## Problem definition

In speech processing and elsewhere, a frequently appearing task is to make a prediction of an unknown vector *y* from available observation vectors *x*. Specifically, we want to have an estimate

Mathinline |
---|

\hat y = f(x) |

such that

Mathinline |
---|

\hat y \approx y. |

In particular, we will focus on *linear estimates* where

Mathinline |
---|

\hat y=f(x):=A^T x, |

and where *A* is a matrix of parameters.

## The minimum mean square estimate (MMSE)

Suppose we want to minimise the squared error of our estimate on average. The estimation error is

Mathinline |
---|

e=y-\hat y |

*and the squared error is the L_{2}-norm of the error, that is,*

Mathinline |
---|

\left\|e\right\|^2 = e^T e |

*and its mean can be written as the expectation*

Mathinline |
---|

E\left[\left\|e\right\|^2\right] = E\left[\left\|y-\hat y\right\|^2\right] = E\left[\left\|y-A^T x\right\|^2\right]. |

*Formally, the minimum mean square problem can then be written as*

Mathdisplay |
---|

\min_A\, E\left[\left\|y-A^T x\right\|^2\right]. |

*This can in generally not be directly implemented because we have the abstract expectation-operation in the middle.*

*(Advanced derivation begins)* To get a computational model, first note that the error expectation can be written in terms of the mean of a sample of vector *e _{k}* as

Mathdisplay |
---|

E\left[\left\|e\right\|^2\right] \approx \frac1N \sum_{k=1}^N \left\|e_k\right\|^2 = \frac1N {\mathrm{tr}}(E^T E), |

*where*

Mathinline |
---|

E=\left[e_1,\,e_2,\dotsc,e_N\right] |

*and tr() is the matrix trace. To minimize the error energy expectation, we can then set its derivative to zero*

Mathdisplay |
---|

0 = \frac{\partial}{\partial A} \frac1N {\mathrm{tr}}(E^T E) = \frac1N\frac{\partial}{\partial A} {\mathrm{tr}}((Y-A^TX)^T (Y-A^TX)) = \frac1N(Y-A^T X)X^T |

*where the observation matrix is*

Mathinline |
---|

X=\left[x_1,\,x_2,\dotsc,x_N\right] |

*and the desired output matrix is*

Mathinline |
---|

Y=\left[y_1,\,y_2,\dotsc,y_N\right] |

*. (End of advanced derivation)*

*It follows that the optimal weight matrix A can be solved as*

Mathdisplay |
---|

\boxed{A = \left(XX^T\right)^{-1}XY^T = X^\dagger Y^T}, |

*where the superscript*

Mathinline |
---|

\dagger |

*denotes the Moore-Penrose pseudo-inverse.*

*Estimates with a mean parameter*

*Suppose that instead of an estimate*

Mathinline |
---|

\hat y=A^T x |

*, we want to include a mean vector in the estimate as*

Mathinline |
---|

\hat y=A^T x + \mu |

*. While it is possible to derive all of the above equations for this modified model, it is easier to rewrite the model into a similar form as above with*

Mathdisplay |
---|

\hat y=A^T x + \mu = \begin{bmatrix} \mu^T \\ A^T \end{bmatrix} \begin{bmatrix} 1 \\ x \end{bmatrix} := A'^T x'. |

*That is, we can extend x by a single 1, (the observation X similarly with a row of constant 1s), and extend A to include the mean vector. With this modifications, the above Moore-Penrose pseudo-inverse can again be used to solve the modified model.*

*Estimates with linear equality constraints*

*(Advanced derivation begins)*

*In practical situations we often have also linear constraints, such as*

Mathinline |
---|

C^T A = B |

*, which is equivalent with*

Mathinline |
---|

C^T A - B = 0. |

*The modified programming task is then*

Mathdisplay |
---|

\min_A\, E\left[\left\|y-A^T x\right\|^2\right]\quad\text{such that}\quad C^T A - B = 0. |

*For simplicity, let us consider only scalar estimation, where instead of vector y, as well as matrices A, B and C, respectively, we have scalar θ as well as vector a, b and c and the optimization problem is*

Mathdisplay |
---|

`\min_a\, E\left[\left\|\theta-a^T x\right\|^2\right]\quad\text{such that}\quad c^T a - b = 0.` |

*Such constraints can be included into the objective function using the method of Lagrange multipliers such that the modified objective function is*

Mathdisplay |
---|

\eta(Aa,Gg) = E\, \left[\left\|y\theta - A^Ta^T x\right\|^2\right] - 2 \sum_k\left[ G_k^Tg^T \left(C^Tc^T A_ka - B_kb\right)\right]. |

*A heuristic explanation of this objective function is based on the fact the G g is a free parameter. Since its value can be anything, then*

Mathinline |
---|

C^Tc^T Aa - Bb |

*must be zero, because otherwise the output value of the objective function could be anything. That is, when optimizing with respect to A a, we find the minimum of the mean square error, while simultaneously satisfying the constraint.*

*The objective function can further be rewritten as*

Mathdisplay |
---|

\begin{split} \eta(Aa,Gg) & = E\, \left[(y\left(\theta - a^T x\right)\left(\theta - A^Ta^T x\right)^T(y-A^Tx)\\right] - 2 \gamma \left(c^T a - \beta\right) \\& = E\, \left[\theta^2 - 2\theta x^T a + a^T xx^T a\right] - 2 \sum_k\left[ G_k^T \gamma \left(c^T a - \beta\right) \\& = \sigma_\theta^2 + a^T R_x^T a - 2 \gamma \left(C^Tc^T A_ka - B_k\beta\right)\right]. \\& = \begin{bmatrix} a^T & \gamma \end{bmatrix} \begin{bmatrix} R_x & -c \\ -c^T & 0 \end{bmatrix} \begin{bmatrix} a \\ \gamma \end{bmatrix} -2 \begin{bmatrix} 0 & \beta^T \end{bmatrix} \begin{bmatrix} a \\ \gamma \end{bmatrix} . \end{split} |

*TBC*

*(Advanced derivation ends)*