It's time for the geometer and algebraist in me to get all formal about this least squares business. Everything here is real and finite-dimensional, you anal nitpickers, you. Also, I'm only doing the linear case, because it's the easy one.

A statement of the problem is as follows: suppose we have a linear system of equations, represented in matrix form as Ax = y, but that A doesn't have an inverse because the dimension of its range is much greater than the dimension of its domain (A has lots more rows than columns). We settle instead for the next best thing, which is a solution in the least squares sense, that is, we seek a solution vector x= θ such that the Euclidean norm (which is the square root of a sum of squares)

||Ax - y||

is minimum. The following result holds.

Theorem: For the linear problem posed as

Ax = y

The least squares solution is given by

θ = (AtA)-1Aty

where x is the column vector of n unknown coefficients estimated in the least squares sense by θ; y is the column vector of m observations, and A is the m × n design matrix. The solution exists in all interesting cases. ♦

The algebra is exceedingly simple. The only computationally difficult part about the solution of the least squares problem is computing the inverse (AtA)-1, but there are algorithms for that sort of thing, such as the LU factorisation. Observe that in the trivial case when A is invertible (for which we don't need a least squares solution at all) that we can distribute the -1 according to the socks and shoes theorem (you put your socks first and then your shoes, but you take them off in the reverse order) and obtain that θ = A-1y.

Being all haughty and pure mathematicians for the time, let us ignore the many, many, many real-world applications of the least squares problem and the origin of the linear equationAx = y. Instead, let us dedicate the rest of this writeup to an algebro-geometric proof of the least squares problem. First, it's lemma time!

Lemma: Let V be a vector space and W be a subspace of V. Consider any vector vV. Then the w vector in the subspace W that is closest to v, in the sense that the Euclidean norm ||v - w|| is minimal, is the orthogonal projection w = projW v. ♦

The lemma is intuitively obvious. What it says is that the vector to on a linear space closest to a given vector outside of that space is the orthogonal projection, the "shadow" of that vector outside of the space. In two dimensions and glorious ASCII art, the lemma says the following:

 

                            ^
                           /|
for this                  / |
vector right here... ->  /  |
                        /   |     and this line 
                     v /    |        here...
                      /     |         |
                     /      |         |
---------------------.------>--------------- W
                      projW v                   
                       |
                       |
                    this orthogonal projection
                    is the vector on the line W
                    closest to the vector v.
                    (well, duh!)

I call the above ASCII drawing the geometric proof. The algebraic proof, which isn't illuminating but merely boring book-keeping, is to take an orthogonal basis for the subspace W, extend it to a basis for the bigger space V, possible by the Gram-Schmidt orthogonalisation process, and to write the norm ||v - w|| in this basis. The last step in the algebraic proof is to observe that the only way to minimise this norm is by setting all of the coordinates of w equal to the the corresponding ones for v in the basis for W, making zero the only possible squares that can be made zero. This minimising choice of w corresponds to the orthogonal projection.

Details, of course, left to the reader and to the hardcore nitpickers.

The lemma tells us how to find a solution of the least square problem. If we consider the range R(A) of A, that is, the set of vectors Ax as x ranges over all possible values, the least-square solution we seek is the orthogonal projection Aθ = projR(A) y of the observation vector y onto the range R(A). (Look back at the statement of the problem at the beginning of this writeup.)

Good, now that we know what θ should be, the rest of the proof is to come up with a formula for θ amenable to computation. To that effect, we draw the same picture as before, but with different labels:



                               ^
                              /|
                             / |
                            /  |
                         y /   |
                          /    | n
                         /     |
                        /      |
                       /       |
----------------------.-------->---------------------- R(A)
                          Aθ

Observe that we may decompose y = Aθ + n as a sum of its orthogonal projection and a normal component, so that the normal component is given by n = y - Aθ. Now, this normal component, by definition of the orthogonal projection, has the property that it is orthogonal to every vector in R(A), that is, to every vector Ax for any x. Noting that we may write the inner product of two vectors u and v in Euclidean space by using transposes; utv is the inner product of u and v, we record the orthogonality condition as follows:

(y - Aθ)t Ax = 0,

for all x. But since this can be regrouped as the matrix equation ((y - Aθ)t A)x = 0, the only way that this can hold for all x is if

(y - Aθ)t A = 0.

(Think about this for a second. If a matrix maps every vector into zero, then that matrix must be the zero matrix.)

All that remains now is to solve the above equation for θ. Thus we proceed. Transposing both sides of the equation and distributing, we obtain

Aty - AtAθ = 0
AtAθ = Aty
θ = (AtA)-1Aty,

which is the least square solution that we sought. QED, as they say.