Photogrammetry is the science of making maps from aerial photographs. I worked for many years for a company in San Diego California on a software product called SOCET SET that did photogrammetry: - The company I worked for - A commercial distributor of my software

The software is a lot of fun, and involved a lot of math and image processing. From a math standpoint, the key algorithm used was called Least Squares.

Least Squares

Least squares is a very, very powerful technique to solve for unknown values, given a bunch of data. Least Squares is usually used by scientists or engineers when they have a lot of data (or observations) and they want to generalize the data. The word "regression" is used when the scientists are trying to fit a curve through the data. Least squares is very powerful and very broad, and its applications are nearly unlimited, so it is a bit hard to explain. When do you use it?

N linear equations with n unknowns

If you have a set of n linear equations in n unknowns, you don't need least squares, because you can just work with the equations to isolate the unknowns (which is the same as inverting a matrix) and you're done. If your unknowns are X and your data values are B and A:

a0 = b11*x1 + b12*x2 + b13* x3

a1 = b21*x1 + b22*x2 + b23* x3

a2 = b31*x1 + b32*x2 + b33* x3

Which can be represented in matrix notation as


where A and X are vectors, and B is a square matrix. The solution for the unkowns X is

X = B-1 * A

More linear equations than unknowns

If you have more linear equations than unknowns then the matrix equation is over-determined, and you can solve it using simple matrix math. Least squares is not needed. For example, if you are fitting an m-dimensional surface thru a set of n points, then in matrix notation:

A(n x 1) = B (n x m) * X(m x 1 )

Where n>m; A is the given "y values" of the surface that you want the surface to pass thru; X is the unknown polynomial coefficients that you will solve for, and B is the known factors in the polynomial, such as x, x2, x3, z, z2, zx, etc. Because B is non-square, you have to use the transpose to solve for X:


BT * A = BT * B * X

(BT*B)-1 * BT * A = X


Non-Linear equations

Least squares is most appropriate when the problem you have cannot be reduced to linear equations. Lets say your unknowns are X and your known data is B and A. And let's say you have equations like:

a0 = b11*x12 + b12*x23 + b13* x3

a1 = b21*x1 + b22* sin( x2 ) + b23* x3

a2 = b31* cos ( x1 ) + b32*x2 + b33* sqrt( x3 )

a4 = b41* tan ( x1 ) + b42*x23 + b43* x3 !

Because these equations are not linear, you cannot use the simple matrix math (linear algebra) described above. This is when least squares is used.

Least squares can even be used even when the relationships between the unknowns (X) and the data (A an B) are so complex that they cannot be reduced to equations. For example, if the relationship between X and A and B is embodied in a complex algorithm that requries 20 subroutines to describe, that is okay: Least Squares will work!

Least squares works well when the data is over-determined, that is, when you have more data than needed to find the unknowns. In this case, least squares will automatically find the solution that is "best" in the sense that the errors are minimized.

One of the catches to Least Squares is that you must have a guess of the initial values of the unknowns. If you cant make a decent guess, then Least Squares wont work.

Web Links

I tried to find some good, tutorial web pages on Least Squares, so I could put some additional links here, but I couldn't find any. If you know of a link, let me know.

Overview of the Least Squares Algorithm

  • Define the model (i.e. algorithms or equations) that describes your problem. For example:
  • a0 = b11*x12 + b12*x23 + b13* x3

    a1 = b21*x1 + b22* sin( x2 ) + b23* x3

    a2 = b31* cos ( x1 ) + b32*x2 + b33* sqrt( x3 )

  • Gather all your observations (i.e. data); these are the values A and B in the equations above.
  • Guess initial values for all the unkowns (X)
  • Compute the predicted data values (A) that correspond to your guess of the unkowns (X) (from step 2 or 5) . I.e. input X-guess and B (from step 1) into your model (from step 0) then get output A values.
  • Compute the difference (i.e. error or residual) between the actual A values (from step 1) and the predicted A values (from step 3). If the error is small, we are done and stop. If the errors are not getting smaller after a lot of iterations, give up. Otherwise continue.
  • Adjust guess of X to get a better guess, as follows:
  • 5a) Create a set of linear equations that relate the (unkown) adjustments to X to the A-errors from step 4. Represent as a matrix equation.

    5b) Fill matrix with sensitivities (i.e. parial derivatives, computed numerically) that relate the adjustments to the errors.

    5c,d) Solve the equations from step 5a,b by using the "overdetermined linear equation" matrix math described above.

    5e) Compute a new guesses for all the X values by adding the adjustment (solved for in step 5d) to the prior guess. Go to step 3 and repeat.

    Least Squared Applied to a Real Application, In Detail

    To explain this process, I'll use the applications I'm familiar with: aerial photography. Lets say you have an airplane that was flying and took 10 overlapping pictures of the ground. Your goal is to discover exactly where the plane was (its location in latitiude, longitude, and elevation) at the 10 instants it took those pictures. For simplicity, we will assume that the camera in the airplane is pointing straight down.

    Step 0) Define the Behavior Model. You need an algorithm that will model the physical behavior of the system you are solving, in this case, an airplane and the camera it holds. Because this algorithm is key to our least squares solution, we need a name for it, so we call it the Behavior Model. I'll use the words "model" and "algorithm" interchangeably.

    The Behavior model for this airplane example will input an airplane location and a landmark ground location (lat/long/elev) and it will output the location of the landmark in the photos (in millimeters). The formula is a simple "projective ray" formula that represents how the lens projects light beams onto the film. We wont bother to describe the projective ray algorithm in detail here, since it doesn't help you understand Least Squares. The key thing for you to understand is that your application must have a Behavior Model, and that you must define the inputs and outputs very carefully.

    The inputs to the Model algorithm must include the unknown values you are trying to solve (in this example, the airplane locations). The Model will represent just one airplane/photograph, even though in a typical job we may have 10 or more photographs. The outputs of the Model algorithm are some of the data that you have available (in this case, the location in the photos of the landmarks).

    Every application will have a unique Behavior Model. The Behavior Model doesn't have to be a polynomial; in fact, it doesn't even have to be a closed-form formula: it is okay if the algorithm is complicated or even recursive.

    Defining your Behavior Model may be the most important part of your entire least squares experience. In particular, the choice of what parameters you will solve for (in my example, there are three: the airplane location coordinates lat, long, and elev) can be critical. For example, I could have included the airplane orientation (heading, pitch, and roll) for a total of 6 unknowns per airplane. Also, you have to decide how much detail to put into your Behavior Model. For airplane photography: should the algorithm account for earth curvature? Should it account for refraction of light rays in the atmosphere? Its your choice when you define your Behavior Model!

    Some people write a Behavior Model, then add a polynomial (stuck on the end) to represent "unmodeled errors".

    Most people consider the definition of the Behavior Model to be the most critical part of the job, because the math part of least squares is rather mechanical. Defining the Behavior Model includes deciding which parameters of the Behavior Model you will adjust (i.e. what are the unknowns). Least Squares does not create the Behavior Model for you, instead it merely computes certain parameters of the model.

    Step1) Gather Input Data. The input data we have in the airplane example is (a) the ground location (lat, long, elev) of 50 landmarks on the earth's surface and (b) the location of the landmarks in the photographs (i.e. on the film, measuring millimeters from the upper left corner of the photo). Lets say each of the 10 photos contain between 8 and 15 landmarks, for a total of 100 photo measurements. The input data will change for each job you perform (i.e. each set of airplane photos) but the Behavior Model will not change (unless you shift to another application entirely, such as weather prediction or molecular modelling).

    CAUTION: Don't confuse the locations of the landmarks on the ground (measured in lat, long, elev) with the locations of the landmarks in the photo film (measured in millimeters). These are two different values!!

    Step 2) Guess the Answer. Non-linear least squares needs to get boot-strapped with an approximate answer. In this case, we need 10 approximate locations of the airplane, one for each of the photos. We know these locations are wrong, but that is okay: we are just using them as a starting point and will improve the guesses later. If the guesses are too wrong, Least Squares wont work, so the guesses have to be reasonable.

    Step 3) Compute Predicted Data. Using our guess of the airplane locations (from step 2 the first pass, from step 5 in later passes), we use the Behavior Model for the airplane and camera to compute where the landmarks should appear in the photos (measured in millimeters from the photo corner). We do this computation for all 100 photo measurements, yielding 200 values (X and Y millimeter measurements for each landmark). These computed values will be different, of course, than the actual locations of the landmarks on the film (see step 1) but that is okay: we will measure how wrong the computed values and use that difference to adjust our guess of the airplane locations.

    Step 4) Compute Error. Compare the computed values of the landmark locations in the photo (from step 3) with the actual measured values (from step 1). Subtract them to get a difference, called the error or residual. Since we have 100 total landmarks in the photos, that is 200 distinct error values (each error has an X and Y component). If the error is tiny, you can quit and declare success. If not, continue on to the next step.

    Step 5) Adjust the Guesses. Now comes the hard part: We will use the errors from step 4 to adjust our guesses of the 10 airplane locations. If we do this right, the airplane location guesses will improve, that is, will become closer to the correct answer. We then repeat the process over and over until we don't get any more improvement. Then we are done!

    How do we compute the adjustments to the airplane locations? We know that our guesses are wrong, and in fact we even have a good idea of how wrong (from our errors measured in step 4). From these errors we will compute the adjustments, which are small correction values we will add to the guessed airplane locations (from step 2), which gives new, improved guesses. Steps 3, 4, and 5 are repeated until either we find a good solution (that is, the error we compute in step 4 is tiny) or we give up (that is, the number of iterations exceeds, say, 20 or the error stops getting smaller).

    Step 5a) Build the Least Squares Matrix Equation. Go back and look at the simple least squared matrix notation at the top of this page, used to solve polynomial fitting problems.

    A(n x 1) = B (n x m) * X(m x 1 )

    We will use this formula. However, we cannot use it directly to solve our airplane locations in one pass, because our Behavior Model (described in step 3) is not a simple polynomial. Therefore, we use a trick: We will solve, not for the final airplane locations, but rather for adjustments to the airplane locations. These adjustments are small "nudges" given to the locations to make them more correct. If we have 10 guessed locations, each with 3 components (latitude, long, and elev) then we have 30 separate adjustments we must calculate. These adjustments are the "unknowns" shown as matrix X above. Matrix A is the error measures from step 4; it contains 200 error values. In our example, the vector size n=200 and m=30. What is matrix B? Find out in the next step!

    Step 5b) Compute Partial Derivatives. Matrix B is a 2-dimensional matrix that contains the partial derivatives of Behavior Model with respect to the parameters we are solving for (namely, the 30 airplane location coordinates). If you are not good at calculus, don't panic! The derivatives don't have to be precise, so we can compute them using a simple numerical technique. We do not need a formula for the derivatives!!! Each entry in the B matrix is computed "numerically" as follows: (i) Take the Behavior Model and input the airplane location and landmark ground location, and compute the landmark film location; (ii) Adjust the airplane location used in step i by adding a small delta to the value (e.g. adjust the latitude coordinate by adding, say, 10 meters) and compute a second landmark film location; (iii) subtract the two film locations from step i and ii; Divide the film location delta (from step iii) from by the small airplane location delta you used in step ii: this yields two partial derivative (one for each X and Y film components). Repeats steps i to iii for all 30 airplane parameters, against all 200 error values, for a total of 30x200 = 6,000 partial derivatives. If a landmark does not appear in a photo, just put a zero in the matrix for that combination. That is matrix B.

    Step 5c) Understand what is Happening. In your mind, picture 10 airplanes in the sky. Picture 100 landmarks on the ground, with rays pointing up from the landmarks to thru the camera lenses, being projected onto the film in the airplanes. We know where each landmark hits the film, but we don't know where the airplanes are. We want to move the airplanes around, all 10 SIMULTANEOUSLY, until the rays projected up from the ground hit all the actual landmark locations on the film. We do it in several iterations: in each iteration we move all 10 planes a small amount, calculated to move the ray-projections on the film closer to the actual locations. The partial derivatives in matrix B are a measure of sensitivity: If I take airplane three and move it 5 meters to the north, how much will the projected ray of landmark #122 move to the left on the film? There are 6,000 different sensitivities we have to work with: one for each airplane location parameter (10 x 3) and each landmark measurement component in the photos (100 x 2).

    Step 5d) Solve for the Adjustments. Remember, our goal is to solve for adjustments to the airplane locations, which are stored in vector X in the matrix formula above. We can solve for X as shown at the top of the page:

    X = (BT*B)-1 * BT * A

    Step 5e) Compute new Guesses and Iterate. Vector X, which we computed in step 5d, are the optimal adjustments, or nudges, that we will apply to the airplane locations. These adjustments, if we are lucky, will improve our guesses, because the adjustments are based on the error measures. By adding the adjustments to our guesses at the airplane locations (initially from step 2) we compute new, better guess of the airplane locations. We now repeat steps 3, 4, and 5.


    A Note on the Partial Derivative Matrix (B)

    The B matrix has lots of zeros. In fact, the non-zero numbers are clustered in rectangular blocks: one block for each airplane. If you look closely you'll see that in the airplane example that the matrix solution for each airplane is independent of the other airplanes. This makes the matrix inversion simpler. However, in the general case, there may be inter-dependencies, and adjusting a single parameter will affect many (not just a few) of the error measures. Thus, the airplane example is simpler than some other applications, and you should not assume that all applications will have so many zeros in the B matrix.

    Matrix Inversion Speed

    When you write a least squares algorithm, the slowest part tends to be the matrix inversion (step 5d) so that is where you concentrate your speed-up efforts.

    Data Structure for B Matrix

    In many applications, the B matrix is predominantly zero, and this can lead to huge wastes of memory (and time), so some people use a special data structure that stores only the non-zero sub-blocks of the matrix.

    Normalizing the Values

    Inverting a matrix is a process that is very sensitive to the relative values that are in the matrix. Inversion may fail (or produce wrong results) if the values range too widely (e.g. contain values like 0.000003 and 500000.0). Therefore, it is wise to normalize the values. This is done by scaling your data (A, B, and X) linearly so that the values tend to be in the range (absolute values) from 0.1 to 10.0. Think of as changing the units that you measure A, B and X in.

    It's time for the geometer and algebraist in me to get all formal about this least squares business. Everything here is real and finite-dimensional, you anal nitpickers, you. Also, I'm only doing the linear case, because it's the easy one.

    A statement of the problem is as follows: suppose we have a linear system of equations, represented in matrix form as Ax = y, but that A doesn't have an inverse because the dimension of its range is much greater than the dimension of its domain (A has lots more rows than columns). We settle instead for the next best thing, which is a solution in the least squares sense, that is, we seek a solution vector x= θ such that the Euclidean norm (which is the square root of a sum of squares)

    ||Ax - y||

    is minimum. The following result holds.

    Theorem: For the linear problem posed as

    Ax = y

    The least squares solution is given by

    θ = (AtA)-1Aty

    where x is the column vector of n unknown coefficients estimated in the least squares sense by θ; y is the column vector of m observations, and A is the m × n design matrix. The solution exists in all interesting cases. ♦

    The algebra is exceedingly simple. The only computationally difficult part about the solution of the least squares problem is computing the inverse (AtA)-1, but there are algorithms for that sort of thing, such as the LU factorisation. Observe that in the trivial case when A is invertible (for which we don't need a least squares solution at all) that we can distribute the -1 according to the socks and shoes theorem (you put your socks first and then your shoes, but you take them off in the reverse order) and obtain that θ = A-1y.

    Being all haughty and pure mathematicians for the time, let us ignore the many, many, many real-world applications of the least squares problem and the origin of the linear equationAx = y. Instead, let us dedicate the rest of this writeup to an algebro-geometric proof of the least squares problem. First, it's lemma time!

    Lemma: Let V be a vector space and W be a subspace of V. Consider any vector vV. Then the w vector in the subspace W that is closest to v, in the sense that the Euclidean norm ||v - w|| is minimal, is the orthogonal projection w = projW v. ♦

    The lemma is intuitively obvious. What it says is that the vector to on a linear space closest to a given vector outside of that space is the orthogonal projection, the "shadow" of that vector outside of the space. In two dimensions and glorious ASCII art, the lemma says the following:

    for this                  / |
    vector right here... ->  /  |
                            /   |     and this line 
                         v /    |        here...
                          /     |         |
                         /      |         |
    ---------------------.------>--------------- W
                          projW v                   
                        this orthogonal projection
                        is the vector on the line W
                        closest to the vector v.
                        (well, duh!)

    I call the above ASCII drawing the geometric proof. The algebraic proof, which isn't illuminating but merely boring book-keeping, is to take an orthogonal basis for the subspace W, extend it to a basis for the bigger space V, possible by the Gram-Schmidt orthogonalisation process, and to write the norm ||v - w|| in this basis. The last step in the algebraic proof is to observe that the only way to minimise this norm is by setting all of the coordinates of w equal to the the corresponding ones for v in the basis for W, making zero the only possible squares that can be made zero. This minimising choice of w corresponds to the orthogonal projection.

    Details, of course, left to the reader and to the hardcore nitpickers.

    The lemma tells us how to find a solution of the least square problem. If we consider the range R(A) of A, that is, the set of vectors Ax as x ranges over all possible values, the least-square solution we seek is the orthogonal projection Aθ = projR(A) y of the observation vector y onto the range R(A). (Look back at the statement of the problem at the beginning of this writeup.)

    Good, now that we know what θ should be, the rest of the proof is to come up with a formula for θ amenable to computation. To that effect, we draw the same picture as before, but with different labels:

                                 / |
                                /  |
                             y /   |
                              /    | n
                             /     |
                            /      |
                           /       |
    ----------------------.-------->---------------------- R(A)

    Observe that we may decompose y = Aθ + n as a sum of its orthogonal projection and a normal component, so that the normal component is given by n = y - Aθ. Now, this normal component, by definition of the orthogonal projection, has the property that it is orthogonal to every vector in R(A), that is, to every vector Ax for any x. Noting that we may write the inner product of two vectors u and v in Euclidean space by using transposes; utv is the inner product of u and v, we record the orthogonality condition as follows:

    (y - Aθ)t Ax = 0,

    for all x. But since this can be regrouped as the matrix equation ((y - Aθ)t A)x = 0, the only way that this can hold for all x is if

    (y - Aθ)t A = 0.

    (Think about this for a second. If a matrix maps every vector into zero, then that matrix must be the zero matrix.)

    All that remains now is to solve the above equation for θ. Thus we proceed. Transposing both sides of the equation and distributing, we obtain

    Aty - AtAθ = 0
    AtAθ = Aty
    θ = (AtA)-1Aty,

    which is the least square solution that we sought. QED, as they say.

    Log in or register to write something here or to contact authors.