Least squares

( previous, home, next )


Further reading

An XKCD cartoon on linear regression

From their early applications in surveying, the arts of geometry and trigonometry seeped into other areas, including architecture, warfare, astronomy, cartography, and navigation. The application of mathematics in surveying, astronomy, and navigation also lead to the creation of the method of linear least squares. The method of linear least squares was developed almost simultaneously in France by Adrien-Marie Legendre, in (what is now) Germany by Carl Gauss, and in the United States by Robert Adrian. The story in the US is interesting and somewhat controversial.

Patterson’s surveying problem

In 1808, Robert Patterson, a poor Irish immigrant who had risen to the position Professor of Mathematics at the University of Pennsylvania, wrote to The Analyst with the following problem.

“In order to find the content of a piece of ground, having a plane level surface I measured with a common circumferentor and chain, the bearings and lengths of several sides, or boundary line which I found as follows:

1. N 45° E 40 perches
2. S 30° W 25 perches
3. S 5° E 36 perches
4. W 29.6 perches
5. N 20° E 31 perches to the place of beginning.

But upon casting up the difference of latitude and departure, I discovered what will perhaps always be the case in actual surveys, that some error has been contracted in taking the dimensions. …"

Patterson surveying problem
Patterson surveying problem

Patterson was describing a common problem in surveying called a “misclosure”. A misclosure occurs when all sides of a closed circuit are surveyed, but the mathematical reconstruction of the circuit based on the measurements does not return exactly to its starting point because of errors in the observations. Patterson wanted to know the best way to correct for these errors and determine the enclosed area, knowing that the figure should be closed. A prize of $10 was offered, the equivalent of about $200 in 2017.

Nathaniel Bowditch, a self-taught mathematician from Salem, Massachusetts, won the prize. Bowditch was well known for his textbook The New American Practical Navigator (1802), which explained the geometry and trigonometry calculations needed for celestial and terrestrial navigation. However, Bowditch’s solutions were not very satisfying to the journal editor, Robert Adrian. Adrian was an orphan who had come to America as a political refugee from Ireland in 1798. Adrian followed up Bowditch’s solution with his own explanation, and included the earliest known observations connecting line-fitting to the normal probability distribution. Adrian’s idea, in its modern form, is called the “method of linear least squares”.

The Method of Least Squares

Linear least squares is a way to solve systems of linear equations that can not otherwise be solved. The broad idea is to look for the thing that is closest to solving all of the equations, rather than solving all of the equations directly. The trick of the method that has lead to its success is to minimize the sum of the square errors between the observed output and the predicted output. Let’s begin by briefly reviewing some of the basics of linear systems.

A linear system is a set of equations that we can arrange to look like \[\begin{align} b_1 &= a_{1,1} x_1 + a_{1,2} x_2 + \cdots + a_{1,n} x_n \\ b_2 &= a_{2,1} x_1 + a_{2,2} x_2 + \cdots + a_{2,n} x_n \\ \vdots & \vdots \\ b_m &= a_{m,1} x_1 + a_{m,2} x_2 + \cdots + a_{m,n} x_n \end{align}\] where the \(a\)’s and \(b\)’s are all known numbers, and the \(x\)’s are the unknowns we would like to determine. This linear system can be written more compactly as a vector-matrix equation \[\mathbf{b} = \mathbf{A x}\] where \(\mathbf{b} = [ b _ i ]\) is an \(m\)-dimensional vector, \(\mathbf{A} = [a _ {ij}]\) is an \(m \times n\) matrix with \(m\) rows and \(n\) columns, and \(\mathbf{x} = [ x _ i]\) is an \(n\)-dimensional vector. If the number of equations is equal to the number of unknowns (\(m = n\)), then we can usually determine the value of all the unknowns exactly (the exceptions being cases where one of the equations repeats the other equations). If there are fewer equations than unknowns (\(m < n\)), the linear system is underdetermined, and we can not determine the exact values of the unknowns. If there are more equations than unknowns (\(m > n\)), then the linear system is usually overdetermined, and there are no solutions.

Ideally, when we measure things, all our measurements are very accurate, and we can make just enough measurements so that we get a linear system having exactly as many independent equations as we have unknowns. In practice, however, our observations have errors, and we always want to check your work, in case one of our observations went really wrong. A good way to check our work for errors is to make more observations. But adding more observations means adding more equations to the system. So, we often have overdetermined systems. And since there’s always a little error in these observations, it is unlikely that a single set of unknown values will satisfy all the equations at once.

Although there is no solution to an over-determined system, we can still look for the solution that is “closest” to solving all of our equations at once, \(x _ {\rm closest}\). There are many possible ways to measure “closest”. We could, for instance, calculate how much each equation is off by, and choose the value of \(x\) that minimized the largest of these errors. Or we could just sum the absolute values of all the errors and minimize this sum. But the most useful approach we’ve found is to pick values for our unknowns that minimize the sum of the squared errors from the of each equation – the x’s with the “least squares”. Turns out, the vector \(x _ {\rm closest}\) with the least square-error solves a simplified linear system \[A^T b = A^T A x _ {\rm closest}.\] The matrix equation \(A^T b = A^T A x\) is called the normal equation of a linear system.

Why the normal equation? The normal equation is derivable using standard minimization techniques from calculus – differentiate an error function and find where that derivative of the error vanishes. Because the error function is square’s, it’s derivative will be a linear function. It is good to have a reference handy for matrix calculus.

First, observe that if you want to sum the squares of the components of any vector \(v\), you can do this by taking the dot-product of the vector with itself, or multiplying the vector by it’s own matrix transpose, \[\sum_i v_i^2 = v \cdot v = v^T v.\] Now, we define the square error \[\begin{align*} E(x) &:= \sum _ i \left[ (Ax)_i - b_i \right]^2 \\ &= (Ax - b)^T (Ax-b) \\ &= x^T A^T A x - x^T A b - b^T A x + b^T b \end{align*}\] Since each of these terms is a scalar, \(x^T A b = (x^T A b)^T = b^T A x\), and \[\begin{align*} E(x) = x^T A^T A x - 2 b^T A x + b^T b \end{align*}\] To find the solution \(x _ {\rm closest}\) the gives the least square error, differentiate \(E(x)\) with respect to \(x\) and set the derivative to zero: \[\frac{dE}{dx} = 2 x^T A^T A - 2 b^T A = 0.\] If we now re-arrange, we obtain the normal equation stated above: \[ A^T A x = A^T b.\] So, that’s where the normal equations come from. If, instead of minimizing the squared error, we assumed observation errors were normally distributed and attempted to find the most likely solution, we would also obtain the normal equations. The connection between the normal distribution and the normal equations was Robert Adrian’s big insight, although those who have read his work closely are not satisfied by his justification.

Pro tip: For any of you familiar with the techniques of numerical linear algebra, it is often recommended to use a QR algorithm or SVD decomposition to find linear least-squares solutions rather than solving the normal equations by elimination with partial pivoting – these alternative algorithms are more robust in cases where the normal equations are sensitive to small measurement errors. Computationally this means, don’t invert \(A^TA\) to recover \(x=(A^T A)^{-1} A^T b\). Use a built-in linear system solver like scipy’s linalg.solve instead.

Chauvenet’s charting problem

One of the first people in the United States to teach linear least squares was William Chauvenet. Chauvenet was born in Milford, Pennsylvania, in 1820, graduated from Yale with degrees in mathematics and classics, wrote several math books, helped found the US’s Annapolis Naval Academy, served as chancellor of Washington University in St. Louis, and played a role in the engineering of the first steel bridge across the Mississippi river for renouned engineer James Eads (which we will revisit later).

In the 1864 book A manual of spherical and practical astronomy, Chauvenet illustrates the method with a problem for coastal surveying (v 2, p551). A Coastguard station at Pine Mount surveyed the angles between 4 neighboring stations, Joscelyne, Deepwater, Deakyne, and Burden as follows, measured to a precision of thousandths of an arcsecond (or roughly a millionth of a degree). The conversion of (degrees, minutes, seconds) to decimal form can be quickly accomplished with the python function lambda d,m,s : ((s/60.+m)/60.+d)

Stations Angle between Decimal form
Joscelyne to Deepwater 65 11’ 52.500" 65.197917
Deepwater to Deakyne 66 24’ 15.553" 66.404320
Deakyne to Burden 87 2’ 24.703" 87.040195
Burden to Jescelyne 141 21’ 21.757" 141.356044

But we also have one extra piece of information that hasn’t been accounted for – the sum of all these angles must be 360 degrees. If we add up all four observed angles, we get 359.998476, close to 360 degrees, but off by more than a thousandth of a degree. That’s definitely not correct to the millionth of a degree precision we expect, given the precision of the measurements reported. So, the task is to use this extra information about the sum of the angles to improve our accuracy, if we can.

If we let \(t\), \(u\), \(v\), and \(w\) be actual angles, we have 5 equations. \[\begin{align*} t &= 65.197917 \\ u &= 66.404320 \\ v &= 87.040195 \\ w &= 141.356044 \\ t + u + v + w &= 360 \end{align*}\] In matrix form, \[ A x = b\] where \[A = \begin{bmatrix} 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \\ 1 & 1 & 1 & 1 \end{bmatrix}, \quad x = \begin{bmatrix} t \\ u \\ v \\ w \end{bmatrix}, \quad b = \begin{bmatrix} 65.197917 \\ 66.404320 \\ 87.040195 \\ 41.356044 \\ 360 \end{bmatrix}. \]

The calculation from here on can be done by hand but are best performed by computer.

[Show code]
from scipy import *
from scipy import linalg

thetas = array([65.197917, 66.404320, 87.040195, 141.356044]) # directly observed values
A = zeros((5,4))
A[0:4,0:4] = eye(4)
A[4,0:4] = 1
b = zeros((5,1))
b[0:4,0] = thetas
b[4] = 360.
xbest = linalg.solve(A.T.dot(A), A.T.dot(b)) #
print("Corrections, in arc-seconds:\n")
print((xbest.T - thetas) * 60 ** 2)

We find \[A^T A = \begin{bmatrix}2 & 1 & 1 & 1\\1 & 2 & 1 & 1\\1 & 1 & 2 & 1\\1 & 1 & 1 & 2\end{bmatrix} \quad \text{and} \quad A^T b = \begin{bmatrix}425.197917\\426.40432\\447.040195\\501.356044\end{bmatrix}.\]

\[x _ {\text{closest}} = (A^T A)^{-1} (A^T b) = \frac{ 1 }{ 5 } \begin{bmatrix}4&-1&-1&-1\\-1&4&-1&-1\\-1&-1&4&-1\\-1&-1&-1&4\end{bmatrix} \begin{bmatrix}425.197917\\426.40432\\447.040195\\501.356044\end{bmatrix} = \begin{bmatrix}65.1982218\\66.4046248\\87.0404998\\141.3563488\end{bmatrix}. \]

We find the best correction is to add 1.097 arcseconds to each angle estimate. This makes intuitive sense, since without more information, we expect all four of the measurements to have the same kinds of errors, and our auxiliary condition that the angles must sum to 360 degrees also treats all four symmetrically.

Stations Observed angle Corrected angle Correction
Joscelyne to Deepwater 65° 11’ 52.500" 65° 11’ 53.598" 1.097"
Deepwater to Deakyne 66° 24’ 15.553" 66° 24’ 16.649" 1.097"
Deakyne to Burden 87° 2’ 24.703" 87° 2’ 25.799" 1.097"
Burden to Joscelyne 141° 21’ 21.757" 141° 21’ 22.856" 1.097"

Solving Patterson’s surveying problem

Now, let us return to Patterson’s surveying problem. The analysis of Patterson’s problem requires a little more mathematics than the previous example. Patterson’s problem involves two kinds of measurements – 5 bearings \(\theta _ k\), and 5 distances \(r _ k\). Translated to standard polar coordinates, the observations of the surveying were \[\begin{alignat}{3} \theta_1 &= 45^\circ, \quad & \quad r_1 &=40, \\ \theta_2 &= -120^\circ, \quad & \quad r_2 &= 25, \\ \theta_3 &= -85^\circ, \quad & \quad r_3 &= 36, \\ \theta_4 &= 180^\circ, \quad & \quad r_4 &= 29.6, \\ \theta_5 &= 70^\circ, \quad & \quad r_5 &= 31. \end{alignat}\] If we convert the bearings from degrees to radians, then the vectors representing the sides of the surveyed region can be conveniently written in algebraic form using complex numbers \[r e^{i \theta \pi/180}\] where \(r\) is the vector length and \(\theta\) is the bearing, in degrees. So, for example, the first side is represented by the complex number \[40 e^{i 45 \pi/180} \approx 28.28 + 28.28 i.\] The closure problem arrises because these vectors do not sum to zero: \[\begin{gather} \sum_{k=1}^{5} r_k e^{i \theta_k \pi /180} = -0.0754 - 0.0989 i. \end{gather}\] But each angle measurement \(\theta _ k\) and distance measurement \(r _ k\) has a small amount of error. Let’s call these errors \(\alpha _ k\) and \(\epsilon _ k\) respectively. The fact that the boundary should form a closed curve implies \[\begin{gather} \sum_{k=1}^{5} (r_k - \epsilon_k) e^{i (\theta_k - \alpha_k) \pi /180} = 0. \end{gather}\]

We’d like to estimate the errors \(\alpha _ k\) and \(\epsilon _ k\) and use them to correct the original survey observations using linear least squares. However, this last equation is NOT a linear function of the error terms. However, since we expect the errors to be small, we can use the zeroth and first terms of the Taylor series expansion in combination with the known total error to find a linear function that is a close approximation of the boundary closure condition. To the first order, \[\begin{gather} \sum_{k=1}^{5} \left(r_k - \epsilon_k - \frac{i \pi}{180} \alpha_k r_k \right) e^{i \pi \theta_k/180} = 0, \end{gather}\] After re-arranging to standard form with the constant term on the right, \[\begin{gather} \sum_{k=1}^{5} \left(\epsilon_k + \frac{i \pi}{180} \alpha_k r_k \right) e^{i \pi \theta_k/180} = \sum_{k=1}^{5} r_k e^{i \pi \theta_k/180}. \end{gather}\] When numerically evaluated for our given data, we get the equation \[\begin{multline} - 0.4936 \alpha_{1} + 0.4936 i \alpha_{1} + 0.3778 \alpha_{2} - 0.2181 i \alpha_{2} + 0.6259 \alpha_{3} + 0.0547 i \alpha_{3} \\ - 0.5166 i \alpha_{4} - 0.5084 \alpha_{5} + 0.1850 i \alpha_{5} + 0.7071 \epsilon_{1} + 0.7071 i \epsilon_{1} - 0.5 \epsilon_{2} \\ - 0.8660 i \epsilon_{2} + 0.0871 \epsilon_{3} - 0.9961 i \epsilon_{3} - \epsilon_{4} + 0.3420 \epsilon_{5} + 0.9396 i \epsilon_{5} \\ = 0.07549 + 0.09890 i \end{multline}\] Since equality applies for both the real and imaginary parts, this single equation of complex numbers is equivalent to two linear real-valued equations. Together with the five angle error equations \(\alpha _ k = 0\) and the five distance error equations \(\epsilon _ k = 0\), we have a total of 12 equations for our ten unknown angles \(\alpha _ k\) and distances \(\epsilon _ k\). Thus we have an overdetermined linear system we can solve using linear least squares. Setting this up as a matrix system the same way we did above, we find the corrected bearings and distances are \[\begin{alignat}{3} \theta_{1} - \alpha_{1} &= 45.001, \quad & \quad r_{1} - \epsilon_{1} &= 40.024, \\ \theta_{2} - \alpha_{2} &= -119.998, \quad & \quad r_{2} - \epsilon_{2} &= 24.976, \\ \theta_{3} - \alpha_{3} &= -84.989, \quad & \quad r_{3} - \epsilon_{3} &= 35.983, \\ \theta_{4} - \alpha_{4} &= 179.991, \quad & \quad r_{4} - \epsilon_{4} &= 29.584, \\ \theta_{5} - \alpha_{5} &= 69.995, \quad & \quad r_{5} - \epsilon_{5} &= 31.023, \end{alignat}\]

All of the corrections are small relative to the original observations, so our Taylor series approximation probably did a good job. The closure error from the direct observations was about \(0.12\), while the closure error after the least squares correction is \(0.024\), five-fold better. This could be improved further by using an appropriate weighting of angle errors verses distance errors (see Exercises) and using the method of successive approximations to correct for higher order terms in the Taylor series.

[Show code]
from sympy import *

A linear least-squares solution of Patterson's surveying problem
from 1808

# (angle in degrees, distance in "perches" (?) )
data = [(45,40), (-120,25), (-85,36), (180,29.6), (70,31)]
theta,r = zip(*data)
n = len(data)

a = symbols('a')
s = list(symbols('alpha1:%d'%(n+1),real=True))
p = list(symbols('epsilon1:%d'%(n+1),real=True))

F = sum([ (r[j]+p[j]*a)*exp(I*(theta[j]+s[j]*a)*pi/180) for j in range(n) ])
Fs = F.series(a,0,2).removeO().subs(a,1).evalf().expand()
print "# Taylor series approximation of closure equation"
print latex(Fs)
rhs = -F.subs(zip(s+p,[0]*(2*n))).evalf()

# For simplicity, we'll use equal weights,
# even though we have much higher confidence
# in the closure condition than the observations
# themselves
ws = 1
wp = 1
wz = 1
W = diag(*([ws]*n+[wp]*n+[wz]*2)) # Weight matrix

A = W*Matrix(s+p+[re(Fs), im(Fs)]).jacobian(s+p)
b = W*Matrix([0]*(2*n)+[re(rhs), im(rhs)])

sol = zip(s+p,linsolve( (A.T*A, A.T*b), s+p).args[0])

v = A*Matrix([0]*(2*n))-b
Err0 = v.T*v

v = A*Matrix(s+p).subs(sol)-b
ErrSol = v.T*v

print "# Estimated errors of each observation"
print "# Original closure error"
print "# Closure error after correction"
ans = map(tuple,(Matrix(data) + Matrix([s,p]).T.subs(sol)).tolist())

# LaTex output for web page
for k in range(n):
    s1 = r"\theta_{%d} - \alpha_{%d} &= %.3f"%(k+1, k+1, ans[k][0])
    s2 = r"r_{%d} - \epsilon_{%d} &= %.3f"%(k+1, k+1, ans[k][1])
    s = s1 + r", \quad & \quad " + s2 + r",\\"
    print s


The linear least squares method for approximating solution to overdetermined linear systems has several advantages over other approaches like minimizing the total error or the maximum error. First, the square error forms a convex function, and thus, there is always a unique local minimum that is also the global minimum. This ensures reproducibility, and removes ambiguity. Second, the least-squares solution can be calculated directly, with greater accuracy and efficiency than other optimization methods that require iterative evaluation for convergence. Third, the method can be theoretically justified in cases when the errors in the observations are normally distributed.

But don’t mistake these advantages for “proof” that least squares methods and linear regression are universally the “right” approach to estimation problems. Least squares methods are a good general-purpose approach and often give answers very similar to more powerful alternatives. But for problems where the value of greater accuracy justifies the time and effort or where there are strong nonlinearities, we can get better results by modelling the actual error mechanisms involved and using alternative optimization algorithms. And least-squares will always gives you a solution, even if the question is nonsense! You should never accept it’s results without some sanity-checking. We’ll see more examples as things go along.


  1. (From Sand, 1995) The following data set illustrates Bergmann’s rule that body size within a species should increase with lattitude. Using the linear least-square method, find a regression line through the data that is the best fit.

[ Data : hide , shown as table , shown as CSV shown as QR ]

# This data set illustrates Bergmann's rule of
# animal size increasing with lattitude with Swedish Moose 
# https://en.wikipedia.org/wiki/Bergmann's_rule
# @article{bib:Sand1995,
#   title={Geographical and latitudinal variation in growth
#   patterns and adult body size of Swedish moose (Alces alces)},
#   volume={102},
#   ISSN={1432-1939},
#   url={http://dx.doi.org/10.1007/BF00341355},
#   DOI={10.1007/bf00341355},
#   number={4},
#   journal={Oecologia},
#   publisher={Springer Nature},
#   author={Sand, H�kan and Cederlund, G�ran and Danell, Kjell},
#   year={1995},
#   pages={433-442},
# }
# Data from tables 1 and 2
# Columns ...
# Lattitude (degrees N), mean male adult mass in herd (kg)
  1. (From Rosner, 2000) In newborn babies, blood-pressure depends on the weight of the baby as well as the age of the baby when the blood pressure is recorded. Below is a data set from 16 infants. Use linear least-squares to fit a model of the form \(b = c_a a + c_w w + c_0\) where \(a\) is the baby’s age, \(w\) is teh baby’s weight, \(b\) is the baby’s blood pressure, and \(c_i\) are the model parameters to be estimated. Hint: the method for one independent variable works just the same with 2 independent variables, but with the matrix \(A\) having 3 columns instead of 2.

[ Data : hide , shown as table , shown as CSV shown as QR ]

#patient id, age (days), birthweight (oz), systolic blood pressure (mm Hg)
#Data from Rosner, Fundamentals of Biostatistics, page 468
01, 3, 135, 89
02, 4, 120, 90 
03, 3, 100, 83 
04, 2, 105, 77
05, 4, 130, 92 
06, 5, 125, 98 
07, 2, 125, 82
08, 3, 105, 85
09, 5, 120, 96
10, 4, 090, 95
11, 2, 120, 80
12, 3, 095, 79
13, 3, 120, 86
14, 4, 150, 97
15, 3, 160, 92
16, 3, 125, 88

  1. The calculations Chauvenet actually performs in his book are a little more complicated than described above. Based on other information, some measurements were more trustworthy than others. Specifically, the measurement of the angle between Burden to Jescelyne is believed to be less precise than the other three angles. It would be nice to be able to take this imprecision into account when calculating our least-squares solution. And there is indeed a way – we can weight the errors produced by each equation such that equations measured to greater precision are given larger weights and equations measured to lower precision are given smaller weights.

    1. Find the weighted normal equations in matrix form for the vector \(x_{\rm best}\) that minimizes the weighted square error \[E(x) = \sum_{i} W_{ii} ( (Ax)_i - b_i)^2,\] where \(W\) is a non-negative diagonal matrix.

    2. Chauvenet weights the first 3 measurements as 3 times more precise than the 4th, and the observation that angles must sum to 360 degrees with infinitely more weight. Replicate the calculations in the published version of Chauvenet’s survey problem using the weighted normal equations obtained above. Explain any discrepancies with Chauvenet’s result.

  2. In 1807, Nathaniel Bowditch observed a comet in the New England sky and made a series of astronomical observations. He then attempted to calculate the orbit of the comet using the methods described by Laplace. In the process, he developed a system of 56 equations for 5 unknowns. Use linear least squares to estimate the values of these 5 parameters and discuss the relationship between your estimates and Bowditch’s.

  3. Find the latitude and longitude of the Pine Mount Coastguard station.

( previous, home, next )