( previous, home, next )


Further reading

An XKCD cartoon on linear regression

From their early applications in surveying, the arts of geometry and trigonometry seeped into other areas, including architecture, warfare, astronomy, cartography, and navigation. The application of mathematics in surveying, astronomy, and navigation also lead to the creation of the method of linear least squares. The method of linear least squares was developed almost simultaneously in France by Adrien-Marie Legendre, in (what is now) Germany by Carl Gauss, and in the United States by Robert Adrian. The story in the US is interesting and somewhat controversial.

The Method of Linear Least Squares

Linear least squares is a way to solve systems of linear equations that can not otherwise be solved. The broad idea is to look for the thing that is closest to solving all of the equations, rather than solving all of the equations directly. The trick of the method that has lead to its wide adoption is that it minimizes the sum of the square errors between the observations and predictions. Let's begin by briefly reviewing some of the basics of linear systems.

A linear system is a set of equations that we can arrange to look like \[\begin{align*} b_1 &= a_{1,1} x_1 + a_{1,2} x_2 + \cdots + a_{1,n} x_n \\ b_2 &= a_{2,1} x_1 + a_{2,2} x_2 + \cdots + a_{2,n} x_n \\ \vdots & \vdots \\ b_m &= a_{m,1} x_1 + a_{m,2} x_2 + \cdots + a_{m,n} x_n \end{align*}\] where the \(a\)'s and \(b\)'s are all known numbers, and the \(x\)'s are the unknowns we would like to determine. This linear system can be written more compactly as a vector-matrix equation \[\mathbf{b} = \mathbf{A x}\] where \(\mathbf{b} = [ b _ i ]\) is an \(m\)-dimensional vector, \(\mathbf{A} = [a _ {ij}]\) is an \(m \times n\) matrix with \(m\) rows and \(n\) columns, and \(\mathbf{x} = [ x _ i]\) is an \(n\)-dimensional vector. If the number of equations is equal to the number of unknowns (\(m = n\)), then we can usually determine the value of all the unknowns exactly (the exceptions being cases where one or more of the equations are repeating information already present in the other equations). If there are fewer equations than unknowns (\(m < n\)), the linear system is underdetermined, and we can not determine the exact values of the unknowns. If there are more equations than unknowns (\(m > n\)), then the linear system is usually overdetermined, and there are no solutions.

Ideally, when we measure things, all our measurements are very accurate, and we can make just enough measurements so that we get a linear system having exactly as many independent equations as we have unknowns. In practice, however, our observations have errors, and we always want to check your work, in case one of our observations went really wrong. A good way to check our work for errors is to make more observations. But adding more observations means adding more equations to the system. So, we often have overdetermined systems. And since there's always a little error in these observations, it is unlikely that a single set of unknown values will satisfy all the equations at once.

Although there is no solution to an overdetermined system, we can still look for the solution that is "closest" to solving all of our equations at once, \(x _ {\rm closest}\). There are many possible ways to measure "closest". We could, for instance, calculate how much each equation is off by, and choose the value of \(x\) that minimized the largest of these errors. Or we could just sum the absolute values of all the errors and minimize this sum. But the most useful approach we've found is to pick values for our unknowns that minimize the sum of the squared errors for each equation -- the unknowns that return the "least squares". As it turns out, the vector \(x _ {\rm closest}\) with the least sum-of-squares-error solves a simplified linear system \[A^T b = A^T A x _ {\rm closest}.\] The matrix equation \(A^T b = A^T A x\) is called the normal equation of a linear system.

Why the normal equation? The normal equation is derivable using standard minimization techniques from calculus -- differentiate an error function and find where that derivative of the error vanishes. Because the error function is square's, it's derivative will be a linear function. It is good to have a reference handy for matrix calculus.

First, observe that if you want to sum the squares of the components of any vector \(v\), you can do this by taking the dot-product of the vector with itself, or multiplying the vector by it's own matrix transpose, \[\sum_i v_i^2 = v \cdot v = v^T v.\] Now, we define the square error \[\begin{align*} E(x) &:= \sum _ i \left[ (Ax)_i - b_i \right]^2 \\ &= (Ax - b)^T (Ax-b) \\ &= x^T A^T A x - x^T A b - b^T A x + b^T b \end{align*}\] Since each of these terms is a scalar, \(x^T A b = (x^T A b)^T = b^T A x\), and \[\begin{align*} E(x) = x^T A^T A x - 2 b^T A x + b^T b \end{align*}\] To find the solution \(x _ {\rm closest}\) the gives the least square error, differentiate \(E(x)\) with respect to \(x\) and set the derivative to zero: \[\frac{dE}{dx} = 2 x^T A^T A - 2 b^T A = 0.\] If we now re-arrange, we obtain the normal equation stated above: \[ A^T A x = A^T b.\] So, that's where the normal equations come from. One of the reasons the math works out so nicely is that the sum-of-squared-errors is always a convex function of the unknowns, and for convex functions on a bounded convex domain, there is always exactly one local minimum that is also a global minimum.

The normal equation also has a close connection to probability theory. If observation errors only occur in the constant vector \(b\) and these errors are normally distributed, then according to principles of probability, the solution of the normal equations gives the most likely values for the unknown vector \(x\). The connection between the normal distribution and the normal equations was Robert Adrian's big insight, although those who have read his work closely are not satisfied by his justification.

Pro tip: For any of you familiar with the techniques of numerical linear algebra, it is often recommended to use a QR algorithm or SVD decomposition to find linear least-squares solutions rather than solving the normal equations by elimination with partial pivoting -- these alternative algorithms are more robust in cases where the normal equations are sensitive to small measurement errors. Computationally this means, don't invert \(A^TA\) to recover \(x=(A^T A)^{-1} A^T b\). Use a built-in linear system solver like scipy's linalg.solve instead. The version of the linear least squares algorithm presented here is the simplest, most naive version. Many variations and enhancements have been developed over the last century (see Björck, 1996).

Chauvenet's charting problem

One of the first people in the United States to teach linear least squares was William Chauvenet. Chauvenet was born in Milford, Pennsylvania, in 1820, graduated from Yale with degrees in mathematics and classics, wrote several math books, helped found the US's Annapolis Naval Academy, served as chancellor of Washington University in St. Louis, and played a role in the engineering of the first steel bridge across the Mississippi river for renouned engineer James Eads (which we will revisit later).

In the 1864 book A manual of spherical and practical astronomy, Chauvenet illustrates the method with a problem for coastal surveying (v 2, p551). A Coastguard station at Pine Mount surveyed the angles between 4 neighboring stations, Joscelyne, Deepwater, Deakyne, and Burden as follows, measured to a precision of thousandths of an arcsecond (or roughly a millionth of a degree). The conversion of (degrees, minutes, seconds) to decimal form can be quickly accomplished with the python function lambda d,m,s : ((s/60.+m)/60.+d)

Stations Angle between Decimal form

Joscelyne to Deepwater

65 11' 52.500"


Deepwater to Deakyne

66 24' 15.553"


Deakyne to Burden

87 2' 24.703"


Burden to Jescelyne

141 21' 21.757"


But we also have one extra piece of information that hasn't been accounted for -- the sum of all these angles must be 360 degrees. If we add up all four observed angles, we get 359.998476, close to 360 degrees, but off by more than a thousandth of a degree. That's definitely not correct to the millionth of a degree precision we expect, given the precision of the measurements reported. So, the task is to use this extra information about the sum of the angles to improve our accuracy, if we can.

If we let \(t\), \(u\), \(v\), and \(w\) be actual angles, we have 5 equations. \[\begin{align*} t &= 65.197917 \\ u &= 66.404320 \\ v &= 87.040195 \\ w &= 141.356044 \\ t + u + v + w &= 360 \end{align*}\] In matrix form, \[ A x = b\] where \[A = \begin{bmatrix} 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \\ 1 & 1 & 1 & 1 \end{bmatrix}, \quad x = \begin{bmatrix} t \\ u \\ v \\ w \end{bmatrix}, \quad b = \begin{bmatrix} 65.197917 \\ 66.404320 \\ 87.040195 \\ 41.356044 \\ 360 \end{bmatrix}. \]

The calculation from here on can be done by hand but are best performed by computer.

[Show code]
from scipy import *
from scipy import linalg

thetas = array([65.197917, 66.404320, 87.040195, 141.356044]) # directly observed values
A = zeros((5,4))
A[0:4,0:4] = eye(4)
A[4,0:4] = 1
b = zeros((5,1))
b[0:4,0] = thetas
b[4] = 360.
xbest = linalg.solve(A.T.dot(A), A.T.dot(b)) #
print("Corrections, in arc-seconds:\n")
print((xbest.T - thetas) * 60 ** 2)

We find \[A^T A = \begin{bmatrix}2 & 1 & 1 & 1\\1 & 2 & 1 & 1\\1 & 1 & 2 & 1\\1 & 1 & 1 & 2\end{bmatrix} \quad \text{and} \quad A^T b = \begin{bmatrix}425.197917\\426.40432\\447.040195\\501.356044\end{bmatrix}.\]

\[x _ {\text{closest}} = (A^T A)^{-1} (A^T b) = \frac{ 1 }{ 5 } \begin{bmatrix}4&-1&-1&-1\\-1&4&-1&-1\\-1&-1&4&-1\\-1&-1&-1&4\end{bmatrix} \begin{bmatrix}425.197917\\426.40432\\447.040195\\501.356044\end{bmatrix} = \begin{bmatrix}65.1982218\\66.4046248\\87.0404998\\141.3563488\end{bmatrix}. \]

We find the best correction is to add 1.097 arcseconds to each angle estimate. This makes intuitive sense, since without more information, we expect all four of the measurements to have the same kinds of errors, and our auxiliary condition that the angles must sum to 360 degrees also treats all four symmetrically.

Stations Observed angle Corrected angle Correction

Joscelyne to Deepwater

65° 11' 52.500"

65° 11' 53.598"


Deepwater to Deakyne

66° 24' 15.553"

66° 24' 16.649"


Deakyne to Burden

87° 2' 24.703"

87° 2' 25.799"


Burden to Joscelyne

141° 21' 21.757"

141° 21' 22.856"


Legendre, Regression Lines, and the Figure of the Earth

The very first public application of the method of least squares in 1805 was by Adrien-Marie Legendre to determine the figure of the earth and orbit of a comet. Laplace had previously attempted to solve the problem, and we will work with the data he collated. Since Jean Richer's observations of natural variation in pendulum periods in 1672 started the debate, a number surveys and expeditions made careful measurements of the distance travelled to traverse 1 degree of latitude, including Maupertius's expedition to Lapland (1736), the French Geodesic Mission to Peru (1739), and the Mason-Dixon observation in Pennsylvania (1768).

[ Data : hide , shown as table , shown as CSV shown as QR ]

# latitude L, Ds/DL, arc length
# Volume 2, book 3, section 41, page 445 of Bowditch's translation
# https://archive.org/stream/mcaniquecles02laplrich#page/444/mode/2up
# Formula: ds/dL = m sin(L)**2 + b
# and we wish to determine m and b from observations of the
# latitude L measured in grads (400 grads = 360 degrees = 2 pi rads)
# the length of one degree (ds/dL) is measured in double toises per grad
#   (1 toises is 1.949 meters, so y*2*1.949)
# The last column is the actual arclength measured to estimate ds/dL,
# and not used in our calculations
0.00000, 25538.85, 3.4633
37.0093, 25666.65, 1.3572
43.5556, 25599.60, 1.6435
47.7963, 25640.55, 2.4034
51.3327, 25658.28,10.7487
53.0926, 25683.30, 3.2734
73.7037, 25832.25, 1.0644

Working with the approximate equation \[\frac{ds}{d\lambda} \approx c + m \sin^2(\lambda),\] where \(c\) and \(m\) are unknowns, Laplace was able to use existing observations of lattitude \(\lambda\) and curvature \(ds/d\lambda\) to obtain 7 linear equations of a form that could be used to determine the figure of the earth. Laplace didn't have a good method for solving these 7 equations for the 2 unknown variables \(c\) and \(m\), though he tried several methods.

Let's consider the general problem of finding a line through several data points. Suppose we have an equation of the form \(m u + c = v\), and we observe a bunch of data points \((u _ i, v _ i)\) that we think should occur on this line, except for errors in \(v _ i\). \[\begin{align*} m u _ 1 + c &= v _ 1 \\ m u _ 2 + c &= v _ 2 \\ \vdots \\ m u _ i + c &= v _ i \\ \vdots \end{align*}\] The parameters \(m\) and \(c\) are our unknowns that we'd like to solve for, given our data points, but we have way more than 2 equations for our 2 unknowns, so we'll use linear least-squares. If we re-arrange our equations into matrix form, \[\begin{gather*} \begin{bmatrix} u _ 1 & 1 \\ u _ 2 & 1 \\ \vdots & \vdots \\ u _ i & 1 \\ \vdots & \vdots \end{bmatrix} \begin{bmatrix} m \\ c \end{bmatrix} = \begin{bmatrix} v _ 1 \\ v _ 2 \\ \vdots \\ v _ i \\ \vdots \end{bmatrix} \end{gather*}\] So, then \[\begin{gather*} A = \begin{bmatrix} u _ 1 & 1 \\ u _ 2 & 1 \\ \vdots & \vdots \\ u _ i & 1 \\ \vdots & \vdots \end{bmatrix}, \quad x = \begin{bmatrix} m \\ c \end{bmatrix}, \quad b = \begin{bmatrix} v _ 1 \\ v _ 2 \\ \vdots \\ v _ i \\ \vdots \end{bmatrix}. \end{gather*}\] We can now calculate \(A^T A\) and \(A^T b\) and then solve the normal equation \(A^T A x = A^T b\) for \(x\), i.e. for the parameters \(m\) and \(c\), using the python code
x = scipy.linalg.solve(A.T.dot(A), A.T.dot(b)) to determine the best fit.

[Show code]
from scipy import *
from numpy import *
from matplotlib.pyplot import *

fname = "LaplaceEarth.csv"
data = loadtxt(fname, delimiter=',', comments='#')
print("Data: \n", data)
n = max(data.shape)
b = data[:,1] # b becomes is a 1-d array
latitude = data[:,0:1]*(360./400)*(pi/180) # convert grads to rads
                # using '0:1' keeps result a 2-d array

sinsq = sin(latitude)**2
A = hstack([ones((n, 1)), sinsq])
x = linalg.solve( A.T.dot(A), A.T.dot(b))
c, m = x
print("=== best least squares fit ===")

plot(sinsq, b, 'bo', markersize=10, label="Laplace's Data")
u = array([0.,1.])
plot(u, u*m + c, 'k-', linewidth=2, label='Linear least squared error')

print("=== Laplace 1 ===")
m,c = 308.202, 25525.10
plot(u, u*m + c, 'g:', linewidth=2, label="Laplace's weighted least error")

xlabel('Transformed Lattitude ($\sin^2 \lambda$)',fontsize=18)
ylabel('Inverse curvature ($ds/d\lambda$)\nin double-toises per gradian',fontsize=18)
print("# rsync -auvz LaplaceEarthLstSq.png LaplaceEarthLstSq.pdf ../../../figures/")

When we evaluate the normal equations for Laplace's data, we find \[ A^T A = \begin{bmatrix} 7. & 3.07473 \\ 3.07473 & 1.74303 \end{bmatrix}, \quad A^T b = \begin{bmatrix} 179619.48 \\ 79022.77 \end{bmatrix}, \] which is equivalent to the 2 by 2 linear system \[\begin{align*} 179619.48 &= 7 c + 3.07473 m \\ 79022.77 &= 3.07473 c + 1.74303 m \end{align*}\] If we now solve this linear system, we find the best least-squares fit is \(c \approx 25519\), \(m \approx 319.60\), and the regression line through the data is given as \[v = 25519 + 319.60 u.\] Laplace used his own approach to minimizing the error, leading to the line \(v = 25525 + 308.202 u\).

plot of least squares fit to latitude data

plot of least squares fit to latitude data

When we plot the data and the regression line together, as in the figure above, we see the least-squares regression line does a good job of bisecting the data, but the data itself has enough variation to be scattered away from the line. Though calculated differently, there is essentially no difference between Laplace's result and ours when allowing for the uncertainty of the data. The data's trend is clear -- inverse curvature increases with lattitude, so curvature decreases with latitude, implying that the earth bulges at the equator and is flatter at the poles.

Historical asside

The hypothesis that earth is a ball with no edges traces back at least as far as the pre-Socratic Greek philosophers(c. 500 BC), and since the Magellan-Elcano circumnavigation in 1522, it has been widely accepted that the earth was roughly spherical in shape. But with the increased scientific sophistication of the age of enlightenment, it became a matter of great debate whether the earth was a perfect sphere and if not, was it bulging (an oblate spheroid) or cinched (a prolate spheroid) at the equator.

The whole problem can be traced back to one of the most practical applied-mathematics problems of the day -- ship navigation, and the longitude problem in-particular. As nations grew and trade expanded during the Renaissance, shipwrecks became an important source of financial and personal risk. The risks became larger after 1500, as long, expensive, and uncertain ocean crossings became common. Accurate navigational charts were being drawn, but were of little help when ships couldn't determine their own positions. Latitude could be approximated based on observations of the stars, but the longitude of a ship was nearly impossible to keep track of because the earth's daily rotation blurred the most useful celestial references.

In 1656, Christiaan Huygens, a remarkable natural philosopher whom we will encounter again later, patented the first pendulum clock. One of the uses Huygen's proposed for his clock was the determination of longitude. If you know what time it was in Paris when it was noon in New York, you would know the time-difference between the two, and by multiplication, the longitude of New York, relative to Paris. Huygen's clocks were not sophisticated enough to practically solve the longitude problem -- John Harrison's chronometer designs would eventually do the job a century later. But the idea already intrigued many, and in 1670, when the recently created Académie des Sciences in Paris decided to send it's second scientific expedition to Cayenne in French Guyana, clock-testing was part of the plan. In 1672, Jean Richer reported back that the length of a seconds-pendulum was not the same in Cayenne and Paris. In Cayenne, the seconds-pendulum was slightly shorter than the 24.84 centimeters in Paris. Parisian clocks would lose about 2 minutes and 28 seconds each day when used in Cayenne.

This one piece of seemingly modest data was the butterfly that started a hurricane in European science. Giovanni Cassini, the new influential Director of the Paris observatory, interpreted Richer's observation in the context of Huygen's pendulum theory (see later). There, the slower period implied that gravity was weaker in Cayenne, probably because earth was smaller around the equator than it was around the poles. Alternatively, in 1683, Robert Hooke had hypothesized that the earth was shaped like an onion rather than a sphere. Isaac Newton ran with this idea in the first edition of Principia (1687). He argued that the slowing of the pendulum was due to the spin of the earth counteracting gravitational forces, and that the slowing would coincide with a bulging of the earth around the equator. Cassini's and Newton's contradictory positions, proposed by influential scientists on opposite sides of the English channel, made the figure of the earth a hot topic of scientific debate across Europe.

How would you go about determining the figure of the earth? It is much less clear how to answer this than it is to answer the same question for the planet Jupiter. We can directly observe the shape of Jupiter through a telescope. Giovanni Cassini did this and determined that Jupiter was wider than tall. But we can not do the same for the planet we live on. This puzzle forces us to take a closer examination of what we mean by "figure" and "shape". If, for example, the earth's cross-section is elliptical rather than circular, how can we tell?

If we try to explain an ellipse to somebody who can not see it, one description would be that an ellipse is like a circle, but squeezed so that opposite ends get rounder while the middle gets flattened. This description suggests a way forward. An important local characteristic of shape is curvature. A circle has the same curvature all the way around as you go, but other shapes like an ellipse become more or less curved and for some shapes even reverse the direction of curvature. If we can measure curvature at different locations, then we might be able to determine the shape.

Curvature can be thought of as the bending of an arc per unit of distance travelled. Suppose we start at some point \(p\) on a planar curve. Draw a normal vector to the curve at the point we are starting. Now, we walk along the curve, keeping track of the distance \(s\) we have travelled. As long as the curve we are walking along is smooth enough that we can measure deviations in angle from being perpendicular to the curve, and we have a universal reference frame like the night sky, we can also recording the angle \(\theta(s)\) between the normal vector to the curve at our current location and the normal vector at our start location. The local curvature, then, can be thought of as \(d\theta/ds\). On a circle of radius, this implies that the curvature is constant everywhere and proportional the reciprocal of the radius. As we might expect, the bigger the circles get, the smaller their curvatures.

So, let's consider the specific problem of estimating the shape of an ellipse in detail. An ellipse can be represented with an equation of the form \[x^2 + (1+\epsilon)y^2 = r^2\] where \(\epsilon\) represents the distortion of the ellipse from a perfect circle.

Diagram of relationship between angle and distance on an ellipse

The normal vector perpendicular to a point on this ellipse can be found by treating this ellipse as a single contour line of the function \(F(x,y) = x^2 + (1+\epsilon) y^2\). The gradient can be determined by differentiation, \(\nabla F = [ 2 x, 2 (1+ \epsilon ) y ]\), and at each point, the gradient vector will be perpendicular to the ellipse curve forming a contour of constant value. Let \(\lambda\) be the angle between the normal vector and the equatorial plane. Then from elementary trigonometry, \[\tan \lambda = \frac{(1 + \epsilon)y}{x}.\] We can use this equation to represent both \(x\) and \(y\) parametrically as functions of \(\lambda\). Solving for \(y\) and substituting into our equation for the ellipse, \[\begin{gather*} x^2 + \frac{1}{1 + \epsilon} x^2 \tan^2 \lambda = r^2, \\ x = r \cos(\lambda) \sqrt{\frac{1+\epsilon}{1+\epsilon \cos^2 \lambda}}. \end{gather*}\] Similarly, solving our tangent equation for \(x\) and substituting leads to \[y = r \sin(\lambda) \sqrt{\frac{1 }{(1+\epsilon)(1 + \epsilon \cos^{2}(\lambda ))}}.\] Now, if we parameterize the arclength \(s\) as a function of the angle \(\lambda\), standard calculus references tell us \[\frac{ds}{d\lambda} = \sqrt{ \left( \frac{dx}{d\lambda} \right)^2 + \left( \frac{dy}{d\lambda} \right)^2 }.\] If we calculate the derivatives, substitute, and simplify, we find the reciprocal curvature \[\frac{ds}{d\lambda} = \sqrt{\frac{(1 + \epsilon) r^2}{\left(1 + \epsilon \cos^2(\lambda)\right)^3}}.\] This is a laborious nonlinear equation, so working with it directly at the time was not feasible. But since the earth is approximately spherical, \(\epsilon \approx 0\). The first two terms in the McLaurin series expansion in \(\epsilon\) give \[\frac{ds}{d\lambda} \approx r - \epsilon r + \frac{3}{2} \epsilon r \sin^{2}{\left (\lambda \right )},\] or if we transform to more convenient variables, \[\frac{ds}{d\lambda} \approx c + m \sin^2(\lambda)\] where \(c = r (1 - \epsilon)\) and \(m = 3/2 \epsilon r\). This remarkably little equation relating the latitude and distance on an ellipsoidal planet was original form derived by Roger Boscovich in Rome, in 1755. It was used successively by Maupertuis, Laplace, and Legendre (with minor variation), when estimating the figure of the earth. The left-hand side has units of distance per angle, the reciprocal of curvature. The ellipse parameters \(r\) and \(\epsilon\) on the right-hand side are the what we seek to estimate, while the free variables \(\lambda\) and \(ds/d\lambda\) can be measured directly by observation.

Using our regression-line fit obtained above, and solving for our original variables, the elliptical distortion \(\epsilon = 2m /(2m+3 c)\approx 0.008280\) and the earth's radius \(r = c + (2/3) m \approx 25,732\) meters. Hooke and Newton were correct -- the earth bulges at the equator. Unfortunately, these calculations do not settle the problem of determining the figure of the earth. There were significant errors in these data (as you might have guessed from their plot), and the method we've employed, based only on local estimates of the curvature, is not a good match to the actual data collection process. While the estimate of the earth's elliptical distortion is within 20% of modern approximations, our methods above lead to estimates of around 300 kilometers for the radius of the earth. This number is ridiculously small. If it were true, we would be able to drive around the earth in less than a day. In 1826, with the help of a much-improved data set and a little creativity, George Airy was able to obtain values that we can consider accurate by modern standards (see Exercises). We now know the earth's radius is about 6,300 kilometers.

There is a second, parallel track in the story of the determination of the earth's figure that we should atleast mention. While much work on measurement and fitting was going on, there was also substantial theoretical work under way, in an effort to explain the shape of the earth using the laws of nature. Newton's work was the first step on this path, but was unsatisfactory to a number of attentive readers. Subsequent studies by mathematical modelers opened up a can of worms. There was not a simple answer to the question of the equilibrium shape of a spinning blob of fluid. There was not a single shape it turned out, but a menagery of equilibria. The story unfolded over the next 200 hundred years of study, as new equilibria were discovered, and then shown to deform into some other unknown shape as the rate of spin of the blob increased. These analyses gave rise to what we now call "bifurcation theory". Many possibly familiar names, including Clairaut, Maclaurin, Simpson, Jacobi, Poincare, and Liapounov appear, and the curious may find it a rewarding story to seek out.


The linear least squares method for approximating solution to overdetermined linear systems has several advantages over other approaches like minimizing the total error or the maximum error. First, the square error forms a convex function, and thus, there is always a unique local minimum that is also the global minimum. This ensures reproducibility, and removes ambiguity. Second, the least-squares solution can be calculated directly, with greater accuracy and efficiency than other optimization methods that require iterative evaluation for convergence. Third, the method can be theoretically justified in cases when the errors in the observations are normally distributed.

But don't mistake these advantages for "proof" that least squares methods and linear regression are universally the "right" approach to estimation problems. Least squares methods are a good general-purpose approach and often give answers very similar to alternatives. But for problems where the value of greater accuracy justifies the time and effort or where there are strong nonlinearities, we can get better results by modelling the actual error mechanisms involved and using alternative optimization algorithms. And least-squares will always gives you a solution, even if the question is nonsense! You should never accept it's results without some sanity-checking. We'll see more examples as things go along.


  1. (Gauss) Find the least-squares solution for \(x\), \(y\), and \(z\) of the following over-determined linear system. \[\begin{align*} x - y + 2 z =& 3, \\ 3x + 2y - 5 z =& 5, \\ 4x + y + 4 z =& 21, \\ -2x + 6y + 6 z =& 28. \end{align*}\]

  2. The calculations Chauvenet actually performs in his book are a little more complicated than described above. Based on other information, some measurements were expected to be more accurate than others. Specifically, the measurement of the angle between Burden to Jescelyne is believed to be less accurate than the other three angles. It would be nice to be able to take this inaccuracy into account when calculating our least-squares solution. And there is indeed a way -- the method of weighted linear least squares. We can weight the errors produced by each equation such that equations measured to greater accuracy are given larger weights and equations with less accuracy are given smaller weights.

    1. Find the matrix form for the equations of the method of weighted linear least squares by determining the vector \(x_{\rm closest}\) that minimizes the total weighted square error \[E(x) = \sum_{i} W_{ii} ( (Ax)_i - b_i)^2,\] where \(W\) is a non-negative diagonal matrix.

    2. Chauvenet weights the first 3 measurements as 3 times more accurate than the 4th, and the observation that angles must sum to 360 degrees with (say) 1,000 times more accurate. Use the method of weighted linear least equations obtained above to find the angles between stations to the nearest thousands of an arc-second.

    3. Replicate the calculations in the published version of Chauvenet's survey problem using the weighted normal equations. Explain any discrepancies with Chauvenet's result.

  3. (Hard) In 1807, Nathaniel Bowditch observed a comet in the New England sky and made a series of astronomical observations. He then attempted to calculate the orbit of the comet using the methods described by Laplace. In the process, he developed a system of 56 equations for 5 unknowns.

    1. Use linear least squares to estimate the values of these 5 parameters.
    2. Discuss the relationship between your estimates and Bowditch's.
    3. Show that even though Bowditch's estimates do not minimize the sum-of-squared errors, they do do a better job minimizing the sum-of-errors.

[ Data : hide , shown as table , shown as CSV shown as QR ]

# eq #, c_d, c_t, c_p, c_n, c_i, constant
# data from pages 7, 8, and 9 of Bowditch, 1807
# http://www.jstor.org/stable/25057874
# Special thanks to ZY and HP for transcription
1,  -315,    2,  129,  239, -234,   38
2,  -317,   -1,  138,  242, -248,  -88
3,  -319,   -4,  145,  244, -262,  -98
4,  -323,   -8,  151,  246, -279,  -84
5,  -330,    2,  198,  249, -412,   21
6,  -329,   14,  207,  246, -448,  -75
7,  -327,   17,  208,  242, -467,   16
8,  -324,   26,  212,  240, -485, -164
9,  -321,   31,  213,  235, -505, -116
10, -318,   40,  215,  231, -526,  -71
11, -314,   48,  215,  225, -547,   86
12, -308,   57,  214,  218, -567, -278
13, -304,   67,  213,  211, -589, -158
14, -199,  202,  157,  108, -789, -413
15, -140,  262,  119,   59, -851,  385
16, -122,  280,   99,   37, -876,  120
17,  -95,  305,   81,   16, -895, -304
18,  -14,  378,   12,  -55, -955, 1081
19,   47,  429,  -40, -105, -987,  775
20,  187,  533, -165, -217,-1033,  263
21,  267,  588, -240, -280,-1045,  164
22,  306,  613, -282, -314,-1048,  757
23,  509,  726, -483, -464,-1020,  204
24,  542,  742, -520, -489,-1007,  238
25,  582,  761, -562, -517, -992,  614
26,  690,  804, -676, -587, -929,   47
27,  724,  813, -713, -609, -905, 1628
28,  944,  789,-1006, -716, -496,   69
29,  226,  957, -336, -214,   43, -180
30,  237,  950, -337, -212,   41, -235
31,  247,  945, -339, -211,   39, -301
32,  258,  933, -341, -209,   38, -143
33,  333,  865, -361, -208,   28, -262
34,  349,  849, -367, -210,   27, -197
35,  359,  839, -372, -211,   27, -386
36,  366,  831, -374, -212,   27, -180
37,  375,  824, -378, -213,   28,  -98
38,  383,  814, -382, -215,   29, -127
39,  390,  804, -386, -217,   28,  -92
40,  397,  794, -391, -220,   31,   74
41,  403,  784, -395, -223,   32, -108
42,  463,  702, -435, -241,   74,  244
43,  476,  671, -447, -245,   97, -294
44,  478,  659, -451, -246,  106, -197
45,  483,  651, -454, -246,  115,  413
46,  491,  617, -464, -246,  146, -803
47,  493,  593, -468, -242,  170, -200
48,  488,  541, -468, -229,  223, -177
49,  484,  516, -466, -220,  252,  287
50,  477,  500, -464, -215,  267,  424
51,  447,  428, -440, -176,  342,  565
52,  436,  412, -434, -167,  355,  101
53,  427,  397, -426, -157,  370, -143
54,  395,  351, -398, -123,  412, -589
55,  384,  338, -387, -111,  425,-1082
56,  206,  167, -219,   56,  539, -229
  1. (Unsolved) Find the latitudes and longitudes of the Coastguard stations in Chauvenet's book or show they do not exist.

  2. Laplace proposed that the equations for the data-fitting should be weighted in proportion to the total length of arc measured, reasoning that the larger the arc, the more accurate the estimate of the length of a single degree. If Laplace's data is fit using weighted linear least squares (see exercise in previous lecture), what estimates do we get for \(m\) and \(c\)?

  3. George Airy (1826), like Legendre before him, recognized that the local approximation wasn't a great match to the long transects contemporary surveyers were making in their efforts to improve accuracy. Instead of working directly with point estimates of curvature, Airy compared estimates of arc length between different latitudes. Let's try a version of his method and see what we get.

    1. Integrate our McLaurin series approximation \[\frac{ds}{d\lambda} \approx c + m \sin^{2}{\left (\lambda \right )},\] to get a formula for \(s(\lambda)\). Use trigonometry identities if necessary to express your formula in terms of \(\sin(2 \lambda)\).

    2. Airy's data have the form \((\phi, \psi, d)\), where \(\phi\) is the starting latitude, \(\psi\) is the ending latitude, and \(d\) is the distance measured between the two latitudes. Use your formula for \(s(\lambda)\) to derive an equation for the error between the predicted and observed values of \(d\) that is linear in \(c\), \(m\), and \(d\). (The equation will NOT be linear in \(\phi\) or \(\psi\) -- that's OK.)

    3. Use linear least squares and Airy's data below to get estimates for the values of \(c\) and \(m\).

    4. Based on your estimates of \(c\) and \(m\), what's the radius of the earth? How accurate is this estimate compared modern estimates?

[ Data : hide , shown as table , shown as CSV shown as QR ]

# lower latitude, upper latitude, measured distance 
# These are the improved data used by Bowditch and Airy
# in their estimates of the figure of the earth.
# https://archive.org/stream/mcaniquecles02laplrich#page/452/mode/2up
# latitudes are in radians
# the measured distance between latitudes is in meters
-7.3313525e-04, 5.36780860e-02, 3.44747088e+05
1.42430454e-01, 3.15146346e-01, 1.09477454e+06
6.74841785e-01, 8.90744298e-01, 1.37446573e+06
8.83453039e-01, 9.33023396e-01, 3.15927029e+05
1.14362830e+00, 1.17193793e+00, 1.80813456e+05

  1. (Medium) One of the sources of error in Airy's approximation is the use of only the first two terms in the series approximation of \(ds/d\lambda\).

    1. Derive the 3rd term in the McLaurin series of \(ds/d\lambda\) for small \(\epsilon\).
    2. Integrate this improved series to find an improved approximation to \(s(\lambda)\).
    3. Using a modern approximation for the earth's radius, plot the log of the sum of the squared errors between the predicted and observed values of \(d\) as a function of \(\epsilon\) over the range \([0,0.01]\).
    4. Which value of \(\epsilon\) minimizes this log sum of squared errors? (Note: because our formula for \(s(\lambda)\) is now a nonlinear function of \(\epsilon\), we can no longer use linear least squares to directly estimate its value.)
  2. (From Sand, 1995) The following data set illustrates Bergmann's rule that body size within a species should increase with lattitude. Using the linear least-square method, find a regression line through the data that is the best fit.

[ Data : hide , shown as table , shown as CSV shown as QR ]

# This data set illustrates Bergmann's rule of
# animal size increasing with lattitude with Swedish Moose 
# https://en.wikipedia.org/wiki/Bergmann's_rule
# @article{bib:Sand1995,
#   title={Geographical and latitudinal variation in growth
#   patterns and adult body size of Swedish moose (Alces alces)},
#   volume={102},
#   ISSN={1432-1939},
#   url={http://dx.doi.org/10.1007/BF00341355},
#   DOI={10.1007/bf00341355},
#   number={4},
#   journal={Oecologia},
#   publisher={Springer Nature},
#   author={Sand, H�kan and Cederlund, G�ran and Danell, Kjell},
#   year={1995},
#   pages={433-442},
# }
# Data from tables 1 and 2
# Columns ...
# Lattitude (degrees N), mean male adult mass in herd (kg)
  1. (From Rosner, 2000) In newborn babies, blood-pressure depends on the weight of the baby as well as the age of the baby when the blood pressure is recorded. Below is a data set from 16 infants. Use linear least-squares to fit a model of the form \(b = c_a a + c_w w + c_0\) where \(a\) is the baby's age, \(w\) is teh baby's weight, \(b\) is the baby's blood pressure, and \(c_i\) are the model parameters to be estimated. Hint: the method for one independent variable works just the same with 2 independent variables, but with the matrix \(A\) having 3 columns instead of 2.

[ Data : hide , shown as table , shown as CSV shown as QR ]

#patient id, age (days), birthweight (oz), systolic blood pressure (mm Hg)
#Data from Rosner, Fundamentals of Biostatistics, page 468
01, 3, 135, 89
02, 4, 120, 90 
03, 3, 100, 83 
04, 2, 105, 77
05, 4, 130, 92 
06, 5, 125, 98 
07, 2, 125, 82
08, 3, 105, 85
09, 5, 120, 96
10, 4, 090, 95
11, 2, 120, 80
12, 3, 095, 79
13, 3, 120, 86
14, 4, 150, 97
15, 3, 160, 92
16, 3, 125, 88

  1. The method of linear least squares can be applied to fitting certain polynomial curves to data. The parabola \(y = p_0 + p_1 x + p_2 x^2\) can be fit to data by letting the columns of the matrix \(A\) be the monomial powers \(x^0\), \(x^1\), and \(x^2\), with the coefficients \(p_i\) being our unknowns. Use the method of linear least squares to find the parabola that best fits the points \((-2,8)\), \((0,5)\), \((2, 3)\), \((4,4)\), and \((6,9)\).

  2. Fit a circle to the following data set by applying the method of linear least squares to estimate the parameters \(a\), \(b\), and \(c\) in the circle equation \[x^2 + y^2 = a + b x + c y.\]

[ Data : hide , shown as table , shown as CSV shown as QR ]

4.699 -15.127
5.003 -13.942
5.191 -15.625
5.309 -15.477
5.367 -15.923
5.399 -14.886
5.461 -13.778
5.563 -13.871
5.858 -16.825
6.305 -12.915
6.435 -17.143
6.750 -17.045
6.789 -16.919
6.882 -13.101
6.977 -16.659
7.154 -13.088
7.228 -12.828
7.855 -12.871
8.072 -16.715
8.468 -14.796
8.527 -13.972
8.690 -14.195
8.835 -15.526
8.853 -13.606
8.898 -15.991
8.946 -15.823
9.021 -15.789
9.038 -16.106
9.051 -16.004
9.094 -15.117

( previous, home, next )