Gauss on least-squares and maximum-likelihood estimation

Gauss’ 1809 discussion of least squares, which can be viewed as the beginning of mathematical statistics, is reviewed. The general consensus seems to be that Gauss’ arguments are at fault, but we show that his reasoning is in fact correct, given his self-imposed restrictions, and persuasive without these restrictions.

Mathematical statistics is much younger and, with similar caution, we select as its beginning the publication of Gauss' famous 1809 monograph. 2 Legendre (1805) had published his method of least squares four years earlier, but he developed his method as an approximation tool and no randomness is assumed. Gauss (1809), in contrast, works in the context of random variables and distributions; see, e.g. Pearson (1978), Stigler (1986), and Gorroochurn (2016) for historical details.
Some satisfaction seems to be derived in finding mistakes in the writings of great minds, and Leibniz' error is quoted frequently. Rather than laughing at Leibniz' mistake, we should realize just how difficult the beginnings of probability theory were, and that things that we now consider easy are not easy because we are so clever but because they have sunk into common knowledge.
Similarly, most Gauss commentators have found his 1809 treatment of least squares at fault. For example, Stigler (1986, pp. 141-143) considers it a "logical aberration … essentially both circular and non sequitur" and Gorroochurn (2016, p. 163) writes that "his reasoning contains an inherent circularity because the normal distribution emerges as a consequence of the postulate of the arithmetic mean, which is in fact a consequence of the normality assumption!" The purpose of this note is to demonstrate that it is not Gauss who is at fault but his commentators. 3 In modern notation, Gauss starts with the linear model where he assumes that the errors u i are independent and identically distributed (iid) with mean zero and common variance σ 2 , which we set equal to one without loss of generality. Since the u i are iid, they have a common density function, say φ(u i ), and the logarithm of the joint density becomes n i=1 log φ(u i ). Gauss wishes to estimate β by maximizing the joint density. In other words, he wants to derive the maximumlikelihood estimator for β.
Gauss is aware of the fact that if he assumes normality of the errors, then the joint density will be of the form so that (under normality) maximizing the likelihood is the same as minimizing the sum of squared deviations. Gauss makes life unnecessarily difficult for himself by working in a Bayesian framework, assuming a flat bounded prior for each of the β j , so that the posterior also has bounded support. But in essence, Gauss showed (for the first time) that in the standard linear model under normality the maximum-likelihood estimator is equal to the least-squares formula. This is an important result in itself and Gauss could have stopped there. But he did not want to assume at the outset that the errors are normally distributed. Instead he wants to show that normality of the errors is not only sufficient but also necessary for the maximum-likelihood estimator to be equal to the least-squares formula. 4 In this attempt he fails, not because his argument is wrong (as most Gauss scholars seem to believe), but because his (correct) argument is not general, which he fully realizes.
Let us reexamine his argument. Gauss (1809, book II, section III, §177) proves the following result (in modern notation).
Proposition (Gauss 1809) Let y 1 , y 2 , . . . , y n (n ≥ 3) be a sequence of independent and identically distributed observations from an absolutely-continuous distribution with E(y i ) = μ and var(y i ) = 1. Assume that the n realizations of y i take only two values with frequencies n 1 and n 2 , respectively (n 1 ≥ 1, n 2 ≥ 1, n 1 = n 2 ). Then, the averageȳ is the maximum-likelihood estimator of μ if and only if the y i are normally distributed.
Before we prove the proposition, some comment is in order on Gauss' assumption that the n realizations of y i take only two values. This seems to contradict the fact that the y i follow an absolutely continuous distribution. Of course, there is a difference between observations (random variables) from an absolutely continuous distribution and observations (the realized values). Some statistical concepts have two terms (estimator, estimate; predictor, prediction) to emphasize this difference, but most (like observation) do not. The random variables follow an absolutely continuous distribution, but the realizations take on specific values, and Gauss assumes that they take one or the other of two values. This is a rather heroic assumption, but it is not inconsistent or wrong. Gauss himself simply says supponendo itaque (by supposing, therefore) as if this were a logical continuation of his argument, and provides no further comment.
To prove the proposition, Gauss argues as follows. Let u i = y i − μ. Since the u i are iid, they have a common density function, say φ(u i ). First assume that φ is the standard-normal density. Then the loglikelihood L(μ) can be written as in (2). This is maximized if and only if the sum of squares is minimized, which occurs when i (y i − μ) = 0, that is whenμ =ȳ. Note that the additional assumption on the realizations of y i is not required. Now assume thatμ =ȳ. Gauss needs to show that this implies that φ is the standard-normal density. As assumed, the y i can only take two distinct values, say z 1 (n 1 times) and z 2 (n 2 times), where n = n 1 + n 2 . Then, letting he obtains Setting L (μ) = 0 then gives which can be rewritten as For each given value of r (0 < r < 1, r = 1/2), this has to hold for every value of d, and it is easy to see (unde facile colligitur in Gauss' words) that this implies that f is a constant. (We have to exclude r = 1/2 because this would only imply that f is symmetric around zero.) Hence, we must solve the equation where k is a constant. The solution to this differential equation is for some constant A. 5 Since φ represents a distribution it must integrate to one which implies that the constant A takes the value A = √ k/(2π), as proved a few decades earlier by Laplace (1774) in a theorema elegans, a fact gracefully acknowledged by Gauss. 6 In our case, σ = 1 and hence k = 1. Hence φ is the standard-normal distribution, and the proof is complete.
The presented proof follows Gauss' argument closely except that he sets n 1 = 1 and n 2 = n − 1 (and tacitly assumes that n ≥ 3). The proposition tells us how far Gauss came into proving the necessity of the normality assumption. The answer is: not very far, because his conditions are rather restrictive. Two centuries later we can get a little further. In particular, Kagan et al. (1973, Theorem 7.4.1), building on an earlier result in Kagan et al. (1965), established that, in general, linear estimators of location parameters are admissible if and only if the random variables are normally distributed; and they applied the approach through admissibility to the linear model in Kagan et al. (1973, Section 7.7).
To link the linear model y = X β + u to the proposition, Gauss thus makes three simplifying assumptions: 1. The design matrix X has only one column, namely the vector of ones, so that we only have a constant term in the model, there is only one β to estimate, and u i = y i − β. 2. The realizations of y i only take two distinct values, say z 1 (n 1 times) and z 2 (n 2 times), where n = n 1 + n 2 and n 1 = n 2 . 3. The optimum is attained atβ =ȳ.
The third assumption is a perfectly reasonable assumption as we are considering iid random variables y i with common mean β and common variance. So, unless we expect Gauss to discuss shrinkage estimators, what alternative is there to estimate β? Under these three assumptions, Gauss shows that the y i must be normally distributed.
After establishing the proposition, Gauss argues that it is thus reasonable to assume normality. This is a qualitative statement which can be challenged, but it is not incorrect. Gauss was primarily interested in the justification of least squares, not in pushing the normal distribution. Not completely happy with his restrictive assumptions, Gauss (1823, §20) considered the same model again. This time he asked a different question, namely: which linear unbiased estimator has the smallest variance? This resulted in what we now call the Gauss-Markov theorem and it does not rely on normality of the errors.
Gauss liked his 1823 approach much better than his 1809 approach, not only because of the restrictive mathematical assumptions but also because of the metaphysics (Gauss' words). In a letter to Friedrich Bessel dated February 28, 1839 (Gauss and Bessel 1880, p. 523), Gauss writes that for a reason "den ich selbst öffentlich nicht erwähnt habe" (which I have not mentioned publicly) he views best linear unbiased estimation as more fundamental than maximum likelihood, and that he, therefore, prefers his 1823 justification of least squares over his 1809 analysis.

Conflict of interest
The author has no competing interests to declare that are relevant to the content of this article.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.