Robert Leslie Ellis’s work in mathematical statistics and probability can be conveniently grouped under three headings: (1) his 1844 critique of the justifications of the method of least squares, (2) his short 1842 paper on the foundations of probability (with a brief follow-up in 1854), and (3) his return to least squares in 1850.Footnote 1 The first of these was a brilliant dissection of two major and one minor derivation of least squares, each under different sets of assumptions. It was the most cited and admired of his mathematical works during his lifetime, and indeed in the nineteenth century. The second was not influential in the 19th century, but it may be the most cited of his works in the twentieth century, when, after the contributions of John Venn (1834–1923), it caught the eyes of a number of philosophers as a partial anticipation of Venn. The third, his return to the topic of 1844, after an 1850 review by John F.W. Herschel (1792–1871) had added a new wrinkle to an old topic, may have been less successful, but it still sheds new light on his understanding. I propose to treat these three topics in that order, with the emphasis being on a brief summary of the issues and a critique of how they may be thought of in the light of current historical understanding. Regarding the third, I will also present a previously unpublished letter from Ellis on the topic.

1 Ellis and Least Squares in 1844

To understand the scope of Ellis’s achievement, it may be useful to offer a very short summary of work on least squares before Ellis.

The method of least squares was formally introduced by Adrien-Marie Legendre (1752–1833) in 1805 as a method of reconciling a set of inconsistent linear equations through a form of averaging.Footnote 2 In the simplest case, it involved fitting a line Y = a + bX to points (Xi, Yi), under the assumption that each Yi = a + bXi + ei, ei being an error of observation. Legendre’s elegant but ad hoc solution was to choose a and b to minimize the sum

$$\varSigma {\left({\mathrm{Y}}_{\mathrm{i}}\hbox{--} \left(\mathrm{a}+{\mathrm{bX}}_{\mathrm{i}}\right)\right)}^2$$

(hence the name he gave the method), and the least squares estimates of a and b were relatively easily calculable linear functions of the Yi given the Xi. Legendre gave no argument to justify the choice of this measure of fit other than observing that it gave a balance to the points considered as a mechanical system.

Subsequent authors would build the case upon assumptions about the errors of observation, using probability arguments where Legendre had not even mentioned probability. In 1809 Carl Friedrich Gauss (1777–1855) presented an argument that the errors could be considered to be independently random, each following what we now call the normal or Gaussian distribution. He then showed how this would lead to the method of least squares as the best method in the sense that the a and b thus derived made the values of the data (the Yi’s) that were observed more probable than any other possible values of the Yi’s, given the Xi’s. We now call this the criterion of maximum likelihood. His argument for the particular choice of a normal distribution was a bit circular. Astronomers, he said, put faith in the arithmetic mean in the simplest case where b = 0 and only a is to be estimated, and the arithmetic mean will in that case be the solution found by least squares if and only if the errors have that distribution.

In 1811 Pierre-Simon Laplace (1749–1827), inspired by Gauss’s observation of the special role for that particular distribution, gave a lengthy derivation justifying least squares as approximately optimum when n, the number of points (Xi, Yi), is large, regardless of the distribution of the errors. Gauss in 1809 and Legendre in 1805 found the linear least squares estimates were optimum without restricting the form of the estimates; unlike them, Laplace did restrict to linear estimators.Footnote 3

In two memoirs in 1821–1823 Gauss reconsidered the problem, adopting Laplace’s restriction to linear estimators and adding a “mean square error” criterion as a measure of fit to the data, and from that deduced least squares without recourse to any assumption about the form of error distribution other than symmetry.Footnote 4 This second solution of Gauss’s has come to be known as “The Gauss-Markov Theorem,” through no fault of Gauss or Andrej Markov (1856–1922).

Ellis’s critique of this body of work was a brilliant tour de force. To appreciate what he accomplished, consider the difficulty of the task. Legendre’s introduction of the method may have been limited in scope, but it was expressed in as clear a statement as one can find in the mathematical literature: four pages of text that were direct, unambiguous, complete, and his short description was all that was needed to understand what he had done and why he had done it. Gauss’s first contribution was as part of his important and brilliant 1809 treatise on gravitational attraction in the solar system, where in passing he made major contributions to spherical geometry and to the handling of linear equations, as well as giving a specific analysis of the orbit of the recently discovered asteroid Ceres. His later work in the 1820s, where he greatly advanced numerical analysis with a new approach to linear equations (Gaussian elimination) was not well understood by any statistical commentator before Ellis.

At least Gauss’s work, notwithstanding the difficulty of the material, was rigorously expressed through the exacting mind of one of the greatest mathematicians of any era. Laplace was a different case altogether. His was also a great mathematical mind, but of a markedly different type, where the accurate reach of his conclusions outran the proofs he could provide. He may be the person who introduced the annoying phrase “it is easy to see” to mathematics. In an 1837 review of Laplace’s major book on this topic (Théorie analytique des probabilités), Augustus De Morgan wrote that ‘[o]f all the masterpieces of analysis, this is perhaps the least known; [it] is the Mont Blanc of mathematical analysis,’ he added, ‘but the mountain has this advantage over the book, that there are guides always ready near the former, whereas the student has been left to his own method of encountering the latter’.Footnote 5 Ellis himself stated in his critique that ‘there are few mathematical investigations less inviting than the fourth chapter of the Théorie des Probabilités, which is that in which the method of least squares is proved’.Footnote 6 At least Laplace was basically correct in his analysis. The same cannot be said of the articles by James Ivory (1765–1842) which Ellis also discussed: Is there any work that is harder to read carefully than an extensive confused and mistaken analysis by a second-rate mathematical scientist? Ironically, Ellis’s diary records him reading Ivory on least squares on 12 September 1840; it may have been Ellis’s first encounter with least squares.

Ellis began his critique with the very smallest of points: the initial assumption of Gauss that in simple situations the arithmetic mean deserved a special status. Yet even there Ellis cast new light. He took the crux of the matter to be the belief that in the long run, the errors plus and minus would tend to cancel out. If the observations were x1, x2, etc, and “a” is the true magnitude, so that the errors are x1–a = e1, x2–a = e2, etc, then the tendency would be towards Σei = 0 and so Σ(xi–a) = 0, giving a=Σxi/n, the arithmetic mean. But as Ellis observed, equally well we would have Σf(ei) = 0 for any odd function f (where f(e) = –f(–e), as with odd powers of e), and so just as well we should have Σf(xi–a) = 0, which would lead to a conclusion different from least squares. He pointed out that if the probability density of an error e were any function φ(e) symmetric about zero, then what we now call the maximum likelihood estimate would be found from solving Σf(xi–a) = 0, where f(e) is the derivative of logφ(e), namely φ′(e)/φ(e), and we can no more suppose the mean is privileged than we can suppose one particular error distribution is privileged, namely that with φ′(e)/φ(e) = e, the distribution now called the normal distribution. Ellis would surely have been amused to learn that more than a century later, statisticians seeking more “robust” estimates (estimates less sensitive to larger errors than is the method of least squares) would adopt exactly the approach of solving Σf(xi–a) = 0 for an f selected with that goal, calling these “M-estimates”.Footnote 7

Having disposed of Gauss 1809, Ellis turned to Laplace. The mathematical question there was entirely different. Laplace, seeing the intimate tie between the normal distribution and least squares that Gauss had brought to attention, saw a different way to reach the same conclusion. In earlier work he had generalized Abraham De Moivre’s (1667–1754) demonstration that, for a large number of trials, the binomial distribution could be nicely approximated by the normal distribution. After seeing Gauss’s first effort, Laplace saw a way to use that insight in the broader question. If the estimates were to be linear in the dependent observations, then he could show how the smoothing that linearity produced would lead to the final estimates being approximately equivalent to what one would derive if the error distribution was indeed a normal distribution. To that end, he adopted a method of proof using complex analysis that essentially used what later came to be called Fourier transforms. The method was not rigorously developed – this was well before Augustin-Louis Cauchy (1789–1857) took the first serious steps towards the needed rigor. In the most mathematically impressive part of his article, Ellis found a way to do with rigor in real analysis what Laplace had done with complex numbers.

Laplace’s use of complex numbers was purely formal, exploiting Leonard Euler’s (1707–1783) formula eix = cos(x)+isin(x) as an analytical tool to integrate powers of s = eix easily in terms of trigonometric functions. Ellis painstakingly redid the analysis directly in terms of trigonometric expansions, arriving at the same result without invoking imaginary numbers, with the rigor and clarity necessary to convince mathematicians unwilling to accept Laplace’s formal leap. De Morgan wrote of Laplace: ‘No one was more sure of giving the result of an analytical process correctly, and no one ever took so little care to point out the various small considerations on which correctness depends. His Théorie des Probabilités is by very much the most difficult mathematical work we have met with, and principally from this circumstance’.Footnote 8 Ellis’s recasting of Laplace was a tour-de-force. Whether his claim that ‘the mathematical difficulty is greatly diminished by the change [in approach]’ is accepted would depend on the reader’s unwillingness to accept Laplace’s formalism on faith.Footnote 9 A part of his analysis, on the distribution of sums, was published separately in 1844.Footnote 10

Finally, Ellis moved on to a brief discussion of Gauss’s later approach. Of all the early comments on these works, Ellis’s may have been the only one that had both a just appreciation of what Gauss presented and of its relationship to the approach of Laplace. By adopting Laplace’s restriction to linear estimates and adding the specific measure of mean squared error as a judge of performance, Gauss freed the analysis from any assumptions about errors (other than a balance between positive and negative) or any appeal to asymptotics. It did not answer all questions, of course, and in truth at the practical level few people worried about the foundations of the method. Long after mathematicians had ceased giving attention to Gauss’s 1809 argument, textbooks still gave it alone as an explanation, because it was simple and succinct.

Ellis closed his article with a brief and effective dismissal of three efforts by mathematician James Ivory, efforts that had added confusion, not clarity, mostly through misunderstanding of Laplace.

Ellis’s analysis of least squares may not have changed the textbook treatments, but it was surely his most famous mathematical work over the last half of the 19th century. It was noticed with approval by mathematicians such as J.L. Glaisher (1809–1903) writing on least squares, and its fame was spread internationally by a much-cited bibliography of works on least squares compiled by Mansfield Merriman as part of his PhD thesis at Yale University in 1876. Merriman’s summary of Ellis ends with this endorsement: ‘The paper is one of the most valuable in the theoretical literature of the subject’.Footnote 11

2 Ellis and the Foundations of Probability in 1842

Two years before his least squares critique, Ellis had read a short report to the Cambridge Philosophical Society, on the foundations of probability. The topic would not have been fashionable at the time; the joining of mathematical and philosophical considerations of this topic was not entirely new, but serious attempts to examine the subject were mostly in a distant future. In his speculative study, Ellis took off from Jacob Bernoulli’s (1655–1705) theorem and an apparent paradox: ‘If the probability of a given event be correctly determined, the event will, on a long run of trials, tend to recur with a frequency proportional to this probability. This is generally proved mathematically. It seems to me to be true a priori’.Footnote 12 That is, if we accept the long run frequency definition of probability, then Bernoulli’s limit theorem is automatically fulfilled, and no proof is needed. And to Ellis, the meaning of probability was unalterably linked to the frequency interpretation.

At one level this was an odd assertion, because it misrepresents what Bernoulli proved, and internal evidence suggests Ellis had never read Bernoulli. As superficial evidence there is the matter that he consistently misspelled the name as “Bernouilli”. Of course, he was not the only person of that time to sin in this way. Augustus De Morgan upbraided another sinner of that century with the comment, ‘Oh, you have deeply offended me. Pray always keep in mind the personal interest I take in one-eyed philosophers’.Footnote 13 (De Morgan had lost the sight of his left eye in infancy.) But what Ellis missed in Bernoulli was deeper and of a mathematical sort that would have interested him.

Bernoulli in fact had seen the same paradox that animated Ellis, although he expressed his view differently. He thought the long run tendency was obvious: ‘to judge concerning some future event it would not suffice to take one or another experiment, but a great abundance of experiments would be required, given that even the most foolish person, alone and with no previous instruction (which is truly astonishing), has discovered that the more observations of this sort are made, the less danger there will be of error’.Footnote 14 But if Bernoulli thought the simply-stated limit theorem obvious to even the untutored, why did he concoct a proof and just what did he prove? In fact, Bernoulli derived a difficult and complicated proof, not of a limit theorem, but of an approximation theorem with an exact bound on the degree of error. He could, for a given binomial distribution, and for any degree of approximation desired, give a conservative upper bound on the number of trials needed to achieve that accuracy. The commonplace limit theorem Ellis stated could be derived from Bernoulli’s result, but the actual result was stronger and much more precise. Bernoulli could tell you how tightly the probability was packed about the theoretical value, something very far from a priori evident. Knowing precisely how concentrated probability is in a high dimensional space for any number of trials is very different from knowing only that at some point it will be concentrated to some degree, but we cannot know how much or when. Bernoulli ruled out a wandering concentration, although only in theory. He admitted that in practice, with a coin or die wearing down, the chance could change.

Would Ellis have taken a different view if he had known Bernoulli’s actual result, rather than the popular description given by Laplace in his Essai philosophique sur les probabilités, a book written for a general audience? Perhaps not, but at least he would have had a better sense of the mathematical difficulty and it should at least have given him pause. In any case, what Ellis did say was quite interesting. He noted that a central problem with a frequency theory in practice was that it dealt with types or species of like events, and once you get away from dice and coins, deciding whether some instances are of the same type as others depends crucially on defining what was the defining characteristic of “type” in a world where all individuals were different in some respect, while alike in others. In that view, social applications fail because it is impossible to group events as well-defined types. John Venn would later take this on with more success, but Ellis did raise the issue. Venn only noticed Ellis’s paper some time after his first edition of The Logic of Chance in 1866, and he added a note in the second edition of 1876 stating that their two approaches were ‘substantially similar’, but he took issue with Ellis’s terminology invoking “genus and species”, terms he thought less appropriate than his own use of “series”.Footnote 15

This question was particularly crucial in induction. Ellis faulted Laplace’s treatment of the problem of succession as lacking rigorous foundation. Laplace’s rule of succession held that if you had no knowledge of the chance of, say, heads versus tails in tossing a coin, then take all chances as equally likely, and then if you observed n heads in a row, the probability of a head on the next toss is (n + 1)/(n + 2), a consequence of Bayes’s theorem. Ignorance, Ellis said, was no justification for assuming a specific form for a prior distribution, even one of uniform probability.

Ellis referenced Laplace on induction, but only for the Essai philosophique, and he never cited Bayes. Ellis’s criticism of Laplace had merit, especially if read by a rigorous mathematical mind. But had Ellis read Bayes? If he had, he would at least have seen an argument for the possibility of probabilistic induction. And he would have seen a treatment of succession that could have softened his view.Footnote 16 In 1854 Ellis published an even shorter comment that did not advance the matter, just recasting the question of “type” in metaphysical terms.

Ellis’s notes on the foundations of probability were ignored by his contemporaries De Morgan, Herschel, George Boole, John Stuart Mill, and, aside from the short note in the second edition of his Logic of Chance, Venn. A century later his work was viewed more favourably. A number of recent philosophers have called attention to Ellis’s articles, generally as an early anticipation of some of the questions to be addressed in framing a successful frequency theory of probability, rather than as a successful resolution of those questions, a success that Ellis never claimed.Footnote 17 Salmon summed up his discussion succinctly: ‘Ellis, it seems to me, took us to the very threshold of a frequency theory of probability; Venn opened the door and led us in’.Footnote 18

3 Back to Least Squares in 1850

In July 1850, the Edinburgh Review published a long article on Adolphe Quetelet’s (1796–1874) work on probability and social science. The article was anonymous, but generally known to have been written by John Herschel, and Herschel reprinted it in a collection of essays in 1857.Footnote 19 The article was widely read and so appreciated by Quetelet that he translated it into French and despite its length reprinted it, with permission from Herschel, as a preface to the 1869 edition of his Physique sociale. Some historians of science have argued that James Clerk Maxwell found his inspiration for his theory of gases in reading this review.Footnote 20 Ellis was also an avid reader of the review, and it reawakened his interest in least squares.

The particular passage that caught Ellis’s attention was a digression by Herschel offering what he considered to be a new and simple proof for the normal distribution as the general distribution of errors, and hence a new justification for the method of least squares. From considering target shooting and similar examples, Herschel argued that a general error distribution about the two-dimensional target should be symmetric in both horizontal and vertical directions, and that error in those two directions should be independent. Furthermore, the probability of error should only depend upon the size of the error: the distribution should be such that the probability of any shot (x,y) should only depend upon the distance between the shot and the target (0,0), that is, it should be a function of x2 + y2, where x and y are the horizontal and vertical errors. Together with independence this led to the product f(x2)f(y2) depending only on x2 + y2, and the only solution possible was a normal distribution. (Unknown to Herschel (and to Ellis), Robert Adrain had offered the same argument in 1808 or 1809).Footnote 21

Ellis’s attention must have been drawn to the article soon after publication. By the end of August, he was in correspondence with a friend, Archibald Smith, about the matter. A transcription of a September 3rd letter from Ellis to Smith is reproduced in Part II of the present volume. Smith, a Scottish mathematician, had preceded Ellis as Senior Wrangler at Cambridge and as a winner of, yes, the prestigious Smith Prize (no relation), given since 1769 for performance on examination at Cambridge, awarded to Archibald Smith in 1836 and to Ellis in 1840.

In a short publication in the Philosophical Magazine, Ellis raised three points of criticism of “the reviewer” (the letter to Smith shows Ellis knew Herschel was the reviewer). The first was the assumption that there was such a thing as a general law of error that could hold in all circumstances. The second was the assumption that errors in different directions could be considered independent in the technical meaning of the term that Herschel had indeed used, rather than in the casual meaning as simply separate. In both of these, Ellis’s complaint had justice. Herschel and others (including Adrain) had been, and many people still are, cavalier in their assumption of independence. Herschel did not reply, but I would guess that the criticism would not have greatly troubled him; he could have felt that the assumptions were approximately sufficient with some generality, even if not full generality. Ellis would have none of that. To him both assumptions were based on ignorance about causes, and nothing could flow from ignorance but more ignorance. “Ex nihilo nihil,” was Ellis’s frequent refrain, a Latin quotation from Lucretius, after a Greek argument attributed to Parmenides.

Ellis’s third point was quite different and quite unfortunate. It was simply wrong and a rare example of how this fine mind could err in substantial ways when rushing into print. At the end of the published note he stated correctly that if you transform to polar coordinates and then find the probability distribution of the distance of a shot from the centre of the target, you get a distribution over the values from zero to infinity, which he derived correctly (we would now say he had derived the distribution of the square root of a Chi-square variate with one degree of freedom). His math was fine, but he then made the absurd comment that since the distance from the centre – the absolute value of the distance from the centre – had a positive most probable value (naturally, since it could never be negative), this contradicted Herschel’s claim that the bivariate normal distribution of errors (signed errors) had a centre of gravity at the target’s centre point! He wrote, ‘the centre of gravity of the shot-marks is not the most probable position […] so that [the reviewer’s] hypothesis is self-contradictory’.Footnote 22 Ellis, probably writing quickly and carelessly, had confused error and absolute error.

Ellis must have soon realized he had erred; his article had been sent on September 19 (13 days after his letter to Archibald Smith), and it appeared in the November issue. On 7 November he wrote confessing his error; that when the errors are strictly bivariate normal the centre of gravity was as claimed at the centre of the target and that was the most probable point for a shot-mark. He added a hint without proof that without the normal distribution this would not be true.Footnote 23 That would usually be true, but it was a weak attempt to recover from an egregious error that must have upset him greatly.

4 Ellis and Mathematical Statistics

Ellis was an unusually able mathematician for his time. His was the most insightful critique of arguments for the method of least squares in the 19th century. His understanding of Gauss’s and Laplace’s works on probability and least squares was in that time only exceeded by Gauss and Laplace, and his explication was often clearer than theirs.

His philosophical work on probability was less successful, but passionate and clear. Ellis’s work on the foundations of probability suffered from his apparent unfamiliarity with work by Bernoulli and Bayes. Had he engaged with their ideas instead of with secondary accounts he might well have advanced understanding. Reading Ellis gives the impression of reading an author who, once he is exposed to an idea, pauses to think deeply about how to challenge it, with little curiosity to explore how others may have developed it further and little sympathy for other interpretations.

Ellis’s lack of experience in experimental or observational science limited the impact of his critiques of probabilistic approaches to induction. He was a philosopher without patience for a scientist who, despite limited support for a method, cannot plead ignorance and drop the subject but must proceed in any event. His precarious health must have been a great limit to his energies. One can only wonder what he might have accomplished with good health and a longer life. Would he have then tried to extend mathematical statistics, rather than merely critique it? Might he have attacked the philosophy of probability full bore, rather than casually sow provocative seeds? Certainly, the ability was there that could have made him a major scientist or philosopher, but as many other cases prove, that is a necessary but not sufficient prerequisite. And his breadth of interests was so wide that we cannot even guess over what fields his mind might have wandered to explore. His reading was extraordinarily wide; he could come up with such surprises as a maxim of metrology, mentioning: ‘the Arab’s saying that a mile is as far as one can tell a man from a woman’.Footnote 24 We are now far from Ellis but close enough to recognize a cast of mind that glows brighter at a great distance than many more celebrated figures in the history of science. We are left to marvel at what he did accomplish in a very short working life.