Output as Prophecy

Holmes’s famous epigram, that “prophecies of what the courts will do” is what constitutes the law, applies equally well to machine learning. In machine learning, prediction is the be-all and end-all. Machine learning requires that its tasks be formulated as prediction problems—not just forecasting, but prediction in the broader sense of “give an answer for cases where the true answer is not known to the machine”, for example naming the animal shown in a photo. A machine learning system is trained so as to maximize the accuracy of its predictions, and prediction accuracy is the benchmark by which machine learning systems are compared and evaluated. In the law, according to Holmes, prediction is everything, from the humble plaintiff who simply wants to guess how a case might turn out, to the appeals court judge whose decisions are written in anticipation of how they will be received by other judges and by the rest of society.

Holmes's interest in the logic and philosophy of probability and statistics has come more to light thanks to recent scholarship 2 ; he immersed himself early in his career in those subjects. Holmes's use of the word "prophecy" was deliberate. It accorded with his overall view of law by getting away from the scientific and rational overtones of "prediction," even as he used that word too. Arguably, given how elusive explanations have been of how machine learning systems arrive at the predictions they make, "prophecy" is a good term in that context too.
We expand in this chapter on Holmes's idea that prophecies constitute the law, and then we return to prediction in machine learning.

5.1
Prophecies Are What Law Is Holmes's famous epigram has been widely repeated, but it is not widely understood. Taken in isolation from The Path of the Law, where Holmes set it down, and in isolation from Holmes's development as a thinker, it might sound like no more than a piece of pragmatic advice to a practicing lawyer: don't get carried away by the cleverness of your syllogisms; ask yourself, instead, what the judge is going to do in your client's case. If that is all it meant, then it would be good advice, but it would not be a concept of law. Holmes had in mind a concept of law. The epigram needs to be read in context: Holmes was contrasting "law as prophecy" to "law as axioms and deductions." He saw an inductive approach to law-the pattern finding approach that starts with data or experience-not just to improve upon or augment legal formalism. He saw it as a corrective. The declaration in his dissent in the Lochner case a few years after The Path of the Law that "[g]eneral propositions do not decide concrete cases" 4 was not just to say that the formal, deductive approach is insufficient; it was to say that formalism gets in the way. The centrality of concern for Holmes was the reality of decision, the output that a court might produce. The realism or positivism in this understanding of law contrasted with the formalist school that had long prevailed. To shift the concern of lawyers in this way was to lessen the role of doctrine, of formal rules, and to open a vista of social and historical considerations heretofore not part of the law school curriculum and ignored, or at any rate not publicly acknowledged, by lawyers or judges. Jurists have been divided ever since as to whether the shift of conception was for better or worse. Whatever one's assessment of it, the concept of law as Holmes expressed it continues to influence the law.
There is more still to Holmes's epigram about prophecies. True, the contrast it entails between the inductive method and the deductive method alone has revolutionary implications. But Holmes was not merely concerned with what method "our friend the bad man" (or indeed the bad man's lawyer) should employ to predict the outcome of a case. He wasn't writing a law practice handbook. He was interested in individual encounters with the law to be sure, 5 but this was because he sought to reach a general understanding of law as a system. Holmes's invocation of prophecies, like his use of terms from logic and mathematics, was memorable use of language, but it was more than rhetoric: it was at the core of Holmes's definition of law. He referred to the law as "systematized prediction." 6 This was to apply the term "prediction" broadly-indeed across the legal system as a whole. Holmes was not sparing in his use of the word "prophecy" when defining the law. The word "prophesy" or its derivates appear nine times in The Path of the Law. 7 He used it in the same sense when writing for the Supreme Court. 8 Holmes's concern with prediction is traceable in his other writings too. 9 The heart of Holmes's insight, and what has so affected jurisprudence since, is that the law is prediction. 10 Prophecy does not refer solely to the method for predicting what courts will do. Prophecy is what constitutes the law.
Prophecy of what, by whom, and on the basis of which input data? Holmes gave several illustrations. For example, he famously described the law of contract as revolving around prediction: "The duty to keep a contract at common law means a prediction that you must pay damages if you do not keep it, and nothing else." 11 He stated his main thesis in similar terms: "a legal duty so called is nothing but a prediction that if a man does or omits certain things he will be made to suffer in this or that way by judgment of the court." 12 This is a statement about "legal duty" irrespective of the content of the duty. It thus just as well describes any duty that exists in the legal system.
We think that Holmes's concept of law as prediction indeed is comprehensive. Many jurists don't see it that way. Considering how Holmes understood law to relate to decisions taken by courts, one sees why his concept of law-as-prophecy often has received a more limited interpretation.
Holmes wrote that "the object of [studying law], then, is prediction, the prediction of the incidence of the public force through the instrumentality of the courts." 13 Making the equation directly, he wrote, "Law is a statement of the circumstances, in which the public force will be brought to bear upon men through the courts…."; a "word commonly confined to such prophecies… addressed to persons living within the power of the courts." 14 It is often assumed that Holmes's description here does not account for the decisions of the highest courts in a jurisdiction, courts whose decisions are final. After all, in a system of hierarchy, the organ at the apex expects its commands to be obeyed. To call decisions that emanate from such quarters "predictions" seems to ignore the reality of how a court system works. In a well-functioning legal system, a judgment by a final court of appeal saying, for example, that the police are to release such and such a prisoner, should lead almost certainly to that outcome. The court commands it; the prisoner is released.
In two respects, however, one perhaps trivial but the other assuredly significant, the highest court's statements, too, belong to the concept of law as prophecy.
First, even in a well-functioning legal system, the court's decision is still only a prediction. As outlandish as the situation would be in which the police ignored the highest court, it is a physical possibility. A statistician might say that the probability is very high (say, 99.9999%) that the highest court's judgment that the law requires the prisoner to be released will in fact result in an exercise of public power in accordance with that judgment. We will say more below about the relation between probability and prediction. 15 Leaving that relation aside for the moment, a judgment even of the highest court is a prediction in Holmes's sense. It is a prediction in this way: the implementation of a judgment by agents of public power is an act of translation, and in that act the possibility exists for greater or lesser divergence from the best understanding of what the judge commanded. So the definition of law as prophecy is instanced in the chance that the "public force" will not properly implement the judicial decision. In a well-functioning legal system, the chance is remote. In legal systems that don't function well, the predictive character of final judgments is more immediate, because the risk in those systems is greater that the public force will not properly implement the courts' commands. "Finality" in some judicial systems is more formal than real. 16 The further respect in which the concept of law as prophecy is comprehensive comes to view when we consider how judges decide cases and how advocates argue them. In deciding a case, a judge will have in mind how that decision is likely to be interpreted, relied upon, or rejected, by future courts and academics and public opinion, as well as by the instruments of public force. The barrister, for her part, in deciding what line of argument to pursue, will have in mind how the judge might be influenced, and this in turn requires consideration of the judge's predictions about posterity. Holmes thus described a case after the highest court had decided it as still in the "early stages of law." 17 As Kellogg puts it, Holmes situated a case "not according to its place on the docket but rather in the continuum of inquiry into a broader problem." 18 The law is a self-referential system, whose rules and norms are consequences of predictions of what those rules and norms might be. 19 Some people participate in the system in a basic and episodic way, for example the "bad man" who simply wants advice from a solicitor about likely outcomes in respect of his situation. Some people participate in a formative way. The apex example is the judge of the final court of appeal whose predictions about the future of the legal system are embodied in a judgment which she expects as a consequence of her authority in the legal system to be a perfect prediction of the exercise of public power in respect of the case. But her outlook, indeed her self-regard as a judge, entails more than that; a judge does more than participate in disconnected episodes of judging: she hopes that any judgment she gives in a case, because she strives for judgments that withstand the test of time, will be a more or less accurate prediction of how a future case bearing more or less likeness to the case will be decided. The judge describes her judgment as command, not as prophecy; but the process leading to it, and the process as the judge hopes it will unfold in the future, is predictive. Law-as-prophecy, understood this way, has no gap.
Holmes's claim, as we understand it, holds law to be predictive through and through. Prophecy is what law is made of. The predictive character of law, in this constitutive sense, is visible in the process of judicial decision, regardless what level of the judiciary is deciding; and it is visible in all other forms of legal assertion as well. Prophecy embraces all parts of legal process.
So everyone who touches the law is making predictions, from the selfinterested "bad man" to the judge in the highest court of the land, and they make predictions about the full range of possible outcomes. The experience that influences their predictions, as we saw in Chapter 3, 20 Holmes understood to be wide, and the new situations that they make predictions about are unlimited. People on Holmes's path of the law thus engage in tasks much broader than standard machine learning tasks. As we also discussed in Chapter 3, 21 machine inputs, while they consist in very large datasets ("Big Data"), are limited to inputs that have been imparted a considerable degree of structure-a degree of structure almost certainly lacking in the wider (and wilder) environment from which experience might be drawn. Machine outputs are correspondingly limited as well. Let us now further explore the machine learning side of the analogy-and its limits.

Prediction Is What Machine Learning Output Is
Holmes, writing in 1897, obviously did not have machine learning in mind. Nevertheless, his idea that prophecy constitutes the law has remarkable resonance with machine learning, a mechanism of computing that, like law as Holmes understood it, is best understood as constituted by prediction.
The word prediction is a term of art in machine learning. It is used like this: In a typical scenario, we have an outcome measurement, usually quantitative (such as a stock price) or categorical (such as heart attack/no heart attack), that we wish to predict based on a set of features (such as diet and clinical measurements). We have a training set of data, in which we observe the outcome and feature measurements for a set of objects (such as people). Using this data we build a prediction model, or learner, which will enable us to predict the outcome for new unseen objects. A good learner is one that accurately predicts such an outcome. 22 Though in common parlance the term "prediction" means forecaststhat is to say, statements about future events-in machine learning the term has a wider meaning. We have touched on the wider meaning in Chapter 4 and at the opening of the present chapter. Let us delve a little more into that wider meaning now.
It is true that some machine learning outputs are "prediction" in the sense in which laypersons typically speak: "Storm Oliver will make landfall in North Carolina" 23 or "the stock price will rise 10% within six months." Other outputs are not predictions in the layperson's sense. Indeed, the main purposes for which machine learning is used do not involve predictions of that kind-purposes like classifying court cases by topic or controlling an autonomous vehicle. Whatever the purposes for which it is used, machine learning involves "prediction" of the more general kind computer scientists denote with that term.
The essential feature of prediction in machine learning is that it should concern "the outcome for new unseen objects," i.e. for objects not in the training set. Thus, for example, if the training set consists of labelled photographs, and if we treat the pixels of the photograph as features and the label as the outcome, then it is prediction when the machine learning system is given a new photograph as input data and it outputs the label "kitten." In machine learning prediction, "pre-" simply refers to before the true outcome measurement has been revealed to the machine learning system. The sense of "pre-" in "prediction" holds even though other parties might well already know the outcome. For example, the computer scientist might well already know that the new photograph is of a tiger, not a kitten. That assignment of label-to-picture has already happened, but the machine learning system has not been told about it at the point in time when the system is asked to predict. Philosophers of science use the terms "postdiction" or "retrodiction" to refer to predicting things that have already happened. 24 These words are not used in the machine learning community, but the concept behind them is much what that community has in mind when it talks about prediction.
A significant part of the craft of machine learning is to formulate a task as a prediction problem. We have already described how labelling a photograph can be described as prediction. A great many other examples may be given. Translation can be cast as prediction: "predict the French version of a sentence, given the English text," where the training set is a human-translated corpus of sentences. Handwriting synthesis can as well. Given a dataset of handwritten text, recorded as the movements of a pen nib, and given the same text transcribed into text in a word processor, the task of handwriting synthesis can be cast as prediction: "predict the movements of a pen nib, given text from a word processor." As Judea Pearl observed in the interview with which we opened Chapter 4, 25 it is truly remarkable how many tasks can be formulated this way. In the social sciences, it is "a rather new epistemological approach […] and research agendas based on predictive inference are just starting to emerge." 26 A theory of law based on predictive inference, however, emerged over a century ago: Holmes theorized law to be constituted by prophecy. So too might we say that machine learning is constituted by prediction.
Moreover, prediction is not just the way that machine learning tasks are formulated. It is also the benchmark by which we train and evaluate machine learning systems in the performance of their tasks. The goal of training is to produce a "good learner," i.e. a system that makes accurate predictions. Training is achieved by measuring the difference between the machine's predictions (or postdictions, as the philosophers would say) and the actual outcomes in the training dataset; and iteratively tweaking the machine's parameter values so as to minimize the difference. The machine that reliably labels tigers as "tigers" has learned well and, at least for that modest task, needs no more tweaking. The machine that labels a tiger as a "kitten" needs tweaking. The one that labels a tiger as "the forests of the night," though laudable if its task had been to predict settings in which tigers are found in the poetry of William Blake, needs some further tweaking still to perform the task of labeling animals. This process of iterative tweaking, as we noted in Chapter 2, is what is known as gradient descent, 27 the backbone of modern machine learning. Thus, a mechanism of induction, not algorithmic logic, is at the heart of machine learning, much as Holmes's "inductive turn" is at the heart of his revolutionary idea of law.
It is not machine learning's fundamental characteristic that it can be used to forecast future events-when will the next hurricane occur, where will it make landfall? One doesn't need machine learning to make forecasts. One can make forecasts about hurricanes and the like with dice or by sacrificing a sheep (or by consulting a flock of dyspeptic parrots). One can also make such forecasts with classic algorithms, by simulating dynamical systems derived from atmospheric science. This sort of prediction is not the fundamental characteristic of machine learning.
The fundamental characteristic of machine learning is that the system is trained using a dataset consisting of examples of input features and outcome measurements; until, through the process of gradient descent, the machine's parameter values are so refined that the machine's predictions, when we give it further inputs, differ only minimally from the actual outcomes in the training dataset. Judges, litigants, and their lawyers certainly try to align their predictive statements of law with what they discern to be the relevant pattern in law's input data, that is to say in the collective experience that shapes the law. It is equally the case, in Holmes's understanding of the law, that we do not test court judgments by comparing against stipulated "correct" labels the way our spam email or tiger detector was tested. Judgments are, however, tested against future judgments. This is to the point we made earlier about the judge's aim that her judgments withstand the test of time. The test is whether future judgments show her judgment to have been an accurate prediction, or at least not so far off as to be set aside and forgotten. A machine learning system must be trained on a dataset of input features and outcome measurements. This is in contrast to the classic algorithmic approach, which starts instead from rules. For example the classic approach to forecasting the weather works by solving equations that describe how the atmosphere and oceans behave; it is based on scientific laws (which are presumably the result of codifying data from earlier experiments and observation). Just as machine learning rejects rules and starts instead with training data, Holmes rejected the idea that law is deriving outcomes based on general principles, and he cast it instead as a prediction problem-prophesying what a court will do-to be performed on the basis of experience.

5.3
Limits of the Analogy As we noted in Chapter 3, 28 the predictions made by a machine learning system must have the same form as the outcomes in the training dataset, and the input data for the object to be predicted must have the same form as objects already seen. In earlier applications of machine learning, "same form" was very narrowly construed: for example, the training set for the ImageNet challenge 29 consists of images paired with labels; the machine learning task is to predict which one of these previously seen labels is the best fit for a new image, and the new image is required to be the same dimensions as all the examples in the training set. Human ability to make predictions about novel situations is far ahead of that of machines. A human lawyer can extrapolate from experience and make predictions about new cases that don't conform to a narrow definition of "cases similar to those already seen." The distance is closing, however, as researchers develop techniques to broaden the meaning of "same form." For example, an image captioning system 30 is now able to generate descriptions of images, rather than just repeat labels it has already seen. Thus, it is well within their grasp for machines to label an image as "tiger on fire in a forest," but they are still a long way, probably, from describing, as the poet did, the tiger's "fearful symmetry." There is a more significant difference between predictions in machine learning and in law. In machine learning, the paradigm is that there is something for the learning agent-i.e., the machine-to learn. The thing to learn is data, something that is given, not a changing environment affected by numerous factors-including by the learning agent. A machine for translating English to French can be trained using a human-translated corpus of texts, and its translations can be evaluated by how well they match the human translation. Whatever translations the machine comes up with they do not alter the English nor French languages. In law, by contrast, the judgment in a case becomes part of the body of experience to be used in subsequent cases. Herein, we think, Holmes's concept of law as a system constituted from prediction may hold lessons for machine learning. In Chapters 6-8, we will consider some challenges that machine learning faces, and possible lessons from Holmes, as we discuss "explainability" of machine learning outputs 31 and outputs that may have invidious effects because they reflect patterns that emerge from the data (such as patterns of racial or gender discrimination). 32 In Chapter 9, 33 we will suggest that Holmes, because he understood law to be a self-referential process in which each new prediction shapes future predictions, might point the way for future advances in machine learning.
Before we get to the challenges of machine learning and possible lessons for the future from Holmes, we will briefly consider a question that prediction raises: does prediction, whether as the constitutive element of law or as the output of machine learning, necessarily involve the assessment of probabilities?

Probabilistic Reasoning and Prediction
"For the rational study of the law the blackletter man may be the man of the present, but the man of the future is the man of statistics," said Holmes. 34 It is not certain that Holmes thought that the predictive character of law necessarily entails a probabilistic character for law. He was certainly interested in probability. In the time after his Civil War service, a period that Frederic Kellogg closely examined in Oliver Wendell Holmes, Jr. and Legal Logic, Holmes studied theories of probability and was much engaged in discussions about the phenomenon, including how it relates to logic and syllogism. 35 Later, as a judge, he recognized the part played by probability in commercial life, for example in the functioning of the futures markets. 36 In personal correspondence, Holmes said that early in his life he had learned "that I must not say necessary about the universe, that we don't know whether anything is necessary or not. So that I describe myself as a bet tabilitarian. I believe that we can bet on the behavior of the universe…" 37 Holmes would have been comfortable with the idea that law, in its character as prediction, concerned probability as well. Some jurists indeed have discerned in Holmes's idea of law-as-prophecy just such a link. 38 Predictions made by machine learning are not inherently probabilistic. For example, the "k nearest neighbors" 39 machine learning algorithm is simply "To predict the outcome for a new case, find the k most similar cases in the dataset, find their average outcome, and report this as the prediction." The system predicts a value, which may or may not turn out to be correct. Modern machine learning systems such as neural networks, however, are typically designed to generate predictions using the language of probability, for example "the probability that this given input image depicts a kitten is 93%." 40 Separately, we can classify machine learning systems by whether or not they employ probabilistic reasoning to generate their predictions: [One type of] Machine Learning seeks to learn [probabilistic] models of data: define a space of possible models, learn the parameters and structure of the models from data; make predictions and decisions. [The other type of] Machine Learning is a toolbox of methods for processing data: feed the data into one of many possible methods; choose methods that have good theoretical or empirical performance; make predictions and decisions. 41 Are legal predictions expressed in the language of probability? Lawyers serving clients do not always give probability assessments when they give predictions, but sometimes they do. 42 Some clients need such an assessment for purposes of internal controls, financial reporting, and the like. Others ask for it for help in strategizing around legal risk. Modern empirical turns in law scholarship, it may be added, are much concerned with statistics. 43 Attaching a probability to a prediction of a legal outcome is an inexact exercise, but it is not unfamiliar to lawyers.
Holmes, when he referred to the prophecies of what courts will do, is often read to mean that the law should be made readily predictable. 44 Though we don't doubt he preferred stable judges to erratic ones, we don't see that that was Holmes's point. Courts whose decisions are hard to predict are no less sources of legal decision. Even when the lawyer has the privilege to argue in front of a "good" judge, whom for present purposes we define as a judge whose decisions are easy to predict, the closer the legal question, the harder it is to predict the answer. It is inherent that lawyers will be more confident in some of their predictions than in others.
Judges, practically by definition of their role as legal authorities, do not proffer a view as to the chances that their judgments are correct. It is hard to see how the process of judgment would keep the confidence of society, if every judgment were issued with a p-value! 45 Yet reading judgments through a realist's glasses, one may discern indicia of how likely it is that the judgment will be understood in the future to have stated the law. Judges do not shy from describing some cases as clear ones; others as close ones. They don't call it hedging, but that's very much what it's like. When a judge refers to how finely balanced such and such a question was, it has the effect of qualifying the judgment. It thus may be that one can infer from a judgment's text how much confidence one should have in the judgment as a prediction of future results. The text, even where it does not express anything in terms about the closeness of a case, still may give clues. The structure of the reasoning may be a clue: the more complex and particularistic a judge's reasoning, the more the judgment might be questioned, or at least limited in its future application. Textual clues permit an inference as to how confident one should be that the judgment accurately reflects a pattern in the experience that was the input behind it. 46 Does the law use probabilistic reasoning to arrive at a prediction? In other words, once a judgment has been made and it becomes part of the body of legal experience, do lawyers and judges reason about their level of confidence that an earlier judgment is relevant for their predictions about a current case? Ex post, every judgment is in fact, to a greater or lesser extent, questioned or rejected or ignored-or affirmed or relied upon. Nullification, reversal, striking down-by whatever term the legal system refers to the process, a rejection of a judgment by a controlling authority is a formal expression that the judge got it wrong. 47 Endorsement, too, is sometimes formal and explicit, the archetype being a decision on appeal that affirms the judgment. Formal and explicit signals, whether of rejection or of reliance, entail a significant adjustment in how much confidence we should have in a judgment as a prediction of a future case.
It is not just in appeals that we look for signals as to how confident we should be in a given judgment as a prediction of future cases. Rejection or endorsement might occur in a different case on different facts (i.e., not on appeal in the same case) and in that situation is therefore only an approximation: ignoring, rejecting, or "distinguishing" a past judgment; or invoking it with approval, a judge in a different case says or implies that the judge in the past judgment had the law wrong or he had it right, but indirect treatment in the new judgment, whether expressed or implied, says only so much about the past one. A jurist, considering such indirect treatment, would struggle to arrive at a numerical value to adjust how much confidence to place in the past judgment. 48 In evidence about judgments-evidence inferable from the words of the judgments themselves and evidence contained in their reception-one nevertheless discerns at least rough markers of the probability that they will be followed in the future.
There is no received view as to what Holmes thought the function of probability is in prediction. As is the case with machine learning, jurists make probabilistic as well as non-probabilistic predictions. You can state the law-i.e., give a prediction about the future exercise of public power-without giving an assessment of your confidence that your prediction is right. Jurists also use both probabilistic and non-probabilistic reasoning. Holmes, when referring to prophecies, was not however telling courts how to reason (or for that matter, legislatures or juries; we will return to juries in Chapter 7). His concern was to state what it is that constitutes the law. True, we don't call wobbly or inarticulate judges good judges. But Holmes was explicitly not concerned with the behavior of the "good" litigant; and, in his thinking about the legal system as a whole, his concern was not limited to the behavior of the "good" judge. Gradient descent is often coupled with another technique called cross validation, also based on prediction. The term derives from the idea of a "validation dataset." When training a machine learning system, it is not possible to measure prediction accuracy by testing predictions on the same dataset as was used to train the machine. (This can be shown mathematically.) Therefore, the training dataset is split into two: one part for training parameter values, the other part for measuring prediction accuracy. This latter part is called the "validation dataset." Cross validation is totemic in machine learning: stats.stackexchange.com, a popular Internet Q&A site for machine learning, calls itself CrossValidated. It is also technically subtle. See Hastie, Tibshirani & Friedman (2009)  Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/ by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.