1 Introduction

The no-free-lunch (NFL) theorems of supervised learning (Wolpert 1992a, 1996b; Schaffer 1994) are an influential collection of impossibility results in machine learning. Computer scientists have ranked these results “among the most important theorems in statistical learning” (von Luxburg and Schölkopf 2011, p. 695), while some philosophers have read them as “a radicalized version of Hume’s induction skepticism” (Schurz 2017, p. 825, p. 830).

In a nutshell, the results say—or rather, are usually interpreted as saying—that we cannot formally justify our machine learning algorithms. That is, we cannot formally ground our conviction that some learning algorithms are more sensible than others: that we have reason to think some algorithms perform better in attaining the epistemic goals that we designed them to attain. In Wolperts original interpretation, “all learning algorithms are equivalent” (Wolpert 1995a, p. 129; 2002, p. 35), so that, for instance, a standard learning method like cross-validation has as much justification as anti-cross-validation (Zhu and Rohwer 1996; Wolpert 1996b, p. 1359f; 2021, p. 6f).

Yet for many such standard learning algorithms we do seem to have a justification. The field of machine learning theory is concerned with deriving mathematical learning guarantees, that purport to show that standard procedures, like minimizing empirical error on the training set, are better than other possible procedures, like maximizing empirical error (Shalev-Shwartz and Ben-David 2014). This raises a puzzle. How can there exist a learning theory at all, if the lesson of the NFL theorems is that learning algorithms can have no formal justification?

While this tension has been noted from the start (Wolpert 1996b, p. 1347), existing explanations of the consistency of the NFL theorems with learning theory (e.g., Wolpert 1996b, p. 1368ff, Bousquet et al. 2004, p. 202ff, von Luxburg and Scholköpf 2011, p. 692ff) are partial at best. In this paper, we investigate in detail the implications of the NFL results for the justification of machine learning algorithms. The main tool in our analysis is a distinction between two conceptions of learning algorithms, a distinction that has a parallel in the philosophical literature promoting a local view of inductive inference. This is the distinction between a conception of learning algorithms as purely data-driven or data-only, as instantiating functions that only take data, and a conception of learning algorithms as model-dependent, as instantiating functions that, aside from input data, also ask for an input model.

We argue that the NFL theorems rely on the former, data-driven conception of learning algorithms; but that many standard learning methods, including empirical risk minimization and cross-validation, should not be viewed as such. By their specification, such algorithms take two inputs: data, and an explicitly formulated model or hypothesis class, which constitutes a choice of bias. What we can reasonably demand from such model-dependent algorithms is that they perform as well as possible relative to any chosen model. Consequently, learning-theoretic guarantees are relative to the instantiated models the algorithm can take, and it is in this form that there is justification for standard learning algorithms. It is in this sense that learning theory allows one to say that empirical risk minimization is preferable to risk maximization, and that cross-validation is preferable to anti-cross-validation.

This is all consistent with the valid lesson of the NFL results, namely that every data-only learning procedure must possess some inductive bias. Our point is that this lesson should not be taken as a stick to wield against any possible learning algorithm. On the contrary: in model-dependent learning algorithms, this lesson is accounted for from the start.

The plan of the paper is as follows. First, in Sect. 2, we introduce the original Wolpert-Schaffer results. Still granting here the data-only conception of learning algorithms, we dispute the results’ interpretation that all algorithms are equivalent. We discuss how this interpretation relies on an unmotivated assumption of a uniform distribution over possible learning situations, that can in fact be seen as an explicit assumption that learning is impossible. We advance the alternative statement that there is no universal data-only learning algorithm. As instantiations of this statement, the NFL results illustrate and support the central insight in machine learning that every mechanical learning procedure, understood as a mapping from possible data to conclusions, must possess an inductive bias.

Next, in Sect. 3, we develop the model-dependent conception of learning methods, and show how this conception makes room for a justification for standard learning methods that is consistent with the NFL results. We start by pointing out that discussions surrounding the NFL theorems share a questionable presupposition with Hume’s original argument for inductive skepticism: the idea that the performance of our inductive methods must be grounded in a general postulate of the induction-friendliness of the world. We discuss philosophical work that denies the cogency of such a principle, and that advances a local view of induction. This leads us to a local view of learning algorithms: the model-dependent perspective, and the accompanying possibility of a model-relative learning-theoretic justification. We discuss this in more detail for Bayesian machine learning, empirical risk minimization, and cross-validation, making explicit why learning theory allows us to say, for instance, that cross-validation is more sensible than anti-cross-validation. We conclude in Sect. 4.

Finally, we provide two appendices that complement the main argument. In “Appendix A” we investigate the formal consistency of the original NFL results with learning theory, and in ”Appendix B” we list some important nuances to our discussion about model-dependent learning algorithms.

2 All learning algorithms are equivalent?

The first mentions in print of the “no-free-lunch theorems” of supervised learning are in Wolpert (1995a; 1996b, also see 1995b),Footnote 1 although an earlier version of the results already appeared in Wolpert (1992a, 1992b). Around the same time, Schaffer (1994) presented a version of these results, with reference to Wolpert, as a “conservation law for generalization performance.”

We start this section with presenting some basic versions of the Wolpert-Schaffer results, within a problem setting of prediction (Sect. 2.1), and within the original setting of classification (Sect. 2.2). Next, we discuss Wolpert’s interpretation of his results that “all learning algorithms are equivalent, on average.” We discuss the results’ concern with all possible learning algorithms vis-à-vis the traditional philosophical concern with “inductive method,” and note its restriction to data-only algorithms (Sect. 2.3). We then critically analyse Wolpert’s equivalence claim and the underlying assumption of a uniform distribution over possible learning situations (Sect. 2.4). Finally, we advance the alternative NFL statement that there is no universal data-only learning algorithm (Sect. 2.5).

2.1 Prediction

Imagine that every day we are given a bowl of oatmeal for breakfast. Every morning on waking up, before we have our breakfast, we seek to predict whether it will be tasty (\(\mathsf {T}\)) or not (\(\mathsf {N}\)), based only on when it was the days before. A learning algorithm in this simple learning framework makes a guess whether the oats we are served today will be tasty, based on the data of the previous days. For a sequence of three days (see Fig. 1), there are in this scenario \(2^3\) logically possible histories or learning situations (of the form \(\mathsf {TTT, TNT, NTT}\), ...), and already \(2^7\) possible learning algorithms (functions from \(\{\emptyset , \mathsf {T,N, TT, NT, TN, NN}\}\) to \(\{\mathsf {T,N}\}\)). Let an algorithm’s error be the ratio, among all predictions, of those predictions that are incorrect (e.g., a prediction of \(\mathsf {T}\) and then obtaining \(\mathsf {N}\)). Then a no-free-lunch statement in this scenario is that for each possible level of error, every learning algorithm suffers this error in equally many possible learning situations. Namely, one can verify that every single algorithm predicts perfectly (has error 0) in exactly one possible learning situation, predicts maximally badly (error 1) in exactly one other possible situation, suffers error 1/3 in three possible learning situations, and error 2/3 in the remaining three.Footnote 2

Fig. 1
figure 1

NFL for prediction. For any possible learning method (say, the method that always chooses \(\mathsf {T}\), here represented by the arrows), there is one learning situation (path through the tree) with error 0 (follow the arrows), one with error 1 (never follow the arrow), and three situations each with error 1/3 and 2/3. Assigning each learning situation the same probability 1/8, the algorithm’s expected error is 1/2

Note that in thus counting learning situations and comparing these counts, we treat all possible learning situations on a par. Another way of doing this is to assume a uniform probability distribution on all possible learning situations, that is, a distribution that assigns the same probability to each of the finitely many possible learning situations. Then the above NFL result can be restated as the observation that, under the uniform distribution on learning situations, every learning algorithm has the same expected error of exactly 1/2. That is, every learning algorithm can be expected to do no better (or worse) than random guessing.Footnote 3

2.2 Classification

The original Wolpert-Schaffer results were derived in a problem setting more standard in machine learning theory, the setting of classification. We first discuss the simplified setting of non-stochastic classification (Sect. 2.2.1), before we turn to the more general setting of stochastic classification (Sect. 2.2.2).

2.2.1 Non-stochastic classification

Imagine we want to learn to successfully classify whether a bowl of oats will be tasty or not, based on three different features we can determine before trying it: its temperature, its color, and its smell. Formally, supposing that these attributes are binary (either hot or cold, either bright or dull, either reeking or not), every instance of a bowl of oats can be represented by a length-three attribute vector of binary (write 0 or 1) components. This gives a total of eight (\(2^3\)) different possible instances, collected in the domain set \(\mathcal {X}=\{0,1\}^3\). A classifier is a function \(f: \mathcal {X} \rightarrow \mathcal {Y} \) from the possible instances to their labels (tasty or not), collected in the label set \(\mathcal {Y} = \{\mathsf {T,N}\}\). Supposing that the true labels are indeed fully determined by the attributes, the possible learning situations—possible true labelings of all instances of oats—are given by the possible classifiers. A learning algorithm A maps a sample \(S=(x_1,y_1), \ldots , (x_n, y_n)\) of training data, pairs of instances and true labels, to a particular classifier f.

We are now interested in a learning algorithm’s generalization error \(L_{\overline{S}}(A(S))\): given some training sample S, how accurate is the classifier \(f=A(S)\) selected by A on the instances that lie outside of S? Suppose the training data includes six of the total number of eight different possible instances of oats, determining the true tastiness labels for these six instances (see Table 2). There are four possible ways of classifying the two unseen instances, or four remaining possible learning situations \(f^*\). Each possible learning algorithm selects a particular classifier in response to the training data, which classifies the two unseen instances in one of the four possible ways. That means that each possible learning algorithm (selected classifier f) has the same generalization error (ratio of incorrectly classified unseen instances over all unseen instances: either 0, 0.5, or 1) in the same number (one, two, one) of still possible learning situations \(f^*\).Footnote 4

Fig. 2
figure 2

NFL for nonstochastic classification. For any learning algorithm A, any non-exhaustive training sample S (here of size six) and any possible labeling of S (say, all \(\mathsf {N}\), leading A to output classifier \(\hat{f}\)), there is the same number (here, four) of remaining possible learning situations (here, the classifiers \(f^*_1\) to \(f^*_4\)) that each label the (here, two) remaining instances differently. (Table adapted from Giraud-Carrier and Provost 2005.)

Alternatively, we can put things again in terms of a uniform distribution \(\mathcal {U}\) over all possible learning situations. So for this specific sample S of instances and labels, we have that uniformly averaged over the four remaining possible learning situations, the error of each learning algorithm is equal to 1/2. More generally, we can consider the same sample \(S_X\) stripped of its labels, and move the averaging to the front, so to speak, to cover how the possible \(f^*\) (now all possible \(f^*\)) assign labels to \(S_X\), and how the algorithm fares for the resulting \(S=S_X \times f^*(S_X)\) of instances and labels. But since for any four learning situations that label \(S_X\) in an identical way, an algorithm’s average generalization error is 1/2, it remains 1/2 when averaged this way over all learning situations; and this reasoning goes through for any non-exhaustive \(S_X \subsetneq \mathcal {X}\). Thus we arrive at the statement that for any non-exhaustive training sample \(S_X\) of instances every learning algorithm A has expected generalization error \({{\,\mathrm{\mathbf {E}}\,}}_{f^* \sim \mathcal {U}}\left[ L_{\overline{S}}(A(S)) \right] = 1/2 \).Footnote 5

2.2.2 Stochastic classification

An additional refinement in the standard framework for classification (see Shalev-Shwartz and Ben-David 2014) is that the true connection between instances and labels can itself be stochastic. Moreover, we assume some unknown probability distribution for the drawing of instances. Thus a learning situation is given by a distribution \(\mathcal {D}\) over pairs of instances and labels.Footnote 6

We now also measure generalization error in expectation over drawing an instance from \(\mathcal {D}\): we shall call this the risk. But we have a choice here: do we take the expectation over all over \(\mathcal {X}\), so including instances that were already in the training set, or do we discard the latter? Wolpert’s “off-training-set” (ots) risk, write \(L_{\mathcal {D} {{\setminus }} S}(A(S))\), explicitly discounts already seen instances. He actually departs here from most of learning theory, where the error is standardly evaluated over all instances. We shall follow Wolpert in calling the latter quantity “i.i.d.” (iid) risk, write \(L_\mathcal {D}(A(S))\). Formally, for given sample \(S= (x_1,y_1), \ldots , (x_n,y_n)\), \(L_\mathcal {D}(A(S))\) is the probability, under \(\mathcal {D}\), that an independently sampled example (XY) has \(f(X) \ne Y\), where \(f= A(S)\) is the classifier output by algorithm A on input S. This can also be written as

$$\begin{aligned} L_\mathcal {D}(A(S)) = \mathbf{E}_{(X,Y) \sim \mathcal {D}} [|Y-A(S)(X)|], \end{aligned}$$

that is, the expected 0/1-error. In contrast, \(L_{\mathcal {D} {{\setminus }} S}(A(S))\) is the probability that \(f(X) \ne Y\), with \(f= A(S)\), conditional on \((X,Y) \not \in \{(x_1,y_1), \ldots , (x_n,y_n) \}\). This can also be written as

$$\begin{aligned} L_{\mathcal {D} {{\setminus }} S}(A(S)) = \mathbf{E}_{(X,Y) \sim \mathcal {D}}\left[ |Y-A(S)(X)| \; \mid X,Y \not \in \{(x_1,y_1), \ldots , (x_n,y_n) \}\right] . \end{aligned}$$

A central claim in Wolpert’s works is that ots risk is a more natural measure of generalization performance than iid risk (1996b, p. 1345ff; 2002, p. 25ff). Note that it is certainly more similar to the generalization error in the previous nonstochastic case (where the labels of already seen instances are conclusively learned). But this does not make it clearly better in the stochastic case, where there is still an estimation problem even for already seen instances. We discuss the relation between the two notions (and the relevance of their difference in the context of the consistency of the NFL results with positive results in learning theory) in more detail in “Appendix A.1”, and in the following always make clear what risk we mean.

Abstracting away from the oatmeal classification example, suppose instances are given by some finite-length set of features that can take a finite number of values, so that there is a (possibly huge yet) finite number m of possible instances. Given some training set S of n labeled instances, consider again any single unseen instance x. For each learning algorithm (selected classifier, assigning label y to x), there is a possible learning situation \(\mathcal {D}\) in which the classifier’s risk on this particular x is 0 (namely, a \(\mathcal {D}\) that assigns probability 1 to label y, conditional on instance x). Likewise, there is a possible learning situation \(\mathcal {D}\) in which the classifier’s risk on this particular x is 1. Indeed, for each value in the unit interval there is a possible learning situation in which the classifier has that risk on x, as well as a counterpoint situation where the classifier has one minus that risk on x. The intuition that these risks all even out finds again a precise expression under the assumption of an (in this case, continuous) uniform distribution \(\mathcal {U}\) over all learning situations—in this case, a uniform distributionFootnote 7over distributions. Thus for any given set of training data, for any learning algorithm, the selected classifier’s \(\mathcal {U}\)-expected risk on any single unseen instance is 0.5. This concerns a specific unseen instance, given some specific set of training data. But, crucially, we can again move the expectations to the front, to range over the whole process of drawing training data and measuring risk.Footnote 8 In this way we reach the statement of the NFL theorem, or the conservation law of generalization performance: every learning algorithm A, for any sample size \(n<m\), has the same expected ots risk \({{\,\mathrm{\mathbf {E}}\,}}_{\mathcal {D}\sim \mathcal {U}, S \sim \mathcal {D}^n} \left[ L_{\mathcal {D} {{\setminus }} S}(A(S))\right] =1/2\).Footnote 9

2.3 All learning algorithms ...

We presented some versions of the Wolpert-Schaffer results, leading up to what is essentially the original form. But already the first example in the framework of prediction brings out an important characteristic of the NFL theorems: their concern, for the given learning problem, with all possible learning algorithms, understood as mappings from data to conclusions.Footnote 10

There is, to begin with, no special regard for particular subclasses of learning algorithms, say those that we would intuitively call “inductive” (or indeed learning algorithms). In the prediction setting, for example, an “inductive” function that extrapolates the past data \(\mathsf {NN}\) to the prediction \(\mathsf {N}\) is no less a learning algorithm than an “anti-inductive” function that extrapolates data \(\mathsf {NN}\) to prediction \(\mathsf {T}\), or indeed than the “learning-resistant” constant function that outputs \(\mathsf {T}\) no matter what. As such, the NFL theorems can be seen to simply bypass the main companion problem to that of justifying induction: the problem of specifying or describing what actually constitutes inductive method or methods (see Lipton 2004, p. 7ff).Footnote 11

That said, when it comes to the assessment of the results’ implications, it seems there is only a small subset of all logically possible algorithms that we are really interested in. These are the algorithms that are actually used. There is a limited number of standard algorithms developed and analyzed in machine learning, generic algorithms that are employed in a wide variety of different domains. Naturally enough, the motivating discussions in Wolpert’s writings focus on the ramifications of his results for the justification for these algorithms. We will discuss the justificatory implications of the NFL in detail in Sect. 3 below.

While the “all possible” in the NFL results’ characteristic concern with all possible learning algorithms can be seen as a useful generality in the results’ scope, there is also an important sense in which this scope is limited. This has to do with the restriction to “learning algorithms,” understood as well-defined mappings from data to conclusions. The NFL results apply to formal learning rules that fully specify what conclusion follows which observed data. They clearly do not apply to a non-algorithmic conception of inductive method(s) that involves irreducibly informal factors (like, perhaps, everyday human and even scientific reasoning). But they do not even apply to a conception of learning methods as taking for input other (context-dependent) elements: the NFL results apply to a conception of learning algorithms as purely data-driven or data-only. We will also return to and expand on this point in Sect. 3 below.

2.4 ... are equivalent?

The interpretation that Wolpert attached to his formal results, and that we went along with in our presentation, is that “for any two learning algorithms A and B ... there are just as many situations (appropriately weighted) in which algorithm A is superior to algorithm B as vice versa” (Wolpert 1996b, p. 1360), or that “all algorithms are equivalent, on average” (Wolpert 1995a, p. 129; 2002, p. 35). The obvious worry about the significance of the NFL theorems concerns the qualifiers “appropriately weighted” and “on average” in these statements: that is, the presupposition of a uniform distribution on learning situations. This is indeed what the immediate responses in the literature focused on.

Perhaps the main criticism is that a uniform distribution is really a worst-case assumption for the purpose of learning. The “rational reconstruction” by Rao et al. (1995) shows that Schaffer’s conservation law of generalization performance is equivalent to the (trivial) statement that for any unseen example, both possible classifications result in a generalization error of 0.5, if we measure the latter by uniformly averaging over both possible true classes. On a more conceptual level, this procedure of uniformly averaging corresponds to assuming that however many examples we have seen, we cannot have learned anything: the best guess for the label of any new example will always still be fifty-fifty. Thus these authors conclude that “the uniform concept distribution ... in which every possible classification of unseen cases is equally likely ... is the definition of a uniformly random universe, in which learning is impossible” (ibid, 475).Footnote 12 Obviously the NFL theorems cannot be said to hold much significance if we understand them as the observation that every learning algorithm is equivalent in a universe where learning is impossible.Footnote 13

It has been suggested that this particular criticism can be countered by the observation that a uniform distribution is not a necessary condition for NFL theorems to go through (e.g., Giraud-Carrier and Provost 2005, p. 10). Rao et al. (1995, p. 475ff) show that generalization performance is conserved under a wider class of distributions; and indeed Wolpert (1996b, p. 1361f) also already gives “extensions for nonuniform averaging.” But as long as the results do not extend to all distributions (and they do not: there is a certain symmetry that must be retained, Rao et al. 1995, p. 477), the worry remains that the NFL results are simply an expression of the induction-hostileness of the presupposed weighing distribution.

Wolpert was aware of this perspective on his results.Footnote 14 In (1992a), he himself refers to the assumption of a “maximum-entropy universe”; the way he puts his point there is that “[s]ince such a universe cannot be ruled out on an a priori basis, it is theoretically impossible to come to any conclusions about how to generalize using only a priori reasoning.” But the statement that it is a priori possible that there are (in expectation) no distinctions between learning algorithms is weaker than the categorical statement that there are (in expectation) no a priori distinctions between learning algorithms, the claim of the later paper (1996b).

In this paper (ibid., 1362ff), Wolpert actually argues that the uniform distribution does have a preferred status. He starts by allowing that if we change the weighing of learning situations, then there could arise “a priori distinctions” between learning algorithms. However, he continues, “a priori” such a change of weighing could just as well favor algorithm A as B: “[a]ccordingly, claims that ‘in the real world [the distribution over learning situations] is not uniform, so the NFL results do not apply to my favorite learning algorithm’ are misguided at best” (ibid., 1363). Indeed, he points at results in the same paper regarding averages over prior distributions over learning situations, with the interpretation that there are as many priors for which A is superior to B as the other way around. From this perspective, “uniform distributions over targets are not an atypical, pathological case ... [r]ather they and their associated results are the average case (!)” (ibid.).

This jump to a higher level is clearly inconclusive: we can restate the same worry at that level.Footnote 15 Most remarkable, however, is Wolpert’s dialectical move of turning the table on the critic: “the burden is on the user of a particular learning algorithm. Unless they can somehow show that [the true prior] is one of the ones for which their algorithm does better than random ... they cannot claim to have any formal justification for their learning algorithm” (ibid.).

Curiously, responses in the computer science literature critical of the significance of Wolpert’s results have essentially followed him here. Rao et al., after discussing how NFL theorems must depend on a symmetrical prior, conjecture that “our world has strong regularities, rather than being nearly random. However, only time and further testing of physical theories can refine our understanding of the nature of our universe [and] might lead to a reasonable estimate of [the true prior] in our world” (1995, p. 477). Giraud-Carrier and Provost emphatically set forth as an implicit yet generally accepted “weak assumption of machine learning” that “the process that presents us with learning problems ... induces a non-uniform probability distribution [over learning situations]” (2005, p. 11).Footnote 16 But this Wolpert would not disagree with: he writes himself that a nonuniform distribution “is why some algorithms tend to perform better than others in the real world” (Wolpert 1996b, p. 1361, emphasis ours).Footnote 17 The point is to give a “formal justification” for believing in any such distribution. Indeed, if we seek to criticize the assumption of a uniform distribution in Wolpert’s claim that all algorithms are a priori equivalent by postulating a different, nonuniform, distribution, then we better provide a justification for postulating that distribution. The result is that we find ourselves in a corner, because it is not clear where to look for such a justification. What we should have done, of course, is to insist that Wolpert justify his assumption.

In fact, a more fundamental reply is to demand a reason for postulating any prior distribution over learning situations. Doing so is a formal requirement in Wolpert’s “extended Bayesian formalism” (unlike in the conventional classification framework); but that merely shows that the framework is constraining in a way that we may find inpalatable.Footnote 18,Footnote 19 Indeed, it is not at all clear what it is supposed to mean to assign probabilities to possible learning situations. An epistemic interpretation, as some (ideally rational) agent’s degrees of belief, is perhaps the easiest to make sense of, but immediately throws us back to the justification for any specific choice of prior distribution: in particular, the idea of a uniform distribution as an objective-logical “indifference prior” has long been abandoned by philosophers and statisticians alike as a viable option (see, e.g., van Fraassen 1989, p. 293ff; Zabell 2016). This is, in any case, not what Wolpert appears to have in mind: the suggestion is rather that we should think of these probabilities as objective-physical, as chances.Footnote 20 But in the absence of a fuller account of the nature of these chances we do not see much reason for going along with the idea that the universe is governed by some objective distribution generating learning situations—let alone that this distribution should be uniform.

In sum, it would be granting Wolpert too much to accept that it is on us to show, contra his equivalence claim, that some algorithms are generally better than others. (We do not even need to think that “generally” is a qualifier that can be made meaningful here.) The burden is rather on Wolpert to justify the presuppositions that back his claim, in particular the assumption of a uniform distribution on learning situations, and this he has not done.

2.5 There is no universal data-only learning algorithm

We can, however, formulate a weaker variation of the NFL results, a statement that is implied by the original but that does away with the uniformity assumption. In stating it, we also make explicit the observation from Sect. 2.3 that we are still talking of data-only algorithms, functions from data to conclusions:

In other words, there is no single data-only learning algorithm that performs well whenever some data-only algorithm performs well: there is no universal data-only learning algorithm.

Note right away that the truth of any instantiation of this statement depends on the learning problem in question, including the possible methods and the adopted notion of good performance. It is not too hard to come up with (artificial) learning problems for which the statement is false (e.g., a problem that is formulated such that the possible learning situations explicitly accommodate a particular learning method).Footnote 21 The statement is relevant insofar it holds for problems within most standard learning frameworks and natural measures of good performance.

For instance, we retrieve this statement from the original Wolpert-Schaffer result if we drop the uniformity assumption and make “good performance” precise as (say) “having expected risk strictly smaller than 1/2.” Namely, for every learning algorithm \(A_1\), for any sample size n, there exists a learning situation \(\mathcal {D}\) such that the algorithm has expected ots risk \({{\,\mathrm{\mathbf {E}}\,}}_{S \sim \mathcal {D}^n}\left[ L_{\mathcal {D}{{\setminus }} S}(A_1(S)) \right] \) at least 1/2, while another algorithm \(A_2\) has expected ots risk below 1/2 (indeed zero, for choice of \(\mathcal {D}\) that labels instances deterministically via some \(f^*\), and \(A_2\) that always outputs this \(f^*\)).

A variant for iid risk is the NFL theorem in the standard textbook by Shalev-Shwartz and Ben-David (2014, p. 61ff). Here the notion of performance is that in \(\mathcal {D}\)-expectation over samples of size n no more than half the total number of possible instances, the algorithm’s iid risk is smaller than 1/4. Correspondingly, their NFL theorem states that for every learning algorithm \(A_1\) there is a learning situation \(\mathcal {D}\) such that its expected iid risk \({{\,\mathrm{\mathbf {E}}\,}}_{S \sim \mathcal {D}^n}\left[ L_{\mathcal {D}}(A_1(S)) \right] \) is higher than 1/4, while that of another algorithm \(A_2\) is lower than 1/4 (indeed again zero). The authors write that the “theorem states that for every learner, there exists a task on which it fails, even though that task can be successfully learned by another learner” (ibid., 61).

Another example is given by the NFL theorems collected by Belot (2021) for problems of prediction. (He calls these results “of the absolute variety,” as opposed to “measure-relative,” which would include the original Wolpert-Schaffer results.) The learning situations in these problems are (probability measures over) infinite sequences of binary outcomes, and he considers different types of effectively computable learning functions (namely, “extrapolators” that are, as in our example in Sect. 2.1, functions from past outcome sequences to next outcomes, and “forecasters” that output probabilities of next outcomes) and for both of these types various notions of good performance. In each case he derives two types of results, that are both instantiations of the general NFL statement that there is no universal algorithm: that for each learning algorithm \(A_1\) there is a second algorithm \(A_2\) that performs well in those situations in which \(A_1\) does, and in other situations still (“better-but-no-best”),Footnote 22 and that for each \(A_1\) there is an \(A_2\) such that the situations in which they perform well are disjoint (“evil-twin”).

These examples also illustrate that statement (*) retains much of the spirit of the original Wolpert-Schaffer statement. In particular, it is a clear expression of the central insight in machine learning (Mitchell 1980, 1997; Dietterich 1989; Russell 1991; Shalev-Shwartz and Ben-David 2014) that no purely data-driven learning algorithm—no formal inductive function from data to conclusions—can be successful in all circumstances. That is, every such data-only algorithm must possess some inductive bias that determines in which restricted class of situations it performs well, and hence in which situations it does not. What statement (*) still adds to this is that such a learning algorithm’s inevitable inductive bias excludes it from learning successfully in some situations that are not unlearnable: situations in which some other algorithm would perform well. But it does not go as far as the original Wolpert-Schaffer statement that all (data-driven) algorithms are equivalent in their performance, depending as this does on the additional and unmotivated assumption of a uniform prior distribution.

3 Generic algorithms and local models

In this section, we investigate the significance of the NFL-statement (*) for the justification for machine learning algorithms. The route we take is to first relate the NFL results to Hume’s skeptical argument about induction (Sect. 3.1). We note that both Hume’s original argument and discussions of the original Wolpert-Schaffer results presuppose that justifying inductive methods requires justifying a general postulate of the induction-friendliness of the world. Subsequently, we discuss philosophical work that denies this presupposition, and that promotes a local perspective on induction (Sect. 3.2). We argue that a local conception of induction, applied to machine learning, points at a more natural conception of learning algorithms: rather than one-place functions on data only, many standard learning algorithms are better conceived of as two-place functions that for their operation also require some model (Sect. 3.3). Learning-theoretic guarantees do justify the use of such algorithms, in a local, model-relative manner (Sect. 3.4).

3.1 The road to skepticism

The NFL theorems, both the original Wolpert-Schaffer results and instantiations of the statement (*) of Sect. 2.5, are mathematical theorems. They say something about the impossibility of mathematically proving that some learning algorithms, conceived of as purely data-driven, perform better than others. As such, they can be seen as versions of the first, deductive, horn of the fork that constitutes Hume’s orginal argument against a justification for induction. This first horn concerns the impossibility of inferring good performance of inductive inference using only deductive, a priori reasoning: since it implies no logical contradiction that induction does not perform well, we can never deductively derive, from a priori premises only, that it does.Footnote 23 Similarly, the NFL results show for any learning algorithm that it implies no contradiction that this algorithm does not perform well (does not perform at least as well as other algorithms), by showing that there are a priori possible situations in which it does not.Footnote 24

This does not yet constitute a skeptical argument that we can offer no rational grounds for thinking that one algorithm performs better than another. Likewise, the first horn of Hume’s fork did not yet establish a skeptical conclusion about the grounds for inductive inference. Arguably, the novelty and force of Hume’s argument lay in the second horn of his fork: the assertion that neither can we offer, on pain of circularity, good nondeductive or empirical grounds for thinking that inductive inference must perform well. Only the two horns taken together lead to the skeptical conclusion that we can offer no rational, epistemic ground for using inductive inference: that we cannot justify induction.

Perhaps the Wolpert-Schaffer results were not intended to support a skeptical conclusion, and we should read conclusions of the sort that “methods for induction to unseen cases cannot be justified rigorously” (Schaffer 1994, p. 264) or that “one can not formally justify [standard learning algorithms]” (Wolpert 2002, p. 38) as merely indicating the limits of mathematically founding the performance of learning algorithms. However, something more than that is suggested in the original discussion surrounding these results, by the nods to Hume (Wolpert 1996b, p. 1341; Schaffer 1994, p. 264), but also by the outlines of a move very reminiscient of Hume’s. This is the idea, discussed before in Sect. 2.4, that the only way remaining to found the good performance of our learning algorithms is to postulate that “the world” (or “nature,” or the “universe”) has a certain structure that guarantees this. Hume’s original argument in fact starts with the premise that inductive reasoning proceeds upon the principle that “nature is uniform.” It is this principle that is subjected to the two horns; in particular, that we cannot justify it inductively or empirically. Namely, any attempt to derive the uniformity of nature from past such observed uniformity would require the very principle at stake and thus be viciously circular.

Hume’s argument and most of its later reconstructions simply concerned “inductive inference” or “inductive method,” exemplified by something like enumerative induction but beyond that largely left unspecified (prompting a distinct problem of description, recall Sect. 2.3). The NFL theorems concern all possible purely data-driven learning algorithms. Still, the skeptical threat of the NFL results lies in their application to “our standard algorithms,” the generic learning algorithms that we actually use (recall again Sect. 2.3). So both Hume’s argument and discussion surrounding the NFL results envisage some restricted collection of generic inductive methods. And in both cases we see that the performance of these inductive methods is paired to a particular structure the world may or may not have. If the world has the matching structure, then our inductive methods perform well; if not, they do not.Footnote 25 Consequently, the dialectics turns on the justification for such an assumption on the world: in Hume’s argument from the start, in the case of the Wolpert-Schaffer results in the ensuing discussion. The NFL statement (*) is similarly susceptible to this move: if we do want to uphold the existence of well-performing generic (universal) learning algorithms, then it seems we must postulate that the world has a structure that facilitates such algorithms’ performance. But in all cases, it appears impossible to justify, without question-begging, such an assumption on the world, whence we are driven towards a skeptical conclusion.

3.2 Localizing induction

An idea that has been advanced in the philosophical literature is that we may avoid being driven there by denying that inductive inference relies on universal uniformity principles (Okasha 2005b). This idea builds on arguments that it is hopeless to try and give a precise account of a principle of the “uniformity of nature” (Salmon 1953; Sober 1988, p. 55ff).Footnote 26

Sober (1988, p. 58ff; also see Okasha 2005b, p. 245ff) argues that in presupposing that induction relies on a single principle of uniformity, Hume actually commits a quantifier-shift fallacy. It is not the case, as Hume has it, that there is a certain assumption (the uniformity of nature) that every inductive inference requires; it is rather the case that every inductive inference requires a certain assumption. That is, rather than all relying on a single universal uniformity principle, every induction relies on a specific and local empirical assumption.

Arguments against universal uniformity principles usually run together with arguments that it is hopeless to try and give a precise account of “inductive method” (Putnam 1981, 1987; Rosenkrantz 1982; van Fraassen 1989, 2000; Norton 2003, 2010). Okasha (2001) indeed develops an argument analogous to Sober’s where he diagnoses the fault in Hume’s reasoning to be the presupposition that inductive inference is given by universally applicable rules. He, like Norton (Norton 2003, p. 666; 2014), argues that the denial of this presupposition actually blocks the skeptical argument.

These ideas offer a local perspective on inductive inference.Footnote 27 In order to assess the value of this perspective for machine learning algorithms and their justification, we make two observations.

First, even if we grant that Hume’s original argument no longer goes through when we deny the existence of universal uniformity principles or inductive rules, it does not follow that we are safe from a skeptical argument. As Sober (1988, p. 66ff) himself emphasises, there are still always assumptions involved in an inductive inference, that themselves stand in need of justification. Even if we are safe from Hume’s argument that any nondeductive justification of induction must be circular, it appears we will now be facing an endless regress, where each empirical assumption can only be justified by another induction with its own empirical assumptions.

Yet Okasha (2005b) is more optimistic: “The form which the inductive sceptic’s argument takes on the \(\forall \exists \) picture—pushing the demand for justification further and further back—seems somehow less problematic than in the \(\exists \forall \) case,” where “the whole practice of reasoning inductively seems to be premised on an enormous, untestable assumption about the way the world is” (ibid., pp. 252, 251). We do not think that this settles the matter, but it does clearly bring out a crucial advantage of a local perspective on induction. Namely, this perspective is much closer to what the problem of justification looks like in actual enquiry.Footnote 28 Plausibly, in an actual enquiry, each inference takes place within a constellation of context-specific or local empirical assumptions.Footnote 29 The motivation for such an inference will focus on one or more of these assumptions, and not on a universal uniformity principle. Furthermore, the question of justification does not only target these assumptions: even given these assumptions, there can still be room for different inferences, in which case there is still the question of the justification for the inference of choice, or the method used for the inference. We will argue below that both aspects are important to the question of the justification for machine learning algorithms.

Second, it might seem that a local conception of induction, inasmuch as it is coupled to the position that inductive inference cannot be encoded into general rules, actually does not sit very well with the enterprise of machine learning. After all, and arguably in contrast to day-to-day human or even scientific reasoning, machine learning is characterized by the design and use of learning algorithms: fully mechanical, generic procedures for inductive inference.

The rejection of general inductive rules in a local perspective must be qualified, though. For instance, Okasha (2001; also see 2005a), in the course of arguing against the idea of general rules for inductive inference, does endorse Bayesian conditionalization as the rational procedure for learning from experience.Footnote 30 There appears to be a tension there (cf. Henderson 2020): is updating by conditionalization not a rule? Okasha, however, makes a distinction: “a rule of inductive inference is supposed to tell you what beliefs you should have, given your data, and the rule of conditionalization does not do that ... the state of opinion you end up in depends on the state you were in previously; whereas if you apply an inductive rule to your data, the state of opinion you end up in depends on the instructions contained in the rule” (2001, p. 316). The output of Bayesian conditionalization does not depend on the input data only: it also depends on “the state you were in previously,” ultimately, a prior probability assignment. The rejection of general rules for inductive inference here thus concerns purely data-driven rules.

This idea is, of course, very much supported by the statement of the NFL theorems we advocated: there is no universal purely data-driven learning algorithm.Footnote 31 Moreover, this is perfectly consistent with allowing for general rules for induction that also require other inputs, plausibly inputs that encode local assumptions, like (in the case of the Bayesian method) a prior probability distribution. In sum, the lesson we take from a local conception of induction is not to reject rules for induction: the lesson is to fine-grain the notion of inductive rule, to conceive of it as a procedure that can also take for input local assumptions. Applying this perspective to machine learning algorithms, we will also be able to qualify the sweeping skeptical conclusion that the NFL theorems seemed to lead us to.

3.3 Model-dependent learning algorithms

We think it highly implausible that the use of machine learning algorithms relies, explicitly or implicitly, on a general “assumption of machine learning” about the learning-friendliness of the world, let alone a belief in some all-governing non-uniform prior distribution on possible learning situations. The assumptions that accompany the use of a learning algorithm in any particular context are normally themselves of a context-dependent, local nature. But how to square the role of local assumptions with the use of generic mechanical learning procedures?

The observations of the previous section point us at an answer. Many standard learning algorithms are not purely data-driven, but must also take for input a model. Such model-dependent algorithms instantiate, not a one-place function that maps data to conclusions, but a two-place function that maps data and a model to conclusions. Crucially, such algorithms can be given a model-relative justification.

In the following, we illustrate model-dependent learning algorithms using three standard machine learning examples: Bayesian machine learning (Sect. 3.3.1), empirical risk minimization (Sect. 3.3.2) and cross-validation (Sect. 3.3.3). These methods all have in common (as do most if not all standard model-dependent learning algorithms that we know of) that they select a hypothesis or combination of hypotheses with good predictive performance, measured in terms of the loss function of interest (empirical risk minimization, cross-validation) or a related measure such as the likelihood (Bayes). We discuss how these methods receive a model-relative justification in the form of learning-theoretic guarantees, and thereby bring out why such claims as “the NFL theorems indicate that cross-validation has no more inherent justification than anti-cross-validation” are misleading.Footnote 32 We conclude our examples with a discussion of the consistency of the NFL results with learning guarantees (Sect. 3.3.4).

Finally, we have delegated to “Appendix B” some nuances that distract from the argument’s main thrust.

3.3.1 Bayesian learning

The Bayesian scheme, central to many philosophical accounts of rational learning, also constitutes an important approach in machine learning (Duda et al. 2001; Bishop 2006). What characterizes Bayesian learning is that an algorithm must be provided with a prior distribution over some domain of probability distributions, and this choice of prior constitutes a choice of model. The role of the prior as a variable input factor lends such an algorithm a considerable genericity: the algorithm itself does not come with a particular model, but is flexibly supplied with a specific model in each specific application. This is also what provides room for a model-relative learning-theoretic justification: a demonstration that relative to the choice of prior distribution, a Bayesian algorithm performs well.

We now discuss this in some detail for Bayesian machine learning in the framework of classification, the realm of the original Wolpert-Schaffer results. Here, the prior \(\Pi \) is usually taken over a set of conditional probability distributions of the form \(P(Y\mid X)\) with \(Y \in \mathcal {Y}\), \(X \in \mathcal {X}\) the possible labels and instances, respectively. (Recall the oatmeal example of Sect. 2.2, where \(\mathcal {Y} = \{ \mathsf {T,N}\}\) and \(\mathcal {X} = \{0,1\}^3\).) The distributions are extended to n outcomes by assuming that the data pairs \((X_1,Y_1), (X_2,Y_2), \ldots , (X_n,Y_n)\) in the training set are sampled independently. The set of probability distributions in the prior support (that is, those with prior density or mass greater than 0) demarcate a model \(\mathcal {M}\), a set of (conditional) probability distributions. A prototypical example is the logistic regression model (Hastie et al. 2009, p. 119ff), in which the \(X_i = (X_{i,(1)}, \ldots , X_{i,(k)})\) are vector-valued as in our example, and the \(P(Y \mid X)\) are given by linear functions \(\sum _{j=1}^k \beta _{(j)} X_{(j)}\), rescaled by a fixed nonlinear function so as to become probabilities that sum to one.

There exist several variations of the Bayesian stance, which differ in how the prior is interpreted. For the purpose of our discussion, most relevant is the distinction between a subjective and a pragmatic stance. Under the former, the prior quite literally encodes one’s beliefs (which can be elicited by, for example, testing willingness to bet on certain outcomes). That is, the relevant inductive assumption can be equated with one’s beliefs. Alternatively, under a pragmatic interpretation, to which most practitioners subscribe, one still assumes the model (set of all distributions in the support of the prior) to be correct, but one can choose the prior \(\Pi \) for other, more pragmatic reasons. These could be considerations of (computational) convenience, of optimizing worst-case behaviour (this leads to “noninformative” or “flat” priors), or a mix of prior knowledge with worst-case and computational considerations. For example, a standard pragmatic approach for the logistic regression model is to take a Gaussian prior centered at 0 on the \(\beta _{(j)}\)’s.

Regardless of the prior’s origin, it serves as an input to the Bayesian algorithm. Together with the data, i.e., the training sample, one uses Bayes’s rule to update the prior to a posterior. The posterior over the distributions is then used to output a classifier \(\hat{f}_{{Bayes}}\), defined as the function from \(\mathcal {X}\) to \(\mathcal {Y}\) that has the largest probability of being correct according to the Bayesian posterior predictive distribution (Bishop 2006). In contrast to the notion of learning algorithm in Sect. 2, where an algorithm only takes data, the Bayesian algorithm requires additional input: the user’s inductive assumptions, codified explicitly as prior and induced model. One cannot avoid stating these explicitly—without specifying a prior and hence a model, the outcome of the Bayesian algorithm is simply undefined.

When it comes to the question of justification, the distinction between the two Bayesian stances is also relevant. Under the subjective stance, the Bayesian algorithm is simply optimal: among all algorithms, it leads to the best possible classifier (with smallest risk) under one’s own inductive assumptions as encoded by the prior. In other words, if the prior truly reflects one’s beliefs, then one must also believe that the Bayesian procedure, with this prior, is justified. If one is willing to take the subjective stance, then any arguing that the Bayesian algorithm has no more inherent justification than any other algorithm, let alone “anti-Bayesian learning” (where one selects the classifier with the highest risk under the posterior), is futile.Footnote 33

Under a pragmatic view of Bayesian inference, the prior weights cannot be directly related to one’s beliefs, and the Bayesian algorithm cannot be said to be optimal in the previous sense. Nevertheless, under the pragmatic view one can still show that the Bayesian procedure has a certain model-relative optimality, even if the specific choice of prior over the same model now becomes important. We already mentioned how choices of noninformative priors can optimize worst-case behavior, by which we meant that \(\hat{f}_{\text {Bayes}}\) has the smallest possible generalization error in the worst case under all \(P^* \in \mathcal {M}\).

Furthermore, there exists a plethora of results (e.g., Ghosal et al. 2000, 2008) showing that, under very weak conditions on the model \(\mathcal {M}\), one can select priors such that for all \(P^* \in \mathcal {M}\), the posterior concentrates around \(P^*\) at a certain rate. In our context, this implies that the expected generalization error of \(\hat{f}_{\text {Bayes}}\) converges to the generalization error one could obtain if one knew the “true” (leading to the best possible predictions) \(P^*\). Moreover, one can give nonasymptotic bounds on the difference in generalization errors (Grünwald and Mehta 2020). These results provide a clear model-relative justification for the pragmatic Bayesian procedure: if one has reason to believe that the model is correct, then (with the right choice of prior over this model) one also has reason to believe that the algorithm performs well.

For the sake of brevity we do not go in more detail into the justification of Bayesian methods. Instead, we proceed with a more in-depth discussion of two methods that have received more attention in the context of the NFL results: empirical risk minimization and cross-validation.

3.3.2 Empirical risk minimization

This is probably the most standard machine learning method. Like Bayesian learning, empirical risk minimization (ERM) is a model-dependent method. The crucial difference with Bayesian learning is that the “model” is now not a set of probability distributions, but rather a user-specified set of classifiers \(\mathcal {F}\), usually called a hypothesis class. In practice, it could be the set of all neural networks with a given number of nodes and connectivity matrix, represented by their weights; or the set of all decision trees of a given size. The generalization performance of ERM can be analyzed via the standard machinery of learning theory (Shalev-Shwartz and Ben-David 2014). Here, as in Sect. 2.2, one assumes that the data \(S = (X_1, Y_1), (X_2, Y_2), \ldots , (X_n,Y_n)\) are sampled independently from an unknown distribution \(\mathcal {D}\). No further assumptions about \(\mathcal {D}\) are made: instead all inductive assumptions go into \(\mathcal {F}\). In Bayesian learning, the choice of model \(\mathcal {M}\) can be seen as the inductive assumption that “there is a \(P \in \mathcal {M}\) such that acting as if the data is a random sample from P leads to the best possible predictions.” In learning theory, adopting a class \(\mathcal {F}\) can be seen as the inductive assumption that “there is an \(f \in \mathcal {F}\) that has classification risk \(\ll 1/2\), small enough to be useful.” Here the classification risk is iid risk, or the probability that \(f(X) \ne Y\) under \(\mathcal {D}\).

The ERM method \(A_{\textsc {erm}}\) takes as input both a training sample S and a hypothesis class \(\mathcal {F}\) as above. It proceeds by picking the classifier \(\hat{f}_{\text {erm}} = A_ \textsc {erm}(S,\mathcal {F})\) in \(\mathcal {F}\) that made, among all elements in \(\mathcal {F}\), the minimum number of errors on S, with some arbitrary rule for breaking ties. Assume for simplicity that \(\mathcal {F}\) is finite, so that there exists an \(f^*\) in \(\mathcal {F}\) that minimizes the risk. A variation of a standard result in learning theory says that ERM works well, in the following sense: the difference between the expected risk of \(\hat{f}_{{erm}}\) and the best obtainable risk within the model, namely that of \(f^*\), is bounded by \(\sqrt{| \log \mathcal {F}|/(2n)}\). (See ”Appendix A.3” for a derivation.) This result holds no matter what \(\mathcal {D}\) is. Since the dependence on the size of \(\mathcal {F}\) is logarithmic, the guarantee remains non-void even for exponentially large, and in this sense fairly complex \(\mathcal {F}\). In fact, it can be extended to many infinite \(\mathcal {F}\) as well: the \(\log |\mathcal {F} |\) term is then replaced in the bound by an abstract (but computable) complexity notion such as the Rademacher, Vapnik-Chervonenkis or “PAC-Bayesian” complexity of \(\mathcal {F}\) (Grünwald and Mehta 2020). Interestingly, as the latter paper explains in detail, such results are proven using essentially the same techniques as those used for proving non-asymptotic convergence of pragmatic Bayesian learning.

What about anti-ERM (or empirical risk maximization), that picks the \(\hat{f}_{\text {a-ERM}} \in \mathcal {F}\) with largest error on the training set? We can precisely reverse the math behind the convergence of ERM to show that anti-ERM will converge to the worst element of \(\mathcal {F}\), the element that maximizes risk. The difference between the expected risk of \(\hat{f}_{\text {a-ERM}}\) and the worst obtainable risk is again at most \(\sqrt{| \log \mathcal {F}|/(2n) }\) if \(\mathcal {F}\) is finite, and an analogous result holds again with \(\log |\mathcal {F} |\) replaced by Rademacher or VC dimension for infinite \(\mathcal {F}\). Saying “ERM has no inherently better justification than anti-ERM” would thus amount to saying: “A method which (given a not too small sample) leads to the best possible predictions that can be obtained based on my hypothesis class, has no more inherent justification than a method which (given a not too small sample) leads to the worst possible predictions that can be obtained based on my hypothesis class.” To us, this seems an aberration.Footnote 34

Our point is certainly not that ERM is perfect: if \(\mathcal {F}\) becomes “too complex” then ERM may suffer from severe overfitting and will not work in practice.Footnote 35 But if anyone advises us to use such a class in combination with ERM, we can simply reply that handling it goes beyond the power of ERM—other methods more suitable for that case exist, such as structural risk minimization (Vapnik 1998; Shalev-Shwartz and Ben-David 2014), or forms of minimum description length learning (Grünwald and Roos 2020), or ERM combined with cross-validation as below.

We thus have a well-defined condition (small enough complexity of our \(\mathcal {F}\)) under which ERM is provably preferable to anti-ERM. No such conditions have ever been formulated under which anti-ERM performs better than ERM (with the same model!), and it is highly implausible that something of the sort could be done.

3.3.3 Cross-validation

This method can be viewed as a meta-algorithm to select between different learning algorithms.Footnote 36 For ease of presentation, we concentrate on a simplification of cross-validation: two-fold forward-validation. This takes as input a data set of a given size \(n>1\), and a finite set of learning algorithms \(A_1, A_2, \ldots , A_{m}\). Forward-validation runs all these algorithms on the first half \(S_1\) of the original training set.Footnote 37 Letting \(\hat{f}_{k}=A_k(S_1)\) denote the classifier learned by algorithm \(A_k\), it then selects as final classifier the classifier \(\hat{f}_{\hat{k}_{\text {fv}}}\) where \(\hat{k}_{\text {fv}}\) is the k such that \(\hat{f}_k\) has the smallest error on the second half \(S_2\) of the training set, which is thereby used as a validation set. Thus, the final classifier always coincides with one of the m initial classifiers. For full two-fold cross-validation, one repeats the procedure with the two data sets interchanged, and for M-fold cross-validation we split the data in M parts with a validation set of size n/M. Everything we say below for two-fold forward-validation also holds mutatis mutandis for full M-fold cross-validation, but the phrasing of results becomes more cumbersome, so we stick to the two-fold forward case for simplicity.

Now, let \(\mathcal {E}^{(n)}_k\) be the expected iid risk of algorithm k after having run on the first half of the data: \(\mathcal {E}^{(n)}_k = {{\,\mathrm{\mathbf {E}}\,}}_{S \sim \mathcal {D}^n}\left[ L_\mathcal {D}\left( \hat{f}_k \right) \right] \). Let \(\mathcal {E}^{(n)}_{\text {fv}}\) be the expected iid risk of two-fold forward-validation as defined above: \(\mathcal {E}^{(n)}_{\text {fv}} = {{\,\mathrm{\mathbf {E}}\,}}_{S \sim \mathcal {D}^n}\left[ L_\mathcal {D}\left( \hat{f}_{\hat{k}_{\text {fv}}}\right) \right] \). One can now show (see “Appendix A.3”) that the expected iid risk of forward-validation satisfies

$$\begin{aligned} \mathcal {E}^{(n)}_{\text {fv}} \le \min _{k \in \{1, \ldots , m \}} \mathcal {E}^{(n)}_k + \sqrt{\frac{\log m}{n}}. \end{aligned}$$

Thus, the expected iid risk of forward-validation converges, as n grows, to the expected risk of the learning algorithm that, among all algorithms under consideration, is best in the sense that it outputs the lowest-risk classifier in expectation over the training set \(S_1\). This holds for all m and n, so if n is large, we can also take m very large; in particular, due to the logarithmic dependence on m, at sample size n we can choose between a number of learning algorithms m that is orders of magnitudes larger than n and still have a meaningful bound.Footnote 38

Forward- and cross-validation can be fruitfully applied both to model-dependent algorithms and to algorithms that may be better viewed as data-only. A prototypical example of the latter is nearest-neighbor classification. Here \(\mathcal {X}\) is a space equipped with a metric (e.g., Euclidean space with the Euclidean metric). The k-nearest-neighbor method based on a training set with \(n'\) instances plus labels \((x_1,y_1), \ldots , (x_{n'}, y_{n'})\) outputs the classifier which, for any value of x, picks the k data points \(\{i_1, i_2, \ldots , i_{k} \} \subset \{1, \ldots , n'\}\) for which \(x_i\) is closest to x, and outputs the majority vote for the corresponding \(y_{i_1}, \ldots , y_{i_k}\). Nearest-neighbor with \(k=1\) always has zero error on the training set, so typically overfits dramatically. However, one can use cross- or forward-validation to choose a value of k. The number \(m_n\) of k’s that make sense at sample size n is at most n, so the generalization bounds above are meaningful, and we have the guarantee that the expected risk based on using \(\hat{k}_{\text {fv}}\)-nearest-neighbour is close to the error achieved with the unknown optimal \(k \in m_n\) that achieves the best expected risk \({{\,\mathrm{\mathbf {E}}\,}}_{S \sim \mathcal {D}^n}\left[ L_\mathcal {D}(\hat{f}_{k})\right] \).

When applying forward- and cross-validation to model-dependent learning algorithms, one typically takes the same learning algorithm (say ERM) for \(A_1, \ldots , A_{m}\), turned into one-place algorithms by combining each \(A_k\) with a different hypothesis class \(\mathcal {F}_k\). For example, \(A_k\) could represent ERM applied to \(\mathcal {F}_k\), the set of decision trees of depth k. The class of all decision trees of arbitrary depth is too large for ERM to work well (yield nontrivial generalization guarantees), but in combination with forward- or cross-validation one can use the above result to get meaningful generalization guarantees again.

How about anti-cross-validation? We can invoke precisely the same analysis as for ERM. Our inductive bias is now explicitly specified at a meta-level, by specifying the algorithms \(A_k\). If m, the number of algorithms taken into account, is fixed or grows subexponentially with n, cross-validation can be expected to converge to the best of them based on a finite and quantifiable sample size. In contrast, under the same conditions, anti-cross-validation will converge to the worst of them. Analogously to the ERM case, there is a clear condition (m subexponential as function of n) under which cross-validation is (much) better than anti-cross-validation relative to the given algorithms that encode our inductive bias. And again, we cannot imagine a condition that would allow one to prove an interesting guarantee in support of anti-cross-validation.

3.3.4 The consistency with no-free-lunch

To conclude our examples, we note that the model-dependent perspective still encapsulates the valid lesson from NFL results: the lesson that every algorithm, when operating on data only, must incorporate an inductive bias. A Bayesian algorithm, when provided with a model and a prior distribution on this model, will possess a certain bias; similarly, ERM, when provided with a hypothesis class, and cross-validation, when provided with a set of hypothesis classes, possess a certain bias. The models here represent an inductive bias, and NFL results show that any such model must indeed be biased in the sense that it must be restrictive. Any algorithm plus instantiated model performs well in some situations: those situations which the inductive bias, in some sense, is well-aligned with. But the algorithm plus this model does not perform well in other situations, situations even in which the very same algorithm, with a different instantiated model, would perform well.

To further illustrate the consistency of negative NFL results and positive learning-theoretic results, recall the NFL version of Shalev-Shwartz and Ben-David (2014) that we described in Sect. 2.5. It states that every data-only algorithm (like ERM with any instantiated \(\mathcal {F}\)) does not perform well in situations \(\mathcal {D}\) in which another data-only algorithm does perform well. They prove this by exhibiting a second algorithm that has an \(\mathcal {D}\)-expected risk at least 1/4 less than the first algorithm; specifically, the second algorithm is ERM with a class \(\mathcal {F}'\) that is well (indeed perfectly) aligned with \(\mathcal {D}\). Note that if the first algorithm is \(\textsc {ERM}\) with some \(\mathcal {F}\), then this second \(\mathcal {F}'\) must be a different class, for any significant difference in expected risk (depending on the sample size). This follows from the learning-theoretic guarantee that the expected risk of ERM cannot be much worse than that of the best hypothesis in \(\mathcal {F}\), and therefore than that of any algorithm that uses (must select a classifier from) model \(\mathcal {F}\). Again, ERM with a particular \(\mathcal {F}\) may be much worse than a different data-only algorithm if \(\mathcal {F}\) is not a good model. But ERM cannot perform much worse than any algorithm with the same model; and if we have reason to believe that our model is good, then we have reason to believe that ERM with this model performs well, too.Footnote 39

3.4 The justification for learning algorithms

Learning theory thus provides us with model-relative justification for many standard methods. For a generic model-dependent method, such a model-relative justification is all we can ask for. For such a generic method, it simply does not make sense to speculate about empirical assumptions that would render the method in itself successful and in that sense justify it. This observation stands in sharp contrast to the reduction of the justification for standard learning methods to some postulate about the right structure of the world. We think that this observation within the domain of machine learning also lends further plausibility to local accounts of induction in philosophy.

One could object, however, that no method is perfectly generic, and some assumptions or biases are always inherent to it. To put this point differently, we have used the word “inductive bias” in a relatively narrow way, as only pertaining to the choice of hypothesis class. But one could object that, for instance, the method of ERM (anti-ERM), irrespective of the hypothesis class, embodies a substantive assumption that the evidence so far is not (is) misleading.Footnote 40 We agree these can also be called biases, or perhaps rather meta-biases (as they concern extrapolating classifiers’ success rather than the data directly); but they are fundamentally linked to assumptions that are already introduced in the formulation of the relevant learning problem, in this case the general problem of stochastic classification.Footnote 41 In particular, the use of ERM relies (and learning-theoretic guarantees for ERM rely) on the problem assumption of stochastic classification that data is sampled i.i.d. (this can be extended to a stationarity assumption but not much beyond). For this learning problem, and in particular due to the i.i.d. assumption, the “uniformity meta-bias” of ERM is provably good, and the “anti-uniformity meta-bias” of anti-ERM is provably not. In general, in the same way that any NFL statement concerns a certain learning problem (recall Sect. 2.5), any learning guarantee concerns a certain learning problem. Thus our claim is more precisely that many standard learning methods, also relative to the learning problem they were designed for, have a model-relative justification.

Finally, recall from our discussion in Sect. 3.2 that it is far from clear that a local conception of induction brings us closer to an absolute, global justification of inductive inferences. Similarly, a model-relative justification still leaves the justification for the model in any particular application of a learning algorithm, and indeed the further assumptions encoded in the very formulation of the learning problem. A global justification for the conclusions of a machine learning algorithm must also include the justification for all these assumptions. The obvious threat is an endless justificatory regress, where the motivation for these local assumptions leads us to an earlier inference that itself relies on inductive assumptions that want justification. Note, though, that this regress will soon, if not immediately, lead us to assumptions that we have not actually arrived at by machine learning methods. We will soon have left the domain of machine learning, and face the problem of induction in its full generality. Rather, therefore, than understanding the NFL theorems as somehow deepening Humean skepticism, the more sober conclusion is that the question of the global justification for the conclusions of machine learning algorithms reduces to the original problem of induction.

4 Conclusion

The NFL theorems are commonly understood to show that every learning algorithm must possess a certain bias, and must ultimately lack justification because any such bias must. We have argued that for many standard learning algorithms, this is turning things on their head. NFL results do show that any data-only algorithm must have an inherent bias. Presented such an algorithm, we could expose its bias, and question the justification for this bias and thereby for the algorithm. However, many standard learning algorithms are better conceived of as model-dependent. The need for a choice of bias is accommodated by such an algorithm from the start: on each application, one must equip it with a particular model, that represents the bias. The algorithm itself is generic in that it does not itself come with a bias: on each application, one must provide it with one. What is more, such algorithms can have a model-relative justification: relative to any given model, such an algorithm performs well. Learning-theoretic guarantees show that in that sense some standard learning algorithms are sensible, are justified—and other possible algorithms are not. This is perfectly consistent with the valid lesson of NFL results that any data-only learning method, including a model-dependent algorithm plus a particular choice of model, must possess a bias.

In the course of our argument, we drew some parallels to the broader philosophy of induction. Most importantly, we discussed the role of a general postulate on the induction-friendliness of the world, and the local view of induction that challenges the cogency of such a postulate. We think of our emphasis on the model-dependence of many standard learning algorithms as an instance of the local view of induction. It is important to note, however, that the local view does not yet suffice to escape Hume’s skeptical argument, and neither does the model-relative conception in the context of machine learning algorithms. Namely, an absolute justification for the conclusions of inductive inferences still requires a justification for the preceding choice of local assumptions or model.

For that reason, the local view of induction also does not suffice to fully explain the success of our inductive inferences. Analogously, Wolpert (1996b, p. 1364) points at “a rather profound (if somewhat philosophical) paradox,” that is not yet resolved by the model-dependent perspective on learning algorithms: “How is it that we perform inference so well in practice, given the NFL theorems and the limited scope of our prior knowledge?” That this is not merely a “somewhat philosophical” issue is demonstrated, for instance, by the recent debate surrounding the “paradox of deep learning” (Zhang et al. 2017; Neyshabur et al. 2017; Arpit et al. 2017; Kawaguchi et al. 2019), which revolves around the perceived lack of a good explanation for the empirical success of deep neural networks. The case of deep learning is particularly interesting, as a clean separation of method and model is here much more contentious, and the remaining question of justification does not clearly center on the motivation for a well-articulated choice of model.