Abstract
The No Free Lunch theorems prove that under a uniform distribution over induction problems (search problems or learning problems), all induction algorithms perform equally. As I discuss in this chapter, the importance of the theorems arises by using them to analyze scenarios involving nonuniform distributions, and to compare different algorithms, without any assumption about the distribution over problems at all. In particular, the theorems prove that anti-cross-validation (choosing among a set of candidate algorithms based on which has worst out-of-sample behavior) performs as well as cross-validation, unless one makes an assumption—which has never been formalized—about how the distribution over induction problems, on the one hand, is related to the set of algorithms one is choosing among using (anti-)cross validation, on the other. In addition, they establish strong caveats concerning the significance of the many results in the literature that establish the strength of a particular algorithm without assuming a particular distribution. They also motivate a “dictionary” between supervised learning and improving blackbox optimization, which allows one to “translate” techniques from supervised learning into the domain of blackbox optimization, thereby strengthening blackbox optimization algorithms. In addition to these topics, I also briefly discuss their implications for philosophy of science.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
Note that as a special case, we could have each of the two professors always produce the exact same search algorithm for any objective function they are presented. In this case comparing the performance of the two professors just amounts to comparing the performance of the two associated search algorithms.
- 2.
The choice to use an off-training set cost function for the analysis of supervised learning is the analog of the choice in the analysis of search to use a search algorithm that only searches over points not yet sampled. In both the cases, the goal is to “mod out” aspects of the problem that are typically not of interest and might result in misleading results: ability of the learning algorithm to reproduce a training set in the case of supervised learning, and ability to revisit points already sampled with a good objective value in the case of search.
References
Wolpert, D.H.: The lack of a prior distinctions between learning algorithms and the existence of a priori distinctions between learning algorithms. Neural Comput. 8, 1341–1390, 1391–1421 (1996)
Schaffer, C.: A conservation law for generalization performance. In: International Conference on Machine Learning, pp. 295–265. Morgan Kaufmann, San Mateo (1994)
Wolpert, D.H., Macready, W.G.: No free lunch theorems for optimization. IEEE Trans. Evol. Comput. 1(1), 67–82 (1997)
Whitley, D., Rowe, J.: A “no free lunch” tutorial: sharpened and focused no free lunch. In: Theory of Randomized Search Heuristics: Foundations and Recent Developments, pp. 255–287. World Scientific, Singapore (2011)
Igel, C., Toussaint, M.: A no-free-lunch theorem for non-uniform distributions of target functions. J. Math. Model. Algorithms 3(4), 313–322 (2005)
Poland, K., Beer, K., Osborne, T.J.: No free lunch for quantum machine learning (2020). Preprint, arXiv:2003.14103
Peel, L., Larremore, D.B., Clauset, A.: The ground truth about metadata and community detection in networks. Sci. Adv. 3(5), e1602548 (2017)
Godfrey-Smith, P.: Theory and Reality: An Introduction to the Philosophy of Science. University of Chicago Press, Chicago (2009)
Wolpert, D.H.: The relationship between PAC, the statistical physics framework, the Bayesian framework, and the VC framework. In: The Mathematics of Generalization, pp. 117–215. Addison-Wesley, Reading (1995)
Wolpert, D.H.: On bias plus variance. Neural Comput. 9(6), 1211–1243 (1997)
Mackay, D.J.C.: Information Theory, Inference, and Learning Algorithms. Cambridge University Press, Cambridge (2003)
Jefferys, W.H., Berger, J.O.: Ockham’s razor and Bayesian analysis. Am. Sci. 80(1), 64–72 (1992)
Loredo, T.J.: From Laplace to SN 1987a: Bayesian inference in astrophysics. In: Maximum Entropy and Bayesian Methods, pp. 81–142. Kluwer Academic, Dordrecht (1990)
Gull, S.F.: Bayesian inductive inference and maximum entropy. In: Maximum Entropy and Bayesian Methods, pp. 53–74. Kluwer Academic, Dordrecht (1988)
Wolpert, D.H.: On the Bayesian “Occam factors” argument for Occam’s razor. In: Petsche T., et al. (eds.) Computational Learning Theory and Natural Learning Systems III. MIT Press, Cambridge (1995)
Li, M., Vitanyi, P.: An Introduction to Kolmogorov Complexity and Its Applications. Springer, Berlin (2008)
Lattimore, T., Hutter, M.: No free lunch versus Occam’s razor in supervised learning. In: Algorithmic Probability and Friends. Bayesian Prediction and Artificial Intelligence, pp. 223–235. Springer, Berlin (2013)
Wolpert, D.H.: The relationship between Occam’s razor and convergent guessing. Complex Syst. 4, 319–368 (1990)
Ermoliev, Y.M., Norkin, V.I.: Monte Carlo optimization and path dependent nonstationary laws of large numbers. Technical Report IR-98-009, International Institute for Applied Systems Analysis, March 1998
Rubinstein, R., Kroese, D.: The Cross-Entropy Method. Springer, Berlin (2004)
De Bonet, J.S., Isbell, C.L. Jr., Viola, P.: Mimic: Finding optima by estimating probability densities. In: Advances in Neural Information Processing Systems - 9. MIT Press, Cambridge (1997)
Rajnarayan, D., Wolpert, D.H.: Exploiting parametric learning to improve black-box optimization. In: Jost, J. (ed.) Proceedings of ECCS 2007 (2007)
Rajnarayan, D., Wolpert, D.H.: Bias-variance techniques for Monte Carlo optimization: cross-validation for the CE method (2008). arXiv:0810.0877v1
Wolpert, D.H., Macready, W.: Coevolutionary free lunches. Trans. Evol. Comput. 9, 721–735 (2005)
Macready, W.G., Wolpert, D.H.: What makes an optimization problem hard? Complexity 1, 40–46 (1995)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Appendix
Appendix
1.1 A.1 NFL and Inner Product Formulas for Search
To begin, expand the performance probability distribution:
where the delta function equals 1 if its two arguments are equal, and 0 otherwise. The choice of search algorithm affects performance only through the term \(P(d^m_Y \mid {\mathcal {A}}, m)\). In turn, this probability of \(d^m_Y\) under \({\mathcal {A}}\) is given by
Plugging in gives
where
So for any fixed ϕ, \(P(\phi \mid {\mathcal {A}}, m)\) is an inner product of two real-valued vectors each indexed by f: \(D(f; d^m_Y, {\mathcal {A}}, m) \) and P(f). Note that all the details of how the search algorithm operates are embodied in the first of those vectors. In contrast, the second one is completely independent of the search algorithm.
This notation also allows us to state the NFL for search theorem formally. Let B be any subset of the set of all objective functions, Y X. Then Eq. (7) allows us to express the expected performance for functions inside B in terms of expected performance outside of B:
where the constant on the right-hand side depends on the performance measure Φ(.) but is independent of both \({\mathcal {A}}\) and B [3]. Expressed differently, Eq. (9) says that \(\sum _f E(\Phi \mid f, m, {\mathcal {A}})\) is independent of \({\mathcal {A}}\). This is the core of the NFL for search, as elaborated in the next section.
1.2 A.2 NFL for Search When We Average over P(f)’s
To derive the NFL theorem that applies when we vary over P(f)’s, first recall our simplifying assumption that both X and Y are finite (as they will be when doing search on any digital computer). Due to this, any P(f) is a finite-dimensional real-valued vector living on a simplex Ω. Let π refer to a generic element of Ω. So ∫Ω dπ P(f∣π) is the average probability of any one particular f, if one uniformly averages over all distributions on f’s. By symmetry, this integral must be a constant, independent of f. In addition, as mentioned above, Eq. (9) tells us that \(\sum _{f \in B} {\mathbb {E}}(\Phi \mid f, m, {\mathcal {A}})\) is independent of \({\mathcal {A}}\). Therefore for any two search algorithms \({\mathcal {A}}\) and \({\mathcal {B}}\),
that is,
We can re-express this result as the statement that \(\int _\Omega d\pi \; {\mathbb {E}}_\pi (\Phi \mid m, {\mathcal {A}})\) is independent of \({\mathcal {A}}\).
Next, let Π be any subset of Ω. Then our result that \(\int _\Omega d\pi \; {\mathbb {E}}_\pi (\Phi \mid m, {\mathcal {A}})\) is independent of \({\mathcal {A}}\) implies
where the constant depends on Φ but is independent of both \({\mathcal {A}}\) and Π. So if any search algorithm performs particularly well for one set of P(f)’s, Π, it must perform correspondingly poorly on all other P(f)’s. This is the NFL theorem for search when P(f)’s vary.
1.3 A.3 NFL and Inner Product Formulas for Supervised Learning
To state the supervised learning inner product and NFL theorems requires introducing some more notation. Conventionally, these theorems are presented in the version where both the learning algorithm and the target function are stochastic. (In contrast, the restrictions for search—presented above—conventionally involve a deterministic search algorithm and deterministic objective function.) This makes the statement of the restrictions for supervised learning intrinsically more complicated.
Let X be a finite input space, Y a finite output space, and say we have a target distribution f(y f ∈ Y ∣x ∈ X), along with a training set \(d = (d^m_X, d^m_Y)\) of m pairs \(\{(d^m_X(i) \in X, d^m_Y(i) \in Y)\}\), which is stochastically generated according to a distribution P(d∣f) (conventionally called a likelihood, or “data-generation process”). Assume that based on d we have a hypothesis distribution h(y h ∈ Y ∣x ∈ X). (The creation of h from d—specified in toto by the distribution P(h∣d)—is conventionally called the learning algorithm.) In addition, let L(y h, y f) be a loss function taking \(Y \times Y \rightarrow {\mathbb {R}}\). Finally, let C(f, h, d) be an off-training set cost function,Footnote 2
where P(q) is some probability distribution over X assigning nonzero measure to \(X \setminus d^m_X\).
All aspects of any supervised learning scenario—including the prior, the learning algorithm, the data likelihood function, etc.—are given by the joint distribution P(f, h, d, c) (where c is the value of the cost function) and its marginals. In particular, in [9] it is proven that the probability of a particular cost value c is given by
for a matrix M c,d that is symmetric in its arguments so long as the loss function is. P(f∣d) ∝ P(d∣f)P(f) is the posterior probability that the real world has produced a target f for you to try to learn, given that you only know d. It has nothing to do with your learning algorithm. In contrast, P(h∣d) is the specification of your learning algorithm. It has nothing to do with the distribution of targets f in the real world. So Eq. (14) tells us that as long as the loss function is symmetric, how “aligned” you (the learning algorithm) are with the real world (the posterior) determines how well you will generalize.
This supervised learning inner product formula results in a set of NFL theorems for supervised learning, once one imposes some additional conditions on the loss function. See [9] for details.
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this chapter
Cite this chapter
Wolpert, D.H. (2021). What Is Important About the No Free Lunch Theorems?. In: Pardalos, P.M., Rasskazova, V., Vrahatis, M.N. (eds) Black Box Optimization, Machine Learning, and No-Free Lunch Theorems. Springer Optimization and Its Applications, vol 170. Springer, Cham. https://doi.org/10.1007/978-3-030-66515-9_13
Download citation
DOI: https://doi.org/10.1007/978-3-030-66515-9_13
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-66514-2
Online ISBN: 978-3-030-66515-9
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)