Skip to main content

What Is Important About the No Free Lunch Theorems?

  • Chapter
  • First Online:

Part of the book series: Springer Optimization and Its Applications ((SOIA,volume 170))

Abstract

The No Free Lunch theorems prove that under a uniform distribution over induction problems (search problems or learning problems), all induction algorithms perform equally. As I discuss in this chapter, the importance of the theorems arises by using them to analyze scenarios involving nonuniform distributions, and to compare different algorithms, without any assumption about the distribution over problems at all. In particular, the theorems prove that anti-cross-validation (choosing among a set of candidate algorithms based on which has worst out-of-sample behavior) performs as well as cross-validation, unless one makes an assumption—which has never been formalized—about how the distribution over induction problems, on the one hand, is related to the set of algorithms one is choosing among using (anti-)cross validation, on the other. In addition, they establish strong caveats concerning the significance of the many results in the literature that establish the strength of a particular algorithm without assuming a particular distribution. They also motivate a “dictionary” between supervised learning and improving blackbox optimization, which allows one to “translate” techniques from supervised learning into the domain of blackbox optimization, thereby strengthening blackbox optimization algorithms. In addition to these topics, I also briefly discuss their implications for philosophy of science.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   109.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   139.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD   139.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    Note that as a special case, we could have each of the two professors always produce the exact same search algorithm for any objective function they are presented. In this case comparing the performance of the two professors just amounts to comparing the performance of the two associated search algorithms.

  2. 2.

    The choice to use an off-training set cost function for the analysis of supervised learning is the analog of the choice in the analysis of search to use a search algorithm that only searches over points not yet sampled. In both the cases, the goal is to “mod out” aspects of the problem that are typically not of interest and might result in misleading results: ability of the learning algorithm to reproduce a training set in the case of supervised learning, and ability to revisit points already sampled with a good objective value in the case of search.

References

  1. Wolpert, D.H.: The lack of a prior distinctions between learning algorithms and the existence of a priori distinctions between learning algorithms. Neural Comput. 8, 1341–1390, 1391–1421 (1996)

    Article  Google Scholar 

  2. Schaffer, C.: A conservation law for generalization performance. In: International Conference on Machine Learning, pp. 295–265. Morgan Kaufmann, San Mateo (1994)

    Google Scholar 

  3. Wolpert, D.H., Macready, W.G.: No free lunch theorems for optimization. IEEE Trans. Evol. Comput. 1(1), 67–82 (1997)

    Article  Google Scholar 

  4. Whitley, D., Rowe, J.: A “no free lunch” tutorial: sharpened and focused no free lunch. In: Theory of Randomized Search Heuristics: Foundations and Recent Developments, pp. 255–287. World Scientific, Singapore (2011)

    Google Scholar 

  5. Igel, C., Toussaint, M.: A no-free-lunch theorem for non-uniform distributions of target functions. J. Math. Model. Algorithms 3(4), 313–322 (2005)

    Article  MathSciNet  Google Scholar 

  6. Poland, K., Beer, K., Osborne, T.J.: No free lunch for quantum machine learning (2020). Preprint, arXiv:2003.14103

    Google Scholar 

  7. Peel, L., Larremore, D.B., Clauset, A.: The ground truth about metadata and community detection in networks. Sci. Adv. 3(5), e1602548 (2017)

    Article  Google Scholar 

  8. Godfrey-Smith, P.: Theory and Reality: An Introduction to the Philosophy of Science. University of Chicago Press, Chicago (2009)

    Google Scholar 

  9. Wolpert, D.H.: The relationship between PAC, the statistical physics framework, the Bayesian framework, and the VC framework. In: The Mathematics of Generalization, pp. 117–215. Addison-Wesley, Reading (1995)

    Google Scholar 

  10. Wolpert, D.H.: On bias plus variance. Neural Comput. 9(6), 1211–1243 (1997)

    Article  Google Scholar 

  11. Mackay, D.J.C.: Information Theory, Inference, and Learning Algorithms. Cambridge University Press, Cambridge (2003)

    MATH  Google Scholar 

  12. Jefferys, W.H., Berger, J.O.: Ockham’s razor and Bayesian analysis. Am. Sci. 80(1), 64–72 (1992)

    Google Scholar 

  13. Loredo, T.J.: From Laplace to SN 1987a: Bayesian inference in astrophysics. In: Maximum Entropy and Bayesian Methods, pp. 81–142. Kluwer Academic, Dordrecht (1990)

    Google Scholar 

  14. Gull, S.F.: Bayesian inductive inference and maximum entropy. In: Maximum Entropy and Bayesian Methods, pp. 53–74. Kluwer Academic, Dordrecht (1988)

    Google Scholar 

  15. Wolpert, D.H.: On the Bayesian “Occam factors” argument for Occam’s razor. In: Petsche T., et al. (eds.) Computational Learning Theory and Natural Learning Systems III. MIT Press, Cambridge (1995)

    Google Scholar 

  16. Li, M., Vitanyi, P.: An Introduction to Kolmogorov Complexity and Its Applications. Springer, Berlin (2008)

    Book  Google Scholar 

  17. Lattimore, T., Hutter, M.: No free lunch versus Occam’s razor in supervised learning. In: Algorithmic Probability and Friends. Bayesian Prediction and Artificial Intelligence, pp. 223–235. Springer, Berlin (2013)

    Google Scholar 

  18. Wolpert, D.H.: The relationship between Occam’s razor and convergent guessing. Complex Syst. 4, 319–368 (1990)

    MathSciNet  MATH  Google Scholar 

  19. Ermoliev, Y.M., Norkin, V.I.: Monte Carlo optimization and path dependent nonstationary laws of large numbers. Technical Report IR-98-009, International Institute for Applied Systems Analysis, March 1998

    Google Scholar 

  20. Rubinstein, R., Kroese, D.: The Cross-Entropy Method. Springer, Berlin (2004)

    Book  Google Scholar 

  21. De Bonet, J.S., Isbell, C.L. Jr., Viola, P.: Mimic: Finding optima by estimating probability densities. In: Advances in Neural Information Processing Systems - 9. MIT Press, Cambridge (1997)

    Google Scholar 

  22. Rajnarayan, D., Wolpert, D.H.: Exploiting parametric learning to improve black-box optimization. In: Jost, J. (ed.) Proceedings of ECCS 2007 (2007)

    Google Scholar 

  23. Rajnarayan, D., Wolpert, D.H.: Bias-variance techniques for Monte Carlo optimization: cross-validation for the CE method (2008). arXiv:0810.0877v1

    Google Scholar 

  24. Wolpert, D.H., Macready, W.: Coevolutionary free lunches. Trans. Evol. Comput. 9, 721–735 (2005)

    Article  Google Scholar 

  25. Macready, W.G., Wolpert, D.H.: What makes an optimization problem hard? Complexity 1, 40–46 (1995)

    Article  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Appendix

Appendix

1.1 A.1 NFL and Inner Product Formulas for Search

To begin, expand the performance probability distribution:

$$\displaystyle \begin{aligned} \begin{array}{rcl} P(\phi \mid {\mathcal{A}}, m) & =&\displaystyle \sum_{d^m_Y} P(d^m_Y \mid {\mathcal{A}}, m) P(\phi \mid d^m_Y, {\mathcal{A}}, m) \\ & =&\displaystyle \sum_{d^m_Y} P(d^m_Y \mid {\mathcal{A}}, m) \delta(\phi, \Phi(d^Y_m)), {} \end{array} \end{aligned} $$
(5)

where the delta function equals 1 if its two arguments are equal, and 0 otherwise. The choice of search algorithm affects performance only through the term \(P(d^m_Y \mid {\mathcal {A}}, m)\). In turn, this probability of \(d^m_Y\) under \({\mathcal {A}}\) is given by

$$\displaystyle \begin{aligned} \begin{array}{rcl} P(d^m_Y \mid {\mathcal{A}}, m) & =&\displaystyle \sum_{f} P(d_Y^m \mid f, m, {\mathcal{A}}) P(f \mid m, {\mathcal{A}}) \\ & =&\displaystyle \sum_{f} P(d_Y^m \mid f, m, {\mathcal{A}}) P(f). \end{array} \end{aligned} $$
(6)

Plugging in gives

$$\displaystyle \begin{aligned} P(\phi \mid {\mathcal{A}}, m) &= \sum_{f} P(f) D(f; d^m_Y, {\mathcal{A}}, m), {} \end{aligned} $$
(7)

where

$$\displaystyle \begin{aligned} D(f; d^m_, {\mathcal{A}}, m) &:= \sum_{d^m_Y} P(d^m_Y \mid f, {\mathcal{A}}, m) \delta(\phi, \Phi(d^Y_m)). \end{aligned} $$
(8)

So for any fixed ϕ, \(P(\phi \mid {\mathcal {A}}, m)\) is an inner product of two real-valued vectors each indexed by f: \(D(f; d^m_Y, {\mathcal {A}}, m) \) and P(f). Note that all the details of how the search algorithm operates are embodied in the first of those vectors. In contrast, the second one is completely independent of the search algorithm.

This notation also allows us to state the NFL for search theorem formally. Let B be any subset of the set of all objective functions, Y X. Then Eq. (7) allows us to express the expected performance for functions inside B in terms of expected performance outside of B:

$$\displaystyle \begin{aligned} \sum_{f \in B} {\mathbb{E}}(\Phi \mid f, m, {\mathcal{A}}) &= constant - \sum_{f \in Y^X \setminus B} {\mathbb{E}}(\Phi \mid f, m, {\mathcal{A}}), {} \end{aligned} $$
(9)

where the constant on the right-hand side depends on the performance measure Φ(.) but is independent of both \({\mathcal {A}}\) and B [3]. Expressed differently, Eq. (9) says that \(\sum _f E(\Phi \mid f, m, {\mathcal {A}})\) is independent of \({\mathcal {A}}\). This is the core of the NFL for search, as elaborated in the next section.

1.2 A.2 NFL for Search When We Average over P(f)’s

To derive the NFL theorem that applies when we vary over P(f)’s, first recall our simplifying assumption that both X and Y  are finite (as they will be when doing search on any digital computer). Due to this, any P(f) is a finite-dimensional real-valued vector living on a simplex Ω. Let π refer to a generic element of Ω. So ∫Ω P(fπ) is the average probability of any one particular f, if one uniformly averages over all distributions on f’s. By symmetry, this integral must be a constant, independent of f. In addition, as mentioned above, Eq. (9) tells us that \(\sum _{f \in B} {\mathbb {E}}(\Phi \mid f, m, {\mathcal {A}})\) is independent of \({\mathcal {A}}\). Therefore for any two search algorithms \({\mathcal {A}}\) and \({\mathcal {B}}\),

$$\displaystyle \begin{aligned} \begin{array}{rcl} \sum_f {\mathbb{E}}(\Phi \mid f, m, {\mathcal{A}}) & =&\displaystyle \sum_f {\mathbb{E}}(\Phi \mid f, m, {\mathcal{B}}) , \\ \sum_f {\mathbb{E}}(\Phi \mid f, m, {\mathcal{A}}) \bigg[\int_\Omega d\pi \; P(f \mid \pi)\bigg] & =&\displaystyle \sum_f {\mathbb{E}}(\Phi \mid f, m, {\mathcal{B}}) \bigg[\int_\Omega d\pi \; P(f \mid \pi)\bigg] , \\ \sum_f {\mathbb{E}}(\Phi \mid f, m, {\mathcal{A}}) \bigg[\int_\Omega d\pi \; \pi(f)\bigg] & =&\displaystyle \sum_f {\mathbb{E}}(\Phi \mid f, m, {\mathcal{B}}) \bigg[\int_\Omega d\pi \; \pi(f)\bigg], \\ \int_\Omega d\pi \; \sum_f {\mathbb{E}}(\Phi \mid f, m, {\mathcal{A}}) \pi(f) & =&\displaystyle \int_\Omega d\pi \; \sum_f {\mathbb{E}}(\Phi \mid f, m, {\mathcal{B}}) \pi(f), \end{array} \end{aligned} $$
(10)

that is,

$$\displaystyle \begin{aligned} \begin{array}{rcl} \int_\Omega d\pi \; {\mathbb{E}}_\pi(\Phi \mid m, {\mathcal{A}}) & =&\displaystyle \int_\Omega d\pi \; {\mathbb{E}}_\pi(\Phi \mid m, {\mathcal{B}}). {} \end{array} \end{aligned} $$
(11)

We can re-express this result as the statement that \(\int _\Omega d\pi \; {\mathbb {E}}_\pi (\Phi \mid m, {\mathcal {A}})\) is independent of \({\mathcal {A}}\).

Next, let Π be any subset of Ω. Then our result that \(\int _\Omega d\pi \; {\mathbb {E}}_\pi (\Phi \mid m, {\mathcal {A}})\) is independent of \({\mathcal {A}}\) implies

$$\displaystyle \begin{aligned} \begin{array}{rcl} \int_{\pi \in \Pi} d\pi \; {\mathbb{E}}_\pi(\Phi \mid m, {\mathcal{A}}) & =&\displaystyle constant - \int_{\pi \in \Omega \setminus \Pi} d\pi \; {\mathbb{E}}_\pi(\Phi \mid m, {\mathcal{A}}), {} \end{array} \end{aligned} $$
(12)

where the constant depends on Φ but is independent of both \({\mathcal {A}}\) and Π. So if any search algorithm performs particularly well for one set of P(f)’s, Π, it must perform correspondingly poorly on all other P(f)’s. This is the NFL theorem for search when P(f)’s vary.

1.3 A.3 NFL and Inner Product Formulas for Supervised Learning

To state the supervised learning inner product and NFL theorems requires introducing some more notation. Conventionally, these theorems are presented in the version where both the learning algorithm and the target function are stochastic. (In contrast, the restrictions for search—presented above—conventionally involve a deterministic search algorithm and deterministic objective function.) This makes the statement of the restrictions for supervised learning intrinsically more complicated.

Let X be a finite input space, Y  a finite output space, and say we have a target distribution f(y f ∈ Y ∣x ∈ X), along with a training set \(d = (d^m_X, d^m_Y)\) of m pairs \(\{(d^m_X(i) \in X, d^m_Y(i) \in Y)\}\), which is stochastically generated according to a distribution P(df) (conventionally called a likelihood, or “data-generation process”). Assume that based on d we have a hypothesis distribution h(y h ∈ Y ∣x ∈ X). (The creation of h from d—specified in toto by the distribution P(hd)—is conventionally called the learning algorithm.) In addition, let L(y h, y f) be a loss function taking \(Y \times Y \rightarrow {\mathbb {R}}\). Finally, let C(f, h, d) be an off-training set cost function,Footnote 2

$$\displaystyle \begin{aligned} \begin{array}{rcl} C(f, h, d) \propto \sum_{y_f \in Y, y_h \in Y} \sum_{q \in X \setminus d^m_X} P(q) L(y_f, y_h) f(y_f \mid q) h(y_h \mid q), \end{array} \end{aligned} $$
(13)

where P(q) is some probability distribution over X assigning nonzero measure to \(X \setminus d^m_X\).

All aspects of any supervised learning scenario—including the prior, the learning algorithm, the data likelihood function, etc.—are given by the joint distribution P(f, h, d, c) (where c is the value of the cost function) and its marginals. In particular, in [9] it is proven that the probability of a particular cost value c is given by

$$\displaystyle \begin{aligned} \begin{array}{rcl} P(c \mid d ) & =&\displaystyle \int df dh \; P(h \mid d)P(f \mid d)M_{c,d}(f, h) {} \end{array} \end{aligned} $$
(14)

for a matrix M c,d that is symmetric in its arguments so long as the loss function is. P(fd) ∝ P(df)P(f) is the posterior probability that the real world has produced a target f for you to try to learn, given that you only know d. It has nothing to do with your learning algorithm. In contrast, P(hd) is the specification of your learning algorithm. It has nothing to do with the distribution of targets f in the real world. So Eq. (14) tells us that as long as the loss function is symmetric, how “aligned” you (the learning algorithm) are with the real world (the posterior) determines how well you will generalize.

This supervised learning inner product formula results in a set of NFL theorems for supervised learning, once one imposes some additional conditions on the loss function. See [9] for details.

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Wolpert, D.H. (2021). What Is Important About the No Free Lunch Theorems?. In: Pardalos, P.M., Rasskazova, V., Vrahatis, M.N. (eds) Black Box Optimization, Machine Learning, and No-Free Lunch Theorems. Springer Optimization and Its Applications, vol 170. Springer, Cham. https://doi.org/10.1007/978-3-030-66515-9_13

Download citation

Publish with us

Policies and ethics