What Is Important About the No Free Lunch Theorems?

Wolpert, David H.

doi:10.1007/978-3-030-66515-9_13

What Is Important About the No Free Lunch Theorems?

David H. Wolpert²¹

Chapter
First Online: 04 January 2021

2524 Accesses
9 Citations

Part of the book series: Springer Optimization and Its Applications ((SOIA,volume 170))

Abstract

The No Free Lunch theorems prove that under a uniform distribution over induction problems (search problems or learning problems), all induction algorithms perform equally. As I discuss in this chapter, the importance of the theorems arises by using them to analyze scenarios involving nonuniform distributions, and to compare different algorithms, without any assumption about the distribution over problems at all. In particular, the theorems prove that anti-cross-validation (choosing among a set of candidate algorithms based on which has worst out-of-sample behavior) performs as well as cross-validation, unless one makes an assumption—which has never been formalized—about how the distribution over induction problems, on the one hand, is related to the set of algorithms one is choosing among using (anti-)cross validation, on the other. In addition, they establish strong caveats concerning the significance of the many results in the literature that establish the strength of a particular algorithm without assuming a particular distribution. They also motivate a “dictionary” between supervised learning and improving blackbox optimization, which allows one to “translate” techniques from supervised learning into the domain of blackbox optimization, thereby strengthening blackbox optimization algorithms. In addition to these topics, I also briefly discuss their implications for philosophy of science.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 109.00; Price excludes VAT (USA)

Softcover Book: USD 139.99; Price excludes VAT (USA)

Hardcover Book: USD 139.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
Note that as a special case, we could have each of the two professors always produce the exact same search algorithm for any objective function they are presented. In this case comparing the performance of the two professors just amounts to comparing the performance of the two associated search algorithms.
2.
The choice to use an off-training set cost function for the analysis of supervised learning is the analog of the choice in the analysis of search to use a search algorithm that only searches over points not yet sampled. In both the cases, the goal is to “mod out” aspects of the problem that are typically not of interest and might result in misleading results: ability of the learning algorithm to reproduce a training set in the case of supervised learning, and ability to revisit points already sampled with a good objective value in the case of search.

References

Wolpert, D.H.: The lack of a prior distinctions between learning algorithms and the existence of a priori distinctions between learning algorithms. Neural Comput. 8, 1341–1390, 1391–1421 (1996)
Article Google Scholar
Schaffer, C.: A conservation law for generalization performance. In: International Conference on Machine Learning, pp. 295–265. Morgan Kaufmann, San Mateo (1994)
Google Scholar
Wolpert, D.H., Macready, W.G.: No free lunch theorems for optimization. IEEE Trans. Evol. Comput. 1(1), 67–82 (1997)
Article Google Scholar
Whitley, D., Rowe, J.: A “no free lunch” tutorial: sharpened and focused no free lunch. In: Theory of Randomized Search Heuristics: Foundations and Recent Developments, pp. 255–287. World Scientific, Singapore (2011)
Google Scholar
Igel, C., Toussaint, M.: A no-free-lunch theorem for non-uniform distributions of target functions. J. Math. Model. Algorithms 3(4), 313–322 (2005)
Article MathSciNet Google Scholar
Poland, K., Beer, K., Osborne, T.J.: No free lunch for quantum machine learning (2020). Preprint, arXiv:2003.14103
Google Scholar
Peel, L., Larremore, D.B., Clauset, A.: The ground truth about metadata and community detection in networks. Sci. Adv. 3(5), e1602548 (2017)
Article Google Scholar
Godfrey-Smith, P.: Theory and Reality: An Introduction to the Philosophy of Science. University of Chicago Press, Chicago (2009)
Google Scholar
Wolpert, D.H.: The relationship between PAC, the statistical physics framework, the Bayesian framework, and the VC framework. In: The Mathematics of Generalization, pp. 117–215. Addison-Wesley, Reading (1995)
Google Scholar
Wolpert, D.H.: On bias plus variance. Neural Comput. 9(6), 1211–1243 (1997)
Article Google Scholar
Mackay, D.J.C.: Information Theory, Inference, and Learning Algorithms. Cambridge University Press, Cambridge (2003)
MATH Google Scholar
Jefferys, W.H., Berger, J.O.: Ockham’s razor and Bayesian analysis. Am. Sci. 80(1), 64–72 (1992)
Google Scholar
Loredo, T.J.: From Laplace to SN 1987a: Bayesian inference in astrophysics. In: Maximum Entropy and Bayesian Methods, pp. 81–142. Kluwer Academic, Dordrecht (1990)
Google Scholar
Gull, S.F.: Bayesian inductive inference and maximum entropy. In: Maximum Entropy and Bayesian Methods, pp. 53–74. Kluwer Academic, Dordrecht (1988)
Google Scholar
Wolpert, D.H.: On the Bayesian “Occam factors” argument for Occam’s razor. In: Petsche T., et al. (eds.) Computational Learning Theory and Natural Learning Systems III. MIT Press, Cambridge (1995)
Google Scholar
Li, M., Vitanyi, P.: An Introduction to Kolmogorov Complexity and Its Applications. Springer, Berlin (2008)
Book Google Scholar
Lattimore, T., Hutter, M.: No free lunch versus Occam’s razor in supervised learning. In: Algorithmic Probability and Friends. Bayesian Prediction and Artificial Intelligence, pp. 223–235. Springer, Berlin (2013)
Google Scholar
Wolpert, D.H.: The relationship between Occam’s razor and convergent guessing. Complex Syst. 4, 319–368 (1990)
MathSciNet MATH Google Scholar
Ermoliev, Y.M., Norkin, V.I.: Monte Carlo optimization and path dependent nonstationary laws of large numbers. Technical Report IR-98-009, International Institute for Applied Systems Analysis, March 1998
Google Scholar
Rubinstein, R., Kroese, D.: The Cross-Entropy Method. Springer, Berlin (2004)
Book Google Scholar
De Bonet, J.S., Isbell, C.L. Jr., Viola, P.: Mimic: Finding optima by estimating probability densities. In: Advances in Neural Information Processing Systems - 9. MIT Press, Cambridge (1997)
Google Scholar
Rajnarayan, D., Wolpert, D.H.: Exploiting parametric learning to improve black-box optimization. In: Jost, J. (ed.) Proceedings of ECCS 2007 (2007)
Google Scholar
Rajnarayan, D., Wolpert, D.H.: Bias-variance techniques for Monte Carlo optimization: cross-validation for the CE method (2008). arXiv:0810.0877v1
Google Scholar
Wolpert, D.H., Macready, W.: Coevolutionary free lunches. Trans. Evol. Comput. 9, 721–735 (2005)
Article Google Scholar
Macready, W.G., Wolpert, D.H.: What makes an optimization problem hard? Complexity 1, 40–46 (1995)
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Santa Fe Institute, Santa Fe, NM, USA
David H. Wolpert

Authors

David H. Wolpert
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Industrial & Systems Engineering, University of Florida, Gainesville, FL, USA
Panos M. Pardalos
Moscow Aviation Institute, Moscow, Russia
Varvara Rasskazova
Mathematics Department, University of Patras, Patras, Greece
Michael N. Vrahatis

Appendix

1.1 A.1 NFL and Inner Product Formulas for Search

To begin, expand the performance probability distribution:

$$\displaystyle \begin{aligned} \begin{array}{rcl} P(\phi \mid {\mathcal{A}}, m) & =&\displaystyle \sum_{d^m_Y} P(d^m_Y \mid {\mathcal{A}}, m) P(\phi \mid d^m_Y, {\mathcal{A}}, m) \\ & =&\displaystyle \sum_{d^m_Y} P(d^m_Y \mid {\mathcal{A}}, m) \delta(\phi, \Phi(d^Y_m)), {} \end{array} \end{aligned} $$

(5)

where the delta function equals 1 if its two arguments are equal, and 0 otherwise. The choice of search algorithm affects performance only through the term $P(d^m_Y \mid {\mathcal {A}}, m)$. In turn, this probability of $d^m_Y$ under ${\mathcal {A}}$ is given by

$$\displaystyle \begin{aligned} \begin{array}{rcl} P(d^m_Y \mid {\mathcal{A}}, m) & =&\displaystyle \sum_{f} P(d_Y^m \mid f, m, {\mathcal{A}}) P(f \mid m, {\mathcal{A}}) \\ & =&\displaystyle \sum_{f} P(d_Y^m \mid f, m, {\mathcal{A}}) P(f). \end{array} \end{aligned} $$

(6)

Plugging in gives

$$\displaystyle \begin{aligned} P(\phi \mid {\mathcal{A}}, m) &= \sum_{f} P(f) D(f; d^m_Y, {\mathcal{A}}, m), {} \end{aligned} $$

(7)

where

$$\displaystyle \begin{aligned} D(f; d^m_, {\mathcal{A}}, m) &:= \sum_{d^m_Y} P(d^m_Y \mid f, {\mathcal{A}}, m) \delta(\phi, \Phi(d^Y_m)). \end{aligned} $$

(8)

So for any fixed ϕ, $P(\phi \mid {\mathcal {A}}, m)$ is an inner product of two real-valued vectors each indexed by f: $D(f; d^m_Y, {\mathcal {A}}, m) $ and P(f). Note that all the details of how the search algorithm operates are embodied in the first of those vectors. In contrast, the second one is completely independent of the search algorithm.

This notation also allows us to state the NFL for search theorem formally. Let B be any subset of the set of all objective functions, Y ^X. Then Eq. (7) allows us to express the expected performance for functions inside B in terms of expected performance outside of B:

$$\displaystyle \begin{aligned} \sum_{f \in B} {\mathbb{E}}(\Phi \mid f, m, {\mathcal{A}}) &= constant - \sum_{f \in Y^X \setminus B} {\mathbb{E}}(\Phi \mid f, m, {\mathcal{A}}), {} \end{aligned} $$

(9)

where the constant on the right-hand side depends on the performance measure Φ(.) but is independent of both ${\mathcal {A}}$ and B [3]. Expressed differently, Eq. (9) says that $\sum _f E(\Phi \mid f, m, {\mathcal {A}})$ is independent of ${\mathcal {A}}$. This is the core of the NFL for search, as elaborated in the next section.

1.2 A.2 NFL for Search When We Average over P(f)’s

To derive the NFL theorem that applies when we vary over P(f)’s, first recall our simplifying assumption that both X and Y are finite (as they will be when doing search on any digital computer). Due to this, any P(f) is a finite-dimensional real-valued vector living on a simplex Ω. Let π refer to a generic element of Ω. So ∫_Ω dπ P(f∣π) is the average probability of any one particular f, if one uniformly averages over all distributions on f’s. By symmetry, this integral must be a constant, independent of f. In addition, as mentioned above, Eq. (9) tells us that $\sum _{f \in B} {\mathbb {E}}(\Phi \mid f, m, {\mathcal {A}})$ is independent of ${\mathcal {A}}$. Therefore for any two search algorithms ${\mathcal {A}}$ and ${\mathcal {B}}$,

$$\displaystyle \begin{aligned} \begin{array}{rcl} \sum_f {\mathbb{E}}(\Phi \mid f, m, {\mathcal{A}}) & =&\displaystyle \sum_f {\mathbb{E}}(\Phi \mid f, m, {\mathcal{B}}) , \\ \sum_f {\mathbb{E}}(\Phi \mid f, m, {\mathcal{A}}) \bigg[\int_\Omega d\pi \; P(f \mid \pi)\bigg] & =&\displaystyle \sum_f {\mathbb{E}}(\Phi \mid f, m, {\mathcal{B}}) \bigg[\int_\Omega d\pi \; P(f \mid \pi)\bigg] , \\ \sum_f {\mathbb{E}}(\Phi \mid f, m, {\mathcal{A}}) \bigg[\int_\Omega d\pi \; \pi(f)\bigg] & =&\displaystyle \sum_f {\mathbb{E}}(\Phi \mid f, m, {\mathcal{B}}) \bigg[\int_\Omega d\pi \; \pi(f)\bigg], \\ \int_\Omega d\pi \; \sum_f {\mathbb{E}}(\Phi \mid f, m, {\mathcal{A}}) \pi(f) & =&\displaystyle \int_\Omega d\pi \; \sum_f {\mathbb{E}}(\Phi \mid f, m, {\mathcal{B}}) \pi(f), \end{array} \end{aligned} $$

(10)

that is,

$$\displaystyle \begin{aligned} \begin{array}{rcl} \int_\Omega d\pi \; {\mathbb{E}}_\pi(\Phi \mid m, {\mathcal{A}}) & =&\displaystyle \int_\Omega d\pi \; {\mathbb{E}}_\pi(\Phi \mid m, {\mathcal{B}}). {} \end{array} \end{aligned} $$

(11)

We can re-express this result as the statement that $\int _\Omega d\pi \; {\mathbb {E}}_\pi (\Phi \mid m, {\mathcal {A}})$ is independent of ${\mathcal {A}}$.

Next, let Π be any subset of Ω. Then our result that $\int _\Omega d\pi \; {\mathbb {E}}_\pi (\Phi \mid m, {\mathcal {A}})$ is independent of ${\mathcal {A}}$ implies

$$\displaystyle \begin{aligned} \begin{array}{rcl} \int_{\pi \in \Pi} d\pi \; {\mathbb{E}}_\pi(\Phi \mid m, {\mathcal{A}}) & =&\displaystyle constant - \int_{\pi \in \Omega \setminus \Pi} d\pi \; {\mathbb{E}}_\pi(\Phi \mid m, {\mathcal{A}}), {} \end{array} \end{aligned} $$

(12)

where the constant depends on Φ but is independent of both ${\mathcal {A}}$ and Π. So if any search algorithm performs particularly well for one set of P(f)’s, Π, it must perform correspondingly poorly on all other P(f)’s. This is the NFL theorem for search when P(f)’s vary.

1.3 A.3 NFL and Inner Product Formulas for Supervised Learning

To state the supervised learning inner product and NFL theorems requires introducing some more notation. Conventionally, these theorems are presented in the version where both the learning algorithm and the target function are stochastic. (In contrast, the restrictions for search—presented above—conventionally involve a deterministic search algorithm and deterministic objective function.) This makes the statement of the restrictions for supervised learning intrinsically more complicated.

Let X be a finite input space, Y a finite output space, and say we have a target distribution f(y _f ∈ Y ∣x ∈ X), along with a training set $d = (d^m_X, d^m_Y)$ of m pairs $\{(d^m_X(i) \in X, d^m_Y(i) \in Y)\}$, which is stochastically generated according to a distribution P(d∣f) (conventionally called a likelihood, or “data-generation process”). Assume that based on d we have a hypothesis distribution h(y _h ∈ Y ∣x ∈ X). (The creation of h from d—specified in toto by the distribution P(h∣d)—is conventionally called the learning algorithm.) In addition, let L(y _h, y _f) be a loss function taking $Y \times Y \rightarrow {\mathbb {R}}$. Finally, let C(f, h, d) be an off-training set cost function,^{Footnote 2}

$$\displaystyle \begin{aligned} \begin{array}{rcl} C(f, h, d) \propto \sum_{y_f \in Y, y_h \in Y} \sum_{q \in X \setminus d^m_X} P(q) L(y_f, y_h) f(y_f \mid q) h(y_h \mid q), \end{array} \end{aligned} $$

(13)

where P(q) is some probability distribution over X assigning nonzero measure to $X \setminus d^m_X$.

All aspects of any supervised learning scenario—including the prior, the learning algorithm, the data likelihood function, etc.—are given by the joint distribution P(f, h, d, c) (where c is the value of the cost function) and its marginals. In particular, in [9] it is proven that the probability of a particular cost value c is given by

$$\displaystyle \begin{aligned} \begin{array}{rcl} P(c \mid d ) & =&\displaystyle \int df dh \; P(h \mid d)P(f \mid d)M_{c,d}(f, h) {} \end{array} \end{aligned} $$

(14)

for a matrix M _c,d that is symmetric in its arguments so long as the loss function is. P(f∣d) ∝ P(d∣f)P(f) is the posterior probability that the real world has produced a target f for you to try to learn, given that you only know d. It has nothing to do with your learning algorithm. In contrast, P(h∣d) is the specification of your learning algorithm. It has nothing to do with the distribution of targets f in the real world. So Eq. (14) tells us that as long as the loss function is symmetric, how “aligned” you (the learning algorithm) are with the real world (the posterior) determines how well you will generalize.

This supervised learning inner product formula results in a set of NFL theorems for supervised learning, once one imposes some additional conditions on the loss function. See [9] for details.

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Wolpert, D.H. (2021). What Is Important About the No Free Lunch Theorems?. In: Pardalos, P.M., Rasskazova, V., Vrahatis, M.N. (eds) Black Box Optimization, Machine Learning, and No-Free Lunch Theorems. Springer Optimization and Its Applications, vol 170. Springer, Cham. https://doi.org/10.1007/978-3-030-66515-9_13

Download citation

DOI: https://doi.org/10.1007/978-3-030-66515-9_13
Published: 04 January 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-66514-2
Online ISBN: 978-3-030-66515-9
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics

Abstract

Buying options

Notes

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Appendix

Appendix

1.1 A.1 NFL and Inner Product Formulas for Search

1.2 A.2 NFL for Search When We Average over P(f)’s

1.3 A.3 NFL and Inner Product Formulas for Supervised Learning

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation