## Abstract

I use a theorem from machine learning, called the “No Free Lunch” theorem (NFL) to support the claim that non-epistemic values are essential to theory choice. I argue that NFL entails that predictive accuracy is insufficient to favor a given theory over others, and that NFL challenges our ability to give a purely epistemic justification for using other traditional epistemic virtues in theory choice. In addition, I argue that the natural way to overcome NFL’s challenge is to use non-epistemic values. If my argument holds, non-epistemic values are entangled in theory choice regardless of human limitations and regardless of the subject matter. Thereby, my argument overcomes objections to the main lines of argument revealing the role of values in theory choice. At the end of the paper, I argue that, contrary to common conception, the epistemic challenge arising from NFL is distinct from Hume’s problem of induction and other forms of underdetermination.

This is a preview of subscription content, access via your institution.

## Notes

- 1.
For a more technical, yet accessible, introduction to machine learning algorithms see Russell and Norvig (2010).

- 2.
To be more precise, Wolpert’s (1996) theorem applies to supervised learning algorithms.

- 3.
- 4.
Since NFL allows to use any error measure that is only a function of the relevant values and prominent distance measures are also functions of the same values, we can manipulate NFL’s results to bear on popular error measures. For example, suppose we use square Euclidian distance as our error measure for NFL: \(\left| {Y_{F} \left( {\text{x}} \right) - Y_{H} \left( {\text{x}} \right)} \right|^{2}\), where \(Y_{H} \left( {\text{x}} \right)\) is the algorithm’s prediction for input x and \(Y_{F} \left( {\text{x}} \right)\) is the true output. Then, according to NFL, all algorithms have the same average expected error: \(\mathop \sum \nolimits_{{x \in {\text{X}}}} \left| {Y_{F} \left( {\text{x}} \right) - Y_{H} \left( {\text{x}} \right)} \right|^{2} /\left| X \right|\) (where X is the set of all relevant inputs). However, since |X| is just the number of items in X, the quantitiy \(\mathop \sum \nolimits_{{x \in {\text{X}}}} \left| {Y_{F} \left( {\text{x}} \right) - Y_{H} \left( {\text{x}} \right)} \right|^{2}\) is also the same for all algorithms. But \(\mathop \sum \nolimits_{{x \in {\text{X}}}} \left| {Y_{F} \left( {\text{x}} \right) - Y_{H} \left( {\text{x}} \right)} \right|^{2}\) is the Brier inaccuracy measure. Therefore, we get that the predictions of all algorithms are equally inaccuate relative to the Brier inaccuracy measure.

- 5.
See Dotan (forthcoming) for more discussion of the implication of the No Free Lunch theorem on using accuracy in theory choice.

- 6.
Based on the OR/XOR example from Wilson and Martinez (1997).

## References

Arlot, S., and Celisse, A. (2010). A survey of cross-validation procedures for model selection.

*Statistics Surveys,**4,*40–79.Bird, A. (2012). The structure of scientific revolutions and its significance: An essay review of the fiftieth anniversary edition.

*The British Journal for the Philosophy of Science,**63*(4), 859–883. https://doi.org/10.1093/bjps/axs031.Boghossian, P. A. (2006).

*Fear of knowledge: Against relativism and constructivism*. Oxford: Clarendon Press. https://doi.org/10.15713/ins.mmj.3.Culberson, J. (1998). On the futility of blind search: An algorithmic view of “no free lunch”.

*Evolutionary Computation,**6*(2), 109–127. https://doi.org/10.1162/evco.1998.6.2.109.Davidson, D. (1973). On the very idea of conceptual scheme.

*Proceedings and Addresses of the American Philosophical Association,**47,*5–20. https://doi.org/10.1075/pc.3.1.12bus.Domingos, P. (2012). A few useful things to know about machine learning.

*Communications of the ACM,**55*(10), 78–87.Dotan, R. (forthcoming). What can we learn about accuracy from machine learning?

*Philosophy of Science*Douglas, H. (2009).

*Science, policy, and the value free ideal*. Pittsburgh: University of Pittsburgh Press.Elliott, K., & Steel, D. (Eds.). (2017).

*Current controversies in values and science*. Oxford: Taylor & Francis.Fernández-Delgado, M., Cernadas, E., Barro, S., et al. (2014). Do we need hundreds of classifiers to solve real world classification problems?

*Journal of Machine Learning Research,**15,*3133–3181. https://doi.org/10.1016/j.csda.2008.10.033.Giraud-Carrier, C., & Provost, F. (2005). Toward a justification of meta-learning: Is the no free lunch theorem a show-stopper. In

*Proceedings of the ICML*-*2005 Workshop on Meta*-*Learning.*Gómez, D., & Rojas, A. (2015). An empirical overview of the no free lunch theorem and its effect on real-world machine learning classification.

*Neural Computation,**28,*105.Henderson, L. (2020). The problem of induction. In Edward N. Z. (ed.)

*The Stanford encyclopedia of philosophy*. Springer: Berlin. https://plato.stanford.edu/archives/spr2020/entries/induction-problem.Igel, C., & Toussaint, M. (2005). A no-free-lunch theorem for non-uniform distributions of target functions.

*Journal of Mathematical Modelling and Algorithms,**3*(4), 313–322.Korb, K. B. (2004). Introduction: Machine learning as philosophy of science.

*Minds and Machines,**14*(4), 433–440. https://doi.org/10.1023/B:MIND.0000045986.90956.7f.Kuhn, T. S. (1962).

*The structure of scientific revolutions*. Chicago: The University of Chicago Press.Lacey, H. (1999).

*Is science value free? Values and scientific understanding. Science Teacher*(Vol. 53). London: Routledge.Lacey, H. (2017). Distinguishing between cognitive and social values. In K. Elliott & D. Steel (Eds.),

*Current controversies in values and science*. New York: Routledge.Lattimore, T., & Hutter, M.. (2011). No free lunch versus Occam’s razor in supervised learning. [ArXiv preprint available at arXiv:1111.3846].

Lauc, D. (2018). How gruesome are the no-free-lunch theorems for machine learning?

*Croatian Journal of Philosophy,**18*(54), 479–485.Lauden, L. (1990).

*Science and relativism: Some key controversies in the philosophy of science*. Chicago: The University of Chicago Press.Levi, I. (1962). On the seriousness of mistakes.

*Philosophy of Science,**29*(1), 47–65.Lipton, P. (2004).

*Inference to the best explanation*(2nd ed.). New York: Routledge.Longino, H. (1990).

*Science as social knowledge: Values and objectivity in scientific inquiry*. Princeton: Princeton University Press.Longino, H. (1996). Cognitive and non-cognitive values in science: Rethinking the dichotomy. In H. N. Lynn & N. Jack (Eds.),

*Feminism, science, and the philosophy of science*(pp. 39–58). New York: Kluwer Academic Publishers.Longino, H. (2002).

*The fate of knowledge*. Princeton: Princeton University Press.Longino, H. (2014). Values, heuristics, and politics of knowledge. In M. Carrier (Ed.),

*The challange of the social and the pressure of the practice: Science and values revisited*. Pittsburgh: University of Pittsburgh Press.McMullin, E. (1982). Values in science.

*PSA Proceedings of the Biennial Meeting of the Philosophy of Science Association,**2,*3–28.Montañez, G.D. (2017).

*Why machine learning works*. Carnegie Mellon.Okruhlik, K. (1994). Gender and the biological sciences.

*Canadian Journal of Philosophy,**24*(sup1), 21–42.Pettigrew, R. (2016).

*Accuracy and the laws of credence*. Oxford: Oxford University Press.Rolin, Kristina. (2017). Can social diversity be best incorporated into science by adopting the social value management ideal? In D. Steel & Kevin C. Elliott (Eds.),

*Current controversies in values and science*(pp. 113–129). Routledge.Rudner, R. (1953). The scientist qua scientist makes value judgments.

*Philosophy of Science,**20*(1), 1–6.Russell, S., & Norvig, P. (2010).

*Artificial intelligence: A modern approach*(3rd ed.). New Jersey: Pearson Education Inc.Schaffer, C. (1993a). Overfitting avoidance as bias.

*Machine Learning,**10*(2), 153–178.Schaffer, C. (1993b). Selecting a classification method by cross validation.

*Machine Learning,**13*(1), 135–143.Schaffer, C. (1994). A conservation law for generalization performance. In

*Machine learning: Proceedings of the eleventh international conference*.Steel, D. (2013). Acceptance, values, and inductive risk.

*Philosophy of Science,**80*(5), 818–828. https://doi.org/10.1086/673936.Strawson, P. F. (1952).

*Introduction to logical theory*. London: Methuen.Swinburne, R. (1997).

*Simplicity as Evidence for Truth*. Milwaukee: Marquette University Press.The Biology and Gender Study Group. (1988). The importance of feminist critique for contemporary cell biology.

*Hypatia,**3*(1), 61–76.Toulmin, S. (1970). Does the distinction between normal and revolutionary science hold water? In L. Imre & M. Alan (Eds.),

*Criticism and the growth of knowledge*. Cambridge: Cambridge University Press.van Fraassen, B. C. (1980).

*The scientific image*. New York: Oxford University Press.Wilson, D. R., & Martinez, T. R. (1997). Bias and the probability of generalization. In

*Proceedings Intelligent Information Systems. IIS’97*(pp. 108–114). https://doi.org/10.1109/iis.1997.645199.Wolpert, D. H. (1996). The lack of a priori distinctions between learning algorithms.

*Neural Computation,**8*(7), 1391–1420. https://doi.org/10.1162/neco.1996.8.7.1391.Wolpert, D. H. (2012). What the no free lunch theorems really mean ; how to improve search algorithms (pp. 1–13).

Wolpert, D. H.

*On overfitting avoidance as bias*. Technical Report SFI TR 92-03-5001. Santa Fe, NM: The Santa Fe Institute, 1993.

## Acknowledgements

For extensive feedback on this paper, I would like to thank Lara Buchak and Shamik Dasgupta. For comments on earlier drafts, I would like to thank Greyson Abid, Michael Arsenault, Nick French, Alvin Goldman, Tyler Haddow, Daniel Harman, Dan Hicks, John MacFarlane, Sven Neth, Emily Perry, Daniel Warren, and two anonymous referees for Synthese. For extensive conversations, I thank Gil Rosenthal. I am also grateful for comments and discussion from the conferences where versions of this paper were presented, including the 2020 Eastern APA, the 2019 Congress on Logic, Methodology, and philosophy of Science and Technology, the 2019 Canadian Society for the History and Philosophy of Science conference, the 2019 Society of Exact Philosophy conference, the 2019 Values in Medicine, Science, and Technology conference, and the 2018 Philosophy of Science Association conference.

## Author information

### Affiliations

### Corresponding author

## Additional information

### Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

## Appendix: The no free lunch theorem(s)

### Appendix: The no free lunch theorem(s)

“No Free Lunch” is the name of a family of theorems. Differences between No Free Lunch theorems include differences between the kinds of algorithms they consider. For example, initially No Free Lunch theorems were proven for optimization algorithms (Wolpert and Macready 1993). Wolpert (1996, 2001, 2012) proves No Free Lunch theorems for supervised learning algorithms, and this is what I have focused on in this paper. Schaffer (1994) gives an elegant formulation of Wolpert’s main No Free Lunch Theorem for classification learning algorithms, based on a preprint of Wolpert (1996). In this appendix, I state Schaffer’s formulation to illustrate what NFL theorems say more formally (Schaffer calls it the “Law of Conservation of Generalization of Performance”). See Montanez (2017, chapter 2) for a review of various No Free Lunch results, and see Schaffer (1994) and Wolpert (1996) for a proof of the theorem which I will state here.

We start with defining cases in a classification problem. Each case in a classification problem, \(A_{i}\), is a vector of attributes. For simplicity, we assume that each component in the vector is a finite number. \(\left\{ {A_{1} , \ldots ,A_{m} } \right\}\) is the set of all possible attribute vectors, where m is finite. C is a class probability vector, which defines the relationship between attribute vectors and classes. Each component of C, \(C_{\text{i}}\), is the probability that a case with attribute \(A_{i}\) belongs to class 1. We assume that data is generated in the same way in training and testing a learner. Attribute vectors are sampled with replacement according to an arbitrary distribution D and a class is assigned to them using C. We also assume that the training set contains n samples. A learning situation S is a triple (D, C, n).

The Generalization Accuracy of a learner (*GA*_{L}) is the expected prediction performance of a learner on cases with attribute vectors not represented in the training set. For example, the generalization accuracy of a random guesser in a two-class problem is 1/2 for every D and C. We use the generalization accuracy of a random guesser as a baseline and define Generalization Performance of a learner (GP_{L}) the difference between its generalization accuracy and the generalization accuracy of a random guesser:

Generalization performance greater than zero means better than chance performance. \(GP_{L} \left( S \right)\) is the generalization performance of learner L in learning situation S.

Using the notation above we can write Schaffer’s Law of Conservation of Generalization Performance:

In words, this law says that any positive performance by a learner in a certain learning situation must be exactly balanced by negative performance in other learning situations.

If we allow for the possibility of noise, then the law is properly written with an integral instead of a summation:

In this case, the components of C are taken from the real interval [0,1] and the integral runs over the space [0,1]^{m} of class probability vector. Without noise, the components of C are taken from {0,1} and the summation runs over 2 ^{m} possible class probability vectors.

From the conservation law, it follows that all learners have the same average generalization performance if we average over all possible learning situations (or, as I put it, that all algorithms have the same expected error if we make no assumptions about the problem we are trying to solve). Here’s why.

For any learner:

Add \(\mathop \sum \nolimits_{S} GA_{random\,guesser} \left( S \right)\) to both sides and get:

Divide by the number of learning situations and get the formulation that was used in this paper—that the average generalization performance of any learner L is equal, and in particular equal to that of the random guesser:

## Rights and permissions

## About this article

### Cite this article

Dotan, R. Theory choice, non-epistemic values, and machine learning.
*Synthese* (2020). https://doi.org/10.1007/s11229-020-02773-2

Received:

Accepted:

Published:

### Keywords

- Theory choice
- Epistemic values
- No free lunch theorem
- Machine learning
- General philosophy of science