Theory choice, non-epistemic values, and machine learning

Abstract

I use a theorem from machine learning, called the “No Free Lunch” theorem (NFL) to support the claim that non-epistemic values are essential to theory choice. I argue that NFL entails that predictive accuracy is insufficient to favor a given theory over others, and that NFL challenges our ability to give a purely epistemic justification for using other traditional epistemic virtues in theory choice. In addition, I argue that the natural way to overcome NFL’s challenge is to use non-epistemic values. If my argument holds, non-epistemic values are entangled in theory choice regardless of human limitations and regardless of the subject matter. Thereby, my argument overcomes objections to the main lines of argument revealing the role of values in theory choice. At the end of the paper, I argue that, contrary to common conception, the epistemic challenge arising from NFL is distinct from Hume’s problem of induction and other forms of underdetermination.

This is a preview of subscription content, access via your institution.

Notes

  1. 1.

    For a more technical, yet accessible, introduction to machine learning algorithms see Russell and Norvig (2010).

  2. 2.

    To be more precise, Wolpert’s (1996) theorem applies to supervised learning algorithms.

  3. 3.

    This thought experiment is loosely based on the adversary argument from Culberson (1998) and the OR/XOR example from Wilson and Martinez (1997).

  4. 4.

    Since NFL allows to use any error measure that is only a function of the relevant values and prominent distance measures are also functions of the same values, we can manipulate NFL’s results to bear on popular error measures. For example, suppose we use square Euclidian distance as our error measure for NFL: \(\left| {Y_{F} \left( {\text{x}} \right) - Y_{H} \left( {\text{x}} \right)} \right|^{2}\), where \(Y_{H} \left( {\text{x}} \right)\) is the algorithm’s prediction for input x and \(Y_{F} \left( {\text{x}} \right)\) is the true output. Then, according to NFL, all algorithms have the same average expected error: \(\mathop \sum \nolimits_{{x \in {\text{X}}}} \left| {Y_{F} \left( {\text{x}} \right) - Y_{H} \left( {\text{x}} \right)} \right|^{2} /\left| X \right|\) (where X is the set of all relevant inputs). However, since |X| is just the number of items in X, the quantitiy \(\mathop \sum \nolimits_{{x \in {\text{X}}}} \left| {Y_{F} \left( {\text{x}} \right) - Y_{H} \left( {\text{x}} \right)} \right|^{2}\) is also the same for all algorithms. But \(\mathop \sum \nolimits_{{x \in {\text{X}}}} \left| {Y_{F} \left( {\text{x}} \right) - Y_{H} \left( {\text{x}} \right)} \right|^{2}\) is the Brier inaccuracy measure. Therefore, we get that the predictions of all algorithms are equally inaccuate relative to the Brier inaccuracy measure.

  5. 5.

    See Dotan (forthcoming) for more discussion of the implication of the No Free Lunch theorem on using accuracy in theory choice.

  6. 6.

    Based on the OR/XOR example from Wilson and Martinez (1997).

References

  1. Arlot, S., and Celisse, A. (2010). A survey of cross-validation procedures for model selection. Statistics Surveys, 4, 40–79.

    Article  Google Scholar 

  2. Bird, A. (2012). The structure of scientific revolutions and its significance: An essay review of the fiftieth anniversary edition. The British Journal for the Philosophy of Science, 63(4), 859–883. https://doi.org/10.1093/bjps/axs031.

    Article  Google Scholar 

  3. Boghossian, P. A. (2006). Fear of knowledge: Against relativism and constructivism. Oxford: Clarendon Press. https://doi.org/10.15713/ins.mmj.3.

    Google Scholar 

  4. Culberson, J. (1998). On the futility of blind search: An algorithmic view of “no free lunch”. Evolutionary Computation, 6(2), 109–127. https://doi.org/10.1162/evco.1998.6.2.109.

    Article  Google Scholar 

  5. Davidson, D. (1973). On the very idea of conceptual scheme. Proceedings and Addresses of the American Philosophical Association, 47, 5–20. https://doi.org/10.1075/pc.3.1.12bus.

    Article  Google Scholar 

  6. Domingos, P. (2012). A few useful things to know about machine learning. Communications of the ACM, 55(10), 78–87.

    Article  Google Scholar 

  7. Dotan, R. (forthcoming). What can we learn about accuracy from machine learning? Philosophy of Science

  8. Douglas, H. (2009). Science, policy, and the value free ideal. Pittsburgh: University of Pittsburgh Press.

    Google Scholar 

  9. Elliott, K., & Steel, D. (Eds.). (2017). Current controversies in values and science. Oxford: Taylor & Francis.

    Google Scholar 

  10. Fernández-Delgado, M., Cernadas, E., Barro, S., et al. (2014). Do we need hundreds of classifiers to solve real world classification problems? Journal of Machine Learning Research, 15, 3133–3181. https://doi.org/10.1016/j.csda.2008.10.033.

    Article  Google Scholar 

  11. Giraud-Carrier, C., & Provost, F. (2005). Toward a justification of meta-learning: Is the no free lunch theorem a show-stopper. In Proceedings of the ICML-2005 Workshop on Meta-Learning.

  12. Gómez, D., & Rojas, A. (2015). An empirical overview of the no free lunch theorem and its effect on real-world machine learning classification. Neural Computation, 28, 105.

    Google Scholar 

  13. Henderson, L. (2020). The problem of induction. In Edward N. Z. (ed.) The Stanford encyclopedia of philosophy. Springer: Berlin. https://plato.stanford.edu/archives/spr2020/entries/induction-problem.

  14. Igel, C., & Toussaint, M. (2005). A no-free-lunch theorem for non-uniform distributions of target functions. Journal of Mathematical Modelling and Algorithms, 3(4), 313–322.

    Article  Google Scholar 

  15. Korb, K. B. (2004). Introduction: Machine learning as philosophy of science. Minds and Machines, 14(4), 433–440. https://doi.org/10.1023/B:MIND.0000045986.90956.7f.

    Article  Google Scholar 

  16. Kuhn, T. S. (1962). The structure of scientific revolutions. Chicago: The University of Chicago Press.

    Google Scholar 

  17. Lacey, H. (1999). Is science value free? Values and scientific understanding. Science Teacher (Vol. 53). London: Routledge.

    Google Scholar 

  18. Lacey, H. (2017). Distinguishing between cognitive and social values. In K. Elliott & D. Steel (Eds.), Current controversies in values and science. New York: Routledge.

    Google Scholar 

  19. Lattimore, T., & Hutter, M.. (2011). No free lunch versus Occam’s razor in supervised learning. [ArXiv preprint available at arXiv:1111.3846].

  20. Lauc, D. (2018). How gruesome are the no-free-lunch theorems for machine learning? Croatian Journal of Philosophy, 18(54), 479–485.

    Google Scholar 

  21. Lauden, L. (1990). Science and relativism: Some key controversies in the philosophy of science. Chicago: The University of Chicago Press.

    Google Scholar 

  22. Levi, I. (1962). On the seriousness of mistakes. Philosophy of Science, 29(1), 47–65.

    Article  Google Scholar 

  23. Lipton, P. (2004). Inference to the best explanation (2nd ed.). New York: Routledge.

    Google Scholar 

  24. Longino, H. (1990). Science as social knowledge: Values and objectivity in scientific inquiry. Princeton: Princeton University Press.

    Google Scholar 

  25. Longino, H. (1996). Cognitive and non-cognitive values in science: Rethinking the dichotomy. In H. N. Lynn & N. Jack (Eds.), Feminism, science, and the philosophy of science (pp. 39–58). New York: Kluwer Academic Publishers.

    Google Scholar 

  26. Longino, H. (2002). The fate of knowledge. Princeton: Princeton University Press.

    Google Scholar 

  27. Longino, H. (2014). Values, heuristics, and politics of knowledge. In M. Carrier (Ed.), The challange of the social and the pressure of the practice: Science and values revisited. Pittsburgh: University of Pittsburgh Press.

    Google Scholar 

  28. McMullin, E. (1982). Values in science. PSA Proceedings of the Biennial Meeting of the Philosophy of Science Association, 2, 3–28.

    Article  Google Scholar 

  29. Montañez, G.D. (2017). Why machine learning works. Carnegie Mellon.

  30. Okruhlik, K. (1994). Gender and the biological sciences. Canadian Journal of Philosophy, 24(sup1), 21–42.

    Google Scholar 

  31. Pettigrew, R. (2016). Accuracy and the laws of credence. Oxford: Oxford University Press.

    Google Scholar 

  32. Rolin, Kristina. (2017). Can social diversity be best incorporated into science by adopting the social value management ideal? In D. Steel & Kevin C. Elliott (Eds.), Current controversies in values and science (pp. 113–129). Routledge.

  33. Rudner, R. (1953). The scientist qua scientist makes value judgments. Philosophy of Science, 20(1), 1–6.

    Article  Google Scholar 

  34. Russell, S., & Norvig, P. (2010). Artificial intelligence: A modern approach (3rd ed.). New Jersey: Pearson Education Inc.

    Google Scholar 

  35. Schaffer, C. (1993a). Overfitting avoidance as bias. Machine Learning, 10(2), 153–178.

    Google Scholar 

  36. Schaffer, C. (1993b). Selecting a classification method by cross validation. Machine Learning, 13(1), 135–143.

    Google Scholar 

  37. Schaffer, C. (1994). A conservation law for generalization performance. In Machine learning: Proceedings of the eleventh international conference.

  38. Steel, D. (2013). Acceptance, values, and inductive risk. Philosophy of Science, 80(5), 818–828. https://doi.org/10.1086/673936.

    Article  Google Scholar 

  39. Strawson, P. F. (1952). Introduction to logical theory. London: Methuen.

    Google Scholar 

  40. Swinburne, R. (1997). Simplicity as Evidence for Truth. Milwaukee: Marquette University Press.

    Google Scholar 

  41. The Biology and Gender Study Group. (1988). The importance of feminist critique for contemporary cell biology. Hypatia, 3(1), 61–76.

    Article  Google Scholar 

  42. Toulmin, S. (1970). Does the distinction between normal and revolutionary science hold water? In L. Imre & M. Alan (Eds.), Criticism and the growth of knowledge. Cambridge: Cambridge University Press.

    Google Scholar 

  43. van Fraassen, B. C. (1980). The scientific image. New York: Oxford University Press.

    Google Scholar 

  44. Wilson, D. R., & Martinez, T. R. (1997). Bias and the probability of generalization. In Proceedings Intelligent Information Systems. IIS’97 (pp. 108–114). https://doi.org/10.1109/iis.1997.645199.

  45. Wolpert, D. H. (1996). The lack of a priori distinctions between learning algorithms. Neural Computation, 8(7), 1391–1420. https://doi.org/10.1162/neco.1996.8.7.1391.

    Article  Google Scholar 

  46. Wolpert, D. H. (2012). What the no free lunch theorems really mean ; how to improve search algorithms (pp. 1–13).

  47. Wolpert, D. H. On overfitting avoidance as bias. Technical Report SFI TR 92-03-5001. Santa Fe, NM: The Santa Fe Institute, 1993.

Download references

Acknowledgements

For extensive feedback on this paper, I would like to thank Lara Buchak and Shamik Dasgupta. For comments on earlier drafts, I would like to thank Greyson Abid, Michael Arsenault, Nick French, Alvin Goldman, Tyler Haddow, Daniel Harman, Dan Hicks, John MacFarlane, Sven Neth, Emily Perry, Daniel Warren, and two anonymous referees for Synthese. For extensive conversations, I thank Gil Rosenthal. I am also grateful for comments and discussion from the conferences where versions of this paper were presented, including the 2020 Eastern APA, the 2019 Congress on Logic, Methodology, and philosophy of Science and Technology, the 2019 Canadian Society for the History and Philosophy of Science conference, the 2019 Society of Exact Philosophy conference, the 2019 Values in Medicine, Science, and Technology conference, and the 2018 Philosophy of Science Association conference.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Ravit Dotan.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix: The no free lunch theorem(s)

Appendix: The no free lunch theorem(s)

“No Free Lunch” is the name of a family of theorems. Differences between No Free Lunch theorems include differences between the kinds of algorithms they consider. For example, initially No Free Lunch theorems were proven for optimization algorithms (Wolpert and Macready 1993). Wolpert (1996, 2001, 2012) proves No Free Lunch theorems for supervised learning algorithms, and this is what I have focused on in this paper. Schaffer (1994) gives an elegant formulation of Wolpert’s main No Free Lunch Theorem for classification learning algorithms, based on a preprint of Wolpert (1996). In this appendix, I state Schaffer’s formulation to illustrate what NFL theorems say more formally (Schaffer calls it the “Law of Conservation of Generalization of Performance”). See Montanez (2017, chapter 2) for a review of various No Free Lunch results, and see Schaffer (1994) and Wolpert (1996) for a proof of the theorem which I will state here.

We start with defining cases in a classification problem. Each case in a classification problem, \(A_{i}\), is a vector of attributes. For simplicity, we assume that each component in the vector is a finite number. \(\left\{ {A_{1} , \ldots ,A_{m} } \right\}\) is the set of all possible attribute vectors, where m is finite. C is a class probability vector, which defines the relationship between attribute vectors and classes. Each component of C, \(C_{\text{i}}\), is the probability that a case with attribute \(A_{i}\) belongs to class 1. We assume that data is generated in the same way in training and testing a learner. Attribute vectors are sampled with replacement according to an arbitrary distribution D and a class is assigned to them using C. We also assume that the training set contains n samples. A learning situation S is a triple (D, C, n).

The Generalization Accuracy of a learner (GAL) is the expected prediction performance of a learner on cases with attribute vectors not represented in the training set. For example, the generalization accuracy of a random guesser in a two-class problem is 1/2 for every D and C. We use the generalization accuracy of a random guesser as a baseline and define Generalization Performance of a learner (GPL) the difference between its generalization accuracy and the generalization accuracy of a random guesser:

$$GP_{L} = GA_{L} - GA_{random\,guesser}$$

Generalization performance greater than zero means better than chance performance. \(GP_{L} \left( S \right)\) is the generalization performance of learner L in learning situation S.

Using the notation above we can write Schaffer’s Law of Conservation of Generalization Performance:

$$\mathop \sum \limits_{S} GP_{L} \left( S \right) = 0,\;{\text{for}}\;{\text{every}}\;{\text{D}},{\text{n}}$$

In words, this law says that any positive performance by a learner in a certain learning situation must be exactly balanced by negative performance in other learning situations.

If we allow for the possibility of noise, then the law is properly written with an integral instead of a summation:

$$\mathop \int \limits_{S}^{{}} GP_{L} \left( S \right)ds = 0,\;{\text{for}}\;{\text{every}}\;{\text{D}},{\text{n}}$$

In this case, the components of C are taken from the real interval [0,1] and the integral runs over the space [0,1]m of class probability vector. Without noise, the components of C are taken from {0,1} and the summation runs over 2 m possible class probability vectors.

From the conservation law, it follows that all learners have the same average generalization performance if we average over all possible learning situations (or, as I put it, that all algorithms have the same expected error if we make no assumptions about the problem we are trying to solve). Here’s why.

For any learner:

$$\mathop \sum \limits_{S} GP_{L} \left( S \right) = \mathop \sum \limits_{S} \left( {GA_{L} \left( {\text{S}} \right) - GA_{random\,guesser} \left( S \right)} \right) = 0$$

Add \(\mathop \sum \nolimits_{S} GA_{random\,guesser} \left( S \right)\) to both sides and get:

$$\mathop \sum \limits_{S} GA_{L} \left( S \right) = \mathop \sum \limits_{S} GA_{random \,guesser} \left( S \right)$$

Divide by the number of learning situations and get the formulation that was used in this paper—that the average generalization performance of any learner L is equal, and in particular equal to that of the random guesser:

$$\frac{{\mathop \sum \nolimits_{S} GA_{L} \left( S \right)}}{\# S} = \frac{{\mathop \sum \nolimits_{S} GA_{random\, guesser} \left( S \right)}}{\# S}$$

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Dotan, R. Theory choice, non-epistemic values, and machine learning. Synthese (2020). https://doi.org/10.1007/s11229-020-02773-2

Download citation

Keywords

  • Theory choice
  • Epistemic values
  • No free lunch theorem
  • Machine learning
  • General philosophy of science