Between the Devil and the Deep Blue Sea: Tensions Between Scientific Judgement and Statistical Model Selection


Discussions of model selection in the psychological literature typically frame the issues as a question of statistical inference, with the goal being to determine which model makes the best predictions about data. Within this setting, advocates of leave-one-out cross-validation and Bayes factors disagree on precisely which prediction problem model selection questions should aim to answer. In this comment, I discuss some of these issues from a scientific perspective. What goal does model selection serve when all models are known to be systematically wrong? How might “toy problems” tell a misleading story? How does the scientific goal of explanation align with (or differ from) traditional statistical concerns? I do not offer answers to these questions, but hope to highlight the reasons why psychological researchers cannot avoid asking them.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2


  1. 1.

    While there are many people who assert that “a single failure is enough to falsify a theory”, I confess I have not yet encountered anyone willing to truly follow this principle in real life.

  2. 2.

    For instance, Gelman et al. (2003, pp. 586–587) present an analogous convergence result for the posterior distribution P(𝜃|x) within a single model . The result generalises to the Bayes factor by noting that the Bayes factor identifies a model with the prior predictive distribution P(x|). Substituting P(x|) for the role of P(x|𝜃) in their derivation produces the necessary result.

  3. 3.

    For the purposes of full disclosure, I should note that the precise situation from Lee and Navarro (2002) is quite a bit more complex than this description implies, and there are several details about how we had to adapt a model from one context to be applicable to the other have been omitted.


  1. Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle. In B.N., Petrov, & F., Csaki (Eds.) Second international symposium on information theory (pp. 267–281). Budapest: Akademiai Kiado.

  2. Bernardo, J.M., & Smith, A.F.M. (2000). Bayesian theory, 2nd Edn. New York: John Wiley & Sons.

    Google Scholar 

  3. Box, G.E.P. (1976). Science and statistics. Journal of the American Statistical Association, 71, 791–799.

    Article  Google Scholar 

  4. Browne, M. (2000). Cross-validation methods. Journal of Mathematical Psychology, 44, 108–132.

    Article  Google Scholar 

  5. Devezer, B., Nardin, L.G., Baumgaertner, B., Buzbas, E. (under review). Discovery of truth is not implied by reproducibility but facilitated by innovation and epistemic diversity in a model-centric framework. Manuscript submitted for publication. arXiv:1803.10118.

  6. Edwards, W., Lindman, H., Savage, L.J. (1963). Bayesian statistical inference for psychological research. Psychological Review, 70, 193–242.

    Article  Google Scholar 

  7. Gelman, A., Carlin, J.B., Stern, H.S., Rubin, D.B. (2003). Bayesian Data Analysis, 2nd Edn. Boca Raton: Chapman & Hall/CRC.

    Google Scholar 

  8. Grünwald, P. (2007). The minimum description length principle. Cambridge: MIT Press.

    Google Scholar 

  9. Gronau, Q., & Wagenmakers, E.J. (2018). Limitations of Bayesian leave-one-out cross-validation for model selection. Computational Brain and Behavior.

  10. Hayes, B.K., Banner, S., Forrester, S., Navarro, D.J. (under review). Sampling frames and inductive inference with censored evidence. Manuscript submitted for publication.

  11. Kamin, L.J. (1969). Predictability, surprise, attention, and conditioning. In Campbell, B.A., & Church, R.M. (Eds.) Punishment and Aversive Behavior (pp. 279–296). New York: Appleton-Century-Crofts.

  12. Kruschke, J.K. (1992). ALCOVE: An exemplar-based connectionist model of category learning. Psychological Review, 99(1), 22–44.

    Article  Google Scholar 

  13. Lake, B.M., Salakhutdinov, R., Tenenbaum, J.B. (2015). Human-level concept learning through probabilistic program induction. Science, 350(6266), 1332–1338.

    Article  Google Scholar 

  14. Lattal, K.M., & Nakajima, S. (1998). Overexpectation in appetitive Pavlovian and instrumental conditioning. Animal Learning & Behavior, 26(3), 351–360.

    Article  Google Scholar 

  15. Lee, M.D. (2001a). On the complexity of additive clustering models. Journal of Mathematical Psychology, 45, 131–148.

  16. Lee, M.D. (2001b). Determining the dimensionality of multidimensional scaling models for cognitive modeling. Journal of Mathematical Psychology, 45, 149–166.

  17. Lee, M.D., & Navarro, D.J. (2002). Extending the ALCOVE model of category learning to featural stimulus domains. Psychonomic Bulletin & Review, 9, 43–58.

    Article  Google Scholar 

  18. Navarro, D.J. (2004). A note on the applied use of MDL approximations. Neural Computation, 16, 1763–1768.

    Article  Google Scholar 

  19. Navarro, D.J., Dry, M.J., Lee, M.D. (2012). Sampling assumptions in inductive generalization. Cognitive Science, 36, 187–223.

    Article  Google Scholar 

  20. Navarro, D.J., Pitt, M.A., Myung, I.J. (2004). Assessing the distinguishability of models and the informativeness of data. Cognitive Psychology, 49, 47–84.

    Article  Google Scholar 

  21. Pavlov, I. (1927). Conditioned reflexes. London: Oxford University Press.

    Google Scholar 

  22. Pitt, M.A., Myung, I.J., Zhang, S. (2002). Toward a method of selecting among computational models of cognition. Psychological Review, 109, 472–491.

    Article  Google Scholar 

  23. Pitt, M.A., Kim, W., Navarro, D.J., Myung, J.I. (2006). Global model analysis by parameter space partitioning. Psychological Review, 113, 57–83.

    Article  Google Scholar 

  24. Rescorla, R.A. (1968). Probability of shock in the presence and absence of CS in fear conditioning. Journal of Comparative and Physiological Psychology, 66, 1–5.

    Article  Google Scholar 

  25. Rescorla, R.A. (1969). Conditioned inhibition of fear resulting from negative CS-US contingencies. Journal of Comparative and Physiological Psychology, 67, 504–509.

    Article  Google Scholar 

  26. Rescorla, R.A. (1971). Variations in the effectiveness of reinforcement following prior inhibitory conditioning. Learning and Motivation, 2, 113–123.

    Article  Google Scholar 

  27. Rescorla, R.A., & Wagner, A.R. (1972). A theory of Pavlovian conditioning: variations in the effectiveness of reinforcement and nonreinforcement. In A.H., Black, & W.F., Prokasy (Eds.) Classical conditioning II: current research and theory (pp. 64–99). New York: Appleton-Century-Crofts.

  28. Ransom, K., Perfors, A., Navarro, D.J. (2016). Leaping to conclusions: why premise relevance affects argument strength. Cognitive Science, 40, 1775–1796.

    Article  Google Scholar 

  29. Rissanen, J. (1996). Fisher information and stochastic complexity. IEEE Transactions on Information Theory, 42, 40–47.

    Article  Google Scholar 

  30. Schultz, W., Dayan, P., Montague, P.R. (1997). A neural substrate of prediction and reward. Science, 275(5306), 1593–1599.

    Article  Google Scholar 

  31. Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statistics, 6, 461–464.

    Article  Google Scholar 

  32. Shao, J. (1993). Linear model selection by cross-validation. Journal of the American Statistical Association, 88, 486–494.

    Article  Google Scholar 

  33. Shiffrin, R. M., Borner, K., Stigler, S.M. (2018). Scientific progress despite irreproducibility: a seeming paradox. Proceedings of the National Academy of Sciences, USA, 115, 2632–2639.

    Article  Google Scholar 

  34. Tenenbaum, J.B., & Griffiths, T.L. (2001). Generalization, similarity, and Bayesian inference. Behavioral and Brain Sciences, 24, 629–640.

    PubMed  Google Scholar 

  35. Vehtari, A., Simpson, D., Yao, Y., Gelman, A. (2018). Limitations of Bayesian leave-one-out cross-validation. Computational Brain and Behavior.

  36. Vehtari, A., & Ojanen, J. (2012). A survey of Bayesian predictive methods for model assessment, selection and comparison. Statistics Surveys, 6, 142–228.

    Article  Google Scholar 

  37. Voorspoels, W., Navarro, D.J., Perfors, A., Ransom, K., Storms, G. (2015). How do people learn from negative evidence? Non-monotonic generalizations and sampling assumptions in inductive reasoning. Cognitive Psychology, 81, 1–25.

    Article  Google Scholar 

  38. Wagenmakers, E. J. (2007). A practical solution to the pervasive problems of p values. Psychonomic Bulletin & Review, 14(5), 779–804.

    Article  Google Scholar 

  39. Wickelgren, W.A. (1972). Trace resistance and decay of long-term memory. Journal of Mathematical Psychology, 9, 418–455.

    Article  Google Scholar 

Download references


I am grateful to many people for helpful conversations and comments that shaped this paper, most notably Nancy Briggs, Berna Devezer, Chris Donkin, Olivia Guest, Daniel Simpson, Iris van Rooij and Fred Westbrook.

Author information



Corresponding author

Correspondence to Danielle J. Navarro.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Navarro, D.J. Between the Devil and the Deep Blue Sea: Tensions Between Scientific Judgement and Statistical Model Selection. Comput Brain Behav 2, 28–34 (2019).

Download citation


  • Model selection
  • Science
  • Statistics