Machine Learning

, Volume 66, Issue 2–3, pp 119–149 | Cite as

Suboptimal behavior of Bayes and MDL in classification under misspecification

  • Peter Grünwald
  • John Langford
Open Access


We show that forms of Bayesian and MDL inference that are often applied to classification problems can be inconsistent. This means that there exists a learning problem such that for all amounts of data the generalization errors of the MDL classifier and the Bayes classifier relative to the Bayesian posterior both remain bounded away from the smallest achievable generalization error. From a Bayesian point of view, the result can be reinterpreted as saying that Bayesian inference can be inconsistent under misspecification, even for countably infinite models. We extensively discuss the result from both a Bayesian and an MDL perspective.


Bayesian statistics Minimum description length Classification Consistency Inconsistency Misspecification 


  1. Barron, A. R. (1985). Logically smooth density estimation. PhD thesis, Department of EE, Stanford University, Stanford, Ca.Google Scholar
  2. Barron, A. R. (1998). Information-theoretic characterization of Bayes performance and the choice of priors in parametric and nonparametric problems. In Bayesian statistics, vol. 6 (pp. 27–52). Oxford University Press.Google Scholar
  3. Barron, A. R., Rissanen, J., & Yu, B. (1998). The MDL principle in coding and modeling. IEEE Trans. Inform. Theory, 44 (6), 2743–2760.zbMATHCrossRefMathSciNetGoogle Scholar
  4. Barron, A. R. (1990). Complexity regularization with application to artificial neural networks. In Nonparametric functional estimation and related topics (pp. 561–576). Kluwer Academic Publishers.Google Scholar
  5. Barron, A. R., & Cover, T. M. (1991). Minimum complexity density estimation. IEEE Trans. Inform. Theory, 37 (4), 1034–1054.CrossRefMathSciNetGoogle Scholar
  6. Bernardo, J. M., & Smith, A. F. M. (1994). Bayesian theory. John Wiley.Google Scholar
  7. Blackwell, D., & Dubins, L. (1962). Merging of opinions with increasing information. The Annals of Mathematical Statistics, 33, 882–886.MathSciNetGoogle Scholar
  8. Blumer, A., Ehrenfeucht, A., Haussler, D., & Warmuth, M. (1987). Occam’s razor. Information Processing Letters, 24, 377–380.zbMATHCrossRefMathSciNetGoogle Scholar
  9. Bunke, O., & Milhaud, X. (1998). Asymptotic behaviour of Bayes estimates under possibly incorrect models. The Annals of Statistics, 26, 617–644.zbMATHCrossRefMathSciNetGoogle Scholar
  10. Clarke, B. (2004). Comparing Bayes and non-Bayes model averaging when model approximation error cannot be ignored. Journal of Machine Learning Research, 4 (4), 683–712.zbMATHCrossRefMathSciNetGoogle Scholar
  11. Comley, J. W., & Dowe, D. L. Minimum message length and generalised bayesian nets with asymmetric languages. In P. D. Grünwald, I. J. Myung, & M. A. Pitt (Eds.), Advances in minimum description length: theory and applications. MIT Press, 2005.Google Scholar
  12. Cover, T. M., & Thomas, J. A. (1991). Elements of Information Theory. Wiley.Google Scholar
  13. Diaconis, P., & Freedman, D. (1986). On the consistency of Bayes estimates. The Annals of Statistics, 14 (1), 1–26.zbMATHMathSciNetGoogle Scholar
  14. Doob, J. L. (1949). Application of the theory of martingales. In Le Calcul de Probabilités et ses Applications. Colloques Internationaux du Centre National de la Recherche Scientifique (pp. 23–27), Paris.Google Scholar
  15. Grünwald, P. D. (2007). The minimum description length principle. Cambridge, MA: MIT Press.Google Scholar
  16. Grünwald, P. D. (1998). The minimum description length principle and reasoning under uncertainty. PhD thesis, University of Amsterdam, The Netherlands.Google Scholar
  17. Grünwald, P. D. (2005). MDL tutorial. In P. D. Grünwald, I. J. Myung, & M. A. Pitt (Eds.), Advances in minimum description length: theory and applications. MIT PressGoogle Scholar
  18. Heckerman, D., Chickering, D.M., Meek, C., Rounthwaite, R., & Kadie, C. (2000). Dependency networks for inference, collaborative filtering, and data visualization. Journal of Machine Learning Research, 1, 49–75.CrossRefGoogle Scholar
  19. Hutter, M. (2005). Fast non-parametric Bayesian inference on infinite trees. In Proceedings of the 15th international workshop on artificial intelligence and statistics (AISTATS ’05).Google Scholar
  20. Jordan, M. I. (1995). Why the logistic function? a tutorial discussion on probabilities and neural networks. Computational Cognitive Science Tech. Rep. 9503, MIT.Google Scholar
  21. Kearns, M., Mansour, Y., Ng, A.Y., & Ron, D. (1997). An experimental and theoretical comparison of model selection methods. Machine Learning, 27, 7–50.CrossRefGoogle Scholar
  22. Kleijn, B., & van der Vaart, A. (2004). Misspecification in infinite-dimensional Bayesian statistics. submitted.Google Scholar
  23. Li, J. K. (1997). Estimation of mixture models. PhD thesis, Yale University, Department of Statistics.Google Scholar
  24. McAllester, D. (1999). PAC-Bayesian model averaging. In Proceedings COLT ’99.Google Scholar
  25. Meir, R., & Merhav, N. (1995). On the stochastic complexity of learning realizable and unrealizable rules. Machine Learning, 19, 241–261.zbMATHGoogle Scholar
  26. Platt, J. C. (1999). Probabilities for SV machines. In A. Smola, P. Bartlett, B. Schöelkopf, & D. Schuurmans (Eds.), Advances in large margin classifiers (pp. 61–74). MIT Press.Google Scholar
  27. Quinlan, J., & Rivest, R. (1989). Inferring decision trees using the minimum description length principle. Information and Computation, 80, 227–248.zbMATHCrossRefMathSciNetGoogle Scholar
  28. Rissanen, J. (1983). A universal prior for integers and estimation by minimum description length. The Annals of Statistics, 11, 416–431.zbMATHMathSciNetGoogle Scholar
  29. Rissanen, J. (1989). Stochastic complexity in statistical inquiry. World Scientific.Google Scholar
  30. Tipping, M. E. (2001). Sparse Bayesian learning and the relevance vector machine. Journal of Machine Learning Research, 1, 211–244.zbMATHCrossRefMathSciNetGoogle Scholar
  31. Viswanathan, M., Wallace, C. S., Dowe, D. L., & Korb, K. B. (1999). Finding cutpoints in noisy binary sequences - a revised empirical evaluation. In Proc. 12th Australian joint conf. on artif. intelligence, vol. 1747 of Lecture notes in artificial intelligence (LNAI) (pp. 405–416), Sidney, Australia.Google Scholar
  32. Wallace, C. S. (2005). Statistical and Inductive Inference by Minimum Message Length. New York: Springer.zbMATHGoogle Scholar
  33. Wallace, C. S., & Boulton, D. M. (1968). An information measure for classification. Computing Journal, 11, 185–195.zbMATHGoogle Scholar
  34. Wallace, C. S., & Dowe, D. L. (1999a). Minimum message length and Kolmogorov complexity. Computer Journal, 42 (4), 270–283. Special issue on Kolmogorov complexity.Google Scholar
  35. Wallace, C. S., & Dowe, D. L. (1999b). Refinements of MDL and MML coding. Computer Journal, 42 (4), 330–337. Special issue on Kolmogorov complexity.Google Scholar
  36. Wallace, C. S., & Patrick, J. D. (1993). Coding decision trees. Machine Learning, 11, 7–22.zbMATHCrossRefGoogle Scholar
  37. Yamanishi, K. (1998). A decision-theoretic extension of stochastic complexity and its applications to learning. IEEE Trans. Inform. Theory, 44(4), 1424–1439.zbMATHCrossRefMathSciNetGoogle Scholar

Copyright information

© Springer Science + Business Media, LLC 2007

Authors and Affiliations

  1. 1.CWIAmsterdamThe Netherlands
  2. 2.Yahoo ResearchNew York

Personalised recommendations