Abstract
We show that forms of Bayesian and MDL inference that are often applied to classification problems can be inconsistent. This means that there exists a learning problem such that for all amounts of data the generalization errors of the MDL classifier and the Bayes classifier relative to the Bayesian posterior both remain bounded away from the smallest achievable generalization error. From a Bayesian point of view, the result can be reinterpreted as saying that Bayesian inference can be inconsistent under misspecification, even for countably infinite models. We extensively discuss the result from both a Bayesian and an MDL perspective.
Article PDF
Similar content being viewed by others
References
Barron, A. R. (1985). Logically smooth density estimation. PhD thesis, Department of EE, Stanford University, Stanford, Ca.
Barron, A. R. (1998). Information-theoretic characterization of Bayes performance and the choice of priors in parametric and nonparametric problems. In Bayesian statistics, vol. 6 (pp. 27–52). Oxford University Press.
Barron, A. R., Rissanen, J., & Yu, B. (1998). The MDL principle in coding and modeling. IEEE Trans. Inform. Theory, 44 (6), 2743–2760.
Barron, A. R. (1990). Complexity regularization with application to artificial neural networks. In Nonparametric functional estimation and related topics (pp. 561–576). Kluwer Academic Publishers.
Barron, A. R., & Cover, T. M. (1991). Minimum complexity density estimation. IEEE Trans. Inform. Theory, 37 (4), 1034–1054.
Bernardo, J. M., & Smith, A. F. M. (1994). Bayesian theory. John Wiley.
Blackwell, D., & Dubins, L. (1962). Merging of opinions with increasing information. The Annals of Mathematical Statistics, 33, 882–886.
Blumer, A., Ehrenfeucht, A., Haussler, D., & Warmuth, M. (1987). Occam’s razor. Information Processing Letters, 24, 377–380.
Bunke, O., & Milhaud, X. (1998). Asymptotic behaviour of Bayes estimates under possibly incorrect models. The Annals of Statistics, 26, 617–644.
Clarke, B. (2004). Comparing Bayes and non-Bayes model averaging when model approximation error cannot be ignored. Journal of Machine Learning Research, 4 (4), 683–712.
Comley, J. W., & Dowe, D. L. Minimum message length and generalised bayesian nets with asymmetric languages. In P. D. Grünwald, I. J. Myung, & M. A. Pitt (Eds.), Advances in minimum description length: theory and applications. MIT Press, 2005.
Cover, T. M., & Thomas, J. A. (1991). Elements of Information Theory. Wiley.
Diaconis, P., & Freedman, D. (1986). On the consistency of Bayes estimates. The Annals of Statistics, 14 (1), 1–26.
Doob, J. L. (1949). Application of the theory of martingales. In Le Calcul de Probabilités et ses Applications. Colloques Internationaux du Centre National de la Recherche Scientifique (pp. 23–27), Paris.
Grünwald, P. D. (2007). The minimum description length principle. Cambridge, MA: MIT Press.
Grünwald, P. D. (1998). The minimum description length principle and reasoning under uncertainty. PhD thesis, University of Amsterdam, The Netherlands.
Grünwald, P. D. (2005). MDL tutorial. In P. D. Grünwald, I. J. Myung, & M. A. Pitt (Eds.), Advances in minimum description length: theory and applications. MIT Press
Heckerman, D., Chickering, D.M., Meek, C., Rounthwaite, R., & Kadie, C. (2000). Dependency networks for inference, collaborative filtering, and data visualization. Journal of Machine Learning Research, 1, 49–75.
Hutter, M. (2005). Fast non-parametric Bayesian inference on infinite trees. In Proceedings of the 15th international workshop on artificial intelligence and statistics (AISTATS ’05).
Jordan, M. I. (1995). Why the logistic function? a tutorial discussion on probabilities and neural networks. Computational Cognitive Science Tech. Rep. 9503, MIT.
Kearns, M., Mansour, Y., Ng, A.Y., & Ron, D. (1997). An experimental and theoretical comparison of model selection methods. Machine Learning, 27, 7–50.
Kleijn, B., & van der Vaart, A. (2004). Misspecification in infinite-dimensional Bayesian statistics. submitted.
Li, J. K. (1997). Estimation of mixture models. PhD thesis, Yale University, Department of Statistics.
McAllester, D. (1999). PAC-Bayesian model averaging. In Proceedings COLT ’99.
Meir, R., & Merhav, N. (1995). On the stochastic complexity of learning realizable and unrealizable rules. Machine Learning, 19, 241–261.
Platt, J. C. (1999). Probabilities for SV machines. In A. Smola, P. Bartlett, B. Schöelkopf, & D. Schuurmans (Eds.), Advances in large margin classifiers (pp. 61–74). MIT Press.
Quinlan, J., & Rivest, R. (1989). Inferring decision trees using the minimum description length principle. Information and Computation, 80, 227–248.
Rissanen, J. (1983). A universal prior for integers and estimation by minimum description length. The Annals of Statistics, 11, 416–431.
Rissanen, J. (1989). Stochastic complexity in statistical inquiry. World Scientific.
Tipping, M. E. (2001). Sparse Bayesian learning and the relevance vector machine. Journal of Machine Learning Research, 1, 211–244.
Viswanathan, M., Wallace, C. S., Dowe, D. L., & Korb, K. B. (1999). Finding cutpoints in noisy binary sequences - a revised empirical evaluation. In Proc. 12th Australian joint conf. on artif. intelligence, vol. 1747 of Lecture notes in artificial intelligence (LNAI) (pp. 405–416), Sidney, Australia.
Wallace, C. S. (2005). Statistical and Inductive Inference by Minimum Message Length. New York: Springer.
Wallace, C. S., & Boulton, D. M. (1968). An information measure for classification. Computing Journal, 11, 185–195.
Wallace, C. S., & Dowe, D. L. (1999a). Minimum message length and Kolmogorov complexity. Computer Journal, 42 (4), 270–283. Special issue on Kolmogorov complexity.
Wallace, C. S., & Dowe, D. L. (1999b). Refinements of MDL and MML coding. Computer Journal, 42 (4), 330–337. Special issue on Kolmogorov complexity.
Wallace, C. S., & Patrick, J. D. (1993). Coding decision trees. Machine Learning, 11, 7–22.
Yamanishi, K. (1998). A decision-theoretic extension of stochastic complexity and its applications to learning. IEEE Trans. Inform. Theory, 44(4), 1424–1439.
Author information
Authors and Affiliations
Corresponding author
Additional information
Editors: Olivier Bousquet and Andre Elisseeff
Rights and permissions
Open Access This is an open access article distributed under the terms of the Creative Commons Attribution Noncommercial License ( https://creativecommons.org/licenses/by-nc/2.0 ), which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.
About this article
Cite this article
Grünwald, P., Langford, J. Suboptimal behavior of Bayes and MDL in classification under misspecification. Mach Learn 66, 119–149 (2007). https://doi.org/10.1007/s10994-007-0716-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10994-007-0716-7