Suboptimal behavior of Bayes and MDL in classification under misspecification

Grünwald, Peter; Langford, John

doi:10.1007/s10994-007-0716-7

Suboptimal behavior of Bayes and MDL in classification under misspecification

Open access
Published: 30 January 2007

Volume 66, pages 119–149, (2007)
Cite this article

Download PDF

You have full access to this open access article

Machine Learning Aims and scope Submit manuscript

Suboptimal behavior of Bayes and MDL in classification under misspecification

Download PDF

Peter Grünwald¹ &
John Langford²

1022 Accesses
29 Citations
1 Altmetric
Explore all metrics

Abstract

We show that forms of Bayesian and MDL inference that are often applied to classification problems can be inconsistent. This means that there exists a learning problem such that for all amounts of data the generalization errors of the MDL classifier and the Bayes classifier relative to the Bayesian posterior both remain bounded away from the smallest achievable generalization error. From a Bayesian point of view, the result can be reinterpreted as saying that Bayesian inference can be inconsistent under misspecification, even for countably infinite models. We extensively discuss the result from both a Bayesian and an MDL perspective.

Article PDF

Constrained Naïve Bayes with application to unbalanced data classification

Article Open access 20 October 2021

When is the Naive Bayes approximation not so naive?

Article 21 July 2017

Naïve Bayes

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

References

Barron, A. R. (1985). Logically smooth density estimation. PhD thesis, Department of EE, Stanford University, Stanford, Ca.
Barron, A. R. (1998). Information-theoretic characterization of Bayes performance and the choice of priors in parametric and nonparametric problems. In Bayesian statistics, vol. 6 (pp. 27–52). Oxford University Press.
Barron, A. R., Rissanen, J., & Yu, B. (1998). The MDL principle in coding and modeling. IEEE Trans. Inform. Theory, 44 (6), 2743–2760.
Article MATH MathSciNet Google Scholar
Barron, A. R. (1990). Complexity regularization with application to artificial neural networks. In Nonparametric functional estimation and related topics (pp. 561–576). Kluwer Academic Publishers.
Barron, A. R., & Cover, T. M. (1991). Minimum complexity density estimation. IEEE Trans. Inform. Theory, 37 (4), 1034–1054.
Article MathSciNet Google Scholar
Bernardo, J. M., & Smith, A. F. M. (1994). Bayesian theory. John Wiley.
Blackwell, D., & Dubins, L. (1962). Merging of opinions with increasing information. The Annals of Mathematical Statistics, 33, 882–886.
MathSciNet Google Scholar
Blumer, A., Ehrenfeucht, A., Haussler, D., & Warmuth, M. (1987). Occam’s razor. Information Processing Letters, 24, 377–380.
Article MATH MathSciNet Google Scholar
Bunke, O., & Milhaud, X. (1998). Asymptotic behaviour of Bayes estimates under possibly incorrect models. The Annals of Statistics, 26, 617–644.
Article MATH MathSciNet Google Scholar
Clarke, B. (2004). Comparing Bayes and non-Bayes model averaging when model approximation error cannot be ignored. Journal of Machine Learning Research, 4 (4), 683–712.
Article MATH MathSciNet Google Scholar
Comley, J. W., & Dowe, D. L. Minimum message length and generalised bayesian nets with asymmetric languages. In P. D. Grünwald, I. J. Myung, & M. A. Pitt (Eds.), Advances in minimum description length: theory and applications. MIT Press, 2005.
Cover, T. M., & Thomas, J. A. (1991). Elements of Information Theory. Wiley.
Diaconis, P., & Freedman, D. (1986). On the consistency of Bayes estimates. The Annals of Statistics, 14 (1), 1–26.
MATH MathSciNet Google Scholar
Doob, J. L. (1949). Application of the theory of martingales. In Le Calcul de Probabilités et ses Applications. Colloques Internationaux du Centre National de la Recherche Scientifique (pp. 23–27), Paris.
Grünwald, P. D. (2007). The minimum description length principle. Cambridge, MA: MIT Press.
Google Scholar
Grünwald, P. D. (1998). The minimum description length principle and reasoning under uncertainty. PhD thesis, University of Amsterdam, The Netherlands.
Grünwald, P. D. (2005). MDL tutorial. In P. D. Grünwald, I. J. Myung, & M. A. Pitt (Eds.), Advances in minimum description length: theory and applications. MIT Press
Heckerman, D., Chickering, D.M., Meek, C., Rounthwaite, R., & Kadie, C. (2000). Dependency networks for inference, collaborative filtering, and data visualization. Journal of Machine Learning Research, 1, 49–75.
Article Google Scholar
Hutter, M. (2005). Fast non-parametric Bayesian inference on infinite trees. In Proceedings of the 15th international workshop on artificial intelligence and statistics (AISTATS ’05).
Jordan, M. I. (1995). Why the logistic function? a tutorial discussion on probabilities and neural networks. Computational Cognitive Science Tech. Rep. 9503, MIT.
Kearns, M., Mansour, Y., Ng, A.Y., & Ron, D. (1997). An experimental and theoretical comparison of model selection methods. Machine Learning, 27, 7–50.
Article Google Scholar
Kleijn, B., & van der Vaart, A. (2004). Misspecification in infinite-dimensional Bayesian statistics. submitted.
Li, J. K. (1997). Estimation of mixture models. PhD thesis, Yale University, Department of Statistics.
McAllester, D. (1999). PAC-Bayesian model averaging. In Proceedings COLT ’99.
Meir, R., & Merhav, N. (1995). On the stochastic complexity of learning realizable and unrealizable rules. Machine Learning, 19, 241–261.
MATH Google Scholar
Platt, J. C. (1999). Probabilities for SV machines. In A. Smola, P. Bartlett, B. Schöelkopf, & D. Schuurmans (Eds.), Advances in large margin classifiers (pp. 61–74). MIT Press.
Quinlan, J., & Rivest, R. (1989). Inferring decision trees using the minimum description length principle. Information and Computation, 80, 227–248.
Article MATH MathSciNet Google Scholar
Rissanen, J. (1983). A universal prior for integers and estimation by minimum description length. The Annals of Statistics, 11, 416–431.
MATH MathSciNet Google Scholar
Rissanen, J. (1989). Stochastic complexity in statistical inquiry. World Scientific.
Tipping, M. E. (2001). Sparse Bayesian learning and the relevance vector machine. Journal of Machine Learning Research, 1, 211–244.
Article MATH MathSciNet Google Scholar
Viswanathan, M., Wallace, C. S., Dowe, D. L., & Korb, K. B. (1999). Finding cutpoints in noisy binary sequences - a revised empirical evaluation. In Proc. 12th Australian joint conf. on artif. intelligence, vol. 1747 of Lecture notes in artificial intelligence (LNAI) (pp. 405–416), Sidney, Australia.
Wallace, C. S. (2005). Statistical and Inductive Inference by Minimum Message Length. New York: Springer.
MATH Google Scholar
Wallace, C. S., & Boulton, D. M. (1968). An information measure for classification. Computing Journal, 11, 185–195.
MATH Google Scholar
Wallace, C. S., & Dowe, D. L. (1999a). Minimum message length and Kolmogorov complexity. Computer Journal, 42 (4), 270–283. Special issue on Kolmogorov complexity.
Google Scholar
Wallace, C. S., & Dowe, D. L. (1999b). Refinements of MDL and MML coding. Computer Journal, 42 (4), 330–337. Special issue on Kolmogorov complexity.
Google Scholar
Wallace, C. S., & Patrick, J. D. (1993). Coding decision trees. Machine Learning, 11, 7–22.
Article MATH Google Scholar
Yamanishi, K. (1998). A decision-theoretic extension of stochastic complexity and its applications to learning. IEEE Trans. Inform. Theory, 44(4), 1424–1439.
Article MATH MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

CWI, Amsterdam, The Netherlands
Peter Grünwald
Yahoo Research, New York
John Langford

Authors

Peter Grünwald
View author publications
You can also search for this author in PubMed Google Scholar
John Langford
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Peter Grünwald.

Additional information

Editors: Olivier Bousquet and Andre Elisseeff

Rights and permissions

Open Access This is an open access article distributed under the terms of the Creative Commons Attribution Noncommercial License ( https://creativecommons.org/licenses/by-nc/2.0 ), which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.

Reprints and permissions

About this article

Cite this article

Grünwald, P., Langford, J. Suboptimal behavior of Bayes and MDL in classification under misspecification. Mach Learn 66, 119–149 (2007). https://doi.org/10.1007/s10994-007-0716-7

Download citation

Received: 29 March 2005
Revised: 15 December 2006
Accepted: 20 December 2006
Published: 30 January 2007
Issue Date: March 2007
DOI: https://doi.org/10.1007/s10994-007-0716-7

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Suboptimal behavior of Bayes and MDL in classification under misspecification

Abstract

Article PDF

Similar content being viewed by others

Constrained Naïve Bayes with application to unbalanced data classification

When is the Naive Bayes approximation not so naive?

Naïve Bayes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Suboptimal behavior of Bayes and MDL in classification under misspecification

Abstract

Article PDF

Similar content being viewed by others

Constrained Naïve Bayes with application to unbalanced data classification

When is the Naive Bayes approximation not so naive?

Naïve Bayes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation