Statistics and Computing

, Volume 16, Issue 2, pp 161–175 | Cite as

MDL convergence speed for Bernoulli sequences

  • Jan Poland
  • Marcus Hutter


The Minimum Description Length principle for online sequence estimation/prediction in a proper learning setup is studied. If the underlying model class is discrete, then the total expected square loss is a particularly interesting performance measure: (a) this quantity is finitely bounded, implying convergence with probability one, and (b) it additionally specifies the convergence speed. For MDL, in general one can only have loss bounds which are finite but exponentially larger than those for Bayes mixtures. We show that this is even the case if the model class contains only Bernoulli distributions. We derive a new upper bound on the prediction error for countable Bernoulli classes. This implies a small bound (comparable to the one for Bayes mixtures) for certain important model classes. We discuss the application to Machine Learning tasks such as classification and hypothesis testing, and generalization to countable classes of i.i.d. models.


Artificial Intelligence Machine Learn Hypothesis Testing Prediction Error Model Class 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. Barron A. R. and Cover T. M. 1991. Minimum complexity density estimation. IEEE Trans. on Information Theory 37(4): 1034–1054.CrossRefMathSciNetGoogle Scholar
  2. Barron A. R., Rissanen J. J., and Yu B. 1998. The minimum description length principle in coding and modeling. IEEE Trans. on Information Theory 44(6): 2743–2760.CrossRefMathSciNetGoogle Scholar
  3. Clarke B. S. and Barron A.R. 1990. Information-theoretic asymptotics of Bayes methods. IEEE Trans. on Information Theory 36: 453–471.CrossRefMathSciNetGoogle Scholar
  4. Gács P. 1983. On the relation between descriptional complexity and algorithmic probability. Theoretical Computer Science 22: 71–93.CrossRefzbMATHMathSciNetGoogle Scholar
  5. Grünwald P. and Langford J. 2004. Suboptimal behaviour of Bayes and MDL in classification under misspecification. In 17th Annual Conference on Learning Theory (COLT, pp. 331–347.Google Scholar
  6. Hutter M. 2001. Convergence and error bounds for universal prediction of nonbinary sequences. Proc. 12th Eurpean Conference on Machine Learning (ECML-2001), pp. 239–250Google Scholar
  7. Hutter M. 2003a. Convergence and loss bounds for Bayesian sequence prediction. IEEE Trans. on Information Theory 49(8): 2061–2067.CrossRefMathSciNetGoogle Scholar
  8. Hutter M. 2003b. Optimality of universal Bayesian prediction for general loss and alphabet. Journal of Machine Learning Research 4: 971–1000.CrossRefMathSciNetGoogle Scholar
  9. Hutter. M. 2003c. Sequence prediction based on monotone complexity. In Proc. 16th Annual Conference on Learning Theory (COLT-2003), Lecture Notes in Artificial Intelligence, Berlin, Springer, pp. 506–521.Google Scholar
  10. Hutter M. 2005. Sequential predictions based on algorithmic complexity. Journal of Computer and System Sciences 72(1): 95–117.CrossRefMathSciNetGoogle Scholar
  11. Levin L. A. 1973. On the notion of a random sequence. Soviet Math. Dokl. 14(5): 1413–1416.zbMATHGoogle Scholar
  12. Li J. Q. 1999. Estimation of Mixture Models. PhD thesis, Dept. of Statistics. Yale University.Google Scholar
  13. Li M. and Vit’anyi P. M. B. 1997. An introduction to Kolmogorov complexity and its applications. Springer, 2nd edition.Google Scholar
  14. Poland J. and Hutter M. 2004a. Convergence of discrete MDL for sequential prediction. In 17th Annual Conference on Learning Theory (COLT), pp. 300–314.Google Scholar
  15. Poland J. and Hutter M. 2004b. On the convergence speed of MDL predictions for Bernoulli sequences. In International Conference on Algorithmic Learning Theory (ALT), pp. 294– 308.Google Scholar
  16. Poland J. and Hutter M. 2005. Strong asymptotic assertions for discrete MDL in regression and classification. In Benelearn 2005 (Ann. Machine Learning Conf. of Belgium and the Netherlands)Google Scholar
  17. Rissanen J. J. 1996. Fisher Information and Stochastic Complexity. IEEE Trans. on Information Theory 42(1): 40– 47.CrossRefzbMATHMathSciNetGoogle Scholar
  18. Rissanen J. J. 1999. Hypothesis selection and testing by the MDL principle. The Computer Journal 42(4): 260–269.CrossRefzbMATHMathSciNetGoogle Scholar
  19. Solomonoff R. J. 1978. Complexity-based induction systems: comparisons and convergence theorems. IEEE Trans. Information Theory IT-24: 422–432.CrossRefMathSciNetGoogle Scholar
  20. Vitányi P. M. and Li M. 2000. Minimum description length induction, Bayesianism, and Kolmogorov complexity. IEEE Trans. on Information Theory 46(2): 446–464.CrossRefGoogle Scholar
  21. Vovk V. G. 1997. Learning about the parameter of the Bernoulli model. Journal of Computer and System Sciences 55: 96–104.CrossRefzbMATHMathSciNetGoogle Scholar
  22. Zhang T. 2004. On the convergence of MDL density estimation. In Proc. 17th Annual Conference on Learning Theory (COLT), pp. 315–330,Google Scholar
  23. Zvonkin A. K. and Levin L. A. 1970. The complexity of finite objects and the development of the concepts of information and randomness by means of the theory of algorithms. Russian Mathematical Surveys 25(6): 83–124.CrossRefMathSciNetGoogle Scholar

Copyright information

© Springer Science + Business Media, LLC 2006

Authors and Affiliations

  1. 1.Graduate School of Information Science and TechnologyHokkaido UniversityJapan
  2. 2.IDSIAManno (Lugano)Switzerland

Personalised recommendations