The Missing Consistency Theorem for Bayesian Learning: Stochastic Model Selection

  • Jan Poland
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4264)


Bayes’ rule specifies how to obtain a posterior from a class of hypotheses endowed with a prior and the observed data. There are three principle ways to use this posterior for predicting the future: marginalization (integration over the hypotheses w.r.t. the posterior), MAP (taking the a posteriori most probable hypothesis), and stochastic model selection (selecting a hypothesis at random according to the posterior distribution). If the hypothesis class is countable and contains the data generating distribution, strong consistency theorems are known for the former two methods, asserting almost sure convergence of the predictions to the truth as well as loss bounds. We prove the first corresponding results for stochastic model selection. As a main technical tool, we will use the concept of a potential: this quantity, which is always positive, measures the total possible amount of future prediction errors. Precisely, in each time step, the expected potential decrease upper bounds the expected error. We introduce the entropy potential of a hypothesis class as its worst-case entropy with regard to the true distribution. We formulate our results in the online classification framework, but they are equally applicable to the prediction of non-i.i.d. sequences.


Model Class Current Input Minimum Description Length True Distribution Kolmogorov Complexity 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Freund, Y., Seung, H.S., Shamir, E., Tishby, N.: Selective sampling using the query by committee algorithm. Machine Learning 28, 133 (1997)zbMATHCrossRefGoogle Scholar
  2. 2.
    McAllester, D.: PAC-bayesian stochastic model selection. Machine Learning 51, 5–21 (2003)zbMATHCrossRefGoogle Scholar
  3. 3.
    Blackwell, D., Dubins, L.: Merging of opinions with increasing information. Annals of Mathematical Statistics 33, 882–887 (1962)zbMATHCrossRefMathSciNetGoogle Scholar
  4. 4.
    Clarke, B.S., Barron, A.R.: Information-theoretic asymptotics of Bayes methods. IEEE Trans. Inform. Theory 36, 453–471 (1990)zbMATHCrossRefMathSciNetGoogle Scholar
  5. 5.
    Rissanen, J.J.: Fisher Information and Stochastic Complexity. IEEE Trans. Inform. Theory 42, 40–47 (1996)zbMATHCrossRefMathSciNetGoogle Scholar
  6. 6.
    Solomonoff, R.J.: Complexity-based induction systems: comparisons and convergence theorems. IEEE Trans. Inform. Theory 24, 422–432 (1978)zbMATHCrossRefMathSciNetGoogle Scholar
  7. 7.
    Poland, J., Hutter, M.: Convergence of discrete MDL for sequential prediction. In: Shawe-Taylor, J., Singer, Y. (eds.) COLT 2004. LNCS (LNAI), vol. 3120, pp. 300–314. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  8. 8.
    Poland, J., Hutter, M.: Asymptotics of discrete MDL for online prediction. IEEE Transactions on Information Theory 51, 3780–3795 (2005)CrossRefMathSciNetGoogle Scholar
  9. 9.
    Hutter, M.: Universal Artificial Intelligence: Sequential Decisions based on Algorithmic Probability. Springer, Berlin (2004)Google Scholar
  10. 10.
    Poland, J., Hutter, M.: On the convergence speed of MDL predictions for bernoulli sequences. In: Ben-David, S., Case, J., Maruoka, A. (eds.) ALT 2004. LNCS (LNAI), vol. 3244, pp. 294–308. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  11. 11.
    Hutter, M.: Convergence and loss bounds for Bayesian sequence prediction. IEEE Trans. Inform. Theory 49, 2061–2067 (2003)CrossRefMathSciNetGoogle Scholar
  12. 12.
    Littlestone, N., Warmuth, M.K.: The weighted majority algorithm. In: 30th Annual Symposium on Foundations of Computer Science, Research Triangle Park, North Carolina, pp. 256–261. IEEE, Los Alamitos (1989)CrossRefGoogle Scholar
  13. 13.
    Vovk, V.G.: Aggregating strategies. In: Proc. Third Annual Workshop on Computational Learning Theory, Rochester, pp. 371–383. ACM Press, New York (1990)Google Scholar
  14. 14.
    Hutter, M., Poland, J.: Adaptive online prediction by following the perturbed leader. Journal of Machine Learning Research 6, 639–660 (2005)MathSciNetGoogle Scholar
  15. 15.
    Cesa-Bianchi, N., Lugosi, G.: Potential-based algorithms in on-line prediction and game theory. Machine Learning 51, 239–261 (2003)zbMATHCrossRefGoogle Scholar
  16. 16.
    Li, M., Vitányi, P.M.B.: An introduction to Kolmogorov complexity and its applications, 2nd edn. Springer, Heidelberg (1997)zbMATHGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Jan Poland
    • 1
  1. 1.Graduate School of Information Science and TechnologyHokkaido UniversityJapan

Personalised recommendations