The Safe Bayesian

Learning the Learning Rate via the Mixability Gap
  • Peter Grünwald
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7568)


Standard Bayesian inference can behave suboptimally if the model is wrong. We present a modification of Bayesian inference which continues to achieve good rates with wrong models. Our method adapts the Bayesian learning rate to the data, picking the rate minimizing the cumulative loss of sequential prediction by posterior randomization. Our results can also be used to adapt the learning rate in a PAC-Bayesian context. The results are based on an extension of an inequality due to T. Zhang and others to dependent random variables.


Learning Rate Predictive Distribution Statistical Learning Theory Dependent Random Variable Longe Version 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. Audibert, J.Y.: PAC-Bayesian statistical learning theory. PhD thesis, Université Paris VI (2004)Google Scholar
  2. Barron, A.R., Cover, T.M.: Minimum complexity density estimation. IEEE Transactions on Information Theory 37(4), 1034–1054 (1991)MathSciNetzbMATHCrossRefGoogle Scholar
  3. Catoni, O.: PAC-Bayesian Supervised Classification. Lecture Notes IMS (2007)Google Scholar
  4. Chaudhuri, K., Freund, Y., Hsu, D.: A parameter-free hedging algorithm. In: NIPS 2009, pp. 297–305 (2009)Google Scholar
  5. Dawid, A.P.: Present position and potential developments: Some personal views, statistical theory, the prequential approach. J. R. Stat. Soc. Ser. A-G 147(2), 278–292 (1984)MathSciNetzbMATHCrossRefGoogle Scholar
  6. Doob, J.L.: Application of the theory of martingales. In: Le Calcul de Probabilités et ses Applications. Colloques Internationaux du CNRS, pp. 23–27 (1949)Google Scholar
  7. Freund, Y., Schapire, R.E.: A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. 55, 119–139 (1997)MathSciNetzbMATHCrossRefGoogle Scholar
  8. Grünwald, P.: The MDL Principle. MIT Press, Cambridge (2007)Google Scholar
  9. Grünwald, P.: Safe learning: bridging the gap between Bayes, MDL and statistical learning theory via empirical convexity. In: Proc. COLT 2011, pp. 551–573 (2011)Google Scholar
  10. Grünwald, P., Langford, J.: Suboptimal behavior of Bayes and MDL in classification under misspecification. Machine Learning 66(2-3), 119–149 (2007)CrossRefGoogle Scholar
  11. Kleijn, B., van der Vaart, A.: Misspecification in infinite-dimensional Bayesian statistics. Ann. Stat. 34(2) (2006)Google Scholar
  12. Kullback, S., Leibler, R.A.: On information and sufficiency. Ann. Math. Stat. 22, 76–86 (1951)MathSciNetCrossRefGoogle Scholar
  13. Li, J.Q.: Estimation of Mixture Models. PhD thesis, Yale, New Haven, CT (1999)Google Scholar
  14. McAllester, D.: PAC-Bayesian stochastic model selection. Mach. Learn. 51(1), 5–21 (2003)zbMATHCrossRefGoogle Scholar
  15. Jordan, M.I., Bartlett, P.L., McAuliffe, J.D.: Convexity, classification and risk bounds. J. Am. Stat. Assoc. 101(473), 138–156 (2006)MathSciNetzbMATHCrossRefGoogle Scholar
  16. Seeger, M.: PAC-Bayesian generalization error bounds for Gaussian process classification. J. Mach. Learn. Res. 3, 233–269 (2002)MathSciNetCrossRefGoogle Scholar
  17. Shalizi, C.: Dynamics of Bayesian updating with dependent data and misspecified models. Electronic Journal of Statistics 3, 1039–1074 (2009)MathSciNetCrossRefGoogle Scholar
  18. Takeuchi, J., Barron, A.R.: Robustly minimax codes for universal data compression. In: Proc. ISITA 1998, Japan (1998)Google Scholar
  19. van der Vaart, A.: Asymptotic Statistics. Cambridge University Press (1998)Google Scholar
  20. Vovk, V.: Competitive on-line statistics. Intern. Stat. Rev. 69, 213–248 (2001)zbMATHGoogle Scholar
  21. Vovk, V.: Aggregating strategies. In: Proc. COLT 1990, pp. 371–383 (1990)Google Scholar
  22. Zhang, T.: From ε-entropy to KL entropy: analysis of minimum information complexity density estimation. Ann. Stat. 34(5), 2180–2210 (2006a)zbMATHCrossRefGoogle Scholar
  23. Zhang, T.: Information theoretical upper and lower bounds for statistical estimation. IEEE T. Inform. Theory 52(4), 1307–1321 (2006b)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Peter Grünwald
    • 1
  1. 1.CWI, Amsterdam and Leiden UniversityThe Netherlands

Personalised recommendations