Machine Learning

, Volume 51, Issue 1, pp 5–21 | Cite as

PAC-Bayesian Stochastic Model Selection

  • David A. McAllester


PAC-Bayesian learning methods combine the informative priors of Bayesian methods with distribution-free PAC guarantees. Stochastic model selection predicts a class label by stochastically sampling a classifier according to a “posterior distribution” on classifiers. This paper gives a PAC-Bayesian performance guarantee for stochastic model selection that is superior to analogous guarantees for deterministic model selection. The guarantee is stated in terms of the training error of the stochastic classifier and the KL-divergence of the posterior from the prior. It is shown that the posterior optimizing the performance guarantee is a Gibbs distribution. Simpler posterior distributions are also derived that have nearly optimal performance guarantees.

PAC learning model averaging posterior distribution Gibbs distribution PAC-Baysian learning 


  1. Barron, A. R. (1991). Complexity regularization with application to artificial neural networks. In G. Roussas (Ed.), Nonparametric functional estimation and related topics (pp. 561–576). Dordrecht: Kluwer Academic Publishers.Google Scholar
  2. Barron, A. R.,& Cover, T. M. (1991). Minimum complexity density estimation. IEEE Transactions on Information Theory, 37, 1034–1054.Google Scholar
  3. Buntine, W. (1992). Learning classification trees. Statistics and Computing, 2, 63–73.Google Scholar
  4. Catoni, O. (To appear a) Gibbs estimators. Probability Theory and Related Fields.Google Scholar
  5. Catoni, O. (To appear b) Universal aggregation rules with sharp oracle inequalities. Annals of Statistics.Google Scholar
  6. Cesa-Bianchi, N., Freund, Y., Helmbold, D. P., Haussler, D., Schapire, R. E., & Warmuth, M. K. (1997). How to use expert advice. JACM, 44:3, 427–485.Google Scholar
  7. Freund, Y., & Schapire, R. E. (1999). Adaptive game playing using multiplicative weights. Games and Economic Behavior, 79–103.Google Scholar
  8. Freund, Y., Schapire, R. E., Singer, Y., & Warmuth, M. K. (1997). Using and combining predictors that specialize. In COLT-97.Google Scholar
  9. Helmbold, D. P.,& Schapire, R. E. (1997). Predicting nearly as well as the best pruning of a decision tree. Machine Learning, 27:1, 51–68.Google Scholar
  10. Kearns, M., & Mansour, Y. (1998). A fast, bottom-up decision tree pruning algorithm with near-optimal generalization. In Proceedings of the 15th International Conference on Machine Learning. San Mateo, CA: Morgan Kaufmann.Google Scholar
  11. Kearns, M., Mansour, Y., Ng, A., & Ron, D. (1995). An experimental and theoretical comparison of model selection methods. In Proceedings of the Eighth ACMConference on Computational Learning Theory (pp. 21–30). New York: ACM Press.Google Scholar
  12. Linial, N., Mansour, Y., & Rivest, R. (1991). Results on learnability and the Vapnik-Chervonenkis dimension. Information and Computation, 90, 33–49.Google Scholar
  13. Littlestone, N., & Warmuth, M. (1994). The weighted majority algorithm. Information and Computation, 108:2, 212–261. An extended abstract appeared in COLT-89.Google Scholar
  14. Lugosi, G., & Zeger, K. (1996). Concept learning using complexity regularization. IEEE Transactions on Information Theory, 42, 48–54.Google Scholar
  15. McAllester, D. (1998). Some pac-bayesian theorems. In Proceedings of the Eleventh Annual Conference on Computational Learning Theory (pp. 230–234).Google Scholar
  16. McAllester, D. (1999). Pac-bayesian model averaging. In COLT-99.Google Scholar
  17. Oliver, J. J.,& Hand, D. (1995).Onpruning and averaging decision trees. In Proceedings of the Twelfth International Conference on Machine Learning.Google Scholar
  18. Pereira, F. C., & Singer, Y. (1997). An efficient extension to mixture techniques for prediction and decision trees. In Proceedings of the Tenth Annual Conference on Computational Learning Theory (pp. 114–121). Long version to appear in Machine Learning.Google Scholar
  19. Quinlan, J. R. (1993). C4.5: Programs for Machine Learning. San Mateo, CA: Morgan Kaufmann.Google Scholar
  20. Shawe-Taylor, J., & Williamson, R. (1997). A pac analysis of a bayesian estimator. In Proceedings of the Tenth Annual Conference on Computational Learning Theory. New York: ACM Press.Google Scholar
  21. Yamanishi, K. (1992). Learning non-parametric densities in tyerms of finite-dimensional parametric hypotheses. IEICE Trans. Inf. and Syst., E75-D(4).Google Scholar
  22. Yang, Y. (2000a). Adaptive estimation in pattern recognition by combining different procedures. Statistica Sinica, 10, 1069–1089.Google Scholar
  23. Yang, Y. (2000b). Mixing strategies for density estimation. Annals of Statistics, 28, 75–87.Google Scholar

Copyright information

© Kluwer Academic Publishers 2003

Authors and Affiliations

  • David A. McAllester
    • 1
  1. 1.AT&T Shannon LabsFlorham ParkUSA

Personalised recommendations