Machine Learning

, Volume 14, Issue 1, pp 83–113 | Cite as

Bounds on the sample complexity of Bayesian learning using information theory and the VC dimension

  • David Haussler
  • Michael Kearns
  • Robert E. Schapire


In this paper we study a Bayesian or average-case model of concept learning with a twofold goal: to provide more precise characterizations of learning curve (sample complexity) behavior that depend on properties of both the prior distribution over concepts and the sequence of instances seen by the learner, and to smoothly unite in a common framework the popular statistical physics and VC dimension theories of learning curves. To achieve this, we undertake a systematic investigation and comparison of two fundamental quantities in learning and information theory: the probability of an incorrect prediction for an optimal learning algorithm, and the Shannon information gain. This study leads to a new understanding of the sample complexity of learning in several existing models.


learning curves VC dimension Bayesian learning information theory average-case learning statistical physics 


  1. Assouad, P. (1983). Densité et dimension.Annales de l'Institut Fourier, 33(3):233–282.Google Scholar
  2. Barzdin, J. M. and Freivald, R. V. (1972). On the prediction of general recursive functions.Soviet Mathematics-Doklady, 13:1224–1228.Google Scholar
  3. Baum, E. and Haussler, D. (1989). What size net gives valid generalization?Neural Computation, 1(1):151–160.Google Scholar
  4. Blumer, A., Ehrenfeucht, A., Haussler, D., and Warmuth, M. K. (1989). Learnability and the Vapnik-Chervonenkis dimension.Journal of the Association for Computing Machinery, 36(4):929–965.Google Scholar
  5. Buntine, W. (1990).A Theory of Learning Classification Rules. PhD thesis, University of Technology, Sydney.Google Scholar
  6. Buntine, W. and Weigend, A. (1991). Bayesian back propagation. Unpublished manuscript.Google Scholar
  7. Clarke, B. and Barron, A. (1990). Information-theoretic asymptotics of Bayes methods.IEEE Transactions on Information Theory, 36(3):453–471.Google Scholar
  8. Clarke, B. and Barron, A. (1991). Entropy, risk and the Bayesian central limit theorem. manuscript.Google Scholar
  9. Cover, T. and Thomas, J. (1991).Elements of Information Theory. Wiley.Google Scholar
  10. Denker, J., Schwartz, D., Wittner, B., Solla, S., Howard, R., Jackel, L., and Hopfield, J. (1987). Automatic learning, rule extraction and generalization.Complex Systems, 1:877–922.Google Scholar
  11. DeSantis, A., Markowski, G., and Wegman, M. N. (1988). Learning probabilistic prediction functions. InProceedings of the 1988 Workshop on Computational Learning Theory, pages 312–328. Morgan Kaufmann.Google Scholar
  12. Duda, R. O. and Hart, P. E. (1973).Pattern Classification and Scene Analysis. Wiley.Google Scholar
  13. Dudley, R. M. (1984). A course on empirical processes.Lecture Notes in Mathematics, 1097:2–142.Google Scholar
  14. Fano, R. (1952). Class notes for course 6.574. Technical report, Massachusetts Institute of Technology.Google Scholar
  15. Gyorgyi, G. and Tishby, N. (1990). In Thuemann, K. and Koeberle, R., editors,Neural Networks and Spin Glasses. World Scientific.Google Scholar
  16. Haussler, D. (1991). Sphere packing numbers for subsets of the Booleann-cube with bounded Vapnik-Chervonenkis dimension. Technical Report UCSC-CRL-91-41, University of Calif. Computer Research Laboratory, Santa Cruz, CA.Google Scholar
  17. Haussler, D., Littlestone, N., and Warmuth, M. (1990). Predicting {0, 1}-functions on randomly drawn points. Technical Report UCSC-CRL-90-54, University of California Santa Cruz, Computer Research Laboratory. To appear in Information and Computation.Google Scholar
  18. Littlestone, N. (1989).Mistake Bounds and Logarithmic Linear-threshold Learning Algorithms. PhD thesis, University of California Santa Cruz.Google Scholar
  19. Littlestone, N., Long, P. M., and Warmuth, M. K. (1991). On-line learning of linear functions. InProceedings of the Twenty Third Annual ACM Symposium on Theory of Computing, pages 465–475.Google Scholar
  20. Littlestone, N. and Warmuth, M. (1989). The weighted majority algorithm. Technical Report UCSC-CRL-89-16, Computer Research Laboratory, University of Santa Cruz.Google Scholar
  21. MacKay, D. (1992).Bayesian Methods for Adaptive Models. PhD thesis, California Institute of Technology.Google Scholar
  22. Massart, P. (1986). Rates of convergence in the central limit theorem for empirical processes.Annales de l'Institut Henri Poincaré Probabilites et Statistiques, 22:381–423.Google Scholar
  23. Natarajan, B. K. (1992). Probably approximate learning over classes of distributions.SIAM Journal on Computing, 21(3):438–449.Google Scholar
  24. Opper, M. and Haussler, D. (1991). Calculation of the learning curve of Bayes optimal classification algorithm for learning a perceptron with noise. InProceedings of the Fourth Annual Workshop on Computational Learning Theory, pages 75–87. Morgan Kaufmann.Google Scholar
  25. Pazzani, M. J. and Sarrett, W. (1992). A framework for average case analysis of conjunctive learning algorithms.Machine Learning, 9(4):349–372.Google Scholar
  26. Renyi, A. (1970).Probability Theory. North Holland, Amsterdam.Google Scholar
  27. Sauer, N. (1972). On the density of families of sets.Journal of Combinatorial Theory (Series A), 13:145–147.Google Scholar
  28. Shawe-Taylor, J., Anthony, M., and Biggs, N. (1989). Bounding sample size with the Vapnik-Chervonenkis dimension. Technical Report CSD-TR-618, University of London, Surrey, England.Google Scholar
  29. Sompolinsky, H., Tishby, N., and Seung, H. (1990). Learning from examples in large neural networks.Physical Review Letters, 65:1683–1686.Google Scholar
  30. Talagrand, M. (1988). Donsker classes of sets.Probability Theory and Related Fields, 78:169–191.Google Scholar
  31. Tishby, N., Levin, E., and Solla, S. (1989). Consistent inference of probabilities in layered networks: predictions and generalizations. InIJCNN International Joint Conference on Neural Networks, volume II, pages 403–409. IEEE.Google Scholar
  32. Valiant, L. G. (1984). A theory of the learnable.Communications of the ACM, 27(11):1134–42.Google Scholar
  33. Vapnik, V. N. (1979).Theorie der Zeichenerkennung. Akademie-Verlag.Google Scholar
  34. Vapnik, V. N. (1982).Estimation of Dependences Based on Empirical Data. Springer-Verlag, New York.Google Scholar
  35. Vapnik, V. N. and Chervonenkis, A. Y. (1971). On the uniform convergence of relative frequencies of events to their probabilities.Theory of Probability and its Applications, 16(2):264–80.Google Scholar
  36. Vovk, V. (1990). Aggregating strategies. InProceedings of the Third Annual Workshop on Computational Learning Theory, pages 371–383. Morgan Kaufmann.Google Scholar

Copyright information

© Kluwer Academic Publishers 1994

Authors and Affiliations

  • David Haussler
    • 1
  • Michael Kearns
    • 2
  • Robert E. Schapire
    • 2
  1. 1.Computer and Information SciencesUniversity of CaliforniaSanta Cruz
  2. 2.AT&T Bell LaboratoriesMurray Hill

Personalised recommendations