Introduction to Statistical Learning Theory

  • Olivier Bousquet
  • Stéphane Boucheron
  • Gábor Lugosi
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3176)

Abstract

The goal of statistical learning theory is to study, in a statistical framework, the properties of learning algorithms. In particular, most results take the form of so-called error bounds. This tutorial introduces the techniques that are used to obtain such results.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Vapnik, V.: Statistical Learning Theory. John Wiley, New York (1998)MATHGoogle Scholar
  2. 2.
    Anthony, M., Bartlett, P.L.: Neural Network Learning: Theoretical Foundations. Cambridge University Press, Cambridge (1999)CrossRefMATHGoogle Scholar
  3. 3.
    Breiman, L., Friedman, J., Olshen, R., Stone, C.: Classification and Regression Trees. Wadsworth International, Belmont (1984)MATHGoogle Scholar
  4. 4.
    Devroye, L., Györfi, L., Lugosi, G.: A Probabilistic Theory of Pattern Recognition. Springer, New York (1996)CrossRefMATHGoogle Scholar
  5. 5.
    Duda, R., Hart, P.: Pattern Classification and Scene Analysis. John Wiley, New York (1973)MATHGoogle Scholar
  6. 6.
    Fukunaga, K.: Introduction to Statistical Pattern Recognition. Academic Press, New York (1972)MATHGoogle Scholar
  7. 7.
    Kearns, M., Vazirani, U.: An Introduction to Computational Learning Theory. MIT Press, Cambridge (1994)Google Scholar
  8. 8.
    Kulkarni, S., Lugosi, G., Venkatesh, S.: Learning pattern classification—a survey. IEEE Transactions on Information Theory 44, 2178–2206 (1998) (Information Theory: 1948–1998. Commemorative special issue)Google Scholar
  9. 9.
    Lugosi, G.: Pattern classification and learning theory. In: Györfi, L. (ed.) Principles of Nonparametric Learning, pp. 5–62. Springer, Viena (2002)Google Scholar
  10. 10.
    McLachlan, G.: Discriminant Analysis and Statistical Pattern Recognition. John Wiley, New York (1992)CrossRefMATHGoogle Scholar
  11. 11.
    Mendelson, S.: A few notes on statistical learning theory. In: Mendelson, S., Smola, A.J. (eds.) Advanced Lectures on Machine Learning. LNCS, vol. 2600, pp. 1–40. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  12. 12.
    Natarajan, B.: Machine Learning: A Theoretical Approach. Morgan Kaufmann, San Mateo (1991)Google Scholar
  13. 13.
    Vapnik, V.: Estimation of Dependencies Based on Empirical Data. Springer, New York (1982)MATHGoogle Scholar
  14. 14.
    Vapnik, V.: The Nature of Statistical Learning Theory. Springer, New York (1995)CrossRefMATHGoogle Scholar
  15. 15.
    Vapnik, V., Chervonenkis, A.: Theory of Pattern Recognition, Nauka, Moscow (1974) (in Russian); German translation: Theorie der Zeichenerkennung. Akademie Verlag, Berlin (1979)Google Scholar
  16. 16.
    von Luxburg, U., Bousquet, O., Schölkopf, B.: A compression approach to support vector model selection. The Journal of Machine Learning Research 5, 293–323 (2004)MathSciNetMATHGoogle Scholar
  17. 17.
    McDiarmid, C.: On the method of bounded differences. In: Surveys in Combinatorics 1989, pp. 148–188. Cambridge University Press, Cambridge (1989)CrossRefGoogle Scholar
  18. 18.
    Vapnik, V., Chervonenkis, A.: On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications 16, 264–280 (1971)CrossRefMATHGoogle Scholar
  19. 19.
    Hoeffding, W.: Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association 58, 13–30 (1963)MathSciNetCrossRefMATHGoogle Scholar
  20. 20.
    Ledoux, M., Talagrand, M.: Probability in Banach Space. Springer, New York (1991)CrossRefMATHGoogle Scholar
  21. 21.
    Sauer, N.: On the density of families of sets. Journal of Combinatorial Theory Series A 13, 145–147 (1972)MathSciNetCrossRefMATHGoogle Scholar
  22. 22.
    Shelah, S.: A combinatorial problem: Stability and order for models and theories in infinity languages. Pacific Journal of Mathematics 41, 247–261 (1972)MathSciNetCrossRefMATHGoogle Scholar
  23. 23.
    Alesker, S.: A remark on the Szarek-Talagrand theorem. Combinatorics, Probability, and Computing 6, 139–144 (1997)MathSciNetCrossRefMATHGoogle Scholar
  24. 24.
    Alon, N., Ben-David, S., Cesa-Bianchi, N., Haussler, D.: Scale-sensitive dimensions, uniform convergence, and learnability. Journal of the ACM 44, 615–631 (1997)MathSciNetCrossRefMATHGoogle Scholar
  25. 25.
    Cesa-Bianchi, N., Haussler, D.: A graph-theoretic generalization of the Sauer-Shelah lemma. Discrete Applied Mathematics 86, 27–35 (1998)MathSciNetCrossRefMATHGoogle Scholar
  26. 26.
    Frankl, P.: On the trace of finite sets. Journal of Combinatorial Theory, Series A 34, 41–45 (1983)MathSciNetCrossRefMATHGoogle Scholar
  27. 27.
    Haussler, D.: Sphere packing numbers for subsets of the boolean n-cube with bounded Vapnik-Chervonenkis dimension. Journal of Combinatorial Theory, Series A 69, 217–232 (1995)MathSciNetCrossRefMATHGoogle Scholar
  28. 28.
    Szarek, S., Talagrand, M.: On the convexified Sauer-Shelah theorem. Journal of Combinatorial Theory, Series B 69, 183–192 (1997)MathSciNetCrossRefMATHGoogle Scholar
  29. 29.
    Dudley, R.: Uniform Central Limit Theorems. Cambridge University Press, Cambridge (1999)CrossRefMATHGoogle Scholar
  30. 30.
    Giné, E.: Empirical processes and applications: an overview. Bernoulli 2, 1–28 (1996)MathSciNetCrossRefMATHGoogle Scholar
  31. 31.
    van der Waart, A., Wellner, J.: Weak convergence and empirical processes. Springer, New York (1996)Google Scholar
  32. 32.
    Blumer, A., Ehrenfeucht, A., Haussler, D., Warmuth, M.: Learnability and the Vapnik-Chervonenkis dimension. Journal of the ACM 36, 929–965 (1989)MathSciNetCrossRefMATHGoogle Scholar
  33. 33.
    Ehrenfeucht, A., Haussler, D., Kearns, M., Valiant, L.: A general lower bound on the number of examples needed for learning. Information and Computation 82, 247–261 (1989)MathSciNetCrossRefMATHGoogle Scholar
  34. 34.
    Dudley, R.: Central limit theorems for empirical measures. Annals of Probability 6, 899–929 (1978)MathSciNetCrossRefMATHGoogle Scholar
  35. 35.
    Dudley, R.: Empirical processes. In: Ecole de Probabilité de St. Flour 1982. Lecture Notes in Mathematics, vol. 1097, Springer, New York (1984)CrossRefGoogle Scholar
  36. 36.
    Dudley, R.: Universal Donsker classes and metric entropy. Annals of Probability 15, 1306–1326 (1987)MathSciNetCrossRefMATHGoogle Scholar
  37. 37.
    Talagrand, M.: The Glivenko-Cantelli problem. Annals of Probability 15, 837–870 (1987)MathSciNetCrossRefMATHGoogle Scholar
  38. 38.
    Talagrand, M.: Sharper bounds for Gaussian and empirical processes. Annals of Probability 22, 28–76 (1994)MathSciNetCrossRefMATHGoogle Scholar
  39. 39.
    Vapnik, V., Chervonenkis, A.: Necessary and sufficient conditions for the uniform convergence of means to their expectations. Theory of Probability and its Applications 26, 821–832 (1981)MATHGoogle Scholar
  40. 40.
    Assouad, P.: Densité et dimension. Annales de l’Institut Fourier 33, 233–282 (1983)MathSciNetCrossRefMATHGoogle Scholar
  41. 41.
    Cover, T.: Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition. IEEE Transactions on Electronic Computers 14, 326–334 (1965)CrossRefMATHGoogle Scholar
  42. 42.
    Dudley, R.: Balls in R k do not cut all subsets of k + 2 points. Advances in Mathematics 31(3), 306–308 (1979)MathSciNetCrossRefMATHGoogle Scholar
  43. 43.
    Goldberg, P., Jerrum, M.: Bounding the Vapnik-Chervonenkis dimension of concept classes parametrized by real numbers. Machine Learning 18, 131–148 (1995)MATHGoogle Scholar
  44. 44.
    Karpinski, M., Macintyre, A.: Polynomial bounds for vc dimension of sigmoidal and general Pfaffian neural networks. Journal of Computer and System Science 54 (1997)Google Scholar
  45. 45.
    Khovanskii, A.G.: Fewnomials. Translations of Mathematical Monographs, vol. 88. American Mathematical Society, Providence (1991)MATHGoogle Scholar
  46. 46.
    Koiran, P., Sontag, E.: Neural networks with quadratic vc dimension. Journal of Computer and System Science 54 (1997)Google Scholar
  47. 47.
    Macintyre, A., Sontag, E.: Finiteness results for sigmoidal neural networks. In: Proceedings of the 25th Annual ACM Symposium on the Theory of Computing, pp. 325–334. Association of Computing Machinery, New York (1993)Google Scholar
  48. 48.
    Steele, J.: Existence of submatrices with all possible columns. Journal of Combinatorial Theory, Series A 28, 84–88 (1978)MathSciNetCrossRefMATHGoogle Scholar
  49. 49.
    Wenocur, R., Dudley, R.: Some special Vapnik-Chervonenkis classes. Discrete Mathematics 33, 313–318 (1981)MathSciNetCrossRefMATHGoogle Scholar
  50. 50.
    McDiarmid, C.: Concentration. In: Habib, M., McDiarmid, C., Ramirez-Alfonsin, J., Reed, B. (eds.) Probabilistic Methods for Algorithmic Discrete Mathematics, pp. 195–248. Springer, New York (1998)CrossRefGoogle Scholar
  51. 51.
    Ahlswede, R., Gács, P., Körner, J.: Bounds on conditional probabilities with applications in multi-user communication. Zeitschrift für Wahrscheinlichkeitstheorie und verwandte Gebiete 34, 157–177 (1976) (correction in 39:353–354,1977)MathSciNetCrossRefMATHGoogle Scholar
  52. 52.
    Marton, K.: A simple proof of the blowing-up lemma. IEEE Transactions on Information Theory 32, 445–446 (1986)MathSciNetCrossRefMATHGoogle Scholar
  53. 53.
    Marton, K.: Bounding \(\bar{d}\)-distance by informational divergence: a way to prove measure concentration. Annals of Probability 24, 857–866 (1996)MathSciNetCrossRefMATHGoogle Scholar
  54. 54.
    Marton, K.: A measure concentration inequality for contracting Markov chains. Geometric and Functional Analysis 6, 556–571 (1996); Erratum 7, 609–613 (1997)MathSciNetCrossRefMATHGoogle Scholar
  55. 55.
    Dembo, A.: Information inequalities and concentration of measure. Annals of Probability 25, 927–939 (1997)MathSciNetCrossRefMATHGoogle Scholar
  56. 56.
    Massart, P.: Optimal constants for Hoeffding type inequalities. Technical report, Mathematiques, Université de Paris-Sud, Report 98.86 (1998)Google Scholar
  57. 57.
    Rio, E.: Inégalités de concentration pour les processus empiriques de classes de parties. Probability Theory and Related Fields 119, 163–175 (2001)MathSciNetCrossRefGoogle Scholar
  58. 58.
    Talagrand, M.: A new look at independence. Annals of Probability 24, 1–34 (1996) (Special Invited Paper)MathSciNetCrossRefMATHGoogle Scholar
  59. 59.
    Talagrand, M.: Concentration of measure and isoperimetric inequalities in product spaces. Publications Mathématiques de l’I.H.E.S. 81, 73–205 (1995)MathSciNetCrossRefMATHGoogle Scholar
  60. 60.
    Talagrand, M.: New concentration inequalities in product spaces. Inventiones Mathematicae 126, 505–563 (1996)MathSciNetCrossRefMATHGoogle Scholar
  61. 61.
    Luczak, M.J., McDiarmid, C.: Concentration for locally acting permutations. Discrete Mathematics (to appear, 2003)Google Scholar
  62. 62.
    McDiarmid, C.: Concentration for independent permutations. Combinatorics, Probability, and Computing 2, 163–178 (2002)MathSciNetMATHGoogle Scholar
  63. 63.
    Panchenko, D.: A note on Talagrand’s concentration inequality. Electronic Communications in Probability 6 (2001)Google Scholar
  64. 64.
    Panchenko, D.: Some extensions of an inequality of Vapnik and Chervonenkis. Electronic Communications in Probability 7 (2002)Google Scholar
  65. 65.
    Panchenko, D.: Symmetrization approach to concentration inequalities for empirical processes. Annals of Probability (to appear, 2003)Google Scholar
  66. 66.
    Ledoux, M.: On Talagrand’s deviation inequalities for product measures. ESAIM: Probability and Statistics 1, 63–87 (1997), http://www.emath.fr/ps/ MathSciNetCrossRefMATHGoogle Scholar
  67. 67.
    Ledoux, M.: Isoperimetry and Gaussian analysis. In: Bernard, P. (ed.) Lectures on Probability Theory and Statistics, Ecole d’Eté de Probabilités de St-Flour XXIV-1994, pp. 165–294 (1996)Google Scholar
  68. 68.
    Bobkov, S., Ledoux, M.: Poincaré’s inequalities and Talagrands’s concentration phenomenon for the exponential distribution. Probability Theory and Related Fields 107, 383–400 (1997)MathSciNetCrossRefMATHGoogle Scholar
  69. 69.
    Massart, P.: About the constants in Talagrand’s concentration inequalities for empirical processes. Annals of Probability 28, 863–884 (2000)MathSciNetCrossRefMATHGoogle Scholar
  70. 70.
    Boucheron, S., Lugosi, G., Massart, P.: A sharp concentration inequality with applications. Random Structures and Algorithms 16, 277–292 (2000)MathSciNetCrossRefMATHGoogle Scholar
  71. 71.
    Boucheron, S., Lugosi, G., Massart, P.: Concentration inequalities using the entropy method. The Annals of Probability 31, 1583–1614 (2003)MathSciNetCrossRefMATHGoogle Scholar
  72. 72.
    Boucheron, S., Bousquet, O., Lugosi, G., Massart, P.: Moment inequalities for functions of independent random variables. The Annals of Probability (to appear, 2004)Google Scholar
  73. 73.
    Bousquet, O.: A Bennett Concentration Inequality and Its Application to Suprema of Empirical Processes. C. R. Acad. Sci. Paris 334, 495–500 (2002)MathSciNetCrossRefMATHGoogle Scholar
  74. 74.
    Giné, E., Zinn, J.: Some limit theorems for empirical processes. Annals of Probability 12, 929–989 (1984)MathSciNetCrossRefMATHGoogle Scholar
  75. 75.
    Koltchinskii, V.: Rademacher penalties and structural risk minimization. IEEE Transactions on Information Theory 47, 1902–1914 (2001)MathSciNetCrossRefMATHGoogle Scholar
  76. 76.
    Bartlett, P., Boucheron, S., Lugosi, G.: Model selection and error estimation. Machine Learning 48, 85–113 (2001)CrossRefMATHGoogle Scholar
  77. 77.
    Koltchinskii, V., Panchenko, D.: Empirical margin distributions and bounding the generalization error of combined classifiers. Annals of Statistics 30 (2002)Google Scholar
  78. 78.
    Koltchinskii, V., Panchenko, D.: Rademacher processes and bounding the risk of function learning. In: Giné, E., Mason, D., Wellner, J. (eds.) High Dimensional Probability II, pp. 443–459 (2000)Google Scholar
  79. 79.
    Bartlett, P., Mendelson, S.: Rademacher and Gaussian complexities: risk bounds and structural results. Journal of Machine Learning Research 3, 463–482 (2002)MathSciNetMATHGoogle Scholar
  80. 80.
    Bartlett, P.L., Bousquet, O., Mendelson, S.: Localized rademacher complexities. In: Kivinen, J., Sloan, R.H. (eds.) COLT 2002. LNCS, vol. 2375, pp. 44–48. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  81. 81.
    Bousquet, O., Koltchinskii, V., Panchenko, D.: Some Local Measures of Complexity of Convex Hulls and Generalization Bounds. In: Kivinen, J., Sloan, R.H. (eds.) COLT 2002. LNCS, vol. 2375, pp. 59–73. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  82. 82.
    Antos, A., Kégl, B., Linder, T., Lugosi, G.: Data-dependent margin-based generalization bounds for classification. Journal of Machine Learning Research 3, 73–98 (2002)MathSciNetMATHGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2004

Authors and Affiliations

  • Olivier Bousquet
    • 1
  • Stéphane Boucheron
    • 2
  • Gábor Lugosi
    • 3
  1. 1.Max-Planck Institute for Biological CyberneticsTübingenGermany
  2. 2.Laboratoire d’InformatiqueUniversité de Paris-SudOrsayFrance
  3. 3.Department of EconomicsPompeu Fabra UniversityBarcelonaSpain

Personalised recommendations