Machine Learning

, Volume 48, Issue 1–3, pp 85–113 | Cite as

Model Selection and Error Estimation

  • Peter L. Bartlett
  • Stéphane Boucheron
  • Gábor Lugosi


We study model selection strategies based on penalized empirical loss minimization. We point out a tight relationship between error estimation and data-based complexity penalization: any good error estimate may be converted into a data-based penalty function and the performance of the estimate is governed by the quality of the error estimate. We consider several penalty functions, involving error estimates on independent test data, empirical VC dimension, empirical VC entropy, and margin-based quantities. We also consider the maximal difference between the error on the first half of the training data and the second half, and the expected maximal discrepancy, a closely related capacity estimate that can be calculated by Monte Carlo integration. Maximal discrepancy penalty functions are appealing for pattern classification problems, since their computation is equivalent to empirical risk minimization over the training data with some labels flipped.

model selection penalization concentration inequalities empirical penalties 


  1. Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19, 716–723.Google Scholar
  2. Barron, A. R. (1985). Logically smooth density estimation. Technical Report TR 56, Department of Statistics, Stanford University.Google Scholar
  3. Barron, A. R. (1991). Complexity regularization with application to artificial neural networks. In G. Roussas, (Ed.), Nonparametric functional estimation and related topics (pp. 561–576). NATO ASI Series, Dordrecht: Kluwer Academic Publishers.Google Scholar
  4. Barron, A. R., Birgé, L., & Massart, P. (1999). Risk bounds for model selection via penalization. Probability Theory and Related Fields, 113, 301–413.Google Scholar
  5. Barron, A. R., & Cover, T. M. (1991). Minimum complexity density estimation. IEEE Transactions on Information Theory, 37, 1034–1054.Google Scholar
  6. Bartlett, P. L. (1998). The sample complexity of pattern classification with neural networks: The size of the weights is more important than the size of the network. IEEE Transactions on Information Theory, 44:2, 525–536.Google Scholar
  7. Bartlett, P. L., & Shawe-Taylor, J. (1999). Generalization performance of support vector machines and other pattern classifiers. In B. Schölkopf, C. J. C. Burges, & A. J., Smola. (Eds.), Advances in Kernel methods: Support vector learning (pp. 43–54). Cambridge: MIT Press.Google Scholar
  8. Birgé, L., & Massart, P. (1997). From model selection to adaptive estimation. In E. Torgersen, D. Pollard, & G. Yang, (Eds.), Festschrift for Lucien Le Cam: Research papers in probability and statistics (pp. 55–87). New York: Springer.Google Scholar
  9. Birgé, L., & Massart, P. (1998). Minimum contrast estimators on sieves: Exponential bounds and rates of convergence. Bernoulli, 4, 329–375.Google Scholar
  10. Boucheron, S., Lugosi, G., & Massart, P. (2000). A sharp concentration inequality with applications in random combinatorics and learning. Random Structures and Algorithms, 16, 277–292.Google Scholar
  11. Buescher, K. L., & Kumar, P. R. (1996a). Learning by canonical smooth estimation, Part I: Simultaneous estimation. IEEE Transactions on Automatic Control, 41, 545–556.Google Scholar
  12. Buescher, K. L., & Kumar, P. R. (1996b). Learning by canonical smooth estimation, Part II: Learning and choice of model complexity. IEEE Transactions on Automatic Control, 41, 557–569.Google Scholar
  13. Cristianini, N., & Shawe-Taylor, J. (2000). An introduction to support vector machines. Cambridge, UK: Cambridge University Press.Google Scholar
  14. Devroye, L., Györfi, L., & Lugosi, G. (1996). A probabilistic theory of pattern recognition. New York: Springer-Verlag.Google Scholar
  15. Freund, Y. (1998). Self bounding learning algorithms. In Proceedings of the Eleventh Annual Conference on Computational Learning Theory (pp. 247-258).Google Scholar
  16. Gallant, A. R. (1987). Nonlinear statistical models. New York: John Wiley.Google Scholar
  17. Geman, S., & Hwang, C. R. (1982). Nonparametric maximum likelihood estimation by the method of sieves. Annals of Statistics, 10, 401–414.Google Scholar
  18. Giné, E., & Zinn, J. (1984). Some limit theorems for empirical processes. Annals of Probability, 12, 929–989.Google Scholar
  19. Hoeffding, W. (1963). Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58, 13–30.Google Scholar
  20. Kearns, M., Mansour, Y., Ng, A. Y., & Ron, D. (1995). An experimental and theoretical comparison of model selection methods. In Proceedings of the Eighth Annual ACM Workshop on Computational Learning Theory (pp. 21–30). New York: Association for Computing Machinery.Google Scholar
  21. Koltchinskii,V. (2001). Rademacher penalties and structural risk minimization. IEEE Transactions on Information Theory, 47:5, 1902–1914.Google Scholar
  22. Koltchinskii, V., & Panchenko, D. (2000). Rademacher processes and bounding the risk of function learning. In Giné, Evarist et al. (eds.), High dimensional probability II. 2nd international conference, Boston: Birkhäuser. Prog. Probab., 47, 443–457.Google Scholar
  23. Krzyÿzak, A., & Linder, T. (1998). Radial basis function networks and complexity regularization in function learning. IEEE Transactions on Neural Networks, 9, 247–256.Google Scholar
  24. Lozano, F. (2000). Model selection using Rademacher penalization. In Proceedings of the Second ICSC Symposia on Neural Computation (NC2000), ICSC Adademic Press.Google Scholar
  25. Lugosi, G., & Nobel, A. (1999). Adaptive model selection using empirical complexities. Annals of Statistics, 27:6.Google Scholar
  26. Lugosi, G., & Zeger, K. (1995). Nonparametric estimation via empirical risk minimization. IEEE Transactions on Information Theory, 41, 677–678.Google Scholar
  27. Lugosi, G., & Zeger, K. (1996). Concept learning using complexity regularization. IEEE Transactions on Information Theory, 42, 48–54.Google Scholar
  28. Mallows, C. L. (1997). Some comments on cp. IEEE Technometrics, 15, 661–675.Google Scholar
  29. Mason, L., Bartlett, P. L., & Baxter, J. (2000). Improved generalization through explicit optimization of margins. Machine Learning, 38:3, 243–255.Google Scholar
  30. Massart, P. (2000). Some applications of concentration inequalities to statistics. Annales de la faculté des sciences de l'Universitéde Toulouse, Mathématiques, série 6, IX, 245–303.Google Scholar
  31. McDiarmid, C. (1989). On the method of bounded differences. In Surveys in Combinatorics 1989 (pp. 148–188). Cambridge: Cambridge University Press.Google Scholar
  32. Mehlhorn, K., & Naher, S. (2000). Leda: A platform for combinatorial and geometric computing. Cambridge: Cambridge University Press.Google Scholar
  33. Meir, R. (1997). Performance bounds for nonlinear time series prediction. In Proceedings of the Tenth Annual ACM Workshop on Computational Learning Theory (pp. 122–129). New York: Association for Computing Machinery.Google Scholar
  34. Modha, D. S., & Masry, E. (1996). Minimum complexity regression estimation with weakly dependent observations. IEEE Transactions on Information Theory, 42, 2133–2145.Google Scholar
  35. Rissanen, J. (1983). A universal prior for integers and estimation by minimum description length. Annals of Statistics, 11, 416–431.Google Scholar
  36. Schapire, R. E., Freund, Y., Bartlett, P. L., & Lee, W. S. (1998). Boosting the margin: A new explanation for the effectiveness of voting methods. Annals of Statistics, 26:5, 1651–1686.Google Scholar
  37. Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statistics, 6, 461–464.Google Scholar
  38. Shawe-Taylor, J., Bartlett, P. L., Williamson, R. C., & Anthony, M. (1998). Structural risk minimization over data-dependent hierarchies. IEEE Transactions on Information Theory, 44:5, 1926–1940.Google Scholar
  39. Shen, X., & Wong, W. H. (1994). Convergence rate of sieve estimates. Annals of Statistics, 22, 580–615.Google Scholar
  40. Szarek, S. J. (1976). On the best constants in the Khintchine inequality. Studia Mathematica, 63, 197–208.Google Scholar
  41. Talagrand, M. (1995). Concentration of measure and isoperimetric inequalities in product spaces. Inst. Hautes Etudes Sci. Publ. Math., 81, 73–205.Google Scholar
  42. Vapnik, V. N. (1982). Estimation of dependencies based on empirical data. New York: Springer-Verlag.Google Scholar
  43. Vapnik, V. N. (1995). The nature of statistical learning theory. New York: Springer-Verlag.Google Scholar
  44. Vapnik, V. N. (1998). Statistical learning theory. New York: Wiley.Google Scholar
  45. Vapnik, V. N., & Chervonenkis, A. Ya. (1971). On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications, 16, 264–280.Google Scholar
  46. Vapnik, V. N., & Chervonenkis, A. Ya. (1974). Theory of pattern recognition. Moscow: Nauka. (in Russian); German translation (1979): Theorie der Zeichenerkennung. Berlin: Akademie Verlag.Google Scholar
  47. Vapnik, V. N., Levin, E., & Le Cun, Y. (1994). Measuring the VC-dimension of a learning machine. Neural Computation, 6:5, 851–876.Google Scholar
  48. van der Vaart, A. W., & Wellner, J. A. (1996). Weak convergence and empirical processes. New York: Springer-Verlag.Google Scholar
  49. Williamson, R. C., Shawe-Taylor, J., Schölkopf, B., & Smola, A. J. (1999). Sample based generalization bounds. NeuroCOLT Technical Report NC-TR-99-055.Google Scholar
  50. Yang, Y., & Barron, A. R. (1998). An asymptotic property of model selection criteria. IEEE Transactions on Information Theory, 44, 95–116.Google Scholar
  51. Yang, Y., & Barron, A. R. (1999). Information-theoretic determination of minimax rates of convergence. Annals of Statistics, 27, 1564–1599.Google Scholar

Copyright information

© Kluwer Academic Publishers 2002

Authors and Affiliations

  • Peter L. Bartlett
    • 1
  • Stéphane Boucheron
    • 2
  • Gábor Lugosi
    • 3
  1. 1.BIOwulf TechnologiesBerkeleyUSA
  2. 2.Laboratoire de Recherche en InformatiqueCNRS-Université Paris-SudOrsay-CedexFrance
  3. 3.Department of EconomicsPompeu Fabra UniversityBarcelonaSpain

Personalised recommendations