Probability Theory and Related Fields

, Volume 150, Issue 3–4, pp 405–433 | Cite as

A high-dimensional Wilks phenomenon

  • Stéphane Boucheron
  • Pascal MassartEmail author


A theorem by Wilks asserts that in smooth parametric density estimation the difference between the maximum likelihood and the likelihood of the sampling distribution converges toward a Chi-square distribution where the number of degrees of freedom coincides with the model dimension. This observation is at the core of some goodness-of-fit testing procedures and of some classical model selection methods. This paper describes a non-asymptotic version of the Wilks phenomenon in bounded contrast optimization procedures. Using concentration inequalities for general functions of independent random variables, it proves that in bounded contrast minimization (as for example in Statistical Learning Theory), the difference between the empirical risk of the minimizer of the true risk in the model and the minimum of the empirical risk (the excess empirical risk) satisfies a Bernstein-like inequality where the variance term reflects the dimension of the model and the scale term reflects the noise conditions. From a mathematical statistics viewpoint, the significance of this result comes from the recent observation that when using model selection via penalization, the excess empirical risk represents a minimum penalty if non-asymptotic guarantees concerning prediction error are to be provided. From the perspective of empirical process theory, this paper describes a concentration inequality for the supremum of a bounded non-centered (actually non-positive) empirical process. Combining the now classical analysis of M-estimation (building on Talagrand’s inequality for suprema of empirical processes) and versatile moment inequalities for functions of independent random variables, this paper develops a genuine Bernstein-like inequality that seems beyond the reach of traditional tools.


Wilks phenomenon Risk estimates Suprema of empirical processes Concentration inequalities Statistical learning 

Mathematics Subject Classification (2000)

60E15 62G08 62H30 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Akaike H.: A new look at the statistical model identification. IEEE Trans. Automat. Contr. 19(6), 716–723 (1974)MathSciNetzbMATHCrossRefGoogle Scholar
  2. 2.
    Alexander K.: Rates of growth and sample moduli for weighted empirical processes indexed by sets. Probab. Theory Relat. Fields 75, 379–423 (1987)zbMATHCrossRefGoogle Scholar
  3. 3.
    Angluin D., Laird P.: Learning from noisy examples. Mach. Learn. 2(4), 343–370 (1987)Google Scholar
  4. 4.
    Arlot S., Massart P.: Data-driven calibration of penalties for least-squares regression. J. Mach. Learn. Res. 10, 245–279 (2009)Google Scholar
  5. 5.
    Assouad P.: Densité et dimension. Ann. Inst. Fourier (Grenoble) 33(3), 233–282 (1983)MathSciNetzbMATHGoogle Scholar
  6. 6.
    Bartlett P., Mendelson S.: Empirical minimization. Probab. Theory Relat. Fields 135(3), 311–334 (2006)MathSciNetzbMATHCrossRefGoogle Scholar
  7. 7.
    Bartlett P., Boucheron S., Lugosi G.: Model selection and error estimation. Mach. Learn. 48, 85–113 (2002)zbMATHCrossRefGoogle Scholar
  8. 8.
    Bickel P., Doksum K.: Mathematical Statistics. Holden-Day Inc., San Francisco (1976)Google Scholar
  9. 9.
    Boucheron S., Bousquet O., Lugosi G.: Theory of classification: some recent advances. ESAIM Probab. Stat. 9, 329–375 (2005)MathSciNetCrossRefGoogle Scholar
  10. 10.
    Boucheron S., Bousquet O., Lugosi G., Massart P.: Moment inequalities for functions of independent random variables. Ann. Probab. 33(2), 514–560 (2005)MathSciNetzbMATHCrossRefGoogle Scholar
  11. 11.
    Bousquet, O.: Concentration inequalities for sub-additive functions using the entropy method. In: Stochastic Inequalities and Applications. Progress in Probability, vol. 56, pp. 213–247. Birkhäuser, Basel (2003)Google Scholar
  12. 12.
    Bousquet O.: A Bennett concentration inequality and its application to suprema of empirical processes. C. R. Math. Acad. Sci. Paris 334(6), 495–500 (2002)MathSciNetzbMATHGoogle Scholar
  13. 13.
    de la Pena V., Giné E.: Decoupling. Springer, Berlin (1999)CrossRefGoogle Scholar
  14. 14.
    Devroye L., Wagner T.: Distribution-free inequalities for the deleted and holdout error estimates. IEEE Trans. Inform. Theory 25, 202–207 (1977)MathSciNetCrossRefGoogle Scholar
  15. 15.
    Efron B., Stein C.: The jackknife estimate of variance. Ann. Stat. 9(3), 586–596 (1981)MathSciNetzbMATHCrossRefGoogle Scholar
  16. 16.
    Fan J.: Local linear regression smoothers and their minimax efficiency. Ann. Stat. 21, 196–216 (1993)zbMATHCrossRefGoogle Scholar
  17. 17.
    Fan J., Zhang C., Zhang J.: Generalized likelihood ratio statistics and Wilks phenomenon. Ann. Stat. 29(1), 153–193 (2001)MathSciNetzbMATHCrossRefGoogle Scholar
  18. 18.
    Gayraud G., Pouet C.: Minimax testing composite null hypotheses in the discrete regression scheme. Math. Methods Stat. 10(4), 375–394 (2001)MathSciNetzbMATHGoogle Scholar
  19. 19.
    Giné E., Koltchinskii V.: Concentration inequalities and asymptotic results for ratio type empirical processes. Ann. Probab. 34(3), 1143–1216 (2006)MathSciNetzbMATHCrossRefGoogle Scholar
  20. 20.
    Giné, E., Koltchinskii, V., Wellner, J.: Stochastic inequalities and applications. In: Ratio Limit Theorems for Empirical Processes, pp. 249–278. Birkhaüser, Basel (2003)Google Scholar
  21. 21.
    Huber, P.: The behavior of the maximum likelihood estimates under non-standard conditions. In: Proceedings of Fifth Berkeley Symposium on Probability and Mathematical Statistics, pp. 221–233. University of California Press, Berkeley (1967)Google Scholar
  22. 22.
    Kearns M., Ron D.: Algorithmic stability and sanity-check bounds for leave-one-out cross-validation. Neural Comput. 11(6), 1427–1453 (1999)CrossRefGoogle Scholar
  23. 23.
    Kearns M., Mansour Y., Ng A., Ron D.: An experimental and theoretical comparison of model selection methods. Mach. Learn. 27, 7–50 (1997)CrossRefGoogle Scholar
  24. 24.
    Koltchinskii V.: Localized rademacher complexities and oracle inequalities in risk minimization. Ann. Stat. 34, 2593–2656 (2006)MathSciNetzbMATHCrossRefGoogle Scholar
  25. 25.
    Ledoux, M.: On Talagrand’s deviation inequalities for product measures. ESAIM Probab. Stat. 1, 63–87 (1995/1997)Google Scholar
  26. 26.
    Ledoux M.: The concentration of measure phenomenon. American Mathematical Society, Providence (2001)zbMATHGoogle Scholar
  27. 27.
    Ledoux M., Talagrand M.: Probability in Banach spaces. Springer, Berlin (1991)zbMATHGoogle Scholar
  28. 28.
    Mallows C.: Some comments on C p. Technometrics 15(4), 661–675 (1973)zbMATHCrossRefGoogle Scholar
  29. 29.
    Mammen E., Tsybakov A.: Smooth discrimination analysis. Ann. Stat. 27(6), 1808–1829 (1999)MathSciNetzbMATHCrossRefGoogle Scholar
  30. 30.
    Massart P.: About the constants in Talagrand’s concentration inequality. Ann. Probab. 28, 863–885 (2000)MathSciNetzbMATHCrossRefGoogle Scholar
  31. 31.
    Massart P.: Some applications of concentration inequalities to statistics. Ann. Fac. Sci. Toulouse IX(2), 245–303 (2000)MathSciNetGoogle Scholar
  32. 32.
    Massart, P.: Concentration inequalities and model selection. Ecole d’Eté de Probabilité de Saint-Flour xxxiv. In: Lecture Notes in Mathematics, vol. 1896. Springer, Berlin (2007)Google Scholar
  33. 33.
    Massart P., Nedelec E.: Risk bounds for classification. Ann. Stat. 34(5), 2326–2366 (2006)MathSciNetzbMATHCrossRefGoogle Scholar
  34. 34.
    Pollard D.: Convergence of Stochastic Processes. Springer, Berlin (1984)zbMATHGoogle Scholar
  35. 35.
    Portnoy S.: Asymptotic behavior of likelihood methods for exponential families when the number of parameters tends to infinity. Ann. Stat. 16, 356–366 (1988)MathSciNetzbMATHCrossRefGoogle Scholar
  36. 36.
    Quenouille M.: Approximate test of correlation in time series. J. R. Stat. Soc. Ser. B 11, 68–84 (1949)MathSciNetzbMATHGoogle Scholar
  37. 37.
    Rakhlin A., Mukherjee S., Poggio T.: Stability results in learning theory. Anal. Appl. (Singapore) 3, 397–417 (2005)MathSciNetzbMATHCrossRefGoogle Scholar
  38. 38.
    Rio E.: Inégalités de concentration pour les processus empiriques de classes de parties. Probab. Theory Relat. Fields 119, 163–175 (2001)MathSciNetzbMATHCrossRefGoogle Scholar
  39. 39.
    Schoelkopf B., Smola A.: Learning with Kernels. MIT Press, Cambridge (2002)Google Scholar
  40. 40.
    Shorack G.R., Wellner J.A.: Empirical Processes with Applications to Statistics. Wiley, New York (1986)zbMATHGoogle Scholar
  41. 41.
    Talagrand M.: A new look at independence. Ann. Probab. 24, 1–34 (1996)MathSciNetzbMATHCrossRefGoogle Scholar
  42. 42.
    Talagrand M.: New concentration inequalities in product spaces. Invent. Math. 126, 505–563 (1996)MathSciNetzbMATHCrossRefGoogle Scholar
  43. 43.
    Tsybakov A.B.: Optimal aggregation of classifiers in statistical learning. Ann. Stat. 32, 135–166 (2004)MathSciNetzbMATHCrossRefGoogle Scholar
  44. 44.
    Tukey J.: Bias and confidence in not quite large samples. Ann. Math. Stat. 29, 614 (1958)CrossRefGoogle Scholar
  45. 45.
    van de Geer S.: Applications of Empirical Process Theory. Cambridge University Press, London (2000)zbMATHGoogle Scholar
  46. 46.
    van der Vaart A.: Asymptotic Statistics. Cambridge University Press, London (1998)zbMATHGoogle Scholar
  47. 47.
    van der Vaart A., Wellner J.: Weak Convergence and Empirical Processes. Springer, Berlin (1996)zbMATHGoogle Scholar
  48. 48.
    Vapnik V.: The Nature of Statistical Learning Theory. Springer, Berlin (1995)zbMATHGoogle Scholar
  49. 49.
    Wilks S.: The large-sample distribution of the likelihood ratio for testing composite hypotheses. Ann. Math. Stat. 9, 60–62 (1938)zbMATHCrossRefGoogle Scholar

Copyright information

© Springer-Verlag 2010

Authors and Affiliations

  1. 1.Laboratoire Probabilités et Modèles AléatoiresUniversité Paris-DiderotParisFrance
  2. 2.Département de MathématiquesUniversité Paris-SudOrsayFrance

Personalised recommendations