Test

, Volume 15, Issue 2, pp 271–344

Regularization in statistics

  • Peter J. Bickel
  • Bo Li
  • Alexandre B. Tsybakov
  • Sara A. van de Geer
  • Bin Yu
  • Teófilo Valdés
  • Carlos Rivero
  • Jianqing Fan
  • Aad van der Vaart
Article

Abstract

This paper is a selective review of the regularization methods scattered in statistics literature. We introduce a general conceptual approach to regularization and fit most existing methods into it. We have tried to focus on the importance of regularization when dealing with today's high-dimensional objects: data and models. A wide range of examples are discussed, including nonparametric regression, boosting, covariance matrix estimation, principal component estimation, subsampling.

Key Words

Regularization linear regression nonparametric regression boosting covariance matrix principal component bootstrap subsampling model selection 

AMS subject classification

Primary 62G08, 62H12 Secondary 62F12, 62G20, 62H25 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Akaike, H. (1970). Statistical predictor identification.Annals of the Institute of Statistical Mathematics, 22:203–217.MATHMathSciNetCrossRefGoogle Scholar
  2. Bair, E., Hastie, T. J., Paul, D., andTibshirani, R. (2006). Prediction by supervised principal components.Journal of the American Statistical Association, 101(473):119–137.MathSciNetCrossRefMATHGoogle Scholar
  3. Bickel, P. J., Götze, F., andvan Zwet, W. R. (1997). Resampling fewer thann observations: gains, losses, and remedies for losses.Statistica Sinica, 7(1):1–31. Empirical Bayes, sequential analysis and related topics in statistics and probability (New Brunswick, NJ, 1995).MATHMathSciNetGoogle Scholar
  4. Bickel, P. J., Klaassen, C. A. J., Ritov, Y., andWellner, J. A. (1998).Efficient and adaptive estimation for semiparametric models.Reprint of the 1993 original. Springer-Verlag, New York.MATHGoogle Scholar
  5. Bickel, P. J. andLevina, E. (2004). Some theory of Fisher's linear discriminant function, ‘naive Bayes’, and some alternatives when there are many more variables than observations.Bernoulli, 10(6):989–1010.MATHMathSciNetCrossRefGoogle Scholar
  6. Bickel, P. J. andLevina, E. (2006). Regularized estimation of large covariance matrices. Technical Report 716, Department of Statistics, University of California, Berkeley, CA.Google Scholar
  7. Bickel, P. J., Ritov, Y., andZakai, A. (2006). Some theory for generalized boosting algorithms.Journal of Machine Learning Research. To appear.Google Scholar
  8. Bickel, P. J. andSakov, A. (2005). On the choice ofm in them out ofn bootstrap and its application to confidence bounds for extreme percentiles. Unpublished.Google Scholar
  9. Birgé, L. andMassart, P. (1997). From model selection to adaptive estimation. In D. Pollard, E. Torgessen, and G. Yang, eds.,A Festschrift for Lucien Le Cam: Research papers in Probability and Statistics, pp. 55–87 Springer-Verlag, New York.Google Scholar
  10. Birgé, L. andMassart, P. (2001). Gaussian model selection.Journal of the European Mathematical Society, 3(3):203–268.MATHMathSciNetCrossRefGoogle Scholar
  11. Böttcher, A. andSilbermann, B. (1999).Introduction to large truncated Toeplitz matrices, Universitext. Springer-Verlag, New York.MATHGoogle Scholar
  12. Breiman, L. (1996). Heuristics of instability and stabilization in model selection.The Annals of Statistics, 24(6):2350–2383.MATHMathSciNetCrossRefGoogle Scholar
  13. Breiman, L., Stone, C. J., andKooperberg, C. (1990). Robust confidence bounds for extreme upper quantiles.Journal of Statistical Computation and Simulation, 37(3–4):127–149.MATHMathSciNetGoogle Scholar
  14. Bühlmann, P. (2006). Boosting for high-dimensional linear models.The Annals of Statistics, 34(2):559–583.MATHMathSciNetCrossRefGoogle Scholar
  15. Bühlmann, P. andYu, B. (2006). Sparse boosting.Journal of Machine Learning Research, 7:1001–1024.Google Scholar
  16. Bunea, F., Wegkamp, M. H., andAuguste, A. (2006). Consistent variable selection in high dimensional regression via multiple testing.Journal ofStatistical Planning and Inference, 136(12):4349–4364.MATHCrossRefMathSciNetGoogle Scholar
  17. Chen, H. (1988). Convergence rates for parametric components in a partly linear model.The Annals of Statistics, 16(1):136–146.MATHMathSciNetGoogle Scholar
  18. Craven, P., andWahba, G. (1979). Smoothing noisy data with spline functions. Estimating the correct degree of smoothing by the method of generalized cross-validation.Numerische Mathematik 31(4): 377–403.MATHMathSciNetCrossRefGoogle Scholar
  19. Daniels, M. J., andPourahmadi, M. (2002). Bayesian analysis of covariance matrices and dynamic models for longitudinal data.Biometrika, 89(3):553–566.MATHMathSciNetCrossRefGoogle Scholar
  20. Datta, S., andMcCormick, W. P. (1995). Bootstrap inference for a firstorder autoregression with positive innovations.Journal of the American Statistical Association, 90(432):1289–1300MATHMathSciNetCrossRefGoogle Scholar
  21. Devroye, L., Györfi, L., andLugosi, G. (1996),A probabilistic theory of pattern recognition, Vol. 31 ofApplications of Mathematics (New York). Springer-Verlag, New York.Google Scholar
  22. Donoho, D. L. (2000). High dimensional data analysis: The curses and blessings of dimensionality. InMath Challenges of 21st Centuary (2000). American Mathematical Society. Plenary speaker. Available in: http: //www-stat.stanford.edu/donoho/Lectures/AMS2000/Google Scholar
  23. Donoho, D. L., andJohnstone, I. M. (1998). Minimax estimation via wavelet shrinkage.The Annals of Statistics, 26(3):879–921MATHMathSciNetCrossRefGoogle Scholar
  24. Draper, N. R., andSmith, H. (1998).Applied regression analysis. Wiley Series in Probability and Statistics: Texts and References Section, John Wiley & Sons, New York, 3rd ed.MATHGoogle Scholar
  25. Dudoit, S., Fridlyand, J., andSpeed, T. P. (2002). Comparison of discrimination methods for the classification of tumors using gene expression data.Journal of the American Statistical Association, 97(457):77–87.MATHMathSciNetCrossRefGoogle Scholar
  26. Dudoit, S., andVan der Laan, M. J. (2005). Asymptotics of crossvalidaed risk estimation in estimator selection and performance assessment.Statistical Methodology, 2(2):131–154.MathSciNetCrossRefGoogle Scholar
  27. Efron, B. (1979). Bootstrap methods: another look at the jackknife.The annals of Statistics, 7(1):1–26.MATHMathSciNetGoogle Scholar
  28. Efron, B. (2004). The estimation of prediction error: covariance penalties and cross-validation (with discussions).Journal of the American Statistical Association, 99(467):619–642.MathSciNetCrossRefMATHGoogle Scholar
  29. Efron, B., Hastie, T. J., Johnstone, I., andTibshirani, R. (2004). Least angle regression (with discussions).The Annals of Statistics, 32(2):407–499.MATHMathSciNetCrossRefGoogle Scholar
  30. Fan J., andGijbels, I. (1996).Local polynomial modelling and its applications, Vol. 66 ofMonographs on Statistics and Applied Probability. Chapman & Hall/CRC, LondonMATHGoogle Scholar
  31. Fan, J., andLi, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties.Journal of the American Statistical Assocition, 96(456):1348–1360.MATHMathSciNetCrossRefGoogle Scholar
  32. Fan, J. andLi, R. (2006). Statistical challenges with high dimensionality: Feature selection in knowledge discovery. In: M. Sanz-Sole, J. Soria, J. L. Varona, and J. Verdera, eds.Proceedings of the International Congress of Mathematicians, Madrid 2006, Vol. III, pp 595–622, European Mathematical Society Publishing House.Google Scholar
  33. Fan, J., andPeng, H. (2004). Nonconcave penalized likelihood with a diverging number of parameters.The Annals of Statistics, 32(3):928–961.MATHMathSciNetCrossRefGoogle Scholar
  34. Furrer, R. andBengtsson, T. (2006). Estimation of high-dimensional prior and posteriori covariance matrices in Kalman filter variants.Journal of Multivariate Analysis. To appear.Google Scholar
  35. Götze, F. (1993). Asymptotic approximation and the bootstrap.I.M.S. Bulletin, p. 305.Google Scholar
  36. Götze, F., andRaĉkauskas, A. (2001). Adaptive choice of bootstrap sample sizes. InState of the art in probability and statistics (Leiden, 1999), Vol 36 ofIMS Lecture Notes Monograph Series, pp. 286–309. Institute of Mathematical Statitics, Beachwood, OHGoogle Scholar
  37. Greenshtein, E. (2006). Best subset selection, persistence in high-dimensional statistical learning and optimization under ℓ1-constraintThe Annals of Statistics 34(5), To appear.Google Scholar
  38. Greenshtein, E., andRitov, Y. (2004). Persistence in high-dimensional linear predictor selection and the virtue of overparametrization.Bernoulli, 10(6):971–988.MATHMathSciNetGoogle Scholar
  39. Györfi, L., Kohler, M., Krzyzak, A., andWalk, H. (2002).A distribution-free theory of nonparametric regression, Springer Series in Statistics. Springer-Verlag, New York.MATHGoogle Scholar
  40. Hall, P. (1992).The bootstrap and Edgeworth expansion. Springer Series in Statistics, Springer-Verlag, New York.Google Scholar
  41. Hall, P., Horowitz, J. L., andJing, B.-Y. (1995). On blocking rules for the bootstrap with dependent data.Biometrika, 82(3):561–574.MATHMathSciNetCrossRefGoogle Scholar
  42. Hastie, T. J., Tibshirani, R., andFriedman, J. H. (2001).The elements of statistical learning. Springer Series in Statistics. Springer-Verlag, New York. Data mining, inference, and prediction.MATHGoogle Scholar
  43. Hoerl, A. E. andKennard, R. W. (1970). Ridge regression: Biased estimation for nonorthogonal problems.Technometrics, 12(1):55–67.MATHMathSciNetCrossRefGoogle Scholar
  44. Huang, J., Liu, N., Pourahmadi, M., andLiu, L. (2006). Covariance matrix selection and estimation via penalise normal likelihood.Biometrika, 93(1):85–98.MathSciNetCrossRefMATHGoogle Scholar
  45. Hunter, D. R. andLi, R. (2005). Variable selection using MM algorithms.The Annals of Statistics, 33(4):1617–1642.MATHMathSciNetCrossRefGoogle Scholar
  46. James, W. andStein, C. (1961). Estimation with quadratic loss. InProceedings of the 4th Berkeley Sympos. Math. Statist. and Probability, Vol. I, pp. 361–379. Univ. California Press, Berkeley, Calif.Google Scholar
  47. Johnstone, I. M. (2001). On the distribution of the largest eigenvalue in principal components analysis.The Annals of Statistics,29(2):295–327.MATHMathSciNetCrossRefGoogle Scholar
  48. Johnstone, I. M. andLu, A. Y. (2006). Sparse principle component analysis.Journal of the American Statistical Association. To appear.Google Scholar
  49. Johnstone, I. M. andSilverman, B. W. (2005). Empirical Bayes selection of wavelet thresholds.The Annals of Statistics, 33(4):1700–1752.MATHMathSciNetCrossRefGoogle Scholar
  50. Kass, R. E., andRaftery, A. E. (1995). Bayes factors.Journal of the American Statistical Association, 90(430):773–795.MATHCrossRefGoogle Scholar
  51. Kass, R. E. andWasserman, L. (1995). A reference Bayesian test for nested hypotheses and its relationship to the Schwarz criterion.Journal of the American Statistical Association, 90(431):928–934.MATHMathSciNetCrossRefGoogle Scholar
  52. Kosorok, M. andMa, S. (2006). Marginal asymptotics for the “large p, small n” paradigm: with applications to microarray data. Unpublished.Google Scholar
  53. Künsch, H. R. (1989). The jackknife and the bootstrap for general stationary observations.The Annals of Statistics, 17(3):1217–1241.MATHMathSciNetGoogle Scholar
  54. Ledoit, O., andWolf, M. (2004). A well-conditioned estimator for large-dimensional coveriance matrices.Journal of Multivariate Analysis, 88(2):365–411.MATHMathSciNetCrossRefGoogle Scholar
  55. Li, K.-C. (1985). From Stein's unbiased risk estimates to the method of generalized cross validation.The Annals of Statistics, 13(4):1352–1377.MATHMathSciNetGoogle Scholar
  56. Li, K.-C. (1986). Asymptotic optimality ofC L and generalized cross-validation in ridge regression with application to spline smoothing.The Annals of Statistics, 14(3):1101–1112.MATHMathSciNetGoogle Scholar
  57. Li, K.-C. (1987). Asymptotic optimality forC p,C L, cross-validation and generalized cross-validation: discrete index set.The Annals of Statistics, 15(3):958–975.MATHMathSciNetGoogle Scholar
  58. Lugosi, G. andNobel, A. B. (1999). Adaptive model selection using empirical complexities.The Annals of Statistics, 27(6):1830–1864.MATHMathSciNetCrossRefGoogle Scholar
  59. Lugosi, G., andVayatis, N. (2004). On the Bayes-risk consistency of regularized boosting methods.The Annals of Statistics, 32(1):300–55.Google Scholar
  60. Mallows, C. L. (1973). Some comments onc p.Technometrics, 15(4):661–675.MATHCrossRefGoogle Scholar
  61. Mammen, E. (1992).When Does Bootstrap Work?, Springer-Verlag, New York.Google Scholar
  62. Mammen, E. andTsybakov, A. B. (1999). Smooth discrimination analysis.The Annals of Statistics, 27(6):1808–1829.MATHMathSciNetCrossRefGoogle Scholar
  63. Meinshausen, N. (2005). Lasso with relaxation. Unpublished.Google Scholar
  64. Nadaraya, E. A. (1964). On estimating regression.Theory of Probability and Its Applications, 10:186–190.CrossRefGoogle Scholar
  65. Parzen, E. (1962). On estimation of a probability density function and mode.The Annals of Mathematical Statistics, 33:1065–1076.MathSciNetMATHGoogle Scholar
  66. Paul, D. (2005). Asymptotics of the leading sample eigenvalues for a spiked covariance model. Unpublished.Google Scholar
  67. Politis, D. N. andRomano, J. P. (1994). Large sample confidence regions based on subsamples under minimal assumptions.The Annals of Statistics, 22(4):2031–2050.MATHMathSciNetGoogle Scholar
  68. Politis, D. N. Romano, J. P., andWolf, M. (1999).Subsampling. Springer Series in Statistics. Springer-Verlag, New York.MATHGoogle Scholar
  69. Pourahmadi, M. (1999). Joint mean-covariance models with applications to longitudinal data: unconstrained parameterisation.Biometrika, 86(3):677–690.MATHMathSciNetCrossRefGoogle Scholar
  70. Pourahmadi, M. (2000). Maximum likelihood estimation of generalised linear models for multivariate normal covariance matrix.Biometrika, 87(2):425–435.MATHMathSciNetCrossRefGoogle Scholar
  71. Rissanen, J. (1984). Universal coding, information, prediction, and estimationInstitute of Electrical and Electronics Engineers. Transactions on Information Theory, 30(4):629–636.MATHMathSciNetCrossRefGoogle Scholar
  72. Robert, C. P. andCasella, G. (2004).Monte Carlo statistical methods. Springer Texts in Statistics. Springer-Verlag, New York, 2nd ed.MATHGoogle Scholar
  73. Rosenblatt, M. (1956). Remarks on some nonparametric estimates of a density function.The Annals of Mathematical Statistics, 27:832–837.MathSciNetMATHGoogle Scholar
  74. Schwarz, G. (1978). Estimating the dimension of a model.The Annals of Statistics, 6(2):461–464.MATHMathSciNetGoogle Scholar
  75. Shao, J. (1997). An asymptotic theory for linear model selection (with discussion).Statistica Sinica, 7(2):221–264.MATHMathSciNetGoogle Scholar
  76. Smith, M. andKohn, R. (2002). Parsimonious covariance matrix estimation for longitudinal data.Journal of the American Statistical Association, 97(460):1141–1153.MATHMathSciNetCrossRefGoogle Scholar
  77. Stone, C. J., Hansen, M. H., Kooperberg, C., andTruong, Y. K. (1997). Polynomial splines and their tensor products in extended linear modeling (with discussions).The Annals of Statistics, 25(4):1371–1470.MATHMathSciNetCrossRefGoogle Scholar
  78. Stone, M. (1974). Cross-validatory choice and assessment of statistical predictions (with discussions).Journal of the Royal Statistical Society. Series B, 36:111–147.MATHGoogle Scholar
  79. Tibshirani, R. (1996). Regression shrinkage and selection via the lasso.Journal of the Royal Statistical Society. Series B, 58(1):267–288.MATHMathSciNetGoogle Scholar
  80. Tikhonov, A. N. (1943). On the stability of inverse problems.C. R. (Doklady) Acad. Sci. URSS (N.S.), 39:176–179.MathSciNetGoogle Scholar
  81. Tsybakov, A. B. (2004). Optimal aggregation of classifiers in statistical learning.The Annals of Statistics, 32(1):135–166.MATHMathSciNetCrossRefGoogle Scholar
  82. Vapnik, V. N. (1998).Statistical learning theory Adaptive and Learning Systems for Signal Processing, Communications, and Control. John Wiley & Sons, New York. A Wiley-Interscience Publication.MATHGoogle Scholar
  83. Wachter, K. W. (1978). The strong limits of random matrix spectra for sample matrices of independent elements.The Annals of Probability, 6(1):1–18.MATHMathSciNetGoogle Scholar
  84. Wang, Y. (2004). Model selection. InHandbook of computational statistics, pp. 437–466. Springer-Verlag. Berlin.Google Scholar
  85. Watson, G. S. (1964). Smooth regression analysis.Sankhyā. Series A, 26:359–372.MATHGoogle Scholar
  86. Wigner, E. P. (1955). Characteristic vectors of bordered matrices with infinite dimensions.Annals of Mathematics. Second Series, 62:548–564.MathSciNetGoogle Scholar
  87. Wu, W. B. andPourahmadi, M. (2003). Nonparametric estimation of large covariance matrices of longitudinal data.Biometrika, 90(4):831–844.MathSciNetCrossRefGoogle Scholar
  88. Zhang, H. H., Wahba, G., Lin, Y., Voelker, M., Ferris, M., Klein, R., andKlein, B. (2004). Variable selection and model building via likelihood basis pursuit.Journal of the American Statistical Association, 99(467):659–672.MathSciNetCrossRefMATHGoogle Scholar
  89. Zhang, T. andYu, B. (2005). Boosting with early stopping: convergence and consistency.The Annals of Statistics, 33(4):1538–1579.MATHMathSciNetCrossRefGoogle Scholar
  90. Zou, H. andHastie, T. J. (2005). Regularization and variable selection via the elastic net.Journal of the Royal Statistical Society. Series B, 67(2):301–320.MATHMathSciNetCrossRefGoogle Scholar

Additional references

  1. Barron, A., Cohen, A., Dahmen, W., andDeVore, R. (2005). Approximation and learning by greedy algorithms. Manuscript.Google Scholar
  2. Bunea, F., Tsybakov, A. B., andWegkamp, M. H. (2005). Aggregation for gaussian regression.The Annals of Statistics. Tentatively accepted.Google Scholar
  3. Bunea, F., Tsybakov, A. B., andWegkamp, M. H. (2006). Aggregation and sparsity via ℓ1 penalized least squares. In H. U. Simon and G. Lugosi, eds.,Proceedings of 19th Annual Conference on Learning Theory (COLT 2006), Vol. 4005 ofLecture Notes in Artificial Intelligence, pp. 379–391. Springer-Verlag, Berlin-Heidelberg.Google Scholar
  4. Juditsky, A., Nazin, A., Tsybakov, A., andVayatis, N. (2005a). Recursive aggregation of estimators by mirror descent algorithm with averaging.Problems of Information Transmission, 41(4):368–384.MATHMathSciNetCrossRefGoogle Scholar
  5. Juditsky, A., Rigollet, P., andTsybakov, A. B. (2005b). Learning by mirror averaging. Preprint, Laboratoire de Probabilités et Modèles Aléatoires, Universités Paris 6—Paris 7. https://hal.ccsd.cnrs.fr/ccsd-00014097.Google Scholar
  6. Klemelä, J. (2006). Density estimation with stagewise optimization of the empirical risk. Manuscript.Google Scholar
  7. Mannor, S., Meir, R. andZhang, T. (2003). Greedy algorithms for classification—consistency, convergence rates, and adaptivity.Journal of Machine Learning Research, 4:713–742.MathSciNetCrossRefGoogle Scholar
  8. Mason, L., Baxter, J., Bartlett, P. L., andFrean, M. (2000). Functional gradient techniques for combining hypotheses. In A. J. Smola, P. L. Bartlett, B. Schölkopf, and D. Schuurmans, eds.,Advances in Large Margin Classifiers, pp. 221–247. MIT Press, Cambridge, MA.Google Scholar
  9. Tsybakov, A. B. (2003). Optimal rates of aggregation. In B. Schölkopf and M. Warmuth, eds.,Proceedings of 16th Annual Conference on Learning Theory (COLT 2003) and 7th Annual Workshop on Kernel Machines, Vol. 2777 ofLecture Notes in Artificial Intelligence, pp. 303–313. Springer-Verlag, Berlin-Heidelberg.Google Scholar

Additional references

  1. Boucheron, S., Bousquet, O., andLugosi, G. (2005). Theory of classification: a survey of some recent advances.ESAIM. Probability and Statistics, 9:323–375 (electronic).MATHMathSciNetGoogle Scholar
  2. Koltchinskii, V. (2006). 2004 IMS Medallion Lecture: Local Rademacher complexities and oracle inequalities in risk minimization.The Annals of Statistics, 34(6). To appear.Google Scholar

Additional references

  1. Bousquet, O. andElisseeff, A. (2002). Stability and generalization.Journal of Machine Learning Research, 2(3):499–526.MATHMathSciNetCrossRefGoogle Scholar
  2. Kutin, S., andNiyogi, P. (2002). Almost-everywhere algorithmic stability and genearalization error. Technical Report TR-2002-03, Department of Computer Science, University of Chicago, Chicago, IL.Google Scholar

Additional references

  1. Rivero, C. andValdés, T. (2004). Mean based iterative procedures in linear models with general errors and grouped data.Scandinavian Journal of Statistics, 31(3):469–486.MATHMathSciNetCrossRefGoogle Scholar

Additional references

  1. Antoniadis, A. andFan, J. (2001). Regularized wavelet approximations (with discussion).Journal of the American Statistical Association, 96:939–967.MATHMathSciNetCrossRefGoogle Scholar
  2. Chamberlain, G. andRothschild, M. (1983). Arbitrage, factor structure, and mean-variance analysis on large asset markets.Econometrica, 51:1281–1304.MATHMathSciNetCrossRefGoogle Scholar
  3. Chen, S., Donoho, D. L., andSaunders, M. A. (1998). Automatic decomposition by basis pursuit.SIAM Journal on Scientific Computting, 20:33–61.MathSciNetCrossRefGoogle Scholar
  4. Donoho, D. L. andJohnstone, I. M. (1994). Ideal spatial adaptation by wavelet shrinkage.Biometrika, 81:425–455.MATHMathSciNetCrossRefGoogle Scholar
  5. Efromovich, S. (1999).Nonparametric Curve Estimation: Methods, Theory and Applications, Springer-Verlag, New York.MATHGoogle Scholar
  6. Fama, E. andFrench, K. (1993). Common risk factors in the returns on stocks and bonds.Journal of Financial Economics, 33:3–56.CrossRefGoogle Scholar
  7. Fan, J., Chen, Y., Chan, H. M., Tam, P., andRen, Y. (2005a). Removing intensity effects and identifying significant genes for Affymetrix arrays in MIF-suppressed neuroblastoma cells.Proceedings of the National Academy of Sciences of the United states of America, 103:17751–17756.CrossRefGoogle Scholar
  8. Fan, J. andJiang, J. (2005). Nonparametric inference for additive models.Journal of the American Statistical Association, 100:890–907.MathSciNetCrossRefMATHGoogle Scholar
  9. Fan, J., Peng, H., andHuang, T. (2005b). Semilinear high-dimensional model for normalization of microarray data: a theoretical analysis and partial consistency.Journal of the American Statistical Association, 100:781–813.MathSciNetCrossRefMATHGoogle Scholar
  10. Fan, J., Zhang, C. M., andZhang, J. (2001). Generalized likelihood ratio statistics and Wilks phenomenon.The Annals of Statistics, 29: 153–193.MATHMathSciNetGoogle Scholar
  11. Ross, S. (1976). The arbitrage theory of capital asset pricing.Journal of Economic Theory, 13:341–360.MathSciNetCrossRefGoogle Scholar
  12. Tibshirani, R., Hastie, T., Narasimhan, B., andChu, G. (2002). Diagnosis of multiple cancer types by shrunken centroids of gene expression.Proceedings of the National Academy of Sciences of the United states of America, 99:6567–6572.CrossRefGoogle Scholar

Additional references

  1. Barron, A., Birgé, L., andMassart, P. (1999). Risk bounds for model selection via penalization.Probability Theory and Related Fields, 113(3):301–413.MATHMathSciNetCrossRefGoogle Scholar
  2. Belitser, E. andGhosal, S. (2003). Adaptive Bayesian inference on the mean of an infinite-dimensional normal distribution.The Annals of Statistics, 31(2):536–559.MathSciNetCrossRefGoogle Scholar
  3. Birgé, L. (2006). Statistical estimation with model selection (Brouwer lecture). Preprint.Google Scholar
  4. Cai, T. T. andLow, M. G. (2004). An adaptation theory for nonparametric confidence intervals.The Annals of Statistics,33(5):1805–1840.MathSciNetGoogle Scholar
  5. Cai, T. T. andLow, M. G. (2005). On adaptive estimation of linear functionals.The Annals of Statistics,33(5):2311–2343.MATHMathSciNetCrossRefGoogle Scholar
  6. Cohen, A., Dahmen, W., Daubechies, I., andDeVore, R. (2001). Tree approximation and optimal encoding.Applied and Computational Harmonic Analysis. Time-Frequency and Time-Scale Analysis, Wavelets, Numerical Algorithms, and Applications, 11(2):192–226.MATHMathSciNetGoogle Scholar
  7. DeVore, R., Kerkyacharian, G., Picard, D., andTemlyakov, V. (2006). Approximation methods for supervised learning.Foundations of Computational Mathematics. The Journal of the Society for the Foundations of Computational Mathematics, 6(1):3–58.MATHMathSciNetCrossRefGoogle Scholar
  8. DeVore, R. A. andLorentz, G. G. (1993).Constructive approximation, Vol. 303 ofGrundlehren der Mathematischen Wissenschaften [Fundamental Principles of Mathematical Sciences]. Springer-Verlag, Berlin.MATHGoogle Scholar
  9. Ghosal, S., Ghosh, J. K., andvan der Vaart, A. W. (2000). Convergence rates of posterior distributions.The Annals of Statistics, 28(2):500–531.MATHMathSciNetCrossRefGoogle Scholar
  10. Ghosal, S., Lember, J., andVan Der Vaart, A. (2003). On Bayesian adaptation.Acta Applicandae Mathematicae. An International Survey Journal on Applying Mathematics and Mathematical Applications, 79(1–2):165–175.MATHMathSciNetGoogle Scholar
  11. Ghosal, S. andvan der Vaart, A. W. (2006). Convergence rates of posterior distributions for noniid observations.The Annals of Statistics, 34.Google Scholar
  12. Hoffmann, M. andLepski, O. (2002). Random rates in anisotropic regression.The Annals of Statistics, 30(2):325–396.MATHMathSciNetCrossRefGoogle Scholar
  13. Huang, T.-M. (2004). Convergence rates for posterior distributions and adaptive estimation.The Annals of Statistics, 32(4):1556–1593.MATHMathSciNetCrossRefGoogle Scholar
  14. Juditsky, A. andLambert-Lacroix, S. (2003). Nonparametric confidence set estimation.Mathematical Methods of Statistics, 12(4):410–428 (2004).MathSciNetGoogle Scholar
  15. Juditsky, A., Nazin, A., Tsybakov, A., andVayatis, N. (2005). Recursive aggregation of estimators by mirror descent algorithm with averaging.Problems of Information Transmission, 41(4):368–384.MATHMathSciNetCrossRefGoogle Scholar
  16. Juditsky, A. andNemirovski, A. (2000). Functional aggregation for nonparametric regression.The Annals of Statistics, 28(3):681–712.MATHMathSciNetCrossRefGoogle Scholar
  17. Keles, S., van der Laan, M., andDudoit, S. (2004). Asymptotically optimal model selection method with right censored outcomes.Bernoulli. Official Journal of the Bernoulli Society for Mathematical Statistics and Probability, 10(6):1011–1037.MATHMathSciNetGoogle Scholar
  18. Kerkyacharian, G. andPicard, D. (2004). Entropy, universal coding, approximation, and bases properties.Constructive Approximation. An International Journal for Approximations and Expansions, 20(1):1–37.MATHMathSciNetCrossRefGoogle Scholar
  19. Lember, J. (2004). On Bayesian adaptation. Preprint.Google Scholar
  20. Li, L., Tchetgen, E., Robins, J., andvan der Vaart, A. (2005). Robust inference with higher order influence functions: Parts I and II. Joint Statistical Meetings, Minneapolis, Minnesota.Google Scholar
  21. Murphy, S. A. andvan der Vaart, A. W. (2000). On profile likelihood.Journal of the American Statistical Association, 95(450):449–485.MATHMathSciNetCrossRefGoogle Scholar
  22. Nemirovski, A. (2000). Topics in non-parametric statistics. InLectures on probability theory and statistics (Saint-Flour, 1998), Vol. 1738 ofLecture Notes in Mathematics, pp. 85–277, Springer-Verlag, Berlin.Google Scholar
  23. Robins, J. M. (1997). Causal inference from complex longitudinal data. In M. Berkane, ed.,Latent variable modeling and applications to causality (Los Angeles, CA, 1994), Vol. 120 ofLecture Notes in Statistics, pp. 69–117. Springer-Verlag, New York.Google Scholar
  24. Robins, J. M. andvan der Vaart, A. W. (2006). Adaptive nonparametric confidence sets.The Annals of Statistics, 34(1):229–253.MATHMathSciNetCrossRefGoogle Scholar
  25. van der Laan, M. J. andRobins, J. M. (2003).Unified methods for censored longitudinal data and causality. Springer Series in Statistics. Springer-Verlag, New York.MATHGoogle Scholar
  26. van der Vaart, A. W. (2002). Semiparametric statistics. InLectures on probability theory and statistics (Saint-Flour, 1999), Vol. 1781 ofLecture Notes in Mathematics, pp. 331–457, Springer-Verlag, Berlin.Google Scholar
  27. Yang, Y. (2000). Mixing strategies for density estimation.The Annals of Statistics, 28(1):75–87.MATHMathSciNetCrossRefGoogle Scholar
  28. Yang, Y. (2004). Aggregating regression procedures to improve performance.Bernoulli. Official Journal of the Bernoulli Society for Mathematical Statistics and Probability, 10(1):25–47.MATHMathSciNetGoogle Scholar

Additional references

  1. Bickel, P. J. andFreedman, D. A. (1981). Some asymptotic theory for the bootstrap.The Annals of Statistics, 9(6):1196–1217.MATHMathSciNetGoogle Scholar
  2. Bickel, P. J. andRitov, Y. (2000). Non- and semiparametric statistics: compared and contrasted.Journal of Statistical Planning and Inference, 91(2):209–228. Prague Workshop on Perspectives in Modern Statistical Inference: Parametrics, Semi-parametrics, Non-parametrics (1998).MATHMathSciNetCrossRefGoogle Scholar
  3. Cox, D. D. (1993). An analysis of Bayesian inference for nonparametric regression.The Annals of Statistics, 21(2):903–923.MATHMathSciNetGoogle Scholar
  4. Devroye, L. P. andWagner, T. J. (1979). Distribution-free performance bounds for potential function rules.Institute of Electrical and Electronics Engineers. Transactions on Information Theory, 25(5):601–604.MATHMathSciNetCrossRefGoogle Scholar
  5. Freedman, D. (1999). On the Bernstein-von Mises theorem with infinite-dimensional parameters.The Annals of Statistics, 27(4):1119–1140.MATHMathSciNetGoogle Scholar
  6. Gray, H. L. andSchucany, W. R. (1972).The generalized jackknife statistic, Vol. 1 ofStatistics Textbooks and Monographs, Marcel Dekker Inc., New York.Google Scholar
  7. Hampel, F. R., Ronchetti, E. M., Rousseeuw, P. J., andStahel, W. A. (1986).Robust statistics. Wiley Series in Probability and Mathematical Statistics: Probability and Mathematical Statistics, John Wiley & Sons Inc., New York. The approach based on influence functions.MATHGoogle Scholar
  8. Hodges, J. L., Jr. (1967). Efficiency in normal samples and tolerance of extreme values for some estimates of location. InProc. Fifth Berkeley Sympos. Math. Statist. and Probability (Berkeley, Calif., 1965/66), pp. Vol. I: Statistics, pp. 163–186. Univ. California Press, Berkeley, Calif.Google Scholar
  9. Huber, P. J. (1964). Robust estimation of a location parameter.Ann. Math. Statist., 35:73–101.MathSciNetMATHGoogle Scholar
  10. Kleijn, B. andvan der Vaart, A. (2005). The Bernstein-Von-Mises theorem under misspecification. Unpublished.Google Scholar
  11. Koltchinskii, V. (2006). 2004 IMS Medallion Lecture: Local Rademacher complexities and oracle inequalities in risk minimization.The Annals of Statistics, 34(6). To appear.Google Scholar
  12. Li, L., Techtgen, E., Robins, J., andvan der Vaart, A. (2005). Robust inference with higher order influence functions: Parts I and II. Joint Statistical Meetings, Minneapolis, Minnesota.Google Scholar
  13. van der Vaart, A. W. (1998).Asymptotic statistics, Vol. 3 ofCambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, Cambridge.Google Scholar
  14. Wahba, G. (1990).Spline models for observational data, Vol. 59 ofCBMS-NSF Regional Conference Series in Applied Mathematics. Society for Industrial and Applied Mathematics (SIAM), Philadelphia, PA.Google Scholar

Copyright information

© Sociedad Española de Estadistica e Investigacion Operativa 2006

Authors and Affiliations

  • Peter J. Bickel
    • 1
  • Bo Li
    • 2
  • Alexandre B. Tsybakov
    • 3
  • Sara A. van de Geer
    • 4
  • Bin Yu
    • 1
  • Teófilo Valdés
    • 5
  • Carlos Rivero
    • 5
  • Jianqing Fan
    • 6
  • Aad van der Vaart
    • 7
  1. 1.Department of StatisticsUniversity of California at BerkeleyUSA
  2. 2.School of Economics and ManagementTsinghua UniversityChina
  3. 3.Laboratoire de Probabilités et Modèles AléatoiresUniversité Paris VIFrance
  4. 4.Seminar für StatistikETH ZürichSwitzerland
  5. 5.Department of Statistics and Operational ResearchComplutense University of MadridSpain
  6. 6.Department of Operations Research and Financial EngineeringPrinceton UniversityUSA
  7. 7.Department of MathematicsVrije Universiteit AmsterdamNetherlands

Personalised recommendations