Advertisement

Machine Learning

, Volume 72, Issue 1–2, pp 39–61 | Cite as

Aggregation by exponential weighting, sharp PAC-Bayesian bounds and sparsity

Article

Abstract

We study the problem of aggregation under the squared loss in the model of regression with deterministic design. We obtain sharp PAC-Bayesian risk bounds for aggregates defined via exponential weights, under general assumptions on the distribution of errors and on the functions to aggregate. We then apply these results to derive sparsity oracle inequalities.

Keywords

Aggregation Nonparametric regression Oracle inequalities Sparsity 

References

  1. Audibert, J.-Y. (2004). Une approche PAC-bayésienne de la théorie statistique de l’apprentissage. PhD thesis. University of Paris 6. Google Scholar
  2. Audibert, J.-Y. (2006). A randomized online learning algorithm for better variance control. In Lecture notes in artificial intelligence : Vol. 4005. Proceedings of the 19th annual conference on learning theory, COLT 2006 (pp. 392–407). Heidelberg: Springer. Google Scholar
  3. Bickel, P. J., Ritov, Y., & Tsybakov, A. B. (2007, submitted). Simultaneous analysis of Lasso and Dantzig selector. Annals of Statistics. http://arxiv:0801.1158.
  4. Bunea, F., Nobel, A.B. (2005). Sequential procedures for aggregating arbitrary estimators of a conditional mean, Preprint Florida State University. ttp://www.stat.fsu.edu/~flori.
  5. Bunea, F., Tsybakov, A. B., & Wegkamp, M. H. (2006). Aggregation and sparsity via 1-penalized least squares. In Lecture notes in artificial intelligence : Vol. 4005. Proceedings of 19th annual conference on learning theory, COLT 2006 (pp. 379–391). Springer: Heidelberg. Google Scholar
  6. Bunea, F., Tsybakov, A. B., & Wegkamp, M. H. (2007a). Aggregation for Gaussian regression. Annals of Statistics, 35, 1674–1697. CrossRefMathSciNetMATHGoogle Scholar
  7. Bunea, F., Tsybakov, A. B., & Wegkamp, M. H. (2007b). Sparsity oracle inequalities for the Lasso. Electronic Journal of Statistics, 1, 169–194. CrossRefMathSciNetMATHGoogle Scholar
  8. Candes, E., & Tao, T. (2007). The Dantzig selector: statistical estimation when p is much larger than n (with discussion). Annals of Statistics, 35, 2313–2404. CrossRefMathSciNetMATHGoogle Scholar
  9. Catoni, O. (1999). “Universal” aggregation rules with exact bias bounds. Preprint n.510, Laboratoire de Probabilités et Modèles Aléatoires, Universités Paris 6 and Paris 7. http://www.proba.jussieu.fr/mathdoc/preprints/index.html#1999.
  10. Catoni, O., (2004). Statistical learning theory and stochastic optimization. In Lecture notes in mathematics. Ecole d’été de Probabilités de Saint-Flour 2001, New York: Springer. Google Scholar
  11. Cesa-Bianchi, N., & Lugosi, G. (2006). Prediction, learning, and games. New York: Cambridge University Press. MATHGoogle Scholar
  12. Cesa-Bianchi, N., Freund, Y., Haussler, D., Helmbold, D. P., Shapire, R., & Warmuth, M. (1997). How to use expert advice. Journal of the ACM, 44, 427–485. CrossRefMathSciNetMATHGoogle Scholar
  13. Cesa-Bianchi, N., Conconi, A., & Gentile, G. (2004). On the generalization ability of on-line learning algorithms. IEEE Transactions on Information Theory, 50, 2050–2057. CrossRefMathSciNetGoogle Scholar
  14. Dalalyan, A., & Tsybakov, A. B. (2007). Aggregation by exponential weighting and sharp oracle inequalities. In N. H. Bshouty & C. Gentile (Eds.), Lecture notes in artificial intelligence : Vol. 4539. Proceedings of the 20th annual conference on learning theory (COLT 2007) (pp. 97–111). Berlin: Springer. Google Scholar
  15. Dembo, A., & Zeitouni, O. (1998). Large deviations techniques and applications. New York: Springer. MATHGoogle Scholar
  16. Donoho, D. L., Elad, M., & Temlyakov, V. (2006). Stable recovery of sparse overcomplete representations in the presence of noise. IEEE Transactions on Information Theory, 52, 6–18. CrossRefMathSciNetGoogle Scholar
  17. Efron, B., Hastie, T., Johnstone, I., & Tibshirani, R. (2004). Least angle regression. Annals of Statistics, 32, 407–451. CrossRefMathSciNetMATHGoogle Scholar
  18. Greenshtein, E., & Ritov, Y. (2004). Persistency in high dimensional linear predictor-selection and the virtue of over-parametrization. Bernoulli, 10, 971–988. CrossRefMathSciNetMATHGoogle Scholar
  19. Frank, I. E., & Friedman, J. H. (1993). A statistical view of some chemometrics regression tools. Technometrics, 35, 109–148. CrossRefMATHGoogle Scholar
  20. Juditsky, A. B., Nazin, A. V., Tsybakov, A. B., & Vayatis, N. (2005). Recursive aggregation of estimators via the Mirror Descent Algorithm with averaging. Problems of Information Transmission, 41, 368–384. CrossRefMathSciNetMATHGoogle Scholar
  21. Juditsky, A., Rigollet, P., & Tsybakov, A. (2008, to appear). Learning by mirror averaging. Annals of Statistics. https://hal.ccsd.cnrs.fr/ccsd-00014097.
  22. Kivinen, J., & Warmuth, M. K. (1999). Averaging expert predictions. In H. U. Simon & P. Fischer (Eds.), Lecture notes in artificial intelligence : Vol. 1572. Proceedings of the fourth European conference on computational learning theory (pp. 153–167). Berlin: Springer. Google Scholar
  23. Koltchinskii, V. (2006, submitted). Sparsity in penalized empirical risk minimization. Google Scholar
  24. Lehmann, E. L., & Casella, G. (1998). Theory of point estimation. New York: Springer. MATHGoogle Scholar
  25. Leung, G., & Barron, A. (2006). Information theory and mixing least-square regressions. IEEE Transactions on Information Theory, 52, 3396–3410. CrossRefMathSciNetGoogle Scholar
  26. Littlestone, N., & Warmuth, M. K. (1994). The weighted majority algorithm. Information and Computation, 108, 212–261. CrossRefMathSciNetMATHGoogle Scholar
  27. Nemirovski, A. (2000). Topics in non-parametric statistics. In Lecture notes in mathematics : Vol. 1738. Ecole d’été de probabilités de saint-flour, 1998. New York: Springer. Google Scholar
  28. Obloj, J. (2004). The Skorokhod embedding problem and its offspring. Probability Surveys, 1, 321–392. CrossRefMathSciNetGoogle Scholar
  29. Petrov, V. V. (1995). Limit theorems of probability theory. Oxford: Clarendon. MATHGoogle Scholar
  30. Revuz, D., & Yor, M. (1999). Continuous martingales and Brownian motion. Berlin: Springer. MATHGoogle Scholar
  31. Tsybakov, A. B. (2003). Optimal rates of aggregation. In B. Schölkopf & M. Warmuth (Eds.), Lecture notes in artificial intelligence : Vol. 2777. Computational learning theory and kernel machines (pp. 303–313). Heidelberg: Springer. Google Scholar
  32. Tsybakov, A. B. (2004). Mathématiques et applications : Vol. 41. Introduction à l’estimation non-paramétrique. Berlin: Springer. MATHGoogle Scholar
  33. Tsybakov, A. B. (2006). Comments on “Regularization in statistics”, by P. Bickel and B. Li. Test, 15, 303–310. Google Scholar
  34. van de Geer, S. A. (2006). High dimensional generalized linear models and the Lasso (Research report No. 133). Seminar für Statistik, ETH, Zürich. Google Scholar
  35. Vovk, V. (1990). Aggregating strategies. In Proceedings of the 3rd annual workshop on computational learning theory, COLT1990 (pp. 371–386). San Mateo: Morgan Kaufmann. Google Scholar
  36. Vovk, V. (2001). Competitive on-line statistics. International Statistical Review, 69, 213–248. CrossRefMATHGoogle Scholar
  37. Yang, Y. (2000). Combining different procedures for adaptive regression. Journal of Multivariate Analysis, 74, 135–161. CrossRefMathSciNetMATHGoogle Scholar
  38. Yang, Y. (2001). Adaptive regression by mixing. Journal of the American Statistical Association, 96, 574–588. CrossRefMathSciNetMATHGoogle Scholar
  39. Yang, Y. (2003). Regression with multiple candidate models: selecting or mixing? Statistica Sinica, 13, 783–809. MathSciNetMATHGoogle Scholar
  40. Zhang, T. (2006a). From epsilon-entropy to KL-complexity: analysis of minimum information complexity density estimation. Annals of Statistics, 34, 2180–2210. CrossRefMathSciNetMATHGoogle Scholar
  41. Zhang, T. (2006b). Information theoretical upper and lower bounds for statistical estimation. IEEE Transactions on Information Theory, 52, 1307–1321. CrossRefGoogle Scholar
  42. Zhang, C.-H., & Huang, J. (2008, to appear). Model-selection consistency of the Lasso in high-dimensional regression. Annals of Statistics. Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2008

Authors and Affiliations

  1. 1.LPMAUniversity of Paris 6Paris cedex 05France
  2. 2.Laboratoire de StatistiqueCRESTMalakoff cedexFrance

Personalised recommendations