Simpler PAC-Bayesian bounds for hostile data
PAC-Bayesian learning bounds are of the utmost interest to the learning community. Their role is to connect the generalization ability of an aggregation distribution \(\rho \) to its empirical risk and to its Kullback-Leibler divergence with respect to some prior distribution \(\pi \). Unfortunately, most of the available bounds typically rely on heavy assumptions such as boundedness and independence of the observations. This paper aims at relaxing these constraints and provides PAC-Bayesian learning bounds that hold for dependent, heavy-tailed observations (hereafter referred to as hostile data). In these bounds the Kullack-Leibler divergence is replaced with a general version of Csiszár’s f-divergence. We prove a general PAC-Bayesian bound, and show how to use it in various hostile settings.
KeywordsPAC-Bayesian theory Dependent and unbounded data Oracle inequalities f-divergence
We would like to thank Pascal Germain for fruitful discussions, along with two anonymous Referees and the Editor for insightful comments.This author gratefully acknowledges financial support from the research programme New Challenges for New Data from LCL and GENES, hosted by the Fondation du Risque, from Labex ECODEC (ANR-11-LABEX-0047) and from Labex CEMPI (ANR-11-LABX-0007-01).
- Alquier, P., & Li, X. (2012). Prediction of quantiles by statistical learning and application to gdp forecasting. In 15th International Conference on Discovery Science 2012 (pp. 23–36). SpringerGoogle Scholar
- Alquier, P., Ridgway, J., & Chopin, N. (2016). On the properties of variational approximations of gibbs posteriors. Journal of Machine Learning Research, 17(239), 1–41. http://jmlr.org/papers/v17/15-290.html.
- Bégin, L., Germain, P., Laviolette, F., & Roy, J.-F. (2016). PAC-Bayesian bounds based on the Rényi divergence. In Proceedings of the 19th International Conference on Artificial Intelligence and Statistics (pp. 435–444).Google Scholar
- Catoni, O. (2004). Statistical learning theory and stochastic optimization. In J. Picard (Ed.), Saint-Flour Summer School on Probability Theory 2001., Lecture notes in mathematics Berlin: Springer.Google Scholar
- Catoni, O. (2007). PAC-Bayesian supervised classification: The thermodynamics of statistical learning. Institute of Mathematical Statistics Lecture Notes—Monograph Series (Vol. 56). Beachwood, OH: Institute of Mathematical Statistics.Google Scholar
- Catoni, O. (2012). Challenging the empirical mean and empirical variance: A deviation study. In Annales de l’Institut Henri Poincaré, Probabilités et Statistiques (Vol. 48, pp. 1148–1185). Paris: Institut Henri Poincaré.Google Scholar
- Catoni, O. (2016). PAC-Bayesian bounds for the Gram matrix and least squares regression with a random design. arXiv:1603.05229.
- Dedecker, J., Doukhan, P., Lang, G., Rafael, L. R. J. Louhichi, S., & Prieur, C. (2007). Weak dependence. In Weak dependence: With examples and applications (pp. 9–20). Berlin: Springer.Google Scholar
- Devroye, L., Lerasle, M., Lugosi, G., & Oliveira, R. I. (2015). Sub-Gaussian mean estimators. arXiv:1509.05845.
- Dinh, V. C., Ho, L. S., Nguyen, B., & Nguyen, D.(2016). Fast learning rates with heavy-tailed losses. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, & R. Garnett (Eds.), Advances in neural information processing systems (Vol. 29, pp. 505–513). Curran Associates, Inc., http://papers.nips.cc/paper/6104-fast-learning-rates-with-heavy-tailed-losses.pdf.
- Giulini, I. (2015). PAC-Bayesian bounds for principal component analysis in Hilbert spaces. arXiv:1511.06263.
- Grünwald, P. D., & Mehta, N. A. (2016). Fast rates with unbounded losses. arXiv:1605.00252.
- Guillaume, L., & Matthieu, L. (2017). Learning from mom’s principles. arXiv:1701.01961.
- Honorio, J., & Jaakkola, T.(2014). Tight bounds for the expected risk of linear classifiers and PAC-Bayes finite-sample guarantees. In Proceedings of the 17th international conference on artificial intelligence and statistics (pp. 384–392).Google Scholar
- Kuznetsov, V., & Mohri, M. (2014). Generalization bounds for time series prediction with non-stationary processes. In International conference on algorithmic learning theory (pp. 260–274). Springer.Google Scholar
- Langford, J., & Shawe-Taylor, J. (2002). PAC-Bayes & margins. In Proceedings of the 15th international conference on neural information processing systems (pp. 439–446). MIT Press.Google Scholar
- Lecué, G., & Mendelson, S. (2016). Regularization and the small-ball method I: Sparse recovery. arXiv:1601.05584.
- Lugosi, G., & Mendelson, S.(2016). Risk minimization by median-of-means tournaments. arXiv:1608.00757.
- Lugosi, G. & Mendelson, S. (2017). Regularization, sparse recovery, and median-of-means tournaments. arXiv:1701.04112.
- McAllester, D. A. (1998). Some PAC-Bayesian theorems. In Proceedings of the Eleventh annual conference on computational learning theory (pp. 230–234). New York: ACM.Google Scholar
- McAllester, D. A. (1999). PAC-Bayesian model averaging. In Proceedings of the twelfth annual conference on computational learning theory (pp. 164–170). ACM.Google Scholar
- Mendelson, S. (2015). Learning without concentration. Journal of ACM, 62(3), 21:1–21:25. ISSN: 0004-5411. https://doi.org/10.1145/2699439.
- Oliveira, R. I. (2013). The lower tail of random quadratic forms, with applications to ordinary least squares and restricted eigenvalue properties. arXiv:1312.2903. (To appear in Probability Theory and Related Fields).
- Seldin, Y., Auer, P., Shawe-Taylor, J., Ortner, R & Laviolette, F. (2011). PAC-Bayesian analysis of contextual bandits. In Advances in neural information processing systems (pp. 1683–1691).Google Scholar
- Shawe-Taylor, J., & Williamson, R. (1997). A PAC analysis of a Bayes estimator. In Proceedings of the 10th annual conference on computational learning theory (pp. 2–9). New York: ACM.Google Scholar
- Steinwart, I., & Christmann, A. (2009). Fast learning from non-iid observations. In Advances in neural information processing systems (pp. 1768–1776).Google Scholar
- Taleb, N. N. (2007). The black swan: The impact of the highly improbable. New York: Random House.Google Scholar
- Yu, B. (1994). Rates of convergence for empirical processes of stationary mixing sequences. The Annals of Probability, 22(1), 94–116Google Scholar
- Zimin, A., & Lampert, C. H.(2015). Conditional risk minimization for stochastic processes. arXiv:1510.02706.