Simpler PAC-Bayesian bounds for hostile data
Abstract
PAC-Bayesian learning bounds are of the utmost interest to the learning community. Their role is to connect the generalization ability of an aggregation distribution \(\rho \) to its empirical risk and to its Kullback-Leibler divergence with respect to some prior distribution \(\pi \). Unfortunately, most of the available bounds typically rely on heavy assumptions such as boundedness and independence of the observations. This paper aims at relaxing these constraints and provides PAC-Bayesian learning bounds that hold for dependent, heavy-tailed observations (hereafter referred to as hostile data). In these bounds the Kullack-Leibler divergence is replaced with a general version of Csiszár’s f-divergence. We prove a general PAC-Bayesian bound, and show how to use it in various hostile settings.
Keywords
PAC-Bayesian theory Dependent and unbounded data Oracle inequalities f-divergenceNotes
Acknowledgements
We would like to thank Pascal Germain for fruitful discussions, along with two anonymous Referees and the Editor for insightful comments.This author gratefully acknowledges financial support from the research programme New Challenges for New Data from LCL and GENES, hosted by the Fondation du Risque, from Labex ECODEC (ANR-11-LABEX-0047) and from Labex CEMPI (ANR-11-LABX-0007-01).
References
- Agarwal, A., & Duchi, J. C. (2013). The generalization ability of online algorithms for dependent data. IEEE Transactions on Information Theory, 59(1), 573–587.MathSciNetCrossRefMATHGoogle Scholar
- Alquier, P., & Li, X. (2012). Prediction of quantiles by statistical learning and application to gdp forecasting. In 15th International Conference on Discovery Science 2012 (pp. 23–36). SpringerGoogle Scholar
- Alquier, P., & Wintenberger, O. (2012). Model selection for weakly dependent time series forecasting. Bernoulli, 18(3), 883–913.MathSciNetCrossRefMATHGoogle Scholar
- Alquier, P., Li, X., & Wintenberger, O. (2013). Prediction of time series by statistical learning: General losses and fast rates. Dependence Modeling, 1, 65–93.CrossRefMATHGoogle Scholar
- Alquier, P., Ridgway, J., & Chopin, N. (2016). On the properties of variational approximations of gibbs posteriors. Journal of Machine Learning Research, 17(239), 1–41. http://jmlr.org/papers/v17/15-290.html.
- Audibert, J.-Y. (2009). Fast learning rates in statistical inference through aggregation. The Annals of Statistics, 37(4), 1591–1646.MathSciNetCrossRefMATHGoogle Scholar
- Audibert, J.-Y., & Catoni, O. (2011). Robust linear least squares regression. The Annals of Statistics, 39, 2766–2794.MathSciNetCrossRefMATHGoogle Scholar
- Bégin, L., Germain, P., Laviolette, F., & Roy, J.-F. (2016). PAC-Bayesian bounds based on the Rényi divergence. In Proceedings of the 19th International Conference on Artificial Intelligence and Statistics (pp. 435–444).Google Scholar
- Boucheron, S., Lugosi, G., & Massart, P. (2013). Concentration inequalities: A nonasymptotic theory of independence. Oxford: Oxford University Press.CrossRefMATHGoogle Scholar
- Catoni, O. (2004). Statistical learning theory and stochastic optimization. In J. Picard (Ed.), Saint-Flour Summer School on Probability Theory 2001., Lecture notes in mathematics Berlin: Springer.Google Scholar
- Catoni, O. (2007). PAC-Bayesian supervised classification: The thermodynamics of statistical learning. Institute of Mathematical Statistics Lecture Notes—Monograph Series (Vol. 56). Beachwood, OH: Institute of Mathematical Statistics.Google Scholar
- Catoni, O. (2012). Challenging the empirical mean and empirical variance: A deviation study. In Annales de l’Institut Henri Poincaré, Probabilités et Statistiques (Vol. 48, pp. 1148–1185). Paris: Institut Henri Poincaré.Google Scholar
- Catoni, O. (2016). PAC-Bayesian bounds for the Gram matrix and least squares regression with a random design. arXiv:1603.05229.
- Csiszár, I., & Shields, P. C. (2004). Information theory and statistics: A tutorial. Breda: Now Publishers Inc.MATHGoogle Scholar
- Dedecker, J., Doukhan, P., Lang, G., Rafael, L. R. J. Louhichi, S., & Prieur, C. (2007). Weak dependence. In Weak dependence: With examples and applications (pp. 9–20). Berlin: Springer.Google Scholar
- Devroye, L., Györfi, L., & Lugosi, G. (1996). A probabilistic theory of pattern recognition. Berlin: Springer.CrossRefMATHGoogle Scholar
- Devroye, L., Lerasle, M., Lugosi, G., & Oliveira, R. I. (2015). Sub-Gaussian mean estimators. arXiv:1509.05845.
- Dinh, V. C., Ho, L. S., Nguyen, B., & Nguyen, D.(2016). Fast learning rates with heavy-tailed losses. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, & R. Garnett (Eds.), Advances in neural information processing systems (Vol. 29, pp. 505–513). Curran Associates, Inc., http://papers.nips.cc/paper/6104-fast-learning-rates-with-heavy-tailed-losses.pdf.
- Doukhan, P. (1994). Mixing: Properties and examples., Lecture notes in statistics New York: Springer.CrossRefMATHGoogle Scholar
- Giraud, C., Roueff, F., & Sanchez-Pèrez, A. (2015). Aggregation of predictors for nonstationary sub-linear processes and online adaptive forecasting of time varying autoregressive processes. The Annals of Statistics, 43(6), 2412–2450.MathSciNetCrossRefMATHGoogle Scholar
- Giulini, I. (2015). PAC-Bayesian bounds for principal component analysis in Hilbert spaces. arXiv:1511.06263.
- Grünwald, P. D., & Mehta, N. A. (2016). Fast rates with unbounded losses. arXiv:1605.00252.
- Guedj, B., & Alquier, P. (2013). PAC-Bayesian estimation and prediction in sparse additive models. Electronic Journal of Statistics, 7, 264–291.MathSciNetCrossRefMATHGoogle Scholar
- Guillaume, L., & Matthieu, L. (2017). Learning from mom’s principles. arXiv:1701.01961.
- Honorio, J., & Jaakkola, T.(2014). Tight bounds for the expected risk of linear classifiers and PAC-Bayes finite-sample guarantees. In Proceedings of the 17th international conference on artificial intelligence and statistics (pp. 384–392).Google Scholar
- Hsu, D., & Sabato, S. (2016). Loss minimization and parameter estimation with heavy tails. Journal of Machine Learning Research, 17(18), 1–40.MathSciNetMATHGoogle Scholar
- Kontorovich, L. A., Ramanan, K., et al. (2008). Concentration inequalities for dependent random variables via the martingale method. The Annals of Probability, 36(6), 2126–2158.MathSciNetCrossRefMATHGoogle Scholar
- Kuznetsov, V., & Mohri, M. (2014). Generalization bounds for time series prediction with non-stationary processes. In International conference on algorithmic learning theory (pp. 260–274). Springer.Google Scholar
- Langford, J., & Shawe-Taylor, J. (2002). PAC-Bayes & margins. In Proceedings of the 15th international conference on neural information processing systems (pp. 439–446). MIT Press.Google Scholar
- Lecué, G., & Mendelson, S. (2016). Regularization and the small-ball method I: Sparse recovery. arXiv:1601.05584.
- London, B., Huang, B., & Getoor, L. (2016). Stability and generalization in structured prediction. Journal of Machine Learning Research, 17(222), 1–52.MathSciNetMATHGoogle Scholar
- Lugosi, G., & Mendelson, S.(2016). Risk minimization by median-of-means tournaments. arXiv:1608.00757.
- Lugosi, G. & Mendelson, S. (2017). Regularization, sparse recovery, and median-of-means tournaments. arXiv:1701.04112.
- McAllester, D. A. (1998). Some PAC-Bayesian theorems. In Proceedings of the Eleventh annual conference on computational learning theory (pp. 230–234). New York: ACM.Google Scholar
- McAllester, D. A. (1999). PAC-Bayesian model averaging. In Proceedings of the twelfth annual conference on computational learning theory (pp. 164–170). ACM.Google Scholar
- Mendelson, S. (2015). Learning without concentration. Journal of ACM, 62(3), 21:1–21:25. ISSN: 0004-5411. https://doi.org/10.1145/2699439.
- Minsker, S. (2015). Geometric median and robust estimation in banach spaces. Bernoulli, 21(4), 2308–2335.MathSciNetCrossRefMATHGoogle Scholar
- Modha, D. S., & Masry, E. (1998). Memory-universal prediction of stationary random processes. IEEE Transactions on Information Theory, 44(1), 117–133.MathSciNetCrossRefMATHGoogle Scholar
- Mohri, M., & Rostamizadeh, A. (2010). Stability bounds for stationary \(\varphi \)-mixing and \(\beta \)-mixing processes. Journal of Machine Learning Research, 11, 789–814.MathSciNetMATHGoogle Scholar
- Oliveira, R. I. (2013). The lower tail of random quadratic forms, with applications to ordinary least squares and restricted eigenvalue properties. arXiv:1312.2903. (To appear in Probability Theory and Related Fields).
- Oneto, L., Anguita, D., & Ridella, S. (2016). PAC-Bayesian analysis of distribution dependent priors: Tighter risk bounds and stability analysis. Pattern Recognition Letters, 80, 200–207.CrossRefGoogle Scholar
- Ralaivola, L., Szafranski, M., & Stempfel, G. (2010). Chromatic PAC-Bayes bounds for non-iid data: Applications to ranking and stationary \(\beta \)-mixing processes. Journal of Machine Learning Research, 11, 1927–1956.MathSciNetMATHGoogle Scholar
- Rio, E. (2000). Théorie asymptotique des processus aléatoires faiblement dépendants (Vol. 31). Berlin: Mathématiques & Applications.MATHGoogle Scholar
- Seeger, M. (2002). PAC-Bayesian generalisation error bounds for gaussian process classification. Journal of Machine Learning Research, 3, 233–269.MathSciNetCrossRefMATHGoogle Scholar
- Seldin, Y., & Tishby, N. (2010). PAC-Bayesian analysis of co-clustering and beyond. Journal of Machine Learning Research, 11, 3595–3646.MathSciNetMATHGoogle Scholar
- Seldin, Y., Auer, P., Shawe-Taylor, J., Ortner, R & Laviolette, F. (2011). PAC-Bayesian analysis of contextual bandits. In Advances in neural information processing systems (pp. 1683–1691).Google Scholar
- Seldin, Y., Laviolette, F., Cesa-Bianchi, N., Shawe-Taylor, J., & Auer, P. (2012). PAC-Bayesian inequalities for martingales. IEEE Transactions on Information Theory, 58(12), 7086–7093.MathSciNetCrossRefMATHGoogle Scholar
- Shawe-Taylor, J., & Williamson, R. (1997). A PAC analysis of a Bayes estimator. In Proceedings of the 10th annual conference on computational learning theory (pp. 2–9). New York: ACM.Google Scholar
- Steinwart, I., & Christmann, A. (2009). Fast learning from non-iid observations. In Advances in neural information processing systems (pp. 1768–1776).Google Scholar
- Taleb, N. N. (2007). The black swan: The impact of the highly improbable. New York: Random House.Google Scholar
- Valiant, L. G. (1984). A theory of the learnable. Communications of the ACM, 27(11), 1134–1142.CrossRefMATHGoogle Scholar
- Vapnik, V. N. (2000). The nature of statistical learning theory. Berlin: Springer.CrossRefMATHGoogle Scholar
- Yu, B. (1994). Rates of convergence for empirical processes of stationary mixing sequences. The Annals of Probability, 22(1), 94–116Google Scholar
- Zimin, A., & Lampert, C. H.(2015). Conditional risk minimization for stochastic processes. arXiv:1510.02706.