Simpler PAC-Bayesian bounds for hostile data



PAC-Bayesian learning bounds are of the utmost interest to the learning community. Their role is to connect the generalization ability of an aggregation distribution \(\rho \) to its empirical risk and to its Kullback-Leibler divergence with respect to some prior distribution \(\pi \). Unfortunately, most of the available bounds typically rely on heavy assumptions such as boundedness and independence of the observations. This paper aims at relaxing these constraints and provides PAC-Bayesian learning bounds that hold for dependent, heavy-tailed observations (hereafter referred to as hostile data). In these bounds the Kullack-Leibler divergence is replaced with a general version of Csiszár’s f-divergence. We prove a general PAC-Bayesian bound, and show how to use it in various hostile settings.


PAC-Bayesian theory Dependent and unbounded data Oracle inequalities f-divergence 



We would like to thank Pascal Germain for fruitful discussions, along with two anonymous Referees and the Editor for insightful comments.This author gratefully acknowledges financial support from the research programme New Challenges for New Data from LCL and GENES, hosted by the Fondation du Risque, from Labex ECODEC (ANR-11-LABEX-0047) and from Labex CEMPI (ANR-11-LABX-0007-01).


  1. Agarwal, A., & Duchi, J. C. (2013). The generalization ability of online algorithms for dependent data. IEEE Transactions on Information Theory, 59(1), 573–587.MathSciNetCrossRefMATHGoogle Scholar
  2. Alquier, P., & Li, X. (2012). Prediction of quantiles by statistical learning and application to gdp forecasting. In 15th International Conference on Discovery Science 2012 (pp. 23–36). SpringerGoogle Scholar
  3. Alquier, P., & Wintenberger, O. (2012). Model selection for weakly dependent time series forecasting. Bernoulli, 18(3), 883–913.MathSciNetCrossRefMATHGoogle Scholar
  4. Alquier, P., Li, X., & Wintenberger, O. (2013). Prediction of time series by statistical learning: General losses and fast rates. Dependence Modeling, 1, 65–93.CrossRefMATHGoogle Scholar
  5. Alquier, P., Ridgway, J., & Chopin, N. (2016). On the properties of variational approximations of gibbs posteriors. Journal of Machine Learning Research, 17(239), 1–41.
  6. Audibert, J.-Y. (2009). Fast learning rates in statistical inference through aggregation. The Annals of Statistics, 37(4), 1591–1646.MathSciNetCrossRefMATHGoogle Scholar
  7. Audibert, J.-Y., & Catoni, O. (2011). Robust linear least squares regression. The Annals of Statistics, 39, 2766–2794.MathSciNetCrossRefMATHGoogle Scholar
  8. Bégin, L., Germain, P., Laviolette, F., & Roy, J.-F. (2016). PAC-Bayesian bounds based on the Rényi divergence. In Proceedings of the 19th International Conference on Artificial Intelligence and Statistics (pp. 435–444).Google Scholar
  9. Boucheron, S., Lugosi, G., & Massart, P. (2013). Concentration inequalities: A nonasymptotic theory of independence. Oxford: Oxford University Press.CrossRefMATHGoogle Scholar
  10. Catoni, O. (2004). Statistical learning theory and stochastic optimization. In J. Picard (Ed.), Saint-Flour Summer School on Probability Theory 2001., Lecture notes in mathematics Berlin: Springer.Google Scholar
  11. Catoni, O. (2007). PAC-Bayesian supervised classification: The thermodynamics of statistical learning. Institute of Mathematical Statistics Lecture Notes—Monograph Series (Vol. 56). Beachwood, OH: Institute of Mathematical Statistics.Google Scholar
  12. Catoni, O. (2012). Challenging the empirical mean and empirical variance: A deviation study. In Annales de l’Institut Henri Poincaré, Probabilités et Statistiques (Vol. 48, pp. 1148–1185). Paris: Institut Henri Poincaré.Google Scholar
  13. Catoni, O. (2016). PAC-Bayesian bounds for the Gram matrix and least squares regression with a random design. arXiv:1603.05229.
  14. Csiszár, I., & Shields, P. C. (2004). Information theory and statistics: A tutorial. Breda: Now Publishers Inc.MATHGoogle Scholar
  15. Dedecker, J., Doukhan, P., Lang, G., Rafael, L. R. J. Louhichi, S., & Prieur, C. (2007). Weak dependence. In Weak dependence: With examples and applications (pp. 9–20). Berlin: Springer.Google Scholar
  16. Devroye, L., Györfi, L., & Lugosi, G. (1996). A probabilistic theory of pattern recognition. Berlin: Springer.CrossRefMATHGoogle Scholar
  17. Devroye, L., Lerasle, M., Lugosi, G., & Oliveira, R. I. (2015). Sub-Gaussian mean estimators. arXiv:1509.05845.
  18. Dinh, V. C., Ho, L. S., Nguyen, B., & Nguyen, D.(2016). Fast learning rates with heavy-tailed losses. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, & R. Garnett (Eds.), Advances in neural information processing systems (Vol. 29, pp. 505–513). Curran Associates, Inc.,
  19. Doukhan, P. (1994). Mixing: Properties and examples., Lecture notes in statistics New York: Springer.CrossRefMATHGoogle Scholar
  20. Giraud, C., Roueff, F., & Sanchez-Pèrez, A. (2015). Aggregation of predictors for nonstationary sub-linear processes and online adaptive forecasting of time varying autoregressive processes. The Annals of Statistics, 43(6), 2412–2450.MathSciNetCrossRefMATHGoogle Scholar
  21. Giulini, I. (2015). PAC-Bayesian bounds for principal component analysis in Hilbert spaces. arXiv:1511.06263.
  22. Grünwald, P. D., & Mehta, N. A. (2016). Fast rates with unbounded losses. arXiv:1605.00252.
  23. Guedj, B., & Alquier, P. (2013). PAC-Bayesian estimation and prediction in sparse additive models. Electronic Journal of Statistics, 7, 264–291.MathSciNetCrossRefMATHGoogle Scholar
  24. Guillaume, L., & Matthieu, L. (2017). Learning from mom’s principles. arXiv:1701.01961.
  25. Honorio, J., & Jaakkola, T.(2014). Tight bounds for the expected risk of linear classifiers and PAC-Bayes finite-sample guarantees. In Proceedings of the 17th international conference on artificial intelligence and statistics (pp. 384–392).Google Scholar
  26. Hsu, D., & Sabato, S. (2016). Loss minimization and parameter estimation with heavy tails. Journal of Machine Learning Research, 17(18), 1–40.MathSciNetMATHGoogle Scholar
  27. Kontorovich, L. A., Ramanan, K., et al. (2008). Concentration inequalities for dependent random variables via the martingale method. The Annals of Probability, 36(6), 2126–2158.MathSciNetCrossRefMATHGoogle Scholar
  28. Kuznetsov, V., & Mohri, M. (2014). Generalization bounds for time series prediction with non-stationary processes. In International conference on algorithmic learning theory (pp. 260–274). Springer.Google Scholar
  29. Langford, J., & Shawe-Taylor, J. (2002). PAC-Bayes & margins. In Proceedings of the 15th international conference on neural information processing systems (pp. 439–446). MIT Press.Google Scholar
  30. Lecué, G., & Mendelson, S. (2016). Regularization and the small-ball method I: Sparse recovery. arXiv:1601.05584.
  31. London, B., Huang, B., & Getoor, L. (2016). Stability and generalization in structured prediction. Journal of Machine Learning Research, 17(222), 1–52.MathSciNetMATHGoogle Scholar
  32. Lugosi, G., & Mendelson, S.(2016). Risk minimization by median-of-means tournaments. arXiv:1608.00757.
  33. Lugosi, G. & Mendelson, S. (2017). Regularization, sparse recovery, and median-of-means tournaments. arXiv:1701.04112.
  34. McAllester, D. A. (1998). Some PAC-Bayesian theorems. In Proceedings of the Eleventh annual conference on computational learning theory (pp. 230–234). New York: ACM.Google Scholar
  35. McAllester, D. A. (1999). PAC-Bayesian model averaging. In Proceedings of the twelfth annual conference on computational learning theory (pp. 164–170). ACM.Google Scholar
  36. Mendelson, S. (2015). Learning without concentration. Journal of ACM, 62(3), 21:1–21:25. ISSN: 0004-5411.
  37. Minsker, S. (2015). Geometric median and robust estimation in banach spaces. Bernoulli, 21(4), 2308–2335.MathSciNetCrossRefMATHGoogle Scholar
  38. Modha, D. S., & Masry, E. (1998). Memory-universal prediction of stationary random processes. IEEE Transactions on Information Theory, 44(1), 117–133.MathSciNetCrossRefMATHGoogle Scholar
  39. Mohri, M., & Rostamizadeh, A. (2010). Stability bounds for stationary \(\varphi \)-mixing and \(\beta \)-mixing processes. Journal of Machine Learning Research, 11, 789–814.MathSciNetMATHGoogle Scholar
  40. Oliveira, R. I. (2013). The lower tail of random quadratic forms, with applications to ordinary least squares and restricted eigenvalue properties. arXiv:1312.2903. (To appear in Probability Theory and Related Fields).
  41. Oneto, L., Anguita, D., & Ridella, S. (2016). PAC-Bayesian analysis of distribution dependent priors: Tighter risk bounds and stability analysis. Pattern Recognition Letters, 80, 200–207.CrossRefGoogle Scholar
  42. Ralaivola, L., Szafranski, M., & Stempfel, G. (2010). Chromatic PAC-Bayes bounds for non-iid data: Applications to ranking and stationary \(\beta \)-mixing processes. Journal of Machine Learning Research, 11, 1927–1956.MathSciNetMATHGoogle Scholar
  43. Rio, E. (2000). Théorie asymptotique des processus aléatoires faiblement dépendants (Vol. 31). Berlin: Mathématiques & Applications.MATHGoogle Scholar
  44. Seeger, M. (2002). PAC-Bayesian generalisation error bounds for gaussian process classification. Journal of Machine Learning Research, 3, 233–269.MathSciNetCrossRefMATHGoogle Scholar
  45. Seldin, Y., & Tishby, N. (2010). PAC-Bayesian analysis of co-clustering and beyond. Journal of Machine Learning Research, 11, 3595–3646.MathSciNetMATHGoogle Scholar
  46. Seldin, Y., Auer, P., Shawe-Taylor, J., Ortner, R & Laviolette, F. (2011). PAC-Bayesian analysis of contextual bandits. In Advances in neural information processing systems (pp. 1683–1691).Google Scholar
  47. Seldin, Y., Laviolette, F., Cesa-Bianchi, N., Shawe-Taylor, J., & Auer, P. (2012). PAC-Bayesian inequalities for martingales. IEEE Transactions on Information Theory, 58(12), 7086–7093.MathSciNetCrossRefMATHGoogle Scholar
  48. Shawe-Taylor, J., & Williamson, R. (1997). A PAC analysis of a Bayes estimator. In Proceedings of the 10th annual conference on computational learning theory (pp. 2–9). New York: ACM.Google Scholar
  49. Steinwart, I., & Christmann, A. (2009). Fast learning from non-iid observations. In Advances in neural information processing systems (pp. 1768–1776).Google Scholar
  50. Taleb, N. N. (2007). The black swan: The impact of the highly improbable. New York: Random House.Google Scholar
  51. Valiant, L. G. (1984). A theory of the learnable. Communications of the ACM, 27(11), 1134–1142.CrossRefMATHGoogle Scholar
  52. Vapnik, V. N. (2000). The nature of statistical learning theory. Berlin: Springer.CrossRefMATHGoogle Scholar
  53. Yu, B. (1994). Rates of convergence for empirical processes of stationary mixing sequences. The Annals of Probability, 22(1), 94–116Google Scholar
  54. Zimin, A., & Lampert, C. H.(2015). Conditional risk minimization for stochastic processes. arXiv:1510.02706.

Copyright information

© The Author(s) 2017

Authors and Affiliations

  1. 1.CREST, ENSAE, Université Paris SaclayParisFrance
  2. 2.Modal Project-Team, InriaLille - Nord Europe research centerFrance

Personalised recommendations