Advertisement

Relative deviation learning bounds and generalization with unbounded loss functions

  • Corinna Cortes
  • Spencer GreenbergEmail author
  • Mehryar Mohri
Article
  • 12 Downloads

Abstract

We present an extensive analysis of relative deviation bounds, including detailed proofs of two-sided inequalities and their implications. We also give detailed proofs of two-sided generalization bounds that hold in the general case of unbounded loss functions, under the assumption that a moment of the loss is bounded. We then illustrate how to apply these results in a sample application: the analysis of importance weighting.

Keywords

Generalization bounds Learning theory Unbounded loss functions Relative deviation bounds Importance weighting Unbounded regression Machine learning 

Mathematics Subject Classification (2010)

97R40 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Notes

Acknowledgments

We thank the reviewers for several careful and very useful comments. This work was partly funded by NSF CCF-1535987 and NSF IIS-1618662.

References

  1. 1.
    Anthony, M., Shawe-Taylor, J.: A result of Vapnik with applications. Discret. Appl. Math. 47, 207–217 (1993)MathSciNetzbMATHCrossRefGoogle Scholar
  2. 2.
    Anthony, M., Bartlett, P.L.: Neural Network Learning: Theoretical Foundations. Cambridge University Press (1999)Google Scholar
  3. 3.
    Azuma, K.: Weighted sums of certain dependent random variables. Tohoku Math. J. 19(3), 357–367 (1967)MathSciNetzbMATHCrossRefGoogle Scholar
  4. 4.
    Bartlett, P.L., Boucheron, S., Lugosi, G.: Model selection and error estimation. Mach. Learn. 48(1–3), 85–113 (2002)zbMATHCrossRefGoogle Scholar
  5. 5.
    Bartlett, P.L., Bousquet, O., Mendelson, S.: Localized Rademacher complexities. In: COLT, vol. 2375, pp. 79–97. Springer (2002)Google Scholar
  6. 6.
    Bartlett, P.L., Mendelson, S.: Rademacher and Gaussian complexities: risk bounds and structural results. J. Mach. Learn. Res. 3 (2002)Google Scholar
  7. 7.
    Ben-David, S., Blitzer, J., Crammer, K., Pereira, F.: Analysis of representations for domain adaptation. NIPS (2007)Google Scholar
  8. 8.
    Bercu, B., Gassiat, E., Rio, E., et al.: Concentration inequalities, large and moderate deviations for self-normalized empirical processes. Ann. Probab. 30(4), 1576–1604 (2002)MathSciNetzbMATHCrossRefGoogle Scholar
  9. 9.
    Bickel, S., Brückner, M, Scheffer, T.: Discriminative learning for differing training and test distributions. In: ICML, pp. 81–88 (2007)Google Scholar
  10. 10.
    Blitzer, J., Crammer, K., Kulesza, A., Pereira, F., Wortman, J.: Learning bounds for domain adaptation. NIPS 2007 (2008)Google Scholar
  11. 11.
    Boucheron, S., Bousquet, O., Lugosi, G.: Theory of classification: a survey of some recent advances. ESAIM: Probab. Statist. 9, 323–375 (2005)MathSciNetzbMATHCrossRefGoogle Scholar
  12. 12.
    Cortes, C., Mansour, Y., Mohri, M.: Learning bounds for importance weighting. In: Advances in Neural Information Processing Systems (NIPS 2010), Vancouver. MIT Press (2010)Google Scholar
  13. 13.
    Cortes, C., Mansour, Y., Mohri, M.: Learning bounds for importance weighting. In: Advances in Neural Information Processing Systems, pp. 442–450 (2010)Google Scholar
  14. 14.
    Cortes, C., Mohri, M.: Domain adaptation and sample bias correction theory and algorithm for regression. Theor. Comput. Sci. 519, 9474 (2013)MathSciNetGoogle Scholar
  15. 15.
    Cortes, C., Mohri, M., Riley, M., Rostamizadeh, A.: Sample selection bias correction theory. In: ALT (2008)Google Scholar
  16. 16.
    Cortes, C., Mohri, M., Riley, M., Rostamizadeh, A.: Sample selection bias correction theory. In: International Conference on Algorithmic Learning Theory, pp. 38–53. Springer (2008)Google Scholar
  17. 17.
    Dasgupta, S., Long, P.M.: Boosting with diverse base classifiers. In: COLT (2003)Google Scholar
  18. 18.
    Daumé, H. III, Marcu, D.: Domain adaptation for statistical classifiers. J. Artif. Intell. Res. 26, 101–126 (2006)MathSciNetzbMATHCrossRefGoogle Scholar
  19. 19.
    Dudík, M, Schapire, R.E., Phillips, S.J.: Correcting sample selection bias in maximum entropy density estimation. In: NIPS (2006)Google Scholar
  20. 20.
    Dudley, R. M.: A course on empirical processes. Lect. Notes Math. 1097, 2–142 (1984)MathSciNetzbMATHGoogle Scholar
  21. 21.
    Dudley, R. M.: Universal Donsker classes and metric entropy. Ann. Probab. 14 (4), 1306–1326 (1987)MathSciNetzbMATHCrossRefGoogle Scholar
  22. 22.
    Greenberg, S., Mohri, M.: Tight lower bound on the probability of a binomial exceeding its expectation. Technical Report 2013-957, Courant Institute, New York (2013)Google Scholar
  23. 23.
    Haussler, D.: Decision theoretic generalizations of the PAC model for neural net and other learning applications. Inf. Comput. 100(1), 78–150 (1992)MathSciNetzbMATHCrossRefGoogle Scholar
  24. 24.
    Hoeffding, W.: Probability inequalities for sums of bounded random variables. J. Am. Stat. Assoc. 58(301), 13–30 (1963)MathSciNetzbMATHCrossRefGoogle Scholar
  25. 25.
    Huang, J., Smola, A.J., Gretton, A., Borgwardt, K.M., Schölkopf, B: Correcting sample selection bias by unlabeled data. In: NIPS, vol. 19, pp. 601–608 (2006)Google Scholar
  26. 26.
    Jaeger, S.A.: Generalization bounds and complexities based on sparsity and clustering for convex combinations of functions from random classes. J. Mach. Learn. Res. 6, 307–340 (2005)MathSciNetzbMATHGoogle Scholar
  27. 27.
    Jaeschke, D.: The asymptotic distribution of the supremum of the standardized empirical distribution function on subintervals. Ann. Stat. 7, 108–115 (1979)MathSciNetzbMATHCrossRefGoogle Scholar
  28. 28.
    Jiang, J., Zhai, C.X.: Instance Weighting for Domain Adaptation in NLP. In: ACL (2007)Google Scholar
  29. 29.
    Koltchinskii, V.: Local rademacher complexities and oracle inequalities in risk minimization. Ann. Stat. 34(6) (2006)Google Scholar
  30. 30.
    Koltchinskii, V., Panchenko, D.: Rademacher processes and bounding the risk of function learning. In: High Dimensional Probability II, pp. 443–459 (2000)Google Scholar
  31. 31.
    Koltchinskii, V., Panchenko, D.: Empirical margin distributions and bounding the generalization error of combined classifiers. Ann. Statist. 30, 1–50 (2002)MathSciNetzbMATHGoogle Scholar
  32. 32.
    Kutin, S., Niyogi, P.: Almost-everywhere algorithmic stability and generalization error. arXiv preprint arXiv:1301.0579 (2012)
  33. 33.
    Mansour, Y., Mohri, M., Rostamizadeh, A.: Domain adaptation: learning bounds and algorithms. In: COLT (2009)Google Scholar
  34. 34.
    Massart, P., Nédélec, É, et al.: Risk bounds for statistical learning. Ann. Stat. 34(5), 2326–2366 (2006)MathSciNetzbMATHCrossRefGoogle Scholar
  35. 35.
    McDiarmid, C.: On the method of bounded differences. In: Surveys in Combinatorics, pp. 148–188. Cambridge University Press (1989)Google Scholar
  36. 36.
    Meir, R., Zhang, T.: Generalization Error Bounds for Bayesian Mixture Algorithms. J. Mach. Learn. Res. 4, 839–860 (2003)MathSciNetzbMATHGoogle Scholar
  37. 37.
    Mendelson, S.: Learning without concentration. In: Conference on Learning Theory, pp. 25–39 (2014)Google Scholar
  38. 38.
    Panchenko, D.: Symmetrization approach to concentration inequalities for empirical processes. Ann. Probab. 2068–2081 (2003)Google Scholar
  39. 39.
    Peña, V H, Lai, T.L., Shao, Q-M: Self-Normalized Processes. Springer, Berlin (2008)Google Scholar
  40. 40.
    Pollard, D.: Convergence of Stochastic Processess. Springer, New York (1984)CrossRefGoogle Scholar
  41. 41.
    Pollard, D.: Asymptotics via empirical processes. Stat. Sci. 4(4), 341–366 (1989)MathSciNetzbMATHCrossRefGoogle Scholar
  42. 42.
    Sauer, N.: On the density of families of sets. J Combinat Theory, Ser A 13(1), 145–147 (1972)MathSciNetzbMATHCrossRefGoogle Scholar
  43. 43.
    Shimodaira, H.: Improving predictive inference under covariate shift by weighting the log-likelihood function. J. Stat. Plan. Infer. 90(2), 227–244 (2000)MathSciNetzbMATHCrossRefGoogle Scholar
  44. 44.
    Steinwart, I., Scovel, C., et al.: Fast rates for support vector machines using gaussian kernels. Ann. Stat. 35(2), 575–607 (2007)MathSciNetzbMATHCrossRefGoogle Scholar
  45. 45.
    Sugiyama, M., Nakajima, S., Kashima, H., von Bünau, P., Kawanabe, M.: Direct importance estimation with model selection and its application to covariate shift adaptation. In: NIPS (2008)Google Scholar
  46. 46.
    Talagrand, M.: Sharper bounds for gaussian and empirical processes. Ann. Probab. 22(1), 28–76 (1994)MathSciNetzbMATHCrossRefGoogle Scholar
  47. 47.
    Vapnik, V.N.: Statistical Learning Theory. Wiley (1998)zbMATHGoogle Scholar
  48. 48.
    Vapnik, V.N.: Estimation of Dependences Based on Empirical Data, 2nd edn. Springer, Berlin (2006)zbMATHGoogle Scholar
  49. 49.
    Vapnik, V.N., Chervonenkis, A.: On the uniform convergence of relative frequencies of events to their probabilities. Theory Probab. Appl. 16, 264 (1971)zbMATHCrossRefGoogle Scholar
  50. 50.
    Wang, Y., Singh, A.: Noise-adaptive margin-based active learning and lower bounds under tsybakov noise condition. In: AAAI, pp. 2180–2186 (2016)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Google ResearchNew YorkUSA
  2. 2.
  3. 3.Courant Institute and Google ResearchNew YorkUSA

Personalised recommendations