Distribution-free consistency of empirical risk minimization and support vector regression

Original Article

Abstract

In this paper, we focus on the generalization ability of the empirical risk minimization technique in the framework of agnostic learning, and consider the support vector regression method as a special case. We give a set of analytic conditions that characterize the empirical risk minimization methods and their approximations that are distribution-free consistent. Then utilizing the weak topology of the feature space, we show that the support vector regression, possibly with a discontinuous kernel, is distribution-free consistent. Moreover, a tighter generalization error bound is shown to be achieved in certain cases if the value of the regularization parameter grows as the sample size increases. The results carry over to the ν-support vector regression.

Keywords

Consistency Covering number Discontinuous kernel Empirical risk minimization PAC learning Support vector regression Weak topology 

References

  1. 1.
    Alon N, Ben-David S, Cesa-Bianchi N, Haussler D (1997) Scale-sensitive dimensions, uniform convergence, and learnability. J Assoc Comput Mach 44(4): 615–631MATHMathSciNetGoogle Scholar
  2. 2.
    Bartlett P, Long P, Williamson R (1996) Fat-shattering and the learnability of real-valued functions. J Comput Syst Sci 52(3): 434–452MATHCrossRefMathSciNetGoogle Scholar
  3. 3.
    Bartlett PL, Kulkarni SR, Posner SE (1997) Covering numbers for real-valued function classes. IEEE Trans Inform Theory 43(5): 1721–1724MATHCrossRefMathSciNetGoogle Scholar
  4. 4.
    Bousquet O, Elisseeff A (2002) Stability and generalization. J Mach Learn Res 2(3): 499–526MATHCrossRefMathSciNetGoogle Scholar
  5. 5.
    Champeney DC (1987) A handbook of Fourier theorems. Cambridge University Press, New YorkMATHGoogle Scholar
  6. 6.
    Chang C-C, Lin C-J (2002) Training ν-support vector regression: theory and algorithms. Neural Comput 14(8): 1959–1977MATHCrossRefGoogle Scholar
  7. 7.
    Chapelle O, Vapnik V, Bousquet O, Mukherjee S (2002) Choosing multiple parameters for support vector machines. Mach Learn 46(1–3): 131–159MATHCrossRefGoogle Scholar
  8. 8.
    Cristianini N, Shawe-Taylor J (2000) An introduction to support vector machines. Cambridge University Press, CambridgeGoogle Scholar
  9. 9.
    Cucker F, Smale S (2001) On the mathematical foundations of learning. Bull (New Series) Am Math Soc 39(1): 1–49CrossRefMathSciNetGoogle Scholar
  10. 10.
    Davies PL (1993) Aspects of robust linear regression. Ann Stat 21(4): 1843–1899MATHCrossRefGoogle Scholar
  11. 11.
    Donoho DL, Liu RC (1998) The “automatic” robustness of minimum distance functionals. Ann Stat 16(2): 552–586CrossRefMathSciNetGoogle Scholar
  12. 12.
    Donoho DL, Liu RC (1998) Pathologies of some minimum distance estimators. Ann Stat 16(2): 587–608CrossRefMathSciNetGoogle Scholar
  13. 13.
    Evgeniou T, Pontil M (1999) On the V γ dimension for regression in reproducing kernel Hilbert spaces. In: Watanabe O, Yokomori T (eds) Proceedings of the 14th international conference on algorithmic learning theory. Lecture Notes in Computer Science, vol 1720. Springer, Berlin, pp 106–117Google Scholar
  14. 14.
    Evgeniou T, Pontil M, Poggio T (2000) Regularization networks and support vector machines. Adv Comput Math 13(1): 1–50MATHCrossRefMathSciNetGoogle Scholar
  15. 15.
    Guo Y, Bartlett PL, Shawe-Taylor J, Williamson RC (2002) Covering numbers for support vector machines. IEEE Trans Inform Theory 48(1): 239–250MATHCrossRefMathSciNetGoogle Scholar
  16. 16.
    Halmos PR (1982) A Hilbert space problem book. Springer, New York, NYMATHGoogle Scholar
  17. 17.
    Haussler D (1992) Decision theoretic generalizations of the PAC model for neural net and other learning applications. Inform Comput 100(1): 78–150MATHCrossRefMathSciNetGoogle Scholar
  18. 18.
    Kelley JL, Namioka I et al (1976) Linear topological spaces. Springer, New YorkMATHGoogle Scholar
  19. 19.
    Kimeldorf G, Wahba G (1971) Some results on Tchebychffian spline functions. J Math Anal Appl 33: 82–95MATHCrossRefMathSciNetGoogle Scholar
  20. 20.
    Neumann MH (1997) Optimal change-point estimation in inverse problems. Scand J Stat 24(4): 503–521MATHCrossRefGoogle Scholar
  21. 21.
    Oyama E, Agah A, MacDorman KF, Maeda T, Tachi S (2001) A modular neural network architecture for inverse kinematics model learning. Neurocomputing 38–40: 797–805CrossRefGoogle Scholar
  22. 22.
    Schölkopf, B, Burges, CJC, Smola, AJ (eds) (1999) Advances in kernel methods: support vector learning. MIT Press, CambridgeGoogle Scholar
  23. 23.
    Schölkopf B, Shawe-Taylor J, Smola AJ, Williamson RC (1999) Generalization bounds via eigenvalues of the Gram matrix. Technical Report 1999-035, NeuroCOLTGoogle Scholar
  24. 24.
    Schölkopf B, Smola AJ, Williamson RC, Bartlett PL (2000) New support vector algorithms. Neural Comput 12(5): 1207–1245CrossRefGoogle Scholar
  25. 25.
    Shawe-Taylor J, Bartlett PL, Williamson RC, Anthony M (1998) Structural risk minimization over data-dependent hierarchies. IEEE Trans Inform Theory 44(5): 1926–1940MATHCrossRefMathSciNetGoogle Scholar
  26. 26.
    Shiryayev AN (1996) Probability. Springer, New YorkGoogle Scholar
  27. 27.
    Smola AJ, Schölkopf B (2004) A tutorial on support vector regression. Stat Comput 14(3): 199–222CrossRefMathSciNetGoogle Scholar
  28. 28.
    Steinwart I (2002) Support vector machines are universally consistent. J Complex 18: 768–791MATHCrossRefMathSciNetGoogle Scholar
  29. 29.
    Vapnik V, Chapelle O (2000) Bounds on error expectation for support vector machines. Neural Comput 12(9): 2013–2036CrossRefGoogle Scholar
  30. 30.
    Vapnik VN (1998) Statistical learning theory. Wiley, New YorkMATHGoogle Scholar
  31. 31.
    Vidyasagar M (1997) A theory of learning and generalization. Springer, LondonMATHGoogle Scholar
  32. 32.
    Wang Y (1995) Jump and sharp cusp detection by wavelets. Biometrika 82(2): 385–397MATHCrossRefMathSciNetGoogle Scholar
  33. 33.
    Williamson RC, Smola AJ, Schölkopf B (2001) Generalization performance of regularization networks and support vector machines via entropy numbers of compact operators. IEEE Trans Inform Theory 47(6): 2516–2532MATHCrossRefMathSciNetGoogle Scholar

Copyright information

© Springer-Verlag London Limited 2009

Authors and Affiliations

  1. 1.Department of Electrical EngineeringPennsylvania State UniversityUniversity ParkUSA
  2. 2.Department of Electrical and Computer EngineeringUniversity of FloridaGainesvilleUSA

Personalised recommendations