Distribution-free consistency of empirical risk minimization and support vector regression
- 66 Downloads
- 3 Citations
Abstract
In this paper, we focus on the generalization ability of the empirical risk minimization technique in the framework of agnostic learning, and consider the support vector regression method as a special case. We give a set of analytic conditions that characterize the empirical risk minimization methods and their approximations that are distribution-free consistent. Then utilizing the weak topology of the feature space, we show that the support vector regression, possibly with a discontinuous kernel, is distribution-free consistent. Moreover, a tighter generalization error bound is shown to be achieved in certain cases if the value of the regularization parameter grows as the sample size increases. The results carry over to the ν-support vector regression.
Keywords
Consistency Covering number Discontinuous kernel Empirical risk minimization PAC learning Support vector regression Weak topologyReferences
- 1.Alon N, Ben-David S, Cesa-Bianchi N, Haussler D (1997) Scale-sensitive dimensions, uniform convergence, and learnability. J Assoc Comput Mach 44(4): 615–631MATHMathSciNetGoogle Scholar
- 2.Bartlett P, Long P, Williamson R (1996) Fat-shattering and the learnability of real-valued functions. J Comput Syst Sci 52(3): 434–452MATHCrossRefMathSciNetGoogle Scholar
- 3.Bartlett PL, Kulkarni SR, Posner SE (1997) Covering numbers for real-valued function classes. IEEE Trans Inform Theory 43(5): 1721–1724MATHCrossRefMathSciNetGoogle Scholar
- 4.Bousquet O, Elisseeff A (2002) Stability and generalization. J Mach Learn Res 2(3): 499–526MATHCrossRefMathSciNetGoogle Scholar
- 5.Champeney DC (1987) A handbook of Fourier theorems. Cambridge University Press, New YorkMATHGoogle Scholar
- 6.Chang C-C, Lin C-J (2002) Training ν-support vector regression: theory and algorithms. Neural Comput 14(8): 1959–1977MATHCrossRefGoogle Scholar
- 7.Chapelle O, Vapnik V, Bousquet O, Mukherjee S (2002) Choosing multiple parameters for support vector machines. Mach Learn 46(1–3): 131–159MATHCrossRefGoogle Scholar
- 8.Cristianini N, Shawe-Taylor J (2000) An introduction to support vector machines. Cambridge University Press, CambridgeGoogle Scholar
- 9.Cucker F, Smale S (2001) On the mathematical foundations of learning. Bull (New Series) Am Math Soc 39(1): 1–49CrossRefMathSciNetGoogle Scholar
- 10.Davies PL (1993) Aspects of robust linear regression. Ann Stat 21(4): 1843–1899MATHCrossRefGoogle Scholar
- 11.Donoho DL, Liu RC (1998) The “automatic” robustness of minimum distance functionals. Ann Stat 16(2): 552–586CrossRefMathSciNetGoogle Scholar
- 12.Donoho DL, Liu RC (1998) Pathologies of some minimum distance estimators. Ann Stat 16(2): 587–608CrossRefMathSciNetGoogle Scholar
- 13.Evgeniou T, Pontil M (1999) On the V γ dimension for regression in reproducing kernel Hilbert spaces. In: Watanabe O, Yokomori T (eds) Proceedings of the 14th international conference on algorithmic learning theory. Lecture Notes in Computer Science, vol 1720. Springer, Berlin, pp 106–117Google Scholar
- 14.Evgeniou T, Pontil M, Poggio T (2000) Regularization networks and support vector machines. Adv Comput Math 13(1): 1–50MATHCrossRefMathSciNetGoogle Scholar
- 15.Guo Y, Bartlett PL, Shawe-Taylor J, Williamson RC (2002) Covering numbers for support vector machines. IEEE Trans Inform Theory 48(1): 239–250MATHCrossRefMathSciNetGoogle Scholar
- 16.Halmos PR (1982) A Hilbert space problem book. Springer, New York, NYMATHGoogle Scholar
- 17.Haussler D (1992) Decision theoretic generalizations of the PAC model for neural net and other learning applications. Inform Comput 100(1): 78–150MATHCrossRefMathSciNetGoogle Scholar
- 18.Kelley JL, Namioka I et al (1976) Linear topological spaces. Springer, New YorkMATHGoogle Scholar
- 19.Kimeldorf G, Wahba G (1971) Some results on Tchebychffian spline functions. J Math Anal Appl 33: 82–95MATHCrossRefMathSciNetGoogle Scholar
- 20.Neumann MH (1997) Optimal change-point estimation in inverse problems. Scand J Stat 24(4): 503–521MATHCrossRefGoogle Scholar
- 21.Oyama E, Agah A, MacDorman KF, Maeda T, Tachi S (2001) A modular neural network architecture for inverse kinematics model learning. Neurocomputing 38–40: 797–805CrossRefGoogle Scholar
- 22.Schölkopf, B, Burges, CJC, Smola, AJ (eds) (1999) Advances in kernel methods: support vector learning. MIT Press, CambridgeGoogle Scholar
- 23.Schölkopf B, Shawe-Taylor J, Smola AJ, Williamson RC (1999) Generalization bounds via eigenvalues of the Gram matrix. Technical Report 1999-035, NeuroCOLTGoogle Scholar
- 24.Schölkopf B, Smola AJ, Williamson RC, Bartlett PL (2000) New support vector algorithms. Neural Comput 12(5): 1207–1245CrossRefGoogle Scholar
- 25.Shawe-Taylor J, Bartlett PL, Williamson RC, Anthony M (1998) Structural risk minimization over data-dependent hierarchies. IEEE Trans Inform Theory 44(5): 1926–1940MATHCrossRefMathSciNetGoogle Scholar
- 26.Shiryayev AN (1996) Probability. Springer, New YorkGoogle Scholar
- 27.Smola AJ, Schölkopf B (2004) A tutorial on support vector regression. Stat Comput 14(3): 199–222CrossRefMathSciNetGoogle Scholar
- 28.Steinwart I (2002) Support vector machines are universally consistent. J Complex 18: 768–791MATHCrossRefMathSciNetGoogle Scholar
- 29.Vapnik V, Chapelle O (2000) Bounds on error expectation for support vector machines. Neural Comput 12(9): 2013–2036CrossRefGoogle Scholar
- 30.Vapnik VN (1998) Statistical learning theory. Wiley, New YorkMATHGoogle Scholar
- 31.Vidyasagar M (1997) A theory of learning and generalization. Springer, LondonMATHGoogle Scholar
- 32.Wang Y (1995) Jump and sharp cusp detection by wavelets. Biometrika 82(2): 385–397MATHCrossRefMathSciNetGoogle Scholar
- 33.Williamson RC, Smola AJ, Schölkopf B (2001) Generalization performance of regularization networks and support vector machines via entropy numbers of compact operators. IEEE Trans Inform Theory 47(6): 2516–2532MATHCrossRefMathSciNetGoogle Scholar