Regularization Networks and Support Vector Machines

  • Theodoros Evgeniou
  • Massimiliano Pontil
  • Tomaso Poggio
Article

Abstract

Regularization Networks and Support Vector Machines are techniques for solving certain problems of learning from examples – in particular, the regression problem of approximating a multivariate function from sparse data. Radial Basis Functions, for example, are a special case of both regularization and Support Vector Machines. We review both formulations in the context of Vapnik's theory of statistical learning which provides a general foundation for the learning problem, combining functional analysis and statistics. The emphasis is on regression: classification is treated as a special case.

regularization Radial Basis Functions Support Vector Machines Reproducing Kernel Hilbert Space Structural Risk Minimization 

References

  1. [1]
    D. Allen, The relationship between variable selection and data augmentation and a method for prediction, Technometrics 16 (1974) 125–127.MATHMathSciNetCrossRefGoogle Scholar
  2. [2]
    N. Alon, S. Ben-David, N. Cesa-Bianchi and D. Haussler, Scale-sensitive-dimensions, uniform convergence, and learnability, in: Symposium on Foundations of Computer Science(1993).Google Scholar
  3. [3]
    S. Amari, A. Cichocki and H. Yang, A new learning algorithm for bling signal separation, in: Advances in Neural Information Processing System(MIT Press, Cambridge, MA, 1995) pp. 757–763.Google Scholar
  4. [4]
    S.I. Amari, Natural gradient works efficiently in learning, Neural Comput. 10 (1998) 251–276.CrossRefGoogle Scholar
  5. [5]
    N. Aronszajn, Theory of reproducing kernels, Trans. Amer. Math. Soc. 686 (1950) 337–404.MathSciNetCrossRefGoogle Scholar
  6. [6]
    P. Bartlett, The sample complexity of pattern classification with neural networks: the size of the weights is more important that the size of the network, IEEE Trans. Inform. Theory (1998).Google Scholar
  7. [7]
    P. Bartlett, P.M. Long and R.C. Williamson, Fat-shattering and the learnability of real-valued functions, J. Comput. Systems Sci. 52(3) (1996) 434–452.MATHMathSciNetCrossRefGoogle Scholar
  8. [8]
    P. Bartlett and J. Shawe-Taylor, Generalization performance of support vector machine and other patern classifiers, in: Advances in Kernel Methods-Support Vector Learning, eds. C. Burges and B. Scholkopf (MIT Press, Cambridge, MA, 1998).Google Scholar
  9. [9]
    A.J. Bell and T. Sejnowski, An information maximization approach to blind separation and blind deconvolution, Neural Comput. 7 (1995) 1129–1159.Google Scholar
  10. [10]
    M. Bertero, Regularization methods for linear inverse problems, in: Inverse Problems, ed. C.G. Talenti (Springer, Berlin, 1986).Google Scholar
  11. [11]
    M. Bertero, T. Poggio and V. Torre, Ill-posed problems in early vision, Proc. IEEE 76 (1988) 869–889.CrossRefGoogle Scholar
  12. [12]
    L. Bottou and V. Vapnik, Local learning algorithms, Neural Comput. 4(6) (1992) 888–900.Google Scholar
  13. [13]
    M.D. Buhmann, Multivariate cardinal interpolation with radial basis functions, Constr. Approx. 6 (1990) 225–255.MATHMathSciNetCrossRefGoogle Scholar
  14. [14]
    M.D. Buhmann, On quasi-interpolation with Radial Basis Functions, Numerical Analysis Reports DAMPT 1991/NA3, Department of Applied Mathematics and Theoretical Physics, Cambridge, England (March 1991).Google Scholar
  15. [15]
    C. Burges, A tutorial on support vector machines for pattern recognition, in: Data Mining and Knowledge Discovery, Vol. 2 (Kluwer Academic, Boston, 1998).Google Scholar
  16. [16]
    O. Chapelle and V. Vapnik, Model selection for support vector machines, in: Advances in Neural Information Processing Systems(1999).Google Scholar
  17. [17]
    S. Chen, D. Donoho and M. Saunders, Atomic decomposition by basis pursuit, Technical Report 479, Department of Statistics, Stanford University (May 1995).Google Scholar
  18. [18]
    S. Chen, Basis Pursuit, Ph.D. thesis, Department of Statistics, Stanford University (November 1995).Google Scholar
  19. [19]
    V. Cherkassky and F. Mulier, Learning from Data: Concepts, Theory, and Methods(Wiley, New York, 1998).MATHGoogle Scholar
  20. [20]
    J.A. Cochran, The Analysis of Linear Integral Equations(McGraw-Hill, New York, 1972).MATHGoogle Scholar
  21. [21]
    R.R. Coifman and M.V. Wickerhauser, Entropy-based algorithms for best-basis selection, IEEE Trans. Inform. Theory 38 (1992) 713–718.MATHCrossRefGoogle Scholar
  22. [22]
    C. Cortes and V. Vapnik, Support vector networks, Machine Learning 20 (1995) 1–25.Google Scholar
  23. [23]
    R. Courant and D. Hilbert, Methods of Mathematical Physics, Vol. 2 (Interscience, London, England, 1962).MATHGoogle Scholar
  24. [24]
    I. Daubechies, Ten Lectures on Wavelets, CBMS-NSF Regional Conferences Series in Applied Mathematics (SIAM, Philadelphia, PA, 1992).MATHGoogle Scholar
  25. [25]
    C. de Boor, Quasi-interpolants and approximation power of multivariate splines, in: Computation of Curves and Surfaces, eds. M. Gasca and C.A. Micchelli (Kluwer Academic, Dordrecht, Netherlands, 1990) pp. 313–345.Google Scholar
  26. [26]
    R.A. DeVore, Nonlinear approximation, Acta Numerica (1998) 51–150.Google Scholar
  27. [27]
    L. Devroye, L. Gy¨orfi and G. Lugosi, A Probabilistic Theory of Pattern Recognition, Applications of Mathematics, Vol. 31 (Springer, New York, 1996).Google Scholar
  28. [28]
    R.O. Duda and P.E. Hart, Pattern Classification and Scene Analysis(Wiley, New York, 1973).MATHGoogle Scholar
  29. [29]
    R.M. Dudley, A course on empirical processes, in: Lecture Notes in Mathematics, Vol. 1097 (1984) pp. 2–142.MathSciNetGoogle Scholar
  30. [30]
    R.M. Dudley, E. Gine and J. Zinn, Uniform and universal Glivenko-Cantelli classes, J. Theoret. Probab. 4 (1991) 485–510.MATHMathSciNetCrossRefGoogle Scholar
  31. [31]
    N. Dyn, Interpolation and approximation by radial and related functions, in: Approximation Theory VI, eds. C.K. Chui, L.L. Schumaker and D.J. Ward (Academic Press, New York, 1991) pp. 211–234.Google Scholar
  32. [32]
    N. Dyn, I.R.H. Jackson, D. Levin and A. Ron, On multivariate approximation by integer translates of a basis function, Computer Sciences Technical Report 886, University of Wisconsin-Madison (November 1989).Google Scholar
  33. [33]
    N. Dyn, D. Levin and S. Rippa, Numerical procedures for surface fitting of scattered data by radial functions, SIAM J. Sci. Statist. Comput. 7(2) (1986) 639–659.MATHMathSciNetCrossRefGoogle Scholar
  34. [34]
    T. Evgeniou, L. Perez-Breva, M. Pontil and T. Poggio, Bounds on the generalization performance of kernel machines ensembles, A.I. Memo, MIT Artificial Intelligence Lab. (1999).Google Scholar
  35. [35]
    T. Evgeniou and M. Pontil, A note on the generalization performance of kernel classifiers with margin, A.I. Memo, MIT Artificial Intelligence Lab. (1999).Google Scholar
  36. [36]
    T. Evgeniou and M. Pontil, On the v-gamma-dimension for regression in reproducing kernel Hilbert spaces, A.I. Memo, MIT Artificial Intelligence Lab. (1999).Google Scholar
  37. [37]
    F. Girosi, Models of noise and robust estimates, A.I. Memo 1287, Artificial Intelligence Laboratory, Massachusetts Institute of Technology (1991).Google Scholar
  38. [38]
    F. Girosi, An equivalence between sparse approximation and Support Vector Machines, Neural Comput. 10(6) (1998) 1455–1480.CrossRefGoogle Scholar
  39. [39]
    F. Girosi, M. Jones and T. Poggio, Regularization theory and neural networks architectures, Neural Comput. 7 (1995) 219–269.Google Scholar
  40. [40]
    F. Girosi, T. Poggio and B. Caprile, Extensions of a theory of networks for approximation and learning: outliers and negative examples, in: Advances in Neural Information Processings Systems, Vol. 3, eds. R. Lippmann, J. Moody and D. Touretzky (Morgan Kaufmann, San Mateo, CA, 1991).Google Scholar
  41. [41]
    W. H¨ardle, Applied Nonparametric Regression, Econometric Society Monographs, Vol. 19 (Cambridge University Press, 1990).Google Scholar
  42. [42]
    G.F. Harpur and R.W. Prager, Development of low entropy coding in a recurrent network, Network 7 (1996) 277–284.CrossRefGoogle Scholar
  43. [43]
    T. Hastie and R. Tibshirani, Generalized Additive Models, Monographs on Statistics and Applied Probability, Vol. 43 (Chapman and Hall, London, 1990).Google Scholar
  44. [44]
    S. Haykin, Neural Networks: A Comprehensive Foundation(Macmillan, New York, 1994).MATHGoogle Scholar
  45. [45]
    H. Hochstadt, Integral Equations, Wiley Classics Library (Wiley, New York, 1973).MATHGoogle Scholar
  46. [46]
    V.V. Ivanov, The Theory of Approximate Methods and Their Application to the Numerical Solution of Singular Integral Equations(Noordhoff International, Leiden, 1976).MATHGoogle Scholar
  47. [47]
    T. Jaakkola and D. Haussler, Probabilistic kernel regression models, in: Proc. of Neural Information Processing Conference(1998).Google Scholar
  48. [48]
    I.R.H. Jackson, Radial Basis Functions methods for multivariate approximation, Ph.D. thesis, University of Cambridge, UK (1988).Google Scholar
  49. [49]
    M. Kearns, Y. Mansour, A. Ng and D. Ron, An experimental and theoretical comparison of model selection methods, in: Proceedings of the 8th Annual ACM Conference on Computational Learning Theory(1995).Google Scholar
  50. [50]
    M. Kearns and R. E. Shapire, Efficient distribution-free learning of probabilistic concepts, J. Comput. Syst. Sci. 48(3) (1994) 464–497.MATHCrossRefGoogle Scholar
  51. [51]
    G.S. Kimeldorf and G. Wahba, A correspondence between Bayesan estimation on stochastic processes and smoothing by splines, Ann. Math. Statist. 41(2) (1971) 495–502.MathSciNetGoogle Scholar
  52. [52]
    T.-W. Lee, M. Girolami, A.J. Bell and T. Sejnowski, A unifying information-theoretical framework for independent component analysis, Internat. J. Math. Comput. Mod. (1998) to appear.Google Scholar
  53. [53]
    M. Lewicki and T. Sejnowski, Learning nonlinear overcomplete representation for efficient coding, in: Advances in Neural Information Processing System(1997).Google Scholar
  54. [54]
    G.G. Lorentz, Approximation of Functions(Chelsea, New York, 1986).MATHGoogle Scholar
  55. [55]
    D.J.C. MacKay, Introduction to Gaussian processes (1997) (available at the URL: http://wol.ra.phy.cam.ac.uk/mackay).Google Scholar
  56. [56]
    W.R. Madych and S.A. Nelson, Polyharmonic cardinal splines: a minimization property, J. Approx. Theory 63 (1990) 303–320.MATHMathSciNetCrossRefGoogle Scholar
  57. [57]
    S. Mallat and Z. Zhang, Matching Pursuit in a time-frequency dictionary, IEEE Trans. Signal Process. 41 (1993) 3397–3415.MATHCrossRefGoogle Scholar
  58. [58]
    J.L. Marroquin, S. Mitter and T. Poggio, Probabilistic solution of ill-posed problems in computational vision, J. Amer. Statist. Assoc. 82 (1987) 76–89.MATHCrossRefGoogle Scholar
  59. [59]
    H.N. Mhaskar, Neural networks for localized approximation of real functions, in: Neural Networks for Signal Processing III, Proceedings of the 1993 IEEE-SP Workshop, eds. C.A. Kamm et al. (IEEE Signal Processing Society, New York, 1993) pp. 190–196.Google Scholar
  60. [60]
    C.A. Micchelli, Interpolation of scattered data: distance matrices and conditionally positive definite functions, Constr. Approx. 2 (1986) 11–22.MATHMathSciNetCrossRefGoogle Scholar
  61. [61]
    P. Niyogi and F. Girosi, On the relationship between generalization error, hypothesis complexity, and sample complexity for radial basis functions, Neural Comput. 8 (1996) 819–842.Google Scholar
  62. [62]
    P. Niyogi, F. Girosi and T. Poggio, Incorporating prior information in machine learning by creating virt ual examples, Proc. IEEE 86(11) (1998) 2196–2209.CrossRefGoogle Scholar
  63. [63]
    E. Oja, The nonlinear pca learning rule in independent component analysis, Neurocomput. 17 (1997) 25–45.CrossRefGoogle Scholar
  64. [64]
    B. Olshausen, Learning linear, sparse, factorial codes, A.I. Memo 1580, MIT Artificial Intelligence Lab. (1996).Google Scholar
  65. [65]
    B.A. Olshausen and D.J. Field, Emergence of simple-cell receptive field properties by learning a sparse code for natural images, Nature381 (1996) 607–609.CrossRefGoogle Scholar
  66. [66]
    C. Papageorgiou, F. Girosi and T. Poggio, Sparse correlation kernel based signal reconstruction, Technical Report 1635, Artificial Intelligence Laboratory, Massachusetts Institute of Technology (1998). (CBCL Memo 162.)Google Scholar
  67. [67]
    G. Parisi, Statistical Field Theory(Addison-Wesley, Reading, MA, 1988).MATHGoogle Scholar
  68. [68]
    T. Poggio, On optimal nonlinear associative recall, Biological Cybernetics 19 (1975) 201–209.MATHMathSciNetCrossRefGoogle Scholar
  69. [69]
    T. Poggio and F. Girosi, A theory of networks for approximation and learning, A.I. Memo No. 1140, Artificial Intelligence Laboratory, Massachusetts Institute of Technology (1989).Google Scholar
  70. [70]
    T. Poggio and F. Girosi, Networks for approximation and learning, Proc. IEEE 78(9) (1990).Google Scholar
  71. [71]
    T. Poggio and F. Girosi, Networks for approximation and learning, in: Foundations of Neural Networks, ed. C. Lau (IEEE Press, Piscataway, NJ, 1992) pp. 91–106.Google Scholar
  72. [72]
    T. Poggio and F. Girosi, A sparse representation for function approximation, Neural Comput. 10(6) (1998).Google Scholar
  73. [73]
    T. Poggio, V. Torre and C. Koch, Computational vision and regularization theory, Nature 317 (1985) 314–319.CrossRefGoogle Scholar
  74. [74]
    D. Pollard, Convergence of Stochastic Processes(Springer, Berlin, 1984).MATHGoogle Scholar
  75. [75]
    M. Pontil, S. Mukherjee and F. Girosi, On the noise model of support vector machine regression, A.I. Memo, MIT Artificial Intelligence Laboratory (1998) (in preparation).Google Scholar
  76. [76]
    M. Pontil, R. Rifkin and T. Evgeniou, From regression to classification in support vector machines, A.I. Memo 1649, MIT Artificial Intelligence Lab. (1998).Google Scholar
  77. [77]
    M.J.D. Powell, The theory of radial basis functions approximation in 1990, in: Advances in Numerical Analysis Volume II: Wavelets, Subdivision Algorithms and Radial Basis Functions, ed. W.A. Light (Oxford University Press, 1992) pp. 105–210.Google Scholar
  78. [78]
    C. Rabut, How to build quasi-interpolants. Applications to polyharmonic B-splines, in: Curves and Surfaces, eds. P.-J. Laurent, A. Le M´ehaut´e and L.L. Schumaker (Academic Press, New York, 1991) pp. 391–402.Google Scholar
  79. [79]
    C. Rabut, An introduction to Schoenberg's approximation, Comput. Math. Appl. 24(12) (1992) 149–175.MATHMathSciNetCrossRefGoogle Scholar
  80. [80]
    R. Rifkin, M. Pontil and A. Verri, A note on support vector machine degeneracy, A.I. Memo, MIT Artificial Intelligence Lab. (1999).Google Scholar
  81. [81]
    J. Rissanen, Modeling by shortest data description, Automatica 14 (1978) 465–471.MATHCrossRefGoogle Scholar
  82. [82]
    I.J. Schoenberg, Contributions to the problem of approximation of equidistant data by analytic functions, Part A: On the problem of smoothing of graduation, a first class of analytic approximation formulae, Quart. Appl. Math. 4 (1946) 45–99.MATHMathSciNetGoogle Scholar
  83. [83]
    I.J. Schoenberg, Cardinal interpolation and spline functions, J. Approx. Theory 2 (1969) 167–206.MATHMathSciNetCrossRefGoogle Scholar
  84. [84]
    B. Scholkpof, P. Simard, A. Smola and V. Vapnik, Prior knowledge in suport vector kernels, in: Advances in Neural Information Processing Systems 9(1997).Google Scholar
  85. [85]
    L.L. Schumaker, Spline Functions: Basic Theory(Wiley, New York, 1981).MATHGoogle Scholar
  86. [86]
    J. Shawe-Taylor, P.L. Bartlett, R.C. Williamson and M. Anthony, Structural risk minimization over data-dependent hierarchies, IEEE Trans. Inform. Theory (1998) to appear. Also: NeuroCOLT Technical Report NC-TR-96–053 (1996) ftp://ftp.dcs.rhbnc.ac.uk/pub/neurocolt/tech reports.Google Scholar
  87. [87]
    J. Shawe-Taylor and N. Cristianini, Robust bounds on generalization from the margin distribution, Technical Report NeuroCOLT2, Technical Report NC2–TR-1998–029, NeuroCOLT2 (1998).Google Scholar
  88. [88]
    B.W. Silverman, Spline smoothing: the equivalent variable kernel method, Ann. Statist. 12 (1984) 898–916.MATHMathSciNetGoogle Scholar
  89. [89]
    P. Simard, Y. LeCun and J. Denker, Efficient pattern recognition using a new transformation distance, in: Advances in Neural Information Processing Systems 5(1993) pp. 50–58.Google Scholar
  90. [90]
    A. Smola and B. Sch¨olkopf, On a kernel-based method for pattern recognition, regression, approximation and operator inversion, Algorithmica 22 (1998) 211–231.MATHMathSciNetCrossRefGoogle Scholar
  91. [91]
    J. Stewart, Positive definite functions and generalizations, an historical survey, Rocky Mountain J. Math. 6 (1976) 409–434.MATHMathSciNetCrossRefGoogle Scholar
  92. [92]
    A.N. Tikhonov and V.Y. Arsenin, Solutions of Ill-posed Problems(W.H. Winston, Washington, DC, 1977).MATHGoogle Scholar
  93. [93]
    L.G. Valiant, A theory of learnable, in: Proc. of the 1984 STOC(1984) pp. 436–445.Google Scholar
  94. [94]
    V.N. Vapnik, Estimation of Dependences Based on Empirical Data(Springer, Berlin, 1982).MATHGoogle Scholar
  95. [95]
    V.N. Vapnik, The Nature of Statistical Learning Theory(Springer, New York, 1995).MATHGoogle Scholar
  96. [96]
    V.N. Vapnik, Statistical Learning Theory(Wiley, New York, 1998).MATHGoogle Scholar
  97. [97]
    V.N. Vapnik and A.Y. Chervonenkis, On the uniform convergence of relative frequences of events to their probabilities, Theory Probab. Appl. 17(2) (1971) 264–280.MathSciNetCrossRefGoogle Scholar
  98. [98]
    V.N. Vapnik and A.Ya. Chervonenkis, The necessary and sufficient conditions for the uniform convergence of averages to their expected values, Teor. Veroyatn. i Primenen. 26(3) (1981) 543–564.MATHMathSciNetGoogle Scholar
  99. [99]
    V.N. Vapnik and A.Ya. Chervonenkis, The necessary and sufficient conditions for consistency in the empirical risk minimization method, Pattern Recognition and Image Analysis 1(3) (1991) 283–305.Google Scholar
  100. [100]
    G. Wahba, Spline bases, regularization, and generalized cross-validation for solving approximation problems with large quantities of noisy data, in: Proceedings of the International Conference on Approximation Theory in Honour of George Lorenz, eds. J. Ward and E. Cheney, Austin, TX January 8–10, 1980 (Academic Press, 1980).Google Scholar
  101. [101]
    G. Wahba, A comparison of GCV and GML for choosing the smoothing parameter in the generalized splines smoothing problem, Ann. Statist. 13 (1985) 1378–1402.MATHMathSciNetGoogle Scholar
  102. [102]
    G. Wahba, Splines Models for Observational Data, Series in Applied Mathematics, Vol. 59 (SIAM, Philadelphia, PA, 1990).Google Scholar
  103. [103]
    R. Williamson, A. Smola and B. Scholkopf, Generalization performance of regularization networks and support vector machines via entropy numbers, Technical Report NC-TR-98–019, Royal Holloway College University of London (1998).Google Scholar

Copyright information

© Kluwer Academic Publishers 2000

Authors and Affiliations

  • Theodoros Evgeniou
    • 1
  • Massimiliano Pontil
    • 1
  • Tomaso Poggio
    • 1
  1. 1.Center for Biological and Computational Learning and Artificial Intelligence LaboratoryMassachusetts Institute of TechnologyCambridgeUSA E-mail:

Personalised recommendations