A Generalized Representer Theorem

  • Bernhard Schölkopf
  • Ralf Herbrich
  • Alex J. Smola
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 2111)

Abstract

Wahba’s classical representer theorem states that the solutions of certain risk minimization problems involving an empirical risk term and a quadratic regularizer can be written as expansions in terms of the training examples. We generalize the theorem to a larger class of regularizers and empirical risk terms, and give a self-contained proof utilizing the feature space associated with a kernel. The result shows that a wide range of problems have optimal solutions that live in the finite dimensional span of the training examples mapped into feature space, thus enabling us to carry out kernel algorithms independent of the (potentially infinite) dimensionality of the feature space.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    M.A. Aizerman, É.M. Braverman, and L.I. Rozonoér. Theoretical foundations of the potential function method in pattern recognition learning. Automation and Remote Control, 25:821–837, 1964.Google Scholar
  2. 2.
    N. Aronszajn. Theory of reproducing kernels. Transactions of the American Mathematical Society, 68:337–404, 1950.MATHCrossRefMathSciNetGoogle Scholar
  3. 3.
    P.L. Bartlett and J. Shawe-Taylor. Generalization performance of support vector machines and other pattern classifiers. In B. Schölkopf, C.J.C. Burges, and A.J. Smola, editors, Advances in Kernel Methods — Support Vector Learning, pages 43–54, Cambridge, MA, 1999. MIT Press.Google Scholar
  4. 4.
    C. Berg, J.P.R. Christensen, and P. Ressel. Harmonic Analysis on Semigroups. Springer-Verlag, New York, 1984.MATHGoogle Scholar
  5. 5.
    B.E. Boser, I.M. Guyon, and V.N. Vapnik. A training algorithm for optimal margin classifiers. In D. Haussler, editor, Proceedings of the 5th Annual ACM Workshop on Computational Learning Theory, pages 144–152, Pittsburgh, PA, July 1992. ACM Press.Google Scholar
  6. 6.
    O. Bousquet and A. Elisseeff. Algorithmic stability and generalization performance. In T.K. Leen, T.G. Dietterich, and V. Tresp, editors, Advances in Neural Information ProcessingSystems 13. MIT Press, 2001.Google Scholar
  7. 7.
    D. Cox and F.O’ Sullivan. Asymptotic analysis of penalized likelihood and related estimators. Annals of Statistics, 18:1676–1695, 1990.MATHMathSciNetCrossRefGoogle Scholar
  8. 8.
    L. Csató and M. Opper. Sparse representation for Gaussian process models. In T.K. Leen, T.G. Dietterich, and V. Tresp, editors, Advances in Neural Information ProcessingSystems 13. MIT Press, 2001.Google Scholar
  9. 9.
    Y. Freund and R.E. Schapire. Large margin classification using the perceptron algorithm. In J. Shavlik, editor, Machine Learning: Proceedings of the Fifteenth International Conference, San Francisco, CA, 1998. Morgan Kaufmann.Google Scholar
  10. 10.
    T.-T. Frieß, N. Cristianini, and C. Campbell. The kernel adatron algorithm: A fast and simple learning procedure for support vector machines. In J. Shavlik, editor, 15th International Conf. Machine Learning, pages 188–196. Morgan Kaufmann Publishers, 1998.Google Scholar
  11. 11.
    C. Gentile. Approximate maximal margin classification with respect to an arbitrary norm. Unpublished.Google Scholar
  12. 12.
    F. Girosi, M. Jones, and T. Poggio. Regularization theory and neural networks architectures. Neural Computation, 7(2):219–269, 1995.CrossRefGoogle Scholar
  13. 13.
    D. Haussler. Convolutional kernels on discrete structures. Technical Report UCSCCRL-99-10, Computer Science Department, University of California at Santa Cruz, 1999.Google Scholar
  14. 14.
    G.S. Kimeldorf and G. Wahba. A correspondence between Bayesian estimation on stochastic processes and smoothing by splines. Annals of Mathematical Statistics, 41:495–502, 1970.MATHMathSciNetCrossRefGoogle Scholar
  15. 15.
    G.S. Kimeldorf and G. Wahba. Some results on Tchebycheffian spline functions. J. Math. Anal. Applic., 33:82–95, 1971.MATHCrossRefMathSciNetGoogle Scholar
  16. 16.
    J. Kivinen, A.J. Smola, P. Wankadia, and R.C. Williamson. On-line algorithms for kernel methods. in preparation, 2001.Google Scholar
  17. 17.
    A. Kowalczyk. Maximal margin perceptron. In A.J. Smola, P.L. Bartlett, B. Schölkopf, and D. Schuurmans, editors, Advances in Large Margin Classifiers, pages 75–113, Cambridge, MA, 2000. MIT Press.Google Scholar
  18. 18.
    H. Lodhi, J. Shawe-Taylor, N. Cristianini, and C. Watkins. Text classification using string kernels. Technical Report 2000-79, NeuroCOLT, 2000. Published in: T.K. Leen, T.G. Dietterich and V. Tresp (eds.), Advances in Neural Information ProcessingSystems 13, MIT Press, 2001.Google Scholar
  19. 19.
    O.L. Mangasarian. Nonlinear Programming. McGraw-Hill, New York, NY, 1969.MATHGoogle Scholar
  20. 20.
    J. Mercer. Functions of positive and negative type and their connection with the theory of integral equations. Philosophical Transactions of the Royal Society, London, A 209:415–446, 1909.CrossRefGoogle Scholar
  21. 21.
    B. Schölkopf, A. Smola, and K.-R. Müller. Kernel principal component analysis. In B. Schölkopf, C.J.C. Burges, and A.J. Smola, editors, Advances in Kernel Methods-Support Vector Learning, pages 327–352. MIT Press, Cambridge, MA, 1999.Google Scholar
  22. 22.
    A. Smola, T. Frieß, and B. Schölkopf. Semiparametric support vector and linear programming machines. In M.S. Kearns, S.A. Solla, and D.A. Cohn, editors, Advances in Neural Information ProcessingSystems 11, pages 585–591, Cambridge, MA, 1999. MIT Press.Google Scholar
  23. 23.
    V. Vapnik. The Nature of Statistical Learning Theory. Springer, NY, 1995.MATHGoogle Scholar
  24. 24.
    G. Wahba. Spline Models for Observational Data, volume 59 of CBMS-NSF Regional Conference Series in Applied Mathematics. SIAM, Philadelphia, 1990.Google Scholar
  25. 25.
    G. Wahba. Support vector machines, reproducing kernel Hilbert spaces and the randomized GACV. In B. Schölkopf, C.J.C. Burges, and A.J. Smola, editors, Advances in Kernel Methods — Support Vector Learning, pages 69–88, Cambridge, MA, 1999. MIT Press.Google Scholar
  26. 26.
    C. Watkins. Dynamic alignment kernels. In A.J. Smola, P.L. Bartlett, B. Schölkopf, and D. Schuurmans, editors, Advances in Large Margin Classifiers, pages 39–50, Cambridge, MA, 2000. MIT Press.Google Scholar
  27. 27.
    C.K.I. Williams. Prediction with Gaussian processes: From linear regression to linear prediction and beyond. In M.I. Jordan, editor, Learningand Inference in Graphical Models. Kluwer, 1998.Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2001

Authors and Affiliations

  • Bernhard Schölkopf
    • 1
    • 2
    • 3
  • Ralf Herbrich
    • 1
    • 2
  • Alex J. Smola
    • 1
  1. 1.Department of EngineeringAustralian National UniversityCanberraAustralia
  2. 2.Microsoft Research Ltd.CambridgeUK
  3. 3.Biowulf TechnologiesNew YorkUSA

Personalised recommendations