Advertisement

Machine Learning

, Volume 74, Issue 3, pp 315–345 | Cite as

An algebraic characterization of the optimum of regularized kernel methods

  • Francesco Dinuzzo
  • Giuseppe De Nicolao
Article

Abstract

The representer theorem for kernel methods states that the solution of the associated variational problem can be expressed as the linear combination of a finite number of kernel functions. However, for non-smooth loss functions, the analytic characterization of the coefficients poses nontrivial problems. Standard approaches resort to constrained optimization reformulations which, in general, lack a closed-form solution. Herein, by a proper change of variable, it is shown that, for any convex loss function, the coefficients satisfy a system of algebraic equations in a fixed-point form, which may be directly obtained from the primal formulation. The algebraic characterization is specialized to regression and classification methods and the fixed-point equations are explicitly characterized for many loss functions of practical interest. The consequences of the main result are then investigated along two directions. First, the existence of an unconstrained smooth reformulation of the original non-smooth problem is proven. Second, in the context of SURE (Stein’s Unbiased Risk Estimation), a general formula for the degrees of freedom of kernel regression methods is derived.

Keywords

Representer theorem Regularization theory Kernel methods Support vector machines Degrees of freedom 

References

  1. Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle. In B. N. Petrov & F. Csáki (Eds.), Second international symposium on information theory. Budapest: Académiai Kiadó. Google Scholar
  2. Chapelle, O. (2007). Training a support vector machine in the primal. Neural Computation, 19(5), 1155–1178. zbMATHCrossRefMathSciNetGoogle Scholar
  3. Chu, W., Keerthi, S. S., & Ong, C. J. (2001). A unified loss function in Bayesian framework for support vector regression. In Proceedings of the 18th international conference on machine learning (pp. 51–58). San Francisco: Morgan Kaufmann. Google Scholar
  4. Cox, D., & O’Sullivan, F. (1990). Asymptotic analysis of penalized likelihood and related estimators. The Annals of Statistics, 18, 1676–1695. zbMATHCrossRefMathSciNetGoogle Scholar
  5. Craven, P., & Wahba, G. (1979). Smoothing noisy data with spline functions. Numerische Mathematik, 31, 377–403. zbMATHCrossRefMathSciNetGoogle Scholar
  6. De Vito, E., Rosasco, L., Caponnetto, A., Piana, M., & Verri, A. (2004). Some properties of regularized kernel methods. Journal of Machine Learning Research, 5, 1363–1390. Google Scholar
  7. Dekel, O., Shalev-Shwartz, S., & Singer, Y. (2005). Smooth ε-insensitive regression by loss symmetrization. Journal of Machine Learning Research, 6, 711–741. MathSciNetGoogle Scholar
  8. Dinuzzo, F., Neve, M., De Nicolao, G., & Gianazza, U. P. (2007). On the representer theorem and equivalent degrees of freedom of SVR. Journal of Machine Learning Research, 8, 2467–2495. MathSciNetGoogle Scholar
  9. Efron, B. (2004). The estimation of prediction error: covariance penalties and cross-validation. Journal of the American Statistical Association, 99(14), 619–632. zbMATHCrossRefMathSciNetGoogle Scholar
  10. Fukushima, M., & Qi, L. (1999). Reformulation: nonsmooth, piecewise smooth, semismooth and smoothing methods. Dordrecht: Kluwer Academic. zbMATHGoogle Scholar
  11. Gunter, L., & Zhu, J. (2007). Efficient computation and model selection for the support vector regression. Neural Computation, 19, 1633–1655. zbMATHCrossRefMathSciNetGoogle Scholar
  12. Hastie, T. J., Tibshirani, R. J., & Friedman, J. (2001). The elements of statistical learning. Data mining, inference and prediction. Canada: Springer. zbMATHGoogle Scholar
  13. Kimeldorf, G., & Wahba, G. (1971). Some results on Tchebycheffian spline functions. Journal of Mathematical Analysis and Applications, 33(1), 82–95. zbMATHCrossRefMathSciNetGoogle Scholar
  14. Lee, Y. J., & Mangasarian, O. L. (2001). SSVM: A smooth support vector machine for classification. Computational Optimization and Applications, 20, 5–22. zbMATHCrossRefMathSciNetGoogle Scholar
  15. Mallows, C. (1973). Some comments on C p. Technometrics, 15, 661–675. zbMATHCrossRefGoogle Scholar
  16. Nesterov, Y. (2005). Smooth minimization of non-smooth functions. Mathematical Programming, 103, 127–152. zbMATHCrossRefMathSciNetGoogle Scholar
  17. Osborne, M. R. (2001). Simplicial algorithms for minimizing polyhedral functions. Cambridge: Cambridge University Press. zbMATHGoogle Scholar
  18. Perez-Cruz, F., Bousono-Calzon, C., & Artes-Rodriguez, A. (2005). Convergence of the IRWLS procedure to the support vector machine solution. Neural Computation, 17(1), 7–18. zbMATHCrossRefGoogle Scholar
  19. Platt, J. (1998). Fast training of support vector machines using sequential minimal optimization. In B. Schölkopf, C. Burges, & A. Smola (Eds.), Advances in kernel methods—support vector learning. Cambridge: MIT Press. Google Scholar
  20. Poggio, T., & Girosi, F. (1992). A theory of networks for approximation and learning. In Foundation of neural networks (pp. 91–106). Google Scholar
  21. Pontil, M., & Verri, A. (1998). Properties of support vector machines. Neural Computation, 10, 955–974. CrossRefGoogle Scholar
  22. Schölkopf, B., Herbrich, R., & Smola, A. J. (2001). A generalized representer theorem. Neural Networks and Computational Learning Theory, 81, 416–426. Google Scholar
  23. Schölkopf, B., & Smola, A. J. (2001). Learning with kernels: support vector machines, regularization, optimization, and beyond. Adaptive computation and machine learning. Cambridge: MIT Press. Google Scholar
  24. Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statistics, 6, 461–464. zbMATHCrossRefMathSciNetGoogle Scholar
  25. Shalev-Shwartz, S., Singer, Y., & Srebro, N. (2007). PEGASOS: primal estimated sub-gradient solver for svm. In ICML ’07: proceedings of the 24th international conference on machine learning (pp. 807–814). New York: ACM. CrossRefGoogle Scholar
  26. Stein, C. (1981). Estimation of the mean of a multivariate normal distribution. The Annals of Statistics, 9, 1135–1151. zbMATHCrossRefMathSciNetGoogle Scholar
  27. Steinwart, I. (2003). Sparseness of support vector machines. Journal of Machine Learning Research, 4, 1071–1105. CrossRefMathSciNetGoogle Scholar
  28. Vapnik, V. (1995). The nature of statistical learning theory. New York: Springer. zbMATHGoogle Scholar
  29. Wahba, G. (1990). Spline models for observational data. Philadelphia: SIAM. zbMATHGoogle Scholar
  30. Wahba, G. (1998). Support vector machines, reproducing kernel Hilbert spaces and randomized GACV (Tech. Rep. 984). Department of Statistics, University of Wisconsin. Google Scholar
  31. Ye, J. (1998). On measuring and correcting the effects of data mining and model selection. Journal of the American Statistical Association, 93, 120–131. zbMATHCrossRefMathSciNetGoogle Scholar
  32. Zou, H., Hastie, T., & Tibshirani, R. (2007). On the degrees of freedom of the lasso. The Annals of Statistics, 35(5), 2173–2192. zbMATHCrossRefMathSciNetGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2009

Authors and Affiliations

  1. 1.Department of MathematicsUniversity of PaviaPaviaItaly
  2. 2.Department of Computer Engineering and Systems ScienceUniversity of PaviaPaviaItaly

Personalised recommendations