Skip to main content

Supervised Learning by Support Vector Machines

  • Reference work entry
Handbook of Mathematical Methods in Imaging

Abstract

During the last 2 decades support vector machine learning has become a very active field of research with a large amount of both sophisticated theoretical results and exciting real-word applications. This chapter gives a brief introduction into the basic concepts of supervised support vector learning and touches some recent developments in this broad field.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 679.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References and Further Reading

  1. Aizerman M, Braverman E, Rozonoer L (1964) Uncovering shared structures in multiclassification. Int Conf Mach Learn 25: 821–837

    MathSciNet  Google Scholar 

  2. Amit Y, Fink M, Srebro N, Ullman S (2007) Theoretocal foundations of the potential function method in pattern recognition learning. Automat Rem Contr 25:17–24

    Google Scholar 

  3. Anthony M, Bartlett PL (1999) Neural network learning: theoretical foundations. Cambridge University Press, Cambridge

    Book  MATH  Google Scholar 

  4. Argyriou A, Evgeniou T, Pontil M (2008) Convex multi-task feature learning. Mach Learn 73(3):243–272

    Article  Google Scholar 

  5. Aronszajn N (1950) Theory of reproducing kernels. Trans Am Math Soc 68:337–404

    Article  MathSciNet  MATH  Google Scholar 

  6. Bartlett PL, Jordan MI, McAuliffe JD (2006) Convexity, classification, and risk bounds. J Am Stat Assoc 101:138–156

    Article  MathSciNet  MATH  Google Scholar 

  7. Bennett KP, Mangasarian OL (1992) Robust linear programming discrimination of two linearly inseparable sets. Optim Methods Softw 1:23–34

    Article  Google Scholar 

  8. Berlinet A, Thomas-Agnan C (2004) Reproducing kernel Hilbert spaces in probability and statistics. Kluwer, Dordrecht

    Book  MATH  Google Scholar 

  9. Bishop CM (2006) Pattern recognition and machine learning. Springer, Heidelberg

    MATH  Google Scholar 

  10. Björck A (1996) Least squares problems. SIAM, Philadelphia

    Book  MATH  Google Scholar 

  11. Bonnans JF, Shapiro A (2000) Perturbation analysis of optimization problems. Springer, New York

    MATH  Google Scholar 

  12. Boser GE, Guyon I, Vapnik V (1992) A training algorithm for optimal margin classifiers. In: Proceedings of the fifth annual ACM workshop on computational learning theory, Madison, pp 144–152

    Google Scholar 

  13. Bottou L, Chapelle L, DeCoste O, Weston J (eds) (2007) Large scale kernel machines. MIT Press, Cambridge

    Google Scholar 

  14. Boucheron S, Bousquet O, Lugosi G (2005) Theory of classification: a survey on some recent advances. ESAIM Probab Stat 9:323–375

    Article  MathSciNet  MATH  Google Scholar 

  15. Bousquet O, Elisseeff A (2001) Algorithmic stability and generalization performance. In: Leen TK, Dietterich TG, Tresp V (eds) Advances in neural information processing systems 13. MIT Press, Cambridge, pp 196–202

    Google Scholar 

  16. Bradley PS, Mangasarian OL (1998) Feature selection via concave minimization and support vector machines. In: Proceedings of the 15th international conference on machine learning, Morgan Kaufmann, San Francisco, pp 82–90

    Google Scholar 

  17. Brown M, Grundy W, Lin D, Cristianini N, Sugnet C, Furey T, Ares M, Haussler D (2000) Knowledge-based analysis of microarray gene-expression data by using support vector machines. Proc Natl Acad Sci 97(1): 262–267

    Article  Google Scholar 

  18. Buhmann MD (2003) Radial basis functions. Cambridge University Press, Cambridge

    Book  MATH  Google Scholar 

  19. Burges CJC (1998) A tutorial on support vector machines for pattern recognition. Data Min Knowl Discov 2(2):121–167

    Article  Google Scholar 

  20. Cai J-F, Candès EJ, Shen Z (2008) A singular value thresholding algorithm for matrix completion. Technical report, UCLA computational and applied mathematics

    Google Scholar 

  21. Caruana R (1997) Multitask learning. Mach Learn 28(1):41–75

    Article  MathSciNet  Google Scholar 

  22. Chang C-C, Lin C-J (2004) LIBSVM: a library for support vector machines. www.csie.ntu.edu.tw/cjlin/papers/libsvm.ps.gz

  23. Chapelle O, Haffner P, Vapnik VN (1999) SVMs for histogram-based image classification. IEEE Trans Neural Netw 10(5):1055–1064

    Article  Google Scholar 

  24. Chen P-H, Fan R-E, Lin C-J (2006) A study on SMO-type decomposition methods for support vector machines. IEEE Trans Neural Netw 17:893–908

    Article  Google Scholar 

  25. Collobert R, Bengio S (2001) Support vector machines for large scale regression problems. J Mach Learn Res 1:143–160

    MathSciNet  Google Scholar 

  26. Cortes C, Vapnik V (1995) Support vector networks. Mach Learn 20:273–297

    MATH  Google Scholar 

  27. Cristianini N, Shawe-Taylor J (2000) An introduction to support vector machines. Cambridge University Press, Cambridge

    Google Scholar 

  28. Cucker F, Smale S (2002) On the mathematical foundations of learning. Bull Am Math Soc 39:1–49

    Article  MathSciNet  MATH  Google Scholar 

  29. Cucker F, Zhou DX (2007) Learning theory: an approximation point of view. Cambridge University Press, Cambridge

    Book  Google Scholar 

  30. Devroye L, Gyrfi L, Lugosi G (1996) A probabilistic theory of pattern recognition. Springer, New York

    MATH  Google Scholar 

  31. Devroye LP (1982) Any discrimination rule can have an arbitrarily bad probability of error for finite sample size. IEEE Trans Pattern Anal Mach Intell 4:154–157

    Article  MATH  Google Scholar 

  32. Dietterich TG, Bakiri G (1995) Solving multiclass learning problems via error-correcting output codes. J Artfic Int Res 2:263–286

    MATH  Google Scholar 

  33. Dinuzzo F, Neve M, Nicolao GD, Gianazza UP (2007) On the representer theorem and equivalent degrees of freedom of SVR. J Mach Learn Res 8:2467–2495

    MathSciNet  MATH  Google Scholar 

  34. Duda RO, Hart PE, Stork D (2001) Pattern classification, 2nd edn. Wiley, New York

    MATH  Google Scholar 

  35. Edmunds DE, Triebel H (1996) Function spaces, entropy numbers, differential operators. Cambridge University Press, Cambridge

    Book  MATH  Google Scholar 

  36. Elisseeff A, Evgeniou A, Pontil M (2005) Stability of randomised learning algorithms. J Mach Learn Res 6:55–79

    MathSciNet  MATH  Google Scholar 

  37. Evgeniou T, Pontil M, Poggio T (2000) Regularization networks and support vector machines. Adv Comput Math 13(1):1–50

    Article  MathSciNet  MATH  Google Scholar 

  38. Fan R-E, Chen P-H, Lin C-J (2005) Working set selection using second order information for training support vector machines. J Mach Learn Res 6:1889–1918

    MathSciNet  MATH  Google Scholar 

  39. Fasshauer GE (2007) Meshfree approximation methods with MATLAB. World Scientific, New Jersey

    MATH  Google Scholar 

  40. Fazel M, Hindi H, Boyd SP (2001) A rank minimization heuristic with application to minimum order system approximation. In: Proceedings of the American control conference, Arlington, pp 4734–4739

    Google Scholar 

  41. Fisher RA (1936) The use of multiple measurements in taxonomic problems. Ann Eugenics 7:179–188

    Article  Google Scholar 

  42. Flake GW, Lawrence S (1999) Efficient SVM regression training with SMO. Technical report, NEC Research Institute

    Google Scholar 

  43. Gauss CF (1963) Theory of the motion of the heavenly bodies moving about the sun in conic sections. (trans: Davis CH). Dover, New York; first published 1809

    Google Scholar 

  44. Girosi F (1998) An equivalence between sparse approximation and support vector machines. Neural Comput 10(6):1455–1480

    Article  Google Scholar 

  45. Golub GH, Loan CFV (1996) Matrix computation, 3rd edn. John Hopkins University Press, Baltimore

    Google Scholar 

  46. Gyrfi L, Kohler M, Krzyżak A, Walk H (2002) A distribution-free theory of non-parametric regression. Springer, New York

    Book  Google Scholar 

  47. Hastie T, Tibshirani R, Friedman J (2001) The elements of statistical learning. Springer, New York

    MATH  Google Scholar 

  48. Herbrich R (2001) Learning Kernel classifiers: theory and algorithms. MIT Press, Cambridge

    Google Scholar 

  49. Hoerl AE, Kennard RW (1970) Ridge regression: biased estimation for nonorthogonal problems. Technometrics 12(1):55–67

    MathSciNet  MATH  Google Scholar 

  50. Huang T, Kecman V, Kopriva I, Friedman J (2006) Kernel based algorithms for mining huge data sets: supervised semi-supervised and unsupervised learning. Springer, Berlin

    MATH  Google Scholar 

  51. Jaakkola TS, Haussler D (1999) Probabilistic kerbnel regression models. In: Proceedings of the 1999 conference on artificial inteligence and statistics

    Google Scholar 

  52. Joachims T (1999) Making large-scale SVM learning practical. In: Schlkopf B, Burges C, Smola A (eds) Advances in Kernel methods-support vector learning. MIT Press, Cambridge, pp 41–56

    Google Scholar 

  53. Joachims T (2002) Learning to classify text using support vector machines. Kluwer, Boston

    Book  Google Scholar 

  54. Kailath T (1971) RKHS approach to detection and estimation problems: Part I: deterministic signals in Gaussian noise. IEEE Trans Inform Theory 17(5):530–549

    Article  MathSciNet  MATH  Google Scholar 

  55. Keerthi SS, Shevade SK, Battacharyya C, Murthy KRK (2001) Improvements to Platt’s SMO algorithm for SMV classifier design. Neural Comput 13:637–649

    Article  MATH  Google Scholar 

  56. Kimeldorf GS, Wahba G (1971) Some results on Tchebycheffian spline functions. J Math Anal Appl 33:82–95

    Article  MathSciNet  MATH  Google Scholar 

  57. Kolmogorov AN, Tikhomirov VM (1961) ε-entropy and ε-capacity of sets in functional spaces. Am Math Soc Trans 17:277–364

    Google Scholar 

  58. Kondor RI, Lafferty J (2002) Diffusion kernels on graphs and other discrete structures. In: Kauffman M (ed) Proceedings of the international conference on machine learning, Morgan Kaufman, San Mateo

    Google Scholar 

  59. Krige DG (1951) A statistical approach to some basic mine valuation problems on the witwatersrand. J Chem Met Mining Soc S Africa 52(6):119–139

    Google Scholar 

  60. Kuhn HW, Tucker AW (1951) Nonlinear programming. In: Proceedings of the Berkley symposium on mathematical statistics and probability, University of California Press, Berkeley, pp 482–492

    Google Scholar 

  61. Laplace PS (1816) Théorie Analytique des Probabilités, 3rd edn. Courier, Paris

    Google Scholar 

  62. LeCun Y, Jackel LD, Bottou L, Brunot A, Cortes C, Denker JS, Drucker H, Guyon I, Müller U, Säckinger E, Simard P, Vapnik V (1995) Comparison of learning algorithms for handwritten digit recognition. In: Fogelman-Souleé F, Gallinari P (eds) Proceedings of ICANN’95, vol 2. EC2 & Cie, Paris, pp 53–60

    Google Scholar 

  63. Legendre AM (1805) Nouvelles Méthodes pour la Determination des Orbites des Cométes. Courier, Paris

    Google Scholar 

  64. Leopold E, Kinderman J (2002) Text categogization with support vector machines how to represent text in input space? Mach Learn 46(1–3):223–244

    Google Scholar 

  65. Lin CJ (2001) On the convergence of the decomposition method for support vector machines. IEEE Trans Neural Netw 12:1288–1298

    Article  Google Scholar 

  66. Lu Z, Monteiro RDC, Yuan M (2008) Convex optimization methods for dimension reduction and coefficient estimation in multivariate linear regression. Submitted to Math Program

    Google Scholar 

  67. Ma S, Goldfarb D, Chen L (2008) Fixed point and Bregman iterative methods for matrix rank minimization. Technical report 08-78, UCLA Computational and applied mathematics

    Google Scholar 

  68. Mangasarian OL (1994) Nonlinear programming. SIAM, Madison

    Book  MATH  Google Scholar 

  69. Mangasarian OL, Musicant DR (1999) Successive overrelaxation for support vector machines. IEEE Trans Neural Netw 10:1032–1037

    Article  Google Scholar 

  70. Matheron G (1963) Principles of geostatistics. Econ Geol 58:1246–1266

    Article  Google Scholar 

  71. Micchelli CA (1986) Interpolation of scattered data: distance matices and conditionally positive definite functions. Constr Approx 2:11–22

    Article  MathSciNet  MATH  Google Scholar 

  72. Micchelli CA, Pontil M (2005) On learning vector-valued functions. Neural Comput 17: 177–204

    Article  MathSciNet  MATH  Google Scholar 

  73. Mitchell TM (1997) Machine learning. McGraw-Hill, Boston

    MATH  Google Scholar 

  74. Mukherjee S, Niyogi P, Poggio T, Rifkin R (2006) Learning theory: stability is sufficient for generalization and necessary and sufficient for consistency of empirical risk minimization. Adv Comput Math 25:161–193

    Article  MathSciNet  MATH  Google Scholar 

  75. Neumann J, Schnörr C, Steidl G (2005) Efficient wavelet adaptation for hybrid wavelet–large margin classifiers. Pattern Recogn 38: 1815–1830

    Article  MATH  Google Scholar 

  76. Obozinski G, Taskar B, Jordan MI (2009) Joint covariate selection and joint subspace selection for multiple classification problems. Stat Comput (in press)

    Google Scholar 

  77. Osuna E, Freund R, Girosi F (1997) Training of support vector machines: an application to face detection. In: Proceedings of the CVPR’97, IEEE Computer Society, Washington, pp 130–136

    Google Scholar 

  78. Parzen E (1970) Statistical inference on time series by RKHS methods. Technical report, Department of Statistics, Stanford University

    Google Scholar 

  79. Pinkus A (1996) N-width in approximation theory. Springer, Berlin

    Google Scholar 

  80. Platt JC (1999) Fast training of support vector machines using sequential minimal optimization. In: Schölkopf B, Burges CJC, Smola AJ (eds) Advances in Kernel methods – support vector learning. MIT Press, Cambridge, pp 185–208

    Google Scholar 

  81. Poggio T, Girosi F (1990) Networks for approximation and learning. Proc IEEE 78(9):1481–1497

    Article  Google Scholar 

  82. Pong TK, Tseng P, Ji S, Ye J (2009) Trace norm regularization: reformulations, algorithms and multi-task learning. University of Washington, preprint

    Google Scholar 

  83. Povzner AY (1950) A class of Hilbert function spaces. Doklady Akademii Nauk USSR 68: 817–820

    MathSciNet  Google Scholar 

  84. Rosenblatt F (1959) The perceptron: a probabilistic model for information storage and organization in the brain. Psychol Rev 65: 386–408

    Article  Google Scholar 

  85. Schoenberg IJ (1938) Metric spaces and completely monotone functions. Ann Math 39: 811–841

    Article  MathSciNet  Google Scholar 

  86. Schölkopf B, Herbrich R, Smola AJ (2001) A generalized representer theorem. In: Helmbold D, Williamson B (eds) Proceedings of the 14th annual conference on computational learning theory. Springer, New York, pp 416–426

    Google Scholar 

  87. Schölkopf B, Smola AJ (2002) Learning with Kernels: support vector machnes, regularization, optimization, and beyond. MIT Press, Cambridge

    Google Scholar 

  88. Shawe-Taylor J, Cristianini N (2009) Kernel methods for pattern analysis, 4th edn. Cambridge University Press, New York

    Google Scholar 

  89. Smola AJ, Schölkopf B, Müller KR (1998) The connection between regularization operators and support vector kernels. Neural Netw 11: 637–649

    Article  Google Scholar 

  90. Spellucci P (1993) Numerische verfahren der nichtlinearen optimierung. Birkhäuser, Basel/Boston/Berlin

    Book  MATH  Google Scholar 

  91. Srebro N, Rennie JDM, Jaakkola TS (2005) Maximum-margin matrix factorization. In NIPS, MIT Press, Cambridge, pp 1329–1336

    Google Scholar 

  92. Steinwart I (2003) Sparseness of support vector machines. J Mach Learn Res 4:1071–1105

    MathSciNet  Google Scholar 

  93. Steinwart I, Christmann A (2008) Support vector machines. Springer, New York

    MATH  Google Scholar 

  94. Stone C (1977) Consistent nonparametric regression. Ann Stat 5:595–645

    Article  MATH  Google Scholar 

  95. Strauss DJ, Steidl G (2002) Hybrid wavelet-support vector classification of waveforms. J Comput Appl Math 148:375–400

    Article  MathSciNet  MATH  Google Scholar 

  96. Strauss DJ, Steidl G, Delb D (2003) Feature extraction by shape-adapted local discriminant bases. Signal Process 83:359–376

    Article  MATH  Google Scholar 

  97. Sutton RS, Barton AG (1998) Reinforcement learning: an introduction. MIT Press, Cambridge

    Google Scholar 

  98. Suykens JAK, Gestel TV, Brabanter JD, Moor BD, Vandewalle J (2002) Least squares support vector machines. World Scientific, Singapore

    Book  MATH  Google Scholar 

  99. Suykens JAK, Vandevalle J (1999) Least squares support vector machine classifiers. Neural Process Lett 9(3):293–300

    Article  Google Scholar 

  100. Tao PD, An LTH (1998) A d.c. optimization algorithm for solving the trust-region subproblem. SIAM J Optimiz 8(2):476–505

    Google Scholar 

  101. Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc B 58(1): 267–288

    MathSciNet  MATH  Google Scholar 

  102. Tikhonov AN, Arsenin VY (1977) Solution of ill-posed problems. Winston, Washington

    Google Scholar 

  103. Toh K-C, Yun S (2009) An accelerated proximal gradient algorithm for nuclear norm regularized least squares problems. Technical report, Department of Mathematics, National University of Singapore, Singapore

    Google Scholar 

  104. Tsypkin Y (1971) Adaptation and learning in automatic systems. Academic, New York

    Google Scholar 

  105. Vapnik V (1998) Statistical learning theory. Wiley, New York

    MATH  Google Scholar 

  106. Vapnik VN (1982) Estimation of dependicies based on empirical data. Springer, New York

    Google Scholar 

  107. Vapnik VN, Chervonenkis A (1974) Theory of pattern regognition (in Russian). Nauka, Moscow; German translation: Theorie der Zeichenerkennung, Akademie-Verlag, Berlin, 1979 edition

    Google Scholar 

  108. Vapnik VN, Lerner A (1963) Pattern recognition using generalized portrait method. Automat Rem Contr 24:774–780

    Google Scholar 

  109. Vidyasagar M (2002) A theory of learning and generalization: with applications to neural networks and control systems. 2nd edn. Springer, London

    Google Scholar 

  110. Viola P, Jones M (2004) Robust real-time face detection. Int J Comput Vision 57(2):137–154

    Article  Google Scholar 

  111. Vito ED, Rosasco L, Caponnetto A, Piana M, Verri A (2004) Some properties of regularized kernel methods. J Mach Learn Res 5:1363–1390

    MATH  Google Scholar 

  112. Wahba G (1990) Spline models for observational data. SIAM, New York

    Book  MATH  Google Scholar 

  113. Weimer M, Karatzoglou A, Smola A (2008) Improving maximum margin matrix factorization. Mach Learn 72(3):263–276

    Article  Google Scholar 

  114. Wendland H (2005) Scattered data approximation. Cambridge University Press, Cambridge

    MATH  Google Scholar 

  115. Weston J, Elisseeff A, Schölkopf B, Tipping M (2003) Use of the zero-norm with linear models and kernel methods. J Mach Learn Res 3: 1439–1461

    MATH  Google Scholar 

  116. Weston J, Watkins C (1999) Multi-class support vector machines. In: Verlysen M (ed) Proceedings of ESANN’99, D-Facto Publications, Brussels

    Google Scholar 

  117. Wolfe P (1961) Duality theorem for nonlinear programming. Q Appl Math 19:239–244

    MathSciNet  MATH  Google Scholar 

  118. Zdenek D (2009) Optimal quadratic programming algorithms with applications to variational inequalities. Springer, New York

    MATH  Google Scholar 

  119. Zhang T (2004) Statistical behaviour and consistency of classification methods based on convex risk minimization. Ann Stat 32:56–134

    Article  MATH  Google Scholar 

  120. Zoutendijk G (1960) Methods of feasible directions. A study in linear and nonlinear programming. Elsevier, Amsterdam

    Google Scholar 

Download references

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer Science+Business Media, LLC

About this entry

Cite this entry

Steidl, G. (2011). Supervised Learning by Support Vector Machines. In: Scherzer, O. (eds) Handbook of Mathematical Methods in Imaging. Springer, New York, NY. https://doi.org/10.1007/978-0-387-92920-0_22

Download citation

Publish with us

Policies and ethics