Data Mining and Knowledge Discovery

, Volume 2, Issue 2, pp 121–167 | Cite as

A Tutorial on Support Vector Machines for Pattern Recognition

  • Christopher J.C. Burges
Article

Abstract

The tutorial starts with an overview of the concepts of VC dimension and structural risk minimization. We then describe linear Support Vector Machines (SVMs) for separable and non-separable data, working through a non-trivial example in detail. We describe a mechanical analogy, and discuss when SVM solutions are unique and when they are global. We describe how support vector training can be practically implemented, and discuss in detail the kernel mapping technique which is used to construct SVM solutions which are nonlinear in the data. We show how Support Vector machines can have very large (even infinite) VC dimension by computing the VC dimension for homogeneous polynomial and Gaussian radial basis function kernels. While very high VC dimension would normally bode ill for generalization performance, and while at present there exists no theory which shows that good generalization performance is guaranteed for SVMs, there are several arguments which support the observed high accuracy of SVMs, which we review. Results of some experiments which were inspired by these arguments are also presented. We give numerous examples and proofs of most of the key theorems. There is new material, and I hope that the reader will find that even old material is cast in a fresh light.

support vector machines statistical learning theory VC dimension pattern recognition 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Aizerman, M.A., Braverman, E.M. and Rozoner, L.I. Theoretical foundations of the potential function method in pattern recognition learning. Automation and Remote Control, 25:821–837, 1964.Google Scholar
  2. Anthony, M. and Biggs, N. Pac learning and neural networks. In The Handbook of Brain Theory and Neural Networks, pages 694–697, 1995.Google Scholar
  3. Bennett, K.P. and Bredensteiner, E. Geometry in learning. In Geometry at Work, page to appear, Washington, D.C., 1998. Mathematical Association of America.Google Scholar
  4. Bishop, C.M. Neural Networks for Pattern Recognition. Clarendon Press, Oxford, 1995.Google Scholar
  5. Blanz, V., Schölkopf, B., Bülthoff, H., Burges, C., Vapnik, V. and Vetter, T. Comparison of view–based object recognition algorithms using realistic 3d models. In C. von der Malsburg, W. von Seelen, J. C. Vorbrüggen, and B. Sendhoff, editors, Artificial Neural Networks—ICANN’96, pages 251–256, Berlin, 1996. Springer Lecture Notes in Computer Science, Vol. 1112.Google Scholar
  6. Boser, B.E., Guyon, I.M. and Vapnik, V. A training algorithm for optimal margin classifiers. In Fifth Annual Workshop on Computational Learning Theory, Pittsburgh, 1992. ACM.Google Scholar
  7. Bunch, J.R. and Kaufman, L. Some stable methods for calculating inertia and solving symmetric linear systems. Mathematics of computation, 31(137):163–179, 1977.Google Scholar
  8. Bunch, J.R. and Kaufman, L. A computational method for the indefinite quadratic programming problem. Linear Algebra and its Applications, 34:341–370, 1980.Google Scholar
  9. Burges, C.J.C. and Schölkopf, B. Improving the accuracy and speed of support vector learning machines. In M. Mozer, M. Jordan, and T. Petsche, editors, Advances in Neural Information Processing Systems 9, pages 375–381, Cambridge, MA, 1997. MIT Press.Google Scholar
  10. Burges, C.J.C. Simplified support vector decision rules. In Lorenza Saitta, editor, Proceedings of the Thirteenth International Conference on Machine Learning, pages 71–77, Bari, Italy, 1996. Morgan Kaufman.Google Scholar
  11. Burges, C.J.C. Geometry and invariance in kernel based methods. In Advances in Kernel Methods-Support Vector Learning,Bernhard Schölkopf, Christopher J.C. Burges and Alexander J. Smola (eds.), MIT Press, Cambridge, MA, 1998 (to appear).Google Scholar
  12. Burges, C.J.C., Knirsch, P. and Haratsch, R. Support vector web page: http://svm.research.bell-labs.com. Technical report, Lucent Technologies, 1996.Google Scholar
  13. Cortes, C. and Vapnik, V. Support vector networks. Machine Learning, 20:273–297, 1995.Google Scholar
  14. Courant, R. and Hilbert, D. Methods of Mathematical Physics. Interscience, 1953.Google Scholar
  15. Devroye, L., Györfi, L. and Lugosi, G. A Probabilistic Theory of Pattern Recognition. Springer Verlag, Applications of Mathematics Vol. 31, 1996.Google Scholar
  16. Drucker, H., Burges, C.J.C., Kaufman, L., Smola, A. and Vapnik, V. Support vector regression machines. Advances in Neural Information Processing Systems, 9:155–161, 1997.Google Scholar
  17. Fletcher, R. Practical Methods of Optimization. John Wiley and Sons, Inc., 2nd edition, 1987.Google Scholar
  18. Geman, S. and Bienenstock, E. Neural networks and the bias / variance dilemma. Neural Computation, 4:1–58, 1992.Google Scholar
  19. Girosi, F. An equivalence between sparse approximation and support vector machines. Neural Computation (to appear); CBCL AI Memo 1606, MIT, 1998.Google Scholar
  20. Guyon, I., Vapnik, V., Boser, B., Bottou, L. and Solla. S.A. Structural risk minimization for character recognition. Advances in Neural Information Processing Systems, 4:471–479, 1992.Google Scholar
  21. Halmos, P.R. A Hilbert Space Problem Book. D. Van Nostrand Company, Inc., 1967.Google Scholar
  22. Horn, R.A. and Johnson, C.R. Matrix Analysis. Cambridge University Press, 1985.Google Scholar
  23. Joachims, T. Text categorization with support vector machines. Technical report, LS VIII Number 23, University of Dortmund, 1997. ftp://ftp-ai.informatik.uni-dortmund.de/pub/Reports/report23.ps.Z.Google Scholar
  24. Kaufman, L. Solving the quadratic programming problem arising in support vector classification. In Advances in Kernel Methods-Support Vector Learning,Bernhard Schölkopf, Chrisopher J.C. Burges and Alexander J. Smola (eds.), MIT Press, Cambridge, MA, 1998 (to appear).Google Scholar
  25. Kolmogorov, A.N. and Fomin, S.V. Introductory Real Analysis. Prentice-Hall, Inc., 1970.Google Scholar
  26. Mangarasian, O.L. Nonlinear Programming. McGraw Hill, New York, 1969.Google Scholar
  27. McCormick, G.P. Non Linear Programming: Theory, Algorithms and Applications. John Wiley and Sons, Inc., 1983.Google Scholar
  28. Montgomery, D.C. and Peck, E.A. Introduction to Linear Regression Analysis. John Wiley and Sons, Inc., 2nd edition, 1992.Google Scholar
  29. Moré and Wright. Optimization Guide. SIAM, 1993.Google Scholar
  30. Moré, J.J. and Toraldo, G. On the solution of large quadratic programming problems with bound constraints. SIAM J. Optimization, 1(1):93–113, 1991.Google Scholar
  31. Mukherjee, S., Osuna, E. and Girosi, F. Nonlinear prediction of chaotic time series using a support vector machine. In Proceedings of the IEEE Workshop on Neural Networks for Signal Processing 7, pages 511–519, Amelia Island, FL, 1997.Google Scholar
  32. Müller, K.-R., Smola, A., Rätsch, G., Schölkopf, B., Kohlmorgen, J. and Vapnik, V. Predicting time series with support vector machines. In Proceedings, International Conference on Artificial Neural Networks, page 999. Springer Lecture Notes in Computer Science, 1997.Google Scholar
  33. Osuna, E., Freund, R. and Girosi, F. An improved training algorithm for support vector machines. In Proceedings of the 1997 IEEE Workshop on Neural Networks for Signal Processing, Eds. J. Principe, L. Giles, N. Morgan, E. Wilson, pages 276–285, Amelia Island, FL, 1997.Google Scholar
  34. Osuna, E., Freund, R. and Girosi, F. Training support vector machines: an application to face detection. In IEEE Conference on Computer Vision and Pattern Recognition, pages 130–136, 1997.Google Scholar
  35. Osuna, E. and Girosi. F. Reducing the run-time complexity of support vector machines. In International Conference on Pattern Recognition (submitted), 1998.Google Scholar
  36. Press, W.H., Flannery, B.P., Teukolsky, S.A. and Vettering, W.T. Numerical recipes in C: the art of scientific computing. Cambridge University Press, 2nd edition, 1992.Google Scholar
  37. Schmidt, M. Identifying speaker with support vector networks. In Interface ’96 Proceedings, Sydney, 1996.Google Scholar
  38. Schölkopf, B. Support Vector Learning. R. Oldenbourg Verlag, Munich, 1997.Google Scholar
  39. Schölkopf, B., Burges, C. and Vapnik, V. Extracting support data for a given task. In U. M. Fayyad and R. Uthurusamy, editors, Proceedings, First International Conference on Knowledge Discovery & Data Mining. AAAI Press, Menlo Park, CA, 1995.Google Scholar
  40. Schölkopf, B., Burges, C. and Vapnik, V. Incorporating invariances in support vector learning machines. In C. von der Malsburg, W. von Seelen, J. C. Vorbrüggen, and B. Sendhoff, editors, Artificial Neural Networks — ICANN’96, pages 47–52, Berlin, 1996. Springer Lecture Notes in Computer Science, Vol. 1112.Google Scholar
  41. Schölkopf, B., Simard, P., Smola, A. and Vapnik, V. Prior knowledge in support vector kernels. In M. Jordan, M. Kearns, and S. Solla, editors, Advances in Neural Information Processing Systems 10, Cambridge, MA, 1998. MIT Press. In press.Google Scholar
  42. Schölkopf, B., Smola, A. and Müller, K.-R. Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 1998. In press.Google Scholar
  43. Schölkopf, B., Smola, A., Müller, K.-R., Burges, C.J.C. and Vapnik, V. Support vector methods in learning and feature extraction. In Ninth Australian Congress on Neural Networks (to appear), 1998.Google Scholar
  44. Schölkopf, B., Sung, K., Burges, C., Girosi, F., Niyogi, P., Poggio, T. and Vapnik. V. Comparing support vector machines with gaussian kernels to radial basis function classifiers. IEEE Trans. Sign. Processing, 45:2758–2765, 1997.Google Scholar
  45. Shawe-Taylor, J., Bartlett, P.L., Williamson, R.C. and Anthony, M. A framework for structural risk minimization. In Proceedings, 9th Annual Conference on Computational Learning Theory, pages 68–76, 1996.Google Scholar
  46. Shawe-Taylor, J., Bartlett, P.L., Williamson, R.C. and Anthony, M. Structural risk minimization over data-dependent hierarchies. Technical report, NeuroCOLT Technical Report NC-TR-96-053, 1996.Google Scholar
  47. Smola, A. and Schölkopf, B. On a kernel-based method for pattern recognition, regression, approximation and operator inversion. Algorithmica (to appear), 1998.Google Scholar
  48. Smola, A., Schölkopf, B. and Müller, K.-R. General cost functions for support vector regression. In Ninth Australian Congress on Neural Networks (to appear), 1998.Google Scholar
  49. Smola, A.J., Schölkopf, B. and Müller, K.-R. The connection between regularization operators and support vector kernels. Neural Networks (to appear), 1998.Google Scholar
  50. Stitson, M.O., Gammerman, A., Vapnik, V., Vovk, V., Watkins, C. and Weston, J. Support vector anova decomposition. Technical report, Royal Holloway College, Report number CSD-TR-97-22, 1997.Google Scholar
  51. Strang, G.T. Introduction to Applied Mathematics. Wellesley-Cambridge Press, 1986.Google Scholar
  52. Vanderbei, R.J. Interior point methods: Algorithms and formulations. ORSA J. Computing, 6(1):32–34, 1994.Google Scholar
  53. Vanderbei, R.J. LOQO: An interior point code for quadratic programming. Technical report, Program in Statistics & Operations Research, Princeton University, 1994.Google Scholar
  54. Vapnik, V. Estimation of Dependences Based on Empirical Data [in Russian]. Nauka, Moscow, 1979. (English translation: Springer Verlag, New York, 1982).Google Scholar
  55. Vapnik, V. The Nature of Statistical Learning Theory. Springer-Verlag, New York, 1995.Google Scholar
  56. Vapnik, V. Statistical Learning Theory. John Wiley and Sons, Inc., New York, in preparation.Google Scholar
  57. Vapnik, V., Golowich, S. and Smola, A. Support vector method for function approximation, regression estimation, and signal processing. Advances in Neural Information Processing Systems, 9:281–287, 1996.Google Scholar
  58. G. Wahba. Support vector machines, reproducing kernel hilbert spaces and the randomized gacv. In Advances in Kernel Methods-Support Vector Learning, Bernhard Schölkopf, Christopher J.C. Burges and Alexander J. Smola (eds.), MIT Press, Cambridge, MA, 1998 (to appear).Google Scholar
  59. Weston, J., Gammerman, A., Stitson, M.O., Vapnik, V., Vovk, V., and Watkins, C. Density estimation using support vector machines. Technical report, Royal Holloway College, Report number CSD-TR-97-23, 1997.Google Scholar

Copyright information

© Kluwer Academic Publishers 1998

Authors and Affiliations

  • Christopher J.C. Burges
    • 1
  1. 1.Bell Laboratories, Lucent TechnologiesUSA.

Personalised recommendations