Machine Learning

, Volume 46, Issue 1–3, pp 161–190 | Cite as

Training Invariant Support Vector Machines

  • Dennis Decoste
  • Bernhard Schölkopf

Abstract

Practical experience has shown that in order to obtain the best possible performance, prior knowledge about invariances of a classification problem at hand ought to be incorporated into the training procedure. We describe and review all known methods for doing so in support vector machines, provide experimental results, and discuss their respective merits. One of the significant new results reported in this work is our recent achievement of the lowest reported test error on the well-known MNIST digit recognition benchmark task, with SVM training times that are also significantly faster than previous SVM methods.

support vector machines invariance prior knowledge image classification pattern recognition 

References

  1. Baird, H. (1990). Document image defect models. In Proceedings, IAPR Workshop on Syntactic and Structural Pattern Recognition (pp. 38-46). Murray Hill, NJ.Google Scholar
  2. Boser, B. E., Guyon, I. M., & Vapnik, V. N. (1992). A training algorithm for optimal margin classifiers. In D. Haussler, (Ed.), Proceedings of the 5th Annual ACM Workshop on Computational Learning Theory (pp. 144-152). Pittsburgh, PA: ACM Press.Google Scholar
  3. Bottou, L. & Vapnik, V. N. (1992). Local learning algorithms. Neural Computation, 4:6, 888-900.Google Scholar
  4. Bromley, J. & Säckinger, E. (1991). Neural-network and k-nearest-neighbor classifiers. Technical Report 11359-910819-16TM, AT &T.Google Scholar
  5. Burges, C. J. C. (1999). Geometry and invariance in kernel based methods. In B. Schölkopf, C. J. C. Burges, & A. J. Smola (Eds.). Advances in kernel methods-support vector learning (pp. 89-116). Cambridge, MA: MIT Press.Google Scholar
  6. Burges, C. J. C. & Schölkopf, B. (1997). Improving the accuracy and speed of support vector learning machines. In M. Mozer, M. Jordan, & T. Petsche, (Eds.). Advances in neural information processing systems 9 (pp. 375-381). Cambridge, MA: MIT Press.Google Scholar
  7. Burl, M. C. (2000). NASA volcanoe data set at UCI KDD Archive. (See http:/kdd.ics.uci.edu/databases/ volcanoes/volcanoes.html).Google Scholar
  8. Burl, M. C. (2001). Mining large image collections: Architecture and algorithms. In R. Grossman, C. Kamath, V. Kumar, & R. Namburu (Eds.). Data mining for scientific and engineering applications. Series in Massive Computing. Cambridge, MA: Kluwer Academic Publishers.Google Scholar
  9. Burl, M. C., Asker, L., Smyth, P., Fayyad, U., Perona, P., Crumpler, L., & Aubele, J. (1998). Learning to recognize volcanoes on Venus. Machine Learning, 30, 165-194.Google Scholar
  10. Chapelle, O. & Schölkopf, B. (2000). Incorporating invariances in nonlinear SVMs. Presented at the NIPS2000 workshop on Learning with Kernels.Google Scholar
  11. Cortes, C. & Vapnik, V. (1995). Support vector networks. Machine Learning, 20, 273-297.Google Scholar
  12. Crampin, M. & Pirani, F. A. E. (1986). Applicable differential geometry. Cambridge, UK: Cambridge University Press.Google Scholar
  13. DeCoste, D. & Burl, M. C. (2000). Distortion-invariant recognition via jittered queries. In Computer Vision and Pattern Recognition (CVPR-2000).Google Scholar
  14. DeCoste, D. & Wagstaff, K. (2000). Alpha seeding for support vector machines. In International Conference on Knowledge Discovery and Data Mining (KDD-2000).Google Scholar
  15. Drucker, H., Schapire, R., & Simard, P. (1993). Boosting performance in neural networks. International Journal of Pattern Recognition and Artificial Intelligence, 7:4, 705-719.Google Scholar
  16. Girosi, F. (1998). An equivalence between sparse approximation and support vector machines. Neural Computation, 10:6, 1455-1480.Google Scholar
  17. Haussler, D. (1999). Convolutional kernels on discrete structures. Technical Report UCSC-CRL-99-10, Computer Science Department, University of California at Santa Cruz.Google Scholar
  18. Jaakkola, T. S. & Haussler, D. (1999). Exploiting generative models in discriminative classifiers. In M. S. Kearns, S. A. Solla, & D. A. Cohn (Eds.). Advances in neural information processing systems 11. Cambridge, MA: MIT Press.Google Scholar
  19. Joachims, T. (1999). Making large-scale support vector machine learning practical. In Advances in kernel methods: support vector machines (Schölkopf et al., 1999).Google Scholar
  20. Keerthi, S., Shevade, S., Bhattacharyya, C., & Murthy, K. (1999). Improvements to Platt's SMO algorithm for SVM classifier design. Technical Report CD-99-14, Dept. of Mechanical and Production Engineering, National University of Singapore.Google Scholar
  21. Kimeldorf, G. S. & Wahba, G. (1970). A correspondence between Bayesian estimation on stochastic processes and smoothing by splines. Annals of Mathematical Statistics, 41, 495-502.Google Scholar
  22. LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., & Jackel, L. J. (1989). Backpropagation applied to handwritten zip code recognition. Neural Computation, 1, 541-551.Google Scholar
  23. LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86, 2278-2324.Google Scholar
  24. LeCun, Y., Jackel, L. D., Bottou, L., Brunot, A., Cortes, C., Denker, J. S., Drucker, H., Guyon, I., Müller, U. A., Säckinger, E., Simard, P., & Vapnik, V. (1995). Comparison of learning algorithms for handwritten digit recognition. In F. Fogelman-Soulié, & P. Gallinari (Eds.). Proceedings ICANN'95-International Conference on Artificial Neural Networks vol. II, pp. 53-60). Nanterre, France: EC2.Google Scholar
  25. Oliver, N., Schölkopf, B., & Smola, A. J. (2000). Natural regularization in SVMs. In A. J. Smola, P. L. Bartlett, B. Schölkopf, & D. Schuurmans (Eds.). Advances in large margin classifiers (pp. 51-60). Cambridge, MA: MIT Press.Google Scholar
  26. Platt, J. (1999). Fast training of support vector machines using sequential minimal optimization. In B. Schölkopf, C. J. C. Burges, & A. J. Smola (Eds.). Advances in kernel methods-support vector learning (pp. 185-208). Cambridge, MA: MIT Press.Google Scholar
  27. Poggio, T. & Girosi, F. (1989). A theory of networks for approximation and learning. Technical Report AIM-1140, Artificial Intelligence Laboratory, Massachusetts Institute of Technology (MIT), Cambridge, Massachusetts.Google Scholar
  28. Poggio, T. & Vetter, T. (1992). Recognition and structure from one 2D model view: Observations on prototypes, object classes and symmetries. A.I. Memo No. 1347, Artificial Intelligence Laboratory, Massachusetts Institute of Technology.Google Scholar
  29. Schölkopf, B. (1997). Support vector learning. R. Oldenbourg Verlag,M¨unchen. Doktorarbeit, TU Berlin. Download: http://www.kernel-machines.org.Google Scholar
  30. Schölkopf, B., Burges, C., & Smola, A. (1999). Advances in kernel methods: support vector machines. Cambridge, MA: MIT Press.Google Scholar
  31. Schülkopf, B., Burges, C., & Vapnik, V. (1995). Extracting support data for a given task. In U. M. Fayyad, & R. Uthurusamy (Eds.). In Proceedings, First International Conference on Knowledge Discovery & Data Mining, Menlo Park: AAAI Press.Google Scholar
  32. Schölkopf, B., Burges, C., & Vapnik, V. (1996). Incorporating invariances in support vector learning machines. In C. von der Malsburg, W. von Seelen, J. C. Vorbrüggen, & B. Sendhoff (Eds.). Artificial neural networks-ICANN'96 (pp. 47-52). Berlin: Springer. Lecture Notes in Computer Science (Vol. 1112).Google Scholar
  33. Schölkopf, B., Simard, P., Smola, A., & Vapnik, V. (1998a). Prior knowledge in support vector kernels. In M. Jordan, M. Kearns, & S. Solla (Eds.). Advances in neural information processing systems 10 (pp. 640-646). Cambridge, MA: MIT Press.Google Scholar
  34. Schölkopf, B., Smola, A., & Müller, K.-R. (1998b). Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 10, 1299-1319.Google Scholar
  35. Simard, P., LeCun, Y., & Denker, J. (1993). Efficient pattern recognition using a new transformation distance. In S. J. Hanson, J. D. Cowan, & C. L. Giles (Eds.). Advances in neural information processing systems 5. Proceedings of the 1992 Conference (pp. 50-58). San Mateo, CA: Morgan Kaufmann.Google Scholar
  36. Simard, P., Victorri, B., LeCun, Y., & Denker, J. (1992). Tangent prop-a formalism for specifying selected invariances in an adaptive network. In J. E. Moody, S. J. Hanson, & R. P. Lippmann (Eds.). Advances in neural information processing systems 4, San Mateo, CA: Morgan Kaufmann.Google Scholar
  37. Smola, A., Schölkopf, B., & Müller, K.-R. (1998). The connection between regularization operators and support vector kernels. Neural Networks, 11, 637-649.Google Scholar
  38. Teow, L.-N. & Loe, K.-F. (2000). Handwritten digit recognition with a novel vision model that extracts linearly separable features. In computer vision and pattern recognition (CVPR-2000).Google Scholar
  39. Vapnik, V. (1995). The nature of statistical learning theory. NY: Springer.Google Scholar
  40. Vapnik, V. (1998). Statistical learning theory. NY: Wiley.Google Scholar
  41. Watkins, C. (2000). Dynamic alignment kernels. In A. J. Smola, P. L. Bartlett, B. Schölkopf, & D. Schuurmans (Eds.). Advances in large margin classifiers (pp. 39-50). Cambridge, MA: MIT Press.Google Scholar
  42. Zien, A., Rätsch, G., Mika, S., Schölkopf, B., Lengauer, T., & Müller, K.-R. (2000). Engineering support vector machine kernels that recognize translation initiation sites. Bioinformatics, 16:9, 799-807.Google Scholar

Copyright information

© Kluwer Academic Publishers 2002

Authors and Affiliations

  • Dennis Decoste
    • 1
    • 2
  • Bernhard Schölkopf
    • 3
  1. 1.Jet Propulsion Laboratory, MS 126-347PasadenaUSA;
  2. 2.California Institute of TechnologyUSA
  3. 3.Max-Planck-Institut fuer biologische KybernetikTübingenGermany

Personalised recommendations