Machine Learning

, Volume 46, Issue 1–3, pp 21–52 | Cite as

Bayesian Methods for Support Vector Machines: Evidence and Predictive Class Probabilities

  • Peter Sollich
Article

Abstract

I describe a framework for interpreting Support Vector Machines (SVMs) as maximum a posteriori (MAP) solutions to inference problems with Gaussian Process priors. This probabilistic interpretation can provide intuitive guidelines for choosing a ‘good’ SVM kernel. Beyond this, it allows Bayesian methods to be used for tackling two of the outstanding challenges in SVM classification: how to tune hyperparameters—the misclassification penalty C, and any parameters specifying the ernel—and how to obtain predictive class probabilities rather than the conventional deterministic class label predictions. Hyperparameters can be set by maximizing the evidence; I explain how the latter can be defined and properly normalized. Both analytical approximations and numerical methods (Monte Carlo chaining) for estimating the evidence are discussed. I also compare different methods of estimating class probabilities, ranging from simple evaluation at the MAP or at the posterior average to full averaging over the posterior. A simple toy application illustrates the various concepts and techniques.

Support vector machines Gaussian processes Bayesian inference evidence hyperparameter tuning probabilistic predictions 

References

  1. Barber, D. & Bishop, C. M. (1997). Bayesian model comparison by Monte Carlo chaining. In M. C. Mozer, M. I. Jordan, & T. Petsche (Eds.). Advances in neural information processing systems NIPS 9 (pp. 333-339). Cambridge, MA: MIT Press.Google Scholar
  2. Barber, D. & Williams, C. K. I. (1997). Gaussian processes for Bayesian classification via hybrid Monte Carlo. In M. C. Mozer, M. I. Jordan, & T. Petsche (Eds.). Advances in neural information processing systems NIPS 9 (pp. 340-346). Cambridge, MA: MIT Press.Google Scholar
  3. Burges, C. J. C. (1998). A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2:2, 121-167.Google Scholar
  4. Chapelle, O. & Vapnik, V. N. (2000). Model selection for support vector machines. In S. A. Solla, T. K. Leen, & K.-R. Müller (Eds.). Advances in neural information processing systems 12 (pp. 230-236). Cambridge, MA: MIT Press.Google Scholar
  5. Cristianini, N., Campbell, C., & Shawe-Taylor, J. (1999). Dynamically adapting kernels in support vector machines. In M. Kearns, S. A. Solla, & D. Cohn (Eds.). Advances in neural information processing systems 11 (pp. 204-210). Cambridge, MA: MIT Press.Google Scholar
  6. Cristianini, N. & Shawe-Taylor, J. (2000). An introduction to support vector machines. Cambridge: Cambridge University Press.Google Scholar
  7. Csató, L., Fokoué, E., Opper, M., Schottky, B., & Winther, O. (2000). Efficient Approaches to Gaussian Process Classification. In: S. Solla, T. Leen, & K.-R. Müller (Eds.). Advances in neural information processing systems 12 (pp. 251-257). Cambridge, MA: MIT Press.Google Scholar
  8. Gold, C. & Sollich, P. (2001). In preparation.Google Scholar
  9. Herbrich, R., Graepel,T., & Campbell, C. (1999). Bayesian learning in reproducing kernel Hilbert spaces.Technical Report TR 99-11, Technical University Berlin. http://stat.cs.tu-berlin.de/~ralfh/publications.html.Google Scholar
  10. Jaakkola, T. & Haussler, D. (1999). Probabilistic kernel regression models. In D. Heckerman & J. Whittaker (Eds.). Proceedings of the Seventh International Workshop on Artificial Intelligence and Statistics. San Francisco, CA.Google Scholar
  11. Kwok, J. T. Y. (1999a). Integrating the evidence framework and the support vector machine. In Proc. European Symp. Artificial Neural Networks (Bruges 1999) (pp. 177-182). Brussels.Google Scholar
  12. Kwok, J. T. Y. (1999b). Moderating the outputs of support vector machine classifiers. IEEE Trans. Neural Netw., 10:5, 1018-1031.Google Scholar
  13. MacKay, D. J. C. (1992). The evidence framework applied to classification networks. Neural Computation, 4, 720-736.Google Scholar
  14. Neal, R. M. (1993). Probabilistic inference using Markov Chain Monte Carlo methods. Technical Report CRGTR-93-1, University of Toronto.Google Scholar
  15. Opper, M. & Winther, O. (2000). Gaussian process classification and SVM: Mean field results and leave-one-out estimator. In A. J. Smola, P. Bartlett, B. Schölkopf, & D. Schuurmans (Eds.). Advances in large margin classifiers (pp. 43-65). Cambridge, MA: MIT Press.Google Scholar
  16. Pontil, M., Mukherjee, S., & Girosi, F. (1998). On the noise model of support vector machine regression. Technical Report CBCL Paper 168, AI Memo 1651, Massachusetts Institute of Technology, Cambridge, MA.Google Scholar
  17. Ripley, B. D. (1996). Pattern recognition and neural networks. Cambridge: Cambridge University Press.Google Scholar
  18. Schölkopf, B., Bartlett, P., Smola, A., & Williamson, R. (1999). Shrinking the tube: a newsupport vector regression algorithm. In M. Kearns, S. A. Solla, & D. Cohn (Eds.). Advances in neural information processing systems 11 (pp. 330-336). Cambridge, MA: MIT Press.Google Scholar
  19. Schölkopf, B., Burges, C., & Smola, A. J. (1998). Advances in kernel methods: support vector machines. Cambridge, MA: MIT Press.Google Scholar
  20. Seeger, M. (2000). Bayesian model selection for support vector machines, Gaussian processes and other kernel classifiers. In S. Solla, T. Leen, & K.-R. Müller (Eds.). Advances in neural information processing systems 12 (pp. 603-609). Cambridge, MA.Google Scholar
  21. Smola, A. J. & Schölkopf, B. (1998). A tutorial on support vector regression. Neuro COLT Technical Report TR-1998-030; Available from http://svm.first.gmd.de/.Google Scholar
  22. Smola, A. J., Schölkopf, B., & Müller, K. R. (1998). The connection between regularization operators and support vector kernels. Neural Networks, 11:4, 637-649.Google Scholar
  23. Sollich, P. (1999). Probabilistic interpretation and Bayesian methods for support vector machines. In ICANN99-Ninth International Conference on Artificial Neural Networks (pp. 91-96). London.Google Scholar
  24. Sollich, P. (2000). Probabilistic methods for support vector machines. In S. A. Solla, T. K. Leen, & K.-R. Müller (Eds.). Advances in neural information processing systems 12 (pp. 349-355). Cambridge, MA: MIT Press.Google Scholar
  25. Tipping, M. E. (2000). The relevance vector machine. In S. Solla, T. Leen, & K.-R. Müller (Eds.). Advances in neural information processing systems 12 (pp. 652-658). Cambridge, MA: MIT Press.Google Scholar
  26. Vapnik, V. (1995). The nature of statistical learning theory. New York: Springer.Google Scholar
  27. Wahba, G. (1998). Support vector machines, reproducing kernel Hilbert spaces and the randomized GACV. In B. Schölkopf, C. Burges, & A. J. Smola (Eds.). Advances in kernel methods: support vector machines (pp. 69-88). Cambridge, MA: MIT Press.Google Scholar
  28. Williams, C. K. I. (1998). Prediction with Gaussian processes: from linear regression to linear prediction and beyond. In M. I. Jordan (Ed.). Learning and inference in graphical models (pp. 599-621). Dordrecht: Kluwer Academic.Google Scholar

Copyright information

© Kluwer Academic Publishers 2002

Authors and Affiliations

  • Peter Sollich
    • 1
  1. 1.Department of MathematicsKing's College LondonStrand, LondonUK

Personalised recommendations