Abstract
I describe a framework for interpreting Support Vector Machines (SVMs) as maximum a posteriori (MAP) solutions to inference problems with Gaussian Process priors. This probabilistic interpretation can provide intuitive guidelines for choosing a ‘good’ SVM kernel. Beyond this, it allows Bayesian methods to be used for tackling two of the outstanding challenges in SVM classification: how to tune hyperparameters—the misclassification penalty C, and any parameters specifying the ernel—and how to obtain predictive class probabilities rather than the conventional deterministic class label predictions. Hyperparameters can be set by maximizing the evidence; I explain how the latter can be defined and properly normalized. Both analytical approximations and numerical methods (Monte Carlo chaining) for estimating the evidence are discussed. I also compare different methods of estimating class probabilities, ranging from simple evaluation at the MAP or at the posterior average to full averaging over the posterior. A simple toy application illustrates the various concepts and techniques.
Article PDF
Similar content being viewed by others
References
Barber, D. & Bishop, C. M. (1997). Bayesian model comparison by Monte Carlo chaining. In M. C. Mozer, M. I. Jordan, & T. Petsche (Eds.). Advances in neural information processing systems NIPS 9 (pp. 333-339). Cambridge, MA: MIT Press.
Barber, D. & Williams, C. K. I. (1997). Gaussian processes for Bayesian classification via hybrid Monte Carlo. In M. C. Mozer, M. I. Jordan, & T. Petsche (Eds.). Advances in neural information processing systems NIPS 9 (pp. 340-346). Cambridge, MA: MIT Press.
Burges, C. J. C. (1998). A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2:2, 121-167.
Chapelle, O. & Vapnik, V. N. (2000). Model selection for support vector machines. In S. A. Solla, T. K. Leen, & K.-R. Müller (Eds.). Advances in neural information processing systems 12 (pp. 230-236). Cambridge, MA: MIT Press.
Cristianini, N., Campbell, C., & Shawe-Taylor, J. (1999). Dynamically adapting kernels in support vector machines. In M. Kearns, S. A. Solla, & D. Cohn (Eds.). Advances in neural information processing systems 11 (pp. 204-210). Cambridge, MA: MIT Press.
Cristianini, N. & Shawe-Taylor, J. (2000). An introduction to support vector machines. Cambridge: Cambridge University Press.
Csató, L., Fokoué, E., Opper, M., Schottky, B., & Winther, O. (2000). Efficient Approaches to Gaussian Process Classification. In: S. Solla, T. Leen, & K.-R. Müller (Eds.). Advances in neural information processing systems 12 (pp. 251-257). Cambridge, MA: MIT Press.
Gold, C. & Sollich, P. (2001). In preparation.
Herbrich, R., Graepel,T., & Campbell, C. (1999). Bayesian learning in reproducing kernel Hilbert spaces.Technical Report TR 99-11, Technical University Berlin. http://stat.cs.tu-berlin.de/~ralfh/publications.html.
Jaakkola, T. & Haussler, D. (1999). Probabilistic kernel regression models. In D. Heckerman & J. Whittaker (Eds.). Proceedings of the Seventh International Workshop on Artificial Intelligence and Statistics. San Francisco, CA.
Kwok, J. T. Y. (1999a). Integrating the evidence framework and the support vector machine. In Proc. European Symp. Artificial Neural Networks (Bruges 1999) (pp. 177-182). Brussels.
Kwok, J. T. Y. (1999b). Moderating the outputs of support vector machine classifiers. IEEE Trans. Neural Netw., 10:5, 1018-1031.
MacKay, D. J. C. (1992). The evidence framework applied to classification networks. Neural Computation, 4, 720-736.
Neal, R. M. (1993). Probabilistic inference using Markov Chain Monte Carlo methods. Technical Report CRGTR-93-1, University of Toronto.
Opper, M. & Winther, O. (2000). Gaussian process classification and SVM: Mean field results and leave-one-out estimator. In A. J. Smola, P. Bartlett, B. Schölkopf, & D. Schuurmans (Eds.). Advances in large margin classifiers (pp. 43-65). Cambridge, MA: MIT Press.
Pontil, M., Mukherjee, S., & Girosi, F. (1998). On the noise model of support vector machine regression. Technical Report CBCL Paper 168, AI Memo 1651, Massachusetts Institute of Technology, Cambridge, MA.
Ripley, B. D. (1996). Pattern recognition and neural networks. Cambridge: Cambridge University Press.
Schölkopf, B., Bartlett, P., Smola, A., & Williamson, R. (1999). Shrinking the tube: a newsupport vector regression algorithm. In M. Kearns, S. A. Solla, & D. Cohn (Eds.). Advances in neural information processing systems 11 (pp. 330-336). Cambridge, MA: MIT Press.
Schölkopf, B., Burges, C., & Smola, A. J. (1998). Advances in kernel methods: support vector machines. Cambridge, MA: MIT Press.
Seeger, M. (2000). Bayesian model selection for support vector machines, Gaussian processes and other kernel classifiers. In S. Solla, T. Leen, & K.-R. Müller (Eds.). Advances in neural information processing systems 12 (pp. 603-609). Cambridge, MA.
Smola, A. J. & Schölkopf, B. (1998). A tutorial on support vector regression. Neuro COLT Technical Report TR-1998-030; Available from http://svm.first.gmd.de/.
Smola, A. J., Schölkopf, B., & Müller, K. R. (1998). The connection between regularization operators and support vector kernels. Neural Networks, 11:4, 637-649.
Sollich, P. (1999). Probabilistic interpretation and Bayesian methods for support vector machines. In ICANN99-Ninth International Conference on Artificial Neural Networks (pp. 91-96). London.
Sollich, P. (2000). Probabilistic methods for support vector machines. In S. A. Solla, T. K. Leen, & K.-R. Müller (Eds.). Advances in neural information processing systems 12 (pp. 349-355). Cambridge, MA: MIT Press.
Tipping, M. E. (2000). The relevance vector machine. In S. Solla, T. Leen, & K.-R. Müller (Eds.). Advances in neural information processing systems 12 (pp. 652-658). Cambridge, MA: MIT Press.
Vapnik, V. (1995). The nature of statistical learning theory. New York: Springer.
Wahba, G. (1998). Support vector machines, reproducing kernel Hilbert spaces and the randomized GACV. In B. Schölkopf, C. Burges, & A. J. Smola (Eds.). Advances in kernel methods: support vector machines (pp. 69-88). Cambridge, MA: MIT Press.
Williams, C. K. I. (1998). Prediction with Gaussian processes: from linear regression to linear prediction and beyond. In M. I. Jordan (Ed.). Learning and inference in graphical models (pp. 599-621). Dordrecht: Kluwer Academic.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Sollich, P. Bayesian Methods for Support Vector Machines: Evidence and Predictive Class Probabilities. Machine Learning 46, 21–52 (2002). https://doi.org/10.1023/A:1012489924661
Issue Date:
DOI: https://doi.org/10.1023/A:1012489924661