Skip to main content

Bayesian Kernel Methods

  • Chapter
  • First Online:
Advanced Lectures on Machine Learning

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 2600))

Abstract

Bayesian methods allow for a simple and intuitive representation of the function spaces used by kernel methods. This chapter describes the basic principles of Gaussian Processes, their implementation and their connection to other kernel-based Bayesian estimation methods, such as the Relevance Vector Machine.

The present article is based on 62.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 49.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. K. P. Bennett, A. Demiriz, and J. Shawe-Taylor. A column generation algorithm for boosting. In P. Langley, editor, Proceedings of the International Conference on Machine Learning, San Francisco, 2000. Morgan Kaufmann Publishers.

    Google Scholar 

  2. K. P. Bennett and O. L. Mangasarian. Robust linear programming discrimination of two linearly inseparable sets. Optimization Methods and Software, 1:23–34, 1992.

    Article  Google Scholar 

  3. C. M. Bishop. Neural Networks for Pattern Recognition. Clarendon Press, Oxford, 1995.

    Google Scholar 

  4. C. M. Bishop and M. E. Tipping. Variational relevance vector machines. In Proceedings of 16th Conference on Uncertainty in Artificial Intelligence UAI’2000, pages 46–53, 2000.

    Google Scholar 

  5. P. S. Bradley and O. L. Mangasarian. Feature selection via concave minimization and support vector machines. In J. Shavlik, editor, Proceedings of the International Conference on Machine Learning, pages 82–90, San Francisco, California, 1998. Morgan Kaufmann Publishers. ftp://ftp.cs.wisc.edu/math-prog/tech-reports/98-03.ps.Z.

  6. S. Chen, D. Donoho, and M. Saunders. Atomic decomposition by basis pursuit. Siam Journal of Scientific Computing, 20(1):33–61, 1999.

    Article  MATH  MathSciNet  Google Scholar 

  7. T. M. Cover and J. A. Thomas. Elements of Information Theory. John Wiley and Sons, New York, 1991.

    MATH  Google Scholar 

  8. H. Cramér. Mathematical Methods of Statistics. Princeton University Press, 1946.

    Google Scholar 

  9. L. Csató and M. Opper. Sparse representation for Gaussian process models. In T. K. Leen, T. G. Dietterich, and V. Tresp, editors, Advances in Neural Information Processing Systems 13, pages 444–450. MIT Press, 2001.

    Google Scholar 

  10. A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society B, 39(1):1–22, 1977.

    MATH  MathSciNet  Google Scholar 

  11. S. Duane, A. D. Kennedy, B. J. Pendleton, and D. Roweth. Hybrid Monte Carlo. Physics Letters B, 195:216–222, 1995.

    Article  Google Scholar 

  12. S. Fine and K. Scheinberg. Efficient SVM training using low-rank kernel representation. Technical report, IBM Watson Research Center, New York, 2000.

    Google Scholar 

  13. R. Fletcher. Practical Methods of Optimization. John Wiley and Sons, New York, 1989.

    Google Scholar 

  14. G. Fung and O. L. Mangasarian. Data selection for support vector machine classifiers. In Proceedings of KDD’2000, 2000. also: Data Mining Institute Technical Report 00-02, University of Wisconsin, Madison.

    Google Scholar 

  15. A. Gelman, J. B. Carlin, H. S. Stern, and D. B. Rubin. Bayesian Data Analysis. Chapman and Hall, London, 1995.

    Google Scholar 

  16. M. Gibbs and D. J. C. Mackay. Variational Gaussian process classifiers. Technical report, Cavendish Laboratory, Cambridge, UK, 1998.

    Google Scholar 

  17. M. N. Gibbs. Bayesian Gaussian Methods for Regression and Classification. PhD thesis, University of Cambridge, 1997.

    Google Scholar 

  18. Mark Gibbs and David J. C. Mackay. Efficient implementation of Gaussian processes. Technical report, Cavendish Laboratory, Cambridge, UK, 1997. available at http://www.wol.ra.phy.cam.ac.uk/mng10/GP/.

  19. P. E. Gill, W. Murray, and M. H. Wright. Practical Optimization. Academic Press, 1981.

    Google Scholar 

  20. F. Girosi. Models of noise and robust estimates. A.I. Memo 1287, Artificial Intelligence Laboratory, Massachusetts Institute of Technology, 1991.

    Google Scholar 

  21. D. Goldfarb and K. Scheinberg. A product-form cholesky factorization method for handling dense columns in interior point methods for linear programming. Technical report, IBM Watson Research Center, Yorktown Heights, 2001.

    Google Scholar 

  22. G. H. Golub and C. F. Van Loan. Matrix Computations. John Hopkins University Press, Baltimore, MD, 3rd edition, 1996.

    MATH  Google Scholar 

  23. I. S. Gradshteyn and I. M. Ryzhik. Table of integrals, series, and products. Academic Press, New York, 1981.

    Google Scholar 

  24. T. Graepel, R. Herbrich, P. Bollmann-Sdorra, and K. Obermayer. Classification on pairwise proximity data. In M. S. Kearns, S. A. Solla, and D. A. Cohn, editors, Advances in Neural Information Processing Systems 11, pages 438–444, Cambridge, MA, 1999. MIT Press.

    Google Scholar 

  25. D. Haussler. Convolutional kernels on discrete structures. Technical Report UCSCCRL-99-10, Computer Science Department, UC Santa Cruz, 1999.

    Google Scholar 

  26. R. Herbrich. Learning Kernel Classifiers: Theory and Algorithms. MIT Press, 2002.

    Google Scholar 

  27. Ralf Herbrich, Thore Graepel, and Colin Campbell. Bayes point machines: Estimating the Bayes point in kernel space. In Proceedings of IJCAI Workshop Support Vector Machines, pages 23–27, 1999.

    Google Scholar 

  28. R. A. Horn and C. R. Johnson. Matrix Analysis. Cambridge University Press, Cambridge, 1985.

    MATH  Google Scholar 

  29. P. J. Huber. Robust statistics: a review. Annals of Statistics, 43:1041, 1972.

    Article  MathSciNet  MATH  Google Scholar 

  30. T. Jaakkola, M. Meila, and T. Jebara. Maximum entropy discrimination. Technical Report AITR-1668, Artificial Intelligence Laboratory, Massachusetts Institute of Technology, 1999.

    Google Scholar 

  31. T. S. Jaakkola and D. Haussler. Exploiting generative models in discriminative classifiers. In M. S. Kearns, S. A. Solla, and D. A. Cohn, editors, Advances in Neural Information Processing Systems 11, pages 487–493, Cambridge, MA, 1999. MIT Press.

    Google Scholar 

  32. T. S. Jaakkola and M. I. Jordan. Computing upper and lower bounds on likelihoods in untractable networks. In Proceedings of the 12th Conference on Uncertainty in AI. Morgan Kaufmann Publishers, 1996.

    Google Scholar 

  33. W. James and C. Stein. Estimation with quadratic loss. In Proceedings of the Fourth Berkeley Symposium on Mathematics, Statistics and Probability, volume 1, pages 361–380, Berkeley, 1960. University of California Press.

    Google Scholar 

  34. T. Jebara and T. Jaakkola. Feature selection and dualities in maximum entropy discrimination. In Uncertainty In Artificial Intelligence, 2000.

    Google Scholar 

  35. M. I. Jordan and C. M. Bishop. An Introduction to Probabilistic Graphical Models. MIT Press, 2002.

    Google Scholar 

  36. M. I. Jordan, Z. Gharamani, T. S. Jaakkola, and L. K. Saul. An introduction to variational methods for graphical models. In Learning in Graphical Models, volume M.I. Jordan, pages 105–162. Kluwer Academic, 1998.

    Google Scholar 

  37. M. S. Lewicki and T. J. Sejnowski. Learning nonlinear overcomplete representations for efficient coding. In M. I. Jordan, M. J. Kearns, and S. A. Solla, editors, Advances in Neural Information Processing Systems 10, pages 556–562, Cambridge, MA, 1998. MIT Press.

    Google Scholar 

  38. D. G. Luenberger. Introduction to Linear and Nonlinear Programming. Addison-Wesley, Reading, MA, 1973.

    MATH  Google Scholar 

  39. H. Lütkepohl. Handbook of Matrices. John Wiley and Sons, Chichester, 1996.

    MATH  Google Scholar 

  40. D. J. C. MacKay. Bayesian Methods for Adaptive Models. PhD thesis, Computation and Neural Systems, California Institute of Technology, Pasadena, CA, 1991.

    Google Scholar 

  41. D. J. C. MacKay. The evidence framework applied to classification networks. Neural Computation, 4(5):720–736, 1992.

    Article  Google Scholar 

  42. S. Mallat and Z. Zhang. Matching Pursuit in a time-frequency dictionary. IEEE Transactions on Signal Processing, 41:3397–3415, 1993.

    Article  MATH  Google Scholar 

  43. O. L. Mangasarian. Linear and nonlinear separation of patterns by linear programming. Operations Research, 13:444–452, 1965.

    Article  MATH  MathSciNet  Google Scholar 

  44. B. K. Natarajan. Sparse approximate solutions to linear systems. SIAM Journal of Computing, 25(2):227–234, 1995.

    Article  MathSciNet  Google Scholar 

  45. R. Neal. Priors for infinite networks. Technical Report CRG-TR-94-1, Dept. of Computer Science, University of Toronto, 1994.

    Google Scholar 

  46. R. Neal. Bayesian Learning in Neural Networks. Springer, 1996.

    Google Scholar 

  47. Radford M. Neal. Probabilistic inference using Markov chain Monte Carlo methods. Technical report, Dept. of Computer Science, University of Toronto, 1993. CRGTR-93-1.

    Google Scholar 

  48. B. A. Olshausen and D. J. Field. Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, 381:607–609, 1996.

    Article  Google Scholar 

  49. M. Opper and O. Winther. Mean field methods for classification with Gaussian processes. In M. S. Kearns, S. A. Solla, and D. A. Cohn, editors, Advances in Neural Information Processing Systems 11, pages 309–315, Cambridge, MA, 1999. MIT Press.

    Google Scholar 

  50. M. Opper and O. Winther. Gaussian processes and SVM: mean field and leaveone-out. In A. J. Smola, P. L. Bartlett, B. Schölkopf, and D. Schuurmans, editors, Advances in Large Margin Classifiers, pages 311–326, Cambridge, MA, 2000. MIT Press.

    Google Scholar 

  51. J. Platt. Probabilities for SV machines. In A. J. Smola, P. L. Bartlett, B. Schölkopf, and D. Schuurmans, editors, Advances in Large Margin Classifiers, pages 61–73, Cambridge, MA, 2000. MIT Press.

    Google Scholar 

  52. T. Poggio. On optimal nonlinear associative recall. Biological Cybernetics, 19:201–209, 1975.

    MathSciNet  MATH  Google Scholar 

  53. W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery. Numerical Recipes in C: The Art of Scientific Computing (2nd ed.). Cambridge University Press, Cambridge, 1992. ISBN 0-521-43108-5.

    Google Scholar 

  54. C. Rasmussen. Evaluation of Gaussian Processes and Other Methods for Non-Linear Regression. PhD thesis, Department of Computer Science, University of Toronto, 1996. ftp://ftp.cs.toronto.edu/pub/carl/thesis.ps.gz.

  55. G. Rätsch, S. Mika, and A.J. Smola. Adapting codes und embeddings for polychotomies. In Neural Information Processing Systems, volume 15. MIT Press, 2002. to appear.

    Google Scholar 

  56. B. D. Ripley. Pattern Recognition and Neural Networks. Cambridge University Press, Cambridge, 1996.

    MATH  Google Scholar 

  57. R. T. Rockafellar. Convex Analysis, volume 28 of Princeton Mathematics Series. Princeton University Press, 1970.

    Google Scholar 

  58. P. Ruján and M. Marchand. Computing the Bayes kernel classifier. In A. J. Smola, P. L. Bartlett, B. Schölkopf, and D. Schuurmans, editors, Advances in Large Margin Classifiers, pages 329–347, Cambridge, MA, 2000. MIT Press.

    Google Scholar 

  59. Pál Ruján. Playing billiards in version space. Neural Computation, 9:99–122, 1997.

    Article  MATH  Google Scholar 

  60. B. Schölkopf, S. Mika, C. Burges, P. Knirsch, K.-R. Müller, G. Rätsch, and A. Smola. Input space vs. feature space in kernel-based methods. IEEE Transactions on Neural Networks, 10(5):1000–1017, 1999.

    Article  Google Scholar 

  61. B. Schölkopf, A. Smola, and K.-R. Müller. Kernel principal component analysis. In B. Schölkopf, C. J. C. Burges, and A. J. Smola, editors, Advances in Kernel Methods-Support Vector Learning, pages 327–352. MIT Press, Cambridge, MA, 1999.

    Google Scholar 

  62. B. Schölkopf and A. J. Smola. Learning with Kernels. MIT Press, Cambridge, MA, 2002.

    Google Scholar 

  63. M. Seeger. Bayesian methods for support vector machines and Gaussian processes. Master’s thesis, University of Edinburgh, Division of Informatics, 1999.

    Google Scholar 

  64. J. Skilling. Maximum Entropy and Bayesian Methods. Cambridge University Press, 1988.

    Google Scholar 

  65. A. Smola, B. Schölkopf, and G. Rätsch. Linear programs for automatic accuracy control in regression. In Ninth International Conference on Artificial Neural Networks, Conference Publications No. 470, pages 575–580, London, 1999. IEE.

    Google Scholar 

  66. A. J. Smola. Learning with Kernels. PhD thesis, Technische Universität Berlin, 1998. GMD Research Series No. 25.

    Google Scholar 

  67. A. J. Smola and P. L. Bartlett. Sparse greedy Gaussian process regression. In T. K. Leen, T. G. Dietterich, and V. Tresp, editors, Advances in Neural Information Processing Systems 13, pages 619–625. MIT Press, 2001.

    Google Scholar 

  68. A. J. Smola and B. Schölkopf. Sparse greedy matrix approximation for machine learning. In P. Langley, editor, Proceedings of the International Conference on Machine Learning, pages 911–918, San Francisco, 2000. Morgan Kaufmann Publishers.

    Google Scholar 

  69. A.J. Smola and S.V.N. Vishwanathan. Cholesky factorization for rank-k modifications of diagonal matrices. SIAM Journal of Matrix Analysis, 2002. submitted.

    Google Scholar 

  70. C. Soon-Ong, A.J. Smola, and R.C. Williamson. Superkernels. In Neural Information Processing Systems, volume 15. MIT Press, 2002. to appear.

    Google Scholar 

  71. D. J. Spiegelhalter and S. L. Lauritzen. Sequential updating of conditional probabilities on directed graphical structures. Networks, 20:579–605, 1990.

    Article  MATH  MathSciNet  Google Scholar 

  72. J. Stoer and R. Bulirsch. Introduction to Numerical Analysis. Springer, New York, second edition, 1993.

    MATH  Google Scholar 

  73. M. Tipping. Sparse Bayesian learning and the relevance vector machine. Journal of Machine Learning Research, 1:211–244, 2001.

    Article  MATH  MathSciNet  Google Scholar 

  74. V. Tresp. A Bayesian committee machine. Neural Computation, 12(11):2719–2741, 2000.

    Article  Google Scholar 

  75. V. Vapnik. The Nature of Statistical Learning Theory. Springer, New York, 1995.

    MATH  Google Scholar 

  76. V. Vapnik, S. Golowich, and A. Smola. Support vector method for function approximation, regression estimation, and signal processing. In M. C. Mozer, M. I. Jordan, and T. Petsche, editors, Advances in Neural Information Processing Systems 9, pages 281–287, Cambridge, MA, 1997. MIT Press.

    Google Scholar 

  77. G. Wahba. Spline Models for Observational Data, volume 59 of CBMS-NSF Regional Conference Series in Applied Mathematics. SIAM, Philadelphia, 1990.

    Google Scholar 

  78. C. Watkins. Dynamic alignment kernels. CSD-TR-98-11, Royal Holloway, University of London, Egham, Surrey, UK, 1999.

    Google Scholar 

  79. G. N. Watson. A Treatise on the Theory of Bessel Functions. Cambridge University Press, Cambridge, UK, 2 edition, 1958.

    Google Scholar 

  80. C. K. I. Williams. Prediction with Gaussian processes: From linear regression to linear prediction and beyond. In M. I. Jordan, editor, Learning and Inference in Graphical Models. Kluwer Academic, 1998.

    Google Scholar 

  81. C. K. I. Williams. Prediction with Gaussian processes: From linear regression to linear prediction and beyond. In Micheal Jordan, editor, Learning and Inference in Graphical Models, pages 599–621. MIT Press, 1999.

    Google Scholar 

  82. C. K. I. Williams and C. E. Rasmussen. Gaussian processes for regression. In D. S. Touretzky, M. C. Mozer, and M. E. Hasselmo, editors, Advances in Neural Information Processing Systems 8, pages 514–520, Cambridge, MA, 1996. MIT Press.

    Google Scholar 

  83. Christoper K. I. Williams and Matthias Seeger. Using the Nystrom method to speed up kernel machines. In T. K. Leen, T. G. Dietterich, and V. Tresp, editors, Advances in Neural Information Processing Systems 13, pages 682–688, Cambridge, MA, 2001. MIT Press.

    Google Scholar 

  84. Christopher K. I. Williams and David Barber. Bayesian classification with Gaussian processes. IEEE Transactions on Pattern Analysis and Machine Intelligence PAMI, 20(12):1342–1351, 1998.

    Article  Google Scholar 

  85. T. Zhang. Some sparse approximation bounds for regression problems. In Proc. 18th International Conf. on Machine Learning, pages 624–631. Morgan Kaufmann, San Francisco, CA, 2001.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2003 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Smola, A.J., Schölkopf, B. (2003). Bayesian Kernel Methods. In: Mendelson, S., Smola, A.J. (eds) Advanced Lectures on Machine Learning. Lecture Notes in Computer Science(), vol 2600. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-36434-X_3

Download citation

  • DOI: https://doi.org/10.1007/3-540-36434-X_3

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-00529-2

  • Online ISBN: 978-3-540-36434-4

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics