Bayesian Kernel Methods

  • Alexander J. Smola
  • Bernhard Schölkopf
Chapter
Part of the Lecture Notes in Computer Science book series (LNCS, volume 2600)

Abstract

Bayesian methods allow for a simple and intuitive representation of the function spaces used by kernel methods. This chapter describes the basic principles of Gaussian Processes, their implementation and their connection to other kernel-based Bayesian estimation methods, such as the Relevance Vector Machine.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    K. P. Bennett, A. Demiriz, and J. Shawe-Taylor. A column generation algorithm for boosting. In P. Langley, editor, Proceedings of the International Conference on Machine Learning, San Francisco, 2000. Morgan Kaufmann Publishers.Google Scholar
  2. 2.
    K. P. Bennett and O. L. Mangasarian. Robust linear programming discrimination of two linearly inseparable sets. Optimization Methods and Software, 1:23–34, 1992.CrossRefGoogle Scholar
  3. 3.
    C. M. Bishop. Neural Networks for Pattern Recognition. Clarendon Press, Oxford, 1995.Google Scholar
  4. 4.
    C. M. Bishop and M. E. Tipping. Variational relevance vector machines. In Proceedings of 16th Conference on Uncertainty in Artificial Intelligence UAI’2000, pages 46–53, 2000.Google Scholar
  5. 5.
    P. S. Bradley and O. L. Mangasarian. Feature selection via concave minimization and support vector machines. In J. Shavlik, editor, Proceedings of the International Conference on Machine Learning, pages 82–90, San Francisco, California, 1998. Morgan Kaufmann Publishers. ftp://ftp.cs.wisc.edu/math-prog/tech-reports/98-03.ps.Z.
  6. 6.
    S. Chen, D. Donoho, and M. Saunders. Atomic decomposition by basis pursuit. Siam Journal of Scientific Computing, 20(1):33–61, 1999.MATHCrossRefMathSciNetGoogle Scholar
  7. 7.
    T. M. Cover and J. A. Thomas. Elements of Information Theory. John Wiley and Sons, New York, 1991.MATHGoogle Scholar
  8. 8.
    H. Cramér. Mathematical Methods of Statistics. Princeton University Press, 1946.Google Scholar
  9. 9.
    L. Csató and M. Opper. Sparse representation for Gaussian process models. In T. K. Leen, T. G. Dietterich, and V. Tresp, editors, Advances in Neural Information Processing Systems 13, pages 444–450. MIT Press, 2001.Google Scholar
  10. 10.
    A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society B, 39(1):1–22, 1977.MATHMathSciNetGoogle Scholar
  11. 11.
    S. Duane, A. D. Kennedy, B. J. Pendleton, and D. Roweth. Hybrid Monte Carlo. Physics Letters B, 195:216–222, 1995.CrossRefGoogle Scholar
  12. 12.
    S. Fine and K. Scheinberg. Efficient SVM training using low-rank kernel representation. Technical report, IBM Watson Research Center, New York, 2000.Google Scholar
  13. 13.
    R. Fletcher. Practical Methods of Optimization. John Wiley and Sons, New York, 1989.Google Scholar
  14. 14.
    G. Fung and O. L. Mangasarian. Data selection for support vector machine classifiers. In Proceedings of KDD’2000, 2000. also: Data Mining Institute Technical Report 00-02, University of Wisconsin, Madison.Google Scholar
  15. 15.
    A. Gelman, J. B. Carlin, H. S. Stern, and D. B. Rubin. Bayesian Data Analysis. Chapman and Hall, London, 1995.Google Scholar
  16. 16.
    M. Gibbs and D. J. C. Mackay. Variational Gaussian process classifiers. Technical report, Cavendish Laboratory, Cambridge, UK, 1998.Google Scholar
  17. 17.
    M. N. Gibbs. Bayesian Gaussian Methods for Regression and Classification. PhD thesis, University of Cambridge, 1997.Google Scholar
  18. 18.
    Mark Gibbs and David J. C. Mackay. Efficient implementation of Gaussian processes. Technical report, Cavendish Laboratory, Cambridge, UK, 1997. available at http://www.wol.ra.phy.cam.ac.uk/mng10/GP/.
  19. 19.
    P. E. Gill, W. Murray, and M. H. Wright. Practical Optimization. Academic Press, 1981.Google Scholar
  20. 20.
    F. Girosi. Models of noise and robust estimates. A.I. Memo 1287, Artificial Intelligence Laboratory, Massachusetts Institute of Technology, 1991.Google Scholar
  21. 21.
    D. Goldfarb and K. Scheinberg. A product-form cholesky factorization method for handling dense columns in interior point methods for linear programming. Technical report, IBM Watson Research Center, Yorktown Heights, 2001.Google Scholar
  22. 22.
    G. H. Golub and C. F. Van Loan. Matrix Computations. John Hopkins University Press, Baltimore, MD, 3rd edition, 1996.MATHGoogle Scholar
  23. 23.
    I. S. Gradshteyn and I. M. Ryzhik. Table of integrals, series, and products. Academic Press, New York, 1981.Google Scholar
  24. 24.
    T. Graepel, R. Herbrich, P. Bollmann-Sdorra, and K. Obermayer. Classification on pairwise proximity data. In M. S. Kearns, S. A. Solla, and D. A. Cohn, editors, Advances in Neural Information Processing Systems 11, pages 438–444, Cambridge, MA, 1999. MIT Press.Google Scholar
  25. 25.
    D. Haussler. Convolutional kernels on discrete structures. Technical Report UCSCCRL-99-10, Computer Science Department, UC Santa Cruz, 1999.Google Scholar
  26. 26.
    R. Herbrich. Learning Kernel Classifiers: Theory and Algorithms. MIT Press, 2002.Google Scholar
  27. 27.
    Ralf Herbrich, Thore Graepel, and Colin Campbell. Bayes point machines: Estimating the Bayes point in kernel space. In Proceedings of IJCAI Workshop Support Vector Machines, pages 23–27, 1999.Google Scholar
  28. 28.
    R. A. Horn and C. R. Johnson. Matrix Analysis. Cambridge University Press, Cambridge, 1985.MATHGoogle Scholar
  29. 29.
    P. J. Huber. Robust statistics: a review. Annals of Statistics, 43:1041, 1972.MathSciNetMATHCrossRefGoogle Scholar
  30. 30.
    T. Jaakkola, M. Meila, and T. Jebara. Maximum entropy discrimination. Technical Report AITR-1668, Artificial Intelligence Laboratory, Massachusetts Institute of Technology, 1999.Google Scholar
  31. 31.
    T. S. Jaakkola and D. Haussler. Exploiting generative models in discriminative classifiers. In M. S. Kearns, S. A. Solla, and D. A. Cohn, editors, Advances in Neural Information Processing Systems 11, pages 487–493, Cambridge, MA, 1999. MIT Press.Google Scholar
  32. 32.
    T. S. Jaakkola and M. I. Jordan. Computing upper and lower bounds on likelihoods in untractable networks. In Proceedings of the 12th Conference on Uncertainty in AI. Morgan Kaufmann Publishers, 1996.Google Scholar
  33. 33.
    W. James and C. Stein. Estimation with quadratic loss. In Proceedings of the Fourth Berkeley Symposium on Mathematics, Statistics and Probability, volume 1, pages 361–380, Berkeley, 1960. University of California Press.Google Scholar
  34. 34.
    T. Jebara and T. Jaakkola. Feature selection and dualities in maximum entropy discrimination. In Uncertainty In Artificial Intelligence, 2000.Google Scholar
  35. 35.
    M. I. Jordan and C. M. Bishop. An Introduction to Probabilistic Graphical Models. MIT Press, 2002.Google Scholar
  36. 36.
    M. I. Jordan, Z. Gharamani, T. S. Jaakkola, and L. K. Saul. An introduction to variational methods for graphical models. In Learning in Graphical Models, volume M.I. Jordan, pages 105–162. Kluwer Academic, 1998.Google Scholar
  37. 37.
    M. S. Lewicki and T. J. Sejnowski. Learning nonlinear overcomplete representations for efficient coding. In M. I. Jordan, M. J. Kearns, and S. A. Solla, editors, Advances in Neural Information Processing Systems 10, pages 556–562, Cambridge, MA, 1998. MIT Press.Google Scholar
  38. 38.
    D. G. Luenberger. Introduction to Linear and Nonlinear Programming. Addison-Wesley, Reading, MA, 1973.MATHGoogle Scholar
  39. 39.
    H. Lütkepohl. Handbook of Matrices. John Wiley and Sons, Chichester, 1996.MATHGoogle Scholar
  40. 40.
    D. J. C. MacKay. Bayesian Methods for Adaptive Models. PhD thesis, Computation and Neural Systems, California Institute of Technology, Pasadena, CA, 1991.Google Scholar
  41. 41.
    D. J. C. MacKay. The evidence framework applied to classification networks. Neural Computation, 4(5):720–736, 1992.CrossRefGoogle Scholar
  42. 42.
    S. Mallat and Z. Zhang. Matching Pursuit in a time-frequency dictionary. IEEE Transactions on Signal Processing, 41:3397–3415, 1993.MATHCrossRefGoogle Scholar
  43. 43.
    O. L. Mangasarian. Linear and nonlinear separation of patterns by linear programming. Operations Research, 13:444–452, 1965.MATHMathSciNetCrossRefGoogle Scholar
  44. 44.
    B. K. Natarajan. Sparse approximate solutions to linear systems. SIAM Journal of Computing, 25(2):227–234, 1995.CrossRefMathSciNetGoogle Scholar
  45. 45.
    R. Neal. Priors for infinite networks. Technical Report CRG-TR-94-1, Dept. of Computer Science, University of Toronto, 1994.Google Scholar
  46. 46.
    R. Neal. Bayesian Learning in Neural Networks. Springer, 1996.Google Scholar
  47. 47.
    Radford M. Neal. Probabilistic inference using Markov chain Monte Carlo methods. Technical report, Dept. of Computer Science, University of Toronto, 1993. CRGTR-93-1.Google Scholar
  48. 48.
    B. A. Olshausen and D. J. Field. Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, 381:607–609, 1996.CrossRefGoogle Scholar
  49. 49.
    M. Opper and O. Winther. Mean field methods for classification with Gaussian processes. In M. S. Kearns, S. A. Solla, and D. A. Cohn, editors, Advances in Neural Information Processing Systems 11, pages 309–315, Cambridge, MA, 1999. MIT Press.Google Scholar
  50. 50.
    M. Opper and O. Winther. Gaussian processes and SVM: mean field and leaveone-out. In A. J. Smola, P. L. Bartlett, B. Schölkopf, and D. Schuurmans, editors, Advances in Large Margin Classifiers, pages 311–326, Cambridge, MA, 2000. MIT Press.Google Scholar
  51. 51.
    J. Platt. Probabilities for SV machines. In A. J. Smola, P. L. Bartlett, B. Schölkopf, and D. Schuurmans, editors, Advances in Large Margin Classifiers, pages 61–73, Cambridge, MA, 2000. MIT Press.Google Scholar
  52. 52.
    T. Poggio. On optimal nonlinear associative recall. Biological Cybernetics, 19:201–209, 1975.MathSciNetMATHGoogle Scholar
  53. 53.
    W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery. Numerical Recipes in C: The Art of Scientific Computing (2nd ed.). Cambridge University Press, Cambridge, 1992. ISBN 0-521-43108-5.Google Scholar
  54. 54.
    C. Rasmussen. Evaluation of Gaussian Processes and Other Methods for Non-Linear Regression. PhD thesis, Department of Computer Science, University of Toronto, 1996. ftp://ftp.cs.toronto.edu/pub/carl/thesis.ps.gz.
  55. 55.
    G. Rätsch, S. Mika, and A.J. Smola. Adapting codes und embeddings for polychotomies. In Neural Information Processing Systems, volume 15. MIT Press, 2002. to appear.Google Scholar
  56. 56.
    B. D. Ripley. Pattern Recognition and Neural Networks. Cambridge University Press, Cambridge, 1996.MATHGoogle Scholar
  57. 57.
    R. T. Rockafellar. Convex Analysis, volume 28 of Princeton Mathematics Series. Princeton University Press, 1970.Google Scholar
  58. 58.
    P. Ruján and M. Marchand. Computing the Bayes kernel classifier. In A. J. Smola, P. L. Bartlett, B. Schölkopf, and D. Schuurmans, editors, Advances in Large Margin Classifiers, pages 329–347, Cambridge, MA, 2000. MIT Press.Google Scholar
  59. 59.
    Pál Ruján. Playing billiards in version space. Neural Computation, 9:99–122, 1997.MATHCrossRefGoogle Scholar
  60. 60.
    B. Schölkopf, S. Mika, C. Burges, P. Knirsch, K.-R. Müller, G. Rätsch, and A. Smola. Input space vs. feature space in kernel-based methods. IEEE Transactions on Neural Networks, 10(5):1000–1017, 1999.CrossRefGoogle Scholar
  61. 61.
    B. Schölkopf, A. Smola, and K.-R. Müller. Kernel principal component analysis. In B. Schölkopf, C. J. C. Burges, and A. J. Smola, editors, Advances in Kernel Methods-Support Vector Learning, pages 327–352. MIT Press, Cambridge, MA, 1999.Google Scholar
  62. 62.
    B. Schölkopf and A. J. Smola. Learning with Kernels. MIT Press, Cambridge, MA, 2002.Google Scholar
  63. 63.
    M. Seeger. Bayesian methods for support vector machines and Gaussian processes. Master’s thesis, University of Edinburgh, Division of Informatics, 1999.Google Scholar
  64. 64.
    J. Skilling. Maximum Entropy and Bayesian Methods. Cambridge University Press, 1988.Google Scholar
  65. 65.
    A. Smola, B. Schölkopf, and G. Rätsch. Linear programs for automatic accuracy control in regression. In Ninth International Conference on Artificial Neural Networks, Conference Publications No. 470, pages 575–580, London, 1999. IEE.Google Scholar
  66. 66.
    A. J. Smola. Learning with Kernels. PhD thesis, Technische Universität Berlin, 1998. GMD Research Series No. 25.Google Scholar
  67. 67.
    A. J. Smola and P. L. Bartlett. Sparse greedy Gaussian process regression. In T. K. Leen, T. G. Dietterich, and V. Tresp, editors, Advances in Neural Information Processing Systems 13, pages 619–625. MIT Press, 2001.Google Scholar
  68. 68.
    A. J. Smola and B. Schölkopf. Sparse greedy matrix approximation for machine learning. In P. Langley, editor, Proceedings of the International Conference on Machine Learning, pages 911–918, San Francisco, 2000. Morgan Kaufmann Publishers.Google Scholar
  69. 69.
    A.J. Smola and S.V.N. Vishwanathan. Cholesky factorization for rank-k modifications of diagonal matrices. SIAM Journal of Matrix Analysis, 2002. submitted.Google Scholar
  70. 70.
    C. Soon-Ong, A.J. Smola, and R.C. Williamson. Superkernels. In Neural Information Processing Systems, volume 15. MIT Press, 2002. to appear.Google Scholar
  71. 71.
    D. J. Spiegelhalter and S. L. Lauritzen. Sequential updating of conditional probabilities on directed graphical structures. Networks, 20:579–605, 1990.MATHCrossRefMathSciNetGoogle Scholar
  72. 72.
    J. Stoer and R. Bulirsch. Introduction to Numerical Analysis. Springer, New York, second edition, 1993.MATHGoogle Scholar
  73. 73.
    M. Tipping. Sparse Bayesian learning and the relevance vector machine. Journal of Machine Learning Research, 1:211–244, 2001.MATHCrossRefMathSciNetGoogle Scholar
  74. 74.
    V. Tresp. A Bayesian committee machine. Neural Computation, 12(11):2719–2741, 2000.CrossRefGoogle Scholar
  75. 75.
    V. Vapnik. The Nature of Statistical Learning Theory. Springer, New York, 1995.MATHGoogle Scholar
  76. 76.
    V. Vapnik, S. Golowich, and A. Smola. Support vector method for function approximation, regression estimation, and signal processing. In M. C. Mozer, M. I. Jordan, and T. Petsche, editors, Advances in Neural Information Processing Systems 9, pages 281–287, Cambridge, MA, 1997. MIT Press.Google Scholar
  77. 77.
    G. Wahba. Spline Models for Observational Data, volume 59 of CBMS-NSF Regional Conference Series in Applied Mathematics. SIAM, Philadelphia, 1990.Google Scholar
  78. 78.
    C. Watkins. Dynamic alignment kernels. CSD-TR-98-11, Royal Holloway, University of London, Egham, Surrey, UK, 1999.Google Scholar
  79. 79.
    G. N. Watson. A Treatise on the Theory of Bessel Functions. Cambridge University Press, Cambridge, UK, 2 edition, 1958.Google Scholar
  80. 80.
    C. K. I. Williams. Prediction with Gaussian processes: From linear regression to linear prediction and beyond. In M. I. Jordan, editor, Learning and Inference in Graphical Models. Kluwer Academic, 1998.Google Scholar
  81. 81.
    C. K. I. Williams. Prediction with Gaussian processes: From linear regression to linear prediction and beyond. In Micheal Jordan, editor, Learning and Inference in Graphical Models, pages 599–621. MIT Press, 1999.Google Scholar
  82. 82.
    C. K. I. Williams and C. E. Rasmussen. Gaussian processes for regression. In D. S. Touretzky, M. C. Mozer, and M. E. Hasselmo, editors, Advances in Neural Information Processing Systems 8, pages 514–520, Cambridge, MA, 1996. MIT Press.Google Scholar
  83. 83.
    Christoper K. I. Williams and Matthias Seeger. Using the Nystrom method to speed up kernel machines. In T. K. Leen, T. G. Dietterich, and V. Tresp, editors, Advances in Neural Information Processing Systems 13, pages 682–688, Cambridge, MA, 2001. MIT Press.Google Scholar
  84. 84.
    Christopher K. I. Williams and David Barber. Bayesian classification with Gaussian processes. IEEE Transactions on Pattern Analysis and Machine Intelligence PAMI, 20(12):1342–1351, 1998.CrossRefGoogle Scholar
  85. 85.
    T. Zhang. Some sparse approximation bounds for regression problems. In Proc. 18th International Conf. on Machine Learning, pages 624–631. Morgan Kaufmann, San Francisco, CA, 2001.Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2003

Authors and Affiliations

  • Alexander J. Smola
    • 1
  • Bernhard Schölkopf
    • 2
  1. 1.RSISE, The Australian National UniversityCanberraAustralia
  2. 2.Max Planck Institut für Biologische KybernetikTübingenGermany

Personalised recommendations