Skip to main content

Model Selection for Gaussian Process Regression

  • Conference paper
  • First Online:
Book cover Pattern Recognition (GCPR 2017)

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 10496))

Included in the following conference series:

Abstract

Gaussian processes are powerful tools since they can model non-linear dependencies between inputs, while remaining analytically tractable. A Gaussian process is characterized by a mean function and a covariance function (kernel), which are determined by a model selection criterion. The functions to be compared do not just differ in their parametrization but in their fundamental structure. It is often not clear which function structure to choose, for instance to decide between a squared exponential and a rational quadratic kernel. Based on the principle of posterior agreement, we develop a general framework for model selection to rank kernels for Gaussian process regression and compare it with maximum evidence (also called marginal likelihood) and leave-one-out cross-validation. Given the disagreement between current state-of-the-art methods in our experiments, we show the difficulty of model selection and the need for an information-theoretic approach.

N.S. Gorbach and A.A. Bian—These two authors contributed equally.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://berkeleyearth.org/data/.

  2. 2.

    http://archive.ics.uci.edu/ml/datasets/Combined+Cycle+Power+Plant.

References

  1. Bachoc, F.: Cross validation and maximum likelihood estimations of hyper-parameters of Gaussian processes with model misspecification. Comput. Stat. Data Anal. 66, 55–69 (2013)

    Article  MathSciNet  Google Scholar 

  2. Bian, A.A., Gronskiy, A., Buhmann, J.M.: Information-theoretic analysis of maxcut algorithms. Technical report, Department of Computer Science, ETH Zurich (2016). http://people.inf.ethz.ch/ybian/docs/pa.pdf

  3. Bian, Y., Gronskiy, A., Buhmann, J.M.: Greedy maxcut algorithms and their information content. In: IEEE Information Theory Workshop (ITW), pp. 1–5 (2015)

    Google Scholar 

  4. Buhmann, J.M.: Information theoretic model validation for clustering. In: IEEE International Symposium on Information Theory (ISIT), pp. 1398–1402 (2010)

    Google Scholar 

  5. Buhmann, J.M.: SIMBAD: emergence of pattern similarity. In: Pelillo, M. (ed.) Similarity-Based Pattern Analysis and Recognition. ACVPR, pp. 45–64. Springer, London (2013). doi:10.1007/978-1-4471-5628-4_3

    Chapter  Google Scholar 

  6. Cawley, G.C., Talbot, N.L.C.: On over-fitting in model selection and subsequent selection bias in performance evaluation. J. Mach. Learn. Res. 11, 2079–2107 (2010)

    MathSciNet  MATH  Google Scholar 

  7. Chapelle, O.: Some thoughts about Gaussian processes (2005). http://is.tuebingen.mpg.de/fileadmin/user_upload/files/publications/gp_[0].pdf

  8. Chehreghani, M.H., Busetto, A.G., Buhmann, J.M.: Information theoretic model validation for spectral clustering. In: International Conference on Artificial Intelligence and Statistics (AISTATS), pp. 495–503 (2012)

    Google Scholar 

  9. Damianou, A.C., Lawrence, N.D.: Deep Gaussian processes. In: International Conference on Artificial Intelligence and Statistics (AISTATS), pp. 207–215 (2013)

    Google Scholar 

  10. Frank, M., Buhmann, J.M.: Selecting the rank of truncated SVD by maximum approximation capacity. In: IEEE International Symposium on Information Theory (ISIT), pp. 1036–1040 (2011)

    Google Scholar 

  11. Gronskiy, A., Buhmann, J.: How informative are minimum spanning tree algorithms? In: IEEE International Symposium on Information Theory (ISIT), pp. 2277–2281 (2014)

    Google Scholar 

  12. Horn, R.A., Johnson, C.R.: Matrix Analysis, 2nd edn. Cambridge University Press, Cambridge (2012)

    Book  Google Scholar 

  13. Jaynes, E.T.: Information theory and statistical mechanics. Phys. Rev. 106, 620–630 (1957)

    Article  MathSciNet  MATH  Google Scholar 

  14. Jaynes, E.T.: Information theory and statistical mechanics. ii. Phys. Rev. 108, 171–190 (1957)

    Article  MathSciNet  MATH  Google Scholar 

  15. Lloyd, J.R., Duvenaud, D., Grosse, R., Tenenbaum, J.B., Ghahramani, Z.: Automatic construction and natural-language description of nonparametric regression models. In: AAAI Conference on Artificial Intelligence (AAAI) pp. 1242–1250 (2014)

    Google Scholar 

  16. Nocedal, J.: Updating quasi-Newton matrices with limited storage. Math. Comput. 35, 773–782 (1980)

    Article  MathSciNet  MATH  Google Scholar 

  17. Rasmussen, C.E., Williams, C.K.I.: Gaussian Processes for Machine Learning. The MIT Press, Cambridge (2006)

    MATH  Google Scholar 

  18. Seeger, M.W.: PAC-Bayesian generalisation error bounds for Gaussian process classification. J. Mach. Learn. Res. 3, 233–269 (2002)

    Article  MathSciNet  MATH  Google Scholar 

  19. Tong, Y.L.: The Multivariate Normal Distribution. Springer Science & Business Media, New York (2012)

    Google Scholar 

  20. Zee, A.: Quantum Field Theory in a Nutshell. Princeton University Press, Princeton (2003)

    MATH  Google Scholar 

  21. Zhu, X., Welling, M., Jin, F., Lowengrub, J.S.: Predicting simulation parameters of biological systems using a Gaussian process model. Stat. Anal. Data Min. 5, 509–522 (2012)

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgments

This research was partially supported by the Max Planck ETH Center for Learning Systems and the SystemsX.ch project SignalX.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Stefan Bauer .

Editor information

Editors and Affiliations

Appendices

Appendix

A Propositions of Gaussian distribution

This is a collection of properties related to Gaussian distributions for the derivations in Sect. 2.2.

Proposition 1

If

$$ \begin{bmatrix} {\varvec{t}} \\ {\varvec{u}} \end{bmatrix} \thicksim {{\mathrm{\mathcal {N}}}}\left( {\begin{bmatrix} {\varvec{\mu }} \\ {\varvec{r}} \end{bmatrix}, \begin{bmatrix} \varvec{\varSigma }&\varvec{A} \\ \varvec{A}^\intercal&\varvec{V} \end{bmatrix}}\right) $$

then

[19, Theorem 3.3.4].

Proposition 2

If \( \varvec{\varLambda }\) is symmetric positive-definite, then

[20, 14].

Proposition 3

It holds that,

where \( {\varvec{r}}= \sum _{k = 1}^{K} \varvec{\varSigma }_{k} ^ {-1} {\varvec{\mu }}_{k} \) and \( \varvec{\varLambda }= \sum _{k = 1}^{K} \varvec{\varSigma }_{k} ^ {-1} \).

Proof

We shorten to move this factor \( \gamma \) independent of \( {\varvec{x}}\) out of the integral as in

The remaining integral can be calculated by Proposition 2.

Proposition 4

If \( \varvec{\varSigma }\) is symmetric positive-definite, then \( \varvec{\varSigma }\) is invertible and \( \varvec{\varSigma } ^ {-1} \) is symmetric positive-definite [12, 430].

Proposition 5

If \( \varvec{\varSigma }\) is symmetric positive-definite and \( \varvec{A}\) has full row rank, then \( \varvec{A}\varvec{\varSigma }\varvec{A}^\intercal \) is symmetric positive-definite [12, Observation 7.1.8.(b)].

Proposition 6

For \( \varvec{A}\in \mathbb {R}^ {D \times N} \) of full row rank, the density

has the equivalent form as , where \( {\varvec{r}}= \varvec{A}\varvec{\varSigma } ^ {-1} {\varvec{\mu }}\) and \( \varvec{\varLambda }= \varvec{A}\varvec{\varSigma } ^ {-1} \varvec{A}^\intercal \).

Proof

First, we separate a factor independent of \( {\varvec{x}}\) in

Therefore,

$$ p\left( {{\varvec{x}}}\right) = \frac{\exp \left( {{\varvec{x}}^\intercal \left( {{\varvec{r}}- \frac{1}{2} \varvec{\varLambda }{\varvec{x}}}\right) }\right) }{\int _{\mathbb {R}^ {D}}^{} \exp \left( {{\varvec{x}}^\intercal \left( {{\varvec{r}}- \frac{1}{2} \varvec{\varLambda }{\varvec{x}}}\right) }\right) \,\text{ d }^ {D} {\varvec{x}}}. $$

We now calculate the integral. From Proposition 4 and Proposition 5, one can see that \(\varvec{\varLambda }\) is symmetric positive-definite, so that Proposition 2 can be applied to find

Finally, one gets

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Gorbach, N.S., Bian, A.A., Fischer, B., Bauer, S., Buhmann, J.M. (2017). Model Selection for Gaussian Process Regression. In: Roth, V., Vetter, T. (eds) Pattern Recognition. GCPR 2017. Lecture Notes in Computer Science(), vol 10496. Springer, Cham. https://doi.org/10.1007/978-3-319-66709-6_25

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-66709-6_25

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-66708-9

  • Online ISBN: 978-3-319-66709-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics