Abstract
Gaussian processes are powerful tools since they can model non-linear dependencies between inputs, while remaining analytically tractable. A Gaussian process is characterized by a mean function and a covariance function (kernel), which are determined by a model selection criterion. The functions to be compared do not just differ in their parametrization but in their fundamental structure. It is often not clear which function structure to choose, for instance to decide between a squared exponential and a rational quadratic kernel. Based on the principle of posterior agreement, we develop a general framework for model selection to rank kernels for Gaussian process regression and compare it with maximum evidence (also called marginal likelihood) and leave-one-out cross-validation. Given the disagreement between current state-of-the-art methods in our experiments, we show the difficulty of model selection and the need for an information-theoretic approach.
N.S. Gorbach and A.A. Bian—These two authors contributed equally.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Bachoc, F.: Cross validation and maximum likelihood estimations of hyper-parameters of Gaussian processes with model misspecification. Comput. Stat. Data Anal. 66, 55–69 (2013)
Bian, A.A., Gronskiy, A., Buhmann, J.M.: Information-theoretic analysis of maxcut algorithms. Technical report, Department of Computer Science, ETH Zurich (2016). http://people.inf.ethz.ch/ybian/docs/pa.pdf
Bian, Y., Gronskiy, A., Buhmann, J.M.: Greedy maxcut algorithms and their information content. In: IEEE Information Theory Workshop (ITW), pp. 1–5 (2015)
Buhmann, J.M.: Information theoretic model validation for clustering. In: IEEE International Symposium on Information Theory (ISIT), pp. 1398–1402 (2010)
Buhmann, J.M.: SIMBAD: emergence of pattern similarity. In: Pelillo, M. (ed.) Similarity-Based Pattern Analysis and Recognition. ACVPR, pp. 45–64. Springer, London (2013). doi:10.1007/978-1-4471-5628-4_3
Cawley, G.C., Talbot, N.L.C.: On over-fitting in model selection and subsequent selection bias in performance evaluation. J. Mach. Learn. Res. 11, 2079–2107 (2010)
Chapelle, O.: Some thoughts about Gaussian processes (2005). http://is.tuebingen.mpg.de/fileadmin/user_upload/files/publications/gp_[0].pdf
Chehreghani, M.H., Busetto, A.G., Buhmann, J.M.: Information theoretic model validation for spectral clustering. In: International Conference on Artificial Intelligence and Statistics (AISTATS), pp. 495–503 (2012)
Damianou, A.C., Lawrence, N.D.: Deep Gaussian processes. In: International Conference on Artificial Intelligence and Statistics (AISTATS), pp. 207–215 (2013)
Frank, M., Buhmann, J.M.: Selecting the rank of truncated SVD by maximum approximation capacity. In: IEEE International Symposium on Information Theory (ISIT), pp. 1036–1040 (2011)
Gronskiy, A., Buhmann, J.: How informative are minimum spanning tree algorithms? In: IEEE International Symposium on Information Theory (ISIT), pp. 2277–2281 (2014)
Horn, R.A., Johnson, C.R.: Matrix Analysis, 2nd edn. Cambridge University Press, Cambridge (2012)
Jaynes, E.T.: Information theory and statistical mechanics. Phys. Rev. 106, 620–630 (1957)
Jaynes, E.T.: Information theory and statistical mechanics. ii. Phys. Rev. 108, 171–190 (1957)
Lloyd, J.R., Duvenaud, D., Grosse, R., Tenenbaum, J.B., Ghahramani, Z.: Automatic construction and natural-language description of nonparametric regression models. In: AAAI Conference on Artificial Intelligence (AAAI) pp. 1242–1250 (2014)
Nocedal, J.: Updating quasi-Newton matrices with limited storage. Math. Comput. 35, 773–782 (1980)
Rasmussen, C.E., Williams, C.K.I.: Gaussian Processes for Machine Learning. The MIT Press, Cambridge (2006)
Seeger, M.W.: PAC-Bayesian generalisation error bounds for Gaussian process classification. J. Mach. Learn. Res. 3, 233–269 (2002)
Tong, Y.L.: The Multivariate Normal Distribution. Springer Science & Business Media, New York (2012)
Zee, A.: Quantum Field Theory in a Nutshell. Princeton University Press, Princeton (2003)
Zhu, X., Welling, M., Jin, F., Lowengrub, J.S.: Predicting simulation parameters of biological systems using a Gaussian process model. Stat. Anal. Data Min. 5, 509–522 (2012)
Acknowledgments
This research was partially supported by the Max Planck ETH Center for Learning Systems and the SystemsX.ch project SignalX.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendices
Appendix
A Propositions of Gaussian distribution
This is a collection of properties related to Gaussian distributions for the derivations in Sect. 2.2.
Proposition 1
If
then
[19, Theorem 3.3.4].
Proposition 2
If \( \varvec{\varLambda }\) is symmetric positive-definite, then
[20, 14].
Proposition 3
It holds that,
where \( {\varvec{r}}= \sum _{k = 1}^{K} \varvec{\varSigma }_{k} ^ {-1} {\varvec{\mu }}_{k} \) and \( \varvec{\varLambda }= \sum _{k = 1}^{K} \varvec{\varSigma }_{k} ^ {-1} \).
Proof
We shorten to move this factor \( \gamma \) independent of \( {\varvec{x}}\) out of the integral as in
The remaining integral can be calculated by Proposition 2.
Proposition 4
If \( \varvec{\varSigma }\) is symmetric positive-definite, then \( \varvec{\varSigma }\) is invertible and \( \varvec{\varSigma } ^ {-1} \) is symmetric positive-definite [12, 430].
Proposition 5
If \( \varvec{\varSigma }\) is symmetric positive-definite and \( \varvec{A}\) has full row rank, then \( \varvec{A}\varvec{\varSigma }\varvec{A}^\intercal \) is symmetric positive-definite [12, Observation 7.1.8.(b)].
Proposition 6
For \( \varvec{A}\in \mathbb {R}^ {D \times N} \) of full row rank, the density
has the equivalent form as , where \( {\varvec{r}}= \varvec{A}\varvec{\varSigma } ^ {-1} {\varvec{\mu }}\) and \( \varvec{\varLambda }= \varvec{A}\varvec{\varSigma } ^ {-1} \varvec{A}^\intercal \).
Proof
First, we separate a factor independent of \( {\varvec{x}}\) in
Therefore,
We now calculate the integral. From Proposition 4 and Proposition 5, one can see that \(\varvec{\varLambda }\) is symmetric positive-definite, so that Proposition 2 can be applied to find
Finally, one gets
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Gorbach, N.S., Bian, A.A., Fischer, B., Bauer, S., Buhmann, J.M. (2017). Model Selection for Gaussian Process Regression. In: Roth, V., Vetter, T. (eds) Pattern Recognition. GCPR 2017. Lecture Notes in Computer Science(), vol 10496. Springer, Cham. https://doi.org/10.1007/978-3-319-66709-6_25
Download citation
DOI: https://doi.org/10.1007/978-3-319-66709-6_25
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-66708-9
Online ISBN: 978-3-319-66709-6
eBook Packages: Computer ScienceComputer Science (R0)