Abstract
Kernel methods have been successfully used in many machine learning problems with favorable performance in extracting nonlinear structure of high-dimensional data. Recently, nonparametric inference methods with positive definite kernels have been developed, employing the kernel mean expression of distributions. In this approach, the distribution of a variable is represented by the kernel mean, which is the mean element of the random feature vector defined by the kernel function, and relation among variables is expressed by covariance operators. This article gives an introduction to this new approach called kernel Bayesian inference, in which the Bayes’ rule is realized with the computation of kernel means and covariance expressions to estimate the kernel mean of posterior [11]. This approach provides a novel nonparametric way of Bayesian inference, expressing a distribution with weighted sample, and computing posterior with simple matrix calculation. As an example of problems for which this kernel Bayesian inference is applied effectively, nonparametric state-space model is discussed, in which it is assumed that the state transition and observation model are neither known nor estimable with a simple parametric model. This article gives detailed explanations on intuitions, derivations, and implementation issues of kernel Bayesian inference.
Keywords
- Feature Vector
- Bayesian Inference
- Covariance Operator
- Kernel Method
- Observation Model
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
This is a preview of subscription content, access via your institution.
Buying options

Notes
- 1.
As the kernel mean depends on k, it should be written by \(m_X^k\) rigorously. We will, however, generally write \(m_X\) for simplicity, where there is no ambiguity.
- 2.
These conditions guarantee existence of the covariance operator. Note also \(E[k(X,X)]<\infty \) is stronger than the condition for kernel mean, \(E[\sqrt{k(X,X)}]<\infty \); this is obvious from Cauchy–Schwarz inequality.
- 3.
Some previous literatures derived a convergence rate at unrealistic assumptions. For example, Theorem 6 in [30] assumes \(k(\cdot ,y_0)\in \mathscr {R}(C_{YY})\) to achieve the rate \(n^{-1/4}\), but in typical cases there is no function \(f\in {\mathscr {H}_\mathscr {Y}}\) that satisfies \(\int k(y,z)f(z)dP_Y(z)=k(y,y_0)\). Theorem 1.3.2 shows that if the eigenvalues decay sufficiently fast the rate approaches \(n^{-1/4}\). As a relevant result, Theorem 11 in [11] shows a convergence rate of the kernel sum rule. While the conditional kernel mean is a special case of kernel sum rule with prior given by Dirac’s delta function at x, the faster rate (\(n^{-1/3}\) at best) is not achievable by Theorem 1.3.2, since the former assumes that \(\pi /p_X\) is a function in the RKHS and smooth enough.
- 4.
Although the samples are not i.i.d., we assume an appropriate mixing condition and thus the empirical covariances converge to the covariances with respect to the stationary distribution as \(T\rightarrow \infty \).
References
Aronszajn, N.: Theory of reproducing kernels. Trans. Am. Math. Soc. 68(3), 337–404 (1950)
Baker, C.: Joint measures and cross-covariance operators. Trans. Am. Math. Soc. 186, 273–289 (1973)
Berlinet, A., Thomas-Agnan, C.: Reproducing kernel Hilbert Spaces in Probability and Statistics. Kluwer Academic Publisher (2004)
Caponnetto, A., De Vito, E.: Optimal rates for regularized least-squares algorithm. Found. Comput. Math. 7(3), 331–368 (2007)
Doucet, A., Freitas, N.D., Gordon, N.: Sequential Monte Carlo Methods in Practice. Springer (2001)
Fine, S., Scheinberg, K.: Efficient SVM training using low-rank kernel representations. J. Mach. Learn. Res. 2, 243–264 (2001)
Fukumizu, K., Bach, F., Jordan, M.: Dimensionality reduction for supervised learning with reproducing kernel Hilbert spaces. J. Mach. Learn. Res. 5, 73–99 (2004)
Fukumizu, K., Bach, F., Jordan, M.: Kernel dimension reduction in regression. Ann. Stat. 37(4), 1871–1905 (2009)
Fukumizu, K., Gretton, A., Sun, X., Schölkopf, B.: Kernel measures of conditional dependence. In: Advances in Neural Information Processing Systems 20, pp. 489–496. MIT Press (2008)
Fukumizu, K., R.Bach, F., Jordan, M.I.: Kernel dimension reduction in regression. Technical Report 715, Department of Statistics, University of California, Berkeley (2006)
Fukumizu, K., Song, L., Gretton, A.: Kernel Bayes’ rule: Bayesian inference with positive definite kernels. J. Mach. Learn. Res. 14, 3753–3783 (2013)
Fukumizu, K., Sriperumbudur, B.K., Gretton, A., Schölkopf, B.: Characteristic kernels on groups and semigroups. Adv. Neural Inf. Proc. Syst. 20, 473–480 (2008)
Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. In: Advances in Neural Information Processing Systems 19, pp. 513–520. MIT Press (2007)
Gretton, A., Borgwardt, K.M., Rasch, M.J., Schölkopf, B., Smola, A.: A kernel two-sample test. J. Mach. Learn. Res. 13, 723–773 (2012)
Gretton, A., Fukumizu, K., Harchaoui, Z., Sriperumbudur, B.: A fast, consistent kernel two-sample test. Adv. Neural Inf. Process. Syst. 22, 673–681 (2009)
Gretton, A., Fukumizu, K., Sriperumbudur, B.: Discussion of: brownian distance covariance. Ann. Appl. Stat. 3(4), 1285–1294 (2009)
Gretton, A., Fukumizu, K., Teo, C.H., Song, L., Schölkopf, B., Smola, A.: A kernel statistical test of independence. In: Advances in Neural Information Processing Systems 20, pp. 585–592. MIT Press (2008)
Haeberlen, A., Flannery, E., Ladd, A.M., Rudys, A., Wallach, D.S., Kavraki, L.E.: Practical robust localization over large-scale 802.11 wireless networks. In: Proceedings of 10th International Conference on Mobile computing and networking (MobiCom ’04), pp. 70–84 (2004)
Kanagawa, M., Fukumizu, K.: Recovering distributions from gaussian rkhs embeddings. J. Mach. Learn. Res. W&CP 3, 457–465 (2014)
Kanagawa, M., Nishiyama, Y., Gretton, A., Fukumizu, K.: Monte carlo filtering using kernel embedding of distributions. In: Proceedings of 28th AAAI Conference on Artificial Intelligence (AAAI-14), pp. 1987–1903 (2014)
Kwok, J.Y., Tsang, I.: The pre-image problem in kernel methods. IEEE Trans. Neural Networks 15(6), 1517–1525 (2004)
McCalman, L.: Function embeddings for multi-modal bayesian inference. Ph.D. thesis. School of Information Technology. The University of Sydney (2013)
McCalman, L., O’Callaghan, S., Ramos, F.: Multi-modal estimation with kernel embeddings for learning motion models. In: IEEE International Conference on Robotics and Automation (ICRA), pp. 2845–2852 (2013)
Mika, S., Schölkopf, B., Smola, A., Müller, K.R., Scholz, M., Rätsch, G.: Kernel PCA and de-noising in feature spaces. In: Advances in Neural Information Pecessing Systems 11, pp. 536–542. MIT Press (1999)
Monbet, V., Ailliot, P., Marteau, P.: \(l^1\)-convergence of smoothing densities in non-parametric state space models. Stat. Infer. Stoch. Process. 11, 311–325 (2008)
Moulines, E., Bach, F.R., Harchaoui, Z.: Testing for homogeneity with kernel Fisher discriminant analysis. In: Advances in Neural Information Processing Systems 20, pp. 609–616. Curran Associates, Inc. (2008)
Quigley, M., Stavens, D., Coates, A., Thrun, S.: Sub-meter indoor localization in unmodified environments with inexpensive sensors. In: Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2010), pp. 2039 – 2046 (2010)
Schölkopf, B., Smola, A.: Learning with Kernels. MIT Press (2002)
Song, L., Fukumizu, K., Gretton, A.: Kernel embeddings of conditional distributions: a unified kernel framework for nonparametric inference in graphical models. IEEE Sig. Process. Mag. 30(4), 98–111 (2013)
Song, L., Huang, J., Smola, A., Fukumizu, K.: Hilbert space embeddings of conditional distributions with applications to dynamical systems. In: Proceedings of the 26th International Conference on Machine Learning (ICML2009), pp. 961–968 (2009)
Sriperumbudur, B.K., Fukumizu, K., Lanckriet, G.: Characteristic kernels and rkhs embedding of measures. J. Mach. Learn. Res. Universality 12, 2389–2410 (2011)
Sriperumbudur, B.K., Gretton, A., Fukumizu, K., Schölkopf, B., Lanckriet, G.: Hilbert space embeddings and metrics on probability measures. J. Mach. Learn. Res. 11, 1517–1561 (2010)
Steinwart, I., Hush, D., Scovel, C.: Optimal rates for regularized least squares regression. Proc. COLT 2009, 79–93 (2009)
Thrun, S., Langford, J., Fox, D.: Monte carlo hidden markov models: Learning non-parametric models of partially observable stochastic processes. In: Proceedings of International Conference on Machine Learning (ICML 1999), pp. 415–424 (1999)
Wan, E., and van der Merwe, R.: The unscented Kalman filter for nonlinear estimation. In: Adaptive Systems for Signal Processing, Communications, and Control Symposium (AS-SPCC 2000), pp. 153–158. IEEE (2000)
Widom, H.: Asymptotic behavior of the eigenvalues of certain integral equations. Trans. Am. Math. Soc. 109, 278–295 (1963)
Widom, H.: Asymptotic behavior of the eigenvalues of certain integral equations II. Arch. Ration. Mech. Anal. 17, 215–229 (1964)
Williams, C.K.I., Seeger, M.: Using the Nyström method to speed up kernel machines. In: Advances in Neural Information Processing Systems, vol. 13, pp. 682–688. MIT Press (2001)
Acknowledgments
The author has been supported in part by MEXT Grant-in-Aid for Scientific Research on Innovative Areas 25120012.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendix: Proof of Theorem 1.3.2
Appendix: Proof of Theorem 1.3.2
First, we show a lemma to derive a convergence rate of conditional kernel mean.
Lemma 1.5.1
Assume that the kernels are measurable and bounded. Let \(N(\varepsilon ):=\mathrm {Tr}[C_{YY}(C_{YY}+\varepsilon I)^{-1}]\) and \(\varepsilon _n\) be a constant such that \(\varepsilon _n\rightarrow 0\) as \(n\rightarrow \infty \). Then,
and
as \(n\rightarrow \infty \).
Proof
The first result is shown in [4] (page 349). While the proof of the second one is similar, it is shown below for completeness.
Let \(\xi _{yx}\) be an element in \({\mathscr {H}_\mathscr {Y}}\otimes {\mathscr {H}_\mathscr {X}}\) defined by
With identification between \(Hy\otimes {\mathscr {H}_\mathscr {X}}\) and the Hilbert–Schmidt operators from \({\mathscr {H}_\mathscr {X}}\) to \({\mathscr {H}_\mathscr {Y}}\),
Take \(a>0\) such that \(k(x,x)\le a^2\) and \(k(y,y)\le a^2\). It follows from \(\Vert f\otimes g\Vert =\Vert f\Vert \,\Vert g\Vert \) and \(\Vert (C_{YY}+\varepsilon _n I)^{-1}\Vert \le 1/\varepsilon _n\) that
and
Here \(k(\cdot ,Y)^*\) is the dual element of \(k(\cdot ,Y)\) and \(k(\cdot ,Y)\otimes k(\cdot ,Y)^*\) is regarded as an operator on \({\mathscr {H}_\mathscr {Y}}\). In the last inequality, \((C_{YY}+\varepsilon _n I)^{-1}\) in the trace is replaced by its upper bound \(\varepsilon _n^{-1} I\). Since \(\frac{1}{n}\sum _{i=1}^n (C_{YY}+\varepsilon _n I)^{-1}\xi _{Y_i X_i} = (C_{YY}+\varepsilon _n I)^{-1}\widehat{C}^{(n)}_{YX}\), it follows from Proposition 2 in [4] that for all \(n\in {\mathbb {N}}\) and \(0<\eta <1\)
which proves the assertion.\(\square \)
Proof of Theorem 1.3.2 First, we have
Using the general formula \(A^{-1}-B^{-1}=A^{-1}(B-A)B^{-1}\) for any invertible operators A, B, the first term in the right-hand side of the above inequality is upper bounded by
where in the second inequality the decomposition \(C_{XY}=C_{XX}^{1/2}W_{XY}C_{YY}^{1/2}\) with some \(W_{XY}:{\mathscr {H}_\mathscr {Y}}\rightarrow {\mathscr {H}_\mathscr {X}}\) (\(\Vert W_{XY}\Vert \le 1\)) [2] is used. It follows from Lemma 1.5.1 that
as \(n\rightarrow \infty \). It is known (Proposition 3, [4]) that, under the assumption on the decay rate of the eigenvalues, \(N(\varepsilon )\le \frac{b\beta }{b-1}\varepsilon ^{-1/b}\) holds with some \(\beta \ge 0\). Since \(\varepsilon _n^{-3/2}n^{-1} \ll \varepsilon _n^{-1-\frac{1}{2b}}n^{-1/2}\) for \(b>1\) and \(n\varepsilon _n \rightarrow \infty \), we have
as \(n\rightarrow \infty \).
For the second term of Eq. (1.19), let \(\varTheta :=E[k(X,\tilde{X})|Y=\cdot ,\tilde{Y}=*]\in \mathscr {R}(C_{YY}\otimes C_{YY})\). Note that for any \(\varphi \in {\mathscr {H}_\mathscr {Y}}\) we have
Similarly,
It follows form these equalities with \(\varphi =(C_{YY}+\varepsilon _n I)^{-1}k_\mathscr {Y}(\cdot ,y_0)\) that
From the assumption \(\varTheta \in \mathscr {R}(\mathbb {C}_{YY}\otimes C_{YY})\), there is \(\varPsi \in {\mathscr {H}_\mathscr {Y}}\otimes {\mathscr {H}_\mathscr {Y}}\) such that \(\varTheta = (C_{YY}\otimes C_{YY}) \varPsi \). Let \(\{\phi _i\}\) be the eigenvectors of \(C_{YY}\) with eigenvalues \(\lambda _1\,{\ge }\, \lambda _2\,{\ge }\, \cdots 0\). Since the eigenvectors and eigenvalues of \(C_{YY}\otimes C_{YY}\) are given by \(\{\phi _i\otimes \phi _j\}_{ij}\) and \(\lambda _i\lambda _j\), respectively, with the fact \((C_{YY}+\varepsilon _n I)^{-1}C_{YY}^2\phi _i=(\lambda _i^2/(1+\lambda _i))\phi _i\) and Parseval’s theorem we have
which shows
By balancing Eqs. (1.21) and (1.22), the assertion is obtained with \(\varepsilon _n=n^{-b/(4b+1)}\).
\({}\square \)
Rights and permissions
Copyright information
© 2015 The Author(s)
About this chapter
Cite this chapter
Fukumizu, K. (2015). Nonparametric Bayesian Inference with Kernel Mean Embedding. In: Peters, G., Matsui, T. (eds) Modern Methodology and Applications in Spatial-Temporal Modeling. SpringerBriefs in Statistics(). Springer, Tokyo. https://doi.org/10.1007/978-4-431-55339-7_1
Download citation
DOI: https://doi.org/10.1007/978-4-431-55339-7_1
Published:
Publisher Name: Springer, Tokyo
Print ISBN: 978-4-431-55338-0
Online ISBN: 978-4-431-55339-7
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)