Skip to main content

Nonparametric Bayesian Inference with Kernel Mean Embedding

  • 1131 Accesses

Part of the SpringerBriefs in Statistics book series (JSSRES)


Kernel methods have been successfully used in many machine learning problems with favorable performance in extracting nonlinear structure of high-dimensional data. Recently, nonparametric inference methods with positive definite kernels have been developed, employing the kernel mean expression of distributions. In this approach, the distribution of a variable is represented by the kernel mean, which is the mean element of the random feature vector defined by the kernel function, and relation among variables is expressed by covariance operators. This article gives an introduction to this new approach called kernel Bayesian inference, in which the Bayes’ rule is realized with the computation of kernel means and covariance expressions to estimate the kernel mean of posterior [11]. This approach provides a novel nonparametric way of Bayesian inference, expressing a distribution with weighted sample, and computing posterior with simple matrix calculation. As an example of problems for which this kernel Bayesian inference is applied effectively, nonparametric state-space model is discussed, in which it is assumed that the state transition and observation model are neither known nor estimable with a simple parametric model. This article gives detailed explanations on intuitions, derivations, and implementation issues of kernel Bayesian inference.


  • Feature Vector
  • Bayesian Inference
  • Covariance Operator
  • Kernel Method
  • Observation Model

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

This is a preview of subscription content, access via your institution.

Buying options

USD   29.95
Price excludes VAT (USA)
  • DOI: 10.1007/978-4-431-55339-7_1
  • Chapter length: 24 pages
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
USD   54.99
Price excludes VAT (USA)
  • ISBN: 978-4-431-55339-7
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
Softcover Book
USD   69.99
Price excludes VAT (USA)
Fig. 1.1


  1. 1.

    As the kernel mean depends on k, it should be written by \(m_X^k\) rigorously. We will, however, generally write \(m_X\) for simplicity, where there is no ambiguity.

  2. 2.

    These conditions guarantee existence of the covariance operator. Note also \(E[k(X,X)]<\infty \) is stronger than the condition for kernel mean, \(E[\sqrt{k(X,X)}]<\infty \); this is obvious from Cauchy–Schwarz inequality.

  3. 3.

    Some previous literatures derived a convergence rate at unrealistic assumptions. For example, Theorem 6 in [30] assumes \(k(\cdot ,y_0)\in \mathscr {R}(C_{YY})\) to achieve the rate \(n^{-1/4}\), but in typical cases there is no function \(f\in {\mathscr {H}_\mathscr {Y}}\) that satisfies \(\int k(y,z)f(z)dP_Y(z)=k(y,y_0)\). Theorem 1.3.2 shows that if the eigenvalues decay sufficiently fast the rate approaches \(n^{-1/4}\). As a relevant result, Theorem 11 in [11] shows a convergence rate of the kernel sum rule. While the conditional kernel mean is a special case of kernel sum rule with prior given by Dirac’s delta function at x, the faster rate (\(n^{-1/3}\) at best) is not achievable by Theorem 1.3.2, since the former assumes that \(\pi /p_X\) is a function in the RKHS and smooth enough.

  4. 4.

    Although the samples are not i.i.d., we assume an appropriate mixing condition and thus the empirical covariances converge to the covariances with respect to the stationary distribution as \(T\rightarrow \infty \).


  1. Aronszajn, N.: Theory of reproducing kernels. Trans. Am. Math. Soc. 68(3), 337–404 (1950)

    CrossRef  MathSciNet  MATH  Google Scholar 

  2. Baker, C.: Joint measures and cross-covariance operators. Trans. Am. Math. Soc. 186, 273–289 (1973)

    CrossRef  MathSciNet  MATH  Google Scholar 

  3. Berlinet, A., Thomas-Agnan, C.: Reproducing kernel Hilbert Spaces in Probability and Statistics. Kluwer Academic Publisher (2004)

    Google Scholar 

  4. Caponnetto, A., De Vito, E.: Optimal rates for regularized least-squares algorithm. Found. Comput. Math. 7(3), 331–368 (2007)

    CrossRef  MathSciNet  MATH  Google Scholar 

  5. Doucet, A., Freitas, N.D., Gordon, N.: Sequential Monte Carlo Methods in Practice. Springer (2001)

    Google Scholar 

  6. Fine, S., Scheinberg, K.: Efficient SVM training using low-rank kernel representations. J. Mach. Learn. Res. 2, 243–264 (2001)

    MATH  Google Scholar 

  7. Fukumizu, K., Bach, F., Jordan, M.: Dimensionality reduction for supervised learning with reproducing kernel Hilbert spaces. J. Mach. Learn. Res. 5, 73–99 (2004)

    MathSciNet  MATH  Google Scholar 

  8. Fukumizu, K., Bach, F., Jordan, M.: Kernel dimension reduction in regression. Ann. Stat. 37(4), 1871–1905 (2009)

    CrossRef  MathSciNet  MATH  Google Scholar 

  9. Fukumizu, K., Gretton, A., Sun, X., Schölkopf, B.: Kernel measures of conditional dependence. In: Advances in Neural Information Processing Systems 20, pp. 489–496. MIT Press (2008)

    Google Scholar 

  10. Fukumizu, K., R.Bach, F., Jordan, M.I.: Kernel dimension reduction in regression. Technical Report 715, Department of Statistics, University of California, Berkeley (2006)

    Google Scholar 

  11. Fukumizu, K., Song, L., Gretton, A.: Kernel Bayes’ rule: Bayesian inference with positive definite kernels. J. Mach. Learn. Res. 14, 3753–3783 (2013)

    MathSciNet  MATH  Google Scholar 

  12. Fukumizu, K., Sriperumbudur, B.K., Gretton, A., Schölkopf, B.: Characteristic kernels on groups and semigroups. Adv. Neural Inf. Proc. Syst. 20, 473–480 (2008)

    Google Scholar 

  13. Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. In: Advances in Neural Information Processing Systems 19, pp. 513–520. MIT Press (2007)

    Google Scholar 

  14. Gretton, A., Borgwardt, K.M., Rasch, M.J., Schölkopf, B., Smola, A.: A kernel two-sample test. J. Mach. Learn. Res. 13, 723–773 (2012)

    MathSciNet  MATH  Google Scholar 

  15. Gretton, A., Fukumizu, K., Harchaoui, Z., Sriperumbudur, B.: A fast, consistent kernel two-sample test. Adv. Neural Inf. Process. Syst. 22, 673–681 (2009)

    Google Scholar 

  16. Gretton, A., Fukumizu, K., Sriperumbudur, B.: Discussion of: brownian distance covariance. Ann. Appl. Stat. 3(4), 1285–1294 (2009)

    CrossRef  MathSciNet  MATH  Google Scholar 

  17. Gretton, A., Fukumizu, K., Teo, C.H., Song, L., Schölkopf, B., Smola, A.: A kernel statistical test of independence. In: Advances in Neural Information Processing Systems 20, pp. 585–592. MIT Press (2008)

    Google Scholar 

  18. Haeberlen, A., Flannery, E., Ladd, A.M., Rudys, A., Wallach, D.S., Kavraki, L.E.: Practical robust localization over large-scale 802.11 wireless networks. In: Proceedings of 10th International Conference on Mobile computing and networking (MobiCom ’04), pp. 70–84 (2004)

    Google Scholar 

  19. Kanagawa, M., Fukumizu, K.: Recovering distributions from gaussian rkhs embeddings. J. Mach. Learn. Res. W&CP 3, 457–465 (2014)

    Google Scholar 

  20. Kanagawa, M., Nishiyama, Y., Gretton, A., Fukumizu, K.: Monte carlo filtering using kernel embedding of distributions. In: Proceedings of 28th AAAI Conference on Artificial Intelligence (AAAI-14), pp. 1987–1903 (2014)

    Google Scholar 

  21. Kwok, J.Y., Tsang, I.: The pre-image problem in kernel methods. IEEE Trans. Neural Networks 15(6), 1517–1525 (2004)

    CrossRef  Google Scholar 

  22. McCalman, L.: Function embeddings for multi-modal bayesian inference. Ph.D. thesis. School of Information Technology. The University of Sydney (2013)

    Google Scholar 

  23. McCalman, L., O’Callaghan, S., Ramos, F.: Multi-modal estimation with kernel embeddings for learning motion models. In: IEEE International Conference on Robotics and Automation (ICRA), pp. 2845–2852 (2013)

    Google Scholar 

  24. Mika, S., Schölkopf, B., Smola, A., Müller, K.R., Scholz, M., Rätsch, G.: Kernel PCA and de-noising in feature spaces. In: Advances in Neural Information Pecessing Systems 11, pp. 536–542. MIT Press (1999)

    Google Scholar 

  25. Monbet, V., Ailliot, P., Marteau, P.: \(l^1\)-convergence of smoothing densities in non-parametric state space models. Stat. Infer. Stoch. Process. 11, 311–325 (2008)

    CrossRef  MathSciNet  MATH  Google Scholar 

  26. Moulines, E., Bach, F.R., Harchaoui, Z.: Testing for homogeneity with kernel Fisher discriminant analysis. In: Advances in Neural Information Processing Systems 20, pp. 609–616. Curran Associates, Inc. (2008)

    Google Scholar 

  27. Quigley, M., Stavens, D., Coates, A., Thrun, S.: Sub-meter indoor localization in unmodified environments with inexpensive sensors. In: Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2010), pp. 2039 – 2046 (2010)

    Google Scholar 

  28. Schölkopf, B., Smola, A.: Learning with Kernels. MIT Press (2002)

    Google Scholar 

  29. Song, L., Fukumizu, K., Gretton, A.: Kernel embeddings of conditional distributions: a unified kernel framework for nonparametric inference in graphical models. IEEE Sig. Process. Mag. 30(4), 98–111 (2013)

    CrossRef  Google Scholar 

  30. Song, L., Huang, J., Smola, A., Fukumizu, K.: Hilbert space embeddings of conditional distributions with applications to dynamical systems. In: Proceedings of the 26th International Conference on Machine Learning (ICML2009), pp. 961–968 (2009)

    Google Scholar 

  31. Sriperumbudur, B.K., Fukumizu, K., Lanckriet, G.: Characteristic kernels and rkhs embedding of measures. J. Mach. Learn. Res. Universality 12, 2389–2410 (2011)

    MathSciNet  MATH  Google Scholar 

  32. Sriperumbudur, B.K., Gretton, A., Fukumizu, K., Schölkopf, B., Lanckriet, G.: Hilbert space embeddings and metrics on probability measures. J. Mach. Learn. Res. 11, 1517–1561 (2010)

    MathSciNet  MATH  Google Scholar 

  33. Steinwart, I., Hush, D., Scovel, C.: Optimal rates for regularized least squares regression. Proc. COLT 2009, 79–93 (2009)

    Google Scholar 

  34. Thrun, S., Langford, J., Fox, D.: Monte carlo hidden markov models: Learning non-parametric models of partially observable stochastic processes. In: Proceedings of International Conference on Machine Learning (ICML 1999), pp. 415–424 (1999)

    Google Scholar 

  35. Wan, E., and van der Merwe, R.: The unscented Kalman filter for nonlinear estimation. In: Adaptive Systems for Signal Processing, Communications, and Control Symposium (AS-SPCC 2000), pp. 153–158. IEEE (2000)

    Google Scholar 

  36. Widom, H.: Asymptotic behavior of the eigenvalues of certain integral equations. Trans. Am. Math. Soc. 109, 278–295 (1963)

    CrossRef  MathSciNet  MATH  Google Scholar 

  37. Widom, H.: Asymptotic behavior of the eigenvalues of certain integral equations II. Arch. Ration. Mech. Anal. 17, 215–229 (1964)

    CrossRef  MathSciNet  MATH  Google Scholar 

  38. Williams, C.K.I., Seeger, M.: Using the Nyström method to speed up kernel machines. In: Advances in Neural Information Processing Systems, vol. 13, pp. 682–688. MIT Press (2001)

    Google Scholar 

Download references


The author has been supported in part by MEXT Grant-in-Aid for Scientific Research on Innovative Areas 25120012.

Author information

Authors and Affiliations


Corresponding author

Correspondence to Kenji Fukumizu .

Editor information

Editors and Affiliations

Appendix: Proof of Theorem 1.3.2

Appendix: Proof of Theorem 1.3.2

First, we show a lemma to derive a convergence rate of conditional kernel mean.

Lemma 1.5.1

Assume that the kernels are measurable and bounded. Let \(N(\varepsilon ):=\mathrm {Tr}[C_{YY}(C_{YY}+\varepsilon I)^{-1}]\) and \(\varepsilon _n\) be a constant such that \(\varepsilon _n\rightarrow 0\) as \(n\rightarrow \infty \). Then,

$$ \left\| (\widehat{C}^{(n)}_{YY}-C_{YY})(C_{YY}+\varepsilon _n I)^{-1}\right\| = O_p\left( \frac{1}{\varepsilon _n n} + \sqrt{\frac{N(\varepsilon _n)}{\varepsilon _n n}}\right) $$


$$ \left\| (\widehat{C}^{(n)}_{XY}-C_{XY})(C_{YY}+\varepsilon _n I)^{-1}\right\| = O_p\left( \frac{1}{\varepsilon _n n} + \sqrt{\frac{N(\varepsilon _n)}{\varepsilon _n n}}\right) $$

as \(n\rightarrow \infty \).


The first result is shown in [4] (page 349). While the proof of the second one is similar, it is shown below for completeness.

Let \(\xi _{yx}\) be an element in \({\mathscr {H}_\mathscr {Y}}\otimes {\mathscr {H}_\mathscr {X}}\) defined by

$$ \xi _{yx}:=\bigl \{(C_{YY}+\varepsilon _n I)^{-1} k(\cdot ,y)\bigr \}\otimes k(\cdot ,x). $$

With identification between \(Hy\otimes {\mathscr {H}_\mathscr {X}}\) and the Hilbert–Schmidt operators from \({\mathscr {H}_\mathscr {X}}\) to \({\mathscr {H}_\mathscr {Y}}\),

$$ E[\xi _{YX}]=(C_{YY}+\varepsilon _n I)^{-1}C_{YX}. $$

Take \(a>0\) such that \(k(x,x)\le a^2\) and \(k(y,y)\le a^2\). It follows from \(\Vert f\otimes g\Vert =\Vert f\Vert \,\Vert g\Vert \) and \(\Vert (C_{YY}+\varepsilon _n I)^{-1}\Vert \le 1/\varepsilon _n\) that

$$ \Vert \xi _{yx}\Vert = \bigl \Vert (C_{YY}+\varepsilon _n I)^{-1} k(\cdot ,y) \bigr \Vert \bigl \Vert k(\cdot ,x)\bigr \Vert \le \frac{1}{\varepsilon _n} \Vert k(\cdot ,y)\Vert \,\Vert k(\cdot ,x)\Vert \le \frac{a^2}{\varepsilon _n}, $$


$$\begin{aligned} E\Vert \xi _{YX}\Vert ^2&= E\bigl \Vert \{(C_{YY}+\varepsilon _n I)^{-1} k(\cdot ,Y)\}\otimes k(\cdot ,X)\bigr \Vert ^2 \\&= E\Vert k(\cdot ,X)\Vert ^2 \,\bigl \Vert (C_{YY}+\varepsilon _n I)^{-1} k(\cdot ,Y)\bigr \Vert ^2 \\&\le a^2 E\bigl \Vert (C_{YY}+\varepsilon _n I)^{-1} k(\cdot ,Y)\bigr \Vert ^2 \\&= a^2 E\bigl \langle (C_{YY}+\varepsilon _n I)^{-2}k(\cdot ,Y),k(\cdot ,Y)\bigr \rangle \\&= a^2 E \mathrm {Tr}\bigl [ (C_{YY}+\varepsilon _n I)^{-2}(k(\cdot ,Y)\otimes k(\cdot ,Y)^*)\bigr ] \\&=a^2 \mathrm {Tr}\bigl [ (C_{YY}+\varepsilon _n I)^{-2}C_{YY}\bigr ] \\&\le \frac{a^2}{\varepsilon _n }\mathrm {Tr}\bigl [ (C_{YY}+\varepsilon _n I)^{-1}C_{YY}\bigr ] = \frac{a^2}{\varepsilon _n } N(\varepsilon _n). \end{aligned}$$

Here \(k(\cdot ,Y)^*\) is the dual element of \(k(\cdot ,Y)\) and \(k(\cdot ,Y)\otimes k(\cdot ,Y)^*\) is regarded as an operator on \({\mathscr {H}_\mathscr {Y}}\). In the last inequality, \((C_{YY}+\varepsilon _n I)^{-1}\) in the trace is replaced by its upper bound \(\varepsilon _n^{-1} I\). Since \(\frac{1}{n}\sum _{i=1}^n (C_{YY}+\varepsilon _n I)^{-1}\xi _{Y_i X_i} = (C_{YY}+\varepsilon _n I)^{-1}\widehat{C}^{(n)}_{YX}\), it follows from Proposition 2 in [4] that for all \(n\in {\mathbb {N}}\) and \(0<\eta <1\)

$$\begin{aligned} \Pr \biggl ( \Bigg \Vert (C_{YY}+\varepsilon _n I)^{-1}\widehat{C}^{(n)}_{YX} - (C_{YY}+\varepsilon _n I)^{-1}&C_{YX} \Bigg \Vert \\&\ge 2\biggl (\frac{2a^2}{n\varepsilon _n} + \sqrt{\frac{a^2 N(\varepsilon _n)}{\varepsilon _n n}}\biggr )\log \frac{2}{\eta } \bigg ) \le \eta , \end{aligned}$$

which proves the assertion.\(\square \)

Proof of Theorem 1.3.2 First, we have

$$\begin{aligned}&\bigl \Vert \widehat{C}^{(n)}_{XY}(\widehat{C}^{(n)}_{YY}+\varepsilon _n I)^{-1}k_\mathscr {Y}(\cdot ,y_0) - E[k_\mathscr {X}(\cdot ,X)|Y=y_0] \bigr \Vert _{\mathscr {H}_\mathscr {X}}\nonumber \\&\le \bigl \Vert \widehat{C}^{(n)}_{XY}(\widehat{C}^{(n)}_{YY}+\varepsilon _n I)^{-1}k_\mathscr {Y}(\cdot ,y_0) - C_{XY}(C_{YY}+\varepsilon _n I)^{-1}k_\mathscr {Y}(\cdot ,y_0) \Vert _{\mathscr {H}_\mathscr {X}}\end{aligned}$$
$$\begin{aligned}&\qquad + \bigl \Vert C_{XY}(C_{YY}+\varepsilon _n I)^{-1}k_\mathscr {Y}(\cdot ,y_0)-E[k_\mathscr {X}(\cdot ,X)|Y=y_0] \bigr \Vert _{\mathscr {H}_\mathscr {X}}. \end{aligned}$$

Using the general formula \(A^{-1}-B^{-1}=A^{-1}(B-A)B^{-1}\) for any invertible operators AB, the first term in the right-hand side of the above inequality is upper bounded by

$$\begin{aligned}&\bigl \Vert (\widehat{C}^{(n)}_{XY}-C_{XY})(\widehat{C}^{(n)}_{YY}+\varepsilon _n I)^{-1}k_\mathscr {Y}(\cdot ,y_0)\bigr \Vert _{\mathscr {H}_\mathscr {X}}\nonumber \\&\qquad + \bigl \Vert C_{XY}(C_{YY}+\varepsilon _n I)^{-1}(C_{YY}-\widehat{C}^{(n)}_{YY})(\widehat{C}^{(n)}_{YY}+\varepsilon _n I)^{-1}k_\mathscr {Y}(\cdot ,y_0)\bigr \Vert _{\mathscr {H}_\mathscr {X}}\\ \le&\bigl \Vert (\widehat{C}^{(n)}_{XY}-C_{XY})(\widehat{C}^{(n)}_{YY}+\varepsilon _n I)^{-1}\bigr \Vert \,\bigl \Vert k_\mathscr {Y}(\cdot ,y_0)\bigr \Vert _{\mathscr {H}_\mathscr {Y}}\\&\qquad + \frac{1}{\sqrt{\varepsilon _n}}\Vert C_{XX}\Vert ^{1/2} \bigl \Vert (\widehat{C}^{(n)}_{YY}-C_{YY})(\widehat{C}^{(n)}_{YY}+\varepsilon _n I)^{-1}\bigr \Vert \, \bigl \Vert k_\mathscr {Y}(\cdot ,y_0)\bigr \Vert _{\mathscr {H}_\mathscr {Y}}, \end{aligned}$$

where in the second inequality the decomposition \(C_{XY}=C_{XX}^{1/2}W_{XY}C_{YY}^{1/2}\) with some \(W_{XY}:{\mathscr {H}_\mathscr {Y}}\rightarrow {\mathscr {H}_\mathscr {X}}\) (\(\Vert W_{XY}\Vert \le 1\)) [2] is used. It follows from Lemma 1.5.1 that

$$\begin{aligned} \bigl \Vert \widehat{C}^{(n)}_{XY}(\widehat{C}^{(n)}_{YY}+\varepsilon _n I)^{-1}k_\mathscr {Y}(\cdot ,y_0) - C_{XY}(C_{YY}&+\varepsilon _n I)^{-1}k_\mathscr {Y}(\cdot ,y_0) \Vert _{\mathscr {H}_\mathscr {X}}\\&=O_p\left( \varepsilon _n^{-1/2}\left\{ \frac{1}{\varepsilon _n n}+\sqrt{\frac{N(\varepsilon _n)}{\varepsilon _n n}}\right\} \right) , \end{aligned}$$

as \(n\rightarrow \infty \). It is known (Proposition 3, [4]) that, under the assumption on the decay rate of the eigenvalues, \(N(\varepsilon )\le \frac{b\beta }{b-1}\varepsilon ^{-1/b}\) holds with some \(\beta \ge 0\). Since \(\varepsilon _n^{-3/2}n^{-1} \ll \varepsilon _n^{-1-\frac{1}{2b}}n^{-1/2}\) for \(b>1\) and \(n\varepsilon _n \rightarrow \infty \), we have

$$\begin{aligned} \bigl \Vert \widehat{C}^{(n)}_{XY}(\widehat{C}^{(n)}_{YY}+\varepsilon _n I)^{-1}k_\mathscr {Y}(\cdot ,y_0) - C_{XY}(C_{YY}+\varepsilon _n I)^{-1}&k_\mathscr {Y}(\cdot ,y_0) \Vert _{\mathscr {H}_\mathscr {X}}\nonumber \\&=O_p\left( \varepsilon _n^{-1-\frac{1}{2b}}n^{-1/2}\right) , \end{aligned}$$

as \(n\rightarrow \infty \).

For the second term of Eq. (1.19), let \(\varTheta :=E[k(X,\tilde{X})|Y=\cdot ,\tilde{Y}=*]\in \mathscr {R}(C_{YY}\otimes C_{YY})\). Note that for any \(\varphi \in {\mathscr {H}_\mathscr {Y}}\) we have

$$\begin{aligned} \langle C_{XY}\varphi ,&C_{XY}\varphi \rangle =E[k(X,\tilde{X})\varphi (Y)\varphi (\tilde{Y})]\\&\quad =E\bigl [ E[k(X,\tilde{X})|Y,\tilde{Y}]\varphi (Y)\varphi (\tilde{Y})\bigr ] =\langle (C_{YY}\otimes C_{YY})\varTheta ,\varphi \otimes \varphi \rangle _{{\mathscr {H}_\mathscr {Y}}\otimes {\mathscr {H}_\mathscr {Y}}}. \end{aligned}$$


$$\begin{aligned} \langle C_{XY}\varphi , E[k(\cdot ,X)|Y=y_0]\rangle _{\mathscr {H}_\mathscr {X}}=\langle E[k&(X,\tilde{X})|Y=y_0,\tilde{Y}=*], C_{YY} \varphi \rangle _{{\mathscr {H}_\mathscr {Y}}} \\&=\langle (I\otimes C_{YY})\varTheta ,k(\cdot ,y_0)\otimes \varphi \rangle _{{\mathscr {H}_\mathscr {Y}}\otimes {\mathscr {H}_\mathscr {Y}}}. \end{aligned}$$

It follows form these equalities with \(\varphi =(C_{YY}+\varepsilon _n I)^{-1}k_\mathscr {Y}(\cdot ,y_0)\) that

$$\begin{aligned}&\bigl \Vert C_{XY}(C_{YY}+\varepsilon _n I)^{-1}k_\mathscr {Y}(\cdot ,y_0)-E[k_\mathscr {X}(\cdot ,X)|Y=y_0] \bigr \Vert _{\mathscr {H}_\mathscr {X}}^2 \\&=\bigl \langle \bigl \{ (C_{YY}+\varepsilon _n I)^{-1}C_{YY}\otimes (C_{YY}+\varepsilon _n I)^{-1}C_{YY} -I\otimes (C_{YY}+\varepsilon _n I)^{-1}C_{YY}\\&\qquad -(C_{YY}+\varepsilon _n I)^{-1}C_{YY}\otimes I + I\otimes I\bigr \}\varTheta , k_\mathscr {Y}(\cdot ,y_0)\otimes k_\mathscr {Y}(*,y_0)\bigr \rangle _{{\mathscr {H}_\mathscr {Y}}\otimes {\mathscr {H}_\mathscr {Y}}}. \end{aligned}$$

From the assumption \(\varTheta \in \mathscr {R}(\mathbb {C}_{YY}\otimes C_{YY})\), there is \(\varPsi \in {\mathscr {H}_\mathscr {Y}}\otimes {\mathscr {H}_\mathscr {Y}}\) such that \(\varTheta = (C_{YY}\otimes C_{YY}) \varPsi \). Let \(\{\phi _i\}\) be the eigenvectors of \(C_{YY}\) with eigenvalues \(\lambda _1\,{\ge }\, \lambda _2\,{\ge }\, \cdots 0\). Since the eigenvectors and eigenvalues of \(C_{YY}\otimes C_{YY}\) are given by \(\{\phi _i\otimes \phi _j\}_{ij}\) and \(\lambda _i\lambda _j\), respectively, with the fact \((C_{YY}+\varepsilon _n I)^{-1}C_{YY}^2\phi _i=(\lambda _i^2/(1+\lambda _i))\phi _i\) and Parseval’s theorem we have

$$\begin{aligned}&\bigl \Vert \bigl \{(C_{YY}+\varepsilon _n I)^{-1}C_{YY}\otimes (C_{YY}+\varepsilon _n I)^{-1}C_{YY} -I\otimes (C_{YY}+\varepsilon _n I)^{-1}C_{YY}\\&\qquad -(C_{YY}+\varepsilon _n I)^{-1}C_{YY}\otimes I + I\otimes I\bigr \}\varTheta \bigr \Vert _{{\mathscr {H}_\mathscr {Y}}\otimes {\mathscr {H}_\mathscr {Y}}}^2 \\&= \sum _{i,j}\Bigl \{ \frac{\lambda _i^2}{\lambda _i+\varepsilon _n}\frac{\lambda _j^2}{\lambda _j+\varepsilon _n}- \frac{\lambda _i^2\lambda _j}{\lambda _i+\varepsilon _n}-\frac{\lambda _i\lambda _j^2}{\lambda _j+\varepsilon _n}+\lambda _i\lambda _j\Bigr \}^2 \langle \phi _i\otimes \phi _j,\varPsi \rangle _{{\mathscr {H}_\mathscr {X}}\otimes {\mathscr {H}_\mathscr {X}}}^2 \\&= \varepsilon _n^4 \sum _{i,j}\Bigl \{\frac{\lambda _i\lambda _j}{(\lambda _i+\varepsilon _n)(\lambda _j+\varepsilon _n)}\Bigr \}^2 \langle \phi _i\otimes \phi _j,\varPsi \rangle _{{\mathscr {H}_\mathscr {X}}\otimes {\mathscr {H}_\mathscr {X}}}^2 \le \varepsilon _n^4 \Vert \varPsi \Vert _{{\mathscr {H}_\mathscr {X}}\otimes {\mathscr {H}_\mathscr {X}}}^2, \end{aligned}$$

which shows

$$\begin{aligned} \bigl \Vert C_{XY}(C_{YY}+\varepsilon _n I)^{-1}k_\mathscr {Y}(\cdot ,y_0)-E[k_\mathscr {X}(\cdot ,X)|Y=y_0] \bigr \Vert _{\mathscr {H}_\mathscr {X}}= O(\varepsilon _n). \end{aligned}$$

By balancing Eqs. (1.21) and (1.22), the assertion is obtained with \(\varepsilon _n=n^{-b/(4b+1)}\).

\({}\square \)

Rights and permissions

Reprints and Permissions

Copyright information

© 2015 The Author(s)

About this chapter

Cite this chapter

Fukumizu, K. (2015). Nonparametric Bayesian Inference with Kernel Mean Embedding. In: Peters, G., Matsui, T. (eds) Modern Methodology and Applications in Spatial-Temporal Modeling. SpringerBriefs in Statistics(). Springer, Tokyo.

Download citation