Model-based kernel sum rule: kernel Bayesian inference with probabilistic models

Kernel Bayesian inference is a principled approach to nonparametric inference in probabilistic graphical models, where probabilistic relationships between variables are learned from data in a nonparametric manner. Various algorithms of kernel Bayesian inference have been developed by combining kernelized basic probabilistic operations such as the kernel sum rule and kernel Bayes’ rule. However, the current framework is fully nonparametric, and it does not allow a user to flexibly combine nonparametric and model-based inferences. This is inefficient when there are good probabilistic models (or simulation models) available for some parts of a graphical model; this is in particular true in scientific fields where “models” are the central topic of study. Our contribution in this paper is to introduce a novel approach, termed the model-based kernel sum rule (Mb-KSR), to combine a probabilistic model and kernel Bayesian inference. By combining the Mb-KSR with the existing kernelized probabilistic rules, one can develop various algorithms for hybrid (i.e., nonparametric and model-based) inferences. As an illustrative example, we consider Bayesian filtering in a state space model, where typically there exists an accurate probabilistic model for the state transition process. We propose a novel filtering method that combines model-based inference for the state transition process and data-driven, nonparametric inference for the observation generating process. We empirically validate our approach with synthetic and real-data experiments, the latter being the problem of vision-based mobile robot localization in robotics, which illustrates the effectiveness of the proposed hybrid approach.


Introduction
Kernel methods are powerful tools for developing nonlinear algorithms in machine learning (Schölkopf and Smola, 2002;Steinwart and Christmann, 2008).The basis of kernel methods is to transform data into elements in reproducing kernel Hilbert space (RKHS) and to solve problems in that space by exploiting the reproducing property (kernel trick) of the Hilbert space.A recent trend in kernel methods is to exploit the mean feature element, i.e., the kernel mean, of a probability distribution in an RKHS (Smola et al, 2007).The kernel mean m P is defined by the expectation of a random feature function k(•, X) with respect to a probability distribution P on a measurable space (X , B)1 .The mapping P → m P is called the kernel mean map.A positive-definite kernel k associated with kernel means is called characteristic (Fukumizu et al, 2004;Sriperumbudur et al, 2010) if the kernel mean map is injective, i.e., every probability distribution can be distinguished by its kernel mean.The use of characteristic kernels guarantees that a kernel mean m P uniquely specifies a probability distribution P and m P may be used in place of P in probability operations, estimations, and the learning of P .Many machine learning applications that involve kernel means have been proposed, e.g., density estimations (Smola et al, 2007;Song et al, 2008;McCalman et al, 2013), hypothesis tests (Gretton et al, 2012;Gretton and Györfi, 2010;Fukumizu et al, 2008), Bayesian inference (Song et al, 2009(Song et al, , 2010(Song et al, , 2011;;Fukumizu et al, 2013;Song et al, 2013;Kanagawa et al, 2014), classification (Muandet et al, 2012), dimension reduction (Fukumizu and Leng, 2012), and control problems (Grünewälder et al, 2012;Nishiyama et al, 2012;Rawlik et al, 2013;Boots et al, 2013).
In the context of Bayesian inference, the basic probabilistic operations of the sum rule, chain rule, product rule, and Bayes' rule are kernelized in terms of kernel means and they are referred to as the kernel sum rule (KSR), kernel chain rule (KCR), kernel product rule (KPR), and kernel Bayes' rule (KBR), respectively (Song et al, 2013).Combinations of these rules allow Bayesian inference to be expressed entirely in terms of kernel means.In this study, we refer to Bayesian inference in the kernel mean expression as kernel Bayesian inference.Song et al (2011) developed a nonparametric belief propagation method to infer the kernel means of marginals.Song et al (2009) and Fukumizu et al (2013) proposed nonparametric filtering algorithms for state space models, where the transition dynamics and observation process are both represented by kernel means and learned nonparametrically from samples.Control problems (MDP (Grünewälder et al, 2012), POMDP (Nishiyama et al, 2012), path integral control (Rawlik et al, 2013), and predictive state representations (Boots et al, 2013)) have also been developed based on kernel Bayesian inference.
Kernel Bayesian inference has many benefits such as: (i) algorithms can perform nonparametric inference and capture complex conditional relations among variables, (ii) algorithms are similarity-based and the application domains are not restricted provided that a positive-definite kernel is defined, and (iii) algorithms can compute expected values without requiring density estimation.
Nonparametric kernel Bayesian inference is powerful, but a weakness of the current framework is that inference algorithms (obtained by combinations of rules, i.e., KSR, KCR, KPR, and KBR) only allow full nonparametric inference, i.e., all of the conditional probabilities are learned nonparametrically.Depending on the specific application, the knowledge included in probabilistic models should be combined flexibly with kernel Bayesian inference during full nonparametric inference, e.g., in robot localization problems.In state space models of visionbased robot localization problems, the hidden state is a mobile robot's position and the observation is an image captured by the robot.The robot's position should be estimated using a sequence of images.The process of observing the captured image at a specific position in a building is typically complex.By contrast, the transition dynamics of the robot's position are rather simple and they may be modeled using mathematical models of the physical system, which are encoded as probabilistic models of motion models.Thus, it is desirable to combine nonparametric learning of the observation process using kernel Bayesian inference and probabilistic models of the transition dynamics.
In this study, we propose a novel KSR method, called model-based KSR (Mb-KSR), which exploits the knowledge included in some probabilistic models of conditional distributions.The existing KSR (Song et al, 2013) aims to achieve nonparametric learning of a conditional distribution and we refer to it as nonparametric KSR (Non-KSR).Our Mb-KSR method exploits the tractable conditional kernel means of probabilistic models.The incorporation of Mb-KSR into existing rules, i.e., Non-KSR, KCR, KPR, and KBR, provides a more flexible framework for kernel Bayesian inference, which is combined with probabilistic models.We focus on combinations of Mb-KSR, Non-KSR, and KBR, which are sufficient to develop a filtering algorithm for state space models.
Fig. 1 illustrates inference using a combination of Mb-KSR and Non-KSR.We consider the following three cases.(i) (Left chain) Suppose that both conditional probabilities P Y|X and P Z|Y are complex and that the training samples are (ii) (Middle chain) Suppose that the conditional probability P Y|X is complex and there is a training sample {(X i , Y i )} n i=1 , but P Z|Y is described simply by a probabilistic model, such as an additive Gaussian noise model.(iii) (Right chain) This is the opposite setting compared with (ii), i.e., P Y|X is an additive Gaussian noise model but P Z|Y is complex and there is a training sample {(Y i , Z i )} n i=1 .The inference of Z given X in the kernel mean expression is then achieved simply by executing: (i) Non-KSR twice for both X to Y and Y to Z, (ii) Mb-KSR for X to Y and Non-KSR for Y to Z, and (iii) Non-KSR for X to Y and Mb-KSR for Y to Z, respectively.Example 32 illustrates the kernel mean estimators for these cases.By introducing Mb-KSR, a kernel mean estimator is represented by a weighted sum of feature functions but also by a weighted sum of tractable conditional kernel means.
The differences between the Mb-KSR and the Non-KSR are as follows.The Non-KSR contains a regularization parameter.Tuning a regularization parameter is always a burden during computation, e.g., it may require cross-validation.In addition, a regularization parameter with a finite value results in a bias error with Non-KSR.By contrast, the Mb-KSR does not contain a regularization parameter because the smoothness of the regression function is determined by the probabilistic model and its knowledge is reflected.It is natural to expect that if a probabilistic model describes the true conditional distribution well, it is better to learn the probabilistic model first and use it as the Mb-KSR.The Mb-KSR and Non-KSR are compared in experiments reported in Sect.5.1.
In this study, we focus on additive Gaussian noise model cases for the application of Mb-KSR, but the idea can be extended to more general noise model cases, as described in Appendix A.1.A systematic view is obtained by considering a conjugate pair that comprises a probabilistic model and a positive-definite kernel.
By combining probabilistic rules, Non-KSR, KBR, and the proposed Mb-KSR, we develop a filtering algorithm for state space models.In this setting, the observation process is learned nonparametrically by using kernel means and the transition dynamics are given by probabilistic models, such as additive Gaussian noise models.
The main contributions of this study are summarized as follows.
-We propose to split KSR into Non-KSR and Mb-KSR depending on whether the conditional distribution can assume probabilistic models or not.Mb-KSR can be incorporated into existing probabilistic rules, i.e., Non-KSR, Mb-KSR, KCR, KPR, and KBR, thereby yielding a more flexible framework for kernel Bayesian inference when combined with probabilistic models.The Mb-KSR is more accurate and it requires less computation than the Non-KSR if the conditional distribution is described well by probabilistic models, such as additive Gaussian noise models.-We propose a filtering algorithm, i.e., Algorithm 1, for state space models, which combines nonparametric learning of the observation process using kernel means and probabilistic models of the transition dynamics.The advantages of this algorithm are as follows.The algorithm can handle arbitrary observation domains, e.g., images (as shown in Sect.5.3), provided that a positive-definite kernel is defined, whereas well-known filtering algorithms such as nonlinear Kalman filters and particle filters restrict the domain to (a subset of) the Euclidean space R d .The proposed algorithm does not assume a specific probabilistic model for the observation process, i.e., the observation process is learned nonparametrically from a training sample, whereas nonlinear Kalman filters and particle filters need to assume a probabilistic model of the observation process.
A related method is described as follows.Kanagawa et al (2014) proposed a Monte-Carlo (MC) sampling method for a filtering algorithm for state space models in the same setting, which combines nonparametric learning of the observation process using kernel means and probabilistic models of the transition dynamics.Because the transition dynamics assume a probabilistic model, the algorithm generates training samples from the transition dynamics and estimates the kernel means using the sample.However, if the transition dynamics are simple, e.g., frequently used additive Gaussian noise models (as considered in the present study), explicit expressions of the kernel means of probabilistic models can be exploited and MC sampling methods are not necessary.Thus, we propose a filtering algorithm for additive Gaussian transition noise models that does not use MC sampling methods.In general, the difference between the MC method Kanagawa et al (2014) and our proposed method is analogous to the difference between particle filters and Kalman filters, where Kalman filters do not require sampling methods.
The remainder of this paper is structured as follows.The next section provides preliminary details of kernel Bayesian inference, which are used in the later sections that propose the Mb-KSR and the filtering algorithm.Sect. 3 introduces the Mb-KSR.Sect. 4 proposes a filtering algorithm for state space models.Sect. 5 presents the results of ground-truth experiments, which confirm that the proposed Mb-KSR method performs adequately, and we describe a real-world application to robot localization problems.Our conclusions and suggestions for future work are given in Sect.6.

Preliminaries: Kernel Bayesian Inference
This section reviews kernel means, KSR, and KBR, which are the components of kernel Bayesian inference (see Song et al (2009); Fukumizu et al (2013); Song et al (2013) for further details).All of the functions considered in this study are realvalued unless stated otherwise explicitly.

Positive-definite kernel
Let X be an arbitrary nonempty set.A function

Reproducing kernel Hilbert space (RKHS)
An RKHS H on a nonempty set X is a Hilbert space of functions f : X → R such that for every x ∈ X , there exists a unique element ex ∈ H as folows: where Kernel mean in a RKHS on a measurable space Let P(X ) be the set of probability distributions on a measurable space (X , B(X )).Let E X [f(X)] := X f (x)dP (x) denote the expectation of a measurable function f : X → R with respect to a probability distribution P ∈ P(X ) of a random variable X.The kernel mean of a probability distribution P ∈ P(X ) in the RKHS H generated by a positive-definite kernel k is an RKHS element: If k is shift-invariant such that k(x, y) = ψ(x − y), x, y ∈ R m , the kernel mean is given by m P = ψ * P with the convolution * .Throughout this study, we assume that a positive-definite kernel is bounded (sup x∈X k(x, x) < ∞).A bounded kernel guarantees that the kernel mean m P is well defined for all P ∈ P(X ) (Sriperumbudur et al, 2010).The expectation property for the kernel mean m P is as follows: The expectation property is used in applications of kernel Bayesian inference.Typically, the kernel mean is estimated as a weighted sum mP = n i=1 w i k X (•, Xi ) with weights w = {w i } ∈ R n from the data X1 , . . . ,Xn.The expectation of an RKHS function f ∈ H can then be estimated using eq.( 3), as follows.
If mP is a consistent estimator such that || mP − m P || H p → 0, then eq. ( 4) is a consistent estimator.It is not straightforward to show that eq. ( 4) is consistent for a non-RKHS function f / ∈ H.However, Kanagawa and Fukumizu (2014) proved that eq. ( 4) is consistent for more general functions in the Besov space under Gaussian RBF kernels.Song et al (2009) and Fukumizu et al (2013) provided consistent estimators for the nonparametric KSR and KBR, as follows.

Kernel sum rule (KSR)
Let (X , B(X )) and (Y, B(Y)) be measurable spaces, and let (X, Y ) be a random variable on X × Y with a joint probability distribution P X ×Y .Let P X be the marginal distribution of P X ×Y on X .For each x ∈ X , let P Y|x be the conditional distribution.We denote P Y|X := {P Y|x |x ∈ X }.If P X ×Y , P X , and P Y|x have densities, we denote them as p(x, y), p(x), and p(y|x), respectively.Let Π be another probability distribution Π on X and if it exists, then π(x) is its density.P Y|X and Π define a new joint probability distribution Q X ×Y on X × Y as (5) We denote Q Y as the marginal of Q X ×Y on Y and q(y) is its density, if it exists.The sum rule is a computation of the marginal Q Y given an input distribution Π.The sum rule deals with a mapping The KSR is a kernelization of the sum rule, i.e., the computation of the kernel mean m The KSR deals with a mapping U Y|X : m Π → m QY .For each x ∈ X , let m Y|x be a conditional kernel mean of P Y|x defined by Song et al (2009) provided a KSR estimator with nonparametric learning of the conditional kernel mean m Y|X of conditional distribution P Y|X .In this study, we refer to this as the Non-KSR.Let {(X i , Y i )} n i=1 be a sample drawn i.i.d from P Y|X with an input distribution P X .Let mΠ = l i=1 γ i k X (•, Xi ) be an estimator of the input kernel mean m Π with a weight vector γ ∈ R l using the data ( X1 , . . . ,Xl ).The Non-KSR estimator is given by where We use the same notations that are employed above for the KSR.Let Q X |y be the conditional distribution of be an estimator of the prior kernel mean m Π with weights γ ∈ R l and the data ( X1 , . . . ,Xl ) ⊂ X .The KBR estimator is given by where and δn is a regularization parameter.The weight vector w corresponds to the KSR's weights (6).
3 Kernel Bayesian Inference with Probabilistic Models In this section, we introduce the Mb-KSR method.We explicitly define this method such that combinations of Non-KSR, Mb-KSR, and KBR immediately indicate how to infer the kernel means using different combinations of nonparametric learning and probabilistic models.Sect.3.1 defines the Mb-KSR.Sect.3.2 describes inference by combining Non-KSR and Mb-KSR for the chain example shown in Fig. 1.Sect.3.3 describes a KBR adapted to the Mb-KSR, where the prior kernel mean estimator is slightly different.These results are used to derive the filtering algorithm in Sect. 4.

Model-based KSR (Mb-KSR)
The Mb-KSR is based on a simple observation with respect to the conditional kernel mean m Y|X .Assume that the function Thus, m QY (y) is obtained by the inner product of m Π and m Y|(•) (y).m QY (y) can be estimated using the input kernel mean estimator mΠ = l i=1 γ i k X (•, Xi ) as We then define the following estimator: The Mb-KSR deals with the mapping ŪY|X : The consistency rate of mQY is the same as that of the input kernel mean mΠ .
In the Mb-KSR, we consider the conditional distribution P Y|x as a probabilistic model, with m Y|x its conditional kernel mean.If m Y|x (y) is computed in an efficient manner, then eq. ( 8) is computed efficiently.For example, if P Y|x is an additive Gaussian noise model and k is a Gaussian kernel, then m Y|x has an explicit expression and m Y|x (y) is computed efficiently for every y ∈ Y.In the following, we consider the Mb-KSR of an additive Gaussian noise model.Thus, we focus on the Gaussian case in the present study. Let x ∈ R m denote the m-dimensional Gaussian density function with mean vector µ and covariance matrix R. We denote a Gaussian random vector (10) Proof The additive Gaussian noise model has the conditional density p(y|x) = d G (y|f(x), Σ).For each x ∈ X , the conditional kernel mean m Y|x of the Gaussian By substituting the explicit expression (10) into eq.( 9), we obtain the Mb-KSR used for computing mQY in the case of the additive Gaussian noise model.For each y ∈ Y, the evaluation mQY (y) of the output kernel mean often needs to be computed in applications.This requires the evaluation of the Gaussian density function (10) in eq.( 8).Thus, we obtain the Mb-KSR of an additive Gaussian noise model.
In this study, we focus on additive Gaussian noise models, which are used frequently, but the same idea can be applied to other probabilistic models provided that for each y ∈ Y, m Y|x (y) can be computed.A systematic approach to such probabilistic models is to focus on the conjugate pair of a probabilistic model and a positive-definite kernel for the kernel mean (Nishiyama and Fukumizu, 2014).A probabilistic model p and a positive-definite kernel k are conjugate if p and its kernel mean mp have the same density form.The pair of an additive Gaussian noise model p(y|x) and a Gaussian kernel k is an example, where its kernel mean is also a Gaussian (10).For other examples, see Appendix A.1.
The Non-KSR and Mb-KSR estimators differ as follows.The Non-KSR (6) estimates the output kernel mean m QY by a linear combination of feature functions {k Y (•, Y i )|i = 1, . . ., n} on Y with different weights w ∈ R n , whereas the Mb-KSR (9) estimates m QY by a linear combination of conditional kernel means {m Y| Xi |i = 1, . . ., l} with input weights γ ∈ R l , where the evaluation of the conditional kernel means is computationally tractable.If a conjugate pair is used, the conditional kernel means are simply the same functions as feature functions but with increased parameter values.In the additive Gaussian noise model case, the Mb-KSR is given by a linear combination of Gaussian feature functions {d where the variance is increased by Σ.The Non-KSR estimator uses the regularization parameter ǫn to determine the smoothness, whereas the Mb-KSR does not and it does not need to be tuned.This is because the smoothness is determined by the probabilistic model and it reflects the knowledge included in the probabilistic model.

Combining the Non-KSR and Mb-KSR
Given the two types of kernel sum rule, Non-KSR and Mb-KSR, we can infer kernel means by combining them.We perform inference for the chain case shown in Fig. 1.
Let (X , B X ), (Y, B Y ) and (Z, B Z ) be measurable spaces.Let (X, Y, Z) be a random variable on X × Y × Z with probability distribution Q X ×Y×Z .Suppose that Q X ×Y×Z is factored into a chain, as in Fig. 1, where it comprises an input distribution Π on X and conditional distributions P Y|X and P Z|Y .Let Q Y and Q Z denote the marginals of Q X ×Y×Z on Y and Z, respectively.Define three RKHSs: Example 32 (chain) Let mΠ = l i=1 γ i k X (•, Xi ) be an estimator of the input kernel mean m Π of a probability distribution Π.
(i) Fig. 1 (Left).This case corresponds to the preliminary results of (Song et al, 2009;Fukumizu et al, 2013;Song et al, 2013) and it is obtained by applying the Non-KSR twice.
(ii) Fig. 1 (Middle).Let {(X i , Y i )} n i=1 be the training data for a conditional distribution P Y|X , which are generated from an input distribution P X .Suppose that a probabilistic model is given for P Z|Y .The output kernel mean m QZ is then estimated by mQZ = ŪZ|Y ÛY|X mΠ , where ÛY|X denotes the Non-KSR operation and ŪZ|Y denotes the Mb-KSR operation, i.e., where G X and G X X are kernel matrices G . This is the opposite of setting (ii).Suppose that a probabilistic model is given for P Y|X .Let {(Y i , Z i )} n i=1 be the training data for a conditional distribution P Z|Y generated by an input distribution P Y .The output kernel mean m QZ is then estimated by mQZ = ÛZ|Y ŪY|X mΠ , where ŪY|X denotes the Mb-KSR operation and ÛZ|Y denotes the Non-KSR operation, i.e., mQZ = n i=1 where In is the n × n identity matrix, and ǫn is a regularization parameter.
The consistency of the two estimators, ( 11) and ( 12), is addressed in Example A11 (Appendix A.2). Since the rate of the output kernel mean of the Mb-KSR is equal to the rate of the input kernel mean, the two estimators, ( 11) and ( 12), have the same consistency rate.

KBR
As described in Sect.3.1, when the Mb-KSR is considered, a kernel mean estimator is expressed as the weighted sum form of the feature functions mP = l i=1 γ i k X (•, Xi ) but also the conditional kernel means (9).The KBR (7) when mΠ is expressed as a weighted sum of the conditional kernel means can also be obtained in a slightly modified manner, as follows.
Let mΠ = l j=1 γ j m j be a prior kernel mean estimator that uses RKHS elements {m j } l j=1 ⊂ H X .The KBR is given by where Fig. 2 State space models.Left: the prediction step is the kernel sum rule (KSR).Right: the filtering step is the kernel Bayes' rule (KBR).
4 Filtering for State Space Models Using the components described in Sections 2 and 3, we develop a filtering algorithm for state space models (Fig. 2).We consider the following setting for the state space models: -Let x ∈ X = R m be a hidden state.The transition dynamics comprise an additive Gaussian noise model with a linear/nonlinear function f : R m → R m .-Let z be an observation in an arbitrary domain Z.The observation process is given by a conditional distribution P Z|X .The density is p(z|x) if it exists.We learn P Z|X nonparametrically from the data {(X i , Z i )} n i=1 .
The filtering task comprises the sequential estimation of the hidden state x t from a sequence of observations z 1:t := (z 1 , . . ., z t ) for each time t.Due to the Markov property of the state space model, the current state x t is estimated using the distribution (belief) of the previous hidden state x t−1 and a current observation z t .Fig. 2 shows one step of the filtering algorithm at time t.The algorithm comprises two steps: the prediction step and filtering step.Suppose that q(x t−1 |z 1:t−1 ) is a distribution of x t−1 given observations z 1:t−1 .In the prediction step, x t is predicted as p(x t |z 1:t−1 ) = p(x t |x t−1 )q(x t−1 |z 1:t−1 )dx t−1 without observation z t .In the filtering step, the predictive distribution is improved to q(x t |z 1:t ) = p(z t |x t )p(x t |z 1:t−1 )/ X p(z t |x t )p(x t |z 1:t−1 )dx t using a prior p(x t |z 1:t−1 ) and a likelihood p(z t |x t ) with a new observation z t .The prediction and filtering steps are employed sequentially for each time t.The prediction step corresponds to the sum rule and the filtering step corresponds to Bayes' rule.
Filtering algorithms that use kernel means can be obtained by applying KSR and KBR.Let (X , k X , H X ) and (Z, k Z , H Z ).Let m Xt−1|z1:t−1 , m Xt|z1:t−1 , and m Xt|z1:t be kernel means of q(x t−1 |z 1:t−1 ), p(x t |z 1:t−1 ), and q(x t |z 1:t ) in H X , respectively.Suppose that we have an estimator for m Xt−1|z1:t−1 as mXt−1|z1: with weights α Xt−1|z1:t−1 ∈ R n and data (X 1 , . . ., Xn).In the prediction step, mXt|z1:t−1 = ŪX ′ |X mXt−1|z1:t−1 , where ŪX ′ |X is the Mb-KSR operation.From eq. ( 9), this is given by Algorithm 1 KBR-Filter (Transition: probabilistic model, Observation: nonparametric) where m X ′ |Xi is the conditional kernel mean of the transition dynamics (additive Gaussian noise model) given a data point X i .In the filtering step, m Xt|z1:t is obtained by the KBR (13) with the prior kernel mean mXt|z1:t−1 .From eq. ( 13), the posterior kernel mean is estimated by where R X |Z (β Zt|z1:t−1 ) is an n × n matrix that depends on β Zt|z1:t−1 as where This corresponds to eq. ( 12) in Example 32.As a result, the filtering algorithm is summarized in Algorithm 1.The next section presents the results of numerical experiments.As described in eq. ( 4), the expectation E X∼Q X |z 1:t [f(X)] of any RKHS function f ∈ H X with respect to the posterior distribution Q X |z1:t can be estimated as The point estimation of xt from a given kernel mean estimator mXt|z1:t may be considered by solving a preimage problem (Song et al, 2009;Fukumizu et al, 2011), where the minimization algorithm results in a simple iteration in the Gaussian kernel case (Mika et al, 1999).
Comparisons with other typical filtering algorithms are described in the following.Well-known algorithms such as the Kalman filter, extended Kalman filter, unscented Kalman filter, and particle filters restrict the observation domain Z to (a subset of) the Euclidean space R d , and they require the setting of a probabilistic model of the observation process P Z|X , e.g., to compute the likelihood in particle filters.By contrast, Algorithm 1 permits an arbitrary observation domain Z and does not assume a specific probabilistic model for the observation process.

Experiments
In this section, we present the results of numerical experiments obtained using the Mb-KSR and the filtering algorithm, Algorithm 1, for state space models.Sect.5.1 describes simple ground-truth experiments, which validate the concept of the Mb-KSR and they illustrate the differences compared with the Non-KSR.Sect.5.2 presents filtering results for synthetic nonlinear state space models, which demonstrate the superior performance of the proposed Algorithm 1 compared with the existing full nonparametric kernel Bayes filter (Fukumizu et al, 2013, Sect. 4.3) due to the incorporation of a probabilistic model.Sect.5.3 describes the application of our proposed method to real-world vision-based robot localization problems.

Ground-truth Experiment
We validate the Mb-KSR concept and determine how it differs from the Non-KSR.We consider a case where the true output kernel mean m QY described in Sect.3.1 has an analytical solution.Next, we evaluate the accuracy m QY − mQY HY of the estimators in the RKHS norm.
Let X = Y = R m and P Y|X be a linear Gaussian model y = Ax + ǫ with matrix A ∈ R m×m and Gaussian noise ǫ ∼ N (0, Σ).If the input distribution Π is a Gaussian mixture with density π Proposition A12 describes the RKHS norm error of the Non-KSR and Mb-KSR estimators in this case.
In the experiments, we set m = 2 and A = Σ = I 2 , and drew a sample (X i , Y i ) 500 i=1 i.i.d. with a uniform input distribution P X on a square [−10, 10] 2 .For the test distribution Π, we set L = 4, ξ i = 1/4 (i = 1, 2, 3, 4), The input kernel mean m Π was estimated by mΠ = 1 500 500 i=1 k X (•, Xi ) with a sample X1 . . .X500 i.i.d ∼ Π.We set R X = 0.1I 2 , R Y = I 2 for the Gaussian kernels k RX and k RY , respectively.The results were averaged over 30 experiments.misspecified the coefficient A by Ã = σ 1 A with fixed Σ, where the value σ 1 > 0 indicates the horizontal axis.σ 1 = 1 corresponds to the exact case.Similarly, Fig. 3 (bottom left) plots the errors when the Mb-KSR misspecified the variance Σ by Σ = σ 2 Σ with fixed A, where the value σ 2 > 0 indicates the horizontal axis.σ 2 = 1 corresponds to the exact case.These figures show the sensitivity of the Mb-KSR to model misspecification.The Mb-KSR performed better than the Non-KSR with some model misspecification in this setting.Next, we determined the inference that corresponded to eqs (11), ( 12).We also considered the case where the output kernel mean m QZ had an analytical solution and where the RKHS norm error m QZ − mQZ HZ could be evaluated.
Let P Y|X and P Z|Y be the linear Gaussian noise models y = A 1 x + ǫ 1 and z = A 2 y+ǫ 2 with x, y, z ∈ R m , matrix A 1 , A 2 ∈ R m×m and independent Gaussian noises ).The kernel mean m QZ of Q Z with a Gaussian kernel k RZ has the analytical solution:  Proposition A13 describes the RKHS norm errors of the three estimators, i.e., (i), (ii), and (iii), in Example 32 in this setting.
In the experiments, we set A 1 = A 2 , Σ 1 = Σ 2 and used the same parameter setting as that given above.Fig. 3 (bottom right) shows the errors of the three types of estimators in Proposition A13.'N+N,' 'N+Mb,' and 'Mb+N' represent (i) Non-KSR + Non-KSR, (ii) Non-KSR + Mb-KSR, and (iii) Mb-KSR + Non-KSR in Proposition A13, respectively.N+eMb indicates the Non-KSR + Mb-KSR(est).All of the errors decreased with the regularization parameter ǫ → 0. We can see that estimators (ii) and (iii) were more accurate than (i) because they reflected partial knowledge (P Y|X or P Z|Y ) included in the linear Gaussian noise model in the Mb-KSR.

Filtering for state space models
We performed experiments using the proposed kernel mean filter and Algorithm 1 on a state space model (Fukumizu et al, 2013, Sect. 5.3) and compared the results with those obtained using the existing kernel mean filter (Fukumizu et al, 2013, Sect. 4.3).The model settings were as follows.
-A hidden state was x t := (u t , v t ) ∈ R 2 .The transition dynamics were (u t+1 , v t+1 ) ⊤ = (1+b sin(Mθ t+1 ))(cos θ t+1 , sin θ t+1 ) ⊤ +ς t , θ t+1 = θ t +η (mod 2π), where ς t was an independent Gaussian noise N (0, σ 2 h I 2 ).-An observation variable was z t ∈ R 2 .A training sample (x i , z i ) n i=1 was used for learning the unknown observation process 3 .We used Gaussian kernels for the state and observation domains.We learned the kernel parameters (regularization parameters δ, ǫ and band width parameters based on a twofold cross validation with a grid search.In the test phase, we estimated a single hidden state from a posterior kernel mean using eq.( 14).We then evaluated the mean squared error (MSE) of the estimated state trajectory relative to the true state trajectory in a test sequence and we computed the average MSE based on 30 experiments.with Algorithm 1 and "fKBF" indicates the results obtained using the existing full nonparametric kernel Bayes' filter (Fukumizu et al, 2013, Sect. 4.3).fKBF learns both the transition dynamics and observation process nonparametrically using transition samples {(x t , x t+1 )} n t=1 and observation samples (x i , z i ) n i=1 .Since the fKBF is known to perform better than both the extended Kalman filter and unscented Kalman filter when the state space models have strong nonlinearity (Fukumizu et al, 2013, Sect. 5.3), we only report the comparison between Algorithm 1 and fKBF.We can see that the proposed Algorithm 1 may obtain a more accurate MSE than fKBF due to the incorporation of the Mb-KSR.Fig. 4 (bottom left) shows that similar results were obtained when the transition dynamics noise was extended to a Gaussian mixture.Algorithm 1 can also be applied to an additive Gaussian mixture noise model, which is obtained by a simple extension (Appendix A.1, Example A9).Another benefit of Algorithm 1 over fKBF is that Algorithm 1 can respond to time-variant (inhomogeneous) transition dynamics such as x t+1 = f t (x t ) + ς t , whereas the existing fKBF cannot.Fig. 4 (bottom right) shows the simplest case where the transition dynamics of the training and test phases are different, which demonstrates fKBF cannot deal with this situation and it exhibits bias errors.The nonparametric learning of time-variant transition dynamics P (t)  X ′ |X would require a training sample {(x, x ′ )} n i=1 for each fixed time t, thereby demanding a vast number of training samples.

Robot Localization
We applied the proposed Algorithm 1 to a vision-based robot localization problem.This task required the sequential estimation of the position of a mobile robot based on observations of images captured by the robot.In the state space modeling, the state domain X was the robot's position and the observation domain Z comprised the images.Note that Algorithm 1 permits an arbitrary domain Z, such as images, if a positive-definite kernel is defined.
In this experiment, we used the COsy Localization Database (COLD) (Pronobis and Caputo, 2009), which contains image sequences captured by mobile robots in indoor laboratory environments.Each image z was labeled according to the robot's position (x, y, θ) ∈ R 2 × [−π, π], where (x, y) denotes the position in a building (global coordinate system) and θ is an angle that represents the pose of the robot, and the robot's internal odometry data (x, ȳ, θ), i.e., the data for each time point t comprised D t = (x t , y t , θ t , xt , ȳt , θt , z t ).The process used to observe an image z given the robot's position y, θ) is complex and it depends on the environment.Thus, we employed nonparametric learning using kernel means.However, the transition dynamics of the robot's position (x, y, θ) could be modeled using a probabilistic odometry motion model (Thrun et al, 2005).
In the experiments, we used the datasets Saarbrücken, Part A, Standard, and Cloudy from the database, which comprised three similar trajectories on the path (Pronobis and Caputo, 2009, p. 590, Fig.1(b), blue).Fig. 5 (left) shows the trajectories with arrows.We used two of these datasets for training and the others for testing.We used the following odometry motion model to predict the next state (x ′ , y ′ , θ ′ ) given the current state (x, y, θ), the current odometry measurement (x, ȳ, θ), and the next odometry measurement (x ′ , ȳ′ , θ′ ): where ξx ∼ N (0, σ 2 x ), ξy ∼ N (0, σ 2 y ), ξc ∼ N (0, σ 2 c ), and ξs ∼ N (0, σ 2 s ) are Gaussian noise.Here, atan2(•, •) is an arctangent function with two arguments.We learned the variances σ 2 x , σ 2 y , σ 2 c , and σ 2 s using the two training datasets.We used the spatial pyramid matching kernel (Lazebnik et al, 2006) based on scale-invariant feature transform (SIFT) descriptors (Lowe, 2004) for images Z and a Gaussian kernel for states (x, y, cos θ, sin θ).The bandwidth parameters and regularization parameters were tuned based on twofold cross-validations using the training datasets.In the test phase, we estimated a filtered state as a training data point that maximized the weight of the posterior kernel mean.We then evaluated the root mean squared error (RMSE) of a test sequence and the result was averaged over 10 experiments for each training dataset size.
We compared the results obtained using the following methods.
-Naïve method (NAI): Given a test image z, the position (x, y, θ) of a robot was estimated as a point in the training dataset that maximized the similarity in terms of the same kernel with spatial pyramid matching.Thus, this method did not consider the Markov property of the time series.-Nearest Neighbors (NN): A k nearest neighbor approach for filtering the observation process with nonparametric learning (Vlassis et al, 2002).Given a test image z, k nearest neighbors of the training image were explored and the likelihood function -fKBF (Fukumizu et al, 2013): this algorithm estimated the hidden states based on nonparametric learning of both the observation process and the transition dynamics.The number of training samples used to determine the transition dynamics was fixed at 900.This comparison demonstrated the effect of incorporating the odometry motion model.

Conclusion
In this study, we proposed a kernel Bayesian inference method that combines nonparametric estimation of conditional distributions and probabilistic modeling of conditional distributions, such as additive Gaussian noise models, in a flexible manner.This was achieved by introducing the Mb-KSR and by combining rules, i.e., Non-KSR, Mb-KSR, and KBR.We demonstrated the consistency of the Mb-KSR and showed how to combine the Non-KSR and Mb-KSR in a chain example.We also developed a filtering algorithm for state space models that combines nonparametric learning of the observation process using kernel means and additive Gaussian noise models of the transition dynamics.The idea of the Mb-KSR can be extended to α-stable noise models or to elliptical distribution cases, by exploiting the concept of a conjugate pair that comprises a positive-definite kernel and a probabilistic model.In contrast to the Non-KSR, the Mb-KSR does not contain a regularization parameter, which means that tuning is not necessary.The proposed algorithm, Algorithm 1, can deal with filtering for state space models where the transition dynamics are even time-variant, which is not possible with the current full nonparametric KBR filter.Another potential application of the Mb-KSR may be learning a probabilistic model in the setting of kernel Bayesian inference.This could be achieved in future research by learning the parameters of the transition dynamics partly by maximizing the likelihood given a sequence of observations, possibly with expectation-maximization-like algorithms.Learning during kernel Bayesian inference with probabilistic models may allow semi-parametric inference in the kernel method.

A.1 Conditional Kernel Means of Probabilistic Models
This appendix describes examples of noise models for Mb-KSR in addition to additive Gaussian noise models, which are based on the results of Nishiyama and Fukumizu (2014).
Nishiyama and Fukumizu (2014) described a systematic approach for determining the kernel means of probabilistic models in terms of infinitely divisible distributions, where the conjugate pair of a probabilistic model p and a positive-definite kernel k was introduced to ensure that its kernel mean mp had a simple form.A probabilistic model p and positive-definite kernel k are called conjugate if p and mp has the same density function form.If k is shift invariant, the kernel mean map p → mp implies a convolution p → ψ * p with the positive-definite function ψ.Thus, the conjugate property indicates the reproducing property of probability distributions (i.e., a family of distributions that are closed under convolution) by the inclusion of a positive-definite function ψ.
For example, the family of Gaussian distributions is closed under convolution and it includes a Gaussian density function ψ that is positive-definite.The kernel mean ψ * p of a Gaussian density p with a Gaussian kernel ψ(x − y) is then a Gaussian.More generally, for each α ∈ (0, 2], the family of α-stable distributions is closed under convolution and it includes an α-stable density function ψ that is positive-definite 4 (Nishiyama and Fukumizu, 2014, Sect. 4, p. 16).α-stable distributions have heavy tails for α ∈ (0, 2) and light tails for α = 2 (Gaussian cases), and α tunes the degree of the tail property of the distribution and the positive-definite kernel.By contrast, the family of Gamma distributions is closed under convolution but it does not include a Gamma density ψ that is positive-definite, and this is not a conjugate case.The family of Laplace distributions is not closed under convolution and this is not a conjugate case.However, note that the conditional kernel mean of a Laplace distribution with a Laplace kernel is given by an explicit form (Example A6) and it can be used for the Mb-KSR.
The results obtained using the kernel means of probabilistic models (Nishiyama and Fukumizu, 2014) immediately indicate the conditional kernel mean cases for additive probabilistic noise models.
The α-stable case on R is as follows.
Example A1 (Additive α-stable noise model on R) Let Y = R. Assume that {P Y|x |x ∈ X } is an additive α-stable noise model y = f (x)+ ǫ with a function f : X → R and an α-stable noise ǫ ∼ Sα(σ, β, 0).The conditional kernel mean m Y|x of the additive α-stable noise model in the α-stable RKHS Hα,σ 0 is an α-stable density, which is function given by The generalization to multi-dimensional α-stable cases on R m is similar.
Example A2 (Additive α-stable noise model on R m ) Let Y = R m .Assume that {P Y|x |x ∈ X } is an additive α-stable noise model y = f (x) + ǫ with a function f : X → R m and an α-stable random vector noise ǫ ∼ Sα(Γ, 0).The conditional kernel mean m Y|x of the additive α-stable noise model in the α-stable RKHS H α,Γs is an α-stable density function given by A well-known special case of the multi-dimensional α-stable distributions on R m comprises sub-Gaussian α-stable distributions on R m .
Let d SGα (x|R, µ), x ∈ R m denote the sub-Gaussian α-stable density on R m , where α ∈ (0, 2) is a characteristic index, µ ∈ R m is a location parameter, and R ∈ R m×m is a dispersion matrix.We denote a sub-Gaussian α-stable random vector X on R m as X ∼ SGα(R, µ).Let H α,R be the RKHS generated by an α-stable kernel k α,R (x, y) = d SGα (x − y|R, 0), x, y ∈ R m .α = 2 corresponds to multi-dimensional Gaussian cases.
Example A3 (Additive sub-Gaussian α-stable noise model on R m ) Let Y = R m .Assume that {P Y|x |x ∈ X } is an additive sub-Gaussian α-stable noise model y = f (x) + ǫ with a function f : X → R m and ǫ ∼ SGα(0, R).If R 0 = cR with c > 0, the conditional kernel mean m Y|x of the additive sub-Gaussian α-stable noise model in the sub-Gaussian α-stable RKHS H α,R 0 is an sub-Gaussian α-stable density function, which is given by Gaussian distributions and sub-Gaussian α-stable distributions are examples of elliptical distributions.More conceptually, the results given above can be extended to general elliptical distributions.Appendix B reviews elliptical distributions.
Proof For each x ∈ X , m Y|x is given by a convolution In general, we use the following lemma, which completes the proof.
The proof of Lemma A8 can be obtained by direct computation and it is omitted.Eq. ( 24) and Lemma A8 complete the derivation.
The conditional kernel mean of an additive Laplace noise model in a Laplace RKHS has an explicit form, as given above, but it is not exactly a Laplace density function.This is not a conjugate case.However, note that the Mb-KSR with the Laplace noise model is computable using the explicit expression ( 23).
The Mb-KSR may also be employed with a mixture of probabilistic models for the examples above.

A.2 Consistency
The consistency of the Mb-KSR is demonstrated as follows.The rate of the output kernel mean is the same as that of the input kernel mean mΠ .Note that Fukumizu et al (2013, Theorem 11, p.3776) provides the consistency rate of the Non-KSR as an upper bound.The rate is always lower than that of the input kernel mean.
Proposition A10 Suppose that a kernel mean estimator mΠ := Proof We have the following derivation.
The fourth equality employs the assumption θ( The following are the consistency rates of the two types of estimators, i.e., (ii) and (iii), given in Example 32.
A.3 RKHS norm errors in the setting described in Sect.5.1 Proposition A12 (i) Let mQ Y denote the Non-KSR estimator in the setting given in Sect.

Then,
where (ii) Let mQ Y denote the Mb-KSR estimator in the setting given in Sect.5.1.Suppose that the Mb-KSR misspecifies the model (A, Σ) as ( Ã, Σ).Then, where Proof We derive (i) and (ii).
(i) First, the norm m Q Y 2 H Y can be computed using eq.( 15) as For the second equality, we use the explicit expression of the inner product of two Gaussian kernel means (Nishiyama and Fukumizu, 2014, Proposition 4.5).Using the Non-KSR estimator mQ (ii) From eq. ( 27) and the Mb-KSR estimator mQ which completes the derivation.For the fourth equality, we use the explicit expression of the inner product of two Gaussian kernel means (Nishiyama and Fukumizu, 2014, Proposition 4.5).
Proposition A13 (i) (Non-KSR + Non-KSR) Let mQ Z be an estimator mQ Z = ÛZ|Y ÛY|X mΠ , where both ÛZ|Y and ÛY|X are Non-KSR operations in the setting given in Sect.5.1.Then, where (ii) (Mb-KSR + Non-KSR) Let mQ Z be an estimator mQ Z = ŪZ|Y ÛY|X mΠ , where ŪZ|Y is the Mb-KSR operation and ÛY|X is the Non-KSR operation in the setting given in Sect.5.1.Then, where w = (G X +nǫIn) −1 G X X γ, (iii) (Non-KSR + Mb-KSR) Let mQ Z be an estimator mQ Z = ÛZ|Y ŪY|X mΠ , where ÛZ|Y is the Non-KSR operation and ŪY|X is the Mb-KSR operation in the setting given above.Then, Proof The derivation is similar to Proposition A12 and thus it is omitted.

B Elliptical Distributions
This appendix presents definitions of elliptical (elliptically contoured) distributions.We check the condition of the characteristic property for positive-definite kernels generated by elliptical densities.We begin by reviewing spherical (spherically contoured) distributions.
Definition B1 (McNeil et al, 2005, Definition 3.18, p. 89 A scalar variable function φ fully describes a spherical random vector.φ is called the characteristic generator.If X is spherical with φ, we denote is as X ∼ SC(φ).The normal distribution N (µ, Σ) with µ = 0 and Σ = I d is spherical and its characteristic generator is φ(r) = e − 1 2 r .If a spherical distribution has density, it is given by the form: with a scalar variable function g : R ≥0 → R ≥0 .g is called the density generator.
An elliptical distribution is defined by an affine transformation of a spherical distribution.
Definition B3 (McNeil et al, 2005, Definition 3.26, p. 93) A random vector X on R d has an elliptical distribution if X d = µ+AY , where Y ∼ SC(φ), A ∈ R d×m is a matrix, and µ ∈ R d is a vector.
The characteristic function of an elliptical distribution is given by E X [e √ −1θ ⊤ X ] = e √ −1θ ⊤ µ φ(θ ⊤ Σθ) with Σ = AA ⊤ .If X is an elliptical random vector with µ, Σ, and φ, then we denote it as X ∼ EC(µ, Σ, φ)5 .If Σ is positive-definite and g is the density generator of Y , then X has the density The following condition is required for an elliptical kernel to be characteristic.This condition guarantees that a kernel mean uniquely specifies its probability distribution when the elliptical kernel is used, e.g., for the Mb-KSR.

Fig. 1
Fig.1Kernel mean inference for a three-variable chain.

Fig. 4 (
top left) shows the posterior kernel mean estimator, where the green curve indicates the transition function f without the noise term and the dots indicate the training sample {x i } n i=1 .For each time point t, the kernel mean of the posterior distribution was estimated from the weights using the data {x i } n i=1 .Cyan-colored data indicate positive weights and magenta-colored data indicate negative weights.Fig. 4 (top right) shows the computed MSEs and their standard deviations as a function of the training sample size n.The parameter settings are shown in the caption.In this figure, 'proposed' indicates the results obtained

Fig. 5
Fig. 5 Left; three similar trajectories (colored) of a mobile robot.Each state (x, y, θ) (arrows represent θ) is associated with an image.Right; RMSE vs training sample size.Error bars indicate standard deviations.

Fig. 5 (
Fig. 5 (right)  shows the RMSEs computed with different training sample dataset sizes.NN performed better than NAI because NAI did not consider the Markov property.The proposed method (Analy) yielded more accurate RMSEs than the fKBF by combining the odometry motion models, which was achieved by introducing the Mb-KSR.
Gaussians distributions and sub-Gaussian α-stable distributions correspond to φ(r) = e − Additive elliptical noise model on R m ) Let Y = R m .Assume that {P Y|x |x ∈ X } is an additive elliptical noise model y = f (x) + ǫ with a function f : X → R m and ǫ ∼ EC(0, Σ, φ).The conditional kernel mean m Y|x of the additive elliptical noise model in the RKHS H Σ,φ 0 is an elliptical density function given by k=1 in a common RKHS H Y , respectively.The conditional kernel mean of mixtureP Y|x = N k=1 ω k P (k) Y|x in H Y is a mixture of the kernel means m Y|x = Y|x in H Y .If {m (k) Y|x } Nk=1 have expressions, as in the examples above, then m Y|x is a mixture of them.
Proposition B4 Suppose that d EC (x; 0, Σ, φ), x ∈ R d is a bounded continuous elliptical density onR d .A function k Σ,φ (x, y) = d EC (x − y; 0, Σ, φ), x, y ∈ R d is a positive-definite kernel if and only if φ is a nonnegative function φ : R ≥0 → R ≥0 .The elliptical kernel k Σ,φ (x, y) is characteristic if and only if φ is a positive function φ : R ≥0 → R >0 .Proof The symmetric function k Σ,φ (x, y), x, y ∈ R d is positive-definite if and only if the Fourier transform of d EC (x; 0, Σ, φ), x ∈ R d is a finite nonnegative measure on R d (Bochner's Theorem).The elliptical kernel k Σ,φ (x, y), x, y ∈ R d is characteristic if and only if the Fourier transform of d EC (x; 0, Σ, φ), x ∈ R d has the entire support R d(Sriperumbudur et al, 2010,  Theorem 9).example, the normal distribution N (µ, Σ) with the characteristic generator φ(r) = e − 1 2 r has multiple expressions EC(µ, cΣ, φ(•/c)) for any c > 0. 11:1517-1561 Steinwart I, Christmann A (2008) Support Vector Machines.Information Science and Statistics.Springer Thrun S, Burgard W, Fox D (2005) Probabilistic Robotics.MIT Press, Cambridge, MA Vlassis N, Terwijn B, Kröwe B (2002) Auxiliary particle filter robot localization from highdimensional sensor observetions.In: ICRA, pp 7-12 •, • H denotes the inner product of the Hilbert space.For any RKHS H, In is the n × n identity matrix, and ǫn is a regularization parameter.The RKHS elements {m Z|Yi } n i=1 ⊂ H Z are the conditional kernel means of the probabilistic model {P Z|Yi } n i=1 at data points {Y The new weight vector w is a result of the Mb-KSR given in Example 32 (iii).
) A random vector X = (X 1 , ..., X d ) ⊤ on R d has a spherical distribution if it is distributionally invariant under rotations, i.e., it holds for every orthogonal matrix O ∈ R d×d (matrix such thatOO ⊤ = O ⊤ O = I d ),Theorem B2(McNeil et al, 2005, Theorem 3.19, p. 89) A random vector X on R d is spherical if and only if there a scalar variable function φ : R ≥0 → C exists for the characteristic function E X [e