1 Introduction

The ability to reason about past, current and future states in continuous, partially observable stochastic processes is a fundamental stepstone towards fully autonomos and intelligent systems. Such models are required in many applications as for example state estimation in case of incomplete sensory data, smoothing noisy data from mediocre sensors, or predicting future states from past and current observations.

Traditional state estimation techniques usually require analytical models of the underlying system, are often limited to a set of models with a special structure, and require knowledge about the moments of the stochstic processes. When assuming linear Gaussian models with known mean and covariance for instance, the Kalman filter (Kalman 1960) yields the optimal solution. However, the required linear Gaussian models with known statistics impose a strong limitation to the applicability of this method. For more complex processes, approximate solutions have to be used instead. Examples are the extended Kalman filter (McElhoe 1966; Smith et al. 1962) or the unscented Kalman filter (Julier and Uhlmann 1997; Wan and Van Der Merwe 2000). These solution inherit the Gaussian representation of the belief state to which they apply the non-linear system dynamics. However, the Gaussian distribution with its unimodal nature is a strong assumption about the belief state which leads to poor results for systems that require a more complex distribution over possible states. Moreover, both the Kalman filter but also it’s extensions to non-linear systems, require that the dynamics of the systems are given as analytical models. Yet, these analytical models are often hard to obtain or make simplifying assumptions about the system.

The recently introduced framework for nonparametric inference (Song et al. 2013; Fukumizu et al. 2013) alleviates the problems of traditional state estimation methods for nonlinear systems. The basic idea of these methods is to embed the probability distributions into reproducing kernel Hilbert spaces (RKHS). These embeddings allow the representation of arbitrary probability distributions using empirical estimators. Inference on the embedded distribution can then be performed efficiently and entirely in the RKHS using the kernelized versions of the sum rule, the chain rule, and the Bayes’ rule. Additionally, Song et al. (2013) use the kernel sum rule and the kernel Bayes’ rule to construct the kernel Bayes’ filter (KBF). The KBF learns the transition and observation models from observed samples and can be applied to nonlinear systems with high-dimensional observations. However, the computational complexity of the KBR update does not scale well with the number of samples such that hyper-parameter optimization becomes prohibitively expensive. Moreover, the KBR requires mathematical tricks that may cause numerical instabilities and also render the objective that is optimized by the KBR unclear.

In this paper, we present two approaches to overcome the limitations named above. First, we introduce the subspace conditional embedding operator. In contrast to the conditional embedding operator (Song et al. 2009), this operator allows to estimate its empirical estimator with a much larger data set while maintaining computational efficiency. We further apply the subspace conditional embedding operator to the kernel sum rule, kernel chain rule and kernel Bayes rule to derive their subspace versions. We have presented these results at the large-scale kernel learning workshop at ICML 2015 (Gebhardt et al. 2015).

Furthermore, we present the kernel Kalman rule (KKR) as an approximate alternative to the kernel Bayes’ rule. Our derivations closely follow the derivations of the innovation update used in the Kalman filter and are based on a recursive least squares minimization objective in a reproducing kernel Hilbert space (RKHS). The KKR does not perform an exact Bayesian update as it uses a regularization term in the least squares objective and assumes constant noise on the conditioning variable. While the update equations are formulated in a potentially infinite dimensional RKHS, we derive through application of the kernel trick and by virtue of the representer theorem an algorithm that uses only operations of finite kernel matrices and vectors. We employ the kernel Kalman rule together with the kernel sum rule for filtering, which results in the kernel Kalman filter (KKF). In contrast to filtering techniques that rely on the KBR, the KKF allows to precompute expensive matrix inversions which significantly reduces the computational complexity and which also allows us to apply hyper-parameter optimization for the KKF. This work has been presented at AAAI 2017 (Gebhardt et al. 2017).

In addition to the KKF, we introduce the kernel forward backward smoother (KFBS) which computes the embedding of the belief state given all available observations from the past and the future. The kernel forward backward smoother combines the belief state embeddings of a forward pass and a backward pass into smoothed embeddings using Hilbert space operations. Both, the forward and the backward pass are realized by a KKF, where the backward KKF operates backwards in time starting at the last observation. To scale gracefully with larger data sets, we rederive the KKR, the KKF and the KFBS with the subspace conditional operator (Gebhardt et al. 2015).

We compare our approach to different versions of the KBR and demonstrate its improved estimation accuracy and computational efficiency. Furthermore, we evaluate the KKR on a simulated 4-link pendulum task, on a human motion capture data set (Wojtusch and von Stryk 2015) and on data from a table-tennis setup (Gomez-Gonzalez et al. 2016).

1.1 Related work

To the best of our knowledge the kernel Bayes’ rule exists in three different versions. It was first introduced in its original version by Fukumizu et al. (2013). Here, the KBR is derived, similar to the conditional operator, using prior modified covariance operators. These prior-modified covariance operators are approximated by weighting the feature mappings with the weights of the embedding of the prior distribution. Since these weights are potentially negative, the covariance operator might become indefinite, and thus, rendering its inversion impossible. To overcome this drawback, the authors have to apply a form of the Tikhonov regularization that decreases accuracy and increases the computational costs. A second version of the KBR was introduced by Song et al. (2013) in which they use a different approach to approximate the prior-modified covariance operator. In the experiments conducted for this paper, this second version often leads to more stable algorithms than the first version. Boots et al. (2013) introduced a third version of the KBR where they apply only the simple form of the Tikhonov regularization. However, this rule requires the inversion of a matrix that is often indefinite, and therefore, high regularization constants are required, which again degrades the performance. In our experiments, we refer to these different versions with KBR(b) for the first, KBR(a) for the second (order adapted from the literature), and KBR(c) for the third version. Song et al. (2013) propose in their framework for nonparametric inference to combine the KBR with the kernel sum rule to obtain the kernel Bayes filter (KBF). The kernel Kalman filter presented in this work is closely related to this, as we simply replace the KBR with the KKR. We compare to the KBF in our experiments. Nishiyama et al. (2016) recently proposed the nonparametric kernel Bayes smoother. This approach builds on top of the kernel Bayes filter, which is used to compute the estimates of a normal forward pass. The smoothing update is then obtained by propagating the embeddings backwards in time without performing a second filtering pass.

For filtering tasks with known linear system equations and Gaussian noise, the Kalman filter (KF) yields the solution that minimizes the squared error of the estimate to the true state. Two widely known and applied approaches to extend the Kalman filter to non-linear systems are the extended Kalman filter (EKF) (McElhoe 1966; Smith et al. 1962) and the unscented Kalman filter (UKF) (Wan and Van Der Merwe 2000; Julier and Uhlmann 1997). Both, the EKF and the UKF, assume that the non-linear system dynamics are known and use them to update the prediction mean. Yet, updating the prediction covariance is not straightforward. In the EKF the system dynamics are linearized at the current estimate of the state, and in the UKF the covariance is updated by applying the system dynamics to a set of sample-points (sigma points). While these approximations make the computations tractable, they can significantly reduce the quality of the state estimation, in particular for high-dimensional systems.

Hsu et al. (2012) recently proposed an algorithm for learning Hidden Markov Models (HMMs) by exploiting the spectral properties of observable measures to derive an observable representation of the HMM (Jaeger 2000). An RKHS embedded version thereof was presented by Song et al. (2010). While this method is applicable for continuous state spaces, it still assumes a finite number of discrete hidden states.

Other closely related algorithms to our approach are the kernelized version of the Kalman filter and Kalman smoother by Ralaivola and d’Alche Buc (2005) and the kernel Kalman filter based on the conditional embedding operator (KKF-CEO) by Zhu et al. (2014). The former approach formulates the Kalman filter in a sub-space of the infinite feature space that is defined by the kernels. Hence, this approach does not fully leverage the kernel idea of using an infinite feature space. In contrast, the KKF-CEO approach embeds the belief state also in an RKHS. However, they require that the observation is a noisy version of the full state of the system, and thus, they cannot handle partial observations. Moreover, they also deviate from the standard derivation of the Kalman filter, which—as our experiments show—decreases the estimation accuracy. The full observability assumption is needed in order to implement a simplified version of the innovation update of the Kalman filter in the RKHS. The KKF does not suffer from this restriction. It also provides update equations that are much closer to the original Kalman filter and outperforms the KKF-CEO algorithm as shown in our experiments. Another approach to state estimation is presented in (Kawahara et al. 2007), where the authors propose to estimate low-dimensional state vectors based on kernel canonical correlation analysis and then regress a linear transition model of the estimated state vectors and the nonlinear features of the input.

Learning predictors in the space of predictive state representations to perform filtering has been proposed in Sun et al. (2016b) and later extended to smoothing in Sun et al. (2016a). They introduce predictive state inference machines (PSIM) which are (nonlinear) regressors on predictive states learned from data to perform filtering. With the smoothing machine (SMACH) they extend this concept for smoothing.

1.2 Structure of the paper

The remainder of the paper is structured as follows: in Sect. 2, we discuss the framework for non-parametric inference (Song et al. 2013) and the Kalman filter as foundations for the work we present in this paper; in Sect. 3 we introduce the subspace conditional embedding operator and show its application to the framework for non-parametric inference; in Sect. 4 we present the kernel Kalman rule and the subspace kernel Kalman rule; in Sect. 5 we introduce the (subspace) kernel Kalman filter (Sect. 5.1) and the (subspace) kernel forward backward smoother (Sect. 5.4) before we conclude the paper in Sect. 6. Experimental evaluations of all proposed methods are shown and discussed directly in the respective sections.

2 Preliminaries

Our work is based on the recent formulations of embedding distributions into reproducing kernel Hilbert spaces (Smola et al. 2007; Song et al. 2013). These embeddings allow to represent arbitrary probability distributions non-parametrically by a potentially infinite dimensional feature vector. Through the application of derived operators (Song et al. 2009; Fukumizu et al. 2013) it is furthermore possible to apply inference rules entirely in the Hilbert space. In the first part of this section, we want to give the reader an introduction into this technology and define the notation we will use throughout this article.

One of the main contributions of our paper is a novel method for performing approximate Bayesian updates on a distribution embedded into an RKHS. The derivations of this update rule are based on a least-squares objective and inspired by the derivations of the Kalman filter update, thus we name this method kernel Kalman rule. In the second part of this section, we will recapitulate the classical Kalman filter equations and review the derivations of the innovation update based on the least-squares objective.

2.1 Nonparametric inference with Hilbert space embeddings of distributions

Intuitively, a Hilbert space is an extension of the well known two- or three-dimensional Euclidean vector space to arbitrary many dimensions, specifically including infinite dimensional vector spaces. Such infinite dimensional Hilbert spaces include spaces where the single elements are functions, i.e., infinite dimensional vectors that contain for each element of the domain the corresponding function value in the image. In addition, a Hilbert space has an inner product that allows to measure distances and angles between its elements. For a reproducing kernel Hilbert space \(\mathcal {H}_{k}\), this inner product \(\langle \cdot , \cdot \rangle \) is implicitly defined by a reproducing kernel \(k(\varvec{x}, \varvec{x}') = \langle \varphi (\varvec{x}),\varphi (\varvec{x}')\rangle \), where \(\varphi (\varvec{x})\) is a feature mapping into a possibly infinite dimensional space, intrinsic to the kernel function. For example the Gaussian kernel computes the inner product of the feature mappings of its inputs where the feature mappings itself cannot be written down explicitly as they are into an infinite dimensional space. Due to the reproducing property of the kernel, all elements f of the RKHS can be reproduced by k in the sense that the outcome f(x) of the function for a specific value x can obtained by an evaluation of the kernel function (Aronszajn 1950), i.e., \(f(\varvec{x}) = \langle f , \varphi (\varvec{x}) \rangle \) for any \(f \in \mathcal {H}_{k}\).

In a practical setting, we want to embed probability distributions in an RKHS spanned by samples \({\mathcal {D}}_X = \left\{ \varvec{x}_1, \ldots , \varvec{x}_n\right\} \). Based on the representer theorem (Schölkopf et al. 2001) and the reproducing property, the elements f of an RKHS \(\mathcal {H}_{k}\) can then be written as

$$\begin{aligned} f(\cdot ) = \sum \limits _{i=1}^n \alpha _{i} k(\varvec{x}_{i},\cdot ) = \sum \limits _{i=1}^n \alpha _{i} \langle \varphi (\varvec{x}_{i}),\varphi (\cdot )\rangle = \varvec{\alpha }^\intercal \varvec{\varvec{\Upsilon }_{x}^{\intercal }}\varphi (\cdot ), \end{aligned}$$
(1)

with the weights \(\alpha _{i} \in {\mathbb {R}}\) and where we denote the feature matrix of samples \(\varvec{x}_i\) by \(\varvec{\varvec{\Upsilon }_{x}^{\phantom {\intercal }}}= [\varphi (\varvec{x}_{1}),\ldots ,\varphi (\varvec{x}_{n})]\). In the following paragraphs we will show how probability distributions can be represented as an embedding in such a reproducing kernel Hilbert space and how the operators for performing inference in the RKHS can be derived.

2.1.1 Embeddings of marginal and joint distributions

The embedding of a marginal density P(X) over the random variable X is defined as the expected feature mapping \(\mu _{X} := {\mathbb {E}}_{X} \left[ \varphi (X)\right] \), also called the mean map (Smola et al. 2007). Using a finite set of samples \(\left\{ \varvec{x}_1, \ldots , \varvec{x}_n\right\} \) from P(X), the mean map can be estimated as

$$\begin{aligned} {\hat{\mu }}_{X} = \frac{1}{n} \sum _{i=1}^{n} \varphi (\varvec{x}_{i}) = \frac{1}{n} \varvec{\varvec{\Upsilon }_{x}^{\intercal }}\varvec{1}_{n}, \end{aligned}$$
(2)

where \(\varvec{1}_{n} \in {\mathbb {R}}^{n}\) is an n dimensional vector of ones. Because of the reproducing property of the kernel function, computing the expectation of a function which is an element of the same RKHS resolves to simple matrix operations. On the other hand, the probability of a single outcome or higher order statistics of the distributions are not straight forward to obtain.

Alternatively, a distribution can be embedded in a tensor product RKHS \(\mathcal {H}_{k} \times \mathcal {H}_{k}\) as the expected tensor product of the feature mappings (Smola et al. 2007)

$$\begin{aligned} {\mathcal {C}}_{XX} := {\mathbb {E}}_{XX} \left[ \varphi (X) \otimes \varphi (X)\right] - \mu _{X} \otimes \mu _{X}, \end{aligned}$$
(3)

where we use \(\otimes \) to denote the tensor product (or outer product) of two vectors. This embedding is also called the centered covariance operator. The finite sample estimator is given by

$$\begin{aligned} \hat{{\mathcal {C}}}_{XX} = \frac{1}{m} \sum _{i=1}^{m} \varphi (\varvec{x}_{i}) \otimes \varphi (\varvec{x}_{i}) - {\hat{\mu }}_{X} \otimes {\hat{\mu }}_{X}. \end{aligned}$$
(4)

Similarly, we can define the uncentered cross-covariance operator for a joint distribution p(XY) of two variables X and Y as \(\hat{{\mathcal {C}}}_{XY} = \frac{1}{m} \sum _{i=1}^{m} \varphi (\varvec{x}_{i}) \otimes \phi (\varvec{y}_{i})\). Here, we have used a data set of tuples \({\mathcal {D}}_{XY} = \left\{ (\varvec{x}_1, \varvec{y}_1), \ldots , (\varvec{x}_n, \varvec{y}_n)\right\} \) sampled from p(XY) and a second RKHS \({\mathcal {H}}_{g}\) with kernel function \(g(\varvec{y}, \varvec{y'}) =: \langle \phi (\varvec{y}),\phi (\varvec{y'})\rangle \).

2.1.2 The conditional embedding operator

The embedding of a conditional distribution P(Y|X) is not like the mean map a single element of the RKHS, but rather a family of embeddings that yields a mean embedding for each realization of the conditioning variable X. To obtain the conditional distribution for a specific value \(X = \varvec{x}_{*}\), Song et al. (2009) defined the conditional embedding operator\({\mathcal {C}}_{Y|X}\) which, if applied to the feature mapping of \(\varvec{x}\), returns the embedding of \(P(Y|X = \varvec{x}_*)\)

$$\begin{aligned} \mu _{Y| \varvec{x}} := {\mathbb {E}}_{Y| \varvec{x}} \left[ \phi (Y)\right] = {\mathcal {C}}_{Y|X} \varphi (\varvec{x}). \end{aligned}$$
(5)

Using the data set \({\mathcal {D}}_{XY}\) from the joint distribution, an estimator of the conditional embedding operator can be derived from a least-squares objective (Grünewälder et al. 2012) as

$$\begin{aligned} \hat{{\mathcal {C}}}_{Y|X} = \varvec{\Phi }^{\phantom {\intercal }}(\varvec{K}_{xx}+ \lambda \varvec{I}_{n})^{-1} \varvec{\Upsilon }_{x}^{\intercal }, \end{aligned}$$
(6)

with the feature matrices \(\varvec{\Phi }^{\phantom {\intercal }}:= [\phi (\varvec{y}_{1}), \ldots ,\phi (\varvec{y}_{n})]\) and \(\varvec{\Upsilon }_{x}^{\phantom {\intercal }}:= [\varphi (\varvec{x}_{1}),\ldots ,\varphi (\varvec{x}_{n})]\), the Gram matrix \(\varvec{K}_{xx}= \varvec{\Upsilon }_{x}^{\intercal }\varvec{\Upsilon }_{x}^{\phantom {\intercal }}\in {\mathbb {R}}^{n \times n}\), the regularization parameter \(\lambda \), and the identity matrix \(\varvec{I}_{n} \in {\mathbb {R}}^{n \times n}\). With the feature mapping of the realization \(\varvec{x}_{*}\) this results in

$$\begin{aligned} {\hat{\mu }}_{Y| \varvec{x}_*} = \hat{ {\mathcal {C}}}_{Y|X} \varphi (\varvec{x}_*) = \varvec{\Phi }^{\phantom {\intercal }}(\varvec{K}_{xx}+ \lambda \varvec{I}_{n})^{-1} \varvec{\Upsilon }_{x}^{\intercal }\varphi (\varvec{x}_*) = \varvec{\Phi }^{\phantom {\intercal }}(\varvec{K}_{xx}+ \lambda \varvec{I}_{n})^{-1} \varvec{k}_{\varvec{x}_*}, \end{aligned}$$
(7)

where \(\varvec{k}_{\varvec{x}_*} = [k(\varvec{x}_1, \varvec{x}_*), \ldots , k(\varvec{x}_n, \varvec{x}_*)]^{\intercal }\) is the kernel vector of the samples \(\varvec{x}_i\) and the realization \(\varvec{x}_*\). As the kernel matrices in the inverse and the kernel vector of the realization are finite, the embedding of the conditional distribution can be represented as a weighted sum of feature mappings

$$\begin{aligned} {\hat{\mu }}_{Y| \varvec{x}_*} = \varvec{\Phi }^{\phantom {\intercal }}\varvec{\alpha }= \sum \limits _{i=1}^n \alpha _i \phi (\varvec{y}_i), \end{aligned}$$
(8)

with the finite weight vector \(\varvec{\alpha }\in {\mathbb {R}}^{n}\). Based on the two definitions for Hilbert space embeddings of probability distributions and the conditional embedding operator discussed above, all the rules of the framework for non-parametric inference (Song et al. 2013) can be derived.

2.1.3 The kernel sum rule

The embedding of \(Q(Y) = \sum _{X} P(Y|X)\pi (X)\) can be obtained from the kernel sum rule (Song et al. 2013). To that end, the conditional operator is applied to the embedding \({\hat{\mu }}_{X}^{\pi } = \varvec{\Upsilon }_{x}^{\phantom {\intercal }}\varvec{\alpha }_{\pi }\) of the prior distribution \(\pi (X)\),

$$\begin{aligned} {\hat{\mu }}_{Y}^{\pi }&= \hat{{\mathcal {C}}}_{Y|X} {\hat{\mu }}_{X}^{\pi } = \varvec{\Phi }^{\phantom {\intercal }}(\varvec{K}_{xx}+ \lambda \varvec{I}_{n})^{-1} \varvec{K}_{xx}\varvec{\alpha }_{\pi }. \end{aligned}$$
(9)

Again, the result can be represented as a weighted sum over feature mappings. In order to obtain the distribution Q(Y) as a covariance operator instead of a mean map, Song et al. (2013) also proposed the kernel sum rule for tensor product features which yields the prior modified covariance operator \({\mathcal {C}}_{YY}^{\pi }\) as

$$\begin{aligned} {\mathcal {C}}_{YY}^{\pi }&= {\mathcal {C}}_{(YY)|X}\mu _{X}^{\pi } \end{aligned}$$
(10)
$$\begin{aligned} \hat{{\mathcal {C}}}_{YY}^{\pi }&= \varvec{\Phi }^{\phantom {\intercal }}\mathrm {diag}((\varvec{K}_{xx}+ \lambda \varvec{I}_{n})^{-1} \varvec{K}_{xx}\varvec{\alpha }) \varvec{\Phi }^{\intercal }, \end{aligned}$$
(11)

where \({\mathcal {C}}_{(YY)|X}\) is the conditional operator for tensor product features.

2.1.4 The kernel chain rule

The kernel chain rule (Song et al. 2013) yields an embedding of the joint distribution \(Q(X,Y) = P(Y|X)\pi (X)\) as a prior modified covariance operator. There are two versions of the kernel chain rule. Both apply the conditional embedding operator of P(Y|X) to an embedding of the prior distribution \(\pi (X)\). In the first version the conditional operator is applied to a covariance embedding of the prior distribution. This covariance operator is not estimated directly from samples but approximated from the weight vector \(\varvec{\alpha }_{\pi }\) of the embedding of the prior distributions as \(\hat{{\mathcal {C}}}_{XX}^{\pi } = \varvec{\Upsilon }_{x}^{\phantom {\intercal }}\text {diag}(\varvec{\alpha }_{\pi }) \varvec{\Upsilon }_{x}^{\intercal }\). This yields Version (a) of the kernel chain rule as

$$\begin{aligned} \hat{{\mathcal {C}}}^{\pi }_{YX}&= \hat{{\mathcal {C}}}_{Y|X} \hat{{\mathcal {C}}}^{\pi }_{XX} = \varvec{\Phi }^{\phantom {\intercal }}(\varvec{K}_{xx}+ \lambda \varvec{I}_{n})^{-1} \varvec{K}_{xx}\text {diag}(\varvec{\alpha }_\pi ) \varvec{\Upsilon }_{x}^{\intercal }. \end{aligned}$$
(12)

Version (b) of the kernel chain rule first computes the mean map \(\mu _{Y}^{\pi }\) conditioned on the prior distribution \(\pi (X)\) by applying the conditional embedding operator to the mean map \(\mu _{X}^{\pi }\). Afterwards, the prior-modified covariance operator of the joint distribution is constructed from the resulting weight vector which results in

$$\begin{aligned} \hat{{\mathcal {C}}}_{YX}^{\pi }&= \varvec{\Phi }^{\phantom {\intercal }}\text {diag}((\varvec{K}_{xx}+ \lambda \varvec{I}_{n})^{-1} \varvec{K}_{xx}\varvec{\alpha }_\pi ) \varvec{\Upsilon }_{x}^{\intercal }. \end{aligned}$$
(13)

Both versions of the kernel chain rule have been used to derive different versions of the kernel Bayes’ rule as we will depict below.

2.1.5 The kernel Bayes’ rule

Given the embedding of a prior distribution \(\pi (X)\) and the feature mapping of an observation \(\phi (\varvec{y}_*)\), the kernel Bayes’ rule (KBR) infers the mean embedding of the posterior distribution \(Q_\pi (X|Y = \varvec{y}_*)\). The idea is to construct a prior-modified conditional embedding operator that yields the mean map of the posterior if applied to the feature mapping of the observation (Fukumizu et al. 2013)

$$\begin{aligned} \mu _{X|\varvec{y}}^\pi = \mathcal {C}_{X|Y}^{\pi } \phi (\varvec{y}_*). \end{aligned}$$
(14)

This prior-modified conditional operator is constructed from two prior-modified covariance operators \({\mathcal {C}}_{YY}^{\pi }\) and \({\mathcal {C}}_{YX}^{\pi }\) obtained from the kernel sum and the kernel chain rule, respectively, using the relation

$$\begin{aligned} \mathcal {C}_{X|Y}^{\pi } = {\mathcal {C}}_{XY}^{\pi } \left( {\mathcal {C}}_{YY}^{\pi } \right) ^{-1}. \end{aligned}$$
(15)

In the first version, which we denote by KBR(b) following the notation of Song et al. (2013), Fukumizu et al. (2013) derived the kernel Bayes’ rule using the tensor product conditional operator in the kernel chain rule (c.f. Eq. 13) and arrived at

$$\begin{aligned} {\hat{\mu }}_{X|y}^{\pi }&= \varvec{\Upsilon }_{x}^{\phantom {\intercal }}\varvec{D} \varvec{G}_{yy}\left( (\varvec{D} \varvec{G}_{yy})^{2} + \kappa \varvec{I}_{n}\right) ^{-1} \varvec{D} \varvec{g}_{\varvec{y}_*}, \end{aligned}$$
(16)

with the diagonal \(\varvec{D} := \text {diag} ((\varvec{K}_{xx} + \lambda \varvec{I}_{n})^{-1} \varvec{K}_{xx}\varvec{\alpha }_\pi )\), the gram matrix \(\varvec{G}_{yy}= \varvec{\Phi }^{\intercal }\varvec{\Phi }^{\phantom {\intercal }}\), the kernel vector \(\varvec{g}_{\varvec{y}_*} = [g(\varvec{y}_1, \varvec{y}_*), \ldots , g(\varvec{y}_n, \varvec{y}_*)]^{\intercal }\) and \(\kappa \) as regularization parameter. Song et al. (2013) derived the KBR using the first formulation of the kernel chain rule shown in Eq. 12 which results in

$$\begin{aligned} {\hat{\mu }}_{X|y}^{\pi }&= \varvec{\Upsilon }_{x}^{\phantom {\intercal }}\varvec{\Lambda }^{\intercal } \left( (\varvec{D} \varvec{G}_{yy})^{2} + \kappa \varvec{I}_{n}\right) ^{-1} \varvec{G}_{yy}\varvec{D} \varvec{g}_{\varvec{y}_*}, \end{aligned}$$
(17)

with \(\varvec{\Lambda }:= (\varvec{K}_{xx}+ \lambda \varvec{I}_{n})^{-1} \varvec{K}_{xx}\text {diag}(\varvec{\alpha }_\pi )\). This second version of the kernel Bayes’ rule is denoted by KBR(a). As the matrix \(\varvec{D} \varvec{G}_{yy}\) is typically not invertible, both of these versions of the KBR use a form of the Tikhonov regularization in which the matrix in the inverse is squared. Boots et al. (2013) use a third form of the KBR which is derived analogously to the first version but does not use the squared form of Tikhonov regularization, i.e.,

$$\begin{aligned} {\hat{\mu }}_{X|y}^{\pi }&= \varvec{\Upsilon }_{x}^{\phantom {\intercal }}\left( \varvec{D} \varvec{G}_{yy}+ \kappa \varvec{I}_{n}\right) ^{-1} \varvec{D} \varvec{g}_{\varvec{y}_*}. \end{aligned}$$
(18)

Since the product \(\varvec{D} \varvec{G}_{yy}\) is often not positive definite, a strong regularization parameter is required to make the matrix invertible. We denote this third version of the kernel Bayes’ rule consequently by KBR(c).

2.2 The Kalman filter

The Kalman filter is a well known technique for state estimation, prediction, and smoothing in environments with linear system dynamics that are subject to zero-mean Gaussian noise with known covariances (Kalman 1960). The system equations can be formulated as

$$\begin{aligned} \varvec{x}_{t+1} = \varvec{F} \varvec{x}_t + \varvec{v}_t,\quad \varvec{y}_{t} = \varvec{H} \varvec{x}_{t} + \varvec{w}_t, \end{aligned}$$
(19)

where \(\varvec{x}_{t}\) is the latent state of the system at time t and \(\varvec{y}_{t}\) is the corresponding observation. The linear Gaussian model is defined by the system matrix \(\varvec{F}\), the observation matrix \(\varvec{H}\), and noise vectors \(\varvec{v}_{t}\) and \(\varvec{w}_{t}\) which are sampled from \({\mathcal {N}}(\varvec{0}, \varvec{P})\) and \({\mathcal {N}}(\varvec{0}, \varvec{R})\), respectively.

From the assumption of Gaussian transition noise and Gaussian observation noise, it follows that the belief state over the latent state \(\varvec{x}_{t}\) is as well a Gaussian distribution with mean \(\varvec{\eta }_{\varvec{x},t}\) and covariance \(\varvec{\Sigma }_{\varvec{x},t}\). The Kalman filter applies iteratively two update procedures to the belief state to which we will refer to as prediction and innovation update. During the prediction update the Kalman filter propagates the belief state in time by applying the transition model, i.e.,

$$\begin{aligned} {\varvec{\eta }}^{-}_{\varvec{x},t+1} = \varvec{F} {\varvec{\eta }}^{+}_{\varvec{x},t}, \quad {\varvec{\Sigma }}^{-}_{\varvec{x},t+1} = \varvec{F} {\varvec{\Sigma }}^{+}_{\varvec{x}, t} \varvec{F}^{\intercal } + \varvec{P}. \end{aligned}$$
(20)

On new observations \(\varvec{y}_t\), the innovation update applies Bayes’ theorem to the a-priori belief state \(\{{\varvec{\eta }}^{-}_{\varvec{x},t},{\varvec{\Sigma }}^{-}_{\varvec{x},t}\}\) to obtain the a-posteriori mean and covariance as

$$\begin{aligned} {\varvec{\eta }}^{+}_{\varvec{x},t}&= {\varvec{\eta }}^{-}_{\varvec{x},t} + \varvec{Q}_{t} (\varvec{y}_{t} - \varvec{H} {\varvec{\eta }}^{-}_{\varvec{x},t}), \end{aligned}$$
(21)
$$\begin{aligned} {\varvec{\Sigma }}^{+}_{\varvec{x},t}&= {\varvec{\Sigma }}^{-}_{\varvec{x},t} - \varvec{Q}_{t} \varvec{H} {\varvec{\Sigma }}^{-}_{\varvec{x},t}, \end{aligned}$$
(22)

with the Kalman gain matrix

$$\begin{aligned} \varvec{Q}_t = {\varvec{\Sigma }}^{-}_{\varvec{x},t} \varvec{H}^{\intercal } (\varvec{H} {\Sigma }^{-}_{\varvec{x},t} \varvec{H}^{\intercal } + \varvec{R})^{-1}. \end{aligned}$$

Another approach to derive the Kalman filter equations follows from a least-squares objective between the state estimator a-priori and a-posteriori to the observation (Simon 2006). This second approach does not make the explicit assumption that the belief state can be represented as a Gaussian random variable. Rather this representation follows from the objective to minimize the variance of the error between the a-priori and a-posteriori estimators. We will take this second approach as inspiration to derive the kernel Kalman rule in Sect. 4.

3 Efficient nonparametric inference in a subspace

A general drawback of kernel methods is that the complexities of the algorithms scale poorly with the number of samples in the kernel matrices. As the conditional embedding operator and the kernel inference rules require the inversion of a kernel matrix, the complexity scales cubically with the number of data points. To overcome this drawback, several approaches exist that aim to find a good trade-off between a compact representation and leveraging from a large data set. Examples are the sparse Gaussian processes that use pseudo-inputs (Snelson and Ghahramani 2006; Csató and Opper 2002), or a sparse subset of the data which is selected by maximizing the posterior probability (Smola and Bartlett 2001). Other techniques are based on approximating the kernel matrices using the Nyström method (Williams and Seeger 2000) or random Fourier features (Rahimi and Recht 2007). We approach this problem by proposing the subspace conditional embedding operators (Gebhardt et al. 2015). The basic idea is to use only a subset of the available training data as representation for the embeddings but the full data set to learn the conditional operators. In the following sections, we will recapitulate this approach and show how it can be applied to the framework for nonparamteric inference.

Given the feature matrices \(\varvec{\Phi }^{\phantom {\intercal }}:= [\phi (\varvec{y}_{1}), \ldots ,\phi (\varvec{y}_{n})]\) and \(\varvec{\Upsilon }_{x}^{\phantom {\intercal }}:= [\varphi (\varvec{x}_{1}),\ldots ,\varphi (\varvec{x}_{n})]\), we can define the respective subsets \(\varvec{\Psi }^{\phantom {\intercal }}\subset \varvec{\Phi }^{\phantom {\intercal }}\) and \(\varvec{\Gamma }^{\phantom {\intercal }}\subset \varvec{\Upsilon }_{x}^{\phantom {\intercal }}\), where \(|\varvec{\Psi }^{\phantom {\intercal }}| = |\varvec{\Gamma }^{\phantom {\intercal }}| = m \ll n\). We assume that the subsets are representative for the embedded distributions. Similar to the conditional operator discussed in Sect. 2.1.2, we define the subspace conditional operator \(\mathcal {C}_{Y|X}^{S}\) as the mapping from an embedding \(\varphi (\varvec{x}) \in {\mathcal {H}}_{k}\) to the mean embedding \(\mu _{Y|\varvec{x}} \in {\mathcal {H}}_{g}\) of the conditional distribution \(P(Y|\varvec{x})\) conditioned on the variate \(\varvec{x}\). To obtain this subspace conditional operator, we first introduce an auxiliary conditional operator \(\mathcal {C}_{Y|X}^{\mathrm {aux}}\) which maps from the subspace projection of the embedding \(\varvec{\Gamma }^{\intercal }\varphi (\varvec{x})\) to the mean map of the conditional distribution, i.e.,

$$\begin{aligned} {\hat{\mu }}_{Y|\varvec{x}} = \hat{\mathcal {C}}_{Y|X}^{\mathrm {aux}} \varvec{\Gamma }^{\intercal }\varphi (\varvec{x}). \end{aligned}$$
(23)

We can then derive this auxiliary conditional operator by minimizing the squared error on the full data set

$$\begin{aligned} \hat{\mathcal {C}}_{Y|X}^{\mathrm {aux}}&= {{\,\mathrm{\mathrm{arg\,min}}\,}}_{\mathcal {C}} \left||\varvec{\Phi }^{\phantom {\intercal }}- \mathcal {C} \varvec{\Gamma }^{\intercal }\varvec{\Upsilon }_{x}^{\phantom {\intercal }}\right||_{2} \end{aligned}$$
(24)
$$\begin{aligned}&= \varvec{\Phi }^{\phantom {\intercal }}\varvec{\Upsilon }_{x}^{\intercal }\varvec{\Gamma }^{\phantom {\intercal }}\left( \varvec{\Gamma }^{\intercal }\varvec{\Upsilon }_{x}^{\phantom {\intercal }}\varvec{\Upsilon }_{x}^{\intercal }\varvec{\Gamma }^{\phantom {\intercal }}+ \lambda \varvec{I}_m\right) ^{-1}, \end{aligned}$$
(25)

with the identity \(\varvec{I}_m \in {\mathbb {R}}^{m \times m}\). Substituting this result for the auxiliary conditional operator in Eq. 23 gives us the subspace conditional operator as

$$\begin{aligned} \hat{\mathcal {C}}_{Y|X}^{S}&= \hat{\mathcal {C}}_{Y|X}^{\mathrm {aux}} \varvec{\Gamma }^{\intercal }\nonumber \\&= \varvec{\Phi }^{\phantom {\intercal }}\varvec{K}_{x\bar{x}}\left( \varvec{K}_{x\bar{x}}^{\intercal }\varvec{K}_{x\bar{x}}+ \lambda \varvec{I}_m)\right) ^{-1} \varvec{\Gamma }^{\intercal }, \end{aligned}$$
(26)

where \(\varvec{K}_{x\bar{x}}= \varvec{\Upsilon }_{x}^{\intercal }\varvec{\Gamma }^{\phantom {\intercal }}\in {\mathbb {R}}^{n \times m}\) is the kernel matrix of the sample feature set \(\varvec{\Upsilon }_{x}^{\phantom {\intercal }}\) and the subset \(\varvec{\Gamma }^{\phantom {\intercal }}\). Since we assume that \(m \ll n\), the inverse in the subspace conditional operator is in \({\mathbb {R}}^{m \times m}\) and, thus, of a much smaller size than the inverse in the standard conditional operator shown in Eq. 6. Additionally, we can use the feature matrix \(\varvec{\Gamma }^{\intercal }\) on the right hand side in Eq. 26 to represent the mean embedding always in the subspace spanned by the features \(\varvec{\Gamma }^{\phantom {\intercal }}\). This allows to avoid representations and computations in the high-dimensional space spanned by the features of the full sample set. Before we will rederive the non-parametric inference rules analogously to Song et al. (2013) but based on the subspace conditional embedding operator, we will discuss the selection of the samples for spanning the subspace and the relation of the subspace conditional operator to other sparsification approaches in the next sections.

3.1 Selecting the sample set to span the subspace

To learn the subspace conditional embedding operator, we need to choose m points for the representation of the embedding from a data set of n data points, where \(m \ll n\). The selection of these data points is a crucial step as we want a subset that is descriptive enough to represent the belief state well. We propose two approaches to address this problem which aim at different characteristics of the subset.

The first approach simply samples uniformly without replacement from the full data set. The result is a subset that resembles statistically the full data set, i.e., regions that have a high density in the full data set will have a high density in the subset and vice versa.

The goal of the second approach is to get a subset with an optimal coverage of the sample space. We select the first sample randomly from the full data set into the subset. Afterwards, we iteratively extend the subset by adding samples according to the following criterion: we compute the maximum kernel activation for each sample in the full data set with the samples in the current subset, then we extend the current subset by taking the sample from the full data set with the minimal maximum activation. We call this second strategy the kernel activation heuristic.

3.2 Relation to other sparsification approaches

Many other sparsification techniques for kernel methods exist. The two most important techniques among these are probably the Nyström method (Williams and Seeger 2000; Drineas and Mahoney 2005) and the random Fourier features (Rahimi and Recht 2007). Both methods are closely related to our approach.

Williams and Seeger (2000) approximate the Gram matrix \(\varvec{K}\) based on the eigendecomposition \(\varvec{K} = \varvec{U} \varvec{\Lambda }\varvec{U}^{\intercal }\), where \(\varvec{U}\) are the eigenvectors and \(\varvec{\Lambda }\) is a diagonal of the eigenvalues \(\lambda _{1}, \ldots , \lambda _{n}\). By taking only the first \(m < n\) eigenvectors as \(\varvec{U}^{(m)}\) and the first m eigenvalues as diagonal \(\varvec{\Lambda }^{(m)}\), \(\varvec{K}\) can be approximated as \(\varvec{K} \approx \varvec{U}^{(m)} \varvec{\Lambda }^{(m)} \varvec{(}U^{(m)})^{\intercal }\). However, since the eigendecomposition is computationally costly and more efficient methods only significantly decrease the running time for \(m \ll n\), Williams and Seeger (2000) propose to use instead the Nyström approximation of the eigenvectors which can be computed in only \(O(m^2n)\) instead of \(O(n^3)\) for the true eigenvectors. The resulting approximation of the Gram matrix \(\varvec{K}\) has the form \(\varvec{\tilde{K}} = \varvec{K}_{n,m}\varvec{K}_{m,m}^{-1}\varvec{K}_{m,n}\). Using the Nyström method to approximate the conditional embedding operator would result in the following equations

$$\begin{aligned} \hat{\mathcal {C}}_{Y|X} {\hat{\mu }}_{X}&= \varvec{\Phi }^{\phantom {\intercal }}(\varvec{K}_{n,m}\varvec{K}_{m,m}^{-1}\varvec{K}_{m,n} + \lambda \varvec{I}_{n})^{-1} \varvec{K}_{n,m}\varvec{K}_{m,m}^{-1}\varvec{K}_{m,X} \end{aligned}$$
(27)
$$\begin{aligned}&= \varvec{\Phi }^{\phantom {\intercal }}\varvec{K}_{n,m}\varvec{K}_{m,m}^{-1} (\varvec{K}_{m,n}\varvec{K}_{n,m}\varvec{K}_{m,m}^{-1} + \lambda \varvec{I}_{m})^{-1} \varvec{K}_{m,X}, \end{aligned}$$
(28)

where \(\varvec{K}_{m,X}\) is the approximator of the kernel mean embedding \(\mu _{X}\) using the m samples of the Nyström approximation. Thus, if we would assume \(\varvec{K}_{m,m} = \varvec{I}\), the subspace conditional operator is equivalent to the Nyström approximation. This assumption requires that features of the selected data points are orthogonal, i.e., \(\phi (\varvec{x}_i)^\intercal \phi (\varvec{x}_j) = \delta _{ij}\). Note that the kernel activation heuristic presented in the previous sections selects the data points by minimizing this inner product for all points that are already in the subset.

The idea of the random Fourier features (Rahimi and Recht 2007) is to compute the Fourier transform p of the kernel method k. Random samples are drawn from the distribution over frequencies p which are then used to construct a feature function \(\varvec{z}(\varvec{x})\). Rather than using a projection of the high dimensional feater as in our approach, the random features directly transform the samples into a finite dimensional feature space whose inner product approximates the kernel function, i.e., \(\varvec{z}(\varvec{x})^{\intercal }\varvec{z}(\varvec{y}) \approx k(\varvec{x}, \varvec{y})\). Let \(\varvec{Z} \in {\mathbb {R}}^{n \times m}\) be the feature matrix of the n data points with m random features. We could approximate the conditional embedding operator as

$$\begin{aligned} \hat{\mathcal {C}}_{Y|X} {\hat{\mu }}_{X}&= \varvec{\Phi }^{\phantom {\intercal }}(\varvec{Z} \varvec{Z}^{\intercal } + \lambda \varvec{I}_{n})^{-1} \varvec{Z} \varvec{Z}^{\intercal } \varvec{m}_X \end{aligned}$$
(29)
$$\begin{aligned}&= \varvec{\Phi }^{\phantom {\intercal }}\varvec{Z} (\varvec{Z}^{\intercal } \varvec{Z} + \lambda \varvec{I}_{m})^{-1} \varvec{Z}^{\intercal } \varvec{m}_X. \end{aligned}$$
(30)

Again, it is easy to observe the similarity to our approach if we replace \(\varvec{Z}\) by \(\varvec{K}_{x\bar{x}}\). Note that this representation, in contrast to our approach and to the Nyström method, does not allow to derive an operator in the reproducing kernel Hilbert space but directly uses finite vector/matrix representations.

3.3 The subspace kernel sum rule

Analogously to Song et al. (2013), the subspace kernel sum rule is the application of the subspace conditional operator to the embedding of a distribution \(\pi (X)\), i.e.,

$$\begin{aligned} {\hat{\mu }}_{Y}^{\pi }&= \hat{\mathcal {C}}_{Y|X}^{S} {\hat{\mu }}_{X}^{\pi } = \varvec{\Phi }^{\phantom {\intercal }}\varvec{K}_{x\bar{x}}\left( \varvec{K}_{x\bar{x}}^{\intercal }\varvec{K}_{x\bar{x}}+ \lambda \varvec{I}_m\right) ^{-1} \varvec{\Gamma }^{\intercal }\varvec{\Upsilon }_{x}^{\phantom {\intercal }}\varvec{\alpha }_\pi , \end{aligned}$$
(31)

where \(\mu _{X}^{\pi } = \varvec{\Upsilon }_{x}^{\phantom {\intercal }}\varvec{\alpha }_\pi \) is the embedding of the prior distribution \(\pi (X)\). We construct the subspace kernel sum rule for tensor product features differently than Song et al. (2013). Instead of applying the conditional operator to the mean embedding and then approximating the covariance operator with the resulting weights (c.f. Eq. 11), we first approximate the covariance operator \(\mathcal {C}_{XX}^{\pi }\) from the weights \(\varvec{\alpha }_\pi \) and then apply the subspace conditional operator to both sides, i.e.,

$$\begin{aligned} \hat{\mathcal {C}}_{YY}^{S,\pi }&= \hat{\mathcal {C}}_{Y|X}^{S} \hat{\mathcal {C}}_{XX}^{\pi } \left( \hat{\mathcal {C}}_{Y|X}^{S}\right) ^{\intercal } = \hat{\mathcal {C}}_{Y|X}^{S} \varvec{\Upsilon }_{x}^{\phantom {\intercal }}{{\,\mathrm{\mathrm{diag}}\,}}(\varvec{\alpha }_\pi ) \varvec{\Upsilon }_{x}^{\intercal }\left( \hat{\mathcal {C}}_{Y|X}^{S}\right) ^{\intercal } \nonumber \\&= \varvec{\Phi }^{\phantom {\intercal }}\varvec{K}_{x\bar{x}}\varvec{L}{{\,\mathrm{\mathrm{diag}}\,}}(\varvec{\alpha }_\pi ) \varvec{L}^\intercal \varvec{K}_{x\bar{x}}^{\intercal } \varvec{\Phi }^{\intercal }. \end{aligned}$$
(32)

Here, we denote \(\varvec{L}:= \left( \varvec{K}_{x\bar{x}}^{\intercal } \varvec{K}_{x\bar{x}}+ \lambda \varvec{I}_m\right) ^{-1} \varvec{K}_{x\bar{x}}^{\intercal } \in {\mathbb {R}}^{m \times n}\) to keep the notation uncluttered. This definition of the subspace kernel sum rule follows from the kernel chain rule for tensor product features, where the conditional operator \(\mathcal {C}_{Y|X}^{\phantom {\intercal }}\) is applied to the covariance embedding \(\mathcal {C}_{XX}^{\pi }\) to obtain the covariance embedding \(\mathcal {C}_{YX}^{\pi }\) (c.f. Eq. 12). The subspace kernel sum rule follows from applying the transpose of the condition operator a second time from the right-hand side.

3.4 The subspace kernel chain rule

The subspace kernel chain rule is a straight forward modification of the kernel chain rule by Song et al. (2013). We simply apply the subspace conditional operator \(\mathcal {C}_{Y|X}^{S}\) from the left side to a covariance operator \(\mathcal {C}_{XX}^{\pi } = \varvec{\Upsilon }_{x}^{\phantom {\intercal }}{{\,\mathrm{\mathrm{diag}}\,}}(\varvec{\alpha }_\pi ) \varvec{\Upsilon }_{x}^{\intercal }\) approximated from the weights \(\varvec{\alpha }_\pi \) of the prior mean map \(\mu _{X}^{\pi }\)

$$\begin{aligned} \hat{\mathcal {C}}_{YX}^{S,\pi }&= \hat{\mathcal {C}}_{Y|X}^{S} \hat{\mathcal {C}}_{XX}^{\pi } = \varvec{\Phi }^{\phantom {\intercal }}\varvec{K}_{x\bar{x}}\varvec{L}{{\,\mathrm{\mathrm{diag}}\,}}(\varvec{\alpha }_\pi ) \varvec{\Upsilon }_{x}^{\intercal }. \end{aligned}$$
(33)

With the subspace kernel sum rule and the subspace kernel chain rule we can now construct the subspace kernel Bayes’ rule.

3.5 The subspace kernel Bayes’ rule

The Bayes’ rule computes a posterior distribution P(X|Y) from a prior distribution \(\pi (X)\) and a likelihood function P(Y|X). Fukumizu et al. (2013) derive a conditional operator \(\mathcal {C}_{X|Y}^{\pi }\) from the prior modified covariance operators \(\mathcal {C}_{XY}^{\pi }\) and \(\mathcal {C}_{YY}^{\pi }\). We follow this approach and construct the subspace kernel Bayes’ rule (subKBR) from the prior modified covariance operators \(\mathcal {C}_{XY}^{S,\pi }\) and \(\mathcal {C}_{YY}^{S,\pi }\) which we obtain from the subspace kernel chain rule and the subspace kernel sum rule for tensor product features, respectively. When applied to the embedding of a variate \(\varvec{y}_*\), the subspace kernel Bayes’ rule returns the mean embedding of the conditional distribution \(P(X|\varvec{y}_*)\) as

$$\begin{aligned} {\hat{\mu }}_{X|y}^{\pi }&= \hat{\mathcal {C}}_{X|Y}^{S,\pi } \phi (\varvec{y}_*) \end{aligned}$$
(34)
$$\begin{aligned} {\hat{\mu }}_{X|y}^{\pi }&= \hat{\mathcal {C}}_{XY}^{S,\pi } \left( \hat{\mathcal {C}}_{YY}^{S,\pi } \right) ^{-1} \phi (\varvec{y}_*). \end{aligned}$$
(35)

From the subspace kernel chain rule, we obtain

$$\begin{aligned} \hat{\mathcal {C}}_{XY}^{S,\pi }&= \left( \hat{\mathcal {C}}_{Y|X}^{S} \hat{\mathcal {C}}_{XX}^{\pi }\right) ^{\intercal } \end{aligned}$$
(36)
$$\begin{aligned}&= \varvec{\Upsilon }_{x}^{\phantom {\intercal }}{{\,\mathrm{\mathrm{diag}}\,}}(\varvec{\alpha }_\pi ) \varvec{L}^{\intercal } \varvec{K}_{x\bar{x}}^{\intercal } \varvec{\Phi }^{\intercal }\end{aligned}$$
(37)

and from the subspace kernel sum rule

$$\begin{aligned} \mathcal {C}_{YY}^{S,\pi }&= \mathcal {C}_{Y|X}^{S} \mathcal {C}_{XX}^{\pi } \left( \mathcal {C}_{Y|X}^{S}\right) ^{\intercal } \end{aligned}$$
(38)
$$\begin{aligned}&= \varvec{\Phi }^{\phantom {\intercal }}\varvec{K}_{x\bar{x}}\varvec{L}{{\,\mathrm{\mathrm{diag}}\,}}(\varvec{\alpha }_\pi ) \varvec{L}^{\intercal } \varvec{K}_{x\bar{x}}^{\intercal } \varvec{\Phi }^{\intercal }. \end{aligned}$$
(39)

To keep the notation of the subspace kernel Kalman rule uncluttered, we define the following matrices

$$\begin{aligned} \bar{\varvec{\Lambda }}&:= {{\,\mathrm{\mathrm{diag}}\,}}(\varvec{\alpha }_\pi ) \varvec{L}^{\intercal } \end{aligned}$$
(40)
$$\begin{aligned} \bar{\varvec{D}}&:= \varvec{L}{{\,\mathrm{\mathrm{diag}}\,}}(\varvec{\alpha }_\pi ) \varvec{L}^{\intercal } \in {\mathbb {R}}^{m \times m}, \end{aligned}$$
(41)
$$\begin{aligned} \varvec{E}&:= \varvec{K}_{x\bar{x}}^{\intercal } \varvec{G}_{yy}\varvec{K}_{x\bar{x}}\in {\mathbb {R}}^{m \times m}, \end{aligned}$$
(42)

where \(\varvec{G}_{yy}= \varvec{\Phi }^{\intercal }\varvec{\Phi }^{\phantom {\intercal }}\) is the kernel matrix of the samples \(\varvec{y}_{i}\). Using the same form of the Tikhonov regularization as the kernel Bayes’ rule in Fukumizu et al. (2013) and substituting the prior modified subspace covariance operators from Eqs. 37 and 39 results in

$$\begin{aligned} {\hat{\mu }}_{X|y}^{\pi }&= \hat{\mathcal {C}}_{XY}^{S,\pi } \left[ \left( \hat{\mathcal {C}}_{YY}^{S,\pi }\right) ^{2} + \gamma \varvec{I}_m \right] ^{-1} \hat{\mathcal {C}}_{YY}^{S,\pi } \phi (\varvec{y}_*) \end{aligned}$$
(43)
$$\begin{aligned}&= \varvec{\Upsilon }_{x}^{\phantom {\intercal }}\bar{\varvec{\Lambda }} \varvec{K}_{x\bar{x}}^{\intercal } \varvec{\Phi }^{\intercal }\left[ \left( \varvec{\Phi }^{\phantom {\intercal }}\varvec{K}_{x\bar{x}}\bar{\varvec{D}} \varvec{K}_{x\bar{x}}^{\intercal } \varvec{\Phi }^{\intercal }\right) \varvec{\Phi }^{\phantom {\intercal }}\varvec{K}_{x\bar{x}}\bar{\varvec{D}} \varvec{K}_{x\bar{x}}^{\intercal } \varvec{\Phi }^{\intercal }+ \gamma \varvec{I}_m \right] \varvec{\Phi }^{\phantom {\intercal }}\varvec{K}_{x\bar{x}}\bar{\varvec{D}} \varvec{K}_{x\bar{x}}^{\intercal } \varvec{\Phi }^{\intercal }\phi (\varvec{y}_*) \end{aligned}$$
(44)
$$\begin{aligned}&= \varvec{\Upsilon }_{x}^{\phantom {\intercal }}\bar{\varvec{\Lambda }} \varvec{E} \left[ (\bar{\varvec{D}} \varvec{E})^{2} + \gamma \varvec{I}_m\right] ^{-1} \bar{\varvec{D}} \varvec{K}_{x\bar{x}}^{\intercal } \varvec{g}_{\varvec{y}_*}, \end{aligned}$$
(45)

with kernel vector \(\varvec{g}_{\varvec{y}_*} = [g(\varvec{y}_1, \varvec{y}_*), \ldots , g(\varvec{y}_n, \varvec{y}_*)]^{\intercal }\), and where we apply the matrix identity \(\varvec{A} \left( \varvec{BA} + \lambda \varvec{I}\right) ^{-1} = \left( \varvec{AB} + \lambda \varvec{I}\right) ^{-1} \varvec{A}\) with \(\varvec{A} = \varvec{\Phi }^{\phantom {\intercal }}\varvec{K}_{x\bar{x}}\) to obtain a finite matrix in the inverse. Since \(\varvec{E}\) and \(\bar{\varvec{D}}\) are both in \({\mathbb {R}}^{m \times m}\), the matrix inversion has complexity \(O(m^{3})\) instead of \(O(n^{3})\). The entire subspace kernel Bayes’ rule has complexity \(O(nm^{2})\) and, thus, scales linearly with the number of sample points (given a fixed reference set) instead of cubically as for the original kernel Bayes’ rule.

3.6 Experimental evaluation

We compare the performance, learning time and run time of the subspace kernel Bayes’ filter in comparison to the standard kernel Bayes’ filter on a simple toy task. We simulate a pendulum which we randomly initialize in the ranges \([0.1\pi , 0.4\pi ]\) for the angle \(\theta \) and \([-\,0.5\pi , 0.5\pi ]\) for the angular velocity \({\dot{\theta }}\). The pendulum has a mass of 5kg and a friction coefficient of 1. We apply Gaussian white noise to the system with a variance of 1, and to the observations with a variance of 0.1. Additionally, the observed angles are randomly perturbed by an offset of \(\pi /4\). These random perturbations occur with a probability of 0.1 in every time step. Each episode consists of 30 time steps with \(\Delta t = 0.1\).

Figure 1 shows that the subspace KBF has a slightly better performance when the training set equals the subspace set and maintains the performance of the standard KBF with an increasing number of training samples while the subspace set is fixed to 100 samples. However, at the same time the learning time of the subspace KBF increases at a much lower magnitude and the run time is nearly constant while the learning and run time of the standard KBR grow cubically. The samples for the subspace kernel Bayes rule are drawn uniformly without replacement from the full sample set.

Fig. 1
figure 1

The subspace variant of the kernel Bayes’ filter outperforms the standard kernel Bayes’ filter in both, the tarining time depicted in the left plot and the run time depicted in the middle plot, while maintaining a similar performance to the standard kernel Bayes’ rule as depicted in the right plot. The size of the subspace is fixed to 100 samples. The plots show median and the [0.25, 0.75] quantiles over 20 evaluations

4 The kernel Kalman rule

All three versions of the kernel Bayes’ rule discussed in Sect. 2.1.5 have drawbacks. First, due to the approximation of the prior modified covariance operators, these operators are not guaranteed to be positive definite and, thus, their inversion requires either a harsh form of the Tikhonov regularization or a strong regularization factor and are often still numerically instable. Furthermore, the inverse is dependent on the embedding of the prior distribution and, hence, needs to be recomputed for every Bayesian update of the mean map. This recomputation significantly increases the computational costs, for example, if we want to optimize the hyper-parameter.

In this section we will present the kernel Kalman rule (KKR) as an approximate alternative to the kernel Bayes’ rule (Fukumizu et al. 2013). We assume a prior belief state over a variable X embedded in a Hilbert space \(\mathcal {H}_{k}\) as

$$\begin{aligned} {\mu }^{-}_{X,t} = {\mathbb {E}}_{X_{t}|\varvec{y}_{1:t-1}} [\varphi (X)] \end{aligned}$$

and new measurement \(\varvec{y}_{t}\) embedded in a Hilbert space \(\mathcal {H}_{g}\) as \(\phi (\varvec{y}_{t})\). With the kernel Kalman rule we want to infer the embedding of the posterior belief state

$$\begin{aligned} {\mu }^{+}_{X,t} = {\mathbb {E}}_{X_{t}|\varvec{y}_{1:t}} [\varphi (X)] \in \mathcal {H}_{k} \end{aligned}$$

from the prior belief and the new measurement. The derivations for the KKR are inspired by the ansatz from recursive least squares (Gauss 1823; Sorenson 1970; Simon 2006), and thus the resulting update equations follow from a clear optimality criterion.

4.1 Estimating the posterior mean embedding from a least squares objective

Let \(\mathcal {C}_{Y|X}^{\phantom {\intercal }}\) be a conditional embedding operator of the observation model P(Y|X) that yields for a given belief state embedded in the Hilbert space \(\mathcal {H}_{k}\) the distribution over possible observations embedded into the Hilbert space \(\mathcal {H}_{g}\). We call this conditional embedding operator also observation operator. For a single sample \((\varvec{x}_{t}, \varvec{y}_{t})\), the observation operator yields the relation

$$\begin{aligned} \phi (\varvec{y}_t) = \mathcal {C}_{Y|X}^{\phantom {\intercal }}\varphi (\varvec{x}_t) + \zeta _t, \end{aligned}$$
(46)

where \(\zeta _t\) is zero mean noise with covariance \(\mathcal {R}\). Let us assume that the distribution \(p(\varvec{x})\) is unknown and we can only observe the samples \(\varvec{y}_t\). The objective of the KKR is then to find the mean embedding \(\mu _X\) that minimizes the squared error

$$\begin{aligned} L = {\mathbb {E}}_{XY} \left[ \left( \phi (\varvec{y}_t) - \mathcal {C}_{Y|X}^{\phantom {\intercal }}\mu _X\right) ^{\intercal } \mathcal {R}^{-1} \left( \phi (\varvec{y}_t) - \mathcal {C}_{Y|X}^{\phantom {\intercal }}\mu _X \right) \right] . \end{aligned}$$
(47)

Note that the use of \(\mathcal {R}^{-1}\) as metric for the least squares is somewhat arbitrary (it works with any invertible matrix). This definition becomes more important once we regularize the estimate of \(\mu _X\) (see below). We assume that \(\mathcal {R}\) is constant for all samples \(\varvec{x}_t\).Footnote 1 We do not assume that the noise is constant if we use a mean embedding on the operator \(\mathcal {C}_{Y|X}^{\phantom {\intercal }}\).

To show that \(\mu _X = {\mathbb {E}}_{X}[\varphi (\varvec{x})]\), i.e., that \(\mu _{X}\) is indeed the mean embedding of the distribution \(p(\varvec{x})\), we can solve for \(\mu _{X}\) by setting the derivative of L to zero, i.e.,

$$\begin{aligned} \frac{\text {d} L}{\text {d} \mu _{X}}&= {\mathbb {E}}_{XY} \left[ (\phi (\varvec{y}_t) - \mathcal {C}_{Y|X}^{\phantom {\intercal }}\mu _{X})^{\intercal } \mathcal {R}^{-1} \mathcal {C}_{Y|X}^{\phantom {\intercal }}\right] \end{aligned}$$
(48)
$$\begin{aligned}&= {\mathbb {E}}_{Y} \left[ \phi (\varvec{y}_t)^{\intercal }\right] \mathcal {R}^{-1} \mathcal {C}_{Y|X}^{\phantom {\intercal }}- \mu _{X}^\intercal \mathcal {C}_{Y|X}^{\intercal } \mathcal {R}^{-1} \mathcal {C}_{Y|X}^{\phantom {\intercal }}\end{aligned}$$
(49)
$$\begin{aligned}&= 0. \end{aligned}$$
(50)

Assuming that \(\mathcal {C}_{Y|X}^{\intercal } \mathcal {R}^{-1} \mathcal {C}_{Y|X}^{\phantom {\intercal }}\) is invertible, this yields

$$\begin{aligned} \mu _{X}&= (\mathcal {C}_{Y|X}^{\intercal } \mathcal {R}^{-1} \mathcal {C}_{Y|X}^{\phantom {\intercal }})^{-1} \mathcal {C}_{Y|X}^{\intercal } \mathcal {R}^{-1} {\mathbb {E}}_{Y} \left[ \phi (\varvec{y}_t)\right] \end{aligned}$$
(51)
$$\begin{aligned}&= (\mathcal {C}_{Y|X}^{\intercal } \mathcal {R}^{-1} \mathcal {C}_{Y|X}^{\phantom {\intercal }})^{-1} \mathcal {C}_{Y|X}^{\intercal } \mathcal {R}^{-1} \mathcal {C}_{Y|X}^{\phantom {\intercal }}{\mathbb {E}}_{X}\left[ \varphi (\varvec{x}_t)\right] = {\mathbb {E}}_{X}\left[ \varphi (\varvec{x}_t)\right] , \end{aligned}$$
(52)

where we have used the kernel sum rule \({\mathbb {E}}_{Y} [\phi (\varvec{y}_t)] = \mathcal {C}_{Y|X}^{\phantom {\intercal }}{\mathbb {E}}_{X}[\varphi (\varvec{x}_t)]\). Hence, under the assumption that \(\mathcal {C}_{Y|X}^{\intercal } \mathcal {R}^{-1} \mathcal {C}_{Y|X}^{\phantom {\intercal }}\) is invertible, \(\mu _{X}\) indeed estimates the mean embedding of the unobserved distribution \(p(\varvec{x})\). Note that the derivations hold for a constant \(\varvec{x}\), i.e, \(\varvec{x}_t = \varvec{x}\) as well as for samples \(\varvec{x}_t\) drawn from the distribution \(p(\varvec{x})\).

In practice, inverting \(\mathcal {C}_{Y|X}^{\intercal } \mathcal {R}^{-1} \mathcal {C}_{Y|X}^{\phantom {\intercal }}\) is, however, not always feasible. Hence, we can introduce an additional regularization objective, i.e.,

$$\begin{aligned} L_{\text {reg}} = L + (\mu _{X} - {\mu }^{-}_X)^\intercal ({\mathcal {C}}^{-}_{XX})^{-1}(\mu _{X} - {\mu }^{-}_X), \end{aligned}$$
(53)

where \({\mu }^{-}_X\) and \({\mathcal {C}}^{-}_{XX}\) denote a prior belief embedded as mean embedding and covariance operator, respectively. The solution of this optimization problem is given by

$$\begin{aligned} \mu _{X} = \left( \mathcal {C}_{Y|X}^{\intercal } \mathcal {R}^{-1} \mathcal {C}_{Y|X}^{\phantom {\intercal }}+ ({\mathcal {C}}^{-}_{XX})^{-1}\right) ^{-1} \left( \mathcal {C}_{Y|X}^{\intercal } \mathcal {R}^{-1} {\mathbb {E}}_{Y} \left[ \phi (\varvec{y}_t)\right] + ({\mathcal {C}}^{-}_{XX})^{-1} {\mu }^{-}_X\right) . \end{aligned}$$
(54)

Note that this regularization is the only approximation we make in the derivation of the kernel Kalman rule.

4.2 Using recursive least squares to estimate the posterior embedding

Since we want to update our estimate \(\mu _{X}\) iteratively with each new observation \(\varvec{y}_{t}\), we assume a prior mean map \({\mu }^{-}_{X,t}\). In each iteration, we update the prior mean map with the measurement \(\varvec{y}_t\) to obtain the posterior mean map \({\mu }^{+}_{X,t}\). From the recursive least squares solution, we know that the update rule for obtaining the posterior mean map \({\mu }^{+}_{X,t}\) is

$$\begin{aligned} {\mu }^{+}_{X,t} = {\mu }^{-}_{X,t} + \mathcal {Q}_{t}\left( \phi (\varvec{y}_{t}) - \mathcal {C}_{Y|X}^{\phantom {\intercal }}{\mu }^{-}_{X,t}\right) , \end{aligned}$$
(55)

where \(\mathcal {Q}_{t}\) is the Hilbert space Kalman gain operator that is applied to the correction term \(\delta _{t} = \phi (\varvec{y}_{t}) - \mathcal {C}_{Y|X}^{\phantom {\intercal }}{\mu }^{-}_{X,t}\). We call this rule the kernel Kalman rule (KKR). It remains to find an optimal value for the Kalman gain operator. In the next section, we will show that this rule is an unbiased estimator of the posterior mean map. Thus, we cannot obtain the optimal \({\mathcal {Q}}_{t}\) by minimizing directly the error. Instead, we will show in Sect. 4.2.2, how we can find an optimal \({\mathcal {Q}}_{t}\) by minimizing the covariance of the error instead.

4.2.1 The kernel Kalman update is an unbiased estimator of the posterior mean map

The embedding of the observation \(\phi (\varvec{y}_{t})\) in the correction term \(\delta _{t}\) is a single-sample estimator of the embedding of the true distribution over observations \({\mu }^{-}_{Y|x,t} = \mathcal {C}_{Y|X}^{\phantom {\intercal }}\varphi (\varvec{x}_t)\). Let us assume for now that we have access to the true embedding \(\varphi (\varvec{x}_t)\). Thus, we have for the embedding of the observation

$$\begin{aligned} \phi (\varvec{y}_{t}) = {\mu }^{-}_{Y|x,t} + \zeta _{t} = \mathcal {C}_{Y|X}^{\phantom {\intercal }}\varphi (\varvec{x}_t) + \zeta _{t}, \end{aligned}$$
(56)

where \(\zeta _{t}\) denotes the error of the single sample estimator \(\phi (\varvec{y}_{t})\) to the embedding of the true distribution \({\mu }^{-}_{Y|x,t}\). By taking the expectation it is easy to show that the error of the single-sample estimator is zero-mean and thus \(\phi (\varvec{y}_{t})\) is an unbiased estimator for \({\mu }^{-}_{Y|X,t}\). We refer to “Appendix B.1” for a more detailed derivation. We further assume that \(\zeta _{t}\) is independent from the state \(x_{t}\) and has constant covariance \({\mathcal {R}}\). Following from the delta method (Agresti 2002), this assumption is a reasonable choice as we assume i.i.d, zero mean observation noise. Similarly, the error of the a-posteriori mean embedding to the embedding of the true state is given as

$$\begin{aligned} {\varepsilon }^{+}_{t}&= \varphi (\varvec{x}_{t}) - {\mu }^{+}_{X,t} \end{aligned}$$
(57)
$$\begin{aligned}&= \varphi (\varvec{x}_{t}) - {\mu }^{-}_{X,t} - \mathcal {Q}_{t}(\phi (\varvec{y}_{t}) - \mathcal {C}_{Y|X}^{\phantom {\intercal }}{\mu }^{-}_{X,t}), \end{aligned}$$
(58)

where we use Eq. 55 to substitute the embedding of the posterior belief. By substituting \(\phi (\varvec{y}_{t})\) with Eq. 56 and defining the error of the a-priori mean embedding analogously as \({\varepsilon }^{-}_{t} = \varphi (\varvec{x}_{t}) - {\mu }^{-}_{X,t}\), we arrive at

$$\begin{aligned} {\varepsilon }^{+}_{t}&= \varphi (\varvec{x}_{t}) - {\mu }^{-}_{X,t} - \mathcal {Q}_{t}(\mathcal {C}_{Y|X}^{\phantom {\intercal }}\varphi (\varvec{x}_{t}) + \zeta _{t} - \mathcal {C}_{Y|X}^{\phantom {\intercal }}{\mu }^{-}_{X,t}) \end{aligned}$$
(59)
$$\begin{aligned}&= \left( \mathcal {I} - \mathcal {Q}_{t}\mathcal {C}_{Y|X}^{\phantom {\intercal }}\right) (\varphi (\varvec{x}_{t}) - {\mu }^{-}_{X,t}) - \mathcal {Q}_{t} \zeta _{t} \end{aligned}$$
(60)
$$\begin{aligned}&= \left( \mathcal {I} - \mathcal {Q}_{t}\mathcal {C}_{Y|X}^{\phantom {\intercal }}\right) {\varepsilon }^{-}_{t} - \mathcal {Q}_{t} \zeta _{t} \end{aligned}$$
(61)

with identity operator \(\mathcal {I}\). We can now apply the expectation operator and exploit its linearity to obtain

$$\begin{aligned} {\mathbb {E}} \left[ {\varepsilon }^{+}_{t}\right]&= {\mathbb {E}} \left[ \left( \mathcal {I} - \mathcal {Q}_{t}\mathcal {C}_{Y|X}^{\phantom {\intercal }}\right) {\varepsilon }^{-}_{t} - \mathcal {Q}_{t} \zeta _{t}\right] \nonumber \\&= \left( \mathcal {I} - \mathcal {Q}_{t}\mathcal {C}_{Y|X}^{\phantom {\intercal }}\right) {\mathbb {E}} \left[ {\varepsilon }^{-}_{t}\right] - \mathcal {Q}_{t} {\mathbb {E}} \left[ \zeta _{t}\right] . \end{aligned}$$
(62)

Since the residual of the observation operator is zero mean (\({\mathbb {E}}[\zeta _{t}] = 0\)), we see that, given an unbiased a-priori mean embedding (\({\mathbb {E}}[{\varepsilon }^{-}_{t}] = 0\)), the a-posteriori mean embedding obtained from the kernel Kalman update is unbiased (\({\mathbb {E}}[{\varepsilon }^{+}_{t}] = 0\)) independent of the choice of \(\mathcal {Q}\). Thus, we cannot use the expected error as an optimality criterion for the Kalman gain operator.

4.2.2 Finding the optimal kernel Kalman gain operator

If the expected error—or the first moment of the error distribution—is already zero, taking the covariance of the error—or the second moment—is a consequent choice. Hence, we chose the kernel Kalman gain operator \(\mathcal {Q}_{t}\) which minimizes the expected squared loss \({\mathbb {E}} \left[ \left( {\varepsilon }^{+}_{t}\right) ^{\intercal }{\varepsilon }^{+}_{t}\right] \) or equivalently the variance of the estimator. The objective for minimizing the variance can also be reformulated as minimizing the trace of the a-posteriori covariance operator \({\mathcal {C}}^{+}_{XX,t}\) of the state \(\varvec{x}_{t}\) at time t, i.e., \(\min _{\mathcal {Q}_{t}} {\mathbb {E}} \left[ \left( {\varepsilon }^{+}_{t}\right) ^{\intercal }{\varepsilon }^{+}_{t}\right] = \min _{\mathcal {Q}_{t}} \mathrm {Tr} \; {\mathcal {C}}^{+}_{XX,t}\). Using the formulation of the posterior error from Eq. 61 and the independence assumption of \(\zeta _{t}\) and \(\varepsilon _{t}\) allows us to reformulate the a-posteriori covariance operator as

$$\begin{aligned} {\mathcal {C}}^{+}_{XX,t} =&\; \left( \mathcal {I} - \mathcal {Q}_{t}\mathcal {C}_{Y|X}^{\phantom {\intercal }}\right) {\mathcal {C}}^{-}_{XX,t} \left( \mathcal {I} - \mathcal {Q}_{t}\mathcal {C}_{Y|X}^{\phantom {\intercal }}\right) ^{\intercal } + \mathcal {Q}_{t} \mathcal {R} \mathcal {Q}_{t}^{\intercal }, \end{aligned}$$
(63)

where \(\mathcal {R} = {\mathbb {E}} [\zeta _{t}\zeta _{t}^{\intercal }]\) is the covariance of the residual of the observation operator. Taking the derivative of the trace of the covariance operator and setting it to zero leads to the solution for the kernel Kalman gain operator

$$\begin{aligned} \mathcal {Q}_{t}&= {\mathcal {C}}^{-}_{XX,t} \mathcal {C}_{Y|X}^{\intercal } \left( \mathcal {C}_{Y|X}^{\phantom {\intercal }}{\mathcal {C}}^{-}_{XX,t} \mathcal {C}_{Y|X}^{\intercal } + \mathcal {R}\right) ^{-1}. \end{aligned}$$
(64)

We provide a detailed derivation of the optimal Kalman gain operator in “Appendix B.2”. From Eqs. 63 and 64, we can also see that it is possible to recursively estimate the covariance embedding operator independently of the mean map and of the observations. This property will allow us later to precompute the covariance embedding operator as well as the kernel Kalman gain operator to further improve the computational complexity of our algorithm. Following Simon (2006), the update of the covariance operator can be further simplified to

$$\begin{aligned} {\mathcal {C}}^{+}_{XX,t} =&{\mathcal {C}}^{-}_{XX,t} - \mathcal {Q}_{t} \mathcal {C}_{Y|X}^{\phantom {\intercal }}{\mathcal {C}}^{-}_{XX,t}. \end{aligned}$$
(65)

The derivations of this simplification can be found in “Appendix B.3”. In the following section we will show how to obtain the empirical Kalman update rule from a finite data set.

4.3 Empirical kernel Kalman rule

The equations for the kernel Kalman rule that we derived in the previous section are based on embeddings in infinite dimensional Hilbert spaces and operators that map between these spaces. In practice, these embeddings and operators are estimated from a finite set of samples \({\mathcal {D}}_{XY} = \left\{ (\varvec{x}_1, \varvec{y}_1), \ldots , (\varvec{x}_n, \varvec{y}_n)\right\} \). In this section we will show how we can reformulate the kernel Kalman rule to manipulations of finite matrices by applying the kernel trick (i.e., matrix identities). Based on the data set \({\mathcal {D}}_{XY}\) and the corresponding feature matrices \(\varvec{\Upsilon }_{x}^{\phantom {\intercal }}\) and \(\varvec{\Phi }^{\phantom {\intercal }}\), the finite sample estimators of the prior mean embedding and the prior covariance operator are given as

$$\begin{aligned} {{\hat{\mu }}}^{-}_{X,t} = \varvec{\Upsilon }_{x}^{\phantom {\intercal }}{\varvec{m}}^{-}_{t} \qquad \text {and} \qquad {\hat{\mathcal {C}}}^{-}_{XX,t} = \varvec{\Upsilon }_{x}^{\phantom {\intercal }}{\varvec{S}}^{-}_{t} \varvec{\Upsilon }_{x}^{\intercal }, \end{aligned}$$
(66)

respectively, with weight vector \({\varvec{m}}^{-}_{t}\) and positive definite weight matrix \({\varvec{S}}^{-}_{t}\). Using this finite sample estimator of the covariance operator, the finite sample estimator of the conditional operator from Eq. 6, and by approximating the covariance of the residual of the observation operator with a diagonal \({\mathcal {R}} = \kappa {\mathcal {I}}\), we can rewrite the kernel Kalman gain operator as

$$\begin{aligned} \hat{\mathcal {Q}}_t = \varvec{\varvec{\Upsilon }_{x}^{\phantom {\intercal }}}{\varvec{S}}^{-}_{t} \varvec{O}^{\intercal } \varvec{\varvec{\Phi }^{\intercal }}\left( \varvec{\varvec{\Phi }^{\phantom {\intercal }}}\varvec{O} {\varvec{S}}^{-}_t \varvec{O}^{\intercal } \varvec{\varvec{\Phi }^{\intercal }}+ \kappa \mathcal {I}\right) ^{-1}, \end{aligned}$$
(67)

with the observation matrix \(\varvec{O} = (\varvec{K}_{xx}+ \lambda \varvec{I})^{-1} \varvec{K}_{xx}\). Here, the approximation of \({\mathcal {R}} = \kappa \mathcal {I}\) also acts as a small regularization in the inverse to ensure its positive definiteness and to improve the numerical stability of the kernel Kalman rule. However, \(\hat{\mathcal {Q}}_t\) still requires the inversion of an infinite dimensional matrix. Using matrix identities, we can solve this problem and arrive at

(68)

where we defined \(\varvec{Q}_t = {\varvec{S}}^{-}_{t} \varvec{O}^{\intercal } (\varvec{G}_{yy}\varvec{O} {\varvec{S}}^{-}_t \varvec{O}^{\intercal } + \kappa \varvec{I})^{-1} \in {\mathbb {R}}^{n \times n}\) where \(\varvec{G}_{yy}= \varvec{\Phi }^{\intercal }\varvec{\Phi }^{\phantom {\intercal }}\) is the Gram matrix of the observations. Based on this reformulation of the kernel Kalman gain operator, we can obtain finite vector/matrix representations of the update equations for the estimator of the mean embedding (Eq. 55) and the estimator of the covariance operator (Eq. 65). For the weight vector \(\varvec{m}_{t}\), we arrive at

$$\begin{aligned} {\varvec{m}}^{+}_{t}&= {\varvec{m}}^{-}_{t} + \varvec{Q}_{t} \left( \varvec{g}_{\varvec{y}_t} - \varvec{G}_{yy}\varvec{O} {\varvec{m}}^{-}_{t}\right) , \end{aligned}$$
(69)

where \(\varvec{g}_{\varvec{y}_t} = [g(\varvec{y}_1, \varvec{y}_t), \ldots , g(\varvec{y}_n, \varvec{y}_t)]^{\intercal }\) is the kernel vector of the measurement at time t. Similarly, we can also obtain the update equation for the weight matrix \(\varvec{S}_{t}\) as

$$\begin{aligned} \varvec{S}_{t}&= {\varvec{S}}^{-}_{t} - \varvec{Q}_{t} \varvec{G}_{yy}\varvec{O} {\varvec{S}}^{-}_{t}. \end{aligned}$$
(70)

The algorithm requires the inversion of a \(m \times m\) matrix in every iteration for computing the kernel Kalman gain matrix \(\varvec{Q}_{t}\). Hence, similar to the kernel Bayes’ rule, the computational complexity of a straightforward implementation would scale cubically with the number of data points m. However, in contrast to the KBR, the inverse in \(\varvec{Q}_{t}\) is only dependent on time and not on the estimate of the mean map. Thus, the kernel Kalman gain matrix can be precomputed since it is identical for multiple parallel runs of the algorithm. Furthermore, if the stream of incoming measurements is reliable (no time steps without incoming measurement), \(\varvec{S}_{t}\) will converge to a stationary matrix and by that \(\varvec{Q}_{t}\) will become stationary as well. While many applications do not require to perform state estimations in parallel, it is a huge advantage for hyper-parameter optimization as we can evaluate multiple trajectories from a validation set simultaneously. As for most kernel-based methods, hyper-parameter optimization is crucial for scaling the approach to complex tasks. So far, the hyper-parameters of the kernels for the KBF have typically been set by heuristics as optimization would be too expensive.

Besides the hyper-parameters in the kernel functions (e.g. bandwidths) and the regularization constants of the conditional operators, we also treat the approximation of the covariance operator \(\mathcal {R} \approx \kappa \mathcal {I}\) as a hyper-parameter which we optimize. Note that the selection of the minimization objective, e.g., mean squared error (MSE) or negative log-likelihood (NLL), has a substantial effect on the selection of this parameter. For the MSE objective, the parameters are chosen to only optimize the expectation of the filter output. Consequently the parameter \(\kappa \) acts more as a regularizer and is chosen as small as possible. In contrast, using the NLL objective also respects the variance of the filter output and thus the role as approximation to the variance \(\mathcal {R}\) is more important for choosing the parameter value.

4.4 The subspace kernel Kalman rule

In Sect. 3 we have already shown how we can apply the subspace conditional embedding operator to leverage from large data sets but at the same time maintain the computational tractability of the learned models. In this section, we will now show how we can apply this technique to the kernel Kalman rule to obtain the subspace kernel Kalman rule (subKKR).

A core difference between the KKR and the subKKR is the representation of the embedded distributions. While we represent the embeddings for the kernel Kalman rule as weight vector \({\varvec{m}}^{-}_{t}\) and weight matrix \({\varvec{S}}^{-}_{t}\), we use the projections into a subspace

$$\begin{aligned} \varvec{n}_{t}&= \varvec{\Gamma }^{\intercal }\mu _{t} = \varvec{\Gamma }^{\intercal }\varvec{\Upsilon }_{x}^{\phantom {\intercal }}\varvec{m}_{t} = \varvec{K}_{x\bar{x}}^{\intercal } \varvec{m}_{t}, \end{aligned}$$
(71)
$$\begin{aligned} \varvec{P}_{t}&= \varvec{\Gamma }^{\intercal }\mathcal {C}_{XX,t} \varvec{\Gamma }^{\phantom {\intercal }}= \varvec{\Gamma }^{\intercal }\varvec{\Upsilon }_{x}^{\phantom {\intercal }}\varvec{S}_{t} \varvec{\Upsilon }_{x}^{\intercal }\varvec{\Gamma }^{\phantom {\intercal }}= \varvec{K}_{x\bar{x}}^{\intercal } \varvec{S}_{t} \varvec{K}_{x\bar{x}}. \end{aligned}$$
(72)

to represent the distribution for the subspace kernel Kalman rule. These projections will later allow us to express all operations in the lower dimensional subspace instead of the space spanned by the full data set. Matrix manipulations with the full data set are then only necessary during the learning phase of the KKR not while performing inference.

We use a slightly modified version of the kernel Kalman gain from Eq. 64 where we approximate the covariance operator \(\mathcal {R}\) with a diagonal operator \(\kappa \mathcal {I}\). With the subspace conditional embedding operator \(\hat{\mathcal {C}}_{Y|X}^{S}\) of the distribution P(Y|X), as derived in Sect. 2.1.2 we obtain the subspace kernel Kalman gain operator as

$$\begin{aligned} \hat{\mathcal {Q}}_{t}^{S}&= {\hat{\mathcal {C}}}^{-}_{XX,t} \left( \hat{\mathcal {C}}_{Y|X}^{S}\right) ^{\intercal } \left( \hat{\mathcal {C}}_{Y|X}^{S} {\hat{\mathcal {C}}}^{-}_{XX,t} \left( \hat{\mathcal {C}}_{Y|X}^{S}\right) ^{\intercal } + \kappa \mathcal {I}\right) ^{-1} \end{aligned}$$
(73)

We can further derive a finite matrix representation of the operator using matrix identities and the projection into the subspace spanned by the features \(\varvec{\Gamma }^{\intercal }\) as

(74)

where we define the subspace kernel Kalman gain matrix \(\varvec{Q}_{t}^{S}\) using the short hand \(\varvec{O}^{S} := (\varvec{K}_{x\bar{x}}^{\intercal } \varvec{K}_{x\bar{x}}+ \lambda \varvec{I}_m)^{-1}\). A detailed derivation can be found in “Appendix B.5”. Note that \(\varvec{Q}_{t}^{S} \in {\mathbb {R}}^{m \times n}\) and not \({\mathbb {R}}^{m \times m}\), however when applying the subspace KKR in an inference algorithm, we can use the matrix \(\varvec{K}_{x\bar{x}}^{\intercal }\) on the right side as a projection for the high-dimensional embedding of the distribution over the variable Y to which the gain is applied (see Algorithm 2 for an example). From here, the update equation for the projection of the mean map becomes

$$\begin{aligned} {\varvec{n}}^{+}_{X,t}&= {\varvec{n}}^{-}_{X,t} + \varvec{Q}_{t}^{S} \left( \varvec{g}_{\varvec{y}_{t}} - \varvec{G}_{yy}\varvec{K}_{x\bar{x}}\varvec{O}^{S} {\varvec{n}}^{-}_{t}\right) , \end{aligned}$$
(75)

And similarly, we can derive the update equation for the covariance embedding as

$$\begin{aligned} {\varvec{P}}^{+}_{t}&= {\varvec{P}}^{-}_{t} - \varvec{Q}_{t}^{S} \varvec{G}_{yy}\varvec{K}_{x\bar{x}}\varvec{O}^{S} {\varvec{P}}^{-}_{t} \end{aligned}$$
(76)

In contrast to the kernel Kalman gain presented in the previous section, but also in contrast to the variants of the kernel Bayes rule discussed in Sect. 2.1.5, the subspace kernel Kalman gain requires only the inversion of an \(m \times m\) matrix instead of an \(n \times n\) matrix, where \(m \ll n\). Still, the full data set of n samples can be used to learn the Kalman gain operator.

4.5 Experimental comparison of (sub)KKR and (sub)KBR

We compare the performance of the (subspace) kernel Kalman rule to the performance of the (subspace) kernel Bayes rule on a simple stationary filtering task for estimating the expectation of a Gaussian distribution. The graphical model that we assume for this task is depicted in Fig. 2. We sample \(N = 500\) latent context variables \(c_{i}\) uniformly from the interval \([-5, 5]\) as the mean of the Gaussian distributions. Afterwards we draw one single (\(M=1\)) observed sample \(s_{i}\) for each context from \({\mathcal {N}}(c_{i}, \frac{1}{3})\) and learn the kernel Kalman rule and the different versions of the kernel Bayes rule with the context variables as states and the samples as observations. For the performance comparison (Fig. 3), the KKR and the KBR are learned with a kernel size of 200 samples, subKKR and subKBR are both learned with 200 samples to span the subspace and the full set of 500 samples to learn the operators. The comparison of time efficiency is summarized in Table 1 where the respective kernel size and subspace size is denoted as column header. The subKKR and subKBR have always been learned with the full data set of 500 samples. The data points for the subspace have been drawn uniformly without replacement from the full data set.

Fig. 2
figure 2

Graphical model for comparing KKR to KBR

For the optimization of the hyper-parameters and for the evaluation, we have respectively generated a data set with \(N = 10\) latent context variables from the same uniform distribution. These context variables are not be observed by the the filter methods. Next, we draw \(M=10\) samples from the Gaussian distribution around each context and update each method iteratively with these ten samples. For each update we compute the squared error to the true context and take the mean over all ten context variables. We use squared exponentials as kernel functions and optimize their bandwidths as well as the regularization parameters using CMA-ES (Hansen 2006). Figure 3 shows the median and the (0.15, 0.85)-quantiles of the MSE to the true context over the number of seen samples. As a baseline we depict the maximum-likelihood (ML) estimate of the expectation. We see that while in the beginning all methods perform similar to the ML estimate, with more seen samples KKR and subKKR outperform all variants of the KBR. Moreover the choice to depict the median and (0.15, 0.85)-quantiles over mean and standard deviation is due to the instable optimization behavior of the KBR which produced a lot of outliers. In Table 1 we state the time consumed to perform ten KKR/KBR updates on 10 estimation tasks for different kernel sizes. Here, the KKR/subKKR methods benefit from their ability to process the updates for all 10 estimation tasks in parallel. Yet, the ability to precompute \(\varvec{Q}_{t}\)/\(\varvec{Q}^{S}_{t}\) and \(\varvec{S}_{t}\)/\(\varvec{P}_{t}\) has not even been exploited.

Fig. 3
figure 3

Performance of the KKR updates versus the KBR updates for estimating the mean of a Gaussian distribution with 1–10 seen samples. The ML estimate serves as a baseline. Depicted are the median and the (0.15, 0.85)-quantiles of the MSE to the true mean over 20 runs

Table 1 Time consumptions of the KKR and KBR update methods for different kernel sizes

In a second experiment, we have investigated how sensible the KKR is to non-constant noise in comparison to the KBR. We have sampled data similar to the previous experiment with a context variable \(c_{i}\) in the range \([-5, 5]\) and observations \(s_{i,j}\) from the distribution \({\mathcal {N}}(c_i, \sigma (c_i))\). The variance of the Gaussian distribution is dependent on the context variable by \(\sigma (c_i) = \exp (c_i)\). Again, we sample \(N = 500\) context and one observation (\(M = 1\)) for each context for learning the models. For the optimization of the hyper-parameters, we have sampled a data set of \(N = 10\) context variables with \(M = 10\) observations for each context. And for the evaluation of the methods, we have chosen the context variables at the integers \([-5, -4,\ldots , 5]\) and have sampled again \(M = 10\) observations per context. For each sampled context, we perform updates with all ten observations. The plots in Fig. 4 show mean and min/max of the estimated mean relative to the true mean (context) from which the observations have been sampled for both cases, constant noise \(\sigma = \frac{1}{3}\) and variable noise \(\sigma = \exp (c_i)\). As expected from the previous experiment, it can be clearly seen that all models perform similarly well for the case of a constant noise variance. In the case of variable noise variance, all models perform worse for smaller variances where the impact is larger in the performances of the KKR and the subKKR. Note, however, that the KBR methods suffered from numerical instabilities for large noise variances. For instance, KBR(b) was the only KBR method that yielded results for the largest variance \(\sigma = \exp (5)\).

Fig. 4
figure 4

Comparison of the performances of the KKR, subKKR, KBR(b), and subKBR on the estimation of the mean \(c_{i}\) of a Gaussian random variable, left with constant variance and right with variable variance \(\exp (c_i)\). The y-axis depicts the mean absolute error relative to the context using a log-scale. Because of the exponential relation between the observation noise and the context variable, we get the linear slope in the distribution of the samples in the right plot (Color figure online)

5 Applications of the kernel Kalman rule

In Sect. 4, we have shown how we can derive the KKR as an operator for approximate Bayesian updates in the framework for nonparametric inference. In this section we will present two applications of the kernel Kalman rule. In Sect. 5.1 we will first present the kernel Kalman filter (KKF) and discuss details about the implementation. A subspace variate of the KKF is presented in Sect. 5.2 and experimental results of both are shown in Sect. 5.3. The kernel forward backward smoother (KFBS) is presented as another application of the KKR in Sect. 5.4 and a subspace variate is discussed in Sect. 5.5. We finally show experimental evaluations of the KFBS and the subKFBS in Sect. 5.6.

5.1 The kernel Kalman filter

Similar to the kernel Bayes’ filter (Fukumizu et al. 2013; Song et al. 2013), we can combine the kernel Kalman rule with the kernel sum rule to formulate the kernel Kalman filter (KKF). To learn the models of the KKF, we assume a data set \({\mathcal {D}}_{{\tilde{XXY}}} = \left\{ (\varvec{{\tilde{x}}}_{1},\varvec{x}_{1},\varvec{y}_{1}),\ldots ,(\varvec{{\tilde{x}}}_{n},\varvec{x}_{n},\varvec{y}_{n})\right\} \) consisting of triples with preceding state \({\tilde{x}}_{i}\), state \(x_{i}\), and measurement \(y_{i}\) as given. We further assume the states to be Markov, i.e., the state \(x_{i}\) is only dependent on its predecessor \({\tilde{x}}_{i}\). Based on this data set we define the feature matrices \(\varvec{\Upsilon }_{x}^{\phantom {\intercal }}:= [\varphi (\varvec{x}_{1}),\ldots ,\varphi (\varvec{x}_{n})]\), \(\varvec{\Upsilon }_{\tilde{x}}^{\phantom {\intercal }}:= [\varphi (\varvec{{\tilde{x}}}_{1}),\ldots ,\varphi (\varvec{{\tilde{x}}}_{n})]\), and \(\varvec{\Phi }^{\phantom {\intercal }}:= [\phi (\varvec{y}_{1}),\ldots ,\phi (\varvec{y}_{n})]\). In contrast to the KBF, we represent the belief state as mean map \({\hat{\mu }}_{X,t} = \varvec{\Upsilon }_{x}^{\phantom {\intercal }}\varvec{m}_{t}\) and as covariance operator \(\hat{\mathcal {C}}_{XX,t} = \varvec{\Upsilon }_{x}^{\phantom {\intercal }}\varvec{S}_{t} \varvec{\Upsilon }_{x}^{\intercal }\).

The forward model \(P(X|{\tilde{X}})\) that propagates the posterior belief state at time t to the prior belief state at time \(t+1\) can then be learned as conditional embedding operator

$$\begin{aligned} \hat{\mathcal {C}}_{X|\tilde{X}}^{\phantom {\intercal }}&= \varvec{\Upsilon }_{x}^{\phantom {\intercal }}\left( \varvec{K}_{\tilde{x}\tilde{x}}+ \lambda \varvec{I}_n\right) ^{-1} \varvec{\Upsilon }_{\tilde{x}}^{\intercal }, \end{aligned}$$
(77)

which we also call transition operator. Here, \(\varvec{K}_{\tilde{x}\tilde{x}}\) is the Gram matrix of the features of the preceding states \(\varvec{\Upsilon }_{\tilde{x}}^{\phantom {\intercal }}\). The posterior belief state at time t is then propagated to the prior belief state at time \(t+1\) time by applying the kernel sum rule. That is, we apply the transition operator to the posterior mean map and the posterior covariance embedding at time t and obtain prior mean map and prior covariance embedding at time \(t+1\), i.e.,

$$\begin{aligned} {{\hat{\mu }}}^{-}_{X,t+1}&= \hat{\mathcal {C}}_{X|\tilde{X}}^{\phantom {\intercal }}{{\hat{\mu }}}^{+}_{X,t} = \varvec{\Upsilon }_{x}^{\phantom {\intercal }}\varvec{T} {\varvec{m}}^{+}_t,&\Leftrightarrow \qquad {\varvec{m}}^{-}_{t+1}&= \varvec{T} {\varvec{m}}^{+}_t \end{aligned}$$
(78)
$$\begin{aligned} {\hat{\mathcal {C}}}^{-}_{XX,t+1}&= \hat{\mathcal {C}}_{X|\tilde{X}}^{\phantom {\intercal }}{\hat{\mathcal {C}}}^{+}_{XX,t} \hat{\mathcal {C}}_{X|\tilde{X}}^{\intercal } + \mathcal {V}&\Leftrightarrow \qquad \; {\varvec{S}}^{-}_{t+1}&= \varvec{T} {\varvec{S}}^{+}_t \varvec{T}^{\intercal } + \varvec{V}. \end{aligned}$$
(79)

Note that the propagation of the covariance embedding is slightly different to the kernel sum rule by Song et al. (2013), however this formulation follows directly from the kernel chain rule (c.f. Eqs. 32 and 11). Analog to the observation matrix \(\varvec{O}\) (c.f. Sect. 4.3), we denote the transition matrix \(\varvec{T} = (\varvec{K}_{\tilde{x}\tilde{x}}+ \lambda _{\varvec{T}} \varvec{I})^{-1} \varvec{K}_{\tilde{x} x}\), where \(\varvec{K}_{\tilde{x} x}= \varvec{\Upsilon }_{\tilde{x}}^{\intercal }\varvec{\Upsilon }_{x}^{\phantom {\intercal }}\) is the kernel matrix of the preceding states and the current states. The covariance of the transition residual \(\mathcal {V}\) and its finite matrix representation \(\varvec{V}\) can be obtained as

(80)
(81)
(82)

On the new prior belief state that we obtain from the transition update, we can afterwards apply the kernel Kalman rule as observation update. Before we give a condensed summary of the kernel Kalman filter in Algorithm 1, we will discuss how we obtain the embedding of the distribution over the initial states in the next section. To extract some meaningful information from the RKHS-embedded distributions, we furthermore need to find a mapping of the embedded distribution back into the state space. In Sect. 5.1.2, show how we approached the so-called preimage problem and shortly discuss other solutions.

5.1.1 Embedding the initial state distribution

Before running the filter on incoming measurements \(y_{t}\), we need to initialize the belief state with an initial mean map \(\mu _{X,0}\) and an initial covariance operator \(\mathcal {C}_{XX,0}\). We can obtain these initial embeddings from a data set \({\mathcal {D}}_{0} = \{\varvec{x}_{1}^{0},\ldots ,\varvec{x}_{N}^{0}\}\) which consists in general of samples from the initial distribution of the system. Practically, we can obtain this data set by taking the initial states from multiple training episodes or—if we assume a stationary distribution—we can also take all training samples for the initialization. We can obtain the initial mean map by first embedding a uniform distribution into the RKHS spanned by the features of the initial states \(\varvec{\Upsilon }_{x,0}^{\phantom {\intercal }}\). Afterwards, we apply a conditional operator to map this distribution into the Hilbert space spanned by the features \(\varvec{\Upsilon }_{x}^{\phantom {\intercal }}\) as

$$\begin{aligned} {\hat{\mu }}_{X,0} = \varvec{\Upsilon }_{x}^{\phantom {\intercal }}\varvec{m}_{0}&= \varvec{\Upsilon }_{x}^{\phantom {\intercal }}(\varvec{K}_{xx}+ \lambda \varvec{I}_n)^{-1} \varvec{\Upsilon }_{x}^{\intercal }\varvec{\Upsilon }_{x,0}^{\phantom {\intercal }} \varvec{1}_{N} \frac{1}{N} \end{aligned}$$
(83)
$$\begin{aligned} \Leftrightarrow \quad \varvec{m}_{0}&= (\varvec{K}_{xx}+ \lambda \varvec{I}_n)^{-1} \varvec{K}_{x0}\varvec{1}_{N} \frac{1}{N}. \end{aligned}$$
(84)

where \(\varvec{K}_{x0}= \varvec{\Upsilon }_{x}^{\intercal }\varvec{\Upsilon }_{x,0}\) is the kernel matrix of the training samples and the samples in \({\mathcal {D}}_{0}\), and \(\varvec{1}_{N}\) denotes the N-dimensional all-ones vector. Similarly, we can obtain the initial covariance embedding operator as

$$\begin{aligned} \hat{\mathcal {C}}_{XX,0} = \varvec{\Upsilon }_{x}^{\phantom {\intercal }}\varvec{S}_{0} \varvec{\Upsilon }_{x}^{\intercal }= \frac{1}{N} \varvec{\Upsilon }_{x}^{\phantom {\intercal }}(\varvec{K}_{xx}+ \lambda \varvec{I}_n)^{-1} \varvec{K}_{x0}^{\phantom {\intercal }} \varvec{K}_{x0}^{\intercal } (\varvec{K}_{xx}+ \lambda \varvec{I}_n)^{-1} \varvec{\Upsilon }_{x}^{\intercal }- \varvec{\Upsilon }_{x}^{\phantom {\intercal }}\varvec{m}_{0}^{\phantom {\intercal }} \varvec{m}_{0}^{\intercal } \varvec{\Upsilon }_{x}^{\intercal }. \end{aligned}$$
(85)

Hence, we can obtain the initial weight vector \(\varvec{m}_{0}\) and the initial weight matrix \(\varvec{S}_{0}\) by computing the mean and the covariance over the columns of the matrix \(\varvec{C}_{0} = (\varvec{K}_{xx}+ \lambda \varvec{I}_n)^{-1} \varvec{K}_{x0}\).

5.1.2 The pre-image problem/recovering the state-space distribution

Recovering a distribution in the state space that is a pre-image of a given mean map is still a topic of ongoing research. There are several approaches to this problem, such as fitting a Gaussian mixture model (Mccalman et al. 2013), or sampling from the embedded distribution by optimization (Chen et al. 2010). In the experiments conducted for this paper, we approach the pre-image problem by matching a Gaussian distribution, which is a reasonable choice if the recovered distribution is unimodal. Since we embed the belief state for the kernel Kalman rule as a mean map and as a covariance operator, we can obtain the mean and covariance of a Gaussian approximations by simple matrix manipulations. The space of the samples \({\mathbb {R}}^{d}\) together with the linear kernel \(k(\varvec{x}_{1}, \varvec{x}_{2}) = \langle \varvec{x}_{1}, \varvec{x}_{2}\rangle = \varvec{x}_{1}^{\intercal }\varvec{x}_{2}\) forms an RKHS as well. Therefore, we can simply define a conditional embedding operator that maps from the Hilbert space of the feature vectors to the Hilbert space of the samples as

$$\begin{aligned} \hat{\mathcal {C}}_{\text {pre}} = \varvec{X} (\varvec{K}_{xx}+ \lambda \varvec{I}_n)^{-1} \varvec{\Upsilon }_{x}^{\intercal }. \end{aligned}$$
(86)

By applying this conditional operator now to the belief state, we obtain the mean of the embedded distribution in the sample space

$$\begin{aligned} \varvec{\eta }_{t} = \hat{\mathcal {C}}_{\text {pre}} \mu _{X,t}&= \hat{\mathcal {C}}_{\text {pre}} {\mathbb {E}}_{b_{t}}[\varphi (X)] = {\mathbb {E}}_{b_{t}}[\hat{\mathcal {C}}_{\text {pre}} \varphi (X)] = {\mathbb {E}}_{b_{t}}[X]. \end{aligned}$$
(87)

Similarly, we can also apply this operator to the covariance embedding to obtain the covariance of the belief state in the sample space

$$\begin{aligned} \varvec{\Sigma }_{t}&= \hat{\mathcal {C}}_{\text {pre}} \hat{\mathcal {C}}_{XX,t} \hat{\mathcal {C}}_{\text {pre}}^{\intercal } \nonumber \\&= \hat{\mathcal {C}}_{\text {pre}} \left( {\mathbb {E}}_{b_{t}} \left[ \varphi (X) \otimes \varphi (X)\right] - \mu _{X,t} \otimes \mu _{X,t}\right) \hat{\mathcal {C}}_{\text {pre}}^{\intercal } \nonumber \\&= {\mathbb {E}}_{b_{t}} \left[ \hat{\mathcal {C}}_{\text {pre}} \varphi (X) \otimes \varphi (X)\hat{\mathcal {C}}_{\text {pre}}^{\intercal }\right] - \hat{\mathcal {C}}_{\text {pre}} \mu _{X,t} \otimes \mu _{X,t} \mathcal {C}_{\text {pre}}^{\intercal } \nonumber \\&= {\mathbb {E}}_{b_{t}} \left[ X \otimes X\right] - \varvec{\eta }_{t} \otimes \varvec{\eta }_{t}^{\intercal } \end{aligned}$$
(88)

However, also any other approach from the literature can be used in the kernel Kalman filter algorithm.

figure a

5.1.3 Embedding observation windows

So far, we assumed that we have access to the latent states \(\varvec{x}_i\) in our training set. However, in many setups we only have access to the partial observations \(\varvec{y}_i\) which do not have the Markov property. Yet, we can still learn a KKR model from the provided data by embedding time windows \(\varvec{y}_{t-k+1:{t}}\) of size k as internal state representation. Similar approaches have been used by auto-regressive HMMs (Shannon et al. 2013). With longer data windows, the transitions become more and more Markov. How many observation each data window has to contain depends on two factors: on the dimensionality of the underlying system and on the signal-to-noise ratio of the measurements \(\varvec{y}_{i}\).

5.2 The subspace kernel Kalman filter

The subspace kernel Kalman filter (subKKF) is an extension of the KKF that applies the subspace conditional embedding operator presented in Eq. 26 as well as the subspace formulation of the kernel Kalman rule derived in Sect. 4.4. In contrast to the KKF, we assume for the subKKF a data set of triples \(\{(\varvec{x}_{1}, \varvec{x}_{1}', \varvec{y}_{1}),\ldots ,(\varvec{x}_{n}, \varvec{x}_{n}', \varvec{y}_{n})\}\), where \(\varvec{x}_{i}'\) is the successor state to \(\varvec{x}_{i}\). The representation of the belief state changes from weight vector \(\varvec{m}_{t}\) and weight matrix \(\varvec{S}_{t}\) to the subspace projections of the embeddings \(\varvec{n}_{t} = \varvec{\Gamma }^{\intercal }\varvec{\Upsilon }_{x}^{\phantom {\intercal }}\varvec{m}_{t} = \varvec{K}_{x\bar{x}}^{\intercal } \varvec{m}_{t}\) and \(\varvec{P}_{t} = \varvec{\Gamma }^{\intercal }\varvec{\Upsilon }_{x}^{\phantom {\intercal }}\varvec{S}_{t} \varvec{\Upsilon }_{x}^{\intercal }\varvec{\Gamma }^{\phantom {\intercal }}= \varvec{K}_{x\bar{x}}^{\intercal } \varvec{S}_{t} \varvec{K}_{x\bar{x}}\), respectively. Additionally, both update procedures of the kernel Kalman filter, the transition update and the innovation update, have to be substituted by their subspace counterparts. The transition update is realized by the subspace kernel sum rule and the innovation update by the subspace kernel Kalman rule. The equations are depicted in Algorithm 2.

Since we represent the belief state as a projection into the subspace defined by \(\varvec{\Gamma }^{\phantom {\intercal }}\), we can directly obtain the initial belief state by projecting the uniform embedding in the RKHS spanned by the samples from the initial distribution as

$$\begin{aligned} \varvec{n}_{0}&= \varvec{\Gamma }^{\phantom {\intercal }}\varvec{\Upsilon }_{x,0}^{\phantom {\intercal }} \varvec{1}_{N} \frac{1}{N} = \varvec{K}_{\bar{x}0}\varvec{1}_{N} \frac{1}{N}, \end{aligned}$$
(89)
$$\begin{aligned} \varvec{P}_{0}&= \frac{1}{N} \varvec{\Gamma }^{\phantom {\intercal }}\varvec{\Upsilon }_{x,0}^{\phantom {\intercal }} \varvec{\Upsilon }_{x,0}^{\intercal } \varvec{\Gamma }^{\intercal }= \frac{1}{N} \varvec{K}_{\bar{x}0}\left( \varvec{K}_{\bar{x}0}\right) ^{\intercal }. \end{aligned}$$
(90)

Here, \(\varvec{K}_{\bar{x}0}\) is the feature matrix of the subset and the samples from the initial state distribution. For the mapping back into the state space, we can similarly to the KKF define a subspace conditional operator as

$$\begin{aligned} \hat{\mathcal {C}}_{\text {pre}}^{S} := \varvec{X} \varvec{K}_{x\bar{x}}\left( \varvec{K}_{x\bar{x}}^{\intercal }\varvec{K}_{x\bar{x}}+ \lambda \varvec{I}_m\right) ^{-1} \varvec{\Gamma }^{\intercal }. \end{aligned}$$
(91)

By applying this operator to mean map and covariance embedding, we obtain the mean and variance in state space from the subspace projections as

$$\begin{aligned} \varvec{\eta }_{t}&= \varvec{X} \varvec{K}_{x\bar{x}}\left( \varvec{K}_{x\bar{x}}^{\intercal }\varvec{K}_{x\bar{x}}+ \lambda \varvec{I}_m\right) ^{-1} \varvec{n}_{t}, \end{aligned}$$
(92)
$$\begin{aligned} \varvec{\Sigma }_{t}&= \varvec{X} \varvec{K}_{x\bar{x}}\left( \varvec{K}_{x\bar{x}}^{\intercal }\varvec{K}_{x\bar{x}}+ \lambda \varvec{I}_m\right) ^{-1} \varvec{P}_{t} \left( \varvec{K}_{x\bar{x}}^{\intercal }\varvec{K}_{x\bar{x}}+ \lambda \varvec{I}_m\right) ^{-1} \varvec{K}_{x\bar{x}}^{\intercal } \varvec{X}^{\intercal }. \end{aligned}$$
(93)

A concise description of the subspace kernel Kalman filter can be found in Algorithm 2.

figure b

5.3 Experimental evaluation of the kernel Kalman filter

We evaluate the performance of the KKF and the subKKF on two experiments on simulated environments, a pendulum and a quad-link, and one experiment on real-world data from a human motion tracking data set (Wojtusch and von Stryk 2015). For all kernel based methods, we use the squared exponential kernel, where we choose the kernel bandwidths according to the median trick (Jaakkola et al. 1999) and scale the median distances with a single optimized parameter.

5.3.1 Pendulum

Fig. 5
figure 5

Graphical model that we assume for the pendulum experiment

In this experiment, we use a simulated pendulum as system dynamics. The state \(s_{0} = (q_{0}, \dot{q}_0)\) of the pendulum is initialized uniformly in the range \([0.1 \pi , 0.4 \pi ]\) for the angle \(q_{0}\) and in the range \([-\,0.5 \frac{\pi }{s}, 0.5 \frac{\pi }{s}]\) for the angular velocity \(\dot{q}_{0}\). We simulate the pendulum with a frequency of 10,000 Hz and add normally distributed process noise with \(\sigma = 0.1\). The filter methods observe the joint positions with additive Gaussian noise, i.e., \(o_t \sim {\mathcal {N}}(q_t,0.01)\) at a rate of 10 Hz. A graphical model of the pendulum is depicted in Fig. 5.

We compare the KKF, the subspace KKF (subKKF) and the KKF learned with the full data set (fullKKF) to version (a) of the kernel Bayes filter (KBF(a)) (Song et al. 2013) (the other versions, KBF(b) and KBF(c), have yielded worse results in this experiment) and the kernel Kalman filter with covariance embedding operator (KKF-CEO) (Zhu et al. 2014), as well as to standard filtering approaches such as the EKF (Julier and Uhlmann 1997) and the UKF (Wan and Van Der Merwe 2000) (which require a model of the system dynamics). To learn the models, we simulate 10 episodes with a length of 30 steps (3 s), i.e., 300 samples in total. Instead of the true state \(s_{t}\) of the pendulum, we use a window of 4 samples to represent the latent state. For the KKF and all KBF models, we use a kernel size of 100 samples, for the fullKKF, we use all available training samples and for the subKKF we use a set of 100 samples to span the subspace and the full data set to learn the operators. The samples for the subspace are selected from the full data set using the kernel activation heuristic. The results are shown in Fig. 6. The KKF and subKKF show clearly better results than all other non-parametric filtering methods and reach a performance level close to the EKF and UKF.

Fig. 6
figure 6

Comparison of KKF to KBF(b), KKF-CEO, EKF and UKF. All kernel methods (except fullKKF) use kernel matrices of 100 samples. The subKKF method uses a subset of 100 samples and the whole data set to learn the conditional operators. Depicted is the median MSE to the ground-truth of 20 trials with the [0.25, 0.75] quantiles

5.3.2 Quad-link

In this experiment, we use a simulated 4-link pendulum where we observe the 2-D end-effector positions. The state \(\varvec{s}_{t}\) of the pendulum consists of the four joint angles \(\varvec{q}_{t}\) and joint velocities \(\varvec{\dot{q}}_{t}\). The first and the last joints \(q_{0,t=0}, q_{3,t=0}\) are initialized uniformly in the range \([-\,0.55 \pi , -\,0.45 \pi ]\) for \(q_{0,t=0}\), and \([-\,0.5 \pi , 0.5 \pi ]\) for \(q_{3,t=0}\). The remaining joints and the joint velocities have all been initialized at 0.0. We simulate with Gaussian process noise with \(\sigma = 0.01\). The filter methods observe the end-effector positions \(x_{t}\) with Gaussian observation noise as \(o_{t} \sim {\mathcal {N}}(x_t, 0.001)\) at a rate of 10 Hz. As we assume, that we have no access to the true states, we use data windows of size 4 as representation of the latent state to learn the models. A graphical model of the system is depicted in Fig. 7.

Fig. 7
figure 7

Graphical model that we assume for the quad-link experiment. The states \(s_{t}\) contain the joint angles and velocities, \(x_{t}\) is the position of the endeffector, and \(o_{t}\) the noise observation thereof

We evaluate the prediction performance of the subKKF in comparison to the KKF-CEO, the EKF and the UKF. All other non-parametric filtering methods could not achieve a good performance or are not feasible due to the very high computation times. As the subKKF outperformed the KKF in the previous experiments and is also computationally much cheaper, we skip the comparison to the standard KKF in this and also the following experiments. We use a subspace of 500 samples, which is selected according to the kernel activation heuristic, and learn the subKKF with the full data set of 3000 samples.

In a first qualitative evaluation, we compare the long-term prediction performance of the subKKF in comparison to the UKF, the EKF and the Monte–Carlo filter (MCF) as a baseline. This evaluation is shown in Fig. 8. The first five steps of of the end-effector trajectories are observed by the filters, the following 30 steps are predicted. The UKF is not able to predict the movements of the quad-link end-effector due to the high non-linearity, while the subKKF is able to predict the whole trajectory.

Fig. 8
figure 8

Example trajectory of the quad-link end-effector. The filter outputs in black, where the ellipses enclose 90% of the probability mass. All filters were updated with the first five measurements (yellow marks) and predicted the following 30 steps. a Animation of the trajectory, b-d depict the outputs of Monte-Carlo filter, unscented Kalman filter, and subspace kernel Kalman filter, respectively (Color figure online)

Fig. 9
figure 9

1, 2 and 3 step prediction performances in mean euclidean distances (MED) to the true end-effector positions of the quad-link

We also compared the 1, 2 and 3-step prediction performance of the subKKF to the KKF-CEO, EKF and UKF (Fig. 9). The KKF-CEO provides poor results already for the filtering task. The EKF performs equally bad, since the observation model is highly non-linear. The UKF already yields a much better performance as it does not suffer from the linearizion of the system dynamics. The subKKF performs slightly better than the UKF.

5.3.3 Human motion data

The human motion dynamics (HuMoD) database by Wojtusch and von Stryk (2015) consists of the data sets of several motions executed by two subjects. All data sets contain the recordings from a motion capture system with 36 markers as well as the recordings of the electrical activity of 14 muscles in the legs. Additionally, data from the treadmill such as ground reaction forces and velocities are available. The x-, y-, and z-locations of the markers were recorded at 500 Hz, the muscle activities at 2000 Hz, and the data from the treadmill at 1000 Hz. Furthermore, the database contains joint positions and joint trajectories derived from the marker positions via kinematic models of the human body. In our experiments, we use the marker locations, the derived locations of the joints and the muscle activities. We subsample all data to a common frame rate of 50 Hz and transpose the x- and z-position of all markers such that the T12-marker (marker at the 12th thoracic vertebra) has \((x=0,z=0)\) in all frames. Note that in the HuMoD database, the x-axis points in the motion direction (i.e., along the treadmill), the y-axis points upwards and the z-axis forms a right-hand coordinate system towards the right side of the treadmill. We used walking motions at 1.0 m/s, 1.5 m/s, 2.0 m/s, and running motions at 2.0 m/s, 3.0 m/s, and 4 m/s, captured from one subject. For evaluating the trained model, we used a test data-set in which the subject transitions linearly from 0 m/s up to 4 m/s and back to 0 m/s.

Fig. 10
figure 10

Example sequence of 4 postures and the measured muscle activities. The marker and skeleton in green depict the ground-truth, the estimates from the models are depicted in black/blue. The learned models estimate the marker and joint positions from the muscle activities. The first row shows the estimated positions from the subKKF, the second row shows the estimated positions from the subKBF and the third row shows the estimated positions from a sparse GP. For all three models, we use a sample set of 2000 samples and a sparse subset of 500 samples (Color figure online)

Fig. 11
figure 11

Performance of the subKKF and the subKBF on the HuMoD transition data for different sizes of the subspace

Table 2 Top: performance of subKKF, subKBF, and SGP for the HuMoD transition data for different sizes of the subspace. Bottom: time consumptions of subKKF and subKBF for filtering 100 test sequences of length 50 from the HuMoD data set
Fig. 12
figure 12

Performance of the subKKF on HuMoD test sequences after 0, 10, 20, 30 and 40 iterations of the CMA-ES optimizer

In the experiment, we compare the performance of subKKF, subKBF, and sparse Gaussian process (SGP) in restoring the marker and joint positions from the muscle activities. We learn all three models using the marker and joint positions as state variables (or outputs) \(\varvec{x}_{i}\) and the muscle activities as observations (or inputs) \(\varvec{y}_{i}\). We use a set of 2000 samples to learn the kernel matrices and a subset of 500 samples to define the subspace (or as inducing inputs). For subKKF and subKBF, we use a window size of 2. While we could easily carry out the optimization of the parameters for the subKKF and for the SGP, the optimization of the parameters for the subKBF was not feasible in a considerable amount of time.

Figure 10 depicts marker and joint positions of four exemplary postures together with the muscle activities during that period of time. The locations of the exemplary postures in the time line are depicted by vertical lines in the plot of the muscle activities. While this is only a qualitative example, it depicts how the subKKF outperforms the subKBF and the SGP in restoring the positions of the markers and the joints.

We compare the performance of subKKF, subKBF and SGP for different sizes of the subspace (inducing inputs). Figure 11 depicts the performance of subKKF and subKBF. The results of the SGP can be seen in Table 2 which are clearly worse in comparison to the filtering approaches which take the temporal correlation of the data into account. Furthermore, Table 2 also depicts the time consumption of subKKF and subKBF for filtering 100 test sequences of length 50. The gain in efficiency of the subKKF over the subKBF which is around the factor 100 can be seen clearly. In Fig. 12, we depict the performance gain of the subKKF over the number of iterations of the CMA-ES optimizer. We see that in this case, the first 10 iterations yield a bigger jump in performance than the following 40 steps. However, from our experience, this is very specific to the problem and the initial setting of the parameters, which were in this experiment already very close to the optimal parameters.

5.4 The kernel forward–backward smoother

Smoothing is in contrast to filtering a post-processing routine. While filtering refers to a routine where the current state is estimated recursively from all past observations, smoothing computes the best state estimates given all available observations from the past and the future. Hence, for a given time series of observations \([\varvec{y}_1,\ldots ,\varvec{y}_T]\), we want to obtain the belief \(p(\varvec{x}_{t}|\varvec{y}_{1},\ldots ,\varvec{y}_{T})\) for all \(1 \le t \le T\).

One well-known and simple approach to smoothing is the forward–backward smoother. During a forward pass the standard filtering algorithm is applied to the observations. Afterwards, during the backward pass, an inverse filter is applied to the same time series of observations. The filter estimates of forward and backward pass are finally combined into the smoothed estimates. Since the information from the observation should be incorporated only once into the smoothed estimates, we need to combine the posterior estimates of the forward pass with the prior estimates of the backward pass (or vice versa). For the case of ordinary Kalman filters, the backward pass is hard to realize because of two problems. First, it requires an inverse model of the underlying system for the backward pass, and second, an initialization of the belief at the final state is necessary. These issues are however not applicable for the KKF as we can learn both, the inverse models and the embedding of initial distribution of the final states from data.

5.4.1 Computing the smoothed belief state as a weighted average

Assuming that we have the a-posteriori belief states from the forward pass and the a-priori belief states from the backward pass as

$$\begin{aligned} \left\{ \left( {\mu }^{+}_{f,1}, {\mathcal {C}}^{+}_{f,1}\right) ,\ldots ,\left( {\mu }^{+}_{f,T}, {\mathcal {C}}^{+}_{f,T}\right) \right\} \qquad \text {and} \qquad \left\{ \left( {\mu }^{-}_{b,1}, {\mathcal {C}}^{-}_{b,1}\right) ,\ldots ,\left( {\mu }^{-}_{b,T}, {\mathcal {C}}^{-}_{b,T}\right) \right\} , \end{aligned}$$
(94)

respectively, we can combine the mean maps into a smoothed belief state as the weighted average

$$\begin{aligned} \mu _{s,t}&= \mathcal {Z}_{f,t}{\mu }^{+}_{f,t} + \mathcal {Z}_{b,t}{\mu }^{-}_{b,t}. \end{aligned}$$
(95)

Since both, the estimator from the forward pass and the estimator from the backward pass, are unbiased, the weighting operators \(\mathcal {Z}_{f,t}\) and \(\mathcal {Z}_{b,t}\) need to satisfy \({\mathcal {I}} = \mathcal {Z}_{f,t}+ \mathcal {Z}_{b,t}\) in order to get an unbiased estimator of the smoothed mean map, i.e.,

$$\begin{aligned} {\mathbb {E}}\left[ \varphi (X_t) - \mu _t^s \right]&\overset{!}{=} 0 \end{aligned}$$
(96)
$$\begin{aligned} {\mathbb {E}}\left[ \varphi (X_t) - \mathcal {Z}_{f,t}{\mu _{t}^{f}}^{+} + \mathcal {Z}_{b,t}{\mu _{t}^{b}}^{-}\right]&\overset{!}{=} 0 \end{aligned}$$
(97)
$$\begin{aligned} {\mathbb {E}}\left[ \varphi (X_t)\right] - \mathcal {Z}_{f,t}{\mathbb {E}}\left[ {\mu _{t}^{f}}^{+}\right] + \mathcal {Z}_{b,t}{\mathbb {E}}\left[ {\mu _{t}^{b}}^{-}\right]&\overset{!}{=} 0 \end{aligned}$$
(98)
$$\begin{aligned} \mu _{t} - \left( \mathcal {Z}_{f,t}+ \mathcal {Z}_{b,t}\right) \mu _{t}&\overset{!}{=} 0 \end{aligned}$$
(99)
$$\begin{aligned} \Rightarrow \quad \mathcal {Z}_{f,t}+ \mathcal {Z}_{b,t}&= {\mathcal {I}} \end{aligned}$$
(100)

Thus, the weighting operators can be expressed by each other as \(\mathcal {Z}_{b,t}= {\mathcal {I}} - \mathcal {Z}_{f,t}\) and vice versa. We substitute this representation back into the smoothing update in Eq. 95 to obtain

$$\begin{aligned} \mu _{s,t}&= \mathcal {Z}_{f,t}{\mu }^{+}_{f,t} + \left( {\mathcal {I}} - \mathcal {Z}_{f,t}\right) {\mu }^{-}_{b,t}. \end{aligned}$$
(101)

5.4.2 Finding the optimal weighting operators

We obtain the optimal weighting operators by minimizing the squared error of the smoothing mean map which is equivalent to minimizing the trace of the smoothed covariance embedding operator \(\mathcal {C}_{s,t}\), i.e.,

$$\begin{aligned} \min {\mathbb {E}}\left[ \left( \varphi (X_t) - \mu _t^s\right) ^\intercal \left( \varphi (X_t) - \mu _t^s\right) \right]&= \min {\mathbb {E}}\left[ \mathrm {Tr} \left( \varphi (X_t) - \mu _t^s\right) \left( \varphi (X_t) - \mu _t^s\right) ^\intercal \right] \end{aligned}$$
(102)
$$\begin{aligned}&= \min \mathrm {Tr} \;\mathcal {C}_{s,t}. \end{aligned}$$
(103)

We can then use Eq. 101 to rewrite the covariance operator as

$$\begin{aligned} \mathcal {C}_{s,t}&= {\mathbb {E}}\left[ \left( \varphi (X_t) - \mathcal {Z}_{f,t}{\mu }^{+}_{f,t} + \left( {\mathcal {I}} - \mathcal {Z}_{f,t}\right) {\mu }^{-}_{b,t}\right) \Big (\ldots \Big )^\intercal \right] \end{aligned}$$
(104)
$$\begin{aligned}&= {\mathbb {E}}\left[ \left( \mathcal {Z}_{f,t}(\epsilon _f - \epsilon _b) + \epsilon _b\right) \Big (\ldots \Big )^\intercal \right] , \end{aligned}$$
(105)

where \(\epsilon _{f} = \varphi (x_t) - {\mu }^{+}_{f,t}\) is the error of the a-posteriori estimate of the forward pass and \(\epsilon _{b} = \varphi (x_t) - {\mu }^{-}_{b,t}\) is the error of the a-priori estimate of the backward pass and where we used the relation \(\varphi (X_t) = {\mathcal {I}} \varphi (X_t) = (\mathcal {Z}_{f,t}+ \mathcal {Z}_{b,t}) \varphi (X_t) = (\mathcal {Z}_{f,t}+ ({\mathcal {I}} - \mathcal {Z}_{f,t}) ) \varphi (X_t)\). By expanding the square and since the cross-covariances of the errors from forward and the backward pass are zero (i.e., \({\mathbb {E}}[\epsilon _{f}\epsilon _{b}^{\intercal }] = 0\)), we arrive at

$$\begin{aligned} \mathcal {C}_{s,t}&= {\mathbb {E}}\left[ \mathcal {Z}_{f,t}^{\phantom {\intercal }} \left( \epsilon _f^{\phantom {\intercal }}\epsilon _f^{\intercal } + \epsilon _b^{\phantom {\intercal }} \epsilon _b^{\intercal }\right) \mathcal {Z}_{f,t}^{\intercal } - \mathcal {Z}_{f,t}^{\phantom {\intercal }} \epsilon _b^{\phantom {\intercal }} \epsilon _b^{\intercal } - \epsilon _b^{\phantom {\intercal }} \epsilon _b^{\intercal } \mathcal {Z}_{f,t}^{\intercal } + \epsilon _b^{\phantom {\intercal }} \epsilon _b^{\intercal }\right] . \end{aligned}$$
(106)

Lastly, we can take the derivative and set it to zero to obtain the optimal \(\mathcal {Z}_{f,t}\) as

$$\begin{aligned} 0&\overset{!}{=} \frac{\partial \, \mathrm {Tr} \,\mathcal {C}_{s,t}}{\partial \mathcal {Z}_{f,t}} = 2 {\mathbb {E}}\left[ \mathcal {Z}_{f,t}\left( \epsilon _f^{\phantom {\intercal }}\epsilon _f^{\intercal } + \epsilon _b^{\phantom {\intercal }} \epsilon _b^{\intercal }\right) - \epsilon _b^{\phantom {\intercal }} \epsilon _b^{\intercal }\right] \end{aligned}$$
(107)
$$\begin{aligned} 0&\overset{!}{=} \mathcal {Z}_{f,t}\left( {\mathbb {E}}\left[ \epsilon _f^{\phantom {\intercal }}\epsilon _f^{\intercal }\right] + {\mathbb {E}}\left[ \epsilon _b^{\phantom {\intercal }} \epsilon _b^{\intercal }\right] \right) - {\mathbb {E}}\left[ \epsilon _b^{\phantom {\intercal }} \epsilon _b^{\intercal }\right] \end{aligned}$$
(108)
$$\begin{aligned} 0&\overset{!}{=} \mathcal {Z}_{f,t}\left( {\mathcal {C}}^{+}_{f,t} + {\mathcal {C}}^{-}_{b,t}\right) - {\mathcal {C}}^{-}_{b,t} \end{aligned}$$
(109)
$$\begin{aligned} \mathcal {Z}_{f,t}&= {\mathcal {C}}^{-}_{b,t} \left( {\mathcal {C}}^{+}_{f,t} + {\mathcal {C}}^{-}_{b,t}\right) ^{-1}. \end{aligned}$$
(110)

From the condition on the weighting operators stated in Eq. 100, it furthermore follows that \(\mathcal {Z}_{b,t}= {\mathcal {C}}^{-}_{f,t} \left( {\mathcal {C}}^{+}_{f,t} + {\mathcal {C}}^{-}_{b,t}\right) ^{-1}\).

5.4.3 Smoothing the covariance embedding operator

Taking the representation of the smoothed covariance in Eq. 106 and substituting the covariance operators and the optimal weighting operator \(\mathcal {Z}_{f,t}\) gives us the following smoothed covariance operator

$$\begin{aligned} \mathcal {C}_{s,t}&= {\mathcal {C}}^{-}_{b,t} - {\mathcal {C}}^{-}_{b,t} \left( {\mathcal {C}}^{+}_{f,t} + {\mathcal {C}}^{-}_{b,t}\right) ^{-1} {\mathcal {C}}^{-}_{b,t} \end{aligned}$$
(111)
$$\begin{aligned}&= {\mathcal {C}}^{-}_{b,t} - \left( \mathcal {I} - {\mathcal {C}}^{+}_{f,t} \left( {\mathcal {C}}^{+}_{f,t} + {\mathcal {C}}^{-}_{b,t}\right) ^{-1}\right) {\mathcal {C}}^{-}_{b,t} \end{aligned}$$
(112)
$$\begin{aligned}&= {\mathcal {C}}^{+}_{f,t} \left( {\mathcal {C}}^{+}_{f,t} + {\mathcal {C}}^{-}_{b,t}\right) ^{-1} {\mathcal {C}}^{-}_{b,t} \end{aligned}$$
(113)

From the optimal solution of the weighting operator, we can now see that the smoothing update of the covariance embedding operator can be expressed as

$$\begin{aligned} \mathcal {C}_{s,t}&= \mathcal {Z}_{b,t}{\mathcal {C}}^{-}_{b,t} = \mathcal {Z}_{f,t}{\mathcal {C}}^{+}_{f,t} \end{aligned}$$
(114)

In the following section, we will show how the smoothing update can be expressed with finite samples using vector/matrix operations.

5.4.4 The empirical kernel forward–backward smoother

We assume that we are given the weight vectors and weight matrices from the forward and the backward pass as \(\{({\varvec{m}}^{+}_{f,1}, {\varvec{S}}^{+}_{f,1}), \ldots , ({\varvec{m}}^{+}_{f,T}, {\varvec{S}}^{+}_{f,T})\}\) and \(\{({\varvec{m}}^{-}_{b,1}, {\varvec{S}}^{-}_{b,1}), \ldots , ({\varvec{m}}^{-}_{b,T}, {\varvec{S}}^{-}_{b,T})\}\), respectively. Since the weighting operator \(\hat{\mathcal {Z}}_{b,t}\) can be expressed by \(\hat{\mathcal {Z}}_{f,t}\) and vice versa, we only need to compute one of the weighting operators which we choose to be \(\hat{\mathcal {Z}}_{f,t}\). We add the identity operator with a small scalar \(\gamma \) in the inverse to improve the numerical stability and obtain

(115)
(116)
(117)
(118)

where we use the matrix identity \(\varvec{A} (\varvec{B} \varvec{A} + \varvec{I})^{-1} = (\varvec{A} \varvec{B} + \varvec{I})^{-1} \varvec{A}\) and defined the finite weighting matrix \(\varvec{Z}_{f,t}\). With this weighting matrix we can now combine the mean maps of the forward and the backward pass as

$$\begin{aligned} {\hat{\mu }}_{s,t}&= \hat{\mathcal {Z}}_{f,t}{{\hat{\mu }}}^{+}_{f,t} + \left( {\mathcal {I}} - \hat{\mathcal {Z}}_{f,t}\right) {{\hat{\mu }}}^{-}_{b,t} \end{aligned}$$
(119)
$$\begin{aligned}&= \varvec{\Upsilon }_{x}^{\phantom {\intercal }}\varvec{Z}_{f,t} \varvec{\Upsilon }_{x}^{\intercal }{{\hat{\mu }}}^{+}_{f,t} + \left( {\mathcal {I}} - \varvec{\Upsilon }_{x}^{\phantom {\intercal }}\varvec{Z}_{f,t} \varvec{\Upsilon }_{x}^{\intercal }\right) {{\hat{\mu }}}^{-}_{b,t} \end{aligned}$$
(120)
$$\begin{aligned}&= \varvec{\Upsilon }_{x}^{\phantom {\intercal }}\varvec{Z}_{f,t} \varvec{K}_{xx}{\varvec{m}}^{+}_{f,t} + \varvec{\Upsilon }_{x}^{\phantom {\intercal }}{\varvec{m}}^{-}_{b,t} - \varvec{\Upsilon }_{x}^{\phantom {\intercal }}\varvec{Z}_{f,t} \varvec{K}_{xx}{\varvec{m}}^{-}_{b,t} \end{aligned}$$
(121)
$$\begin{aligned} \varvec{\Upsilon }_{x}^{\phantom {\intercal }}\varvec{m}_{t}^{s}&= \varvec{\Upsilon }_{x}^{\phantom {\intercal }}\left( {\varvec{m}}^{-}_{b,t} + \varvec{Z}_{f,t} \varvec{K}_{xx}\left( {\varvec{m}}^{+}_{f,t} - {\varvec{m}}^{-}_{b,t}\right) \right) . \end{aligned}$$
(122)

And similarly we can obtain the smoothed estimate of the covariance operator as

$$\begin{aligned} \hat{\mathcal {C}}_{t}^{s}&= \hat{\mathcal {Z}}_{f,t}{\hat{\mathcal {C}}}^{+}_{f,t} \end{aligned}$$
(123)
$$\begin{aligned}&= \varvec{\Upsilon }_{x}^{\phantom {\intercal }}\varvec{Z}_{f,t} \varvec{\Upsilon }_{x}^{\intercal }\varvec{\Upsilon }_{x}^{\phantom {\intercal }}{\varvec{S}}^{+}_{f,t} \varvec{\Upsilon }_{x}^{\intercal }\end{aligned}$$
(124)
$$\begin{aligned} \varvec{\Upsilon }_{x}^{\phantom {\intercal }}\varvec{S}_{t}^{s} \varvec{\Upsilon }_{x}^{\intercal }&= \varvec{\Upsilon }_{x}^{\phantom {\intercal }}\varvec{Z}_{f,t} \varvec{K}_{xx}{\varvec{S}}^{+}_{f,t} \varvec{\Upsilon }_{x}^{\intercal }\end{aligned}$$
(125)

A concise description of the kernel forward–backward smoothing algorithm can be found in Algorithm 3.

figure c

5.4.5 Initialization of the backward kernel Kalman filter

A critical aspect of the classical forward–backward smoothing algorithm is the initialization of the belief state for the backward pass. Often the distribution over the initial state is well known but not a distribution over the terminal state, where it is often not even clear how a terminal state is defined. For the backward kernel Kalman filter, two approaches can be used to initialize the belief state. The first approach assumes that we have multiple episodes in the training data, where each episode terminates in a terminal state of the system. We can then compute the initialization for the backward pass analogously to the initialization for the KKF described in Sect. 5.1.1. The second approach simply assumes that the system has a stationary distribution which is covered by the training data. The initialization is then the embedding of the distribution over all the samples in the training set.

5.5 The subspace kernel forward–backward smoother

If we use the subspace kernel Kalman filter to perform the forward and the backward pass, we obtain as outcome the subspace projections \(\varvec{n}_{t}\) of the mean map and \(\varvec{P}_{t}\) covariance embedding instead of the weight vectors \(\varvec{m}_{t}\) and weight matrices \(\varvec{S}_{t}\), respectively. To perform smoothing on these subspace projections, we need to find the weighting matrices for the smoothing update analog to Eq. 101. Though, as the representation is already in a finite domain, we can directly apply the optimal solution found in Eq. 110 to the subspace projections of the covariance operator. Hence, the weighting matrix for the subspace kernel forward–backward smoother (subKFBS) becomes

$$\begin{aligned} \varvec{Z}_{f,t}^{S}&= {\varvec{P}}^{-}_{b,t} \left( {\varvec{P}}^{+}_{f,t} + {\varvec{P}}^{-}_{b,t}\right) ^{-1} \end{aligned}$$
(126)

From here, we can obtain the equations for the smoothing update of the subspace kernel forward–backward smoother easily by

$$\begin{aligned} \varvec{n}_{s,t}&= \varvec{Z}_{f,t}^{S} {\varvec{n}}^{+}_{f,t} + \left( \varvec{I} - \varvec{Z}_{f,t}^{S}\right) {\varvec{n}}^{-}_{b,t}, \qquad \text {and} \end{aligned}$$
(127)
$$\begin{aligned} \varvec{P}_{s,t}&= \varvec{Z}_{f,t}^{S} {\varvec{P}}^{+}_{f,t}. \end{aligned}$$
(128)

Algorithm 4 gives a compact description of the subKFBS.

figure d

5.6 Experimental evaluation of the kernel forward backward smoother

We evaluate the kernel forward backward smoother with two experiments. In the first experiment on data from a simulated pendulum, we show the performance gain of the KFBS over the KKF. In the second experiment, we apply the KFBS on data of a table tennis ball recorded with a camera-based tracking system and show how the KFBS and subKFBS are able to restore the full trajectory of the ball while only having observations at the first four and at the last time step.

5.6.1 Pendulum

We simulate a pendulum similar to the one from Sect. 5.3.1, however we initialize the pendulum in the range \([-0.25 \pi , 0.25 \pi ]\) and with a angular velocity sampled from the range \([-2 \frac{\pi }{s}, 2 \frac{\pi }{s}]\). During the simulation, we apply Gaussian process noise with \(\sigma = 0.01\) and as observations we use the angular displacement and add Gaussian observation noise with \(\sigma = 0.2\). To learn the KFBS models, we sample 100 episodes with each 30 steps where a step corresponds to 0.1 s. The training samples are 200 windows of four observations, which we select by the kernel activation heuristic explained in Sect. 3.1. To find the optimal parameters, we apply CMA-ES (Hansen 2006) where we use the negative log-likelihood of the ground-truth to the smoothed estimate as optimality criterion. During the optimization, we use a test data set of 10 episodes, where we still observe at each time step. Later, we evaluate the smoothing performance on an evaluation data set where we do not observe at each time step but only at \(t = [1{-}4, 6, 11, 16, 21, 27{-}30]\). This optimization procedure yielded better results than directly optimizing with only partial observations.

In Fig. 13, we show a qualitative comparison of the forward and the backward pass to the smoother. The results are as expected: the forward pass yields better results in the first half of the episode, and the backward pass yields better results in the second half. The smoother combines both estimates and outperforms the filter results. The smoothing can also be observed in the profiles of the standard deviation. While the variance from the filters increases at each time step without observation until the next measurement, the variance of the smoother is much smaller and only rises slightly between the observations.

Fig. 13
figure 13

Qualitative comparison of the forward and the backward pass to the smoothed estimates of the KFBS on a simulated pendulum. The upper plots show the mean and variance output of the filter/smoother, the lower plots show the profiles of the standard deviation. While the forward pass already yields good estimates in the first half of the time series, the smoother incorporates the good estimates from the backward pass in the second half and outperforms the filters. In addition, the smoother yields a more confident about its estimate

In Fig. 14, we compare the performance of a standard KKF to the KFBS and the subKFBS for different kernel sizes on the same state estimation task for a simulated pendulum. The subKFBS has been learned with 300 samples in the full data set. Depicted are the median and the [0.15, 0.85]-quantiles of the MSE over 20 repetitions. The KFBS and the subKFBS clearly outperform the KKF for small kernel sizes (50, 100) and also yield better results for larger kernel sizes (i.e., 150 and 200 samples). The subKFBS yields slightly better results than the KFBS. In addition, we see from the quantiles of the MSE that the KFBS and the subKFBS have a more stable behavior in the optimization process than the KKF.

Fig. 14
figure 14

The KFBS and the sub KFBS outperform the KKF clearly for small kernel sizes but also with more samples in the gram matrices. The task was to estimate the state of a pendulum from noisy partial observations. Depicted are the median and the [0.15, 0.85]-quantiles of the MSE over 20 repetitions

5.6.2 Tabletennis

In a second experiment, we perform smoothing on observations of a table tennis ball (Gomez-Gonzalez et al. 2016). The data set contains 54 trajectories of a table tennis ball tracked with a camera system, where each trajectory contains 51 observations which are recorded with a frequency of 100 Hz. We train the subKFBS with the data of 34 trajectories and use 10 trajectories for optimizing the parameters using CMA-ES (Hansen 2006). The remaining 10 trajectories are used for evaluating the results. For the smoothing task, the ball has been observed at the first five time steps and then again at the last time step.

Figure 15 shows qualitative examples of smoothed trajectories using the subKFBS in comparison the output of the subKKF. Here, we used data windows of size 4 and learned the models with 300 samples in the training data set and 100 samples in the subset. We optimized the regularization parameters and all bandwidths with CMA-ES (Hansen 2006). The plot shows how the subKFBS can estimate accurately the path of the ball only from observations at the beginning and at the end of the trajectory, while the subKKF diverges from the actual trajectory over time. Especially the impact position of the ball on the table can be estimated much better by the subKFBS than by the subKKF.

We also compare the KFBS to the subKFBS for different kernel sizes with the same smoothing task on recorded table tennis ball data. Figure 16 shows a comparison of the MSE, depicting the median and the [0.05, 0.95]-quantiles over 20 repetitions. The KFBS has been learned with a varying kernel size of 50, 100, 150, and 200 samples. The subKFBS uses the same number of samples to span the subspace but learns the models always with 400 samples in the full training set. While the subKFBS outperforms the KFBS for all kernel sizes, the KFBS achieves a similar performance to the KFBS when learned with 200 samples.

Fig. 15
figure 15

In comparison to the KKF, the KFBS is able to reconstruct the trajectories of a table tennis ball from observations at the first five and at the last two time steps. The plot shows the z-coordinate of two trajectories of a table tennis ball recorded with a camera-based tracking system. We learned the subKFBS/subKKF with 100 samples in the subspace and 300 points in the training set

Fig. 16
figure 16

The subKFBS performs better than the KFBS in the table tennis ball smoothing task. This difference in the MSE between the estimates and the noisy recorded data is more prevalent for small kernel sizes and decreases with the number of samples in the Gram matrices. Depicted is the median and the [0.05, 0.95]-quantiles of the MSE over 20 repetitions

6 Conclusion and future work

In this paper, we have presented the kernel Kalman rule (KKR) as an alternative to the kernel Bayes’ rule (KBR) in the framework for nonparametric inference Song et al. (2013). In contrast to the KBR, the KKR is computationally more efficient, numerically more stable and follows from a clear optimization objective. We have further combined the KKR as Bayesian update with the kernel sum rule to formulate the kernel Kalman filter (KKF). The kernel Kalman filter can be applied to nonlinear state estimation tasks as it learns the probabilistic transition and observation dynamics as linear functions on embeddings of the belief state in high-dimensional Hilbert spaces from data. In difference to existing kernel Kalman filter formulations, the KKF also provides a more general formulation that is much closer to the original Kalman filter equations and can also be applied to partially observable systems.

While the KKF can be applied to state estimation and prediction based on past observations, we extend this work by introducing the kernel forward backward smoother (KFBS) which infers the belief state from current, past, and future information. We have shown in an experimental evaluation how this additional information leads to a performance gain of the KFBS over the KKF. As kernel methods typically scale poorly with the number of data points in the kernel matrices, we have introduced a sparsification technique that leverages from the full training set while representing the embeddings only with a small subset of the data. This technique leads to significant gains of the computational efficiency while yielding similar or even slightly better results than whithout the sparsification.

We have shown that it is possible to learn the kernel Kalman rule and other kernelized inference methods also from partial observations if sliding windows of the time series provide sufficient statistics. However in future work, we want to concentrate on learning the transition dynamics in the RKHS with an expectation-maximization algorithm in case of missing information about the latent state as we think that this leads to better models of the dynamics and also improves the accuracy of the estimated variance.