1 Introduction

We pose the problem, describe its background and give a brief sketch of earlier approaches. After that we explain our approach and contribution to the literature and outline the organization of the present paper.

1.1 Problem description and background

When using stochastic models in a practical environment, model parameters need to be estimated, which turns out to be a very challenging problem. In the present paper we present a method based on a combination of the Kalman and particle filters to do so. In Statistics one can discern two paradigms, Bayesian and frequentist, with ensuing Bayesian and frequentist estimation methods. Among the latter, Maximum Likelihood estimation (MLE) is a classical one. Such methods can also be categorized as online or offline depending on whether the data are used sequentially, or used in batches of observations. The MLE approach is to find the estimate which maximizes the likelihood function of the observed data. The Bayesian approach, however, considers the parameters as random variables. Prior distributions of these, reflecting prior knowledge of the parameters, is updated by the observations through the likelihood, resulting in posterior distributions.In applications, the MLE based on offline methods is often linked to the Kalman filter or its modifications such as the extended Kalman filter, see Einicke and White (1999) and Wan and Nelson (2002), or the unscented Kalman filter, see Wan and van der Merwe (2002), because these algorithms can compute or approximate the likelihood function analytically. However, a common problem of the MLE calibration is that the likelihood function is usually not convex. Hence the numerical optimization of the likelihood often ends up at a local maximum instead of the global maximum. This problem can be even more severe when dealing with models with many parameters, such as multi-factor Hull–White and Vasiěk models, popular in interest rate modeling. Originally, the MLE method requires static model parameters, while in reality the model parameters, such as volatility in financial models, could change over time. Later on, likelihood based methods have been developed to cope with this situation as well. Change point methods, see for example Nemeth et al. (2014), are developed to address the abrupt change of parameters, but those models require a separate model to determine when the model parameters change in the time series, which could increase the complexity of the model. An alternative is to use online methods, which have received more and more attention in recent decades.

Attempts to solve the problem of estimating the static parameters online was to include simulations (particles) of parameter values. One then has a particle filter, see for example Doucet et al. (2000), Gordon et al. (1993), Kitagawa (1996), Liu and Chen (1998) and Kantas et al. (2015) for a survey. However, through successive time steps this approach can quickly lead to what is called particle degeneracy of the parameter space. One solution to this degeneracy problem is to use a kernel density to estimate the posterior distribution of the parameters from which new parameter particles can be drawn at each time step (Liu and West 2001). Besides the fact that a study of the convergence of the algorithm introduced in this latter paper has not been carried out, also such a method could deliver poor performance in some simple set-ups other than the low dimensional case (Miguez et al. 2005).

In recent years, some new methods have been proposed to deal with the online parameter estimation problem, including the iterated batch importance sampling (IBIS), see Chopin (2002), the sequential Monte Carlo square (SMC2) simulation, see Chopin et al. (2013), and the recursive nested particle filter (RNP filter, also RNPF in short), see Crisan and Míguez (2018). The SMC2 and the RNPF use two layers of Monte Carlo methods to overcome certain difficulties with the IBIS method, see Papavasiliou (2006). An important difference between SMC2 and the RNPF is that the SMC2 is a non-recursive method, whereas the RNPF is recursive (see Ljung and Söderström 1983 for a definition of a recursive (online) algorithm for the estimation of a given model parameter). Hence in general, RNPF is more efficient than SMC2.

In Crisan and Míguez (2018) the estimated posterior measure of the parameters by using an RNPF algorithm is shown to converge to the actual measure in \(L^p\)-norm with rate \(N^{-1/2}+M^{-1/2}\), where N is the number of particles for the parameter estimation in the outer layer and \(N\times M\) is the number of particles for the state variables in the inner layer. The RNPF has some drawbacks for a practical application. One is that the computation of the two Monte Carlo layers is very time consuming, another one is that the RNPF requires that the parameter mutation size is small enough. As a consequence the RNPF converges very slowly to the actual value of the parameters and hence requires a very long time series of data, which is very often not available in many applications. To obtain a faster while still accurate algorithm, we propose an algorithm that combines the Kalman filter and a particle filter together with a so-called jittering kernel. The mixture of a Kalman filter and a particle filter is previously considered in Andrieu and Doucet (2002) and Chen and Liu (2000) for conditional linear-Gaussian systems, although the focus in these papers is different, merely on filtering with known parameters while in the present paper we focus on parameter estimation. Other algorithms have been proposed by Stroud et al. (2018) that combine an Ensemble Kalman filter for state estimation with various approximations (one involving a particle filter too) for the updates for the parameter posterior. Their performances have only been numerically evaluated in examples. We apply our algorithm to well chosen non-Gaussian non-linear models and we also provide a convergence analysis of the parameter estimators, an issue not treated in Andrieu and Doucet (2002), Chen and Liu (2000) and Stroud et al. (2018). More explanation follows in the next section.

1.2 Contribution

In this paper, we consider joint parameter and state estimation for a state space model where the state evolves continuously in time, whereas the observations are sampled at discrete time instants. We use a Bayesian online approach to parameter estimation. We propose an algorithm which combines the Kalman filter and a particle filter for online estimation of the posterior distribution of the unknown parameters. This algorithm has a similar structure as the RNPF, it is a semi-recursive algorithm with also two layer structure: the inner layer provides the approximation on the posterior distribution of the state variables conditioned on the parameter particles generated in the outer layer, while the outer layer provides an approximation of the posterior distribution of the parameters by using the outcome of the inner layer.

Our proposed methodology has two main differences when compared to the RNPF algorithm. One difference is that in the inner layer, the posterior distribution of the state variables is estimated by the Kalman filter instead of a particle filter. The implementation of the Kalman filter reduces the computational complexity and hence results in a much faster and robust algorithm. The second difference is in the outer layer. In the RNPF the parameter samples are generated from a certain kernel function. In order to obtain a recursive algorithm, some requirements on the kernel function are introduced. This results in a kernel that significantly reduces the convergence speed of the RNPF. We overcome this problem by using dynamic jittering kernels. Especially in this paper, we implement two different kernel functions. One is applied at the beginning stage to obtain a higher convergence speed. The consequence, however, is that the algorithm is not recursive at this beginning stage since this kernel function does not satisfy the requirements of a recursive algorithm. The other kernel is applied when the variance of the parameter particles decreases to a certain level which is such that this kernel function satisfies the conditions for a recursive method. From that time on, the algorithm is truly recursive. From the numerical experiments we performed, we observe that the variance of the particles decreases very fast at the beginning stage, usually after hundreds steps. Hence by using these two different kernel functions, the algorithm converges much faster than the RNPF.

This paper also provides theoretical results on the asymptotical behavior of the proposed algorithm. When dealing with linear and Gaussian state space models, we show that our algorithm converges with a speed of order \({\mathcal {O}}(N^{-1/2})\), where N is the number of particles for the parameter space. When dealing with non-Gaussian or non-linear models, the Kalman filter in the inner layer could produce a biased estimate of the posterior distribution of the state variables. This makes it in general difficult to study the convergence of the posterior distribution of the parameters. Although it is shown in Pérez-Vieites et al. (2018) that, under certain assumptions, the bias introduced in the inner layer makes the posterior distribution of the parameters converge to a biased distribution, this bias is intractable in general. In this paper, for models with a well chosen structure (affine models for interest rates), we show that the estimated distributions of the parameters and the states converge to the actual distributions in \(L^p\) with rate of order \({\mathcal {O}}(N^{-1/2}+\varDelta ^{1/2})\) under certain regularity assumptions, where \(\varDelta \) is the maximum time step between consecutive observations. Note that we do not have to deal with particles in the inner layer, which improves on the order \(M^{-1/2}\) term for convergence rate of the RNPF. Our proofs are inspired by those in Crisan and Míguez (2018), but at crucial steps we obtain novel results. These are due to the use of the Kalman filter in one of the layers and to the size of the time discretization that governs the observations of the continuous-time system, the latter not playing a role in the setting of the cited reference.

To illustrate the performance of the algorithm, we present numerical results of the parameter estimation on several affine interest rate models, some allowing for stochastic volatility, including the Vasiček model, also known as the two-factor Hull-White model with constant parameters, and the Cox-Ingersoll-Ross (CIR) model. For the CIR model we have also implemented the RNPF and we observed that our algorithm outperforms the RNPF. Although the algorithm is designed for static parameter estimation, it can also be used to estimate parameters that perform sudden changes in value. We present an implementation of the algorithm in such a situation, and we observe that the algorithm is able to quickly track such a sudden change.

1.3 Organization of the paper

In Sect. 2 we present the state space model of interest. This section also provides brief reviews on Bayesian filters, including the Kalman filter and the particle filter, and online parameter estimation using particle filters. Section 3 contains an encompassing framework for various affine models that are used in interest rate modeling and to which we apply our proposed Kalman particle algorithm, which is introduced in Sect. 4. In Sect. 5 we provide the convergence analysis and in Sect. 6 the numerical results are presented. Finally, Sect. 7 is devoted to the conclusions. In the “Appendix” we collect some background results on affine processes.

1.4 Notation

Let \(d\ge 1\), \(S \subseteq {\mathbb {R}}^d\), and \({\mathcal {B}}(S)\) be the sigma algebra of Borel subsets of S. We denote by \({\mathbf {1}}_A\) the indicator function on \(A \in {\mathcal {B}}(S)\) and by \(\delta _x\) the Dirac measure for a given \(x \in S\), i.e.,

$$\begin{aligned} \delta _{x}(A) = {\mathbf {1}}_A(x)= {\left\{ \begin{array}{ll} 1, &{} \text{ if } \ x \in A\,,\\ 0, &{} \text{ otherwise }\,. \end{array}\right. } \end{aligned}$$

Suppose given a function \(f: S\rightarrow {\mathbb {R}}\) and a probability measure \(\mu \) on \((S, {\mathcal {B}}(S))\). We denote the integral of f w.r.t. \(\mu \) by \((f,\mu ) := \int _S f(x)\,\mu (\mathrm {d}x)\) and the supremum norm of f by \(\Vert f\Vert _\infty = \sup _{x \in S} |f(x)|\). In the case of a conditional measure \(\nu (\cdot \mid y)\), \(y \in S\), defined on \((S, {\mathcal {B}}(S))\), we use the notation \((f,\nu (\cdot \mid y)) := \int _S f(x)\,\nu (\mathrm {d}x\mid y)\).

We use the notation \(x_{0:k} := (x_{0},\ldots ,x_k)\) for a discrete-time sequence up to time k of a process \((x_k)_{k \in {\mathbb {N}}}\). By \(\cdot ^\top \), we denote the transpose of a vector or a matrix. The Euclidian norm of an element \(x \in {\mathbb {R}}^d\), is denoted by \(\Vert x\Vert \) and the \(L^p\)-norm, for \(p\ge 1\) of a random variable X, defined on some probability space \((\varOmega ,{\mathcal {F}},{\mathbb {P}})\), is denoted by \(\Vert X\Vert _p = ({\mathbb {E}}|X|^p)^{1/p}\). Densities of random variables or vectors x (always assumed to exist w.r.t. the Lebesgue measure) are often denoted p, or p(x) and conditional densities of X given \(Y=y\) are often denoted \(p(x\mid y)\), possibly endowed with sub- or superscripts.

2 Set up and background on parameters estimation using filters

In this section we outline the set up, we pose the problem formulation, give a brief survey of various filters (Bayesian, Kalman, particle filter) and address the parameter estimation problem using particle filters. Time is assumed to be discrete.

2.1 Discrete-time state space model

We consider the following general state space model, defined on some probability space \((\varOmega ,{\mathcal {F}},{\mathbb {P}})\).

$$\begin{aligned} \begin{aligned} x_k&=f_k(x_{k-1},u_k)\,\quad k \in {\mathbb {N}}^+,\\ y_k&=h_k(x_k,v_k)\,, \quad k \in {\mathbb {N}}^+\,, \end{aligned} \end{aligned}$$
(2.1)

where \(f_k:{\mathbb {R}}^d \times {\mathbb {R}}^d \rightarrow {\mathbb {R}}^d\), \(h_k: {\mathbb {R}}^d \times {\mathbb {R}}^d \rightarrow {\mathbb {R}}^m\) are given functions and \(\{u_k\}_{k\in {\mathbb {N}}^+}\) and \(\{v_k\}_{k\in {\mathbb {N}}^+}\) are d-dimensional strong white noise processes, so independent sequences, both independent of the initial condition \(x_0\), and mutually independent as well. Parameters in the functions \(f_k\) and \(h_k\), together with the covariance of \(u_k\) and \(v_k\) can be seen as the parameters of the state space model, and to which we refer to as \(\theta \).

It follows that the model (2.1) satisfies the properties of a stochastic system, i.e. at every (present) time \(k\ge 1\) the future states and future observations \((x_j,y_j)\), \(j\ge k\), are conditionally independent from the past states and observations \((x_j,y_{j-1})\), \(j\le k\), given the present state \(x_k\), see van Schuppen (1989). It then follows that \(\{x_k\}_{k \in {\mathbb {N}}}\) is a Markov process, and for every \(k\ge 1\) one has that \(y_k\) and \(y_{1:k-1}\) are conditionally independent given \(x_{k-1}\), in terms of densities,

$$\begin{aligned} p(y_k\mid y_{1:k-1},x_k)=p(y_k\mid x_k)\,,\quad \text{ for }\quad k \in {\mathbb {N}}^+\,. \end{aligned}$$
(2.2)

Moreover, one also has, for every \(k\ge 1\), that \(x_k\) and \(y_{1:k-1}\) are conditionally independent given \(x_k\), in terms of densities,

$$\begin{aligned} p(x_k\mid x_{k-1},y_{1:k-1})=p(x_k\mid x_{k-1}). \end{aligned}$$
(2.3)

The latter equation has the consequence

$$\begin{aligned} p(x_k\mid y_{1:k-1})=\int p(x_k\mid x_{k-1})p(x_{k-1}\mid y_{1:k-1})\mathrm {d}x_{k-1}. \end{aligned}$$
(2.4)

We are interested in estimating the (latent) state process \(\{x_k\}_{k \in {\mathbb {N}}^+}\), but only have access to the process \(\{y_k\}_{k \in {\mathbb {N}}^+}\) which represents the observations. Because of the existence of the white noise in the data, estimating the value of the latent states \(\{x_k\}_{k \in {\mathbb {N}}^+}\) by the observations \(\{y_k\}_{k \in {\mathbb {N}}^+}\) is not trivial. There are different methodologies in the literature to estimate the latent process (see e.g. Press 2003; Chui and Chen 2017; Arulampalam et al. 2002). We introduce some of these methodologies in our paper since we will need them in our analysis later. We first introduce the Bayesian filter.

2.2 Bayesian filter of discrete-time Markovian state space model

The Bayesian filter, see e.g. Press (2003), Robert (2007) for an overview, is used to estimate the latent states \(\{x_k\}_{k \in {\mathbb {N}}^+}\) in (2.1) given the parameter \(\theta \). We define the initial probability measure \(\pi _{0}\) of \(x_{0}\), and the transition measure \(\pi _k^{\theta }\) of \(x_k\) under a given parameter \(\theta \) at time k by

$$\begin{aligned} \begin{aligned} \pi _{0}(A)&= {\mathbb {P}}(x_{0}\in A),\\ \pi _k^{\theta }(A\mid x_{k-1})&= {\mathbb {P}}(x_k\in A\mid x_{k-1},\theta ), \qquad k\in {\mathbb {N}}^+\,, \end{aligned} \end{aligned}$$
(2.5)

where \(A\in {\mathcal {B}}({\mathbb {R}}^{d})\) is a Borel set.

The methodology in Bayesian filtering consists of two parts: prediction and update. At every time point k, the prediction part computes (estimates) the prior measure of \(x_k\) (a time k given the past observations up to time \(k-1\)) and the update part computes (estimates) the posterior measure of \(x_k\) given the past up to time k, respectively given by

$$\begin{aligned} \begin{aligned} \gamma _k^{\theta }(\mathrm {d}x_k)&= {\mathbb {P}}(\mathrm {d}x_k\mid y_{1:k-1},\theta )\,, \\ \varGamma _k^{\theta }(\mathrm {d}x_k)&= {\mathbb {P}}(\mathrm {d}x_k\mid y_{1:k}, \theta )\,, \qquad k\in {\mathbb {N}}^+\,. \end{aligned} \end{aligned}$$
(2.6)

Using Bayes’ rule, we deduce that the density function of the prior distribution is given by

$$\begin{aligned} \begin{aligned} p(x_k\mid y_{1:k-1}, \theta )&= \int p(x_k\mid x_{k-1},y_{1:k-1}, \theta )p(x_{k-1}\mid y_{1:k-1}, \theta )\, \mathrm {d}x_{k-1}\\&= \int p(x_k\mid x_{k-1}, \theta )p(x_{k-1}\mid y_{1:k-1}, \theta )\, \mathrm {d}x_{k-1}\,, \end{aligned} \end{aligned}$$

where we used (2.3) to get the last equality. This implies the relation

$$\begin{aligned} \gamma _k^\theta (\mathrm {d}x_k) =\int \pi _k^\theta (\mathrm {d}x_k\mid x_{k-1})\varGamma _{k-1}^\theta (\mathrm {d}x_{k-1}). \end{aligned}$$
(2.7)

Let \(f: {\mathbb {R}}^d\rightarrow {\mathbb {R}}\) be an integrable function w.r.t. the measure \(\gamma _k^\theta \). Then we get by Fubini’s theorem

$$\begin{aligned} (f,\gamma _k^{\theta })&= \int \int f(x_k)\pi _k^{\theta }(\mathrm {d}x_k\mid x_{k-1}) \varGamma _{k-1}^{\theta }(\mathrm {d}x_{k-1})\\&= \int (f,\pi ^\theta _k(\cdot \,\mid x_{k-1}))\varGamma _{k-1}^{\theta }(\mathrm {d}x_{k-1}), \end{aligned}$$

which we abbreviate by

$$\begin{aligned} (f,\gamma _k^{\theta })= ((f,\pi _k^{\theta }),\varGamma _{k-1}^{\theta }). \end{aligned}$$
(2.8)

The purpose of the Bayesian algorithm is to sequentially compute the posterior measure \(\varGamma _k^{\theta }\). Let

$$\begin{aligned} l^{\theta }_{y_k}(x)=p(y_k\mid x,\theta ) \end{aligned}$$

be the density (with some abuse of statistical terminology we often also call it likelihood) of the realized observation \(y_k\) conditional on the state value \(x_k=x\) and the model parameter \(\theta \). Then using Bayes’ rule, (2.2) and (2.7), we obtain for a function f that is integrable w.r.t. \(\varGamma _k^{\theta }\)

$$\begin{aligned} (f,\varGamma _k^{\theta })&= \int f(x_k)p(x_k\mid y_{1:k}, \theta )\, \mathrm {d}x_k\\&= \int f(x_k) \frac{p(x_k,y_k,y_{1:k-1}\mid \theta )}{p(y_{1:k}\mid \theta )}\, \mathrm {d}x_k\\&= \frac{\int f(x_k)p(y_k\mid x_k,y_{1:k-1}, \theta )p(x_k\mid y_{1:k-1}, \theta )\, \mathrm {d}x_k}{p(y_k\mid y_{1:k-1}, \theta )}\\&= \frac{\int f(x_k)p(y_k\mid x_k, \theta )p(x_k\mid y_{1:k-1}, \theta )\, \mathrm {d}x_k}{\int p(y_k\mid x_k, \theta )p(x_k\mid y_{1:k-1}, \theta )\, \mathrm {d}x_k}\\&= \frac{\int f(x_k)l^\theta _{y_k}(x_k)p(x_k\mid y_{1:k-1}, \theta )\, \mathrm {d}x_k}{\int l^\theta _{y_k}(x_k)p(x_k\mid y_{1:k-1}, \theta )\, \mathrm {d}x_k}\\&= \frac{\int f(x_k)l^\theta _{y_k}(x_k)\int \pi _k^\theta (\mathrm {d}x_k\mid x_{k-1})\varGamma _{k-1}^\theta (\mathrm {d}x_{k-1})}{\int l^\theta _{y_k}(x_k)\int \pi _k^\theta (\mathrm {d}x_k\mid x_{k-1})\varGamma _{k-1}^\theta (\mathrm {d}x_{k-1})}, \end{aligned}$$

which we abbreviate, similar to (2.8), by

$$\begin{aligned} (f,\varGamma _k^{\theta })= \frac{((fl^{\theta }_{y_k},\pi _{{k}}^{\theta }),\varGamma _{k-1}^{\theta })}{((l^{\theta }_{y_k},\pi _{{k}}^{\theta }),\varGamma _{k-1}^{\theta })}. \end{aligned}$$
(2.9)

If we assume the likelihood function \(l^{\theta }_{y_k}\) and the transition measure \(\pi _{{k}}^{\theta }\) are known, then given the posterior measure \(\varGamma _{k-1}^{\theta }\), we can use Eq. (2.9) to compute the posterior measure \(\varGamma _{k}^{\theta }\). In this way the posterior measure \(\{\varGamma _{k}^{\theta }\}_{k\in {\mathbb {N}}^+}\) can be computed recursively. Moreover, using (2.2) again, the conditional likelihood \(p(y_k\mid y_{1:k-1},\theta )\) and the likelihood \(p(y_{1:k}\mid \theta )\) can be respectively computed as

$$\begin{aligned} p(y_k\mid y_{1:k-1},\theta )&= \int p(y_k\mid x_{k},y_{1:k-1},\theta ) p(x_k\mid y_{1:k-1},\theta )\, \mathrm {d}x_k\nonumber \\&= \int p(y_k\mid x_k,\theta )p(x_k\mid y_{1:k-1},\theta )\, \mathrm {d}x_k\nonumber \\&= (l^{\theta }_{y_k},\gamma _{k}^{\theta }), \end{aligned}$$
(2.10)

and

$$\begin{aligned} p(y_{1:k}\mid \theta )&= P(y_{1}\mid \theta )\prod _{i=2}^k p(y_{t_i} \mid y_{1: i-1},\theta ) \\&= P(y_{1}\mid \theta )\prod _{i=2}^k (l^{\theta }_{y_{i}},\gamma _{i}^{\theta })\,. \end{aligned}$$

When (2.1) is a linear Gaussian model, then the Bayesian filter is equivalent to the Kalman filter, which we briefly review in the next subsection.

2.3 Kalman filter

We assume that the state and observations in (2.1) evolve according to a linear Gaussian model. That is the functions \(f_k\) and \(h_k\) have to take linear forms as follows

$$\begin{aligned} \begin{aligned} x_k&= F_k x_{k-1} + u_k\,,\\ y_k&= H_k x_k + v_k\,, \qquad k\in {\mathbb {N}}^+\,, \end{aligned} \end{aligned}$$
(2.11)

where \(F_k\) is a \(d\times d\) matrix, \(H_k\) is a \(m\times d\) matrix and the noise terms \(u_k\) (d-dimensional), \(v_k\) (m-dimensional) are assumed to be Gaussian with mean 0 and variance \(Q_k\), \(R_k\), respectively. Moreover, the initial state \(x_{0}\) is assumed to be Gaussian. Due to the Gaussian assumptions and the linear structure of the model in (2.11), one can derive analytic expressions for the prior and posterior measures defined in (2.6) and the algorithm in the Kalman filter, see e.g. Chui and Chen (2017), Grewal and Andrews (2015), yields the exact solution to the estimation problem.

Denote by \(N(\mathrm {d}x;\mu ,\varSigma )\) or \(N(\mu ,\varSigma )\) the Gaussian distribution with mean \(\mu \) and Covariance \(\varSigma \). We also use the generic notation \(N(x;\mu ,\varSigma )\) to denote the density at x of this normal distribution. Recall from (2.6), the prior and posterior measures and denote by \(A_{k-1}\) and \(P_{k-1}\) respectively, the mean and the covariance of the posterior measure at time \({k-1}\). Recall also that we denote \(\theta \) the vector of all the parameters involved, i.e. those in \(F_k\) and \(H_k\) for model (2.11). Then the prior measure is given by

$$\begin{aligned} \gamma _k^{\theta } (\mathrm {d}x_k) = N(\mathrm {d}x_k; F_kA_{k-1}, F_kP_{k-1}F_k^\top +Q_k)\,, \end{aligned}$$

which implies that the prior measure is a conditionally Gaussian measure with mean and covariance respectively given by

$$\begin{aligned} A_k^- = F_kA_{k-1}\,, \qquad P_k^- = F_kP_{k-1}F_k^\top +Q_k\,. \end{aligned}$$

Moreover, the posterior measure is given by

$$\begin{aligned}&\varGamma _k^{\theta } (\mathrm {d}x_k) = N(\mathrm {d}x_k; A_{k}, C_k)\,, \end{aligned}$$
(2.12)

where

$$\begin{aligned} A_{k}= & {} A_k^- + P_{{k}}^- H_k^\top (H_kP_{{k}}^-H_k^\top + R_k)^{-1}(y_k-H_kA_k^-)\,, \\ C_k= & {} P_{{k}}^- - P_{{k}}^- H_k^\top (H_kP_{{k}}^-H_k^\top + R_k)^{-1}H_kP_{{k}}^-\,. \end{aligned}$$

Finally, the conditional likelihood is given by

$$\begin{aligned} p(y_k\mid y_{1:k-1},\theta ) = N(y_k; H_kA_k^-,H_kP_{{k}}^-H_k^\top + R_k)\,. \end{aligned}$$
(2.13)

Let \(S_k = H_kP_{{k}}^-H_k^\top + R_k\), \(k\in {\mathbb {N}}^+\). Then we obtain the recursion for the log-likelihood of the observation \(\log (p(y_{1:k})\mid \theta )\) as follows,

$$\begin{aligned} \log (p(y_{1:k})\mid \theta )&=\log p((y_{1:k-1})\mid \theta ) \\&\qquad - \frac{1}{2}\left( m\log 2\pi - \log (\det (S_k))\right) \\&\qquad -\frac{1}{2} (y_k-H_kA_{{k}}^- )^\top S_k^{-1} (y_k-H_kA_{{k}}^-)\,, \end{aligned}$$

where m is the dimensionality of the data \(y_k\), \(k\in {\mathbb {N}}^+\). Hence, by maximizing the likelihood of the observations, one can determine the optimal parameters of the linear Gaussian system (2.11).

For most non linear non Gaussian models, it is not possible to compute the prior and posterior measures analytically and numerical methods are called for. In this case, the particle filter, which we introduce in the next subsection, is widely used.

2.4 Particle filter

In the particle filter, see e.g. Arulampalam et al. (2002), Cappé et al. (2007) and Doucet and Johansen (2009), the prior and posterior distributions are estimated by a Monte Carlo method. With a Monte Carlo method, a certain measure \(\mu \) is generally estimated by

$$\begin{aligned} \mu ^N(\mathrm {d}x)= \sum _{i=1}^{N}a^{(i)}\delta _{x^{(i)}}(\mathrm {d}x)\,, \end{aligned}$$

where \(\{x^{(i)}, i=1,\cdots ,N\}\) are i.i.d. random samples from a so-called importance density and \(\{a^{(i)}, i=1,\cdots ,N\}\) are the importance weights. The key part of the particle filter is to choose the importance density and compute the importance weights, see e.g. Doucet (1997). For the general state space model (2.1), suppose the posterior measure \(\varGamma _{k-1}^{\theta }\) at time \({k-1}\) is estimated by

$$\begin{aligned} {\varGamma }_{k-1}^{\theta }\approx \sum _{i=1}^{N}a_{k-1}^{(i)}\delta _{x_{k-1}^{(i)}}. \end{aligned}$$

If at time k, the samples \({\tilde{x}}_k^{(i)}\) are generated from the transition measure \(\pi _k^{\theta }(\mathrm {d}x\mid x_{k-1}^{(i)})\) for \(i=1,\cdots ,N\), then using Eq. (2.8), the integral \((f,\gamma _k^{\theta })\) can be estimated by

$$\begin{aligned} (f,\gamma _k^{\theta })\approx \sum _{i=1}^N a_{k-1}^{(i)}f({\tilde{x}}_k^{(i)}). \end{aligned}$$

Moreover, using Eq. (2.9), the prior and posterior measures are respectively estimated by

$$\begin{aligned} \begin{aligned} \gamma _k^{\theta }&\approx \sum _{i=1}^N a_{k-1}^{(i)}\delta _{{\tilde{x}}_k^{(i)}},\\ \varGamma _k^{\theta }&\approx \sum _{i=1}^N a_{k}^{(i)}\delta _{{\tilde{x}}_k^{(i)}} \end{aligned} \end{aligned}$$
(2.14)

and from (2.10), we deduce the following approximation for the conditional likelihood

$$\begin{aligned} p(y_k\mid y_{1:k-1}, \theta )\approx \sum _{i=1}^N a_{k-1}^{(i)}l^{\theta }_{y_k}({\tilde{x}}_k^{(i)})\,. \end{aligned}$$
(2.15)

Consequently, the integral \((f,\varGamma _k^{\theta })\) can be estimated by

$$\begin{aligned} \begin{aligned} (f,\varGamma _k^{\theta })&\approx \frac{\sum _{i=1}^N a_{k-1}^{(i)}l^{\theta }_{y_k} ({\tilde{x}}_k^{(i)})f({\tilde{x}}_k^{(i)})}{\sum _{i=1}^N a_{k-1}^{(i)} l^{\theta }_{y_k}({\tilde{x}}_k^{(i)})}\\&= \sum _{i=1}^N a_{k}^{(i)}f({\tilde{x}}_k^{(i)})\,, \end{aligned} \end{aligned}$$

where the weights \(a_{k}^{(i)}\) are defined by

$$\begin{aligned} a_{k}^{(i)} = \frac{a_{k-1}^{(i)}l^{\theta }_{y_k}({\tilde{x}}_k^{(i)})}{\sum _{i=1}^N a_{k-1}^{(i)}l^{\theta }_{y_k}({\tilde{x}}_k^{(i)})}. \end{aligned}$$
(2.16)

Eqs. (2.14) and (2.15) show how to sequentially estimate the posterior measure \(\varGamma _k^{\theta }\) using the Monte Carlo method. This type of particle filter is often referred to as sequential particle filter. It is a specific member of the family termed the bootstrap particle filter (see Gordon et al. 1993). In Doucet (1997) it is shown that the variance of the importance weights decreases stochastically over time. This will lead the importance weights to be concentrated on a small amount of sampled particles. This problem is called degeneracy. To address the rapid degeneracy problem, the sampling-importance resampling (SIR) method, see e.g. Doucet (1997), Pitt and Shephard (1999), is introduced to eliminate the samples with low importance weight and multiply the samples with high importance weight. In SIR, once the approximation of the posterior measure \(\varGamma _k^{\theta }\approx \sum _{i=1}^N a_{k}^{(i)}\delta _{{\tilde{x}}_k^{(i)}}\) is obtained, new, re-sampled, particles \({x}_k^{(j)}\) are i.i.d. sampled from this approximated measure, i.e. every \(x_k^{(j)}\) is independently chosen from the \({\tilde{x}}_k^{(i)}\) with probabilities \(a_k^{(i)}\), for \(i=1,\cdots ,N\). This step can be accomplished by sampling integers j from \(\{1,\ldots ,n\}\) with probabilities \(a_{k}^{(i)},i=1,\cdots ,N\). Then the new estimation on the posterior measure \(\varGamma _k^{\theta }\) is given by

$$\begin{aligned} \varGamma _k^{\theta } \approx \frac{1}{N}\sum _{i=1}^N\delta _{{x}_k^{(i)}} \end{aligned}$$

and the new estimate of the conditional likelihood is

$$\begin{aligned} p(y_k\mid y_{1:k-1}, \theta )\approx \frac{1}{N}\sum _{i=1}^N l^{\theta }_{y_k}({x}_k^{(i)})\,. \end{aligned}$$

2.5 Static model parameters estimation using particle filter

When the parameters are known, the particle filter is a quite effective algorithm for latent variable estimation. However, if the parameters are not known beforehand, it is a very challenging task to estimate the parameters and the latent states using the particle filter. Here we take a Bayesian approach to estimate the parameters. The estimation of the parameters in online estimation requires the computation of the posterior distribution of \(\theta \), i.e., \(p(\theta \mid y_{1:k}), k \in {\mathbb {N}}^+\). Using Bayes’ rule, one can represent the posterior density as

$$\begin{aligned} p(\theta \mid y_{1:k}) = \frac{p(y_k\mid y_{1:k-1},\theta )p(\theta \mid y_{1:k-1})}{\int p(y_k\mid y_{1:k-1},\theta )p(\theta \mid y_{1:k-1}) \,\mathrm {d}\theta }\,. \end{aligned}$$

To estimate the density of \(\theta \) given \(y_{1:k}\), a straightforward way is to sample parameter particles from the former posterior distribution \(p(\theta \mid y_{1:k-1})\). Denote the samples by \(\{\theta ^{(i)}, i=1,\cdots ,N\}\), then the measure \(p(\mathrm {d}\theta \mid y_{1:k})\) at time k can be approximated by

$$\begin{aligned} \sum _{i=1}^N \frac{p(y_k\mid y_{1:k-1},\theta ^{(i)})}{\sum _{i=1}^N p(y_k\mid y_{1:k-1},\theta ^{(i)})} \delta _{\theta ^{(i)}}(\mathrm {d}\theta )=\sum _{i=1}^N w_k^{\theta ^{(i)}}\delta _{\theta ^{(i)}}(\mathrm {d}\theta )\,, \end{aligned}$$
(2.17)

where the weights \(w_k^{\theta ^{(i)}}\), \(i=1, \cdots N\), are defined by

$$\begin{aligned} w_k^{\theta ^{(i)}}=\frac{p(y_k\mid y_{1:k-1},\theta ^{(i)})}{\sum _{i=1}^N p(y_k\mid y_{1:k-1},\theta ^{(i)})}\,. \end{aligned}$$
(2.18)

There are two issues to implement (2.17). One is that sampling from the former posterior distribution \(p(\theta \mid y_{1:k-1})\) usually cannot be carried out exactly. Another is that often the likelihood \(p(y_k\mid y_{1:k-1},\theta ^{(i)})\) cannot be computed theoretically. These two latter issues can be tackled by using the recursive nested particle filter (RNPF), recently introduced in Crisan and Míguez (2018), which is presented below.

2.5.1 Recursive nested particle filter

In the RNPF, a two layer Monte Carlo method is used. In the first layer, also referred to as outer layer, new parameter samples are generated by using a kernel function. This step is usually called jittering and the kernel is referred to as the jittering kernel. In the second layer, also called inner layer, a particle filter is applied to approximate the conditional likelihood \(p(y_k\mid y_{1:k-1},\theta ^{(i)})\). In the following paragraph of this section we present the RNPF in more detail and introduce its ensuing Algorithm 2.1.

First, assume that \(\theta \) has a compact support \(D_{\theta } \subset {\mathbb {R}}^{d_\theta }\), where \(d_\theta \) is the dimension of \(\theta \). Moreover assume at time \({k-1}\), one can generate a random grid of samples in the parameter space \(D_{\theta }\), say \(\{\theta _{k-1}^{(i)}, i=1,\cdots ,N\}\), and for each \(\theta _{k-1}^{(i)}\), we have the set of particles in the state space \(\{x_{k-1}^{(i,j)}, 1\le j\le M\}\).

  • \(\mathbf{Jittering}. \) Given the parameters samples \(\{\theta _{k-1}^{(i)}, i=1,\cdots ,N\}\) at time \(k-1\), new particles \(\{{\tilde{\theta }}^{(i)}_k, i=1,\cdots ,N\}\) at time k are generated by some Markov kernels denoted by \(\kappa (d\theta \mid \theta _{{k-1}}^{(i)}): {\mathcal {B}}(D_\theta )\times D_\theta \rightarrow [0,1]\) (step 1.a in Algorithm 2.1 below). This step is the outer Monte Carlo layer.

  • \(\mathbf{Update}. \) From Eqs. (2.8) and (2.10), we know that for a given \({\tilde{\theta }}\), the likelihood function is obtained by calculating the integral

    $$\begin{aligned} p(y_k\mid y_{1:k-1},{\tilde{\theta }}) = ((l_{y_k}^{{\tilde{\theta }}}, \pi _k^{{\tilde{\theta }}}),\varGamma _{k-1}^{{\tilde{\theta }}})\,. \end{aligned}$$

In order to compute this latter integral, the posterior measure at time \({k-1}\), \(\varGamma _{k-1}^{{\tilde{\theta }}}\), needs to be known. In the standard Bayesian filter, the parameters are fixed over time and this posterior measure is computed at time \({k-1}\) by using Eq. (2.9). However in this case, this measure is not directly available since the parameter has evolved from \(\theta \) at time \({k-1}\) to \({\tilde{\theta }}\) at time k. In order to compute \(\varGamma _{k-1}^{{\tilde{\theta }}}\), one needs to re-run a filter from time 1 to k, which makes the algorithm not recursive and very time consuming. The authors in Crisan and Míguez (2018) solved this latter problem by assuming that \(\varGamma _{k-1}^{\theta }\) is continuous w.r.t. \(\theta \in D_\theta \), which means that when \(\theta \approx {\tilde{\theta }}\), then \(\varGamma _{k-1}^{\theta } \approx \varGamma _{k-1}^{{\tilde{\theta }}}\). Therefore by considering a rather small variance in the jittering kernel, one can use the particle approximation of the filter computed for \(\theta \) at time \(k-1\) as a particle approximation of the filter for the new sampled \({\tilde{\theta }}\) at time k.

In the RNPF, the jittering kernel is chosen such that the mutation step from \(\theta _{k-1}\) to \({\tilde{\theta }}_k\) is sufficiently small, see Sect. 4.2 in Crisan and Míguez (2018). Then for each \({\tilde{\theta }}_k^{(i)}\), \(\{i=1,\cdots , N\}\), a sequential nested particle filter (see Sect. 2.4 for the description of the particle filter methodology) is used for the state space to obtain \(\{{\tilde{x}}_k^{(i,j)}, \,1\le j\le M\}\); see steps 1.b, 1.d, 1.e in Algorithm 2.1 below. This is the inner Monte Carlo layer.

  • \(\mathbf{Resampling}. \) The outer layer Monte Carlo method in the update step above provides an approximation of the likelihood \(p(y_k\mid y_{1:k-1},{\tilde{\theta }}_k^{(i)})\), \(i=1,\cdots , N\), (step 1.c in Algorithm 2.1) which are used to re-weight the parameter particles and obtain \(\{{\theta }_k^{(i)},{x}_k^{(i,j)},\, i=1\cdots , N,\, j=1,\cdots , M\}\); see step 2 in Algorithm 2.1.

The RNPF is introduced in Crisan and Míguez (2018). We reproduce it here for the sake of completeness.

Algorithm 2.1

(sequential nested particle filter for parameter estimation)

  • Initialization: Assume an initial distribution \(p(\theta _{0})\) for the parameters and \(p(x_{0})\) for the states, and sample from the initial distributions to get N particles \(\{\theta _{0}^{(i)}, i=1,\cdots , N\}\) and \(N\times M\) particles \(\{x_{0}^{(i,j)}, i=1,\cdots ,N, \,j=1,\cdots ,M\}\).

  • Recursion:

    1. 1.

      Filtering: given \(\{\theta _{k-1}^{(i)},x_{k-1}^{(i,j)}\}\), for each \(i=1, \cdots , N\),

      1. a.

        (jittering, outer Monte Carlo layer) sample new parameters \({\tilde{\theta }}_k^{(i)}\) from the jittering kernel \(\kappa (d\theta \mid \theta _{{k-1}}^{(i)})\),

      2. b.

        (together with the next two steps, this is the update part) sample new states \({\hat{x}}_k^{(i,j)}\), \(j=1,\cdots , M\), from the transition measure \(\pi _k^{{\tilde{\theta }}_k^{(i)}}(\mathrm {d}x\mid x_{k-1}^{(i,j)})\) (inner Monte Carlo Layer),

      3. c.

        compute \(p(y_k\mid y_{1:k-1}, {\tilde{\theta }}_k^{(i)}) \approx \frac{1}{M} \sum _{j=1}^{M} l^{{\tilde{\theta }}_k^{(i)}}_{y_k}({\hat{x}}_k^{(i,j)})\),

      4. d.

        compute the weights for the state space using Eq. (2.16)

        $$\begin{aligned} a_{k}^{(j)} =\frac{ a_{k-1}^{(j)} p(y_{k}\mid {\hat{x}}_k^{(i,j)},{\tilde{\theta }}_k^{(i)})}{\sum _{j=1}^M a_{k-1}^{(j)} p(y_{k}\mid {\hat{x}}_k^{(i,j)},{\tilde{\theta }}_k^{(i)})}, \qquad j=1,\cdots , M\,, \end{aligned}$$
      5. e.

        resample the \({\hat{x}}^{(i,p)}\): set \({\tilde{x}}^{(i,j)}\) equal to \({\hat{x}}^{(i,p)}\) with probability \(a_{k}^{(p)}\), where \(j,p \in \{1,\cdots M\}\).

    2. 2.

      Resampling of the \(\{\theta _k^{(i)}\}\): compute the weights for the parameters space using Eq. (2.18)

      $$\begin{aligned} w_k^{{\tilde{\theta }}^{(i)}_k}= \frac{p(y_k\mid y_{1:k-1},{\tilde{\theta }}_k^{(i)})}{\sum _{i=1}^N p(y_k\mid y_{1:k-1},{\tilde{\theta }}_k^{(i)})} , \qquad i=1,\cdots , N\,. \end{aligned}$$
      (2.19)

      For \(i=1,\cdots ,N\), set \(\{\theta _k^{(i)}, x_k^{(i, j)}\}_{1\le j\le M}\) equal to \(\{{\tilde{\theta }}_k^{(p)}, {\tilde{x}}_k^{(p, j)}\}_{1\le j\le M}\) with probability \(w_k^{{\tilde{\theta }}^{(p)}_k}\), where \(p\in \{1,\cdots ,N\}\).

    3. 3.

      Go back to the filtering step.

2.5.2 A note on the convergence of the RNPF

A convergence study of Algorithm 2.1 was carried out in Lemmas 3 to 6 and Theorems 2 and 3 in Crisan and Míguez (2018), where the reasoning was split in the three steps of the algorithm: the jittering, the update and the resampling. In this latter paper, it was proven that, under some regularity conditions, the \(L^p\)-norms of the approximation errors, induced by these different steps, vanish with rate proportional to \(\frac{1}{\sqrt{N}}\) and \(\frac{1}{\sqrt{M}}\). Recall here that N and \(N\times M\) are respectively the number of samples in the parameter space and the number of particles in the state space. A similar result was proven for the approximation of the joint posterior distribution of the parameters and the state variables. We will make use of some of these convergence results later in Sect. 5 to carry out convergence study of our proposed algorithm, Algorithm 4.2.

Under the assumption that the posterior measure \(\varGamma _k^\theta (\mathrm {d}x)\) is continuous w.r.t. the parameter \(\theta \) and when the mutation step of the parameters is small enough, the RNPF is a recursive algorithm. This makes the RNPF more efficient than non-recursive methods such as sequential Monte Carlo square, see Chopin et al. (2013), and Markov Chain Monte Carlo methods, see Gamerman and Lopes (2006), Geweke and Tanizaki (1999) and Higdon (1998). The drawbacks of the RNPF are its heavy computational burden and slow convergence speed which are respectively due to the nested simulations in the two Monte Carlo layers and the small mutation step of the parameters. In many applications, including some in financial modeling, the time length of the data is quite limited, and often too short to observe convergent behavior of the RNPF. To tackle this problem, we propose a new methodology in Sect. 4.

3 Parameters estimation in short rate models

Here we present a rather general model, an affine process, particularly relevant in mathematical finance for instance where one is interested in estimating the parameters of the short rate curve given the observed data. It motivates the kind of system that we will consider and to which the new (Kalman particle) filter of Sect. 4 will be applied.

Let \((\varOmega , {\mathcal {F}}, ({\mathcal {F}}_t)_{t\ge 0},{\mathbb {P}})\) be a filtered probability space satisfying the usual conditions and \((W_t)_{t\ge 0}\) be a d-dimensional Brownian motion. In this paper, although our results can be applied to general state space models of type (2.1), we will mainly consider dynamics of the type

$$\begin{aligned} \mathrm {d}x_t = A(\beta -x_t)\, \mathrm {d}t + \left( \varSigma +{\tilde{\varSigma }}\sqrt{x_t^{(1)}}\right) \, \mathrm {d}W_t, \, x_0 = x \in {\mathbb {R}}^d\,, \end{aligned}$$
(3.1)

where \(A,\varSigma \) and \({\tilde{\varSigma }}\) are \(d\times d\)-matrices, \(\beta \) is a d-vector and its first component is non-negative, and \((x_t^{(1)})_{t\ge 0}\) is the first component of \((x_t)_{t\ge 0}\). We assume the matrix A is diagonal and we denote the diagonal elements of A by \(\alpha _1,\cdots ,\alpha _{d}\). Consider some integers \(p, q\ge 0\) with \(p+q=d\). When \(\varSigma {\tilde{\varSigma }}={\tilde{\varSigma }}\varSigma =0\) (below we specialise to the cases \(\varSigma =0\) or \({\tilde{\varSigma }}=0\)) and the parameters of the model (3.1) satisfy certain conditions known in the literature as admissibility conditions, the process \((x_t)_{t\ge 0}\) is \({\mathbb {R}}^p_+ \times {\mathbb {R}}^q\)-valued affine process, see Duffie et al. (2003), Duffie et al. (2000), Keller-Ressel and Mayerhofer (2015)) for an overview of affine processes. In “Appendix A.1” we specify the admissibility of the parameters of the dynamics (3.1). In our context, the short rate evolution will be described by a process \((r_t)_{t\ge 0}\) given in terms of \((x_t)_{t\ge 0}\) by

$$\begin{aligned} r_t = c + \gamma ^\top x_t\,, \end{aligned}$$

where \(c \in {\mathbb {R}}\), \(\gamma \in {\mathbb {R}}^d\). Let \(T>0\) be the maturity time, then the zero coupon bond price at time \(t<T\) is defined as

$$\begin{aligned} P(t,T) = {\mathbb {E}}[\mathrm {e}^{-\int _t^T r(s) \, \mathrm {d}s}\mid {\mathcal {F}}_t]\,, \end{aligned}$$

and the corresponding zero rates, also called yields, are defined as

$$\begin{aligned} -\log P(t,T)/(T-t)\,. \end{aligned}$$

The fact that the process \((x_t)_{t\ge 0}\) is affine, which happens if \(\varSigma {\tilde{\varSigma }}\) is zero, allows one to obtain an explicit formula for the zero coupon bond price, i.e.

$$\begin{aligned} P(t,T) =\mathrm {e}^{-\phi (T-t,0) -\psi (T-t,0)^\top x(t)}\,. \end{aligned}$$
(3.2)

The functions \(\phi \) and \(\psi \) are the solutions to some ordinary differential equations, which are often referred to as the Riccati equations, see Theorem A.1 in “Appendix A.2” for details. Then, if \({\tilde{\varSigma }}=0\) (first case), Eq. (3.2) holds for \((\phi , \psi )\) the solution to (A.1). If \(\varSigma =0\) (second case), then (3.2) holds for \((\phi , \psi )\) the solution to (A.2). Denote the time to maturity \(T-t\) by \(\tau \), then the zero rate at time t with time to maturity \(\tau \) can be computed by \(\frac{1}{\tau }R_t(\tau )\), where

$$\begin{aligned} R_t(\tau ) := \phi (\tau ,0) + \psi (\tau ,0)^\top x_t. \end{aligned}$$
(3.3)

In the market, we can obtain the data for zero rates at discrete-time instants \(t_k\) with certain times to maturity \(\tau _1,\cdots ,\tau _L\), call these data \(R_k(\tau _l)\). We believe these data contain noise, hence at time k we observe

$$\begin{aligned} y_k = [y_k(\tau _1),\cdots ,y_k(\tau _L)] = {[}R_k(\tau _1),\cdots ,R_k(\tau _L)]^\top +v_k, \end{aligned}$$

for \(k=1,\cdots , K\), and \(v_k\) is an L-dimensional random vector which presents the noise in the observed data. Let \(0=t_0\le t_1,\cdots , t_n=T\) be a partition of the time interval [0, T]. Then, considering a time-discrete version \(x_k := x_{t_k}\), \(k\in {\mathbb {N}}^+\), of the affine process \((x_{t})_{t\ge 0}\), our aim is to derive the parameters of the latent state process \((x_k)_{k\in {\mathbb {N}}^+}\) given the observations \((y_k)_{k\in {\mathbb {N}}^+}\). To be more precise, we consider the following state space model, the observation equation can be seen as of the general form in (2.11) by enlarging the state vector,

$$\begin{aligned} x_k&= \mathrm {e}^{-A(t_k-t_{k-1})}x_{k-1} + \left( I-\mathrm {e}^{-A(t_k-t_{k-1})}\right) \beta \nonumber \\&\qquad + \int _{t_{k-1}}^{t_k} \mathrm {e}^{-A(t_k-u)}\left( \varSigma +{\tilde{\varSigma }} \sqrt{x_u^{(1)}}\right) \, \mathrm {d}W_u\,, \end{aligned}$$
(3.4)
$$\begin{aligned} y_k&= H_kx_k +H^0_k+ v_k\,, \end{aligned}$$
(3.5)

where I is the identity matrix, \((x_k)_{k\in {\mathbb {N}}^+}\) is the latent process, \((y_k)_{k\in {\mathbb {N}}^+}\) represents the observations, \(H_k\) is a \(L\times d\) matrix with each row equal to \(-\psi (\tau _l,0)/\tau _l\), \(l=1,\cdots ,L\), \(H^0_k\) is the column vector \([-\phi (\tau _1,0)/\tau _1,\cdots ,-\phi (\tau _L,0)/\tau _L]^\top \) and \(v_k\) represents the noise. Note that the \(x_k\) from (3.4) forms a discrete-time sample from the continuous process that solves (3.1) and that (3.4) exactly represents \(x_t\) at time \(t=t_k\), it is not a recipe for an approximation. The aim is to estimate the model parameters \(A, \beta , \varSigma , {\tilde{\varSigma }}\) given the observation vector \(y_{1:k}\) and the variance of \(v_k\). As usual, in the sequel we collectively denote these parameters by \(\theta \).

We end this section by giving some examples of the models of type (3.1) which are well known in the literature and to which we return with numerical experiments in Sect. 6. For \(d=1\), \(p=1\), \(\varSigma =0\), \({\tilde{\varSigma }} \ne 0\), one obtains the Cox-Ingersoll-Ross (CIR) model, see Cox et al. (1985), i.e.,

$$\begin{aligned} \mathrm {d}x_t = \alpha _1(\beta -x_t)\, \mathrm {d}t +{\tilde{\varSigma }} \sqrt{x_t}\, \mathrm {d}W_t\,. \end{aligned}$$
(3.6)

For \(d=2\), \(p=0\), \(\beta ^{(1)}=\beta ^{(2)} =0\), \(\varSigma \ne 0\), \({\tilde{\varSigma }}=0\), one obtains the two-factor Hull-White model with constant parameters (also known as the two-factor Vasiček model), with mean-reversion level 0, see Hull and White (1990), i.e.,

$$\begin{aligned} \begin{aligned} \mathrm {d}x_t^{(1)}&=- \alpha _{11} x^{(1)}_t\, \mathrm {d}t + \varSigma _{11} \, \mathrm {d}W_t^{(1)} +\varSigma _{12}\, \mathrm {d}W_t^{(2)}\,,\\ \mathrm {d}x_t^{(2)}&=- \alpha _{22} x^{(2)}_t\, \mathrm {d}t + \varSigma _{21} \, \mathrm {d}W_t^{(1)} +\varSigma _{22}\, \mathrm {d}W_t^{(2)}\,. \end{aligned} \end{aligned}$$
(3.7)

For \(d=2\), \(p=q=1\), \(\varSigma =0\), one obtains the stochastic volatility model, see Heston (1993), in which the first component, \((x_t^{(1)})_{t\ge 0}\), represents the stochastic volatility of the short rate \((x_t^{(2)})_{t\ge 0}\), i.e.,

$$\begin{aligned} \begin{aligned} \mathrm {d}x_t^{(1)}&=\alpha _{11} (\beta _1-x^{(1)}_t)\, \mathrm {d}t + \sqrt{x_t^{(1)}} \left( {\tilde{\varSigma }}_{11} \, \mathrm {d}W_t^{(1)} +{\tilde{\varSigma }}_{12}\, \mathrm {d}W_t^{(2)}\right) \,, \\ \mathrm {d}x_t^{(2)}&=\alpha _{22} (\beta _2-x^{(2)}_t)\, \mathrm {d}t + \sqrt{x_t^{(1)}}\left( {\tilde{\varSigma }}_{21} \, \mathrm {d}W_t^{(1)} +{\tilde{\varSigma }}_{22}\, \mathrm {d}W_t^{(2)}\right) \,. \end{aligned} \end{aligned}$$
(3.8)

4 Kalman particle filter for online parameters estimation

In this section, we introduce the Kalman particle filter for online parameter estimation. It is a semi-recursive algorithm that combines the Kalman filter and the particle filter. In this new approach, we consider a two layers method as in the RNPF algorithm. In the outer layer, we sample the particles of the model parameters using some Markovian Gaussian kernel which is updated at each time step. In the inner layer, the distribution of the state process and the conditional likelihood function \(p(y_k\mid y_{1:k-1},\theta _k^{(i)})\), which is used to re-weight the parameter particles in the outer layer, are estimated given the sampled parameter particles.

There are two main differences between our proposed Kalman particle filter algorithm and the RNPF algorithm. The first difference is that in the outer layer we use dynamic jittering functions, i.e. the jittering functions can change over time. Specially, in this paper we specify two jittering functions to sample the model parameters, see (4.5) and (4.9) as described in Sect. 4.1 below. The second difference is that we use the Kalman filter, instead of the particle filter, to update the underlying states in the inner layer. Note that in case the state space does not follow linear Gaussian dynamics, the literature offers different alternatives, see Bruno (2013) for a Monte Carlo approach, or the Gaussian mixture, see Sorenson and Alspach (1971), or Kalman filter extensions such as the extended Kalman filter, see Einicke and White (1999) and Wan and Nelson (2002), the unscented Kalman filter, see Wan and van der Merwe (2002) . In these latter methodologies, the idea is to consider an approximation of the state variables which is linear and Gaussian, and then run a Kalman filter on the approximation. When the model is not Gaussian, such an approximation introduces bias. In Sect. 5, we will carry a convergence analysis of our algorithm and we will prove that the bias induced by the Gaussian approximation of the model (3.4), (3.5) indeed vanishes when the time step tends to zero.

The use of the two jittering functions in the outer layer and of the Kalman filter in the inner layer allows us to obtain an algorithm that has faster convergence speed and less computational complexity than the RNPF algorithm. This will be further illustrated in the examples in Sect. 6.

4.1 Static model estimation

As described in Sect. 2.5, in order to sequentially estimate the posterior density \(p(\theta \mid y_{1:k}\)), \(k=1,\cdots ,K\), we face two issues: how to sample particles from the former posterior distribution \(p(\theta \mid y_{1:k-1})\) and how to compute the conditional likelihood \(p(y_k\mid y_{1:k-1},\theta )\). First, we consider the sampling problem and we introduce the first jittering kernel that we use to update the parameter space \(\theta \). We assume that the parameter \(\theta \) has a compact domain \(D_\theta \subset {\mathbb {R}}^{d_\theta }\), with \(d_\theta \) the dimension of \(\theta \). For the time being, \(\theta \) is an abstract parameter, which will be specified later when a specific model is assumed.

4.1.1 Truncated Gaussian kernel with changing covariance

Recall that we use the generic notation \(N(x;\mu ,\varSigma )\) to denote the density at x of the normal distribution with mean vector \(\mu \) and covariance matrix \(\varSigma \). We choose a Gaussian kernel such that the conditional density of \(\theta _k\) is given as

$$\begin{aligned} p(\theta _k\mid \theta _{k-1}) = N(\theta _k; \mu (\theta _{k-1}),\varSigma (\theta _{k-1}))\,, \end{aligned}$$

with \(\mu (\theta _{k-1})\) and \(\varSigma (\theta _{k-1})\) being respectively the conditional mean and covariance of \(\theta _k\) given \(\theta _{k-1}\) . Then if the parameters \(\theta _k\) are jittered from this Gaussian kernel, one can easily derive that

$$\begin{aligned} \begin{aligned} {\mathbb {E}}(\theta _k)&= {\mathbb {E}}(\mu (\theta _{k-1})), \\ \text{ Var }(\theta _k)&= {\mathbb {E}}(\varSigma (\theta _{k-1}))+\text{ Var }(\mu (\theta _{k-1}))\,. \end{aligned} \end{aligned}$$
(4.1)

Ideally, the jittering should not introduce bias and information loss (artificial increase in the variance), see Liu and West (2001), which means that \({\mathbb {E}}(\theta _k)={\mathbb {E}}(\theta _{k-1})\) and \(\text{ Var }(\theta _k)=\text{ Var }(\theta _{k-1})\), \(k=1,\cdots ,K\). The latter, together with Eqs. (4.1) imply

$$\begin{aligned} \begin{aligned} {\mathbb {E}}(\mu (\theta _{k-1}))&= {\mathbb {E}}(\theta _{k-1}),\\ {\mathbb {E}}(\varSigma (\theta _{k-1}))+\text{ Var }(\mu (\theta _{k-1}))&= \text{ Var }(\theta _{k-1})\,. \end{aligned} \end{aligned}$$
(4.2)

To achieve that, the Liu and West (2001) applies a shrinkage to the kernel. We will apply the same technique although the jittering function is used differently in our case. If one assumes a deterministic jittering covariance, i.e. \(\varSigma (\theta _{k-1}) = \varSigma _{k-1}\) and a linear mean function

$$\begin{aligned} \mu (\theta _{k-1}) = a\theta _{k-1} + (1-a){\mathbb {E}}(\theta _{k-1}), \text{ for } \text{ some } a\in (0,1), \end{aligned}$$
(4.3)

then the jittering kernel satisfying (4.2) is given by

$$\begin{aligned} p(\theta _k\mid \theta _{k-1}) = N(\theta _k; \mu (\theta _{k-1}), \varSigma (\theta _{k-1}))\,, \end{aligned}$$
(4.4)

where \( \varSigma (\theta _{k-1})= (1-a^2)\text{ Var }(\theta _{k-1})\). The kernel in (4.4) is the same jittering kernel as used in the Liu and West (2001). We will refer to the number a as the discount factor. However, note that the domain of the kernel (4.4) is not compact, whereas we assumed the domain of \(\theta \) to be compact. So instead of directly using the kernel (4.4), we truncate this Gaussian distribution to the compact domain \(D_\theta \). The truncated kernel is denoted

$$\begin{aligned} p(\theta _k\mid \theta _{k-1}) = \mathrm {TruncNorm}(\theta _k; \mu (\theta _{k-1}), \varSigma (\theta _{k-1}), D_\theta )\,, \end{aligned}$$
(4.5)

with \(\mathrm {TruncNorm}(x; \mu ,\varSigma , D_\theta )\) the truncated normal distribution which truncates the normal distribution with mean \(\mu \) and covariance \(\varSigma \) to the range \(D_\theta \).

Before we present our methodology for jittering in detail, we introduce the following assumption which we need in our recursive algorithm later. We will use this assumption in Sect. 5 to prove the convergence of our proposed algorithm.

Assumption 4.1

The jittering kernels \(\kappa _k^N(d\theta \mid \theta ^\prime ): {\mathcal {B}}(D_\theta )\times D_\theta \rightarrow [0,1]\), for \(k=1,\cdots ,N\), and \(\theta ^\prime \) taking values in a compact set \(D_\theta \subset {\mathbb {R}}^d\), satisfy the following inequalities

$$\begin{aligned} \sup _{\theta ^\prime \in D_\theta }\int \left| f(\theta )-f(\theta ^\prime )\right| \kappa _k^N(d\theta \mid \theta ^\prime )\le \frac{e_{1,k}\left\| f\right\| _\infty }{\sqrt{N}}\,, \end{aligned}$$
(4.6)

for all bounded function \(f : D_\theta \rightarrow {\mathbb {R}}\) and some positive constant \(e_{1,k}\) which is independent of f , and

$$\begin{aligned} \sup _{\theta ^\prime \in D_\theta }\int \left\| \theta -\theta ^\prime \right\| ^p \kappa _k^N(d\theta \mid \theta ^\prime )\le \frac{e_{2,k}^p}{\sqrt{N^p}}\,, \end{aligned}$$
(4.7)

for \(p\ge 1\) and some positive constant \(e_{2,k}\).

Let \(f: D_\theta \rightarrow {\mathbb {R}}\) be a bounded Lipschitz function. In Proposition 1 of Appendix C in Crisan and Míguez (2018), set \(\epsilon _n=1\) there, it is shown that if for any \(p \ge 1\), the jittering kernels \(\kappa ^N_k(\mathrm {d}\theta \mid \theta ^\prime )\) satisfy

$$\begin{aligned} \sup _{\theta ^{\prime } \in D_\theta }\int \left\| \theta -\theta ^\prime \right\| ^2\kappa ^N_k(\mathrm {d}\theta \mid \theta ^{\prime })\le \frac{c}{\sqrt{N^{p+2}}}\,, \end{aligned}$$
(4.8)

for some positive constant c independent of N, then Assumption 4.1 holds.

The jittering kernels of type (4.5) have an appealing property, the covariance can change over time. This aspect helps us to design an algorithm with the following attractive feature. Initially, since we lack information on the unknown parameters, a larger covariance can lead to a faster convergence of the parameters to the high likelihood area. Over time, the filter refines the estimate of the fixed parameters until at some points a very small variance has been reached which makes the parameter estimation more accurate. However, a direct application of this kernel does not yield a recursive method since it generally does not satisfy Assumption 4.1. Hence, it is unclear whether the algorithm converges. To tackle this issue, we introduce a second Gaussian jittering kernel which satisfies Assumption 4.1. This second jittering kernel will allow us to obtain a recursive method. The details will be described below in Sect. 4.1.2.

Remark 4.1

Although in this paper we have specified the use of (truncated) Gaussian kernels with a certain mean and variance, the dynamic kernel set up is very generic. In implementation, one can freely choose another kernel that fits its purpose. For example, one can define the variance of the jittering kernel as a monotonically decreasing function of time so that the convergence speed can be manually controlled.

4.1.2 Description of the Kalman particle filter methodology

Here we present our methodology to Gaussian and linear models. When the model is not Gaussian and linear, we approximate it by a Gaussian linear model and hence we can follow the same methodology as described below to the approximation. Before we present the algorithms, we introduce some notation and convention. We denote by diag(A) the diagonal matrix that has the same diagonal elements as in a given matrix A. With a diagonal matrix V, diag\((A)<V\) means every diagonal element of diag(A) is smaller than the corresponding diagonal element in V. Similarly, an expression like \(\min \{\text {diag}(A),V\}\) is to be interpreted as element wise. These notational conventions apply to all algorithms below.

In the recursive step explained below we set thresholds on the covariance matrices of the jittering kernel so that Assumption 4.1 holds. Notice that although the (co)variance matrices used in the jittering kernel of the recursive step do not have to be diagonal when Assumption 4.1 holds, they are chosen to be diagonal for the convenience of the implementation. Hence, in the jittering kernel in (4.9) below of the recursive step we use diag\((\varSigma (\theta _k))\) instead of \(\varSigma (\theta _k)\).

The non recursive step. Assume at time \({k-1}\), one can generate a random grid of samples in the parameter space, say \(\{\theta ^{(i)}_{k-1}, i=1,\cdots , N\}\).

  • Jittering step 1. Here we apply the kernel (4.5), referred to as jittering kernel 1, to obtain new samples \(\{{\tilde{\theta }}^{(i)}_k, i=1,\cdots , N\}\) (step 1.a.i in Algorithm 4.2 below).

  • Update. In order to compute the posterior measure \(\varGamma _k^{{\tilde{\theta }}_k^{(i)}}\) at time k, one needs to know the mean \(B_{k-1}^{{\tilde{\theta }}^{(i)}_k}\) and the covariance \(P_{k-1}^{{\tilde{\theta }}^{(i)}_k}\) at time \({k-1}\) of the posterior distribution, see Formula (2.12) and note that we make the dependence on \({\tilde{\theta }}^{(i)}_k\) clear in the notation. However, these latter quantities are not available since the parameter has evolved from \(\theta _{k-1}\) at time k to \({\tilde{\theta }}_{k}\) at time k, see also the discussion in Sect. 2.5. Hence at this step, the algorithm does not run recursively and at every time k where a new parameter particle is sampled, the inner filter re-runs from time \(t_0\) to \(t_k\) (step 1.a.ii in Algorithm 4.2). Moreover, the conditional likelihood function \(p(y_k\mid y_{1:k-1},{\tilde{\theta }}_k^{(i)})\), \(i=1,\cdots ,N\) is computed in the inner filter, using Eq. (2.13). This latter will be used to re-weight the parameter particles, see step 1.c in Algorithm 4.2 below.

  • Resampling. We use a resampling technique to obtain \(\{\theta ^{(i)}_k, B_k^{(i)}, P_k^{(i)}, \,i=1,\cdots ,N\}\), further specified in step 2 of Algorithm 4.2.

The recursive step. In this step we first need to specify a diagonal matrix \(V_N\). This chosen \(V_N\) serves as the threshold for the (co)variance of the jittering kernel satisfying Assumption 4.1. For example, the i-th diagonal element of \(V_N\) can be defined as \(\frac{c_i}{\sqrt{N^{p+2}}}\) with arbitrary positive constants \(c_i\). Once at some time point \(t_l\), diag\((\varSigma (\theta _{{l}}))=(1-a^2)\) diag\((\text{ Var }(\theta _{l}))<V_N\), we switch to the jittering kernel 2. In a practical implementation, one could also set a floor, a diagonal matrix \(V_f\), with \(0\le V_f\le V_N\), for the elements of the jittering variance matrix to prevent the algorithm of getting stuck.

  • Jittering step 2. We apply the jittering kernel 2,

    $$\begin{aligned} {p(\theta _k^{(i)}\mid \theta ^{(i)}_{k-1}) =} \nonumber \\&\quad \mathrm {TruncNorm}(\theta ^{(i)}_k;\theta ^{(i)}_{k-1},\min \{\max \{\text {diag} (\varSigma (\theta _{{l}})),V_f\},V_N\}, D_\theta )\,, \end{aligned}$$
    (4.9)

    for \(k\ge l+1\), see step 1.b.i of Algorithm 4.2.

  • Update. From time \(t_{l+1}\) on, we have a recursive algorithm based on the idea to approximate the posterior measure \(\varGamma _k^{{\tilde{\theta }}_k^{(i)}}\) by \({\hat{\varGamma }}_k^{{\theta }^{(i)}_{k-1}}\), which is computed using \({\tilde{\theta }}_{k}^{(i)}\), \(B_{k-1}^{\theta ^{(i)}_{k-1}}\) and the covariance \(P_{k-1}^{\theta ^{(i)}_{k-1}}\) (step 1.a.ii in Algorithm 4.2). Note that here we use the Kalman filter. Note that, as in Sect. 2.5.1, here we assume that \(\varGamma _k^\theta \) is continuous w.r.t. \(\theta \in D_\theta \). Moreover, the marginal likelihood \(p(y_k\mid y_{1:k-1},{\tilde{\theta }}_k^{(i)})\), \(i=1,\cdots ,N\) is approximated by the inner filter using Eq. (2.13), step 1.c in Algorithm 4.2. This latter will be used to re-weight the parameter particles

  • Resampling. Apply a resampling technique to obtain \(\{\theta ^{(i)}_k, B_k^{(i)}, P_k^{(i)}, \, i=1,\cdots ,N\}\), see step 2 of Algorithm 4.2.

We now introduce the Kalman particle algorithm. Recall \(\mu (\theta _{k-1})\) and \(\varSigma (\theta _{k-1})\) from (4.5) and the discount factor a from (4.3).

Algorithm 4.2

(Kalman particle filter for static parameter model)

Initialization:

  1. 1.

    set the number of particles N, a value for the discounting factor a, a switching variance level \(V_N\) and a floored variance level \(V_f\),

  2. 2.

    assume an initial distribution \(p(\theta _{0})\) for the parameters,

  3. 3.

    sample from the initial distribution to get N particles \(\{\theta _{0}^{(i)},\, i=1,\cdots ,N\}\) for the parameters,

  4. 4.

    for each particle \(\theta _{0}^{(i)}\), assign the same initial mean \(B_{0}^{(i)}\) and covariance value \(P_{0}^{(i)}\) of the posterior distribution \(\varGamma _{0}^{\theta _{0}^{(i)}}\), \(i=1,\cdots , N\).

Recursion:

  1. 1.

    Filtering: given \(\{\theta _{k-1}^{(i)},\, i=1,\cdots ,N\}\),

    1. a.

      as long as \(\text {diag}(\varSigma (\theta _{k-1}))\) is not smaller than \(V_N\) (jittering case 1), for each \(i=1,\cdots ,N\),

      1. i.

        sample new parameters from the kernel (4.5), i.e.

        $$\begin{aligned} {\tilde{\theta }}_k^{(i)}\sim \mathrm {TruncNorm}\left( {\tilde{\theta }}_k; \mu (\theta _{k-1}^{(i)}),\varSigma (\theta _{k-1}),D_\theta \right) \,, \end{aligned}$$
      2. ii.

        based on the parameter \({\tilde{\theta }}_k^{(i)}\), use the Kalman filter to compute the mean and covariance of the posterior distribution from time 1 to k and hence obtain \(\varGamma _k^{{\tilde{\theta }}^{(i)}_k}\),

    2. b.

      once \(\text {diag}(\varSigma (\theta _{l-1})) < V_N\) (jittering case 2), for some \(t_l\), then for \(k\ge l+1\) and \(i=1,\cdots ,N\), given \(\{B_{k-1}^{(i)}, P_{k-1}^{(i)}\}\),

      1. i.

        sample new parameters from the kernel (4.9), i.e.

        $$\begin{aligned} {\tilde{\theta }}_k^{(i)}\sim \mathrm {TruncNorm}({\tilde{\theta }}_k; \theta _{k-1}^{(i)},\min \{\max \{\text {diag}(\varSigma (\theta _{{k-1}})),V_f\},V_N\}, D_\theta )\,, \end{aligned}$$
      2. ii.

        based on the parameters \({\tilde{\theta }}_{k}^{(i)}\), \(B_{k-1}^{(i)}\) and \(P_{k-1}^{(i)}\), use the Kalman filter to compute the mean \({\tilde{B}}_{k-1}^{(i)}\) and covariance \({\tilde{P}}_{k-1}^{(i)}\) of the posterior distribution at time k and hence obtain an approximation \({\hat{\varGamma }}_k^{{\theta }^{(i)}_{k-1}}\) of the posterior distribution (update step),

    3. c.

      if 1.a or 1.b holds, then for \(i=1,\cdots , N\), compute \({\tilde{p}}(y_{k}\mid y_{1:k-1},{\tilde{\theta }}_k^{(i)})\) using Eq. (2.13), consequently, using (2.18)), obtain an approximation of the normalized weights given by

      $$\begin{aligned} {\tilde{w}}_k^{{\tilde{\theta }}_k^{(i)}} = \frac{{\tilde{p}}(y_{k}\mid y_{1:k-1},{\tilde{\theta }}_k^{(i)})}{\sum _{i=1}^{N}{\tilde{p}}(y_{k}\mid y_{1:k-1},{\tilde{\theta }}_k^{(i)})}\,, \end{aligned}$$

      which gives un update to be used in the next step,

  2. 2.

    Resampling: for each \(i=1,\cdots , N\), set \(\{\theta _k^{(i)}, B_k^{(i)}, P_k^{(i)}\}\) equal to \(\{{\tilde{\theta }}_k^{(p)}, {\tilde{B}}_k^{(p)}, {\tilde{P}}_k^{(p)}\}\) with probability \({\tilde{w}}_k^{{\tilde{\theta }}_k^{(p)}}\), where \(p \in \{1,\cdots ,N\}\).

  3. 3.

    Return to the filtering step.

Note that Algorithm 4.2 can be applied to general state space models (2.1). For our convergence analysis in the next section and in financial applications, we will focus on the special type (3.4), (3.5) of affine state space models. For those models, when \(\varSigma \ne 0\), the transition kernel \(\pi ^\theta _k(\mathrm {d}x\mid x_{k-1})\) resulting (3.4) is not Gaussian, and we need to approximate it by a Gaussian transition kernel in order to apply the Kalman filter.

The approximation is obtained by replacing \(\sqrt{x_u^{(1)}}\) in (3.4) with \(\sqrt{x^{(1)}_{t_{k-1}}}\), resulting in

$$\begin{aligned} {\check{x}}_k&= \mathrm {e}^{-A(t_k-t_{k-1})}x_{k-1} + \left( I-\mathrm {e}^{-A(t_k-t_{k-1})}\right) \beta \nonumber \\&\qquad + \int _{t_{k-1}}^{t_k} \mathrm {e}^{-A(t_k-u)}\left( \varSigma +{\tilde{\varSigma }}\sqrt{x_{k-1}^{(1)}}\right) \,\mathrm {d}W_u. \end{aligned}$$
(4.10)

Note that given \(x_{k-1}\), the variable \({\check{x}}_k\) admits a Gaussian transition for \(k \in {\mathbb {N}}^+\). Hence, given the model parameters \(\theta \) (i.e. \(A,\beta , \varSigma , {{\tilde{\varSigma }}}\)), we can compute the approximated transition measure \({\hat{\pi }}_k^{\theta }=p(d{\check{x}}_k\mid x_{k-1}, \theta )\). Recall from Eqs. (2.17), (2.8) and (2.10) that the weights for the parameters space are computed by

$$\begin{aligned} w_k^{{\tilde{\theta }}_k^{(i)}}\propto \left( (l^{{\tilde{\theta }}_k^{(i)}}_{y_k},\pi _k^{{\tilde{\theta }}_k^{(i)}}), \varGamma _{k-1}^{{\tilde{\theta }}_k^{(i)}}\right) \,. \end{aligned}$$
(4.11)

In the recursive step of Algorithm 4.2, the measure \(\varGamma _{k-1}^{{\tilde{\theta }}_k^{(i)}}\) is not available at time \({k-1}\), since the parameter evolves from \(\theta _{k-1}^{(i)}\) to \({\tilde{\theta }}_k^{(i)}\) in the jittering step at time k. In order to have a recursive algorithm, the estimate \({\hat{\varGamma }}_{k-1}^{{\theta }_{k-1}^{(i)}}\) obtained at time \({k-1}\) is used to approximate the measure \(\varGamma _{k-1}^{{\tilde{\theta }}_k^{(i)}}\). Hence, when the model is linear and Gaussian, we obtain the following estimation of the weights for the parameters space

$$\begin{aligned} {\tilde{w}}_k^{{\tilde{\theta }}_k^{(i)}}\propto \left( (l^{{\tilde{\theta }}_k^{(i)}}_{y_k},\pi _k^{{\tilde{\theta }}_k^{(i)}}), {\hat{\varGamma }}_{k-1}^{\theta _{k-1}^{(i)}}\right) \,. \end{aligned}$$
(4.12)

When the model is nonlinear or non-Gaussian, then we consider the approximation (4.10) to (3.4). In this case, the transition probability \(\pi _k^{{\tilde{\theta }}_k^{(i)}}\) is approximated by a Gaussian transition probability \({\hat{\pi }}_k^{{\tilde{\theta }}_k^{(i)}}\). Therefore, in the non-recursive step, estimation of the weights for the parameters space is given by

$$\begin{aligned} {\hat{w}}_k^{{\tilde{\theta }}_k^{(i)}}\propto \left( (l^{{\tilde{\theta }}_k^{(i)}}_{y_k},{\hat{\pi }}_k^{{\tilde{\theta }}_k^{(i)}}), {\hat{\varGamma }}_{k-1}^{{\tilde{\theta }}_{k}^{(i)}}\right) \,, \end{aligned}$$
(4.13)

In the recursive step, the measure \(\varGamma _{k-1}^{{\tilde{\theta }}_k^{(i)}}\) is approximated by \({\hat{\varGamma }}_{k-1}^{{\theta }_{k-1}^{(i)}}\). Hence, we obtain the following estimation of the weights for the parameters space

$$\begin{aligned} {\hat{w}}_k^{{\tilde{\theta }}_k^{(i)}}\propto \left( (l^{{\tilde{\theta }}_k^{(i)}}_{y_k},{\hat{\pi }}_k^{{\tilde{\theta }}_k^{(i)}}), {\hat{\varGamma }}_{k-1}^{{\theta }_{k-1}^{(i)}}\right) \,, \end{aligned}$$
(4.14)

together with the estimation of the posterior measure of \(x_k\), for \(A \in {\mathcal {B}}({\mathbb {R}}^d)\) a Borel set,

$$\begin{aligned} {\hat{\varGamma }}_{k}^{\tilde{{\theta }}_{k}^{(i)}} (A)= \frac{\left( ({\mathbf {1}}_{\{x_k \in A\}}l^{{\tilde{\theta }}_k^{(i)}}_{y_k},{\hat{\pi }}_k^{{\tilde{\theta }}_k^{(i)}}), {\hat{\varGamma }}_{k-1}^{{\theta }_{k-1}^{(i)}}\right) }{\left( (l^{{\tilde{\theta }}_k^{(i)}}_{y_k},{\hat{\pi }}_k^{{\tilde{\theta }}_k^{(i)}}),{\hat{\varGamma }}_{k-1}^{{\theta }_{k-1}^{(i)}}\right) }\,. \end{aligned}$$

We will make use of the weights (4.12), (4.13) and (4.14) in our convergence analysis later in Sect. 5.

Remark 4.2

Switching between jittering kernel happens once the condition in the step 1.b of the algorithm is satisfied. Here one may take \(V_N\) having diagonal elements of order \(N^{-(p+2)/2}\), as mentioned at the beginning of the recursive step. It may however happen in an implementation that the switching does not take place. As will be explained at the beginning of Sect. 5, not switching from kernel 1 to kernel 2 will not prevent convergence of the algorithm. We consider two causes for not switching between kernels: first, the length of the data is too short; and second, the noise level h is extremely big and hence there is not much information contained in the data.

In the first case there are not many data points, and the non-recursive algorithm will anyway quickly produce results, so this case is not problematic from a practical point of view. The second case is not problematic either, because it deals with a really extreme situation. For example, the experiment that we present in Sect. 6.1 on simulated interest rates without noise produces data with on average a magnitude of several basis points (\(10^{-4}\)). But the variance of the added noise is \(10^{-7}\), which is already quite big compared to the interest rate themselves. Even in this case, we observed that after 1164 steps the algorithm switches kernel. Data with 1164 working days (a bit less than 5 years) or more are widely available in financial markets, especially for interest rates. Nevertheless, for Algorithm 4.2 the convergence to the true measure is always assured.

4.2 Kalman Particle filter for models with piece-wise constant parameters

So far in this paper, the model parameters are assumed to be fixed over time. But in many applications, it is more realistic to assume that, at least, some parameters are time-varying, for instance if they are piecewise constant. An offline method that can deal with the estimation of the location of the change-points and of the parameters of the model is studied e.g. in Chib (1998). For an SMC perspective, there are multiple papers that deal with change-point detection problems, see e.g. Chopin (2007), Fearnhead and Liu (2007) and He and Maheu (2010). For the particle filter methods which treat the model parameters as static, the variance of the samples for the parameters decreases with more observed data. Hence the marginal distribution of the model parameters will be increasingly concentrated around certain values. The consequence is that the particle filter algorithm is not able to capture abrupt changes of parameters.

We extend our proposed algorithm for static parameter to adapt to abrupt changes of parameters. To achieve that, we first identify the change points. This step is done by comparing the marginal likelihood between two consecutive steps. Suppose the parameter samples have already converged to the actual value. If at some point the actual parameter value jumps to another value, then the marginal likelihood based on the existing parameter samples are far from optimal. Hence the marginal likelihood at this time point should be significantly smaller than that at the previous time point. On the other hand, if at this time point the actual parameter value does not change, then the marginal likelihood should also be very close to the previous value. So we set a threshold \(q<1\) and if at some point time k, the maximum marginal likelihood for \(\theta ^{(i)}_k \in D_\theta \), \(k\in {\mathbb {N}}^+\), satisfies

$$\begin{aligned} \max _{1\le i\le N}\left\{ p(y_{k}\mid y_{1:k-1},\theta ^{(i)}_{k})\right\} <q\max _{1\le i\le N}\left\{ p(y_{k-1}\mid y_{1:k-2},\theta ^{(i)}_{k-1})\right\} \,, \end{aligned}$$

then we consider the time point k to be the change point of the parameters. Of course, this condition is not sufficient but it is necessary. Notice that the threshold q in our case is user-specific. For statistical methods to choose q, we refer e.g. to Maheu and Gordon (2008).

Another issue here is that the parameter samples may have already (nearly) converged before the change point, hence the variance of these samples is too small to capture the change. This problem can be tackled by adding new samples from the initial parameter distribution to increase the sample variance. But since the variance is increased, the jittering kernel (4.9) does not satisfy Assumption 4.1. Hence the jittering kernel should switch to (4.5). Moreover, since the model parameter changes, the posterior distributions from the previous time point are also not valid anymore, and a new initial value for the mean and the variance of the posterior distribution should also be initialized. So once the change point is determined, one can treat the calibration of the model as a new calibration based on data after this change point.

We introduce the Kalman particle algorithm extended to time-varying parameters. We present the algorithm in full detail, noting that the differences with the previous Algorithm 4.2 are in the two jittering cases in the filtering step.

Algorithm 4.3

(Kalman particle filter for models with time-varying parameters)

Initialization:

  1. 1.

    set the number of particles N, a value for the discounting factor a, a switching variance level \(V_N\), a floored variance level \(V_f\), and the threshold parameter q,

  2. 2.

    assume an initial distribution \(p(\theta _{0})\) for the parameters,

  3. 3.

    sample from the initial distribution to get N particles \(\{\theta _{0}^{(i)}, i=1,\cdots ,N\}\) for the parameters,

  4. 4.

    for each particle \(\{\theta _{0}^{(i)}\}\), assign the same initial mean and covariance value of the posterior distribution \(B_{0}^{(i)}\) and \(P_{0}^{(i)}\) for the Kalman filter update,

Recursion:

  1. 1.

    Filtering: given \(\{\theta _{k-1}^{(i)}, \, i=1,\cdots ,N\}\),

    1. a.

      as long as \(\text {diag}(\varSigma (\theta _{k-1}))\) is not smaller than \(V_N\) (jittering case 1), for each \(i=1,\cdots ,N\),

      1. i.

        sample new parameters from the kernel (4.5), i.e.,

        $$\begin{aligned} {\tilde{\theta }}_k^{(i)}\sim \mathrm {TruncNorm}({\tilde{\theta }}_k; \mu (\theta _{k-1}^{(i)}), \varSigma (\theta _{k-1}),D_\theta )\,, \end{aligned}$$
      2. ii.

        based on the parameter \({\tilde{\theta }}_k^{(i)}\), use the Kalman filter to compute the mean and the covariance of the posterior distribution from time 1 to k,

      3. iii.

        compute the likelihood \(p(y_{k}\mid y_{1:k-1},{\tilde{\theta }}_k^{(i)})\), consequently obtain the normalized weights

        $$\begin{aligned} w_k^{{\tilde{\theta }}^{(i)}_k} = \frac{p(y_{k}\mid y_{1:k-1},{\tilde{\theta }}_k^{(i)})}{\sum _{i=1}^Np(y_{k}\mid y_{1:k-1},{\tilde{\theta }}_k^{(i)})}\,; \end{aligned}$$
    2. b.

      once \(\text {diag}(\varSigma (\theta _{l-1})) < V_N\) (jittering case 2), then for \(k\ge l+1\) and \(i=1,\cdots , N\), given \(\{B_{k-1}^{(i)}, P_{k-1}^{(i)}\}\),

      1. 1.

        sample new parameters from the kernel (4.9), i.e.

        $$\begin{aligned} {\tilde{\theta }}_k^{(i)}\sim \mathrm {TruncNorm}({\tilde{\theta }}_k; \theta _{k-1}^{(i)},\min \{\max \{\varSigma (\theta _{k-1}),V_f\},V_N\},D_\theta ), \end{aligned}$$
      2. 2.

        based on the parameters \({\tilde{\theta }}_{k}^{(i)},B_{k-1}^{(i)}\) and \(P_{k-1}^{(i)}\), use the Kalman filter to compute the mean \({\tilde{B}}_k^{(i)}\) and the covariance \({\tilde{P}}_k^{(i)}\) of the approximated posterior distribution at time k,

      3. 3.

        compute an approximation \({\tilde{p}}(y_{k}\mid y_{1:k-1},{\tilde{\theta }}_k^{(i)})\) of the likelihood \(p(y_{k}\mid y_{1:k-1},{\tilde{\theta }}_k^{(i)})\). If

        $$\begin{aligned} \max _{1\le i\le N}\{{\tilde{p}}(y_{k}\mid y_{1:k-1},{\tilde{\theta }}_{k}^{(i)})\}<q\max _{1\le i\le N}\{{\tilde{p}}(y_{k-1}\mid y_{1:k-2},{\tilde{\theta }}_{k-1}^{{(i)}})\}\,, \end{aligned}$$

        then go to the Initialization step of the algorithm to initialize the algorithm using the data after time k. Otherwise compute the normalized weights

        $$\begin{aligned} {\tilde{w}}_k^{{\tilde{\theta }}_k^{(i)}} = \frac{{\tilde{p}}(y_{k}\mid y_{1:k-1},{\tilde{\theta }}_k^{(i)})}{\sum _{i=1}^{N}{\tilde{p}}(y_{k}\mid y_{1:k-1},{\tilde{\theta }}_k^{(i)})}\, . \end{aligned}$$
  2. 4.

    Resampling: for each \(i=1,\cdots , N\), set \(\{{\theta }_k^{(i)}, B_k^{(i)}, P_k^{(i)}\} =\{{\tilde{\theta }}_k^{(p)}, {\tilde{B}}_k^{(p)}, {\tilde{P}}_k^{(p)}\}\), with probability \(w_k^{{\tilde{\theta }}^{(p)}_k}\), where \(p \in \{1,\cdots , N\}\).

In Sect. 6 we will present a numerical study where Algorithm 4.3 is seen to be capable of quickly tracking a sudden parameter change. Similar to the example in Crisan and Míguez (2018), Sect. 6, we content ourselves with this empirical tracking behaviour and refrain from presenting a consistency result in a large sample setting. For a fixed number of change points though, convergence results do exist, see e.g. Bai and Perron (1998) on multiple structural changes in linear regression. But, as stated above, an asymptotic analysis of this behaviour is besides the purpose of the present section.

5 Convergence analysis

This section is devoted to showing a convergence result, Theorem 5.9 below, for the Kalman particle Algorithm 4.2 introduced in Sect. 4.1.2. The standing assumption is that we have observations \(y_{1:k}\) generated by (3.4) and (3.5). The parameters are \(A,\beta ,\varSigma ,{\tilde{\varSigma }}\), collectively denoted \(\theta \).

In Algorithm 4.2, the (conditional) measure

$$\begin{aligned} \mu _k(\mathrm {d}\theta _k) = p(\mathrm {d}\theta _k\mid y_{1:k})\,, \end{aligned}$$
(5.1)

is estimated for \(k=1,\cdots ,K\). From now on, for convenience, we assume the algorithm starts at time 1. At each time, the algorithm has three main steps: jittering, update and resampling.

Define the maximum step size \(\varDelta = \sup _{k=1,\cdots , K}(t_{k}-t_{k-1})\). Let the parameter \(\theta \in D_\theta \). At time k, suppose the estimated measure of the last time \(\mu _{k-1}^N\) is available and

$$\begin{aligned} \mu _{k-1}^N = \frac{1}{N}\sum _{i=1}^N\delta _{\theta _{k-1}^{(i)}}(\mathrm {d}\theta )\,. \end{aligned}$$

In the jittering step as described in Algorithm 4.2, new samples \({\tilde{\theta }}_{k}^{(i)}\) are sampled from the kernel functions. The resulting measure \({\tilde{\mu }}_k^N\) is then defined by

$$\begin{aligned} {\tilde{\mu }}_k^N = \frac{1}{N}\sum _{i=1}^N\delta _{{\tilde{\theta }}_{k}^{(i)}}(\mathrm {d}\theta )\,. \end{aligned}$$
(5.2)

In this step, no extra information is used to refine the estimates on the parameters. Hence the aim is to prove that the measure \( {\tilde{\mu }}_k^N\) converges to the measure \(\mu _{k-1}^N\) in some sense when \(\varDelta \) goes to zero and the number of samples N goes to infinity.

In the update step as described in Algorithm 4.2 of Sect. 4.1.2, there are four cases to analyze, combinations of Gaussian-linear or non-Gaussian/non-linear models and recursive or non-recursive parts of the algorithm. When the model is linear and Gaussian, and we consider the non-recursive step of the algorithm, then the normalized weights \(w_k^{{\tilde{\theta }}_k^{(i)}}\) are computed exactly. When we consider the recursive step of the algorithm for a Gaussian linear model, the normalized weights \(w_k^{{\tilde{\theta }}_k^{(i)}}\) are estimated by \({\tilde{w}}_k^{{\tilde{\theta }}_k^{(i)}}\), see (4.12). For the non-Gaussian and non-linear model, we consider the approximation (4.10) to (3.4), the normalized weights are estimated by (4.13) and (4.14) in the non-recursive step and recursive step, respectively. Since the convergence of the weights in this latter case, the recursive step, implies the convergence of the weights in the first three cases, we will only consider the model (4.10) and the recursive step of the algorithm. In this case, we obtain a new estimated measure

$$\begin{aligned} {\hat{\mu }}_k^N = \sum _{i=1}^N {\hat{w}}_k^{{\tilde{\theta }}_k^{(i)}}\delta _{{\tilde{\theta }}_{k}^{i}}(\mathrm {d}\theta )\, \end{aligned}$$
(5.3)

and the convergence analysis of \({\hat{\mu }}_k^N\) depends on that of the estimated weights \({\hat{w}}_k^{{\tilde{\theta }}_k^{(i)}}\). The aim is to prove that \({\hat{\mu }}_k^N\) converges to \(\mu _k\) in some sense when \(\varDelta \) goes to 0 and N goes to infinity.

In the resampling step, the new particles \(\{\theta _k^{(i)},\, i=1,\cdots ,N\}\) are sampled from the empirical distribution \( \sum _{i=1}^{N}{\hat{w}}_k^{{\tilde{\theta }}_k^{(i)}}\delta _{{\tilde{\theta }}_k^{(i)}}\) and we need to prove that the measure

$$\begin{aligned} \mu _k^N = \frac{1}{N}\sum _{i=1}^N\delta _{\theta _k^{(i)}}(\mathrm {d}\theta )\,, \end{aligned}$$
(5.4)

converges to \(\mu _k\) when \(\varDelta \) goes to zero and N goes to infinity.

For our convergence analysis we will need besides Assumption 4.1, the following three assumptions. They are of a type that is common when dealing with parameter estimation using particle filters.

Assumption 5.1

The posterior measure \(\varGamma _k^{\theta }\), \(k=1,\cdots ,N\), \(\theta \in D_\theta \), is Lipschitz in the parameter \(\theta \), i.e. for any bounded continuous function f,

$$\begin{aligned} \left| (f,\varGamma _k^{\theta })-(f,\varGamma _k^{\theta ^\prime })\right| \le e_{3,k}\left\| f\right\| _\infty \left\| \theta -\theta ^\prime \right\| \,, \end{aligned}$$
(5.5)

for some positive constant \(e_{3,k}\) independent of f .

Assumption 5.2

Let \((x_k)_{k\in {\mathbb {N}}^+}\) be as in (3.4). To emphasize that the dependence of the distribution of \(x_k\) depends on \(\theta \), we write \((x_k^\theta )_{k\in {\mathbb {N}}^+}\). It is assumed that there exist a constant \(M>0\) such that

$$\begin{aligned} \sup _{\theta \in D_\theta , \, k\in {\mathbb {N}}^+}{\mathbb {E}}\left[ \left| x_k^\theta \right| \right] \le M. \end{aligned}$$

Assumption 5.3

For any fixed observation sequence \(y_{1:k}\), the likelihood \(\{l^{\theta }_{y_t}(x),\,t=1,\cdots ,k,\,\theta \in D_\theta \}\) satisfies

  1. 1.

    \(\left\| l_{y_t}\right\| _\infty := \sup _{\theta \in D_\theta }\left\| l^{\theta }_{y_t}\right\| _\infty <\infty \),

  2. 2.

    \(\inf _{\theta \in D_\theta }l^{\theta }_{y_t}>0\).

Remark 5.1

Assumptions 5.2 and 5.3 hold when a judicious choice of \(D_\theta \) is made. For instance, the inequality in Assumption 5.2 involves \(x^\theta _k\) for all k. This leads us to assume that the matrix \(-A\) is stable, all eigenvalues of A have positive real part, or in the diagonal case, all diagonal elements of A are positive. To have also Assumption 5.3 satisfied, one has to impose further restrictions, the elements of A come from a bounded set and its eigenvalues are bounded away from zero (then also \(A^{-1}\) belongs to a bounded set), whereas these conditions are also imposed on \(\varSigma \) and \({{\tilde{\varSigma }}}\). Under these specified conditions, Assumptions 5.2 and 5.3 hold true.

The Inequality (4.6) in Assumption 4.1 is used for the convergence analysis of the jittering step, the Inequality (4.7) in Assumption 4.1, together with Assumptions 5.1 and 5.2 are specially used for the analysis for the update step and Assumption 5.3 is used for the analysis for both update and resampling steps. Given the convergence of \(\mu _{k-1}^N\) to \(\mu _{k-1}\) in an \(L^p\)-sense when \(\varDelta \) goes to zero and N goes to infinity, we present the convergence of the measures \({\tilde{\mu }}_k^N\), \({\hat{\mu }}_k^N\) and \(\mu _k^N\) respectively in the three Lemmas 5.4, 5.7, 5.8 in the subsections below. In Sect. 5.4, we study the convergence of Algorithm 4.2 presented in Sect. 4.1.2 by induction using these three lemmas. The main result there is Theorem 5.9. Our analysis in inspired by the results in Crisan and Míguez (2018) for the nested particle filter and follows a similar pattern. Note however the crucial differences between our Algorithm 4.2 and their nested particle filter. We have to deal with only one layer with a particle filter (instead of two such layers), as we use the Kalman filter in the other layer, but we also pay attention to the time discretization of the continuous processes.

5.1 Jittering

In the jittering step 1.b.i for low values of \(\varSigma (\theta _{k-1})\) the new parameter particles \({\tilde{\theta }}_k^{(i)}\) are sampled from a kernel function \(\kappa _k^N(\mathrm {d}\theta \mid \theta _{k-1}^{(i)})\), \(i=1,\cdots ,N\). The following lemma shows that the error due to the jittering step vanishes. This lemma can be seen as the analog of Lemma 3 in Crisan and Míguez (2018), but with the term \(\tfrac{1}{\sqrt{M}}\) there (which plays no role in our analysis) replaced with \(\sqrt{\varDelta }\) as we treat the influence of time discretization. It can be proven in the same way and we present it here for the sake of completeness and in a form that suits our purposes.

Lemma 5.4

Let f be a bounded function and suppose Assumption 4.1 holds. If

$$\begin{aligned} \left\| (f,\mu _{k-1}^N)-(f,\mu _{k-1})\right\| _p\le \frac{c_{1,{k-1}}\left\| f\right\| _\infty }{\sqrt{N}}+d_{1,{k-1}}\left\| f\right\| _\infty \sqrt{\varDelta }\,, \end{aligned}$$
(5.6)

for some constants \(c_{1,{k-1}}\) and \(d_{1,{k-1}}\) which are independent of f, N and \(\varDelta \), then there exist constants \({\tilde{c}}_{1,{k}}\) and \({\tilde{d}}_{1,t}\) which are independent of f, N and \(\varDelta \), such that

$$\begin{aligned} \left\| (f,{\tilde{\mu }}_{k}^N)-(f,\mu _{k-1})\right\| _p\le \frac{{\tilde{c}}_{1,{k}}\left\| f\right\| _\infty }{\sqrt{N}}+{\tilde{d}}_{1,{k}} \left\| f\right\| _\infty \sqrt{\varDelta }\,. \end{aligned}$$
(5.7)

5.2 Update

To prove the convergence in the update step 1.c, we first need to prove that the error introduced from the approximation of the weights \(w_k^{{\tilde{\theta }}_{k}^{(i)}}\) by the weights \({\hat{w}}_k^{{\tilde{\theta }}_{k}^{(i)}}\), can be bounded by a desired quantity as in (5.7). The following lemma is a core result in our convergence analysis, in the proof of it we exploit the affine nature of the state process.

Lemma 5.5

Let the observation sequence \(y_{1:k}\) be fixed. Suppose function f is bounded and continuous and Assumptions 4.1,5.1, and 5.2 hold. If

$$\begin{aligned} \sup _{1\le i\le N} \left| (f,{\hat{\varGamma }}_{k-1}^{\theta _{k-1}^{(i)}}) -(f,{\varGamma }_{k-1}^{\theta _{k-1}^{(i)}})\right| \le \frac{c_{2,{k-1}}\left\| f\right\| _\infty }{\sqrt{N}}+d_{2,{k-1}}\left\| f\right\| _\infty \sqrt{\varDelta }\,, \end{aligned}$$
(5.8)

for some constants \(c_{2,{k-1}}\) and \(d_{2,{k-1}}\) which are independent of f, N and \(\varDelta \), then there exist constants \({\tilde{c}}_{2,{k}}\) and \({\tilde{d}}_{2,k}\) which are independent of f, N and \(\varDelta \) such that

$$\begin{aligned} \sup _{1\le i\le N} \left| ((f,{\hat{\pi }}_k^{{\tilde{\theta }}_{k}^{(i)}}), {\hat{\varGamma }}_{k-1}^{\theta _{k-1}^{(i)}}) -((f,\pi _k^{{\tilde{\theta }}_{k}^{(i)}}),{\varGamma }_{k-1}^{{\tilde{\theta }}_{k}^{(i)}})\right| \le \frac{{\tilde{c}}_{2,{k}}\left\| f\right\| _\infty }{\sqrt{N}}+{\tilde{d}}_{2,{k}} \left\| f\right\| _\infty \sqrt{\varDelta }\,. \end{aligned}$$
(5.9)

To prove Lemma 5.5, we exploit the structure (3.4) of the state process. Besides, we need the following auxiliary result. The proof of it follows the same lines as the proof of Lemma 4 in Crisan and Míguez (2018), but note again the \(\frac{1}{\sqrt{M}}\) term is replaced by \(\sqrt{\varDelta }\).

Lemma 5.6

Suppose the function f is bounded and continuous. Moreover, suppose Assumptions 4.1 and 5.1 and Inequality (5.8) hold. Then there exist some constants \({\tilde{c}}_{2,{k-1}}\) and \({\tilde{d}}_{2,{k-1}}\) which are independent of f, N, \(\varDelta \) and of all \(\theta \) such that

$$\begin{aligned} \sup _{1\le i\le N} \left| (f,{\hat{\varGamma }}_{k-1}^{\theta _{k-1}^{(i)}}) -(f,{\varGamma }_{k-1}^{{\tilde{\theta }}_{k}^{(i)}})\right| \le \frac{{\tilde{c}}_{2,{k-1}}\left\| f\right\| _\infty }{\sqrt{N}}+{\tilde{d}}_{2,{k}} \left\| f\right\| _\infty \sqrt{\varDelta }\,. \end{aligned}$$
(5.10)

Now we are ready to prove Lemma 5.5.

Proof

Using the triangle inequality, one obtains

$$\begin{aligned}&{\sup _{1\le i\le N}\left| ((f,{\hat{\pi }}_k^{{\tilde{\theta }}_{k}^{(i)}}), {\hat{\varGamma }}_{k-1}^{\theta _{k-1}^{(i)}}) -((f,\pi _k^{{\tilde{\theta }}_{k}^{(i)}}), {\varGamma }_{k-1}^{{\tilde{\theta }}_{k}^{(i)}})\right| } \nonumber \\&\quad \le \sup _{1\le i\le N}\left| ((f,{\hat{\pi }}_k^{{\tilde{\theta }}_{k}^{(i)}}), {\hat{\varGamma }}_{k-1}^{\theta _{k-1}^{(i)}}) -((f,{\hat{\pi }}_k^{{\tilde{\theta }}_{k}^{(i)}}), {\varGamma }_{k-1}^{{\tilde{\theta }}_{k}^{(i)}})\right| \nonumber \\&\qquad +\sup _{1\le i\le N}\left| ((f,{\hat{\pi }}_k^{{\tilde{\theta }}_{k}^{(i)}}), {\varGamma }_{k-1}^{{\tilde{\theta }}_{k}^{(i)}}) -((f,\pi _k^{{\tilde{\theta }}_{k}^{(i)}}), {\varGamma }_{k-1}^{{\tilde{\theta }}_{k}^{(i)}})\right| \,. \end{aligned}$$
(5.11)

Note that \(\sup _{\theta \in D_\theta }(f,{\hat{\pi }}_k^{\theta })\) is bounded by \(\left\| f\right\| _\infty \). Hence, using Inequality (5.10), we get for the first term on the right hand side of (5.11)

$$\begin{aligned}&{\sup _{1\le i\le N}\left| ((f,{\hat{\pi }}_k^{{\tilde{\theta }}_{k}^{(i)}}), {\hat{\varGamma }}_{k-1}^{\theta _{k-1}^{(i)}}) -((f,{\hat{\pi }}_k^{{\tilde{\theta }}_{k}^{(i)}}), {\varGamma }_{k-1}^{{\tilde{\theta }}_{k}^{(i)}})\right| } \nonumber \\&\quad \le \sup _{\theta \in D_\theta }\sup _{1\le i\le N}\left| ((f,{\hat{\pi }}_k^{\theta }), {\hat{\varGamma }}_{k-1}^{\theta _{k-1}^{(i)}}) -((f,{\hat{\pi }}_k^{\theta }), {\varGamma }_{k-1}^{{\tilde{\theta }}_{k}^{(i)}})\right| \nonumber \\&\quad \le \frac{{\tilde{c}}_{2,{k-1}}\left\| f\right\| _\infty }{\sqrt{N}}+{\tilde{d}}_{2,{k}} \left\| f\right\| _\infty \sqrt{\varDelta }\,. \end{aligned}$$
(5.12)

Recall \((x_k)_{k\in {\mathbb {N}}}\) and \(({\check{x}}_k)_{k\in {\mathbb {N}}}\) respectively from (3.4) and (4.10). For \(\epsilon \), \(M_1>0\), define the sets

$$\begin{aligned} A_{\epsilon ,i}&= \left\{ \left| {\check{x}}_k^{{\tilde{\theta }}_{k}^{(i)}} -x_k^{{\tilde{\theta }}_{k}^{(i)}}\right| <\epsilon \right\} \,, \\ B_{M_1,i}&= \left\{ \left| {\check{x}}_k^{{\tilde{\theta }}_{k}^{(i)}}\right| \le M_1, \,\,\left| {x}_k^{{\tilde{\theta }}_{k}^{(i)}}\right| \le M_1 \right\} \,. \end{aligned}$$

Note that \((f,{\hat{\pi }}_k^{{\tilde{\theta }}_{k}^{(i)}})\) can be seen as the (conditional) expectation of f(.) taken under the measure \({\hat{\pi }}_k^{{\tilde{\theta }}_{k}^{(i)}}\). Stated otherwise, we can see it as the expectation of \(f({\check{x}}_k^{{\tilde{\theta }}_{k}^{(i)}})\). Likewise, we can see \((f,\pi _k^{{\tilde{\theta }}_{k}^{(i)}})\) as the (conditional) expectation of \(f({x}_k^{{\tilde{\theta }}_{k}^{(i)}})\). Below we use the notations \({\mathbb {E}}f({\check{x}}_k^{{\tilde{\theta }}_{k}^{(i)}})\) and \({\mathbb {E}}f({x}_k^{{\tilde{\theta }}_{k}^{(i)}})\) for these expectations. With these interpretations, the second term on the right hand side of (5.11) yields

$$\begin{aligned}&\sup _{1\le i\le N}\left| \left( (f,{\hat{\pi }}_k^{{\tilde{\theta }}_{k}^{(i)}}), {\varGamma }_{k-1}^{{\tilde{\theta }}_{k}^{(i)}}\right) -\left( (f,\pi _k^{{\tilde{\theta }}_{k}^{(i)}}), {\varGamma }_{k-1}^{{\tilde{\theta }}_{k}^{(i)}}\right) \right| \nonumber \\&\quad = \sup _{1\le i\le N}\left| \left( (f,{\hat{\pi }}_k^{{\tilde{\theta }}_{k}^{(i)}}) -(f,\pi _k^{{\tilde{\theta }}_{k}^{(i)}}), {\varGamma }_{k-1}^{{\tilde{\theta }}_{k}^{(i)}}\right) \right| \nonumber \\&\quad \le \sup _{1\le i\le N}{{ \left( \left| (f,{\hat{\pi }}_k^{{\tilde{\theta }}_{k}^{(i)}}) -(f,\pi _k^{{\tilde{\theta }}_{k}^{(i)}})\right| , {\varGamma }_{k-1}^{{\tilde{\theta }}_{k}^{(i)}} \right) }}\nonumber \\&\quad = \sup _{1\le i\le N}{\left( \left| {\mathbb {E}} f({\check{x}}_k^{{\tilde{\theta }}_{k}^{(i)}}) -{\mathbb {E}}f({x}_k^{{\tilde{\theta }}_{k}^{(i)}})\right| , {\varGamma }_{k-1}^{{\tilde{\theta }}_{k}^{(i)}}\right) }\nonumber \\&\quad \le \sup _{1\le i\le N}{\left( {\mathbb {E}} \left| f({\check{x}}_k^{{\tilde{\theta }}_{k}^{(i)}}) -f({x}_k^{{\tilde{\theta }}_{k}^{(i)}})\right| , {\varGamma }_{k-1}^{{\tilde{\theta }}_{k}^{(i)}}\right) }\nonumber \\&\quad =\sup _{1\le i\le N}{\left( {\mathbb {E}} {\mathbf {1}}_{A_{\epsilon ,i}}{\mathbf {1}}_{B_{M_1,i}} \left| f({\check{x}}_k^{{\tilde{\theta }}_{k}^{(i)}}) -f({x}_k^{{\tilde{\theta }}_{k}^{(i)}})\right| , {\varGamma }_{k-1}^{{\tilde{\theta }}_{k}^{(i)}}\right) }\nonumber \\&\qquad + \sup _{1\le i\le N}{\left( {\mathbb {E}} {\mathbf {1}}_{A_{\epsilon ,i}}{\mathbf {1}}_{B_{M_1,i}^c} \left| f({\check{x}}_k^{{\tilde{\theta }}_{k}^{(i)}}) -f({x}_k^{{\tilde{\theta }}_{k}^{(i)}})\right| , {\varGamma }_{k-1}^{{\tilde{\theta }}_{k}^{(i)}}\right) }\nonumber \\&\qquad + \sup _{1\le i\le N}{\left( {\mathbb {E}}{\mathbf {1}}_{A_{\epsilon ,i}^c} \left| f({\check{x}}_k^{{\tilde{\theta }}_{k}^{(i)}}) -f({x}_k^{{\tilde{\theta }}_{k}^{(i)}})\right| , {\varGamma }_{k-1}^{{\tilde{\theta }}_{k}^{(i)}}\right) }. \end{aligned}$$
(5.13)

We need to find an upper bound for the three terms on the right hand side of (5.13).

Consider the first term on the right hand side of (5.13) and note that the random variables \({\check{x}}_k^{{\tilde{\theta }}_{k}^{(i)}}\) and \({x}_k^{{\tilde{\theta }}_{k}^{(i)}}\) restricted to \(B_{M_1}\) take values in a compact set. Hence the continuous function f is also uniformly continuous on that set. Then there exists an \(\epsilon >0\), such that \(\left| f(x)-f(y)\right| \le \sqrt{\varDelta }\), for all \(\left| x-y\right| <\epsilon \). From now on we assume that we use this \(\epsilon \), and we obtain for the first term on the right hand side of (5.13), using the definition of \(A_{\epsilon ,i}\),

$$\begin{aligned} \sup _{1\le i\le N}{\mathbb {E}}{\mathbf {1}}_{A_{\epsilon ,i}}{\mathbf {1}}_{B_{M_1,i}} \left| f({\check{x}}_k^{{\tilde{\theta }}_{k}^{(i)}}) -f({x}_k^{{\tilde{\theta }}_{k}^{(i)}})\right| \le \sqrt{\varDelta }\,, \end{aligned}$$

which implies

$$\begin{aligned} \sup _{1\le i\le N}{\left( {\mathbb {E}}{\mathbf {1}}_{A_{\epsilon ,i}}{\mathbf {1}}_{B_{M_1,i}} \left| f({\check{x}}_k^{{\tilde{\theta }}_{k}^{(i)}}) -f({x}_k^{{\tilde{\theta }}_{k}^{(i)}})\right| , {\varGamma }_{k-1}^{{\tilde{\theta }}_{k}^{(i)}}\right) }\le \sqrt{\varDelta }\,. \end{aligned}$$
(5.14)

Next we consider the second term on the right hand side of (5.13). Set \(M_1=M/\sqrt{\varDelta }+\epsilon \), where M is as defined in Assumption 5.2. On the set \(A_{\epsilon ,i}\), we have \(\left| {\check{x}}_k^{{\tilde{\theta }}_{k}^{(i)}}\right| -\epsilon \le \left| {x}_k^{{\tilde{\theta }}_{k}^{(i)}}\right| \le \left| {\check{x}}_k^{{\tilde{\theta }}_{k}^{(i)}}\right| + \epsilon \), which implies \({\mathbb {P}}\left( A_{\epsilon ,i}, \left| {\check{x}}_k^{{\tilde{\theta }}_{k}^{(i)}}\right|>M_1 \right) \le {\mathbb {P}}\left( \left| {x}_k^{{\tilde{\theta }}_{k}^{(i)}}\right| > M_1-\epsilon \right) \). Hence using Assumption 5.2 and the Markov inequality, we obtain

$$\begin{aligned}&\sup _{1\le i\le N}{\mathbb {E}}{\mathbf {1}}_{A_{\epsilon ,i}} {\mathbf {1}}_{B_{M_1,i}^c}\left| f({\check{x}}_k^{{\tilde{\theta }}_{k}^{(i)}}) -f({x}_k^{{\tilde{\theta }}_{k}^{(i)}})\right| \nonumber \\&\quad \le 2\left\| f\right\| _\infty \sup _{1\le i\le N} \left( {\mathbb {E}}{\mathbf {1}}_{A_{\epsilon ,i}}{\mathbf {1}}_{B_{M_1,i}^c}\right) \nonumber \\&\quad \le 2\left\| f\right\| _\infty \sup _{1\le i\le N}\left[ {\mathbb {P}} \left( A_{\epsilon ,i},\left| {\check{x}}_k^{{\tilde{\theta }}_{k}^{(i)}}\right|>M_1\right) + {\mathbb {P}}\left( A_{\epsilon ,i}, \left| {x}_k^{{\tilde{\theta }}_{k}^{(i)}}\right|> M_1 \right) \right] \nonumber \\&\quad \le 2\left\| f\right\| _\infty \sup _{1\le i\le N}\left[ {\mathbb {P}} \left( \left| {x}_k^{{\tilde{\theta }}_{k}^{(i)}}\right|> M_1-\epsilon \right) + {\mathbb {P}}\left( \left| {x}_k^{{\tilde{\theta }}_{k}^{(i)}}\right|> M_1 \right) \right] \nonumber \\&\quad \le 4\left\| f\right\| _\infty \sup _{1\le i\le N}{\mathbb {P}} \left( \left| {x}_k^{{\tilde{\theta }}_{k}^{(i)}}\right| > M_1-\epsilon \right) \nonumber \\&\quad \le 4\left\| f\right\| _\infty \frac{\sup _{1\le i\le N}{\mathbb {E}} \left| x_k^{{\tilde{\theta }}_{k}^{(i)}}\right| }{M_1-\epsilon }\nonumber \\&\quad \le 4\left\| f\right\| _\infty \sqrt{\varDelta } \,. \end{aligned}$$
(5.15)

Substituting (5.15) into the second term of the right hand side of (5.13), we obtain

$$\begin{aligned}&{\sup _{1\le i\le N}{\left( {\mathbb {E}}{\mathbf {1}}_{A_{\epsilon ,i}} {\mathbf {1}}_{B_{M_1,i}^c}\left| f({\check{x}}_k^{{\tilde{\theta }}_{k}^{(i)}}) -f({x}_k^{{\tilde{\theta }}_{k}^{(i)}})\right| , {\varGamma }_{k-1}^{{\tilde{\theta }}_{k}^{(i)}}\right) }} \nonumber \\&\quad \le \sup _{1\le i\le N}{\left( 4\left\| f\right\| _\infty \sqrt{\varDelta }, {\varGamma }_{k-1}^{{\tilde{\theta }}_{k}^{(i)}}\right) }\nonumber \\&\quad \le 4\left\| f\right\| _\infty \sqrt{\varDelta }\,. \end{aligned}$$
(5.16)

For the last term on the right hand side of (5.13), we apply again the Markov inequality, to obtain

$$\begin{aligned} {\mathbb {E}}{\mathbf {1}}_{A_{\epsilon ,i}^c} \left| f({\check{x}}_k^{{\tilde{\theta }}_{k}^{(i)}}) -f({x}_k^{{\tilde{\theta }}_{k}^{(i)}})\right|&\le 2\left\| f\right\| _\infty {\mathbb {P}}(A_{\epsilon ,i}^c)\\&\le 2\left\| f\right\| _\infty \frac{{\mathbb {E}} \left| {\check{x}}_k^{{\tilde{\theta }}_{k}^{(i)}} -{x}_k^{{\tilde{\theta }}_{k}^{(i)}}\right| }{\epsilon }\,.\\ \end{aligned}$$

We denote the i-th component of an \({\mathbb {R}}^d\)-valued process \((x_k)_{k\in {\mathbb {N}}}\) by \((x^{(i)}_k)_{k\in {\mathbb {N}}}\). Furthermore, we temporarily suppress the dependence on \({\tilde{\theta }}_{k}^{(i)}\) in the notation. Given \(x_{k-1}\), according to Eqs. (3.4) and (4.10), we obtain

$$\begin{aligned} {\mathbb {E}}\left| {\check{x}}_k^{(i)}-{x}_k^{(i)}\right|&= \mathrm {e}^{-\alpha _i(t_k-t_{k-1})}{\mathbb {E}}\left| \int _{t_{k-1}}^{t_{k}} \mathrm {e}^{\alpha _iu}\sum _{j=1}^{d}\varSigma _{ij}\left( \sqrt{x_{u}^{(1)}} -\sqrt{x_{k-1}^{(1)}}\right) \mathrm {d}W_u^{(j)}\right| \\&\le {\mathbb {E}}\left| \int _{t_{k-1}}^{t_{k}}\mathrm {e}^{\alpha _iu}\sum _{j=1}^{d}\varSigma _{ij} \left( \sqrt{x_{u}^{(1)}}-\sqrt{x_{k-1}^{(1)}}\right) \mathrm {d}W^{(j)}_u\right| \,. \end{aligned}$$

Define \(\varSigma ^{(i)}=\sqrt{\sum _{j=1}^{d}(\varSigma _{ij})^2}\). Using the Burkholder-Davis-Gundy inequality (Karatzas and Shreve 1998, Theorem 3.28), we know there exists a constant C which does not depend on \((x_t)_{t\ge 0}\) such that, for every \(i=1,\cdots ,d\),

$$\begin{aligned}&{{\mathbb {E}}\left| \int _{t_{k-1}}^{t_{k}}\mathrm {e}^{\alpha _iu}\sum _{j=1}^{d} \varSigma _{ij}\left( \sqrt{x_{u}^{(1)}}-\sqrt{x_{k-1}^{(1)}}\right) \,\mathrm {d}W_u\right| } \\&\quad \le C\varSigma ^{(i)} {\mathbb {E}}\left( \int _{t_{k-1}}^{t_{k}} \mathrm {e}^{2\alpha _iu}\left( \sqrt{x_u^{(1)}}-\sqrt{x_{k-1}^{(1)}}\right) ^2\, \mathrm {d}u\right) ^{\frac{1}{2}}\\&\quad \le C\varSigma ^{(i)} {\mathbb {E}}\left( \int _{t_{k-1}}^{t_{k}}\mathrm {e}^{2\alpha _iu} \left( x_u^{(1)}+x_{k-1}^{(1)}\right) \,\mathrm {d}u\right) ^{\frac{1}{2}}. \end{aligned}$$

Using Jensen’s inequality and Fubini’s theorem, we get for the latter expectation

$$\begin{aligned}&{{\mathbb {E}}\left( \int _{t_{k-1}}^{t_{k}}\mathrm {e}^{2\alpha _iu} \left( x_u^{(1)}+x_{k-1}^{(1)}\right) \,\mathrm {d}u\right) ^{\frac{1}{2}}}\\&\quad \le \left( {\mathbb {E}}\int _{t_{k-1}}^{t_{k}}\mathrm {e}^{2\alpha _iu} \left( x_u^{(1)}+x_{k-1}^{(1)}\right) \,\mathrm {d}u\right) ^{\frac{1}{2}}\\&\quad = \left( \int _{t_{k-1}}^{t_{k}}\mathrm {e}^{2\alpha _iu}{\mathbb {E}} \left( x_u^{(1)}+x_{k-1}^{(1)}\right) \,\mathrm {d}u\right) ^{\frac{1}{2}}\\&\quad = \left( \int _{t_{k-1}}^{t_{k}}\mathrm {e}^{2\alpha _i u} \left[ (1+\mathrm {e}^{-\alpha _1(u-t_{k-1})})x_{k-1}^{(1)} +(1-e^{-\alpha _1(u-t_{k-1})})\beta _1\right] \mathrm {d}u\right) ^{\frac{1}{2}}\\&\quad = \left( \frac{x_{k-1}^{(1)}+\beta _1}{2\alpha _i} \left( \mathrm {e}^{2\alpha _i\varDelta }-1\right) e^{2\alpha _it_{k-1}} \right. \nonumber \\&\qquad +\left. \frac{x_{k-1}^{(1)}-\beta _1}{2\alpha _i-\alpha _1} \left( \mathrm {e}^{(2\alpha _i-\alpha _1)\varDelta }-1\right) e^{2\alpha _it_{k-1}}\right) ^{\frac{1}{2}}. \end{aligned}$$

Note that \(\mathrm {e}^{x}-1=O(x)\) if \(x\rightarrow 0\). Since the parameters are assumed to have a compact domain, we conclude there exists a constant \(C_1\) which is independent of the parameters and such that

$$\begin{aligned} {\mathbb {E}}\left| {\check{x}}_k-{x}_k\right| \le C_1x_{k-1}^{(1)}\sqrt{\varDelta }\,. \end{aligned}$$

Hence, returning to previously used notation,

$$\begin{aligned} {\mathbb {E}}{\mathbf {1}}_{A_{\epsilon ,i}}^c \left| f({\check{x}}_k^{{\tilde{\theta }}_{k}^{(i)}}) -f({x}_k^{{\tilde{\theta }}_{k}^{(i)}})\right| \le \frac{C_1}{\epsilon }x_{k-1}^{{(1)},{\tilde{\theta }}_{k}^{(i)}}\sqrt{\varDelta }\,, \end{aligned}$$

which implies

$$\begin{aligned} \sup _{1\le i\le N}{\left( {\mathbb {E}}{\mathbf {1}}_{A_{\epsilon ,i}^c} \left| f({\check{x}}_k^{{\tilde{\theta }}_{k}^{(i)}}) -f({x}_k^{{\tilde{\theta }}_{k}^{(i)}})\right| , {\varGamma }_{k-1}^{{\tilde{\theta }}_{k}^{(i)}}\right) }&\le \frac{C_1}{\epsilon }\sqrt{\varDelta }\sup _{1\le i\le N} {\mathbb {E}}{x_{k-1}^{{(1)},{\tilde{\theta }}_{k}^{(i)}}}\nonumber \\&\le \frac{C_1M}{\epsilon }\sqrt{\varDelta }\,. \end{aligned}$$
(5.17)

Combining (5.16),(5.14) and (5.17) together with (5.12), we prove the statement of the lemma. \(\square \)

Lemma 5.5 shows that the approximation errors of \({\hat{\pi }}_k^{{\tilde{\theta }}_k^{(i)}}\) and \({\hat{\varGamma }}_{k-1}^{{\theta }_{k-1}^{(i)}}\) can be controlled in an appropriate manner, guaranteeing \(\varSigma (\theta _{k-1})\) below a threshold value \(V_N\). This allows us to run the outer layer in Algorithm 4.2 recursively, see step 1.b. Adding Assumption 5.3, we present in the following lemma the convergence of \({\hat{\mu }}_k^N\).

Lemma 5.7

Let the observation sequence \(y_{1:k}\) be fixed and Assumptions 4.1, 5.1, 5.2 and 5.3 hold. Then for any bounded and continuous function f, if (5.6) and

$$\begin{aligned} \sup _{1\le i\le N} \left| (f,{\hat{\varGamma }}_{k-1}^{\theta _{k-1}^{(i)}}) -(f,{\varGamma }_{k-1}^{\theta _{k-1}^{(i)}})\right| \le \frac{c_{2,{k-1}}\left\| f\right\| _\infty }{\sqrt{N}}+d_{2,{k-1}} \left\| f\right\| _\infty \sqrt{\varDelta }\,, \end{aligned}$$

hold for some constants \(c_{2,{k-1}}\) and \(d_{2,{k-1}}\) which are independent of f, N and \(\varDelta \), then there exist constants \({\hat{c}}_{1,{k}}, {\hat{d}}_{1,k}, {\tilde{c}}_{2,{k}}\) and \({\tilde{d}}_{2,k}\) which are independent of f, N and \(\varDelta \) such that

$$\begin{aligned} \left\| (f,{\hat{\mu }}_{k}^N)-(f,\mu _{k})\right\| _p&\le \frac{{\hat{c}}_{1,{k}}\left\| f\right\| _\infty }{\sqrt{N}}+{\hat{d}}_{1,{k}} \left\| f\right\| _\infty \sqrt{\varDelta }\,,\\ \sup _{1\le i\le N} \left| (f,{\hat{\varGamma }}_k^{{\tilde{\theta }}_{k}^{(i)}}) -(f,{\varGamma }_{k}^{{\tilde{\theta }}_{k}^{(i)}})\right|&\le \frac{{\tilde{c}}_{2,{k}}\left\| f\right\| _\infty }{\sqrt{N}}+{\tilde{d}}_{2,{k}} \left\| f\right\| _\infty \sqrt{\varDelta }\,. \end{aligned}$$

Proof

First, we prove the convergence of the estimated normalized weights \({\hat{w}}_k^{{\tilde{\theta }}_k^{(i)}}\), which are used to prove the convergence of the measures \({\hat{\mu }}_{k}^N\). Denote the unnormalized weights by \({\hat{v}}_k^{{\tilde{\theta }}_k^{(i)}}=((l^{{\tilde{\theta }}_k^{(i)}}_{y_k}, {\hat{\pi }}_k^{{\tilde{\theta }}_k^{(i)}}), {\hat{\varGamma }}_{k-1}^{{\theta }_{k-1}^{(i)}})\), see (4.14), and \({v}_k^{{\tilde{\theta }}_k^{(i)}}= ((l^{{\tilde{\theta }}_k^{(i)}}_{y_k},{\pi }_k^{{\tilde{\theta }}_k^{(i)}}), {\varGamma }_{k-1}^{\tilde{{\theta }}_{k}^{(i)}})\), see (4.11). Note that \(\sup _{\theta \in D_\theta } (l^{\theta }_{y_k},{\pi }_k^{\theta })\) is bounded by \(\left\| l_{y_k}\right\| _\infty \). Using Lemma 5.5, noting that \(\left\| l_{y_t}\right\| _\infty \) is some finite number under Assumption 5.3, we have

$$\begin{aligned} \sup _{1\le i\le N}\left| {\hat{v}}_k^{{\tilde{\theta }}_k^{(i)}}-{v}_k^{{\tilde{\theta }}_k^{(i)}}\right|&=\sup _{1\le i\le N} \left| ((l_{y_k}^{{\tilde{\theta }}_k^{(i)}},{\hat{\pi }}_k^{{\tilde{\theta }}_k^{(i)}}), {\hat{\varGamma }}_{k-1}^{{\theta }_{k-1}^{(i)}}) -((l_{y_k}^{{\tilde{\theta }}_k^{(i)}},{\pi }_k^{{\tilde{\theta }}_k^{(i)}}), {\varGamma }_{k-1}^{{\tilde{\theta }}_{k}^{(i)}})\right| \nonumber \\&\le \sup _{\theta \in D_\theta } \sup _{1\le i\le N} \left| ((l_{y_k}^{\theta },{\hat{\pi }}_k^{{\tilde{\theta }}_k^{(i)}}), {\hat{\varGamma }}_{k-1}^{{\theta }_{k-1}^{(i)}}) -((l_{y_k}^{\theta },{\pi }_k^{{\tilde{\theta }}_k^{(i)}}), {\varGamma }_{k-1}^{{\tilde{\theta }}_{k}^{(i)}})\right| \nonumber \\&\le \frac{{\tilde{c}}_{2,{k}}}{\sqrt{N}}+{\tilde{d}}_{2,{k}}\sqrt{\varDelta } \,, \end{aligned}$$
(5.18)

where \({\tilde{c}}_{2,{k}}\) and \({\tilde{d}}_{2,{k}}\) are constants which are independent of \(N,\varDelta \) and \(\theta \). By Assumption 5.3, we obtain

$$\begin{aligned} \inf _{\theta \in D_\theta }{\hat{v}}_k^{\theta }, \inf _{\theta \in D_\theta }{v}_k^{\theta }&>0\,,\\ \sup _{\theta \in D_\theta }{\hat{v}}_k^{\theta },\sup _{\theta \in D_\theta }{v}_k^{\theta }&<\infty \,. \end{aligned}$$

Hence for the normalized weights, it follows that

$$\begin{aligned} \sup _{1\le i\le N}\left| {\hat{w}}_k^{{\tilde{\theta }}_k^{(i)}}-{w}_k^{{\tilde{\theta }}_k^{(i)}}\right|&=\sup _{1\le i\le N}\left| \frac{{\hat{v}}_k^{{\tilde{\theta }}_k^{(i)}}}{\sum _{i=1}^N {\hat{v}}_k^{{\tilde{\theta }}_k^{(i)}}} -\frac{{v}_k^{{\tilde{\theta }}_k^{(i)}}}{\sum _{i=1}^N {v}_k^{{\tilde{\theta }}_k^{(i)}}} \right| \nonumber \\&\le \sup _{1\le i\le N}\left| \frac{{\hat{v}}_k^{{\tilde{\theta }}_k^{(i)}}}{\sum _{i=1}^N {\hat{v}}_k^{{\tilde{\theta }}_k^{(i)}}} - \frac{{\hat{v}}_k^{{\tilde{\theta }}_k^{(i)}}}{\sum _{i=1}^N {v}_k^{{\tilde{\theta }}_k^{(i)}}}\right| \nonumber \\&\qquad + \sup _{1\le i\le N}\left| \frac{{\hat{v}}_k^{{\tilde{\theta }}_k^{(i)}}}{\sum _{i=1}^N {v}_k^{{\tilde{\theta }}_k^{(i)}}} - \frac{{v}_k^{{\tilde{\theta }}_k^{(i)}}}{\sum _{i=1}^N {v}_k^{{\tilde{\theta }}_k^{(i)}}}\right| \nonumber \\&\le \sup _{1\le i\le N}\frac{{\hat{v}}_k^{{\tilde{\theta }}_k^{(i)}}}{(\sum _{i=1}^N {v}_k^{{\tilde{\theta }}_k^{(i)}})(\sum _{i=1}^N {\hat{v}}_k^{{\tilde{\theta }}_k^{(i)}})}\sum _{i=1}^N \left| {v}_k^{{\tilde{\theta }}_k^{(i)}}-{\hat{v}}_k^{{\tilde{\theta }}_k^{(i)}}\right| \nonumber \\&\qquad + \sup _{1\le i\le N}\frac{1}{\sum _{i=1}^N {\hat{v}}_k^{{\tilde{\theta }}_k^{(i)}}} \left| {\hat{v}}_k^{{\tilde{\theta }}_k^{(i)}}-{v}_k^{{\tilde{\theta }}_k^{(i)}}\right| \,. \end{aligned}$$
(5.19)

Observe that \(\frac{{\hat{v}}_k^{{\tilde{\theta }}_k^{(i)}}}{\sum _{i=1}^N {v}_k^{{\tilde{\theta }}_k^{(i)}}}\le 1\) and that \(\sum _{i=1}^N {\hat{v}}_k^{{\tilde{\theta }}_k^{(i)}}\) is bounded from below by a constant times N. Substituting Inequality (5.18) into Inequality (5.19), one obtains that there exist constants \({\hat{c}}_{2,{k}}\) and \({\hat{d}}_{2,{k}}\) such that

$$\begin{aligned} \sup _{1\le i\le N}\left| {\hat{w}}_k^{{\tilde{\theta }}_k^{(i)}}-{w}_k^{{\tilde{\theta }}_k^{(i)}}\right| \le \frac{{\hat{c}}_{2,{k}}}{\sqrt{N}}+{\hat{d}}_{2,{k}}\sqrt{\varDelta }\,. \end{aligned}$$
(5.20)

Next we study the convergence of the measures \({\hat{\mu }}_{k}^N\). For simplification in the notation, we write \({\hat{w}}_k = {\hat{w}}_k^{{\tilde{\theta }}_k^{(i)}}\). Recalling that \(w_k = \frac{p(y_k\mid y_{1:k-1},\theta )}{\int p(y_k\mid y_{1:k-1},\theta )d\theta }\) (note that \(\left\| w_k\right\| _\infty <\infty \) by Assumption 5.3)), we get from Bayes’ rule

$$\begin{aligned} (f,\mu _{k})&= \int f(\theta )p(\theta \mid y_{1:k})d\theta \\&= \int f(\theta )\frac{p(y_k\mid y_{1:k-1},\theta )p(\theta \mid y_{1:k-1})}{\int p(y_k\mid y_{1:k-1},\theta )p(\theta \mid y_{1:k-1})d\theta }d\theta \\&= \frac{(fw_k,\mu _{k-1})}{(w_k,\mu _{k-1})}\,, \end{aligned}$$

and from (5.2) and (5.3) we get

$$\begin{aligned} (f,{\hat{\mu }}_{k}^N)&=\frac{(f{\hat{w}}_k,{\tilde{\mu }}_{k}^N)}{({\hat{w}}_k,{\tilde{\mu }}_{k}^N)}\,. \end{aligned}$$

Since \(w_k\) and \((w_k,\mu _{k-1})\ge 0\), using Assumption 5.3 and the triangle inequality, we get

$$\begin{aligned}&\left\| (f,{\hat{\mu }}_{k}^N)-(f,\mu _{k})\right\| _p\nonumber \\&\quad =\left\| \frac{(f{\hat{w}}_k,{\tilde{\mu }}_{k}^N)}{({\hat{w}}_k, {\tilde{\mu }}_{k}^N)}-\frac{(fw_k,\mu _{k-1})}{(w_k,\mu _{k-1})}\right\| _p\nonumber \\&\quad =\frac{1}{(w_k,\mu _{k-1})}\left\| \frac{(f{\hat{w}}_k,{\tilde{\mu }}_{k}^N)}{({\hat{w}}_k,{\tilde{\mu }}_{k}^N)} (w_k,\mu _{k-1}) - (fw_k,\mu _{k-1})\right\| _p\nonumber \\&\quad \le \frac{1}{(w_k,\mu _{k-1})}\left( \left\| \frac{(f{\hat{w}}_k, {\tilde{\mu }}_{k}^N)}{({\hat{w}}_k,{\tilde{\mu }}_{k}^N)} (w_k,\mu _{k-1}) - (f{\hat{w}}_k,{\tilde{\mu }}_{k}^N)\right\| _p \right. \nonumber \\&\qquad +\left. \left\| (f{\hat{w}}_k,{\tilde{\mu }}_{k}^N) -(fw_k,\mu _{k-1})\right\| _p \right) \nonumber \\&\quad = \frac{1}{(w_k,\mu _{k-1})} \left( \left\| \frac{(f{\hat{w}}_k,{\tilde{\mu }}_{k}^N)[(w_k,\mu _{k-1}) -({\hat{w}}_k,{\tilde{\mu }}_{k}^N)]}{({\hat{w}}_k,{\tilde{\mu }}_{k}^N)}\right\| _p \right. \nonumber \\&\qquad + \left. \left\| (f{\hat{w}}_k,{\tilde{\mu }}_{k}^N) -(fw_k,\mu _{k-1})\right\| _p \right) \nonumber \\&\quad \le \frac{1}{(w_k,\mu _{k-1})}\left( \left\| f\right\| _\infty \left\| (w_k,\mu _{k-1}) -({\hat{w}}_k,{\tilde{\mu }}_{k}^N)\right\| _p \right. \nonumber \\&\qquad + \left. \left\| (f{\hat{w}}_k,{\tilde{\mu }}_{k}^N)-(fw_k,\mu _{k-1})\right\| _p \right) \,. \end{aligned}$$
(5.21)

Hence, we proceed to finding upper bounds for the two quantities \(\Vert (w_k,\mu _{k-1})-({\hat{w}}_k,{\tilde{\mu }}_{k}^N)\Vert _p\) and \(\left\| (f{\hat{w}}_k,{\tilde{\mu }}_{k}^N)-(fw_k,\mu _{k-1})\right\| _p\). Note that

$$\begin{aligned}&{\left\| (f{\hat{w}}_k,{\tilde{\mu }}_{k}^N)-(fw_k,\mu _{k-1})\right\| _p \le } \nonumber \\&\qquad \left\| (fw_k,\mu _{k-1})-(fw_k,{\tilde{\mu }}_{k}^N)\right\| _p+\left\| (fw_k,{\tilde{\mu }}_{k}^N) -(f{\hat{w}}_k,{\tilde{\mu }}_{k}^N)\right\| _p. \end{aligned}$$
(5.22)

For the first term on the right hand side of (5.22), note that \(\left\| w_k\right\| _p<\infty \), it follows from Lemma 5.4 that

$$\begin{aligned} \left\| (fw_k,\mu _{k-1})-(fw_k,{\tilde{\mu }}_{k}^N)\right\| _p\le \frac{{\tilde{c}}_{1,{k}} \left\| f\right\| _\infty }{\sqrt{N}}+\left\| f\right\| _\infty {\tilde{d}}_{1,{k}}\sqrt{\varDelta }\,. \end{aligned}$$
(5.23)

For the second term on the right hand side of (5.22), we get using (5.20)

$$\begin{aligned} \left| (fw_k,{\tilde{\mu }}_{k}^N)-(f{\hat{w}}_k,{\tilde{\mu }}_{k}^N)\right|&=\left| \frac{1}{N}\sum _{i=1}^Nf({\tilde{\theta }}_k^{(i)}) \left( {\hat{w}}_k^{{\tilde{\theta }}_k^{(i)}}-{w}_k^{{\tilde{\theta }}_k^{(i)}}\right) \right| \\&\le \frac{\left\| f\right\| _\infty }{N}\sum _{i=1}^N\left| {\hat{w}}_k^{{\tilde{\theta }}_k^{(i)}} -{w}_k^{{\tilde{\theta }}_k^{(i)}}\right| \\&\le \left\| f\right\| _\infty \sup _{1\le i\le N}\left| {\hat{w}}_k^{{\tilde{\theta }}_k^{(i)}} -{w}_k^{{\tilde{\theta }}_k^{(i)}}\right| \\&\le \frac{{\tilde{c}}_{2,{k}}\left\| f\right\| _\infty }{\sqrt{N}}+{\tilde{d}}_{2,{k}} \left\| f\right\| _\infty \sqrt{\varDelta }\,. \end{aligned}$$

Hence,

$$\begin{aligned} \left\| (fw_k,{\tilde{\mu }}_{k}^N)-(f{\hat{w}}_k,{\tilde{\mu }}_{k}^N)\right\| _p \le \frac{{\tilde{c}}_{2,{k}}\left\| f\right\| _\infty }{\sqrt{N}}+{\tilde{d}}_{2,{k}} \left\| f\right\| _\infty \sqrt{\varDelta }\,. \end{aligned}$$

Inserting the latter together with (5.23) in (5.22), we obtain

$$\begin{aligned} \left\| (f{\hat{w}}_k,{\tilde{\mu }}_{k}^N)-(fw_k,\mu _{k-1})\right\| _p\le \frac{{c}_{k}^\prime \left\| f\right\| _\infty }{\sqrt{N}}+{d}_{k}^\prime \left\| f\right\| _\infty \sqrt{\varDelta }\,, \end{aligned}$$
(5.24)

where \({c}_{k}^\prime \) and \({d}_{k}^\prime \) are constants independent of N and \(\varDelta \). Letting \(f=1\) in (5.24) implies

$$\begin{aligned} \left\| (w_k,\mu _{k-1})-({\hat{w}}_k,{\tilde{\mu }}_{k}^N)\right\| _p \le \frac{{c}_{k}^\prime }{\sqrt{N}}+{d}_{{k}^\prime }\sqrt{\varDelta }\,. \end{aligned}$$
(5.25)

Therefore substituting (5.25) and (5.24) into the right hand side of Inequality (5.21), we obtain

$$\begin{aligned} \left\| (f,{\hat{\mu }}_{k}^N)-(f,\mu _{k})\right\| _p \le \frac{{\hat{c}}_{1,{k}}\left\| f\right\| _\infty }{\sqrt{N}}+{\hat{d}}_{1,{k}} \left\| f\right\| _\infty \sqrt{\varDelta }\,, \end{aligned}$$

where \({\hat{c}}_{1,{k}}=\frac{2}{(w_k,\mu _{k-1})}{c}_{k}^\prime <\infty \) and \({\hat{d}}_{1,{k}}=\frac{2}{(w_k,\mu _{k-1})}{d}_{k}^\prime <\infty \) are independent of N and \(\varDelta \) and the statement for \({\hat{\mu }}_k^N\) follows.

To prove the statement for \({\hat{\varGamma }}_k^{{\tilde{\theta }}_{k}^{(i)}}\), we compute using (2.9) and (4.14)

$$\begin{aligned}&{\left| (f,{\hat{\varGamma }}_k^{{\tilde{\theta }}_{k}^{(i)}}) -(f,{\varGamma }_{k}^{{\tilde{\theta }}_{k}^{(i)}})\right| } \\&\quad = \left| \frac{((fl^{{\tilde{\theta }}_{k}^{(i)}}_{y_k}, {\hat{\pi }}_k^{{\tilde{\theta }}_{k}^{(i)}}),{\hat{\varGamma }}_{k-1}^{\theta _{k-1}^{(i)}} )}{{\hat{w}}_{k}^{{\tilde{\theta }}_{k}^{(i)}}} -\frac{((fl^{{\tilde{\theta }}_{k}^{(i)}}_{y_k}, {\pi }_k^{{\tilde{\theta }}_{k}^{(i)}}),{\varGamma }_{k-1}^{{\tilde{\theta }}_{k}^{(i)}} )}{{w}_{k}^{{\tilde{\theta }}_{k}^{(i)}}}\right| \\&\quad =\frac{1}{{\hat{w}}_{k}^{{\tilde{\theta }}_{k}^{(i)}} {w}_{k}^{{\tilde{\theta }}_{k}^{(i)}}} \left| {w}_{k}^{{\tilde{\theta }}_{k}^{(i)}}((fl^{{\tilde{\theta }}_{k}^{(i)}}_{y_k}, {\hat{\pi }}_k^{{\tilde{\theta }}_{k}^{(i)}}), {\hat{\varGamma }}_{k-1}^{\theta _{k-1}^{(i)}} ) -{\hat{w}}_{k}^{{\tilde{\theta }}_{k}^{(i)}} ((fl^{{\tilde{\theta }}_{k}^{(i)}}_{y_k}, {\pi }_k^{{\tilde{\theta }}_{k}^{(i)}}), {\varGamma }_{k-1}^{{\tilde{\theta }}_{k}^{(i)}} )\right| \\&\quad \le \frac{1}{{\hat{w}}_{k}^{{\tilde{\theta }}_{k}^{(i)}} {w}_{k}^{{\tilde{\theta }}_{k}^{(i)}}} \left( \left| {w}_{k}^{{\tilde{\theta }}_{k}^{(i)}} ((fl^{{\tilde{\theta }}_{k}^{(i)}}_{y_k},{\hat{\pi }}_k^{{\tilde{\theta }}_{k}^{(i)}}), {\hat{\varGamma }}_{k-1}^{\theta _{k-1}^{(i)}} ) -{\hat{w}}_{k}^{{\tilde{\theta }}_{k}^{(i)}}((fl^{{\tilde{\theta }}_{k}^{(i)}}_{y_k}, {\hat{\pi }}_k^{{\tilde{\theta }}_{k}^{(i)}}), {\hat{\varGamma }}_{k-1}^{\theta _{k-1}^{(i)}} )\right| \right. \\&\quad \left. \quad +\left| {\hat{w}}_{k}^{{\tilde{\theta }}_{k}^{(i)}} ((fl^{{\tilde{\theta }}_{k}^{(i)}}_{y_k},{\hat{\pi }}_k^{{\tilde{\theta }}_{k}^{(i)}}), {\hat{\varGamma }}_{k-1}^{\theta _{k-1}^{(i)}} ) -{\hat{w}}_{k}^{{\tilde{\theta }}_{k}^{(i)}}((fl^{{\tilde{\theta }}_{k}^{(i)}}_{y_k}, {\pi }_k^{{\tilde{\theta }}_{k}^{(i)}}), {\varGamma }_{k-1}^{{\tilde{\theta }}_{k}^{(i)}} )\right| \right) \\&\quad \le \frac{1}{{\hat{w}}_{k}^{{\tilde{\theta }}_{k}^{(i)}} {w}_{k}^{{\tilde{\theta }}_{k}^{(i)}}} \left( \left\| f\right\| _\infty \left\| l_{y_k}\right\| _\infty \left| {w}_{k}^{{\tilde{\theta }}_{k}^{(i)}} -{\hat{w}}_{k}^{{\tilde{\theta }}_{k}^{(i)}}\right| \right) \\&\qquad +{\hat{w}}_{k}^{{\tilde{\theta }}_{k}^{(i)}} \left| ((fl^{{\tilde{\theta }}_{k}^{(i)}}_{y_k}, {\hat{\pi }}_k^{{\tilde{\theta }}_{k}^{(i)}}),{\hat{\varGamma }}_{k-1}^{\theta _{k-1}^{(i)}} )-((fl^{{\tilde{\theta }}_{k}^{(i)}}_{y_k},{\pi }_k^{{\tilde{\theta }}_{k}^{(i)}}), {\varGamma }_{k-1}^{{\tilde{\theta }}_{k}^{(i)}} )\right| \,. \end{aligned}$$

Since \({\hat{w}}_{k}^{{\tilde{\theta }}_{k}^{(i)}},{w}_{k}^{{\tilde{\theta }}_{k}^{(i)}}\ge \inf _{\theta \in D_\theta } l^{\theta }_{y_k}>0\), then using Eq. (5.20) and Lemma 5.5 (note that \(l^{{\tilde{\theta }}_{k}^{(i)}}_{y_k}\) is bounded by Assumption 5.3), we prove the statement of the Lemma. \(\square \)

5.3 Resampling

In the following lemma we study the convergence of the measure \(\mu _k^N\).

Lemma 5.8

Let the observation sequence \(y_{1:k}\) be fixed, for bounded and continuous function f. If

$$\begin{aligned} \left\| (f,{\hat{\mu }}_{k}^N)-(f,\mu _{k})\right\| _p&\le \frac{{\hat{c}}_{1,{k}} \left\| f\right\| _\infty }{\sqrt{N}}+{\hat{d}}_{1,{k}}\left\| f\right\| _\infty \sqrt{\varDelta }\,,\\ \sup _{1\le i\le N} \left| (f,{\hat{\varGamma }}_k^{{\tilde{\theta }}_{k}^{(i)}}) -(f,{\varGamma }_{k}^{{\tilde{\theta }}_{k}^{(i)}})\right|&\le \frac{{\tilde{c}}_{2,{k}}\left\| f\right\| _\infty }{\sqrt{N}}+{\tilde{d}}_{2,{k}} \left\| f\right\| _\infty \sqrt{\varDelta }\,, \end{aligned}$$

holds for some constants \({\hat{c}}_{1,{k}},{\hat{d}}_{1,{k}},{\tilde{c}}_{2,{k}}\) and \({\tilde{d}}_{2,{k}}\) which are independent of N and \(\varDelta \), then there exist constants \({c}_{1,{k}},{d}_{1,{k}},{c}_{2,{k}}\) and \({d}_{2,{k}}\) which are independent of N and \(\varDelta \), such that

$$\begin{aligned} \left\| (f,{\mu }_{k}^N)-(f,\mu _{k})\right\| _p&\le \frac{{c}_{1,{k}} \left\| f\right\| _\infty }{\sqrt{N}}+{d}_{1,{k}}\left\| f\right\| _\infty \sqrt{\varDelta }\,,\\ \sup _{1\le i\le N} \left| (f,{\hat{\varGamma }}_k^{{\theta }_{k}^{(i)}})-(f,{\varGamma }_{k}^{{\theta }_{k}^{(i)}})\right|&\le \frac{{c}_{2,{k}}\left\| f\right\| _\infty }{\sqrt{N}}+{d}_{2,{k}}\left\| f\right\| _\infty \sqrt{\varDelta }\,. \end{aligned}$$

Proof

Note that in the resampling step the \({\theta }_{k}^{(i)}\) are resampled from the pool \(\{{\tilde{\theta }}_{k}^{(i)},i=1,\cdots ,N\}\). Hence it is trivial that

$$\begin{aligned} \sup _{1\le i\le N} \left| (f,{\hat{\varGamma }}_k^{{\theta }_{k}^{(i)}})-(f,{\varGamma }_{k}^{{\theta }_{k}^{(i)}})\right|&\le \sup _{1\le i\le N} \left| (f,{\hat{\varGamma }}_k^{{\tilde{\theta }}_{k}^{(i)}}) -(f,{\varGamma }_{k}^{{\tilde{\theta }}_{k}^{(i)}})\right| \\&\le \frac{{\tilde{c}}_{2,{k}}\left\| f\right\| _\infty }{\sqrt{N}} +{\tilde{d}}_{2,{k}}\left\| f\right\| _\infty \sqrt{\varDelta }\\&= \frac{{c}_{2,{k}}\left\| f\right\| _\infty }{\sqrt{N}}+{d}_{2,{k}}\left\| f\right\| _\infty \sqrt{\varDelta }\,, \end{aligned}$$

where \({c}_{2,{k}}={\tilde{c}}_{2,{k}}\) and \({d}_{2,{k}}={\tilde{d}}_{2,{k}}\) are independent of N and \(\varDelta \). Moreover, by triangle inequality, we have

$$\begin{aligned} \left\| (f,{\mu }_{k}^N)-(f,\mu _{k})\right\| _p \le \left\| (f,{\mu }_{k}^N)-(f,{\hat{\mu }}_{k}^N)\right\| _p + \left\| (f,{\hat{\mu }}_{k}^N)-(f,\mu _{k})\right\| _p\,. \end{aligned}$$
(5.26)

For the second term on the right hand side of (5.26), it follows from the conditions in the lemma, that

$$\begin{aligned} \left\| (f,{\hat{\mu }}_{k}^N)-(f,\mu _{k})\right\| _p\le \frac{{\hat{c}}_{1,{k}}\left\| f\right\| _\infty }{\sqrt{N}}+{\hat{d}}_{1,{k}} \left\| f\right\| _\infty \sqrt{\varDelta }\,. \end{aligned}$$
(5.27)

Note that \(\{\theta _k^{(i)},i=1,\cdots ,N\}\) are i.i.d. samples generated from \({\hat{\mu }}_k^N(\mathrm {d}\theta )=\sum _{i=1}^N {\hat{w}}_k^{{\tilde{\theta }}_k^{(i)}}\delta _{{\tilde{\theta }}_k^{(i)}}(\mathrm {d}\theta )\). Let \(\tilde{{\mathcal {G}}}_k\) be the sigma-algebra generated by \( \{\theta _{1:k-1}^{(i)},{\tilde{\theta }}_{1:k}^{(i)}, i=1,\cdots ,N\}\), then

$$\begin{aligned} {\mathbb {E}}[f(\theta _{k}^{(i)})\mid \tilde{{\mathcal {G}}}_k]&= \int f(\theta )\sum _{i=1}^N {\hat{w}}_k^{{\tilde{\theta }}_k^{(i)}} \delta _{{\tilde{\theta }}_k^{(i)}}(d\theta )\\&= \sum _{i=1}^N {\hat{w}}_k^{{\tilde{\theta }}_k^{(i)}}f({\tilde{\theta }}_k^{(i)})\\&= (f,{\hat{\mu }}_k^N)\,. \end{aligned}$$

Define \(Z_k^{(i)} = f(\theta _k^{(i)})-(f,{\hat{\mu }}_k^N)=f(\theta _k^{(i)})-{\mathbb {E}} [f(\theta _{k}^{(i)})\mid \tilde{{\mathcal {G}}}_k]\). The \(Z_k^{(i)}, i=1,\cdots ,N\) are random variables with zero-mean and bounded by \(2\left\| f\right\| _\infty \) and have the property \({\mathbb {E}}[Z_k^{(i)}Z_k^{(j)}\mid \tilde{{\mathcal {G}}}_k]=0\) for \(i\ne j\). Let \(p\ge 1\), and let’s first additionally assume p is an even integer. Then

$$\begin{aligned} {\mathbb {E}}\left[ \left| (f,\mu _k^N)-(f,{\hat{\mu }}_k^N)\right| ^p \mid \tilde{{\mathcal {G}}}_k\right]&={\mathbb {E}}\left[ \left| \frac{1}{N}\sum _{i=1}^N f(\theta _k^{(i)}) -(f,{\hat{\mu }}_k^N)\right| ^p \mid \tilde{{\mathcal {G}}}_k\right] \\&={\mathbb {E}}\left[ \left| \frac{1}{N}\sum _{i=1}^N Z_k^{(i)}\right| ^p \mid \tilde{{\mathcal {G}}}_k\right] \\&=\frac{1}{N^p}{\mathbb {E}}\left[ \sum _{i_1=1}^N\cdots \sum _{i_p=1}^N Z_k^{(i_1)}\cdots Z_k^{(i_p)} \mid \tilde{{\mathcal {G}}}_k\right] \,. \end{aligned}$$

Since \({\mathbb {E}}[Z_k^{(i)}\mid \tilde{{\mathcal {G}}}_k]=0\), there are at most \(N^{\frac{p}{2}}\) non-zero contributions to

$$\begin{aligned} \sum _{i_1=1}^N\cdots \sum _{i_p=1}^N {\mathbb {E}}\left[ Z_k^{(i_1)}\cdots Z_k^{(i_p)} \mid \tilde{{\mathcal {G}}}_k\right] \,. \end{aligned}$$

Hence

$$\begin{aligned} {\mathbb {E}}\left[ \sum _{i_1=1}^N\cdots \sum _{i_p=1}^N Z_k^{(i_1)}\cdots Z_k^{(i_p)} \mid \tilde{{\mathcal {G}}}_k\right] \le \frac{1}{N^{\frac{p}{2}}}2^p\left\| f\right\| _\infty ^p\,, \end{aligned}$$

which implies

$$\begin{aligned} \left\| (f,\mu _k^N)-(f,{\hat{\mu }}_k^N)\right\| _p\le \frac{2\left\| f\right\| _\infty }{\sqrt{N}}\,. \end{aligned}$$
(5.28)

Substituting Inequality (5.28) and (5.27) into Eq. (5.26) yields, for any even p, the result

$$\begin{aligned} \left\| (f,{\mu }_{k}^N)-(f,\mu _{k})\right\| _p\le \frac{{c}_{1,{k}}\left\| f\right\| _\infty }{\sqrt{N}}+{d}_{1,{k}}\left\| f\right\| _\infty \sqrt{\varDelta }\,, \end{aligned}$$
(5.29)

where \({c}_{1,{k}} = 2\left\| f\right\| _\infty +{\hat{c}}_{1,{k}}\) and \({d}_{1,{k}} ={\hat{d}}_{1,{k}}\). For any real number \(p\ge 1\), we know there exist an even number \(q>p\) such that (5.29) holds for this number q. Hence the statement is proved. \(\square \)

5.4 Convergence of the Kalman Particle algorithm

In the following theorem we prove the convergence of our proposed algorithm.

Theorem 5.9

Suppose the function f is bounded and continuous and Assumptions 4.1, 5.1, 5.2, 5.3 hold. Let the sequence of the observation \(y_{1:k}\) be fixed and the measures \(\mu _k\) and \(\mu _k^N\) resulting from Algorithm 4.2 be respectively as in (5.1) and (5.4), where the model is Gaussian and linear in the sense of (2.11) or is of type (3.1). Then it holds

$$\begin{aligned} \left\| (f,\mu _k^N)-(f,\mu _k)\right\| _p\le \frac{c_{1,k}\left\| f\right\| _\infty }{\sqrt{N}} + d_{1,{k}}\left\| f\right\| _\infty \sqrt{\varDelta }, \end{aligned}$$

with constants \(c_{1,k},d_{1,k}, 1\le k\le K\) independent of N and \(\varDelta \).

Proof

We prove this theorem by induction. At time \(t_0\), the parameter samples \(\theta _{0}^{(i)}\), \(i=1,\cdots ,N\), are sampled from the initial measure \(\mu _{0}\) and \(\mu _{0}^N = \frac{1}{N}\sum _{i=1}^N \delta _{\theta _{0}^{(i)}}\). A well known result of Monte Carlo simulation, see for example Chapter I in Glasserman (2013), implies that

$$\begin{aligned} \left\| (f,\mu _{0})-(f,\mu _{0}^N)\right\| _p \le \frac{c_{1,t_0}\left\| f\right\| _\infty }{\sqrt{N}} + d_{1,t_0}\left\| f\right\| _\infty \sqrt{\varDelta }, \end{aligned}$$

for some constant \(c_{1,t_0}\) and \(d_{1,t_0}\) independent of N and \(\varDelta \). Moreover, one can just define the initial measure of \(x_{0}\) by a Gaussian measure \(\varGamma _{0}\) and define \({\hat{\varGamma }}_{0}^{\theta _{0}^{(i)}}= {\varGamma }_{0}^{\theta _{0}^{(i)}} = \varGamma _{0}\), for \(i=1,\cdots ,N\). Then it is trivial that

$$\begin{aligned} \sup _{1\le i\le N}\left| (f,{\hat{\varGamma }}_{0}^{\theta _{0}^{(i)}}) -(f,{\varGamma }_{0}^{\theta _{0}^{(i)}})\right| \le \frac{c_{2,t_0}\left\| f\right\| _\infty }{\sqrt{N}}+d_{2,t_0}\left\| f\right\| _\infty \sqrt{\varDelta }\, \end{aligned}$$

holds for some constant \(c_{2,t_0}\) and \(d_{2,t_0}\) independent of N and \(\varDelta \) (actually \(c_{2,t_0}=d_{2,t_0}=0\)). Assume that, at time \({k-1}\), the inequalities

$$\begin{aligned} \left\| (f,{\mu }_{k-1}^N)-(f,\mu _{k-1})\right\| _p \le \frac{{c}_{1,{k-1}}\left\| f\right\| _\infty }{\sqrt{N}}+{d}_{1,{k-1}}\left\| f\right\| _\infty \sqrt{\varDelta } \end{aligned}$$

and

$$\begin{aligned} \sup _{1\le i\le N} \left| (f,{\hat{\varGamma }}_{k-1}^{\theta _{k-1}^{(i)}}) -(f,{\varGamma }_{k-1}^{\theta _{k-1}^{(i)}})\right| \le \frac{c_{2,{k-1}}\left\| f\right\| _\infty }{\sqrt{N}}+d_{2,{k-1}}\left\| f\right\| _\infty \sqrt{\varDelta }\, \end{aligned}$$

hold for some constants \({c}_{1,{k-1}},{d}_{1,{k-1}},c_{2,{k-1}}\) and \(d_{2,{k-1}}\) which are independent of N and \(\varDelta \). Then we can just successively apply Lemmas 5.7 and 5.8 to obtain the statements of the lemma. \(\square \)

6 Numerical results on affine term structure models

In this section we illustrate our results by considering some specific examples of the affine class (3.1) to which we apply our algorithms. In particular we consider the one-factor Cox-Ingersoll-Ross (CIR) model, the two-factor Vasiček model, and the one-factor Vasiček model with stochastic volatility. To test how the designed algorithm works on these models, the models are calibrated on simulated data using pre-determined parameters. Moreover, for the illustration of our algorithm on empirical data, we consider the calibration of a two-factor Vasiček model on yield curves. We compare the behavior of our algorithm to the one generated by the recursive nested particle filter (RNPF) for the CIR model. The comparison shows that in this case our algorithm outperforms the RNPF.

6.1 One-factor CIR model

At first we consider the CIR model (3.6). One important property of the CIR models is that the yield curves are always non-negative. The transition density of the CIR model has a non-central chi-square distribution, i.e.,

$$\begin{aligned} x_{t_{k+1}}\mid x_{t_k} \sim c\chi _p^2(\lambda )\,, \end{aligned}$$

where \(p=\frac{4\alpha \beta }{\sigma ^2}\) is the degree of freedom, \(\lambda =x_k(4\alpha e^{-\alpha (t_{k+1}-t_{k})})/(\sigma ^2(1-e^{-\alpha (t_{k+1}-t_k)}))\) is the non-centrality parameter and \(c=\sigma ^2(1-e^{-\alpha (t_{k+1}-t_k)})/4\alpha \). We define the short rate process to be \(r_t=x_t\). The analytical solution of the functions \(\phi \) and \(\psi \) defined in (3.2) can be obtained by solving the ODE (A.2), with \(d=1\), \(\gamma =-1\) and \(c=0\).

In the tests, the parameters are set to be \(\alpha _1=0.45, \beta =0.001\) and \({\tilde{\varSigma }}=0.017\). Based on these parameters values, we generate daily data for yield curves with the times to maturity ranging from 1 year to 30 years. The time length is \(T=2000\), i.e., the data set contains 2000 days. Furthermore, we add white noise with variance \(h = 1 \times 10^{-8}\) in the simulated zero rates.

We use our Kalman particle filter algorithm (Algorithm 4.2, henceforth referred to as KPF) to calibrate the model parameters. The transition distribution of the CIR model is not Gaussian, but note that the CIR model fits the structure of (3.1) with \(\varSigma =0\). Hence we use the approximation (4.10) and apply the Kalman filter in the inner layer. In the outer layer, we set the number of particles to be \(N=5000\) and the initial prior distribution of the parameters to be uniform, i.e.,

$$\begin{aligned} \alpha \sim U(0,1),\ \ \ \beta \sim U(0,0.01), \ \ \ \text{ and } \ \ \sigma \sim U(0,0.1)\,. \end{aligned}$$

Moreover, the sampled parameters at each step are truncated by the boundary of the latter corresponding uniform distributions respectively. In the next sections we use the same kind of truncation. In the inner layer, the mean and variance of the initial prior distribution of \(x_0\) are 0.005 and 0.01. The discounting factor a is set to be 0.98 and the variance boundaries \(V_N\) and \(V_f\) are set to be the diagonal matrices whose elements are equal to \(\frac{1}{\sqrt{N^3}}\), i.e. \(p=1\) and \(10^{-8}\), respectively. We will also use the same a, \(V_N\) and \(V_f\) for other experiments later.

Fig. 1
figure 1

Convergence of parameter estimates by KPF for the CIR model with noise variance \(h = 1\times 10^{-8}\); the blue lines represent the posterior mean of the parameters, the red lines represent the true values of the parameters and the black dots are the lower and upper bounds of the 95% credible intervals

Fig. 2
figure 2

Simulated states and estimated states; the blue line represents estimates of the simulated states, the red line represents the simulated states

Figure 1 shows that the estimated parameters converge over time to the real value. In this figure (and in other ones below) the blue lines represent the posterior mean of the parameters, the red lines represent the true values of the parameters and the black dots are the lower and upper bounds of the 95% credible intervals.

Figure 2 shows the estimated states follow the simulated states very well. Moreover, we observe that after 810 steps the variance is smaller than the required level \(V_N\). We repeated this experiment for noise variance \(h = 1\times 10^{-7}\) and \(h = 1\times 10^{-9}\) and the results are shown in Figures 3 and 4 , respectively.

Fig. 3
figure 3

Convergence of parameter estimates by KPF for the CIR model with noise level \(h = 1\times 10^{-7}\)

Fig. 4
figure 4

Convergence of parameter estimates by KPF for the CIR model with noise level \(h = 1\times 10^{-9}\)

We find out that in all these cases the posterior means converge to the real value. We also observed that the bigger the variance of the noise is, the more steps it takes to switch the jittering kernel and the more steps it takes to the algorithm to converge. Specifically, in these three cases with noise levels \(1\times 10^{-7}, 1\times 10^{-8}\) and \(1\times 10^{-9}\), it has taken 1164, 810 and 134 steps to switch between the kernels, respectively.

Furthermore, for noise variance \(h = 1\times 10^{-8}\), we extend this experiment to weekly and monthly data, i.e. the time step is weekly and monthly and the results are shown in Figures 5 and 6 , respectively. We find out that also in both cases the posterior means converge to the real value. We also observe that the bigger the time step is, the more time it takes for the algorithm to converge (note that 1 step in those two cases is 1 week and 1 month respectively).

Fig. 5
figure 5

Convergence of parameter estimates by KPF for the CIR model with weekly data and noise level \(h = 1\times 10^{-8}\)

Fig. 6
figure 6

Convergence of parameter estimates by KPF for the CIR model with monthly data and noise level \(h = 1\times 10^{-8}\)

We also implemented the recursive nested particle filter (RNPF) for comparison. The sample size in the two layers are set to be 1000 and 300 respectively. Moreover, we simulated 18000 more days of data, hence in total we obtain 20000 days of data. For the rest we use the same settings as in the previous example, i.e. the same initial sample distributions for the parameter generation, the same boundary on the parameter samples in the outer layer, and the variance of the jittering kernel is also set the same as \(V_N\).

Fig. 7
figure 7

Behavior of parameter estimates by RNPF for the CIR model

Figure 7 shows how the behavior of the estimated parameters over time. One observes that even after 20000 time steps the RNPF algorithm estimates of \(\alpha \) and \(\sigma \) do not reach the correct parameter values.

6.2 Two-factor Vasiček model

In this subsection we consider the two-factor Vasiček model (3.7). The short rate process \(r_t\) is given by \(r_t = x_t^{(1)} + x_t^{(2)}\). The analytical solution of the functions \(\phi \) and \(\psi \) defined in (3.2) can be obtained by solving the ODE (A.1), with \(d=2\), \(\gamma = (-1, -1)^\top \) and \(c=0\). Furthermore we set \(\alpha _{11} = 0.03, \,\alpha _{22} = 0.23,\, \varSigma _{11}=0.02,\, \varSigma _{12}=0, \,\varSigma _{21} = 0.02\rho , \,\varSigma _{22} = 0.02 \sqrt{1-\rho ^2}, \,\rho = -0.5\,\). Similar as in the previous example, the yield curve data are simulated based on these parameters and then a white noise process is added on the simulated data. The variance of the white noise is \(6\times 10^{-7}\). The times to maturity of the yield curves range from 1 year up to 30 years. The time step of the data is set to be daily and the time length is 2000.

Using these noisy simulated data, we use our Kalman particle algorithm 4.2 to calibrate the model parameters. The Vasiček fits the structure of (3.1) with \(\varSigma =0\). Since the Vasiček model is a Gaussian model, the Kalman filter in the inner layer gives an optimal filter. The number of particles at each step is \(N = 2000\). For the initialization of the parameter samples, the initial prior distribution of parameters is chosen to be uniform, namely:

$$\begin{aligned} \alpha _1,\alpha _2 \sim U(0,0.4),\ \ \ \sigma _1,\sigma _2 \sim U(0,0.1), \ \ \text{ and } \ \ \rho \sim U(-0.8,-0.3)\,. \end{aligned}$$

The initial prior distribution of the state \(x_k\) is chosen to be Gaussian with mean 0 and variance \(\mathrm {diag}([0.1,0.1])\). Figure 8 shows the how the estimated parameters converge over time. One can observe that the convergence is very fast and accurate.

Fig. 8
figure 8

Convergence of parameter estimates by KPF for the Vasiček model

6.3 Vasiček model with stochastic volatility

We consider the following stochastic volatility model,

$$\begin{aligned} \mathrm {d}V_t&= \alpha _1(0.1-V_t)\,\mathrm {d}t+\sigma _1\sqrt{V_t}\,\mathrm {d}W_t^{(1)},\\ \mathrm {d}x_t&= \alpha _2(\beta -x_t)\,\mathrm {d}t+\sigma _2\sqrt{V_t}\left( \rho \,\mathrm {d}W_t^{(1)}+\sqrt{1-\rho ^2}\,\mathrm {d}W_t^{(2)}\right) , \end{aligned}$$

which is model (3.8) in a notation that is more suitable for the purposes of this section. The process \(V_t\) presents the fluctuation of the volatility of the system. Note that the long term mean of the process \(V_t\) is fixed at the known constant 0.1, otherwise the model would be over-parametrized, i.e. by scaling the volatility process and the parameters \(\sigma _1, \sigma _2\) one can obtain an equivalent model. The transition density of this stochastic volatility model is not analytically available. To tackle this issue, we apply the same approximation as in the CIR model test of Sect. 6.1, namely, we approximate the stochastic diffusion by constant diffusion between the time steps, see (4.10). Under this approximation, the transition density of the model is Gaussian and the mean and variance can be theoretically computed. The short rate is defined by \(r_t=x_t\) and hence the yield curve can be computed, see (3.3), from \(R_t(\tau ) = \phi (\tau ,0)+\psi (\tau ,0)^\top {\tilde{x}}_t\), with \({\tilde{x}}_t = [V_t,x_t]^\top \). The functions \(\phi \), \(\psi \) are the solutions to the Riccati Eqs. (A.2), with \(d=2\), \(\gamma = (0,-1)^\top \) and \(c=0\). The solutions to these latter equations are not known in closed form. We introduce an efficient numerical algorithm to compute these functions, the detailed algorithm is in the Appendices A.3. The parameters of this model are set to be \([\alpha _1,\alpha _2,\beta ,\sigma _1,\sigma _2,\rho ]=[0.1,0.3,0.03,0.3,0.07,-0.5]\). The variance of the white noise is \(10^{-8}\). The times to maturity of the yield curves range from 1 year up to 20 years. The time step of the data is set to be daily and the time length is 2000. In the outer layer, we sample \(N=2000\) particles and in the inner layer, we use the Kalman filter. The initial prior distributions of the parameters are uniform,

$$\begin{aligned}&{\alpha _1,\alpha _2 \sim U(0,1), \beta \sim U(0,0.1),} \\&\sigma _1\sim U(0,0.8), \sigma _2\sim U(0,0.2) \text{ and } \rho \sim U(-1,1). \end{aligned}$$

The mean and variance of the initial prior distribution of \({\tilde{x}}_t\) are [0.1, 0] and \(\mathrm {diag}([0.01,0.01])\). Figure 9 shows that also for a model with stochastic volatility the parameter estimates quickly converge.

Fig. 9
figure 9

Convergence of parameter estimates by KPF for the Vasiček model with stochastic volatility

6.4 One-factor CIR model with jump parameters

In this section we test the performance of Algorithm 4.3 in case a sudden jump in the parameter takes place, as the alternative of Algorithm 4.2 for the case of static parameters. This experiment can be seen as an extension of the experiment on the CIR model calibration of Sect. 6.1. In this experiment, we assume the parameters have a jump at time \(T=2001\), from \([\alpha ,\beta ,\sigma ] = [0.45,0.001,0.017]\) to \([\alpha ,\beta ,\sigma ] = [0.55,0.0015,0.023]\). We simulate the new data from time point \(T=2001\) to \(T=4000\) based on the new parameters and the settings for the other parameters are the same as in Sect. 6.1. To identify the parameter change, we set \(q=0.1\). Figure 10 shows that in this study the KPF algorithm for models with time-varying parameters is able to track a sudden change in the parameter values and quickly stabilizes at the new values.

Fig. 10
figure 10

Parameter estimates by KPF for the CIR model with a sudden jump

6.5 Two-factor Vasiček model for real data

In this subsection, we apply our algorithm to the calibration of a two-factor Vasiček model on yield curves that are bootstrapped from Euro swap rates with 3-month floating and fixed leg. The swap rates come from Bloomberg, with daily time step from 01-01-2009 to 29-05-2017. The tenor we use are from 4 years to 15 years.

Let \((Y_t)_{t\ge 0} = (Y_t^{(1)},Y_t^{(2)})_{t\ge 0}\) be a two-factor affine process. Assume the short rate \(r_t = Y_t^{(1)} + Y_t^{(2)}\), \(t\ge 0\). Then the corresponding yield curves with tenor \(\tau \) is given by

$$\begin{aligned} R_t(\tau ) = \phi (\tau ,0) + \psi (\tau ,0)^\top Y_t. \end{aligned}$$

We suppose we have observations of the process Y at discrete times \(k=1,\ldots T\). Denote by \({\bar{R}}(\tau ) = \frac{1}{T}\sum _{k=1}^T R_k(\tau )\) and \({\bar{Y}} = \frac{1}{T}\sum _{k=1}^T Y_k\). Then we obtain

$$\begin{aligned} R_k(\tau ) - {\bar{R}}(\tau ) =\psi (\tau ,0)^\top (Y_k-{\bar{Y}})\,. \end{aligned}$$
(6.1)

We assume the process \(X_t = Y_t-{\bar{Y}}\), \(t\ge 0\), is a two-factor Vasiček process with mean reversion level zero, i.e.,

$$\begin{aligned} \mathrm {d}X_t&= AX_t \, \mathrm {d}t + \varSigma \, \mathrm {d}W_t\,, \end{aligned}$$
(6.2)

where A is a \(2\times 2\) diagonal matrix and \(\varSigma \) is a \(2\times 2\) lower diagonal matrix. Instead of the perfect observations given by (6.1), we assume to have noisy observations of the yield curve given by

$$\begin{aligned} R_k(\tau ) - {\bar{R}}(\tau ) = \psi (\tau ,0)^\top X_k + v_k, \end{aligned}$$
(6.3)

which corresponds to Eq. (3.5). Moreover, \(\psi (\tau ,0)\) the solution of ODE (A.1), with \(d=2\), \(\gamma = (-1, -1)^\top \) and \(c=0\).

The model (6.2) has five parameters that need to be calibrated: the diagonal elements of A, denoted by \(\alpha _1, \alpha _2\), the volatility parameters and the correlation parameter in \(\varSigma \), denoted by \(\sigma _1, \sigma _2\) and \(\rho \), respectively. The variance of \(v_k\) in (6.3) is set to be \(2.36\times 10^{-8}\) (from MLE). The base configuration of our algorithm is the same as in Sect. 6.2, except that we now also allow for sudden changes of the parameters. For that we set \(q=0.01\). Figure 11 shows for all five parameters the calibration results using the KPF algorithm and the MLE. Figure 11 shows that there are three different levels of the parameters to which the algorithm quickly converges in all cases.

Fig. 11
figure 11

Parameter estimates by KPF and MLE for the two-factor Vasiček model applied to real data of yield curves

7 Conclusion

In this paper we have introduced a semi-recursive algorithm combining the Kalman filter and the particle filter with a two layers structure. In the outer layer the dynamic Gaussian kernel is implemented to sample the parameter particles. Moreover, the Kalman filter is applied the inner layer to estimate the posterior distribution of the state variables given the parameters sampled in the outer layer. These two changes provide faster convergence and reduce the computational time comparable to the RNPF methodology. The theoretical contribution of this paper is the convergence analysis of the proposed algorithm. We proved that, under regularity assumptions and given a certain model structure, the posterior distribution of the parameters and the state variables converge to the actual distribution in \(L^p\) with rate \({\mathcal {O}}(N^{-\frac{1}{2}}+\varDelta ^{\frac{1}{2}})\). The theoretical result is complemented by numerical results for several affine term structure models with static parameters or jump parameters. Although our numerical illustrations are for term structure models, the Kalman particle algorithm can also be applied to many other models.