Nudging the particle filter

We investigate a new sampling scheme aimed at improving the performance of particle filters whenever (a) there is a significant mismatch between the assumed model dynamics and the actual system, or (b) the posterior probability tends to concentrate in relatively small regions of the state space. The proposed scheme pushes some particles toward specific regions where the likelihood is expected to be high, an operation known as nudging in the geophysics literature. We reinterpret nudging in a form applicable to any particle filtering scheme, as it does not involve any changes in the rest of the algorithm. Since the particles are modified, but the importance weights do not account for this modification, the use of nudging leads to additional bias in the resulting estimators. However, we prove analytically that nudged particle filters can still attain asymptotic convergence with the same error rates as conventional particle methods. Simple analysis also yields an alternative interpretation of the nudging operation that explains its robustness to model errors. Finally, we show numerical results that illustrate the improvements that can be attained using the proposed scheme. In particular, we present nonlinear tracking examples with synthetic data and a model inference example using real-world financial data.


Background
State-space models (SSMs) are ubiquitous in many fields of science and engineering, including weather forecasting, mathematical finance, target tracking, machine learning, population dynamics, etc., where inferring the states of dynamical systems from data plays a key role.
An SSM comprises a pair of stochastic processes (x t ) t≥0 and (y t ) t≥1 called signal process and observation process, respectively.The conditional relations between these processes are defined with a transition and an observation model (also called likelihood model) where observations are conditionally independent given the signal process, and the latter is itself a Markov process.Given an observation sequence, y 1:t , the filtering problem in SSMs consists in the estimation of expectations with respect to the posterior probability distribution of the hidden states, conditional on y 1:t , which is also referred to as the filtering distribution.
Apart from a few special cases, neither the filtering distribution nor the integrals (or expectations) with respect to it can be computed exactly; hence, one needs to resort to numerical approximations of these quantities.Particle filters (PFs) have been a classical choice for this task since their introduction by Gordon et al. (1993); see also (Kitagawa, 1996;Liu and Chen, 1998;Doucet et al., 2000;de Freitas et al., 2001).The PF constructs an empirical approximation of the posterior probability distribution via a set of Monte Carlo samples (usually termed particles) which are modified or killed sequentially as more data are taken into account.These samples are then used to estimate the relevant expectations.The original form of the PF, often referred to as the bootstrap particle filter (BPF), has received significant attention due to its efficiency in a variety of problems, its intuitive appeal and its straightforward implementation.A large body of theoretical work concerning the BPF has also been compiled.For example, it has been proved that the expectations with respect to the empirical measures that the BPF constructs converge to the expectations with respect to the true posterior distributions when the number of particles is large enough (Del Moral and Guionnet, 1999;Chopin, 2004;Künsch, 2005;Douc and Moulines, 2008) or that they converge uniformly over time under additional assumptions related to the stability of the true distributions (Del Moral and Guionnet, 2001;Del Moral, 2004).
Despite the success of PFs in relatively low dimensional settings, their use has been regarded impractical in models where (x t ) t≥0 and (y t ) t≥1 are sequences of high-dimensional random variables.This is an instance of a general problem coined as the the curse of dimensionality.It is referred to as the collapse of the particle filter within the particle filtering context and it has been studied from various empirical and theoretical viewpoints in recent years.Bengtsson et al. (2008) study the weights of the BPF in a high-dimensional setting and they show that the maximum weight tends to 1 (under simplifying assumptions) which, they argue, causes the filter to collapse effectively (also see (Snyder et al., 2008;Bickel et al., 2008) for additional research on the collapse of the BPF).Recently, Rebeschini and Van Handel (2015) have proposed a "block PF", applicable to certain network-like high-dimensional SSMs, which can be proved to converge for a subset of marginal posterior distributions on the state space under assumptions on the correlations between state variables.
The collapse of the PF has received significant attention, also, in the weather dynamics literature, where models are high-dimensional, obviously approximate, and do not have analytic solutions.In particular, in the data assimilation literature, high-dimensional systems are often dealt with via an operation called nudging (Hoke and Anthes, 1976;Malanotte-Rizzoli andHolland, 1986, 1988;Zou et al., 1992).Within the particle filtering context, nudging can be defined as a transformation of the particles, which are pushed towards the observations using some observationdependent map (van Leeuwen, 2009(van Leeuwen, , 2010;;Ades andvan Leeuwen, 2013, 2015).If the dimensions of the observations and the hidden states are different, which is often the case, a gain matrix is computed in order to perform the nudging operation.In (van Leeuwen, 2009(van Leeuwen, , 2010;;Ades andvan Leeuwen, 2013, 2015), nudging is performed after the sampling step of the particle filter.The importance weights are then computed accordingly, so that they remain proper.Hence, nudging in this version amounts to a sophisticated choice of the importance function that generates the particles.It has been shown (numerically) that the schemes proposed by van Leeuwen (2009Leeuwen ( , 2010)); Ades andvan Leeuwen (2013, 2015) can track high-dimensional systems with a low number of particles.However, generating samples from the nudged proposal requires costly computations for each particle and the evaluation of weights becomes heavier as well.It is also unclear how to apply existing nudging schemes when non-Gaussianity and nontrivial nonlinearities are present in the observation model.
A related class of algorithms includes the so-called implicit particle filters (IPFs) (Chorin and Tu, 2009;Chorin et al., 2010;Atkins et al., 2013).Similar to nudging schemes, IPFs rely on the principle of pushing particles to high-probability regions in order to prevent the collapse of the filter in high-dimensional state spaces.In a typical IPF, the region where particles should be generated is determined by solving an algebraic equation.This equation is model dependent, yet it can be solved for a variety of different cases (general procedures for finding solutions are given by Chorin and Tu (2009) and Chorin et al. (2010)).The fundamental principle underlying IPFs, moving the particles towards high-probability regions, is similar to nudging.Note, however, that unlike IPFs, nudging-based methods are not designed to guarantee that the resulting particles land on high-probability regions; it can be the case that nudged particles are moved to relatively low probability regions (at least occasionally).Since an IPF requires the solution of a modeldependent algebraic equation for every particle, it can be computationally costly, similar to the nudging methods by van Leeuwen (2009Leeuwen ( , 2010)); Ades andvan Leeuwen (2013, 2015).Moreover, it is not straightforward to derive the map for the translation of particles in general models, hence the applicability of IPFs depends heavily on the specific model at hand.

Contribution
In this work, we propose a modification of the PF, termed the nudged particle filter (NuPF) and assess its performance in high dimensional settings and with misspecified models.Although we use the same idea for nudging that is presented in the literature, our algorithm has subtle but crucial differences, as summarized below.
First, we define the nudging step not just as a relaxation step towards observations but as a step that strictly increases the likelihood of a subset of particles.This definition paves the way for using different nudging schemes, such as using the gradients of likelihoods or employing random search schemes to move around the state-space.In particular, classical nudging (relaxation) operations arise as a special case of nudging using gradients when the likelihood is assumed to be Gaussian.Compared to IPFs, the nudging operation we propose is easier to implement as we only demand the likelihood to increase (rather than the posterior density).Indeed, nudging operators can be implemented in relatively straightforward forms, without the need to solve model-dependent equations.
Second, unlike the other nudging based PFs, we do not correct the bias induced by the nudging operation during the weighting step.Instead, we compute the weights in the same way they are computed in the standard BPF.However, we carry out the nudging step in a way that preserves the convergence rate of the BPF under mild standard assumptions.In particular, we push only a fraction of particles and leave others untouched, which is key to preserving the Monte Carlo convergence rate, in addition to a significant computational gain †.Also, computing biased weights using only the likelihood function (i.e.regardless of the proposal) is usually much faster than computing full weights, as one needs only one function evaluation for each weight.Depending on the choice of nudging scheme, the proposed algorithm can have an almost negligible computational overhead compared to the standard BPF.
We present computer simulation results on tracking a stochastic chaotic Lorenz 63 attractor with model misspecification, object tracking with a misspecified model involving a heavy-tailed nonlinear observation model, and tracking a high-dimensional Lorenz 96 model.The first two examples illustrate the robustness of the proposed nudged PFs when there is a significant misspecification in the transition model (provided that the likelihood is well-specified).This is helpful in realworld applications because practitioners often have more control over, or simply knowledge of, measurement systems, which determine the likelihood, than they have over the state dynamics.The last example, dealing with a stochastic Lorenz 96 model, shows how a relatively high dimensional system can be tracked without a major increase of the computational effort compared to the standard BPF.
The paper is structured as follows.After a brief note about notation, we describe the SSMs of interest and the BPF in Section 2. Then in Section 3, we outline the general algorithm and the specific nudging schemes we propose to use within the PF.We prove a convergence result in Section 4 which shows that the new algorithm has the same asymptotic convergence rate as the BPF.We also provide an alternative interpretation of the nudging operation that explains its robustness in scenarios where there is a mismatch between the observed data and the assumed SSM.We discuss the results of several computer experiments in Section 5 and demonstrate the practical use of our algorithm on a real dataset in Section 6, namely for the fitting of a stochastic volatility model.Finally, we make some concluding remarks in Section 7.

Notation
We denote the set of real numbers as R, while We denote the set of positive integers with N + and the set of positive reals with R + .We represent the state space with X ⊂ R dx and the observation space with Y ⊂ R dy .
The supremum norm of a real function ϕ : we indicate the space of real bounded functions X → R as B(X).
In order to denote sequences, we use the shorthand notation x i1:i2 = {x i1 , . . ., x i2 }.For sets of integers, we use [n] = {1, . . ., n}.The p-norm of a vector x ∈ R d is defined by The Borel σ-algebra of subsets of X is denoted B(X) and the integral of a function ϕ : X → R with respect to a measure µ on the measurable space (X, B(X)) is is denoted (ϕ, µ) := ϕdµ.The †In principle, one can choose to nudge all particles.However, although the resulting algorithm can be designed to enjoy the same theoretical properties as nudging only a fraction of particles, it would have a much higher computational cost and it would be harder to tune in practice.Therefore, we do not explore this possibility.
unit Dirac delta measure located at x ∈ R d is denoted δ x (dx).The Monte Carlo approximation of a measure µ constructed using N samples is generally denoted as µ N .

State space models
We consider SSMs of the form (2) where x t ∈ X is the system state at time t, y t ∈ Y is the t-th observation, the measure π 0 describes the prior probability distribution of the state, τ t is a Markov transition kernel on X, and g t (y t |x t ) is the (possibly non-normalized) pdf of the observation y t conditional on the state x t .We assume the observation sequence {y t } t∈N+ is arbitrary but fixed.Hence, it is convenient to think of the conditional pdf g t as a likelihood function and we write g t (x t ) := g t (y t |x t ) for conciseness.We are interested in the sequence of posterior probability distributions of the states generated by the SSM model.To be specific, at each time t = 1, 2, ... we aim at computing (or, at least, approximating) the probability measure π t , t = 1, 2, ..., which describes the probability distribution of the state x t conditional on the observation of the sequence y 1:t .When it exists, we use π(x t |y 1:t ) to denote the pdf of x t given y 1:t with respect to the Lebesgue measure, i.e., π t (dx t ) = π(x t |y 1:t )dx t .
The measure π t is often termed the optimal filter at time t.It is closely related to the probability measure ξ t , which describes the probability distribution of the state x t conditional on y 1:t−1 and it is, therefore, termed the predictive measure at time t.As for the case of the optimal filter, we use ξ(x t |y 1:t−1 ) to denote the pdf, with respect to the Lebesgue measure, of x t given y 1:t−1 .

Bootstrap particle filter
The BPF (Gordon et al., 1993) is a recursive algorithm that produces successive Monte Carlo approximations of ξ t and π t for t = 1, 2, ....The method can be outlined as shown in Algorithm 1.

4:
Weighting: compute w 5: (dx), independently for i = 1, ..., N .6: end for After an initialization stage, where a set of independent and identically distributed (iid) samples from the prior are drawn, it consists of three recursive steps which can be depicted as, Given a Monte Carlo approximation computed at time t − 1, the sampling step yields an approximation of the predictive measure ξ t of the form t by propagating the particles {x t−1 ).The observation y t is assimilated via the importance weights w and the resampling step produces a set of un-weighted particles that completes the recursive loop and yields the approximation The random measures ξ N t , πN t and π N t are commonly used to estimate a posteriori expectations conditional on the available observations.For example, if ϕ is a function X → R, then the expectation of the random variable ϕ(x t ) conditional on y Similarly, we can have estimators (ϕ, πN t ) ≈ (ϕ, π t ) and (ϕ, π N t ) ≈ (ϕ, π t ).Classical convergence results are usually proved for real bounded functions, e.g., if ϕ ∈ B(X) then lim under mild assumptions; see (Del Moral, 2004;Bain and Crisan, 2009) and references therein.
The BPF can be generalized by using arbitrary proposal pdf's q t (x t |x (i) t−1 , y t ), possibly observationdependent, instead of the Markov kernel τ t (•|x (i) t−1 ), in order to generate the particles {x i=1 in the sampling step.This can lead to more efficient algorithms, but the weight computation has to account for the new proposal and we obtain (Doucet et al., 2000), which can be more costly to evaluate.This issue is related to the nudged PF to be introduced in Section 3 below, which can be interpreted as a scheme to choose a certain observation-dependent proposal q t (x t |x (i) t−1 , y t ).However, the new method does not require that the weights be computed as in (5) in order to ensure convergence of the estimators.

General algorithm
Compared to the standard BPF, the nudged particle filter (NuPF) incorporates one additional step right after the sampling of the particles {x i=1 at time t.The schematic depiction of the BPF in (4) now becomes where the new nudging step intuitively consists in pushing a subset of the generated particles {x t } N i=1 towards regions of the state space X where the likelihood function g t (x) takes higher values.
When considered jointly, the sampling and nudging steps in ( 6) can be seen as sampling from a proposal distribution which is obtained by modifying the kernel τ t (•|x t−1 ) in a way that depends on the observation y t .Indeed, this is the classical view of nudging in the literature (van Leeuwen, 2009(van Leeuwen, , 2010;;Ades andvan Leeuwen, 2013, 2015).However, unlike in this classical approach, here the weighting step does not account for the effect of nudging.In the proposed NuPF, the weights are kept the same as in the original filter, w In doing so, we save computations but, at the same time, introduce bias in the Monte Carlo estimators.One of the contributions of this paper is to show that this bias can be controlled using simple design rules for the nudging step, while practical performance can be improved at the same time.
In order to provide an explicit description of the NuPF, let us first state a definition for the nudging step.
Definition 1.A nudging operator α yt t : X → X associated with the likelihood function g t (x) is a map such that for every x, x ′ ∈ X.
Intuitively, we define nudging herein as an operation that increases the likelihood.There are several ways in which this can be achieved and we discuss some examples in Sections 3.2 and 3.3.The NuPF with nudging operator α yt t : X → X is outlined in Algorithm 2.

4:
Nudging: choose a set of indices

7: end for
It can be seen that the nudging operation is implemented in two stages.
• First, we choose a set of indices I t ⊂ [N ] that identifies the particles to be nudged.Let M = |I t | denote the number of elements in I t .We prove in Section 4 that keeping allows the NuPF to converge with the same error rates O 1 √ N as the BPF.In Section 3.2 we discuss two simple methods to build I t in practice.
• Second, we choose an operator α yt t that guarantees an increase of the likelihood of any particle.We discuss different implementations of α yt t in Section 3.3.
We devote the rest of this section to a discussion of how these two steps can be implemented (in several ways).

Selection of particles to be nudged
The set of indices I t , that identifies the particles to be nudged in Algorithm 2, can be constructed in several different ways, either random or deterministic.In this paper, we describe two simple random procedures with little computational overhead.
• Batch nudging: Let the number of nudged particles M be fixed.A simple way to construct I t is to draw indices i 1 , i 2 , . . ., i M uniformly from [N ] without replacement, and then let I t = i 1:M .We refer to this scheme as batch nudging, referring to selection of the indices at once.One advantage of this scheme is that the number of particles to be nudged, M , is deterministic and can be set a priori.
• Independent nudging: The size and the elements of I t can also be selected randomly in a number of ways.Here, we have studied a procedure in which, for each index i = 1, ..., N , we assign i ∈ I t with probability M N .In this way, the actual cardinality |I t | is random, but its expected value is exactly M .This procedure is particularly suitable for parallel implementations, since each index can be assigned to I t (or not) at the same time as all others.

How to nudge
The nudging step is aimed at increasing the likelihood of a subset of individual particles, namely those with indices contained in I t .Therefore, any map α yt t : X → X such that (g t • α yt t )(x) ≥ g t (x) when x ∈ X is a valid nudging operator.Typical procedures used for optimization, such as gradient moves or random search schemes, can be easily adapted to implement (relatively) inexpensive nudging steps.Here we briefly describe a few of such techniques.
• Gradient nudging: If g t (x t ) is a differentiable function of x t , one straightforward way to nudge particles is to take gradient steps.In Algorithm 3 we show a simple procedure with one gradient step alone, and where γ > 0 is a step-size parameter and ∇ x g t (x) denotes the vector of partial derivatives of g t with respect to the state variables, i.e., Algorithms can obviously be designed where nudging involves several gradient steps.In this work we limit our study to the single-step case, which is shown to be effective and keeps the computational overhead to a minimum.We also note that the performance of gradient nudging can be sensitive to the choice of the step-size parameter γ > 0, and this is, in turn, model dependent.
• Random nudging: Gradient-free techniques inherited from the field of global optimization can also be employed in order to push particles towards regions where they have higher likelihoods.A simple stochastic-search technique adapted to the nudging framework is shown in Algorithm 4. We hereafter refer to the latter scheme as random-search nudging.
• Model specific nudging: Particles can also be nudged using the specific model information.
For instance, in some applications the state vector x t can be split into two subvectors, x o t and x u t (observed and unobserved, respectively), such that g t (x t ) = g t (x o t ), i.e., the likelihood depends only on x o t and not on x u t .If the relationship between x o t and x u t is tractable, one can first nudge x o t in order to increase the likelihood and then modify x u t in order to keep it coherent with x o t .A typical example of this kind arises in object tracking problems, where positions and velocities have a special and simple physical relationship but usually only position variables are observed through a linear or nonlinear transformation.In this case, nudging would only effect position variables.However, using position variables, one can also nudge velocity variables with simple rules.We discuss this idea and show numerical results in Section 5.

Algorithm 3 Gradient nudging
1: for every i ∈ I t do x(i) t + η t where η t ∼ N (0, C) for some covariance matrix C. 3: x(i) t .4: until the particle is nudged.

Nudging general particle filters
In this paper we limit our presentation to BPFs in order to focus on the key concepts of nudging and ease presentation.It should be apparent, however, that nudging steps can be plugged into general PFs.More specifically, since the nudging step is algorithmically detached from the sampling and weighting steps, it can be easily used within any PF, even if it relies on different proposals and different weighting schemes.We leave for future work the investigation of the performance of nudging within widely used PFs, such as auxiliary particle filters (APFs) (Pitt and Shephard, 1999).

Analysis
The nudging step modifies the random generation of particles in a way that is not compensated by the importance weights.Therefore, we can expect nudging to introduce bias in the resulting estimators in general.However, in Section 4.1 we prove that, as long as some basic guidelines are followed, the estimators of integrals with respect to the filtering measure π t and the predictive measure ξ t converge in L p as N → ∞ with the usual Monte Carlo rate O(1/ √ N ).The analysis is based on a simple induction argument and ensures the consistency of the estimators.In Section 4.2 we briefly comment on the conditions needed to guarantee that convergence is attained uniformly over time.We do not provide a full proof, but this can be done extending the classical arguments in (Del Moral and Guionnet, 2001) or (Del Moral, 2004) and using the same treatment of the nudging step as in the induction proof of Section 4.1.Finally, in Section 4.3, we provide an interpretation of nudging in a scenario with modelling errors.In particular, we show that the NuPF can be seen as a standard PF for a modified SSM which is "a better fit" for the available data than the original model.

Convergence in L p
The goal in this section is to provide theoretical guarantees of convergence for the NuPF under mild assumptions.First, we analyze a general NuPF (with arbitrary nudging operator α yt t and an upper bound on the size M of the index set I t ) and then we provide a result for a NuPF with gradient nudging.
Before proceeding with the analysis, let us note that the NuPF produces several approximate measures, depending on the set of particles (and weights) used to construct them.After the sampling step, we have the random probability measure after nudging.Once the weights w (i) t are computed, we obtain the approximate filter after the resampling step.Similar to the BPF, the simple Assumption 1 stated next is sufficient for consistency and to obtain explicit error rates (Del Moral and Miclo, 2000;Crisan and Doucet, 2002;Míguez et al., 2013) for the NuPF, as stated in Theorem 1 below.
Assumption 1.The likelihood function is positive and bounded, i.e., Theorem 1.Let y 1:T be an arbitrary but fixed sequence of observations, with T < ∞, and choose any M ≤ √ N and any map α yt t < ∞.If Assumption 1 is satisfied and for every t = 1, 2, ..., T , any ϕ ∈ B(X), any p ≥ 1 and some constant c t < ∞ independent of N .
Proof.We follow the same kind of induction argument as in, e.g., (Bain and Crisan, 2009).For t = 0, Eq. ( 10) is satisfied trivially as we draw x The analysis of the approximate predictive measure ξ N t is standard (see, e.g., the proof of (Míguez et al., 2013, Lemma 1)) and it can be easily shown that where c 1,t < ∞ is a constant independent of N .After the nudging step we obtain the random measure ξN t .The sets of samples {x , where {j 1 , . . ., j M } = I t .Therefore, we readily obtain the relationship where the first inequality holds trivially (since ) and the second inequality follows from the assumption M ≤ √ N .Combining ( 11) and ( 12) we arrive at where the constant c1,t = 2 + c 1,t < ∞ is independent of N .The rest of the proof is standard.Following the same steps as in the proof of (Míguez et al., 2013, Lemma 1), we can prove that where is finite a constant independent of N (the denominator is positive and the numerator is finite as a consequence of Assumption 1).The analysis of the multinomial resampling step is also standard and it yields where c 3,t < ∞ is a constant independent of N .We can combine bounds ( 14) and ( 15) to obtain the inequality (10), with c t = c 2,t + c 3,t < ∞, and conclude the proof.✷ Theorem 1 is very general; it actually holds for any map α yt t < ∞, i.e., not necessarily a nudging operator.We can also obtain error rates for specific choices of the nudging scheme.A simple, yet practically appealing, setup is the combination of batch and gradient nudging, as described in Sections 3.2 and 3.3, respectively.
Assumption 2. The gradient of the likelihood is bounded.In particular, there are constants G t < ∞ such that ∇ x g t (x) 2 ≤ G t < ∞ for every x ∈ X and t = 1, 2, . . ., T .
Lemma 1. Choose a step-size γ > 0 and the number of nudged particles M > 0 in such a way that γM ≤ √ N .If Assumption 2 holds and ϕ is a Lipschitz test function, then the error introduced by the batch gradient nudging step with |I t | = M can be bounded as, where L is the Lipschitz constant of ϕ, for every t = 1, . . ., T .

Proof. Since x(i)
t ), we readily obtain the relationships where the first inequality follows from the Lipschitz assumption, the identity is due to the implementation of the gradient-nudging step and the second inequality follows from Assumption 2. Then we bound the error (ϕ, where the identity is a consequence of the construction of I t and we apply Minkowski's inequality, ( 16) and the assumption |I t = M | to obtain (17).However, we have assumed that γM ≤ √ N , hence

✷
It is straightforward to apply Lemma 1 to prove convergence of the NuPF with a batch gradientnudging step.Specifically, we have the following result.
Theorem 2. Let y 1:T be an arbitrary but fixed sequence of observations, with T < ∞, and choose a step size γ > 0 and an integer M such that γM ≤ √ N .Let π N t denote the filter approximation obtained with a NuPF with batch gradient nudging.If Assumptions 1 and 2 are satisfied and for every t = 1, 2, ..., T , any bounded Lipschitz function ϕ, any p ≥ 1 and some constant c t < ∞ independent of N .
The proof is straightforward (using the same argument as in the proof of Theorem 1 combined with Lemma 1 )and we omit it here.We note that Lemma 1 provides a guideline for the choice of M and γ.In particular, one can select M = N β , where 0 < β < 1, together with γ ≤ N 1 2 −β in order to ensure γM ≤ √ N .Actually, it would be sufficient to set γ ≤ CN 1 2 −β for some constant C < ∞ in order to keep the same error rate (albeit with a different constant in the numerator of the bound).Therefore, Lemma 1 provides a heuristic to balance the step size with the number of nudged particles.We can increase the number of nudged particles but in that case we need to shrink the step size accordingly, so as to keep γM ≤ √ N .Similar results can be obtained using the gradient of the log-likelihood, log g t , if the g t comes from the exponential family of densities.

Uniform convergence
Uniform convergence can be proved for the NuPF under the same standard assumptions as for the conventional BPF; see, e.g., (Del Moral and Guionnet, 2001;Del Moral, 2004).The latter can be summarised as follows, following Del Moral ( 2004): (i) The likelihood function is bounded and bounded away from zero, i.e., g t ∈ B(X) and there is some constant a > 0 such that inf t>0,x∈X g t (x) ≥ a.
(ii) The kernel mixes sufficiently well, namely, for any given integer m there is a constant 0 for any Borel set A, where τ t+m|t is the composition of the kernels τ t+m When (i) and (ii) above hold, the sequence of optimal filters {π t } t≥0 is stable and it can be proved that sup for any bounded function ϕ ∈ B(X), c < ∞ is constant with respect to N and t and π N t is the particle approximation produced by either the NuPF (as in Theorem 1 or, provided sup t>0 G t < ∞, as in Theorem 2) or the BPF algorithms.We skip a formal proof as, again, it is straightforward combination of the standard argument by Del Moral (2004) (see also, e.g., (Oreshkin and Coates, 2011;Crisan and Miguez, 2017)) with the same handling of the nudging operator in the proofs of Theorem 1 or Lemma 1 .

Nudging and model evidence
We have found in computer simulation experiments that the NuPF is consistently more robust to model errors than the conventional BPF.In order to obtain some analytical insight of this scenario, in this section we reinterpret the NuPF as a standard BPF for a modified SSM and show that this modified model is a better fit for the given data than the original SSM.
In Bayesian methodology, a common approach to compare two competing probabilistic models, say M 0 and M 1 , for a given data set y 1:T is to evaluate the so-called model evidence (Bernardo and Smith, 1994) for both M 0 and M 1 .
Definition 2. The evidence (or likelihood) of a probabilistic model M for a given data set y 1:T is the probability density of the data conditional on the model, that we denote as p(y 1:T |M).
We say that M 1 is a better fit than M 0 for the data set y 1:T when p(y 1:T |M 1 ) > p(y 1:T |M 0 ).
If the models to be compared are SSMs, then they are defined by the prior τ 0 , the kernels τ t and the likelihood functions g t (x) = g t (y t |x), for t ≥ 1.In this section we write the latter as g yt t (x) = g t (y t |x), in order to emphasize that g t is parameterized by the observation y t , and we also assume that every g yt t is a normalized pdf for the sake of clarity.Hence, we can formally represent the SSM defined by ( 1), ( 2) and (3) as M 0 = {τ 0 , τ t , g yt t }, and its evidence can be written as Now, let us consider y 1:T to be fixed and construct the alternative SSM M 1 = {τ 0 , τ yt t , g yt t }, where and the nudging operator α yt t is a one-to-one map that depends on the (fixed) observation y t .We note that the kernel τ yt t jointly represents the Markov transition induced by the original kernel τ t followed by an independent nudging transformation (namely, each particle is independently nudged with probability ε M ).As a consequence, the standard BPF for model M 1 coincides exactly with a NuPF for model M 0 with independent nudging and operator α yt t .Indeed, according to the definition of τ yt t in ( 20), generating a sample x(i) t−1 ) is a three-step process where • the generate a sample u t from the uniform distribution U(0, 1), and After sampling, the importance weight for the BPF applied to model M 1 is w t ).This is exactly the same procedure as in the NuPF applied to the original SSM M 0 (see Algorithm 2).
It turns out that M 1 is a better fit for the given sequence of observations y 1:T than M 0 , in terms of the model evidence, and, hence, the NuPF is itself (in this sense) a better fit than the BPF for the given data.This is made formal by the following result.
Proof.We note that where ( 21) follows from the structure of the SSM and ( 22) is readily obtained from the definition of τ yt t in (20).Since (g yt t • α yt t )(x) > g yt t (x) > 0 for every x ∈ X (by the definition of nudging operator), it follows that If the nudging operator α yt t is a continuous function of y t , then the inequality in Proposition 1 holds as well, for the same sequence of kernels τt , t = 1, ..., T , if we evaluate the model evidence for any sequence y ′ 1:T within a (sufficiently small) neighbourhood of y 1:T .
The assumption in expression (20) that the nudging operator α yt t is one-to-one can be relaxed.In particular, it is relatively straightforward to extend Proposition 1 to some models where α yt t is a many-to-one map.In particular, if there is a partition X = nt j=1 X j such that the restrictions α yt t,j : X j → X of the map α yt t are one-to-one, then it is possible to prove the same result as in Proposition 1 with just some additional notational burden.
Remark 3. If we fix the maps α yt t and the likelihoods g for arbitrary but fixed y 1:T .

Computer simulations
In this section, we present the results of three computer experiments.In the first one, we address the tracking of a stochastic Lorenz 63 model with misspecified parameters.Then, we study an object tracking problem where the observation model is nonlinear and the observation noise is heavy-tailed.Finally, we present results for the tracking of a high-dimensional stochastic Lorenz 96 model ‡.
We have used gradient nudging in all experiments, with either M ≤ √ N (deterministically, with batch nudging) or E[M ] ≤ √ N (with independent nudging).This ensures the validity of the theoretical results presented in Section 4. For the object tracking experiment, we have used a large step-size, but this choice does not affect the convergence rate of the NuPF algorithm either.

Stochastic Lorenz 63 model with misspecified parameters
In this experiment, we demonstrate the performance of the NuPF a misspecified stochastic Lorenz 63 model.The dynamics of the system is described by a stochastic differential equation (SDE) in three dimensions,  where {w i (s)} s∈(0,∞) for i = 1, 2, 3 are 1-dimensional independent Wiener processes and s, r, b ∈ R are fixed model parameters.We discretise the model using the Euler-Maruyama scheme with integration step T > 0 and obtain the system of difference equations where {u i,t } t∈N , i = 1, 2, 3 are i.i.d.Gaussian random variables with zero mean and unit variance.We assume that we can only observe the variable x 1,t every t s > 1 discrete time steps and contaminated by additive noise.To be specific, we collect the sequence of observations where {v n } n∈N is a sequence of i.i.d.Gaussian random variables with zero mean and unit variance and the scale parameter k o is assumed known.
In order to simulate both the state signal and the synthetic observations from this model, we choose the so-called standard parameter values (s, r, b) = 10, 28, 8 3 , which make the system dynamics chaotic.The initial condition is set as which corresponds to a deterministic trajectory of the system (i.e., with no state noise) with the same parameter set (Crisan and Miguez, 2016).We assume that the system is observed every t s = 40 discrete time steps and for each simulation we simulate the system for t = 0, 1, . . ., t f , with t f = 20, 000 .Since t s = 40, we have a sequence of t f ts = 500 observations overall.We run the BPF and NuPF algorithms for the model described above, except that the parameter b is replaced by b ǫ = b + ǫ, with ǫ = 0.75 (hence b ǫ ≈ 3.417 versus b ≈ 2.667 for the actual system).As the system underlying dynamics is chaotic, this mismatch affects the predictability of the system significantly.
We have implemented the NuPF with independent gradient nudging.Each particle is nudged with probability 1 √ N , where N is the number of particles (hence E[M ] = √ N ), and the size of the gradient steps is set to γ = 0.75 (see Algorithm 3).As a figure of merit, we evaluate the normalised mean square error (NMSE) for the 3-dimensional state vector, averaged over 1,000 independent Monte Carlo simulation.Specifically, the NMSE for the j-th simulation is evaluated as where xt (j) is the estimate of the state vector in the j-th simulation run.In the various figures we usually plot the mean and the standard deviation of the sample NMSE(1), . . ., NMSE(1000).
Fig. 1 (left) displays the NMSE, attained for varying number of particles N , for the standard BPF and the NuPF.It is seen that the NuPF outperforms the BPF for the whole range of values of N in the experiment, both in terms of the mean and the standard deviation of the errors, although the NMSE values become closer for larger N .The plot on the right displays the values of x 2,t and its estimates for a typical simulation.In general, the experiment shows that the NuPF can track the actual system using the misspecified model and a small number of particles, whereas the BPF requires a higher computational effort to attain a similar performance.

Object tracking with a misspecified model
In this experiment, we consider a tracking scenario where a target is observed through sensors collecting radio signal strength (RSS) measurements contaminated with additive heavy-tailed noise.The target dynamics is described by the model, where x t ∈ R 4 denotes the target state, consisting of its position r t ∈ R 2 and its velocity, The parameter x o is a deterministic, pre-chosen state to be attained by the target.Each element in the sequence {u t } t∈N is a zero-mean Gaussian random vector with covariance matrix Q.The parameters A, B, Q are selected as where I 2 is the 2 × 2 identity matrix and κ = 0.04.The policy matrix L ∈ R 2×4 determines the trajectory of the target from an initial position x 0 = [140, 140, 50, 0] ⊤ to a final state x o = [140, −140, 0, 0] ⊤ and it is computed by solving a Riccati equation (see (Bertsekas, 2001) for details), which yields This policy results in a highly maneuvering trajectory.In order to design the NuPF, however, we assume the simpler dynamical model hence there is a considerable model mismatch.
The observations are nonlinear and coming from 10 sensors placed in the region where the target moves.The measurement collected at the i-th sensor, time t, is modelled as y t,i = 10 log 10 P 0 r t − s i 2 + η + w t,i where r t ∈ R 2 is the position vector, s i is the position of the ith sensor and w t,i ∼ T (0, 1, ν) is an independent t-distributed random variable for each i = 1, . . ., 10. Intuitively, the closer the parameter ν to 1, the more explosive the observations become.In particular, we set ν = 1.01 to make the observations explosive and heavy-tailed.As for the sensor parameters, we set the transmitted RSS as P 0 = 1 and the sensitivity parameter as η = 10 −9 .The latter yields a soft lower bound of -90 dB's for the RSS measurements.
We have implemented the NuPF with batch gradient nudging, with a large-step size γ = 5.5 and M = ⌊ √ N ⌋.Since the observations depend on the position vector r t only, an additional model-specific nudging step is needed for the velocity vector v t .In particular, after nudging the r 2,t ] ⊤ , we update the velocity variables as where κ = 0.04 as defined for the model.The motivation for this additional transformation comes from the physical relationship between position and velocity.We note, however, that the NuPF also works without nudging the velocities.
We have run 10, 000 Monte Carlo runs with N = 500 particles in the auxiliary particle filter (APF) (Pitt and Shephard, 1999), the BPF (Gordon et al., 1993) and the NuPF.We have also implemented the extended Kalman filter (EKF), which uses the gradient of the observation model.
Figure 2 shows a typical simulation run with each one of the four algorithms (on the left side, plots (a)-(d)) and a box-plot of the NMSE's obtained for the 10,000 simulations (on the right, plot (e)).Plots (a)-(d) show that, while the EKF also uses the gradient of the observation model, it fails to handle the heavy-tailed noise, as it relies on Gaussian approximations.The BPF and the APF collapse due to the model mismatch in the state equation.Plot (d) shows that the NMSE of the NuPF is just slightly smaller in the mean than the NMSE of the EKF, but much more stable.

High-dimensional stochastic Lorenz 96 model
In this computer experiment, we compare the NuPF with the ensemble Kalman filter (EnKF) for the tracking of a stochastic Lorenz 96 system.The latter is described by the set of stochastic differfential equations (SDEs) where {w i (s)} s∈(0,∞) , i = 1, . . ., d, are independent Wiener processes, d is the system dimension and the forcing parameter is set to F = 8, which ensures a chaotic regime.The model is assumed to have a circular structure, so that x −1 = x d−1 , x 0 = x d , and x d+1 = x 1 .In order to simulate data from this model, we apply the Euler-Maruyama discretization scheme and obtain the difference equations, where u i,t are zero-mean, unit-variance Gaussian random variables.We assume that the system is only partially observed.In particular, half of the state variables are observed, in Gaussian noise, every t s = 10 time steps, namely, y j,n = x 2j−1,nts + u j,n , n = 1, 2, . . ., j = 1, 2, . . ., ⌊d/2⌋, where u j,n is a normal random variable with zero mean and unit variance.
In all the simulations for this system we run the NuPF with batch gradient nudging (with M = ⌊ √ N ⌋ nudged particles and step-size γ = 0.075).In the first computer experiment, we fixed the dimension d = 40 and run the BPF and the NuPF with increasing number of particles.The results can be seen in Fig. 3, which shows how the NuPF performs better than the BPF in terms of NMSE (plot (a)) while the running times of both algorithms are nearly identical (plot (b)).
In a second computer experiment, we compared the NuPF with the EnKF. Figure 4(a) shows how the NMSE of the two algorithms grows as the model dimension d increases and the number of particles N is kept fixed.In particular, the EnKF attains a better performance for smaller dimensions (up to d = 10 3 ), however its NMSE blows up for d > 10 3 while the performance of the NuPF remains stable.The running time of the EnKF was also higher than the running time of the NuPF in the range of higher dimensions (d ≥ 10 3 ).

Experimental results on model inference
In this section, we illustrate the application of the NuPF to estimate the of a financial time-series model.In particular, we adopt a stochastic-volatility SSM and we aim estimating its unknown parameters (and track its state variables) using the EURUSD log-return data from 2014-12-31 to 2016-12-31 (obtained from www.quandl.com).For this task, we apply two recently proposed Monte Carlo schemes: the nested particle filter (NPF) (Crisan and Miguez, 2016) (a purely recursive, particle-filter style Monte Carlo method) and the particle Metropolis-Hastings (pMH) algorithm (Andrieu et al., 2010) (a batch Markov chain Monte Carlo procedure).In their original forms, both algorithms use the marginal likelihood estimators given by the BPF to construct a Monte Carlo approximation of the posterior distribution of the unknown model parameters.Here, we compare the performance of these algorithms when the marginal likelihoods are computed using either the BPF or the proposed NuPF.We assume the stochastic volatility SSM (Tsay, 2005), where µ ∈ R, σ v ∈ R + , and φ ∈ [−1, 1] are fixed but unknown parameters.The states {x t } 1≤t≤T are log-volatilities and the observations {y t } 1≤t≤T are log-returns.We follow the same procedure as Dahlin and Schön (2015) to pre-process the observations.Given the historical price sequence s 0 , . . ., s T , the log-return at time t is calculated as for 1 ≤ t ≤ T .Then, given y 1:T , we tackle the joint Bayesian estimation of x 1:T and the unknown parameters θ = (µ, σ v , φ).In the next two subsections we compare the conventional BPF and the NuPF as building blocks of the NPF and the pMH algorithms.

Nudging the nested particle filter
The NPF in (Crisan and Miguez, 2016) consists of two layers of particle filters which are used to jointly approximate the posterior distributions of the parameters and the states.The filter in the first layer builds a particle approximation of the marginal posterior distribution of the parameters.Then, for each particle in the parameter space, say θ (i) , there is an inner filter that approximates the posterior distribution of the states conditional on the parameter vector θ (i) .The inner filters are classical particle filters, which are essentially used to compute the importance weights (marginal likelihoods) of the particles in the parameter space.In the implementation of Crisan and Miguez (2016), the inner filters are conventional BPFs.We have compared this (a), it can be seen that there is a significant increase in the acceptance rates when the number of particles are low, e.g., N = 100.From (b) and (c), it can be seen that the pMH-NuPF is still better for increasing number of particles but the pMH-BPF is catching up with the performance of the pMH-NuPF.
We have carried out a computer experiment to compare the performance of the pMH scheme using either BPFs or NuPFs to compute acceptance probabilities.The two algorithms are labeled pMH-BPF and pMH-NuPF, respectively, hereafter.The parameter priors in the experiment are p(µ) = N (0, 1) p(σ v ) = G(2, 0.1) p(φ) = B(120, 2) where G(a, θ) denotes the Gamma pdf with shape parameter a and scale parameter θ, and B(α, β) denotes the Beta pdf with shape parameters (α, β).Unlike the authors of (Dahlin and Schön, 2015), who use a truncated Gaussian prior centered on 0.95 with a small variance for φ, we use the Beta pdf, which is defined on [0, 1], with mean α/(α + β) = 0.9836, which puts a significant probability mass on the interval [0.9, 1].
We have compared the pMH-BPF algorithm and the pMH-NuPF scheme (using a batch nudging procedure with γ = 0.1 and M = ⌊ √ N ⌋) by running 1,000 independent Monte Carlo trials.We have computed the marginal likelihood estimates in the NuPF after the nudging step.
The resulting empirical acceptance rates can be seen from Fig. 6.It is observed that the use of nudging leads to higher acceptance rates, making the pMH more efficient to use.
Another key figure of merit for an MCMC algorithm is the degree of correlation in the chain.Fig. 7 displays the average autocorrelation functions (ACFs) of the chains obtained in the 1,000 independent simulations.We see that the autocorrelation of the chains produced by the pMH-NuPF method decay more quickly than the autocorrelation of the chains output by the conventional pMH-BPF.Less correlation can be expected to translate into better estimates as well for a fixed length of the chain.

Conclusions
We have proposed a simple modification of the particle filter which, according to our computer experiments, can improve the performance of the algorithm (e.g., when tracking high-dimensional systems) or enhance its robustness to model mismatches in the state equation of a SSM.The modification of the standard particle filtering scheme consists of an additional step, which we term nudging, in which a subset of particles are pushed towards regions of the state space with a higher likelihood.In this way, the state space can be explored more efficiently while keeping the computational effort at nearly the same level as in a standard particle filter.We refer to the new algorithm as the "nudged particle filter" (NuPF).While, for clarity and simplicity, we have kept the discussion and the numerical comparisons restricted to the modification (nudging) of the conventional BPF, the new step can be naturally incorporated to most known particle filtering methods.
We have presented a basic analysis of the NuPF which indicates that the algorithm converges (in L p ) with the same error rate as the standard particle filter.In addition, we have also shown simple analytical arguments that illustrates why the NuPF tends to outperform the BPF in computer simulations when there is some mismatch in the state equation of the SSM.To be specific, we have shown that, given a fixed sequence of observations, the NuPF amounts to a standard particle for a modified SSM with a higher model evidence (i.e., a higher likelihood) compared to the original SSM.
The analytical results have been corroborated by a number of computer experiments, both with synthetic and real data.In the latter case, we have tackled the fitting of a stochastic volatility SSM using Bayesian methods for model inference and a time-series dataset consisting of euro to US dollar exchange rates over a period of two years.We have shown how different figures of merit (model evidence, acceptance probabilities or autocorrelation functions) improve when using the NuPF, instead of a standard BPF, in order to implement a nested particle filter (Crisan and Miguez, 2017) and a particle Metropolis-Hastings (Andrieu et al., 2010) algorithm.
Since the nudging step is fairly general, it can be used with a wide range of differentiable or non-differentiable likelihoods.Besides, the new operation does not require any modification of the well-defined steps of the particle filter so it can be plugged into a variety of common particle filtering methods.Similar to the resampling step, which is routinely employed for numerical stability, the nudging step can be systematically used for improving the performance and robustness of particle filters.
‡For the experiments involving Lorenz 96 model, simulation from the model is implemented in C++ and integrated into Matlab.The rest of the simulations are fully implemented in Matlab.

Fig. 2 .
Fig. 2. Plots (a)-(d):A typical simulation run for the BPF, APF, EKF and NuPF algorithms using N = 500 particles.The black dots denote the real trajectory of the object, the red dots are sensors and the blue dots are position estimates as provided by the filters.Plot (e): Box-plot of the errors NMSE(1), . . ., NMSE(10, 000) obtained for the set of independent simulation runs.The NuPF achieves a low NMSE with a low variance whereas the EKF exhibits a large variance.

Fig. 3 .
Fig. 3. Comparison of the NuPF and the BPF for the stochastic Lorenz 96 system with model dimension d = 40.The results have been obtained from a set of 1, 024 independent Monte Carlo runs.Plot (a): average NMSE and standard deviation (dashed lines) as the number of particles N is increased.Plot (b): Running times of the BPF and the NuPF for the same set of simulations.The increase in computational cost of the NuPF, compared to the BPF, is negligible.

Fig. 4 .
Fig. 4. Comparison of the NuPF with the EnKF for the stochastic Lorenz 96 model with increasing dimension d and fixed number of particles N = 500 (this is the sames as the number of ensemble members in the EnKF).We have run 1,000 independent Monte Carlo trials for this experiment.Plot (a): NMSE versus dimension d.The EnKF attains a smaller error for lower dimensions but then it explodes for d > 10 3 , while the NuPF remains stables.Plot (b): Running times for the same set of simulations.

Fig. 6 .
Fig.6.Empirical acceptance rates computed for the pMH running BPF and the pMH running NuPF.From (a), it can be seen that there is a significant increase in the acceptance rates when the number of particles are low, e.g., N = 100.From (b) and (c), it can be seen that the pMH-NuPF is still better for increasing number of particles but the pMH-BPF is catching up with the performance of the pMH-NuPF.