This chapter provides an introduction to methods that, in theory, samples precisely the posterior pdf. Commonly-used ensemble data-assimilation methods, like the EnKF and EnRML, only sample the posterior pdf correctly in the Gauss-linear case and typically fail in cases with strong nonlinearity. On the other hand, particle methods are also ensemble methods, but they attempt to sample the full posterior pdf, also in problems with multimodal distributions. Their major drawbacks are formed by the convergence issues when the state dimension increases through a phenomenon known as ensemble degeneracy. Particle methods work very well for low-dimensional problems, but they require an intelligent implementation with high-dimensional models and affordable ensemble sizes. In the following, we will focus on particle filters and particle flows, which are currently the most promising for highly nonlinear problems in high-dimensions.  

1 Particle Approximation

The methods discussed in the previous chapters concentrate on finding or approximating certain features of the posterior pdf, such as the mode (typically the variational techniques) or the mean (non-iterative ensemble methods). However, if the posterior pdf is not unimodal and symmetric, then different sampling techniques are needed. Sometimes the pdf can still be described by a small number of parameters, and methods to estimate these parameters can be employed, but this is typically not the case. Often the posterior pdf can have any shape, e.g., being multimodal or heavily skewed. In this case, we can approximate the posterior pdf by using many samples of it.

Let’s introduce a new approximation that will significantly ease the computational aspects of the methods discussed in the previous chapter.

Approximation 9 (Particle representation of the pdfs).

It is possible to approximate a probability density function by a finite ensemble of N model states (or particles) as

$$\begin{aligned} f({\mathbf {z}}) \approx \sum _{j=1}^N \frac{1}{N} \delta ({\mathbf {z}}- {\mathbf {z}}_j), \end{aligned}$$
(9.1)

where \(\delta (\cdot )\) denotes the Dirac-delta function.

We can then use these samples to calculate mean, covariances, higher-order moments, quantiles, etc.

Several data-assimilation methods exist that sample the posterior pdf, and we divide them roughly into sequential Metropolis-like algorithms and parallel particle-filter-like methods. “Sequential” here means that we generate the samples one after the other, and we use the value of the current sample to generate the next one. Examples of this kind of algorithms are Metropolis-Hastings, Gibbs samplers, Langevin samplers, and Hybrid Monte-Carlo. These are known as Markov-Chain Monte-Carlo (MCMC) methods, where the chain part refers to the sequential nature of the sample generation.           

The parallel particle-filter-like algorithms generate samples independently from each other or with only small interactions. Examples are standard (bootstrap) particle filters, particle filters with proposal densities, and particle flows. Many of these parallel methods use resampling to ensure sufficient samples to represent the posterior pdf well. Sometimes these particle schemes are also called MCMC methods, but for a different reason. They are connected in physical time via a Markov Chain, for instance, when we propagate the samples in time via a stochastic partial differential equation.

Many schemes use combinations of methods, such as particle-within-Gibbs schemes and particle MCMC. However, strategies that generate samples sequentially cannot be made parallel by construction (although one can run several Markov chains to create different strings of samples). Moreover, we need many samples to converge to a realistic description of the posterior pdf. Thus, it is not common to use these methods in high-dimensional geophysical systems. For that reason, we restrict the discussion to particle filters and flow filters in the following.

As a final note on nonlinear filtering, we touch upon exciting new developments. One is the so-called Schroedinger perspective on data assimilation, in which one tries to draw equal-weight particles from the posterior at time n, based directly on draws from the equal-weight posterior particles at time \(n-1\), see Reich (2019). These methods try to solve the prediction and assimilation problem in one go. No practical algorithms exist for high-dimensional systems, but this is an active research area in the applied mathematics community.

Substantial progress is also reported using coupling methods that try to find an optimal transportation map between prior and posterior pdfs by first finding the map from the prior to a standard Gaussian pdf and then the map from that Gaussian pdf to the posterior pdf. E.g., ElMoselhy and Marzouk (2012) and Spantini et al. (2019) explore a specific form of the map called the Knothe–Rosenblatt (KR) rearrangement resulting in a triangular map from one pdf to the other. One exciting feature is that many deep learning algorithms with a ReLU activation function have the same structure and are ideal for learning the map. Another exciting part is that for a linear data-assimilation problem, the map is precisely the EnKF.

Another recent development is ensemble-Riemannian data assimilation over the Wasserstein space. The prior and the likelihood are considered marginal pdfs of a coupling pdf, and one calculates the distance between pdfs using the Wasserstein metric. The posterior pdf is then the Wasserstein barycenter of the prior and the likelihood. It becomes similar to the analysis state being the Euclidian barycenter of the prior and likelihood terms in a cost function in linear data assimilation  (Tamang et al., 2021). Extensions to high-dimensional problems are described by Tamang et al. (2022), exploring entropic regularization to speed up convergence in the optimization method for the optimal transport map. They applied the technique to a two-layer quasi-geostrophic model.

2 Particle Filters

One of the exciting aspects of data assimilation is that we know the solution, and we can often write it down in analytical form, but we do not know how to describe it in practical terms for use in a computer. The expression for the posterior pdf is

$$\begin{aligned} f({\mathbf {z}}|{\mathbf {d}}) = \frac{f({\mathbf {d}}|{\mathbf {z}})}{f({\mathbf {d}})}f({\mathbf {z}}), \end{aligned}$$
(9.2)

which is just Bayes’ theorem from Eq. (2.10).

There are several ways to generate samples from the posterior pdf. For instance, similar to ensemble Kalman filters, we can have a set of particles from an ensemble integration of the numerical model. In the ensemble Kalman filter, we assume that these particles describe a Gaussian prior. The particle filter does not apply the Gaussian assumption from Approx. 4, and the prior pdf can have any shape. Thus, the only representation we have of the prior pdf is the set of particles, and in the following, we will describe several methods that explore this feature.

2.1 The Standard Particle Filter

We start out by representing the prior using Eq. (9.1), which is the mathematical representation of the prior pdf using samples \({\mathbf {z}}_i\). Using this representation in the expression for the posterior pdf, we find

$$\begin{aligned} f({\mathbf {z}}|{\mathbf {d}}) = \sum _{j=1}^N w_j \delta ({\mathbf {z}}- {\mathbf {z}}_j ), \end{aligned}$$
(9.3)

with the so-called likelihood weights \(w_j\) given by

$$\begin{aligned} w_j = \frac{f({\mathbf {d}}|{\mathbf {z}}_j)}{f({\mathbf {d}})} = \frac{f({\mathbf {d}}|{\mathbf {z}}_j)}{\sum _{i=1}^N f({\mathbf {d}}|{\mathbf {z}}_i))}. \end{aligned}$$
(9.4)

Here we have used a standard self normalization to ensure the weights add up to one, which can be derived from \(f({\mathbf {d}}) = \int f({\mathbf {d}}|{\mathbf {z}})f({\mathbf {z}}) \; d{\mathbf {z}}\approx \sum _{i=1}^N f({\mathbf {d}}|{\mathbf {z}}_i)\), and using the ensemble representation for the prior as above. The full scheme is presented in Algorithm 9.

figure a

Within this scheme, we propagate the weighted particles from one assimilation step to the next. At each assimilation step, we assign new weights to the particles. After a few assimilation steps, a particle’s weight is proportional to the product of all its previous weights. In practice, this means that the weights of the particles diverge more and more. Typically within a few assimilation steps the relative weight of one particular particle is very close to one, while all the other particles have weights very close to zero. We call the resulting ensemble degenerate as it effectively contains just one particle. The weighted ensemble mean equals the particle with near-one weight, and the ensemble variance is close to zero.

One way to avoid this degeneracy problem is to use resampling. If the weights diverge too much, we abandon the low-weight particles and duplicate the high-weight particles such that the total number of particles does not change. A standard measure for particle divergence is the effective ensemble size, defined as

$$\begin{aligned} N_\text {eff} =\frac{1}{\sum _{i=1}^N w_i^2} , \end{aligned}$$
(9.5)

where \(w_i\) are the normalized weights, such that they add up to one. Typically, resampling is introduced when \(N_\text {eff} \le 0.8N\).

Using resampling avoids the accumulation of weights on each particle, and we obtain so-called particle filters with sequential importance resampling (SIR). There are several resampling schemes from which we can choose. The one that leads to the minimum additional random noise is the so-called Stochastic Universal Resampling, which proceeds as denoted in Algorithm 10, see Kitagawa (1996).

When the number of independent observations is large, and large here typically means more than 10, we need 1,00,000 samples or more, even to represent the mean accurately. The reason is that the likelihood will be very peaked if it consists of products of individual likelihoods for each independent observation, which means that the weights will vary enormously over the particles. We often have very many observations in the geosciences, and the SIR method will degenerate even after one assimilation step. In this case, the resampling will result in an ensemble of particles identical to the best member, and the ensemble remains degenerate.

figure b

2.2 Proposal Densities

Interestingly, we do not have to draw samples from the prior as we can also use samples from another pdf, the so-called proposal pdf \(q({\mathbf {z}})\). We can rewrite Bayes’ theorem from Eq. (9.2) as

$$\begin{aligned} f({\mathbf {z}}|{\mathbf {d}}) = \frac{f({\mathbf {d}}|{\mathbf {z}})}{f({\mathbf {d}})}f({\mathbf {z}}) = \frac{f({\mathbf {d}}|{\mathbf {z}})}{f({\mathbf {d}})}\frac{f({\mathbf {z}})}{q({\mathbf {z}})}q({\mathbf {z}}), \end{aligned}$$
(9.6)

which holds for any \(q({\mathbf {z}})\) that is nonzero where the prior is nonzero. Assume we have samples from \(q({\mathbf {z}})\), so we can write

$$\begin{aligned} q({\mathbf {z}}) = \sum _{i=1}^N \frac{1}{N} \delta ({\mathbf {z}}- {\mathbf {z}}_i), \end{aligned}$$
(9.7)

then we find again, using the expression for \(q({\mathbf {z}})\) Eq. (9.6)

$$\begin{aligned} f({\mathbf {z}}|{\mathbf {d}}) = \sum _{i=1}^N w_i \delta ({\mathbf {z}}- {\mathbf {z}}_i), \end{aligned}$$
(9.8)

but now with weights

$$\begin{aligned} w_i = \frac{f({\mathbf {d}}|{\mathbf {z}}_i)}{N\sum _{j=1}^N f({\mathbf {d}}|{\mathbf {z}}_j)} \frac{f({\mathbf {z}}_i)}{q({\mathbf {z}}_i)} . \end{aligned}$$
(9.9)

and the weights are now the product of the likelihood weights and the so-called proposal weights. It looks like we didn’t gain much, but remember that the pdf \(q({\mathbf {z}})\) can be whatever we choose. For instance, we can make it dependent on the observations, \(q({\mathbf {z}}|{\mathbf {d}})\). Thus, we can use particles that already know where the observations are. As an example, we can use posterior samples from an EnKF solution as samples from the proposal density \(q({\mathbf {z}})\). In that case, all particles will be closer to the observations than samples from the prior. Hence, the likelihood weights will be much closer together. Of course, we need to include the so-called proposal weights \(f({\mathbf {z}}_j) / q({\mathbf {z}}_j)\) in our final expression, but these often have a much smoother distribution than the likelihood weights.

To use this formalism we need to evaluate \(f({\mathbf {z}}_j)\). In many cases this pdf is assumed to be known, e.g., for pure parameter estimation or for state estimation over just a single time window. In general, we know this pdf for stationary estimation problems.

However, in sequential estimation the posterior pdf at one time step becomes the prior at the next timestep and we do not know the shape of this prior pdf. All we have are a set of particles that represent the prior pdf and it will be impossible to evaluate the value of \(f({\mathbf {z}}_j)\). Fortunately, we can use the model equations because we know the statistics of the model errors. Thus, we assume we know \(f({\mathbf {z}}_k|{\mathbf {z}}_{k-1})\) where k is the time index, and in which \({\mathbf {z}}\) can be a model state, or parameters when these are estimated sequentially.

For example, let’s assume Gaussian additive model errors with mean zero and covariance \({\mathbf {C}}_\textit{qq}\). In that case we can introduce the transition density

$$\begin{aligned} f({\mathbf {z}}_k|{\mathbf {z}}_{k-1}) \propto \exp \Bigl (-\frac{1}{2}\bigl ({\mathbf {z}}_k-{\mathbf {m}}({\mathbf {z}}_{k-1})\bigr )^\mathrm {T}{\mathbf {C}}_\textit{qq}^{-1}\bigl ({\mathbf {z}}_k-{\mathbf {m}}({\mathbf {z}}_{k-1})\bigr )\Bigr ) . \end{aligned}$$
(9.10)

The reason for introducing the transition density of one state to the next is that we can rewrite the prior pdf as

$$\begin{aligned} f({\mathbf {z}}_k) = \int f({\mathbf {z}}_k,{\mathbf {z}}_{k-1})\;d{\mathbf {z}}_{k-1} = \int f({\mathbf {z}}_k|{\mathbf {z}}_{k-1}) f({\mathbf {z}}_{k-1}) \; d{\mathbf {z}}_{k-1} . \end{aligned}$$
(9.11)

If we now invoke a particle representation at time \(t_{k-1}\), we find the following expression for the prior pdf

$$\begin{aligned} f({\mathbf {z}}_k) =\int f({\mathbf {z}}_k|{\mathbf {z}}_{k-1}) \frac{1}{N} \sum _{j=1}^N \delta ({\mathbf {z}}_{k-1}-{\mathbf {z}}_{j,k-1}) \; d{\mathbf {z}}_{k-1} = \sum _{j=1}^N \frac{1}{N} f({\mathbf {z}}_k|{\mathbf {z}}_{j,k-1}). \end{aligned}$$
(9.12)

The exciting part of this development is that we have changed from representing the prior by a set of delta functions to using a continuous prior defined by a sum of transition densities. The transition densities are often Gaussian pdfs representing Gaussian model errors without any further approximation. Hence, for Gaussian model errors a Gaussian mixture is a natural expression of the prior, where the model error covariance defines the width of the covariance in the Gaussian pdf in each mixture component.

Remember that we want to introduce a proposal density in this formalism to obtain weights with better behavior. With this in mind, we use the particle representation from Eq. (9.12) in Eq. (9.2) to obtain

$$\begin{aligned} \begin{aligned} f\bigl ({\mathbf {z}}_k|{\mathbf {d}}_k\bigr )&= \frac{f\bigl ({\mathbf {d}}_k|{\mathbf {z}}_k\bigr )}{f\bigl ({\mathbf {d}}_k\bigr )}f\bigl ({\mathbf {z}}_k\bigr ) \\&= \frac{f\bigl ({\mathbf {d}}_k|{\mathbf {z}}_k\bigr )}{f\bigl ({\mathbf {d}}_k\bigr )} \sum _{j=1}^N \frac{1}{N} f\bigl ({\mathbf {z}}_k|{\mathbf {z}}_{j,k-1}\bigr )\\&= \frac{f\bigl ({\mathbf {d}}_k|{\mathbf {z}}_k\bigr )}{f\bigl ({\mathbf {d}}_k\bigr )} \sum _{j=1}^N \frac{1}{N} \frac{f\bigl ({\mathbf {z}}_k|{\mathbf {z}}_{j,k-1}\bigr )}{q\bigl ({\mathbf {z}}_k|{\mathbf {z}}_{k-1},{\mathbf {d}}\bigr )} q\bigl ({\mathbf {z}}_k|{\mathbf {z}}_{k-1},{\mathbf {d}}\bigr ). \end{aligned} \end{aligned}$$
(9.13)

Note that we have introduced a transition proposal density \(q({\mathbf {z}}_k|{\mathbf {z}}_{k-1},{\mathbf {d}})\) that does not only depend on state evolution equations, but that we also allow to depend on the new observations. As before, we multiply and divide the expression in the second line of the equation by q. This division is possible when the support of q is equal to or larger than that of the transition density \(f\bigl ({\mathbf {z}}_k|{\mathbf {z}}_{j,k-1})\). The important element is that we now draw samples from q instead of directly from the model error pdf f. This leads again to

$$\begin{aligned} f\bigl ({\mathbf {z}}_k|{\mathbf {d}}_k\bigr ) = \sum _{i=1}^N w_i \delta \bigl ({\mathbf {z}}_k - {\mathbf {z}}_{i,k}\bigr ), \end{aligned}$$
(9.14)

but now with weights

$$\begin{aligned} w_i = \frac{f\bigl ({\mathbf {d}}_k|{\mathbf {z}}_{i,k}\bigr )}{\sum _{j=1}^N f\bigl ({\mathbf {d}}_k|{\mathbf {z}}_{j,k}\bigr )} \frac{f\bigl ({\mathbf {z}}_{i,k}|{\mathbf {z}}_{i,k-1}\bigr )}{q\bigl ({\mathbf {z}}_{i,k}|{\mathbf {z}}_{i,k-1},{\mathbf {d}}_k\bigr )} , \end{aligned}$$
(9.15)

where the first part comes from Eq. (9.4). The values of the weights in Eq. (9.15) depend on our choice for q. Since q is a transition density, it is related to state evolution equations. In fact, we can choose any model equation we like to ensure that the weights are less degenerate. The freedom is enormous. Since we can take q to be dependent on the new observations, we can include other data-assimilation methods into a particle filter in a very natural way.

Let’s have a look at how to include a stochastic ensemble Kalman filter into a particle filter, as introduced by Papadakis et al. (2010), and as discussed in correction to that scheme in Van Leeuwen (2009). In terms of proposal densities, we can split the stochastic ensemble Kalman filter into two steps, a model evolution step from time \(t_{k-1}\) to the observation time \(t_k\), and an update step at time k. We can write for each ensemble member, assuming a linear observation operator and suppressing the superscript f for the forecast

(9.16)

where \(\overline{{\mathbf {K}}}\) is the Kalman gain calculated from the prior ensemble at time k.

We see that the update consists of a deterministic part and a stochastic part. Assuming Gaussian model errors and Gaussian observation errors, the stochastic part is Gaussian with mean zero and covariance

(9.17)

Hence, the proposal-transition density for each ensemble member of a stochastic ensemble Kalman filter becomes

$$\begin{aligned} q\bigl ({\mathbf {z}}_k | {\mathbf {z}}_{j,k-1},{\mathbf {d}}\bigr ) = \mathcal {N}\bigl (\tilde{{\mathbf {z}}}_j,\overline{\boldsymbol{\Sigma }}\bigr ), \end{aligned}$$
(9.18)

where \(\tilde{{\mathbf {z}}}_j\) results from the deterministic part

(9.19)

Thus, we use EnKF to calculate the “proposed” particles and, after that, obtain their weights from Eq. (9.15). To compute the weights, we evaluate the probability of each EnKF-updated particle from the normal distribution \(q({\mathbf {z}}_k | {\mathbf {z}}_{j,k-1},{\mathbf {d}})\) in Eq. (9.18). Then we calculate \(f({\mathbf {z}}_k|{\mathbf {z}}_{j,k-1})\) for each particle using the Gaussian in Eq. (9.10). The ratio of these two probabilities gives the proposal part of the weights. We multiply this ratio with the likelihood part of the weights from Eq. (9.4) and, after normalization, we have our final weights.

Many other choices are possible too. For instance, one could use a 3DVar on each particle as a proposal. Or we could use the EnKF with reduced observation errors to draw the particles closer to the observations. More extensively, we can also use a 4DVar or an ensemble smoother on each particle (Van Leeuwen et al., 2015). Other suggested proposals include synchronization methods  (Pinheiro et al., 2019a, b) and simple nudging schemes (Van Leeuwen, 2010). The point is, one can use every trick in the book and beyond without making any other approximation than Approx. 9. However, the proposed samples should make physical sense and represent the posterior distribution as close as possible at each update step.

2.3 The Optimal Proposal Density

One can ask if there is an optimal proposal density, and depending on the definition of optimal, there is. One way to define optimality is to minimize the variance of the weights, leading to the so-called optimal proposal density. This density is given by \(q({\mathbf {z}}_k | {\mathbf {z}}_{j,k-1},{\mathbf {d}}) = f({\mathbf {z}}_k | {\mathbf {z}}_{j,k-1},{\mathbf {d}})\). By using the definition of conditional densities as \(f(a|b,c) = f(a,b|c)/f(a|c)\) and \(f(a,b|c) = f(a|b,c)f(b|c)\), we can write

$$\begin{aligned} f\bigl ({\mathbf {z}}_k | {\mathbf {z}}_{j,k-1},{\mathbf {d}}_k \bigr )= & {} \frac{f\bigl ({\mathbf {z}}_k , {\mathbf {d}}_k | {\mathbf {z}}_{j,k-1}\bigr )}{f\bigl ({\mathbf {d}}_k | {\mathbf {z}}_{j,k-1}\bigr )} = \nonumber \\= & {} \frac{f\bigl ({\mathbf {d}}_k | {\mathbf {z}}_k, {\mathbf {z}}_{j,k-1} \bigr )f\bigl ({\mathbf {z}}_k | {\mathbf {z}}_{j,k-1}\bigr )}{f\bigl ({\mathbf {d}}_k | {\mathbf {z}}_{j,k-1}\bigr )} = \nonumber \\= & {} \frac{f\bigl ({\mathbf {d}}_k | {\mathbf {z}}_k\bigr )f\bigl ({\mathbf {z}}_k | {\mathbf {z}}_{j,k-1}\bigr )}{f\bigl ({\mathbf {d}}_k | {\mathbf {z}}_{j,k-1}\bigr )}, \end{aligned}$$
(9.20)

where in the last equality we used that \(f({\mathbf {d}}_k | {\mathbf {z}}_k,{\mathbf {z}}_{j,k-1}) =f({\mathbf {d}}_k | {\mathbf {z}}_k)\) because \({\mathbf {d}}_k\) does not explicitly depend on \({\mathbf {z}}_{j,k-1}\) when \({\mathbf {z}}_{j,k}\) is given. The denominator does not depend on the active variable \({\mathbf {z}}_k\), and hence is a normalization constant that we do not have to worry about.

We can evaluate the weights of the optimal proposal density without any approximation. From Eq. (9.15) the weights become, using Eq. (9.20),

$$\begin{aligned} w_j&= \frac{f({\mathbf {d}}_k|{\mathbf {z}}_{j,k})}{f({\mathbf {d}}_k)} \frac{f({\mathbf {z}}_{j,k}|{\mathbf {z}}_{j,k-1})}{f({\mathbf {z}}_{j,k}|{\mathbf {z}}_{j,k-1},{\mathbf {d}}_k)} \nonumber \\&= \frac{f({\mathbf {d}}_k|{\mathbf {z}}_{j,k})}{f({\mathbf {d}}_k)} \frac{f({\mathbf {z}}_{j,k}|{\mathbf {z}}_{j,k-1})}{f({\mathbf {d}}_k|{\mathbf {z}}_{j,k})} \frac{f({\mathbf {d}}_k|{\mathbf {z}}_{j,k-1})}{f({\mathbf {z}}_{j,k},{\mathbf {z}}_{j,k-1})} \nonumber \\&= \frac{f({\mathbf {d}}_k|{\mathbf {z}}_{j,k-1})}{f({\mathbf {d}}_k)} . \end{aligned}$$
(9.21)

The variance in these optimal proposal weights will be much lower than those of the standard particle filter because of the model error pdf. To see this, we can write

$$\begin{aligned} w_j = \frac{f({\mathbf {d}}_k|{\mathbf {z}}_{j,k-1})}{f({\mathbf {d}}_k)} = \int \frac{f({\mathbf {d}}_k, {\mathbf {z}}_k|{\mathbf {z}}_{j,k-1})}{f({\mathbf {d}}_k)} \; d{\mathbf {z}}_k = \int \frac{f({\mathbf {d}}_k| {\mathbf {z}}_k)}{f({\mathbf {d}}_k)} f({\mathbf {z}}_k|{\mathbf {z}}_{j,k-1}) \; d{\mathbf {z}}_k. \end{aligned}$$
(9.22)

Thus, we can write the weights as a convolution of the standard particle filter weights with the model error pdf. Such a convolution always results in a broader pdf as the standard particle filter weights are “smeared out.”

When the model and the observation operators are nonlinear, it is not straightforward to generate these optimal proposal particles, i.e., the draws from \(f({\mathbf {z}}_k|{\mathbf {z}}_{j,k-1},{\mathbf {d}}_k)\). Chorin and Tu (2009), Chorin et al. (2010) and Morzfeld et al. (2012) have developed an efficient scheme named the implicit particle filter that partly resolves this problem. For Gaussian observation and model errors, this solution equals a 4DVar estimate for each particle, perturbed by a random error. In each 4Dvar the state covariance at the initial time is zero (and hence, the prior pdf is a delta function centered at that particle). For linear observation operators, we can evaluate this explicitly. The numerator in Eq. (9.20) is a product of two Gaussians, and we know from standard Kalman filters that we can write \(f({\mathbf {z}}_k | {\mathbf {z}}_{j,k-1},{\mathbf {d}}_k)\) as another Gaussian with mean

(9.23)

in which \(\widetilde{{\mathbf {K}}} = {\mathbf {C}}_\textit{qq}{\mathbf {H}}^\mathrm {T}\bigl ({\mathbf {H}}{\mathbf {C}}_\textit{qq}{\mathbf {H}}^\mathrm {T}+ {\mathbf {C}}_\textit{dd}\bigr )^{-1}\), which is a Kalman gain with model covariance \({\mathbf {C}}_\textit{qq}\), and with covariance

$$\begin{aligned} \widetilde{\boldsymbol{\Sigma }} = \left( {\mathbf {I}}-\widetilde{{\mathbf {K}}}{\mathbf {H}}\right) {\mathbf {C}}_\textit{qq}\left( {\mathbf {I}}-\widetilde{{\mathbf {K}}}{\mathbf {H}}\right) ^\mathrm {T}+ \widetilde{{\mathbf {K}}}{\mathbf {C}}_\textit{dd}\widetilde{{\mathbf {K}}}^\mathrm {T}. \end{aligned}$$
(9.24)

The mean and mode of these transition densities are equal to those of a 4DVar on each particle with prior covariance equal to zero and model error covariance equal to \({\mathbf {C}}_\textit{qq}\). We do not need the mode, but instead we need to draw from the Gaussian distribution and consider each particle as a weak-constraint 4DVar solution perturbed by a draw from \(\mathcal {N}({\mathbf {0}},\widetilde{\boldsymbol{\Sigma }})\). Note the resemblance with using the stochastic EnKF as proposal, as expected. For this particular case, we can generate an analytical expression for the weights as

(9.25)

where

$$\begin{aligned} \widetilde{{\mathbf {C}}}_{dd} = {\mathbf {H}}{\mathbf {C}}_\textit{qq}{\mathbf {H}}^\mathrm {T}+ {\mathbf {C}}_\textit{dd}. \end{aligned}$$
(9.26)

We see that the weight in Eq. (9.25) is the likelihood of particle j starting at the previous time. These weights will be better behaved than the weights of the standard particle filter because we inflate the observation error covariance by a term \({\mathbf {H}}{\mathbf {C}}_\textit{qq}{\mathbf {H}}^\mathrm {T}\). When the model error is significant, this additional inflation can make the weights much more similar.

2.4 Other Particle Filter Schemes

It is easy to show that for high-dimensional systems as encountered in the geosciences the weights are still degenerate when using an EnKF as proposal density. This is even the case with the optimal proposal density with a realistic number of ensemble members, e.g., 100, to compute the proposal ensemble. The community has not yet systematically explored other possibilities of calculating the proposal ensemble, so searching for methods to avoid degeneracy remains an area of active research.

There are mainly two solutions proposed in the literature for avoiding the weights’ degeneracy. The first approach uses localization to reduce the number of observations in each local update of the weights. We will discuss localization in more detail in Chap. 10. The second method tries to ensure that all or most particles have equal weight. The reason that we can do better than the optimal proposal density, which minimizes the variance of the weights, is that we sacrifice a few particles to ensure that the rest of the particles have weights that are very similar. Hence, the variance in the weights can be large, but via resampling of the bad particles we can avoid degeneracy of the overall filter. We will not discuss these methods here but we rather refer to e.g.,  Ades and Van Leeuwen (2013, 2015a, 2015b), Skauvold et al. (2019) , Van Leeuwen (2010, 2011), Van Leeuwen and Ades (2013), Zhu et al. (2016), and  the review in Van Leeuwen et al. (2019).

A recently proposed third solution is to use methods that avoid particle weights altogether, as discussed in the next section.

3 Particle-Flow Filters

In particle flows, one typically starts with equally weighted samples from the prior. Instead of weighing them with the likelihood, as in the standard particle filter, we transform the samples in state space to represent the posterior pdf. This transformation is an iterative process. In the previous chapters, we discussed variational schemes like 4DVar and RML sampling. 4DVar uses an iterative Gauss–Newton method to find the posterior mode, and RML sampling minimizes an ensemble of cost functions to approximately sample the posterior pdf. Particle flow is an iterative ensemble method that in theory correctly samples the posterior pdf.

There is a recent increased interest in methods that dynamically move the particles in state space from equal-weight particles representing the prior, \(f({\mathbf {z}})\), to equal-weight particles representing the posterior, \(f({\mathbf {z}}|{\mathbf {d}})\). In these methods, we seek a potentially stochastic differential equation

$$\begin{aligned} d{\mathbf {z}}= {\mathbf {m}}_s({\mathbf {z}}) ds + d {\mathbf {q}}, \end{aligned}$$
(9.27)

in artificial time \(s\ge 0\) where the deterministic flow map \({\mathbf {m}}_s\) and the stochastic term \(d{\mathbf {q}}\) define the desired transformation. The stochastic term is drawn from \(\mathcal {N}(0,{\mathbf {C}}_\textit{ff}\, ds)\), in which \({\mathbf {C}}_\textit{ff}\) is the covariance matrix of the error in the flow map, i.e., the stochastic forcing. If the initial conditions of the differential equation (9.27) are chosen from a pdf \(f_0({\mathbf {z}})\) with 0 referring to the initial artificial time, then the solutions follow a distribution characterized by the Fokker–Plank Eq. (2.25) which we write as 

(9.28)

The initial condition for this equation is \(f_0({\mathbf {z}}) = f({\mathbf {z}})\) and we aim to determine a flow map \({\mathbf {m}}_s\) and stochastic forcing determined by \({\mathbf {C}}_\textit{ff}\), such that \(f_s\) satisfies the final condition \(f_{s_\text {final}}({\mathbf {z}})=f({\mathbf {z}}|{\mathbf {d}})\).

3.1 Particle Flow Filters via Likelihood Factorization

Several classes of particle-flow filters arise from the likelihood-factorization formalism. To introduce this formulation, let us assume

$$\begin{aligned} f_s({\mathbf {z}}) \propto f({\mathbf {d}}|{\mathbf {z}})^s f({\mathbf {z}}), \end{aligned}$$
(9.29)

in which \(s=0\) gives us back the prior, and \(s=1\) the posterior pdf. We can take the natural logarithm to find:

$$\begin{aligned} \ln f_s({\mathbf {z}}) \propto s \ln f({\mathbf {d}}|{\mathbf {z}}) + \ln f({\mathbf {z}}) + c(s), \end{aligned}$$
(9.30)

in which c(s) is a function of the pseudo time s, but not of the state \({\mathbf {z}}\). If we now take the pseudo-time derivative we find:

$$\begin{aligned} \frac{1}{ f_s}\frac{\partial f_s}{\partial s} = \ln f({\mathbf {d}}|{\mathbf {z}}) + \frac{\partial c(s)}{\partial s}. \end{aligned}$$
(9.31)

We now divide the Fokker–Plank Eq. (9.28) by \(f_s\) to find:

(9.32)

Combining the last two equations and taking the gradient to the state \({\mathbf {z}}\) to eliminate c(s) leads directly to:

$$\begin{aligned} \nabla \log f({\mathbf {d}}|{\mathbf {z}}) = - {\mathbf {m}}_s^\mathrm {T}\nabla _{\mathbf {z}}^2 \log f_s - \nabla _{\mathbf {z}}(\nabla _{\mathbf {z}}\cdot {\mathbf {m}}_s) - \nabla _{\mathbf {z}}\log f_s \nabla _{\mathbf {z}}\cdot {\mathbf {m}}_s + \frac{1}{2} \nabla _{\mathbf {z}}\left( \frac{\nabla _{\mathbf {z}}\cdot ({\mathbf {C}}_\textit{ff}\nabla _{\mathbf {z}}f_s)}{f_s} \right) . \end{aligned}$$
(9.33)

Thus, we have a nonlinear coupled system of equations whose size is the dimension of the system. However, \({\mathbf {m}}_s\) has that same dimension, and \({\mathbf {C}}_\textit{ff}\) has that dimension squared, so the number of unknowns is much larger than the number of independent equations. Thus, there are many, in fact, an infinite number of \({\mathbf {m}}, {\mathbf {C}}_\textit{ff}\) combinations that are valid solutions.

Remarkably, and this is truly remarkable, Daum et al. (2018) found an analytical solution of Eq. (9.33) , i.e.,

(9.34)
(9.35)

This solution’s significance is the existence of a closed-form solution for the fully nonlinear data-assimilation problem in terms of the movement of individual particles. Unfortunately, we need the gradient of the logarithm of \(f_s({\mathbf {z}})\), which is a pdf that we only have an ensemble representation of, so we know it as a sum of Dirac-delta distributions. Hence, this gradient does not exist, and we need to make approximations, e.g., assuming that each particle is not a Dirac-delta distribution but a Gaussian. We have not yet seen this approach explored in any detail for high-dimensional geophysical problems.

In another class of methods, we assume that the stochastic term is zero, and start from a tapering approach, where we gradually increase s such that \(s_\text {final}=1\). We now take the limit of increasing number of tapering steps by choosing as steplength \(\gamma _i=1/n_s = \Delta s\) with \(\lim _{n_s \rightarrow \infty }\), so \(\lim _{\gamma _i \rightarrow 0}\), or \(\lim _{\Delta s \rightarrow 0}\), see Daum and Huang (2011, 2013) and Reich (2011). This approach leads to

(9.36)

where we have use a first order Taylor expansion to get to the final line. Hence, we find

(9.37)

with \(c_s = \int f_s({\mathbf {z}}) \ln f({\mathbf {d}}|{\mathbf {z}})\; d{\mathbf {z}}\), which follows directly from integrating the equation over the whole state space, and using \(\int f_s({\mathbf {z}})\; d{\mathbf {z}}=1 \). If we now use the Liouville equation (Jazwinski, 1970) for the evolution of a pdf we can identify

(9.38)

which is an implicit equation for \({\mathbf {m}}_s\) in terms of \(f_s\). Explicit expressions for \({\mathbf {m}}_s\) are available for certain pdfs such as Gaussians and Gaussian mixtures (Reich, 2012). These particle-flow filters can be viewed as a continuous limit of the tapering methods, avoiding the need for resampling and jittering. Note that the elliptic partial differential equation (9.38) does not determine \({\mathbf {m}}_s\) uniquely. Optimal choices in the sense of minimizing the \(L_2(f_s)\)-norm of \({\mathbf {m}}_s\) lead to the theory of optimal transportation, see  Reich and Cotter (2015) and Villani (2008).

3.2 Particle Flows via Distance Minimization

Alternatively, one can define a distance between the intermediate pdf \(f_s({\mathbf {z}})\) and the posterior pdf, and then find the flow field \({\mathbf {m}}_s\) that minimizes that distance. Many definitions of the distance between two pdfs exist, and we will use the Kullback–Leibler (KL) divergence here. (The KL divergence is strictly speaking not a distance as it is not symmetric in its arguments, but reducing KL does reduce the distance between the two pdfs.) The following efficient derivation follows Hu and Van Leeuwen  (2021).   The KL divergence is given by

$$\begin{aligned} \text {KL}\bigl (f_s({\mathbf {z}})\,||\,f({\mathbf {z}}|{\mathbf {d}})\bigr ) = \int f_s({\mathbf {z}})\, \ln \frac{f_s({\mathbf {z}})}{f({\mathbf {z}}|{\mathbf {d}})}\; d{\mathbf {z}}, \end{aligned}$$
(9.39)

and we find the rate of change of the KL divergence with s from

$$\begin{aligned} \frac{\partial \text {KL}}{\partial s} = \int \frac{\partial f_s({\mathbf {z}})}{\partial s} \left( \ln \frac{f_s({\mathbf {z}})}{f({\mathbf {z}}|{\mathbf {d}})}-1 \right) \;d{\mathbf {z}}. \end{aligned}$$
(9.40)

We can rewrite this expression using the Liouville equation for \(f_s({\mathbf {z}})\) as

(9.41)

and, using partial integrations twice, we obtain

(9.42)

Our task now is to find the flow field \({\mathbf {m}}_s({\mathbf {z}})\) that leads to a fast decrease of the KL divergence and thus an efficient mapping from the prior to the posterior pdf. As we have no direct solution to this optimization problem,   Liu and Wang (2016) suggest embedding the flow field in a reproducing-kernel Hilbert space (RKHS), such that

(9.43)

in which \(\boldsymbol{\mathcal {K}}(\cdot ,{\mathbf {z}})\) is a matrix-valued kernel, so a matrix of functions of two state vectors. Using this result in Eq. (9.42) leads directly to

$$\begin{aligned} \frac{\partial \text {KL}}{\partial s} =-\left\langle \int f_s({\mathbf {z}})\left[ \boldsymbol{\mathcal {K}}(\cdot ,{\mathbf {z}}) \nabla _{{\mathbf {z}}} \ln f({\mathbf {z}}|{\mathbf {d}}) + \nabla _{{\mathbf {z}}} \boldsymbol{\mathcal {K}}(\cdot ,{\mathbf {z}})\right] \;d{\mathbf {z}}\;,\;{\mathbf {m}}_s(\cdot ) \right\rangle \;, \end{aligned}$$
(9.44)

where we used the linearity of the integral and the inner product to change their order. If we now define \(\nabla _{\mathbf {z}}\text {KL}\) as the gradient of the KL distance, i.e., the maximal functional derivative of KL at every state vector \({\mathbf {z}}\) in the RKHS, we can write the change in KL in the direction of \({\mathbf {m}}_s\) as

(9.45)

By comparing this expression with Eq. (9.44), we can identify

$$\begin{aligned} \nabla _{\mathbf {z}}\text {KL} = - \int f_s({\mathbf {z}})\left[ \boldsymbol{\mathcal {K}}(\cdot ,{\mathbf {z}}) \nabla _{{\mathbf {z}}} \ln f({\mathbf {z}}|{\mathbf {d}}) + \nabla _{{\mathbf {z}}} \boldsymbol{\mathcal {K}}(\cdot ,{\mathbf {z}})\right] \;d{\mathbf {z}}. \end{aligned}$$
(9.46)

Hence, by introducing the Reproducing Kernel Hilbert Space, we find an expression for the gradient of KL in terms of an integral that contains the kernel. The critical point is that this gradient is independent of \({\mathbf {m}}_s\). If we choose the flow field \({\mathbf {m}}_s\) along this gradient direction

$$\begin{aligned} {\mathbf {m}}_s({\mathbf {z}}) = - \epsilon \nabla _{\mathbf {z}}\text {KL} ({\mathbf {z}}), \end{aligned}$$
(9.47)

where \(\epsilon \) is a positive scalar, we can use this gradient in a steepest descent minimization of the KL distance. Furthermore, as in variational data-assimilation methods, we can rotate the descent direction to achieve faster convergence. In general, we can use

$$\begin{aligned} {\mathbf {m}}_s({\mathbf {z}}) = - {\mathbf {B}}\nabla _{\mathbf {z}}\text {KL} ({\mathbf {z}}), \end{aligned}$$
(9.48)

in which \({\mathbf {B}}\) is a positive definite matrix to our liking. From variational and other iterative methods discussed in Chap. 3, one might want to choose the posterior covariance matrix for \({\mathbf {B}}\), see, e.g., Eq. (3.17). In practical applications with variables with different physical dimensions, we recommend exploring this freedom of the definition of the matrix \({\mathbf {B}}\).

figure c
figure d
figure e
figure f

Finally, we replace the integral by its empirical approximation by using the particle representation of \(f_s({\mathbf {z}})\), to obtain

(9.49)

The intuitive explanation of this equation is that the first term in (9.49) pulls the particles towards the mode of the posterior as in a variation method, while the second term acts as a repulsive force that allows for particle diversity. If only the first term is present, the particles will all flow towards the mode of the posterior pdf. As a result, the averaged gradient of the log posterior at each particle, weighted by the kernel, determines the particle flow.

The second term avoids this particle collapse by repelling the particles when they become too close. This can easily be seen by choosing a scalar Gaussian kernel, as in Liu  and Wang (2016) and Pulido and Van Leeuwen (2019). If we write \(\boldsymbol{\mathcal {K}}({\mathbf {z}}_j,{\mathbf {z}}_i) = k({\mathbf {z}}_j,{\mathbf {z}}_i) {\mathbf {I}}\) and take the gradient to \({\mathbf {z}}_j\), we obtain

$$\begin{aligned} \nabla _{{\mathbf {z}}_l} \boldsymbol{\mathcal {K}}({\mathbf {z}}_j,{\mathbf {z}}_l) \propto ({\mathbf {z}}_j-{\mathbf {z}}_l) k({\mathbf {z}}_j,{\mathbf {z}}_l) . \end{aligned}$$
(9.50)

If a component of \({\mathbf {z}}_j\) is larger than that of \({\mathbf {z}}_l\), the gradient in Eq. (9.50) is positive, increasing \({\mathbf {z}}_j\) in that dimension. Thus, the term act as a repelling term. Hu and Van Leeuwen (2021) showed that for sparsely observed systems, a matrix kernel is more efficient than a scalar kernel. The issue with a scalar kernel is that the repelling term uses the distance between two complete state vectors. The particles converge fast to each other in the space directly influenced by the observations, while that part of the state vector far from the observations shows slow convergence. This slow convergence results in a large distance between particles, hence a tiny repelling force, while the particles collapse in the observed part of the state vector. We can easily avoid this problem by using a simple diagonal kernel with local kernels on the diagonal. We present the particle flow algorithm in Algorithm 11.

Interestingly, Lu et al. (2019) showed that this particle-flow filter converges to the true posterior for any kernel symmetric in its arguments that vanishes at infinity, in the limit of an infinite number of particles. Hence, in that limit the choice of kernel is irrelevant! With a finite number of particles, as in any realistic geophysical application, the choice of the kernel will matter.

Another choice to be made in this scheme is the \({\mathbf {B}}\) matrix, which can be seen as a preconditioning matrix for the minimization. By choosing this matrix proportional to a localized ensemble covariance matrix, Hu and Van Leeuwen  (2021) demonstrated a practical scheme that works well in problems with hundreds of local modes using only 20 particles. In Chap. 18 we demonstrate the use of a particle-flow implementation with a scalar model and show how the method samples the true posterior distribution, in contrast to traditional assimilation methods, while in Chap. 20 a high-dimensional application to a quasi-geostrophic atmospheric model is described.