In the following, we derive some methods for sampling the posterior conditional pdf in Eq. (3.8). We aim to estimate the full pdf, not only finding its maximum. We will, in this chapter, use an approach named randomized maximum likelihood (RML) sampling. Note that the name is not precise as the method attempts to sample the posterior pdf and not just the likelihood. However, we will continue using the name RML when we refer to the technique. RML provides a highly efficient approach for approximate sampling of the posterior pdf and lays the ground for developing many popular ensemble methods.

## 1 RML Sampling

To introduce randomized-maximum-likelihood sampling, let’s define an ensemble of cost functions where the prior vectors $${\mathbf {z}}_j^\mathrm {f}$$ are samples from the Gaussian distribution in Eq. (3.5), and we introduce the perturbed measurements $${\mathbf {d}}_j={\mathbf {d}}+ \boldsymbol{\epsilon }_j$$ where the perturbations $$\boldsymbol{\epsilon }_j$$ are samples from (3.6),

Ensemble of cost functions

\begin{aligned} \mathcal {J}({\mathbf {z}}_j) =\frac{1}{2}\bigl ({\mathbf {z}}_j-{\mathbf {z}}_j^\mathrm {f}\bigr )^\mathrm {T}{{\mathbf {C}}_{\textit{zz}}^{-1}}\bigl ({\mathbf {z}}_j-{\mathbf {z}}_j^\mathrm {f}\bigr ) +\frac{1}{2}\bigl ({\mathbf {g}}({\mathbf {z}}_j)-{\mathbf {d}}_j\bigr )^\mathrm {T}{\mathbf {C}}_\textit{dd}^{-1}\bigl ({\mathbf {g}}({\mathbf {z}}_j)-{\mathbf {d}}_j\bigr ), \end{aligned}
(7.1)

as proposed by Kitanidis, (1995)  and  Oliver et al., (1996). These cost functions are independent of each other and differ from the cost function (3.9) by the introduction of the random samples $${\mathbf {z}}_j^\mathrm {f}\sim \mathcal {N}({\mathbf {z}}^\mathrm {f}, {{\mathbf {C}}_{\textit{zz}}})$$ and $${\mathbf {d}}_j \sim \mathcal {N}({\mathbf {d}}, {\mathbf {C}}_\textit{dd})$$.

One might ask why we need to perturb the measurements when Bayes theorem tells us that we are already given the measurements in the data-assimilation problem. Indeed, Van Leeuwen  (2020) contains a detailed discussion on why it is more consistent to perturbing the predicted measurements $${\mathbf {g}}({\mathbf {z}}_j)$$ with a draw from the measurement error pdf. There is no practical advantage for either choice. The reason is that the cost function in Eq. (7.1) only contains the difference between the predicted and actual measurements, and the Gaussian is symmetric in its arguments. In this chapter, we will use the conventional “perturbed measurements” formalism.

### Approximation 6 (RML sampling)

In the weakly nonlinear case, we can approximately sample the posterior pdf in Eq. (3.8) by minimizing the ensemble of cost functions defined by Eq. (7.1). $$\square$$

In the Gauss-linear case, the minimizing solutions of these cost functions precisely sample the posterior conditional pdf in Eq. (3.8). Furthermore, with an infinite number of samples, the sample mean and covariance will converge to the KF solution given by Eqs. (6.28) and (6.33). When we introduce nonlinearity into the problem, the samples will deviate from the pdf in Eq. (3.8). But in many cases with only weak nonlinearity, this approximation is acceptable. The fun fact is that nobody knows precisely which distribution the method samples in the nonlinear case. Note also that we can minimize each of the cost functions independently of the others using the Gauss–Newton method described in Chap. 3.

Similarly to Eq. (3.11), we now have an ensemble of gradients that we set to zero to minimize the ensemble of cost functions in Eq. (7.1),

Ensemble of gradients set to zero

\begin{aligned} {{\mathbf {C}}_{\textit{zz}}^{-1}}\bigl ({\mathbf {z}}_j - {\mathbf {z}}_j^\mathrm {f}\bigr ) + \nabla _{{\mathbf {z}}} {\mathbf {g}}\bigl ({\mathbf {z}}_j \bigr ) {\mathbf {C}}_\textit{dd}^{-1} \bigl ({\mathbf {g}}({\mathbf {z}}_j) - {\mathbf {d}}_j \bigr ) = 0. \end{aligned}
(7.2)

## 2 Approximate EKF Sampling

The simplest way to solve Eq. (7.2) for an ensemble of realizations is to use the Kalman filter update Eq. (6.44) to solve for each sample, $$j=1, \dots , N_{ens}$$, (7.3)

However, as we noted in the previous chapter, these equations are only valid in the linear case or for modest updates in the nonlinear case.

## 3 Approximate Gauss–Newton Sampling

As an alternative to the EKF solution from Sect. 6.5, we can minimize the cost function in Eq. (7.1) without introducing the Approx. 5. We do this by using the Gauss–Newton method as in Sect. 3.4 for each of the cost functions in the ensemble. Taking the derivative of Eq. (3.10) while neglecting terms including second derivatives, we obtain an approximation to the Hessian

\begin{aligned} \nabla _{\mathbf {z}}\nabla _{\mathbf {z}}\mathcal {J}({\mathbf {z}}_j) \approx {{\mathbf {C}}_{\textit{zz}}^{-1}}+ \nabla _{\mathbf {z}}{\mathbf {g}}({\mathbf {z}}_j) {\mathbf {C}}_\textit{dd}^{-1} \bigl (\nabla _{\mathbf {z}}{\mathbf {g}}({\mathbf {z}}_j)\bigr )^\mathrm {T}. \end{aligned}
(7.4)

We can then write a Gauss–Newton iteration for $${\mathbf {z}}$$ as

Ensemble of GN iterations (7.5)

where we have defined the gradient of the observation operator at iteration i and for ensemble member j as (7.6)

In this formulation, each realization uses the tangent-linear model $${{\mathbf {G}}_{j}^{i}}$$ evaluated at the solution for realization j at iteration i. Thus, each realization has a model sensitivity that is independent of the other realizations. This approach and any other method that minimizes the cost functions in Eq. (7.1) will correctly sample the posterior distribution in the Gauss-linear case. Still, for a posterior non-Gaussian distribution, Approx. 6 applies. Thus, we can use any of the methods discussed in Chaps. 3, 4, and 5 to solve for the minimizing solution of each cost-function realization.

## 4 Least-Squares Best-Fit Model Sensitivity

There are two aspects of the solutions defined in Eqs. (7.3) and (7.5) that require our attention. First, we assume we know the tangent-linear model $${{\mathbf {G}}_{j}^{i}}$$ and its adjoint, $${{\mathbf {G}}_{j}^{i}}^\mathrm {T}$$, which is not always the case. The other aspect relates to the storage and inversion of $${{\mathbf {C}}_{\textit{zz}}}$$, a huge matrix.

In cases when we do not have access to a tangent-linear model or the adjoint operator, we can use a statistical representation of the model sensitivity. Rather than computing different tangent linear operators $${{\mathbf {G}}_{j}^{i}}$$ for each sample, we represent them by a statistical least-squares best-fit model sensitivity $${{\mathbf {G}}^{i}}$$ common for all realizations (Chen & Oliver, 2013; Evensen, 2019; Reynolds et al., 2006) , and we introduce the following approximation.

### Approximation 7 (Best-fit ensemble-averaged model sensitivity)

Interpret $${{\mathbf {G}}_{j}}$$ in Eq. (7.3) and $${{\mathbf {G}}_{j}^{i}}$$ in Eq. (7.5) as the sensitivity matrix in linear regression and represent them using the definition

\begin{aligned} {{\mathbf {G}}_{j}}&\approx {\mathbf {G}}\triangleq {\mathbf {C}}_{yz} {{\mathbf {C}}_{\textit{zz}}^{-1}}. \end{aligned}
(7.7)

Note that we have dropped the superscript j for the realizations. Hence, we approximate the individual model sensitivities with a common averaged sensitivity used for all realizations.

A consequence of this approximation is that we slightly alter the gradient in Eq. (7.2) and thus also the minimizing solution that the Kalman filter updates or the Gauss–Newton iterations would provide.

By introducing the averaged model-sensitivity from Eq. (7.7), we can rewrite the Gauss–Newton iteration in Eq. (7.5) as (7.8) (7.9)

where we have used the corollaries from Eqs. (6.9) and (6.10).

A rather tricky issue with Eq. (7.9) is the appearance of products between the averaged model sensitivity $${{\mathbf {G}}^{i}}$$ evaluated at iteration i with the prior covariance matrix $${{\mathbf {C}}_{\textit{zz}}}$$. Chen and Oliver (2013) provided an alternative approach by evaluating the state covariance in the Hessian at the current iterate. This modification does not impact the final solution, but it alters the update step. They introduced various strategies for solving Eqs. (7.8) and (7.9) using ensemble representations for the state covariances. The next chapter will present a recent and efficient algorithm that searches for the solution in the ensemble subspace.

Recall that $${\mathbf {y}}= {\mathbf {g}}({\mathbf {z}})$$ is the model equivalent of the observed state and $${\mathbf {C}}_{yz}$$ is the covariance between the state vector $${\mathbf {z}}$$ and the predicted measurements $${\mathbf {y}}$$. The operator $${\mathbf {G}}$$, defined in Eq. (7.7), is the linear regression between $${\mathbf {y}}$$ and $${\mathbf {z}}$$, and we have

\begin{aligned} {\mathbf {G}}{{\mathbf {C}}_{\textit{zz}}}= {\mathbf {C}}_{yz}, \end{aligned}
(7.10)

and

\begin{aligned} {\mathbf {G}}{{\mathbf {C}}_{\textit{zz}}}{\mathbf {G}}^\mathrm {T}= {\mathbf {C}}_{yz} {{\mathbf {C}}_{\textit{zz}}^{-1}}{\mathbf {C}}_{zy} . \end{aligned}
(7.11)

We will use these expressions further in the following chapter. For now, we note that we can use the EKF update Eq. (7.3) to formulate an ensemble of Kalman-filter updates without using the tangent-linear operator, as (7.12)

It is common to replace the term $${\mathbf {G}}{{\mathbf {C}}_{\textit{zz}}}{\mathbf {G}}^\mathrm {T}= {\mathbf {C}}_{yz} {{\mathbf {C}}_{\textit{zz}}^{-1}}{\mathbf {C}}_{zy}$$ with $${\mathbf {C}}_{yy}$$. However, most data-assimilation practitioners are unaware that this replacement introduces another approximation if $${\mathbf {g}}({\mathbf {z}})$$ is nonlinear. In the following chapter, we will come back to this issue when discussing a low-rank ensemble approximation of the prior covariance matrix that leads to efficient ensemble-data-assimilation methods.