1 Introduction

In Bayesian methods of data assimilation, it is necessary to specify the prior joint probability density for the model parameters (the “prior” for short). The prior for parameters of a subsurface reservoir model (properties such as permeability and porosity) may be based partly on data that have already been assimilated, such as log or core measurements and seismic surveys. The choice of a prior joint probability is also often influenced by the joint distributions of properties observed in modern geological analogues of ancient depositional environments.

The choice of the prior is a challenge in Bayesian inference (Scales and Tenorio 2001). In almost all applications of ensemble Kalman-based data assimilation methods, the prior is chosen to be multivariate normal with fully specified prior mean and prior covariance. In these cases, the prior mean largely determines the average properties in regions where the data are not sensitive to parameters of the model, while the choice of the covariance determines the smoothness of the spatial distributions, the variability of magnitudes, and the orientation and the range of correlation of the spatially distributed parameter fields. In most subsurface applications, it is difficult to select these parameters (Malinverno and Briggs 2004).

In many applications of the Bayesian data assimilation methods to field cases, the prior is found to have been too narrowly specified —it does not allow for the possibility of events that could (and perhaps will) occur in the future, and is often inconsistent with historical measurements of flow and transport. If a prior model that is inconsistent with data is used for data assimilation, the result will be biased estimates and forecasts, implausible parameter values, and unjustified reduction in uncertainty (Moore and Doherty 2005; Oliver and Alfonzo 2018). At the end of an expensive model-building and calibration exercise, it will be apparent that the model is inadequate and will need to be rebuilt.

Although allowing for uncertainty in the prior mean is not common in ensemble-based history matching, its usefulness has been demonstrated in both synthetic and field cases. Li et al. (2010) allowed for a uniform adjustment to the mean, while Zhang and Oliver (2011) allowed the prior mean to be characterized by a trend surface whose coefficients were uncertain. In some cases (such as the Brugge benchmark case), the use of a hierarchical model with uncertainty in the mean permeability was not necessary for matching production data, but the hierarchical model provided a more reasonable explanation of the bias in the initial forecasts (Chen and Oliver 2010).

Uncertainty in the prior covariance is widely recognized as being a key aspect of overall uncertainty, but careful treatment of covariance uncertainty in history matching is rare. A practical approach to parameterizing the uncertainty in the covariance is to assume a family of covariance functions with hyperparameters that control the smoothness, the covariance range, the variance, and the anisotropy. If these hyperparameters are fixed at inappropriate values, which are then used in data assimilation, the ability to assimilate flow data will often be limited. One approach to treating uncertainty in the hyperparameters is to use the concept of scenarios to represent the possibilities of a discrete number of alternative values for the hyperparameters. Park et al. (2013) demonstrated the usefulness of this approach for a case in which the orientation of the anisotropy was assumed to be one of two possible angles. The approach used by Emerick (2016) was somewhat similar, except that weights for the scenarios were computed using an iterative ensemble smoother. Although Malinverno and Briggs (2004) did not consider uncertainty in anisotropy, they allowed uncertainty in five hyperparameters of the covariance for a problem of inferring compressional wave slowness in a one-dimensional earth model.

If the covariance is allowed to be uncertain in an ensemble Kalman-based approach, the parameterization of the model and the choice of the covariance model are both important to successful data assimilation. Chada et al. (2018) investigated the effect of both centered and non-centered parameterizations (described in Sect. 2.1) of a hierarchical spatial model for use with ensemble Kalman iteration (EKI). In their experiments, which included nonstationary hyperparameters, they concluded that using a hierarchical approach with a non-centered parameterization was significantly better than a hierarchical approach with a centered parameterization and that both hierarchical approaches were better than a non-hierarchical approach. Subsequently, Dunlop et al. (2020) showed that for maximum a posteriori (MAP) estimation of the hyperparameters in a linear inverse problem, the centered parameterization is to be preferred when the goal is MAP estimation as opposed to uncertainty quantification.

2 The Hierarchical Model

Gaussian priors are often used in data assimilation because they are relatively tolerant to errors in misspecification, but the choice of the parameters of the covariance still have an influence on the quality of the final data match. In Sect. 4.2, the feasibility of the hybrid IES to assimilate data in a model with an uncertain prior covariance is investigated. In that example, the “true” model that generated the data has an anisotropic covariance with a longer range in one direction. If one were fortunate enough to know the covariance model from which the truth was sampled, it would not be necessary to use a hierarchical model as a good average total squared data mismatch (e.g.  \(S_d^o = 365\) for the correct covariance in the two-dimensional flow problem) could be achieved without consideration of uncertainty in hyperparameters. On the other hand, if the prior covariance is mistakenly fixed at an incorrect orientation (off by \(\pi /2\)), the misfit to data would be substantially worse (\(S_d^o = 2142\)), and the ability to accurately forecast future behavior would also suffer. Uncertainty in the covariance can be easily accounted for by introducing a hierarchical model for which the anisotropy and ranges are uncertain. By applying the hierarchical model in this problem and using the hybrid iterative ensemble smoother for data assimilation, it is possible to obtain a mean squared data mismatch (\(S_d^o = 1151\)) that is intermediate between the value obtained using the correct covariance and a covariance with incorrect orientation. When hierarchical modeling is an appropriate approach to describing uncertainty, it provides more robust priors for history matching and reduces the need for rebuilding the model and re-history matching. Unfortunately, hierarchical models are more nonlinear than non-hierarchical models, and data assimilation or history matching is more difficult. A straightforward application of an iterative ensemble smoother with localization to the hierarchical problem with uncertain anisotropy results in an extremely poor match to data (\(S_d^o = 13{,}000\)). It appears that in order to use a hierarchical model with an ensemble-based data assimilation, modification of the data assimilation methodology will generally be required, as presented in Sect. 3.3.

The probability distribution of Gaussian random variables (denoted m) is completely determined by the mean and the covariance of the variable. In this study, we will assume that the mean \(m_{pr}\) is known but that the covariance \(C_m\) is uncertain. There are a number of parameters that could describe the uncertainty in \(C_m\); for two-dimensional fields, we will focus on correlation range \(\rho \), orientation of the anisotropy \(\phi \), and the ratio of range in two principal directions. For a one-dimensional field, we will focus on problems in which the variance and the correlation range are uncertain.

2.1 Hierarchical Model Parameterization: Centered and Non-centered

There are two common parameterizations for Gaussian hierarchical problems: the centered (or natural) parameterization and the non-centered parameterization (Papaspiliopoulos et al. 2007). In the centered parameterization, the natural variables (denoted m for model parameters) are augmented with a set of hyperparameters \(\theta \) that characterize the covariance and perhaps the mean. With the augmented parameters, the set of parameters that must be sampled can be written as \(x = (m,\theta )\). The centered parameterization utilizes the conditional independence of the data d and the hyperparameters of the distribution \(\theta \) given the observable parameters m, so it is relatively easy to implement a Gibbs sampler for a centered parameterization. The disadvantage of the centered parameterization for the ensemble form of data assimilation, however, is that the objective function contains the term \((m-m_{\mathrm {pr}})^{\mathrm {T}}C_m^{-1}(m-m_{\mathrm {pr}})\). The matrix \(C_m\) depends on the parameters \(\theta \), so that an ensemble approximation of \(C_m\) is not appropriate.

In the non-centered parameterization, the relationship between the natural Gaussian variable m and the non-centered parameters can be written

$$\begin{aligned} m = m_{\mathrm {pr}} + L(\theta ) z, \end{aligned}$$
(1)

where L is a “square root” of the model covariance matrix, i.e., \(L L^{\mathrm {T}}= C_m\), and z is a vector of independent standard normal deviates with the same dimension as m. In the non-centered parameterization, \(\theta \) and z are a priori independent. Thus, for the non-centered parameterization, we can write the entire set of parameters \(x = (z, \theta )\). Note that the prior probability density function (pdf) for x is Gaussian in the centered parameterization, making it suitable for an ensemble form of data assimilation.

Although Eq. 1 is used in this paper to generate realizations of m, Eq. 1 will only be feasible for large grids if the square root of the covariance, L, is compact (as it is for the spherical covariance), or if the correlation range is sufficiently short that most elements of L are effectively zero. Other methods could be more efficient for large grids. In particular, it may often be advantageous to use either the inverse of the covariance matrix (the precision matrix) or a factorization of the precision matrix (Stojkovic et al. 2017) or a Matérn–Whittle covariance model (Gneiting et al. 2010) for which stochastic partial differential equations can be used efficiently to generate a random field (Roininen et al. 2019; Zhou et al. 2018). Another relatively common approach for describing the uncertainty in Gaussian hierarchical models is to assume that the prior distribution for the covariance model matrix is inverse Wishart (Myrseth and Omre 2010; Tsyrulnikov and Rakitko 2017). While this approach has some computational advantages, it seems less likely to describe the prior uncertainty in the covariance.

3 Data Assimilation for Gaussian Hierarchical Models

For large geoscience inverse problems, iterative ensemble smoothers are often an effective approach. These are all based on the original development of the ensemble Kalman filter (Evensen 1994), which has several advantages over Kalman filters or extended Kalman filters: a low-rank approximation of the covariance matrix is used instead of the full covariance, and the linearization of the relationship between predicted data and model parameters is approximated without requiring adjoints. For geoscience inverse problems, it has been found that it is more efficient to use a “smoother” to update the parameters of the inverse problem using all the data simultaneously. On the other hand, the problem of parameter estimation becomes more nonlinear when all data are assimilated simultaneously, so iteration is almost always required when a smoother is used for history matching. Although there are many variants of iterative ensemble smoothers, they can generally be classified into one of two approaches. In multiple data assimilation (MDA), the same data are assimilated multiple times with an inflated observation error (Reich 2011; Emerick and Reynolds 2013). This approaches reduces the nonlinearity in the update, but also requires updating the approximation of the covariance matrix at each iteration. That seems to be more difficult for hierarchical models than for models in which the prior covariance is assumed to be known.

The second class of iterative ensemble smoothers is based on the randomized maximum likelihood (RML) approach to approximate sampling from the posterior (Kitanidis 1995; Oliver et al. 1996, 2008). In the RML approach, samples from the prior are updated to become approximate samples from the posterior by minimizing a stochastic objective function. Like MDA, this method is exact for linear Gaussian data assimilation problems, but this method does not require updating of the covariance at each iteration. In the iterative ensemble smoother form of RML (Chen and Oliver 2012), an average sensitivity, computed from the ensemble of samples, is used to approximate the downhill direction. Because an ensemble average sensitivity will not provide an accurate sensitivity when the problem is highly nonlinear, a hybrid data assimilation method that is a blend of RML and IES is introduced in which some derivatives are computed analytically, while other derivatives are estimated from the ensemble. Finally, for the linear observation case, results will be compared with the marginal-then-conditional (MTC) approach introduced by Fox and Norton (2016).

The unnormalized posterior pdf for the model parameters in the non-centered parameterization can be written as

$$\begin{aligned}&p(z , \theta \mid d) \propto \exp \left( -\frac{1}{2} (d^o-g(m(\theta , z)))^{\mathrm {T}}C_d^{-1}(d^o-g(m(\theta , z))) \right) \nonumber \\&\quad \qquad \qquad \times \exp \left( -\frac{1}{2} (\theta -{\bar{\theta }})^{\mathrm {T}}C_\theta ^{-1}(\theta -{\bar{\theta }}) -\frac{1}{2} z^{\mathrm {T}}z \right) \end{aligned}$$
(2)

or more simply

$$\begin{aligned}&p(x \mid d) \propto \exp \left( -\frac{1}{2} (d^o-g(m(x)))^{\mathrm {T}}C_d^{-1}(d^o-g(m(x))) \right) \nonumber \\&\quad \, \exp \left( -\frac{1}{2} (x-{\bar{x}})^{\mathrm {T}}C_x^{-1}(x -{\bar{x}}) \right) . \end{aligned}$$
(3)

The non-centered parameterization appears to be well suited to data assimilation using an iterative ensemble smoother when the prior pdf for both z and \(\theta \) have been assumed Gaussian (perhaps after transformation), as the pdf for x (Eq. 3) is identical in form to the pdf for m in traditional Bayesian history matching. Nonlinearity is a result either of the relationship \(d=g(m)\) being nonlinear, as is the case if the data are water rates and the model parameters are porosity and log-permeability, or a result of nonlinearity in the relationship between m and \(\theta \). Although Eq. 3 is written for the case in which the prior for the hyperparameters is Gaussian, when the uncertainty in the orientation of the anisotropy is moderately large, a Gaussian approximation is not appropriate. We discuss that case in Appendix A.

3.1 Data assimilation: RML

The randomized maximum likelihood approach begins, like the perturbed observation form of the ensemble Kalman filter (Burgers et al. 1998; Houtekamer and Mitchell 1998), by drawing samples \((x_i',\epsilon _i')\), \(i=1,\ldots ,N_s\), from the Gaussian distribution

$$\begin{aligned} q_{X'\epsilon '} (x',\epsilon ') \propto \exp \left( -\frac{1}{2} (x-{\bar{x}})^{\mathrm {T}}C_x^{-1}(x -{\bar{x}}) \right) \exp \left( - \frac{1}{2} \left( \epsilon ' \right) ^{\mathrm {T}}C_\epsilon ^{-1}\epsilon ' \right) \end{aligned}$$
(4)

for a given \({\bar{x}}\). The ith approximate sample from the posterior is obtained by computing the minimizer of the cost functional

$$\begin{aligned} J_i(x)&= \frac{1}{2} \left( x- x_i' \right) ^{\mathrm {T}}C_x^{-1}\left( x- x_i' \right) \nonumber \\&\quad \, + \frac{1}{2} \left( g(m(x)) + \epsilon '_i - d^o \right) ^{\mathrm {T}}C_d^{-1}\left( g(m(x)) + \epsilon '_i - d^o \right) . \end{aligned}$$
(5)

This is usually done by solving

$$\begin{aligned} \nabla _x J_i(x) = 0 \end{aligned}$$
(6)

for x. If Levenberg–Marquardt with a Gauss–Newton approximation of the Hessian is used for the minimization, the \(\ell \)th update is of the form

$$\begin{aligned} \delta x_{\ell } = \frac{x'_i-x_{\ell }}{1+\uplambda _{\ell }}- C_x G_{\ell }^{\mathrm {T}}\biggl [ (1+\uplambda _{\ell }) C_d + G_{\ell } C_x G_{\ell }^{\mathrm {T}}\biggr ]^{-1}\biggl [(g(x_{\ell }) + \epsilon '_i - d^{o}) - \frac{ G_{\ell } (x_{\ell } - x'_i) }{1+\uplambda _{\ell }} \biggr ], \nonumber \\ \end{aligned}$$
(7)

where \(G^{\mathrm {T}}= \nabla _{x} \big (g^{\mathrm {T}}\big )\), and \(\uplambda _{\ell }\) is the Levenberg–Marquardt regularization parameter. This method of sampling from the posterior distribution is only exact if the relationship between the data and the model parameters is linear. It will sample accurately from multimodal distributions in some situations, but exact sampling using RML requires computation of additional critical points and weighting of solutions (Ba et al. 2022). For geoscience applications, the standard unweighted RML is almost always used.

3.2 Data assimilation: IES

One disadvantage of RML is the need for computation of the gradient of the objective function. For many forward models of the data such as reservoir production simulation, the derivatives are not readily available. In those cases, the iterative ensemble smoothers offer an alternative that avoids the need to compute G directly. The basic idea is that terms that appear in Eq. 7 can often be computed efficiently using ensemble approximations. In an ensemble-based approach to sampling from the posterior, Eq. 7 is replaced by the following update step (Chen and Oliver 2013)

$$\begin{aligned} \begin{aligned} \delta x_{\ell +1}&= - \frac{1}{(1+\uplambda _\ell )} \Delta x_{\ell } \, \Delta x_{\ell }^{\mathrm {T}}C_x^{-1}(x_\ell -x') \\&\qquad - \Delta x_{\ell } \, \Delta d_{\ell }^{\mathrm {T}}\left( (1+ \uplambda _\ell ) C_{d} + \Delta d_{\ell } \Delta d_\ell ^{\mathrm {T}}\right) ^{-1}\\&\qquad \times \Biggl ( g(m_\ell ) + \epsilon '_i - d^o - \frac{1}{(1+\uplambda )} \Delta d_{\ell } \Delta x_{\ell }^{\mathrm {T}}C_x^{-1}(x_\ell - x'_i) \Biggr ), \end{aligned} \end{aligned}$$
(8)

where \(\Delta x_{\ell } = \frac{(X_\ell - {\bar{X}}_\ell )}{\sqrt{(N-1)}}\) and similar for \(\Delta d_{\ell }\). If this approach were applied directly, it can be seen that any update is restricted to the space spanned by the initial ensemble, and the number of degrees of freedom available for calibration is \(N_e-1\) where \(N_e\) is the number of realizations in the initial ensemble. To avoid this limitation, localization is nearly always applied in large problems, but for clarity, localization has been omitted in Eq. 8. The main disadvantage of an ensemble-based methodology is the failure to handle strong nonlinearity well—primarily because the same average sensitivity is used for all samples.

3.3 Data assimilation: hybrid IES

The hybrid IES is a method that takes advantage of the ability of the RML to use different gain matrices for each sample and the ability of the IES to avoid the need for adjoint systems. As in the RML method, the (transpose of) sensitivity of data to model parameters in a non-centered parameterization is

$$\begin{aligned} G^{\mathrm {T}}= G(x)^{\mathrm {T}}= \begin{bmatrix} \nabla _{x} g_{1}&\quad \nabla _{x} g_{2}&\quad \ldots&\quad \nabla _{x} g_{N_d} \end{bmatrix} = \nabla _{x} \big (g^{\mathrm {T}}\big ) \end{aligned}$$
(9)

and similarly define the sensitivity of data to the observable model parameters m,

$$\begin{aligned} G_m^{\mathrm {T}}= G(m)^{\mathrm {T}}= \begin{bmatrix} \nabla _{m} g_{1}&\quad \nabla _{m} g_{2}&\quad \ldots&\quad \nabla _{m} g_{N_d} \end{bmatrix} = \nabla _{m} \big (g^{\mathrm {T}}\big ). \end{aligned}$$
(10)

These two sensitivity matrices are related

$$\begin{aligned} G^{\mathrm {T}}= \nabla _{x} \big (g^{\mathrm {T}}\big ) = \nabla _{x} \big (m^{\mathrm {T}}\big ) \cdot \nabla _{m} \big (g^{\mathrm {T}}\big ) = \nabla _{x} \big (m^{\mathrm {T}}\big ) \cdot G_{m}^{\mathrm {T}}\end{aligned}$$
(11)

or

$$\begin{aligned} G = G_{m} \cdot \left( \nabla _{x} \big (m^{\mathrm {T}}\big )\right) ^{\mathrm {T}}. \end{aligned}$$
(12)

For notational simplicity, the Jacobian of the observable model parameters with respect to the non-centered hierarchical parameters is written as

$$\begin{aligned} M_x= \left( \nabla _{x} \big (m^{\mathrm {T}}\big )\right) ^{\mathrm {T}}, \end{aligned}$$
(13)

where the dimension of \(M_x\) is \(N_m \times N_x\). Substituting \(G = G_m M_x\) into the RML update expression (Eq. 7) with ensemble representation of \(G_m\) results in a hybrid IES data assimilation approach

$$\begin{aligned} \delta x= & {} - \frac{1}{(1+\uplambda )} (x-x') - C_x M_x^{\mathrm {T}}G_m^{\mathrm {T}}\left( (1+ \uplambda ) C_{d} + G_m M_xC_x M_x^{\mathrm {T}}G_m^{\mathrm {T}}\right) ^{-1}\nonumber \\&\times \left( g(m)+\epsilon ' - d^o - \frac{1}{(1+\uplambda )} G_m M_x (x - x') \right) \nonumber \\= & {} - \frac{1}{(1+\uplambda )} (x-x') \nonumber \\&- C_x M_x^{\mathrm {T}}(\Delta m)^{-\mathrm {T}}\, \Delta d^{\mathrm {T}}\left( (1+ \uplambda ) C_{d} + \Delta d \, (\Delta m)^{-1}M_x C_x M_x^{\mathrm {T}}(\Delta m)^{-\mathrm {T}}\, \Delta d^{\mathrm {T}}\right) ^{-1}\nonumber \\&\qquad \times \left( g(m)+\epsilon ' - d^o - \frac{1}{(1+\uplambda )} \Delta d \, (\Delta m)^{-1}M_x (x - x') \right) . \end{aligned}$$
(14)

Note that in the hybrid IES method, each ensemble member has its own Kalman gain matrix since the sensitivity \(M_x\) of the physical model parameters m to the non-centered parameters x is specific to an ensemble member; assuming that m is determined from the known mean \(m_{pr}\), the uncertain covariance \(C_m\) and from the stochastic variable z, i.e., \(m = m_{pr} + L(\sigma , a) z\) where \(L L^T = C_m\).

3.3.1 Factorization of Covariance Functions

The hybrid approach will only be useful if computation of the terms in Eq. 14 is practical. First, note that \(C_x\) is diagonal, so that it can be eliminated by scaling of the variables. Computing \(M_x\) this way relies on factorability of the covariance matrix. For functions in one dimension, the equivalent to defining \(C_m = L L^{\mathrm {T}}\) is to define

$$\begin{aligned} C(r) = f *f^{\mathrm {T}}= \int _{-\infty }^{+\infty } f(s) f(r+s) \mathrm{d}s. \end{aligned}$$
(15)

For illustration, consider the one-dimensional exponential covariance function \(C(r) = \sigma ^2 e^{ - \mid r\mid /a}\) where r is the distance between two variables, a is a measure of the range of the correlation, and \(\sigma ^2\) is the variance. In one dimension, a symmetric factorization is \(L_s(r) = \frac{ \sigma \sqrt{2} }{ \sqrt{a} \pi }\ K_0(\mid r\mid / a)\), where \(K_0\) is the modified Bessel function of the second kind of order 0. The factorizations are not unique, however, and it may be beneficial to use a factorization for which the “square root” is not singular. A one-sided factorization that is always finite is \(L_a(r) = \sigma \sqrt{2/a} \ H(r) \exp (-r/a)\) where H(r) is the Heaviside function. Similarly, the symmetric factorization of the squared exponential or Gaussian covariance \(C(r) = \sigma ^2 e^{ - r^2 /a^2}\) can be shown to be \(L_s(x) = \sigma \left( \frac{ 4 }{ a^2 \pi } \right) ^{1/4} \exp (- 2 r^2 /a^2)\).

Similar factorizations in two and three dimensions can easily be derived for other covariance functions such as the spherical covariance in three dimensions, the circular covariance in two dimensions, and Whittle’s covariance (Oliver 1995). For example, in a two-dimensional stationary field with geometric anisotropy and positions x, we define \(r= \sqrt{(x-x')^{\mathrm {T}}H (x-x')}\) and compute convolution square roots of the covariance as described above. In this paper, the square root of a two-dimensional Gaussian covariance model is used. Alternatively, one can start with a specific covariance “square root” and compute the corresponding covariance function (Gaspari and Cohn 1999). In this study, the factorization of the covariance function has been used to obtain an approximation of the factorization of the covariance matrix. A scale space implementation might be justified if additional accuracy was required (Lindeberg 1990).

4 Numerical Examples

Two numerical examples are presented to illustrate data assimilation for Gaussian models with uncertainty in the covariance. In the first example, the model is defined on a one-dimensional grid, and the observations are of m. In the second example, the uncertain permeability field in a two-dimensional porous medium is estimated by assimilation of a time series of water cut observations at six producing wells.

4.1 One-Dimensional Linear Gaussian

Consider a simple one-dimensional Gaussian random field, discretized on the interval [0, 1], into \(n_m=150\) lattice points. The random variable m is observed at every fourth lattice point, with independent measurement noise characterized by \(\sigma _d = 0.01\). The data-generating model has correlation range \(a^{tr} = 0.100\) and standard deviation \(\sigma _m^{tr} = 1.08\). The noisy observations are shown in Fig. 1a.

Fig. 1
figure 1

The one-dimensional linear test problem with hierarchical prior

The form of the covariance of the data-generating model \(C_m\) (squared Gaussian) is assumed to be known—only the correlation range a and model variance \(\sigma ^2\) are uncertain. Because both parameters are required to be positive, log-normal distributions are assumed for each: \(\theta _1 \equiv \log \sigma _m \sim N(-0.22, 0.5^2)\) and \(\theta _2 \equiv \log a \sim N(-2.3, 0.6^2)\). Ten unconditional samples of m from the prior are shown in Fig. 1b.

The posterior pdf can therefore be written as

$$\begin{aligned}&p(z,a,\sigma \mid d) \propto \exp \biggl ( -\frac{1}{2} (d-g(m(u,z)))^T C_d^{-1} (d-g(m(u,z))) \nonumber \\&\quad \, -\frac{1}{2} (\theta -\theta _{pr})^{\mathrm {T}}C_\theta ^{-1}(\theta -\theta _{pr}) -\frac{1}{2} z^Tz \biggr ), \end{aligned}$$
(16)

and the corresponding objective function for a minimization-based sampling approach is given by Eq. 5. Results from the three different data assimilation approaches described in Sect. 3 (RML, IES, and hybrid IES) are compared with the exact distribution of hyperparameters obtained using marginalization for the centered parameterization (Rue and Held 2005; Rue and Martino 2007; Fox and Norton 2016) and with the true values of the hyperparameters.

Fig. 2
figure 2

Reduction of squared data mismatch to perturbed data (Eq. 17) for three data assimilation algorithms

Note that although RML does not use information from the ensemble for data assimilation—each realization is generated independently—100 RML samples were generated for the comparison. It is necessary to set a few algorithmic parameters for the minimizations. For RML with Levenberg–Marquardt minimization, an initial value of \(\uplambda = 5000\) was used. It was decreased by a factor of 4 if the objective decreased. Otherwise, it was increased by a factor of 4. For each minimization, the initial model was chosen to be the sample from the prior. For IES and hybrid IES, the initial value of \(\uplambda \) was based on the initial mean value of the data mismatch objective function (Chen and Oliver 2013). An ensemble size of 200 was used for the IES, and an ensemble size of 100 was used for the hybrid IES.

Fig. 3
figure 3

Samples of m from the posterior for three data assimilation methods (top row) and comparison of prior and posterior distributions of hyperparameters (bottom row)

The reduction in the squared mismatch of simulated data to perturbed observations (Eq. 17) was initially rapid for all three methods, although a few realizations with long initial correlation ranges failed to converge for both the RML and the hybrid IES (Fig. 2). The smallest mismatch values with perturbed observations were achieved by the RML, with values of the realizations of the data objective function achieving values

$$\begin{aligned} S_d^{\mathrm {rml}}(m) = \frac{1}{2} \sum _{i=1}^{N_d} \frac{(g(m) - (d^o + \epsilon ))^2}{\sigma _d^2} \end{aligned}$$
(17)

smaller than one in several cases. While this objective is useful for monitoring convergence, for model checking it is more useful to evaluate the squared data mismatch between predicted data from the calibrated models and the actual data; the expected value of that metric is

$$\begin{aligned} E[S_d^{obs}(m)] = \frac{1}{2} \sum _{i=1}^{N_d} \frac{(g(m) - d^o )^2}{\sigma _d^2} = \frac{1}{2} N_d =19. \end{aligned}$$
(18)

The average values from all three methods are close to the expected value.

Fig. 4
figure 4

Posterior estimates of standard deviation of the model as functions of position, for three data assimilation methods applied to the one-dimensional linear problem

The a posteriori realizations of m from all three methods also look generally plausible (top row Fig. 3), although it appears that realizations from the RML and IES have correlation lengths that are generally shorter that the correlation length in the data-generating model (Fig. 3). Qualitatively, it appears that the standard deviation of the posterior model realizations is somewhat larger for the IES method than for the other two methods. Figure 4 confirms this quantitatively, although the differences between methods are relatively small. Note that the standard error in the observations is 0.1, and the prior standard deviation for m was uncertain.

4.2 Fluid Flow Example

In the second example, the data-generating system is a two-dimensional (\(30\times 15\)) anisotropic porous medium with two-phase (oil and water), immiscible, incompressible flow driven by two injectors and six producing wells. The permeability field that generates the data is a draw from a prior model in which the angle of the principal axis of anisotropy is 0.93 radians, the range parameter for the longest correlation length is 1.0, and the ratio of correlation ranges in the two principal directions is 6.0. The covariance type for log-permeability is set to be “Gaussian,” i.e., squared exponential (Eq. B6) with standard deviation 2.0. The porosity is uniform.

Fig. 5
figure 5

Two-dimensional reservoir flow problem with uncertain prior covariance

The data are measurements of water cut (fraction of produced fluid that is water) as a function of time in the producing wells. Although the permeability is highly variable, the wells are controlled by total flow rate which is identical for all producers. Figure 5a shows the permeability field that was used to generate the observations, and Fig. 5b shows the noisy observations of water cut at each of the wells. The errors in the observations are independent Gaussian with standard error 0.02. For data assimilation, the prior covariance for log-permeability has uncertainty in the orientation of the principal axes of anisotropy and in the range of the correlation in each of the two principal directions. The prior uncertainties for each of the hyperparameters of the covariance are shown as histograms of sampled values in Fig. 6. The priors for correlation length and ratio of correlation ranges are both assumed to be log-normal. The prior for orientation of the principal axes of the covariance for permeability is Gauss–von Mises (Eq. A1), which is close to Gaussian when the variance is relatively small.

Fig. 6
figure 6

Prior uncertainty in each of the hyperparameters of the hierarchical model for the two-dimensional flow problem

4.3 Data Assimilation Results

Data assimilation with the hierarchical parameterization was performed using two methods: a standard Levenberg–Marquardt iterative ensemble smoother and a hybrid IES, also with Levenberg–Marquardt regularization. In both approaches, the parameters being updated are z and the three hyperparameters of the covariance function. Recall that z has the same dimension as m (450) and is defined in Eq. 1. The updates of z were localized in the IES approach using the Gaspari–Cohn correlation function with a taper range that was the same as the true correlation range. Localization was not used in the hybrid IES approach. An ensemble size of 200 was used for the IES, while a smaller ensemble (\(N_e = 100\)) was used for the hybrid IES approach. In both approaches, the starting value of the Levenberg–Marquardt parameter \(\uplambda \) was selected based on the magnitude of the initial data mismatch (Chen and Oliver 2013). If the average squared data mismatch decreased in an iteration, the value of \(\uplambda \) was decreased by a factor of 4. If the average square data mismatch increased, then \(\uplambda \) was multiplied by a factor of 4. Iterations were stopped if \(\uplambda \) increased in two successive iterations, or if the number of iterations exceeded 25, or if the magnitude of the reduction in data mismatch was too small.

Fig. 7
figure 7

Data assimilation with the IES and hybrid IES. Watercut versus time for 40 realizations from the prior and from the posterior for five producers. Blue dots are noisy observations of water cut

Prior and posterior predictive distributions of water cut are compared with the observed water cut in Fig. 7 for five of the six producers. The prior predictive distribution (top row) “covers” the observations, although the prior distribution for producer 3 appears to be inconsistent with trends in the data. The posterior predictive distribution computed using the IES (middle row) has reduced spread compared to the prior distribution, but the mismatch with actual observations (blue dots) is much larger than would be expected for conditional realizations. The hybrid IES (bottom row) generates a posteriori predictions that are generally consistent with both the observations and with the measurement error, with the exception that the match with data in producer 3 is poor.

The improved match to the data obtained using the hybrid IES approach is confirmed quantitatively in Fig. 8. In both subfigures, the spread of squared data mismatch values is shown as a box which represents the range containing the central 50% of the values. Note that the reduction in the squared data mismatch is slow for the IES and that the final mean value (13,000) is much greater than the expected value for a properly calibrated ensemble (240). The hybrid IES converges much faster and ends at a much smaller mean value of data mismatch (1,151). Although the final value is larger than expected, it is largely a result of the poor match to water cut observations in producer 3.

Fig. 8
figure 8

Reduction in squared mismatch with observed (not perturbed) data. The expected value at convergence is 240. Note that different scales are used for the two plots

Fig. 9
figure 9

The first six realizations of the log-permeability field from the prior (left) and the corresponding realizations from the posterior (right). Darker greens correspond to larger values of log-permeability

Fig. 10
figure 10

Modification of distributions of hyperparameters after assimilation of flow data. Blue is prior distribution. Orange is posterior distribution. Red dashed lines show the values used in the data-generating model

Figure 9 shows the first six ensemble members from the prior and the corresponding ensemble members from the posterior after data assimilation in the hierarchical model using the hybrid IES method. Six realizations are not sufficient to illustrate the results of the data assimilation, but by comparing corresponding realizations, it appears that the spread in orientation of the anisotropy is reduced and that the principal range is largely maintained. The effect of data assimilation on the hyperparameters is illustrated more quantitatively in Fig. 10 which compares the prior distribution for the three hyperparameters (blue), with the posterior marginal distributions (orange), and the values used in the data-generating model (dashed red line). The posterior distribution of orientations (\(\phi \)) is narrower than the prior distribution, but the mean is nearly unchanged from the prior (Fig. 10 (left)). The range parameter (\(\rho \)) shows the largest influence of data assimilation. Short correlation ranges have been eliminated from the posterior, while the mean of the posterior distribution has been shifted to a value that is larger than both the prior mean and the value that was used in the data-generating model (Fig. 10 (center)). If the relationship of data to model hyperparameters was linear, one would expect the posterior mean to lie between the prior and the data-generating value, which is not what is seen here. The mean value of the ratio of correlation ranges in the principal directions (\(\alpha \)) has shifted to a slightly smaller value after data assimilation (Fig. 10 (right)), meaning that the mean correlation range in the second principal direction has increased slightly, but less than the range in the first principal direction.

4.4 Discussion: Benefit of the Hybrid Approach

In Sect. 4.3, the IES method failed to assimilate data properly in a hierarchical model for which the anisotropy in the covariance was uncertain. The problem of estimating the permeability field from flow observations in this case is clearly nonlinear, but IES often works well for highly nonlinear problems, so that does not appear to be the primary explanation. In this type of hierarchical problem, it is relatively straightforward to see why a standard IES may not converge and why the hybrid IES works very well. To illustrate the relative performance of the two algorithms, consider a simpler two-dimensional hierarchical problem in which an ensemble of realizations of m are updated from partial noisy observations of m. The prior covariance in this example has uncertain orientation and ranges, just as in the two-dimensional flow problem but with simpler observation and sensitivity. It is instructive in this case to look at the ensemble estimate of the sensitivity, \(\nabla _z m\), between observation of m at a location near the center of the domain and z for an ensemble with 200 realizations (Fig. 11a) and compare it to the exact sensitivity for one particular realization (Fig. 11b).

Fig. 11
figure 11

Sensitivity of observation of m at (10,4) to values of z at every cell. Note difference in color scales

The ensemble estimate of sensitivity (Fig. 11a) suffers from several problems. The first is that although the ensemble size (200) is larger than typically used for data assimilation, the magnitudes of the spurious correlations are comparable to the estimated magnitude at the observation location. Also, the magnitude of the ensemble estimate of the sensitivity at the observation location is almost an order of magnitude smaller than the correct value. Finally, the region of significant sensitivity in the ensemble estimate is different from the region in the exact calculation. Increasing ensemble size would reduce the spurious correlations and improve the magnitude of sensitivity at the peak, but the region of sensitivity would not improve because the average orientation is different from the orientation in realization 0.

Fig. 12
figure 12

Comparison of purely ensemble-based estimates of the cross-covariance of observation at (10,4) to the values of z (top row) to hybrid estimates of cross-covariance (bottom row)

The effect of ensemble size on the estimation of the cross-covariance between the observation of m at (10,4) and values of z at every cell is shown in Fig. 12. Because the covariance matrix for z is the identity matrix, the cross-covariance should be the same as the sensitivity matrix if both are computed accurately. It can be seen, in fact, that the hybrid IES estimate of the cross-covariance for large ensemble size (Fig. 12) is identical to the exact sensitivity shown in Fig. 11. Importantly, the estimates from the hybrid IES method with \(N_e=100\) is much better than the standard IES estimate with ensemble size \(N_e=800\). And again, the bias in the cross-covariance from the ensemble is large, indicating that the global estimate is not representative of the local value in this highly nonlinear problem.

The reasons for the remarkably good results for the hybrid IES in this example are twofold. First, \(C_z = I\) for the non-centered hierarchical parameterization. The hybrid IES method uses the exact \(C_z\), while the traditional IES uses a low-rank approximation of \(C_z\), which suffers from spurious correlations. Second, the hybrid method uses the chain rule to compute part of the sensitivity analytically. This eliminates the bias that appears when the mean hyperparameters do not match the hyperparameters of the individual realization.

4.4.1 Summary

In an attempt to increase the robustness of data assimilation against model misspecification, a hierarchical Gaussian model was introduced for history matching of flow data. The non-centered parameterization used here is relatively simple, but allows uncertainty in several important parameters of the prior covariance that are typically fixed during history matching. Because we assumed that the prior covariance was stationary, the total number of uncertain parameters was only slightly larger than the number of uncertain parameters in standard data assimilation, but the effective number of uncertain parameters is much larger, as the grid-based variables are independent and identically distributed (iid) in the non-centered hierarchical parameterization.

Unfortunately, updating model parameters using a standard ensemble Kalman-like method was not effect when the number of model cells was relatively large compared to the size of the ensemble, even though it is common in data assimilation for the number of model parameters to be much larger than the ensemble size. There appears to be two reasons for the increased difficulty with the use of data assimilation for the hierarchical model. The first is that the effect of spurious correlations is much larger with the non-centered parameterization, and localization based on the range of the prior covariance was not useful when the grid-based parameters were iid. Secondly, the addition of hyperparameters makes the problem more nonlinear so that the use of average sensitivities (or Jacobians) computed from the ensemble do not represent local sensitivities accurately. Although the problem of spurious correlations could conceptually be solved by increasing the ensemble size, that does not solve the problem of incorrect sensitivities.

To solve these problems while retaining the advantages of the iterative ensemble smoothers (no need to derive the adjoint system and no need to store very large matrices), a hybrid IES was developed in which the “difficult sensitivities” such as the sensitivities of water production rate to permeability are still computed approximately using an ensemble of model parameters and the ensemble of model predictions. The “easy sensitivities” such as the sensitivity of the permeability to the hyperparameters or to the latent iid variable z are computed analytically. In a two-dimensional problem with uncertainty in the orientation of the principal axes of the covariance, hybrid estimates of G and \(CG^{\mathrm {T}}\) were much less noisy than estimates computed directly from the ensemble. Additionally, the hybrid estimates of sensitivities G and cross-covariances \(CG^{\mathrm {T}}\) are specific to a realization and will ill allow convergence to the correct value as ensemble size increases for linear observation operators. Finally, since each realization has its own Kalman gain matrix, the methodology is less limited by Gaussian assumptions in the standard IES approaches. The hybrid method provides greater robustness against nonlinearity if some of the nonlinearity is in the relationship between z and m, as it is for the hierarchical model and also for the truncated plurigaussian model (Oliver and Chen 2018).

In this paper, the hybrid IES method was tested on a one-dimensional linear problem and on a two-dimensional-flow problem with hierarchical parameterizations. In the one-dimensional problem, the hierarchical parameterization provided uncertainty in the range and variance of the prior covariance. The number of parameters was small compared to the ensemble size in this case, so all methods performed similarly, although the standard IES tended to produce correlation ranges that were too short. In the two-dimensional flow problem, there was uncertainty in the orientation of the anisotropic covariance and in the correlation range in the two principle directions. The data assimilation using hybrid IES gave results that were much better than the IES. In this problem, the number of model parameters was much greater than the number of ensemble members. The hierarchical model with hybrid IES data assimilation provided updated models with data mismatch magnitudes that were close to the expected values in all wells except one.

The comparison of data assimilation with a hierarchical prior to data assimilation with a “good” and a “bad” prior showed results with the hierarchical parameterization were slightly worse than results with the good prior and much better than results with the bad prior. But the good results for the hierarchical method were only obtained in conjunction with the use of the hybrid IES method of data assimilation. When the standard IES was used with the hierarchical parameterization, the mismatch to data was nearly as large as the prior mismatch.