Marginalized iterative ensemble smoothers for data assimilation

Data assimilation is an important tool in many geophysical applications. One of many key elements of data assimilation algorithms is the measurement error that determines the weighting of the data in the cost function to be minimized. Although the algorithms used for data assimilation treat the measurement uncertainty as known, it is in many cases estimated or set based on some expert opinion. Here we treat the measurement uncertainty as a hyperparameter in a fully Bayesian hierarchical model and derive a new class of iterative ensemble methods for data assimilation where the measurement uncertainty is integrated out. The proposed algorithms are compared with the standard iterative ensemble smoother on a 2D synthetic reservoir model.


Introduction
Data assimilation is the process of combining model predictions with real measurements for estimating unknown parameters and states of a model.In statistics, estimating the hidden (random) states of a model is commonly known as filtering or smoothing, whereas the parameter estimation problem is known as inversion or inverse problem in mathematical literature.The parameter estimation problem can be formulated either as a (regularized) maximum likelihood problem, or as a Bayesian inversion.In petroleum engineering, the inverse problem is commonly referred to as history matching [16].There are multiple ways to address data assimilation, and here we focus on a particular approximate Bayesian approach known as the iterative ensemble methods [2,18] since exact Bayesian methods such as e.g.MCMC methods are too expensive for large scale models.The iterative ensemble methods are well suited for large scale data assimilation since they treat the model as a black box and do not require the model to be differentiated.
In addition to a numerical model and the data itself, Bayesian data assimilation algorithms require prior informa-tion in terms of the prior distribution for both the parameters and states of the model as well as the likelihood function for the data, i.e. the measurement bias (if any) and uncertainty.For most standard Bayesian methods the prior distribution, as well as the likelihood, is assumed to be known.In many situations these parameters, often denoted hyperparameters, are unknown and can be modelled using hierarchical Bayes [8].In principle there are several ways to handle hyperparameters with the largest class being the empirical Bayesian methods [3] where the hyperparameters are estimated from the data either by some point estimates or by some iterative procedure that alternates between computing the posterior of the parameters and states given the hyperparameters and computing the posterior of the hyperparameters given the states and parameters.The first approach typically ignores some of the uncertainty in the posterior while the latter requires multiple applications of the data assimilation method.The fully Bayesian hierarchical analysis, however, is to integrate out all the hyperparameters to obtain the full posterior.This approach often involves approximation of integrals via nested Laplace transformation (see e.g.[22]), but in some cases the integral can be solved analytically.The focus here is on the measurement uncertainty and since the likelihood is frequently assumed to be Gaussian in data assimilation methods, we can choose prior densities such that the likelihood can be analytically integrated over the measurement error variances.
The error of a measurement typically stems from model error or representation error and not necessarily from the measurement instrument alone.The are many sources of measurement error, but regardless of how they are derived, (almost) all methods used for data assimilation have one thing in common: they all assume that the true uncertainty is specified.Naturally this is a strong assumption in many real world applications.Several techniques to estimate unknown errors online are available, such as adaptive filtering [14,15].However, the assumptions required are usually not satisfied in real applications.Robust approaches such as H ∞ filters [23] or Levy filters [24] serve as alternative methods, but the application in data assimilation setting is very limited to the best of our knowledge.
In geophysical models, it is quite often difficult to estimate the measurement error due to simplified physics, simplified parameterization or lack of knowledge of the measurement process [21].Most approaches within atmospheric science are based on the work of [5,6,10], and [7], using the difference between observation and error forecasts (see e.g.[4]).Other approaches such as e.g.[1] use multiple data sets in combination with theory from [9].A Bayesian approach is presented in [26] where the measurement uncertainty is estimated by maximizing an approximate log-posterior at each time step.[12] estimate the covariance of a water balance model in a weak constrained ENKF, where a variational Bayesian approach is used in a second update step of the algorithm, and an online approach for estimating colored observation noise has recently been introduced in [19].
For petroleum application there is not a lot of discussion of the measurement error of production data.It can in principle be obtained from the measuring instrument, but this would require a perfect model.For time-lapse seismic observations the noise can be estimated from the difference in data between different surveys in domains where the seismic signal is expected to be unchanged.[13] used techniques from image de-noising such as local and non-local moving average and sparse wavelet representation of the observations.When combining production data and seismic data, specifying the uncertainty becomes even more troublesome as there are two or more data terms in the likelihood function.Furthermore, due to imperfect models and pre-processing of certain data, the observation error is more than just noise related to the survey equipment (see [17] and references therein).
Regardless of how the measurement error is estimated or specified, and whether or not bias correction is included, the uncertainty in the estimate should be taken into account when applying Bayes' rule.This is commonly neglected in the data assimilation literature.From a Bayesian point of view, it is possible to incorporate the measurement uncertainty as a hyperparameter, and as far as the data assimilation algorithm is concerned, the measurement error variances can be considered like any of the physical model parameter.However, despite the interdependence (in the sense that the distribution of one is conditioned on the other), thereis no obvious hierarchical relationship to the physical parameters, unlike, e.g., hyperparameters of the prior.Instead, they could, or perhaps should, be considered 'nuisance parameters', i.e. parameters that are of no inherent interest but must be taken into account with their uncertainties in order to determine the physical model parameters of interest.Importantly, it is not necessary to reconstruct the probability distribution of the nuisance parameters, as long as their impact on the physical model parameters is accounted for.Here, an iterative ensemble smoother (iES) is proposed, where the likelihood function is marginalized over the measurement error.That is compliant with the fully Bayesian approach discussed above where the hyperparameters are integrated out of the hierarchical model.The resulting algorithm, denoted marginalized iterative ensemble smoother (miES), requires only a few modifications of iES algorithm.The theory is presented for a single measurement variance, but the more general case with several data types is briefly discussed and also presented in an example.
In the next section we define the data assimilation and inverse problem and give a short summary of the subspace version of the iES.The marginalized version is introduced in Section 3 and in Section 4 we provide examples and comparison with iES.The paper is concluded with a summary and discussion of further work.

Bayesian framework
Our quantity of interest, X , is considered as an unknown random variable with a prior probability density, f X (x).The observation, Y , is linked to X via the likelihood function f Y |X (y|x).The conceptual solution of the inverse problem is given by the posterior density We have omitted any potential time dependence for simplicity.

Data assimilation
In most data assimilation algorithms, and ensemble based methods in particular, it is common to assume that both the prior (after a possible transformation of variables) and the likelihood function are Gaussian and that the observations are given as a nonlinear mapping of X with additive noise The posterior is then known up to the normalizing constant where μ is the prior mean, P is the prior covariance matrix, is the observation error covariance matrix and C is the logarithm of the normalizing constant.The iterative ensemble methods discussed in later sections seek to minimize the objective function Eq. 1 or a stochastic version of it.When is uncertain, this will be reflected in the posterior and hence the objective function to be minimized.

Marginalizing the likelihood
In the following, we assume that the measurement error covariance matrix is known up to a constant, although the more general situation with block matrices with unknown variances is also presented.The measurement error is assumed to be Gaussian with a covariance matrix = σ 2 R, where R is a known M × M matrix.The likelihood function, given σ , is In a Bayesian framework, it is natural to view σ 2 as a hyper parameter with a prior density f σ 2 (σ 2 ) and then marginalize the likelihood w.r.t.σ 2 [25].If no information is available, a non-informative prior on σ 2 defined by f σ 2 (σ 2 ) = σ −2 can be used.This is known as Jeffrey's prior and represents the state of no prior information on the scale parameter σ [11].With this choice we can analytically compute f Y |X (y|x, σ 2 ) p(σ 2 )dσ 2 to get the marginalized likelihood (see e.g.[25]) with corresponding log-likelihood given by More often than not, the measurement error is specified either from an expert opinion or from some type of estimation.In either case it is uncertain, and this uncertainty should be accounted for.
Both the estimate of measurement error and its uncertainty can easily be incorporated into a prior distribution.To this end, a frequently used prior for the variance in Bayesian statistics is the scale inverse χ 2 distribution defined by where s 2 is the estimate of the measurement error variance σ 2 and ν is the degrees of freedom.In this case the marginal likelihood is proportional to which is a multivariate t distribution with t = y − H(x) 2 R /s 2 and ν degrees of freedom.The log-likelihood is then The two marginalized likelihoods Eqs. 3 and 6 can easily be extended to K data types of dimension M k with different variance, the log likelihood is then for the non-informative prior and for the scale inverse χ 2 prior.The marginalized likelihoods can be used in Bayes' formulation for filtering, data assimilation and Bayesian inversion.Here we focus on large scale inversion, but the algorithms can also be directly implemented for sequential data assimilation as they are presented.
In the following, we develop a subspace square root iES algorithm [2].For the ease of notation we focus on the single data type case, but some examples presented later contains multiple data types.

The new objective functions
Since solving the fully Bayesian inversion problem is too computationally costly for large scale models, we seek an approximate solution via maximization of the posterior (MAP) with the associate uncertainty provided by the inverse of the approximate Hessian matrix [2].The cost function to be minimized in the miES is defined as the negative log posterior density.Hence, for Jeffrey's prior we have The gradient and approximate Hessian (using a first order model expansion) are given by where χ y = y − H(x) 2 R and H x is the tangent linear operator of H evaluated at x.
For the case with scale inverse χ 2 prior, the cost function is with corresponding gradient and approximate Hessian where χ y = y − H(x) 2 R and H x is the tangent linear operator of H evaluated at x.A quasi Gauss-Newton algorithm for minimizing the cost functions is given by where k is the iteration index.If there are several data types, the likelihood term of the gradients in Eqs.11 and 14 are replaced by a sum of gradient terms derived from Eqs. 8 and 9.In the next section an ensemble subspace approximation of the quasi Gauss-Newton scheme is presented.

Ensemble subspace formulation
In most applications of ensemble methods, the dimension of the state, M, is greater than the number of ensembles N .In this case we can use the simplified subspace ensemble version of iES with the new cost functions which avoids computing the pseudo inverse of matrices of dimension M × N [2,18].Let X be the matrix of prior ensemble anomalies (samples with the mean subtracted from each column).The variable x can then be expressed in the ensemble subspace as where X is the initial ensemble anomalies, x is the ensemble mean and ω is the N × 1 weight vector initially set to 0.
Inserting the change of variables into the Jeffery's prior gives the following objective function where N is the ensemble size and I N is the identity matrix of size N .The gradient and first order Hessian approximation is Let E be the matrix with ensemble member i as its i th column and let x denote the column mean.The ensemble anomaly matrix X is defined by X = E − x1 , where 1 is the N-dimensional column vector of ones.The ensemble can be written E = x1 + XW, where W is initially set to the identity matrix of dimension N × N .For each ensemble member, i.e. each column of E, the forward model is run and the output is collected in the matrix denoted H(E).The linearization we use to linearize H using the ensemble is given by Y = H(E)W −1 1 , where ⊥ 1 = I N − 11 /N , I N is the N -dimensional identity matrix.This implementation avoids using the chain rule by direct regression of the composite function H(x(ω)).For more details on the linearization we refer to [18].We approximate Eq. 19 using the ensemble with Jeffrey's prior as ∇ ω J prior = (N − 1)ω ( 24) Finally, the ensemble subspace formulation of Eq. 16, at iteration k, is then given by where γ k is the step size.Often step size is not implemented in Gauss Newton, however, since we are using so many approximations in our scheme, we have seen that there can be advantages of not taking a full step (γ <1) in some cases.
For the inverse χ 2 prior the objective function (with the change of variables) is given by The following defines our subspace formulation marginalized iterative Ensemble smoother with an inverse χ 2 prior.
x k+1 = x + X ω k+1 (39) Remark 1 The gradient and Hessian approximations involves terms of the form (y − H(x + Xω)).Since we run the ensemble, and not the ensemble mean, this quantity is not available to us.There are two ways to deal with this issue.First, it is possible to run N − 1 ensemble members and the mean to obtain the desired quantity without increasing the computational cost, alternatively we can approximate H(x +Xω) with H(E), as is done in [2].The latter approach is implemented here.
In the next section the two algorithms are tested on a synthetic reservoir model and compared with a square root implementation of the iES.

Reservoir example
The test case is a 2D reservoir of 60 × 60 grid cells with a two phase flow consisting of oil and water operated with six producing wells and one injection well.There are three different data types: oil production rates, water production rates and water injection rates, so the log likelihood for the miES is a sum of three terms (K = 3 in Eqs. 8 and 9).We assume that the porosity is known (fixed at 0.2) and that the permeability field is unknown and modeled using a Gaussian prior distribution with a constant mean of 5, a constant variance of 1, and with a spherical variogram of range 30, anisotropy of 0.33, and with an angle of −45 degrees.The true permeability field is a realization from the prior shown in Fig. 1 along with the prior mean.
The historical production data are obtained by running the The Open Porous Media Flow simulator [20] on bottom hole pressure control .The measurements are generated by adding a zero mean Gaussian measurement error to the oil production rates and water production rates in the six producers as well as the water injection rates in a single injector every 150 days for a total of 12000 days (80 observation time points) except for producer 4 which opens after 6000 days and producers 5 and 6 which open after 9000 days.The standard deviation of the measurement error is set to 10, 20 and We run seven different cases with three different ensemble methods.The iES and miES with inverse χ 2 prior are run with three different inputs (the standard deviation for iES and s for the miES).Once with the correct values for the standard deviations (10,20,30), once with a lower value ( √ 40) for all three data types and once with a higher value ( √ 1000) for all three data types.In all runs the degrees of freedom in miES, ν, is set equal to the number of measurements (80).Since the miES with Jeffrey's prior does not require any input it is only run once.The ensemble size is 100.In addition, we run the iES with the correct standard deviations for the measurement error and an ensemble size of 1000 that will serve as a reference solution.That is, all results should be compared with the result from this run as it is the best solution we can get for an iES algorithm.
Normally some type of localization should be applied when using ensemble methods, but for the sake of comparison of the impact of the different methods and inputs, we did not apply any localization in this example.In Figs.2,3,4 the prior and posterior oil and water production is shown for the first producer as well as the water injection for the injector.All the methods provide a fairly good match with the data, with the exception of the iES when the measurement error standard deviations are set to a lower value than the true standard deviations.
Figure 5 shows the mean of the posterior permeability fields, where we particularly notice the robustness of the miES in the sense that the end results are very similar irrespective of the input for the measurement variance, unlike the iES where the results are much more sensitive to the input.Finally in Fig. 6 the posterior standard deviations in each grid cell are shown.Again we notice the similarity for the miES results, as well as the high sensitivity of iES to the input values.The results clearly demonstrate the advantage of the miES: it is by construction far less sensitive to the input than the iES, yielding a more robust algorithm for real field cases.From the results it seems as the miES has a 'localization' effect on the ensemble when compared with the iES with 1000 members.

Summary and future work
This paper has presented a new class of iterative ensemble methods called the marginalized iterative ensemble smoother (miES).The new algorithms are derived from maximization of the posterior density where the measurement error variance parameters in the likelihood are treated as hyperparameters and integrated out.Two different priors for the hyperparameters were studied.The first was the noninformative or Jeffrey's prior, which requires no user input.Hence the miES algorithm requires no input for the measurement error variances.The second was the scaled inverse chi-square prior where an offline estimate or an expert opinion can be used as input.The resulting miES algorithm can then be used with the same input as the standard iES algorithm.A test on a synthetic reservoir model clearly demonstrated potential advantages of the new method when compared to the iES.The solution is less sensitive to the input and produced solutions that were more in line with the iES with an ensemble size of 1000 and correctly specified standard deviations than the iES with 100 ensemble members.The miES can handle different variances for different datatypes and it is also possible to use different priors for different data.The algorithm is of the same complexity as the standard iES.
In situations where the measurements are correlated, but with unknown correlation structure, further work is required to use the miES.A direct generalization using noninformative or an inverse Wishart prior density is possible assuming that the entire measurement error covariance is unknown, but this may lead to a very uncertain likelihood function and it is perhaps more interesting to study a known correlation structure with unknown parameters.Both these topics will be investigated in future research.