Novel iterative ensemble smoothers derived from a class of generalized cost functions

Iterative ensemble smoothers (IES) are among the state-of-the-art approaches to solving history matching problems. From an optimization-theoretic point of view, these algorithms can be derived by solving certain stochastic nonlinear-least-squares problems. In a broader picture, history matching is essentially an inverse problem, which is often ill-posed and may not possess a unique solution. To mitigate the ill-posedness, in the course of solving an inverse problem, prior knowledge and domain experience are often incorporated, as a regularization term, into a suitable cost function within a respective optimization problem. Whereas in the inverse theory there is a rich class of inversion algorithms resulting from various choices of regularized cost functions, there are few ensemble data assimilation algorithms (including IES) which in their practical uses are implemented in a form beyond nonlinear-least-squares. This work aims to narrow this noticed gap. Specifically, we consider a class of more generalized cost functions, and establish a unified formula that can be used to construct a corresponding group of novel ensemble data assimilation algorithms, called generalized IES (GIES), in a principled and systematic way. For demonstration, we choose a subset (up to 30 +) of the GIES algorithms derived from the unified formula, and apply them to two history matching problems. Experiment results indicate that many of the tested GIES algorithms exhibit superior performance to that of an original IES developed in a previous work, showcasing the potential benefit of designing new ensemble data assimilation algorithms through the proposed framework.


Introduction
A typical inverse problem involves finding one (or multiple) set(s) of model variables m, as inputs to a numerical forward simulator g, in such a way that the generated outputs g(m) from the simulator can match a collection of observed data d o , or other relevant quantities of interest (e.g., a probability density function conditioned on d o ), to a good extent. Inverse theory is one of the important areas in various science and engineering disciplines, and finds wide applications in different areas, such as system identification, signal/image processing, computer vision, machine learning, meteorology, oceanography, geophysics, petroleum engineering, to name but a few. Depending on the field, inverse theory may have diverse names. For Xiaodong Luo xluo@norceresearch.no 1 Norwegian Research Centre (NORCE), 5008 Bergen, Norway instance, it is known as estimation theory in statistical signal processing [1], or data assimilation in meteorology and oceanography [2,3], or history matching in reservoir engineering [4]. In the current work, both the terms "data assimilation" and "history matching" will be adopted and used interchangeably.
To find a solution to an inverse problem, there are two major schools of methods, from the Bayesian and optimization-theoretic perspectives, respectively. A Bayesian approach focuses on updating a prior probability density function (PDF) p(m) of model variables to the posterior one p(m|d o ) based on Bayes' theorem, whereas the forward simulator (together with the PDF of observation noise) is used to calculate the likelihood p(d o |m) that the observations d o are caused by a specific set of model variables m [5]. After the posterior PDF is obtained, point estimates, such as the corresponding mean, median or mode, can be generated accordingly.
In a conventional optimization-based inversion algorithm [6], one first defines a certain cost function C(m; d o , m 0 ) that describes the discrepancies between the observations d o and the outputs g(m) of the forward simulator, where m 0 represents an initial guess of the unknown model variables 1 . The solution to the inverse problem then boils down to finding one (or multiple) set(s) of model variables m that approximately minimize the cost function C(m; d o , m 0 ). Often, inverse problems are ill-posed, e.g., in the sense that there can exist an infinite number of models m that match the observations d o equally well. To mitigate the ill-posedness, it is customary to introduce a certain regularization term to the cost function, such that the regularized cost function is typically in the form of C(m; d o , m 0 ) = D(d o − g(m)) + γ R(m − m 0 ), where the operator D defines the discrepancies (or mismatch) between the observations d o and the simulated data g(m) of the forward simulator, the operator R specifies the amount of deviation of m from the initial guess m 0 , and γ is a regularization parameter associated with R. As will be shown later, one can slightly extend the form of the cost function C(m; d o , m 0 ), by allowing D and/or R to act on transformed data and/or model variables. For ease of reference, hereafter we call D(d o − g(m)) and γ R(m − m 0 ) (or simply R(m − m 0 ) ) the data mismatch and regularization terms, respectively.
Under certain circumstances, connections between Bayesian and optimization-based inversion approaches can be established. For instance, if one lets the prior PDF Although theoretically sound, computational issues may arise in applications of the conventional Bayesian and optimization-based inversion approaches to practical inverse problems. For a Bayesian approach, the involved PDFs are often replaced by certain Monte Carlo approximations, which typically require to adopt a large number of samples in order to achieve a decent approximation accuracy, and may thus become prohibitively expensive in large-scale problems, where even running a single forward simulation would require substantial computational resources. In contrast, a conventional optimization-based inversion approach tends to be computationally more efficient than a Bayesian approach, as it requires only a few forward simulations per iteration step. However, in many cases, solving the optimization problem itself may become more complicated. For instance, if a gradient-based local optimization method is adopted to solve the optimization problem, the exact gradient may not always be available, e.g., in case that the forward simulator is a black-box system provided by a third party. In addition, aiming to obtain a single point estimate, an optimization-based inversion approach would not be able to quantify the uncertainties associated with the estimated model variables in general.
Recent years have witnessed the developments of new algorithms that are able to mitigate some of the noticed computational issues in Bayesian and/or optimization-based inversion approaches, among which we focus here on iterative ensemble smoothers (IES) that are among the state-of-the-art approaches to large-scale history matching problems. In this regard, different forms of IES are developed from either a Bayesian or an optimizationtheoretic point of view, for example, see, [7][8][9][10][11][12][13][14][15].
Although in principle one can iteratively update the posterior PDF within the framework of IES, in such a way that the prior and final posterior PDF can satisfy Bayes' theorem [10,13,15], in practical implementations it is rare that PDFs are explicitly involved in numerical computations. Instead, the more common choice is to adopt a relatively small ensemble of model variables to represent the whole space that the model variables may span, and the iteration process aims to update the ensemble of model variables under a certain criterion, derived from either the Bayesian or the optimization-theoretic perspective. For a Bayesian approach, e.g., the ensemble smoother with multiple data assimilation (ES-MDA) [10], such an ensemble can be treated as an Monte Carlo approximation to the corresponding PDF, but with a relatively small ensemble size to reduce the computational cost. The iteration process is then designed with the goal that the Monte Carlo approximations of PDFs (approximately) satisfy Bayes' theorem.
Meanwhile, for an optimization-based approach, e.g., [7,11,13,14], the ensemble of model variables leads to an ensemble of cost functions, and the objective is to update the ensemble of model variables to minimize either each individual cost function [8,11], or the average of the cost functions [13,14]. Compared to the conventional optimization-based inversion approaches, ensemble-based methods are non-intrusive, and do not require explicit knowledge of the gradients of a cost function with respect to the model variables. As such, ensemble-based methods are derivative-free and applicable to black-box systems, and are more flexible and convenient in terms of algorithm implementations. Moreover, instead of producing a single point estimate as in a conventional optimizationbased approach, an ensemble-based method generates an ensemble of estimates, which provides a means of uncertainty quantification to some extent. On the other hand, an open problem for ensemble-based methods is that, except for some special cases, in general the final ensemble of estimated model variables may not be considered as samples drawn from the correct posterior PDF, which satisfies Bayes' theorem with respect to the corresponding prior PDF and the likelihood function [16].
For IES developed from either the Bayesian or the optimization-theoretical perspective, a less explored issue is that, in practical applications, the underlying cost functions are typically limited to the forms of nonlinear least squares, in which both the data mismatch and regularization operators, namely D and R, are some quadratic functions with respect to the simulated data and the model variables, respectively. In contrast, for the conventional optimizationbased inversion approaches, there exist richer types of cost functions. For instance, in terms of the regularization operator R, except for the conventional quadratic form (known as Tikonov or 2 -norm regularization) as adopted in the IES, other types of regularization methods, such as total variation regularization [17], p regularization in which p may not be equal to 2 [18], mixed-norm regularization [19] etc, are also proposed in the literature to improve the performance of inversion algorithms. However, to the best of our knowledge, in the literature there seem to lack such ensemble-based iterative inversion algorithms that are developed based on cost functions beyond the conventional forms of nonlinear least squares.
This work aims to narrow this noticed gap between the conventional optimization-based inversion approaches and the ensemble-based ones, in terms of the varieties of the underlying cost functions used to establish the corresponding inversion algorithms. The motivation for us to consider other types of cost functions can also be understood from a Bayesian perspective. Indeed, as aforementioned, the Bayesian formulation can have an underlying cost function if we make a connection between the Bayesian and optimization-based approaches under the MAP criterion. For a cost function in the form of nonlinear least squares, it is equivalent for a Bayesian approach to assume that both the prior model variables and the observation noise follow certain Gaussian distributions. In practice, however, it may also be desirable to consider other statistical distributions. One good example in this regard is the Bayesian LASSO regression [20], which equips the regularization term with the 1 norm, corresponding to a Laplacian prior. Another good example is the Bayesian elastic net [21], which admits a mixture of 1 and 2 norms for the regularization term, corresponding to a prior resulting from the combined use of Gaussian and Laplacian distributions.
The rest of this paper is organized as follows: We first present a unified ensemble-based model update formula, which provides solutions to a class of inverse problems, in the form of generalized minimum average cost (GMAC) problems. Underlying the GMAC problems is a corresponding class of regularized cost functions that generalize and go beyond the form of nonlinear least squares in general. As such, the presented update formula provides a principled and systematic way to derive a group of novel ensemble-based data assimilation algorithms, called (more) generalized iterative ensemble smoothers (GIES) in this work. We show that the update formula of the GIES algorithms bear certain structural similarities to those of the IES in the literature. Consequently, previous experience in implementing and applying the IES algorithms can be naturally transferred to the implementations and applications of the GIES. For demonstration, we compare the performance of a subset (up to 30+) of the GIES algorithms (including one original IES developed in a previous work) in two history matching problems. Experiment results indicate that many of the tested GIES algorithms outperform the original IES in the case studies, manifesting the potential benefit of designing new ensemble data assimilation algorithms under the GMAC framework.

Original IES derived from a stochastic nonlinear-least-squares problem
Given the following noisy observation system where d o ∈ R d stands for a d-dimensional vector of observations; g : R m → R d for a forward simulator that maps an m-dimensional model vector m tr ∈ R m onto the observation space; and ∈ R d for contamination noise following a certain distribution.
In ensemble-based data assimilation, our objective is to estimate a set of possible model vectors m that may explain the observations d o to a good extent, given the forward simulator g and possibly certain prior knowledge of m. Without loss of generality, we assume that the forward simulator g is perfect.
Since the forward simulator g is nonlinear in general, we employ a certain iterative algorithm to solve the data assimilation problem. To this end, let us assume that at the ith iteration step, we have an ensemble of N e model vectors, denoted by M i ≡ {m i j } N e j =1 , as the initial guess, where j is the index of ensemble members. Following the previous work [13], our objective is to find a new ensemble M i+1 ≡ {m i+1 j } N e j =1 of model vectors that approximately solves the following minimum-average-cost (MAC) problem, in the form of stochastic nonlinear-least-squares, at the (i + 1)th iteration step: where d j is a data vector being either the observation vector with the square-root matrix S i I being defined later (cf. Eq. (6)); and γ i is a parameter that is adaptive to the iteration process, following certain preset rules [13].
The main idea in Eq. 2 is to minimize the average of an ensemble of cost functions. Each cost function consists of two parts, namely, the data mismatch term which measures the weighted euclidean distance between simulated and observed data, and the quadratic regularization term which is introduced to mitigate the ill-posedness in the history matching problem, and also to prevent the IES from overfitting the observations. Through a linearization-based approximation strategy, Eq. 2 can be solved as follows [13]: For the proof of the matrix identity in Eq. 4, readers are referred to Appendix A.
For notational convenience, the square-root matrices in Eqs. 6 and 7 can be expressed in a unified way as follows: (8) where O represents a certain operator that maps a reservoir model into another domain. In Eq. 6, O is reduced to the identity operator I, such that I (m) = m for a reservoir model m. Similarly, in Eq. 7, the operator O equals the forward reservoir simulator g. The custom in Eq. 8 will be adopted throughout this work.

GIES derived from a class of generalized cost functions
In the sequel, we consider the following generalized MAC (GMAC) problem arg min with a class of more general cost functions, in comparison to the original MAC problem in Eq. 2. According to Eq. 10, in the GMAC problem, each of the cost functions also consists of two terms, namely, the data mismatch term with respect to the distance metric D : R t → [0, +∞), and the regularization term with respect to the regularization operator R : R s → [0, +∞).
Within the data mismatch term, a transform operator T : R d → R t is introduced to accommodate the possibility that the observed and simulated data may be transformed into another domain, e.g., through discrete wavelet transform [22,23] or learned dictionary [24], before computing respective data mismatch between them. Similarly, within the regularization term, a transform operator S : R m → R s is present to take into account the possibility that regularization may be applied to transformed model variables, rather than directly to the model variables themselves. For example, see [25][26][27]. In this regard, S may also be considered as a re-parametrization strategy, if one assumes that the transform operator S is (approximately) invertible, so that in the data mismatch term, one can re-write where T • g • S −1 denotes the composition of the operators T , g and S −1 , and acts as an effective forward simulator for the effective input S m i+1 j . On the other hand, we note that in Eq. (10), the invertibility of S is not formally required.
As a special case, given a dummy vector x, if one takes where C d and C i m represent the observation error covariance matrix and the sample model error covariance matrix, respectively, as in the original MAC problem (2); and also lets both T and S be identity operators, then up to a constant factor, the GMAC problem in Eqs. 9 and 10 is reduced to the stochastic nonlinear-least-squares problem in Eq. 2, which leads to the IES derived in [13], and is also closely related to the cost functions in other variants of the IES, e.g., [7,10,14].
In Appendices A and B, we show that the GMAC problem can be approximately solved under different scenarios, depending on whether relevant gradient and/or Hessian information is available or not. When all relevant gradient and/or Hessian information is provided, then the GMAC problem is solved by the formulas in Eqs. 62-66. Otherwise, ensemble-based approximate solutions can be obtained, by combining the formulas in Eqs. 74 and 75 and the approximation strategies in Appendix B.
In the current work, we assume that both the distance metric D and the regularization operator R are relatively simple (e.g., as in Eqs. 11 and 12), such that the analytic forms of gradient and Hessian of D (with respect to T ) and R (with respect to S) are available.
Under this setting, it can be shown (cf. Appendix A) that an ensemble-based approximate solution to the GMAC problem in Eqs. 9 and 10 is given by Kalman-gain-like matrix and an "effective" innovation term. In the GIES, the effective innovation term can be interpreted as a gradient with respect to (w.r.t) the data mismatch (after data transform) of the respective ensemble member.
Of particular interest here is the dependence of the effective Kalman-gain-like matrix on individual ensemble members in the GIES, which is a phenomenon not observed in the original IES. Indeed, as indicated in Table 1, the effective Kalman-gain-like matrix in the original IES is common to all ensemble members, for which the reason is the quadratic forms of the distance metric D and the regularization operator R adopted in Eqs. 11 and 12, such that the corresponding local Hessian matrices, C −1 d and C −1 m , respectively, are independent of model variables. In contrast, in the GIES, the distance metric D and the regularization operator R may not necessarily admit constant Hessian matrices everywhere. Instead, in general, these Hessian matrices may vary for different ensemble members, and thus depend on the iteration path of the ensemble algorithm.
In implementation, the dependence of the Kalman-gainlike matrix on individual ensemble members implies that the GIES would require more computational time and computer memory than the original IES does. Fortunately, since both the projected local Hessian (PLH) matrices, M i D d j and M i R m i j , respectively, are in the dimension of N e ×N e , the additional memory consumption would not be significant when the ensemble size N e is not too big. On the other hand, in terms of the computational complexity, the numbers of flops required to evaluate M i D d j and M i R m i j are in the orders of t 2 × N e and s 2 × N e , respectively, where t and s are the sizes of transformed data and model, after applying the transform operators T and S. Based on these observations, there are a few strategies that can be exploited to help reduce the required computational resources. For instance, if one chooses not to perturb the observations, then the PLH M i D d j will be the same for all reservoir models. On the other hand, sparse representations of model and/or data, e.g., [22,24,26,27], can also be adopted to render reduced sizes for transformed data and model.

A subclass of the GIES: The q p -GIES
For illustration, in this work we consider a subclass of the GIES algorithms, which approximately solves the following GMAC problem arg min Here, | • | is the operator taking absolute values elementwise, e.g., |x| = [|x 1 |, |x 2 |, · · · ] T ; (Bx) e stands for the eth element of the vector Bx, x f for the f th element of the vector x, and B e,f for the element of the matrix B on the eth row and the f th column.
In this GMAC problem, the data mismatch term is in the conventional form of nonlinear least-squares, which corresponds to the distance metric D in Eq. 11, with T being an identity transform operator. On the other hand, the regularization term consists of a mixture of distance and call the resulting subclass of inversion algorithms q p -GIES. Although not considered in the current work, the derivation to be presented below can be similarly applied to establish variants of the q p -GIES, for which the corresponding distance metric D also consists of a mixture of different operators.
According to the update formula of the GIES, Eqs. 13-16, we have By applying the matrix identity (49) of Appendix A to Eq. 23, one obtains the following alternative update formula The reason for us to present the alternative update formula (24) is that, when combining with a truncated eigenvalue decomposition, adopting (24) tends to achieve better history matching performance than a straightforward implementation of Eq. 23, in some of the experiments conducted in the current work (not reported here for brevity).
Next, we explain how to evaluate the PLH M i R m i j .
p k as individual operators in the mixture (20), a combination of the analytic results in Appendices C and D leads to the following results for the In Eqs. 28 and 29, the operator∧ raises all elements of a vector to a certain power, e.g., given a vector operator stands for element-wise product (i.e., Hadamard product); and 1 for a vector whose elements are all equal to 1.

A comparison study on q p -GIES algorithms applied to a history matching problem
The focus of this section is to show the abundance of the GIES algorithms, illustrated by applying a set of q p -GIES algorithms derived from the previous section to a channelized reservoir characterization problem below. Another set of q p -GIES algorithms is also applied to history matching in a 5-spots problem. To limit the length of the main text, however, numerical results with respect to the 5-spots problem are presented in Appendix E.

The reservoir model
The case study considers a 2D channelized reservoir model in the dimension of 45 × 45, with oil/water two-phase flow, as described in the previous work [28]. Figure 1 shows the geological structure of the permeability map in the reference model. The porosity of the reference model is spatially constant, and is set to 0.1. On the other hand, the reservoir has isotropic permeability (PERM), whose values is set to 10000 md within the channel, and to 500 md in the background. The reservoir has 8 injectors (labeled as I 1, I 2, etc) and 8 producers (labeled as P 1, P 2, etc), whose positions are indicated in Figure 1. The injectors are constrained by injection rates, and the producers by bottom-hole pressures.
The reference model in Fig. 1 is used to generate production data every 190 days, for a total period of 3800 days. The production data consists of well oil/water production rates (WOPR/WWPR) from 8 producers, and well bottom-hole pressures (WBHP) from 8 injectors. In the case study, we divide the production data into two parts. We use the production data (with observation noise) within the first 1900 days for history matching through the q p -GIES algorithms, while reserving the production data (without observation noise) within the second 1900 days for performance validation. The numbers of production data in both time intervals are equal to 240. The observed production data within the history matching period are contaminated by certain zero-mean Gaussian noise. For WOPR/WWPR data, the standard deviations (STDs) of observation noise are 0.2236 m 3 /d, whereas for WBHP data, their STDs are 0.2646 bar.
In the current case study, permeability values (in the unit of md) are the unknown parameters to be estimated through certain q p -GIES algorithms. An initial ensemble of 100 PERM maps is generated by SNESIM [29], following the procedure explained in [28]. For illustration, Fig. 2 shows the mean (top left) and STD (top right) maps of the initial ensemble, as well as two sample reservoir models from the initial ensemble.

The chosen q p -GIES algorithms for comparison
In the experiment, the orders of p k and q k in Eq. 20 are set to either (case 2 1 ) p k = 1 and q k = 2; or (case 2 2 ) p k = q k = 2, while the settings of the matrices B k and the transform operators S k will be specified later. According to Eqs. (27) - (29) , the corresponding Hessian matrices are then given by In Eq. 30,ṡgn (x) = x ./ |x| represents the sign function applying element-wise to its argument x, with ./ denoting the element-wise division operator. To derive the second line of Eq. 30 from the first one, we have exploited the fact thatṡgn (x)ṡgn x T =ṡgn xx T . Another remark is that shown in Appendix C, and this is the reason that we do not consider the 1 1 metric here. Regarding the transform operator S k , we consider three possible scenarios: (S-I) S k is the identity operator I; (S-II) S k calculates the first-order variations of the permeability map of a reservoir model; and (S-III) S k generates the histogram of the permeability map of a reservoir model, and converts the histogram values into information entropies.
For distinction later, we denote the identity operator in S-I by I, such that I (m) = m. Here we only consider case 2 2 , as this leads to the original IES, and serves as the base case for our comparison study later. Indeed, with p k = q k = 2, the corresponding Hessian is given by Eq. 31.
Eq. 31; and also choose B k m i+1 j − m i j 2 2 as the only metric in the regularization term (and the associated weight w 1 = 1), then according to Eq. 26, the PLH, denoted by As a result, the update formula of the The same custom will be adopted for other PLHs later on.
In S-II, the transform operator calculates the first-order spatial variations of permeability maps. For this reason, we denote this operator by S V hereafter. In the current case study, a permeability map can be represented by a 45 × 45 matrix. Let P stand for such a matrix, then one can compute the first-order variations of P along both the horizontal and vertical directions, denoted by P h and P v , respectively. Specifically, suppose that P consists of a set of column vectors c k , for k = 1, 2, · · · , 45, and a set of row vectors For illustration, Fig. 3a shows the permeability map of sample realization 1 from the initial ensemble, and Fig. 3b and c indicate the first-order spatial variations of the permeability map, along horizontal and vertical directions, respectively. Clearly, in this case, non-zero variation values are able to capture the positions of boundaries between the channel and the background regions.
In addition, the transform operator S V further augments the column vectors of both P h and P v into a super column vector. As such, given two reservoir models, say m 1 and m 2 , the difference S V (m 1 ) − S V (m 2 ) equals the difference between the augmented variation vectors of m 1 and m 2 . With the operator S V , we let the matrix B k in both (30) and (31) be the identity matrix, such that the corresponding Hessian matrices become According to Eq. 26, the corresponding PLHs with respect to the reservoir model m i j are given by respectively, where the square-root matrix S i S V is computed according to Eq. 8, with the operator O therein replaced by the operator S V , and "V" in the subscript of the PLHs stands for variation. Note that in Eq. 36, the component is a rank-1 matrix. As a result, we expect the corresponding PLH M i m i j in Eq. 37, whose rank would be N e − 1 given independent ensemble members m i j . In S-III, the transform operator first calculates the histogram of a permeability map. For distinction, we denote For each bin, say, the kth one, we assign to it a value E k ≡ −M p p k ln(p k ), where p k = M b k /M p is the empirical probability (or relative frequency) that a permeability value falls into the kth bin. We note that E k can be considered as the component-wise Shannon's information entropy [30], scaled by the factor M p . By definition, we also have E k = M b k × ln M p − ln M b k , which is the formula used in our computation. For numerical evaluations, we adopt the custom that 0 × ln(0) = 0.
Furthermore, the transform operator S H augments the entropies of all histogram bins into a column vector. For two reservoir models, say m 1 and m 2 , the difference S H (m 1 ) − S H (m 2 ) is then equal to the difference between the augmented information-entropy vectors of m 1 and m 2 . We note that the Hessian and the PLH in S-III bear similarities to those in S-II. We let the matrix B k in both Eqs. 30 and 31 be the identity matrix again, so that the corresponding Hessian matrices in S-III become and the corresponding PLHs can be expressed as S i SH , (case 2 1 ) and respectively, where the square-root matrix S i S H is again computed according to Eq. 8, and "H" in the subscript of the PLHs stands for histogram. Similar to the situation in S-II, the PLH M i For ease of reference later, we adopt a 5-bit binary code to represent the q p -GIES algorithms. In each bit, "1" represents the presence of the corresponding PLH, while "0" means the absence of such a PLH. For instance, the code "10000" stands for the original IES with M i R m i j = w 1 I N e , and "10100" for the As a result, following Eq. 42, we end up with 2 5 − 1 = 31 q p -GIES algorithms in total, after excluding the case with the binary code "00000" (i.e., no regularization). Table 2 lists these considered q p -GIES algorithms, where the information of  The algorithms are ranked in terms of mean data mismatch values during the forecast period (see Table 3). Definitions of the matrix C i m and the transform operators S V and S H are provided in the text, whereas the weights associated with the individual regularization terms are specified in Table 3 presence or absence of individual regularization terms in the cost function is provided. The PLHs in S-II and S-III are rank-deficient in general, and this deficiency would be particularly serious when using the 2  is present, otherwise it will be rank-deficient as well.
As such, to compute the inverse of M i R m i j in Eq. 24, we first conduct an eigenvalue decomposition on M i R m i j , whether it is rank-deficient or not. We then keep the leading eigenvalues (and the corresponding eigenvectors) which add up to 99% or more of the sum of all eigenvalues. The inverse of M i R m i j is a sudo-inverse, computed based on the truncated eigenvalue decomposition, with the corresponding eigenvalues being the inverse of the kept eigenvalues of There could be different ways in specifying the weight coefficients w k , k = 1, 2, · · · , 5, in Eq. 42. In the current work, we follow the strategy in [13], and set w k as follows: where α k (k = 1, 2, · · · , 5) are some non-negative coefficients satisfying the constraints α k ∈ [0, 1]; 5 k=1 α k = 1. The rationale behind Eq. 43 is to balance the relative weights of individual PLHs, such that the trace of each nonvanishing PLH term is roughly comparable to the trace of I N e (equal to N e ). Typically, this strategy can help to avoid the situation that an ill-conditioned PLH term has a dominating weight over other existing terms, and thus improve the numerical stability in algorithm implementation. After M i R m i j is determined, the value of γ i in Eq. 24 is chosen following the same rule as in [13].
To mitigate the adverse effects of sampling errors due to the relatively small ensemble size in use, we equip all the q p -GIES algorithms with an automatic and adaptive localization scheme, labeled as RndShfl-GC in [31]. The stopping criteria of the q p -GIES include: (1) the iteration process reaches a maximum of 50 iteration steps; (2) the change of data mismatch values during two consecutive iteration steps is less than 0.01%. An q p -GIES algorithm will stop if either of these two criteria is satisfied.

Experiment results
This sub-section compares the performance of the chosen q p -GIES algorithms, in terms of data mismatch values during the history matching and forecast periods. Given an of N e reservoir models at the ith iteration step, and a collection of observations d o (with the associated observation error covariance matrix C d ) within a given period, we compute an ensemble M i ) of data mismatch values (normalized by the number d of production data) as follows: with d = 240 in this case study.
In the experiment, we update the reservoir models by assimilating the observations within the history matching period, and cross-validate the qualities of the estimated reservoir models by comparing the fitness of the simulated data of the estimated reservoir models to those of the reference model within the forecast period. As a result, in the sequel, we use the (mean) forecast data mismatch as a measure of the quality of an estimated reservoir model. Table 3 reports the performance of all 31 q p -GIES algorithms, in terms of data mismatch values (in the form of mean ± STD) during the history matching and forecast periods, and the associated weight coefficients α k . For the purpose of demonstration, it would be sufficient to select some set of the α k values without fine-tuning to optimize the performance of each individual q p -GIES algorithm. For ease of comparison, the q p -GIES algorithms are listed in an ascending order of their mean data mismatch values within the forecast period, such that a lower rank implies better performance. Before history matching is conducted, the data mismatch values for the initial ensemble are 24.1399 ± 9.3804 and 31.6418 ± 15.7455 within the history matching and forecast periods, respectively. Some observations can be made based on the results in Table 3. First of all, in this case study, using only one of the five q p metrics, namely, should be absorbed into the regularization term. Otherwise, The q p -GIES algorithms are listed in an ascending order of mean values of forecast data mismatch. In particular, performance of the q p -GIES algorithm corresponding to the original IES is highlighted (in red) the performance of the q p -GIES algorithms would be degraded, see, for example, the results for those ranked from the 25th to the 31th in the table. As a conjecture, the reason for the under-performance of these m i j may also have relatively low ranks due to the possible correlations among the histograms of the initial ensemble of reservoir models, which was generated using the SNESIM with a common training image [28]. With the low-rank PLHs, the corresponding q p -GIES algorithms are not able to find new reservoir models that significantly reduce the data mismatch values. As a result, one can see that the data mismatch values for reservoir models obtained by these q p -GIES algorithms remain very close to those of the initial ensemble.
For conciseness, in the sequel we present some additional experiment results with respect to some of the selected q p -GIES algorithms, which correspond to binary codes 10000 (original IES, ranked the 15th), 11000 (ranked the first), 10011 (ranked the second) and 01001 (ranked the third), respectively, in Table 3. Figure 4 reports the box plots of data mismatch within the history matching period at different iteration steps. As indicated in Table 3, in terms of mean data mismatch values, the q p -GIES algorithm with the code 01001 results in the lowest mismatch values, and is followed by the q p -GIES algorithms with the codes 10011 and 11000, whereas the original IES (code 10000) leads to the highest data mismatch. Meanwhile, it should be noted that, lower data mismatch within the history matching period does not necessarily imply lower data mismatch within the forecast period. In fact, it is the q p -GIES algorithm with the code 11000 that achieves the lowest mean data mismatch within the forecast period. A possible explanation for this performance-inconsistency is that some of the q p -GIES algorithms may have overfitted the noisy observations within the history matching period, in comparison to others.
For illustration, Fig. 5 shows the profiles of simulated well oil production rates (WOPR) at P 5 within both the history matching and forecast periods, with respect to the initial ensemble and the final ensembles obtained through the selected q p -GIES algorithms, respectively. As can be seen from Fig. 5a, all the simulated WOPR of the initial ensemble (blue curves) under-forecasts the WOPR of the reference model. This is possibly due to the fact that, in the reference model, the rectangular area with the wells I5, I6, P5 and P6 being the vertices (upper right region of Figs. 1 or 7(a)) contains a relatively large portion of highpermeability values 2 . In contrast, in the initial ensemble, the same rectangular area is filled with more low-permeability values instead (cf Fig. 2a or 7b). After applying the q p -GIES algorithms to update the reservoir models, the simulated data of the final ensembles tend to match the WOPR of the reference model better, although we still see the tendency of under-forecasting the WOPR for most of the updated reservoir models. On the other hand, better data match can be found in other wells, as illustrated in Fig. 6. Figure 7 presents the mean permeability maps of the final ensembles obtained by the q p -GIES algorithms. For ease of comparison, we also re-plot the reference model and the mean map of the initial ensemble here. Comparing the mean of the initial ensemble to the reference model, one can see that the initial mean model captures some of the geological structures in the reference model, e.g., in terms Fig. 5 Forecasts of WOPR at well P5 within both the history matching and forecast periods. In each sub-figure, the orange curve represents the true WOPR generated by the reference model, without any observation noise; the red dots denote the noisy observations within the history matching period; the blue curves stand for the forecasts from either the initial or the final ensembles of reservoir models (in case of final ensembles, we mark the adopted history matching algorithms by their corresponding binary codes in the captions of sub-figures); and finally, the vertical dashed green lines separate the history matching and forecast periods. of the lower channel closer to the bottom. On the other hand, there are also some noticeable differences, e.g., in the areas surrounded by the wells I1, I2, P1 and P2 (upper left parts of the maps), or the areas surrounded by the wells I5, I6, P5 and P6 (upper right parts of the maps). After history matching, the final mean maps are able to capture the geological structures better in the areas surrounded by the wells I1, I2, P1 and P2, and by the wells I5, I6, P5 and P6, respectively. The connectivities of the channels in the final ensembles of updated reservoir models appear to be a problem. We expect that this issue may be mitigated by adopting a more sophisticated reparameterization strategy [25][26][27] in the course of history matching, which is, however, beyond the scope of the current study. For more information, in Figs. 8 and 9 we also report the STD maps with respect to the initial ensemble and final ones obtained by the q p -GIES algorithms, and sample reservoir models before and after history matching, respectively.
Finally, Fig. 10 plots the histograms with respect to the reference model, and the mean models of the initial and final ensembles. While the histogram of the reference model takes two separate points as its support, those of

Discussion and conclusion
This work investigates the feasibility to derive novel ensemble data assimilation algorithms in a principled and systematic way, and applies some newly derived algorithms to two history matching problems for performance comparison.
We start from establishing an ensemble-based model update formula, which can be considered as an approximate solution to a generalized minimum-average-cost (GMAC) problem. In the GMAC problem, one aims to find an  ensemble of inputs (e.g., model parameters) to a given forward simulator, in such a way that the corresponding outputs of the simulator can match the observed data to a good extent. As such, solving the GMAC problem corresponds to finding an ensemble of solutions to the underlying inverse problem. Similar to the transition from variational data assimilation to ensemble-based data assimilation, the solution to the GMAC problem can be deemed as an "ensemblized" version of a certain deterministic approach to solving the inverse problem. However, in the literature, various deterministic inversion algorithms have been developed by minimizing different types of (regularized) cost functions. In contrast, for ensemble-based data assimilation algorithms in practical use, such a variety (in terms of the underlying cost functions) does not seem to exist, to the best of our knowledge. The current work narrows this noticed gap between the varieties of the deterministic and ensemblebased inversion algorithms, by showing that one can derive novel ensemble data assimilation algorithms from a class of generalized cost functions, which in general go beyond the form of nonlinear-least-squares in the conventional ensemble-based data assimilation algorithms. For demonstration, we investigate a sub-class of newly derived ensemble data assimilation algorithms, called the q p -GIES algorithms. The underlying cost functions of these algorithms share the same data mismatch term, but differ in their regularization terms, which consist of  p -GIES algorithms to two history matching problems, we show that many tested q p -GIES algorithms outperform the original IES in these case studies. In a bigger picture, given the richness of the ensemble-based data assimilation algorithms that can be derived from the general model update formula, it appears sensible to believe that one may find certain algorithms that perform better than the original IES in various inverse problems. On the other hand, though, how to find the optimal designs of such ensemble-based algorithms would remain to be an open research question.

Appendix A: Iterative ensemble smoothers derived from a class of generalized cost functions
In the sequel, we proceed to develop an approximate solution to the generalized MAC problem in Eqs. 9 and 10, in a way similar to that in [13]. To this end, we need to assume certain regularity conditions, namely, the operators D, R, T , S and g are (locally) differentiable up to relevant orders in our derivations below.
We start from considering the first order Taylor approximation where f is a vector-valued function, x 0 is an input vector associated with a (relatively small) perturbation vector δx, and ∇ f (x 0 ), as defined in Eq. 46, represents the gradient with respect to f evaluated at the point x 0 . Note that throughout this work, we adopt the following convention: given the m x -dimensional vector x 0 and the function f : The following formulas of matrix calculus [32] ∂f will get involved in our derivation later. In Eq. 47, f (u (x)) represents the composition of two vector functions f and u, with the dummy input vector x. Equation 48 may be considered as a special case of Eq. 47, where f (u (x)) = Mu (x), with M being a constant matrix, such that ∂Mu/∂u = M T . In addition, the follow matrix identity and the corresponding derivative is Then using Eq. 49 and with some algebra, one has the following update formula: Before proceeding further, we have a few remarks. First, the assimilation algorithm presented in Eqs. 62-66 provides an approximation solution to the GMAC problem, but without using ensemble approximations to the firstand second-order gradients therein yet. If one already has access to all required gradients, then they can be used in the assimilation algorithm, and this may lead to improved performance of history matching. On the other hand, if some of the gradients are not available or impractical to compute and/or store, then one may employ ensemble approximations to obtain certain (sometimes partially) derivative-free algorithms, as will be discussed later.
Second, in the original IES algorithm (3)- (7), the Kalman-gain-like matrix, K i in Eq. 4, is common to all ensemble members. In contrast, in the case with a generalized cost function, e.g., that in Eq. 10, the corresponding Kalman-gain-like matrix, K i j in Eq. (66), depends on both the individual ensemble member m i j and even the (potentially) perturbed data d j in general, as one can see from Eqs. 62-63. While the diversity of K i j may be of theoretical interest, in practice one of the implications is the increased memory consumption for storing these Kalman-gain-like matrices. This problem can be avoided if one lets both the distance metric D and the regularization operator R be some quadratic functions, as in the problem formulation that leads to the original IES, cf. Equations 2 or Eqs. 11 and 12; or partially mitigated if one adopts ensemble approximations to compute K i j , as will be shown below. To obtain ensemble-based approximation, we start from computing the product C i m,j H i T involved in the computation of K i j . Let Again, if ∇ 2 D cannot be evaluated exactly, then an ensemble-based approximation can be adopted, which will be discussed later. Assembling the results from Eqs. 67-73 into Eq. 65, we derive the following ensemble-based model update formula: In Eq. 74, if the analytic form of the gradient ∇ D is not available, then an ensemble-based approximation strategy can be adopted for its computation, as will be discussed later. Below we show that the original IES algorithm 3 is a special case of the more general formula in Eq. 74. To this end, let the distance metric D and the regularization operator R be some quadratic mappings as defined in Eqs. 11 and 12. In addition, let both the transform operators T and S be identity, such that where I N e is the identity matrix in the dimension of N e × N e . As a result, the update formula (74) is reduced to According to Eq. 49, one has Therefore, Eq. 77 is equivalent to which is exactly the same as the update formula induced by Eqs. 3-7.

Appendix B: Approximations to the products between gradient or Hessian and certain ensemble-induced square-root matrices
Here, we consider ensemble-based approximations to the The latter result implies that the Hessian of Bx 1 1 vanishes almost everywhere (except at Bx = 0).

Appendix D: Gradient and Hessian of a mixture of regularization operators
Here we aim to evaluate the gradient and Hessian of a mixture of K mix regularization operators in the transformed model space, in the form of where w k is a scalar, representing the weight associated with the kth regularization operator R k , which acts on the transformed model variables S k (x), obtained by applying the transform operator S k to the model variables x.
For ease of comprehension, in the sequel we adopt the method of induction. To start, consider the case with a single regularization operator R, such that the regularization term is simply in the form of R (S (x)). In this case, we have To get the Hessian, we have To derive the third line of Eq. 95, we ignore the Hessian of the transform operator S with respect to x, in light of the strategy of first-order Taylor approximation in Eq. 55. Now let us consider the case of a mixture of two regularization operators, in terms of w 1 R 1 (S 1 (x)) + w 2 R 2 (S 2 (x)). In this case, we define As a result, we obtain Therefore, we have the gradient of R with respect to S, in terms of In addition, we have Hence the Hessian of R with respect to S is given by To obtain (107), we have used the fact that Similarly, for a mixture of K mix regularization operators, i.e., one has its gradient and Hessian in the transformed model space given by S (x) ≡ (S1 (x)) T , (S2 (x)) T , · · · , SK mix (x) T T ; (108)

Appendix E: Additional results in a 5-spots example
The reservoir model Here we present some additional numerical results obtained by applying a subset of q p -GIES algorithms to a 2D 5spots example, wherein the numerical reservoir model is in the dimension of 50 × 50 (in the unit of gridblocks), with oil, water and gas phases. There are four producers (labeled as P1 -P4) on the corners of the model, and one injector (labeled as I1) in the center, as indicated in Fig. 11. The locations of these wells, in terms of Cartesian coordinates (x, y), are as follows: P1 at (2, 2); P2 at (49, 2); P3 at (2, 49); P4 at (49, 49) and I1 at (25,25).
The parameters to be estimated consist of x-dimensional permeability (PERMX) and porosity (PORO) on all reservoir gridblocks. Initial ensembles for PERMX and PORO (with 100 ensemble members) are generated through sequential Gaussian simulation. Figure 11 shows the PERMX and PORO maps in the reference model, which is used to generate production data every 30 days, for a total period of 1500 days. One of the producers, P4, is shut in during history matching, therefore the production data consist of well oil/water/gas production rates (WOPR/WWPR/WGPR) from the other 3 producers, and well bottom-hole pressures (WBHP) from wells P1 -P3 and I1 every 30 days. The number of production data used for history matching is 650 in total. The observed production data are contaminated by certain zero-mean Gaussian white noise. For WOPR/WWPR/WGPR data, the standard deviations (STDs) of observation noise are either 10% of their magnitudes, or set to 10 −6 if the observed production data (e.g., WWPR) are equal to zero; whereas for WBHP, their STDs are 1 bar. The q p -GIES algorithms adopted for this case study are constructed based on the following considerations. As Figure 11 indicates, there are also spatial patterns in the reservoir models that can be exploited by the algorithms, e.g., through the calculations of the first-order variations of PERMX and PORO maps, similar to the previous channelized reservoir characterization problem. On the other hand, histograms of PERMX and PORO in the current case study do not provide particularly useful information of spatial patterns. As such, we choose to construct the q p -GIES algorithms by solving the GMAC problem with the following cost function: where S V is a transform operator similar to that in the channelized reservoir characterization problem, but applied to both PERMX and PORO maps; and the weights w i (i = 1, 2, 3) are determined in a way similar to that in Eq. 43, which leads to the scalar coefficients α i (i = 1, 2, 3) in Table 4. Similar to the situation in the channelized reservoir characterization problem, we adopt 3-bit binary codes to refer to the q p -GIES algorithms derived from Eq. 111, which leads to 7 algorithms after excluding the one with the code 000 (i.e., no regularization). The characteristics of these 7 algorithms are the same as those summarized in Table 2 , are adopted. All these q p -GIES algorithms are equipped with an automatic and adaptive localization scheme, RndShfl-GC, of [31] during history matching, and are run with 10 iteration steps.
In the example here, we use root mean square error (RMSE) as a measure to cross-validate the history matching performance. Given an m-dimensional reference model m ref and an ensemble M i = {m i j } N e j =1 of reservoir models at the ith iteration step, we compute an ensemble Ω(M i ) of RMSE as follows:  and PORO (with equal weights). Following this setting, the original IES (code 100) is ranked as the 6th, meaning that there are a few other q p -GIES algorithms that outperform the original IES. As such, one can draw a conclusion similar to that in the previous channelized reservoir characterization problem, that is, it is possible to design through the GIES framework new ensemble history matching algorithms that may perform better than the original IES algorithm in certain circumstances.
In the current case study, except for the q p -GIES algorithm with the code 001 (for convenience, hereafter we simply call it algorithm 001, and the same custom will be applied to other algorithms), the other algorithms result in very similar RMSE values for the estimated PORO maps. On the other hand, the differences among the RMSE values of PERMX are more substantial. In particular, for algorithm 001, although its mean data mismatch value is higher, its RMSE values for both PORO and PERMX are the lowest among the tested algorithms.
To limit the length of the current work, we do not proceed to present the full numerical results. Instead, we focus on inspecting the impacts of a few q p -GIES algorithms on model updates. To this end, Fig. 12 reports the PERMX and PORO maps of one initial reservoir model, and the corresponding maps of the final models obtained by algorithms 100 (i.e., the original IES), 001 and 010, respectively. In line with the results in Table 4, the PORO maps obtained by algorithms 100 and 010, respectively, are very similar. In comparison to the initial PORO map, a noticeable difference can be observed in the area around the coordinate (45, 15). On the other hand, the final PORO map obtained by algorithm 001 bears more structural differences from those of the other two algorithms. Meanwhile, compared to the initial PORO map, the final PORO map of algorithm 001 also exhibits some clear differences in the area around the coordinate (12,7). A similar conclusion can be drawn for PERMX maps, especially if one compares the PERMX maps in the area around the coordinate (18,12). As such, in this particular case study, it appears that, due to the use of the regularization term S V m i+1 j − S V m i j 2 1 with the 1 norm, algorithm 001 is able to produce flatter  regions, in which the estimated parameters (PERMX and PORO) exhibit less spatial variations. This property is less noticeable in algorithms 100 and 010, in which the distance is measured by the 2 norm instead.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons. org/licenses/by/4.0/.