p-Kernel Stein Variational Gradient Descent for Data Assimilation and History Matching

A Bayesian method of inference known as “Stein variational gradient descent” was recently implemented for data assimilation problems, under the heading of “mapping particle filter”. In this manuscript, the algorithm is applied to another type of geoscientific inversion problems, namely history matching of petroleum reservoirs. In order to combat the curse of dimensionality, the commonly used Gaussian kernel, which defines the solution space, is replaced by a p-kernel. In addition, the ensemble gradient approximation used in the mapping particle filter is rectified, and the data assimilation experiments are re-run with more relevant settings and comparisons. Our experimental results in data assimilation are rather disappointing. However, the results from the subsurface inverse problem show more promise, especially as regards the use of p-kernels.


Introduction
Bayesian inference in data assimilation (DA) has been researched for several decades and is frequently applied in the petroleum industry today. Due to the typically vast computational cost of the forward simulation problem, classic Bayesian Monte Carlo methods, such as Markov chain Monte Carlo (MCMC) and importance sampling, are not applicable. Therefore, approximate Bayesian methods, which require fewer computational resources, are applied in most real-world cases. It is convenient to separate these methods into derivative-based and derivative-free approaches. Derivative-based approximations include Randomized Maximum Likelihood (RML, Kitanidis 1995;Oliver et al. 1996), distributed Gauss-Newton solvers (Gao et al. 2017), and EDA-4DVar (Carrassi et al. 2018).
However, it is more common to use derivative-free methods due to the common lack of adjoint code in both commercial simulators and large scale models in general. Many of the modern derivative-free methods are based on the ensemble Kalman filter (EnKF, Evensen 2004). During the last decade the method of choice has become the iterative ensemble methods such as iterative EnKF/EnKS (Bocquet and Sakov 2014), the ensemble RML (Chen and Oliver 2013) and the ensemble smoother with multiple data assimilations (ESMDA, Emerick and Reynolds 2013). Unfortunately, in the general (non-linear) case, none of these ensemble methods converge to the true posterior distribution in the limit of an infinite sample size. It is possible to nest some of these methods in the importance sampling framework to achieve asymptotic optimality (Stordal and Elsheikh 2015;Stordal 2015;Stordal and Karlsen 2017), but at a considerable cost. Other approximate approaches to Bayesian sampling that go beyond the traditional approach of MCMC include sampling via optimal transport (El Moshely and Marzouk 2012;Reich 2013;Marzouk et al. 2017) and implicit sampling (Skauvold et al. 2019) In this work, the recently published Stein variational gradient descent method (SVGD, Liu and Wang 2016) is applied in history matching for the first time. The kernels introduced in the algorithm are more appropriate for high dimensional applications. In addition, an alternative derivative-free implementation is discussed. Previous applications of SVGD includes Bayesian logistic regression (Liu and Wang 2016), training of neural nets (Feng et al. 2017), sequential filtering of the Lorenz-63 and -96 models , inference on a simple linear and nonlinear partial differential equation (PDE) using a subspace Hessian projection (Chen et al. 2019) and a Gauss-Newton formulation (Detommaso et al. 2018).
The outline of this paper is as follows: Sect. 2 introduces the SVGD theory. In Sect. 3, the implementation is presented with and without an adjoint code, and the choice of kernels for high dimensional applications. Examples with a toy model, with the Lorenz systems, and with a reservoir model, are given in Sect. 4. A summary and discussion of future work concludes the paper in Sect. 5.

Stein Variational Gradient Descent
Let X ∈ X ⊆ R N x be the unknown vector of interest. In Bayesian statistics, it is assumed random. Denote its (prior) probability density function (PDF), p(x). It is observed through the noise-corrupted measurements where ∈ R N y is a random noise vector, and H is a (possibly nonlinear) mapping H : X → Y ⊆ R N y . The likelihood, p(y|x), follows from the distribution of , whose framing as an additive error is mere convention. The posterior is given by Bayes' rule: p(x|y) ∝ p(y|x) p(x). For complex problems, such as large scale inversion and DA, Bayesian inference often takes the form of (approximate) sampling from the posterior. Given two densities p and q, the Stein discrepancy can be used to quantify their difference. It is defined as where F is a set of test functions, f : X → R, that are sufficiently smooth and satisfy That is, f is in the Stein class of p . The variational problem Eq. 2 is computationally intractable in most cases, and therefore Chwialkowski et al. (2016) and  introduced the kernelized Stein discrepancy, where the function space F is set to the unit ball within a reproducible kernel Hilbert space (RKHS), H , for which analytical solutions may be obtained. The Kullback-Leibler (KL) divergence from p to q is given by It was related to kernelized Stein discrepancy by Liu and Wang (2016), who showed that if X has density q, then the KL divergence to p from the PDF q T of the transformation has a derivative, d d D KL (q T p), that is maximized at = 0 for where K (x, x ) is the unique kernel defining H .
In the context of Bayesian inference, Bayes' rule may be computed by a KL minimizing algorithm (Liu and Wang 2016): the SVGD starts with a sample {x i 0 } N i=1 from the prior density p(x) and iteratively updates x i k using the empirical (Monte Carlo) version of Eq. 6 in Eq. 5 for all particles, or ensemble members, i = 1, . . . , N . If the kernels are chosen to be 2 ), then Eq. 7 reduces to The last term may be seen as a weighted average of the gradient of the log posterior and of its similarity to the other members. The first term guides the sample points towards the maximum of the log posterior, while the second term repulses sample points that are too close. Equation 8 presupposes the use of simple gradient descent optimization In practice, Liu and Wang (2016); Pulido and van Leeuwen (2019) both used an adaptive subgradient variation of this (ADAGRAD, Duchi et al. 2011), which is also the choice made here. It is interesting to note that if the kernel is degenerate, then the SVGD algorithm produces N copies of the MAP estimate (or N local optima in the general non-convex case). It was shown in Liu (2017) that the continuous-time infinite-sample limit of the density induced by SVGD satisfies the Vlasov equation (Vlasov 1961) Using Stein's lemma (Stein 1972) it may be shown that ∂μ t = 0 for μ t = p(x|y), meaning that the SVGD algorithm can be viewed as particle flow. Furthermore, Zhang et al. (2018) showed that SVGD can be combined with Langevin dynamics to obtain a stochastic particle flow version of SVGD that may avoid the issue of particles collapsing to a single point or getting stuck in a local mode. This will not be discussed further here as it did not have any impact on the problems presented here. For more details the reader is referred to Zhang et al. (2018) and references therein.

Implementation
In the following it is assumed that the prior and the measurement error are both Gaussian. Hence where H is the sensitivity matrix (the derivative of H with respect to x) and C x and C are the covariance matrices of X and , respectively. Computing H is challenging since it requires an adjoint code. It is therefore of great interest to study alternative implementations of SVGD using approximate gradients. It should also be mentioned that Han and Liu (2018) presented a weighted SVGD wherein the gradient is computed using a surrogate density instead of the target density. In the following description, however, the focus is to use either an adjoint code, or an ensemble approximation of H.

Adjoint Implementation
In large scale PDE-constrained optimization for the solution of inverse problems (e.g. in reservoir history matching) it is convenient to formulate H as depending on two separate variables, H = H(ξ, x), where the dynamic ones, ξ , depend deterministically on the fixed (but unknown) parameter fields x through g(ξ, x) = 0, where g is the set of (discretized) forward model equations. Implicit differentiation yields ∂ x ξ = −(∂ ξ g) −1 ∂ x g, so that the (total) derivative of H with respect to x, required in Eq. 10, is given by Using the shorthand the last term in Eq. 10 becomes The use of brackets in Eq. 15 seems heavy-handed. However, in contrast to the explicit computation of H of Eq. 12 (known in the literature as the forward, or direct, method (Rodrigues 2006;Anterion et al. 1989), or gradient simulator (Oliver et al. 2008)), the ordering of the brackets in Eq. 15 yields a sequence of computations, now involving the transposed system, with a cost proportional to a single simulation. This sequence of operations is known in the literature as adjoint, or backward, method (Rodrigues 2006;Chavent et al. 1975).
A major hurdle in applying the adjoint method (as well as in the Forward/Direct method) is writing the code for the computation of the partial derivatives of g and H with regard to both ξ and (mainly) x. In this study, automatic differentiation (Bendtsen and Stauning 1996) is applied. Zhou (2008) showed that, by the reproducing property of f ∈ H ,

Adjoint-Free Implementation
However, the inner product on H is given by where φ and λ are the eigenfunctions and eigenvalues of K . This is not trivial to compute due to the high-dimensional integrals and the computation of eigenfunctions. An alternative formulation was used by  where the gradient of H was approximated by Monte Carlo integratioñ using the sample available from the SVGD algorithm at each iteration. A further normalization was later introduced to improve the approximation by dividing each value ofH by the sum of the kernels at that point. The approximation was based on the claim that Eq. 18 would converge to for H ∈ H as N goes to infinity. However, this uses the plain L 2 inner product, which is not a RKHS, so Eq. 19 is incorrect. However, if it assumed that H(x)K (x, x ) = 0 in the limits or at the boundary (in case of finite support), then integration by parts yields Hence, Eq. 19 can be seen as a kernel-smoothed derivative of H around the point x, provided the kernel is normalized (integrates to one). Furthermore, their unsatisfactory results are a consequence, not only of the approximation in Eq. 19, but also of the fact that the sample is drawn from a particular distribution and not uniformly in space. Hence the notion that Eq. 18 will converge to HK is also incorrect. This can be corrected by noting that the samples are drawn from a distribution q, hence, Eq. 18 converges to E q [HK ] unless the integrand is normalized by q. An alternative approach is therefore to draw the sample from the kernel density, q(x) = N −1 i K (x, x i ) at each iteration, yielding the Monte Carlo estimate of Eq. 19 given by Alternatively,  suggest using an average ensemble approximation for the derivative, as in the EnKF. This does not allow for local information, and the algorithm falls into the ensemble-based category where quasi-linearity of the model is assumed.
Finally, sinceH is estimated using a sample, it will suffer from Monte Carlo errors. This can, just like in the EnKF and its variants, be reduced by introducing localization. A user defined correlation matrix ρ can be used to taper H(x) in Eq. 22 via the Schur product ρ • H(x).

Extension to p-Kernels
In the literature discussed above, almost all applications selected K (x, x ) to be a Gaussian kernel. The choice of the kernel may not be critical in low dimensional problems. For high dimensional problems, however, the Gaussian kernel is not able to separate points locally, due to the curse of dimensionality. It will produce degeneracy, as illustrated in importance sampling. To overcome this issue, scaled kernels were used both in Liu and Wang (2016) and Detommaso et al. (2018). In many cases, a simple scaling of the kernel bandwidth is not sufficient to separate points in high dimensions (increasing the bandwidth in high dimensions results in almost uniform weights) and an alternative approach of using p-kernels (Francois et al. 2005) is therefore chosen here. The kernel is specified by The special case p = 2 and d(x, x ) = x − x 2 reduces the p-kernel to a Gaussian kernel. As the dimension increases, the distances (i = j), which are χ 2 if X is Gaussian, tend to cluster away from zero, hence the term with i = j in Eq. 8 increasingly dominates unless the bandwidth of the kernel is chosen to be very large. In this case the kernel will not separate points, and the local property of the kernel is lost.
To overcome this, p-kernels force the kernel value to stay close to 1 until it reaches the lower tail of the distance distribution of the sample, and then decay appropriately. Specifically, the parameters p and σ are here set to match the α-percentile, d α , and the (1−α)-percentile, d 1−α , of the empirical distribution of distances, meaning In Fig. 1, the distribution of Euclidean distances for a standard Gaussian random vector of dimension 1000 with α = 0.05 is plotted. Also included is a set of Gaussian kernels with different bandwidths and the p-kernel. The problem of using Gaussian kernels is evident: either the kernel function dies out before it reaches the sampled distances, or the bandwidth is so large that it does not separate the points. Hence uniform weights could in practice replace the kernel.

Univariate Example
The first test is to reproduce the toy example in  in order to validate our derivative approximation (Eq. 22) and demonstrate the sampling properties of SVGD. In the simple toy problem of , the prior is given by a univariate Gaussian density with mean 0.5 and variance of 1. The measurement model is y = x 2 + , where is a zero-mean Gaussian random variable with a variance of 0.5. The actual measurement is set to y = 3. Although x 2 is not in the Gaussian RKHS (Minh 2010), the reproducing property is still valid for the derivative in this particular case. That is, u 2 ∂ x K (x, u) du = 2x.  Figure 2 shows the derivative approximation of Eq. 18 and its normalized version (blue and red), reproduced from , together with the true derivative (black) and our proposed method (cyan) in Eq. 22 of the prior sample. The results clearly show the inadequacy of the approximation (blue line) in Eq. 18, which in this case is not even monotonic. The normalized version (red line) is just a damped version and exhibits the same non-monotonic behavior. However, it can also be seen that the corrected Monte Carlo estimate is slightly biased due to deficiency the of kernel density estimation (Silverman 1986) for the tails of the prior density.
In Fig. 3 we plot density estimates, including SVGD using the exact (magenta) and the new approximate (cyan) derivative of Eq. 22. The standard EnKF solution (green) and the prior (red) are also included. The sample size is 100 and 1000. The error in SVGD is due to the sampling error and bias in the kernel density estimation. The SVGD with the (new) approximate derivative has an additional bias due to the error in the derivative estimate, as seen by the left and right skewness in Fig. 3. Note that this error is reduced with increased sample size.

Data Assimilation Tests with the Lorenz Systems
Pulido and van Leeuwen (2019) tested the SVGD in the sequential filtering setting (described below) of data assimilation (DA, Wikle and Berliner 2007), applying it in the analysis step by constituting the prior from the ensemble using Gaussian mixtures. They called this the mapping particle filter (MPF), however, the SVGD label is maintained here. Their results suggest that the SVGD filter achieves the accuracy of the (bootstrap) particle filter on the Lorenz-63 system, and the EnKF on the Lorenz-96 system.
Unfortunately, their result lacks relevance because the Lorenz systems, used as coarse prototypes of atmospheric dynamics, are inundated with noise (see appendix A), so that the extended Kalman filter (EKF, Jazwinski 1970), or even 3D-Var, can achieve optimal accuracy, at much lower costs. This issue is rectified here by benchmarking the SVGD filter (i.e. MPF) with more standard settings.
The Lorenz-63 system (Lorenz 1963;Sakov et al. 2012) is given by the N x = 3 coupled ordinary differential equationṡ with parameter values r = 28, σ = 10, and b = 8/3. The true trajectory, x(t), is computed using the fourth-order Runge-Kutta scheme, with time steps of 0.01 time units, and no model noise. Observations of the entire state vector (i.e. H = I N x ) are taken Δt Obs = 0.25 time units apart with a noise variance of 2 (i.e. C = 2I N x ). The Lorenz-96 system (Lorenz 1996;Ott et al. 2004) is given by the coupled ordinary differential equationṡ for m = 1, . . . , N x , with periodic boundary conditions, and a constant "force" of F = 8. These are integrated using the fourth-order Runge-Kutta scheme, with time steps of 0.05 time units, and no model noise. The state dimension is set to N x = 10 (rather than the typical value of 40) so that the particle filter is practical. Observations of the entire state vector (i.e. H = I N x ) are taken Δt Obs = 0.1 time units apart with unit noise variance (i.e. C = I N x ). The methods are assessed by their accuracy, as measured by root-mean squared error (RMSE) which is recorded following each analysis of the latest observation y(t). After the experiment, the instantaneous RMSE(t) are averaged for all t > 20. Comparison of the benchmark performance of SVGD is made to that of the EnKF ) and the bootstrap particle filter (with universal resampling, triggered if w −2 < N /2, where w is the vector of weights). In addition, baseline methods are included for context. Their analysis estimates, x a , are computed as follows:x for Climatology,x + K(C) [y −x] for Optimal Interpolation, and x f + K(cI) [y − x f ] for 3D-Var. Here,x and C are the mean and covariance of the (invariant measure of the) system dynamics, K(C) = CH (HCH + R) −1 is a gain matrix, x f is the model forecast of the previous x a , and c is a scaling factor subject to tuning.
The RMSE averages of each method are tabulated for a range of ensemble sizes, N , and plotted as curves in Figs. 4 and 5. The plotted scores represent the lowest obtained among a large number of tuning settings, selected for optimality at each point. For the PF the tuning parameter is the bandwidth (scaling) of the regularizing post-resample jitter, whose covariance is computed from the weighted ensemble. For the EnKF ) the tuning parameters are (i) the post-analysis inflation factor and (ii) whether or not to apply random, covariance-preserving, post-analysis rotations (Sakov and Oke 2008). For the SVGD the tuned parameters are: (i) the number of iterations (maximum: 100) and (ii) the step size for ADAGRAD, (iii) the bandwidth of the isotropic Gaussian kernels of the priors' mixture, and (iv) the bandwidth of the isotropic Gaussian kernels defining the RKHS. In the case of p-kernels, the latter bandwidth is determined automatically by Eq. 25, along with the p parameter.
As can be seen in both Figs. 4 and 5, the abscissa (N -axis) can be roughly partitioned into three segments, with regard to the ordering of the method scores: For intermediate ensemble sizes, N , the EnKF obtains the lowest RMSE score among all of the methods. For the largest ensemble size, N = 400, the particle filter obtains the best score, while in the case of Lorenz-63, the SVGD filter achieves the same score as the EnKF (which it already obtained with N = 8). Tests with N > 400 were not affordable due to the cost of SVGD. For small N , the SVGD filter obtains lower RMSE than both the particle filter and the EnKF. However, as this score is not better than 3D-Var using a background matrix proportional to identity, it does not hold much relevance.
Furthermore, the p-kernel modification generally scores worse than SVGD with Gaussian kernels and tuned bandwidths. This is not very surprising, because the dimensionality is low, preempting their rationale. Moreover, an investigation reveals that, for small ensemble sizes, the p-kernel approach has a tendency to use very large values of p. For example, the corresponding curve in Fig. 4 spikes at N = 3. The reason is that, sometimes, the three distances between the three ensemble members are almost equal, but far from zero, requiring, on average, p = 22. By contrast, this does not occur at N = 2, because then there is only one distance, nor does it occur when N is very large and the distances rarely coincide.
In summary, the SVGD filter of Pulido and van Leeuwen (2019), with or without our p-kernel modification, while functional, does not achieve the same accuracy as standard DA methods. It is important to remember that the SVGD filter involves many iterations, quite a few tuning parameters, and linear algebra with N x × N x matrices, all of which add to its cost. It seems likely that the disappointing performance of SVGD stems from several issues: Firstly, it is limited by the precision of the prior approximations, which come from Gaussian mixtures, and may suffer from the curse of dimensionality, or from using simple, identity covariances. Secondly, the optimization process may not be very successful. In cursory experiments, other optimization routines were tested, and testw were condunctued using more iterations and the use of pre-conditioning, all of which sometimes could yield improved performance. A thorough examination is left for future work.

Reservoir History Matching Example
This section presents the results of SVGD with Gaussian kernels and p-kernels on a synthetic reservoir case. The reservoir is a two-dimensional model with 21 by 21 grid cells consisting of incompressible two-phase flow (oil-water). In each corner cell there is a production well and in the center an injection well (see Fig. 6). The porosity is assumed to be known in each grid cell with a value of 0.3. The forward model equations, as well as the adjoint model for this example, are fully depicted in de Moraes et al. (2018).
The permeability field is considered unknown with a prior distribution that is Gaussian. There are 1000 realizations available for this model (Jansen 2011), which specify the prior mean and covariance. Figure 7 shows four different permeability realizations from the ensemble. This is the same set up as in Stordal and Elsheikh (2015).
The observed data are generated from a synthetic truth (of the permeability field), chosen at random as one realization from the ensemble. Specifically, the observations are the water rates resulting from the simulation of 10 years of the truth, observed every six months, with a 5% white noise level. For more details on the reservoir settings, the reader is referred to Jansen (2011) andde Moraes et al. (2018).
The ADAGRAD method for optimization was implemented with a step size of 0.1 and an autocorrelation of 0.9, which are the standard settings for ADAGRAD. The maximum number of iterations was set to 50. The experiments were conducted with both Gaussian and p-kernels, and with ensemble sizes N = 100 and N = 1000.
The results for the data match with 100 members are shown in Fig. 8, as well as the data match with 1000 members in Fig. 9.
As can be observed from the results, both history matching using 100 members (Fig. 8) and 1000 members (Fig. 9) provide a reasonably good match for each individual well. Even though the prior ensemble either mostly underestimates the water rate for PROD1 and PROD3 wells (top-left and bottom-right panels, respectively) or overestimates the PROD2 and PROD4 water rates (top-right and bottom-left panels, respectively), the posterior data predictions capture the observed data well. Additionally, better data predictions are obtained with 1000 members. Furthermore, underestimation of uncertainty is often a problem in history matching. It therefore is interesting to see that the posterior ensemble using the p-kernel contains a much larger uncertainty when compared to the posterior ensemble using the Gaussian kernel. This is even clearer if one observes the uncertainty around the water break-through of wells PROD2 and PROD4 (top-right and bottom-left panels on Fig. 9b). Next, the history matching results are further investigated by comparing the permeability marginal PDFs conditioned to production data, shown in Fig. 10. In general, the conditioned marginal PDFs for both Gaussian kernel and p-kernel provide a reasonably good representation of uncertainty. While subtle, one may note that the conditioned The blue curves represent the true (reference) marignal pdf PDFs using p-kernel (right-side panels of Fig. 10) better represent the uncertainty when compared to the conditioned PDFs using Gaussian kernel, mainly when looking at the permeability range (horizontal axis) and the spread of the PDFs at the bottom, that use a 1000 member ensemble.
Comparisons with RML (Oliver et al. 1996) in order to further investigate the usage of p-kernels to better represent different high-dimensional geological settings will be investigated in the future.
Even though the results presented here indicate that the SVGD is a promising alternative for reservoir uncertainty quantification, it is only feasible, even considering this relatively small example, with an efficient gradient computation strategy. However, other data assimilation/uncertainty quantification strategies, such as RML (Chen and Oliver 2012), are also only feasible combined with efficient gradient computation strategies. Even though derivative-free formulations have been devised to overcome the adjoint implementation challenges (e.g. EnRML), the performance of both formulations (derivative and derivative-free) in an SVGD setting should be investigated in reservoir history matching problems.

Conclusions
The Stein variational gradient descent (SVGD) algorithm was extended to p-kernels, and we discussed derivative and derivative-free implementations. With an adjoint code, the algorithm was applied to subsurface petroleum history matching for the first time. The results showed that it can obtain reasonable data match with both Gaussian kernels and p-kernels, but that the posterior uncertainty was larger using the p-kernels, hence demonstrating the potential usefulness of p-kernels in higher dimensions.
The SVGD was also tested on two small chaotic dynamical systems. For these models the measurement operator is linear, thus the sensitivity matrix is trivial. However, as the prior distribution is unknown, it has to be approximated using kernel density estimation. The results showed that the SVGD is outperformed by the EnKF for intermediate ensemble sizes and the particle filter for large ensemble sizes, tempering the impression given by . This contradiction is most likely a result of using a less noisy system (the dynamics in  were entirely driven by noise) and the fact that the posterior sample from SVGD cannot overcome the deficiencies of the density estimate of the prior. Further analysis and improvements are left for future work.
The reservoir example indicated more potential for the SVGD algorithm. In particular, the uncertainty quantification of the posterior seemed to improve when using the p-kernel instead of the Gaussian kernel. A more detailed investigation of the uncertainty quantification properties of SVGD, in particular when compared to RML or ensemble smoothers, is left for future work. In addition, the ensemble approximation of the derivative for reservoir models will also be investigated in the future, in combination with localization. difference in RMSE between the noisy and noiseless cases indicates that the systems are dominated by noise.

A.2 Predicting the RMSE Without Experiments
It is no coincidence that the RMSE value of 0.5 occurs in the noisy case both (!) for . Consider the stationary Riccati recursion with constant system matrices (with conventional DA notation) which can be used to predict the squared error, P a ∞ , if the dynamics, M, are approximately constant. Pulido and van Leeuwen (2019) use H = I, and a diagonal model noise covariance, Q. Now, if the error growth is entirely dominated by the noise, then the impact of the dynamics is negligible, i.e. M ≈ I. Thus the system dimensions (for either  becomes decoupled and homogeneous, and Eq. 31 reduces to a set of identical scalar systems, each of which yields the quadratic equation Inserting the settings of the Pulido and van Leeuwen (2019), namely Q = 0.3 and R = 0.5, yields P a ∞ = 0.26, of which the square-root is 0.5. This indicates that, indeed, the model, M, has negligible bearing on the experiments.