Improved sampling strategies for ensemble-based optimization

We are concerned with the efficiency of stochastic gradient estimation methods for large-scale nonlinear optimization in the presence of uncertainty. These methods aim to estimate an approximate gradient from a limited number of random input vector samples and corresponding objective function values. Ensemble methods usually employ Gaussian sampling to generate the input samples. It is known from the optimal design theory that the quality of sample-based approximations is affected by the distribution of the samples. We therefore evaluate six different sampling strategies to optimization of a high-dimensional analytical benchmark optimization problem, and, in a second example, to optimization of oil reservoir management strategies with and without geological uncertainty. The effectiveness of the sampling strategies is analyzed based on the quality of the estimated gradient, the final objective function value, the rate of the convergence, and the robustness of the gradient estimate. Based on the results, an improved version of the stochastic simplex approximate gradient method is proposed based on UE(s2) sampling designs for supersaturated cases that outperforms all alternative approaches. We additionally introduce two new strategies that outperform the UE(s2) designs previously suggested in the literature.


Introduction
A continuous increase over recent decades in computing power, accompanied by improvements in numerical algorithms, has led to increasing use of simulation models to obtain optimal operating strategies for complex systems. Simulation of these models may be very computationally demanding and will therefore require highly efficient numerical optimization workflows. One domain in which computational demands are continuously challenging the efficiency of optimization workflows is the management of subsurface hydrocarbon reservoirs. This problem can be characterized by large-scale multiphase and compositional flow models; by high-dimensional control (input) spaces; O. Leeuwenburgh olwijn.leeuwenburgh@tno.nl 1 Eindhoven University of Technology, Eindhoven, Netherlands 2 TNO, Utrecht, Netherlands 3 Faculty of Geosciences and Engineering, Delft University of Technology, Delft, Netherlands and by large geological, economical, and operational uncertainty. The presence of significant uncertainty, even after years of data gathering, motivates the optimization of the expected value of the objective function, an approach that is sometimes referred to as robust optimization [35].
Controls may include the number of wells to be drilled (10-100s), their locations and trajectories, the drilling order, the well type (injector or producer), and operational controls such as well rates or pressures over a period of several years. The total number of variables to be optimized can easily be in the order of 1000s. Gradient-based methods have been shown to be the most efficient techniques to find optimal solutions for these complex problems [3,16,17,32]. In many practical cases of interest, the types of controls (e.g., integer or categorical) and lack of access to the numeric model code prevent use of efficient gradient estimation by means of the adjoint method.
In such scenarios, approximate gradient methods that require a limited number of test simulations with perturbed controls as input have been proven to be quite useful. An advantage of this approach is that it treats the model as a black box, and therefore offers great flexibility in terms of the type of controls that can be considered. The main the approximate gradients are accurate enough to enable sufficient increases in the objective function at reasonable computational cost. Since test simulations can be performed in parallel (assuming the availability of a parallel computing facility), this challenge translates into choosing the control perturbation set (ensemble) in such a way that the gradients can be estimated with minimal error.
Various methods exist for gradient approximation. Deterministic methods include finite differences and the simplex gradient [7], both of which are computationally unattractive for large numbers of controls since they require as many perturbation tests runs as there are controls. Stochastic approaches based on a limited number of random perturbations include simultaneous perturbation stochastic approximation (SPSA) [33] and stochastic noise reaction (SNR) [26], both of which are based on averaging, and ensemble optimization (EnOpt) [6] and a modified version coined stochastic simplex approximate gradient (StoSAG) [10,11,14], which are both based on least squares linear regression. Do and Reynolds [10] discussed the relationship between some of these methods in a deterministic context.
The sampling strategy (for example, the distribution) used to generate the ensemble of controls is extremely important. The SPSA method, for example, utilizes control perturbations sampled from the Bernoulli distribution instead of Gaussian sampling. The impact of sampling strategy on the performance of ensemble methods, on the other hand, has so far received rather little attention. The impact of ensemble size on the quality of the ensemble gradient was investigated for the Rosenbrock function and for an oil reservoir model [12]. It was also shown that the perturbation size (the standard deviation of a multivariate Gaussian distribution) has a significant impact on the gradient quality. A method called CMA-EnOpt to adaptively adjust the perturbation size through covariance matrix adaptation (CMA) was found to improve the performance of ensemble gradients [13]. The impact of alternative distributions was considered by Sarma and Chen [31] who investigated the impact of a quasi-random sampling method (Sobol sampling, [25]) that avoids clustering of samples on SNR gradient estimates. They found Sobol sampling to lead to a faster rate of convergence relative to Gaussian sampling when applied to a deterministic reservoir optimization problem. The performance of Sobol sampling strategies in a robust optimization context was not investigated.
Considering that the number of controls (N) for the problems of interest will normally be much larger than the feasible number of test simulations (M), we will be dealing here exclusively with the underdetermined (supersaturated) case. Specifically, we will address the question which sampling strategy for the supersaturated case leads to optimal performance of the approximate gradient estimation methods within large-scale nonlinear optimization problems under uncertainty. We investigate three categories of sampling: quasi-random (low-discrepancy) sequences, stratified sampling, and sampling designs motivated by optimality criteria. All sampling methods are applied in combination with the StoSAG gradient estimation method.
In the remainder of this paper, we first provide a brief review of ensemble optimization for both deterministic and robust cases in Section 2, followed by a discussion of the various sampling strategies used in this paper (Sobol sampling, Latin hypercube sampling (LHS), UE(s 2 )-optimal supersaturated design) and the motivation for considering them in Section 3. Here we also introduce two new variants of UE(s 2 )-optimal supersaturated design. Finally in Section 4, the sampling strategies are applied in conjunction with the StoSAG method first to the extended Rosenbrock optimization test function [9] and subsequently to a synthethic 3D reservoir model of realistic complexity (for both deterministic and robust cases) followed by a detailed analysis. Chen [6] proposed a stochastic gradient estimation method for use within an ensemble-based optimization workflow referred to as EnOpt. Later modifications for deterministic and robust optimization problems [10,11] addressed approximation errors associated with the use of ensemble means. The modified method has since been referred to as stochastic simplex approximate gradient (StoSAG) [14] to highlight the relationship with the simplex gradient [20]. While the simplex gradient estimation method is a full-rank deterministic method, the StoSAG method is a low-rank stochastic method based on random perturbations. With low-rank, we mean that the estimation typically involves fewer equations than unknowns.

Ensemble-based gradient estimation
from which we wish to estimate the gradient g = ∇ u J . If the model is considered uncertain, one may choose to define an expected objective function J (u) = 1 An expected gradient can be obtained similarly as the sum of N r individual gradients, which can be computed using Eq. 1 using M/N r control perturbation vectors each. Theoretical and numerical studies [6,14] in which a ratio equal to 1, that is M = N r , was used, have shown that the expected gradient g = ∇ u J can alternatively be estimated by solving a single system like (1) with j replaced by ( 2 ) In the following, we will refer to the optimization problem with N r = 1 as deterministic, and to the case with N r > 1 as robust optimization. For robust optimization with M/N r = 1, we will use Eq. 2, and for higher ratios, we will use the mean of individual gradients computed each by solving Eq. 1. In all cases, the number of unknowns is N and the number of model simulations required to evaluate the required gradient is M.
Taking as example the deterministic case Eq. 1, the normal equations can be formulated by pre-multiplying with U T , leading to The matrix U T U has dimension N × N. Since the number of perturbations M that we can afford to evaluate (i.e., the number of equations) is typically less than N, the number of controls, the N × N matrix U T U is rank deficient and its inverse does not exist. A unique solution is normally obtained by imposing a minimum norm constraint and can be computed from the generalized pseudoinverse aŝ A regularized gradient estimate can be obtained by premultiplication of the above gradient by a positive definite matrix B. A common choice is B = C u where C u is a block-diagonal covariance matrix prescribing correlations between controls, for example, over time.
It can be shown [34] that if {z i } M i=1 with z i = u + δu i = u i is an i.i.d. sample from the multivariate Gaussian density N (u, C u ), the ensemble gradient (4) has the following convergence property (in the almost sure sense) for M → ∞ a.s.
In other words, the ensemble gradient (4) is a Monte Carlo (i.e., random sampling-based) approximation of a probability-weighted integral of the function values J (u i ) over all possible values of u i . The convergence properties of such an approximation will depend strongly on the chosen sampling strategy [4].

Sampling strategies
In the context of estimation, the matrix U is known as the design matrix and the matrix S = (U T U ) as the information matrix. The choice for a set of samples is that for a particular design and can be motivated by the desired statistical properties of the solution of Eq. 3. These properties generally depend on properties of the matrix S or, equivalently, of its inverse, known as the dispersion matrix, and lead to a number of optimality criteria which will be discussed later in this section. If M ≥ N and the rank of U is equal to or greater than N, the solution (4) is the best linear unbiased estimator (BLUE) and has variance proportional to S −1 . In the case that M < N, which is most relevant here, the solution (4) is the minimum bias estimator. If M = N and the elements S ij = 0 for all i = j , the design is called orthogonal.

Random sampling
Random sampling (or Monte Carlo sampling) is the conventional approach to generate control perturbations for ensemble-based gradient estimation. A generic approach to generating samples is to obtain random combinations of basis vectors that are obtained by factorization of a perturbation covariance matrix C u , for example, by Cholesky decomposition C u = L T L, such that δu i = L r i , where r i is a number from a pseudo-random sequence as can be generated by random number generators available with any computer code. The standard distribution used for ensemble gradient estimation is the Gaussian distribution, i.e., r ∝ N (0, C u ). If perturbations are uncorrelated, C u = σ 2 I N . In some cases, for example, when the controls represent long time series discretized in short intervals (as is typical for the oil reservoir well control problem), a regularized solution may be obtained by imposing time correlation between subsequent controls. In this case, C u will be a block-diagonal matrix.

Quasi-Monte Carlo sampling
The Monte Carlo approximation of an integral of f over I N is where x 1 , x 2 , .., x M are random points in I N [28]. The strong law of large numbers ensures the convergence of the approximation. Furthermore, the expected integration error is bounded by , with the interesting fact that it does not depend on dimension N. We have already seen that the ensemble gradient estimation is equivalent with a Monte Carlo integration (also known as quadrature). The quasi-Monte Carlo (QMC) method [23] is an alternative to the Monte Carlo (MC) method for calculating this approximation using quasi-random (deterministic) sequences with higher convergence rate than obtained with (pseudo) random sequences. The improved convergence originates from the uniformity of the quasirandom sampling distribution. The uniformity is quantified by the so-called discrepancy which measures the relative density of samples in each sub-volume of the sampled domain. Low-discrepancy sequences have good uniformity properties [4]. Examples of low-discrepancy quasi-random sequences are the Sobol and Halton sequences. More detailed discussion of quasi-random sequences and their properties is provided by, e.g., Niederreiter [24]. Given their low-discrepancy properties, which avoid clustering of samples in subvolumes, they are good candidates for generating space-filling designs [4]. In this work, we will present results obtained with the Sobol sequence which tends to produce lower correlations in high dimensions than Halton sampling [5,23]. Successful application of Sobol sequences in problems of dimension ∼ 300 has been reported in the literature [29].

Stratified sampling
A number of approaches that directly address the error variance of the Monte Carlo estimate are discussed by Caflisch [4]. Stratification is a variance reduction technique that, like low-discrepancy sampling, attempts to avoid the clustering of samples. Latin hypercube sampling (LHS) [22] is perhaps the best known stratified sampling method that is suitable for higher dimensions and settings where M < N [28]. LHS divides the input (design) space equally into M strata (subdomains), an arrangement known as Latin squares, and places a sample randomly in each stratum. Theoretical considerations [22,27] suggest that LHS can be much better than MC sampling and it cannot be much worse. However, it has been reported that LHS methods may produce clustering of samples in high dimensions [8]. Figure 1 shows examples of Gaussian and uniform (pseudo) random, quasi-random (Sobol), and stratified (LHS) sample distributions for a simple 2-control example and 100 samples, that is, N = 2 and M = 100. While this is different from the M < N case of interest, the figure serves as a simple illustration of the motivation for considering sample distributions other than Gaussian. In order to enable comparison of the sample spread, the standard deviation was normalized to 1 in both directions for all four distributions. Gaussian sampling produces relatively dense sampling around the center, as expected. LHS appears to produce sampling distributions that are very similar to uniform sampling, at least for the case M > N, with some clustering and under-sampled intervals. Sobol sampling can be seen to produce a uniform space-filling sample distribution.

Optimal supersaturated designs
The theory of Design of Experiments (DOE) distinguishes saturated (M = N) and nonsaturated (M > N) designs. It furthermore defines a number of design criteria and techniques to obtain designs that satisfy these criteria. Here we are interested primarily in designs for the supersaturated case (M < N) based on the E(s 2 ) criterion which defines an approximate orthogonality measure [2]. An extension, the so-called UE(s 2 )-optimal supersaturated designs [19], adds an effective design optimality criterion (D-optimality). For a supersaturated design, the information matrix S becomes rank deficient and hence its inverse does not exist. A natural approach in this case is to find a design that is nearly orthogonal, that is, the design in which the absolute values of the off-diagonal elements of the matrix S are small in some sense. Two alternative approaches have been suggested [2] to obtain near-orthogonal designs. The first is to choose a design with minimum max i =j |s ij | and among all such designs to choose one with the fewest s ij that achieve this maximum. The second approach is to choose a design in which the sum of the squares of the off-diagonal elements is minimum, that is, a design that minimizes which is called the E(s 2 ) criterion. A design is E(s 2 )optimal if it satisfies the following conditions: 1. s 1j = 0 ∀j = 2, ..., N 2. among all those designs that satisfy 1, the design should minimize E(s 2 ) given in Eq. 9.
There are various methods for construction of E(s 2 )optimal supersaturated designs [15], but we will consider only the methods using Hadamard matrices [21,36]. Optimal designs are experimental designs for which the solution of the estimator satisfies particular statistical optimality criteria. Generally, these statistical criteria are formulated in terms of the (generalized) variance of the solution, for example, minimum trace of the covariance ofĝ (A-optimality), minimum maximum eigenvalue of the covariance ofĝ (E-optimality), or minimum product of non-zero eigenvalues of the covariance ofĝ (Doptimality) [1]. The E(s 2 ) design can be made more theoretically strong and efficient by adding such traditional design optimality criteria. UE(s 2 )-optimal designs are designs for the supersaturated case that are near-orthogonal but exchange the first constraint above for D-optimality. UE(s 2 )-optimal supersaturated designs could therefore be described as producing minimum bias-minimum variance estimates (for details about algorithms for their construction from Hadamard matrices, we refer to Jones and Majumdar [19]). A brief summary of the general procedure and variants is provided here.
A Hadamard matrix H ∈ R N×N is a square matrix whose columns are orthogonal to each other and for which holds that H H T = H T H = N I N where I N is the identity matrix of size N × N. It consists of elements ±1 and it is generally available for order N equal to 1, 2 and multiples of 4. Procedures for constructing a UE(s 2 )-optimal design matrix U ∈ R M×N with M < N from Hadamard matrices are discussed by [19], who also review modern methods to construct Hadamard matrices of the required orders. Four situations can be distinguished based on the remainder of N when divided by 4 that are referred to as T 0 , T 1 , T 2 , and T 3 . In the following, N = a(mod 4) means a is the remainder when N is divided by 4. Given that there is some freedom in constructing the Hadamard matrices, we will consider three variants for constructing U denoted as M1, M2, and M3 as explained below. M1 is the conventional approach, while M2 and M3 are new variants that we introduce here. We finally note that the near-orthogonality and Doptimality of UE(s 2 )-optimal designs do not hold for the case N = 3(mod 4) where M > (N+5) 2 . However, by choosing M properly, the design can be made D-optimal for this case as well [19].

Numerical experiments
The various sampling strategies and designs are first applied to gradient estimation in a simple toy problem for which exact gradients can be computed analytically. This enables us to determine the impact of the sampling strategy on the accuracy of the stochastic gradients.
We determine the gradient accuracy separately for two sets of control test points. In order to explain the composition of these two sets, we refer to the red line in Fig. 6 which represents a typical example of the evolution of the objective function as a function of iteration of an optimization process. The iterative objective function increase during optimization can be commonly characterized by fast improvements during early iterations when objective function values are far from the optimum (the objective function curve is steep), and very slow improvement towards convergence (the objective function curve is nearly flat). In our example (Fig. 6), the first stage would consist of roughly the first 10 iterations, while the second stage would consist of iterations 10 to 35. The separation between the two stages can simply be defined by, for example, a threshold rate of improvement in the objective function. We propose this procedure to identify points that are expected to lie relatively far away from the optimum (visited during the first stage) and points that are expected to lie in a part of the control space that is connected to the optimum through pathways along which the objective function is only very weak sloping (visited during the second stage). We are interested in determining the quality of gradient estimates during both stages of the optimization process separately since it may be expected that for a given perturbation size, the gradient quality is lower during the second stage than during the first stage. One hundred conventional robust optimization experiments were conducted with the extended Rosenbrock function for one hundred different starting controls.

Analytical toy problem
We use an extended version of the well-known Rosenbrock benchmark function which is characterized in 2D by a curved valley, with a minimum at coordinates (1, 1) located in one of the two branches of the valley. In order to mimic the high dimensionality typically encountered in subsurface reservoir problems, we use the extended Rosenbrock function [9]. In addition to a large number of controls we want to investigate the impact of uncertainty in the model properties. Therefore, uncertainty is introduced to mimic the geological uncertainty following [12], where (c j 1 , c j 2 ) with j = 1, . . . , 100 are samples from N (0, I 2 ) representing N r = 100 model realizations. N is the number of controls which we set here to 320. The gradient of Eq. 10 can be derived analytically for any set of controls.
Using a simple objective function increase criterion solutions obtained at subsequent iterations of the optimization process were classified as belonging to either the first or second stage as discussed above. The gradient at each of these test points was estimated using different sampling strategies. For sampling strategies involving random numbers, the gradient computation was repeated 100 times, after which an average angle error α 100 was computed by comparison with the analytical gradient direction and further averaging over all 50 test points. The standard deviation of the control perturbation magnitude was set to 0.01 in all cases. The angle error is estimated for different ratios M/N r . If the ratio equals 1, we use the same number of perturbations as there are model realizations, while for a ratio of for example 3, three different perturbed controls are applied to each model realization. Figure 2 shows the angle error alpha 100 for 3 ratios and for different sampling strategies averaged over the first 50 test points and Fig. 3 shows the average angle error for the second set of 50 test points.  For control points far away from the optimum, the Gaussian, Sobol, LHS, and UE(s 2 )-M1 sampling strategies provide a similar gradient quality when the ratio equals 1 (Fig. 2). A slight difference is seen when the ratio is increased to 3 with UE(s 2 )-M1 and Sobol performing slightly better on average than the Gaussian and LHS strategies. For a 1:1 ratio, UE(s 2 )-M2 and UE(s 2 )-M3 provide the best gradient quality, with angle errors that are 5 • to 30 • lower than for the other strategies.
Gradient errors are significantly larger for points closer to the optimum as seen in Fig. 3. Otherwise, the results are more or less consistent with those for the early stage except that the differences between UE(s 2 )-optimal designs of types M2 and M3 and Gaussian, Sobol, LHS, and UE(s 2 )-optimal design of type M1 are relatively smaller. In conclusion, for the high-dimensional Rosenbrock function with uncertainty, UE(s 2 )-optimal designs of types M2 and M3 provide significantly better expected gradients than the other sampling strategies, especially in the early stages of the optimization process when solutions are far from the  In the following, we will only present robust optimization results for a ratio of 1.

Oil reservoir case
In this section, we will investigate the impact of the different sampling strategies on an optimization process for a small, but realistically complex, reservoir test case. The 3D reservoir model used here is the "Egg" benchmark model [18,35]. Figure 4 shows the permeability field of one model realization and the position of eight injection wells (blue) and four production wells (red). The egg model is a channelized reservoir model with seven vertical layers and a total of 18,553 active cells.
The permeability values are not conditioned to values at the wells, and the porosity is assumed to be constant. The producers are operated at constant bottom hole pressure, while the injectors are rate-controlled between 10 and 79.5 m 3 /day. Production of the field is simulated for a period of 3600 days that is discretized into 40 control time intervals of 90 days. This results in a total of 40 × 8 = 320 injection rate controls. The objective function used in this work is the undiscounted net present value (NPV), i.e., the sum of revenues and costs induced over the production period. We use an oil price of 126 $/m 3 and costs of 19 $/m 3 and 6 $/m 3 for water production and injection respectively. The fully implicit black oil simulator OPM flow is used for the model simulations and the objective function is computed based on the simulator output. We investigate both the deterministic and robust optimization cases, where the model realizations are taken from a set of 100 permeability realizations. Six of these realizations are shown in Fig. 5. More details on the model can be found in Jansen et al. [18].

Deterministic optimization
In deterministic optimization, there is no uncertainty in the model, and therefore only a single model realization  [18], characterizing the uncertainty in the permeability is used in this section (i.e., N r = 1). All optimization experiments are run for a fixed number of iterations (35) and use a steepest ascent update with a normalized gradient (that is, the norm of the gradient vector is 1) and a fixed step size of 0.1. The convergence will thus be affected primarily by the quality of the gradient. The initial control vector consists of equal values of 79.5 in units of m 3 /day which corresponds to the maximum injection rate. M =100 perturbation vectors are generated to estimate the gradients by solving (4). Figures 6 and 7 show the objective function curves over all iterations for different sampling strategies with the number of perturbation vectors M=100 and M = 30 respectively.
The curves for UE(s 2 ) designs of types M2 and M3 flatten after 14 iterations. The curve for Sobol sampling approaches the same final objective function value at a slightly slower rate. These methods also produce high convergence rates and final objective function values when the ensemble size is very small (30). From the results of the Rosenbrock function, it was observed that UE(s 2 ) designs of type M1 provides inferior gradient quality compared with M2 and M3 at poor control points (steep section of the objective function curve). This is also observed in Figs. 6  . The curve for M1 shows iteration intervals for which convergence is extremely slow, alternated by intervals with steep increases in the objective function. Upon inspection, it was discovered that these intervals correspond to iterations in which the row of the Hadamard matrix containing only +1 values was either not included (slow improvement) or was included (fast improvement). This behavior actually motivated the creation of the new schemes M2 and M3 in this paper. In general, both M2 and M3 perform slightly better than Sobol sampling which tends to produce objective function curves that flatten a bit earlier. Since the curves for Gaussian and LHS sampling have not yet flattened after 35 iterations, it is not possible from these results to draw conclusions about the final objective function value that can be reached. Given the high computational cost associated with simulating large and complex reservoir models, it is reasonable to consider the performance of different methods for a limited number of iterations (or function evaluations). It appears that LHS does not perform better than Gaussian sampling for the number of controls considered in these experiments. The optimal control strategies for one of the injectors obtained after 35 iterations are shown in Fig. 8.
The choice of sampling method clearly has a significant impact on the character of the resulting control strategy. While Sobol sampling produces a strategy with frequent and large changes in the injection rate, the UE(s 2 ) designs tend to produce fairly smooth low-rate profiles. Highly dynamic control strategies are generally undesirable from an operational point of view. Regularization of the gradients is often proposed as a means to produce smooth control profiles. One way to achieve this is by imposing correlations over time between the control perturbations, i.e., between the samples. The impact of this approach on the optimization process for different sampling methods is illustrated in Fig. 9, where a correlation length of 15 control intervals was applied (the total number of intervals is 40). Correlation clearly benefits the convergence properties for all sampling methods except Sobol sampling. Gaussian sampling, LHS, and UE(s 2 ) designs all produce very similar objective function profiles. Gaussian sampling and LHS produce the highest final objective function values, while the values obtained for UE(s 2 ) are nearly identical to those obtained without induced correlation. The convergence rate for Sobol sampling on the other hand has decreased notably. We conclude that this latter result must be related to the loss of uniformity of Sobol distributions after smoothing.

Robust optimization
In this section, the optimization is aimed at maximizing the expected NPV as evaluated over N r = 100 equiprobable realizations of the model with different permeability fields The optimal control strategies obtained after 25 iterations are shown in Fig. 12 for all sampling strategies. A similar behavior can be observed as in the deterministic case. When using Sobol sampling, the water injection rate jumps between near-minimum and near-maximum values. This is close to what is known as a bang-bang strategy, which is an optimal strategy for certain linear problems and is characterized by solutions that attain only the minimum and maximum allowable control values. The solutions obtained with Gaussian sampling and LHS tend to vary around an intermediate average control value, while the UE(s 2 ) solutions consistently suggest near-minimum injection rates. The solutions for this well are characteristic also for those of the other wells, with UE(s 2 )-based injection rates mostly in the range of 10-30 m 3 /day.

Sensitivity of results
The optimization experiments presented here were performed with the same constant perturbation size which is the standard approach in approximate gradient applications. It has been observed in experiments with different perturbation sizes [12] that smaller perturbations may be preferred during the later stages of the optimization process. One way to improve the quality of gradient estimates at different stages of the optimization process is to adjust the perturbation size [30]. This was found to be an effective solution when the traditional Gaussian sampling strategy is used.
Here we investigate to what extend the sampling strategy affects the sensitivity to perturbation size. Experiments were performed with the Egg reservoir model with different sampling strategies for both deterministic and robust settings. Figure 13 shows robust optimization results for three different fixed perturbation sizes and all sampling strategies considered in this paper. The results suggest that the performance for UE(s 2 ) designs of type M2 is much less sensitive  to the perturbation size than that for Gaussian and Sobol sampling or LHS designs. Some of the considered sampling strategies, including Sobol sampling and UE(s 2 ) designs of type M3, are deterministic and therefore will produce the same result each time. Strategies based on pseudo-random numbers (Gaussian, LHS), or on random selection of perturbations from a fixed set (UE(s 2 ) designs of types M1 and M2) may produce different results for different random number generator seeds. In Fig. 14, we present robust optimization results obtained with UE(s 2 ) designs of type M1 and type M2 for five different initial random seeds. The sensitivity of the gradient quality and convergence with respect to the initial seed is very large for UE(s 2 ) designs of type M1, but almost negligible for designs of type M2. This is another benefit of the UE(s 2 )-type M2 designs proposed here.

Conclusions
The standard practice of using Gaussian sampling to generate random perturbations for use in approximate gradient estimation procedures is compared against various alternative sampling strategies. The alternative strategies include two space-filling designs, namely Sobol sampling and LHS, based on low-discrepancy concepts as achieved by quasi-Monte Carlo approaches and stratification respectively. A second class of methods is based on the E(s 2 ) near-orthogonality concept for supersaturated designs and D-optimal reduction of the generalized variance of the gradient estimate (E(s 2 )-optimal designs). Two new variants of E(s 2 )-optimal designs were proposed. The sampling strategies were applied to high-dimensional analytical test problem to evaluate their impact on the gradient quality. In a second example, they were applied to an oil reservoir case with realistic complexity in terms of number of controls and uncertainty in parameter values to test their impact on optimization performance. The main conclusions can be summarized as follows.
-Sobol sampling and UE(s 2 ) designs outperform random sampling and stratified experimental designs in terms of gradient quality and convergence properties in all cases when no smoothing is performed on the samples prior to gradient estimation. -When samples are smoothed over time, the performance of Sobol sampling strongly deteriorates. -The sampling strategy is found to have a significant impact on the character of the resulting control strategy. Sobol sampling tends to produce highly dynamic strategies, while UE(s 2 ) designs produce fairly smooth strategies, also when no smoothing is explicitly applied. -The new UE(s 2 ) design referred to here as M2 was observed to outperform the optimal supersaturated design method previously suggested (M1), as well as a third variant (M3), in terms of performance of the optimization and in terms of sensitivity to the perturbation size and initial random seed. -UE(s 2 )-optimal supersaturated designs perform well in all situations that were investigated for both deterministic and robust cases and are therefore recommended for gradient approximation schemes where the number of samples is less than the number of unknowns.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons. org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.