Parametric generation of conditional geological realizations using generative neural networks

We introduce a method for parametric generation of conditional geological realizations using generative neural networks. We build on our recent work where we trained a neural network to generate unconditional geological realizations using generative adversarial networks. Here we propose a method for post-hoc conditioning of pre-trained generator networks to generate conditional realizations. We frame the problem in the Bayesian setting and model the posterior distribution of the latent vector given observations. To efficiently generate multiple latent vector solutions, we train a neural network to generate samples from the posterior distribution. This inference network is trained by minimizing the discrepancy between its output distribution and the posterior. Once the inference network is trained, it is coupled to the (unconditional) generator to obtain the conditional generator, thus also maintaining a parametrization of the (conditional) generation process.


Introduction
The large scale nature of geological models makes reservoir simulation an expensive task, prompting numerous works on parametrization methods that can preserve complex geological characteristics required for accurate flow modeling. A wide variety of methods exist including zonation [1,2], PCA-based methods [3][4][5], SVD methods [6,7], discrete cosine transform [8,9], level set methods [10][11][12], and dictionary learning [13,14]. Very recently, a new method from the machine learning community called generative adversarial networks [15] has been investigated [16][17][18][19][20][21][22] for the purpose of parametrization, reconstruction, and synthesis of geological properties; obtaining very competitive results in the visual quality of the generated images compared to previous methods. This adds to the recent trend in applying machine learning techniques [23][24][25][26][27][28][29] to leverage rapid advances in the field as well as the increasing availability of data and computational resources that enable these techniques to be effective.
Generative adversarial networks (GAN) is a novel technique for training a neural network to sample from a distribution that is unknown and intractable, by only using a dataset of realizations from this distribution. The result is a neural network parametrization called a generator, which is capable of generating new realizations from the target distributionin our case, geological images -using a very efficient representation. Recent works show that using the generator to parametrize the geology is very effective in preserving highorder flow statistics [18,22], two-point spatial statistics [16,19] and morphology [16], all while achieving dimensionality reduction of several orders of magnitude.
Subsequent works on GAN focused on the problem of conditioning the generator: given a generator trained on unconditional realizations, the task is to generate realizations conditioned to spatial observations (hard data). In [20,21], an image inpainting technique was used which adopts a sampling by optimization approach, i.e. it requires solving an optimization problem for each conditional realization that is generated. The method obtained very good results -in particular, [20] reported superior performance in many aspects compared to standard geomodeling tools. However, sampling by optimization can be expensive if realizations need to be continuously generated during deployment, e.g. for history matching or uncertainty quantification. An alternative approach was presented in [19], where the authors addressed conditioning using a Bayesian framework and performed Markov chain Monte Carlo to sample conditional realizations. Neither of these approaches, however, provides a parametrization for the conditional sampling process. As the authors in [19,20] express, it is of interest to obtain such parametrization to directly sample conditional realizations without running optimizations or Monte Carlo methods.
In this work, we propose a method to obtain a parametrization to directly sample conditional realizations. The main idea is to simply extend the existing generator network by stacking a second inference network that performs the conditioning. This inference network is a neural network trained to sample a posterior distribution, derived using a Bayesian formulation of the conditioning task. The resulting extended neural network thus provides the conditional parametrization, hence direct conditional sampling can be done very efficiently. We assess our method on the benchmark image of Strebelle and Journel [30], finding positive results for a wide variety of conditioning configurations.
Note that although previous works [16,19,20] study applications of GAN mainly in the context of geomodeling and multipoint geostatistical simulations, here we emphasize on the effectiveness of GAN -and neural networks in general -for parametrization and dimensionality reduction, highlighting their ability to learn efficient representations for complex and high-dimensional data. The rest of this work is organized as follows: In Section 2, we describe parametrization using generative adversarial networks, and the Bayesian formulation of the conditioning problem. We introduce our method in Section 3 where we describe how the inference network is obtained. In Section 4, we show results for unconditional and conditional parametrization of binary channelized subsurface images. We discuss related work in Section 5 including other alternatives to train the inference network, and conclude our work in Section 6.

Background
In this section, we discuss the importance of parametrization for subsurface simulations (Section 2.1), we describe generative adversarial networks (Section 2.2), and we describe the Bayesian formulation of the conditioning problem (Section 2.3).

Parametrization
Parametrization is useful in subsurface simulations where the large number of uncertain variables are highly correlated and redundantly represented as a consequence of the grid discretization. One useful analogy to parametrization is an index of words or a dictionary: Consider the task of inferring the content of a book using only indirect information such as the frequency of letters. A priori, this task would need to consider any possible arrangement of letters however implausible (top row, left of Figure 1). On the other hand, since most books consist of words, we know that most arrangements are unlikely and can be quickly discarded. The task, although still difficult, is suddenly much easier with the inclusion of this prior information via an index of words (top row, right of Figure 1). Likewise, consider the task of inferring the subsurface from indirect information such as the oil production history. Without any other information, attempting to deliberately model the subsurface to match the production history would almost certainly result in unrealistic images (bottom row, left of Figure 1). On the other hand, we know that real subsurface images are not completely random but instead tend to exhibit clear spatial correlations. By using a suitable parametrization of the subsurface, we can embed this information and narrow our search to only the plausible realizations (bottom row, right of Figure 1), thus reducing the number of simulations required in uncertainty quantification and inversion problems.
Let the random vector y ∈ R ny represent plausible subsurface images. Parametrization aims to construct a well-behaved function G : R nz → R ny such that y = G(z) where z ∈ R nz (normally n z n y ) is a latent random vector with known pre-defined distribution (for example, z ∼ N (0, I)). Generally, strictly achieving y = G(z) for complex and highdimensional y is hard, hence many methods settle for replicating simple statistics of y such as the mean and covariance. For example, in a parametrization based on principal component analysis, G is an affine transformation where A, b are fitted so that G(z), z ∼ N (0, I) preserves the sample mean and covariance estimated from an available dataset {y 1 , · · · , y n } of realizations of y. Note that for nature this parametrization is often too simplistic, resulting in unrealistic realizations that are overly smooth in practice.
In this work, we use a parametrization based on deep neural networks: where • denotes composition (f 2 • f 1 (x) = f 2 (f 1 (x))), and σ i denotes a component-wise non-linearity 1 . This is motivated by the high expressive power of deep neural networks as it is now evident from the state-of-the-art results in computer vision (see e.g. [31,32] for recent examples). In addition to the more flexible parametrization, instead of training the weights A i , b i to preserve only mean and covariance as in principal component analysis, we leverage once more the expressive power of neural networks and let a second neural network learn and decide the relevant statistics directly from the dataset. This is possible due to a recent technique called generative adversarial networks, described below in Section 2.2.

Generative adversarial networks
We use generative adversarial networks to obtain the (unconditional) parametrization of the geology. This method can be used to obtain a parametrization of a general random vector given a dataset of its realizations. Let the random vector y ∈ R ny represent the uncertain subsurface property of interest, where n y is very large (e.g. permeability discretized by the simulation grid). This random vector follows a distribution y ∼ P y that is unknown and possibly intractable (e.g. distribution of plausible channelized permeability images). Instead, we are only given a dataset of realizations {y 1 , · · · , y N } of the random vector (e.g. a set of permeability realizations deemed representative of the area under study). Using this training dataset, we aim to find a parametrization for y: Consider now a latent random vector z ∈ R nz with n z n y and z ∼ p z where p z is manually chosen to be easy to sample from (e.g. a multivariate normal or uniform distribution); and a neural network G θ : R nz → R ny , that we call a generator, where θ denotes the weights of the neural network. We aim to determine θ so that y = G θ (z). In other words, let P θ denote the distribution induced by G θ (i.e. G θ (z) ∼ P θ ), which depends on θ; the goal is to determine θ so that P θ = P y .
A difficulty with the problem statement above is that P y is completely unknown (we only have realizations of y) and P θ is unknown and intractable (even if p z is simple, G θ is a neural network with several non-linearities). On the other hand, sampling from these distributions is easy: For P y , we "sample" by drawing realizations from the training set, assuming the set is large enough to be representative. For P θ , we simply sample z ∼ p z and evaluate G θ (z). We therefore have two distributions that we can sample from but we cannot model analytically, and yet we need to optimize P θ to approximate P y . Informally, we need to teach the generator G θ to generate plausible realizations.
Following this observation, the seminal work in [15] (see also [33]) introduces the idea of using a classifier neural network D ψ : R ny → [0, 1] called a discriminator, with weights ψ, to assess the plausibility of generated realizations. The discriminator D ψ is trained to distinguish between "fake" (from generator) and "real" (from training dataset) realizations, and it essentially outputs a probability estimate. The aim of the generator is then to fool the discriminator (see Figure 2), hence the discriminator and the generator are adversaries. The discriminator is trained to solve a binary classification problem by maximizing the following loss: which is in essence a binary classification score. The expectations in the expression above are approximated by taking a batch of M ≤ N realizations from the training set for the first term, and sampling M realizations z 1 , · · · , z M from p z for the second term. The generator on the other hand is trained to minimize the same loss. Thus, an adversarial game is created where G and D optimize the loss in opposite directions, In practice, this optimization is performed alternately using stochastic gradient descent, where the gradients with respect to θ and ψ are obtained using automatic differentiation algorithms. The equilibrium is reached when G effectively learns to approximate P y and D is 1 2 in the support of P y (coin toss scenario). It is shown in [15] that in the infinite capacity setting, the process minimizes the Jensen-Shannon divergence between P θ and P y . Once trained, we can discard the discriminator and keep the generator as our parametrization.
Note that the method is very general and directly applicable in practice to all types of geological models including multi-facies and multimodal geology, since minimal assumptions are imposed on P y and P θ as we do not need to model them explicitly. We only require realizations {y 1 , · · · , y N } from the unknown target distribution P y and the discriminator is in charge of inferring it from the realizations.
Variations of GAN Stability issues with the original formulation of GAN led to numerous works to improve and generalize the method (see e.g. [34][35][36][37] and references therein). One line of research generalizes GAN in the framework of integral probability metrics [38]: Given two distributions P and Q, and a class of real valued functions D, an integral probability metric measures the discrepancy between P and Q as follows: Note the slight similarity with Equation (2). In comparison, this new formulation 2 drops the logarithms and performs the optimization within a class D D that may be more general, i.e. not necessarily limited to classifier functions. The choice of D is important and leads to different flavors of GAN. For example, when D is a ball in a Reproducing Kernel Hilbert Space, d D is the Maximum Mean Discrepancy (MMD GAN) [39,40]. When D is a set of 1-Lipschitz functions, d D is the Wasserstein distance (WGAN) [41,42]. When D is a Lebesgue ball, we obtain Fisher GAN [43], and when D is a Sobolev ball, we obtain Sobolev GAN [44]. See [44,45] for further discussion. Our unconditional generator is trained using the Wasserstein formulation (see also our recent work [18,22]).

Conditioning to observations
Given a pre-trained generator G, we aim to generate realizations conditioned to spatial observations (hard data), i.e. find z such that G(z) honors the observations. Let d obs denote the observations and d(z) = G(z) obs the values at the observed locations given G(z). Under the probabilistic framework, we can formulate the problem as finding z * that maximizes its posterior probability given observations, From Bayes' rule and applying logarithms, For the prior p(z), a natural choice is p z for which the generator has been trained. In most applications (and in ours), this is the multivariate standard normal distribution, then p(z) ∝ exp(− 1 2 z 2 ). For the likelihood p(d obs |z), we take the general assumption of i.i.d. Gaussian measurement noise, where σ is the measurement standard deviation. Then the optimization in Equation (4) can be written as where where we multiplied everything by λ = σ 2 and discarded the irrelevant constant. One way to draw different conditional realizations is to optimize Equation (5) repeatedly using a local optimizer and different initial guesses for z, as performed in [20,21]. Another approach is to sample the full posterior using Markov chain Monte Carlo methods as performed in [19].

Conditional generator for geological realizations
As mentioned in Section 2.3, one way to sample multiple realizations conditioned to observations is to solve Equation (5) repeatedly using local optimization with different initial guesses [20,21]. However, this approach can be expensive if a large number of realizations need to be continuously generated in deployment, e.g. for uncertainty quantification and inversion problems, and it also may not cover the full solution space. Another approach is to use Markov chain Monte Carlo methods -assuming the latent vector is of moderate size -to sample the full posterior distribution [19]. Neither approach, however, provides a parametrization of the sampling process. That is, we no longer have a functional relationship y cond = G cond (w), w ∼ p w , where y cond denotes the conditional geology and p w is some fixed distribution.
Here we propose a method to obtain a conditional parametrization for direct and parametric sampling of conditional realizations. The idea is to extend the existing generator G • I = : G cond where I is another neural network -called the inference network -that performs the conditioning, as illustrated in Figure 3. The inference network is trained to sample the Bayesian posterior p(z|d obs ) derived in Section 2.3. Let I φ : R nw → R nz where φ denotes the weights of the neural network to be determined. I φ maps from yet another random vector w ∈ R nw , w ∼ p w with manually chosen p w (we can naturally choose p w = p z and n z = n w ), to the conditional latent vector z|d obs ∼ p(z|d obs ). Let q φ denote the distribution density induced by I φ , which depends on φ. This density function is unknown and intractable (I φ is a neural network with several non-linearities), but is easy to sample from since it only requires sampling w ∼ p w and evaluating I φ (w). The Kullback-Leibler divergence from p(·|d obs ) to q φ gives us The first term is the expected loss under the induced distribution q φ , with the loss defined in Equation (6). It can be approximated as by sampling M realizations w 1 , · · · , w M from p w . The second term, however, is more difficult to evaluate since we lack the unknown and intractable q φ . The second term is also called the (negative) entropy of q φ , usually denoted H(q φ ) := − Ez∼q φ log q φ (z).
Fortunately, there are sample estimatorsĤ for H, so we can estimate it from a sample We use the Kozachenko-Leonenko estimator [46,47], where ρ(z i ) is the distance between z i and its k th nearest neighbor in the sample. A good rule of thumb is k ≈ √ M [47]. Intuitively, the entropy estimator measures how spread the elements of the sample are. If the entropy term were not present, minimizing Equation (7) would reduce to finding the maximum a posteriori estimate, instead of sampling the full posterior.
To train the inference network I φ , we minimize D KL (q φ p(·|d obs )) (Equation (7)) using automatic differentiation algorithms. Once trained, we obtain our conditional parametrization G cond = G • I : R nw → R ny . Note that we now map from a new source distribution w ∼ p w , although we can simply pick p w = p z . Sampling conditional realizations is done very efficiently by directly sampling w ∼ p w and forward-passing through G • I.
We summarize the training steps of the inference network in Algorithm 1. Note that we show a simple gradient descent update (line 7), however it is more common to use dedicated schemes for neural networks such as Adam [48] or RMSProp [49].
Note that the inference network I is relatively easy to train compared to the generator G which is based on GAN. The network I is also usually small and the relative increase in evaluation cost of the composition G • I is not significant. We find this to be the case in our experiments.

Numerical experiments
We train generative adversarial networks to obtain a parametrization of binary channelized subsurface images based on the benchmark image of Strebelle and Journel [30]. We then condition the parametrization for a variety of configurations using our method described in Section 3. Finally, we also include in Appendix C a side experiment as a sanity check of the proposed method where we train neural samplers for simple mixture of Gaussians. All our numerical experiments are implemented using PyTorch [50], an open-source Python package for automatic differentiation that provides tools to facilitate the construction and training of neural network models. Our implementation code is available in our repository 3 .

Unconditional parametrization
We train a generator G : R 30 → R 64×64 using a dataset of 1000 realizations of size 64 × 64 of binary channelized subsurface images. The realizations were obtained using the snesim algorithm [30,51] provided within the Stanford geostatistical modeling software [52], using the benchmark image from Strebelle and Journel [30] as the reference image 4 . A few snesim realizations are shown in Figure 4b.

Architecture design
The latent vector size n z = 30 was chosen using principal component analysis as a heuristic, where the number of eigencomponents required to retain 75% of the variance is used as a reference. This results in a dimensionality reduction of two orders of magnitudefrom 64 × 64 to 30. The latent vector is sampled from the standard normal distribution, z ∼ N (0, I). Since the data is binary, it is reasonable to embed this knowledge into the neural network design using a suitable non-linearity in the output layer of the neural network. We use σ = tanh, where we adopt 1 to denote channel material, and −1 to denote background material. Note that attempting to use a hard threshold here would render G discontinuous and introduce issues in the training. It can also be an issue in inversion problems during deployment. The rest of the neural network architecture (shapes of A i , b i , non-linearity σ i , number of layers l, etc. -see Equation (1)) is designed according to the template provided in [34]. This template is the result of experimentation, heuristics, and experience. In particular, an important design choice is the use of convolutional layers. These are sparse matrices A i that follow a certain structure that makes them effective for spatial data. A brief description of convolutional layers is provided in Appendix E. Further details of the generator architecture and training is given in Appendix A.1.

Quality assessment
Realizations generated by the parametrization G are shown in Figure 4a. We also show snesim realizations in Figure 4b. We can already see from the figure that the parametrization is at least visually competitive with previous parametrization methods. The realizations of the parametrization are virtually indistinguishable from snesim, recreating crisp and clear channels from the reference (note that no thresholding has been performed). We next assess the results quantitatively.
Previous works have assessed the effectiveness of GAN-based parametrization using a variety of tools. Two-point probability functions, morphological measures, and effective porosity were assessed in [16]. Two-point probability and cluster functions, and fractions of facies were assessed in [19]. In our previous work [18], we assessed the effectiveness of the parametrization for preserving high-order flow statistics in uncertainty quantification.   Here we add to the assessment using the method of analysis of distances (ANODI) [53] which captures multipoint statistics, providing a more reliable measure of quality for complex data where two-point statistics are insufficient. We also apply multidimensional scaling for visualization.
The ANODI method aims to capture multipoint statistics by comparing multipoint histograms at different resolutions. It computes an inconsistency score (how well it matches the statistics of the reference image) and a diversity score (variability between realizations) -therefore, we want low inconsistency and high diversity. Multidimensional scaling is a method that aims to project a set of high-dimensional objects to low dimensions in a way that preserves the distances between the objects. Although some information may be lost in the projection, the method provides a useful way of visualizing high-dimensional objects (e.g. images) using a scatter-plot. The notion of distance between images, as adopted in [53], is the Jensen-Shannon divergence between multipoint histograms of patterns extracted from the images within a window size.
We perform the analysis at four resolutions: ×1 (original), ×1/2, ×1/4, and ×1/8 resolution (i.e. at 64 × 64, 32 × 32, 16 × 16, and 8 × 8). We use a window size of 4 × 4 and sets of 100 realizations for the analysis. Importantly, note that the snesim realizations are fresh realizations, i.e not from the training dataset: Ultimately, we aim to plug the parametrization into a reservoir simulator, for which we are assuming that the parametrization replicates the data generating process. Therefore, the comparison is made against out-of-sample realizations to see if the parametrization generalizes. For multidimensional dimensional scaling, we use the SMACOF [54] algorithm with 300 iterations and tolerance of 10 −3 .
For the analysis, we need to binarize the realizations generated by the parametrization which are continuous by design. For this, we use Otsu's thresholding method [55]. We also apply small object removal processing on the images to remove possible isolated pixels. Note that we do not apply thresholding nor any other post-processing on the displayed images in this work. Finally, since it can be difficult to gauge the differences in the ANODI scores, we include "patch sampling" method (i.e. drawing patches of 64 × 64 from the reference image) into the analysis to serve as a third point of comparison.
We show the ANODI scores and multidimensional scaling visualizations in Table 1 and Figure 4c, respectively. The patch sampling procedure understandably produces the highest consistency (lowest inconsistency), although it is also slightly less diverse. Regarding the parametrization, we find that the scores for snesim and G are relatively very close across all resolutions, suggesting that G effectively learned to replicate the data generating process. The multidimensional scaling visualization in Figure 4c further supports this result, showing a very good overlap in the scatter-plots of snesim and G. The scatterplots are also well spread and centered around the reference image, verifying the good performance of both methods.

Memorization
To verify that the parametrization is not simply memorizing the training dataset, we find the nearest neighbor in the dataset for each of 50 generated realizations. To further verify that the generator is not simply learning trivial transformations, we data-augment our training dataset with horizontal and vertical flips, as well as 10 and −10 degrees rotation and shearing (with reflection filling at the boundaries). This results in 35 additional variations for each image in the dataset. Finally, to capture small translations, we apply a Gaussian blur to the images before computing the Euclidean distance.
The 50 realizations along with the nearest neighbors are shown in Figure 5a. We see  that there is no perfect match despite the heavy data augmentation, verifying that the parametrization is capable of generating novel realizations that are not mere rotations, translations, shearing and flips of images from the dataset. The lack of memorization can be justified by the fact that the generator never has direct access to the training dataset (see Equation (2)). Instead, the generator only obtains indirect information about the dataset via the discriminator (i.e. through its gradients). This is similar to principal component analysis where the parametrization is only informed about the dataset covariance. In the case of GAN, the relevant dataset statistics are automatically discovered and informed by the discriminator. Finally, we provide a further verification by performing an interpolation in the latent space in Figure 5b. If G is simply memorizing the dataset, we would expect to see sudden jumps from one image of the dataset to another, with implausible images in between as we interpolate in the latent space. We instead effectively find smooth transitions between plausible outputs. The smoothness is also justified by the fact that G is continuous and piecewise differentiable by design (see Equation (1)). Note also that the smoothness is critical in practice for efficient exploration of the solution space during deployment, e.g. for inversion and uncertainty quantification tasks.

Conditional parametrization
We now obtain conditional generators for 9 conditioning configurations, ranging from 16 to 49 spatial observations (hard data). The configurations are detailed in Table 12. These indicate the presence or absence of channels at different locations of the domain. For each configuration d obs , we derive the Bayesian posterior p(z|d obs ) as described in Section 2.3, and train an inference network to sample from it as described in Section 3. We assume λ = 0.1 in Equation (6) in all our test cases. Note that we use a Gaussian likelihood function in the Bayesian formulation, although one could also consider the binomial distribution for binary data. Also note that due to the Bayesian framework adopted here -i.e. that the observations are noisy -we cannot fully guarantee that conditioning is strictly honored, but only that it is honored with high probability.

Architecture design -inference network
The inference network I φ : R 30 → R 30 is a fully connected neural network with several layers. We naturally use n w = n z = 30 and p w = p z = N (0, I) (so that if no conditioning were present, I φ should learn a distribution-preserving function such as the identity function). To simplify the presentation, we perform minimal hyperparameter tuning -that is, we use the same neural network architecture and optimization parameters for all 9 test cases. In practice, one should perform hyperparameter optimization for each problem at hand. For the non-linearity, we use scaled exponential linear units [56]. No non-linearity is applied in the output layer. Further details of the architecture and training are given in Appendix A.2. Once I is trained, we use G • I to generate conditional realizations.

Quality assessment -conditional realizations
We show conditional realizations generated by G • I for each test case in Figures 6 to 14. We also include conditional realizations obtained using snesim for comparison. Over the images, we indicate the conditioning using blue dots to denote channel material and orange crosses to denote background material. Overall, we observe that G • I generates good conditioning results maintaining the plausibility of the realizations. We also show                             in Figure 15 the output of the inference network I for each test case to visualize the distribution change (for no conditioning, the distribution is normal). Since it is cumbersome to visualize the distribution for the 30 components of z, we show pairwise scatter plots only for the first and last two components.
Assessment using analysis of distances We perform a quantitative assessment as in the unconditional case, using the ANODI method and multidimensional scaling on sets of 100 realizations. We keep the "patch sampling" method for the multidimensional scaling visualization. The results are shown in Figures 6 to 14 and Tables 2 to 10. In terms of the ANODI scores, we find that whenever one method generates more plausible images (lower inconsistency), it also tends to be less diverse, and vice versa -this is the usual trade-off in image synthesis. Overall, we find that snesim produces more diverse realizations whereas G • I emphasizes on plausibility. This is reasonable since G • I ⊂ G, i.e. an output of G • I is an output of G, therefore the conditional realizations cannot deviate too much from the reference spatial statistics. Also for this reason, we find that the outputs of G • I and snesim are most different when the conditioning statistics are in less agreement with the reference spatial statistics. This is evident in Figure 14, and to a lesser extent Figures 6 and 8. In Figure 8 we enforce a diagonal channel, finding that G•I generates plausible but noticeably less diverse realizations compared to snesim. The difference is more pronounced in Figure 14 where we densely enforce vertical channels and find a failure case for G • I, whereas snesim can handle this case despite the implausibility of this conditioning (there are no vertical channels in the reference image). In other words, if the conditioning is in far disagreement with the reference spatial statistics, effective conditional parametrization may be difficult since G is tied to the reference statistics. In the snesim algorithm, deliberate conditioning and diversity can be achieved regardlessly since the conditioning is trivially imposed and the stochasticity is intrinsic to the synthesis process. Finally, when the conditioning is in good agreement with the reference spatial statistics as in the remaining cases, we observe that G • I generates realizations that are visually comparable with snesim. Note that in practice, we always aim to use a reference image whose spatial statistics are in good agreement with the spatial observations, otherwise the reference image may not be representative of the area under study. Also note that although we compare our method against a multipoint geostatistical simulator, our emphasis is on parametrization. Lastly, we mention that the present results could be further improved with hyperparameter tuning for each individual test case.
Assessment using the discriminator We demonstrate an alternative approach to assess the quality of the generated realizations using the discriminator D that is made available after training generative adversarial networks to obtain G. Recall that the discriminator outputs a score that estimates the probability of a realization being "real" (see Section 2.2), with higher scores corresponding to higher probability. We can therefore use the discriminator to assess the quality of the generated realizations. We evaluate the discriminator on the same sets of 100 realizations used before in the ANODI assessment, and plot the histogram of the scores in Figure 16. Overall, we verify that this assessment at least arrives at the same qualitative conclusions as the ANODI assessment. For a quantitative summary report, one can consider summary statistics of the scores such as the mean and variance as measures of plausibility and diversity, respectively. Another option is to report the Jensen-Shannon divergence with respect to some reference histogram.

Related work
There is increasing interest in applying deep learning techniques in geological applications to leverage recent advances in the field as well as the increasing availability of data and computational resources that make these techniques effective. In particular, we expect to see more applications of generative adversarial networks (GAN) [15] in geology following successful results from recent works [16][17][18][19][20][21]. In [19], conditioning is addressed using a Bayesian formulation and performing Markov chain Monte Carlo to sample the corresponding posterior distribution. In [20,21], the authors address conditioning using the inpainting technique from [57], which is equivalent to a Bayesian formulation using a "neural network prior" (see Appendix D for more details), and the sampling is done using local optimization. Our approach is closer to [19] in that we use a simple prior and aim to sample the full posterior, with the difference that the sampling is carried out by a neural network and we obtain a parametrization for the sampling process. Our approach is motivated by [58] where the authors trained a neural network to perform texture synthesis. Such authors used the sample entropy estimator for case k = 1 (nearest neighbor, see Equation (9)). The entropy estimator used in our work is a generalization introduced in [47]. A similar estimator based on random distances is used in [59] in the context of texture synthesis. In the context of generative modeling, [60] used a closed-form expression of the entropy term when using batch normalization [61]. Other alternatives to train neural samplers include normalizing flow [62], autoregressive flow [63], and Stein discrepancy [64]. These are all alternatives worth exploring in future work. Also related to our work include [65,66] where the authors optimize the latent space to condition on labels/classes.

Conclusion
We introduced a method to obtain a conditional parametrization by extending an existing unconditional parametrization, enabling reusability as well as direct and parametric sampling of conditional realizations. The parametrization considered in this work was based on deep neural networks motivated by their ability to express complex high-dimensional data such as natural images, including geological subsurface images. The unconditional parametrization G was obtained using generative adversarial networks (GAN) [15], and the post-hoc conditioning was done by training a second neural network I to sample a Bayesian posterior, resulting in G • I as the conditional parametrization.
We applied the method to parametrize binary channelized images using the benchmark image of Strebelle and Journel [30]. In previous works, unconditional parametrization based on GAN was assessed using mostly two-point statistics tools. Here we added to the assessment using the analysis of distances method [53] which captures multipoint statistics. We found very positive results for the unconditional case, supporting previous results showing that G can effectively replicate the data generating process (in our case, the snesim [30] algorithm) while achieving dimensionality reduction of two orders of magnitude. Post-hoc conditional parametrization was explored for a variety of configurations. We found that G • I produces very plausible realizations with good conditioning results, but the effectiveness may depend on the conditioning. Specifically, if the observations are in far disagreement with the reference spatial statistics, effective conditioning may be difficult. For observations that agree with the reference spatial statistics, we found that the parametrization produces comparable results.
Possible future works include studying alternative training methods for the inference network as mentioned in Section 5, and further assessments with other images and in large scale settings.

A Implementation details
This section describes training and hyperparameters of the neural network models. See [67] for a practical guide on training neural networks.

A.1 Generator neural network
The generator G : R 30 → R 64×64 is a deep convolutional neural network based on the template provided in [34]. The generator architecture consists of stacks of (transposed) convolutional layers (see Appendix E) together with batch normalization layers [61]. Batch normalization is the operation of normalizing the intermediate layer results to have zero mean and unit variance, which drastically improves optimization of deep neural networks [61]. For the non-linearity, we use rectified linear units (ReLU, σ(x) = max(0, x)) in the intermediate layers, and σ(x) = tanh(x) in the last layer to constrain the output in [−1, 1]. The architecture is summarized in Table 11a. We train G using the Wasserstein formulation of GAN introduced in [41] with the proposed default hyperparameters. The optimization is performed using the Adam [48,68] method with learning rate of 10 −4 and batch size of 32.
Our generator converges in approximately 20,000 iterations, taking around 30 minutes using a Nvidia GeForce GTX Titan X GPU. For deployment, it can generate approximately 5500 realizations per second using the GPU.

A.2 Inference neural network
We use the same inference network architecture I : R 30 → R 30 for all our conditioning experiments. The architecture is simply a stack of fully connected layers with constantsize intermediate layers. More specifically, we first transform the input from size 30 to size 512, then apply several more intermediate transformations preserving the size, and finally apply a transformation to bring the size back from 512 to 30 in the output layer. For the non-linearity, we use scaled exponential linear units (SeLU) [56], which are the current default option for deep fully connected networks: σ(x) = λx if x > 0, otherwise σ(x) = λα(e x − 1), where constants λ, α are given in [56]. No non-linearity is applied in the output layer (we do not need to bound the output as in the case of the generator). We experimented with different numbers of layers. Perhaps not surprisingly, we found that deeper architectures tended to produce better results in general. In our work, we settled with 5 intermediate layers. The architecture is summarized in Table 11b. We optimize I using the Adam method with learning rate of 10 −4 and batch size of 64 for all the test cases. The network converges in between 1000 and 10,000 iterations, depending  on the conditioning, taking between seconds and a few minutes to train using a Nvidia GeForce GTX Titan X GPU. For deployment, the conditional generator G•I can generate approximately 5500 realizations per second using the GPU -we do not see significant increase in generation time from G to G • I.

B Conditioning settings
The conditioning settings are summarized in Table 12.

C Mixture of Gaussians
The proposed method described in Section 3 can be used to train a general neural sampler.
In this side section, we perform a simple sanity check by assessing the method on a toy problem where we train neural networks to sample mixture of Gaussians. Concretely, we train fully connected neural networks I φ : R nw → R nz to sample simple 1D and 2D mixture of Gaussians, with n z = n w = 1 in the 1D case, and n z = n w = 2 in the 2D case. The source distribution p w is the standard normal in both cases. Results are summarized in Figure 17.

D Comparison to related work based on inpainting
In image processing, image inpainting is used to fill incomplete images or replace a subregion of an image (e.g. a face with eyes covered). The recent GAN-based inpainting technique employed in [20,21] uses an optimization approach with the following loss: The second term in this loss function is referred to as the perceptual loss and is the same second term in the GAN loss in Equation (2), which is the classification score on synthetic realizations. Compare Equation (10) with Equation (6): While our Bayesian posterior uses a simple Gaussian prior, the prior in Equation (10) (the perceptual loss) involves the discriminator D used during the GAN training. We argue that the Gaussian prior can be equally effective, as long as the GAN training has converged successfully: If G and D are at convergence, then G(z) always produces plausible realizations for z ∼ p z where p z is the chosen latent distribution, and D is 1/2 for all realizations of G(z). In such scenario, the perceptual loss should then act as a regularization term that drives z towards regions of high density of the latent distribution p z , therefore having a similar effect to using p z as the prior. For example, let us consider z ∼ U[0, 1] and y ∼ U [1,3]. An optimal generator would be G(z) = 2z + 1 and an optimal discriminator D(y) = 1/2 for y ∈ [1, 3] and D(y) = 0 otherwise. Then D(G(z)) = 1/2 for z ∈ [0, 1], and D(G(z)) = 0 otherwise, which is precisely the density function of z ∼ U[0, 1] scaled by 1/2. Therefore, in this example the perceptual loss and p z as prior would have the same effect. Nevertheless, in practice the perceptual loss can be very useful when G and D are not exactly optimal and there exist bad realizations from G. In that case, the perceptual loss can help the optimization to find good solutions. In our work, we found our Gaussian prior to be sufficient while removing a layer of complexity in the optimization.

E Convolutional neural networks
We provide a very brief description by example of convolutional neural networks. See [69,70] for further details or [71] for a more practical treatment. Let u = (u 1 , u 2 , u 3 , u 4 ) and a = (a 1 , a 2 ). Let us call a a filter. To convolve the filter a on u is to compute the output vector v with components v i = u i a 1 + u i+1 a 2 for i = 1, · · · , 3. The operation is illustrated as a neural network layer in Figure 18a. In this example, the convolution has a stride of 1 (at which the filter is swept), but in general it can be any positive integer.
We also show the matrix A associated with this operation -it is easy to verify that v = Au. We see that the associated matrix is sparse and diagonal-constant, which is the appeal of using convolutional layers. This structural constraint achieves two things: it drastically reduces the number of free weights, and it does so by assuming a locality  prior. This locality prior turns out to be useful in practice, since nearby events in natural phenomena (natural images, speech, text, etc.) tend to be correlated.
Compare the convolutional layer with the fully connected layer shown in Figure 18b: In the fully connected case, the associated matrix is dense, resulting in 12 free weights whereas the convolution layer has only 2 for the same layer sizes. This difference is greatly amplified in practice where inputs/outputs are large (e.g. images), making convolutional layers a much more efficient architecture. Note that in practice we use deep architectures, i.e. several stacks of convolutional layers, therefore the full connectivity can be recovered if necessary, although now with an embedded locality prior along with a hierarchy in the influence of the weights.
Note that the example considered above would always result in a smaller output vector size. If the opposite effect is desired, a simple solution is to transpose the matrix A. For this reason, this operation is called a transposed convolution. Several stacks of transposed convolutions are typically used in generators and decoders to upsample the small latent vector to the full-size output image. In classifier neural networks, normal convolutions are used instead to downsample the large image to a single number indicating a probability.
Our brief description can be readily extended to 2D and 3D arrays with corresponding multidimensional filters. For example, for a 2D input the filters are of rectangular shape and can be swept horizontally and vertically. See [71] for further practical details.

F Computational complexity
Let N denote the dataset size and d the dimension of each realization. Fast PCA methods based on singular value decomposition can achieve a complexity of O(N 2 d). This complexity is favorable in geology where N d, i.e. we have a small number of very large realizations (although it still grows quadratically with the number of realizations). For our present method, reporting the computational complexity is less straightforward since it is highly problem-dependent. To illustrate the difficulties, we discuss in the following the computational complexity of a classifier neural network -similar arguments apply to encoders, generators and decoders.
Computing the computational complexity of neural network models is cumbersome since it fully depends on the architecture, which in turn depends on the learning difficulty of the problem at hand. For example, in the simple case that the dataset is linearly separable, a classifier neural network of the form f (x) = σ(w T x + b), with w, b to be determined, is enough to correctly classify all points of the dataset. The evaluation cost of this neural network is simply O(d), hence the training cost is O(T d) (when using stochastic gradient descent as normally done), where T is the number of update iterations. Note that this expression does not depend on N , although in practice T is at most linear in N , e.g. when performing multiple passes through the dataset until convergence, but note that the training can also converge even before a single pass through the dataset (which happens on massive datasets). Hence, neural networks are very favorable in the big data setting, i.e. when N is very large.
The estimated evaluation cost of O(d) is overly optimistic since in practice we use deep architectures to deal with complex datasets that are not linearly separable. If the architecture is instead f (x) = σ(A l (· · · σ(A 2 (σ(A 1 x + b 1 ) + b 2 )) · · · ) + b l ), where each A i is a d × d matrix, then the evaluation cost of this architecture is roughly O(d 2 ) (we omit the number of layers l since this is a constant factor and l d). However, this estimate is now overly pessimistic: First, in practice the A i are not shape-preserving, instead they decrease very quickly in size while exponentially compressing the input (e.g. ). Second, the matrices A i are rarely full since convolutional layers are used instead (see Appendix E), resulting in very sparse matrices that are several orders of magnitude lighter. Modern architectures use several stacks of exponentially decreasing convolutional layers, while fully connected layers are avoided or used only sparingly (and for small inputs/outputs). The overall effect is a drastic reduction in the computational complexity, from O(d 2 ) to O(kd) where k is a factor that is determined by the architecture. The corresponding training complexity is then O(T kd). Note that although k < d in practice, k can still be sizable. On the other hand, k = 1 is also possible as just mentioned. Ultimately, k will depend on the learning difficulty of the problem. In most models encountered in the literature, k grows sublinearly with d.
Perhaps more importantly is the human time, rather than computational time, that is involved in optimizing the dozens of hyperparameters -in particular the architecture design -for which automation is currently limited. As mentioned before, designing the architecture is heavily based on experience, heuristics, and experimentation which incur high costs in terms of engineering time. The justification of such costs will ultimately depend on the lifespan of the model, since the model needs to be constructed only once but can be deployed for a long time (e.g. history matching) or virtually indefinitely (e.g. most applications in internet companies such as recommender systems, visual and voice recognition, language translation, etc.). Automatic architecture search is an ongoing area of research (see e.g. [72,73] and references therein).