A Surrogate Modelling Approach Based on Nonlinear Dimension Reduction for Uncertainty Quantification in Groundwater Flow Models

In this paper, we develop a surrogate modelling approach for capturing the output field (e.g. the pressure head) from groundwater flow models involving a stochastic input field (e.g. the hydraulic conductivity). We use a Karhunen–Loève expansion for a log-normally distributed input field and apply manifold learning (local tangent space alignment) to perform Gaussian process Bayesian inference using Hamiltonian Monte Carlo in an abstract feature space, yielding outputs for arbitrary unseen inputs. We also develop a framework for forward uncertainty quantification in such problems, including analytical approximations of the mean of the marginalized distribution (with respect to the inputs). To sample from the distribution, we present Monte Carlo approach. Two examples are presented to demonstrate the accuracy of our approach: a Darcy flow model with contaminant transport in 2-d and a Richards equation model in 3-d.


Introduction
Groundwater contamination, caused by landfills, wastewater seepage, hazardous chemical spillage, dumping of toxic substances or discharge from industrial processes (Karatzas 2017), is a major concern for both public and environmental health. Understanding the mechanisms and predicting the transport of contaminants through soils is therefore an important topic in groundwater flow modelling.
The control of groundwater quality relies on knowledge of the transport of chemicals to the groundwater through soil. The efficacy of remedial treatment and management of contaminated land depends on the accuracy of models used for the simulation of flow and B A. A. Shah Akeel.Shah@warwick.ac.uk 1 School of Engineering, University of Warwick, Coventry CV47AL, UK solute transport. Modelling and simulation of hydraulic phenomena in soil are, however, hampered by the complex and heterogeneous nature of soils, as well as the broad range of influential factors involved. A number of simplified models have been developed to describe the small-scale physical, chemical (Boi et al. 2009;Foo and Hameed 2009;Vomvoris and Gelhar 1990), and biological mechanisms (Schfer et al. 1998; Barry et al. 2002) that affect unsaturated flow and contaminant transport.
A current challenge in modelling solute transport in soils lies in characterizing and quantifying the uncertainties engendered by the natural heterogeneity of the soil. Such uncertainty can be vital for decision-making. Despite strong evidence from field-scale observations and experimental studies in relation to the effects of soil heterogeneity on the transport of contaminants (Al-Tabbaa et al. 2000;Kristensen et al. 2010), relatively few numerical models incorporate the effects of this uncertainty (Feyen et al. 1998;Aly and Peralta 1999;Datta 2011a, 2014;Herckenrath et al. 2011).
Monte Carlo (MC) sampling is the default method for investigating uncertainties in a system (e.g. propagating uncertainty in the inputs), including in the context of groundwater flow modelling (Fu and Gomez-Hernandez 2009;Paleologos et al. 2006;Kourakos and Harter 2014;Maxwell et al. 2007;Herckenrath et al. 2011). MC estimates are extracted from multiple runs of the model using different realizations of the inputs, sampled from some distribution. While convergence is guaranteed as the number of runs increases, the slow rate of convergence demands (typically) a few thousand runs in order to extract reliable estimates of the statistics. If the model is computationally expensive, such a brute-force approach can be extremely time-consuming or perhaps even infeasible (Maxwell et al. 2007). Analytical stochastic methods have also been employed (Gelhar and Axness 1983;Gelhar 1986). Such methods can be useful for conceptual understanding of the transport process but are not applicable to practical scenarios.
Such limitations and shortcomings could be resolved in theory by using surrogate models (also known as metamodels, emulators or simply surrogates) in place of the complex numerical codes; that is, computationally efficient approximations of the codes based on data-driven or reduced-order model (ROM) approaches. Surrogate models have been used in a limited number of groundwater flow modelling problems (Aly and Peralta 1999;Bhattacharjya and Datta 2005;Kourakos and Mantoglou 2009;Sreekanth and Datta 2011b;Ataie-Ashtiani et al. 2014) (we refer to Razavi et al. 2012;Ketabchi and Ataie-Ashtiani 2015 for reviews of the topic) and are typically based on artificial neural networks (ANNs) for approximating a small number of outputs within an optimization task. For example, Bhattacharjya and Datta used an ANN to approximate the salt concentration in pumped water at 8 pumping wells for 3 different times, in order to maximize the total withdrawal of water from a coastal aquifer while limiting the salt concentration (Bhattacharjya and Datta 2005). Similarly, Kourakos and Mantoglou used an ANN model to optimize 34 well pumping rates in a coastal aquifer (Kourakos and Mantoglou 2009).
Another popular surrogate modelling approach is the stochastic collocation method (Babuška et al. 2007) in which the approximate response is constrained to a subspace, typically spanned by a generalized Polynomial Chaos basis (Xiu and Karniadakis 2002). The coefficients in this basis are approximated via a collocation scheme. While these schemes yield good convergence rates, they scale poorly with the number of collocation points (Rajabi et al. 2015). Although sparse grid methods based on the Smolyak algorithm (Smolyak 1963) help to alleviate the increased computational burden, the resulting schemes are still severely limited by the input space dimensionality and tend to perform poorly with limited observations (Xiu and Hesthaven 2005;Xiu 2007;Nobile et al. 2008;Ma and Zabaras 2009).
When data are scarce, we may turn to statistical Bayesian approaches such as Gaussian process (GP) regression. GPs are stochastic processes used for inferring nonlinear and latent functions. They are defined as a families of normally distributed random variables, indexed in this case by the input variable(s). GPs were first used for surrogate models in the seminal papers of Currin et al. (1988) and Sacks et al. (1989). The first applications of GP surrogate models to uncertainty quantification can be found in O' Hagan and Kingman (1978). Kernel methods such as GP models are well-established tools for analysing the relationships between input data and corresponding outputs of complex functions. Kernels encapsulate the properties of functions in a computationally efficient manner and provide flexibility in terms of model complexity (the functions used to approximate the target function) though variation of the functional form and parameters of the kernel.
GPs excel when data are scarce since they make a priori assumptions with regard to the relationship between data points. Comparatively, ANNs make fewer a priori assumptions and as a result require much larger data sets; they are, therefore, infrequently used for uncertainty quantification tasks. In the context of groundwater flow, very few applications of GPs can be found (Bau and Mayer 2006;Hemker et al. 2008;Borgonovo et al. 2012), the most likely explanations for which are the difficulty in implementing multioutput GP models and the lack of available information on, and software for GP modelling in comparison with ANNs. Existing applications again deal with low-dimensional outputs; e.g. in Bau and Mayer (2006), the authors use a GP model to learn 4 well extraction rates for a pump-and-treat optimization problem.
Our aim in this paper is to develop a surrogate model for the values of a field variable in a groundwater flow model, e.g. the pressure, pressure head or flow velocity, at a high number of points in the spatial domain, in order to propagate uncertainty in a stochastic field input, e.g. the hydraulic conductivity. In such cases, simplified covariance structures (Conti and O'Hagan 2010) for the output space (response surface) or dimensionality reduction for the input and/or output space can be used. In Higdon et al. (2008) Higdon et al. use principal component analysis (PCA) to perform linear, non-probabilistic dimensionality reduction on the response in order to render a GP model tractable (independent learning of a small number of PCA coefficients). Such linear approaches (PCA, multidimensional scaling, factor analysis) are applicable only when data lie in or near a linear subspace of the output space.
For more complex response surfaces, manifold learning (nonlinear dimensionality reduction) can be employed, using, for example, kernel principal component analysis (kPCA), diffusion maps (Xing et al. 2016or isomaps Xing et al. 2015. In contrast, kPCA was used to perform nonlinear, non-probabilistic dimensionality reduction of the input space in Ma and Zabaras (2011). This can useful when the input space is generated from observations (experimental data), but when the form is specified we can use linear dimension reduction methods such as the Karhunen-Loève (KL) expansion (Wong 1971).
In this paper we use manifold learning in the form of local tangent space alignment (LTSA) (Zhang and Zha 2004) to perform Bayesian inference (GP regression/emulation with Markov Chain Monte Carlo) in an abstract feature space and use an inverse (pre-image) map to obtain the output field at a finite number of points for an arbitrary input. In contrast to diffusion maps, isomaps and kPCA, LTSA is a local method in that it approximates points on the manifold on localized regions (patches), rather than directly seeking a global basis for the feature space. This can potentially provide more accurate results, although this is of course dependent upon the sampling methodology for the points and the quality of the reconstruction mapping.
The aforementioned approach is combined with a Karhunen-Loève expansion for a lognormally distributed input field and a framework for UQ is developed. We derive analytical forms for the output distribution by pushing the feature space Gaussian distribution through a locally linear reconstruction map. Additionally, we derive analytical estimates of the moments of the predictive distribution via approximate marginalization of the stochastic input. To sample from the hyperparameter and signal precision posteriors, we employ a Hamiltonian Monte Carlo scheme and use MC sampling to approximately marginalize the stochastic input distribution. The accuracy of the approach is demonstrated via two examples: a linear, steady-state Darcy's Law with a contaminant mass balance in a 2-d domain (aquifer) and a time-dependent Richards equation evaluated at a fixed time in a 3-d domain. In both cases we consider a stochastic hydraulic conductivity input.
The rest of the paper is organized as follows. In Sect. 2 we provide a detailed problem statement and outline the proposed solution. In Sect. 3 we outline LTSA, and in Sect. 4 we outline GP regression. In Sect. 5 we provide full details of the coupling of the methods and we demonstrate how the approach can be used to perform UQ tasks. In Sect. 6 we present the examples and discuss the results.

Problem Statement
Consider a well-defined steady-state partial differential equation (PDE) with a scalar, isotropic random field input (e.g. a permeability or hydraulic conductivity), and a response (output) consisting of a scalar field, e.g. pressure head, concentration or flow velocity. We may generalize our approach to multiple or vector fields but in order to simplify the presentation we focus on a single scalar field. We can also apply the method we develop to dynamic problems by focusing on the spatial field at a given fixed time (the second example we present). For an arbitrary input field realization, solutions to the PDE are found using a numerical code (simulator, or solver) on a spatial mesh with k y fixed degrees of freedom, e.g. grid points in a finite difference grid, control volume centres in a finite volume mesh or spatial nodes in a finite element mesh combined with a nodal basis.
We denote the input field by K (x), where x ∈ R ⊂ R d , d ∈ {1, 2, 3} denotes a spatial location and the notation makes explicit the spatial dependence. The model output (a scalar field) is denoted by u(x; K ), i.e. it is a function of x that is parameterized by K (x). The random input K (x) is defined on the whole of R and therefore requires a discrete (finite-dimensional) approximation in order to obtain a numerical solution. Let x k ∈ R d , k = 1, . . . , k y be a set of nodes or grid points and suppose that the simulator yields discrete approximations of the output field u(x; K ) in each run. Our goal is to approximate these simulator outputs for an arbitrary K .

Input Model: Karhunen-Loève Expansion
Let ( , F , P) be a probability space, with sample space , event space F and probability measure P. We can explicitly signify the randomness of the input by writing K (x, ω), where ω ∈ . For simplicity, and where it will not cause confusion, we suppress the dependence on ω (the same applies to other random processes). We assume that K (x) is log-normal (to avoid unphysical, i.e. negative, realizations), so is of the form K (x, ω) = exp(Z (x, ω)), where Z (x, ω) is a normally distributed field (a GP 1 indexed by x). For each x ∈ R, Z (x, ·) : → R is a random variable defined on the (common) probability space ( , F , P). For a fixed ω ∈ , Z (·, ω) : R → R is a deterministic function of x called a realization or sample path of the process. The mean and covariance functions of Z (x, ω) are defined as: respectively, in which E[·] is the expectation operator. Given the covariance and mean functions for Z (x, ω), the most widely used finite-dimensional approximation is based on a Karhunen-Loève (KL) expansion (Wong 1971). Assume that Z (x, ω) is mean-square con- , and is thus a second-order process. The KL theorem states that we may expresses Z (x, ω) as a linear combination of deterministic L 2 (R)-orthonormal functions w j (x), with random L 2 ( )-orthonormal coefficients ξ j (ω): where λ 1 ≥ λ 2 ≥ · · · > 0 and {w j (x)} ∞ j=1 are, respectively, the eigenvalues and eigenfunctions of an integral operator with kernel c Z (x, x ): The random coefficients are given by: and are independent, standard normal (ξ j ∼ N (0, 1)), with Var( λ j ξ j (ω)) = λ j , where Var(·) denotes the variance operator. Here and throughout, N (·, ·) denotes a normal distribution, in which the first argument is the mean (mean vector) and the second is the variance (covariance matrix). The sum (2) can be truncated by virtue of the decay in the eigenvalues for increasing j. Discretizing the eigenvalue problem (3) using finite differencing at the nodes x k ∈ R, k = 1, . . . , k y , assuming that they are uniformly distributed, leads to an eigenvalue problem for the covariance matrix C = c Z (x k , x j ) k y k, j=1 : where the kth component w j,k of w j ∈ R k y , j = 1, . . . , k y , is equivalent to the evaluation of eigenfunction w j at the node x k , k = 1, . . . , k y . Defining the random vector Z := (Z (x 1 ), . . . , Z (x k y )) T : → R k y , we can write: where m Z = (m Z (x 1 ), . . . , m Z (x k y )) T and ξ j ∼ N (0, 1) are independent random variables (note that we have kept the notation ξ j and λ j used in the continuous case in order to avoid notational clutter). This provides discrete realizations of Z (x, ω), and the expansion in (6) can be truncated by virtue of the decay in λ j for some k ξ < k y , chosen so that the generalized variance satisfies k ξ j=1 λ j / k y j=1 λ j > ϑ for some specified tolerance 0 < ϑ < 1. We can then obtain discrete realizations K = (K 1 , . . . , K k y ) T of K (x, ω) via: The discrete input K can then replaced by the random vector defined by ξ = (ξ 1 , . . . , ξ k ξ ) T ∼ N (0, I), the coefficients of which are independent standard normal. We may then write u(x k ; ξ ) for the KL approximation to u(x k ; K ), at the nodes {x k } k y k=1 . We note that different methods, including different quadrature rules or the use of projection schemes and Nystrom methods (Wan and Karniadakis 2006) can be used to solve the eigenvalue problem (3), all of which lead to a generalized eigenvalue problem in place of (5) (Betz et al. 2014). For example, if the finite element method is used, we may express the eigenfunctions as w j (x) = k l j,k ψ k in terms of the finite element basis {ψ k } k y k=1 and perform a Galerkin projection of (3) onto span(ψ 1 , . . . , ψ k y ) to yield a generalized eigenvalue problem for {λ j } k y j=1 and the undetermined coefficients {l j,k } k y j,k=1 (Ghanem and Spanos 2003).

Statement of the Surrogate Model Problem
The simulator can now be considered as a mapping η : X → Y (assumed to be continuous and injective), where ξ ∈ X ⊂ R k ξ is the permissible input space and y ∈ Y ⊂ R k y is the permissible output space or response surface consisting of the discrete field: Our aim is to develop a surrogate to make fast, online predictions of η(ξ ), using training data from a limited number of solver runs at the design points ξ n , n = 1, . . . , N . The training data can be expressed compactly as a matrix Y = y 1 , . . . , y N T ∈ R N ×k y and we can define The high dimensionalities of the (original) input and output spaces pose great challenges for surrogate model development. The input space dimensionality can be reduced as described above. The intrinsic dimensionality of the output space is significantly lower than k y by virtue of correlations between outputs for different inputs, as well as physical constraints imposed by the simulator. This suggests that we treat Y as a manifold and use manifold learning/dimensionality reduction to perform Bayesian inference on a low-dimensional (feature) space F that is locally homeomorphic to Y. Below we introduce the manifold learning method employed, before recasting the emulation problem as one of inference in the feature space, together with a pre-image (inverse) mapping to obtain solutions in Y for arbitrary inputs ξ . manifold. To characterize the manifold in such cases, we can introduce overlapping patches, each with its own system of (non-unique) coordinates.
Formally speaking, a smooth k z -manifold is defined as a topological space Y that is equipped with a maximal open cover {U α } α∈ consisting of coordinate neighbourhoods (or patches) U α , together with a collection of homeomorphisms (coordinate charts) φ α : are open in R k z ; we say that φ α and φ β are compatible. Moreover, the transition maps defining a change of coordinates φ β • φ −1 α are diffeomorphisms for all α, β ∈ . Let A = {(U α , φ α )} α∈ be an atlas on Y ({U α } α∈ is a cover and the {φ α } α∈ are pairwise compatible). Two smooth curves γ 0 , γ 1 : R → Y are called y-equivalent at a point y ∈ Y if for every α ∈ with y ∈ U α , we have γ 0 (0) = γ 1 (0) = y and furthermore . With this equivalence relation, the equivalence class of a smooth curve γ with γ (0) = v is denoted [γ ] p and the tangent space T y Y of Y at y is the set of equivalence classes {[γ ] p : γ (0) = y}. The tangent space is a k z -dimensional vector space, which is seen more clearly by identifying T y Y with the set of all derivations at y [linear maps from C ∞ (Y) to R satisfying the derivation (Liebnitz) property].
We assume that the output space Y ⊃ Y is a manifold of dimension k z k y embedded in R k y . Representations of points in Y and corresponding representations in the feature or latent space F ⊂ R k z can be related by some smooth and unknown function f : F → Y. Manifold learning is concerned with the reconstruction of f and its inverse given data points on the manifold, whereas dimensionality reduction is concerned with the representation of given points in Y by corresponding points in the feature space F . Here we are interested primarily in dimensionality reduction and use Local Tangent Space Alignment (LTSA) (Zhang and Zha 2004). The tangent space at a point y provides a low-dimensional linear approximation of points in a neighbourhood of y. We can approximate each point y in a data set using a basis for T y Y and use these approximations to find low-dimensional representations in a global coordinate system, by aligning the tangent spaces using local affine transformations (Zhang and Zha 2004). We note that this assumes the existence of a single chart (homeomorphism) f −1 .
Consider a noise-free model in which the data Y are generated by the smooth function f defined above: where z = (z 1 , . . . , z k z ) T ∈ F is a latent/feature vector (i.e. the low-dimensional representation of the point y). Under the assumption that f is smooth, it can be approximated using a first-order Taylor expansion in a neighbourhood (z) of a point z: Here and throughout, || · || denotes a standard Euclidean norm.
A basis for the tangent space T y Y of Y (a k z -dimensional linear subspace of R k y ) at y = f(z) is given by the span of the column vectors of J f . The vector z − z then gives the coordinate of f ( z) in the affine subspace f(z) + T y Y. J f cannot be computed explicitly without knowledge of f. Suppose we can express T y Y in terms of a matrix Q z , the columns of which form an orthonormal basis for T y Y: is still unknown. Combining Eq. (10) with the Taylor expansion, we can, however, find an approximation of π π π * z consisting of an orthogonal provided that the basis Q z is known for each z. Truncating this expansion, the global coordinate z then satisfies: If the Jacobian is of full column rank, we can find a local affine transformation: The transformation L z aligns the local coordinate with the global coordinate z − z for f ( z).
We then find the global coordinate z and affine transformation L z by minimizing (z) z − z − L z π π π z d z.
We note that the orthogonal basis Q z for each tangent space is still unknown. Consider a data set y n , n = 1, . . . , N , sampled with noise n , n = 1, . . . , N , from the underlying nonlinear manifold: For any y n , let Y n = [y n 1 . . . y n P ] be the matrix containing the P nearest neighbours, including y n , where distances are measured using the standard Euclidean metric. The best k z -dimensional local affine subspace approximation for the points in Y n is given by: where the orthonormal matrix Q has k z columns, = [π π π 1 . . . π π π P ] and e is a vector of all ones. The optimal y is given by the mean of {y n k } k , denotedȳ n , and the optimal Q is given by Q n , the columns of which are the k z left singular vectors of Y n I − ee T /P corresponding to the k z largest singular values. Lastly, is given by n : where π π π (i) k = Q T n y n k −ȳ n . Consequently: where ϕ (l) k = I − Q n Q T n y n k −ȳ n is the reconstruction error. Having minimized the local reconstruction error, we would like to find the global coordinates Z = [z 1 . . . z N ] ∈ R k z ×N , corresponding to data points Y, given the local coordinates π π π (l) k . The global coordinates z n k of the corresponding points y n k are chosen to respect the local geometry as determined by the π π π (l) k : given by E n = Z n (I − ee T /P) − L n n . We find the latent points and local affine transformations L n that minimize the local reconstruction error E n F , in which || · || F denotes a Frobenius norm. The optimal L n are given by L n = Z n (I − ee T /P) + n , and consequently the errors are given by E n = Z n (I − ee T /P)(I − + n n ), where + n is the Moor-Penrose pseudo-inverse of n . We define a 0-1 selection matrix S n ∈ R N ×P such that ZS n = Z n . The global coordinates can then be selected according to a minimization of the overall reconstruction error: The constraint Z T Z = I ensures that the solutions are unique. The vector e is an eigenvector of B ≡ SWW T S T ∈ R N ×N corresponding to a zero eigenvalue. Arranging the eigenvalues in increasing order, the optimal Z is given by Z = [ζ ζ ζ 2 . . . ζ ζ ζ k z +1 ] T , where ζ ζ ζ 2 , . . . , ζ ζ ζ k z +1 ∈ R N are the eigenvectors of B corresponding to the (k z + 1) st smallest eigenvalues excluding the first (zero) eigenvalue. This defines a map f − : y → z, z = f − (y) that approximates f −1 : Y → F for the given data points: where z n,: is the nth column of Z . Fixing the number of neighbours assumes that the manifold has a certain smoothness, while using the same number of neighbours for every tangent space assumes a global smoothness. These assumptions may result in inaccurate predictions, in which case we can use adaptive algorithms (Zou and Zhu 2011;Zhang et al. 2012;Wei et al. 2008). Similar adaptations can be made for other issues, such as robustness in the presence of noise (Zhan and Yin 2011).
We remark that LTSA is a nonparametric technique, in that an explicit form of f is not available. This means that the out-of-sample problem does not have a parametric (explicit) solution. In other words, application of LTSA (the map f − ) to a point that was not in the data set can only be achieved by rerunning the entire algorithm with an updated data set that appends the new point. Nonparametric solutions to the out-of-sample problem have been developed, and one that is applicable to LTSA can be found in Li et al. (2005).
If we map points y ∈ Y to F using f − and perform inference in F , an approximation of f is required in order to make predictions in the physical space Y. This is referred to as the pre-image problem in manifold learning methods: given a point in the low-dimensional space, find a mapping to the original space (manifold). We outline an approximation of the pre-image map in the next section.

Pre-image Problem: Reconstruction of Points in the Manifold Y
Given a point z ∈ F in latent space, we require the corresponding point in the original physical space y ∈ Y. Let z k be the neighbour nearest to z. According to Eq. 18: By Eq. 17 we can also define: Consequently, we have the following approximate pre-image mappingf : F → Y (approximation of f): where k = arg min n z − z n and E = −Q k L −1 k (k) * + ϕ (k) * incorporates the error terms.

Gaussian Process Emulation in Feature Space
In Sect. 2.2, the surrogate model problem was defined as one of approximating the simulator mapping η : X → Y given the data set D = { , Y} derived from runs of the simulator at selected design points {ξ n } N n=1 . We can instead consider the simulator as a mapping Application of LTSA to points on the manifold approximates this mapping with , and our aim is now to approximate the mapping η F (·). Returning a general point z = η F (ξ ) to the corresponding point y in the space Y is discussed in the next section.
In this work, a GP model is used to infer the mapping η F : ξ → z by treating it as a realization of a (Gaussian) stochastic process indexed by the inputs ξ . Specifically, we learn each component of z separately (assuming independence) using a scalar GP model. Here and throughout, GP(·, ·) denotes a GP, in which the first argument is the mean function and the second is the covariance (kernel) function. Let is the kernel function (of the same form across i) in which θ i is a vector of hyperparameters pertaining to component i. The latent functions h i (ξ ), i = 1, . . . , k z , can be thought of as independent draws from the GP. Using the notation h n,i = h i (ξ n ) we can define a matrix H ∈ R N ×k z with columns h :,i = (h 1,i , . . . , h N ,i ) T (z :,i similarly defines the vector of the ith features across all samples). By the independence assumption: where p(z :,i |h :,i , β i ) = N (h :,i , β −1 i I) by virtue of the noise model, and β β β = (β 1 , . . . , β k z ) T .
We place gamma priors on all hyperparameters θ i and signal noise precisions β i . The parameterization of these priors is determined through an initial maximum likelihood run. We choose these parameters such that the mean is equal to the maximum likelihood value, and so that we obtain an appropriate variance. Let z ∈ F be the feature vector corresponding to the test (new) input ξ . The predictive distribution for the ith component z i of z (i = 1, . . . , k z ) is given by: where is the cross-covariance between z and z n , n = 1, . . . , N . Thus, the latent variable GP prediction is distributed as: where the components of μ μ μ z (ξ ) ∈ F are given by the second of Eqs. (26) and z (ξ ) ∈ R k z ×k z is a diagonal covariance matrix, in which the ith diagonal element corresponds to the predictive variance given by the third of Eqs. (26), while the off-diagonal elements are zero due to the assumption of independent GPs across i.

Sampling Hyperparameter Posterior with Hybrid Monte Carlo
We explore the hyperparameter posterior distributions using a hybrid Monte Carlo (HMC) scheme. HMC is a Metropolis method that uses gradient information. It exploits Hamiltonian dynamics to explore state spaces based on the likelihood probability, and consequently limits the random walk behaviour. The Hamiltonian is defined as an energy function in terms of a position vector q(t) and a momentum vector p(t) at time t (unrelated to the time component in the solver): is the potential energy and E K (p) is the kinetic energy, the sum of which is constant. The evolution of this system is then defined by the partial derivatives of the Hamiltonian: We define the potential energy as the negative log likelihood with an additive constant C, chosen for convenience: E U (q(t)) = − log (likelihood (q(t))) − log (prior (q (t))). Furthermore, following convention we define the kinetic energy as: where M K is a symmetric, positive definite mass matrix, chosen to be a scalar multiple of the identity matrix. With this choice, the potential energy is the negative log probability density of a multivariate Gaussian distribution with covariance M K and matches the classical definition of potential energy.

Predictions
The physical models we consider have an unknown, stochastic input (e.g. the hydraulic conductivity). This represents a lack of knowledge of the input, which induces a random variable response (e.g. the pressure head). Quantifying the distribution over the response is referred to as a pushforward or forward problem. The pushforward measure is the distribution over the response, or quantity of interest derived from the response. 2 Based on the methods of the preceding sections, we now present an emulation framework for interrogating the pushforward measure (the response distribution). We begin by describing in the next section how a single realization of the random variable response may be obtained given a single realization of the stochastic input. In Sect. 5.2, we then discuss how to quantify the pushforward measure (extract relevant statistics of the response).

Outputs Conditioned on Inputs
Due to the nature of the emulator, the prediction of a point z ∈ F is normally distributed. This distribution captures uncertainty in the predictions as a consequence of limited and noise corrupted data. A common challenge when using reduced dimensional representations is analytically propagating this distribution through a nonlinear, pre-image map [in this casê f : F z → y ∈ Y defined by Eq. (23)] for a test input ξ . Analytically propagating a distribution through a nonlinear mapping is often not feasible. Instead we could repeatedly sample from the feature space response distribution (over z ∈ F ) and apply the pre-image map to find the distribution over the corresponding y ∈ Y. Examples that use this latter approach include kernel principal component analysis and Gaussian process latent variable models (in the latter case, approximations can be obtained using the projected process approximation). Since the manifold consists of aligned (tangent) hyperplanes, however, we are able to derive locally linear pre-image maps that can be used for mapping distributions defined on local tangent spaces. The latent variable GP prediction z is distributed according to Eq. (27). Restricting to a single tangent space, it is a straightforward task to push this distribution though Eq. (23) to obtain a normal distribution for the corresponding y ∈ Y: where k = arg min n μ μ μ z (ξ ) − z n , μ μ μ y (ξ ) ∈ R k y , and y (ξ ) ∈ R k y ×k y . This result is particularly useful for scenarios in which knowledge of the correlations between response features is required. Without this result we would require a large number of samples to estimate the covariance (tens of thousands). If, however, we are only interested in samples of the distribution (27), i.e. making predictions at specified inputs, then it is more memory efficient to sample from the predictive distribution (27) and use the pre-image map (23). When the support of this distribution is large, the accuracy of the local approximation breaks down and we must first sample the latent features before applying the pre-image map.

Marginalizing the Stochastic Input
Having obtained a distribution over the response for a stochastic input realization, we now consider the problem of obtaining a distribution over the response marginalized over the stochastic input. We assume that the input is normally distributed: for some mean vector μ μ μ ξ (equal to 0 in this case) and covariance matrix ξ (equal to I in this case). We wish to evaluate: wheref is the (measurable) pre-image map and p (z|ξ , D, , β β β) is defined in Eq. (27). Since the input ξ appears nonlinearly in the inverse of the z predictive distribution covariance σ 2 (ξ ), we are unable to find a closed form solution to the integral in (32), i.e. the marginal distribution over z. The moments of this marginal distribution can, on the other hand, be found analytically, although we will not know the family of distributions to which these moments belong. Let us focus on the ith feature of z. We wish to find the first two moments, i.e. the mean and variance, of the marginal distribution p (z i |D, θ i , β i ). We can then push these moments through the pre-image map to obtain analytical solutions. This can be repeated for each i by virtue of the independence assumption. To begin, p (z i |D, θ i , β i ) is approximated as a Gaussian with mean m and variance v (Girard and Murray-Smith 2003), which, from "Appendix A", are given by: and : where E ξ [·] and Var ξ (·) are the expectation and variance with respect to ξ , respectively. Calculation of these moments involves expectations of the kernel with respect to the stochastic input distribution on the unknown and unobserved test inputs: The analytic tractability of these integrals is dependent upon the choice of kernel and stochastic input distribution. One example of a kernel is the commonly used squared exponential, for which the integrals are derived in "Appendix B". Once calculated, the mean can be pushed through the local pre-image mapping (23). Since we expect that the variance, on the other hand, will span more than one tangent space, predictions of the variance using this method may be inaccurate.
Since we cannot sample from the approximate marginal of the analytical approach, further analysis requires MC to fully characterize the distribution (32). Again it is sufficient to demonstrate the procedure for a single latent (feature space) dimension i. Using MC we obtain a marginalized predictive distribution expressed as the sum of normally distributed random variables, which itself is non-Gaussian: where ξ (q) ∼ N (μ μ μ ξ , ξ ), θ i and β i are samples from the hyperparameter and signal noise posteriors (for the ith feature), and the approximation converges as Q → ∞. Each sampled latent variable can then be pushed through the pre-image map. Latent variables found in this way are draws from the marginalized distribution p z ·,i |D, θ i , β i and we obtain multiple marginalized distributions [one for each (θ i , β i )] from which we can estimate the statistics of the response. Algorithm 1 describes the procedure. Note that we use a * superscript in order to avoid confusion between MC samples and training points. Each Y * i in Algorithm 1 can be interrogated to find any property of the pushforward measure (mean, standard deviation and higher-order moments). We can use kernel density estimation (KDE, also known as Parzen-Rosenblatt window) (Simonoff 1996) to approximate the pdf given a finite number of samples, or find the moments of the density. We use Gaussian kernel function with a suitably small bandwidth.

Results and Discussion
We now assess the performance of the proposed method on two example partial differential equation problems: a Darcy flow problem with a contaminant mass balance, modelling steadystate groundwater flow in a 2-d porous medium; and Richards equation, modelling single-phase flow through a 3-d porous medium. As explained in Sect. 5, the analysis includes: (i) predictions that are conditioned on an input; and (ii) predictions that are marginalized over the stochastic input.
When making conditioned predictions, we use the conditional predictive distribution (30) for y, or the distribution (27) for z in conjunction with the pre-image map (23). As explained in 4.1, we place a prior over the hyperparameters and signal variances β β β and use a HMC scheme to sample from the posterior distributions over these parameters. Each sample can be used to obtain a different normal predictive distribution, conditioned on an input. We are therefore able to see how the predictive mean and variance change with respect to the uncertainty in the GP parameters. In the results, we plot the expectation and standard deviation of first two predictive distribution moments.
For the forward UQ problem we marginalize the conditional predictive distributions over a stochastic input (Eq. 32) to obtain the pushforward measure (non analytically). We are able to analytically find the mean using (A2) and (A3) together with the pre-image map, or, using Algorithm 1, sample from the marginalized distribution via MC (Eq. 36).
The accuracies of both the point predictions and the predictions of the pushforward measure are assessed by comparison with the true values obtained with the simulator (on the test inputs {ξ * q } Q q=1 ). We run the solver for each test input to generate the true response, denoted y * q . For the UQ comparison we again approximate the pdf using KDE (or simply extract the moments) based on {ỹ * q } Q q=1 . The latter approximation is guaranteed to converge to the truth as the number of test inputs increases.

Darcy Flow: Non-point Source Pollution
The first example is a linear model of steady-state groundwater flow in 2-d. The approach was developed by Kourakos et al. (2012) and implemented in the mSim package. 3 The model comprises Darcy's law and a contaminant mass balance in a 2-d polygonal domain of total area 18.652 km 2 containing wells and a stream, and subdivided into polygonal regions of different land use (Fig. 5). Full details of the model and the numerical method can be found in Kourakos et al. (2012). Below we provide a brief description. The model equations are given by: where K (x) is the hydraulic conductivity, h(x) is the pressure head, C(x, t) is the contaminant concentration, R is the retardation factor, D is the dispersion tensor, v is the fluid velocity, and Q and G represent sources/sinks. The contaminant transport equation is replaced by a 1-d approximation and is solved through an ensemble of one-dimensional streamline-based solutions (Kourakos et al. 2012). The contaminant balance and flow (Darcy) equations are decoupled. The latter is solved using the finite element method based on triangular elements and first-order (linear) shape functions. The boundary conditions are given by: (i) a constant head equal to 30 m on the left boundary; (ii) a general head boundary equal to 40 m with conductance equal 160 m 3 day −1 on the right boundary; and (iii) no flow on the top and bottom boundaries. Each land use polygon is assigned its own recharge rate. Stream rates are assigned directly to nodes. (Any node closer than 10 m to the stream is considered to be part of the stream.) We assume that K (x) is log-normally distributed and treat it as an input. The output field upon which we focus is the pressure head, that is, u(x; K ) = h(x) in the notation of Sect. 2. We use the input model described in Sect. 2, defining a discretized random field corresponding to realizations of K (x) = exp(Z (x)) at the nodes {x k } k y k=1 ⊂ R on the finite element mesh. The covariance function for the random field Z (x) is given by: where l 1 and l 2 are correlation lengths. This separable form was suggested in Zhang and Lu (2004) and is used extensively in the literature to model hydraulic permeability fields (often by setting the correlation lengths equal). The generalized variance (value of k ξ ) was chosen to satisfy 98. Both the training and test input samples were drawn independently: ξ n ∼ N (0, I), n = 1, . . . , N to yield {y n } N n=1 for training; and ξ q ∼ N (0, I), q = 1, . . . , Q to yield { y * q } Q q=1 for testing and the forward problem (UQ). We set Q = 5000 and N ∈ {25, 50, 75, 100}. Running the solver with an input generated using the KL truncation necessarily leads to a response surface with intrinsic dimension at most k ξ , which was therefore the value chosen for the approximating manifold dimension k z . In all of the results presented below, k y = 1933 nodes (elements) were used in the simulation. The number of neighbours P in the LTSA algorithm was chosen according to the error between the solver response and the predictive mean at the test points. We define a scaled measure of error on each test point as follows: where y * q is the response predicted by the solver, and y * q is the point recovered by application of the pre-image map (23) on the GP predictive mean (26). The scaling ensures that the errors are comparable and can be interpreted as percentage errors.
We present results for three stochastic input models:

M1
We set m Z = ln(40) and σ 2 Z = 0.2, yielding 4 a mean for k(x) of 44.2 m day −1 , which is close to the default value in the mSim package, and a standard deviation of 13.63 m day −1 . The correlation lengths were chosen as l 1 = 2000 m and l 2 = 1000 m, which correspond to dimensionless values of 1/3 and 2/7, respectively. These choices require k ξ = 5 input dimensions to capture 98% of the generalized variance. M2 We set m Z = ln(36.18) and σ 2 Z = 0.4, again yielding a mean 44.2 m day −1 and a standard deviation of 18.80 m day −1 . We set l 1 = 2000 m and l 2 = 1000 m. k ξ = 5 captures 98% of the generalized variance. M3 We set m Z = ln(40), and σ 2 Z = 0.4 and reduce the correlation lengths to l 1 = 1000 m and l 2 = 500 m (dimensionless values of 1/6 and 1/7, respectively). We now require k ξ = 15 to capture 98% of the generalized variance. log(e q ) / log error P / neighbour number  and P is more complicated. The errors are high for P < 8 (not shown in the boxplots) at all values of N and decrease as P increases. This is due to the linear approximation of points in local tangent spaces via PCA in the LTSA algorithm. As more points are added, the approximation improves. As P is increased beyond a certain value, however, the errors increase (this is most clearly visible for N = 100). The reason for this behaviour is that for large enough neighbourhood sizes the linear approximation breaks down. Thus, there is an optimal choice of P for each value of N and the higher the value of N the more sensitive are the errors to the value of P. In the subsequent results we use P = 15 unless otherwise specified.
In Fig. 2 we plot the normalized pressure head prediction (for each coordinate of the predicted pressure head we subtract the mean and divide by the standard deviation) corresponding to the highest e q for both N = 25 and N = 50 (using P = 15). The normalization highlights the differences between the true values and the predictions (the errors) more clearly. The predicted means of the means (middle row) are the mean predictions averaged over all hyperparameter and precision samples. Also shown (bottom row) are the standard deviations of the predictions averaged over all hyperparameter and precision samples. We observe that the prediction at N = 75 is highly accurate, while the prediction at N = 25 is still reasonably accurate even in this worst case (an outlier in Fig. 1 We now focus on the forward problem, in which we estimate the marginalized predictive distribution (32) using Algorithm 1. KDE is used to obtain estimates of the pdf of a feature for different predictive posterior, hyperparameter and precision samples, as previously described. The feature we choose is the pressure head at the spatial location x = (2511, 486) ∈ R. We plot a heat map of the pdfs in Fig. 4 for different N . The distributions are accurately estimated for all values of N . While the predictions improve as the number of training samples N increases, the true value does not always lie within the contours. This is because: (i) as stated earlier, an increased GP predictive variance acts to smooth the density, rather than increase the width of the contours; (ii) by choosing a priori the number of neighbours we also a priori assume a global smoothness of the emulator; and (iii) we have a pre-image map f : F → Y for which the error is unknown (as with all methods), but not estimated (as with probabilistic methods).
We can find the means and standard deviations across the samples obtained for different predictive posterior, hyperparameter and precision samples using Algorithm 1. We obtain distributions over the moments of the marginalized predictive distribution (32). In Fig. 5 we plot the mean and standard deviation of the marginalized predictive mean and standard deviation for N = 25, with comparisons to the true values obtained by finding the mean and standard deviation across the test responses { y * q } Q q=1 . Even for this low number of training points the results are highly accurate. increasing P are shown in Fig. 6. We observe trends similar to those observed using Model M1, although the increased variance leads to larger errors at fixed N and P (higher maxima and minima). With the exception of an isolated outlier (shown later), the predictions are nevertheless accurate for N = 75.
The worst case (highest e q ) for P = 15 is shown in Fig. 7 for N = 25 and 75 points (see Fig. 6). As before the top row is the test (solver prediction), while the middle and bottom rows are the mean prediction and standard deviation of the prediction averaged over all hyperparameter and precision samples. The true values lie within the credible regions, although for this model a higher number of training points are required to ensure that even the worst-case predictions are accurate. Figure 8 demonstrates the quality of the predicted responses when the errors are at the median in the P = 15 boxplots in Fig. 6. Here, even N = 25 provides accurate results. Figure 9 shows heat maps of the pdfs of the pressure head at the spatial location x = (2511, 486) for different N (generated using KDE) in the case of Model M2. Using N = 75 we achieve very good agreement with the MC prediction based on the simulator results (test points), although again the true value does not lie within the contours. For N = 25, we plot the mean and standard deviation of the marginalized predictive mean and standard deviation in Fig. 10, with a comparison to the true values obtained from { y * q } Q q=1 . The predictions are highly accurate. In fact, even for N = 25 (not shown to conserve space) the mean was very accurate and the standard deviation exhibited only slight differences from the true value.
For Model M3 (decreased correlation lengths, high standard deviation and k ξ = 15), the distributions of {e q } Q q=1 for increasing N and P in are shown in Fig. 11. In this case it is clear that a much higher value of P (P > 60, giving a similar neighbourhood radius in-line with the increased sample density) is required to obtain a reasonable accuracy. For N = 500 and P = 80, there are a small (9 out of 5000) number of outliers with low accuracy, while the errors for the remaining points satisfy ln(e q ) < −3.25. The worst cases (highest e q ) for P = 70, N = 300 and P = 80, N = 500 are shown in Fig. 12, and in Fig.13 we show predicted responses with errors at the medians for the same values of P and N . There are noticeable differences in the worst cases, although the qualitative agreement is very good at both values of N . For the median error cases both emulators perform extremely well.

Richards Equation: Unsaturated Flow in Porous Media
Consider a single-phase flow through a 3-d porous region R ⊂ R 3 containing unsaturated soil with a random permeability field. The vertical flow problem can be solved using Richards equation (Darcy's law combined with a mass balance). There are three standard forms of Richards equation: the pressure head based (h-based) form; the water content-based (θ -  (Huang et al. 1996;Shahraiyni and Ataie-Ashtiani 2011).
The h-based form with an implicit or explicit finite difference (FD) scheme has been shown to provide good accuracy, although this approach may result in high mass balance errors Huang et al. 1996). The mixed-based form, on the other hand, exhibits low mass balance errors with highly accurate predictions using a fully implicit FD scheme (Ray and Mohanty 1992;Zarba et al. 1990;Celia et al. 1987). The latest work of Shahraiyni and Ataie-Ashtiani (2011) showed that a fully implicit FD scheme with a standard chord slope (CSC) approximation (Rathfelder and Abriola 1994) not only solved the mass balance problem of the h-based form but also improved convergence. Thus, in the paper we adopt this approach, although other numerical formulations are by no means precluded. The h-based form of Richards equation can be written as follows: where h is the pressure head, u(h) = ∂θ/∂h is the specific moisture capacity, in which θ is the moisture content, K (h) is the unsaturated hydraulic conductivity, and x = (x 1 , x 2 , x 3 ) T is the spatial coordinate, in which x 3 is the vertical coordinate. The nonlinear functions θ(h) and k(h) can take on different forms. For example, in Haverkamp et al. (1977), a least square fit to experimental data was used to derive: where θ r and θ s are the residual the saturated water contents, K s (x) is the saturated hydraulic conductivity, and α 1 , α 2 , α 3 and α 4 are fitting parameters. We adopt the relationships (41) and use the parameter values in Haverkamp et al. (1977): α 1 = 1.611 × 10 6 , α 2 = 3.96, α 3 = 1.175 × 10 6 , α 4 = 4.74, θ s = 0.287 and θ r = 0.075. The domain R is taken to be 20 cm × 20 cm × 20 cm. K s (x) is treated as a random field input with a log-normal  (a) (b) Fig. 11 Log normalized error between true and predictive mean in the normalized pressure head prediction from an emulator trained on 300 and 500 points y n and interrogated with Q = 5000 test pointsỹ * q for different nearest neighbour numbers P (Model M3). Predictions were obtained by averaging over hyperparameter and precision posterior samples. a 300 training points. b 500 training points distribution (K s (x) = exp(Z (x)), again discretized using the Karhunen-Loève theorem. We generate realizations of a corresponding discrete random field on an n 1 × n 2 × n 3 finite difference grid (k y = n 1 n 2 n 3 ), with grid spacings x 1 , x 2 and x 3 in the directions x 1 , x 2 and x 3 , respectively. The output field of interest is again the pressure head, at a fixed time T . Thus, we set u(x; K ) = h(x, T ).
The boundary conditions are those used in Haverkamp et al. (1977), corresponding to laboratory experiments of infiltration in a plexiglass column packed with sand. Along the top boundary (surface) x 3 = 20 cm, the pressure head is maintained at h = −20.7 cm (θ = 0.267 cm 3 cm −3 ), and along the bottom boundary x 3 = 0 cm, it is maintained at h = − 61.5 cm. At all other boundaries a no-flow condition is imposed: ∇h · n = 0, where n is the unit, outwardly pointing normal to the surface. The initial condition is h(x, 0) = − 61.5 cm.
The covariance function for the random field Z (x) is again of the form: where the l i are correlation lengths, chosen as l 1 = l 2 = l 3 = 7.5 cm. The mean m Z and variance σ 2 Z are chosen such that the mean and standard deviation of K (x) are 0.0094 cm s −1 (Shahraiyni and Ataie-Ashtiani 2011; Haverkamp et al. 1977) and 0.00235 cm s −1 (25 % of the mean), respectively. The generalized variance satisfies The training and test input samples were drawn independently: ξ n ∼ N (0, I) and ξ q ∼ N (0, I) to yield {y n } N n=1 for training and { y * q } Q q=1 for testing and UQ. We set Q = 5000 and N ≤ 800. As before, the manifold dimension was set to k z = k ξ . The number of neighbours P and the number of training points N were chosen as in the first example by examining the errors e q = || y * q − y * q ||/|| y * q || on the test set, where again y * q is the solver output (truth) and y * q is emulator prediction based on the GP predictive mean (26). Equation (40) was solved using a finite difference scheme with first-order differencing for the first-order derivatives, central differencing for the second-order derivatives and a fully implicit backward Euler time stepping scheme. A picard iteration scheme is used ) at each time step. Details are provided in "Appendix C". We followed the procedure of the first example. Training point numbers below 600 led to inaccurate results. For N = 600, the results were reasonably accurate but to achieve good accuracy we required N > 700. We present the results for N = 800. The pressure head is normalized as in the first example in order to highlight the errors in the predictions more clearly. In Fig. 16a we plot the log normalized error ln(e q ) for an emulator trained on N = 800 points y n and tested with Q = 5000 points y * q for different nearest neighbour numbers P > 20 (averaging over hyperparameter and precision posterior samples). For P ≤ 20 the errors were high, with the same trend as seen in the first example.
We use Algorithm 1 and KDE to obtain predictions of the pdf of a feature of the response. We choose as a feature the pressure head at the location x = (10.4, 10.4, 10.4) T (grid point number 4411). The distributions are shown in Fig. 16b for N = 800. We can again find the means and standard deviations across predictive posterior, hyperparameter and precision samples to obtain distributions over the moments of the marginalized distribution (32). These are plotted in Fig. 17, alongside comparisons to the true values obtained from { y * q } Q q=1 . These results show that the emulator performs extremely well, accurately capturing both the mean and standard deviation with high precision.

Numerical Computation
LTSA naturally lends itself to parallelization since almost all computations are performed on each neighbourhood independently. After merging threads we need only solve an eigenvalue problem for an N × N matrix. Similarly, independent Gaussian processes across latent dimensions leads to a natural parallelization framework.
For large sample sizes and feature space dimensions saving each Q i can become infeasible (N × k y × k z elements). Similarly, for large sample and neighbourhood sizes saving f can . The black line gives the MC prediction using the simulator. a 800 training points. b 800 training points, 30 k-NN become infeasible (N × k 2 elements). In such cases, these tensors may be saved to file or re-calculated online.
The scalability of our approach is limited by the computational complexity of Gaussian processes O N 3 . However, this can be alleviated by using sparse Gaussian process regression models. These models introduce m inducing points, reducing computational complexity to O m 2 N . We may also use active learning to reduce the number of samples required.

Summary and Conclusions
In this paper we developed a new approach to the emulation of a model involving a random field input and a field output, with a focus on problems arising in groundwater flow modelling. The main challenges are the high input and output space dimensionalities, which we dealt with using a KL expansion and manifold learning, respectively. We implemented LTSA on the given outputs (training data), which allowed us to perform Bayesian inference in a lowdimensional feature space. Furthermore, we developed a framework for UQ in such problems by marginalizing over the inputs, either analytically (the mean and possibly in some cases the standard deviation) or using MC sampling.
Testing the emulation method on two examples reveals that it performs well in certain cases. When the variance of the log-normal input is high or the correlation lengths of the normal process Z (x) are short, the accuracy suffers, as is found in all other approaches. Nevertheless, the accuracy in terms of the forward UQ problem is high even in such cases for the examples considered. (Of course, further increases in the variance and correlation lengths would eventually lead to unacceptably poor performance.) The major drawback of the KL expansion approach (and similarly with circulant embedding) is the curse of dimensionality as the number of retained coefficients grows. Some progress can be made in this regard by using a Smolyak algorithm (Smolyak 1963) for sampling or incremental local tangent space alignment (Liu et al. 2006) combined with active learning (Settles 2012), but the gains will be limited. Our method, in common with other methods except direct Monte Carlo or ROMs, is therefore potentially limited, given current computational resources, to problems in which the domain size is at most a few multiples of the shortest correlation length. The assumption of independence of the feature vector coordinates is also sub-optimal. Since the number of coordinates is small, however, this assumption can easily be relaxed by adopting, e.g. a convolved GP approach.