Gaussian active learning on multi-resolution arbitrary polynomial chaos emulator: concept for bias correction, assessment of surrogate reliability and its application to the carbon dioxide benchmark

Surrogate models are widely used to improve the computational efficiency in various geophysical simulation problems by reducing the number of model runs. Conventional one-layer surrogate representations are based on global (e.g. polynomial chaos expansion, PCE) or on local kernels (e.g., Gaussian process emulator, GPE). Global representations omit some details, while local kernels require more model runs. The existing multi-resolution PCE is a promising hybrid: it is a global representation with local refinement. However, it can not (yet) estimate the uncertainty of the resulting surrogate, which techniques like the GPE can do. We propose to join multi-resolution PCE and GPE s into a joint surrogate framework to get the best out of both worlds. By doing so, we correct the surrogate bias and assess the remaining uncertainty of the surrogate itself. The resulting multi-resolution emulator offers a pathway for several active learning strategies to improve the surrogate at acceptable computational costs, compared to the existing PCE-kriging approach it adds the multi-resolution aspect. We analyze the performance of a multi-resolution emulator and a plain GPE using didactic test cases and a CO2 benchmark, that is representative of many alike problems in the geosciences. Both approaches show similar improvements during the active learning, but our multi-resolution emulator leads to much more stable results than the GPE. Overall, our suggested emulator can be seen as a generalization of multi-resolution PCE and GPE concepts that offers the possibility for active learning.


Introduction
Nowadays, many problems addressed by geosciences require to deal with uncertainty. These uncertainties can Ilja 1 Institute for Modelling Hydraulic and Environmental Systems, LS3, University of Stuttgart, Stuttgart, Germany occur based on assumptions made, limitations of the underlying models, lacking data for model building or calibration, uncertain or highly variable parameters or uncertainty in the underlying process description itself. To address these uncertainties, several concepts of model-based Uncertainty Quantification (UQ) and Machine Learning (ML) have been developed in computational geosciences for a long time, beginning from classical concepts of surrogate model or ML-assisted data-assimilation [19,37] up to deep artificial neuronal networks (DNN) nowadays [46,52,58]. However, in many applied tasks, large systems are considered, where calculating the results for a single model run can be computationally very expensive.
To overcome computational unfeasibilities, surrogate models are widely used to reduce the uncertainty-related computational effort of repeated model runs [23,29,32,62]. Scenarios, where the amount of the available training samples (model-runs) is very limited are usually supposed to be challenging for sufficiently complex DNNs, which are considered to be data-hungry and usually require significantly more than 1000 training samples. Therefore, various surrogate techniques based on polynomials and for Gauss processes are very common in geosciences. Nevertheless, it should be mentioned here, that also DNNs were successfully used for at least some complex data-poor applications [35].
The Polynomial Chaos Expansion (PCE), was originally introduced by Wiener [78] and extended by Cameron and Martin [11] by proof of convergence in the L 2 -sense. Ghanem and Spanos introduced the PCE to the context of UQ in [21]. Since then, PCE is widely used in the context of steady state [43,80] and convection dominated [27,34,56,64,82] problems. The convergence and statistical properties of PCE are considered in [18,48], where a comprehensive overview over orthogonal polynomials in general are given in [20,73]. Today, PCE is widely used in the context of random ODEs [66], stochastic differential equations [81] and PDEs [3,82]. For complex applications, PCE is often used together with stochastic collocation [12] and compressive sensing methods [38,39].
Nowadays, many PCE implementations are related to the generalized PCE (gPC) [80,82], which extends the PCE to a certain list of underlying probability distributions for uncertain model parameters by using the Askey scheme [2]. In the current work, we will pay special attention to a specific version of PCE that is called the arbitrary Polynomial Chaos (aPC) expansion [50,51], and is free to adapt to arbitrary probability distributions of uncertain model parameters. However, all mentioned PCE-based approaches give a single global polynomial representation and could be affected by oscillations of the surrogate model [15]. In particular, non-linear, convection dominated problems (such as multiphase flow in porous media [6,13,45]) are known to be susceptible to the Gibbs' phenomenon that causes such oscillations.
To face this challenge, the multi-resolution / multielement Ansatz was developed in the 90s [1,22,47] and later transferred to the UQ context [36,77]. It was used by many authors in several versions [16,42,75,76]. Multi-resolution versions of the PCE are able to handle discontinuities by replacing the global PCE polynomials with piecewise polynomial functions. The piecewise polynomial Ansatz seems to be powerful in capturing strongly non-linear effects of models. Comparisons among diverse modern surrogate approaches can be found in [27], further extensions w.r.t. geophysical applications in [29,32,55], CFD-related [16,75,76] and other engineering related applications in [4,7,9,42]. The arbitrary Multi-Resolution Polynomial Chaos (aMR-PC) [31] is a multi-resolution based extension of the aPC that extends the multiresolution advantages to cases with arbitrary probability distributions.
Unfortunately, neither the PCE, aPC or their multiresolution extensions can estimate how uncertain the resulting surrogate model is in its approximation based on limited training data. At least, so far, Bayesian optimization [8,57] or bootstrap-based resampling techniques [41] are not used. In contrast, the Gaussian Process Emulator (GPE) [79] is able to estimate the approximation uncertainty after training. GPE, also closely related to Kriging [30], has been used and extended, e.g., in the context of reservoir simulation [19], experimental design [10,67] and active learning [49,53,68]. The combination of PCE and GPE was proposed in [25,63].
The current paper proposes to join both aMR-PC and GPE into a joint framework in order to get the best out of two branches, creating the multi-resolution localized arbitrary polynomial chaos emulator. Such a combination not only helps to estimate the uncertainty of the surrogate itself, but also opens a pathway for Gaussian Active Learning (AL) on a Multi-resolution Arbitrary Polynomial chaos emulator. For brevity, we denote this combination as GALMaP. Our proposed surrogate model is data-driven, reduces the Gibbs' phenomena by construction, due to localization, and at the same time is also suitable for larger random fields. For AL on the plain PCE and GPE we refer to [61]. In particular, we can make use of pool-based AL, where iteratively added samples are only chosen from a given set of candidates. This AL idea was first introduced by McCallumzy et al. for text classification [44], but has since then been used in many other fields, such as linear regression [24,71].
To be specific, we propose a two-step emulator framework, which generates a preliminary surrogate model using aMR-PC in its first step, and enhances it adaptively by means of GPE-based AL in its second step. We expect that the presented strategy can reduce the number of required model runs to obtain a reliable surrogate model for many computationally expensive real-world applications. Moreover, different types of computing hardware could be used for different steps, enabling the model runs required by the aMR-PC to be processed in parallel, such that only the second step requires an iterative execution. Exploring different combinations how aMR-PC and GPE could be joined together, we demonstrate the accuracy and efficiency of the method on basis of several numerical experiments. The most important test-case is the CO 2 sequestration benchmark that was introduced in [27] to compare the reliability of a large set of surrogate models (data-sets are available on [28]) and which was already used in several publications [8,31,49,59,60] , such that the presented results can be easily compared with the existing surrogate models used in these publications.

Outline of the paper
This paper is organized as follows. In Section 2, we provide an introduction into the used concepts of surrogate modelling. In particular, we recall the aPC in Section 2.1.1, the aMR-PC in Section 2.1.2 and the GPE in Section 2.2. In Section 3, we present our new aMR-PC and GPE-based AL strategy. In Section 4, we demonstrate the performance of the introduced methods on several examples. The comparability with other methods is provided by using the CO 2 sequestration benchmark [27] that was already used in [31,49,59,60]. In Section 5, we summarize the results of this paper.

Introduction to surrogate modelling
Surrogate models can be split into two types, parametric models and non-parametric models. In the current work, a surrogate model concept from each type is relevant. The aMR-PC [31], which is a multi-resolution-based extension of the well-established PCE [78] and of its data-driven extension to arbitrary distributions (the aPC [50]) , is a parametric type of model. In contrast, the GPE exploits the powerful and widely used machinery of Gaussian Processes in the area of machine learning. It is non-parametric (i.e. contains noise [79]) and can quantify uncertainty based on inherent qualities. This section contains a brief overview of both surrogate types. The used notation is described in Table 1.

Polynomial chaos expansion
In the present work, we are mainly interested in a system response Y as a function f of a random vector X = (X 1 , . . . , X M ) in the probability space (Ω, A, ) with sample space Ω, related σ -Algebra A and probability measure . For the sake of better readability, we start our explanation with the scalar case X = X, such that a system response Y can be written as If Y ∈ L 2 (Ω) (w.r.t. ), the system response Y can be expressed by a PCE as the infinite sum of orthonormal polynomials ϕ k (X) with deterministic expansion coefficients Y k : For computational reasons, the PC expansion is usually truncated, such that only the terms related to polynomial degrees not exceeding the desired expansion degree N o are considered: The polynomials ϕ k (X) are selected such that they fulfil the orthonormality condition w.r.t. Γ : where ·, · denotes the inner product of two polynomials with respect to the probability measure Γ , and δ kl denotes the Kronecker delta. Therefore, the orthonormal representation in Eq. (2) is not redundant. Moreover, mean and variance of the PC expansion of Y can be obtained from the polynomial coefficients Y k , k = 0, . . . , N o analytically, without further evaluations of Y P CE [43,70].

Arbitrary polynomial chaos
The various versions of PC expansions that exist today mostly differ in how the polynomials are chosen and which probability distribution is assumed for the random variable X [2,72,74,80]. For many common probability distributions, the related orthogonal polynomials are known [2,72]. And, of course, the probability density of the random variables describing the uncertainty have to be known. The arbitrary Polynomial Chaos (aPC) [50] is a variation of PCE that addresses those two points. It defines orthogonal polynomials for any arbitrary probability measure Γ of a random variable X. And it is not necessary to give the probability measure Γ [50] in an explicit, fully defined form. Instead, the probability measure Γ could be merely implied by a finite number of raw moments of the random variable X. The n-th raw stochastic moment of X is defined as follows via the probability measure Γ : The idea behind aPC is to construct the orthonormal polynomial basis ϕ 0 , . . . , ϕ N o of degree N o from 2N o raw stochastic moments of X. Each orthonormal aPC polynomial ϕ d of degree d from the basis could be written [51] as: input value a value or vector x that is given as input to some model or surrogate, also called input vector output value a value y returned from some model or surrogate X, Y capital letters denote random variables and system response for input and output values respectively x, y lower case letters denote variables or samples Here, the constants p k determining the orthonormal polynomial basis are given by the solution of the following linear system of Eq. [51]: where H d+1 is the (modified) Hankel matrix [20,72]: The aPC formulation above leads to the same set of polynomials as the gPC if the random variable X is within the Askey scheme, i.e. the aPC includes the gPC. However, the aPC is also feasible for other distributions of the random variable X. In particular, if the exact probability density of the random variable X is unknown and only discrete values x i are available to represent X, then the aPC representation could be constructed in a purely data-driven manner, e.g. by providing a sample of values for X from which empirical raw moments could be obtained. For a comprehensive introduction and description of the aPC, we refer to [48,50,51].
The final aPC representation is defined by the polynomial coefficients Y = {Y k , | k = 0, . . . , N o } of Eq. 2. Across all PC methods, the polynomial coefficients Y can be obtained through several typical options [72]. We obtain them by solving a linear system of Eqs. [50,70] that could be seen as a least-squares minimization on N regression points {x 1 , . . . , x N }: As given in Table 1, for the sake of readability, the lowercase letter x denotes a value, while the capital letter X refers to a random variable. From an accuracy point of view, the N regression points x j should be properly chosen, and the related Gaussian quadrature points are considered to be optimal [20,72]. The focus on the quadrature points related to the polynomial degree N o allows to avoid oscillations associated with the Runge phenomenon and to omit the overfitting prevention routines, which become computationally expensive for large random fields. The Gaussian quadrature points and also the corresponding quadrature weights can be obtained from only the Hankel matrix [20]. Therefore, if required, Gaussian quadrature is also available to compute the aPC expansion based on numerical quadrature. However, in this paper we follow the Ansatz of [50] and use an appropriate least-squares regression to obtain the polynomial coefficients. This concept is very common in PCE research, e.g., in the context of compressive sensing and sparse PCE [38,39].

Arbitrary multi-resolution polynomial chaos
In the current work, we employ the recent data-driven generalization called arbitrary multi-resolution PC (aMR-PC), which combines the non-intrusive aPC Ansatz with the multi-resolution framework [31] to reduce possible Gibbstype oscillations.
The aMR-PC uses the quantile function Q X (·) of the random variable X to decompose the sample space / dataset Ω into stochastic subdomains (SD) by means of dyadic decomposition. For refinement level N r and scalar random variable X, we get 2 N r SDs defined as follows: Further, the aMR-PC uses the aPC Ansatz to construct the polynomial basis on each SD and obtain the orthonormal piecewise polynomial basis in L 2 (Ω) as following: where the polynomials satisfy orthonormal conditions on each SD element ϕ N r l,p , ϕ N r k,q Γ = δ k,l δ p,q , for k, l ∈ I N r , p, q = 0, . . . , N o .
Extension from scalar random variables X to the Mdimensional random vector X = (X 1 , . . . , X M ) T is based on the following multi-variate piecewise polynomials else.
Using the definitions above, the overall aMR-PCexpansion of Y ∈ Ł 2 (Ω M ) for the refinement level N r and highest polynomial degree N o could be written [31] as: For transparency in this paper, we will denote by aMR-PC N r ,N o the aMR-PC obtained for a specific combination of refinement level N r and maximum polynomial degree N o . Moreover, the aPC could be seen as a particular case of aMR-PC where N r = 0. Thus, the current paper will only use the aMR-PC N r ,N o notation. Equivalently, using an aMR-PC N r ,N o is the same as creating and fitting an aPC for the specified N o to each defined sub-domain. The multi-resolution multi-element design leads to a higher resolution of the surrogate representation in areas where more samples are given and helps to deal with discontinuities in the model function Y = f (X). It also improves the accuracy compared to the standard aPC approach, in particular for lower polynomial degrees. However, with an increase in the number of sub-domains, the computational cost also increases, since the number of the coefficients of the aMR-PC N r ,N o expansion (without adaptivity) is given by More coefficients also need more training runs of the model to be approximated. Therefore, to reduce the computational costs for the multi-resolution / multi-element representation, multi-wavelets-based adaptivity is usually used [31]. Nevertheless, for the sake of clarity, in this paper we focus our attention on the full-tensor non-adaptive case.

Gaussian process emulators
A Gaussian Process Emulator (GPE), also known as Gaussian process regressor, is a surrogate model that uses Gaussian Processes to establish the function dependence f in Eq. (1). Formally, Gaussian Processes (GP) are defined as a collection of random variables, where any finite subset can be described by a joint Gaussian distribution [10,40,79]. Similar to how Gaussian distributions are given by a mean and a variance, a Gaussian process is specified by a mean function m(x) and a (covariance) kernel k(x, x ), for x, x ∈ R M : where x and x are the input values for two samples (x, y) and (x , y ). The usage of ∼ symbolises that the approximation Y GP E is distributed according to the given GP.

Kernels
Kernels k(x, x ), also called covariance functions, encode assumptions about the properties of the model by describing how similar the outputs for two inputs x and x are [17,79]. Beyond the scope of Gaussian processes, kernels and their properties are widely used in the context of support vector machines [69]. Kernels can also be seen as a feature space transformation of the variables. The common requirement on kernels is positive or semi-positive definiteness [69,72], in particular due to the requirements of Mercer's theorem [69,79]. The choice of kernel depends on the understanding of the specific application. A popular choice of kernel is the Matérn kernel, which is specified by its parameter ν. In this paper, we use the Squared Exponential kernel, meaning the Matérn kernel with ν → ∞.
Depending on the properties of the input values x and the model response Y , the choice of other kernels could be more reasonable. However, the functionality of the methods proposed in this paper is not dependent on the specific kernel. For a more in-depth explanation and more kernels, see [69,79].

Making predictions
The typical assumption is that the training samples in a training set D used to fit the GPE contain only additive, independent and identically-distributed Gaussian noise with zero-mean and variance σ 2 n . In that case, equations for the most likely predictionŷ * returned by the GPE for some unknown sample x * can be derived. In general, such a prediction is given as per the following conditional expectation: where D is the set of known training samples, p(·) is a probability density and Y * denotes the random variable of the GPE prediction. Since the distribution stems from a GPE, it is normally distributed: Based on this distribution, equations for the conditional mean μ GP , * and conditional variance σ 2 GP , * of the GPE for this specific input value can be derived: (9) where (x D , y D ) ∈ D, I denotes the identity matrix. Further information can be found in [79]. The conditional mean value μ GP , * is equivalent to the surrogate representation given by the GPE for the input value x * . The quality of the GPE surrogate is expressed in the form of the standard deviation σ GP , * , which gives a measure of how uncertain this representation is.

Gaussian active learning on the aMR-PC emulator
As presented in Section 2, aMR-PC and GPE are two inherently different types of surrogate models. The aMR-PC representation discussed in Section 2.1.2 is a parametri method, which exploits statistical properties of the input parameters. Moreover, the aMR-PC structure suggests to select training samples before the first model run, i.e. related quadrature points. This property is very helpful in the context of parallelization. In contrast to aMR-PC, the GPE surrogate described in Section 2.2 belongs to the class of non-parametric surrogate models, where the positioning of samples in the parameter space is not dictated by the GPE concept. GPEs usually show a good performance in handling noisy and perturbed data that could occur in real-world problems. The purpose of the current Section is to present a novel heuristic that combines the two surrogate model concepts into a unified framework in order to increase the overall accuracy of the final surrogate representation.
The selection of the regression points is usually a tradeoff between computational effort related to the number of model runs and computational effort of the building and selection of surrogate model. The second one is relevant, if sufficiently large random fields are considered, e.g. the CO 2 benchmark in Section 4.2 uses a random field of the length 250, but lengths of several thousand elements are also common. The goal of our approach is to reduce the time to solution, by exploiting the knowledge of the prior distribution of the input parameters. On the one hand, we reduce the Gibbs' phenomena by the multiresolution / multi-element Ansatz, which is necessarily at least for non-linear convection-dominated problems [15]. On the other hand, using the polynomial-degreerelated quadrature points for regression prevents the Runge phenomena and overfitting. Since the quadrature points are known in advance, for many practical applications, the related model evaluations would be computed in parallel, which can reduce the wall-time significantly. Therefore, only the second stage performed by GPE requires iterative evaluations. If required, the number of model-evaluations can be reduced by multi-wavelet-based pre-processing of input data [31], use of hierarchical quadrature rules [72] or extension to sparse PCE concepts [8,38,39]. In the last case, overfitting prevention is usually provided by leaveone-out cross-validation.

Problem specification
For all proposed combinations of all aMR-PC with GP, we assume that the distribution of the random vector X is given in the form of a set of realizations. The known set of , consisting of n vectors of input values x i and corresponding target values y i , is chosen so that a specified aMR-PC N r ,N o can be created from it. Additionally , there exists an independent set of test samples D test = {x j , y j } m j =1 of m input vectors x j and their corresponding target values y j , which are to be used for evaluation. For convenience of the reader, we collect the general notations in Table 1, and notations and brief descriptions of the methods in Table 2.

Non-iterative multi-resolution emulator
First, we present the non-iterative combination 1 of aMR-PC and GPE to construct an aMR-PC emulator denoted by aMR-PC⊕GP.
The proposed unified framework consists of two subsequent representations, as shown in Fig. 1. The first component of the surrogate representation is given by the aMR-PC. The second component is based on the residual error Res, which is then approximated by the GPE 1 This type of combination of the representations will be denoted by ⊕.
We will determine the residual error Res as the difference between the system response and the approximated values from the aMR-PC for all available samples: where the residual error does not generally have a mean value of 0, i.e. it could have a non-zero bias. The resulting aMR-PCGP emulator denoted asŶ could formally be written as the following combination: Here, the brackets (·) contain the samples used to fit the approximation. Thus, Y aMR−P C (D) denotes the aMR-PC fitted on the training samples D, and Y GP E (Res) describes the results of a GPE that is fitted on the residual error Res of the aMR-PC. Similar to the previous section, the subscript N r , N o will be used to determine the architecture of the aMR-PC representation, leading to the specific notation of the overall surrogate emulator as aMR-PCGP N r ,N o . To be consistent with PCE theory, we will use the samples and residual errors at the corresponding Gaussian roots as the training samples for the aMR-PC and the GPE. In this way, no additional samples are needed to fit the GPE. However, since the aMR-PC is fitted with least squares to these samples, it aims to approximate the values for these samples with smallest possible error, and thus the resulting residual will be very small. Obviously, fitting the GPE to these small residual errors results in very small additional approximations from the GPE and the overall approximation could be improved only slightly. However, equipping the aMR-PC surrogate with the GPE emulator properties opens up an indispensable pathway for possible active learning that could help to smartly allocate additional computational resources for further refinement. Further, since the degrees of freedom of aMR-PC, namely N r and N o do not increase, the computationally expensive overfitting prevention for the aMR-PC -part can be omitted.

Active learning on the multi-resolution emulator
The two-step emulator framework using the aMR-PC⊕GP representation (see Section 3.2) provides the expected corrected surrogate value (including bias) and also assesses the remaining uncertainty via its emulator properties. In order to increase the accuracy of the initially constructed aMR-PC⊕GP emulator, we will now employ active learning. Active learning could be seen as an iterative process, where the emulator is able to ask queries, typically in the form of unlabelled instances (here an input vector x k ) [65]. It is, in return, given the label (in this case the corresponding target value y k of these instances) and can use this knowledge to improve its representation. The pool from which the unlabelled instances stem is set to be the set of test samples D test . Since this pool is specified independently of the model, this form of active learning is called pool-based active learning [24,44,53,55].
To be specific, we will perform active learning iteratively: in each iteration step, a single sample {x k , y k } is added as training sample. To do so, we will choose the next sample {x k , y k } by considering the current variance σ GP (D) of the active learner on each input value in D test : As this search strategy is based on Gaussian properties of the emulator, it is called Gaussian Active Learning (GAL) [44,49]. After each iteration of GAL, the resulting aMR-PC⊕GP representation is evaluated, and a new additional sample is chosen from D test . However, any new training sample could be used to update one or more parts of the aMR-PC⊕GP in Eq. (11). Therefore, in the following sections, three different active learning variations of the two-step emulator aMR-PC⊕GP are presented.

Gaussian active learning only on the Gaussian process emulator
The first straightforward variation of GAL is its direct application to a plain GPE -based surrogate, for now omitting the multi-resolution component. Then, GAL will iteratively enlarge the set of training samples for the GPE, and we call it GAL-GP. A schematic representation can be found in Fig. 2. The result of running GAL-GP for a given number of iterations iter consists of one representation Y GP E (D it ) for each iteration it, which is based on the iteratively growing training samples D it : Overall, we expect the quality of the GPE representation to improve over the iterations of GAL-GP. The quality is also expected to depend on the initial training samples of the GPE that can either be given randomly or selected in some specific manner. In particular, the initial training samples before GAL could be selected according to an aMR-PC N r ,N o strategy. This strategy will be employed later for comparison and, hence, we will use the subscripts N r , N o indicating the GAL-GP initialization, leading to GAL-GP N r ,N o .

Gaussian active learning on the multi-resolution arbitrary polynomial chaos emulator
As the second variation of GAL, we will combine it with the newly introduced two-step representation aMR-PC⊕GP from Section 3.2, i.e. Gaussian Active Learning on the Multi-resolution arbitrary Polynomial chaos (GALMaP). We estimate the initial multi-resolution trend via an aMR-PC representation given on classical PCE-related training data. Then, we employ GAL to iteratively reduce the remaining error represented through the GPE. In that sense, aMR-PC is not affected by the iterative changes, and GAL only adjusts the GPE. The resulting structure is schematically visualized in Fig. 3.
In each iteration, GALMaP returns a representationŶ it that consists of the approximation Y aMR−P C (D) of the aMR-PC N r ,N o fitted to the original training samples D in combination with the returned approximation Y GAL it (Res it ) of GAL-GP N r ,N o of the current iteration, fitted to the current residual error Res: Since the GAL-GP is now fitted to the residual error, this residual error has to be extended similarly to D it to include the additional samples: The samples from the aMR-PC used here include both the representation of the aMR-PC on the n samples in D, as

GALMaP with an adaptive trend
The third variation of GAL is also based on the newly introduced aMR-PC⊕GP, similar to Section 3.3.2, but now we update simultaneously the aMR-PC and the GPE together. This GAL strategy will be called GALMaP with an adaptive trend or shortly GALMaP AT . A trend can be seen as a general direction in which something is developing or changing. In that sense, the term adaptive trend is used to describe the change in the underlying representation given by the aMR-PC. Thus, GALMaP AT defines a representation with an adaptive trend, where one part of the approximation, the GAL-GP, adapts to a trend in the residual from the changing aMR-PC: Since GALMaP AT uses both the iterative aMR-PC and GAL-GP, both the training samples D and the residual error Res have to be updated in each iteration. The schematic presentation of the GALMaP AT structure can be seen in Fig. 4. The strategy GALMaP AT presented in the current section might seem like an obvious extension of the previously defined GALMaP. However, there are two reasons that make it unlikely to result in an improvement in the representation. The first reason is related to the least-squares optimization, where the residual error of the aMR-PC is minimized during the training, as pointed out previously in Section 3.2. Using the training samples as inputs will thus lead to only minimal changes in the full representation. The second reason is that, when specifying an aMR-PC through its parameters N r and N o , the number of samples given by the Gaussian roots fits to the number of degrees of freedom of the aMR-PC. Adding further samples then means that the aMR-PC is fitted to more samples than its degrees of freedom allow for, thus resulting in underfitting [5,14].

Numerical experiments
The current section will demonstrate the performance of the four surrogate representations developed in Section 3, Fig. 4 Schematic representation of GALMaP AT out out of which three are equipped with AL strategies. We will also consider the two already existing individual surrogates aMR-PC and GPE described in Section 2. This results in a total of six methods for comparison. We will first consider two different examples in Section 4.1 to give a general understanding of the functionality of the aMR-PC, as well as the proposed iterative variation GALMaP.
These two examples are designed such that they will be underfitted by stand-alone aMR-PC for chosen (N r , N o )level (highly non-linear response vs. to low polynomial degree N o , and discontinuous response vs. N r = 0) to visualize the contribution of the two components. Then, in Section 4.2, we will analyse a complex example that is dedicated to a CO 2 benchmark study, which is also the main test-case of this section. This benchmark was designed to reproduce the challenges typical for UQ for non-linear convection dominated problems, e.g., two-phase flux in porous media, which are very common in geosciences. In the CO 2 benchmark, we will establish the aMR-PC and GPE surrogates, as well as GAL and the proposed aMR-PC⊕ GP, GALMaP and GALMaP AT strategies to construct the multi-resolution emulators.
In all test-cases, we use the squared exponential kernel, where the hyper-parameters are optimized to the training data in each iteration. Related data and code are available on [26].

Surrogate performance in didactic examples
In order to provide a basic understanding of the newly introduced surrogate representations, we will first use a simple example. We construct this function such that it is not trivial to approximate by orthogonal polynomials: All variables x i follow a uniform distribution on the interval [−1, 1]. We will generate training samples according to aMR-PC quadrature rules as a starting point and then will perform active learning. As the test exploration set D test and as the pool for GAL, we will use m = 1000 realizations taken from the uniformly distributed x i ∈ [−1, 1] and their corresponding values y i as the test samples. We will illustrate the performance using the aMR-PC representation with (N r , N o ) = (2, 2) as an initial setup. Obviously, considering higher values of (N r , N o ) offers better approximations, but setting them to moderate values helps to show the effects of the considered GAL strategy. The resulting representation can be seen in Fig. 5(a) in addition to the reference function and the samples that make up the Gaussian roots that the aMR-PC is fitted to. In this visualization, the multi-resolution sub-domains are clearly visualized by the jumps of the approximation at their borders. Inside the sub-domains, the aMR-PC shows a parabolic behaviour, as can be expected, since this is the given maximum polynomial degree. Beyond this test-case, the degrees of freedom for aMR-PC, namely the number of refinements N r and the polynomial degree N o , usually depend on available computational resources and on the quality of input data.
Extending the aMR-PC representation displayed in Fig. 5(a) towards an aMR-PC⊕GP emulator is not worthwhile. The reason is that the residual error of the aMR-PC on its Gaussian quadrature roots is almost nonexistent. As a consequence, the GPE emulation on these values does not change the approximation. However, further active learning could improve the reliability of the aMR-PC⊕GP surrogate. To illustrate the benefit of the active learning strategy, we will construct GALMaP iteratively using 20 additional model runs, i.e. with number of iterations iter = 20. The resulting approximation after 20 iterations of GALMaP 2,2 can be seen in Fig. 5(b). The additional 20 samples are shown in red. Improving the surrogate representation, the GALMaP fits the example function much closer than the naked aMR-PC. Even though the jumps between the subdomains have not disappeared fully, they have become smaller. Thus, the proposed GALMaP demonstrates an improvement over the initial aMR-PC representation.
Let us now consider another simple example, where the underlying model contains a discontinuity, which is extremely challenging to represent using any set of  To set up such a test case, we will consider the signum function Figure 6 illustrates the signum function itself and two variants of aMR-PC representations with a total of 5 samples for aMR-PC 0,4 and 9 samples for aMR-PC 0,8 . Obviously, the aMR-PC 0,8 surrogate has more flexibility due to the 8 th degree of expansion and shows superior results compared to aMR-PC 0,4 . Figure 6 also shows the GALMaP 0,4 representation with 4 additional samples, i.e. the total number of sample equals to 9 samples as for the aMR-PC 0,8 . However, the initial trend of GALMaP 0,4 is captured by aMR-PC 0,4 and its encapsulated bias is corrected via the GPE. We observe, that the aMR-PC 0,8 and the GALMaP have a similar shape, but for x ∈ [−0.5, 0.5] the representation given by the GALMaP shows better performance than the aMR-PC. Therefore, GALMaP potentially could be more efficient (i.e. less model runs) than aMR-PC for the multi-variate case, where the impact of active learning will be more significant. Moreover, GALMaP indicates the uncertainty of the surrogate that is also shown using the σ GAL in Fig. 6.

Description of the CO 2 benchmark
After the two simple examples in the previous section, we now evaluate the suggested multi-resolution emulators using a complex example. To do so, we address a frequently used carbon dioxide (CO 2 ) storage benchmark problem [27], where a non-linear, convection-dominated set-up with a strong shock displacement is very challenging to capture by surrogate modelling approaches. The goal of the CO 2 benchmark study is to provide a fair comparison of different surrogate modelling concepts. This benchmark has also been analysed with recent approaches, in particular with (piecewise-) PCE in [27,31], sparse PCE in [8], sparse-grids and B-splines in [27,59], and reduced-basis methods in [27,60]. Further, a modified version of this benchmark was also used in the context of GPE-surrogate based Bayesian active learning in [49].
The considered benchmark problem [27] looks at the injection of CO 2 through an injectioninjeciton well into a deep aquifer and the subsequent spreading of CO 2 in a geological formation. The model is based on a benchmark problem described in [13] and is further simplified by only considering radial flow in the vicinity of the well. In short, the physical model can be described by the following aspects. The aquifer is initially filled with brine. A hydrostatic pressure distribution, which depends on the brine density, is given as initial condition. Then, CO 2 is injected into the centre of the area at a constant rate q CO 2 . Fluid properties such as density and viscosity are assumed to be constant, and capillary pressure can be neglected. Brine and CO 2 are assumed to be two separate phases that do not mix, and mutual dissolution is neglected. The aquifer is assumed to be isotropic, rigid and does not react chemically. Additionally, all processes are assumed to be isothermal. Further initial and boundary conditions include constant Dirichlet conditions at the lateral boundaries and no-flow conditions everywhere else. Based on the twophase flow equations obtained from the mass balances of the two fluid phases in combination with the multi-phase version of Darcy's Law and the fractional flow formulation, a coupled system of equations can be obtained.
Through radial coordinates [27] and assuming that the pressure can be determined in a closed analytical form, the equation for the gas phase can be written as Here, the spatial (radial) domain is denoted by [0, s] [m.] with s = 250, T is the terminal time, r is the radial coordinate representing the distance from the injection well, φ is the porosity and C p is a constant derived from the pressure decay. Following [27], T is set to 100 days. The fractional flux function is denoted by f g . The quantity of interest is S g , the unknown saturation of the gas phase.
For the numerical solution of Eq. 18, a central-upwind method [33] was used. A more detailed explanation of the benchmark and the corresponding solver can be found in [27]. The benchmark considers three uncertain parameters: X 1 in the injection rate q CO 2 , X 2 in the fractional flux function f g through the permeability degree, X 3 in the Fig. 7 The uncertain input variables reservoir porosity φ. We collect all three parameters in a three-dimensional random vector X = (X 1 , X 2 , X 3 ) T , and assume X i , i = 1, . . . , 3 to be independent. A realization of this vector will be called an input value.
This paper looks at the case of data-poor settings in combination with computationally expensive model evaluations. The distributions of the random variables are given in a fully data-driven manner through a set of 10000 values. The corresponding histograms can be seen in Fig. 7. For each input value, the saturation over distance r = 0, . . . , 250 is given by the hyperbolic solver described in [27]. This leads to a total of 10000 given samples (x i , y i ) for each distance r, out of which only much less will actually be used. The saturation values lie approximately in the interval [0, 0.8], with smaller distances typically resulting in larger values, and larger distances in smaller values. This also shows in the mean μ and standard deviation σ of the full set of samples in Fig. 9. The related input-and simulation data and binaries are provided in [28]. This benchmark reproduce a challenge that is typical for non-linear convectiondominated problems, e.g., multi-phase flow trough porous media such as CO 2 storage, oil production, subsurface flow, transport of ground water contamination and many other computational fluid dynamic related applications, such as meteorology. Here, the saturation profile tends to build shocks, and the shock-speed depends strongly on the uncertain input parameters. Figure 8 visualizes this behaviour on 30 randomly selected saturation profiles from the dataset. It is easy to see, that the difference between the saturation values is higher for r > 150 m., which is especially visible at the tails of the curves. To emphasize the necessity of the dedicated surrogate models for the datapoor scenarios we also provide comparisons with common techniques in the appendix. Figure 18 in Appendix B shows that MC chains do not fully mix on samples of this benchmark, which further motivates the usage of surrogate models. Appendix A contains the description and results of three deep neural networks (DNN) on this benchmark. The results in Fig. 17 show only slight improvement of the approximation for more training samples. Thus DNN are also not considered in this paper.

Setup of evaluation
The training samples D for the newly introduced emulators are generally based on the roots for Gaussian quadrature optimal of an aMR-PC N r ,N o , and we will describe them in depth for each case on its own. In order to compare the proposed variations with each other and the original model, they are evaluated on the full set of test samples D test (i.e. only for discussion of the methods) for each example in the upcoming sections. The set of test samples starts out with the 10000 samples described in Section 4.2, and decreases by 1 in each iteration of active learning, as described in Section 3.3. The initial samples needed by the aMR-PC are given as extra samples. This section lays the basics of this evaluation by defining the used error measures.
We consider three error measures, a normalized L 2 -norm of the mean μ and of the standard deviation σ of model response, as well as the mean squared error (MSE): Fig. 8 The saturation curves for 30 randomly chosen samples Each of these error measures is calculated from the reference solution Y obtained from the test samples D test , and the representationŶ on the corresponding input values x i . We will also consider the standard deviation σ GP of the distribution of the GPE for every sample as a measure of how certain the GPE is about its prediction, and σ GAL for the iterative variations. We define a spatial mean and standard deviation for these values. The values of σ GP should be small to ensure that the GPE gives precise predictions for all input values, but large enough so that the test samples D test fall inside the interval given by the σ GAL around the approximation of the GPE.

Results with aMR-PC
The results of already existing aPC and aMR-PC representations on the CO 2 benchmark are described in depth in [27,31]. In our current paper, only low-order aMR-PC is used for illustration, mainly (N r , N o ) = (1, 1). This is viable, since the residual error will be given to, and approximated, by the GPE. Furthermore, the residual error can oscillate, which would need many more samples to be approximated by the aMR-PC compared to the GPE. Reducing the number of training samples in this way also reduces the necessary computational cost. Figure 9 illustrates the corresponding approximations of the uncertain output's mean and standard deviation for aMR-PC 1,1 in comparison to the reference obtained from the full set of test samples D test . At the tail ends, the results show drastic differences to the reference values due to the high differences in the samples in the dataset. The values of the error measures for this aMR-PC can be found in Table 3.

Performance of the Gaussian process emulator
Now, we illustrate the performance of the conventional GPE surrogate using the same training samples from aMR-PC 1,1 as in Section 4.4 in order to provide direct comparison. We will denote by GP N r ,N o the GPE trained on the Gaussian quadrature points for (N r , N o ). The kernel of the GPE was set to an RBF-kernel with an initial length-scale of 10 and the length-scale bounds 10 −2 and 10 2 . The final values are determined via hyperparameter training. The GPE was Fig. 10 Approximation of the mean and standard deviation given by GP 1,1 implemented via scikit [54], which uses hyperparameteroptimization to improve the accuracy of the approximation. The representation for GP N r ,N o shown in Fig. 10 has a very similar shape to that of aMR-PC 1,1 . The overall accuracy of the two representations is also very similar, as is illustrated by similar values for the error measures given in Tab. 3. We also compare the results of GPE and aMR-PC for different combinations of (N r , N o ), as shown in Fig. 12. One can see that, that although the results are similar, the GPE does not lead to optimal results on the Gaussian roots. Another difference between the aMR-PC and the GPE can be found in their runtime. On average, the GP N r ,N o takes about 7 times longer than the corresponding aMR-PC N r ,N o . This is mainly due to the high number of allowed hyperparameter-optimization iterations of the GPE. Moreover, such a difference in runtime could be reduced by parallelizing the GP for each spatial (radial) distance, or by choosing a more specialized kernel for the given samples. In this paper, the RBF-kernel was used purposefully as a very general kernel, which doesn't encode further prior knowledge about samples or the residual error, and no parallelization was implemented.

Performance of the two-step aMR-PC⊕GP
Next, we demonstrate the newly introduced two-step emulator aMR-PC⊕GP described in Section 3.2. As described in Section 3.2, the strategy is to minimize the number of model runs by reusing the residual error of the aMR-PC on the Gaussian quadrature roots to fit the GPE. However, since the Gaussian quadrature roots are the samples for which the residual error of the aMR-PC is the smallest, the GPE is likely to underestimate this error and will not change the results a lot. Thus, compared to the representations given by the aMR-PC and the GPE on their own, this variation leads to a very minor improvement only. The corresponding values of the error measures, evaluated on the whole dataset with 10.000 samples, are given in Table 3 and visualized in Fig. 11. As expected, the aMR-PC⊕GP emulator offers only a very minor improvement. Compared to the naked aMR-PC, the aMR-PC⊕GP can assess the reliability/uncertainty of the surrogate representation itself. However, again because the GPE part is trained on very small residuals, the GPE returns very small σ GP values. For some distances r, the values for σ GP are so small that the reference is not included in the standard deviation around the representation given by the aMR-PC⊕GP. As was noted in Section 4.3, this shows that there is still room for improvement in the part of the representation given by the GPE, and is an indicator that the training samples chosen for the GPE might not be optimal. For deeper comparison between aMR-PC, GPE, and aMR-PC⊕GP, we use various combinations of (N r , N o ). Figure 12 contains the results of all three error measures for the analysed representations. Note here the increase in error for (N r , N o ) = (2, 1) compared to (N r , N o ) = (1, 2). This is due to the switch in refinement level N r . The corresponding mean and standard deviation of σ GP (see Section 4.3) for the GPE and the aMR-PC⊕GP are given in Fig. 13. The remaining uncertainty of the aMR-PC⊕GP is much smaller than that of the GPE. These results further illustrate that the previously shown effects also hold true for other combinations of (N r , N o ). Overall, we observe that the GPE representation demonstrates inferior performance, while aMR-PC and aMR-PC⊕GP approaches show very similar behavior. However, the key feature of the aMR-PC⊕GP emulator has not been explored yet, and a further active learning procedure could increase its performance.

Active leaning of emulators
Finally, we explore the idea of the active learning described in Section 3.3 to further improve the representations that fitted the GPE on the Gaussian roots only. To make the results comparable, we will perform 100 iterations The values of the error measures for GAL-GP for these values of (N o , N r ) are given in Fig. 14. The proposed method of adding samples is indeed capable of decreasing all three error measures. The stepwise behaviour of the errors indicates that some added samples lead to more drastic improvements than others and as such might be more important for the representation using the GPE. However, there are also many spikes in these errors, which show that we do not obtain a consistent quality of representation in each iteration. The corresponding reliability of the GAL-GP is shown in Fig. 15, using the remaining standard deviation σ GAL for the considered (N r , N o ) values. The mean and standard deviation of σ GAL show a similar behaviour and both decrease as the iterations pass. We observe, that the lower polynomial degree N o seems to reflect some noise occurrence, which stabilizes for higher N o . Overall,

Gaussian active learning on multi-resolution arbitrary polynomial chaos emulator
A similar improvement by active learning can also be expected for GALMaP as described in Section 3.3.2. Figure 16 shows the resulting representation for iter = 100 of GALMaP 1,1 . Compared to the initial aMR-PC⊕GP representation that is shown in Fig. 11, the active learning embedded in GALMaP improves the accuracy. Moreover, GALMaP reduces the uncertainty of the surrogate, expressed using the values of σ GP , which have adapted so that the reference now lies inside the σ GP -interval around the approximated mean and standard deviation of the model response.
The results of GALMaP given in Fig. 14 also look very similar to that of the GAL-GP for the same (N r , N o ), but This aspect illustrates that the functionality of the GAL-GP is not dependent on the chosen underlying training samples, in this case the residual error given by the aMR-PC. Despite this, the underlying samples can help to stabilize the representation, as shown by the lack of spikes (see Fig. 14) in the curves describing the error values for GALMaP.

Gaussian active learning on multi-resolution arbitrary polynomial chaos emulator with an adaptive trend
The last active learning strategy on aMR-PC⊕GP includes the trend adaptation as proposed in Section 3.3 and denoted as GALMaP AT . The GALMaP AT iteratively adapts both the GPE and the aMR-PC based on the samples chosen by the GPE. This variation seems to be an obvious extension of GALMaP, but could suffer in improvement of the overall representation as pointed out in Section 3.3.3. The and demonstrates that the quality of the GALMaP AT representation performs worse than GAL-GP and GALMaP for all three considered error measures. Moreover, the same also holds true for the reliability of the GALMaP AT surrogate expressed in terms of the standard deviation σ GAL shown in Appendix 1 in Fig. 17.

Summary and conclusion
The current paper presents a two-step emulator framework. It combines the multi-resolution arbitrary Polynomial Chaos (aMR-PC) and the Gaussian process emulator (GPE). The two-step emulator aMR-PC⊕GP consists of a first approximation given by the aMR-PC on known training samples, and a second representation from a GPE fitted on the residual error values of the aMR-PC. The proposed joint aMR-PC⊕GP can also perform active learning to improve the final surrogate representation at the acceptable computational costs. This can be very helpful, when initially planning the training runs for an aMR-PC representation, but then desiring a refinement at small additional computer costs. Following that idea, the combination of pool-based active learning based on the GPE component has been used to refine the developed two-step emulator using several strategies. The first strategy (denoted as GALMaP) contains an unchanging aMR-PC and an iteratively changing GPE. The second strategy (denoted as GALMaP AT ) also updates the multi-resolution aMR-PC trend during the active learning procedure.
The newly introduced two-step emulator aMR-PC⊕GP and the related active learning strategies GALMaP and GALMaP AT were evaluated on didactic test cases and a CO 2 benchmark problem. We observe, that the aMR-PC⊕GP emulator did not result in much improvement compared to the existing aMR-PC due to the small residual errors of the aMR-PC surrogate on its corresponding training samples. However, additional active learning using the GALMaP strategy improves the aMR-PC⊕GP representation. The GALMaP emulator mitigates the remaining bias and reduces the uncertainty of the surrogate itself. Opposite to that, GALMaP AT , updates the trend encoded by the aMR-PC. However, it shows an underfitting of the aMR-PC due to a lack of degrees of freedom compared to the amount of given training samples, which the GPE cannot compensate. Moreover, we have compared the suggested GALMaP framework with active learning on the plain GPE (GAL-GP). Both approaches show similar improvements in their representations, but GALMaP leads to much more stable results than GAL-GP during the active learning. The GALMaP framework allows to break free from the needed number of training samples defined by the chosen aMR-PC to find representations that need less model evaluations of the underlying model to give similar and better results.
Overall, the suggested aMR-PC⊕GP emulator could be seen as a generalization of the aPC and GPE. It offers a freedom to set up the initial multi-resolution representation using (N r , N o ) for further improvement via GALMaP framework, leading to useful accuracy at acceptable computational costs.

Appendix A
To provide additional information for comparison, here we show the results for the CO 2 -benchmark introduced in Section 4.2, that are obtained from training deep neural networks and GALMaP AT using a comparable number of training samples. A total of three DNN are considered. Each row of Table 4 shows the specifications of one of the DNN. Here given are the number of training samples N t , the number of layers L, the number of neurons per layer N L and the total number of weights N w in the DNN. Only fully connected layers were used.
The activation function was set to tanh for all DNN and all layers A tanh (I ) = e I − e −I e I + e −I and a combination of the MSE and mean squared weight (MSW) was used as a loss function during training. Here y i denotes the output of the DNN for the i-th test sample with reference outputŷ i , while w i is the i-th weight of the DNN. Figure 17 extends the results of Fig. 15 by results of GALMaP AT and the three DNN described above. The training samples for the DNN were given by the Gaussian quadrature points of aPC with the same number of training samples.

Appendix B
For further motivation of the usage of surrogates on the CO 2 -benchmark introduced in Section 4.2 we take a look at the results of MC on the available samples. Here we repeat the MC-approach with the number of samples used in the  Figure 18 shows the error of the mean and standard deviation of 10 chains of up to 400 MC-samples on the benchmark. The chains do not fully mix or converge, even for higher numbers of samples. Thus we can say, that MC does not provide reliable results for the considered number of samples. constructive criticism.
Funding Open Access funding enabled and organized by Projekt DEAL.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons. org/licenses/by/4.0/.