Advertisement

Bayesian Uncertainty Propagation Using Gaussian Processes

  • Ilias BilionisEmail author
  • Nicholas Zabaras
Living reference work entry

Abstract

Classic non-intrusive uncertainty propagation techniques, typically, require a significant number of model evaluations in order to yield convergent statistics. In practice, however, the computational complexity of the underlying computer codes limits significantly the number of observations that one can actually make. In such situations the estimates produced by classic approaches cannot be trusted since the limited number of observations induces additional epistemic uncertainty. The goal of this chapter is to highlight how the Bayesian formalism can quantify this epistemic uncertainty and provide robust predictive intervals for the statistics of interest with as few simulations as one has available. It is shown how the Bayesian formalism can be materialized by employing the concept of a Gaussian process (GP). In addition, several practical aspects that depend on the nature of the underlying response surface, such as the treatment of spatiotemporal variation, and multi-output responses are discussed. The practicality of the approach is demonstrated by propagating uncertainty through a dynamical system and an elliptic partial differential equation.

Keywords

Epistemic uncertainty Expensive computer code Expensive computer simulations Gaussian process Uncertainty propagation 

1 Introduction

National laboratories, research groups, and corporate R&D departments have spent decades in time and billions in dollars to develop realistic multi-scale/multi-physics computer codes for a wide range of engineered systems such as aircraft engines, nuclear reactors, and automobile vehicles. The driving force behind the development of these models has been their potential use for designing systems with desirable properties. However, this potential has been hindered by the inherent presence of uncertainty attributed to the lack of knowledge about initial/boundary conditions, material properties, geometric features, the form of the models, as well as the noise present in the experimental data used in model calibration.

This chapter focuses on the task of propagating parametric uncertainty through a given computer code ignoring the uncertainty introduced by discretizing an ideal mathematical model. This task is known as the uncertainty propagation (UP) problem. Even though the discussion is limited to the UP problem, all the ideas presented can be extended to the problems of model calibration and design optimization under uncertainty, albeit this is beyond the scope of this chapter.

The simplest approach to the solution of the UP problem is the Monte Carlo (MC) approach. When using MC, one simply samples the parameters, evaluates the model, and records the response. By post-processing the recorded responses, it is possible to quantify any statistic of interest. However, obtaining convergent statistics via MC for computationally expensive realistic models is not feasible, since one may be able to run only a handful of simulations.

The most common way of dealing with expensive models is to replace them with inexpensive surrogates. That is, one evaluates the model on a set of design points and then tries to develop an accurate representation of the response surface based on what he observes. The most popular such approach is to expand the response in a generalized polynomial chaos basis (gPC) [4, 51, 52, 53] and approximate its coefficients with a Galerkin projection computed with a quadrature rule, e.g., sparse grids [44]. In relatively low-dimensional parametric settings, these techniques outperform MC by orders of magnitude. In addition, it is possible to prove rigorous convergence results for gPC. However, the quality of the estimates when using a very limited number of simulations is questionable. The main reason is that they do not attempt to quantify the epistemic uncertainty induced by this very fact.

The quantification of the epistemic uncertainty induced by the limited number of simulations requires statistical methodologies and, specifically, a Bayesian approach. The first statistical approach to computer code surrogate building was put forward by Currin et al. [16] and Sacks et al. [42], both using Gaussian processes (GP). This work was put in a Bayesian framework by Currin et al. [17] and Welch et al. [50]. The first fully Bayesian treatment of the UP problem has its roots in the Bayes-Hermite quadrature of O’Hagan [36], and it was put on the modern context in O’Hagan et al. [38] and Oakley and O’Hagan [34]. Of great relevance is the Gaussian emulation machine for sensitivity analysis (GEM-SAM) software of O’Hagan and Kennedy [37]. The work of the authors in [7, 10] constitutes a continuation along this path. The present chapter is a comprehensive review of the Bayesian approach to UP as of now.

The outline of the chapter is as follows. It starts with a generic description of physical models and the computer emulators used for their evaluation, followed by the definition of the UP problem. Then, it discusses the Bayesian approach to UP introducing the concept of a Bayesian surrogate and showing how the epistemic uncertainty induced by limited observations can be represented. Then, it shows how the Bayesian framework can be materialized using GPs, by providing practical guidelines for the treatment of spatiotemporal multi-output responses, training the hyper-parameters of the model, and quantifying epistemic uncertainty due to limited data by sampling candidate surrogates. The chapter ends with three demonstrative examples, a synthetic one-dimensional example that clarifies some of the introduced concepts, a dynamical system with uncertain initial conditions, and a stochastic partial differential equation.

2 Methodology

2.1 Physical Models

A physical model is mathematically equivalent to a multi-output function,
$$\displaystyle{ \mathbf{f} : \mathcal{X} = \mathcal{X}_{s} \times \mathcal{X}_{t} \times \mathcal{X}_{\xi }\rightarrow \mathcal{Y}, }$$
(1)
where \(\mathcal{X}_{s} \subset \mathbb{R}^{d_{s}}\), with d s  = 0, 1, 2, or 3, is the spatial domain; \(\mathcal{X}_{t} \subset \mathbb{R}^{d_{t}}\), with d t  = 0, or 1, is the time domain; \(\mathcal{X}_{\xi }\subset \mathbb{R}^{d_{\xi }}\) is the parameter domain; and \(\mathcal{Y}\subset \mathbb{R}^{d_{y}}\) is the output space. Note that under this notation, d s  = 0, or d t  = 1, is interpreted as if f(⋅ ) has no spatial or time components, respectively. One thinks of \(\mathbf{f}(\mathbf{x}_{s},t,\boldsymbol{\xi })\) as the model response at the spatial location \(\mathbf{x}_{s} \in \mathcal{X}_{s}\) at time \(t \in \mathcal{X}_{t}\) when the parameters \(\boldsymbol{\xi }\in \mathcal{X}_{\xi }\) are used. The parameters, \(\boldsymbol{\xi }\), should specify everything that is required in order to provide a complete description of the system. This covers any physical parameters, boundary conditions, external forcings, etc .

2.1.1 Example: Dynamical Systems

Let \(\mathbf{f}(t,\boldsymbol{\xi }) = \mathbf{z}(t;\mathbf{z}_{0}(\boldsymbol{\xi }_{1}),\boldsymbol{\xi }_{2})\) be the solution of the initial value problem
$$\displaystyle{ \begin{array}{ccc} \dot{\mathbf{z}} & =&\mathbf{h}(\mathbf{z},\boldsymbol{\xi }_{2}), \\ \mathbf{z}(0)& =& \mathbf{z}_{0}(\boldsymbol{\xi }_{1}),\end{array} }$$
(2)
with \(\boldsymbol{\xi }= (\boldsymbol{\xi }_{1},\boldsymbol{\xi }_{2})\), and \(\mathbf{h} : \mathbb{R}^{q} \times \mathcal{X}_{\xi }\rightarrow \mathbb{R}^{q}\). Here \(\boldsymbol{\xi }_{1}\) and \(\boldsymbol{\xi }_{2}\) are any parameters that affect the dynamics and the initial conditions, respectively. One has \(n_{s} = 0,n_{t} = 1\), and d y  = q.

2.1.2 Example: Partial Differential Equation

In flow through porous media applications, \(\boldsymbol{\xi }\) parametrizes the permeability and the porosity fields, e.g., via a Karhunen-Loeve Expansion (KLE). Here, \(d_{s} = 3,d_{t} = 1\), and d y  = 4, with
$$\displaystyle{ \mathbf{f}(\mathbf{x}_{s},t,\boldsymbol{\xi }) = \left (p(\mathbf{x}_{s},t,\boldsymbol{\xi }),v_{1}(\mathbf{x}_{s},t,\boldsymbol{\xi }),v_{2}(\mathbf{x}_{s},t,\boldsymbol{\xi }),v_{3}(\mathbf{x}_{s},t,\boldsymbol{\xi })\right ), }$$
(3)
where \(p(\mathbf{x}_{s},t,\boldsymbol{\xi })\) and \(v_{i}(\mathbf{x}_{s},t,\boldsymbol{\xi }),i = 1,2,3\) is the pressure and the i-th component of the velocity field of the fluid, respectively .

2.2 Computer Emulators

A computer emulator, f c (⋅ ), of a physical model, f(⋅ ), is a function that reports the physical model at a given set of spatial locations,
$$\displaystyle{ \mathbf{X}_{s} = \left \{\mathbf{x}_{s,1},\ldots ,\mathbf{x}_{s,n_{s}}\right \}, }$$
(4)
and at specific times,
$$\displaystyle{ \mathbf{X}_{t} = \left \{t_{1},\ldots ,t_{n_{t}}\right \}, }$$
(5)
for any realization of the parameters, \(\boldsymbol{\xi }\). That is,
$$\displaystyle{ \mathbf{f}_{c} : \mathcal{X}_{\xi }\rightarrow \mathcal{Y}^{n_{s}n_{t} } \subset \mathbb{R}^{n_{s}n_{t}d_{y} }, }$$
(6)
with
$$\displaystyle{ \mathbf{f}_{c}(\boldsymbol{\xi }) = \left (\mathbf{f}(\mathbf{x}_{s,1},t_{1},\boldsymbol{\xi })^{T}\ldots \mathbf{f}(\mathbf{x}_{ s,1},t_{n_{t}},\boldsymbol{\xi })^{T}\ldots \mathbf{f}(\mathbf{x}_{ s,n_{s}},t_{1},\boldsymbol{\xi })^{T}\ldots \mathbf{f}(\mathbf{x}_{ s,n_{s}},t_{n_{t}},\boldsymbol{\xi })^{T}\right )^{T}. }$$
(7)
For example, for the case of dynamical systems, X t may be the time steps on which the numerical integration routine reports the solution. In the porous flow example X s may be the centers of the cells of a finite volume scheme and X t the time steps on which the numerical integration reports the solution.
Alternatively, the computer code represents the physical model in an internal basis, e.g., spectral elements. In this case,
$$\displaystyle{ \mathbf{f}(\mathbf{x}_{s},t,\boldsymbol{\xi }) \approx \sum _{i=1}^{n_{c} }\mathbf{c}_{i}(t,\boldsymbol{\xi })\psi _{i}(\mathbf{x}_{s}), }$$
(8)
where ψ i (x s ) are known spatial basis functions. Then, one may think of the computer code as the function that returns the coefficients \(\mathbf{c}_{i}(t,\boldsymbol{\xi }) \in \mathbb{R}^{d_{y}}\), i.e.,
$$\displaystyle\begin{array}{rcl} \mathbf{f}_{c} : \mathcal{X}_{\xi }\rightarrow \mathbb{R}^{n_{c}n_{t}d_{y} }& &{}\end{array}$$
(9)
$$\displaystyle\begin{array}{rcl} \mathbf{f}_{c}(\boldsymbol{\xi }) = \left (\mathbf{c}_{1}(t_{1},\boldsymbol{\xi })^{T}\ldots \mathbf{c}_{ 1}(t_{n_{t}},\boldsymbol{\xi })^{T}\ldots \mathbf{c}_{ n_{c}}(t_{1},\boldsymbol{\xi })^{T}\ldots \mathbf{c}_{ n_{c}}(t_{n},\boldsymbol{\xi })^{T}\right )^{T}.& &{}\end{array}$$
(10)

2.3 The Uncertainty Propagation Problem

In all problems of physical relevance, the parameters, \(\boldsymbol{\xi }\), are not known explicitly. Without loss of generality, uncertainty about \(\boldsymbol{\xi }\) may be represented by assigning to it a probability density, \(p(\boldsymbol{\xi })\). Accordingly, \(\boldsymbol{\xi }\) is referred to as the stochastic input. The goal of uncertainty propagation is to study the effects of \(p(\boldsymbol{\xi })\) on the model output \(\mathbf{f}(\mathbf{x}_{s},t,\boldsymbol{\xi })\). Usually, it is sufficient to be able to compute low-order statistics such as the mean,
$$\displaystyle{ \mathbb{E}_{\boldsymbol{\xi }}[\mathbf{f}(\cdot )](\mathbf{x}_{s},t) =\int \mathbf{f}(\mathbf{x}_{s},t,\boldsymbol{\xi })p(\boldsymbol{\xi })d\boldsymbol{\xi }, }$$
(11)
the covariance matrix function:
$$\displaystyle\begin{array}{rcl} \mathbb{C}_{\boldsymbol{\xi }}[\mathbf{f}(\cdot )]((\mathbf{x}_{s},t);(\mathbf{x}_{s}',t'))& =& \mathbb{E}_{\boldsymbol{\xi }}\Big[\left \{\mathbf{f}(\mathbf{x}_{s},t,\boldsymbol{\xi }) - \mathbb{E}_{\boldsymbol{\xi }}[\mathbf{f}(\cdot )](\mathbf{x}_{s},t)\right \} \\ & & \quad \left \{\mathbf{f}(\mathbf{x}_{s}',t',\boldsymbol{\xi }) - \mathbb{E}_{\boldsymbol{\xi }}[\mathbf{f}(\cdot )](\mathbf{x}_{s}',t')\right \}^{T}\Big],{}\end{array}$$
(12)
the variance of component i as a function of space and time:
$$\displaystyle{ \mathbb{V}_{\boldsymbol{\xi },i}[\mathbf{f}(\cdot )](\mathbf{x}_{s},t) = \mathbb{C}_{\boldsymbol{\xi },ii}[\mathbf{f}(\cdot )]((\mathbf{x}_{s},t);(\mathbf{x}_{s}',t')), }$$
(13)
for \(i = 1,\ldots ,d_{y}\), low-dimensional full statistics, e.g., point-wise probability densities of one component of the response.

The focus of this chapter is restricted to non-intrusive uncertainty propagation methods. These techniques estimate the statistics of the physical model using the computer code f c (⋅ ) as a black box. In particular, a fully Bayesian approach using Gaussian processes is developed .

2.4 The Bayesian Approach to Uncertainty Propagation

Assume that one has made n simulations and collected the following data set:
$$\displaystyle{ \mathcal{D} = \left \{(\mathbf{x}_{i},\mathbf{y}_{i})\right \}_{i=1}^{n}, }$$
(14)
where \(\mathbf{x}_{i} \in \mathcal{X}\), i.e., \(\mathbf{x}_{i} = (\mathbf{x}_{s,i},\xi _{i},t_{i})\), and y i  = f(x i ). The problem is to estimate the statistics of the response, y, using only the simulations in \(\mathcal{D}\).

Classic approaches to uncertainty propagation use \(\mathcal{D}\) to build a surrogate surface \(\hat{\mathbf{f}}(\mathbf{x})\) of the original model f(x). Then, they characterize the uncertainty on the response y by propagating the uncertainty of the stochastic inputs, \(\boldsymbol{\xi }\), through the surrogate. In some cases, e.g., gPC [26], this can be done analytically. In general, since the surrogate surface is cheap to evaluate, the uncertainty of the stochastic inputs, \(\boldsymbol{\xi }\), is propagated through via a simple Monte Carlo procedure [31, 41]. Such a procedure can provide point estimates of any statistic. However, what can one say about the accuracy of these estimates? This question becomes important especially when the number of simulations, n, is very small. The Bayesian approach to uncertainty propagation can address this issue, by providing confidence intervals for the estimated statistics .

2.4.1 Bayesian Surrogates

The Bayesian approach is based on the idea of a Bayesian surrogate. A Bayesian surrogate is a probability measure on the space of surrogates which is compatible with one’s prior beliefs about the nature of f(x) as well as the data \(\mathcal{D}\). A precise mathematical meaning of these concepts is given in the Gaussian process section. For the moment – and without loss of generality – assume that one has a parameterized family of surrogates, \(\hat{\mathbf{f}}(\cdot ;\boldsymbol{\theta })\), where \(\boldsymbol{\theta }\) is a finite dimensional random variable, with PDF \(p(\boldsymbol{\theta })\). Intuitively, think of \(\hat{\mathbf{f}}(\cdot ;\boldsymbol{\theta })\) as a candidate surrogate with parameters \(\boldsymbol{\theta }\) and that, before observing any data, \(\boldsymbol{\theta }\) may take any value compatible with the prior probability \(p(\boldsymbol{\theta })\). In addition, let \(p(\mathcal{D}\vert \boldsymbol{\theta })\) be the likelihood of the simulations under the model. Using Bayes rule, one may characterize his state of knowledge about \(\boldsymbol{\theta }\) via the posterior PDF:
$$\displaystyle{ p(\boldsymbol{\theta }\vert \mathcal{D}) = \frac{p(\mathcal{D}\vert \boldsymbol{\theta })p(\boldsymbol{\theta })} {p(\mathcal{D})} , }$$
(15)
where the normalization constant,
$$\displaystyle{ p(\mathcal{D}) =\int p(\mathcal{D}\vert \boldsymbol{\theta })p(\boldsymbol{\theta })d\boldsymbol{\theta }, }$$
(16)
is known as the evidence. The posterior of \(\boldsymbol{\theta }\), Equation (15), neatly encodes everything one has learned about the true response, f(⋅ ), after seeing the simulations in \(\mathcal{D}\). How can one use this information to characterize his state of knowledge about the statistics of f(⋅ )?

2.4.2 Predictive Distribution of Statistics

A statistic is an operator \(\mathcal{Q}[\cdot ]\) that acts on the response surface. Examples of statistics are the mean, \(\mathbb{E}_{\boldsymbol{\xi }}[\cdot ]\), of Equation (11); the covariance, \(\mathbb{C}_{\boldsymbol{\xi }}[\cdot ]\), of Equation (12); and the variance, \(\mathbb{V}_{\boldsymbol{\xi }}[\cdot ]\), of Equation (13). Using the posterior of \(\boldsymbol{\theta }\) (see Equation (15)), the state of knowledge about an arbitrary statistic \(\mathcal{Q}[\cdot ]\) is characterized via
$$\displaystyle{ p(Q\vert \mathcal{D}) =\int \delta \left (Q -\mathcal{Q}[\hat{\mathbf{f}}(\cdot ;\boldsymbol{\theta })]\right )p(\boldsymbol{\theta }\vert \mathcal{D})d\boldsymbol{\theta }. }$$
(17)
Equation (17) contains everything that is known about the value of \(\mathcal{Q}[\cdot ]\), given the observations in \(\mathcal{D}\). The quantity \(p(\mathcal{Q}\vert \mathcal{D})\) is known as the predictive distribution of the statistic \(\mathcal{Q}[\cdot ]\) given \(\mathcal{D}\). The uncertainty in \(p(\mathcal{Q}\vert \mathcal{D})\) corresponds to the epistemic uncertainty induced by one’s limited-data budget. The Bayesian approach is the only one that can naturally characterize this epistemic uncertainty.
Equation (17) applies generically to any Bayesian surrogate and to any statistic \(\mathcal{Q}[\cdot ]\). For the case of a Gaussian process surrogate, it is possible to develop semi-analytic approximations to Equation (17) when \(\mathcal{Q}[\cdot ]\) is the mean or the variance of the response. However, in general one has to think of Equation (17) in a generative manner in the sense that it enables sampling of possible statistics in a two-step procedure:
  1. 1.

    Sample a \(\boldsymbol{\theta }\) from \(p(\boldsymbol{\theta }\vert \mathcal{D})\) of Equation (15).

     
  2. 2.

    Evaluate \(\mathcal{Q}[\hat{\mathbf{f}}(\cdot ;\boldsymbol{\theta })]\).

     

2.5 Gaussian Process Regression

For simplicity , the section starts by developing the theory for one-dimensional outputs, i.e., n y  = 1.

2.5.1 Modeling Prior Knowledge About the Response

An experienced scientist or engineer has some knowledge about the response function, f(⋅ ), even before running any simulations. For example, he might know that f(⋅ ) cannot exceed, or be smaller than, certain values, that it satisfies translation invariance, that it is periodic along certain inputs, etc. This knowledge is known as prior knowledge .

Prior knowledge can be precise, e.g., the response is exactly twice differentiable, the period of the first input is 2π, etc., or it can be vague, e.g., the probability that the period T takes any particular value is p(T), the probability that the length-scale \(\ell_{1}\) of the first input takes any particular value is \(p(\ell_{1})\), etc. When one is dealing with vague prior knowledge, he may refer to it as prior belief. Almost always, one’s prior knowledge about a computer code is a prior belief.

Prior knowledge about f(⋅ ) can be modeled by a probability measure on the space of functions from \(\mathcal{X}\) to \(\mathbb{R}\). This probability measure encodes one’s prior beliefs, in the sense that it assigns probability one to the set of functions that are consistent with it. A Gaussian process is a great way to represent this probability measure.

A Gaussian process is a generalization of a multivariate Gaussian random variable to infinite dimensions [39]. In particular, f(⋅ ) is a Gaussian process with mean function \(m : \mathcal{X} \rightarrow \mathbb{R}\) and covariance function \(k : \mathcal{X}\times \mathcal{X} \rightarrow \mathbb{R}\), i.e.,
$$\displaystyle{ f(\cdot ) \sim \mathop{\mathrm{GP}}\nolimits \left (f(\cdot )\vert m(\cdot ),k(\cdot ,\cdot )\right ), }$$
(18)
if and only if for any collection of input points \(\mathbf{X} =\{ \mathbf{x}_{1},\ldots ,\mathbf{x}_{n}\} \subset \mathcal{X}\), the corresponding outputs \(\mathbf{Y} =\{ y_{1} = f(\mathbf{x}_{1}),\ldots ,y_{n} = f(\mathbf{x}_{n})\}\) are distributed according to:
$$\displaystyle{ \mathbf{Y}\vert \mathbf{X} \sim \mathcal{N}_{n}\left (\mathbf{Y}\vert \mathbf{m}(\mathbf{X}),\mathbf{k}(\mathbf{X},\mathbf{X})\right ), }$$
(19)
where m(X) = (m(x i )) i , \(\mathbf{k}(\mathbf{X},\mathbf{X}') = (k(\mathbf{x}_{i},\mathbf{x}_{j}'))_{ij}\), and \(\mathcal{N}_{n}(\cdot \vert \boldsymbol{\mu },\boldsymbol{\varSigma })\) are the probability density function of a n-dimensional multivariate normal random variable with mean vector \(\boldsymbol{\mu }\) and covariance matrix \(\boldsymbol{\varSigma }\).

But how does Equation (18) encode one’s prior knowledge about the code? It does so through the choice of the mean and the covariance functions. The mean function can be an arbitrary function. Its role is to encode any generic trends about the response. The covariance function can be any semi-positive-definite function. It is used to model the signal strength of the response and how it varies across \(\mathcal{X}\), the similarity (correlation) of the response at two distinct input points, noise, regularity, periodicity, invariance, and more. The choice of mean and covariance functions is discussed more elaborately in what follows.

Vague prior knowledge, i.e., prior beliefs, may be modeled by parameterizing the mean and covariance functions and assigning prior probabilities to their parameters. In particular, the following generic forms for the mean and covariance functions are considered:
$$\displaystyle{ m : \mathcal{X}\times \boldsymbol{\varPsi }_{m} \rightarrow \mathbb{R}, }$$
(20)
and
$$\displaystyle{ k : \mathcal{X}\times \mathcal{X}\times \boldsymbol{\varPsi }_{k} \rightarrow \mathbb{R}, }$$
(21)
respectively. Here
$$\displaystyle{ \boldsymbol{\varPsi }_{m} \subset \mathbb{R}^{d_{m} }\;\text{and}\;\boldsymbol{\varPsi }_{k} \subset \mathbb{R}^{d_{k} }, }$$
(22)
and for all \(\boldsymbol{\psi }_{k} \in \boldsymbol{\varPsi }_{k}\) the function \(k(\cdot ,\cdot ;\boldsymbol{\psi }_{k})\) is positive definite. The parameters \(\boldsymbol{\psi }_{m} \in \boldsymbol{\varPsi }_{m}\) and \(\boldsymbol{\psi }_{k} \in \boldsymbol{\varPsi }_{k}\), of the mean and covariance functions, respectively, are known as hyper-parameters of the model. Using this notation, the most general model considered in this work is:
$$\displaystyle\begin{array}{rcl} f(\cdot )\vert \boldsymbol{\psi }_{m},\boldsymbol{\psi }_{k}& \sim & \mathop{\mathrm{GP}}\nolimits \left (f(\cdot )\vert m(\cdot ;\boldsymbol{\psi }_{m}),k(\cdot ,\cdot ;\boldsymbol{\psi }_{k})\right ),{}\end{array}$$
(23)
$$\displaystyle\begin{array}{rcl} \boldsymbol{\psi }_{m}& \sim & p(\boldsymbol{\psi }_{m}),{}\end{array}$$
(24)
$$\displaystyle\begin{array}{rcl} \boldsymbol{\psi }_{k}& \sim & p(\boldsymbol{\psi }_{k}).{}\end{array}$$
(25)
For notational economy one writes:
$$\displaystyle{ \boldsymbol{\psi }:=\{\boldsymbol{\psi } _{m},\boldsymbol{\psi }_{k}\}, }$$
(26)
and
$$\displaystyle{ p(\boldsymbol{\psi }) := p(\boldsymbol{\psi }_{m})p(\boldsymbol{\psi }_{k}). }$$
(27)
Choosing the Mean Function
The role of the mean function is to model one’s prior beliefs about the existence of systematic trends. One of the most common choices for the mean function is the generalized linear model:
$$\displaystyle{ m(\mathbf{x};\mathbf{b}) = \mathbf{b}^{T}\mathbf{h}(\mathbf{x}), }$$
(28)
where \(\mathbf{h} : \mathbb{R}^{d} \rightarrow \mathbb{R}^{d_{h}}\) is an arbitrary function and \(\mathbf{b} =:\boldsymbol{\psi } _{m} \in \boldsymbol{\varPsi }_{m} = \mathbb{R}^{d_{m}},d_{m} = d_{h}\), are the hyper-parameters known as weights. A popular prior for the weights, b, is the improper, “non-informative,” uniform:
$$\displaystyle{ p(\boldsymbol{\psi }_{m}) = p(\mathbf{b}) \propto 1. }$$
(29)
Another commonly used prior is the multivariate normal:
$$\displaystyle{ p(\mathbf{b}\vert \boldsymbol{\mu }_{\mathbf{b}},\boldsymbol{\varSigma }_{\mathbf{b}}) = \mathcal{N}_{d_{m}}\left (\mathbf{b}\vert \boldsymbol{\mu }_{\mathbf{b}},\boldsymbol{\varSigma }_{\mathbf{b}}\right ), }$$
(30)
where \(\boldsymbol{\mu }_{\mathbf{b}} \in \mathbb{R}^{d_{m}}\), and \(\boldsymbol{\varSigma }_{\mathbf{b}} \in \mathbb{R}^{d_{m}\times d_{m}}\) is positive definite. It can be shown that in both choices, Equation (29) or Equation (30), it is actually possible to integrate b out of the model [15]. Some examples of generalized linear model s are:
  1. 1.
    The constant mean function, d h  = 1,
    $$\displaystyle{ h(x) = 1; }$$
    (31)
     
  2. 2.
    The linear mean function, \(d_{h} = d + 1\),
    $$\displaystyle{ \mathbf{h}(\mathbf{x}) = (1,x_{1},\ldots ,x_{d}); }$$
    (32)
     
  3. 3.
    The generalized polynomial chaos (gPC) mean function in which \(\mathbf{h}(\cdot ) = (h_{1}(\cdot ),\ldots ,h_{d_{h}}(\cdot ))\), with the \(h_{i} : \mathbb{R}^{d} \rightarrow \mathbb{R},i = 1,\ldots ,d_{h}\) being polynomials with degree up to ρ and orthogonal with respect to a measure μ(⋅ ), i.e.
    $$\displaystyle{ \int h_{i}(x)h_{j}(x)d\mu (x) =\delta _{ij}. }$$
    (33)
    For uncertainty propagation applications, μ(⋅ ) is usually a probability measure. For many well-known probability measures, the corresponding gPC is known [53]. For arbitrary probability measures, the polynomials can only be constructed numerically, e.g., [24]. Excellent Fortran code for the construction of orthogonal polynomials can be found in [25]. An easy-to-use Python interface is available by Bilionis [6].
     
  4. 4.
    The Fourier mean function defined for d = 1 in which \(\mathbf{h}(\cdot ) = (h_{1}(\cdot ),\ldots ,h_{d_{h}}(\cdot ))\), with the \(h_{i} : \mathbb{R} \rightarrow \mathbb{R},i = 1,\ldots ,d_{h}\) being trigonometric functions supporting certain frequencies ω i , i.e.,
    $$\displaystyle{ h_{2i}(x) =\sin (\omega _{i}x),\;\text{and}\;h_{2i+1}(x) =\cos (\omega _{i}x). }$$
    (34)
     
Choosing the Covariance Function
Discussing the choice of the covariance function in great depth is beyond the scope of the chapter. Instead, the interested reader is pointed to the excellent account by Rasmussen and Williams [39]. Here, some remarkable aspects that relate directly to representing prior beliefs for computer codes are discussed:
  1. 1.
    Modeling measurement noise. Assume that measurements of f(x) are noisy and that this noise is Gaussian with variance \(\sigma ^{2}\). GPs can account for this fact if one adds a Kronecker delta-like function to the covariance, i.e., if one works with a covariance of the form:
    $$\displaystyle{ k(\mathbf{x},\mathbf{x}';\boldsymbol{\psi }_{k}) = k_{0}(\mathbf{x},\mathbf{x}';\boldsymbol{\psi }_{k_{0}}) +\sigma ^{2}\delta (\mathbf{x} -\mathbf{x}'). }$$
    (35)
    Note that \(\delta (\mathbf{x} -\boldsymbol{\xi }')\) here is one if and only if x and x′ correspond to exactly the same measurement. Even though most computer simulators do not return noisy outputs, some do, e.g., molecular dynamics simulations. So, being able to model noise is useful. Apart from that, it turns out that including a small \(\sigma ^{2}\), even when there is no noise, is beneficial because it can improve the stability of the computations. When \(\sigma ^{2}\) is included for this reason, it is known as a “nugget” or as a “jitter.” In addition to the improved stability, the nugget can also lead to better predictive accuracy [27].
     
  2. 2.

    Modeling regularity. It can be shown that the regularity properties of the covariance function are directly associated to the regularity of the functions sampled from the probability measure induced by the GP (Equations (18) and (23)). For example, if \(k(\mathbf{x},\mathbf{x};\boldsymbol{\psi }_{k})\) is continuous at x, then samples \(f(\cdot )\) from Equation (23) are continuous almost surely (a.s.) at x. If \(k(\mathbf{x},\mathbf{x};\boldsymbol{\psi })\) is ρ times differentiable at x, then samples \(f(\cdot )\) from Equation (23) are ρ times differentiable a.s. at x.

     
  3. 3.

    Modeling invariance. If \(k(\cdot ,\cdot ;\boldsymbol{\psi }_{k})\) is invariant with respect to a transformation T, i.e., \(k(\mathbf{x},T\mathbf{x}';\boldsymbol{\psi }_{k}) = k(T\mathbf{x},\mathbf{x}';\boldsymbol{\psi }_{k}) = k(\mathbf{x},\mathbf{x}';\boldsymbol{\psi }_{k})\), then samples f(x) from Equation (23) are invariant with respect to the same transformation a.s. In particular, if \(k(\cdot ,\cdot ;\boldsymbol{\psi })\) is periodic, then samples f(x) from Equation (23) are periodic a.s.

     
  4. 4.
    Modeling additivity. Assume that the covariance function is additive, i.e.,
    $$\displaystyle{ k(\mathbf{x},\mathbf{x}';\boldsymbol{\psi }_{k}) =\sum _{ i=1}^{d}k_{ i}(x_{i},x_{i}';\boldsymbol{\psi }_{k_{i}}) +\sum _{1\leq i<j\leq d}k_{ij}\left ((x_{i},x_{j}),(x_{i}',x_{j}');\boldsymbol{\psi }_{k_{ij}}\right )+\ldots , }$$
    (36)
    with \(\boldsymbol{\psi }_{k} =\{\boldsymbol{\psi } _{k_{i}}\} \cup \{\boldsymbol{\psi }_{k_{ij}}\}\). If \(f_{i}(\mathbf{x}),f_{ij}(\mathbf{x}),\ldots\) are samples from Equation (23) with covariances \(k_{i}(\cdot ,\cdot ;\boldsymbol{\psi }_{k_{i}})\), \(k_{ij}(\cdot ,\cdot ;\boldsymbol{\psi }_{k_{ij}})\), \(\ldots\), respectively, then
    $$\displaystyle{f(\mathbf{x}) =\sum _{ i=1}^{d}f_{ i}(x_{i}) +\sum _{1\leq i<j\leq d}f_{ij}(x_{i},x_{j})+\ldots ,}$$
    is a sample from Equation (23) with the additive covariance defined in Equation (36). These ideas can be used to deal effectively with high-dimensional inputs [22, 23].
     
The most commonly used covariance function for representing one’s prior knowledge about a computer code is of the form of Equation (35) with \(k_{0}(\cdot ,\cdot ;\boldsymbol{\psi }_{k_{0}})\) being the squared exponential (SE) covariance function:
$$\displaystyle{ k_{0}(\mathbf{x},\mathbf{x}';\boldsymbol{\psi }_{k_{0}}) = s^{2}\exp \left \{-\sum _{ i=1}^{d}\frac{(x_{i} - x_{i}')^{2}} {2\ell_{i}^{2}} \right \}, }$$
(37)
where \(\boldsymbol{\psi }_{k_{0}} =\{ s,\ell_{1},\ldots ,\ell_{d}\}\). Note that s > 0 may be interpreted as the signal strength and \(\ell_{i} > 0\) as the length scale of the \(i = 1,\ldots ,d\) input dimension. Prior beliefs about \(\boldsymbol{\psi }_{k} =\{\sigma ^{2}\} \cup \{\boldsymbol{\psi }_{k_{0}}\}\) are modeled by:
$$\displaystyle{ p(\boldsymbol{\psi }_{k}) = p(\sigma )p(\boldsymbol{\psi }_{k_{0}}), }$$
(38)
with
$$\displaystyle{ p(\boldsymbol{\psi }_{k_{0}}) = p(s)\prod _{i=1}^{d}p(\ell_{ i}). }$$
(39)
Typically, \(p(\sigma ),p(s),p(\ell_{1}),\ldots ,p(\ell_{d})\) are chosen to be Jeffreys priors or exponential probability densities with fixed parameters.

2.5.2 Conditioning on Observations of the Response

As seen in the previous section, one’s prior knowledge about the response can be modeled in terms of a generic GP defined by Equations (23), (24), and (25). Now, assume that one makes n simulations at inputs \(\mathbf{x}_{1},\ldots ,\mathbf{x}_{n}\) and that he observes the responses \(y_{1} = f(\mathbf{x}_{1}),\ldots ,y_{n} = f(\mathbf{x}_{n})\). Write \(\mathbf{X} =\{ \mathbf{x}_{1},\ldots ,\mathbf{x}_{n}\}\) and \(\mathbf{Y} =\{ y_{1},\ldots ,y_{n}\}\) for the observed inputs and outputs, respectively. Abusing the mathematical notation slightly, the symbol \(\mathcal{D}\) is used to denote X and Y collectively (see Equation (14)). We refer to \(\mathcal{D}\) as the (observed) data. How does the observation of \(\mathcal{D}\) alter one’s state of knowledge about the response surface?

The answer to the aforementioned question comes after a straightforward application of Bayes rule and use of Kolmogorov’s theorem on the existence of random fields. One’s state of knowledge is characterized by a new GP,
$$\displaystyle{ f(\cdot )\vert \mathcal{D},\boldsymbol{\psi }\sim \mathop{\mathrm{GP}}\nolimits \left (f(\cdot )\vert m^{{\ast}}(\cdot ;\boldsymbol{\psi },\mathcal{D}),k^{{\ast}}(\cdot ,\cdot ;\boldsymbol{\psi },\mathcal{D})\right ), }$$
(40)
with mean function:
$$\displaystyle{ m^{{\ast}}(\mathbf{x};\boldsymbol{\psi },\mathcal{D}) := m(\mathbf{x};\boldsymbol{\psi }_{ m}) + \mathbf{k}(\mathbf{x},\mathbf{X};\boldsymbol{\psi }_{k})\mathbf{k}(\mathbf{X},\mathbf{X};\boldsymbol{\psi }_{k})^{-1}\left (\mathbf{Y} -\mathbf{m}(\mathbf{X};\boldsymbol{\psi }_{ m})\right ), }$$
(41)
covariance function:
$$\displaystyle{ k^{{\ast}}(\mathbf{x},\mathbf{x}';\boldsymbol{\psi },\mathcal{D}) := k(\mathbf{x},\mathbf{x}';\boldsymbol{\psi }_{ k}) -\mathbf{k}(\mathbf{x},\mathbf{X};\boldsymbol{\psi }_{k})\mathbf{k}(\mathbf{X},\mathbf{X};\boldsymbol{\psi }_{k})^{-1}\mathbf{k}(\mathbf{X},\mathbf{x}';\boldsymbol{\psi }_{ k}), }$$
(42)
and the posterior of the hyper-parameters:
$$\displaystyle{ p(\boldsymbol{\psi }\vert \mathcal{D}) = \frac{p(\mathcal{D}\vert \boldsymbol{\psi })p(\boldsymbol{\psi })} {p(\mathcal{D})} , }$$
(43)
where
$$\displaystyle{ p(\mathcal{D}\vert \boldsymbol{\psi }) := p(\mathbf{Y}\vert \mathbf{X},\boldsymbol{\psi }) = \mathcal{N}_{n}\left (\mathbf{Y}\vert \mathbf{m}(\mathbf{X};\boldsymbol{\psi }_{m}),\mathbf{k}(\mathbf{X},\mathbf{X};\boldsymbol{\psi }_{m})\right ), }$$
(44)
is the likelihood of \(\mathcal{D}\) induced by the defining property of the GP (Equation (19)) and \(p(\mathcal{D}) =\int p(\mathcal{D}\vert \boldsymbol{\psi })p(\boldsymbol{\psi })d\boldsymbol{\psi }\) the evidence.

2.5.3 Treating Space and Time by Using Separable Mean and Covariance Functions

The input x may include spatial x s ; time, t; and stochastic, \(\boldsymbol{\xi }\), components (see Equation (1)). A computer code reports the response for a given \(\boldsymbol{\xi }\) (see Equation (7)) at a fixed set of n s spatial locations, X s , and n t time instants X t (see Equations (4) and (5), respectively). Now suppose that \(n_{\xi }\) observations of \(\boldsymbol{\xi }\) are to be made. Then, the size of the covariance matrix \(\mathbf{k}(\mathbf{X},\mathbf{X};\boldsymbol{\psi }_{k})\) used in Equation (19) becomes \((n_{\xi }n_{s}n_{t}) \times (n_{\xi }n_{s}n_{t})\). Since the cost of inference and prediction is \(O\left ((n_{s}n_{t}n_{\xi })^{3}\right )\), one encounters insurmountable computational issues even for moderate values of \(n_{\xi },n_{s}\), and n t . In an attempt to remedy this problem, simplifying assumptions must be made. Namely, one has to assume that the mean is a generalized linear model (see Equation (28)) with a separable set of basis functions, \(\mathbf{h}(\cdot )\), and that the covariance function is also separable.

The mean function \(\mathbf{h} : \mathcal{X}_{s} \times \mathcal{X}_{t} \times \mathcal{X}_{\xi }\rightarrow \mathbb{R}^{d_{h}}\) is separable if it can be written as:
$$\displaystyle{ \mathbf{h}(\mathbf{x}) = \mathbf{h}_{s}(\mathbf{x}_{s}) \otimes \mathbf{h}_{t}(t) \otimes \mathbf{h}_{\xi }(\boldsymbol{\xi }), }$$
(45)
where “⊗” denotes the Kronecker product, \(\mathbf{h}_{s} : \mathcal{X}_{s} \rightarrow \mathbb{R}^{d_{h_{s}}},\mathbf{h}_{ t} : \mathcal{X}_{t} \rightarrow \mathbb{R}^{d_{h_{t}}}\), and \(\mathbf{h}_{\xi } :\rightarrow \mathbb{R}^{d_{h_{\xi }}}\), with \(d_{h} = d_{h_{s}}d_{h_{t}}d_{h_{\xi }}\).
The covariance function, \(k : \mathcal{X}_{s} \times \mathcal{X}_{t} \times \mathcal{X}_{\xi }\rightarrow \mathbb{R}\), is separable if it can be written as:
$$\displaystyle{ k(\mathbf{x},\mathbf{x}';\boldsymbol{\psi }_{k}) = k_{s}(\mathbf{x}_{s},\mathbf{x}_{s}';\boldsymbol{\psi }_{k,s})k_{t}(t,t';\boldsymbol{\psi }_{t,\xi })k_{\xi }(\boldsymbol{\xi },\boldsymbol{\xi }';\boldsymbol{\psi }_{k,t}), }$$
(46)
where \(k_{s}(\cdot ,\cdot ;\boldsymbol{\psi }_{k,s}),k_{t}(\cdot ,\cdot ;\boldsymbol{\psi }_{k,t})\), and \(k_{\xi }(\cdot ,\cdot ;\boldsymbol{\psi }_{k,\xi })\) are spatial, time, and stochastic domain covariance functions, respectively; \(\boldsymbol{\psi }_{k,s},\boldsymbol{\psi }_{k,t}\), and \(\boldsymbol{\psi }_{k,\xi }\) are the corresponding hyper-parameters; and \(\boldsymbol{\psi }_{k} =\{\boldsymbol{\psi } _{k,s},\boldsymbol{\psi }_{k,t},\boldsymbol{\psi }_{k,\xi }\}\). Then, as shown in Bilionis et al. [10], the covariance matrix can be written as the Kronecker product of a spatial, a time, and a stochastic covariance, i.e.,
$$\displaystyle{ \mathbf{k}(\mathbf{X},\mathbf{X};\boldsymbol{\psi }_{k}) = \mathbf{k}_{s}(\mathbf{X}_{s},\mathbf{X}_{s};\boldsymbol{\psi }_{k,s}) \otimes \mathbf{k}_{t}(\mathbf{X}_{t},\mathbf{X}_{t};\boldsymbol{\psi }_{k,t}) \otimes \mathbf{k}_{\xi }(\mathbf{X}_{\xi },\mathbf{X}_{\xi };\boldsymbol{\psi }_{k,\xi }). }$$
(47)

Exploiting the fact that factorizations (e.g., Cholesky or QR) of a matrix formed by Kronecker products are given by the Kronecker products of the factorizations of the individual matrices [47], inference and predictions can be made in \(O(n_{s}^{3}) + O(n_{t}^{3}) + O(n_{\xi }^{3})\) time. For the complete details of this approach, the reader is directed to the appendix of Bilionis et al. [10]. It is worth mentioning that exactly the same idea was used by Stegle et al. [46].

2.5.4 Treating Space and Time by Using Output Dimensionality Reduction

In this section an alternative way to deal with space and time inputs, first introduced by Higdon et al. [30], is presented. Assume that the computer code reports the response at specific spatial locations and time instants exactly as in the previous subsection (see also Equation (7)). The idea is to perform a dimensionality reduction on the output matrix \(\mathbf{Y} \in \mathbb{R}^{n_{\xi }\times (n_{s}n_{t})}\),
$$\displaystyle{ \mathbf{Y} = \left (\begin{array}{c} \mathbf{f}_{c}(\boldsymbol{\xi }_{1})^{T} \\ \mathbf{f}_{c}(\boldsymbol{\xi }_{n_{\xi }})^{T} \end{array} \right ), }$$
(48)
and then learn the map between \(\boldsymbol{\xi }\) the reduced variables. For notational convenience, let n y  = n s n t denote the number of outputs of the code, \(\mathbf{y} \in \mathbb{R}^{n_{y}}\) the full output, and \(\mathbf{z} \in \mathbb{R}^{n_{z}}\) the reduced output. The choice of the right dimensionality reduction map is an open research problem. Principal component analysis (PCA) [12, Chapter 12] is going to be used for the identification of the dimensionality reduction map. Consider the empirical covariance matrix \(\mathbf{C} \in \mathbb{R}^{n_{y}\times n_{y}}\):
$$\displaystyle{ \mathbf{C} = \frac{1} {n_{\xi } - 1}\sum _{i=1}^{n_{\xi }}(\mathbf{Y}_{ i} -\mathbf{m})(\mathbf{Y}_{i} -\mathbf{m})^{T}, }$$
(49)
where Y i is the i-th row of the output matrix Y and \(\mathbf{m} \in \mathbb{R}^{n_{y}}\) is the empirical mean of the observed outputs:
$$\displaystyle{ \mathbf{m} = \frac{1} {n_{\xi }}\sum _{i=1}^{n_{\xi }}\mathbf{Y}_{ i}. }$$
(50)
One proceed by diagonalizing C:
$$\displaystyle{ \mathbf{C} = \mathbf{V}\mathbf{D}\mathbf{V}^{T}, }$$
(51)
where \(\mathbf{V} \in \mathbb{R}^{n_{y}\times n_{y}}\) contains the eigenvectors of C as columns, and \(\mathbf{D} \in \mathbb{R}^{n_{y}\times n_{y}}\) is a diagonal matrix with the eigenvalues of C on its diagonal. The PCA connection between y and z (reconstruction map) is given by:
$$\displaystyle{ \mathbf{y} = \mathbf{V}\mathbf{D}^{1/2}\mathbf{z} + \mathbf{m}. }$$
(52)
In this equation, the reduced outputs corresponding to very small eigenvalues can be eliminated to a very good approximation. Typically, one keeps n z eigenvalues of C so that 95 %, or more, of the observed variance of y is explained. This is achieved by removing columns from V and columns and rows from D. The inverse of Equation (52) (reduction map) is given by:
$$\displaystyle{ \mathbf{z} = \mathbf{V}^{T}\mathbf{D}^{-1/2}(\mathbf{y} -\mathbf{m}). }$$
(53)
The map between \(\boldsymbol{\xi }\) and the reduced variables z can be learned by any of the multi-output GP regression techniques to be introduced in subsequent sections .

2.6 Training the Parameters of the Gaussian Process

Training the parameters of a GP requires the characterization of the posterior distribution \(p(\boldsymbol{\psi }\vert \mathcal{D})\). A very general way to describe this posterior is via a particle approximation. A particle approximation is a collection of weights, w(s), and samples, \(\boldsymbol{\psi }^{(s)}\), with which one may represent the posterior as:
$$\displaystyle{ p(\boldsymbol{\psi }\vert \mathcal{D}) \approx \sum _{s=1}^{S}w^{(s)}\delta \left (\boldsymbol{\psi }-\boldsymbol{\psi }^{(s)}\right ). }$$
(54)
Here, w(s) ≥ 0 and \(\sum _{s=1}^{S}w^{(s)} = 1\). The usefulness of Equation (54) relies on the fact that it allows one to approximate expectations with respect to \(p(\boldsymbol{\psi }\vert \mathcal{D})\) in a straightforward manner.
$$\displaystyle{ \mathbb{E}_{\boldsymbol{\psi }}[g(\boldsymbol{\psi })] :=\int g(\boldsymbol{\psi })p(\boldsymbol{\psi }\vert \mathcal{D})d\boldsymbol{\psi } \approx \sum _{s=1}^{S}w^{(s)}g\left (\boldsymbol{\psi }^{(s)}\right ). }$$
(55)
There are various ways in which a particle approximation can be constructed. The discussion below includes the most common approaches .

2.6.1 Maximization of the Likelihood

The most widespread approach to training the GP parameters is to obtain a point estimate of the hyper-parameters by maximizing the likelihood of the data as given in Equation (44) [39, Ch. 5]. This is known as the maximum likelihood estimator (MLE) of the hyper-parameters:
$$\displaystyle{ \boldsymbol{\psi }_{\text{MLE}}^{{\ast}} =\arg \max _{\boldsymbol{\psi } \in \boldsymbol{\varPsi }}p(\mathcal{D}\vert \boldsymbol{\psi }). }$$
(56)
The MLE can be thought as single-particle approximation, i.e., \(S = 1,w^{(1)} = 1,\boldsymbol{\psi }^{(1)} =\boldsymbol{\psi }_{ \text{MLE}}^{{\ast}}\),
$$\displaystyle{ p(\boldsymbol{\psi }\vert \mathcal{D}) \approx \delta \left (\boldsymbol{\psi }-\boldsymbol{\psi }_{\text{MLE}}^{{\ast}}\right ), }$$
(57)
if the prior is relatively flat and the posterior is sharply peaked around \(\boldsymbol{\psi }_{ \text{MLE}}^{{\ast}}\).

2.6.2 Maximization of the Posterior

If the prior information is not flat but the posterior is expected to be sharply peaked, then it is preferable to use the maximum a posteriori (MAP) estimate of the hyper-parameters:
$$\displaystyle{ \boldsymbol{\psi }_{\text{MAP}}^{{\ast}} =\arg \max _{\boldsymbol{\psi } \in \boldsymbol{\varPsi }}p(\boldsymbol{\psi }\vert \mathcal{D}), }$$
(58)
where the posterior \(p(\boldsymbol{\psi }\vert \mathcal{D})\) was defined in Equation (43). This is also a single-particle approximation :
$$\displaystyle{ p(\boldsymbol{\psi }\vert \mathcal{D}) \approx \delta \left (\boldsymbol{\psi }-\boldsymbol{\psi }_{\text{MAP}}^{{\ast}}\right ). }$$
(59)

2.6.3 Markov Chain Monte Carlo

Markov chain Monte Carlo (MCMC) techniques [29, 31, 33] can be employed to sample from the posterior of Equation (43). These techniques result in a series of uncorrelated samples \(\boldsymbol{\psi }^{(s)},s = 1,\ldots ,S\) from Equation (43). The particle approximation is built by picking \(w^{(s)} = 1/S,s = 1,\ldots ,S\). MCMC is useful when the posterior is unimodal.

2.6.4 Sequential Monte Carlo

Sequential Monte Carlo (SMC) is a nonlinear extension of Kalman filters [19, 21]. These algorithms can be also used in order to sample from the posterior of a Bayesian model. For example, Bilionis and Zabaras [9], Bilionis et al. [11], Wan and Zabaras [48] use them to sample from the posterior of a stochastic inverse problem. One introduces a family of probability densities on \(\boldsymbol{\psi }\) parameterized by γ:
$$\displaystyle{ p(\boldsymbol{\psi }\vert \gamma ,\mathcal{D}) \propto p(\boldsymbol{\psi }\vert \mathcal{D})^{\gamma }p(\boldsymbol{\psi }). }$$
(60)
Notice that for γ = 0 one obtains the prior and for γ = 1 one obtains the posterior. The idea is to start a particle approximation \(\left \{w_{0}^{(s)},\boldsymbol{\psi }_{0}^{(s)}\right \}_{s=1}^{S}\) from the prior (γ = 0) which is very easy to sample from and gradually move it to the posterior (γ = 1). This can be easily achieved if there is an underlying MCMC routine for sampling from Equation (60). In addition, the schedule of γ’s can be picked adaptively. The reader is directed to the appendix of Bilionis and Zabaras [9] for the complete details. SMC-based methodologies are suitable for multimodal posteriors. Another very attractive attribute is that they are embarrassingly parallelizable.

2.7 Multi-output Gaussian Processes

In what follows , the treatment of generic d y -dimensional response functions \(\mathbf{f}(\cdot )\) is considered. The collection of all observed outputs is denoted by \(\mathbf{Y}_{d} = (\mathbf{y}_{1}^{T},\ldots ,\mathbf{y}_{n}^{T}) \in \mathbb{R}^{n\times d_{y}}\). The proper treatment of multiple-outputs in a computationally efficient way is an open research problem. Even though there exist techniques that attempt to capture nontrivial dependencies between the various outputs, e.g., [2, 3, 13, 46], here the focus is on computationally simple techniques that treat outputs as either independent or linearly dependent. These techniques tend to be the easiest to use in practice .

2.7.1 Completely Independent Outputs

In this model the outputs are completely independent, i.e.,
$$\displaystyle{ \mathbf{f}(\cdot )\vert \boldsymbol{\psi }\sim \prod _{i=1}^{d_{y} }\mathop{ \mathrm{GP}}\nolimits \left (f_{i}(\cdot )\vert m_{i}(\cdot ;\boldsymbol{\psi }_{m,i}),k_{i}(\cdot ,\cdot ;\boldsymbol{\psi }_{k,i})\right ), }$$
(61)
with
$$\displaystyle{ \boldsymbol{\psi }=\bigcup _{ i=1}^{d_{y} }\{\boldsymbol{\psi }_{m,i},\boldsymbol{\psi }_{k,i}\}, }$$
(62)
where \(m_{i}(\cdot ;\cdot )\), \(k_{i}(\cdot ;\cdot )\) are the mean and the covariance function and \(\boldsymbol{\psi }_{m,i}\), \(\boldsymbol{\psi }_{k,i}\) the parameters of the mean and the covariance function, respectively, of the i-th output of \(\mathbf{f}(\cdot )\). The likelihood of the model is:
$$\displaystyle{ \begin{array}{ccc} p(\mathcal{D}\vert \boldsymbol{\psi })& =& p(\mathbf{Y}\vert \mathbf{X},\boldsymbol{\psi }) \\ & =&\prod _{i=1}^{d_{y}}p(\mathbf{y}_{i}\vert \mathbf{X},\boldsymbol{\psi }_{i}), \end{array} }$$
(63)
where \(\boldsymbol{\psi }_{i} =\{\boldsymbol{\psi } _{m,i},\boldsymbol{\psi }_{k,i}\}\) and \(p(\mathbf{y}_{i}\vert \mathbf{X},\boldsymbol{\psi }_{i})\) is a multivariate Gaussian exactly the same as the one in Equation (19) with \(m_{i}(\cdot ;\boldsymbol{\psi }_{m,i})\) and \(k_{i}(\cdot ,\cdot ;\boldsymbol{\psi }_{k,i})\) instead of \(m(\cdot ;\boldsymbol{\psi }_{m})\) and \(k(\cdot ,\cdot ;\boldsymbol{\psi }_{k})\), respectively. A typical choice of a prior for \(\boldsymbol{\psi }\) assumes a priori independence of the various parameters, i.e.,
$$\displaystyle{ p(\boldsymbol{\psi }) =\prod _{ i=1}^{d_{y} }\left (p(\boldsymbol{\psi }_{m,i})p(\boldsymbol{\psi }_{k,i})\right ). }$$
(64)
There is nothing special about this model. Essentially, it is equivalent to carrying out a Gaussian process regression independently on each one of the outputs.

2.7.2 Independent, but Similar, Outputs

In this model, the outputs are correlated only via the parameters of the covariance function, i.e.,
$$\displaystyle{ \mathbf{f}(\cdot )\vert \boldsymbol{\psi }\sim \prod _{i=1}^{d_{y} }\mathop{ \mathrm{GP}}\nolimits \left (f_{i}(\cdot )\vert m(\cdot ;\boldsymbol{\psi }_{m,i}),\psi _{s,i}^{2}k(\cdot ,\cdot ;\boldsymbol{\psi }_{ k})\right ), }$$
(65)
with
$$\displaystyle{ \boldsymbol{\psi }=\{\boldsymbol{\psi } _{k}\} \cup \bigcup _{i=1}^{d_{y} }\{\boldsymbol{\psi }_{m,i},\boldsymbol{\psi }_{s,i}\}, }$$
(66)
where \(\boldsymbol{\psi }_{m,i}\) and \(\boldsymbol{\psi }_{s,i}\) are the parameters of the mean and the signal strength, respectively, of the i-th outputs, and \(\boldsymbol{\psi }_{k}\) are the parameters of the covariance function shared by all outputs. The likelihood of this model is given by an equation similar to Equation (63) with an appropriate mean and covariance function. The prior of \(\boldsymbol{\psi }\) is assumed to have an a priori independence structure similar to Equation (64). The advantage of this approach compared with the fully independent approach is that all outputs share the same covariance function, and, hence, its computational complexity is the same as that of a single-output Gaussian process regression.

2.7.3 Linearly Correlated Outputs

Conti and O’Hagan [15] introduce a simple multi-output model in which the outputs are correlated using a positive-definite covariance matrix \(\boldsymbol{\varSigma }\in \mathbb{R}^{d_{y}\times d_{y}}\) via a d y -dimensional Gaussian random field, i.e.,
$$\displaystyle{ \mathbf{f}(\cdot )\vert \boldsymbol{\psi }\sim \mathop{\mathrm{GP}}\nolimits \left (\mathbf{f}(\cdot )\vert \mathbf{m}(\cdot ;\boldsymbol{\psi }_{m}),k(\cdot ,\cdot ;\boldsymbol{\psi }_{k}),\boldsymbol{\varSigma }\right ), }$$
(67)
with
$$\displaystyle{ \boldsymbol{\psi }=\{\boldsymbol{\psi } _{m},\boldsymbol{\psi }_{k},\boldsymbol{\varSigma }\}, }$$
(68)
where \(\mathbf{m}(\cdot ;\boldsymbol{\psi }_{m})\) is the mean-vector function with d y -outputs, and \(k(\cdot ,\cdot ;\boldsymbol{\psi }_{k})\) is a common covariance function. Equation (67) essentially means that, a priori,
$$\displaystyle{ \mathbb{E}[\mathbf{f}(\mathbf{x})\vert \boldsymbol{\psi }] = \mathbf{m}(\mathbf{x};\boldsymbol{\psi }_{m}), }$$
(69)
and
$$\displaystyle{ \mathbb{C}[\mathbf{f}(\mathbf{x}),\mathbf{f}(\mathbf{x}')\vert \boldsymbol{\psi }] = k(\mathbf{x},\mathbf{x}';\boldsymbol{\psi }_{k})\boldsymbol{\varSigma }. }$$
(70)
Notice that there is no need to include a signal parameter in the covariance function, since it will be absorbed by \(\boldsymbol{\varSigma }\). Therefore, in this model, one can identically take the signal strength of the response to be identically equal to one. Equation (70) provides the intuitive meaning of \(\boldsymbol{\varSigma }\) as the matrix capturing the linear part of the correlations of the various outputs. In this case, the likelihood is given via a matrix normal [18]:
$$\displaystyle{ \begin{array}{ccc} p(\mathcal{D}\vert \boldsymbol{\psi })& =& p(\mathbf{Y}\vert \mathbf{X},\boldsymbol{\psi }) \\ & =&\mathcal{N}_{n\times d_{y}}\left (\mathbf{Y}\vert \mathbf{m}(\mathbf{X};\boldsymbol{\psi }_{m}),\boldsymbol{\varSigma },\mathbf{k}(\mathbf{X},\mathbf{X};\boldsymbol{\psi }_{k})\right ).\end{array} }$$
(71)
The prior of \(\boldsymbol{\psi }\) is assumed to have the form:
$$\displaystyle{ p(\boldsymbol{\psi }) = p(\boldsymbol{\psi }_{m})p(\boldsymbol{\psi }_{k})p(\boldsymbol{\varSigma }), }$$
(72)
with the prior of \(\boldsymbol{\varSigma }\) being “non-informative” [15]:
$$\displaystyle{ p(\boldsymbol{\varSigma }) \propto \vert \boldsymbol{\varSigma }\vert ^{-\frac{d_{y}+1} {2} }. }$$
(73)
Alternatively, \(\boldsymbol{\varSigma }\) could follow an inverse Wishart distribution [28]. As shown in [10] and [14], if in addition to the prior assumptions for \(\boldsymbol{\varSigma }\), the mean \(m(\cdot ;\boldsymbol{\psi }_{m})\) is chosen to be a generalized linear model with a priori flat weights \(\boldsymbol{\psi }_{m}\), then both \(\boldsymbol{\varSigma }\) and \(\boldsymbol{\psi }_{m}\) can be integrated out of the model analytically. This feature makes inference in this model as computationally efficient as single-output Gaussian process regression (GPR). For the complete details of this approach, the reader is directed to [10].

2.8 Sampling Possible Surrogates

The purpose of this section is to demonstrate how the posterior Gaussian process defined by Equations (40) and (43) can be represented as \(\hat{f}(\cdot ;\boldsymbol{\theta })\), where \(\boldsymbol{\theta }\) is a finite dimensional random variable with probability density \(p(\boldsymbol{\theta }\vert \mathcal{D})\). There are two ways in which this can be achieved. The first way is based on a truncated Karhunen-Loève expansion [26, 32, 45] (KLE) of the Equation (40). The second way is based on the idea introduced in O’Hagan et al. [38], further discussed in Oakley and O’Hagan [34, 35], and revisited in Bilionis et al. [10] as well as in Chen et al. [14]. In both approximations, \(\hat{f}(\cdot ;\boldsymbol{\theta })\) is given analytically. In particular, it can be expressed as
$$\displaystyle{ \hat{f}(\mathbf{x};\boldsymbol{\theta }) = m^{{\ast}}(\mathbf{x};\boldsymbol{\psi }) + \mathbf{k}^{{\ast}}(\mathbf{x},\mathbf{X}_{ d};\boldsymbol{\psi })\mathbf{C}(\boldsymbol{\psi })\boldsymbol{\omega }, }$$
(74)
where \(m^{{\ast}}(\cdot ;\boldsymbol{\psi })\) is the posterior mean given in Equation (41) and \(\mathbf{k}^{{\ast}}(\mathbf{x},\mathbf{X}_{d};\boldsymbol{\psi })\) is the posterior cross covariance, defined generically via
$$\displaystyle{ \mathbf{k}^{{\ast}}(\mathbf{X},\mathbf{X}^{{\prime}};\boldsymbol{\psi }) := \left (\begin{array}{ccc} k^{{\ast}}(\mathbf{x}_{ 1},\mathbf{x}_{1}^{{\prime}};\boldsymbol{\psi })&\ldots &k^{{\ast}}(\mathbf{x}_{ 1},\mathbf{x}_{n^{{\prime}}}^{{\prime}};\boldsymbol{\psi })\\ \vdots &\ddots & \vdots \\ k^{{\ast}}(\mathbf{x}_{n},\mathbf{x}_{1}^{{\prime}};\boldsymbol{\psi })&\ldots &k^{{\ast}}(\mathbf{x}_{n},\mathbf{x}_{n^{{\prime}}}^{{\prime}};\boldsymbol{\psi }) \end{array} ,\right ) }$$
(75)
with \(k^{{\ast}}(\cdot ,\cdot ;\boldsymbol{\psi })\) being the posterior covariance defined in Equation (42). The parameters \(\boldsymbol{\theta }\) that characterize the surrogate surface \(\hat{f}(\mathbf{x};\boldsymbol{\theta })\) are:
$$\displaystyle{ \boldsymbol{\theta }=\{\boldsymbol{\psi },\boldsymbol{\omega }\}, }$$
(76)
with
$$\displaystyle{ \boldsymbol{\omega }= (\omega _{1},\ldots ,\omega _{d_{\omega }}), }$$
(77)
being distributed as a standard normal random vector, i.e.,
$$\displaystyle{ p(\boldsymbol{\omega }) = \mathcal{N}_{d_{\omega }}\left (\boldsymbol{\omega }\vert \mathbf{0}_{d_{\omega }},\mathbf{I}_{d_{\omega }}\right ). }$$
(78)
The posterior probability density of \(\boldsymbol{\theta }\), \(p(\boldsymbol{\theta }\vert \mathcal{D})\), is given by:
$$\displaystyle{ p(\boldsymbol{\theta }\vert \mathcal{D}) = p(\boldsymbol{\psi }\vert \mathcal{D})p(\boldsymbol{\omega }), }$$
(79)
with \(p(\boldsymbol{\psi }\vert \mathcal{D})\) being the posterior of the hyper-parameters (see Equation (43)). The matrix
$$\displaystyle{ \mathbf{X}_{d} = \left (\mathbf{x}_{d,1}^{T}\ldots \mathbf{x}_{ d,n_{d}}^{T}\right )^{T}, }$$
(80)
contains n d design points in \(\mathcal{X}\), and \(\mathbf{C}(\boldsymbol{\psi }) \in \mathbb{R}^{n_{d}\times d_{\omega }}\) is a matrix that corresponds to a factorization of the posterior covariance function of Equation (43) over the design points X d . The optimal choice of the design points, X d , is an open research problem. Below, some heuristics are provided. These depend on how one actually constructs the various quantities of Equation (74).

2.8.1 The Karhunen-Loève Approach for Constructing \(\hat{f}(\cdot ;\boldsymbol{\theta })\)

Let \(\omega _{\ell},\ell= 1,\ldots\) be standard normal random variables. Consider the eigenvalues \(\lambda _{i}(\boldsymbol{\psi })\) and eigenfunctions \(\phi _{\ell}(\cdot ;\boldsymbol{\psi })\) of the posterior covariance function \(k^{{\ast}}(\cdot ,\cdot ;\boldsymbol{\psi })\) of Equation (42). That is:
$$\displaystyle{ \int k^{{\ast}}(\mathbf{x},\mathbf{x}';\boldsymbol{\psi })\phi _{\ell}(\mathbf{x};\boldsymbol{\psi })d\mathbf{x} =\lambda _{\ell}(\boldsymbol{\psi })\phi _{\ell}(\mathbf{x};\boldsymbol{\psi }), }$$
(81)
for \(\ell= 1,\ldots\). Then, the posterior Gaussian process defined by Equation (40) can be written as:
$$\displaystyle{ \hat{f}(\mathbf{x};\boldsymbol{\psi },\omega _{1},\ldots ) = m^{{\ast}}(\mathbf{x},\boldsymbol{\psi }) +\sum _{ \ell =1}^{\infty }\omega _{ \ell}\sqrt{\lambda _{\ell}(\boldsymbol{\psi })}\phi _{\ell}(\mathbf{x};\boldsymbol{\psi }). }$$
(82)
Typically, one truncates the series to a finite-order d ω and writes
$$\displaystyle{ \hat{f}(\mathbf{x};\boldsymbol{\theta }) = m^{{\ast}}(\mathbf{x};\boldsymbol{\psi }) +\sum _{ \ell =1}^{d_{\omega }}\omega _{ \ell}\sqrt{\lambda _{\ell}(\boldsymbol{\psi })}\phi _{\ell}(\mathbf{x};\boldsymbol{\psi }), }$$
(83)
where \(\boldsymbol{\theta }\) and \(\boldsymbol{\omega }\) are as in Equations (76) and (77), respectively. The probability density of \(\boldsymbol{\theta }\), \(p(\boldsymbol{\theta }\vert \mathcal{D})\), is defined via Equations (79) and (78).

Equation (81) is a Fredholm integral eigenvalue problem. In general, this equation cannot be solved analytically. A very recent study of the numerical techniques that can be used for the solution of this problem can be found in Betz et al. [5]. This work relies on the Nyström approximation [20, 40] and follows Betz et al. [5] closely in its development.

Start by approximating the integral on the left-hand side of Equation (81) by
$$\displaystyle{ \sum _{j=1}^{n_{d} }w_{j}k^{{\ast}}(\mathbf{x},\mathbf{x}_{ d,j};\boldsymbol{\psi })\phi _{\ell}(\mathbf{x}_{d,j};\boldsymbol{\psi }) \approx \lambda _{\ell}(\boldsymbol{\psi })\phi _{\ell}(\mathbf{x};\boldsymbol{\psi }), }$$
(84)
where \(\{\mathbf{x}_{d,j},w_{j}\}_{j=1}^{n_{d}}\) is a suitable quadrature rule. Notice that in this approximation the design points of Equation (80) correspond to the quadrature points. The simplest quadrature rule would be a Monte Carlo type of rule, in which the points xd, j are randomly picked in \(\mathcal{X}\) and \(w_{j} = \frac{1} {n_{d}}\). Other choices could be based on tensor products of one-dimensional rules or a sparse grid quadrature rule [44]. As shown later on, for the special – but very common – case of a separable covariance function, the difficulty of the problem can be reduced dramatically.
The next step in the Nyström approximation is to solve Equation (84) at the quadrature points:
$$\displaystyle{ \sum _{j=1}^{n_{d} }w_{j}k^{{\ast}}(\mathbf{x}_{ d,i},\mathbf{x}_{d,j};\boldsymbol{\psi })\phi _{\ell}(\mathbf{x}_{d,j};\boldsymbol{\psi }) \approx \lambda _{\ell}(\boldsymbol{\psi })\phi _{\ell}(\mathbf{x}_{d,i};\boldsymbol{\psi }), }$$
(85)
for \(i = 1,\ldots ,n_{d}\). Therefore, one needs to solve the generalized eigenvalue problem:
$$\displaystyle{ \mathbf{k}^{{\ast}}(\mathbf{X}_{ d},\mathbf{X}_{d};\boldsymbol{\psi })\mathbf{W}\mathbf{v}_{\ell} =\lambda _{\ell}\mathbf{v}_{\ell}, }$$
(86)
where \(k^{{\ast}}(\mathbf{X}_{d},\mathbf{X}_{d};\boldsymbol{\psi })\) is the posterior covariance matrix on the integration points X d (see Equation (75)), and \(\mathbf{W} =\mathop{ \mathrm{diag}}\nolimits \left (w_{1},\ldots ,w_{n_{d}}\right )\). It is easy to see that the solution of Equation (86) can be obtained by first solving the regular eigenvalue problem
$$\displaystyle{ \mathbf{B}\tilde{\mathbf{v}}_{\ell} =\lambda _{\ell}\tilde{\mathbf{v}}_{\ell}, }$$
(87)
where
$$\displaystyle{ \mathbf{B} = \mathbf{W}^{\frac{1} {2} }\mathbf{k}^{{\ast}}(\mathbf{X}_{d},\mathbf{X}_{d};\boldsymbol{\psi })\mathbf{W}^{\frac{1} {2} }, }$$
(88)
and noticing that
$$\displaystyle{ \mathbf{v}_{\ell} = \frac{1} {\sqrt{w_{\ell}}}\tilde{\mathbf{v}}_{\ell}, }$$
(89)
or
$$\displaystyle{ \phi _{\ell}(\mathbf{x}_{d,j};\boldsymbol{\psi }) \approx \frac{1} {\sqrt{w_{\ell}}}\tilde{\mathbf{v}}_{\ell}. }$$
(90)
Plugging this into Equation (84) and solving for \(\phi _{\ell}(\mathbf{x})\), it is seen that
$$\displaystyle{ \phi _{\ell}(\mathbf{x};\boldsymbol{\psi }) \approx \frac{1} {\lambda _{\ell}} \sum _{j=1}^{n_{d} }\sqrt{w_{j}}\tilde{v}_{\ell,j}k^{{\ast}}(\mathbf{x}_{ d,j},\mathbf{x}). }$$
(91)
Typically, the KLE is truncated at some d ω  ≤ n d such that the α% of the energy of the field is captured, e.g., \(\alpha = 90\) %. That is, one may pick d ω so that:
$$\displaystyle{ \sum _{\ell=1}^{d_{\omega }}\lambda _{ \ell} =\alpha \sum _{ \ell=1}^{n_{d} }\lambda _{\ell}. }$$
(92)
With this choice, the matrix \(\mathbf{C}(\boldsymbol{\psi })\), Equation (74), is given by:
$$\displaystyle{ \mathbf{C}(\boldsymbol{\psi }) = \mathbf{W}^{\frac{1} {2} }\tilde{\mathbf{V}}\boldsymbol{\varLambda }^{-\frac{1} {2} }, }$$
(93)
where \(\boldsymbol{\varLambda }=\mathop{ \mathrm{diag}}\nolimits \left (\lambda _{1},\ldots ,\lambda _{d_{\omega }}\right )\) and \(\tilde{\mathbf{V}}\) is the first d ω eigenvectors of B of Equation (88), i.e.,
$$\displaystyle{ \tilde{\mathbf{V}} = \left (\tilde{\mathbf{v}}_{1}\ldots \tilde{\mathbf{v}}_{d}\right ), }$$
(94)
is the matrix of the first d ω eigenvectors as columns .

2.8.2 The O’Hagan Approach for Constructing \(\hat{f}(\cdot ;\boldsymbol{\theta })\)

Consider the X d design points defined in Equation (80). Let \(\mathbf{Y}_{d} \in \mathbb{R}^{n_{d}\times 1}\) be the random variable corresponding to the unobserved output of the simulation on the design points X d . That is,
$$\displaystyle{ \mathbf{Y}_{d}\vert \mathcal{D},\boldsymbol{\psi },\mathbf{X}_{d} \sim \mathcal{N}_{n_{d}}\left (\mathbf{Y}_{d}\vert \mathbf{m}^{{\ast}}(\mathbf{X}_{ d};\boldsymbol{\psi }),\mathbf{k}^{{\ast}}(\mathbf{X}_{ d},\mathbf{X}_{d};\boldsymbol{\psi })\right ), }$$
(95)
where
$$\displaystyle{ \mathbf{m}^{{\ast}}(\mathbf{X}_{ d};\boldsymbol{\psi }) = \left (m^{{\ast}}(\mathbf{x}_{ d,1};\boldsymbol{\psi })\ldots m^{{\ast}}(\mathbf{x}_{ d,n_{d}})\right ), }$$
(96)
with \(m^{{\ast}}(\cdot ;\boldsymbol{\psi })\) being the posterior mean given defined in Equation (41), and
$$\displaystyle{ \mathbf{k}^{{\ast}}(\mathbf{X}_{ d},\mathbf{X}_{d};\boldsymbol{\psi }) = \left (\begin{array}{ccc} k^{{\ast}}(\mathbf{x}_{d,1},\mathbf{x}_{d,1};\boldsymbol{\psi })&\ldots &k^{{\ast}}(\mathbf{x}_{d,1},\mathbf{x}_{d,n_{d}};\boldsymbol{\psi })\\ \vdots &\ddots & \vdots \\ k^{{\ast}}(\mathbf{x}_{d,n},\mathbf{x}_{d,1};\boldsymbol{\psi })&\ldots & k(\mathbf{x}_{d,n},\mathbf{x}_{d,n};\boldsymbol{\psi }) \end{array} \right ), }$$
(97)
with \(k^{{\ast}}(\cdot ,\cdot ;\boldsymbol{\psi })\) being the posterior covariance defined in Equation (42). Finally, let us condition the posterior Gaussian process of Equation (40) on the hypothetical observations \(\left \{\mathbf{X}_{d},\mathbf{Y}_{d}\right \}\). One has:
$$\displaystyle{ f(\cdot )\vert \mathcal{D},\boldsymbol{\psi },\mathbf{X}_{d},\mathbf{Y}_{d} \sim \mathop{\mathrm{GP}}\nolimits \left (f(\cdot )\vert \mathbf{m}^{{\ast}{\ast}}(\cdot ;\boldsymbol{\psi },\mathbf{Y}_{ d}),k^{{\ast}{\ast}}(\cdot ,\cdot ;\boldsymbol{\psi },\mathbf{Y}_{ d})\right ), }$$
(98)
where the mean is given by:
$$\displaystyle{ m^{{\ast}{\ast}}(\mathbf{x};\boldsymbol{\psi },\mathbf{Y}_{ d}) = m^{{\ast}}(\mathbf{x};\boldsymbol{\psi }) + \mathbf{k}^{{\ast}}(\mathbf{x},\mathbf{X}_{ d};\boldsymbol{\psi })\mathbf{k}^{{\ast}}(\mathbf{X}_{ d},\mathbf{X}_{d};\boldsymbol{\psi })^{-1}(\mathbf{Y}_{ d} -\mathbf{m}^{{\ast}}(\mathbf{X}_{ d};\boldsymbol{\psi })), }$$
(99)
and the covariance is given by:
$$\displaystyle{ k^{{\ast}{\ast}}(\mathbf{x},\mathbf{x}';\boldsymbol{\psi }) = k^{{\ast}}(\mathbf{x},\mathbf{x}';\boldsymbol{\psi }) -\mathbf{k}^{{\ast}}(\mathbf{x},\mathbf{X}_{ d})\mathbf{k}^{{\ast}}(\mathbf{X}_{ d},\mathbf{X}_{d};\boldsymbol{\psi })^{-1}\mathbf{k}^{{\ast}}(\mathbf{X}_{ d},\mathbf{x}). }$$
(100)
The idea of [38] was that if n d is sufficiently large, then k∗∗(x, x′) is very small and, thus, negligible. In other words, if X d is dense enough, then all the probability mass of Equation (98) is accumulated around the mean \(m^{{\ast}{\ast}}(\cdot ;\boldsymbol{\psi },\mathbf{Y}_{d})\). Therefore, one may think of the mean \(m^{{\ast}{\ast}}(\cdot ;\boldsymbol{\psi },\mathbf{Y}_{d})\) as a sample surface from Equation (40).
To make the connection with Equation (74), let \(\mathbf{L}^{{\ast}}(\boldsymbol{\psi })\) be the Cholesky decomposition of the covariance \(\mathbf{k}^{{\ast}}(\mathbf{X}_{d},\mathbf{X}_{d};\boldsymbol{\psi })\), Equation (100),
$$\displaystyle{\mathbf{k}^{{\ast}}(\mathbf{X}_{ d},\mathbf{X}_{d};\boldsymbol{\psi }) = \mathbf{L}^{{\ast}}(\boldsymbol{\psi })\left (\mathbf{L}^{{\ast}}(\boldsymbol{\psi })\right )^{T}.}$$
From Equation (95), one may express Y d as
$$\displaystyle{\mathbf{Y}_{d} = \mathbf{m}^{{\ast}}(\mathbf{X}_{ d};\boldsymbol{\psi }) + \mathbf{L}^{{\ast}}(\boldsymbol{\psi })\boldsymbol{\omega },}$$
for a \(\boldsymbol{\omega }\sim \mathcal{N}_{d_{\omega }}(\boldsymbol{\omega }\vert \mathbf{0}_{n_{d}},\mathbf{I}_{n_{d}})\). Therefore, one may rewrite Equation (99) as:
$$\displaystyle{m^{{\ast}{\ast}}(\mathbf{x};\boldsymbol{\psi };\boldsymbol{\omega }) = m^{{\ast}}(\mathbf{x};\boldsymbol{\psi }) + \mathbf{k}^{{\ast}}(\mathbf{x},\mathbf{X}_{ d};\boldsymbol{\psi })\left (\mathbf{L}^{{\ast}}(\boldsymbol{\psi })\right )^{T,-1}\boldsymbol{\omega }.}$$
From this, the matrix \(\mathbf{C}(\boldsymbol{\psi })\) of Equation (74) should be:
$$\displaystyle{ \mathbf{C}(\boldsymbol{\psi }) = \left (\mathbf{L}^{{\ast}}(\boldsymbol{\psi })\right )^{T,-1}. }$$
(101)
However, since n d is expected to be quite large, it is not a good idea to use all n d design points in X d to build a functional sample. Apart from the increased computational complexity, the Cholesky of the large covariance matrix of Equation (100) introduces numerical instabilities. A heuristic that can be used to construct a subset of design points, Xd, s, from the original, dense, set of design points, X d , without sacrificing accuracy, is discussed. The idea is to iteratively select the points of X d with maximum variance, until the maximum variance falls below a threshold ε > 0. The algorithm is as follows:
  1. 1.
    Start with
    $$\displaystyle{\mathbf{X}_{d,s} =\{\} .}$$
     
  2. 2.
    If
    $$\displaystyle{\vert \mathbf{X}_{d,s}\vert = n_{d},}$$
    then STOP. Otherwise, CONTINUE.
     
  3. 3.
    Find
    $$\displaystyle{i^{{\ast}} =\arg \max _{ 1\leq j\leq n_{d}}k^{{\ast}{\ast}}(\mathbf{x}_{ d,i},\mathbf{x}_{d,i};\boldsymbol{\psi },\mathbf{X}_{d,s}),}$$
    where \(k^{{\ast}{\ast}}(\mathbf{x},\mathbf{x}';\boldsymbol{\psi },\mathbf{X}_{d,s})\) is the covariance function defined in Equation (100) if Xd, s is used instead of X d .
     
  4. 4.
    If
    $$\displaystyle{k^{{\ast}{\ast}}(\mathbf{x}_{ d,i^{{\ast}}},\mathbf{x}_{d,i^{{\ast}}};\boldsymbol{\psi },\mathbf{X}_{d,s}) >\epsilon ,}$$
    then
    $$\displaystyle{\mathbf{X}_{d,s} \leftarrow \mathbf{X}_{d,s} \cup \{\mathbf{x}_{d,i^{{\ast}}}\},}$$
    and GO TO 2. Otherwise, STOP.
     

Notice that when one includes a new point \(\mathbf{x}_{d,i^{{\ast}}}\), he has to compute the Choleksy decomposition of the covariance matrix \(\mathbf{k}^{{\ast}}(\mathbf{X}_{d,s} \cup \{\mathbf{x}_{d,i^{{\ast}}}\},\mathbf{X}_{d,s} \cup \{\mathbf{x}_{d,i^{{\ast}}}\};\boldsymbol{\psi })\). This can be done efficiently using rank-one updates of the covariance matrix (see Seeger [43]).

2.9 Semi-analytic Formulas for the Mean and the Variance

It is quite obvious how Equation (74) can be used to obtain samples from the predictive distribution, \(p(\mathcal{Q}\vert \mathcal{D})\), of a statistic of interest \(\mathcal{Q}[\cdot ]\). Thus, using a Monte Carlo procedure, one can characterize one’s uncertainty about any statistic of the response surface. This could become computationally expensive in the case of high dimensions and many observations, albeit less expensive than evaluating these statistics using the simulator itself. Fortunately, as shown in this section, it is actually possible to evaluate exactly the predictive distribution for the mean statistic, Equation (11), since it turns out to be Gaussian. Furthermore, it is possible to derive the predictive mean and variance of the covariance statistic Equation (12). In this subsection, it is shown that the predictive distribution for the mean statistic Equation (11) is actually Gaussian.

2.9.1 One-Dimensional Output with No Spatial or Time Inputs

Assume a one-dimensional output d y  = 1 and that there are no spatial or time inputs, i.e., \(d_{s} = d_{t} = 0\). In this case, one has \(\mathbf{x} =\boldsymbol{\xi },\mathbf{X}_{d} =\boldsymbol{\varXi } _{d}\), and he may simply rewrite Equation (74) as:
$$\displaystyle{ \hat{f}(\boldsymbol{\xi };\boldsymbol{\theta }) = m^{{\ast}}(\boldsymbol{\xi };\boldsymbol{\psi }) + \mathbf{k}^{{\ast}}(\boldsymbol{\xi },\boldsymbol{\varXi }_{ d};\boldsymbol{\psi })\mathbf{C}(\boldsymbol{\psi })\boldsymbol{\omega }, }$$
(102)
where \(\boldsymbol{\theta },\boldsymbol{\psi },\) and \(\boldsymbol{\omega }\) are as before. Taking the expectation of this over \(p(\boldsymbol{\xi })\), one obtains:
$$\displaystyle{ \mathbb{E}_{\boldsymbol{\xi }}[\hat{f}(\cdot ;\boldsymbol{\theta })] = \mathbb{E}_{\boldsymbol{\xi }}[m^{{\ast}}(\cdot ;\boldsymbol{\psi })] + \mathbb{E}_{\boldsymbol{\xi }}[\mathbf{k}^{{\ast}}(\boldsymbol{\xi },\boldsymbol{\varXi }_{ d};\boldsymbol{\psi })]\mathbf{C}(\boldsymbol{\psi })\boldsymbol{\omega }. }$$
(103)
Since \(\boldsymbol{\omega }\) is a standard normal random variable (see Equation (78), it can be integrated out from Equation (103) to give:
$$\displaystyle{ \mathbb{E}_{\boldsymbol{\xi }}[f(\cdot )]\vert \mathcal{D},\boldsymbol{\psi }\sim \mathcal{N}\left (\mathbb{E}_{\boldsymbol{\xi }}[f(\cdot )]\vert \mu _{\mu _{f}}(\boldsymbol{\psi }),\sigma _{\mu _{f}}^{2}(\boldsymbol{\psi })\right ), }$$
(104)
where
$$\displaystyle{ \mu _{\mu _{f}}(\boldsymbol{\psi }) := \mathbb{E}_{\boldsymbol{\xi }}[m^{{\ast}}(\cdot ;\boldsymbol{\psi })], }$$
(105)
and
$$\displaystyle{ \sigma _{\mu _{f}}^{2}(\boldsymbol{\psi }) :=\parallel \mathbf{C}^{T}(\boldsymbol{\psi })\boldsymbol{\epsilon }^{{\ast}}(\boldsymbol{\varXi }_{ d};\boldsymbol{\psi }) \parallel ^{2}, }$$
(106)
with
$$\displaystyle{ \boldsymbol{\epsilon }^{{\ast}}(\boldsymbol{\varXi }_{ d};\boldsymbol{\psi }) := \mathbb{E}_{\boldsymbol{\xi }}[\mathbf{k}^{{\ast}}(\boldsymbol{\varXi }_{ d},\cdot ;\boldsymbol{\psi })]. }$$
(107)
All these quantities are expressible in terms of expectations of the covariance function with respect to \(p(\boldsymbol{\xi })\):
$$\displaystyle{ \boldsymbol{\epsilon }(\boldsymbol{\varXi }';\boldsymbol{\psi }_{k}) := \mathbb{E}_{\boldsymbol{\xi }}[\mathbf{k}(\boldsymbol{\varXi }',\cdot ;\boldsymbol{\psi }_{k})]. }$$
(108)
Indeed, from Equation (41) one gets:
$$\displaystyle{ \mu _{\mu _{f}}(\boldsymbol{\psi }) = \mathbb{E}_{\boldsymbol{\xi }}[m(\cdot ;\boldsymbol{\psi }_{m})] +\boldsymbol{\epsilon } (\boldsymbol{\varXi };\boldsymbol{\psi }_{k})^{T}\mathbf{k}(\boldsymbol{\varXi },\boldsymbol{\varXi };\boldsymbol{\psi }_{ k})^{-1}(\mathbf{Y} -\mathbf{m}(\boldsymbol{\varXi };\boldsymbol{\psi }_{ m})), }$$
(109)
and from Equation (42)
$$\displaystyle{ \boldsymbol{\epsilon }^{{\ast}}(\boldsymbol{\varXi }_{ d};\boldsymbol{\psi }) =\boldsymbol{\epsilon } (\boldsymbol{\varXi }_{d};\boldsymbol{\psi }_{k}) -\mathbf{k}(\boldsymbol{\varXi }_{d},\boldsymbol{\varXi };\boldsymbol{\psi }_{k})\mathbf{k}(\boldsymbol{\varXi },\boldsymbol{\varXi };\boldsymbol{\psi }_{k})^{-1}\boldsymbol{\epsilon }(\boldsymbol{\varXi };\boldsymbol{\psi }_{ k}). }$$
(110)
For the case of a SE covariance (see Equation (37)) combined with a Gaussian or uniform distribution \(p(\boldsymbol{\xi })\), [34] and [7], respectively, show that Equation (108) can be computed analytically. As shown in [8], for the case of an arbitrary separable covariance as well as arbitrary independent random variables \(\boldsymbol{\xi }\), Equation (108) can be computed efficiently by doing d y one-dimensional integrals.

2.9.2 One-Dimensional Output with Spatial and/or Time Inputs

Consider the case of one-dimensional output, i.e., d y  = 1, with possible spatial and/or time inputs, i.e., d s , d t  ≥ 0. In this generic case, \(\mathbf{x} = (\mathbf{x}_{s},t,\boldsymbol{\xi })\).

It is possible to use the particular form of Equation (74) to derive semi-analytic formulas for some of the statistics. Let us start by considering the mean statistic. One has
$$\displaystyle{ \mathbb{E}_{\boldsymbol{\xi }}[\hat{f}(\cdot ;\boldsymbol{\theta })](\mathbf{x}_{s},t) = \mathbb{E}_{\boldsymbol{\xi }}\left [m^{{\ast}}(\mathbf{x}_{ s},t,\boldsymbol{\xi };\boldsymbol{\psi }\right ] + \mathbb{E}_{\boldsymbol{\xi }}\left [\mathbf{k}^{{\ast}}((\mathbf{x}_{ s},t,\boldsymbol{\xi }),\mathbf{X}_{d};\boldsymbol{\psi })\right ]\mathbf{C}(\boldsymbol{\psi })\boldsymbol{\omega }. }$$
(111)
In other words, it has been shown that if \(f(\cdot )\) is a Gaussian process, then its mean \(\mathbb{E}_{\boldsymbol{\xi }}[f(\cdot )](\cdot )\) is a Gaussian process:
$$\displaystyle{ \mathbb{E}_{\boldsymbol{\xi }}[f(\cdot )](\cdot )\vert \mathcal{D},\boldsymbol{\psi }\sim \mathop{\mathrm{GP}}\nolimits \left (\mathbb{E}_{\boldsymbol{\xi }}[f(\cdot )](\cdot )\vert m_{\text{mean}}(\cdot ;\boldsymbol{\psi }),k_{\text{mean}}(\cdot ,\cdot ;\boldsymbol{\psi })\right ), }$$
(112)
with mean function:
$$\displaystyle{ m_{\text{mean}}(\mathbf{x}_{s},t) = \mathbb{E}_{\boldsymbol{\xi }}\left [m^{{\ast}}(\mathbf{x}_{ s},t,\boldsymbol{\xi };\boldsymbol{\psi }\right ], }$$
(113)
and covariance function:
$$\displaystyle\begin{array}{rcl} & & k_{\text{mean}}((\mathbf{x}_{s},t),(\mathbf{x}_{s}',t')) \\ & & \quad = \mathbb{E}_{\boldsymbol{\xi }}\left [\mathbf{k}^{{\ast}}((\mathbf{x}_{ s},t,\boldsymbol{\xi }),\mathbf{X}_{d};\boldsymbol{\psi })\right ]\mathbf{C}(\boldsymbol{\psi })\mathbf{C}(\boldsymbol{\psi })^{T}\mathbb{E}_{\boldsymbol{\xi }}\left [\mathbf{k}^{{\ast}}((\mathbf{x}_{ s}',t',\boldsymbol{\xi }),\mathbf{X}_{d};\boldsymbol{\psi })\right ]^{T}.{}\end{array}$$
(114)
Note, that if the stochastic variables in \(\boldsymbol{\xi }\) are independent and the covariance function \(k(\mathbf{x},\mathbf{x}';\boldsymbol{\psi })\) is separable with respect to the \(\xi _{i}\)’s, then all these quantities can be computed efficiently with numerical integration.

Equations similar to Equation (111) can be derived without difficulty for the covariance statistic of Equation (12) as well as for the variance statistic of Equation (13) [10, 14]. In contrast to the mean statistic, however, the resulting random field is not Gaussian. That is, an equation similar to Equation (112) does not hold.

3 Numerical Examples

3.1 Synthetic One-Dimensional Example

In this synthetic example , the ability of the Bayesian approach to characterize one’s state of knowledge about the statistics with a very limited number of observations is demonstrated. To keep things simple, start with no space/time inputs (\(d_{s} = d_{t} = 0\)), one stochastic variable \(d_{\xi } = 0\), and one output d y  = 0. That is, the input is just \(x =\xi\). Consider n = 7 arbitrary observations, \(\mathcal{D} = \left \{\left (x^{(i)},y^{(i)}\right )\right \}_{i=1}^{n}\), which are shown as crosses in Fig. 1a. The goal is to use these seven observations to learn the underlying response function y = f(x) and characterize one’s state of knowledge about the mean \(\mathbb{E}[f(\cdot )]\) (Equation (11)), the variance \(\mathbb{V}[f(\cdot )]\) (Equation (13)), and the induced probability density function in the response y:
$$\displaystyle{p(y) =\int \delta \left (y - f(x)\right )p(x)dx,}$$
where p(x) is the input probability density, taken to be a Beta(10, 2) and shown in Fig. 1b.
Fig. 1

Synthetic: Subfigure (a) shows the observed data (cross symbols), the mean (dashed blue line), the 95 % predictive intervals (shaded gray area), and three samples (solid black lines) from the posterior Gaussian process conditioned on the observed data. The green line of subfigure (b) shows the probability density function imposed on the input x. The three lines in subfigure (c) correspond to the first three eigenfunctions used in the KLE of the posterior GP. Subfigures (d) and (e) depict the predictive distribution conditioned on the observations of the mean and the variance statistic of f(x), respectively. Subfigure (f) shows the mean predicted probability density of y = f(x) (blue dashed line) with 95 % predictive intervals (shaded gray area),and three samples (solid black lines) from the posterior predictive probability measure on the space of probability densities

The first step is to assign a prior Gaussian process to the response (Equation (18)). This is done by picking a zero mean and an SE covariance function (Equation (37)) with no nugget, \(\sigma ^{2} = 0\) in Equation (35), and fixed signal and length-scale parameters to s = 1 and \(\ell= 0.1\), respectively. These choices represent one’s prior beliefs about the underlying response function y = f(x).

Given the observations in \(\mathcal{D}\), the updated state of knowledge is characterized by the posterior GP of Equation (40). The posterior mean function, \(m^{{\ast}}(\cdot )\) of Equation (41), is the dashed blue line of Fig. 1a. The shaded gray are of the same figure corresponds to a 95 % predictive interval about the mean. This interval is computed using the posterior covariance function, \(k^{{\ast}}(\cdot ,\cdot )\) of Equation (42). Specifically, the point predictive distribution at x is
$$\displaystyle{p(y\vert x,\mathcal{D}) \sim \mathcal{N}\left (m^{{\ast}}(x),\left (\sigma ^{{\ast}}(x)\right )^{2}\right ),}$$
where \(\sigma ^{{\ast}}(x) = \sqrt{k^{{\ast} } (x, x)}\) and, thus, the 95 % predictive interval at x is given, approximately, by \(\left (m^{{\ast}}(x) - 1.96\sigma ^{{\ast}}(x),m^{{\ast}}(x) + 1.96\sigma ^{{\ast}}(x)\right )\). The posterior mean can be thought of as a point estimation of the underlying response surface.

In order to sample possible surrogates from Equation (40), the Karhunen-Loève approach for constructing \(\hat{f}(\cdot ;\boldsymbol{\theta })\) is followed (see Equations (74) and (83)), retaining d ω  = 3 eigenfunctions (see Equations (81) and (91)) of the posterior covariance which account for more than α = 90 % of the energy of the posterior GP (see Equation (92)). These eigenfunctions are shown in Fig. 1c. Using the constructed \(\hat{f}(\cdot ;\boldsymbol{\theta })\), one can sample candidate surrogates. Three such samples are shown as solid black lines in Fig. 1a.

Having constructed a finite dimensional representation of the posterior GP, one is in a position to characterize one’s state of knowledge about arbitrary statistics of the response, which is captured by Equation (17). Here the suggested two-step procedure is followed. That is, candidate surrogates are repeatedly sampled and then the statistic of interest are computed for each sample. In the results presented, 1,000 sampled candidate surrogates are used. Figure 1d shows the predictive probability density for the mean of the response \(p\left (\mathbb{E}[f(\cdot )]\vert \mathcal{D}\right )\). Note that this result can also be obtained semi-analytically using Equation (104). Figure 1e shows the predictive probability density for the variance of the response \(p\left (\mathbb{V}[f(\cdot )]\vert \mathcal{D}\right )\), which cannot be approximated analytically. Finally, subfigure (f) of the same figure characterizes the predictive distribution of the PDF of the response p(y). Specifically, the blue dashed line corresponds to the median of the PDFs of each one of the 1,000 sampled candidate surrogates, while the gray shaded area corresponds to a 95 % predictive interval around the median. The solid black lines of the same figure are the PDFs of three arbitrary sampled candidate surrogates .

3.2 Dynamical System Example

In this example , the Bayesian approach to uncertainty propagation is applied to a dynamical system with random initial conditions. In particular, the dynamical system [49]:
$$\displaystyle\begin{array}{rcl} \frac{dy_{1}} {dt} & =& y_{1}y_{3}, {}\\ \frac{dy_{2}} {dt} & =& -y_{2}y_{3}, {}\\ \frac{dy_{3}} {dt} & =& -y_{1}^{2} + y_{ 2}^{2}, {}\\ \end{array}$$
subject to random uncertain conditions at \(t = 0\):
$$\displaystyle{y_{1}(0) = 1,\;y_{2}(0) = 0.1\xi _{1},\;y_{3}(0) =\xi _{2},}$$
where
$$\displaystyle{\xi _{i} \sim \mathcal{U}([-1,1]),\ i = 1,2,}$$
is considered. To make the connection with the notation of this chapter, note that \(d_{s} = 0,d_{t} = 1,d_{\xi } = 2\), and d y  = 3. For each choice of \(\boldsymbol{\xi }\), the computer emulator, \(\mathbf{f}_{c}(\boldsymbol{\xi })\) of Equation (7), reports the response at n t  = 20 equidistant time steps in [0, 10], X t of Equation (5). The result of \(n_{\xi }\) randomly picked simulations is observed, and one wants to characterize his state of knowledge about the statistics of the response. Consider the case of \(n_{\xi } = 70,100\), and 150. Note that propagating uncertainty through this dynamical system is not trivial since there exists a discontinuity in the response surface as \(\xi _{1}\) crosses zero.

The prior GP is picked to be a multi-output GP with linearly correlated outputs, Equation (67), with a constant mean function, \(h(t,\boldsymbol{\xi }) = 1\), and a separable covariance function, Equation (46), with both the time and stochastic covariance functions being SE, Equation (37), with nuggets, Equation (35). Denote the hyper-parameters of the time and stochastic part of the covariance by \(\boldsymbol{\psi }_{t} =\{\ell _{t},\sigma _{t}\}\) and \(\boldsymbol{\psi }_{\xi } =\{\ell _{\xi ,1},\ell_{\xi ,2},\sigma _{\xi }\}\), respectively. An exponential prior is assigned to all of them, albeit with different rate parameters. Specifically, the rate of \(\ell_{t}\) is 2, the rate of \(\ell_{\xi ,i},i = 1,2\) is 20, and the rate of the nuggets \(\sigma _{t}\) and \(\sigma _{\xi }\) is 106. This assignment corresponds to the vague prior knowledge that the a priori mean of the time scale is about 0. 5 of the time unit, the scale of \(\boldsymbol{\xi }\) is about 0. 05 of its unit, and the nuggets expected to be around 10−6. According to the comment below Equation (70), the signal strength can be picked to be identically equal to one since it is absorbed by the covariance matrix \(\boldsymbol{\varSigma }\). For the hyper-parameters of the mean function, i.e., the constant number, a flat uninformative prior is assigned. As already discussed, with this choice it is possible to integrate it out of the model analytically.

The model is trained by sampling the posterior of \(\boldsymbol{\psi }=\{\boldsymbol{\psi } _{t},\boldsymbol{\psi }_{\xi }\}\) (see Equation (43)) using a mixed MCMC-Gibbs scheme (see [10] for a discussion on the scheme and evidence of convergence). After the MCMC chain sufficiently mixed (it takes about 500 iterations), a particle approximation of the posterior state of knowledge about the response surface is constructed. This is done this as follows. For every 100-th step of the MCMC chain (the intermediate 99 steps are dropped to reduce the correlations), 100 candidate surrogates are drawn using the O’Hagan procedure with a tolerance of \(\epsilon = 10^{-2}\).

In all plots, the blue solid lines and the shaded gray areas depict the predictive mean and 95 % intervals of the corresponding statistics, respectively. The prediction about the time evolution of the mean response, \(p\left (\mathbb{E}\left [y_{i}(t)\right ]\mathcal{D}\right ),i = 1,3\) for the case of \(n_{\xi } = 100\) observations is shown in the first row of Fig. 2. Note that there is very little residual epistemic uncertainty for this prediction. The time evolution of the variance of the response, \(p\left (\mathbb{V}\left [y_{i}(t)\right ]\vert \mathcal{D}\right )\), is shown on the second and third rows of the same figure for \(n_{\xi } = 100\) and \(n_{\xi } = 150\), respectively. Notice how the width of the predictive interval decreases with increasing \(n_{\xi }\). In Fig. 3, the time evolution of the probability density of y2(t) is summarized. Specifically, the four rows correspond to four different time instants, t = 4, 6, 8, and 10, and the columns refer to different sample sizes of \(n_{\xi } = 70,100\), and 150, counting from the left .
Fig. 2

Dynamical system: Subfigures (a) and (b) correspond to the predictions about the mean of y1(t) and y3(t), respectively, using \(n_{\xi } = 100\) simulations. Subfigures (c) and (d) ((e) and (f)) show the predictions about the variance of the same quantities for \(n_{\xi } = 100\) (\(n_{\xi } = 150\)) simulations (Reproduced with permission from [10])

Fig. 3

Dynamical system: Columns correspond to results using \(n_{\xi } = 70,100\), and 150 simulations counting from the left. Counting from the top, rows one, two, three, and four show the predictions about the PDF of y2(t) at times \(t = 4,6,8,10\), respectively (Reproduced with permission from [10])

3.3 Partial Differential Equation Example

In this example, it is shown how the Bayesian approach to uncertainty propagation can be applied to a partial differential equation. In particular, a two-dimensional (\(\mathcal{X}_{s} = [0,1]^{2}\) and d s  = 2), single-phase, steady-state (d t  = 0) flow through an uncertain permeability field is studied; see Aarnes et al. [1] for a review of the underlying physics and solution methodologies. The uncertainty in the permeability is represented by a truncated KLE of an exponentiated Gaussian random field with exponential covariance function of signal strength equal to one and correlation length equal to 0. 1 and a zero mean. The total number of stochastic variables corresponds to the truncation order of the KLE, and it is chosen to be \(n_{\xi } = 50\). Three outputs, d y  = 3, are considered: the pressure, \(p(\mathbf{x}_{s};\boldsymbol{\xi })\), and the horizontal and vertical components of the velocity field, \(u(\mathbf{x}_{s};\boldsymbol{\xi })\) and \(v(\mathbf{x}_{s};\boldsymbol{\xi })\), respectively. The emulator, \(\mathbf{f}_{c}(\boldsymbol{\xi })\) of Equation (7), is based on the finite element method and is described in detail in [10], and it reports the response on a regular 32 × 32 spatial grid, i.e., \(n_{s} = 32^{2} = 1,024\). The objective is to quantify the statistics of the response using a limited number of \(n_{\xi } = 24,64\), and 120 simulations. The results are validated by comparing against a plain vanilla MC estimate of the statistics using 108, 000 samples.

As in the previous example, the prior state of knowledge is represented using a multi-output GP with linearly correlated outputs, Equation (67); a constant mean function, \(h(t,\boldsymbol{\xi }) = 1\); and a separable covariance function, Equation (46), with both the space and stochastic covariance functions being SE, Equation (37), with nuggets, Equation (35). Denote the hyper-parameters of the spatial and stochastic part of the covariance by \(\boldsymbol{\psi }_{s} =\{\ell _{s,1},\ell_{s,2},\sigma _{s}\}\) and \(\boldsymbol{\psi }_{\xi } =\{\ell _{\xi ,1},\ldots ,\ell_{\xi ,50},\sigma _{\xi }\}\), respectively. Note that the fact that the spatial component is also separable is exploited to significantly reduce the computational cost of the calculations. Again, exponential priors are assigned. The rate parameters of the spatial length scales are 100 corresponding to an a priori expectation of 0.01 spatial units. The rates of \(\ell_{xi,i},\sigma _{s}\), and \(\sigma _{\xi }\) are 3, 100, and 100, respectively.

The posterior of the hyper-parameters, Equation (43), is sampled using 100,000 iterations of the same MCMC-Gibbs procedure as in the previous example. However, in order to reduce the computational burden, a single-particle MAP approximation to the posterior, Equation (59), is constructed, by searching for the MAP over the 100,000 MCMC-Gibbs samples collected. Then, 100 candidate surrogate surfaces are sampled following the O’Hagan procedure with a tolerance of \(\epsilon = 10^{-2}\). For each sampled surrogate, the statistics of interest are calculated and compared to MC estimates.

In Fig. 4 the mean prediction is compared to the mean of the horizontal component of the velocity, \(u(\mathbf{x}_{s};\boldsymbol{\xi })\) as a function of the spatial coordinates, \(\mathbb{E}_{\boldsymbol{\xi }}[u(\mathbf{x}_{s};\boldsymbol{\xi })]\), conditioned on \(n_{\xi } = 24,64\), and 120 simulations, subfigures (a), (b), and (c), respectively, to the MC estimate, subfigure (e). The error bars shown in subfigure (d) of the same figure correspond to two standard deviations of the predictive \(p\left (\mathbb{E}_{\boldsymbol{\xi }}[u(\mathbf{x}_{s};\boldsymbol{\xi })]\vert \mathcal{D}\right )\) for the case of \(n_{\xi } = 120\) simulations. Figures 5 and 6 report the same statistic for the y-component of the velocity, \(v(\mathbf{x}_{s};\boldsymbol{\xi })\), and the pressure, \(p(\mathbf{x}_{s};\boldsymbol{\xi })\), respectively. Similarly, in Figs. 78, and 9, the predictive distributions of the variances of the horizontal component of the velocity, \(p\left (\mathbb{V}\left [u(\mathbf{x}_{s};\boldsymbol{\xi }_{s}\right ]\vert \mathcal{D}\right )\); the vertical component of the velocity, \(p\left (\mathbb{V}\left [v(\mathbf{x}_{s};\boldsymbol{\xi }_{s}\right ]\vert \mathcal{D}\right )\); and the pressure, \(p\left (\mathbb{V}\left [p(\mathbf{x}_{s};\boldsymbol{\xi }_{s}\right ]\vert \mathcal{D}\right )\), respectively, are characterized. Even though one observes an underestimation of the variance, which is more pronounced for the limited simulation cases, the truth is well covered by the predicted error bars.
Fig. 4

Partial differential equation: Mean of \(\mathbb{E}[u(\mathbf{x}_{s};\boldsymbol{\xi })]\). Subfigures (a), (b), and (c) show the predictive mean of \(\mathbb{E}[u(\mathbf{x}_{s};\boldsymbol{\xi })]\) as a function of x s conditioned on 24, 64, and 120 simulations, respectively. Subfigure (d) plots two standard deviations \(\mathbb{E}[u(\mathbf{x}_{s};\boldsymbol{\xi })]\) conditioned on 120 observations. Subfigure (e) shows the MC estimate of the same quantity (Reproduced with permission from [10])

Fig. 5

Partial differential equation: Mean of \(\mathbb{E}[v(\mathbf{x}_{s};\boldsymbol{\xi })]\). Subfigures (a), (b), and (c) show the predictive mean of \(\mathbb{E}[v(\mathbf{x}_{s};\boldsymbol{\xi })]\) as a function of x s conditioned on 24, 64, and 120 simulations, respectively. Subfigure (d) plots two standard deviations \(\mathbb{E}[v(\mathbf{x}_{s};\boldsymbol{\xi })]\) conditioned on 120 observations Subfigure (e) shows the MC estimate of the same quantity (Reproduced with permission from [10])

Fig. 6

Partial differential equation: Mean of \(\mathbb{E}[p(\mathbf{x}_{s};\boldsymbol{\xi })]\). Subfigures (a), (b), and (c) show the predictive mean of \(\mathbb{E}[p(\mathbf{x}_{s};\boldsymbol{\xi })]\) as a function of x s conditioned on 24, 64, and 120 simulations, respectively. Subfigure (d) plots two standard deviations of \(\mathbb{E}[p(\mathbf{x}_{s};\boldsymbol{\xi })]\) conditioned on 120 observations. Subfigure (e) shows the MC estimate of the same quantity (Reproduced with permission from [10])

Fig. 7

Partial differential equation: Mean of \(\mathbb{V}[u(\mathbf{x}_{s};\boldsymbol{\xi })]\). Subfigures (a), (b), and (c) show the predictive mean of \(\mathbb{V}[u(\mathbf{x}_{s};\boldsymbol{\xi })]\) as a function of x s conditioned on 24, 64, and 120 simulations, respectively. Subfigure (d) plots two standard deviations of \(\mathbb{V}[u(\mathbf{x}_{s};\boldsymbol{\xi })]\) conditioned on 120 observations. Subfigure (e) shows the MC estimate of the same quantity (Reproduced with permission from [10])

Fig. 8

Partial differential equation: Mean of \(\mathbb{V}[v(\mathbf{x}_{s};\boldsymbol{\xi })]\). Subfigures (a), (b), and (c) show the predictive mean of \(\mathbb{V}[v(\mathbf{x}_{s};\boldsymbol{\xi })]\) as a function of x s conditioned on 24, 64, and 120 simulations, respectively. Subfigure (d) plots two standard deviations of \(\mathbb{V}[v(\mathbf{x}_{s};\boldsymbol{\xi })]\) conditioned on 120 observations. Subfigure (e) shows the MC estimate of the same quantity (Reproduced with permission from [10])

Fig. 9

Partial differential equation: Mean of \(\mathbb{V}[p(\mathbf{x}_{s};\boldsymbol{\xi })]\). Subfigures (a), (b), and (c) show the predictive mean of \(\mathbb{V}[p(\mathbf{x}_{s};\boldsymbol{\xi })]\) as a function of x s conditioned on 24, 64, and 120 simulations, respectively. Subfigure (d) plots two standard deviations of \(\mathbb{V}[p(\mathbf{x}_{s};\boldsymbol{\xi })]\) conditioned on 120 observations. Subfigure (e) shows the MC estimate of the same quantity (Reproduced with permission from [10])

In Fig. 10 the solid blue line and the shaded gray area correspond to the mean of the predictive probability density of the PDF of the horizontal component of the velocity \(u(\mathbf{x}_{s} = (0.5,0.5);\boldsymbol{\xi })\), respectively, conditioned on 24 (a), 64 (b), and 120 (c) simulations, and compares it to an MC estimate (d). Notice that the ground truth, i.e., the MC estimate, always falls within the shaded areas. Finally, Figs. 1112, and 13 show the predictive distribution of the PDF of \(u(\mathbf{x}_{s} = (0.25,0.25);\boldsymbol{\xi })\) and \(p(\mathbf{x}_{s} = (0.25,0.25);\boldsymbol{\xi })\), respectively.
Fig. 10

Partial differential equation: The prediction for the PDF of \(u(\mathbf{x}_{s} = (0.5,0.5);\boldsymbol{\xi })\). The solid blue line shows the mean predictive distribution of the PDF conditioned on 24 (a), 64 (b), and 120 (c) simulations. The filled gray area depicts two standard deviations of the predictive distribution PDFs about the predictive mean of PDF. The solid red line of (d) shows the MC estimate for comparison (Reproduced with permission from [10])

Fig. 11

Partial differential equation: The prediction for the PDF of \(u(\mathbf{x}_{s} = (0.25,0.25);\boldsymbol{\xi })\). The solid blue line shows the mean predictive distribution of the PDF conditioned on 24 (a), 64 (b), and 120 (c) simulations. The filled gray area depicts two standard deviations of the predictive distribution PDFs about the predictive mean of PDF. The solid red line of (d) shows the MC estimate for comparison (Reproduced with permission from [10])

Fig. 12

Partial differential equation: The prediction for the PDF of \(p(\mathbf{x}_{s} = (0.5,0.5);\boldsymbol{\xi })\). The solid blue line shows the mean predictive distribution of the PDF conditioned on 24 (a), 64 (b), and 120 (c) simulations. The filled gray area depicts two standard deviations of the predictive distribution PDFs about the predictive mean of PDF. The solid red line of (d) shows the MC estimate for comparison (Reproduced with permission from [10])

Fig. 13

Partial differential equation: The prediction for the PDF of \(p(\mathbf{x}_{s} = (0.25,0.25);\boldsymbol{\xi })\). The solid blue line shows the mean predictive distribution of the PDF conditioned on 24 (a), 64 (b), and 120 (c) simulations. The filled gray area depicts two standard deviations of the predictive distribution PDFs about the predictive mean of PDF. The solid red line of (d) shows the MC estimate for comparison (Reproduced with permission from [10])

4 Conclusions

In this chapter we presented a comprehensive review of the Bayesian approach to the UP problem that is able to quantify the epistemic uncertainty induced by limited number of simulations. The core idea was to interpret a GP as a probability measure on the space of surrogates which characterizes our prior state of knowledge about the response surface. We focused on practical aspects of GPs such as the treatment of spatiotemporal variation and multi-output responses. We showed how the prior GP can be conditioned on the observed simulations to obtain a posterior GP, whose probability mass corresponds to the epistemic uncertainty introduced by the limited number of simulations, and we introduced sampling-based techniques that allow for its quantification.

Despite the successes of the current state of the Bayesian approach to the UP problem, there is still a wealth of open research questions. First, carrying out GP regression in high dimensions is not a trivial problem since it requires the development of application-specific covariance functions. The study of covariance functions that automatically perform some kind of internal dimensionality reduction seems to be a promising step forward. Second, in order to capture sharp variations in the response surface, such as localized bumps or even discontinuities, there is a need for flexible nonstationary covariance functions or alternative approaches based on mixtures of GPs, e.g., see [14]. Third, there is a need for computationally efficient ways of treating nonlinear correlations between distinct model outputs, since this is expected to squeeze more information out of the simulations. Fourth, as a semi-intrusive approach, the mathematical models describing the physics of the problem could be used to derive physics-constrained covariance functions that would, presumably, force the prior GP probability measure to be compatible with known response properties, such as mass conservation. That is, such an approach would put more effort on better representing our prior state of knowledge about the response. Fifth, there is an evident need for developing simulation selection policies which are specifically designed to gather information about the uncertainty propagation task. Finally, note that the Bayesian approach can also be applied to other important contexts such as model calibration and design optimization under uncertainty. As a result, all the open research questions have the potential to also revolutionize these fields.

References

  1. 1.
    Aarnes, J.E., Kippe, V., Lie, K.A., Rustad, A.B.: Modelling of multiscale structures in flow simulations for petroleum reservoirs. In: Hasle, G., Lie, K.A., Quak, E. (eds.): Geometric Modelling, Numerical Simulation, and Optimization, chap. 10, pp. 307–360. Springer, Berlin/Heidelberg (2007). doi:10.1007/978-3-540-68783-2_10Google Scholar
  2. 2.
    Alvarez, M., Lawrence, N.D.: Sparse convolved Gaussian processes for multi-output regression. In: Koller, D., Schuurmans, D., Bengio, Y., and Bottou. L. (eds.): Advances in Neural Information Processing Systems 21 (NIPS 2008), Vancouver, B.C., Canada (2008)Google Scholar
  3. 3.
    Alvarez, M., Luengo-Garcia, D., Titsias, M., Lawrence, N.: Efficient multioutput Gaussian processes through variational inducing kernels. In: Ft. Lauderdale, FL, USA (2011)Google Scholar
  4. 4.
    Babuska, I., Nobile, F., Tempone, R.: A stochastic collocation method for elliptic partial differential equations with random input data. SIAM J. Numer. Anal. 45(3), 1005–1034 (2007)MathSciNetCrossRefzbMATHGoogle Scholar
  5. 5.
    Betz, W., Papaioannou, I., Straub, D.: Numerical methods for the discretization of random fields by means of the Karhunen-Loeve expansion. Comput. Methods Appl. Mech. Eng. 271, 109–129 (2014). doi:10.1016/j.cma.2013.12.010MathSciNetCrossRefzbMATHGoogle Scholar
  6. 6.
    Bilionis, I.: py-orthpol: Construct orthogonal polynomials in python. https://github.com/PredictiveScienceLab/py-orthpol (2013)
  7. 7.
    Bilionis, I., Zabaras, N.: Multi-output local Gaussian process regression: applications to uncertainty quantification. J. Comput. Phys. 231(17), 5718–5746 (2012) doi:10.1016/J.Jcp.2012.04.047MathSciNetCrossRefzbMATHGoogle Scholar
  8. 8.
    Bilionis, I., Zabaras, N.: Multidimensional adaptive relevance vector machines for uncertainty quantification. SIAM J. Sci. Comput. 34(6), B881–B908 (2012). doi:10.1137/120861345MathSciNetCrossRefzbMATHGoogle Scholar
  9. 9.
    Bilionis, I., Zabaras, N.: Solution of inverse problems with limited forward solver evaluations: a Bayssian perspective. Inverse Probl. 30(1), Artn 015004 (2014). doi:10.1088/0266-5611/30/1/015004Google Scholar
  10. 10.
    Bilionis, I., Zabaras, N., Konomi, B.A., Lin, G.: Multi-output separable Gaussian process: towards an efficient, fully Bayesian paradigm for uncertainty quantification. J. Comput. Phys. 241, 212–239 (2013). doi:10.1016/J.Jcp.2013.01.011CrossRefGoogle Scholar
  11. 11.
    Bilionis, I., Drewniak, B.A., Constantinescu, E.M.: Crop physiology calibration in the CLM. Geoscientific Model Dev. 8(4), 1071–1083 (2015). doi:10.5194/gmd-8-1071-2015, http://www.geosci-model-dev.net/8/1071/2015 http://www.geosci-model-dev.net/8/1071/2015/gmd-8-1071-2015.pdf, gMD http://www.geosci-model-dev.net/8/1071/2015/gmd-8-1071-2015.pdf
  12. 12.
    Bishop, C.M.: Pattern Recognition and Machine Learning. Information Science and Statistics. Springer, New York (2006)zbMATHGoogle Scholar
  13. 13.
    Boyle, P., Frean, M.: Dependent Gaussian processes. In: Saul, L.K., Weiss, Y., and Bottou L. (eds.): Advances in Neural Information Processing Systems 17 (NIPS 2004), Whistler, B.C., Canada (2004)Google Scholar
  14. 14.
    Chen, P., Zabaras, N., Bilionis, I.: Uncertainty propagation using infinite mixture of Gaussian processes and variational Bayssian inference. J. Comput. Phys. 284, 291–333 (2015)MathSciNetCrossRefGoogle Scholar
  15. 15.
    Conti, S., O’Hagan, A.: Bayesian emulation of complex multi-output and dynamic computer models. J. Stat. Plan. Inference 140(3), 640–651 (2010). doi:10.1016/J.Jspi.2009.08.006MathSciNetCrossRefzbMATHGoogle Scholar
  16. 16.
    Currin, C., Mitchell, T., Morris, M., Ylvisaker, D.: A Bayesian approach to the design and analysis of computer experiments. Report, Oak Ridge Laboratory (1988)CrossRefGoogle Scholar
  17. 17.
    Currin, C., Mitchell, T., Morris, M., Ylvisaker, D.: Bayesian prediction of deterministic functions, with applications to the design and analysis of computer experiments. J. Am. Stat. Assoc. 86(416), 953–963 (1991). doi:10.2307/2290511MathSciNetCrossRefGoogle Scholar
  18. 18.
    Dawid, A.P.: Some matrix-variate distribution theory – notational considerations and a Bayesian application. Biometrika 68(1), 265–274 (1981)MathSciNetCrossRefzbMATHGoogle Scholar
  19. 19.
    Del Moral, P., Doucet, A., Jasra, A.: Sequential Monte Carlo samplers. J. R. Stat. Soc. Ser. B. (Stat. Methodol.) 68(3), 411–436 (2006)Google Scholar
  20. 20.
    Delves, L.M., Walsh, J.E., of Manchester Department of Mathematics, U., of Computational LUD, Science, S.: Numerical Solution of Integral Equations. Clarendon Press, Oxford (1974)Google Scholar
  21. 21.
    Doucet, A., De Freitas, N., Gordon, N. (eds.): Sequential Monte Carlo Methods in Practice (Statistics for Engineering and Information Science). Springer, New York (2001)Google Scholar
  22. 22.
    Durrande, N., Ginsbourger, D., Roustant, O.: Additive covariance kernels for high-dimensional Gaussian process modeling. arXiv:11116233 (2011)Google Scholar
  23. 23.
    Duvenaud, D., Nickisch, H., Rasmussen, C.E.: Additive Gaussian processes. In: Advances in Neural Information Processing Systems, vol. 24, pp. 226–234 (2011)Google Scholar
  24. 24.
    Gautschi, W.: On generating orthogonal polynomials. SIAM J. Sci. Stat. Comput. 3(3), 289–317 (1982). doi:10.1137/0903018MathSciNetCrossRefzbMATHGoogle Scholar
  25. 25.
    Gautschi, W.: Algorithm-726 – ORTHPOL – a package of routines for generating orthogonal polynomials and Gauss-type quadrature rules. ACM Trans. Math. Softw. 20(1), 21–62 (1994) doi:10.1145/174603.174605CrossRefzbMATHGoogle Scholar
  26. 26.
    Ghanem, R., Spanos, P.D.: Stochastic Finite Elements: A Spectral Approach, rev. edn. Dover Publications, Minneola (2003)Google Scholar
  27. 27.
    Gramacy, R.B., Lee, H.K.H.: Cases for the nugget in modeling computer experiments. Stat. Comput. 22(3), 713–722 (2012) doi:10.1007/s11222-010-9224-xMathSciNetCrossRefzbMATHGoogle Scholar
  28. 28.
    Haff, L.: An identity for the Wishart distribution with applications. J. Multivar. Anal. 9(4), 531–544 (1979). doi:http://dx.doi.org/10.1016/0047-259X(79)90056-3Google Scholar
  29. 29.
    Hastings, W.K.: Monte-Carlo sampling methods using Markov chains and their applications. Biometrika 57(1), 97–109 (1970). doi:10.2307/2334940MathSciNetCrossRefzbMATHGoogle Scholar
  30. 30.
    Higdon, D., Gattiker, J., Williams, B., Rightley, M.: Computer model calibration using high-dimensional output. J. Am. Stat. Assoc. 103(482), 570–583 (2008)MathSciNetCrossRefzbMATHGoogle Scholar
  31. 31.
    Liu, J.S.: Monte Carlo Strategies in Scientific Computing. Springer Series in Statistics. Springer, New York (2001)zbMATHGoogle Scholar
  32. 32.
    Loève, M.: Probability Theory, 4th edn. Graduate Texts in Mathematics. Springer, New York (1977)zbMATHGoogle Scholar
  33. 33.
    Metropolis, N., Rosenbluth, A.W., Rosenbluth, M.N., Teller, A.H., Teller, E.: Equation of state calculations by fast computing machines. J. Chem. Phys. 21(6), 1087–1092 (1953). doi:10.1063/1.1699114CrossRefGoogle Scholar
  34. 34.
    Oakley, J., O’Hagan, A.: Bayesian inference for the uncertainty distribution of computer model outputs. Biometrika 89(4), 769–784 (2002)CrossRefGoogle Scholar
  35. 35.
    Oakley, J.E., O’Hagan, A.: Probabilistic sensitivity analysis of complex models: a Bayesian approach. J. R. Stat. Soc. Ser. B Stat. Methodol. 66, 751–769 (2004). doi:10.1111/j.1467-9868.2004.05304.xMathSciNetCrossRefzbMATHGoogle Scholar
  36. 36.
    O’Hagan, A.: Bayes-Hermite quadrature. J. Stat. Plan. Inference 29(3), 245–260 (1991)MathSciNetCrossRefzbMATHGoogle Scholar
  37. 37.
    O’Hagan, A., Kennedy, M.: Gaussian emulation machine for sensitivity analysis (GEM-SA) (2015). http://www.tonyohagan.co.uk/academic/GEM/ Google Scholar
  38. 38.
    O’Hagan, A., Kennedy, M.C., Oakley, J.E.: Uncertainty analysis and other inference tools for complex computer codes. Bayesian Stat. 6, 503–524 (1999)MathSciNetzbMATHGoogle Scholar
  39. 39.
    Rasmussen, C.E., Williams, C.K.I.: Gaussian Processes for Machine Learning. MIT Press, Cambridge (2006)zbMATHGoogle Scholar
  40. 40.
    Reinhardt, H.J.: Analysis of Approximation Methods for Differential and Integral Equations. Applied Mathematical Sciences. Springer, New York (1985)CrossRefzbMATHGoogle Scholar
  41. 41.
    Robert, C.P., Casella, G.: Monte Carlo Statistical Methods, 2nd edn. Springer Texts in Statistics. Springer, New York (2004)CrossRefzbMATHGoogle Scholar
  42. 42.
    Sacks, J., Welch, W.J., Mitchell, T., Wynn, H.P.: Design and analysis of computer experiments. Stat. Sci. 4(4), 409–423 (1989)MathSciNetCrossRefzbMATHGoogle Scholar
  43. 43.
    Seeger, M.: Low rank updates for the Cholesky decomposition. Report, University of California at Berkeley (2007)Google Scholar
  44. 44.
    Smolyak, S.A.: Quadrature and interpolation formulas for tensor products of certain classes of functions. Sov. Math. Dokl. 4, 240–243 (1963)zbMATHGoogle Scholar
  45. 45.
    Stark, H., Woods, J.W., Stark, H.: Probability and Random Processes with Applications to Signal Processing, 3rd edn. Prentice Hall, Upper Saddle River (2002)Google Scholar
  46. 46.
    Stegle, O., Lippert, C., Mooij, J.M., Lawrence, N.D., Borgwardt, K.M.: Efficient inference in matrix-variate Gaussian models with backslash iid observation noise. In: Shawe-Taylor, J., Zemel, R.S., Barlett, P.L., Pereira, F., Weinberger K.Q. (eds.): Advances in Neural Information Processing Systems 24 (NIPS 2011), Granada, Spain (2011)Google Scholar
  47. 47.
    Van Loan, C.F.: The ubiquitous Kronecker product. J. Comput. Appl. Math. 123(1–2), 85–100 (2000)Google Scholar
  48. 48.
    Wan, J., Zabaras, N.: A Bayssian approach to multiscale inverse problems using the sequential Monte Carlo method. Inverse Probl. 27(10), 105004 (2011)MathSciNetCrossRefzbMATHGoogle Scholar
  49. 49.
    Wan, X.L., Karniadakis, G.E.: An adaptive multi-element generalized polynomial chaos method for stochastic differential equations. J. Comput. Phys. 209(2), 617–642 (2005). doi:10.1016/j.jcp.2005.03.023, <GotoISI>://WOS:000230736700011 Google Scholar
  50. 50.
    Welch, W.J., Buck, R.J., Sacks, J., Wynn, H.P., Mitchell, T.J., Morris, M.D.: Screening, predicting, and computer experiments. Technometrics 34(1), 15–25 (1992)CrossRefGoogle Scholar
  51. 51.
    Xiu, D.B.: Efficient collocational approach for parametric uncertainty analysis. Commun. Comput. Phys. 2(2), 293–309 (2007)MathSciNetzbMATHGoogle Scholar
  52. 52.
    Xiu, D.B., Hesthaven, J.S.: High-order collocation methods for differential equations with random inputs. SIAM J. Sci. Comput. 27(3), 1118–1139 (2005)MathSciNetCrossRefzbMATHGoogle Scholar
  53. 53.
    Xiu, D.B., Karniadakis, G.E.: The wiener-askey polynomial chaos for stochastic differential equations. SIAM J. Sci. Comput. 24(2), 619–644 (2002)MathSciNetCrossRefzbMATHGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  1. 1.School of Mechanical EngineeringPurdue UniversityWest LafayetteUSA
  2. 2.Warwick Centre for Predictive ModellingUniversity of WarwickCoventryUK

Personalised recommendations