1 Introduction

In many fields of Science, Engineering and Technology, the mathematical and physical models are implemented in computer programs known as simulators [25]. Simulators are increasingly popular nowadays in social sciences, social network modeling and (electronic) commerce as well. These simulators aim to model and reproduce the complex real-world phenomena accurately. On the downside, simulators typically require large computational cost and memory resources, as well as the introduction of complicated ad hoc rules in the programs. Despite their good performance and reputation in the related fields, these shortcomings impede its wide practical adoption. In the last decades, machine learning has played a key role in the field by proposing surrogate models, known commonly as emulators, which try to reproduce (learn) the complex input-output mapping built by simulators from empirical data. This is typically done by running the simulator for a number of input factors and situations, thus yielding a training dataset, which is then used to fit regression models able to perform out-of-sample predictions.

Statistical emulators were actually developed in the 1980s for general purposes [9, 21, 25]. Emulators typically rely on Gaussian Processes and neural networks because of their flexibility, accuracy, and computational efficiency. Once the emulator model is developed, one can run approximate simulations very efficiently because the testing time for machine learning models is commonly very low (often linear or sub-linear). This actually allows one to analyze the system for which the simulator was built with far more instantiations and situations, as well as to perform sensitivity analysis (that is, analyze the relative relevance of the drivers) in a more robust manner. We find interesting applications of Gaussian process emulators for input feature selection and sensitivity analysis, and uncertainty quantification of the outputs given the uncertainty of the inputs [5, 19].

In this paper, we will focus on the perhaps most active field nowadays building emulators, that of Earth Science. In Earth observation and climate science one typically has access to physical models encoding the variable relations. These physical models are either called process-based models in global change ecology, radiative transfer models (RTMs) in remote sensing, or climate models in detection-and-attribution schemes for climate science applications. RTMs simulate the system as \(\mathbf {y}=f(\mathbf {x},\varvec{\omega })\), where \(\mathbf {y}\) is a measurement obtained by the satellite (e.g. radiance); the vector \(\mathbf {x}\) represents the state of the biophysical variables on the Earth; \(\varvec{\omega }\) contains a set of controllable conditions (e.g. wavelengths, viewing direction, time, Sun position, and polarization); and \(f(\cdot )\) is a function which relates \(\mathbf {y}\) with \(\mathbf {x}\). Such a function f is typically considered to be nonlinear, smooth and continuous. The goal in inverse modeling is to obtain an accurate model \(g(\cdot ) \approx f^{-1}(\cdot )\), parametrized by \(\varvec{\theta }\), which approximates the biophysical variables \(\mathbf {x}\) given the data \(\mathbf {y}\) received by the satellite, i.e. \(\hat{\mathbf {x}} = g(\mathbf {y},\varvec{\theta })\). In emulation mode, however, we are interested in approximating the RTM well, that is, obtain a machine learning model \(\widehat{f}\) that approximates the RTM code f at a fraction of time and accurately. Obtaining such a model is a game changer since one can do model inversion, sensitivity analysis and parameter retrieval much more efficiently than with the original simulator.

Here we are concerned about using the emulator to replace RTMs and then perform model inversion. Such inversion typically requires large multi-dimensional look-up tables (LUTs), which are precomputed for their later interpolation [11]. However, the computation of these LUTs still imposes a large computation burden, requiring techniques of parallelization and execution in computer grids. In order to further reduce this computation burden, a possible strategy is to select an optimal subset of anchor or landmark points in order to reduce the error of the interpolation of LUTs. Compact and informative LUTs give raise, in turn, to interesting possibilities for emulating RTMs [24]. In this work, we address the problem of optimal selection of the points to be included in the LUT and the construction of the emulator simultaneously.

The field has received attention from (apparently unrelated) fields in statistical signal processing and machine learning. The problem has been cast as experimental optimal design [6, 17, 20, 26] of interpolators of arbitrary functions f. To reduce the number of direct runs of the system (evaluations of f), a possible approach is to construct an approximation of f starting with a set of support points. This approximation is then sequentially improved incorporating novel points given a suitable statistical rule. This topic is also related to different research areas: optimal nonuniform sampling, quantization and interpolation of continuous signals [8, 16], the so-called Bayesian Optimization (BO) problem [12, 18, 27], and active learning [2, 7, 10, 28]. Finally, an interesting alternative approach is based on adaptive gridding, where the aim is to construct a partitioning of \(\mathcal {X}\) into cells of equal size, where the cell edges have different lengths depending on their spatial direction. This was the approached followed in [4]. In order to find these lengths, the proposed method uses a modification of the Relevance Vector Machines (RVMs) [3, 22].

In this paper we introduce a simpler and more general approach. The proposed method is a sequential, adaptive and automatic construction of the emulator based on the notion of the acquisition function, similarly to the BO approach [12, 18]. Unlike in BO, our goal is not the optimization of the unknown underlying function f but its accurate approximation \(\widehat{f}\). Given a set of initial points, the emulator is built automatically with the online addition of new nodes maximizing the acquisition function at each iteration. Theoretically, the acquisition function should incorporate (a) geometric information of the unknown function f, and (b) information about the distribution of the current nodes. Indeed, areas of high variability of \(f(\mathbf {x})\) require the addition of more points as well as areas with a small concentration of nodes require the introduction of new inputs. Thus, the experimental design problem is converted into a sequential optimization problem where the function to be optimized involves geometric and spatial information (regardless of the dimensionality of the input space).

The rest of the paper is outlined as follows. Next Sect. 2 describes the general ideas behind the automatic emulation. In Sect. 3, we introduce the proposed automatic emulator scheme, and revises each of the building blocks (regression and acquisition functions). In Sect. 4, we give experimental evidence of performance of the proposal in several numerical examples, one of them involving the optimization of the nodes for the construction of an optimal LUT for atmospheric correction involving a computationally demanding physical model. We conclude in Sect. 5 with some remarks and outline of the further work.

2 Automatic Emulation

This section introduces the proposed scheme for automatic emulation. We start by fixing the notation and presenting the processing scheme. Then we will detail all the ingredients that allows to design an adaptive acquisition function in conjunction with the RVM model.

2.1 Emulation Scheme

Let us consider a D-dimensional input space \(\mathcal {X}\), i.e., \(\mathbf {x}=[x_1,\ldots ,x_D]^{\intercal } \in \mathcal {X}\subset \mathbb {R}^D\) and, for the sake of simplicity, we assume that \(\mathcal {X}\) is bounded. Let us denote the system to be emulated as \(f(\mathbf {x}): \mathcal {X}\mapsto \mathbb {R}\), e.g. a complicated transfer equations modeled with an expensive RTM. Given an input matrix \(\mathbf {X}_t=[\mathbf {x}_1,\cdots ,\mathbf {x}_{m_t}]\) of dimension \(D\times m_t\), we have a vector of outputs \(\mathbf{y}_t=[y_1,\ldots ,y_{m_t}]^{\intercal }\), where \(y_t=f(\mathbf {x}_t)\), where the index \(t\in \mathbb {N}^+\) denotes the t-th iteration of the algorithm. Essentially, at each iteration t one performs an interpolation or a regression, providing \(\widehat{f}_t(\mathbf {x}|\mathbf {X}_t,\mathbf{y}_t)\), followed by an optimization step that updates the acquisition function, \(A_{t}(\mathbf {x})\), updates the set \(\mathbf {X}_{t+1} =[\mathbf {X}_t, \mathbf {x}_{m_t+1}]\) adding a new node, set \(m_{t+1} = m_t +1\) and \(t\leftarrow t +1\). The procedure is repeated until a suitable stopping condition is satisfied. We assume scalar outputs in order to simplify the description of the technique, yet the algorithm can be easily extended to multi-output settings. The algorithm is outlined in Algorithm 1 and Fig. 1 shows a graphical representation of a generic automatic emulator. In this work, \(f(\mathbf {x})\) represents RTM but, more generally, it could be any costly function.

Fig. 1.
figure 1

Graphical representation of an automatic emulator. In this work, the function \(f(\mathbf{x})\) represents RTM but, more generally, it can be any function costly to be evaluated.

figure a

2.2 Conceptual Definition of the Acquisition Function

Let us start describing the general properties that a generic acquisition function \(A_t(\mathbf {x})\) should satisfy. Conceptually, an acquisition function can be defined as the product of two functions, a geometry term \(G_t(\mathbf {x})\) and a diversity term \(D_t(\mathbf {x})\):

$$\begin{aligned} A_{t}(\mathbf {x})= & {} G_t(\mathbf {x}) D_t(\mathbf {x}), \end{aligned}$$
(1)

where \(G_t(\mathbf {x}): \mathcal {X}\mapsto \mathbb {R}^+\), \(D_t(\mathbf {x}): \mathcal {X}\mapsto \mathbb {R}^+\) and hence \(A_{t}(\mathbf {x}): \mathcal {X}\mapsto \mathbb {R}^+\) (i.e., \(A(\mathbf {x})\ge 0\)). The first function \(G_t(\mathbf {x})\) represents some suitable geometrical information of the hidden function f. The second function \(D_t(\mathbf {x})\) depends on the distribution of the points in the current vector \(\mathbf {X}_t\). More specifically, \(D_t(\mathbf {x})\) will have a greater probability mass around empty areas within \(\mathcal {X}\), whereas \(D_t(\mathbf {x})\) will be approximately zero close to the support points and exactly zero at the support points, i.e., \(D_t(\mathbf {x}_i)=0\), for \(i=1,\ldots ,m_t\) and \(\forall t\in \mathbb {N}\). As a consequence, a suitable acquisition function satisfies the following condition,

$$\begin{aligned} A_{t}(\mathbf {x}_i)=0, \quad \text{ for } i=1,\ldots ,m_t, \text{ and } \forall t\in \mathbb {N}. \end{aligned}$$
(2)

3 RVM Automatic Emulator (RAE)

In this section, we specify the implementation of an automatic emulator based on a variant of the RVM method, called RAE. Thus, we introduce the Adaptive RVM (A-RVM) (employed as regressor in RAE) and then we describe a suitable construction of an acquisition function \(A_t(\mathbf {x})\), taking into account important information provided by the hyperparameters of A-RVM previously optimized.

3.1 Adaptive Relevance Vector Machine (A-RVM)

Let us consider the standard regression model

$$ y=f(\mathbf {x})+\epsilon , $$

where \(\epsilon \sim \mathcal {N}(0,\sigma _e^2)\) and we observe N data pairs, \(\{\mathbf {x}_n,y_n\}_{n=1}^N\). We also denote \(\mathbf {X}=[\mathbf {x}_1,\ldots , \mathbf {x}_N]\), \(\mathbf {y}=[y_1,\ldots ,y_N]^{\intercal }\), and define the following N basis functions

$$\begin{aligned} \varvec{\phi }_n(\mathbf {x}|\mathbf {x}_n)=\exp \left( -\frac{\Vert \mathbf {x}-\mathbf {x}_n\Vert ^2}{2\delta _n^2}\right) : \mathcal {X} \times \mathcal {X} \rightarrow \mathbb {R} \end{aligned}$$
(3)

Hence, we have N functions centered in the data inputs \(\mathbf {x}_n\)’s, each one with a different scale parameter \(\delta _n^2\). Moreover, let us define the \(N\times N\) matrix \([{\varvec{\varPhi }}]_{i,j}:=\varvec{\phi }_j(\mathbf {x}_i|\mathbf {x}_j)\), and the \(N\times 1\) vector \(\varvec{\phi }(\mathbf {x},\mathbf {X})=[\varvec{\phi }_1(\mathbf {x}|\mathbf {x}_1),\ldots , \varvec{\phi }_N(\mathbf {x}|\mathbf {x}_N)]^{\intercal }\). We assume that the hidden function f can be expressed as \(f(\mathbf {x})=\varvec{\phi }(\mathbf {x},\mathbf {X})^{\intercal }{} \mathbf{w}\), where \(\mathbf{w}\) is an unknown vector. Furthermore, we consider a Gaussian prior over the \(N\times 1\) weight vector \(\mathbf{w}\), i.e., \(\mathbf{w}\sim \mathcal {N}(\mathbf{0},{\varvec{\varSigma }}_p)\). The Minimum Mean Square Error (MMSE) solution is

$$\begin{aligned} \widehat{\mathbf{w}} = \frac{1}{\sigma _e^2}\left( \frac{1}{\sigma _e^2}{\varvec{\varPhi }}{\varvec{\varPhi }}^{\intercal }+ {\varvec{\varSigma }}_p^{-1}\right) ^{-1} {\varvec{\varPhi }} \mathbf {y}= {\varvec{\varSigma }}_p{\varvec{\varPhi }}\left( {\varvec{\varPhi }}^{\intercal }{\varvec{\varSigma }}_p{\varvec{\varPhi }}+ \sigma _e^2\mathbf{I}_N\right) ^{-1} \mathbf {y}. \end{aligned}$$
(4)

Hence, since \(\widehat{f}(\mathbf {x})=\varvec{\phi }(\mathbf {x},\mathbf {X})^{\intercal }\widehat{\mathbf{w}}\), then

$$\begin{aligned} \widehat{f}(\mathbf {x}) = \varvec{\phi }(\mathbf {x},\mathbf {X})^{\intercal }{\varvec{\varSigma }}_p{\varvec{\varPhi }}\left( {\varvec{\varPhi }}^{\intercal }{\varvec{\varSigma }}_p{\varvec{\varPhi }}+ \sigma _e^2\mathbf{I}_N\right) ^{-1} \mathbf {y}= \mathbf{k}^{\intercal }\left( {\mathbf K}+ \sigma _e^2\mathbf{I}_N\right) ^{-1} \mathbf {y}, \end{aligned}$$
(5)

where we have set \({\mathbf K}={\varvec{\varPhi }}^{\intercal }{\varvec{\varSigma }}_p{\varvec{\varPhi }}\) and \(\mathbf{k}^{\intercal }=\varvec{\phi }(\mathbf {x},\mathbf {X})^{\intercal }{\varvec{\varSigma }}_p{\varvec{\varPhi }}\).

On the adaptive lengthscales. We remark again that the number of basis function \(\varvec{\phi }_n\) is exactly N as the number of data and \(N+1\) hyperparameters, \({\varvec{\theta }}=[\delta _1,\ldots ,\delta _N,\sigma _e^2]^{\intercal }\), N scale parameters \(\delta _n\) (one per each function \(\varvec{\phi }_n\)) and the variance of the measurement noise \(\sigma _e^2\).

Fig. 2.
figure 2

Graphical representation of adaptive RVM regression model. The regressor \(\widehat{f}(x)=\sum _{n=1}^3 w_n\varvec{\phi }_n(x|x_n)\) is a linear combination of \(N=3\) basis functions, each one with a different scale \(\delta _n\), \(n=1,2,3\).

Clearly, the use of different \(\delta _n\)’s fosters the flexibility of the regression method. Moreover, and more important for our purpose, each parameter \(\delta _n\) contains geometric information about the hidden function \(f(\mathbf {x})\). Indeed, the parameters \(\delta _n\)’s of the functions \(\varvec{\phi }_n\)’s located in regions with a greater variation of \(f(\mathbf {x})\) are smaller than the parameters \(\delta _n\)’s of the functions \(\varvec{\phi }_n\)’s located in regions where \(f(\mathbf {x})\) is flatter. Roughly speaking, a great value of \(\delta _n\) means that \(\varvec{\phi }_n\) is located in an area where f is virtually flat, whereas a small value of \(\delta _n\) is obtained when \(\varvec{\phi }_n\) is located in a region where f has an high derivative, for instance. This consideration is very useful in order to build a suitable acquisition function. An illustrative example of the adaptive property is shown in Fig. 2.

On hyperparameter tuning. The tuning of the hyperparameters is performed maximizing the log-marginal likelihood,

$$ J({\varvec{\theta }})=\log [p(\mathbf {y}|{\varvec{\theta }})]=-\frac{1}{2}\mathbf {y}^{\intercal }({\mathbf K}+\sigma _e^2\mathbf{I}_N)^{-1}\mathbf {y}-\frac{1}{2}\log \left[ \det ({\mathbf K}+\sigma _e^2\mathbf{I}_N)\right] +c, $$

where c is a constant and \({\mathbf K}={\varvec{\varPhi }}^{\intercal }{\varvec{\varSigma }}_p{\varvec{\varPhi }}\). We consider a simulated annealing approach for finding the global maximum of \(J({\varvec{\theta }})\) [13, 15, 23]. More sophisticated optimization methods could exploit information about the gradient of \(J({\varvec{\theta }})\).

3.2 Proposed Acquisition Function

As we have observed before, the N learnt parameters \(\delta _n\)’s of M-RVM contain geometric information about the hidden function \(f(\mathbf {x})\). Thus, in order to create a suitable acquisition function we need to take into account also the distribution of the other inputs in the current iteration of the algorithm. A possibility is to define the acquisition function as

$$\begin{aligned} A(\mathbf {x})=\prod _{i=1}^{m_t} V_i(\mathbf {x}|\mathbf {x}_i), \end{aligned}$$
(6)

where \(V_i(\mathbf {x}|\mathbf {x}_i)=(\mathbf {x}-\mathbf {x}_i)^2\exp (-\frac{(\mathbf {x}-\mathbf {x}_i)^2}{\ell \delta _i})\), being \(\ell >0\) a parameter chosen by the user. The function \(V_i(\mathbf {x}|\mathbf {x}_i)\) has the analytical structure of a Nakagami density [14]. The Nakagami density has been widely studied from an analytical point of view. Thus, different features of \(V_i(\mathbf {x}|\mathbf {x}_i)\) are known analytically and it can be easily normalized.

Fig. 3.
figure 3

Example of two functions \(V_i(\mathbf {x}|\mathbf {x}_i)\) for the same \(\mathbf {x}_i\) (circle) with a high (solid) and low (dashed) \(\delta _n\).

Note that each \(V_i(\mathbf {x}|\mathbf {x}_i)\) depends on the parameter \(\delta _i\) and \(V_i(\mathbf {x}_i|\mathbf {x}_i)=0\) (Fig. 3). Hence, \(A(\mathbf {x}_i)=0\) for all \(i=1,\ldots ,m_t\). The function \(V_i(\mathbf {x}|\mathbf {x}_i)\) is symmetric with respect to \(\mathbf {x}_i\) and it is bimodal with modes located at

$$\begin{aligned} \mathbf {x}^*=\mathbf {x}_i\pm \sqrt{\ell \delta _i}. \end{aligned}$$
(7)

When \(\delta _i\) is small, the two modes are closer to each other and closer to \(\mathbf {x}_i\), whereas when \(\delta _i\) is high the modes are far away from \(\mathbf {x}_i\). Figure 4 shows a sequence of approximations \(\{\widehat{f}_t(x)\}_{t=0}^3\) of f(x), and the corresponding acquisition functions at each iteration.

Fig. 4.
figure 4

A sequence of approximations \(\widehat{f}_0(x)\), \(\widehat{f}_1(x)\), \(\widehat{f}_2(x)\) and \(\widehat{f}_3(x)\) (top) of the function \(f(x)=\log (x)\) (dashed line) obtained by RAE starting with 3 points and the corresponding acquisition functions \(A_0(x)\), \(A_1(x)\), \(A_2(x)\), \(A_3(x)\) and \(A_{4}(x)\) (bottom). The nodes are depicted by circles. In this example, we have set \(\ell =6\).

4 Experimental Results

4.1 Synthetic Example

In order to test RAE, first we consider a toy example where we can compare the achieved approximation \(\widehat{f}_t\) with the underlying function f which is unknown in the real-world applications. In this way, we can exactly check the true accuracy of the obtained approximation using different schemes. For the sake of simplicity, we consider the function

$$\begin{aligned} f(x)=\log (x), \end{aligned}$$
(8)

with \(x\in \mathcal {X}=(0,20]\) hence \(D=1\) (see Fig. 5). Even in this simple scenario, the procedure used for selecting new points is relevant as confirmed the results provided below. We start with \(m_0=3\) support points, \(\mathbf{X}_0=[0.01,10,20]\). We add sequentially and automatically other 10 points using different techniques: (a) completely random choice, uniformly within \(\mathcal {X}=[0,20]\), (b) a deterministic procedure filling the greatest distance between two consecutive points, adding the middle point of this interval, (c) Latin Hypercube (LHC) sampling [17], (d) RAE with \(\ell =6\) in the functions \(V_i(\mathbf {x}|\mathbf {x}_i)\). In all cases, we use the M-RVM regression scheme (for simplicity, we also set \(\sigma _e^2=0.1\)). The optimization of the remaining hyperparameters is obtained by a parallel simulated annealing approach [13, 15]. Note the deterministic procedure always adds sequentially the following points, obtaining \(\mathbf{X}_{10}=[\mathbf {X}_0,15,5.005,12.5,17.5,7.5025,11.25,13.75,16.25,18.75]\) (recalling that \(\mathbf{X}_0=[0.01,10,20]\) for all methods). At each run, the results can slightly vary even for the deterministic procedure due to the optimization of the hyperparameters (we use a simulated annealing approach that is a stochastic optimization technique [13]). We average all the results over 500 independent runs.

We compute the \(L_2\) distance between \(\widehat{f}_t(x)\) and f(x) at each iteration, obtained by the different method We show the evolution of the averaged \(L_2\) distance versus the number of support points \(m_t\) (that is \(m_t=t+m_0\)) in Fig. 5 (specifically, we show the median curves obtained over the 500 runs). We can observe that the RAE scheme outperforms the other methods, providing the smallest distances between f and \(\widehat{f}_t\). The proposed technique gives a clear advantage in the first iterations: with only \(m_t=7\) points provides an error of \(\approx \)10\(^{-1}\), whereas the other methods require more than 12 points for obtaining the same error value. After adding a certain number of nodes (enough for covering the entire domain \(\mathcal {X}\)), clearly the results become similar. The difficulty of the automatic problem is clearly represented by the curve of the deterministic procedure: the attempt of filling as fast as possible all the domain \(\mathcal {X}\) trying to maximizing the distance among the nodes is not the best strategy, in general. In some specific scenarios, after the addition of one point, the performance can even get worse (it can vary depending on the regression scheme employed). This happens with the deterministic procedure and the addition of the 6-th and 7-th node (see Fig. 5). Figure 4 depicts the function f(x) in Eq. (8), the approximation \(\widehat{f}_t(x)\), and the acquisition function \(A_t(x)\) obtained in four consecutive iterations of RAE, considering a specific run of the algorithm.

Fig. 5.
figure 5

The \(L_2\) distance (in log-scale) between f(x) and \(\widehat{f}_t(x)\) versus the number of the number of support points \(m_t\), that is \(m_t=t+3\) in this example. The results of RAE are shown with squares, the LHC curve is shown with x-marks, the curve of the deterministic procedure is depicted with triangles and the results of the completely random approach is illustrated by circles.

4.2 Emulation of Costly Radiative Transfer Codes

Our second example focuses on the optimization of selected points for a MODTRAN5-based LUT. MODTRAN5 is considered as de facto standard atmospheric RTM for atmospheric correction applications [1]. This RTM solves the radiative transfer equation in the atmosphere considering the effect of scattering and absorption by gasses and aerosols for a flexible configuration of viewing and illumination conditions and surface reflectance. In our test application, and for the sake of simplicity, we have considered \(D=2\) with the Aerosol Optical Thickness at 550 nm (\(\tau \)) and ground elevation (h) as key input parameters. The underlying function \(f(\mathbf {x})\) consists therefore on the execution of MODTRAN5 at given values of \(\tau \) and h at the single output wavelength of 760 nm (i.e. bottom of the O\(_2\)-A band). As the parameters \(a_i\), \(b_i\) and c in the toy example above, other MODTRAN5 input parameters are set to standard atmospheric and geometric conditions (e.g. mid-latitude summer atmosphere, rural aerosol, nadir view, 55\(^\circ \) solar zenith angle, sensor height at 100 km). The input parameter space is bounded to 0.05–0.4 for \(\tau \) and 0–3 km for h. In order to test the accuracy of the different schemes, we have evaluated f(x) at all the possible 1750 combinations of 35 values of \(\tau \) and 50 values of h. Namely, this thin grid represents the ground-truth in this example.

We test (a) a random approach choosing points uniformly within \(\mathcal {X}=[0.05,0.4]\times [0,3]\), (b) the Latin Hypercube (LHC) sampling (see, e.g., [17]) and (c) RAE (\(\ell =3\)). We use the simulated annealing algorithm [13, 15] for both, optimizing the hyper-parameters of M-RVM and finding the maximum of the acquisition function \(A_t(\mathbf {x})\). We start with \(m_0=5\) points \(\mathbf {x}_1=[0.05,0]^{\top }\), \(\mathbf {x}_2=[0.05,3]^{\top }\), \(\mathbf {x}_3=[0.4,0]^{\top }\), \(\mathbf {x}_4=[0.4,3]^{\top }\) and \(\mathbf {x}_5=[0.2,1.5]^{\top }\) for all the techniques. We compute the final number of nodes \(m_t\) required to obtain an \(\ell _2\) distance between f and \(\widehat{f}\) approximately of \(\eta \in \{0.03,0.2\}\), with the different methods. The results, averaged over \(10^3\) runs, are shown in Table 1. RAE requires the addition of \(\approx \)3 new points to obtain a distance \(\approx \)0.2 and \(\approx \)6 new points to obtain a distance \(\approx \)0.03.

Table 1. Averaged number of nodes needed for obtaining a \(L_2\) error \(\approx \eta \).

5 Conclusions

We introduced an automatic method to construct surrogate models, also known as emulators, and optimal look-up-tables for costly physical models (as RTMs). We proposed an iterative scheme combining the interpolation capabilities of RVMs with the design of an acquisition function that fosters a suitable choice of the nodes and flatness of the interpolation function. We illustrated the good capabilities of the method in a synthetic example and a real example involving atmospheric correction based on the computationally expensive MODTRAN5 model. Future work is tied to the development of multi-output schemes and testing in other costly RTMs.