1 Introduction

1.1 The Green AI challenge

Machine Learning (ML) models are computationally hungry: this is particularly true in the case of Deep Neural Networks (DNNs) in fields like computer vision (Bianco et al. 2020) and Natural Language Processing (NLP) (Kulkarni and Shivananda 2019): an approximate quantification of the financial and environmental costs for training and validating some of the neural network models in the NLP domain is reported in Strubell et al. (2019) and Hao (2019) showing the amazing amount of energy consumed for training and validating a neural network model for NLP, which can generate the emission of an amount of carbon dioxide approximately five times the lifetime emissions of an average American car. No surprise that Green Machine Learning (Green-ML) and Green Artificial Intelligence (Green AI) (Schwartz et al. 2019; Yang et al. 2020) have recently emerged as new research topics.

This paper is focused on the issue of hyperparameter optimization (HPO), where hyper-parameters are all the parameters of a model which are not updated during the learning and are used to configure either the model (e.g., number of layers of a deep neural network, etc.) or characterize the algorithm used in the training phase (learning rate for gradient descent algorithm, etc.) and even to include the choice of optimization algorithm itself and also the data features which are fed into the ML model.

HPO can be regarded as an optimization outer loop on top of ML model learning (inner loop) to find the set of hyperparameters leading to the lowest error on a validation set. This two-tier optimization structure has several implications. First, the evaluation of the objective function of the outer loop is very expensive, as it requires learning a model and evaluating its performance on a validation set. This is usually repeated k times in a k-fold-cross-validation procedure. Moreover, the objective function is unknown and can only be observed pointwise with typically noisy evaluations. Secondly, the average value of the loss function does not reflect the true distribution of the data (which leads to the generalization error) and due to the relatively small size of the validation set, the variance of the average estimate obtained by cross-validation can be high. Ignoring this uncertainty can result in suboptimal configuration of hyperparameters. One must also consider that the performance of the model is evaluated with some error, and thus finding the true optimum with a high precision is usually not critical: this fits nicely into in the Bayesian Optimization (BO) framework that is very sample efficient and yields an acceptable solution with relatively few function evaluations.

The outer loop optimization algorithm can be passive, like grid or pure random search, or “educated” to learn, from previous evaluations, the structure of the objective function, and to actively search where most interesting solutions are. Indeed, BO is a framework to model the learning process and to yield a principled quantification of uncertainty (Frazier 2018; Candelieri and Archetti 2019). BO has become the main approach to handle all the relevant steps in finding an accurate ML model: Algorithm selection, Hyperparameter Optimization, both recently integrated in the more general setting named CASH: Combining Algorithm Selection and Hyperparameter optimization (Kotthoff et al. 2017). This led to the definition of Automated Machine/Deep Learning (AutoML/AutoDL) (Hutter et al. 2019) and Neural Architecture Search (NAS) (Hutter et al. 2019; Lindauer and Hutter 2019), showing that different algorithms and values of its hyperparameters can result in significantly different performances (Wolpert 2002; Melis et al. 2017).

Although the active learning inherent in BO and the ensuing sample efficiency are usually associated with the search for the best algorithm and its configuration (Shahriari et al. 2016), in terms of accuracy, they translate into significant cost and energy savings. For instance, the BERT (Bidirectional Encoder Representation from Transformer) model, now available in the Google Cloud, aimed at contextual representation in NLP, can require 4 days training sessions (with 110 million of DNN’s parameters to be learned) (Strubell et al. 2019) which makes the NAS performed in the outer loop awfully expensive. Sample efficiency requires some assumption on the objective function and a model of learning from observations.

Probabilistic models commonly used in BO are Gaussian Processes (GPs) (Williams and Rasmussen 2006) and Random Forests (RFs) (Ho 1995) (here we do not discuss their relative merits in different problem classes). GPs are a powerful framework for reasoning about an unknown function \(f\) given partial knowledge of its behavior obtained through function evaluations. GP leverages a principled estimate of predictive uncertainty toward a careful balance of exploration (increasing one’s knowledge about \(f\)) and exploitation (focusing on the best points found so far).

The global hyperparameter optimization problem is usually defined as:

$$\underset{x\in \mathcal{X}\subset {\mathbb{R}}^{d}}{\mathrm{min}}f\left(x\right)$$
(1)

where the search space \(\mathcal{X}\) is generally box-bounded, \(f\) is the loss function and \(x\) are the values of the hyperparameters. We remark that \(f\) is analytically unknown (also called latent) and only pointwise, usually noisy, evaluations can be obtained by querying it. We refer to this situation as black-box optimization.

BO leverages the fact that conditioning the GP on previous observations provides versatile regressors of the objective function. BO starts from a GP prior over \(f\), encoded with parametric mean and kernel. The available observations are used to build the posterior distribution which is used to determine the learning policy, balancing exploration (high GP variance) and exploitation (low GP mean value).

Given the cost of evaluating the objective function, trial-and-error methods like random or grid search are not useful. Compared to a simple grid search, BO can identify a better solution for HPO, given the same number of configurations to evaluate. Given its modeling flexibility, BO can build a relatively cheap probabilistic surrogate of \(f\), take advantage of related tasks (Swersky et al. 2013) or use problem specific priors (De Ath et al. 2020).

The strategy we follow here is to mitigate the high cost of hyperparameter optimization enabling the BO algorithm to trade-off the value of information gained from the evaluation of a hyperparameter configuration against its cost. In Swersky et al. (2013) and Klein et al. (2017) BO is used to evaluate models trained on randomly chosen subsets of data to obtain more, but less informative, evaluations. Two strategies aiming at the same target, which we do not consider here, are curriculum learning, which leverages a data-centric view training the model on increasingly larger datasets, and continuation learning, which leverages a model-centric view building a sequence of loss functions \({L}_{1}\dots {L}_{r}\), in which each \({L}_{i+1}\) is more difficult to optimize than \({L}_{i}\) and one can view each \({L}_{i}\) as a regularized version of \({L}_{i+1}\) (Aggarwal 2018).

These approaches could be interpreted as optimization problems in which multiple information sources are available, with every source approximating the actual black-box and expensive (loss) function, with a different cost for querying each information source. This setting is known as Multi-Information Source Optimization (MISO), or multi-fidelity optimization in the special case that the “fidelity” of each source is known a priori and independent on the value of the hyperparameters.

1.2 Multi information sources optimization: related works

This problem was initially studied under the name of multi-fidelity optimization in which rather than a single objective \(f\), we have a collection of information sources denoted with \({f}_{1}\left(x\right),\dots ,{f}_{\mathrm{S}}\left(x\right)\). Each source has its own cost, \(c_{1} , \ldots ,c_{{\text{S}}}\), where \({c}_{\mathcal{s}}>0 \forall \mathrm{ s}=1,..,\mathrm{S}-1\), which controls the fidelity with lower \(s\) giving higher fidelity: increasing the fidelity gives a more accurate estimate but at a higher cost. In the case of cross-validation, the fidelity can be related to the number of iterations of the learning algorithm, the amount of data used in the training or the number of folds in the cross-validation. In MISO, the goal is to solve (1) while reducing the overall cost along the optimization process. MISO requires specific approaches to choose both the next location and source to evaluate, leading to a sequence \(\left\{\left({s}^{(1)},{x}^{(1)}\right),\dots ,\left({s}^{(N)},{x}^{(N)}\right)\right\}\). It is always possible to sort sources such that \({c}_{\mathrm{s}}>{c}_{\mathrm{s}+1}\); in the case that also \(f\left(x\right)\) can be queried, then it is the most expensive source, so we can set \(f\left(x\right)={f}_{1}\left(x\right)\) without loss of generality.

In the early work about multi-fidelity \({f}_{\mathrm{s}}\left(x\right)\) were assumed to be ordered in terms of accuracy and cost: in more general problems of multi-information source optimization, we only assume the function \(f(x)\) taking a design input \(x\), the objective and \({f}_{\mathrm{s}}\left(x\right)\) being the sources with different biases, different amounts of noise and different costs.

MISO has been gaining increasing attention in the last years, also beyond ML. An example in engineering design is the finite element method, where models with cost and fidelity can be obtained using different mesh values. Cheap approximations do not represent accurately the optimization targets, but still can offer an indication of the sensitivity of the output to changes in the parameters. Also, output data from physical prototypes can be integrated in the optimization framework as an additional information source, with fidelity depending on the application and the experimental setting. The application domain which has first exploited the advantages offered by multi-fidelity and multi-information source optimization is aerodynamics: in Chaudhuri et al. (2019) and Lam et al. (2015) is presented an approach that adaptively updates a multi-fidelity surrogate on multiple information sources and without any assumption about hierarchical relations among them.

In a seminal paper (Swersky et al. 2013), the use of small datasets to quickly optimize the hyperparameters of a ML model for large datasets has been proposed. The method shows that it is possible to transfer the knowledge gained from previous optimizations to new tasks in order to speed up k-fold cross-validation. The algorithm dynamically chooses which dataset to query in order to yield the most information per unit cost. In Kandasamy et al. (2016) a multi-fidelity bandit optimization based on Gaussian Process (GP) approximations of all the sources is proposed. The algorithm is named Multi-Fidelity Gaussian Process Upper Confidence Bound (MF-GP-UCB): it explores the search space using first the lower fidelity sources and then the higher ones in successively smaller regions, converging to the optimum. FABOLAS (FAst Bayesian Optimization on LArge dataSets) (Klein et al. 2017) is an approach for HPO on large datasets: at each iteration, it selects an hyperparameters configuration and a dataset size to use for optimizing hyperparameters for the entire dataset. Results are reported for HPO of Support Vector Machines (SVM) and DNNs, with FABOLAS often providing good solutions significantly faster than “vanilla” BO-based HPO on the full dataset. The approach in Poloczek et al. (2017) uses a GP with a kernel working on a space consisting of both the search space (spanned by the hyperparameters to optimize) and the information sources. In Ghoreishi and Allaire (2019) an approach incorporating correlations both within and among information sources is proposed. This allows to exploit the information collected over all the sources and then fusing them in a unique fused GP. Furthermore, the constrained setting is considered, where also constraints can be queried on multiple information sources.

A different approach has been proposed in Ariafar et al. (2020) Importance-based Bayesian Optimization (IBO), which models a distribution over the location of optimal hyperparameter configuration and allocates experimental budget according to cost adjusted expected reduction in entropy (Hennig and Schuler 2012). Higher fidelity observations provide a larger reduction in entropy, albeit at a higher evaluation cost.

To properly quantify predictive uncertainty, it is important for a learning system to recognize different types of uncertainty arising in the modeling process (Liu et al. 2019). Two types of uncertainty must be considered: aleatoric and epistemic. Aleatoric arises due to the stochastic variability of the data generating process, imperfect sensors, and epistemic arises due to our lack of knowledge about the data generating mechanism. A model epistemic uncertainty can be reduced by collecting more data and takes two forms: parametric uncertainty that is uncertainty associated with estimating the model parameters under the current model specification and structural uncertainty that reflects the measure in which a model is sufficient to describe the data, i.e. whether there exists a systematic discrepancy.

1.3 Our contributions

The main contributions of this paper can be summarized as follows:

  • A new GP called augmented GP which does not require a kernel working in the \(x,s\) space of hyperparameters and sources. Relations among sources are captured by a simplified and computationally cheap discrepancy measure (related to the epistemic error), used to select “reliable” evaluations to fit the proposed GP and included into a new acquisition function.

  • A new acquisition function based on U/LCB but implementing a sparsification strategy. Indeed, the proposed GP results sparse, reducing the computational cost for fitting it (i.e., the number of evaluations raised power of three).

  • The new GP mitigates the computational problems in estimating nonparametric regression which is inherently difficult in high dimensions with known lower bounds depending exponentially on dimension.

  • Making MISO energy-efficient itself by selecting a subset of “reliable” evaluations among all those performed over all the sources. Only this subset is used to fit a GP differently from the fused GP in Ghoreishi and Allaire (2019).

  • Demonstrating, empirically, the benefit provided by our approach on an HPO task aimed at optimally tuning a Support Vector Machine classifier on a large dataset.

2 Background

2.1 Gaussian processes

One way to interpret a Gaussian Process (GP) regression model is to think of \(f\) as a latent function defining a distribution over functions, and with inference taking place directly in the space of functions (i.e., function-space view) (Williams and Rasmussen 2006). A GP is a collection of random variables, any finite number of which have a joint Gaussian distribution. A GP is completely specified by its mean function \(\mu (x)\) and covariance function \(cov\left( {f\left( x \right),f\left( {x^{\prime}} \right)} \right) = k\left( {x,x^{\prime}} \right)\):

$$ \begin{aligned} \mu \left( x \right) & = {\mathbb{E}}\left[ {f\left( x \right)} \right] \\ cov\left( {f\left( x \right),f\left( {x^{\prime}} \right)} \right) & = k\left( {x,x^{\prime}} \right) = {\mathbb{E}}\left[ {\left( {f\left( x \right) - \mu \left( x \right){ }} \right)\left( {f\left( {x^{\prime}} \right) - \mu \left( {x^{\prime}} \right){ }} \right)} \right] \\ \end{aligned} $$
(2)

, and the GP will be written as:

$$f\left(x\right)\sim GP\left(\mu \left(x\right),k\left(x,{x}^{^{\prime}}\right)\right)$$
(3)

Usually, for notational simplicity, we will take the prior of the mean function to be zero, although this is not necessary.

As consequence, the function values \(f\left({x}_{1}\right),\dots ,f\left({x}_{n}\right)\) obtained at \(n\) different points \({x}_{1},\dots ,{x}_{n}\), are jointly Gaussian. To see this, we can draw samples from the distribution of functions evaluated at any number of points; in detail, we choose a set of input points \({X}_{1:n}={({x}_{1},\dots ,{x}_{n})}^{T}\) and then compute the corresponding covariance matrix elementwise. This operation is usually performed by using predefined covariance functions allowing to write covariance between outputs as a function of inputs (i.e., \(cov\left(f\left(x\right),f\left({x}^{^{\prime}}\right)\right)= k(x,x^{\prime})\)). Finally, we can generate a random Gaussian vector as:

$$f({X}_{1:n})\sim \mathcal{N}(0,\mathrm{K}({X}_{1:n},{X}_{1:n}) )$$
(4)

and plot the generated values as a function of the inputs. This is basically known as sampling from prior.

Let \({\mathbf{X}}_{1:n}=\left\{{x}^{(1)},\dots ,{x}^{(n)}\right\}\) denote a set of \(n\) locations into the search space \(\mathcal{X}\) and \(\mathbf{y}=\left\{{y}^{(1)},\dots ,{y}^{(n)}\right\}\) the associated function values, with \({y}^{(i)}=f\left({x}^{(i)}\right)\) or, in the noisy setting, \({y}^{(i)}=f\left({x}^{(i)}\right)+\varepsilon \) with \(\varepsilon \sim \mathcal{N}\left(0,{\lambda }^{2}\right)\). Then, the GP’s mean and variance are conditioned to the training set \(\left({\mathbf{X}}_{1:n},\mathbf{y}\right)\) as follows:

$$\mu \left(x\right)|({\mathbf{X}}_{1:n},\mathbf{y})=\mathbf{k}\left(x,{\mathbf{X}}_{1:n}\right){\left[\mathbf{K}+{\lambda }^{2}\mathbf{I}\right]}^{-1}\mathbf{y}$$
(5)
$${\sigma }^{2}\left(x\right)|({\mathbf{X}}_{1:n},\mathbf{y})=k\left(x,x\right)-\mathbf{k}\left(x,{\mathbf{X}}_{1:n}\right){\left[\mathbf{K}+{\lambda }^{2}\mathbf{I}\right]}^{-1}\mathbf{k}\left({\mathbf{X}}_{1:n},x\right)$$
(6)

with \(k\) a kernel function, \(\mathbf{k}\left(x,{\mathbf{X}}_{1:n}\right)\) a vector whose \(i\)th component is \(k\left(x,{x}^{(i)}\right)\) and \(\mathbf{K}\) an \(n\times n\) matrix with entries \({\mathbf{K}}_{{\varvec{i}}{\varvec{j}}}=k\left({x}^{(i)},{x}^{(j)}\right)\). Finally, \(\mathbf{k}\left({\mathbf{X}}_{1:n},x\right)\) is the transposed version of \(\mathbf{k}\left(x,{\mathbf{X}}_{1:n}\right)\).

In the following, a simple example of five different samples is drawn at random from a GP prior and posterior, respectively (Fig. 1). The posterior is conditioned on six function observations.

A kernel function (aka covariance function) is the crucial ingredient in a GP predictor, as it encodes assumptions about the function to approximate. It is clear that the notion of similarity between data points is crucial; it is a basic assumption that points which are close are likely to have similar target values \(y\), and thus function evaluations that are near to a given point should be informative about the prediction at that point. Under the GP view it is the covariance function that defines nearness or similarity.

2.1.1 Squared Exponential (SE) kernel:

$${k}_{SE}\left(x,{x}^{^{\prime}}\right)={e}^{-\frac{{\Vert x-x^{\prime}\Vert }^{2}}{2{\mathcal{l}}^{2}}}$$

With \(\mathcal{l}\) known as characteristic length scale. A large value of the length scale will map x to a narrower range of values, while a small length-scale does the opposite. Consequently, a large length-scale implies long-range correlations, whereas a short length-scale makes function values strongly correlated only if their respective inputs are very close to each other. This kernel is infinitely differentiable, meaning that the sample paths of the corresponding GP are very “smooth.”

Another way to look at \(\mathcal{l}\) is through the expected number of 0-upcrossings which is proportional to 1/\(\mathcal{l}\). Then \(\mathcal{l}\) is proportional to the expected length before crossing 0, hence the name length scale.

SE is the most widely used kernel because it is easy to code, relatively robust to misspecification and guarantees a positive definite covariance regardless of input dimensions. One must anyway bear in mind that it is particularly liable to numerical ill conditioning of the kernel matrix. Other widely adopted kernels are reported in Appendix of this paper.

2.2 Acquisition functions

The acquisition function is the mechanism to implement the trade-off between exploration and exploitation in BO. More precisely, any acquisition function aims to guide the search of the optimum toward points with potential low values of objective function either because the prediction of \(f(x)\), based on the probabilistic surrogate model, is low or the uncertainty is high (or both). Indeed, exploiting means to target the area providing more chance to improve the current solution (with respect to the current surrogate model), while exploring means to move toward less explored regions of the search space where predictions based on the surrogate model have a higher variance.

Confidence Bound—where Upper and Lower Confidence Bound (UCB and LCB) are used, respectively for maximization and minimization problems—is an acquisition function that manages exploration–exploitation by being optimistic in the face of uncertainty, in the sense of considering the best-case scenario for a given probability value (Auer 2002).

For the case of minimization, LCB is given by:

$$\mathrm{LCB}\left(x\right)= \mu \left(x\right)-\xi \sigma (x)$$

where \(\xi \ge 0\) is the parameter in charge of managing the trade-off between exploration and exploitation (\(\xi =0\) is for pure exploitation; on the contrary, higher values of \(\xi \) emphasize exploration by inflating the model uncertainty). For this acquisition function there are strong theoretical results, originated in the context of multi-armed bandit problems, on achieving the optimal regret derived by Srinivas et al. (2012). For the candidate point \({x}_{n}\) we observe instantaneous regret \({r}_{n}\) =\(f\left({x}_{n}\right)-f\left({x}^{*}\right)\). The cumulative regret \({R}_{N}\) after \(N\) function evaluations is the sum of instantaneous regrets: \({R}_{N}=\sum_{n=1}^{N}{r}_{n}\). A desirable asymptotic property of an algorithm is to be no-regret: \(\underset{N\to \infty }{\mathrm{lim}}\frac{{R}_{N}}{N}=0\). Bounds on the average regret \(\frac{{R}_{N}}{N}\) translate bounding \({R}_{N}\) by a quantity sublinear in \(T\), to convergence rates: \(f\left( {x^{ + } } \right) = \mathop {\min }\limits_{{}} x_{n \le N} f\left( {x_{n} } \right)\), in the first \(N\) function evaluations, is no further from \(f({x}^{*})\) than the average regret. Therefore,\(f\left({x}^{+}\right)- f({x}^{*})\to 0\), with \(N\to \infty \) and so a no regret algorithm will converge to a subset of the global minimizers.

A wide analysis of the convergence rate of \({R}_{N}\) in the case of Matérn kernel, for different values of d and \(\nu \), is given in Vakili et al. (2020).

Figure 2 shows how the selected points change depending on \(\xi \).

Finally, the next point to evaluate is chosen according to \({x}^{\left(n+1\right)}= \underset{x\in X}{\mathrm{argmin}}\,LCB(x)\), in the case of a minimization problem, or \({x}^{\left(n+1\right)}= \underset{x\in X}{\mathrm{argmax}}\,UCB(x)\) in the case of a maximization problem.

From the perspective of BO a particularly interesting bandit problem is the kernelized continuum armed bandit-problem (Srinivas et al. 2010). Here, \(f\) is assumed to be in the closure of functions on \(X\) expressible as a linear combination of a feature embedding parametrized by a kernel \(k\). The properties of the functions in the resulting space referred as the RKHS of \(k\), are determined by the choice of the kernel. For a SE kernel the RKHS contains only infinitely differentiable functions.

The optimization of the acquisition function leads to the next location to be queried, \({x}^{(n+1)},\) and, consequently, to a sequence of locations generated \(\left\{{x}^{(1)},\dots ,{x}^{(N)}\right\}\) over the BO process, with \(N\) the overall number of function evaluations at the end of the process. In this paper we use Lower Confidence Bound, largely adopted in GP-based BO and with a convergence proof under an appropriate scheduling of the internal parameter \({\beta }^{(n)}\) (Srinivas et al. 2012) which balances between exploration and exploitation:

$${LCB}^{\left(n\right)}(x) = {\mu }^{\left(n\right)}(x)-\sqrt{{\beta }^{(n)}}{\sigma }^{(n)}$$
(7)

where the apex related to the current iteration \(n\) has been included to highlight that the value of \(\beta \) changes over BO iterations, as well as the conditioned GP’s mean and standard deviation. Confidence Bound has been successfully applied in MISO, such as in Kandasamy et al. (2016). Wilson et al. (2018) point out that the shape of the acquisition function may have large flat regions which, in particular in high-dimensional spaces, make its optimization problematic and propose a Monte Carlo evaluation of acquisition function amenable to gradient-based optimization and identify a family of acquisition functions, including UCB, whose characteristics allow using greedy approaches for their maximization.

A specific problem in MISO is related to the acquisition function. According to Poloczek et al. (2017) and Ghoreishi and Allaire (2019), Knowledge Gradient, Entropy Search and Predictive Entropy Search can be applied. However, their computation and optimization are computationally expensive: for this reason, in this paper we consider L/UCB and build on it a new acquisition function specifically designed for MISO.

3 The proposed multi information source optimization—augmented Gaussian process (MISO-AGP)

3.1 Augmented GP

The MISO approach proposed in this paper is based on the idea of training a GP on a “reliable” subset of all the function evaluations performed so far over all the information sources. We refer to this GP as Augmented Gaussian Process (AGP) and consequently named our approach MISO-AGP. The term “augmented” is used to highlight that the set of function evaluations to train the AGP starts from those performed on the most expensive source and then it is “augmented” by selecting evaluations performed on some other source. Before explaining how the selection process is performed, we introduce some useful notations.

Let \({D}_{s}={\left\{\left({x}^{\left(i\right)},{y}_{s}^{(i)}\right)\right\}}_{i=1,\dots ,{n}_{s}}\) denote the \({n}_{s}\) function evaluations performed so far on the source \(\mathcal{s}\). For each source \(s\) a specific GP, \({\mathcal{G}}_{s}\), is trained on the current \({D}_{s}\). Let us introduce a model discrepancy measure, \(\eta \left(x,\mathcal{G},\mathcal{G}\mathcal{^{\prime}}\right)\), between two GPs. Differently from other papers, such as Poloczek et al. (2017) and Ghoreishi et al. (2019), we compute it simply as:

$$\eta \left(x,\mathcal{G},\mathcal{G}\mathcal{^{\prime}}\right)=\left| \mu \left(x\right)|{D}_{s}-\mu \mathrm{^{\prime}}\left(x\right)|{D}_{s^{\prime}}\right|$$
(8)

with \(\mu \left(x\right)\) and \(\mu ^{\prime}\left(x\right)\) the conditioned mean functions of the two GPs. It is also important to note that \(\eta \left(x,\mathcal{G},\mathcal{G}\mathcal{^{\prime}}\right)\) depends on \(x\). Indeed, in MISO we do not know a-priori the fidelity of each source and it could be not constant over \(\mathcal{X}\).

Assume that \(f\left(x\right)\) can be queried at the highest cost, that is \(f\left(x\right)={f}_{1}\left(x\right)\). Thus, the set of evaluations to train the AGP consists of \({D}_{1}\) “augmented” by:

$$ \tilde{D} = \left\{ {\left( {\tilde{x},\tilde{y}} \right):\exists {\mathcal{z}}:\left( {\tilde{x},\tilde{y}} \right) \in D_{\mathcal{z}} \curlywedge \eta \left( {x,{\mathcal{G}}_{1} ,{\mathcal{G}} } \right) < m{ }\sigma_{1} \left( x \right)} \right\} $$
(9)

with \(m\) a technical parameter of the MISO-AGP algorithm. We used \(m=1\) (i.e., around 68% of observations normally distributed are in the interval mean \(\pm \) standard deviation). Thus, function evaluations on cheaper sources, having a discrepancy lower than the threshold given in (9), are considered “reliable” to be merged with those collected on the most expensive source. Let \(\widehat{D}\) denote the augmented set of function evaluations, such that \(\widehat{D}={D}_{1}\cup \stackrel{\sim }{D}\), the AGP \(\widehat{\mathcal{G}}\) is trained on \(\widehat{D}\), leading to \(\widehat{\mu }(x)\) and \(\widehat{\sigma }(x)\), computed according to (5–6). An example is reported in Fig. 3.

Fig. 1
figure 1

Sampling from prior vs sampling from posterior (for the sake of simplicity, we consider the noise-free setting)

Fig. 2
figure 2

GP trained depending on seven observations (top), LCB with respect to different values of \(\xi \) and min values corresponding to the next point to evaluate (bottom)

Fig. 3
figure 3

An example of AGP on a one-dimensional MISO minimization problem with two information sources. (Left) the two GPs trained on each source; (right) the AGP: only three evaluations on the cheaper source (around \(x=0.5\), \(x=0.7\) and \(x=0.8\)) are selected to “augment” the evaluations on the expensive one. This reduces, at the same time, the uncertainty near the global minimum of \({f}_{1}\) and the number of evaluations for training the AGP (six out of the 14 overall)

3.2 Acquisition function in MISO-AGP algorithm

Following the training of the AGP, an acquisition function must be used to choose the next pair source-location to query, that is \(\left({s}^{^{\prime}},x^{\prime}\right)\). We consider the framework of U/LCB:

$$\left({s}^{\prime},x\mathrm{^{\prime}}\right)=\underset{\begin{array}{c}x\in X\subset {\mathbb{R}}^{d}\\ s=1,\dots ,S\end{array}}{\mathrm{argmax}}\left\{\frac{{y}^{+}-\left(\widehat{\mu }\left(x\right) -\sqrt{{\beta }^{\left(n\right)}}\widehat{\sigma }\left(x\right)\right)}{{c}_{s}\left(1+\eta \left(x,\widehat{\mathcal{G}},{\mathcal{G}}_{s}\right)\right)}\right\}$$
(10)

where \(n\) is the number of function evaluations into \(\widehat{D}\) and \({y}^{+}=\underset{\left(x,y\right)\in \widehat{D}}{\mathrm{min}}\left\{y\right\}\) is the best observed value into \(\widehat{D}\). The numerator is the most optimistic improvement with respect to the AGP’s LCB, penalized by the cost of the source \(\mathcal{s}\) and the model discrepancy between the AGP \(\widehat{\mathcal{G}}\) and \({\mathcal{G}}_{s}\), at the location \(x\).

There is the chance that \(x\mathrm{^{\prime}}\) could be too close to some previous function evaluations on \({s}^{\prime}\). This behaviour arises when BO is converging to a (local/global) optimum and leads to a well-known instability issue in GP training, which is ill-conditioning in the inversion of the matrix \(\left[\mathbf{K}+{\lambda }^{2}\mathbf{I}\right]\). This instability issue occurs even more frequently and quickly in the noise-free setting (i.e., \(\lambda =0\)). To avoid this undesired behavior—leading to wasting evaluations without obtaining any improvement and/or risking occurring in the instability issue—we introduce the following correction.

Given \(\left( {s^{\prime},x{^{\prime}}} \right)\) from (10), if \(\exists \left( {x^{\left( i \right)} ,y^{\left( i \right)} } \right) \in D_{{s^{\prime } }} \curlywedge x^{\prime } - x^{\left( i \right)2} < {\updelta }\)

$$ s^{\prime} \leftarrow 1\;and\;x^{\prime } = \mathop {{\text{argmax}}}\limits_{{x \in {\mathcal{X}} \subset {\mathbb{R}}^{d} }} \sigma_{1} \left( x \right) $$
(11)

with \(\updelta >0\) the second MISO-AGP’s technical parameter. In other words, we set the acceptable level of approximation, \(\updelta \), in locating the optimizer and, in the case that \(x\mathrm{^{\prime}}\) is closer than \(\updelta \) to another evaluation on \({s}^{\prime}\), then we prefer to “spend our budget” in reducing uncertainty on the most expensive source.

The MISO-AGP algorithm is summarized in the following.

figure a

3.3 Computational setting

All the experiments have been performed on a Microsoft Azure virtual machine, H8 (High Performance Computing family) Standard with 8 vCPUs, 56 GB of memory, Ubuntu 16.04.6 LTS. The code has been developed in R: all the code is available upon request from the authors.

4 Experimental setting

4.1 Test problems

To validate the proposed MISO-AGP approach, we have first evaluated it on two test problems: the one-dmensional Forrester test function (Forrester et al. 2007; Bartz-Beielstein et al. 2015) and the two-dimensional Rosenbrock test function, presented in Poloczek et al. (2017) as a MISO as well as multi-fidelity optimization test case.

4.1.1 Forrester test problem

The Forrester test problem is characterized by the two following sources:

$$ \begin{aligned} f_{1} \left( x \right) & = f\left( x \right) = \left( {6x - 2} \right)^{2} \sin \left( {12x - 4} \right) \\ f_{2} \left( x \right) & = 0.5 f_{1} \left( x \right) + 10\left( {x - 0.5} \right) + 5 \\ \end{aligned} $$

with associated costs \({c}_{1}=1000\) and \({c}_{2}=1\). The two functions are considered black-box, and the search space is the interval \(\left[\mathrm{0,1}\right]\). The solution for this problem is \({x}^{*}=0.7572488\) with associated function value \(f\left({x}^{*}\right)=-6.02074\).

4.1.2 Rosenbrock test problem

The Rosenbrock test problem is characterized by the following two sources:

$$ \begin{aligned} f_{1} \left( x \right) & = \left( {1 - x_{\left[ 1 \right]} } \right)^{2} + 100 \left( {x_{\left[ 2 \right]} - x_{\left[ 1 \right]}^{2} } \right)^{2} \\ f_{2} \left( x \right) & = f_{1} \left( x \right) + 0.1 \sin \left( {10 x_{\left[ 1 \right]} + 5 x_{\left[ 2 \right]} } \right) \\ \end{aligned} $$

where \({x}_{\left[1\right]}\) and \({x}_{\left[2\right]}\) represent, respectively, the first and second components of \(x\). The associated costs for evaluating the two sources are \({c}_{1}=1000\) and \({c}_{2}=1\). The two functions are considered black-box and the search space is \({\left[-\mathrm{2,2}\right]}^{2}\). The solution of this problem is \({x}^{*}=\left(\mathrm{1,1}\right)\) with associated function value \(f\left({x}^{*}\right)=0\).

4.2 C-support vector classification with radial basis function kernel

To validate our MISO-AGP approach, we designed an HPO task whose goal is to optimally and efficiently tune the hyperparameters of a Support Vector Machine (SVM) classifier on a large dataset. More precisely, we consider a C-SVC with a Radial Basis Function (RBF) kernel and the “MAGIC Gamma Telescope” dataset.Footnote 1

We chose C-SVC (i.e., C-Support Vector Classification, where C is the hyperparameter managing the trade-off between maximizing the margin and minimizing the classification error) due to its relative inefficiency on large datasets: computational complexity for training a C-SVC, on a given hyperparameters configuration, is the number of instances raised the power of three. The C-SVC’s hyperparameters to optimize are the regularization term, \(C\), and \(\gamma \) in the RBF kernel: \({k}_{RBF}\left(x,x^{\prime}\right)={e}^{-\gamma {\Vert x-x^{\prime}\Vert }^{2}}\).

The MAGIC dataset is generated by a Monte Carlo program (Heck et al. 1998), to simulate registration of high energy gamma particles in a ground-based atmospheric Cherenkov gamma telescope using the imaging technique. The overall dataset consists of 19,020 instances: 12,332 of the class “gamma (signal)” and 6,688 of the class “hadron (background),” with each instance represented by ten continuous features. We have performed a preprocessing consisting in scaling all the dataset features in \([\mathrm{0,1}]\).

Following the notation used in this paper, MISO-AGP will be used to minimize \(f\left(x\right)\). This is straightforward for the two test problems, Forrester and Rosenbrock. As far as the hyperparameter optimization of the C-SVC is considered, \(f\left(x\right)\) is the misclassification error computed on tenfold cross validation on the MAGIC dataset. The search space \(\mathcal{X}\) is two-dimensional and box-bounded, spanned by the two C-SVC’s hyperparameters \(C\in \left[{10}^{-2},{10}^{2}\right]\) and \(\gamma \in \left[{10}^{-4},{10}^{4}\right]\). We adopt a logarithmic scaling of the search space, a usual procedure suggested in AutoML for hyperparameters varying within ranges of this scale.

We have defined two different sources: the first provides the misclassification error obtained via tenfold cross-validation of a C-SVC configuration using the entire MAGIC dataset (i.e., \({f}_{1}\left(x\right)=f\left(x\right)\)). The second (i.e., \({f}_{2}\left(x\right)\)) performs the same computation but using a smaller portion of the data (just 5% through stratified sampling).

Energy required to perform tenfold cross validation is basically associated with the computational time, which we consider as a proxy for the sources’ costs. Since computational time can also depend on the values of C-SVC’s hyperparameters, we have run a sample of ten hyperparameters configurations on both the two sources and used the average computational times for estimating reference values for \({c}_{1}\) and \({c}_{2}\). More precisely, computational time required by \({f}_{1}\left(x\right)\) is, on average, 320 times that required by \({f}_{2}\left(x\right)\). Thus, we set \({c}_{2}=1\) and, consequently, \({c}_{1}=320\).

4.3 MISO-AGP setting

The kernel used to model the covariance function, for all the GPs, including the AGP, is the Squared Exponential kernel, whose hyperparameters are set via Maximum Loglikelihood Estimation during the GP training. The acquisition function (12) and, in case, the correction (11) are both optimized via L-BFGS (Limited-memory Broyden–Fletcher–Goldfarb–Shanno algorithm) algorithm.

As initialization, three hyperparameters configurations are sampled in \(\mathcal{X}\) via Latin Hypercube Sampling (Huntington and Lyrintzis 1998). Then, 30 further function evaluations are used by MISO-AGP to optimize over sources. We decided not to set a limit on the cumulated cost but to use this value to make considerations on the efficiency of the proposed approach with respect to BO applied only on the most expensive source. To mitigate the effect of initial randomness, ten different runs of MISO-AGP and BO have been performed and compared: at each run, the two approaches share the same initialization.

As metrics, we consider the best function value observed so far. It is usually named “best seen” in BO and simply defined as \({y}_{+}^{(n)}=\underset{i=1,\dots ,n}{\mathrm{min}}\left\{{y}^{(1)},\dots ,{y}^{(n)}\right\}\)—because we are considering the minimization of the misclassification error. However, this definition is no more valid in the case of the AGP. Suppose that, at a certain iteration, a function evaluation on a cheaper source is selected to fit the AGP and that corresponds to the best seen up to that iteration. At the next iteration, it could not be selected and, consequently, it cannot be considered as the best seen any longer. More formally, let \({\widehat{y}}_{+}^{(n)}\) denote the “augmented best seen,” \({\widehat{y}}_{+}^{(n)}=\underset{i=1,\dots ,p}{\mathrm{min}}\left\{{y}^{(1)},\dots ,{y}^{(p)}\right\}\), with \(p<n\) because only a subset of the evaluations on all the sources is used to train the AGP. In the case that \({\widehat{y}}_{+}^{(n-1)} \notin \left\{{y}^{\left(1\right)},\dots ,{y}^{\left(p\right)}\right\} \Rightarrow {\widehat{y}}_{+}^{(n)}\lesseqqgtr{\widehat{y}}_{+}^{(n-1)}\); in other terms, contrary to the common “best seen,” the “augmented best seen” could not be monotone over the function evaluations.

5 Results

5.1 Results on test problems

In this section we summarize the results related to the two test problems, namely Forrester and Rosenbrock. The actual optimizer \({x}^{*}\) is known for each one of the two test problems, so we measured the distance between the optimal solution identified by at the end of the BO and MISO-AGP processes. Distances are reported in Table 1, as average and standard deviation on the 30 different runs.

Table 1 Distance from \({x}^{*}\) and overall cumulated cost. Values are mean (standard deviation) on 30 independent runs. Standard deviation is not applicable in the case of BO cost because BO uses only one source, that is \({f}_{1}\left(x\right)=f\left(x\right)\)

As expected, the overall cumulated cost for MISO-AGP is significantly lower than performing BO on \({f}_{1}\left(x\right)=f\left(x\right)\) only.

With respect to the Forrester test problem, the solutions identified by MISO-AGP are significantly closer to \({x}^{*}\) than those found by BO (Wilcoxon test, p value < 0.01), with approximately half of the cumulated cost. Results are less exciting on the Rosenbrock test problem: BO solutions are in this case closer to \({x}^{*}\) than those found by MISO-AGP (Wilcoxon test, p-value < 0.001). However, MISO-AGP cost is significantly lower than BO, around the 2%, on average. Therefore, MISO-AGP has still margin to improve by slighting increasing its cumulated cost.

5.2 Results on hyperparameter optimization of C-SVC

Figure 4 summarizes the results obtained on a real-world application related to hyperparameter optimization of a C-SVC on the MAGIC dataset. In this case, the optimal hyperparameters configuration (i.e., \({x}^{*}\)) is not known a priori, as well as the associated function value \(f\left({x}^{*}\right)\). The best value of the misclassification error is reported with respect to the cost cumulated over the MISO-AGP and BO iterations, separately. Solid lines represent the mean over the ten independent runs, while shaded areas represent the standard deviations. As a reference value, we have considered the best misclassification error registered, on the entire MAGIC dataset, over all the experiments performed (green dashed line). The cumulated costs—which are actual and not the nominal \({c}_{1}\) and \({c}_{2}\) used in the acquisition function—are also averaged on the ten independent runs.

Fig. 4
figure 4

HPO of C-SVC on the MAGIC dataset. Comparison between traditional BO-based HPO and MISO-AGP on two information sources. Results refer to ten independent runs

The MISO-AGP approach proved to be both more effective and efficient than traditional BO: the identified hyperparameters configurations are associated with a lower misclassification error, and within less than one-third of the time required by BO. On average, 60% of the function evaluations are performed on the cheaper source. Thus, MISO-AGP has intelligently exploited the cheaper information source, thanks to the proposed AGP, leading to an energy-efficient and green HPO task.

6 Conclusions

The GP framework can be extended to deal with multiple information sources. Relations among sources are captured by a simplified and computationally cheap discrepancy measure, which enables a sparsification strategy used to select “reliable” evaluations to fit the proposed AGP. The MISO-AGP has been empirically shown to solve a real HPO task effectively while reducing significantly computational time and consequently energy usage.