Green Machine Learning via Augmented Gaussian Processes and Multi-Information Source Optimization

Searching for accurate Machine and Deep Learning models is a computationally expensive and awfully energivorous process. A strategy which has been gaining recently importance to drastically reduce computational time and energy consumed is to exploit the availability of different information sources, with different computational costs and different"fidelity", typically smaller portions of a large dataset. The multi-source optimization strategy fits into the scheme of Gaussian Process based Bayesian Optimization. An Augmented Gaussian Process method exploiting multiple information sources (namely, AGP-MISO) is proposed. The Augmented Gaussian Process is trained using only"reliable"information among available sources. A novel acquisition function is defined according to the Augmented Gaussian Process. Computational results are reported related to the optimization of the hyperparameters of a Support Vector Machine (SVM) classifier using two sources: a large dataset - the most expensive one - and a smaller portion of it. A comparison with a traditional Bayesian Optimization approach to optimize the hyperparameters of the SVM classifier on the large dataset only is reported.

This paper is focused on the issue of hyperparameter optimization (HPO), where hyperparameters are all the parameters of a model which are not updated during the learning and are used to configure either the model (e.g., number of layers of a deep neural network, etc.) or characterize the algorithm used in the training phase (learning rate for gradient descent algorithm, etc.) and even to include the choice of optimization algorithm itself and also the data features which are fed into the ML model.
HPO can be regarded as an optimization outer loop on top of ML model learning (inner loop) to find the set of hyperparameters leading to the lowest error on a validation set. This 2-tier optimization structure has several implications. First, the evaluation of the objective function of the outer loop is very expensive, as it requires learning a model and evaluating its performance on a validation set. This is usually repeated k times in a k fold-cross validation procedure. Moreover, the objective function is unknown and can only be observed pointwise with typically noisy evaluations. Secondly, the average value of the loss function does not reflect the true distribution of the data (which leads to the generalization error) and due to the relatively small size of the validation set, the variance of the average estimate obtained by cross validation can be high. Ignoring this uncertainty can result in sub-optimal configuration of hyperparameters. One must also take into account that the performance of the model is evaluated with some error, and thus finding the true optimum with a high precision is usually not critical: this fits nicely into in the Bayesian Optimization (BO) framework that is very sample efficient and yields an acceptable solution with relatively few function evaluations.
The outer loop optimization algorithm can be passive, like grid or pure random search, or "educated" to learn, from previous evaluations, the structure of the objective function, and to actively search where most interesting solutions are. Indeed, BO is a framework to model the learning process and to yield a principled quantification of uncertainty (Frazier 2018 Although the active learning inherent in BO and the ensuing sample efficiency are usually associated with the search for the best algorithm and its configuration (Shahriari et al. 2016), in terms of accuracy, they translate into significant cost and energy savings. For instance, the BERT (Bidirectional Encoder Representation from Transformer) model, now available in the Google Cloud, aimed at contextual representation in NLP, can require 4 days training sessions (with 110 million of DNN's parameters to be learned) (Strubell et al. 2019) which makes the NAS performed in the outer loop awfully expensive. Sample efficiency requires some assumption on the objective function and a model of learning from observations. Probabilistic models commonly used in BO are Gaussian Processes (GPs) (Williams and Rasmussen 2006) and Random Forests (RFs) (Ho 1995) (here we do not discuss their relative merits in different problem classes). GPs are a powerful framework for reasoning about an unknown function given partial knowledge of its behavior obtained through function evaluations. GP leverages a principled estimate of predictive uncertainty towards a careful balance of exploration (increasing one's knowledge about ) and exploitation (focusing on the best points found so far).
The global hyperparameter optimization problem is usually defined as: where the search space is generally box-bounded, is the loss function and the values of the hyperparameters. We remark that is analytically unknown (also called latent) and only pointwise, usually noisy, evaluations can be obtained by querying it. We refer to this situation as black-box optimization. BO leverages the fact that conditioning the GP on previous observations provides versatile regressors of the objective function. BO starts from a GP prior over , encoded with parametric mean and kernel. The available observations are used to build the posterior distribution which is used to determine the learning policy, balancing exploration (high GP variance) and exploitation (low mean value).
Given the cost of evaluating the objective function, trial-&-error methods like random or grid search are not useful. Compared to a simple grid search, BO can identify a better solution for HPO, given the same number of configurations to evaluate. Given its modelling flexibility, BO can build a relatively cheap probabilistic surrogate of , take advantage of related tasks (Swersky et al. 2013) or use problem specific priors (De Ath et al. 2020).
The strategy we follow here is to mitigate the high cost of hyperparameter optimization enabling the BO algorithm to trade-off the value of information gained from the evaluation of a hyperparameter configuration against its cost. In (Swersky et al. 2013) and ) BO is used to evaluate models trained on randomly chosen subsets of data to obtain more, but less informative, evaluations. Two strategies aiming at the same target, which we do not consider here, are curriculum learning, which leverages a data-centric view training the model on increasingly larger datasets, and continuation learning, which leverages a model-centric view building a sequence of loss functions 1 … , in which each +1 is more difficult to optimize than and one can view each as a regularized version of +1 . (Aggarwal 2018), These approaches could be interpreted as optimization problems in which multiple information sources are available, with every source approximating the actual blackbox and expensive (loss) function, with a different cost for querying each information source. This setting is known as Multi-Information Source Optimization (MISO), or multi-fidelity optimization in the special case that the "fidelity" of each source is known a priori and independent on the value of the hyperparameters.

Multi Information Sources Optimization: Related works.
This problem was initially studied under the name of multi-fidelity optimization in which rather than a single objective , we have a collection of information sources denoted with 1 ( ), … , ( ). Each source has its own cost, 1 , … , , where > 0 ∀ = 1, . . , − 1, which controls the fidelity with lower giving higher fidelity: increasing the fidelity gives a more accurate estimate but at a higher cost. In the case of cross-validation the fidelity can be related to the number of iterations of the learning algorithm, the amount of data used in the training or the number of folds in the cross validation. In MISO the goal is to solve (1) while reducing the overall cost along the optimization process. MISO requires specific approaches to choose both the next location and source to evaluate, leading to a sequence {( (1) , (1) ), … , ( ( ) , ( ) )}. It is always possible to sort sources such that > +1 ; in the case that also ( ) can be queried, then it is the most expensive source, so we can set ( ) = 1 ( ) without loss of generality.
In the early work about multi-fidelity ( ) were assumed to be ordered in terms of accuracy and cost: in more general problems of multi-information source optimization we only assume the function ( ) taking a design input , the objective and ( ) being the sources with different biases, different amount of noise and different costs.
MISO has been gaining increasing attention in the last years, also beyond ML. An example in engineering design is the finite element method, where models with cost and fidelity can be obtained using different mesh values. Cheap approximations do not represent accurately the optimization targets, but still can offer an indication of the sensitivity of the output to changes in the parameters. Also, output data from physical prototypes can be integrated in the optimization framework as an additional information source, with fidelity depending on the application and the experimental setting. The application domain which has first exploited the advantages offered by multi-fidelity and multi-information source optimization is aerodynamics: in (Chaudhuri et al. 2019) and (Lam et al. 2015) is presented an approach that adaptively updates a multi-fidelity surrogate on multiple information sources and without any assumption about hierarchical relations among them.
In a seminal paper (Swersky et al. 2013) the use of small datasets to quickly optimize the hyperparameters of a ML model for large datasets has been proposed. The method shows that it is possible to transfer the knowledge gained from previous optimizations to new tasks in order to speed up k-fold cross validation. The algorithm dynamically chooses which dataset to query in order to yield the most information per unit cost. In (Kandasamy et al. 2016) a multi-fidelity bandit optimisation based on Gaussian Process (GP) approximations of all the sources is proposed. The algorithm is named Multi-Fidelity Gaussian Process Upper Confidence Bound (MF-GP-UCB) and resulted able to explore the search spacespanned by the hyperparameters of the ML algorithm to optimizeusing the lower fidelity sources and then exploit the higher fidelity sources in successively smaller regions, converging to the optimum. FABOLAS (FAst Bayesian Optimization on LArge dataSets) ) is an approach for HPO on large datasets: at each iteration, it selects an hyperparameters configuration and a dataset size to use for optimizing hyperparameters for the entire dataset. Results are reported for HPO of Support Vector Machines (SVM) and DNNs, with FABOLAS often providing good solutions significantly faster than "vanilla" BO-based HPO on the full dataset. The approach in (Poloczek et al. 2017) uses a GP with a kernel working on a space consisting of both the search space (spanned by the hyperparameters to optimize) and the information sources. In (Ghoreishi and Allaire 2019) an approach incorporating correlations both within and among information sources is proposed. This allows to exploit the information collected over all the sources and then fusing them in a unique fused GP. Furthermore, the constrained setting is considered, where also constraints can be queried on multiple information sources.
A different approach has been proposed in (Ariafar et al. 2020) Importance based Bayesian Optimization (IBO) which models a distribution over the location of optimal hyperparameter configuration and allocates experimental budget according to cost adjusted expected reduction in entropy (Hennig and Schuler 2012). Higher fidelity observations provide a larger reduction in entropy, albeit at a higher evaluation cost.
To properly quantify predictive uncertainty, it is important for a learning system to recognize different types of uncertainty arising in the modelling process (Liu et al. 2019). Two types of uncertainty must be considered: aleatoric and epistemic. Aleatoric arises due to the stochastic variability of the data generating process, imperfect sensors, epistemic arises due to our lack of knowledge about the data generating mechanism. A model epistemic uncertainty can be reduced by collecting more data and takes two forms: parametric uncertainty, that is uncertainty associated with estimating the model parameters under the current model specification and structural uncertainty that reflects the measure in which a model is sufficient to describe the data ,i.e. whether there exists a systematic discrepancy.

Our Contributions
The main contributions of this paper can be summarized as follows: • A new GP called augmented GP which does not require a kernel working in the , space of hyperparameters and sources. Relations among sources are captured by a simplified and computationally cheap discrepancy measure (related to the epistemic error), used to select "reliable" evaluations to fit the proposed GP and included into a new acquisition function • A new acquisition function based on U/LCB but implementing a sparsification strategy. Indeed, the proposed GP results sparse, reducing the computational cost for fitting it (i.e., the number of evaluations raised power of three). • The new GP mitigates the computational problems in estimating nonparametric regression which is inherently difficult in high dimensions with known lower bounds depending exponentially on dimension. • Making MISO energy-efficient itself by selecting a subset of "reliable" evaluations among all those performed over all the sources. Only this subset is used to fit a GP differently from the fused GP in (Ghoreishi and Allaire 2019). • Demonstrating, empirically, the benefit provided by our approach on an HPO task aimed at optimally tuning a Support Vector Machine classifier on a large dataset.

Gaussian Processes
The global optimization problem is defined as: where the search space is generally box-bounded, no analytical expression is known of ( ), whose shape can only be learned from its evaluations. One way to interpret a Gaussian Process (GP) regression model is to think of as a latent function defining a distribution over functions, and with inference taking place directly in the space of functions (i.e., function-space view) (Williams and Rasmussen, 2006). A GP is a collection of random variables, any finite number of which have a joint Gaussian distribution. A GP is completely specified by its mean function ( ) and covariance function ( ( ), ( ′ )) = ( , ′ ): and will write the GP as: Usually, for notational simplicity we will take the prior of the mean function to be zero, although this is not necessary.
A simple example of a GP can be obtained from a Bayesian linear regression model ( ) = ( ) with prior = (0, Σ ), where ( ) and are -dimensional vectors. More precisely ( ) is a function mapping the -dimensional vector into adimensional vector.
Thus, the equations for mean and covariance become: This means that ( ) and ( ′ ) are jointly Gaussian with zero mean and covariance given by ( ) Σ ( ′).
As consequence, the function values ( 1 ), … , ( ) obtained at different points and plot the generated values as a function of the inputs. This is basically known as sampling from prior.
with a kernel function, ( , 1: ) a vector whose th component is ( , ( ) ) and an × matrix with entries = ( ( ) , ( ) ). Finally, ( 1: , ) is the transposed version of ( , 1: ). Following, a simple example of 5 different samples drawn at random from a GP prior and posterior, respectively. The posterior is conditioned on 6 function observations. It is easy to notice that the mean prediction is a linear combination of functions, each one centred on an evaluated point. This allows to write ( ) as: This means that, to make a prediction at a given , we only need to consider the ( + 1)-dimensional distribution defined by the function evaluations performed so far and the new point to evaluate.
The values of the hyperparameters are usually unknown a priori and are set up depending on the observations 1: , usually, via Marginal Likelihood maximization.
GP's hyperparameters appear non-linearly in the kernel matrix and a closedform solution to maximizing the marginal likelihood cannot be found in general. In practice, gradient-based optimization algorithms are adopted to find a (local) optimum of the marginal likelihood (e.g., conjugate gradients or BFGS).
An interesting generalization has been recently proposed in (Berkenkamp et al., 2019) where the values of hyperparameters are modified by iteratively reducing the characteristic length-scale instead of setting them up through Marginal Likelihood maximization.

Kernels
A kernel function (aka covariance function) is the crucial ingredient in a GP predictor, as it encodes assumptions about the function to approximate. it is clear that the notion of similarity between data points is crucial; it is a basic assumption that points which are close are likely to have similar target values , and thus function evaluations that are near to a given point should be informative about the prediction at that point. Under the GP view it is the covariance function that defines nearness or similarity.
Squared Exponential (SE) kernel: With ℓ known as characteristic length-scale. A large value of the length-scale will map x to a narrower range of values, while a small length-scale does the opposite. Consequently, a large length-scale implies long-range correlations, whereas a short lengthscale makes function values strongly correlated only if their respective inputs are very close to each other. This kernel is infinitely differentiable, meaning that the sample paths of the corresponding GP are very "smooth".
Another way to look at l is through the expected number of 0-upcrossings which is proportional to 1/ ℓ. Then ℓ is proportional to the expected length before crossing 0, hence the name length scale.
SE is the most widely used kernel because it is easy to code, relatively robust to misspecification and guarantees a positive definite covariance regardless of input dimensions. One must anyway bear in mind that it's particularly liable to numerical ill conditioning of the kernel matrix.
Matérn kernels: With two hyperparameters and ℓ, and where is a modified Bessel function. Note that for → ∞ we obtain the SE kernel.
The Matérn covariance functions become especially simple when is half-integer: = + 1/2, where is a non-negative integer.
. The formula can be rewritten as the product of an exponential and polynomial terms of order − 1.
The advantages of the simplified covariance Matérn function is that there are no Bessel functions, no sum of factorials nor fraction of gammas as reported in (Gramacy et al.

2020). This is important because the evaluation of the Bessel function can be as computationally demanding as the matrix inversion
The most widely adopted versions, specifically in the Machine Learning community, are = 3/2 and = 5/2.
Choosing p=0 one obtains the exponential family, p=0 implies v=1/2 which is appropriate for rough surfaces.
Sample path of latent under a GP with Matérn will be -times differentiable iff larger than .
One great advantage of Matérn is that at least for small v it creates covariance matrices that are better conditioned than SE.
The exponential kernel is also called the Laplace kernel and has a strong link with Mondrian kernels which results in Gaussian models conceptually close to Random Forests (Lévesque et al. 2017).

Rational Quadratic Covariance Function
where and ℓ are two hyperparameters. This kernel can be considered as an infinite sum (scale mixture) of SE kernels, with different characteristic length-scales.
The afore mentioned kernels are just the most widely adopted in GP regression.
More details and a most comprehensive set of covariance functions are reported in (Williams and Rasmussen, 2006), and (Gramacy 2020) including non-stationary kernels and dot product kernels.
Some issues on kernel have been considered in recent publications, such as: kernel composition, safe optimization in relation to cognition (Schultz et al. A space-temporal kernel has been proposed in (Nyikosa et al., 2018) to allow the GP to capture all the instances of the function over time and track a temporally evolving minimum.

Acquisition functions
The acquisition function is the mechanism to implement the trade-off between exploration and exploitation in BO. More precisely, any acquisition function aims to guide the search of the optimum towards points with potential low values of objective function either because the prediction of ( ), based on the probabilistic surrogate model, is low or the uncertainty is high (or both). Indeed, exploiting means to target the area providing more chance to improve the current solution (with respect to the current surrogate model), while exploring means to move towards less explored regions of the search space where predictions based on the surrogate model have a higher variance. where ( + ) is the best value of the objective function observed so far, ( ) and ( ) are mean and standard deviation provided by (6) and (7), and (•) is the normal cumulative distribution function. . The parameter is introduced to modulates the balance between exploration and exploitation. More precisely, = 0 is towards exploitation while > 0 is more towards exploration.

Probability of Improvement
The next point to evaluate is chosen according to: +1 = argmax ∈ ( )

Expected Improvement
Expected Improvement (EI), proposed initially proposed in (Močkus et al., 1975) and then made popular in (Jones et al., 1998) , measures the expectation of the improvement on ( ) with respect to the predictive distribution of the probabilistic surrogate model. The parameter in order to actively manage the trade-off between exploration (larger values) and exploitation (smaller): should be adjusted dynamically to decrease monotonically with the function evaluations.
The next point to evaluate is chosen according to: +1 = argmax ∈ ( )

Upper/Lower Confidence Bound
Confidence Boundwhere Upper and Lower are used, respectively for maximization and minimization problemsis an acquisition function that manage explorationexploitation by being optimistic in the face of uncertainty, in the sense of considering the best-case scenario for a given probability value (Auer, 2002).
For the case of minimization, LCB is given by: where ≥ 0 is the parameter to manage the trade-off between exploration and exploitation ( = 0 is for pure exploitation; on the contrary, higher values of emphasizes exploration by inflating the model uncertainty). For this acquisition function there are strong theoretical results, originated in the context of multi-armed bandit problems, on achieving the optimal regret derived by (Srinivas et al. 2012). For the candidate point we observe instantaneous regret = ( ) − ( * ). The cumulative regret after function evaluations is the sum of instantaneous regrets: = ∑ =1 . A desirable asymptotic property of an algorithm is to be no-regret: lim →∞ = 0. Bounds on the average regret translate , bounding by a quantity sublinear in , to convergence rates: ( + ) = min ≤ ( ) in the first function evaluations is no further from ( * ) than the average regret. Therefore, ( + ) − ( * ) → 0, with → ∞ and so a no regret algorithm will converge to a subset of the global minimizers.
A wide analysis of the convergence rate of in the case of Matérn kernel, for different values of d and , is given in (Vakili et al. 2020).
The following figure shows how the selected points changes depending on . From the perspective of BO a particularly interesting bandit problem is the kernelized continuum armed bandit-problem (Srinivas et al. 2010). Here is assumed to be in the closure of functions on expressible as a linear combination of a feature embedding parametrised by a kernel . The properties of the functions in the resulting space referred as the RKHS of , are determined by the choice of the kernel. For a SE kernel the RKHS contains only infinitely differentiable functions. The Matérn kernel is parametrised by a smoothness parameter , for a given , the Matérn RKHS contains all functions times differentiable.
The optimization of the acquisition function leads to the next location to be queried, ( +1) , and, consequently, to a sequence of locations generated { (1) , … , ( ) } over the BO process, with the overall number of function evaluations at the end of the process.
In this paper we use Lower Confidence Bound, largely adopted in GP-based BO and with a convergence proof under an appropriate scheduling of the internal parameter ( ) (Srinivas et al. 2012) which balances between exploration and exploitation.
where the apex related to the current iteration has been included to highlight that the value of changes over BO iterations, as well as the conditioned GP's mean and standard deviation. Confidence Bound has been successfully applied in MISO, such as in (Kandasamy et al. 2016). (Wilson et al. 2018) point out that the shape of the acquisition function may have large flat regions which, in particular in high dimensional spaces, make its optimization problematic and propose a Monte Carlo evaluation of acquisition function amenable to gradient-based optimization and identify a family of acquisition functions, including EI and UCB, whose characteristics allow to use of greedy approaches for their maximization. A specific problem in MISO is related to the acquisition function: a direct translation of the popular expected improvement causes = 0 leading to querying only the highest fidelity source. According to (Poloczek et

3
The proposed Multi Information Source Optimization -Augmented Gaussian Process (MISO-AGP)

Augmented GP
The MISO approach proposed in this paper is based on the idea of training a GP on a "reliable" subset of all the function evaluations performed so far over all the information sources. We refer to this GP as Augmented Gaussian Process (AGP) and consequently named our approach MISO-AGP. The term "augmented" is used to highlight that the set of function evaluations to train the AGP starts from those performed on the most expensive source and then it is "augmented" by selecting evaluations performed on some other source. Before explaining how the selection process is performed, we introduce some useful notations.
with ( ) and ′( ) the mean functions of the two GPs. It is also important to note that ( , , ′) depends on . Indeed, in MISO we do not know a-priori the fidelity of each source and it could be not constant over .
Assume that ( ) can be queried at the highest cost, that is ( ) = 1 ( ). Thus, the set of evaluations to train the AGP consists of 1 "augmented" by: with a technical parameter of the MISO-AGP algorithm. We used = 1 (i.e., around 68% of observations normally distributed are in the interval mean ± standard deviation). Thus, function evaluations on cheaper sources, having a discrepancy lower than the threshold given in (11), are considered "reliable" to be merged with those collected on the most expensive source. Let ̂ denotes the augmented set of function evaluations, such that ̂= 1 ∪̃, the AGP ̂ is trained on ̂, leading to ( ) and ̂( ), computed according to (6)(7). An example is reported in Fig. 1.

Acquisition function in MISO-AGP algorithm
Following the training of the AGP, an acquisition function must be used to choose the next pair source-location to query, that is ( ′ , ′). We consider the framework of U/LCB: where is the number of function evaluations into ̂. The numerator is the opposite of the AGP's LCB (indeed, we are maximizing (12)), penalized by the cost of the source and the model discrepancy between the AGP ̂ and , at the location .
There is the chance that ′ could be too close to some previous function evaluations on ′ . This behaviour arises when BO is converging to a (local/global) optimum and leads to a well-known instability issue in GP training, that is ill-conditioning in the inversion of the matrix [ + 2 ]. This instability issue occurs even more frequently and quickly in the noise-free setting (i.e., = 0). To avoid this undesired behaviourleading to wasting evaluations without obtaining any improvement and/or risking occurring in the instability issuewe introduce the following correction.
with δ > 0 the second MISO-AGP's technical parameter. In other words, we set the acceptable level of approximation, δ, in locating the optimizer and, in the case that ′ is closer than δ to another evaluation on ′ , then we prefer to "spend our budget" in reducing uncertainty on the most expensive source.

C-SVC with RBF kernel
To validate our MISO-AGP approach we designed an HPO task whose goal is to optimally and efficiently tune the hyperparameters of a Support Vector Machine (SVM) classifier on a large dataset. More precisely, we consider a C-SVC with a Radial Basis Function (RBF) kernel and the "MAGIC Gamma Telescope" dataset 1 .
We chose C-SVC due to its relative inefficiency on large datasets: computational complexity for training a C-SVC, on a given hyperparameters configuration, is the number of instances raised the power of three. The C-SVC's hyperparameters to optimize are the regularization term, , and in the RBF kernel: ( , ′) = − ‖ − ′‖ 2 . The MAGIC dataset is generated by a Monte Carlo program (Heck et al. 1998), to simulate registration of high energy gamma particles in a ground-based atmospheric Cherenkov gamma telescope using the imaging technique. The overall dataset consists of 19'020 instances: 12'332 of the class "gamma (signal)" and 6'688 of the class "hadron (background)", with each instance represented by 10 continuous features. We have performed a pre-processing consisting in scaling all the dataset features in [0,1].

MISO-AGP setting
Following the notation used in this paper, MISO-AGP will be used to minimize ( ), that is the misclassification error of a C-SVC, computed on 10-fold cross validation, on the MAGIC dataset. The search space is 2-dimensional and box-bounded, spanned by the two C-SVC's hyperparameters ∈ [10 −2 , 10 2 ] and ∈ [10 −4 , 10 4 ]. We adopt a logarithmic scaling of the search space, a usual procedure suggested in AutoML for hyperparameters varying within ranges of this scale. We have defined two different sources: the first provides the misclassification error obtained via 10-fold cross validation of a C-SVC configuration using the entire MAGIC dataset (i.e., 1 ( ) = ( )). The second (i.e., 2 ( )) performs the same computation but using a smaller portion of the data (just 5% through stratified sampling).
Energy required to perform 10-fold cross validation is basically associated to the computational time, which we consider as a proxy for the sources' costs. Since computational time can also depend on the values of C-SVC's hyperparameters, we have run a sample of 10 hyperparameters configurations on both the two sources and used the average computational times for estimating reference values for 1 and 2 . More precisely, computational time required by 1 ( ) is, on average, 320 times that required by 2 ( ). Thus, we set 2 = 1 and, consequently, 1 = 320.
The kernel used to model the covariance function, for all the GPs, including the AGP, is the Squared Exponential kernel, whose hyperparameters are set via Maximum Loglikelihood Estimation during the GP training. The acquisition function (12) and, in case, the correction (13) are both optimized via L-BFGS.
As initialization, 3 hyperparameters configurations are sampled in via Latin Hypercube Sampling. Then, 30 further function evaluations are used by MISO-AGP to optimize over sources. We decided not to set a limit on the cumulated cost but to use this value to make considerations on the efficiency of the proposed approach with respect to BO applied only on the most expensive source. To mitigate the effect of initial randomness, 10 different runs of MISO-AGP and BO have been performed and compared: at each run, the two approaches share the same initialization.
As metrics, we consider the best function value observed so far. It is usually named "best seen" in BO and simply defined as + ( ) = min =1,…, { (1) , … , ( ) }because we are considering the minimization of the misclassification error. However, this definition is no more valid in the case of the AGP. Suppose that, at a certain iteration, a function evaluation on a cheaper source is selected to fit the AGP and that corresponds to the best seen up to that iteration. At the next iteration, it could be not selected and, consequently, it cannot be considered as the best seen any longer. More formally, let ̂+ ( ) denotes the "augmented best seen", ̂+ ( ) = min =1,…, { (1) , … , ( ) }, with < because only a subset of the evaluations on all the sources is used to train the AGP. In the case that ̂+ ( −1) ∉ { (1) , … , ( ) } ⇒ ̂+ ( ) ⋚̂+ ( −1) ; in other terms, contrary to the common "best seen", the "augmented best seen" could not be monotone over the function evaluations.

Results
The following figure (Fig. 2) summarizes the results of the study. The best value of the misclassification error is reported with respect to the cost cumulated over the MISO-AGP and BO iterations, separately. Solid lines represent the mean over the 10 independent runs, while shaded areas represent the standard deviations. As a reference value, we have considered the best misclassification error registered, on the entire MAGIC dataset, over all the experiments performed (green dashed line). The cumulated costswhich are actual and not the nominal 1 and 2 used in the acquisition functionare also averaged on the 10 independent runs. The MISO-AGP approach proved to be both more effective and efficient than traditional BO: the identified hyperparameters configurations are associated to a lower misclassification error, and within less than 1 3 ⁄ of the time required by BO. On average, 60% of the function evaluations are performed on the cheaper source. Thus, MISO-AGP had intelligently exploited the cheaper information source, thanks to the proposed AGP, leading to an energy-efficient and green HPO task.

Conclusions
The GP framework can be extended to deal with multiple information sources. Relations among sources are captured by a simplified and computationally cheap discrepancy measure, which enables a sparsification strategy used to select "reliable" evaluations to fit the proposed AGP. The MISO-AGP has been empirically been shown to solve a real HPO task effectively while reducing significantly computational time and consequently energy usage.