Fair and Green Hyperparameter Optimization via Multi-objective and Multiple Information Source Bayesian Optimization

There is a consensus that focusing only on accuracy in searching for optimal machine learning models amplifies biases contained in the data, leading to unfair predictions and decision supports. Recently, multi-objective hyperparameter optimization has been proposed to search for machine learning models which offer equally Pareto-efficient trade-offs between accuracy and fairness. Although these approaches proved to be more versatile than fairness-aware machine learning algorithms -- which optimize accuracy constrained to some threshold on fairness -- they could drastically increase the energy consumption in the case of large datasets. In this paper we propose FanG-HPO, a Fair and Green Hyperparameter Optimization (HPO) approach based on both multi-objective and multiple information source Bayesian optimization. FanG-HPO uses subsets of the large dataset (aka information sources) to obtain cheap approximations of both accuracy and fairness, and multi-objective Bayesian Optimization to efficiently identify Pareto-efficient machine learning models. Experiments consider two benchmark (fairness) datasets and two machine learning algorithms (XGBoost and Multi-Layer Perceptron), and provide an assessment of FanG-HPO against both fairness-aware machine learning algorithms and hyperparameter optimization via a multi-objective single-source optimization algorithm in BoTorch, a state-of-the-art platform for Bayesian Optimization.


Introduction 1.Rationale and motivations
A low misclassification/prediction error is not the only performance metric of interest in searching for the most suitable Machine Learning (ML) model to use in a successful decision support application.Additional metrics like fairness, interpretability, and privacy have been increasingly becoming important during last years (Barocas et al., 2017).This paper focuses on fairness, a desired property of the decision support provided by a ML model: it must not be "biased" towards a specific person or groups of individuals.The topic is widely known as FairML (Mehrabi et al., 2021), with approaches which can be organized into three different families: (i) post-processing to modify a pre-trained model to increase the fairness of its outcomes, (ii) in-processing to enforce fairness constraints during training, and (iii) preprocessing to modify the data representation and then apply standard ML algorithms (Friedler et al., 2019).
These approaches regard the design of fairness-aware (or fair-by-design) ML algorithms, but they suffer from one or more of the following drawbacks (Perrone et al., 2021): the intervention performed to deal with biases is (i) specific to the model class (e.g., linear models only), (ii) limited to a specific definition(s) of fairness, (iii) limited to a single, binary sensitive feature, (iv) requires access to sensitive feature information at prediction time, and (v) results in a randomized classifier that may generate different prediction for the same input at different times.
These considerations have recently led to a new proposition: instead of designing fairness-aware ML algorithm, FairML can be addressed as an hyperparameter optimization (HPO) task which also considers fairness as a metric.Indeed, HPO of a ML algorithm, only driven by misclassification/prediction error minimization, can amplify biases contained in the dataset, leading to an unfair decision support (Barocas et al., 2017;Buolamwini & Gebru, 2018).Recently, two types of approaches have been proposed to perform HPO for FairML: (i) constrained optimization -aimed at maximizing accuracy while satisfying some given fairness constraint (Perrone et al., 2020;2021) -and (ii) Multi-objective optimization -aimed at maxi-mizing, simultaneously, accuracy and some fairness metric (Schmucker et al., 2020).
It is important to remark that the first approach could not be applicable in many real-life settings, because suitable fairness constraints could be difficult to be established apriori.According to this consideration, we have decided to focus our research on the multi-objective optimization approach.
In addition to fairness, this paper also addresses the issue of energy-efficiency of an HPO task.Nowadays, it has become crucial to consider the dual role of Artificial Intelligence (AI) and ML in the climate crisis.On the one hand, they can support more sustainable and low-emission decisions, from design to management of critical systems such as smart energy-grids, transportation, healthcare and water utilities, and they can also provide accurate climate change predictions.On the other hand, AI and ML are themselves energivorous and, consequently, significant emitters of CO 2 , leading to the concept of Red-AI (Dhar, 2020).Prevalence of Red-AI is also quantified in (Schwartz et al., 2020), reporting that the total cost of producing accurate ML models increases linearly with (i) the cost of executing the model on a single example, (ii) the size of the training dataset and (iii) the number of HPO experiments, which controls how many times the model is trained on the dataset.Astonishing results are also reported in (Strubell et al., 2019;Hao, 2019), which analysed the training process of many Natural Language Processing (NLP) models to estimate the energy cost in kilowatts required.When these figures are converted into approximate carbon emissions it comes out that the carbon footprint of training a single large NLP model is equal to the amount of CO 2 emitted by 125 round-trip flights between New York and Beijing or, equivalently, five American average cars in their lifetimes, including their manufacturing processes.Consequently, the research community has been focusing on the Green-AI topic and also starting to propose novel optimization techniques to make the HPO task "greener" (Tornede et al., 2021), for instance by using smaller portions of the available databases/datasets, as proposed in the seminal work of (Swersky et al., 2013) to the most recent ones, such as in (Klein et al., 2017;Candelieri et al., 2021).

Contributions
The main contributions of this paper are: 1.A new Fair and Green Hyperparameter Optimization algorithm, namely FanG-HPO, based on both multiobjective and multiple information source Bayesian Optimization, which offers more flexibility and better computational performance on HPO problems.
2. A computational assessment of FanG-HPO against both fairness-aware ML algorithms (Scutari et al., 2021) and fair-HPO performed by using BoTorch (Balandat et al., 2020), a state-of-the-art platform for BO.
The rest of the paper is organized as follows: Sect. 2 provides the main background on multi-objective and multiple information source optimization, along with the definition of the fairness metric adopted in the study.In Sect. 3 the FanG-HPO approach is detailed.Sect. 4 describes the experimental setting and reports results.Finally, Sect. 5 provides conclusions and perspectives.

related works
A recent BO related study targeting fairness in ML is (Schmucker et al., 2020) whose proposed approach combines multi-objective and multi-fidelity by building upon Hyperband (Li et al., 2017).Unfortunately the code is not currently available: an implementation was uder review to be included into AutoGluon1 suite, but it has not been included yet.Other relevant works are (Perrone et al., 2021;2020), where the fairness requirement are modelled as a constraint -instead of an objective -in single-objective (i.e., accuracy maximization) HPO.
Finally, although BO has been extended to deal with multiobjectives (Svenson & Santner, 2016;Feliot et al., 2017;Yang et al., 2019;Iqbal et al., 2020;Daulton et al., 2020) as well as multiple fidelities and multiple information sources (Lam et al., 2015;Poloczek et al., 2017;Ghoreishi & Allaire, 2019;Candelieri & Archetti, 2021b;a;Ariafar et al., 2021), there is a significant lack of solutions jointly addressing the two tasks.On the other hand, the research interest on this specific challenge is quickly increasing, especially because its applicability to many other real-life problems than fair and green ML, as demonstrated by very recent works such as (Sun et al., 2022) and (Irshad et al., 2021).

Background
This paper addresses FairML as a multi-objective problem and, at the same time, uses multiple information sources (i.e., a small portion of the large target dataset) to improve energy-efficiency and, consequently, reduce CO 2 emissions.
Here, we briefly summarize the basic background about multi-objective optimization, fairness metrics and multiple information source optimization.

Multi-objective optimization
Multi-objective optimization (MO) concerns solving problems with more than one objective function to be optimized simultaneously, that is: where Ω is the search space, typically box-bounded in d , and f : Ω → M is the vector-valued function of the multiple objectives.In MO, due to the conflicting nature of the objectives, it does not exists a unique solution x * ∈ Ω to the problem (1).The final aim is to identify a set of equally efficient trade-offs among the objectives.This set of efficient trade-offs can be depicted within the space spanned by the M conflicting objectives, allowing for drawing the so-called Pareto front (aka frontier or boundary).The associated set of solutions -into the search space -is instead known as Pareto set.An example is shown in Appendix A.1.
Formally, the Pareto set consists of only dominant (aka notdominated) solutions, where a solution x is said to dominate another solution x if their objectives, respectively , satisfy the following two conditions: Equation ( 2) means that x is not worse than x in all the objectives, and equation (3) means that x is strictly better than x in at least an objective.The Pareto dominance symbol, ≺, is used to synthesize (2-3): If the objectives are black-box, their values can only be known point-wise by querying f (x) at specific locations.
Given all the queries performed so far, the set of nondominated solutions (respectively, outcomes) is the current approximation of the Pareto set (respectively, front), that is the set of the currently dominated solutions (respectively, outcomes).If the objectives are also expensive to evaluate, in terms of time or resources, then the problem (1) requires to be solved efficiently, meaning that a good Pareto front/set approximation has to be found within a limited number of queries.Thus, sample-efficiency of BO was the driver of its successful extension to the MO setting (i.e., MOBO), mainly along three different strategies: • Scalarization which maps the vector of all objectives into a scalar parametrized function whose optimizer, computed by a single objective method, can span, as the parameters vary, the whole Pareto set (Paria et al., 2020;Zhang & Golovin, 2020).The key drawback of scalarization is that it does not consider the geometry of the Pareto front approximation.
• Maximization of some index related to the quality of the Pareto front approximation.A common choice is the dominated hypervolume indicator, that is the volume of the region dominated by a Pareto front approximation.Based on this, the hypervolume improvement is commonly used in multi-objective optimization.
This paper focuses on the first two strategies.It is important to remark that, while in "vanilla" BO a probabilistic surrogate model is used to approximate the black-box objective function, almost all the MOBO approaches adopt a probabilistic surrogate model for each one of the objectives, assuming independence among them.This is a reasonable assumption, because in MO the objectives should be competing.Exactly as in BO, every new query contributes to better approximate the objective function -which is vectorvalued in MOBO -through the update of the probabilistic surrogate model.The second key component of BO is the acquisition function, which deals with the well-known exploration-exploitation dilemma.All the "vanilla" BO acquisition functions can be used in the case of scalarizationbecause the multi-objective problem is mapped into a singleobjective one -on the contrary, Expected Hypervolume Improvement (EHVI) is an acquisition function specifically designed for vector-vaued MOBO, basically extending the idea underlying the well-known Expected Improvement (EI) to the multi-objective setting.
In this paper we consider HPO optimization of a classification model, with two different objectives to minimize: the misclassification error (MCE) and the unfairness metric known as Differential Statistical Parity (DSP), which is detailed in the next section.Both the objectives are computed through stratified 10-fold cross validation, so they are black-box, expensive, multi-extremal, and possibly noisy (depending on the specific ML algorithm to be optimized or the cross-validation procedure).

Differential Statistical Parity as unfairness metric
There is not a unique definition -and consequently metric -of fairness (Verma & Rubin, 2018).Instead, different alternatives have been proposed depending on application domains and specific use cases.In this paper we consider the DSP -which has been also recently considered in (Schmucker et al., 2020).More specifically, we refer to the standard framework where F L denotes the true labels for the target feature, F S is the sensitive feature, and F L denotes the predicted labels.The Statistical Parity (SP) requires that positive predictions are unaffected by the value of the sensitive feature(s), independently of the actual label: Finally, DSP is a measure of the violation of the above condition, and it is considered as a measure of unfairness.

Multiple information source optimization
Multiple Information Source Optimization (MISO) aims at searching for the global optimum of a black-box, expensive and multi-extremal function, namely the ground-truth, given the possibility to also query less expensive information sources which are its approximations.The final goal is to find an optimal solution for the ground-truth while satisfying some constraint on the query cost accumulated along the search process, basically by effectively and efficiently using the cheap information sources.MISO has been defined for single-objective problem: differently from multi-objective, here the subscript is used to denote the information source, where f 1 (x) is the ground-truth and f s (x), with s ∈ {2, ..., S}, are the cheap sources.
The MISO problem can be formulated as: subject to: where denotes the ordered set of source-location pairs sequentially queried, c s is the cost for querying f s (x), and C max is the maximum query cost that can be accumulated along the sequential optimization process.
BO has been also successfully extended to deal with MISO problems, where each information source, is individually modelled through a probabilistic surrogate model -usually a Gaussian Process (GP) -fitted on the queries performed on that source.Then, all the individual models are combined into a single one, which is used to drive the choice of the next promising source-location pair to query, such as in (Ghoreishi & Allaire, 2019;Candelieri & Archetti, 2021b).

Fang-HPO
The proposed fair and green HPO task is performed by solving the problem (4-5), but with the scalar objective function replaced by a vector-valued one, that is: subject to: where f 1 is the ground-truth, while all the other cheaper information sources are f s , with s ∈ {2, ..., S}.

Modelling objectives and information sources
In FanG-HPO, both objectives and information sources are modelled independently via GP regression (Williams & Rasmussen, 2006;Gramacy, 2020).A GP is a probabilistic regression model whose predictive mean, µ(x), and uncertainty, σ(x), are conditioned on previous observations.A brief introduction to GP regression is provided in the Appendix A.2.
Thus, at a generic iteration, FanG-HPO fits S × M GP models, leading to the following set of predictive means and uncertainty functions: µ sm (x), σ sm (x) s=1:S, m=1:M Then, a single GP model is fitted, for each objective, by combining the GPs individually modelling that objective on every information source.In FanG-HPO this operation is performed by following the Augmented Gaussian Process (AGP) approach recently proposed in (Candelieri & Archetti, 2021b).More precisely, a set of indices identifying "reliable" observations from cheaper sources is computed for each source-objective pair, where "reliable" means they are not too discrepant with respect the ground-truth: where α is a technical parameter to tune reliability of observations from cheap information sources (in (Candelieri & Archetti, 2021b) the suggested value is α = 1).
Then, the observations on the ground-truth are "augmented" with those identified by I sm , separately for each objective m ∈ {1, ..., M }: where X s are the locations queried on source s and Y [m] are the values observed for the objective m and associated to the set X m (i.e., the symbol [m] is the operator selecting only the column m of the n s × M matrix Y sm , with n s the number of queries performed on the source s).
Finally, FanG-HPO fits M independent AGPs, with predictive means and uncertainty respectively denoted with µ m (x) and σ m (x), and both conditioned to X m , Y m .

Deriving the next query
The next source-location pair to query, namely (s , x ), is derived by solving a multi-objective problem whose objectives are approximated by the AGPs obtained as previously described.Having M independent AGPs is in line with recent results in literature: as reported in (Zhan et al., 2017) considering a GP modelling each objective independently makes easy the implementation of multi-objective optimization approaches, while using dependent GP models -such as multi output GPs -do not provide any relevant benefit against independent GPs (Svenson & Santner, 2016) More precisely, (s , x ) is obtained according to the following two-steps procedure: 1. Selecting x .First, the location x is selected depending on the well-known Expected Hypervolume Improvement (EHVI).Hypervolume Improvement (HVI) is defined as the relative increase in the hypervolume indicator, when an outcome y, associated to a solution x, is added to the current Pareto front approximation.
In BO, the HVI is a random variable because y is a (set of) random variable itself, and this leads to the EHVI.
x = arg max x∈Ω EHVI(x, P, r) where P is the current approximated Pareto front and r is the reference point.In this paper r is the worst point, with both MCE and DSP equal to 1.Although a closed formula for the EHVI exists (Feliot et al., 2017), it is expensive to calculate.In FanG-HPO the fast calculation proposed in (Zhao et al., 2018) is used, that is an extension, to the EHVI computation, of the Walking Fish Group (WFG) technique (While et al., 2011), one of the fastest algorithms for calculating the hypervolume of a Pareto front approximation.
2. Selecting s .Then, the information source s is selected according to both its query cost and its discrepancy with respect to the ground-truth at x , with respect to all the objectives, that is: Contrary to other recent approaches which propose to query the ground-truth on a regular basis, such as in (Khatamsaz et al., 2020), at each iteration FanG-HPO adaptively chooses among all the sources, including the ground-truth.However, just to ensure a sufficient quality of the approximation provided by the AGPs, before solving (12) FanG-HPO checks if the number of augmenting observations coming from anyone of the cheap sources is larger than those from the ground-truth: in that case s = 1 is selected, instead of solving (12).

Experimental setting
Experiments consider two benchamark datasets on fairness -ADULT and COMPAS -and two ML algorithms whose hyperparameters are optimized -a Multi-layer Perceptron (MLP) and XGBoost (XGB).Dimensionality of search space, for the two ML algorithms are d = 10 and d = 7, respectively, for MLP and XGB.The search spaces are those used in (Schmucker et al., 2020) and reported in Appendix A.3.
The bi-objective ground-truth is given by the MCE and DSP, computed through stratified 10 fold-cross validation, using the entire datasets.Since DSP is computed for every sensitive feature, we have decided to consider DSP= max i∈F S {DSP i }, where F S is the set of sensitive features and DSP i is the feature-specific DSP value.
Querying the cheap information sources consists in computing the same metrics by using only half of the original dataset.
According to preliminary evaluations on a set of random configurations of the hyperparameters, the cost for querying the ground-truth is approximately twice that for querying the cheap information sources, for all the datasets and ML algorithms pairs.Therefore, we set c 1 = 2 and c 2 = 1.
It is important to remark that only FanG-HPO exploits the two information sources, while the HPO task performed through BoTorch uses only the ground-truth.We have selected BoTorch as a baseline because it provides implementations of many state-of-the-art algorithms for vanilla BO, multi-objective BO, and multi-fidelity BO.Testing all of them is out of the scope of this paper.Thus, we have selected one of the most effective and efficient implementations for multi-objective (single-source) BO, that is ParEGO (Knowles, 2006) with q-Knowledge Gradient (qKG) as acquisition function (Wu & Frazier, 2016).Specifically, we have used q = 5 in our experiments.ParEGO is an extension of the (single-objective) Efficient Global Optimization (EGO) algorithm (Jones et al., 1998) to the MO setting, whose core consists in approximating the different objectives through independent GPs and use scalarization to rescale the multi-objective problem in to a single-objective one.It is also important to remark that -as all the other available BO tools -BoTorch considers multi-objective and multi-fidelity optimization as two separate problems, so it does not provide any algorithm to jointly address them.
Moreover, BoTorch provides implementations for multifidelity optimization, which is just a special case of multiple information source optimization, and consequently we have discarded them comparison because not well-suited for a comparison.
First, 2d hyperparameters are randomly chosen to generate the initial designs for MLP and XGB, separately.For BoTorch this leads to an initialization cost of 2dc s , that is 40 and 28 for MLP and XGB (independently on the dataset, in this study).The maximum accumulated query cost has been set to C max = 20d, that is C max = 200 and C max = 140 for MLP and XGB, respectively.
To have a fair comparison between BoTorch and FanG-HPO, 13 and 9 hyperparameters configurations are sampled from the BoTorch initial designs, respectively for MLP and XGB, leading to an associated query cost of 26 and 18.
The remaining budgets for initialization (i.e., 40 − 26 = 14 and 28 − 18 = 10) are used to generate and evaluate initial designs for the cheap information sources, that are 14 hyperparameters configurations for MLP and 10 for XGB.
Finally, to mitigate the randomness due to (i) generation of initial design and (ii) MLP and XGBoost learning algorithms, we have generated five initial designs -for each ML algorithm -and performed five independent runs for each one of them.
The hypervolume (HV), with respect to the accumulated query cost, is the performance metric used to monitor the effectiveness of the two BO-based approaches.As already mentioned, the reference point is the worst one, that is MCE=1 and DSP=1.
To mitigate the effect of randomness, five independent runs have been performed for both the FairML algorithms.As usual, a constraint on the unfairness must be provided for these algorithms: we set this value to 0.1 (for each individual sensitive feature).
FanG-HPO was developed in R and integrates, through the reticulate R package, Python code (i.e., sklearn modules for MLP and XGBoost).FairML algorithms are from the fairml R package.

FANG-HPO VS BOTORCH-BASED HPO
This section summarizes the most relevant results of the study.Figure 1 compares FanG-HPO and BoTorch-based HPO according to the evolution of the HV, associated to the current Pareto front approximation, with respect to the query cost accumulated over the HPO process.First, it is important to remark two important points: (i) only observations on the ground-truth are used to compute the approximate Pareto front at each iteration (and consequently the associated HV), also for FanG-HPO, and (ii) the small initial difference between the HV of the two methods is due to the fact that FanG-HPO uses just a subsample of the initial design (on the ground-truth) of BoTorch (i.e., FanG-HPO must split the initial budget to initialize the GPs on both the ground-truth and the cheap information source).
As far as HPO of MLP is concerned, FanG-HPO outperforms BoTorch-based HPO.More specifically, given the same cost -but a different number of hyperparameters configurations evaluated -FanG-HPO identifies MLP models offering a better trade-off between MCE and DSP.Indeed, the HV -computed as the median on the 25 runsis higher for FanG-HPO almost immediately in the case of the ADULT dataset and after half of the available budget in the case of COMPAS (i.e., accumulated query cost equal to C max /2 = 100).This means that, given a desired accumulated query cost -which is a proxy for computational burden and, consequently, energy consumption -FanG-HPO identifies ML models which are more Pareto-efficient (i.e., higher HV) than those generated via BoTorch.Analogously, given a desired level of Pareto-efficiency (i.e., a HV value), FanG-HPO achieves it with a lower computational burden (i.e., accumulated query cost).Therefore, FanG-HPO is more effective and greener than BoTorch-based HPO.
As reported in Table 3, at the end of the optimization process the difference in terms of HV is statistically significant, with a confidence level 0.05 (Mann-Whitney's U test, null hypothesis: final HVs are similar between the two approaches; alternative hypothesis: final HV for FanG-HPO is higher than for BoTorch).
Results are not equally encouraging when HPO on XGBoost  is considered.A possible motivation is that XGBoost is an effective ML algorithm able to provide accurate and fair models by itself, so there are not significant differences in optimizing its hyperparameters through FanG-HPO or BoTorch-based HPO.This is also confirmed by the results in Table 3, where the slightly difference between the two approaches, in terms of HV at the end of the optimizaiton, is not statistically significant (Mann-Whitney's U test, null hypothesis: final HVs are similar between the two approaches; alternative hypothesis: final HV for BoTorch is higher than for FanG-HPO).In the case of the ADULT dataset, the HV of BoTorch-based HPO is, on median, higher than the FanG-HPO's one.However, the difference quickly decreases with the accumulated query cost and becomes not statistically significant (Figure 1).In the case of the COMPAS dataset, the two approaches are comparable, with FanG-HPO slightly better, on median, when the accumulated query cost is, approximately, within the range [60; 110], and then BoTorch slightly better for higher accumulated query costs.The initially large standard deviation for FanG-HPO (around 0.05) is due to the random subsampling of 9 (out of the 14) hyperparameters configurations from the initial design of BoTorch, performed as explained in the experimental setting.Usage of the cheap source depends on the runbasically on the initialization of the design.
Summarizing, FanG-HPO does not apparently underperfom BoTorch-based HPO and, in some cases, can significantly outperform it.This achievement is obtained by how FanG-HPO exploits information from the cheap informa-tion source.In Figure 2 and Figure 3, we have reported the number of queries of FanG-HPO, on each information source and for each one of the 25 runs, separately for each ML algorithm and dataset pair.The queries related to the initial designs are not included in the counts.
Although the encouraging results, only a larger set of experiments -performed also by other research groups -involving more datasets and ML algorithms could definitely confirm these preliminary results.

COMPARISON AGAINST FAIRML ALGORITHMS
Finally, we present a Pareto analysis of the solutions provided by both FanG-HPO and BoTorch-based HPO against the two fairml algorithms zlrm and fgrrm.Separately for the two datasets and the two ML algorithms (i.e., MLP and XGBoost), we have retrieved the best Pareto fronts for FanG-HPO and BoTorch-based HPO (i.e., the ones with the largest HV over all the 25 independent runs).A separate chart has been obtained for each dataset, depicting also the MCE and DSP values provided by the five indepent runs of zlrm and fgrrm.Pareto fronts are zoomed in for a better visualization, so the reference point (i.e., MCE=1 and DSP=1) is not in the charts.
Figure 4 shows the results for the COMPAS dataset.Important remarks are: • XGBoost models dominate -in Pareto terms -most of the MLP models, irrespectively to the adoption  of FanG-HPO or BoTorch.However, some notdominated MLP models offer a lower DSP (in the face of an increase in MCE); • fairml models identified through zlrm and fgrrm dominate many MLP models but are completely dominated by XGBoost models; • the Pareto fronts identified by FanG-HPO and BoTorch are quite similar, as well as the number of Pareto efficient models.It is important to remark that solutions on the Pareto front refer to the ground-truth only, but FanG-HPO performs a lower number of queries on it.
Figure 5 shows results on ADULT.Important remarks are: • XGB models identified by FanG-HPO dominate most of the other XGB and MLP models, as well as all the models identified through zlrm and fgrrm; • MLP models identified by FanG-HPO dominate most of the MLP models identified by BoTorch-based HPO; • zlrm and fgrrm models lay on the approximated Pareto front of FanG-HPO on MLP, and dominate a portion of that of BoTorch-based HPO.

Conclusions and perspectives
The proposed approach, namely FanG-HPO -Fair and Green hyperparameter optimization -is aimed to searching for accurate and fair ML models while using a small portions of the large dataset to also reduce computational time for training and validation and, consequently energy consumption and possibly associated CO 2 emissions.
Preliminary results proved that FanG-HPO does not underperform multi-objective (i.e., accuracy and fairness) HPO built on BoTorch.Furthermore, the capability to deal with multiple information sources (i.e., small portions of the largest dataset) allows FanG-HPO to significantly improve efficiency in terms of accumulated query costs, compared to BoTorch.This means that an approximate Pareto front with a largest hypervolume can be identified with a lower accumulated computational time required to train-and-validate ML models.This was clearly observed in the case of HPO of MLP, on both ADULT and COMPAS datasets.
Finally, depending on the ML algorithm to be optimized, both FanG-HPO and BoTorch-based HPO can identify more efficient models -in Pareto terms -than fairness-aware ML algorithms, specifically zlrm and fgrrm.
Ongoing and future works will focus on a larger set of experiments, including more ML algorithms and fairness datasets, and on investigating the possibility to also consider cost-aware optimization, recently proposed in (Lee et al., 2020;Candelieri & Archetti, 2021a;Luong et al., 2021), where sources' query costs are not fixed but depends on the hyperparameters configuration to evaluate.Although the two-steps acquisition function proposed in FanG-HPO should not require any awareness about location-dependent costs (i.e., after choosing x the query costs only depends on the size of dataset underlying the information sources), it could be anyway interesting to investigate this topic.

Software and Data
Both the R code and the data are available on request.

A. Appendix
A.1.Multi-objective optimization

A.2. Gaussian Process Regression
A Gaussian Process (GP) is a collection of random variables, any finite number of which have a joint Gaussian distribution, and it is completely specified by its mean function, µ(x), and covariance function, k(x, x ).A GP is denoted with GP µ(x), k(x, x ) (Williams & Rasmussen, 2006;Gramacy, 2020).Importantly, these two scalar-valued functions can be conditioned on a set of available observations, leading to a probabilistic regression model which can be used to make predictions at any location x, according to the so-called GP's predictive (aka posterior) mean and standard deviation.While the first represents the predicted value, the second represents the associated predictive uncertainty.
Consider to have performed n queries, then denote with X 1:n = x (i) i=1:n the set of queried locations and with Y 1:n = y (i)  i=1:n the associated observed outcomes, possibly noisy (i.e., y (i) = f x (i) + ε (i) , where ε (i) is assumed to be a zero-mean Gaussian noise, ε (i) ∼ N 0, σ 2 ε , ∀ i ∈ {1, ..., n}).Then, the GP's predictive mean and variance, conditioned to the n performed queries, are respectively computed as follows:

Figure 1 .
Figure 1.Hypervolume (HV) with respect to query cost accumulated over the HPO process (median ± standard deviation over the 25 experiments: 5 initial designs times 5 repeated runs for each design).Comparison between FanG-HPO and BoTorch based HPO.

Figure 2 .
Figure 2. Number of queries performed by FanG-HPO on each source for optimizing: (left) MLP on the ADULT dataset, and (right) MLP on the COMPAS dataset.

Figure 3 .
Figure 3. Number of queries performed by FanG-HPO on each source for optimizing: (left) XGBoost on the ADULT dataset, and (right) XGBoost on the COMPAS dataset.

Figure 4 .
Figure 4. Pareto analysis of the models identified by the different approaches on the COMPAS dataset.

Figure 5 .
Figure 5. Pareto analysis of the models identified by the different approaches on the ADULT dataset.

Figure 6
Figure6summarizes the main concepts of Pareto analysis in the multi-objective setting.For the sake of visualization, a two objectives problem, with a two dimensional search space Ω, is considered.On the left hand side the search space and the (unknown) Pareto set are depicted; on the right hand side, the outcome space spanned by the two objectives is illustrated, along with the (unknown) feasible space (containing the outcomes associated to all the possible solutions in the search space) and the (unknown) actual Pareto front.

Figure 6 .
Figure 6.On the left: box-bounded Search Space Ω, (unknown) Pareto Set, Pareto and not-Pareto solutions.On the right: (unknown) Feasible Space within the Outcome Space (i.e., consisting of all the outcomes associated to solutions in the Search Space), Pareto (aka dominant) outcomes, not-Pareto (aka dominated outcomes), and actual (unknown) Pareto Front.

Table 1 .
Hypervolumes (median±sd over 25 independent runs) of the final Pareto front approximations provided by FanG-HPO and BoTorch.