1 Introduction

1.1 Rationale and motivations

A low misclassification/prediction error is not the only performance metric of interest in searching for the most suitable Machine Learning (ML) model to use in a successful decision support application. Additional metrics like fairness, interpretability, and privacy have been increasingly becoming important during last years. This paper focuses on fairness, a desired property of the decision support provided by a ML model: it must not be “biased” towards a specific person or groups of individuals (Barocas et al., 2017; Buolamwini & Gebru, 2018; Pessach & Shmueli, 2022). The topic is known as FairML (Mehrabi et al., 2021), with approaches organized into three different families: (i) post-processing to modify a pre-trained model to increase the fairness of its outcomes, (ii) in-processing to enforce fairness constraints during training, and (iii) pre-processing to modify the data representation and then apply standard ML algorithms (Friedler et al., 2019; Hort et al., 2022). These approaches regard the design of fairness-aware (or fair-by-design) ML algorithms, but they suffer from one or more of the following drawbacks (Perrone et al., 2021): the intervention performed to deal with biases is (i) specific to the model class (e.g., linear models only), (ii) limited to a specific definition of fairness, (iii) limited to a single, binary sensitive feature, (iv) requires access to sensitive feature information at prediction time, and (v) results in a randomized classifier that may generate different prediction for the same input at different times.

These considerations have recently led to a new strategy: instead of designing fairness-aware ML algorithm, FairML can be addressed as the hyperparameter optimization (HPO) of a ML algorithm, by also considering fairness. Two mechanisms have been recently proposed to include fairness into HPO: (i) constrained optimization—aimed at minimizing the misclassification error while satisfying some fairness constraint (Perrone et al., 2020, 2021) – and (ii) multi-objective optimization—aimed at minimizing, simultaneously, misclassification error and some unfairness metric (Schmucker et al., 2020).

More recently, the topic has been named Fairness-aware AutoML (Weerts et al., 2023), where automated Machine Learning (AutoML) generically refers to automatizing the design of a ML pipeline (aka workflow), in which HPO of ML algorithms is a specific task (Hutter et al., 2019; He et al., 2021).

It is important to remark that fairness-constrained HPO could not be viable in many real-life settings, because a suitable threshold on fairness could be difficult to be established a-priori. According to this consideration, and to recent trends in the research field (Nguyen et al., 2023), we have decided to focus on the multi-objective HPO approach.

In addition to fairness, this paper also addresses the issue of energy-efficiency of HPO. Nowadays, it has become crucial to consider the dual role of artificial intelligence (AI) and ML in the climate crisis. On the one hand, they can support more sustainable and low-emission decisions, from design to management of critical systems such as smart energy-grids, transportation, healthcare and water utilities, and they can also provide accurate climate change predictions. On the other hand, AI and ML are themselves energivorous and, consequently, significant CO\(_2\) emitters, leading to the concept of Red-AI (Dhar, 2020). Prevalence of Red-AI is also quantified in Schwartz et al. (2020), reporting that the total cost of producing accurate ML models increases linearly with (i) the cost of executing the model on a single example, (ii) the size of the training dataset and (iii) the number of HPO experiments, which controls how many times the model is trained on the dataset. Astonishing results are also reported in Strubell et al. (2019) and Hao (2019), which analysed the training process of many natural language processing (NLP) models to estimate the energy cost in kilowatts required. When these figures are converted into approximate carbon emissions it comes out that the carbon footprint of training a single large NLP model is equal to the amount of CO\(_2\) emitted by 125 round-trip flights between New York and Beijing or, equivalently, five American average cars in their lifetimes, including their manufacturing processes.

Consequently, the research community has been focusing on the Green-AI topic and also starting to propose novel approaches to make HPO—and more generally AutoML—“greener”, for instance by using smaller portions of the available databases/datasets, as proposed in the seminal work of Swersky et al. (2013) up to the most recent ones, such as in Klein et al. (2017), Candelieri et al. and (2021). Another possibility consists into early discarding unpromising hyperparameter configurations to save and re-allocate computational resources (aka, successive halving, first proposed in Jamieson & Talwalkar 2016). A well known example is Hyperband (Li et al., 2017). A wider overview about Green AutoML is given in Tornede et al. (2023), along with an indication of future research directions.

Bayesian optimization (BO) is a sample-efficient, sequential, model-based, global optimization method, well-suited for optimizing black-box, expensive, and multi-extremal objective functions (Frazier, 2018; Archetti & Candelieri, 2019; Garnett, 2023). Thanks to its sample-efficiency, BO is the core component of most of the current AutoML solutions, both open-source and commercial. BO has been recently extended to also deal with multiple objectives (Hernández-Lobato et al., 2016; Paria et al., 2020), as well as multiple information sources which can queried under different costs (Ghoreishi & Allaire, 2019; Belakaria et al., 2020a; Candelieri et al., 2021; Candelieri & Archetti, 2021; Khatamsaz et al., 2020). A special case is when information sources can be organized hierarchically depending on their quality of approximation (aka fidelity), leading to the so-called multi-fidelity optimization, originally proposed in Kennedy & O’Hagan (2000). Both in multiple information source and multi-fidelity optimization energy efficiency is achieved by suitably using the less expensive (as well as low-fidelity) sources to keep low the cumulative query cost, that is a proxy of energy consumption and, consequently, CO\(_2\) emissions.

1.2 Contributions

The main contributions of our paper are:

  1. 1.

    A comparative analysis between fairness-aware ML and Fairness-aware AutoML (specifically, HPO) algorithms, on a set of four relevant benchmark (fairness) datasets.

  2. 2.

    A new fair and green hyperparameter optimization algorithm, namely FanG-HPO, based on both multi-objective and multiple information source Bayesian Optimizations to simultaneously address Fairness-aware and Green AutoML.

  3. 3.

    A computational assessment of FanG-HPO against other two state-of-the-art BO suites enabling both multi-objective and energy efficient HPO, specifically:

    • autogluon-FairBO (Schmucker et al., 2020), in which HPO is addressed as a bi-objective optimization task (i.e, simultaneously minimizing misclassification error and unfairness on 10 fold cross validation) and using successive halving to reduce energy consumption, and consequently CO\(_2\) emissions.

    • BoTorch-MOMF (multi-objective and multi-fidelity) (Irshad et al., 2021), implementing a generic multi-objective and multi-fidelity BO framework. We have used it to target HPO as a bi-objective task and to address energy efficiency by selectively using the entire dataset (high-fidelity source) or a portion if it (low-fidelity source) to compute the misclassification error and the unfairness metric of every hyperparameter configuration, on 10-fold cross validation. BoTorch-MOMF has been released only quite recently, within the BoTorch suite.

The rest of the paper is organized as follows: Sect. 2 provides the main background on multi-objective and multiple information source optimization, along with the definition of the unfairness metric adopted in this study. In Sect. 3 the FanG-HPO approach is detailed. Section 4 describes the experimental setting and Sect. 5 reports the results. Finally, Sect. 6 provides conclusions and perspectives.

1.3 Related works

As far as fairness in ML is concerned, recent reviews are given in Barocas et al. (2017), Pessach & Shmueli (2022) and Weerts et al. (2023). Specifically, (Weerts et al., 2023) aims at raising awareness among AutoML researchers and developers about limitations of Fairness-aware AutoML, while remarking the potential of AutoML as a tool enabling the research on fairness in ML.

With respect to the energy efficiency, a recent review is given in Tornede et al. (2023), providing important hints about quantifying sustainability of AutoML systems, along with an overview and taxonomy of currently available energy-efficient AutoML systems.

Our paper represents a contribution to the effort of providing AutoML practitioners with tools enabling the development of more societal—both fair and green—decision support systems based on ML.

Although BO has been extended to deal with multiple objectives (Svenson & Santner, 2016; Feliot et al., 2017; Yang et al., 2019; Iqbal et al., 2020; Daulton et al., 2020) as well as multiple fidelities and multiple information sources (Lam et al., 2015; Poloczek et al., 2017; Ghoreishi & Allaire, 2019; Candelieri & Archetti, 2021, 2021a; Ariafar et al., 2021), there is a still lack of solutions jointly addressing the two problems. On the other hand, the research interest on this specific challenge has been quickly increased, especially because its applicability to many other real-life problems than fair and green ML, as demonstrated by very recent works (Sun et al., 2022; Irshad et al., 2021).

The only two available research tools—at the authors’ knowledge—are autogluon-FairBO (Schmucker et al., 2020), combining multi-objective and multi-fidelity optimization by building upon Hyperband (Li et al., 2017), and BoTorch-MOMF (Irshad et al., 2021). Although an implementation of autogluon-FairBO is still under review to be included into the autogluonFootnote 1 suite, the code is freely availableFootnote 2 and it has been used to implement the first competitor of FanG-HPO. On the contrary, BoTorch-MOMF is already integrated and freely available in the BoTorch suite,Footnote 3 and is considered in this paper as the second competitor of FanG-HPO.

Finally, with respect to fairness-aware ML algorithms, relevant works are Komiyama et al. (2018), Zafar et al. (2019) and Scutari et al. (2021).

2 Background

This paper addresses FairML as a multi-objective problem and, at the same time, uses multiple information sources (i.e., a small portion of the large target dataset) to improve energy-efficiency and, consequently, reduce CO\(_2\) emissions. Here, we briefly summarize the basic background about multi-objective optimization, fairness metrics, and multiple information source optimization.

2.1 Multi-objective optimization

Multi-objective optimization (MO) concerns solving problems with more than one objective function to be optimized simultaneously, that is:

$$\begin{aligned} \underset{\textbf{x} \in \Omega }{\min }\; \textbf{f}(\textbf{x}) \end{aligned}$$
(1)

where \(\Omega\) is the search space, typically box-bounded in \(\mathbb {R}^d\), and \(\textbf{f}:\Omega \rightarrow \mathbb {R}^M\) is the vector-valued function of the multiple objectives. In MO, due to the conflicting nature of the objectives, it does not exist a unique solution \(\mathbf {x^*} \in \Omega\) to the problem (1). The final aim is to identify a set of equally efficient trade-offs among the objectives. This set of efficient trade-offs can be depicted within the space spanned by the M conflicting objectives, allowing for drawing the so-called Pareto front (aka frontier or boundary). The associated set of solutions—into the search space \(\Omega\) - is instead known as Pareto set. An example is shown in Appendix A.1. Formally, the Pareto set consists of only dominant (aka not-dominated) solutions, where a solution \(\textbf{x}\) is said to dominate another solution \(\mathbf {x'}\) if their objectives, respectively \(\textbf{f}(\textbf{x})=\big (f_1(\textbf{x}),...,f_M(\textbf{x})\big )\) and \(\textbf{f}(\mathbf {x'})=\big (f_1(\mathbf {x'}),...,f_M(\mathbf {x'})\big )\), satisfy the following two conditions:

$$\begin{aligned}{} & {} f_m(\textbf{x})\le f_m(\mathbf {x'}) \; \forall \; m \in \{1,...,M\} \end{aligned}$$
(2)
$$\begin{aligned}{} & {} \exists \; j \in \{1,...,M\}: f_j(\textbf{x})<f_j(\mathbf {x'}) \end{aligned}$$
(3)

Equation (2) means that \(\textbf{x}\) is not worse than \(\mathbf {x'}\) in all the objectives, and Eq. (3) means that \(\textbf{x}\) is strictly better than \(\mathbf {x'}\) in at least an objective. The Pareto dominance symbol, \(\prec\), is used to synthesize (23): \(\textbf{f}(\textbf{x})\prec \textbf{f}(\mathbf {x'})\).

If the objectives are black-box, their values can only be known point-wise by querying \(\textbf{f}(\textbf{x})\) at specific locations. Given all the queries performed so far, the set of non-dominated solutions (respectively, outcomes) is the current approximation of the Pareto set (respectively, front). If the objectives are also expensive to evaluate, in terms of time or resources, then (1) must be solved efficiently, meaning that a good Pareto front/set approximation has to be found within a limited number of queries. Thus, sample-efficiency of BO was the driver of its successful extension to the MO setting (i.e., MOBO), mainly along three different strategies:

  • Scalarization which maps the vector of all objectives into a scalar parametrized function whose optimizer, computed by a single objective method, can span, as the parameters vary, the whole Pareto set (Paria et al., 2020; Zhang & Golovin, 2020). The key drawback of scalarization is that it does not consider the geometry of the Pareto front approximation.

  • Maximization of some index related to the quality of the Pareto front approximation. A common choice is the dominated hypervolume indicator (HV), that is the volume of the region dominated by a Pareto front approximation.

  • Information theoretic based, which aims at reducing the uncertainty/entropy about the Pareto front (Belakaria et al., 2019; Suzuki et al., 2020; Belakaria et al., 2020b), recently also considering the multi-fidelity setting (Belakaria et al., 2020a).

The BO methods considered in this paper rely on the first or the second strategy.

It is important to remark that, while in “vanilla” BO a probabilistic surrogate model is used to approximate the black-box objective function, almost all the MOBO approaches in literature adopt a probabilistic surrogate model for each one of the objectives, assuming independence among them. This is a reasonable assumption, because in MO the objectives should be competing and uncorrelated. Exactly as in BO, every new query contributes to better approximate the objective function—which is vector-valued in MOBO – through the update of the probabilistic surrogate model. The second key component of BO is the acquisition function, which deals with the well-known exploration-exploitation dilemma. All the “vanilla” BO acquisition functions can be used in the case of scalarization— because the multi-objective problem is mapped into a single-objective one—on the contrary, expected hypervolume improvement (EHVI) is an acquisition function specifically designed for vector-valued MOBO, basically extending the idea underlying the well-known Expected Improvement (EI) to the multi-objective setting.

In this paper we consider HPO of a classification model, with two different objectives to minimize: the misclassification error (MCE) and the unfairness metric known as differential statistical parity (DSP), which is detailed in the next section. Both the objectives are computed through stratified 10-fold cross validation (10FCV), so they are black-box, expensive, multi-extremal, and possibly noisy (depending on the specific ML algorithm to be optimized or the cross-validation procedure).

2.2 Differential statistical parity as unfairness metric

There is not a unique definition—and consequently metric—of fairness (Verma et al., 2018). Instead, different alternatives have been proposed depending on application domains and specific use cases. In this paper we consider the DSP—which has been also recently considered in Schmucker et al. (2020). We refer to the standard framework where \(F_L\) denotes the true labels for the target feature, \(F_{Sens}\) is the sensitive feature, and \(\widehat{F}_L\) denotes the predicted labels. Statistical parity (SP) requires that positive predictions are unaffected by the value of the sensitive feature, independently of the actual label:

$$\begin{aligned} P\left( \widehat{F}_L=1 \vert F_{Sens}=0\right) = P\left( \widehat{F}_L=1 \vert F_{Sens}=1\right) \end{aligned}$$

Finally, the absolute value of the difference between the two terms is the DSP, that is a measure of the violation of the above condition and, consequently, a measure of unfairness.

$$\begin{aligned} DSP = \left| P\left( \widehat{F}_L=1 \vert F_{Sens}=0\right) - P\left( \widehat{F}_L=1 \vert F_{Sens}=1\right) \right| \end{aligned}$$

2.3 Multiple information source optimization

Multiple information source optimization (MISO) aims at searching for the global optimum of a black-box, expensive and multi-extremal function, namely the ground-truth, given the possibility to also query less expensive information sources which are its approximations. The final goal is to find an optimal solution for the ground-truth while satisfying some constraint on the query cost accumulated along the search process, by effectively and efficiently using the cheap information sources. MISO has been defined for single-objective problems: differently from multi-objective, here the subscript is used to denote a specific information source, where \(f_1(\textbf{x})\) is the ground-truth and \(f_s(\textbf{x})\), with \(s \in \{2,...,S\}\), are the cheap information sources.

The MISO problem can be formulated as:

$$\begin{aligned} \mathbf {x^*} =&\quad \underset{\textbf{x}\in \Omega }{\arg \min } f_1(\textbf{x}) \end{aligned}$$
(4)
$$\begin{aligned} \text {subject to:}&\sum _{(s,\textbf{x})\in Z^{1:n}} c_s \le C_{max} \end{aligned}$$
(5)

where \(Z^{1:n}=\bigg \{\Big (s^{(i)},\textbf{x}^{(i)}\Big )\bigg \}_{i=1:n}\) denotes the set of source-location pairs sequentially queried, \(c_s\) is the cost for querying \(f_s(\textbf{x})\), and \(C_{max}\) is the maximum query cost that can be cumulated along the optimization process.

BO has been also successfully extended to deal with MISO problems, where each information source is individually modelled through a probabilistic surrogate model—usually a Gaussian process (GP)—fitted on the queries performed on that source. Then, all the individual models are combined into a single one, which is used to drive the choice of the next promising source-location pair to query, such as in Ghoreishi & Allaire (2019) and Candelieri & Archetti (2021).

3 FanG-HPO

The proposed fair and green HPO approach aims at solving the problems (4-5), but with the scalar objective function replaced by a vector-valued one, that is:

$$\begin{aligned} \mathbf {x^*} =&\quad \underset{\textbf{x}\in \Omega }{\arg \min } \;\mathbf {f_1}(\textbf{x}) \end{aligned}$$
(6)
$$\begin{aligned} \text {subject to:}&\sum _{(s,\textbf{x})\in Z^{1:n}} c_s \le C_{max} \end{aligned}$$
(7)

where \(\mathbf {f_1}\) is the vector-valued ground-truth, while all the other cheaper information sources are \(\mathbf {f_s}\), with \(s\in \{2,...,S\}\).

3.1 Modelling objectives and information sources

In FanG-HPO, both objectives and information sources are modelled independently via GP regression (Williams & Rasmussen, 2006; Gramacy, 2020). A GP is a probabilistic regression model whose predictive mean, \(\mu (\textbf{x})\), and uncertainty, \(\sigma (\textbf{x})\), are conditioned on previous observations. A brief introduction to GP regression is provided in the Appendix A.2.

Thus, at a generic iteration, FanG-HPO learns \(S \times M\) GP models, leading to the following set of predictive means and uncertainty functions:

$$\begin{aligned} \bigg \{\mu _{sm}(\textbf{x}),\sigma _{sm}(\textbf{x})\bigg \}_{\begin{array}{c} s=1:S,\\ m=1:M \end{array}} \end{aligned}$$

Then, a single GP model is fitted, for each objective, by combining the GPs which individually model that specific objective on every information source. In FanG-HPO this operation is performed by using the Augmented Gaussian Process (AGP) methodology recently proposed in Candelieri & Archetti (2021). More precisely, a set of indices identifying “reliable” observations from cheaper sources is computed for each pair source - objective, where “reliable” means they are not too discrepant with respect to the ground-truth:

$$\begin{aligned} \begin{aligned} \mathcal {I}_{sm} = \Big \{&i: \big \vert \mu _{1m}(\textbf{x})-\mu _{sm}\big (\textbf{x}^{(i)}\big )\big \vert \le \alpha \sigma _{1m}\big (\textbf{x}^{(i)}\big ), \textbf{x}^{(i)} \in \textbf{X}_s \Big \},\\&\forall \; s\ne 1, \forall \; m \in \{1,...,M\} \end{aligned} \end{aligned}$$
(8)

where \(\alpha\) is a technical parameter to tune reliability of the observations from cheap information sources. In Candelieri & Archetti (2021) the suggested value is \(\alpha =1\).

Then, the observations on the ground-truth are “augmented” with those identified by \(\mathcal {I}_{sm}\), separately for each objective \(m\in \{1,...,M\}\):

$$\begin{aligned}{} & {} \mathbf {{\widehat{X}}}_m \leftarrow \textbf{X}_1 \cup \Big \{ \textbf{x}^{(i)} \in \textbf{X}_s: i \in \mathcal {I}_{sm}, \forall s\ne 1\Big \} \end{aligned}$$
(9)
$$\begin{aligned}{} & {} \mathbf {{\widehat{Y}}}_{m} \leftarrow \textbf{Y}_{1[m]} \cup \Big \{ y^{(i)} \in \textbf{Y}_{s[m]}: i \in \mathcal {I}_{sm}, \forall s\ne 1\Big \} \end{aligned}$$
(10)

where \(\textbf{X}_s\) are the locations queried on source s and \(\mathbf {\widehat{Y}}_{[m]}\) are the values observed for the objective m and associated to the set \(\mathbf {{\widehat{X}}}_m\) (i.e., the symbol [m] is the operator selecting only the column m of the \(n_s \times M\) matrix \(\textbf{Y}_{sm}\), with \(n_s\) the number of queries performed on the source s).

Finally, FanG-HPO fits M independent AGPs, with predictive means and uncertainty respectively denoted with \(\widehat{\mu }_m(\textbf{x})\) and \(\widehat{\sigma }_m(\textbf{x})\), and both conditioned to \(\Big \{\mathbf {\widehat{X}}_m,\mathbf {\widehat{Y}}_m\Big \}\).

3.2 Deriving the next source-location to query

The next source-location pair to query, namely \((s',\textbf{x}')\), is derived by solving a multi-objective problem whose objectives are approximated by the AGPs obtained as previously described. Having M independent AGPs is in line with recent results in literature: as reported in Zhan et al. (2017) considering a separate GP modelling each objective independently makes easy the implementation of multi-objective optimization approaches, while using dependent GP models—such as multi output GPs—does not provide any relevant benefit against independent GPs (Svenson & Santner, 2016).

More precisely, \((s',\textbf{x}')\) is obtained according to the following two-steps procedure:

  1. 1.

    Selecting \(\textbf{x}'\). First, the location \(\textbf{x}'\) is selected depending on the well-known expected hypervolume improvement (EHVI). Hypervolume Improvement (HVI) is defined as the relative increase in the hypervolume indicator, when an outcome \(\textbf{y}\), associated to a solution \(\textbf{x}\), is added to the current Pareto front approximation. In BO, the HVI is a random variable because \(\textbf{y}\) is a (set of) random variable itself, and this leads to the EHVI.

    $$\begin{aligned} \textbf{x}' = \underset{\textbf{x}\in \Omega }{\arg \max } \; \text {EHVI}(\textbf{x},\mathcal {P},\textbf{r}) \end{aligned}$$
    (11)

    where \(\mathcal {P}\) is the current approximated Pareto front and \(\textbf{r}\) is the reference point. In this paper \(\textbf{r}\) is the worst point, with both MCE and DSP equal to 1. Although a closed formula for the EHVI exists (Feliot et al., 2017), it is expensive to calculate. In FanG-HPO the fast calculation proposed in (Zhao et al., 2018) is used, that is an extension, to the EHVI computation, of the Walking Fish Group (WFG) technique (While et al., 2011), one of the fastest algorithms for calculating the hypervolume of a Pareto front approximation.

  2. 2.

    Selecting \(s'\). Then, the information source \(s'\) is selected according to both its query cost and its discrepancy with respect to the ground-truth at \(\textbf{x}'\), with respect to all the objectives, that is:

    $$\begin{aligned} s' = \underset{s\in \{1...,S\}}{\arg \min } \; c_s \cdot \sum _{m=1}^M \Big \vert \mu _{1m}(\textbf{x}')-\mu _{sm}(\textbf{x}')\Big \vert \end{aligned}$$
    (12)

Contrary to other recent approaches which propose to query the ground-truth on a regular basis, such as in Khatamsaz et al. (2020), at each iteration FanG-HPO adaptively chooses among all the sources, including the ground-truth. However, just to ensure a sufficient quality of the approximation provided by the AGPs, before solving (12) FanG-HPO checks if the number of augmenting observations coming from cheap sources is larger than those from the ground-truth: in that case \(s'=1\) is selected, instead of solving (12).

4 Experimental setting

4.1 Datasets and Machine Learning algorithms

To select a suitable set of benchmark datasets on fairness, we have based our choice on the paper (Le Quy et al., 2022) which provides a detailed overview and analysis of real-world tabular datasets frequently used in fairML. Specifically, Bayesian networks were used to model and analyse the relationship between protected attributes and target class, for each considered dataset. We limited our selection to four binary classification datasets, resulting difficult in terms of achievable accuracy and statistical parity, according to the results reported in table 15 of Le Quy et al. (2022). It is important to anticipate that our results on the four selected datasets are homogeneous, suggesting that was useless to extend the analysis to other—easier—datasets. [i.e., a more extended experimental campaign is out of the scope of this paper. For a wider set of experiments on effectiveness of Fainess-aware ML one can refer to very recent studies, such as Nguyen et al. (2023)].

More specifically, the four selected datasets are known with the names: ADULT, COMPAS, GERMAN CREDIT, and LAW SCHOOL ADMISSIONS. They are taken from the R package fairml,Footnote 4 which also provides implementations of a set of fairness-aware algorithms. The same datasets are also available on the well-known UCI Repository,Footnote 5 but the versions available in the R package could be slightly different due to some basic pre-processing operations. It is important to clarify that these operations are related to common data pre-processing (e.g., identifiers removal) and not to any fairness-oriented pre-processing technique. As follows we report, for each dataset, some details relevant for our experiments. A brief description of the four datasets is reported in the following.

  • ADULT—the aim of the associated classification task is to predict whether personal income exceeds 50K$ per year, using the U.S. 1994 Census data. The dataset consists of 30,162 instances and 14 features (among them, two are sensitive: “gender” and “race”).

  • COMPAS—data refers to criminal offenders screened in Florida (US) during 2013–2014. The aim of the associated classification task is to predict the recidivism of crime in two years. The dataset consists of 5855 instances and 16 features (among them, two are sensitive: “gender” and “race”).

  • GERMAN CREDIT—data refers to credit scoring and the aim of the associated classification task is to predict defaults on consumer loans in the German market. The dataset consists of 1000 instances and 21 features (among them, one is sensitive: “gender”).

  • LAW SCHOOL ADMISSIONS—data refers to a survey among students attending law school in the U.S. in 1991. Although the original task associated to this dataset is a regression task (i.e., predicting the Undergraduate Grade Point Average), we have decided to consider a different target feature, specifically the one assessing whether the student has passed the bar exam on the first try. Thus, the classification task we consider in our experiments is to predict this outcome. The dataset consists of 20,800 instances and 11 features (among them, two are sensitive: “gender” and “race”).

Before performing HPO, all the datasets have been (further) pre-processed by applying one-hot-encoding on all the nominal features, increasing the final number of features (including the target feature) to: 52 for ADULT, 20 for COMPAS, 47 for GERMAN CREDIT, 51 for LAW SCHOOL ADMISSIONS. Pre-processing has also increased the number of sensitive features for some dataset. As better detailed in Sect. “Availability of data and material”, all the pre-processed datasets, as they have been used in this study, are available for replicability, along with the code.

With respect to the HPO of the ML algorithms, we have selected four completely different ML algorithms: multi-layer perceptron (MLP), random forest (RF), eXtreme Gradient Boosting (XGB), and support vector machine (SVM) with an RBF kernel. The number of hyperparameters to be optimized is respectively: 10 for MLP, 2 for RF, 7 for XGB, and 2 for SVM. The details about their associated search spaces are reported in the following four Tables 1, 2, 3, and 4. For MLP and XGB we have used the same search spaces defined in Schmucker et al. (2020), while RF and SVM are defined by us, because they were not considered in the quoted study.

Table 1 sklearn MLP’s search space
Table 2 sklearn RF’s search space. The range of the hyperparameter MAX_FEATURES depends on the dataset: \(\vert F \vert\) denotes the number of features, excluded the target one
Table 3 sklearn XGBoost’s search space
Table 4 sklearn SVM’s search space

4.2 Compared methods

We have compared our approach, FanG-HPO, against two state-of-the-art BO suites enabling both multi-objective and energy efficient optimization, specifically autogluon-FairBO and BoTorch-MOMF.

Bi-objective optimization (i.e., simultaneous minimization of 10FCV MCE and 10FCV DSP), is addressed via scalarization in autogluon-FairBO and via EHVI maximization in both FanG-HPO and BoTorch-MOMF.

Another important difference regards the strategies adopted to deal with energy-efficiency. According to the taxonomy in Tornede et al. (2023):

autogluon-FairBO belongs to the family of “early discarding of unpromising candidates” methods and is based on Hyperband (Li et al., 2017).

BoTorch-MOMF and FanG-HPO belong to the family of the “multi-fidelity performance measurements” methods, but with a significant methodological difference. BoTorch-MOMF uses a multi-output Gaussian process (GP) to model three objectives: not only 10FCV MCE and 10FCV DSP, but also the query costs associated to the sources. Moreover, the fidelity is also included as an additional decision variable (i.e., it is treated just like a hyperparameter of the ML to be optimized). On the contrary, Fang-HPO uses an independent Augmented Gaussian Process (AGP) for each objective, fitted by merging observations on the different information sources. This difference is even more important in terms of acquisition function: although both BoTorch-MOMF and FanG-HPO use EHVI, the first penalizes the associated value depending on the cost of the source (the higher the fidelity the higher the cost) while the second adopts the two steps mechanism explained in Sect. 3.2.

Another important comparison performed in our study is between the three HPO methods, all together, and two well known FairML algorithms, both available in the R package fairml. The two algorithms are named zlrm and fgrrm (Scutari et al., 2021). This comparison is important to evaluate, in terms of fairness, the effectiveness of Fairness-aware AutoML with respect to fairness-aware (aka fairness-by-design) ML algorithms.

4.3 Performance metrics

As previously mentioned, in this paper HPO is aimed at simultaneously minimize 10FCV MCE and 10FCV DSP), separately for every pair ML algorithm - dataset. It is important to remark that, given a specific dataset, DSP is computed for every sensitive feature, according to what previously reported in Sect. 2.2. To obtain a single value—instead of a vector—we have decided to consider DSP\(=\underset{ i=1:n_{Sens}}{\max }\{DSP_i\}\), where \(n_{Sens}\) is the number of sensitive features and \(DSP_i\) is DSP value associated to the i-th sensitive feature. Basically, we minimize the worst DSP over all the sensitive features.

Being respectively a multi-fidelity and a multiple information source optimization approach, it is quite simple to define sources and their query costs for BoTorch-MOMF and FanG-HPO. Specifically, 10FCV MCE and 10FCV DSP are computed on the high fidelity / expensive source (i.e., the ground-truth) when the associated hyperparameter configuration is evaluated on the entire dataset, otherwise they are computed on the low fidelity / cheap source if the hyperparameter configuration is evaluated on a stratified sample (i.e., 50%) of the original dataset. Just for the sake of simplicity, we can assume that the nominal query cost of the two information sources are, respectively, \(c_1=1\) and \(c_2=0.5\). It is important to remark that nominal query cost is not a direct proxy of energy consumption and CO\(_2\) emissions, but it drives energy-efficient choices in the two approaches.

Being based on Hyperband, autogluon-FairBO adopts successive halving on the validation folds. In brief, if a configuration of the hyperparameters is not promising according to the results iteratively collected on the folds, it is discarded and the 10 fold cross validation procedure is early stopped. As a consequence, we cannot define in advance the cost of evaluating a hyperparameter configuration in autogluon-FairBO. Thus, we cannot define information sources and their query costs a-priori: if the 10FCV procedure terminates with success, then we consider that the query has been performed on the ground-truth, and we use \(c_1=1\), otherwise we consider \(c_2=n_f/10\), with \(n_f\) the number of folds analysed before halving.

As far as energy consumption and CO\(_2\) emissions are concerned, we decided to consider runtime (i.e., query time) as a suitable proxy, as better detailed and motivated in Sec. 5.3.

Finally, it is important to remark that the query cost is not applicable in the case of the fairness-aware algorithms (i.e., HPO is not performed on them because they have not hyperparameters to be optimized).

4.4 Experimental protocol

Autogluon-FairBO does not allow to specify a maximum cumulative query cost as a termination criterion. The only option is to provide a maximum number of queries, that is a maximum number of hyperparameter configurations to evaluate. To have a fair comparison against the other two approaches, namely BoTorch-MOMF and FanG-HPO, we have decided to implement the following experimental protocol (for all the ML algorithms and datasets):

  1. 1.

    Executing autogluon-FairBO with a limit of 200 queries;

  2. 2.

    Computing the cost of each query as \(n_f/10\), with \(n_f\) the number of folds considered before halving (if any);

  3. 3.

    Computing the resulting overall query cost accumulated by autogluon-FairBO on that run (call it budget);

  4. 4.

    Selecting the first \(2(d+1)\) hyperparameter configurations queried by autogluon-FairBO and divide them, randomly, in two sets of size \(d+1\) each, with d the number of hyperparameters to optimize.

  5. 5.

    Running BoTorch-MOMF by initializing the multi-output GPs with the two sets mentioned above (i.e., it is important to recall that fidelity is treated as an additional hyperparameter to be optimized and the associated query cost is part of the acquisition function). The previously computed budget is set as a threshold for the cumulative query cost (i.e., termination criterion).

  6. 6.

    Running FanG-HPO by initializing the two AGPs with the two sets of observations mentioned above. The same termination criterion of BoTorch-MOMF is used also for FanG-HPO.

To mitigate the randomness of the initialization in autogluon-FairBO, ten independent runs have been performed for each pair ML algorithm - dataset. The experimental protocol has been applied for each one of the independent runs. Analogously, five independent runs have been performed for zlrm and fgrrm, separately. A constraint on the unfairness must be provided for these two algorithms: we set this value to 0.1 (for each sensitive feature).

All the experiments have been performed on a Microsoft Azure virtual machine, Standard D16ds v5 (16 vcpus, 64 GiB memory) Ubuntu 18.04 LTS.

5 Results

In this section we summarize the most relevant results of our study. Every result is first stated and then commented, to make more easy-to-read this section. Moreover, results are organized into three subsections:

  • the first is related to a comparison between FairML and Fairness-aware AutoML algorithms,

  • the second is related to the cost-effectiveness of the three BO-based approach considered in the paper,

  • and finally the third subsection illustrates the ecological profiles of the three BO-based methods.

5.1 Fairness related results

As a first step we have performed a comparison, in terms of Pareto optimality, between Fairness-aware ML and Fairness-aware AutoML algorithms. To achieve this, for every pair dataset - ML algorithm, we have selected the dominant hyperparameter configurations among all those generated by the three BO-based approaches over all the 10 independent runs. It is important to remark that only hyperparameter configurations evaluated on the entire datasets are considered in this operation. We call the resulting approximated Pareto front “super Pareto front”. Figure 1 depicts the super Pareto fronts along with the MCE–DSP trade-offs provided by the Fairness-aware ML algorithms. In the following, the main results derived from it.

Result 1.

Fairness-aware AutoML (Pareto) dominates Fairness-aware ML algorithms.

With respect to the four datasets, it always exists at least one super Pareto front dominating the MCE–DSP trade-offs of the two Fairness-aware ML algorithms. This result is also in line with what recently reported in (Cruz & Hardt, 2023) about post-processing of Fairness-aware ML algorithms.

Result 2.

Bi-objective HPO of RF leads to super Pareto fronts smaller than those of HPO of other ML algorithms, in terms of both HV and number of Pareto optimal hyperparameter configurations.

More specifically, the super Pareto front associated to bi-objective HPO of RF is quite limited in terms of 10FCV DSP, for all the four datasets considered. Thus, although accurate, the final RF models obtained via HPO are not so fair.

Result 3.

Overall, the bi-objective HPO of XGB has led to the best results.

The super Pareto front associated to XGB is always larger than the others in terms of both HV and number of Pareto optimal hyperparameter configurations.

Fig. 1
figure 1

Comparison, on each dataset, between MCE–DSP trade-offs provided by Fairness-aware ML algorithms and super Pareto fronts obtained through HPO of four ML learning algorithms. Super Pareto fronts refers to all the dominant hyperparameter configurations identified by the three BO-based approaches all together (i.e., autogluon-FairBO, BoTorch-MOMF, and FanG-HPO), over 10 independent runs

5.2 Cost-effectiveness of fairness-aware HPO methods

When comparing AutoML systems, it is not correct to just consider the final performances at the end of their run. Instead, one has to look at the performance curves, usually given by the best observed value of a metric with respect to the number of queries performed or the cumulative query cost.

In this section we report the curves of the best HV (of the approximated Pareto front) with respect to the cumulative query cost: one curve for every BO-based approach and run, and separately for each pair ML algorithm - dataset. These curves allow us to compare the three BO-based approaches in terms of their cost-effectiveness.

It is important to remark that, in this case, we do not deal with the super Pareto front: instead, we consider the HV of approximated Pareto front consisting of the dominant hyperparameter configurations – evaluated on the entire dataset, only (i.e., queries on the groud-truth)—at increasing values of the cumulative query cost. It is also important to recall that, in this subsection, we refer to the nominal query costs, associated to the two different sources and defined in Sect. 4.3.

Result 4.

HPO of MLP on the four datasets: successive halving (i.e., autogluon-FairBO) is less cost-effective than multi-fidelity and multiple information source BO (i.e., BoTorch-MOMF and FanG-HPO).

Although autogluon-FairBO starts from higher initial values of HV (due to the different type of initialization of the approaches, as described in the experimental protocol), both BoTorch-MOMF and FanG-HPO are able to overcome it within a small cumulative nominal cost (Fig. 2).

Fig. 2
figure 2

Cost-effectiveness of the three BO-based approaches for bi-objective HPO of a MLP classifier over 10 independent runs, separately for the four datasets

Result 5.

The “pathological” behaviour of RF (previously reported in Result 2) affects all the three BO-based methods.

As depicted in Fig. 3, although BoTorch-MOMF and FanG-HPO are both able to improve with respect to their initial HV values, they could not close the gap with autogluon-FairBO (except for BoTorch-MOMF on the GERMANCREDIT dataset). Moreover, on average, the improvement with respect to the initial HV value is not so relevant, for all the approaches. The main motivation underlying these two issues is the degenerate shape of the associated approximated Pareto front for RF (as shown in the previous subsection).

Fig. 3
figure 3

Cost-effectiveness of the three BO-based approaches for bi-objective HPO of a RF classifier over 10 independent runs, separately for the four datasets

Result 6.

HPO of XGB on the four datasets: successive halving (i.e., autogluon-FairBO) is less cost-effective than multi-fidelity and multiple information source BO (i.e., BoTorch-MOMF and FanG-HPO).

Basically, this result is aligned with what previously obtained on MLP: the multi-fidelity and the multiple information source BO approaches (i.e., BoTorch-MOMF and FanG-HPO) are more cost-effective than the successive halving mechanism of autogluon-FairBO (Fig. 4).

Fig. 4
figure 4

Cost-effectiveness of the three BO-based approaches for bi-objective HPO of a XGB classifier over 10 independent runs, separately for the four datasets

Result 7.

HPO of SVM on the four datasets: FanG-HPO is the most cost-effective approach.

Although there is not a clear winner between autogluon-FairBO and BoTorch-MOMF, FanG-HPO is always among the most cost-effective approach, on all the four datasets considered (Fig. 5).

Fig. 5
figure 5

Cost-effectiveness of the three BO-based approaches for bi-objective HPO of a SVM classifier over 10 independent runs, separately for the four datasets

5.3 Ecological performance profiles

Computing the cost-effectiveness in terms of cumulative nominal query cost does not provide a direct quantification of the energy consumption and, consequently, the carbon footprint of the three BO-based approaches.

As reported in Tornede et al.(2023), although runtime is a poor measure of energy efficiency—basically because it is hardware-dependent—it can be straightforward measured on most hardware, contrary to other measures. Moreover, it is a quite practical proxy of the environmental impact whenever any additional information, such as the energy consumption of used hardware (per time unit) and the composition of the energy mix, is not available. Runtime also depends on other factors, such as the degree of parallelism and heterogeneity of the execution environment, but when the same hardware is used for running a set of competing approaches—just like in our case—it can be considered a good proxy of the CO \(_2\) footprint of each competitor, at the location and time it was executed.

According to these considerations, we have decided to redraw the previous cost-effectiveness curves in terms of cumulative query time (i.e., runtime), instead of cumulative nominal query cost. It is also important to remark that, to guarantee a fair comparison, we have only considered the runtime required to evaluate hyperparameter configurations (i.e., query time), ignoring the computational time of the approaches themselves (which is in any case negligible with respect to the query time). When runtime is considered instead of the cumulative nominal query cost, the curves are named ecological performance profiles (Tornede et al., 2023). They are reported in the following Figs. 6, 7, 8 and 9, one for each ML algorithm.

Result 8.

Overall, successive halving (i.e., autogluon-FairBO) resulted less “ecological” than multi-fidelity and multiple information source bi-objective HPO (i.e., BoTorch-MOMF and FanG-HPO, respectively).

As it can be noticed in Figs. 6, 7, 8 and 9, the cumulative query time of BoTorch-MOMF and FanG-HPO are significantly lower than autogluon-FairBO’s one. Only in the case of HPO RF – whose pathological behavior has been already commented – the cumulative query time of BoTorch-MOMF is larger than the autogluon-FairBO’s one.

Result 9.

XGB is the ML algorithm on which performing bi-objective HPO yields the best results; FanG-HPO is the most effective and green method for HPO.

From a ML developer and practitioner, this is one of the most valuable result from our research. As widely proven, XGB is usually among the most performing ML algorithms, on a number of datasets. Moreover, performing bi-objective HPO on it (i.e., accuracy and fairness) leads to the the richest set of Pareto optimal models. Finally, performing HPO of XGB through FanG-HPO will result into the best ecological performance profile.

Fig. 6
figure 6

HPO of MLP on the four datasets: ecological performance profiles of the three BO-based approaches

Fig. 7
figure 7

HPO of RF on the four datasets: ecological performance profiles of the three BO-based approaches

Fig. 8
figure 8

HPO of XGB on the four datasets: ecological performance profiles of the three BO-based approaches

Fig. 9
figure 9

HPO of SVM on the four datasets: ecological performance profiles of the three BO-based approaches

5.4 Additional results and considerations

Finally, we have investigated how much frequently every BO-based approach queries the high-fidelity / expensive information source to deal with energy efficiency. As follows, the most relevant results.

Result 10.

On average, FanG-HPO is the approach that more frequently uses the expensive source.

From Table 5 it is easy to notice that FanG-HPO queries the expensive information source (i.e., the ground truth) significantly more frequently than the other approaches. This is quite obvious with respect to autogluon-FairBO: indeed, halving occurs quite frequently and, even if it happens at 9 out of 10 folds, the associated query is not on counted as “on the ground truth”. Then, except for HPO on SVM, BoTorch-MOMF shows a quite constant behaviour, that is querying the high-fidelity source around 40% of the times.

Table 5 Percentage of queries on the ground-truth (i.e., hyperparameter configurations evaluated on the entire dataset): mean and standard deviation on 10 independent runs

It is important to remark that, although FanG-HPO performs more queries on the expensive source, its cumulative query time (i.e., runtime) is not so higher than the BoTorch-MOMF’s one; actually, it is even smaller in many cases, given the same termination criterion. This means that FanG-HPO is “clever” in using the different sources, as clearly stated in the next—and last—result of this paper.

Result 11.

On average, FanG-HPO is the approach providing the largest sets of Pareto optimal models.

Table 6 reports the percentages of Pareto optimal hyperparameter configurations with respect to the total number of configurations evaluated—averaged on 10 independent runs.

Table 6 Percentage of Pareto optimal hyperparameter configurations: mean and standard deviation on 10 independent runs

6 Conclusions and perspectives

Although it is not the main goal, this paper empirically proves the highest effectiveness of Fairness-aware AutoML approaches (specifically bi-objective HPO) with respect to fairness-aware ML algorithms (a more recent and extended experimental campaign is offered by Nguyen et al. (2023)).

From a ML developer/practitioner’s point of view, one of the most practical result is that XGB is the best algorithm to address FairML via HPO. On the four datasets considered, the Pareto optimal hyperpamater configurations obtained for XGB dominate, almost completely, those of the other ML algorithms as well as those from two well-known fairness-aware ML algorithms.

As far as Fairness-aware AutoML (i.e., bi-objective HPO) is concerned, the two approaches belonging to the so-called family of multi-fidelity performance measurements methods Tornede et al. (2023), that is BoTorch-MOMF and FanG-HPO (proposed in this paper), offer better ecological performance profiles than successive halving (i.e., autogluon-FairBO). Specifically, defining the information sources—as done in BoTorch-MOMF and FanG-HPO—instead of using successive halving, improves the ability to learn from different information sources and to take advantage of that over the sequential optimization process. This consideration is also in line with other results recently reported in Candelieri et al. (2021), where single-objective HPO, based on multiple information source optimization, resulted more effective and efficient than FABOLAS Klein et al. (2017) which instead addresses multi-fidelity by including the size of the dataset as a further hyperparameter to optimize.

Although they belong to the same family, the underlying methodological background of BoTorch-MOMF and FanG-HPO is significantly different (as clarified in Sect. 4.2). This is at the basis of the better performances provided, on average, by FanG-HPO. Currently, one of the most important practical advantages is that ML developers can choose among BoTorch-MOMF and FanG-HPO according to their programming skills, specifically coding in Python or R. On the other hand, we are currently working on porting our code in Python for evaluating its integration into the BoTorch platform.

It is important to remark that, whenever a real-life decision support system has to be developed, the choice of the most appropriate metrics, both in terms of error and fairness, must be carefully chosen depending on the specific problem/dataset. In this study we have just selected misclassification error and DSP for all the experiments in order to have a homogeneous experimental setting, even if they could be not the most appropriate choices for each one of the dataset. On the other hand, it is also interesting to remark that all the three approaches are agnostic to the underlying semantic of the adopted metrics, so they show similar behaviours also in the case that different metrics could be used.

Future works are going to investigate the possibility to also consider cost-aware—aka location-dependent or frugal—optimization, recently proposed in Lee et al. (2020), Candelieri & Archetti (2021a), Luong et al. (2021) and Wu et al. (2021), where nominal sources’ query costs are not fixed but depends on the hyperparameters configuration to evaluate and can be learned along the optimization process. Moreover, validating FanG-HPO on more realistic datasets and a larger set of ML algorithms will allow us and other research groups to further extend and improve it.

More in detail, the github repository contains:

  • The four pre-processed (fairness) datasets.

  • All the basic R scripts implementing AGP, EHVI, and the core functionalities of FanG-HPO.

  • One R script for each pair dataset - ML algorithm to run FanG-HPO starting from the associated autogluon-FairBO run.

  • Python scripts (called from R scripts) for training MLP, RF, XGB, and SVM classifiers, given a specific configuration of their hyperparameters.

  • All the results obtained with FanG-HPO: https://drive.google.com/drive/folders/14o7FbZAwUWfJn2QHn0foqRdHtVCob1tx?usp=sharing

Moreover, from the following Google Drive folders it is possible to download: