1 Introduction

Nowadays, Artificial Intelligence (AI) is omnipresent in everyday life. Current technological advances allow us to analyze huge amounts of data to generate knowledge that is used in many different ways, e.g. for automatic user recommendations (Cai et al. 2020), image recognition (Phillips et al. 2005; Andreopoulos and Tsotsos 2013), and supporting healthcare-related tasks (Jiang et al. 2017). In general, AI can be seen as a computer technology capable of carrying out functions that traditionally required human intelligence (Ertel 2018). Although learning is a key element in many areas of artificial intelligence, the very concept of learning is mainly studied in the Machine Learning (ML) subfield. According to Mitchell (1997), “a computer program is said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P, improves with experience E”. ML algorithms and their parameters must be intelligently configured to make the most of the data. Those parameters that need to be specified before training the algorithm are usually referred to as hyperparameters: they influence the learning process, but they are not optimized as part of the training algorithm.

The impact of these hyperparameters on algorithm performance should not be underestimated (Kim et al. 2017; Kong et al. 2017; Singh et al. 2020; Cooney et al. 2020); yet, their optimization (hereafter referred to as hyperparameter optimization or HPO) is a challenging task, as traditional optimization methods are often not applicable (Luo 2016). Indeed, classic convex optimization methods such as gradient descent tend to be ill-suited for HPO, as the measure to optimize is usually a non-convex and non-differentiable function (Stamoulis et al. 2018; Parsa et al. 2019). Furthermore, the hyperparameters to optimize may be discrete, categorical and/or continuous (typical hyperparameters for an Artificial Neural Network (ANN), for instance, are the number of layers, the number of neurons per layer, the type of optimizer, and the learning rate). The search space can also contain conditional hyperparameters; e.g., the hyperparameters in a support vector machine algorithm depend on the type of kernel used. Finally, the time needed to train a machine learning model with a given hyperparameter configuration on a given dataset may already be substantial, particularly for moderate to large datasets; as a common HPO algorithm requires multiple such training cycles, the algorithm itself needs to be computationally efficient to be useful in practice.

HPO should not be confused with the more general topic of automatic algorithm configuration (AC), which is much broader in scope (see López-Ibáñez et al. 2016; Hutter et al. 2009 for examples on this topic). In AC, in general, the aim is to find a well-performing parameter configuration for an arbitrary algorithm on a given, finite set of problem instances. In HPO, we typically search for a well-performing hyperparameter configuration on a single data set, for a specific task (classification, image recognition, or other). The scope of AC is also broader than that of HPO, in the sense that the target algorithm does not necessarily carry out a learning process for the task under study; e.g., it also comprises the optimization of solvers and/or metaheuristics.

HPO has gained increasing attention in recent years, probably spurred by the popularity of deep learning algorithms, which have demanding characteristics (e.g., the need for large amounts of data and time to train the models, high model complexity, and a diverse mix of hyperparameter types). Previously, analysts tended to use simple methods to look for the “best” hyperparameter settings. The most basic of these is grid search (Montgomery 2017): the user creates a set of possible values for each hyperparameter, and the search evaluates the Cartesian product of these sets. Although this strategy is easy to implement and easy to understand, its performance is influenced by the number of hyperparameters to optimize, and the (number of) values chosen on the grid. Random search (Bergstra and Bengio 2012) provides an alternative to grid search, and tends to be popular when some of the hyperparameters are more important than others; e.g, learning rate and momentum are critical to guarantee a faster convergence of neural networks (Guo et al. 2020). More advanced optimization methods have also been put forward, such as meta-learning methods (Bui and Yi 2020), neural architecture search (NAS) methods (Jing et al. 2020), multi-fidelity algorithms (such as Freeze-thaw Bayesian optimization (Swersky et al. 2014), Successive halving algorithm (Karnin et al. 2013), Hyperband (Li et al. 2017), Bayesian Optimization Hyperband (Falkner et al. 2018), and Multi-task Bayesian optimization (Swersky et al. 2013)), population-based optimization algorithms (such as Population-based training (PBT) (Jaderberg et al. 2017) and Population-based Bandits (PB2) (Parker-Holder et al. 2020)), and reinforcement learning algorithms (such as HypRL (Jomaa et al. 2019)) and the model-based Reinforcement Learning algorithm (Wu et al. 2020)).

So far, these more advanced approaches have largely focused on single-objective HPO problems. Multi-objective optimization is particularly relevant in HPO, as different conflicting objectives may be important for the analyst (e.g., the error-based performance of the target ML algorithm, inference time, model size, energy consumption, etc.). Multi-objective HPO should not be confused with multi-task learning (MTL). In multi-objective HPO, we seek to optimize the hyperparameter configuration for a specific task, on a single data set, in view of marrying multiple conflicting objectives. MTL, by contrast, seeks to optimize the HP configuration for multiple tasks, potentially using multiple datasets; while the performance metrics for the individual tasks can be seen as multiple simultaneous objectives, they are not necessarily in conflict.

Our work aims to provide an overview of the state-of-the-art in the field of multi-objective hyperparameter optimization for machine learning algorithms, highlighting the approaches currently used in the literature, the typical performance measures used as objectives, and discussing remaining challenges in the field. To the best of our knowledge, our work presents the first comprehensive review of these multi-objective HPO approaches. Previous reviews (Hutter et al. 2015; Luo 2016; Yang and Shami 2020; Feurer and Hutter 2019; Talbi 2021) mainly discuss single-objective HPO approaches, often focusing on particular contexts (such as biomedical data analysis), specific target algorithms (such as Deep Neural Networks) or specific approaches (Sequential Model-based Bayesian Optimization, multi-fidelity approaches). While two of the most recent surveys (Feurer and Hutter 2019; Talbi 2021) mention multi-objective HPO on the sidelines, they only list some examples or common strategies relevant to this topic, without discussing the actual approaches.

The remainder of this article is organized as follows. Section 2 discusses the methodology used in the literature search. Section 3 formalizes the concepts of single- and multi-objective hyperparameter optimization and discusses the most commonly used performance measures in HPO algorithms. Section 4 categorizes the existing methods for multi-objective hyperparameter optimization. Section 5 discusses the pros and cons of the algorithms. Finally, Sect. 6 summarizes the findings, highlighting potential improvements and avenues for further research.

2 Methodology

Given the remarkable surge in publications on HPO since 2014, we focused on research published between 2014 and 2020. Figure 1 shows an overview of the search and selection process.

Fig. 1
figure 1

Overview of search and selection process

We first performed a WoS (Web of Science) search, using the search terms shown in Table 1. Although the main focus is on multi-objective HPO, we also consider the occurrence of the phrase “single objective” in the abstract (AB), as it is common to transform multiple objectives into a single objective by means of a scalarization function. As the use of surrogates is common in single-objective HPO for deep learning networks (e.g., Wistuba et al. 2018; Sjöberg 2019; Victoria and Maragatham 2021), we also searched for articles mentioning the terms “surrogate”, “metamodel”, “deep learning”, “neural networks”, “Gaussian process”, and “kriging” in the abstract. The choice of hyperparameters is also related to overfitting (Feurer and Hutter 2019). Finally, we also include the term “constraint”, as the required performance targets (e.g., maximum memory consumption, training time Stamoulis et al. 2018; Hu et al. 2019) may be presented as constraints in (multi-objective) HPO. We limited our search to publications (including conference proceedings, articles, book chapters, and meeting abstracts) in computer science-related categories (WC).

We subsequently completed the set of papers through (1) scanning suggestions of papers on Google Scholar alerts, and (2) a reference search. We limited the latter to electronic collections only, and solely considered journals/conference proceedings/workshop proceedings that were indexed on WoS (for the WoS journals, we included accepted preprints of forthcoming articles).

Table 1 Search term details

The papers obtained through the WoS and manual search were manually filtered based on the title and abstract, to ensure they were related to the topic of discussion. We filtered out irrelevant papers, such as those that focus on the optimization of industrial processes (Chen et al. 2014), meta-learning (Vanschoren 2019), optimization of internal parameters (Wawrzyński 2017), and papers related to AutoML systems that are not focused on hyperparameter optimization (such as model selection algorithms (van Rijn et al. 2015; Silva et al. 2016) or pure feature selection methods (Hegde and Mundada 2020)). Neural Architecture Search (NAS) is usually considered a distinct category with its own methods and techniques for optimizing the structure of a neural network; hence, articles on NAS were only considered when the problem was addressed as an HPO problem. Articles focusing on more specific aspects of NAS (such as Negrinho et al. 2019) are beyond the scope of this research.

A full read of the articles, combined with a reference search, resulted in a final selection of 48 relevant articles. Most of these articles (about 60%) were published in conferences or workshops, though there has been an increase in scientific journal articles in 2020 (see Fig. 2); these were mainly published in Q1/Q2 journals belonging to the Computer Science field.

Fig. 2
figure 2

Number of articles that address multi-objective HPO, according to the publication source (2014–2020)

3 HPO: concepts and performance measures

Section 3.1 provides an overview of the basic concepts related to HPO, while Sect. 3.2 discusses the main performance measures (objectives) used in such optimization. Finally, Sect. 3.3 discusses the quality metrics used for comparing the performance of multi-objective HPO algorithms.

3.1 HPO: concepts and terminology

In mathematics and computer science, an algorithm is a finite sequence of well-defined instructions that, when fed with a set of initial inputs, eventually produces an output. Figure 3 shows that in HPO, the optimization algorithm forms an “outer” shell of optimization instructions; the “inner” optimization refers to the training and cross-validation of the target ML algorithm (e.g., ANN, SVM, etc.). This inner optimization trains the target algorithm to perform the task it should perform (e.g., predicting house prices from a data set, using a set of features). In turn, the HPO algorithm takes the hyperparameters of the target ML algorithm as input and produces a number of performance measures as output (e.g., RMSE, energy consumption, etc.). The aim of the HPO algorithm is to optimize the set of hyperparameters, in view of obtaining the best possible outcomes for the performance measures considered.

Fig. 3
figure 3

Example of the interplay between the HPO algorithm and the target ML algorithm (in this case, an ANN for predicting house prices)

More formally, the single-objective HPO problem can be formalized as follows. Consider a target ML algorithm \(\mathcal {A}\) with N hyperparameters, such that the n-th hyperparameter has a domain denoted by \(\Lambda _n\). The overall hyperparameter configuration space is denoted as \(\Lambda =\Lambda _1 \times \Lambda _2 \times \cdots \times \Lambda _N\). A vector of hyperparameters is denoted by \(\varvec{\lambda } \in \Lambda\), and an algorithm \(\mathcal {A}\) with its hyperparameters set to \(\varvec{\lambda }\) is denoted by \(\mathcal {A}_{\varvec{\lambda }}\). In the case of HPO, the available data are split into a training set, a validation set, and a test set. The learning process of the algorithm takes place on the training set (\(\mathcal {D}_{train}\)) and is validated on the validation set (\(\mathcal {D}_{valid}\)). We can then formalize the single-objective HPO problem as (Feurer and Hutter 2019):

$$\begin{aligned} \min _{\varvec{\lambda } \in \Lambda } V(\mathcal {L}\mid \mathcal {A}_{\varvec{\lambda }}, \mathcal {D}_{train}, \mathcal {D}_{valid} ) \end{aligned}$$

where \(V(\mathcal {L} \mid \mathcal {A}_{\varvec{\lambda }}, \mathcal {D}_{train}, \mathcal {D}_{valid} )\) is a validation protocol that uses a loss function \(\mathcal {L}\) to estimate the performance of a model \(\mathcal {A}_{\varvec{\lambda }}\) trained on \(\mathcal {D}_{train}\) and validated on \(\mathcal {D}_{valid}\). Popular choices for the validation protocol \(V(\cdot )\) are the holdout and cross-validation process (see Bischl et al. 2012 for an overview of validation protocols). Without loss of generality, we assume in the remainder of this article that the loss function should be minimized.

The previous definition can be readily extended to multi-objective optimization (see Li and Yao 2019). Consider a multi-objective hyperparameter optimization problem with N hyperparameters and a set \({\textbf{L}}\) containing m performance measures (objective functions). These can reflect the error-based performance of the algorithm, but also other metrics such as algorithm complexity (as detailed later in Sect. 3.2). The multi-objective HPO problem can then be formalized as follows (assuming that all performance measures should be minimized):

$$\begin{aligned} \min _{\varvec{\lambda } \in \Lambda } V({\textbf{L}} \mid \mathcal {A}_{\varvec{\lambda }}, \mathcal {D}_{train}, \mathcal {D}_{valid} ) \end{aligned}$$

Typically, there is a trade-off among the different objectives: for instance, between the performance of a model and training time (increasing the accuracy of a model often requires larger amounts of data and, hence, a higher training time; see e.g., Rajagopal et al. 2020), or between different error-based measures (e.g., between confusion matrix-based measures (Tharwat 2020) of a binary classification problem; see Horn and Bischl 2016). Considering these trade-offs is often crucial: e.g., in medical diagnostics (de Toro et al. 2002), the simultaneous consideration of objectives such as sensitivity and specificity is essential to determine if the machine learning model can be used in practice. The goal in multi-objective HPO is to obtain the Pareto-optimal solutions, i.e., those solutions for which none of the objectives can be improved without negatively affecting any other objective. In the decision space, the set of optimal solutions is referred to as the Pareto set; in objective space, it yields the Pareto front (or Pareto frontier). The Pareto-optimal solutions are also referred to as the non-dominated solutions (Emmerich and Deutz 2018). Ideally, these solutions should be diverse (i.e., spread across the different areas of the Pareto front), while approximating this front as well as possible (i.e., showing convergence to the Pareto front).

In (general) multi-objective optimization problems, the multiple objectives are often scalarized into one single function, such that the problem can be solved as a single-objective problem. Care should be taken, though, when selecting the scalarization approach: e.g., not all approaches allow to detect non-convex parts of the front (see Miettinen and Mäkelä 2002 for further details about scalarization functions). Scalarization methods have also been applied in multi-objective HPO; see Section 4 for further details.

3.2 Multi-objective HPO: typical objectives

Tables 2 and 3 show an overview and concise description of the performance measures occurring in the current literature on multi-objective HPO (Table 2 focuses on error-based measures, while Table 3 summarizes the non-error-based measures). These measures will reappear later in Section 4, when we categorize the different multi-objective HPO algorithms. As evident from Table 2, for regression problems, the error-based metrics are commonly based on the squared errors; for classification problems, they are commonly related to the elements of the confusion matrix [True Positives (TP), False Negatives (FN), False Positives (FP), and True Negatives (TN)].

Error-based measures are heavily used in multi-objective HPO, as they ensure a response from the model that is close to reality. Additionally, model complexity objectives are often included [following Occam’s razor principle; (Blumer et al. 1987)], along with time-based metrics (e.g., training time on embedded devices) and/or (computational) cost objectives. The complexity of a neural network, for instance, is often estimated using the number of parameters (weights of the connections between neurons) (Liang et al. 2019; Lu et al. 2020; Baldeon and Lai-Yuen 2020; Calisto and Lai-Yuen 2020). The number of features can also be used as a complexity measure: see Sopov and Ivanov (2015), Martinez-de Pison et al. (2017), Binder et al. (2020), Faris et al. (2020), Bouraoui et al. (2018). The more features the training algorithm has to consider, the more expensive it will be. On the other hand, considering fewer features may negatively affect the error-based performance of the algorithm.

Table 2 Error-based measures used in multi-objective HPO algorithms
Table 3 Non error-based performance measures used in multi-objective HPO algorithms

Metrics reflecting model size naturally depend on the target ML algorithm to be optimized (e.g., the number of neurons in a single-layer NN (Juang and Hsu 2014), the number of support vectors in a SVM (Bouraoui et al. 2018), the DNN file size (Shinozaki et al. 2020), or the number of models used (Garrido and Hernández 2019) for ensemble algorithms). Alternatively, the number of floating point operations (FLOPs) in a NN can be used (Wang et al. 2019, 2020; Lu et al. 2020; Chin et al. 2020; Loni et al. 2020). This metric is also used to reflect the energy consumption (Han et al. 2015); likewise, the number of parameters in a NN is used as a measure for complexity as well as for model size. Both FLOPs and the number of parameters are sometimes used as memory consumption measures (Laskaridis et al. 2020), and can be combined with a time-based measure (Shah and Ghahramani 2016). Time-based measures can be related to the training phase (Tanaka et al. 2016; Rajagopal et al. 2020; Laskaridis et al. 2020; Lu et al. 2020), the prediction phase (Hernández et al. 2016; Abdolsh et al. 2019; Garrido and Hernández 2019), the inference process on forwarding passes in ANNs (Kim et al. 2017), or the whole optimization process (Richter et al. 2016).

The increasing computational cost of Deep Learning models generally translates into higher hardware costs. As a result, optimization using both algorithm performance and hardware cost should be considered, especially for edge devices. Hardware-related costs can be measured in different ways; e.g., through energy consumption (Hernández-Lobato et al. 2016) or memory utilization (Chandra and Lane 2016). In many cases, these measures are estimated as a function of the hyperparameters. For instance, Parsa et al. (2019) present an abstract energy consumption model that depends on the neural network architecture (number of layers, number of outputs of each layer, kernel size, etc).

Some objectives encountered in the literature do not fall into any of the categories above. In Table 3, they are grouped into the category “Other” (e.g., diversity measures for ensembles (Kuncheva 2014)).

3.3 Quality metrics for comparing multi-objective HPO algorithms

The surveyed literature presents different metrics to judge and/or compare the strengths and weaknesses of multi-objective HPO algorithms. The first set of quality metrics is related to the resulting Pareto front. Here, hypervolume is the most widely used (Horn and Bischl 2016; Hernández et al. 2016; Shah and Ghahramani 2016; Horn et al. 2017; Garrido and Hernández 2019; Lu et al. 2020). It computes the volume of the area enclosed by the Pareto front and a reference point, specified by the user. Binder et al. (2020) compute the generalization dominated hypervolume, which is obtained by evaluating the non-dominated solutions of the validation set on the test set data. Other quality metrics based on the Pareto front are the difference in performance between each solution on the front and the single-objective version of the algorithm (holding the other objectives steady) (Chatelain et al. 2007), the average distance (or Generational Distance) of the front to a reference set (such as the approximated true Pareto front obtained by exhaustive search, see Smithson et al. 2016; or an aggregated front, see Gülcü and Kuş 2021), a coverage measure computed as the percentage of the solutions of an algorithm A dominated by the solutions of another algorithm B (Juang and Hsu 2014; Li et al. 2004), or metrics based on the shape of the Pareto front (Abdolsh et al. 2019) or its diversity (Juang and Hsu 2014; Li et al. 2004). The latter can be computed using the spacing and the spread of the solutions: spacing evaluates the diversity of the Pareto points along a given front (Gülcü and Kuş 2021), whereas spread evaluates the range of the objective function values (see Zitzler et al. 2000).

Some authors use performance measures that do not relate to the quality of the front obtained; e.g., execution time (Parsa et al. 2019; Richter et al. 2016; Horn et al. 2017), number of performance evaluations (Parsa et al. 2019), CPU utilization in parallel computer architectures (Richter et al. 2016), measures that were not considered as an objective and that are evaluated in the Pareto solutions (usually, confusion matrix-based measures for classification problems; see Salt et al. 2019), or measures that are specific for the HPO algorithm used (e.g., the number of new points suggested per batch is used by Gupta et al. (2018) to evaluate the performance of the search executed during batch Bayesian optimization).

4 Multi-objective HPO algorithms: categorization

In this section, we categorize the literature on multi-objective HPO algorithms based on the way in which the algorithms perform the search for the optimal solutions (i.e., the search methodology). We distinguish the following three categories (Fig. 4):

  • Metaheuristic-based optimization algorithms (Sect. 4.1): these algorithms use a metaheuristic to guide the search process, based on the empirically observed input/output observations.

  • Metamodel-based optimization algorithms (Sect. 4.2): in these algorithms, a metamodel is fit to the empirical input/output observations, and an acquisition function is used to search for the optimal HPO configurations.

  • Hybrid algorithms (Sect. 4.3): a metamodel is fit to the input/output observations, and a metaheuristic is used to guide the search for better solutions.

Fig. 4
figure 4

Multi-objective HPO algorithms: number of articles per category (2014–2020)

4.1 Metaheuristic-based HPO algorithms

Heuristic search attempts to optimize a problem by improving the solution based on a given heuristic function or a cost measure (Russell and Norvig 2010). A heuristic search method does not always guarantee to find the optimal solution but aims to find a good or acceptable solution within a reasonable amount of time and memory usage. Metaheuristics are algorithms that combine heuristics (which are often problem-specific) in a more general framework (Bianchi et al. 2009). Figure 5 summarizes the general procedure of a metaheuristic-based algorithm for multi-objective optimization (MOO). The algorithm generates new solution(s) starting from one or more initial solution(s). Depending on the algorithm, the information available from the search process so far (which can include updates in the sampling distribution used by the metaheuristic, or other adjustments such as updates in the velocity vectors in Particle Swarm Optimization, or the pheromone paths in Ant Colony Optimization) can be updated before the next iteration starts, and/or bad solutions can be discarded. The process is repeated until a stop criterion is met.

Fig. 5
figure 5

General procedure in metaheuristic-based MOO algorithms

While some metaheuristics start from a single initial solution (e.g., Tabu Search (Glover 1986)), others (referred to as population-based algorithms) start from a set of solutions (e.g., Ant Colony Optimization (Dorigo and Blum 2005) and Evolutionary Algorithms, e.g. Evolution Strategies and Genetic Algorithms (Mitchell 1998)).

For ease of reference, Table 4 gives an overview of the metaheuristic-based algorithms currently used in multi-objective HPO, while Table 5 gives an overview of the experimental comparisons reported in these papers. Clearly, the most popular metaheuristic-based algorithm for multi-objective HPO is the Non-dominated Sorting Genetic Algorithm II (NSGA-II; Deb et al. 2002). This is not surprising, as genetic algorithms have shown to perform quite well in single-objective HPO settings: see, e.g., Deighan et al. (2021), who showed that they cannot only obtain CNN configurations from scratch but can also refine state-of-the-art CNNs. NSGA-II builds on the original NSGA algorithm (Srinivas and Deb 1994); yet, it is computationally less expensive (a temporal complexity of \(O(MN^2)\) versus \(O(MN^3)\) for the original algorithm, where M is the number of objectives and N is the population size). Another important difference is the preservation of the best solutions, through an elitist selection according to the fitness and spread of solutions. Ekbal and Saha (2015) applied NSGA-II to jointly optimize hyperparameters and features, and demonstrated the superiority of the resulting models over others (trained with default hyperparameters, and using all the features included in a dataset). Binder et al. (2020) observed analogous results optimizing a SVM, kkNN, and XGBoost. Yet, according to the generalization-dominated hypervolume, NSGA-II performed slightly worse than ParEGO, a Bayesian optimization-based approach (see Knowles 2006 for further details). Binder et al. (2020) thus suggest to prefer NSGA-II over ParEGO only when model evaluations are cheap and marginal degradation of performance is acceptable.

Contrary to NSGA-II, the Multi-Objective Evolutionary Algorithm based on Decomposition (MOEA/D) (Zhang and Li 2007) uses scalarization to solve the multi-objective HPO problem. Both MOEA/D and NSGA-II have shown to improve the accuracy of the resulting model compared with manual hyperparameter selection (Magda et al. 2017; Calisto and Lai-Yuen 2020). In Baldeon and Lai-Yuen (2020), MOEA/D is compared with a Bayesian Optimization approach (using Gaussian Process Regression with Expected Improvement as acquisition function), for tuning an adaptive convolutional neural network (AdaResU-Net) used for medical image segmentation. The use of MOEA/D resulted in a reduction in the number of parameters to train; the comparison is not really reliable, though, as the Bayesian approach was used in a single-objective optimizer, focusing only on segmentation accuracy and not on model size. The ENS-MOEA/D algorithm proposed by Zhao et al. (2012) presents a further improvement to the original MOEA/D algorithm, by adaptively adjusting the neighborhood size (as large neighborhood sizes favor more global search, while smaller sizes lead to more local search). Zhang et al. (2020) apply this method to optimize the hyperparameters of a Variational Model Decomposition (VMD) procedure, used to pre-process time series for forecasting wind speeds. The authors prove that this yields better forecasts, yet they did not perform any comparison against other HPO procedures.

Table 4 Overview of Metaheuristic-based HPO algorithms
Table 5 Experimental comparisons reported in the literature on metaheuristic-based HPO algorithms

The Covariance matrix adaptation-evolutionary strategy (CMA-ES) (Hansen et al. 2003) is a population-based metaheuristic that differs from Genetic Algorithms in the use of a fixed-length real-valued vector as a gene (instead of the typical vector of binary components), and a multivariate Gaussian distribution to generate new solutions. Multi-objective CMA-ES can be formulated considering the dominance of solutions on the Pareto Frontier, to redefine the ranking function used to determine the best solution found so far (now a Pareto front) (Tanaka et al. 2016; Qin et al. 2017; Shinozaki et al. 2020). Shinozaki et al. (2020) optimize DNN-based Spoken Language Systems using this approach; the resulting networks had lower word error rates and were smaller than the networks designed by NSGA-II. Additionally, multi-objective CMA-ES generated smaller networks than the one obtained with single-objective CMA-ES (using the error-based measure as an objective to optimize). In our opinion, though, this last comparison does not make much sense, since network size did not appear as an objective in the single-objective setting.

Analogous to Genetic Algorithms, Particle Swarm Optimization (PSO) (Eberhart and Kennedy 1995) works with a population of candidate solutions, known as particles. Each particle is characterized by a velocity and a position. The particles search for the optimal solutions by continuously updating their position and velocity. Their movement is influenced not only by their own local best-known position but is also guided toward the best-known position found by other particles in the search space. A multi-objective PSO algorithm (OMOPSO) was developed by Sierra and Coello (2005), using Pareto dominance and crowding distance to filter out the best particles. It employs different mutation operators which act on subsets of the swarm, and applies the \(\epsilon\)-dominance concept (see Laumanns et al. 2002 for more details) to fix the size of the set of final solutions produced by the algorithm.

Strength Pareto Evolutionary Algorithm II (SPEA-II) (Zitzler et al. 2001) adds several improvements to the original SPEA algorithm presented by Zitzler and Thiele (1999). Loni et al. (2019) used the algorithm to optimize six hyperparameters of a CNN, yielding more accurate and less complex networks than could be obtained with hand-crafted networks, or with NAS algorithms.

Differential Evolution (DE) (Storn and Price 1997) is similar to Genetic Algorithms but differs in the way in which the solutions are coded (using real vectors instead of binary-coded ones) and, consequently, in the way in which the evolutionary operators are applied. Multi-Objective Differential Evolution (MODE) (Babu and Gujarathi 2007) selects the non-dominated solutions to generate new solutions on each iteration. To reduce the computational effort while maintaining accuracy, a memetic adaptive DE method (MADE) was developed by Li et al. (2019). DE depends significantly on its control parameter settings. Therefore, MADE uses a historical memory of successful control parameter settings to guide the selection of future control parameter values (Tanabe and Fukunaga 2013). Additionally, a local search method (e.g., the Nelder-Mead simplex method (NMM) (Li et al. 2019), or chaotic local search (Pathak et al. 2020)) is employed to refine the solutions, and a ranking-based elimination strategy (using non-dominated and crowding distance sorting) is proposed to maintain the most promising solutions.

Ant Colony Optimization (ACO) (Dorigo et al. 1996) is inspired by the behavior of real ants; the basic idea is to model the HPO problem as the search for a minimum cost path in a graph. ACO algorithms can be applied to solve multi-objective problems, and may differ in three respects (Alaya et al. 2007): (1) the way solutions are built, using only one pheromone structure for an aggregation of several objectives, or associating a different pheromone structure with each objective (Iredi et al. 2001; Gravel et al. 2002; 2) the way in which solutions are updated (Iredi et al. 2001; Barán and Schaerer 2003) and (3) the incorporation of existing problem-specific knowledge into the transition rule that defines how to create new solutions from existing ones (Gravel et al. 2002; Doerner et al. 2004). The latter is included in a multi-objective version of ACO (MO-RACACO, Hsu and Juang 2013) for Fuzzy Neural Network (FNN) optimization (Juang and Hsu 2014). The results showed that MO-RACACO outperformed other population-based MO algorithms (MO-EA, Juang 2002; and MO-ACOr, Socha and Dorigo 2008) in terms of the coverage measure obtained, yet it did not always obtain the best diversity values.

Simulated annealing (SA) is a probabilistic technique for finding the global optimum of a single-objective problem (Kirkpatrick et al. 1983). Gülcü and Kuş (2021) applied a multi-objective approach (MOSA) to optimize 14 hyperparameters of a CNN. The algorithm selects new solutions based on their relative merit (measured by the dominance relationship) w.r.t. the current solutions.

The Nelder-Mead simplex method (NMM) (Olsson and Nelson 1975) has been applied by Albelwi and Mah (2016) to optimize seven hyperparameters for a CNN. As NMM is a single-objective optimization procedure, the objectives need to be scalarized (the authors used a weighted sum approach). NMM is a local optimization procedure, so it may get stuck in a local minimum. This may be avoided by running the algorithm from different starting points, which increases the probability of reaching the global minimum. Alternatively, modifications to the algorithm have been proposed (as in McKinnon 1998) that allow the algorithm to escape from local minima, yet at the cost of a large number of iterations.

4.2 Metamodel-based HPO algorithms

Training a machine learning algorithm can be computationally expensive, e.g. due to the target algorithm’s own structure (e.g., Deep Learning models), the amount and complexity of the data to process, resource limitations (execution time, memory and energy consumption, etc), and/or the type of training algorithm used. Therefore, different HPO approaches have been developed that employ less expensive models (referred to as metamodels or surrogate models) to emulate the computation of the real performance functions. The resulting algorithms have also been referred to as Efficient Global Optimization (EGO) or Bayesian Optimization (BO) algorithms, and use an acquisition function or infill criterion to guide the search. Figure 6 summarizes the main steps in such an algorithm.

Fig. 6
figure 6

Generic optimization procedure in metamodel-based MOO algorithms

The optimization starts with a set of initial points (input/output observations) to train the metamodel. Next, the acquisition function is used to select one or more new points (infill points) to be evaluated. The use of this acquisition function is a key element in the search (approaches that combine metamodels with metaheuristic search are referred to as hybrid methods, and are discussed in Sect. 4.3). The metamodel is updated with this new information (adding the new I/O observations to the initial set), and the procedure continues until a stopping criterion is met.

For ease of reference, Table 6 gives an overview of the metamodel-based algorithms currently used for multi-objective HPO, while Table 7 gives an overview of the experimental comparisons reported in this part of the literature. As evident from Table 6, most multi-objective HPO articles use a Gaussian Process (GP) metamodel. GPs use a covariance function, or kernel, to compute the spatial correlation among several output observations for a given performance measure (i.e., a given objective of the HPO algorithm; see Fig. 3). In this approach, it is assumed that HPO input configurations that differ only slightly from one another (i.e., they are close to each other in the search space) are strongly positively correlated w.r.t. their outputs; as the configurations are further apart in the search space, the correlation dies out. The choice of the kernel in a GP is important, as it determines the shape of the assumed correlation function. In general, the most common kernels used in GP-based metamodels are the Gaussian kernel and the Matérn kernel (Ounpraseuth 2008). Using the kernel, the analyst can not only predict the estimated outputs (i.e., in our case, the performance measures) at non-observed input locations (i.e., hyperparameter configurations), but can also estimate the uncertainty on these output predictions. Both the predictions and their uncertainty are reflected in the acquisition function to search for new hyperparameter settings. We refer the reader to Rojas-Gonzalez and Van Nieuwenhuyse (2020) for a detailed review of acquisition functions, for (general, non-HPO related) single and multi-objective optimization problems.

Table 6 Overview of Metamodel-based HPO algorithms
Table 7 Experimental comparisons reported in the literature on metamodel-based MO HPO algorithms

Table 6 also shows the acquisition functions that have been used so far in multi-objective HPO. Clearly, the most popular one is Expected Improvement (EI, which was originally proposed by Jones et al. 1998). The EI represents the expected improvement over the best outputs found so far, at an (arbitrary) non-observed input configuration. As EI was originally developed for single-objective problems, it is usually applied in multi-objective problems where the objectives are scalarized. Salt et al. (2019), for instance, optimize a Spiking Neural Network (SNN) using a weighted function of three individual objectives (the accuracy, the sum square error of the membrane voltage signal, and the reward of the spiking trace). Three acquisition functions were studied; EI, Probability of Improvement (POI), and Upper Confidence Bound (UCB). The performance obtained with POI was significantly better than that obtained with EI and UCB, and overall, the BO-based approach required significantly fewer evaluations than evolutionary strategies such as SADE.

Another way to use BO in multi-objective HPO is to fit a metamodel to each objective independently. Parsa et al. (2019) use such an approach in their Pseudo Agent-Based multi-objective Bayesian hyperparameter Optimization (PABO) algorithm; they use the dominance rank (based on the predictor values of each objective) as an infill criterion. This evidently yields different infill points for the respective objectives (in their case, an error-based objective and an energy-related objective). The infill point suggested for one objective function is then also evaluated for the other objective function, provided that it is not dominated by any previous HPO configuration analyzed. In this way, the algorithm speeds up the search for Pareto-optimal solutions. The experiments indeed demonstrated that PABO outperforms NSGA-II in terms of speed.

Other authors have studied HPO problems when the performance measures are correlated (Shah and Ghahramani 2016), or when one of the measures is clearly more important than the others (Abdolsh et al. 2019). The algorithm proposed by Shah and Ghahramani (2016) models the correlations between accuracy, memory consumption, and training time of an ANN using a multi-output Gaussian process or Co-Kriging (Liu et al. 2018). The authors propose a modification to the expected hypervolume (EHV) that reflects these correlations; this modified EHV is then used as an acquisition function, preferring the infill point that increases the expected hypervolume of the Pareto front the most. The algorithm is compared to ParEGO, (Knowles 2006), random search, and a GP using the original EHV metric. The results suggest that the modified EHV criterion increases the speed of the optimization, requiring fewer iterations to converge to the Pareto optimal solutions.

The MOBO-PC algorithm proposed by Abdolsh et al. (2019) adjusts the Expected Hypervolume Improvement (EHI) acquisition function to account for the probability that the novel HP configuration satisfies a set of user-defined preference-order constraints. In this way, it manages to focus its search on the Pareto solutions that are most relevant for the user, as opposed to the other algorithms that are used as a comparison in the paper (PESMO, Hernández et al. 2016; SMS-EGO, Ponweiser et al. 2008; Stepwise Uncertainty Reduction, Picheny 2014; and ParEGo, Knowles 2006), which try to find solutions across the entire Pareto front.

Other acquisition functions used in metamodel-based algorithms are the Lower Confidence Bound (LCB) or Upper Confidence Bound (UCB). These use a (user-defined) confidence bound to focus the search on local areas or explore the search space more globally. Richter et al. (2016) use a multipoint LCB which simultaneously generates q hyperparameter configurations. A GP is used to model the misclassification error and the logarithmic runtime. The results demonstrated an improvement in CPU utilization (and, thus, an increase in the number of hyperparameter evaluations) within the same time budget. Confidence bounds are also used by Chin et al. (2020) to optimize the hyperparameters of Slimmable Neural Networks. The algorithm fits a GP to each individual performance measure, hence obtaining information to compute individual UCBs. These UCBs are then scalarized, and the resulting single objective function is minimized to obtain the next infill point. The proposed algorithm succeeds in reducing the complexity of the NNs studied; yet, the authors did not compare its performance with any other multi-objective HPO algorithms.

The Predictive Entropy Search (PES) criterion is used by multiple authors, as an infill criterion for different algorithms. Hernández et al. (2016) use PESMO (multi-objective PES) to optimize a NN with six hyperparameters, in view of minimizing the prediction error and the training time. PESMO seeks to minimize the uncertainty in the location of the Pareto set. The algorithm is compared with ParEGO, SMS-EGO, and SUR, showing that PESMO gives the best overall results in terms of hypervolume and the number of expensive evaluations required for training/testing the neural network. Garrido and Hernández (2019) use PESMOC (a modified version of PESMO which takes into account constraints) to optimize an ensemble of Decision Trees. The experiments show that PESMOC is able to obtain better results than a state-of-the-art method for constrained multi-objective Bayesian optimization (Feliot et al. 2017), in terms of the hypervolume obtained and the number of evaluations required. Finally, Hernández-Lobato et al. (2016) used PES to design a neural network with three layers. While most of the HPO methods collect data in a coupled way by always evaluating all performance measures jointly at a given input, these authors consider a decoupled approach in which, at each iteration, the next infill configuration is selected according to the maximum value of the acquisition functions across all objectives. The results showed that this approach obtains better solutions (compared to NSGA-II and random search) when computational resources are limited; yet, the trade-offs found among the performance measures may be affected and one of the objectives can turn out to be prioritized over the others.

Random forests (RFs) (Ho 1995) are an ensemble learning method that trains a set of decision trees having low computational complexity. Each tree is trained with different samples, taken from the initial set of observations. For classification outputs, the RF uses a voting procedure to determine the decision class; for regression output, it returns the average value over the different trees. As for GP, RFs allows the analyst to obtain an uncertainty estimator for the prediction values. Some examples are the quantile regression forests method (Meinshausen and Ridgeway 2006), which estimates the prediction intervals, and the U-statistics approach (Mentch and Hooker 2016). Horn and Bischl (2016) use RFs as metamodel to optimize the hyperparameters of three ML algorithms: SVM, Random Forest, and Logistic regression. Using LCB as an acquisition function, the authors show that SMS-EGO and ParEGO outperform random sampling and NSGA-II.

Whereas GP-based approaches model the density function of the resulting outcomes (performance measures) given a candidate input configuration, Tree-structured Parzen Estimators (TPE) (Bergstra et al. 2011) model the probability of obtaining an input configuration, given a condition on the outcomes. TPEs naturally handle not only continuous but also discrete and categorical inputs, which are difficult to handle with a GP. Moreover, TPE also works well for conditional search spaces (where the value of a given hyperparameter may depend on the value of another hyperparameter), and has demonstrated good performance on HPO problems for single-objective optimization (Bergstra et al. 2013; Thornton et al. 2013; Falkner et al. 2018). While it can, in theory, also be applied to multi-objective settings by scalarizing the performance measures, Chandra and Lane (2016) obtained disappointing results when comparing this approach with random sampling, GP and Genetic Algorithms for optimizing an Augmented Tchebycheff scalarized function (Miettinen 2012) (using fixed weights) of three performance measures for ANNs: GP performed best, while TPE performed worst. Unfortunately, the authors reported the performance based solely on the scalarized value of the three performance measures; they did not report on any other quality metrics, such as hypervolume. They also did not discuss the reason for the poor TPE performance, such that it remains unclear whether this is due to the scalarization function, or to the characteristics of the search space. A (non-scalarized) multi-objective version of TPE has been proposed by Ozaki et al. (2020) and is included in the software Optuna (Akiba et al. 2019).

Strikingly, the majority of current HPO algorithms routinely ignore the fact that the obtained performance measures are noisy. The noise can be due to either the target ML algorithm itself (when it contains randomness in its procedure, such as a NN that randomly initializes the weights), but even if there is no randomness involved, there will be noise on the outcomes due to the use of k-fold cross-validation during the training of the algorithm. This type of cross-validation is common in HPO: it involves the creation of different splits of the data into a training and validation set. This process is repeated k times; the performance measures of a given hyperparameter combination will thus differ for each split. Current HPO algorithms focus simply on the average performance measures over the different splits during the search for the Pareto-optimal points; the inherent uncertainty on these performance measures is ignored. Horn et al. (2017) are one of the few authors to highlight the presence of noise. The paper assumes, though, that noise is homogenous (i.e., it doesn’t differ over the search space), and only focuses on different strategies for handling this noise. These strategies are used in combination with the SMS-EGO algorithm (Ponweiser et al. 2008) and compared with the rolling tide evolutionary algorithm (RTEA) (Fieldsend and Everson 2014) and random search. The results show that simply ignoring the noise (by evaluating a given HPO combination only once, and considering the resulting performance measures as deterministic) performs poorly, even worse than a repeated random search. The best strategy is to reevaluate the (most promising) HP settings. According to the authors, this can likely be explained by the fact that the true noise on the performance measures in HPO settings is heterogeneous (i.e., its magnitude differs over the search space). Reevaluation of already observed HP settings is then required to improve the reliability of the observed performance measures. The interested reader is referred to Jalali et al. (2017) for a discussion of the impact of noise magnitude and noise structure on the performance of (general) optimization algorithms.

Koch et al. (2015) adapt SMS-EGO (Ponweiser et al. 2008) and SExI-EGO (Emmerich et al. 2011) for noisy evaluations, to optimize the hyperparameters of a SVM. The authors again assume that the noise is homogenous, and compare the performance of both algorithms with different noise handling strategies (the reinterpolation method proposed by Forrester et al. (2006), and static resampling). Both algorithms use the expected hypervolume improvement (EHI) as an infill criterion, though the actual calculation of the criterion is different (causing Sexi-EGO to require larger runtimes). The results show that both SMS-EGO and SExI-EGO work well with the reinterpolation method, yielding comparable results in terms of hypervolume.

4.3 Hybrid HPO algorithms

A limited number of papers have combined aspects of metamodel-based and population-based HPO approaches: these are referred to in Table 8, summarizing their main characteristics. Table 9 gives an overview of the experimental comparisons reported in these papers.

Smithson et al. (2016) use an ANN as a metamodel to estimate the performance of the target ML algorithm. The neural network is embedded into a Design Space Exploration (DSE) metaheuristic, and is used to intelligently select new solutions that are likely to be Pareto optimal. The algorithm starts with a random solution, and iteratively generates new solutions that are evaluated with the ANN. DSE decides if the solution should be used to update the ANN knowledge, or should be discarded. Compared with manually designed networks from the literature, the proposed algorithm yields results with nearly identical performance, while reducing the associated costs (in terms of energy consumption).

The algorithm proposed by Martinez-de Pison et al. (2017) combines HPO with feature selection (as opposed to other algorithms, e.g., Ekbal and Saha 2015; León et al. 2019; Guo et al. 2019). First, a GP (with UCB as an acquisition function) is used to obtain the best HPO setting (according to the RMSE), considering the full set of features. Next, a variant of GA (GA-PARSIMONY, Sanz-García et al. 2015) is used to select the best features of the problem, given the hyperparameters obtained in the first step. In this way, the final model has high accuracy and lower complexity (i.e., fewer features), and optimization time is significantly reduced. In our opinion, however, this approach is still suboptimal, as the two optimization problems (HPO and feature selection) are solved sequentially, instead of jointly. Calisto and Lai-Yuen (2021) use an evolutionary strategy combined with a Random Forest metamodel, to optimize 10 hyperparameters of a CNN. In the beginning of the optimization, the algorithm updates the population of solutions using the evolutionary strategy; only after some iterations, the selection of the new candidates is guided by the RF, which is updated each time with all new Pareto front solutions. The final networks found by the algorithm perform better than (or equivalent to) state-of-the-art architectures, while the size of the architectures and the search time are significantly reduced.

Although most NAS algorithms are out of scope for this survey, we include the work by Lu et al. (2020), as it can be considered an HPO algorithm. The algorithm (NSGANetV2) simultaneously optimizes the architectural hyperparameters and the model weights of a CNN, using a bi-level approach consisting of NSGA-II combined with a metamodel. The metamodel is used to estimate performance measures, which are then optimized by an evolutionary algorithm (such approaches have also been applied successfully to non-HPO settings, see e.g., Jin 2011; Dutta and Gandomi 2020). In the upper level of the optimization, the metamodel is built using an initial set of candidate solutions. In each iteration of the upper level, NSGA-II is executed on the metamodel to detect the Pareto-optimal HP settings (configuration of layers, channels, kernel size, and input resolution of the CNN). At the lower level, the weights of the CNN are trained on a subset of the Pareto-optimal solutions. The metamodel is then updated with the results of the actual performance evaluations. Four different metamodels were studied; Multilayer Perceptron (MLP), Classification and Regression Trees (CART), Radial Basis Functions (RBF), and GP. Given that none of them consistently outperformed the others, the authors propose to select the best metamodel in every iteration. On standard datasets (CIFAR-10, CIFAR-100, and ImageNet), the resulting algorithm matches the performance of state-of-the-art NAS algorithms ( et al. 2019; Mei et al. 2020), but at a reduced search cost.

Table 8 Overview of hybrid HPO algorithms
Table 9 Experimental comparisons reported in the literature on hybrid MO HPO algorithms

5 Multi-objective HPO algorithms: pros and cons

In this section, we discuss the weakness and strengths of the different algorithms. We focus on four different aspects: (1) the computational complexity of the algorithm, (2) the ability to accommodate high dimensional input spaces, (3) the ability to handle mixed input spaces, and (4) the ease of use of parallel computations. Unfortunately, none of the papers studied in this review provides explicit details on these aspects in the publication. In general, we often observed a surprising lack of detail with respect to many methodological aspects (such as the nature of the hyperparameters being optimized, the nature of the genetic operators and the design of the initial population in metaheuristic-based algorithms, the design of experiments used, the final Pareto-optimal solutions provided by the algorithm, etc.). In many cases, there is even no pseudocode provided for the algorithm, and detailed descriptions of novel metrics (if any) used to measure the performance of the target ML algorithm are lacking. This lack of detail is likely caused by the fact that most papers aim to solve a particular practical application and the hyperparameter optimization was usually not seen as the main contribution of the paper.

Consequently, the discussion in this section remains quite general, and relies largely on the results of our own independent research, based on the information found in methodological papers for the algorithms considered. This information also allowed us to outline rough pseudocodes of the algorithms (which are presented in Appendix 1). Although we emphasize (again) that these pseudocodes do not necessarily reflect the accurate details of the algorithms, we find them helpful, in particular, to estimate the complexity of the algorithms. For black-box algorithms, this complexity can be measured by means of their worst-case expected running time (Doerr 2020). The running time (or optimization time of an algorithm for a function f is defined as the number of function evaluations that the algorithm performs until (and including) the evaluation of an optimal solution for f. For HPO algorithms, the running time is largely proportional to the number of training and validation steps performed, as these are the most expensive steps in the HPO procedure. The training and validation steps need to be performed for each HPO configuration studied by the HPO algorithm. Consequently, in what follows, we propose to use the (worst-case) number of HPO configurations evaluated by the algorithm as a proxy for the algorithm’s expected worst-case running time. The result is expressed as a function g(nIN), which is influenced by three parameters: (1) the number of initial HP configurations n required to start the optimization (e.g., the size of the initial population in evolutionary algorithms, or the size of a Latin hypercube sample for Bayesian optimization), (2) the number of iterations I allowed during the search, and (3) the number of new HPO configurations N generated per iteration. Table  summarizes the results of our analysis.

Clearly, the number of costly function evaluations in a typical metamodel10-based optimization is much lower than in a metaheuristic-based algorithm, as usually only a single new solution is evaluated in each iteration. MADE, the metaheuristic-based algorithm by Pathak et al. (2020), can be particularly expensive, as it performs a chaotic local search to generate N additional solutions for each solution present in the population of a given iteration. However, using a metamodel to reduce the number of HP configurations that need to be evaluated does not ensure a lower execution time. For instance, the hybrid algorithm GP + GA_Parsimony (Sanz-García et al. 2015) tries to optimize both hyperparameters and features used to train the ML model; the running time remains high, however, as the feature selection is performed in a separate phase after the HPO has been performed: this leads to a drastic increase in the number of HP configurations evaluated, compared with other algorithms such as NSGA-II and GP-based optimization.

The use of parallel computations may be considered to decrease the total execution time of the optimization. For metaheuristic-based algorithms, this is usually implemented by parallelizing the evaluation of novel configurations in each population generation (Durillo et al. 2008; Wang et al. 2018). Parallelization has been observed in metaheuristic-based optimization algorithms such as CMA-ES (Tanaka et al. 2016; Qin et al. 2017), CoDeepNeat (Liang et al. 2019), GA (Deighan et al. 2021), and NSGA-II (Kim et al. 2017); it has also been suggested in (Albelwi and Mah 2016; Baldeon and Lai-Yuen 2020) for DNN optimization. Bayesian Optimization approaches, by contrast, are inherently serial as they use past observations to determine the next point(s) to sample. Parallelization can be used to some extent, though, e.g. in the evaluation of the initial set of configurations, or in batch BO (Richter et al. 2016; Binder et al. 2020; Horn and Bischl 2016). Parallel computations can also be introduced during the training/validation of the ML algorithm (by training/validating the model simultaneously on the different data splits in the cross-validation protocol (Mostafa et al. 2020)), or during the training of the metamodel [e.g., for Random Forests (Chen et al. 2016) and for Gaussian Processes (Dai et al. 2014)].

The ability of an algorithm to handle mixed input spaces is not evident. For metaheuristic-based optimization procedures, for instance, this requires a proper coding of the solutions (e.g., the chromosomes in GAs or the particles in PSO), and consequently a reformulation of the evolutionary operators. For algorithms such as ACO, NMA, and CMA-ES, we expect that handling mixed search spaces is not straightforward, given that they were originally designed for a specific type of variables (ACO for discrete variables that can be easily structured in a graph, and NMA and CMA-ES for continuous variables). In metamodel-based optimization approaches using GPs, a proper kernel needs to be used to accommodate mixed input spaces. Metamodel-based approaches that rely on Random Forests or TPE, by contrast, can handle a mix of discrete, categorical, and numerical variables quite straightforwardly.

To judge the ability of the algorithms to handle high dimensional search spaces, we relied on the findings of other studies (see the references in Table 10). We categorize the results into poor (meaning that the ability to handle high dimensional search spaces is problematic), good, or unknown (meaning that no discussions on this aspect were found).

Table 10 Analysis of pros and cons for the MO HPO algorithms studied

6 Conclusions and future research

This paper has reviewed the literature on multi-objective HPO algorithms, categorizing relevant papers into metaheuristic-based, metamodel-based, and hybrid approaches. The literature on MO HPO is not as abundant as on single-objective HPO; yet, MO HPO is highly relevant in practice. Taking a multi-objective perspective on HPO not only allows the analyst to optimize trade-offs between different performance measures, but it may also even yield better solutions than the corresponding single-objective HPO problem. For instance, it has been shown that including complexity as an objective in multi-objective HPO does not necessarily compromise the loss-based performance of the ML algorithm w.r.t. the task for which it is trained: particularly, the minimization of the number of features used for training can improve the performance of the ML algorithm (Sopov and Ivanov 2015; Binder et al. 2020; Bouraoui et al. 2018; Faris et al. 2020).

As the field of multi-objective HPO is gaining speed, it presents diverse opportunities for further research. We present recommendations here, distinguishing between (1) methodological recommendations (focusing on the use of more advanced optimization approaches), and (2) general recommendations (focusing on shortcomings or pitfalls that currently occur in the literature, and that—in our opinion—hamper the reproducibility, usability, and interpretability of the results). The recommendations are outlined in Table 11.

Table 11 Summary of research opportunities for multi-objective hyperparameter optimization

In the current literature, metaheuristic-based HPO approaches are clearly the most popular. This is quite striking, as such approaches require the evaluation of a large amount of HP configurations, and training/testing the target algorithm for any given HP configuration is usually the most expensive step in the HPO algorithm (due to, e.g., the k-fold cross-validation, the optimization steps required for the algorithm’s internal parameters, the evaluation of potentially expensive performance measures such as energy consumption or inference time, etc.). Further research on hybrid HPO algorithms appears promising here. So far, research on these algorithms remains scarce; yet, one would expect that such algorithms combine the best of two worlds, providing low computational cost (as the metamodel provides inexpensive function evaluations) along with a heuristic search that avoids the challenge of optimizing an acquisition function.

Current results have also demonstrated that using ensembles of optimal HP configurations can yield improvements (Ekbal and Saha 2015; Sopov and Ivanov 2015; Ekbal and Saha 2016; Zhang et al. 2016). Yet, this evidently increases the number of HP evaluations required. In future research, it may be promising to look at ensembles of multiple metamodels (Wistuba et al. 2018; Cho et al. 2020), multiple acquisition functions (Cowen-Rivers et al. 2020), or even multiple optimization procedures (Liu et al. 2020).

Furthermore, multiple opportunities exist to extend recent advanced approaches for single-objective HPO towards multi-objective HPO. Recent research has shown potential benefits in studying cheaply available (yet lower fidelity) information, obtained for instance by evaluating only a fraction of the training data or a small number of iterations. Low fidelity methods such as bandit-based approaches (Li et al. 2017) have, to the best of our knowledge, not yet been applied in multi-objective HPO. Also, early stopping criteria (Dai et al. 2019) could be considered to ensure more intelligent use of the available computational budget. This has already been applied in single-objective optimization (Kohavi and John 1995; Provost et al. 1999), by considering the algorithm’s learning curve: the training procedure for a given hyperparameter configuration is then stopped when adding further resources (training instance, iterations, training time, etc) is predicted to be futile. Early stopping criteria have also been used to reduce the overfitting level of the ML algorithm (Makarova et al. 2021). To the best of our knowledge, none of these methodological approaches has been applied so far in multi-objective HPO algorithms.

Finally, apart from the work of Koch et al. (2015) and Horn et al. (2017), the uncertainty in the performance measures is commonly ignored in HPO optimization. These two algorithms have mainly explored the impact of different noise handling strategies on the results of existing algorithms, while it may be more beneficial to account for the noise by adjusting the metamodels used, and/or the algorithmic approach. Furthermore, they assume homogenous noise, which is likely not the case in practice. Stochastic algorithms (such as Binois et al. 2019; Gonzalez et al. 2020) can potentially be useful to determine the number of (extra)replications dynamically during HPO optimization, thus ensuring that computational budget is spent in (re-)evaluating the configuration that yields most information.

Apart from these methodological recommendations, we also outline some general recommendations. To improve the interpretability of the results, we recommend using individual performance measures as objectives in HPO settings, rather than an aggregate measure such as the F-measure (combining recall and precision for classification problems Ekbal and Saha 2015, 2016) or the Area Under the Curve measure (AUC), which combines the False Positive rate and the True Positive rate. Such aggregated measures reflect a fixed relationship between the individual measures, which may result in solutions that perform really well on the aggregated measure (for instance, the F-measure), but are suboptimal for the individual measures (recall and precision). Moreover, the aggregation of multiple performance measures into a single objective by means of scalarization should be done carefully, as not all scalarization methods (e.g., weighted sum) allow the detection of all parts of the Pareto front. The Augmented Tchebycheff function (Miettinen 2012), for instance, is recommended when the front contains non-convex areas. The nonlinear term in the scalarization function ensures that these areas can be detected, while the linear term ensures that weak Pareto optimal solutions are less rewarded (see Miettinen and Mäkelä 2002 for a further discussion on scalarization functions).

Furthermore, we noticed a surprising lack of detail in the current HPO papers (i.e., in the description of the methodological approaches, the experimental designs, and the corresponding results). To improve the reproducibility of the research, and facilitate comparisons among different HPO algorithms, we recommend a clear description and analysis of four basic elements in each future HPO research paper: (1) search space characteristics (type and range of the considered HPs), (2) algorithmic details (accompanied by pseudocode), (3) description/definition of performance objectives, (4) details on the final optimal solutions obtained for the test problems (optimal HPO configurations, quality metrics for the Pareto front, etc.).

Finally, we noticed that only about half of the papers studied benchmark the algorithm under study w.r.t. other existing algorithms. Such experimental comparisons have substantial added value for the research community. We therefore clearly advocate their inclusion in future multi-objective HPO research.