Skip to main content

Mind the Gap: Measuring Generalization Performance Across Multiple Objectives

  • 709 Accesses

Part of the Lecture Notes in Computer Science book series (LNCS,volume 13876)


Modern machine learning models are often constructed taking into account multiple objectives, e.g., minimizing inference time while also maximizing accuracy. Multi-objective hyperparameter optimization (MHPO) algorithms return such candidate models, and the approximation of the Pareto front is used to assess their performance. In practice, we also want to measure generalization when moving from the validation to the test set. However, some of the models might no longer be Pareto-optimal which makes it unclear how to quantify the performance of the MHPO method when evaluated on the test set. To resolve this, we provide a novel evaluation protocol that allows measuring the generalization performance of MHPO methods and studying its capabilities for comparing two optimization experiments.

This is a preview of subscription content, log in via an institution.

Buying options

USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
USD   59.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   79.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions


  1. 1.

    In principle, this is agnostic to the capability of the HPO algorithm to consider multiple objectives. Any HPO algorithm (including random search) would suffice since one can compute the Pareto-optimal set post-hoc.

  2. 2.

    The true Pareto front is only approximated because there is usually no guarantee that an MHPO algorithm finds the optimal solution. Furthermore, there is no guarantee that an algorithm can find all solutions on the true Pareto front.

  3. 3.

    This is due to a shift in distributions when going from the validation set to the test set due to random sampling. The HPC might then no longer be optimal due to overfitting.

  4. 4.

    If the true function values of evaluated configurations cannot be recovered due to budget restrictions, our proposed evaluation protocol can be applied as well to deal with solutions that are no longer part of the Pareto front on the test set.

  5. 5.

    Distributionally Robust Bayesian Optimization (Kirschner et al., 2020) is an algorithm that could be used in such a setting and the paper introducing it explicitly states AutoML as an application, but does neither demonstrate its applicability to AutoML nor elaborates on how to describe the distribution shift in a way the algorithm could handle it.


  • Benmeziane, H., El Maghraoui, K., Ouarnoughi, H., Niar, S., Wistuba, M., Wang, N.: A comprehensive survey on Hardware-aware Neural Architecture Search. arXiv:2101.09336 [cs.LG] (2021)

  • Bergstra, J., Bengio, Y.: Random search for hyper-parameter optimization. J. Mach. Learn. Res. 13, 281–305 (2012)

    MathSciNet  MATH  Google Scholar 

  • Binder, M., Moosbauer, J., Thomas, J., Bischl, B.: Multi-objective hyperparameter tuning and feature selection using filter ensembles. In: Ceberio, J. (ed.) Proceedings of the Genetic and Evolutionary Computation Conference (GECCO 2020), pp. 471–479. ACM Press (2020)

    Google Scholar 

  • Breiman, L.: Random forests. Mach. Learn. J. 45, 5–32 (2001)

    Article  MATH  Google Scholar 

  • Chakraborty, J., Xia, T., Fahid, F., Menzies, T.: Software engineering for fairness: a case study with Hyperparameter Optimization. In: Proceedings of the 34th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE (2019)

    Google Scholar 

  • Cruz, A., Saleiro, P., Belem, C., Soares, C., Bizarro, P.: Promoting fairness through hyperparameter optimization. In: Bailey, J., Miettinen, P., Koh, Y., Tao, D., Wu, X. (eds.) Proceedings of the IEEE International Conference on Data Mining (ICDM 2021), pp. 1036–1041. IEEE (2021)

    Google Scholar 

  • Dua, D., Graff, C.: UCI machine learning repository (2017)

    Google Scholar 

  • Elsken, T., Metzen, J., Hutter, F.: Efficient multi-objective Neural Architecture Search via Lamarckian evolution. In: Proceedings of the International Conference on Learning Representations (ICLR 2019) (2019a). Published online:

  • Elsken, T., Metzen, J., Hutter, F.: Neural architecture search: a survey. J. Mach. Learn. Res. 20(55), 1–21 (2019b)

    Google Scholar 

  • Emmerich, M.T.M., Deutz, A.H.: A tutorial on multiobjective optimization: fundamentals and evolutionary methods. Nat. Comput. 17(3), 585–609 (2018)

    Article  MathSciNet  Google Scholar 

  • Feffer, M., Hirzel, M., Hoffman, S., Kate, K., Ram, P., Shinnar, A.: An empirical study of modular bias mitigators and ensembles. arXiv:2202.00751 [cs.LG] (2022)

  • Feurer, M., Hutter, F.: Hyperparameter optimization. In: Hutter et al. (2019), chap. 1, pp. 3–38, available for free at

  • Feurer, M., et al.: OpenML-Python: an extensible Python API for OpenML. J. Mach. Learn. Res. 22(100), 1–5 (2021)

    MATH  Google Scholar 

  • Gardner, S., et al.: Constrained multi-objective optimization for automated machine learning. In: Singh, L., De Veaux, R., Karypis, G., Bonchi, F., Hill, J. (eds.) Proceedings of the International Conference on Data Science and Advanced Analytics (DSAA 2019), pp. 364–373. ieeecis, IEEE (2019)

    Google Scholar 

  • Gelbart, M., Snoek, J., Adams, R.: Bayesian optimization with unknown constraints. In: Zhang, N., Tian, J. (eds.) Proceedings of the 30th Conference on Uncertainty in Artificial Intelligence (UAI 2014), pp. 250–258. AUAI Press (2014)

    Google Scholar 

  • Gonzalez, S., Branke, J., van Nieuwenhuyse, I.: Multiobjective ranking and selection using stochastic Kriging. arXiv:2209.03919 [stat.ML] (2022)

  • Hernández-Lobato, J., Gelbart, M., Adams, R., Hoffman, M., Ghahramani, Z.: A general framework for constrained Bayesian optimization using information-based search. J. Mach. Learn. Res. 17(1), 5549–5601 (2016)

    MathSciNet  MATH  Google Scholar 

  • Horn, D., Bischl, B.: Multi-objective parameter configuration of machine learning algorithms using model-based optimization. In: Likas, A. (ed.) 2016 IEEE Symposium Series on Computational Intelligence (SSCI), pp. 1–8. IEEE (2016)

    Google Scholar 

  • Horn, D., Dagge, M., Sun, X., Bischl, B.: First investigations on noisy model-based multi-objective optimization. In: Trautmann, H., et al. (eds.) EMO 2017. LNCS, vol. 10173, pp. 298–313. Springer, Cham (2017).

    Chapter  Google Scholar 

  • Hutter, F., Kotthoff, L., Vanschoren, J. (eds.): Automated Machine Learning: Methods, Systems, Challenges. Springer, Heidelberg (2019). Available for free at

  • Iqbal, M., Su, J., Kotthoff, L., Jamshidi, P.: Flexibo: Cost-aware multi-objective optimization of deep neural networks. arXiv:2001.06588 [cs.LG] (2020)

  • Karl, F., et al.: Multi-objective hyperparameter optimization - an overview. arXiv:2206.07438 [cs.LG] (2022)

  • Kirschner, J., Bogunovic, I., Jegelka, S., Krause, A.: Distributionally robust Bayesian optimization. In: Chiappa, S., Calandra, R. (eds.) Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics (AISTATS 2020), pp. 2174–2184. Proceedings of Machine Learning Research (2020)

    Google Scholar 

  • Konen, W., Koch, P., Flasch, O., Bartz-Beielstein, T., Friese, M., Naujoks, B.: Tuned data mining: a benchmark study on different tuners. In: Krasnogor, N. (ed.) Proceedings of the 13th Annual Conference on Genetic and Evolutionary Computation (GECCO 2011), pp. 1995–2002. ACM Press (2011)

    Google Scholar 

  • Letham, B., Brian, K., Ottoni, G., Bakshy, E.: Constrained Bayesian optimization with noisy experiments. Bayesian Analysis (2018)

    Google Scholar 

  • Levesque, J.C., Durand, A., Gagne, C., Sabourin, R.: Multi-objective evolutionary optimization for generating ensembles of classifiers in the roc space. In: Soule, T. (ed.) Proceedings of the 14th Annual Conference on Genetic and Evolutionary Computation (GECCO 2012), pp. 879–886. ACM Press (2011)

    Google Scholar 

  • Manning, C., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008)

    Book  MATH  Google Scholar 

  • Molnar, C., Casalicchio, G., Bischl, B.: Quantifying model complexity via functional decomposition for better post-hoc interpretability. In: Cellier, P., Driessens, K. (eds.) ECML PKDD 2019. CCIS, vol. 1167, pp. 193–204. Springer, Cham (2020).

    Chapter  Google Scholar 

  • Morales-Hernández, A., Nieuwenhuyse, I.V., Gonzalez, S.: A survey on multi-objective hyperparameter optimization algorithms for machine learning. arXiv:2111.13755 [cs.LG] (2021)

  • Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)

    MathSciNet  MATH  Google Scholar 

  • Raschka, S.: Model evaluation, model selection, and algorithm selection in machine learning. arXiv:1811.12808 [stat.ML] (2018)

  • Schmucker, R., Donini, M., Zafar, M., Salinas, D., Archambeau, C.: Multi-objective asynchronous successive halving. arXiv:2106.12639 [stat.ML] (2021)

  • Vanschoren, J., van Rijn, J., Bischl, B., Torgo, L.: OpenML: networked science in machine learning. SIGKDD Explor. 15(2), 49–60 (2014)

    Article  Google Scholar 

  • Zitzler, E., Deb, K., Thiele, L.: Comparison of multiobjective evolutionary algorithms: empirical results. Evol. Comput. 8(2), 173–195 (2000)

    Article  Google Scholar 

  • Zitzler, E., Thiele, L., Laumanns, M., Fonseca, C., Fonseca, V.: Performance assessment of multiobjective optimizers: an analysis and review. IEEE Trans. Evol. Comput. 7, 117–132 (2003)

    Article  Google Scholar 

Download references


Robert Bosch GmbH is acknowledged for financial support. Also, this research was partially supported by TAILOR, a project funded by EU Horizon 2020 research and innovation programme under GA No 952215. The authors of this work take full responsibility for its content.

Author information

Authors and Affiliations


Corresponding author

Correspondence to Matthias Feurer .

Editor information

Editors and Affiliations

A Experimental Details

A Experimental Details

Random Forest

Linear Model

Hyperparameter name

Search space

Hyperparameter name

Search Space


[gini, entropy]


[l2, l1, elasticnet]


[True, False]


\([1e-6, 1e-2]\), log


[0.0, 1.0]

l1 ratio

[0.0, 1.0]


[2, 20]


[True, False]


[1, 20]


\([1e-7, 1e-1]\)

pos_class_weight exponent

\([-7, 7]\)

pos_class_weight exp

\([-7, 7]\)

We provide the random forest and linear model search spaces in Table A. We fit the linear model with stochastic gradient descent and use an adaptive learning rate and minimize the log loss (please see the scikit-learn (Pedregosa et al., 2011) documentation for a description of these). Because we are dealing with unbalanced data, we consider the class weights as a hyperparameter and tune the weight of the minority (positive) class in the range of \([2^{-7}, 2^7]\) on a log-scale (Horn and Bischl, 2011; Konen et al., 2016). To deal with categorical features, we use one hot encoding. We transform the features for the linear models using a quantile transformer with a normal output distribution.

We use the German credit dataset (Dua and Graff, 2017) because it is relatively small, leading to high variance in the algorithm performance, and unbalanced. We downloaded the dataset from OpenML (Vanschoren et al., 2014) using the OpenML-Python API (Feurer et al., 2021) as task ID 31, but conducted our own 60/20/20 split. It is a binary classification problem with 30% positive samples. The dataset has 1000 samples and 20 features. Out of the 20 features, 13 are categorical. The dataset contains no missing values.

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Feurer, M., Eggensperger, K., Bergman, E., Pfisterer, F., Bischl, B., Hutter, F. (2023). Mind the Gap: Measuring Generalization Performance Across Multiple Objectives. In: Crémilleux, B., Hess, S., Nijssen, S. (eds) Advances in Intelligent Data Analysis XXI. IDA 2023. Lecture Notes in Computer Science, vol 13876. Springer, Cham.

Download citation

  • DOI:

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-30046-2

  • Online ISBN: 978-3-031-30047-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics