Skip to main content
Log in

VolcanoML: speeding up end-to-end AutoML via scalable search space decomposition

  • Regular Paper
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract

End-to-end AutoML has attracted intensive interests from both academia and industry which automatically searches for ML pipelines in a space induced by feature engineering, algorithm/model selection, and hyper-parameter tuning. Existing AutoML systems, however, suffer from scalability issues when applying to application domains with large, high-dimensional search spaces. We present VolcanoML, a scalable and extensible framework that facilitates systematic exploration of large AutoML search spaces. VolcanoML introduces and implements basic building blocks, which decompose a large search space into smaller ones, and allows users to utilize these building blocks to compose an execution plan for the AutoML problem at hand. VolcanoML further supports a Volcano-style execution model—akin to the one supported by modern database systems—to execute the plan constructed. Our evaluation demonstrates that, not only does VolcanoML raise the level of expressiveness for search space decomposition in AutoML, it also leads to actual findings of decomposition strategies that are significantly more efficient than the ones employed by state-of-the-art AutoML systems such as auto-sklearn.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

References

  1. Ghoting, A., Krishnamurthy, R., Pednault, E., Reinwald, B., Sindhwani, V., Tatikonda, S., Tian, Y., Vaithyanathan, S.: Systemml: declarative machine learning on mapreduce. In: 2011 IEEE 27th International Conference on Data Engineering, pp. 231–242. IEEE (2011)

  2. Boehm, M., Antonov, I., Baunsgaard, S., Dokter, M., Ginthör, R., Innerebner, K., Klezin, F., Lindstaedt, S., Phani, A., Rath, B., et al.: Systemds: a declarative machine learning system for the end-to-end data science lifecycle. arXiv preprint arXiv:1909.02976 (2019)

  3. Ratner, A., et al.: Snorkel: rapid training data creation with weak supervision. In: PVLDB (2017)

  4. Wu, R., Chaba, S., Sawlani, S., Chu, X., Thirumuruganathan, S.: Zeroer: entity resolution using zero labeled examples. In: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, pp. 1149–1164 (2020)

  5. Baylor, D., Breck, E., Cheng, H.T., Fiedel, N., Foo, C.Y., Haque, Z., Haykal, S., Ispir, M., Jain, V., Koc, L., et al.: Tfx: A tensorflow-based production-scale machine learning platform. In: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1387–1395 (2017)

  6. Breck, E., Polyzotis, N., Roy, S., Whang, S., Zinkevich, M.: Data validation for machine learning. In: MLSys (2019)

  7. Wu, W., Flokas, L., Wu, E., Wang, J.: Complaint-driven training data debugging for query 2.0. In: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, pp. 1317–1334 (2020)

  8. Nakandala, S., Kumar, A., Papakonstantinou, Y.: Incremental and approximate inference for faster occlusion-based deep cnn explanations. In: Proceedings of the 2019 International Conference on Management of Data, pp. 1589–1606 (2019)

  9. Nakandala, S., Zhang, Y., Kumar, A.: Cerebro: a data system for optimized deep learning model selection. Proc. VLDB Endow. 13(12), 2159–2173 (2020)

    Article  Google Scholar 

  10. Vartak, M., et al.: Modeldb: a system for machine learning model management. In: HILDA (2016)

  11. Zaharia, M., et al.: Accelerating the machine learning lifecycle with mlflow. IEEE Data Eng Bull (2018)

  12. De Sa, C., Ratner, A., Ré, C., Shin, J., Wang, F., Wu, S., Zhang, C.: Deepdive: declarative knowledge base construction. ACM SIGMOD Record 45(1), 60–67 (2016)

    Article  Google Scholar 

  13. Rekatsinas, T., Chu, X., Ilyas, I.F., Ré, C.: Holoclean: holistic data repairs with probabilistic inference. Proc. VLDB Endow. 10(11) (2017)

  14. Krishnan, S., Wang, J., Wu, E., Franklin, M.J., Goldberg, K.: Activeclean: interactive data cleaning for statistical modeling. Proc. VLDB Endow. 9(12), 948–959 (2016)

    Article  Google Scholar 

  15. Kraska, T.: Northstar: an interactive data science system. Proc. VLDB Endow. 11(12), 2150–2164 (2018)

    Article  Google Scholar 

  16. Yao, Q., Wang, M., Chen, Y., Dai, W., Li, Y.F., Tu, W.W., Yang, Q., Yu, Y.: Taking human out of learning applications: a survey on automated machine learning. arXiv preprint arXiv:1810.13306 (2018)

  17. Zöller, M.A., Huber, M.F.: Survey on automated machine learning. arXiv preprint arXiv:1904.12054 (2019)

  18. Hutter, F., Kotthoff, L., Vanschoren, J. (eds.): Automated Machine Learning: Methods, Systems, Challenges. Springer, Berlin (2018)

    Google Scholar 

  19. Feurer, M., Klein, A., Eggensperger, K., Springenberg, J., Blum, M., Hutter, F.: Efficient and robust automated machine learning. In: Advances in Neural Information Processing Systems, pp. 2962–2970 (2015)

  20. Olson, R.S., Moore, J.H.: Tpot: a tree-based pipeline optimization tool for automating machine learning. In: Automated Machine Learning, pp. 151–160. Springer (2019)

  21. Komer, B., Bergstra, J., Eliasmith, C.: Hyperopt-sklearn: automatic hyperparameter configuration for scikit-learn. In: ICML workshop on AutoML, vol. 9. Citeseer (2014)

  22. Schawinski, K., et al.: Generative adversarial networks recover features in astrophysical images of galaxies beyond the deconvolution limit. MNRAS Letters (2017)

  23. Li, T., Zhong, J., Liu, J., Wu, W., Zhang, C.: Ease.ml: towards multi-tenant resource sharing for machine learning workloads. Proc. VLDB Endow. 11(5), 607–620 (2018)

    Article  Google Scholar 

  24. Liu, S., Ram, P., Bouneffouf, D., Bramble, G., Conn, A.R., Samulowitz, H., Gray, A.G.: An admm based framework for automl pipeline configuration, pp. 4892–4899 (2020)

  25. Li, Y., Jiang, J., Gao, J., Shao, Y., Zhang, C., Cui, B.: Efficient automatic cash via rising bandits. In: AAAI, pp. 4763–4771 (2020)

  26. Li, Y., Shen, Y., Zhang, W., Jiang, J., Ding, B., Li, Y., Zhou, J., Yang, Z., Wu, W., Zhang, C., et al.: Volcanoml: speeding up end-to-end automl via scalable search space decomposition. Proc. VLDB Endow. (2021)

  27. Garcia-Molina, H., Ullman, J.D., Widom, J.: Database Systems: The Complete Book, 2nd edn. Prentice Hall Press, Hoboken (2008)

    Google Scholar 

  28. He, X., Zhao, K., Chu, X.: Automl: a survey of the state-of-the-art. Knowl. Based Syst. 212, 106622 (2021)

    Article  Google Scholar 

  29. Hutter, F., Lücke, J., Schmidt-Thieme, L.: Beyond manual tuning of hyperparameters. KI-Künstliche Intelligenz 29(4), 329–337 (2015)

    Article  Google Scholar 

  30. Thornton, C., Hutter, F., Hoos, H.H., Leyton-Brown, K.: Auto-weka: Combined selection and hyperparameter optimization of classification algorithms. In: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 847–855 (2013)

  31. Bergstra, J., Bengio, Y.: Random search for hyper-parameter optimization. J. Mach. Learn. Res. 13, 281–305 (2012)

    MathSciNet  MATH  Google Scholar 

  32. Mohr, F., Wever, M., Hüllermeier, E.: Ml-plan: automated machine learning via hierarchical planning. Mach. Learn. 107(8), 1495–1515 (2018)

    Article  MathSciNet  MATH  Google Scholar 

  33. Bergstra, J.S., Bardenet, R., Bengio, Y., Kégl, B.: Algorithms for hyper-parameter optimization. In: Advances in Neural Information Processing Systems, pp. 2546–2554 (2011)

  34. Hutter, F., Hoos, H.H., Leyton-Brown, K.: Sequential model-based optimization for general algorithm configuration. In: International Conference on Learning and Intelligent Optimization, pp. 507–523. Springer (2011)

  35. Snoek, J., Larochelle, H., Adams, R.P.: Practical Bayesian optimization of machine learning algorithms. In: Advances in Neural Information Processing Systems, pp. 2951–2959 (2012)

  36. Eggensperger, K., Feurer, M., Hutter, F., Bergstra, J., Snoek, J., Hoos, H., Leyton-Brown, K.: Towards an empirical foundation for assessing bayesian optimization of hyperparameters. In: NIPS workshop on Bayesian Optimization in Theory and Practice, vol. 10, p. 3 (2013)

  37. Shahriari, B., Swersky, K., Wang, Z., Adams, R.P., De Freitas, N.: Taking the human out of the loop: a review of Bayesian optimization. Proc. IEEE 104(1), 148–175 (2015)

    Article  Google Scholar 

  38. Vanschoren, J.: Meta-learning: a survey. CoRR (2018). http://arxiv.org/abs/1810.03548

  39. de Sá, A.G., Pinto, W.J.G., Oliveira, L.O.V., Pappa, G.L.: RECIPE: A grammar-based framework for automatically evolving classification pipelines. In: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2017)

  40. Hutter, F., Hoos, H., Leyton-Brown, K.: An efficient approach for assessing hyperparameter importance. In: 31st International Conference on Machine Learning, ICML 2014 (2014)

  41. Van Rijn, J.N., Hutter, F.: Hyperparameter importance across datasets. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 2367–2376 (2018)

  42. Drori, I., Krishnamurthy, Y., Rampin, R., De, R., Lourenco, P., Ono, J.P., Cho, K., Silva, C., Freire, J.: AlphaD3M: machine learning pipeline synthesis. In: AutoML Workshop at ICML (2018)

  43. Chen, B., Wu, H., Mo, W., Chattopadhyay, I., Lipson, H.: Autostacker: a compositional evolutionary learning system. In: Proceedings of the Genetic and Evolutionary Computation Conference, pp. 402–409 (2018)

  44. Smith, M.J., Sala, C., Kanter, J.M., Veeramachaneni, K.: The machine learning bazaar: Harnessing the ml ecosystem for effective system development. In: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, pp. 785–800 (2020)

  45. LeDell, E., Poirier, S.: H2o automl: scalable automatic machine learning. In: Proceedings of the AutoML Workshop at ICML, vol. 2020 (2020)

  46. Barnes, J.: Azure machine learning. Microsoft Azure Essentials, 1st ed. Microsoft (2015)

  47. Google: Google prediction api. https://developers.google.com/prediction (2020)

  48. Liberty, E., Karnin, Z., Xiang, B., Rouesnel, L., Coskun, B., Nallapati, R., Delgado, J., Sadoughi, A., Astashonok, Y., Das, P., et al.: Elastic machine learning algorithms in amazon sagemaker. In: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, pp. 731–737 (2020)

  49. IBM: Ibmwatson studio autoai. https://www.ibm.com/cloud/watson-studio/autoai (2020)

  50. Khurana, U., Turaga, D., Samulowitz, H., Parthasrathy, S.: Cognito: automated feature engineering for supervised learning. In: 2016 IEEE 16th International Conference on Data Mining Workshops (ICDMW), pp. 1304–1307. IEEE (2016)

  51. Kaul, A., Maheshwary, S., Pudi, V.: Autolearn-automated feature generation and selection. In: 2017 IEEE International Conference on Data Mining (ICDM), pp. 217–226. IEEE (2017)

  52. Katz, G., Shin, E.C.R., Song, D.: Explorekit: automatic feature generation and selection. In: 2016 IEEE 16th International Conference on Data Mining (ICDM), pp. 979–984. IEEE (2016)

  53. Nargesian, F., Samulowitz, H., Khurana, U., Khalil, E.B., Turaga, D.S.: Learning feature engineering for classification. In: IJCAI, pp. 2529–2535 (2017)

  54. Khurana, U., Samulowitz, H., Turaga, D.: Feature engineering for predictive modeling using reinforcement learning. In: 32nd AAAI Conf Artif Intell. AAAI 2018 (2018)

  55. Efimova, V., Filchenkov, A., Shalamov, V.: Fast automated selection of learning algorithm and its hyperparameters by reinforcement learning. In: International Conference on Machine Learning AutoML Workshop (2017)

  56. Li, L., Jamieson, K., DeSalvo, G., Rostamizadeh, A., Talwalkar, A.: Hyperband: a novel bandit-based approach to hyperparameter optimization. In: Proceedings of the International Conference on Learning Representations, pp. 1–48 (2018)

  57. Jamieson, K., Talwalkar, A.: Non-stochastic best arm identification and hyperparameter optimization. In: Artificial Intelligence and Statistics, pp. 240–248 (2016)

  58. Falkner, S., Klein, A., Hutter, F.: Bohb: robust and efficient hyperparameter optimization at scale. In: International Conference on Machine Learning, pp. 1437–1446. PMLR (2018)

  59. Li, Y., Shen, Y., Jiang, J., Gao, J., Zhang, C., Cui, B.: Mfes-hb: Efficient hyperband with multi-fidelity quality measurements. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 8491–8500 (2021)

  60. Swersky, K., Snoek, J., Adams, R.P.: Multi-task Bayesian optimization. In: Advances in Neural Information Processing Systems, pp. 2004–2012 (2013)

  61. Klein, A., Falkner, S., Bartels, S., Hennig, P., Hutter, F.: Fast Bayesian optimization of machine learning hyperparameters on large datasets. In: Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, pp. 528–536 (2017)

  62. Kandasamy, K., Dasarathy, G., Schneider, J., Póczos, B.: Multi-fidelity Bayesian optimisation with continuous approximations. In: International Conference on Machine Learning, pp. 1799–1808. PMLR (2017)

  63. Poloczek, M., Wang, J., Frazier, P.: Multi-information source optimization. In: Advances in Neural Information Processing Systems, pp. 4288–4298 (2017)

  64. Hu, Y.Q., Yu, Y., Tu, W.W., Yang, Q., Chen, Y., Dai, W.: Multi-fidelity automatic hyper-parameter tuning via transfer series expansion. AAAI (2019)

  65. Sen, R., Kandasamy, K., Shakkottai, S.: Noisy blackbox optimization with multi-fidelity queries: a tree search approach. arXiv preprint arXiv:1810.10482 (2018)

  66. Wu, J., Toscano-Palmerin, S., Frazier, P.I., Wilson, A.G.: Practical multi-fidelity Bayesian optimization for hyperparameter tuning. In: Uncertainty in Artificial Intelligence, pp. 788–798. PMLR (2020)

  67. Wistuba, M., Schilling, N., Schmidt-Thieme, L.: Two-stage transfer surrogate model for automatic hyperparameter optimization. In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 199–214. Springer (2016)

  68. Golovin, D., Solnik, B., Moitra, S., Kochanski, G., Karro, J., Sculley, D.: Google vizier: a service for black-box optimization. In: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1487–1495. ACM (2017)

  69. Feurer, M., Letham, B., Bakshy, E.: Scalable meta-learning for Bayesian optimization using ranking-weighted gaussian process ensembles. In: AutoML Workshop at ICML (2018)

  70. Research, M.: Microsoft nni. https://github.com/Microsoft/nni (2020)

  71. Moritz, P., Nishihara, R., Wang, S., Tumanov, A., Liaw, R., Liang, E., Elibol, M., Yang, Z., Paul, W., Jordan, M.I., et al.: Ray: A distributed framework for emerging \(\{\)AI\(\}\) applications. In: 13th \(\{\)USENIX\(\}\) Symposium on Operating Systems Design and Implementation (\(\{\)OSDI\(\}\) 18), pp. 561–577 (2018)

  72. Li, Y., Shen, Y., Zhang, W., Chen, Y., Jiang, H., Liu, M., Jiang, J., Gao, J., Wu, W., Yang, Z., et al.: Openbox: a generalized black-box optimization service. arXiv preprint arXiv:2106.00421 (2021)

  73. Liaw, R., Liang, E., Nishihara, R., Moritz, P., Gonzalez, J.E., Stoica, I.: Tune: a research platform for distributed model selection and training. arXiv preprint arXiv:1807.05118 (2018)

  74. Kanter, J.M., Veeramachaneni, K.: Deep feature synthesis: towards automating data science endeavors. In: 2015 IEEE International Conference on Data Science and Advanced Analytics, DSAA 2015, Paris, France, October 19–21, 2015, pp. 1–10. IEEE (2015)

  75. Graefe, G.: Volcano-an extensible and parallel query evaluation system. IEEE Trans. Knowl. Data Eng. (1994)

  76. Levine, N., Crammer, K., Mannor, S.: Rotting bandits. In: Advances in NIPS, pp. 3074–3083 (2017)

  77. Dechter, R.: Bucket elimination: a unifying framework for probabilistic inference. In: Learning in Graphical Models, pp. 75–104. Springer (1998)

  78. CarøE, C.C., Schultz, R.: Dual decomposition in stochastic integer programming. Oper. Res. Lett. 24(1–2), 37–45 (1999)

    Article  MathSciNet  MATH  Google Scholar 

  79. Jones, D.R., Schonlau, M., Welch, W.J.: Efficient global optimization of expensive black-box functions. J. Global Optim. 13(4), 455–492 (1998)

    Article  MathSciNet  MATH  Google Scholar 

  80. Takeno, S., Fukuoka, H., Tsukada, Y., Koyama, T., Shiga, M., Takeuchi, I., Karasuyama, M.: Multi-fidelity Bayesian optimization with max-value entropy search and its parallelization. In: International Conference on Machine Learning, pp. 9334–9345. PMLR (2020)

  81. Wang, Z., Zoghi, M., Hutter, F., Matheson, D., De Freitas, N.: Bayesian optimization in high dimensions via random embeddings. In: 23rd International Joint Conference on Artificial Intelligence (2013)

  82. Liu, C., Zoph, B., Neumann, M., Shlens, J., Hua, W., Li, L.J., Fei-Fei, L., Yuille, A., Huang, J., Murphy, K.: Progressive neural architecture search. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 19–34 (2018)

  83. Vilalta, R., Drissi, Y.: A perspective view and survey of meta-learning. Artif. Intell. Rev. (2002). https://doi.org/10.1023/A:1019956318069

    Article  Google Scholar 

  84. Burges, C.: From ranknet to lambdarank to lambdamart: an overview. Learning 11 (2010)

  85. Vanschoren, J., Van Rijn, J.N., Bischl, B., Torgo, L.: Openml: networked science in machine learning. ACM SIGKDD Explor. Newslett. 15(2), 49–60 (2014)

    Article  Google Scholar 

  86. Bardenet, R., Brendel, M., Kégl, B., Sebag, M.: Collaborative hyperparameter tuning. In: International Conference on Machine Learning, pp. 199–207. PMLR (2013)

  87. Dewancker, I., McCourt, M., Clark, S., Hayes, P., Johnson, A., Ke, G.: A strategy for ranking optimization methods using multiple criteria. In: Workshop on Automatic Machine Learning, pp. 11–20. PMLR (2016)

  88. Dietterich, T.G.: Ensemble methods in machine learning. In: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2000). https://doi.org/10.1007/3-540-45014-9_1

  89. Caruana, R., Niculescu-Mizil, A., Crew, G., Ksikes, A.: Ensemble selection from libraries of models. In: Proceedings, Twenty-First International Conference on Machine Learning, ICML 2004 (2004). https://doi.org/10.1145/1015330.1015432

Download references

Acknowledgements

This work is supported by the National Natural Science Foundation of China (NSFC Nos. 61832001, U1936104), Beijing Academy of Artificial Intelligence (BAAI) and PKU-Tencent Joint Research Lab. Bin Cui is the corresponding authors. Ce Zhang and the DS3Lab gratefully acknowledge the support from the Swiss National Science Foundation (Project Number 200021_184628), Innosuisse/SNF BRIDGE Discovery (Project Number 40B2-0_187132), European Union Horizon 2020 Research and Innovation Programme (DAPHNE, 957407), Botnar Research Centre for Child Health, Swiss Data Science Center, Alibaba, Cisco, eBay, Google Focused Research Awards, Oracle Labs.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Yang Li or Bin Cui.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Appendix

In this section, we describe more details about the background, system design and implementations.

1.1 AutoML formulations and motivations

1.1.1 Formulations

Definition and notation. There are K candidate algorithms \(\mathcal {A}=\{A^1, ..., A^K\}\). Each algorithm \(A^i\) has a corresponding hyper-parameter space \(\Lambda _i\). The algorithm \(A^i\) with hyper-parameter configuration \(\lambda \) and new feature set F is denoted by \(A^i_{(\lambda ,F)}\). Given the dataset \(D=\{D_{train}, D_{valid}\}\) of a learning problem, the AutoML problem is to find the joint algorithm, feature, and hyper-parameter configuration \(A^{*}_{(\lambda ^{*},F^{*})}\) that minimizes the loss metric (e.g., the validation error on \(D_{valid}\)):

$$\begin{aligned} A^*_{(\lambda ^*, F^{*})} = {\text {argmin}}_{A^i\in \mathcal {A},\lambda \in \Lambda ^i,F\in \mathcal {F}^i} \mathcal {L}(A_{(\lambda , F)}^i; D), \end{aligned}$$
(14)

where \(\mathcal {F}^i=Gen(A^i, D, \mathbf {op})\) is the feature space of \(A^i\) that can be generated from the raw feature (data) set D, and \(\mathbf {op}\) is the set of available FE operators.

Fig. 13
figure 13

Validation error on pc4 when increasing the number of hyper-parameters in auto-sklearn given the same time budget

Challenge: ever-growing search space. Enriching the search space can lead to performance improvement since the enriched search space may bring better configurations. However, an ever-growing search space can significantly increase the complexity of searching for ML pipelines. Existing AutoML systems usually can only explore very limited configurations in a huge search space and thus suffer from the low-efficiency issue [25] that hampers the effectiveness of AutoML systems. In Fig. 13, we provide a brief example of auto-sklearn, one state-of-the-art system AutoML system. Its search algorithm cannot scale to a high-dimensional search space [25]. To alleviate this issue, in this paper, we focus on developing a scalable AutoML system.

1.1.2 Observations and motivations about AutoML

We now present several important observations that inspired the design of VolcanoML.

Observation 1. The search space can be partitioned according to ML algorithms. The entire search space is the union of the search spaces of individual algorithms, i.e., \(\Omega =\{S^1, ..., S^K\}\), where \(S^i\) is the joint space of features and hyper-parameters, i.e., \(S^i=(\Lambda ^i \times \mathcal {F}^i)\).

Fig. 14
figure 14

The performance distribution of ML pipelines constructed by 30 FE and HPO configurations on fri_c1 using Random Forest. For FE configurations, the performance increases from top to down; for HPO configurations, the performance increases from left to right (the deeper, the better)

Observation 2. The sub-space of algorithm \(A^i\) can be very large, e.g., in auto-sklearn, \(S^i\) usually includes more than 50 hyper-parameters. When exploring the search spaces via extensive experiments, we observe the following:

  • If hyper-parameter configuration \(\lambda _1\) performs better than \(\lambda _2\), i.e., \(\lambda _1 \le \lambda _2\), then it often holds that \((\lambda _1, F) \le (\lambda _2, F)\) for the joint configuration \((\lambda , F)\) with F fixed;

  • If FE pipeline configuration \(F_1\) performs better than \(F_2\), i.e., \(F_1\le F_2\), then it often holds that \((\lambda , F_1) \le (\lambda , F_2)\) for the joint configuration \((\lambda , F)\) with \(\lambda \) fixed.

Figure 14 presents an example for these observations. This motivates us to solve the joint FE and HPO problem via alternating optimization. That is, we can alternate between optimizing FE and HPO, and we can fix the FE configuration (resp. HPO configuration) when optimizing for HPO (resp. FE). This alternating manner is indeed similar to how human experts solve the joint optimization problem manually. One obvious advantage of alternating optimization is that each time only a much smaller subspace (\(\Lambda ^i\) or \(\mathcal {F}^i\)) needs to be optimized, instead of the joint space \(S^i=(\Lambda ^i \times \mathcal {F}^i)\).

Observation 3. The sensitivity of ML algorithms to FE and HPO is often different. Taking Fig. 14 for example, compared to HPO, FE has a larger influence on the performance of ‘Random Forest’ on ‘fri_c1’; in this case, optimizing FE more frequently can bring more performance improvement.

Observation 4. The above observations motivate the use of meta-learning. We can learn (1) the algorithm performance across ML tasks and (2) the configuration selection of each ML algorithm across tasks. Such meta-knowledge obtained from historical tasks can greatly improve the efficiency of ML pipeline search.

Therefore, a scalable AutoML system should include two basic components: (1) an efficient framework that can navigate in a huge search space, and (2) a meta-learning module that can extract knowledge from previously ML tasks and apply it to new tasks.

1.2 VolcanoML components and implementations

1.2.1 Compenents and search space

Feature engineering. The feature engineering pipeline is shown in Fig. 2. It comprises four sequential stages: preprocessors (compulsory), scalers (5 possible operators), balancers (1 possible operators) and feature transformers (13 possible operstors). For each of the latter three stages, VolcanoML picks one operator and then execute the entire pipeline. Table 13 presents the details of each operator. The total number of hyper-parameters for FE is 52.

We follow the design of the search space for feature engineering in the existing AutoML systems, e.g., autosklearn and TPOT. It limits the search space for feature engineering by adopting a fixed pipeline including different stages, and each stage is equipped with an operation (featurizer) that is selected from a pool of featurizers. The pool of featurizers at each stage is relatively small, and Bayesian optimization can be used to choose the proper featurizers for each stage. When the pool of featurizers is very large, high-dimensional Bayesian optimization algorithms could work better. In many real-world cases, though this architecture is not good enough for feature engineering (effectiveness), there still remains space to explore to conduct feature engineering effectively and efficiently. To support real scenarios, VolcanoML provides API for user-defined feature engineering operators and we recommend users to add domain-specific feature engineering operators to the search space for better search performance. In addition, users can replace the original feature engineering part in VolcanoML with other iterative feature engineering methods easily.

ML algorithms. VolcanoML implements 11 algorithms for classification and 10 algorithms for regression, with a total of 50 and 49 hyper-parameters, respectively. The built-in algorithms include linear models, support vector machine, discriminant analysis, nearest neighbors, and ensembles. Table 12 presents the details.

Ensemble methods. Ensembles that combine predictions from multiple base models have been known to outperform individual models, often drastically reducing the variance of the final predictions [88]. VolcanoML provides four ensemble methods: bagging, blending, stacking, and ensemble selection [89]. During the search process, the top \(N_{\text {top}}\) configurations for each algorithm are recorded and the corresponding models are stored. After the optimization budget exhausts, the saved models are treated as the base models for the ensemble method. We use ensemble selection as the default method and build an ensemble of size 50.

Table 12 Hyper-parameters of ML algorithms in VolcanoML. We distinguish categorical (cat) hyper-parameters from numerical (cont) ones. The numbers in the brackets are conditional hyper-parameters
Table 13 Hyper-parameters of FE operators in VolcanoML

1.2.2 Programming interface

Consider a tabular dataset of raw values in a CSV file, named train.csv, where the last column represents the label. We take a classification task as an example. With VolcanoML, only six lines of code are needed for searching and model evaluation.

figure d

By calling load_train and load_test, the data manager automatically identifies the type of each feature (continuous, discrete, or categorical), imputes missing values, and converts string-like features to one-hot vectors. By calling fit, VolcanoML splits the dataset into folds for training and validation, evaluates various configurations, and generates an ensemble from each individual configuration. For users who need to customize the search process, Classifier provides additional parameters to specify:

  • time_limit controls the total runtime of the search process;

  • include_algorithms specifies which algorithms are included (if not specified, all built-in algorithms are included);

  • ensemble_method chooses which ensemble strategy to use;

  • enable_meta determines whether to use meta-learning to accelerate the search process;

  • metric specifies the metric used to evaluate the performance of each configuration.

Customized components. VolcanoML provides APIs to easily enrich the search space, such as the stage in FE pipeline, FE operators, and ML algorithms. The following is the syntax of defining customized components:

figure e

It is important to note that auto-sklearn does not support adding a new stage or updating the existing stages in the FE pipeline. In addition, auto-sklearn cannot add an operator for any stage (e.g., adding smote_balancer to the stage balancer), while VolcanoML supports this.

1.3 Experiment datasets

In our experiments, we splitted each dataset into five folds. Four are used for training and the remaining one is used for testing. The 60 OpenML datasets used are presented as follows (in the form of “dataset_name (OpenML id)”):

Classification datasets. kc1 (1067), quake (772), segment (36), ozone-level-8hr (1487), space_ga (737), sick (38), pollen (871), analcatdata_supreme (728), abalone (183), spambase (44), waveform(2) (979), phoneme (1489), page-blocks(2) (1021), optdigits(28), satimage (182), wind (847), delta_ailerons (803), puma8NH (816), kin8nm (807), puma32H (752), cpu_act (761), bank32nh (833), mc1 (1056), delta_elevators (819), jm1 (1053), pendigits (32), mammography (310), ailerons (734), eeg (1471), letter(2) (977), kropt (184), mv (881), fried (901), 2dplanes (727), electricity (151), a9a (A2), mnist_784 (554), higgs (23512), covertype (180).

Regression datasets. stock (223), socmob (541), Moneyball (41021), insurance (A1), weather_izmir (42369), us_crime (315), debutanizer (23516), space_ga (507), pollen (529), wind (503), bank8FM (572), bank32nh (558), kin8nm (189), puma8NH (225), cpu_act (573), puma32H (308), cpu_small (227), visualizing_soil (668), sulfur (23515), rainfall_bangladesh (41539).

Since the datasets insurance and a9a are not collected in OpenML, we use A1 and A2 as their OpenML ID instead.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, Y., Shen, Y., Zhang, W. et al. VolcanoML: speeding up end-to-end AutoML via scalable search space decomposition. The VLDB Journal 32, 389–413 (2023). https://doi.org/10.1007/s00778-022-00752-2

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-022-00752-2

Keywords

Navigation