Skip to main content
Log in

Hyperparameter tuning of supervised bagging ensemble machine learning model using Bayesian optimization for estimating stormwater quality

  • Original Article
  • Published:
Sustainable Water Resources Management Aims and scope Submit manuscript

Abstract

Physically based models (PBMs), including stormwater management model (SWMM), require a significant amount of in situ data and expertise to predict water quality in urban watersheds. In recent years, data-driven models have been increasingly used as an alternative for the prediction of pollutant concentrations. Supervised machine learning (ML) models have been used for estimating stormwater quality parameters. However, optimizing the structure of such ML models has rarely been considered. This study aims to comprehensively evaluate the optimization of the supervised ensemble bagging ML model for forecasting stormwater quality using an ML-based optimization method called Bayesian optimization (BO). To that end, a bagging ensemble model, namely random forest (RF), was first developed for estimating total suspended solids (TSS) concentration in urban watersheds. Eleven factors, including drainage area, land-use types, impervious area, rainfall depth, the volume of runoff, and antecedent dry days, were implemented as predictive features in the model, and their data were acquired from the National Stormwater Quality Database (NSQD). Values for the number of basic estimators, the number of basic selected features for developing basic estimators, subsamples, and the maximum depth of basic learners were optimized using BO. A sensitivity analysis was done on the ML model and the BO parameters, including acquisition function, number of initial points, and realizations. Results indicated that the accuracy of the RF model depends on all mentioned RF parameters. The performance of the best-developed RF model was satisfactory in both the training and the testing steps. This model obtained the R2 values of 0.955 and 0.915 for the training and testing step, respectively. The study demonstrated the potential of a combination of the RF models and BO for accurately predicting stormwater quality parameters.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Data availability

The datasets generated and/or analyzed during the current study are publicly available on the International Stormwater BMP database (https://bmpdatabase.org/).

References

  • Adam EM, Mutanga O, Rugege D, Ismail R (2012) Discriminating the Papyrus vegetation (Cyperus Papyrus L.) and Its Co-existent species using random forest and hyperspectral data resampled to HYMAP. Int J Remote Sens 33(2):552–569

    Article  Google Scholar 

  • Ahmed N et al (2019) Machine learning methods for better water quality prediction. J Hydrol 578:124084

    Article  Google Scholar 

  • Al Hasan M, Chaoji V, Salem S, Zaki M (2006) Link prediction using supervised learning. In: SDM06: workshop on link analysis, counter-terrorism and security, vol 30, pp 798–805

  • Álvarez-Cabria M, Barquín J, Peñas FJ (2016) Modelling the spatial and seasonal variability of water quality for entire river networks: relationships with natural and anthropogenic factors. Sci Total Environ 545–546:152–162. https://doi.org/10.1016/j.scitotenv.2015.12.109

    Article  CAS  Google Scholar 

  • Bardenet R, Brendel M, Kégl B, Sebag M (2013) Collaborative hyperparameter tuning. Int Conf Mach Learn, ICML 28(2):858–866

    Google Scholar 

  • Beriman L (2001) Random forests. Mach Learn 45:5–32

    Article  Google Scholar 

  • Berk J, Gupta S, Rana S, Venkatesh S (2020) Randomised gaussian process upper confidence bound for bayesian optimisation. IJCAI Int Joint Conf Artif Intell 2021:2284–2290

    Google Scholar 

  • Brochu E, Cora VM, De Freitas N (2010) “A tutorial on bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning.” http://arxiv.org/abs/1012.2599

  • Cambez MJ, Pinho J, David LM (2008) “Using SWMM 5 in the continuous modelling of stormwater hydraulics and quality”. 1–10

  • Candelieri A, Perego R, Archetti F (2018) Bayesian optimization of pump operations in water distribution systems. J Global Optim 71(1):213–235. https://doi.org/10.1007/s10898-018-0641-2

    Article  Google Scholar 

  • Frazier PI (2018a) A tutorial on Bayesian optimization. arXiv 5:1–22

    Google Scholar 

  • Frazier PI. 2018b. “Bayesian optimization.” Recent Adv Optim Model Contemp Probl 255–78

  • García-Alba J, Bárcena JF, Ugarteburu C, García A (2019) Artificial neural networks as emulators of process-based models to analyse bathing water quality in estuaries. Water Res 150:283–295

    Article  Google Scholar 

  • García-Callejas D, Araújo MB (2016) Of model and data complexity on predictions from species distributions models. Ecol Model 326:4–12. https://doi.org/10.1016/j.ecolmodel.2015.06.002

    Article  Google Scholar 

  • Gelbart MA, Snoek J, Adams RP (2014) “Bayesian optimization with unknown constraints.” Uncertainty in Artificial Intelligence-Proceedings of the 30th Conference, UAI 2014: 250–59

  • Golecha YS (2017) Analyzing term deposits in banking sector by performing predictive analysis using multiple machine learning techniques. Doctoral dissertation, Dublin, National College of Ireland

  • Gong Y, Liang X, Li X, Li J, Fang X, Song R (2016) Influence of rainfall characteristics on total suspended solids in urban runoff: a case study in Beijing, China. Water 8(7):278. https://doi.org/10.3390/w8070278

    Article  Google Scholar 

  • Granata F et al (2017) Machine learning algorithms for the forecasting of wastewater quality indicators. Water (switzerland) 9(2):1–12

    Google Scholar 

  • Haghiabi AH, Nasrolahi AH, Parsaie A (2018) Water quality prediction using machine learning methods. Water Qual Res J 53(1):3–13

    Article  CAS  Google Scholar 

  • Hansen N et al (2010). Experimental setup to cite this version : HAL Id : Inria-00462481 Real-Parameter Black-Box Optimization Benchmarking 2010 : Experimental Setup”

  • Hardt M, Price E, Srebro N (2016) Equality of opportunity in supervised learning. In: Advances in neural information processing systems 29 (NIPS 2016), pp 3323–3331

  • Hasanipanah M et al (2017) Forecasting blast-induced ground vibration developing a CART model. Eng Comput 33(2):307–316

    Article  Google Scholar 

  • He F, Zhou J, Feng ZK, Liu G, Yang Y (2019) A hybrid short-term load forecasting model based on variational mode decomposition and long short-term memory networks considering relevant factors with Bayesian optimization algorithm. Appl Energy 237:103–116

    Article  Google Scholar 

  • Jeung M et al (2019) Evaluation of random forest and regression tree methods for estimation of mass first flush ratio in urban catchments. J Hydrol 575(May):1099–1110. https://doi.org/10.1016/j.jhydrol.2019.05.079

    Article  CAS  Google Scholar 

  • Kim YH et al (2014) Machine learning approaches to coastal water quality monitoring using GOCI satellite data. Gisci Remote Sens 51(2):158–174

    Article  Google Scholar 

  • King JK, Blanton JO (2011) Model for predicting effects of landuse changes on the canal-mediated discharge of total suspended solids into tidal creeks and estuaries. J Environ Eng 137(10):920–927. https://doi.org/10.1061/(ASCE)EE.1943-7870.0000396

    Article  CAS  Google Scholar 

  • Knysh P, Korkolis Y. Blackbox (2016) “Blackbox: a procedure for parallel optimization of expensive black-box functions.” : 1–8. http://arxiv.org/abs/1605.00998

  • Kokkonen TS, Jakeman AJ, Young PC, Koivusalo HJ (2003) Predicting daily flows in ungauged catchments: model regionalization from catchment descriptors at the coweeta hydrologic laboratory, North Carolina. Hydrol Process 17(11):2219–2238

    Article  Google Scholar 

  • Krebs G et al (2013) A high resolution application of a stormwater management model (SWMM) using genetic parameter optimization. Urban Water J 10(6):394–410

    Article  Google Scholar 

  • Li L et al (2018) Hyperband: a novel bandit-based approach to hyperparameter optimization. J Mach Learn Res 18:1–52

    Google Scholar 

  • Li P et al (2020) Comparison of the use of a physical-based model with data assimilation and machine learning methods for simulating soil water dynamics. J Hydrol 584(January):124692. https://doi.org/10.1016/j.jhydrol.2020.124692

    Article  Google Scholar 

  • Liang J, Li W, Bradford SA, Šimůnek J (2019) Physics-informed data-driven models to predict surface runoffwater quantity and quality in agricultural fields. Water (switzerland) 11(2):200

    CAS  Google Scholar 

  • Liaw A, Wiener M (2002) Classification and regression by randomforest. R News 2(3):18–22

    Google Scholar 

  • Ließ M, Glaser B, Huwe B (2012) Uncertainty in the spatial prediction of soil texture. comparison of regression tree and random forest models. Geoderma 170:70–79. https://doi.org/10.1016/j.geoderma.2011.10.010

    Article  Google Scholar 

  • Lu H, Ma X (2020) Hybrid decision tree-based machine learning models for short-term water quality prediction. Chemosphere 249:126169. https://doi.org/10.1016/j.chemosphere.2020.126169

    Article  CAS  Google Scholar 

  • Mansour-Bahmani A, Haghiabi AH, Shamsi Z, Parsaie A (2021) Predictive modeling the discharge of urban wastewater using artificial intelligent models (case study: Kerman city). Model Earth Syst Environ 7:1917–1925

    Article  Google Scholar 

  • McCarthy DT, Hathaway JM, Hunt WF, Deletic A (2012) Intraevent variability of Escherichia coli and total suspended solids in urban stormwater runoff. Water Res 46(20):6661–6670. https://doi.org/10.1016/j.watres.2012.01.006

    Article  CAS  Google Scholar 

  • Minocha VK (2004) Discussion of “ comparative analysis of event-based rainfall-runoff modeling. J Hydrol Eng 9(6):550–558

    Article  Google Scholar 

  • Moeini M, Shojaeizadeh A, Geza M (2021) Supervised machine learning for estimation of total suspended solids in urban watersheds. Water (switzerland) 13(2):147

    Google Scholar 

  • Moeini M, Shojaeizadeh A, Geza M (2022) Supervised stacking ensemble machine learning approach for enhancing prediction of total suspended solids concentration in urban watersheds. J Environ Eng 148(6):1–12

    Article  Google Scholar 

  • Moeini M, Sela L, Taha AF, Abokifa AA (2023a) Bayesian optimization of booster disinfection scheduling in water distribution networks. Water Res 242:120117. https://doi.org/10.1016/j.watres.2023.120117

    Article  CAS  Google Scholar 

  • Moeini M, Sela L, Taha AF, Abokifa AA (2023b) Optimization techniques for chlorine dosage scheduling in water distribution networks: a comparative analysis. World environmental and water resources congress 2023:987–998. https://doi.org/10.1061/9780784484852.09

    Article  Google Scholar 

  • Munkhdalai L et al (2019) Mixture of activation functions with extended min-max normalization for forex market prediction. IEEE Access 7:183680–183691

    Article  Google Scholar 

  • Najafzadeh M, Ghaemi A, Emamgholizadeh S (2019) Prediction of water quality parameters using evolutionary computing-based formulations. Int J Environ Sci Technol 16(10):6377–6396. https://doi.org/10.1007/s13762-018-2049-4

    Article  CAS  Google Scholar 

  • Nezaratian H, Zahiri J, Peykani MF, Haghiabi A, Parsaie A (2021) A genetic algorithm-based support vector machine to estimate the transverse mixing coefficient in streams. Water Qual Res J 56(3):127–142

    Article  CAS  Google Scholar 

  • Nguyen Vu et al (2017) Regret for expected improvement over the best-observed value and stopping condition. J Mach Learn Res 77:279–294

    Google Scholar 

  • Ok AO, Akar O, Gungor O (2012) Evaluation of random forest method for agricultural crop classification. Eur J Remote Sens 45(1):421–432

    Article  Google Scholar 

  • Pandey A, Jain A (2017) Comparative analysis of KNN algorithm using various normalization techniques. IntJ Comput Netw Inform Secur 9(11):36–42

    Google Scholar 

  • Parsaie A, Emamgholizadeh S, Azamathulla HM, Haghiabi AH (2018) ANFIS-based PCA to predict the longitudinal dispersion coefficient in rivers. Int J Hydrol Sci Technol 8(4):410–424

    Article  Google Scholar 

  • Pizarro J, Vergara PM, Morales JL, Rodríguez JA, Vila I (2014) Influence of land use and climate on the load of suspended solids in catchments of Andean rivers. Environ Monit Assess 186(2):835–843. https://doi.org/10.1007/s10661-013-3420-z

    Article  CAS  Google Scholar 

  • Qishlaqi A, Kordian S, Parsaie A (2017) Hydrochemical evaluation of river water quality—a case study. Appl Water Sci 7:2337–2342

    Article  CAS  Google Scholar 

  • Rajadurai H, Gandhi UD (2020) A stacked ensemble learning model for intrusion detection in wireless network. Neural Comput Appl. https://doi.org/10.1007/s00521-020-04986-5

    Article  Google Scholar 

  • Reddy GT et al (2020) An ensemble based machine learning model for diabetic retinopathy classification. Int Conf Emerg Trends Inform Technol Eng Ic-ETITE 2020:1–6

    Google Scholar 

  • Schratz P et al (2019) Hyperparameter tuning and performance assessment of statistical and machine-learning algorithms using spatial data. Ecol Model 406:109–120. https://doi.org/10.1016/j.ecolmodel.2019.06.002

    Article  Google Scholar 

  • Seeger M (2004) Gaussian processes for machine learning. Int J Neural Syst 14(2):69–106

    Article  Google Scholar 

  • Singh D, Singh B (2020) Investigating the impact of data normalization on classification performance. Appl Soft Comput 97:105524. https://doi.org/10.1016/j.asoc.2019.105524

    Article  Google Scholar 

  • Singh KP, Basant A, Malik A, Jain G (2009) Artificial neural network modeling of the river water quality-a case study. Ecol Model 220(6):888–895

    Article  CAS  Google Scholar 

  • Snoek J, Larochelle H, Adams RP (2012) Practical bayesian optimization of machine learning algorithms. Adv Neural Inf Process Syst 4:2951–2959

    Google Scholar 

  • Springenberg JT (2015) “Unsupervised and semi-supervised learning with categorical generative adversarial networks.” (2009): 1–20. http://arxiv.org/abs/1511.06390

  • Sutton CD (2005) 24 handbook of statistics classification and regression trees, bagging, and boosting. Elsevier Masson SAS. https://doi.org/10.1016/S0169-7161(04)24011-1

    Article  Google Scholar 

  • Tan M, Quoc V Le (2019) “EfficientNet: rethinking model scaling for convolutional neural networks.” 36th International Conference on Machine Learning, ICML 2019 2019-June: 10691–700

  • Uygun BŞ, Albek M (2015) Determination effects of impervious areas on urban watershed. Environ Sci Pollut Res 22(3):2272–2286. https://doi.org/10.1007/s11356-014-3345-2

    Article  Google Scholar 

  • Wu J, Poloczek M, Wilson AG, Frazier PI (2017) Bayesian optimization with gradients. Adv Neural Inform Process Syst 3:5268–5279

    Google Scholar 

  • Wu J et al (2019) Hyperparameter optimization for machine learning models based on bayesian optimization. J Electron Sci Technol 17(1):26–40. https://doi.org/10.11989/JEST.1674-862X.80904120

    Article  Google Scholar 

  • Wu Di, Wang H, Seidu R (2020) Smart data driven quality prediction for urban water source management. Futur Gener Comput Syst 107:418–432

    Article  Google Scholar 

  • Yang Li, Shami A (2020) On hyperparameter optimization of machine learning algorithms: theory and practice. Neurocomputing 415:295–316. https://doi.org/10.1016/j.neucom.2020.07.061

    Article  Google Scholar 

  • Yao Y et al (2017) Complexity vs. performance: empirical analysis of machine learning as a service. Proceed ACM SIGCOMM Internet Meas Conf, IMC Part F1319(119):384–397

    Google Scholar 

Download references

Acknowledgements

The author would like to thank the National Science Foundation (NSF), the University of Illinois Chicago, the Department of Civil, Materials, and Environmental Engineering, and Dr. Ahmed Abokifa for their support while he is continuing his Ph.D. studies. Also, the author would like to express his sincere gratitude to Dr. Ahmed Abokifa for his advice on this research study.

Author information

Authors and Affiliations

Authors

Contributions

Mohammadreza Moeini wrote the original draft, conceived the data, investigated the manuscript, reviewed the data, edited the manuscript, and supervised the data.

Corresponding author

Correspondence to Mohammadreza Moeini.

Ethics declarations

Conflict of interest

The author declares no conflict of interests.

Ethical approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Moeini, M. Hyperparameter tuning of supervised bagging ensemble machine learning model using Bayesian optimization for estimating stormwater quality. Sustain. Water Resour. Manag. 10, 83 (2024). https://doi.org/10.1007/s40899-024-01064-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s40899-024-01064-9

Keywords

Navigation