Abstract
Physically based models (PBMs), including stormwater management model (SWMM), require a significant amount of in situ data and expertise to predict water quality in urban watersheds. In recent years, data-driven models have been increasingly used as an alternative for the prediction of pollutant concentrations. Supervised machine learning (ML) models have been used for estimating stormwater quality parameters. However, optimizing the structure of such ML models has rarely been considered. This study aims to comprehensively evaluate the optimization of the supervised ensemble bagging ML model for forecasting stormwater quality using an ML-based optimization method called Bayesian optimization (BO). To that end, a bagging ensemble model, namely random forest (RF), was first developed for estimating total suspended solids (TSS) concentration in urban watersheds. Eleven factors, including drainage area, land-use types, impervious area, rainfall depth, the volume of runoff, and antecedent dry days, were implemented as predictive features in the model, and their data were acquired from the National Stormwater Quality Database (NSQD). Values for the number of basic estimators, the number of basic selected features for developing basic estimators, subsamples, and the maximum depth of basic learners were optimized using BO. A sensitivity analysis was done on the ML model and the BO parameters, including acquisition function, number of initial points, and realizations. Results indicated that the accuracy of the RF model depends on all mentioned RF parameters. The performance of the best-developed RF model was satisfactory in both the training and the testing steps. This model obtained the R2 values of 0.955 and 0.915 for the training and testing step, respectively. The study demonstrated the potential of a combination of the RF models and BO for accurately predicting stormwater quality parameters.
Similar content being viewed by others
Data availability
The datasets generated and/or analyzed during the current study are publicly available on the International Stormwater BMP database (https://bmpdatabase.org/).
References
Adam EM, Mutanga O, Rugege D, Ismail R (2012) Discriminating the Papyrus vegetation (Cyperus Papyrus L.) and Its Co-existent species using random forest and hyperspectral data resampled to HYMAP. Int J Remote Sens 33(2):552–569
Ahmed N et al (2019) Machine learning methods for better water quality prediction. J Hydrol 578:124084
Al Hasan M, Chaoji V, Salem S, Zaki M (2006) Link prediction using supervised learning. In: SDM06: workshop on link analysis, counter-terrorism and security, vol 30, pp 798–805
Álvarez-Cabria M, Barquín J, Peñas FJ (2016) Modelling the spatial and seasonal variability of water quality for entire river networks: relationships with natural and anthropogenic factors. Sci Total Environ 545–546:152–162. https://doi.org/10.1016/j.scitotenv.2015.12.109
Bardenet R, Brendel M, Kégl B, Sebag M (2013) Collaborative hyperparameter tuning. Int Conf Mach Learn, ICML 28(2):858–866
Beriman L (2001) Random forests. Mach Learn 45:5–32
Berk J, Gupta S, Rana S, Venkatesh S (2020) Randomised gaussian process upper confidence bound for bayesian optimisation. IJCAI Int Joint Conf Artif Intell 2021:2284–2290
Brochu E, Cora VM, De Freitas N (2010) “A tutorial on bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning.” http://arxiv.org/abs/1012.2599
Cambez MJ, Pinho J, David LM (2008) “Using SWMM 5 in the continuous modelling of stormwater hydraulics and quality”. 1–10
Candelieri A, Perego R, Archetti F (2018) Bayesian optimization of pump operations in water distribution systems. J Global Optim 71(1):213–235. https://doi.org/10.1007/s10898-018-0641-2
Frazier PI (2018a) A tutorial on Bayesian optimization. arXiv 5:1–22
Frazier PI. 2018b. “Bayesian optimization.” Recent Adv Optim Model Contemp Probl 255–78
García-Alba J, Bárcena JF, Ugarteburu C, García A (2019) Artificial neural networks as emulators of process-based models to analyse bathing water quality in estuaries. Water Res 150:283–295
García-Callejas D, Araújo MB (2016) Of model and data complexity on predictions from species distributions models. Ecol Model 326:4–12. https://doi.org/10.1016/j.ecolmodel.2015.06.002
Gelbart MA, Snoek J, Adams RP (2014) “Bayesian optimization with unknown constraints.” Uncertainty in Artificial Intelligence-Proceedings of the 30th Conference, UAI 2014: 250–59
Golecha YS (2017) Analyzing term deposits in banking sector by performing predictive analysis using multiple machine learning techniques. Doctoral dissertation, Dublin, National College of Ireland
Gong Y, Liang X, Li X, Li J, Fang X, Song R (2016) Influence of rainfall characteristics on total suspended solids in urban runoff: a case study in Beijing, China. Water 8(7):278. https://doi.org/10.3390/w8070278
Granata F et al (2017) Machine learning algorithms for the forecasting of wastewater quality indicators. Water (switzerland) 9(2):1–12
Haghiabi AH, Nasrolahi AH, Parsaie A (2018) Water quality prediction using machine learning methods. Water Qual Res J 53(1):3–13
Hansen N et al (2010). Experimental setup to cite this version : HAL Id : Inria-00462481 Real-Parameter Black-Box Optimization Benchmarking 2010 : Experimental Setup”
Hardt M, Price E, Srebro N (2016) Equality of opportunity in supervised learning. In: Advances in neural information processing systems 29 (NIPS 2016), pp 3323–3331
Hasanipanah M et al (2017) Forecasting blast-induced ground vibration developing a CART model. Eng Comput 33(2):307–316
He F, Zhou J, Feng ZK, Liu G, Yang Y (2019) A hybrid short-term load forecasting model based on variational mode decomposition and long short-term memory networks considering relevant factors with Bayesian optimization algorithm. Appl Energy 237:103–116
Jeung M et al (2019) Evaluation of random forest and regression tree methods for estimation of mass first flush ratio in urban catchments. J Hydrol 575(May):1099–1110. https://doi.org/10.1016/j.jhydrol.2019.05.079
Kim YH et al (2014) Machine learning approaches to coastal water quality monitoring using GOCI satellite data. Gisci Remote Sens 51(2):158–174
King JK, Blanton JO (2011) Model for predicting effects of landuse changes on the canal-mediated discharge of total suspended solids into tidal creeks and estuaries. J Environ Eng 137(10):920–927. https://doi.org/10.1061/(ASCE)EE.1943-7870.0000396
Knysh P, Korkolis Y. Blackbox (2016) “Blackbox: a procedure for parallel optimization of expensive black-box functions.” : 1–8. http://arxiv.org/abs/1605.00998
Kokkonen TS, Jakeman AJ, Young PC, Koivusalo HJ (2003) Predicting daily flows in ungauged catchments: model regionalization from catchment descriptors at the coweeta hydrologic laboratory, North Carolina. Hydrol Process 17(11):2219–2238
Krebs G et al (2013) A high resolution application of a stormwater management model (SWMM) using genetic parameter optimization. Urban Water J 10(6):394–410
Li L et al (2018) Hyperband: a novel bandit-based approach to hyperparameter optimization. J Mach Learn Res 18:1–52
Li P et al (2020) Comparison of the use of a physical-based model with data assimilation and machine learning methods for simulating soil water dynamics. J Hydrol 584(January):124692. https://doi.org/10.1016/j.jhydrol.2020.124692
Liang J, Li W, Bradford SA, Šimůnek J (2019) Physics-informed data-driven models to predict surface runoffwater quantity and quality in agricultural fields. Water (switzerland) 11(2):200
Liaw A, Wiener M (2002) Classification and regression by randomforest. R News 2(3):18–22
Ließ M, Glaser B, Huwe B (2012) Uncertainty in the spatial prediction of soil texture. comparison of regression tree and random forest models. Geoderma 170:70–79. https://doi.org/10.1016/j.geoderma.2011.10.010
Lu H, Ma X (2020) Hybrid decision tree-based machine learning models for short-term water quality prediction. Chemosphere 249:126169. https://doi.org/10.1016/j.chemosphere.2020.126169
Mansour-Bahmani A, Haghiabi AH, Shamsi Z, Parsaie A (2021) Predictive modeling the discharge of urban wastewater using artificial intelligent models (case study: Kerman city). Model Earth Syst Environ 7:1917–1925
McCarthy DT, Hathaway JM, Hunt WF, Deletic A (2012) Intraevent variability of Escherichia coli and total suspended solids in urban stormwater runoff. Water Res 46(20):6661–6670. https://doi.org/10.1016/j.watres.2012.01.006
Minocha VK (2004) Discussion of “ comparative analysis of event-based rainfall-runoff modeling. J Hydrol Eng 9(6):550–558
Moeini M, Shojaeizadeh A, Geza M (2021) Supervised machine learning for estimation of total suspended solids in urban watersheds. Water (switzerland) 13(2):147
Moeini M, Shojaeizadeh A, Geza M (2022) Supervised stacking ensemble machine learning approach for enhancing prediction of total suspended solids concentration in urban watersheds. J Environ Eng 148(6):1–12
Moeini M, Sela L, Taha AF, Abokifa AA (2023a) Bayesian optimization of booster disinfection scheduling in water distribution networks. Water Res 242:120117. https://doi.org/10.1016/j.watres.2023.120117
Moeini M, Sela L, Taha AF, Abokifa AA (2023b) Optimization techniques for chlorine dosage scheduling in water distribution networks: a comparative analysis. World environmental and water resources congress 2023:987–998. https://doi.org/10.1061/9780784484852.09
Munkhdalai L et al (2019) Mixture of activation functions with extended min-max normalization for forex market prediction. IEEE Access 7:183680–183691
Najafzadeh M, Ghaemi A, Emamgholizadeh S (2019) Prediction of water quality parameters using evolutionary computing-based formulations. Int J Environ Sci Technol 16(10):6377–6396. https://doi.org/10.1007/s13762-018-2049-4
Nezaratian H, Zahiri J, Peykani MF, Haghiabi A, Parsaie A (2021) A genetic algorithm-based support vector machine to estimate the transverse mixing coefficient in streams. Water Qual Res J 56(3):127–142
Nguyen Vu et al (2017) Regret for expected improvement over the best-observed value and stopping condition. J Mach Learn Res 77:279–294
Ok AO, Akar O, Gungor O (2012) Evaluation of random forest method for agricultural crop classification. Eur J Remote Sens 45(1):421–432
Pandey A, Jain A (2017) Comparative analysis of KNN algorithm using various normalization techniques. IntJ Comput Netw Inform Secur 9(11):36–42
Parsaie A, Emamgholizadeh S, Azamathulla HM, Haghiabi AH (2018) ANFIS-based PCA to predict the longitudinal dispersion coefficient in rivers. Int J Hydrol Sci Technol 8(4):410–424
Pizarro J, Vergara PM, Morales JL, Rodríguez JA, Vila I (2014) Influence of land use and climate on the load of suspended solids in catchments of Andean rivers. Environ Monit Assess 186(2):835–843. https://doi.org/10.1007/s10661-013-3420-z
Qishlaqi A, Kordian S, Parsaie A (2017) Hydrochemical evaluation of river water quality—a case study. Appl Water Sci 7:2337–2342
Rajadurai H, Gandhi UD (2020) A stacked ensemble learning model for intrusion detection in wireless network. Neural Comput Appl. https://doi.org/10.1007/s00521-020-04986-5
Reddy GT et al (2020) An ensemble based machine learning model for diabetic retinopathy classification. Int Conf Emerg Trends Inform Technol Eng Ic-ETITE 2020:1–6
Schratz P et al (2019) Hyperparameter tuning and performance assessment of statistical and machine-learning algorithms using spatial data. Ecol Model 406:109–120. https://doi.org/10.1016/j.ecolmodel.2019.06.002
Seeger M (2004) Gaussian processes for machine learning. Int J Neural Syst 14(2):69–106
Singh D, Singh B (2020) Investigating the impact of data normalization on classification performance. Appl Soft Comput 97:105524. https://doi.org/10.1016/j.asoc.2019.105524
Singh KP, Basant A, Malik A, Jain G (2009) Artificial neural network modeling of the river water quality-a case study. Ecol Model 220(6):888–895
Snoek J, Larochelle H, Adams RP (2012) Practical bayesian optimization of machine learning algorithms. Adv Neural Inf Process Syst 4:2951–2959
Springenberg JT (2015) “Unsupervised and semi-supervised learning with categorical generative adversarial networks.” (2009): 1–20. http://arxiv.org/abs/1511.06390
Sutton CD (2005) 24 handbook of statistics classification and regression trees, bagging, and boosting. Elsevier Masson SAS. https://doi.org/10.1016/S0169-7161(04)24011-1
Tan M, Quoc V Le (2019) “EfficientNet: rethinking model scaling for convolutional neural networks.” 36th International Conference on Machine Learning, ICML 2019 2019-June: 10691–700
Uygun BŞ, Albek M (2015) Determination effects of impervious areas on urban watershed. Environ Sci Pollut Res 22(3):2272–2286. https://doi.org/10.1007/s11356-014-3345-2
Wu J, Poloczek M, Wilson AG, Frazier PI (2017) Bayesian optimization with gradients. Adv Neural Inform Process Syst 3:5268–5279
Wu J et al (2019) Hyperparameter optimization for machine learning models based on bayesian optimization. J Electron Sci Technol 17(1):26–40. https://doi.org/10.11989/JEST.1674-862X.80904120
Wu Di, Wang H, Seidu R (2020) Smart data driven quality prediction for urban water source management. Futur Gener Comput Syst 107:418–432
Yang Li, Shami A (2020) On hyperparameter optimization of machine learning algorithms: theory and practice. Neurocomputing 415:295–316. https://doi.org/10.1016/j.neucom.2020.07.061
Yao Y et al (2017) Complexity vs. performance: empirical analysis of machine learning as a service. Proceed ACM SIGCOMM Internet Meas Conf, IMC Part F1319(119):384–397
Acknowledgements
The author would like to thank the National Science Foundation (NSF), the University of Illinois Chicago, the Department of Civil, Materials, and Environmental Engineering, and Dr. Ahmed Abokifa for their support while he is continuing his Ph.D. studies. Also, the author would like to express his sincere gratitude to Dr. Ahmed Abokifa for his advice on this research study.
Author information
Authors and Affiliations
Contributions
Mohammadreza Moeini wrote the original draft, conceived the data, investigated the manuscript, reviewed the data, edited the manuscript, and supervised the data.
Corresponding author
Ethics declarations
Conflict of interest
The author declares no conflict of interests.
Ethical approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Moeini, M. Hyperparameter tuning of supervised bagging ensemble machine learning model using Bayesian optimization for estimating stormwater quality. Sustain. Water Resour. Manag. 10, 83 (2024). https://doi.org/10.1007/s40899-024-01064-9
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s40899-024-01064-9