Abstract
Water quality index (WQI) has been utilised in many countries and regions as a numeric representation of the condition of water resources. However, the computation of the WQI involves a host of water quality variables. Although machine learning models are proven to be a promising tool to estimate WQI with lesser inputs, sufficient data or samples must be collected so that the machine learning models can be trained well. This exhibits a great challenge in places where there has been a lack of data collection infrastructure to meet the needs of machine learning models. Data scarcity is a major issue to be tackled. This study covered two major rivers that served as water intakes in Peninsular Malaysia (Selangor River and Skudai River), where four synthetic data generation methods, namely the conditional tabular generative adversarial network (CTGAN), the tabular variational autoencoder (TVAE), the Gaussian copula (GC) and the copula generative adversarial network (CopulaGAN), were used to synthesise datasets based on the real dataset. By using the pairwise correlation difference (PCD), Kullback-Leibler divergence (KLD) and the Kolmogorov-Smirnov (KS) test, the best synthetic datasets were selected for the two rivers. The CopulaGAN1 and the CopulaGAN2 yielded the best small and large synthetic datasets at Selangor River, scoring the lowest PCD, KLD and KS statistics. For the Skudai River, the TVAE1 and TVAE2 were chosen. The real and synthetic datasets were used to train the back-propagation neural network (BPNN) for the WQI estimation. Based on the various evaluation metrics, it was proven that increasing the size of training data using the synthetic data method had a positive impact on the performance of the BPNN. The BPNN trained with the CopulaGAN2 (at Selangor River) and the TVAE2 (at Skudai River) yielded more accurate estimations compared to those derived from the actual and smaller datasets.
Highlights
Data were insufficient to train machine learning model well in developing regions.
Synthetic data methods can overcome the data scarcity issue in Malaysia.
CopulaGAN and TVAE outperformed other methods at Selangor River and Skudai River.
BPNN trained with synthetic datasets estimated WQI with higher accuracy.
Similar content being viewed by others
Data Availability
The datasets generated during and/or analysed during the current study are not publicly available due to obligation to the data provider but are available from the corresponding author on reasonable request.
References
Abba SI, Hadi SJ, Sammen SS, Salih SQ, Abdulkadir RA, Pham QB, Yaseen ZM (2020) Evolutionary computational intelligence algorithm coupled with self-tuning predictive model for water quality index determination. J Hydrol 587:124974
Bertholdo L, Silva D, De Aragão Umbuzeiro CG, G. and, Camolesi Júnior L (2017) Classification, Association and Clustering of Water Body Data: application to Water Quality Monitoring. Environ Processes 4:813–831
Bourou S, El Saer A, Velivassaki T-H, Voulkidis A, Zahariadis T (2021) A review of Tabular Data Synthesis using GANs on an IDS dataset. Information 12:375
Cinquini M, Giannotti F, Guidotti R (2021) Boosting Synthetic Data Generation with Effective Nonlinear Causal Discovery. In: IEEE Third International Conference on Cognitive Machine Intelligence (CogMI), 2021. Atlanta, USA. Institute of Electrical and Electronics Engineers, 54–63
Hong D, Baik C (2021) Generating and validating synthetic training data for predicting bankruptcy of individual businesses. J Inform Communication Convergence Eng 19:228–233
Inan MSK, Hossain S, Uddin MN (2022) Synthetic Data Guided Breast Cancer Diagnosis and Prognosis Using Integrated Deep Framework. SSRN Electronic Journal, pp
Kadkhodazadeh M, Farzin S (2022) Introducing a Novel Hybrid Machine Learning Model and developing its performance in estimating Water Quality parameters. Water Resour Manage 36:3901–3927
Li Z, Zhao Y, Fu J (2020) SynC: A Copula based Framework for Generating Synthetic Data from Aggregated Sources. pp. 571–578
Lundberg SM, Lee S-I (2017) A Unified Approach to interpreting model predictions. In: Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R (eds) Advances in Neural Information Processing Systems 30. Curran Associates, Inc, In:
Moon J, Jung S, Park S, Hwang E (2020) Conditional tabular GAN-Based two-Stage Data Generation Scheme for short-term load forecasting. IEEE Access 8:205327–205339
Najah A, Teo FY, Chow MF, Huang YF, Latif SD, Abdullah S, Ismail M, El-Shafie A (2021) Surface water quality status and prediction during movement control operation order under COVID-19 pandemic: case studies in Malaysia. Int J Environ Sci Technol (Tehran), pp. 1–10
Othman F, Alaaeldin ME, Seyam M, Ahmed AN, Teo FY, Fai M, Afan C, Sherif HA, Sefelnasr M, A. and, El-Shafie A (2020) Efficient river water quality index prediction considering minimal number of inputs variables. Eng Appl Comput Fluid Mech 14:751–763
Provalov V, Stavinova E, Chunaev P (2021) SynEvaRec: A Framework for Evaluating Recommender Systems on Synthetic Data Classes. In: 2021 International Conference on Data Mining Workshops (ICDMW), Auckland, New Zealand. Institute of Electrical and Electronics Engineers, 55–64
Raseman WJ, Rajagopalan B, Kasprzyk JR, Kleiber W (2020) Nearest neighbor time series bootstrap for generating influent water quality scenarios. Stoch Env Res Risk Assess 34:23–31
Rezaie-Balf M, Attar NF, Mohammadzadeh A, Murti MA, Ahmed AN, Fai CM, Nabipour N, Alaghmand S, El-Shafie A (2020) Physicochemical parameters data assimilation for efficient improvement of water quality index prediction: comparative assessment of a noise suppression hybridization approach. J Clean Prod 271:122576
Wai KP, Koo CH, Huang YF, Chong WC (2022) Water quality index prediction with hybridized ELM and Gaussian process regression. E3S Web of Conferences, 347, pp. 04004
Withanachchi S, Ghambashidze G, Kunchulia I, Urushadze T, Ploeger A (2018) A paradigm shift in Water Quality Governance in a transitional context: a critical study about the empowerment of local governance in Georgia. Water 10:98
Wong YJ, Shimizu Y, He K, Nik Sulaiman NM (2020) Comparison among different ASEAN water quality indices for the assessment of the spatial variation of surface water quality in the Selangor river basin, Malaysia. Environ Monit Assess 192:644
Xia J, Zeng J (2022) Environmental Factors Assisted the Evaluation of Entropy Water Quality Indices with efficient machine learning technique. Water Resour Manage 36:2045–2060
Xu L, Skoularidou M, Cuesta-Infante A, Veeramachaneni K (2019) Modeling Tabular data using Conditional GAN. In: 33rd Conference on Neural Information Processing Systems, Vancouver, Canada
Xu T, Coco G, Neale M (2020) A predictive model of recreational water quality based on adaptive synthetic sampling algorithms and machine learning. Water Res 177:115788
Acknowledgements
This research was funded by Universiti Tunku Abdul Rahman (UTAR), Malaysia through the Universiti Tunku Abdul Rahman Research Fund (UTARRF) under project number IPSR/RMC/UTARRF/2020-C2/K03. The authors are also grateful to the Department of Environment, Malaysia for their strong support providing the all-important datasets so crucial for such modelling studies.
Funding
This work was supported by Universiti Tunku Abdul Rahman Research Fund (IPSR/RMC/UTARRF/2020-C2/K03). Koo, C. H. has received research support from Universiti Tunku Abdul Rahman (UTAR).
Author information
Authors and Affiliations
Contributions
All authors contributed to the study conception and design. Conceptualisation was done by Chai Hoon Koo. Material preparation, data collection and analysis were performed by Wei Di Chan and Jia Yin Pang. The first draft of the manuscript was written by Min Yan Chia. Review of manuscript was done by Yuk Feng Huang. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethical Approval
Not Applicable.
Consent to Participate
Not Applicable.
Consent to Publish
All authors agreed and gave their consents to publish this manuscript.
Competing Interests
The authors have no relevant financial or non-financial interests to disclose.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Chia, M.Y., Koo, C.H., Huang, Y.F. et al. Artificial Intelligence Generated Synthetic Datasets as the Remedy for Data Scarcity in Water Quality Index Estimation. Water Resour Manage 37, 6183–6198 (2023). https://doi.org/10.1007/s11269-023-03650-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11269-023-03650-6