Skip to main content
Log in

Artificial Intelligence Generated Synthetic Datasets as the Remedy for Data Scarcity in Water Quality Index Estimation

  • Published:
Water Resources Management Aims and scope Submit manuscript

Abstract

Water quality index (WQI) has been utilised in many countries and regions as a numeric representation of the condition of water resources. However, the computation of the WQI involves a host of water quality variables. Although machine learning models are proven to be a promising tool to estimate WQI with lesser inputs, sufficient data or samples must be collected so that the machine learning models can be trained well. This exhibits a great challenge in places where there has been a lack of data collection infrastructure to meet the needs of machine learning models. Data scarcity is a major issue to be tackled. This study covered two major rivers that served as water intakes in Peninsular Malaysia (Selangor River and Skudai River), where four synthetic data generation methods, namely the conditional tabular generative adversarial network (CTGAN), the tabular variational autoencoder (TVAE), the Gaussian copula (GC) and the copula generative adversarial network (CopulaGAN), were used to synthesise datasets based on the real dataset. By using the pairwise correlation difference (PCD), Kullback-Leibler divergence (KLD) and the Kolmogorov-Smirnov (KS) test, the best synthetic datasets were selected for the two rivers. The CopulaGAN1 and the CopulaGAN2 yielded the best small and large synthetic datasets at Selangor River, scoring the lowest PCD, KLD and KS statistics. For the Skudai River, the TVAE1 and TVAE2 were chosen. The real and synthetic datasets were used to train the back-propagation neural network (BPNN) for the WQI estimation. Based on the various evaluation metrics, it was proven that increasing the size of training data using the synthetic data method had a positive impact on the performance of the BPNN. The BPNN trained with the CopulaGAN2 (at Selangor River) and the TVAE2 (at Skudai River) yielded more accurate estimations compared to those derived from the actual and smaller datasets.

Highlights

Data were insufficient to train machine learning model well in developing regions.

Synthetic data methods can overcome the data scarcity issue in Malaysia.

CopulaGAN and TVAE outperformed other methods at Selangor River and Skudai River.

BPNN trained with synthetic datasets estimated WQI with higher accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Data Availability

The datasets generated during and/or analysed during the current study are not publicly available due to obligation to the data provider but are available from the corresponding author on reasonable request.

References

  • Abba SI, Hadi SJ, Sammen SS, Salih SQ, Abdulkadir RA, Pham QB, Yaseen ZM (2020) Evolutionary computational intelligence algorithm coupled with self-tuning predictive model for water quality index determination. J Hydrol 587:124974

    Article  Google Scholar 

  • Bertholdo L, Silva D, De Aragão Umbuzeiro CG, G. and, Camolesi Júnior L (2017) Classification, Association and Clustering of Water Body Data: application to Water Quality Monitoring. Environ Processes 4:813–831

    Article  Google Scholar 

  • Bourou S, El Saer A, Velivassaki T-H, Voulkidis A, Zahariadis T (2021) A review of Tabular Data Synthesis using GANs on an IDS dataset. Information 12:375

    Article  Google Scholar 

  • Cinquini M, Giannotti F, Guidotti R (2021) Boosting Synthetic Data Generation with Effective Nonlinear Causal Discovery. In: IEEE Third International Conference on Cognitive Machine Intelligence (CogMI), 2021. Atlanta, USA. Institute of Electrical and Electronics Engineers, 54–63

  • Hong D, Baik C (2021) Generating and validating synthetic training data for predicting bankruptcy of individual businesses. J Inform Communication Convergence Eng 19:228–233

    Google Scholar 

  • Inan MSK, Hossain S, Uddin MN (2022) Synthetic Data Guided Breast Cancer Diagnosis and Prognosis Using Integrated Deep Framework. SSRN Electronic Journal, pp

  • Kadkhodazadeh M, Farzin S (2022) Introducing a Novel Hybrid Machine Learning Model and developing its performance in estimating Water Quality parameters. Water Resour Manage 36:3901–3927

    Article  Google Scholar 

  • Li Z, Zhao Y, Fu J (2020) SynC: A Copula based Framework for Generating Synthetic Data from Aggregated Sources. pp. 571–578

  • Lundberg SM, Lee S-I (2017) A Unified Approach to interpreting model predictions. In: Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R (eds) Advances in Neural Information Processing Systems 30. Curran Associates, Inc, In:

  • Moon J, Jung S, Park S, Hwang E (2020) Conditional tabular GAN-Based two-Stage Data Generation Scheme for short-term load forecasting. IEEE Access 8:205327–205339

    Article  Google Scholar 

  • Najah A, Teo FY, Chow MF, Huang YF, Latif SD, Abdullah S, Ismail M, El-Shafie A (2021) Surface water quality status and prediction during movement control operation order under COVID-19 pandemic: case studies in Malaysia. Int J Environ Sci Technol (Tehran), pp. 1–10

  • Othman F, Alaaeldin ME, Seyam M, Ahmed AN, Teo FY, Fai M, Afan C, Sherif HA, Sefelnasr M, A. and, El-Shafie A (2020) Efficient river water quality index prediction considering minimal number of inputs variables. Eng Appl Comput Fluid Mech 14:751–763

    Google Scholar 

  • Provalov V, Stavinova E, Chunaev P (2021) SynEvaRec: A Framework for Evaluating Recommender Systems on Synthetic Data Classes. In: 2021 International Conference on Data Mining Workshops (ICDMW), Auckland, New Zealand. Institute of Electrical and Electronics Engineers, 55–64

  • Raseman WJ, Rajagopalan B, Kasprzyk JR, Kleiber W (2020) Nearest neighbor time series bootstrap for generating influent water quality scenarios. Stoch Env Res Risk Assess 34:23–31

    Article  Google Scholar 

  • Rezaie-Balf M, Attar NF, Mohammadzadeh A, Murti MA, Ahmed AN, Fai CM, Nabipour N, Alaghmand S, El-Shafie A (2020) Physicochemical parameters data assimilation for efficient improvement of water quality index prediction: comparative assessment of a noise suppression hybridization approach. J Clean Prod 271:122576

    Article  Google Scholar 

  • Wai KP, Koo CH, Huang YF, Chong WC (2022) Water quality index prediction with hybridized ELM and Gaussian process regression. E3S Web of Conferences, 347, pp. 04004

  • Withanachchi S, Ghambashidze G, Kunchulia I, Urushadze T, Ploeger A (2018) A paradigm shift in Water Quality Governance in a transitional context: a critical study about the empowerment of local governance in Georgia. Water 10:98

    Article  Google Scholar 

  • Wong YJ, Shimizu Y, He K, Nik Sulaiman NM (2020) Comparison among different ASEAN water quality indices for the assessment of the spatial variation of surface water quality in the Selangor river basin, Malaysia. Environ Monit Assess 192:644

    Article  Google Scholar 

  • Xia J, Zeng J (2022) Environmental Factors Assisted the Evaluation of Entropy Water Quality Indices with efficient machine learning technique. Water Resour Manage 36:2045–2060

    Article  Google Scholar 

  • Xu L, Skoularidou M, Cuesta-Infante A, Veeramachaneni K (2019) Modeling Tabular data using Conditional GAN. In: 33rd Conference on Neural Information Processing Systems, Vancouver, Canada

  • Xu T, Coco G, Neale M (2020) A predictive model of recreational water quality based on adaptive synthetic sampling algorithms and machine learning. Water Res 177:115788

    Article  Google Scholar 

Download references

Acknowledgements

This research was funded by Universiti Tunku Abdul Rahman (UTAR), Malaysia through the Universiti Tunku Abdul Rahman Research Fund (UTARRF) under project number IPSR/RMC/UTARRF/2020-C2/K03. The authors are also grateful to the Department of Environment, Malaysia for their strong support providing the all-important datasets so crucial for such modelling studies.

Funding

This work was supported by Universiti Tunku Abdul Rahman Research Fund (IPSR/RMC/UTARRF/2020-C2/K03). Koo, C. H. has received research support from Universiti Tunku Abdul Rahman (UTAR).

Author information

Authors and Affiliations

Authors

Contributions

All authors contributed to the study conception and design. Conceptualisation was done by Chai Hoon Koo. Material preparation, data collection and analysis were performed by Wei Di Chan and Jia Yin Pang. The first draft of the manuscript was written by Min Yan Chia. Review of manuscript was done by Yuk Feng Huang. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Chai Hoon Koo.

Ethics declarations

Ethical Approval

Not Applicable.

Consent to Participate

Not Applicable.

Consent to Publish

All authors agreed and gave their consents to publish this manuscript.

Competing Interests

The authors have no relevant financial or non-financial interests to disclose.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chia, M.Y., Koo, C.H., Huang, Y.F. et al. Artificial Intelligence Generated Synthetic Datasets as the Remedy for Data Scarcity in Water Quality Index Estimation. Water Resour Manage 37, 6183–6198 (2023). https://doi.org/10.1007/s11269-023-03650-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11269-023-03650-6

Keywords

Navigation