Skip to main content
Log in

Proposed formulation of surface water quality and modelling using gene expression, machine learning, and regression techniques

  • Research Article
  • Published:
Environmental Science and Pollution Research Aims and scope Submit manuscript

Abstract

The rising water pollution from anthropogenic factors motivates further research in developing water quality predicting models. The available models have certain limitations due to limited timespan data and the incapability to provide empirical expressions. This study is devoted to model and derive empirical equations for surface water quality of upper Indus river basin using a 30-year dataset with machine learning techniques and then to determine the most reliable model capable to accurately predict river water quality. Total dissolve solids (TDS) and electrical conductivity (EC) were used as dependent variables, whereas eight parameters were used as independent variables with 70 and 30% data for model training and testing, respectively. Various evaluation criteria, i.e., Nash-Sutcliffe efficiency (NSE), root mean square error (RMSE), coefficient of determination (R2), and mean absolute error (MAE), were used to assess the performance of models. The data is also validated with the help of k-fold cross-validation using R2 and RMSE. The results indicated a strong correlation with NSE and R2 both above 0.85 for all the developed models. Gene expression programming (GEP) outperformed both artificial neural network (ANN) and linear and non-linear regression models for TDS and EC. The sensitivity and parametric analyses revealed that bicarbonate is the most sensitive parameter influencing both TDS and EC models. Two equations were derived and formulated to represent the novel results of GEP model to help authorities in the effective monitoring of river water quality.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17

Similar content being viewed by others

Data Availability

The datasets used and/or analyzed during the current study are available from the corresponding author on reasonable request.

References

  • Abdollahzadeh G, Jahani E, Kashir Z (2017) Genetic programming based formulation to predict compressive strength of high strength concrete. Civil Eng Infrastructures J 50(2):207–219

    Google Scholar 

  • Abunama T, Othman F, Ansari M, El-Shafie A (2019) Leachate generation rate modeling using artificial intelligence algorithms aided by input optimization method for an MSW landfill. Environ Sci Pollut Res 26(4):3368–3381

    Google Scholar 

  • Adamowski J, Fung Chan H, Prasher SO, Ozga-Zielinski B, Sliusarieva A (2012) Comparison of multiple linear and nonlinear regression, autoregressive integrated moving average, artificial neural network, and wavelet artificial neural network methods for urban water demand forecasting in Montreal, Canada. Water Resour Res 48(1):W01528

    Google Scholar 

  • Ali S, Li D, Congbin F, Khan F (2015) Twenty first century climatic and hydrological changes over Upper Indus Basin of Himalayan region of Pakistan. Environ Res Lett 10(1):014007. https://doi.org/10.1088/1748-9326/10/1/014007

    Article  Google Scholar 

  • Alizadeh MJ, Kavianpour MR, Danesh M, Adolf J, Shamshirband S, Chau K-W (2018) Effect of river flow on the quality of estuarine and coastal waters using machine learning models. Eng Appl Computational Fluid Mech 12(1):810–823

    Google Scholar 

  • Al-Mukhtar M, Al-Yaseen F (2019) Modeling water quality parameters using data-driven models, a case study Abu-Ziriq marsh in south of Iraq. Hydrology 6(1):24

    Google Scholar 

  • Ansari M, Othman F, Abunama T, El-Shafie A (2018) Analysing the accuracy of machine learning techniques to develop an integrated influent time series model: case study of a sewage treatment plant, Malaysia. Environ Sci Pollut Res 25(12):12139–12149

    Google Scholar 

  • Aryafar A, Khosravi V, Zarepourfard H, Rooki R (2019) Evolving genetic programming and other AI-based models for estimating groundwater quality parameters of the Khezri plain, Eastern Iran. Environ Earth Sci 78(3):69

    Google Scholar 

  • Azad A, Karami H, Farzin S, Mousavi S-F, Kisi O (2019) Modeling river water quality parameters using modified adaptive neuro fuzzy inference system. Water Sci Eng 12(1):45–54

    Google Scholar 

  • Azamathulla HM, Ghani AA, Leow CS, Chang CK, Zakaria NA (2011) Gene-expression programming for the development of a stage-discharge curve of the Pahang River. Water Resourc Manag 25(11):2901–2916

    Google Scholar 

  • Azamathulla HM, Rathnayake U, Shatnawi A (2018) Gene expression programming and artificial neural network to estimate atmospheric temperature in Tabuk, Saudi Arabia. Appl Water Sci 8(6):184

    Google Scholar 

  • Azim I, Yang J, Javed MF, Iqbal MF, Mahmood Z, Wang F, and Liu Q-F. (2020). Prediction model for compressive arch action capacity of RC frame structures under column removal scenario using gene expression programming. Paper presented at the Structures.

  • Basant N, Gupta S, Malik A, Singh KP (2010) Linear and nonlinear modeling for simultaneous prediction of dissolved oxygen and biochemical oxygen demand of the surface water—a case study. Chemom Intell Lab Syst 104(2):172–180

    CAS  Google Scholar 

  • Bozorg-Haddad O, Soleimani S, Loáiciga HA (2017) Modeling water-quality parameters using genetic algorithm–least squares support vector regression and genetic programming. J Environ Eng 143(7):04017021

    Google Scholar 

  • Chen X-Y, Chau K-W (2019) Uncertainty analysis on hybrid double feedforward neural network model for sediment load estimation with LUBE method. Water Resourc Manag 33(10):3563–3577

    Google Scholar 

  • Chen K, Chen H, Zhou C, Huang Y, Qi X, Shen R, Wang J (2020) Comparative analysis of surface water quality prediction performance and identification of key water parameters using different machine learning models based on big data. Water Res 171:115454

    CAS  Google Scholar 

  • Crocker J, Bartram J (2014) Comparison and cost analysis of drinking water quality monitoring requirements versus practice in seven developing countries. Int J Environ Res Public Health 11(7):7333–7346

    Google Scholar 

  • Ferreira C (2006). Gene expression programming: mathematical modeling by an artificial intelligence (Vol. 21): Springer.

  • Frank IE, and Todeschini R (1994). The data analysis handbook: Elsevier.

  • Gandomi AH, Yun GJ, Alavi AH (2013) An evolutionary approach for modeling of shear strength of RC deep beams. Mater Struct 46(12):2109–2119

    Google Scholar 

  • Gholampour A, Gandomi AH, Ozbakkaloglu T (2017) New formulations for mechanical properties of recycled aggregate concrete using gene expression programming. Constr Build Mater 130:122–145

    Google Scholar 

  • Iqbal MF, Liu Q-F, Azim I, Zhu X, Yang J, Javed MF, Rauf M (2020) Prediction of mechanical properties of green concrete incorporating waste foundry sand based on gene expression programming. J Hazard Mater 384:121322

    CAS  Google Scholar 

  • Javed MF, Amin MN, Shah MI, Khan K, Iftikhar B, Farooq F, Aslam F, Alyousef R, Alabduljabbar H (2020) Applications of gene expression programming and regression techniques for estimating compressive strength of bagasse ash based concrete. Crystals 10(9):737

    CAS  Google Scholar 

  • Juditsky A, Hjalmarsson H, Benveniste A, Delyon B, Ljung L, Sjöberg J, Zhang Q (1995) Nonlinear black-box models in system identification: Mathematical foundations. Automatica 31(12):1725–1750

    Google Scholar 

  • Kargar K, Samadianfard S, Parsa J, Nabipour N, Shamshirband S, Mosavi A, Chau K-W (2020) Estimating longitudinal dispersion coefficient in natural streams using empirical models and machine learning algorithms. Eng Appl Computational Fluid Mech 14(1):311–322

    Google Scholar 

  • Khan AJ, Koch M (2018) Correction and informed regionalization of precipitation data in a high mountainous region (Upper Indus Basin) and its effect on SWAT-modelled discharge. Water 10(11):1557

    Google Scholar 

  • Khan A, Richards KS, Parker GT, McRobie A, Mukhopadhyay B (2014) How large is the Upper Indus Basin? The pitfalls of auto-delineation using DEMs. J Hydrol 509:442–453

    Google Scholar 

  • Khare MJK, Warke A (2014) Selection of significant input parameters for water quality prediction-a comparative approach. Int J Res Advent Technol 2(03):81–90

    Google Scholar 

  • Kohavi R (1995) A study of cross-validation and bootstrap for accuracy estimation and model selection. In: Proceedings Ijcai, 14th edn. Montreal, Canada, pp 1137–1145

    Google Scholar 

  • Liu M, Lu J (2014) Support vector machine―an alternative to artificial neuron network for water quality forecasting in an agricultural nonpoint source polluted river? Environ Sci Pollut Res 21(18):11036–11053. https://doi.org/10.1007/s11356-014-3046-x

    Article  CAS  Google Scholar 

  • Liu L-W, Wang Y-M (2019) Modelling reservoir turbidity using Landsat 8 Satellite Imagery by gene expression programming. Water 11(7):1479

    Google Scholar 

  • Maedeh A, Mehrdadi N, Bidhendi G, Abyaneh HZ (2013) Application of artificial neural network to predict total dissolved solids variations in groundwater of Tehran plain: Iran. Int J Environ Sustain 2(1):10–20

    Google Scholar 

  • Martí P, Shiri J, Duran-Ros M, Arbat G, De Cartagena FR, Puig-Bargués J (2013) Artificial neural networks vs. gene expression programming for estimating outlet dissolved oxygen in micro-irrigation sand filters fed with effluents. Comput Electron Agric 99:176–185

    Google Scholar 

  • Mehdipour V, Memarianfard M, Homayounfar F (2017) Application of Gene Expression Programming to water dissolved oxygen concentration prediction: Int. J Hum Cap Urban Manag 2(1):1–10

    Google Scholar 

  • Montaseri M, Ghavidel SZZ, Sanikhani H (2018) Water quality variations in different climates of Iran: toward modeling total dissolved solid using soft computing techniques. Stoch Env Res Risk A 32(8):2253–2273

    Google Scholar 

  • Mustafa YA, Jaid GM, Alwared AI, Ebrahim M (2014) The use of artificial neural network (ANN) for the prediction and simulation of oil degradation in wastewater by AOP. Environ Sci Pollut Res 21(12):7530–7537. https://doi.org/10.1007/s11356-014-2635-z

    Article  CAS  Google Scholar 

  • Najah A, El-Shafie A, Karim OA, El-Shafie AH (2013) Application of artificial neural networks for water quality prediction. Neural Comput & Applic 22(1):187–201

    Google Scholar 

  • Nasr M, Zahran HF (2014) Using of pH as a tool to predict salinity of groundwater for irrigation purpose using artificial neural network. Egypt J Aqua Res 40(2):111–115

    Google Scholar 

  • Ouma YO, Okuku CO, Njau EN (2020) Use of artificial neural networks and multiple linear regression model for the prediction of dissolved oxygen in rivers: case study of hydrographic basin of River Nyando, Kenya. Complexity 2020:9570789 1-23

    Google Scholar 

  • Pal S, Mukherjee S, Ghosh S (2014) Estimation of the phenolic waste attenuation capacity of some fine-grained soils with the help of ANN modeling. Environ Sci Pollut Res 21(5):3524–3533. https://doi.org/10.1007/s11356-013-2315-4

    Article  CAS  Google Scholar 

  • Ramzan S, Zahid FM, Ramzan S (2013) Evaluating multivariate normality: a graphical approach. Middle-East J Sci Res 13(2):254–263

    Google Scholar 

  • Salami E, Salari M, Ehteshami M, Bidokhti N, Ghadimi H (2016) Application of artificial neural networks and mathematical modeling for the prediction of water quality variables (case study: southwest of Iran). Desalin Water Treat 57(56):27073–27084

    CAS  Google Scholar 

  • Sarkar A, Pandey P (2015) River water quality modelling using artificial neural network technique. Aqua Proc 4:1070–1077

    Google Scholar 

  • Sattari MT, Joudi AR, Kusiak A (2016) Estimation of water quality parameters with data-driven model. J-Am Water Works Assoc 108(4):E232–E239

    Google Scholar 

  • Seyam MS, Alagha J, Abunama T, Mogheir Y, Affam AC, Heydari M, Ramlawi K (2020) Investigation of the influence of excess pumping on groundwater salinity in the Gaza Coastal Aquifer (Palestine) using three predicted future scenarios. Water 12(8):2218

    CAS  Google Scholar 

  • Shah MI, Khan A, Akbar TA, Hassan QK, Khan AJ, Dewan A (2020) Predicting hydrologic responses to climate changes in highly glacierized and mountainous region Upper Indus Basin. R Soc Open Sci 7(8):191957

    CAS  Google Scholar 

  • Shamshirband S, Jafari Nodoushan E, Adolf JE, Abdul Manaf A, Mosavi A, Chau, K.-w. (2019) Ensemble models with uncertainty analysis for multi-day ahead forecasting of chlorophyll a concentration in coastal waters. Engineering Applications of Computational Fluid Mechanics 13(1):91–101

    Google Scholar 

  • Tahir AA, Chevallier P, Arnaud Y, Neppel L, Ahmad B (2011) Modeling snowmelt-runoff under climate scenarios in the Hunza River basin, Karakoram Range, Northern Pakistan. J Hydrol 409(1-2):104–117

    Google Scholar 

  • Tu JV (1996) Advantages and disadvantages of using artificial neural networks versus logistic regression for predicting medical outcomes. J Clin Epidemiol 49(11):1225–1231

    CAS  Google Scholar 

  • Tung TM, Yaseen ZM (2020) A survey on river water quality modelling using artificial intelligence models: 2000–2020. J Hydrol 585:124670

    Google Scholar 

  • Zhang Y, Gao X, Smith K, Inial G, Liu S, Conil LB, Pan B (2019) Integrating water quality and operation into prediction of water production in drinking water treatment plants by genetic algorithm enhanced artificial neural network. Water Res 164:114888

    CAS  Google Scholar 

Download references

Acknowledgments

The authors acknowledge the support of water and power development authority (WAPDA), Pakistan, for providing the water quality data of Indus River.

Funding

This research received no external funding.

Author information

Authors and Affiliations

Authors

Contributions

Conceptualization, data collection, and writing original draft preparation: Muhammad Izhar Shah; data analysis, modeling, review, and editing: Muhammad Faisal Javed; validation check, data curation, and manuscript revision: Taher Abunama. All authors approved the final manuscript.

Corresponding author

Correspondence to Muhammad Izhar Shah.

Ethics declarations

Competing interests

The authors declare that they have no competing interests.

Ethical approval

Not applicable

Consent to participate

Not applicable

Consent to publish

Not applicable

Additional information

Responsible Editor: Marcus Schulz

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix. A Expression tree diagrams

Appendix. A Expression tree diagrams

Fig. 18
figure 18

Expression tree of the developed GEP model for TDS

Fig. 19
figure 19

Expression tree of the developed GEP model for EC

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Shah, M.I., Javed, M.F. & Abunama, T. Proposed formulation of surface water quality and modelling using gene expression, machine learning, and regression techniques. Environ Sci Pollut Res 28, 13202–13220 (2021). https://doi.org/10.1007/s11356-020-11490-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11356-020-11490-9

Keywords

Navigation