Modelling of total dissolved solids in water supply systems using regression and supervised machine learning approaches

Monitoring of water quality through accurate predictions provides adequate information about water management. In the present study, three different modelling approaches: Gaussian process regression (GPR), backpropagation neural network (BPNN) and principal component regression (PCR) models were used to predict the total dissolved solids (TDS) as water quality indicator for the water quality management. The performance of each model was evaluated based on three different sets of inputs from groundwater (GW), surface water (SW) and drinking water (DW). The GPR, BPNN and PCR models used in this study gave an accurate prediction of the observed data (TDS) in GW, SW and DW, with the R2 consistently greater than 0.850. The GPR model gave a better prediction of TDS concentration, with an average R2, MAE and RMSE of 0.987, 4.090 and 7.910, respectively. For the BPNN, an average R2, MAE and RMSE of 0.913, 9.720 and 19.137, respectively, were achieved, while the PCR gave an average R2, MAE and RMSE of 0.888, 11.327 and 25.032, respectively. The performance of each model was assessed using efficiency based indicators such as the Nash and Sutcliffe coefficient of efficiency (ENS) and the index of agreement (d). The GPR, BPNN and PCR models, respectively, gave an ENS of (0.967, 0.915, 0.874) and d of (0.992, 0.977, 0.965). It is understood from this study that advanced machine learning approaches (e.g. GPR and BPNN) are appropriate for the prediction of water quality indices and would be useful for future prediction and management of water quality parameters of various water supply systems in mining communities where artificial intelligence technology is yet to be fully explored.


Introduction
Provision of safe and quality drinking water is a major concern in many developing countries due to rapid growth in urbanization and industrialization. An estimate of about 1.8 million people die every year, predominantly in developing countries, due to water-borne diseases and inadequate supply of quality water (Ishii and Sadowsky 2008;Corcoran 2010). To reduce the impact caused by water-related issues, water quality assessment based on water quality indices, drinking water standards and guidelines are used to evaluate the chemical, physical and biological constituents of water. Among other constituents, total dissolved solid (TDS) is one of the most vital constituents or parameters in assessing the overall suitability and quality of various water supply systems (Atta et al. 2018;Li et al. 2018;Pan et al. 2019). Therefore, accurate measurement and prediction of TDS may provide an indication of the salinity (total organic and inorganic dissolved substances) in various water resource systems.
Several models have been developed and applied for analysis and monitoring of water quality parameters (Ghosh et al. 2015;Sen et al. 2018;Adiat et al. 2020;Emami and Parsa 2020). Traditional (deterministic and stochastic) models, such as statistical approaches and visual modelling, have been commonly used in literature (Sun and Gui 2015;Tziritis and Lombardo 2017;Chen et al. 2018;Karami et al. 2018). Statistical-based water quality models, such as cluster analysis (CA), hierarchical cluster analysis (HCA) and principal component analysis (PCA), have been commonly used to classify and evaluate correlations between water constituents or parameters (Liu et al. 2011;Gu et al. 2016;Hamil et al. 2018;Lu and Ma 2020). However, data requirements for these approaches are enormous, difficult, time-consuming and expensive to obtain. Furthermore, many statistical models assume a linear relationship between response and prediction variables (parameters). Therefore, utilizing statistical approaches for nonlinear relations among variables is usually ineffective.
Multiple linear regression (MLR) and PCA models, despite their inefficiency on nonlinear relations among variables, have also been used in many hydrological studies possibly due to the easiness to use and interpret relationships between parameters (Adeloye 2009;Chenini and Khemiri 2009;Gholami et al. 2011;Viswanath et al. 2015;Lu and Ma 2020). For example, Viswanath et al. (2015) proposed a prediction model for TDS concentration with 10 other water quality parameters as a variable in watersheds by combining PCA with MLR. In their model, PCA was used to isolate less significant parameters, whereas MLR model was used to predict TDS in terms of other statistically significant parameters. However, the PCA prediction model used in their study utilized the entire dataset for model development with no validation. Alternatively, the principal component regression (PCR) technique, which combines PCA with MLR, has been developed and successfully applied in solid waste generation rate prediction (Azadi et al. 2016) and TDS prediction (Jacintha et al. 2017;Pan et al. 2019).
In addition to the classical statistical regression methods, supervised machine learning (SML) approaches such as artificial neural network (ANN), support vector machine (SVM) and adaptive neuro-fuzzy inference system (ANFIS) have been adopted in many hydrological studies (Suen and Eheart 2003;Asadollahfardi et al. 2012;Shamshirband et al. 2015;Alrashed et al. 2018;Yaseen et al. 2018;Sinshaw et al. 2019;Haghbin et al. 2020). These studies include forecasting nitrate concentration in rivers (Suen and Eheart 2003;Haghbin et al. 2020), modelling total phosphorus and total nitrogen in wetlands (Asadollahfardi et al. 2010), predicting TDS in rivers (Asadollahfardi et al. 2012), estimating the concentration of total nitrogen and total phosphorus in lakes (Sinshaw et al. 2019), analysing the thermal behaviour and performance of nano-suspensions in water supply systems (Shamshirband et al. 2015;Alrashed et al. 2018;Karimipour et al. 2019) and predicting water quality parameters including TDS, biochemical oxygen demand (BOD), and chemical oxygen demand (COD) using three new ensemble machine learning models (Asadollah et al. 2020;Sharafati et al. 2020).
In relation to the prediction of TDS, some recent studies have utilized various SML techniques. Asadollahfardi et al. (2012) developed ANN model was to predict TDS in the Talkheh Rud River (Iran). In their study, the Elman network, which includes the multilayer perceptron (MLP) and recurrent neural network (RNN), was developed and applied. The results from their study indicate that the Elman network predicts the TDS very close to the observed values (R = 0.964). Schuttrumpf (2018) developed a recurrent neural network (RNN)-based model for predicting and forecasting TDS of a river. They observed that the RNN model gave an accurate prediction of the observed parameters. Pan et al. (2019) compared the efficiency of dual-step MLR, hybrid PCR and BPNN models in predicting TDS in monitoring wells. According to their study, hybrid PCR and dual-step MLR models provide better prediction compared to the BPNN model. Banadkooki et al. (2020) employed the ANFIS, SVMs and ANN models for prediction of TDS of aquifers in Yazd plain (Iran). The results from their study showed that the hybrid ANFIS had a better improved accuracy over the ANN and SVM models by 1.4% and 3.8%, respectively. They also observed that the SVM model had the least Nash-Sutcliffe efficiency value among all the models.
Despite lots of studies on water quality analysis using deterministic and stochastic approaches, statistical regression methods and SML models, studies on using SML, PCR and machine learning techniques such as the Gaussian process regression (GPR) are limited, especially in developing African countries. While the world is geared towards the fourth industrial revolution where artificial intelligence (AI), machine learning, augmented reality, internet of things (IOT), etc., is playing a major role, Africa is yet to come to the moment of realization. More importantly, developing countries in Africa such as Ghana are yet to adopt, apply and test the efficiency of machine learning in the water supply system management. This study will therefore create awareness for practitioners to appreciate the robustness of the methods used in this study. Also, to the best of the author's knowledge, existing models for predicting water quality parameters are usually developed using a single water resource (rivers or monitoring wells or lakes), which may affect the long-term application of the models in different water supply systems. Hence, a comprehensive study on predictive models is required for monitoring and predicting water quality parameters, especially in mining communities such as Tarkwa, Ghana. Therefore, the objectives of this study are to (i) develop hybrid predictive models using GPR, BPNN and PCR by utilizing non-overlapping testing dataset from groundwater (GW), surface water (SW) and drinking water (DW); (ii) predict TDS concentration in SW, GW and DW using the proposed predictive models developed in this study; and (iii) evaluate and compare the performance of the models using series of performance evaluation metrics and statistical indices.

Hydrogeological setting
A total of about 386 data points (GW = 189, SW = 110 and DW = 87), obtained between February to March 2015, were used in this study. The datasets were taken from Tarkwa, a mining (mainly gold and manganese) community and the capital of Tarkwa-Nsuaem Municipal Assembly, Western Region, Ghana. The area was selected for this study due to high-level pollution of water supply systems from mining activities (Bhattacharya et al. 2012;Ewusi et al. 2017a, b; Baah-Ennumh and Adom-Asamoah 2019). The area is located between latitudes 4° 0ʹ 0″ N and 5° 40ʹ 0″ N and longitudes 1° 45ʹ 0″ W and 2° 1ʹ 0″ W. Figure 1 shows the location and geological setting of the study area using QGIS.
The domestic and commercial water supply systems in the area mainly consist of GW (boreholes and hand-dug wells) and SW (streams and rivers). The majority of these water supply systems (GW and SW) serve as a source of DW for nearby communities. The average well depth in the area is 35.4 m. Borehole yields range between 0.4 m 3 /h and 18 m 3 /h with an average of 2.4 m 3 /h (Bhattacharya et al. 2012). The Bonsa, Huni and Ankobra Rivers and their tributaries are the main sources of recharge for nearby streams and GW (Bhattacharya et al. 2012). The quality of the water supply systems in the area is highly affected by mine contaminants and mining-related activities, leakage from underground storage tanks, improper waste disposal and agrochemicals from agricultural fields (Ewusi et al. 2017b).

Data description
A total of 10 parameters, which include arsenic (As), cadmium (Cd), mercury (Hg), copper (Cu), cyanide (CN), total dissolved solids (TDS), total suspended solids (TSS), pH, turbidity and electric conductivity (EC), were obtained from GW, SW and DW systems in the study area. These parameters were carefully chosen based on their data availability, significance and concentrations with respect to the WHO guideline values. Table 1 presents a statistical summary of all the parameters used in this study. TDS was selected as the target parameter for all modelling and analyses as its concentration is affected by many of the studied parameters due to the high pollution of water supply systems from mining activities in the area. The remaining 9 parameters were used as the input parameters to build the prediction model for TDS.
In this study, three modelling approaches (GPR, BPNN and PCR) were used to predict the concentration of TDS in GW, SW and DW systems. To reduce modelling errors and avoid possible bias from the inputs, all the datasets from GW, SW and DW were combined and were used for training (70%) and testing (30%). The entire datasets were then used to perform model validation. This approach was adopted from previous studies (Konaté et al. 2015a;Ziggah et al. 2016) to understand the possible variability in the dataset and to determine the extent at which the developed model can be generalized should the size of the data increase in the study area. Moreover, it indicates the model predictive capability across the entire data extent in the study area. After the training, testing and validation stage, the model is then used for prediction using 3 different sets of inputs from GW, SW and DW. As such, each model was evaluated 3 times with different datasets. This approach allows a fair and systematic assessment of the methods and modelling precision by applying consistent training and testing inputs in multiple trials. This also helps to identify and reveal possible bias from the input data. All modelling and analysis were performed using MATLAB (ver. R 2 020a).

Gaussian process regression
The Gaussian process regression is a powerful nonlinear prediction tool, which can be used for both supervised and unsupervised learning frameworks. It is a nonparametric stochastic process that generalizes the Gaussian probability distribution. A Gaussian process sometimes is described as a distribution over functions (P(ƒ)), where ƒ is a function that projects input space (vector X) to feature space (vector r) and for any finite subset of X, the marginal distribution over that subset P(ƒ) has a Gaussian distribution. The ƒ could be an infinite-dimensional quantity. As a result, the Gaussian process extends multivariate Gaussian distributions to infinite dimensionality (Rasmussen and Williams 2006). One of the advantages of a Gaussian process model is that its formulation is probabilistic. This is useful for probabilistic prediction and also enables the model parameters inference for kernel shape and noise-level control (Chu  and Ghahramani 2005). Given a dataset M = {X, y} , where X = x 1 , … , x n represent the matrix composed by input vectors, y = y 1 , … , y n represents the output, x i is a vector and y i is a variable (Eq. 1). The relationship between the input and the output can be given as: where f (x) is the underlying regression function, and n is the noise term.

Principal component regression
PCA is commonly used in hydrological studies to reduce the number of variables, extract useful information and to eliminate the noise from data (Konaté et al. 2015b). PCA extracts eigenvalues from the original dataset and forms new principal components (PC) that are linear combinations of the parameters (Pearson 1901). The resulting PCs are orthogonal to each other after varimax rotation (Abou Zakhem et al. 2017;Ravikumar and Somashekar 2017;Pan et al. 2019), which helps to avoid multicollinearity between model parameters. PCs with eigenvalues greater than unity (one) are considered significant (Abou Zakhem et al. 2017;Selvakumar et al. 2017;Pan et al. 2019), and each significant PC explains a portion of the total variance of the dataset. In this study, PCR models are developed by using PCs identified by PCA as independent variables in MLR. PCR is more advantageous than conventional MLR modelling since it retains more original predictor variables and minimizes multicollinearity between variables. For a given trial, PCs on TDS are first identified from the training dataset and MLR is carried out using the significant PCs (total variance > 95%) to obtain a TDS prediction model. The original MLR equation (Eq. 2) derived from the training dataset was used for testing and validation. The MLR interaction equation used in the current PCR model is expressed by: where V 1 , V 2 and V 3 represent principal components of the independent variables derived from the PCA.

Artificial neural network model
An artificial neural network is a computational model that consists of highly interconnected elements (nodes or neurons) and is used to simulate the structure and/or functional aspects of biological neural networks. ANN applications can be categorized as classification or pattern recognition, clustering or prediction and modelling. The advantages of ANNs are the unrestricted number of inputs and outputs and the clearly defined number of hidden layers and hidden neurons. In the present study, the back-propagation training algorithm was used to adjust connection weights and bias values training.

Feed-forward network
A feed-forward network with one hidden layer was selected, in which the input data ( x 1 , x 2 ,…,x n ) are included in the first layer, and the network progressively processes those data throughout subsequent layers to produce the results ( y 1 , y 2 ,…,y k ) in the output layer. The input neurons are linked to those in the intermediate layer by w ji weights (weight connecting the ith neuron in the input layer and the jth neuron in the hidden layer), and the neurons in the intermediate layer are linked to those in the output layer by w kj weights (weight connecting the kth neuron in the output layer and the jth neuron in the hidden layer). The ANNs, based on the nonlinear activation functions, map the relationship between the inputs and the output. Thus, the explicit correlation for the output values is expressed in Eq. 3.
where f h = activation function of the nodes in the hidden layer; f 0 = activation function of the nodes in the output layer; s and s ′ = number of nodes in the input and hidden layers, respectively; b j = bias for the jth hidden neuron; and b k = bias for the kth output neuron.
The five training algorithms commonly used in ANNs are Levenberg-Marquardt, gradient descent, gradient descent with momentum, gradient descent momentum and adaptive learning rate, and gradient descent with adaptive learning rate. As the Levenberg-Marquardt (LM) algorithm is assumed to be one of the fastest methods for training ANNs, it was chosen in this study.

Back-propagation neural network
The number of input and output nodes in the BPNN is determined by the nature of the actual input and output variables. The number of hidden nodes, however, depends on the complexity of the mathematical nature of the problem and is determined by the modeller, often by trial and error. Each hidden and output node processes its input by multiplying each of its input values by a weight, summing the product and then passing the sum through a nonlinear transfer function (e.g. sigmoid function) to produce a result (Eq. 4). It can be expressed as: where X = input or hidden node value; Y = output value of the hidden or output node; ƒ() = transfer function; W = weights connecting the input to hidden, or hidden to output, nodes; and θ = bias or threshold for each node. The input, hidden and output layer nodes are interconnected by adjustable connection weights to recognize different patterns of information. A decision about the number of hidden layers and the number of hidden nodes is an important aspect of a neural network design process because it significantly affects the final output. For many practical problems, one hidden layer is sufficient to provide the required accuracy (Khalil et al. 2011;Wu et al. 2015;Azadi et al. 2016;Sinshaw et al. 2019). The current BPNN model developed in this study uses the same input and target variables as other methods. A BPNN structure of 9-10-1, representing 9 input parameters, 10 nodes in the hidden layer, and 1 output variable (TDS), was adopted (Fig. 2).

Performance measurement of models
The performance of each model is evaluated and compared using the methods discussed in this section.

Linear correlation coefficient
The linear correlation coefficient (R) is a measure of how well a particular model can accurately predict the observed (actual) data. The values of R usually range from -1.0 to 1.0. A value of 1.0 indicates a perfect positive correlation between the observed and the predicted and vice versa. The value of R is calculated using Eq. 5 given below: where y = observed value; y ′ = predicted value; and n = number of data samples.

Coefficient of determination (R 2 )
The R 2 measures how much the variance in the observed values is explained by the model prediction. The higher the R 2 value, the better the model prediction accuracy.
Proposed structure of a feed-forward back-propagation neural network

Root-mean-squared error
Root-mean-squared error (RMSE) is the square root of the mean square error. The RMSE is thus the average distance of an observed data point from the model line measured or the standard deviation of the prediction errors (Eq. 6). The RMSE is given by the following equation:

Mean absolute error
The mean absolute error (MAE) is an arithmetic of the absolute errors and statistically measures the predictive accuracy of a model. The MAE is commonly used in quantitative predictive models because it indicates the relative overall fit (i.e. the goodness of fit). The MAE is given by Eq. 7 below:

Nash and Sutcliffe coefficient of efficiency (E NS )
The Nash-Sutcliffe efficiency (E NS ) is a normalized statistic that determines the relative magnitude of the residual variance compared to the measured data variance (Nash and Sutcliffe 1970). E NS indicates how well the plot of observed versus predicted data fits the 1:1 line. 0 < E NS < 1 indicate  indicates an unsatisfactory performance of the model. The value of E NS is calculated using Eq. 8 given below: where y m = mean of the observed value.

Index of agreement (d)
Index of agreement (d) represents the ratio of the mean square error and the potential error (Willmott 1982). It is a standardized measure of the degree of model prediction error which varies between 0 and 1. A d value of 1 indicates where Table 1 presents a summary of the concentration of parameters from GW, SW and DW used in this study. The mean concentration of turbidity was considerably higher than the guideline value in GW, SW and DW. This is possibly due to a high amount of effluents released as a result of numerous mining activities in the area. Although the mean concentrations of other parameters including TDS were lower than the guideline value, some parameters had a considerably high

Results and discussion
maximum value above the guideline values. Considering the salinity problem associated with the water supply systems in the study area, accurate predictive models for constant monitoring of TDS in the study area are required to reduce the time and cost involved using conventional methods. The correlation between TDS and other input parameters was evaluated. The results show that there is a high correlation between EC and TDS with R = 0.934 as presented in italic in Table 2. The performance of each model was also evaluated, and the results are summarized in Table 3.

Gaussian process regression model
For selecting the optimum covariance function for the proposed GPR model, the following covariance functions were tried and tested: (1) the squared exponential covariance function, (2) the exponential covariance function, (3) the rational quadratic covariance function, and (4) Table 5.

Back-propagation neural network model
The BPNN developed consists of three layers, i.e. input, hidden, and output layers. In accordance with existing literature (Hornik et al. 1989;Arthur et al. 2020), one hidden layer was used in this study due to its capability to universally approximate any complex problem. The network, the hyperbolic tangent sigmoid and linear transfer functions were utilized in the hidden and output layers due to the nonlinearity of the input datasets. Training of the BPNN was done using the Levenberg-Marquardt algorithm (Moré 1978). The optimum BPNN obtained for this study had 9 inputs, one neuron in the hidden layer and one output, with the structure [9-1-1]. Figure 5a-c presents the results of the BPNN model during training, testing and validation stages, respectively. The performance of the BPNN model was further evaluated using 3 different sets of inputs from GW, SW and DW, and the results are presented in Figs. 6a-f. It is worth noting that the TDS concentrations in GW, SW and DW were accurately predicted using the BPNN model with R 2 of 0.945 (Fig. 6a,  b), 0.936 (Fig. 6c, d) Table 5. Observed TDS (mg/L) Principal component regression PCR model, which contains a hybrid PCA and MLR models, is constructed in this study to minimize the multicollinearity of the variables. Table 4 presents a summary of all three PCs factor scores for the input parameters with the highest scores in italics. Figure 7a-c presents the results of the PCR model during training, testing and validation stages, respectively. New inputs GW, SW and DW were used to evaluate the performance of the PCR model, and the results are presented in Fig. 8a-f. The TDS concentrations in GW, SW and DW were accurately predicted using the PCR model with R 2 of 0.875, 0.870 and 0.919 as shown in Fig. 8a-f, respectively. Similar to the GPR model, the PCA model gave  Table 5.

Comparison of model performance
The predictive techniques proposed in this study were evaluated by using the performance methods (R 2 , MAE and RMSE) discussed previously. In general, indices during the training stage may not provide a good reference for accurate model evaluation (Bagheri et al. 2017). Therefore, the performance of each model (GPR, BPNN and PCR) was evaluated using new inputs parameters from GW, SW and DW as shown in Table 5. The best model performance (i.e. highest R 2 value, or lowest MAE and RMSE values) for each model is in italics. The MAE and RMSE for training, testing and validation are shown in Fig. 9a, b, respectively. It is worth noting that the GPR model showed the best performance with the least error during training, testing and validation (Fig. 9). It was found that all models adequately predict the observed data  in GW, SW and DW, with the R 2 consistently greater than 0.85. The average R 2 values of GPR model (R 2 avg = 0.987) and BPNN (R 2 avg = 0.913) are higher than the PCR (R 2 avg = 0.888). In general, accurate predictions were made with input data from DW based on the GPR (R 2 = 0.999) and PCR (R 2 = 0.919) models; however, the BPNN gave a good prediction in GW (R 2 = 0.945). The performance of each model was again assessed using efficiency based indicators such as E NS and d. The GPR, BPNN and PCR models, respectively, gave an E NS of (0.967, 0.915, 0.874) and d of (0.992, 0.977, 0.965). Overall, the GPR model gave a better prediction with the highest average R 2 , E NS and d, and the lowest average error (MAE and RMSE) values as shown in Fig. 10. Table 6 summarizes the primary previous works in TDS prediction. It is worth noting that R 2 varied from 0.900 to 0.987, which indicates good performances in the overall proposed models for predicting TDS. By comparing with other models proposed in the literature, this current study achieved the best predictive performance model with R 2 of 0.987. Unlike previous works, the model used in this study ensured good generalization capability.

Conclusions
The overall performance of various models for predicting the concentration of TDS in GW, SW and DW in the Tarkwa mining area was evaluated. Different water parameters are implemented to develop TDS models, and the performance of the models is evaluated by various statistical indices. The major findings are: • Although the mean concentrations of all the parameters used in this study were lower than guideline value, except turbidity, TDS was chosen as the target parameter considering the salinity problem associated with the water supply systems in the study area. • The GPR, BPNN and PCR models developed in this study gave an accurate prediction of the observed data (TDS) in GW, SW and DW, with the R 2 consistently greater than 0.850. • The GPR model gave a better prediction of TDS concentration, with an average R 2 , MAE and RMSE of 0.987, 4.090 and 7.910, respectively. The performance of each model was assessed using efficiency-based indicators such as the Nash and Sutcliffe coefficient of efficiency (E NS ) and index of agreement (d). The GPR, BPNN and PCR models, respectively, gave an E NS of (0.967, 0.915, 0.874) and d of (0.992, 0.977, 0.965). • Compared with other models proposed in previous works, the proposed model in this study gave the best performance (highest R 2 value of 0.987) with a superior generalization capability because of the use of datasets from different water supply systems.
In general, this research work provides an integrated analytical and modelling methods that would be useful for future prediction and management of water quality parameters in various water supply systems. The results obtained from this study suggest that advanced NLR techniques and machine learning approaches are appropriate for the prediction of water quality indices. Moreover, the models obtained from this study could form a basis for a more effective decision-making process which will help in maintaining and improving the management of water supply systems, especially in mining communities. Although the models used in this study have some predictive capabilities to some degree, the data used for validation and testing were very limited to the study area. It is therefore recommended that future studies should validate these models with large datasets. It is also recommended that future researches on predicting water quality parameters in developing African countries should examine novel models such as extreme learning machines (ELM), hybrid and ensemble models.
Data availability statement Data generated or analysed during the study are available from the corresponding author by request.

Compliance with ethical standards
Conflict of interest The authors declare that they have no conflict of interest.
Ethical approval This article does not contain any studies with human participants or animals performed by any of the authors.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creat iveco mmons .org/licen ses/by/4.0/.