Introduction

Groundwater quality assessment and monitoring is a crucial task for sustainable optimal management of groundwater resources(Egbueri 2020; Kawo and Karuppannan 2018; Li et al. 2018; Islam et al. 2020a). The continuous growth of the population is directly associated with the growth of clean water demand (Dos Santos et al. 2017; Islam et al. 2017; Rahman et al. 2020). This demand makes the researchers more encouraged to develop new models for the prediction of water quality (Uddin et al. 2021). As a key element of the water cycle and drinking water resource, groundwater becomes an issue under a huge pressure worldwide (Ahmed et al. 2019; Saha et al. 2020). Thus, appraising water quality is of an urgent interest in recent times. Horton (1965) developed the first water quality index (WQI) in order to transform the several parameters containing water into one single number to describe the allover water quality. After that, several indices have been developed (Hossain and Patra 2020; Mukate et al. 2019; Islam et al. 2020b). The parameters involved in the calculation of the WQI have to be chosen carefully in order to get expressive results (Abbasi & Abbasi 2012). Various WQIs have been adopted by many researchers to assess the drinking suitability of groundwater and the quality river water (Islam et al. 2017; 2019; Kabir et al. 2021). However, the deterioration of water quality could be caused by many factors, e.g., inadequate proper sanitation, pollutants derived from industries and excessive use of fertilizer in agricultural practices, climate change, and poor groundwater management plan (Loecke et al. 2017; Alam et al. 2007; Trevett et al. 2005; Islam et al. 2018). On the other hand, the water quality appraisal involves some issues like sample collection at an enormous scale, testing in the laboratory, and data manipulation, which are mostly time-consuming processes and more expensive in terms of equipment, chemical, reagent, and human capital (Tiyasha et al. 2020). Besides, the subindex calculation is a time-taking process. Ongley (2000) found that water quality appraisal using traditional methods triggers losses in the economic aspect which influences the policy-making ability for groundwater quality management plans. In addition to this circumstance, the recent Corona pandemic made laboratories suffer from the lack of chemical analysis reactors used for water analysis after the remarkable reduction of the quantities of imported goods in several countries. Thus, to overcome these circumstances, it is necessary to use a promising and cost-effect tool for rapid and precise water quality appraisal. In such a case, the artificial intelligence (AI) model is an alternative option to generate models during the pandemic period that would help predict the overall quality of groundwater based on the results of analyses that do not need expensive reactors or very developed measurement instruments.

The AI technique is a potential and robust multifunctioning tool in water-science-related fields (Babbar and Babbar 2017; Kisi et al. 2018; Kim et al., 2019; Bui et al. 2020; Abba et al. 2020; Hayder et al. 2021; Singha et al. 2021; Bilali et al. 2021). Several research scholars have employed AI techniques worldwide including random forest (RF), support vector machine (SVM), and artificial neural network (ANN) in different water-related studies. The RF model was applied for the groundwater quality prediction (Singha et al. 2021), flood susceptibility study (Towfiqul Islam et al. 2021), river water quality prediction (Asadollah et al. 2021), and so on. Likewise, the SVM model was adopted for predicting marine water quality (Deng et al. 2021) and wastewater treatment plant monitoring (Nourani et al. 2018), with different precision levels. ANN-based prediction models have been extensively used in different fields including heavy metal pollution prediction (Singha et al., 2020), wetland vulnerability (Islam et al. 2021), and water level forecasting (Zhu et al. 2020).

Apart from these cited works, many studies have been performed for the prediction of WQI by appraising the performance of various AI models. For example, Gazzaz et al. (2012) adopted the ANN method to forecast river water quality and got a precision level of more than 90% (R2). Wang et al. (2017) applied a swarm optimization-based support vector regression model to predict WQI. A study performed by Ahmed et al. (2019) implemented 15 AI algorithms for the prediction of WQI, where the regression model and classification model outperformed the other models. Bui et al. (2020) found the better predictive performance of hybrid AI models over the conventional models for predicting WQI with 4 conventional and 12 hybrid AI techniques. Recently, Singha et al. (2021) applied deep learning for predicting WQI with 3 traditional models and found that the deep learning model is a more robust and accurate tool than the traditional model in the prediction of groundwater quality. Valentini et al. (2021) introduced a new WQI equation for Mirim Lagoon and evaluated its suitability based on 154 samples collected over three years at seven sampling points in Mirim Lagoon. For forecasting monthly WQI values at the Lam Tsuen River in Hong Kong, Asadollah et al. (2021) proposed a new ensemble machine learning algorithm called extra tree regression (ETR). The efficiency of the ETR model is comparable to that of traditional standalone models such as support vector regression (SVR) and decision tree regression (DTR) (Asadollah et al. 2021). Based on parameters such as pH, dissolved oxygen, conductivity, turbidity, fecal coliform, and temperature, Hu et al. (2021) investigated the classification of water quality using machine learning algorithms such as decision tree (DT), k-nearest neighbor (KNN), logistic regression (LogR), multilayer perceptron (MLP), and Naive Bayes (NB) and found that the DT algorithm outperformed other models with a classification accuracy of 99%.

From the aforementioned literature review, it is obvious that different AI models have been performed under various hydro geological conditions with different accuracy levels. In this context, additive regression (AR), M5P tree (M5P), random subspace (RSS), multilinear regression (MLR), and locally weighted linear regression (LWLR) were applied in our research to improve the reliability of water quality appraisal; however, these AI models are scarcely used in the hydrology field in the prediction of groundwater quality.

Besides, after thoroughly reviewing earlier literature, to the best of the author's knowledge, no previous studies have tested and verified the performance of these above-mentioned AI models for the prediction of groundwater quality. Thus, to close this gap, the current study used 8 ML-based WQI prediction models in Illizi region of the southeast, Algeria. Groundwater acts as a vital source of human use and consumption in the study area, and groundwater quality is mainly affected by human-induced pollution; hence, a thorough systematic appraisal of groundwater quality is necessary for this region. Additionally, no such scientific investigation has been done in the current study region. The WQI prediction using 8 ML techniques is a more robust tool than appraising it with any standalone tool. Hence, to achieve this aim, this study has developed two scenarios. The first scenario is developed using 8 models to predict the WQI using all the analyzed parameters as inputs variables to reduce the time consumption of calculations. The second scenario is constructed to reduce the number of inputs based on sensitivity analysis and to select the main parameters controlling water quality to predict the WQI in the critical case.

Materials and methods

Study area

General setting

With 284,618 km2 Illizi county is the third largest wilayah by area. It is located in the extreme southeast of Algeria, and it borders with three countries on a 1,233 km border with: Tunisia and Libya from the east and Niger from the south, where Ouargla county and Tamanrasset county border it from the north and the west, respectively (Kouadri and Samir 2021). Although the study area is very large, the climate has a homogenized distribution, with a very long hot summer and very short warm winter. The rains are extremely irregular. June is the hottest month of the year, while January is the coldest. Winds are generally light to moderate. Figure 1 presents the study area location.

Fig. 1
figure 1

Study area location map

Hydrogeological settings

According to the authority of agricultural production in Saharian regions (CDARS), the hydrogeology of the Illizi area is distinguished by many aquifer deposits. The region has a large surface area, from which we can discern many aquifer horizons, such as Tassili's Cambro-Ordovician. Sandstone formations are traversed by a pattern of cracking and faults in addition to having a very low porosity. Tassili sandstones have a strong permeability due to these characteristics, which promote water circulation. The Devonian aquifer is located in Illizi and its surroundings, especially in the north, by exploitation from 250 to 1450 m in the Illizi and El Adeb Larach regions, respectively. The static level in regard to the land differs from one place to the next. In the high regions, it ranges from a few centimeters to a few meters; however, water is springing north and east of Illizi. The Carboniferous: This aquifer is extracted in the In Aménas area at various depths from 800 to 1100 m. The water drained by "lifting" is only used to keep the oil slicks under pressure and for irrigation; the static amount ranges between 200 and 300 m (Peterson 1985; Boudjema 1987; van de Weerd and Ware 1994; Kouadri and Kateb 2021).

The Continental Intercalaire (CI) aquifer system: It is found in the stratigraphic interval between the Triassic and the Albian summit. The Barremian and Albian, which are sandstone and sandy–clayey Lower Cretaceous continental deposits, form the majority of the aquifer layers. It drains the Triassic and Jurassic sandstone and clay–sandstone deposits in the Stah and In Aménas regions (where the CI is known as the Zaraitine and Taouratin Series), from Barremian and Albien to Deb Deb and Albien to BOD and Rhourd Nouss (Boudjema, 1987; Kouadri and Kateb 2021).

Medium-depth (400–500 m) drilling in, (T.F.T), Ohanet, and (B.O.D) capture the aquifer. Rhourd Nouss and the north of Deb Deb are comparatively wide (800–1200 m). The sheet’s waters are gushing at Rhourd Nouss, Bordj Omar Driss, Tabankort, Maouar, Zemelet Mederba, and the north of Deb Deb; they are exploited by pumping at differing depths (from a few meters to 300 m) at Tinfouyé, Ohanet, the south of Deb Deb, and Stah; the useful tank's strength exceeds 250 m. Static pressure readings show that pressures can exceed 18 bars (e.g., Rhourd Nouss, gushing water). The Mio-Pliocene aquifer is made up of a rearrangement of sands and clays that stretches from the far north-west of the wilayah to the far northeast. Drilling 160–300 m deep is used to extract it in the Rhourd Nouss and El Hamra areas. The water is pumped out at different depths ranging from 80 to 100 m. Oued Djanet's Infero-Flux (Alluvial): The alluvial aquifer of Wadi Djanet was the region's largest and only water supply until the Cambro-Ordovician aquifer was discovered. It is a shallow aquifer spanning 17 km2 of heterogeneous alluviums ranging from silty sand to small pebbles resting on a twenty-kilometer stretch. Currently, 24 boreholes (including 1 well) have demanded it, with 9 boreholes and 1 well in operation. The water in this aquifer is of good quality, with dry residue ranging from 146 to 340 mg/l (Boudjema 1987; Montgomery 1993; Kouadri and Kateb 2021).

Geological settings

According to the National Organization of Hydrographic Network (A.N.R.H), the city of Illizi is situated on a plateau land consisting of lower Devonian clay-sandstone and Emsian clay-sandstone deposits, as well as Quaternary. To the north, the middle to upper undifferentiated Devonian layers outcrop for around 12 km, before being surpassed much further north by Upper Devonian to Carboniferous layers created primarily by the Khenig sandstone, upper Famennian at Tournaisien, with average coastlines of 550–650 m and peaks exceeding 700 m. This disparity in elevation creates a landscape of canyons that favors river drainage and flow acceleration (Kouadri and Kateb 2021).

A plain landscape stretches from the northeast to the side of Tin-Tourha, east to the field of Halloufa, and south to the side of Gara Souf Mellene, passing through Adjnadjane to the Gara Tan Harab. This plain, which has an 8-km radius, is mostly made up of post-Mesozoic (Quaternary) formations with an altimetry of 560–570 m. The lower Devonian formations, known as the Oued Samène formations, are located in the south and beyond 8 km (Siegenien). Their elevations are in excess of 700 m. With frank deformations and large fractures, these formations form a tectonic domain. Less significant faults run east–west as well.

In a strict sense, the geology of the state of Illizi is divided into two broad units: the crystalline basement and the sedimentary cover, which are lithostratigraphically distinct.

Data collection

In order to prepare this work, the results of water analysis provided by the Directorate of Water Resources (DRE) of the State of Illizi were relied on. The presented data set consists of the results of analysis of 114 samples taken from 57 exploited wells of 6 different layers. The samples were taken between 1999 and 2020. The analyses of each sample consisted of physical elements represented by TDS, CE, and T°C and chemical elements represented in pH, Ca2+, Mg2+, Na+, K+, anions as Cl, HCO3, SO42−, and pollution indicators as NO3. The different used models in this work to deal with this type of data considered a new challenge, where the efficiency and performance of the models will be tested with an irregular data set.

Calculation of water quality index (WQI)

WQI is one of the most widely used tools for determining the quality of water and its suitability for human use (El Baba et al. 2020; Reyes-Toscano et al. 2020; Zhang et al. 2020; Maskooni et al. 2020; Bahir et al. 2020). The following are the measures for estimating WQI: In the beginning, a weight must assign on to each factor ranging from 1 to 5, based on its significance and impact on drinking water and human health. Mineralization, SO42−, Cl, and NO3 are awarded the highest rating of “5” due to their direct impact on water quality and human health (Seifi, A. et al. 2020). The bicarbonates HCO3, on the other hand, have a minimum value of “1”. Assigned weights, relative weights, and the limits required by WHO are shown in Table 1.

$$\mathrm{Wi}=\frac{\mathrm{wi}}{\sum_{\mathrm{i}=1}^{\mathrm{n}}\mathrm{wi}}$$
(1)

where“Wi” is the relative weight.“wi”is the weight/parameter.“n” is the number of parameters.

Table 1 Assigned weights and relative weights of physicochemical parameters

Then, a quality rating scale (qi) for each parameter is calculated based on Eq. (2).

$$\mathrm{qi}=(\mathrm{Ci}/\mathrm{Si})\times 100$$
(2)

where“qi” is the quality rating.“Ci” is the chemical concentration/water sample (mg/L).“Si” is the WHO drinking water quality standard (mg/L).

Furthermore, a subindex of the ith parameter is calculated using Eq. (3).

$$\mathrm{SIi}=\mathrm{qi}\times \mathrm{Wi}$$
(3)

where

“SIi” is the subindex rating.

“qi” is the quality rating.

“Wi” is the relative weight.

Finally, the water quality index calculated as follows:

$$\mathrm{WQI}=\sum \mathrm{SIi}$$
(4)

Artificial intelligence models

In this study, ANN, MLR, SVM, M5P tree, RF, LWLR, RS, and AR models were proposed for the estimation of WQI of ILLIZI groundwater. Data set was partitioned into two parts. 70% of the data were employed for calibration phase and the 30% of the data for verification purposes. Selection of dominant inputs parameters is one of the important parts in any AI-based modeling. MATLAB (R2018b) was used for the analysis of ANN and MLR, while the rest of models were developed using Waikato Environment for Knowledge Analysis (WEKA-version 3.8.4).

Artificial neural network (ANN)

Artificial neural network (ANN) is a system that inspired its dynamic functionality from the simulation of human nervous system. It was used for the first time by McCulloch and Pitts (1943), where the method works to create a relationship between inputs and outputs through assigned weights which plays the role of a mathematical memory(Elbeltagi et al. 2020c).

As seen in Fig. 2, the ANN is made up of three groups of layers: The hiding layers are intermediate layers between the independent input and dependent output layers where all the computations are performed, and the output layer outputs the result for the given inputs (Babaee et al. 2021). The input layers' circles are denoted by the vector "i." The secret neuron layers are represented by the middle circles. The "activation" nodes are represented by these circles, which are often referred to as the weights (Ws). The final circle reflects the output sheet, which displays the water quality index's expected value (Elbeltagi et al.,2020a,b, c, d).

Fig. 2
figure 2

Architecture of ANN model

In order to optimize the performance of the network, training algorithm was founded; such as feed-forward back propagation algorithm. This algorithm works to minimize the error rate by calculating the difference between calculated and predicted values. Based on the error amount, new weights will be assigned in order to have better predicted results. Depending on the main factors affecting the performance of an ANN system, we can find the number of the hidden neurons and the activation function (Kouadri et al 2021; Elbeltagi et al. 2021a, b). In an attempt to select the optimal number of hidden neurons, an iterative algorithm had been used in order to plot the performance of the ANN model in function, of MSE in training and validation phase versus the number of hidden neurons number. The optimal number of hidden neurons is the one that give the lowest rate error in both training and validation phases.

Multi linear regression (MLR)

Multilinear regression analysis is considered as one of the simplest mathematical models. It is based on the linear relationships between inputs and outputs. In other words, it extracts the linear relationships between dependent and independent variables by involving a regression that is constant in the formula (Sihag et al., 2020). MLR work is based on the equation below:

$$y={b}_{0}+{b}_{1}{x}_{1}+{b}_{2}{x}_{2}+\dots {b}_{i}{x}_{i}$$

whereY: the independent variable.B: the regression constant.X: the ithpredictor.

Support vector regression (SVM)

Initially, support vector machine (SVM) was developed in order to help identify the distribution pattern of data samples in order to classify them into categories and help in making good decisions. The main idea of ​​this method depends on using a set of studied sample points as supports to draw vectors separating the various classes in the studied data. When SVM was used to solve discontinuous issues, support vector machine regressor (SVMR) was created to deal with continuous issues. This system is characterized by many features that make it a permanent target for use in solving linear and nonlinear correlation problems (Elbeltagi et al., 2021a, b). Among its advantages is the dependence on structural risk minimization (SRM) principle which showed greater effectiveness than traditional empirical risk minimization (ERM). SRM is characterized by its great ability to reduce error, unlike some other methods, such as artificial neural networks that reduce error only in the results of training phase; this has given the SVM method a greater effectiveness in treating prediction Issues. Using the one-dimensional example in Fig. 3, SVR problem formulation is often best obtained from a geometrical perspective. The equation below represents the continuous-valued equation that is being approximated (Awad& Khanna, 2015).

Fig. 3
figure 3

One-dimensional linear SVR

$$y=f\left(x\right)= <w,x>+b= {\sum }_{j=1}^{M}{w}_{j}{x}_{j}+b, y, b\in {\mathbb{R}},x, w \in {\mathbb{R}}^{M}$$

To simplify the mathematical notation for multidimensional data, multiply x by one and add b in the w vector to obtain the multivariate regression in equation below:

$$f\left(x\right)= {\left[\begin{array}{c}w\\ b\end{array}\right]}^{T}\left[\begin{array}{c}x\\ 1\end{array}\right]= {w}^{T}x+b x,w \in {\mathbb{R}}^{M+1}$$

M5P tree

M5P tree model has been presented by Quinlan (Quinlan 1992). It is a model that is a learner tree that deals with regression situations. The basis of this algorithm is based on dividing the overall problem into smaller problems by dividing the data, so that a multivariate model is constructed for each small problem and assigning linear regression functions into the final nodes. This method is characterized by its ability to deal with complex problems with many variables, with the condition that they are continuous class problems instead of discrete classes (Adnan et al. 2021; Sihag et al. 2020; Singh et al. 2017).

Figure 4 presents an M5P tree architecture. Depending on the amount of error calculated in each node, the M5P tree determines information about the criteria for dividing it. After studying the error, based on the standard deviation at the entrance to the node, the correction characteristic of this error is determined by testing all the characteristics of the studied node. The reduction of standard deviation is calculated by the following equation:

$$SDR=sd\left(K\right)- \sum \frac{\left|Ki\right|}{\left|K\right|}sd(Ki)$$

whereK: a set of instances that attain the node.Ki: the subset of illustrations that have the i th product of the possible set.sd: the standard deviation.

Fig. 4
figure 4

M5P tree architecture model

Random forest(RF)

The random forest method was first introduced by Breiman (Breiman 2001). This method is considered as one of the machine learning systems that depend mainly on a group of decision trees targeting the middle separation of the target groups using individual trees. The construction of this method depends on two factors in the random regression of forests, namely, first the number of trees to be planted in the forest, and it is symbolized by the symbol (k), second the number of variables specified at each node for the growth of the tree which is symbolized by (m)(Bournas et al. 2003; Pham et al. 2017; Sihag et al. 2019). The architecture of random forest model is presented in Fig. 5.

Fig. 5
figure 5

Random forest architecture model

Locally weighted linear regression (LWLR)

LWLR is a multivariate smoothing technique for fitting a regression surface to data. In a moving fashion, the dependent variable is smoothed as a function of the independent variables, similar to how a moving average for a time series is calculated. The fundamental structure is as follow, let x,—(xi1,…..xip), i = 1,…, n, be « n» measurements of p independent variables, and let y, I = 1,…, n) be measurements of the dependent variable. Assume that yi = g(xi) + ξi generates the results. We assume that the ξi are independent normal variables with mean 0 and variance σ2, as in the most commonly used regression framework. If g is a member of a parametric class of functions, such as polynomials, in the ordinary setting, we will assume that g is a smooth function of the independent variables, but in this case, we will only assume that g is a smooth function of the independent variables. We can approximate a large class of smooth functions with local fitting, well more than we might possibly predict from any one parametric class of functions (Cleveland and Devlin 1988; Kisi and Ozkan 2017).

Random subspace (RSS)

Ho (Ho 1998) was the first who implemented the RS model as a novel coupled algorithms for resolving naturel issues based on artificial intelligence. This model uses combination and training of multiple classifier on altered feature space. The training basis of this model are the generated multiple training subsets for the classifiers (Ho 1998). The training set (x), the base-classifier (w), and the number of subspaces (L) are the RS inputs (Kuncheva and Plumpton 2010; Luo et al. 2019; Garca-Pedrajas and Ortiz-Boyer 2008; Lai et al. 2006; Wang et al. 2018, 2015). This technique is highly advocated by (Pham et al. 2017) to avoid over-fitting problems and to deal with the most unnecessary data sets. Figure 6 presents the architecture of an RSS model.

Fig. 6
figure 6

Random subspace architecture model

Additive regression (AR)

Hastie and Tibshirani (1986) have introduced the generalized additive model (GAM). The GAM, an extension of the generalized linear model (GLM) (McCullagh and Nelder 1989), has several benefits over the latter model. The GAM assumes no form of dependence, unlike the GLM, which is based on the clear assumption of linearity of the parameters, and the relationship is not generally linear. Its theory is based on the use of a sum of nonlinear functions to model the response, which helps one to model the effect of each explanatory variable more specifically. In modeling the effects of environmental variables, this precision makes it a common technique since these effects are often nonlinear and are difficult to specify parametrically (Peng and Dominici 2008; Bruneau and Grégoire 2011). The Jbilou and El Adlouni (2012) literature review described the capacity of the GAM in environmental health studies as a powerful technique to detect nonlinear associations between an environmental explanatory variable and a variable dependent on health. The equation used for this algorithm is written as:

$$g(E\left(y\right))={{\beta }_{0}+ f}_{1}{x}_{1}+{f}_{2}{x}_{2}+\dots +{f}_{p}{x}_{p}+\varepsilon$$
(5)

The nonlinear smooth functions are used in the estimation of this model's application.

\({f}_{i}{(x}_{i})\), i = 1,.., p, for any single explanatory vector\({x}_{i}\).

Several data set split features are selected using the standard deviation error (SDR) as a parameter for the best characteristics to segment the data set into each node. The selected attribute is meant to reduce errors.

$$SD\left( {Tree} \right) - \sum {\frac{{{\text{ }}Tree\_i}}{{Tree}}} * SD\left( {Tree\_i} \right)$$
(6)

where Tree (i) denotes the subset of examples with the product of the possible evaluations, SD() denotes the standard deviation of the statement. The stop criteria are the number of instances needed to reach a certain number or a small form value shift. All models’ parameters used for modeling the WQI are clarified in Table 2.

Table 2 Parameters of machine learning algorithms used for WQI modeling

Sensitivity analysis

When there are several input variables, feature selection is one of the most important steps in developing a soft computing model to forecast and simulate engineering phenomena. There are many methods for determining the best possible combinations, including the best subset regression, shared knowledge, forward stepwise filtering, and so on. The best subset regression analysis was used in this research to find the best input combinations for the WQI model. Six statistical parameters were computed for this reason, including MSE, decision coefficients (R2), adjusted R2, Mallows' Cp (Gilmour 1996), Akaike's AIC, and Amemiya's PC (Claeskens and Hjort 2008).

Model’s performance criteria

Throughout the course of the analysis, actual WQI data and modeled values were compared. The following statistical metrics were chosen to determine the accuracy of models: root mean square error, coefficient of determination, and mean absolute error (Malone et al. 2017; Elbeltagi et al. 2020a, b, d).

All parameters are defined as follows:

\({WQI}_{A}^{i}\) is the calculated or actual value.

\({WQI}_{P}^{i}\) is predicted or foreseen value.

\({WQI}^{-}\) is the mean value of reference samples, and N is the total number of data points.

Root mean square error

The sample standard deviation of the variations between expected and real values is known as the RMSE. It is given by:

$$RMSE=\sqrt{\frac{1}{\mathrm{N}}{\sum }_{i=1}^{N}{{(WQI}_{A}^{i}-{WQI}_{P}^{i})}^{2}}$$
(7)

Mean absolute error

The mean absolute error assesses the extent of errors in a series of predictions without taking their sign into account. It's an estimation of the absolute differences between expected and observed values over the test sample. It is defined as follows:

$$MAE= \frac{1}{\mathrm{N}}{\sum }_{i=1}^{N}{|WQI}_{P}^{i}-{WQI}_{A}^{i}|$$
(8)

Relative absolute error

The total absolute error is normalized by dividing it by the total absolute error of the basic indicator in the relative absolute error.

$$RAE=\frac{{WQI}_{A}^{i}-{ WQI}_{P}^{i}}{{WQI}_{P}^{i}}| \times 100$$
(9)

Root relative squared error

The total squared error is normalized by dividing it by the total squared error of the basic indicator in the relative squared error. The error is reduced to the same dimensions as the quantity being predicted by taking the square root of the relative squared error.

$$RRSE=\frac{\sqrt{{\sum }_{i=1}^{N}{{(WQI}_{P}^{i}-{WQI}_{A}^{i})}^{2}}}{\sqrt{{\sum }_{i=1}^{N}{{(WQI}_{A}^{i}-{WQI}^{-})}^{2}}}$$
(10)

Results and discussion

Statistical analysis

Table 3 presents the descriptive statistics for 114 groundwater samples. The correlation matrix is useful since it illustrates the importance of each parameter independently and their effect on the hydrochemistry mechanism (Helena et al. 2000; Khan 2011; Patil et al. 2020; Islam et al. 2017; 2020b). If the values of (r) are + 1 or—1 in the Pearson’s correlation matrix (Table 4), they are treated as strong correlation coefficients values and signify total correlation, i.e., functional dependency, between two variables. If the values are closer to zero, it means there is no meaningful interaction between two variables at the p˂ 0.05 level (Singh et al. 2011; Patil et al. 2020). If r is bigger than 0.7, the parameters are highly.

Table 3 Descriptive statistics for all input and output variables
Table 4 The Pearson correlation coefficient between input and output variables

correlated, and if r is between 0.4 and 0.7, the parameters are moderately correlated. A correlation matrix is used to consider the correlation between chemical parameters and WQI values in this study. The WQI which is the parameter focus on in this study has very weak correlations with pH and HCO3-, moderate correlations with EC, TH, K+, Cl, and NO3, and strong correlations with TDS, Ca2+, Mg2+, Na+, and SO42−.

The Electrical conductivity of water (EC) has a negative correlation with the pH, and positive correlation of r ˂ 0.4 with HCO3, Ca2+ and Cl, 0.4 ˃ r ˃ 0.7 with TH, TDS, Mg2+, Na+, K+, SO42−, and WQI, r ˃ 0.7 with NO3 which has a strong correlation. The total hardness (TH) moderately correlated with HCO3, Mg2+, and WQI, where no correlation exists with the rest of parameters. pH is observed to have no correlation with other parameters with an r coefficient ranged between − 0.189 and 0.128. The correlation of TDS with HCO3 and NO3 is found to be weak and moderate, respectively, where all of Ca2+, Mg2+, Na+, K+, SO42−, Cl, and WQI have a strong correlation with it. HCO3- have no existing relationship with Ca2+, Mg2+, Na+, K+, SO42−, Cl, NO3, and WQI in the other hand the Ca2+, Mg2+, Na+, K+, SO42−, and Cl are characterized with strong and moderate correlation with each other.

Sensitivity analysis

In this section, a sensitivity analysis is performed to determine the most sensitive parameters in the considered combination set in predicting WQI. The selection of 2 best input combinations is mainly based on the nonlinear subset regression and sensitivity analysis. The advantage of using the nonlinear sensitivity input variables selection approach to carefully determine the most relevant factors has been reported in several studies (Bui et al. 2020; Kisi et al. 2018; Liu et al. 2019).The best subset regression analysis for determining the best input combinations is presented in Table 5. We found that the best combination was TH / pH / TDS / Ca / Mg / Na / K / SO4 / Cl / NO3 and achieved high correlation and less statistical errors. Besides, all founded combinations generated good results.

Table 5 The best subset regression analysis for determining the best input combinations to model WQI

Figure 7 presents the standardized coefficients of inputs variables for sensitivity analysis. We conclude that TH is identified as the most sensitive parameter. It has the highest standardized coefficient (0.453) among the considered parameters. After TH, the TDS earn the second place in the list of the most sensitive variables with standardized coefficient equal to 0.243. On the other hand, SO42-, Cl-, and NO3- have 0.152, 0.176, and 0.135 as standardized coefficient, respectively, where the rest of parameters are considered as non-influential variables in predicting the WQI (Table 6).

Fig. 7
figure 7

The standardized coefficients of input variable for sensitivity analysis

Table 6 The regression analysis for identifying the most effective parameters

Based on the results obtained from Tables 5, 6 and Fig. 7, and in order to achieve the objective targeted in this paper, two inputs combinations have been chosen: the first combination encloses all the parameters, where the second contains only the two strong influential inputs in predicting WQI which are TH and TDS.

Evaluation of several ML models in WQI prediction

This study included the results of performing eight different methods of predicting the water quality parameter (WQI). The eight models used were as follows: MLR, ANN, M5P tree, SVM, RF, AR, RSS, and LWLR. Two combinations of variables were relied upon. The first configuration contained all the chemical elements used in the calculation of the water quality factor (WQI), while the second configuration was limited to only two components, namely the sum of dissolved salts (TDS) and water hardness (TH). These two elements were identified as the most controlling water quality index (WQI) based on sensitivity analysis results. It is worth mentioning that the Continental Intercalaire (CI) aquifer system received non-point sewage from different industries and agricultural inputs which highly attributed in deteriorating WQI. Generally, in groundwater studies, some factors affect the predictive precision of the models. However, there are some possible factors affecting the precision in this work could definitely be the low correlation values between pH, and WQI, TDS, and TH. It could also be caused by the enhanced pollution that is triggered by human inputs on the side of the industry, which drastically decrease the precision of the models. This result is in good agreement with the studies done by Zhu and Heddam (2019).

Five statistical parameters were selected in order to determine the performance of the different models and compare them. Table 7 represents the results of the models depending on the first combination of inputs in the training and testing phases. As shown in Table 7, the MLR model was performed perfectly in the prediction process for the training phase, as it obtained a correlation coefficient of R = 1 and the performance indicators were the smallest value by MAE = 1.4 * 10–8, RMSE = 2.14 * 10–8, RAE = 1.25 * 10–10%, and RRSE = 3.17 * 10–10. It was followed directly by the ANN model which had a correlation coefficient of R = 0.9996, MAE = 0.925, RMSE = 1.4013, RAE = 1.89%, and RRSE = 0.024, whereas the lowest performing model in the training phase was the LWLR model with correlation coefficient R = 0.9423, MAE = 15.52, RMSE = 18.39, RAE = 36%, and RRSE = 33.76. Through the values ​​of the performance of indicators, we note generally acceptable performance for the eight models. Yaseen et al. (2019) reported that RMSE is the most significant predictive numerical index for measuring the performance of the model in any data-mining modeling and time series forecasting. Our finding is in line with that of Yaseen et al. (2018), where the performance accuracy increases as the input variables are increased for the prediction of WQI.

Table 7 First input combination model result

For the test phase, the MLR model had the highest correlation value of R = 1 and the smallest error indicators that closely approximated zero. MAE = 4.8 * 10–9, RMSE = 7.7 * 10–9, RAE = 7.7 * 10–11%, and RRSE = 2.5 * 10–10. It was followed by the ANN model which obtained a correlation coefficient of R = 0.9987 and MAE performance indicators = 1.4, RMSE = 2.7, RAE = 1.68%, and RRSE = 0.044, whereas the weakest performance was recorded in the testing phase when the SVM model consists of the correlation coefficient of R = 0.9412 and MAE performance indicators = 5.16, RMSE = 11.386, RAE = 22.6%, and RRSE = 37.265. The predictive capability of the MLR model is definitely not surprising, because it is an evolving nonlinear system identification tool and has shown better predictive ability in many studies (Abba et al. 2020; El Bilali et al. 2021).

Table 8 represents the performance results of the eight models depending on the second configuration of inputs, which includes the elements TH and TDS. Through Table 8, we note that during the training phase, the best results were recorded on the RF model with a correlation coefficient of R = 0.9984 and MAE performance indicators = 1.99, RMSE = 3.248, RAE = 4.6%, and RRSE = 5.96%. The ANN model came in second place with a correlation coefficient of R = 0.9969, MAE performance indicators = 2.46, RMSE = 3.88, RAE = 3.3%, and RRSE = 7.01%. For the ANN model that provided the best performance based on the first combination of inputs, it regressed to the fifth place when using the second combination of inputs with correlation coefficient of R = 0.9958 and performance indicators of MAE = 3.48, RMSE = 4.98, RAE = 4.23%, and RRSE = 7.37. The weakest performance was recorded when using the LWLR model with a correlation coefficient of R = 0.9406 and MAE performance indicators = 15.33, RMSE = 18.74, RAE = 36.08%, and RRSE = 34.42%. For the test phase, the ANN model outperformed the rest of the models with a correlation coefficient of R = 0.9957, MAE performance indicators = 3.85, RMSE = 6.19, RAE = 3.96%, and RRSE = 9.35%. Followed by the RF model which obtained a correlation coefficient of R = 0.9926 and performance indicators of MAE = 2.15, RMSE = 3.82, RAE = 9.45%, and RRSE = 12.51. The weakest performance was recorded on the MLR model with a correlation coefficient of R = 0.9325, MAE performance indicators = 7.94, RMSE = 11.04, RAE = 12.51%, and RRSE = 36.15%. The main reason for the poor performance of the other models in both input combinations can be related to the inverse association, which was identified by the negative correlation between the observed pH concentration and the NO3 and HCO32− parameters except for the TH and TDS values. This observation was analogous to the results reported by Zhu and Heddam (2019).

Table 8 Second input combination model result

It is noted that the ensemble tree-based model such as RF outperformed all the other models with considerable accuracy in second input combination model due to its robustness deal with complicated pathways which can perform predictions without requiring regular large datasets. Our results showed that the RF model is superior to other models in terms of precision. The key reason is that RF model can accommodate high-dimensional factors to improve water quality prediction accuracy, e.g., the inclusion of a monthly physicochemical variable in this study. Besides, according to the RF model, Castrillo and García (2020) reported a high prediction precision of the RF model compared to the MLR model. In addition, there is in line with earlier published works in classification problem (Salamand Islam 2020; Chen et al. 2020).

Figure 8 describes the dispersion of points representing the calculated WQI values against the predicted WQI values based on each model separately using the first set of inputs. Through Document 1, it appears that the MLR model is the most suitable for predicting the values of the water quality parameter due to the total match of the points with the perfect line 1:1. Fig. 9 describes the dispersion of points representing the calculated WQI values against the predicted WQI values based on each model separately using the second combination of inputs. The document shows a large dispersion of the MLR model points, while the RF model points are more ideally positioned compared to the rest of the models. The largest dispersion of points was in the case of using both the LWLR and RSS model, which indicates the poor performance of the two models in the case of using the second set of inputs.

Fig. 8
figure 8

First input combination model predicted vs calculated WQI in testing phase

Fig. 9
figure 9

Second input combination model predicted vs calculated WQI in testing phase

The best model in each scenario is presented in Fig. 10 using scatter plot with smooth lines, bleu for calculated WQI and purple for predicted WQI values, and markers present samples. Part (a) presents results of MLR model from the first scenario, where an optimal fitness is shown between calculated and predicted WQI values. In part (b), we notice a presentation of RF model from the second scenario. The fitness in second scenario is not as in the first one, because a reduction in inputs had been made; this is why some predicted points does not fit with their calculated versus.

Fig. 10
figure 10

Scatter plot of calculated and predicted WQI values in testing period using best models, a MLR model and b RF model

In addition to the aforementioned, the Wilcoxon rank-sum test was also relied upon in order to confirm the results mentioned in the previous paragraphs. This test is a nonparametric statistical news, used to compare two groups. The test calculates the difference between the pairs and the results are used to determine whether the two groups are statistically different from each other or not. In this work, this method was used to test the null hypothesis, which states that every two identical groups have the same continuous distribution. Some conditions must be met to apply this test, which is that the data should be from the same community and be associated. With random and independent data selection, Table 9 represents the P values for each model based on the first and second input configurations. In the case of using the first input combination, the highest probability was recorded when using the MLR and AR models with a value of P = 0.9951 for both models, whereas the lowest probability was recorded when using the RSS model with a value of P = 0.4730.

Table 9 Wilcoxon rank-sum test results (P values) of the optimal models in predicting WQI using first and second combination inputs

The use of the second group of inputs witnessed noticeable changes in the performance of the models. The highest probability of match was recorded when using the RF model with a value of P = 0.9951. Both the MLR and AR models reported significant decreases in performance with values of P = 0.8588 and P = 0.7585, respectively. The weakest performance was recorded again when using the RSS model with a value of P = 0.5519.

The physicochemical parameters chosen in the current study may also pose a drawback due to possible inadequate sampling. In addition to this, the uncertainty problem of the physical-based models in water quality modeling is inevitable and has been discussed in many studies (Bui et al. 2020; Kisi et al. 2018; Singha et al. 2021). Future research may add the use of different input physicochemical parameters to predict the WQI based on WHO guidelines, to compare with other standard indexes. The model presented here should be also appraised for other similar climatic and hydrological settings. However, given the noisy characteristics of this dataset, there was still a threat that the models did not fit the data well, which might undermine the outcomes of the scenario forecasting. Besides, adding more influential physicochemical factors could also improve model fitting. For example, there may be other factors affecting TDS concentration besides climate and hydrogeological features (Islam et al. 2017). As the new development of machine learning models, it is promising for further work to predict contaminant concentration under the future pollution scenarios if the machine learning algorithm fits data well.

As mentioned in previous studies, a key gap in water quality studies has been a lack of consideration of cross effects between explanatory variables, such as the cross-correlation between land covers and the cross-correlation between land cover and climate in influencing stream water quality (Islam et al. 2021). Machine learning models can use input variables and improving model predictive accuracy, which is an advantage over conventional statistical models. For example, it is likely that physicochemical factors showed effects with environmental variables and groundwater pollution on groundwater water quality and the predictive accuracy can therefore be improved.

Conclusion

In this work, the effectiveness of a group of artificial intelligence methods in predicting the water quality parameter in a dry desert environment was examined based on the 114 samples collected from six aquifers at different time periods in Illizi state, southeast Algeria. Eight artificial intelligence models, namely MLR, ANN, SVM, M5P tree, RSS, RF, AR, and LWLR, were used, and their ability to predict was tested based on two scenarios and 2 different input combinations. The proposed two scenarios aim to solve two main problems. First, the classical computational method is replaced with modeling approach. Second, when there is a lack or unavailability of data in critical cases, this study provides an alternative solution. The first set of inputs included all the chemical elements present in the water and used in calculating the WQI, while the second combination contained the controlling parameters of the water quality changes which were determined using the sensitivity analysis.

The sensitivity analysis shows that all the subset performed well as predictors in modeling WQI, where the selection of only two parameters as input in the second scenario was developed in order to propose an alternative solution for monitoring the WQI in the study area in critical cases. In second scenario, the modeling procedure showed that TDS and TH concentrations were the most vital determinants of WQI. The MLR model was performed perfectly in the first scenario because the calculation procedures of the WQI was linear, which make the task executed perfectly using MLR model with all the parameters as inputs. The reduction of the number of inputs affects directly the performance of models, where the aim in second scenario was constructing which model performed well in such conditions. RF models observed to be the best model in predicting WQI based on TH and TDS as parameters in the study area.

It is worth noting that MLR and RF algorithms generate robust results using a dataset covering the longer periods based on two scenarios. Thus, these algorithms might be useful for developing places that have very limited wellbore. Our results recommend that the RF algorithms could be a robust and cost-effective model to enhance groundwater quality management plans in an arid region in southeast Algeria. It is possible that this model is more applicable in developing countries where the costs of estimating several water quality variables are high and might be commonly restrictive. These outcomes could not be generalized and employed to other regions or other hydrogeological datasets, and these algorithms might not be optimal (i.e., most reliable) in all areas and under all conditions.