Introduction

Groundwater is an important source for many agricultural and industrial activities at Akhmim area, Sohag Governorate, Egypt. Many authors assessed the quality of groundwater and its suitability for drinking and irrigation (among them, Hagage et al. 2021; Balamurugan et al. 2020, 2021, Elbeih and El-Zeiny 2018; Ismaila and El-Rawyba 2018; Gedamy 2015; Melegy et al. 2014; Ahmed and Ali 2011). Elbeih and El-Zeiny (2018) evaluated the groundwater quality west of Sohag governorate in 2008 and 2016, based on some physicochemical characteristics of groundwater and set of retrieved land use spectral indices. Ismaila and El-Rawyba (2018) assessed and evaluate the hydrochemical properties of groundwater resources west of Sohag, Egypt based on chemical analyses of groundwater samples collected in 2014. Melegy et al. (2014) studied the geo-chemical mobility of some heavy metals in water resources and their impact on human health in west of Sohag Governorate. The results recorded high contamination with cadmium and lead and about 50% of water samples are contaminated with iron and manganese. Ahmed and Ali (2011) reported that groundwater resources of Sohag are threatened by pollution resulting from urbanization and agricultural activities. Ammonia is found in groundwater naturally as a result of anaerobic decomposition of organic materials (Bohlke et al. 2006). It reached the groundwater through the leakage from sewage systems (Johan Lindenbaum 2012). Hagage et al. (2021) studied the suitability of the groundwater for drinking and irrigation in the Akhmim area, Egypt. They concluded that about 95% of the collected groundwater samples are highly contaminated with ammonia. This contamination resulted from urban growth and agriculture and industrial activities. The present study proposed a predictive model for the groundwater contamination at Akhmim area, Sohag Governorate, Egypt using Random Forest (RF) and the Multivariate Logistic Regression (MLR) algorithms.

Machine learning is algorithmic study of how computers simulate or implement human learning behavior. Machine learning algorithms are designed to predict accurately patterns within multivariate data (Cracknell 2014). They are classified into three main classes: supervised, unsupervised, and reinforcement algorithms (Russell and Norvig 2010). They are widely used in many applications such as pattern recognition, anomaly detection, and classification. Implementation of machine learning (ML) algorithms such as Logistic Regression and Random Forest in the prediction of water quality and groundwater contamination are tested and evaluated by many authors (Venkataraman and Uddameri 2012; Mair and El-Kadi 2013; Solanki et al. 2015; Wang et al. 2017; Muharemia et al. 2019; Vijay and Kamaraj 2019a; Rizeei et al. 2018; Hosseini et al. 2018; Aldhyani et al. 2020). Venkataraman and Uddameri (2012) utilized a Logistic Regression model to predict the exceedance of drinking-water standards of arsenic and nitrate in the Southern Ogallala aquifer. Mair and El-Kadi (2013) assessed the groundwater vulnerability to contamination in Hawaii using the Logistic Regression model. Madani and Niyazi (2015) utilized a knowledge-driven GIS model for groundwater potentiality mapping over wadi Yalamlam, Western Saudi Arabia. Solanki et al. (2015) utilized deep learning algorithms for the prediction of the water quality parameters in India. Wang et al. (2017) combined machine learning algorithms, WQI, and remote sensing spectral indices to establish a model for assessing the water quality in China. Rizeei et al. (2018) utilized a data-driven Logistic Regression model to assess the groundwater nitrate contamination hazard in a semi-arid region. Hosseini et al. (2018) presented a novel machine learning-based approach for the risk assessment of nitrate groundwater contamination. Quedraogo et al. (2018) applied the Random Forest regression in modeling groundwater contamination in Africa and compared its performance with a multiple linear regression model. Muharemia et al. (2019) applied several machine learning (ML) models to identify the anomalies in water quality time series data. Results showed that DNN, RNN, and LSTM algorithms are very vulnerable compared to SVM and LR models. Vijay and Kamaraj (2019b) investigated three ML models to predict groundwater quality. They concluded that the C5.0 classifier produced a better result with an accuracy of 96%. Venkatech et al. (2020) utilized tree-based modeling methods to predict nitrate exceedance in the Ogallala aquifer in Texas. Aldhyani et al. (2020) utilized several ML models for water quality prediction. Results showed that SVM algorithms achieved the highest accuracy (97%) for water quality prediction. Nafouanti et al. (2021) compared the Random Forest, Logistic Regression, and artificial neural network algorithms for the fluoride contamination in groundwater at the Datong Basin, Northern China. The paper is organized as follow: (1) description of the study area, (2) geological and hydrogeological background, (3) selection of the relevant variables through spatial-statistical analyses, (4) machine learning model implementation, and (5) model evaluation through the generation of the performance metrics.

Study area

The study area (Akhmim District) is located east of the River Nile between latitudes 26°30′ and 26°44′N and longitudes 31°35′ and 31°55′E (Fig. 1) about 467 km apart from Cairo. Several authors studied the hydrochemical characteristics of the groundwater of Sohag governorate and evaluated the impact of the human activities on the groundwater quality (Awad et al. 1995; Ahmed 2009; Abdel Latif and El Kashouty 2010; Youssef et al. 2011; Ahmed and Ali 2011; Melegy et al. 2014; Ismaila and El-Rawyba 2018; Elbeih and El-Zeiny 2018; Hagage et al. 2021). Youssef and Abdel Moneim (2006) studied the geo-environmental impacts of the area east of the Sohag governorate and revealed the existence of three main geo-environmental hazards. Industrial, domestic, and agricultural activities are the main groundwater contamination sources recorded in the study area. Hagage et al. (2021) studied the impacts of anthropogenic activities on the archaeological sites in the Akhmim area, Sohag Governorate, Egypt using remote sensing and GIS techniques. In this study, the authors utilized the ammonia data as an indicator for groundwater contamination because its elevated concentrations in groundwater are typically caused by anthropogenic activities. Ammonia includes the non-ionized (NH3) and the ionized (NH4) species. The most common nitrogen compound in groundwater is NO3, but in a reducing environment, ammonia is predominant. Hagage et al. (2021) identified the groundwater deterioration sources through extensive field investigation. Growing population and urbanization without proper urban planning on the study area lead to generating random sewerage systems (Hagage 2021). Lack of sewerage networks in the study area forces the inhabitants to build septic tanks and injection wells which leads to contamination of the groundwater, where more than half of the population do not have any sewerage networks (Hagage 2021).

Fig. 1
figure 1

Location map of the study area

Geological and hydrogeological background

The main features that characterized the study area include the cultivated Nile flood plain and the lowland desert areas along both sides of the Nile Valley (Youssef and Abdel Moneim 2006). Several authors studied the sedimentary sequence in the study area among them (Said 1960, 1981, 1990; Issawi et al. 1978; Issawi and Hinnawi 1980; Omer 1996). They revealed that it starts from the base by Lower Eocene Thebes Formation, followed by Issawia Formation, Pre-Nile Sediments, Fanglomerate, Nile silt, and the Recent Wadi deposits. Figure 2 shows a part of the geological map of the study area.

Fig. 2
figure 2

Geological map for the study area (EGSMA 1983)

The main aquifer system in the study area is the Quaternary aquifer where the Pleistocene deposits are the major water-bearing sediments in the Akhmim area (Abdel Moneim 1999; Hagage 2021). On ancient cultivated lands, the Pleistocene aquifer is a semi-confined aquifer where the upper member consists of a clay-silt layer while in the desert fringes it is an unconfined aquifer where the clay-silt layer is replaced by desert sands. The lower boundary of the aquifer is extensive with thick deposits of Pliocene clays (Abdel Moneim 1992). The groundwater flows towards the River Nile where the groundwater level in the study area ranges from 63 m (masl) at the valley fringes to 53 m (masl) close to the River Nile. Hagage et al. (2021) studied the groundwater quality and its suitability for drinking and irrigation in Akhmim District, Sohag, Egypt. The results of their study showed the existence of several human activities that affect the quality of groundwater and its suitability for drinking.

Materials and methods

Spatial distribution maps of index and independent variables

A field trip to the study area took place in April 2019 where 32 groundwater samples representing the Quaternary aquifer were collected. The water samples were collected in one-liter polyethylene bottles. For heavy metal analysis, 100 ml of sample was acidified with nitric acid (1%) and preserved separately and all water samples were locked carefully and labeled after collection and kept in the refrigerator until analysis. The analyses were performed in the Central Laboratory of the National Water Research Center, according to standard methods for testing water as described by the American Public Health Association (APHA 2005). The samples were chemically analyzed to determine the cations, anions, nutrients (ammonia, nitrate, and phosphate), and soluble heavy metals.

Index variable

In this study, ammonia is considered as the index to groundwater contamination. Ammonia concentrations in the groundwater samples range between 0.01 and 22.4 mg/l. The presence of ammonia in water is evidence of fecal pollution from wastewater and it can relatively be oxidized to nitrite and finally nitrates (Karavoltsos et al. 2008). According to the WHO (2011), the maximum permissible ammonia concentration is 0.5 mg/l and about 95% of the groundwater collected samples are contaminated with ammonia. Figure 3a shows the histogram distribution of the ammonia concentration, whereas Figure 3b demonstrates its box-plot pattern. The figure shows the distortion of distribution with skewed values. Figure 4 shows the spatial distribution map of the ammonia concentration. The highest concentrations are recorded at the southern and middle parts of the study area whereas the northern part records values below the permissible ammonia limit. The ammonia concentrations at the southern part range between 8 and 22.4 mg/l whereas the middle parts record less than 8 to 0.5 mg/l.

Fig. 3
figure 3

a Histogram distribution of the ammonia. b Box-plot chart of the ammonia

Fig. 4
figure 4

Distribution map of the ammonia concentration

Independent variables

About 16 physicochemical parameters are prepared and analyzed to clarify their correlation to ammonia. Only four variables are found to be relevant to the ammonia contamination. Figure 5a, b, c, and d show the spatial distribution pattern of the Pb, Mg, Fe, and Zn variables. The Pb and Mg distribution maps show a high correlation to ammonia pattern whereas Zn and Fe maps show less correlation. Figure 5 demonstrates that Akhmim and El Kola are the highly polluted sites which is due to industrial activities, urbanization, excessive use of chemical fertilizers, agricultural pesticides, and sewage leakage.

Fig. 5
figure 5

Distribution maps of a lead, b magnesium, c iron, and d zinc concentrations

Statistical analysis

The correlation analysis is carried out to clarify the existence of a relationship between the measured variables. The correlation coefficient (CC) could be negative or positive and it ranges from −1.00 to +1.00. Large values of the correlation coefficient between two variables imply that they are highly correlated and this might be in the positive or negative direction. We consider the presence of a strong correlation when the CC value is greater than 0.5 and it is a weak correlation when the value is less than 0.5. The correlation values scaled between 0 (no correlation) and 1 (perfect correlation) can be encoded with color in a 2D heat map. The R2 values can be translated to color saturation and produced a heat map (Fig. 6) that shows the correlation scores between the independent variables and the ammonia. It confirms the presence of a strong correlation between the Pb and Mg with NH3 and a weak correlation with the Zn and Fe variables.

Fig. 6
figure 6

Heat map of the Mg, Pb, Fe, and Zn variables correlated with the ammonia

Table 1 provides some descriptive statistics for the relevant variables used to predict groundwater contamination. High standard deviation is recorded by Mg compared to the standard deviation of Fe, Zn, and Pb. The maximum and minimum values of the Mg are 34 and 10.3 mg/l with a 24.12-mg/l mean value.

Table 1 Descriptive statistics

The maximum and minimum values of the Fe are 0.148 and0.006 mg/l with a 0.025-mg/l mean value. The maximum and minimum values of the Pb are 0.05 and 0.004 mg/l with a 0.019-mg/l mean value. The maximum and minimum values of the Zn are 1.8 and 0.009 mg/l with a 0.18-mg/l mean value. The box plots of the Pb, Fe, and Zn variables (Fig. 7) show normal distribution whereas the box plot of the Mg shows little distortion, and the values are skewed. Because the Fe, Pb, and Zn values are not of the same magnitude as Mg values, the data are normalized before ML model implementation.

Fig. 7
figure 7

Box-plot charts of the relevant variables to the groundwater contamination

Machine learning model selection and implementation

The general methodology of machine learning includes (a) data preparation, analyses, and visualization; (b) normalization; (c) model selection and implementation; and (d) performance metrics. In this study, the following steps are implemented using “Python” Code within “Anaconda Notebook”: (1) import the required libraries (NumPy, Pandas, Matplotlib, and Seaborn), (2) import the “-.csv” file containing the dataset (Mg, Fe, Pb, and Zn variables and NH3), (3) statistical analyses and data visualization, (4) Multivariate Logistic Regression and Random Forest models, and (5) performance metrics (classification reports and confusion matrix). The following paragraphs describe each step in detail.

Figure 8 shows the pair plots of the relevant variables against NH3. No clear relation is observed. In this case, ensemble models are the best to treat with these kinds of data. This study implemented an ensemble Random Forest model in addition to the Multivariate Logistic Regression models.

Fig. 8
figure 8

Pair plot of the Mg, Fe, Pb, and Zn variables against the NH3

Multivariate Logistic Regression model

Logistic Regression is one of the most commonly used machine learning algorithms for predicting two classes. Linear Regression model is a linear function that demonstrates a relationship between different variables and is expressed by Eq. (1):

$$y=\beta_0+\beta_1x+\in$$
(1)

where y is a dependent variable, x is an independent variable, β0 is the y-intercept, β1 is the slope, and Є is a random error.

The sigmoid function is represented by Eq. (2):

$$P=1/1+{e}^{-y}$$
(2)

Apply the sigmoid function on linear regression function in Eq. (3):

$$P=1/1+{e}^{-\left(\beta 0+\beta 1x\right)}$$
(3)

The multiple Logistic Regression model considers a set of x independent variables, which in this study are represented by the Mg, Fe, Pb, and Zn variables, to predict the likelihood of the response variable Y which is represented by the NH3. This model is expressed as in Eq. (4):

$$Y=\beta_0+\beta_1x_1+\beta_2x_2+\beta_3x_3+\beta_4x_4+ \in$$
(4)

where Y is a dependent variable; x1, x2, x3, and x4 are the independent variables; β0 is the y intercept; β1, β2, β3, and β4 are the slope; and Є is a random error.

Apply the sigmoid function in Eq. (5):

$$P=1/1+{e}^{-\left(\beta 0+\beta 1x1+\beta 2x2+\beta 3x3+\beta 4x4\right)}$$
(5)

The Logistic Regression model has several optional parameters such as solver, random state, and C. The parameter “solver” is a string that decides what solver to use for fitting the model. The default is the “liblinear,” whereas the “newton-cg,” “lbfgs,” “sag,” and “saga” are other options. The “random_state” parameter is an integer that defines what pseudo-random number generator to use. The default is none. The “C” is a positive floating-point number that defines the relative strength of regularization. Smaller values indicate stronger regularization and the default is 1.0. In this study, the authors implemented the MLR model two times. The first is with the following parameters: solver is liblinear, C = 10.0, and the random_state = 0, whereas the second run has the following parameters: solver=sag, C = 80.0, and random_state = 0.

Random Forest (RF) model

Random Forest (RF) is an ensemble classification/regression method that trains several classifiers and combines the results through a voting process (Breiman et al. 1984; Breiman, 2001; Gislason et al. 2006; Pham et al. 2019). It is a method where a large number of decision trees are created and each tree is trained on the original training data and the output class is determined by a majority vote of the trees. Random Forest searches across a randomly selected subset of variables to determine a split for each node based on some metric. The type of metric is different for regression and classification tasks. In this study, the dataset has been divided into 70% training and 30% testing subsets for RF implementation. The model run under the following parameters: criterion=“entropy,” n_estimators = 10, and random_state = 0.

Results and discussion

Confusion matrix and classification reports are generated to evaluate the model performance. F1 score, accuracy, precision, and recall are generated and evaluated. Precision is the number of correctly classified positive samples (TP) divided by the sum of the TP and the number of samples labeled by the system as positive (precision = TP/TP + FP) (Bottenberg & Ward, 1963). Recall is the number of correctly classified positive samples (TP) divided by the number of positive samples in the data (recall = TPR = TP/TP + FN). F1 score is the harmonic mean of precision and recall (F1 = 2 × precision × recall/precision + recall). The result of the model’s performance is shown in Table 2. Values of the classification metrics show that the RF model scores the highest accuracy (93%) whereas the highest accuracy of MLR model scores 83%. This result proved that the ensemble RF model is the best for prediction of the groundwater contamination.

Table 2 Classification metric results

More information about the accuracy of the model can be obtained from the confusion matrix (Hamilton 2012). The confusion matrix reports the numbers of (1) true positives (TP = the number of samples classified as true while they are true), (2) true negative (TN = the number of samples classified as false while they are false), (3) false positives (FP = the number of samples classified as true while they are false), and (4) false negatives (FN = the number of samples classified as false while they are true) (Bekkar et al. 2013).

For the MLR model, among 30 samples of actual data, 18 samples are classified as true positive, 7 samples as true negative, 5 samples as false positive, and no samples for false negative. For the RF model, among 30 samples of actual data, 16 samples are classified as true positive, 12 samples as true negative, no samples as false positive, and 2 samples for false negative. Figure 9a and b show the confusion matrix of the MLR and RF models, respectively.

Fig. 9
figure 9

Confusion matrix: a the MLR model and b the RF model

Results of the analyses of the groundwater samples revealed that about 95% exceeds the maximum permissible ammonia (0.5 mg/l) according to the WHO (2011). The value of ammonia in groundwater samples ranges between 0.01 and 22.4 mg/l. The reason for this contamination is due to various human activities as well as the use of wastewater for irrigation in the east of the study area (Hagage et al. 2021). Lead, magnesium, iron, and zinc content in water depend on the amount of industrial waste, fertilizers, and sewage sludge (Oluyemi et al. 2008). Lead concentration in groundwater ranges between 0.004 and 0.05 mg/l. The high pollution of lead in groundwater is due to industrial activities, urbanization, excessive use of chemical fertilizers, agricultural pesticides, and sewage leakage (Hagage et al. 2021; Krishna and Kurakalva 2014).

Conclusions

In general, groundwater contamination by ammonia is a significant issue in Sohag Governorate, Egypt and is attributed mainly to urban growth. The present study developed a predictive model for groundwater contamination using ensemble RF and MLR models. Results of the performance of these models are evaluated using classification metrics and confusion matrix. The study concluded the following:

  1. 1-

    Performance of the RF model is better than the MLR model. It scores high accuracy (93%) compared to the (83%) recorded by the MLR model.

  2. 2-

    A strong relation is observed between the urban expansion and the high ammonia concentration. Lack of sewerage networks in the study area forced the inhabitants to build sewage rooms and injection wells which leads to high contamination of the groundwater.

  3. 3-

    Akhmim and El Kola are highly polluted sites as demonstrated by the spatial distribution maps.

  4. 4-

    The study proved the usefulness of the ML models for predicting groundwater contamination using the ammonia index and its relevant variables.