Introduction

Groundwater is a valuable resource in all parts of the world. In dry and semi-arid locations where there are no rivers or other sources of surface water, groundwater is the primary supply of water for agricultural and human activities. The quest for additional groundwater resources is a major concern in these places, particularly in locations where water demand is gradually increasing over time (Falkenmark et al. 2019). Therefore, suitable water resource management is highly necessary. Due to limited water supplies and rising uncertainties induced by climate change, water management in the Kingdom of Saudi Arabia (KSA) is experiencing significant problems. Exploiting deep aquifers for subterranean water depletes supplies that took decades or centuries to develop and on which the current annual rainfall has little direct influence (Mahmoud et al. 2014). An estimated 158.47 billion m3 of rainfall occurs on the nation each year (Al-Rashed and Sherif 2000). In KSA's biggest single alluvial reservoir, the total capacity in alluvial deposits is estimated to be 84 billion m3 (Abdulrazzak 1995). While the entire amount of groundwater removed from Saudi Arabia's deep aquifers during the previous two decades is estimated to be about 254.5 billion m3, it was pumped from Saudi Arabia to meet the demands of the agriculture sector's development (Al-Rashed and Sherif 2000). While the recharging of deep aquifers has been restricted to 41.04 billion m3 during the previous two decades (Al-Rashed and Sherif 2000). By the year 2000, the agriculture sector in Saudi Arabia consumed around 20 billion m3/year of water (Al-Rashed and Sherif 2000). Demands of water in Agricultural accounted for 83–90% of overall water demands from 1990 to 2009 (Chowdhury and Al-Zahrani 2015). To meet the water conservation policy, KSA implemented a plan to minimise agricultural water needs by adopting advanced irrigation techniques, which resulted in a 2.5 percent yearly decrease in water use for agricultural reasons between 2004 and 2009 (MoEP 2014). Because of the scarcity of water and the possibility for an expansion in the area under agriculture, Demarcating groundwater potential zones for agricultural growth is critical (Zhang et al. 2018a; Mahato and Pal 2019; Zhu and Abdelkareem 2021).

The first step in protecting ecosystems from groundwater discharge is identifying the groundwater discharge, the locations where groundwater discharge occurs, and the support that groundwater discharge provides for down-gradient ecosystems (e.g., fluvial ecosystems) (Zhang et al. 2018a; Mahato and Pal 2019; Zhu and Abdelkareem 2021). Such determinations are typically only possible by field investigation because different types (and an unknown number) of groundwater discharge sources are commonly intermingled in the subsurface near their point of connection with surface water (e.g., seeps and springs) (Yu and Michael 2019; Bierkens and Wada 2019). Field mapping of these forms of groundwater outflow is doable in some instances (Arabameri et al. 2019b; Díaz-Alcaide and Martínez-Santos 2019). Such approaches are often expensive to apply and require substantial ground truth knowledge prior to deployment. Therefore, it is impracticable across broad geographical scales and in difficult-to-reach areas. Remote sensing, geospatial modeling, and machine learning have all been employed in these cases to map remote sites where groundwater discharge occurs, with varying degrees of effectiveness (Mahato and Pal 2019; Das et al. 2019; Mallick and Rudra 2021; Vellaikannu et al. 2021; Zhu and Abdelkareem 2021). As computer processing capacity develops and remote sensing data becomes more readily accessible, these methods garner greater attention for broad applications in hydrology.

The aim of geospatial modeling, in particular, is to use a procedure for setting out the models and a suitable choice of geospatial algorithms to help decision-makers and other stakeholders decide the most optimal and beneficial option (Nagpal et al. 2019; van Eeuwijk et al. 2019; Pal et al. 2020a; Malik and Bhagwat 2021; Forootan and Seyedi 2021; Phong et al. 2021). Over the years, various methods such as multi-criteria decision analysis (MCDA), the statistical index (SI) (Mandal and Mandal 2018), logistic regression (LR) (Mandal and Mandal 2018; Zhang et al. 2018a),evidential belief function (EBF)(Chen et al. 2018b), probability-frequency ratio (FR) (Lee and Dan 2005; Chen et al. 2017b), certainty factors (CF) (Dou et al. 2014; Chen et al. 2016; Hong et al. 2016a), weight of evidence (WoE) (Xu et al. 2012; Xie et al. 2017), index of entropy (IoE) (Jaafari et al. 2014; Tien Bui et al. 2018) have been used to generate GPM in various regions. A review of past research demonstrates that, although the models used to provide decent results, utilizing them as standalone has drawbacks (Boori et al. 2019; Mahato and Pal 2019; Das et al. 2019; Pradhan et al. 2019; Mallick and Rudra 2021; Mosavi et al. 2021; Chen et al. 2021a). Furthermore, approaches like the multi-criteria index may give helpful information for selecting appropriate locations for managed aquifer recharge (MAR) (He et al. 2012; Kazakis et al. 2020).

ANN (Tien Bui et al. 2016; Chen et al. 2017a), neuro-fuzzy (Tien Bui et al. 2012), decision trees (Tien Bui et al. 2014; Hong et al. 2015), and support vector machines (Trigila et al. 2015; Hong et al. 2016a, b; Chen et al. 2018a, b) are examples of machine learning models that have been developed to address complex issues and have been applied to a variety of groundwater research (Ghimire et al. 2019; Qadir et al. 2019; Hamdani and Baali 2019; Hoque et al. 2020; Malik and Bhagwat 2021). They have shown promising predictive capabilities for groundwater potentiality prediction and are often used as benchmarks to evaluate new approaches (Pal and Sarda 2021a). Lee et al. (2018) show that the ANN technique is beneficial for producing GPM investigations. Emamgholizadeh et al. (2014) employed two machine learning approaches to estimate groundwater levels: ANN and ANFIS. According to their findings, the ANFIS model performs well in prediction and may be applied in similar investigations. There is still no broad consensus on which strategy is superior since learning algorithms with a single or straightforward hypothesis space have difficulty satisfying all case scenarios as the data they use changes (Tien Bui et al. 2018; Truong et al. 2018).

However, these efficient instances contain significant flaws that restrict the efficacy of individual algorithms in terms of prediction (Luo et al. 2017). Individual learning methods may overlook the best-fit function or actual distribution of the sample set from the hypothesis space, for example, since training data for GPM is generally insufficient (Stamatopoulos et al. 2015; Crosta et al. 2003). As a result, researchers created ensemble-learning algorithms for GPM, including Adaboost (Nguyen et al. 2020), Bagging (Iqbal et al. 2021), random subspace (Elbeltagi et al. 2021), and rotating forest (Mallick et al. 2021), to increase the prediction accuracy of a single classifier (Singha et al. 2020). These ensemble-learning strategies are homogenous because they often mix the same classifier to create an aggregated model. Consequently, the flaws of a single classifier may compound, which is undesirable for the outcome (Regmi et al. 2010).

Meanwhile, the studied locations differ geographically in terms of GW potentiality forecast. Diverse places have different geo-environmental characteristics, and homogenous ensemble-learning approaches for GPM ignore such differences (Hong et al. 2016b; Pham et al. 2017). In several sectors, heterogeneous ensemble-learning approaches that combine several single classifiers have exhibited significant nonlinear representation (Mirchooli et al. 2019; Rizeei et al. 2019; Roy and Saha 2019; Nhu et al. 2020; Pal and Sarda 2021b). Heterogeneous ensemble learning is more resilient and generalizable than homogeneous ensemble learning because it can harness the benefits of diverse groundwater potentiality models. Furthermore, the challenge of model selection for GPM may be mitigated to some degree using heterogeneous ensemble-learning approaches, which can preserve heterogeneity using any base classifier. Despite this, there is little research on using heterogeneous ensemble-learning approaches in GPM. It is essential to look into new ensemble modeling methodologies for groundwater potentiality assessment. Ensemble modeling is the simultaneous analysis of multiple (not necessarily optimized) models to look for regularities and patterns in the output from these different models that might give insights into how well they might predict possible future events (Forkuor et al. 2017; Chen et al. 2018a; Zhang et al. 2018b; Abdulkadir et al. 2019; Arabameri et al. 2019a).

In this study, we carried out GWP mapping in the Bisha watershed in Saudi Arabia for the first time using a stacking ensemble technique based on multiple machine learning algorithms. The approach is divided into two levels: the heterogeneous base learners at the bottom and the meta-learner at the top. It is worth noting that the benefits of stacking have been studied in various fields. According to Wang et al. (2011), the stacking-based credit scoring model had better predictive performance than base learners in average accuracy and error. Another study found that the ANN with stacking outperformed both the single ANN and the combination of ANNs employing the averaging technique in flood frequency analysis (Shu and Burn 2004). Three machine learning algorithms, namely SVM, ANN, and LR, were chosen as candidate base learners and RF as meta-classifiers of the stacking ensemble technique to estimate the spatial groundwater potential. The following three advancements for GWP modeling are presented in our study:

(1) For GWP, we used three heterogeneous ensemble-learning approaches. We adopted the RF approach over other machine learning and ensemble machine learning methods because of its superior performance (Talukdar et al. 2020). (2) We used both parametric and non-parametric ROC curves for the first time to validate the models. (3) The specifics of heterogeneous ensemble-learning techniques and result evaluation are presented in this paper. According to our literature analysis, heterogeneous ensemble-learning is still seldom applied in GPM, and many studies only used basic heterogeneous ensemble-learning algorithms. We used the aforesaid basic heterogeneous approaches as well as one meta heterogeneous way of stacking and blending for GPM in this investigation.

Materials and methodology

Study area

The Bisha watershed covers an area of 21,260 km2 and is bordered by Yemen. The Bisha watershed is located north of the equator between 17°59′27.588ʺ and 20°49′13.958ʺ north, and east of the Greenwich meridian between 41°49′50.825ʺ and 43°11′20.254ʺ east (Fig. 1). It features a varied scenery, including highlands, high mountains (between 2000 and 3000 m above sea level), plateaus, and Wadiyan. Additionally, it encompasses a sizable portion of the desert to the north and east, all the way to Bisha. The elevation ranges from 950 to 2980 m above sea level, with a mean of 1655 m. The climate of the region varies substantially by terrain and season. The climate varies considerably between semi-arid regions in the south and desert regions in the north. The average temperature over the last 30 years has fluctuated between 12 and 44 °C. Annual rainfall averages 245 mm. Rainfall in excess of 200 mm per year is restricted to a 20–30 km broad crest zone. As a result, eastward and northward Wadi flow rapidly reduces downstream, and deposition exceeds erosion at the plateau's eastern margin. The highland of the watershed is bordered by forests and Juniperus procera, which are home to a variety of indigenous and rare flora and wildlife.

Fig. 1
figure 1

The study area, Bisha Watershed

Materials

We prepared 21 parameters at the spatial scale for GWP modeling. We derived 14 parameters (among 21) from a high-resolution digital elevation model (DEM). In the present study, we used ALOS PALSAR DEM, having a spatial resolution of 12.5 m. We obtained it from the EarthData website, NASA (“https://asf.alaska.edu/data-sets/derived-data-sets/alos-palsar-rtc/alos-palsar-radiometric-terrain-correction/”). The sentinel-2 MSI satellite image has been obtained from the website of the United States Geological Survey- Earth Explorer (“https://earthexplorer.usgs.gov/”). On the other hand, rainfall data was collected from 16 meteorological stations of Saudi Arabia (the “Ministry of Environment, Water, and Agriculture (MEWA), Saudi Arabia”). A geological map has been obtained from the Survey of Saudi Arabia. Distance to water bodies and the urban center has been generated by integrating the maps provided by DIVA-GIS and field survey.

Groundwater potentiality inventory

The initial part was identifying the spring and non-spring locations and creating the appropriate inventory map. A random selection of non-spring locations was made using the Create Random Points feature in the Data Management Tools on the ArcGIS platform. Spring locations were obtained from land records, field surveys, and the Hydrological Data Management System. There were 50 springs and 50 non-spring places found. In addition, based on prior relevant research, twenty-four groundwater spring-associated variables were initially chosen in this phase (Mahato and Pal 2019; Das et al. 2019; Mallick and Rudra 2021; Vellaikannu et al. 2021; Zhu and Abdelkareem 2021). All groundwater and non-groundwater data have been separated into 80 (80):20 (20 points) training and testing datasets by arbitrary separation (Mallick and Rudra 2021).

Rationale and preparation of groundwater potentiality conditioning factors

When groundwater springs are present, it is believed that the slope, elevation, and aspect all have a role. Elevation has a significant impact on the local circumstances of the terrain that influence groundwater distribution (Arabameri et al. 2019b). Groundwater reservoirs often follow the gradient of height and tend to gather beneath the low-elevated terrain areas, which is why they are called "altitude reservoirs” (Chen et al. 2021b) (Fig. 2a). The landscape slope was chosen as an additional explanatory variable because of its link with the hydrologic processes that govern the direction of runoff and the infiltration capacity of the landscape (Sarkar et al. 2001; Arabameri et al. 2019b) (Fig. 2b). The slope length (LS) is a combination of the slope gradient (S) and the slope length (L) (Moore and Burch 1986). In general, a low value of LS indicates a greater likelihood of a productive groundwater aquifer. The length of the slope is a measure of the extent of ground cover (Fig. 2c). TRI is another morphological parameter that is extensively employed in groundwater assessments to quantify the relative elevation of a target cell concerning its neighboring cell. The TRI index ranges from 0 to 1, where high values indicate a well-drained surface and low values indicate a poorly drained surface (2d). Flow across a surface is affected by the profile curvature, which runs parallel to its maximum slope and is proportional to the acceleration or deceleration of the flow (Ginesta Torcivia et al. 2020). In contrast, planform curvature is perpendicular to the direction of the maximum slope and influences the convergence and divergence of flow across a surface, respectively (Costache and Tien Bui, 2020) (Fig. 2e, f). To generate potential groundwater maps, it was necessary to include the variable aspect as an essential explanatory variable. This variable is related to evapotranspiration because it indicates the direction of water flow, which has an effect on groundwater recharge and storage (Fig. 2g).

Fig. 2
figure 2

DEM derived topographic parameters, such as a elevation, b slope, c LS Factor, d TRI, e plan curvature, f profile curvature, g aspect

Once the TPI is calculated as the ratio of the target pixel elevation to the average elevation of the surrounding area, a positive value (Ridge) indicates a cell above the surrounding pixels concerning the target pixel. The negative value (Valley) reflects the sites at a lower elevation than the neighboring cells in the neighbourhood (Fig. 3a). Besides illustrating the curvature of a slope, CI also depicts the convergence or divergence of a cell, with positive values indicating diverging pixels and negative values indicating converging pixels. The minimal convergence suggests that there is a significant amount of groundwater available. The CI index oscillates between the extremes of maximum divergence (100) and maximum convergence (− 100) (Fig. 3b). When applied to hydrologic processes, TWI quantifies the influence of topography and, hence, on infiltration and recharge (Meles et al. 2020a). By examining the connection between the topographic index surface and reference data, it is possible to determine the correctness of a topographic index concerning grid spacing and terrain roughness (Meles et al. 2020a; Shit et al. 2020) (Fig. 3c). Stream power is the rate at which flowing water expends energy over a certain period. The Stream Power Index (SPI), sometimes known as the compound index (SPI), assesses stream power in geographic information systems (Chen et al. 2021b). SPI values below certain thresholds are related to a low erosion rate by runoff, allowing sufficient time for precipitation to percolate underground (Fig. 3d). Although there are other approaches for determining flow direction in grids, the most straightforward way to explain the flow direction is to consider the direction in which water and silt would flow out of that cell (Burrough and Mcdonnell 1998) (Fig. 3e). When water and slope materials gather, it impacts how they are distributed and where they prefer to collect (Pack et al. 1999) (Fig. 3f, g).

Fig. 3
figure 3

DEM derived topographic and hydrologic parameters, such as a TPI, b convergence index, c TWI, d SPI, e flow direction, f flow accumulation, g topographic features

The NDVI has a significant impact on the incidence of groundwater springs. They are primarily used to characterize the state of the surface vegetation, which impacts recharge rates and the availability of groundwater, both of which are necessary for the formation of vegetation communities (Fig. 4a). Dykes, shears, fractures, and other linear or curved structures on the surface topography are examples of lineaments. Lineaments play an essential part in the penetration process via rock weaknesses in hard rock terrain (Fig. 4b). The NDMI can detect wet or moist places in the landscape. However, soil moisture and groundwater cannot be separated. The wet/moisture pixels indicate those areas where water may resist for an extended period, commonly towards the end of a rainy event. Because water cannot resist uplands, the study also revealed moisture pixels in flat and valley regions, enabling more significant time for precipitation to sink into the subsurface; moreover, it highlights waterlogged places (Fig. 4c). Rainfall is a critical element in determining the variability of groundwater recharge and storage and calculating the groundwater potential. Annual continuous rainfall data was gathered from a network of 16 meteorological stations in and around the research area from 1976 to 2017. The kriging technique was used to plot the annual rainfall. Annual rainfall varies from 342 mm in the south and south-west to approximately 8.5 mm in the east (Fig. 4d). Because it influences the moisture content of soil and rock on the slope and the infiltration rate, distance from rivers has a significant impact on groundwater springs (Fig. 4e). Because the presence of urban area might create local hydrological and erosion difficulties while indirectly affecting the groundwater table, its distance from urban areas is thought to impact groundwater springs. As a consequence of the removal of geological formations and the disruption of the surface during the building phase, an urban area may affect the quantity of soil moisture and the infiltration rate (Fig. 4f). The infiltration capacity and porosity are governed by GG, which changes the particular groundwater storage. There are forty GG kinds in the research field. Some notable characteristics include sandstone and siltstone, intrusive plutonic rock, orthoamphibolite, andesite and dacite, andesitic volcaniclastic rocks, granodiorite, and granite suite (Fig. 4g).

Fig. 4
figure 4

The conditional parameters for GPM, such as a NDVI, b lineament density, c NDMI, d Rainfall, e distance to waterbody, f distance to urban, g geological structure

Multicollinearity test for groundwater potentiality variables

In this work, multicollinearity analysis assessed the relationship between the groundwater affecting elements and the groundwater itself. When there is a strong link between two or more predictor variables in a multiple regression model, this is called multicollinearity in statistics (Islam et al. 2021). To find multicollinearity among influencing variables in this research, the tolerance (TOL) and variance inflation factor (VIF) was utilized with each other. “Let \(X = \{ X_{1} ,X_{2} ,....,X_{N} )\) signify a particular independent variable set, and \(R_{j}^{2}\) denote the coefficient of determination when the jth independent variable \(X_{j}\) is regressed on all the other predictor variables in the model, as shown in the following example”. The following is the formula for calculating the VIF value:

$${\text{VIF}} = 1 - (1 - R_{j}^{2} ).$$
(1)

When the VIF value is divided by the TOL value, the degree of linear correlation between the two independent variables, it is recommended that if the VIF value is over ten or the TOL value is less than 0.1, the related factors be eliminated from the landslide prediction models because of multicollinearity (Talukdar et al. 2021).

Developing stacking ensemble machine learning algorithm

Wolpert (1992) was the first to introduce the stacking ensemble approach. Stacking, unlike other existing ensemble learning approaches, use meta-learning to integrate many types of algorithms. The outputs of numerous base learners (level-0) are merged by the meta-learner in the stacking structure, which has two levels: level-0 and level-1 (level-1). Figure 5 depicts a basic drawing of the stacking structure used in this work. SVM, ANN, and LR were in charge of the basic learners. GPM frequently use these machine learning techniques. In the case of the meta-learner, RF was used in accordance with prior study recommendations.

Fig. 5
figure 5

Stacking-based ensemble frameworks for groundwater potentiality modeling. SVM support vector machine, ANN artificial neural network, LR logistic regression, RF random forest”

Level-0: Application of base classifiers

ANN

Behavioral trends are used in the artificial neural network to provide a basis for modelling processes. It has three layers: input, covered, and output, as well as processing units such as neurons that are organised in several layers (Moghaddam et al. 2019). The attachment weights bind the neurons of previous layers to those of subsequent layers. The output of the middle layer (hidden layer) is fed into the next layer as data. The input layer receives the data, while the final layer produces the ANN model's final output. The input data is received and sent by the middle layers to the related nodes in the subsequent layers. The secret neurons use the weighted number of inputs to generate the intermediate output. The activation functions are used in the ANN model to compute the hidden and output neurons' outputs. It uses the bias values, as well as the weighted number of the neuron's inputs, to set the output. The preparation of the network structure and the modification of the weights of links are the two main stages of the ANN modelling process. According to the literature review, the backpropagation training algorithm is widely used in a variety of areas, including water engineering (Fan et al. 2017). The performance of the ANN model is first obtained as the ANN model's response. The error between measured and predicted values is reduced at the next step to determine the model's weights. Where the output differs from the observed value, the weights and biases are adjusted to reduce the error values. However, since the backpropagation algorithm has a poor convergence rate, meta-heuristic optimization algorithms were used in this analysis to solve this flaw.

SVM

Vapnik (2013)presents the SVM, a widely used machine learning tool, as well as a collection of linear predictor functions that have been used to solve problems of function determination. In the SVM model, the kernel mathematical machine function was used for data transformation. A hyper-plane was generated using the training datasets after the individual SVM datasets were translated to high dimensional feature space (Choi et al. 2020).The best linear hyper-plane was used to differentiate the real output space. It is often used to divide data into two categories, such as low groundwater potential (0) and high groundwater potential (1). Several experiments have shown that, for groundwater potential models chosen as benchmarks kernel function, the RBF is overweight among various kernel functions (Tehrany et al. 2014). Due to the versatility of the radial base kernel to address different dimensions of the data set and its better capacity for generalisation, flood susceptibility was primarily modelled (Chen et al. 2021b). Modeling with SVM is generally limited because of its difficulties in recording essential parameters (Choubin et al. 2019).

LR

Logistic regression is a multivariate analytic model that predicts the presence or absence of an attribute or result based on the values of a collection of predictor variables. The term "logistic regression" refers to a kind of multivariate regression in which a dependent variable is linked to many independent variables (Tu 1996). Groundwater occurrences rely on several geo-hydrological independent variables, and the dependent variable is a binary variable representing the presence or absence of groundwater. The logistic regression technique is used for maximum likelihood estimate after turning the dependent variable (groundwater) into a logit variable (Ayalew and Yamagishi 2005). The advantages of logistic regression include that the variables do not have to have a normal distribution; they may be continuous, discrete, or a mix of both (Tu 1996). The ratios for each independent variable in the multivariate analysis model may be predicted using logistic regression coefficients. In multi-regression analysis, the factors must be numerical, and the variables must have a normal distribution. The dependent variable, which denotes the presence or absence of a groundwater, should be entered as 1 or 0, and the model applies sound to groundwater potentiality analysis in this research (Pradhan and Lee 2010).

Level-1: Application of meta classifier

RF

Breiman et al. (2017) established the random forest as an ensemble learning approach for generating numerous decision trees from distinct data subsets and voting on the findings of multiple decision trees to produce the random forest output. The random forest has a substantial body of research that shows it is tolerant to outliers and noise, unlikely to over-fit, and has good prediction accuracy and stability.

The basic idea behind random forest is to train many unconnected decision tree models. Each decision tree produces a different prediction regarding the sample's categorization (for classification algorithm). The sample classification mode is the outcome. To reduce model variance, the random forest's performance may be enhanced by creating unrelated training sets. Sample training is used to create different training sets of classifications, which are then merged to create the random forest model.

Ensemble procedure

After all of the base learning algorithms have been developed, the stacking approach is used to combine them into a larger framework. Assume that the initial dataset D contains examples \(d_{i} = (x_{i} - y_{i} )\), where \(x_{i}\) denotes groundwater conditioning variables and \(y_{i}\) denotes classifications (groundwater or non-groundwater). \(i \in [1,N]\), where N is the total number of data points in the modelling dataset. \(L_{t} (t = 1,2,3)\) is the abbreviation for base learning algorithms including SVM, ANN, and LR. To begin, the dataset D is periodically split into two distinct subsets: one is used to train base learning algorithms to produce level-0 classifiers, denoted by \(h_{t}\), and the other is used to construct level-0 classifiers.

$$h_{t}^{i} = L_{t} (D - d_{i} )\forall i = 1,2, \ldots ,N;\forall t = 1,2,3.$$
(2)

The remaining examples are being utilized to train classifiers to generate predictions \(Zit\):

$$Zit = h_{t}^{i} (x_{i} ).$$
(3)

These level-0 classifier outputs, along with their actual classifications, form a new dataset \(D^{\prime} = ((Zit,Zit,....,Zit),y_{i} )\), which is subsequently given to level-1 to train the meta-learner (RF). As a consequence, RF can put together the classification results of basic learners to come up with a final prediction for new cases:

$$Y_{x} = RF(h_{1} (x),h_{2} (x),h_{3} (x)).$$
(4)

The initial ensemble model, known as the SVM–ANN–LR, was built using three single candidate algorithms and the stacking approach. The optimum parameters for whole stacking model have been presented in Table 1.

Table 1 Optimum parameters for stacking model for generating GPM

Validation of the models

All model-based predictions, including GPM, require measuring the performance of a modeling method. In the literature, several performance metrics have been effectively utilized. The most widely-used measurements for GPM AUC of ROC curves were employed to evaluate the algorithms' performances in this research. The ROC graphs the plot of sensitivity (true positive rate) versus 1-specificity (false positive rate) with varying thresholds (Nahayo et al. 2019). The AUC value, or area under this curve, is a regularly used metric for evaluating model performance. The AUC value for random is 0.5, whereas the value for perfect conformance is 1. When the AUC is less than 0.5, the model is non-informative or fits the data worse than a random model. In the present study, we employed two types of ROC curve, such as parametric and non-parameteric ROC curves for the validation.

Sensitivity analysis

The mean decreases in the Gini and the mean decreases in accuracy were measured. These two measures are widely utilized because they may choose criteria and rank things. Calculating essential variables of landslide conditioning factors for the current research region was accomplished using the mean decrease accuracy (MDA) and mean decrease gini (MDG) techniques of the RF algorithm, respectively. The two approaches described above are pretty popular, and they have been extensively employed in a variety of research to determine the significance of various factors. MDA and MDG techniques were used in the RF algorithm to compute error since they were discovered during the OOB calculation error and participation of the relevant variable inside the homogeneity of the trees.

Results and analysis

Computation of multicolinearity analysis

In this work, VIF and Tolerances (TOL) were used to assess multicollinearity amongst independent variables. Multicollinearity problems can arise if there are high correlations between the independent variables. A VIF score of over ten and a tolerance value of 0.2 suggest a multicollinearity concern. In the present study, 21 parameters have been generated for GPM. After applying colinartity technique, it has been found that all parameters have VIF score < 5, except elevation (VIF: 68.94) and TRI (VIF: 60.544) (Table 2). Therefore, these two parameters could not be used for modelling; otherwise it would generate lower accurate results. Then, we decided to exclude TRI from the parameters list; again colinearity technique has been applied. The VIF and tolerance values of all factors were determined to be less than 5 and higher than 0.8, respectively (Table 3). As a result, the factors were incorporated into the model. The following are the outcomes of the current study's multicolinearity analysis:

Table 2 The evaluation of the flood conditioning factors using a multi-collinearity test (VIF) before and after removal of redundant parameters
Table 3 Area coverage of hybrid models under different GWP zones

Developing stacking based robust GW potentiality models

In this work, a stacking-based ensemble model for GPM was established. However, before employing the stacking-based ensemble model, we used ANN, SVM, and LR to investigate their performance and compare it to the stacking model. GPMs in stretch form were created after adopting all four models. The GPMs were then categorised into five categories: very high, high, moderate, low, and very low GPMs.

An advanced hybrid method like SVM, ANN, LR, or stacking was used to create the groundwater potentiality models shown in Fig. 6. According to the classifications illustrated in Fig. 6, there are five levels of groundwater potential: very high, high, moderate, low, and very low. Figure 6 also illustrates that most of these zones are extensive and coincide with the expected pattern of groundwater potentiality based on previous studies. Parallel to the drainage path of the watershed, the very high and very high potential groundwater zone flows northwest-northward. Low groundwater potential zones predominate in the south and southeast.

Fig. 6
figure 6

Groundwater potentiality models using a SVM, b ANN, c LR, and d stacking

It was discovered that around 1850km2-2149km2 and 3644km2-4585km2 of the basin's total area had "very low" and "low" potential for groundwater, respectively, with this model (Table 3). Overall, the most significant proportion of the area was assessed to have a high groundwater potential. According to all models, there are many potential areas for water harvesting in the catchment region. However, the most representative model must be explained since the region's size varies. There is much variability in the studies, which may be because the models are based on different data sets, including application, resolution, and assumptions of raster layer attributes.

Validation of the models

The GPMs based on GPS data were validated using the empirical and binormal ROC AUC. SVM has an AUC of 0.831 and 0.906; ANN has an AUC of 0.841 and 0.911; LR has an AUC of 0.819 and 0.903, while stacking has an AUC of 0.858 and 0.921 (see Fig. 7a–d). According to both ROC curves, stacking was the outperformed model, accompanied by ANN, SVM, and LR. However, the ROC curves show that the stacking outperforms all other models under investigation. All models with AUCs greater than or equal to 0.8 could be regarded as successful. Results showed that single models have performed well, but the accuracy level is relatively lower than stacking ensemble model. Therefore, it can be stated that stacking model outperformed other single models.

Fig. 7
figure 7

Empirical and binormal ROC curves were used to validate the hybrid models, a SVM, b ANN, c LR, and d stacking

Sensitivity analysis

Because of the complicated mathematical relationship between historical trends in the amount of groundwater and the variables that trigger those trends, it is impossible to determine the exact region in which a significant and significant quantity of groundwater will be available for commercial use by developing advanced hybrid algorithms for mapping groundwater potential zones. No factors are included in either model why groundwater potential is diminishing in a particular location. Unless the impact of these elements on the frequency of landslides can be quantified, how can management strategies be devised and put into action?

If the impact of these variables on groundwater potential cannot be established, how will management strategies be devised and implemented? Reducing the frequency of groundwater decline may be achieved by identifying characteristics associated with prospective zones for groundwater. As a result, it is critical to identify the most influential factors. To determine how important each conditioning variable is to the RF modeling process, the MDG and MDA have been used (Hollister et al. 2016). In the GWP modeling, all elements except TRI were included (based on MDG and MDA), but the most relevant ones were the NDMI, NDVI, slope, distance to water bodies, and flow accumulation (based on MDG and MDA) (Fig. 8). It was difficult to determine the relative importance of 14 different factors using just TPI, aspect, and TRI as a criterion (Fig. 8).

Fig. 8
figure 8

Analyses of sensitivity for the best model (AND model) using a MDA and b MDG

Discussion

Water consumption has grown dramatically in the last decade due to fast population expansion, particularly in arid and semi-arid regions (Rahmati et al. 2016). Groundwater is the primary water supply for life in the vast majority of the study region, which includes dry and semi-arid areas (Mousavi et al. 2017). Groundwater planning and management are essential in this area. Hydrogeologists, engineers, and decision-makers need specific fundamental tools for managing groundwater. GPM can be used as a basic groundwater management technique (Yousefi et al. 2020).

GPM results from lithology, tectonics, terrain, vegetation, rainfall, and hydrology are present and accessible in the environment. In this study, many types of data were employed as input datasets (Kumar et al. 2021). Research-based on DEMs yields more accurate and essential findings. Different DEMs provide different outcomes; for example, the ALOS DEM with a spatial resolution of 12.5 m produces relevant and excellent results compared to the ASTER and SRTM DEMs with 30 m. The authors used a combination of geomorphology, geology, and hydrology characteristics to determine the spatial groundwater potential. Considering the argument issue, spatial analysis is the study's primary focus on choosing the best performing technique and models for GPMs. VIF and tolerance tests have been performed on geo-environmental parameters (elevation, aspect, slope, lithology, rainfall, land use/land cover, drainage density, soil type, distance to fault, distance to road, distance to river, NDVI, TPI, TWI, and SPI) (Singha et al. 2020). They have proven to be the most effective for groundwater storage.

Based on prior research, it has been discovered that numerous factors on groundwater potential are primarily site-specific and hence cannot be directly generalized to other areas. In the Ningtiaota area of China, Bui et al. (2019) revealed that elevation was the most influencing variable on groundwater potential, while in the Chilgazi region of Iran, TWI and distance from rivers were the most relevant variables Bui et al. (2019). Nguyen et al. (2020) revealed that elevation and rainfall are the most and least essential environmental parameters, respectively, in the DakNong Province of Vietnam. In contrast, Oikonomidis et al. (2015) found that rainfall was the most crucial variable influencing groundwater potential in the Greek region of Thessaly. Singh et al. (2019) conducted a national-scale groundwater potential mapping project in New Zealand and found that lithology was the most relevant variable to employ. (Tolche 2020) and Mallick et al. (2021) have all reported different variable rankings for different regions around the world, showing that The significance of variables for determining groundwater potential varies based on the study area's geo-environmental and topo-hydrological characteristics.

The present study constructed a novel ensemble model based high resolution groundwater potentiality model (GPM) by integrating random forest (RF) meta classiffier based stacking ensemble machine learning algorithm with high resolution groundwater conditioning parameters in Bisha watershed, Saudi Arabia. In the present research, twenty-one parameters were generated from high resolution satellite images and other secondary sources. The novel stacking ensemble machine learning algorithm has been developed by integrating three base-classifiers, such as artificial neural network (ANN), support vector machine (SVM), and logistic regression (LR) with RF meta-classifiers. The resulting AUC under the respective ROC (empirical and binormal) is 0.831 and 0.906 for SVM, 0.841 and 0.911 for ANN, 0.819 and 0.903 for LR, and 0.858 and 0.921 for stacking (see Fig. 7a–d). Based on the both ROC curves, stacking appeared as the best model, followed by ANN, SVM, and LR. As a result, the unique hybrid model performed better and can be applied to further domains. Although, if all possible parameters are combined, a better degree of precision can be reached. Our findings indicate that GW management in the study area region should focus their efforts on developing GW exploitation and agricultural operations in areas adjacent to wadies. As a result of the increased natural recharge, these areas offer a greater GW potential.

The imperative to employ ensemble learning techniques to minimize over-fitting was entirely salient in our study since the effective training efficiency of single models such as ANN, SVM, LR, and RF decreased to a relatively poor validation performance, indicating that the training performance had been over-fitted (Avand et al. 2020). When it comes to performance, while the models achieved different ranks (i.e., performance) during the training and validation phases, it was discovered that the ensemble models significantly improved the predictive ability of the single models using the ROC method (), which was used as the primary performance metric. When the single models (ANN, SVM, and LR) were combined with a single ensemble learning approach (RF) inside the stacking framework, the predictive ability of the hybrid model increased by 3.37 percent, 1.9 percent, and 4.52 percent, respectively, according to the results. The previous study has shown that model performance may be enhanced by employing the ensemble modeling technique. This is consistent with our modeling results (DeSimone et al. 2020; Feizizadeh et al. 2021; Fadhillah et al. 2021).

Ensemble modeling methods have been demonstrated to be successful in recent research; however, when applied to various situations in different locations of the globe, these strategies performed differently (Sachdeva and Kumar 2021). Using Rotation Forest and Bagging, Islam et al. (2021) found that the Reduced Pruning Error Tree (RPET) technique outperformed MultiBoost and Random Subspace (RSS) in the prediction of flood. Pal et al. 2020b) used the RSS ensemble learning approach for habitat quality prediction and found an enhanced model performance. This is also true for groundwater potentiality prediction studies using ensemble approaches (Golkarian et al. 2018; Chen et al. 2019; Vafaeinejad and Mahmoudi Jam 2021). These findings lead us to conclude that the performance of diverse predictive models obtained from different machine learning and ensemble learning approaches may be significantly affected by local variability. This, it seems, necessitates additional modeling work across areas to provide reliable forecast models.

According to the findings of this research, the new ensemble model described here is a powerful tool for modeling complex natural processes. To improve GWPM accuracy, various factors should be studied in the future. First and foremost, the consequences of restricting or expanding the region from which non-well sites are selected to locations where wells are situated or to areas outside the distribution of wells should be thoroughly examined. Since a second consideration, the traditional 80:20 ratio of datasets used for both testing and training has to be reexamined, as it may be contributing to problems. A comparison of the hybrid ensemble model to other ensemble models and deep learning to forecast groundwater sites should be performed to uncover any hidden impacts of the selection techniques used to choose features.

Conclusion

In the current study, a novel hybrid stacking model has been developed in which ANN, SVM, and LR were used as base classifiers, and RF was used as a meta-classifier for the generation of a groundwater potential map for the Bisha watershed, Saudi Arabia. The findings of the RF-based hybrid ensemble model are more accurate than the results of the three individual models. The study's findings also show that morphological characteristics, such as NDMI, NDVI, slope, flow accumulation, SPI, distance to the river, elevation, and aspect, substantially impact prediction performance. A few limitations should be highlighted despite the positive performance of the established models of groundwater spring potential. Several factors may affect the effectiveness of models, including the quality and amount of data and incorrectly identifying non-spring locations. In addition, fine-tuning the stacking framework's structural parameters is a computationally taxing undertaking that may need some prior user knowledge. Future work might be focused on providing a technique that automatically adjusts the structural parameters or on implementing an alternative evolutionary algorithm that requires fewer parameters to be adjusted as a direction. A future objective may be to deploy the approach in a location with a variety of geo-environmental circumstances to gauge its efficacy. In light of the fact that the environment and all of its inhabitants rely on groundwater resources, discovering the geographical patterns of its occurrence is critical to effective water resources management initiatives. Overall, the RF-based feature selection approach proposed in this research was determined to be essential for evaluating the potential of groundwater springs. As a result of its adoption, more accurate and dependable models were created at a lower cost. The findings of this research will be helpful to local governments and government agencies in developing future water resource management strategies since they identify possible groundwater spring locations.