Introduction

Groundwater is a gift and a considerable element of the hydrologic cycle, during which water moves vertically toward the center of Earth. Aquifer recharge takes place when water moves either from the land surface, or from the vadose zone into the saturated zone. Quantitative estimation of the recharge rate is crucial in order to understand large-scale hydrologic processes, and it is important for evaluating the sustainability of groundwater supplies. The extensive availability of fresh groundwater is the main cause for its usage as a source of irrigation and drinking, universally (Alley et al. 2002). However, the large amount of crops is grown by irrigated cultivation, which mainly depends upon the available amounts of groundwater. Groundwater plays a fundamental role in river flow mainly in dry periods and is essential to several lagoons, wetlands and lakes (Rockström et al. 2010). Besides, the life of human, vegetation and aquatic animals rely on the groundwater that moves to rivers, lagoons, ponds and wetlands. Last few years, the level of groundwater gradually decreases due to extensive use in various purposes. The quantity of water that may be collected from the aquifer without causing exhaustion is mainly depended upon the recharge of groundwater (Freeze 1969). Thus, the estimation of recharging rate of the ground is essential for water supply and groundwater resource management. It is very necessary for areas where economic development depends on groundwater resources.

Precipitation is the principal source for the recharging of groundwater. The amount of water that will ultimately arrive at the water table is defined as natural groundwater recharge (Sophocleous 2002). The quantity of the recharge depends on the period and intensity of precipitation, flood, soil type, soil moisture conditions etc. As there is spatial and temporal variability of the recharging rate of the soil, it is crucial to be precise to the selection of recharging estimation methods. The suitability of recharging models is site-specific due to spatial variation in recharging rate through the soil. Experimentally estimation of recharging rate is a tedious and time-consuming task (Sihag et al. 2017; Kumar and Sihag 2019). Water storage ability differs at various soil textures and soil physical properties (Angelaki et al. 2013). Sand practical consists of relatively greater pore size than clay and thus has higher recharging rate and very small water-holding ability. The actual rate at which water percolates into the soil at any time is identified as the recharging rate. The significance of the recharging process imposed the researchers to generate several models (Green and Ampt 1911; Richards 1931; Kostiakov 1932; Horton 1941; Philip 1957; Holtan 1961; Singh and Yu 1990) as well as Modified Kostiakov model, SCS model and Novel model. These models are divided into three groups such as physical models, semi-empirical models and empirical models. The correct determination of the recharging rate is essential for several groundwater-related studies and projects (Singh et al. 2018).

Last few years, data mini-techniques like neural network, support vector machines, adaptive neuro-fuzzy inference system (ANFIS), random forest (RF), Gaussian process regression (GP) and M5P model tree have been successfully implemented in civil engineering and water resources problems (Kisi et al. 2012; Ebtehaj and Bonakdari 2013; Parsaie et al. 2016; Parsaie and Haghiabi 2017a, b, c; Qishlaqi et al. 2017; Parsaie et al. 2018a, b; Sihag 2018; Sihag et al. 2018a, b, 2019; Parsaie et al. 2020). There are several convention models, but these outcomes are not general on different location and conditions. The aim of this study was to develop a new model for the accurate prediction of natural recharging rate of groundwater. GP- , M5P- and RF-based regression methods were selected for the prediction of natural recharging rate, and a comparison between the empirical equations (Kostiakov model, multi-linear regression (MLR) and multi-nonlinear regression (MNLR)) and soft computing-based models has been done. Most important parameter was selected using sensitivity analysis, and Taylor diagram and predicted error box plot were also used to investigate the accuracy of the applied models.

Methodology and dataset

Experimental procedure

In order to investigate the recharging of water through different soil types, three soil samples of different hydrodynamic parameters were used. Soil samples were collected using core cutter from three different locations (Greece). After drying the soils at 105 °C, granulometric analysis has been done. Each soil sample passed through a certain series of sieves with descending diameters. Bulk density, the moisture of the saturated soil and recharging rates were measured in the laboratory, for all soil samples. Apparatus selected for experimentation is shown in Fig. 1. Each soil sample was packed in a transparent column of Plexiglas. In order to achieve good homogeneity of the soil porosity, the column of Plexiglas was filled with soil using a tube with a double sieve in it. TDR probes were inserted carefully at certain locations of the column, and to avoid water leakage, silicon was used for water proofing. As there was an intention to achieve homogeneous steady rain and in addition to achieve a 2 mm head boundary at the top of the soil column, two volumetric tubes were used. One volumetric tube was used for pouring water into the column, while the other one was used as an outpouring container. The incoming—into the soil—water volume was calculated by subtracting the volume of water of the second tube (outcoming) from the volume of the first tube (incoming). While the wet profile was moving into the soil, TDR was automatically measuring the moisture of the soil at certain locations and at certain time circles.

Fig. 1
figure 1

Experimental procedure

Dataset

The entire dataset contains 106 experimental observations from the laboratory. Data were divided into two separate groups, training and testing, respectively. Training data involve 70% of the total data chosen randomly from the whole data set, while testing data involve the remaining 30% of the whole data. The features of the training and testing data sets are represented in Table 1, where time, sand, clay, silt, bulk density and moisture content are input parameters and recharging rate of the soil is the target.

Table 1 Features of the data set

Modeling approaches

Gaussian process regression (GP)

GP regression relies upon the assumption that nearby observation must share the information mutually and it’s an approach of mentioning earlier straight over the function space. The simplification of Gaussian distribution is known as Gaussian regression. The matrix and vector of Gaussian distribution are expressed as covariance and mean in Gaussian process regression. Due to having earlier information of function reliance and data, the validation for generalization is not essential. The GP regression models are capable to recognize the foresee distribution consequent to the input test data (Rasmussen and Williams 2006).

A GP is the selection of numbers of the random variable, any finite number of them has a collective multivariate Gaussian distribution. Assume p and q are input and output domain respectively, there upon x pairs (gi, hi) are drawn freely and equivalently distribution. For regression, it is assumed that \(h \subseteq \text{Re}\) than a GP on p is expressed by the mean function \(v0: p \to {\text{Re}}\) and covariance function \(\mu : p \times p \to\) Re. The kernels used in present work are radial basis kernel (RBF) and Pearson VII kernel function which is shown below:

  1. 1.

    RBF = \(e^{{ - \gamma \left| {x_{\text{i}} - x_{\text{j}} } \right|^{2} }}\)

  2. 2.

    PUK = \(\left( {{1 \mathord{\left/ {\vphantom {1 {\left[ {1 + \left( {{{2\sqrt {\left\| {x_{i} - x_{j} } \right\|}^{2} \sqrt {2^{{\left( {{1 \mathord{\left/ {\vphantom {1 \omega }} \right. \kern-0pt} \omega }} \right)}} - 1} } \mathord{\left/ {\vphantom {{2\sqrt {\left\| {x_{i} - x_{j} } \right\|}^{2} \sqrt {2^{{\left( {{1 \mathord{\left/ {\vphantom {1 \omega }} \right. \kern-0pt} \omega }} \right)}} - 1} } \sigma }} \right. \kern-0pt} \sigma }} \right)^{2} } \right]}}} \right. \kern-0pt} {\left[ {1 + \left( {{{2\sqrt {\left\| {x_{i} - x_{j} } \right\|}^{2} \sqrt {2^{{\left( {{1 \mathord{\left/ {\vphantom {1 \omega }} \right. \kern-0pt} \omega }} \right)}} - 1} } \mathord{\left/ {\vphantom {{2\sqrt {\left\| {x_{i} - x_{j} } \right\|}^{2} \sqrt {2^{{\left( {{1 \mathord{\left/ {\vphantom {1 \omega }} \right. \kern-0pt} \omega }} \right)}} - 1} } \sigma }} \right. \kern-0pt} \sigma }} \right)^{2} } \right]}}^{\omega } } \right)\)

where γ, σ and ω are primary parameters of the kernels.

M5P model (M5P)

M5P tree, initially introduced by Quinlan (1992), is selected to grow a decision tree by engaging the linear regression function method at nodes to build a model which recommend a correlation amid the output value of the preparing cases and value of input attributes. The splitting method is supplied at each node instead to achieve the maximum knowledge with minimum variation in the inter-subset class value down to each branch. The splitting method will be converged when there are diminutive variations among the class values of the instances or left only a few instances or when a tree is pruned back. The fully grown tree demonstrates the very good quality structure and forecast correctness due to presenting more probable linearity at the leaf node (Singh et al. 2017).

Random forest (RF)

Random forest algorithm is used to generate a model which includes a group of many trees. Each tree illustrates the specific classification and votes the classification. The forest chooses the classification which has the maximum voting in the forest. The tree is fully grown if N is the number of cases at the training set. N cases at random with the substitute from actual data may be the input data set to fully grown the tree. The m variables are chosen arbitrarily out of K input variables for the best split, the value of m should be less than K and constant. The tree is grown without pruning up to the highest extent. RF can work efficiently and exactly with the huge and complex data set.

Empirical models

Kostiakov model

An empirical model was proposed by Kostiakov (1932) in order to estimate the recharging rate:

$$R\left( t \right) = at^{ - b}$$
(1)
$$R\left( t \right) = 2.7563t^{ - 0.6529}$$
(2)

where R(t) is the recharging rate at time t(LT−1), t is the recharging time (T), a and b are dimensionless empirical constants.

Multiple linear regression (MLR)

MLR is implemented on more than one predictor parameters. The common structure of the MLR model is:

$$Z = c_{0} + x_{1}^{{c_{1} }} + x_{2}^{{c_{2} }} + x_{3}^{{c_{3} }} x_{4}^{{c_{4} }} + \cdots + x_{n}^{{c_{n} }}$$
(3)
$$R\left( t \right) = 0.925 - 0.0012t + 0.0187S + 0.103{\text{Si}} - 0.189C + 0.4089D - 5.173{\text{Mc}}$$
(4)

Multiple nonlinear regression (MNLR)

Multiple nonlinear regression (MNLR) is applied on more than one predictor parameters. The common structure of the MNLR model is:

$$Z = c_{0} x_{1}^{{c_{1} }} x_{2}^{{c_{2} }} x_{3}^{{c_{3} }} x_{4}^{{c_{4} }} \ldots x_{n}^{{c_{n} }}$$
(5)
$$R\left( t \right) = \, 0.0648t^{ - 0.4694} S^{0.438} {\text{Si}}^{ - 0.839} C^{0.305 } D^{4.33} {\text{Mc}}^{0.4047}$$
(6)

where Z is the normal value represented as a function of n-number of independent parameters x1, x2, x3, …, xn, in which the values of coefficients, c0, c1, c2, c3,…, cn, are unidentified. These values correspond to the local behavior and are evaluated by the least square technique.

Model assessment

Four most popular equations were used to assess the performance of various data mining methods and empirical equations, such as correlation coefficient (R), mean square error (MSE), root mean square error (RMSE) and Nash–Sutcliffe model efficiency (NSE) values (Sihag et al. 2020).

$$R = \frac{{a\sum mn{-}(\sum m)(\sum n)}}{{\sqrt {a(\sum m^{2} ) - (\sum m)^{2} } \sqrt {a(\sum n) - (\sum n)^{2} } }}$$
(7)
$${\text{MSE}} = \frac{1}{a}\mathop \sum \limits_{i = 1}^{a} \left( {m - n} \right)^{2}$$
(8)
$${\text{RMSE}} = \sqrt {\frac{1}{a}\left( {\mathop \sum \nolimits_{i = 1}^{a} \left( {m - n} \right)^{2} } \right)}$$
(9)
$${\text{NSE}} = 1 - \frac{{\mathop \sum \nolimits_{i = 1}^{a} \left( {m - n} \right)^{2} }}{{\mathop \sum \nolimits_{i = 1}^{a} \left( {m - \bar{m}} \right)^{2} }}$$
(10)

where \(m\) is the actual value, \(n\) is the predicted value, \(\bar{m}\) is the mean of actual value and a is the number of values.

Implementation of machine learning methods

Four standard statistical measures: R, MSE, RMSE and NSE were chosen to judge the performance of the data mining methods and empirical equations. Numerous trials were carried out to find the optimum value of the primary parameters. The upper range of R, NSE and a lesser range of MSE, RMSE indicates superior estimation precision of the models. The number of trees to be developed (k) in the forest and the number of features or variables selected (m) at each node to generate a tree are the two standard primary parameters essential for random forest regression. In M5P, calibration of models has been done by means of changing the value of no. of instances allowed at each node (m), while in Gaussian process regression Gaussian noise, γ, σ and ω are the primary parameters. The selected primary parameters of the modeling approaches are presented in Table 2.

Table 2 Primary parameters

Results and discussion

All empirical equations showed good performance when estimating the natural recharging rate of groundwater using the current dataset, except Kostiakov model. Results of each empirical equation were plotted versus the actual data, and the results are shown in Fig. 2. Standard error indices consisting of R, RMSE, MSE and NSE were used to assess the precision of the empirical equations (observe Table 3). The MNLR equation with R value as 0.90, MSE value as 0.02, RMSE value as 0.15, and NSE values as 0.87 is the most accurate among the empirical models, as observing Table 3 and Fig. 4.

Fig. 2
figure 2

Performance of empirical models

Table 3 Performance of empirical equations

Results of M5P tree

Developing of M5P model is a trial-and-error method. The M5P model contains only one user-defined parameter (m). During the M5P development, the optimum value of m = 4 was found. The agreement diagram of M5P model in both periods of progress is shown in Fig. 3. To assess the performance of this model, performance parameters for both periods are calculated and presented in Table 4. Figure 3 shows that the M5P tree model with R value as 0.82, MSE value as 0.03, RMSE value as 0.18, and NSE value as 0.82 is appropriate for predicting the natural recharging rate of groundwater.

Fig. 3
figure 3

Performance of M5P tree model

Table 4 Performance of M5P- , GP- and RF-based models

Results of GP

Similar to M5P model preparation, developing of GP model is based on the same dataset. In this study, Gaussian noise (0.01) was fixed for the fair assessment of both the kernel function-based models. The primary parameters for GP models are listed in Table 2. Based on the obtained results (Table 4), the PUK kernel gives a better performance than RBF kernel function-based model. To assess the precision of these models, agreement designs are presented in Fig. 4. The R values of PUK kernel function-based GP model were attained 0.97 and 0.88 for preparing and testing, correspondingly. Assessing Table 4 and Fig. 4 concludes that GP_PUK model is more appropriate than M5P and GP_RBF models for prediction of the natural recharging rate of the soil. It is remarkable that in these figures the GP_PUK is linked with outcomes of the PUK kernel function-based GP model and GP_RBF is linked to the outcomes of the GP_RBF model.

Fig. 4
figure 4

Performance of GP models

Results of RF

Similarly, the development of the RF model is the same as the M5P and GP model, based on the dataset. The progress of RF includes the number of trees (k) and the number of features (m). In this study, 1 tree and number of features 1 were selected. Outcomes of the RF model for prediction of the recharging rate of groundwater are presented in Fig. 5. The optimum value of the primary constraint of the RF model is presented in Table 2. Overall, assessing Table 4 and Fig. 5 it is clear that the exactness of the RF model for the prediction of the natural recharging rate of the soil is supreme. The R values of the RF model were obtained 0.98 and 0.91 for training and testing, respectively.

Fig. 5
figure 5

Performance of RF model

Assessment of soft computing and empirical models (Tables 3, 4) states that RF-based model shows better response than other models. Also, the MNLR model shows the better response in the performance of estimating the natural recharging rate of groundwater, than GP, M5P and the empirical models. Finally, the Kostiakov model has the least ability to estimate the natural recharging rate.

Inter-comparison of soft computing and empirical models

Last few years soft computing methods are successfully used in several engineering-related fields. In this, study performance of M5P- , GP- and RF-based models were assessed for the prediction of the recharging rate of the soil. The developed soft computing-based models were compared with Kostiakov model, MLR and MNLR. The performances of all discussed models are listed in Tables 3 and 4 for both training and testing stages. Agreement plot among actual and predicted values with applied models using the testing stage is drawn in Fig. 6. Figure 6 and Tables 3 and 4 confirm that the RF model is outperforming than other applied soft computing and empirical models. Box plot (Fig. 7) was plotted, in which overall error distribution was shown. As a result, the negative and positive error values correspond to the over-estimation and under-estimation behavior of the models, respectively. Figure 8 also shows Taylor’s diagram for all applied models. Taylor diagram was used to illustrate schematically the performance of the applied models (Taylor 2001). Three statistic parameters including standard deviation, correlation and root mean square error evaluated the degree of compliance of recharging rate of water through soil among actual and predicted values. Figure 8 suggests that RF model achieves higher correlation with minimum standard deviation values. Taylor diagram also confirms that the RF model is performing better than other applied models.

Fig. 6
figure 6

Agreement plot among actual vs predicted values of recharging rate with various soft computing techniques using testing data set

Fig. 7
figure 7

Box plot for error prediction with various soft computing and empirical model

Fig. 8
figure 8

Taylor diagram for various soft computing and empirical model

Sensitivity investigation using RF

Sensitivity investigation was carried out on the RF model in order to examine the performance of the developed best model in the deficiency of every input. Numerous sets of training data were prepared by removing one input parameter at a time and outcomes were recorded in terms of R and RMSE with the testing dataset. Outcomes of sensitivity investigation on RF are given in Table 5. Table 5 shows that, in comparison with other input parameters, the time has an important role in predicting the recharging rate of the soil.

Table 5 Sensitivity investigation using RF

Conclusions

Prediction of the natural recharging rate of the groundwater is essential for efficient use of groundwater resource in agriculture (irrigation) and water supply. In this study, experimental data were used in order to investigate the performance of GP- , M5P- and RF-based regression method and evaluate the potential of these techniques in the prediction of natural recharging rate, while a comparison has been made between the empirical (Kostiakov model, multilinear regression (MLR) and multi-nonlinear regression (MNLR)) equations. Outcomes of this study indicate that the performance of RF-based model has shown a superiority between the other soft computing and empirical models. In particular, based on the attained outcomes, the RF model has an appropriate potential to predict the exact recharging rate of the groundwater with R values as 0.98 and 0.91 for training and testing stages, respectively, while the MNLR (empirical model) offers better performance than the GP, M5P, MLR and Kostiakov model. Also, the PUK-based GP model is more responsive than the RBF-based GP model, for this data set. In addition, an important conclusion obtained from this study is that sensitivity investigation proposes that the variable of time (t) is the most significant when RF-based modeling method is selected for the prediction of recharging rate of the groundwater, as time (t) affects strongly the recharging rate. Taylor diagram and Box plot results also confirms that the RF model is performing better than other applied models for the prediction recharging rate of the groundwater.