Introduction

The process in which water moves into the soil through the top surface soil is called the infiltration, and the rate by which it enters into the soil is called the infiltration rate (Haghighi et al. 2010). It plays the important role in the hydrologic cycle. There are many factors which influence the infiltration rate, that is, rainfall intensity, suction head, humidity, water content, types of impurities, field density and humidity. It is associated with the surface runoff and groundwater recharge (Uloma et al. 2014) and also helpful in water supply system, landslides, design of irrigation, flood control system and drainage (Igbadun and Idris 2007). With the help of infiltration rate, we can easily find out sorptivity and unsaturated hydraulic conductivity of the soil (Chow et al. 1988; Scotter et al. 1982). Hydraulic properties of soil are necessary for design of drainage system (Brooks and Corey 1964). At catchment level, infiltration characteristic is one of the dominant factors in determining the flooding condition (Bhave and Sreeja 2013). The soil capacity of infiltration affects the amount of surface flow (Diamond and Shanley 2003). Infiltration rate in soil is inversely proportion to the water-holding capacity of soil (Singh et al. 2014). Physical changes of soil also affect the infiltration rate (Gupta and Gupta 2008; Smith 2006: Micheal 1978).

Water quality of soil is also affected the infiltration rate and ultimately affected the natural and artificial ground water recharge. Generally, there are many impurities present in the earth surface which can easily mix with the water and changes the quality of the water. Many people studied about the concept of the water quality and infiltration. Singh et al. (2017) used the two types of impurities (ash and organic manure) in his study with three soft computing techniques (M5P model tree, artificial neural network and random forest) and found that random forest predicts the infiltration rate well as compared to the other methods. Sihag (2018) studied the infiltration rate by mixing different proportions fly ash and rice husk ash in sand with fuzzy logic and artificial neural network and found that artificial neural network outperforms the fuzzy logic. Singh et al. (2017a, 2018) and Sihag and Singh 2018 utilised various infiltration models (empirical model) in his study to calculate the infiltration rate of the soil for the given study area. Tiwari et al. (2017) used the generalised regression neural network, MLR, M5P model tree and SVM to predict the cumulative infiltration of soil and found that SVM works well than the other techniques. Various researchers have been used various soft computing techniques in hydraulics and environmental engineering applications (Sihag et al. 2017b, c, 2018a; Haghiabi et al. 2018; Nain et al. 2018a; Tiwari et al. 2018; Parsaie et al. 2017a, b; Shiri et al. 2016, 2017; Parsaie and Haghiabi 2015, 2017; Parsaie 2016; Azamathulla et al. 2016; Baba et al., 2013). These researchers found that these techniques work exceptionally well. Keeping it in the view, the focus of this investigation is on the prediction of the infiltration rate by using M5P tree, GP, MLR and SVM. Furthermore, the results were also compared with the empirical model (Kostiakov 1932) and sensitivity analysis was performed to find out the most important influencing parameter for predicting the infiltration rate of the soil.

Soft computing techniques

The soft computing technique is one of the most relevant and modern techniques used in the civil engineering problems (Sihag et al. 2018b, c; Nain et al. 2018b, Haghiabi et al. 2017; Kisi et al. 2017; Parsaie et al. 2017c; Kisi et al. 2015; Parsaie and Haghiabi 2014; Shiri and Kisi 2012). In this investigation, GP, SVM and M5P tree models were used. The description of the GP, SVM and M5P tree is given below.

Gaussian process (GP) regression

GP regression relies upon the postulation that nearby observation must share the information mutually and it is an approach for mentioning earlier straight over the function space. The simplification of Gaussian distribution is known as Gaussian regression. The matrix and vector of Gaussian distribution are expressed as covariance and mean in GP regression. Due to having earlier knowledge of function reliance and data, the validation for generalisation is not essential. The GP regression models are capable of recognising the foresee distribution consequent to the input test data (Rasmussen and Williams 2006).

A GP is the collection of numbers of random variable, and any finite number of them has a collective multivariate Gaussian distribution. Assuming u and v stand for input and output domain accordingly, thereupon × pairs (gi, hi) are drawn freely and equivalently distribution. For regression, it is assumed that h ⊆ Re; then, a GP on p is expressed by the mean function v0: u Re and covariance function µ: u × u Re. Readers are requested to follow the Kuss (2006a, b) to get the exhaustive details of GP.

Support vector machine (SVM)

This method was first proposed by Vapnik (1998) and based on statistical learning theory. Main principle of SVM is optimal separation of classes. From the separable classes, SVM selects the one which have lowest generalisation error from infinite number of linear classifier or set upper limit to error which is generated by structural risk minimisation. In this way, the maximum margin between two classes can be found from the selected hyperplane and sum of distances of the hyperplane from the nearby point of two classes will set highest margin between two classes. Readers are requested to follow the Smola (1996) to get the exhaustive details of SVM. Cortes and Vapnik (1995) gave the idea of kernel function for nonlinear support vector regression.

M5P tree

M5P tree (Quinlan 1992) is a binary decision tree that uses linear regression function at the leaf (terminal node) which helps in predicting continuous numerical attributes. This method involves two stages for generation of model tree. First stage consists of splitting criteria to generate a decision tree. Splitting criteria for this method are based on treating the standard deviation of class value. Splitting process causes less standard deviation in child node as compared to parent node and thus considered as pure (Quinlan 1992). Out of all possible splits, M5P tree chooses the one that maximises the error reductions. This process of splitting the data may overgrow the tree which may cause over fitting. So, the next stage involves in removing over fitting using pruning method. It trims overgrown trees by substituting the subtrees with linear regression function. In this technique of tree generation, parameter space is split into surfaces and building a linear regression model in each of them.

M5P tree algorithm utilises standard deviation of the class value reaching at terminal node which measures the error value at that node and evaluates the expected reduction in error. Standard reduction is given as

$${\text{SDR}} = {\text{sd}}(N) - \sum \frac{{\left| {N_{i} } \right|}}{\left| N \right|}{\text{sd}}(N_{i } )$$
(1)

where N depicts a set of examples that arrive at the node. Ni depicts ith outcome of subset of examples of potential set, and sd is the standard deviation.

Conventional models

In this investigation, two conventional models were used. The description of the conventional models was listed below.

Multi-linear regression (MLR)

The parameters for multi-linear regression analysis include f(t) with Tf, It, Ci and Wc; therefore, a following functional relationship may be initially assumed:

$$f(t) = k\;T_{\text{f}}^{a} \cdot I_{\text{t}}^{b} \cdot C_{\text{i}}^{c} \cdot W_{\text{c}}^{d}$$
(2)

where k is the proportionality constant.

$$\log \;f(t) = \log \;k + a\;\log \;T_{\text{f}} + b\;\log \;I_{\text{t}} + c\;\log \;C_{\text{i}} + d\;\log \;W_{\text{c}} ({\text{Taking log}})$$
(3)

There are four explanatory variables in the multi-linear equation. Now to develop a multi-linear model, log f (t) is taken as the output parameter and the four explanatory variables, namely log Tf, log It, log Ci and log Wc, are taken as input parameters. The output of the multi-linear regression provided the values of k, a, b, c and d and, in turn, the developed equation of the form (3). The developed multi-linear regression equation is as follows:

$$f(t) = 104\left( {\frac{{I_{\text{t}}^{1.310} \cdot C_{\text{i}}^{0.007} }}{{T_{\text{f}}^{0.66} \cdot W_{\text{c}}^{0.270} }}} \right)$$
(4)

where It is 1 for ash and 2 for organic manure.

Kostiakov model

The details of the Kostiakov model (Kostiakov 1932) are as follows:

$$f(t) = aT_{\text{f}}^{ - b}$$
(5)

where a and b are constants.

After solving Eq. (5) with the measure infiltration rate with time, it will become in the form of Eq. (6).

$$f(t) = 114.6T_{\text{f}}^{ - 0.68}$$
(6)

Materials and methodology

In this investigation, two double-ring infiltrometers were used to calculate the infiltration rate of soil. These consist of two rings, i.e. inner ring and outer ring, with diameter 300 mm and 450 mm, respectively, as shown in Fig. 1. The instrument was driven 100 mm into the soil out of 300 mm which is the total depth of the instrument and it was done with the fallen weight type hammer strike uniformly without disturbing the top layer of the soil. Both the rings were filled with equal depth of water and note down the initial depth of water in inner ring because the water from the inner ring went downwards directly not laterally. The moisture content of soil was also calculated before each experiment by using gravitational method.

Fig. 1
figure 1

Double-ring infiltrometer

The experimentations were done in the Hydraulics Laboratory, NIT Kurukshetra, India. The soil of NIT Kurukshetra is loam soil, and elevation from the sea level is 274 m. The climate of the Kurukshetra is cold in winter and dry in the summer except the monsoon season (normal annual rainfall 582 mm). Infiltration rate was calculated with water quality which was the mixture of the water and different concentrations of impurities, i.e. 1%, 5%, 10% and 15%, and different types of impurities, i.e. ash and organic manure, which are a by-product and generally present in the study area. Two double-ring infiltrometers were driven into the soil paralleled to each other: One is filled with fixed concentration of ash and other with organic manure. Furthermore, infiltration rate was measured up to a fixed time interval which is 180 min because after 180 min, infiltration rate attains the steady infiltration rate (Sihag et al. 2017). The details of the experimental procedure along with the range of the infiltration rate are summarised in Table 1, and plot for the infiltration rate versus time with ash and organic manure is depicted in Fig. 2. As indicated in Table 1 and Fig. 2, infiltration rate is inversely proportional to the time. Initial infiltration rate of the water with impurities ash was higher than water with impurities organic impurities, but final infiltration rate of the soil of water having organic manure was higher than the ash. In case of the organic manure, infiltration rate increases with time when time reached to 90 min; then, it decreased with time (Singh 2015).

Table 1 Details of the experimental procedure along with the range of infiltration rate
Fig. 2
figure 2

Result analysis of the infiltration rate with different water qualities: a ash and b organic manure

Data set

The experiments for the measured infiltration rate were performed in between January 2015 and May 2015 in Hydraulics Laboratory, Civil Engineering Department (NIT Kurukshetra). The geographical co-ordinates of the study area are 29.9490° N and 76.8173° E. The soil present in the campus is poorly permeable which has the low tendency of the infiltration rate. Totally, 132 observations were obtained from the field experiments out of which 92 observations were used for training and residual 40 for testing the models. Cumulative time (min), type of impurities (organic manure/ash), concentration of impurities (%) and moisture content (%) were the input variables, whereas infiltration rate (mm/h) was output. The features and correlation matrix of the data set are given in Tables 2 and 3.

Table 2 Features of the data used
Table 3 Correlation matrix of input data set

Detail of kernel functions

The SVM- and GP-based regression approaches design includes the scheme of kernel function. There are several kernel functions in GP and SVM. In this study, two kernel functions were used with GP and SVM technique.

  1. 1.

    Radial basis kernel (RBF) = \(e^{{ - \gamma \left| {a - b} \right|^{2} }}\)

  2. 2.

    Pearson VII kernel function (PUK) = \(\left( {{1 \mathord{\left/ {\vphantom {1 {\left[ {1 + \,\left( {{{2\sqrt {\left\| {a\, - \,b} \right\|}^{2} \sqrt {2^{{\left( {{1 \mathord{\left/ {\vphantom {1 \omega }} \right. \kern-0pt} \omega }} \right)}} - \,1} \,} \mathord{\left/ {\vphantom {{2\sqrt {\left\| {a\, - \,b} \right\|}^{2} \sqrt {2^{{\left( {{1 \mathord{\left/ {\vphantom {1 \omega }} \right. \kern-0pt} \omega }} \right)}} - \,1} \,} \sigma }} \right. \kern-0pt} \sigma }} \right)^{2} } \right]}}} \right. \kern-0pt} {\left[ {1 + \,\left( {{{2\sqrt {\left\| {a\, - \,b} \right\|}^{2} \sqrt {2^{{\left( {{1 \mathord{\left/ {\vphantom {1 \omega }} \right. \kern-0pt} \omega }} \right)}} - \,1} \,} \mathord{\left/ {\vphantom {{2\sqrt {\left\| {a\, - \,b} \right\|}^{2} \sqrt {2^{{\left( {{1 \mathord{\left/ {\vphantom {1 \omega }} \right. \kern-0pt} \omega }} \right)}} - \,1} \,} \sigma }} \right. \kern-0pt} \sigma }} \right)^{2} } \right]}}^{\omega } } \right)\)

where γ, σ and ω are kernel parameters. It is well known that GP and SVM estimation performance depends on a good setting of meta-parameters, parameters Gaussian noise, C, γ, σ and ω. The selections of Gaussian noise, C, γ, σ and ω control the prediction (regression) model complexity. In this study, a physical method was used to select primary parameters (i.e. C, γ, σ, ω and Gaussian noise). In order to minimise the RMSE and to maximise the CC, suitable values of various primary parameters are selected. The same kernel-specific parameters were taken for GP regression as well as for SVM. Table 4 enlists all the optimal values of the primary parameters for GP, SVM and M5P tree model.

Table 4 Primary parameters using GP, SVM and M5P tree

Statistical performance evaluation criteria

Correlation coefficient (CC) and root-mean-square error (RMSE) values were calculated to investigate the performance of GP, SVM and M5P tree modelling approaches.

Coefficient of correlation (CC)

The coefficient of correlation (CC) is computed as

$${\text{CC}} = \frac{{m\mathop \sum \nolimits_{i = 1}^{m} o_{i} t_{i} {-}\left( {\mathop \sum \nolimits_{i = 1}^{m} o_{i} } \right)\left( {\mathop \sum \nolimits_{i = 1}^{m} t_{i} } \right)}}{{\sqrt {m\left( {\mathop \sum \nolimits_{i = 1}^{m} o_{i}^{2} } \right) - \left( {\mathop \sum \nolimits_{i = 1}^{m} o_{i} } \right)^{2} } \sqrt {m\left( {\mathop \sum \nolimits_{i = 1}^{m} o_{i}^{2} } \right) - \left( {\mathop \sum \nolimits_{i = 1}^{m} t_{i} } \right)^{2} } }}$$
(7)

Root-mean-square error (RMSE)

The root-mean-square error (RMSE) is computed as:

$${\text{RMSE}} = \sqrt {\frac{1}{m}\left( {\sum\nolimits_{i = 1}^{m} {\left( {o_{i} - t_{i} } \right)} } \right)^{2} }$$
(8)

where oi is the calculated values of infiltration rate, ti is the estimated values of infiltration rate and m is the  number of observations.

Results and discussion

This section of this investigation focuses on predicting performance of the proposed three soft computing techniques, i.e. GP, SVM and M5P tree, and two empirical models, i.e. MLR and Kostiakov model. The ability of these soft computing models is depending upon the primary parameters, and the values of primary parameters are listed in Table 4. In this study, input variables were Tf, It, Ci and Wc and output was f(t). The results of these soft computing techniques with empirical model are given in Table 5.

Table 5 Results of the different modelling approaches and empirical models for training and testing data set

Figure 3 gives the scattered details of actual and predicted values of the infiltration rate of the soil by GP regression with RBF and PUK kernel function. It is clear from Fig. 3 that both the kernels function failed to predict a good result for the infiltration rate of the soil. But in comparison with RBF and PUK kernel of GP regression, RBF kernel function works well with CC and RMSE 0.4374 and 14.9329 (refer Table 5), respectively.

Fig. 3
figure 3

Predicted infiltration rate of soil using GP_RBF and GP_PUK

The same data set was also used for SVM-based regression techniques. Figure 4 gives the scattered details of the infiltration rate of the soil by using SVM regression techniques with RBF and PUK kernel function. Same like GP regression techniques, SVM also failed to predict the good quality of result for the infiltration rate of the soil. But the results from the SVM techniques were little bit good from the GP regression techniques. The values of CC and RMSE for RBF kernel with SVM were 0.5278 and 14.1891, respectively (refer Table 4).

Fig. 4
figure 4

Predicted infiltration rate of soil using SVR_RBF and SVR_PUK

The prediction of infiltration rate of the soil by M5P tree techniques, multi-linear regression and Kostiakov model was also performed by the same data set. Figure 5 gives the scattered details of the infiltration rate of the soil by using MLR, Kostiakov model and M5P tree. It is also clear from Fig. 5 that all the scatters from the M5P tree model are nearby to agreement line than the other two models. Also, the value of CC is much high (0.8490) and RMSE is much less (9.4356) than the other two techniques. In comparison with MLR and Kostiakov model, the prediction of the MLR and Kostiakov model was almost same with CC (0.4405 and 0.4806) and RMSE (15.9657 and 15.0521), respectively.

Fig. 5
figure 5

Predicted infiltration rate of soil by using MLR, M5P tree and Kostiakov model

Comparison of the results

A comparison of all the techniques and models was done to find out the most efficient technique in prediction of the infiltration rate of the soil. The performance of the M5P tree is good with performance evaluation parameters (CC = 0.8490 and RMSE = 9.4356 mm/h) than the other model and techniques, while among GP and SVM, SVM with RBF kernel outperforms than other kernel function with values of CC and RMSE 0.5278 and 14.1891 mm/h, respectively. Table 6 gives the statistical information of actual and predicted values of the infiltration rate with different soft computing techniques and empirical models.

Table 6 Statistical information of the infiltration rate with different soft computing techniques and empirical models

Figure 6 provides a plot of MLR, M5P tree and Kostiakov model with actual infiltration rate in increasing order of the values. This figure suggests that the predicted values using M5P model follows same path which is followed by actual values. But when the actual values of infiltration rate were very high, prediction from all the techniques gave the large error because there were large fluctuations in the infiltration rate at starting point. Hence, it is clear from Table 4 and Fig. 6 that M5P tree was the best technique which can be predicted the values of the infiltration rate in the absence of the infiltration data under the same conditions.

Fig. 6
figure 6

Variation of infiltration rate using the different regression approaches

Sensitivity analysis (SA)

SA is the test in which we find the most important input parameter or parameters which affect the infiltration rate of the soil most. In this investigation, sensitivity analysis was done by removing the one parameter one by one in each case. M5P tree model was used to carry out the sensitivity analysis by using same primary parameters. Table 7 summarises the results of the sensitivity analysis. Outcomes from Table 7 suggest that cumulative time is the most important parameter to predict the infiltration rate of the soil for this data set.

Table 7 Sensitivity analysis using M5P tree model

Conclusions

Knowledge of infiltration process is essential for agriculture, hydrologic study, watershed management, irrigation system design and drainage design. In this investigation, three soft computing techniques (SVR, GP and M5P tree) and two empirical models (MLR and Kostiakov model) were used to estimate the infiltration rate of the soil with different water qualities. The obtained results concluded that the M5P tree model is the most efficient model to predict the infiltration rate of the soil with different water qualities than the SVR, GP, MLR and Kostiakov model, whereas the results of SVM were more suitable as compared to the GP and MLR and also gave better prediction than Kostiakov model. Thus, M5P tree model was the most suitable model for predicting the infiltration rate of the soil. Finally, SA suggests that cumulative time is an essential parameter which affects the infiltration rate of the soil with different water qualities using M5P model tree for this data set.