Introduction

It is essential to estimate the hydraulic features of the soil because of their considerable role in dam, hydrological cycle, irrigation system, drainage system and groundwater flow-related studies. The saturated hydraulic conductivity of soil is an important property which signifies the characterization of subsurface flow behaviour, and it largely affects the characteristics of infiltration of water through soil. Hydraulic conductivity is strongly influenced by the compacting behaviour, density and water content of the soil. The design and feasibility of irrigation and drainage projects require accurate determination of hydraulic conductivity for the efficient water management related to surface as well as subsurface flow. Hydraulic conductivity of soil is traditionally estimated for small samples in the laboratory or by using different infiltrometers in the field. The direct measurement and accurate determination of hydraulic conductivity are difficult, tedious and time-consuming due to temporal and spatial variabilities when hydrological estimations are required for huge areas (Arshad et al. 2013). Therefore, indirect methods involving predictive estimations have received a considerable popularity and are widely adopted in order to provide reasonable predictability of hydraulic properties of soils in relation to basic measurable soil properties (Al-Sulaiman and Aboukarima 2016). Thus, for numerous hydrological model functions, soil hydraulic properties are predicted from more simply accessible proxy variables such as texture of soil, bulk density or organic carbon content (Jarvis et al. 2013). Many predictive computing methods such as multiple linear regression (MLR), artificial neural network (ANN), support vector machines (SVM) and adaptive neuro-fuzzy inference system (ANFIS) have been used to improve the estimation precision of hydraulic conductivity of soil.

Most of the studies available in the literature discussed the application of ANN and SVM as predictive models for the estimation of hydraulic conductivity of soil (Agyare et al. 2007; Erzin et al. 2009; Rogiers et al. 2012; Das et al. 2012; Sihag 2018). Arshad et al. (2013) compared the performance of radial basis function neural networks (RBFNN), multilayer perceptron neural networks (MLPNN), ANFIS and MLR to estimate the saturated hydraulic conductivity based on soil texture and bulk density. They reported ANFIS as a powerful estimation tool relative to ANN and MLR. Ekhmaj (2010) developed MLR and ANN models in order to predict the steady infiltration rate, and the outcomes yielded better predictions with ANN model relative to MLR. Elbisy (2015) implemented genetic algorithm in order to determine the optimum SVM parameters and investigated the performance of three kernel functions (linear, radial basis and sigmoid) in determining field hydraulic conductivity of sandy soil having easily measurable soil parameters as input variables. The study yielded RBF kernel-based SVM as a powerful method for the indirect estimations of hydraulic conductivity in comparison with other methods. Al-Sulaiman and Aboukarima (2016) successfully implemented ANN model for the accurate relationship of hydraulic conductivity with eight input soil parameters (sand, silt, clay, soil electric conductivity, sodium absorption ratio, organic matter, initial soil water content and bulk density of soil). In a study conducted on field infiltration data, Sihag et al. (2017a) suggested a novel nonlinear regression-based infiltration model developed from Kostiakov modified model for the location of NIT Kurukshetra (India) which yielded better estimations of infiltration rate than some popular conventional models. In a laboratory study conducted on synthetic soil samples by varying the percentages of soil mixture (sand, rice husk ash, fly ash), moisture content, bulk density and suction head, cumulative infiltration was estimated by using machine learning approaches (multiple nonlinear regression, support vector machines, Gaussian process regression) as well as conventional infiltration models (Sihag et al. 2017b). The study resulted in accurate predictions with Gaussian process regression (GPR) approach relative to other models. In a similar type of laboratory data, and Tiwari et al. (2017) and Sihag et al. (2019a) showed successful utilization of ANFIS in modelling the cumulative infiltration and the unsaturated hydraulic conductivity of soil samples. Some latest studies suggested successful application of soft computing techniques, viz. SVM, GPR, M5 tree and random forest regression to the field of groundwater hydrology (Singh et al. 2017, 2019a, b; Angelaki et al. 2018; Sihag et al. 2018a, b, c; Vand et al. 2018; Kumar and Sihag 2019; Sihag et al. 2019b, c), water resources (Kumar et al. 2018; Sepahvand et al. 2019; Singh et al. 2018a, b; Tiwari and Sihag 2018; Tiwari et al. 2019) and engineering (Nain et al. 2018, 2019; Mehdipour et al. 2018; Kumar et al. 2019; Mohanty et al. 2019). Keeping in view the importance of M5 tree and random forest regression techniques, the present research deals with the implementation of these techniques in an attempt to relate unsaturated hydraulic conductivity of the field data measured from 20 locations of Kurukshetra district, Haryana, with the soil physical properties.

To the best knowledge of authors, the predictive capabilities of M5 tree and random forest (RF) regression are not investigated in estimating the unsaturated hydraulic conductivity of soil in the field. So this study investigates the potential of M5 tree and RF regression models. A relationship based on multiple nonlinear regression (MNLR) is developed for the unsaturated hydraulic conductivity of soil considering sand (%), clay (%), silt (%), bulk density and moisture content as input variables, and the developed relationship is compared with the soft computing-based regression models (M5 and RF).

Study area

Kurukshetra district lies in the Ghaggar basin (Fig. 1), and it is in the north-east part of the Haryana State, India. Thanesar Tehsil of Kurukshetra district is chosen for experimentation. Ghaggar is one of the main rivers of Haryana State, India. Twenty different locations were selected for measurement of infiltration process. The texture of the soil is listed and shown in Table 1 and Fig. 2, respectively.

Fig. 1
figure 1

Study area

Table 1 Texture of the soil
Fig. 2
figure 2

Texture of the soil for the study area

Data set

The unsaturated soil hydraulic conductivity was measured in the field using a mini disc infiltrometer (Decagon Devices Inc.) as shown in Fig. 3. During the experiment, the volume of water in the lower chamber was listed at expected time intervals. The total data set consisting 240 observations from field experiments of infiltration process was separated randomly into two groups of training and testing, respectively. Larger group is considered as training data (70% of the total data), while smaller group is considered as testing data (rest 30% of the total data). Input parameters are sand, clay, silt, bulk density and moisture content, and output parameter is unsaturated hydraulic conductivity (\(K\)) of soil. The characteristics of both data sets are listed in Table 2.

Fig. 3
figure 3

Mini disc infiltrometer (Infiltrometer User’s Manual, 2014)

Table 2 Characteristics of training and testing data sets

Modelling approaches

Multiple nonlinear regression (MNLR)

To develop nonlinear regression model, the general form of multiple nonlinear regression model is considered by the following relationship:

$$K = a S^{{b_{1} }} C^{{b_{2} }} {\text{Si}}^{{b_{3} }} \rho^{{b_{4} }} {\text{MC}}^{{b_{5} }}$$
(1)

where \(K\) is the dependent variable representing hydraulic conductivity of soil, \(S\), \(C\), \({\text{Si}}\), \(\rho\) and \({\text{MC}}\) are regarded as explanatory variables, \(a\) is the constant, and the estimate of parameters (regression coefficients) \(b_{1}\), \(b_{2}\), \(b_{3}\), \(b_{4}\), and \(b_{5}\) is obtained by minimizing the sum of squares of error in prediction based on least squares. Based on the above equation, the following relationship is developed from the training data set:

$$K = 0.83 \times 10^{ - 5} S^{0.95} C^{ - 0.23} {\text{Si}}^{ - 0.37} \rho^{1.92} {\text{MC}}^{ - 0.015}$$
(2)

M5 model tree (M5)

M5 tree, introduced by Quinlan (1992), is a decision tree learner for regression problems. This tree algorithm assigns linear regression functions at the terminal nodes and fits a multivariate linear regression model to each subspace by classifying or dividing the whole data space into several sub spaces. The M5 tree method deals with continuous class problems instead of discrete classes and can handle tasks with very high dimensionality. It reveals piecewise information of each linear model constructed to approximate nonlinear relationships of the data set.

The information about the splitting criteria for the M5 model tree is gained on the basis of calculates of error at each node. The error is analysed by the standard deviation of the class values that arrive at a node. The attribute that maximizes the expected error reduction resulting from the testing of each attribute at that node is chosen for splitting at the node. The standard deviation reduction (\({\text{SDR}}\)) is calculated by:

$${\text{SDR}} = {\text{sd}}\left( K \right) - \sum \frac{{\left| {K_{i} } \right|}}{\left| K \right|}{\text{sd}} \left( {K_{i} } \right)$$
(3)

where \(K\) indicates set of instances that attain the node; \(K_{i}\) indicates the subset of illustrations that have the ith product of the possible set; and \({\text{sd }}\) indicates the standard deviation.

Random forest regression (RF)

RF regression approach was initially introduced by Breiman (2001). This is a machine learning classifier that contains several decision trees and targets the class that is the mode of the classes’ target by individual trees. Number of trees to be grown (\(k\)) in the forest and the quantity of features or variables chosen (\(m\)) at every node to develop a tree are the two standard user-defined parameters required for random forest regression (Breiman 2001). In this study, we applied RF model to predict the unsaturated hydraulic conductivity of soil (K).

Implementation of machine learning methods

Three standard statistical measures: coefficient of determination (\(R^{2}\)), root mean square error (\({\text{RMSE}}\)) and mean absolute error (\({\text{MAE}}\)), were implemented as performance evaluation parameters in order to judge the performance of the machine learning methods. Large numbers of models were developed by changing the user-defined parameters of the modelling approaches with the training data set, and the efficiency of the developed models was validated by analysing the performance on the testing data set. So the modelling procedure involves three steps: changing the values of user-defined parameters associated with the modelling method, model building on training data and validation of the formed models on testing data. The identification of model-specific user-defined parameters and the selection of the models were based on statistical criteria. Higher values of \(R^{2}\) and lower values of \({\text{RMSE}}\) and \({\text{MAE}}\) indicate better estimation accuracy of the models. Number of trees to be grown (\(k\)) in the forest and the number of features or variables selected (\(m\)) at each node to generate a tree are the two standard user-defined parameters required for random forest regression. In M5 tree model, calibration of models was done by means of changing the value of no. of instances allowed at each node. The selected primary parameters of the modelling approaches are presented in Table 3.

Table 3 Primary parameters

Results and discussion

The efficiency of the modelling methods in predicting the hydraulic conductivity of soil in the field is tested by developing the models by regression modelling methods and testing the accuracy of the developed models with the unseen testing data. The inputs selected for estimating the hydraulic conductivity are sand (%), clay (%), silt (%), bulk density and moisture content. The performance of multiple nonlinear regression (MNLR) is evaluated by generating a simple multivariate relationship (Eq. 2) based on nonlinear regression function (Eq. 1) applied to the training data set. In order to check the potential of the nonlinear relationship (Eq. 2), the equation is applied to the testing data set and the outcomes are depicted in Fig. 4 as a scattering diagram of the predicted data of hydraulic conductivity. Closeness of the data to the perfect agreement line represents accuracy of the model in estimating the actual field data. However, in this case, excessive scattering of the data points from the agreement line reveals poor performance of the MNLR model in approximating the actual data of field hydraulic conductivity and hence lacking in generalization. The statistical measures observed with the testing data verify the lower accuracy of the MNLR modelling technique as the error values (\({\text{RMSE}}\) and \({\text{MAE}}\)) are higher and the coefficient of determination (\(R^{2}\)) is less (Table 4). So a direct relationship is not sufficient to precisely relate the hydraulic conductivity with the soil input parameters used in the current study, leading to inferior performance by the MNLR model.

Fig. 4
figure 4

Actual versus predicted values of hydraulic conductivity using MNLR model (testing data)

Table 4 Statistical performance measures for modelling approaches

In an attempt to approximate the actual field data of hydraulic conductivity of soil, machine learning methods are adopted to improve the generalization capacity. M5 model tree algorithm, which utilizes linear regression models to define input–output relationship based on splitting of the parameter space of the data set into several subspaces, was used. Two M5 tree models: pruned and unpruned trees, were developed by changing the instances used at the leaf node. The values of user-defined parameters (instances used) were selected by implementing M5 model tree method on the training data and judging the performance on the testing data (Table 3). By checking the results of both pruned and unpruned tree models with the testing data set, the statistical measures indicate lower values of RMSE (0.0000699) and MAE (0.0000488) obtained with unpruned M5 tree model relative to pruned (RMSE = 0.0000898, MAE = 0.0000633) tree model. The higher value of R2 observed with unpruned model infers closer prediction of actual data, and scattering plot shows (Fig. 5) that the estimated points of the unpruned model lie closer to the agreement line when compared with the pruned model tree. So based on the results, unpruned model indicates better learning capability than pruned model as the estimation accuracy is higher.

Fig. 5
figure 5

Actual versus predicted values of hydraulic conductivity using M5 model tree (testing data)

The development of random forest model is achieved by carrying out trials with the training data set by changing the number of features used at each node to generate a tree, and the numbers of trees and finally the performance of the calibrated model are tested on the testing data set. After optimizing the performance of the testing data by checking the forecasting accuracy of the developed model based on least \({\text{RMSE}}\) and \({\text{MAE}}\) values, the model was selected based on generalization ability. The performance of RF regression is presented in Fig. 6 as a comparison of actual and predicted values of hydraulic conductivity. It is analysed from the plot that the scattering of the data is relatively closer to the perfect agreement line. The RF model generated comparatively lower values of \({\text{RMSE}}\) (0.0000491) and \({\text{MAE}}\) (0.0000396) than the other tested regression models (Table 4), which indicates the superior potential of the RF model in accurately relating the hydraulic conductivity of the field data with the soil properties.

Fig. 6
figure 6

Actual versus predicted values of hydraulic conductivity using RF model (testing data)

Comparative analysis of the regression models

The efficacy of MNLR, M5 tree and RF regression in estimating the hydraulic conductivity of field data is tested and presented as a combined graph showing all the applied regression models (Fig. 2). To study the scatter around the perfect agreement line, the graph between actual and predicted values is represented by error lines in the range of ± 30%. From Fig. 7, it is clear that the prediction performance of the random forest (RF) model is well within error range of ± 30% except for some smaller values. The model measures the actual data with an accuracy of ± 30%. Lower values of \({\text{RMSE}}\) and \({\text{MAE}}\) obtained with RF model confirm this (Table 4). The scattering of the MNLR model from the perfect agreement line is higher (except for some larger values) than all the other models indicating inferior performance of the model in estimation and generalization. Both M5 tree models overpredict the smaller values of hydraulic conductivity and reside outside the + 30% error line, but underpredict for the larger values and lie near to the − 30% error line. The scattering of the M5_unpruned model is relatively more than that of M5_pruned model indicating better performance by the unpruned M5 tree model. So based on statistical measures and error plots, the performance of RF model is found superior to M5 model tree and nonlinear regression model.

Fig. 7
figure 7

Comparison of soft computing models in estimating the testing data

To analyse the relative variation of the implemented modelling techniques and the actual experimental field data, a graph between the number of observations and hydraulic conductivity of the field is presented (Fig. 8). This figure shows that RF based model follows the same path as followed by actual observed hydraulic conductivity values so RF model is most suitable for estimating the hydraulic conductivity of soil than other above discussed models. The deviation of the predicted points from the actual points by M5_pruned model is the highest from all the models.

Fig. 8
figure 8

Scattering of actual and soft computing model-based predictions against observation number

As shown in Fig. 9, the RF model significantly reduces the overall residual errors due to accurate predictions by the model. Other regression models have larger residuals than RF model, thus indicating low efficiency of the models in accurate estimations of the field data.

Fig. 9
figure 9

Residual error versus observation number using soft computing models

ANOVA test using single factor was used to compare the statistical significance of predicted values from machine learning approaches and actual values. Results suggest that F-value was less than the F-critical and P value was greater than 0.05 for all the soft computing models which indicate that the difference in predicted and actual values was insignificant (Table 5).

Table 5 Result of ANOVA single-factor test

Conclusions

Machine learning methods are employed for the purpose of accurate and reliable predictions of hydraulic conductivity of soil. Twenty different locations in the district of Kurukshetra, Haryana (India), were selected for the experimental data collection on monthly basis for the period of 1 year. Mini disc infiltrometer was used for the determination of hydraulic conductivity in the field. The compiled field data of hydraulic conductivity associated with soil physical properties: sand (%), clay (%), silt (%), bulk density and moisture content as input parameters, were used for modelling by the random division of the total data in two parts (training and testing). The modelling techniques employed in this study were multivariate nonlinear regression, M5 model tree and random forest (RF) regression. Based on the validation results of the developed regression models on the testing data set, the performance of RF regression in predicting the hydraulic conductivity of field data was found more accurate than M5 model tree as well as the relationship developed on the basis of multiple nonlinear regression. The performance of unpruned M5 tree model is found superior to both pruned M5 tree and multiple nonlinear regression models. The modelling results based on standard statistical measures indicated that the RF model, due to higher predictive efficiency in model development and validation, has higher generalization capability and thus can be applied for the accurate estimations of the field hydraulic conductivity of soil relating to basic soil properties.