1 Introduction

Solutions to many geotechnical and geo-environmental engineering issues require extensive understanding of soil hydraulic conductivity. According to Murthy [1], hydraulic conductivity is a measure of the ease with which water flows through permeable materials. It is a measured indicator of the soil's ability to convey water when exposed to a hydraulic gradient. This soil parameter plays a key role in solving problems relating to leachate transportation in landfill design, earth dam design as it dictates, among other important parameters, the selection of suitable material for the liner system and the core material for earth dam, respectively [2,3,4]. Like many geotechnical parameters, hydraulic conductivity is simple in concept, but has some very complex aspects in practice, especially when trying to obtain realistic measurements. Approaches taken to estimate hydraulic conductivity of soils include laboratory and field methods of measurement and calculation from empirical formulae. Laboratory and field measuring methods include constant head test, falling head test, flexible wall permeameter test, rigid wall permeameter test, ring infiltrometer, instant profile method, test basins [5, 6], etc. Meanwhile, the different empirical equations currently in use correlate the hydraulic conductivity of fine-grained soils with the index properties which in accordance with Freeze and Cherry [7], hydraulic conductivity is established to associate with the grain-size distribution of porous granular media. The major advantage of the available empirical methods [8, 9] is that the hydraulic conductivity value is rapidly estimated than the direct measurement. The application of these relationships, however, may be incorrect and may lead to random errors [5]. Field test methods have the advantages that the soil profile is often undisturbed but cannot control the soil environment unlike the laboratory test method [10]. Although laboratory methods are relatively easy, the common inconvenience is that they take time. Consequently, getting a quick result to a problem in the field within the given time frame can be challenging. As a result, appropriate consideration has now been given to determining the hydraulic conductivity of soils using predictive techniques [3].

Over the last few decades, computational intelligence (CI) methods otherwise known as soft computing have been applied to various fields of science and engineering. Complex tasks such as learning, modelling or getting a pattern from experimental approach can be handled with CI methods with accurate precision [11]. CI technique can either be single (such as artificial neural networks (ANNs), fuzzy, support vector machine (SMV), particle swarm optimisation (PSO), genetic algorithm (GA)) or hybrid (e.g. adaptive neuro-fuzzy inference system (ANFIS), GA-ANN, etc.). Single CI technique is employed mainly for predicting, modelling or exploring data [11, 12]. ANN as a subset of CI has been the most popular and significant tool in various engineering fields [13,14,15]. Its high ability to predict nonlinear behaviours has unfolded its uniqueness to many researchers [16]. ANN is inspired by how the human brain works. The human brain consists of a large number of highly networked neurons working together to solve a specific problem. Like human brain that possessed tremendous ability to process huge amount of information using data sent by human senses, ANNs too learn by training [17,18,19]. Neural network's basic processing elements are called nodes, and the weighted connections perform the same work as synapses in biological systems. Nodes are simple elements of information processing, while the connection weights modulate the effect of the related input signals and a transfer function represents the nonlinear characteristic displayed by the neurons. The output of a neuron is then computed as the weighted sum of the inputs plus the bias activated by the transfer function [17, 20]. By contrast, traditional linear regression is one of the oldest statistical methods that still maintains its relevance in the academic world, especially as a benchmark for measuring the performance of currently developed predictive tools. Multiple linear regression (MLR) examines the relationship between a response variable and the collection of independent variables. It is a generalisation of the linear regression model [21]. The assumption in regression modelling is that the output can be explained by a linear combination of input values.

In recent years, the use of ANNs has increased in several areas of civil engineering profession. Its application to many geotechnical and geo-environmental engineering problems has shown a commendable degree of success. ANN can be trained with experimental data; as a result, it is esteemed superior among popular modelling tools [12]. A series of studies show that ANNs have been used successfully in the prediction of pile capacity, soil behaviour modelling, soil retention structures, settlement of structures, stability of slope, tunnel design and underground openings, liquefaction, soil compaction, soil swelling, soil chemical properties such as cation exchange capacity (CEC) and classification of soils [22,23,24,25,26,27,28,29], in addition, ANN and MLR as tools for prediction of geotechnical properties [24,25,26], prediction of cation exchange capacity by ANN and MLR [30, 31], prediction of tropical soil’s hydraulic conductivity using eight different algorithms [10], and prediction of hydraulic conductivity of clays [3, 18, 32]. Minasny et al. [33] used the neural network tool to predict unsaturated hydraulic conductivity of alluvial soils.

The prediction or estimation of the hydraulic conductivity of soils using ANN and MLR by many researchers was based on a specific soil. To the best of our knowledge, there is no or limited recommendations in the literature with regard to ANN and MLR application for predicting hydraulic conductivity of all soil types. Therefore, the specific aim of this study is to develop models for the prediction of saturated hydraulic conductivity of soils (fine grain and coarse grain) through a comparative study using artificial neural network and multiple linear regression analysis. These models were developed by increasing the spectrum of test soils used by Sinha and Wang [20] by adding more results from various reliable experimental studies on hydraulic conductivity of naturally occurring soil types published in the literature using different input variables and training parameters for an optimised result. The input data used are: percentages of sand (S), fines (Fi), clay (C) of the soil samples, plasticity index (PI), and the compaction characteristics. Comparative studies were done with the selected existing MLR models and networks to evaluate the reliability of the developed models.

2 Methodology

2.1 Data collation and analysis

The reliability of the data set used is the most significant phase that can influence the ANN modelling, particularly in geotechnical and geo-environmental engineering. In addition, the efficiency of ANN relies on the data width selected. For more complicated issues, more examples are needed that show all the distinct features of the problem [17, 27]. In this study, data were collated from several experimental studies on hydraulic conductivity of different soil types published worldwide in the literature. Factors affecting hydraulic conductivity of soils include soil density, moulding water content, degree of saturation, void ratio, soil composition, soil structure, permeant properties and others. Most of these factors are not really independent but interrelated complexly with each other, for example, grain size and void ratio, etc. [20, 34]. The smaller the grain size, the smaller the voids which leads to the reduced size of flow channels [1]. Hence, low hydraulic conductivity is likely to be achieved when the soil is well graded and the clay fraction governs the hydraulic behaviour of the soil matrix. As stated by Lambe [35], five factors had the greatest influence on hydraulic conductivity: soil composition; soil structure; permeant characteristics; void ratio and degree of saturation. Considering the aforementioned factors, the experimental studies selected provide data on the particle size distribution, namely percentages of sand (S), fines (Fi), clay (C) of the soil sample, plasticity index (PI), and the compaction properties (optimum moisture content, OMC and the maximum dry density, MDD) and the corresponding hydraulic conductivity, k (i.e. seven variables). These input variables are factors considered easily obtainable that influence the hydraulic conductivity of most soils. The data set was divided into two parts: 75% for training and 25% for testing. To avoid overfitting, the training set was chosen so that each soil class was properly represented and samples in each class contain a wide range of variations. This data set was analysed using R Software to develop MLR model and network for hydraulic conductivity prediction.

2.2 R software

R is a software language for carrying out simple and complicated statistical analyses. R is free software and comes with totally no guarantee. R was originally written by Robert Gentleman and Ross Ihaka from the Statistics Department of the University of Auckland in New Zealand. It is a collaborative effort with many contributors, since mid-1997 there has been a key team with written access to the R source [36]. R has a number of benefits for scholars; it is open source. Additionally, using R implies having access to a global group of individuals who are continuously creating new R packages and fresh teaching resources. Diverse packages for all machine learning techniques, especially neural network (e.g. nnet, NeuralNet, etc.) are available on R. Its language is easier to learn compared to other proprietary software and offers rich and better options for statistics. Different user-friendly interfaces to execute R commands are available (e.g. RStudio) that are free and simple to install. Although MATLAB provides some good options to create, train, validate and test neural networks, it seems that there are not too many options for Windows and also requires license. For this research, RStudio version 1.1.463 was used along with R version 3.5.2. R software is available for Mac, Windows and Linux operating systems and can be obtained via www.r-project.org, and RStudio is accessible at www.rstudio.com.

2.3 Multiple linear regression model (MLR)

Regression modelling aimed at using numbers of independent measurements to determine a mathematical function that describes the relationship between the input parameters and the output. In engineering and science, many problems revolve round the relationship between two or more variables. MLR is a linear regression technique that is very beneficial for predicting the best relationship between a response variable and several independent variables unlike the simple linear regression analysis [31, 37]. One of the assumptions in multiple linear regression is non-existence of collinear relation between independent variables. Variance inflation factor (VIF) is an index that is used for collinear determination. If there is no linear relationship between independent variables, VIF value will be one and the deviation of this factor from 1 reveals the tendency to collinearity. Having VIF values more than 10 for each variable show the multiple collinearity and it may result in estimation problems [38].

Multiple linear regression was developed using 75% of the training data set and the remaining 25% to evaluate the efficiency of the developed model. The MLR model for hydraulic conductivity prediction was executed using the ‘lm’ function in R Software. Equation (1) shows the general form of the MLR equation:

$$Y = a + b_{1} X_{1} + b_{2} X_{2} + ... + b_{i} X_{i}$$
(1)

where \(Y\) is the response variable representing hydraulic conductivity k,

a is the intercept,

\(b_{1} ...b_{i}\) are regression coefficients, and

\(X_{1} ...X_{i}\) are independent variables referring to basic soil properties (i.e. the input data).

At first, all the input variables were utilised in developing the MLR model and subsequently the variables that were significantly less effective on the hydraulic conductivity (output parameter) were eliminated.

2.4 Artificial neural network architecture

Multilayer perceptron (MLP) and radial basis function (RBF) are two of the most widely used neural network architecture in the literature for classification or regression problems. General difference between MLP and RBF is that RBF is a localist type of learning which is responsive only to a limited section of input space [39]. However, MLP, being the most predominant network architecture and due to the simplicity of its design, was utilised for this study [18, 31, 40]. Figure 1 demonstrates a typical two-layer perceptron to simulate an input–output reaction. The network which consists of nodes is structured into input, hidden and output nodal layers. The input layer is not regarded a neuron layer as it does not process any signal. Every node is linked to all nodes in the adjacent layers. The training set which is made up of 108 observations (i.e. 75% of the data set) was used to developed ANN model for predicting hydraulic conductivity using the R Neuralnet library package. The six input variables that were employed for the neural network training are; S, Fi, C, PI, OMC, and MDD of the soil. The input layer therefore has six neurons. The only output is the hydraulic conductivity, so the output layer has only one neuron. Since there are no overall guidelines for defining the number of neurons in each hidden layer, Bahmed et al. [27] suggested the use of a simple architecture of one hidden layer with a limited number of neurons to earn time in the training stage. The rule of thumb in deciding the number of hidden layers is normally to start the training process with one hidden layer, and if one hidden layer does not train well, then the number of hidden neurons can be increased before considering adding more hidden layers [41]. The choice of the number of hidden neurons depends on the complexity of the problem. In this study, the number of hidden neurons equal half of the input variables was used to start the network training. This was further increased as the training error remains above the training error tolerance until the training error drops. After several network trainings with different number of neurons in the hidden layer, one hidden layer with the number of neurons that produced the least error was selected.

Fig. 1
figure 1

Two-layer perceptron network architecture [42]

The NeuralNet presents the training set to the ANN and modifies the weights to minimise the error generated between the actual and desired output. In other words, a neuron's output is the weighted sum of inputs plus the bias activated by the transfer function [20]. Lim and Kolay [10] observed that backpropagation (BP) training algorithm yields the best prediction model for hydraulic conductivity of tropical soils compared to other learning algorithms such as Levenberg–Marquardt algorithm, scale conjugate gradient, BFGS quasi-Newton, conjugate gradient with Powell/Beale Restarts, Fletcher–Powell conjugate gradient, and one-step secant. As a result, feed-forward neural network, with backpropagation training algorithm was used to develop the ANN model for this study. The goal of BP training is to iteratively change the connections weights between the neurons in a direction that minimises the error. Connection weights in the network are adjusted by the algorithm using a sample-by-sample updating rule. In one algorithm iteration, a training sample is presented to the network. The signal is then fed in a forward manner through the network until the network output is obtained. The error between the actual and desired network outputs is calculated and used to adjust the connection weights [41]. After the completion of the training process, a new set of data was presented to the network, the testing data, to validate and evaluate the integrity of the trained network.

2.5 Performance evaluation

The following statistical indices, which were deemed significant, were used to assess the predictability of the developed ANN and MLR: mean squared error (MSE), root of the mean squared error (RMSE), multiple coefficient of determination (R2), and mean absolute error (MAE). During the ANN training, a minimum network error is repeatedly tried by altering the weights as earlier mentioned and the number of the hidden layer neurons.

The mean square error (MSE) indicates the error obtained while training, and it measures the average square gap between the anticipated response value and its prediction. MSE is calculated using Eq. (2):

$${\text{MSE}} = \frac{1}{N}\sum\limits_{n = 1}^{N} {\left( {{\text{actual}} - {\text{predicted}}} \right)}^{2}$$
(2)

where N is the overall number of data.

The root mean squared error (RMSE) is calculated between the measured values and the predicted values using Eq. (3)

$${\text{RMSE}} = \sqrt {\frac{1}{N}} \sum\limits_{{n = 1}}^{N} {\left( {{\text{actual}} - {\text{predicted}}} \right)} ^{2}$$
(3)

Coefficient of determination, R2 defined by Eq. (4) expressed the proportion of the total variation in response variable (predicted value) that is explained by different independent variables. The lower the difference between actual and forecast values the higher the value of the determination coefficient. The value of R2 is between 1 and 0. R2 is near 1, for a good fit model, and R2 near 0 indicates a poor fit model

$$R^{2} = 1 - \frac{{{\text{SSE}}}}{{{\text{SSa}}}}$$
(4)

where \({\text{SSE}} = \sum {\left( {a - \hat{a}} \right)}^{2}\); \({\text{SSa}} = \sum {\left( {a - \overline{a}} \right)}^{2}\),

\(a\) is the true value; \(\hat{a}\) is the forecast value of \(a\), and

the mean value of the \(a\) values is \(\overline{a}\).

Mean absolute error (MAE) measures difference between two continuous variables.

The correlation coefficient (r) described the strength of the linear connection between the predicted and actual response variable, ranging from  − 1 to + 1.

3 Results and discussion

3.1 Data collated

The data set collated constitutes 144 observations, from different regions. Table 1 shows the summary of the data set collated of which 108 observations (75% of the data set), representing the characteristics of all the soil class present,

Table 1 Data set sources

Were used as training set for both the ANN and MLR models for hydraulic conductivity prediction. The descriptive statistics for each variable as contained in the collated data set is shown in Table 2. The PI ranged between 0.00 and 480 with a mean value of 83.64 showing that the soils collated ranged from pure sands to extreme swelling clay (sodium bentonite).

Table 2 Descriptive statistics of data set collated

3.2 Data cleaning and multicollinearity analysis

Sanity checks were carried out on the data set to ensure that there are no variable values that fall outside the expected boundaries (Table 2). This also includes checking for missing values in each column for proper handling and to ensure there are sufficient observations to utilise for analysis. R software is gracefully designed to handle missing values with.

annotation ‘NA’ to indicate the existence of missing values [43]. Sanity checks revealed the absence of missing data and values of each parameter used is within the expected range.

Correlation assessment was conducted to explore the likelihood and degree of multicollinearity relationship between each independent variable and all other variables. Correlation analysis was computed on R software using the ‘cor ()’ command. From Table 3, the independent variables satisfied the conditions for non-existence of multicollinearity except for sand–fine pair (r = 0.969) and OMC–MDD pair (r = 0.907). The correlation values between the independent variable are expected to be less than 0.8 in order to avoid the multicollinearity problem in the predicting model. The strong relationship between the two pairs of variables as revealed by the correlations analysis suggests that one of the two variables for each pair is needed in the regression analysis. It can also be observed that the independent variables chosen have weak correlation relationship with the response variable, k. Another significant parameter that R also offers as a measure of multicollinearity is the Variance Inflation Factor (VIF). VIF analysis in R software is executed through the library ‘car’. VIF value less than 5 and close to one indicates that there is no linear connection between input variables, if more than 10, is an indication that the variable is not needed and can be removed from the model [38].

Table 3 Pearson correlation coefficient

3.3 Multiple linear regression model

3.3.1 MLR model training

The training of the MLR model for hydraulic conductivity prediction was accomplished using 75% of the data as earlier mentioned with six input variables, namely PI, S, Fi, C, OMC and MDD of the soils. This was executed in R using the ‘lm ()’ function. Table 4 which was generated by R gives the summary of the developed MLR model. The residuals as indicated in Table 4 give the differences between the experimental values and the predicted values. Positive residuals indicate that the model predicted a value that was lower than the observed value, and a value less than zero indicates that the regression model predicted a value higher than the observed value. As shown in Table 4, ‘Min’ as used by R indicates minimum value of residual, and ‘Max’, maximum value of residual. Residuals’ median value is denoted using ‘Median’. The variables 1Q and 3Q are the points that label the first and third quartiles of the residuals. The median of the residual values of a good model is expected to be close to zero, whereas the minimum and maximum values of almost the same value and the first and third quartile values should be approximately the same [43]. The residuals, as shown in Table 4, deviate slightly from these conditions for this present model. A section

Table 4 Summary table of parameter estimates, residuals, and standard errors for the linear model of hydraulic conductivity fitted with six predictors

Table shows the coefficient of each of the input variables. Hence, the developed regression equation is as shown in Eq. (5):

$$\begin{gathered} k = \left( {3.300E - 05} \right) + \left( { - 1.1416E - 08} \right){\text{PI}} + \left( {1.413E - 07} \right)S + \left( {3.092E - 08} \right)Fi + \, \left( {5.189E - 10} \right)C + \hfill \\ \left( { - 1.909E - 08} \right){\text{MDD}} + \left( { - 2.591E - 07} \right){\text{OMC}} \hfill \\ \end{gathered}$$
(5)

The column named ‘Std. Error' in Table 4 displays the standard statistical error for each coefficient. The standard error is expected, for a reliable model, to be at least 5–10 times less than the corresponding coefficient [43]. The resulting statistical standard errors for this model was almost greater than the corresponding coefficient for each input variable. The column marked \(\Pr \left( { > |t|} \right)\), gives the p value of the coefficient. The p value indicates the probability that the corresponding coefficient is not needed in the model; it ranges from zero (no chance) to unity (absolute certainty). In other words, subtracting this value from one gives the significance level. In science generally, results yielding a p value less than or equal to 0.05 are considered to be statistically significant and statistically highly significant if p value is less than or equal to 0.001. The p values revealed PI and MDD to be the only variables that are statistically significant with p value equal 0.0529 and 0.0935, respectively (i.e. 94.71% and 90.65% significance, respectively), while other inputs were not statistically strong enough to establish a significant model by MLR.

Retraining of the MLR was carried out using the backward elimination approach as explained by Lilja [43] to determine the predictor that should be utilised in developing the model and the ones to discard. The ‘summary ()’ function in R computes the significance level of each input variable used in the model. As earlier stated, the variable with the largest p value is least significant statistically, while threshold of p value equals 0.05 is predetermined below which the input variable has more than 95% chance that it is significant. Having p value higher than the set threshold value, such variable or predictor is removed from the model and re-computed. With the back-elimination process, regression Eq. (6) was developed with three predictors. The model outcomes as shown in Tables 4, 5 indicate that Eq. (6) is more reliable than Eq. (5). The three variables, namely PI, percent sand (S) and MDD, prove to be statistically significant with p values below 0.05 and VIF close to 1.

Table 5 Summary table of parameter estimates, residuals, and standard errors for the linear model of hydraulic conductivity fitted with three predictors

3.3.2 MLR model validation

Having obtained the regression Eq. (6), the equations was fitted with the test data to predict hydraulic conductivity of the test data. Figure 2 shows the scatter plots for observed values of k against its predicted values for the test data. Performance evaluation of the developed MLR models was carried out using the obtained values of MSE, RMSE, MAE and determination coefficient R2 between the observed and predicted values as presented in Tables 6, 7. As noted from Table 7 for test samples, coefficient of determination, R2, and correlation coefficient, r, for model Eq. (6) demonstrate a stronger and more accurate output than model Eq. (5) with six input variables. 40.4% variation in hydraulic conductivity for model Eq. (6) was explained by the three input variables utilised (PI, S and MDD), whereas 36.9% of hydraulic conductivity variability was explained by the six inputs utilised for MLR model Eq. (5). Correlation coefficient r, for MLR model Eq. (6) indicates a stronger linear relation between the observed and predicted values of k compared to r value for MLR model Eq. (5). Since R2 and r values could give a biased estimate of model performance, the MLR models are also compared with respect to their mean square error, MSE and mean absolute error, MAE. As shown in Table 7, MSE and MAE values for model Eq. (6) are lower showing that the MLR model with three input variables (PI, S and MDD) is more accurate. The result of the regression analysis revealed that PI, percent sand (S) and MDD of soils have more significant impact with respect to hydraulic conductivity of soils:

$$k = \left( {1.858E - 05} \right) + \left( { - 1.362E - 08} \right){\text{PI}} + \left( {1.291E - 07} \right)S + \left( { - 1.198E - 08} \right){\text{MDD}}$$
(6)
Fig. 2
figure 2

The scatter plots of observed versus predicted values of hydraulic conductivity for MLR Eq. (6) model

Table 6 Summary of performance indices for MLR models (training)
Table 7 Summary of performance indices for MLR models (testing)

3.3.3 MLR comparative study

The MLR models of Salarashayeri and Siosemarde [29], Merdun, et al. [59] and Arshad, et al.[39] show a higher RMSE and MAE with relatively higher coefficient of determination as shown in Table 8. This indicates that the developed MLR model for this study with three input variables performed relatively better than the above-mentioned models with respect to error generated. The MLR model of Salarashayeri and Siosemarde [29] was developed using three input variables from soil particle diameters, namely D10, D50 and D60, where D10, D50 and D60 are the soil particle diameter (mm) corresponding to 10%, 50% and 60% finer by weight and saturated hydraulic conductivity, k is expressed in m/day. The obtained RMSE and MAE were 4.06 and 3.32, respectively, when k is expressed in m/day, and the equivalent values when express in m/s are as shown in Table 8. Merdun, et al. [59] utilised seven variables to developed MLR model, namely S, Si, C, BD, P1, P2, and P3, corresponding to percent sand, percent silt, percent clay, bulk density and pore sizes, respectively. The obtained RMSE is 0.938 when hydraulic conductivity is expressed in cm/hr. Arshad, et al. [39] utilised percentages of clay (C), silt (Si), sand (S), and bulk density (ρb) as independent variables to develop MLR model for hydraulic conductivity prediction. The RMSE obtained was also higher than the one obtained for this present study (Table 8). This shows that for this current study, the MLR model developed with three independent variables for hydraulic conductivity prediction of soils is relatively better.

Table 8 MLR models comparison

3.4 Artificial neural network

3.4.1 ANN training

The training of the ANN for hydraulic conductivity prediction was achieved using six input parameters as used for the MLR model. It is considered a good practice to normalise the data before training a neural network in order to avoid unnecessary results or very difficult training processes resulting in algorithm convergence problems [43]. Among the simple methods of data normalisation available, minimum–maximum normalisation approach was utilised to bring the data values between 1 and 0 using Eq. (7):

$$a_{i} = \frac{{a - a_{\min } }}{{a_{\max } - a_{\min } }}$$
(7)

where \(a_{i}\) is the normalised value, \(a\) is the actual value, \(a_{\max }\) is the maximum value and \(a_{\min }\) is the minimum value.

The architecture that produced the best result after several network trainings contains one hidden layer with 10 neurons (Fig. 3). This was chosen based on the obtained R2 and MSE (Fig. 4). The summary of the major training parameters is shown in Table 9. Figure 5 demonstrates the comparative significance of predictors in neural networks to the output prediction as obtained from R using Garson algorithm, a function in the NeuralNetTools library. Based on Garson algorithm, input relevance is calculated using Eq. (8):

$$R_{ik} = \sum\limits_{j = 1}^{nh} {\left( {\frac{{|w_{ij} ||w_{jk} |}}{{\sum\limits_{i = 1}^{ni} {\left( {|w_{ij} ||w_{jk} |} \right)} }}} \right)}$$
(8)

where \(ni\) and \(nh\) are the number of inputs and hidden units, respectively, \(w_{ij}\) is the weight between input \(i\) and hidden unit \(j\) and \(w_{jk}\) is the weight between hidden unit \(j\) and output \(k\).

Fig. 3
figure 3

ANN architecture for hydraulic conductivity prediction (R generated)

Fig. 4
figure 4

Number of hidden nodes versus MSE

Table 9 Major training parameter used for ANN model in R
Fig. 5
figure 5

Relative importance of input parameters

Unlike the MLR, ANN utilised all the input variables to develop the network for hydraulic conductivity prediction with percent fines having the greatest influence, while the least is the percent clay. This indicates that ANN can interpret very complex relationships between the input variables and the response variable beyond what MLR can comprehend.

3.4.2 ANN model validation

The ANN model was provided with new data set, the test data, to assess its efficiency and capacity to generalise forecast beyond the learning data. Figure 6 shows the scatter plot for observed values of k against its predicted values for the test data, while Table 10 gives the performance estimate of the ANN model. For test data, the obtained R2 and MSE values of 0.955 and 7.366e-12 indicate an acceptable accuracy showing that 95.5% variation in hydraulic conductivity was explained by the six input variables considered with a minimised error thereby validating the model.

Fig. 6
figure 6

Scatter plot of observed versus predicted values of hydraulic conductivity for ANN model (testing)

Table 10 Summary of performance indices for ANN model

3.4.3 ANN and MLR model comparison

The MLR analysis of the data collated revealed that half of the input variables (percent fines, percent clay and OMC) did not contribute to the performance of the MLR. This could be attributed to nonlinear relationship or very low correlation between these variables and hydraulic conductivity. However, ANN shows a high level of understanding hidden relationships between these variables and the corresponding hydraulic conductivity. Hence, ANN high ability to predict nonlinear behaviour is worth commending. The results of the performance indices of the developed MLR and ANN models as presented in Table 11 show that ANN produces more reliable estimate of soils hydraulic conductivity than MLR. The higher R2 of 0.995 and lower error estimates of ANN than those obtained by the MLR models was in support with other previous studies carried out by Sinha and Wang [20], Arshad, et al. [39] and other researchers showing that ANN is a better predictive tool to solving geotechnical/geo-environmental problems than the traditional linear regression.

Table 11 Performance evaluation of MLR and ANN

3.4.4 ANN comparative study

Before this present study there are many studies on the ANN applications in geotechnical engineering and few in predicting hydraulic conductivity of a particular soil with different input variables utilised. Therefore, the developed network is compared with other existing models. Table 12 gives the summary of this comparison. The input variables used in all the networks listed include plasticity index (PI), percentages of gravel (Gr), percent sand (S), percent fines (F), percent silt (Si), percent clay (C), bulk density (ρb), dry density (ρd), liquid limit (LL), plastic limit (PL), maximum dry density (MDD), moisture content (w), optimum water content (OMC), void ratio (e), soil particle diameters (D10, D50) and degree of compaction express in percentage (P%). Boroumand and Baziar [18] developed network for predicting hydraulic conductivity of clay using 55 data set: 45 for training and 10 for testing. The soil physical properties and compaction properties used as input variables are MDD, PI, Gr, S, C with architecture of 5-4-1 corresponding to one hidden layer with four neurons. The obtained R2 was 0.54 for test data. The ANN model of Sinha and Wang [20] utilised test soils prepared by mixing different amounts of four main components; gravel, sand, limestone dust and sodium bentonite. These four components were mixed to achieve 55 distinct combinations in distinct ratios. Five variables that were utilised as input include P (%), D10, D50, PL and LL. The network architecture contains one hidden layer with three neurons, the obtained R2 was 0.901. Lim and Kolay [10] utilised 10 input variables with 10 neurons in the hidden layer to developed hydraulic conductivity prediction network for tropical soils (Table 12). The data set used contained 144 observations, of which 100 were used as training set and the rest as test data. The obtained R2 was 0.92. For easier comparison, the R2 value obtained for Lim and Kolay [10] was calculated from the obtained correlation coefficient and the MSE for Merdun et al. [59] was calculated from the obtained RMSE, the unit of k used was properly considered. It can be observed from Table 12 that R2 value obtained for this study was the highest (0.955) with the lowest error estimate (MSE = 7.366e-12) showing that the developed ANN for hydraulic conductivity prediction with six input variables, namely PI, S, Fi, C, MDD, and OMC, was well generalised to the validation data set (test data).

Table 12 ANN models comparison

4 Conclusions

This research assessed the performance of artificial neural network (ANN) and the multiple linear regression (MLR) in predicting hydraulic conductivity of a wide range of soil types to obtain the appropriate value of soil hydraulic conductivity within the shortest time frame. Based on the analysis and the results obtained, the following conclusions are drawn:

  • The correlation and the p value results revealed that only three inputs variables (plasticity index, percent sand and MDD) are statistically significant to the development of the MLR model and others are reductants.

  • Relative Importance analysis revealed that the six input variables utilised for ANN development are all significant with percentage of fines being the most influential.

  • The results of the statistical indices (R2, MSE) show that ANN is the most reliable predictive tool and has strong ability to predict nonlinear behaviour when compared with MLR.

  • Comparative study analysis shows the developed MLR and ANN to perform better that the corresponding available models considered.

  • ANN model developed in this study can be efficiently utilised to predict the hydraulic conductivity of most soil types since the input variables are easily obtainable parameters thereby making soil investigation with respect to hydraulic conductivity faster.

  • This study utilised the earliest and the most used activation function (sigmoid) for the developed network. It is suggested that the performance of rectified linear unit (ReLU) for ANN development to predict hydraulic conductivity of soils should be investigated. Furthermore, it will be important to develop a network that can predict hydraulic conductivity of different soil types stabilised with the same additive (e.g. lime, cement, fly ash, etc.).

Resolving to collate data from previous studies on hydraulic conductivity of soils was a result of unavailability of database on basic soil’s properties in Nigeria. Since the performance of ANN depends on the reliability of the training data, the data set used for this study was carefully selected to minimise data error. However, the integrity of the collated data set cannot be fully ascertained.