1 Introduction

Desirable safety of the water-retaining structures is identified as the top priority of geotechnical research attention which is foundational to broader research in the field. The discharge flow rate at the downstream side of a water retaining structure, as a seepage output quantity, remains the central part of the safe design of the structure. The high values of the discharge flow rate can endanger the water retaining structure stability. Use should be made of installed vital systems called sheet piles for reducing the values of downstream discharge flow rate at a flow region or under the foundation of a dam. The seepage beneath the sheet pile follows a specific differential equation in calculating discharge flow rate—similar to most seepage-related problems in geotechnical engineering. Beyond that, researchers for solving the seepage governing equation and computing the flow rate discharge have employed various methods. The analytical and numerical methods have been performed more broadly.

A variational iteration method with fractional derivatives as an analytical method to solve the nonlinear seepage flow into porous media was suggested by He [1]. Handling a finite difference method based on boundary-fitted coordinate transformation to analyze the steady-state seepage with a free surface in the isotropic and homogeneous embanked dam was carried out by Jie et al. [2]. In analyzing the two and three-dimensional seepage problems using finite difference methods, so-called boundary polynomial interpolation was adopted by Fukuchi [3]. Relying on two practical approaches suggested in the literature, Bresciani et al. [4] applied a finite volume-based method to find out the solutions of groundwater flow through earth dams. Their proposed methods merged the most beneficial advantages of adaptive and fixed mesh techniques. A coupled finite-element-based method to model the transient seepage flow beneath a concrete dam has been employed by Ouria et al. [5]. Kazemzadeh-Parsi and Daneshmand [6] have exerted a smoothed fixed grid finite element method to analyze three-dimensional unconfined seepage of complex geometries, heterogeneous, and anisotropic porous media. Rafiezadeh and Ataie-Ashtiani [7] developed a coded computer program based on the boundary element method to analyze three-dimensional confined seepage problems under dams. The unconfined seepage problems by the natural element method have been simulated by Jie et al. [8]. Mesh-free technique to analyze the free-surface seepage problem as a moving-boundary problem has been exercised by Zhang et al. [9]. The node locations were arbitrary in this meshless method letting the seepage problems with free surface be appropriately analyzed.

Although analytical techniques cannot straightly apply to complicated geometries and complex boundary conditions, these methods, however, can provide exact solutions to problems [1]. Preparing approximate analysis satisfying high accuracy to deal with the more complex issues is conducted by numerical methods. The approaches considered to be mesh-based, such as finite difference, finite volume, and finite element, are implemented to discretize the whole problem domains. An essential disadvantage of most mesh-based methods is that a domain could be encountered, in cases, consisting of singular points and sharp corners, making the further progress of numerical-based techniques towards the desired solution(s) numerically impossible [9,10,11].

A newly developed semi-analytical method called Scaled Boundary Finite Element Method (SBFEM), proven to be capable of solving different types of differential equations, was proposed by Song and Wolf [12] to transcend the limits of some existing recent approaches. The SBFEM has merged important excellences of finite element and boundary element methods. Bazyar and Graili [13] analyzed the confined seepage problems beneath the dams and the sheet piles in steady-state conditions in anisotropic media using SBFEM. What was conducted in another part of Bazyar and Graili [13] study was a successful attempt to solve unconfined flow problems using an unknown free surface through the dam body. The SBFEM for analyzing the transient seepage problems in bounded and unconfined domains was extended by Bazyar and Talebi [14]. The proposed method was capable of solving the seepage problem for heterogeneous and anisotropic porous media without extra endeavor. Reliability analysis of seepage in several numerical problems through stochastic SBFEM was handled by Johari and Heydari [15]. Su et al. [16] utilized drainage substructure and nodal virtual flux method to simulate drainage holes and analyze complex seepage fields. The advantages of the SBFEM outweigh other methods. Hence, it seems to be a practical method to analyze the seepage beneath sheet plies and to acquire the discharge flow rate.

Despite the efficiencies of the SBFEM in obtaining the discharge flow rate, a separate analysis will be needed for discharge flow rate computing in every single condition. Furthermore, the user must be fully acquainted with the analysis procedure to be able to effectively and efficiently analyze the seepage problem beneath the sheet piles and subsequently acquire the discharge flow rate. To overcome this downside, prediction models have been developed that directly relate quantities such as discharge flow rate to their contributing parameters which removed the need for tedious use of any analytical, numerical, or laboratory methods to calculate the discharge flow rate as part of solution procedures. Data-driven approaches, regression methods, artificial intelligence, and other soft computing techniques have been recently attracted several researchers to generate prediction equations for the complicated behavior of various systems. Artificial Neural Networks (ANN) [17, 18], Adaptive Neuro-Fuzzy Inference System (ANFIS) [19, 20], Ant Colony Optimization (ACO) [21], Evolutionary Polynomial Regression (EPR) [22], Genetic Algorithm (GA) [23,24,25], Genetic Programming (GP) [26,27,28], Genetic-Based Neural Network (GBNN) [29, 30], and Gene Expression Programming (GEP) [31,32,33,34] can be mentioned as the most conventional soft computing and heretofore outstanding contributions in various civil engineering problems.

In this contribution, an Evolutionary Polynomial Regression (EPR) model is developed to predict discharge flow rate under sheet piles. The EPR models developed in this study were produced based on a large database comprising 1000 lines of artificial data retrieved from using the SBFEM method simulating real-world conditions of seepage under sheet piles to provide a powerful, representative, and comprehensive model that could be applied to the situations similar to the conditions underlain in the comprehensive model development database used.

2 Evolutionary polynomial regression (EPR)

Evolutionary Polynomial Regression (EPR) is a data-driven method based on evolutionary computing to search polynomial structures representing a system. A general EPR expression may be presented as:

$$ y = \sum\limits_{j = 1}^{n} {F\left(X,f(X),a_{j} \right) + a_{0} } $$
(1)

where y is the estimated vector of the output of the process; aj is model parameters; F is a function constructed by the EPR process; X is the matrix of input variables; f is a function defined by the user, and n is the number of terms of the target expression. The general functional structure is constructed from elementary functions by EPR using a Genetic Algorithm (GA) strategy. The GA is employed to select the useful input vectors from X to be combined. The building blocks (elements) of the structure of F are defined by the user based on an understanding of the physical process. While the selection of feasible structures to be combined is made through an evolutionary process, the parameters aj are estimated by the least square method (Fig. 1).

Fig. 1
figure 1

Typical flow diagram for the EPR procedure

In this technique, the combination of the genetic algorithm to find feasible structures and the least square method to find the appropriate model parameters for those structures implies some advantages. In particular, the GA allows a global exploration of the error surface relevant to specifically defined objective functions. Using such objective functions some criteria can be selected to be satisfied through the search process. These criteria can be set to (a) avoid the overfitting of models, (b) push the models towards simpler structures, and (c) avoid unnecessary terms representative of the noise in data. EPR avoids over-fitting by penalizing the number of inputs involved in structures (model complexity); controlling the constant values whose term may describe noise when the related constant is close to zero, and controlling the variance of EPR terms with respect to noise variance in data which is estimated by model residuals [35]. A useful feature of EPR is the high level of interactivity between the user and the methodology. The user physical insight can be used to make hypotheses on the elements of the target function and on its structure [Eq. (1)]. Selecting an appropriate objective function, assuming pre-selected elements in Eq. (1) based on engineering judgment and working with dimensional information enable refinement of final models [22].

Before starting the evolutionary procedure, a number of constraints can be implemented to control the structure of the models to be constructed, in terms of length of the equations, type of functions used, number of terms, range of exponents, number of generations etc. It can be seen that there is great potential in achieving different models for a particular problem which enables the user to gain additional information. By starting to apply the EPR procedure, the evolutionary process starts from a constant mean of output values. By increasing the number of evolutions, it gradually picks up the different participating parameters to form equations representing the constitutive relationships. Each model is trained using the training data and tested using the testing data [22, 35].

3 Data preparation using SBFEM

The EPR models developed in this study were produced based on an extensive database including 1000 lines of synthetic data retrieved from the SBFEM method simulating possible scenarios under various boundary and real-world conditions of seepage under sheet piles using a robust, representative, and comprehensive model. Figure 2 shows the geometry and the boundary conditions of the problem domain divided into non-uniform subdomains. The sheet pile is considered in the middle of the modeling domain. A 20.0 m by 40.0 m horizontal saturated soil layer is modeled as the domain of the problem.

Fig. 2
figure 2

Geometry of the model

The preciseness and versatility of the model used to produce the data based on which this research was conducted are clarified by comparing the results of SBFEM with those of FEM. For this purpose, an FEM code was developed. The domain discretization used for both models is shown in Fig. 3. The domain is discretized into 450 subdomains for SBFEM. The scaling centers related to corresponding subdomains, are located exactly at the geometry center. The contour of potential lines for the results of SBFEM and FEM is demonstrated in Fig. 4. The results indicated great compatibility between the results of SBFEM and FEM, a strong testimony for the accuracy and the reliability of the generated artificial data used in this study.

Fig. 3
figure 3

Domain discretization of a SBFEM, b FEM

Fig. 4
figure 4

Contour of potential lines

4 Developing the EPR model

In pattern recognition procedures in general, for instance neural network, fuzzy logic, or genetic programming, the model construction is normally based on adaptive learning over several cases. The performance of the developed model is then evaluated using a validation data set which has not been used/participated in the model development process. In evolutionary-based modeling, how the data are divided into training and validation sets has a significant effect on the results [36, 37].

The developed model could be applied to situations similar to the conditions underlain in the comprehensive model development database. Three input parameters with biggest influence on the seepage results are selected, including sheet pile height (D), upstream water level (H), and hydraulic conductivity anisotropy ratio of deposit materials (K). The output parameter is considered as the normalized flow rate QNor. The seepage problem is an elastic problem and based on seepage equations the only soil parameter that is involved in solving the problem is the infiltration coefficient. From the geometrical point of view, upstream water level (H) and Sheet pile height (D) are the most influencing parameters, and the modelling dimensions will have little effect on the results. Results from previous studies [40] emphasize that the sheet pile height has greater effect on the total seepage discharge compared to any other location-related parameter that may affect seepage. The parameter ranges in this study are considered to fall within the expected range for small to medium sheet piles [41] that are most used in the industry. However, the EPR model has the capability to be retrained if different ranges of parameters were the subject of interest or in case any complementary data becomes available to make sure the model stays relevant and applicable to the considered newly emerging scenarios. Table 1 states the range of parameters for the input and output parameters in this research.

Table 1 Parameters involved in the developed EPR model

The training of the EPR resulted in the development of few equations. Of these, some equations did not include the effect of all contributing parameters. Among the remaining equations, the most appropriate and efficient one based on the model performance (fitness) and complexity was selected as the final model. Equation 2 presents the developed EPR model:

$${Q}_{nor}=-1.24{D}^{3}H+2.47{D}^{2}H-1.64\mathrm{DH}-0.17\mathrm{DK}+0.23\mathrm{HK}+.44H+0.14K+0.02$$
(2)

The D, H, and K are the cutoff depth, the upstream/upper head of water, and the anisotropy ratio (kx/ky) of a soil deposit, respectively. Figure 5 shows the normalized flow rate predicted by EPR against the data used to develop the EPR model (training data).

Fig. 5
figure 5

Predicted vs. SBFEM-based data used to train EPR model

In this study, the data set was split into several random combinations of training and validation sets until a robust representation of the whole population was achieved for both training and validation sets. Statistical analysis was performed on the input and output parameters of the randomly selected training and validation sets to choose the most robust representation. This was to ensure that the statistical properties of the selected data in each of the subsets (training or testing) are as close as possible to the other, and the training and testing subsets represent the same statistical population. Of the 1000 available data sets, 80% were used to train EPR. The remaining 200 (20%) was chosen to validate the developed model, meaning that these sets were unseen to EPR during the model development processes. The ratio on which the data are divided into training and testing subsets is chosen to stay consistent and comparable with the traditional approach in machine learning research [22, 36]; however, there is no limitation in EPR approach in choosing any ratio and depending on the data availability and application this could change. Many possibilities emerged, enabling the desired combination of the training and testing data. Therefore, minimum, maximum, mean, and standard deviation were calculated for all the contributing parameters for the training and testing data sets for possible cases. The one point in which the standard deviation and mean values were the closest for the training and testing data was chosen for training and testing stages in the EPR model development process. In this way, the most statistically consistent combination was used to construct and validate the EPR model.

The level of accuracy at each stage of the modelling process was evaluated based on the coefficient of determination (COD), i.e., the fitness function defined as [22, 35]

$$ {\text{COD}} = 1 - \frac{{\sum\limits_{N} {({\text{Y}}_{a} - {\text{Y}}_{p} )^{2} } }}{{\sum\limits_{N} {\left( {{\text{Y}}_{a} - \frac{1}{N}\sum\limits_{N} {{\text{Y}}_{a} } } \right)}^{2} }} $$
(3)

where \({\mathbf{Y}}_{a}\) is the actual output value; \({\mathbf{Y}}_{p}\) is the EPR predicted value and N is the number of data points on which the COD is computed. If the model fitness is not acceptable or the other termination criteria (in terms of maximum number of generations and maximum number of terms) are not satisfied, the current model should go through another evolution to obtain a new model.

As shown in Fig. 6, comparison of the results along with the high Coefficient of Determination (COD) values for the EPR model (Training COD: 98%–Testing COD: 97%) indicate the excellent performance of the developed model in capturing the underlying relationships between the contributing parameters and flow rate and also in generalizing the training to predict seepage behavior under sheet piles under unseen conditions.

Fig. 6
figure 6

Predicted vs. SBFEM-based data used to validate EPR model

The proposed EPR model generates a transparent and structured representation of the system. One of the main advantages of the EPR approach is that there is no need to assume a priori form of the relationship between the input and output parameters. The explicit and transparent structures obtained from the proposed EPR method can allow physical interpretation of the model predictions giving the user additional insight into the relationship between input and output parameters by performing sensitivity analyses of the developed model for individual contributing parameters. In general, EPR-based modeling has several advantages, including that it provides a simple and straightforward framework for modeling all materials. It does not require any arbitrary choice of the constitutive (mathematical) model, yield function, plastic potential function, flow rule, etc. As EPR learns the material behavior directly from raw experimental data, it is the shortest route from experimental/research-based/artificially generated data to numerical modeling.

It should be noted that EPR trains and develops validated models using the data provided regardless of the way the data has been collected/generated. This study is also not an exception and the synthetic data generated along with its geometrical as well as any other aspects, which are intrinsically included in the data, is used by EPR and the presented model reflects the data—as a whole—used to train EPR and develop and validate the model, precisely as expected by the user, in the model outcomes/predictions. However, EPR has the capability to be retrained, where more/different data are developed, needed, or becomes available to ensure the model stays representative, relevant, and comprehensive.

4.1 Sensitivity analysis

A sensitivity analysis was conducted to investigate the effects of individual contributing parameters on the predictions made by the proposed model. The aim was to verify the consistency of the behavior predicted by the model and the expected behavior for the system from the literature. To perform the analysis for every normalized contributing parameter, all the parameter values for all parameters—other than the one being investigated—were set to their average values in the range. The parameter being studied then was set to vary between the minimum and maxim parameter values. A graph was then plotted to show the variations in EPR predictions for the flow rate as the parameter in question varied in value between its minimum and maximum values. Figures 7, 8, and 9 show the sensitivity analysis results for sheet pile/cutoff wall length, upstream/upper head, and anisotropy ratio.

Fig. 7
figure 7

Sensitivity analysis—effect of changes in sheet pile/cutoff wall length on flow rate predictions by the EPR model

Fig. 8
figure 8

Sensitivity analysis—effect of changes in upstream/upper head of water on flow rate predictions by the EPR model

Fig. 9
figure 9

Sensitivity analysis—effect of changes in anisotropy ratio on flow rate predictions by the EPR model

As shown in Fig. 7, given a certain position of the cutoff wall, increasing the cutoff depth results in a reduction in the seepage discharge, and the flow rate predicted by the EPR model decreases. This phenomenon can be understood based on Darcy’s theory [38]. Moreover, as the opening between the cutoff wall and the impervious floor is reduced, converging flow lines add resistance to the flow, and seepage is diminished. As shown in Fig. 8, given a specific position of the cutoff wall, increasing the upstream/upper head of water results in an increase in the seepage discharge, so the flow rate predicted by the EPR model increases. If the head (h) is everywhere, there is no water flow through the soil. If the head differs in different parts of the soil mass, water flows away from points at which the head is high and towards points at which the head is lower. The flow rate is governed by the hydraulic gradient dependent directly on the water head, which is considered the essential term of the seepage force per soil volume and acts in the flow direction. When the flow is upward in the soil, pore water pressure increases, and effective stress decreases. When the flow is downward, the pore water pressure drops, and the effective stress increases.

As shown in Fig. 9, given a specific position of the cutoff wall, increasing the anisotropy ratio of a soil deposit results in an increase in the seepage discharge, so the flow rate predicted by the EPR model increases. It was found that increasing the anisotropy ratio of permeability leads to the formation of horizontal flow canals and increasing the seepage flow consequently at a constant vertical permeability. Variation of permeability coefficient was found to have almost no impact on mean discharge flow rate for anisotropic fields compared to the isotropic conditions. Hence, it appears that the anisotropic properties of the soil alluvium have a significant influence on the stress distribution, hydraulic conductivity coefficient, and damage zone [39].

5 Discussion and Conclusion

The current study investigated an EPR model which is developed to predict discharge flow rate under sheet piles. The EPR models developed in this contribution were produced based on an extensive database comprising 1000 lines of artificial data retrieved from using the SBFEM method simulating real-world seepage conditions under sheet piles. As mentioned before, one of the important advantages of this method is modeling the singular points directly with high accuracy, and this feature can be utilized to model seepage beneath the sheet piles as a singular point. The preciseness and versatility of the model were clarified by comparing the results of SBFEM with those of FEM. The domain was discretized into 450 subdomains and 3200 elements for SBFEM and FEM, respectively. The contour of potential lines for the results of SBFEM and FEM was shown. The results indicated great compatibility between the results of SBFEM and FEM.

A robust, representative, and comprehensive model that could be applied to situations similar to the conditions underlain in the complete model development database, was developed. It was shown that the EPR model can capture the underlying relationships between various parameters directly from artificially developed SBFEM data and make predictions of very high precision for unseen scenarios (as verified by the introduced unseen testing/verification data set). The EPR model was tested using data that were not used in the training stage of the EPR model development process; thus, an unbiased performance indicator was obtained on the actual prediction capability of the model. The results show the excellent ability of the EPR model in generalizing the training to predict flow rates under unseen conditions. Ultimately, the validity of the behavior consistency, signified by the model and the expected system behavior from the literature, has been assessed by sensitivity analysis. Accordingly, Qnor predicted by the EPR model decreases when the cutoff depth increases at a particular position of the cutoff wall. The training of the EPR resulted in the development of few equations. Since some equations did not include all contributing parameters, the most appropriate and efficient one based on the model performance (fitness) and complexity was selected as the final model. After training the desired EPR model, its account was verified using 200 sets of validation data that had not been introduced to EPR during training. Then, a comparison between COD values for the EPR models, including training and testing CODs (i.e., training COD: 98%–testing COD: 97%), has been drawn to prove the appropriate fitness of the developed model in capturing the underlying relationships between the contributing parameters and flow rate and also in generalizing the training to predict seepage behavior under sheet piles under unseen conditions. This obtained parameter has also increased when the upstream/upper head of water and the soil deposit’s anisotropy ratio increased.

The synthetic data used to develop and verify the EPR model has been carefully generated to be robust and to represent real world problems. The developed model verification and parametric study suggest that the model predictions are in line with expectations and are highly accurate as long as the contributing parameters of any problem fall in the ranges used to create and verify the model; however, it is advised that necessary precautions and verifications to be put in place on case-by-case basis and where applying the model to real world problems to ensure safety of the structures.