1 Introduction

Geotechnical parameters, such as shear strength, permeability, consistency limits, and compaction, play a significant role in the construction of civil engineering projects. These parameters provide insight into soil behavior, which is essential for making informed decisions about the design and construction of geotechnical features (Vanapalli et al. 1996). For instance, the undrained shear strength (UDSS) of soils is a fundamental parameter that is highly demanded for solving many problems. It is widely used to determine the stability of slopes, bearing capacity of a shallow foundation, embankment settlement and stability, piles bearing capacity, and so on (Wong et al. 2021).

UDSS is particularly important in cohesive soils, where the pore water pressure generated during shearing cannot dissipate quickly. Conventional methods, such as in-situ or laboratory testing, are frequently employed to provide design information on the UDSS. Even though these methods can be very efficient, they can also be very time-consuming and costly, especially when dealing with soils that display spatial variability. For this reason, determining the UDSS of cohesive soils is a source of concern due to the challenges associated with in-situ measurements or subsequent laboratory testing (Lunne et al. 2006; Phoon and Kulhawy 1999). In practice, in order to expedite the design process, geotechnical engineers frequently apply some correlations already established between UDSS and some soil indices (Hansbo 1957; Kulhawy and Mayne 1990; Skempton 1954). However, it is essential to consider the limitations of these approaches, as there can be differences between the data source and the site, such as site geology and soil characteristics, where those will be used (D’Ignazio et al. 2016). In other words, the results drawn from these correlations may be deviated regarding the actual values of the UDSS of soils.

Over the past few years, there has been rapid development in the field of artificial intelligence (AI) techniques. This development has led to the emergence of machine learning (ML) algorithms that have been proposed and are now widely used in various fields. ML applications have transformed the way how complex problems can be tackled using new and innovative solutions. Due to their learning ability, ML algorithms became a desirable tool for revealing relationships between many soil parameters. Therefore, the growing interest in studying the potential applications of ML algorithms on geotechnical issues has been witnessed in the past decades. Research covers a wide range of problems, such as soil liquefaction and liquefaction-induced hazards (Demir and Sahin 2023b; Durante and Rathje 2021; Goh and Goh 2007; Sahin and Demir 2023), slope stability (Aminpour et al. 2023; Sabri et al. 2023; Huang et al. 2023), pile bearing capacity (Benbouras et al. 2021; Kardani et al. 2020), and other specific problems (Baghbani et al. 2022; Chaabene et al. 2020; Chen et al. 2024; Kahraman and Ozdemir 2022; Niu et al. 2023; Qi et al. 2023; Shi et al. 2023a, b; Yin et al. 2023; Zhang et al. 2022; Zhao et al. 2023). Furthermore, several researchers have also utilized ML algorithms to solve the UDSS prediction problem. For instance, Mbarak et al. (2020) proposed a study to predict UDSS using Standard Penetration Test (SPT) results and soil consistency indices. Three ML algorithms, Random Forest (RF), Gradient Boosting (GBM), and stacked models, were developed and employed on a dataset from different projects around Turkiye. These models were also compared with the methods of simple and multiple linear regression models. They concluded that the ML models performed superior prediction capabilities compared to the conventional methods. Pham et al. (2020) developed a hybrid ML model combining RF and Particle Swarm Optimization (PSO) to predict the UDSS of soil using experimental results of 127 soil samples. In this study, the RF-PSO model outperformed the single RF model without optimization in predicting UDSS with high accuracy. Zhang et al. (2021) utilized Bayesian optimized eXtreme Gradient Boosting (XGBoost) and RF algorithms to predict UDSS of soft clays. The performance of the proposed methods was also compared with two transformation models from previous works and three baseline ML algorithms. Results revealed that XGBoost and RF models outperformed the other approaches. Tran et al. (2022) applied two novel hybrid ML approaches, ANFIS-CA (Adaptive Neuro Fuzzy Inference System- Cultural Algorithm) and ANFIS-PSO, to predict the UDSS of sensitive clays using five input parameters. Their results showed that the ANFIS-PSO model yielded a promising result with a correlation coefficient of 0.715 for predicting the UDSS of sensitive clays with limited input parameters. Länsivaara et al. (2023) compared the performance of traditional and ML-based models for UDSS, considering the influence of data coherence. The ML-based models outperformed traditional ones for two different datasets. They also showed that including additional variables can improve training set performance but worsen testing set prediction. Reviewing the existing research revealed that ML algorithms are capable of reasonably estimate the UDSS of soils.

It is well-known that ML algorithms are data-driven and that the prediction performance of ML approaches is affected by the application of data pre-processing and the number of training sets, thereby it is essential to consider the steps of data pre-processing to build a robust ML model (García et al. 2015; Marsland 2011). Data pre-processing methods typically involve modifying the training dataset through the removal, addition, or transformation of the training set data (Kuhn and Johnson 2013). Some common methods used in data pre-processing include data cleaning, data transformation, data integration, and data reduction (García et al. 2015). Data cleaning involves the removal of noise and the correction of inconsistencies in the data. This process is essential to ensure that the data used in further analysis is accurate and reliable. It helps in improving the overall quality of the data by identifying and correcting errors or inconsistencies that may adversely affect the results of data analysis. Data transformation is another important aspect of data pre-processing. This step may include processes such as data scaling (e.g., normalization, standardization), which adjust the values in the dataset to a common scale, without distorting differences in the range of values or losing information. This is particularly useful when dealing with data that has different units or scales and can help to improve the accuracy of data analysis. Data integration is a process that merges data from multiple different sources into a single, coherent data store. This often involves combining data into a data warehouse, providing a unified view of the data. This step is crucial when dealing with large volumes of data from disparate sources and can greatly enhance the efficiency of data analysis. Finally, data reduction is a technique that can reduce the volume of data by using methods such as aggregation, elimination of redundant features, or clustering. This technique helps to simplify the data, making it easier to analyze and interpret, without losing any significant information.

The importance of data pre-processing has become evident in enhancing the precision of ML models across a range of fields. For instance, Ojagh et al. (2021) demonstrated that refining raw data through pre-processing can lead to improved air quality predictions. Their findings suggest that pre-processing techniques can enhance model performance, even in complex applications like air quality monitoring. Aksangür et al. (2022) discussed the impact of data pre-processing techniques on the prediction accuracy of long short-term memory models using real-world data. They concluded that data pre-processing enhances data quality as well as the model’s training time and prediction accuracy. Demir and Sahin (2023c) employed a data pre-processing method (i.e., outlier treatment) to investigate the effect of outliers on the prediction performances of models for slope stability problem. Their prediction results showed how the handling of outliers of a dataset enhances models’ prediction performance. Recently, Habib and Okayli (2024) investigated the impact of data pre-processing on the accuracy of ML models in predicting concrete’s compressive strength. Their findings revealed that the choice of a data pre-processing method significantly affect the results of predictive models. While the impact of data pre-processing methods on model performance is recognized, their impact on model accuracy has yet to receive sufficient attention in geotechnical engineering domain, and the effects of data scaling and transformation methods on the learning performance of ML models have not been adequately investigated to date. To narrow this gap, this study aims to extend the current understanding of the influence of data scaling and transformation methods on the performance of UDSS models by systematically assessing a broad spectrum of data pre-processing methods, such as Range, Z-Score, Log Transformation, Box-Cox, and Yeo-Johnson. For this purpose, the models based on Random Forest (RF), Support Vector Regression (SVR), Cubist, and Stochastic Gradient Boosting (SGB) algorithms are adopted for predicting the UDSS of soil. Furthermore, this study also considers the effect of different sampling ratios on the model’s performance. In this way, the contribution of the data scaling and data transformation tasks to the ML model success at various sampling ratios are extensively assessed. The novelty of this work is first conducting a detailed and systematic comparison of UDSS prediction models utilizing five pre-processing methods and four ML algorithms. Secondly, revealing the impact of the data pre-processing methods by using the results of ML models built from data scaled/transformed dataset. Furthermore, the effect of the sampling ratios on the performance and overfitting of the ML models was also investigated and the prediction performance of the RF, SVR, Cubist, and SGB algorithms using well-known performance metrics was assessed to analyze the best performance.

The remainder of this paper is organized as follows. The experimental setting, data description, data pre-processing, prediction models, and model evaluation metrics employed are mentioned in Sect. 2. Obtained results about the prediction performance of UDSS models are given in Sect. 3. Discussions are presented in Sect. 4. Lastly, Sect. 5 describes the conclusions of this work and some future recommendations.

2 Methodology

2.1 Overview of experimental settings

In the ML domain, two types of data are needed in order to use ML models to estimate the target variable; the training set used to build the ML model and the test set used to estimate prediction accuracy. The training set is used to predict the parameters for a specific model architecture, and the test set is applied to choose the best model among all models considered. Typically, natural groups do not exist in the regression model, so in this study, the simple random sampling (SRS) method was applied to divide the target variable into groups (stratum) with a random sampling strategy. In order to conduct a comprehensive investigation of the sampling ratio effect of the training and test size, the dataset was divided into the training sets (60%, 70%, 80%, and 90%) and test sets (40%, 30%, 20%, and 10%) for hyperparameter estimation, model production, and performance analysis. The UDSS dataset containing 384 observations with six different features was used to build the prediction model. Outliers of the dataset were eliminated before model construction, and thus 372 samples were considered during the model preparation. The assessment of UDSS prediction using the ML methods on the training set may be biased. Hence, the k-fold cross validation (10-fold CV), which can provide an upwardly biased model evaluation, was applied with grid search to reduce overfitting and produce a less biased prediction (Demir and Sahin 2023a).

One of the major criticisms of ML algorithms in science is the lack of novel laws, understanding, and knowledge arising from their use. This issue stems from the common practice of treating ML algorithms as black boxes, where the intricate models crafted by machines surpass human understanding (Schmidt et al., 2019). Currently, data preprocessing, cross-validation, and extensive metric-based evaluation processes are employed to mitigate the black-box nature of ML and offer more objective and descriptive insights. At this point, the mentioned processes were used in this study. These steps have indeed helped address the black-box nature of ML models by providing transparency, understanding, and control over the modeling process. The final stage of the ML application is to evaluate the performance of the proposed models in order to determine the most accurate result. The performance measurements (i.e., R2, MSE, RMSE, MAE, and MAPE) were utilized to evaluate each ML model. Fig. 1 presents the schematic view of steps of the methodology. For interest, the tests were performed on a PC running Windows 10 with an AMD Ryzen 9 3950X processor and 64 GB of RAM. All the code is written in R programing language (version 4.3.0) with the following main R packages: caret, e1071, randomForest, Cubist, ipred, plyr, kernlab, and gbm, respectively.

Fig. 1
figure 1

The methodology applied during the model generation and performance measurement

2.2 Data description

Two datasets, F-CLAY/7/216 and S-CLAY/7/168, compiled by D’Ignazio et al. (2016), were considered for predicting UDSS. F-CLAY/7/216 and S-CLAY/7/168 datasets comprise 216 and 168 samples obtained from field vane tests in Finland and Sweden-Norway, respectively. In this paper, these datasets were combined to analyze the overall behavior of clayey soils (n = 384 samples). Both datasets contain six features, including UDSS, depth (d), liquid limit (wL), plastic limit (wP), natural water content (wn), vertical effective stress (σ`v), and preconsolidation stress (σ`p). All these parameters play a critical role in determining the strength and stability of clayey soils. Some statistics information of the dataset, such as minimum, maximum, mean, median, and standard deviation is presented in Table 1. In the dataset, the depths of the samples are quite diverse, ranging from as shallow as 0.5 m to as deep as 24 m. The vertical effective stress varies with the lowest stress at 6.9 kPa and the highest stress reaching up to 212.9 kPa. The natural water content of samples varies from 17.3 to 180.1%. Furthermore, the liquid limit and the plastic limit of the samples are found to vary from 22 to 201.8% and 2.7% to 73.9, respectively. Meanwhile, the preconsolidation stress ranged from 15.2 kPa to 315.6 kPa.

Table 1 Statistical information of the input features

When analyzing a multi-featured dataset, it’s important to be able to visualize the relationships between different features. A heat map is a great tool for this, as it provides a quick and easy way to identify patterns and relationships of the features in the dataset. Pearson’s correlation coefficient is often computed to determine the correlations between each feature. This approach helps to identify which parameters are strongly or weakly correlated with each other. Fig. 2 shows the computed Pearson’s correlation coefficients of each pairwise feature. As shown in Fig. 2, some correlations between features are stronger than others. For instance, pairwise features d - σ`v, σ`p - σ`v, wL - wn, and σ`p - UDSS are strongly correlated (i.e., |r| = 0.82, |r| = 0.71, |r| = 0.84, and |r| = 0.75, respectively). In statistical modeling, it is well known that the existence of strongly correlated variables can significantly influence the efficiency of the model. This belief stems from the assumption that these variables, due to their strong correlation, may cause redundancy and unnecessary complexity in the model. Moreover, Kutner et al. (2005) also argued that these correlated variables do not typically affect inferences about mean responses in the data. This suggests that even if variables share a strong correlation, each can still provide unique and valuable insights about the average responses in the dataset, thereby making them essential components of the model. On the other hand, correlation coefficients between 0.40 and 0.69 are moderately correlated (e.g., d - σ`p, wP - wL, σ`v - UDSS). Lastly, the same pair of features having correlation coefficients less than 0.40 are weakly correlated.

Fig. 2
figure 2

Pearson correlation matrix of the UDSS dataset

2.3 Data pre-processing

Typically, raw data needs to be better organized to be ready for an ML application and is frequently challenging for researchers to work with. Data pre-processing in a ML operation is the process of gathering and transforming raw data into a format that can be accurately and quickly analyzed (Sahin 2023). A simple version of data pre-processing usually consists of several processes, such as data cleaning and refining techniques, removal of outliers, missing data interpolation, and feature scaling (i.e., normalization, standardization, and transformation). For this study, some steps in data pre-processing were performed, such as data cleaning with the removal of outliers, data scaling, and data transformation.

  • Removing Outliers

In data science literature, an outlier is defined as an abnormal, deviant, or discordant data point from the remaining data set. Data corruption during data collection and incorrect measurement of observation may cause the existence of outliers (Gareth et al. 2013). Detecting and handling outliers is a crucial step in improving the performance of an ML model since outliers can skew the results and lead to inaccurate predictions. Therefore, it is important to identify and remove them from the dataset or adjust them to fit within the expected range. By doing so, the model can be trained on a more representative dataset, resulting in better accuracy and generalization capability. A classical and popular tool for detecting outliers is the boxplot (Tukey 1977). An individual data point is marked as a possible outlier when its distance from the corresponding quartile (Q1 or Q3) exceeds 1.5 times the interquartile range (IQR). According to the boxplot results, 12 outliers were identified for the target value (i.e., UDSS) in the dataset (Fig. 3). The boxplot analysis showed that the maximum and minimum outlier values of the UDSS variable were found to be 75 kPa and 43 kPa, respectively. Thus, a total of 12 rows in the raw dataset were eliminated. As a result, 372 rows remained for continuous analysis when the outliers were removed from the dataset. Figure 3 depicts the boxplot graphs of the raw dataset (384 samples before removing outliers) and the modified dataset (372 samples after removing outliers).

Fig. 3
figure 3

Box plot graphs for the original and the modified dataset

  • Data scaling

In ML, the scale and distribution may differ for each variable in a dataset, which can pose a challenge when creating a model. These differences in scale across input variables can increase the difficulty of the problem being modeled. When a model has large weight values, it is often unstable, which can lead to a number of issues, including poor performance with a high generalization error (Brownlee 2020). To address this issue, it is crucial to normalize or scale the input data before training the model. This ensures that all variables have the same scale and distribution, making the model more robust and accurate. This paper employs two popular and widely used methods to scale input variables, Range (Min-Max normalization) and Z-Score standardization.

Range is a data pre-processing technique that scales the feature values of a dataset to a range between 0 and 1. The goal of this technique is to adjust the scale of values in a dataset to a common range, making it easier to compare different features. In Range, the feature values are rescaled using the following formula:

$${v'} = \frac{{v - mi{n_A}}}{{ma{x_A} - mi{n_A}}}$$
(1)

where, \(mi{n_A}\) and \(ma{x_A}\) represent the minimum and maximum values of feature A, respectively. The original feature value \(v\) is transformed into the normalized value \({v'}\) using this formula. This approach ensures that the maximum and minimum feature values are mapped to 1 and 0, respectively.

Z-Score standardization was carried out to ensure the models’ independence from the scales of certain features. By standardizing the values, the data was transformed into a format that had a mean of 0 and a variance of 1. This is done by subtracting the mean from each value and then dividing by the standard deviation. Z-Score can be determined using the following equation:

$$z=\frac{{v - {\mu _A}}}{{{\sigma _A}}}$$
(2)

where, \({\sigma _A}\) and \({\mu _A}\) are the standard deviation and mean of the feature, A. The original and the normalized feature values are given by \(v\) and \(z\), respectively.

  • Data transformation

It is common for some algorithms to assume that data distribution is normal. However, in real-world scenarios, data can often be skewed, meaning that it is distributed unevenly. To address this issue, data transformation is usually employed to decrease the skewness of the data (Son et al. 2019). The features, which do not follow a normal distribution, may not be appropriate for ML methods, and they need to transform to improve the generalization of the ML algorithms (Nguyen et al. 2021). Therefore, in this study, Log, Box-Cox, and Yeo-Johnson transformations were applied to transform the feature values for approximately obtaining a symmetric form in the training process.

Log Transformation is one of the prevalent transformation approach to transform data (Changyong et al. 2014). It can be computed as \(y=lo{g_b}(x)\). Here, \(Y\) is the log-transformed data, \(x\) represents the input data and \(b\)refers log function base, which ranges between 2 and 10. Box–Cox Transformation can be used with the following formula given by Box and Cox (1964):

$${y^\lambda }=\left\{ \begin{gathered} \frac{{{y^\lambda } - 1}}{\lambda }\quad (\lambda \ne 0) \hfill \\ \log y\quad \;\;(\lambda =0) \hfill \\ \end{gathered} \right.\quad$$
(3)

where \(Y\) stands for the actual feature, which is transformed to \({y^\lambda }\)using a parameter \(\lambda\). A limitation of the Box–Cox Transformation is that it is only applicable to positive data. Yeo and Johnson (2000) proposed an alternative and improved transformation approach that can handle both negative and positive data. Its primary goal is to regulate the skewness and positive-negative values of the original variable. Yeo–Johnson Transformation is defined as follows:

$$\psi (y,\lambda )=\left\{ \begin{gathered} \frac{{{{\left( {y+1} \right)}^\lambda } - 1}}{\lambda }\quad \quad \quad \quad (\lambda \ne 0,y \geqslant 0) \hfill \\ \log \left( {y+1} \right)\quad \;\;\quad \quad \quad (\lambda =0,y \geqslant 0) \hfill \\ \frac{{ - \left[ {{{\left( { - y+1} \right)}^{2 - \lambda }} - 1} \right]}}{{2 - \lambda }}\quad \,(\lambda \ne 2,y<0) \hfill \\ - \log \left( { - y+1} \right)\quad \;\;\quad \;\;(\lambda =2,y<0)\quad \hfill \\ \end{gathered} \right.\quad$$
(4)

The skewness value of each feature was calculated using the approach described by Cramer (1946) to determine the distribution type (i.e., symmetric, moderate, and highly skewed). Thereafter, transformation approaches were employed to eliminate the skewness of each feature. The raw and transformed datasets, along with the type of distribution, are given in Table 2. When the skewness values of the features in the raw data are evaluated, it is observed that the skewness value of almost all features (except d and wn) is greater than 1, indicating the distribution of these features is highly skewed. On the other hand, the skewness score for the d feature is found to be about 0.91, which means that the distribution of this feature is moderately skewed. Lastly, wn has a normal distribution and corresponds to a symmetric view. However, it is seen that while the distribution of depth, liquid limit, plastic limit, natural water content, preconsolidation pressure, and vertical effective stress are skewed in the raw data, all features have a symmetric form after the transformation process except wP for Log Transformation. Fig. 4 presents an overview of the histograms before and after data scaling/transformation methods.

Table 2 The skewness value and distribution type of the modified and transformed dataset
Fig. 4
figure 4

Histogram views of the features before and after data scaling/transformation

2.4 Machine learning regression

2.4.1 Random forest

Breiman (2001) developed the Random Forest (RF) model which is a popular technique used for regression and classification problems. One of the key benefits of RF is that it requires only a small number of hyperparameters, which are easily tunable. This means that RF can be implemented with minimal effort, making it an ideal choice for many use cases. Another advantage of RF is its ability to reduce the variance of the model without increasing the bias. This property helps to improve the generalization capability of the model. Finally, RF provides useful error information that can help users better understand the strengths and weaknesses of the model. This information can be used to fine-tune the model and improve its performance over time (Demir and Sahin 2022; Quinto 2020). The first step in the RF algorithm involves constructing multiple bootstrapped samples of the original training data. This is done to ensure that each decision tree in the forest is trained on a unique subset of the data. Subsequently, for each tree, a random subset of input features is selected and used to split the node (Palczewska et al. 2014). This helps to introduce variability in the decision-making process and reduce the correlation between the trees. After, the final prediction of the model is determined based on whether it is a classification or regression problem. For classification problems, the final prediction is obtained by taking a majority vote of the decision trees. On the other hand, for regression problems, the final prediction is the average of the output of the individual trees (Demir and Sahin 2022). Further information on RF can be found in the literature (Breiman 2001; Hastie et al. 2009). The performance of the RF algorithm can be improved by adjusting the number of trees (ntree) and the number of features used to grow each tree (mtry), which are its main parameters.

2.4.2 Support vector regression

Support Vector Regression (SVR) is a type of regression analysis with similar principles to Support Vector Machine (SVM). SVM is a popular and powerful supervised learning method in the ML domain that has gained much attention due to its high accuracy and ability to handle complex datasets. SVM uses the principle of statistical risk minimization to find a hyperplane, which separates data into different categories (Chandaka et al. 2009). The idea of SVM is straightforward, find a decision boundary that maximizes the margins between two categories and allows them to be perfectly separated (Kou et al. 2023). SVM achieves this process by identifying the support vectors, which are data points closest to the decision boundary. The main difference between SVM and SVR is their objective. While SVR is a regression technique that is used to predict the value of a continuous variable, SVM is a classification technique that is applied to separate data into different classes. Despite their differences, both algorithms share many of the same features, including the use of a kernel function to transform the input data into a higher-dimensional space, the identification of support vectors, and the ability to handle non-linear data. One of the most significant benefits of SVR is that its computational complexity remains constant and irrespective of the dimensionality of the input space. This means that SVR is highly efficient and can easily handle large datasets. Also, SVR has excellent generalization capability. It is designed to identify complex patterns in the data. As a result, it can produce highly accurate predictions, even when working with noisy or incomplete data (Awad and Khanna 2015). Moreover, selecting an appropriate kernel function (i.e., linear, polynomial, sigmoid, radial basis function) is important for performing a stable and precise SVR model. The radial basis kernel function is one of the most widely used functions compared to other kernel functions.

2.4.3 Cubist

Cubist is a rule-based model that is derived from the extended version of Quinlan’s M5 model tree (Kuhn et al. 2012; Quinlan 1992; Quinlan 1993). It originally came as commercial software compared to other ML algorithms, but later the source code was released under an open-source license in 2011 (Kuhn and Johnson 2013). It is a useful tool for building rule-based models that balance the need for accurate prediction with understandability. Cubist typically produces more precise results than simple methods while being simpler to comprehend than neural networks (Rulequest 2020). The algorithm consists of four steps: branching, regression model development, pruning, and smoothing. Cubist creates regression models using one or more rules based on the combination of conditions and a linear function to estimate target values. The algorithm can execute multiple rules and generate many models, achieving higher accuracy levels by combining them (Nguyen et al. 2019). In its predictions, two hyperparameters need to be tuned to improve the performance of the model; the neighbor function and the committee function (Kuhn 2020). The neighbor function applies the nearest neighbor algorithm to each leaf and then adjusts the rule-based predictions based on the most similar samples. This process helps to minimize model error and prevent overlapping between rules. On the other hand, committees are a kind of boosting, and they have similar properties to boosting, but in committees, case weights are not changed, and instead outcome values are modified for each iteration. This is done to correct under-predictions from previous iterations.

2.4.4 Stochastic gradient boosting

Stochastic Gradient Boosting (SGB), introduced by (Friedman 2002), is a minor modification of the gradient boosting algorithm that is used for both regression and classification problems. SGB seeks to address the performance issues (i.e., improving accuracy, reducing overfitting) associated with traditional gradient boosting algorithms (De’ath 2007). It specifically involves randomly subsampling the training data (the fraction of subsample to the entire training dataset is typically set to be 0.4–0.6) rather than using the entire training dataset to compute the gradient of the loss function at each iteration. This randomization leads to a reduction in variance, improved computational efficiency and reduced overfitting (Friedman 2002).

SGB generates several regression trees sequentially using a stepwise model-fitting approach and model averaging (Dube et al. 2015). A regression tree is built from a random subsample of the dataset at each iteration (Friedman 2002). The objective of SGB is to find the optimal function that minimizes the loss function (i.e., the degree of error by the model). When the loss function is greater, the probability that the model will miss is higher. In order to reduce the loss function and the erroneous rate, bringing the loss function to decline in the gradient direction can be adopted as the best approach (Chang et al. 2018). In gradient boosting, the optimal function is built by adding the prediction of the weak learner to the residuals (i.e., the difference between the actual target variable and the current prediction) of the target variable. In SGB, the residuals are calculated using the random subset, and the weak learner is trained on the residuals. The final prediction is the average of the predictions from all iterations. The weights of the weak learner are updated at each iteration to minimize the loss function. In this way, SGB balances the trade-off between computational efficiency and accuracy by using a random subset of the data to approximate the gradient of the loss function and updating the weights of the weak learner to minimize the loss function (De’ath 2007).

2.5 Assessment of model performances

In regression analyses, it is important to evaluate the proximity of predicted values to observed values. This evaluation is based on numeric predictions for a set of n test samples, as well as the measured (actual) values and the predicted (estimated) values for those test cases. On the other hand, model evaluation is a crucial process involving choosing the most appropriate model from different types, features, and tuning parameters (Subasi 2020). The process of selecting the most accurate model involves utilizing different performance metrics for quality assessment. The selection of performance metrics is usually based on the intended application of the model. This means that different metrics may be employed depending on the application of the expected model. Therefore, there is no single, unified, standard metric that can be used to accurately assess the regression results (Chicco et al. 2021). For this study, standard performance metrics of MAPE, MAE, MSE, RMSE, and R2 were used to identify the performances of the regression models. Their basic definitions and formulas are given in Fig. 5. These metrics offer various perspectives on predicting models. For instance, a lower MAE, MSE, MAPE, or RMSE value indicates that a regression model is more accurate. On the other hand, a higher value of R2 is considered suitable for the analysis of regression samples. While R2 near zero indicates the model predictions have no linear association with the outcome, its value near 1.0 means an almost perfect fit. It is also mentioned that MAPE less than 10% indicates a highly accurate prediction performance. The other performance categories of MAPE can be classified as good for 10–20%, reasonable for 20–50, and poor for ≥ 50% (Lewis 1982).

Fig. 5
figure 5

Performance evaluation metrics for regression models

3 Results

3.1 Model evaluation

This section systematically compares how performance varies as different data scaling/transformation methods and ML algorithms are combined. To this end, data scaling and transformation methods, namely Range, Z-Score, Log, Box-Cox, and Yeo-Johnson, were used during the generation of the models. Four types of ML algorithms, RF, SVR, Cubist, and SGB were utilized for UDSS prediction, and the best results were separately presented in tables and figures for each ML algorithm. Moreover, performance results for each generated model are also provided in the Supplementary File. Details of the prediction results of the ML models are presented in the following sections.

3.1.1 Evaluation of RF model

The test results of the RF models were reported considering varying sampling ratios based on Range, Z-Score, Log, Box-Cox, and Yeo Johnson transformed data. Table 3 presents the best results of the RF regression model for each sampling ratio. It is seen from Table 3 that generally, Z-score led to the best results regarding RF models in predicting UDSS for the dataset. The models’ performances are similar in terms of the MAPE metrics with a minimum difference. The lowest RMSE and highest R2 scores (i.e., 4.61 and 0.75, respectively) with a 90:10 sampling ratio were obtained from the Z-Score applied dataset (Hyperparameters of the best model are ntree = 500 and mtry = 2). Among the four sampling ratios, the model having a 60:40 sampling ratio exhibited most poor performance with a high prediction error, which is clearly observable from the given performance scores. This is expected as the highest training ratio assists the model learning process and thus enhances the prediction performance.

Table 3 Prediction results of the RF models based on the test data

Ultimately, the variation of the actual and predicted UDSS values with the number of test sets was presented in Fig. 6. In this figure, predicted results and actual values were drawn as lines and points, respectively based on each different sampling ratio. When the predicted and actual data were compared, it is evident that the predicted results distributed differ at the peak points from the actual data, indicating that they did not be accurately estimated. In most cases, however, predicted values nearly showed the same trend as actual values for all four models in predicting UDSS.

Fig. 6
figure 6

Comparison of actual and predicted values using the test set based on the RF model

3.1.2 Evaluation of SVR model

When the results of the SVR models from Table 4 were investigated, the best results were obtained during the Log and Box-Cox transformations for various sampling ratios. SVR produced the lowest RMSE (3.93036), MSE (17.19957), and MAPE (17.19957%) with a 90:10 sampling ratio for the Box-Cox transformation (Hyperparameters of the best model are kernel = Radial, gamma = 0.2147543, and C = 1). Also, it should be said that the R2 metric can be computed to analyze useful insights into the method prediction behavior. The prediction outcome for R2 was observed to be 0.80729, which is a satisfyingly good score, and this result reveals that the prediction of the model samples is closely identified for all the test samples. On the other hand, the highest RSME and MSE values estimated from the model were found to be 4.93710 and 24.37495, respectively, for the Log transformed dataset with the 60:40 sampling ratio. Moreover, three models (i.e., 60%, 70%, and 80%) with different train ratio sizes obtained the closest result based on the prediction performance of the SVR models.

The comparison between actual and predicted values of the SVR models using different sampling ratios scores is shown in Fig. 7 for the test set. The comparisons for the actual and predicted values show the greatest scatter for the 60:40, 70:30, and 80:20 sampling ratios. These models appear to underpredict very low and very high values of the actual value while overpredicting middle values. It can be seen that the Box-Cox model achieves better testing performance by predicting UDSS for the 90:10 sampling ratio than the other SVR models.

Table 4 Prediction results of the SVR models based on the test set
Fig. 7
figure 7

Comparison of actual and predicted values using the test set based on the SVR model

3.1.3 Evaluation of Cubist model

The performance results of different Cubist models were compared by using a set of assessment metrics presented in Table 5. It is seen that the inclusion of the transformation method (i.e., Box-Cox) performed as part of data pre-processing contributes to a consistent improvement in the performance of the prediction model. Additionally, the effect of the training size increment consistently shows better performance than the previous models. Among the models, the Cubist model combined with the Box-Cox transformation method performed the best on the RMSE (3.37838), MSE (11.41344), MAE (2.62978), R2 (0.87107), and MAPE (16.35228%) metrics for the 90% training set (Hyperparameters of the best model are committees = 20 and neighbors = 9), followed by Range, Yeo-Johnson, and Log Transformation. The Cubist model with Log Transformation built on a 60% training sample size showed the worst performance compared to the other models considered in this study. Consequently, increasing the training set had a positive effect on learning performance in this modeling example, as well as in other examples.

Table 5 Prediction results of the cubist models based on the test set

Fig. 8 shows plots for comparing predicted and observed values based on several models at four training/test sizes. It can be seen from the figure that estimation at 60% training size with Log Transformation model and 80% training size with Range are inaccurate for very high values. On the other hand, the model based on 70% training size with Yeo-Johnson transformation shows to underpredict very low values of the actual values generally. It is seen that the Cubist model with Box-Cox transformation for a higher training sample size (i.e., 90% training size) produced the closest prediction values to the original values.

Fig. 8
figure 8

Comparison of actual and predicted data based on the cubist model

3.1.4 Evaluation of SGB model

For SGB models, Z-Score and Box-Cox transformation resulted in the best metric scores for each sampling ratio. The results of SGB models given in Table 6 show that the model produced with a 90% training set using Box-Cox transformation is the best model among the other sampling ratios with RMSE, R2, MSE, MAE, and MAPE values of 4.20446, 0.77875, 17.67748, 3.14526, and 18.41018%, respectively (Hyperparameters of the best model are n.trees = 150, interaction.depth = 3, shrinkage = 0.1, and n.minobsinnode = 10). Nonetheless, the 80% training size model with Z-Score is the second optimal model with a RMSE of 4.5472. However, 60% training size is found to be the least performance model during the assessment process.

Table 6 Prediction results of the SGB models based on the test sets

The best model for each training set was found and plotted in Fig. 9 as a scatter plot of actual versus predicted to visually explore how values represent the scatter around the regression line. The results show that the relationship between the actual and predicted values was very similar for the 60:40 Z-Score model and the 70:30 Box-Cox transformation model, except for the 80:20 Z-Score model, which had a somewhat higher proximity between actual measurements and predicted values. Nevertheless, the 90:10 Box-Cox transformation had a better relationship for actual versus predicted data than the other models.

Fig. 9
figure 9

Comparison of actual and predicted data based on the SGB model

4 Discussions

Prediction models were compared and discussed in this section based on all data pre-processing methods for different sampling ratio scenarios (Fig. 10). Moreover, a comparison was made with the other results obtained from the literature considering the same UDSS dataset.

Fig. 10 demonstrates that the ML models generally achieve better accuracy after increased training sets. Thereby, the better-performing models were observed in the case of the 90% training set in predicting UDSS of the studied soil. When Fig. 10 was examined more closely, it could recognize that the UDSS prediction models had similar performances as compared to the Raw data in the case of 60:40, 70:30, and 80:20 sampling ratios even though data transformation or data scaling processes. In other words, data transformation/scaling methods had no discernible impact on the model performances for the mentioned sampling ratios. In the remaining scenario (i.e., 90:10), however, prediction models performed better. It can be seen that the Box-Cox transformed Cubist algorithm yielded the best performance among the other algorithms used in this study with an R2 of 0.87 for the 90% training set. It was R2 = 0.81 for the Raw data when Cubist was employed, which corresponds to a notable increase (about 7.4%) in the model accuracy for predicting UDSS. The Cubist model was observed as the second-best performing model (R2 = 0.85) when it was used with the Yeo-Johnson Transformation. Results also showed that the prediction performances of regression models were sensitive to the size of the training dataset; as a result, it can be concluded that increasing the training dataset typically yields an improvement in accuracy in most cases. For the other ML algorithms (RF, SVR, and SGB), interestingly, model accuracies provided approximately similar scores to those of the Raw dataset, regardless of the sampling ratios and data scaling/transformation methods. This can be attributed to the algorithms’ constitutive structure. Another important finding is that the performance of transformed-based models (i.e., Box-Cox, Log, and Yeo-Johnson) provided slightly better results than that of scaled-based (i.e., Range and Z-Score) models based on the performance metrics.

Fig. 10
figure 10

Comparative assessment of the prediction models for each data scaling/transformation method and sampling ratios

The results of this study were also compared with the other results obtained from the literature. Some researchers have invested their efforts in using and investigating ML methods for the purpose of creating UDSS prediction models using the same dataset. For instance, Zhang et al. (2021) utilized ML algorithms for UDSS prediction and their best success rate for the test set was R2 = 0.73 and RMSE = 4.40. Recently, Bherde et al. (2024) built prediction models in this field by proposing ensemble-based methods. Results show that the ensemble based XGBoost model showed the highest R2 (i.e., R2 = 0.80) and RMSE (i.e., RMSE = 4.51). On the other hand, the best performing model presented in this study outperformed the mentioned studies and yielded an R2 and RMSE value of 0.87 and 3.378, respectively. It should be noted that despite these encouraging results, comparing the model data results from this study with previous research is not an easy task. This is primarily due to differences in the selection of input variable numbers, splitting ratios, sampling strategies, and other model evaluation techniques. These differences are directly related to the performance of the models. Therefore, this issue highlights the complexity and diversity of approaches in the field and suggests the need for further research to refine and standardize methodologies.

5 Conclusions and recommendations for future work

In this research, the impact of different pre-processing steps on the accuracy of four ML learning algorithms for UDSS prediction was evaluated on a dataset of 384 clay samples. Data scaling and data transformation techniques, including Range, Z-Score, Log, Box-Cox, and Yeo Transformation, was applied as a preliminary step on the dataset, and then four ML algorithms, RF, SVR, Cubist, and SGB, were applied to the dataset for UDSS prediction. The prediction performance of the four ML algorithms was computed before and after applying the data pre-processing for an objective comparison. The effect of the sampling ratio was also included in the modeling process. Results proved that the accuracy of ML algorithms was significantly improved after increasing the sampling ratios. It can also be concluded that the Cubist model combined with the Box-Cox transformation method provided the best performance on the prediction metrics for the 90% training set. It was R2 = 0.87, RMSE = 3.38, MSE = 11.41, MAE = 2.63, and MAPE = 16.35%, corresponding to an acceptable performance in predicting UDSS. The transformed-based models (Box-Cox, Log, and Yeo-Johnson) generally outperformed the scaled-based models (Range and Z-Score) in terms of prediction accuracy. On the other hand, all model performance measurements (i.e., R2, MAE, RMSE, MAPE, and MSE) of the RF, SVR, and SGB models were similar as compared to those of the Raw dataset regardless of the sampling ratios and data scaling/transformation methods.

The study has some limitations, thereby suggesting promising approaches for future research. The dataset and each pre-processing approach have their individual impact on the prediction performance of an ML algorithm. The performance of the models assessed in this study may be limited for other UDSS datasets due to the different soil characteristics. Therefore, further research could be conducted to explore the effect of data scaling/transformation methods on the models using different datasets. Furthermore, metaheuristic algorithms can be adopted to tune the hyperparameters of the prediction models and obtain a more robust model for UDSS estimation of cohesive soils in future works. Also, additional investigations can be performed on the related problem in terms of feature engineering such as feature extraction and feature selection to optimize the learning process of ML models. Feature engineering may contribute to the performance of the prediction models.