The effectiveness of data pre-processing methods on the performance of machine learning techniques using RF, SVR, Cubist and SGB: a study on undrained shear strength prediction

Demir, Selçuk; Sahin, Emrehan Kutlug

doi:10.1007/s00477-024-02745-9

The effectiveness of data pre-processing methods on the performance of machine learning techniques using RF, SVR, Cubist and SGB: a study on undrained shear strength prediction

Original Paper
Open access
Published: 13 June 2024

(2024)
Cite this article

Download PDF

You have full access to this open access article

Stochastic Environmental Research and Risk Assessment Aims and scope Submit manuscript

The effectiveness of data pre-processing methods on the performance of machine learning techniques using RF, SVR, Cubist and SGB: a study on undrained shear strength prediction

Download PDF

229 Accesses
Explore all metrics

Abstract

In the field of data engineering in machine learning (ML), a crucial component is the process of scaling, normalization, and standardization. This process involves transforming data to make it more compatible with modeling techniques. In particular, this transformation is essential to ensure the suitability of the data for subsequent analysis. Despite the application of many conventional and relatively new approaches to ML, there remains a conspicuous lack of research, particularly in the geotechnical discipline. In this study, ML-based prediction models (i.e., RF, SVR, Cubist, and SGB) were developed to estimate the undrained shear strength (UDSS) of cohesive soil from the perspective of a wide range of data-scaling and transformation methods. Therefore, this work presents a novel ML framework based on data engineering approaches and the Cubist regression method to predict the UDSS of cohesive soil. A dataset including six different features and one target variable were used for building prediction models. The performance of ML models was examined considering the impact of the data pre-processing issue. For that purpose, data scaling and transformation methods, namely Range, Z-Score, Log Transformation, Box-Cox, and Yeo-Johnson, were used to generate the models. The results were then systematically compared using different sampling ratios to understand how model performance varies as various data scaling/transformation methods and ML algorithms were combined. It was observed that data transformation or data sampling methods had considerable or limited effects on the UDSS model performance depending on the algorithm type and the sampling ratio. Compared to RF, SVR, and SGB models, Cubist models provided higher performance metrics after applying the data pre-processing steps. The Box-Cox transformed Cubist model yielded the best prediction performance among the other models with an R² of 0.87 for the 90% training set. Also, the UDSS prediction model generally yielded the best performance metrics when it was used with the transformed-based models (i.e., Box-Cox, Log, and Yeo-Johnson) than that of scaled-based (i.e., Range and Z-Score) models. The results show that the Cubist model has a higher potential for UDSS prediction, and data pre-processing methods have impacts on the predictive capacity of the evaluated regression models.

Determination of geotechnical parameters for underground trenchless construction design

Article 15 December 2022

Enhancing rock fragmentation assessment in mine blasting through machine learning algorithms: a practical approach

Article Open access 25 April 2024

A novel approach to estimate rock deformation under uniaxial compression using a machine learning technique

Article 14 June 2024

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Geotechnical parameters, such as shear strength, permeability, consistency limits, and compaction, play a significant role in the construction of civil engineering projects. These parameters provide insight into soil behavior, which is essential for making informed decisions about the design and construction of geotechnical features (Vanapalli et al. 1996). For instance, the undrained shear strength (UDSS) of soils is a fundamental parameter that is highly demanded for solving many problems. It is widely used to determine the stability of slopes, bearing capacity of a shallow foundation, embankment settlement and stability, piles bearing capacity, and so on (Wong et al. 2021).

UDSS is particularly important in cohesive soils, where the pore water pressure generated during shearing cannot dissipate quickly. Conventional methods, such as in-situ or laboratory testing, are frequently employed to provide design information on the UDSS. Even though these methods can be very efficient, they can also be very time-consuming and costly, especially when dealing with soils that display spatial variability. For this reason, determining the UDSS of cohesive soils is a source of concern due to the challenges associated with in-situ measurements or subsequent laboratory testing (Lunne et al. 2006; Phoon and Kulhawy 1999). In practice, in order to expedite the design process, geotechnical engineers frequently apply some correlations already established between UDSS and some soil indices (Hansbo 1957; Kulhawy and Mayne 1990; Skempton 1954). However, it is essential to consider the limitations of these approaches, as there can be differences between the data source and the site, such as site geology and soil characteristics, where those will be used (D’Ignazio et al. 2016). In other words, the results drawn from these correlations may be deviated regarding the actual values of the UDSS of soils.

Over the past few years, there has been rapid development in the field of artificial intelligence (AI) techniques. This development has led to the emergence of machine learning (ML) algorithms that have been proposed and are now widely used in various fields. ML applications have transformed the way how complex problems can be tackled using new and innovative solutions. Due to their learning ability, ML algorithms became a desirable tool for revealing relationships between many soil parameters. Therefore, the growing interest in studying the potential applications of ML algorithms on geotechnical issues has been witnessed in the past decades. Research covers a wide range of problems, such as soil liquefaction and liquefaction-induced hazards (Demir and Sahin 2023b; Durante and Rathje 2021; Goh and Goh 2007; Sahin and Demir 2023), slope stability (Aminpour et al. 2023; Sabri et al. 2023; Huang et al. 2023), pile bearing capacity (Benbouras et al. 2021; Kardani et al. 2020), and other specific problems (Baghbani et al. 2022; Chaabene et al. 2020; Chen et al. 2024; Kahraman and Ozdemir 2022; Niu et al. 2023; Qi et al. 2023; Shi et al. 2023a, b; Yin et al. 2023; Zhang et al. 2022; Zhao et al. 2023). Furthermore, several researchers have also utilized ML algorithms to solve the UDSS prediction problem. For instance, Mbarak et al. (2020) proposed a study to predict UDSS using Standard Penetration Test (SPT) results and soil consistency indices. Three ML algorithms, Random Forest (RF), Gradient Boosting (GBM), and stacked models, were developed and employed on a dataset from different projects around Turkiye. These models were also compared with the methods of simple and multiple linear regression models. They concluded that the ML models performed superior prediction capabilities compared to the conventional methods. Pham et al. (2020) developed a hybrid ML model combining RF and Particle Swarm Optimization (PSO) to predict the UDSS of soil using experimental results of 127 soil samples. In this study, the RF-PSO model outperformed the single RF model without optimization in predicting UDSS with high accuracy. Zhang et al. (2021) utilized Bayesian optimized eXtreme Gradient Boosting (XGBoost) and RF algorithms to predict UDSS of soft clays. The performance of the proposed methods was also compared with two transformation models from previous works and three baseline ML algorithms. Results revealed that XGBoost and RF models outperformed the other approaches. Tran et al. (2022) applied two novel hybrid ML approaches, ANFIS-CA (Adaptive Neuro Fuzzy Inference System- Cultural Algorithm) and ANFIS-PSO, to predict the UDSS of sensitive clays using five input parameters. Their results showed that the ANFIS-PSO model yielded a promising result with a correlation coefficient of 0.715 for predicting the UDSS of sensitive clays with limited input parameters. Länsivaara et al. (2023) compared the performance of traditional and ML-based models for UDSS, considering the influence of data coherence. The ML-based models outperformed traditional ones for two different datasets. They also showed that including additional variables can improve training set performance but worsen testing set prediction. Reviewing the existing research revealed that ML algorithms are capable of reasonably estimate the UDSS of soils.

It is well-known that ML algorithms are data-driven and that the prediction performance of ML approaches is affected by the application of data pre-processing and the number of training sets, thereby it is essential to consider the steps of data pre-processing to build a robust ML model (García et al. 2015; Marsland 2011). Data pre-processing methods typically involve modifying the training dataset through the removal, addition, or transformation of the training set data (Kuhn and Johnson 2013). Some common methods used in data pre-processing include data cleaning, data transformation, data integration, and data reduction (García et al. 2015). Data cleaning involves the removal of noise and the correction of inconsistencies in the data. This process is essential to ensure that the data used in further analysis is accurate and reliable. It helps in improving the overall quality of the data by identifying and correcting errors or inconsistencies that may adversely affect the results of data analysis. Data transformation is another important aspect of data pre-processing. This step may include processes such as data scaling (e.g., normalization, standardization), which adjust the values in the dataset to a common scale, without distorting differences in the range of values or losing information. This is particularly useful when dealing with data that has different units or scales and can help to improve the accuracy of data analysis. Data integration is a process that merges data from multiple different sources into a single, coherent data store. This often involves combining data into a data warehouse, providing a unified view of the data. This step is crucial when dealing with large volumes of data from disparate sources and can greatly enhance the efficiency of data analysis. Finally, data reduction is a technique that can reduce the volume of data by using methods such as aggregation, elimination of redundant features, or clustering. This technique helps to simplify the data, making it easier to analyze and interpret, without losing any significant information.

The importance of data pre-processing has become evident in enhancing the precision of ML models across a range of fields. For instance, Ojagh et al. (2021) demonstrated that refining raw data through pre-processing can lead to improved air quality predictions. Their findings suggest that pre-processing techniques can enhance model performance, even in complex applications like air quality monitoring. Aksangür et al. (2022) discussed the impact of data pre-processing techniques on the prediction accuracy of long short-term memory models using real-world data. They concluded that data pre-processing enhances data quality as well as the model’s training time and prediction accuracy. Demir and Sahin (2023c) employed a data pre-processing method (i.e., outlier treatment) to investigate the effect of outliers on the prediction performances of models for slope stability problem. Their prediction results showed how the handling of outliers of a dataset enhances models’ prediction performance. Recently, Habib and Okayli (2024) investigated the impact of data pre-processing on the accuracy of ML models in predicting concrete’s compressive strength. Their findings revealed that the choice of a data pre-processing method significantly affect the results of predictive models. While the impact of data pre-processing methods on model performance is recognized, their impact on model accuracy has yet to receive sufficient attention in geotechnical engineering domain, and the effects of data scaling and transformation methods on the learning performance of ML models have not been adequately investigated to date. To narrow this gap, this study aims to extend the current understanding of the influence of data scaling and transformation methods on the performance of UDSS models by systematically assessing a broad spectrum of data pre-processing methods, such as Range, Z-Score, Log Transformation, Box-Cox, and Yeo-Johnson. For this purpose, the models based on Random Forest (RF), Support Vector Regression (SVR), Cubist, and Stochastic Gradient Boosting (SGB) algorithms are adopted for predicting the UDSS of soil. Furthermore, this study also considers the effect of different sampling ratios on the model’s performance. In this way, the contribution of the data scaling and data transformation tasks to the ML model success at various sampling ratios are extensively assessed. The novelty of this work is first conducting a detailed and systematic comparison of UDSS prediction models utilizing five pre-processing methods and four ML algorithms. Secondly, revealing the impact of the data pre-processing methods by using the results of ML models built from data scaled/transformed dataset. Furthermore, the effect of the sampling ratios on the performance and overfitting of the ML models was also investigated and the prediction performance of the RF, SVR, Cubist, and SGB algorithms using well-known performance metrics was assessed to analyze the best performance.

The remainder of this paper is organized as follows. The experimental setting, data description, data pre-processing, prediction models, and model evaluation metrics employed are mentioned in Sect. 2. Obtained results about the prediction performance of UDSS models are given in Sect. 3. Discussions are presented in Sect. 4. Lastly, Sect. 5 describes the conclusions of this work and some future recommendations.

2 Methodology

2.1 Overview of experimental settings

In the ML domain, two types of data are needed in order to use ML models to estimate the target variable; the training set used to build the ML model and the test set used to estimate prediction accuracy. The training set is used to predict the parameters for a specific model architecture, and the test set is applied to choose the best model among all models considered. Typically, natural groups do not exist in the regression model, so in this study, the simple random sampling (SRS) method was applied to divide the target variable into groups (stratum) with a random sampling strategy. In order to conduct a comprehensive investigation of the sampling ratio effect of the training and test size, the dataset was divided into the training sets (60%, 70%, 80%, and 90%) and test sets (40%, 30%, 20%, and 10%) for hyperparameter estimation, model production, and performance analysis. The UDSS dataset containing 384 observations with six different features was used to build the prediction model. Outliers of the dataset were eliminated before model construction, and thus 372 samples were considered during the model preparation. The assessment of UDSS prediction using the ML methods on the training set may be biased. Hence, the k-fold cross validation (10-fold CV), which can provide an upwardly biased model evaluation, was applied with grid search to reduce overfitting and produce a less biased prediction (Demir and Sahin 2023a).

One of the major criticisms of ML algorithms in science is the lack of novel laws, understanding, and knowledge arising from their use. This issue stems from the common practice of treating ML algorithms as black boxes, where the intricate models crafted by machines surpass human understanding (Schmidt et al., 2019). Currently, data preprocessing, cross-validation, and extensive metric-based evaluation processes are employed to mitigate the black-box nature of ML and offer more objective and descriptive insights. At this point, the mentioned processes were used in this study. These steps have indeed helped address the black-box nature of ML models by providing transparency, understanding, and control over the modeling process. The final stage of the ML application is to evaluate the performance of the proposed models in order to determine the most accurate result. The performance measurements (i.e., R², MSE, RMSE, MAE, and MAPE) were utilized to evaluate each ML model. Fig. 1 presents the schematic view of steps of the methodology. For interest, the tests were performed on a PC running Windows 10 with an AMD Ryzen 9 3950X processor and 64 GB of RAM. All the code is written in R programing language (version 4.3.0) with the following main R packages: caret, e1071, randomForest, Cubist, ipred, plyr, kernlab, and gbm, respectively.

2.2 Data description

Two datasets, F-CLAY/7/216 and S-CLAY/7/168, compiled by D’Ignazio et al. (2016), were considered for predicting UDSS. F-CLAY/7/216 and S-CLAY/7/168 datasets comprise 216 and 168 samples obtained from field vane tests in Finland and Sweden-Norway, respectively. In this paper, these datasets were combined to analyze the overall behavior of clayey soils (n = 384 samples). Both datasets contain six features, including UDSS, depth (d), liquid limit (w_L), plastic limit (w_P), natural water content (w_n), vertical effective stress (σ`_v), and preconsolidation stress (σ`_p). All these parameters play a critical role in determining the strength and stability of clayey soils. Some statistics information of the dataset, such as minimum, maximum, mean, median, and standard deviation is presented in Table 1. In the dataset, the depths of the samples are quite diverse, ranging from as shallow as 0.5 m to as deep as 24 m. The vertical effective stress varies with the lowest stress at 6.9 kPa and the highest stress reaching up to 212.9 kPa. The natural water content of samples varies from 17.3 to 180.1%. Furthermore, the liquid limit and the plastic limit of the samples are found to vary from 22 to 201.8% and 2.7% to 73.9, respectively. Meanwhile, the preconsolidation stress ranged from 15.2 kPa to 315.6 kPa.

Table 1 Statistical information of the input features

Full size table

When analyzing a multi-featured dataset, it’s important to be able to visualize the relationships between different features. A heat map is a great tool for this, as it provides a quick and easy way to identify patterns and relationships of the features in the dataset. Pearson’s correlation coefficient is often computed to determine the correlations between each feature. This approach helps to identify which parameters are strongly or weakly correlated with each other. Fig. 2 shows the computed Pearson’s correlation coefficients of each pairwise feature. As shown in Fig. 2, some correlations between features are stronger than others. For instance, pairwise features d - σ`_v, σ`_p - σ`_v, w_L - w_n, and σ`_p - UDSS are strongly correlated (i.e., |r| = 0.82, |r| = 0.71, |r| = 0.84, and |r| = 0.75, respectively). In statistical modeling, it is well known that the existence of strongly correlated variables can significantly influence the efficiency of the model. This belief stems from the assumption that these variables, due to their strong correlation, may cause redundancy and unnecessary complexity in the model. Moreover, Kutner et al. (2005) also argued that these correlated variables do not typically affect inferences about mean responses in the data. This suggests that even if variables share a strong correlation, each can still provide unique and valuable insights about the average responses in the dataset, thereby making them essential components of the model. On the other hand, correlation coefficients between 0.40 and 0.69 are moderately correlated (e.g., d - σ`_p, w_P - w_L, σ`_v - UDSS). Lastly, the same pair of features having correlation coefficients less than 0.40 are weakly correlated.

2.3 Data pre-processing

Typically, raw data needs to be better organized to be ready for an ML application and is frequently challenging for researchers to work with. Data pre-processing in a ML operation is the process of gathering and transforming raw data into a format that can be accurately and quickly analyzed (Sahin 2023). A simple version of data pre-processing usually consists of several processes, such as data cleaning and refining techniques, removal of outliers, missing data interpolation, and feature scaling (i.e., normalization, standardization, and transformation). For this study, some steps in data pre-processing were performed, such as data cleaning with the removal of outliers, data scaling, and data transformation.

Removing Outliers

In data science literature, an outlier is defined as an abnormal, deviant, or discordant data point from the remaining data set. Data corruption during data collection and incorrect measurement of observation may cause the existence of outliers (Gareth et al. 2013). Detecting and handling outliers is a crucial step in improving the performance of an ML model since outliers can skew the results and lead to inaccurate predictions. Therefore, it is important to identify and remove them from the dataset or adjust them to fit within the expected range. By doing so, the model can be trained on a more representative dataset, resulting in better accuracy and generalization capability. A classical and popular tool for detecting outliers is the boxplot (Tukey 1977). An individual data point is marked as a possible outlier when its distance from the corresponding quartile (Q₁ or Q₃) exceeds 1.5 times the interquartile range (IQR). According to the boxplot results, 12 outliers were identified for the target value (i.e., UDSS) in the dataset (Fig. 3). The boxplot analysis showed that the maximum and minimum outlier values of the UDSS variable were found to be 75 kPa and 43 kPa, respectively. Thus, a total of 12 rows in the raw dataset were eliminated. As a result, 372 rows remained for continuous analysis when the outliers were removed from the dataset. Figure 3 depicts the boxplot graphs of the raw dataset (384 samples before removing outliers) and the modified dataset (372 samples after removing outliers).

Data scaling

In ML, the scale and distribution may differ for each variable in a dataset, which can pose a challenge when creating a model. These differences in scale across input variables can increase the difficulty of the problem being modeled. When a model has large weight values, it is often unstable, which can lead to a number of issues, including poor performance with a high generalization error (Brownlee 2020). To address this issue, it is crucial to normalize or scale the input data before training the model. This ensures that all variables have the same scale and distribution, making the model more robust and accurate. This paper employs two popular and widely used methods to scale input variables, Range (Min-Max normalization) and Z-Score standardization.

Range is a data pre-processing technique that scales the feature values of a dataset to a range between 0 and 1. The goal of this technique is to adjust the scale of values in a dataset to a common range, making it easier to compare different features. In Range, the feature values are rescaled using the following formula:

$${v'} = \frac{{v - mi{n_A}}}{{ma{x_A} - mi{n_A}}}$$

(1)

where, $mi{n_A}$ and $ma{x_A}$ represent the minimum and maximum values of feature A, respectively. The original feature value $v$ is transformed into the normalized value ${v'}$ using this formula. This approach ensures that the maximum and minimum feature values are mapped to 1 and 0, respectively.

Z-Score standardization was carried out to ensure the models’ independence from the scales of certain features. By standardizing the values, the data was transformed into a format that had a mean of 0 and a variance of 1. This is done by subtracting the mean from each value and then dividing by the standard deviation. Z-Score can be determined using the following equation:

$$z=\frac{{v - {\mu _A}}}{{{\sigma _A}}}$$

(2)

where, ${\sigma _A}$ and ${\mu _A}$ are the standard deviation and mean of the feature, A. The original and the normalized feature values are given by $v$ and $z$, respectively.

Data transformation

It is common for some algorithms to assume that data distribution is normal. However, in real-world scenarios, data can often be skewed, meaning that it is distributed unevenly. To address this issue, data transformation is usually employed to decrease the skewness of the data (Son et al. 2019). The features, which do not follow a normal distribution, may not be appropriate for ML methods, and they need to transform to improve the generalization of the ML algorithms (Nguyen et al. 2021). Therefore, in this study, Log, Box-Cox, and Yeo-Johnson transformations were applied to transform the feature values for approximately obtaining a symmetric form in the training process.

Log Transformation is one of the prevalent transformation approach to transform data (Changyong et al. 2014). It can be computed as $y=lo{g_b}(x)$. Here, $Y$ is the log-transformed data, $x$ represents the input data and $b$refers log function base, which ranges between 2 and 10. Box–Cox Transformation can be used with the following formula given by Box and Cox (1964):

$${y^\lambda }=\left\{ \begin{gathered} \frac{{{y^\lambda } - 1}}{\lambda }\quad (\lambda \ne 0) \hfill \\ \log y\quad \;\;(\lambda =0) \hfill \\ \end{gathered} \right.\quad$$

(3)

where $Y$ stands for the actual feature, which is transformed to ${y^\lambda }$using a parameter $\lambda$. A limitation of the Box–Cox Transformation is that it is only applicable to positive data. Yeo and Johnson (2000) proposed an alternative and improved transformation approach that can handle both negative and positive data. Its primary goal is to regulate the skewness and positive-negative values of the original variable. Yeo–Johnson Transformation is defined as follows:

$$\psi (y,\lambda )=\left\{ \begin{gathered} \frac{{{{\left( {y+1} \right)}^\lambda } - 1}}{\lambda }\quad \quad \quad \quad (\lambda \ne 0,y \geqslant 0) \hfill \\ \log \left( {y+1} \right)\quad \;\;\quad \quad \quad (\lambda =0,y \geqslant 0) \hfill \\ \frac{{ - \left[ {{{\left( { - y+1} \right)}^{2 - \lambda }} - 1} \right]}}{{2 - \lambda }}\quad \,(\lambda \ne 2,y<0) \hfill \\ - \log \left( { - y+1} \right)\quad \;\;\quad \;\;(\lambda =2,y<0)\quad \hfill \\ \end{gathered} \right.\quad$$

(4)

The skewness value of each feature was calculated using the approach described by Cramer (1946) to determine the distribution type (i.e., symmetric, moderate, and highly skewed). Thereafter, transformation approaches were employed to eliminate the skewness of each feature. The raw and transformed datasets, along with the type of distribution, are given in Table 2. When the skewness values of the features in the raw data are evaluated, it is observed that the skewness value of almost all features (except d and w_n) is greater than 1, indicating the distribution of these features is highly skewed. On the other hand, the skewness score for the d feature is found to be about 0.91, which means that the distribution of this feature is moderately skewed. Lastly, w_n has a normal distribution and corresponds to a symmetric view. However, it is seen that while the distribution of depth, liquid limit, plastic limit, natural water content, preconsolidation pressure, and vertical effective stress are skewed in the raw data, all features have a symmetric form after the transformation process except w_P for Log Transformation. Fig. 4 presents an overview of the histograms before and after data scaling/transformation methods.

Table 2 The skewness value and distribution type of the modified and transformed dataset

Full size table

2.4 Machine learning regression

2.4.1 Random forest

Breiman (2001) developed the Random Forest (RF) model which is a popular technique used for regression and classification problems. One of the key benefits of RF is that it requires only a small number of hyperparameters, which are easily tunable. This means that RF can be implemented with minimal effort, making it an ideal choice for many use cases. Another advantage of RF is its ability to reduce the variance of the model without increasing the bias. This property helps to improve the generalization capability of the model. Finally, RF provides useful error information that can help users better understand the strengths and weaknesses of the model. This information can be used to fine-tune the model and improve its performance over time (Demir and Sahin 2022; Quinto 2020). The first step in the RF algorithm involves constructing multiple bootstrapped samples of the original training data. This is done to ensure that each decision tree in the forest is trained on a unique subset of the data. Subsequently, for each tree, a random subset of input features is selected and used to split the node (Palczewska et al. 2014). This helps to introduce variability in the decision-making process and reduce the correlation between the trees. After, the final prediction of the model is determined based on whether it is a classification or regression problem. For classification problems, the final prediction is obtained by taking a majority vote of the decision trees. On the other hand, for regression problems, the final prediction is the average of the output of the individual trees (Demir and Sahin 2022). Further information on RF can be found in the literature (Breiman 2001; Hastie et al. 2009). The performance of the RF algorithm can be improved by adjusting the number of trees (ntree) and the number of features used to grow each tree (mtry), which are its main parameters.

2.4.2 Support vector regression

Support Vector Regression (SVR) is a type of regression analysis with similar principles to Support Vector Machine (SVM). SVM is a popular and powerful supervised learning method in the ML domain that has gained much attention due to its high accuracy and ability to handle complex datasets. SVM uses the principle of statistical risk minimization to find a hyperplane, which separates data into different categories (Chandaka et al. 2009). The idea of SVM is straightforward, find a decision boundary that maximizes the margins between two categories and allows them to be perfectly separated (Kou et al. 2023). SVM achieves this process by identifying the support vectors, which are data points closest to the decision boundary. The main difference between SVM and SVR is their objective. While SVR is a regression technique that is used to predict the value of a continuous variable, SVM is a classification technique that is applied to separate data into different classes. Despite their differences, both algorithms share many of the same features, including the use of a kernel function to transform the input data into a higher-dimensional space, the identification of support vectors, and the ability to handle non-linear data. One of the most significant benefits of SVR is that its computational complexity remains constant and irrespective of the dimensionality of the input space. This means that SVR is highly efficient and can easily handle large datasets. Also, SVR has excellent generalization capability. It is designed to identify complex patterns in the data. As a result, it can produce highly accurate predictions, even when working with noisy or incomplete data (Awad and Khanna 2015). Moreover, selecting an appropriate kernel function (i.e., linear, polynomial, sigmoid, radial basis function) is important for performing a stable and precise SVR model. The radial basis kernel function is one of the most widely used functions compared to other kernel functions.

2.4.3 Cubist

Cubist is a rule-based model that is derived from the extended version of Quinlan’s M5 model tree (Kuhn et al. 2012; Quinlan 1992; Quinlan 1993). It originally came as commercial software compared to other ML algorithms, but later the source code was released under an open-source license in 2011 (Kuhn and Johnson 2013). It is a useful tool for building rule-based models that balance the need for accurate prediction with understandability. Cubist typically produces more precise results than simple methods while being simpler to comprehend than neural networks (Rulequest 2020). The algorithm consists of four steps: branching, regression model development, pruning, and smoothing. Cubist creates regression models using one or more rules based on the combination of conditions and a linear function to estimate target values. The algorithm can execute multiple rules and generate many models, achieving higher accuracy levels by combining them (Nguyen et al. 2019). In its predictions, two hyperparameters need to be tuned to improve the performance of the model; the neighbor function and the committee function (Kuhn 2020). The neighbor function applies the nearest neighbor algorithm to each leaf and then adjusts the rule-based predictions based on the most similar samples. This process helps to minimize model error and prevent overlapping between rules. On the other hand, committees are a kind of boosting, and they have similar properties to boosting, but in committees, case weights are not changed, and instead outcome values are modified for each iteration. This is done to correct under-predictions from previous iterations.

2.4.4 Stochastic gradient boosting

Stochastic Gradient Boosting (SGB), introduced by (Friedman 2002), is a minor modification of the gradient boosting algorithm that is used for both regression and classification problems. SGB seeks to address the performance issues (i.e., improving accuracy, reducing overfitting) associated with traditional gradient boosting algorithms (De’ath 2007). It specifically involves randomly subsampling the training data (the fraction of subsample to the entire training dataset is typically set to be 0.4–0.6) rather than using the entire training dataset to compute the gradient of the loss function at each iteration. This randomization leads to a reduction in variance, improved computational efficiency and reduced overfitting (Friedman 2002).

SGB generates several regression trees sequentially using a stepwise model-fitting approach and model averaging (Dube et al. 2015). A regression tree is built from a random subsample of the dataset at each iteration (Friedman 2002). The objective of SGB is to find the optimal function that minimizes the loss function (i.e., the degree of error by the model). When the loss function is greater, the probability that the model will miss is higher. In order to reduce the loss function and the erroneous rate, bringing the loss function to decline in the gradient direction can be adopted as the best approach (Chang et al. 2018). In gradient boosting, the optimal function is built by adding the prediction of the weak learner to the residuals (i.e., the difference between the actual target variable and the current prediction) of the target variable. In SGB, the residuals are calculated using the random subset, and the weak learner is trained on the residuals. The final prediction is the average of the predictions from all iterations. The weights of the weak learner are updated at each iteration to minimize the loss function. In this way, SGB balances the trade-off between computational efficiency and accuracy by using a random subset of the data to approximate the gradient of the loss function and updating the weights of the weak learner to minimize the loss function (De’ath 2007).

2.5 Assessment of model performances

In regression analyses, it is important to evaluate the proximity of predicted values to observed values. This evaluation is based on numeric predictions for a set of n test samples, as well as the measured (actual) values and the predicted (estimated) values for those test cases. On the other hand, model evaluation is a crucial process involving choosing the most appropriate model from different types, features, and tuning parameters (Subasi 2020). The process of selecting the most accurate model involves utilizing different performance metrics for quality assessment. The selection of performance metrics is usually based on the intended application of the model. This means that different metrics may be employed depending on the application of the expected model. Therefore, there is no single, unified, standard metric that can be used to accurately assess the regression results (Chicco et al. 2021). For this study, standard performance metrics of MAPE, MAE, MSE, RMSE, and R² were used to identify the performances of the regression models. Their basic definitions and formulas are given in Fig. 5. These metrics offer various perspectives on predicting models. For instance, a lower MAE, MSE, MAPE, or RMSE value indicates that a regression model is more accurate. On the other hand, a higher value of R² is considered suitable for the analysis of regression samples. While R² near zero indicates the model predictions have no linear association with the outcome, its value near 1.0 means an almost perfect fit. It is also mentioned that MAPE less than 10% indicates a highly accurate prediction performance. The other performance categories of MAPE can be classified as good for 10–20%, reasonable for 20–50, and poor for ≥ 50% (Lewis 1982).

3 Results

3.1 Model evaluation

This section systematically compares how performance varies as different data scaling/transformation methods and ML algorithms are combined. To this end, data scaling and transformation methods, namely Range, Z-Score, Log, Box-Cox, and Yeo-Johnson, were used during the generation of the models. Four types of ML algorithms, RF, SVR, Cubist, and SGB were utilized for UDSS prediction, and the best results were separately presented in tables and figures for each ML algorithm. Moreover, performance results for each generated model are also provided in the Supplementary File. Details of the prediction results of the ML models are presented in the following sections.

3.1.1 Evaluation of RF model

The test results of the RF models were reported considering varying sampling ratios based on Range, Z-Score, Log, Box-Cox, and Yeo Johnson transformed data. Table 3 presents the best results of the RF regression model for each sampling ratio. It is seen from Table 3 that generally, Z-score led to the best results regarding RF models in predicting UDSS for the dataset. The models’ performances are similar in terms of the MAPE metrics with a minimum difference. The lowest RMSE and highest R² scores (i.e., 4.61 and 0.75, respectively) with a 90:10 sampling ratio were obtained from the Z-Score applied dataset (Hyperparameters of the best model are ntree = 500 and mtry = 2). Among the four sampling ratios, the model having a 60:40 sampling ratio exhibited most poor performance with a high prediction error, which is clearly observable from the given performance scores. This is expected as the highest training ratio assists the model learning process and thus enhances the prediction performance.

Table 3 Prediction results of the RF models based on the test data

Full size table

Ultimately, the variation of the actual and predicted UDSS values with the number of test sets was presented in Fig. 6. In this figure, predicted results and actual values were drawn as lines and points, respectively based on each different sampling ratio. When the predicted and actual data were compared, it is evident that the predicted results distributed differ at the peak points from the actual data, indicating that they did not be accurately estimated. In most cases, however, predicted values nearly showed the same trend as actual values for all four models in predicting UDSS.

3.1.2 Evaluation of SVR model

When the results of the SVR models from Table 4 were investigated, the best results were obtained during the Log and Box-Cox transformations for various sampling ratios. SVR produced the lowest RMSE (3.93036), MSE (17.19957), and MAPE (17.19957%) with a 90:10 sampling ratio for the Box-Cox transformation (Hyperparameters of the best model are kernel = Radial, gamma = 0.2147543, and C = 1). Also, it should be said that the R² metric can be computed to analyze useful insights into the method prediction behavior. The prediction outcome for R² was observed to be 0.80729, which is a satisfyingly good score, and this result reveals that the prediction of the model samples is closely identified for all the test samples. On the other hand, the highest RSME and MSE values estimated from the model were found to be 4.93710 and 24.37495, respectively, for the Log transformed dataset with the 60:40 sampling ratio. Moreover, three models (i.e., 60%, 70%, and 80%) with different train ratio sizes obtained the closest result based on the prediction performance of the SVR models.

The comparison between actual and predicted values of the SVR models using different sampling ratios scores is shown in Fig. 7 for the test set. The comparisons for the actual and predicted values show the greatest scatter for the 60:40, 70:30, and 80:20 sampling ratios. These models appear to underpredict very low and very high values of the actual value while overpredicting middle values. It can be seen that the Box-Cox model achieves better testing performance by predicting UDSS for the 90:10 sampling ratio than the other SVR models.

Table 4 Prediction results of the SVR models based on the test set

Full size table

3.1.3 Evaluation of Cubist model

The performance results of different Cubist models were compared by using a set of assessment metrics presented in Table 5. It is seen that the inclusion of the transformation method (i.e., Box-Cox) performed as part of data pre-processing contributes to a consistent improvement in the performance of the prediction model. Additionally, the effect of the training size increment consistently shows better performance than the previous models. Among the models, the Cubist model combined with the Box-Cox transformation method performed the best on the RMSE (3.37838), MSE (11.41344), MAE (2.62978), R² (0.87107), and MAPE (16.35228%) metrics for the 90% training set (Hyperparameters of the best model are committees = 20 and neighbors = 9), followed by Range, Yeo-Johnson, and Log Transformation. The Cubist model with Log Transformation built on a 60% training sample size showed the worst performance compared to the other models considered in this study. Consequently, increasing the training set had a positive effect on learning performance in this modeling example, as well as in other examples.

Table 5 Prediction results of the cubist models based on the test set

Full size table

Fig. 8 shows plots for comparing predicted and observed values based on several models at four training/test sizes. It can be seen from the figure that estimation at 60% training size with Log Transformation model and 80% training size with Range are inaccurate for very high values. On the other hand, the model based on 70% training size with Yeo-Johnson transformation shows to underpredict very low values of the actual values generally. It is seen that the Cubist model with Box-Cox transformation for a higher training sample size (i.e., 90% training size) produced the closest prediction values to the original values.

3.1.4 Evaluation of SGB model

For SGB models, Z-Score and Box-Cox transformation resulted in the best metric scores for each sampling ratio. The results of SGB models given in Table 6 show that the model produced with a 90% training set using Box-Cox transformation is the best model among the other sampling ratios with RMSE, R², MSE, MAE, and MAPE values of 4.20446, 0.77875, 17.67748, 3.14526, and 18.41018%, respectively (Hyperparameters of the best model are n.trees = 150, interaction.depth = 3, shrinkage = 0.1, and n.minobsinnode = 10). Nonetheless, the 80% training size model with Z-Score is the second optimal model with a RMSE of 4.5472. However, 60% training size is found to be the least performance model during the assessment process.

Table 6 Prediction results of the SGB models based on the test sets

Full size table

The best model for each training set was found and plotted in Fig. 9 as a scatter plot of actual versus predicted to visually explore how values represent the scatter around the regression line. The results show that the relationship between the actual and predicted values was very similar for the 60:40 Z-Score model and the 70:30 Box-Cox transformation model, except for the 80:20 Z-Score model, which had a somewhat higher proximity between actual measurements and predicted values. Nevertheless, the 90:10 Box-Cox transformation had a better relationship for actual versus predicted data than the other models.

4 Discussions

Prediction models were compared and discussed in this section based on all data pre-processing methods for different sampling ratio scenarios (Fig. 10). Moreover, a comparison was made with the other results obtained from the literature considering the same UDSS dataset.

Fig. 10 demonstrates that the ML models generally achieve better accuracy after increased training sets. Thereby, the better-performing models were observed in the case of the 90% training set in predicting UDSS of the studied soil. When Fig. 10 was examined more closely, it could recognize that the UDSS prediction models had similar performances as compared to the Raw data in the case of 60:40, 70:30, and 80:20 sampling ratios even though data transformation or data scaling processes. In other words, data transformation/scaling methods had no discernible impact on the model performances for the mentioned sampling ratios. In the remaining scenario (i.e., 90:10), however, prediction models performed better. It can be seen that the Box-Cox transformed Cubist algorithm yielded the best performance among the other algorithms used in this study with an R² of 0.87 for the 90% training set. It was R² = 0.81 for the Raw data when Cubist was employed, which corresponds to a notable increase (about 7.4%) in the model accuracy for predicting UDSS. The Cubist model was observed as the second-best performing model (R² = 0.85) when it was used with the Yeo-Johnson Transformation. Results also showed that the prediction performances of regression models were sensitive to the size of the training dataset; as a result, it can be concluded that increasing the training dataset typically yields an improvement in accuracy in most cases. For the other ML algorithms (RF, SVR, and SGB), interestingly, model accuracies provided approximately similar scores to those of the Raw dataset, regardless of the sampling ratios and data scaling/transformation methods. This can be attributed to the algorithms’ constitutive structure. Another important finding is that the performance of transformed-based models (i.e., Box-Cox, Log, and Yeo-Johnson) provided slightly better results than that of scaled-based (i.e., Range and Z-Score) models based on the performance metrics.

The results of this study were also compared with the other results obtained from the literature. Some researchers have invested their efforts in using and investigating ML methods for the purpose of creating UDSS prediction models using the same dataset. For instance, Zhang et al. (2021) utilized ML algorithms for UDSS prediction and their best success rate for the test set was R² = 0.73 and RMSE = 4.40. Recently, Bherde et al. (2024) built prediction models in this field by proposing ensemble-based methods. Results show that the ensemble based XGBoost model showed the highest R² (i.e., R² = 0.80) and RMSE (i.e., RMSE = 4.51). On the other hand, the best performing model presented in this study outperformed the mentioned studies and yielded an R² and RMSE value of 0.87 and 3.378, respectively. It should be noted that despite these encouraging results, comparing the model data results from this study with previous research is not an easy task. This is primarily due to differences in the selection of input variable numbers, splitting ratios, sampling strategies, and other model evaluation techniques. These differences are directly related to the performance of the models. Therefore, this issue highlights the complexity and diversity of approaches in the field and suggests the need for further research to refine and standardize methodologies.

5 Conclusions and recommendations for future work

In this research, the impact of different pre-processing steps on the accuracy of four ML learning algorithms for UDSS prediction was evaluated on a dataset of 384 clay samples. Data scaling and data transformation techniques, including Range, Z-Score, Log, Box-Cox, and Yeo Transformation, was applied as a preliminary step on the dataset, and then four ML algorithms, RF, SVR, Cubist, and SGB, were applied to the dataset for UDSS prediction. The prediction performance of the four ML algorithms was computed before and after applying the data pre-processing for an objective comparison. The effect of the sampling ratio was also included in the modeling process. Results proved that the accuracy of ML algorithms was significantly improved after increasing the sampling ratios. It can also be concluded that the Cubist model combined with the Box-Cox transformation method provided the best performance on the prediction metrics for the 90% training set. It was R² = 0.87, RMSE = 3.38, MSE = 11.41, MAE = 2.63, and MAPE = 16.35%, corresponding to an acceptable performance in predicting UDSS. The transformed-based models (Box-Cox, Log, and Yeo-Johnson) generally outperformed the scaled-based models (Range and Z-Score) in terms of prediction accuracy. On the other hand, all model performance measurements (i.e., R², MAE, RMSE, MAPE, and MSE) of the RF, SVR, and SGB models were similar as compared to those of the Raw dataset regardless of the sampling ratios and data scaling/transformation methods.

The study has some limitations, thereby suggesting promising approaches for future research. The dataset and each pre-processing approach have their individual impact on the prediction performance of an ML algorithm. The performance of the models assessed in this study may be limited for other UDSS datasets due to the different soil characteristics. Therefore, further research could be conducted to explore the effect of data scaling/transformation methods on the models using different datasets. Furthermore, metaheuristic algorithms can be adopted to tune the hyperparameters of the prediction models and obtain a more robust model for UDSS estimation of cohesive soils in future works. Also, additional investigations can be performed on the related problem in terms of feature engineering such as feature extraction and feature selection to optimize the learning process of ML models. Feature engineering may contribute to the performance of the prediction models.

Data availability

The dataset analyzed during the current study are publicly available at location cited in the reference section.

References

Aksangür İ, Eren B, Erden C (2022) Evaluation of data preprocessing and feature selection process for prediction of hourly PM10 concentration using long short-term memory models. Environ Pollut 311:119973
Article Google Scholar
Aminpour M, Alaie R, Khosravi S, Kardani N, Moridpour S, Nazem M (2023) Slope stability machine learning predictions on spatially variable random fields with and without factor of safety calculations. https://doi.org/10.1016/j.compgeo.2022.105094. Comput Geotech 153
Awad M, Khanna R (2015) Efficient learning machines: theories, concepts, and applications for engineers and system designers. Springer nature
Baghbani A, Choudhury T, Costa S, Reiner J (2022) Application of artificial intelligence in geotechnical engineering: a state-of-the-art review. Earth-Sci Rev 228. https://doi.org/10.1016/j.earscirev.2022.103991
Article Google Scholar
Benbouras MA, Petrisor AI, Zedira H, Ghelani L, Lefilef L (2021) Forecasting the bearing capacity of the driven piles using advanced machine-learning techniques. Appl Sci-Basel 11. https://doi.org/10.3390/app112210908
Bherde V, PMV K, Balunaini U (2024) Application of ensemble-based methods for prediction of undrained shear strength of soft sensitive clays. In Geo-Congress 2024, pp. 52–61
Box GE, Cox DR (1964) An analysis of transformations. J Roy Stat Soc: Ser B (Methodol) 26:211–243
Article Google Scholar
Breiman L (2001) Random Forests Mach Learn 45:5–32
Article Google Scholar
Brownlee J (2020) Data preparation for machine learning: data cleaning, feature selection, and data transforms in Python. Machine Learning Mastery
Chaabene WB, Flah M, Nehdi ML (2020) Machine learning prediction of mechanical properties of concrete: critical review. Constr Build Mater 260:119889
Article Google Scholar
Chandaka S, Chatterjee A, Munshi S (2009) Cross-correlation aided support vector machine classifier for classification of EEG signals. Expert Syst Appl 36:1329–1336. https://doi.org/10.1016/j.eswa.2007.11.017
Article Google Scholar
Chang YC, Chang KH, Wu GJ (2018) Application of eXtreme gradient boosting trees in the construction of credit risk assessment models for financial institutions. Appl Soft Comput 73:914–920. https://doi.org/10.1016/j.asoc.2018.09.029
Article Google Scholar
Changyong F, Hongyue W, Naiji L, Tian C, Hua H, Ying L (2014) Log-transformation and its implications for data analysis Shanghai archives of psychiatry 26:105
Chen Z, Chen L, Zhou X, Huang L, Sandanayake M, Yap P (2024) Recent Technological advancements in BIM and LCA integration for sustainable construction. Rev Sustain 16(3):1340. https://doi.org/10.3390/su16031340
Article Google Scholar
Chicco D, Warrens MJ, Jurman G (2021) The coefficient of determination R-squared is more informative than SMAPE, MAE, MAPE, MSE and RMSE in regression analysis evaluation. Peerj Comput Sci. https://doi.org/10.7717/peerj-cs.623
Article Google Scholar
Cramer H (1946) Mathematical methods of statistics. Princeton Univ, Princeton, NJ
Google Scholar
D’Ignazio M, Phoon KK, Tan SA, Lansivaara TT (2016) Correlations for undrained shear strength of Finnish soft clays. Can Geotech J 53:1628–1645. https://doi.org/10.1139/cgj-2016-0037
Article CAS Google Scholar
De’ath G (2007) Boosted trees for ecological modeling and prediction. Ecology 88:243–251. https://doi.org/10.1890/0012-9658(2007)88[243:Btfema]2.0.Co;2
Article Google Scholar
Demir S, Sahin EK (2022) Liquefaction prediction with robust machine learning algorithms (SVM, RF, and XGBoost) supported by genetic algorithm-based feature selection and parameter optimization from the perspective of data processing. Environ Earth Sci 81. https://doi.org/10.1007/s12665-022-10578-4
Demir S, Sahin EK (2023a) An investigation of feature selection methods for soil liquefaction prediction based on tree-based ensemble algorithms using AdaBoost, gradient boosting, and XGBoost. Neural Comput Appl 35:3173–3190. https://doi.org/10.1007/s00521-022-07856-4
Article Google Scholar
Demir S, Sahin EK (2023b) Predicting occurrence of liquefaction-induced lateral spreading using gradient boosting algorithms integrated with particle swarm optimization: PSO-XGBoost, PSO-LightGBM, and PSO-CatBoost. Acta Geotech 18:3403–3419. https://doi.org/10.1007/s11440-022-01777-1
Article Google Scholar
Demir S, Sahin EK (2023c) Earth Sci Inf 16:2497–2509. https://doi.org/10.1007/s12145-023-01059-8. Application of state-of-the-art machine learning algorithms for slope stability prediction by handling outliers of the dataset
Dube T, Mutanga O, Abdel-Rahman EM, Ismail R, Slotow R (2015) Predicting Eucalyptus spp. stand volume in Zululand, South Africa: an analysis using a stochastic gradient boosting regression ensemble with multi-source data sets. Int J Remote Sens 36:3751–3772. https://doi.org/10.1080/01431161.2015.1070316
Article Google Scholar
Durante MG, Rathje EM (2021) An exploration of the use of machine learning to predict lateral spreading. Earthq Spectra 37:2288–2314. https://doi.org/10.1177/87552930211004613
Article Google Scholar
Friedman JH (2002) Stochastic gradient boosting Comput. Stat Data An 38:367–378. https://doi.org/10.1016/S0167-9473(01)00065-2
Article Google Scholar
García S, Luengo J, Herrera F (2015) Data preprocessing in data mining, vol 72. Springer, Cham, Switzerland
Google Scholar
Gareth J, Daniela W, Trevor H, Robert T (2013) An introduction to statistical learning: with applications in R. Spinger
Goh ATC, Goh SH (2007) Support vector machines: their use in geotechnical engineering as illustrated using seismic liquefaction. data Comput Geotech 34:410–421. https://doi.org/10.1016/j.compgeo.2007.06.001
Article Google Scholar
Habib M, Okayli M (2024) Evaluating the sensitivity of machine learning models to data preprocessing technique in concrete compressive strength estimation. Arab J Sci Eng, 1–19
Hansbo S (1957) New approach to the determination of the shear strength of clay by the fall-cone test. In: Proceedings of the Royal Swedish Geotechnical Institute, 1957. Royal Swedish Geotechnical Institute No 14, pp 1–47
Hastie T, Tibshirani R, Friedman J (2009) Random forests The elements of statistical learning: Data mining, inference, and prediction:587–604
Huang F, Xiong H, Chen S, Lv Z, Huang J, Chang Z, Catani F (2023) Slope stability prediction based on a long short-term memory neural network: comparisons with convolutional neural networks, support vector machines and random forest models. Int J Coal Sci Technol 10(1):18
Article CAS Google Scholar
Kahraman E, Ozdemir AC (2022) The prediction of durability to freeze–thaw of limestone aggregates using machine-learning techniques. Constr Build Mater 324:126678
Article Google Scholar
Kardani N, Zhou AN, Nazem M, Shen SL (2020) Estimation of bearing capacity of piles in Cohesionless Soil using Optimised Machine Learning approaches. Geotech Geol Eng 38:2271–2291. https://doi.org/10.1007/s10706-019-01085-8
Article Google Scholar
Kou L, Sysyn M, Liu JX, Fischer S, Nabochenko O, He W (2023) Prediction system of rolling contact fatigue on crossing nose based on support vector regression measurement 210. https://doi.org/10.1016/j.measurement.2023.112579
Kuhn M (2020) Modern Rule-Based Models
Kuhn M, Johnson K (2013) Applied predictive modeling vol 26. Springer
Kuhn M, Weston S, Keefer C, Coulter N (2012) Cubist models for regression, R package Vignette R package version 0.0, 18
Kulhawy FH, Mayne PW (1990) Manual on estimating soil properties for foundation design (no. EPRI-EL-6800). Electric Power Research Inst. USA); Cornell Univ., Ithaca, NY (USA), Geotechnical Engineering Group, Palo Alto, CA
Google Scholar
Kutner MH, Nachtsheim CJ, Neter J, Li W (2005) Applied linear statistical models. McGraw-Hill
Länsivaara TT, Farhadi MS, Samui P (2023) Performance of traditional and machine learning-based transformation models for undrained shear strength arabian. J Geosci 16:183. https://doi.org/10.1007/s12517-022-11173-4
Article Google Scholar
Lewis C (1982) International and Business Forecasting Methods Butterworths: London 144
Lunne T, Berre T, Andersen KH, Strandvik S, Sjursen M (2006) Effects of sample disturbance and consolidation procedures on measured shear strength of soft marine Norwegian clays. Can Geotech J 43:726–750. https://doi.org/10.1139/T06-040
Article Google Scholar
Marsland S (2011) Machine learning: an algorithmic perspective. Chapman and Hall/CRC
Mbarak WK, Cinicioglu EN, Cinicioglu O (2020) SPT based determination of undrained shear strength: regression models and machine learning. Front Struct Civ Eng 14:185–198. https://doi.org/10.1007/s11709-019-0591-x
Article Google Scholar
Nguyen H, Bui XN, Tran QH, Mai NL (2019) A new soft computing model for estimating and controlling blast-produced ground vibration based on hierarchical K-means clustering and cubist algorithms. Appl Soft Comput 77:376–386. https://doi.org/10.1016/j.asoc.2019.01.042
Article Google Scholar
Nguyen XC et al (2021) Nitrogen removal in subsurface constructed wetland: Assessment of the influence and prediction by data mining and machine learning. Environ Technol Innov 23:101712. https://doi.org/10.1016/j.eti.2021.101712
Niu Q, Jiang L, Li C, Zhao Y, Wang Q, Yuan A (2023) Application and prospects of 3D printing in physical experiments of rock mass mechanics and engineering: materials, methodologies and models. Int J Coal Sci Technol 10(1):5
Article Google Scholar
Ojagh S, Cauteruccio F, Terracina G, Liang SH (2021) Enhanced air quality prediction by edge-based spatiotemporal data preprocessing. Comput Electr Eng 96:107572
Article Google Scholar
Palczewska A, Palczewski J, Marchese Robinson R, Neagu D (2014) Interpreting random forest classification models using a feature contribution method Integration of reusable systems:193–218
Pham BT et al (2020) A Novel Hybrid Soft Computing Model using Random Forest and Particle Swarm Optimization for Estimation of Undrained Shear Strength of Soil. https://doi.org/10.3390/su12062218. Sustainability-Basel 12
Phoon KK, Kulhawy FH (1999) Characterization of geotechnical variability can. Geotech J 36:612–624. https://doi.org/10.1139/t99-038
Article Google Scholar
Qi Q, Yue X, Duo X, Xu Z, Li Z (2023) Spatial prediction of soil organic carbon in coal mining subsidence areas based on RBF neural network. Int J Coal Sci Technol 10(1):30
Article CAS Google Scholar
Quinlan JR (1993) Combining instance-based and model-based learning. In: Proceedings of the tenth international conference on machine learning, pp 236–243
Quinlan JR Learning with continuous classes. In: 5th Australian joint conference on artificial intelligence, 1992. World Scientific, pp 343–348
Quinto B (2020) Next-Generation Machine Learning with Spark: Covers XGBoost, LightGBM, Spark NLP, Distributed Deep Learning with Keras, and More. Apress Berkeley, CA
Rulequest (2020) Data mining with Cubist
Sabri MS, Ahmad F, Samui P (2023) Slope stability analysis of heavy-haul freight corridor using novel machine learning approach. Model Earth Syst Env. https://doi.org/10.1007/s40808-023-01774-7
Article Google Scholar
Sahin EK (2023) Implementation of free and open-source semi-automatic feature engineering tool in landslide susceptibility mapping using the machine-learning algorithms RF, SVM, and XGBoost. Stoch Env Res Risk A 37:1067–1092. https://doi.org/10.1007/s00477-022-02330-y
Article Google Scholar
Sahin EK, Demir S (2023) Greedy-AutoML: a novel greedy-based stacking ensemble learning framework for assessing soil liquefaction potential eng. Appl Artif Intel 119. https://doi.org/10.1016/j.engappai.2022.105732
Schmidt J, Marques MRG, Botti S et al (2019) Recent advances and applications of machine learning in solid-state materials science. Npj Comput Mater 5:83. https://doi.org/10.1038/s41524-019-0221-0
Article Google Scholar
Shi M, Lv L, Xu L (2023a) A multi-fidelity surrogate model based on extreme support vector regression: fusing different fidelity data for engineering design. Eng Comput 40(2):473–493. https://doi.org/10.1108/EC-10-2021-0583
Article Google Scholar
Shi M, Hu W, Li M, Zhang J, Song X, Sun W (2023b) Ensemble regression based on polynomial regression-based decision tree and its application in the in-situ data of tunnel boring machine. Mech Syst Signal Process 188:110022. https://doi.org/10.1016/j.ymssp.2022.110022
Article Google Scholar
Skempton A (1954) Discussion: sensitivity of clays and the c/p ratio in normally consolidated clays. Proc Am Soc Civ Eng Separate 478:19–22
Google Scholar
Son H, Hyun C, Phan D, Hwang HJ (2019) Data analytic approach for bankruptcy prediction. Expert Syst Appl 138. https://doi.org/10.1016/j.eswa.2019.07.033
Article Google Scholar
Subasi A (2020) Practical machine learning for data analysis using python. Academic
Tran QA, Ho LS, Le HV, Prakash I, Pham BT (2022) Estimation of the undrained shear strength of sensitive clays using optimized inference intelligence system. Neural Comput Appl 34:7835–7849. https://doi.org/10.1007/s00521-022-06891-5
Article Google Scholar
Tukey JW (1977) Exploratory data analysis, vol 2. Addison-Wesley, Reading, MA
Google Scholar
Vanapalli SK, Fredlund DG, Pufahl DE, Clifton AW (1996) Model for the prediction of shear strength with respect to soil suction. Can Geotech J 33:379–392. https://doi.org/10.1139/t96-060
Article Google Scholar
Wong CK, Lun MCH, Wong RCK (2021) Interpretation of undrained shear strength observed in confined triaxial compression tests on compacted clay. Can Geotech J 58:1690–1702. https://doi.org/10.1139/cgj-2020-0355
Article Google Scholar
Yeo IK, Johnson RA (2000) A new family of power transformations to improve normality or symmetry. Biometrika 87:954–959
Article Google Scholar
Yin J, Lei J, Fan K, Wang S (2023) Integrating image processing and deep learning for effective analysis and classification of dust pollution in mining processes. Int J Coal Sci Technol 10(1):84
Article CAS Google Scholar
Zhang WG, Wu CZ, Zhong HY, Li YQ, Wang L (2021) Prediction of undrained shear strength using extreme gradient boosting and random forest based on Bayesian optimization Geosci Front 12:469–477 https://doi.org/10.1016/j.gsf.2020.03.007
Zhang WA, Gu X, Tang LB, Yin YP, Liu DS, Zhang YM (2022) Application of machine learning, deep learning and optimization algorithms in geoengineering and geoscience: Comprehensive review and future challenge. Gondwana Res 109:1–17. https://doi.org/10.1016/j.gr.2022.03.015
Article Google Scholar
Zhao N, Li D, Gu S, Du W (2023) Analytical fragility relation for buried cast iron pipelines with lead-caulked joints based on machine learning algorithms. Earthq Spectra 40(1):566–583. https://doi.org/10.1177/87552930231209195
Article Google Scholar

Download references

Funding

No funding was received for conducting this study.

Open access funding provided by the Scientific and Technological Research Council of Türkiye (TÜBİTAK).

Author information

Authors and Affiliations

Department of Civil Engineering, Bolu Abant Izzet Baysal University, Bolu, 14030, Türkiye
Selçuk Demir & Emrehan Kutlug Sahin

Authors

Selçuk Demir
View author publications
You can also search for this author in PubMed Google Scholar
Emrehan Kutlug Sahin
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

S.D. Conceptualization, Investigation, Writing-review and editing, Writing-original draft, Visualization. E.K.S. Conceptualization, Methodology, Software, Writing-review and editing, Visualization.

Corresponding author

Correspondence to Selçuk Demir.

Ethics declarations

Conflict of interest

The authors have no conflicts of interest to declare that are relevant to the content of this article.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary Material 1

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Demir, S., Sahin, E.K. The effectiveness of data pre-processing methods on the performance of machine learning techniques using RF, SVR, Cubist and SGB: a study on undrained shear strength prediction. Stoch Environ Res Risk Assess (2024). https://doi.org/10.1007/s00477-024-02745-9

Download citation

Accepted: 18 May 2024
Published: 13 June 2024
DOI: https://doi.org/10.1007/s00477-024-02745-9

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

The effectiveness of data pre-processing methods on the performance of machine learning techniques using RF, SVR, Cubist and SGB: a study on undrained shear strength prediction

Abstract

Similar content being viewed by others

Determination of geotechnical parameters for underground trenchless construction design

Enhancing rock fragmentation assessment in mine blasting through machine learning algorithms: a practical approach

A novel approach to estimate rock deformation under uniaxial compression using a machine learning technique

1 Introduction

2 Methodology

2.1 Overview of experimental settings

2.2 Data description

2.3 Data pre-processing

2.4 Machine learning regression

2.4.1 Random forest

2.4.2 Support vector regression

2.4.3 Cubist

2.4.4 Stochastic gradient boosting

2.5 Assessment of model performances

3 Results

3.1 Model evaluation

3.1.1 Evaluation of RF model

3.1.2 Evaluation of SVR model

3.1.3 Evaluation of Cubist model

3.1.4 Evaluation of SGB model

4 Discussions

5 Conclusions and recommendations for future work

Data availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher’s Note

Electronic supplementary material

Supplementary Material 1

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation