Introduction

Geogrids play a prominent role in flexible pavement construction, particularly in maintaining subgrade restraint and stabilizing the base course. These applications either reduce the base course thickness or extend the pavement life. When used to build pavement working platforms, geogrids are installed at the base–subgrade interface to enhance the load-bearing capacity of the pavement foundation. For base stabilization, geogrids can be positioned at the center of the base course or at the base–subgrade interface [1,2,3,4]. The successful implementation of both methods has afforded reduction in permanent deformation in the base course [5].

Geogrids are widely recognized for their primary function of enhancing horizontal stiffness in mechanically stabilized base layers, a process known as lateral confinement. Lateral confinement prevents excessive base course deformations and improves the bearing capacity under vehicular loading due to the interlocking between the geogrid and aggregate particles. Various geogrid products, characterized by differences in properties such as aperture shapes and sizes, rib shape, and tensile strength, affect the interlocking mechanisms [6]. Previous studies have reported that optimal geogrid stabilization and interlock can be achieved when the geogrid aperture size matches the aggregate grain size [7, 8]. Tutumluer et al. [9] demonstrated the difference between the shear forces in case of triangular and rectangular geogrid aperture shapes using aggregate image–aided discrete element method simulation. Byun et al. [10] used shear wave transducers to successfully quantify different horizontal stiffness profiles of aggregate specimens enhanced by rectangular and triangular geogrid apertures. However, without coring and sampling in the field, information regarding the geogrid reinforcement type and presence may not be readily available. This can potentially affect the assessment of pavement conditions and selection of appropriate maintenance techniques.

Numerous machine learning (ML) studies in geotechnical engineering focus on supervised learning. In supervised learning, labeled datasets are used to predict or classify outcomes, where most of the related studies target regression problems [11,12,13,14]. Nevertheless, some studies have explored classification problems. For instance, Eyo et al. [15] demonstrated that meta-ensemble models could classify three different binder combinations previously used in predicting the unconfined compressive strength of reinforced soils. Ensemble classification algorithms have been used in slope stability analysis to determine landslide susceptibility based on field data [16, 17]. Furthermore, Ozsagir et al. [18] achieved a high prediction accuracy in assessing the liquefaction potential of fine-grained soils using a decision tree model, while Jas and Dodagoudar [19] explored the prediction of the liquefaction potential of soils using extreme gradient boosting-Shapley additive explanations (SHAP). Soil classification based on physical and chemical properties has also been recently investigated [20,21,22,23]. However, a major limitation of some of these studies is their insufficient datasets for training ML models as well as the interpretability of the models [19]. Thus, more studies focusing on applying ML classification algorithms and interpreting these models are required.

This study aims to investigate the efficacy of ML classification algorithms in distinguishing between unstabilized aggregate specimens and those stabilized with triangular and rectangular aperture geogrids. To resolve the problem of data insufficiency noted in previous studies, a relatively moderate dataset with 4500 experimental observations is employed herein to train five ML models, including single-learning and tree-ensemble algorithms. The performances of these models are compared using appropriate ML classification metrics. Furthermore, the global feature importance across all algorithms is examined to enhance the interpretability of the applied ML methods. Finally, the engineering implications of the current study are discussed.

Methodology

Database generation

The database comprises numerical data obtained from repeated-load triaxial experimental testing for the resilient modulus characterization of crushed aggregates. The crushed aggregates were obtained from a quarry located in North Carolina. Sieve analysis results indicate that the material was well-graded, with a mean diameter of 8.4 mm and a maximum particle size of 25 mm. A detailed description of the indexes and compaction properties of the aggregates can be found in previous studies [10, 24]. The experimental program utilized an unstabilized specimen and two specimens stabilized with triangular and rectangular aperture geogrids. The two geogrids were designed for base course stabilization in flexible pavements, and the dimensions are also detailed in previous studies [10, 24]. Five confining pressures, 100 load repetitions of varying deviatoric stresses, and 15 sequences were considered during the experimental program, following the procedure outlined in AASHTO T307 [25].

Figure 1a illustrates the typical haversine pulse obtained from each loading cycle of the repeated-load triaxial test, representing a standard wheel loading involving a 100 ms loading period and a 900 ms rest period. Furthermore, recoverable and plastic strains were recorded for each cycle and subsequently used to calculate the resilient modulus and permanent deformation. Herein, the accumulated permanent strain (εz) (Fig. 1b) was considered as an input feature alongside the resilient modulus (MR), number of cycles (N), confining pressure (σc), and deviatoric stress (σd). Notably, previous studies demonstrated that the deviatoric stress and number of cycles are major factors influencing the accumulated permanent strain [26, 27]. Finally, each datum was assigned to one of three classes: rectangular geogrid, triangular geogrid, and unstabilized.

Fig. 1
figure 1

Typical responses of repeated-load triaxial texts: a axial stress at different cycles and (b) accumulated permanent strain for unstabilized specimen (ninth sequence)

Data preprocessing and exploratory data analysis

Data preprocessing is an important step that allows the examination and assurance of database integrity. Herein, no missing values or duplicate data points were found. Upon further investigation, the database was found to comprise 4500 observations, with the statistics summarized in Table 1. All classes were equally represented in the database used for this study, with 1500 data points for each class. This equal distribution was due to the consistent application of N, σc, and σd for each specimen, resulting in varied MR and εz values. A common issue with multiclass ML models is class imbalance, often leading to biased models [28, 29]. The class balance in this study helps avoid this problem and aids reliable performance evaluation.

Table 1 Relevant statistics of independent features in the database

The distribution of values for each input feature across the three classes was presented along the diagonal of the plot shown in Fig. 2. The input features N, σc, and σd exhibited the same distribution across all the classes. However, the unstabilized specimen showed higher MR values than the triangular and rectangular stabilized specimens (Fig. 3) because the two geogrid-stabilized specimens had slightly lower densities [10]. When examining the relation between MR and N, overlapping MR values were observed between the triangular and rectangular specimens. For the relation between εz and N, the εz for the triangular specimen was higher than those of the other classes. Owing to this general trend of overlapping MR and εz values, linear ML models might face challenges in accurately separating the classes.

Fig. 2
figure 2

Pairwise relation between variables in the dataset

Fig. 3
figure 3

Resilient modulus distribution for all classes: (a) unstabilized, (b) triangular geogrid, and (c) rectangular geogrid

The dataset was partitioned into training and testing sets with a ratio of 0.8, as recommended in the literature [30, 31]. Subsequently, the values of the input features must be scaled using a standard scaler so that they have a mean of 0 and a standard deviation of 1. This standard scaling can be mathematically represented as

$$\mathrm{z }= \frac{x - u}{s}$$
(1)

where x is the feature value, u is the mean of the feature value, and s is the corresponding standard deviation.

Machine learning algorithms

Logistic regression

Logistic regression (LR) employs a logistic function to establish the relation between a binary target variable and a set of predictor variables [22]. Given a binary target variable represented as Y and a vector of predictor variables denoted as X = (X1, X2, … Xp), LR formulates the probability of Y equaling 1 as follows:

$$P(Y=1 |\mathrm{ X}) =\frac{1}{1+ {e}^{-Z}}$$
(2)

where Z is the linear predictor defined by Z = β0 + β1X1 + β2X2 + … + βpXp, and β0, β1, β2, …, βp are the regression coefficients.

The key assumptions of LR include the absence of influential outliers and multicollinearity, the independence of errors, and linearity in the logit. The regression coefficients are typically estimated using maximum likelihood estimation techniques, such as iterative reweighted least squares and the Newton–Raphson method. Although LR is well suited for binary classification problems, a one-vs.-rest (OvR) approach can be applied to multiclass problems. The OvR approach involves training a separate binary LR model for each class, treating that class as positive while all other classes as negative.

Support vector machines

Support vector machines (SVMs) are used to find an optimal hyperplane in a high-dimensional feature space that maximizes the separation of data points between different classes. This is achieved using a kernel function, such as a linear, polynomial, or radial basis function (RBF), suitable for solving complex nonlinear problems [32, 33]. Using the RBF, the SVM can be expressed as

$${\text{f}}\left({X}_{1},{X}_{2}\right)={\text{exp}}(-\upgamma || {X}_{1}-{X}_{2}{ ||}^{2})$$
(3)

where X1 and X2 represent the input vectors and \(\upgamma\) defines the influence of a single training point on nearby data points.

SVMs aim to identify support vectors, which are the data points closest to the decision boundary. These support vectors are crucial for defining the margin used to optimize the hyperplane. Generally, SVMs effectively handle high-dimensional data and are less susceptible to overfitting. However, they can be sensitive to hyperparameter tuning and may become computationally expensive when handling large datasets.

Tree-ensemble models

This study considered three tree-based/ensemble algorithms: random forest (RF), extra trees classifier (ETC), and light gradient boosting machine (LGBM). RF creates decision trees by training each tree using a random subset of data and features. The outcome is determined by combining the predictions from all individual trees [34]. Meanwhile, ETC shares similarities with RF but introduces additional randomness in the tree-building process. Random thresholds are used instead of identifying the optimal split of features based on a quality criterion, such as Gini impurity or entropy [35]. LGBM employs a unique method called gradient-based one-side sampling, which reduces the number of data instances required to build each tree while maintaining accuracy. This method considerably reduces the training time without compromising performance.

Generally, tree-ensemble methods combine multiple decision trees to create a robust and accurate predictive model. The primary difference between the methods lies in the way the decision trees are constructed. These methods are known for their interpretability based on embedded feature importance, ability to generalize when using new datasets, and computational efficiency, particularly concerning the tree-ensemble methods selected in this study. The fundamental equation for tree-ensemble methods is given by

$$F\left(x\right)={\sum }_{m=1}^{M}{\gamma }_{m}{h}_{m}\left(x\right)$$
(4)

where F(x) is the prediction, \({\gamma }_{m}\) denotes the weight assigned to the mth tree, and hm(x) represents the prediction from the mth tree for input x.

Machine learning implementation

The ML process typically comprises training, validation, and testing phases. During training, the model attempts to learn about the patterns and correlations that exist in the data. Optimization algorithms are used during model training to reduce errors. In addition, each ML algorithm includes hyperparameters that can be combined to improve model performance. Techniques, such as grid search, help to assess the different model hyperparameters and choose the best combination using the validation set [36]. Finally, the model is evaluated on the test set using suitable predefined metrics. Model evaluation helps to assess the generalization of the model and determine its performance on unseen data.

Herein, ML model training and validation were conducted using Python programming language. After separating the datasets into training and testing sets, the input features were scaled for both sets. A parameter grid containing a range of hyperparameters for each model was defined. Only significant hyperparameters were considered to expedite the training process. A grid search algorithm [36] was used to identify the optimal combination of hyperparameters for each model, as outlined in Table 2. Subsequently, each model was trained and tested using the metrics defined in Sub Sect. “Performance evaluation metrics”. The k-fold cross-validation method was also adopted in this study. This method divides the entire training set into k subsets or folds. The model was then trained using all but one of the subsets and validated using the remaining subset. This process was repeated iteratively k times. A recommended tenfold cross-validation was used in this study [37, 38].

Table 2 Selected hyperparameters for machine learning models

Performance evaluation metrics

Herein, the capability of each model to identify and correctly classify specimens was evaluated using the metrics detailed below. The first metric is accuracy, which is the ratio of correct predictions to the total number of predictions, mathematically expressed as

$$Accuracy = \frac{TP + TN}{TP + FP + TN + FN}$$
(5)

where TP refers to the true positive, i.e., the model correctly predicts a positive class; an example is when a model correctly classifies an unstabilized specimen as unstabilized. Conversely, false positives (FPs) are instances where the model incorrectly predicts a negative class as positive. In this case, an example would be classifying the unstabilized specimen as rectangular. True negatives (TNs) occur when the model correctly identifies the negative class. An example is when a model correctly classifies an unstabilized specimen as not being a specimen stabilized by triangular or rectangular aperture geogrid. Finally, false negatives (FNs) arise when the model inaccurately predicts a positive class as negative.

Notably, in the case of unbalanced classes, relying solely on accuracy to assess the performance of a model can be misleading. The next metric is thus precision, which gauges the FP rate of a model. A higher precision score implies a lower rate of FPs and vice versa. Precision should be computed as

$$Precision = \frac{TP }{TP + FP}$$
(6)

Recall is another important metric that evaluates the FN rate of a model. A higher recall value indicates a low FN rate. Recall is calculated as follows:

$$Recall = \frac{TP }{TP+ FN}$$
(7)

Finally, the F1-score is the harmonic mean of precision and recall, balancing these two metrics. The F1-score is mathematically expressed as

$$F1-score = 2 \times \frac{precision \times recall}{precision + recall}$$
(8)

Receiver operating characteristic curve

An effective method for visualizing the performance of ML models is using an “area under the curve–receiver operating characteristics” (AUC–ROC) curve. In this study, the AUC–ROC curve is used to assess the discriminatory capacity of the models. Discriminatory capacity, as the term implies, is the ability of a model to accurately distinguish between classes. The ROC curve represents the probability distribution of the outcomes, while the AUC quantifies the degree of class separability. More specifically, the ROC curve plots the TP rate (TPR) against the FP rate (FPR), with the TPR on the y-axis and the FPR on the x-axis. Typically, the AUC values fall within the range of 0–1. High AUC values, particularly those approaching 1, are desirable as they reflect the capacity of a model to accurately predict classes. By contrast, models with an AUC close to 0 exhibit poor class separation. In the case of an AUC of 0.5, the model cannot discriminate between the classes and essentially resorts to random guessing.

Shapley additive explanations

ML models are often considered black-box models because of their complex nature. Explainable artificial intelligence (XAI) has been developed to bridge the gap between the intricacies of artificial intelligence algorithms and human comprehension. An essential component of XAI is SHAP, which employs game theory to compute Shapley values [39]. In this study, SHAP feature importance was used to determine the rank of features across all models. The global feature importance was obtained by calculating the arithmetic mean of the absolute Shapley values for each feature. The features with high absolute Shapley values were deemed the most important. Notably, the kernel SHAP and tree SHAP [40] were used to determine the importance of the single-learning and tree-based models, respectively. Generally, the feature importance is given by

$${I}_{j} = \frac{1}{n} \sum_{i = 1}^{n}|{\varnothing }_{j}^{(i)}|$$
(9)

where n represents the total number of instances or samples in the dataset and øj denotes the Shapley value of an observation j, given by

$${\varnothing }_{j} = \sum_{S \subseteq \{\mathrm{1,2},...p\}\backslash \{j\}}\frac{\left|S\right|!\left(p-\left|S\right|-1\right)!}{p!}\left[f({x}_{s}\cup \{j\}) - f({x}_{s})\right]$$
(10)

where S denotes a subset of features excluding j, |S| represents the number of features in the subset S, and p denotes the total number of features. f (xs ∪ {j}) is the prediction output when j is included in S, while f (xs) is the prediction output when j is excluded from S.

Results and discussion

Performance evaluation of ML models

The performances of the multiclassification ML algorithms used to predict the geogrid stabilization category are plotted in Fig. 4. The tree-based learning models (LGBM, ETC, and RF) performed better than the single-learning models (LR and SVM). LR exhibited the worst performance among all the models, with an average precision, recall, and F1-score of 0.71, 0.70, and 0.70, respectively. The major reason for the poor performance of the LR model might be the assumption of a linear relation between the input variables and the log-odds of the target variable by the algorithm [41]. Feature engineering, such as creating interaction terms, could potentially enhance the performance of the LR model. However, this might increase the risk of overfitting and degrade performance when using new datasets. Meanwhile, the SVM model performed considerably better than the LR model, with a precision, recall, and F1-score of 0.85. This improved performance could be attributed to the ability of SVM to transform the initial feature space into a higher dimension where classes are more easily separated using a suitable kernel function. The radial basis function kernel used in this study was especially effective in handling nonlinear and complex multiclass decision boundaries [32]. Among the five ML algorithms implemented in this study, the LGBM model showed the best performance, with a precision, recall, and F1-score of 0.91. The RF and ETC models also exhibited good performances, with precision, recall, and F1-scores of 0.89 and 0.90, respectively. Notably, the three tree-based models did not assume a linear relation between features and targets, leading to an improved performance. Furthermore, tree-based models combine the prediction of multiple base models to make a final prediction. By aggregating predictions, tree-based models tend to produce more robust and generalized models. Finally, the tree-based models have been shown to outperform even deep-learning models when using tabular datasets with < 10,000 training examples [42].

Fig. 4
figure 4

Average precision, recall and F1-score for the five models

Prediction uncertainty

The confusion matrix was used to illustrate the prediction uncertainties for each multiclass model (Fig. 5). Overall, the unstabilized specimens were better classified than the specimens stabilized by triangular and rectangular aperture geogrids. Each model correctly identified unstabilized specimens at least 90% of the time, with the LGBM model making accurate predictions in 94% of such instances. This could be attributed to the fact that unstabilized specimens possessed higher and more distinguishable MR than the specimens stabilized by triangular and rectangular aperture geogrids, which exhibited overlapping MR values. Note that a previous study demonstrated that the MR obtained from repeated-load triaxial tests could not discern the effect of the geogrid on the stabilization of aggregates [10]. The performance of the LR model was considerably poor when predicting specimens stabilized by triangular and rectangular aperture geogrids, correctly predicting the specimens only 59 and 62% of the time, respectively. The reasons for the poor performance were explored in Sect. “Performance evaluation of ML models”. The SVM model accurately predicted the specimens stabilized by triangular and rectangular aperture geogrids ~ 80% of the time. By contrast, the tree-based models exhibited less uncertainty than the single-learning models, making accurate predictions at least 85% of the time for RF and 88% of the time for both LGBM and ETC.

Fig. 5
figure 5

Confusion matrix for (a) light gradient boosting (b) extra trees classifier (c) random forest (d) support vector machine, and (e) logistic regression

Sensitivity analysis

The average AUC and accuracy values of each model across all classes are summarized in Table 3. The models were found to demonstrate a strong capacity to distinguish positives from negatives. Both the SVM and tree-based models presented AUC values of > 0.96, while the LR model exhibited an AUC value of 0.87. However, this relatively high value for the LR model was considerably affected by the predictions of the unstabilized class. Figure 6 shows the ROC curves for the LGBM model considering each classification. The unstabilized specimens and the specimens stabilized by triangular and rectangular aperture geogrids exhibit AUC values of 0.99, 0.98, and 0.97, respectively. These results emphasize the potential of applying these algorithms to geotechnical classification problems.

Table 3 Performance of five models on test dataset
Fig. 6
figure 6

Receiver operating characteristic curve for light gradient boosting machine

Feature importance

The Shapley multiclass output of each model using mean absolute values is illustrated in Fig. 7. Herein, 100 simulations were selected for each ML model to expedite the feature importance algorithm and ensure consistent ranking of the models. First, the mean absolute Shapley values for the LGBM model were found to be considerably higher than those for the ETC, RF, SVC, and LR models. This indicates a wider value range for the LGBM model compared with the narrow range displayed by the other models. In practice, the wider range of mean absolute Shapley values indicates that the LGBM model is better at distinguishing between important and unimportant features for making predictions. MR emerged as the most important feature across all the models. The final rank of feature importance is presented in Table 4, where σc, σd, εz, and N complete the ranking in the specified order. For the LGBM model, εz and σd emerged as the third- and fourth-ranked features, respectively, whereas the ETC, RF, and single-learning models showed a ranking pattern similar to the final ranking. Generally, the feature ranking for all the models was remarkably consistent and almost identical. Figure 7 also reveals the contribution of each feature to the prediction of each class across all the models. For each model, MR considerably contributes in predicting the class of the unstabilized specimen, which could be attributed to the higher and easily separable MR values for the unstabilized specimen. In addition, the specimen stabilized by the triangular aperture geogrid was effectively explained by MR. Furthermore, σc predicted unstabilized specimens more than the specimens stabilized by triangular and rectangular aperture geogrids. However, for the specimens stabilized by rectangular geogrid, εz and σd considerably contributed to predicting the class.

Fig. 7
figure 7

Shapley multioutput absolute mean values (a) light gradient boosting, (b) extra trees classifier, (c) random forest, (d) support vector machine, and (e) logistic regression

Table 4 Feature importance rank using Shapley absolute mean values

Engineering implications

Unbound aggregate base layers play a crucial role in pavement systems, providing structural support, distributing loads to the subgrade, and facilitating drainage. The resilient and permanent deformation behaviors of the base course are the two key aspects of the pavement design and performance analysis. Accordingly, the resilient modulus and rutting, representing recoverable and permanent deformation under repeated traffic loading, respectively, were evaluated. The resilient modulus and permanent deformation can be influenced by various factors, such as deviatoric stress and the number of load cycles. For pavement rehabilitation, the resilient modulus and permanent deformation can also be assessed using nondestructive field-testing methods, such as using falling weight deflectometers. These deflectometers are typically used to estimate elastic-layer moduli based on the impulse load and measured deflection [43]. However, detecting geogrid insertion into the base course can be challenging when using a specific nondestructive field-testing method because of the lesser thickness of the geogrid than that of the base course. Furthermore, identifying the geogrid type without coring and sampling can be quite difficult in the context of pavement rehabilitation. However, using the five features, either measured from field tests or computed from numerical analyses, the ML technique can identify the type and presence of geogrid reinforcement in aggregates. Therefore, the ML classification of geogrid-stabilized aggregates proposed in this study may be a promising approach for assessing pavement conditions and help further selecting adequate maintenance techniques.

Conclusions

This study proposed a novel ML technique to distinguish between unstabilized aggregate specimens and the specimens stabilized with triangular and rectangular aperture geogrids. Five ML models were employed using an experimental database, comprising five input variables and 4500 data points equally distributed among the three classes. In addition, the feature importance across all the models was examined to improve the interpretability of the ML models. All models could correctly identify the unstabilized specimen with a minimum accuracy of 0.90. This result could be attributed to the higher, more distinguishable resilient modulus values of the unstabilized specimen than those of the stabilized specimens. The single-learning algorithms, especially the LR model, produced less accurate predictions than the tree-ensemble algorithms for the geogrid-reinforced specimens. Generally, the LGBM model exhibited the highest accuracy, with an overall score of 0.91, which could be attributed to the robustness of the algorithm and its high nonlinear mapping capabilities. The results from the sensitivity analysis showed that the SVM and tree-based models correctly identified the positive class, with a minimum AUC value of 0.96 The LGBM model exhibited the highest AUC value of 0.98, while LR showed the lowest AUC value (0.87), considerably affected by the accuracy scores obtained from the prediction of the unstabilized specimen. Eventually, the resilient modulus was identified as the most important variable in classification, followed by confining pressure, deviatoric stress, permanent strain, and the number of cycles in that order. The ML approach proposed in this study may efficiently assess pavement conditions by providing information regarding the type and presence of geogrid reinforcement in aggregates without coring and sampling.