Introduction

Carotid atherosclerosis (CAS) is a vital risk factor for cardiovascular and cerebral events. It is characterized by pathological thickening of the common or internal carotid intima, and because of the increased risk of ischemic stroke, coronary events, and blood flow restriction, it is a non-negligible disease burden in society worldwide [1, 2]. A recent study showed that increased carotid intima-media thickness (IMT) is projected to occur in adults aged 30–79 worldwide at a prevalence of 27.62%, carotid plaque at a prevalence of 21.13%, and carotid stenosis at a prevalence of 1.50% [3]. In China, about 31% of the general population and 39% of people aged 60 to 69 have carotid plaques, respectively [4]. CAS identification is a prerequisite for early detection and intervention in cardiovascular and cerebrovascular events, such as stroke [4, 5].

Ultrasonography is widely used to measure carotid luminal stenosis and identify patients with carotid artery atherosclerosis [6]. However, a high proportion of patients have a delayed diagnosis of CAS. The reasons for the delay might be attributed to the following: (1) CAS is usually asymptomatic, unless the patient has experienced a symptomatic ischemic stroke, transient ischemic attack, or amaurosis fugax [3]. (2) The accuracy of the routine ultrasound examination varies greatly due to operator manipulation experience, hemodynamics, and other factors. (3) Ultrasonography is generally not used for routine health checkups, especially in economically underdeveloped areas, due to its high cost [7]. Recently, with the rapid development of artificial intelligence, machine learning (ML) algorithms have overcome the limitations of the application scope of traditional statistical models and have been successfully applied in medical scenarios for its great potential to improve the accuracy and efficiency of health outcome identification from electronic health record (EHR) datasets [8], such as screening high-risk individuals for COVID-19 [9] and patients with diabetes [10]. It has also been used in CAS diagnosis [11,12,13]. However, the models reported have several shortcomings. First, there has been no evaluation of common ML algorithms with demonstrated performance, such as extreme gradient boosting (XGB) and gradient boosting decision trees (GBDT), with good adaptability to tabular data [14]. In addition, the previously reported models used too many uncommon physical examination indicators, which greatly limited the ease of use of the models [12]. Furthermore, external validation, calibration, and interpretability analyses of established models have not been reported, especially the sensitivity and specificity of various ML models among different high-risk subgroups of CAS. The aim of this study was to develop and validate ML models for CAS classification using routine health check-up indicators and interpret the outputs of the optimal ML model using the SHapley Additive exPlanations (SHAP) method.

Methods

Data collection and participant selection

The transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD) was followed when conducting this study [15]. The health examination center of the First Affiliated Hospital of China Medical University (Shenyang, China) provided us with health check-up medical records between 2018 and 2019 in the form of excel sheets. Individuals who participated in the physical examination were mainly employees of various organizations, new recruits, and individuals who voluntarily attended health check-ups. All participants are local residents of Shenyang, China. The training set contains 80% of the 2019 health check-up data and the internal validation dataset consists of the remaining 20%. The 2018 dataset served as the external validation set. The inclusion criteria were as follows: (1) aged ≥ 18 years, (2) underwent carotid ultrasound examination, and (3) had undergone routine biochemistry blood testing, including liver function, renal function, serum lipid, and fasting serum glucose (FSG). The following were excluded: (1) age < 18 years, (2) lack of carotid ultrasound, and (3) lack of biochemistry testing.

Variables identified

Variables in the collected datasets included demographic characteristics, clinical variables, and laboratory indices. From the 70 health check-up items, 24 demographic and biochemical candidate parameters were selected for CAS model construction according to the study design and the clinician’s advice. The selected 24 variables were: demographic characteristics (six variables), including age, sex, body mass index (BMI), waist circumference, height, and body weight; clinical characteristics (two variables), including diastolic blood pressure (DBP) and systolic blood pressure (SBP); biochemical characteristics (16 variables), including FSG, total cholesterol (TC), triglyceride (TG), high-density lipoprotein cholesterol (HDL-C), low-density lipoprotein cholesterol (LDL-C), non-high-density lipoprotein cholesterol (non-HDL-C), alkaline phosphatase (ALP), gamma-glutamyl transpeptidase (GGT), aspartate aminotransferase (AST), alanine transaminase (ALT), total protein (TP), total bilirubin (TBIL), albumin (ALB), blood urea nitrogen (BUN), creatinine (Cr), and uric acid (UA). An automatic biochemical analyzer was used to test laboratory indicators (Cobas 8000 c701 module; Roche Diagnostics, Mannheim, Germany).

Outcome definition and assessment

Bilateral carotid IMT maxima were used as indicators to assess the degree of carotid arteriosclerosis. According to the diagnostic criteria from the textbook of diagnostic ultrasound (3rd edition) [16], the normal IMT of the carotid artery was defined as < 1.0 mm; carotid artery atherosclerosis was defined as localized thickening of the intima (1.0 mm ≤ IMT < 1.5 mm); and carotid artery plaque was defined as an IMT of 1.5 mm or greater, or at least 0.5 mm greater than the surrounding normal IMT value, or more than 50% greater than the surrounding normal IMT value, and local changes to the structure protruding to the lumen. Subsequently, increased IMT, carotid plaque, and carotid stenosis were classified into the CAS group, and other cases were classified into the control group. CAS was diagnosed by two independent clinicians by examining left and right carotid artery ultrasound reports. In cases of disagreement, consensus was reached through discussion and consultation.

Feature selection, model construction, and evaluation

To ensure better model discrimination performance and reduce redundant variables, a genetic algorithm-based k-nearest neighbors (GA-KNN) [17] with a ten-fold cross-validation method was used for feature selection (repeats = 100). Ten well-known ML algorithms, including decision tree (DT), K-nearest neighbors (KNN), logistic regression (LR), naive Bayes (NB), random forest (RF), multiplayer perceptron (MLP), extreme gradient boosting machine (XGB), gradient boosting decision tree (GBDT), linear support vector machine (SVM-linear), and non-linear support vector machine (SVM-nonlinear), were selected to develop the CAS classification model. Their performance was assessed using both internal and external validation datasets. Given that LR algorithm is a highly interpretable and simplified ML algorithm, we used it as a performance reference.

To evaluate the model’s performance, we reported both the area under the receiver operating characteristic (auROC) curve and the precision-recall curve (auPRC). A calibration plot was used to assess the agreement between predictions and observations. The best cut-off point for each model was estimated using Youden's method, and the following metrics were also calculated to reflect the model performance: sensitivity, specificity, positive and negative predictive values (PPV and NPV), and positive and negative likelihood ratios (PLR and NLR).

CAS can be influenced by various factors. Among these, advanced age, obesity, history of hypertension, diabetes, and hyperlipidemia are significant risk factors for the development of CAS. To verify the stability of model performance, sensitivity analysis was used to explore the performance of the optimal model in five subgroups. Subgroup 1: individuals aged ≥ 65 years; subgroup 2: individuals whose BMI ≥ 30 kg/m2; subgroup 3: individuals with hypertension [18] (SBP ≥ 140 mmHg or DBP ≥ 90 mmHg); subgroup 4: individuals with diabetes [19] (FSG ≥ 7.0); and subgroup 5: individuals with dyslipidemia [20] (defined as TC ≥ 5.18 mmol/L or TG ≥ 1.76 mmol/L, or LDL-C ≥ 3.37 mmol/L, or HDL-C ≤ 1.04 mmol/L).

Model interpretability and utility

To better understand the reasoning mechanism behind the high-performing ML model, we implemented the SHapley Additive exPlanations (SHAP) method using the SHAP package (https://github.com/slundberg/shap) for further analysis [21]. The clinical utility of each model was evaluated using Decision Curve Analysis (DCA).

Statistical analysis

The dataset was cleaned up using the listwise method for excluding missing data and the Tukey method for identifying and eliminating outliers. For categorical variables, data were expressed as n (%) while continuous variables were expressed as mean ± SD, or for continuous variables with skewed distribution as median (interquartile range (IQR)). The chi-squared test, Students T-test, Mann–Whitney U test, or Kruskal–Wallis H test were all used to compare group differences based on variable distribution and comparison purpose. The models were developed with our own program built in Python (version 3.7; Python Software Foundation, Wilmington, DE, USA) using the scikit-learn package (version 0.24.0).

Results

Characteristics of the study populations

A flowchart of the patient selection process is shown in Fig. 1. A total of 69,601 patients received health check-ups between 2018 and 2019. After excluding patients who did not undergo ultrasonography, data for 6315 patients were included in the analysis. Of these, 3264 CAS cases were included in the training dataset, 817 in the internal validation dataset, and 2234 in the external validation dataset. Table 1 presents the demographic and clinical characteristics of the training dataset, as well as internal and external validation datasets. The proportion of CAS in the training, internal, and external validation datasets was approximately 50%, and there were no statistically significant differences between the groups. When comparing the variables in each validation dataset with the training dataset, it was found that the AST was higher in the internal validation dataset than in the training dataset, with values of 20 (IQR: 17, 24) and 19 (IQR: 16, 24) U/L, respectively. In addition, age and waist circumference were higher in the external validation dataset than in the training dataset. The external validation dataset had an age of 49 (IQR: 40, 57) years old, while the training dataset had an age of 48 (IQR: 37, 55) years. Similarly, the external validation dataset had a waist circumference of 85 (IQR: 79, 91) cm, while the training dataset had a waist circumference of 85 (IQR: 78, 91) cm. However, HDL-C, TP, and ALB levels in the external validation set were found to be lower than levels in the training set. The HDL-C level in the external validation set was 1.20 (IQR: 1.01, 1.42) mmol/L, while in the training set it was 1.21 (IQR: 1.03, 1.46) mmol/L. The TP level was 69.1 (IQR: 66.8, 71.6) g/L in the external validation set and 70.2 (IQR: 67.8, 72.6) g/L in the training set. The ALB level was 43.80 (IQR: 42.30, 45.40) g/L in the external validation set and 44.30 (IQR: 42.70, 45.90) g/L in the training set. The other characteristics of each validation set were comparable to those of the training set.

Fig. 1
figure 1

Flowchart of the study

Table 1 Characteristics of the training and validation datasets

Development and calibration of CAS classification ML models

Ten features for CAS classification were selected using the GA-KNN algorithm from 24 candidate variables, including age, sex, non-HDL-C, FSG, TC, DBP, LDL-C, ALB, GGT, and ALP. Table 2 and Fig. 2 provide a summary of the performance of ML models. In the internal validation set, LR and GBDT models had the best performance with an auROC up to 0.861 (95% CI 0.841–0.881) and 0.860 (95% CI 0.839–0.880), whereas the corresponding performance metrics of KNN, MLP, SVM-linear, SVM-nonlinear, RF, NB, XGB, and DT were 0.800 (95% CI 0.777–0.824), 0.852 (95% CI 0.832–0.872), 0.846 (95% CI 0.824–0.867), 0.835 (95% CI 0.812–0.857), 0.849 (95% CI 0.828–0.870), 0.829 (95% CI 0.805–0.852), 0.855 (95% CI 0.835–0.876), and 0.817 (95% CI 0.794–0.840), respectively. The model performance reflected by auPR is consistent with that of auROC. With a cut-off value of the operating point determined by the maximal Youden index, the specificity, sensitivity, PPV, NPV, PLR, and NLR were 0.85, 0.722, 0.757, 0.804, 3.057, and 0.208 for the LR model, respectively, and 0.84, 0.729, 0.762, 0.797, 3.104, and 0.219 for the GBDT model, respectively. External validation was also performed to validate the model performance of CAS discrimination, and LR and GBDT models demonstrated similar performance in auROC, auPR, sensitivity, and specificity (Fig. 2). We also showed the calibration curves for the GBDT model in the training, internal and external validation dataset in Fig. 3, which showed good consistency between actual and expected probabilities.

Table 2 The performance of ten ML models for recognizing CAS in the training set, internal validation set, and external validation set
Fig. 2
figure 2

ROC and PR curves of models with different algorithms in the training, internal validation, and external validation datasets. PR: precision-recall; ROC: receiver operating characteristic; KNN k-nearest neighbors; LR logistic regression; NB naive bayes; RF random forest; SVM-linear linear support vector machine; SVM-nonlinear non-linear support vector machine; DT decision tree; GBDT gradient boosting decision tree; MLP multiplayer perception; XGB extreme gradient boosting machine

Fig. 3
figure 3

Calibration plots of the GBDT model in the training, internal validation, and external validation datasets. GBDT: gradient boosting decision tree

Sensitivity analysis of the optimal GBDT model for CAS classification

To test performance of the GBDT model in different CAS risk groups, sensitivity analysis was performed in the following five subsets, individuals aged ≥ 65 years, BMI ≥ 30 kg/m2, with dyslipidemia, with hypertension, or with diabetes in the training and internal and external validation datasets (Table 3). The analysis showed moderate to high discriminative performance of GBDT models across different subgroups, with an auROC range of 0.869–0.996, auPR of 0.866–0.993, sensitivity of 0.710–0.948, specificity of 0.775–0.939, PPV of 0.645–0.951, PLR of 0.93–3.184 and NLR of 0.133–0.689. However, the NPV was relatively low in subgroup aged ≥ 65 years, with a range of 0.100–0.333 and in subgroup with diabetes with a range of 0.361–0.548.

Table 3 Performance of GBDT model in five high-risk CAS subgroups

Interpretability and clinical benefit analysis

Finally, the GBDT model with the best performance was selected for SHAP analysis. We also performed SHAP analysis on the XGB model, which is an integrated learning algorithm based on GBDT. Figure 4 shows a global summary of the SHAP value distribution for all features, which helps to understand the importance of each feature. Age, gender, Non-HDL-C, FSG, DBP, and TC were identified as the top six influencing indicators for CAS classification. According to both the GBDT and XGB models, age, Non-HDL-C, FSG, and DBP showed a positive correlation with CAS, while gender and TC showed a negative correlation with CAS (Figs. 4a, b). Age is the factor that contributes the most to model predictions. The clinical utility of ML models at varying risk thresholds is depicted in Fig. 5. The ML models demonstrated a net benefit in DCA when compared to “treat-all”, “treat-none” at a threshold probability of  > 20%. Here, “treat” refers to the selection of patients for intervention.

Fig. 4
figure 4

Contribution analysis to the prediction of the GBDT and XGB models in the training dataset using the SHAP technique. The higher the ranking, the more important the characteristics; each point is a patient and the color gradient from red to blue corresponds to the high- to low-value of this feature. The point on the left side of the digital baseline (with a SHAP value of 0) represents a negative contribution to suffering from CAS, while the point on the right represents a positive contribution. The farther from the baseline, the greater the impact. CAS: carotid atherosclerosis; GBDT: gradient boosting decision tree; SHAP: SHapley Additive exPlanations; XGB: extreme gradient boosting machine; ALB Albumin; ALP Alkaline phosphatase; DBP Diastolic blood pressure; FSG Fasting serum glucose; GGT Gammaglutamyl transpeptidase; LDL-C Low-density lipoprotein cholesterol; Non-HDL-C Non-high-density lipoprotein cholesterol; TC Total cholesterol

Fig. 5
figure 5

DCA curve analysis of the ML models in the development and validation datasets. DCA: decision curve analysis; KNN k-nearest neighbors; LR logistic regression; NB naive bayes; RF random forest; SVM-linear linear support vector machine; SVM-nonlinear non-linear support vector machine; DT decision tree; GBDT gradient boosting decision tree; MLP multiplayer perception; XGB extreme gradient boosting machine

Discussion

This study developed and validated a screening model for CAS using ten ML algorithms based on routine clinical and laboratory features. The results showed that the GBDT models provided the best discriminatory performance (maximum auROC and auPR in validation datasets). At the same time, other metrics outperformed other ML models in both internal and external validation sets, demonstrating the utility of the best model. We further performed an interpreted analysis of the model and found that age was the most critical factor for the GBDT model for decision-making. Other important factors included sex, non-HDL-C and SBP. Compared to previous studies, this study has the following advantages. First, we used the GA-KNN algorithm to select the optimal combination of features. Second, the model was validated using an external dataset, which further confirmed its ability to discriminate. Third, the SHAP algorithm compensates for the “black box” problem of advanced ML algorithms [21]. This study can be seen as the first step in the use of ML models for screening of CAS in clinical practice, and can serve as a reference for furtherresearch in the future.

GA-KNN algorithm was used to select the optimal combination of candidate variables for CAS classification in our study. Compared to Shao’s study. [11], which modeled carotid plaque classification in physical examination populations, five same predictive variables for model construction (age, sex, blood pressure, glucose, and serum lipids) were used. Similar findings support the reliability of the GA-KNN feature selection algorithm. In terms of the number of selected indicators included in the model, Fan et al. used 19 features from different medical examination packages [13], with the possibility of collinearity, which may bias the model predictions. In addition, two other studies selected nonclinical indicators, such as nonalcoholic fatty liver disease and homocysteine in Yu et al. [12] and platelets and diabetes mellitus in Fan et al. [13]. The inclusion of these uncommon indicators greatly limits the scope of application of the model. Our study used a genetic algorithm combined with the KNN algorithm to find the optimal feature combination of routine health check-up indicators. This approach helps to avoid the underlying bias caused by a lack of experience in manually selecting features. This technique is worthy of further validation and evaluation in future studies [23].

Our study found that GBDT algorithms achieved the best performance in CAS classification, which is significantly better than that of other reported ML models [23]. The reasons for the better model performance of GBDT in our study may be explained as follows: First, we used a feature selection strategy to find the best combination of CAS predictors to ensure that the selection retains important information and avoids information redundancy. Second, although the superiority of LR as a classical linear statistical analysis model was confirmed in a previous study [13], the GBDT model in our study used different computational strategies and also achieved similar performance. In terms of algorithm principle, GBDT is a classical tree-integrated boosting algorithm, which can identify non-linear and interconnected correlations between input and output [24]. It is also worth mentioning that although the XGB algorithm is modified from GBDT, the XGB model in our study does not perform as well as the GBDT model. The underlying reason may be the XGB model with more parameters and tuning, and prone to overfitting than GBDT for real-world EHR data. Therefore, the GBDT algorithm can be considered a powerful tool for analyzing real-world EHR data.

In addition, subgroup analysis showed that the performance of the established GBDT models had a low NPV in the subgroup aged ≥ 65 years or with diabetes, indicating that this model were not specific enough to exclude patients with low CAS risk among the above two subgroups. The underlying reasons may be the small number of negative samples and high prevalence of CAS in this subgroup, which led to insufficient training of the model's discrimination ability for CAS negative individuals. Another reason may be that the features selected in our study did not have adequate diagnostic capabilities for seniors and diabetes patients, suggesting that adding specific predictors with the most discriminatory power (e.g., risk genes) improves model performance in the future. In addition, patients in both subgroups may frequently have several underlying diseases, which may have an impact on the model’s discrimination power. Finally, before ML modeling, we can consider conducting cluster analysis [25] to explore the heterogeneity of the target population, so as to guide the construction of the ML model and achieve a balanced performance between bias and variance.

Considering the “black box” nature of the advanced ML model, this study also used the SHAP algorithm, which can be applied to any type of ML model, has the advantages of fast implementation of tree-based models, and can ensure consistency and local accuracy, to conduct interpretability analysis of GBDT and XGB models. For the first time, we ranked the factors affecting CAS, and found that age and sex were the first two key factors for GBDT models in CAS classification. The potential mechanism may be corroborated by previous findings that age and sex may influence CAS distribution and ultrasound morphology [26, 27], and indicated that age and sex differences should be considered in clinical practice [28]. In addition, consistent with previous findings, Non-HDL-C, FSG, and SBP were also important predictors for CAS classification [29], suggesting that CAS is a metabolically closely related disease [27] and that above metabolic indicators should be paid more attention for CAS prevention [3].

There were several limitations in our study: (1) Although we used internal and external validation datasets to assess the model's application stability, the risk and benefit of the optimal model deployed in real-world scenarios requires the design of clinical trials for further evaluation. (2) Information on medications could not be collected from health check-up records. However, preparation prior to the physical examination, including monitor diet (e.g. not eating too much greasy and indigestible food), not taking non-essential medicines three days before the physical examination, and not drinking water or eating food on the day of the physical examination, can minimize the impact of potential interfering factors. (3) As this study was based on physical examination data from people in Northeast China, the reliability of the established models needs further validation if they are to be applied to scenarios beyond the population representation in this study.

Conclusions

The ML models developed could provide good power for CAS identification, which will hopefully be applied in scenarios without ethnic and geographic heterogeneity, and guide prevention and management of individuals at risk of CAS.