Introduction

Coronary atherosclerotic heart disease (coronary heart disease, CHD) is currently the most common cardiovascular disease in the world. The incidence and mortality rates of cardiovascular diseases, particularly CHD, in people living with HIV (PLHIV) have been increasing annually. [1]. The mortality rate of myocardial infarction is 1.5 to 1.7 times that of the general population, and the average age of death is about 48 years old, which is far lower than that of the general population [2]. Although PLHIV have a very high awareness of traditional risk factors for CHD, the incidence of CHD has not been reduced in this group, which may be related to the particularity of PLHIV themselves. Compared with the general population, the age of onset of CHD in PLHIV is about 10 years earlier. With increased age, the risk of CHD increases yearly. Many and complex traditional risk factors influence CHD, including male gender, smoking, and high-density lipoprotein Lowered protein cholesterol are high-risk factors [3]. Therefore, analyzing the changes in clinical characteristics of PLHIV with CHD and exploring the risk factors for patients with comorbidities has great significance in the disease prevention and treatment of this special group.

The death rate of AIDS has gradually decreased owing to extended application of highly active antiretroviral therapy (HAART). In turn, the life expectancy of PLHIV has been prolonged even in less-developed areas like sub-Saharan Africa [4, 5]. However, although AIDS has transformed into a manageable chronic disease [6], the risk of basic diseases such as cardiovascular disease (CVD) has increased. According to a meta-analysis research, the risk of CVD among PLHIV is 2.16 times than that in the general population [7]. The mortality rate among people living with HIV (PLHIV) is 1.6 per 1000 people and has been observed to increase annually [8]. This rise in mortality is primarily attributed to the long-term effects of antiretroviral therapy, which include hypercoagulability, co-infection, and immune activation, as identified in studies [9,10,11,12]. Protease inhibitor therapy can also cause side effects like hyperlipidemia and insulin resistance, which further promote the pathogenesis of CVD [13]. Given that CVD has become the first cause of non-AIDS death among PLHIV, the management of CVD in PLHIV should be given focus to control the death rate.

Risk factors help predict potential negative events in advance. In PLHIV, the risk factors are similar but more severe than traditional CVD. For example, the prevalence of hypertension, diabetes, atherosclerosis, and dyslipidemia is significantly higher than in the non-HIV-infected population [14, 15]. Unfortunately, few studies have reported on risk factors in Chinese PLHIV with CHD. Only a retrospective research about that risk factors is available. Its results show that CHD does not change in HIV-positive patients, except for the body mass index being lower than that in HIV-negative patients. Most of the clinical characteristics of HIV-positive patients with CHD are similar to those of HIV-negative ones. However, the levels of total cholesterol, high-density lipoprotein cholesterol, and low-density lipoprotein cholesterol in HIV-positive patients are significantly lower, the heart is significantly enlarged, and the incidence of acute coronary syndrome is reduced [16]. Specifically, males and young people infected with HIV are more likely to smoke than the non-HIV group, which is the most important risk factor for acute coronary syndrome [17,18,19]. However, a 6-year follow-up study has revealed that in a population without CVD risk factors, the probability of acute myocardial infarction of PLHIV is twice that of non-infected ones. The former group of PLHIV are 7 to 10 times more likely to have an acute myocardial infarction than those without HIV. Even after controlling for traditional CVD risk factors, people with HIV are twice as likely to develop CVD as those without HIV [20]. The reason may be that most PLHIV do not have high risk factors for conventional CVD at the time of diagnosis. Current risk factors fail to assist predicting potential CVD risk in PLHIV, so a more accurate predictive indicator is well positioned to be discovered. Yet, traditional CVD risk-assessment tools may underestimate the CVD risk of PLHIV.

Machine learning (ML) algorithms exhibit improved discrimination capacity and generalizability in high-dimensional data, indicating that they are not confined by strict exclusion and inclusion criteria. Thus, the actual health status of individuals are available to these algorithms [21]. This method addresses limitations in existing risk-prediction techniques. ML models, leveraging electronic medical records (EMRs), can enhance clinical diagnostic accuracy and decision by physicians [22]. Therefore, multiple ML algorithms have been extensively used to predict CVD [23, 24], including prediction of 3-year all-cause mortality in patients with heart failure caused by CHD [25], and the classification of in-hospital mortality in chronic kidney disease patients with coronary artery disease [26]. ML algorithms can be useful in the identification of patients with CVD. Often, many elements contribute to classifying patients who are at risk for these common diseases. ML methods can help identify hidden patterns in these factors that may otherwise be missed. Moreover, no predictive models of CHD in PLHIV based on EMRs have been constructed yet using ML.

Accordingly, the present study aimed to determine accurate the predictive risk factors for CHD in PLHIV by establishing a risk-prediction model based on ML. We compared the predictive performance of seven ML algorithms in detail, selected the model with the best comprehensive performance, and visually explained the model. This model can assist clinicians to screen HIV patients who may experience CHD in the future, discovering the risk factors for CHD among HIV-infected patients and provide evidence-based guidance for the prevention of CHD among HIV-infected patients in the Chinese population.

Materials and methods

Data source

Data were acquired from the EMR database of inpatients of Guangxi Chest Hospital, a unique provincial clinical center for the prevention and control of HIV/AIDS. The EMR database was linked to collect demographic information (e.g., age and gender), clinical laboratory measurement records, and clinical diagnoses of inpatients. Valid and intact patient data obtained between June 2016 and October 2021 were included in the study. Furthermore, to maintain privacy, identity-related information of all individuals was concealed during data acquisition.

Study population

Individuals were diagnosed with HIV, per the International Classification of Diseases (ICD)-10 codes. The inclusion criteria were as follows: (1) age above 18 years; (2) patient was confirmed to be HIV-infected according to the Guidelines for the Diagnosis and Treatment of AIDS in China (2018 Edition); and (3) results of biochemical examination during hospitalization can be queried. Meanwhile, patients with incomplete medical histories were excluded.

Data imputation

To enhance data utilization, variables exhibiting more than 20% missing data were excluded before performing data interpolation. For others, the missing data were imputed with the help of the random forest (RF) method and algorithms, which are great for imputing missing data. They are desirable because they can handle mixed types of missing data. Additionally, they are adaptive to nonlinearity and interactions and can potentially be scaled to big-data settings [27].

Class-imbalance problem

The ML classifier is generally more biased toward the majority class when dealing with datasets having a class imbalance, thereby leading to bad classification for the minority class. In the case of such issues, the majority is labeled as a single class, whereas the minority is labeled as the other class [28]. In this dataset, CHD individuals accounted for 3.53% of PLHIV. Furthermore, an imbalanced distribution of these two classes was observed, potentially leading to subpar prediction performance of the minority class in the prediction model [28]. A cost-sensitive learning method, used in data mining, aims to produce accurate results for class-imbalanced datasets with minimal cost by re-weighting the cost matrix, allowing the classifier to focus on fewer weight cases and avoid predicting high-cost cases [29].

Model development and evaluation

Seven ML algorithms implemented in the Python package 3.10.9 were as follows: a light gradient-boosting machine (LightGBM), lasso-logistic regression, eXtreme gradient boosting (XGBoost), adaptive boosting (AdaBoost), decision tree, multilayer perceptron (MLP), and support vector machine (SVM). They aided the identification of the most informative variables for CHD risk prediction in PLHIV, as well as the development of models that predicted CHD in PLHIV as a binary outcome (absence or presence), per the laboratory and clinical diagnosis values of the chosen predictor variables.

LightGBM and XGBoost are members of the boosting algorithm family and utilize the negative gradient of the loss function to compute the residual and ascertain the ideal solution. LightGBM is a highly efficient and accurate implementation of the gradient-boosting decision tree (GBDT). Additionally, compared with XGBoost, LightGBM trains faster, consumes less memory, has higher accuracy, and can handle larger amounts of data [30]. Meanwhile, AdaBoost algorithm [31] is a classic Boosting algorithm that trains various classifiers (weak classifiers) for the same training set. Subsequently, they are assembled to create a stronger final classifier (strong classifier). An enhancement of the logistic regression method, MLP [32] is a feedforward artificial neural network model that maps several datasets of inputs onto datasets of a single output. Moreover, SVM [33] is a binary classification model that maps data in a high-dimensional space. It also finds a hyperplane in the space to maximize the distance among various data points and the hyperplane, which distinguishes it from MLP. Furthermore, the decision-tree [34] learning method constitutes a non-parametric supervised approach. It summarizes decision rules from a series of data with labels and features and subsequently illustrates them as a tree graph to resolve regression and classification issues. Then, logistic regression converts the output results of linear regression into probability values via a function to realize sample classifications. Lasso regression was used to screen features and eliminate the multicollinearity among them.

Data were segregated randomly into training and validation datasets with the help of the Python package (Scikit-learn). Among these data, 80% aided model training, whereas the remaining 20% helped validate its predictive performance. In this study, sensitivity, accuracy, specificity, negative predictive value, positive predictive value (PPV), the areas under the receiver operator characteristic curves (ROC-AUC), and F1 score (2*((precision*recall)/(precision + recall)) were used to compare model performance. Moreover, a 10-fold cross-validation was performed to compare the AUC of the seven ML algorithms and to ascertain the overall best performance.

To comprehend the findings of the ML models more coherently, the Shapley additive explanation (SHAP) method aided the visualization analysis. This method was applied to comprehend the findings of the best prediction model in terms of performance. For this purpose, the individual contribution of each variable was computed [35]. A SHAP value denotes the contribution of the feature to the outcome value. A positive value indicates that the feature promotes the likelihood of a positive outcome, whereas a negative one indicates that the feature decreases the likelihood of a positive outcome.

Statistical analyses

Analyses were conducted with the help of SPSS (ver. 26.0) software (IBM, Chicago, IL, USA). The clinical-feature analysis of the complete dataset was performed in the interpolated dataset. The continuous variables were reported as the median (IQR) because the data were non-normally distributed. Meanwhile, the categorical variables were represented as numbers and percentages. Additionally, the continuous and categorical variables were compared with the help of the Wilcoxon rank sum test and the Chi-square test, respectively. For all tests, P < 0.05 was deemed statistically significant. The general schema for building this prediction model is also illustrated in Fig. 1.

Fig. 1
figure 1

General schema for building and evaluating the prediction model. Positive samples were defined as PLHIV with CHD, whereas negative samples were PLHIV without CHD

Results

From the data of 6792 PLHIV including 239 CHD patients and 6553 non-CHD patients, 75 variables were extracted. Twenty patients younger than 18 years of age were excluded, leaving 6772 patients. Then, 14 variables were removed because more than 20% date were missing, ultimately leaving 61 variables. These variables included demography, laboratory measurements, and diseases besides HIV and CHD (e.g., hypertension). Post-interpolation with RF, clinical-feature analysis of datasets was performed, and the obtained findings are depicted in Table 1. The mean age of the 6772 PLHIV was 54 (IQR: 43–64) years, including 5152 males (76.08%) and 1620 females (23.92%). Additionally, a total of 239 individuals (3.53%) were diagnosed with CHD, among which 82.43% were males and 17.57% were females.

Table 1 Demographic and clinical characteristics of the included patients

Model performance and evaluation

The training set contained 5417 samples, whereas the validation set contained 1355 samples. Moreover, XGBoost, decision tree, AdaBoos, LightGBM, SVM, MLP, and lasso-logistic regression were built per the training set with the aforementioned 61 variables. Model performance evaluation was aided by the seven ML algorithms, as depicted in Table 2; Fig. 2. The LightGBM model demonstrated superior performance, achieving the highest AUC of 0.849 (95% CI, 0.814–0.883), whereas the decision tree exhibited the lowest AUC one (0.753; 95% CI, 0.704–0.803). Given that the LightGBM model exhibited the ideal performance across the four ML algorithms, it was deemed the best model.

Table 2 Performance of prediction models generated by the seven ML algorithms
Fig. 2
figure 2

Assessment of the seven ML algorithms per the AUC of the ROC curve. AUC, area under the curve; ROC, receiver operating characteristic

Explanation of risk factor

SHAP aided the interpretation of the LightGBM model findings by computing the individual contribution of each variable. The importance matrix and SHAP summary plots for LightGBM are depicted in Fig. 3, whereas the SHAP dependence plot for the same is depicted in Fig. 4. Additionally, the importance matrix plot ranked variables contributing to CHD risk prediction among PLHIV from highest to lowest contribution per the baseline age, heart failure, hypertension, glucose (Glu), serum creatinine (Scr), indirect bilirubin (I-Bil), amylase (AMY), and serum uric acid (SUA) of the individuals (Fig. 3A). The SHAP summary (Fig. 3B) and SHAP dependence (Fig. 4) plots ascertained the influence of each variable on the CHD outcome.

Fig. 3
figure 3

A an importance matrix plot of the LightGBM model depicting the significance of each variable in anticipating CHD risk in PLHIV. B SHAP summary plot of the top eight clinical attributes of the LightGBM model. Each point represents the SHAP value of a specific feature on a data point, indicating the magnitude and direction of that feature’s impact on the model’s predictive outcome. Red points denote high feature values with a positive incremental effect on the prediction; blue points denote low feature values with a negative decremental effect. Features are ranked from top to bottom by their average impact, highlighting their relative importance in the model’s decision-making process. Glu, glucose; Scr, serum creatinine; I-Bil, indirect bilirubin; AMY, amylase; SUA, serum uric acid

Fig. 4
figure 4

SHAP dependence plot of the LightGBM model, illustrating the influence of a single variable on the prediction. The blue points in the plot represent the SHAP values for this feature across different data points, with the horizontal position indicating the feature’s contribution to the predictive outcome. SHAP values greater than zero suggest an increased risk of CHD in PLHIV. The vertical axis represents the actual values of the feature, and the distribution of points reveals the relationship between the feature values and the risk of CHD. Glu, glucose; Scr, serum creatinine; I-Bil, indirect bilirubin; AMY, amylase; SUA, serum uric acid

As illustrated in the SHAP summary plot, higher feature values corresponded with a higher likelihood of CHD occurrence in PLHIV. The red and blue dots depicted higher and lower feature values, respectively. The high values of age, heart failure, hypertension, Glu, Scr, and SUA corresponded with a value of SHAP > 0, indicating that these features were vital risk factors for CHD in PLHIV. Generally, older PLHIV individuals (Fig. 4A) with heart failure and hypertension (Fig. 4B and C), high Glu (Fig. 4D), high Scr (Fig. 4E), poor I-Bil (Fig. 4F), low AMY(Fig. 4G), and high SUA (Fig. 4H) exhibited an elevated CHD risk.

Applying the prediction model

The actual application of the model is illustrated in Fig. 5. Red means the feature value elevates CHD probability, whereas blue denotes a reduction in CHD probability owing to the feature; f(x) represents the comprehensive value of SHAP for each individual. The base value depicts the mean value of SHAP for all samples. Hence, if f(x) was higher than the base value, the model would declare the individual as having CHD. Figure 5A illustrates that a PLHIV was accurately predicted to suffer from CHD, and Fig. 5B illustrates that a PLHIV without CHD was distinguished accurately. Therefore, the LightGBM model produced a sufficient distinction between CHD and non-CHD individuals and can denote different risk probabilities based on the individual circumstances of each patient. Figure 5C illustrates the values of SHAP predicted for each patient in the training set, with more red indicating a higher overall risk.

Fig. 5
figure 5

SHAP force plot for individuals in the dataset at high (A) or low (B) risk of CHD with PLHIV. C Values of SHAP (global interpretation) for the training set. The abscissa denotes each individual, and the ordinate depicts the value of SHAP. A greater appearance of red denotes a higher overall risk, conversely, a more pronounced blue indicates a comparatively lower risk of CHD

Discussion

This study identified predictive risk factors for CHD among PLHIV and subsequently developed a CHD risk-prediction model using ML and easily retrieved clinical variables from EMR. Earlier studies have demonstrated that existing cardiovascular disease prediction models for HIV patients primarily include the D: A:D model (2010) for 5-year risk [36], the D: A:D model (2010) for 10-year risk [37, 38] the full D: A:D (2016) model for 5-year risk [39], the reduced D: A:D (2016) model for 5-years risk [39], the HIV MI-1 model [40] and the HIV MI-2 model [40]. These models are primarily based on Cox proportional risk models, Poisson regression models, and lasso and ridge regressions [41]. However, significant heterogeneity exists in the definition of cardiovascular disease among these models [41], and the prediction of CHD risk in HIV patients is susceptible to other confounding factors. To the best of our knowledge, this work is the first predictive model for HIV patient-specific CHD developed based on ML algorithms. Furthermore, we compared the predictive model performance of seven ML algorithms to establish the best model. Then, predictive model performance was compared across the seven ML algorithms. The observations revealed that the LightGBM model had the highest AUC, F1, and PPVs. Moreover, LightGBM is an effective implementation of the gradient-boosting learning algorithm, which is based on the decision-tree algorithm and uses n-lifting trees. It is superior to other algorithms in cases of prediction problems [42]. Furthermore, the algorithm is extensively used in regression and classification research with good detection results [42]. Accordingly, the SHAP method aided the explanation of the decision-making process adopted by the LightGBM algorithm and helped doctors intuitively understand its prediction process. The SHAP summary and dependency maps displayed heart failure, age, hypertension, Glu, I-Bil, Scr, AMY, and SUA to distinguish patients with HIV who were at low or high risk of CHD. Older PLHIV with high Glu, Scr, and SUA and with low I-Bil and AMY, combined with heart failure and hypertension, were at a higher risk of developing CHD. Additionally, both elevated glucose and hypertension were also risk factors for CHD in PLHIV, consistent with literature [8, 43].

In an investigation of the proteomic compositions of CD4 cells infected by HIV-1, Chan et al. [44] established that an elevation in fatty acid synthase (FASN) concentration post-infection and serum levels of inflammatory cytokines and insulin were positively correlated with FASN levels. This finding suggests that disrupting the lipid metabolism within HIV-infected cells of the immune system can cause systemic lipid metabolic disorders and inflammatory whole-body insulin resistance (IR), ultimately progressing to dysglycemia [45]. Additionally, the possible mechanisms of CHD include inflammation, endothelial-cell injury, thrombosis, oxidative stress, and glucose and lipid metabolism disorders [46]. Recent studies have demonstrated that IR contributes to coronary plaque formation and remodeling independent of traditional risk factors such as smoking, age, and hypertension [47]. Furthermore, certain studies have established that vascular stiffness is increased by endothelial-cell injury directly related to HIV infection or the activation of endothelial-cell proliferation by HIV proteins and cytokines, in association with ongoing hypertension-related endothelial damage. It may also contribute to the elevated incidence of CVD in individuals with HIV and hypertension [48].

Yue et al. [49] conducted a cohort study in Taiwan. They established robust relations between HIV infection and incident heart failure post-stratification of individuals by sex, age, and comorbidities. They also found that an HIV infection increases the risk of heart failure [50]. Growing evidence indicates that the severity of the HIV infection and the degree of HIV control may be key factors influencing heart failure risk [51]. In patients with CHD, coronary artery blood supply is insufficient, causing myocardial ischemia. Long-term ischemia cannot be effectively improved and in turn causes myocardial degeneration and even necrosis. It easily leads to complications, the most common being heart failure. Heart failure induces a decline in heart contractility and output reduction, thereby resulting in insufficient blood supply to important organs. This insufficient blood supply to the heart can further aggravate myocardial ischemia and even increase the severity of CHD.

This study also demonstrated that Scr, I-Bil, AMY, and SUA were risk factors for CHD in PLHIV. Bagheri et al. [52] explored the link between serum creatinine and the possibility and severity of CHD. They established that serum creatinine is significantly related to CHD. Meanwhile, other studies have revealed that creatinine levels in early HIV patients are higher than those in other groups [53]. Therefore, the Scr levels of PLHIV must be thoroughly monitored during hospitalization. These findings depicted that elevated I-Bil levels were protective for CHD with PLHIV. Marconi et al. [54]. conducted a veterans aging cohort study. The participants (regardless of HIV status) with elevated bilirubin levels are found to have a lower risk of incident total CVD, acute myocardial infarction, heart failure, and ischemic stroke events post-adjustment for known risk factors. Amylase also reportedly increases in individuals with acute HIV infection [55]. Hence, the serum amylase level of newly diagnosed PLHIV is related to CD4 cell count [56], i.e., CD4 cell count decreases with increased serum amylase.

Anti-retroviral therapy may also be the main cause of increased serum amylase in human immunodeficiency virus patients [56]. Park et al. [57] reported that coronary artery disease history (1.7, 1.01–2.87, P = 0.046) is related to heightened serum amylase or lipase when considering the prognosis, morbidity, and predisposition factors of individuals with elevated pancreatic enzyme levels post-cardiac arrest outside the institute. Serum amylase is a direct indicator of pancreatic injury [56], and studies have demonstrated that individuals with chronic pancreatitis have an elevated risk of atherosclerotic cardiovascular disease [58, 59]. Therefore, monitoring amylase alterations during PLHIV therapy positively influences the prevention or early detection of cardiovascular diseases.

Olaniyi et al. [60] established that uric acid content is significantly elevated in PLHIV relative to healthy controls. Uric acid, the end-product of purine metabolism in humans, is a cause of gout. However, it may also lead to the onset and progression of cardiovascular diseases, including atrial fibrillation, hypertension, chronic kidney disease, coronary artery disease, heart failure, and cardiovascular death. Thus, it can be used to predict cardiovascular prognosis [61]. Nicholson et al. [62] demonstrated that hyperuricemia and gout should be considered biomarkers of cardiovascular disease in PLHIV. Uric acid is elevated in PLHIV, and it contributes to cardiovascular disease onset and progression. Thus, alterations in blood uric acid levels should not be ignored when treating hospitalized PLHIV.

The present study had several strengths. For instance, it was real-world research pertaining to risk assessment utilizing 6772 samples, which was performed by comparing seven ML algorithms. The optimal prediction model, i.e., the LightGBM model, was found to have an improved generalizability advantage. It was a highly optimized GBDT algorithm that can incorporate several clinical variables. Furthermore, leveraging the benefits of an ML algorithm meant that the analysis can include various indicators, for example, kidney function, blood glucose, coagulation function, and liver function. Thus, it can aid the thorough assessment of the influencing factors. Furthermore, SHAP was a reliable technique to enhance the clinical interpretability of the LightGBM model output. Doctors can initiate reasonable referral recommendations and individualized CHD health-management suggestions to PLHIV.

This study also had several limitations. It was performed at a single institute, wherein the small sample and missing data derived from EMRs can produce a potential bias. Moreover, we focused on a single center, so only internal validation was conducted. External validation must be established using another dataset to demonstrate stability in the performance of the prediction model. Hence, more effort is needed to conduct multi-center prospective research with more opportunities for multi-center cooperation and better data-mining capabilities.

In summary, demographic and clinical variables were identified as predictive risk factors for CHD among PLHIV. Additionally, a CHD risk-prediction model was constructed for PLHIV using ML and EMR, which can support clinical management techniques for PLHIV in the EMR era.