Introduction

Globally, cardiovascular diseases (CVDs) continue to significantly impact mortality rates and overall health outcomes [1]. Coronary artery disease (CAD) stands out as the most prevalent type among cardiovascular diseases (CVDs), exhibiting noticeable increases in its prevalence and incidence across the majority of countries [2]. From 1990 to 2019, the number of deaths and disability-adjusted life years (DALYs) caused by CAD has risen steadily. In 1990, there were around 5 million deaths and 120 million DALYs, but in 2019, there were 9.14 million deaths and 182 million DALYs [2]. This emphasizes the urgent need for precise identification of risk factors to predict and prevent CAD.

Insulin resistance is commonly regarded as one of the key risk factors for predicting CAD [3,4,5]. It is associated with chronic low-grade inflammation [6] which can lead to pro-coagulation states [7], decreased bioavailability of nitric oxide, and subsequently impaired endothelial function [8]. Further, insulin resistance can activate the sympathetic nervous system and reduce vagal activity, resulting in the activation of the renin-angiotensin-aldosterone system and kidney sodium retention, ultimately causing higher blood pressure and cardiovascular damage [9]. Remarkably, despite its considerable importance, it has not been incorporated into any internationally risk assessment frameworks for the prediction of CAD [3,4,5, 10].

The hyperinsulinemic-euglycemic clamp technique serves as the standard for diagnosing insulin resistance, but its invasiveness, cost, and complexity make it unsuitable for epidemiological studies [11]. The Homeostasis Model Assessment of Insulin Resistance (HOMA-IR) is a commonly employed alternative, offering ease of use; however, this test cannot be used to diagnose people who are already undergoing insulin treatment [12, 13]. Additionally, HOMA-IR has another limitation, as laboratories do not routinely measure circulating insulin concentrations [14, 15].

In light of the drawbacks of direct measurement of insulin, numerous surrogate markers, based on glucose and lipid profiles as well as some anthropometric features, have emerged. These surrogate markers do not necessitate the measurement of serum insulin levels, and they have an even better correlation with the hyperinsulinemic-euglycemic clamp method compared to HOMA-IR [16,17,18]. The ratio of triglycerides to high-density lipoprotein cholesterol (TG/HDL-C), triglyceride-glucose index (TyG index), TyG-index with body mass index (TyG-BMI), TyG index with waist circumference (TyG-WC), and metabolic score for insulin resistance (METS-IR), are the most common of these less complicated and practical markers [19, 20]. Although prior studies have shown associations between these indices and CAD, there is no specific threshold for utilizing these indices, and it remains uncertain which one of them better predicts CAD [21,22,23].

Determining the most reliable predictor among these comparable indices poses a significant challenge in clinical environments, where they can aid in screening and preventive measures to reduce CAD. In this regard, in addition to the conventional statistical methods, we have decided to employ embedded feature selection techniques, which involve the fusion of machine learning algorithms with the process of selecting features [22, 23]. The main advantage of these machine learning algorithms over traditional statistical methods is their reduced emphasis on hypothesis-driven inference [24, 25]. Instead, they prioritize predictive accuracy and can algorithmically derive covariate interactions [24, 26]. These characteristics enable us to evaluate the impact of each feature on CAD prediction comprehensively.

To determine which of these indices best predict CAD occurrence, we first investigated the association between different surrogate markers of insulin resistance and CAD in a 10-year prospective cohort study. Then, we evaluated the optimal cut-off points for these surrogate markers as CAD prediction tools. The ultimate objective was to develop embedded feature selection machine learning algorithms for CAD prediction and to compare the unique impacts of insulin resistance markers on CAD prediction.

Materials and methods

Study population

Data for this cohort study were derived from the Yazd Healthy Heart Project (YHHP), an epidemiological study investigating cardiovascular and metabolic illnesses in a population-based setting. In summary, a total of 2000 Iranian adults (1000 men and 1000 women) between the ages of 20 and 74 were selected using a cluster random sampling technique. The participants were recruited from the urban population of Yazd city during the period of 2005–2006 [27].

Inclusion and exclusion criteria

From the 2000 participants, 17 were omitted from the study due to loss during the second phase; from the 1983 individuals participating in the baseline examination, 62 were excluded due to diagnosis of CAD at baseline, 78 due to death during the study, and 312 due to missing data. The remaining 1531 participants (791 men, mean age 48.6 ± 14.7 years) were included in the present study (Fig. 1).

Fig. 1
figure 1

Flow diagram of participants attending the 10-year follow-up study. aCoronary Artery Disease

Biochemical analyses

Lab analyses were conducted following an overnight fasting. Glucose and triglyceride (TG) levels were measured following centrifugation using kits obtained from Pars Azmoon Inc.(Tehran, Iran). The lipid profiles, including total cholesterol, low-density lipoprotein (LDL), and high-density lipoprotein (HDL), were examined using Bionic kits manufactured by Bionic Company (Tehran, Iran). The tests were conducted utilizing a biochemical autoanalyzer (BT 3000, Italy). The key exposure variables of interest were calculated using the following equations [18]:

$$\varvec{TyG-}\varvec{index}= \text{ln} \left(\frac{Tg \left(\frac{mg}{dl}\right)\times FBS\left(\frac{mg}{dl}\right)}{2}\right)$$
$$\varvec{TyG-}\varvec{BMI}=TyG-index \times BMI\left(\frac{Kg}{{m}^{2}}\right)$$
$$\begin{aligned} \varvec{TyG-}\varvec{WC}=&\,TyG-index \\& \times \text{Waist circumferance} \left(cm\right) \end{aligned}$$
$$\begin{aligned} \varvec{METS-}\varvec{IR}=&\,\text{ln}\Bigg(\bigg(2\times FBS\bigg(\frac{mg}{dl}\bigg)\bigg) \\& +\bigg(Tg\big(\frac{mg}{dl}\big)\times \frac{BMI\big(\frac{kg}{{m}^{2}}\big)}{\text{ln}\big(HDL\big(\frac{mg}{dl})}\bigg)\Bigg) \end{aligned}$$
$$\varvec{TG-}\varvec{HDL}\, \varvec{ration}=\frac{TG\left(\frac{mg}{dl}\right)}{HDL\left(\frac{mg}{dl}\right)}$$

Anthropometric features

The participants’ heights were measured with a stadiometer attached to a smooth wall with no dents or irregularities. They stood barefoot, with their heels, hips, shoulders, and heads touching the wall and fixed horizontally. The heights were measured with a 0.5 centimeter margin of error. Participants were weighed with minimal clothing on a digital scale (Seca, Germany). The participants’ weight was measured with precision to the nearest 0.1 kg in both phases. The circumferences of the waist and hips were measured using a non-stretchable tape at the superior border of the iliac crest and the widest part of the buttock, respectively.

Blood pressure measurements

The participants’ right arm blood pressure was measured by an Omron M6 comfort digital automatic blood pressure monitor in a sitting position. Nursing staff measured blood pressure twice, with a five-minutes interval between measurements.

Physical activity, family history of premature CAD, smoking, and education

Trained interviewers utilized questionnaires to gather demographic information, physical activity, smoking habits, family history of early premature CAD, and angina pectoris. The assessment of physical activity was conducted using the International Physical. Activity Questionnaire (IPAQ) [28]. As part of this survey, the participants were questioned about the duration and number of days of their walking, engagement in moderate intensity exercise, and strenuous activity. Based on these inquiries, the number of MET-hours per week was computed, which is equivalent to 1 kcal/kg/hr [29]. Using this metric, the participants were categorized into low-, moderate-, and high-activity groups. Based on current smoking habits, the participants were categorized into two groups: smokers and nonsmokers. Family history of premature CAD was defined by the occurrence of CAD in a mother or sister before the age of 55, or in a father or brother before the age of 45.

Outcome definition

CAD events were identified based on medical records documenting occurrences of fatal or nonfatal CAD, myocardial infarction, coronary artery bypass graft, positive exercise tests, positive cardiac enzymes, and positive percutaneous coronary angiography. In addition, all participants completed the Rose angina questionnaire (RAQ) [30], a validated tool for assessing new angina. The participants also had electrocardiograms (ECG), which were reviewed by both a general practitioner and a trained nurse. If any discrepancies arose, a cardiologist confirmed the findings. In addition to medical records, CAD was classified as having positive RAQ and findings of ischemia in the ECG.

Statistical analysis

SPSS version 27.0 (IBM Corp., Armonk, NY, USA), Python 3, and R version 4.2.2 (www.R-project.org) were used for statistical analysis. Continuous variables were described as mean ± standard deviation (SD) and compared by ANOVA. Chi-square tests were used to compare categorical variables as numbers (percentages).

We employed multivariable Cox proportional hazard models to assess the association between quartiles of these indices and the CAD incidence. We employed two multivariable models for adjustment. Model 1 was adjusted for age and sex, whereas model 2 was adjusted for model 1 plus systolic and diastolic blood pressure, total cholesterol, LDL, HDL, BMI, waist to hip ratio, family history of premature CAD, physical activity, and smoking. If any of these factors were included in exposure variables (surrogate insulin resistance indices), we excluded them from the adjustment process. For instance, when analyzing TG/HDL ratio, we did not incorporate HDL into the statistical model.

We employed the receiver operative characteristic (ROC) curve to compare the predictive performance of all indices relative to one another. Then, we assessed the optimal cutoff points of surrogate insulin resistance indices with maximum sensitivity and specificity simultaneously, maximum, negative and positive diagnostic ratio, as well as maximum Youden index for predicting CAD using “OptimalCutpoints” R package [31]. In addition, we categorized these thresholds according to gender.

In order to choose the best surrogate insulin resistance marker for predicting CAD, we combined integrative methods with an ensemble of different embedded feature selection methods based on machine learning [23]. For integrative part of our approach, we selected age, sex, systolic blood pressure (SBP), diastolic blood pressure (DBP), LDL, total cholesterol, smoking, family history of premature CAD, and diabetes as our reference variables for comparing our surrogate measures of insulin resistance. For the embedded feature selection part, at first, we used random forest feature selection, which is a non-linear algorithm which can consider multiple interactions and evaluate variables by determining how much each feature can reduce impurities (Mean Decrease in Impurity [MDI]) [32]. For the second approach, we employed the Boruta algorithm, which shuffles the values of each feature and creates shadow features, which represent noise or irrelevant features, then trains a random forest model on original features and shadow features and compares their importance in multiple iterations. If a feature is more important than its shadow, it will be selected [33]. As a third approach, we used least absolute shrinkage and selection operator(LASSO), a regularization technique based on linear regression which drives the coefficients of less important features to zero and selects non-zero coefficient variables [34]. We set the alpha (threshold of significance) to 0.05 for this algorithm. Finally, we used ceteris paribus profile of the random forest model [35, 36]. The ceteris paribus profile can graphically depict the effect of altering specific variables on the predictive performance of the model while keeping all other elements unchanged.

Results

Association of surrogate insulin resistance indices with CAD

Table 1 presents the baseline characteristics of participants according to quartiles of surrogate insulin resistance indices. Age, blood pressure, low education, total cholesterol levels, and LDL showed a significant difference between quartiles for all markers. Table 2 reports the association between different surrogate markers of insulin resistance and CAD incidence. In model 1, after age and sex adjustments, the highest values among all indices in the fourth quartile were significantly and positively associated with CAD. Nevertheless, following adjustment for multiple variables in model 2, only the TyG-index was significantly associated with CAD (hazard ratio [HR]: 2.54, Confidence Interval [CI]: 1.34–4.81, P value = 0.007, P trend = 0.02). Only the TG/HDL ratio in men (HR: 1.95, CI: 1.01–3.77, P value = 0.04, P trend = 0.07) and TyG-index in women (HR: 4.76, CI: 1.36–16.66, P value = 0.01, P trend = 0.004) were associated with CAD after final adjustment (Table 3).

Table 1 Baseline characteristics of the participants according to quartiles of different surrogate markers of insulin resistance
Table 2 Risk of CAD according to quartiles of Surrogate markers of insulin resistance
Table 3 Risk of CAD according to quartiles of Surrogate markers of insulin resistance stratified by gender

Table 4 presents the area under the ROC curve (AUC) and cut-off points for all indices used to predict CAD in men, women, and the total sample. The TyG-index demonstrated superior predictive performance in both the total sample and among women, with AUC values of 0.67 (0.63–0.70, P value 0.001) and 0.72 (0.66–0.77), respectively. However, the TyG-index and the TyG-WC revealed almost identical performance in men.

Table 4 Receiver operating characteristic curve and cut-off points of surrogate markers of insulin resistance for CAD prediction in men, women, and the total population
figure a

Figure 2 illustrates several feature selection methods and the ceteris paribus profile of a random forest model. Figure 2A indicates the feature selection process using the Boruta algorithm. According to this algorithm, age, SBP, and TyG-index were the most important variables for predicting CAD. The random forest model revealed that, following age, blood pressure, and sex, the TyG-index exhibited the greatest MDI, thus serving as the most effective surrogate measure of insulin resistance for predicting CAD (Fig. 2B).

Fig. 2
figure 2

Ensemble of embedded feature selection methods. A This figure illustrates the Importance of variables based on their rank in the Boruta method, a lower rank indicates greater importance, while a higher rank indicates lesser importance. The variables highlighted in black are the most important ones. B The mean decrease in impurity (MDI) or Gini importance measures the extent to which every feature contributes to accurate predictions. A higher MDI value indicates that the variable is more important. C LASSO is a regularization approach based on linear regression. Regularization approaches penalize large coefficients because their presence can lead to overfitting. LASSO decreases coefficients of less significant features to zero and selects features that haven't been lowered to zero. A higher coefficient indicates greater importance. D The Ceteris paribus profile examines individual features while holding all other components of the model constant, in order to understand the particular impact of different features on predictions in machine learning models. A sharper incline on the diagram without a plateau or a downward slope with a higher constant indicate a better feature.

Figure 2C depicts the LASSO technique, which is a penalized approach that discards redundant variables. The TyG-index was the only surrogate indicator of insulin resistance that was chosen by LASSO. The Ceteris paribus profile of a random forest model is shown in.

Figure 2D Compared to other indices, the TyG-index had a stronger positive slope without a clear plateau or decline.

Discussion

Our research findings demonstrated that the TyG-index is the most effective surrogate marker of insulin resistance for predicting CAD and it has superior predictive capabilities in women. Not only did traditional statistical methods like Cox hazard regression and ROC analysis show that the TyG-index had a better HR and AUC for CAD compared to other surrogate indicators of insulin resistance, but also advanced feature selection techniques further validated these findings.

Surrogate insulin resistance markers encompass both blood glucose and dyslipidemia markers, serving as indirect indications of insulin resistance in the liver and adipose tissue [37]. Furthermore, some of these surrogate markers, including TyG-WC, TyG-BMI, and METS-IR, integrate obesity measures. This approach is grounded in the understanding that a direct relationship exists between insulin resistance and the majority of obesity indicators [38]. The advantage of these non-insulin dependent surrogate measures of insulin resistance, compared to the insulin-dependent competitors such as HOMA-IR, lies in their cost-effective and simplified acquisition technique, as well as their stronger association with the gold standard protocol for measuring insulin resistance [11,12,13]. Furthermore, research indicates that some of these indices may be more effective predictors of CAD than metabolic syndrome, which itself is a reflection of insulin resistance [39].

The findings from meta-analyses have shown a relationship between the TyG-index [40] and TG/HDL-C ratio [41] with CAD. Additionally, cohort studies have demonstrated the association of TyG-BMI and METS-IR with CAD [19, 42, 43], while only a cross-sectional study has highlighted a link between TyG-WC and CAD [19]. In the current study, TyG-BMI and METS-IR were not associated with CAD and were also found to be the least effective surrogate markers in the feature selection approaches. The potential explanation is in the fact that BMI fluctuations alone, as the sole anthropometric characteristic, fail to accurately indicate the risk of CAD when accompanied with insulin resistance-related traits [44, 45]. Although, in the present study, TyG-WC was the second most reliable indicator after TyG-index, we found no significant association with CAD.

To date, only four studies have directly compared surrogate markers of insulin resistance and their association with CAD within a single analytical framework [19,20,21, 46]. Among these, a case-control study highlighted the METS-IR index as more closely associated with CAD than both the TG/HDL and TyG-index, though this conclusion might be affected by Berkson’s bias due to the selection process, which targeted participants suspected of CAD and underwent coronary angiography [20]. Elsewhere, an analysis of cross sectional data from the National Health and Nutrition Examination Survey (NHANES) revealed a stronger correlation between the TyG-index and CAD, outperforming other indices, though the TyG-WC indicated a greater AUC [19]. However, the reliance on self-reported outcomes in NHANES study raises concerns about misclassification. Furthermore, research by Mahdavi-Roshan et al. in Iran, employing a case-control approach, indicated that the TyG-index was more closely associated with CAD risk than either the METS-IR or TyG-BMI [21]. Recently, Liu et al. in a prospective cohort of Chinese population evaluated visceral obesity indices and surrogate insulin resistance markers for predicting coronary heart disease [46]. They found that the Chinese visceral adiposity index (CVAI) is a more accurate predictor of coronary heart disease than surrogate markers of insulin resistance [46]. Although this index does have a correlation with insulin resistance and cardiometabolic disease, it was not initially designed for measuring insulin resistance. The initial development and validation of this index is based on measurements of Visceral adipose tissue (VAT) acquired through CT scan [47, 48]. Conversely, surrogate insulin resistance markers particularly formulated based on HOMA-IR and glucose clamp test [49, 50, 51, 52]. Furthermore, CVAI has been designed for people of Chinese ethnicity, which differs significantly from our community. For instance, in China, 34.3% of adults are overweight and 16.4% are obese [48]. In contrast, 63% of the Iranian population is overweight or obese, with 70.54% exhibiting abdominal obesity based on waist-to-hip ratio [53]. Although assessing these measures of visceral obesity is not within the scope of this study, it would be intriguing for future studies to determine which obesity indices are most effective in predicting CAD in the Persian population and whether they have a greater impact than indicators of insulin resistance. Overall, it is crucial to be cautious when interpreting these results because of inherent biases, differing findings among various studies, dependence on cross-sectional data, and reliance on traditional statistical methods. Accurately predicting intricate diseases such as CAD requires considering complex interactions among several parameters [23], a consideration that is overlooked in traditional techniques.

Embedded feature selection

Embedded feature selection techniques are types of supervised learning dimension reduction techniques used to identify the optimal variables for predicting an outcome [53]. Not only do they enhance predictive models’ performance and cost-effectiveness [54], they can also help healthcare practitioners select the most appropriate variable from a set of variables that have similar information and overlap with each other for the goal of screening and preventing an outcome. Although there is no flawless integrated feature selection algorithm [55], we can combine these strategies to use their respective advantages and mitigate their limitations [56]. Nevertheless, it is important to acknowledge that the decision between using novel techniques such as machine learning and traditional statistical models in predictive analytics is not a clear-cut one. Traditional statistical models offer a transparent depiction of the data, often including a probabilistic framework, which enhances interpretability. These models highlight relevant variables and quantify the strength as well as significance of associations. Conversely, machine learning models tend to be more empirical, prioritizing predictive performance over interpretability. Previous research has indicated that the complementation of conventional statistical techniques and machine learning is the optimum strategy to guide to generalizable and significant findings [57]. This is why we employed both of these methods to achieve a more comprehensive interpretation of our data.

Ensemble of feature selection approaches in the current study indicated that the TyG-index is the best surrogate marker of insulin resistance for predicting CAD. Following that, the TyG-WC may have the greatest influence. Ceteris paribus profile of random forest model demonstrated that predictive capability of the TyG-index grew after 9 with a positive slope without any decline or flattening out, which was in accordance with the cutoff points of the ROC curve. The TyG-BMI and METS-IR curves displayed a consistently flat and negative slope, while the TG-HDL and TyG-WC curves showed various instances of plateauing or downhill, suggesting that they are not reliable indicators for predicting CAD.

The combination of all three embedded feature selection methods, along with the results of Cox hazard models and ROC curve analysis, demonstrated that the TyG-index is the most reliable surrogate insulin resistance index for predicting CAD. This consensus of findings of different methods demonstrates the stability and reproducibility of the result, thereby increasing confidence in the use of this index [57, 58] for CAD prediction.

Strengths and limitations

This study is the first to evaluate and compare the most common surrogate measures of insulin resistance within a unified framework for the prediction of CAD. The prospective structure of our study, which has focused on the community, helps to limit the likelihood of reverse causation and recall bias. Unlike previous studies [19], we employed a consistent approach to define CAD by examining both paraclinical and symptomatic data. This enabled us to reduce the likelihood of misclassification.

This study also had some limitations. A few follow-up sessions would constrain our ability to assess and regulate voluntary health check-ups as well as lifestyle modifications that may have influenced our findings over the ten-year study period. Further, conducting a study on surrogate insulin resistance indices using a single baseline evaluation may cause our results to be influenced by differences within individuals over time. Above all, our study was conducted at a single center and included only individuals of the Iranian population. Thus, it is important to note that our findings may not be generalizable to populations in other countries.

Conclusion

The findings of the present investigation indicate that the TyG-index is the most efficient surrogate insulin resistance index for predicting and preventing CAD. Given the ease of evaluating the TyG-index using routine biochemical tests, incorporating this tool into clinical screenings and including it in future CAD risk assessment scores can greatly enhance healthcare professionals’ ability to manage and lower the risk of CAD. Nevertheless, more research involving multiple centers and diverse ethnic groups is necessary to validate our results.