Introduction

Cardiovascular diseases are of the leading causes of death worldwide. One of the most critical heart diseases is coronary artery disease (CAD). Coronary arteries include Left Anterior Descending (LAD), Left circumflex (LCX), and Right Coronary Artery (RCA), divided into left main artery (LMA) including LCX and LAD and right coronary artery (RCA). Suffering from CAD means that at least one of these arteries narrowness is more than 50% [1]. Examination and the study of various sources indicate that the risk factors for CAD are smoking, hypertension, hyperlipidemia (high total cholesterol, high triglycerides and high-density lipoproteins (HDL), Low-density lipoprotein (LDL)), diabetes, physical inactivity, obesity, abdominal obesity, unhealthy diet, age, gender, family history of heart disease, alcohol consumption, psychological factors, menopause, acute phase protein, high fasting glucose, fibrinogen, lipoprotein and homocysteine[1,2,3,4,5]. In examining the causes of death in Iran in 2009, it was found that out of 321,570 deaths in this year, 82,307 were due to CAD, which is the first cause of death with 25.6% [6].

Exercise testing, echocardiogram, nuclear scans of the heart and angiography are the ways to diagnose the disease, where angiography is the most accurate way to detect it [7, 8]. In a study on 598,792 patients, the mortality rate after angiography was 0.1% and the total number of major complications was about 1.7% [9]. Despite this angiography is an expensive and invasive procedure and is associated with risks like death, heart attack and stroke [2]. It is widely used to identify the causes of diseases and diagnostic strategies with higher accuracy and fewer side effects. Data mining and machine learning techniques are similar to decision trees, neural networks, Bayesian networks, and support vector machines. [10,11,12]. Predictions and classifications are a common practice in applied research. Some of the most widely used mathematical methods for predictions and classifications are discriminant analysis [13], logistic regression [14], neural networks [15], and classification and regression trees (CART) [16]. The statistical techniques are mostly divided into two categories, classical and non-classical. In classical statistics, this task is mainly done with the help of methods like regression, discriminant analysis, time series, regression tree and logistic regression, and in non-classical statistics, it is the duty of data mining and machine learning techniques [17,18,19].

The literature review showed that different algorithms such as clustering, classifications, regression and association rules, decision trees, Bayesian network, neural network, multi-layer perceptron with error back propagation algorithm, scaled conjugate gradient (SCG) and support vector machine (SVM) have been used for predicting CAD [20,21,22,23,24,25,26,27,28,29,30,31,32]. However, the comparison between the algorithms has not received adequate attention. Therefore, the purpose of this modelling was to provide insight based on information available from a specific subject.

Given the importance of the issue of prediction prognosis of CAD and the fact that, up to now, no comprehensive comparative study has been conducted in Iran to predict CAD, this study tried to predict the prognosis of CAD using discriminant analysis and compared with the logistic regression model.

The rest of this paper was organized as follows. In Sect. 2 we introduce our data sources and statistical methods we used. In Sect. 3 we presented our results, including sample characteristics, variable selection, and compare the prediction performance of the statistical models. In Sect. 4 we discussed the results. Section 5 includes concluding remarks and a description of directions for future research.

Methodology

Experimental sample

This diagnostic study was carried out as cross-sectional. The population study was all the patients with cardiovascular disease who were admitted to Fatemeh Zahra Teaching Hospital, (Sari, Iran) and underwent angiography of coronary arteries. The indication for angiography for patients were clinical indication including chest pain/ chronic coronary syndrome or unstable patients with myocardial infarction with or without ST segment elevation. The sample size in this study, according to the study of Kurt et al. [33] and considering the estimated area under the ROC curve of 75% for different methods and with a precision of 0.05 at 95% confidence level and 80% power was estimated. Also, considering the ratio of negative angiography to positive angiography cases equal to 0.33 (N− = 188 and N+  = 568), 756 samples in PASS11 are estimated for this study. Based on formulas is the following:

$$\begin{aligned} & Var\left( {\widehat{AUC}} \right) = V\left( {AUC} \right) = \left( {0.0099*e^{{\frac{{ - \alpha^{2} }}{2}}} } \right)*(6\alpha^{2} + 16) \\ & a = \varphi^{ - 1} ({\text{AUC}})*1.414\quad {\text{AUC}}_{{{\text{H0}}}} = 0.75\quad {\text{AUC}}_{{{\text{H1}}}} = 0.80 \\ & {\text{V}}_{{{\text{H}}_{0} \left( {\widehat{{{\text{AUC}}}}} \right)}} = 0/134842\quad {\text{V}}_{{{\text{H}}_{1} \left( {\widehat{{{\text{AUC}}}}} \right)}} = 0/119422 \\ \end{aligned}$$
(1)
$$n = \frac{{\left[ {z_{{\frac{\alpha }{2}}} \sqrt {V_{{H_{0} \left( {\widehat{AUC}} \right)}} } + Z_{\beta } \sqrt {V_{{H_{1} \left( {\widehat{AUC}} \right)}} } } \right]^{2} }}{{\left[ {AUC_{1} - AUC_{0} } \right]^{2} }} = 409$$
(2)

This study contained the records of 758 patients, each of which has 19 variables. All variables can be considered as indicators of CAD for a patient, according to medical literature [34,35,36,37,38]. The variables are arranged in three groups: demographic, clinical, and laboratory variables (Table 1). Some of the variables in the presented tables should be further explained: current smoker is current consumption of cigarettes, illicit drug abuse is current use of illegal drug use (opium, heroin, etc.), and alcohol consumption is the lifetime use of alcohol. Each patient could be in two possible categories CAD or Normal. A patient is categorized as CAD, if his/her diameter narrowing is greater than or equal to 50%, and otherwise as Normal based on the results of angiography and specialist diagnosis.

Table 1 Variables of study dataset

The inclusion criteria were patients who were admitted to the hospital for angiography due to cardiovascular disease with indication for angiography, and the exclusion criteria were patients who had received angiography before angiography or received any treatment after angiography. Diagnostic cases of CAD or non-CAD were recorded in the patient's file based on the results of angiography and physician's opinion. The study protocol and experimental protocols was approved by the Ethics Committee of Golestan University of Medical Sciences (IR.GOUMS.REC.1398.031). All methods were carried out in accordance with relevant guidelines and regulations, and a consent form was obtained from all the participants.

Statistical methods

Classical methods

Discriminant analysis (DA)

DA is one of the multivariate statistical methods for classifying a set of observations and discriminant analysis is a statistical technique that allows the researcher to distinguish between two or more groups according to several independent variables simultaneously. This technique is one of the multivariate statistical methods used to classify a set of observations as well as assign new observations to predefined categories. In other words, with DA technique, one can combine the linear composition of the independent variables as a discriminant function and divide the observations into two or more categories [39].

DA is one of the oldest and most well-known classification techniques proposed by Ronald Fisher in 1936 and generalized by others in later years [40, 41]. Over the past years, various discriminant functions have been examined, but they were similar in terms of purpose. Common types of parametric DA are linear discriminant analysis (LDA), Quadratic Discriminant Analysis (QDA), Regularized Discriminant Analysis (RDA), and the nonparametric DA is K-Nearest Neighbor (KNN) Analysis method [42].

Linear discriminant analysis (LDA)

LDA uses linear combination of independent variables to create the maximum intergroup ratio to intragroup changes in discriminant scores. Among the most popular functions used in DA is the Fisher discriminant function. In another method, the ranking rule is obtained by minimizing the average cost function (ECM) [17].

DA examines the relationship between several independent variables and the class response variable. The easiest type of analysis is when the response variable has two groups. In this case, the discriminant linear function, which passes through the average of the two groups (centers), can be used to separate the two groups, and when there are several prediction groups, k-1, where k is the number of classes, is required for classification. Imagine there are two groups:

If \(\overline{{X_{1} }}\) and \(\overline{{X_{2} }}\) are the mean of the first and the second groups, respectively, and S is the merged variance–covariance matrix, discriminant function of Fisher does separation as follows:

Xi is the member of group one if:

$$y = \left( {\overline{{x_{1} }} - \overline{{x_{2} }} } \right)^{\prime } S^{ - 1} X_{i} \ge \frac{1}{2}\left( {\overline{{x_{1} }} - \overline{{x_{2} }} } \right)^{\prime } S^{ - 1} \left( {x_{1} + x_{2} } \right)$$
(3)

Xi is the member of the second group if:

$$y = \left( {\overline{{x_{1} }} - \overline{{x_{2} }} } \right)^{\prime } S^{ - 1} X_{i} < \frac{1}{2}\left( {\overline{{x_{1} }} - \overline{{x_{2} }} } \right)^{\prime } S^{ - 1} (x_{1} + x_{2} )$$
(4)

Multivariate normalization, homogeneity of variance–covariance matrix, linearity and the absence of multicollinearity between independent variables are of the assumptions of LDA [43], yet Tabachnick and Fidell showed that the linear discriminant function is robust against the deviation from the multivariate normality due to the presence of outlier data as well as violations of the homogeneity of the variance–covariance matrix [44].

Quadratic discriminant analysis (QDA)

QDA is another DA technique that like LDA creates classification functions with independent variables. However, the functions are not linear in QDA. In some cases, linear functions may not create the best group separation, and using Quadratic Discriminant Functions (QDFs) may be more appropriate. However, in selecting the type of model, one must pay attention to the assumptions. Multivariate normal of the variables are independent of QDA method assumptions, but it does not assume variance–covariance homogeneity [45].

Thus, QDA is a more appropriate analysis in case of variance–covariance heterogeneity.

QDA assigns the case i to group one if the following equation holds; otherwise, case i belongs to group two. Here, \(\overline{{X_{1} }}\) and \(\overline{{X_{2} }}\), respectively, are the mean of the first and the second groups, and S1 and S2 are variance–covariance matrix of the first and the second groups, respectively.

$$\frac{1}{2}x_{i} \left( {S_{1}^{ - 1} - S_{2}^{ - 1} } \right)x_{i}^{\prime } + \left( {\overline{{x_{1} }} S_{1}^{ - 1} - \overline{{x_{2} }} S_{2}^{ - 1} } \right)x_{i}^{\prime } - k \ge \ln \left( {\frac{{c_{12} p_{2} }}{{c_{21} p_{1} }}} \right)$$
(5)

where

$$k = \frac{1}{2}\ln \left( {\frac{{\left| {S_{1} } \right|}}{{\left| {S_{2} } \right|}}} \right) + \frac{1}{2}\left( {\overline{{x_{1} }} S_{1}^{ - 1} \overline{{x_{1}^{\prime } }} - \overline{{x_{2} }} S_{2}^{ - 1} \overline{{x_{2}^{\prime } }} } \right)$$
(6)

Consider the Mahalanobis distance (MD). This function is as follows:

$${\text{f}}\left( {\text{x}} \right) = {\text{D}}_{1}^{2} \left( {\text{x}} \right) - {\text{D}}_{2}^{2} \left( {\text{x}} \right) + {\text{ln}}\left( {\frac{{\left| {S_{1} } \right|}}{{S_{2} }}} \right) - 2{\text{ln}}\left[ {\left( {\frac{{{\text{p}}_{2} }}{{{\text{p}}_{1} }}} \right)} \right]$$
(7)

The \(\mathrm{f}\left(\mathrm{x}\right)={\mathrm{D}}_{1}^{2}\left(\mathrm{x}\right)-{\mathrm{D}}_{1}^{2}\left(\mathrm{x}\right)\) value of MD is to the second power when \({S}_{1}\ne {S}_{2}\) decreases to the linear function [39, 46]. As the assumption of multivariate normality and homogeneity of covariance variance matrix is not established in this study, various types of DA were used.

Logistic regression (LR)

LR is the commonest method used to examine the relationship between independent variables and qualitative response variables, especially the dichotomous response variable. LR is a multivariate statistical method where the dependent variable (response) is a nominal or ranked variable, and the independent variable or variables can be continuous, discrete, nominal, or ranking [47].

In LR, independent variables that can be continuous or discrete are used. This model is basically used to identify the relationship between two or more independent and dependent variables. LR model is for Pi (probability of response) to n independent variables X1, X2,… and Xn. The name “Logistics” is derived from the word “Logit,” which is actually a variable change as follows.

$$logit\left({P}_{i}\right)=\mathrm{log}\left(\frac{{P}_{i}}{1-{P}_{i}}\right)={\beta }_{0}+{\beta }_{1}{X}_{1}+\dots +{\beta }_{n}{X}_{n}$$
(8)

In other words, Logit is a probability equal to the log of the ratio of chance or fraction whose numerator is the probability of an accident and its denominator is the probability of an accident. Converting P to \(ln\frac{{P}_{i}}{1-{P}_{i}}\) causes the range of changes in p, which is from zero to one, to be from − ∞ to + ∞. βi,s are the parameters estimated and Pi are the probability of the response. In using logistic regression to predict CAD, βo is the y-intercept and βi coefficients of the independent variables [48].

In the data analysis, while examining the assumptions, to fit the discriminant analysis model, its assumptions were examined, and the results showed that according to the Mardia's Skewness test, the multivariate normality assumption is not established due to high Skewness and according to Breusch-Pagan test, variance–covariance matrix homogeneity assumption was not established in the two groups either. Tolerance and Variance Inflation were used to examine the multicollinearity between the independent variables. As the tolerance values were less than 0.1 and Variance Inflation less than 10, multicollinearity assumption was not violated. In this case, given the capability of DA, LDA techniques, QDA, discriminant analysis with KNN were used and compared, the value of k = 15 was obtained using cross-validation method in KNN.

In order to fit the logistic regression model, the Hosmer–Lemeshow test was used to confirm the fit of the model, indicating that the logistic regression model can be used for this data set. In addition, the stepwise LR test is used to determine the predictor variables and perform multivariate analysis to adjust the effect of the variables. Odds Ratio (OR) and 95% confidence interval (95% CI) were calculated for the variable and the p. values less than 0.05 were considered as statistically significant. For model goodness of fit and to determine the accuracy, we used the diagnostic test indicators of receiver operating characteristics (ROC), accuracy, sensitivity and specificity. The statistical software package SAS (Statistical Analysis System) was used to perform a variety of DA and logistic regression.

Non-classical method

K-nearest neighbour (KNN) analysis

KNN is the first nonparametric DA presented by Fix and Hodges in 1951 [49, 50]. This analysis does not consider the normality of several variables, but assumes homogeneity of variance. KNN differs from the above methods. It does not find functions for group differentiation, but classifies observations based on group membership of K value [51].

The distance between the two observations is calculated from the following equation:

$${d}^{2}\left({x}_{1},{x}_{2}\right)={\left({x}_{1}-{x}_{2}\right)}^{^{\prime}}{V}^{-1}({x}_{1}-{x}_{2})$$
(9)

Results

Sample characteristics

Of the 758 subjects with suspected CAD who underwent coronary angiography, 409 (54%) were males and 349 (46%) females. The mean age and standard deviation were 59.11 ± 11.11 years with the lowest and highest age 27 and 90 years, respectively. Moreover, 250 (32.98%) subjects were non-CAD and 508 (67.02%) subjects were diagnosed with CAD.

Results of univariate logistic regression model suggested that the gender of men is associated with a higher risk of CAD (ORmen vs women = 3.50, 95% CI: 2.55–4.82). Smoking (ORyes vs no = 2.63, 95% CI:1.61–4.29) and illicit drug abuse (ORyes vs no = 3.50, 95% CI:1.96–6.34) were similarly associated with the risk of CAD. Age (OR = 1.06, 95% CI: 1.04–1.08), FBS (OR = 1.005, 95% CI: 1.002–1.009), BUN (OR = 1.07, 95% CI: 1.04–1.11), Cr (OR = 6.56, 955 CI: 3.02–14.26), systolic blood pressures (OR = 1.02, 95% CI: 1.01–1.03) and diastolic blood pressures (OR = 1.02, 95% CI: 1.008–1.03) were directly associated with a greater risk of CAD. Conversely, according to the results, body mass index (OR = 0.91, 95% CI: 0.88–0.94) and HDL (OR = 0.96, 95% CI: 0.94–0.98) associated with a lower risk of CAD (Table 2).

Table 2 Univariate logistic regression the association of independent variables with coronary artery disease

The results of multivariate logistic regression analysis with step-by-step variable entry technique led to the elimination of the significance of the diastolic blood pressure, BUN, Cr, and smoking. According to the results, the risk of CAD in males was 4 times higher than in females (OR: 4.01, 95% CI: 2.67–6.01), in people with illicit drug abuse were 2.17 times higher than others (OR: 2.17, 95% CI: 1.14–4.13), and in families with a history of cardiovascular disease were 1.93 times higher than in families without a history (OR: 1.93, 95% CI: 1.20–3.09) (Table 3).

Table 3 Multivariate logistic regression the association of independent variables with coronary artery disease

Models evaluation

The comparison results of the evaluation indicators showed that the accuracy of LDA was 78.6% and 1.8% higher than that of the logistic regression method. The QDA model was the highest sensitivity, 88.2%, and the KNN has the lowest sensitivity, 77.5%. In terms of specificity, QDA with 48.2% and LDA with 71.3% were estimated as the most and least in the models, respectively (Table 4).

Table 4 The accuracy, sensitivity, specificity and AUC models

The ROC curves of the four models clearly indicated that the models are similar together, with no difference in the area under the curve (AUC). The level below the ROC curve in four modelling techniques was very close and estimated to be between 81 and 82%. According to the Delong test [52], there was no significant difference in the level below the ROC curve of the model when the AUC values of the techniques are compared pairwise (Fig. 1 and Table 5).

Fig. 1
figure 1

ROC curve for logistic regression models, linear discriminant analysis, quadratic discriminant analysis, and K nearest neighbor

Table 5 Differences in surface area under the ROC curve in linear discriminant analysis, quadratic discriminant analysis, K nearest neighbor and logistic regression

In general, all models converged in similar results. All methods estimated the same statistically significant coefficients. The overall classification rate for all was good, and either can be helpful in classifying the class membership of CAD. LDA slightly exceeds discriminant function in the correct classification rate but when taking into account sensitivity, specificity and AUC the differences in the AUC were negligibly, thus indicating no discriminating difference between the models.

Accuracy, sensitivity, specificity and AUC indices of LR Models, LDA, QDA and KNN presented in Table 5 and Fig. 1.

Discussion

Cardiovascular disease is the most important cause of death in Iran and the rest of the world. The most important recent epidemiological study of the Iranian population indicates that CAD is common among young people in this population [53]. Cardiovascular disease is usually caused by a combination of multiple risk factors. It has been clearly shown that changes in risk factors can reduce mortality, especially in people with unknown cardiovascular diseases[54].

We found that age, gender, illicit drug abuse, family history of heart disease, systolic blood pressure, as well as FBS, HDL, LDL, and BMI all play an important role in CAD risk. So that, in the study of Bidel et al. [55], gender, smoking, family history of first-degree relatives, systolic blood pressure, diastolic blood pressure, and high-density lipoprotein were significantly associated with CAD. Also, in the study of Sut et al. [56], age, gender, cholesterol, triglyceride, low-density lipoprotein, high-density lipoprotein, a history of diabetes, and a history of smoking were significantly associated with CAD. The results of those studies are almost consistent with our findings. The most significant feature discovered by logistic regression is illegal drug use. Illicit drug abuse has been identified as a "risk factor" for an association between Illicit drug abuse and a high risk of coronary artery disease. However, limited research has indicated that this relationship may differ depending on the type of region and the type of drug used in that region [57,58,59,60]. Another finding of logistic regression is that BMI is related to coronary artery disease. However, this relationship was inverse, so that with increasing body mass index, the risk of coronary artery disease decreases. This finding may seem to contradict the findings of other studies [34, 38]. However, it should be noted that the subjects were not all healthy individuals but those who referred for angiography with a complaint of heart disease and their coronary artery disease was confirmed by angiography and the other group was not confirmed. However, people who are not approved are still at risk for heart disease and may have a higher body mass index.

Subsequently, we report on a study in which we developed several predictive models to predict CAD. In particular, we used LR, KNN, LDA, and QDA. In addition, we evaluated the performance of the model based on SEN, SPE, Accuracy, and AUC. We tested the statistical significance of the difference in the area under the ROC curve in pairwise. Comparison of the results of evaluation indices showed that the accuracy of LDA was higher than logistic regression. In terms of sensitivity indicators, QDA is the highest and KNN is the lowest. In addition, their AUCs are very close in almost all models.

According to the literature review, numerous studies have been conducted to predict CAD by using diagnostic models. In the study of Dwivedi et al. [61], data mining methods were used to predict CAD disease. Among these methods, the accuracy and sensitivity of KNN were more than LR, which is in line with the present study. However, the specificity of the logistic regression model is higher than that of KNN, which is inconsistent with this study. In addition, the study by Sut et al. [56] Shows that when LR and QDA models are used to diagnose CAD, QDA is more accurate. This is consistent with the results of this study. In addition, the study conducted by Antonogeorgos et al. [62] evaluated variables related to asthma. Among them, the accuracy and specificity of LR and the sensitivity of DA are higher, which is inconsistent with this study. But in terms of the area below the ROC curve, the LR method performed better, which is in line with the present study. Also, Sadehi et al. [17] showed in their metabolic syndrome prediction study that the area under the ROC curve of the LR method was higher than that of the LDA, and the accuracy of the LDA was higher than LR, which is consistent with this study. In addition, the study of Alizadeh et al. with purpose evaluated the coronary artery disease detection using Support Vector Machine (SVM) showed that accuracy rates of 86.14%, 83.17%, and 83.50% were achieved for the diagnosis of the stenosis of the left anterior descending (LAD) artery, left circumflex (LCX) artery and right coronary artery (RCA), respectively, indicating the best performance [63]. In another study, Kurt et al. [33] employed a logistic regression on a dataset. The results showed that accuracy of classification technique was assessed using ROC curve, the logistic regression (LR) was 0.753, indicating the low accuracy compared to our study. Also, Colombet et al. [64] showed in their cardiovascular risk prediction study that the area under the ROC curve of the LR method was 0.78 (0.75–0.81), which indicate the low accuracy compared to our study.

We suggest that gender, family history of heart disease, illicit drug abuse, age, BMI, FBS, HDL, LDL and Systolic blood pressure variables may be used as reliable indicators to predict presence of CAD. In our study, we compared methods by using a real data set in order to provide information on general tendency of data structures in data sets and help researchers to select best method for solving problems of classification. On the basis of these considerations, the linear discriminant analysis method is superior to the QDA, KNN and LR methods in differentiating CAD patients. Therefore, in addition to common non-invasive diagnostic methods, LDA technique is recommended as a predictive model with acceptable accuracy, sensitivity, and specificity for the diagnosis of CAD.

There are a few limitations with this study. First, this was a cross sectional study design with the documented data that abnormalities in biologic characteristics such as laboratory values we do not know whether they preceded the disease or were a result of the disease. Second limitation, we only collected data from one hospital which may limit the generalizability of the developed models, so data collection from more hospitals or population based should be considered. We also need more data from patients e.g., clinical symptoms and electrocardiogram (ECG). However, consistent with the purpose of the current research, considering only the routine clinical features of the patients while being admitted would suffice. Another limitation of our study was the low number of samples to compare different models.

Conclusion and future work

In this study, several models were applied on the dataset and the results were discussed. The variables included in this dataset are possible indicators of CAD, according to our medical knowledge. The accuracy value achieved in this study is, to the best of our knowledge, higher than currently reported values in the literature.

In future, we aim to consider predicting state of each artery independently. Moreover, it is obvious that true diagnosis of diseased people is more important than true identification of healthy ones. Finally, larger datasets, more variables and also broader data mining approaches, could be used to achieve better and more interesting results, and these models need to be compared with the artificial neural network and machine learning models as well.