Construction of genetic classification model for coronary atherosclerosis heart disease using three machine learning methods

Peng, Wenjuan; Sun, Yuan; Zhang, Ling

doi:10.1186/s12872-022-02481-4

Construction of genetic classification model for coronary atherosclerosis heart disease using three machine learning methods

Research article
Open access
Published: 12 February 2022

Volume 22, article number 42, (2022)
Cite this article

Download PDF

You have full access to this open access article

BMC Cardiovascular Disorders Aims and scope Submit manuscript

Construction of genetic classification model for coronary atherosclerosis heart disease using three machine learning methods

Download PDF

2232 Accesses
8 Citations
Explore all metrics

Abstract

Background

Although the diagnostic method for coronary atherosclerosis heart disease (CAD) is constantly innovated, CAD in the early stage is still missed diagnosis for the absence of any symptoms. The gene expression levels varied during disease development; therefore, a classifier based on gene expression might contribute to CAD diagnosis. This study aimed to construct genetic classification models for CAD using gene expression data, which may provide new insight into the understanding of its pathogenesis.

Methods

All statistical analysis was completed by R 3.4.4 software. Three raw gene expression datasets (GSE12288, GSE7638 and GSE66360) related to CAD were downloaded from the Gene Expression Omnibus database and included for analysis. Limma package was performed to identify differentially expressed genes (DEGs) between CAD samples and healthy controls. The WGCNA package was conducted to recognize CAD-related gene modules and hub genes, followed by recursive feature elimination analysis to select the optimal features genes (OFGs). The genetic classification models were established using support vector machine (SVM), random forest (RF) and logistic regression (LR), respectively. Further validation and receiver operating characteristic (ROC) curve analysis were conducted to evaluate the classification performance.

Results

In total, 374 DEGs, eight gene modules, 33 hub genes and 12 OFGs (HTR4, KISS1, CA12, CAMK2B, KLK2, DDC, CNGB1, DERL1, BCL6, LILRA2, HCK, MTF2) were identified. ROC curve analysis showed that the accuracy of SVM, RF and LR were 75.58%, 63.57% and 63.95% in validation; with area under the curve of 0.813 (95% confidence interval, 95% CI 0.761–0.866, P < 0.0001), 0.727 (95% CI 0.665–0.788, P < 0.0001) and 0.783 (95% CI 0.725–0.841, P < 0.0001), respectively.

Conclusions

In conclusion, this study found 12 gene signatures involved in the pathogenic mechanism of CAD. Among the CAD classifiers constructed by three machine learning methods, the SVM model has the best performance.

View this article's peer review reports

Discovering biomarkers associated and predicting cardiovascular disease with high accuracy using a novel nexus of machine learning techniques for precision medicine

Article Open access 02 January 2024

Identifying potential signatures for atherosclerosis in the context of predictive, preventive, and personalized medicine using integrative bioinformatics approaches and machine-learning strategies

Article 20 July 2022

Identification of Co-diagnostic Genes for Heart Failure and Hepatocellular Carcinoma Through WGCNA and Machine Learning Algorithms

Article 18 January 2024

Background

Coronary atherosclerosis heart disease (CAD) is the most common cardiovascular diseases (CVDs) and is characterized by high morbidity and mortality [1]. CVD accounted for one-third of all deaths, and there were an estimated 17.92 million deaths due to CVDs worldwide in 2015 [2]. In China, the summary of China cardiovascular disease report (2018) estimated that about 290 million people are suffering from CVDs, and 11 million of them are CAD patients [3]. A previous study showed that over 40% of deaths in China are directly caused by CAD or its complications [4]. Therefore, a comprehensive analysis of multiple biomarkers interaction is of great significance to understand the pathogenesis of CAD.

With the development of technology, the diagnosis of CAD is constantly innovated. Invasive coronary angiography is so far the gold standard by which the presence and severity of CAD could be defined, especially in patients with significant left ventricular dysfunction [5]. Coronary computed tomography angiography is increasingly being considered as an alternative diagnostic method because of its effectiveness, safety and non-invasion [6]. In addition, magnetic resonance coronary angiography provides a superior soft tissue characterization, and is well suited to the detection of adverse plaque characteristics [7]. However, CAD in the early stage is still missed diagnosis for the absence of any symptoms or mild degree of disease [8]. CAD is influenced by both environmental [9, 10] and genetic factors [11]. Actually, gene expression levels varied before morphological abnormality of the tissue during CAD development [12]. Therefore, genetic classification models might contribute to CAD diagnosis.

Due to the extensive application of gene chip and next-generation sequencing technology, a large amount of gene expression data is stored in databases, for example, Gene Expression Omnibus (GEO, https://www.ncbi.nlm.nih.gov/geo/) [13]. GEO supplies plentiful data for researchers to investigate the association between gene expression and CAD [14,15,16]. Some analytical methods have been used as approaches for microarray data mining. The bioinformatics analyses could reveal the biological functions of CAD-related genes [17]. The machine learning methods contribute to finding genetic biomarkers or constructing classifiers of CAD [18].

In the present study, we obtained CAD-related gene chip data from GEO open resources. Differentially expressed genes (DEGs) were screened between CAD samples and healthy controls, followed by the weighted gene co-expression network analysis (WGCNA) [19] by which hub genes with the highest correlation with CAD were identified. Subsequently, the recursive feature elimination (RFE) [20] algorithm was performed to select the optimal features genes (OFGs) for CAD from hub genes. By utilizing machine learning methods, including support vector machine (SVM) [21], random forest (RF) [22] and logistic regression (LR), the genetic classification models of CAD were finally established. This study aimed to identify potential hub genes and construct genetic classification models for CAD, which may provide new insight into the understanding of its pathogenesis and facilitate further therapeutic studies.

Materials

Data collection, quality evaluation and preprocessing

In this study, three raw datasets (GSE12288, GSE7638 and GSE66360) and corresponding annotation files were acquired from GEO. The simpleaffy package was used to evaluate the quality of chips and draw a quality control diagram, the unqualified samples would be marked with “bioB” in the quality control diagram and further excluded. Then, affy package was performed to standardize raw data, including background correction, normalization, perfect match (PM) probe correction and probe expression value calculation. After that, robust multi-array average (RMA) algorithm [23] was conducted to normalize microarray data and perform a log2 transformation. The probe expression value is estimated based on a stochastic model employed by the PM signal distribution. Afterwards, each probe set in these three datasets was annotated with gene symbol according to corresponding annotation files. Furthermore, k-nearest neighbor (KNN) function in the impute package was carried out to fill in the missing data. Finally, the complete gene expression profiles were acquired. The impute.knn is a function to impute missing expression data using KNN [24]. For each gene with missing values, k nearest neighbors are selected using a Euclidean metric, and the missing elements are imputed by averaging those elements of its neighbors.

Batch effect removal and differential expression analysis

The SVA package was carried out for correcting the batch effects of these three normalized datasets. The limma package [25, 26] was performed to identify DEGs between CAD samples and healthy controls in three datasets and the integrated dataset, respectively. And the Benjamini–Hochberg method was performed for multiple testing correction, by which the adjusted P value was calculated. The integrated dataset was the combination of GSE12288, GSE7638 and GSE66360. The thresholds of adjusted P < 0.001, |log₂(foldchange, FC)|> 0.263 were set to define DEGs. Furthermore, volcano plots were achieved using ggplot2 package to investigate the whole gene comparison results.

Weighted gene co-expression network analysis

Within the integrated dataset, the WGCNA package was conducted to construct the scale-free co-expression network and to identify hub genes from adjusted P < 0.001 genes. The theory behind WGCNA algorithm have been described in detail previously [27]. Firstly, the absolute value of correlation coefficient between the pair of genes i and j across of all subjects was defined as co-expression similarity (\({\text{S}}_{{{\text{ij}}}} = \left| {{\text{cor}}\left( {{\text{i}},{\text{ j}}} \right)} \right|\)). Therefore, \({\text{S}} = \left[ {{\text{S}}_{{{\text{ij}}}} } \right]\) was used to represent the co-expression correlation matrix. Secondly, the \({\text{S}}\) was transformed into an adjacency matrix by a power function: \({\text{a}}_{{{\text{ij}}}} = {\text{power}}\left( {{\text{S}}_{{{\text{ij}}}} ,\upbeta } \right) = \left| {{\text{S}}_{{{\text{ij}}}} } \right|^{\upbeta }\), where the soft thresholding power parameter, β, was set to 5 in this study. Thirdly, the topological overlap matrix (TOM) was calculated on the following function: \({\text{w}}_{{{\text{ij}}}} = \frac{{{\text{l}}_{{{\text{ij}}}} + {\text{a}}_{{{\text{ij}}}} }}{{{\text{min}}\left\{ {{\text{k}}_{{\text{i}}} ,{\text{k}}_{{\text{j}}} } \right\} + 1 - {\text{a}}_{{{\text{ij}}}} }}\), where \({\text{l}}_{{{\text{ij}}}} = \mathop \sum \limits_{\upmu } {\text{a}}_{{{\text{i}}\upmu }} {\text{a}}_{{{\text{j}}\upmu }}\), \({\text{k}}_{{\text{i}}} = \mathop \sum \limits_{{\upmu }} {\text{a}}_{{{{\text{i}\upmu }}}}\), \({\text{k}}_{{\text{j}}} = \mathop \sum \limits_{{\upmu }} {\text{a}}_{{{\text{i}\upmu }}}\). The μ denotes genes connected with gene i or j. Then, the dissimilarity was defined as \({\text{d}}_{{{\text{ij}}}}^{{\text{w}}} = 1 - {\text{w}}_{{{\text{ij}}}}\), thus forming a dissimilarity matrix. Finally, average linkage hierarchical clustering was conducted based on the TOM-based dissimilarity with a minimum size of 30 for the genes dendrogram to classify genes with similar expression profiles into modules. Each module was assigned to the corresponding color.

A dynamic hybrid branch cutting method was implemented on the TOM-based dendrogram to identify module eigengenes (ME). ME was calculated by the first principal component of a given module, which could represent the expression patterns of all genes. A phenotypic trait-based gene significance measure was defined as the absolute value of correlation between the gene i and the phenotypic trait (T): \({\text{GS}}_{{\text{i}}} = \left| {{\text{cor}}\left( {{\text{i}},{\text{T}}} \right)} \right|\). T is the binary variable for CAD status (patient status = 1 and healthy control = 0). GS_i denotes the association between gene i and T. Module membership (MM) represent the correlation between gene i and ME: \({\text{MM}}_{{\text{i}}} = \left| {{\text{cor}}\left( {{\text{i}},{\text{ME}}} \right)} \right|\), which explains associations between gene i and the corresponding module. Hub genes represent a series of genes that is significantly connected to a relevant module [28]. In the current study, a cut of \(|{\text{GS}}_{{\text{i}}} | > 0.2\), \(\left| {{\text{MM}}_{{\text{i}}} } \right| > 0.8\) was considered as the threshold of hub genes.

Selection of optimal feature gene sets

RFE was applied to select OFGs of CAD from hub genes in the integrated dataset using caret package. The OFGs can be used as identifiers of clinical diagnosis to construct a CAD classifier based on their expression levels. Performances of different types of samples were evaluated through combinations of iterative random features until the optimal feature combination was obtained. And, the number of cross-validation was set to 200 in this study. Later, the heatmap of the OFGs was drawn by pheatmap package to compare the expression levels between groups in datasets, respectively.

Construction and validation of genetic classification models

Three machine learning methods (SVM, RF and LR) were used to construct the CAD genetic classification models. SVM is a discriminant classifier defined by the classification hyperplane. The model is trained with labeled training samples, and then, the test samples are classified by the output of the optimal hyperplane [29]. RF is an integrated learning algorithm that combines different decision trees. Among the decision trees that constitute an RF model, each tree is an independent set generated based on random samples. Each tree learns and predicts independently, and the final result is determined by the mean value of all decision trees [30, 31]. LR is one of the GLM models, which have been regarded as an extension of the linear model that establishes the relationship between the mathematical expected value of the response variables and the predictive variables of the linear combination through the coupling function [32].

In the present study, 50% of samples in GSE12288 were selected randomly and used as a training dataset. The SVM, RF and LR classification models were constructed using e1071 package, randomForest package and glm function, respectively. Samples were classified into cases and controls according to the expression level of genes. To confirm the robustness and transferability of these constructed classifiers, internal and external validations were performed. Internal validation was carried out in the remaining 50% of samples of GSE12288, and external validation was performed in the combination of GSE7638 and GSE66360 datasets. Then, the efficacy of models was comprehensively evaluated in terms of sensitivity (Se), specificity (Sp), positive predictive value (PPV), negative predictive value (NPV) and the area under the ROC curve (AUC). All statistical analyses were conducted using R 3.4.4 software.

Results

Data information

Figure 1 summarized the schematic overview of the study flow. In this study, three gene expression datasets related to CAD were acquired. The information of them was summarized in Table 1. A total of 481 samples were included for analysis, among which 269 samples were CAD patients and 212 were healthy control samples. After evaluation of chip quality, one sample (GSM1620893) belonging to healthy control group in GSE66360 was dropped (Additional file 1: Figure S1).

Table 1 Information of the downloaded datasets

Full size table

Integration of three datasets and identification of differentially expressed genes

After batch effect removal analysis, 12,395 genes and 480 samples remained in the integrated dataset. DEGs were identified by differential expression analysis (Fig. 2). Briefly, when CAD samples were compared with healthy controls, 114 (31 upregulated and 83 downregulated), 1157 (1112 upregulated and 45 downregulated) and 2484 (471 upregulated and 2013 downregulated) DEGs were recognized in GSE12288, GSE7638 and GSE66360, respectively (Fig. 2A–C). And, 374 DEGs were identified in the integrated dataset (Fig. 2D) in which 303 DEGs were upregulated and 71 were downregulated.

Hub genes identification using WGCNA

The expression data of 2546 genes that adjusted P < 0.001 were analyzed using WGCNA package to identify the co-expression patterns and hub genes. The threshold power of β = 5 was selected to ensure a scale‐free network (Fig. 3A, B). The co-expression network contained eight modules in total and the module sizes ranged from 40 (pink) to 859 (turquoise). These modules were labelled with colours and depicted in the dendrograms provided in Fig. 3C. However, 387 genes were not similarly co-expressed with other genes in the network (grey). The associations between the MEs of modules and CAD status (patient status = 1 and healthy control = 0) were identified (Fig. 3D). The correlation coefficients (r) of modules indicated that they were all significantly correlated with CAD status (P < 0.05). The MEs of blue, green, yellow, brown, pink and red modules were positively correlated with CAD status (r > 0, P < 0.05), while MEs of turquoise and black modules were negatively correlated with CAD status (r < 0, P < 0.05). In this study, \(|{\text{GS}}_{{\text{i}}} | > 0.2\) and \(\left| {{\text{MM}}_{{\text{i}}} } \right| > 0.8\) was considered as threshold for identifying hub genes, and 33 genes were identified from six modules in total (Fig. 3E, Table 2).

Table 2 The information of 33 hub genes identified by weighed gene co-expression network analysis

Full size table

Construction of genetic classification models based on optimal feature genes

In order to obtain the optimal characteristic combination of genes representative of 33 hub genes, the RFE algorithm was adopted in the integrated dataset. Finally, 12 hub genes were selected as OFGs, this OFGs combination had the lowest classification root mean square error (RMSE) of 25.54% (Fig. 4A). These 12 OFGs were HTR4, KISS1, CA12, CAMK2B, KLK2, DDC, CNGB1, DERL1, BCL6, LILRA2, HCK and MTF2 (Table 3), among which eight OFGs (HTR4, CA12, KLK2, DERL1, BCL6, LILRA2, HCK and MTF2) were upregulated, while the other four (KISS1, CAMK2B, CNGB1 and DDC) were downregulated. Hierarchical clustering analysis was then carried out in datasets based on expression data of OFGs (Fig. 4B–E).

Table 3 The result information of 12 optimal feature genes in limma package analysis

Full size table

In the current study, the training dataset contained 111 (50%) samples in GSE12288 which were selected randomly. The SVM, RF and LR classifiers were constructed based on the expression of these 12 genes in the training dataset. Furthermore, the remained 50% of samples of GSE12288, and the combination of GSE7638 and GSE66360 were deem to testing datasets for internal and external validation, respectively.

Validation and evaluation of classifiers performance

The results showed that SVM, RF and LR classifiers could accurately classify 105 (94.59%), 106 (95.50%) and 108 (97.30%) of the 111 samples in internal validation, respectively. In external validations, 195 (75.59%) of the 258 samples were accurately classified via SVM classifier, with AUC of 0.813 (95% confidence interval (95% CI): 0.761–0.866, P < 0.0001). RF classifier could exactly category 164 (63.57%) of 258 samples, with AUC of 0.727 (95% CI 0.665–0.788, P < 0.0001). LR classifier could precisely classify 165 (63.95%) of 258 samples, with AUC of 0.783 (95% CI 0.725–0.841, P < 0.0001). The ROC charts of samples were shown in Fig. 5.

The performance of these three classifiers was evaluated using a variety of indicators, such as correct rate, Se, Sp, PPV and NPV, which were described in Table 4. In the internal validation, the accuracy appeared RF > LR > SVM, but the AUC SVM > RF > LR. In the external validation, both correct rate and AUC appeared SVM > LR > RF. The Se in SVM classifier was the highest (0.780, 95% CI 0.707–0.842) and in LR classifier was the lowest (0.516, 95% CI 0.435–0.596), respectively. The Sp in LR classifier was the highest (0.869, 95% CI 0.786–0.928) and in RF classifier was the lowest (0.525, 95% CI 0.422–0.627). These results suggested that the constructed SVM classifier based on the 12 OFGs could be the best in the present study.

Table 4 Validation and evaluation results of three machine learning classifiers performance

Full size table

Discussion

Based on gene expression data, machine learning methods can be applied in constructing classification models of disease and propose a deeper understanding for clinical diagnosis and treatment. In this study, three mRNA expression profiles related to 269 CAD and 212 healthy control samples were downloaded from GEO. A total of 374 DEGs and 33 hub genes were identified by bioinformatics analyses. Accordingly, 12 OFGs (HTR4, KISS1, CA12, CAMK2B, KLK2, DDC, CNGB1, DERL1, BCL6, LILRA2, HCK and MTF2) were obtained and classification models were constructed through three machine learning methods. Finally, results of evaluating classifiers performance showed the SVM model was the best in the present study, with the AUC of 0.813 (95% CI 0.761–0.866), the sensitivity of 0.780 (95% CI 0.707–0.842) and the specificity of 0.717 (95% CI 0.618–0.803), respectively.

Gene expression might change before morphological abnormality of the tissue, researchers demonstrated that macrophage C-type lectin receptor CLEC5A (MDL-1) mainly expressed in atherosclerotic lesional macrophages and elevated macrophage MDL-1 expression was associated with early plaque progression [12]. Pulanco MC et al. found that C1q promoted macrophage survival and improved foam cell function, which may play an important protective role in early atherosclerosis progression [33]. In addition, matrix metalloproteinases (MMPs) participated in different mechanisms fundamental to atherothrombotic progression [34, 35], such as MMP-12 [36] and MMP-2 [37].

In the current study, eight (HTR4, CA12, KLK2, DERL1, BCL6, LILRA2, HCK and MTF2) of 12 potential critical genes were upregulated. Oksala found that CA12 expression was elevated in atherosclerotic plaques compared to control tissues (internal thoracic artery controls). And CA12 protein was expressed in the atheromatous core and to some extent in all vessel layers in plaques of all vessel beds, while only sparse cells were positive in control vessels [38]. Chronic inflammation is a hallmark of atherosclerosis, Barish GD examined the impact of the transcriptional repressor BCL6 on atherogenesis and revealed BCL6-SMRT/NCoR complexes could constrain immune responses and contribute to the prevention of atherosclerosis [39]. HCK and FGR are two Src tyrosine kinases, Medina demonstrated that Hck/Fgr-deficiency leads to reduced atherosclerotic lesion with concomitant reductions in macrophage accumulation and, paradoxically, lesion stability [40]. HTR4 is a member of the family of serotonin receptors and associated with average and maximal carotid intima-media thickness measures [41]. Serotonin, also named as 5-hydroxytryptamine (5-HT), is a well-known vasoreactive amine that could affect the circulation of the heart. Human kallikrein 2 (KLK2, also called hK2) has an important in vivo regulatory function on Prostate-specific antigen (PSA) activity, and could convert the inactive precursor form of PSA to active PSA [42]. PSA is a member of the human kallikrein family of serine proteases [43] and PSA is an established marker of myocardial infarction [44].

The other four (KISS1, CAMK2B, CNGB1 and DDC) of 12 OFGs were downregulated. The encoded protein of DDC catalyzes the decarboxylation of L-5-hydroxytryptophan to serotonin, which is a well-known vasoreactive amine. Kisspeptins are the endogenous cleavage products of the KiSS1 protein, they function as potent vasoconstrictors, and the response could comparable to angiotensin (Ang)-II in the coronary artery; In addition, Kisspeptins’ receptor, G protein-coupled receptor 54, is discretely located at atherosclerosis-prone vessels [45]. The product of CAMK2B belongs to serine/threonine protein kinase family. Akt (a serine/threonine protein kinase B) is an important signaling mediator which includes various Akt isoforms, such as Akt1, Akt2, and Akt3 [46]. Researchers reported that, in apoE-deficient mice, the loss of Akt1 leaded to severe atherosclerosis [47] and Akt3 deficiency in macrophages promoted foam cell formation and atherosclerosis [48]. T lymphocytes participate in the chronic inflammatory reaction and ultimately lead to the occurrence and development of acute coronary syndrome (ACS) [49, 50]. CNGB1 also called GARP, Zhu et al. found that the expression of GARP in CD4⁺ T cells of ACS patients was lower than those of control patients [51]. Circulating CD4⁺ CD25⁺ GARP⁺ Tregs were impaired in patients with ACS, targeting GARP might promote the protective function of Tregs in ACS [52].

This study used three kinds of machine learning methods (SVM, RF and LR) to construct genetic classification model of CAD. The SVM, RF and LR have been widely applied for discriminant analyses or biomarker identification in diseases, such as acute coronary syndromes [53], osteosarcoma [54], lung adenocarcinoma [55], rheumatoid arthritis [56], chronic obstructive pulmonary disease [57]. Several studies also compared these three classifiers to find the best one as disease classification models [58,59,60]. In the present study, the SVM classifier showed the best classification efficacy (AUC in the internal and external validation were 0.996 and 0.813, respectively) and was considered as the optimal machine learning method in this study.

Some strengths and limitations of the current study should be acknowledged. Firstly, feature gene selection was the basis of the model construction, this study conducted both WGCNA and RFE algorithm to identify gene features. WGCNA is an advanced systems biology-based approach used for finding molecular mechanisms and for linking the information to phenotypic traits [19]. WGCNA has been widely and successfully used to identify candidate biomarkers and gene modules highly associated with disease [19]. The combined application of WGCNA and RFE in the current study might find the optimal gene features associated with CAD to the maximum extent. Secondly, we included a sufficient number of samples, excluded the unqualified sample, and removed the batch effect between datasets, which made our statistical analyses more reliable. Thirdly, this study performed three kinds of machine learning methods to construct classifiers, and the classification efficacy was compared. Finally, both internal and external validation were conducted to examine the performance of three classifiers and the best classifier was selected. Limitations were as follows: Firstly, we only analyzed the gene expression profiles, but the clinic information was not taken into account since the data was not available. Secondly, the optimal feature genes related to CAD should be further validated by real-time polymerase chain reaction with a larger sample size and functional experiments. Eventually, whether the genetic classification model could be used in practice is currently unknown and should be explored in future studies.

Conclusions

In conclusion, 33 CAD-related hub genes were identified using bioinformatics analyses, and 12 OFGs were obtained. Among the CAD classifiers constructed by three machine learning methods, SVM model has the best performance, which proposed a deeper understanding for CAD clinical diagnosis and treatment.

Availability of data and materials

The datasets analyzed during the current study are available in the GEO (https://www.ncbi.nlm.nih.gov/geo/) databases.

Abbreviations

CAD:: Coronary atherosclerosis heart disease
CVD:: Cardiovascular disease
DEG:: Differentially expressed gene
WGCNA:: Weighted gene co-expression network analysis
RFE:: Recursive feature elimination
OFG:: Optimal features gene
SVM:: Support vector machine
RF:: Random forest
LR:: Logistic regression
FC:: Foldchange
Se:: Sensitivity
Sp:: Specificity
PPV:: Positive predictive value
NPV:: Negative predictive value
ROC:: Receiver operator characteristic
AUC:: Area under the ROC curve

References

Kuller LH. Ethnic differences in atherosclerosis, cardiovascular disease and lipid metabolism. Curr Opin Lipidol. 2004;15(2):109–13.
Article CAS PubMed Google Scholar
Roth GA, Johnson C, Abajobir A, Abd-Allah F, Abera SF, Abyu G, Ahmed M, Aksut B, Alam T, Alam K, et al. Global, regional, and national burden of cardiovascular diseases for 10 causes, 1990 to 2015. J Am Coll Cardiol. 2017;70(1):1–25.
Article PubMed PubMed Central Google Scholar
Hu S, Gao R, Liu L, Zhu M, Wang W, Wang Y, Wu Z, Li H, Gu D, Yang Y, et al. Summary of China cardiovascular disease report. Chin Circ J. 2019;34(03):209–20.
Google Scholar
Gao R, Yang Y, Han Y, Huo Y, Chen J, Yu B, Su X, Li L, Kuo HC, Ying SW, et al. Bioresorbable vascular scaffolds versus metallic stents in patients with coronary artery disease: ABSORB China trial. J Am Coll Cardiol. 2015;66(21):2298–309.
Article CAS PubMed Google Scholar
Lim MJ, White CJ. Coronary angiography is the gold standard for patients with significant left ventricular dysfunction. Prog Cardiovasc Dis. 2013;55(5):504–8.
Article PubMed Google Scholar
Paech DC, Weston AR. A systematic review of the clinical effectiveness of 64-slice or higher computed tomography angiography as an alternative to invasive coronary angiography in the investigation of suspected coronary artery disease. BMC Cardiovasc Disord. 2011;11:32.
Article PubMed PubMed Central Google Scholar
Vesey AT, Dweck MR, Fayad ZA. Utility of Combining PET and MR Imaging of Carotid Plaque. Neuroimaging Clin N Am. 2016;26(1):55–68.
Article PubMed Google Scholar
Kwok CS, Satchithananda D, Mallen CD: Missed opportunities in coronary artery disease: reflection on practice to improve patient outcomes. Coronary artery disease 2021.
Ades PA, Gaalema DE. Coronary heart disease as a case study in prevention: potential role of incentives. Prev Med. 2012;55(Suppl):S75-79.
Article PubMed Google Scholar
Mallika V, Goswami B, Rajappa M. Atherosclerosis pathophysiology and the role of novel risk factors: a clinicobiochemical perspective. Angiology. 2007;58(5):513–22.
Article CAS PubMed Google Scholar
Yamada Y, Matsui K, Takeuchi I, Fujimaki T. Association of genetic variants with coronary artery disease and ischemic stroke in a longitudinal population-based genetic epidemiological study. Biomed Rep. 2015;3(3):413–9.
Article CAS PubMed PubMed Central Google Scholar
Xiong W, Wang H, Lu L, Xi R, Wang F, Gu G, Tao R. The macrophage C-type lectin receptor CLEC5A (MDL-1) expression is associated with early plaque progression and promotes macrophage survival. J Transl Med. 2017;15(1):234.
Article PubMed PubMed Central Google Scholar
Barrett T, Troup DB, Wilhite SE, Ledoux P, Rudnev D, Evangelista C, Kim IF, Soboleva A, Tomashevsky M, Marshall KA, et al. NCBI GEO: archive for high-throughput functional genomic data. Nucleic Acids Res. 2009;37(5):D885-890.
Article CAS PubMed Google Scholar
Liu J, Wang X, Lin J, Li S, Deng G, Wei J. Classifiers for predicting coronary artery disease based on gene expression profiles in peripheral blood mononuclear cells. Int J Gen Med. 2021;14:5651–63.
Article PubMed PubMed Central Google Scholar
Zhu L, Zhao S, Zhao W. Potential regulatory role of lncRNA-miRNA-mRNA in coronary artery disease (CAD). Int Heart J. 2021;62(6):1369–78.
Article CAS PubMed Google Scholar
Zhang B, Zeng K, Li R, Jiang H, Gao M, Zhang L, Li J, Guan R, Liu Y, Qiang Y, et al. Construction of the gene expression subgroups of patients with coronary artery disease through bioinformatics approach. Math Biosci Eng MBE. 2021;18(6):8622–40.
Article PubMed Google Scholar
Tan X, Zhang X, Pan L, Tian X, Dong P. Identification of key pathways and genes in advanced coronary atherosclerosis using bioinformatics analysis. Biomed Res Int. 2017;2017:4323496.
Article PubMed PubMed Central Google Scholar
Wang Y, Liu T, Liu Y, Chen J, Xin B, Wu M, Cui W. Coronary artery disease associated specific modules and feature genes revealed by integrative methods of WGCNA, MetaDE and machine learning. Gene. 2019;710:122–30.
Article CAS PubMed Google Scholar
Langfelder P, Horvath S. WGCNA: an R package for weighted correlation network analysis. BMC Bioinform. 2008;9:559.
Article Google Scholar
Baur B, Bozdag S. A feature selection algorithm to compute gene centric methylation from probe level methylation data. PLoS ONE. 2016;11(2):e0148977.
Article PubMed PubMed Central Google Scholar
Cortes C, Vapnik V. Support-vector networks. Mach Learn. 1995;20(3):273–97.
Article Google Scholar
Breiman L. Random forests. Mach Learn. 2001;45:5–32.
Article Google Scholar
Gautier L, Cope L, Bolstad BM, Irizarry RA. affy–analysis of Affymetrix GeneChip data at the probe level. Bioinformatics (Oxford, England). 2004;20(3):307–15.
Article CAS Google Scholar
Troyanskaya O, Cantor M, Sherlock G, Brown P, Hastie T, Tibshirani R, Botstein D, Altman RB. Missing value estimation methods for DNA microarrays. Bioinformatics (Oxford, England). 2001;17(6):520–5.
Article CAS Google Scholar
Ritchie ME, Phipson B, Wu D, Hu Y, Law CW, Shi W, Smyth GK. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 2015;43(7):e47.
Article PubMed PubMed Central Google Scholar
Smyth GK. Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Stat Appl Genet Mol Biol. 2004, 3:Article3.
Zhang B, Horvath S. A general framework for weighted gene co-expression network analysis. Stat Appl Genet Mol Biol. 2005; 4:Article17.
Langfelder P, Mischel PS, Horvath S. When is hub gene selection better than standard meta-analysis? PLoS ONE. 2013;8(4):e61505.
Article CAS PubMed PubMed Central Google Scholar
Huang S, Cai N, Pacheco PP, Narrandes S, Wang Y, Xu W. Applications of support vector machine (SVM) learning in cancer genomics. Cancer Genom Proteomics. 2018;15(1):41–51.
CAS Google Scholar
Pavlov YL. Random forests. Berlin: De Gruyter; 2019.
Google Scholar
Jeong B, Cho H, Kim J, Kwon SK, Hong S, Lee C, Kim T, Park MS, Hong S, Heo TY. Comparison between statistical models and machine learning methods on classification for highly imbalanced multiclass kidney data. Diagnostics (Basel, Switzerland). 2020;10(6):415.
CAS Google Scholar
Qu Y, Luo J. Estimation of group means when adjusting for covariates in generalized linear models. Pharm Stat. 2015;14(1):56–62.
Article PubMed Google Scholar
Pulanco MC, Cosman J, Ho MM, Huynh J, Fing K, Turcu J, Fraser DA. Complement protein C1q enhances macrophage foam cell survival and efferocytosis. J Immunol (Baltimore, Md: 1950). 2017;198(1):472–80.
Article CAS Google Scholar
Johnson JL. Matrix metalloproteinases: influence on smooth muscle cells and atherosclerotic plaque stability. Expert Rev Cardiovasc Ther. 2007;5(2):265–82.
Article CAS PubMed Google Scholar
Rodriguez JA, Orbe J, Paramo JA. Metalloproteases, vascular remodeling and atherothrombotic syndromes. Rev Esp Cardiol. 2007;60(9):959–67.
Article PubMed Google Scholar
Liang J, Liu E, Yu Y, Kitajima S, Koike T, Jin Y, Morimoto M, Hatakeyama K, Asada Y, Watanabe T, et al. Macrophage metalloelastase accelerates the progression of atherosclerosis in transgenic rabbits. Circulation. 2006;113(16):1993–2001.
Article CAS PubMed Google Scholar
Li Z, Li L, Zielke HR, Cheng L, Xiao R, Crow MT, Stetler-Stevenson WG, Froehlich J, Lakatta EG. Increased expression of 72-kd type IV collagenase (MMP-2) in human aortic atherosclerotic lesions. Am J Pathol. 1996;148(1):121–8.
CAS PubMed PubMed Central Google Scholar
Oksala N, Levula M, Pelto-Huikko M, Kytomaki L, Soini JT, Salenius J, Kahonen M, Karhunen PJ, Laaksonen R, Parkkila S, et al. Carbonic anhydrases II and XII are up-regulated in osteoclast-like cells in advanced human atherosclerotic plaques-Tampere Vascular Study. Ann Med. 2010;42(5):360–70.
Article CAS PubMed Google Scholar
Barish GD, Yu RT, Karunasiri MS, Becerra D, Kim J, Tseng TW, Tai LJ, Leblanc M, Diehl C, Cerchietti L, et al. The Bcl6-SMRT/NCoR cistrome represses inflammation to attenuate atherosclerosis. Cell Metab. 2012;15(4):554–62.
Article CAS PubMed PubMed Central Google Scholar
Medina I, Cougoule C, Drechsler M, Bermudez B, Koenen RR, Sluimer J, Wolfs I, Doring Y, Herias V, Gijbels M, et al. Hck/Fgr kinase deficiency reduces plaque growth and stability by blunting monocyte recruitment and intraplaque motility. Circulation. 2015;132(6):490–501.
Article CAS PubMed PubMed Central Google Scholar
Sabater-Lleal M, Malarstig A, Folkersen L, Soler Artigas M, Baldassarre D, Kavousi M, Almgren P, Veglia F, Brusselle G, Hofman A, et al. Common genetic determinants of lung function, subclinical atherosclerosis and risk of coronary artery disease. PLoS ONE. 2014;9(8):e104082.
Article PubMed PubMed Central Google Scholar
Rittenhouse HG, Finlay JA, Mikolajczyk SD, Partin AW. Human Kallikrein 2 (hK2) and prostate-specific antigen (PSA): two closely related, but distinct, kallikreins in the prostate. Crit Rev Clin Lab Sci. 1998;35(4):275–368.
Article CAS PubMed Google Scholar
Watt KW, Lee PJ, M’Timkulu T, Chan WP, Loor R. Human prostate-specific antigen: structural and functional similarity with serine proteases. Proc Natl Acad Sci USA. 1986;83(10):3166–70.
Article CAS PubMed PubMed Central Google Scholar
Patanè S, Marte F. Prostate-specific antigen kallikrein and acute myocardial infarction: where we are. Where are we going? Int J Cardiol. 2011;146(1):e20-22.
Article PubMed Google Scholar
Mead EJ, Maguire JJ, Kuc RE, Davenport AP. Kisspeptins are novel potent vasoconstrictors in humans, with a discrete localization of their receptor, G protein-coupled receptor 54, to atherosclerosis-prone vessels. Endocrinology. 2007;148(1):140–7.
Article CAS PubMed Google Scholar
Manning BD, Cantley LC. AKT/PKB signaling: navigating downstream. Cell. 2007;129(7):1261–74.
Article CAS PubMed PubMed Central Google Scholar
Fernández-Hernando C, Ackah E, Yu J, Suárez Y, Murata T, Iwakiri Y, Prendergast J, Miao RQ, Birnbaum MJ, Sessa WC. Loss of Akt1 leads to severe atherosclerosis and occlusive coronary artery disease. Cell Metab. 2007;6(6):446–57.
Article PubMed PubMed Central Google Scholar
Ding L, Biswas S, Morton RE, Smith JD, Hay N, Byzova TV, Febbraio M, Podrez EA. Akt3 deficiency in macrophages promotes foam cell formation and atherosclerosis in mice. Cell Metab. 2012;15(6):861–72.
Article CAS PubMed PubMed Central Google Scholar
Hansson GK. Inflammation, atherosclerosis, and coronary artery disease. N Engl J Med. 2005;352(16):1685–95.
Article CAS PubMed Google Scholar
Libby P. Inflammation in atherosclerosis. Nature. 2002;420(6917):868–74.
Article CAS PubMed Google Scholar
Zhu ZF, Meng K, Zhong YC, Qi L, Mao XB, Yu KW, Zhang W, Zhu PF, Ren ZP, Wu BW, et al. Impaired circulating CD4+ LAP+ regulatory T cells in patients with acute coronary syndrome and its mechanistic study. PLoS ONE. 2014;9(2):e88775.
Article PubMed PubMed Central Google Scholar
Meng K, Zhang W, Zhong Y, Mao X, Lin Y, Huang Y, Lang M, Peng Y, Zhu Z, Liu Y, et al. Impairment of circulating CD4⁺CD25⁺GARP⁺ regulatory T cells in patients with acute coronary syndrome. Cell Physiol Biochem Int J Exp Cell Physiol Biochem Pharmacol. 2014;33(3):621–32.
Article CAS Google Scholar
Lu Y, Meng X, Wang L, Wang X. Analysis of long non-coding RNA expression profiles identifies functional lncRNAs associated with the progression of acute coronary syndromes. Exp Ther Med. 2018;15(2):1376–84.
CAS PubMed Google Scholar
He Y, Ma J, Wang A, Wang W, Luo S, Liu Y, Ye X. A support vector machine and a random forest classifier indicates a 15-miRNA set related to osteosarcoma recurrence. Onco Targets Ther. 2018;11:253–69.
Article PubMed PubMed Central Google Scholar
Wang Y, Fu J, Wang Z, Lv Z, Fan Z, Lei T. Screening key lncRNAs for human lung adenocarcinoma based on machine learning and weighted gene co-expression network analysis. Cancer Biomark. 2019;25(4):313–24.
Article CAS PubMed Google Scholar
Long NP, Park S, Anh NH, Min JE, Yoon SJ, Kim HM, Nghi TD, Lim DK, Park JH, Lim J, et al. Efficacy of integrating a novel 16-gene biomarker panel and intelligence classifiers for differential diagnosis of rheumatoid arthritis and osteoarthritis. J Clin Med. 2019;8(1):859.
Article Google Scholar
Mostafaei S, Kazemnejad A, Azimzadeh Jamalkandi S, Amirhashchi S, Donnelly SC, Armstrong ME, Doroudian M. Identification of novel genes in human airway epithelial cells associated with chronic obstructive pulmonary disease (COPD) using machine-based learning algorithms. Sci Rep. 2018;8(1):15775.
Article PubMed PubMed Central Google Scholar
Jin X, Wang J, Ge L, Hu Q. Identification of immune-related biomarkers for sciatica in peripheral blood. Front Genet. 2021;12:781945.
Article PubMed PubMed Central Google Scholar
Pan X, Jin X, Wang J, Hu Q, Dai B. Placenta inflammation is closely associated with gestational diabetes mellitus. Am J Transl Res. 2021;13(5):4068–79.
CAS PubMed PubMed Central Google Scholar
Li MX, Sun XM, Cheng WG, Ruan HJ, Liu K, Chen P, Xu HJ, Gao SG, Feng XS, Qi YJ. Using a machine learning approach to identify key prognostic molecules for esophageal squamous cell carcinoma. BMC Cancer. 2021;21(1):906.
Article CAS PubMed PubMed Central Google Scholar

Download references

Acknowledgements

Not applicable.

Funding

This study was supported by Natural Science Foundation of China (Grant Numbers 81973121 and 81373076) and National Key Research and Development Program of China (Grant Number 2016YFC0900603). The authors received financial support for the research and publication of this article.

Author information

Authors and Affiliations

Department of Epidemiology and Health Statistics, School of Public Health, Capital Medical University, and Beijing Municipal Key Laboratory of Clinical Epidemiology, No. 10, Xi Toutiao You Anmenwai, Fengtai District, Beijing, 100069, China
Wenjuan Peng, Yuan Sun & Ling Zhang

Authors

Wenjuan Peng
View author publications
You can also search for this author in PubMed Google Scholar
Yuan Sun
View author publications
You can also search for this author in PubMed Google Scholar
Ling Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

WJP and LZ participated in the study design. WJP contributed to data collection and analysis. All authors were involved in the writing of the article. WJP, YS and LZ revised the manuscript. All authors have read and approved the final manuscript.

Corresponding author

Correspondence to Ling Zhang.

Ethics declarations

Ethics approval and consent to participate

This study complied with the ethical standards of the institutional research committee, the Helsinki Declaration of 1964 and its subsequent modifications, or comparable ethical standards. The study was analyzed and approved by the Ethics Committee of the Capital Medical University.

Consent for publication

Not applicable.

Competing interests

The authors confirm that there are no conflicts of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1: Figure S1

. Quality control diagram of GSE666360. The horizontal axis is the scale factor (SF), which refers to the average signal level of all probe in the chip. Each sample has two values, the first number represents the detection rate, which refers to the number of probes with signal divided by the number of all probes in the chip; the second number represents the background noise, which refers to the average signal level of all mismatch probes. The hollow triangle is actin3/actin5, and is marked red when the value greater than 3, marked blue when the value less than 3. The hollow circle is gapdh3/gapdh5, and is marked red when the value greater than 1.25, marked blue when the value less than 1.25. Solid circle with wire refers to SF of sample, and the sample (GSM1620893) marked with "bioB" is unqualified and further excluded.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article

Peng, W., Sun, Y. & Zhang, L. Construction of genetic classification model for coronary atherosclerosis heart disease using three machine learning methods. BMC Cardiovasc Disord 22, 42 (2022). https://doi.org/10.1186/s12872-022-02481-4

Download citation

Received: 05 November 2021
Accepted: 24 January 2022
Published: 12 February 2022
DOI: https://doi.org/10.1186/s12872-022-02481-4

Construction of genetic classification model for coronary atherosclerosis heart disease using three machine learning methods

Abstract

Background

Methods

Results

Conclusions

Similar content being viewed by others

Discovering biomarkers associated and predicting cardiovascular disease with high accuracy using a novel nexus of machine learning techniques for precision medicine

Identifying potential signatures for atherosclerosis in the context of predictive, preventive, and personalized medicine using integrative bioinformatics approaches and machine-learning strategies

Identification of Co-diagnostic Genes for Heart Failure and Hepatocellular Carcinoma Through WGCNA and Machine Learning Algorithms

Background

Materials

Data collection, quality evaluation and preprocessing

Batch effect removal and differential expression analysis

Weighted gene co-expression network analysis

Selection of optimal feature gene sets

Construction and validation of genetic classification models

Results

Data information

Integration of three datasets and identification of differentially expressed genes

Hub genes identification using WGCNA

Construction of genetic classification models based on optimal feature genes

Validation and evaluation of classifiers performance

Discussion

Conclusions

Availability of data and materials

Abbreviations

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher's Note

Supplementary Information

Additional file 1: Figure S1

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation