Background

As the number one cause of mortality, coronary artery disease (CAD) contributes to approximately 17 million deaths every year across the world [1], and almost 700,000 deaths due to CAD are recorded annually in China [2]. Being a complicated and multifactorial condition, CAD results from a variety of environmental exposures and genetic variation, including gender, age, diabetes, hypertension, dyslipidemia, cigarette smoking, and family history [3,4,5,6,7]. To some extent reaching an agreement about the early prevention of CAD would be effective.

As a practical way to identify gene expression changes, a microarray analysis may be a useful method to help in the early diagnosis of CAD [8]. However, numerous previous studies reveal that microarray results are not reproducible or sensitive to the changes in information [9, 10]. Even worse, when over 50 thousand probes in a microarray were used to analyze hundreds of samples, only the inaccuracy of potential predictors was found.

With these situations, an integrated analysis may be used to increase the integrality and reliability of the conclusions. In this way, we wanted to achieve a more precise method of detecting differentially expressed genes, and aimed to find their potential biological functions. The Gene Expression Omnibus (GEO, http://www.ncbi.nlm.nih.gov/geo/) [11] is a global free-access repository of next-generation sequence functional genomic data sets and high-throughput microarray deposited by researchers worldwide. The information in these databases can be freely downloaded in many formats. In the current study, we analyzed two public microarray datasets from the GEO repository to determine the genes that had differential expression in individuals suffering from CAD compared to controls using combined P values and we attempted to provide advice on the bio-labelling for on time safeguarding and therapy according to the features of these genes.

Materials and methods

Affymetrix microarray data

There were two gene expression profile data sets. GSE19339 was retrieved from GPL570 Affymetrix Human Genome U133 Plus 2.0 array. GSE56885 was derived from the GPL15207 Affymetrix Human Genome array. The present study included 14 samples, which comprised 8 CAD samples and 6 normal/healthy samples. However, the specific results of each sample were not supported. In the current research, all of these samples were selected. The CEL files were transformed into the expression value matrix using the Affy package in R with RMA methods to normalize the expression value matrix [12]. Afterwards, the Bioconductor in R was used to convert the probe data to gene [13]. Any gene that corresponded to multiple probes, the mean expression value of such a gene should be chosen.

Differentially expressed genes (DEGs) analysis

The limma package [14] in R enabled identification of DEGs based on the comparison between the control and the CAD samples. The threshold values were |log2fold-change| > 2 and adjusted P < 0.05.

GO and pathway enrichment analyses

Studies on large-scale transcription data or genomic data were usually performed based on Gene Ontology (GO) analyses [15]. The Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway database harbors information relating to the networks among genes or molecules, which was used for genetic studies [16]. The Database for Annotation, Visualization and Integrated Discovery (DAVID) contained analytical tools and integrated biological information which was used to interpret the functions of large protein or gene pools [17]. Here, DAVID (version 6.8) was used to enrich the GO functions and pathways of specific DEGs in the KEGG (http://www.genome.ad.jp/kegg/) and GO (http://www.geneontology.org) databases and the R package Goplot [18] with an adjusted P-value (q-value) of < 0.05.

Construction of PPI interaction network

The Search Tool for the Retrieval of Interacting Genes (STRING V10.5) [19] database (https://string-db.org/) is a useful platform to study the interactions between experimental and predicted proteins. The predictive capacity of STRING is based on text mining, databases, co-expression experiments, co-occurrence, gene fusion and neighborhood which are achieved using the Cytoscape (version 3.60) [20]. Then, a combined score was assigned to the interactions among protein pairs in the database. To examine the key genes in the network and potential PPI correlations, the DEGs were mapped on the data and the cut-off value was set as a combined score of > 0.9 [21]. The significance of protein nodes in the network was described using a degree. All of the three protocols were handled by R software package igraph version 1.0.1 [22].

Module analysis

Protein networks such as the network module contain useful information regarding the biological functions of biological molecules. The outstanding clustering module was determined using the Cytoscape software package Molecular Complex Detection (MCODE) [23] . Subsequently, the DAVID online tool was used to analyze the KEGG pathway enriched by the DEGs among the modules. Count ≥2 and EASE ≤0.05 were chosen as the cut-off values and MCODE score > 6 as the threshold for the subsequent analysis.

Study population and follow-up

A total of 206 outpatient were recruited for a complaint of chest pain at the First Affiliated Hospital, Guangxi Medical University from Jan. 1, 2015 to Dec. 31, 2016. A monthly follow-up was performed for patients through a telephone interview and the interviewer was blinded to the genetic status of the patient. The endpoint of the follow-up was a diagnosis of acute coronary symptoms, and cardiac deaths such as deaths due to sudden cardiac death or pump failure. Coronary angiography was performed for patients with suspected CAD or other unrelated conditions where angiographic examination was required. Two experienced interventional cardiologists examined the data from the coronary angiograms. CAD was confirmed by presence of coronary stenosis (≥ 50%) in one or more of the three main coronary arteries or their major branches (branch diameter ≥ 2 mm). To analyze the influence of hub gene on the prognosis of CAD, after coronary angiography, we divided the individuals into two groups, including CAD patients and controls. Patients with a previous CAD attack, type I diabetes mellitus and congenital heart disease were excluded [24]. The absence of CAD in healthy individuals was confirmed through clinical examination, medical history and questionnaires. The medical history and general information of all subjects were obtained by a standard questionnaire. All protocols followed the guidelines of the revised (in 2008) Declaration of Helsinki of 1975 (http://www.wma.net/en/30publications/10policies/b3/). Informed consent was obtained from all subjects involved in the genetic analysis and epidemiologic investigation. Our study was approved by the Ethics Committee of the First Affiliated Hospital, Guangxi Medical University (No: Lunshen-2011-KY-Guoji-001; Mar. 7, 2011) [25]. During the initial examination, clinical information was obtained from the medical records.

Serum hub gene determination

RT-qPCR was used to validate the four significantly dysregulated mRNAs obtained from the microarray results out of the 206 researchers. The PBMCs extracted from blood samples were used to isolate total RNAs were using TRIzol reagent (Invitrogen). The TransScript R Frist-Strand cDNA Synthesis SuperMix (Transgen, China) was used to synthesize cDNA as per the manufacturer’s protocols. Additional file 1 Table S1 shows the sequences of primers used to probe the specific genes which were designed by Sangon Biotech (Shanghai, China) together with the reaction conditions. All samples were put into a 96-well plates and run in triplicate using the LightCycler R 96 and FastStart Essential DNA Green Master (Roche Diagnostics GmbH, Germany). The fit point method was used to calculate the Quantification cycles (Cq) by the LightCycler R 96 Software, (Version 1.1 provided by Roche). The mRNA levels of the genes were normalized to the expression of GAPDH as a reference. All methods (qPCR normalization, primer design, collection, preparation and storage of sample) were carried out in line with the MIQE guidelines.

Table 1 GO for differentially expressed genes

Statistical analyses

SPSS 21.0 package (SPSS Inc. Chicago, IL, USA) was used for statistical analysis. Differences in the rates between groups were compared using a chi-square. Continuous data are presented as the means ± SD. Nominal significance was considered for a raw P value of < 0.05. Multivariate Cox proportional-hazards regression and univariate analyses were used to determine the correlation between the clinical variables and genes with the end point of cardiac adverse events. The univariate and multivariate tests were carried out using two-sided Cox univariate analyses.

Results

Preprocessing

After analysis of GSE19339 and GSE56885, from each gene expression and profile, we obtained a total of 54,560 expression probes. The preprocessed data are shown in Additional file 2 Figure S1. When all of the median values were in the same horizontal line, the data were comfortable normalized.

Fig. 1
figure 1

Heat map of differentially expressed genes. The depth of color reflects the level of differential expression (represents by log FC). a GSE19339; (b): GSE56885

Identified differentially expressed genes (DEGs)

Heat map of differentially expressed genes is presented in Fig. 1. As shown in Fig. 2, with |log 2 (fold change) | ≥ 2, and an adjusted-P value < 0.05. A sum of 402 DEGs were obtained, of which 140 were down-regulated while 262 were up-regulated in GSE19339. At the same time, 71 DEGs in GSE56885 were found: 10 genes were down-regulated and 61 were up-regulated. Some probes were too high or too low to be expressed. We have determined such probes as outliers and need to be removed without further analysis. In cases where many probes corresponded to one gene, we used the average expression value to screen for differential genes. After quality control and removing numerous incorrect expression values, we took all of the 413 DEGs (122 down-regulated and 291 up-regulated) into consideration.

Fig. 2
figure 2

Identifying the DEGs by a volcano plot. The two vertical lines are the 2-fold change boundaries and the horizontal line is the statistical significance boundary (Adj-P < 0.05). Genes with a fold change ≥2 and a statistical significance and up-regulation are marked with red dots, and the down-regulated genes are marked with green dots. The horizontal line represents the boundary for statistical significance and the two vertical lines represent the boundaries of 2-fold change (Adj-P < 0.05). Red dots represent up-regulated and statistically significant genes with a fold change ≥2 and the down-regulated genes are marked with green dots

Analysis of gene ontology functions and KEGG pathways enrichment of DEGs

In the analysis of GO functions, 256 biological processes, only 1 cellular component, and 21 molecular functions were identified. All of these data are presented in Table 1. As shown in Fig. 3, if the statistics threshold was adjusted to 3, only 34 biological processes and 6 molecular functions were retained for analysis. From these data, we found that several functions were related to angiogenesis. The DAVID tool (version 6.8) was also used for the KEGG pathway analysis of the screened DEGs. Approximately 10 pathways were enriched (Fig. 4). To identify disease-related genes, analysis of KEGG pathways and GO functions was performed. A total of 24 genes (Fig. 5b) were enriched in 9 biological processes (Fig. 5a) and 7 KEGG pathways, including biological processes (angiogenesis, blood vessel morphogenesis, smooth muscle cell proliferation, positive regulation of angiogenesis, vasculature development, MAPK cascade, regulation of MAP kinase activity, blood circulation and rhythmic process) and KEGG pathway (atherosclerosis, rheumatoid arthritis, fluid shear stress, AGE-RAGE signaling pathway in diabetic complications, tumor necrosis factor (TNF), cytokine-cytokine receptor interaction, interleukin (IL)-17 and NF-kappa B signaling pathway).

Fig. 3
figure 3

Functional enrichment analysis for the up-regulated DEGs. The y-axis indicates the identified Adj-P value. The x-axis represents the Z-score. The light green background represents the biological process, and 34 functional enrichments are also shown in this. The light blue background represents the molecular function, and 6 functional enrichments are also shown in this

Fig. 4
figure 4

Results of KEGG pathway enrichment analysis for the DEGs

Fig. 5
figure 5

The enriched gene ontology terms and pathways of the differentially expressed genes (DEGs). a bar plot is shown in the inner ring, with the color corresponding to the z-score and the height of the bar representing the significance of the term. The scatter plots of the expression levels (logFC) of the genes in each term are shown in the outer ring. b ribbons linking the genes with their assigned terms. The logFC is represented by the blue-to-red coding near the marked genes

Protein-protein interaction (PPI) network construction

Data analysis was performed on the STRING database out of which 264 protein pairs and 64 nodes were revealed with a combined score > 0.9. The top 7 high degree genes, including interleukin 8 (IL8, degree = 34), interleukin 1 beta (IL1B, degree = 29), C-X-C motif chemokine receptor 4 (CXCR4, degree = 27), intercellular adhesion molecule 1 (ICAM1, degree = 25), Jun proto-oncogene (JUN, degree = 23), C-X-C motif chemokine ligand 12 (CXCL12, degree = 21) and C-C motif chemokine ligand 2 (CCL2, degree = 20), are shown in Fig. 6a.

Fig. 6
figure 6

The protein-protein interaction analysis of the differentially expressed genes. a Protein–protein interaction network of the selected differentially expressed genes (DEGs). The edge indicates the interaction between two genes. A degree is used to describe the importance of the protein nodes in the network, red shows a high degree and green presents a low degree. b The significant modules identified from the protein-protein interaction network using the molecular complex detection method with a score > 6.0. MCODE score = 7.268

Module analysis

For the detection by MCODE with Cytoscape app, only one module with a score > 6 was found. As shown in Fig. 6b, the degrees of hub nodes of the ICAM, IL1B, JUN and CCL2 were the highest in this module.

Association of serum hub gene levels with CAD

Figure 7a shows the relationship among CAD, genes and environmental exposures and scale represents the specific correlation coefficient. Validation of the hypothesized data was done by RT-qPCR. The mRNA levels of IL1B, JUN, ICAM1 and CCL2 were determined to verify the major conclusions derived from the microarray results of the peripheral blood specimen. In general, results of the microarray analysis were consistent with those of RT-qPCR analysis. But, RT-qPCR results showed that the expression of IL1B, ICAM1 and CCL2 was higher in CAD patients than in normal controls (Fig. 7b).

Fig. 7
figure 7

The relationship among CAD, genes and several environmental exposures and validation with RT-qPCR. a The associations among CAD, genes and environmental exposures. Positive correlation is marked with red, and negative correlation is marked with green. Scale represents the specific correlation coefficient. b An illustration of the expression profile of hub genes obtained from the microarray data verified by RT-qPCR. PBMCs were used to isolate total RNAs and subsequently cDNA for RT-qPCR analysis. The mRNA levels of each gene in healthy donors were considered as 1. DM, Diabetes Mellitus

Demographic and cox regression analysis in patients with CAD

The demographic and biochemical characteristics of the participants in the two groups are presented in Table 2. Compared to CAD patients, there were more patients with hypertension and those who smoked cigarettes in the controls. But age, dyslipidemia, diabetes mellitus, alcohol drinking, height, weight, body mass index (BMI), waist circumference (WC), the level of systolic blood pressure (SBP), diastolic blood pressure (DBP), pulse pressure (PP), serum glucose, total cholesterol (TC), triglyceride (TG), low-density lipoprotein cholesterol (LDL-C) and high-density lipoprotein cholesterol (HDL-C) did not differ between controls and CAD patients. The predictors of CAD were further analyzed by univariate and multivariate Cox proportional hazards regression (Table 3). Following adjustment for variety clinical pathological factors and social economic factors, we confirmed that cigarette smoking [95% confidence interval (CI), 0.831–2.727, hazard ratio (HR): 1.506; P = 0.020], diabetes mellitus (95% CI, 1.634–8.283, P = 0.002 for HR: 3.679), and the overexpression of serum IL1B (95% CI, 1.435–2.845, P = 0.017 for HR: 1.896) and CCL2 (95% CI, 0.563–2.440, P = 0.032 for HR: 1.172) genes were still considered independent predictors for CAD.

Table 2 Comparison of the demographics and lifestyle characteristics and the prevalence of related disease between the two groups
Table 3 Univariate and multivariate Cox hazards regression analysis of CAD

Discussion

With the remarkable improvement in microarray expression data, identifying abnormally expressed genes may help us to find and treat diseases. However, microarray data are not always reproducible or are too sensitive to errors [8]. With these situations, it may be a smart choice to remove the false positives by utilizing various datasets of parallel experimental designs. In the current study, we combined two different datasets of CAD to analyze their GO enrichments, KEGG pathways and PPI networks and modules to identify four significant and reproducible genes (IL1B, ICAM1, JUN and CCL2), which showed differential expression between the patients and controls. However, when these genes were replicated in our CAD samples, we found that serum JUN expression levels were not significantly changed and only two genes (IL1B and CCL2) were verified through the Cox proportional hazards regression.

CAD is a condition that is associated with several risk factors. The main pathophysiological mechanism of CAD is atherosclerosis [26]. Studies have confirmed that atherosclerosis is a chronic inflammatory disorder [27]. The IL-1 family, including cytokines, modulates many immunoinflammatory processes. It regulates many biological processes, including the lipoprotein metabolism, leukocyte adherence, thrombogenic response of endothelial cells, endothelial and smooth muscle cell monogenesis, vascular permeability and extracellular matrix production [28, 29]. It is also involved in the process of plaque formation and rupture via different pathways. The pathways’ function is summarized as follows: (1) the suppression of endothelial cell proliferation [30]; (2) the modification of the endothelium which later favors thrombosis [31]; (3) the stimulation of vascular smooth muscle cells via transforming growth factor-β (TGF-β) [32]; and (4) the expression of adhesion molecules by endothelial cells [33]. In the meantime, as a member of the IL-1 family of cytokines, ST2 (also known as T1, IL1RL1, or Fit1) measurements in blood samples could be a clinical prognostic biomarker useful in risk stratification of patients suffering from myocardial infarction, heart failure and dyspnea [34, 35]. In our current study, we demonstrated that IL1B participated in four of the main biological processes (Fig. 5b), including angiogenesis, smooth muscle cell proliferation, positive regulation of angiogenesis and cytokine metabolic process. All of these biological processes give rise to chronic immunoinflammatory pathological changes and finally result in atherosclerosis.

C-C motif chemokine ligand 2 (CCL2) participates in the genesis and progress of atherosclerosis [36,37,38]. The current study demonstrated that CCL2 participated in two of the main biological processes (Fig. 5b), including angiogenesis and the positive regulation of angiogenesis. These two biological processes also contribute to atherosclerosis. Moreover, CCL2 is considered as a risk factor for the promotion of atherosclerosis and for patients with CAD. Numerous studies reveal that high CCL2 levels in patients with CAD are associated with enhanced incidence of adverse cardiac outcomes and increased risk of long-term mortality [39, 40]. Recently, a study showed when CCL2 levels increased, plasma HDL2 levels decreased and CCL2 was negatively correlated with HDL2 [41]. This may be another mechanism of how CCL2 results in CAD.

There were seven pathways containing IL1B and CCL2, according to the KEGG analysis, including rheumatoid arthritis, cytokine-cytokine receptor interaction, TNF signaling pathway, the IL-17 signaling pathway, the AGE-RAGE signaling pathway in diabetic complications and malaria, atherosclerosis and fluid shear stress. Shanmugam et al. reported that TNF-α signaling exerted adverse effects to the cardiovascular tissues although it ameliorated chronic inflammatory disease [42]. In atherosclerosis, Tuenter et al. found that presence of intraplaque haemorrhage and calcifications was associated with elevated maximum shear stress [43]. Moreover, RAGE expression in many types of cell, including smooth muscle cells, endothelial cells and macrophages may result in the pathogenesis of atherosclerosis, and give rise to the pathogenesis of myocardial dysfunction [44]. These findings demonstrate that IL1B and CCL2 cause atherosclerosis and finally result in CAD. Furthermore, when we repeated the validation in our experimental samples, using a blood gene expression and/or Cox proportional hazards regression, we obtained the same results, which might increase the credibility of the conclusions.

Conclusions

Two CAD microarray datasets from the GEO series were systematically analyzed in this study. Based on the expression level, GO enrichment, enriched pathway and protein-protein interaction analyses, four genes (IL1B, ICAM1, JUN and CCL2) were found to be significant meaning, but only two genes (IL1B and CCL2) were replicated in our samples by testing the gene expression in blood and/or analyzing with a Cox proportional hazards regression. The mechanism may be involved in the cytokine-cytokine receptor interaction, the TNF signaling pathway, fluid shear stress and the AGE-RAGE signaling pathway. But, additional experiments are warranted to validate these findings.