Background

Ulcerative colitis (UC) is an inflammatory intestinal disease characterized by easy recurrence and chronic persistence [1]. The lesion site is mainly confined to the large intestine mucosa and submucosa. Mucosal inflammation at the onset site has diffuse distribution and extends to the rectum. The main clinical manifestations are abdominal pain, diarrhea, and mucinous pus bloody stool. It is easy to cause intestinal fibrosis and increase colon cancer risk [2]. Medical circles at home and abroad hope to make achievements in UC treatment, but its etiology is still unclear and its pathogenesis is complex, so it is listed as one of the modern refractory diseases by the World Health Organization [3, 4].

With the development of high-throughput microarray technology, the identification of genomic variation has been promoted, which will help to understand the pathogenesis of potential biomarkers in many diseases [5]. Kang et al. [6] reported that the diagnosis of UC is usually delayed, but the relationship between delayed diagnosis and prognosis of UC has not been widely studied. Biasci et al. [3] reported that genes from the best classifiers are optimized by quantitative polymerase chain reaction (qPCR) and the best qPCR classifier is distinguished using further machine learning, which could evaluate the prognosis of newly diagnosed UC patients. In previous years, research has identified many molecular markers that could help in the early diagnosis of UC. For example, Zhang et al. [7] reported that IL6, PTPRC, CXCL8, IL1B, and MMP9 might be the key genes that could provide vital markers for the early diagnosis and treatment for UC. Zhu et al. [8] found several genes associated with the development of UC, such as MMP1, REG1A, and AQP8. Yan et al. [9] found 11 mutated genes differentially expressed in UC samples, such as APC, APOB, MECP2, NCOR2, and USP48. All these reports suggested that feature genes might play an important role in the diagnosis of UC.

Thus, in this study, the weighted gene coexpression network (WGCNA) was used to distinguish stable modules from four datasets. Then, the protein–protein interaction (PPI) network was constructed through differentially expressed genes (DEGs) in stable modules. A support vector machine (SVM) combined with recursive feature elimination (RFE) was also applied to construct a disease classifier for the disease diagnosis of UC patients.

Methods

Screening of expression profile data

“Ulcerative colitis” and “Human” were used as keywords to search all publicly uploaded expression profile data from the National Center for Biotechnology Information Gene Expression Omnibus database (http://www.ncbi.nlm.nih.gov/geo/). This study contained four datasets: GSE65114 (n = 28; 16 UC and 12 control), GSE36807 (n = 22; 15 UC and 7 control), GSE37283 (n = 20; 15 UC and 5 control), and GSE59071 (n = 108; 97 UC and 11 control). The selection standards of datasets were as follows: [1] the dataset was a gene expression profile, [2] the samples are solid samples of intestinal tissue from UC patients, and [3] the samples contained control samples. Microarray raw data (GPL570, CEL files) from the four datasets were obtained from the Affymetrix platform (Santa Clara, CA, USA), which were processed for background correction (MAS) and quantile normalization using Affy package in R3.4.1 version 1.60.6 (http://www.bioconductor.org/packages/release/bioc/html/affy.html) [10].

Meta-analysis of DEGs

Unlike methods of screening DEGs in previous studies, the meta-synthesis algorithm was adopted to screen DEGs with consistency in multiple datasets. The meta-analysis aimed to cite multiple studies, collect multiple experimental datasets, and screen for reliable genes. However, the four datasets in this study were obtained from different patient samples and experimental detection. There may be different degrees of bias in the dataset; thus, MetaQC (https://cran.r-project.org/web/packages/MetaQC/index.html) was first used to carry out objective quality-control on the datasets combined with principal component analysis (PCA) two-dimensional map and standardized mean rank to evaluate and screen datasets.

DEGs were then screened by MetaDE.ES in the MetaDE package (https://cran.r-project.org/web/packages/MetaDE) [11]. To evaluate gene expression consistency, the heterogeneity of the four datasets was checked through the τ2, Q value, and Qpval values (judgment criteria: when the value of the statistic τ2 is 0), it indicates that each research object is homogeneous and unbiased; the statistic Q obeys the χ2 test with a degree of freedom of k-1; when Qpval > 0.05, it indicates that each research object is homogeneous and unbiased). τ2 = 0 and Qpval > 0.05 were selected as homogeneity test parameters, whereas false discover rate (FDR) < 0.05 was as the threshold for differential gene expression. DEGs with consistency in different groups were detected using the MetaDE method with the cutoff criterion of p < 0.05. FDR was obtained through multiple test corrections. FDR < 0.05 indicated a significant difference. Each individual dataset was calculated separately to express the fold change. Genes were kept consistent by combining with log2FC orientation in each dataset.

Significant gene module based on the WGCNA

In this study, GSE59071 was used as the training dataset and GSE65114, GSE36807, and GSE37283 were used as the validation datasets. WGCNA package version 1.61 (https://cran.r-project.org/web/packages/WGCNA/index.html) [12] in R3.4.1 was applied to select the significant gene module associated with UC in the four datasets. The WGCNA algorithm is implemented according to the steps of defining adjacency function, dividing gene modules, and evaluating module stability. The threshold of gene module partition and screening is that the gene module contains at least 100 genes with a cut height of 0.995. Genes in the important modules were selected as the object for further analysis.

Screened DEGs were mapped to each WGCNA module to calculate the fold enrichment ratio and the p value of target DEGs in each WGCNA module using the hypergeometric algorithm. The formula is f(k,N,M,n) = C(k,M) * C(n-k,N-M) / C(n,N) [13], where N represents all genes referred to the analysis of the WGCNA algorithm, M is the gene number in each module obtained by the WGCNA algorithm, n is the number of obvious DEGs, and k is the DEG number mapped to the corresponding modules. The significant enrichment parameters fold enrichment and enrichment significance p values of the target significant DEGs were calculated in each WGCNA module. The threshold was set as p < 0.05 and fold enrichment > 1. Finally, DEGs markedly enriched in the stable WGCNA module obtained from the screening were compared to consistent and significant DEGs selected in the previous step, and the intersection part was taken. Enrichment analysis of Gene Ontology (GO) biological processes (BPs) and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment were established based on DAVID version 6.8 (https://david.ncifcrf.gov/) [14, 15] with a cutoff of p < 0.05.

PPI network construction

STRING version 10.5 (https://string-db.org/) [16] was applied to search for the interaction between gene products and proteins used for constructing the PPI network. The gene interaction network was visualized through Cytoscape version 3.6.1 (http://www.cytoscape.org/) [17]. The KEGG pathways of DEGs that constitute the interaction network were analyzed based on DAVID.

Screening of key genes related to UC

All KEGG pathways associated with UC were searched from the Comparative Toxicogenomics Database 2019 update (http://ctd.mdibl.org/) [18], which were compared to the pathways in the PPI network. A PPI network of KEGG pathways directly related to UC and screened key genes involved in the UC pathways was constructed.

Construction of the sample-type recognition classifier

In the GSE59071 training dataset, the RFE, R3.4.1 caret package version 6.0–76 (https://cran.r-project.org/web/packages/caret) [19, 20], was applied to screen optimized feature gene combinations. The gene combination with the highest sample recognition accuracy in 100-fold cross-validation was selected as a feature gene combination [21]. The SVM [21] function in the e1071 package of R3.4.1 version 1.6–8 (https://cran.r-project.org/web/packages/e1071) was used to establish the SVM classifier for further analysis. The effectiveness of the SVM classifier was evaluated in the training dataset and three validation datasets. pROC package version 1.12.1 (https://cran.r-project.org/web/packages/pROC/index.html) [22] in R3.4.1 was applied to obtain sensitivity, specificity [23], and area under the curve (AUC) [24].

Results

Screening of DEGs with significant consistency by meta-analysis

First, we perform data standardization on each of the 4 datasets, which was shown in Additional file 1: Table 1. A total of 16,337 genes were obtained from the four datasets. MetaQC quality-control inspection was conducted on the four datasets (Table 1). The PCA principal component plan is shown in Fig. 1A, indicating that the four datasets are evenly distributed. The sum of PC1 and PC2 was 80.90%, and these indexes met the data quality testing standards that should be included in subsequent analysis. Subsequently, 2009 DEGs with obvious consistency were screened from the four datasets for the next analysis (Additional file 2: Table 2). In Fig. 1B, DEGs screened from four different datasets were consistent in the degree of difference and direction of maladjustment.

Table 1 The information in GSE65114, GSE36807, GSE37283 and GSE59071 datasets
Fig. 1
figure 1

(A) PCA principal component plan. MetaQC quality-control chart of GSE65114, GSE36807, GSE37283, and GSE59071 datasets. Horizontal and vertical axes indicate the first and second principal components in the PCA, respectively, whereas 1 to 4 represent the four corresponding datasets. (B) Heatmap of DEGs with significant consistency

Table 2 Preservation information of nine co-expression modules in GSE65114, GSE36807, GSE37283 and GSE59071 datasets

Identification of vital WGCNA modules

This study aimed to distinguish UC-related modules through the WGCNA. GSE59071 was the training dataset, whereas GSE65114, GSE36807, and GSE37283 were the validation datasets. Both training and validation datasets had a high positive correlation, and the correlation significance p values were < 0.05, a very significant positive correlation, indicating that the data are comparable (Fig. 2). Gene correlation coefficients were calculated from the four datasets. The correlation coefficients between GSE59071 and GSE65114, GSE36807, and GSE37283 were 0.81, 0.52, and 0.52, with p < 1e-200 among the three datasets. These data expressed good homogeneity of DEGs in the four datasets (Additional file 3: Table 3).

Fig. 2
figure 2

(A) Power selection graph of the adjacency matrix weight parameter. The horizontal axis indicates the weight parameter power; the vertical axis indicates the square of the correlation coefficient between log(k) and log[p(k)] in the corresponding network. The red line represents the standard line where the square of the correlation coefficient reaches 0.9. (B) Schematic diagram of the average gene connectivity under different power parameters. The red line represents the value of the average connectivity of network nodes under the power parameter value of the adjacency matrix weight parameter in (A). (C) Correlation graph of the expression levels on the training and validation datasets. (D) Training and validation dataset node connection correlation graph

Table 3 GO functional and KEGG pathway enrichment analysis of 718 overlapping genes

Twelve modules were screened through the WGCNA with a cut height of 0.995, and the amount of genes in each gene network was 25 as a criterion (Fig. 3A). Genes were colored based on the module color in the training dataset (Fig. 3).

Fig. 3
figure 3

Screening of gene modules in the (A) training dataset of GSE59071 and the validation datasets of (B) GSE65114, (C) GSE36807, and (D) GSE37283

The preservation Z score analysis was used to evaluate the stabilization of the 12 modules (Table 2). The Z scores of nine modules (black, blue, brown, green, gray, purple, turquoise, red, and yellow) were observed to be > 5 with p < 0.05. This result expressed that the nine modules were stable. Correlation analysis was performed between each stable module and UC characterization, as shown in Fig. 4. For example, black, gray, and yellow modules were positively related to UC, whereas turquoise, green, and yellow modules were negatively correlated with UC.

Fig. 4
figure 4

Correlation analysis performed between each stable module and UC characterization

A total of 1066 overlapping genes were screened through the comprehensive analysis of MetaDE and WGCNA (Fig. 5A; Table 2). Moreover, 78, 116, 115, 100, 145, and 164 overlapping genes from black, blue, brown, red, turquoise, and yellow modules were screened, respectively, whose fold enrichment were all > 1 and their p values were < 0.05 (Fig. 5B, Additional file 4: Table 4). Sixteen GO BPs and 12 KEGG pathways were screened, such as immune response, cytokine-cytokine receptor interaction, cell adhesion molecules (CAMs), and so on (Table 3).

Fig. 5
figure 5

Comprehensive analysis of MetaDE and WGCNA. (A) The Venn diagram was compared based on the overlapping genes and the screening results of the MetaDE algorithm. (B) Based on the proportional pie chart of the gene number in the WGCNA algorithm stability module and the intersection genes screened by MetaDE in different color modules, each color corresponds to different color modules in the WGCNA, and the number indicates the gene number in the color module

Table 4 The significantly associated KEGG pathways with the feature genes obtained by machine learning

Construction of the PPI network

The PPI network was established based on 718 overlapping genes. A total of 809 paired PPI interactions were obtained in this PPI network (Additional file 5: table 5). The network contained 329 nodes and 809 connection edges, as shown in Fig. 6. The KEGG pathway enrichment analysis was conducted on gene nodes that constitute the interaction network, and nine pathways with significant enrichment correlation were obtained, such as cytokine-cytokine receptor interaction, CAMs, leukocyte transendothelial migration, and so on (Table 4).

Fig. 6
figure 6

Construction of the PPI network. Triangles and inverted triangles represent the upregulated and downregulated genes in the tissues of patients with the disease, respectively. The color of the node indicates the corresponding color of the gene from the WGCNA module

Screening of important genes related to UC

A total of 157 KEGG pathways related to UC were selected by searching the UC database (Additional file 6: table 6). Nine overlapping pathways were obtained compared to KEGG pathways related to PPI interactions (Table 4), such as cytokine-cytokine receptor interaction, CAMs, leukocyte transendothelial migration, chemokine signaling pathway, extracellular matrix (ECM)-receptor interaction, T-cell receptor signaling pathway, Jak-STAT signaling pathway, natural killer cell-mediated cytotoxicity, and pathways in cancer. These KEGG pathways involved many genes involved in cytokine-cytokine receptor interaction (CXCL1, CCR2, IL1B, and IL1A), ECM-receptor interaction (COL4A2, COL4A1, COL6A3, COL3A1, and COL1A2), pathways in cancer (STAT5A and SP11), and leukocyte transendothelial migration and chemokine signaling pathway (ITK).

Construction of the sample-type recognition classifier

A sample-type classifier was constructed based on 84 gene expression levels in the pathway network constructed in the GSE59071 training dataset. From the RFE analysis, when the number of genes is 41, it has the highest accuracy of 0.965, which was used to establish an SVM classifier, such as CXCL1, CCR2, IL1B, IL1A, COL4A2, COL4A1, COL6A3, COL3A1, COL1A2, STAT5A, SP11, and ITK (Fig. 7, Additional file 7: table 7). Receiver operating characteristic (ROC) curves were applied to verify the efficacy of the SVM classifier (Fig. 8). The AUC was 0.999 in the GSE59071 dataset, whereas the AUC was 0.886, 0.790, and 0.819 in the validation dataset (GSE65114, GSE37283, and GSE36807, respectively).

Fig. 7
figure 7

Feature genes were screened via RFE analysis

Fig. 8
figure 8

Binary scatter diagram of SVM for GSE59071 (A), GSE65114 (B), GSE37283 (C), and GSE36807 (D). Blue and red dots and squares represent the control and UC samples, respectively. The ROC curve of the SVM results of each dataset is shown in the figure below

Discussion

In this study, 12 highly preserved modules were screened through the WGCNA. A PPI network was also constructed based on 718 overlapping genes. Besides, an SVM classifier combined with RFE was applied to explore the characteristic genes and pathways. From the RFE analysis, when the number of genes is 41, it has the highest accuracy of 0.965. The AUC was 0.999 in the training dataset, whereas the AUC was 0.886, 0.790, and 0.819 in the validation dataset (GSE65114, GSE37283, and GSE36807, respectively).

SVM is one of the most accurate methods among all well-known data mining algorithms. It is a two-class classification algorithm that can support linear and nonlinear classification. In this study, an SVM classifier was constructed to identify UC patients. Previous studies have reported that SVM could identify UC patients. For example, Ding et al. [24] used the SVM to classify healthy people or those with inactive colitis with a sensitivity of 83.5% and 97.1%, respectively. Watanabe et al. [25] used the SVM to evaluate the progress of UC-related neoplasms with an accuracy of 86.8% and 98.1%, respectively. Compared to previous studies, the SVM classifier constructed has better performance and potential applications.

An SVM classifier was established based on 41 genes involved in cytokine-cytokine receptor interaction (CXCL1, CCR2, IL1B, and IL1A), ECM-receptor interaction (COL4A2, COL4A1, COL6A3, COL3A1, and COL1A2), pathways in cancer (STAT5A and SP11), and leukocyte transendothelial migration and chemokine signaling pathway (ITK). Previous reports have indicated the vital role of gene-environment interaction in UC-related disease. Boshagh et al. [26] found that CXCL1 is involved in the pathogenesis of UC. CXCL1 may be used as a therapeutic target for UC, although more studies using human samples are required. CCR2, a chemokine receptor, may influence the body by maintaining and expanding chronic inflammation due to the timely removal or neutralization of invading agents. Pei et al. [27] found that PC3-secreted microprotein chemoattracted Ly6C monocytes in a CCR2-dependent manner by in situ chemotaxis and adoptive transfer experiments, which is an important molecule in UC. Previous studies have distinguished the interleukin-1 receptor antagonist gene allele 2 (IL-1RN*2) as a biomarker of susceptibility in UC [28]. Carter et al. [29] reported that IL-1RN*2 was related to decreased IL-1ra protein and IL-1RN mRNA levels in the colonic mucosa by determining the IL1A and IL1B genotypes. A previous study reported that somatic mutations and altered pathways are one of the reasons why UC turns into cancer [30]. Besides, Yan et al. [9] found that COL6A3 referred to apoptosis, and the phosphatidylinositol 3-kinase/Akt pathway is associated with nonsilent recurrent somatic mutations in UC. COL3A1 is a profibrogenic extracellular matrix gene. Wu [31] also indicated that COL3A1 was upregulated at the active/chronic inflammatory stages. Genetic variants in the region are associated with UC. Besides, Stadnicki et al. [32] found that ITK is significantly decreased in UC compared to noninflammatory controls. This report suggested that the release of ITK during inflammation plays a role in UC. Although there are few studies focused on the association between COL4A2, COL4A1, COL1A2, and SP11 and UC, this finding will provide a theoretical basis for future research on therapeutic targets for UC. This work expressed that CXCL1, CCR2, IL1B, IL1A, COL4A2, COL4A1, COL6A3, COL3A1, COL1A2, STAT5A, SP11, and ITK may be potential markers for UC.

The surgical management of UC remains a difficult challenge, depending on the patient’s status (whether urgent, emergent, or elective) [33]. However, the definition of the best timing and procedure for each patient is the key for the management of UC patients [34]. Thus, the finding of novel biomarkers is important for managing the time and procedure for UC patients. Lai et al. [35] indicated that six hub genes, including CXCR2 and CXCR1, were regarded as potential biomarkers for the classification of UC. Similarly, this study also found 11 key biomarkers related to UC that might be helpful to determine the timing and procedure for UC patients. However, verification of these hub genes would need further experiments that involve UC patient samples.

Conclusions

An SVM classifier based on feature genes could accurately identify healthy people or UC patients. This study may provide new insights into the molecular mechanism of UC.