Introduction

Colorectal cancer (CRC) is the third most frequent cancer to be diagnosed in both females and males in the USA. Approximately 41% of CRC cases involve the proximal colon, whereas 22% and 28% involve the distal colon and rectum, respectively1. By 2030, there would be a 90% increase in the incidence rate of CRC in the USA1. People are keen to look for new treatment approaches in such a difficult environment2. CRC patients usually show no conventional clinical symptoms or just non-specific indicators in the early stage, which results in a low incidence of early diagnosis even though early diagnosis can greatly improve the prognosis3. Since it can be detected and effectively treated, CRC is regarded to be almost the ideal cancer for screening as it can be prevented from developing and causing both mortality and morbidity during its clinical course4.

An in-silico system biology approach called weighted gene co-expression network analysis (WGCNA) is used to examine gene expression in a complicated network of regulatory genes. This R-based tool uses genetic correlations to find modules that are highly connected. As a result, it is useful for discovering new cancer diagnostic and prognostic biomarkers5,6,7,8. Guo et al. found ten hub genes that could serve as viable biomarkers for clinical diagnosis and are associated with the progression of CRC9. In a separate study conducted by Cao et al., the hub genes TDRD5 and GPC1 were identified for prognosis prediction in CRC using WGCNA10. Using WGCNA, Lin et al. discovered that the complement and coagulation cascade, as well as the focal adhesion pathway, play a crucial role in the development and progression of CRC patients with liver metastasis. Additionally, they identified FGG, KNG1, CAV1, and SPP1 as potential metastatic markers that could aid in the early diagnosis of CRC11.

In the current study, the co-expression network of DEGs connected to CRC and their target genes was built using the WGCNA algorithm. This research will aid in the identification of viable biomarkers for CRC and further comprehension of the molecular mechanisms behind this cancer.

Methods

Microarray data acquisition

The following criteria were used to select all of the CRC datasets from the Gene Expression Omnibus (GEO): (1) microarray-based mRNA expression profiles could be accessible; and (2) patients with CRC and healthy controls were estimated. Consequently, a total of three microarray datasets (GSE141174 [Illumina humanRef-8 v2.0 expression beadchip], GSE184093 [Agilent-067406 Human CBC lncRNA + mRNA microarray V4.0 (Probe name version)] and GSE206800 [Clariom_D_Human] Affymetrix Human Clariom D Assay [transcript (gene) version]) were included in this study. We excluded lymph node samples from dataset GSE141174 and metastasis samples from dataset GSE206800. In this analysis, a total of 39 samples (16 controls and 23 cases) were examined details are provided in Table 1.

Table 1 Details of included datasets.

Microarray data analysis and identification of differentially expressed mRNAs

Each of the microarray datasets were normalized using the normalizeQuantiles function in the preprocessCore package (Version 1.58) and then merged. Considering that the Agilent and Affymertix platform datasets for this study were merged, the batch effect and technical variation were eliminated using ComBat function in sva package (version 3.44.0)12.

Weighted correlation network analysis (WGCNA)

A scale-free network based on gene expression profiles was built using a systematic biological technique known as WGCNA. The R package WGCNA (version 1.72-1)5 was applied to create a weighted correlation network utilizing the mRNA expression profiles in the merged dataset in order to discover clinically meaningful modules of CRC. In summary, using the Pearson Correlation Coefficient test, the gene expression matrices were turned into matrices representing the similarities of paired mRNAs and then converted to adjacency matrices where the soft-threshold (power value) was applied to accentuate significant connections between genes and ignore weak correlations in the adjacency matrix. To represent the intensity of the connection between the genes, the adjacency matrix was then transformed into a topological overlap measure (TOM). Genes were analyzed via hierarchical clustering using TOM as an input, and network modules were found using the DynamicTreeCut function. Modules with high similarity scores were merged using a threshold value.

Construction of module-trait relationships and identification of modules hub genes

Module Eigengene (ME) was used to describe expression profiles of each module as the eigenvector related to the first principal component of the expression matrix in order to discover modules that were strongly linked to the clinical variables that were being examined. Also, the correlations between specific genes and the CRC trait were assessed using the estimates of gene significance (GS). Additionally, the ME correlation and the gene expression profile for each module were defined as module membership (MM). The most significant (central) components of the modules can be said to be closely connected to the trait if the GS and MM have a substantial correlation. Lastly, we chose the GS and MM genes that showed a correlation of 0.7 or higher, and we considered the identical genes between the two groups as the most important genes.

Protein–protein interaction network construction and hub clusters identification

The Search Tool for the Retrieval of Interacting Genes (STRING, https://string-db.org/) database was used to build the protein–protein interaction (PPI) network of the genes with highest GS and MM. Confidence score > 0.7 was set as significant. The hub clusters were chosen, and functional annotation was carried out, using the Cytoscape software plug-in molecular complex detection (MCODE)13.

Gene ontology (GO) and KEGG pathway enrichment analysis of hub clusters

The biological process (BP), molecular function (MF), cellular component (CC), and KEGG pathway14,15,16 enrichment analysis of the mRNAs in the hub clusters were obtained utilizing ClusterProfiler R package (version 4.4.4)17. A criterion of an adjusted p value of 0.05 or less was established for the functional category.

Identification of hub genes in PPI network using multiple centralities

By using maximal clique centrality (MCC) algorithm via cytohubba18 plugin in cytoscape, we discovered hub genes in the PPI network. Finally, the intersection of genes between hub genes and DEGs were identified as final hub genes.

MiRNA- and TF-final hub genes regulatory networks

Using the NetworkAnalyst database19, connections between the final hub genes, transcription factors (TFs), and microRNAs (miRNAs) were established. The greatest degree in the networks was then found for each TF and miRNA.

The final hub genes' genetic alterations in CRC patients

Using the cBioPortal for Cancer Genomics20, we examined the genetic changes of the final hub genes in CRC patients. We selected CRC datasets from the TCGA, which comprise 1510 CRC samples.

Verification of final hub genes via expression values

The UALCAN21 and GSCALite databases22, a Web server for Gene Set Cancer Analysis, were used to confirm the mRNA expression patterns of the final hub genes in colorectal adenocarcinoma (COAD) and normal samples. Immunohistochemistry (IHC) from the Human Protein Atlas (HPA)23 was used to compare the protein expression of the final hub genes between COAD and normal tissues.

Results

Microarray data processing, integrative meta-analysis and assessment of data quality

The role of mRNAs in CRC was examined utilizing three expression datasets that included samples from CRC and healthy control individuals. Using the normalizeQuantiles function in the preprocessCore package, quantile normalization was used to independently normalize each dataset. Then, gene symbols were used to combine three datasets. Next, the ComBat function from the R Package Surrogate Variable Analysis (SVA) was employed to rule out batch effects (non-biological differences). The raw data and normalized data following batch effect removal boxplots are displayed in Figure S1. These boxplots show a consistent level of quality in the expression data. Also, the boxplot of the preprocessed data showed satisfactory normalization. The PCA plot of all samples in the 2D plane included by their first two main components (PC1 and PC2) is shown in Figure S2 (PC1 and PC2). After batch effect reduction, the samples exhibited a fair dispersion, as seen by this plot.

Identification of DEmRNAs for CRC

We analyzed the expression matrix between CRC and normal samples using the Limma package (version 3.52.3)24 and discovered 792 DEmRNAs, comprising 466 downregulated and 326 upregulated DEmRNAs (Table S1). Table 2 displays the top 10 DEmRNAs. The volcano plot was made using the EnhancedVolcano package (https://github.com/kevinblighe/EnhancedVolcano) in R, allowing us to examine the changes in mRNA expression between CRC and normal samples (Fig. 1). Also, we utilized the R pheatmap package (version 1.0.12) (https://cran.r-project.org/package=pheatmap) to show the expression of the top up- and down-regulated DEmRNAs (Fig. 2A,B).

Table 2 The top 10 up- and downregulated DEmRNAs between CRC and healthy control samples.
Figure 1
figure 1

Volcano plot of DEGs; the horizontal axis represents the value of LogFC, while the vertical axis represents the mean value of -log 10 (false discovery rate). The significant dysregulated genes that fit the criteria are shown as red dots.

Figure 2
figure 2

Differentially expressed mRNAs heatmaps. (A) Heatmap of upregulated DEmRNAs in CRC samples related to normal samples, (B) heatmap of downregulated DEmRNAs in CRC samples related to normal samples.

Gene Ontology (GO) and KEGG pathway enrichment analysis of DEmRNAs

GO and KEGG pathway analysis were carried out to learn more about the biological role of the DEmRNAs. These findings revealed that the genes were mostly enriched in the BPs that were involved in the regulation of hormone levels, extracellular matrix organization, and extracellular structure organization (Fig. 3A). According the KEGG pathway, the AGE-RAGE signaling pathway in diabetic complications, protein digestion and absorption, and PPAR signaling pathway were the three main categories where the genes were enriched (Fig. 3B).

Figure 3
figure 3

Results of DEmRNA analysis in Kyoto Encyclopedia of Genes and Genomes (KEGG) and the Gene ontology (GO). (A) Results of the investigation of DEmRNAs GO enrichment. The size of the bars corresponds to the gene number, while the color denotes the p value. (B) Results of DEmRNA KEGG pathway analysis. The size of the bars corresponds to the number of genes, while the color denotes the p value.

Weighted correlation network analysis (WGCNA) and functional annotation

A total of 39 samples from merged dataset were used for WGCNA using WGCNA package in R. Then, we used principal component analysis and samples hierarchical clustering to eliminate outlier samples (Fig. 4A,B). We eliminated 4 samples because of clustering and carried on with the analysis using the remaining data. Soft threshold value for the dataset was chosen with a cutoff R2 value of 0.9. As the network follows the power law distribution at this value (Fig. 4C), it is more similar to the condition of a true biological network. In order to merge the similar modules, the minimum module size was set to 30 with a 025 height cut (Fig. 4D). Figure 4E depicts every gene co-expression module. Twenty co-expressed gene modules were identified, with gray modules denoting genes that were not co-expressed. Each module was given a color name. With 24 genes, the royalblue module is the smallest. The turquoise module, which has 2434 genes, is also the biggest module at the moment. In addition, the 7659 genes that are not associated with any module are represented by the background's grey color (Table 3). For the enrichment analysis, we selected modules with correlation coefficients of 0.7 or above (Fig. 4F). As a result, turquoise (0.86), tan (0.72) and green (0.76) modules were selected for further analysis. By evaluating the correlation between module eigengenes (MEs) and traits and the correlation between gene expression profiles and traits, respectively, we measured the membership module (MM) of these three modules and gene significance (GS) (Table S2). The similar genes GS and MM were lastly regarded as the most significant genes (Fig. 4G).

Figure 4
figure 4figure 4

Weighted correlation network analysis. (A) Hierarchical clustering of samples to detect the outlier samples. (B) Principal component analysis to detect the outlier samples. (C) Scale independence (left) and mean connectivity (right). (D) Using a hierarchical clustering of genes based on the 1-TOM matrix, the co-expression network modules Cluster dendrogram is arranged in a certain order. Various modules are represented by various colors. (E) Hierarchical clustering of modules (above) and heatmap of trait and modules (below). (F) Module-trait relationships. Each column is a clinical feature (CRC and normal), and each row denotes a color module. The correlation coefficient is displayed in each cell. (G) intersection of GS and MM.

Table 3 Identified modules and their gene numbers.

Key modules PPI network construction, identification of top clusters and hub genes

The PPI network of most significant genes (Fig. 5A) with 294 nodes and 1416 edges that was created from STRING was put into the MCODE plugin of Cytoscape 3.9 in order to identify the hub clusters (MCODE score > 10) (Fig. 5B). We identified two clusters (Fig. 5C). The first cluster with a score of 16,889 contained UTP18, SDAD1, PWP1, POLR1E, RBM28, RSL1D1, NOLC1, WDR36, NIP7, NAT10, NOB1, UTP14A, DKC1, POLR1A, GTPBP4, ABCE1, WDR3 and TWISTNB and the second cluster with a score of 12,267 contained NUP188, NUP37, NUP85, NUP155, SNRPG, RPL31, RPS18, CCT4, RPL23A, RPS21, GNB2L1, RPL35A, RPL37A, RPL39, RPS15A, EFTUD2, LSM2, LSM5, SF3A3, PPIH, LSM7, SF3B3, SNRPD1, SNRPB2, RPL27, RPL12, SNRPD2, GEMIN6, NUP133, SNRPF and NUP205. Next, we performed GO and KEGG enrichment analysis of two hub clusters using clusterProfiler package in R (Fig. 5D,E).

Figure 5
figure 5figure 5

PPI network of most significant genes, two cluster modules extracted by MCODE and GO and KEGG pathway enrichment. (A) There were 294 nodes and 1416 edges in the protein–protein interaction network created by the most significant genes. A protein is represented by each node, and a protein–protein interaction is represented by each edge. Also, the size of the nodes corresponds to the degree centrality, while the color denotes the neighborhood connectivity centrality. (B) Display of clusters in the PPI network. (C) Two clusters extracted by MCODE. (D) GO enrichment of two clusters. (E) KEGG pathway enrichment of two clusters.

PPI network hub genes detection

We identified hub genes of PPI network using cytohubba plugin. In this case, we selected 25 hub genes with highest maximal clique centrality (MCC) (Fig. 6A). Then, the intersection of genes between hub genes and DEmRNAs (Fig. 6B) showed that DKC1, PA2G4, LYAR and NOLC1 were the clinically final hub genes of CRC.

Figure 6
figure 6

Obtaining the hub genes of PPI network using the maximal clique centrality algorithm in cytohubba plugin. (A) Top 25 hub genes with highest MCC. (B) Intersection between hub genes and DEmRNAs.

Analyzing the regulatory networks of the TF-final hub genes and miRNAs

There are several ways that miRNAs can control how genes are expressed. Networkanalyst online database was utilized to collect miRNAs that target hub genes (Fig. 7A), and in this case, we used the miRTarBase v8.0 database to discover hub miRNA interactions. Hsa-miR-16-5p was regarded as a key miRNA for the development of CRC because it interacted with the three final hub genes (degree 3). Also, we got TFs that target final hub genes from the NetworkAnalyst database (Fig. 7B), and we selected the ChEA database to find the TFs that do so. According to the TF-hub gene network, the E2F1 controls each of the four hub genes and may be important for the development of CRC.

Figure 7
figure 7

Construction of miRNA- and TF-final hub genes regulatory networks. (A) In the miRNA-final hub genes regulatory network, it is shown that has-miR-16-5p is related to the DKC1, LYAR and PA2G4. (B) In the TF-final hub genes regulatory network, it is shown that E2F1 is related to the DKC1, LYAR, PA2G4 and NOLC1.

The genetic alterations of final hub genes in CRC patients

In colorectal adenocarcinoma TCGA datasets, we looked at the final hub genes mutations using the cBioPortal database. These datasets contained 1510 samples from 3 studies. As a result, out of a total of 981 samples, gene DKC1 was altered in 9 samples, PA2G4 in 4, NOLC1 in 21, and LYAR in 17 samples (Fig. 8). Also, it was discovered that the LYAR and DKC1 co-occur with mutations, with a q value of less than 0.001 (Table 4).

Figure 8
figure 8

The genetic changes of the last hub genes in individuals with COAD. An asterisk (*) indicates that the gene has undergone mutations.

Table 4 Mutually exclusive mutation pattern between the final hub gene pairs.

Validation of the expression of the final hub genes

We looked up the expression patterns of the final five hub genes (DKC1, PA2G4, NOLC1, LYAR, and E2F1) in several databases to check their reliability. According to the UALCAN database, the expression levels of every gene were noticeably greater in COAD than in normal samples (Fig. 9A; Table 5). According to the Gene Set Cancer Analysis Lite (GSCALite) database, all of these hub genes were also considerably enhanced in COAD (Fig. 9B). The methylation of these genes was also examined using the GSCALite database, and we discovered that NOLC1 was hypomethylated based on methylation difference between tumor and normal samples in COAD (Fig. 9C), which may be connected to the high level of this gene expression in COAD. The Human Protein Atlas (HPA) database also revealed that protein levels of these five genes considerably greater in tumor samples compared to normal samples (Fig. 9D).

Figure 9
figure 9figure 9

Validation of the final hub expression pattern in COAD. (A) DKC1, PA2G4, NOLC1, LYAR, and E2F1 expression patterns in COAD and normal samples were taken from the UALCAN database. (B) DKC1, PA2G4, NOLC1, LYAR, and E2F1 expression in COAD and normal samples from the GSCALite database. (C) Based on methylation difference between COAD and normal samples from the GSCALite database, there is a difference in the methylation of DKC1 and NOLC1; as the circle size increases, the level of significance becomes greater. Moreover, as the circle color shifts towards dark blue, it indicates a greater reduction in methylation in tumor samples. (D) Immunohistochemistry (IHC) of the DKC1 (DKC1 normal sample from patient 634; DKC1 COAD sample from patient 192), PA2G4 (PA2G4 normal sample from patient 1423; PA2G4 COAD sample from patient 2106), NOLC1 (NOLC1 normal sample from patient 1958; NOLC1 COAD sample from patient 4721), LYAR (LYAR normal sample from patient 2040; LYAR COAD sample from patient 4724), and E2F1 (E2F1 normal sample from patient 1423; E2F1 COAD sample from patient 2948) in COAD and normal samples from the HPA database.

Table 5 Statistical significance of the final hub genes in colorectal adenocarcinoma (COAD) TCGA data according on sample types.

Discussion

CRC is regarded as a cancer with high burden in the world needing urgent identification of appropriate diagnostic biomarkers. High throughput techniques are valuable tools for comparison of expression profiles of normal and cancerous cells to find biomarkers. WGCNA is a method for identification of interconnections between genes and recognition of differentially expressed genes between two sets of samples25. In the current study, we used this method and find final five hub genes, namely DKC1, PA2G4, NOLC1, LYAR, and E2F1. Expression levels of these gene were noticeably greater in COAD than in normal samples. Notably, NOLC1 was hypomethylated in COAD. Therefore, overexpression of this gene can be explained by the phenomenon.

Functionally, DKC1 has been shown to enhance angiogenesis through increasing HIF-1α transcription. Moreover, DKC1 can facilitate metastasis in CRC26. PA2G4 has also been shown to be highly expressed in a variety of cancers, including cervical cancer, CRC, nasopharyngeal carcinoma and salivary carcinoma27,28,29,30. In hepatocellular carcinoma, PA2G4 can promote the metastasis through increasing stability of FYN transcript in a YTHDF2-dependent manner31. NOLC1 is involved in determination of multidrug resistance phenotype in non-small cell lung cancer32. LYAR has been found to promote CRC cell mobility through activating galectin-1 expression33. Finally, E2F1 is a transcription factor that binds to DNA with dimerization partner proteins. This transcription factor has fundamental roles in the development of CRC34.

Notably, these genes have been found to be mutated in CRC samples. Thus, several lines of evidence show fundamental roles of DKC1, PA2G4, NOLC1, LYAR, and E2F1 in CRC.

To sum up, the bioinformatics strategy used in the current study revealed important roles of DKC1, PA2G4, NOLC1, LYAR, and E2F1 in the CRC carcinogenesis and potentiates these genes as biomarkers for detection of CRC and therapeutic targets for this cancer.