Background

According to the latest GLOBOCAN [1] report, colorectal cancer (CRC) is the third most commonly diagnosed cancer worldwide (10.2%) and has the second-highest mortality rate (9.2%). Approximately 145,600 new colorectal cancer cases occur each year in the United States, among which 101,420 cases are colon cancer, and the remainder is rectal cancer [2]. In recent years, colon cancer mortality has continued to rise in many countries with limited resources and health infrastructure, particularly in South America and Eastern Europe [3]. Colon adenocarcinoma (COAD) is the primary pathological type of colon cancer. Surgery combined with postoperative chemotherapy is currently the main treatment for COAD. However, the survival of COAD has improved due to the continuous advancement of surgical technology. However, postoperative recurrence and chemotherapy resistance remain two major obstacles to the long-term survival of patients [4,5,6].

With the development of high-throughput omics, various omics techniques, such as whole-genome sequencing, epigenomics, and proteomics, have been applied to study COAD [7,8,9,10]. Increasing evidence has shown that COAD is not a consistent disease type but a molecularly heterogeneous disease comprising a series of genetic changes [11]. Tumor heterogeneity can alter the tumor growth rate, invasive ability, sensitivity to drugs, prognosis and other aspects, making it one of the main obstacles affecting tumor treatment [12, 13]. Therefore, dividing patients with COAD into different risk groups based on gene expression profiles helps to predict the risk of tumor progression or metastasis and recurrence and is a necessary prerequisite for proper individualized treatment [14,15,16].

There is increasing evidence that the immune system plays an important role in the occurrence and development of cancer [17,18,19]. For example, Salem M [20] found that disrupting the cell surface receptor glycoprotein-A repetitions predominant (GARP) on activated regulatory T (Treg) cells reduces immune tolerance and the development of colon cancer. In recent years, a method based on the relative ranking of gene expression levels was proposed to eliminate the shortcomings of data standardization and scaling in gene expression data processing, achieving reliable results in various studies [21, 22]. The present study selected immune genes that are significantly associated with the prognosis of COAD. Next, we integrated these genes to construct an immune-related gene pair (IRGP) risk model and verified its feasibility as a prognostic marker for COAD.

Methods

Sources of colon adenocarcinoma patients

The data analyzed in this study were all obtained from public databases. The training cohort datasets were downloaded from TCGA (https://tcga-data.nci.nih.gov/tcga) [23], and the validation datasets were obtained from GEO (https://www.ncbi.nlm.nih.gov/geo/). The training cohort datasets included clinical datasets (n = 452), transcriptome datasets (n = 449), and verification datasets from GSE39582 (n = 585) [24] and GSE17538 (n = 244) [25].

Data processing

The human General Transfer Format (hunman.gtf) from Ensemble (https://www.ensembl.org/index.html) [26] was downloaded, and the TCGA data were annotated using Perl language [27]. The chip data file (GSE39582 and GSE17538) was preprocessed using Perl language through the annotation file of the GPL570 platform. Using the above operations, all the gene probe IDs were converted to corresponding gene symbols. To analyze the correlation between the IRGP signature and prognosis in COAD, only patient data containing complete overall survival (OS) were selected.

Establishment of the prognostic immune-related gene pair (IRGP) model

We downloaded a list of immune-related genes (IRGs) from IMMPORT (https://www.immport.org/) [28], a website with open access to immunoassay data for translation and clinical research. Next, the R language [29] limma package (version 3.42.2) was used to control the list to screen out the IRGs in the downloaded TCGA transcriptome data. To further select valuable IRGPs, we measured and stored IRGs with a relatively high variation on all the platforms in this study (as determined by the median absolute deviation (MAD) > 0.5) [30]. The expression levels of IRGs in each sample in the transcriptome, GSE39582 and GSE17538 were compared in pairs to form each IRGP according to a previously validated method [22]. Specifically, in the pairwise comparison of each sample, if the expression level of the first gene is greater than that of the second gene, the output is 1; otherwise, it is 0. Samples with a ratio of 0 and 1 less than 20% were deleted to retain gene pairs that may be related to survival. These IRGPs were merged with the survival time of the clinical data downloaded by the corresponding platform to evaluate the correlation between each IRGP in the training dataset and overall survival rate of the patient. Based on previous reports [31, 32], we used the R language survival software package (version 3.1–11) to perform univariate Cox regression analysis and P < 0.001 to screen the effective IRGPs in the TCGA data. From these IRGPs, we used R language for Lasso Cox proportional hazards regression (glmnet software package, version 3.0–2) to construct the risk score, and the final prognostic model was defined using 17 gene pairs. Finally, in the training cohort, we set the overall survival to 5 years and constructed the time-dependent receiver operating characteristic (ROC) curve (survivalROC, version 1.0.3) to determine the best cutoff value for the risk score and divide patients into low-risk and high-risk groups accordingly.

Further validation of the model

Using the R package survival and survminer (version 0.4.6), Kaplan–Meier plots were applied to establish survival curves for the high-risk and low-risk groups in the training and verification cohorts. The differences in the survival curves were analyzed using the log-rank test. Cox proportional hazards analysis was used for univariate and multivariate analyses to assess the effect of the risk score and other clinical factors.

Gene expression profiles (GEPs) of immune cell infiltration in tumors

We used CIBERSORT [33] to infer the relative abundance of tumor-infiltrating immune cells in different risk groups. CIBERSORT estimated the putative proportion of infiltrating immune cells using a reference set with 22 sorted immune cell subtypes for each sample in the training cohort and validation cohorts. Monte Carlo sampling was used in CIBERSORT to calculate the P-value of the deconvolution of each sample to provide the estimated confidence. The permutation is set to greater than 100, and the corresponding P-value is generated.

Gene set enrichment analysis (GSEA)

The chemical and genetic perturbation analysis-related documents involved in the study were downloaded from the Molecular Signature Database (MSigDB C2, version 7.1) (https://www.gsea-msigdb.org/gsea/datasets.jsp). GSEA [34] was performed using the R package fgsea (version 1.12.0) with default parameters. A log 2-fold change was made between GEPs in the high-risk vs low-risk groups. The difference in the gene sets between the high- and low-risk groups was compared. Differences with an FDR-adjusted P < 0.05 were defined as significant.

Statistical analysis

For all the above tests, a P-value less than 0.05 denoted the presence of a statistically significant difference. Statistical significance was indicated as follows: *P < 0.05, **P < 0.01, ***P < 0.001.

Results

Construction of the prognostic IRGP model

The TCGA transcriptome data were used as a training cohort. From the list of immune-related genes (IRGs) obtained by IMMPORT, the genes in the transcriptome data were searched in turn, and 1647 IRGs were identified. To ensure relatively high variation in the genes of the two platforms, we retained 325 IRGs with a median absolute deviation (MAD) > 0.5. In total, 40,375 pairs were deleted with a ratio of 0 and 1 less than 20%. Next, 12,275 immune-related gene pairs (IRGPs) were built based on these 325 IRGs. After univariate Cox regression analysis of these IRGPs in the training group, 28 potential prognostic IRGPs remained. Using Lasso Cox proportional hazards regression to define the model on the training set, 17 IRGPs were retained to form the final prognostic risk model. These IRGPs comprised 26 unique IRGs, most of which are antibiotics, cytokine receptors and cytokine-related molecules (Table 1). Next, the risk score for each patient in the TCGA dataset was calculated based on the model. Finally, we used a time-dependent ROC curve analysis to classify patients into high- or low-immune risk groups. The optimal cutoff value for the risk score was set to − 0.576 (Fig. 1). This value successfully stratified the patients in the training cohort into high- and low-risk groups. In other words, the overall survival (OS) of the low-risk group was significantly higher than that of the high-risk group (Fig. 2a). We further performed univariate and multivariate Cox proportional hazards analyses to test whether the IRGP model predicted survival independently of other prognostic factors in the TCGA cohort. Among these analyses, the risk score of the model can be used as an independent prognostic factor (Fig. 3a, b).

Table 1 List of immune-related genes for constructing prognostic models
Fig. 1
figure 1

Time-dependent ROC curve for IRGPs risk model in the training cohort. Risk score of − 0.576 which was used as cut-off value for the model to stratify patients into high risk group or low risk group. Abbreviations: ROC, receiver operating characteristic; IRGPs, immune-related gene pairs

Fig. 2
figure 2

Kaplan-meier curves of OS among different risk groups. Patients were stratified by immune-related gene pairs model. OS among patients in the training (a) and validation cohorts(b). Abbreviation: OS, overall survival

Fig. 3
figure 3

Univariate and multivariate analyses of prognostic factors in the training and validation cohort. a and c represent the univariate analysis of training cohort and validation cohort, respectively. b and d represent the multivariate analyses of the training cohort and the validation cohort, respectively

Verification of the feasibility of the IRGP model to predict survival

To determine whether the model had consistent prognostic value in different risk groups, we applied the model to GSE39582 and GSE17538 as external validation. The patients in the verification cohort were divided into two groups according to the risk score. The OS of subgroups in the low-risk group increased significantly (Fig. 2b, Figure S1A). After performing univariate and multivariate Cox proportional hazards analyses in the validation group, we found that the results were similar to those of the training group, and the high risk score of this model suggests a poor prognostic factor (Fig. 3c, d, Figure S1B and Figure S1C).

Immune cell infiltration in different risk groups

Previous studies have revealed that tumor-infiltrating immune cells are related to prognosis [35]. To determine the infiltration of specific tumor immune cell subsets, we used CIBERSORT to estimate the relative proportion of 22 different immune cells per patient in different risk groups. Three radar charts depict a comparative summary of various immune cells in these two risk groups (Fig. 4, Figure S2 and Figure S6). In the training cohort, we found that activated dendritic cells, resting dendritic cells, eosinophils, M0 macrophages, monocytes, resting CD4 memory T cells and regulatory T cells (Tregs) were enriched in different risk groups. Among them, regulatory T cells (Tregs) and M0 macrophages were significantly and highly expressed in the high-risk group, and the rest were highly expressed in the low-risk group (Fig. 5). The high-risk group of GSE39582 highly expressed M0 macrophages, M1 macrophages, monocytes, neutrophils, CD8 T cells and follicular helper T cells (Figure S3). The high-risk population in GSE17538 also highly expressed monocytes and Tregs (Figure S7).

Fig. 4
figure 4

Summary of the 22 immune cells’ abundance estimated by CIBERSORT for different risk groups. P-values are based on t-test(*P < 0.05, **P < 0.01, ***P < 0.001)

Fig. 5
figure 5

The abundance distribution of specific immune cells’ within different risk groups. T cells regulatory and Macrophage M0 were significantly highly expressed in the high-risk group, while the rest were significantly higher in the low-risk group

Functional evaluation of the IRGP model

To investigate the expression signatures of genetic perturbations that were significantly altered by the IRGP model, GSEA was performed in the high-risk and low-risk groups in the TCGA cohort. The bubble chart revealed that genes in the high-risk populations were enriched in stem cells and various advanced tumors (Fig. 6). The top five genetic perturbations in the high-risk group were enriched stem cells, increased breast cancer ductal invasion, a multicancer invasiveness signature, increased advanced vs early-stage gastric cancer and enriched mammary stem cells (Fig. 7). We also obtained similar results when performing the above analysis on GSE39582 and GSE17538 (Figure S4, Figure S8). The high-risk group genes in GSE39582 were significantly enriched in breast cancer ductal invasion and stem cells (Figure S5). The GSEA results obtained in GSE17538 also showed that the high-risk group genes are enriched in tumor cell growth and invasion (Figure S9).

Fig. 6
figure 6

The expression characteristics of genetic perturbations significantly changed by the IRGPs model. A number of these gene sets come in pairs: xxx_UP (and xxx_DN) gene sets representing genes induced (and repressed) by the perturbation

Fig. 7
figure 7

Gene Set Enrichment Analysis (GSEA). Gene set enrichment analysis confirmed that multiple tumor progression and stem cell growth-related pathways in high-risk groups were up-regulated

Discussion

Colon cancer is the most common type of gastrointestinal cancer and has high morbidity and mortality. Approximately 95% of colon cancer is colon adenocarcinoma (COAD). In recent years, immunotherapy has been a hotspot in the research of major tumor types. In the COAD field, studies on the high-level microsatellite instability (MSI-H) population have been performed successively since 2015. The Keynote 016, Keynote 164, Checkmate 142, and NICHE clinical trial results all indicate the extraordinary efficacy of immunotherapy [36,37,38,39]. Patients with MSI-H have a better prognosis than those with microsatellite stability (MSS). However, the MSI-H population accounts for only approximately 10% of COAD. Most patients still face the dilemma of not having an effective prognostic indicator. Thus, the determination of new prognostic biomarkers is urgent to predict the survival of colon adenocarcinoma patients.

To obtain the robustness of the prognosis prediction in this study, we adopted a method for data analysis without considering the technical deviation of different platforms. The newly established prognostic model is based on the ranking and pairing comparison of relative gene expression values; thus, data preprocessing, such as scaling and normalization, is not required. This method has reliable results in many studies [40, 41].

In this study, we identified an immune-related gene pair model to predict the overall survival for colon adenocarcinoma. The prognostic model comprises 17 immune-related gene pairs containing 26 unique immune-related genes. Most genes in this immune model are cytokine receptors and cytokines, which play a vital role in the adaptive immune response. Among these IRGs, no evidence supports that the overexpression of IL17RB can enhance the invasion and metastasis of thyroid cancer cells [42]. STC2 overexpression is associated with a poor prognosis in patients with nasopharyngeal carcinoma (NPC) and can be used as a predictor of NPC responses to radiation [43]. The increase in IL-7 in colorectal cancer (CRC) is related to metastatic disease and tumor location [44]. Decreased CXCL14 expression indicates a poor prognosis and causes metastasis in colon cancer [45]. GRP signaling alters the invasion of colon cancer through heterochromatin protein 1Hsβ and can improve the prognosis of patients with colon cancer [46]. Moreover, regulatory T cells (Tregs) and M0 macrophages are related to the poor clinical prognosis of many patients with cancer [47, 48]. Dendritic cells are associated with cancer immunity and a favorable prognosis [49]. At the same time, the immune cell types M0 macrophages, M1 macrophages, monocytes, neutrophils, CD8 T cells and follicular helper T cells in the high-risk group of GSE39582 are all related to tumor progression and poor prognosis [50,51,52,53]. These findings are consistent with our results. In this study, we also found that several expression characteristics of genetic perturbations, such as increased stem cells, increased breast cancer ductal invasion, a multicancer invasiveness signature, increased advanced vs early gastric cancer and increased mammary stem cells, were related to the IRGP model. These results were verified by corresponding experiments [54,55,56,57,58], confirming their importance in tumor development and cell growth. These findings indicate that the IRGP model may play an essential role in tumor invasiveness and progression in COAD.

The difference between this study and previously published studies [59] is that the IRGP model was established based on the TCGA database. Second, our strategy to establish a prognostic model was different. To screen out immune-related gene pairs that are significantly related to OS in patients with colon cancer, we used univariate Cox regression analysis before determining the final model using Lasso regression analysis. Finally, we conducted GSEA in the training and validation cohorts to further analyze the specific differences between the high- and low-risk groups. We found that the high-risk group genes were significantly enriched in tumor cell invasion and growth.

Similar to all RNA-seq and microarray analyses, our study had limitations. First, the training dataset to build the immune model was obtained from a retrospective study, which included fresh frozen samples; the stability and efficiency of formalin-fixed and paraffin-embedded (FFPE) samples remain questionable. Therefore, it may be necessary to add more datasets with different sample attributes for more extensive verification. Second, because the prognostic model was based on TCGA and other databases, it required proficiency in bioinformatics. Additionally, the gene expression profiles produced by RNA-seq or microarray platforms require high prices and long conversion cycles. Therefore, this method is challenging to popularize in daily clinical applications.

Conclusions

In summary, our immune-related gene pair model can provide an evaluation reference for the prognostic risk of patients with colon adenocarcinoma. The immune-related model was associated with the prognosis of patients with COAD. The tumor-infiltrating immune cells and genetic perturbations distinguished by this model in the high- and low-risk groups can further elucidate the role of our prognostic model in the development of colon adenocarcinoma. Therefore, the risk model will be a useful tool to better evaluate patients who may benefit from immunotherapy.