Introduction

Stomach adenocarcinoma (STAD) is the fifth most frequently diagnosed cancer and the fourth-leading cause of cancer-related death worldwide [1]. The long-term prognosis of patients with STAD differs significantly as a function of tumor stage as assessed by the 8th American Joint Committee on Cancer (AJCC) tumor, node, metastasis (TNM) system. At present, although surgical resection is the only possible curative treatment for resectable STAD in stages I to III, a satisfactory result is only achieved in early-stage STAD cases. According to the SEER database, the 10-year survival rate for patients below stage IIa is approximately 70% but for those above stage IIb, it is only about 50% [2]. Preoperative treatment is particularly important for patients with mid-to-late stage STAD and has been recommended in various guidelines for many years [3, 45]

To identify whether drug or surgical treatment should be performed in the first instance, an accurate preoperative staging method for STAD is imperative. Microarray technology and high-throughput transcriptome profiling have provided new insights into tumor occurrence and development. It may be possible to link the gene expression profile of STAD with certain phenotypes or clinical features. As such, a set of gene signatures could potentially be used to profile STAD at different stages, further assisting clinicians in treatment decision-making in order to achieve optimal outcomes for STAD patients. Current radiological measures, including widely-applied computed tomography (CT), have only limited accuracy, especially in lymph node assessment [6]. Considerable under-staging still occurs.

More importantly, as TNM staging is still the most accurate indicator of STAD patient prognosis, there is an urgent need to identify the relationships between changes in gene expression and disease stage progression. This could assist oncologists to identify commonalities in tumorigenesis and development among this highly heterogeneous cancer type. Previous studies have focused on the direct links between gene expression and survival using open-access data [7,8,9]. However, it is clear that a patient’s duration of survival partially depends on the treatment they receive: the resection type (D2 or not) they received, their compliance with postoperative chemotherapy, and their choices for second-line treatment upon relapse. The TNM stage may be a more direct characteristic that reflects the mechanism of ontogenesis in some ways. To date, few studies have focused on TNM staging and this may be due to differences in the staging criteria applied to previous public data, which hampers the ability of researchers to link genes and staging data. Therefore, unified staging criteria based on the latest 8th AJCC edition are required.

The present study aimed to screen gene expression signatures for the discrimination of earlier and later TNM stages in local, non-metastatic STAD patients using systematic bioinformatic analysis of transcriptomic data.

Methods

Data sources and data pre-processing

TCGA dataset

The RNA sequencing data for STAD tissues were downloaded from the TCGA dataset (https://tcga-data.nci.nih.gov/tcga/) and contained 375 STAD samples with complete clinical and pathological information. The messenger RNA (mRNA) expression dataset was then extracted. Samples were excluded if: (1) the data were missing T stage information, (2) less than 16 lymph nodes were retrieved, (3) the patient had distant metastasis (M1), or (4) the patient had received preoperative treatment. In total, 162 eligible samples were screened from the 374 samples. The T and N stages and overall TNM stage were modified according to the latest AJCC 8th edition criteria (Additional file 1: Table S1). Patients who were classified as 8th edition TNM stages I to IIa were combined into an earlier-stage group (I-IIa) and those classified as stages Iib to III were combined into a later-stage group (Iib-III). To assure the accuracy of the results, features with less than two counts in more than 50% of the samples were discarded.

Training-validation dataset

The GEO website includes five publicly available series that contain more than 30 STAD tissue samples with complete TNM stage information (GSE15459, GSE26942, GSE62254, GSE29272, and GSE27342). None of these were staged according to the AJCC 8th edition. Only one publicly available gene expression profile (GSE62254) has detailed information on the pathological T stage, the number of retrieved and positive lymph nodes, and metastasis. For this reason, GSE62254 was selected as the training-validation dataset. The expression data of the 300 STAD samples in GSE62254 were generated using the GPL570 platform (Affymetrix Human Genome U133 plus 2.0 Array) and downloaded from the GEO database (http://www.ncbi.nlm.nih.gov/geo/). For microarray datasets, ineligible records were excluded according to the same principles as described above: (1) missing T stage information, (2) less than 16 retrieved lymph nodes, and (3) distant metastasis (M1). In total, 262 samples met these criteria.

Validation set 2

To verify the robust performance of the model fitting, GSE15459—obtained from the same GPL570 platform—was adopted as the second validation set. GSE15459 contains 192 qualified genome-wide mRNA expression profiles of primary STAD patients. The staging system in GSE15459 is based on the AJCC 6th edition TNM system, ranging from I to IV. As this database lacks clinical data on the number of retrieved/positive lymph nodes and the metastasis status, the GSE15459 data could not be transformed into the AJCC 8th staging system. Therefore, the stage I samples were classified as the earlier-stage group (N = 31) and the stages II to IV samples (N = 161) were classified as the later-stage group. Despite the diagnostic accuracy and criteria divergence, the diagnostic scope of stage I in the 6th edition is similar to that of stage I-IIA in the 8th edition (except for T2N1). Thus, agreement under the same prediction model was expected (Table Additional file 1: S1).

Outlier detection and removal

The TCGA dataset (N = 159) and GSE62254 dataset (N = 262) were separately subjected to outlier analysis using hierarchical cluster analysis via the “hclust” function in the WGCNA package [10]. After outlier removal, expression data were obtained from 156 subjects in the TCGA dataset (44 in the earlier-stage group and 112 in the later-stage group) and 258 subjects in the training-validation microarray dataset (73 in the earlier-stage group and 185 in the later-stage group; Additional file 5: Figure S1A, S1B).

Data splitting

The training-validation set was further divided into a training set (66.7%) and a validation set (33.3%) at a 2:1 ratio. A stratified sampling method was adopted according to grouping (earlier-stage vs. later-stage) using the function “strata” in the “sampling” R package. After sampling, there were 49 earlier-stage and 123 later-stage subjects in the training set and 24 earlier-stage and 62 later-stage subjects in validation set 1.

Selection of differentially expressed genes (DEGs)

Differentially expressed genes were identified using the LIMMA package (version 3.42.2) for microarray data and DESeq2 (version 1.26.0) for RNA-seq data in R 3.6.2 [11, 12]. Significant DEGs were detected according to the following criteria: (1) absolute fold-change > 1.5, (2) normalized (NOM) P value < 0.05, and (3) q-value (false discovery rate [FDR]) < 0.25. Overlapping DEGs between the GEO and TCGA database were reserved for subsequent study. Heat maps and volcano plots of the DEGs were drawn using the “ggplots” and “pheatmap” packages in R.

Enrichment analysis of DEGs

Functional enrichment analysis included Gene Ontology (GO) analysis and Kyoto Encyclopedia of Genes and Genomes (KEGG) analysis. GO and KEGG analyses were carried out using “clusterProfiler” in R (version 3.14.3) [13,14,15]. GO analysis encompassed biological processes, cellular components, and molecular functions. Gene Set Enrichment Analysis (GSEA) was also performed using the “gsekegg” function with 1,000 permutations of the gene sets and a log2 ratio of classes as the metric for ranking genes. For both enrichment analysis and GSEA, pathways with both a NOM P-value < 0.05 and FDR < 0.25 were considered significant, as recommended previously [16]. Additionally, only those pathways with an absolute normalized enrichment score (NES) > 1 were adopted in the GSEA results.

Establishment of outcome signature with LASSO logistic regression model

The Least Absolute Shrinkage and Selection Operator (LASSO) method was applied to reduce the dimensions of the data and select the DEGs that best distinguished the data. This was achieved using the “glmnet” (version 4.0-2) package in the training microarray data. In the LASSO model, the minimum criterion (λ) based on 10-fold cross-validations was chosen. A multivariate logistic regression model was used to build a model for predicting later-stage cancer. The predictive index of each sample was calculated according to the constructed prognostic signatures based on the following formula: prediction index = \({\sum }_{i=1}^{n}{\upbeta }\text{i} \times \text{X}\text{i}\), where βi represents the coefficient obtained from LASSO-logistic regression and Xi indicates the relative expression level of each selected gene. The area under the curve (AUC) was calculated in the training, validation 1, and validation 2 datasets using the “rms” package.

Statistical analysis

All data were analyzed using R (version 3.6.2). Comparisons between the two groups were made using the χ2 test (nominal data), Wilcoxon rank test (nonparametric continuous data), or Student’s t-test (Gaussian continuous data), as appropriate. For predictive ability, the AUC was required to be equal to or higher than 0.65 with a 95% confidence interval (95% CI) excluding 0.5; an AUC ≥ 0.7 was considered to reflect good prediction or discrimination. We also compared the predictive ability of our gene signature with previously published prognostic signatures [17,18,19,20,21,22,23,24,25]. The Venkatraman permutation test was used to compare the paired ROC curves based on different signatures [26]. The prognosis values of the hub genes with the same probe IDs were inspected using Kaplan-Meier analysis based on the log-rank test. The relationships between clinicopathological factors and both long-term overall survival (OS) and disease-free survival (DFS) were assessed using univariate Cox regression analysis. Covariates that achieved a P-value < 0.05 in the univariate analyses were included in the multivariate analysis. A backward stepwise approach was used to identify possible predictors of OS among the candidate variables. The AIC was used to set a limit on the total number of variables included in the final model. P-values < 0.05 were considered statistically significant. The “sva” package in R was used to remove the batch effect between the datasets using the same platform, if necessary [27].

Results

Identification of DEGs

A detailed flow chart of the prognostic predictive model in this study is shown in Fig. 1. The detailed clinical features of the TCGA, training-validation, and validation 2 datasets before outlier removal are shown in Additional file 2: Table S2.

Fig. 1
figure 1

Flow chart of samples selection

The DEGs between the earlier-stage and later-stage samples in the TCGA dataset and training set were screened. Detailed patient information from both databases is shown in Tables 1 and 2. Compared to the earlier-stage tumors, a total of 1748 DEGs, including 554 upregulated genes and 1194 downregulated genes, were identified in the later-stage group of the TCGA dataset (Fig. 2A) while 74 upregulated genes and 31 downregulated DEGs between the later-stage and earlier-stage samples were identified in the training set (Fig. 2B). Among the two datasets, 22 overlapping DEGs (19 upregulated and 3 downregulated) were identified (Fig. 2C, D). All DEGs are listed in Additional file 3: Table S3. Heatmap analysis was used to determine the relative expression levels of these 22 DEGs in the different groups (Fig. 2E).

Table 1 Demographic and clinicopathologic characteristics in training and validation cohorts (GSE62254)
Table 2 Demographic and clinicopathologic characteristics in TCGA dataset
Fig. 2
figure 2

Differentially expressed stage-related genes for STAD in (A) TCGA and (B) training datasets; Venn diagram showing overlapped (C) upregulated and D down-regulated DEGs between TCGA and training dataset; E the expression heatmap of the 22 overlapped DEGs

All overlapping DEGs were submitted to GO and KEGG pathway analyses. The top three GO enrichment terms for target genes in the biological processes of ontology, cellular components of ontology, and molecular function of ontology are shown separately in Fig. 3A; all seven enriched KEGG terms are presented in Fig. 3B. The results showed that “positive regulation of cytosolic calcium ion concentration” and “calcium ion transport into cytosol” were the most enriched GO terms, while “tyrosine metabolism”, “malaria”, and “cAMP signaling pathway” were the most enriched KEGG terms. The DEGs and their interactions with KEGG pathways are visualized in Fig. 3C.

Fig. 3
figure 3

A GO analysis of the 22 overlapped DEGs; B KEGG pathway of DEGs. C Net plot of the pathways enriched with DEGs, as identified by KEGG pathway analysis

Predicting pathological stage with binomial LASSO logistic regression

To examine the DEGs with the best discriminative ability for stage prediction, and to minimize multicollinearity, LASSO logistic regression was employed. Feature selection was performed based on the training dataset with the 22 identified DEGs. LASSO regression yielded a model with nine predictors (seven upregulated and two downregulated) that minimized binomial deviance and enhanced sparsity (Fig. 4A, B). These nine hub genes showed significant upregulation/downregulation between the two stage groups (Fig. 4C). The Kaplan-Meier plots indicated that overexpression of MYOCD, SCRG1, TYRP1, and THBS4 was associated with significantly poorer survival, while upregulation of GHRL and LYPD6B and downregulation of SERINB2 and NEBL tended to be associated with poorer survival. Only TNFRSF17 showed no expression-related survival trend (Additional file 6: Figure S2A–I).

Fig. 4
figure 4

LASSO logistic analysis via 10‑fold cross‑validation with minimum criteria. A Tuning parameter selection via 10‑fold cross‑validation with minimum criteria in the LASSO model. B LASSO coefficient profiles of 22 candidate DEGs. LASSO. C The expression level of the nine hub genes between the earlier-stage and later-stage groups as identified by LASSO regression

The nine hub genes (MYOCD, GHRL, SCRG1, TYRP1, LYPD6B, THBS4, TNFRSF17, SERPINB2, and NEBL) were included in a multivariate logistic regression model. The obtained coefficients of each identified DEG were then used to form the nine-gene model (Table 3). No reverse sign was observed in any of the covariates within the univariate and multivariate regressions. The ability of the nine-gene signature to predict TNM stage was evaluated by ROC curves and AUC analysis. In the training set, the AUC was 0.763 (0.685–0.841). The prediction model also achieved satisfactory performance with an AUC of 0.704 (0.587–0.821) in validation set 1 and an AUC of 0.743 (0.679–0.808) in the merged training-validation set. The prediction model performed moderately in validation set 2 with an AUC of 0.658 (0.558–0.758). The AUCs in each data set are presented in Fig. 5A.

Table 3 LASSO regression results. Genes selected by the LASSO logistic regression, with the estimated coefficients and odds ratio
Fig. 5
figure 5

Receiver operating characteristic curve based on the training, validation set 1, training-validation and validation set 2 (GSE15459); The ROC performances in validation set 2 before and after batch correction

A significant batch effect between GSE15459 (validation set 2) and GSE62254 (training-validation set) was observed. Because the two series used the same GPL570 platform, batch correction for validation set 2 with reference to the training-validation set was then performed. Boxplots of the merged dataset before and after batch effect removal are presented in Additional file 7: Figures S3A and S3B, respectively. There was an obvious improvement in the AUC value, which increased to 0.717 (0.627–0.806) after batch correction (Fig. 5B).

The nine-gene model was then applied to several clinical phenotypes. The prediction model performed well in forecasting lymph node metastasis (AUC: 0.728, 95% CI 0.647–0.808), signet ring (AUC: 0.711, 95% CI 0.617–0.805), and Lauren diffuse type (AUC: 0.707, 95% CI 0.643–0.771) STAD. The model achieved a moderate predictive value for T4 tumors (Table 4).

Table 4 The AUC performances of the 9 hub genes on other clinicopathologic phenotypes

Identification of KEGG pathways related to the TNM stage using GSEA

To improve our understanding of the gene expression changes that accompany stage development, GSEA was performed using the training-validation set (GSE62254). From this, 134 (62 upregulated and 72 downregulated) significantly enriched pathways were identified (P < 0.05, FDR < 0.25). All of the top 10 significantly enriched pathways were upregulated (Fig. 6A). The “PI3K-Akt signaling pathway” was the most significantly upregulated, followed by the “MAPK signaling pathway”, “Calcium signaling pathway”, “cAMP pathway”, and “focal adhesion”. A network of gene sets in the first half (N = 67) was constructed to illustrate the pathway interactions (Fig. 6B). The details of the significantly enriched gene sets are provided in Additional file 4: Table S4.

Fig. 6
figure 6

Gene set enrichment analysis analysis based on the training-validation set A Top five GSEA enrichment analysis results of the KEGG pathways for the later-stage group. B Network plots for GSEA. Network plot showing enriched upregulated pathways (in red) and downregulated pathways (in blue) for gene expression data samples with higher stage. Top 50% significant KEGG were included in this network

Exploring the prognostic significance of the nine genes and other clinicopathological factors

We further investigated the prognostic impact of the nine selected genes together with various clinicopathologic and genomic features. As the ACRG cohort had the most sophisticated clinical information and molecular subtypes, both the training-validation dataset (N = 258) and the original dataset (N = 300) were used to achieve robust results. Univariate Cox analysis revealed that higher signature score, tumor location, total resection, T stage, N stage, MLH1 positivity, diffuse Lauren type, poor differentiation, ACRG subtype (especially EMT), absent chemotherapy, mesenchymal phenotype, and Borrmann type IV were risk factors for OS and/or DFS either in the training-validation dataset (Table 5A) or in the complete ACRG cohort (Table 5B). Specifically, four of the nine selected genes, i.e., MYOCD, SCRG1, TYRP1, and THBS4, were significantly correlated with survival as continuous variables. All statistically significant variables were then included in a multivariate Cox regression using the backward stepwise algorithm for covariate selection. The results showed that N stage, chemotherapy, and SCRG1 expression level (training-validation dataset: HR 1.21, 95% CI 1.11–1.32, P < 0.001; ACRG cohort: 1.14, 95% CI 1.05–1.24, P = 0.001) were significant covariates in both datasets (Table 6A, B), while T stage and MLH1 status were significant covariates only in the complete ACRG cohort (Table 6B). Other features, e.g., ACRG subtype, mesenchymal phenotype, and other selected genes, were ruled out in both datasets using the same algorithms.

Table 5 (A) Univariate Cox regression in Training-Validation dataset (N = 258). (B) Univariate Cox regression in the whole ACRG cohort (N = 300)
Table 6 (A) Multivariate backward stepwise Cox regression on OS Training-Validation (N = 258, including 2 cases with NA entries). (B) Multivariate backward stepwise Cox regression on OS whole ACRG (N = 300, including 3 cases with NA entries)

Comparison of our signature with other gene signatures for stage prediction

A literature search was then performed, and the stage prediction ability of our signature was compared with those of nine other gene combinations containing similar gene numbers (ranging from 6 to 13 genes). The dataset for this analysis included the training-validation set and validation set 2 (N = 450) after batch correction. The coefficients were adjusted in all 10 signatures with the aim of achieving maximum predictive ability. Among the 10 gene collections, our signature achieved the highest AUC for stage prediction (AUC = 0.742, Fig. 7). The ROC curves indicated that our nine-gene signature was significantly different from the signatures reported in six studies and marginally significantly different from the signatures reported in three studies (Table 7).

Table 7 The collection of gene signatures of STAD used for comparison
Fig. 7
figure 7

Receiver operating characteristic (ROC) curve analysis for stage prediction of our signature and other gene sets appeared in previous studies

Discussion

The present study identified 22 overlapping DEGs based on the integration of the TCGA and GEO public datasets. A nine-gene signature was formed based on LASSO regression results and was further validated in several sets with satisfactory AUC values of > 0.7 in most datasets. The moderate AUC performance in the GSE15459 dataset is likely due to the inconsistent grouping criteria used in this dataset; we were unable to deal with the stage migration problem due to a lack of clinical data. The significant improvement in the AUC after batch correction provides further verification of the stage distinguishing ability of our nine-gene signature. The nine-gene signature reported here is the first stage-oriented prediction model at the transcriptome level using the AJCC 8th edition TNM staging system. The results suggest that this nine-gene signature may be of diagnostic value for the management of non-metastatic STAD and may assist with clinical decision-making.

For historical reasons, most current open-access gene expression sets for STAD have followed the AJCC 6th edition stage classification. The well-known “GEPIA” tool, for example, integrated various datasets and finalized a “stage plot” module [28]. Despite this excellent work and contribution to the field, this approach is somewhat open to question because, from the viewpoint of gastroenterologists and clinical oncologists, the relationship between the 6th staging system and the newest 8th TNM staging system is by no means a simple permutation or combination. For example, the AJCC 6th edition categorizes muscularis propria invasion as pT2a, subserosal invasion as pT2b, serosal penetration as pT3, and adjacent organ invasion as pT4, which corresponds to pT2, pT3, pT4a, and pT4b T stage criteria in the 8th (and 7th ) editions [29, 30]. Even more importantly, both the 5th and 6th editions defined N1, N2, and N3 as positive lymph node numbers of 1–6, 7–15, and > 15, respectively, while starting from the 7th edition, the N stages were further refined as N1: 1–2, N2: 3–6, N3a:7–15, and N3b: > 15 positive lymph nodes. This means that there is a considerable discrepancy when discussing the association between gene expression/behavior and stage [28]: patients with the same “N2” staging according to the 6th and 8th editions reflect different concepts and prognoses which cannot be simply merged together [9, 31]. Additionally, stage migration is another key factor in translating stages from the old to the new system and is a precondition for explaining the expression differences between earlier- and later-stage samples (Additional file 1: Table S1). Our previous findings of 1663 patients indicated that prognosis differences begin to reach statistical significance when pTNM stage reaches IIB [32]. This result prompted us to split the data in the current study into earlier- and later-stage patient groups as these groups are associated with prognosis and treatment strategy variations. Finally, as retrieval of > 15 lymph nodes is required for optimal staging, samples with inadequate lymph node retrieval are at considerable risk of under-staging and should be filtered out in analyses [33,34,35,36]. Given the above, due to the strict data processing performed in this study, these data can be used for accurate stage prediction and to identify factors (DEGs and pathways) that lead to stepwise STAD progression.

Based on TNM stage characteristics, 22 overlapping DEGs were identified between the TCGA set and the training set. This number is similar to that in a prognosis-based study [37], but is far less than those obtained in other data mining studies that have focused on gene expression between tumor and normal tissue. This suggests that either the homogeneity or heterogeneity among gastric adenocarcinomas is much more complex. Accordingly, a penalized model (LASSO regression) was implemented to exclude the confounding variables that could generate multicollinearity in the prediction model. In fact, the coefficients of the nine selected genes in the multivariate analysis maintained the same sign as in the univariate models. This confirms the robust performance of the model. This model also avoided overfitting and the Simpson’s Paradox, which are risks when performing bioinformatics analysis and model building [37,38,39,40]. More importantly, our signature had higher accuracy for stage prediction than previous signatures focusing on various prognostic features. Therefore, the results of this study indicate that the nine-stage signature is a novel biomarker with superior tumor stage predictive ability for LAGC patients.

Of the nine identified genes, some have been reported to be of relevance to various cancers. THBS4 is one of five extracellular calcium-binding proteins that modulate the extracellular matrix (ECM). High levels of THBS4 have been found to be significantly related to cancer-associated ECM in breast cancer tissue [41], and the high expression levels of THBS4 in cancer-associated fibroblasts in Lauren diffuse-type gastric adenocarcinoma support its use as a biomarker [42]. Clinically, the Lauren type has been shown to be strongly correlated with lymph node metastasis in STAD [43]. In vitro, THBS4 also promotes tumor progression by interacting with ITGB1 via the FAK/PI3K/AKT pathway [44, 45].

Tyrosinase-related protein 1 (TYRP1) is the most abundant intracellular glycoprotein in melanoma and melanocytes [46]. Although it has a specific function in melanogenesis, it seems that high expression profiles of TYRP1 are not exclusive to melanoma. Bioinformatics analyses have demonstrated similar unusual overexpression of TYRP1 in STAD, and its expression is associated with poorer prognosis [8, 47]. It is proposed that the high expression of TYRP1 could serve as an indicator of the abnormal activation of transcription regulator microphthalmia-associated transcription factor (MITF), which is phosphorylated by the SCF/KIT pathway, or of the inactivation of anti-oncogenes like p53, which results in tumor progression [48,49,50]. Furthermore, TYRP1 mRNA has been proven to cause ncRNAs to function as sponges for miR-16, which is known for its tumor-suppressor function in STAD [51, 52]. All of the above evidence indicates that TYRP1 plays a role in STAD progression.

SERPINB2, commonly known as plasminogen activator inhibitor-2 (PAI-2), serves as an inhibitor of extracellular protease urokinase plasminogen activator (uPA) and tissue plasminogen activator (tPA), both of which transform plasminogen into plasmin [53]. uPA-triggered fibrinolysis plays various roles in tumor progression, including ECM degradation, the release of tumor-related growth factors, and the promotion of angiogenesis [54,55,56]. In vitro, SERPINB2-deficient cancer cells are associated with increased tumor growth, aberrant ECM, and invasive properties, while SERPINB2 overexpression inhibits tumor proliferation and migration [57, 58]. A low-expression profile of SERPINB2 is linked with poor prognosis in various cancers, including STAD [7, 59].

The GHRL gene encodes the prepropeptides of ghrelin and obestatin. Physiologically, ghrelin/obestatin stimulate/decrease food intake, regulate growth hormones, and may have a role in cell proliferation, differentiation, and apoptosis [60, 61]. In vitro, ghrelin is reported to induce colon cancer cell proliferation through the GHS-R/Ras/PI3K/Akt/mTOR axis [62]. Abnormally high expression of GHRL is not only observed in gastrointestinal tumors but also in other types of cancer including breast cancer, renal cell carcinoma, and ovarian cancer [63, 64]. Interestingly, although in vitro studies and expression arrays have suggested stimulatory effects of ghrelin on proliferation and invasion of STAD, several clinical studies have indicated that ghrelin in serum acts as a protective factor for STAD patient prognosis [65, 66]. This suggests that circulating ghrelin and tumor-localized ghrelin have different effects [67]. A more comprehensive mechanistic analysis is needed to explain this phenomenon.

Scrapie responsive gene 1 (SCRG1) is predominantly expressed in neurons and is overexpressed in the central nervous system during infection or brain injury [68]. SCRG1 was initially recognized as a marker of autophagic vacuoles in terminal-stage disease [69]. The upregulation of SCRG1 was previously reported in STAD with lymph node metastasis in a data-mining study; however, the mechanism was not explained [70]. More recent studies have revealed that SCRG1 acts on CD157 to activate ERK and PI3K/Akt in human mesenchymal stem cells [71, 72]. SCRG1 is also specifically highly expressed in breast cancer with metastatic propensity [73] and might serve as an ideal indicator for developmental cancer-associated fibroblasts [74].

NEBL is also a commonly distinguishable gene that serves as a prognostic factor in various cancers, according to previous microarray results [75, 76]. Because the nebulette protein encoded by the NEBL gene mostly functions to stabilize actin filaments, the expression level of NEBL may reflect the extent of focal adhesion of anchored cancer cells [77]. Contrary to previous findings in colorectal cancer, whereby Hosseini et al. discovered a positive correlation between the expression level of NEBL and lymph node metastasis, the bioinformatics-based analysis in the present study revealed a negative correlation between these two factors. It is proposed that a stabilized cytoskeletal structure results in less random motility, thus enhancing focal adhesion and predicting late-stage STAD with poorer prognosis [78, 79].

Among the remaining three genes, the tumor necrosis factor receptor superfamily member 17 (TNFRSF17) gene, also known as the B-cell maturation antigen gene, is expressed on mature B cells and directly reflects B-cell homeostasis and autoimmune response [80]. The expression of TNFRSF17 is associated with the development of breast cancer, ovarian cancer, and colon cancer [81,82,83]. TNFRSF17 also has the potential to act as a marker for evaluating tumor immune infiltration status and it may predict beneficial effects of immune checkpoint blockade antigens [84,85,86]. Interestingly, in the current study, although TNFRSF17 showed a higher expression profile in later-stage samples, it had no effect on patient survival. In fact, the role of B cells in tumorigenesis and progression is much less understood than other immune cells [87, 88]. This may be due to the two-pronged nature of B cells [87]. On the other hand, the relationship between the overexpression of TNFRSF17 and its global contribution to/reflection of the tumor microenvironment requires further study [89]. LYPD7, also known as LYPD6B, belongs to the LY6/PLAUR domain-containing subclass (LYPD) of the Ly-6/uPAR superfamily [90]. Several bioinformatics-based analyses have revealed that increased LYPD7 expression may be implicated in the pathogenesis of NSCLC, while decreased hypermethylation of LYPD DNA is correlated with an invasive phenotype of malignant melanoma [91, 92]. Finally, contrary to the MYOCD profile in other common tumors, in which myocardin plays a suppressive role in the malignant transformation process [93, 94], the MYOCD level in STAD was vastly upregulated, indicating poorer prognosis (Additional file 8: Figure S4). This MYOCD amplification should be comprehensively investigated because activation of the PI3K/Akt pathway can lead to JAK3 phosphorylation, thus resulting in a STAT3 and myocardin interaction which co-regulates smooth muscle cell proliferation and angiogenesis [95].

Given that the limited number of DEGs identified in this study may not provide a robust enrichment analysis, GSEA was used to inspect the pathways involved in STAD development, with samples grouped by stage. GSEA analysis revealed that the PI3K-Akt, MAPK, and calcium signaling pathways are the top three pathways correlated with later-stage STAD compared to earlier-stage STAD. All three pathways play a vital role in cell proliferation, growth, and apoptosis escape, which are indicative of the higher proliferative profile of late-stage STAD. Based on the network analysis, the proliferation-related and metabolic-related pathways are two major modules that are widely upregulated in stage advancement, while immune-related and DNA repair-related genes are widely downregulated. These results suggest that the development and migration of STAD depend on the stepwise activation of these commonly dysregulated pathways in cancer. Additionally, the GSEA analysis provides solid evidence of changes in tumor behavior according to tumor stage.

As most genes identified in this study were linked with the genesis and development of STAD, the increase in the nine-gene score resulted in a poorer prognosis. Among the nine identified genes, MYOCD, SCRG1, TYRP1, and THBS1 were statistically associated with patient survival, while GHRL, LYPD6B, SERPINB2, and NEBL only showed trends toward better or worse prognosis. Using stepwise backward elimination, only SCRG1 was an independent prognostic factor. This result is understandable because the stepwise algorithm is designed to mathematically avoid multicollinearity [96, 97]. This method is advantageous when the significance of covariates is unknown and the covariates are equally weighted [98]. Since our nine-gene signature was designed to predict tumor stage, a higher correlation with the T or N stage is unavoidable (Table 3), and several stage-related genes can be ruled out when the N and T stages become two of the most important prognostic factors. Apart from T and N stage, chemotherapy and MLH1 status are two clinicopathological features that significantly influence OS. Other important features, including the Lauren classification, ACRG subtype, and mesenchymal phenotype were also excluded from the Cox model due to multicollinearity. To read beyond the analysis, we hypothesize that the results shed light on a simple idea that some genomic or transcriptomic results might be products of an overfitting model using a limited sample size. Nonetheless, a population-based transcriptomic result is still necessary. Meanwhile, several key clinical features (e.g., chemotherapy management) and phenotypes (e.g., TNM stage) are still key factors that drive patient prognosis. Moreover, as several key clinical features are successively related, it is important to focus on the correlation between transcriptomic signatures and key cancer phenotypes to prescribe individualized treatment for patients. Based on this, the nine-gene signature identified in this study can assist with accurate STAD staging.

Clinically, our stage-related gene signature could support decision-making in several ways. First, as preoperative diagnosis has become increasingly important in the multimodality treatment of patients who are initially diagnosed with locally-advanced gastric cancer, a chip-based panel facilitates accurate clinical staging where diagnostic accuracy to date has been limited by the use of enhanced CT [99, 100]. Patients who are over-staged could receive timely resection, while under-staged patients may benefit from systemic treatment before surgery. Second, for D0/D1 surgery or D2 with limited lymph node retrieval number (< 15), a stage-related gene panel allows for tumor restage and more accurate forecasting of the risk of lymph node metastasis, which can inform clinicians’ postoperative regimen choices. Third, for early gastric cancer (T1a/1b) with endoscopic resection, the signature identified here can be used to help decide whether salvage surgery is needed, as it is highly linked with lymph node metastasis and infiltration [101, 102]. Similarly, an extended lymphadenectomy or extensive radical resection may improve long-term outcomes for patients with staggeringly high signature scores [103, 104]. To sum up, more precise preoperative staging can be achieved collaboratively using radiological and transcriptomic methods.

This study has some limitations that should be noted. First, the prediction model was based on bioinformatic analysis and lacked its own validation cohort. Second, although a stringent data washing workflow was implemented, there were still some under-staged samples due to missing information in the public dataset. Third, the nine-gene signature is a probe-based model limited to the GPL570 platform; cross-platform validation may require systemic correction considering the different sensitivity of gene probes in each platform. Fourth, although the nine-gene signature exhibited promising predictive ability, the present model was mRNA-based. The performance of our signature should be further explained by the regulation of corresponding non-coding RNA, otherwise the consistency of associations across genomic and protein-level needs further inquiry. Finally, yet importantly, both GSEA and conventional enrichment analysis methods were used to investigate the expression profile differences between groups, but results that drawn from each method were fragmented. According to GSEA, the Calcium and MAPK signaling pathways achieved high normalized enrichment scores. However, none of the nine genes were involved in these two pathways. It is obvious that biological meanings were limited by the gene number of mathematical optimum, which need our further expansion.

Conclusion

In summary, under stringent data filtering, nine hub genes were identified. These genes predict stage advancement in gastric adenocarcinoma. This nine-gene signature may help facilitate clinical decision-making for patients with localized STAD of uncertain stage. This model may also assist with tumor staging/restaging, especially for those patients with insufficient lymph node retrieval. Nevertheless, further analysis of the molecular mechanisms underlying the roles of these hub genes is required, as well as identification of the factors that drive activation/deactivation of the pathways involved in STAD progression.