Introduction

Gastric cancer (GC) often presents as an advanced disease with patients either having inoperable conditions or surgery as the only potentially curative treatment [1]. There is evidence that 75% of all GCs are initiated by Helicobacter pylori, a known carcinogenic pathogen [2, 3]. Risk factors also include age, sex, smoking and family history [4]. This oncogenesis leads to Correa’s cascade, a stepwise progression from normal, chronic active gastritis, atrophic gastritis, intestinal metaplasia, dysplasia then adenocarcinomas [3]. Intestinal metaplasia also has two subtypes, incomplete and complete intestinal metaplasia (IIM and CIM, respectively), with IIM having a higher probability of developing GC compared to CIM [5].

Research into GCs has used impactful approaches to investigate the genome [6], therapeutics [7] and survival [8], but these methods have not translated into actionable biomarkers of prognostication, targets, novel therapeutics, or changes in screening strategies. These genomic insights also have not provided insight into which genes are important in the progression of GC for pre-neoplastic detection and treatment.

Here, we present a network-based approach for biomarker and target discovery that uses artificial intelligence (AI) to select genes and then perform rigorous validation in multiple independent GC datasets. Previously, we have successfully exploited this approach to identify biomarkers in IBD [9], COVID-19 [10] and macrophages [11]. We demonstrate how Boolean implications allow us to develop models that provide insight into the gastric cancer disease continuum.

Methods

Detailed methods for computational modeling and AI-guided target identification are presented in Online Resource 1 and mentioned in brief here.

Construction of a network of Boolean implications

We modeled continuum states within the metaplasia → dysplasia → neoplasia cascade using Boolean Network Explorer (BoNE) [9]. We created an asymmetric gene expression network, for the progression from normal to gastric cancer (GC), using a computational method based on Boolean logic [12]. To build the GC network, we analyzed a publicly available gastric cancer transcriptomic dataset, GSE66229 [13] (n = 400; 300 GC tumor and 100 patient-matched normal tissue). A Boolean Network Explorer (BoNE; see Online Resource 1 for more details) computational tool was introduced, which uses asymmetric properties of Boolean implication relationships (Boolean implication relationships—BIRs—as in MiDReG algorithm [12]) to model natural progressive time-series changes in major cellular compartments that initiate, propagate, and perpetuate cellular state change and are likely to be important for GC progression. BoNE provides an integrated platform for the construction, visualization and querying of a network of progressive changes much like a disease map (in this case, GC map) in three steps: (1) the expression levels of all genes in these datasets were converted to Boolean values (high or low) using the StepMiner algorithm [14]. (2) Gene expression relationships between pairs of genes were classified into six possible BIRs and expressed as Boolean implication statements: two symmetric Boolean implications “equivalent” and “opposite” occur when two diagonally opposite sparse quadrants are identified and four asymmetric relationships, each corresponding to one sparse quadrant. Previous methods of analysis of transcriptomic datasets recognize the two symmetric relationships using correlation, while ignoring the asymmetric relationships. We used BooleanNet statistics to assess the significance of the Boolean implication relationships [12]. Prior work [9] revealed how our Boolean approach offers a distinct advantage from current conventional computational methods that rely on symmetric linear relationships from gene expression data. BIRs are also more robust to the noise of sample heterogeneity (i.e., healthy, diseased, genotypic, phenotypic, ethnic, interventions, and disease severity) compared to other methods and every sample follows the same mathematical equation. This makes BIRs identified in our methods likely to be reproducible in independent validation datasets. (3) A Boolean implications network was created using the identified BIRs. Clusters are defined by groups of genes that are equivalent to at least half of the genes in the rest of the cluster. The clusters were connected with directed edges by identifying the majority Boolean relationships between two clusters. The resulting Boolean implication network contains clusters of genes which are the nodes and the BIR between the clusters are the directed edges. BoNE enables their discovery in an unsupervised way without the bias of sample type. Gene expression datasets were visualized using Hierarchical Exploration of Gene Expression Microarrays Online (HEGEMON) framework [9].

Ordering samples based on composite score of Boolean path

A Boolean path contains one or more clusters. A composite score for each sample is calculated to provide a summary of the genes expressed in the Boolean path. The composite score is calculated using the following steps: (1) the genes in each cluster were normalized and averaged. Gene expression values were normalized according to a modified Z-score approach centered around StepMiner threshold (formula = (expr—SThr)/3/stddev). (2) A weighted linear combination of the averages from the clusters of a Boolean path was used to create a score for each sample. We either monotonically increased or decreased the weights along the path to make the sample order consistent with the logical order based on BIR. We then order the samples based on the final weighted and linearly combined score. If a cluster is highly expressed in a disease setting, it received a positive weight (ex: 1, 2, 3, etc.) and if a cluster is highly expressed in a healthy setting, it received a negative weight (ex: − 1, − 2, − 3, etc.).

Multivariate analysis for model selection

We used two microarray datasets (GSE37023 (only samples on GPL96 Affymetrix Human Genome U133A Array used for analysis), n = 65, non-malignant = 36, GC tumor = 29; GSE122401, n = 160, patient-matched normal = 80, GC tumor = 80) to train a network model that should distinguish normal vs GC samples. Using Ordinary Least Squares (OLS) regression in Python statsmodels (version 0.12.2), we performed multivariate analysis to determine which models performed best in the two training datasets.

Statistical analysis

Statistical significance between experimental groups was determined using Python scipy.stats.ttest_ind package (version 0.19.0) with Welch’s two sample t test (two-tailed, unpaired, unequal variance (equal_var = False), and unequal sample size). For all tests, a p value of 0.05 was used as the cutoff to determine significance. Violin and bar plots are created using Python seaborn package version 0.10.1.

Results

Machine learning identified two possible Boolean paths in the GC disease map

Using a publicly available GC dataset (GSE66229) that is comprised of tumor (T) and adjacent normal (AN) samples, we built a Boolean implication network (See Methods and Online Resource 1; Fig. 1a). Each cluster was evaluated to determine whether they fall on the healthy versus GC side of the disease map based on whether the average gene expression value of a cluster in healthy samples is up or down, yielding a GC map (Fig. 1b). We then used machine learning to identify Boolean paths (clusters connected by Boolean implication relationships) in the GC map that can distinguish tumor from AN samples in the training datasets (Fig. 1C top graphic). Clusters #11-2-4–14 (C#11-2-4-14) performed the best with an ROC-AUC of 0.96 in training dataset #1 (GSE37023 AN versus T), while clusters #7-13-14 (C#7-13-14) performed best in training dataset #2 (GSE122401 AN vs T) with an ROC-AUC of 0.98 (Fig. 1c). Specific violin plots for both datasets and Boolean paths are presented in Fig. 1d. We performed Reactome pathway analysis on clusters in both paths to identify the top five biological processes associated with the clusters (Fig. 1e). Cluster 11 involves the downregulation of genes related to muscle contraction in GC. Cluster 2 represents genes relevant to cell cycle as many other studies pointed out their relevance in the context of GC [15, 16]. Cluster 4 had genes from the immune system including neutrophil degranulation as linked in other papers [17,18,19]. Clusters 7 and 13 had genes involved in the downregulation of ion channel transport in GC [20, 21]. Cluster 14 represents genes increased in extracellular matrix (ECM) processes, indicating our findings that ECM is altered early during cell transformation is in keeping with what has been observed by others [22, 23]. Since both Boolean paths C#11-2-4-14 and C#7-13-14 can distinguish AN versus GC samples, we identified a gene signature called GC-BoNE uses the path that best characterized the different samples (highest ROC-AUC score out of both paths) for classification of samples.

Fig. 1
figure 1

Generation and validation of Boolean implication network-derived gastric cancer (GC) signature. a Schematic summarizing the workflow to build a Boolean map using a gastric cancer microarray dataset containing tumor and adjacent normal samples (GSE66229). b Disease map representing the continuum from normal stomach to gastric cancer. c Selection of Boolean path using machine learning on two training datasets (GSE37023 and GSE122401). Multivariate regression was used to determine which path best separated the tumor from the adjacent normal samples. Coefficient of each path score (at the center) with 95% confidence intervals (as error bars) and the p values were illustrated in the bar plot. The p value for each term tests the null hypothesis that the coefficient is equal to zero (no effect). d Violin plots showing the top Boolean paths in each of the training datasets. e Reactome pathway analysis of the gene clusters in the GC-BoNE signature

We tested how well the clusters identified by our Boolean approach would compare to previously established gene signatures (Fig. 2a). C#11-2-4-14 and C#7-13-14 individually (Fig. 2b) could classify the tumor and normal/adjacent normal samples in the 21 validation datasets (see Online Resource 2 for a list of GSE IDs; ROC-AUC ranges from 0.57 to 1.00 in C#11-2-4-14, and 0.66–1.00 in C#7-13-14). We then compared GC-BoNE to other gene signatures (see Online Resource 3 for list of genes in signatures; Fig. 2c) and found that our signature outperformed the others (average ROC-AUC for GC-BoNE is 0.933, and other signatures range from 0.690 to 0.921). There were minimal overlaps between clusters 11-2-4 (Fig. 2d), 7-13 (Fig. 2e) and the top three signatures (DEA (Li 2015), DEA + PPIN and DEA (Junnila 2010) [6]). Cluster 14 and the DEA (Junnila 2010) [6] signature had 8 overlapping genes (Fig. 2f). These findings suggest GC-BoNE provides a new list of potential biomarkers for GC that differ from previous signatures.

Fig. 2
figure 2

Comparison of classification accuracy using GC-BoNE signature versus gene signatures from previous literature for normal versus GC samples. a. Schematic summarizing the workflow to compare GC-BoNE to other gene signatures. b Bar plots of GC datasets comparing normal (N)/AN vs T showing the ROC-AUC values for the Boolean paths in the GC-BoNE signature (11-2-4-14 and 7-13-14). Asterisks (*) after the ROC-AUC values represent the following: *p < = 0.05, **p < = 0.01, ***p < = 0.001, no asterisk: p value > 0.05. c Comparison of average ROC-AUC values for the datasets in B. using GC-BoNE and other gene signatures [6, 49,50,51,52,53,54,55] (DEA Differential Expression Analysis, PPIN Protein–Protein Interaction Network, INGP Ingenuity Pathway analysis, ISS Immune Scoring System, SI Stromal-Immune score; DEA (Junnila 2010) [6] uses the DEGs generated from Japanese patients; See Online Resource 3 for the complete list of genes in these signatures). d-f Venn diagrams showing the overlaps in genes in the top four gene signatures (GC-BoNE, DEA (Li 2015), DEA + PPIN and DEA (Junnila 2010) [6])

GC-BoNE identifies progressively increasing risk of GC along the metaplasia–dysplasia continuum

We next asked if the GC-BoNE signature is induced during the progression from normal to GC through the normal → inflammation (gastritis) → metaplasia → dysplasia → neoplasia cascade (Fig. 3a, b). In one dataset (E-MTAB-8889), we looked at the normal → inflammation (gastritis) → metaplasia cascade by comparing pairwise each sequential step, i.e., non-atrophic gastritis (NAG) vs chronic active gastritis (CG), CG vs chronic atrophic gastritis (CAG) and CAG vs intestinal metaplasia (IM) (Fig. 3c). We also looked at the first step in the cascade vs the other steps, i.e., NAG vs CAG and NAG vs IM (Fig. 3c). In another dataset (GSE55696), we studied the dysplasia → neoplasia cascade, which is typically scored by histopathological examination, as per the Vienna classification [24]; the latter comprises a continuum extending from low to high-grade dysplasia to intramucosal carcinoma. Here, we looked at chronic gastritis (CG) vs low-grade intestinal neoplasia (LGIN), LGIN vs high-grade intestinal neoplasia (HGIN), HGIN vs early gastric cancer (EGC), CG vs HGIN and CG vs EGC (Fig. 3d). We compared GC-BoNE to the other signatures (Fig. 3e) and found that our signature again outperformed the others when looking at progression (see Online Resource 2 for a list of GSE IDs; average ROC-AUC for GC-BoNE is 0.828, and other signatures range from 0.633 to 0.806). These findings suggest the genes identified in GC-BoNE may provide further insight into what initiates GC progression.

Fig. 3
figure 3

GC-BoNE signature in GC progression and mouse models. a Schematic showing the progression from normal mucosa to gastric cancer. b GC disease map with GC-BoNE signature highlighted (C#11-2-4-14 and C#7-13-14). c Violin plots for GC progression of normal active gastritis (NAG) → chronic active gastritis (CG) → chronic atrophic gastritis (CAG) → intestinal metaplasia (IM) (E-MTAB-8889) using the GC-BoNE signature: 11-2-4-14 (left) and 7-13-14 (right). d Violin plots for chronic gastritis (CG) → low-grade intestinal neoplasia (LGIN) → high-grade intestinal neoplasia (HGIN) → early gastric cancer (EGC) (GSE55696) using the GC-BoNE signature: 11-2-4-14 (left) and 7–13-14 (right). e Comparison of average ROC-AUC values for GC progression datasets using GC-BoNE and other gene signatures. f Schematic summarizing comparison of 38 mouse models from 20 GEO datasets using GC-BoNE. g. Top ten mouse models according to –log10(p value) from Welch’s two sample t test separated by path (7-13-14: blue, 11-2-4-14: orange)

GC-BoNE can objectively assess the appropriateness of mouse models for studying human GC

Next, we wanted to identify mouse models that recapitulated human normal versus GC. We analyzed 38 mouse models [25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42] from 20 NCBI GEO datasets using C#11-2-4-14 and C#7-13-14 (see Online Resource 2 for a list of GSE IDs; Fig. 3f). Many of the mouse models had a perfect ROC-AUC of 1.00 using C#11-2-4-14 and C#7-13-14 (see Online Resource 4). We then looked at which mouse models are significantly different using a t test to determine the top ten models (Fig. 3g). It is noteworthy that the top two models represent the two common risk factors for GC in humans. The model that ranked #1 (GSE13873) is one in which the H. pylori infection → GC cascade is modeled in C57Bl6 mouse model of experimental infection with the closely related H. felis. The authors showed that while most infected mice develop premalignant lesions such as gastric atrophy, compensatory epithelial hyperplasia and IM, a minority is completely protected from pre-neoplasia. The models that ranked #2-6 (GSE103639 (NGE vs pCP_GC), GSE45956, GSE103639 (NGE vs pChePS_GC), GSE16902, GSE93774) were all genetically engineered mouse models (GEMMs) in which targeted deletions were performed on genes (CDH1, SMAD4, CLDN18, etc.) that are associated with risk of GC, by virtue of being either the most common germline mutation in GC (CDH1 [43]), or for harboring disease-associated SNPs (SMAD4 [44]) or being the target of the most frequent somatic genomic rearrangements [45] (CLDN18). These results suggest that GC-BoNE can objectively assess the degree of similarity between mouse models (both infection-induced and genetically induced types) and human GC. In doing so, it can pinpoint which mouse models best recapitulate the patterns of gene expression that is observed during the transformation from healthy to GC in human samples.

GC-BoNE (C#11-2-4-14) can prognosticate the risk of IM → GC progression

Since we want to identify genes responsible for the progression of GC, we looked at a dataset that curated samples from a prospective study [46] with long-term follow-up (a mean of 12 ± 3.4 years) to evaluate risk of progression to GC among patients with incomplete or complete intestinal metaplasia (IIM and CIM, respectively) (Fig. 4a). It is known that among the types of intestinal metaplasia, IIM carries a greater risk for progression to GC compared to CIM [47]. A recent meta-analysis showed that compared with CIM, pooled relative risk (RR) of cancer/dysplasia in IIM patients was 4.48 (95% CI 2.50–8.03), and the RR was 4.96 (95% CI 2.72–9.04) for cancer, and 4.82 (95% CI 1.45–16.0) for dysplasia [48]. We found that C#11-2-4-14 best distinguished the healthy control patients (HC), patients with high risk-carrying IIM that progressed (IIM-GC) and those that did not progress (IIM-C) (ROC-AUC values: HC vs IIM-C: 0.86, HC vs IIM-GC: 0.94, IIM-C vs IIM-GC: 0.95; Fig. 4b). C#11-2-4-14 was not able to significantly distinguish (using Student’s t test) low risk-carrying CIM from HC. C#7-13-14 also could distinguish HC vs IIM-C (ROC-AUC = 0.80) and HC vs IIM-GC (ROC-AUC = 0.88), but not IIM-C vs IIM-GC (ROC-AUC = 0.71); however, C#11-2-4-14 performed better (Fig. 4c). In addition to IIM, C#7–13-14 could also significantly distinguish HC vs CIM-C (ROC-AUC = 0.73) and HC vs CIM-GC (ROC-AUC = 0.82), but C#7-13-14 could not distinguish CIM-C vs CIM-GC (ROC-AUC = 0.57). The DEA (Li 2015) gene signature similarly separates HC from the other groups (HC vs IIM-C: ROC-AUC = 0.90, HC vs IIM-GC: 0.87, HC vs CIM-C: 0.91, and HC vs CIM-GC 0.97), but is not able to identify the progressors from the non-progressors (IIM-C vs IIM-GC: 0.38; CIM-C vs CIM-GC: 0.47) (Fig. 4d). The DEA (Junnila 2010) [6] signature cannot significantly distinguish any of the samples (ROC-AUC values range from 0.42 to 0.74; Fig. 4e). These findings suggest genes in C#11-2-4-14 might be key to understanding why some IIM patients progress to GC.

Fig. 4
figure 4

GC-BoNE signature predicts outcome. a Schematic summarizing GSE78523: samples collected from healthy patients (HC) and patients with incomplete IM (IIM) or complete IM (CIM). After a mean of 12 + = 3.4 years, patients with IM were diagnosed as non-progressors (control: C) or progressors (GC). b–e Violin plots showing classification of samples using GC-BoNE, DEA (Li 2015), and DEA (Junnila 2010) signatures (b: 11-2-4-14, c: 7-13-14, d: DEA (Li 2015), e: DEA (Junnila 2010) [6]). f. GSE78523 is visualized as bubble plots of ROC-AUC values (radius of circles is based on the ROC-AUC) demonstrating the direction of gene regulation (Up: red, Down: blue) for the classification of samples (GC-BoNE clusters in columns; sample comparison in rows). P values based on Welch’s t test (of composite score of gene expression values) are provided using the standard code (*p < = 0.05, **p < = 0.01, ***p < = 0.001) next to the ROC-AUC

GC-BoNE provides insights into the changes in cellular continuum states during healthy → IIM → GC progression

To understand which cellular processes change during cell transformation and which genes contribute to the progression of GC, we checked how clusters in C#11-2-4-14 and C#7-13-14 perform separately (Fig. 4f). When looking at HC vs IIM-C (Fig. 4f row i), cluster 14 is not able to distinguish the samples (ROC-AUC = 0.63), but both C#11-2-4 and C#7-13 are able to separate the samples (ROC-AUC = 0.87, 0.89, respectively). However, when you compare IIM-C vs IIM-GC (Fig. 4f row ii), cluster 14 is better able to distinguish the samples (ROC-AUC = 0.86), with C#11-2-4-14 best able to classify the samples (ROC-AUC = 0.95). These results show genes in C#11-2-4 might be responsible for the progression from HC to IIM, while C#14 is important for IIM to GC. Although C#7-13-14 could not distinguish the progressors in CIM, C#7-13 alone could identify the progressors from the non-progressors (CIM-C vs CIM-GC ROC-AUC = 0.81). Findings thereby suggest that there may be two paths to GC: progression from HC to IIM may be impacted by genes related to muscle contraction, cell cycle and immune system, progression from IIM to GC is affected by extracellular matrix processes and progression from HC to CIM to GC is impacted by genes related to ion transport, which is expected to induce acid/base disturbances and barrier dysfunction, causing gastric acid-related diseases such as CAG and GC [20].

Discussion

Although the incidence rates of GC have been decreasing around the world [4], there have not been any significant improvements in terms of new therapeutics, diagnostics and changes in screening designed for pre-neoplastic stages. In this study, we built a Boolean implication network using GSE66229 and used machine learning (on GSE37023 and GSE122401) to identify a gene signature (GC-BoNE) which could classify normal and gastric samples. Reactome pathway analysis of GC-BoNE revealed C#11-2-4-14 contains genes that control infection-inflammation: increase in cell cycle related genes in C#2 may lead to abnormal cell proliferation [15, 16], increase in immune system genes in C#4 may lead to inflammation in the cells [17,18,19] and increase in ECM genes in C#14 may lead to a remodeled ECM [22, 23]. Changes in genes in C#7-13-14 signify ion transporter related abnormalities, which in parietal cells can lead to the onset of GC [20, 21]. Although previous studies have identified most of these pathways [15,16,17,18,19,20,21,22,23], muscle contraction has not been widely identified. We then tested how GC-BoNE compares to gene signatures from past studies in both normal vs GC samples (Fig. 2c) and GC progression samples (Figs. 3c and 4f).

Our Boolean network-based approach improves upon past studies by first identifying a gene signature (GC-BoNE) that is better able to classify samples along the GC disease continuum compared to previous signatures. When looking at normal vs GC samples, many of the signatures performed well (Fig. 2c). However, we are more interested in finding a gene signature that can distinguish samples earlier in the GC disease continuum. When looking at GC progression, our signature outperforms the other gene signatures (Fig. 3c). Since the genes in GC-BoNE do not overlap with many genes from the other gene signatures (Fig. 1e), this provides a list of new potential biomarkers for targeting therapeutics at different points along the GC disease continuum.

Second, we found that GC-BoNE may have identified two paths that lead from pre-neoplasia to GC. C#11-2-4-14 showcases the immune cell processes which predicted the risk for HC to IIM to GC progression while C#7-13-14 signifies the ion transporter abnormalities seen in HC to CIM to GC (Fig. 5). Although the model was built and trained on N vs GC samples, using a Boolean network-based approach allows us to identify paths that can also determine the intermediate states of disease progression. The invariant asymmetric Boolean implications present in the GC-BoNE signature provide insight into the cellular changes occurring at various time points along the disease continuum. We do not know which cluster is associated with which pre-neoplastic condition, but GC-BoNE provides a list of gene targets that can be tested using the mouse models we identified (Fig. 3e) or other models.

Fig. 5
figure 5

Graphic overview. Graphical overview of the gastric cancer continuum path, GC-BoNE signatures and reactome pathway summary of the genes in each cluster

Although this work provides a new set of genes that can be targeted for GC and precancerous conditions, we were not able to rigorously test whether GC-BoNE could identify patients with early lesions such as chronic atrophic gastritis who are at highest risk of progression to GC. We identified six additional datasets (GSE69144, GSE153224, GSE83389, GSE116312, GSE106656, GSE134520) from gastritis samples. One of them is a prospective study (GSE69144) that looked at whether precancerous gastric lesions progressed over time (multifocal atrophic gastritis to intestinal metaplasia or intestinal metaplasia to dysplasia). Since the data was profiled on a DASL Human Cancer Panel microarray, many genes in GC-BoNE were not included in the generation of the violin plot (0/240 genes available for C#11, 0/28 genes for C#7 and 0/14 genes for C#13 and 6/23 genes from C#14; Online Resource 5c). The resulting violin plot indicates we may not be able to predict which patients will progress using the available genes on a DASL cancer panel (progressors follow-up samples have lower scores than at baseline). The other datasets are small and did not show consistent patterns (Online Resource 5d-h). Due to the limited availability of datasets, we conclude that additional prospective studies at all stages of GC progression are necessary before we can fully evaluate the capability of GC-BoNE derived gene signatures to identify high-risk patients.

Overall, we demonstrate that the genes identified from our Boolean network-based approach were better able to classify samples along the GC disease continuum compared to the genes from previous work. The genes from GC-BoNE provide more opportunities to research the cellular processes behind GC progression. Results from this paper can be used to rationalize gene targets for diagnostics and therapeutics.