Artificial intelligence-guided discovery of gastric cancer continuum

Background Detailed understanding of pre-, early and late neoplastic states in gastric cancer helps develop better models of risk of progression to gastric cancers (GCs) and medical treatment to intercept such progression. Methods We built a Boolean implication network of gastric cancer and deployed machine learning algorithms to develop predictive models of known pre-neoplastic states, e.g., atrophic gastritis, intestinal metaplasia (IM) and low- to high-grade intestinal neoplasia (L/HGIN), and GC. Our approach exploits the presence of asymmetric Boolean implication relationships that are likely to be invariant across almost all gastric cancer datasets. Invariant asymmetric Boolean implication relationships can decipher fundamental time-series underlying the biological data. Pursuing this method, we developed a healthy mucosa → GC continuum model based on this approach. Results Our model performed better against publicly available models for distinguishing healthy versus GC samples. Although not trained on IM and L/HGIN datasets, the model could identify the risk of progression to GC via the metaplasia → dysplasia → neoplasia cascade in patient samples. The model could rank all publicly available mouse models for their ability to best recapitulate the gene expression patterns during human GC initiation and progression. Conclusions A Boolean implication network enabled the identification of hitherto undefined continuum states during GC initiation. The developed model could now serve as a starting point for rationalizing candidate therapeutic targets to intercept GC progression. Supplementary Information The online version contains supplementary material available at 10.1007/s10120-022-01360-3.


Introduction
Gastric cancer (GC) often presents as an advanced disease with patients either having inoperable conditions or surgery as the only potentially curative treatment [1]. There is evidence that 75% of all GCs are initiated by Helicobacter pylori, a known carcinogenic pathogen [2,3]. Risk factors also include age, sex, smoking and family history [4]. This oncogenesis leads to Correa's cascade, a stepwise progression from normal, chronic active gastritis, atrophic gastritis, intestinal metaplasia, dysplasia then adenocarcinomas [3]. Intestinal metaplasia also has two subtypes, incomplete and complete intestinal metaplasia (IIM and CIM, respectively), with IIM having a higher probability of developing GC compared to CIM [5].
Research into GCs has used impactful approaches to investigate the genome [6], therapeutics [7] and survival [8], but these methods have not translated into actionable biomarkers of prognostication, targets, novel therapeutics, or changes in screening strategies. These genomic insights also have not provided insight into which genes are important in the progression of GC for pre-neoplastic detection and treatment.
Here, we present a network-based approach for biomarker and target discovery that uses artificial intelligence (AI) to select genes and then perform rigorous validation in multiple independent GC datasets. Previously, we have successfully exploited this approach to identify biomarkers in IBD [9], COVID-19 [10] and macrophages [11]. We demonstrate how Boolean implications allow us to develop models that provide insight into the gastric cancer disease continuum.

Methods
Detailed methods for computational modeling and AIguided target identification are presented in Online Resource 1 and mentioned in brief here.

Construction of a network of Boolean implications
We modeled continuum states within the metaplasia → dysplasia → neoplasia cascade using Boolean Network Explorer (BoNE) [9]. We created an asymmetric gene expression network, for the progression from normal to gastric cancer (GC), using a computational method based on Boolean logic [12]. To build the GC network, we analyzed a publicly available gastric cancer transcriptomic dataset, GSE66229 [13] (n = 400; 300 GC tumor and 100 patient-matched normal tissue). A Boolean Network Explorer (BoNE; see Online Resource 1 for more details) computational tool was introduced, which uses asymmetric properties of Boolean implication relationships (Boolean implication relationships-BIRs-as in MiDReG algorithm [12]) to model natural progressive time-series changes in major cellular compartments that initiate, propagate, and perpetuate cellular state change and are likely to be important for GC progression. BoNE provides an integrated platform for the construction, visualization and querying of a network of progressive changes much like a disease map (in this case, GC map) in three steps: (1) the expression levels of all genes in these datasets were converted to Boolean values (high or low) using the StepMiner algorithm [14]. (2) Gene expression relationships between pairs of genes were classified into six possible BIRs and expressed as Boolean implication statements: two symmetric Boolean implications "equivalent" and "opposite" occur when two diagonally opposite sparse quadrants are identified and four asymmetric relationships, each corresponding to one sparse quadrant. Previous methods of analysis of transcriptomic datasets recognize the two symmetric relationships using correlation, while ignoring the asymmetric relationships. We used BooleanNet statistics to assess the significance of the Boolean implication relationships [12]. Prior work [9] revealed how our Boolean approach offers a distinct advantage from current conventional computational methods that rely on symmetric linear relationships from gene expression data. BIRs are also more robust to the noise of sample heterogeneity (i.e., healthy, diseased, genotypic, phenotypic, ethnic, interventions, and disease severity) compared to other methods and every sample follows the same mathematical equation. This makes BIRs identified in our methods likely to be reproducible in independent validation datasets. (3) A Boolean implications network was created using the identified BIRs. Clusters are defined by groups of genes that are equivalent to at least half of the genes in the rest of the cluster. The clusters were connected with directed edges by identifying the majority Boolean relationships between two clusters. The resulting Boolean implication network contains clusters of genes which are the nodes and the BIR between the clusters are the directed edges. BoNE enables their discovery in an unsupervised way without the bias of sample type. Gene expression datasets were visualized using Hierarchical Exploration of Gene Expression Microarrays Online (HEGEMON) framework [9].

Ordering samples based on composite score of Boolean path
A Boolean path contains one or more clusters. A composite score for each sample is calculated to provide a summary of the genes expressed in the Boolean path. The composite score is calculated using the following steps: (1) the genes in each cluster were normalized and averaged. Gene expression values were normalized according to a modified Z-score approach centered around StepMiner threshold (formula = (expr-SThr)/3/stddev). (2) A weighted linear combination of the averages from the clusters of a Boolean path was used to create a score for each sample. We either monotonically increased or decreased the weights along the path to make the sample order consistent with the logical order based on BIR. We then order the samples based on the final weighted and linearly combined score. If a cluster is highly expressed in a disease setting, it received a positive weight (ex: 1, 2, 3, etc.) and if a cluster is highly expressed in a healthy setting, it received a negative weight (ex: − 1, − 2, − 3, etc.).

Multivariate analysis for model selection
We used two microarray datasets (GSE37023 (only samples on GPL96 Affymetrix Human Genome U133A Array used for analysis), n = 65, non-malignant = 36, GC tumor = 29; GSE122401, n = 160, patient-matched normal = 80, GC tumor = 80) to train a network model that should distinguish normal vs GC samples. Using Ordinary Least Squares 1 3 (OLS) regression in Python statsmodels (version 0.12.2), we performed multivariate analysis to determine which models performed best in the two training datasets.

Statistical analysis
Statistical significance between experimental groups was determined using Python scipy.stats.ttest_ind package (version 0.19.0) with Welch's two sample t test (two-tailed, unpaired, unequal variance (equal_var = False), and unequal sample size). For all tests, a p value of 0.05 was used as the cutoff to determine significance. Violin and bar plots are created using Python seaborn package version 0.10.1.

Machine learning identified two possible Boolean paths in the GC disease map
Using a publicly available GC dataset (GSE66229) that is comprised of tumor (T) and adjacent normal (AN) samples, we built a Boolean implication network (See Methods and Online Resource 1; Fig. 1a). Each cluster was evaluated to determine whether they fall on the healthy versus GC side of the disease map based on whether the average gene expression value of a cluster in healthy samples is up or down, yielding a GC map (Fig. 1b). We then used machine learning to identify Boolean paths (clusters connected by Boolean implication relationships) in the GC map that can distinguish tumor from AN samples in the training datasets ( Fig. 1C top graphic). Clusters #11-2-4-14 (C#11-2-4-14) performed the best with an ROC-AUC of 0.96 in training dataset #1 (GSE37023 AN versus T), while clusters #7-13-14 (C#7-13-14) performed best in training dataset #2 (GSE122401 AN vs T) with an ROC-AUC of 0.98 (Fig. 1c). Specific violin plots for both datasets and Boolean paths are presented in Fig. 1d. We performed Reactome pathway analysis on clusters in both paths to identify the top five biological processes associated with the clusters (Fig. 1e). Cluster 11 involves the downregulation of genes related to muscle contraction in GC. Cluster 2 represents genes relevant to cell cycle as many other studies pointed out their relevance in the context of GC [15,16]. Cluster 4 had genes from the immune system including neutrophil degranulation as linked in other papers [17][18][19]. Clusters 7 and 13 had genes involved in the downregulation of ion channel transport in GC [20,21]. Cluster 14 represents genes increased in extracellular matrix (ECM) processes, indicating our findings that ECM is altered early during cell transformation is in keeping with what has been observed by others [22,23]. Since both Boolean paths C#11-2-4-14 and C#7-13-14 can distinguish AN versus GC samples, we identified a gene signature called GC-BoNE uses the path that best characterized the different samples (highest ROC-AUC score out of both paths) for classification of samples.
We tested how well the clusters identified by our Boolean approach would compare to previously established gene signatures (Fig. 2a). C#11-2-4-14 and C#7-13-14 individually ( Fig. 2b) could classify the tumor and normal/adjacent normal samples in the 21 validation datasets (see Online Resource 2 for a list of GSE IDs; ROC-AUC ranges from 0.57 to 1.00 in C#11-2-4-14, and 0.66-1.00 in C#7- [13][14]. We then compared GC-BoNE to other gene signatures (see Online Resource 3 for list of genes in signatures; Fig. 2c) and found that our signature outperformed the others (average ROC-AUC for GC-BoNE is 0.933, and other signatures range from 0.690 to 0.921). There were minimal overlaps between clusters 11-2-4 (

GC-BoNE identifies progressively increasing risk of GC along the metaplasia-dysplasia continuum
We next asked if the GC-BoNE signature is induced during the progression from normal to GC through the normal → inflammation (gastritis) → metaplasia → dysplasia → neoplasia cascade (Fig. 3a, b). In one dataset (E-MTAB-8889), we looked at the normal → inflammation (gastritis) → metaplasia cascade by comparing pairwise each sequential step, i.e., non-atrophic gastritis (NAG) vs chronic active gastritis (CG), CG vs chronic atrophic gastritis (CAG) and CAG vs intestinal metaplasia (IM) (Fig. 3c). We also looked at the first step in the cascade vs the other steps, i.e., NAG vs CAG and NAG vs IM (Fig. 3c). In another dataset (GSE55696), we studied the dysplasia → neoplasia cascade, which is typically scored by histopathological examination, as per the Vienna classification [24]; the latter comprises a continuum extending from low to high-grade dysplasia to intramucosal carcinoma. Here, we looked at chronic gastritis (CG) vs low-grade intestinal neoplasia (LGIN), LGIN vs high-grade intestinal neoplasia (HGIN), HGIN vs early gastric cancer (EGC), CG vs HGIN and CG vs EGC (Fig. 3d). We compared GC-BoNE to the other signatures ( Fig. 3e) and found that our signature again outperformed the others when looking at progression (see Online Resource 2 for a list of GSE IDs; average ROC-AUC for GC-BoNE is 0.828, and other signatures range from 0.633 to 0.806). These findings suggest the genes identified in GC-BoNE may provide further insight into what initiates GC progression.

GC-BoNE can objectively assess the appropriateness of mouse models for studying human GC
Next, we wanted to identify mouse models that recapitulated human normal versus GC. We analyzed 38 mouse models [25][26][27][28][29][30][31][32][33][34][35][36][37][38][39][40][41][42] from 20 NCBI GEO datasets using C#11-2-4-14 and C#7-13-14 (see Online Resource 2 for a list of GSE IDs; Fig. 3f). Many of the mouse models had a perfect ROC-AUC of 1.00 using C#11-2-4-14 and C#7-13-14 (see Online Resource 4). We then looked at which mouse models are significantly different using a t test to determine the top ten models (Fig. 3g). It is noteworthy that the top two models represent the two common risk factors for GC in humans. The model that ranked #1 (GSE13873) is one in which the H. pylori infection → GC cascade is modeled in C57Bl6 mouse model of experimental infection with the closely related H. felis. The authors showed that while most infected mice develop premalignant lesions such as gastric atrophy, compensatory epithelial hyperplasia and IM, a minority is completely protected from pre-neoplasia. The models that ranked #2-6 (GSE103639 (NGE vs pCP_ GC), GSE45956, GSE103639 (NGE vs pChePS_GC), GSE16902, GSE93774) were all genetically engineered mouse models (GEMMs) in which targeted deletions were performed on genes (CDH1, SMAD4, CLDN18, etc.) that are associated with risk of GC, by virtue of being either the most common germline mutation in GC (CDH1 [43]), or for harboring disease-associated SNPs (SMAD4 [44]) or being the target of the most frequent somatic genomic rearrangements [45] (CLDN18). These results suggest that GC-BoNE can objectively assess the degree of similarity between mouse models (both infection-induced and genetically induced types) and human GC. In doing so, it can pinpoint which mouse models best recapitulate the patterns of gene expression that is observed during the transformation from healthy to GC in human samples. (C#11-2-4-14) can prognosticate the risk of IM → GC progression Since we want to identify genes responsible for the progression of GC, we looked at a dataset that curated samples from a prospective study [46] with long-term follow-up (a mean of 12 ± 3.4 years) to evaluate risk of progression to GC among patients with incomplete or complete intestinal metaplasia (IIM and CIM, respectively) (Fig. 4a). It is known that among the types of intestinal metaplasia, IIM carries a greater risk for progression to GC compared to CIM [47]. A recent meta-analysis showed that compared with CIM, pooled relative risk (RR) of cancer/dysplasia in IIM patients was 4.48 (95% CI 2.50-8.03), and the RR was 4.96 (95% CI 2.72-9.04) for cancer, and 4.82 (95% CI 1.45-16.0) for dysplasia [48]. We found that C#11-2-4-14 best distinguished the healthy control patients (HC), patients with high risk-carrying IIM that progressed (IIM-GC) and those that did not progress (

GC-BoNE provides insights into the changes in cellular continuum states during healthy → IIM → GC progression
To understand which cellular processes change during cell transformation and which genes contribute to the progression of GC, we checked how clusters in C#11-2-4-14 and C#7-13-14 perform separately (Fig. 4f). When looking at HC vs IIM-C (Fig. 4f row i), cluster 14 is not Fig. 1 Generation and validation of Boolean implication networkderived gastric cancer (GC) signature. a Schematic summarizing the workflow to build a Boolean map using a gastric cancer microarray dataset containing tumor and adjacent normal samples (GSE66229). b Disease map representing the continuum from normal stomach to gastric cancer. c Selection of Boolean path using machine learning on two training datasets (GSE37023 and GSE122401). Multivariate regression was used to determine which path best separated the tumor from the adjacent normal samples. Coefficient of each path score (at the center) with 95% confidence intervals (as error bars) and the p values were illustrated in the bar plot. The p value for each term tests the null hypothesis that the coefficient is equal to zero (no effect). LGIN (19) HGIN (20) EGC (19) p=0.00803 IIM-C (7) IIM-GC (6) CIM-C (9) CIM-GC (8) p=0  (15) IIM-C (7) IIM-GC (6) CIM-C (9) CIM-GC (8) p=0 Row: is affected by extracellular matrix processes and progression from HC to CIM to GC is impacted by genes related to ion transport, which is expected to induce acid/base disturbances and barrier dysfunction, causing gastric acidrelated diseases such as CAG and GC [20].

Discussion
Although the incidence rates of GC have been decreasing around the world [4], there have not been any significant improvements in terms of new therapeutics, diagnostics and changes in screening designed for pre-neoplastic stages. In this study, we built a Boolean implication network using GSE66229 and used machine learning (on GSE37023 and GSE122401) to identify a gene signature (GC-BoNE) which could classify normal and gastric samples. Reactome pathway analysis of GC-BoNE revealed C#11-2-4-14 contains genes that control infection-inflammation: increase in cell cycle related genes in C#2 may lead to abnormal cell proliferation [15,16], increase in immune system genes in C#4 may lead to inflammation in the cells [17][18][19] and increase in ECM genes in C#14 may lead to a remodeled ECM [22,23]. Changes in genes in C#7-13-14 signify ion transporter related abnormalities, which in parietal cells can lead to the onset of GC [20,21]. Although previous studies have identified most of these pathways [15][16][17][18][19][20][21][22][23], muscle contraction has not been widely identified. We then tested how GC-BoNE compares to gene signatures from past studies in both normal vs GC samples (Fig. 2c) and GC progression samples (Figs. 3c and 4f). Our Boolean network-based approach improves upon past studies by first identifying a gene signature (GC-BoNE) that is better able to classify samples along the GC disease continuum compared to previous signatures. When looking at normal vs GC samples, many of the signatures performed well (Fig. 2c). However, we are more interested in finding a gene signature that can distinguish samples earlier in the GC   (Fig. 3c). Since the genes in GC-BoNE do not overlap with many genes from the other gene signatures (Fig. 1e), this provides a list of new potential biomarkers for targeting therapeutics at different points along the GC disease continuum.
Second, we found that GC-BoNE may have identified two paths that lead from pre-neoplasia to GC. C#11-2-4-14 showcases the immune cell processes which predicted the risk for HC to IIM to GC progression while C#7-13-14 signifies the ion transporter abnormalities seen in HC to CIM to GC (Fig. 5). Although the model was built and trained on N vs GC samples, using a Boolean networkbased approach allows us to identify paths that can also determine the intermediate states of disease progression. The invariant asymmetric Boolean implications present in the GC-BoNE signature provide insight into the cellular changes occurring at various time points along the disease continuum. We do not know which cluster is associated with which pre-neoplastic condition, but GC-BoNE provides a list of gene targets that can be tested using the mouse models we identified (Fig. 3e) or other models.
Although this work provides a new set of genes that can be targeted for GC and precancerous conditions, we were not able to rigorously test whether GC-BoNE could identify patients with early lesions such as chronic atrophic gastritis who are at highest risk of progression to GC. We identified six additional datasets (GSE69144, GSE153224, GSE83389, GSE116312, GSE106656, GSE134520) from gastritis samples. One of them is a prospective study (GSE69144) that looked at whether precancerous gastric lesions progressed over time (multifocal atrophic gastritis to intestinal metaplasia or intestinal metaplasia to dysplasia). Since the data was profiled on a DASL Human Cancer Panel microarray, many genes in GC-BoNE were not included in the generation of the violin plot (0/240 genes available for C#11, 0/28 genes for C#7 and 0/14 genes for C#13 and 6/23 genes from C#14; Online Resource 5c). The resulting violin plot indicates we may not be able to predict which patients will progress using the available genes on a DASL cancer panel (progressors follow-up samples have lower scores than at baseline). The other datasets are small and did not show consistent patterns (Online Resource 5d-h). Due to the limited availability of datasets, we conclude that additional prospective studies at all stages of GC progression are necessary before we can fully evaluate the capability of GC-BoNE derived gene signatures to identify high-risk patients.
Overall, we demonstrate that the genes identified from our Boolean network-based approach were better able to classify samples along the GC disease continuum compared to the genes from previous work. The genes from GC-BoNE provide more opportunities to research the cellular processes behind GC progression. Results from this paper can be used to rationalize gene targets for diagnostics and therapeutics.