Diagnostic Performance of Artificial Intelligence-Centred Systems in the Diagnosis and Postoperative Surveillance of Upper Gastrointestinal Malignancies Using Computed Tomography Imaging: A Systematic Review and Meta-Analysis of Diagnostic Accuracy

Background Upper gastrointestinal cancers are aggressive malignancies with poor prognosis, even following multimodality therapy. As such, they require timely and accurate diagnostic and surveillance strategies; however, such radiological workflows necessitate considerable expertise and resource to maintain. In order to lessen the workload upon already stretched health systems, there has been increasing focus on the development and use of artificial intelligence (AI)-centred diagnostic systems. This systematic review summarizes the clinical applicability and diagnostic performance of AI-centred systems in the diagnosis and surveillance of esophagogastric cancers. Methods A systematic review was performed using the MEDLINE, EMBASE, Cochrane Review, and Scopus databases. Articles on the use of AI and radiomics for the diagnosis and surveillance of patients with esophageal cancer were evaluated, and quality assessment of studies was performed using the QUADAS-2 tool. A meta-analysis was performed to assess the diagnostic accuracy of sequencing methodologies. Results Thirty-six studies that described the use of AI were included in the qualitative synthesis and six studies involving 1352 patients were included in the quantitative analysis. Of these six studies, four studies assessed the utility of AI in gastric cancer diagnosis, one study assessed its utility for diagnosing esophageal cancer, and one study assessed its utility for surveillance. The pooled sensitivity and specificity were 73.4% (64.6–80.7) and 89.7% (82.7–94.1), respectively. Conclusions AI systems have shown promise in diagnosing and monitoring esophageal and gastric cancer, particularly when combined with existing diagnostic methods. Further work is needed to further develop systems of greater accuracy and greater consideration of the clinical workflows that they aim to integrate within.

was performed to assess the diagnostic accuracy of sequencing methodologies. Results. Thirty-six studies that described the use of AI were included in the qualitative synthesis and six studies involving 1352 patients were included in the quantitative analysis. Of these six studies, four studies assessed the utility of AI in gastric cancer diagnosis, one study assessed its utility for diagnosing esophageal cancer, and one study assessed its utility for surveillance. The pooled sensitivity and specificity were 73.4% (64.6-80.7) and 89.7% (82.7-94.1), respectively. Conclusions. AI systems have shown promise in diagnosing and monitoring esophageal and gastric cancer, particularly when combined with existing diagnostic methods. Further work is needed to further develop systems of greater accuracy and greater consideration of the clinical workflows that they aim to integrate within.
Esophageal cancer is an aggressive cancer with a mean estimated 5-year survival rate of 35-45%, even after treatment with curative intent. 1,2 The reported survival rate in advanced-stage disease drops further to 5-10% and can be attributed to the malignancy's insidious onset and aggressive tumor biology that often favors recurrence. [3][4][5] Similarly, gastric cancer has a poor 5-year survival rate and is still the third leading cause of malignancy-related death worldwide. 6 A number of investigations, such as computed tomography (CT) scans, positron emission tomography (PET) scans, endoscopic ultrasound (EUS), and endobronchial ultrasound (EBUS), are utilized in the diagnostic and staging pathway of esophagogastric (EG) malignancy, with CT being the most commonly used of those that are noted. 7 Unlike colorectal, hepatocellular, and pancreatic cancers, there is no reliable biomarker that can be tested and tracked non-invasively for diagnostic or surveillance purposes in esophageal and gastric cancers. [8][9][10] Consequently, patients are often reliant on radiological investigations for diagnosis with staging, detection of recurrence, and monitoring response to treatment. 7 These workflows necessitate both timely and expert radiological interpretation, a requirement that is often difficult to achieve given busy clinical work schedules and a lack of expertise outside tertiary oncological centers. As such, there has been increasing calls to explore the use of AIcentred diagnostic systems to alleviate this issue.
In the context of medical diagnostics, AI is the use of a system to mimic human cognition in the comprehension, analysis, and presentation of medical data. [11][12][13] This is often achieved using machine learning (ML), which is a specialized sub-field within AI that improves the performance of systems through repetitive experience. For example, in EG cancers, ML has been used extensively by AI systems to understand endoscopy images and enhance the interpretation of solely operator-dependent endoscopy. [14][15][16] Naturally, the next step will be the integration of AI into the major imaging modalities used in the management of EG cancers, specifically CT scans. Typically, this involves the high-throughput extraction of large quantities of data from the images and is a technique termed as radiomics. Radiomics is an emerging field using a non-invasive approach to extract numerous quantitative features from medical images, especially parameters not visible to the naked human eye or quantifiable by routine analysis. 17,18 Specifically, with CT scans, radiomics offers the unique advantage of combining ML to acquire images; segment images into regions of interest (ROIs) or volumes of interest (VOIs); extraction of quantitative imaging features from ROIs and VOIs; and, lastly, constructing and validating models. Recently, there has been an increase in work reporting on the combined or individual use of AI or radiomics to diagnose or monitor EG cancers. This review aims to summarize the potential applicability of AI diagnostic systems in the diagnosis and surveillance of esophageal and gastric cancers.

METHODS
Literature search methods, inclusion and exclusion criteria, outcome measures, and statistical analysis were defined according to the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines. 19 Patients were not involved in the conception, design, analysis, drafting, interpretation, or revision of this research, hence ethical approval was not required and was thus not sought for this study.

Literature Search
The following databases were searched: MEDLINE (from 1946 until the first week of April 2021) via OvidSP; MEDLINE In-Process and other non-indexed citations (latest issue) via OvidSP; Ovid EMBASE (from 1974 to the latest issue); and Scopus (from 1996 until the present). The last search was performed on 15 April 2021. Search terms used several strings that were linked by standard modifiers in the following order: 'machine learning', 'artificial intelligence', 'radiomic', 'AI' OR 'ML', as well as 'esophageal cancer', 'esophageal squamous cell cancer', 'esophageal adenocarcinoma', 'ESCC', 'EAC', 'esophageal malignancy', 'upper gastrointestinal cancer', OR 'upper GI cancer'. Additionally, the references of included articles were hand-searched to identify any additional studies.

Selection and Quality Assessment of Studies
Articles were screened for eligibility by SC and VS, and, where conflict arose, a third co-author (SRM) was consulted. Studies were included if they had incorporated the use of AI-centred systems in CT imaging for evaluating both esophageal and gastric cancers. Studies with diagnostic, prognostic, and monitoring intents were included. Studies were excluded if they did not evaluate ML, used imaging modalities other than CT, did not include patients with esophageal or gastric cancers, had incomplete data on outcome measures, were not written in the English language, had sample sizes fewer than 30 patients, or had incompatible designs, including letters, comments and reviews. Studies were assessed for robustness of methodology using the Quality Assessment Tool for Diagnostic Accuracy Studies 2 (QUADAS-2), which comprises four domains covering patient selection, index test, reference standard, and flow of patients through the study and timing of the index test(s) and reference standard. Each domain is evaluated in terms of the risk of bias, and the first three domains are also assessed for any concerns regarding applicability. In doing so, this highlights aspects of the study design that may be exposed to bias.

Statistical Analysis
All statistical analyses were performed using STATA/ SE version 16.0 (StataCorp LLC, College Station, TX, USA). The overall pooled estimate of sensitivity and specificity, with their corresponding 95% confidence intervals (CIs), was calculated using the random-effects model with the metandi command in STATA/SE. Sensitivity was defined as the proportion of patients with esophageal cancer who were correctly confirmed by AI, while specificity was defined as correctly identifying patients without the disease. Forest plots were used to visualize the variation of the diagnostic parameter effect size estimates with 95% CI and weights from the included studies.

Study Selection
The database search yielded a total of 1439 studies, of which 137 duplicates were removed. Titles and abstracts of the remaining 1302 studies were screened for eligibility and 648 studies were removed. A further 617 studies were excluded after full-text review due to incompatible outcome measures, study design, or small sample sizes of fewer than 30 patients (Fig. 1). Thirty-seven studies that described the use of ML (a branch of AI) platforms for the diagnosis and surveillance of esophageal and gastric cancers were included in this study (Table 1).

Quality Appraisal
Assessment of studies using the QUADAS-2 tool showed a low level of bias among the studies ( Table 2). The risk of bias and concerns on their applicability was low across most domains. Some risk of bias was present due to the heterogeneity of the patients included; however, in most studies, there was little reporting of the sensitivity and specificity of the ML algorithms used.

Use of Machine Learning and Radiomics in the Management of Gastric Cancer
Two studies investigated the use of radiomics in diagnosing gastric cancer, specifically in differentiating gastric cancer from other gastric lesions. 24,25 In their study evaluating VOI-based textural features on preoperative arterial phase and portal phase scans of 95 patients, Ba-Ssalamach et al. differentiated gastric adenocarcinoma with an error rate as low as 3.1%. 24 Two studies reported that there was little correlation between radiomic features and histological grades, with AUCs below 0.7, 9, 10 while five studies evaluated images for lymph node status, vascular invasion, and occult peritoneal metastasis, with AUCs as high as 0.941. [11][12][13][14][15] Of the included studies, two studies evaluated the use of AI for prognosis after surgical resection for gastric cancers. Li et al. extracted 273 features from each ROI and 485 features from each VOI, and used the least absolute shrinkage and selection operator (LASSO) method to predict overall survival, although the results were not promising in their test set. 26 In contrast, Giganti et al. extracted 107 features from each VOI that were significantly associated with a negative overall survival in patients with resectable gastric cancer. 27 Four studies also investigated the use of AI for predicting response to neoadjuvant chemotherapy. Giganti    Six studies involving 1352 patients provided sufficient data of true positive, true negative, false positive, and false negative rates for the calculation of sensitivity and specificity. Of these studies, four studies assessed its utility in gastric cancer diagnosis, one study assessed its utility for diagnosing esophageal cancer, and one study assessed its utility for surveillance ( Table 1). The pooled sensitivity and specificity were 73.4% (64.6-80.7) and 89.7% (82.7-94.1), respectively, as visualized on the forest plot and summary receiver operating characteristic curve (Figs. 2 and 3).

DISCUSSION
Our systematic review shows that the application of radiomics and AI for the diagnosis and surveillance of upper gastrointestinal tract malignancies is promising, despite being in its nascency. The included radiological studies show that AI can be potentially used to diagnose cancers, differentiate malignancies from benign lesions, and detect occult disease. AI systems may also be used for staging disease, determining if surgery will improve survival outcomes in patients with resectable disease, and in predicting whether patients will respond to adjuvant or neoadjuvant chemoradiotherapy. Our paper also highlights the different AI platforms available for these purposes and captures their breadth.
The typical patient undergoes several CT scans during their journey, with diagnosis as the primary aim. Combining radiomics and AI to current scans will enable clinicians to simultaneously predict how they will respond to treatment and also assess how they have responded to treatment. In other cancers, radiomic data have provided support to genomic data in generating a prognostic signature that exceeds the accuracy of traditional TNM staging. 34 Given that there is a direct correlation between histopathological response of patients who underwent chemoradiotherapy and the overall survival rate, the ability to assess clinical response will be useful in adjusting the dose and regimes of chemoradiotherapy. 35,36 Our paper has included at least one study using radiomics or AI to assess the response to surgery, chemotherapy, radiotherapy  and immunotherapy, and all report high performance; however, there is still scarce evidence to add support to existing studies described here. AI can also help in overcoming any technical limitations faced by traditional imaging. For example, Jin et al. combined radiomic and dosimetric analyses to overcome the artefacts in wall thickness created by the regular peristaltic waves of contraction. 23 In another study, Ding et al. showed that their models detected occult peritoneal metastasis more accurately than conventional CT scans. Previous studies including the Worldwide Esophageal Cancer Collaboration have reported that survival decreases with the presence of lymph node metastases, and imaging examinations are often the first-line investigations for assessing most lymph node statuses in esophageal cancer. [37][38][39] However, the accuracy of CT in diagnosing the N stage of esophageal cancer was just 59%. 40 Most clinicians use a size criterion of 1 cm to differentiate between benign and malignant enlargement of lymph nodes but this only has a sensitivity of 30-60% and a somewhat higher specificity of 60-80%. [41][42][43] In their study, Wang et al. showed that support vector machine (SVM) models have better diagnostic capability for lymph node metastasis than the traditional LN size criteria. 22 Furthermore, Bollschweiler et al. used a different ML methodology, termed artificial neural network (ANN), and reported a diagnostic accuracy of 79% in predicting LN metastasis in esophageal cancer. 44

STRENGTH AND LIMITATIONS
The strength of our systematic review lies in its up-todate unified analysis of esophageal and gastric cancers in different countries. We also identified challenges that will need to be overcome for the technology to be implemented into daily clinical practice. Our study has several weaknesses. First, most of the articles included in the study did not report the specificity or sensitivity of their AI technologies, which prevented a more comprehensive quantitative analysis to achieve a pooled statistic for the diagnostic accuracy of AI. This also prevented the stratification of pooled data based on study intent (diagnostic vs. prognostic). Furthermore, the diagnostic or predictive accuracy of AI depends on several parameters, including the specific AI program or model developed, scanning equipment, image preprocessing, acquisition protocols, and image reconstruction algorithms.
Although there is heterogeneity between the studies, most of the work is limited to a few specific groups that have taken an interest in this field. The majority of the studies are based in Asia, and several of the included papers stem from the work of the same group. Hence, within the same group, the data acquisition and processing techniques are identical but the aims of the study were different and hence merited inclusion. For example, in the studies by Jiang et al., the first study evaluated the use of radiomics and AI in characterizing the tumor microenvironment, while the other study focused on identifying occult metastasis. 30   the same vein, we also included some studies with a sample size that was \100. Although small sample sizes lend to a greater degree of variation on the quantitative analysis, these studies were relevant in studying a niche area of treatment response. Larger studies have previously tended to focus on the diagnostic aspects, while other facets such as monitoring for recurrence, response to curative resection and chemotherapy, and tumor heterogeneity are areas that are still in their infancy and hence studied at a smaller level. Furthermore, this emphasizes the paucity of studies of large sample sizes and hints at areas that need further work within the field of AI in esophagogastric cancers.

FUTURE DIRECTIONS
Future work should be aimed at the 'in silico' bench to bedside translation of these technologies. Although we highlight much promise in these technologies, several factors require evaluation prior to these technologies being employed in routine upper gastrointestinal oncological care: (1) Use case: There needs to be early clarification in the lifecycle of these AI devices as to (1) their specific clinical task; (2) potential risk and benefits; (3) whether they are used within either new or existing clinical workflows; and (4) whether they are used independently to diagnose disease/recurrence or as a 'second reader' alongside a human clinician. Downstream validation of these systems is dictated by many of these early decisions. (2) Model development: The development of these systems are reliant on diverse, large-scale, and wellmaintained datasets that are accurately labeled for the purposes of model training and internal validation. Systems created upon small single-center datasets with post hoc labeling rarely perform well when subjected to out-of-set testing. (3) Validation: Independent validation of AI systems is crucial, with comparison against expert clinicians to demonstrate either non-inferiority or superiority in diagnostic performance to be undertaken when feasible. Such evaluations require careful study planning, with the need for diverse demographic representation in test datasets in order to assess for bias. (4) Infrastructural requirements: Aside from developer considerations, the bottleneck for many contemporary AI products is the end-user adoption and experience. There needs to be careful consideration of the IT infrastructural requirements at hospitals in which these technologies may be reasonably deployed.
(5) Cost effectiveness: Lastly, although it is assumed that the introduction of AI systems will lead to cost saving across health systems, this requires formal quantification. If deemed not to be financially beneficial, it may be more cost effective to hire diagnostic clinicians, which is the focus on current large-scale studies.
Furthermore, the power of these models is dependent on a large and diverse diet of datasets. At present, the retrospective single-center work available is insufficient and is limited in size, scope and variety. Given that the largest advances in esophagogastric surgery have occurred based on large prospective studies, the advent of ML only calls for further collaborative efforts at an international level to fully reap the potential of this technology.

CONCLUSION
AI and radiomics have a huge potential for diagnostic and surveillance of esophageal and gastric cancers. There is currently a paucity of large-scale studies evaluating the usefulness of AI and radiomics in esophageal cancer and the evidence is limited to retrospective studies of small sample sizes. Further progression of its clinical application will require collaborative efforts to generate a large and diverse dataset that can produce an accurate model. This relies on determining the best and most feasible methodology for ML and standardizing this across centers. Hence, further work should focus on these areas.
OPEN ACCESS This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons. org/licenses/by/4.0/.