An artificial intelligence system for comprehensive pathologic outcome prediction in early gastric cancer through endoscopic image analysis (with video)

Background Accurate prediction of pathologic results for early gastric cancer (EGC) based on endoscopic findings is essential in deciding between endoscopic and surgical resection. This study aimed to develop an artificial intelligence (AI) model to assess comprehensive pathologic characteristics of EGC using white-light endoscopic images and videos. Methods To train the model, we retrospectively collected 4,336 images and prospectively included 153 videos from patients with EGC who underwent endoscopic or surgical resection. The performance of the model was tested and compared to that of 16 endoscopists (nine experts and seven novices) using a mutually exclusive set of 260 images and 10 videos. Finally, we conducted external validation using 436 images and 89 videos from another institution. Results After training, the model achieved predictive accuracies of 89.7% for undifferentiated histology, 88.0% for submucosal invasion, 87.9% for lymphovascular invasion (LVI), and 92.7% for lymph node metastasis (LNM), using endoscopic videos. The area under the curve values of the model were 0.992 for undifferentiated histology, 0.902 for submucosal invasion, 0.706 for LVI, and 0.680 for LNM in the test. In addition, the model showed significantly higher accuracy than the experts in predicting undifferentiated histology (92.7% vs. 71.6%), submucosal invasion (87.3% vs. 72.6%), and LNM (87.7% vs. 72.3%). The external validation showed accuracies of 75.6% and 71.9% for undifferentiated histology and submucosal invasion, respectively. Conclusions AI may assist endoscopists with high predictive performance for differentiation status and invasion depth of EGC. Further research is needed to improve the detection of LVI and LNM. Supplementary Information The online version contains supplementary material available at 10.1007/s10120-024-01524-3.


Introduction
Gastric cancer is the fifth most common malignancy and the fourth leading cause of cancer-related death worldwide [1].Although radical surgery was traditionally the only curative treatment for gastric cancer, recent advances in endoscopic resection have demonstrated favorable clinical outcomes in early gastric cancer (EGC), concurrently improving quality of life for patients by preserving the stomach [2].
Endoscopic submucosal dissection (ESD) is considered curative for EGC without lymph node metastasis (LNM).Owing to the lack of reliable imaging methods to precisely detect LNM in EGC [3,4], current guidelines recommend curative criteria for ESD based on pathologic features in resected specimens associated with a minimal risk of LNM [5,6].These factors include the differentiation status, invasion depth, and lymphovascular invasion (LVI) of the tumor.Since these characteristics are confirmed postoperatively, the accurate prediction of pathologic outcomes before treatment is essential to select the optimal curative approach between endoscopic and surgical resection.
Endoscopists perform forceps biopsies with assistance of magnifying endoscopy with narrow-band imaging (ME-NBI) to evaluate differentiation status, and endoscopic ultrasonography (EUS) to detect submucosal invasion before deciding the treatment strategy for EGC.However, previous studies have revealed significant histologic discrepancies between biopsies and resected specimens, potentially leading to non-curative ESD or missed opportunities for ESD in surgical cases [7][8][9].In addition, EUS is not superior to conventional endoscopy in determining the invasion depth of EGC, with an accuracy of approximately 70% [10][11][12].Therefore, a detailed assessment of endoscopic features by physicians is essential for predicting pathologic results in EGC.
With advancements in deep learning methods, recent studies have proposed artificial intelligence (AI) models for detecting and characterizing EGC in endoscopic images, aiming to assist physicians in evaluating endoscopic features [13].This includes our previous study, where we developed an AI model which can detect EGC in endoscopic videos [14].Although several models have been developed to assess the invasion depth of EGC using endoscopic images, there remains a need for further research into AI-assisted pathologic prediction for EGC to enhance the performance in video analysis [15,16].Moreover, to the best of our knowledge, no previous study has explored the capability of AI in predicting LVI or LNM based on endoscopic images or videos.
Therefore, this study aimed to develop and evaluate an AI model that comprehensively predicts the postoperative pathologic results of EGC, including the differentiation status, invasion depth, LVI, and LNM, based on preoperative white-light endoscopic images and videos.

Methods
The AI model developed in this study is an extension of the ENAD CAD-G, a convolutional neural network (CNN)based model for detecting and classifying gastric lesions in endoscopic videos, as demonstrated in our previous study [14].

Study design and datasets
Figure 1 shows an overview of the study design and datasets.The total dataset of endoscopic images and videos was divided into an internal dataset used for training, internal validation, and testing and an external dataset employed for the external validation of the AI model.
For the internal dataset, we retrospectively collected 4,596 preoperative white-light endoscopic images of EGC from patients who underwent ESD or radical surgery between January 2018 and December 2022 at Seoul National University Hospital (SNUH), a tertiary hospital in the Republic of Korea.To assess the performance of the AI model in videos, we prospectively included 163 whitelight endoscopic videos of patients referred from community clinics who underwent ESD or surgical resection for EGC between April 2022 and April 2023.For the external dataset, we used 436 images retrospectively collected from patients who underwent surgery for EGC between January 2020 and June 2020, and 89 videos prospectively collected from patients who underwent ESD for EGC between April 2022 and October 2022 at another tertiary hospital, Seoul National University Bundang Hospital (SNUBH), Republic of Korea.
This study was conducted in accordance with the Declaration of Helsinki and was approved by the ethics committees of the participating hospitals (IRB No. 2109-048-1253 at SNUH and IRB No. 2201-735-405 at SNUBH).Written informed consent was obtained from all prospectively enrolled patients who provided endoscopic videos.The requirement for informed consent was waived for the patients whose retrospective images were included in this study.

Preparation of endoscopic images
Supplementary Figure S1 shows the process of preparing endoscopic images before training the model.We retrospectively investigated the medical records of 1,617 patients who underwent ESD and 1,641 patients who underwent radical surgery for EGC at SNUH between 2018 and 2022.All preoperative white-light endoscopic images of the patients were reviewed by five endoscopists from SNUH, who selected images that best characterized the target lesions (two or three images per lesion) and excluded images with low resolution or blurring.Patients who underwent additional surgery after non-curative ESD were considered as surgical patients.Patients who had received any previous endoscopic treatment for the target lesions before ESD or had undergone ESD at another hospital before surgery were excluded.Patients with inconclusive pathologic results or those who did not undergo preoperative endoscopies at SNUH were also excluded.Finally, 2,453 images from 1,031 (51.4% of total patients) patients who underwent ESD and 2,143 images from 975 (48.6% of total patients) patients who underwent surgery were included in the internal dataset.

Patient enrollment of endoscopic videos
We prospectively enrolled patients who were diagnosed with gastric dysplasia or EGC on initial biopsies, underwent ESD or radical surgery, and were confirmed with EGC based on the pathological reports of the resected specimens.The indication for ESD was one of the following conditions: i) differentiated-type EGC with tumor size ≤ 2 cm and endoscopically suspected mucosal cancer without ulceration or ii) high-grade dysplasia [17].We excluded patients who had previously undergone gastrectomy and those with contraindications for biopsy due to bleeding tendency or anticoagulant use.All endoscopic examinations were performed preoperatively using standard video endoscopes (GIF-Q260, GIF-H260, or GIF-H290; Olympus Medical Systems, Tokyo, Japan).Consequently, 163 (50 ESD patients and 113 surgery patients) patients from SNUH and 89 ESD patients from SNUBH were included in the study, and their endoscopic videos were provided.All ESD procedures were performed by experienced endoscopists following a standardized protocol [18].The surgical procedures were based on standard gastrectomy with D1 + or D2 lymph node dissection [5].

Pathologic definitions
The pathologic characteristics of EGC in the images and videos were obtained from the pathological reports of specimens resected by ESD or surgery based on the 2022 Korean gastric cancer treatment guidelines [5].Expert pathologists assessed all resected specimens.Differentiated-type EGC includes papillary, well, or moderately differentiated tubular adenocarcinoma, whereas undifferentiated-type EGC includes poorly differentiated adenocarcinoma, signet ring cell carcinoma, and mucinous carcinoma.In cases of mixed-type gastric cancer, the classification was determined by the histological type of the predominant lesion [19].Submucosal invasion < 500 µm was defined as SM1 and submucosal invasion ≥ 500 µm was defined as SM2.The status of LNM in the resected specimens was also investigated in the surgical cases.

Training and internal validation of the AI model
The training set comprised 4,336 images and 153 videos, which were used to train and internally validate the AI model (Fig. 1).For the internal validation, the images and videos in the dataset were randomly divided into five subsets.Four subsets were used for training, and the remaining subset, was used for validation to calculate the predictive performance of the trained model.This cross-validation process was conducted five times to ensure comprehensive evaluation of all images and videos in the set.
The AI model was based on CNN architecture and utilized Efficientnetb0 to evaluate the pathologic characteristics of target lesions in endoscopic images [20].The model employed a soft voting method to categorize these lesions into distinct predictive classes: differentiation status (differentiated or undifferentiated), invasion depth (mucosal or submucosal), LVI (positive or negative), and LNM (positive or negative).A generative model using Stylegan2 was integrated to enhance predictive performance of the model and increase its sensitivity [21].Representative images analyzed by the model are presented in Fig. 2.
Figure 3 shows a schematic diagram of the evaluation of the endoscopic videos.Initially, gastric lesions within the videos were recognized and outlined with boundaries (cropped), using a lesion detection model based on YOLOv5 developed in our previous study [14].Subsequently, the cropped images were categorized as cancer, adenoma, or non-neoplastic lesions using a lesion classification model that employs EfficientNETB0 [14].The AI model then calculated the confidence levels for pathologic predictions of the identified cancers.Finally, the model utilized a soft voting method to determine the pathologic classifications of the cancers.Therefore, the cut-off value for considering AI prediction as correct was set at 50%, and this method was utilized in all the validation process.

Testing the performance of the AI model and endoscopists
A test set was designed to evaluate and compare the predictive performances of the AI model and endoscopists, using 260 endoscopic images and 10 videos distinct from the training set.Sixteen endoscopists, comprising nine experts and seven novices, participated in the test and predicted the

Statistical analyses
The gold standard for prediction was derived from postoperative pathological reports of specimens obtained from ESD or surgery.Accuracy, sensitivity, specificity, and positive and negative predictive values of the predictions were calculated.The prediction metrics were presented as means with 95% confidence intervals and were compared using the Mann-Whitney U test.Receiver operating characteristic (ROC) curves and the corresponding area under the ROC curve (AUC) values for the AI model were calculated.The accuracies of the AI model and those of all the experts were compared using the McNemar's test.The

Internal validation of the AI model with images and videos
There was no significant difference in the performance of the AI model between images and videos, except for the accuracy of predicting LNM, which was significantly higher for the videos (P = 0.008).

Performance of the AI model according to pathologic characteristics of EGC
Supplementary Table S2 shows the performance of the AI model for endoscopic images according to the differentiation status of the target lesion within the training set.For differentiated-type EGC, the model exhibited mean accuracies of 91.1% (sensitivity, 82.3%; specificity, 89.2%) for submucosal invasion, 83.5% (sensitivity, 27.3%; specificity, 95.4%) for LVI, and 90.1% (sensitivity, 29.7%; specificity, 96.3%) for LNM.For undifferentiated-type EGC, the model demonstrated mean accuracies of 87.5% (sensitivity, 79.3%; specificity, 93.3%) for submucosal invasion, 88.1% (sensitivity, 10.4%; specificity, 98.6%) for LVI, and 83.8% (sensitivity, 26.9%; specificity, 92.6%) for LNM.The model presented a significantly higher accuracy in differentiated-type EGC than in undifferentiated-type Initially, the gastric lesion is identified and outlined with red boundaries (cropped) with the lesion detection model.Subsequently, the cropped lesion is categorized as either cancer, adenoma, or nonneoplastic lesion by the lesion classification model.For lesions classified as cancer, the model computes confidence levels to predict differentiation status, invasion depth, lymphovascular invasion, and lymph node metastasis.Finally, the lesion is categorized into distinct pathologic classes utilizing a soft voting method.AI artificial intelligence, EGC early gastric cancer EGC for predicting submucosal invasion (P = 0.008) and LNM (P = 0.016).

Comparison of the predictive accuracies between AI model and endoscopists
Figure 4 shows the ROC curves demonstrating the performance of the AI model with the performance of the endoscopists, presented as dots (blue = expert, red = novice) in the test set.The AUC values of the model were 0.992 for undifferentiated histology, 0.902 for submucosal invasion, 0.706 for LVI, and 0.680 for LNM.All dots representing the performance of the endoscopists were positioned below the curves for predicting undifferentiated histology, submucosal invasion, and LNM.
Table 3 summarizes the performances of the AI model and endoscopists in the test.The model exhibited accuracies of 92.7% for undifferentiated histology, 87.3% for submucosal invasion, 76.4% for LVI, and 87.7% for LNM.The experts reported mean accuracies of 71.6% for undifferentiated histology, 72.6% for submucosal invasion, 69.7% for LVI, and 72.3% for LNM.The model showed significantly higher accuracy than the experts in predicting undifferentiated histology (P ≤ 0.001), submucosal invasion (P ≤ 0.012), and LNM (P ≤ 0.001).The experts showed significantly higher accuracy than the novices in identifying undifferentiated histology (P = 0.001) and submucosal invasion (P = 0.019).However, there was no significant difference between the experts and novices in detecting LVI (P = 0.525) and LNM (P = 0.790).
Representative videos of the AI model in the test set and external validation are shown in Video 1 and 2, respectively.The resolution of images and videos in the external dataset (640 × 480) was lower than that in the internal dataset (1920 × 1080), owing to differences in the picture archiving and communication system between the two hospitals.

Discussion
In this study, we developed and evaluated an AI model that predicts postoperative pathologic results of EGC based on conventional white-light endoscopic images and videos.The performance of the model was compared with that of endoscopists in a test and externally validated using videos from another institution.Categorizing the differentiation status of EGC is pivotal in deciding the indication for ESD, considering the significantly lower curative resection rate in undifferentiated-type EGC compared to differentiatedtype EGC [22,23].Since approximately 18% of undifferentiated-type EGC can initially be misclassified as differentiated-type with forceps biopsy, endoscopic features of the lesions, including ME-NBI, must be combined for accurate diagnosis [24][25][26].In a previous study, an AI model trained with ME-NBI showed an accuracy of 86.2% for classifying EGC differentiation status [27].In our study, the AI model exhibited an accuracy of 89.7% in white-light endoscopic videos and outperformed the experts in identifying undifferentiatedtype EGC.These results suggest that AI can assist endoscopists in predicting the differentiation status, with both white-light and ME-NBI endoscopic images.
Although EUS is commonly used to detect submucosal invasion in EGC, its advantages over conventional endoscopy are insignificant, with an accuracy of approximately 70% [10].Notably, these findings are consistent with our results, where the experts showed a mean accuracy of 72.6% for predicting submucosal invasion of EGC in the test.In contrast, the AI model demonstrated significantly higher accuracy than the experts.Therefore, endoscopic findings indicative of submucosal invasion in EGC, such as clubbing, abrupt cutting or fusion of folds, uneven or nodular depression, and remarked redness of surface can be assessed without ultrasound [28][29][30][31], and AI enhances this process by learning an extensive dataset of conventional endoscopic images.Several studies have investigated deep learning-based prediction of submucosal invasion in EGC using endoscopic images, reporting accuracies ranging from 84 to 94% [16,[32][33][34].However, two studies revealed that undifferentiatedtype EGC was associated with lower predictive accuracies compared with differentiated-type EGC [19,35], a tendency also observed in our study.Furthermore, the significantly lower sensitivity for submucosal invasion was observed in undifferentiated EGC.Given that submucosal invasion with undifferentiated histology indicates non-curative ESD in EGC, these findings suggest that endoscopists still need to be more conservative when deciding to perform ESD for undifferentiated-type EGC than for differentiated-type EGC, even with the assistance of AI.
The lack of research on predicting LNM from endoscopic images using AI can be attributed to the low incidence of LNM in patients with EGC.The LNM rates have been reported to be < 9% for mucosal cancer and < 20% for submucosal cancer, according to large-scale studies based on surgical specimens of EGC [36,37].In addition, the LVI rate of EGC was approximately 13% in another study based on surgical specimens [38].Although our study included as many surgical patients as possible, these inherently low rates of LVI and LNM in EGC induced an imbalance between positive and negative cases within the datasets.This is the reason our model exhibited lower sensitivity and positive prediction value, resulting in a low AUC value compared to its high accuracy in predicting LVI and LNM.However, excluding some patients with negative LVI or LNM to address this data imbalance could introduce significant selection bias.Therefore, despite the potential effect of data imbalance, we chose to include patients consecutively in the study.
Additionally, the absence of a significant difference in the mean accuracy between experts and novices suggests that the ability to detect LNM does not necessarily improve with clinical experience.However, the AI model in our study showed higher accuracy and sensitivity than the experts in predicting LNM.One possible explanation for this is that the AI may have adapted to associate a This study has several limitations.First, the AI model was trained using retrospective images after the selection process, potentially introducing bias into our study.To compensate for this, we also included videos from patients enrolled prospectively under the same indications for ESD and found consistent performance of the model between images and videos.Furthermore, we tested the performance of the model by comparing it with experts from various hospitals across the nation.Second, the predictive performance of the AI model was lower in the external validation than in the internal tests.This discrepancy can be partially explained by the inferior resolution of images and videos in the external dataset compared to those in the internal dataset.Additionally, previous studies reported the "overfitting effect" in AI, where the learning process becomes excessively adapted to the training data [43,44].Several studies on deep learning-based prediction of invasion depth in gastric neoplasms have also reported significant differences in accuracies between internal and external tests [19,45,46].Nevertheless, the external validation of our model showed predictive accuracy above 70% for invasion depth, which was higher than reported predictive accuracy of EUS.The performance could be further improved by training the model with images from various institutions in the future.Third, this study did not evaluate ME-NBI images and videos of EGC.Training this model with NBI data can improve the histological diagnosis of EGC, and it is essential to train the model with NBI images and videos in further studies.Fourth, incorporating both ESD and surgical cases into the dataset may have affected the model's performance due to heterogeneity among the data.The longer section intervals in surgical specimens compared to ESD specimens could potentially lead to underestimation of submucosal invasion and LVI in surgical specimens [47].Finally, the findings of this study should be confirmed in randomized controlled trials, and we are planning to conduct prospective studies to apply this AI model in clinical practice.
In conclusion, this study suggests that AI has the potential to assist endoscopists in determining the optimal treatment strategy for EGC, showing high performance in predicting the differentiation status and invasion depth based on conventional endoscopic images and videos.However, the detection of LVI and LNM using deep learning-based methods requires further research.

Fig. 1
Fig. 1 Flow diagram of study design and datasets.SNUH Seoul National University Hospital, SNUBH Seoul National University Bundang Hospital, EGC early gastric cancer, ESD endoscopic submucosal dissection, AI artificial intelligence

Fig. 2
Fig. 2 Representative examples of pathologic predictions by the AI model in endoscopic images.Each endoscopic image contains one lesion of EGC with the following pathologic characteristics.a Differentiated-type EGC of mucosal invasion without both LVI and LNM.b Differentiated-type EGC of submucosal invasion with positive LVI and negative LNM. c Undifferentiated-type EGC of

Fig. 3
Fig.3 Schematic diagram for AI-based pathologic prediction in endoscopic videos.Initially, the gastric lesion is identified and outlined with red boundaries (cropped) with the lesion detection model.Subsequently, the cropped lesion is categorized as either cancer, adenoma, or nonneoplastic lesion by the lesion classification model.For lesions classified as cancer, the model computes confidence levels to predict differentiation status, invasion depth, lymphovascular invasion, and lymph node metastasis.Finally, the lesion is categorized into distinct pathologic classes utilizing a soft voting method.AI artificial intelligence, EGC early gastric cancer

Table 1
Pathologic characteristics of early gastric cancer in endoscopic images and videos across datasets the Republic of Korea who had over 10 years of experience performing gastric ESD before this study and held positions of associate professor or higher.

Table 2
Internal validation of trained AI model with endoscopic images and videos

Table 3
Comparison of predictive accuracies between AI model and endoscopists in the test set AI artificial intelligence, CI confidence interval, n number of correct answers, N number of questions * P < 0.05, when accuracy was compared with that of the AI system using the Mcnemar's test ** P < 0.05, when the mean accuracy was compared with that of the experts using the Mann-Whitney U test