Performance Comparison of the Deep Learning and the Human Endoscopist for Bleeding Peptic Ulcer Disease

Management of peptic ulcer bleeding is clinically challenging. Accurate characterization of the bleeding during endoscopy is key for endoscopic therapy. This study aimed to assess whether a deep learning model can aid in the classification of bleeding peptic ulcer disease. Endoscopic still images of patients (n = 1694) with peptic ulcer bleeding for the last 5 years were retrieved and reviewed. Overall, 2289 images were collected for deep learning model training, and 449 images were validated for the performance test. Two expert endoscopists classified the images into different classes based on their appearance. Four deep learning models, including Mobile Net V2, VGG16, Inception V4, and ResNet50, were proposed and pre-trained by ImageNet with the established convolutional neural network algorithm. A comparison of the endoscopists and trained deep learning model was performed to evaluate the model’s performance on a dataset of 449 testing images. The results first presented the performance comparisons of four deep learning models. The Mobile Net V2 presented the optimal performance of the proposal models. The Mobile Net V2 was chosen for further comparing the performance with the diagnostic results obtained by one senior and one novice endoscopists. The sensitivity and specificity were acceptable for the prediction of “normal” lesions in both 3-class and 4-class classifications. For the 3-class category, the sensitivity and specificity were 94.83% and 92.36%, respectively. For the 4-class category, the sensitivity and specificity were 95.40% and 92.70%, respectively. The interobserver agreement of the testing dataset of the model was moderate to substantial with the senior endoscopist. The accuracy of the determination of endoscopic therapy required and high-risk endoscopic therapy of the deep learning model was higher than that of the novice endoscopist. In this study, the deep learning model performed better than inexperienced endoscopists. Further improvement of the model may aid in clinical decision-making during clinical practice, especially for trainee endoscopist.


Introduction
Peptic ulcer bleeding is a common gastrointestinal (GI) emergency with a 10% hospital mortality rate [1][2][3]. Important progress has been made in the treatment of this condition since the introduction of emergency endoscopy and the development of endoscopic therapy for hemostasis. The appearance of the ulcer base is probably the best available predictor of patient outcome. The classification of peptic GI bleeding was proposed by Forrest [4] in 1974. The classification differentiates among acute, recent (with risk of rebleeding), and almost-healed ulcerations. The goal of the Forrest classification is to make an immediate judgment of the risk of rebleeding and need for endoscopic intervention. This classification has been used since its introduction and has also been a standard for conducting various clinical trials [5][6][7]. Current guidelines [3,8,9] suggest that patients with high-risk ulcers such as active spurting (Forrest Ia), active oozing (Forrest Ib), or with a non-bleeding visible vessel (Forrest IIa) should receive an endoscopic therapy owing to the high risk of persistent bleeding or rebleeding. Peptic ulcers with an adherent clot (Forrest IIb) should be subjected to endoscopic clot removal to decide on further treatment plans. Ulcers with red spots (Forrest IIc) or a clean base can be observed without endoscopic therapy.
Accurate identification of such stigmata of hemorrhage is essential for endoscopists to deliver appropriate care; however, the ability of correct classification varies with endoscopists' experience. Laine et al. [10] reported that the rate of correctly identifying endoscopic stigmata of hemorrhage increased with endoscopic experience (performing five cases per month) from 59% to 73% before a training course. After the training course, the increase was related to the level of training: fellows, 15% increase; physicians with 0-20 years since training, 8% increase; and physicians with 20 years or more since training, 3% increase. Another study from Italy reported a high interobserver agreement for Forrest Ia/b lesions, but low agreement for Forrest II/III lesions [11]. The Canadian registry of patients with upper gastrointestinal bleeding revealed that only 47.8% of patients with high-risk stigmata received endoscopic therapy, whereas 9.8% of those with low-risk stigmata received endoscopic therapy [12], showing the wide variation of endoscopist practice in the real world.
Artificial intelligence (AI) is an emerging new technology that affects several aspects of healthcare. In endoscopy, AI is currently being used to detect lesions during endoscopic procedures such as colorectal lesions during colonoscopy [13], esophageal cancer during endoscopy [14], and small bowel ulcers during capsule endoscopy [15]. All these developments aim to provide a diagnostic efficacy that is similar or even superior to that of experienced endoscopists. The improved medical therapy such as the use of proton pump inhibitors and eradication of Helicobacter pylori infections have led to a decrease in the peptic ulcer disease rate. Achieving successful hemostasis and providing the best care depends on endoscopic skills and experience. A survey study from the UK demonstrates the decline over time in trainee experience for peptic ulcer bleeding from 76% in 1996 to 15% in 2011 [2]. The study also highlights a lack of trainee experience in more challenging cases, particularly in the out-of-hours period. Although the endoscopic skill such as injection, coagulation, or clipping to perform hemostasis can be improved by training with various ex-vivo model [16], the experience of determining the optimal management for a bleeding peptic ulcer can typically only be obtained by practice with real bleeding cases. Thus, there is a need to develop a tool to assist trainees or young endoscopists during their management of bleeding peptic ulcers. Machine learning algorithms could predict the severity of a bleeding peptic ulcer with an acceptable accuracy from still endoscopic images.
Thus, in this study, we aimed to evaluate the performance of a deep learning model in classifying still endoscopic color images obtained from patients with bleeding peptic ulcers.

Patients and Data Preparation
The endoscopy records of patients who underwent endoscopic examination between January 2015 and January 2020 at the endoscopy center of Changhua Christian Hospital were retrospectively reviewed. The images were reviewed and retrieved for subsequent analysis by two expert endoscopists with 15 years of experience in therapeutic endoscopy. Inclusion criteria for analysis of the images were (a) images from patients with symptoms of gastrointestinal bleeding, i.e., hematemesis, anemia, or tarry stool; (b) bleeders were attributed to a peptic ulcer disease, i.e., gastric or duodenal ulcers; and (c) endoscopy performed with the Olympus 260 or 290 series system. Endoscopic images with a clear view of the pre-treatment peptic ulcers were included as the peptic ulcer group, whereas endoscopic images with a normal appearance of the gastric or duodenal mucosa were included as the control group. Images with bleeding from variceal hemorrhage, neoplasm, angiodysplasia, post polypectomy hemorrhage, or bleeding of unknown origin were excluded.
The endoscopic images of the bleeding peptic ulcers were first classified according to the Forrest classification after a consensus was reached between the two endoscopists as the ground truth for this study (expert 1). Next, we stratified the ulcers according to the clinical guideline as "no need of endoscopic therapy", i.e., Forrest IIc and Forrest III lesions, and "need of endoscopic therapy", i.e., Forrest Ia to Forrest IIb lesions. Those ulcer images requiring endoscopic therapy were further classified to determine the risk of endoscopic therapy after a review by two endoscopists. Ulcers with difficult locations such as the duodenum or the lesser curvature site or those with a big ulcer base or big visible vessels are considered high risk for endoscopic therapy as the primary endoscopic hemostasis may fail [17] and the remaining were considered low risk for endoscopic therapy.
The study complied with the World Medical Association Declaration of Helsinki for medical research involving human subjects, including research on identifiable human material and data and was approved by the institutional 1 3 review board of Changhua Christian Hospital (approval number: CCH IRB 200906).

Training of the Deep Learning Models
The endoscopic images were cropped to remove possible identification data for subsequent image training. The experiment was performed on the DeepQ AI Platform (https:// deepq. com/ artifi cial-intel ligen ce/) for image classification task with different deep learning models. The DeepQ AI Platform is dedicated for medical imaging training with pretrained established models using the ImageNet dataset such as ResNet-50, Inception-v4, VGG-16, and Mobile Net V2 models by the fine-tuning of these networks for data training. Considering the model performance and speed, the Mobile Net V2 model [18,19] was chosen in this study for its potential use in the mobile circumference for emergent clinical consultation. A 5-fold cross-validation on the training set was performed such that the data was split randomly by the patient into five sets for training and validation. The batch size was 32, and training epoch was 100. Data augmentation was performed with a horizontal flip of 0.5.

Evaluation of the Model Performance
To evaluate the performance of the trained model system, we compared the accuracy of the proposed model with an additional testing dataset. For comparison with human endoscopists, two additional endoscopists (expert 2,3) with 10 years of endoscopic experience, one junior endoscopist with 2 years of experience (novice 1), and two novice endoscopists (novice 2 and 3) with one year of experience who were blinded to the study design reviewed the validation dataset for evaluating the accuracy.

Characteristics of the Image Dataset for Analysis
A total of 2738 images from 1694 reviewed patients were selected for analysis as shown in Fig. 1. The images were split into the training (2289 images) and testing datasets (449 images). The model training task was evaluated for threeclass tasks (normal vs. no therapy vs. therapy required) and four-class (normal vs. no therapy vs. low-risk therapy vs. high-risk therapy) tasks.

Results of Different Model Performance Metrics
The results of the performance comparison of the different models are presented in Table 1. The MobileNet V2 model had the shortest training time with a training accuracy of 90.59% for the four-class classification task and 94.09% for the three-class classification task. The detailed result of the trained MobileNet V2 model for the prediction of the testing dataset is shown in Table 2. The prediction of normal, no therapy, and therapy in the 3-class  classification was high with an area under the receiver operating characteristic curve (AUROC) of 0.98, 0.92, and 0.91, respectively (Fig. 2). The prediction of normal, no therapy, high-risk therapy, and low-risk therapy in the 4-class classification was moderate to high with an AUROC of 0.99, 0.89, 0.92, and 0.88, respectively (Fig. 3). Examples of correct and incorrect labeled endoscopic images are illustrated in Figs. 4 and 5. The model in both classification tasks showed high sensitivity and specificity for determining the normal endoscopic images, but a lower sensitivity for determining the need for therapy. The sensitivity for determining peptic ulcers further decreased while stratifying the therapy groups into high-and low-risk groups.

Comparison of the Performance of the Deep Learning Model and Human Endoscopists
Tables 3 and 4 present the interobserver agreements on the testing image dataset with Cohen's kappa coefficient [20] for the 3-class classification and 4-class classification task. Interobserver agreement was high in the expert group for the 3-class classification and substantial in the 4-class classification. The agreement of the deep learning model was substantial in the expert group and was higher than that of novice 2 and 3. The accuracy of the deep learning model in the 3-class classification for the determination of therapy was higher than that of novice 2 and 3 ( Table 5).

Discussion
In the present study, we proposed a deep learning model for the classification of patients with bleeding peptic ulcer. A better prediction result from the deep learning model compared with novice human endoscopists was observed in our study. Our study is the first to show the potential use of the deep learning model in the management of peptic ulcer bleeding, particularly for young endoscopists, in the era of decreasing experience in managing such a gastrointestinal emergency [2,8].
The development of AI has emerged to impact several aspects of human life in the twenty-first century. Since 2010,  Fig. 2 The prediction in the 3-class classification Fig. 3 The prediction in the 4-class classification 1 3 substantial progress has been made to extend its application to the health care field with the introduction of deep learning methods [21]. As the incidence of colorectal cancer is increasing, the need for colonoscopy screening has also increased [5,6], but the manpower of endoscopists is lacking. Thus, the current AI technology in the field of endoscopy has mainly developed to aid in the detection and diagnosis of colon polyps to improve the quality results of colonoscopy. The use of deep learning methods had been shown to be superior to that of the shape and context information detection method for classification, detection, segmentation, and tracking of polyps and an increase in the rate of accurate diagnosis with a reported accuracy of 80-96% [13,22,23].
The application of such technology in upper GI endoscopy had also been attempted [14,[24][25][26]. Yoshimasa et al.  [26] from Japan utilized deep learning for the detection of esophageal cancer with a sensitivity of 98%. Rintaro et al. [25] reported a pilot study with 458 test images (225 dysplasia and 233 non-dysplasia) and correctly detected early neoplasia with a sensitivity of 96.4%, specificity of 94.2%, and accuracy of 95.4%. The human esophagus and colon are narrowed with less mucosal inflammation, and highquality endoscopic images are easier to be obtained from the esophagus and colon than from the stomach. Therefore, endoscopic detection of gastric lesions is usually more difficult than that of the esophagus and colon, especially in an emergency setting.
Zhang et al. [27] recently reported a diagnostic system based on a ResNet34 residual network structure of five gastric conditions, including peptic ulcer (PU), early gastric cancer and high-grade intraepithelial neoplasia, advanced Endoscopic view of an ulcer in the gastric antrum with oozing was incorrectly labelled as no therapy gastric cancer, gastric submucosal tumors (SMTs), and normal gastric mucosa without lesions, with a diagnostic accuracy of 74.2-88.9%. Compared with these gastric conditions, a high-quality image from bleeding PU is usually more difficult to obtain compared with those obtained from nonemergency settings [28]. To the best of our knowledge, there is no previous study that has employed such deep learning for such endoscopic emergency, as we are the first to explore its potential use in clinical practice.
Patients presenting with PU bleeding is a clinical emergency requiring a prompt clinical decision for appropriate management. Endoscopic Forrest classification of the ulcer morphology is the corner stone of clinical trials for making a decision to perform endoscopic or medical therapy since its development [4-6, 11, 29]. However, the number of patients with PUs has decreased with the eradication of H. pylori infection [3]. Young endoscopists do not have sufficient experience in managing such diseases [1,30]; Table 3 The interobserver agreement of the testing dataset based on the 3-class classification evaluated with Cohen's kappa coefficient Cohen's kappa coefficient: < 0 as no agreement,0-0.20 as slight, 0.21-0.40 as fair, 0.41-0.60 as moderate, 0.61-0.80 as substantial, and 0.81-1 as almost perfect agreement thus, developing a computer-aided system may be helpful for such critical clinical decision-making. Current guidelines [3,8,9] suggest that endoscopic therapy should be provided to Forrest I, IIa, and IIb lesions; therefore, in the current study, we simplified the classification to "need" or "no need" of endoscopic therapy to fit the real-world clinical practice pattern. In addition, we also attempted to further stratify the risk of endoscopic therapy by experienced endoscopists based on the still endoscopic images. Our model shows high sensitivity and specificity for determining the normal endoscopic images, but low for determining the need of therapy or its difficulty. Compared with the higher sensitivity/specificity reported in other endoscopic situations, the lower sensitivity observed in our model was mainly explained by the complicated circumstances during PU bleeding, particularly for those with high-risk ulcers requiring endoscopic therapy, i.e., the image view of the lesion is difficult to be standardized in cases of emergency, the gastric contents are not clear, and the adjacent gastric/ duodenal mucosa is inflamed or deformed such that it may increase the difficulty of such discrimination tasks. We performed our first experiment with different currently available deep learning models, and MobileNet V2 was chosen for subsequent study because of its acceptable accuracy and its shorter computing time that may be utilized in an emergency consultation setting. A strength of this study was the comparison of the performance of the deep learning model with that of human endoscopists. The high interobserver agreement rate among experienced endoscopists and the low interobserver agreement rate among young endoscopists revealed the need for training to improve their experience to the expert level. Our trained model had a potential to be used as an aid for young endoscopists during their training process. In addition, we attempted to stratify the risk of endoscopic therapy based on the endoscopists' experience in the current study. We found that discrepancies may exist in the endoscopists' opinion, i.e., a good skilled endoscopist may consider a highrisk ulcer as a low-risk ulcer. In contrast, inexperienced endoscopists may also judge a high-risk ulcer as a low-risk ulcer owing to the lack of experience. Further studies are required to ensure better consensus among experts for this risk stratification classification for subsequent clinical use.
There are several limitations of the current study. First, the dataset came from one hospital in the past five years and only the Olympus endoscope system was utilized; thus, external validation is required. Second, the classification of the obtained images came mainly from two experts in our institution and may not reflect the opinion of other experts from other institutions. In addition, the images obtained for model training were captured from stored still images. PU bleeding is a dynamic process, and further studies are still required with video clips to better help the clinical practice.

Conclusion
In conclusion, we report the first use of a convolutional neural network for classifying endoscopic images of bleeding peptic ulcers. The performance of the model was better than that of endoscopist trainees. Further improvement of the model may aid in clinical decision-making during the training of young endoscopists.