Introduction

Gastric cancer is the fifth most common form of malignant tumor and the third leading cause of cancer-related death worldwide, with approximately 952,000 new cases and 723,000 deaths per year [1, 2].

The prognosis of patients with gastric cancer depends on the cancer stage at diagnosis [2, 3]. Although patients with advanced gastric cancer have a poor prognosis, the 5-year survival rate of patients with gastric cancer detected at an early stage is greater than 90% [2,3,4,5]. Therefore, endoscopic detection of gastric cancer at an earlier stage is the single most effective measure for reducing gastric cancer mortality. It also offers an opportunity to treat patients with organ-preserving endoscopic therapy such as endoscopic mucosal resection or endoscopic submucosal dissection (ESD) [6,7,8,9,10,11,12].

Although esophagogastroduodenoscopy (EGD) is the standard procedure for diagnosing gastric cancer, the false-negative rate for detecting gastric cancer with EGD is 4.6–25.8% [13,14,15,16,17,18]. Furthermore, inexperienced endoscopists tend to overlook gastric cancer because most cases arise from atrophic mucosa. In addition, some early gastric cancer lesions show only subtle morphologic changes, which are difficult to distinguish from background mucosa with atrophic change [12, 19,20,21]. Therefore, endoscopists require long-term specific training and experience to detect gastric cancer properly.

In recent years, image recognition using artificial intelligence (AI) with machine learning has dramatically improved and been increasingly applied to diagnostic imaging in various medical fields. These fields include skin cancer classification, diagnosis in radiation oncology and diabetic retinopathy, histologic classification of gastric biopsy, and characterization of colorectal lesions using endocytoscopy [22,23,24,25,26].

Deep learning, which represents a new method of machine learning, enables machines to analyze various training images and extract specific clinical features using a backpropagation algorithm [27]. Based on the accumulated clinical features, machines can diagnose newly acquired clinical images prospectively. This type of deep learning system has become possible through the use of convolutional neural networks (CNNs) that logically imitate the structure and activity of brain neurons on a computer. Various kinds of neural networks have been developed, and CNN is particularly known as the best performance model in the field of image recognition [27, 28].

Fitting optimal parameter values automatically is called learning of the neural network, and properly defining these parameters determines the neural network’s ability. Supervised learning uses data sets consisting of both input and appropriate output information. Thus, deep learning through a CNN using extensive image data has a high potential for clinical application in recognizing clinical images.

To develop deep learning through a CNN to detect early and advanced gastric cancer, we constructed an AI-based diagnostic system that was trained by more than 13,000 images of EGD. We then tested the diagnostic accuracy of this system to detect gastric cancer.

Methods

Preparation of training and test image sets

For an algorithm to detect gastric cancer, images of EGD were retrospectively obtained from two hospitals (Cancer Institute Hospital Ariake, Tokyo, Japan, and Tokatsu-Tsujinaka Hospital, Chiba, Japan) and two clinics (Tada Tomohiro Institute of Gastroenterology and Proctology, Saitama, Japan, and Lalaport Yokohama Clinic, Kanagawa, Japan) from April 2004 to December 2016. EGD was performed for screening or preoperative examinations in daily clinical practice, and images were captured using standard endoscopes (GIF-H290Z, GIF-H290, GIF-XP290N, GIF-H260Z, GIF-Q260J, GIF-XP260, GIF-XP260NS, and GIF-N260; Olympus Medical Systems, Co., Ltd., Tokyo, Japan) and standard endoscopic video systems (EVIS LUCERA CV-260/CLV-260 and EVIS LUCERA ELITE CV-290/CLV-290SL; Olympus Medical Systems).

The inclusion criteria were images with standard white light, chromoendoscopy using indigo carmine spraying, and narrow band imaging (NBI). The exclusion criteria were any images that were magnified as well as poor quality images resulting from less insufflation of air, post-biopsy bleeding, halation, blur, defocus, or mucus. After selection, 13,584 images were collected for 2639 histologically proven gastric cancer lesions as a training image data set. At least one gastric cancer lesion was presented in all images, and multiple images were prepared for a same lesion to include differences in angle, distance, and extension of the gastric wall. All images of gastric cancer lesions were marked manually by an author (TH) who is an expert on gastric cancer and a board-certified trainer at the Japan Gastroenterological Endoscopy Society. The author (TH) carefully marked the range of cancer lesions using rectangular frames (Figs. 1, 2, 3).

Fig. 1
figure 1

Output of the CNN. a A slightly reddish and flat lesion of gastric cancer appears on the lesser curvature of the middle body. b The yellow rectangular frame was marked by the CNN as a possible lesion and to indicate the extent of a suspected gastric cancer lesion. An endoscopist manually marked the location of the cancer using a green rectangular frame. [0–IIc, 5 mm, tub1, T1a(M)]

Fig. 2
figure 2

Cancer lesion presented in multiple images. An endoscopist manually marked the location of the cancer in each image using a green rectangular frame. The yellow rectangular frame was produced by the CNN to identify a suspected lesion and indicates the extent of gastric cancer. Although the CNN did not identify gastric cancer in the distant view (a), it correctly located gastric cancer in the near view (b). This was counted as a correct answer

Fig. 3
figure 3

Six lesions missed by the CNN. The green rectangular frames show gastric cancer missed by the CNN. a Greater curvature of the antrum, 0–IIc, 3 mm, tub1, T1a(M). b Lesser curvature of the middle body, 0–IIc, 4 mm, tub1, T1a(M). c Posterior wall of the antrum, 0–IIc, 4 mm, tub1, T1a(M). d Posterior wall of the antrum, 0–IIc, 5 mm, tub1, T1a(M). e Greater curvature of the antrum, 0–IIc, 5 mm, tub1, T1a(M). The yellow rectangular frame shows a pyloric ring, which the CNN misdiagnosed as gastric cancer. f Anterior wall of the lower body, 0–IIc, 16 mm, tub1, T1a(M)

To evaluate the diagnostic accuracy of the constructed CNN, an independent test data set of stomach images was collected from 69 consecutive patients with 77 gastric cancer lesions (62 cases had 1 gastric cancer lesion, 6 had 2 lesions, and 1 had 3 lesions), who received an EGD at the Cancer Institute Hospital Ariake from 1 to 31 March 2017 during daily clinical practice. All EGD procedures used a standard endoscope (GIF-H290Z) and a standard endoscopic video system (EVIS LUCERA ELITE CV-290/CLV-290SL). During the procedure, an entire stomach was observed, and images of all parts were captured with white light images. The chromoendoscopy, NBI, and poor-quality images were excluded. The final test data set included 2296 total images, and each case had 18–69 images.

Constructing a CNN algorithm

To construct an AI-based diagnostic system, we used a deep neural network architecture called the Single Shot MultiBox Detector (SSD, https://arxiv.org/abs/1512.02325), without altering its algorithm. SSD is a deep CNN that consists of 16 layers or more. The Caffe deep learning framework, which is one of the most popular and widely used frameworks originally developed at the Berkeley Vision and Learning Center, was then used to train, validate, and test the CNN.

All CNN layers were fine-tuned using stochastic gradient descent with a global learning rate of 0.0001. Each image was resized to 300 × 300 pixels, and the bounding box was also resized accordingly to make CNN analyze optimally. These values were set up by trial and error to ensure all data were compatible with SSD.

Outcome measures

After constructing the CNN using the training image set, we evaluated the performance through the test image set. When the CNN detected a lesion of gastric cancer from the input data of test images, the CNN outputted a disease name (early or advanced gastric cancer) and its position. A detected lesion was displayed with a yellow rectangular frame on the endoscopic images (Fig. 1).

Because some gastric cancer lesions were presented in multiple images, we used the following definitions to perform the test.

  • When the CNN detected even one gastric cancer lesion in multiple images of the same lesion, it was defined as a correct answer (Fig. 2).

  • Because the demarcation line of gastric cancer was sometimes unclear, when the CNN detected a partial gastric cancer lesion, it was regarded as a correct answer.

The sensitivity and positive predictive value (PPV) for the CNN’s ability to detect gastric cancer were calculated as follows:

Sensitivity = detected number of correct gastric cancer lesions/actual number of gastric cancer lesions

PPV = detected number of correct gastric cancer lesions/number of lesions that were diagnosed as gastric cancer by the CNN.

Ethics

This study was approved by the Institutional Review Board of the Cancer Institute Hospital Ariake (no .2016–1171) and Japan Medical Association (ID JMA-IIA00283).

Results

A total of 714 images (31.1%) out of the 2296 test image sets confirmed gastric cancer. Table 1 presents the patient and lesion characteristics used in the test image set. Fifty-eight cases (84.1%) had moderate to severe gastric mucosal atrophy. Forty-two lesions (67.5%) were early gastric cancer (T1), and 25 (32.5%) were advanced gastric cancer (T2–T4). The median tumor size in diameter was 24 mm (range 3 to 170 mm). Most were superficial types (0–IIa, 0–IIb, 0–IIc, 0–IIa + IIc, 0–IIc + IIb, and 0–IIc + III) with 55 lesions (71.4%).

Table 1 Patient and lesion characteristics in the test image set (Japanese classification of gastric carcinoma [34])

The CNN required 47 s to analyze the 2296 test images. The CNN diagnosed 232 total lesions as gastric cancer; 161 were non-cancerous lesions, and it correctly identified 71 of 77 gastric cancer lesions with an overall sensitivity of 92.2% and a PPV of 30.6%. The sensitivity by tumor size and depth is shown in Table 2. Seventy of 71 lesions (98.6%) with a diameter of ≥ 6 mm were correctly detected by the CNN. All invasive cancers (T1b or deeper) were correctly detected by the CNN. Conversely, the details of the six missed lesions are shown in Fig. 3. Five of the six lesions were minute cancers (≤ 5 mm). All missed lesions were superficially depressed and differentiated-type intramucosal cancers that were difficult to distinguish from gastritis even for experienced endoscopists.

Table 2 Sensitivity based on tumor size and depth

Table 3 shows the details of false-positive lesions. Nearly half of the misdiagnosed lesions were gastritis with changes in color tone or irregular mucosal surface as shown in Fig. 4a–c. The next most common cause of misdiagnosis was the normal anatomical structures of the cardia, angulus, and pylorus, as shown in Fig. 4d.

Table 3 Causes of false-positive lesions
Fig. 4
figure 4

False positive lesions. The yellow rectangular frames show non-cancerous lesions that the CNN misdiagnosed as gastric cancer. a Intestinal metaplasia with irregularity of surface mucosa. b Whitish mucosa as a result of localized atrophy. c Reddish mucosa as a result of superficial gastritis. d Bending of the angulus

Discussion

To develop an AI-based diagnostic system to detect gastric cancer, we used a CNN that simulates the human brain. Extensive training data are generally required to construct such a system [29], and we used over 13,000 clear endoscopic images that had been stored at our institutions. To the best of our knowledge, this is the first report that evaluates the ability of CNN to detect gastric cancer in endoscopic images. In this study, the constructed CNN detected 92.2% of gastric cancers in the independent test image set. The lesions detected by the CNN included small intramucosal gastric cancers that are relatively difficult to detect, even by endoscopists. Furthermore, all invasive gastric cancers were detected by the CNN. The missed six lesions were differentiated-type intramucosal cancers that are similar to gastritis and difficult to diagnose even by experienced endoscopists. Because the doubling time of gastric mucosal cancer is considered to be 2–3 years [30], those small cancer cases that were missed would be detected as intramucosal cancer when performing annual EGD, and the clinical applicability of the CNN might not be considerably hampered.

By contrast, 69.4% of the lesions the CNN diagnosed as gastric cancer were benign. The most common reasons for misdiagnosis were gastritis with redness, atrophy, and intestinal metaplasia. These findings are sometimes even difficult for endoscopists to distinguish from gastric cancer. An earlier study reported that the PPV of gastric biopsy without magnifying endoscopy for gastric epithelial neoplasms was only 3.2–5.6% [31, 32]. Considering that the PPV of biopsy by endoscopists is relatively low, and false negatives are more problematic than false positives in diagnosing cancer, a 30.6% PPV by the CNN would be clinically acceptable. The anatomical structures of the cardia, pylorus, and angulus were also misdiagnosed as gastric cancer, which are unlikely to be misdiagnosed by endoscopists. If the CNN can learn such normal anatomical structures as well as various benign lesions more systematically, the PPV of gastric cancer detection will improve further in the future.

Remarkably, the CNN consumed only 47 s to analyze more than 2000 test images. This high rate of speed to recognize and judge images is not achievable by humans. In 2016, an endoscopic mass screening program for gastric cancer was started in Japan. This program requires a time-consuming double checking of endoscopic images, which produces a heavy burden on clinicians. The CNN system will remarkably improve this situation if introduced as a supporting tool. Furthermore, the procedure can be performed completely “online,” thereby addressing the problem of insufficient numbers of endoscopists in remote and rural areas as well as in developing countries as a telemedicine tool. Thus, in the near future, an AI-based diagnostic system might generate major global changes in the endoscopic diagnoses of gastric cancer.

This study has several limitations. First, we used only high-quality endoscopic images for the training and test image sets. If there is less insufflation of air, post-biopsy bleeding, halation, blur, defocus, or mucus, the CNN will make a mistake in judgment (although the same occurs with endoscopists) [33]. Second, we collected a vast number of training set images from the beginning to establish a good accuracy of the CNN, but did not try other numbers of training set images. More training images might result in a more accurate diagnostic ability of the CNN. However, in this study, we did not examine the association of the number of training images and the CNN accuracy, which seems to be an issue to solve in the future studies. Third, we used only gastric cancer cases for the test image sets. The frequency of gastric cancer cases in an endoscopic mass survey would be extremely low. Fourth, because 161 false-positive lesions were not histologically proven, occult cancerous lesions may be included among them. Fifth, despite the fact that he has over 10 years of experience working at a cancer specialty hospital and has diagnosed more than 6000 cases of gastric cancer, a single endoscopist manually marked the training and test image sets of gastric cancer. Sixth, we did not compare the diagnostic accuracy of the CNN with that of endoscopists. Seventh, all test images were provided by the same type of endoscope (GIF-H290Z) and endoscopic video system (EVIS LUCERA ELITE CV-290/CLV-290SL) and did not include images obtained from other endoscopic devices. Finally, in verifying other test images, including those of non-gastric cancer cases, incorporating the CNN system in daily clinical practice will be necessary because all the test images consisted of gastric cancer cases. We are currently planning a multicenter trial to tackle these limitations and further validate the capabilities of the CNN system using endoscopic mass survey screening images.

In conclusion, we developed a CNN system for detecting gastric cancer using stored endoscopic images, which processed extensive independent images in a very short time. The clinically relevant diagnostic ability of the CNN offers a promising applicability to daily clinical practice for reducing the burden of endoscopists as well as telemedicine in remote and rural areas as well as in developing countries where the number of endoscopists is limited.