Introduction

Endoscopic submucosal dissection (ESD) is the standard treatment for early gastric cancer (EGC) in East Asia [1,2,3,4,5]. En bloc excision of cancer allows for a detailed histopathological evaluation, whereby treatment curability is determined. In the Japanese guidelines, when EGC resected by ESD does not fulfill the curability criteria, the resection is classified as endoscopic curability C (i.e., noncurative resection), which is further subclassified into endoscopic curability C-1 and C-2 [6]. Because the latter cases potentially have a risk of lymph node metastasis (LNM), additional surgery with lymphadenectomy is recommended. However, a recent meta-analysis reported that LNM was found in only 8.0% of patients with an endoscopic curability of C-2 [7]. As the risk of LNM varies among patients within the endoscopic curability C-2 group, subjecting all patients to additional surgery results in overtreatment. To minimize unnecessary additional surgeries, a precise prediction method for the LNM risk of EGCs categorized as endoscopic curability C-2 is needed.

For this purpose, we focused on machine learning (ML), which has been adopted to build accurate prediction models in various fields of medicine, including gastroenterology [8,9,10,11,12]. ML is a branch of artificial intelligence that uses algorithms to enable computers to learn automatically from data and determine the rules behind them. Once an ML algorithm is trained, it can predict unknown outcomes from new data with high accuracy. Currently, several scoring models stratify the risk of LNM in patients with EGC; however, all use conventional statistical analyses [13,14,15,16]. We hypothesized that ML models might perform better than existing models established using statistical analyses.

This study aimed to develop an ML-based risk prediction model for LNM in patients with EGC classified as endoscopic curability C-2 and compare its performance with that of the existing scoring model. Among the existing models, the “eCura system” is the most common risk-scoring model for LNM of EGC classified as endoscopic curability C-2 [14], and is currently recommended in the Japanese guidelines [6]. Hence, in this study, we chose this model for comparison.

Methods

Patients

This multicenter retrospective study was conducted at 21 institutions. The study was approved by the institutional review board of Osaka University (approval number: 22171, approval date: July 26, 2022) and the participating hospitals and was performed in accordance with the guidelines outlined in the Declaration of Helsinki.

We used the data of consecutive EGC patients who were treated with surgery, ESD with additional surgery, or ESD alone between 2010 and 2021 and were histologically confirmed as having endoscopic curability C-2. EGC was defined as an adenocarcinoma limited to the mucosa or submucosa, irrespective of LNM [17]. Exclusion criteria were as follows: special histological types of gastric cancer (e.g., neuroendocrine neoplasms, carcinoma with lymphoid stroma, adenocarcinoma of the fundic gland type [18, 19]), esophagogastric junction cancer, synchronous advanced cancer (in the stomach or other organs), synchronous EGC with endoscopic curability C-2, postoperative stomach, and missing data. Patients in the surgery group who had undergone preoperative chemotherapy were excluded. For the ESD-alone group, patients with follow-up periods < 3 years, not including patients who died of known causes within that time, or those who received adjuvant chemotherapy after ESD alone were excluded. Finally, cases with no lymphadenectomy in a surgical procedure (i.e., only local resection) were excluded even if the patients could be followed up for ≥ 3 years, because local resection of the stomach was described as an investigational treatment in Japanese gastric cancer treatment guidelines [6] and was not commonly performed. There was thus a possibility of taking an unusual course of events during the surveillance.

Definition of endoscopic curability C-2

After endoscopic or surgical resection, histopathological evaluation was performed according to the Japanese classification system at each institution [17]. Specimens resected by ESD were sectioned at 2 mm intervals, whereas surgically resected specimens were sectioned at 5 mm intervals. Lymphovascular involvement was first examined by hematoxylin and eosin staining, and in cases with inconclusive findings, immunohistochemical staining was added.

Resected EGC was defined as under the curative state when it was resected in one piece, had no cancer-positive margins or lymphovascular involvement, and had one of the following conditions: (i) mucosal differentiated cancer with no ulceration; (ii) mucosal differentiated cancer with ulceration, ≤ 30 mm in diameter; (iii) undifferentiated, mucosal cancer without ulceration, ≤ 20 mm in diameter; or (iv) shallow (< 500 μm from the muscularis mucosae) submucosal differentiated cancer, ≤ 30 mm in diameter.

Otherwise, the resected EGC was considered to be in a state of endoscopic curability C (noncurative). If a positive horizontal margin was the only noncurative factor, it was categorized as endoscopic curability C-1. Other conditions were categorized as endoscopic curability C-2, and we only included patients with this histopathological character. The above-mentioned definition for the endoscopic curability C-2 was based on Japanese gastric cancer treatment guidelines [6].

Data collection

The following data were collected: age, sex, tumor location, size, histological type, invasion depth, histopathological ulceration, lymphatic involvement, and vascular involvement. Histological types were classified as follows: (i) well-differentiated tubular adenocarcinoma (tub1); (ii) moderately differentiated tubular adenocarcinoma (tub2); (iii) papillary adenocarcinoma (pap); (iv) poorly differentiated adenocarcinoma (por); (v) signet-ring cell carcinoma (sig); and (vi) mucinous adenocarcinoma (muc). When more than one histological type was present in the tumor, the first two dominant histological types were collected in descending order (tub2 > tub1). Well-differentiated tubular adenocarcinoma (tub1), tub2, and pap were categorized as differentiated types, and por, sig, and muc were categorized as undifferentiated types. If the lesion had both types of cancer components, it was regarded as a mixed type. Invasion depth was classified into three categories: tumor limited to the mucosa (M), tumor invading the submucosa to a depth of < 500 μm from the muscularis mucosae (SM1), and tumor invading the submucosa to a depth ≥ 500 μm (SM2). Vertical margins were also investigated in patients who underwent ESD (with or without additional surgery). For the ESD-alone group, the development of metastatic recurrence in the lymph nodes and/or other organs during follow-up was also surveyed. Data were obtained from the medical records of each participating institution between August 2022 and December 2022.

Definitions of outcome

The outcome selected to develop the ML model was LNM. For the surgery or ESD with additional surgery groups, it was defined as the presence of histologically identified metastases in the resected lymph nodes. For the ESD-alone group, it was defined as the development of metastatic recurrence in the lymph nodes and/or other organs diagnosed by computed tomography during follow-up. When patients in the ESD-alone group did not develop metastatic recurrence during a follow-up period of ≥ 3 years, LNM was considered negative. Patients with follow-up periods < 3 years were excluded from the ESD-alone group, except for those who died of known causes.

Development of the ML model

We created two datasets: a training cohort used to build the ML model and a validation cohort used to compare the performance of the ML model with that of the eCura system. The former included all patient groups (surgery, ESD with additional surgery, or ESD alone), whereas the latter included only patients who underwent ESD (with or without additional surgery). The reasons for this were as follows: (i) the actual prediction target for our ML model and the eCura system were patients who underwent noncurative ESD, and (ii) in the eCura system, a positive vertical margin was set as a risk factor, which is assessable only in lesions resected by ESD. We randomly separated patients who underwent ESD (with or without additional surgery) into training and validation groups.

The ML model was constructed as a neural network with two hidden layers using Scikit-learn (https://scikit-learn.org), an ML library for Python. The training data were divided into four parts during the model training process, and parameter tuning was performed through fourfold cross-validation. We used the Adam optimizer for optimization. After parameter tuning, the first and second hidden layers comprised 6 and 18 nodes, respectively. The final inference model was an ensemble model (simple averaging) of the four models obtained through fourfold cross-validation. Hyperparameters of our ML model are listed in Supplementary Table S1.

For model development, we initially used age, sex, tumor location, lesion size, dominant histology, presence or absence of mixed-type histology, invasion depth, lymphatic involvement, vascular involvement, histopathological ulceration, vertical margin, and treatment method as input parameters. Through parameter tuning within the training dataset, the best predictions were achieved using the following seven factors: lesion size, dominant histology (tub2 or others), presence or absence of mixed-type histology, invasion depth (M, SM1, or SM2), lymphatic involvement (positive or negative), vascular involvement (positive or negative), and treatment method (surgery, or ESD with/without additional surgery). Most of our data were encoded as binary variable (i.e., 0 or 1) except for invasion depth and lesion size. For invasion depth, ordinal encoding was performed, such as 1 for SM2, 0.5 for SM1, and 0 for M. Lesion size was transformed to be in a range from 0 to 1 by dividing the raw data by 100.

Statistical analysis

The Chi-squared and Fisher exact tests were used to compare categorical data, and the Kruskal–Wallis and Mann–Whitney U tests were used to compare continuous data. The area under the receiver operating characteristic curve (AUC) was used to measure the performance of the prediction models, and DeLong’s test was used to compare the AUCs. P values < 0.05 were considered statistically significant. Analyses were performed using JMP Pro version 16 (SAS Institute, Cary, NC, USA) or EZR version 1.61 (Saitama Medical Center, Jichi Medical University, Japan).

Results

Study cohort

Figure 1 shows the flowchart of patient selection. Among the 4,873 patients initially identified, 831 were excluded, and 4,042 were finally included: 3,506 patients in the training cohort and 536 patients in the validation cohort. In the training cohort, 2,970 patients (85%) underwent surgery, 414 (12%) underwent ESD with additional surgery, and 122 (3%) underwent ESD alone. In the validation cohort, 401 patients (75%) underwent ESD with additional surgery, and 135 (25%) underwent ESD alone. In the ESD-alone group, the median follow-up periods for the training and validation cohorts were 57 months (interquartile range [IQR] 41–73) and 55 months (IQR 41–74), respectively. Table 1 presents the characteristics of the training and validation cohorts. LNM was observed in 503 (14%) and 39 (7%) patients in the training and validation cohorts, respectively. The patient and lesion characteristics according to treatment are shown in Supplementary Table S2.

Fig. 1
figure 1

Patient selection flowchart. Pts patients; ESD endoscopic submucosal dissection; EGC early gastric cancer; LNM lymph node metastasis

Table 1 Characteristics of the training and validation cohorts

Performance of the ML model

The ML model identified patients with LNM with an AUC of 0.83 [95% confidence interval (CI), 0.76–0.89] in the validation cohort, while the eCura system identified patients with LNM with an AUC of 0.77 (95% CI 0.70–0.85) (P = 0.006, DeLong’s test) (Fig. 2). At cutoff scores where the ML model and the eCura system identified patients with LNM with 100% sensitivity (i.e., a score of 0.02778 for the ML model and 0 for the eCura system), the specificity values were 24% (95% CI 20%–28%) for the ML model versus 0% (95% CI, 0.0%–1.1%) for the eCura system. This indicates that the ML model could reduce unnecessary surgery by up to 24% with a minimized risk of overlooking LNM, whereas no patients could avoid surgery with the eCura system.

Fig. 2
figure 2

Receiver operating characteristic curves for the validation cohort (n = 536). AUC area under the curve

The permutation feature importance of the seven variables used in the ML model was calculated for the training cohort (Fig. 3), and lymphatic involvement was found to be the most important factor for LNM.

Fig. 3
figure 3

Permutation feature importance of the seven variables used to construct the machine learning model in the training cohort

A web application of the ML model

We developed a web application to make our ML model freely available for clinicians (https://www.med.osaka-u.ac.jp/pub/gh/egc-lnm-prediction.html).

Discussion

Our novel neural-network-based ML model derived from a large multi-institutional cohort identified the presence of LNM in patients with EGC categorized as endoscopic curability C-2 better than the most common risk-scoring model in Japan (i.e., the eCura system). Notably, the ML model performed better than the eCura system in choosing very low-risk patients who could be safely managed with only ESD. Our ML model has the potential to minimize unnecessary surgeries after gastric ESD.

Several researchers have developed ML models that predict the risk of LNM in patients with EGC [20,21,22,23,24]; however, these studies include many lesions satisfying the endoscopic curability criteria that have no risk of LNM. In contrast, we used only EGC data categorized as endoscopic curability C-2 (i.e., lesions at a high risk of LNM). Considering that prediction models are used for patients who are classified as endoscopic curability C-2 after gastric ESD, our ML model is more suitable and reliable than those previously reported.

Our study had following strengths. First, we directly compared the new ML model with the eCura system, the current most common risk-scoring model recommended presently in the Japanese gastric cancer treatment guidelines [6]. The eCura system was developed based on data from patients who underwent surgery after noncurative ESD [14]. Hence, we excluded patients who underwent surgery as the first treatment from the validation cohort to allow the eCura system to demonstrate its true performance. In this fair situation, our ML model showed a significantly higher AUC than that of the eCura system (0.83 versus 0.77, P = 0.006). The predictive ability of the eCura system shown in this study (AUC of 0.77) was almost the same as the original results shown by the developers (Hatta W, et al.) (AUC of 0.74) [14], which guarantees the credibility of our results. Second, we included information on the presence of mixed-type histology in the ML model because it is a potential predictor of LNM in EGC [13, 25,26,27]. In fact, analysis of feature importance showed mixed-type histology as the fourth-most important factor in our model (Fig. 3). Since the eCura system does not evaluate information about mixed-type histology, we believe that this difference conferred better results with our model.

One of the problems with the eCura system was that the number of undifferentiated-type EGCs, which are often treated by primary surgery, was small in the development cohort (14%, 150/1101 cases). Thus, Hatta et al. reported that the risk of undifferentiated-type histology may be underestimated in the eCura system [28]. As a measure for this problem, we decided to include primary surgical cases in the training cohort. As a result, we could increase the number of patients with undifferentiated-type EGC (40%, 1490/3506 cases). It might be ideal to increase the number of patients with undifferentiated-type EGC using only ESD cases. However, due to the limited number of ESD cases available, we decided to use primary surgical cases as an alternative.

Although a positive vertical margin is regarded as a risk factor for LNM in the eCura system, we did not include this factor in our ML model because it did not improve the predictive power (data not shown). This might be because we used many surgery cases in the training cohort in which the vertical margin was not evaluable.

We classified the histologic types of EGC into two groups, tub2 or others, for our final ML model. Other classifications, such as differentiated versus undifferentiated, did not show better performance. One reason for this could be that tub2 was the most frequent histologic type among LNM-positive EGCs (41%, 204/503) in the training cohort in this study.

When our ML model is used in clinical settings, the worst scenario is to overlook LNM because it may eventually cause metastatic recurrence. Once this occurs, salvage surgery is almost impossible and can be fatal [29]. Therefore, we chose the cutoff score of the ML model by setting the sensitivity to 100% in the validation cohort. At 100% sensitivity, the ML model had a specificity of 24%, while the eCura system had a specificity of 0%. This means that among the 497 patients who did not have LNM in the validation cohort, the ML model could help 120 patients (24%) avoid unnecessary surgery, whereas none (0%) could avoid unnecessary surgery with the eCura system. Thus, our ML model performed better than the eCura system in correctly identifying patients who did not require surgery after ESD. Characteristics of those 120 patients who could have avoided additional surgery by our ML model (i.e., true negatives in our ML model) is shown in Supplementary Table S3. The scores of the eCura system in those patients were 0 point for 49 patients (41%), 1 point for 68 patients (57%), and 2 points for 3 patients (2%). No patients neither had eCura scores of ≥ 3 points, nor lymphatic involvement (Table S3).

Our study had several limitations. First, the sectioning interval of the resected specimen differed between the ESD (2 mm) and surgery (5 mm) groups. Histopathological evaluation of surgically resected specimens carries the risk of underestimating the invasion depth and overlooking lymphatic and/or vascular involvement because of the wide sectioning interval. As a result, the effect of each risk factor on LNM may differ between the ESD and surgery groups. To minimize this problem, we included “treatment method” as an input parameter in the process of training, making the ML model learn those differences between the ESD and surgery groups. Second, the immunohistochemical staining for assessing lymphatic and vascular involvement was not performed for all the cases. This might also have caused the underestimation of the lymphatic and/or vascular involvement. Third, vertical cancer margin was not evaluable in surgically resected specimens. Fourth, the rate of patients with LNM-positive EGC in the validation cohort (7%) was smaller than that of the training cohort (14%). Fifth, regarding the ESD-alone group, whether a minimum of 3 years of follow-up was sufficient remains controversial. However, considering that metastatic recurrence often appears within 3 years after EGC resection [30], we believe our follow-up period was acceptable. Sixth, we did not collect information on the extent of lymph node dissection or the number of resected lymph nodes in patients who underwent surgery, which may differ according to preoperative staging. Some patients might have undergone insufficient lymph node dissection, causing an underestimation of LNM; however, our large cohort might have reduced this bias.

In conclusion, we developed an ML model that performed better than the eCura system in predicting the risk of LNM in patients with EGC who did not meet the Japanese endoscopic curability criteria. This precision model is potentially useful for minimizing unnecessary surgeries after gastric ESD. A prospective study is required to further validate our ML model.