Thymic epithelial tumors are prevalent in the anterior mediastinum, accounting for approximately 43 % of anterior mediastinal masses.1 Lymph node metastasis with invasion of adjacent organs was found to occur more frequently than lymph node metastasis without such invasion, and the findings showed that the frequency of lymphatic metastasis far exceeds that of previous empirical knowledge.2,3

Lymph node metastasis is an important factor affecting tumor recurrence and the prognosis of patients with common tumors, and lymphadenectomy can be performed to accurately stage the tumor and control disease progression.4,5,6 Unfortunately, lymph node examination currently is less frequently performed during thymectomy, which can increase the likelihood that the degree of disease progression in patients will be misinterpreted. Therefore, an urgent need exists for a model to predict lymph node metastasis in thymic epithelial tumors to assist in clinical diagnosis and treatment.

Predictive models for lymph node metastasis have been developed for many cancer types, such as squamous non-small cell lung cancer, esophageal squamous cell carcinoma, colorectal cancer, and so on.7,8,9 Nevertheless, a model for predicting lymphatic involvement of thymic epithelial tumors is hard to construct. The construction of a prediction model faces two main challenges. First, the incidence of the disease is low. The overall incidence of thymoma is 0.13 per 100,000 person-years in America and10 0.09 to 0.23 per 100,000 person-years in Europe.11 Second, no lymph node map for thymic epithelial tumors has existed in the past as a public reference for lymphatic resection of thymic epithelial tumors. Therefore, a new lymph node map was proposed by the International Thymic Malignancy Interest Group (ITMIG) and published in the 8th edition of tumor-node-metastasis (TNM) stage classification system for thymic malignancies.12,13

Using the Surveillance, Epidemiology, and End Results (SEER) database, this study aimed to develop and validate a predictive model for lymph node metastasis status after thymic epithelial tumor resection. The results of this study can be conveniently implemented in clinical work and can contribute to further guidance and optimization of treatment strategies for thymic epithelial tumors.

Materials and Methods

Patients and Study Design

The population data on thymic epithelial tumors were extracted from the SEER database of the American National Cancer Institute. The Incidence‐SEER 18 Regs Research database is based on the November 2017 submission through SEER*Stat software version 8.3.6 (Information Management Services, Inc., Calverton, MD).

As shown in Fig. 1, patients with a diagnosis of thymic epithelial tumors between 2004 and 2015 were selected for the study from the SEER database for public use. All population data were used to divide the patients into two cohorts. The patients who received surgery between 2004 and 2013 formed the training cohort, and those who received surgery between 2014 and 2015 formed the validation cohort.

Fig. 1
figure 1

Population inclusion flowchart

The inclusion criteria specified the following: (1) The histopathologic diagnosis had to be included. All data had to be histologic type ICD-O-3, and the histologic type had to be according to the International Classification of Diseases for Oncology, third revision (ICD-O-3) using the codes according to the 2015 World Health Organization Classification of Tumors of the Thymus.14 The codes for thymic carcinoma were 8004, 8011, 8020, 8021, 8032, 8033, 8052, 8070, 8071, 8072, 8073, 8074, 8075, 8082, 8083, 8090, 8094, 8123, 8140, 8260, 8310, 8430, 8480, 8481, 8560, 8575, 8576, 8586, 8588, 8589, and 8980, and the codes for thymoma were 8580, 8581, 8582, 8583,8584, All the histology type ICD codes were accompanied with the malignant behavior code-3. (2) Only patients with a diagnosis of tumor were included. (3) Only patients who received surgery were included.

The exclusion criteria ruled out patients whose race, marital status, lymphatic metastases, tumor size, tumor extension, histology type, or distant metastasis was unknown.

Variable Definition

The candidate variables in the analysis were age at diagnosis, sex, race, marital status, tumor size, tumor extension, histologic type, and histologic grade. Race was separated into white, black, and Asian (Asian Indian, Pakistani, Chinese, Filipino, Japanese, Kampuchean, Korean, Laotian, and Vietnamese). Marital status was grouped as single (divorced, separated, single, or unmarried or domestic partner) and married. Extension of tumor included four subgroups: location (CS Extension code 100 or 300 and CS Mets at Dx code 00 or 10), adjacent connective tissue (CS Extension code 400 and CS Mets at Dx code 00 or 10), adjacent organs/structures (CS Extension code 600 and CS Mets at Dx code 00 or 10), and distance (two states according to the SEER manual: (1) CS Extension code 100, 300, 400, or 600 and CS Mets at Dx code 40 or 50 and (2) CS Extension code 800).15 Thymic epithelial tumors were classified into low-risk thymomas (type A, AB, and B1), high-risk thymomas (type B2 and B3), and thymic carcinomas (type C).16

Statistical Analysis

Continuous data are described using median (interquartile range [IQR), and categorical data are described as counts and percentages. Least absolute shrinkage and selection operator (LASSO) regression were performed on the training cohort using the lars package (https://mirrors.tuna.tsinghua.edu.cn/CRAN/web/packages/lars/lars.pdf), and three unsparse variables were finally retained for inclusion in the final prediction model after feature selection.17 Nomograms were plotted for visual analysis by using the rms package of R.18

To decrease overfit bias, we used area under receiver operating characteristic curve (AUC) and calibration with 1000 bootstrap samples to measure the predictive performance of the nomogram. For convenience of clinical use, a novel scoring model was established, which could make clinical prediction easier and more convenient. To estimate the performance of the scoring model, we used AUC, sensitivity, specificity, and accuracy. All statistical test results were considered significant when p was lower than 0.05. All statistical analyses were performed in R-3.6.2 (R Foundation for Statistical Computing, Vienna, Austria).19

Results

Patients Characteristic

As shown in Table 1, the statistical analysis included 1018 eligible patients divided into a training cohort (808 patients) and a validation cohort (210 patients). Men accounted for about a half of the cohorts (52.4 %), and the median age was 59.0 years (range, 48.0–67.0 years). The median tumor size was 64 mm (range, 45.0–89.8 mm), and local invasion was mainly tumor invasive extension (35 %). Thymoma accounted for more of the thymic epithelial tumors (74.8 %), and the majority were low-risk thymoma (40.5 %). Lymph node metastasis was found in 9.5 % of the study cohort.

Table 1 Characteristics of patients in the training and validation groups

Feature Selection Based on LASSO

By running least absolute shrinkage and selection operator (LASSO) regression analyses, according to 10-fold cross-validation, a lambda (λ) value of 4.79 with a log (λ) of 0.68 were chosen (1-SE criteria), and features with non-zero coefficients were filtrated as the risk factors of thymic epithelial tumor involvement, as shown in Fig. 2. From eight features, this study selected three: age, extension, and histology type.

Fig. 2
figure 2

Feature selection using least absolute shrinkage and the selection operator (LASSO) regression model

Construction of the Prognostic Model

As shown in Fig. 3, a nomogram was established based on feature selection. The predicted AUC of the nomogram was 0.80 (95 % confidence interval [CI] 0.75–0.85) for the training cohort, and 0.82 (95 % CI 0.70–0.93) for the validation cohort, as shown in Fig. 4A. Detailed scores of all the variables in the nomogram are shown in Table 2.

Fig. 3
figure 3

Lymph node metastasis of thymic epithelial tumor nomogram

Fig. 4
figure 4

Receiver operating characteristic (ROC) curves for lymph node metastasis prediction for patients using a the nomogram and b the Lymphatic Node Metastasis Risk Scoring System (LNMRS)

Table 2 Detailed scores of each predictor in the nomogram and the LNMRS

Based on the score of each variable in the nomogram, a simpler and more generalizable model, called Lymphatic Node Metastasis Risk Scoring System (LNMRS), was constructed, as shown in Table 2. The predicted AUC of the LNMRS was 0.80 (95 % CI 0.75–0.85) for the training cohort, and 0.82 (95 % CI 0.70–0.93) for the validation cohort, as shown in Table 3. The receiver operating characteristic (ROC) curve is shown in Fig. 4B. Meanwhile, detailed scores were calculated, as shown in Table 2. The calibration curves are presented as prediction curves closed to the standard curve, as shown in Fig. 5.

Table 3 Model performance in the nomogram and the LNMRS
Fig. 5
figure 5

Calibration curves for lymph node metastasis prediction for patients in the training cohort using a the nomogram and b the Lymphatic Node Metastasis Risk Scoring System (LNMRS)

We scored the entire cohort population using the LNMRS model and plotted the scores of both cohorts on a kernel-density map based on the incidence of lymph node metastasis. We determined a score of 13 to be the optimal threshold, whereby patients with a score lower than 13 have a low risk of metastasis and those with a score higher than 13 have a high risk of metastasis, as detailed in Fig. 6. For example, if a 40-year-old patient has pathologic thymic carcinoma and an extension of adjacent organs/structures, then this person has an LNMRS score of 17, which indicates a high risk of lymph node metastases based on the kernel-density map.

Fig. 6
figure 6

Kernel-density plot of the whole cohort score using the Lymphatic Node Metastasis Risk Scoring System (LNMRS)

Discussion

Currently, no predictive model exists for lymph node metastasis in thymic epithelial tumors. In this study, we developed a simple nomogram-based model called the Lymph Node Metastasis Risk Scoring System (LNMRS), which includes age, tumor extension, and histologic type. This prediction model had an AUC of 0.80 (range, 0.75–0.85) for the training set and an AUC of 0.82 (range, 0.70–0.93) for the validation set, with good discriminative effect and calibration ability. With only three variables, our model was not only objective and accurate, but also easier to generalize to clinical studies.

Some research showed that lymph node status was a significant prognostic factor for patients with thymic epithelial tumors.2,3,20,21 Findings suggested that nodal sampling or lymph node dissection can be performed to acquire accurate staging and prediction of prognosis.2,22 Our analysis of 1018 patients found that lymphatic metastasis is lymph node metastasis related to age, pathologic type, and tumor extent. This conclusion also was reached in another study.23 In addition, we noted patients with negative lymph node findings who had higher postoperative scores and whether some preventive treatment measures, such as adjuvant radiotherapy and individualized postoperative follow-up assessment, could be used for this group of patients.

The National Comprehensive cancer Network (NCCN) suggests that patients with R0 resection need not be treated with chemotherapy or radiotherapy, but should be surveilled for recurrence with an annual chest computed tomography (CT) scan. However, lymph status could not be shown clearly for patients with R0 resection.24

In our study, the probability of lymph node metastasis was calculated based on a nomogram with personal clinical information. Patients with R0 resection had high probability of lymphatic metastasis and were more likely to experience lymph node metastasis. Nevertheless, no study exists to support postoperative adjuvant therapy for such patients. Therefore, further research is needed.

The SEER database has a massive amount of clinical information for researchers to perform a large range of clinical studies. Based on the role of lymph node metastasis in disease progression, lymph node prediction using relevant variables from the SEER database has been performed for different malignancies with proven results.25,26,27

However, some limitations of the SEER database are inevitable. First, the SEER database gathers only patients’ clinical information, and neither the consistency nor the standardization of patient treatment could be normalized. Second, the validation set was not available for a sufficiently prolonged follow-up period, resulting in a survival rate that was not applicable. Finally, in the classification of variables, those not described according to the specificity of a different neoplasm (e.g., invasive carcinoma confined to gland of origin in CS Extension) do not distinguish Masaoka stage 1 from stage 2 neoplasms.

Conclusion

In our research, we developed a new model (LNMRS) for patients with thymic epithelial tumors based on the SEER database. This new model demonstrated perfect performance in predictive accuracy capability. The model could be a useful tool for predicting lymph status in thymic epithelial tumors.