Background

Lung cancer is the second most commonly diagnosed cancer, with 2,206,771 new cases and 1,796,144 new deaths worldwide, accounting for 11.4% of all new cancer cases and 18. 0% of all new cancer deaths [1]. More than 40% of lung cancer cases are adenocarcinoma and remain the predominant histological subtype of NSCLC [2]. The overall 5-year survival rate of patients with lung adenocarcinoma is as low as 19.4% [3].

The tumor-node-metastasis (TNM) stage is the chief aspect in the development of treatment strategies for patients with lung adenocarcinoma. Adjuvant therapy is not recommended for stage I patients after radical lung resection, while it is advisable for patients with stage II or higher disease [4,5,6]. Different pN statuses indicate various TNM stages for the same T1 or T2 stage, which significantly affects the postoperative treatment plan [7]. On some occasions, for example, inadequate operative skill, concern for severe complications and life-threatening risk for further lymph node dissection, and the surgeon’s judgment for low risk of lymph node metastasis, SLND is not completed, so the patient lacks true pN staging.

Therefore, there is an urgent need to develop a practical and highly accurate method to assess pN staging in patients without SLND. Of note, several prediction models related to pN staging or lymph node metastasis have been set up [8,9,10,11,12,13]. Among the previous related studies, the established prediction models either have unsatisfactory performance with C-statistics < 0.8 in the validation cohort [8, 9, 11, 12] or require predictors from complex or costly tests [10, 13]. In contrast, few studies have focused on high-accuracy prediction of pN based on features of cancer specimens for patients with LUAD after lung cancer resection without complete lymph node dissection. Hence, we analyzed the correlation between lymph node metastasis and clinicopathological characteristics of LUAD and developed a pN prediction model to guide management after LUAD cancer resection alone.

Methods

Source of data

This was a retrospective, single-center study where clinicopathological information was extracted from the Sun Yat-Sen University Cancer Center (SYSUCC) database by a well-trained clinician and verified and confirmed for authenticity by another clinician. The Ethics Committee of the center approved the study with the approval number B2020-255-01. The requirement for informed consent was waived due to the study’s retrospective nature and the lack of requirement for identification material. The inclusion criteria were as follows: (1) lung cancer resection, including wedge resection, segmentation resection, lobectomy, bilobectomy, or pneumonectomy, combined with SLND at the Department of Thoracic Surgery of SYSUCC between January 1, 2011, and April 27, 2021. (2) detailed pathology report, including components of the cancer specimen, infiltration condition of the nerve tract, visceral pleura, and lymphovascular infiltration. (3) the pathological subtype of the tumor was lung adenocarcinoma. The exclusion criteria were as follows: (1) multiple primary lung cancers, (2) receiving any types of neoadjuvant therapy, and (3) incomplete clinical or pathological data.

Finally, 2,696 eligible patients were enrolled in the study. Using a random assignment approach, a subgroup of 1918 patients was used as the development cohort. The remaining 718 patients constituted the validation cohort and were outside the model-building process.

Clinicopathological factors

The following items were retrieved from the electronic medical records: age, sex, smoking history, alcohol drinking history, family cancer history, individual cancer history, infiltration condition of the nerve tract, visceral pleural and lymphovascular, tumor location, differentiation, components of cancer specimen, tumor size, and lymph node station status. The pathological evaluation process complied with the International Association for the Study of Lung Cancer(IASLC)/ American Thoracic Society(ATS)/ European Respiratory Society (ERS) lung adenocarcinoma classification system published in 2011 [14]. For simplicity, differentiation was classified into three categories: (1) only highly differentiated components(HD), (2) medium-differentiated components but without low-differentiated components (MD), and (3) with low-differentiated components (LD). Tumor size was defined as size of the tumor in the formalin-fixedspecimen. The SLND of all cases was performed based on the definition suggested by the European Society of Thoracic Surgeons guidelines, where all mediastinal tissue containing the lymph nodes is removed systematically within anatomical landmarks, while at least three mediastinal nodal stations, one of which must be the subcarinal station, should be eliminated as a minimum requirement(15). Thirty-six cases (3 with right lung cancer and 33 with left lung cancer) underwent bilateral mediastinal lymph node dissection by mediastinoscopy. LVI is defined as the presence of malignant cells within vascular or lymphatic spaces after hematoxylin and eosin staining and is similar to PI and NTI.

Statistical analysis

Univariate analysis was conducted to test the correlation between clinicopathological factors and pN status using a strictly selected statistical method in the development cohort. For categorical variables, we chose Pearson’s χ2 to test whether all expected counts under the null hypothesis were greater than 5, continuity correction χ2 if one of the expected counts was no less than 1 but less than 5, and Fisher’s exact test if one of the expected counts was less than 1. As for continuous variables, we chose the t-test if two groups of data had normal distribution and variance homogeneity but Wilcoxon rank-sum and signed-rank tests if they were not. To better interpret the categorical variables in the model, they were transformed into dummy variables. Multivariate analysis using logistic regression was performed with statistically significant factors from univariate analysis to adjust for confounding factors and screen for important factors. Two-tailed values of P < 0.05 were considered to be statistically significant. All statistical analyses were conducted using R software (R version 3. 6. 3 [2020-02-29]).

Model development and validation

Logistic regression was used to build a prediction model for predicting pN status in the development cohort because it is widely applied in two-category targeted variable prediction. There are two main steps in the model development process. Firstly, the backward stepwise algorithm was applied to factors resulting from the univariate analysis to select the factors incorporated into the final model and eliminate redundant factors. During the selection procedure, the Akaike information criterion, an estimator of prediction error, was used to determine whether the selected factors had a minimum prediction error. The selected factors were then analyzed again by multivariable analysis to ensure that all final factors in the model were significant (p < 0.05). Coefficients, odds ratios (OR), and 95% confidence intervals (CI) were calculated for each variable in the final logistic model. Secondly, a risk score (RS) was generated by the summation of covariates multiplied by their coefficients in the model. The Youden index, defined as “sensitivity + specificity − 1”, was measured to determine the optimal RS cut-off value. A validation cohort was used to test the stability of the established model.

Model assessment

We assessed the model performance in both cohorts by calculating C-statistics, which are equivalent to the area under the receiver operating characteristic curve (ROC) and indicated comprehensive discrimination of the model, sensitivity as the proportion of participants with positive prediction among those with the true positive condition, specificity as the proportion of participants with negative prediction among those with the true negative condition, accuracy as the proportion of participants with true positive prediction and true negative prediction among all participants, positive predictive value(PPV) as the proportion of participants with true positive prediction among those with positive prediction, and negative predictive value(NPV) as the proportion of participants with true negative prediction among those with negative prediction in both cohorts. The sensitivities and specificities at different cut-off values were used to draw the ROC curve to show the prediction performance of the two cohorts. Decision curve analysis (DCA) was used to evaluate the clinical value of the predictor, which can determine whether utilizing the model to make clinical decisions yields benefits over alternative decision criteria at a certain threshold probability. The clinical impact curve was also employed to evaluate the model, and a nomogram was plotted to visualize the relative weights of predictor variables, facilitating and simplifying the usage of the model.

Results

Among participants, 509 patients (18. 88%) accompanied pN1/2 disease(Table 1). In terms of the number of resected lymph nodes, for right lung adenocarcinoma, the highest stations were identified as #2-4 and #7, while for left lung adenocarcinoma, stations #5-6 and #7 had the highest identified numbers (Fig. 1). Table 2 shows that the development and validation cohorts had good homogeneity (p > 0.05 for all clinicopathological factors). The results of the univariate and multivariate analyses of clinicopathological features in the development cohort are listed in Table 3. Eighteen factors, including tumor size, NTI, PI, LVI, RUL, differentiation (MD, LD, HD), 7 kinds of cancer specimen components (acinus, micropapillary, papillary component, solid, lepidic, mucous), four predominant patterns (solid, lepidic, micropapillary, acinus) (all p values < 0.01) correlated with lymph node metastasis in univariate analysis. In the multivariable analysis, eight factors remained significantly correlated: tumor size (p < 0.001), NTI (p = 0.011), PI (p = 0.019), LVI (p < 0.001), RUL (p = 0.001) and micropapillary components (p = 0.024), lepidic (p < 0.001), predominantly micropapillary (P < 0.001). Finally, we obtained nine factors from the backward stepwise algorithm to be incorporated into the ultimate model, as follows: NTI (p = 0.013), PI (p = 0.017), LVI (p < 0.001), RUL (p = 0.001), LD (p < 0.001) and micropapillary components (p = 0.019), and lepidic (p < 0.001), predominantly micropapillary (P < 0.002), and tumor size (p < 0.001) (Table 4).

Fig. 1
figure 1

Heatmap for resected nodal numbers in each nodal station for right lung cancer and left lung cancer. One row represents one patient and the column represents the corresponding nodal station

Table 1 Resected nodal numbers and positive patient numbers in each nodal station for right lung cancer and left lung cancer
Table 2 Clinicopathological features of development and validation cohorts
Table 3 Univariate and multivariate analysis of clinicopathological features in the development cohort
Table 4 Logistic regression coefficients and odds ratio of variables resulted from the stepwise backward algorithm

A novel risk-scoring method for lymph node metastasis prediction was established using the following equation:

$$\begin{array}{c}RS = - 3.265 + 0.861 \times NTI + 0.354\\\times PI + 1.906 \times LVI - 0.527\\\times RUL + 1.040 \times LD + 0.405\\\times micropapillary\,component - 1.590\\\times lepidic\,component + 1.354\\\times micropapillary\,predo\min ant + 0.442\\\times tumor\,size\end{array}$$

where NTI, PI, LVI, RUL, LD, micropapillary component, lepidic component, and micropapillary predominant are two-level categorical variables whose values were 1 for present or 0 for absent, and tumor size is the mean of 3-dimensional lengths of the tumor in centimeters. The likelihood of lymph node metastasis increased with the RS value. By assessing the OR of each factor in the ultimate model, NTI, PI, LVI, LD, micropapillary component, micropapillary predominance, and larger tumor size were favorable for lymph node metastasis, among which LVI was the largest contributor (OR: 6.728, 95%CI: 4.837–9.400), followed by micropapillary predominance, and LD. Location of the right upper lobe (OR: 0.590, 95%CI: 0.431–0.800) and lepidic components (OR: 0.204, 95%CI: 0.102–0.371) decreased the probability of lymph node metastasis in the development cohort analysis. We determined the optimal cut–off RS for the highest Youden index (0.568) as − 1. 430; the C–statistics by the bootstrap resampling method were 0.861 (95% CI: 0.842–0.883), and the sensitivity and specificity were 0.754 (95%CI: 0.706–0.798) and 0.814 (95%CI: 0.794–0.833). In the validation cohort, the C–statistics by the bootstrap resampling method were 0.861 (95%CI: 0.8406–0.8740), and the sensitivity and specificity were 0.686 (95%CI: 0.607–0.757) and 0.811 (95%CI: 0.778–0.841) (Table 5). A nomogram based on the logistic regression model shown in Fig. 2 is convenient for model interpretation and usage. In the nomogram, the value of a certain indicator produces a corresponding point by drawing a vertical line to the point axis from the location of the indicator value. The probability of lymph node metastasis was calculated according to the total number of points. The calibration curve (Fig. 3) of the development cohort showed good agreement between the observed and predicted outcomes. The area under the receiver operating characteristic curve (ROC; Fig. 4) is 0.861, which is significantly better than the non-informative ROC curve (the diagonal solid line). A decision curve is depicted in Fig. 5, considering both the different clinical decision strategies and the prediction model and calculating the net benefit at all possible thresholds. We drew a clinical impact curve to compare the difference between the number of people classified as positive (high risk) and the number of true positives for each threshold probability and measured the cost-benefit ratio at different probability thresholds. At a cut-off value of 0.5 for the probability, the cost-benefit ratio approximately equals 1:1, which can be a reference to other probability thresholds (Fig. 6).

Table 5 Model performance in development and validation cohorts
Fig. 2
figure 2

A nomogram to predict the probability of lymph node metastasis of LUAD patients with a single tumor size of < = 0.5 cm. The mark “1” of NTI, PI, LVI,RUL, LD, MP component, lepidic component, MP predominant stands for the present status of the corresponding situation and the “0” for the absent status

Fig. 3
figure 3

The calibration curve of the prediction model. The X axis is the predicted probabilities measured by the final logistic regression model and the Y axis is the actual probabilities. The calibration curve cohort shows good agreement between the observed outcomes and those predicted

Fig. 4
figure 4

Receiver Operating Characteristics Curve of the prediction model. The area under the receiver operating characteristic curve is greatly significantly better than the noninformative ROC(the diagonal solid line)

Fig. 5
figure 5

Decision curve analysis of the model. The black horizontal black line is the net benefit of referring none of the patients for lymph node metastasis. The green curve is the net benefit of referring all patients for lymph node metastasis. The red curve represents the logistic regression model

Fig. 6
figure 6

Clinical impact analysis of the model by clinical impact curve. The red solid line represents patients classified as positive (at high risk) by the model while the blue dash line represents patients with true positives at different threshold probability

Discussion

An accurate assessment of the lymph node status involves SLND. Lymph node status in NSCLC, especially pathological status, is important for prognosis and for guiding postoperative therapeutic strategies [15].

Generally, complete lymph node dissection is often indicated and essential for lung cancer resection in most hospitals, especially during surgery for NSCLC with cT1a-2bN0-1M0 [16]. However, this standard of care is not always carried out by thoracic surgeons. Stewart AK et al. [17]found in a survey on lung cancer patient care in 729 hospitals that, among patients receiving surgery, only 57 (8%) of the patients had lymph nodes either sampled or removed from the mediastinum, and as many as 42 (2%) had no mediastinal lymph nodes evaluated. We inferred that the poor performance of patients (advanced age, severe comorbidities, etc.), unskilled operation by the surgeon, and complex anatomy contributed to incomplete resection. Moreover, considering the potential grave complications (such as recurrent nerve palsy, bleeding, chylothorax, and phrenic nerve palsy) and longer operative time, many researchers have focused on whether some patients with lung cancer can forego SLND. Joe B. Putnam, Jr, and his colleagues [18] published results of the ACOSOG Z0030 trial, the largest randomized controlled trial comparing outcome difference between complete lymphadenectomy and further mediastinal lymph node sampling in patients whose systematic and thorough presentation sampling of the mediastinal and hilar nodes is negative. This result showed that mediastinal lymph node dissection (MLND) did not benefit patients with early-stage NSCLC, which indicates that patients with N0 or N1 (less than hilar) confirmed by mediastinoscopy or systematic lymph node sampling during pulmonary resection can help avoid systematic lymph dissection. Conversely, patients without systematic lymph node dissection are at risk of undetected N2 disease. Our study provides a tool to assess N status and predict the probability of lymph node status (N1 or N2 disease).

In this study, we found that NTI, PI, LVI, LD, micropapillary component, micropapillary predominance, and larger tumor size were positively correlated with lymph node metastasis, while the location of the right upper lobe and lepidic component negatively impacted the rate of lymph node metastasis.

It is no doubt that the likelihood of cancer invasiveness and metastasis, including lymph node metastasis, increases as tumor size grows [19,20,21]. Few studies have focused on the correlation between lymph node metastasis in patients with lung cancer and NTI. However, our study found a positive correlation and first introduced NTI into a prediction model for N status. There are three main methods for carcinoma cell migration: lymphatic vessels, blood vessels, and serosal surfaces, whereas NTI is another ignored approach for carcinoma cell dissemination [22]. The nerve tract possesses a low-resistance plane in the neural sheaths, which serves as a conduit for their migration [23], which may explain the positive correlation between NTI and lymph node metastasis.

PI was considered another factor correlated with lymph node metastasis. This is consistent with studies by Yu et al. [24] and Kudos et al. [25]. Moreover, LVI, LD, micropapillary component and micropapillary predominance, and large tumor size have been well investigated to positively impact lymph node metastasis by many academics [26,27,28]. RUL is a protective factor against lymph node metastasis. Yi Tan’s study [29] showed similar results that compared to RUL, tumors located in the LLL, LUL, RML, and RLL (ordered by decreasing OR) displayed a higher risk of LNM. Ting [30] reported that in patients with lung adenocarcinoma ≤ 3 cm, the lepidic component was significantly associated with histologic subtype, TNM stage, and lymph node metastasis (P < 0.05), whereas in those with a tumor of greater than 3 cm, this association did not exist.

Several models for predicting lymph node metastasis of lung adenocarcinoma have been previously reported. Zheng [31] developed a radiomics model (RM) using a support vector machine and extremely randomized trees based on 18 F-FDG PET/CT features to predict mediastinal lymph node metastasis. The AUC of RM was 0.81 (95%CI: 0.771–0.848), sensitivity of 0.794, and specificity of 0.704. Keiju Aokage et al. [32] proposed a multivariable logistic regression model based on clinical and radiological factors, leading to C-statistics of 0.8041 and 0.7972, sensitivity of 95.7% and 95.4%, a specificity of 46.0%, and 40.5% for the development and external validation sets, respectively. Zang [33] reported a four-predictor(larger consolidation size, central tumor location, abnormal status of tumor marker, and clinical N1/N2 stage) model for the preoperative prediction of lymph node involvement in patients with clinical stage T1aN0-2M0 non-small cell lung cancer, achieving an AUC of 0.842 and 0.810 in the training and test groups, respectively. The aforementioned models focused on preoperative clinical information and showed moderate discrimination (ACU: 0.7972–0.842), ignoring the critical pathological information from resected specimens considered to be more relevant to lymph node metastasis which was confirmed in our study. Our model merged these key factors and achieved an AUC value of 0.861 (95%CI: 0.842–0.883) and 0.861 (95%CI: 0.8406–0.8740) for the development and validation cohorts, respectively.

These variables can be easily and conveniently obtained from medical records and pathology reports, which involve no additional burden for extra examination or financial cost from patients after surgery. The nomogram is user-friendly because most variables are binomial, except for tumor size, meaning that the user can quickly locate the point of each variable and the possibility of lymph node metastasis according to the sum point. This model is appropriate for patients with resected lung adenocarcinoma specimens of one tumor < 5 cm, including wedge resection. We recommend further adjuvant therapy and more intensive follow-up for the predicted high-risk patients from our model.

Our study had some limitations that should be mentioned. First, it was a retrospective study inherently prone to selection bias, as in all other retrospective studies. Furthermore, its single-center nature lacks generalization, which requires further validation at the multicenter level in the future, although it possesses a large sample and relatively high AUC value. The third limitation was that our model largely depended on the pathological result that could be variable owing to an inter-observer difference. Thus, a pathomics method that automatically extracts and calculates features from a pathological section image may be a potential tool for decreasing inter-observer instability. Most importantly, the true effects of this prediction tool on the treatment strategy and outcome remain opaque, demanding a multicenter, prospective, randomized, controlled study.

Conclusion

We discovered that NTI, PI, LVI, LD, micropapillary component, micropapillary predominance, and larger tumor size were independent positive predictors of lymph node metastasis in patients with LUAD with a single tumor size ≤ 5 cm after lung resection, whereas RUL and lepidic component were independent negative predictors. The established model could serve as a prediction tool to distinguish patients with pN1/2 from those with pN0 and support treatment strategies for patients with LUAD without SLND.