In the coming years the rate of surgically resectable early stage lung cancers will be on the rise, as lung cancer screening routines are established [1, 2]. As a result interdisciplinary teams will face the challenge to choose the right patient pathways and treatment modalities to achieve the best outcome for each patient. Although the gold standard for the treatment of UICC stage I cancer remains primary surgical resection, comorbidities might propose non-surgical treatment options [3].

Treatment planning in resectable lung cancer usually relies on an algorithm evaluating the fitness for surgery and is mainly focused on cardiac and pulmonary function testing. However, there are other medical conditions increasing the risk of postoperative complications and surgical mortality that are currently not routinely considered. Thoracic surgeons are trying to establish risk scores which may provide some decision guidance when interdisciplinary tumor boards have to weigh a primary surgical therapy approach against other treatment modalities, such as stereotactic body radiotherapy or radiofrequency ablation. The ESTS EuroLung scores, based on the ESTS database, represent such a tool for the risk calculation of 30-day postoperative morbidity and 30-day postoperative mortality based on clinical values [4,5,6]. The ESTS EuroLung scores consist of two groups of scores: EuroLung1 scores for morbidity and EuroLung2 scores for mortality, with parsimonious variants for both to simplify calculation. Aggregate EuroLung scores have been established for grouping patients in different classes of risk for morbidity and mortality. Moreover, the two scores have been updated in 2019 from its original version published in 2016. EuroLung scores have also been published in form of a freely available app for calculating EuroLung scores on a smartphone [4, 5]. The EuroLung scores have been applied to European, Canadian, and Japanese cohorts with inconclusive results [7, 8]. This study compares the EuroLung scores from 2016, 2019 and the app and is the first to test them in a pure video-assisted thoracoscopic surgery (VATS) cohort, while the scores originally were calculated using data from both, open and minimally invasive approaches.

Materials and methods

Patient selection

All patients scheduled for an anatomic VATS resection for primary lung cancer at our department from 02/2009 to 07/2019 were retrospectively analyzed. Due to our ethical directive patients under the age of 18 (n = 4) were excluded. To preemptively avoid possible confounders patients over the age of 80 (n = 20) were excluded, because this patient cohort represents a very preselected group with above-average performance status at our department (no mortality and 4.2% morbidity). No other exclusion criteria were used. 718 patients remained for further analysis. Permission for retrospective data analysis of our VATS cohort was granted by the local ethics committee (Registration Number: AN5163,327/4.17,382/5.2).

Data administration

Data was collected in a prospectively maintained database. Collected data included sex, age at operation, coronary artery disease (CAD), chronic kidney disease (CKD), cerebrovascular disease (CVD), body mass index (BMI), type of resection (e.g., extended resection), predicted postoperative forced expiratory volume in one second (ppoFEV1), postoperative morbidity, death.

Postoperative morbidity was defined according to the EuroLung score and included, but was not limited to: respiratory failure, prolonged mechanical ventilation (> 24 h), acute heart failure, reintubation, pneumonia, atelectasis requiring a bronchoscopy, pulmonary edema, prolonged air leak, embolism, acute respiratory distress syndrome (ARDS), stroke, transient ischemic attack (TIA), acute kidney failure, arrhythmia requiring treatment, myocardial infarction. In accordance with the definition of the EuroLung scores postoperative morbidity and mortality was only included if it occurred during the first 30 days after surgery.

Pneumonia definition in our database matches the definition of Fernandez et al. [9].

EuroLung scores

EuroLung1 and EuroLung2 score from 2016 and 2019, seen in Table 1, have been calculated using data from 47,960, respectively, 82,383, anatomic lung resections documented in the ESTS database (07/2007–08/2015 and 01/2007–12/2018) [4, 5].

Table 1 Evolution of EuroLung scores over time (EuroLung App: formula based on Brunelli et al. [4] and Pompili et al. [7])

Statistical analysis

Statistical analysis was performed using IBM SPSS Statistics 26 (IBM Corporation, Armonk, NY, USA) and included the methods recommended by Altman et al. for external validation of prognostic tests [10].

Pearson’s chi-squared test or Fisher’s exact test were used for identifying relationships between categorical variables. One-way analysis of variance was used for comparing means between various numerical variables. In all EuroLung scores binary logistic regression was used to test their computational variables for significance between our patient groups with and without EuroLung morbidity/mortality. This is to examine whether the variables of our cohort differ from the study by Brunelli et al. [4]. Hosmer–Lemeshow-Test was used for testing for goodness of fit for the logistic regression scores. Area under the receiver operating characteristic curve (AUROC) was calculated to compare predictivity of the scores. For comparison of AUROC between the available scores DeLong test was used. For analysis of the relationship between nominal and metric variables eta correlation was used to calculate the correlation between a score and the observed morbidity/mortality via-cross tabulation. Calibration was assessed by using calibration-in-the-large and calibration slope. The study was performed in accordance to the TRIPOD statement for Prediction Model Validation [11].

Results were expressed as means. Statistical significance was assumed for a p-value < 0.05.

Results

A total of 718 patients were analyzed. Overall patient characteristics and respective morbidity and mortality characteristics are shown in Supplementary Table 1, 2, 3, and 4.

Every patient in the cohort was scheduled for a primary anatomic VATS resection for primary lung cancer (100%). Our observed 30-day morbidity was 10.45% and observed 30-day mortality was 0.70%.

Morbidity

In our cohort 75 out of 718 patients (10.45%) suffered from postoperative morbidity, as defined by Brunelli et al., in the first 30 days after surgery and this rate was lower than the calculated EuroLung scores (Table 2) [4, 5]. The relationship between 30-day morbidity and demographic data, risk scores, and perioperative morbidities are shown in Table 2.

Table 2 Relationship between 30-day morbidity and demographic data, risk scores, and perioperative morbidities

Using the various EuroLung scores, the calculated morbidity ranged from 11.11 to 20.85%. The parsimonious EuroLung1 (2019) showed the most accurate prediction with 11.11% (95%CI, 10.76–11.56%) in comparison to the cohorts observed morbidity rate of 10.45%. Patients with a postoperative morbidity showed significantly higher EuroLung scores in all available morbidity scores than patients without (EuroLung1 (2016): p =  < 0.001; EuroLung1 (2019): p =  < 0.001; parsimonious EuroLung1 (2016): p =  < 0.001; parsimonious EuroLung1 (2019): p =  < 0.001; EuroLung1 App: p =  < 0.001). Patients with morbidity also showed a significantly higher EuroLung1 aggregate score (6.84 (95%CI,6.03–7.65), vs 5.54 (95%CI,5.30–5.79); p = 0.001).

All EuroLung scores only showed a weak individual correlation (EuroLung1 (2016): η = 0.155; EuroLung1 (2019): η = 0.168; parsimonious EuroLung1 (2016): η = 0.156; parsimonious EuroLung1 (2019): η = 0.174; EuroLung1 App: η = 0.173; EuroLung1 aggregate score: η = 0.122). In accordance with these results the AUROC was 0.660 for the EuroLung1 App, 0.646 for the EuroLung1 (2019), 0.645 for the EuroLung1 (2016), 0.642 for both parsimonious EuroLung1 scores (2016 & 2019), 0.599 for the EuroLung1 aggregate score and did not proof high discrimination. The parsimonious Eurolung1 (2019), which showed the most accurate prediction and the highest η-value, had a statistically different AUROC than the EuroLung1 aggregate score (p = 0.010) and the EuroLung1 App (p = 0.032). The rest of the EuroLung scores showed no statistically different AUROC. The EuroLung1 App showed a significantly better discrimination than the EuroLung1 aggregate score (p = < 0.001) and both parsimonious EuroLung1 scores (p = 0.032/0.032), but not for the EuroLung1 (2016) (p = 0.220) and EuroLung1 (2019) (p = 0.217). Respective ROC curves are shown in Fig. 1.

Fig. 1
figure 1

ROC-Curves of EuroLung1 scores

The Hosmer–Lemeshow test for goodness of fit was not significant for all morbidity scores and therefore valid [EuroLung1 (2016 & 2019): p = 0.958; parsimonious EuroLung1 (2016 & 2019): p = 0.996; EuroLung1 aggregate score: p = 0.919].

Calibration-in-the-large showed a graphical trend toward systematically too high predictions, while at the same time showing too extreme risk estimations in the calibration slope, as visualized in Fig. 2. EuroLung1 (2019) and parsimonious EuroLung1 (2019) showed the best calibration-in-the-large with an intercept close to 0 (a = −0.007/−0.007). Moreover, they also showed the tightest estimation spread with their respective calibration slopes being the closest to 1 (b = 0.935/0.911).

Fig. 2
figure 2

Calibration plots of A EuroLung1 (2016), B EuroLung1 (2019), C EuroLung1 App, D parsimonious EuroLung1 (2016) and E parsimonious EuroLung1 (2019). (a) Calibration-in-the-large calculated as the logistic regression model intercept given that the calibration slope equals 1; (b) calibration slope in a logistic regression model with the linear predictor as the sole predictor; (c) c-statistic indicating discriminative ability. Triangles represent deciles of subjects grouped by similar predicted risk

For further investigation of the impact of risk factors for morbidity a binary logistic regression analysis was performed for each risk score. For the EuroLung1 and parsimonious EuroLung1 (2016 & 2019) lower ppoFEV1% was associated with a higher risk for postoperative complications (EuroLung1 (2016 & 2019): p = 0.041, parsimonious EuroLung1 (2016 & 2019): p = 0.042). Male gender showed to be a significant risk factor for the aggregate EuroLung1 score (p = 0.025).

The relationship between the EuroLung1 aggregate score and our observed morbidity rate is shown in Fig. 3.

Fig. 3
figure 3

Relationship between EuroLung1 aggregate score and our morbidity rates. (AEL1 Score Aggregate EuroLung1 Score)

A subgroup analysis did not show a difference in observed morbidity for patients with neoadjuvant therapy (12.3 vs. 10.2% in patients without neoadjuvant therapy, p = 0.547).

Mortality

Postoperative 30-day mortality in our cohort was observed in five patients (0.7%) and was lower than predicted in any EuroLung score. The closest result was estimated by the parsimonious EuroLung 2 with 1.10% (95%CI, 1.01–1.19%), followed by EuroLung2 (2019) with 1.11% (95%CI, 1.03–1.21%), EuroLung2 App with 1.29% (95%CI, 1.07–1.51%), and the EuroLung2 (2016) with 1.40% (95%CI, 1.29–1.51%) in comparison to the cohorts observed mortality rate of 0.7%. The relationship between 30-day mortality and demographic data, risk scores, and perioperative morbidities are shown in Table 3.

Table 3 Relationship between 30-day mortality and demographic data, risk scores, and perioperative morbidities

Patients with observed mortality did not show significantly higher EuroLung scores (EuroLung2 (2016): p = 0.695; EuroLung2 (2019): p = 0.769; parsimonious EuroLung2: p = 0.811; EuroLung2 App: p = 0.983). Also, EuroLung2 aggregate scores (2016 & 2019) did not differ between groups (p = 0.505, p = 0.510).

All EuroLung scores showed only a very weak individual correlation (both EuroLung2 aggregate score (2016 & 2019): η = 0.025; EuroLung2 (2016): η = 0.011; EuroLung2 (2019): η = 0.015; parsimonious EuroLung2: η = 0.009, EuroLung2 App: η = 0.000). In accordance with these results the AUROC was 0.673 for the EuroLung2 (2016), 0.656 for the EuroLung2 (2019), 0.645 for the parsimonious EuroLung2, 0.641 for the EuroLung2 App, 0.610 for the aggregate EuroLung2 (2016) and 0.596 for the aggregate EuroLung2 (2019) and did not proof high discrimination. The AUROC of all available EuroLung scores showed no statistically significant difference, when compared between themselves. Respective ROC curves are shown in Fig. 4.

Fig. 4
figure 4

ROC-Curves of EuroLung2 scores

The Hosmer–Lemeshow test for goodness of fit was not significant for all mortality scores and therefore valid (EuroLung2 (2016 & 2019): p = 0.937; parsimonious EuroLung2: p = 0.961; EuroLung2 aggregate score (2016): p = 0.926; EuroLung2 aggregate score (2019): p = 0.313).

For the computational variables, binary logistic regression for the EuroLung2 (2016 & 2019) scores showed lower ppoFEV1% and CAD being significant risk factors for mortality (p = 0.040, p = 0.033). For the parsimonious EuroLung2 only ppoFEV1% showed significance (p = 0.030) and for the EuroLung2 aggregate score (2016) CAD showed significant impact (p = 0.025). For the EuroLung2 aggregate score (2019) no significant variable was found.

The relationship of the EuroLung2 aggregate score with our observed mortality rate is shown in Fig. 5.

Fig. 5
figure 5

Relationship between EuroLung2 aggregate score and our mortality rates. AEL2 Score Aggregate EuroLung2 Score

Two patients died of ARDS, two patients suffered a lethal sepsis and one patient suffered from both complications and died subsequently. Noteworthy, all patients had low EuroLung2 aggregate scores (see Fig. 5). Interestingly, three out of five patients had a history of solid organ transplantation (kidney: n = 2, liver: n = 1). We found a statistical significant difference in postoperative mortality in the group of patients after solid organ transplantation, compared to the group of non-transplant patients (p < 0.001).

A subgroup analysis did not show a difference in observed mortality for patients with neoadjuvant therapy (0 vs. 0.8% in patients without neoadjuvant therapy, p = 1.000).

Discussion

Despite efforts to reduce smoking, lung cancer remains the leading cause of cancer death. To reduce lung cancer associated mortality successful efforts are taken to implement screening routines. As a result more early stage lung cancers are being diagnosed, increasing the number of potentially resectable lung cancers and the demand for individual risk stratification.

The ESTS Eurolung scores were established to calculate individual risk for postoperative morbidity and mortality and to help guiding treatment decisions. So far, the scores have not been definitely validated in other cohorts. The scores can be used in two ways: first, the overall observed morbidity and mortality can be compared to the predicted outcome as a marker for quality of care, comparing a center to the average of the ESTS database; second, the individual predicted risk can be used to guide decision making, but only once the scores have been validated externally.

Aim of this study was to validate the EuroLung scores in our patient cohort, consisting only of primary anatomic VATS resections. As data from our patients are not included in the ESTS database, this could also serve as an external validation.

Our results show that the parsimonious EuroLung1 (2019; 11.11%; 95%CI, 10.74–11.49%) displays the best correlation with our cohort´s observed morbidity rate of 10.45%. Despite this, the correlation with individual patient morbidity was only weak (η = 0.155), showing insufficient precision. Although the EuroLung1 (2019) showed a rather good calibration with an intercept of −0.007 and a calibration slope of 0.935 the discrimination was weak with a c-statistics of 0.646.

After performing a binary logistic regression analysis only ppoFEV1% showed to be associated with increased morbidity in our cohort. This emphasizes the importance of preoperative lung function tests in the treatment algorithm of lung cancer. It is even more relevant, as pulmonary prehabilitation programs do show a reduction of postoperative morbidity [12].

Comparing the EuroLung2 scores with our cohort we did show that observed mortality (0.7%) was lower than the one predicted with ESTS EuroLung2 scores. Further analysis showed that lower ppoFEV1% correlated with higher 30-day mortality. Also, we found a high rate of mortality in patients with a history of solid organ transplantation (23.1%). A higher 90-day mortality after surgical treatment of lung cancer in patients after solid organ transplantation was also described recently by Drevet et al. [13]. Solid organ transplantation has so far not been evaluated in the EuroLung Scores, as it is not recorded in the ESTS database, but due to increasing evidence should be considered in future updates.

To investigate possible confounders for this discrepancy between expected and observed morbidity and mortality we compared the patient characteristics of the ESTS database with our own VATS database. Our patients showed a lower ppoFEV1% (72.7 vs. 62.9) and a higher amount of diabetes (2.7% vs. 12.5%). In contrast to the EuroLung database our cohort consists of only VATS patients (vs. 13.1% and 26% in the ESTS database at the time of publication of the EuroLung scores 2016 and 2019), which might decrease postoperative complication rate, as a VATS approach has shown to reduce postoperative morbidity such as pneumonia, intensive care admission, bleeding or the need of reoperation. Even in the case of conversion to open surgery primary VATS cases do not show higher complication rates [4, 14,15,16]. Analyses of various institutional VATS programs have shown that the surgeon’s experience does not correlate with the amount of major intraoperative complications, but with a higher amount of non-oncological conversions to open surgery during the first 100 cases. This data amplifies the recommendation of Petersen and Hansen for VATS programs and surgeons to be able to perform at least 25 VATS lobectomies per year to complete the respective learning curve in an adequate amount of time and thus hopefully reduce conversion related morbidity [17, 18]. Only a few variables used to calculate EuroLung scores proved to have a significant impact on morbidity and mortality in our cohort.

Regarding postoperative mortality, the lowest predicted number of events was 50% higher than the actual observed mortality (1.1% vs. 0.7%), again showing only weak individual correlation. The reason for the discrepancy is unclear. On the one hand, benefits of minimally invasive surgery might be underestimated in the EuroLung scores due to the low number of VATS procedures in the ESTS database. On the other hand, as shown by Decaluwe et al., almost 25% of 30-day mortality after a scheduled anatomic VATS resection is linked to major intraoperative complications, which cannot be predicted [17]. However, the intraoperative complication rate does not seem to differ between a primary VATS or thoracotomy approach [19, 20]. Moreover, also potential concerns about more extended tumor stages being the reason for higher morbidity rates in thoracotomy can be dismissed as also major pulmonary resections can be safely performed by VATS without an elevated postoperative complication rate [21].

Perhaps future EuroLung scores will perform better on VATS cohorts, as the number of VATS data in the ESTS database is growing. As Moons et al. recommend, a prognostic model not performing well in new populations should rather include the new patient data than establish a new model [22]. Also, we might miss important clinical details that were not covered in the ESTS database, like frailty, sarcopenia, morbid obesity, anemia, solid organ transplantation, or other known risk factors of unfavorable postoperative outcome [13, 23,24,25,26,27].

According to our results, the EuroLung scores can be used to benchmark quality of care in Europe, but should not be used to preclude patients from surgical treatment of lung cancer due to its weak individual correlation. The various risk scores can be used for a more detailed patient consenting, to set expectation within reason, but also to screen for patients who might benefit most from preoperative rehabilitation efforts. The inclusion of other clinical factors such as frailty scores, or sarcopenia screening might improve the accuracy of the risk scores.

Limitations

The fact that our database consists only of primary VATS patients might influence study outcome, as the prognostic EuroLung scores have been established on a mixed cohort with a rather high thoracotomy rate.

The retrospective character is no limitation of this study as the study design was set as an external model validation study. Although treatment methods and patient selection throughout the years might have changed, it should not impact the validity of our result, because the ESTS database, on which the EuroLung scores are based on, includes patients between June 2007 and December 2018.

Interpretation of our validation of EuroLung2 scores in our study has to be undertaken with caution, as the study population had a rather low number of events. Therefore, also no adequate calibration analysis was possible.

Conclusion

Decision for or against surgery for lung cancer remains a highly individual decision for each patient and should not be based upon currently available risk scores. A calculated risk score should not inhibit patients from receiving surgery for lung cancer. Risk score calculation should rather be used for improved patient consenting and comparison of postoperative outcome with other departments. Currently, many large retrospective databases, such as the ESTS database, lack promising new risk factors making it difficult if not impossible to establish more precise risk prediction models with these databases. Future efforts should aim at including these variables, such as sarcopenia or history of solid organ transplantation, for further adaptions of the risk score.