External validation of risk prediction scores in patients undergoing anatomic video-assisted thoracoscopic resection

Background EuroLung Risk scores were established to predict postoperative morbidity and mortality in patients undergoing anatomic lung resections. We aimed to perform an external validation of the EuroLung scores, which were calculated from data of the European Society of Thoracic Surgeons database, in our video-assisted thoracoscopic surgery cohort. Methods All available EuroLung scores were calculated for 718 patients scheduled for anatomic video-assisted thoracoscopic surgery resections between 2009 and 2019. Morbidity and mortality according to the definitions of the EuroLung scores were analyzed in a prospectively maintained database. Results Overall observed complication rate was 10.45%. Scores showed weak individual correlation (η = 0.155–0.174). The EuroLung1 app score showed the biggest area under the receiver operative characteristic (ROC) curve with 0.660. Binary logistic regression analysis showed that predicted postoperative forced expiratory volume in 1 s was associated with increased complications in both EuroLung1 and parsimonious EuroLung1 scores. Thirty-day mortality was 0.7% (predicted 1.10–1.40%) and was associated with predicted postoperative forced expiratory volume in 1 s for both EuroLung2 and parsimonious EuroLung2 scores. The EuroLung2 (2016) showed the biggest area under the ROC curve with 0.673. Only a very weak eta correlation between predicted and observed mortality was found for both aggregate EuroLung2, EuroLung2 (2016), EuroLung2 (2019), and parsimonious EuroLung2 (2016) (η = 0.025/0.015/0.011/0.009). Conclusion EuroLung scores help to estimate postoperative morbidity. However, even with the highest aggregate EuroLung scores possible only 50% suffer from postoperative morbidity. Although calibration of the scores was acceptable, discrimination between predicted and observed events was poor. Therefore, individual correlation between predicted and observed events is weak. Therefore, EuroLung scores may be best used to compare institutional quality of care to the European Society of Thoracic Surgeons database but should not be used to preclude patients from surgical treatment. Supplementary Information The online version contains supplementary material available at 10.1007/s00464-022-09786-7.

radiotherapy or radiofrequency ablation. The ESTS EuroLung scores, based on the ESTS database, represent such a tool for the risk calculation of 30-day postoperative morbidity and 30-day postoperative mortality based on clinical values [4][5][6]. The ESTS EuroLung scores consist of two groups of scores: EuroLung1 scores for morbidity and EuroLung2 scores for mortality, with parsimonious variants for both to simplify calculation. Aggregate EuroLung scores have been established for grouping patients in different classes of risk for morbidity and mortality. Moreover, the two scores have been updated in 2019 from its original version published in 2016. EuroLung scores have also been published in form of a freely available app for calculating EuroLung scores on a smartphone [4,5]. The EuroLung scores have been applied to European, Canadian, and Japanese cohorts with inconclusive results [7,8]. This study compares the EuroLung scores from 2016, 2019 and the app and is the first to test them in a pure video-assisted thoracoscopic surgery (VATS) cohort, while the scores originally were calculated using data from both, open and minimally invasive approaches.

Patient selection
All patients scheduled for an anatomic VATS resection for primary lung cancer at our department from 02/2009 to 07/2019 were retrospectively analyzed. Due to our ethical directive patients under the age of 18 (n = 4) were excluded. To preemptively avoid possible confounders patients over the age of 80 (n = 20) were excluded, because this patient cohort represents a very preselected group with aboveaverage performance status at our department (no mortality and 4.2% morbidity). No other exclusion criteria were used. 718 patients remained for further analysis. Permission for retrospective data analysis of our VATS cohort was granted by the local ethics committee (Registration Number: AN5163,327/4.17,382/5.2).

Data administration
Data was collected in a prospectively maintained database. Collected data included sex, age at operation, coronary artery disease (CAD), chronic kidney disease (CKD), cerebrovascular disease (CVD), body mass index (BMI), type of resection (e.g., extended resection), predicted postoperative forced expiratory volume in one second (ppoFEV1), postoperative morbidity, death.
Postoperative morbidity was defined according to the EuroLung score and included, but was not limited to: respiratory failure, prolonged mechanical ventilation (> 24 h), acute heart failure, reintubation, pneumonia, atelectasis requiring a bronchoscopy, pulmonary edema, prolonged air leak, embolism, acute respiratory distress syndrome (ARDS), stroke, transient ischemic attack (TIA), acute kidney failure, arrhythmia requiring treatment, myocardial infarction. In accordance with the definition of the EuroLung scores postoperative morbidity and mortality was only included if it occurred during the first 30 days after surgery.
Pneumonia definition in our database matches the definition of Fernandez et al. [9].

Statistical analysis
Statistical analysis was performed using IBM SPSS Statistics 26 (IBM Corporation, Armonk, NY, USA) and included the methods recommended by Altman et al. for external validation of prognostic tests [10].
Pearson's chi-squared test or Fisher's exact test were used for identifying relationships between categorical variables. One-way analysis of variance was used for comparing means between various numerical variables. In all EuroLung scores binary logistic regression was used to test their computational variables for significance between our patient groups with and without EuroLung morbidity/mortality. This is to examine whether the variables of our cohort differ from the study by Brunelli et al. [4]. Hosmer-Lemeshow-Test was used for testing for goodness of fit for the logistic regression scores. Area under the receiver operating characteristic curve (AUROC) was calculated to compare predictivity of the scores. For comparison of AUROC between the available scores DeLong test was used. For analysis of the relationship between nominal and metric variables eta correlation was used to calculate the correlation between a score and the observed morbidity/mortality via-cross tabulation. Calibration was assessed by using calibration-in-the-large and calibration slope. The study was performed in accordance to the TRIPOD statement for Prediction Model Validation [11].
Results were expressed as means. Statistical significance was assumed for a p-value < 0.05.

Results
A total of 718 patients were analyzed. Overall patient characteristics and respective morbidity and mortality characteristics are shown in Supplementary Table 1, 2, 3, and 4. Every patient in the cohort was scheduled for a primary anatomic VATS resection for primary lung cancer (100%). Our observed 30-day morbidity was 10.45% and observed 30-day mortality was 0.70%.

Morbidity
In our cohort 75 out of 718 patients (10.45%) suffered from postoperative morbidity, as defined by Brunelli et al., in the first 30 days after surgery and this rate was lower than the calculated EuroLung scores (Table 2) [4,5]. The relationship between 30-day morbidity and demographic data, risk scores, and perioperative morbidities are shown in Table 2.
Using the various EuroLung scores, the calculated morbidity ranged from 11.11 to 20.85%. The parsimonious EuroLung1 (2019) showed the most accurate prediction with 11.11% (95%CI, 10.76-11.56%) in comparison to the cohorts observed morbidity rate of 10.45%. Patients with a postoperative morbidity showed significantly higher EuroLung scores in all available morbidity scores than patients without (EuroLung1 , 0.599 for the EuroLung1 aggregate score and did not proof high discrimination. The parsimonious Eurolung1 (2019), which showed the most accurate prediction and the highest η-value, had a statistically different AUROC than the EuroLung1 aggregate score (p = 0.010) and the EuroLung1 App (p = 0.032). The rest of the EuroLung scores showed no statistically different AUROC. The EuroLung1 App showed a significantly better discrimination than the EuroLung1 aggregate score (p = < 0.001) and both parsimonious EuroLung1 scores (p = 0.032/0.032), but not for the EuroLung1 (2016) (p = 0.220) and EuroLung1 (2019) (p = 0.217). Respective ROC curves are shown in Fig. 1.
The Hosmer-Lemeshow test for goodness of fit was not significant for all morbidity scores and therefore valid Calibration-in-the-large showed a graphical trend toward systematically too high predictions, while at the same time showing too extreme risk estimations in the calibration slope, as visualized in Fig. 2. EuroLung1 (2019) and parsimonious EuroLung1 (2019) showed the best calibration-inthe-large with an intercept close to 0 (a = −0.007/−0.007). Moreover, they also showed the tightest estimation spread with their respective calibration slopes being the closest to 1 (b = 0.935/0.911).
For further investigation of the impact of risk factors for morbidity a binary logistic regression analysis was performed for each risk score. For the EuroLung1 and parsimonious EuroLung1 ( The relationship between the EuroLung1 aggregate score and our observed morbidity rate is shown in Fig. 3. A subgroup analysis did not show a difference in observed morbidity for patients with neoadjuvant therapy (12.3 vs. 10.2% in patients without neoadjuvant therapy, p = 0.547).
Patients with observed mortality did not show significantly higher EuroLung scores (EuroLung2 (2016)

3
The Hosmer-Lemeshow test for goodness of fit was not significant for all mortality scores and therefore valid For the computational variables, binary logistic regression for the EuroLung2 (2016 & 2019) scores showed lower ppoFEV1% and CAD being significant risk factors for mortality (p = 0.040, p = 0.033). For the parsimonious EuroLung2 only ppoFEV1% showed significance (p = 0.030) and for the EuroLung2 aggregate score (2016) CAD showed significant impact (p = 0.025). For the EuroLung2 aggregate score (2019) no significant variable was found.
The relationship of the EuroLung2 aggregate score with our observed mortality rate is shown in Fig. 5.
Two patients died of ARDS, two patients suffered a lethal sepsis and one patient suffered from both complications and died subsequently. Noteworthy, all patients had low EuroL-ung2 aggregate scores (see Fig. 5). Interestingly, three out of five patients had a history of solid organ transplantation (kidney: n = 2, liver: n = 1). We found a statistical significant difference in postoperative mortality in the group of patients after solid organ transplantation, compared to the group of non-transplant patients (p < 0.001).
A subgroup analysis did not show a difference in observed mortality for patients with neoadjuvant therapy (0 vs. 0.8% in patients without neoadjuvant therapy, p = 1.000).

Discussion
Despite efforts to reduce smoking, lung cancer remains the leading cause of cancer death. To reduce lung cancer associated mortality successful efforts are taken to implement screening routines. As a result more early stage lung cancers are being diagnosed, increasing the number of potentially resectable lung cancers and the demand for individual risk stratification.
The ESTS Eurolung scores were established to calculate individual risk for postoperative morbidity and mortality and to help guiding treatment decisions. So far, the scores have not been definitely validated in other cohorts. The scores can be used in two ways: first, the overall observed morbidity and mortality can be compared to the predicted outcome as a marker for quality of care, comparing a center to the average of the ESTS database; second, the individual predicted risk can be used to guide decision making, but only once the scores have been validated externally.

Fig. 1 ROC-Curves of EuroLung1 scores
Aim of this study was to validate the EuroLung scores in our patient cohort, consisting only of primary anatomic VATS resections. As data from our patients are not included in the ESTS database, this could also serve as an external validation.
Our results show that the parsimonious EuroLung1 (2019; 11.11%; 95%CI, 10.74-11.49%) displays the best correlation with our cohort´s observed morbidity rate of 10.45%. Despite this, the correlation with individual patient morbidity was only weak (η = 0.155), showing insufficient precision. Although the EuroLung1 (2019) showed a rather good calibration with an intercept of −0.007 and a calibration slope of 0.935 the discrimination was weak with a c-statistics of 0.646.
After performing a binary logistic regression analysis only ppoFEV1% showed to be associated with increased morbidity in our cohort. This emphasizes the importance of preoperative lung function tests in the treatment algorithm of lung cancer. It is even more relevant, as pulmonary prehabilitation programs do show a reduction of postoperative morbidity [12].
Comparing the EuroLung2 scores with our cohort we did show that observed mortality (0.7%) was lower than the one predicted with ESTS EuroLung2 scores. Further analysis  showed that lower ppoFEV1% correlated with higher 30-day mortality. Also, we found a high rate of mortality in patients with a history of solid organ transplantation (23.1%). A higher 90-day mortality after surgical treatment of lung cancer in patients after solid organ transplantation was also described recently by Drevet et al. [13]. Solid organ transplantation has so far not been evaluated in the EuroLung Scores, as it is not recorded in the ESTS database, but due to increasing evidence should be considered in future updates.
To investigate possible confounders for this discrepancy between expected and observed morbidity and mortality we compared the patient characteristics of the ESTS database with our own VATS database. Our patients showed a lower ppoFEV1% (72.7 vs. 62.9) and a higher amount of diabetes (2.7% vs. 12.5%). In contrast to the EuroLung database our cohort consists of only VATS patients (vs. 13.1% and 26% in the ESTS database at the time of publication of the EuroLung scores 2016 and 2019), which might decrease postoperative complication rate, as a VATS approach has shown to reduce postoperative morbidity such as pneumonia, intensive care admission, bleeding or the need of reoperation. Even in the case of conversion to open surgery primary VATS cases do not show higher complication rates [4,[14][15][16]. Analyses of various institutional VATS programs have shown that the surgeon's experience does not correlate with the amount of major intraoperative complications, but with a higher amount of non-oncological conversions to open surgery during the first 100 cases. This data amplifies the recommendation of Petersen and Hansen for VATS programs and surgeons to be able to perform at least 25 VATS lobectomies per year to complete the respective learning curve in an adequate amount of time and thus hopefully reduce conversion related morbidity [17,18]. Only a few variables used to calculate EuroLung scores proved to have a significant impact on morbidity and mortality in our cohort.
Regarding postoperative mortality, the lowest predicted number of events was 50% higher than the actual observed mortality (1.1% vs. 0.7%), again showing only weak individual correlation. The reason for the discrepancy is unclear. On the one hand, benefits of minimally invasive surgery might be  Relationship between EuroLung2 aggregate score and our mortality rates. AEL2 Score Aggregate EuroLung2 Score underestimated in the EuroLung scores due to the low number of VATS procedures in the ESTS database. On the other hand, as shown by Decaluwe et al., almost 25% of 30-day mortality after a scheduled anatomic VATS resection is linked to major intraoperative complications, which cannot be predicted [17]. However, the intraoperative complication rate does not seem to differ between a primary VATS or thoracotomy approach [19,20]. Moreover, also potential concerns about more extended tumor stages being the reason for higher morbidity rates in thoracotomy can be dismissed as also major pulmonary resections can be safely performed by VATS without an elevated postoperative complication rate [21].
Perhaps future EuroLung scores will perform better on VATS cohorts, as the number of VATS data in the ESTS database is growing. As Moons et al. recommend, a prognostic model not performing well in new populations should rather include the new patient data than establish a new model [22]. Also, we might miss important clinical details that were not covered in the ESTS database, like frailty, sarcopenia, morbid obesity, anemia, solid organ transplantation, or other known risk factors of unfavorable postoperative outcome [13,[23][24][25][26][27].
According to our results, the EuroLung scores can be used to benchmark quality of care in Europe, but should not be used to preclude patients from surgical treatment of lung cancer due to its weak individual correlation. The various risk scores can be used for a more detailed patient consenting, to set expectation within reason, but also to screen for patients who might benefit most from preoperative rehabilitation efforts. The inclusion of other clinical factors such as frailty scores, or sarcopenia screening might improve the accuracy of the risk scores.

Limitations
The fact that our database consists only of primary VATS patients might influence study outcome, as the prognostic EuroLung scores have been established on a mixed cohort with a rather high thoracotomy rate.
The retrospective character is no limitation of this study as the study design was set as an external model validation study. Although treatment methods and patient selection throughout the years might have changed, it should not impact the validity of our result, because the ESTS database, on which the EuroLung scores are based on, includes patients between June 2007 and December 2018.
Interpretation of our validation of EuroLung2 scores in our study has to be undertaken with caution, as the study population had a rather low number of events. Therefore, also no adequate calibration analysis was possible.

Conclusion
Decision for or against surgery for lung cancer remains a highly individual decision for each patient and should not be based upon currently available risk scores. A calculated risk score should not inhibit patients from receiving surgery for lung cancer. Risk score calculation should rather be used for improved patient consenting and comparison of postoperative outcome with other departments. Currently, many large retrospective databases, such as the ESTS database, lack promising new risk factors making it difficult if not impossible to establish more precise risk prediction models with these databases. Future efforts should aim at including these variables, such as sarcopenia or history of solid organ transplantation, for further adaptions of the risk score. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.