Introduction

Transcatheter aortic valve replacement (TAVR) is the treatment option of choice in inoperable and high-risk patients with severe aortic valve stenosis [1, 2] and recently showed favorable outcomes also in intermediate-risk [3, 4] and low-risk patients [5, 6]. Patient and procedural characteristics determine risk for adverse clinical outcomes. Cardiovascular societies recommend clinical decision-making for TAVR within the Heart Team [1, 2], assisted by the use of statistical risk scoring models for individual patient risk stratification. Classical surgical risk models such as logistic EuroSCORE I (LogES I) [7, 8], its update EuroSCORE II (ES II) [9], or the Society of Thoracic Surgeons Predicted Risk of Mortality model (STS PROM) [10] are routinely used to gauge operative risk; however, they are known to overestimate mortality risk in TAVR [11,12,13]. Several TAVR-specific risk assessment models have been developed from national registry databases to optimize the predictive performance in the context of TAVR [14,15,16]. However, the optimal approach to risk prediction in TAVR still remains unknown, as their performance has been poor in validation studies [12, 17, 18].

We here aimed to evaluate the 30-day mortality prediction performance of six surgery- and TAVR-specific risk models (Table 1) in our German Rhine Transregio Aortic Diseases cohort of patients undergoing TAVR.

Table 1 Risk model characteristics

Methods

Study population and data collection

This study was performed as an all-comer analysis of patients treated with either TF or TA TAVR between 2008 and 2018 at the Heart Centers in Düsseldorf and Bonn, Germany, within the Rhine Transregio research consortium on aortic diseases. All patients were referred for TAVR procedures by local heart teams according to contemporary clinical practice and provided written informed consent for inclusion in prospective registries, with collection of clinical, procedural and follow-up data. The study was performed in accordance with the Declaration of Helsinki. The institutional Ethics Committee of the Heinrich-Heine University approved the study protocol (4080).

Clinical outcomes were systematically assessed using the Valve Academic Research Consortium consensus statement [19]. Primary clinical outcome of interest for risk model performance evaluation was 30-day mortality, secondary in-hospital outcomes are additionally reported.

Statistics

Statistical and graphical data analysis was performed using Excel (Microsoft, USA), SPSS (version 23.0, SPSS Inc., Chicago, IL, USA), MedCalc 18.10 (MedCalc Software, Belgium) and GraphPad Prism (version 7.0, Graphpad Software, San Diego, CA, USA). Continuous data are described as means ± standard deviation (SD), ordinal/categorical data as counts and % of total, and receiver-operating-characteristic (ROC) curve analysis is summarized as c-indices (area-under-the-curve) with 95% confidence intervals (CI). Continuous variables were evaluated for normal distribution using the Kolmogorov–Smirnov test and compared with either Student’s t tests (for normally distributed samples) or Mann–Whitney-U tests; categorical data were compared with Chi-squared tests or Fisher’s exact tests (for expected values < 5); ROC curves were compared using a non-parametric approach according to DeLong et al. [20]. All statistical tests were two tailed and an α probability of p < 0.05 was considered statistically significant.

Six risk models for the prediction of 30-day mortality (Table 1: LogES I [7, 8], ES II [9], STS PROM [10], FRANCE-2 [15], OBSERVANT [14] and GAVS-II [16]) were calculated for individual patients. LogES I, ES II and STS were routinely available in the databases and used for heart team decision-making. FRANCE-2, OBSERVANT and GAVS-II were calculated as follows: (1) relevant variables (for details, see Supplementary material) were weighed according to model definitions; (2) weighed variables were transformed into 30-day mortality probability predictions according to model definitions. Missing values necessary for risk model calculation were imputed to the mean (continuous) or median (categorical) of the whole population. Risk model discrimination accuracy was evaluated using ROC analysis and the c-index [area-under-the-curve (AUC)] as a cumulative measure; risk model calibration accuracy/goodness-of-fit was evaluated by stratification of patients into risk quintiles and comparison of observed vs. expected events within risk strata; additionally, it was formally tested by calculation of a logistic regression model for 30-day mortality, with the score value as independent variable and the Hosmer–Lemeshow goodness-of-fit test [21]. Prespecified analyses of patients stratified by TF vs. TA approach, time period of the procedure and prosthetic valve device types were additionally performed to account for learning curves and technical developments.

Results

Patient population

A total of 2,946 patients underwent TAVR between 2008 and 2018 at the German Heart Centers of Düsseldorf (n = 1605) and Bonn (n = 1341) and were included in this analysis, with 2,625 TF TAVR (89%) and 321 TA TAVR (11%) procedures, respectively. Mean age was 80.9 ± 6.1 years (Suppl. Table 1), patients suffered from a high-cardiovascular-risk profile, including comorbidities of cardiovascular disease (67% coronary artery disease), previous cardiac interventional or surgical procedures (39% previous PCI, 15% previous CABG) and diabetes mellitus (30%). Characteristics differed significantly between patients treated with TF and TA TAVR procedures (Suppl. Table 1).

Procedural characteristics across the TAVR era

TAVR developed from a rare procedure with high mortality (2008/2009: n = 61 with ~ 10% mortality) to a high-volume routine procedure (2018: n = 538 with 1.5% mortality) with low adverse events and shortened procedure duration (Table 2), which reflects a learning curve as well as continuous technical development. Patient age and estimated surgical mortality risk (LogES I) only declined marginally. TF TAVR became the primary access of choice (95.5% of procedures in 2018) over TA TAVR. The self-expandable Medtronic valves in different generations (early: CoreValve; newer generations: Evolut R and Evolut R Pro) were preferred to the balloon-expandable Edwards valves (early: Sapien XT; newer generation: Sapien 3) in TF TAVR; the latter were the primary choice in TA TAVR (Suppl. Table 1).

Table 2 Patient and procedural characteristics across the study period

Clinical outcomes

The primary outcome of 30-day mortality occurred in 3.7% overall, more likely in TA vs. TF TAVR patients (7.5% vs. 3.2%; p < 0.0001). Secondary clinical outcomes according to VARC-2 criteria are additionally reported in Suppl. Table 2.

Risk model performance for the prediction of 30-day mortality

Risk model characteristics are provided in Table 1 and in the Supplementary material.

Model discrimination

Risk model discrimination performance indices are reported in Table 3A, Fig. 1 and Table 4: ROC analyses showed mediocre performance of all risk models, without significant differences between them. Numerically, STS PROM (c-index 0.67, 95% CI 0.62–0.72) performed best, followed by FRANCE-2 (c-index 0.66, 95% CI 0.60–0.71) and ES II (c-index 0.65, 95% CI 0.60–0.70), OBSERVANT performed worst (c-index 0.60, 95% CI 0.55–0.66). All risk models performed worse than in their original validation cohorts (Table 1), except for FRANCE-2 (c-index 0.66 vs. 0.59 original). STS PROM showed the most consistent performance across TF and TA TAVR subgroups (both c-indices 0.66), while OBSERVANT performed numerically worst in both.

Table 3 Risk model performance for the prediction of 30-day mortality risk in TAVR
Fig. 1
figure 1

Risk model discrimination for the prediction of 30-day mortality. Model discrimination (ROC curves) of the six risk models for the prediction of 30-day mortality, for all studied patients (a), patients with TF TAVR only (b) and patients with TA TAVR only (c)

Table 4 Risk model discrimination stratified by time period and device type

The stratified analyses for time period and for new vs. old generation devices (Table 4) showed no significant differences and also no visible performance trend in either risk model over the years.

Model calibration

Mean predicted mortality risk (Table 3B) exceeded observed mortality in all models, ranging from 5.8 ± 5.0% (OBSERVANT) to 23.4 ± 15.9 (LogES I). Predictions were significantly higher in the TA TAVR subgroup for LogES I, ES II, STS PROM and FRANCE-2, while the OBSERVANT and GAVS-II models showed no differences between subgroups.

Graphical analysis of calibration is displayed in Fig. 2. Overestimation of mortality risk was especially pronounced in high-risk patients, while the lower risk quintiles were more adequately calibrated (e.g. STS PROM, ES II, OBSERVANT models). The classical LogES I surgical model grossly overestimated mortality in all risk strata.

Fig. 2
figure 2

Risk model calibration for the prediction of 30-day mortality. Graphical model calibration comparison of the six risk models, stratified into risk quintiles for observed vs. predicted 30-day mortality

Results of formal statistical testing for goodness-of-fit are displayed in Suppl. Table 3: STS PROM and GAVS-II showed significant Hosmer–Lemeshow p values (= inadequate calibration) of the respective logistic regression models for the whole population, GAVS-II also across TF and TA TAVR subgroups. OBSERVANT could not be calculated due to lack of events in some risk groups.

Discussion

The main results of this comparative external validation study of six risk models for the prediction of 30-day mortality in patients undergoing TAVR at two German high-volume Heart Centers from 2008 to 2018 are (1) surgical and TAVR-specific risk models showed similarly mediocre discrimination performance (c-indices 0.60–0.67); (2) all models overestimated 30-day mortality risk and were poorly calibrated, especially in high-risk patients; (3) no significant influence of time of procedure or device type on the results could be found.

Patient risk stratification for coronary procedures is well established—ranging from the elective setting to acute coronary syndromes [22,23,24]—but the risk of TAVR procedures is thus far considerably less predictable: Patients suffer from a multitude of interdependent comorbid conditions, especially frailty as an essential risk factor is difficult to classify [25, 26]. The expansion of TAVR from highest risk/inoperable patients into intermediate- and lower risk groups introduces bias. On the procedural side, there are clear associations of center volume [27], operator experience [28], and choice of approach (TA vs. TF) [29] with clinical outcomes.

Classical surgical risk models (LogES I, STS PROM, ES II) have well-known limitations in TAVR [13]. LogES I is not recommended for use in TAVR anymore (1), we included the model for historical comparisons. National TAVR-registries have provided platforms for the development of TAVR-specific risk models [14,15,16], which were mostly developed in the “early years” of TAVR (Table 1). Resulting limitations in their performance have already been seen when applied outside of their original populations [30, 31], external validation studies with a similar design to our study were performed in the United Kingdom [12] and in the Netherlands [18]—and also retrieved disappointing results.

This analysis from the German Transregio Aortic Diseases cohort elucidates risk prediction in the whole era of TAVR, which has developed from an experimental procedure in 2008 to a routine and high-volume alternative to surgical aortic valve replacement in 2018. With steadily increasing operator experience and new-generation devices, procedure counts have dramatically increased and complication rates declined over the study period. The TF approach has become the routine access route.

Surgical risk models performed similar to expectations [13]: Discrimination analysis essentially showed similar performance of STS PROM [10], LogES I [7, 8] and ES II [9] models. None could even remotely match discrimination performance in their original surgical patient cohorts. Calibration analysis underlined the general overestimation of 30-day mortality risk—most pronounced in the LogES I model. Surgical models are thus confirmed to have severe limitations to judge TAVR risk.

However, dedicated TAVR-specific models also disappointed: FRANCE-2 (overall c-index 0.66, 95% CI 0.60–0.71) was better than in the original validation cohort (c-index 0.59 [15]) and in the UK (c-index 0.62 [12]) and Netherlands (c-index 0.63 [18]) external validation studies, but not superior to surgical models and slightly worse than in an Israeli external validation study (c-index 0.71 [30]). The model considerably overestimated risk (Fig. 2) in our patients. The GAVS-II model [16] derived from the German Aortic Valve Registry was expected to be most adapted to German TAVR conditions, but it performed similar to the other models in the overall cohort (c-index 0.63) and in TF and TA subgroups and could not meet the performance in the development cohort (c-index 0.74), similar to the Dutch external validation study [18]. Numerically, its discrimination improved over time and in new vs. old generation devices (Table 4). The model also overestimated mortality (expected: 9.4%; Fig. 2). Reasons for the lower-than-expected performance may lie in the GAVS-II model being developed in 55% surgical/45% TAVR patients from 2011–2012 (Table 1), which significantly differed in age (74 vs. 81 years) and comorbidities from our population [16]. The OBSERVANT model [14] discriminated numerically worst in the overall cohort and in TF/TA subgroups, and could not match performance in the original population (c-index 0.71), which confirms findings in other external validation studies [12, 18, 30]. However—of all analyzed models—its mean mortality prediction (5.8%) came closest to observed events (3.7%).

Taken together, this analysis clarifies that risk prediction in TAVR is still an unsolved issue: all tested models are severely limited in their performance, and dedicated TAVR-specific models are not superior to decade-old surgical scores. Using any of these models for risk stratification is better than a coin-flip—but with ample room for improvement. A recently published analysis from the Netherlands [18] found similar results to the UK study [12] and to our results, which supports the validity of these findings across Europe. With TAVR evermore becoming a routine procedure and 30-day mortality rates reaching all-time-lows in high-volume Heart Centers, a re-calibration of existing risk models or a new development from growing registries is thus mandatory to improve accuracy [32, 33]. High-quality TAVR databases should enable us to produce risk models with performance comparable to surgical procedures or acute coronary syndrome. Incorporation of functional status/frailty assessment as important predictive factors may additionally improve model accuracy in the TAVR setting [34, 35].

Study limitations

All conclusions from our study are limited to risk model performance in German—or at best European—TAVR patients, findings might be different in other patient populations and procedural conditions. Bias may originate from retrospective calculation of FRANCE-2, OBSERVANT and GAVS-II risk models. While data quality in our prospective databases was high at all times, there is unavoidable risk for bias by changes in clinical practice and adverse event rates from 2008 to 2018: We accounted for this with sub-analyses of patients stratified by time of procedure and device generations (Table 4); however, their statistical power is limited.

Conclusions

Three surgical as well as three TAVR-specific risk scoring models showed mediocre performance in prediction of 30-day mortality risk for TAVR in the German Rhine Transregio Aortic Diseases cohort. Development of new or updated risk models is necessary to improve risk stratification.