Background

Endovascular treatment (EVT) has been shown to be a highly effective treatment for patients with ischemic stroke due to a proximal intracranial occlusion in the anterior circulation [1,2,3]. EVT is defined as arterial catheterization with a micro-catheter to the level of the occlusion, followed by mechanical thrombectomy or thrombus aspiration, or both, with or without delivery of a thrombolytic agent. Currently EVT is widely implemented in routine clinical practice, and the challenge is how to continuously improve the quality of this service.

Worldwide, healthcare systems and practices are being reorganized with a strong focus on measuring and improving outcomes of care. The quality of care for patients with a specific medical condition is judged by the achieved outcomes that are relevant for those patients. A central aspect of this development is benchmarking, comparing quality of care and specifically outcomes between healthcare providers, in this case EVT centers. If a specific center A has better outcomes for a certain condition than center B, this suggests that center B should copy the medical management strategy of center A in order to improve the quality of care in that center. An important problem of this approach, however, is the variability in baseline characteristics of patients (‘case-mix’) that often exists between centers. If center B treats more severely affected patients than center A, this might (partly) explain the better outcomes of center A. Moreover, especially if centers are relatively small, between-center differences in outcome may be caused by chance (‘random variation’). Therefore, between-center comparisons of outcome should be adjusted for case-mix and random variation. If not done properly, such comparisons are likely to miss their purpose and could even be counterproductive as clinicians may base their decisions on flawed information. Furthermore, following Donabedian’s framework for evaluating healthcare quality, the quality improvement potential is especially captured by variation in structures (‘How is care organized?’) and processes (‘What is done?’) of care [4]. For many diseases, however, it is unknown whether between-center variation in outcome reflects true differences in quality of care, captured by this framework.

Using data from a large nation-wide registry, the aim of this study was to assess the effect of structure and process indicators on between-center variation in outcome for ischemic stroke patient treated with EVT, while adjusting for case-mix and random variation.

Methods

Study design and patients

For this study we used data collected between March 2014 and November 2017 from the MR CLEAN Registry, a prospective, observational study in all 17 centers that perform EVT in the Netherlands (Supplementary Figure 1) [5]. This registry is unique since it includes clinical and neuro-imaging data of all patients treated with EVT in one country during a multi-year period and thereby reflects clinical practice. All patients undergoing EVT for acute ischemic stroke have been registered. Inclusion criteria were: age 18 years and older, treatment in a center that participated in the MR CLEAN trial, and proximal intracranial vessel occlusion in the anterior circulation (internal carotid artery (ICA), internal carotid artery terminus (ICA-T), middle (M1/M2) cerebral artery, or anterior (A1/A2) cerebral artery), as shown by computed tomography angiography (CTA). Details on the study design and objectives of the MR CLEAN Registry have been described elsewhere [5]. Overall, data from 3279 patients were included for the current analysis (Supplementary Figure 2).

The MR CLEAN Registry was approved by the ethics committee of the Erasmus University MC, Rotterdam, The Netherlands (MEC-2014-235). With this approval it was approved by the research board of each participating center. At UMC Utrecht, approval to participate in the study has been obtained from their own research board and ethics committee.

Case-mix indicators

For case-mix adjustment we used the following patient and neuro-imaging characteristics: age, sex, relevant medical history (i.e. previous stroke, atrial fibrillation, myocardial infarction, peripheral arterial disease, hypertension, diabetes mellitus, hypercholesterolemia), pre-stroke score on the modified Rankin Scale (mRS), the baseline score on the National Institute of Health Stroke Scale (NIHSS) as a stroke-related neurologic deficit score, and occlusion location and collateral grade on CT angiography. These characteristics were selected based on clinical knowledge and previous studies [6, 7]. Two additional characteristics were considered as case-mix indicators, because these are strongly associated with outcome but not influenceable by the centers: time between stroke onset and arrival at the emergency department (ED) of the intervention center, and whether or not the patient had been admitted to another center before being transferred to the intervention center [8, 9]. Stroke onset was defined as the time point when the sudden appearance of stroke symptoms was witnessed by the patient or an observer. In cases the time of first symptoms was unknown, onset was defined as the moment the patient was last seen well.

Quality of care indicators

Quality of care indicators were defined using Donabedian’s framework comprising indicators of structure, process, and outcome [4]. Both center volume and year of admission reflect the experience of a center with EVT and were used as structure indicators. Center volume was defined as the percentage of all EVT-patients treated in each center relative to all EVT-patients treated in the Netherlands in the study period. In stroke care, high center volume was found to be associated with lower stroke-related mortality [10]. In other studies, higher volume stroke centers showed better outcomes on average [11,12,13]. Since EVT is a relatively new treatment in stroke care, we hypothesized (overall) performance to increase with calendar year and therefore added year of admission as an additional structure indicator.

Two process indicators were defined: time from arrival at the emergency department of the intervention center to groin puncture and the use of general anesthesia (yes/no). A significant negative association between ‘time-to-groin’ and outcome was found in previously published research, indicating that time delays before initiation of EVT have a negative effect on the likelihood of independent functional recovery at 90 days [8, 13,14,15,16,17]. So far, no differences in outcomes between general anesthesia and conscious sedation were observed in previous randomized controlled trials [18,19,20]. In several meta-analyses of observational studies, conscious sedation was associated with better outcomes than general anesthesia [21,22,23]. Although general anesthesia reduces the risk of patient agitation, unnecessary use of general anesthesia increases time delay in the total process of care, intra-procedural complications, and may result in cerebral hypoperfusion (e.g. through fluctuations in blood pressure on general anesthesia induction and abnormal cerebral auto-regulation) [24, 25]. Overall, we expected the use of general anesthesia to influence patient outcome, and therefore defined this as an additional process indicator.

We used the modified Rankin Scale (mRS) score as the outcome indicator [26]. The mRS score is a commonly used measure of patients’ functional outcome after ischemic stroke, and ranges from 0 (no symptoms) to 6 (death). The mRS score was assessed at 90 days after EVT (± 14 days). Good functional outcome, defined as mRS 0–2, was used as secondary outcome (see below).

Statistical analyses

Descriptive analyses

We used Pearson’s chi-square statistic and the non-parametric Kruskal Wallis test for a univariable comparison of centers on case-mix and quality of care indicators. The predicted probability of good functional outcome can be considered an overall measure of each centers’ case-mix. To calculate this, we first fitted an individual-level logistic regression model including all case-mix indicators as predictors and yes/no good functional outcome (mRS 0–2 at 90 days) as the dependent variable. The predicted patient-level probabilities by this model were then used to calculate the median predicted probability per center.

Random effect regression models

In order to adjust for random variation and assess the effects of adjusting for case-mix and performance on structure and process indicators on between-center variation in outcome, we used random effect proportional odds regression modelling. A random center effect (intercept) accounts for the fact that the observed outcomes for lower-volume centers can take extreme values due to random variation. A proportional odds model exploits the full ordinal nature of the mRS as an outcome scale with more than two possible categories [27].

In all analyses, we used the inverse of the mRS score for each patient. Doing so allows us to interpret the estimates as the effects on the likelihood of a more favorable outcome, since a higher score on the inverse of the mRS means more favorable outcome (see above). We estimated common odds ratios with 95% confidence intervals on the patient level using four proportional odds regression models [27]. First, we fitted an ‘empty’, unadjusted model including only a random center effect, providing insight in between-center variation in outcome accounting only for random variation. In the second model, in addition to the random center effect, we adjusted for the individual-level fixed effects of case-mix indicators on outcome. In the third model, we added the fixed effects of the structure indicators to the model. Finally, the fourth model also contains the individual-level fixed effects of the process indicators.

Between-center variation

To assess the relative impact of adjusting for case-mix and performance on structure and process indicators on between-center variation in outcome, we compared the variance of the random center effect (tau2) across models. Essentially, tau2 reflects the amount of between-center variation in outcome. In addition, separately for each model, we constructed forest plots to visualize this variation using estimates of center-specific outcome (i.e. the random center effects). The predictive power of the four models was compared using Akaike’s Information Criterion (AIC), in which a lower AIC value indicates a higher predictive power for outcome [28].

Missing data

Multiple imputation was used to deal with missing data, which ranged between 0.7% (previous diabetes) to 6.3% (collateral grade) [5]. We fitted regression imputation models [29, 30] and imputed data five times, using the following variables: age, sex, medical history, pre-stroke mRS score, location of occlusion, collateral grade, baseline NIHSS score, whether patient transferred from other hospital, time intervals from onset to arrival at the ED, center volume per year, year of admission, time intervals from onset to groin puncture, and use of general anesthesia. Each imputed dataset was analyzed separately, after which the results were pooled.

All statistical analyses were performed with R statistical software version 3.4.3 (R Foundation for Statistical Computation, Vienna, Austria), using the clmm module in the ordinalimputation package. Statistical significance was assessed at P < 0.05 in all analyses.

Results

Descriptive analyses

At the center level, the median patient age ranged from 68 to 77 years, with statistically significant differences between centers (Table 1). Differences between centers were also statistically significant for previous stroke (range 0–26%), atrial fibrillation (13–37%), peripheral arterial disease (4–22%), hypertension (41–67%), hypercholesterolemia (15–50%), pre-stroke mRS, location of occlusion, collateral grade, baseline NIHSS score (range median score per center 13–17) and percentage of transferred patients from another hospital (0–77%). Median time of stroke onset to arrival at the ED of the intervention center ranged from 52 to 160 min across centers (P < 0.001). Importantly, the median predicted probability of good functional outcome (mRS 0–2 at 90 days), which can be interpreted as an overall measure of a center’s case-mix, varied between 17 and 50% across centers (P = 0.004) (Table 1 and Supplementary Table 1). The patient-level effect estimates of case-mix variables on outcome shown in Supplementary Figure 3 are comparable to the results of prior research [31].

Table 1 Case-mix characteristics of patients treated in intervention centers in MR CLEAN Registry

Relative to the total number of patients in our data, the number of EVT-patients varied significantly between centers across the 4 years (Fig. 1). Median time from arrival at the ED of the intervention center to groin puncture also varied substantially: between 74 and 125 min for non-transferred patients (P < 0.001) and between 20 and 60 min for transferred patients (P < 0.001). Variation in the use of general anesthesia (0–99%) was also statistically significant. Crude differences in outcome were statistically significant (P < 0.001) across centers for mRS values 0–6: no symptoms (1–18%), no significant disability (7–26%), slight disability (4–29%), moderate disability (4–20%), moderately severe disability (7–31%), severe disability (2–11%), and death (21–36%) (Table 2 and Supplementary Table 2).

Fig. 1
figure 1

Center volume in each intervention year. Center volume is defined as percentage of all EVT patients treated in each center relative to all EVT patients treated in the Netherlands in that year

Table 2 Quality of care indicators of all 17 intervention centers in the Netherlands

Predictive power for outcome

Model 1 generated an AIC of 11,158, which dropped to 10,050 after adding case-mix variables in model 2, suggesting a considerably improved predictive power (Table 3). Adding structure (AIC = 10,043) and process indicators (AIC = 10,012) only slightly improved the predicted power further (Table 3).

Table 3 Results from the random effect proportional odds regression analysis using the inverse of the modified Rankin Scale at 90 days as the dependent variable

Between-center variation in outcome

In model 1, which only adjusts for random variation, the amount of the between-center variation in outcome (tau2) was 0.040 (Table 3). The tau2 represents the amount of variability in outcome between centers. This estimate more than doubled after adding case-mix indicators (model 2, tau2: 0.086), while adding structure indicators (model 3) and process indicators (model 4) left it almost unaffected (tau2: 0.081 and 0.089, respectively). This indicates that only a small amount of between-center variation was explained by variation in performance on structure and process indicators. This finding is also reflected in the forest plots (Fig. 2), which for each model show the estimated effects of all 17 centers on the likelihood of favorable outcome. Between-center variation increased particularly after case-mix indicators were added (compare Fig. 2b with Fig. 2a), while variation remained rather constant after adding structure (Fig. 2c) and process (Fig. 2d) indicators.

Fig. 2
figure 2

Forest plots reporting random center effect (odds ratios and 95% confidence intervals) on inverse of modified Rankin Scale at 90 days in four models using random effect proportional odds regression analysis. a: Model 1 (unadjusted model); b: Model 2 (case-mix adjusted model); c: Model 3 (case-mix and structure indicators adjusted model); d: Model 4 (case-mix, structure and process indicators adjusted model)

Discussion

This study focused on assessing variation in outcome between centers treating ischemic stroke with EVT in the Netherlands, and the impact of differences in case-mix and in performance on structure and process indicators. Our results show that differences in case-mix have a much larger impact on between-center differences in outcome for stroke patients treated with EVT than differences in structure and processes of care across centers. Therefore, an (unadjusted) measure of functional outcome may not be a valid indicator for quality of care when comparing stroke centers.

Outcome indicators as measures of quality of care

Benchmarking initiatives increasingly use outcome measures to compare quality of care between centers. However, as found in this study, significant differences in performance on evidence- and consensus-based measures of (processes of) care may not be reflected in between-center differences in outcome. An important explanation for this is the observational nature of the data that are typically used in such benchmarking initiatives. When assessing the effect of interventions or processes of care on outcome in such data, confounding by indication is a major issue. For example, physicians might treat more severely affected patients faster than less severely affected patients [32, 33]. As disease severity is likely to influence outcome, in observational data it can confound the estimated relation with outcome insofar severity is insufficiently accounted for. Although we adjusted for disease severity in several ways using measured case-mix variables, we cannot preclude the possibility that our estimated effect of time to treatment on outcome is to some extent biased. This could partly be because of a residual unmeasured case-mix effect, as well as unaccounted for technical parameters. Knowing the association of center-level outcomes with process indicators and structural characteristics allows decision-makers to understand the determinants of performance and implement improvement strategies; in other words, all indicators should be retained to get an empirical (i.e., context specific) evaluation of the quality of care. If a center is an “outlying provider” even after accounting for patient case mix and process/structural characteristics, in-depth analyses and audit activities should be put in place to understand the reasons for this center’s outlying performance. Given that detecting a beneficial effect of effective interventions on outcome can be difficult even in large methodologically sound randomized trials, observing effects of evidence- and consensus-based indicators of (processes of) care on between-center differences in outcome may be even harder in benchmarking initiatives using observational registry data. In other words, when observational data are used for benchmarking, which in practice is typically the case, good performance on structure and process indicators may not be reflected in favorable outcome due to unmeasured confounding factors. Therefore, identifying and benchmarking performance on indicators with a proven contribution to favorable outcomes should be an important future direction for stroke care quality assessment and improvement initiatives.

Between-center differences in outcome after case-mix adjustment

The observed difference of the unadjusted and adjusted estimated center effect is because of statistical consideration and is a result of both imbalance and stratification [34, 35]. Good outcomes will generally be more difficult to achieve for patients who are more severely affected at baseline. Therefore, the observed outcomes of centers with a relatively ‘severe’ case-mix will on average be less favorable (i.e. biased downwards) as compared to those of centers with a relatively ‘mild’ case-mix (i.e. biased upwards). In general, adjusting for the imbalance in case-mix would then reduce observed variation in outcome between centers. However, in this study an opposite pattern is observed. After adjusting for case-mix, actual between-center differences in outcome become visible. Apparently, centers with a more severe case-mix tend to have relatively good outcomes and vice versa. Although counterintuitive, this still underlines the necessity of appropriate case-mix adjustment in benchmarking quality performance using observational data. Besides the effect of adjusting for the imbalance, more extreme center effect after adjustment could be a result of the stratification effect. Although “stratification” usually refers to conditioning on categoric subgroups (e.g. on sex), we also use this term when continuous variables are involved, for example, age. Adjustment will generally increase standard errors, and the stratification (adjustment) effect will lead to more extreme effect estimates [34].

Strengths and limitations

In previous studies [6, 7], a dichotomized version of the mRS was used (mRS ≥3 was considered as ‘poor outcome’) and analyzed using binary logistic regression models. In order to exploit the full ordinal nature of the mRS score, we used proportional odds regression analysis. In addition, the use of random effect analyses allowed us to estimate (the variance of) center outcomes adjusted for random variation and various other factors (i.e. case-mix, structure, and process indicators).

A first limitation of this study is the unavailability of other potentially contributing factors to between-center outcome differences, e.g. unmeasured patient characteristics, care processes, and center characteristics. A second limitation is that missing values may have introduced some bias, although we believe to have mitigated this issue considerably using multiple imputation, which is the preferred method over complete case analysis [36, 37]. A third limitation is low number of second-level units, with resulting lack of precision in tau2 estimates. A final limitation is that we only analyzed one outcome. Although the mRS is used as an outcome in virtually all modern stroke trials, our conclusions might have been different if we had used other outcomes that are relevant to stroke patients treated with EVT, like patient-reported outcomes such as quality of life. In addition, even though the mRS is an appropriate tool for assessing patient disability after stroke care, it may not be easily transferrable to outcomes research. After all, in many countries other than the Netherlands, the mRS is not routinely collected for all acute stroke patients. However, if we would have used clinical outcomes that are commonly registered in administrative databases, such as short-term mortality or readmission, we would have less certainty about our estimations because of the much smaller number of events in the context of acute stroke treatment. Moreover, from a clinical perspective these outcome measures are far less relevant in the stroke context compared to the mRS. Therefore, we used the mRS as the most powerful and clinically relevant outcome for our study.

Conclusions

In this study, we have demonstrated that between-center differences in performance on structure and process indicators have a small impact on functional outcome of ischemic stroke patients treated with EVT, while differences in case-mix affects this variation substantially. Thus, outcome indicators may not be valid and useful for comparing and improving the quality of stroke care based on observational data. Since variation in performance on structure and process indicators captures real quality improvement potential, these indicators should be used in future benchmarking initiatives. This is especially true when a strong association exists between those indicators and outcome, as is the case for time to treatment in ischemic stroke.