Background

Natalizumab [1, 2] and fingolimod [3, 4] are two high-efficacy treatments used in Relapsing Remitting Multiple Sclerosis (RRMS) patients. Interestingly, the comparative effectiveness studies comparing these therapies showed results that were somewhat inconsistent [5,6,7,8,9]. In particular, we focus on three studies which used data from three multiple sclerosis (MS) registries, with differences in methods and conclusions [5,6,7]. We have already shown that some of this variability can be attributed to differences between the study populations [10, 11] . In the present work, we focus on the impact of methodological choices on the results—in particular, the methods used to control treatment indication bias and to manage censoring in time-to-event analysis.

In the absence of randomized clinical trials, many decisions need to be made to conduct observational studies. In the framework of “target trial”, developed by Hernan and Robins, we will focus on two protocol components, first, the assignment procedure and, second, the causal contrast [12]. First, to emulate the random assignment, we need to adjust for all known confounders [12]. Propensity score (PS), utilized in several ways, is a popular instrument used to control indication bias effect on the results of comparisons of intervention [13, 14]. The studies in the Danish MS Registry and MSBase used PS matching [6, 7] while the study in OFSEP used PS weighting [5]. Second, attrition bias and informative censoring result from systematic differences in the follow-up duration between cohorts. Two causal contrasts, per-protocol and intention-to-treat, were considered to evaluate follow-up information. While the per-protocol framework includes only outcomes that were recorded while patients were exposed to the relevant intervention, intention-to-treat framework mitigates the risk of informed censoring, which is of particular importance where clinical outcomes between interventions are delayed [12, 15]. The per-protocol framework was originally used in the studies in the Danish MS Registry and MSBase [6, 7] while the intention-to-treat framework was used in the OFSEP study [5]. Moreover, the study in MSBase used pairwise censoring that consists of censoring data within each PS matched pair to the shorter of the recorded follow-up times within the pair, in order to balance the analysed follow-up time between the groups [16].

The objective of this empirical study is to elucidate the influence of methodological decisions on the results of a comparison of two potent interventions, using the example of natalizumab and fingolimod among patients with MS and combined data from three large clinical registries [5,6,7].

Methods

Data source

This study is a result of a collaborative project [11, 17]. Longitudinal demographic and clinical data were extracted from MSBase on 15th of May 2018 [18, 19]. The Danish MS Registry cohort included all patients treated with natalizumab or fingolimod from 1st of July, 2011 when fingolimod became available in Denmark, until 1st of March, 2018 [20, 21]. The OFSEP cohort included data from 27 French university hospitals extracted from the European Database for Multiple Sclerosis (EDMUS) software in July 2014 [22]. No patient from OFSEP was recorded in MSBase. Some Danish patients who were recorded both in MSBase and Danish MS Registry (2% of Danish MS Registry) have been excluded from MSBase and only considered in the Danish MS Registry.

Eligibility criteria

All patients were diagnosed with RRMS. The required disability follow-up consisted of: a recorded visit with Expanded Disability Status Scale (EDSS)[23] score assessment within six months before treatment initiation (the baseline visit), two post-baseline visits with EDSS at least six months apart, and at least one on-treatment visit.

Interventions

Treatments of interest were the first exposure to natalizumab or fingolimod on or after 1st January 2011 and continued for a minimum of three months. Patients who participated in randomized trials or patients treated with off-label treatment (cyclophosphamide), or with therapies known to have extended duration of effect [24,25,26] (mitoxantrone, alemtuzumab, cladribine, daclizumab, rituximab, ocrelizumab) before the study therapy were excluded. Each patient could contribute only once to the follow-up analysis. When multiple eligible treatment starts were recorded, the earliest treatment was considered.

Outcomes

Four outcomes were evaluated to compare the relative effectiveness of the two study therapies:

  • (1) Count of relapses.

  • (2) Time to first relapse.

  • (3) Time to first confirmed disability worsening event. Worsening was defined as an increase of ≥ 1.5 EDSS steps if baseline EDSS was 0, or 1.0 if baseline EDSS was 1.0–5.5, or 0.5 steps if baseline EDSS was > 5.5, and sustained at all consecutive visits over ≥ 6 months (confirmation cannot be preceded by a relapse within 30 days).

  • (4) Time to first confirmed disability improvement event. An improvement was defined as a decrease of 1.5 if baseline EDSS was 1.5, or 1.0 if baseline EDSS was 2.0–6.0, or 0.5 if baseline EDSS was > 6, sustained at all consecutive visits over ≥ 6 months.

The end of analyzed study or period (count of relapses) depended on the definition of right-censoring (see below).

Assignment procedure: propensity score matching and weighting

In the present work, baseline was defined as the date of the start of the index therapy. To emulate the random assignment of treatments at baseline, PS [13, 27] was defined as the probability of being treated with natalizumab, conditional on the following baseline characteristics (based on expert opinion and prior analyses): sex, age, MS duration (from first MS symptoms to baseline), EDSS score, number of previous treatments, and, evaluated in the past 12 months: number of relapses, and the nature of clinical activity recorded (disability worsening only, relapses only, both or no clinical activity). Country was added as random effect. We estimated both the average treatment effect for the treated (ATT) which is the average treatment effect among those patients who were exposed to natalizumab, and the average treatment effect for the entire eligible population (ATE) [28]. One-to-one, greedy, nearest neighbor, random matching on PS was used, allowing for approximating ATT only [29]. Matching caliper values of 0.1 (used in the original studies), 0.2 (as recommended by literature [30]) and 0.02 standard deviations of the PS (to prioritize close matching) were used. Two weighting procedures were explored. First, using Inverse Probability of Treatment Weighting (IPTW), the weights for a treated patient and for a control are defined as \({w}_{i }=\frac{1}{{p}_{i}}\) and \({w}_{i}=\frac{1}{1-{p}_{i}}\), respectively, where \({p}_{i}\) is the PS for a patient \(i\). In order to reduce issue due to extreme weights, the weights were stabilized by multiplication by the marginal probability of receiving the treatment actually received [31], referred to as sIPTW. Second, using odds [32], the weight for a treated patient is 1 and the weight for control is defined\({w}_{i}=\frac{{p}_{i}}{1-{p}_{i}}\). Weighting with IPTW allows estimation of ATE while weighting by the odds allows estimation of ATT.

Causal contrast of interest

Intention-to-treat analysis retained all matched or weighted patients in the group as initial treatment allocation regardless of their following exposure, until either the last data entry or the study outcome. Per-protocol analysis retained all matched or weighted patients until the date of treatment discontinuation (or the date of last data entry if it occurs earlier). Pairwise-censoring was used as a technique of censoring after matching. In each pair, study follow-up of both patients was censored when the follow-up of one of the two patients was censored. This approach prevented imbalance due to differential duration of follow-up in the matched groups.

Sensitivity analysis without the positivity assumption

The primary analysis ensured that the positivity assumption was fulfilled by only including patients who commenced natalizumab or fingolimod after the more recent of the two therapies became available on 1st January 2011. In a sensitivity analysis, all patients who commenced a study therapy were included, irrespective of the commencement date. Therefore, patients that were considered as ineligible in the primary analysis were included in this sensitivity analysis. Before 2011, MS patients had no chance to receive fingolimod, and could only started natalizumab; that is why the positivity assumption was violated.

Statistical analysis

Characteristics of the patients included in the analyses as well as those excluded by the matching procedure were described – overall and by treatment groups, before and after PS matching/weighting. Standardized mean differences (SMD) or Mahalanobis distances were computed, with 10% considered to be an acceptable difference [33]. Incidence of relapses was evaluated using a negative binomial model, with an offset term for follow-up durations. The cumulative hazards of first relapse, first EDSS improvement and first EDSS worsening were studied using Cox proportional hazards models with robust estimation of variance [34]. The models were either weighted by sIPTW or odds, or matched on PS. A cluster term (generalized estimating equations with negative binomial distribution) or a frailty term (Cox models) for pair identifier was used. As the probability of disability worsening and improvement events is associated with the frequency of EDSS scores [35], models with time to disability outcomes were adjusted for annualized visit density. All analyses were conducted for both the intention-to-treat and the per-protocol causal contrasts. Analyses using matching were completed with and without pairwise-censoring. Table 1 gives an overview of all the analytical approaches considered in the present work. The analyses were performed using R-software (R 3.4.0).

Table 1 Overview of the analytical approaches used in the present work according to the outcomes

Results

Patients’ characteristics

Overall, 5,148 patients were included in this study [10]; 1,989 (39%) were treated with natalizumab and 3,159 (61%) with fingolimod. Patient’s characteristics are described in Table 2 (overall median age at baseline: 37.7 years; median MS duration at baseline: 6.9 years). Most of the patients had a clinically active disease and 70% had a baseline EDSS score equal or greater than 2. Table 3 presents the median durations of follow-up (overall: 3.1 years (interquartile range (IQR): 2.0–4.5)). The median durations of natalizumab and fingolimod treatments were 2.00 (1.3–3.1) and 2.2 (1.2–3.6) years, respectively.

Table 2 Baseline characteristics of the overall study population, as well as the subgroups of patients unmatched and matched within different calipers
Table 3 Follow-up duration according to the outcomes of interest (in years)

Patients’ characteristics after propensity score balancing procedures (matching and weighting)

The distributions of PS showed a good overlap between the treatment groups, except in the tails (Fig. 1). The use of three caliper values for PS-matching led to three similar matched datasets (Table 2). The characteristics of the matched groups were comparable to the characteristics of the overall sample. The excluded patients tended to experience less disease activity. Table 4 presents patients’ characteristics by treatment group. Overall, 35% of patients treated with fingolimod had an EDSS score < 2 at treatment start while it was 22% in the group treated with natalizumab. The matching procedure improved the balance between the compared groups, except for the data source and the number of previous MS treatments.

Fig. 1
figure 1

Distribution of propensity scores by treatment group (probability of being treated with natalizumab) 

Table 4 Characteristics at baseline according to treatment group in the overall population and when three matching calipers were used

Table 5 presents patients’ characteristics by treatment group after weighting on sIPTW or odds. The treatment groups were well balanced, with SMD or Mahalanobis distances around 10% for all patient characteristics, except for the number of previous MS treatments, as natalizumab tended to be prescribed as first treatment more frequently than fingolimod. Exposure following the study therapy is shown in Table S1.

Table 5 Characteristics at baseline by treatment group in the overall study sample, and cohorts weighted on sIPTW and odds

Comparison of effectiveness between natalizumab and fingolimod

Figure 2 summarises the results of all comparative analyses. While the estimated 95% confidence intervals of the estimated differences between natalizumab and fingolimod largely overlapped in all analyses, some variation in point estimates was observed.

Fig. 2
figure 2

Estimated treatment effects for the 4 outcomes, 3 matching and 2 weighting strategies and 2 causal effects, with and without pairwise censoring in matched cohorts

With a few exceptions, the results of the analyses with matching and weighting led to the same conclusions, i.e., superiority of natalizumab (for relapse outcomes and EDSS improvement) or no evidence of difference (for EDSS worsening). Inconsistencies were observed mainly in the intention-to-treat frameworks, for relapse counts and first EDSS improvement. Weighting by the odds (ATT) tended to provide lower point estimates and similar margins of error of the relative effect compared to weighting by sIPTW (ATE). The value of the matching caliper did not influence the magnitude of the estimated differences.

Most of the variability in the estimates was linked to the causal contrast. The intention-to-treat paradigm led to less stable results, especially for the count of relapses and first EDSS improvement. For all outcomes except time to first EDSS worsening, the intention-to-treat analyses underestimated the differences between the therapies in comparison to per-protocol analyses with or without pairwise-censoring. Per-protocol analyses and pairwise-censored analyses returned similar point estimates, even though the margin of error varied. In the pairwise-censored analyses, confidence intervals were relatively smaller for relapse counts but larger for the disability outcomes compared to the per-protocol analysis.

Sensitivity analysis: positivity assumption

To test the effect of violation of the positivity assumption, 7,118 patients were included irrespectively of the date of their treatment start, of whom 3,726 were treated with natalizumab. The other baseline characteristics were similar to those of the main cohort (Table S3). The PS distribution was left-skewed in patients who commenced natalizumab before fingolimod became available (Figure S1). Using weighting, the comparison of the treatment effects on relapses was similar to the main analysis (Table 6). However, the point estimates for the difference in the treatment effects on EDSS worsening were substantially lower than in the primary analysis, although confidence intervals overlapped. When matching was used, the estimates for EDSS outcomes were less influenced by the violation of the positivity assumption. Nevertheless, the estimates of the differences between treatment effects on relapses were substantially inflated when the assumption was violated, especially for the intention-to-treat causal effect.

Table 6 Comparison of treatment effect on relapses and disability violating the positivity assumption

Discussion

In this empirical study conducted on a complex chronic neurological condition, with long-term follow-up data, several non-linear outcomes and well powered dataset, most of the methodological choices (PS matching/weighting, caliper values, weighting on IPTW vs. odds, and pairwise censoring) resulted in consistent overall conclusions, in accordance with two of the three original studies [5, 6], the pooled analysis [11] and a recent French head-to-head prospective study [36]. In a longitudinal observational study conducted over the long-term in the presence of frequent changes of therapy, an intention-to-treat causal contrast tends to be associated with more variability in the observed effects than a per-protocol contrast. Importantly, violation of the positivity assumption demonstrated the most pronounced negative effect on the consistency of reported results.

Propensity score to control indication bias

Among the four methods using PS, matching and weighting have shown a superior performance to adjustment and stratification in achieving balance on baseline characteristics [37], reduction of bias and estimation of variance [38,39,40]. Therefore, we restricted our present work to PS matching and weighting. The results of the weighting and matching procedures were consistent, confirming that both methods performed well in sufficiently powered data sets and correctly specified models. The width of the matching caliper did not have much influence on the consistency of the results, confirming that 0.2 is a sufficiently conservative caliper, as previously reported [30]. The only detectable systematic variability was noted for the type of estimated effect, with the magnitude of the ATE effect trending towards higher values for relapse incidence and time to first relapse.

The matched study sample corresponds to an overlap between the fingolimod- and the natalizumab-treated target populations, with inclusion of comparable cases and exclusion of cases outside the common distribution of the PS (ATT effect of interest). Such reductions in sample size may lead one to study a very specific sub-population and, so, impact the precision and the generalizability of the results [41]. An IPTW-weighted sample is closer to the entire study population, especially where ATE is the effect of interest. It is therefore not surprising, given that the use of natalizumab and fingolimod in MS differs in clinical settings, that we have observed differences in the point estimates obtained with the matched and weighted analyses. Weighting could potentially be subject to influential cases with extreme weights, which are excluded from matching, as they fall outside of the central portion of the PS distribution [42]. In this work, we used stabilized weights to mitigate the risk of influential cases, as an alternative to weight trimming or truncation [33].

Management of censoring

In the present study, most irregularities were related to the intention-to-treat causal contrast, which resulted in less stable and often deflated estimates than the per-protocol analysis. These fluctuations were more pronounced for the outcomes defined as counts of events and time to medium-term events (first disability worsening or improvement) than for time to short-term events (first relapse). The intention-to-treat evaluates the association with the outcome, irrespective of treatment status over-time, and addresses the question of the effect of treatment decision, irrespective of further persistence on the assigned therapy. Therefore, such an approach leads to conservative estimates, which explains the observed overall deflation of effect sizes in comparison to the per-protocol approach and the minimum impact on short-term outcomes.

On the other hand, patients and neurologists may be more interested in a per-protocol effect, which estimates the effect of an intervention while being adhered to. However, a per-protocol treatment effect can be inflated by attrition bias and informed censoring, especially when one of the compared interventions is a-priori perceived as being more effective [43]. This would lead to the selection of “treatment responders”, because patients who respond well to treatment are more likely to remain treated than non-responders [44]. In addition, the per-protocol requirement of adherence to treatment may introduce additional selection bias, which may limit generalizability of conclusions [45], whereas the intention-to-treat approach preserves the balance established at baseline. A pairwise-censoring procedure can be combined with either causal contrast. Its purpose is to sustain the balance between the matched cohorts even when censoring / treatment cessation is systematically different between the compared groups. This sustained balance is achieved at the expense of loss of part of study follow-up due to right-censoring of the paired cases. However, in the present empirical analysis, per-protocol and pairwise-censored analyses led to similar conclusions and point estimates. The observed increase in the margin of error in pairwise-censored analysis suggests some loss of power. Marginal structural models with IPTWs accounting for the probability of censoring may provide a more efficient solution, as they do not lead to loss of follow-up information [46,47,48].

Positivity assumption

The positivity assumption can be objectively assessed in several steps. First, the definition of study timeline and area should be such as both treatments are available to all included patients. Second, the common support of PS distribution in the two groups needs to be established [31]. In our main analysis, these two steps confirmed that the positivity assumption was met. To examine the importance of the positivity assumption, in a different analysis, we allowed inclusion of patients before one of the studied therapies (fingolimod) became available. This included more natalizumab-treated patients from a time period when the probability of exposure to fingolimod was zero. The results of this analysis showed the most pronounced variability and the largest deviation from the primary analysis. Therefore, in a sufficiently powered longitudinal dataset, non-zero probability of exposure to both compared therapies at all baseline time-points is the most important aspect of methodological considerations explored in this study.

Limitations

Through consistency and exchangeability assumptions, it is assumed that there were no unmeasured confounders. Nevertheless, our study was limited by incomplete MRI data, while MRI activity is a known prognostic factor in MS [49]. Reassuringly, two of our three previous studies that accounted for MRI at treatment start showed results consistent with our primary analysis [5, 6].

In addition, heterogeneity of data in multisite registries (with potential differences in therapeutic practices, health care systems and treatment access) may increase variance of the associations between treatments and outcomes [50]. On the other hand, heterogeneity that is representative of clinical use of the compared therapies extends generalizability of the results. We have mitigated the potential heterogeneity in the present dataset by including country as a random term in the PS modeling.

Finally, this study did not attempt to compare the efficiency and robustness of different analytical methods, as this can be done only with simulation studies. Instead, we have focused on the evaluation of practical methodological questions in the context of a specific clinical choice.

Conclusion

This empirical study provides practical insights into the effects of several methodological choices on the estimates of the difference between two therapies in the context of a chronic neurological disease, in a sufficiently powered analysis and correctly specified models. Our results lead us to conclude that methodological considerations such as PS matching/weighting and their specifications, causal contrast and management of censoring have a negligible effect on the overall analyses, given that the model assumptions are met. The choice between ATT or ATE as the preferred approach should be driven by the clinical question of interest. In our clinical example, when both treatments can be prescribed to patients with relapsing–remitting MS following similar rules, there is no apparent reason to restrict the analysis to the natalizumab- or the fingolimod-treated patients, and ATE may be the preferred estimator of interest.

A recent review highlighted the good practice in the use and reporting of PS in MS [41]. While methodological choices in observational studies remain challenging, our present work illustrates the priorities for methodological aspects of PS-based analyses of comparative treatment effectiveness in large registries.