FormalPara Key Summary Points

A matching-adjusted indirect comparison (MAIC), a type of indirect treatment comparison method, may be used to compare the efficacy of different therapies when direct head-to-head comparisons do not exist.

The quality of MAICs should be carefully evaluated according to best practices, various sources of bias should be identified, and the results of a MAIC should be interpreted in the context of potential biases present.

The quality of the conduct and reporting varied greatly in the three identified MAIC publications in spinal muscular atrophy (SMA).

Findings from a MAIC can be misleading because of cross-trial differences in inclusion/exclusion criteria, baseline characteristics, definitions and assessment schedules of outcomes, and key baseline confounders not balanced after weighting, especially in the context of SMA.

Introduction

In the absence of randomized head-to-head trials directly comparing treatments, indirect treatment comparisons (ITCs) are increasingly used to understand the comparative efficacy of different treatments evaluated in separate trials [1, 2]. Comparing treatments originally evaluated in separate trials can be challenging because of differences in study design, characteristics of the trial populations, and outcome definitions and assessments.

ITC methods may be used when individual patient data (IPD) are available from one trial but only aggregate data (i.e., summary-level data such as means and proportions) are available from another trial. Some commonly used ITC methods include matching-adjusted indirect comparison (MAIC) [1,2,3] and simulated treatment comparison (STC) [4, 5]. Although there are advantages and disadvantages associated with each method [6], MAIC may be preferred over STC when working with time-to-event or other non-linear outcomes because of the bias incurred with STC when using non-linear regression models [7].

MAICs are a statistical method that attempts to account for cross-trial differences by applying a form of propensity score weighting to balance baseline covariate distributions across trial populations in an ITC [1, 3]. In brief, this method involves applying the inclusion/exclusion criteria and outcome definitions used in the comparator trial with aggregate data to the other trial with IPD. Individuals in the IPD population are then given a weight that reflects how likely they were to appear in the trial with aggregate data. The goal is that the weighted mean baseline characteristics of patients in the trial with IPD match the baseline characteristics reported for the trial with aggregate data. These steps can be implemented in either an anchored or unanchored MAIC. An anchored MAIC is an indirect comparison of treatments from two trials that have a connected network (i.e., share a common comparator such as a placebo arm) whereas an unanchored MAIC is one in which there is a lack of a connected network (i.e., no common comparator such as in single-arm studies). An anchored MAIC is preferred because it respects randomization within studies to remove confounding bias and enables researchers to detect cross-trial differences between the common control arms that indicate residual bias after MAIC weighting [6].

Given the increasing popularity of MAIC for comparative efficacy research, it is crucial to identify best practices and understand limitations of this methodology in practice. The quality of a MAIC analysis and reporting can be variable, especially in the context of rare disease with small patient populations, high levels of patient heterogeneity across trials, and frequent use of single-arm designs. MAIC has recently been used to compare the efficacy of disease-modifying treatments in spinal muscular atrophy (SMA), a rare degenerative neuromuscular disease characterized by progressive muscle atrophy and weakness in which key baseline characteristics (such as disease duration or motor function status) can be highly predictive of treatment response [8]. SMA is a clinically heterogeneous disease, often classified as infantile-onset (type I) and later-onset (type II and III) SMA, based on age at symptom onset and severity of symptoms [9]. Three therapies are currently approved by the US Food and Drug Administration and the European Medicines Agency for the treatment of SMA: nusinersen (intrathecally administered antisense oligonucleotide for children and adults) [10, 11], onasemnogene abeparvovec (intravenously administered gene therapy for pediatric patients) [12, 13], and risdiplam (orally administered survival of motor neuron 2 (SMN2) splicing modifier for children and adults) [14, 15].

A critical appraisal of publications using MAIC to compare therapies for SMA has not previously been done. This article aims to evaluate the conduct and reporting of previously published studies using MAIC to compare treatments in SMA based on published best practices using a newly developed, consolidated checklist.

Methods

To identify published guidelines on MAIC best practices, a literature search was conducted in both PubMed and Embase from inception until April 8, 2022, using a combination of the following search terms: indirect treatment comparison, matching-adjusted indirect treatment comparison, best practices, educating, consensus, guidelines, or standards (see Table S1 Supplementary Material). Given the rarity of SMA, it was assumed that a search of these two databases was sufficient to ensure that all relevant studies were captured. Results were limited to English publications only. A total of 138 records were first identified followed by 22 records after title and abstract review. Following full-text review, a total of nine publications were retained.

Case studies of MAICs in SMA were identified through a second literature search using the search terms of indirect treatment comparison and spinal muscular atrophy in PubMed and Embase from inception until April 8, 2022 (see Table S2 Supplementary Material). All types of SMA were included in the literature search for completeness. When both a full publication and conference abstract based on the same analysis were identified, only the full publication was retained. Results were limited to English publications only. We identified 268 records of which 265 were not relevant to the topic or, in cases of conference proceedings, the full text publication was available. A total of three records remained after review.

This article is based on previously conducted studies and does not contain any new studies with human participants or animals performed by any of the authors.

Results

Literature Searches

The literature search of guidelines in MAIC identified a total of nine full-text publications, covering both standards of analysis and reporting, and included recommendations from the International Society of Pharmacoeconomics and Outcomes Research (ISPOR) and the National Institute for Health and Care Excellence (NICE) [1,2,3,4, 6, 16,17,18,19]. We identified themes repeated across these nine publications that were highlighted as critical for the proper implementation and reporting of MAICs (Table S3 Supplementary Material). These were then consolidated into a checklist to inform our critiques on the conduct and reporting of MAICs in SMA: (1) justification for use of MAIC is clearly stated, (2) the included trials with respect to study population and design are comparable, (3) all known confounders and potential effect modifiers are adjusted for, (4) outcomes should be similar in definition and assessment, (5) baseline characteristics before and after adjustment are reported, along with weights, and (6) key details are reported.

The second search to identify MAICs in SMA yielded two full-text and one poster publication of three separate MAIC analyses [20,21,22]. Table 1 summarizes the MAICs including the treatments compared, trials included, whether an anchored/unanchored MAIC was used, and outcomes assessed in the MAIC. Table 2 summarizes the clinical trials compared in the MAICs with respect to their study design, treatment groups, sample size, and key inclusion/exclusion criteria. In a MAIC among the infantile-onset SMA population by Liao et al., IPD were from the randomized, sham-procedure controlled ENDEAR/SHINE trial (nusinersen) and aggregated data were from the single-arm STR1VE-US trial (onasemnogene abeparvovec) [20]. In a separate MAIC analysis among the infantile-onset population by Bischof et al., pooled IPD were taken from the STR1VE-US and START trials (onasemnogene abeparvovec) and aggregated data were taken from ENDEAR/SHINE (nusinersen) [21]. The third publication by Ribero et al. included both patient populations with infantile-onset and later-onset SMA [22]. For the infantile-onset SMA population, IPD from FIREFISH (risdiplam) were compared to aggregate data from ENDEAR (nusinersen); in the later-onset SMA population, IPD from SUNFISH Part 2 (risdiplam) were compared to aggregate data from CHERISH (nusinersen). Given that STR1VE-US, START, and FIREFISH did not have a comparator group, all MAICs for infantile-onset SMA were unanchored. For later-onset SMA, Ribero et al. were able to conduct an anchored MAIC because there was a common comparator group in SUNFISH Part 2 (placebo) and CHERISH (sham-procedure) [22].

Table 1 Summary of MAICs identified in SMA
Table 2 Summary of clinical trials compared in the SMA MAICs

Review of MAICs in SMA

A critical review of the identified MAICs in SMA was performed on the basis of six key items for assessing the conduct and reporting of MAICs that were consolidated from nine publications providing recommendations on best practices in indirect treatment comparisons such as MAIC.

Justification for Use of MAIC is Clearly Stated

Prior to conducting any analysis, the rationale for using MAIC, versus other methods of indirect comparison, should be provided. MAIC may be the optimal approach when there is a disjointed network, a single comparator group with many outcomes to be compared, and a non-linear outcome. In instances where a MAIC is chosen because IPD are available for one trial and aggregate-level data for another, it should be noted that this rationale alone does not assume that the MAIC is feasible and valid. The evaluation of MAICs should consider whether all six items on the best practices checklist were followed and the potential biases that may result from deviations.

  • Case study in SMA: The three identified MAIC publications in SMA provided justification for the choice of using MAIC (Table 1). Unanchored MAIC methodology was used in all MAICs for infantile-onset SMA trials because of the lack of a connected network.

The Included Trials with Respect to Study Population and Design Are Comparable

When selecting trials to include in a MAIC, assessing the comparability of trials is important. Although there is no quantitative way of testing for similarity between trials [18], trials may be considered comparable if they are similar in terms of their inclusion/exclusion criteria, baseline characteristics, standard of care in common comparator arms, background treatments, temporal setting, amongst others. To increase comparability of trials, the inclusion/exclusion criteria of the comparator trial with aggregate data can be applied to the trial with IPD. This can be done by excluding patients in the trial with IPD who could not have enrolled in the comparator trial with aggregate data as a result of the comparator trial's inclusion/exclusion criteria. For this to work, the inclusion/exclusion criteria in the trial with IPD should be equally or more inclusive than those of the trial providing aggregate data [2]. If the trial with IPD has more restrictive inclusion/exclusion criteria than the comparator trial with aggregate data, then it may not be possible to address differences in study populations, which may lead to biased comparisons. Further, variables available in each trial, along with their distributions should be presented. For baseline characteristics to be considered comparable, there should be overlap in the minimum and maximum values of a variable across trials. If there is limited/no overlap in the baseline characteristics of potential effect modifiers and confounders in comparisons across trial populations, then a MAIC may not be feasible.

  • Case study in SMA: There are important differences in the inclusion/exclusion criteria and baseline characteristics across trials in SMA to consider. For example, there is a lack of comparability in the exclusion criteria between trials regarding pulmonary events and pulmonary function, which are key factors influencing main outcomes of interest in SMA trials (e.g., permanent ventilation-free survival and overall survival in infantile-onset SMA). These differences are particularly notable for FIREFISH and ENDEAR, where greater exclusions were made in FIREFISH (e.g., excluded patients with hospitalization for pulmonary event within the last 2 months; with invasive ventilation or tracheostomy; requiring non-invasive ventilation or hypoxemia with or without ventilator support; and history of respiratory failure or severe pneumonia and had not fully recovered their pulmonary function at time of screening) than in ENDEAR (i.e., excluded patients with hypoxemia at screening) (Table 2). Incomparable exclusion criteria used in FIREFISH and ENDEAR may have enriched for a population in FIREFISH with less pulmonary burden compared to patients in ENDEAR that cannot be resolved through MAIC weighting, thus hindering a valid comparison of risdiplam and nusinersen, especially for the outcomes of permanent ventilation-free survival and overall survival [22].

All Known Confounders and Effect Modifiers Are Identified A Priori and Accounted for in the Analysis

In unanchored MAICs, where the evidence is disconnected because of the lack of a common comparator, both confounders and effect modifiers need to be accounted for in the MAIC weights. In anchored MAICs using randomized trials where the evidence is connected by a common comparator, only effect modifiers need to be accounted for in the MAIC weights (as there is expected to be no confounding due to randomization). Effect modifiers impact the generalizability of the treatment effects to the target population and therefore need to be balanced across trials. All potential confounders and effect modifiers need to be pre-specified, clinically plausible, measured, and defined similarly across trials [2, 6]. Evidence and assessment for effect modifier status should be provided. Not including key confounders and effect modifiers in MAIC weighting precludes the ability to fully account for cross-trial differences and therefore increases the possibility of residual confounding and lack of generalizability. The reporting of the analyses should describe how potential confounders and effect modifiers were identified a priori, and whether these variables were available in the studies being compared. MAICs using small trials in rare disease may be limited by the number of variables that can be included in the weighting model. In this situation, it may be preferable to prioritize including as many of the most important confounders and effect modifiers as possible. In the case where key variables were not available, an assessment of the potential biasing impact due to the lack of adjustment for key variables should be given.

  • Case study in SMA: Table 3 summarizes the baseline characteristics, along with factors known to impact treatment outcomes in SMA [8], for each MAIC identified in the review. Liao et al. included six confounders and effect modifiers in MAIC weighting while Bischof et al. used two and Ribero et al. used three in each of their two MAICs. While Liao et al. included the most comprehensive list of variables with similar definitions in the analysis, ventilatory and nutritional support were not included as weighting factors because of different definitions between trials (see Table S4 Supplementary Material) [20]. Notably, Bischof et al. did not include age at first dose or age at symptom onset as a weighting factor, which are the strongest predictors of treatment response in SMA [23], which may be an important source of bias. Although Bischof et al. weighted on nutritional support as defined by feeding tube, there may still be residual confounding as this may not have captured the full extent of baseline differences in swallowing and feeding difficulties across the trial populations (see Table S4 Supplementary Material). In the later-onset SMA MAIC by Ribero et al., known effect modifiers of SMA treatment such as age at symptom onset or disease duration at baseline were not included despite their availability in the data, which may lead to biased comparisons.

Table 3 Baseline characteristics used for weighting in MAIC analyses in SMA

Outcomes Should Be Similar in Definition and Assessment

The determination of which outcomes of interest to compare should be justified and may be based on key primary and secondary outcomes evaluated in the trials [19]. All included outcomes should be comparable and measured consistently across trials including their definition, schedule of assessment, statistical analysis method, length of follow-up, and potential loss to follow-up [2]. When outcome definitions and timing of assessments are not comparable, it is recommended not to make comparisons across trials [6]. It is important to consider both the direction and magnitude of the potential bias due to differences in outcome definitions and assessments on results.

  • Case study in SMA: Table 4 highlights key differences in outcome definitions and assessments across the trials included in the MAICs. In infantile-onset SMA, overall survival was defined similarly with comparable assessment schedules across trials. However, permanent ventilation was defined differently across studies with respect to duration required (Table 4), which may also impact the outcome of event-free survival. In addition, motor milestone outcomes assessed in MAICs were not consistently defined or assessed at different times across trials. For example, START/STR1VE did not report a 24-month timepoint for walking independently or sitting unassisted. To make a comparison with the 24-month timepoint, Bischof et al. carried the 18-month results of STR1VE forward, which may be inappropriate as a greater number of patients could have achieved motor milestones if there was longer follow-up. Although this difference in outcome assessment between ENDEAR/SHINE and START/STR1VE may have led to underestimation of the proportion of patients who achieved motor milestones in START/STR1VE, cross-trial differences in baseline characteristics and poor confounding control in the MAIC conducted by Bischof et al. may have potentially led to overestimation of treatment effects. The resulting net bias from all possible sources of bias remains unclear. In another example, motor milestone outcomes were assessed at 12 months in FIREFISH whereas ENDEAR ended early with an average length of 9 months of follow-up based on a positive benefit–risk assessment of a prespecified interim analysis. However, Ribero et al. did not use follow-up data from the extension study SHINE, biasing the observed results. Differences in the timing of assessment in SMA can impact the validity of a MAIC analysis as the achievement of motor milestones, such as sitting unassisted, are time dependent.

Table 4 Outcome definitions and assessments in clinical trials used in SMA MAICs

Baseline Characteristics Before and After Adjustment Are Reported, Along with Weights

MAIC weighting is similar to inverse propensity score weighting and involves assigning weights to patients in the trial with IPD that correspond to their odds of being enrolled in the comparator trial with aggregate data as compared to the trial with IPD [6]. MAIC uses inverse propensity score weighting to form weighted mean estimators of the expected mean outcomes of the treatments of interest, where the propensity scores are found using a method of moments [3]. After weighting on baseline confounders and effect modifiers, trial populations should be balanced such that the weighted means of the baseline characteristics in the trial with IPD match the baseline characteristics reported in the trial with aggregate data [2]. In addition, after weighting, the distribution of the weights should be reported to assess population overlap and to identify any overly influential individuals. When the trial populations are similar to begin with, each patient in the IPD trial would get a weight close to 1. Extreme weights indicate that the two populations are highly imbalanced across one or more baseline characteristics [2]. Thus, population characteristics before and after weighting, including means as well as standard deviations and/or ranges, should be reported to understand how well the populations are balanced. The distributions of other key prognostic factors and effect modifiers that were not included in the weighting model should also be reported to understand the extent of imbalance in these variables between the weighted trial with individual patient data and the trial with aggregate data. When calculating an estimate in a weighted sample, the effective sample size (ESS) reflects the number of independent non-weighted individuals that would be required to give an estimate with the same precision as the weighted sample estimate. While assessment of the sufficiency of an ESS is subjective, a small ESS may indicate widely imbalanced variables or little overlap between baseline characteristics, and can lead to low statistical power to detect differences between treatments [6]. Extreme weights, along with a small ESS, are indicative of possible lack of population overlap and decreased precision with corresponding increased uncertainty in the effect estimates.

  • Case study in infantile-onset SMA: Table 5 summarizes the baseline covariates before and after weighting, as reported in the publications of MAICs for infantile-onset SMA. Liao et al. restricted the populations by using a subpopulation of 48 patients from ENDEAR/SHINE that met the key inclusion/exclusion criteria of STR1VE US for age at first treatment (< 6 months); these 48 patients were all included in the final weighted population [20]. Liao et al. reported the pre- and post-weighting distributions of all six variables included in the weighting model. In addition, the distributions of important baseline variables (e.g., ventilatory and nutritional support) that were not used in weighting because of differences in their definitions across trials were also reported to assess whether the patient populations were likely balanced with respect to these additional variables after MAIC weighting. In contrast, Bischof et al. reported the pre- and post-weighting values of only the two covariates used to calculate weights, thus making it difficult to assess whether weighting achieved balanced trial populations in other important prognostic factors and effect modifiers. Despite using an unanchored MAIC, Ribero et al. weighted the baseline characteristics of the pooled FIREFISH data to both arms of ENDEAR and not just those who received nusinersen. Notably, although the inclusion/exclusion criteria of FIREFISH may have enriched a population with less pulmonary burden, the percentage of patients with ventilatory support are reported to be higher in FIREFISH than ENDEAR. This is most likely due to the different uses/purposes of pulmonary support at baseline across trials; of the infants in pooled FIREFISH receiving ventilatory or pulmonary care, over 88% were receiving it prophylactically instead of receiving it because of breathing problems that necessitated ventilatory support [24, 25]. Following weighting, the pooled FIREFISH sample had a greater mean age at first dose, higher proportion of female patients, higher mean CHOP-INTEND score, and lower proportion of patients with ventilatory support than the nusinersen arm of ENDEAR, as reflected in the before weighting section of Table 5. Ribero et al. presented the distribution of the weights following balancing of the population in the supplemental materials; these were, however, skewed towards low values suggesting lack of trial population overlap.

  • Case study in later-onset SMA: Table 6 summarizes the baseline covariates before and after weighting, as reported in the publications of MAICs for later-onset SMA. While Ribero et al. excluded patients who would not have been enrolled in CHERISH when creating the SUNFISH Part 2 subset, differences remained post-weighting in key variables, including sex, age at symptom onset, and disease duration [22]. Post-weighting, the placebo arm in SUNFISH had better HFMSE outcomes than the sham arm in CHERISH. Since the two trial populations were not comparable, inferences on relative efficacy on HFMSE endpoints could not be drawn [22]. Moreover, there was limited ability to make valid statistical inferences given small sample sizes. For instance, the reported 95% confidence interval (95% CI) for the odds ratio for the relative efficacy of risdiplam vs. nusinersen for RULM responders ranged from 0 to 117.94 [22]. When comparing risdiplam with nusinersen for the incidence of any serious adverse event, the reported 95% CI for the odds ratio was 0.88 to 37.6 million [22]. These examples of highly imprecise results further underscore the fundamental challenge of conducting MAICs in rare disease.

Table 5 Summary of baseline covariates before and after weighting, as reported in published MAICs of infantile-onset SMA
Table 6 Summary of baseline covariates before and after weighting, as reported in the published MAIC of later-onset SMA [22]

Key Details of a MAIC Should Be Reported

Finally, key details should be reported to improve the transparency of the conduct of a MAIC analysis. For example, key details include how standard errors were calculated to provide measures of uncertainty alongside effect estimates, and pre- and post-weighting results to convey the impact of adjustment on effect estimates [6]. Of note, reporting unweighted and weighted effect estimates alongside one another can highlight the degree of confounding present, especially in unanchored settings (e.g., comparison of single-arm trials).

  • Case study in SMA: Table 7 summarizes the critical appraisal of published MAICs according to the checklist and describes whether these key details were reported in the MAICs in SMA.

Table 7 Critical appraisal of published MAICs in SMA according to consolidated checklist

Discussion

To make valid inferences regarding the comparative efficacy of treatments evaluated in separate trials using MAIC methodology, it is paramount to follow best practices. Although MAICs can be a helpful tool to increase the comparability of different trials, they may lead to dubious results if conducted when key assumptions are violated, and best practices not followed. The current paper summarizes guidelines on MAIC best practices and critically evaluates the conduct and reporting of three MAICs in SMA using the consolidated checklist. However, as highlighted in this paper, findings from a MAIC can be misleading as a result of cross-trial differences in inclusion/exclusion criteria, baseline characteristics, definitions and assessment schedules of outcomes, and key baseline confounders and effect modifiers not balanced after weighting. Results of a MAIC should be interpreted in the context of potential biases present.

In the applied examples of the MAICs conducted in SMA, we found important differences between included trials that may decrease the validity of existing indirect treatment comparisons. Across SMA trials, different inclusion/exclusion criteria were used with respect to age and pulmonary event and function, and key baseline characteristics differed such as age at first dose, motor function, and ventilatory and nutritional support. Varied definitions and assessments of key SMA outcomes were also noted including permanent ventilation and motor function. Two of the three identified MAICs were unable to adequately account for differences in baseline covariates, and included only two or three variables in the weights, thus leaving open a large possibility of residual confounding. This is problematic because differences in baseline characteristics, even if seemingly small, such as age at treatment initiation [23], disease duration and baseline ventilatory support [26], can have important effects on key SMA outcomes. As observed in Fig. 1, weighting on a more comprehensive set of variables (as per Liao et al., Table 3) versus an unweighted analysis (which was based on a subpopulation restricted on age only to match the inclusion/exclusion criteria of the comparator trial) resulted in a large difference in the probability of event-free survival, moving the hazard ratio from < 1 to > 1.

Fig. 1
figure 1

Example of reporting weighted and unweighted analyses [20]. Weighted analysis considered the factors highlighted in Table 3. Unweighted analysis was conducted in a subpopulation of ENDEAR/SHINE created on the basis of an age restriction to match the inclusion criteria used in STR1VE. aHR < 1.00 indicates a lower risk of an event in the STR1VE US cohort than in the ENDEAR/SHINE cohort. HR > 1.00 indicates a higher risk of an event in the STR1VE US cohort than in the ENDEAR/SHINE cohort. Shading denotes 95% CIs

Additionally, the considerations noted in this critical appraisal are aligned with those of multiple external and independently conducted health technology assessments, where the uncertainties regarding the observed treatment effects, as reported in Bischof et al. [21] and Ribero et al. [22], were noteworthy because of methodological issues including potential confounding due to differences in baseline characteristics that could not be adjusted for through MAIC. These assessments include, but are not limited to, the reimbursement reviews in Canada [27,28,29,30], France [31, 32], and Scotland [33, 34] (Table S5 Supplementary Material). Taken together, there are many potential sources of bias that should be considered when interpreting the results of existing MAICs, and it can be challenging to predict the direction and magnitude of the net bias when considering the totality of the issues. This underscores the importance of careful examination of the conduct and reporting of a MAIC to support evidence generation for decision-makers such as patients, clinicians, and regulatory and reimbursement agencies. MAICs may reduce observed cross-trial differences and provide decision-makers with comparative evidence when following and adhering to best practices.