Background

Pulmonary arterial hypertension (PAH) is a rare and debilitating chronic disease of the pulmonary vasculature [1]. Disease progression is characterized by increasing pulmonary vascular resistance (PVR) and non-specific symptoms (e.g., dyspnoea during exercise, fatigue, chest pain, and light-headedness), that ultimately leads to right heart failure and premature death [1, 2]. Prior to the availability of PAH-specific therapies, median survival time was documented as 2.8 years in the US patients with PAH [3]. Five-year survival rate in newly diagnosed patients is reported to be 61.2% [4].

Therapies in PAH have been approved with one or more routes of administration for three key pathogenesis pathways. Approved therapies targeting the nitric oxide pathway are the phosphodiesterase-5 inhibitors (PDE-5I): sildenafil (oral or intravenous [IV]) and tadalafil (oral), and the soluble guanylate cyclase stimulator (sGCS) riociguat (oral). Therapies targeting the endothelin pathway currently approved are macitentan, bosentan and ambrisentan, all administered orally. One of the endothelin receptor antagonist (ERA) drugs, sitaxentan, was authorised in Europe in 2006, but subsequently withdrawn due to liver toxicity [5]. Approved drugs targeting the prostacyclin [PGI2] pathway include epoprostenol (IV), iloprost (inhaled), treprostinil (IV, inhaled, oral, subcutaneous [SC]), beraprost (oral), and selexipag (oral), a selective non-prostanoid PGI2 receptor (IP receptor) agonist.

The treatment of PAH is guided by an evidence-based treatment algorithm published by the European Society of Cardiology and European Respiratory Society (ESC/ERS) [2]. The overall treatment goal is to achieve a low-risk status, associated with World Health Organization (WHO) Functional Class II, and good exercise capacity (> 440 m in the 6-min walking distance test), and right-ventricular function assessed using echocardiography. The latest guidance and proceedings (see Figure S1 in the electronic supplementary material) recommend either monotherapy or initial oral combination therapy for treatment-naïve patients at a low or intermediate risk of clinical worsening or death [2, 6]. For these patients, oral therapies are recommended, therefore ERA and PDE-5I are generally used as first-line treatment. For patients who fail to achieve an adequate clinical response (i.e. a low-risk status after 3 to 6 months) with initial therapy, treatment with sequential double or triple combination therapy is recommended. For high-risk treatment-naïve patients, an initial combination therapy regimen including a drug targeting the PGI2 pathway requiring continuous IV administration is indicated.

A lack of head-to-head treatment comparisons in randomized controlled trials (RCTs) has compounded clinical decision-making in PAH. As a result, a multitude of meta-analyses (MA; the synthesis of evidence from the same treatment comparisons assessed in clinical trials [7]) and network meta-analyses (NMA; the synthesis of evidence from both direct and indirect evidence to allow treatment comparisons that have not been directly assessed in clinical trials [7]) in PAH have been conducted.

Given the absence of direct RCT comparisons and the evolution of disease definition, classification, trials designs, available therapies and treatment guidelines, it is important to better understand the quality of published MA and NMA in PAH and their alignment with clinical decision-making today. The objective of the study was to critically appraise the quality and validity of published MA and NMA studies in PAH and explore the impact of the findings from these studies on current decision-making.

Methods

Search strategy and data collection

A systematic literature review was conducted according to the recommendations of the Cochrane Collaboration [8] and the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines [9], to identify published evidence synthesis (i.e. MA and NMA) studies in PAH.

Searches were conducted from the database inception to September 12, 2018 and updated on April 22, 2020 in Embase, Medline (including Medline-In-Process) and the Cochrane’s Database of Systematic Reviews via OVID in line with The National Institute for Health and Care Excellence (NICE) technology appraisal guidelines and recommendation from Centre for Review and Dissemination and the Cochrane Collaboration [10,11,12]. Supplementary searches included websites of selected health technology assessment agencies.

Retrieved records were assessed by one reviewer against the pre-specified PICOS criteria (Table S1 in the electronic supplementary material) and unblinded assessments were double checked by the second reviewer. Any discrepancies were resolved through discussion with a third reviewer. Studies were included if they met the following criteria: 1) adult patients with any etiology of PAH (pulmonary hypertension (PH) Group 1) [2], 2) at least two approved and available therapies or drug classes for treatment of PAH (to allow assessment of relative efficacy and safety of compared treatments), 3) full-text MA/NMA report. Details of the search methodology are provided in Tables S2a-h in the electronic supplementary material.

Key baseline characteristics of patients with PAH from the included RCTs were extracted to explore the extent of heterogeneity across the trials.

Study appraisal

A targeted review of published checklists for evidence synthesis studies was conducted. Checklists published by NICE [13], ISPOR [14], PRISMA [15] and GRADE [16] were identified. Criteria for checklist selection included:

  • Domains covered, such as relevance of research question, methods for establishing the evidence base, assessment for internal validity, statistical methods, and reporting of results

  • Suitability to present context, including applicability to different forms of evidence synthesis

  • Generalizability

  • Acceptability and recognition of the checklist

The ISPOR checklist was deemed the most appropriate as it covers all domains listed in the checklist selection criteria, is suited to the study objective and is applicable to different types of evidence synthesis.

Complementary questions were added to the 26-item ISPOR checklist with questions specific to the disease area and/or study objective. These additional questions are marked as such in the study assessment provided in Table S3 in the electronic supplementary material.

The ISPOR checklist provides for a quality grading whereby an overall assessment of ‘strong’, ‘neutral’ or ‘weak’ is given for each of the six domains (i.e. relevance, credibility, analysis, reporting quality & transparency, interpretation, conflict of interest). However, no explicit criteria are provided for scoring each domain. A set of criteria specific to each domain for quality grading was therefore adopted which is described in Table 1. Study appraisals by one reviewer were double checked by a second reviewer.

Table 1 Criteria for scoring each domain in the checklist

Results

Study characteristics

A total of 52 MA and NMA studies met the inclusion criteria and were retained for data extraction and quality appraisal. From electronic database searches, 51 full-text publications were included. From the hand-search of publicly available websites of health technology assessment bodies, one report of the Canadian Agency for Drugs and Technologies in Health was included. The PRISMA diagram in Figure S2a-b (see electronic supplementary material) presents the search results.

The study characteristics of 52 publications included for appraisal are presented in Table 2. The publication year ranged between 2007 [44] and 2020 [39, 41, 49, 50] with most studies published in recent years. MAs were conducted in 35 studies [17, 19, 20, 22, 23, 26,27,28,29, 31, 35,36,37,38,39,40,41, 44,45,46,47,48, 51,52,53, 55, 58, 60,61,62, 65,66,67,68,69], NMAs in 15 studies [18, 21, 24, 25, 30, 32, 33, 42, 49, 50, 54, 56, 59, 63, 64], both NMA and MA in one study [57], and MA and disproportionality analysis in one study [34]. Of 52 studies, 47 evaluated the impact of PAH interventions in patients with PAH and PAH subgroups (based on aetiology, e.g. idiopathic PAH, familial PAH, connective tissue disease-associated PAH). Patients with PH including PAH and non-PAH patients (e.g. PH due to left sided heart disease) were investigated in four studies [20, 34, 44, 45] while patients with PAH were examined alongside other diseases (e.g. heart failure, prostate cancer) in two studies [46, 60].

Table 2 Characteristics of evidence synthesis studies

Baseline characteristics of patient populations in the included studies are presented in Fig. 1a-c. The average WHO Functional Class distribution, a measure of disease severity, was 0.6, 30.3, 63.7 and 5.4% for FC I, FC II, FC III and FC IV, respectively.

Fig. 1
figure 1

a-c Disease severity, PAH etiology and background therapy across included RCTs

With a number of exceptions [17, 20, 24, 25, 34, 35, 39,40,41, 46, 48, 56, 60, 61, 63, 64, 66], most studies investigated treatments targeting all three pathways. All the approved treatments (ERA, PDE-5Is, PRAs, prostacyclin and sGCS) were investigated in nine recent studies [27, 30, 38, 42, 47, 49, 50, 59, 65]. Some studies included treatments approved in limited markets such as beraprost [38, 40, 49, 50, 52, 53, 59, 63, 67]. In nine studies, drugs targeting one pathway only were investigated: prostacyclins in five studies [40, 48, 61, 63, 66] and ERAs in four studies [25, 46, 60, 64]. Fifteen studies [17, 20, 21, 24, 25, 32, 34, 35, 39, 46, 60, 62, 64, 65, 67] focused on oral treatments only. Besides the approved treatments, non-approved PAH treatments were included in seven studies: imatinib [52, 53, 56, 62], terbogrel [29, 62] and aspirin [52]. Despite being withdrawn in 2010, sitaxentan was assessed in four recent studies [25, 35, 59, 62]. Two studies omitted selexipag despite being approved at the time of study [30, 54].

The outcomes evaluated included clinical, hemodynamics, health-related-quality-of -life (HRQoL) and safety. Frequently investigated clinical endpoints were 6MWD (as a standalone or within combined events) in 43 studies [17, 19, 20, 22,23,24,25,26,27,28,29,30,31,32,33, 35, 36, 39,40,41,42, 44, 45, 47,48,49,50,51,52, 54,55,56,57,58,59,60,61,62,63, 65,66,67,68] followed by mortality (all-cause or disease-specific) in 37 studies [18,19,20,21, 23, 25,26,27,28,29,30, 33, 37, 40,41,42, 44,45,46,47,48,49,50,51,52,53,54,55, 57, 60,61,62,63, 65,66,67,68], clinical worsening (standalone or in combined events) in 25 studies [18,19,20,21, 24,25,26,27, 30, 31, 33, 35, 37, 38, 42, 47, 54, 57, 59, 61, 62, 65,66,67,68] and WHO functional class improvement or deterioration in 24 studies [18,19,20, 24, 27, 29, 31,32,33, 35, 37, 40, 42, 44, 45, 47, 48, 55, 57, 59, 60, 63, 65, 67].

The most commonly employed tool for quality assessment was Cochrane’s risk of bias tool, employed in 21 studies [20, 21, 27, 32,33,34,35,36,37,38,39, 41, 47,48,49,51, 60, 62, 64] followed by Jadad scores used in 12 studies [17, 25, 26, 30, 40, 42, 48, 51, 61, 65,66,67]. There was no mention of quality appraisal being conducted in 10 studies [18, 24, 28, 29, 44, 45, 55, 58, 63, 64].

Quality appraisal

The quality assessment of the included studies is summarized in Fig. 2 by overall judgement (strength, neutral, weakness) against each domain of the checklist and the number of studies scoring each judgement in each domain in Table 3. The detailed quality assessments are presented in Table S3 in the electronic supplementary material.

Fig. 2
figure 2

Overview of quality assessment

Table 3 Number of studies with judgement in each domain

Relevance

Of the 52 studies reviewed, eight were scored as strong in terms of relevance, 26 as neutral, and the remaining 18 as weak.

Most included studies included relevant populations. In some cases, the population was narrowly defined and thus not generalizable to an overall PAH population (e.g. focused on connective tissue disease-associated-PAH [36, 39]) while in others, it went beyond adult PAH populations (i.e. PH patients [group 2–5] or pediatric PAH were included). Some studies adopted a narrow research focus on 1–2 drug classes [17, 20, 25, 32, 35, 39, 40, 60, 61, 63, 64, 66] or oral therapies only [17, 20, 21, 32, 35, 39, 62, 64, 65, 67], often without explicit and/or adequate justification for such restrictions. Many included studies were highly selective in their choice of outcomes analyzed, 6MWD being the most frequently analyzed outcome.

Very few studies fulfilled the checklist item about the extent to which an evidence synthesis study is informative to decision makers today and aligned with the current clinical practice and guidelines. Several papers did not explicitly state the research question or decision problem guiding the analysis [18, 21, 29, 33, 42, 55, 61]. Several other studies failed to justify the focus or their research question [17, 18, 20, 21, 25, 31, 32, 40, 45, 60, 62,63,64,65,66]. For example, some studies formulated research questions with a very narrow scope (e.g. oral treatments [17, 20, 21, 32, 62, 64, 65]) or included trials with non-PAH populations [34, 44, 45], therefore precluding determination of the optimal choice of therapy based on a comparison of all available treatment options. Some studies included unapproved or withdrawn treatments, while several studies made conclusions at odds with current knowledge, guidelines and clinical practice. For example, claims of PDE-5I monotherapy being superior and a therapy of choice based on older, short-term trials (e.g. Singh 2006 [70], Galie 2005a [71]) are not aligned with evidence from more recent, longer-term studies suggesting that PDE-5I monotherapy is inferior to combination therapy (e.g. SERAPHIN [72], AMBITION [73], GRIPHON [74]). Such inconsistencies across studies challenge a robust interpretation of results for decision makers concerned with a comprehensive assessment of all approved treatments, given the dearth of direct comparisons in RCTs.

Credibility

Of the 52 studies reviewed, six were scored as strong in terms of credibility, 18 as neutral, and the remaining 28 as weak.

The majority of studies attempted to identify all relevant RCTs. Some studies did not search all of the most relevant databases, i.e. MEDLINE, Embase, CENTRAL [18, 29, 32, 34, 35, 44, 45, 52, 53, 68]. Several studies did not provide details of the search strategy [18,19,20,21, 24,25,26, 29, 31, 32, 35, 36, 39, 40, 44,45,46, 49, 50, 52,53,54,55, 59,60,61, 63, 67, 68] and one study did not provide any details on the search strategy and searched databases [58].

The proposed methodology was found to be relevant to answer the decision problem in almost all included studies. Some studies did not conduct a quality assessment of included RCTs [18, 24, 28, 44, 55, 58, 59]. Several studies did not provide the results of the RCT quality assessment or discuss implications for the analysis in case of poor quality RCTs [21, 23, 30, 31, 36, 39, 46, 62, 63, 68].

Given the absence of randomization across the RCTs included in an MA or NMA, the assessment of effect modifiers is essential to validate assumptions around homogeneity, consistency and transitivity [75, 76]. Effect modifiers are study and patient characteristics associated with treatment effects, capable of modifying (positively or negatively) the observed effect of a risk factor on disease status. Potential effect modifiers in PAH include patient baseline characteristics such as 6MWD, WHO functional class, disease duration, background therapies and etiology; and study design characteristics such as study duration and imputation rules. As the overview of design and patient baseline characteristics of included PAH RCTs (see Fig. 1a-c; Figure S3a-d in the electronic supplementary material) demonstrates, substantial between-study heterogeneity is a feature of every evidence synthesis study in PAH. The majority of studies did not offer a comprehensive assessment prior to analysis or identify imbalances in effect modifiers across the RCTs [17, 18, 20, 21, 23,24,25,26,27, 30, 32, 34, 39, 44,45,46,47, 49, 51, 53, 54, 58,59,60,61,62,63,64,65,66, 68, 69].

Analysis

Of the 52 studies reviewed, five were scored as strong in terms of analysis, 20 as neutral, and the remaining 27 as weak.

Preservation of study randomization of included RCTs was fulfilled by almost all included studies except in five studies with single-arm [36, 39, 56], retrospective comparative [35] or open-label extension design [58]. Several MAs adopted an approach whereby, for multi-arm trials, the control group was split and the sample size halved [34, 37, 60, 67]. Though outlined in the Cochrane Handbook for Systematic Reviews of Interventions [12], this approach effectively breaks randomization and should therefore be avoided. Other forms of evidence synthesis (e.g. NMA) are more appropriate in this case. Of the included NMA studies with closed loops, most assessed the consistency between the direct and indirect evidence [13, 14, 50, 59, 64].

Common types of analysis to address imbalance in the distribution of treatment effect modifiers include subgroup and sensitivity analysis, meta-regression and using individual patient data. Only about a third of included studies attempted to address between-study heterogeneity [22, 24, 33, 35, 37, 38, 40, 44, 48, 50,51,52,53, 56, 57, 61]. The majority of included studies (primarily MAs) used a fixed effects model unless marked heterogeneity was detected (typically assessed using the Cochran Q-test or I2 statistic), in which case a random effects model was used [17, 20, 25, 29, 31, 34, 39, 44, 45, 48, 51, 59, 60, 62, 65,66,67]. Some studies only fitted a random effects model [19, 20, 23, 26, 27, 35, 40, 46, 47, 49, 50, 64], whereas others only fitted a fixed effects model [28, 30, 38]. The deviance information criterion commonly formed the sole criterion for assessing model fit in the included NMA studies [18, 21, 32] except for Tran et al. 2015 [57], Petrovic 2020a [48] and Petrovic 2020b [49] who assessed model fit based on deviance information criterion and a comparison of the residual deviance with the number of unconstrained data points.

Lastly, several studies pooled treatments at the class level, usually without sound justification for the assumption of a class effect. Very few studies refrained from lumping treatments, doses and co-treatments together [28, 48, 50, 55,56,57, 62, 64].

Reporting quality & transparency

Of the 52 studies reviewed, seven were scored as strong in terms of their reporting quality and transparency, 22 as neutral, and the remaining 23 as weak.

All included NMA studies presented a network diagram, except Zhang et al. 2016 [63]. Two of the 11 included NMA studies did not present details of the number and/or RCTs per pairwise comparison [18, 30]. Separate reporting of direct and indirect comparisons was omitted in six NMA studies [18, 25, 30, 49, 50, 56]. A ranking of interventions according to the reported treatment effects was provided by two-third of the included NMA studies [18, 25, 33, 42, 49, 50, 57, 59, 63, 64], some of which did not report associated uncertainty measures. The reporting of all pairwise contrasts between interventions, along with measures of uncertainty, was not adhered by two of the 11 NMA studies [18, 56].

The reporting of individual study results was omitted or not fully reported by 14 of the 52 studies [21, 25, 30, 32, 38, 42, 46, 49, 50, 55, 57, 59, 63, 64]. Overall, 37 of the included studies either completely omitted a discussion or provided a very brief reference to heterogeneity across studies without a specific discussion of the potential impact of differences in patient characteristics on observed results [17,18,19,20,21, 23,24,25,26,27, 29,30,31,32, 34,35,36, 38, 39, 46, 47, 49, 51, 52, 54, 58,59,60,61,62,63,64,65, 67, 68].

Interpretation

Overall, 15 of the 52 studies reviewed were scored as strong in terms of their interpretation of study findings, 23 as neutral, and the remaining 14 as weak.

A number of studies were scored as ‘weak’ when authors did not contextualize results considering limitations [31, 34, 38, 39, 58, 63], or endorsed specific treatments over others without any discussion of between-study heterogeneity and/or despite pooling of active therapies [20, 21, 25, 33, 39, 59, 60]. For example, Jain et al. 2017 [33] combined trials [74, 77, 78] in their primary analysis that differed in patients’ severity level and provision of background therapies.

Conflict of interest

Among included studies, 22 were scored as strong in terms of conflict of interest,16 as neutral, and the remaining 14 as weak.

Less than a third of all assessed studies provided either no information about conflicts of interest or insufficiently detailed author disclosures. Other studies reported no personal or financial relationships, or clearly stated author contributions in case of personal or final relationships of affiliations that could have biased the respective study.

Discussion

The objective of this study was to systematically appraise all identified MA/NMA studies in PAH and assess their quality given that such studies are taken into consideration for evidence-based decision-making. To our knowledge, this is the first study of this type in PAH. Overall, the appraisal found most evidence synthesis studies to be of low quality.

Most included evidence syntheses were found not to have defined the decision problem (i.e. the research question underpinning a study), population, selection of comparisons and outcome selection that is compatible or aligned with current clinical practice and treatment guidelines [2, 79]. Of note, the majority of the studies [18,19,20,21,22,23,24,25,26, 29, 30, 32, 34, 36, 40, 44,45,46,47,48,49, 51, 54, 55, 57,58,59,60, 62,63,64,65,66, 68] included trials that do not reflect today’s clinical practice. For example, the BREATHE-2 [80] and PACES [81] trials investigated bosentan and sildenafil, respectively, as add-on therapy to IV epoprostenol. By contrast, PAH management today typically involves treatment initiation of oral therapy with an ERA and/or PDE-5I in low or intermediate-risk patients comprising the vast majority of patients, whereas parenteral prostacyclins would only be considered or added for high-risk patients [6].

Notably, clinical trial design has evolved from a preponderance of small, short-term and often open-label studies in treatment-naïve patients with severe PAH to larger, longer-term and event-driven trials (such as COMPASS-2 [82], SERAPHIN [72], AMBITION [73], GRIPHON [74]) in largely treatment-experienced and less severe patient populations. Similarly, primary endpoint definition has gradually shifted from improvement in 6MWD to morbidity and mortality as a composite endpoint (with components such as all-cause death, PAH-related hospitalization or disease worsening) which is considered to be a more patient- and clinically relevant endpoint [83,84,85].

While these changes in trial design and PAH management pose challenges for studies synthesizing evidence generated across such large time spans, a transparent interpretation of findings in recent MA/NMA studies in relation to present clinical practice and guidance was found to be lacking.

A related shortcoming of appraised studies is the choice of outcomes analyzed, which was found to be selective, incomprehensive, and usually not accompanied by clear justification. The most commonly assessed outcome was 6MWD – despite failure of multiple studies to consistently establish significant associations between 6MWD and clinically more relevant outcomes such PAH-related hospitalization, lung transplantation, initiation of rescue therapy or death [28, 29, 44, 52, 86, 87]. Moreover, the assessed evidence synthesis studies generally neither presented a review of the outcome definitions and outcome measures of included trials, nor an assessment of imputation rules for handling missing data.

Mortality was less commonly assessed, which reflects the inherent challenges in designing clinical trials of PAH therapies to detect statistically significant or clinically meaningful differences in mortality. Replication of earlier trials (e.g. Barst 1996 [78]) showing survival benefit over a very short time period and placebo-controlled RCTs comparing monotherapy with no therapy in treatment-naïve patients would be considered unethical today.

Another crucial drawback in most included studies is the lack of a thorough assessment of key effect modifiers prior to the analysis. As the graphs presenting patient baseline characteristics across PAH trials demonstrate (see Fig. 1a-c; Figure S3a-d in the electronic supplementary material), there is marked between-study heterogeneity. One recurring observation was that most evidence synthesis studies included a mix of PAH and non-PAH patients populations, as in the aerosolized iloprost randomized (AIR) study [88] which included PAH and chronic thromboembolic pulmonary hypertension (CTEPH) patients.

Only a handful of studies sought to address such potential systematic differences in the effect modifiers through means of subgroup/sensitivity analyses, meta-regression. This may be due to limited subgroup data available from published PAH RCTs, and challenges around smaller sample sizes associated with subgroup data which results in wider uncertainty estimates and lower likelihood of detecting significant relative treatment effects.

In terms of results synthesis, several studies were found to pool treatments at the drug class level. Best practices guidelines in evidence synthesis, such as NICE DSU TSD 7 [13], recommend against pooling treatment doses or treatments into drug classes since characteristics of the underlying trial population or efficacy/safety trial results may be different.

This review has some limitations. A thorough assessment of the quality of MA/NMA studies is limited by the heterogeneity across included trials. A detailed assessment of between-study heterogeneity in each included MA/NMA was beyond the scope of the review. Nevertheless, a preliminary assessment of patients’ baseline characteristics of all PAH trials included across the appraised MA/NMA studies was considered reflective of most studies. Results or analyses relating to PAH subgroups by etiology, severity or age were not explored further due to no or very limited studies focusing on these specific sub-populations.

Conclusion

This is the first critical appraisal of published MA/NMA studies in PAH, suggesting overall low quality and validity of efforts synthesizing PAH evidence. As our study demonstrates, this has important implications for clinical decision-making and future research. First, the choice of optimal therapy to maximize patient outcomes should also be guided by a consideration of the limitations of published MA/NMA studies highlighted in this study. Second, future attempts of evidence synthesis in PAH should improve the level of validity and scrutiny to meaningfully address challenges arising from an evolving therapeutic landscape. This should include the definition of decision problems that are aligned with today’s clinical practice and treatment guidelines, justification of key analysis assumptions, a comprehensive interrogation of the evidence base prior to analysis, use of individual patient data to mitigate issues of heterogeneity, and a transparent presentation of results and associated uncertainty measures for all relevant outcomes.