Abstract
Background
The assumption of consistency, defined as agreement between direct and indirect sources of evidence, underlies the increasingly popular method of network metaanalysis. This assumption is often evaluated by statistically testing for a difference between direct and indirect estimates within each loop of evidence. However, the test is believed to be underpowered. We aim to evaluate its properties when applied to a loop typically found in published networks.
Methods
In a simulation study we estimate type I error, power and coverage probability of the inconsistency test for dichotomous outcomes using realistic scenarios informed by previous empirical studies. We evaluate test properties in the presence or absence of heterogeneity, using different estimators of heterogeneity and by employing different methods for inference about pairwise summary effects (KnappHartung and inverse variance methods).
Results
As expected, power is positively associated with sample size and frequency of the outcome and negatively associated with the presence of heterogeneity. Type I error converges to the nominal level as the total number of individuals in the loop increases. Coverage is close to the nominal level in most cases. Different estimation methods for heterogeneity do not greatly impact on test performance, but different methods to derive the variances of the direct estimates impact on inconsistency inference. The KnappHartung method is more powerful, especially in the absence of heterogeneity, but exhibits larger type I error. The power for a ‘typical’ loop (comprising of 8 trials and about 2000 participants) to detect a 35% relative change between direct and indirect estimation of the odds ratio was 14% for inverse variance and 21% for KnappHartung methods (with type I error 5% in the former and 11% in the latter).
Conclusions
The study gives insight into the conditions under which the statistical test can detect important inconsistency in a loop of evidence. Although different methods to estimate the uncertainty of the mean effect may improve the test performance, this study suggests that the test has low power for the ‘typical’ loop. Investigators should interpret results very carefully and always consider the comparability of the studies in terms of potential effect modifiers.
Similar content being viewed by others
Background
The validity of results from network metaanalysis depends on the plausibility of the transitivity assumption; that is the comparability of studies informing the treatment comparisons with respect to the distribution of effect modifiers [1–3]. Lack of transitivity in a network can create statistical disagreement between direct and various sources of indirect evidence, often termed inconsistency [4]. Statistical evaluation of consistency is possible only when there are ‘closed loops of evidence’ in the network. The recent increase in applications of network metaanalysis has emphasised the need for methods to evaluate consistency and has motivated the development of statistical models [5–7] and methods [8–11].
Empirical evidence suggests that the prevalence of statistically significant loop inconsistency ranges from 2% to 17% [12–14]. However, little is known about factors that impact on the detection of inconsistency. As expected, the power to detect inconsistency is positively associated with the number and size of trials, and both power and type I error increase when a fixedeffect model is assumed [15]. It has been argued that the presence and magnitude of heterogeneity (within comparison variability) in a loop of evidence can impact on inferences made about inconsistency and empirical evidence has confirmed these claims by showing that different estimators of the heterogeneity variance are likely to have a considerable impact [14]. Finally, previous studies showed that inconsistency occurs more frequently in loops where one of the comparisons is informed only by one trial [14, 16, 17].
Although there are indications that the presence, magnitude and estimation method of heterogeneity might influence the detection of inconsistency, this association has not been studied extensively. For instance, the impact of two alternative methods to express uncertainty about the pairwise summary effects (inverse variance and KnappHartung method [18, 19]) remains unclear. It has been shown that the KnappHartung method outperforms inverse variance in coverage for the summary effect and that it is insensitive to the estimator of the heterogeneity used [20, 21]. We anticipate that differences in the properties of the two methods will impact on the estimation of inconsistency.
The aim of this paper is to explore factors that affect the detection of inconsistency in a threetreatment network for a dichotomous outcome. The factors that we explore are associated with the amount of data available in the loop (such as number, size and distribution of trials across comparisons, frequency of events), the heterogeneity variance in the pairwise comparisons (presence or absence and estimation method) and the method for inference about pairwise summary effects (inverse variance or KnappHartung). We consider only logodds ratio (LOR) as the effect size of interest. We conduct a simulation study considering realistic scenarios including only twoarm trials and we estimate type I error, power and coverage probability for the test of consistency. The simulation scenarios are informed by two previous empirical studies; a large collection of 303 loops from published networks of interventions [14] and a study about the empirical distribution of heterogeneity on dichotomous outcomes [22].
Methods
The inconsistency test
Consider a simple scenario with three competing treatments A, B and C and that there are trials that compare directly all three possible pairs of treatments. Evaluation of inconsistency in a triangular network requires first the estimation of three direct summary effects for each pairwise comparison. We denote the effect sizes (i.e. LORs) for the three pairs of treatments as {\widehat{\mathrm{\mu}}}_{\mathrm{AB}}^{\mathrm{DIR}},{\widehat{\mathrm{\mu}}}_{\mathrm{AC}}^{\mathrm{DIR}} and {\widehat{\mathrm{\mu}}}_{\mathrm{BC}}^{\mathrm{DIR}} with variances {\widehat{\mathrm{v}}}_{\mathrm{AB}}^{\mathrm{DIR}},{\widehat{\mathrm{v}}}_{\mathrm{AC}}^{\mathrm{DIR}} and {\widehat{\mathrm{v}}}_{\mathrm{BC}}^{\mathrm{DIR}} respectively. The superscript denotes the source of evidence (‘DIR’ for direct here or ‘IND’ indirect later) and the subscript denotes the treatment comparison. For any given comparison (e.g. BC) we estimate the indirect mean treatment effect, {\widehat{\mathrm{\mu}}}_{\mathrm{BC}}^{\mathrm{IND}}, as a simple contrast of two direct estimates involving the third treatment, and we compare it with the corresponding direct estimate {\widehat{\mathrm{\mu}}}_{\mathrm{BC}}^{\mathrm{DIR}}.
The inconsistency factor (IF) for the loop ABC is estimated as
with variance
The direction of the estimated IF is irrelevant to the evaluation of inconsistency and only the magnitude of its absolute value is of interest. The subscript in {\mathrm{I}\widehat{\mathrm{F}}}_{\mathrm{ABC}} refers to the loop in which inconsistency is estimated.
Under the null hypothesis of consistency (H_{0} : IF = 0) a ztest is calculated
with a critical region z ≥ z_{ a/2}. In the present study we select a = 0.05.
Estimation of variance
Equation (1) suggests that the method used to estimate the variance of the direct treatment effects {\mathrm{v}}_{\mathrm{AB}}^{\mathrm{DIR}},{\mathrm{v}}_{\mathrm{AC}}^{\mathrm{DIR}} and {\mathrm{v}}_{\mathrm{BC}}^{\mathrm{DIR}} will play an important role in the performance of the ztest for inconsistency. We consider two methods to estimate the direct variances and examine how they can impact on the estimation of {\mathrm{v}}_{{\mathrm{IF}}_{\mathrm{ABC}}}. The first method is the usual inversevariance method and the second method is an alternative approach proposed by Knapp and Hartung [19].
In a pairwise metaanalysis we either assume that trials estimate a single underlying effect size (fixedeffect model) or that the studyspecific underlying effect sizes are different but drawn from the same distribution (random effects model) with heterogeneity τ^{2}. Under the latter scenario, it is common to assume that heterogeneity is the same for all comparisons being made, i.e. {\mathrm{\tau}}_{\mathrm{AB}}^{2}={\mathrm{\tau}}_{\mathrm{AC}}^{2}={\mathrm{\tau}}_{\mathrm{BC}}^{2}={\mathrm{\tau}}^{2}. We adopt this assumption throughout the paper and we estimate τ^{2} using the DerSimonian and Laird estimator [23].
In the inverse variance approach, the direct variances are simple functions of the sampling variances of the individual trials and the heterogeneity variance τ^{2}. Suppose that K_{AB}, K_{AC} and K_{BC} trials inform the AB, AC and BC comparisons respectively. If the sampling variances were the same for all trials (σ^{2}), the inverse variance estimator of the inconsistency variance would be
Consequently, {\widehat{\mathrm{v}}}_{{\mathrm{IF}}_{\mathrm{ABC}}} depends on the heterogeneity and decreases with the number and precision of the included trials.
An alternative approach to estimate each direct variance, and consequently {\mathrm{v}}_{{\mathrm{IF}}_{\mathrm{ABC}}}, is the approach proposed by Knapp and Hartung [19]. They derive the variance {\widehat{\mathrm{v}}}_{\mathrm{AB}}^{\mathrm{DIR}} as the ratio of a generalised Q statistic divided by the product of the degrees of freedom (K_{AB}  1) and the sum of the randomeffects study weights [24]. It has been shown that the performance of this method is not influenced by the choice of the heterogeneity estimator [19, 21, 25, 26].
In summary, we estimate the variances of the direct pairwise summary effects by employing two different strategies: the inverse variance method using DerSimonian and Laird estimator (IVDL) and the KnappHartung method with the DerSimonian and Laird estimator (KHDL). When a comparison is addressed by a single trial (so that the loop includes 3 trials in total) estimation of heterogeneity is impossible. In these cases we use the fixedeffect model (by setting τ^{2} to be zero) and consequently both IVDL and KHDL methods would yield exactly the same results.
Simulation study
Empirical evidence to inform simulation scenarios
To inform the simulation scenarios we use a large collection of complex networks of interventions [14]. Figure 1 summarises some of the attributes of 303 loops from 40 published networks with dichotomous outcomes analysed using the LOR scale. The majority of the pairwise metaanalyses (93%) included fewer than ten trials. The median LOR is 0.32 with interquartile range (IQR) (0.13, 0.75). In 91% of the loops the common withinloop heterogeneity using the DerSimonian and Laird estimator is less than 0.5 and it is estimated at zero (when rounded to the second decimal) in 51% of the loops. The median IF is 0.36 with IQR (0.15, 0.80). The median number of trials per loop is 8 IQR (6, 14) and the median loop sample size is 2256 IQR (1026, 18890); the respective median number of trials and sample size per comparison are 2 IQR (1, 4) and 706 IQR (255, 2997). Most networks had a subjective primary outcome (43%), whereas 35% and 22% of the networks had reasonably objective outcomes (e.g. causespecific mortality, major morbidity event) and allcause mortality outcomes respectively. The majority of the networks (63%) compared pharmacological interventions versus placebo. In the case of such a comparison type and subjective outcome, Turner et al. suggest that the distribution of the heterogeneity is reasonably approximated by a lognormal τ^{2} ~ LN(2.13, 1.58^{2}), with median τ^{2} = 0.12 and IQR (0.03, 0.34) [22]. Our empirical data seem to match the predictive distribution suggested by Turner et al. [22] (τ^{2} ~ LN(2.13, 1.58^{2})), though more data are needed since we have only 55 common withinloop heterogeneities estimated in networks with pharmacological interventions versus placebo comparison type and subjective outcome.
Simulation scenarios
We use subscripts k_{1}, k_{2} and k_{3} to refer to the three comparisons AB, AC and BC respectively, so that k_{1} = 1, …, K_{AB}, k_{2} = 1, …, K_{AC} and k_{3} = 1, …, K_{BC}, where K_{AB}, K_{AC}, K_{BC} represent the number of trials included in AB, AC and BC comparisons respectively. We examine both balanced direct comparisons, i.e. all comparisons include the same number of trials K_{AB} = K_{AC} = K_{BC} = K = 1, …, 7, and imbalanced direct comparisons, i.e. each comparison is informed by a different number of trials with K_{AB} = 1, K_{AC} = 4, K_{BC} = 7. Both balanced and imbalanced scenarios were selected, informed by the empirical data. In particular, the imbalanced scenario included a comparison with a single trial, because the majority (196 out of 303) of observed loops had this characteristic. We then set the second comparison to include a large number of trials (7 trials) and for the third comparison we selected the median between the two extremes (4 trials). We restrict our analysis to dichotomous outcome data measured using oddsratio (OR) due to its mathematical properties [27–29]. Based on the results from the empirical study [14], we assume OR_{AB} = 1/exp(0.32) = 0.73 and OR_{AC} = 1 the relative treatment effects for AB and AC respectively. We compute the OR for the BC comparison as
We select values IF_{ABC} = {0, 0.3, 0.45, 0.6, 1} to cover a range of plausible values for inconsistency as suggested by empirical data (Figure 1d). We consider two different distributions for heterogeneity that pertain to a subjective outcome (the most frequently reported outcome in our data) and allcause mortality for comparisons between pharmacological interventions and placebo; according to [22] these are τ^{2} ~ LN(2.13, 1.58^{2}) and τ^{2} ~ LN(4.06, 1.45^{2}) (median τ^{2} = 0.02 with (IQR 0.01, 0.04)).
For each combination of OR, IF_{ABC}, and τ^{2} we simulate the trialspecific underlying relative treatment effects from a normal distribution as
{\mathrm{LOR}}_{\mathrm{AB},{\mathrm{k}}_{1}}~\mathrm{N}\left({\mathrm{LOR}}_{\mathrm{AB}},\phantom{\rule{0.25em}{0ex}}{\mathrm{\tau}}^{2}\right), {\mathrm{LOR}}_{\mathrm{AC},{\mathrm{k}}_{2}}~\mathrm{N}\left({\mathrm{LOR}}_{\mathrm{AC}},\phantom{\rule{0.25em}{0ex}}{\mathrm{\tau}}^{2}\right) and {\mathrm{LOR}}_{\mathrm{BC},{\mathrm{k}}_{3}}~\mathrm{N}\left({\mathrm{LOR}}_{\mathrm{BC}},\phantom{\rule{0.25em}{0ex}}{\mathrm{\tau}}^{2}\right)\text{.}
Then, we generate armlevel data for each trial k_{1}, k_{2} and k_{3}. Without loss of generality we describe how to obtain armlevel data for an AB trial. We assume equal sample sizes across arms, that is {\mathrm{n}}_{\mathrm{A},{\mathrm{k}}_{1}}={\mathrm{n}}_{\mathrm{B},{\mathrm{k}}_{1}}=\mathrm{n}. The observed IQR for arm sample size in our empirical data is 51 to 270, and to represent moderate and large studies we generated studies with n ~ U(50, 150) and n ~ U(150, 300). We also considered n ~ U(20, 50) to generate data for very small studies. The number of events per arm, denoted with {\mathrm{r}}_{\mathrm{A},{\mathrm{k}}_{1}} and {\mathrm{r}}_{\mathrm{B},{\mathrm{k}}_{1}} are drawn from two binomial distributions {\mathrm{r}}_{\mathrm{A},{\mathrm{k}}_{1}}~\mathrm{B}\left({\mathrm{n}}_{\mathrm{A},{\mathrm{k}}_{1}},\phantom{\rule{0.25em}{0ex}}{\mathrm{p}}_{\mathrm{A},{\mathrm{k}}_{1}}\right) and {\mathrm{r}}_{\mathrm{B},{\mathrm{k}}_{1}}~\mathrm{B}\left({\mathrm{n}}_{\mathrm{B},{\mathrm{k}}_{1}},\phantom{\rule{0.25em}{0ex}}{\mathrm{p}}_{\mathrm{B},{\mathrm{k}}_{1}}\right) where {\mathrm{p}}_{\mathrm{A},{\mathrm{k}}_{1}} and {\mathrm{p}}_{\mathrm{B},{\mathrm{k}}_{1}} are the probabilities of the outcome in each trial arm. To define these probabilities we make assumptions about the average risk (AR) of the outcome in the trial assuming both frequent and rare events. To simulate from frequent event rates we draw from a uniform distribution {\mathrm{AR}}_{\mathrm{AB},{\mathrm{k}}_{1}}~\mathrm{U}\left(0.25,\phantom{\rule{0.25em}{0ex}}0.75\right) and for rare events {\mathrm{AR}}_{\mathrm{AB},{\mathrm{k}}_{1}}~\mathrm{U}\left(0.05,\phantom{\rule{0.25em}{0ex}}0.15\right)\text{.}
Then the event probabilities in the arms are obtained as the solution to the equations
For frequent events and assuming no heterogeneity, the expected mean variance of LOR ranges from 0.04 to 0.25 depending on sample size. Variances for LOR for rare events range from 0.10 to 0.69.
We then calculate the sample LOR and its variance as
If the simulated number of events in one of the study arms is zero, we add 0.5 to the cells of the 2 × 2 table. We repeat this process for all K_{AB} trials and then we perform a randomeffects metaanalysis to obtain the summary effect size {\widehat{\mathrm{\mu}}}_{\mathrm{AB}}^{\mathrm{DIR}}. We follow the same process for comparisons AC and BC and then we estimate the inconsistency factor. Table 1 presents a summary of the simulation scenarios considered.
For each scenario we analyse 1000 simulated triangular networks. Assuming a 5% significance level, we estimate the power of the test when true inconsistency is present (P(z ≥ 1.96IF ≠ 0) and type I error when the null hypothesis is true (P(z ≥ 1.96IF = 0). We compute the coverage probability for the confidence interval (CI) of inconsistency, which is the probability that the estimated interval for IF includes its true value. We carry out the simulations in the freely available software R 2.15.2 [30] using the selfprogrammed sims.fun function, which we have made available online (http://www.mtm.uoi.gr/index.php/materialfrompublicationssoftwareandprotocols).
In addition to the scenarios described above we also consider an extra scenario representing the ‘typical’ loop; that is a loop with the characteristics most commonly encountered in our collection of 303 loops. We specified this such that one comparison was informed by a single trial and the median number of studies per loop was 8, in line with the empirical evidence. The median loop sample size is 2300 (i.e. average trial arm size 144) [14]. Consequently, a loop with K_{AB} = 1, K_{AC} = 4, K_{BC} = 3, and n ~ U(120, 160) can be considered to be an ‘average sized loop’.
Results
Type I error
Figure 2 and Additional file 1: Figure S1 display the estimated type I error for equal and different numbers of trials across comparisons. In general, type I error is close to the nominal level for IVDL, but larger than 5% for many scenarios analysed with KHDL. The KHDL method generally yields smaller variances for IF, leading to larger type I errors (average type I error across all scenarios for IVDL: 0.07, average type I error across all scenarios for KHDL: 0.10, see also Figure 2a and b). Type I error converges to the nominal level more rapidly when τ^{2} = 0 for both IVDL and KHDL methods. The overall type I error approaches the nominal level as the number of trials increases for the same trial size. For example, for frequent events type I error reaches on average the nominal level when K = 5 for small sample sizes, and K = 4 for moderate and large sample sizes. In Table 2 we provide the type I error values for various simulation scenarios. When the total number of individuals included in the network ranges from 2400 to 3000 (i.e. close to the empirically estimated median loop size) type I error lies between 0.06 and 0.08. Type I error deviates from 5% considerably when an equal and small number of trials is considered across comparisons for all trial sizes (see Figure 2a ,b and Table 2).
For rare events, type I error departs from 5% more than it does for frequent events (Figure 2). Type I error is lower than its nominal level in most cases for IVDL especially when τ^{2} = 0, probably due to overestimation of τ^{2}. The KHDL method results again in considerably larger type I errors, which is probably due to the small variances of the mean treatment effects (average type I error across all scenarios for IVDL: 0.05, average type I error across all scenarios for KHDL: 0.08, see Figure 2c and d). Type I error is closer to the nominal level for IVDL when τ^{2} ≠ 0 for all sample sizes. All methods tend to improve their performance with increasing total number of trials included in the entire network (Figure 2 and Additional file 1: Figure S1).
Statistical power
Figure 3 and Additional file 2: Figure S2 present the power for IF = {0.3, 0.45, 0.6, 1} for both frequent and rare events when equal (Figure 3) and different (Additional file 2: Figure S2) numbers of trials are included in comparisons. As expected, the overall power increases both with number of trials included in the loop and with the trial size. Power increases when the trials included in a loop have comparable sample sizes. Results are aggregated over all estimation methods for heterogeneity and the different methods to estimate the variance of the direct summary effects. In Table 2 we provide the power values for various simulation scenarios when IF = 0.6 and frequent events are considered. When the total number of individuals included in the network ranges from 2400 to 3000, power ranges between 0.54 and 0.70 when an equal number of trials is assumed across comparisons but drops to 0.32 when each comparison has a different number of trials. As can be seen in equation (2), the distribution of trials across comparisons affects the estimation of inconsistency variance. This has an impact on power and the test is more powerful when trials are distributed uniformly across comparisons. Comparing, for example, the power of the test for the balanced scenario K_{AB} = 4, K_{AC} = 4, K_{BC} = 4 and the imbalanced scenario K_{AB} = 1, K_{AC} = 4, K_{BC} = 7 (each with 12 trials in the loop), power is higher when the distribution of trials is balanced across comparisons (ranges from 0.23 to 0.79) rather than imbalanced (ranges from 0.16 to 0.49) (see Table 2). The comparison of frequent (Figure 3a) and rare (Figure 3b) events indicates that power is larger for frequent events (average power across all scenarios for frequent events: 0.44, average power across all scenarios for rare events: 0.25). Rare events are associated with larger uncertainty for the direct mean treatment effects and thus the chances of identifying potentially important inconsistency decrease. It should be noted that the first summary result of each power curve pertains to the case where there is only one trial per comparison and heterogeneity is set to be zero. This has an impact on monotonicity especially when IF is low and trial size is large.
In Tables 3 and 4 we present the power for IVDL and KHDL methods. For frequent events the power to detect inconsistency does not vary significantly with the method used to estimate heterogeneity or to express uncertainty on the summary effects although the KnappHartung method is marginally more powerful, especially in the absence of heterogeneity. This is because, in many cases, the KnappHartung method estimates smaller inconsistency variances compared with the inverse variance method. The median inconsistency standard error is 0.33 (IQR 0.21, 0.50) for KHDL and 0.40 (IQR 0.27, 0.57) for IVDL. As expected, when there is no heterogeneity, there is less uncertainty associated with each pairwise effect and the power to detect inconsistency increases for all IF values (Table 3).
The impact of heterogeneity is similar when the outcome is rare (average power across all IF values for KHDL: 0.24, average power across all IF values for IVDL: 0.21, see Table 3). Table 3 shows that the advantage of KHDL method when heterogeneity is zero becomes more pronounced for rare events (average power across all IF values for KHDL: 0.32, average power across all IF values for IVDL: 0.25, see Table 3).
Coverage probability and bias
We assess how often the 95% CI for inconsistency includes the assumed IF value used to generate the data. We plot the coverage probability for the 95% CI of IF in Additional file 3: Figure S3. The coverage probability is close to the nominal level (95%) for most settings. Rare events are associated with larger uncertainty and therefore provide slightly higher coverage than frequent events (average coverage across all scenarios for frequent events: 0.95, average coverage across all scenarios for rare events: 0.97). In Table 2 we provide the coverage values for various simulation scenarios when IF = 0.6. When the total number of individuals included in the network ranges from 2400 to 3000, coverage ranges from 0.95 to 0.96 (Table 2). Coverage does not change considerably when an equal or different number of trials is assumed across comparisons (Additional file 4: Figure S4).
In Additional file 5: Figure S5 and Additional file 6: Figure S6 we present the average relative bias \left(\left\mathrm{I}\widehat{\mathrm{F}}\mathrm{IF}\right/\mathrm{IF}\right) for IF > 0. Relative bias decreases with the total number of individuals included in the network, the total number of trials, and the assumed IF value.
Tables 5 and 6 present the coverage probability for the 95% CI of IF using different methods to express uncertainty on the summary effects. The KHDL method reduces slightly the chances of including the true inconsistency factor in the 95% CI of IF, especially when there is no heterogeneity, as the mean treatment effects become more precise.
Characteristics of the inconsistency test in a ‘typical’ loop of evidence
The type I error in the ‘typical’ loop is 5% and 7% for subjective and allcause mortality outcomes using IVDL and 11% and 12% using KHDL. The ‘typical’ loop of evidence with allcause mortality outcome has considerably low power. The overall power ranges between 14% and 75% for IVDL and 21% to 78% for KHDL depending on the magnitude of inconsistency. For a subjective outcome that pertains to larger heterogeneity power decreases to 14%63% for IVDL and in 20% to 65% for KHDL. Coverage is close to the 95% nominal level (see Table 7).
Discussion
The increased use of network metaanalysis should be accompanied by caution when combining direct and indirect evidence via careful assessment of the consistency assumption. Protocols of network metaanalyses should present methods for the evaluation of inconsistency and define strategies to be followed when inconsistency is present. Several methodologies have been outlined in the literature to test inconsistency [4–9]. In this study, we evaluate the properties of the ztest for detecting inconsistency comparing direct and indirect estimates in triangular networks generating 1000 loops for each scenario presented in Table 1. Although running more than 1000 simulations per scenario would have decreased the Monte Carlo error, we believe the main conclusions from our simulations are robust. Our scenarios are informed by previous largescale empirical studies and hence are directly applicable [14, 22]. We use a variety of scenarios that involve the most commonly used metaanalytic tools for statistical inference regarding heterogeneity and the uncertainty of the mean treatment effects. The main advantage of this work is that it sheds light on factors that might affect the detection of inconsistency and have not been examined in the past, such as the use of KnappHartung variance for the direct summary effects. Our main findings are summarized below.

The assumption of consistency in network metaanalysis is often evaluated performing a ztest within each loop of evidence.

The inconsistency test has low power for the ‘typical’ loop (comprising 8 trials and about 2000 participants) found in published networks. This study suggests that the probability to detect inconsistency when present is between 14% and 21% depending on the estimation method.

Power is positively associated with sample size and frequency of the outcome, and negatively associated with the underlying extent of heterogeneity.

Using the KnappHartung method to estimate uncertainty around metaanalytic effects is slightly more powerful than the inverse variance approach.

Type I error converges to the nominal level as the total number of individuals included in the loop increases while coverage is close to the nominal level for most studied scenarios.

We recommend that investigators a) employ a variety of methods to evaluate inconsistency, b) interpret the magnitude of the estimated inconsistency factor and its confidence interval c) adopt a sceptical stance towards statistically nonsignificant test results unless the loop of evidence has many data d) always consider the comparability of the studies in terms of potential effect modifiers to infer about the possibility of inconsistency
Our simulation study shows that the inconsistency test has on average low power to detect inconsistency, in particular for rare outcomes (i.e. for IF = 0.3 and large trial sizes a rare outcome has event rate on average 0.10 IQR (0.07, 0.13)). Bruadbrn et al. [31] state that the IVDL method may be “unsuitable when there are few events” and that it should be avoided. In the absence of heterogeneity and for a large number and size of trials the overall power for inconsistency might be adequate. A previous simulation study [15] also found that different ways to evaluate inconsistency (e.g. Lu and Ades [6] model, nodesplitting method [9]) have low power in particular under the randomeffects models. Our study suggests that power is improved if the KnappHartung method is used, especially in the absence of heterogeneity, although the type I error increases as well. This is because the estimated uncertainty around inconsistency is small with KnappHartung method. These findings agree with a previous simulation study, which showed that when heterogeneity is zero the KnappHartung method yields a smaller variance for the mean treatment effects than the inverse variance method [21].
Several methods have been suggested to estimate heterogeneity τ^{2}[32, 33]. In the present study we also included the restricted maximum likelihood [34] and the empirical Bayes [35] estimators in conjunction with the inverse variance approach. Although the three estimators have different properties and performance in general, they have been showed to have comparable bias and mean squared error for estimating τ^{2} in the examined simulation scenarios (relatively small number of trials for each pairwise metaanalysis (fewer than 7) and median heterogeneity τ^{2} = 0.12 are comparable [32]. Consequently type I error, power and coverage were found similar between the three methods (data not shown) and we present results only from IVDL and KHDL. This agrees with an empirical study that compared five different estimators for the heterogeneity and showed that variability in the confidence intervals of the overall treatment effect was quite negligible across 920 Cochrane metaanalyses [36].
The inconsistency test, analogously to the heterogeneity test, has low power and we recommend that the point estimate of inconsistency and its 95% confidence interval are used instead to draw inferences about the presence and magnitude of inconsistency. In cases where the test is underpowered, the confidence intervals would include zero, small and large inconsistency values and should be interpreted as lack of evidence for or against the presence of inconsistency. If a test must be used, one possibility is to use a cutoff pvalue of 0.10, as has been suggested for the heterogeneity test in pairwise metaanalysis [37, 38]. Empirical evidence showed that the observed disagreement between direct and indirect comparisons is 1 in 10 loops, so this cutpoint might be a reasonable choice [14]. In complex networks, instead of using multiple underpowered ztest, global tests such as the designbytreatment test can be used, although power properties of the latter are unknown.
Some limitations in our study need to be acknowledged. We do not account for the possible impact of multiarm trials on inconsistency and we only reconsider triangular networks. Our previous empirical study showed that a large majority (85%) of published networks of interventions involve trials with multiple arms, and that out of the total 1173 trials included in all 40 networks 116 (10%) were multiarm trials. Further simulation studies are therefore needed to evaluate complex networks with multiarm trials. In our simulation study we assume that all comparisons in the network share the same amount of heterogeneity. Turner et al. [22] showed that different amounts of heterogeneity can be expected for different outcomes or for different classes of interventions (e.g. pharmacological vs. nonpharmacological). Network metaanalyses typically consider only one outcome and often compare interventions of a similar nature. Hence the assumption of equal heterogeneities is often clinically reasonable as well as being statistically convenient. Most comparisons in networks comprise only few studies, making estimation of heterogeneity challenging. In case heterogeneity is believed to vary across comparisons, we can assume different parameters which should be restricted to conform to special relationships according to the consistency assumption [39]. Finally, a thorough investigation of all available methods to evaluate inconsistency using realistic scenarios informed by empirical evidence would be needed for completeness [5–7].
This is the second simulation study that suggests statistical evaluation of inconsistency has low power [15]. In our simulations we consider threetreatment networks for simplicity but analyse them using methods typically employed for network metaanalysis, e.g. assuming common heterogeneity in a onestage analysis. As inconsistency is a property of a closed loop, we believe that our results are very relevant to full networks. Although our study is limited to simple threetreatment networks including only twoarm trials, we anticipate that the inconsistency test would show similarly low power in the presence of multiarm studies: such studies are internally consistent and would contribute similar pairwise comparisons to evaluations of inconsistency. Further simulation studies might be needed to learn about the impact of assuming different heterogeneity parameters for different comparisons. Reliable estimation of different heterogeneity parameters will require a minimum number of studies for each comparison, a scenario which seldom occurs in published networks of interventions. The KnappHartung method has been shown to be robust to the estimation of heterogeneity [21] so we suspect that conclusions would be similar to those drawn from the present study. It is therefore imperative for investigators to evaluate the assumption of consistency using epidemiological strategies and compare carefully the involved studies with respect to the distribution of effect modifiers before embarking into data synthesis [3, 40].
Conclusions
Although the performance of the ztest for inconsistency might vary according to the method used to estimate the uncertainty of the overall mean treatment effect, the power remains generally low for the loop of evidence that typically features in networks of interventions. Particularly when data is sparse and a loop includes only a few studies or the outcome is rare, the inconsistency test is unlikely to be informative.
Abbreviations
 CI:

Confidence interval
 DIR:

Direct
 IF:

Inconsistency factor
 IND:

Indirect
 IVDL:

Inverse variance method using DerSimonian and Laird estimator
 IQR:

Interquantile range method
 KHDL:

KnappHartung method using DerSimonian and Laird estimator
 LOR:

Logodds ratio
 OR:

Odds ratio.
References
Caldwell DM, Ades AE, Higgins JP: Simultaneous comparison of multiple treatments: combining direct and indirect evidence. BMJ. 2005, 331: 897900. 10.1136/bmj.331.7521.897.
Jansen JP, Fleurence R, Devine B, Itzler R, Barrett A, Hawkins N, Lee K, Boersma C, Annemans L, Cappelleri JC: Interpreting indirect treatment comparisons and network metaanalysis for healthcare decision making: report of the ISPOR task force on indirect treatment comparisons good research practices: part 1. Value Health. 2011, 14: 417428. 10.1016/j.jval.2011.04.002.
Salanti G: Indirect and mixedtreatment comparison, network, or multipletreatments metaanalysis: many names, many benefits, many concerns for the next generation evidence synthesis tool. Res Synth Meth. 2012, 3: 8097. 10.1002/jrsm.1037.
Bucher HC, Guyatt GH, Griffith LE, Walter SD: The results of direct and indirect treatment comparisons in metaanalysis of randomized controlled trials. J Clin Epidemiol. 1997, 50: 683691. 10.1016/S08954356(97)000498.
Higgins JPT, Jackson D, Barrett JK, Lu G, Ades AE, White IR: Consistency and inconstency in network metaanalysis: concepts and models for multiarm studies. Res Synth Meth. 2012, 3: 98110. 10.1002/jrsm.1044.
Lu GB, Ades AE: Assessing evidence inconsistency in mixed treatment comparisons. J Am Stat Assoc. 2006, 101: 447459. 10.1198/016214505000001302.
White IR, Barrett JK, Jackson D, Higgins JPT: Consistency and inconsistency in multiple treatments metaanalysis: model estimation using multivariate metaregression. Res Synth Meth. 2012, 3: 111125. 10.1002/jrsm.1045.
Caldwell DM, Welton NJ, Ades AE: Mixed treatment comparison analysis provides internally coherent treatment effect estimates based on overviews of reviews and can reveal inconsistency. J Clin Epidemiol. 2010, 63: 875882. 10.1016/j.jclinepi.2009.08.025.
Dias S, Welton NJ, Caldwell DM, Ades AE: Checking consistency in mixed treatment comparison metaanalysis. Stat Med. 2010, 29: 932944. 10.1002/sim.3767.
Dias S, Welton NJ, Sutton AJ, Ades AE: NICE DSU technical support document 4: inconsistency in networks of evidence based on randomised controlled trials. Technical support document series No. 4. 2011, NICE Decision Support Unit. Technical Support Document, available from http://www.nicedsu.org.uk
Salanti G, Marinho V, Higgins JP: A case study of multipletreatments metaanalysis demonstrates that covariates should be considered. J Clin Epidemiol. 2009, 62: 857864. 10.1016/j.jclinepi.2008.10.001.
Song F, Xiong T, ParekhBhurke S, Loke YK, Sutton AJ, Eastwood AJ, Alison J, Holland R, Chen YF, Glenny AM, Deeks JJ, Altman DG: Inconsistency between direct and indirect comparisons of competing interventions: metaepidemiological study. BMJ. 2011, 343: d490910.1136/bmj.d4909.
Xiong T, ParekhBhurke S, Loke YK, Abdelhamid A, Sutton AJ, Eastwood AJ, Holland R, Chen YF, Walsh T, Glenny AM, Song F: Overall similarity and consistency assessment scores are not sufficiently accurate for predicting discrepancy between direct and indirect comparison estimates. J Clin Epidemiol. 2013, 66: 184191. 10.1016/j.jclinepi.2012.06.022.
Veroniki AA, Vasiliadis HS, Higgins JP, Salanti G: Evaluation of inconsistency in networks of interventions. Int J Epidemiol. 2013, 42: 332345. 10.1093/ije/dys222.
Song F, Clark A, Bachmann MO, Maas J: Simulation evaluation of statistical properties of methods for indirect and mixed treatment comparisons. BMC Med Res Methodol. 2012, 12: 13810.1186/1471228812138.
Mills EJ, Ghement I, O'Regan C, Thorlund K: Estimating the power of indirect comparisons: a simulation study. PLoS One. 2011, 6: e1623710.1371/journal.pone.0016237.
Song F, Chen YF, Loke Y, Eastwood A, Altman D: Inconsistency between direct and indirect estimates remains more prevalent than previous observed. 2011, http://www.bmj.com/rapidresponse/2011/11/03/inconsistencybetweendirectandindirectestimatesremainsmoreprevalent,
Hartung J: An alternative method for metaanalysis. Biometrical. 1999, 41: 901916. 10.1002/(SICI)15214036(199912)41:8<901::AIDBIMJ901>3.0.CO;2W.
Knapp G, Hartung J: Improved tests for a random effects metaregression with a single covariate. Stat Med. 2003, 22: 26932710. 10.1002/sim.1482.
Sidik K, Jonkman JN: A simple confidence interval for metaanalysis. Stat Med. 2002, 21: 31533159. 10.1002/sim.1262.
SanchezMeca J, MarinMartinez F: Confidence intervals for the overall effect size in randomeffects metaanalysis. Psychol Methods. 2008, 13: 3148.
Turner RM, Davey J, Clarke MJ, Thompson SG, Higgins JP: Predicting the extent of heterogeneity in metaanalysis, using empirical data from the cochrane database of systematic reviews. Int J Epidemiol. 2012, 41: 818827. 10.1093/ije/dys041.
DerSimonian R, Laird N: Metaanalysis in clinical trials. Control Clin Trials. 1986, 7: 177188. 10.1016/01972456(86)900462.
DerSimonian R, Kacker R: Randomeffects model for metaanalysis of clinical trials: an update. Contemp Clin Trials. 2007, 28: 105114. 10.1016/j.cct.2006.04.004.
Sidik K, Jonkman JN: On constructing confidenceintervals for a standardized mean difference in metaanalysis. Comm Stat Simulat Comput. 2003, 32: 11911203. 10.1081/SAC120023885.
Makambi KH: The effect of the heterogeneity variance estimator on some tests of efficacy. J Biopharm Stat. 2004, 2: 439449.
Engels EA, Schmid CH, Terrin N, Olkin I, Lau J: Heterogeneity and statistical significance in metaanalysis: an empirical study of 125 metaanalyses. Stat Med. 2000, 19: 17071728. 10.1002/10970258(20000715)19:13<1707::AIDSIM491>3.0.CO;2P.
Deeks JJ: Issues in the selection of a summary statistic for metaanalysis of clinical trials with binary outcomes. Stat Med. 2002, 21: 15751600. 10.1002/sim.1188.
Eckermann S, Coory M, Willan AR: Indirect comparison: relative risk fallacies and odds solution. J Clin Epidemiol. 2009, 62: 10311036. 10.1016/j.jclinepi.2008.10.013.
R Development Core Team: R: a language and environment for statistical computing. 2011, Vienna, Austria: R Foundation for Statistical Computing, http://www.Rproject.org. 2011. Ref Type: Computer Program, 3900051070
Bradburn MJ, Deeks JJ, Berlin JA, Russell LA: Much ado about nothing: a comparison of the performance of metaanalytical methods with rare events. Stat Med. 2007, 26: 5377. 10.1002/sim.2528.
Sidik K, Jonkman JN: A comparison of heterogeneity variance estimators in combining results of studies. Stat Med. 2007, 26: 19641981. 10.1002/sim.2688.
Viechtbauer W: Confidence intervals for the amount of heterogeneity in metaanalysis. Stat Med. 2007, 26: 3752. 10.1002/sim.2514.
Raudenbush SW: Analyzing effect sizes: random effects models. The handbook of research synthesis and metaanalysis. Edited by: Cooper H, Hedges LV, Valentine JC. 2009, New York: Russell Sage Foundation, 295315. 2
Morris CN, Morris CN: Parametric empirical bayes inference: theory and applications. J Am Stat Assoc. 1983, 78: 4755. 10.1080/01621459.1983.10477920.
Thorlund K, Wetterslev J, Thabane L, Gluud C: Comparison of statistical inferences from the DerSimonian–Laird and alternative randomeffects model metaanalyses – an empirical assessment of 920 Cochrane primary outcome metaanalyses. Res Synth Meth. 2012, 2: 238253.
Fleiss JL: The statistical basis of metaanalysis. Stat Methods Med Res. 1993, 2: 121145. 10.1177/096228029300200202.
Higgins JP, Thompson SG: Quantifying heterogeneity in a metaanalysis. Stat Med. 2002, 21: 15391558. 10.1002/sim.1186.
Lu G, Ades A: Modeling betweentrial variance structure in mixed treatment comparisons. Biostatistics. 2009, 10: 792805. 10.1093/biostatistics/kxp032.
Song F, Loke YK, Walsh T, Glenny AM, Eastwood AJ, Altman DG: Methodological problems in the use of indirect comparisons for evaluating healthcare interventions: survey of published systematic reviews. BMJ. 2009, 338: b114710.1136/bmj.b1147.
Prepublication history
The prepublication history for this paper can be accessed here:http://www.biomedcentral.com/14712288/14/106/prepub
Acknowledgements
GS, AAV and DM receive funding from the European Research Council (IMMA, grant Nr 260559). JPTH was funded in part by the UK Medical Research Council (programme number U105285807).
Author information
Authors and Affiliations
Corresponding author
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
AAV, DM, JH and GS contributed to the conception and design of the study, and helped to draft the manuscript. AAV conducted the statistical analysis. All authors read and approved the final manuscript.
Electronic supplementary material
12874_2013_1120_MOESM1_ESM.pptx
Additional file 1: Figure S1: Type I error by sample sizes, frequency of events and loop sample size. Results are shown assuming different number of trials (K) per comparison (K_{AB} = 1, K_{AC} = 4, K_{BC} = 7). The region within the horizontal dotted lines defines the confidence interval for the 5% nominal level. IVDL: inverse variance method using the DerSimonian and Laird estimator, KHDL: KnappHartung method with the DerSimonian and Laird estimator. (PPTX 113 KB)
12874_2013_1120_MOESM2_ESM.pptx
Additional file 2: Figure S2: Power by inconsistency factor, frequency of events and loop sample size. We assume different number of trials (K) per comparison (K_{AB} = 1, K_{AC} = 4, K_{BC} = 7). Results are aggregated over different assumptions for the heterogeneity and methods to estimate the variances of the mean treatment effects. IF: inconsistency factor. (PPTX 80 KB)
12874_2013_1120_MOESM3_ESM.pptx
Additional file 3: Figure S3: Coverage probabilities of the 95% confidence interval for the inconsistency factor, frequency of events and loop sample size. We assume equal number of trials per comparison (K_{AB} = K_{AC} = K_{BC} = K = 1, …, 7). Results are aggregated over different assumptions for the heterogeneity and methods to estimate the variances of the mean treatment effects. The region within the horizontal dotted lines defines the confidence interval for the 95% nominal level. The first summary result in each coverage probability line pertains to the case where there is a single trial per comparison and a fixedeffects model is employed. (PPTX 145 KB)
12874_2013_1120_MOESM4_ESM.pptx
Additional file 4: Figure S4: Coverage probabilities of the 95% confidence interval for the inconsistency factor (IF), frequency of events and loop sample size. We assume different number of trials (K) per comparison (K_{AB} = 1, K_{AC} = 4, K_{BC} = 7). Results are aggregated over different assumptions for the heterogeneity and methods to estimate the variances of the mean treatment effects. The region within the horizontal dotted lines defines the confidence interval for the 95% nominal level. (PPTX 87 KB)
12874_2013_1120_MOESM5_ESM.pptx
Additional file 5: Figure S5: Averaged relative bias assuming various scenarios for the inconsistency factor, the frequency of events and loop sample size. We assume equal number of trials per comparison (K_{AB} = K_{AC} = K_{BC} = K = 1, …, 7). Results are aggregated over different assumptions for the heterogeneity and methods to estimate the variances for the direct treatment effects. IF: inconsistency factor. (PPTX 127 KB)
12874_2013_1120_MOESM6_ESM.pptx
Additional file 6: Figure S6: Averaged relative bias assuming various scenarios for the inconsistency factor, the frequency of events and loop sample size. We assume different number of trials (K) per comparison (K_{AB} = 1, K_{AC} = 4, K_{BC} = 7). Results are aggregated over different assumptions for the heterogeneity and methods to estimate the variances of the mean treatment effects. IF: inconsistency factor. (PPTX 83 KB)
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
This article is published under an open access license. Please check the 'Copyright Information' section either on this page or in the PDF for details of this license and what reuse is permitted. If your intended use exceeds what is permitted by the license or if you are unable to locate the licence and reuse information, please contact the Rights and Permissions team.
About this article
Cite this article
Veroniki, A.A., Mavridis, D., Higgins, J.P. et al. Characteristics of a loop of evidence that affect detection and estimation of inconsistency: a simulation study. BMC Med Res Methodol 14, 106 (2014). https://doi.org/10.1186/1471228814106
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/1471228814106