Background

Meta-analysis has become recognized as an objective means of summarizing evidence from disparate clinical trials [1]. It is particularly useful when the trials are small and the data are conflicting. Meta-analysis incorporates statistical approaches to pool aggregate data from clinical trials into a summary effect measure [2]. This measure then reflects the effect of an intervention on average across all studies. However, meta-analysis is limited by inclusion of poor quality trials that are prone to report biased findings and exclusion of unpublished trials that do not report findings. Methods for assessing the effect of these limitations on summary measures have been developed and are available [35].

At times, data from clinical trials may conform to continuous rate measures (events per person-time) in which the numerator represents a count of total events "x" and the denominator represents a given time duration multiplied by the number of subjects, e.g. health care visits per person-year. Data such as these are being reported more frequently in clinical trials as evidenced by inclusion of rate measures in recent Cochrane Systematic Reviews [69]. If the reported length of follow-up is the same across studies, e.g. 12 months, then meta-analysis might involve pooling the weighted within-study differences in the mean number of events per person between intervention and control groups, a method we will call the weighted mean difference (WMD) [10]. The interpretation is straightforward and reflects the change in "x" per unit time. However, if the reported length of follow-up from various studies is different, e.g. 6 months versus 1 year, then meta-analysis could involve the conversion of the study differences into a common metric prior to pooling. This is often accomplished by dividing the per study differences between groups by the pooled standard deviation, a procedure known as the standardized weighted mean difference (SMD) [10]. This method is robust to assumptions of varying follow-up time. However, the interpretation is more difficult, since it reflects the difference between intervention and control groups measured in standard deviation units rather than natural time units.

In this paper, we examined data from a recently published Cochrane Systematic Review that included continuous rate measures as outcomes. We compared different statistical approaches to pooling continuous rate measures when they were reported with varying follow-up time. Specifically, we examined the SMD, considered the standard approach, to two alternative methods, incidence rate differences and incidence rate ratios. We examined the results from the different approaches in terms of the point estimates of treatment effect, their precision, and clinical interpretability. We are unaware of previously published studies that have attempted to address this problem.

Methods

Data were taken from a recently published Cochrane systematic review on the effects of asthma self-management education in children [11]. We selected the two outcomes involving continuous rate measures with the greatest number of contributing studies: days of school absence and emergency room (ER) visits. Our goal was to compare the standardized weighted mean difference with two alternative statistical approaches to pooling rate data, incidence rate differences and incidence rate ratios.

The standardized weighted mean difference (SMD) represents a weighted average of the per study difference in mean events per person between treatment and control groups. We first calculated standardized effect sizes for each study by subtracting the reported mean number of events in the control group from the reported mean number of events in the treatment group and dividing by the pooled standard deviation [10]. The per study standardized effect sizes were then combined using both fixed- and random-effects models [12, 13]. The fixed-effects model is essentially a weighted average of the study-specific results in which the weight for each study is proportional to the inverse of the variance of the study-specific SMD. The random-effects model allows for variability among studies in the SMD by incorporating a term for the among-study variability into the weights. Fixed- and random-effects models will generally agree when there is little heterogeneity among studies.

To estimate stratified incidence rate differences (IRD) and stratified incidence rate ratios (IRR), we calculated incidence rates taking time explicitly into account. For each study, we knew the mean number of events (days absent or emergency room visits) and the number of months of observation according to the reported study design. We multiplied the mean by the sample size for each treatment arm to get the total number of events observed in each arm, e.g. the total number of days absent for all participants in the control group. We rounded this to the nearest whole number of events. To obtain the total person-time of follow-up, we assumed that there was no loss to follow-up during the study, i.e. all participants were observed for the entire length of the study. We multiplied the number of months of follow-up by the sample size for each arm to obtain the total number of person-months of follow-up. The study-specific rate of events per person-month for each arm was then the total number of events (days absent or emergency room visits) divided by the total number of person-months of observation for each arm.

The analysis of the rates used stratified IRD and IRR methods estimated in STATA (version 7). To obtain a summary stratified IRD, we used a program, co-written by one of us (JAB) to implement a fixed-effects Mantel-Haenszel (M-H) procedure in STATA. Specifically, the program produced the estimates of IRD and its variance described in Rothman and Greenland's textbook [14]. We also utilized an inverse-variance weighted average approach to estimate a random-effects models by first using STATA's "ird" command, saving those study-specific results, then using the STATA command "meta" to compute the weighted average IRDs [13].

To obtain a summary stratified IRR, we used a fixed-effects M-H type procedure as implemented in the "ir" command in STATA, which should give results similar to fitting a Poisson regression model with indicator variables for "study." This M-H approach produces a summary estimate stratified on study. To take study-to-study variability into account, we also fit Poisson regression models allowing for clustering of the data by study, both with and without study indicator variables. The inclusion of indicator variables forces the comparison between treatments to be made within study, thereby mimicking the stratified analysis. In STATA, we also fit Poisson regression models using the "cluster" option, which uses a robust (Huber-White "sandwich") estimator of the variance [15]. The intent of fitting these models that allowed for clustering was to inflate the variance estimates to allow for among-study variability, and (as will be demonstrated) would not affect the point estimates of treatment effect.

Our interest was in comparing the qualitative and, where possible, the quantitative results across the different methods. We were interested in differences in inference that could be made from the various models, which integrate information about the point estimates of treatment effects and the precision of their estimation but may vary in their assumptions. We also compared conclusions as to the heterogeneity of effects across studies. The methods based on weighted averages use a test of heterogeneity similar in principle to the Cochrane Q-statistic. The test for heterogeneity in the Poisson regression models is based on the interactions between the treatment variable and the study indicator variables. Most importantly, we were concerned with the clinical interpretability of the results. All p-values reported are two-sided and all confidence intervals are calculated at the 95% level.

Results

We illustrate the use of SMD, IRD, and IRR methods for pooling continuous rate measures using data from a published Cochrane systematic review and meta-analysis that examined the effect of self-management education on morbidity and health services outcomes in children and adolescents with asthma [11]. The meta-analysis included 32 separate trials, involving 3706 children and adolescents aged 2 to 18 years. The majority were small, randomized controlled trials that enrolled children with severe asthma. We abstracted data on two outcomes–days of school absence and ER visits – from the original published study. For each outcome, we abstracted the reported mean number of events, standard deviation, sample size, and observation time in months for treatment and control groups. We contacted study authors to identify missing data from published reports. If appropriate measures of variance were not reported nor obtained by author contact, we imputed pooled standard deviations using a conservative approach given the t-statistic or the p-value if the t-statistic was not reported [16].

Table 1 lists the treatment and control group sizes, mean number of events, standard deviations, rates (events/person-month), duration of follow-up, and standardized effect size for each of the 16 trials contributing data on school absences. Sample sizes ranged from 19 to 404 participants, and the duration of observation varied widely from 1 to 12 months. Most of the trials favored the treatment arm, i.e. negative effect sizes implied a reduction in school absences. However, larger studies tended to have standardized effect size estimates closer to the null.

Table 1 Characteristics of Studies Reporting on School Absences.*

Similarly, table 2 lists the treatment and control group sizes, mean number of events, standard deviations, rates (events/person-month), duration of follow-up, and standardized effect size for each of the 12 trials contributing data on ER visits. Again, sample sizes ranged from a low of 14 to a high of 232, but the duration of follow-up was more homogenous with most trials reporting 12 months of observation. Again, most trials favored the treatment arm. Similar to school absences, larger studies tended to have standardized effect size estimates closer to the null.

Table 2 Characteristics of Studies Reporting on Emergency Room Visits.*

Table 3 presents the summary outcome measures for school absences. Effect sizes from the 3 methods gave qualitatively similar conclusions and suggest that treatment reduces school absences. Both fixed- and random-effects SMD gave identical estimates, since there was little to no statistical heterogeneity present (p = 0.61). IRD methods gave clinically interpretable results on the absolute scale. The fixed-effects results suggest that treatment results in an average reduction of 0.15 school absences per child per month (1.8 absences per year). Random-effects estimates were consistent with the fixed-effects results but with wider confidence intervals. IRR methods gave clinically interpretable results on the relative scale. These results suggest that treatment results in a 14% reduction in school absences. IRR estimates obtained using Poisson regression with Huber-White sandwich estimators gave a more conservative estimate than IRR estimates obtained using M-H procedures. The IRR estimate obtained without study indicators was similar to the IRR estimate with study indicators, suggesting that confounding by study was not present for this outcome (see the Appendix for further discussion of this point). Heterogeneity was statistically detected when data were pooled using IRD (p < 0.001) and IRR methods (p < 0.001) but not SMD, suggesting that treatment effects varied across studies when assessed in terms of rates, but not when assessed in terms of standard deviation units.

Table 3 Summary Outcome Measures for Days of School Absence.

Table 4 presents summary outcomes measures for ER visits. Results were again qualitatively similar regardless of method and suggest that treatment reduces ER visits. Random-effects SMD gave a more conservative estimate with wider confidence intervals than the corresponding fixed effects SMD, due to heterogeneity in effects across the studies (p = 0.05). IRD methods gave clinically interpretable results on the absolute scale: treatment results in an average reduction of 0.04 ER visits per child per month (one ER visit every other year). The estimate obtained by the random-effects model was consistent with that from the fixed-effects model but with wider confidence intervals. IRR methods gave clinically interpretable results on the relative scale: treatment results in a 23 to 34% reduction in ER visits. IRR estimates obtained using Poisson regression with Huber-White sandwich estimators gave a more conservative estimate than IRR estimates obtained using M-H procedures. The IRR estimate obtained without study indicators was closer to the null than the IRR estimate with study indicators, suggesting that confounding by study was present for this outcome (see Appendix). Heterogeneity was statistically present in IRD (p < 0.001) and IRR (p < 0.001) methods as well as for SMD for this outcome, suggesting that treatment effects varied across studies.

Table 4 Summary Outcome Measures for Emergency Room Visits.

Discussion

This paper presented three statistical methods of pooling continuous rate measures in which the denominator reflects varying duration of observation. All methods were fairly easy to implement using standard statistical software. Results were statistically consistent regardless of the method employed and suggested a significant treatment effect on average. All methods allowed for explicit adjustment for individual studies. Failure to take stratification by study into account, as illustrated in the Poisson models without study indicators, resulted in a different estimate for one outcome, ER visits, but not the other, school absences.

IRD methods gave clinically interpretable results on an absolute scale. These results suggest that treatment results in an average reduction of 0.15 school absences per person-month or roughly 2 days per person-year. These results also suggest that treatment results in an average of 0.04 fewer ER visits per person-month or roughly 1 fewer visit per person every 2 years. IRR methods gave clinically interpretable results on a relative scale. These results suggest that treatment results in a 14% reduction in school absences and a 34% reduction in ER visits.

The SMD results were not immediately clinically interpretable. On a standard deviation scale, these results suggest that treatment results in a modest reduction in school absences and ER visits. Conversion back to the original scale would allow for more clinically interpretable results but would require making an assumption about the size of the standard deviation and the event rate in the control group across studies. For standard deviations, it is not clear whether one should use a study-specific estimate of the standard deviation or an estimate pooled across studies. Additionally, the data can be skewed, in which case mean events might not appropriately represent the central tendency of the data.

Heterogeneity was statistically present for both outcomes, suggesting variability in treatment effects across studies when incidence rate-based methods were used, and for ED visits but not school absences when SMD was used. It should be kept in mind that, although all of these analyses are attempting to address the same underlying substantive question (i.e., whether asthma education "works"), the SMD analyses address this question on a fundamentally different scale by converting measurements into standard deviation units. This difference in scale could well account for the different results of the heterogeneity tests.

Another alternative that we tried but abandoned because of its non-standard nature was simply to convert the time units from the various studies into a common scale and pool the data using WMD. We found (data not shown) slight but noticeable differences depending on whether we multiplied up for the shorter studies or down for the longer studies to achieve the common scale. For example, studies with 6-month follow-up and 12-month follow-up could be put on a common scale, by either multiplying the 6-month study means and standard deviations by 2 or dividing the 12-month study means and standard deviations by 2. These different approaches changed the per-study weights and produced slight differences in summary measures. We believe that the fundamental problem with this approach is that it rests on the assumption that the event rates stay constant over the entire time period of observations. This is also true for the rate models we did use, but unlike those models, multiplying up essentially imputes data beyond the actual period of observation. This has implications not only for the mean number of events, but possibly also for the variance estimates. For these reasons, we chose not to consider this approach any further.

There are limitations to these findings. First, we explored differences in the three approaches using only data from a single systematic review. However, the outcomes we chose had a sufficient number of contributing studies to assess for small differences among the approaches. Second, in the calculation of event rates using the incidence rate-based methods, we assumed complete follow-up of participants in each study. However, this method is robust to incomplete follow-up if the number of events and the amount of time contributed by each participant are known or it can be assumed that individuals lost to follow-up contribute no events or follow-up time and loss to follow-up is not differential between the treatment groups.

Conclusions

In this study, we demonstrated that choice of method among the ones presented here for continuous rate measures had little effect on inference. SMD, IRD, and IRR methods all gave qualitatively similar estimates of effect and suggest that the intervention was effective for both outcomes. However, choice of method clearly affected clinical interpretability. SMD, reportedly the standard method employed for analysis of rate measures of varying time duration, was not immediately interpretable. Stratified IRD allowed for clinical interpretability on an absolute scale. Stratified IRR or Poisson models allowed for clinical interpretability on a relative scale. For further discussion of the merits of absolute versus relative effects, we recommend that the reader consult additional references [10]. In addition as we have shown, failure to incorporate study indicators in the Poisson analysis may produce different (and inappropriate) estimates of treatment effect. (For an explanation of why we consider this an inappropriate approach, see the Appendix). We recommend that statistical software packages used for meta-analysis consider the addition of stratified IRD and IRR procedures.

Appendix

Table 5 demonstrates the need to perform analyses stratified by study when comparing event rates between treatments. A similar argument would apply to the comparison of risks. The principle demonstrated, among epidemiologists, would be called "confounding by study," and among statisticians might be more familiar as an example of "Simpson's Paradox." In brief, we have generated a hypothetical example, in the table, of a situation in which the baseline (control) rates differ markedly between studies. In addition, the feature that generates the problem is that there is imbalance in the amount of person-time in the treatment and control groups in the second study, perhaps as a result of unequal allocation of subjects to the two conditions.

Table 5 Example of Confounding by Study.

Within each study, the estimate of the relative risk is 0.5. Thus, any reasonable analysis that takes stratification by study into account (and averages the within-study treatment effects) would necessarily produce an average treatment effect of 0.5. Because of the associations noted above, the analysis ignoring study produces an estimated treatment effect of 0.32. This result clearly is not at all representative of the results within either of the individual studies. Note that this concept is not the same as the usual concept of "heterogeneity," which is generally used to refer to situations in which the treatment effect varies across studies. In our example, the treatment effect is constant across studies (on the relative rate scale), although the baseline rate varies dramatically between the studies.