Introduction

Optimal management of anemia in patients with chronic kidney disease remains a controversial and polarizing issue. This article focuses on design and methodological issues and is not meant to be a systematic review or meta-analysis of all available trials, one of which was published recently. Four large clinical trials are critiqued [US Normal Hematocrit [1] (USNH), Canadian European Normalization of Hemoglobin (CENH) [2], Cardiovascular Risk Reduction by Early Anemia Treatment with Epoetin Beta [3] (CREATE) and Correction of Hemoglobin and Outcomes in Renal Insufficiency [4] (CHOIR)], each performed in a relatively specific group of anemic patients with chronic kidney disease (CKD) and all including sample sizes exceeding 500. The reader is referred to the primary articles for detailed descriptions of each study. While anemia management clearly differs in pediatric and adult CKD patients [5, 6], large-scale trials have yet to be undertaken in pediatric CKD. This being said, trials from adult populations may help to inform current practice and the design of future trials in pediatric CKD.

Idiosyncrasies of target trials

Treatment target trials differ profoundly from double-blind, placebo-controlled trials, and these differences often impede trial interpretation. For example, target allocations are not usually concealed, a strategy that may help investigators to avail themselves fully of the therapeutic armamentarium to achieve treatment targets. Even trials that try to conceal target allocations tend towards being unmasked over time, as, all things being equal, patients requiring more intensive efforts for an unknown target to be reached are more likely to have been assigned to the intensive treatment arm.

Non-randomly assigned treatments (co-interventions), that may themselves influence the primary outcome, are often used in the time interval between randomization and primary outcome assessment. Highly imbalanced co-interventions can cloud the mechanisms underlying beneficial or harmful effects in primary treatment comparisons. The extent of co-intervention in hemoglobin target trials is extreme, as one sets out with the intention of using different amounts of erythropoietin stimulating agents (ESAs), intravenous injections of iron, anti-hypertensive agents and transfusions.

Another type of trial which systematically introduces non-random elements involves immediate or delayed intervention. In the delayed intervention arm, treatment is predicated on a non-random element (time) that is not controlled by the study design. If one truly wishes to know the effects of hemoglobin level and duration of anemia, a factorial design would be required, such as one where patients are randomly assigned to one of four groups: maintain current hemoglobin level throughout; maintain current hemoglobin level, switch to higher hemoglobin level after a pre-specified time; immediate increase to higher hemoglobin level, maintain this level throughout; immediate increase to higher hemoglobin level, switch to lower hemoglobin level after the pre-specified time. In practice, such a design would be extremely difficult to implement.

Trial participants

Generalizability of trials is largely a matter of judgment, and numerical methods are not available to guide this process. It is very useful to know the number of potentially eligible subjects, the number approached and the number recruited. Unfortunately, very few trials report these items. Table 1 details the enrollment criteria used in the four studies under consideration. It is evident that no two trials have identical enrollment criteria, and efforts to combine studies based on the premise of homogeneity of enrolled subjects may be suspect. Some notable differences between studies include: overt vs incipient anemia; overt vs absent cardiac disease; end-stage renal disease vs non-end-stage renal disease.

Table 1 Enrollment criteria (GFR glomerular filtration rate, E exclusion criterion, BP blood pressure, RRT renal replacement therapy, I inclusion criterion, TSAT transferrin saturation)

Interventions

Table 2 illustrates hemoglobin targets in the four trials under consideration and the therapeutic strategies used to meet these targets. Only CENH incorporated concealment of treatment targets. While concealing treatment allocation from patients is intuitively important for quality of life outcomes, concealment from patients, healthcare professionals and outcome assessors may be important for several other reasons. Investigator blinding may be important for rating ‘hard’ clinical events, even when these are subsequently adjudicated by blinded clinical event committees. For example, basic physiological constructs suggest that profound anemia and polycythemia predispose to congestive heart failure and vascular thrombosis, respectively. Knowing the assigned target hemoglobin level may influence site investigators when confronted with classically difficult diagnostic challenges, such as differentiating extracellular fluid volume expansion from congestive heart failure and non-specific chest pain from angina pectoris. Unfortunately, blinded event committees have no control over what is written in case records at the site level by unblinded investigators and cannot correct site-level biases.

Table 2 Interventions

Erythropoiesis stimulating agents (ESAs) are known to predispose to hypertension [7], and active surveillance of hemoglobin levels, ESA dose and blood pressure levels is desirable in clinical trials of different hemoglobin targets. Of the trials examined in this article, only CENH incorporated real-time, centrally-controlled, monitoring procedures for hemoglobin, iron, and blood pressure levels, with weekly or bi-weekly treatment recommendations transmitted to the study sites within days of measurement of these parameters.

Table 2 shows the different strategies used to achieve target hemoglobin levels. Once again, it is clear that the degree of heterogeneity between studies is large. Although CREATE and CHOIR both examined patients with CKD that did not require dialysis, the former used a strategy of delayed intervention in one treatment arm, while the latter used one of immediate intervention; in addition, the treatment strategy in the latter trial is impossible to emulate, as the epoetin dosing strategy was only described for the first 3 weeks of the study. In this regard, it is debatable whether many clinicians would begin treating epoetin-naïve patients with chronic kidney disease with 10,000 units of epoetin per week, especially when the intent was to maintain current hemoglobin levels.

Analytical plans

With regard to primary outcomes, no two studies were alike (Table 2). Some of the implications of composite endpoints (used in three of these studies) may be worth exploring. In the CHOIR study, for example, the primary end point was the time to first occurrence of death, myocardial infarction, hospitalization for congestive heart (that did not include renal replacement therapy) and stroke; renal replacement therapy was a censoring event. With this design, subjects with the most severe episodes of heart failure (those requiring dialysis) and subjects receiving renal replacement therapy are censored. From an analytical perspective, they are treated identically to subjects exiting the study without any clinical event. With composite outcomes, only the first event is counted when a patient experiences multiple components of the composite outcome in rapid sequence. Thus, patients in whom myocardial infarction is quickly followed by congestive heart failure are treated identically to those with myocardial infarction alone.

Sample size considerations are important in clinical trials. It is critical to note the planned minimum detectable between-group difference in primary outcome rates, as well as that observed in practice. Plans for interim analyses and stopping rules deserve close scrutiny. Stopping rules are frequently employed in trials, with pre-defined mathematical rules used to detect unexpectedly early but clear differences in primary outcome rates. Figure 1 shows a typical example, using the O’Brien–Fleming boundary method [8] for a trial planning a total of four treatment comparisons; alpha, the chance of a false positive result, is set at the conventional level of 0.05 in this two-sided comparison. It is important to note that, because of multiple comparisons, each separate analysis consumes a portion of the available alpha (hence the term ‘alpha spending function’). Finally, it should also be noted that two-sided boundary P values for rejecting the null hypothesis are considerably less than 0.05 at all analyses except the last.

Fig. 1
figure 1

O’Brien–Fleming boundary method for a clinical trial with power 0.80 and conventional alpha 0.05 in which four primary outcome comparisons (three interim and one final) were planned to occur at equal intervals. Open circles represent Z-score values to reject the null hypothesis at each analysis, and the corresponding figures are the one-sided nominal P values. x represents the cumulative alpha spent at each interim analysis. For two-tailed boundaries, multiply by a factor of 2

Among the four trials under review, USNH reported comparative findings that accounted for multiple comparisons; this issue is moot for CENH and CREATE, given that the null hypothesis was not rejected. For CHOIR, four interim analyses were planned; the “data and safety monitoring board recommended that the study be terminated in May 2005 at the time of the second interim analysis, even though neither the efficacy nor the futility boundaries had been crossed, because the conditional power for demonstrating a benefit for the high-hemoglobin group by the scheduled end of the study was less than 5% for all plausible values of the true effect for the remaining data” [5]. Formal reporting of actual test statistics at each interim analysis, perhaps plotted against the planned boundary conditions, would considerably enhance the critical appraisal process for CHOIR. Another approach that might enhance interpretation would be to report event numbers within treatment groups, risk ratios and P values at each of the interim analyses, as well as the final analysis. Needless to say, the decision to terminate a trial also includes patient-safety variables that are not a part of the primary study outcome. While both CREATE and USNH were terminated early without formally reaching a primary outcome stopping boundary, it is evident that there were safety issues in both trials, with imbalanced rates of renal replacement therapy in the former and imbalanced rates of vascular access thrombosis in the latter. At the time that CHOIR was terminated, it was difficult to see any rationale for trial termination, as no differences in primary outcome rates were apparent and non-primary major adverse event rates were similar in both arms.

With well-conducted trials, primary outcome events should be collated in as timely a manner as possible; in particular, when decisions about primary outcome-based stopping rules are about to be applied at a planned interim analysis, it is reasonable to expect that a high proportion of the primary events will be available to data, safety and monitoring boards. For example, it would be useful to know how many events accrued between the last interim analysis and study termination. This information is not available for USNH and CREATE. While not available in the original study publication, this important information is contained in a clinical study report for CHOIR, available at the US National Institutes of Health website: http://www.clinicaltrials.gov [9, 10]. In that report, it is apparent that a total of 37 primary outcome events were counted at the first interim analysis, 145 at the last interim analysis and 222 shortly thereafter at the termination of the study. It is notable that 35% extra primary events were added in the short interval between the last interim analysis and study completion.

The over-arching rationale of randomization is to generate two groups which are identical, for all measured and all unmeasured patient characteristics. It is worth considering, in advance, whether plans will be in place to deal with the likely occurrence of imperfect randomization. With perfect randomization, the true effect of an intervention should be insensitive to adjustment for baseline characteristics. If apparent treatment effects disappear when adjustment is made for baseline characteristics, the only safe course of action is to conclude that the null hypothesis cannot be rejected. Of the four studies, USNH and CENH formally reported plans for baseline covariate adjustment for primary outcome comparisons of the study interventions [2, 3].

Trial results

Even with well-designed trials, clear and comprehensive reporting can greatly help interpretation. The ability to publish supplementary information on journal websites means that a trade-off between readability and comprehensiveness should no longer be necessary. It is important that all patients entering a trial be accounted for and attributed at the conclusion of the trial. Similarly, when sequential measures are being compared, it is important to report the numbers being assessed, as well as the timing of these assessments.

In Table 3, baseline characteristics in the four major studies are compared by treatment arm. Statistically significant differences in baseline characteristics were present in each of the four studies, as shown in Table 3. It seems natural to question whether imbalances in hypertension, history of coronary artery bypass surgery (CHOIR [5]), body mass index, beta blocker use (CREATE [4]), angina pectoris (USNH [2]) and age (CENH [3]) could have biased treatment comparisons; by extension, having observed these imbalances, it seems natural to expect analyses of the primary outcomes that control for these imbalances.

Table 3 Study results (CABG coronary artery bypass graft, CV cardiovascular, NHANES National Health and Nutrition Examination Survey, GFR glomerular filtration rate, LASA linear analog self assessment, KDQ kidney disease questionnaire, SF-36 short-form 36-item health survey, LV left ventricular, LVH left ventricular hypertrophy, FACIT functional assessment of chronic illness therapy, KDQOL kidney disease quality of life)

Table 3 also shows a comparison of hemoglobin levels achieved in the four studies. Inspection of the temporal evolution of hemoglobin levels showed that initial rates of rise of hemoglobin in the first month were greater in CHOIR [5] than in USNH [2], CENH [3] and CREATE [4]. During the maintenance phase, none of the studies consistently achieved planned hemoglobin levels in the high target groups; statistical separation, however, was clear. Table 3 also compares epoetin doses used to achieve and maintain hemoglobin levels. Epoetin doses were remarkably higher in the two studies from the United States of America [2, 5]. In the non-dialysis studies, these differences may have arisen, in part, by design, as all patients in CHOIR were treated with 10,000 U/week of epoetin alfa for 3 weeks [5], five-times the recommended starting dose in the early intervention arm of CREATE [4]. Regarding blood pressure levels, a between-groups difference in blood pressure levels was seen in the CHOIR study [4]. The other three studies showed equivalent blood pressure levels, albeit with the requirement for more antihypertensive agents in the high target arm in CENH [13].

With unadjusted analysis, CREATE and USNH showed no differences in the primary outcome of cardiovascular events [2, 4]; in contrast, there were more cardiovascular events with the high hemoglobin target in CHOIR [4]. Of the three trials examining cardiovascular outcomes, only USNH reported in the primary study publication a comparison of the primary outcome with adjustment for baseline variables, findings almost identical to those without covariate adjustment [2]. With CHOIR, the clinical study report alluded to above did report a comparison of primary outcomes in which adjustment had been made for several baseline covariates, such as age, gender, race, renal history, National Health and Nutrition Examination Survey (NHANES) scores for clinical variables, diabetes, glomerular filtration rate, albumin, reticulocyte count, and iron status. In that analysis, no efficacy differences were present, as the adjusted P value was 0.111, as opposed to 0.03 in the unadjusted analysis [9, 10]. This finding is noteworthy. With truly random assignment of patient-related characteristics that effect the primary outcome, adjustment for these characteristics should have no effect when randomly assigned treatments are compared. In consequence, it is impossible to refute the hypothesis that imbalanced assignment of non-treatment factors led to the observed disparity in event rates seen on unadjusted analysis. Once again, one is struck by the heterogeneity of the trial results, at least with unadjusted analytical strategies.

Quality of life was a secondary outcome in each of these trials. As such, these trials cannot be viewed as definitive for quality of life, even though the sample size in each of these studies was large. In addition, none of these trials concealed treatment allocation completely from patients, site investigators and outcome assessors. Three of these trials formally reported quality of life comparisons, albeit with incompletely overlapping arrays of instruments [24]. If one accepts that potential quality-of-life effects are mediated through changes in hemoglobin levels, it is important to know the timing of these assessments, as one would not expect to see quality of life differences before hemoglobin levels have separated. Two of the trials, including the single trial with patient blinding, reported quality life benefits with higher hemoglobin targets [2, 3]. In contrast, the CHOIR study showed no clear benefit, and even a potential loss of quality life with the higher hemoglobin target [in the emotional role subscale of the short-form 36-item (SF-36) instrument]; these findings are difficult to interpret, as the effect of imbalanced baseline characteristics on quality of life comparisons is unknown, and the numbers of patients and timing of quality of life assessments were not reported [4].

Two of the studies had echocardiographic outcomes, and higher hemoglobin targets had no effect on ventricular size [2, 3]. With regard to higher target assignment, three of the trials showed more unanticipated, non-primary, serious adverse events. For example, the risk of replacement therapy was greater in CREATE [3], in spite of similar rates of change of glomerular filtration rate; vascular access loss was greater in USNH [1], and stroke was greater in CENH [2].

Conclusions

Even though a mere four trials were considered in this article, the differences between trials were remarkable at each level considered. Enrollment criteria, interventions studied, blinding, success of randomization and study findings showed important differences between studies. It is debatable, then, whether techniques such as meta-analysis should be employed in this universally heterogeneous environment. Similarly, it seems inadvisable to base one’s therapeutic approach entirely on the findings of a single study. Presumably, depending on the specific question, studies with better methodology should carry more weight when study findings are applied to patients. It would also seem reasonable to add a note of caution about studies employing interventions few clinicians would consider in day-to-day practice. Unfortunately, despite many years of trials, the ideal approach to treating anemia in patients with chronic kidney disease remains to be decided and continues to generate controversy. Considerably more research is needed, and the quality of the evidence that accrues from this research will be entirely dependent on the quality of the methods used. In the published literature, methodological issues deserve careful attention before research findings are translated into clinical practice.