Introduction

When the Scottish epidemiologist Archie Cochrane suggested that clinical practice should principally be guided by rigorously designed evaluations, in particular randomized clinical trials (RCTs), the reaction of the medical profession was largely negative. Critics suggested that relying on impersonal statistically-derived "evidence" based on averages to determine clinical decision-making was antithetical to the practice of medicine, which should rather be based on a physician's expertise, acumen and clinical experience, and on knowing the individual patient and considering what is best for each person given their individual circumstances and needs [13].

Although "evidence-based medicine" has become the dominant paradigm for shaping clinical recommendations and guidelines, recent work demonstrates that many clinicians' initial concerns about "evidence-based medicine" come from the very real incongruence between the overall effects of a treatment in a study population (the summary result of a clinical trial) and deciding what treatment is best for an individual patient given their specific condition, needs and desires (the task of the good clinician) [47]. The answer, however, is not to accept clinician or expert opinion as a replacement for scientific evidence for estimating a treatment's efficacy and safety, but to better understand how the effectiveness and safety of a treatment varies across the patient population (referred to as heterogeneity of treatment effect [HTE]) so as to make optimal decisions for each patient.

The conventional method of examining whether treatment effects vary in a trial population is to divide patients into subgroups based on potentially influential characteristics. The main problem with the conventional approach is that there are too many characteristics that can potentially influence treatment effect. This leads to myriad subgroup analyses which are typically both underpowered and vulnerable to spurious false positive results due to multiple comparisons. For these reasons, subgroup analyses are usually "exploratory" and rarely actionable, leaving the clinician to assume that all patients meeting trial inclusion criteria should be similarly treated.

Herein, we propose a framework that directly addresses the problem of multiplicity in two ways. First, our framework prioritizes the analysis and reporting of multivariate risk-based HTE, over conventional "one-variable-at-a-time" subgroup analysis. This recommendation is based on an understanding that HTE emerges from just a few fundamental risk dimensions. These dimensions--which include the risk of the primary study outcome (the main focus of our proposed approach), competing risk, the risk of treatment-related harm and direct treatment-effect modification [58]--can often be summarized using multivariate prediction models, greatly simplifying subgroup analyses and substantially improving statistical power[9]. Second, this framework proposes that other subgroup analyses should be explicitly labeled either as primary subgroup analyses (well-motivated by prior evidence and intended to produce clinically actionable results), which should be few in number and appropriately adjusted for multiple comparisons, or secondary (exploratory) subgroup analyses (performed to inform future research).

Why the overall result from a clinical trial is sometimes unreliable for guiding clinical practice

When considering whether a patient is likely to benefit from a therapy, the most relevant measure of treatment effect is the absolute risk reduction (ARR) (see Appendix 1) of a treatment (or its reciprocal, the number needed to treat [NNT], [see Appendix 1]) [10, 11]. It is well known that a study's overall ARR or NNT will often not reflect a treatment's true ARR for many people in the trial, since a 25% relative risk reduction (RRR) (see Appendix 1) in high risk patients produces much more benefit than it does in low-risk patients (resulting in substantial HTE). For example, Table 1 shows results for a hypothetical treatment that reduces all study subjects' risk by 25%. This results in the overall NNT of 50 greatly underestimating the benefits for high-risk subjects (NNT = 20) and greatly over-estimating the benefits for the typical patient (NNT = 100).

Table 1 How summary results of clinical trials can be misleading even when everyone gets the same relative risk reduction.

Indeed, because a minority of high-risk patients may account for most trial adverse outcomes and because even a small degree of treatment-related harm can nullify or outweigh benefits in low risk patients, it does not take extreme assumptions to produce scenarios in which almost all individuals [6, 12, 13] in the trial have an ARR that is substantially lower than that suggested by the summary results reported in the trial. For example, Table 2 shows results that would emerge if the treatment reduces disease-related risk by 25% (just like in Table 1) but now also carries a 2 in 1000 risk of a serious treatment-related harm (due to adverse events or major side-effects). In Scenario #1, the clinical trial's overall result suggests that the treatment has a moderate benefit (RRR = 12.5% and NNT = 100), despite the fact that 75% of study subjects received absolutely no net benefit (i.e. treatment-related harm equals treatment benefit). In Scenario #2, we see that if the difference between outcome risks of low vs. high risk patients is increased (i.e. risk strata more dissimilar in risk), the summary results can still suggest an overall benefit of treatment even though the treatment risks out-weigh treatment benefits for 75% of study subjects (Table 2).

Table 2 How summary results can obscure situations where the typical patient receives no benefit or risks net harm

While these examples illustrate cases in which the absence of risk-based analysis will result in harmful (or merely wasteful) over-treatment, under certain circumstances the opposite may also be the case; a treatment's effect may be null overall, even though it provides substantial benefit in a patient subgroup (typically at high risk for the outcome of interest or at especially low risk of treatment-related harm) [14, 15].

Why risk stratified analyses should be performed whenever feasible

Although the degree of heterogeneity in risk shown in Tables 1 and 2 may seem extreme, such variability in risk is actually quite common when risk-heterogeneity is assessed using a multivariable prediction tool. It has been documented that outcome rates in the highest risk quartile (the 25% of study subjects with the highest predicted risk) in large clinical trials are often 5-20 times higher than in the lowest risk quartile [5, 1620]. While the degree of risk heterogeneity may vary across medical domains, multiple independent risk factors exist for virtually any clinical outcome that would be the target of a therapeutic trial, and therefore, substantial risk heterogeneity should be common. In turn, the presence of risk heterogeneity mathematically implies the presence of HTE, on the absolute risk scale, regardless of whether there is also HTE on the relative risk scale.

Recent research has demonstrated that, even when there are large and clinically important differences in treatment effects across risk groups, conventional subgroup analyses (which assess HTE "one-variable-at-a-time") are inadequate to detect these differences across risk subgroups because they do not account for the fact that patients have multiple variables that determine risk simultaneously [6, 9, 2124]. Instead, they examine treatment effect differences based on groups differing on only a single variable, falsely determining a "consistency of treatment effect" across subgroups simply because the groups compared are more similar than dissimilar. Additionally, because conventional subgroup analyses involve multiple comparisons and involve splitting the overall sample to smaller sub-samples, they are both under-powered for detecting genuine subgroup effects (prone to false-negatives), and even more commonly they are prone to false positive findings [2531]. Clinical trials, so analyzed, can thus result in treatment recommendations and guidelines that promote substantial over- and under-treatment.

There are better alternatives to one-variable-at-a-time subgroup analyses. Multivariable subgroup analysis is theoretically possible, and has been shown to be potentially useful[5], but statistical power is usually inadequate in anything other than pooled analyses of data from multiple trials. Risk-based analyses using multivariable risk prediction tools are more often feasible and have a lower risk of false positive findings than single variable subgroup analysis, when employed as a single pre-specified analysis that avoids the multiplicity of comparisons inherent in testing each sub-grouping variable separately[9]. Moreover, such an analysis will often have more optimal statistical power, as it compares patients that differ in multiple important characteristics simultaneously. Otherwise undetected yet clinically meaningful differences in relative treatment benefit have been demonstrated in many areas where multivariate risk-based approaches have been applied, most particularly in the areas of cardiovascular and cerebrovascular disease, but others as well (Table 3).

Table 3 Examples of Clinically Important Risk-based Heterogeneity of Treatment Effect

A proposal for reporting clinical trials to provide more information on clinically important heterogeneity in treatment effects (HTE)

Several recent papers have addressed important considerations when conducting and interpreting subgroup analyses [57, 9, 14, 22, 27, 30, 3240], but did not recommend a specific framework for reporting HTE and did not discuss how to deal with multivariable risk analyses. Only a few previous papers have addressed multivariable risk analyses. Herein, we propose some practical guidance for when and how such analyses should be performed and presented (summarized in Table 4). While this framework has not been subjected to a formal consensus building process involving a broad sample of stakeholders and is therefore provisional, the approach is a synthesis of ideas and contributions made by many investigators [47, 9, 14, 16, 17, 27, 41], and is proposed to provide a considered basis for subsequent discussion, revision, and refinement.

Table 4 Checklist for Reporting on Subgroup Analyses & Heterogeneity in Treatment Effects

Recommendation #1: Evaluate and report on the distribution of baseline risk in the overall study population and in the separate treatment arms of the study by using a risk prediction tool

Although its importance was highlighted over a decade ago[12], reporting the distribution of baseline risk (see Appendix 1) is rarely done. Therefore, it is generally impossible to assess the degree of baseline risk heterogeneity in most published clinical trials, since risk heterogeneity cannot be determined when each risk factor's prevalence is listed individually.

The precise approach for presentation is not important, as long as it allows the reader to understand the distribution of predicted baseline risk (or the risk score of a risk index) in the study population. "Table 1" of a clinical trial report (which conventionally includes patient attributes for those in the different study arms) should include, at minimum, the population mean (+ SD) and median predicted baseline risk (or risk score), and additional information on the population distribution if there is substantial skew in subject risk (such as quartiles/percentiles, a histogram or a box plot) (see Table 5). If the study includes a largely homogeneous population with regard to overall risk, the reader will know that generalizing the study results to those with substantially different risk would be speculative. If there is substantial heterogeneity in the study population, then reviewers will know that risk stratified analysis is particularly important.

Table 5 Presenting the distribution of baseline risk in clinical trials

Finally, including this information in "Table 1" of a clinical trial allows the reader to assess whether there are important baseline differences between treatment arms on the most important baseline attribute (i.e., differences in overall risk for the study's main outcome). It is common to note multiple modest deviations between treatment arms when baseline patient factors are listed one at a time. These differences typically have little influence on trial results, particularly when they combine so as to cancel each other out. However, similar differences in overall baseline risk may influence the trial result, such that comparing the risk distribution between the treatment groups using a composite risk model can be informative and facilitate risk adjustment.

Recommendation #2: Report how relative and absolute risk reduction varies by baseline risk, using a multivariable prediction tool

There are two fundamental reasons why all clinical trials should attempt to assess how net treatment benefit and safety vary as a function of predicted untreated risk: 1) It allows us to understand how absolute risk reduction varies across the study population even when relative risk reduction is constant (see Table 1); and 2) net relative risk reduction may not be constant across risk groups, particularly if there is even a small amount of treatment-related harm (see Table 2). For major clinical trials (those that assess a treatment's effect on mortality and major morbidity), it is usually possible to perform risk-based analysis of HTE using an externally developed tool, since prediction tools to estimate overall risk have been developed for most major conditions and their complications (including cardiac, cancer, stroke, renal failure, ICU and hospital morality, etc [see Additional file 1]). Testing risk-based HTE using internally-developed models (based on a blinded regression analysis of the data using all treatment arms) may be useful when such models do not exist. However, when available, we favor the use of an externally developed prediction model since over-fitting can potentially exaggerate the degree of risk heterogeneity.

In reporting risk stratified results, readers should be provided with the information needed to easily determine the amount of variation in ARR/NNT and RRR. An approach to presenting these results to a general readership is shown in Table 6. How statistical testing for HTE should be addressed, including for multivariable risk-stratified analyses, is discussed below (Recommendation #5).

Table 6 Presenting results showing heterogeneity in treatment effect (HTE)*

Recommendation #3: Additional primary subgroup analysis for single variables should be pre-specified and limited to patient attributes with strong a priori pathophysiological or empirical justification

Here we define primary subgroup analysis as those subgroup comparisons that are well justified (hypothesis-testing, not hypothesis-generating) so as to yield potentially actionable results appropriate for guiding clinical care. Therefore, all primary subgroup comparisons must be fully specified and justified a priori.

The number of comparisons made in the primary subgroup analysis should be kept small in number to minimize false positive results, since each additional subgroup comparison decreases the usefulness of the other primary subgroup analyses and should therefore exact a statistical penalty (see recommendation #5). Often, no single variable subgroup analysis (such as by age, by sex, by race, etc.) will be indicated as part of the primary subgroup analysis. Rather, these should generally be conducted as exploratory (secondary) analyses (see recommendation #4), unless: 1) there exists previous empirical evidence from observational studies or exploratory subgroup analyses in prior clinical trials; or 2) there are highly compelling reasons to believe the patient attribute is likely to importantly influence the relative treatment effect (such as time to treatment with time-sensitive therapies or biomarkers that are strong candidates to be specific targets of therapy [e.g. estrogen receptor positivity in breast cancer]).

Prespecification of primary subgroups should include explicit definitions and categories of the subgroup variables, including cut-off thresholds for continuous or ordinal variables where these are used, and the anticipated direction of the effect modification. While it is ideal that analyses should be pre-specified at the time of trial initiation [22, 27], it is most important that all primary subgroup analyses be pre-specified prior to examination of the data to ensure that analyses are not biased by multiple comparisons, including post-hoc changes in variable construction to better "fit the data". By conducting primary subgroup analysis that are few in number, fully pre-specified, hypothesis-driven and more statistically robust (see recommendation #5), examinations of HTE can produce strong and actionable evidence regarding which patients are most likely to benefit from treatment.

Recommendation #4: Secondary (exploratory) subgroup analyses should be clearly distinguished from primary subgroup comparisons

Although we propose making a clear distinction between primary and secondary subgroup analyses, it would be a mistake to forgo secondary analyses. Secondary analyses can explore evidence of unexpected relationships between individual patient attributes and treatment effects. Although exploratory analyses are an important part of scientific discovery, it is critically important to understand that such analyses are mainly appropriate for hypotheses generation, which can then be tested (and usually disproved) in future studies. Although medical journals may be reluctant to report "exploratory" analyses, it would be quite easy to routinely include secondary subgroup analyses in an electronic appendix to be published online with the main results of a clinical trial, making them available to the scientific community and for future meta-analyses while keeping them distinct from the primary results.

Recommendation # 5 All analyses conducted must be reported and statistical testing of HTE should be done using appropriate methods (such as interaction terms) and avoiding overinterpretation

Reporting must include results for all subgroup analyses, including multivariate-risk, primary and secondary subgroup analyses, and the paper must state that the primary subgroup analyses conducted were pre-specified. Because statistically significant benefit is likely to be absent in small subgroups, the correct analysis is not to test the significance of the treatment effect in one subgroup or another, but whether the effect differed significantly between subgroups. Work by Brookes et al suggests that the most statistically robust approach to assessing HTE is using interaction terms in regression models [22, 23]. Further, they found that testing continuous variables (such as baseline LDL level) is substantially more statistically powerful than testing categorical variables (such as baseline LDL < 100 vs. 100-145 vs > 145). Therefore, unless there is reason to believe that an effect is non-linear, HTE of continuous effects should be tested using the full power of the continuous variable, although categorical results can be shown for simplified presentation in the results section (see Table 6).

Where formal statistical testing fails to detect heterogeneity on the relative risk scale, the conservative assumption of a constant relative risk reduction across all risk groups may generally apply, especially if the study is large enough so that the test for interaction is adequately-powered. One should beware of the remaining possibility of false-negatives (as well as false-positives), especially in underpowered settings. Therefore interpretation of interaction effects should be cautious and viewed also in the context of additional prior/external evidence.

Results of subgroup analyses should be presented so that ARR/NNT as well as RRR can be assessed across risk categories or other subgroups. For instances where multiple single-variable subgroup analyses are performed as part of the primary subgroup analysis, the significance threshold should be adjusted for multiple testing[42, 43]..

Caveats and Future Work

Ideally, a continually updated registry containing easily-applicable, well-accepted, well-validated prediction tools for all the primary clinical outcomes used in trials for all major medical conditions would be available. We recognize that this is not currently the case and that the state of the predictive modeling literature is far from this ideal even for fields that have a long tradition in predictive modeling[44, 45]. However, while there is not a well-accepted and validated prediction tool appropriate for every condition, it is important to understand that testing for evidence for HTE using a risk-stratified analysis is a much easier task than determining how risk-stratification should be used in clinical practice. Recent research has demonstrated that a risk prediction tool of even moderate predictive power can typically provide adequate statistical power for answering the scientific question of whether there is evidence that the RRR of treatment varies significantly as a function of baseline risk [9]. It has been shown that even a relatively mediocre prediction tool (AUROC .6 to .65) can substantially improve statistical power over that achieved by examining even strong single risk factors one at a time to test for the presence of risk-based HTE [9]. Indeed, several commonly used scores, such as the Thrombolysis in Myocardial Infarction (TIMI) risk score (for acute coronary syndrome) and CHADS2 score (for non-valvular atrial fibrillation), have discriminatory power in this range but have nevertheless proved useful in the detection of risk-based HTE (see Table 3) [4650].

Moreover, for many fields, it is likely that the widely-accepted predictive models will not be stable but will continuously improve with the addition of new informative predictors (e.g. previously unrecognized genetic risk factors). One may conceive the possibility of re-analyses of the results of clinical trials using more informative prediction models if and when such additional information has been collected. Such re-analyses need to follow equally robust standards as we noted above for the original risk stratification analyses.

For trials that do not have adequate outcome prediction tools to use, risk tools can often be developed on pre-existing data in the trial planning phase, or prior to analysis. Use of internally developed risk models has been advocated [16, 51, 52] and several large trials have used this approach as the basis for testing risk-based HTE [5355]. Future work should explore the degree to which over-fitting may bias such an approach and, if so, how best to avoid this. Regardless of the approach, in most instances in which a risk-based analysis shows significant HTE, the finding will be a call for rigorous follow-up research to assess and optimize clinically-feasible risk prediction.

Other medical conditions may have multiple models that might yield clinically different results, frequently on the individual patient-level (where clinical recommendations may be altered depending on which model is used) and sometimes regarding the presence or absence of HTE overall. While future work is needed to address this issue, it should be noted that the ambiguity about how best to treat individuals in such cases is revealed, not created, by risk-based analysis.

This paper has focused exclusively on binary outcomes. Continuous outcomes can be approached with similar principles regarding testing for HTE, as well as primary and secondary subgroup analyses, but obviously metrics such as ARR and RRR would need to be replaced by absolute and relative changes in the continuous measure of interest; and NNT is not pertinent to continuous outcomes, unless the continuous measures are grouped into justifiable binary categories.

Additionally, we focused on heterogeneity in the dimension of outcome risk; other risk dimensions may also be important, such as the risk of treatment-related harm (for therapies with serious and common adverse events) [15] or competing risk (especially for conditions including many patients with multiple morbidities or older patients in trials measuring longer-term outcomes) [8, 5658]. Multivariate models predicting treatment-related adverse events, such as those developed to predict anticoagulant- or thrombolytic-related serious bleeding [59, 60] or surgical risks for specific procedures, may be useful in the first case, and comorbidity indices [56, 61] in the second. There are also examples where combining models for treatment-related harm with outcome risk models to stratify trial results using a risk-benefit scheme has yielded informative results [17, 21]. However, whether, when, and how to perform these complex analyses are methodologically fraught issues that may be difficult to make routine recommendations on.

As we and others have noted elsewhere, we will never be able to get all the information needed for informing clinical practice and health policy from experimental trials [5, 2729, 62, 63]. The approach we outline here may not be applicable or feasible for many trials, particularly early phase trials, which tend to be small and explanatory in nature, and often use surrogate instead of clinical endpoints. Furthermore, the above suggestions only deal with assessing HTE statistically in the context of trials and not how best to promote the use of risk stratification in clinical practice. Despite these caveats and limitations, for pivotal, phase III clinical trials using clinically important outcomes, the suggested approach should usually be feasible and should substantially improve our ability to produce scientifically valid information on HTE to better inform clinical practice.

Conclusion

Implications for the peer-review and publishing of clinical trials

While it is well appreciated that outcome risk heterogeneity is common and can lead to clinically meaningful HTE, few clinical trials analyze the variation in treatment effect across the spectrum of patients in their studies and subgroup analyses are performed and reported erratically [14, 30, 33, 35]. Though some argue that journals should not dictate the scientific questions that investigators address, for many important trials, the results are not fully disclosed in the absence of a risk-based analysis. While risk-stratified results may emphasize the importance of treatment in high-risk patients and may even result in the discovery of patient sub-groups who benefit when summary results of trials are negative, such analyses may be particularly resisted when trial results are overall positive, given the obvious incentives for industry to get treatments approved for as broad a population as possible [14]. There also exist incentives to selectively highlight positive exploratory subgroup analyses, when overall results are negative. Therefore, it seems likely that inadequate investigation and reporting of HTE will continue to be a problem unless editors, granting agencies and government regulators insist upon it. Suggestions herein provide a framework for the development of implementable guidelines that might support routine examination and reporting of information essential for optimizing medical care for individuals.

Appendix

Appendix 1. Glossary

Baseline Risk

Risk of a particular event (in this paper, typically the primary study outcome) in the absence of the experimental therapy.

Event rate

Proportion or percentage of study participants in a group in which a particular event (typically the primary outcome) is observed. Control event rate (CER) and experimental event rate (EER) are used to refer to event rates in the control group and experimental group, respectively. In a clinical trial, baseline risk is best estimated by the observed control event rate (CER).

Relative Risk Reduction (RRR)

The proportional reduction in the rate of bad events between experiment (experimental event rate [EER]) and control (control event rate [CER]) patients in a trial, calculated as (CER - EER)/CER. Moreover, we use the term "net RRR" in this paper to emphasize that we are assessing the overall treatment benefit (treatment-related benefit minus treatment-related harm). This is merely the RRR when outcome measure is a composite of all major outcomes related to the treatment, both those that are decreased and those that are increased by treatment. For parsimony, we consider here that all outcomes have similar importance, but this may not necessarily by generalizable (e.g. many composite outcomes in the literature are a conglomerate of endpoints with very different connotations and clinical importance).

Absolute Risk Reduction (ARR)

The absolute arithmetic difference in event rates between the control group and the experimental group (CER - EER).

Number Needed to Treat (NNT)

The number of patients who need to be treated, on average, to prevent 1 additional bad outcome; calculated as 1/ARR.