Introduction

The number needed to treat (NNT) was originally used to provide greater clinical relevance to a small risk difference (RD) [1,2,3], particularly when that RD was converted to a percentage of control. For example, when the control (C) rate is 1% and the treatment (T) rate is 0.67% for an event of medical interest (e.g., myocardial infarction [MI]), the estimated absolute risk difference (ARD) is 0.0067–0.01 = 0.0033 or 0.33%. This becomes 33% (0.0033/0.01) when expressed as a percentage of C.

An interpretation of the ARD using the example above is that, on average, among 1000 patients treated with T and 1000 treated with C, one could expect 3.3 fewer patients to have the event in the T group than in the C group. Alternatively, one may wish to determine how many patients need to be treated to have one fewer event in the T group compared with the C group. The answer is provided by the NNT, which in this case would be 1/0.0033 = 303.03 patients. In general, NNT = 1/ARD. NNT is usually rounded up to the nearest larger integer, which in this example is 304.

There is much debate in the literature [4, 5] about whether NNT is a useful measure to summarize treatment effect from a clinical trial with binary efficacy endpoints. The proponents of the summary measure find it appealing because they believe it is easy to interpret. However, some argue [6,7,8] that the basic definition of NNT is flawed and that NNT lacks statistical content.

Initially, it was coronary heart disease risk trials that spawned the use of NNT. These trials were large due to the relatively rare occurrence of the event in question and because of the large power needed to detect a clinically meaningful RD between the T and C groups. More recently, NNT has been used to interpret outcome data from clinical trials in many other disease areas, including multiple sclerosis (MS) [9,10,11,12,13], probably because of the purported ease of interpretation [14].

There is a need to clarify both the definition and use of NNT in summarizing outcome data from comparative relapsing-remitting MS (RRMS) trials so that reported NNTs can be critically appraised. This article provides an overview of the definition and methodological characteristics of NNT, and discusses the relative merits of its use in MS clinical trials, where endpoints are typically time-to-event or frequency of recurrent events.

Compliance with Ethics Guidelines

This article is based on previously conducted studies, and does not involve any new studies of human or animal subjects performed by any of the authors.

NNT for Dichotomous Data

Historically, the use of NNT was based on dichotomous efficacy data and reported in trials in which T was better than C. In a clinical trial of MI (see example in the “Introduction”), an NNT of 304 is interpreted as indicating that 304 patients would need to be treated, on average, with the drug to prevent one MI. Adverse event (AE) data are collected in such trials and, by definition, reflect potential harm to the patient, particularly when the AEs are serious. Therefore, it is of clinical interest to compute the number needed to harm (NNH), which implies that the incidence in the T group is greater than that in the C group (if the event represents a potential harm). Of note, if the event represents a benefit, but its incidence is less in the T group than in the C group, it would be appropriate to use NNH instead of NNT. This is another complexity of interpreting NNT.

NNH is computed in the same way as NNT but represents a different measure. Consider the trial in the previous example, in which the NNT for MI was 304 patients. If the rates of proteinuria were 0.5% in the T group and 0.4% in the C group, then ARD = 0.001 and NNH = 1000. Thus, if MI and proteinuria were assessed in the same trial, 304 patients would need to be treated, on average, to prevent one fewer patient experiencing an MI, and 1000 patients would need to be treated to observe one more patient with proteinuria.

Statistical Properties of ARD and NNT

Population parameters (e.g., true ARD and NNT) are infrequently known. Data for a random sample from the population are used to estimate the parameters. This gives rise to point estimates. Different samples may contain data on different individuals. Thus, it is expected that estimates will vary from one sample to another.

Point estimates and their variability are used to compute a confidence interval (CI) with lower and upper limits (L, U) on the population parameter. For any given data, the endpoints L and U are real numbers. To say that (L, U) is a 95% CI on the true ARD does not mean that the true ARD will fall within this interval, but rather that 95% of the intervals derived would contain the true mean ARD if the experiment was repeated multiple times.

The statistical methodology for computing CIs on the true ARD is well established. However, this is not the case for computing CIs on the true NNT. There are several reasons for this: (a) NNT is not defined in the interval (–1, +1), (b) the variance of NNT is difficult to estimate accurately, and (c) the sampling distribution appears to be triangular [15].

ARD is a summary statistic that synthesizes event information on patients in the T and C groups. Therefore, its variance consists of both within- and between-group variability, and it can be written in a closed form. This variance is a primary driver of the length of the CI on ARD. On the other hand, the NNT is a transformation of ARD, producing a single point estimate without quantifiable variability. A point estimate without its associated variability provides no information for one to judge the proximity of the sample estimate of treatment effect to the unknown population parameter (the true treatment effect) about which one wishes to make a statistical inference. Therefore, without an established method for quantifying the variability of a point estimate, it is impossible to know whether the estimated treatment effect is indicative of a real effect (see Hotelling [16]).

Several authors (e.g., Cook and Sackett [2], Schulzer and Mancini [17], Altman [18], and Bender [19]) have proposed different methods for computing the CI on NNT. For instance, Cook and Sackett [2] proposed computing the CI on ARD and then inverting and reversing the order of the endpoint of the interval to get an approximate CI on NNT.

There are instances, however, where the method of Cook and Sackett [2] and Altman [18] for computing CIs on NNT can produce unreasonable results in which the CI does not contain the point estimate of NNT. One such situation is when the ARD is not statistically significant, in which case the T group may be descriptively favored over the C group, or vice versa. When there is no statistically significant difference between treatment groups, there is no finite single value corresponding to instances where NNT is meaningful [8]. For example, when the estimate of ARD in a clinical trial comparing T to C was 0.05 and the 95% CI on the true ARD was (–0.10, +0.20), the point estimate of NNT is 20 patients, and by inversion the approximate 95% CI on NNT has a lower endpoint of +5 and an upper endpoint of –10. Not only is the point estimate of NNT not in the CI, but the CI is not valid (i.e., a positive lower limit and negative upper limit).

When there is no meaningful difference between treatment groups, NNT should not be computed. This is because when the ARD is not statistically significant, it is doubtful that the NNT has a meaningful clinical utility as there is little or no evidence that the observed ARD reflects the true state of nature. In general, if the (L, U) of a CI on the true ARD are negative and positive, respectively (ARD is not statistically significant), it is not advisable to generate a CI on the NNT by inverting the limits and reversing their order.

NNT in MS Clinical Trials

Common efficacy outcome data from RRMS trials include the percentages of patients with events (estimated as cumulative time-to-event percentages for disability progression or first relapse), the frequency of recurrent events (annualized relapse rate [ARR]), and the number of magnetic resonance imaging lesions. Therapies showing higher efficacy could also have serious harmful side effects. Thus, if there is a desire to compute the NNT for MS therapies, it would be appropriate to report both the NNT and NNH to aid the evaluation of the benefit-to-risk profiles of the treatments. The following discussion is intended to help with the appraisal of NNT when computed in MS trials.

Time-to-Event Efficacy Endpoints for NNT

A time-to-event endpoint measures “if” an event occurred and “when” the event occurred. Thus, for time to relapse or disability worsening, one can obtain the median (or other percentile) time to relapse or disability worsening as well as the proportion of patients with these events. When there is no censoring, the estimated cumulative percentage of patients with the event is the same as the binomial proportion (number relapsed/number at risk). For a time-to-event endpoint, the definition of NNT may be applied to cumulative time-to-event percentages. If a protocol specifies a fixed period of treatment, e.g., 2 years, and there is an interest in estimating the cumulative percentage of patients with events at 2 years, then time-to-event methods would have to be used given that the event is censored in some patients. If the 2-year cumulative relapse proportions are, for example, 30% for the T group and 40% for the C group, then ARD = 0.10 and NNT = 10; this is interpreted as indicating that 10 patients would need to be treated for up to 2 years in order to have, on average, one fewer patient experiencing a relapse.

NNT could be time dependent; hence, it should not be computed using time-to-event methods at intermediate points during the specified treatment period, as it may not lead to consistent meaningful results. This is because NNT would be difficult to interpret for MS trials where the ARD was smaller early and larger later in the treatment period, as the implication is that a greater number of patients would need to be treated for a shorter period of time and fewer would need to be treated for a longer period of time. In the DEFINE study [20], the Kaplan–Meier estimates of the proportion of patients that relapsed at week 24 are 0.167 and 0.110 for placebo and delayed-release dimethyl fumarate (DMF; also known as gastro-resistant DMF) twice daily (BID), respectively. This gives an NNT of 18, suggesting that 18 patients would need to be treated for up to 24 weeks in order to have, on average, one fewer patient experiencing a relapse. However, the Kaplan–Meier estimates of the proportion of patients that relapsed at week 96 in the same study are 0.461 and 0.270 for placebo and DMF BID, respectively. This gives a NNT of 6; i.e., the same therapy (DMF) at the same dose and with the same effect on biology yields a different NNT than that obtained in the same study when computed at a different time point. Additionally, NNT does not provide information on the average or median time before the event occurs. Therefore, the time (“when”) aspect of the time-to-event endpoint is often not reflected by NNT. Furthermore, NNT could vary between the first time-to-event and the end of treatment. Besides, the course and severity of MS can change unpredictably [21]. Thus, the computation of NNT at intermediate time points during a treatment period is of doubtful utility.

ARR for NNT

For recurrent events such as MS relapse, measures similar to NNT (referred to as NNT-like measures, NNT-L) have been proposed based on reducing the occurrence of events [22]. ARR is computed as the total number of relapses in a given period divided by the total number of person-years in that period. Because a patient may experience more than one relapse over the period of a study, relapses within each patient may be correlated. Additionally, the use of person-years assumes that the probability of relapse is constant over time. For example, 50 relapses in 100 patients over 10 years of exposure would give the same relapse rate as 50 relapses in 1000 patients for 1 year of exposure. The idea is that person-years of exposure is the sum of person-years exposed for each patient summed over patients. However, because the change in ARR over time has been documented in MS, computing NNT using ARR may confound well-known changes in ARR over time [23,24,25]. We point out that some authors compute NNT-L measures using the definition of NNT and interpret them as one interprets NNT. We argue that such an interpretation is wrong.

The event-based NNT-L measure may be heavily influenced by small numbers of patients with multiple events (e.g., highly active patients) and may give a fractional value that is <1. When such a fractional value is rounded up to 1, it will suggest that one patient needs to be treated to prevent one relapse [26].

Table 1 presents hypothetical relapse data from 20 patients treated over 2 years, showing that when some patients have multiple relapses, the event-based NNT does not equal the traditional NNT. Thus, it is easy to misinterpret the event-based NNT in Table 1 as indicating that five patients need to be treated, on average, to prevent one relapse. However, this is half the number of patients that need to be treated to prevent one relapse when NNT interpretation is based on the traditional definition of NNT. Hence, a distinction should be made between NNT and NNT-L measures. Of note, NNT is based on ARD: (T%–C%)/100, i.e., the difference in the percentages of patients with a critical event of interest by a specified time of exposure. Therefore, ARR should not be used to compute NNT.

Table 1 Example of the difference between event-based NNT and traditional NNT

The original definition of NNT provides an answer in terms of number of patients. NNT-L (such as rate per year) does not have this property. At best, the computation of such event-based measures results in the number of patient-years of treatment, not the number of patients. For a chronic disease such as MS, in which treatment may be lifelong, describing efficacy in terms of a lower number of patient-years of treatment may not be relevant for clinical decisions. Again, the fact that MS disease course and severity can change unpredictably invalidates any attempt to interpret treatment effect in terms of number of patient-years of treatment.

Table 2 shows data from two clinical trials (Copolymer 1 [NCT00004814] and CONFIRM [NCT00451451]) [27] where glatiramer acetate (GA) was administered at a dose of 20 mg/day for 2 years. The first study [28] is a pivotal phase III comparative trial of patients with RRMS, and the second is a pivotal comparative trial of oral DMF, in which GA was included as a reference comparator for assay sensitivity [27]. The same relative rate reduction (RRR) vs. placebo on ARR was seen for Copolymer 1 and CONFIRM: 30% and 29%, respectively (Table 2). This is despite obtaining different NNTs from ARR and ARD. The fallacy in the event-based NNT from the two studies suggests that four and ten patients, respectively, would need to be treated with GA for one patient to benefit from treatment. However, when the traditional definition of NNT is applied to ARD, the NNT is 15 and 12, respectively.

Table 2 Example of the same RRR but different NNT values from ARR and ARD

Two clinical trials (DEFINE [NCT00420212] and CONFIRM [20, 27]) assessed DMF 240 mg BID over a period of 2 years. Table 3 details different RRRs on ARR for the two studies, identical NNTs computed from ARR and different NNTs computed from ARD. The RRR vs. placebo was 53% and 44%, respectively, for DEFINE and CONFIRM. However, the NNTs computed from ARR were the same for both studies (6), and the NNTs computed from ARD were different from the NNTs computed from ARR. Because the NNTs computed using ARR from the DEFINE and CONFIRM studies are identical, the known changes in ARR over time are obscured [23,24,25].

Table 3 Example of different RRRs but the same NNT calculated from ARR. Data from the DEFINE and CONFIRM MS clinical trials [18, 25]

Direct Comparison of NNT Across Trials

Comparison of NNTs across MS trials warrants caution because of the possible difference in the underlying baseline risk between studies. Even though the baseline risk may be the same in the overall study populations, the baseline risk may differ within the levels of disease severity. Specifically, in MS trials, the computed NNT values in two studies may be the same and may suggest a treatment benefit in the overall study population, whereas the benefit might be limited to patient groups with low or intermediate risk, or who are newly diagnosed or treatment naïve. Thus, patient groups with a treatment benefit may differ across studies. Because the design, baseline characteristics, and percentages of these risk groups are likely to differ across studies, it should not be surprising that the NNT is different when compared across trials for the same therapy. As reported in Table 2, not only were the estimated NNTs for the comparison of GA vs. placebo different in the Copolymer 1 and CONFIRM trials, but the event-based and traditional NNTs produced conflicting results. The event-based NNT from Copolymer 1 was lower than that in the CONFIRM study; however, the converse was true with the traditional NNT. Additionally, ARDs, and hence NNTs, are greatly influenced by placebo risks and rates; therefore, they are study specific and cannot be generalized to all MS populations or compared across studies. As indicated by the GA example above, when comparing across studies, it is possible to arrive at a different conclusion using different or the same endpoints. It is only appropriate to compare NNTs directly when outcomes, study durations, and reference populations are similar, as results are directly related to the control or comparison group.

Furthermore, it is not advisable to perform a meta-analysis of individual NNT estimates from a series of clinical trials directly comparing treatments to get an overall NNT across the collection of trials. There are many reasons for this: (a) incompatibility between the average NNT from the meta-analysis as contrasted with the NNT produced by reciprocating the inverse of the average ARD from a meta-analysis; (b) the variance of individual NNTs is approximate, so the variance of the overall estimate of NNT across trials would compound the approximation; and (c) the variability of the control rate across trials may be large even when the ARD and NNT are consistent across trials, which impacts the interpretation of the combined estimate of NNT across trials (e.g., 100% T vs. 99% C and 5% T vs. 4% C both lead to an NNT of 100). Smeeth et al. [29] questioned the viability of conducting meta-analyses based on NNT. They acknowledge that, although the results are sometimes informative, they are usually misleading. In conclusion, if clinical trials are sufficiently similar to substantiate a scientifically meaningful meta-analysis of the RD or ARD, then meta-analysis [30] of RD or ARD across the trials should be performed to get an overall RD or ARD, after which the definition of NNT can be applied to the overall ARD.

Conclusion

This article provides a review of the definition and methodological characteristics of NNT; it also discusses the relative merits of its use in comparative clinical RRMS trials. The authors caution against: (a) the use of NNT in comparative trials where the ARD is not statistically significant, (b) computing NNT using a time-to-event endpoint at intermediate time points, (c) computing NNT using ARR, and (d) making comparisons or performing inferential analyses of NNT across trials.