Introduction

In reaction to the “rosiglitazone case” [1], and as requested by the US Food and Drug Administration (FDA) [2], pharmaceutical industry performed a large number of the so called “Cardiovascular Outcome Trials” (CVOTs) in the last decade [3]. Although these trials were primarily focused on showing non-inferiority against placebo, many of the investigated drugs also showed superiority and therefore informed and changed current guidelines for diabetes treatment [4, 5]. As non-inferiority has to be shown on the hazard ratio (HR) scale (with the upper limit of the two-sided 95% confidence interval not exceeding 1.3), CVOTs almost exclusively report these relative effect estimates to describe treatment differences. However, it has long been recognized and recommended in guidelines [6,7,8] to report treatment effects also on an absolute scale, e.g. as numbers needed to treat (NNT), only recently in the context of CVOTs as well [9]. An NNT gives the number of patients that have to be treated to prevent one additional single event of the outcome in the treatment group. Positive values of the NNT can thus be interpreted in the CVOTs as the treatment being beneficial compared to placebo. A null effect of the treatment, corresponding to a HR of 1, is given by the value of infinity for the NNT. In the CVOTs with their outcomes being time to an event, NNTs are necessarily time dependent, that is, change their values with duration of treatment [10].

Absolute effects should be reported next to relative ones, because treatment effects on relative scales appear more impressive to patients, physicians, and policy makers [11]. In addition and from a more general, philosophical viewpoint, Sprenger/Stegenga [12] point to further deficiencies of relative effect measures. With respect to decision theory, absolute effect measures are more helpful for assessing and maximizing utility of treatments. Concerning causal inference, absolute effect measures naturally combine assessments of causal strength, e.g. when mediators or combined outcomes are involved.

Nevertheless and largely ignoring guidelines, absolute effects are still underreported in the medical literature [13,14,15,16,17] and this is also true for the CVOTs which almost exclusively report hazard ratios in the original publications.

There have been previous analyses of NNTs in the large CVOTs in type 2 diabetes [10, 18] with respect to the trials’ primary outcomes. We extend these analyses in the following by meta-analysing and comparing NNTs for the three different drug classes of DPP-4 inhibitors, GLP-1 receptor agonists, and SGLT2 inhibitors. In addition, we consider the outcome of all-cause mortality and offer NNTs for the trials’ observation times as well as an extrapolation of NNTs to 30 years of treatment.

Materials and methods

We performed what we propose to call a meta-analysis of “digitalized individual patient data”. We downloaded full texts and online supplements of the CVOTs that (1) were given as completed or ongoing in Fig. 1 of Cefalu et al. [3], (2) had been finished and published until September 2020, and (3) compared a DPP-4 inhibitor, a GLP-1 receptor agonist, or an SGLT2 inhibitor to placebo. We used WebPlotDigitizer, version 4.2, [19] and the R code of Guyot et al. [20] to extract individual patients’ time-to-event information from the Kaplan–Meier plots in the original trial populations. Both methods have been shown to be reliable and valid [21,22,23].

Data were extracted for the respective trial’s primary outcome and the outcome of all-cause mortality. For calculating absolute treatment effects for time-to-event outcomes in a single trial, it is necessary to estimate the survival functions in both treatment groups. As this is not possible from standard Cox proportional hazard models, we fitted, for each outcome in each trial separately, parametric Weibull regression model for the treatment effect. Weibull models are parametric proportional hazards models [24] and thus yield hazard ratios for the treatment effect which can be compared to the hazard ratios from the original paper. From the respective Weibull model, we estimated monthly probability differences (treatment–control) for being free of the event of interest from month 1 to the respective trial’s maximal observation time. These probability differences were then inverted to arrive at estimates for the monthly number needed to treat. In supplemental Fig. 1, we explain this procedure by showing how to compute the NNT for all-cause mortality after 36 months from the EMPA-REG trial.

In the interest of achieving a lifetime perspective of treatment effects, we additionally projected NNTs until 360 months of treatment. This was done (due to the threat of competing risk by death for all other outcomes) only for the outcome of all-cause mortality and under the assumptions that patients would remain on the same treatment and that the treatment effect remains constant after trial completion.

To assess the validity of the extracted data, we compared hazard ratios from the original papers to those from the Weibull models by calculating intra-class correlation coefficients. In addition, and to assess the fit of the Weibull models graphically, we also plotted Kaplan–Meier estimates from the digitalized data together with predicted survival functions from these models.

To summarize NNTs overall and in the three drug classes, we used random-effects inverse-variance meta-analysis methods. Meta-analyses were calculated separately for each single time point. All computations were performed on the probability difference scale and only for displaying results in figures and graphs transformed to the NNT scale. We used SAS (SAS Institute Inc., Cary, NC, USA), version 9.4, for data management and analysis. The full data set is available in a public repository [25]. As the study does not include personalized data, we did not seek for a vote of an ethics committee. The study was not pre-registered and had no previously published protocol because we were developing the statistical methods in parallel with the analysis.

Results

Overall we achieved the original time-to-event information from 4 trials on DPP-4 inhibitors (CARMELINA, EXAMINE, SAVOR-TIMI 53, TECOS), 7 trials on GLP-1 receptor agonists (ELIXA, EXSCEL, HARMONY, LEADER, PIONEER, REWIND, SUSTAIN), and 8 trials on SGLT2 inhibitors (CANVAS, CREDENCE, DAPA-CKD, DAPA-HF, DECLARE-TIMI 58, EMPA-REG, EMPEROR-REDUCED, VERTIS-CV). We excluded the CAROLINA trial, because it has no placebo control; the results of two other trials (FREEDOM-CVO, EMPEROR-PRESERVED) in Fig. 1 of Cefalu et al. [3] were not yet available as of September 2020.

Table 1 gives an overview of the included trials with the treatment and drug class under study, a description of the trial populations, and the exact definition of the primary outcome. As reported in Table 2, 19 trials with 17,501 events from 159,265 observations gave information on the primary outcome, and 13 trials with 8,888 events from 112,524 observations on all-cause mortality. Median follow-up times in the trials ranged from 14.1 to 64.9 months for the primary outcome, and from 15.6 to 65.4 months for all-cause mortality. The overall median follow-up time was 28.7 months for the primary outcome and 39.3 months for all-cause mortality. The hazard ratios from the original publications ranged from 0.61 to 1.02 for the primary outcome, and from 0.68 to 1.01 for all-cause mortality, the respective median hazard ratios were 0.87 for both outcomes. The originally reported hazard ratios, together with the computed hazard ratios from the digitalized data (for a Cox and a Weibull model) are also given in Table 2.

Table 1 Description of included trials (Abbreviations: SD = standard deviation, BMI = body mass index, CV = cardiovascular, MI = myocardial infarction, ESKD = end-stage kidney disease). When only medians and/or quartiles and/or minima/maxima were reported in the trails, we used the formula of Wan et al. [26] to calculate the respective mean and standard deviations
Table 2 Description of included trials, separated by outcomes and drug classes (Abbreviation: HR = hazard ratio, CI = confidence interval)

Figure 1 shows the trials’ NNT time courses for both outcomes, annual NNT values after 12, 24, 36, and 48 months of treatment are also given in Table 3.

Fig. 1
figure 1

NNTs for the single trials, separated by outcomes and drug classes (blue: DPP-4 inhibitors, yellow: GLP-1 receptor agonists, red: SGLT2 inhibitors) with their pointwise 95% confidence intervals. Estimates and confidence intervals are truncated from above at 100.000. Please note the logarithmic scale on the y-axis (Color figure online)

Table 3 Annual NNTs for years 1, 2, 3, and 4 with 95% confidence intervals, separated by outcomes and drug classes

The Meta-NNT time course, as summarized across all trials and drug classes, is given in Fig. 2. At the overall median follow-up times of 29 months for the primary outcome and 39 months for all-cause mortality, the estimated Meta-NNTs are 100 (95%-CI: 60, 303) and 128 (95%-CI: 85, 265), respectively.

Fig. 2
figure 2

Meta-NNTs for the two outcomes across all trials and drug classes with their pointwise 95% confidence intervals. Meta-NNTs were calculated by standard random-effects inverse-variance meta-analysis methods, separately for each month. All computations were performed on the probability difference scale and only then transformed to the NNT scale. Please note the logarithmic scale on the y-axis. With respect to the primary outcome, Meta-NNTs were computed from the single trials’ primary outcomes which slightly differ in their definition (see Table 1)

With respect to Meta-NNTs in the three different drug classes under study (Fig. 3), NNT time courses are very similar with GLP-1 receptor agonists vs. SGLT2 inhibitors, whereas treatment effects with DPP-4 inhibitors are smaller.

Fig. 3
figure 3

Overall Meta-NNTs for the two outcomes, separated by drug classes (blue: DPP-4 inhibitors, yellow: GLP-1 receptor agonists, red: SGLT2 inhibitors) with their pointwise 95% confidence intervals. Estimates and confidence intervals are truncated from above at 100.000. Please note the logarithmic scale on the y-axis. With respect to the primary outcome, Meta-NNTs were computed from the single trials’ primary outcomes which slightly differ in their definition (see Table 1) (Color figure online)

Considering the lifetime perspective of treatment, Fig. 4 gives the projected Meta-NNTs in the three drug classes. Because the probability of death naturally increases in the time course and in both treatment groups, these Meta-NNTs achieve a minimum, that is, a maximum treatment effect, but increase after having reached this minimum. That is, even when assuming a constant treatment effect, Meta-NNTs do not have a monotonically decreasing time course.

Fig. 4
figure 4

Projected overall Meta-NNTs for all-cause mortality, separated by drug classes (blue: DPP-4 inhibitors, yellow: GLP-1 receptor agonists, red: SGLT2 inhibitors) with their pointwise 95% confidence intervals. Estimates and confidence intervals are truncated from above at 100.000. Please note the logarithmic scale on the y-axis. With respect to the primary outcome, Meta-NNTs were computed from the single trials’ primary outcomes which slightly differ in their definition (see Table 1) (Color figure online)

In supplemental Fig. 2, we give scatterplots to compare the originally reported hazard ratios to those from the fitted Weibull models on the digitalized data. As can be seen, the correspondence is excellent. In terms of the intra-class correlation, this was 99.8% (95%-CI: 99.5%, 100%) for the primary outcome, and 99.5% (95%-CI: 98.9%, 100%) for all-cause mortality, where the upper limit of the confidence interval had been truncated at 100%.

In supplemental Fig. 3, we show the fit of the Weibull models to the extracted data by giving the Kaplan–Meier estimates in the two treatment groups for each outcome and each trial together with the 95% confidence intervals of the fitted Weibull survival functions. Again, there are no relevant differences that might compromise computation or interpretation of NNTs.

Discussion

Treatment effects in the large CVOTs of new antidiabetic drugs look less impressive if they are reported as numbers needed to treat (NNTs) instead of hazard ratios. For example, the overall median hazard ratio across all trials for the two outcomes under study were found to be 0.87 here in favour of the trial drug. This corresponds to a hazard reduction of 13% for both outcomes and one might be tempted to interpret that one out of 13 or every 8th (because 13% is roughly an eighth) benefits from treatment [9]. This is clearly an overestimation, instead and as shown in Fig. 2, 100 patients have to be treated for 29 months (the median follow-up time across all trials) to avoid one single event of the primary outcome, and 128 patients have to be treated for 39 months to avoid one single death.

Of course, the perceived overestimation of treatment effects by hazard ratios is especially large here due to the overall low number of events in the CVOTs, that is, the low baseline risk (displayed as the proportion of events in Table 2) for the outcomes in the placebo groups.

In view of these considerable differences between relative and absolute treatment effects, it is no surprise that trial authors and sponsors do not actively communicate NNTs, and of course, the FDA did not insist on that in their 2008 guideline. Exceptions are found in CREDENCE [27] and DAPA-CKD [28], where NNTs are reported in the main paper, LEADER [29] and EMPA-REG OUTCOME [30] report NNTs in follow-up papers.

Until the trials’ maximum observation times, we observe NNT courses to be largely decreasing, thus pointing to increasing treatment effects. It is tempting to speculate that this increase would last also with larger observation/treatment times. However, this is not the case, the NNT by definition (and at least for the outcome of mortality which in the long run would occur for every patient) will reach a minimum and increase again thereafter. This is true even under the two assumptions of a constant treatment effect and all patients staying on their initial treatment. We observe this behaviour of a first-increasing-then-decreasing treatment effect also in our NNT projection for the CVOT data which confirms again the validity of our approach. But of course, the results of these projections rely heavily on extrapolating beyond the trials’ observation times, and thus should be considered exploratory [10].

We restricted our study to the two outcomes reported here because one was the primary outcome as suggested by the FDA (see Table 1 for the slightly different definitions of the primary outcome in the respective trials) and the other (all-cause mortality) is the most unbiased clinical endpoint possible [31]. Moreover, all-cause mortality is the only outcome in the CVOTs that is not affected by competing risks.

There have been previous analyses of NNTs in the large CVOTs in type 2 diabetes [10, 18] where both research groups also relied on individual data extracted from the original publications. However, as compared to Davies et al. [18], we did not restrict to the class of GLP-1 receptor agonists, but report NNTs also for two other drug classes. Moreover, we also report on the additional outcome of all-cause mortality and offer an overall as well as a drug class-specific meta-analytic summary. Ludwig et al. [10] only report on a limited non-systematic sample of CVOTs in two drug classes and only give NNT estimates at the respective median follow-up time although they emphasize correctly that there is not a single global NNT for a specific trial. In addition, Ludwig et al. [10] do not use the original data (with the additional option to validate the digitalized data against those from the original publication), but rely on formulas for summary data. Finally, Davies et al. [18] and Ludwig et al. [10] use different software tools for data extraction and different statistical models to arrive at NNT estimates as compared to our approach. However, differences between reported NNTs (see supplemental Table 1) are marginal thus confirming in principle the consistency and validity of the approach of digital extraction of individual patient data from trial publications.

With respect to comparing originally reported hazard ratios and those computed from Weibull models within our own study, differences were also negligible. As such, it is also no disadvantage that the functional form of the NNT dynamics in time and all the derived NNT at fixed time points reported here were determined by the parametric form. It might rather be considered an advantage that a smooth and plausible function was generated. Moreover, we saw that the full Weibull survival functions (from which the NNTs are directly derived) give excellent fits to the digitalized survival data. There are also other possible parametric assumptions for the outcome data (e.g. Davies et al. [18] used the Royston–Parmar model which allows for even more flexible survival functions), but we chose the Weibull because it allows for computing hazard ratios and thus a comparison to the digitalized data. It should be also noted that the original hazard ratios from the CVOTs were all computed from Cox models that assume proportional hazards in treatment groups across the trial course, and of course, this proportional hazards assumption in the original trials might also be wrong. In effect, we feel the Weibull assumption is not that more restrictive than the proportional hazards assumption; however, a parametric model additionally allows for directly estimating survival probabilities and NNTs.

It is fair to point to some limitations of our study. We did not perform a lege artis systematic review with a systematic literature search and a published study protocol. Instead, we relied on a published scheme of the large CVOTs, which we yet consider complete. Indeed, we have waived a formal systematic review because there are a number of systematic reviews of the new antidiabetic drugs already available and the focus of our work was a methodical one, that is, the computation of absolute effects and their comparison with the reported relative effects.

Of course, meta-analyses of absolute effects face the same threats as those for all other outcome types, with the most important question always being whether the studies are sufficiently homogeneous to be pooled. Admittedly, in the case of absolute effects there is the additional point that effect estimates depend on the underlying baseline risk of the event. However, the CVOTs that we use here had been pooled in numerous meta-analyses up to now, thus confirming that researchers in general judge clinical heterogeneity to be small.

In terms of the NNT as an effect measure, Hansen et al. [32] noted the problem of the “lottery-like” appearance of the NNT. Communicating a fixed value for the NNT (for example, 100, as calculated here for the global NNT with respect to the primary outcome) seems to imply that only exactly one in 100 patients benefits from treatment. While this is formally true for the primary outcome, it is clinically far more plausible that most patients will benefit from treatment, at least to some extent, because anti-glycaemic treatment lowers glucose for most patients and glucose lowering is clearly correlated with the risk of a cardiovascular outcome. Finally, there is solid empirical evidence that patients have problems with interpreting and understanding NNTs. Indeed, even largely different NNT values presented to randomized groups do not result in different acceptance proportions to treatment [33,34,35]. We therefore agree with other researchers [36, 37] that the NNT should better not be used for communication with patients, but rather in research contexts and for communication with health professionals. However, it should be realized that relative effect measures like hazard ratios are also not well understood by patients [38].

All included trials address new antidiabetic drugs, however, no other factors that also determine total cardiovascular risk (e.g. blood pressure, cholesterol, environmental background, genetic background, etc.). In addition, reporting of cardiovascular risk of trial populations was limited. Study designs with multifactorial therapeutic approaches and careful characterization of the underlying population risk (e.g. the NID-2 study [39]) are probably more appropriate to evaluate absolute treatment effects. However, our approach yields results at an average risk of all included patients and is thus a valid way to describe absolute treatment effects.

In summary, this study provides a comprehensive analysis of the absolute treatment effects of the newer antidiabetic drug classes (DPP-4 inhibitors, GLP-1 receptor agonists, and SGLT2 inhibitors) from the large CVOTs of new antidiabetic drugs. We found that the respective treatment effects look much less impressive when communicated on an absolute scale, as numbers needed to treat. For a valid overall picture of the benefit of these new antidiabetic drugs, trial authors should thus also report treatment effects on an absolute scale. Authorities responsible for approval should continue to ask for absolute effects estimates to enable health professionals and policy makers to make better informed decisions.