Dear Editor,


We read with great interest the article by Fahrbach and colleagues [1], which reported the results of a systematic literature review (SLR) and network meta-analysis (NMA) to estimate the relative clinical effectiveness of crisaborole 2% ointment compared with other topical treatments for mild-to-moderate atopic dermatitis (AD). We would like to point out some concerns regarding the methodology and outcomes presented in this article.

Firstly, it appears that the authors did not include all studies available for pimecrolimus (versus vehicle/placebo/cream) in their analysis. While the search algorithm and the inclusion/exclusion criteria used by the authors appear robust, they were able to find only one relevant publication for pimecrolimus (Eichenfield et al.) [2]. Of the various trials that compared pimecrolimus with vehicle in mild-to-moderate AD, we believe that Leung et al. [3] (which reported Investigator's Global Assessment [IGA] data at week 6), the Novartis study ASM981C2322 [4] (which reported IGA data at weeks 1, 2, and 4), and Emer et al. [5] (which reported Physician Global Assessment data at weeks 2 and 4) meet the inclusion criteria of the review and should have been included in the analysis.

In the article [1], relative treatment effects were analyzed in terms of the hazard ratio (HR), which takes into account the number of patients experiencing the event (here, achieving an Investigator’s Static Global Assessment [ISGA] score of 0/1) along with the timing of the event in the two arms that are being compared. While presenting a single HR over the entire follow-up provides information on the likelihood of obtaining ISGA 0/1, it does not provide information on how soon these outcomes may occur. While time-to-event analysis and the use of HR as an endpoint is uncommon in AD, it would have been helpful if the authors had reported the time-dependent HR profile for the comparisons being made in order to show how the relative effects evolve over time. Alternatively, other more common metrics such as the relative risk and the odds ratio at various time points (weeks 1, 2, 3, and 4) could have been used. Reporting outcomes using such metrics would have enabled the study results to be validated with previously published SLRs and meta-analyses. For example, the HR for pimecrolimus versus vehicle was reported by the authors to be 1.28 (95% credible interval 0.92–1.78; probability that treatment is better than vehicle 93.5%) [1]. This contrasts with the underlying study [2] and other previously published SLRs and meta-analyses [6,7,8], which show that pimecrolimus is significantly better than vehicle at weeks 4 and 6.

The analysis performed by the authors utilized a complementary log–log (clog-log) link to adjust for different follow-up durations across trials by assuming a Poisson process in each trial [1]. The authors point out that this necessitates an assumption that HRs are constant over the entire duration of follow-up [1]. This seems to be a strong assumption that has not been validated by the authors and does not seem to be a likely possibility across all the studies included in the review.

The NMA was performed with baseline risk regression to adjust for differences in vehicle response across the studies included [1]. The authors mention that this was necessitated by the variation in vehicle composition and the heterogeneity in patient characteristics [1]. While the vehicle bases used in the trials were not identical (ointment in crisaborole and tacrolimus trials, cream in pimecrolimus trials), there is insufficient evidence in the literature to suggest that the response rate varies with the emollient used in the treatment; indeed, a 2017 Cochrane review did not find any reliable evidence for such a variation [9]. In fact, even though the two crisaborole trials (AD-301 and AD-302) used the same vehicle base, the vehicle response rates were dissimilar [10], indicating that the difference in vehicle response could be due to patient heterogeneity and unexplained random variation rather than a difference in vehicle composition. Furthermore, if the authors believe that all the vehicle arms are indeed different across the studies, a NMA may not be appropriate at all. If the authors had access to patient-level data, a matching-adjusted indirect comparison could have been carried out to validate and support the evidence reported in the article.

In the NMA [1], the adjustment for baseline risk via regression was based on a limited number of studies (three for tacrolimus, two for crisaborole, and one for pimecrolimus). We believe that the sample size is too small to assess and quantify the relationship between vehicle response and relative treatment effects. Also, the assumption that this relationship holds true for all three treatments seems doubtful.

The authors present the results from the clog-log model adjusted for baseline risk and class effects because that model adjusted for those effects and had the lowest deviance information criterion (DIC) [1]. All the models tested by the authors had a DIC of between 124.7 and 130 [1]—a maximum difference of 5.3. A general rule when interpreting the DIC is that it could be misleading to report only the model with the lowest DIC if the difference in DIC is less than 5 and the models make very different inferences. It would therefore have been useful if the authors had also presented outcomes for the other five models to ascertain whether different models lead to different inferences.

We looked at the ISGA scores for crisaborole and pimecrolimus in the underlying studies used in the NMA [2, 10] (Table 1). Based on an unadjusted comparison, pimecrolimus would appear to be superior to crisaborole (Table 1). However, based on the NMA, the authors concluded that crisaborole was superior to pimecrolimus in terms of ISGA 0/1 at 28–42 days [1]. Given that the inference reverses depending on the model type used, it would be inappropriate to make a conclusion based solely on a single model.

Table 1 Patients with an ISGA score of 0/1 (clear/almost clear) at week 4 in the crisaborole [10] and pimecrolimus [2] trials

While we believe that a simple random-effects NMA model that does not adjust for vehicle response may be inaccurate, drawing a conclusion solely based on the use of the authors’ model [1] is also misleading, and the comparative effectiveness of crisaborole and pimecrolimus (and probably crisaborole and tacrolimus) is therefore uncertain. Given the uncertainties in the NMA [1], head-to-head clinical trials would be required to assess the relative effectiveness of crisaborole versus pimecrolimus, and that of crisaborole versus tacrolimus.

In conclusion, there are uncertainties regarding the methodology used in the NMA, and the results of the analysis should be interpreted with caution.