FormalPara Key Summary Points

A lack of consistency was observed in the assessment of treatment response in patients with atopic dermatitis (AD) both across clinical trials and between trials and health technology assessments (HTAs).

No content validations of the Eczema Area and Severity Index (EASI) and the Investigator’s Global Assessment (IGA) were found, and mixed results were observed between these measures and measures of itch, which was identified as a core patient-relevant symptom.

Including the Peak Pruritus Numerical Rating Scale (PP-NRS) as a measure of itch in clinical trials and HTA will better capture patient-relevant benefit and response.

Studies comparing the psychometric performance of different response criteria are needed to inform which are appropriate to use to compare different treatments in AD.

Introduction

Atopic dermatitis (AD) is a chronic inflammatory disease characterized by inflamed eczematous skin, dryness, pruritus (itch), skin pain and excoriations. Globally, the estimated prevalence of AD is 1–3% in adults, with a two- to threefold increase in incidence in industrialized countries over the past decades [1]. Itch is a core symptom of AD and has a substantial impact on quality of life by causing self-consciousness, bleeding, problems with concentration and sleep disturbance [2]. Reducing itch is the most important treatment goal in patients with AD [3].

Due to advances in our understanding of AD and many unmet therapeutic needs, new therapies have been investigated, including interleukin inhibitors such as dupilumab, lebrikizumab, tralokinumab, and nemolizumab, as well as janus kinase inhibitors such as upadacitinib, baricitinib and abrocitinib [4]. The addition of new treatments to the rapidly changing AD treatment landscape means that regulators, health technology assessment (HTA) bodies and clinicians need to understand the comparative efficacies of new treatments compared to existing options. Defining and assessing treatment response is key to determining a treatment’s comparative efficacy and value. Currently, in AD, there is no consensus on a standard outcome measure that should be used to assess response [5]. Response is assessed in trials using a variety of different clinician- or patient-reported measures and a range of different response criteria. This creates a mismatch of evidence across trials, hindering the ability of clinicians, regulators, and payers to directly compare the efficacies of different treatments.

It is vital that measures used to determine response are psychometrically valid in the population in which they are being used. Measures should be valid, reliable and responsive in the target population; they should be able to detect meaningful change as well as clinically relevant differences in change across treatment groups; and they should comprehensively capture the symptoms that are important to patients [6, 7]. To assess whether this is the case for response measures and response criteria used in AD, this literature review followed a two-step approach. First, a scoping review was conducted to identify which measures and criteria are being used to determine treatment response in patients with AD in clinical trials and HTAs. Second, the authors systematically reviewed the evidence on the psychometric performance of those measures and criteria identified in step one as being used to determine response in clinical trials and HTAs in AD. Through these steps, the literature review aimed to understand the extent to which the response measures and criteria being used in clinical trials and HTAs in AD comprehensively capture the symptoms and treatment benefits important to patients.

Methods

Included Response Measures and Criteria

The definition of response combines two elements: the measure being used to assess response, and the criterion, or threshold, used for that measure to determine whether a patient would be defined as a responder or non-responder.

To determine the relevant outcome measures to include in this systematic review, as well as any response criteria associated with these measures, the authors conducted a scoping review of patient endpoints used to assess response in clinical trials and HTA submissions in AD. Inclusion and exclusion criteria for the scoping review are described in Table 1. The outcome measures used as primary endpoints to assess response in phase 3 clinical trials initiated in the last 10 years in adults and/or adolescents with moderate to severe AD were searched for on clinicaltrials.gov (https://www.clinicaltrials.gov, accessed 10 June 2022). Primary endpoints used to assess response were extracted as trials are powered to observe differences in their primary endpoints. HTA submissions were searched using the National Institute for Health and Care Excellence (NICE) and Canadian Agency for Drugs and Technology in Health (CADTH) websites (https://www.nice.org.uk, accessed 15 June 2022 and https://www.cadth.ca, accessed 17 June 2022), and measures used to determine response in the economic model (base case or scenario analyses) were extracted.

Table 1 Scoping review criteria for HTA and clinical trials and the response measures and criteria identified

The scoping review identified 46 phase 3 trials and five HTAs. The outcome measures used either as primary endpoints in these clinical trials or as definitions of response in the HTA economic model were the clinician-reported Eczema Area and Severity Index (EASI) and Investigator’s Global Assessment (IGA) and the patient-reported Dermatology Life Quality Index (DLQI) and Peak Pruritus Numerical Rating Scale (PP-NRS) [2, 8, 9]. The scope of this systematic literature review therefore encompasses these four response measures and the criteria defined from them. A brief description of the included measures is provided in the supplementary material. While a variety of criteria defined from all these measures were used as the primary endpoints in AD clinical trials (Table 1), HTA submissions defined response using either solely EASI or a combined criterion based on the improvement in EASI and DLQI (a 50% improvement in EASI score and ≥ 4 point improvement in DLQI score, i.e. “EASI 50 + DLQI ≥ 4”), which was not used in clinical trials.

Search Strategy

The MEDLINE and Embase databases were searched via ProQuest from database inception to July 21, 2022. The search strategy outlined in Table 2 comprised terms for psychometric validation, disease and symptoms, and included measures. All terms were searched for in titles and abstracts only, and wording variations were captured. The search strategy captured both journal articles and conference abstracts indexed in Embase. All search results were screened by a single reviewer (SJ) using the inclusion/exclusion criteria described in Table 3. All citations first underwent title and abstract screening. The full texts of any articles not excluded at title/abstract level were then evaluated for final inclusion by the same reviewer. Ten percent of the records were double screened by an additional reviewer (HP). Any conflicting decisions were discussed between the two reviewers (SJ and HP) until consensus was reached. The bibliographies of systematic reviews were hand searched to identify studies not captured by the database searches.

Table 2 Search terms and results
Table 3 Selection criteria for literature

Data Extraction

Relevant study and participant characteristics, measures included, methods and results of psychometric testing were extracted using a Microsoft Excel form. Psychometric evidence was extracted in relation to validity (content, convergent, known-group, and structural), reliability (test–retest, inter-rater, and intra-rater) and responsiveness. Additionally, estimates of minimally important differences (MIDs) and minimal important change (MICs) were also extracted where reported. Additional information on how these psychometric properties were defined and tested can be found in Table 4 and the online supplementary material. Any tests of the psychometric performance of a specific response criterion were also extracted. Data extraction was conducted by two reviewers (SJ and SS), with all studies double extracted.

Table 4 Psychometric properties and criteria for good performance

Assessment of Psychometric Performance

Predefined criteria for assessing psychometric performance in relation to each psychometric property were defined in accordance with consensus-based standards for the selection of health measurement instruments (COSMIN) and previous reviews in this area, and are shown in Table 4 [10,11,12]. Once data extraction for individual studies was performed, the overall evidence was assessed per measure and psychometric property. Overall ratings for each included response measure per psychometric property were defined (Fig. 1).

Fig. 1
figure 1

Definition of overall ratings for each included response measure per psychometric property

This article is based on previously conducted studies and does not contain any new studies with human participants or animals performed by any of the authors.

Results

Search Results

Of the 464 unique records retrieved from the MEDLINE and Embase databases via ProQuest, 399 records were excluded at title and abstract screening. Sixty-five papers were reviewed in full, of which 42 were excluded for the reasons outlined in Fig. 2. Twenty-three papers and conference abstracts representing 17 unique studies were included. Six conference abstracts reported on the same study as that reported by an included full-text journal article, while four conference abstracts were not covered by a full-text article. Hand-searching systematic reviews resulted in one additional study [13] for a total of 18 unique studies that fulfilled the inclusion criteria for this review.

Fig. 2
figure 2

PRISMA flow diagram

Characteristics of Included Studies

The key characteristics of the 18 unique studies included in this review are summarized in Table 5. Studies were mainly conducted in the United States of America (USA, n = 7), Europe (n = 2) and Australia (n = 2). Most studies (n = 15/18) included adult participants, with the mean age ranging from 30.0 to 52.0 years, whereas one study included only adolescent participants [14]. The age of the participants was not reported in two studies [15, 16]. Where sex was reported (12/18), male participants accounted for 32.6–72.0% of the sample. Sample sizes varied greatly, ranging from 10 to 10,000 participants.

Table 5 Included study characteristics

Most studies assessed the validity and reliability of measures (n = 11 and n = 9, respectively). Six studies also investigated responsiveness, and four studies also estimated MIDs or MICs. Only one study assessed the psychometric performance of a response criterion defined from the measure under investigation [17]. The psychometric results per measure are summarized in Table 6.

Table 6 Summary of psychometric results per measure

EASI

The psychometric performance of the EASI was assessed in seven studies [9, 13, 15, 16, 18,19,20].

Validity

Two studies investigated convergent validity for the EASI [13, 18]. In Bozek 2017, strong correlations between EASI and the objective component of the Scoring Atopic Dermatitis instrument (oSCORAD), the IGA, and patients’ assessments of disease severity were reported (r = 0.66–0.87) [18]. Shim 2011 reported that the EASI was weakly and insignificantly correlated with a visual analogue scale (VAS) for itch (r = 0.17, P = 0.13) but moderately and significantly correlated with a VAS for sleep (r = 0.35, P = 0.002) [13].

Reliability

Three studies assessed intra-rater reliability [9, 15, 16, 18]. Bozek 2017 reported good intra-rater reliability for EASI scores (intraclass correlation coefficient [ICC] = 0.71; the two assessments took place on the same day) [18]. Excellent intra-rater reliability was reported in Zhao 2017 and Zhao 2016 (coefficients not reported; the two assessments took place on the same day) [15, 16]. However, an important element in the quality of a study of intra-rater reliability is that the time between administrations should be long enough to prevent easy recall of the initial rating, which is unlikely to be the case with same-day administrations [10].

Inter-rater reliability was assessed in three studies [9, 15, 16]. Zhao 2017 and Zhao 2016 reported good overall inter-rater reliabilities (Zhao 2017: ICC [95% CI] = 0.79 (0.61–0.92); Zhao 2016: ICC in light-skinned patients = 0.85 and ICC in dark-skinned patients = 0.79) [15, 16]. Hanifin 2001 reported good inter-rater reliability using the correlation coefficient of reliability (r-hat > 0.75; the assessments took place on consecutive days). [9]

Responsiveness

Only one study assessed the responsiveness of EASI [20]. In Schram 2012, the area under the receiver operating curve (AUC) was 0.67 (95% CI = 0.60–0.76), suggesting poor responsiveness.

MIDs and MICs were estimated in two studies [19, 20]. Schram 2012 reported an MID (anchor: 1-point improvement in IGA) of 6.6 points (standard deviation [SD], 5.9). In this study, the MID varied from 1.0 (IGA from 1 to 0) to 8.6 (IGA from 5 to 4) [20]. In Silverberg 2021, 1-point improvements in the Physician’s Global Assessment (PGA) and the Validated Investigator’s Global Assessment for Atopic Dermatitis (vIGA-AD) were associated with an approximately 50% decrease in EASI score, while a 1-point improvement in the Patient-Reported Global Assessment (PtGA) was associated with a 29.9% decrease in EASI score [19]. One-point improvements in PtGA, PGA, and vIGA-AD scores were associated with approximately 10.9-, 14.0-, and 14.9-point absolute decreases in EASI score. No difference (P = 0.61) in the threshold for the EASI-percentage MICs with AD severity was identified, but a significant difference (P < 0.001) was observed for the absolute score MIC [19].

IGA

Five studies investigated the psychometric properties of the IGA [16,17,18, 21, 22].

Validity

Convergent validity was assessed in two studies, and known-group validity in one study [18, 21]. In Bozek 2017, strong correlations were observed between the IGA and both the EASI and the oSCORAD (r = 0.66–0.80) [18]. In Simpson 2022, strong correlations were observed with the EASI (r = 0.69–0.89) and body surface area (BSA; r = 0.50–0.75) [21]. Weaker correlations were observed with the Patient-Oriented Eczema Measure (POEM) and the DLQI. These correlations were weak at baseline (r = 0.30–0.37) but moderate to strong at week 16 (r = 0.43–0.65).

In Simpson 2022, known-group validity was confirmed by comparing vIGA-AD to EASI and Patient Global Impression of Severity—Atopic Dermatitis (PGI-S-AD) severity groups [21]. Patients with a vIGA-AD = 4 were more likely to have worse disease severity on either the EASI or PGI-S-AD compared with patients with a vIGA-AD = 3 (P < 0.01).

Reliability

One study investigated the test–retest reliability of the vIGA-AD between baseline and week 1 and between weeks 4 and 8 [21]. When stability was defined using patients reporting no change on the PGI-S-AD, weighted kappas across trials ranged from 0.52 to 0.64, suggesting poor reliability. When using an EASI change below the MID of 6.6 to define stability, weighted kappas ranged from 0.66 to 0.78 [21]. These results suggest borderline reliability according to COSMIN criteria of kappa ≥ 0.7 [10].

Intra-rater reliability was tested in two studies [18, 22]. Bożek 2017 reported an ICC = 0.54 ± 0.28 (the assessments took place on consecutive days) [18], while Simpson 2020 found higher intra-rater reliability, with ICCs > 0.8 between same-day ratings and ratings of the same photograph 5 months apart [22].

Two studies assessed inter-rater reliability. Zhao 2016 reported an ICC (95% CI) of 0.77 (0.58–0.91) in adult patients [16]. Simpson 2020 found very good ICCs and weighted kappa coefficients (0.82–0.89) indicating good inter-rater reliability [22].

Responsiveness

One study tested the responsiveness of the vIGA-AD and estimated MIC thresholds [21]. This study reported good responsiveness, as the magnitude of improvement in vIGA-AD scores increased with greater improvement in EASI scores. In the same study, MICs were estimated using anchor-based methods. Overall, the clinical threshold was − 1.00 for minimal meaningful change, − 1.25 to − 1.50 for moderate change, and − 1.75 to − 2.00 for large change. Distribution-based methods gave estimates of − 0.25 (0.5 baseline SD) and − 0.65 (minimal detectable change with 95% confidence). [21]

Silverberg 2019 investigated the performance of the IGA response criterion of IGA ≤ 1 in AD patients [17]. This study found that people defined as non-responders on the IGA (IGA > 1) had clinically meaningful improvements in EASI, PP-NRS, EQ-5D-3L, DLQI, POEM, and BSA scores, suggesting that the IGA response criterion of IGA ≤ 1 may be too restrictive and may not account for the meaningful benefit from itch relief and decreases in the extent and severity of AD lesions and in overall quality of life [17].

DLQI

Five studies investigated the psychometric properties of the DLQI [23,24,25,26,27].

Validity

One study reported good content validity for the DLQI in patients with AD, with 92% of patients reporting that the DLQI covered all the issues most relevant to them in relation to AD [23]. The four participants who reported that the DLQI did not assess their most important issues identified sleep disturbance as the most important symptom. Some participants reported that items concerning sports and sexual activity were not important to them, as they did not participate in these activities, but no participant considered any items conceptually irrelevant.

Convergent validity was assessed in four studies using clinical trial data in AD [23,24,25,26]. Silverberg 2019 reported strong correlations between DLQI and the Patient-Oriented SCORAD (PO-SCORAD; r = 0.71) and the POEM (r = 0.62), and moderate correlations with the PO-SCORAD itch subscore (PO-SCORAD-itch; r = 0.48) and the Numerical Rating Scale (NRS) for pain (NRS-pain; r = 0.43, P < 0.001 for all) [24]. Patel 2019 reported strong correlations with the Itchy Quality of Life (ItchyQOL), the 5-D Itch Scale (5-D itch), the NRS for average itch (NRS-itch), the POEM, and the SCORAD (r = 0.55–0.79), and a moderate correlation with the EASI (0.44) [23]. In Holm 2005, the DLQI correlated strongly with a pruritus VAS (PRUVAS), a patient disease severity VAS (PTVAS), and an investigator overall assessment VAS (INVAS; r = 0.62–0.82), moderately with the 36-Item Short Form Health Survey mental component score (SF-36 MCS; r = − 0.46) and weakly with the 36-Item Short Form Health Survey physical component score (SF-36 PCS; r = − 0.27) [26]. Herd 1999 found a strong correlation between DLQI and Patient-Generated Index (PGI; r = 0.52, P < 0.001) and moderate correlations with health service costs (r = 0.47) and total costs (r = 0.34). [25]

Known-group validity was assessed in three studies. Patel 2019 found significant stepwise increases in DLQI score at each level of severity measured by the POEM, NRS-itch, EASI, and SCORAD instruments (P < 0.0001) [23]. Similarly, Holm 2005 found that DLQI scores were significantly associated with SCORAD severity groups (P < 0.0001) [26]. In Silverberg 2019, DLQI scores increased significantly with each increasing level of severity on self-reported global AD severity, the POEM, the PO-SCORAD, the PO-SCORAD-itch, the PO-SCORAD sleep subscore (PO-SCORAD-sleep), and the NRS-pain (analysis of variance, P < 0.0001 for all). AUC analysis showed that the DLQI was excellent at distinguishing between severe versus mild AD and good at distinguishing moderate versus mild or severe versus moderate AD, outperforming the 12-Item Short Form Health Survey (SF-12) [24].

One study assessed the structural validity of the DLQI. Sun 2020 used an explanatory factor analysis and found two factors: factor 1 (items 1–7) assessed personal life, and factor 2 (items 8–10) assessed social factors and treatments (eigenvalues = 5.00 and 1.13, respectively) [27]. A bifactor model indicated a good fit (root mean square error of approximation [RMSEA] = 0.046; comparative fit index [CFI] = 0.988), with standardized factor loadings on the general factor of 0.42–0.82. The global factor explained 93.5% of the common variance, whereas the specific factors explained 0.4% and 6.1%, indicating sufficient unidimensionality.

Reliability

Good internal consistency was confirmed in three studies reporting Cronbach’s alphas of 0.94–0.89 for the DLQI [24, 27].

Responsiveness

The DLQI’s responsiveness was assessed in one study. Patel 2019 observed medium standardized effect sizes (Cohen’s D) in the anticipated direction in DLQI scores for those participants who experienced a change in POEM score ≥ 3.4 points (a previously determined MID), suggesting good responsiveness (Cohen’s D = ∣0.65–0.72∣) [23].

PP-NRS

Three studies investigated the performance of the PP-NRS in patients with AD [2, 14, 28].

Validity

Three studies reported good content validity in qualitative interviews with AD patients, with the PP-NRS found to be relevant, appropriate, well understood, and consistently interpreted [2, 14, 28]. All patients across two studies reported itch as a core concept/symptom of their AD [2, 28], while 93% reported at least one meaningful consequence of itch, including embarrassment/self-consciousness, bleeding, problems with concentration, and sleep disturbance [2]. Participants considered both worst and average itch over the past 24 h to be important and comprehensive in assessing itch, but found worst itch easier to rate and more important to improve with treatment [2].

Convergent validity was assessed in three studies using clinical trial data [2, 14, 28]. Moderate to strong correlations between PP-NRS and other patient-reported outcomes (PROs) were found in Yosipovitch 2018 (PROs: SCORAD-itch, Children’s Dermatology Life Quality Index–itch [CDLQI-itch], Pruritus Categorical Scale [PCS], and Patient Global Assessment of Disease [PGAD]) and Yosipovitch 2022 (PROs: DLQI, DLQI-itch, POEM, and POEM-itch) at baseline (r = 0.41–0.73) and strong correlations at week 16 (r = 0.64–0.83) [14, 28]. Both studies found weaker correlations with clinician-reported outcomes (Yosipovitch 2018: EASI and IGA; Yosipovitch 2022: EASI, IGA and BSA). These correlations were weak to moderate at baseline (r = 0.20–0.31) and moderate to strong at week 16 (r = 0.43–0.53). In Yosipovitch 2019, strong correlations were observed with the PCS, the DLQI-itch, and the SCORAD-itch VAS (r = 0.61–0.77), and weak correlations were observed with the EASI and the IGA (r = 0.09–0.24) at baseline [2].

Known-group validity was assessed in two studies [2, 14]. Baseline and week-16 PP-NRS scores differed predictably across CDLQI and PCS levels (F-statistic; all P < 0.0001) in Yosipovitch 2018 [14]. Similarly, scores varied significantly between categories, reflecting no/mild itch/symptoms versus severe itch/symptoms on the PCS, the DLQI, and the Patient Global Assessment of Disease Status (PGADS; P < 0.0001 for all comparisons) in Yosipovitch 2019 [2].

Reliability

Test–retest reliability was confirmed in three studies [2, 14, 28]. Yosipovitch 2018 reported that coefficients between baseline to week 2 and week 15 to week 16 exceeded the recommended threshold of 0.7 (the exact test and coefficients are unclear) [14]. Yosipovitch 2019 and 2022 reported ICCs ≥ 0.89, indicating very good test–retest reliability (2019: assessments at weeks 15 and 16; 2022: assessments at weeks 12 and 16), although only Yosipovitch 2022 reported testing this exclusively in participants defined as stable across test–retest periods (using IGA score) [2, 28].

Responsiveness

Responsiveness was assessed in three studies using both correlations of mean changes with similar measures and effect sizes [2, 14, 28]. Yosipovitch 2018 reported moderate to strong correlations in change scores with similar PROs (r = 0.40–0.68) and significant patterns of mean change across PGAD levels (F-statistic, 23.7; P < 0.0001) [14]. Yosipovitch 2019 reported strong correlations of change scores with PCS, DLQI-itch, and SCORAD-itch VAS (r = 0.64–0.77) and moderate to strong correlations with changes in EASI and IGA scores (r = 0.46–0.50) [2]. Large effect size estimates of change were reported in all three studies [2].

Yosipovitch 2022 reported an MIC of 3 points, estimated with anchor-based methods, using the IGA and the Global Assessment of Change–Atopic Dermatitis (GAC-AD) as anchors [28]. In the qualitative phase, most participants stated that a 2-point (n = 6) or 3-point (n = 10) decrease in PP-NRS indicated meaningful improvement. Yosipovitch 2019 reported MIC estimates based on clinician‐reported and patient-reported anchors (EASI, IGA and PCS) ranging between 2.2- and 4.2-point improvements [2].

Discussion

The scoping review of clinical trials and HTA submissions uncovered important findings about how response is measured in AD. EASI, IGA, DLQI and PP-NRS were used to define response using various criteria. There was little consistency in the assessment of response, both across clinical trials and between trials and HTAs. While a variety of criteria defined from these measures were used as the primary endpoint in clinical trials, HTA submissions defined response using either EASI score alone or a combined criterion based on improvement in EASI and DLQI scores, something not used in clinical trials. The identified lack of consistency in the assessment of response observed in the scoping review makes it difficult for clinicians, regulators, and payers to directly compare the efficacies of different treatments to make optimal treatment and resource allocation decisions. Psychometric evidence on the performance of response measures and criteria should be used to guide decisions on which are most appropriate for use to facilitate consistency.

While this review identified some evidence on the psychometric performance of the measures being used to assess treatment response in AD, important gaps in the evidence were revealed. Content validity was only assessed for the patient-reported DLQI and PP-NRS. No assessments of the content validity of the clinician-reported EASI and IGA were identified. Content validity is arguably the most important psychometric property, as it determines whether the measure covers what is important to patients without including irrelevant items, and is understood as intended [10, 12]. Content validation of the PP-NRS found that patients reported itch as a core symptom that had an important impact on their daily life and was a priority for treatment. While the EASI and the IGA are established measures of clinical severity, results of several studies called into question their coverage of patient-relevant symptoms. One included study found that the EASI did not correlate with an itch VAS, indicating a lack of coverage of itch [13]. Another study found that people defined as non-responders by the IGA had clinically meaningful improvements in itch, extent and severity of AD lesions, and overall quality of life, concepts not covered by the IGA [17]. It is vital to investigate the content validity of the EASI and the IGA. These measures are frequently used to assess efficacy and response in clinical trials and HTAs, but they may miss key elements of patient-relevant disease impact and treatment benefit, including itch, which leads to an inadequate understanding and estimation of treatment efficacy. This will result in treatment efficacy and cost-effectiveness being undervalued in regulatory and HTA decision-making, decreasing the chances of treatment acceptance and reimbursement. Moreover, response criteria used in clinical trials and by HTA bodies will likely find their way to the prescriber setting, whereby non-responders would not get their treatment reimbursed. In this case, if response assessment does not fully capture patient-relevant benefit, this may hamper patients’ access to tailored treatment.

Several studies estimated responder definitions for one of the included measures or reported responsiveness results using a predefined response criterion. Only one study investigated the performance of a predefined response criterion for the IGA [17]. However, no studies were identified that compared the psychometric performance of alternative response criteria to make recommendations on an appropriate and consistent definition of response. Such a definition could inform high-quality evidence synthesis for comparisons of the efficacy and value of different treatments. Such studies are available in other conditions, including rheumatoid and psoriatic arthritis and cancer [7, 29,30,31]. A good response criterion should capture the symptoms, impact of disease, and elements of treatment benefit that are important to patients, and it should be able to discriminate between patients receiving a meaningful treatment benefit and those who are not [7, 30]. Evidence is required on the comparative performance of the different criteria being used to define response to inform which are able to comprehensively capture patient-relevant treatment benefits and distinguish those patients receiving effective treatment. This evidence would pave the way for the standardization of response assessment, which would enable high-quality estimates of the comparative efficacy of treatments and evidence-based regulatory, HTA and clinical decision-making.

Limitations of This Review

The reliability of the results of the included studies is dependent on the quality of those studies. Some quantitative studies were performed using very small sample sizes. Results for convergent and known-group validity are dependent on the appropriateness of comparator measures and the known groups defined. Not all investigations of test–retest reliability reported an anchor, based on which the population could be assumed to be stable in health over time. Including patients who were not stable over the test–retest period would impact results. As some studies were reported in conference abstracts only, sufficient details to be able to judge how statistical tests were performed were sometimes unavailable. Different versions of IGA scales were used across studies. Studies often failed to provide the exact wording of the IGA used in their study, and therefore it is unclear to what extent results on the IGA from different studies are comparable.

Lastly, this review was focused on the psychometric performance of measures and criteria which have been used to assess response in phase 3 clinical trials and HTAs in AD in the last 10 years. Other measures are available which capture patient-relevant endpoints such as itch, including SCORAD and POEM [32, 33]. Future research could investigate the extent to which such measures would be suitable to assess response as primary outcomes in clinical trials and HTAs in AD.

Conclusion

The current landscape of disjointed evidence on the responsiveness of different treatments, with different response measures and criteria used, makes direct comparisons of treatment efficacy nearly impossible for clinicians, regulators, and payers. This impedes evidence-based treatment and sound resource allocation decisions. While content validation of the PP-NRS confirmed the importance of itch as a core symptom and treatment priority in AD, the EASI and IGA lack both coverage of itch and content validation. It is concerning that itch is currently not well covered in response assessments, while there is a patient-relevant instrument available with sound psychometric properties, namely the PP-NRS. Including the PP-NRS in both clinical trials and HTAs will place more emphasis on patient-reported benefit and response. Although response thresholds are estimated in some studies, no studies have compared the psychometric performance of different response criteria to inform which were appropriate to use to compare treatments and to pave the way towards a consistent definition of response across trials and HTA.