FormalPara Key Summary Points

Why carry out this study?

The clinical manifestations of plaque psoriasis have been shown to severely impact patients’ quality of life and emotional well-being; therefore, it is important to measure patient experience alongside clinical parameters in the evaluation of treatments for plaque psoriasis.

The Psoriasis Symptoms and Impacts Measure (P-SIM) is a novel, patient-reported outcome tool developed to specifically capture patients’ experiences of the key signs, symptoms and impacts of plaque psoriasis; here, it was completed daily by patients at home on a handheld device on a 0–10 numeric rating scale, and item scores were derived as averages of daily scores over a week.

In these analyses, the psychometric properties of the P-SIM were evaluated using data pooled from the first 16 weeks of the BE VIVID and BE READY phase 3 trials investigating bimekizumab in the treatment of moderate to severe plaque psoriasis.

What was learned from this study?

In these analyses, the 14 item scores of the P-SIM demonstrated good reliability, validity and sensitivity to change in the assessment of patients’ experiences of key psoriasis signs, symptoms and impacts when completed daily on a handheld device over a period of 16 weeks.

Anchor-based analyses determined a 4-point decrease in item scores of the P-SIM as indicative of a marked clinically meaningful improvement when the P-SIM was completed daily on a handheld device; this threshold can be used to assess treatment effects over 16 weeks in patients with moderate to severe plaque psoriasis.

Introduction

Psoriasis is an immune-mediated, chronic, inflammatory skin disease that can take a variety of forms [1]. The symptoms most frequently experienced by patients with psoriasis include itching and scaling [2], and the clinical manifestations of psoriasis can have wide-ranging impacts on quality of life and emotional well-being [3,4,5]; therefore, it is important to measure patient experience alongside clinical parameters. Understanding patients’ experiences of psoriasis and needs due to their disease is key to supporting their treatment.

Patient-reported outcome measures have been developed to assess quality of life in patients with skin conditions. For example, the Dermatology Life Quality Index (DLQI) is widely used to assess quality of life changes in dermatology. However, it is not specific to psoriasis, so may not fully capture the experiences of patients with this disease [6, 7]. Psoriasis-specific tools have also been developed for use in specific clinical trials, such as the 16-item Psoriasis Symptom Diary [8, 9], the Psoriasis Symptoms Scale, which captures severity of pain, redness, itching and burning [10], and the single-item Itch Numeric Rating Scale [11], among others [12,13,14]. These instruments assess patient-reported signs and symptoms of psoriasis. However, they are not available for use in the clinical evaluation of new treatments, and some of these instruments may omit relevant signs, symptoms and impacts.

The Psoriasis Symptoms and Impacts Measure (P-SIM) was developed to capture patients’ experiences of the signs, symptoms and impacts of psoriasis in the bimekizumab in plaque psoriasis clinical programme. It was designed as a 14-item daily electronic diary that can be completed at home by the patient for ease of use. The development and validity of the P-SIM content have been described previously, including a literature review investigating commonly reported signs, symptoms and impacts of plaque psoriasis, clinical expert interviews and patient interviews for the purposes of concept elicitation, cognitive debriefing and usability testing [15]. Quantitative analyses following these steps led to the finalisation of the 14 items of the P-SIM [15]. In addition to content validity, the US Food and Drug Administration (FDA) also recommends assessment of other psychometric properties of newly developed patient-reported outcome measures (reliability, construct validity, and ability to detect change) [16], and determination of thresholds for interpreting the clinical meaningfulness of within-patient score changes over a predetermined time period (a responder definition [RD]) [16]. To date, no RD thresholds have been reported for the P-SIM.

Here, the psychometric properties of the P-SIM are assessed using data from the BE VIVID (NCT03370133) and BE READY (NCT03410992) phase 3 trials, as well as the suitability of the weekly missing data rule that is currently applied to the instrument. RD thresholds for the P-SIM items are also estimated.

Methods

Study Design

Blinded data were pooled from the first 16 weeks of the BE VIVID and BE READY phase 3 trials investigating bimekizumab in the treatment of moderate to severe plaque psoriasis.

BE VIVID was a randomised, double-blinded, placebo- and active comparator-controlled phase 3 study (Supplementary Fig. 1a). Patients in BE VIVID were randomised 4:2:1 to bimekizumab 320 mg every 4 weeks (Q4W), ustekinumab 45/90 mg (by weight) administered at baseline and then every 12 weeks (Q12W) from week 4, or placebo for initial treatment (weeks 0–16). BE READY was a double-blinded, placebo-controlled, randomised withdrawal phase 3 study (Supplementary Fig. 1b). Patients in BE READY were randomised 4:1 to bimekizumab 320 mg Q4W or placebo for initial treatment (weeks 0–16). Both trials enrolled adult patients with a diagnosis of moderate to severe plaque psoriasis ≥ 6 months prior to screening, with baseline Psoriasis Area and Severity Index (PASI) ≥ 12 (on a scale from 0–72), ≥ 10% body surface area (BSA) affected and an Investigator’s Global Assessment (IGA) score ≥ 3 on a 5-point scale. The co-primary endpoints for both studies were 90% improvement in PASI (PASI 90) and IGA 0/1 (0 [clear] or 1 [almost clear] with ≥ 2-category improvement from baseline) responses at week 16. Full study designs and efficacy and safety outcomes have been reported previously [17, 18].

The study protocols, amendments and patient informed consent were reviewed by a national, regional or independent ethics committee or institutional review board. BE VIVID (NCT03370133) and BE READY (NCT03410992) were conducted in accordance with the current version of the applicable regulatory and International Conference on Harmonisation (ICH)-Good Clinical Practice (GCP) requirements, the ethical principles that have their origin in the principles of the Declaration of Helsinki, and the local laws of the countries involved.

Measurements and Outcomes

The P-SIM e-diary was to be completed daily in the evening by patients at home via a handheld electronic device from baseline through week 16. It consisted of 14 items, listed in Table 1. Each item was scored for worst severity or level of impact in the previous 24 h on a scale from 0 to 10 (0 meaning “no sign, symptom or impact”, 10 meaning “very severe sign, symptom or impact”). Average weekly scores were derived for each of the 14 P-SIM items. Weekly scores were considered missing if ≥ 4 daily scores were missing for that week.

Table 1 Items of the P-SIM

Other outcomes reported in BE VIVID and BE READY relevant to this validation analysis include PASI, IGA, DLQI, and Patient Global Assessment of Psoriasis (PGAP). The DLQI consists of ten items, each scored from 0 to 3, with 3 representing the highest impact (overall score range 0–30) [7]. DLQI item 1 (Question: “over the last week, how itchy, sore, painful, or stinging has your skin been?”; answer: “not at all”, “a little”, “a lot”, or “very much”) is specific to skin symptoms commonly seen in psoriasis and is considered individually in these analyses. The PGAP consists of a multiple-choice question (“How severe are your psoriasis-related symptoms right now?”), with answers scored from 1 to 5 (1, “no symptoms”; 2, “mild symptoms”; 3, “moderate symptoms”; 4, “severe symptoms”; 5, “very severe symptoms”). Both the DLQI item 1 and PGAP represent verbal rating scales. Details of PASI and IGA score measurements can be found in the Supplementary Methods [19, 20]. The PGAP was completed on the same electronic device as the P-SIM; all other outcome data were collected on a tablet during on-site study visits (week 0, 1, 2, 4, 8, 12 and 16).

Confirmation of Weekly Scoring Rule

Analyses were conducted to establish whether alternatives to the currently applied weekly scoring rule for the P-SIM, in which weekly item scores are considered missing for patients in weeks where they had ≥ 4 missing daily scores in that item, impacted the variability of weekly item scores.

Patients with no missing data were included for analysis of each item. Scenarios for the number of days missing (1, 2, 3, 4, 5 or 6 days missing) were simulated using a bootstrapping method with replacement by randomly sampling the appropriate number of daily scores for that missing day scenario and calculating average weekly scores. For each missing day scenario for a given item, each patient’s simulated weekly score was the mean of 100 replications of a random selection of daily scores. Means and standard deviations (SDs) were calculated for each item for each patient for the non-missing case and each missing day scenario. Overall means and SDs were also calculated for each item by pooling the weekly scores of patients. The SDs for missing and non-missing scenarios for each item were compared by visual inspection and the Brown–Forsythe test.

Psychometric Validation

Construct Validity

Convergent validity was assessed for P-SIM items by calculating Spearman’s rank correlation coefficients at baseline and week 16 for inter-item correlations and correlations between P-SIM item scores and clinician-reported outcome (PASI total score and IGA score) and patient-reported outcome (DLQI total score, DLQI item 1 score and PGAP score) scores. Only pairs assessed on the same date were included in these analyses.

A correlation coefficient > 0.3 and ≤ 0.5 indicated moderate convergent validity, and > 0.5 indicated strong convergent validity [21]. It was hypothesised that P-SIM items would have strong correlations with each other, and moderate to strong correlations with PASI, IGA, DLQI and PGAP total scores. Items 1, 3, 4 and 8 (itching, skin pain, burning and irritation, respectively) of the P-SIM were expected to have strong correlations with DLQI item 1.

The ability of P-SIM items to discriminate between clinically different groups of patients, predefined according to relevant clinician-reported outcomes, was also assessed. Mean P-SIM item scores were assessed at week 16 in known subgroups of patients defined on the basis of absolute PASI total score thresholds (≤ 1, > 1 to ≤ 3, > 3 to < 5, ≥ 5 to < 12, and ≥ 12) and IGA scores (0, “clear”; 1, “almost clear”; 2, “mild”; 3, “moderate”; and 4, “severe”) at week 16. PASI and IGA are well-accepted clinical measures of psoriasis disease severity and were used to define the primary efficacy endpoints in the bimekizumab phase 3 studies, BE VIVID and BE READY. As expected, as a result of the inclusion criteria of the BE VIVID and BE READY trials (PASI ≥ 12, IGA ≥ 3), there was low variability in scores for both measures at baseline; assessment of known-groups validity therefore focused on week 16. The absolute PASI values 1, 3 and 5 used to define subgroups have been shown to provide reliable estimates of disease activity that can be used to define treatment goals for psoriasis treatment and facilitate clinical decisions [22], while an absolute PASI value of 12 was used to define moderate to severe plaque psoriasis in the BE VIVID and BE READY inclusion criteria. The IGA values used to define subgroups with clinically different psoriasis severity levels are the response options of the 5-point IGA scale, which has been shown to be a valid and reliable measure of psoriasis severity [20].

It was hypothesised that P-SIM item score means and SDs would be higher in patients with higher PASI and IGA scores. P values from the Kruskal–Wallis test were calculated to compare distributions among the known groups.

Test–Retest Reliability

Intraclass correlation coefficients (ICCs; calculated as the ratio of the between-patient variance to the total variance) were calculated for each item’s score between baseline and week 2 in stable patients during this period. Stable patients were defined as those whose IGA score did not change in this interval. ICCs ≥ 0.70 were considered acceptable for test–retest reliability [23].

Sensitivity to Change Over Time

Spearman’s correlation coefficients and p values were calculated between changes from baseline to week 16 in P-SIM item scores and in other clinically relevant outcomes (PASI, IGA, PGAP, DLQI item 1 and DLQI). Correlations using the change from baseline to week 12 for PGAP were assessed as a sensitivity analysis; this was because a large amount of PGAP data were missing at week 16 as a result of technical limitations of the electronic device. A Spearman correlation ≥ 0.30 was considered to demonstrate acceptable responsiveness [24].

Determination of Responder Definition

Anchor-based analyses were performed in line with the recommendations for defining response thresholds for within-patient meaningful changes provided in the FDA patient-reported outcome guidance. Distribution-based analyses were conducted to provide supportive information around the variability of the scores, as per FDA guidance [16]. Triangulation was performed, examining the estimates from the anchor-based analyses and considering the distribution-based analyses as supportive information, to estimate values for clinically meaningful changes in each P-SIM item score.

For the anchor-based analyses, only anchors with Spearman correlation ≥ 0.30 between changes from baseline to week 16 in the anchor (week 12 for PGAP) and P-SIM items were considered. The anchors used, based on PASI, IGA, DLQI, DLQI item 1 and PGAP response, are shown in Supplementary Table 1. As a result of small sample sizes (n < 15) in some anchor response categories, some of the adjacent original categories were collapsed.

Actual mean change scores from baseline to week 16 were calculated for each level of improvement in each anchor. The empirical cumulative distribution function (eCDF) and probability density function (PDF) curves of observed changes at week 16 for each P-SIM item were plotted separately for each responder group for each anchor. These curves were examined to see if separation between levels of response on anchors could be observed. If so, actual change scores that optimally differentiated responders from non-responders were identified to support determination of the RD for each item. Effect sizes for each level of change in the anchors were estimated, and 95% confidence intervals for the observed mean change were calculated using bootstrapping methods.

Distribution-based analyses used the standard error of measurement for P-SIM items, and half an SD of item scores at baseline and study visits [25, 26]. If the SD changed significantly over time, the baseline SD was used; otherwise the mean was used.

Items 1 (itching), 3 (skin pain) and 5 (scaling) of the P-SIM were prioritised in the RD analyses owing to their clinical relevance, importance to patient experience, and use as efficacy endpoints in the BE VIVID and BE READY trials.

Results

Patient Disposition and Baseline Characteristics

In BE VIVID, 567 patients were randomised at baseline: 321 to bimekizumab, 163 to ustekinumab and 83 to placebo. In BE READY, 435 patients were randomised at baseline: 349 to bimekizumab and 86 to placebo. Blinded 16-week data were pooled for the 1002 patients randomised across both studies.

Demographics and baseline disease characteristics are shown in Table 2. No major differences were noted between studies, except for race/region; BE VIVID included patients from Japan (19.0%), while BE READY did not. The majority of patients were white (82.1%) and male (71.8%), and 40.0% of patients had a baseline PASI ≥ 20, with the mean BSA affected by psoriasis being 26.6%.

Table 2 Demographics and baseline characteristics

P-SIM Completion Rates

Completion rate (number of patients with a non-missing weekly item score divided by the number of patients randomised at baseline) for the P-SIM ranged from 89.0% at week 1 to 77.8% at week 16. There were no item-level missing P-SIM data; patients either completed all items or no items.

Confirmation of Weekly Scoring Rule

There were 821 patients (81.9%) who had at least one week with zero daily scores missing; these patients were included in these analyses. For all items and for all missing day scenarios, the SDs found through simulations were not statistically different to those from the ‘zero days missing’ scenario, and the mean item scores were similar (Supplementary Table 2). This suggests that P-SIM item score mean and variability remain similar regardless of the missing score rule applied, confirming that the weekly scoring rule used here (in which weekly scores were considered missing if ≥ 4 daily scores were missing for that week) is acceptable.

Psychometric Validation

Construct Validity

As hypothesised, P-SIM items demonstrated strong inter-item convergent validity (Supplementary Table 3): almost all P-SIM items were strongly correlated (Spearman’s correlation coefficient, r > 0.5) with each other at both baseline and week 16, with the exception of item 14 (choice of clothing) with items 3 (skin pain) and 4 (burning) at baseline (r = 0.49).

As expected, P-SIM items 1, 3, 4 and 8 (itching, skin pain, burning and irritation, respectively) were strongly correlated with DLQI item 1 score at both time points (Supplementary Table 4). All P-SIM items were moderately to strongly correlated with DLQI total score, DLQI item 1 score and PGAP score at both baseline and week 16 (week 12 scores were used for PGAP because of substantial amounts of missing data at week 16). Lower correlations were observed between P-SIM outcomes and the clinician-reported outcomes, PASI total score and IGA score, at baseline. This can be explained by the low variability in PASI total score and IGA at baseline, and was expected owing to the inclusion criteria of both studies (PASI ≥ 12 and IGA ≥ 3). Strong correlations with clinician-reported outcomes were, however, observed at week 16 across nearly all items (with the exception of items 12 [fatigue] and 14 [choice of clothing], for which correlations were moderate, ranging from 0.46 to 0.50 at week 16).

The P-SIM behaved as expected across all items when discriminating between known groups of patients defined on the basis of clinically relevant outcomes, with higher P-SIM item scores seen in patient known groups with higher PASI total score and IGA score (Fig. 1 and Supplementary Fig. 2) at week 16.

Fig. 1
figure 1

P-SIM items 1, 3 and 5 scores at week 16 by IGA score and PASI total score subgroups. a P-SIM item 1 (itching). b P-SIM item 3 (skin pain). c P-SIM item 5 (scaling). Data were pooled from BE VIVID and BE READY. Black circular markers indicate mean scores and blue circular markers indicate outliers. IGA Investigator’s Global Assessment, PASI Psoriasis Area and Severity Index, P-SIM Psoriasis Symptoms and Impacts Measure

Sensitivity to Change Over Time

Changes from baseline to week 16 in all P-SIM item scores had strong Spearman correlation coefficients (r > 0.50) with those in the other patient-reported outcomes (DLQI total score, DLQI item 1 and PGAP [change from baseline to week 12 was used for PGAP as a result of substantial missing data at week 16]), with the exception of P-SIM item 14 with DLQI item 1 (r = 0.50) (Table 3). Moderate correlations were seen between changes from baseline to week 16 in P-SIM item scores and the clinician-reported outcomes, PASI and IGA (r ≥ 0.38 for all items). Generally, changes from baseline in P-SIM items 12, 13 and 14 (fatigue, embarrassment and choice of clothing) had lower correlations with those in other outcome scores; however, changes from baseline in DLQI total score had the highest correlations with changes from baseline in P-SIM items 13 and 14).

Table 3 Spearman correlations between changes from baseline to week 16 in P-SIM item scores and other clinician- and patient-reported outcomes

Test–Retest Reliability

ICCs for all P-SIM items were well above the acceptability threshold of 0.70, ranging from 0.91 (item 5—scaling) to 0.98 (item 12—fatigue, item 13—embarrassment and item 14—choice of clothing) (Table 4). This indicates excellent test–retest reliability of P-SIM items.

Table 4 Intraclass correlation coefficients for P-SIM item scores

Determination of Responder Definition

Supplementary Table 1 shows the response categorisations used for each anchor. The mean P-SIM item score changes obtained for one level of improvement in the anchors were considered as appropriate RD thresholds. The eCDF curves for one level of improvement consistently separated from those in the no meaningful change groups. Collapsing of improvement categories for the PASI percentage change from baseline, PASI total score and IGA change from baseline anchors, as a result of small sample sizes in certain categories, mechanically increased the observed mean changes from baseline to week 16 in P-SIM item scores, making the thresholds obtained from these anchors more conservative.

Values obtained for one level of improvement in DLQI item 1 were prioritised over other anchors. This was because its change from baseline to week 16 generally had the highest correlation with those of P-SIM items, it is patient-reported, it is measured on an ordinal scale with directly interpretable verbal descriptors, and eCDF and PDF findings supported the thresholds obtained using it. Results obtained using anchors based on PASI total score, IGA change from baseline and DLQI total score were also considered in the triangulation process, though were deprioritised as they did not meet all of the above criteria. IGA absolute score and PGAP change from baseline-derived thresholds were considered as supportive only (IGA had only one improvement level, and PGAP had a substantial amount of missing data at week 16 leading to week 12 being used).

The P-SIM item RD thresholds found using mean score changes ranged from − 2.72 (item 12—fatigue) to − 4.39 (item 5—scaling) (Supplementary Table 5). Distribution-based values were much smaller than anchor-based values, as expected; these were considered as supportive only. The mean change in P-SIM item scores obtained with one-point improvement on DLQI item 1, the prioritised anchor, were close to the four-point FDA-recommended threshold. A four-point threshold also discriminated between known groups based on other anchors and was thus considered appropriate for the RD across all items. Inspection of eCDF and PDF curves supported this finding for all items (P-SIM items 1, 3 and 5 were of primary consideration; their CDF curves for DLQI item 1 and PASI are presented in Figs. 2 and 3). On the basis of these results, a four-point reduction on each P-SIM item is considered to represent a marked clinical improvement and can be used to define responders in patients with moderate to severe plaque psoriasis treated over a 16-week period.

Fig. 2
figure 2

eCDF curves of observed changes from baseline to week 16 in P-SIM item scores by DLQI item 1 change score category. a P-SIM item 1 (itching). b P-SIM item 3 (skin pain). c P-SIM item 5 (scaling). Data were pooled from BE VIVID and BE READY. DLQI item 1 was prioritised for the RD anchor-based analysis as it is patient-reported, measured on a directly interpretable ordinal scale, and its change from baseline to week 16 had the highest correlation with those in P-SIM items. Negative changes from baseline in P-SIM item scores indicate improvement. Dotted lines show median values. eCDFs for itching, skin pain and scaling items are shown as these items were used as efficacy endpoints in the BE VIVID and BE READY trials; findings for other items were similar. eCDF empirical cumulative distribution function, DLQI Dermatology Life Quality Index, P-SIM Psoriasis Symptoms and Impacts Measure, RD responder definition

Fig. 3
figure 3

eCDF curves of observed changes from baseline to week 16 in P-SIM item scores by PASI absolute score category. a P-SIM item 1 (itching). b P-SIM item 3 (skin pain). c P-SIM item 5 (scaling). Data were pooled from BE VIVID and BE READY. Negative changes from baseline in P-SIM item scores indicate improvement. Dotted lines show median values. eCDFs for itching, skin pain and scaling items are shown as these items were used as efficacy endpoints in the BE VIVID and BE READY trials; findings for other items were similar. eCDF empirical cumulative distribution function, PASI Psoriasis Area and Severity Index, P-SIM Psoriasis Symptoms and Impacts Measure

Discussion

This study aimed to evaluate the psychometric properties of the P-SIM to complement the previously published evidence for its content validity [15]. Specifically, comprehensive psychometric assessment of the P-SIM item scores, including convergent and known-groups validity, test–retest reliability and sensitivity to change over time, was complemented by assessment of the currently applied missing data rule and estimation of the RD threshold for P-SIM items, as recommended by FDA guidelines [16]. The 14 items are not combined into an overall summary score; while this means that the P-SIM does not produce one number providing an overview of all patient symptoms, it does allow individual and specific capture of various independent signs, symptoms and impacts of plaque psoriasis. Other instruments that use fewer items are simpler to analyse, but capture fewer concepts [11, 13].

Blinded data were used for the psychometric validation and RD analyses, removing sources of potential bias. The applied weekly missing score rule, in which weekly item scores were set to ‘missing’ if ≥ 4 daily scores were missing for that item, was found to be adequate. The P-SIM item scores were shown to have excellent test–retest reliability, with all ICCs ≥ 0.91, well above the acceptability threshold of 0.70; this demonstrates that the P-SIM is consistent in its measurement of signs, symptoms and impacts of plaque psoriasis in patients whose disease state has not changed, and that results are repeatable.

The P-SIM demonstrated strong inter-item correlations at both baseline and week 16, apart from item 14 (choice of clothing), which had a moderate correlation with items 3 (skin pain) and 4 (burning) at baseline. P-SIM item scores also showed moderate to strong correlations with PASI, IGA, DLQI, DLQI item 1 and PGAP scores at week 16, and with DLQI, DLQI item 1 and PGAP scores at baseline. The lower correlations seen with PASI and IGA scores at baseline may have been due to low variability in these measures upon trial entry, owing to the eligibility criteria for BE VIVID and BE READY (PASI ≥ 12 and IGA ≥ 3). Together, these results demonstrated good convergent validity of P-SIM items.

The P-SIM item scores also demonstrated the ability to discriminate between patient known groups defined on the basis of absolute PASI thresholds and IGA scores; statistically significant between-subgroup score differences were seen across all P-SIM items for nearly all known-group comparisons.

Changes from baseline to week 16 in P-SIM items were strongly correlated with those in other patient-reported outcomes and moderately correlated with those in clinician-reported outcomes. Previously it has been reported that patient-reported outcomes do not always correlate well with clinician-reported outcomes, highlighting the importance of including patient-reported outcomes in clinical studies to capture patient experiences [27, 28]. Changes from baseline in items 12 (fatigue), 13 (embarrassment) and 14 (choice of clothing) all generally had slightly weaker correlations with other outcomes, hypothesised to be due to the concepts measured being more distal and more difficult to assess compared with the more proximal skin symptoms assessed by items 1–11, which are also directly evaluated by clinician-reported outcomes and DLQI item 1. These results indicate that the P-SIM items are sensitive to changes in the concepts they are intended to measure.

RD thresholds were estimated, as recommended by the FDA guidance for industry patient-reported outcome measures [16]. This was of particular importance as responder proportions in P-SIM items 1 (itching), 3 (skin pain) and 5 (scaling) have been identified as key secondary endpoints in bimekizumab phase 3 studies. Anchor-based analyses were prioritised, as recommended by the FDA [16]; the RD thresholds obtained ranged from 2.72 to 4.39 (Supplementary Table 5) and supported the use of a four-point RD threshold to represent marked clinical improvement and to define responders in patients with moderate to severe plaque psoriasis in clinical trials. Though variations around this four-point RD threshold were seen, these were of limited amplitude. This is in line with FDA feedback and US prescribing information for compounds approved in psoriasis, which suggest that a four-point threshold is a relevant RD for a single-item score assessed on an 11-point numeric rating scale [29, 30].

These results support the use of the P-SIM to define efficacy endpoints measuring patient experiences of signs, symptoms and impacts of psoriasis. The use of a four-point RD threshold over an 11-point numeric rating scale throughout all 14 items represents a simple and stringent criterion for defining marked clinical improvement; however, responder analyses using this threshold to assess treatment effect will be restricted to patients reporting a score of > 4 at baseline. Complementary analyses or endpoints, such as attaining a P-SIM score of 0 or 1, could be considered, as has been done for similar instruments in trials of other compounds approved in plaque psoriasis [31, 32]. This would allow all symptomatic patients to be taken into account.

Lower RD thresholds for items 1, 3 and 5 (itch, skin pain and scaling, respectively) were determined previously in preliminary phase 2b data analyses (1.98–2.86-point reduction), and were considered to represent meaningful improvement in bimekizumab in psoriasis phase 3 study protocols. The more stringent four-point RD thresholds identified here instead indicate marked clinical improvement.

Limitations of this study include the proportion of missing P-SIM data; 304/1002 of patients were excluded from the RD analysis because of missing data due to technical issues at baseline, week 16 or both. Although 30.3% of randomised patients were not included in the P-SIM RD estimation, the sample size included (698) is on par with or greater than sample sizes of psychometric validation studies of other instruments in psoriasis [8, 33, 34]. It is also important to note that those patients who were excluded from the RD analysis had similar baseline characteristics to those who were included (with the exception of region, where a larger proportion excluded were from North America [49.3% of those excluded versus 28.8% of those included]; Supplementary Table 6).

Furthermore, while the daily nature of P-SIM completion helps to capture how patients’ experiences may change from day to day, the regularity of assessment may affect overall completion rates; however, the simulation to confirm the weekly scoring rule demonstrated that the results do not vary substantially when the number of days included in the weekly score change, thus showing that the weekly scores are robust with respect to variation between days. Additionally, completion rates were high in this study even when using the 4-day missing weekly score rule.

Additionally, there were substantial amounts of missing data for the PGAP at week 16, limiting its use in analyses to a 12-week period instead. However, the RD analysis utilised several robust anchors (both patient- and clinician-reported), each yielding relatively similar results (predominantly between 3 and 5 points). CDF curves for each anchor showed separation between those who experienced an improvement and those who experienced no change at a four-point improvement in P-SIM item scores. Thus, it is likely that conclusions would be similar if there had not been missing data on the PGAP at week 16.

A large proportion of the 1002 patients enrolled in the phase 3 BE VIVID and BE READY trials were male (71.8%) and white (82.1%). It has been suggested that male patients generally have greater severity of psoriasis [35], leading to a greater proportion being included in the trials; however, another study has indicated that female patients experience greater impact on their quality of life [36], and the vast majority of clinical trials exclude pregnant women, meaning that they do not appear in true proportion [37]. Additionally, while the majority of patients in the general moderate to severe psoriasis disease population are white, this study may not fully represent the experiences of patients with skin of colour, who may have increased impact of psoriasis on quality of life [38, 39]; for example, post-inflammatory hyperpigmentation has a greater impact on patients with skin of colour [40].

Conclusions

Here, the P-SIM weekly item scores demonstrated excellent test–retest reliability alongside good convergent validity, known-groups validity, and sensitivity to change in the concepts they are intended to measure. The weekly scoring rule applied was shown to be appropriate, and a four-point RD threshold was established for P-SIM item scores. This RD could be used to assess treatment effects on the signs, symptoms and impacts of psoriasis. Alongside previous studies confirming content validity of the P-SIM, this supports use of the instrument as an outcome measure in clinical trials in patients with psoriasis.