Blinded data were pooled from the first 16 weeks of the BE VIVID and BE READY phase 3 trials investigating bimekizumab in the treatment of moderate to severe plaque psoriasis.
BE VIVID was a randomised, double-blinded, placebo- and active comparator-controlled phase 3 study (Supplementary Fig. 1a). Patients in BE VIVID were randomised 4:2:1 to bimekizumab 320 mg every 4 weeks (Q4W), ustekinumab 45/90 mg (by weight) administered at baseline and then every 12 weeks (Q12W) from week 4, or placebo for initial treatment (weeks 0–16). BE READY was a double-blinded, placebo-controlled, randomised withdrawal phase 3 study (Supplementary Fig. 1b). Patients in BE READY were randomised 4:1 to bimekizumab 320 mg Q4W or placebo for initial treatment (weeks 0–16). Both trials enrolled adult patients with a diagnosis of moderate to severe plaque psoriasis ≥ 6 months prior to screening, with baseline Psoriasis Area and Severity Index (PASI) ≥ 12 (on a scale from 0–72), ≥ 10% body surface area (BSA) affected and an Investigator’s Global Assessment (IGA) score ≥ 3 on a 5-point scale. The co-primary endpoints for both studies were 90% improvement in PASI (PASI 90) and IGA 0/1 (0 [clear] or 1 [almost clear] with ≥ 2-category improvement from baseline) responses at week 16. Full study designs and efficacy and safety outcomes have been reported previously [17, 18].
The study protocols, amendments and patient informed consent were reviewed by a national, regional or independent ethics committee or institutional review board. BE VIVID (NCT03370133) and BE READY (NCT03410992) were conducted in accordance with the current version of the applicable regulatory and International Conference on Harmonisation (ICH)-Good Clinical Practice (GCP) requirements, the ethical principles that have their origin in the principles of the Declaration of Helsinki, and the local laws of the countries involved.
Measurements and Outcomes
The P-SIM e-diary was to be completed daily in the evening by patients at home via a handheld electronic device from baseline through week 16. It consisted of 14 items, listed in Table 1. Each item was scored for worst severity or level of impact in the previous 24 h on a scale from 0 to 10 (0 meaning “no sign, symptom or impact”, 10 meaning “very severe sign, symptom or impact”). Average weekly scores were derived for each of the 14 P-SIM items. Weekly scores were considered missing if ≥ 4 daily scores were missing for that week.
Other outcomes reported in BE VIVID and BE READY relevant to this validation analysis include PASI, IGA, DLQI, and Patient Global Assessment of Psoriasis (PGAP). The DLQI consists of ten items, each scored from 0 to 3, with 3 representing the highest impact (overall score range 0–30) . DLQI item 1 (Question: “over the last week, how itchy, sore, painful, or stinging has your skin been?”; answer: “not at all”, “a little”, “a lot”, or “very much”) is specific to skin symptoms commonly seen in psoriasis and is considered individually in these analyses. The PGAP consists of a multiple-choice question (“How severe are your psoriasis-related symptoms right now?”), with answers scored from 1 to 5 (1, “no symptoms”; 2, “mild symptoms”; 3, “moderate symptoms”; 4, “severe symptoms”; 5, “very severe symptoms”). Both the DLQI item 1 and PGAP represent verbal rating scales. Details of PASI and IGA score measurements can be found in the Supplementary Methods [19, 20]. The PGAP was completed on the same electronic device as the P-SIM; all other outcome data were collected on a tablet during on-site study visits (week 0, 1, 2, 4, 8, 12 and 16).
Confirmation of Weekly Scoring Rule
Analyses were conducted to establish whether alternatives to the currently applied weekly scoring rule for the P-SIM, in which weekly item scores are considered missing for patients in weeks where they had ≥ 4 missing daily scores in that item, impacted the variability of weekly item scores.
Patients with no missing data were included for analysis of each item. Scenarios for the number of days missing (1, 2, 3, 4, 5 or 6 days missing) were simulated using a bootstrapping method with replacement by randomly sampling the appropriate number of daily scores for that missing day scenario and calculating average weekly scores. For each missing day scenario for a given item, each patient’s simulated weekly score was the mean of 100 replications of a random selection of daily scores. Means and standard deviations (SDs) were calculated for each item for each patient for the non-missing case and each missing day scenario. Overall means and SDs were also calculated for each item by pooling the weekly scores of patients. The SDs for missing and non-missing scenarios for each item were compared by visual inspection and the Brown–Forsythe test.
Convergent validity was assessed for P-SIM items by calculating Spearman’s rank correlation coefficients at baseline and week 16 for inter-item correlations and correlations between P-SIM item scores and clinician-reported outcome (PASI total score and IGA score) and patient-reported outcome (DLQI total score, DLQI item 1 score and PGAP score) scores. Only pairs assessed on the same date were included in these analyses.
A correlation coefficient > 0.3 and ≤ 0.5 indicated moderate convergent validity, and > 0.5 indicated strong convergent validity . It was hypothesised that P-SIM items would have strong correlations with each other, and moderate to strong correlations with PASI, IGA, DLQI and PGAP total scores. Items 1, 3, 4 and 8 (itching, skin pain, burning and irritation, respectively) of the P-SIM were expected to have strong correlations with DLQI item 1.
The ability of P-SIM items to discriminate between clinically different groups of patients, predefined according to relevant clinician-reported outcomes, was also assessed. Mean P-SIM item scores were assessed at week 16 in known subgroups of patients defined on the basis of absolute PASI total score thresholds (≤ 1, > 1 to ≤ 3, > 3 to < 5, ≥ 5 to < 12, and ≥ 12) and IGA scores (0, “clear”; 1, “almost clear”; 2, “mild”; 3, “moderate”; and 4, “severe”) at week 16. PASI and IGA are well-accepted clinical measures of psoriasis disease severity and were used to define the primary efficacy endpoints in the bimekizumab phase 3 studies, BE VIVID and BE READY. As expected, as a result of the inclusion criteria of the BE VIVID and BE READY trials (PASI ≥ 12, IGA ≥ 3), there was low variability in scores for both measures at baseline; assessment of known-groups validity therefore focused on week 16. The absolute PASI values 1, 3 and 5 used to define subgroups have been shown to provide reliable estimates of disease activity that can be used to define treatment goals for psoriasis treatment and facilitate clinical decisions , while an absolute PASI value of 12 was used to define moderate to severe plaque psoriasis in the BE VIVID and BE READY inclusion criteria. The IGA values used to define subgroups with clinically different psoriasis severity levels are the response options of the 5-point IGA scale, which has been shown to be a valid and reliable measure of psoriasis severity .
It was hypothesised that P-SIM item score means and SDs would be higher in patients with higher PASI and IGA scores. P values from the Kruskal–Wallis test were calculated to compare distributions among the known groups.
Intraclass correlation coefficients (ICCs; calculated as the ratio of the between-patient variance to the total variance) were calculated for each item’s score between baseline and week 2 in stable patients during this period. Stable patients were defined as those whose IGA score did not change in this interval. ICCs ≥ 0.70 were considered acceptable for test–retest reliability .
Sensitivity to Change Over Time
Spearman’s correlation coefficients and p values were calculated between changes from baseline to week 16 in P-SIM item scores and in other clinically relevant outcomes (PASI, IGA, PGAP, DLQI item 1 and DLQI). Correlations using the change from baseline to week 12 for PGAP were assessed as a sensitivity analysis; this was because a large amount of PGAP data were missing at week 16 as a result of technical limitations of the electronic device. A Spearman correlation ≥ 0.30 was considered to demonstrate acceptable responsiveness .
Determination of Responder Definition
Anchor-based analyses were performed in line with the recommendations for defining response thresholds for within-patient meaningful changes provided in the FDA patient-reported outcome guidance. Distribution-based analyses were conducted to provide supportive information around the variability of the scores, as per FDA guidance . Triangulation was performed, examining the estimates from the anchor-based analyses and considering the distribution-based analyses as supportive information, to estimate values for clinically meaningful changes in each P-SIM item score.
For the anchor-based analyses, only anchors with Spearman correlation ≥ 0.30 between changes from baseline to week 16 in the anchor (week 12 for PGAP) and P-SIM items were considered. The anchors used, based on PASI, IGA, DLQI, DLQI item 1 and PGAP response, are shown in Supplementary Table 1. As a result of small sample sizes (n < 15) in some anchor response categories, some of the adjacent original categories were collapsed.
Actual mean change scores from baseline to week 16 were calculated for each level of improvement in each anchor. The empirical cumulative distribution function (eCDF) and probability density function (PDF) curves of observed changes at week 16 for each P-SIM item were plotted separately for each responder group for each anchor. These curves were examined to see if separation between levels of response on anchors could be observed. If so, actual change scores that optimally differentiated responders from non-responders were identified to support determination of the RD for each item. Effect sizes for each level of change in the anchors were estimated, and 95% confidence intervals for the observed mean change were calculated using bootstrapping methods.
Distribution-based analyses used the standard error of measurement for P-SIM items, and half an SD of item scores at baseline and study visits [25, 26]. If the SD changed significantly over time, the baseline SD was used; otherwise the mean was used.
Items 1 (itching), 3 (skin pain) and 5 (scaling) of the P-SIM were prioritised in the RD analyses owing to their clinical relevance, importance to patient experience, and use as efficacy endpoints in the BE VIVID and BE READY trials.