FormalPara Key Summary Points

Prior to this study, there was a lack of publicly available instruments that adequately measured pruritus in pediatric patients with cholestatic liver diseases from the perspective of patients or their caregivers. Here, we assessed the measurement properties of newly developed tools, a patient-reported outcome (PRO) and an observer-reported outcome (ObsRO), called PRUCISION, for examining pruritus and sleep disturbance in such patients using data from a phase 3 study of odevixibat in patients with progressive familial intrahepatic cholestasis (PFIC).

Examination of the baseline distributions of the PRO and ObsRO data indicated that patients in PEDFIC 1 experienced significant pruritus and related sleep disturbance.

Psychometric analyses found that the ObsRO instrument reliably measures pruritus and sleep disturbance in children and adolescents with PFIC and represents a promising tool for assessing these symptoms over time from the perspective of their caregivers.

Introduction

Progressive familial intrahepatic cholestasis (PFIC) is a group of rare, autosomal-recessive liver diseases estimated to occur in approximately 1 in 75,000 children worldwide [1]. PFIC is characterized by impaired bile acid secretion and transport and the accumulation of bile components, such as bilirubin and bile acids, in the liver [2, 3]. As hepatic levels of these components increase, they can be excreted into the systemic circulation, where they may be associated with the development of jaundice and pruritus, respectively [3,4,5].

Pruritus is a problematic symptom for cholestatic patients that can severely reduce quality of life, limit activities of daily living, and cause significant sleep deprivation [6,7,8]. The initial development of the PRUCISION patient-reported outcome (PRO) and observer-reported outcome (ObsRO) pruritus instruments included a review of the literature, discussion with expert clinicians, and interviews with pediatric patients with cholestatic liver diseases and their caregivers. These are reported in a companion article in this issue. In the initial study, pruritus and associated sleep disturbance were identified as central to the experience of pediatric cholestatic liver disease. Although reducing pruritus is a key treatment objective in these diseases, there was a lack of publicly available instruments that adequately measured pruritus in pediatric patients from a patient or caregiver perspective prior to this study [5]. The PRUCISION PRO and ObsRO pruritus instruments were developed to address this need.

The PRUCISION instruments were developed according to regulatory guidelines for establishing measurement properties (e.g., reliability, validity, and ability to detect change) of clinical outcomes assessments [9, 10]. Based on these guidelines, reliability measures may include test–retest and inter- or intra-rater assessments [9]. Approaches for determining instrument construct validity, or experimentally demonstrating that an instrument measures core constructs, may include assessing convergent validity, or how well constructs that theoretically should be related to each other are observed to be related, and known-groups validity, or the degree to which an instrument could distinguish between clinically distinct groups [10]. The guidelines define the ability to detect change as the ability of an instrument to identify differences in scores over time in individuals or groups who have changed with respect to the measurement concept [9].

The objectives of this study were to assess such measurement properties (reliability, validity, and sensitivity to change) of the newly developed PRUCISION PRO and ObsRO instruments and to estimate a threshold for clinically meaningful change in pruritus score.

Methods

PRO and ObsRO Instruments

The PRO/ObsRO instruments focus on key symptoms of pediatric cholestatic liver disease: pruritus, sleep disturbance, and associated tiredness (Fig. 1a and b). Patients and/or caregivers captured details of these symptoms twice daily (each morning and evening) using an electronic diary (eDiary). The morning diary entry was used to record nighttime itching and scratching severity, aspects of sleep disturbance, and tiredness upon waking. The evening/bedtime diary entry was completed just before the patient went to bed and recorded the patient’s itching and scratching severity and tiredness during the day.

Fig. 1
figure 1figure 1

PRUCISION PRO (a) and ObsRO (b) instruments used in the PEDFIC 1 study. ObsRO observer-reported outcome, PRO patient-reported outcome.

The PRO consists of seven questions and uses two different rating scales (Fig. 1a). The PRO morning diary questions 1, 2, 3, and 5 and the evening diary questions 1 and 2 are scored on a 5-point pictorial response scale, where higher scores indicate worse symptoms. The PRO morning diary question 4 has a “yes” or “no” (binary) response format.

The ObsRO instrument consists of eight questions and uses three response formats (Fig. 1b). The ObsRO morning diary question 1 and evening diary questions 1 and 2 use 5-point response scales similar to those on the PRO. ObsRO morning diary questions 2, 3, 4, and 5 have “yes” or “no” responses, and morning question 6 allows for a numeric response.

PEDFIC 1 Clinical Trial

Overview

The psychometric measurement properties (i.e., reliability, construct validity, sensitivity to change) of the PRO/ObsRO instruments were examined through analysis of data from the randomized, double-blind, placebo-controlled PEDFIC 1 study [11] (for a study flowchart, see Fig. 2). This phase 3 study enrolled children aged 6 months to 18 years with PFIC1 or PFIC2. The primary aim of the study was to assess the efficacy of the ileal bile acid transporter inhibitor odevixibat, as determined by improvements in pruritus and reductions in serum bile acids. The study also assessed safety and tolerability.

Fig. 2
figure 2

Study flow chart

The PEDFIC 1 study consisted of a screening period with a duration of 5–8 weeks, a 24-week treatment period, and a 4-week follow-up period. The PEDFIC 1 study materials received central or local ethics committee (CEC and LEC, respectively) approval from all 45 study sites in the United States, Canada, Europe, Australia, and the Middle East that intended to enroll patients (i.e., from 10 CECs representing 21 sites and from 22 LECs representing 24 sites); the full list of ethics committees providing approval is available in the appendix of the PEDFIC 1 article [11]. In addition, the study was conducted in accordance with the Declaration of Helsinki and the International Conference on Harmonization Good Clinical Practice guidelines; all patients or their caregivers provided written consent prior to study participation.

Assessments

Patients and/or their caregivers were provided with an eDiary at the first screening visit for recording itching (PRO), observed scratching (ObsRO), and sleep disturbance (PRO and ObsRO; Fig. 1a and b) throughout the study. The PRO instrument was used in patients aged ≥ 8 years, and the ObsRO instrument was completed by patients’ caregivers, regardless of patient age.

Patients aged ≥ 8 years, caregivers, and clinicians completed the Global Impression of Change (GIC; called PGIC, CaGIC, CGIC, for patient-, caregiver-, and clinician-reported versions, respectively) and Global Impression of Symptoms (GIS; called PGIS, CaGIS, CGIS, for patient-, caregiver-, and clinician-reported versions, respectively) at randomization (GIS only) and at weeks 4, 12, and 24 of PEDFIC 1. The GIC assesses change in itching or scratching and sleep since starting the study drug using a 7-point scale that ranges from 1 (“very much better”) to 7 (“very much worse”). The GIS assesses itching (PGIS) or scratching (CaGIS and CGIS) and sleep (all versions) in the past week using a 5-point scale that ranges from 1 (“none”/no problems) to 5 (“very bad/very severe” problems).

During PEDFIC 1, caregivers and patients completed the Pediatric Quality of Life Inventory (PedsQL; version 4.0), an instrument designed to assess quality of life in children and adolescents [12]. The PedsQL examines functioning of the patient via four domains: physical, emotional, social, and school. The family impact module component of the PedsQL measures the impact of pediatric chronic health conditions on parents and the family; this assessment has eight domains: physical, emotional, social, and cognitive functioning, as well as communication, worry, daily activities, and family relationships. The PedsQL and family impact module were administered at randomization and week 24 in PEDFIC 1.

Analyses

Completion rates for PRO and ObsRO instruments were calculated as follows: the total number of actual completed eDiary entries was divided by the number of eDiary entries expected to be completed for a given time period (i.e., 14 days for baseline and 28 days for all 4-week interval time points).

A blinded analysis of ObsRO pruritus data from PEDFIC 1 was conducted to estimate a meaningful within-patient change threshold. The analysis included all available eDiary data captured as of June 29, 2020, which represented data from 48 patients. The identified threshold for meaningful change was then incorporated into the main PEDFIC 1 analysis plan prior to database lock and unblinding. All other analyses were based on data from the final locked database, with the analysis population comprising all randomized patients who received ≥ 1 dose of study treatment, and psychometric analyses performed on pooled treatment arms.

Psychometric Properties of PRUCISION Instruments

Descriptive Statistics at Baseline

Descriptive statistics for PRO/ObsRO instrument items were summarized at baseline, including mean (standard deviation [SD]) values and the distribution of scores to assess potential floor and ceiling effects.

Reliability

Reliability was assessed by calculating inter-item correlations between morning and bedtime PRO/ObsRO pruritus scores. In addition, test–retest reliability was assessed by calculating intraclass correlations (ICCs), where ICC values of 0.50–0.75 indicate moderate reliability, values of 0.75–0.90 indicate good reliability, and values > 0.90 indicate excellent reliability [13]. ICCs were calculated using two approaches, as follows: (1) by comparing weekly PRO/ObsRO scores from the first week (days −14 to −8) of the baseline interval (“test”) with those from the second week (days −7 to –1) of the baseline interval (“retest”); and (2) by comparing scores in stable patients (i.e., those with no change on the GIC at week 4 or those with the same responses at baseline and week 4 on the GIS) during the baseline interval (average from day −14 to day −1) with those at week 4 (using the monthly score).

Construct Validity

Construct validity was tested in two ways, through convergent validity and known-groups validity assessments. Convergent validity was assessed using Spearman correlations to compare baseline PRO/ObsRO scores and baseline PGIS, CaGIS, and CGIS item scores and PedsQL and PedsQL family impact module scores. Spearman correlations between 0.30 and 0.49 are considered moderate, and those ≥ 0.50 are considered strong [14]. Known-groups validity was evaluated by partitioning baseline responses from the GIS into two groups based on symptom severity as follows: responses of “none,” “mild,” and “moderate” were grouped and compared with responses of “severe”/“very severe.” A Student t test was used to test differences in PRO/ObsRO scores between these response groups.

Sensitivity to Change

Two approaches were also used to evaluate the instruments’ sensitivity to change. First, sensitivity to change was assessed by comparing mean changes from baseline to weeks 21–24 in PRO/ObsRO scores between “improved” (participants who answered “a little better,” “much better,” or “very much better” at week 24 on GIC scales) and “not improved” (participants who answered “no change,” “a little worse,” “much worse,” or “very much worse” at week 24 on GIC scales); mean differences in PRO/ObsRO scores between these groups were evaluated using analysis of covariance models. Second, Pearson correlations were calculated between the change from baseline to weeks 21−24 in the PRO/ObsRO scores and the GIC and GIS at week 24.

Threshold for Clinically Meaningful Change

Distribution- and anchor-based analyses were used to estimate a threshold for within-patient meaningful change from baseline to weeks 12 and 24 for the ObsRO pruritus instrument. Using distribution-based analyses, the 0.5 SD and 1 standard error of the mean (SEM) values for baseline pruritus scores were calculated. Anchor-based analyses involved examining the degree of change in ObsRO pruritus score from baseline to week 12 or 24 among patients who experienced pruritus improvement in PEDFIC 1 based on GIC and GIS anchors. Each of the anchors included several response options. The responses were dichotomized in different ways to facilitate selection of the threshold that best differentiated responders and nonresponders. There were three dichotomized categories for both the GIC and GIS scales (Table 1). For both scales, definition 1 reflected the maximum amount of change, definition 3 reflected the smallest amount of change, and definition 2 fell between these two extremes. Correlations between each dichotomized level of the anchor and the ObsRO pruritus change score were calculated, with increased weight given to anchors with a correlation of 0.30 or greater. The anchor most highly correlated with the pruritus measure was used as the primary anchor. The 95% confidence interval (CI) for the mean change from baseline in pruritus value was also calculated for patients who were stable according to the GIC or GIS anchor. The lower bound of the 95% CI for this stable group was used as a comparison value in examining the meaningful change estimates from the other anchor categories.

Table 1 GIC and GIS anchors and dichotomized response categories

The anchor-based analyses were used as the primary determinant for estimating a meaningful change threshold, with receiver operating characteristic (ROC) analyses and empirical cumulative distribution function (eCDF) plots used as complementary approaches. The threshold was evaluated as follows: (1) a primary anchor was selected based on correlations between the anchor and the pruritus measure; (2) the smallest median value for the primary anchor that exceeded the values from the distribution-based analyses AND the lower bound of the 95% CI from the stable anchor category was selected as a candidate threshold value; (3) values were tabulated to evaluate the consistency between the primary and complementary threshold estimates; and (4) the final meaningful change estimate was rounded to the nearest 0.5 value on the 0–4 scale to increase the interpretability of the threshold value. Because the results for the 12- and 24-week intervals were similar, the remainder of this manuscript focuses on week 24 data (week 12 data are not shown).

Results

PEDFIC 1 Clinical Trial

Patient Demographics and Baseline Characteristics

Demographics and baseline characteristics of patients randomized in PEDFIC 1 are summarized in Table 2. Overall, 84% of patients (n = 52/62) were aged < 8 years, with 37% and 31% aged < 2 years and 2–4 years, respectively; these patients did not contribute PRO data to the analysis.

Table 2 PEDFIC 1 patient demographics and baseline characteristics

Instrument Completion Rates

At baseline, a total of 10 patients aged ≥ 8 years were eligible to complete the PRO; however, one eligible patient was unable to complete this measure due to disability. A total of 62 caregivers were asked to complete the ObsRO. Overall, the eDiary completion rates at baseline, indicating the percentage of scheduled daytime and nighttime scores that were recorded, were high (PRO: ≥ 79%; ObsRO: ≥ 96%). The completion rates for the PRO and ObsRO at week 24 were ≥ 76% and ≥ 93%, respectively.

Descriptive Analysis at Baseline

Baseline PRO itching and ObsRO scratching scores in PEDFIC 1 indicated that patients experienced substantial levels of itching/scratching, as most itching/scratching baseline scores were between 2 and 4 (Table 3). For PRO sleep disturbance and tiredness items, at least half the patients had average daily scores of 3–4 at baseline (Table 3). Approximately 60% of caregiver respondents reported that their child needed help falling asleep as a result of itching, needed soothing, or slept with the caregiver (i.e., co-sleep) all days during the baseline interval. For the itching/scratching and sleep disturbance PRO/ObsRO items, no floor effects were observed; however, ceiling effects (i.e., ≥ 25% of patients with average daily scores between 3 and 4) were observed.

Table 3 Descriptive statistics for PRO and ObsRO daily pruritus scores at baseline

Psychometric Measurement Properties—PRO Instrument

The small sample size of patients who completed the PRO precluded a full psychometric validation of the PRO instrument.

Psychometric Measurement Properties—ObsRO Instrument

ObsRO Reliability

The mean (SD) ObsRO morning and bedtime pruritus scores at baseline (n = 62, each) were 2.84 (0.66) and 3.01 (0.57), respectively, and a strong inter-item correlation was found between these scores (r = 0.81).

On the ObsRO, moderate-to-strong correlations were found between daytime tiredness and all three scratching scores (r values ranged from 0.58 to 0.65), and strong correlations were observed between sleep disturbance items of needing help falling asleep, needing soothing, and sleeping with a caregiver (r values ranged from 0.68 to 0.87).

Comparisons between weekly ObsRO pruritus and sleep disturbance scores from the first week (“test”) to the second week (“retest”) of the baseline interval yielded ICCs ≥ 0.75, indicating acceptable reliability (Table 4) [13]. Comparisons between baseline and week 4 ObsRO pruritus and sleep disturbance scores in stable patients identified by the CaGIC yielded ICC values ≥ 0.69; these values were ≥ 0.54 when stable patients were identified by the CaGIS (Table 4).

Table 4 Test–retest reliability of ObsRO scores
ObsRO Construct Validity

Moderate-to-strong correlations (r ≥ 0.39) between ObsRO scratching and the GIS were observed (Table 5), demonstrating convergent validity. Moderate-to-strong correlations were also observed for several ObsRO sleep disturbance items and PGIS scratch scores (Table 5). In addition, strong correlations were observed between the ObsRO sleep disturbance items and PedsQL self-reported total scores (Table 6). Some ObsRO sleep disturbance items at baseline had moderate correlations with baseline PedsQL caregiver-reported school domain scores (Table 6).

Table 5 Convergent validity: correlations between baseline ObsRO scores and baseline GIS items
Table 6 Convergent validity: correlations between baseline ObsRO and baseline PedsQL self-reported or caregiver-reported scores

For known-groups validity analyses, group differences changed in the expected direction (e.g., ObsRO pruritus scores increased as scores on the GIS indicated increased severity of pruritus and sleep disturbance) for all pruritus scores. These differences were significant for the ObsRO scratching and some sleep disturbance items when groups were based on CaGIS and CGIS scores (Table 7).

Table 7 Known-groups validity: mean differences in baseline ObsRO scratching scores between groups defined by the GIS
ObsRO Sensitivity to Change

Mean differences in ObsRO scratching, sleep disturbance, and tiredness scores between “improved” and “not improved” groups from the GIC were significant and in the expected direction (Table 8). For example, there were greater mean reductions in ObsRO scratching scores in the “improved” group per the CaGIC versus the “not improved” group at week 24. Similar results were found when groups were identified using the CGIC. Additionally, moderate-to-strong correlations (r ≥ 0.3) were found between ObsRO scratching scores and the GIC and GIS as reported by patients, caregivers, and clinicians (Table 9). In general, moderate-to-strong correlations were also identified between ObsRO tiredness and sleep disturbance items and the GIC and GIS (Table 9). Strong correlations were also identified between ObsRO scratching and some sleep disturbance items and PedsQL self-reported scores for physical functioning (Table 10). Additionally, moderate-to-strong correlations were observed between ObsRO scratching and some sleep disturbance items and PedsQL caregiver-reported scores, particularly in the school domain. ObsRO scratching and daytime tiredness items were strongly correlated with the PedsQL family impact module total scores and domain scores for physical functioning and daily activities (Table 11).

Table 8 Sensitivity to change analyses: mean change from baseline to weeks 21–24 in ObsRO pruritus scores by GIC response category at week 24
Table 9 Sensitivity to change analyses: correlation of change from baseline to weeks 21–24 in ObsRO scores and GIC or GIS items at week 24
Table 10 Sensitivity to change analyses: correlation between change from baseline to weeks 21–24 in ObsRO score and PedsQL self-reported and caregiver-reported scores at week 24
Table 11 Sensitivity to change: correlation between change from baseline to weeks 21–24 in ObsRO score and PedsQL family impact module scores at week 24
ObsRO Threshold for Clinically Meaningful Change in Pruritus

Distribution-based analyses showed that the 0.5 SD of the baseline daily ObsRO scratching score was 0.30, and 1 SEM at baseline was 0.21. These values served as a lower limit for the meaningful change threshold. Anchor-based analyses supported the use of CaGIS as an anchor for establishing a meaningful change in the ObsRO scratching score (Table 12). The smallest median ObsRO scratching change values for the CaGIS that exceeded the values from the distribution-based analyses and the lower bound of the 95% CI from the stable anchor category were −0.95 and −0.96, for the monthly and biweekly ObsRO pruritus scores, respectively, at week 24 (Table 13). These and other anchor-based and ROC analyses are included in Table 13.

Table 12 Correlations between change from baseline to week 24 in ObsRO scratching measures and anchors
Table 13 Summary of anchor-based and receiver operating characteristic analyses of daily ObsRO scratching score

The eCDF curves for change in ObsRO pruritus score shifted left with increasing improvement in CaGIS score (Fig. 3). Based on these analyses, a change of –1.00 from baseline to week 24 was selected as the final meaningful change threshold, indicating that a decrease in the average caregiver scratching score of 1 or more points can be considered a clinically meaningful improvement in pruritus.

Fig. 3
figure 3

eCDF plots for change from baseline in daily ObsRO pruritus monthly score by CaGIS at week 24. CaGIS Caregiver Global Impression of Symptoms, eCDF empirical cumulative distribution function, ObsRO observer-reported outcome

Discussion

The results of this study indicate that a novel clinical outcome assessment, the PRUCISION ObsRO instrument, is a valid and reliable measure able to capture longitudinal changes in the severity of pruritus and sleep disturbance in patients with PFIC.

Using data from the phase 3 PEDFIC 1 study in patients with PFIC, the measurement characteristics of the ObsRO PRUCISION instrument, as well as the PRO PRUCISION instrument, were evaluated. PRO/ObsRO scores at PEDFIC 1 baseline indicated that these patients experienced significant pruritus and associated sleep disturbance prior to treatment intervention. Examination of baseline distribution data from PEDFIC 1 indicated that the PRUCISION instruments had ceiling effects (i.e., they may not be as sensitive to worsening over time), but that they were capable of detecting improvement as a result of treatment. Further analysis of the psychometric properties of the PRO was limited due to the small number of patients who completed the instrument during the study. Psychometric analysis of the PRUCISION ObsRO instrument demonstrated that the instrument had acceptable test–retest reliability overall. Further analyses supported the construct validity and sensitivity to change of the ObsRO instrument. The anchor-based analyses with ObsRO scores performed here indicated that a 1-point reduction corresponds to a meaningful improvement in pruritus.

The PRUCISION PRO/ObsRO instruments were developed to address a lack of adequate measurement tools for quantifying pruritus in pediatric patients with cholestatic liver disease. The analyses used for developing and validating the PRUCISION PRO/ObsRO instruments followed best practice guidelines to assess reliability, construct validity, and sensitivity to change [10, 15, 16]. The findings described here support the use of the ObsRO PRUCISION instrument to measure potential treatment benefits of odevixibat in patients with PFIC. In addition, because the PRUCISION instruments were developed with input from caregivers and patients with a range of cholestatic liver disorders (i.e., Alagille syndrome, biliary atresia, and primary sclerosing cholangitis; see companion article in this issue), these instruments can be applied to measure symptoms of pruritus and sleep disturbance in children with other cholestatic liver diseases, such as those associated with significant pruritus [5]. In fact, the primary outcome of ASSERT (NCT04674761), an interventional study in patients with Alagille syndrome that was initiated in 2021, is based on ObsRO PRUCISION scores. The PRUCISION instruments may also be useful in monitoring symptoms over time in patients who receive no or other treatment options.

Some limitations of this study warrant discussion. First, validation of the PRUCISION PRO instrument was limited by the small sample size of study participants aged ≥ 8 years who could complete the PRO. Additionally, another instrument, the Itch Reported Outcome (ItchRO) tool, was being developed while this study was underway [17]. The ItchRO tool was also intended to measure cholestatic pruritus in children and includes a 5-point response scale that ranges from 0–4 [18]. However, at the time the PEDFIC 1 study was designed, this tool was not publicly available for use or adaptation. Both the PRUCISION PRO/ObsRO instruments and the ItchRO tool contain morning and bedtime pruritus assessments [18], and validation analyses found that a ≥ 1-point reduction in observer-reported pruritus scores in both tools can be considered a clinically meaningful change [17]. However, the PRUCISION contains unique questions, and the PRUCISION and ItchRO instruments were primarily validated in different patient populations (i.e., PRUCISION, in patients with PFIC; ItchRO, in patients with Alagille syndrome) [17].

Accumulation of bile acids in the liver and secondary spillover into the systemic circulation is another key feature of PFIC and other cholestatic liver diseases in children [2]. Although not completely understood, the higher the level of serum bile acids, the greater the likelihood that patients may have pruritus [19, 20]. Therefore, future studies that include detailed investigations into the relationship between PRUCISION scores and measurements of serum bile acids would be valuable. In an initial post hoc analysis of pooled data from odevixibat-treated patients in the PEDFIC 1 and open-label extension PEDFIC 2 (NCT03659916) studies up to a data cutoff date of December 2020, a significant correlation was found between mean percentage change from baseline up to week 72 in serum bile acids and mean change in ObsRO PRUCISION score during the same interval [21]. Longer-term data that could address this relationship further are expected as PEDFIC 2 completes in 2023.

The results from this study support the validity of the PRUCISION ObsRO instrument for measuring pruritus and sleep disturbance in children with PFIC. In addition, based on the analyses conducted here, a reduction in ObsRO pruritus score of −1.00 can be considered a clinically meaningful change. The ObsRO PRUCISION instrument is appropriate for evaluating the effect of treatment on pruritus and sleep disturbance in PFIC and other pediatric cholestatic liver diseases.