Introduction

Ulcerative Colitis (UC) is a chronic, idiopathic, relapsing inflammatory disease of the colonic mucosa with a prevalence of > 181.1/100,000 in North America and Europe [1, 2]. Patients with UC often suffer from frequent diarrhea, abdominal pain and discomfort, and urgent bowel movements, with potentially more severe complications such as toxic megacolon, perforation, and an increased risk of colon cancer [3]. Although the precise etiology of UC is unknown, the condition may be caused by unregulated and exaggerated immune responses to environmental triggers among genetically susceptible individuals [4].

Beyond the physical manifestations of the disease, the young age of onset and morbidities associated with UC and its therapies affect patients’ professional lives and social and emotional wellbeing, ultimately impacting their health-related quality of life (HRQoL) [4, 5]. Given these wide-ranging impacts, regulatory authorities have highlighted the importance of including validated patient-reported outcome (PRO) measures as primary outcomes in clinical trials of UC to assess non-physiologic symptoms and effects of treatment from the patient’s perspective, beyond the generic symptoms captured by existing validated inflammatory bowel disease questionnaires [6, 7]. Specifically, formal guidance has emphasized the importance of qualitative research to develop instruments that measure what is most important to patients about a condition or treatment (i.e., the instrument’s content covers the patient experience), and quantitative research to evaluate the instrument’s measurement properties [8,9,10].

The Ulcerative Colitis-Symptom Questionnaire (UC-SQ) was accordingly developed and validated in line with US and EU regulatory guidance to assess UC-related symptom severity and response to treatment in clinical trials of adults (18 years of age or older) with moderate-to-severe UC [9, 11]. The objective of this manuscript was to describe the development and validation of this instrument.

Materials and Methods

Two Part Development of the UC-SQ Measure

The final UC-SQ measure development was conducted in two parts per Food and Drug Administration (FDA) guidance, reviewed and approved by the study team, and received ethics approval. We used a mixed-methods approach, with the first part involving qualitative research to develop the content/structure of the instrument and assess its content validity. The second part focused on a quantitative assessment of the psychometric properties of the UC-SQ using data from a phase 2b clinical trial (The trial is registered in ClinicalTrials.Gov with registration number NCT02819635).

Part 1: Instrument Development and Content Validation

The UC-SQ was built based on an existing pool of content-valid items and a conceptual framework developed by AbbVie for patients with Crohn’s disease (CD) given similarities in symptom profiles of CD and UC. A three-part qualitative study refined the UC-SQ with adaptations implemented through patient and expert feedback as well as iterative review of the measure [9]. First, a targeted literature review was undertaken, followed by semi-structured concept elicitation interviews with patients with UC (mild or in remission). Patients were identified by two US clinical sites, located in Michigan and in Oklahoma. The sites were trained by ICON staff on an Institutional Review Board (IRB)-approved study protocol detailing all processes for screening. Site personnel screened candidates according to inclusion/exclusion criteria (i.e., diagnosed with UC [active or in remission], 18 to 75 years of age, able to read and understand English, willing to sign an informed consent document and willing and able to participate in the study). Sites enrolled seven patients for telephone concept elicitation (CE) interviews following informed consent. After obtaining informed consent from each participant, experienced interviewers (T.B. & J.D.) conducted the CE interviews.

Following a protocol/IRB amendment, a second set of 15 patients were selected for combined concept elicitation/cognitive interviewing (CE/CI) interviews via the same screening and recruitment process. Eleven of the CE/CI interviews were conducted in-person by experienced interviewers (T.B., J.D., & V.O.), at one of the clinical sites, while the other four were conducted by phone.

CE/CI interviews with direct input from patients in remission or with mild-to-severe UC were then combined. All participants signed IRB-approved consent forms before study participation and received monetary compensation for their time.

All interviews were audio-recorded and transcribed verbatim; transcripts were reviewed to remove personally identifiable data e.g., names and/or locations, and correct any transcription errors before qualitative analysis. MAXQDA software version 11 was used to organize data and code transcripts according to project-specific objectives. An initial codebook included pre-defined concepts and themes already included in the interview guide and theoretical model based on the agreed-upon coding strategy; however, analysts also added additional codes which emerged spontaneously during the interviews. Joffe and Yardley have described this analytic approach [12]. To ensure inter-rater reliability, researchers coded initial transcripts independently (T.B. & V.O.), modifying the codebook as necessary to ensure codebook consistency across coders, such as the use of hierarchical codes and addition of new codes based on emerging concepts. During this process, any discrepancies were discussed among the analysts and reviewed by a senior researcher (J.D.) until consensus was met.

Concept Elicitation and Item Generation

Findings from the initial part of the study encompassing both expert input (N = 2) and patient concept elicitation interviews (N = 7) were evaluated comparatively against the content of the item pool developed for CD to assess the relevance and appropriateness of concepts/items to patients with UC. New items were developed where gaps were identified using terminology derived from patient interviews and following the format of the CD existing items i.e., similar phrasing, recall period, and response options. Conversely, where CE findings showed existing concepts to be less relevant to patients with UC, corresponding items were either removed or retained for further examination during cognitive interviewing and further refined through team consensus. All concepts mentioned by participants and related to symptoms and/or HRQoL were included in a comprehensive set of generated items to develop the initial version of the questionnaire.

To ensure a sufficient number of participant interviews were conducted, symptom and HRQoL concepts were summarized in a saturation grid until conceptual saturation was reached i.e., when no new concept-level data were presented by patients [13]. Probing and/or follow-up questioning was employed by interviewers for symptoms that were not mentioned spontaneously by patients, including probes for symptoms added to the original item pool based on input from experts and the study team i.e., tenesmus and urgency.

Complementing patient input, two practicing gastroenterologists with 10+ years of experience treating patients with UC/CD and with clinical research experience with PRO measures were invited to participate. Clinician interviews were conducted to gather feedback concerning UC symptoms and impacts as experienced by patients, particularly points of differentiation from CD, and to review the PRO instrument in terms of its applicability for UC to identify any missing or inappropriate concepts. The interviews followed a semi-structured guide, were conducted by phone, and were recorded for transcription and coding. Clinicians were compensated according to fair market value for their time participating in the study.

Pilot Testing/Combined Concept Elicitation and Cognitive Interviews

The feasibility of the draft UC questionnaire was subsequently tested in patients with UC in the US (N = 15) in a one-on-one interview conducted by trained interviewers with experience in qualitative data collection. To ensure saturation of concepts, a combined CE/CI methodology was used. During the joint CE/CI interviews, patients were asked about their disease experience, including symptoms and impacts (CE), before being presented with the draft UC instrument to review in the CI portion. Participants were questioned regarding the questionnaire’s comprehensiveness, the clarity and appropriateness of the instructions, items, and response options, and their interpretations and opinions of the relevance of each question. Patients were also asked if the recall period was adequate for the concepts covered. Patients were asked to identify areas of ambiguity and to suggest alternative wording in cases where they thought an item and/or procedure could be improved.

Part 2: Psychometric Validation

Study Population

The psychometric validation of UC-SQ was undertaken using data from a phase 2b, dose-ranging, multicenter, randomized, double-blind, placebo-controlled study (NCT02819635). This trial aimed at evaluating the dose–response, efficacy, and safety of different doses of upadacitinib compared to placebo as induction therapy for 8 weeks in subjects with moderately to severely active UC with inadequate response or intolerance to immunosuppressants, corticosteroids, and/or biologic therapies (n = 250). The participants were adults aged 18–75 years with a UC diagnosis for at least 90 days before baseline, confirmed by colonoscopy during the screening period, excluding current infection, colonic dysplasia and/or malignancy. Further, the participants had active UC with an Adapted Mayo Score of 5 to 9 points and endoscopic subscore of 2 to 3 (confirmed by a central reader) and demonstrated an inadequate response to, loss of response to, or intolerance to corticosteroids, immunosuppressants, and/or biologic therapies.

UC-SQ

The UC-SQ used in psychometric testing is a self-administered, condition-specific PRO instrument that measures UC-related intestinal symptoms (e.g., frequent bowel movements, abdominal pain, and loss of appetite) and extra-intestinal symptoms (e.g., lacking energy, joint pain, and difficulty sleeping) with a 7-day recall period. The UC-SQ is composed of 17 Likert-type items. Each item on the questionnaire can be scored on a five-point Likert scale to assess the frequency (0 = Never; 1 = Rarely; 2 = Sometimes; 3 = Often; 4 = Always) or intensity (0 = Not at all; 1 = A little bit; 2 = Somewhat; 3 = Quite a bit; 4 = Very much) of the symptom. Overall symptom scores are calculated by combining the individual items’ ratings (total score ranges from 0 to 68), with higher scores indicating greater severity. All responses are added with equal weight to obtain the total score. Following initial performance testing, two items (joint pain and constipation) were removed from the UC-SQ total score, resulting in a 15-item measure, with joint pain and constipation included as standalone, non-scored items.

Other Clinical Outcomes Assessment (COA) Measures

In addition to the UC-SQ, the trial collected data on disease severity (using Full Mayo Scores), change in symptom severity (using the Patient Global Impression of Change [PGIC]), disease-specific HRQoL (using the Inflammatory Bowel Disease Questionnaire [IBDQ]), and generic HRQoL (using the Short Form 36-Item [SF-36] survey). Each patient participated in screening, baseline, 2-week, 4-week, and 8-week follow-up visits. Mayo scores were collected at baseline and the 8-week follow-up visit. The IBDQ and SF-36, and UC-SQ were collected at baseline and the 2-week, 4-week, and 8-week follow-up visits. The PGIC was collected at the 2-week, 4-week, and 8-week follow-up visits.

Statistical Analysis

Analyses were performed under a pre-specified statistical analysis plan. While AbbVie’s upadacitinib UC phase 2b trial (NCT02819635) included 250 patients, the UC-SQ was only administered at selected sites. It was not yet fully developed at the trial initiation and was added after the trial onset. Only 113 patients had UC-SQ scores, and psychometric validation analyses were applied to this subset of patients only. While part 1 of the study derived a 17-item UC-SQ, initial psychometric properties (item-level calculations, factor structure, internal consistency reliability) suggested that the joint pain and constipation items may not contribute to the measure’s performance. As a result, the remaining psychometric properties were assessed based on a 15-item UC-SQ total score excluding joint pain and constipation from the total score. Joint pain and constipation were retained in the questionnaire as informational, non-scored, standalone items. Procedures regarding handling of missing data are listed in Appendix 1.

Item-Level Analyses

Item-level analyses were evaluated using baseline data and included measures of central tendency (to assess the distribution of total scores), use of response categories for each item (i.e., frequency and percentage of patients in each response category), and an assessment of floor and ceiling effects [14].

Factor Structure

An exploratory factor analysis was performed at baseline to evaluate the factor structure of the UC-SQ. Maximum likelihood extraction and direct oblimin rotation (i.e., allowing for correlation between extracted factors) on polychoric correlations were used to identify the factor solution. Standard metrics were employed, including examining scree plots, number of Eigenvalues exceeding 1.0, and percentage of explained variance in the overall solution and each factor to determine the number of factors to retain. A one, two, and three factor solution was tested.

Reliability

Internal consistency reliability was evaluated at baseline using the Cronbach’s alpha coefficient, with a target value of 0.7 indicating good internal consistency [15, 16], and using item-total correlations, with a significant correlation > 0.30 showing good homogeneity [16, 17].

Test–retest reliability was assessed between baseline and week two among patients with no change in the PGIC (i.e., stable patients). Mean differences and paired t tests were calculated to compare UC-SQ total scores (and the statistical significance of any change) between the two assessment visits. The intraclass correlation coefficient (ICC) with 95% confidence intervals was computed where ≥ 0.7 (absolute ICC and lower 95% confidence interval limit) indicates good reproducibility [18,19,20].

Validity

Known-groups validity was examined at baseline to determine whether the UC-SQ could distinguish between patients by disease severity. Disease severity was defined in three ways: (1) by Full Mayo Scores (clinical remission, mild disease, moderate disease, and severe disease); (2) by Mayo endoscopy subscore categories (inactive, mild, moderate, and severe disease); and (3) by IBDQ total score quartiles. Using analysis of variance (ANOVA) models, we assessed differences in UC-SQ scores by severity groups. Given the limited variability in disease severity at baseline due to the study inclusion criteria, known-groups validity was also explored at week 8.

Convergent/discriminant validity was assessed by examining correlations of UC-SQ scores with scores on the physical and mental component scores and the social functioning/role-emotional domains of the SF-36 and the IBDQ bowel, systemic, emotional, and social function subscale scores at baseline. Specifically, the UC-SQ was expected to be strongly correlated (r > 0.50) with the IBDQ bowel and systemic symptoms domains and moderately correlated (r = 0.30–0.49) with the SF-36 physical component score, indicating convergent validity [21, 22]. Lower correlations were anticipated between the UC-SQ and social functioning/role-emotional domains and the mental component score of the SF-36 and the emotional and social function subscales of the IBDQ, indicating discriminant validity.

Responsiveness

Responsiveness was assessed among groups of patients showing a change from baseline to week 8 as defined using specific anchor measures. Specifically, patients were classified as ‘improved’, ‘having no change’, or ‘worsened’ based on the PGIC. Further, patients were classified by clinical remission status (achieved remission vs. not) and by treatment group (treatment vs. placebo). Mean change between baseline and week 8 and effect size (ES) (i.e., mean change divided by baseline standard deviation) were calculated for the UC-SQ total score within and between subgroups of change, and effect sizes were compared [22]. ANOVA was used to determine whether the difference was statistically significant between the groups and whether there was any linear trend in change scores.

Clinically Meaningful Change

Individual and group-level clinically meaningful change thresholds were derived in this study. Meaningful within-person change (MWPC; also previously defined as the clinically important responder (CIR) or “responder definition”) refers to the amount of change a patient would have to report to indicate that a treatment benefit has been experienced (individual-level change from baseline). The MWPC was derived using a range of anchor-based and distribution-based estimates. Specifically, the anchor-based approach employed individual change scores on the UC-SQ anchored on the patient’s impression of change category. PGIC responses of “minimally improved” or better were interpreted as showing clinical meaningfulness to patients, with “much better” or “better” also assessed. Clinical remission (as defined by a per an Adapted Mayo Score ≤ 2, with stool frequency subscore ≤ 1, rectal bleeding score of 0, and endoscopic subscore ≤ 1) was also assessed as a potential anchor. The MWPC analysis for each anchor then used a receiver operating characteristic (ROC) curve analysis with the average change scores dichotomized into response and non-response. ROC curves with calculation of the area under the curve (AUC) were used to identify the UC-SQ score change reflecting the best cut point (BCP). BCPs are those values that best identify meaningful change in terms of specified changes in the external criterion. The analytic results were supported by presenting probability density function (PDF) and empirical cumulative distribution function (CDF) curves for each comparison, which visually showed the separation between anchor groups and the extent of overlap. Distribution-based analyses complemented the anchor-based analyses in order to triangulate and check estimates using other methods and to estimate a clinically meaningful difference for group comparisons. Distribution-based estimates of the MWPC were derived using the minimal detectable change at 90% (MDC90). Distribution-based estimates of group-level change (i.e., the minimal clinically important difference [MCID], or the difference between two treatment groups that can be considered clinically important) were also derived using distribution-based estimates such as the ½ standard deviation (SD) and the Standard Error of Measurement (SEM).

SAS version 9.4 (SAS Institute, Cary, NC) was used for all statistical analyses.

Ethical Considerations

Ethical approval was obtained from central Ethics Committees for both Part 1 and 2 of the study (Salus IRB, Austin, TX), and the study was performed in accordance with the ethical standards as laid down in the 1964 Declaration of Helsinki and its later amendments.

Results

Part 1: Instrument Development and Content Validation

Interviews were conducted with 22 adult participants (age 18 to 75 years) recruited from two US clinical sites (Michigan and Oklahoma) with a UC diagnosis (active or in remission). Their demographic and clinical characteristics are summarized in Table 1. Mean age was 32.8 years (SD = 11.5). Females comprised 59.1% of the study population, and 50% of the patient sample were in remission at the time of screening based on clinician and patient rating of disease severity.

Table 1 Demographic and clinical characteristics of participants in the CE & CE/CI process

Concept Elicitation and Item Generation

The concept elicitation included seven 60-min concept elicitation participant interviews and two clinician interviews conducted by telephone. Spontaneously elicited and probed symptoms are distinguished in the saturation table in Online Appendix 2 (Supplementary Data Content: Supplementary Table 1), documenting concepts as they emerged across the patient sample. Fourteen symptom concepts were elicited across 22 total interviews. Of the 14 symptom concepts, only tenesmus was elicited through probing initially; the other 13 symptoms were spontaneously elicited. Reflecting the centrality of elicited symptom concepts to the patient experience of UC, 12/14 (86%) of concepts emerged by the third interview and 13/14 (93%) by the fourth, with saturation achieved by the fifth interview. The resulting impacts data were analyzed and showed a wide range of spontaneously reported symptom impacts on daily life and functioning. Thematic analysis showed correspondence between these impact concepts and the existing item pool e.g., difficulty with physical activity, work/school, socializing, range of travel, and negative emotional impacts. The saturation table illustrates this correspondence in descending order of frequency. Thematic saturation was achieved early on by the third interview, reflecting the centrality of impact concepts to patients’ experience of the disease.

Two clinical experts, both US gastroenterologists treating patients with UC or CD, participated in an interview. One expert was identified based on prior consultation in developing the initial CD item pool, while the other was identified through publication records. Expert input confirmed a significant overlap in symptoms and impacts experienced by patients with UC and CD (frequent bowel movements, diarrhea, and fatigue representing core concepts for both conditions). They noted that abdominal pain was associated more with CD and blood in stool with UC in terms of differentiation. For missing concepts, both experts reported tenesmus, the feeling of incomplete evacuation, as a UC-specific symptom. They noted bowel urgency to be a concept frequently reported by patients with both UC and CD.

After review of the draft UC-PRO item pool (based on the existing CD PRO measure), experts reported that the response options captured the frequency/intensity dimensions of the concepts to which they applied and that a 7-day recall period was appropriate for assessment of UC. Instructions were reported to be readily understandable, and the administration procedure feasible for use in clinical research.

Pilot Testing/Combined Concept Elicitation and Cognitive Interviews

The combined CE/CI included 15 participants who took part in 60–90 min CE/CI interviews conducted in-person or by telephone. Item-level findings are further detailed in the Item Tracking Matrix (Supplementary Data Content: Supplementary Table 2) charting where the existing items, i.e., those included in the original CD item pool, were removed/retained for CI evaluation and where newly generated items were introduced and subsequently reviewed by patients. As shown by the Item Tracking Matrix, 22 items (17 from the existing, i.e., original CD item pool, and five newly developed) were left unchanged because CI participants interpreted the underlying concepts correctly and reported the items to be easily understandable. Findings regarding relevance were also reported, showing the concepts to be largely applicable to patients’ experience of UC. Findings related to the measure’s formal properties also show the instructions to be clear and comprehensible, the response options appropriate, and the recall period acceptable without posing any challenges to patients.

Part 2: Psychometric Validation

Sample Characteristics

The baseline characteristics of the subjects included in the psychometric validation substudy of the upadacitinib UC phase 2b trial (NCT02819635) are presented in Table 2. Participants had a mean age of 45.24 years (> 60.2% white male). The majority of participants had UC for over three years (mean disease duration = 8.38 years) and were classified as having extensive disease/pan colitis at baseline. Participants had a mean Full Mayo score of 9.40 and were classified as having moderate or severe Mayo Endoscopy subscores. The average IBDQ score among participants was 122.

Table 2 Baseline characteristics of participants with UC in AbbVie’s Upadacitinib UC phase 2b trial (NCT02819635)

Item-Level Analyses

Item-level analyses and the distribution of UC-SQ total scores at baseline are shown in Table 3. Overall, most items had approximately normal distributions. However, for items 9 and 15 (joint pain and constipation, respectively), ≥ 55% of the sample selected the minimum possible score (‘not at all’ or ‘never’). Further, 48% of the sample selected the minimum possible score on item 7 (nausea). The overall distribution of scores had a mean (SD) of 32.20 (10.92).

Table 3 Item-level analyses: summary of item-level responses on the UC-SQ at baseline among patients with UC

Factor Structure

Findings from factor analyses supported a unidimensional domain structure were most appropriate (Supplementary Data Content: Supplementary Table 3). Specifically, the one-factor solution for the UC-SQ showed that the first factor accounted for 62% of the variance in the data and was related to 12 of 17 items (abdominal pain, rectal pain, cramping, lack of energy, nausea, loss of appetite, difficulty sleeping, bloating, blood in stools, mucus in stools, bowel movements with empty bowels, and bowel movement urgency). Joint pain and constipation had loaded on the second factor with a factor loading of < 0.30. Although bowel movement frequency, passing gas, and diarrhea did not meet the 0.40 threshold, they had a factor loading of > 0.35 on the first factor. The two and three factor structures did not suggest a solution that explained more variance in the data over the unidimensional structure (data not shown).

Reliability

Internal consistency reliability was demonstrated for the UC-SQ with an overall Cronbach’s alpha value of 0.86. Cronbach’s alphas calculated after each item omitted similarly ranged between 0.84 and 0.86. Internal consistency improved very slightly when joint pain and constipation were deleted individually (0.86 and 0.86, respectively). After removing joint pain and constipation items from the instrument, the overall Cronbach’s alpha was 0.86 (suggesting good internal consistency), with similar Cronbach’s alphas with each item omitted (Supplementary Data Content: Supplementary Table 4). Further, most UC-SQ items had moderate-to-strong correlations with the remaining items’ total score (i.e., met the pre-specified threshold of 0.30). However, the joint pain and constipation items did not correlate well with the total score. Item-total correlations with joint pain and constipation removed were similar, indicating item homogeneity for all remaining items.

For the 15-item UC-SQ, test–retest reliability was demonstrated among stable patients (n = 21 reporting no change on the PGIC), with an ICC value of 0.88 and a minimal, albeit significant change between baseline and week 2 (− 2.62 points, p = 0.04) (Supplementary Data Content: Supplementary Table 5).

Validity

Regarding known-groups validity, the UC-SQ was able to differentiate between patients by disease severity (p < 0.01) based on the Full Mayo Scores, Mayo Endoscopy Scores, and IBDQ Score Quartiles (Table 4). Specifically, significant differences (p < 0.001) were observed at baseline between groups defined by full Mayo Scores and IBDQ total scores, whereby higher full Mayo Score groups (i.e., more severe disease) and lower IBDQ total score quartile groups (i.e., lower HRQoL) were associated with higher (i.e., worse) UC-SQ scores. There were no significant differences in UC-SQ scores by Mayo endoscopy subscore groups at baseline (p = 0.73), although at baseline all patients had moderate or severe endoscopic lesions. At week 8, when all categories of clinical and endoscopy severity were represented, similar findings were observed for the full Mayo Scores and IBDQ total scores. Furthermore, significant differences in UC-SQ scores were also observed between Mayo endoscopy subscore groups at this timepoint, such that higher Mayo endoscopy subscore groups (i.e., more severe disease) were associated with higher (i.e., worse) UC-SQ scores (p < 0.001).

Table 4 Known-groups validity: differences in UC-SQ scores between severity groups at baseline and week 8

Correlations between UC-SQ total scores and other relevant COA measures are presented in the Supplementary Data Content (Supplementary Table 6). Strong negative correlations (r > − 0.70) between UC-SQ scores and IBDQ bowel and systemic symptoms subscale scores were observed as expected given differences in the direction of scoring of instruments. Moderate negative correlations (r = 0.40–0.70) were demonstrated between UC-SQ scores and SF-36 physical component summary scores, social functioning and mental component summary scores, and between UC-SQ scores and IBDQ emotional and social function subscale scores. A weak negative correlation (r = − 0.37) was observed between UC-SQ scores and SF-36 role-emotional subscale scores.

Responsiveness

Changes in UC-SQ scores anchored on PGIC, clinical remission, and treatment groups were found to be statistically significant (Table 5; p < 0.01). Specifically, improvements in PGIC were linearly related to improvements (reductions) in UC-SQ scores. Achieving clinical remission was associated with greater improvements (reductions) in UC-SQ scores (p = 0.003). Finally, patients who received treatment had greater improvements (reductions) in UC-SQ scores than those who received a placebo (p = 0.002). Within-subgroup effect size and standardized response mean (SRM) for the various response categories were all important, with the exception of the PGIC deterioration category and placebo group, which had a borderline small-to-moderate effect sizes. Between-subgroup effect sizes were important for all anchors.

Table 5 Responsiveness: UC-SQ change scores from baselineto week 8 by PGIC score groups, clinical remission status groups, and treatment groups

Clinically Meaningful Change

The MWPC was estimated using the PGIC and clinical remission as external anchors. The BCPs were defined through maximizing the percentage of “true” classifications based on correspondence with the global impression (with at least minimal improvement defined as “meaningful”). Accordingly, the BCP was a decrease of six points, reflecting improvement (Table 6). Using the more conservative definition of a MWPC (i.e., much/very much improved vs. minimally improved/no change/worsened), the BCP was a decrease of -14 points. Likewise, using the more objective anchor of clinical remission, the BCP was a decrease of -14 points. The 14-point average observed across the latter two anchors would reflect an overly stringent expectation regarding meaningful change. A distribution-based determination of minimum detectable change (within-patient) fell between the 6 and 14-point estimates, with MDC90 close to 8.7 points. Overall, these findings suggest that a 10-point improvement in UC-SQ scores is an appropriate MWPC threshold as the approximate midpoint of the plausible range of 6 to 14.

Table 6 Clinically meaningful change (MWPC and MCID Definition): mean of UC-SQ change scores, mean difference in change scores and AUC/BCP by Anchor groups

Distribution-based estimates for MCID were, as expected, somewhat lower (Table 6). These ranged from a SEM-based calculation of ~ 3.74 to a ½ SD criterion-based estimate of 5.18 points.

Discussion

This study aimed to develop and validate a PRO instrument (the UC-SQ) that evaluates the symptoms and severity of UC to better determine treatment efficacy in clinical trials in line with industry and regulatory guidelines for developing and validating COA measures.

The instrument development and content validation activities showed UC symptoms and impacts to be wide-ranging and varied, with subsequent thematic qualitative analyses demonstrating considerable overlap between UC and CD in terms of symptoms and impacts experienced by patients. Where gaps or omissions were identified in the initial item pool analysis and/or through expert input, cognitive interviewing showed the additional UC-SQ items developed to address the concepts to be comprehensible and relevant to patients with UC. Cognitive interviewing further showed the adapted UC-PRO measure to be appropriate with regard to the instructions, response options, and recall period employed. With these findings supporting content validity among adult patients with UC, validation can be furthered by assessing the PRO instrument’s psychometric properties in subsequent studies.

Following evidence of content validity, good validity evidence of the performance of the UC-SQ was found among participants in the upadacitinib UC phase 2b trial (NCT02819635) who were similar to participants in other studies of UC with regards to age, sex, and disease severity (i.e., sample is generally representative of the broader population with UC) [23]. Performance testing of the initial 17-item UC-SQ suggested approximately normal distributions for most items, except for “joint pain” and “constipation”, which demonstrated ceiling effects. These results indicate that, for the most part, the UC-SQ items are performing in ways that were expected, although there may be some items that do not contribute to measuring reliability. Items having asymmetric response frequency, showing that an item is difficult to endorse, can occur because the targeted symptom occurs irregularly or infrequently for many patients with UC. This is supported by data from the qualitative interviews, which showed that constipation and joint pain were not highly endorsed overall but were spontaneously mentioned by patients interviewed (4/22, 18% for each). These findings are also in line with other studies of patients with UC that show that joint pain and constipation are not as commonly reported among patients as other symptoms such as diarrhea, incontinence, fatigue, bowel movement urgency, and abdominal discomfort [4, 24,25,26,27,28].

Using an exploratory factor analysis, no meaningful latent structure which distinctly groups symptoms into separate categories was found. We, therefore, see no evidence that the UC-SQ should score separate domains at this point. While some items (i.e., bowel movement frequency, passing gas, and diarrhea) did not meet the 0.40 threshold in a one-factor solution, these items had factor loadings > 0.35 on Factor 1. In light of the borderline statistical importance of these symptoms and their clinical importance in the patient population with UC, these items should likely be retained in the measure. Joint pain and constipation, however, did not load onto any factor and thus may not be adding informational content to the scale as supported by findings from item-level analyses.

Upon additional testing, the 17-item UC-SQ demonstrated strong internal consistency, as evidenced by an overall Cronbach’s alpha of 0.86. Similar Cronbach’s alphas with each item omitted were observed. Internal consistency improved very slightly when joint pain and constipation were deleted individually, suggesting that these items may not aid the measure’s score as a whole. All item-total correlations met the pre-specified threshold of 0.30, except joint pain and constipation, further suggesting that these items may not correlate with the scale overall in the adult population with UC. When analyses were repeated using the 15-item UC-SQ with joint pain and constipation removed from the total score, excellent internal consistency reliability was demonstrated, with moderate-to-strong item-total correlations for all items in the UC-SQ.

The initial psychometric testing of the UC-SQ indicated that joint pain and constipation might not be symptoms that are prevalent enough among the population with UC to warrant inclusion in a measure aimed at assessing symptoms of this condition. These symptoms may have been identified during concept elicitation because patients were in remission or had mild UC. Nonetheless, while infrequently reported, these symptoms may still be of importance to some patients’ perceived UC disease manifestation. Accordingly, these items were removed from the UC-SQ total score but retained as non-scored, informational, standalone items. The remaining psychometric properties were explored based on a 15-item UC-SQ unidimensional score. Reproducibility of this 15-item UC-SQ was strong, as evidenced by an ICC of 0.88 and a minimal change in scores between baseline and week 2. This is a strong result considering the small subsample of complete subjects available for the analysis (n = 33). While change over the 2 weeks was borderline significant, this could be attributable to a placebo or treatment effect.

Strong support was also found for the known-groups validity of the 15-item UC-SQ. Specifically, the UC-SQ was able to differentiate between disease severity groups as defined by full Mayo Scores and IBDQ total scores in the expected direction at baseline and week 8. Significant differences were also observed among groups defined by Mayo endoscopy subscores at week 8 but not baseline, which is in line with the recruitment strategy (only patients with moderate-to-severe UC included in the trial) and is likely due to the lack of variability in disease severity at baseline. The results for convergent validity showed strong and significant correlations with criterion measures in line with a priori hypotheses. Findings suggest that while UC-SQ scores are more strongly associated with physical symptoms as expected, the symptoms of UC may also negatively impact social and emotional wellbeing thus explaining the stronger than anticipated correlation between UC-SQ total scores and other mental component scores.

In contrast to the UC-PRO/SS diary that was developed and validated using an observational study [29], this study demonstrated the responsiveness of the UC-SQ using an interventional study—i.e., the instrument showed to be responsive in patients who also indicated a change in other instruments in the expected direction. These findings provide evidence for the ability of the UC-SQ to detect a change when a change in the concept of interest has occurred. Finally, estimates for clinically meaningful change were obtained from several sources and using two anchors. Based on triangulation of findings from various analyses, a 10-point improvement in UC-SQ scores (i.e., decrease) appears to be a reasonable threshold to estimate within-patient clinically meaningful change.

There were several strengths in each part of the study. In the concept elicitation part of the study, clinician and patient assessment of disease severity were conducted at baseline and recruitment was monitored to ensure all UC severities were represented. At least one participant with UC belonged to each severity category; however, half of the sample were categorized as in remission, requiring patients to extrapolate on previous flare-ups to describe their UC symptoms and impacts. Another notable strength of the UC-SQ is the incorporation of an impacts domain designed to assess functioning related to disease status. These multi‐domain instruments may be able to support claims related to symptom and HRQoL improvement. Strengths of the psychometric validation study include using the upadacitinib UC phase 2b trial (NCT02819635), which collected various other PRO and clinical measures at different time points for use as reference instruments to assess the planned psychometric properties. Further, given that the study leveraged patient-level data from an interventional clinical trial, tests of sensitivity to change over time and evaluation of clinically meaningful change thresholds were possible as objective change had occurred in many trial participants throughout the trial. The analysis in this study utilized data from an 8-week trial, which provided information on the variability in UC disease over time, and the ability to assess change, including worsening and improvement.

Several limitations to the CE and CE/CI study should also be acknowledged. First, there was a lack of variation in clinical disease activity and racial/ethnic background among the development population. The sample included only one participant with severe clinician-rated UC and two participants with moderate UC; 19/22 participants were categorized as in remission or with mild UC. Further validation in a more diverse population may increase the validity of the questionnaire. Second, the UC-SQ was developed in an English‐speaking US population; translatability assessments are needed to ensure linguistic and cultural equivalency in different country contexts. No evaluation of patient literacy was conducted, but some variation in educational attainment was captured at screening. In part 2 of the study, there was a limitation in selecting anchors embedded in the trial design—i.e. no patient global impression of severity was included among the trial instruments. This study relies on a retrospective PGIC; thus, direct estimation of changes in symptom severity cannot be assessed (i.e., through measurement of severity at baseline and subsequent time points in the study). For the test–retest reliability psychometric property, the UC-SQ was scheduled 2 weeks after the drug’s first dose, which may influence patient responses. The questionnaire should ideally be scheduled closer to the baseline visit to capture truly “stable patients”.

Conclusion

In conclusion, this study provides evidence in support of reliability, content and construct validity, and responsiveness to change for the UC-SQ. Further, clinically meaningful change thresholds were derived for the tool, which is critical in the interpretation of patient outcomes when evaluating treatment efficacy. These findings indicate that the UC-SQ is fit-for-purpose as a key endpoint in pivotal trials, capable of capturing UC symptom severity of patients in clinical practice.