Introduction

Crohn’s disease (CD) is a chronic, idiopathic, relapsing to immune mediated disease with recurrent cycles of relapse and remission [1, 2]. Prevalence estimates range between 201 (USA) and 319 (Canada) per 100,000 adults [1, 3]. Incidence was 20.2 per 100,000 person-years in North America, 12.7 per 100,000 person-years in Europe, and 5.0 person-years in Asia and the Middle East [4].

Inflammation associated with CD has transmural character and occur in discontinuous manner along the gastrointestinal (GI) tract.[5]. Bowel urgency (BU), the sudden or immediate need to have a bowel movement, is a frequent [5,6,7] bothersome [8,9,10,11,12] symptom for CD patients and one of the most important symptoms that patients with CD want to gain control of with treatment [8, 13]. Symptoms associated with CD often correspond to the disease behavior and location [14]. Chronic inflammation in the anorectal region and small intestine, for example, are known to cause symptoms such as BU [5, 15]. However, BU can occur without evidence of bowel inflammation in patients with CD [7, 16].

CD is associated with considerable morbidity and significantly diminished health-related quality of life (HRQOL). New treatments for CD should provide symptomatic relief [17], which is best evaluated with patient reported outcomes (PRO) measures. The US Food and Drug Administration (FDA) is encouraging the development of Patient-Reported Outcome (PRO) measures to be incorporated into clinical testing for new therapies to assess concepts that are important to patients [5, 6]. Clinical trials have historically relied heavily on clinical or endoscopic indices, rather than PRO measures [7]. Consequently, there is a need for the development and validation of PRO instruments that assess important concepts. Currently, there is an absence of PROs that assess BU and that are validated for patients with CD.

The Urgency Numeric Rating Scale (NRS) was developed and validated in ulcerative colitis (UC), and allows respondents to rate the severity of their bowel urgency over the past 24 h using a 0 (“no urgency”) to 10 (“worst possible urgency”) scale [18]. This study was designed to explore the patient experience of CD, confirm importance and relevance of BU in this population, cognitively debrief the Urgency NRS to establish content validity, explore meaningful change and BU remission, and assess its measurement properties.

Methods

This mixed-method observational study consisted of a cross-sectional qualitative interview component (Part A) and a longitudinal quantitative web-based survey (Part B). Patients were recruited via clinical sites (Part A, with some patients referred also to Part B) and through a research recruitment vendor (Part B only) between August 2020 and February 2021. All participants were adults (18 + years old) with a clinically confirmed diagnosis of CD who lived in the US and spoke English. Clinical diagnosis of CD was confirmed by medical records review for patients referred by clinical sites, or by another accepted form of evidence of diagnosis provided by participants recruited via the research recruitment vendor. Participants provided informed consent prior to taking part in the study. This study was conducted in compliance with Good Clinical Practice guidelines, including International Conference on Harmonization Guidelines. The study protocols were approved by Advarra (Columbia, MD).

Part A. Qualitative interviews

Six US clinical sites recruited an initial target of 25 adult patients with a documented diagnosis of moderate-to-severe CD, based on clinical, endoscopic, radiologic and laboratory examination per national guidelines for CD diagnosis [19]. Sample size estimation was based on expectations to achieve saturation [20, 21]. A purposive sampling approach was used. All participants were required to be currently experiencing, or have had experienced, CD symptoms within the past three months, based on self-report. Patients were excluded if they had an ileostomy, colostomy, or intra-abdominal surgery within 3 months or had a comorbid condition which may have confounded discussion of their CD symptoms.

Interview recruitment targeted approximately even distribution of patients with each of the following 5 CD subtypes, as reported by the clinical site based on medical records: Type 1: Small bowel involvement only, including isolated ileitis; Type 2: Colonic involvement, with or without small bowel involvement (Proximal ± Transverse Colon only); Type 3: Colonic involvement, with or without small bowel involvement (rectal only); Type 4: Colonic involvement, with or without small bowel involvement (rectal + distal colon only); and Type 5: Colonic involvement, with or without small bowel involvement (pancolitis).

Qualitative interview procedures

Telephone interviews conducted by 4 interviewers who were trained in qualitative interview methods. A semi-structured interview guide consisting of concept elicitation and cognitive debriefing sections was used in conducting the interviews. Open-ended questions about participants experiences with CD allowed them to describe their overall symptom experience before focusing on BU. Participants experiencing BU described it in greater detail, its impact on daily life, and if this symptom occurred during CD remission. The interviewer instructed the participants to complete PRO measures of interest (below) and then asked questions about interpretation, clarity, relevance, and feasibility of the individual measures and items. Meaningful change (minimum amount of improvement that would make a treatment worth taking) was discussed for some items. PROs of interest included 3 single items: the Patient Global Impression of Change (PGIC), Urgency NRS and Overall CD Symptom Patient Global Rating of Severity (PGRS) (Supplemental Fig. 1).

Fig. 1
figure 1

Distribution of Bowel Urgency NRS Severity Level Categories by Individual Participant. Each participant was asked to provide a score range for the 3 levels of severity (ie; mild, moderate, severe); therefore, each participant is represented by 3 bars. Green = “mild” ranges; Yellow = “moderate” ranges; Red = “Severe ranges.” Four participants indicated that “Severe” was “a 10” rather than providing a range; these participants are represented by the red squares

Qualitative interview data analysis

Interview transcripts without personal health information were loaded into ATLAS.ti (version 8.0 or higher) for qualitative coding and analysis [22, 23]. A coding dictionary was developed based on the interview guide and the themes and concepts that emerged during the interviews. Coded transcripts were reviewed for quality control. The elicitation data were assessed to document saturation [23, 24] of CD symptom concepts. Interview responses were analyzed to compare and tally the number of novel symptoms that were observed per interview. For the cognitive portion, coding was used to examine the relevance, clarity, and appropriateness of the PRO measures of interest, and to assess participant discussion around meaningful change. Sociodemographic and clinical characteristics were assessed using descriptive statistics.

Part B. Quantitative web-based survey

The primary web survey recruitment source was the research recruitment vendor, although qualitative interview participants could participate. Participants self-reported a diagnosis of CD that was supported during screening by some form of evidence, such as a signed letter from the participant’s clinician or an official summary of a medical appointment or procedure listing the diagnosis. Participants were excluded if they self-reported having any comorbid health conditions which might have confounded survey responses. Participants reporting CD symptoms in fewer than 10 days of the past month (“asymptomatic”) were allowed to participate, up to 20% of the total survey sample.

Web survey procedures

Eligible participants received an email to create an account, provide consent, and start the survey via the vendor’s platform, Baseline Plus, managed by Cisiv. Participants completed survey questions daily for 14 days. Participants were required respond to all questions at all time points; however, an “opt out” response option was provided for participants who did not want to answer a given question. Data was considered “missing” for items where participants used this “opt out” response. Reminders were sent to participants who missed a survey entry day. Participants completing at least the Day 1/Baseline survey entry were remunerated between $75.00 and $225.00 USD based on the number of entries completed.

Assessments

The survey administration (Supplemental Table 1) required ≥ 4 responses each week for all measures utilizing a weekly mean score (i.e.; mean of Days 1 through 7 and of Days 8 through 14). Respondents were asked to answer the Urgency NRS during all 14 days of the survey, and two weekly mean scores were calculated. The PROs used to assist in validating the Urgency NRS were overall CD symptoms PGRS (administered daily) and PGIC (administered once), abdominal pain NRS (administered daily), Functional Assessment of Chronic Illness (FACIT)-Fatigue [25] (administered weekly), and Patient Global Impression of Severity (PGIS)-Fatigue [26] (administered weekly). In addition, the Abdominal Pain NRS, an 11-point scale ranging from 0, indicating “no pain” to 10, indicating “Pain as bad as you can imagine,” and the Bowel Movement (BM) Count, that assessed the number of BMs the participant had over the previous 24 h, were also used. The Abdominal Pain NRS and the BM Count were completed daily and mean weekly scores were calculated.

Table 1 Patient demographics and disease characteristics

Test–retest reliability of the Urgency NRS

The stability and reproducibility of the Urgency NRS weekly score was assessed within a stable population by comparing the Week 1 and Week 2 assessments. Statistical significance was based on paired t-tests. Stable subjects were defined as having a response of “no change” on the PGIC on Day 14 as the primary method, followed all subjects who had a change less than |1| on their weekly scores between Weeks 1 and 2 in the PGRS, BM Count, and/or the abdominal pain NRS.

Interclass correlation coefficients (ICCs) and effect size (ES) for the Urgency NRS mean scores were calculated for Week 1 and Week 2. ICCs range from 0 to 1.0, with higher scores indicating a more stable instrument, and values > 0.70 generally indicating strong test–retest reliability [27, 28]. A minimal mean difference of < 0.20 in ES (i.e.; standardized mean difference) between Week 1 and Week 2, as well as lack of a significant difference (p > 0.05) were used to support stability of the Urgency NRS. Test–retest analyses were only conducted if there were ≥ 30 participants in the stable groups [29].

Construct validity of the Urgency NRS

Construct validity [27] of the Urgency NRS was assessed by comparing it with PROs of interest. Correlations were classified as small (< 0.3), moderate, (0.3 to 0.6), or large (> 0.6) [30]. A correlation coefficient greater than 0.3 indicated convergent validity [31]. A priori hypotheses specified expected moderate to large correlations between the Urgency NRS and the PGRS, BM Count, and the abdominal pain NRS, but small to moderate correlations with the fatigue-related measures (PGIS-Fatigue and FACIT-Fatigue).

Known-groups validity of the Urgency NRS

Known-groups validity is the extent to which scores from an instrument are distinguishable between groups of subjects that differ by a relevant clinical indicator [32]. To evaluate known-groups validity, the Week 1 score for Urgency NRS scale was analyzed by Week 1 Overall CD Symptom PGRS, Abdominal Pain NRS, and BM Count using the following groups:

  • Overall CD PGRS: 0– < 3, 3–5

  • Abdominal Pain NRS: Below and ≥ median score; 0–4 vs.5–10

  • BM Count: Below and ≥ median score

The predefined groups were collapsed as necessary to ensure all comparison groups contained at least 10 participants. The analysis of variance (ANOVA) models included the Urgency NRS scores as the dependent variable and the known-group criterion variable as the independent variable to assess the significance of Week 1 mean differences for each group. Analyses were repeated with Week 2 scores.

Results

A. Qualitative interviews

Patient demographics and clinical characteristics

Thirty-five participants with a mean age of 45.1 years (SD 15.83) were recruited from 6 clinical sites. Participants were mostly female (65.7%), non-Hispanic or Latino (97.1%), and White (80.0%). Thirteen participants (37.1%) had a college degree whereas 11 (31.4%) had only finished high school/secondary school (Table 1).

Overall, the mean time since CD diagnosis was 11.9 years (SD 13.7). The most commonly reported treatments and comorbid health conditions are summarized in Table 1.

The relevance of BU as a symptom of CD is underscored by the observation that, of the 35 participants interviewed, 34 (97%) participants reported, either spontaneously (n = 13) or through probing (n = 21), experiencing BU that they attributed to their CD. BU was reported as one of the most bothersome CD symptoms by 15 participants, followed by increased abdominal pain/cramping (n = 12), frequency of BMs (n = 6), diarrhea (n = 5), and fatigue (n = 4). The highest proportion of participants indicating that BU was their most bothersome symptom had Type 4 [colonic involvement, with or without small bowel involvement (rectal+distal colon only)] (Supplemental Fig. 2).

Fig. 2
figure 2

Meaningful Improvement on the Urgency NRS (N = 30). Participants provided their current Urgency NRS score, and the point where they felt a decrease would be meaningful. The flat end of the arrow represents the participants’ current scores, and the tip of the arrow represents the point at which the change would become meaningful to them. The numbers in the figure indicate the number of participants with that response across the full sample

Experience and frequency of BU

Table 2 and Supplemental Table 2 provide examples of quotes participants used to describe BU. Of the 34 participants who experienced BU due to CD, 22 (65%) reported that they experienced BU every day or with nearly every BM. Of the participants reporting BU every day, there was a trend suggesting that the highest proportions were among Types 1 or 2, and the least among Type 3. Half of the participants reported that BU fluctuated over time while the others reported it as stable or consistent.

Table 2 Common terminology used for “Urgency”

Most participants (n = 28) noted that BU is worse depending on certain foods or drinks. Eight participants indicated BU is worse in the morning, and three noted their BU can be triggered by certain activities (e.g., walking, exercising).

Severity of BU

BU severity was described in terms of time to get to toilet (e.g., having 2 to 3 min (n = 6), or ≤ 1 min (n = 4)), while others used descriptors such as “urgent” (n = 5), “severe” (n = 2), “desperate” (n = 1), “not very urgent” with time to get to toilet (n = 5). Still others rated the severity of pain (“six out of 10” (n = 1), or “painful” (n = 1).

Association of BU and frequency of BMs

Thirty-three participants were asked whether their BU and BM frequency were related, and 15 (45%) reported that BU and BM frequency always or usually co-occur (i.e., they tend to have more BU on days when they also have more frequent BMs), whereas 18 (55%) noted they could experience one symptom without the other.

Impacts of BU

Interview participants reported major impacts on their daily activities due to BU and resulting incontinence (Supplemental Table 3). The most commonly reported impacts involved recreation or hobbies, needing to always be aware of toilets when in public and having to stay home more. Twelve participants indicated a mental and/or emotional impact. Nine participants described a general impact on daily activities, noting that their BU prevents them from doing things that “normal” people can do. Participants spoke about having less anxiety around finding restrooms and avoiding accidents if their BU improved. Participants noted that they would not be able to leave the home if their BU was more severe.

Table 3 Participant descriptions of each level of urgency on the NRS

Responses to the Urgency NRS

The mean score for the cohort was 4.4 (SD 3.03), with a median score of 4.0 (Supplemental Table 4). The mean scores for the individual CD subtypes ranged from 3.2 (SD 2.99; Type 1 CD) to 5.4 (SD 3.00; Type 2 CD). Each of the response options were selected at least once.

Table 4 Urgency NRS Test–retest reliability: week 1 and week 2

Cognitive debriefing of the Urgency NRS

Understanding and interpretation

All but one participant (3%) found the Urgency NRS item clear in meaning and easy to answer. The remaining participant reported having some trouble reading in general. Participants were able to differentiate the different severity levels of BU when asked to discuss why they selected a particular response on the Urgency NRS. Twenty-nine participants (83%) used the correct recall period (24 h) when responding, whereas 5 participants (14%) seemed to think only of the waking hours on the day of the interview; and 1 participant (3%) did not directly comment on their recall period.

Interpretation of mild, moderate, and severe urgency on the NRS

In general, participants assigned an Urgency NRS score of > 0–3 with “Mild”, > 3–7 with “Moderate”, and > 7–10 with “Severe” urgency (Fig. 1). Individual definitions of “Moderate” often overlapped with those for “Mild” and “Severe.” (Fig. 1). In general, BU while in CD remission correlated with the “mild” range, although some (n = 7) suggested they could have an Urgency NRS score as high as a 4 while still in CD remission’

Participants described BU at each severity level, as illustrated by selective quotes (Table 3; Supplemental Table 5). In general, the “Mild” range of BU was described as not requiring them to get to the restroom as quickly. “Moderate” BU was described as starting to be disruptive, but still manageable, as they had to get to the restroom more quickly. “Severe” BU was equated with not being able to leave the home and having to stay very close to a toilet at all times, significantly impacting daily activities and HRQOL. Thirty-three participants were asked what their BU would be like if they were in CD remission; 15 participants (45% of those answering the question) said that they would have none, whereas 17 participants (52%) said they could still experience some (2–4 on the Urgency NRS) BU.

Table 5 Urgency NRS construct validity correlations with overall CD symptom PGRS, bowel frequency count, abdominal pain NRS, PGIS-Fatigue and FACIT Fatigue
Interpretation of PGRS

Twenty-seven participants (77%) reported that BU was a factor in their response; 12 mentioned this spontaneously, and 15 confirmed when probed. Interpretation of PGIC. When asked to describe their thoughts while deciding how to answer the Overall CD Symptom PGIC, 7 participants (20%) reported thinking about abdominal pain or cramping, 4 (11%) about the number or frequency of bowel movements, and 4 (11%) about BU. Ten participants (29%) spoke more generally about considering different medications they had been on and comparing how they felt at various points to the present time.

Meaningful change on the Urgency NRS

Using their current score as a starting point, 13 participants (starting Urgency NRS score range: 1–8) reported that a reduction of 1 point would be a meaningful improvement, and another ten participants (starting Urgency NRS score range: 2–9) indicated that a reduction of 2 points would be meaningful (Supplemental Table 6; Fig. 2). All 7 participants who indicated that a minimum improvement of 3, 4, or 5 points would be needed had a starting score of ≥ 5 (Supplemental Fig. 2).

Table 6 Known-groups validity of the Urgency NRS

B. Quantitative survey

Participant demographics and clinical characteristics

Of the 76 participants who completed the web-survey, 16 (21.1%) had also completed the interview as part of the earlier study component, whereas 60 were newly recruited. The mean age of the sample was 41.9 years (SD 13.24). Participants were mostly female (n = 50; 65.8%), White (n = 63; 82.9%), non-Latino or Hispanic (n = 72; 94.7%) and had a college or postgraduate degree (n = 60; 78.9%) (Table 2). Disease characteristics are summarized in Table 2.

PRO descriptive characteristics for Urgency NRS

A total of 76 participants completed the Baseline/Day 1 survey entry, 64 participants (84.2%) completed Day 7, and 66 participants (86.8%) completed Day 14. There were sufficient responses to calculate mean scores for 76 participants (100%) for Week 1 and 74 participants (97.4%) for Week 2. One participant opted out of the BM Count item during each of their survey entry days, but there were no other missing responses across any of the other PRO instruments.

Mean scores for the Urgency NRS were 4.7 (SD 2.26) at Week 1 and 4.3 (SD 2.42) at Week 2 (Table 4). The medians (range) were 4.5 (0–10) and 4.0 (0–10) for Weeks 1 and 2, respectively. Floor and ceiling effects were minimal, with 1 (1.3%) participant having minimal or maximal values at Week 1, and 1 (1.4%) participant having minimal and 2 (2.7%) having maximal values at Week 2. The frequency distribution of NRS scores is shown in Supplemental Fig. 3.

Test–retest reliability

Thirty-seven participants responded “No Change” to the PGIC at Day 14. The ICC was 0.88, indicating strong test–retest reliability. However, the effect size of 0.18 was just below the maximum acceptable level for supporting stability of the score and the Urgency NRS score difference between the assessments at Week 1 and 2 was statistically significant (p = 0.03) (Table 4). Test–retest reliability within stable population using the PGRS, BM Count or Abdominal Pain NRS showed strong reliability based on the ICC (Table 4).

Construct validity for the Urgency NRS

At Week 1, the Urgency NRS score was highly correlated with the PGRS and the Abdominal Pain NRS (0.71 and 0.65, respectively), and it was moderately correlated (0.44 to 0.53) with the remaining PRO measures (Table 5). All correlations were statistically significant and within the a priori hypotheses. At Week 2, the Urgency NRS score was more highly correlated with the PGRS and the Abdominal Pain NRS (0.77 and 0.73, respectively). These correlations were higher than hypothesized a priori. All other correlations were moderate (|0.48| to |0.53|) and were again within the hypothesized ranges (Table 5).

Known-groups validity of the Urgency NRS

Each of the predefined severity groups contained at least 10 participants and no further group collapsing was necessary. As expected, more severe Urgency NRS scores were associated with more severe responses to the selected PROs (PGRS, BM Count, and Abdominal Pain NRS). All comparisons were statistically significant at both Weeks 1 and 2 (Table 6).

Discussion

BU is an important, relevant symptom of CD that varies in intensity and is highly valued in treatment goals to improve HRQOL [8, 13]. As such, daily assessment of BU severity should be considered as a clinical endpoint when designing clinical trials. The interviews with respondents documented the burdensome nature of BU on the patient, and that it greatly impacts HRQOL. Here, the proportions of participants experiencing BU are higher than those reported in some other reports [7]. Clinical and physical assessments of IBD do not provide an accurate measure of the patient’s own perception of their disease and their HRQOL [33]. This dissociation between a patient’s perception of their HRQOL and clinical measurements of disease activity has been demonstrated with patients who have other diseases [33], underlining the importance of devising reliable, reproducible measurements that can help assess the severity of patients’ symptoms from their perspective.

This rating scale has been validated in patients with UC [18, 34]. The results presented here support its use in CD. Findings indicate that the Urgency NRS is a simple, reliable, reproducible, valid, and interpretable PRO scale that can be used to assess one of the most troublesome CD symptoms. We found strong support for content validity for the Urgency NRS in both the qualitative interviews and the web survey, and a large majority of participants endorsed BU as a symptom of their CD in qualitative interviews. Participants could define different levels of BU severity, describe daily life impacts, and score them differently on the Urgency NRS. Patients largely agreed on the rating range for mild urgency (scores of 1–3 on the Urgency NRS), described as “normal urgency”, moderate urgency (scores of 4–7 on the Urgency NRS) and severe urgency (scores > 7 on the Urgency NRS). These thresholds could aid patients, healthcare providers, and other stakeholders in interpreting the Urgency NRS score, although further studies are needed to quantitatively assess within patient responder thresholds.. In addition, it will be important to further explore patient perceptions regarding symptom remission, given that our exploratory results indicate that some patients feel that CD remission is a total absence of symptoms, whereas others indicated that mild symptoms could be present in remission.

The web survey results also provide strong support of content validity for the Urgency NRS, with very minimal floor effects at both Week 1 and Week 2. The findings from the web survey also indicate that the Urgency NRS has good measurement properties. Test–retest reliability of the Urgency NRS was strong when using the PGRS and BM Count as anchors, and moderate when using the PGIC. Construct and known-groups validity against PGRS, BM Count, the Abdominal Pain NRS, PGIS-Fatigue, and FACIT-Fatigue scores were overall strong and within ranges hypothesized a priori. This further supports previous studies that found that the coexistence of BU, fatigue, and abdominal pain are especially burdensome for CD patients [35].

Limitations

Recruitment for the qualitative interviews was limited by available patient pools and recruitment timelines, and so the sample might not be representative of the greater CD population. This is a limitation commonly cited for qualitative research, especially in those with smaller sample sizes [36]. The purpose of the qualitative research was to assure content validity rather than to make generalizable conclusions.

In examining test–retest reliability, score ranges were used instead of the response categories to assess known groups for the overall symptom PGRS because it used a 24-h recall period and an average score was calculated over 7 days. The analysis plan used smaller ranges of numbers corresponding to the discreet verbal categorical responses (0– < 1 = “None,” 1– < 2 = “Very Mild,” etc.). However, final groupings were collapsed due to insufficient N in some of the categories (i.e., a priori, we had stated that group sizes < 10 would be collapsed).

The web survey was a pilot study to assess the psychometric properties, but the findings should be confirmed with clinical trial data. The majority of web survey participants were recruited via a research recruitment vendor and thus may not be representative of other CD patients in the US. Although CD diagnosis was confirmed for each of these participants, no clinical data was available to clinically define or confirm disease severity. Many participants were unaware of their CD subtype, and we were unable to verify responses for those who did report a subtype. Another limitation was that a healthy control group was not included.

Conclusions

BU is an important symptom of CD. The evidence provided herein demonstrates that the Urgency NRS has content validity, test–retest reliability, and construct validity. The Urgency NRS is a well-defined and reliable PRO instrument that is suitable to be used in clinical studies to evaluate a treatment benefit in patients with CD.