Introduction

Diarrhea-predominant irritable bowel syndrome (IBS-D) is a common and burdensome condition, especially in individuals with moderate to severe IBS-D, who suffer significantly impaired quality of life and high healthcare costs [13]. As no biomarkers or clinical measures of disease activity are currently available in IBS-D, diagnosis and treatment rely on direct patient report of signs and symptoms. IBS-D diagnostic criteria rely exclusively on the evaluation of symptoms, and the recently published US Food and Drug Administration (FDA) IBS guidance [4] on the design of IBS interventional trials recommends that primary endpoints in IBS-D trials be made up of patient-reported symptom assessments [5, 6].

Historically, clinical trial primary endpoints in IBS have relied on single-item assessments that ask patients to judge whether they have experienced “adequate symptom relief” or “satisfactory relief” over the entire trial [4]. The limitations of such single-item assessments of a patient’s symptom experience in terms of covering the breadth of IBS-D symptomology have been well documented [4, 7, 8]. None of these measures meet the FDA patient-reported outcome (PRO) guidance [9] in terms of content validity, nor do they adhere to the agency’s roadmap for clinical outcomes of assessment [10, 11]. In addition, the FDA no longer considers a global measure of change to be adequate as a primary endpoint [4, 9]. Consequently, the FDA’s IBS guidance highlights the need to develop multi-item, patient-reported measures in line with the agency’s PRO guidance. While there are established measures such as the IBS-Severity Scoring System (IBS-SSS) [12] and the IBS-Quality of Life Questionnaire (IBS-QOL), these historical measures were not intended to investigate benefits of treatment in the clinical setting, nor do they meet the needs of the IBS-D population (i.e., context of use) [13]. Moreover, they do not meet the rigor as set forth by the FDA PRO guidance [9]. New measures should be developed based on qualitative research with patients and must be designed to capture the cardinal symptoms of IBS-D, including abdominal pain, bowel function, and bloating [7, 8, 14].

To meet this need, the new IBS-D Daily Symptom Diary and Symptom Event Log (hereafter, “Event Log”) was developed via qualitative research among IBS-D patients, in accordance with the FDA PRO guidance [9]. A full account of the qualitative development of this instrument is provided elsewhere [15]. This prior research provides evidence that the instrument’s items demonstrate content validity and assess the full measurement continuum. The hypothesized conceptual framework based on the qualitative research is provided in Fig. 1.

Fig. 1
figure 1

Hypothesized conceptual framework

Once established through qualitative research that a new PRO measures all concepts that are clinically relevant and important to patients, in a manner that patients understand and can respond to, the next step is to evaluate the initial measurement properties of the instrument. This evaluation can then be used to develop a scoring algorithm, as well as guide potential item deletion. This article presents initial results generated from a US-based, multicenter, non-interventional observational study regarding the cross-sectional psychometric properties of the IBS-D Daily Symptom Diary and Event Log.

Methods

Patients

Patients with clinician-verified diagnoses of mild, moderate, or severe IBS-D (per Rome III criteria) were recruited into a US-based, multicenter, and non-interventional observational study. The targeted distribution of the sample population was ~50 patients with mild, 100 with moderate, and 50 with severe disease (based on clinician reports). Patients were recruited from general practice and gastroenterology clinics between June 2012 and November 2012 and were eligible for inclusion if they met the inclusion and exclusion criteria outlined in Table 1.

Table 1 Inclusion and exclusion criteria

Written informed consent was obtained from all patients prior to inclusion in the study. All study forms were approved by a centralized, independent ethics committee, in accordance with the revised Declaration of Helsinki [16]. Patients were free to discontinue participation in the study at any time.

Demographics and clinician IBS-D severity rating

Clinicians completed a case report form during an initial screening visit, confirming that the patient had a diagnosis of IBS-D (per Rome III criteria) and met all inclusion and exclusion criteria. Clinicians also assessed the patient’s IBS-D symptom severity using a 5-point graded scale of “very mild,” “mild,” “moderate,” “severe,” or “very severe.” Once patients were screened into the study, they were asked to complete a baseline demographic and health information form.

Patient-reported outcomes

To evaluate the newly developed symptom diary, several well-established PRO instruments were included in the study, including generic and disease-specific measures of symptoms and impact on health-related quality of life (HRQoL). Patients completed seven PRO instruments: the newly developed IBS-D Daily Symptom Diary and Event Log (described below) [15]; IBS-SSS [12]; the 12-item Short-Form Health Survey (SF-12) [17]; IBS-QOL instrument [13]; the Patient Global Impression of Severity (PGI-S); and 24-h recall and 7-day recall versions of the Patient Global Impression of change (PGI-C) [18]. The IBS-SSS, the IBS-QOL, the SF-12, the PGI-S, and the PGI-C are described in Table 2.

Table 2 Other instruments

IBS-D Daily Symptom Diary

The IBS-D Daily Symptom Diary is a 7-item PRO diary measuring abdominal pain, stomach pain, abdominal pressure, bloating, abdominal cramping, frequency of gas, and the occurrence of accidents. Five of the symptoms are rated on an 11-point numerical rating scale with 0 representing absence of symptoms and 10 representing severe symptoms. Frequency of gas is measured on an ordinal scale from “none of the time” to “all of the time,” and the occurrence of accidents has a “Yes/No” response option. The recall period for all items is the past 24 h. Daily and weekly (i.e., means) scores were utilized for each item for analyses.

IBS-D Symptom Event Log

The IBS-D Symptom Event Log consists of three questions which ask the patient to rate, with respect to each individual bowel movement, the immediacy of need (1 = “no immediate need” to 5 = “extreme immediate need”), consistency of the bowel movement (pictorial 8-point scale: 1 = “like marbles or hard rocks” to 8 = “just liquid”), and whether the bowels were completely emptied (“Yes/No”). All items are completed after each bowel movement, and the date and time of each bowel movement were recorded. Daily and weekly (i.e., means) scores were utilized for each item for analyses.

Administration of PRO instruments

All data were collected on a paper case report form. Center personnel were trained to instruct patients in a standardized way to reduce data collection errors and enhance questionnaire completion compliance. Clinician severity ratings and patient demographic and health information were collected at baseline (day 1). The PRO measures administered in the study were assessed across two intervals: period 1 (study weeks 1 and 2; days 1–14) and period 2 (study weeks 3 and 4; days 15–28). The schedule of assessment is summarized in Table 3.

Table 3 Schedule of assessments

All data were manually entered into a password-protected database; standards of quality control, including proportional double data entry, were observed. Date and time of completion were captured for the daily diary. All eligible patients who completed at least one item of the IBS-D PRO instruments at day 1, period 1 were included in the analysis population.

Statistical methods

Psychometric analyses were performed to evaluate the item-level measurement properties of the instrument. Demographic and health information of the study population was summarized using descriptive statistics. Continuous variables were described by presenting the frequency, mean, standard deviation, median, minimum, maximum, and instances of missing data. Categorical variables were described by presenting the number and percentage of patients in each category and the number of missing data (the percentage in each category was calculated including the proportion of patients with missing values). Quality of completion was assessed for the IBS-D Daily Symptom Diary and Event Log at day 1 through day 15 as the number and percentage of patients missing responses. Items with missing data >10 % were flagged and considered candidates for deletion. If >10 % of patients endorsed the lowest or highest categories on a given item on the IBS-D Daily Symptom Diary and Event Log, the item was investigated for floor or ceiling effects, respectively. Floor or ceiling effects that are too pronounced could interfere with the ability of the score to detect improvement, deterioration, or difference between groups in a clinical trial; however, floor and ceiling effects must be interpreted in the context of the study sample and the condition being studied.

Inter-item correlations were evaluated for the IBS-D Daily Symptom Diary and Event Log items. Correlations >0.80 suggested potential redundancy and thus potential candidates for deletion [19].

The emphasis in a psychometric study is on evaluation of the magnitude of relationships between variables and the overall pattern of results rather than significance testing. As such, no adjustments are generally used for multiplicity of tests. For many psychometric analyses, significance tests are not traditionally used. Where specific significance tests are used, the threshold for statistical significance was p < 0.05. Statistical analyses were performed for the study using Statistical Analysis System version 9 (SAS Institute, Cary, NC, USA).

Psychometric evaluation of the instrument: reliability

Test–retest reliability measures the stability of a score over multiple administrations of an instrument to the same patient [20]. The time period for assessment is critical in chronic, symptomatic, or event-driven conditions because response variability may be high due to the nature of the disease. In this study, test–retest reliability was assessed by comparing 7-day average scores for the IBS-D Daily Symptom Diary and Event Log items between study weeks 1 and 4. The subgroup of stable patients for this analysis was determined using the PGI-C Week assessment at study week 4. Patients who responded “no change” on this measure were included in the test–retest analysis population. The intraclass correlation coefficient (ICC) was used to evaluate test–retest reliability, with ICCs >0.70 considered evidence of acceptable reliability [20].

Psychometric evaluation of the instrument: construct validity

The construct validity of the IBS-D Daily Symptom Diary and Event Log was examined via assessment of concurrent and clinical/known-groups validity.

Concurrent validity was assessed via evaluating correlations of the IBS-D Daily Symptom Diary and Event Log weekly average item scores with the IBS-SSS (no specified recall period), IBS-QOL (4-week recall period), and SF-12 (1-week recall period) at day 8. Spearman correlation coefficients were calculated and described as strong (0.60), moderate (0.40), or low (0.30) [21]. Low to moderate correlations were expected between IBS-D Daily Symptom Diary items and SF-12 scores, and moderate to high correlations were expected between IBS-D Daily Symptom Diary items and IBS-QOL and IBS-SSS scores.

Clinical (or known-groups) methods is a measure of the ability of items to discriminate between patient subgroups expected to respond differently based on severity of their condition [22]. Clinician-reported IBS-D severity at baseline was the primary classification variable used for the assessment of clinical validity [22]. Secondary measures used to define comparison groups for clinical validity assessment in the present study included the PGI-S at day 8 and the derived presence or absence of flare on day 1. Flare on a given day was defined by three or more bowel movements recorded in the IBS-D Symptom Event Log with a rating of 7 or 8 on the pictorial scale and an immediate need rating of moderate or greater. A “non-flare day” was defined as fewer than three bowel movements recorded in the IBS-D Symptom Event Log with a rating <7 on the pictorial scale and an immediate need rating of less than moderate. Analysis of variance and t tests were used to compare differences among groups, with differences considered significant if <0.05 level.

Results

Study population

A total of 202 patients (132 females; 65.3 %) were enrolled in the study, with 200 patients completing the study. Based on patient self-report, the majority of patients had moderate IBS-D (n = 106, 52.5 %) and the remaining patients were mostly distributed between mild (n = 46, 22.8 %) and severe (n = 44, 21.8 %) symptoms, with few reporting very mild (n = 5, 2.5 %) or very severe (n = 1, 0.5 %) symptoms. The mean age of the patient population was 46.3 years ± 14.4 (range 18–79 years), and the majority of patients were Caucasian (n = 122, 60.4 %) and had a high school diploma or some college or other educational certification (n = 126, 62.4 %). Additional demographic information is provided in Table 4.

Table 4 Demographic characteristics at baseline (day 1)

IBS-D Daily Symptom Diary and Event Log measurement properties

Descriptive statistics

Overall, patients were compliant in their completion of the IBS-D Daily Symptom Diary and Event Log, with only 12 patients (5.9 %) with at least one missing item on any of the 15 study days in which quality of completion was tested. Specifically, only 5.9 % of patient had any missing diary data during the 15-day period, with no more than 2 items being missed by a single patient on any particular day. No patient missed items every day, nor was there a pattern of a single item being missed. In addition, item-level missing data were very low, with no items having more than ~4.0 % missing data. Thus, quality of completion indicated there were no patient- or item-specific issues causing missing data and results did not suggest any specific item as a candidate for deletion. Patients utilized the full response scale on all items, and average symptom severity on the IBS-D Daily Symptom Diary items ranged between 3.6 and 4.1 across the items at the baseline (day 1) assessment, with 26.3 % of patients reporting gas most or all of the time and only 7.4 % experiencing an accident on day 1. Patients reported an average of three events (bowel movements) at baseline (day 1) on the IBS-D Symptom Event Log, with complete emptying ~54 % of the time, a mean stool consistency of 5 (“soft chunks or clumps”), and a mean immediacy rating of 3 (“moderate immediate need”). One patient, recruited with severe disease, reported 42 episodes in a day, which was confirmed upon qualitative review of the source data.

Review of the floor and ceiling effects revealed that 5–17 % of the patient population chose the lowest possible response for a particular item on day 1. Floor effects greater than the 10 % a priori criterion were present for all of the daily symptom diary items except frequency of gas; however, given that the highest percentage scoring at floor for any one item was 17 %, the floor effects were considered marginal. No ceiling effects (>10 % scoring at ceiling) were observed for any of the Daily Symptom Diary items.

Inter-item correlations within the IBS-D Diary

Inter-item correlations were examined using data from day 1 (Table 5). With the exception of correlations with the frequency of gas and accidents items, moderate to high inter-item correlations were observed among all items of the IBS-D Daily Symptom Diary and Event Log (day 1 range 0.67–0.90; Table 5). The inter-item correlations were highest between the two items measuring severity of abdominal pain and stomach pain (r = 0.90 for the daily report at day 1). These two items appear to be redundant, suggesting one can be deleted. The correlations of these two items with the items asking about abdominal cramps and abdominal pressure were also close to or above 0.80, suggesting all of these abdominal symptoms are closely related. Of note, point bi-serial correlation coefficients were generated between the IBS-D Daily Symptom Diary item 7 (accidents) and IBS-D Daily Symptom Diary at day 8. The range of correlations was low (range 0.19–0.28). This finding is likely due to the very low frequency of report for accidents on a given day. As larger clinical datasets become available, it will be of value to further investigate the relationship between this item and others by using a known-groups approach and by evaluating the variable over a longer period of time than 1 day.

Table 5 Inter-item correlations–Spearman correlations of IBS-D daily symptom diary items at day 1

Test–retest reliability

Test–retest reliability was evaluated by comparing 7-day average scores of individual items on the IBS-D Daily Symptom Diary and Event Log between weeks 1 and 4, among 115 stable patients who reported “no change” from baseline in their symptoms on the PGI-C Week at study week 4. All abdominal symptom items except frequency of gas met the threshold for test–retest reliability (ICC ≥ 0.70), with ICC scores from 0.78 to 0.83. The ICC for frequency of gas was 0.66, marginally below the threshold. Item 7 (accidents) used a Yes/No dichotomous scale, and therefore, Cohen’s kappa statistics were generated for a single-day score rather than weekly scores. Reliability results for item 7 were well below the threshold; however, these results are reported only between day 22 and day 28. Accidents on a given day are reported with very low frequency and daily symptoms are highly variable, both of which will weaken the ICC.

For the IBS-D Symptom Event Log, the mean number of events also met or surpassed the threshold of 0.70; however, mean immediacy (ICC = 0.64) and stool consistency (ICC = 0.66) narrowly missed the 0.70 threshold. The ICC score for the incomplete evacuation question fell short of the threshold (ICC = 0.46) (Table 6).

Table 6 Test–retest reliability—study week 1 and study week 4 (stable group)

Construct validity: correlations between symptoms and events (concurrent validity)

Moderate correlations were observed between immediacy of need and all abdominal items (range 0.50–0.56) except frequency of gas (0.32) (Table 7). The number of daily events had small to moderate correlations with abdominal pain (0.40) and abdominal cramps (0.41) but smaller correlations with stomach pain (0.34) and bloating (0.29). Mean stool consistency had small to moderate correlations with all abdominal items except frequency of gas, where the correlation was negligible (0.13). However, all abdominal symptoms demonstrated very low correlations with daily percentage of completely emptied bowels. Frequency of gas yielded a low correlation with all event log items (Table 7). Accidents at day 8 (results not shown) also yielded low correlations with the event log items (range −0.08 to −0.27).

Table 7 Spearman correlations between the IBS-D daily symptom diary and event log at week 1

Construct validity: correlations with generic and disease-specific measures (concurrent validity)

A logical pattern of correlations was also observed between IBS-D Daily Symptom Diary items and the concurrent scores at day 8 (Table 8). As hypothesized, correlations among the domains of the SF-12 and the IBS-D Daily Symptom Diary items were extremely low for all SF-12 domains except “bodily pain,” which was moderately correlated with the IBS-D Daily Symptom Diary items related to abdominal pain, stomach pain, and abdominal cramps (range −0.42 to −0.48). Correlations were not calculated for the accidents item.

Table 8 Spearman correlations of IBS-D daily symptom diary at day 8 (a) IBS-QOL and (b) SF-12

As expected for the disease-specific measures, moderate correlations were observed between IBS-QOL domains and the individual symptom items of the IBS-D Daily Symptom Diary, except the IBS-QOL sexual domain, for which correlations were low (range −0.27 to −0.33). Also as expected, the IBS-D Daily Symptom Diary items all correlated most highly with the only concurrent symptom measure, the IBS-SSS, ranging from 0.53 to 0.57 for all items except frequency of gas, which was correlated with the IBS-SSS at 0.32 but which was found to have low correlations with all concurrent domains. Overall, a logical pattern of correlations supported the validity of the Daily Symptom Diary items as measures of IBS symptoms.

Construct validity: clinical (known-groups) validity

Statistically significant differences in each of the IBS-D Daily Symptom Diary item scores were observed between groups defined by patients’ ratings on the PGI-S as none/mild, moderate, or severe/very severe. All IBS-D Daily Symptom Diary item scores increased monotonically across the PGI-S-defined groups, indicating that patients reporting worse global severity ratings also had worse symptoms scores on the diary (Fig. 2). With regard to the accidents item (results not shown), a greater number and percentage of patients in the severe/very severe group (n = 26, 60.47 %) reported having accidents compared with the moderate group (n = 20, 21.05 %), with the fewest number of patients reporting accidents in the none/mild group (n = 6, 10.71 %). Despite the difference in the day of data collection (day 8 for the PGI-S vs. the 7 days prior for the accident item), these results are quite strong, indicating item 7 should be evaluated carefully in future analyses with an eye toward how it might best be incorporated into scoring with the ordinal rating scale items.

Fig. 2
figure 2

IBS-D Daily Symptom Diary by PGI-S response at day 8. IBS-D Daily Symptom Diary Items 1–5 are scored on a numeric rating scale from 0 to 10, with higher scores representing more severe symptoms

Patients experiencing a flare day reported significantly higher symptom severity on each of the IBS-D Daily Symptom Diary items except for frequency of gas (Fig. 3).

Fig. 3
figure 3

IBS-D Daily Symptom Diary by flare experience at day 1

Discussion

The goal of this study was to psychometrically assess the initial, item-level measurement properties of the IBS-D Daily Symptom Diary and Event Log items.

The descriptive statistics showed minimal missing data and good response distributions, suggesting the response scales were fully utilized. Floor effects were slightly higher than predicted for all items except one, but still considered acceptable given that they were evaluated using data from a single day; IBS-D symptoms are highly variable and high symptom severity is not expected as a daily occurrence. The proportion of subjects scoring at ceiling was minimal.

Test–retest reliability of the IBS-D Daily Symptom Diary and Event Log met threshold for all items with the exception of gas (ICC = 0.66), mean immediacy (ICC = 0.64), stool consistency (ICC = 0.66), and incomplete evacuation (ICC = 0.46) items. As the ICC scores for gas, immediacy, and stool consistency only narrowly failed to meet the 0.70 threshold, these results are not considered of concern because these are highly variable symptoms. The low ICCs for incomplete evacuation and accidents are of greater concern and should be reanalyzed as additional data become available. In particular, given the very low frequency of report of accidents on a given day, test–retest reliability for accidents should be evaluated in a larger clinical sample by looking at periods of time that will be used as endpoints rather than looking at single-day occurrences.

Inter-item correlations were highest between the two items measuring severity of abdominal pain and stomach pain, suggesting that the items are measuring very similar concepts and are possibly redundant. This finding provides evidence that patients may think that abdominal pain and stomach pain are the same concept, which is consistent with qualitative data in which patients were thinking about the same part of their body when responding to these items [15]. With regard to the qualitative data, six of the 11 subjects in the cognitive debriefing interviews stated that the stomach pain and abdominal pain items were the same, while only three reported a difference [15]. Further, when patients indicated location on a diagram, there was no consistent indication that patients made a distinction between abdomen and stomach. Finally, given that cramping and pressure items use the term “abdominal” and considering that both pain items were well understood during qualitative testing (i.e., cognitive debriefing), the “abdominal pain” item will be retained and the “stomach pain” item removed [15].

The abdominal pain and stomach pain items were also relatively closely related to the abdominal cramps and abdominal pressure items. It could be argued that retention of just the abdominal pain item is sufficient and that the other three do not add great additional value. However, the abdominal cramps item seemed to discriminate best between flare and non-flare days in the known-groups analyses, and so arguably it provides valuable additional information. Therefore, all three items (abdominal pain, cramps, and pressure) will be retained for further testing.

Evaluation of concurrent validity demonstrated a logical pattern of correlations with concurrent measures, which supports the validity of the IBS-D Daily Symptom Diary and Event Log items as measures of IBS-D symptoms. In accordance with predictions, the IBS-D Daily Symptom Diary showed mostly low correlations with the HRQoL questionnaires and higher, moderate correlations with the IBS-SSS. Furthermore, the instrument correlated more highly with the disease-specific IBS-QOL than with the generic SF-12. The lack of overlap in recall periods between the IBS-D Daily Symptom Diary and Event Log and the concurrent measures is likely to have been a factor in reducing the magnitude of some correlations. For example, the IBS-QOL has a 4-week recall period, and thus, it is perhaps not surprising that the correlation with a week average of IBS-D Daily Symptom Diary item scores was low in addition to content differences (i.e., a symptom-based measure vs. a quality of life-based measure). Similar content between the IBS-D Symptom Diary and the IBS-SSS is the reasoning behind moderate correlations. Overall, these findings support the concurrent validity of the instrument.

For all items on the IBS-D Daily Symptom Diary and Event log, statistically significant differences in each of the IBS-D Daily Symptom Diary item scores were observed between groups defined by patients’ ratings on the PGI-S, indicating that those patients with more severe IBS-D are responding on the more severe end of the questionnaire’s spectrum. These results provide strong evidence that the different items of the IBS-D Daily Symptom Diary and Event Log are able to discriminate among patients of differing severity.

The findings reported here, as well as the previously conducted qualitative research, provide strong evidence in support of the initial psychometric validity of the IBS-D Daily Symptom Diary 24-h recall questions and the IBS-D Symptom Event Log questions.

The next step is the creation of summary scores assessing abdominal and bowel symptoms; this work is ongoing.