The relationship among multiple patient-reported outcomes measures for patients with ulcerative colitis receiving treatment with MMX® formulated delayed-release mesalamine

Purpose Ulcerative colitis (UC) is associated with impaired health-related quality of life (HRQL) and work-related outcomes (WRO). This analysis examined correspondences among measures of HRQL and WRO in patients with UC, as well as the magnitude of each measure’s responsiveness to disease activity and treatment. Methods An open-label, prospective trial of delayed-release mesalamine tablets formulated with MMX® technology included 8 weeks of treatment for patients with active mild-to-moderate UC (n = 137) and 12 months of maintenance treatment for patients with quiescent UC (n = 206). Spearman correlations (ρ) measured inter-domain associations across measures of generic HRQL [12-item Short-Form Health Survey (SF-12v2)], disease-specific HRQL [Short Inflammatory Bowel Disease Questionnaire (SIBDQ)], and disease-specific WRO [Work Productivity and Activity Impairment for Specific Health Problems (WPAI:SHP)]. Responsiveness to disease activity and treatment was assessed for each instrument. Results Changes in scores from baseline to week 8 were moderately correlated across all instrument domains: 65 of 80 (81 %) between-instrument inter-domain correlations were of moderate magnitude (0.30 < ρ < 0.70), with an average magnitude of 0.42 [95 % confidence interval (CI) 0.38–0.46]. Associations between symptom measures were stronger for SIBDQ (|average ρ| = 0.41; 95 % CI 0.34–0.48) and WPAI:SHP (0.40; 0.30–0.47) than SF-12v2 (0.30; 0.27–0.34). SIBDQ was most sensitive to treatment [effect size (d z) for change from baseline to week 8 = 0.62; 95 % CI 0.35–0.89], followed by WPAI:SHP (d z = 0.43; 0.32–0.54) and SF-12v2 (d z = 0.33; 0.27–0.39). Conclusion While the SIBDQ showed the greatest overall responsiveness to disease activity and treatment, all three patient-reported outcomes instruments provided complementary interpretive information regarding the impact of UC treatment.

Previous research on patients with UC shows improvements in HRQL [17][18][19][20][21][22] and WRO [18,23] following treatment when accompanied by decreases in disease activity. For example, both Irvine et al. [20] and Reinisch et al. [23] reported that patients with UC who demonstrated clinical response following treatment had significantly better scores on generic and disease-specific measures of HRQL [the 36-item Short-Form health outcomes survey (SF-36) and the Inflammatory Bowel Disease Questionnaire (IBDQ), respectively] than non-responders. Furthermore, Reinisch et al. [23] found that clinical remission predicted significantly greater improvements in work attendance, and work productivity, and a decreased likelihood of receiving disability benefits.
Cross-sectional studies of patients with UC have typically found concordance between generic and disease-specific HRQL [2,[23][24][25]. A cross-sectional study by Bernklev et al. [26] that examined the simultaneous relations among generic and disease-specific HRQL and WRO found that both IBDQ and SF-36 scores predicted absenteeism and work disability payments. Cross-sectional studies by Cohen et al. [10] and Gibson et al. [11] found that HRQL (SIBDQ, SF-36) and WRO [Work Productivity and Activity Impairment survey (WPAI)] were associated with disease severity and fatigue, respectively, in patients with UC. Given that few studies have captured the simultaneous impact of treatment on disease-specific HRQL, generic HRQL, and WRO for patients with UC, the degree to which these outcomes are interrelated, and the sensitivity and responsiveness of these outcomes to treatment and disease activity have not been fully established.
The current analysis examines associations among PRO instruments measuring generic and disease-specific HRQL [the 12-item Short-Form Health Survey, version 2 (SF-12v2) and the Short IBDQ (SIBDQ), respectively] and disease-specific WRO [the WPAI: Specific Health Problem (WPAI:SHP)] as well as the extent to which these outcomes are negatively associated with disease activity for patients with mild-to-moderate UC who participated in an open-label prospective trial of delayed-release mesalamine tablets formulated with MMX Ò (Cosmo Technologies Ltd, Wicklow, Ireland) technology (hereafter referred to as delayed-release mesalamine). The objective of the current analysis is to test several hypotheses regarding the interrelation among these PRO measures, their relative sensitivity to treatment, and their relative responsiveness to changes in disease activity for patients with UC in this clinical treatment trial.

Study design
Data included in the current analysis were collected from the Strategies in Maintenance for Patients Receiving Long-term Therapy (SIMPLE) study [27], a multicenter, prospective, single-treatment, open-label trial (NCT00446849). This study consisted of a screening period, followed by two phases: an 8-week acute phase, and a 12-month maintenance phase. Figure 1 presents a flowchart of the study design. A more detailed description of the SIMPLE study has been presented elsewhere [27].
Patients diagnosed with mild-to-moderate active UC at screening were entered into the acute phase, where they received daily MMX mesalamine 2.4-4.8 g/day for 8 weeks. Dose titration in increments of 1.2 mg was implemented when necessary throughout the acute phase. Data for all PRO instruments were collected at the acute phase baseline and at the 8-week endpoint.
Patients with quiescent UC at screening, as well as those who achieved quiescence by the acute phase baseline, were able to participate in the 12-month maintenance phase. 1 In this phase, patients received daily MMX mesalamine 2.4 g/ day for 12 months. Data for all PRO instruments were collected from three onsite visits over the 12 months: at baseline, 6, and 12 months (or early withdrawal). This trial was approved by Institutional Review Boards at each study site. Only patients who provided written informed consent at screening were able to enroll in this trial.

SF-12v2 (generic HRQL)
The SF-12v2 is a 12-item self-report survey of HRQL with a 4-week recall period [28]. Item responses afford calculation of eight domains representing separate dimensions of functional health and well-being: physical functioning (PF), role physical (RP; role limitations due to physical problems), bodily pain (BP), general health perceptions (GH), vitality (VT), social functioning (SF), role emotional (RE; role limitations due to emotional problems), and mental health (MH). PCS and MCS scores are computed by summing weighted domain scores. SF-12v2 domains and summary scores were standardized using a T-score metric (mean = 50, standard deviation = 10) based on a US general population normative sample. Higher scores indicate better health outcomes on all domains and summary scores.

SIBDQ (disease-specific HRQL)
The SIBDQ [29] consists of 10 items chosen from among the 32 items on the original IBDQ instrument. This instrument has demonstrated good psychometric properties (i.e., reliability, responsiveness, and construct and criterion validity) in assessment of disease-specific HRQL within the UC patient population [29][30][31][32] capturing the frequency of depression, stress, and anger), and social function (calculated by summing responses to two items capturing the frequency of having to cancel social activities, and being limited in social activities). Responses to each item are also summed to create a total SIBDQ score. Response options for each item range from 1 to 7; thus, possible scores range from 3 to 21 for BS and social function domains, and from 2 to 14 for SS and EF domains, with total scores ranging from 10 to 70. For all domains and the total score, higher scores indicate better health outcomes.

WPAI:SHP (WRO)
The WPAI:SHP consists of six items that can be used to measure the impact of a person's specific health problem (in this case, UC) on WROs, including work time missed, decreases in productivity, and impairment in daily nonwork-related activities (e.g., childcare) during the preceding 7 days [33]. The WPAI:SHP has been psychometrically validated within samples of patients with a variety of gastrointestinal disorders, including gastroesophageal reflux disease [34,35], Crohn's disease [36], and irritable bowel syndrome [37].
For patients employed over the previous 7 days, four domains were calculated based on item responses: absenteeism (the percentage of work time missed due to a patient's UC), presenteeism (the percentage of impairment while working due to a patient's UC relative to their work productivity when healthy), overall work impairment (the percentage of overall work impairment due to a patient's UC), and activity impairment (the percentage of impairment in non-work activities due to a patient's UC). Only scores for the activity impairment domain were computed for patients not employed in the previous 7 days. All domain scores range from 0 to 100 %, with lower scores on all domains signifying better WRO (i.e., less impairment).

UC symptoms
Two UC symptoms, stool frequency (STF) and rectal bleeding severity (RBS), were measured using single-item patient reports. Measures for each of these symptoms are considered crucial for determining the status of disease in patients with UC, as indicated by their inclusion in two well-established measures of disease activity: the Ulcerative Colitis Disease Activity Index (UC-DAI) [38] and the Mayo score [39]. Previous research has found evidence that STF and RBS items alone are sufficient to estimate disease activity in patients with UC [40].
Patients provided once-daily responses on each via telephone or Internet. For the STF item, patients indicated whether their number of bowel movements that day was the same or only 1 more than their normal frequency (0), 2 or 3 more than their normal frequency (1), or at least 4 more than their normal frequency (2). For the RBS item, patients indicated whether they had no rectal bleeding in their stool (0), streaks of blood in their stool (1), obvious blood in their stool (2), or mostly blood in their stool (3) on the current day. At the time of each on-site visit, the patient's three most recent responses to each of these items were averaged to create a total score for each symptom. 2 Lower scores on both measures indicate better outcomes.

Patient baseline characteristics
Descriptive statistics (i.e., means and standard deviations for continuous variables, frequencies and percentages for categorical variables) were calculated for patient characteristics (e.g., age, gender, and employment status) and values of outcome measures for the full baseline sample of patients in each of the acute and maintenance phases. Descriptive statistics were also calculated separately at maintenance phase baseline for two subsamples of patients in the maintenance phase: those who were identified as quiescent at screening and thus entered the maintenance phase directly (maintenance phase-only subsample), and those who were identified with active disease at screening and thus only entered the maintenance phase after achieving quiescence at the end of the acute phase (acute ? maintenance phase subsample).
Baseline values of patient characteristics and outcome scores were compared between maintenance phase-only and acute ? maintenance phase subsamples to demonstrate similarity between these subsamples to justify combining both subsamples into a single analysis group. Comparisons of SF-12v2, SIBDQ, and UC symptom scores between these subsamples were conducted in previous analyses of these data [19,22]; comparisons of patient characteristics and WPAI:SHP scores between the two groups were conducted here. Comparisons between continuous variables were conducted using independent samples t tests (twotailed), while comparisons between categorical variables (gender, employment status) were based on Fisher's exact test (two-tailed).

Correspondence among PRO instruments
The objective of this analytic approach was to estimate the strength of relations among outcomes captured by the three PRO instruments. Analyses falling under this approach were designed to test several hypotheses regarding the relative magnitude of associations among PRO instruments.
Based on previous empirical findings described above, and given the conceptual relatedness among each of these outcomes, Hypothesis 1 was that changes in SF-12v2, SIBDQ, and WPAI:SHP domain scores from baseline to 8-week endpoint during the acute phase would, in general, be moderately correlated (i.e., most correlation coefficients falling within the range of 0.3-0.7).
Since the WPAI:SHP measures a different construct (WRO) than that shared by the other two instruments (HRQL), Hypothesis 2 was that the average inter-domain correlation between the SF-12v2 and the WPAI:SHP would be smaller than the average inter-domain correlation between the SF-12v2 and the SIBDQ.
Also, because the SIBDQ and WPAI:SHP are both designed to capture the impact of disease-specific outcomes, as opposed to generic health outcomes measured by the SF-12v2, Hypothesis 3 was that the average interdomain correlation between SIBDQ and WPAI:SHP scores would be larger than the average inter-domain correlation between SF-12v2 and WPAI:SHP scores.
To test Hypotheses 1-3, we examined correlations among changes in scores for all domains from each of the three PRO instruments. Change scores for each PRO domain were calculated by subtracting patients' acute phase baseline score from their acute phase 8-week endpoint score. Spearman correlation coefficients between all change scores were computed to estimate the direction and magnitude of associations.
To estimate the relative strength of associations among each of the PRO instruments, we calculated the average interdomain correlation between each instrument pair using Fisher's method [41], which Monte Carlo simulations have shown to produce less biased estimates of mean correlation coefficients [42,43] the following procedure. First, Spearman coefficients were transformed into z-scores using Fisher's rto-z transformation [41] based on the following equation: Next, the average z-score was computed as the sum of all z-scores divided by the number of z-scores. Finally, the average z-score was transformed back into the average correlation coefficient using the inverse of Fisher's r-to-z transformation, based on the following equation: For each correlation coefficient, a 95 % confidence interval (CI) was calculated using the following procedure. First, the correlation coefficient (q) was transformed into a z-score (z q ) using Fisher's r-to-z transformation (Eq. 1).
Second, the standard error for z q was calculated using the following equation: [41,44] Third, the 95 % CI for z q (95 % CI z q ) was calculated by multiplying SE z q by 1.96. Fourth, the 95 % CI z q was transformed into the 95 % CI q using the inverse of Fisher's r-to-z transformation (Eq. 2).

Responsiveness of PRO instruments to disease activity and sensitivity to treatment
The objective of this analytic approach was to estimate the relative degree to which changes in each of the three PRO instruments corresponded to changes in UC symptoms (i.e., STF and RBS) over the course of treatment. Analyses falling under this approach were designed to test hypotheses regarding the responsiveness among instruments to disease activity and their sensitivity to treatment.
Since both the SIBDQ and WPAI:SHP, but not the SF-12v2, explicitly assess the impact of UC-related symptoms on patient outcomes, Hypothesis 4 was that the correlations between changes in SIBDQ and WPAI:SHP scores and changes in UC symptoms would generally be larger than correlations between changes in SF-12v2 scores and changes in these symptoms.
Given previously established findings from this trial that HRQL was lower for patients who experienced clinical recurrence (based on the recurrence of symptoms) at the 12-month maintenance phase endpoint as compared to nonrecurrent patients [19,22], and following the same logic of the previous hypothesis, Hypothesis 5 was that differences in change scores between recurrent and non-recurrent patients would be relatively larger for the SIBDQ and WPAI:SHP than for the SF-12v2.
Finally, because disease-specific HRQL captures more proximally the impact of treatment on patient outcomes than does generic HRQL or WRO, Hypothesis 6 was that the SIBDQ would exhibit greater sensitivity to acute treatment than would the SF-12v2 or WPAI:SHP.
The responsiveness of HRQL and WRO to disease activity was captured using two analytic approaches. First, the correspondences between changes in PRO domain scores and changes in symptom scores during the acute phase were examined using Spearman correlations. Change scores for symptom measures were calculated by subtracting each patient's acute phase baseline score from their acute phase 8-week endpoint score. To test the relative strength of associations between the different PRO instruments and the measures of disease activity in Hypothesis 4, we calculated the average correlations across all domain scores within each instrument with scores on each symptom measure using Fisher's r-to-z transformation procedure described above.
The responsiveness of each PRO instrument to changes in disease activity was also assessed by comparing PRO domain scores between patients who did or did not exhibit clinical recurrence at the 12-month maintenance phase assessment. Patients were classified as exhibiting clinical recurrence if they reported 4 or more bowel movements per day above their normal frequency and the presence of rectal bleeding, urgency to defecate, or abdominal pain. Univariate analysis of covariance (ANCOVA) models, with recurrence status as a between-subjects' factor and patients' age, gender, body mass index (BMI), and maintenance baseline domain value as covariates, statistically compared recurrent and non-recurrent patients on each instrument domain. Cohen's d effect sizes [45] for standardized differences between independent-group estimated means 3 were calculated for all comparisons to indicate the strength of the effect of classification group for each domain score. Interpretation of these effects followed Cohen's conventional guidelines for interpretation of magnitude (i.e., small effect size: The sensitivity of each PRO instrument to acute treatment was examined using paired-sample t tests to compare mean scores between baseline and 8-week assessments. Magnitude of change was estimated using Cohen's d z effect sizes [45] for standardized mean differences across dependent samples. 4 No imputation techniques were used for patients missing data at a visit; only observed values were analyzed at each time point. Average correlations and effect sizes were calculated using Microsoft Excel (2007; Redmond, WA, USA). All other statistical analyses were performed using SPSS for Windows, version 17.0.2 (2009; Chicago, IL, USA). Table 1 presents descriptive statistics for patients' baseline age, gender, and employment status; domain and summary scores for the SF-12v2, SIBDQ, and WPAI:SHP; UC symptom scores for the full sample of patients in each of the acute and maintenance phases; and the maintenance phase baseline values of the maintenance phase-only and acute ? maintenance phase subsamples. Previously published comparisons between these subsamples yielded no statistically significant group differences for either SF-12v2, SIBDQ, or UC symptom scores (all P [ 0.05) [19,22]. Subsample comparisons of patient characteristics and WPAI:SHP scores conducted here found no statistically significant differences between the two groups in gender distribution, employment status, or any WPAI:SHP domains (all P [ 0.05), although a statistically significant difference in age was observed (P \ 0.05), with patients in the maintenance phase-only subsample being, on average, 5.5 years older than those in the acute ? maintenance phase subsample.

Patient baseline characteristics
Correspondence among changes in PRO domain scores during the acute phase Spearman coefficients for inter-domain correlations among baseline-endpoint changes in SF-12v2, SIBDQ, and WPAI:SHP domain scores in the acute phase are presented in Table 2

Responsiveness of PRO instruments to changes in disease activity
Spearman correlation coefficients between acute phase change scores of the PRO domains and symptoms measures 3 Cohen's d effect sizes [45] were calculated using the following equa- and where F is derived from betweensubjects ANCOVA with recurrent status as an independent factor, and age, gender, BMI, and maintenance baseline scale value as covariates. 4 Cohen's d z effect sizes [45] were calculated using the following equation: ; with r 12 representing the correlation between scores at each time. are presented in Table 3 Table 4 presents month 12 estimated mean SF-12v2, SIBDQ, and WPAI:SHP domain scores (adjusted for patients' age, gender, BMI, and maintenance baseline      Sensitivity of SF-12v2, SIBDQ, and WPAI:SHP to acute treatment

Discussion
Findings from the current study provide several pieces of evidence regarding the correspondence among instruments measuring different PROs, and between each of these PRO Table 4 Comparison of estimated mean SF-12v2, SIBDQ, and WPAI:SHP Scores (adjusted for age, gender, BMI, and baseline value) at 12-month maintenance phase endpoint for patients with clinically recurrent or non-recurrent symptoms instruments with measures of disease activity. Table 6 summarizes each of the six hypotheses tested in this analysis, as well as whether the findings were supportive or non-supportive of the hypothesized relationships among variables. Consistent with our initial hypothesis, inter-domain correlations for acute phase baseline-endpoint change scores across SF-12v2, SIBDQ, and WPAI:SHP instruments mostly ranged from 0.30 to 0.70, indicating generally moderate concordance in the improvement of each outcome over time. The consistency in scores across instruments also emerged from comparisons of scores following treatment, with all but one domain (RE on the SF-12v2) showing statistically significant improvement from acute phase baseline to 8-week endpoint. Finally, domains from all three of these instruments showed improvement with decreases in stool frequency and rectal bleeding during the acute phase, and all instruments were generally sensitive to patient recurrent status at the maintenance phase endpoint.
While the central results of this analysis indicated close correspondence among patient outcomes, several differences in their associations emerged that were consistent with our hypotheses. Our second and third hypotheses, which predicted that the association between SF-12v2 and WPAI:SHP scores would be weaker than the associations between SF-12v2 and SIBDQ scores (Hypothesis 2) and weaker than associations between SIBDQ and WPAI:SHP scores (Hypothesis 3), were both supported by the data. Specifically, the average correlation coefficient between changes in scores on the SF-12v2 and WPAI:SHP domains from baseline to the 8-week endpoint in the acute phase was smaller than for average correlations of change scores across domains for either of the other two pairings.
Given that the SIBDQ, but not the SF-12v2, explicitly probes the impact of symptoms on patients' perceptions of HRQL, we hypothesized that the SIBDQ would show greater sensitivity to disease activity than the SF-12v2, as indicated by stronger correlations with UC symptom scores (Hypothesis 4) and better discrimination between patients with clinically recurrent and non-recurrent status (Hypothesis 5). The observed results supported Hypothesis 4: The magnitude of the average correlation coefficient between UC symptom measures and SIBDQ domains (0.41, 95 % CI 0.34-0.48) was approximately 0.11 larger than that between symptoms and SF-12v2 domains (0.30,  [25], who found that a continuous measure of UC symptom activity was more strongly correlated with IBDQ scores than with SF-36 scores, but that the IBDQ was not better than the SF-36 at discriminating patients classified by disease extent. While we expected all instruments to show improvement over the course of treatment in the acute phase, particularly given that improvement in SIBDQ and SF-12v2 in this trial was previously established [19,22], our sixth hypothesis was that the SIBDQ would exhibit relatively greater sensitivity to treatment than the SF-12v2 since, as a disease-specific measure, the SIBDQ should more precisely capture the differences in HRQL related to treatment for UC symptoms and their improvement as a result of treatment. The data were generally supportive of this hypothesis: The mean effect size for standardized change in domain scores from baseline to endpoint was considerably larger for the SIBDQ (average d z = 0.62, 95 % CI 0.35-0.89) than for the SF-12v2 (average d z = 0.33, 0.27-0.39).
While all three instruments showed generally moderate levels of correspondence, each instrument provides a unique and important contribution to understanding the impact of UC, and the effect of treatment for UC, on patients' lives. The SIBDQ, as would be expected for a disease-specific measure, exhibited moderate-to-high responsiveness to disease activity, thus providing a reliable measure of treatment impact. The SF-12v2 showed moderate responsiveness to disease activity, and as a generic measure that is widely used across many studies and disease areas, it provides the opportunity for a contextual interpretation of the HRQL of patients with UC by facilitating comparisons with other disease samples and general population norms to understand the burden of UC and the degree to which this burden can be relieved through treatment. The WPAI:SHP, with the exception of the Correlations between SF-12v2 and SIBDQ domains will be larger than between SF-12v2 and WPAI:SHP domains The SF-12v2 and SIBDQ measure the same underlying construct (HRQL), while the WPAI:SHP measures a different construct (WRO) Yes; the magnitude of the average interdomain correlation between SF-12v2 and SIBDQ (0.44) was higher than between SF-12v2 and WPAI:SHP (-0.37) 3 Correlations between SIBDQ and WPAI:SHP domains will be larger than between SF-12v2 and WPAI:SHP domains The SIBDQ and WPAI:SHP measure UCspecific health outcomes, while the SF-12v2 measures generic health outcomes Yes; the magnitude of the average interdomain correlation between SIBDQ and WPAI:SHP (0.47) was higher than between SF-12v2 and WPAI:SHP (0.37) 4 Changes in UC symptoms from baseline to week 8 will correlate more highly with SIBDQ and WPAI:SHP domains than with SF-12v2 domains Because the SIBDQ and WPAI:SHP measure UC-specific health outcomes, while the SF-12v2 measures generic health outcomes, the former two instruments should be more responsive to changes in UC-specific symptoms absenteeism domain, which was also moderately responsive to changes in disease activity, allows for the most instrumental interpretation of the impact of UC on patients' lives.

Conclusion
The findings of mostly moderate correlations among scores on the SF-12v2, SIBDQ, and WPAI:SHP, and between each of these instruments and clinical symptoms, as well as parallel responses to acute and maintenance MMX mesalamine daily treatment, indicate the consistency and correspondence of these instruments within this UC patient population. The finding that all three of these instruments demonstrated sensitivity to treatment and responsiveness to disease activity, with some predictable variations, and the fact that the types of outcomes captured by the instruments are complementary in terms of the interpretation they afford indicate that it is appropriate and beneficial to administer all three of these instruments (or any combination of these instruments depending upon the objectives of the study) for the purpose of capturing the burden of UC and the impact of treatment on quality of life and/or workrelated activities in clinical and outcomes research.