Background

Endometriosis is a common problem in women of reproductive age [1, 2]. In the Unites States, endometriosis affects approximately 6%–10% of women at some time in their life. Symptoms include intermenstrual bleeding, non-menstrual pelvic pain (NMPP), and pain during menstruation, intercourse, urination, and defecation. In addition to pain and bleeding, fatigue is experienced by 50%–87% of women with endometriosis and is considered by women to be one of the most burdensome symptoms [3,4,5]. Due to these symptoms, and their association with infertility, endometriosis has far-reaching consequences on a woman’s health-related quality of life, interfering with marital and sexual relationships, social life, employment, physical activities, and psychological function [4,5,6,7,8,9,10,11].

Although fatigue is an important part of the burden of endometriosis, it is not often included as an endpoint in clinical trials. Several options are available to assess patient-reported fatigue, including the Fatigue Severity Scale [12], Fatigue Impact Scale [13], or Brief Fatigue Inventory [14], but these have not been developed specifically for or been assessed for use among women with endometriosis. The Patient-Reported Outcome Measurement Information System (PROMIS), includes fatigue-related item banks and several fatigue short forms that have been developed and assessed for performance in chronic conditions [15].

A recent qualitative study in women with moderate-to-severe endometriosis-associated pain supported the concept of fatigue as important in this population [16]. In addition, PROMIS Fatigue SF-6a was used as a secondary endpoint to measure patient-reported fatigue in EM-I, a phase III randomized, placebo-controlled clinical trial that assessed the safety and efficacy of elagolix in 871 women with moderate-to-severe endometriosis-related pain [17]. At baseline, 54%–74% of the patients reported frequently having fatigue-related issues. The study also showed that at 6 months, PROMIS Fatigue SF-6a T-scores decreased significantly more in women treated with either dose of elagolix than in women who received placebo, and decreased more in women reporting clinically meaningful reductions in dysmenorrhea, NMPP, and dyspareunia than in women who did not [18].

These findings support the possibility of using PROMIS Fatigue SF-6a as an appropriate endpoint in clinical trials of endometriosis treatments. Here, we describe the psychometric characteristics of PROMIS Fatigue SF-6a in this sample of women with endometriosis based on data collected during the EM-I trial.

Methods

Data source and sample

This analysis was based on data collected during the EM-I phase III clinical trial (NCT01620528) [17]. Participants in the trial were premenopausal women from the United States and Canada aged 18–49 years with a surgical diagnosis of endometriosis in the previous 10 years and enrolled from July 2012 to May 2014. Women were excluded if they had clinically significant gynecological conditions other than endometriosis; or chronic pain unrelated to endometriosis [17].

The study ran from May 22, 2012 to September 28, 2015. Participants were randomized 3:2:2 to placebo, elagolix 150 mg once daily, or elagolix 200 mg twice daily. The study included a washout period for women receiving hormonal therapies, a screening period of up to 100 days, and a 6-month treatment period. The co-primary endpoints were the proportions of women with a clinical response for dysmenorrhea and a clinical response for NMPP at 3 months. Secondary endpoints assessed in the current analysis included PROMIS Fatigue SF-6a, Endometriosis Health Profile-30 (EHP-30), the Health Related Productivity Questionnaire (HRPQ), severity of dysmenorrhea and NMPP, and the Patient Global Impression of Change (PGIC) at months 1, 3, and 6. Data on daily analgesic medication use was collected using an electronic diary.

PROMIS Fatigue SF-6a

PROMIS Fatigue SF-6a is a 6-item instrument with a recall period of last 7 days [19]. Items include (1) “I feel fatigued”; (2) “I have trouble starting things because I am tired”; (3) “How run-down did you feel on average”; (4) “How fatigued were you on average?”; (5) “How much were you bothered by your fatigue on average?”; (6) “To what degree did your fatigue interfere with your physical functioning”. Response options for each item range from “Not at all” (1 point) to “Very much” (5 points). The raw overall (range 6–30) and can be converted to a T-score; higher scores indicate greater fatigue [20]. A T-score more than one standard deviation (10 points) higher than the standardized mean of 50 for the United States population indicates worse than average fatigue than the United States norm.

EHP-30

EHP-30 is a 30-item disease-specific health-related quality of life instrument comprising five core domains (pain, control and powerlessness, emotional well-being, social support, and self-image) with a recall period of 4 weeks [21]. In addition to the core domains, the clinical study included the optional 5-item sexual relationship module of the EHP-30. Women were able to report if the whole sexual relationship module, or individual items, were not relevant. The EHP-30 was scored according to the developer’s manual where item responses map 0 = Never to 4 = Always and domain scores are standardized to a 0 (best health status) to 100 (worst health status) range. Domain scores were calculated from the sum of the raw scores per domain divided by the maximum possible raw score of items in the domain, multiplied by 100 [22]. The measure was used to characterize the sample’s health-related quality of life impact and the pain domain used to assess construct validity and used as a responsiveness anchor.

HRPQ

HRPQ is a 9-item questionnaire to assess loss of productivity due to absenteeism and presenteeism [23] assessed for use in women with endometriosis [24]. The questionnaire uses skip patterns to ensure that items are applicable to those who work outside the home (e.g. in full- or part-time employment) and those who work at home.

Dysmenorrhea and NMPP

During the trial, participants completed a daily electronic diary. Dysmenorrhea was assessed with the item “Choose the item that best describes your pain during the last 24 hours you had your period” and NMPP with the item “Choose the item that best describes your pain during the last 24 hours without your period”. Possible responses included “None” (No discomfort), “Mild” (Mild discomfort but I was easily able to do the things I usually do.), “Moderate” (Moderate discomfort or pain. I had some difficulty doing the things I usually do.), and “Severe” (Severe pain. I had great difficulty doing the things I usually do.). Responses were assigned scores as follows: 0, none; 1, mild, 2; moderate; 3, severe. Scores were averaged over the 35 calendar days immediately prior to and including the date of the first dose of study drug, as well as over the 28 days before each post-baseline assessment. As reported by Taylor and colleagues [17], a dysmenorrhea response was defined as no increase in analgesic use and a score change from baseline of at least − 0.81 (dysmenorrhea responder; if not reached then dysmenorrhea non-responder) and a NMPP response was defined as no increase in analgesic use and a score change from baseline of at least − 0.36 (NMPP responder; if not reached then NMPP non-responder). The thresholds were identified using receiver operating characteristic analysis using the PGIC as an anchor and evaluating changes in analgesic use.

PGIC

PGIC was assessed using the question “Since I started taking the study medication, my endometriosis related pain has: very much improved (1), much improved (2), minimally improved (3), no change (4), minimally worse (5), much worse (6), very much worse (7).”

Statistical analyses: general considerations

All analyses were performed in all randomized subjects who received at least one dose of treatment or placebo. Missing data were not imputed. SAS version 9.4 (SAS Institute, Cary, NC, USA) and Mplus version 7.11 (Los Angeles, CA) were used to perform analyses.

Assessment of ceiling and floor effects

Ceiling effects were explored based a highest response option and floor effects based on a lowest response option.

Confirmatory factor analysis

A categorical confirmatory factor analysis using polytomous response options was performed to evaluate the fit of the PROMIS Fatigue SF-6a as a unidimensional fatigue scale. Model fit was assessed using recommendations by Reeve and colleagues [25]. Specifically, the model fit was evaluated by considering the comparative fit index (suggested cut point > 0.95); Tucker-Lewis Index (suggested cut point > 0.95); weighted root mean square residual (suggested cut point < 1.0); average absolute residual correlations (suggested cut point < 0.10); root mean square error of approximation (suggested cut point < 0.06).

Assessment of reliability

Internal consistency reliability of the PROMIS Fatigue was assessed by calculating Cronbach’s alpha for data collected at baseline. Values above 0.70 are generally considered acceptable for aggregate data [26]. Item performance for the PROMIS Fatigue SF-6a was evaluated by calculating Cronbach’s alpha with individual items deleted. Test-retest reliability (reproducibility) of PROMIS Fatigue SF-6a was assessed in patients with no change in PGIC at month 1; T-scores were compared between baseline and month 1 by paired t-test and intra-class correlations.

Assessment of construct validity

Convergent validity was assessed by calculating Pearson product-moment and Spearman’s rank correlation coefficients at baseline, month 3, and month 6 for PROMIS Fatigue SF-6a versus dysmenorrhea, NMPP, HRPQ, and EHP-30. Because the concepts being measured are not redundant, rather hypothesized to be distally related, the expectation is that positive, low to moderate correlations are expected. Cohen’s conventions were used when interpreting correlation coefficients as 0.10 small, 0.30 moderate, and 0.50 large [27].

Known groups validity was assessed by comparing PROMIS Fatigue SF-6a T-score at months 3 and 6 between the following subgroups: dysmenorrhea and NMPP responders versus non-responders; and PGIC (improved; no change; worsened). Responder status (responder or non-responder) for dysmenorrhea and NMPP was from the Endometriosis Daily Pain Impact diary score as noted in the clinical study [17]. General linear models (Proc GLM) were used to calculate F statistics and p-values; for multiple group pairwise comparisons Scheffe’s test was used to adjust for the multiple comparisons.

Assessment of responsiveness

The PROMIS Fatigue SF-6a was explored using a triangulation approach comprising anchor-based analyses, difference between means analyses, and use of clinically relevant indicators to test the instrument’s ability to detect change during the clinical study. Responsiveness is the ability of a measure to detect change when change is present [28].

In anchor-based analyses, the least-squares (LS)-mean change from baseline in PROMIS Fatigue SF-6a T-score at months 3 and 6 were compared for the following subgroups: PGIC (improved, no change, worsened) and EHP-30 pain domain responders (≥30-point decrease in EHP-30 pain domain score from baseline) versus non-responders. General linear models controlling for age and baseline PROMIS Fatigue score were used to calculate F statistics and p-values using Scheffe’s test to adjustment for multiple comparisons.

In the assessment of standardized difference between two means, effect size (Cohen’s d), was calculated for PROMIS Fatigue SF-6a at months 3 and 6. The analyses do not include direct patient feedback, thus cannot serve as the primary assessment for within-patient clinical meaningfulness, and should be considered only as supportive [29]. Effect size was calculated by subtracting the baseline score from the post-baseline (month 3 or 6) score and dividing the result by the baseline standard deviation. As described by Cohen [27], effect sizes were classified as small (0.20), moderate (0.50), or large (0.80). Change from baseline was analyzed by paired t-test.

Clinically relevant indicators were used to explore the responder threshold, which was defined as the LS-mean change in PROMIS Fatigue SF-6a score from baseline that indicated a meaningful response to treatment. Responder thresholds for PROMIS Fatigue SF-6a T-score at 3 and 6 months were calculated for dysmenorrhea and NMPP responders and non-responders.

Results

Sample characteristics

This analysis included the 871 participants enrolled in the EM-I trial who received at least one dose of study treatment or placebo (Table 1). Mean age was 31.5 years and most were White (87.1%) and not Latino (84.0%). The majority of the women were employed (60.2% full time; 17.0% part-time). For the EHP-30 domains, mean scores at baseline were 58.2 (14.3) for pain, 49.3 (19.3) for emotional well-being, 69.8 (19.5) for control and powerlessness, 54.8 (25.6) for social support, 51.0 (27.6) for self-image, and 64.5 (24.7) for sexual relationships. The mean PROMIS Fatigue T-score at baseline was 63.3 (7.7) (range, 33.4–76.8).

Table 1 Sociodemographic characteristics and patient-reported data at baseline

Ceiling and floor effects

A ceiling effect was observed for one PROMIS Fatigue SF-6a item at baseline, “How much were you bothered by your fatigue on average?”, with 32.6% responding “Very much”. No other ceiling or floor effects were detected at baseline or at months 3 or 6.

Confirmatory factor analysis

Confirmatory factor analysis demonstrated strong fit for a unidimensional scale with item factor loadings ranging between 0.808–0.942 (Table 2). Model fit was supported by the comparative fit index, Tucker-Lewis index, and all absolute residual correlations less than 0.10. The data were nonnormative and the weighted root mean square residual and RMSEA estimate was 0.112 (90% confidence interval, 0.093–0.132) which is higher than the acceptable value but not uncommon for small degrees of freedom [30, 31].

Table 2 Confirmatory factor analysis PROMIS Fatigue 6a at baseline

Reliability

Cronbach’s alpha was 0.93 at baseline and 0.91–0.92 when individual items were deleted, indicating that the items comprising PROMIS Fatigue SF-6a measured the same construct. In the 238 patients with no change in PGIC at month 1, the interclass correlation for baseline versus month 1 was 0.7 and paired t-test were statistically significant (t-value 2.84, p = 0.0049) for the PROMIS Fatigue T-score indicating stable test-retest reliability.

Construct validity

Spearman’s rank correlation coefficients indicated a moderate correlation at baseline between PROMIS Fatigue SF-6a and the EHP-30 pain domain (0.34) and weak correlations between PROMIS Fatigue SF-6a and HRPQ work absenteeism (0.22), HRPQ work presenteeism (0.23), dysmenorrhea (0.17), and NMPP (0.17) (Table 3). At month 3, Spearman’s rank correlation coefficients ranged from 0.27 (dysmenorrhea) to 0.49 (EHP-30 pain domain), and at month 6, they ranged from 0.35 (HRPQ work absenteeism) to 0.60 (EHP-30 pain domain). Pearson product-moment correlations were found to be similar to the Spearman’s rank correlation coefficients (Table 3).

Table 3 Correlations (Spearman and Pearson) between PROMIS Fatigue Short Form 6a and patient-reported outcome measures at baseline, month 3, and month 6

Known groups validity

The mean PROMIS Fatigue T-scores at month 3 were significantly lower for dysmenorrhea and NMPP responders than for non-responders (Table 4). At month 3 the dysmenorrhea responders mean score was 54.5 and the non-responders was 59.1; the DYS responder mean was 54.4 and the non-responders was 59.8. Similar results for mean PROMIS Fatigue T-score between clinical responders and non-responders were seen at month 6. The mean PROMIS Fatigue T-score was also significantly lower in participants for whom PGIC showed improvement than in those for whom PGIC did not change (55.3 vs. 62.3, p < 0.001) or worsened (55.3 vs. 65.8, p < 0.001) at month 3; results at month 6 were similar (Table 5).

Table 4 Known groups validity: PROMIS Fatigue Short Form 6a T-score by dysmenorrhea and NMPP responder status at months 3 and 6
Table 5 PROMIS Fatigue Short Form 6a by PGIC: improved, no change, and worsened

Responsiveness

Two anchor-based approaches were used to evaluate PROMIS Fatigue SF-6a responsiveness, the patient-reported changes in PGIC (improved, no change, worsened) and responder status on the EHP-30 pain domain. At month 3 the mean changes in the PROMIS Fatigue SF-6a T-score were − 7.9 (0.4) for participants who reported an improvement, − 0.9 (0.8) for participants who reported no change, and 2.3 (1.3) for participants who reported a worsening using the PGIC (Table 5). The findings demonstrate a PROMIS Fatigue SF-6a score response when a change is identified. Similar findings were seen at month 6 using the PGIC as an anchor. At month 3 and 6 the PROMIS Fatigue SF-6a T-scores were significantly different for those who were a responder versus not a responder on the EHP-30 pain domain (p < 0.0001 at both timepoints). The mean T-score difference for the EHP-30 responders versus non-responders was − 10.3 (0.5) and − 3.5 (0.4) at month 3 and − 11.8 (0.5) and − 3.3 (0.5) at month 6.

The responsiveness in distribution-based analyses uses only the PROMIS Fatigue SF-6a data to look for changes over time and does not include outside sources of data such as the PGIC or clinical indicator. The treatment groups should have a change, an improvement, between baseline and month 3 and month 6. It is also expected that the placebo group may have an improvement in PROMIS Fatigue SF-6a scores over time due to placebo effect. Table 6 reports the responsiveness findings for the total sample then by treatment arms. For all groupings, at both timepoints, there is a reduction in fatigue indicating less fatigue. The total sample of participants had a − 6.2 (9.8) change from baseline to month 3 in their mean PROMIS Fatigue SF-6a T-score the change was a significant decrease (p < 0.0001); at month 6 the total sample score changed − 6.8 (10.4) and was also a significant decrease (p < 0.0001).

Table 6 Responsiveness of PROMIS Fatigue Short Form 6a at months 3 and 6

Clinically relevant indicators were used to assess the responsiveness of the PROMIS Fatigue SF-6a. For dysmenorrhea responders, the LS-mean change from baseline in PROMIS Fatigue SF-6a T-score was significantly greater than those who did not have a response to treatment in dysmenorrhea (Fig. 1). Dysmenorrhea responders had a score change of − 8.8 (0.5) versus dysmenorrhea non-responders having a change of − 4.0 (0.4) at month 3 and − 10.4 (0.5) versus dysmenorrhea non-responders change of − 3.2 (0.5) at month 6 (both comparisons p < 0.0001). NMPP responders versus non-responders had significant score changes for the responders versus the non-responders at both timepoints too (Fig. 1).

Fig. 1
figure 1

Clinically Relevant Responsiveness for PROMIS Fatigue Short Form 6a Change for Dysmenorrhea and NMPP Responders at Month 3 and Month 6. LS-mean indicates least-squares mean; NMPP, non-menstrual pelvic pain; PROMIS, Patient-Reported Outcome Measurement Information System; SE, standard error. * P value between responder and non-responder is p < 0.0001

Discussion

This analysis showed that for women with moderate-to-severe endometriosis-associated pain, PROMIS Fatigue SF-6a performs well, with evidence of reliability, construct validity, and responsiveness to change. The instrument measures a single construct, and the resulting scores could discriminate between pain severity groups, between responders and non-responders for clinically relevant endpoints (dysmenorrhea and NMPP), and between levels of patient self-assessed global change. Correlations with other patient-reported outcomes were moderate to weak at baseline, indicating that PROMIS Fatigue SF-6a is not redundant with other patient-reported outcomes.

For a patient-reported outcome measure to be fit for purpose there needs to be evidence that the concept being measured is appropriate and applicable to the target population and that the measure performs well in the target population. The current findings compliment the results of a previous study demonstrating content validity and appropriateness of PROMIS Fatigue SF-6a in this population [16]. They also provide quantitative evidence about the positive performance (reliability, validity, and responsiveness) of the instrument, therefore supporting a recent study based on PROMIS Fatigue SF-6a data from the EM-I trial, which showed that the women in the trial frequently experienced severe fatigue at baseline and that elagolix significantly improved fatigue after 6 months of treatment [18].

The mean T-score at baseline in this study was more than one standard deviation higher than the standardized mean for the United States population, indicating that the women included in this analysis experienced significantly greater fatigue than the US norm at study baseline. To put this in context, the baseline T-scores in this study were higher than reported for back pain, cancer, chronic heart failure, chronic obstructive pulmonary disease exacerbation, stable chronic obstructive pulmonary disease, major depressive disorder, and rheumatoid arthritis, muscular dystrophy, multiple sclerosis, post-polio syndrome, spinal cord injury, and chronic pelvic pain [32,33,34]. The baseline T-score in the present analysis was also close to the T-score obtained using the PROMIS Fatigue Short Form 7a in a recent study of adults with myalgic encephalomyelitis and chronic fatigue [35].

PROMIS Fatigue Short Form 7a has been similarly shown to have sound psychometric properties in pregnant women and patients with fibromyalgia, sickle cell disease, and cardio-metabolic risk [36].

The patient-focused drug development guidance series from the US Food and Drug Administration has provided insight to the Agency’s views about the development and use of patient-reported outcome measures as clinical study endpoints. The first draft guidance focuses on the comprehensive and representative input during product development [37]. The guidance details the patient experience data includes information about the experience and impact a condition has on the patient. The concept of fatigue has been shown to be important to this target population and the use of a measure that assesses the concept is appropriate. Having the ability to reliably measure fatigue among women with moderate-to-severe endometriosis-related pain can be an important from a clinical and humanistic perspective. The PROMIS Fatigue SF-6a can measure changes in fatigue and add value in clinical practice and research. This research is an example of how an existing measure can be assessed for use in a new population. The assessment of the psychometric properties and responsiveness support the use of the PROMIS Fatigue SF-6a among women with moderate-to-severe endometriosis-related pain. The findings from this research could be used to identify a responder threshold that would indicate a treatment benefit among this target sample. Previous research with different versions of the Short Form of the PROMIS (e.g. 17 item Short form) has suggested a change of 3–5 points to indicate a responder [38].

A strength of this study is that the data were from a large randomized, controlled trial and were therefore of high quality. At the same time, the results may not be generalizable outside of the selected population, which may have a different racial makeup than the overall population of women in the US with endometriosis, or to women with more mild endometriosis-associated pain. In addition, the results may not be generalizable to other PROMIS Fatigue measures for this population or to the use of PROMS Fatigue SF-6a in other women’s health conditions.

Conclusion

In conclusion, this study showed that PROMIS Fatigue SF-6a performs well in women with moderate-to-severe endometriosis-related pain, with good reliability, validity, and responsiveness to change. The study also confirmed that fatigue is a common and severe problem in women with endometriosis, highlighting the need for a high-quality instrument for assessing fatigue as a treatment outcome in this population.