FormalPara Key Summary Points

Why carry out this study?

Fatigue is one of the most important symptoms reported by patients with systemic lupus erythematosus (SLE), yet is poorly understood and sub-optimally measured by existing patient-reported outcome (PRO) scales.

This study aimed to psychometrically evaluate the measurement properties of three PRO scales that purport to measure fatigue.

What was learned from the study?

Pooled, blinded data from two completed phase 3 SLE trials, EMBODY1 (NCT01262365) and EMBODY2 (NCT01261793) identified item misfit, suboptimal scale-to-sample targeting and low reliability in the psychometric analyses of the PRO scales assessed.

This highlights that FACIT-F, SF-36 Vitality, and LupusQoL Fatigue show limitations in SLE clinical trials.

These findings support the need for further patient-centered research to build an appropriate conceptualization of SLE fatigue to further support the development of a fit-for-purpose fatigue PRO scale for use in the context of SLE.

Digital Features

This article is published with digital features, including a summary slide, to facilitate understanding of the article. To view digital features for this article go to https://doi.org/10.6084/m9.figshare.14779596.

Introduction

Fatigue is one of the most common symptoms reported by patients affected with systemic lupus erythematosus (SLE) [1, 2], but despite this, it is poorly addressed by available treatments. As a concept, fatigue is complex, poorly understood, and sub-optimally measured, as the US Food and Drug Administration (FDA) acknowledged in its guidance on SLE [3]. Fatigue, however, is a key concept of interest (COI) [4,5,6] in industry-sponsored clinical trials [7,8,9]. Three legacy patient-reported outcome (PRO) scales are commonly used to measure fatigue. The Functional Assessment of Chronic Illness Therapy-Fatigue (FACIT-F) scale [10], originally developed to assess fatigue associated with anemia in cancer, has subsequently been used across cancer groups [11, 12], general US populations [13], and in rheumatology [14], including SLE [15]. The Medical Outcomes Study Short-Form (SF-36) Vitality scale, which is part of a larger generic health-related quality of life (HRQoL) PRO instrument [16, 17], was developed on the basis of the RAND Health Insurance Experiment (HIE) [18] and is a PRO mainstay in SLE trials [6, 19,20,21,22]. Finally, the Lupus Quality of Life Questionnaire (LupusQoL) Fatigue scale, which is also part of a multiscale disease-specific HRQoL PRO instrument developed and validated in SLE [23], is also a common choice in SLE clinical trials (NCT01262365, NCT01261793) [7].

For PRO scales to be used to evaluate treatment benefit, they must first be shown to be fit for purpose [4, 24,25,26]. Current best-practice guidelines point to the need for a comprehensive and explicitly detailed COI (i.e., what the PRO scales aim to measure) in the specific context of use (i.e., the specific patient population in which the PRO scales will be used) [4, 27]. PRO scales should be well defined and reliable [25, 28, 29]. Although they are widely used, the published psychometric evidence supporting the FACIT-F, SF-36 Vitality, and LupusQoL Fatigue scales for use in SLE clinical trials is mixed. All scales have been found to be consistently reliable [23, 30,31,32], but evidence for validity (i.e., content, construct, and known groups) [23, 33,34,35] and ability to detect clinical change [31, 36, 37] is less convincing. Additionally, it is important to indicate that all three scales produce a single score for overall fatigue [38], whereas other widely used fatigue PRO scales, including some used in SLE, distinguish between physical/motor and mental/cognitive manifestation of fatigue [39, 40].

The FACIT-F, SF-36 Vitality, and LupusQoL Fatigue scales were used as exploratory endpoints in EMBODY1 and EMBODY2 (ClinicalTrials.gov identifiers NCT01262365 and NCT01261793). These two identical, phase 3, multicenter, randomized, double-blind, placebo-controlled studies (with different geographic sites), assessed the efficacy and safety of epratuzumab in SLE. The study population included adult patients with moderately to severely active SLE who fulfilled the American College of Rheumatology (ACR) revised criteria for SLE [41, 42]. In line with the primary endpoint, no statistically significant differences were observed between the placebo and treatment groups for the PRO scales [7]. To better understand the psychometric performance of the FACIT-F, SF-36 Vitality, and LupusQoL Fatigue scales in these clinical trials, we present a post hoc psychometric analysis of the data using modern psychometric methods, including an exploratory analysis of pooled fatigue items to examine potential relative measurement benefits. This work reflects an exploratory exercise to examine the impact of having an item set that is psychometrically and conceptually more cohesive and clearer, which ultimately aims to inform the self-reported assessment of fatigue in SLE studies through ‘fit for purpose’ PRO scales.

Methods

Study Population

This post hoc psychometric analysis was conducted on pooled, blinded baseline and week 24 FACIT-F, SF-36 Vitality, and LupusQoL Fatigue data from patients enrolled in the EMBODY1 and EMBODY2 clinical trials. All EMBODY1 and EMBODY2 patients had either moderate or severe SLE disease activity as defined by the BILAG-2004 [43] and SLE Disease Activity 2000 (SLEDAI-2K) indices [44]. The vast majority of patients were female (90%), with a mean age of 42 years in EMBODY1 and 41 years in EMBODY 2, while time since diagnosis ranged between 0 and 43 years with a median of 6 years. The study design and population are described in detail elsewhere [7].

Compliance with Ethics Guidelines

The study protocol, amendments, and patient informed consent were reviewed by a national, regional, or Independent Ethics Committee (IEC) or Institutional Review Board (IRB). This study was conducted in accordance with the current version of the applicable regulatory and International Conference on Harmonisation (ICH)-Good Clinical Practice (GCP) requirements, the ethical principles that have their origin in the principles of the Declaration of Helsinki, and the local laws of the countries involved.

Patient and Public Involvement

No patients or members of the public were involved in the design, conduct, reporting, or dissemination plans of this work.

PRO Instruments and Scales

The FACIT-F is a 13-item fatigue PRO scale with a 7-day recall period [10]. Items are scored on a five-point Likert-type response scale ranging from 0 to 4. All items are summed to create a single fatigue score with a range from 0 to 52, with higher values representing higher levels of fatigue. The SF-36 Vitality scale is one of eight sub-scales within the SF-36 PRO instrument comprising four items related to fatigue [16, 17]. Items are scored on a five-point Likert-type frequency scale ranging from all of the time to none of the time within a 4-week recall period, summed, and converted to norm-based 0–100 scores, with higher values representing more vitality (i.e., less fatigue). The LupusQoL Fatigue scale is one of eight scales comprising the LupusQoL PRO instrument, which is made up of four fatigue items [23]. Items are scored on a five-point Likert-type frequency scale ranging from all of the time to never, within a 4-week recall period. A score from 0 to 100 is calculated for each domain scale by dividing the mean raw domain score by four and multiplying by 100, with higher scores representing less fatigue.

Rasch Measurement Theory

Psychometrics is an umbrella term for empirical evaluations of the measurement properties (e.g., reliability, validity, ability to detect change) of rating scales and tests [24], including PRO instruments and scales. Traditional psychometric methods have important limitations that are overcome by modern methods, such as Rasch Measurement Theory (RMT) [24, 29]. RMT analysis evaluates the extent to which the observed data fit predictions of the Rasch model, which in essence defines how a set of items should perform to generate reliable and valid measurements [45, 46]. The difference between expected and observed scores indicates the degree to which rigorous measurement is achieved [29, 45]. RMT analysis has three broad aims: (1) the evaluation of the scale-to-sample targeting; (2) the evaluation of the measurement continuum; and (3) the evaluation of the sample measurement.

RMT analyses, based on the unrestricted Rasch Model for polytomous ordered responses [10], were conducted cross-sectionally on baseline data from week 0. Responsiveness (i.e., ability to detect change) analyses were conducted on longitudinal data from weeks 0 and 24 of the EMBODY trials. The goal of these analyses was to compare the PRO scales on all the available pooled blinded data, as opposed to comparing treatment arms. RUMM2030 [47] was used to conduct the RMT and IBM SPSS 25.0 [48] was used for the responsiveness analyses. Responsiveness analyses were conducted on interval level 0–100 transformed scores computed on the basis of RMT-produced interval logit for total raw scores.

There were two stages of analysis: (1) evaluation of the measurement performance of the FACIT-F, SF-36 Vitality, and LupusQoL Fatigue scales; and (2) exploration of the potential measurement benefits of pooled fatigue items selected based on the best-performing items through an empirical post hoc analysis.

Stage 1: Measurement Performance Review of FACIT-F, SF-36 Vitality, and LupusQoL Fatigue Scales

There were four main areas of psychometric evaluation: (1) scale-to-sample targeting, (2) thresholds for item response options, (3) item-fit statistics, and (4) reliability. These are presented in more detail in Table 1 (columns 1 and 2) and elsewhere [3]. We examined group-level responsiveness by computing three standard indicators: two effect size calculations (Cohen’s and standardized response mean) and relative efficiency (pairwise squared t values from paired samples t tests) [49, 50].

Table 1 Summary of analysis and findings

We assessed individual-level responsiveness by computing the significance of each person’s change in each scale’s score [29]. The standard error of the difference (SED; i.e., the size of the error associated with each person’s change) was computed for each individual (SED = √((SE Time 1)2 + (SE Time 2)2)). The significance of change was then determined by dividing each person’s change score by the SED. Significance of change values were categorized into five groups: (1) significant improvement = significant change ≤  − 1.96; (2) non-significant improvement =  − 1.95 < significant change < 0; (3) no change = significant change = 0; (4) non-significant worsening = 0 < significant change <  + 1.95; and (5) significant worsening = significant change ≥  + 1.96.

Stage 2: Construction and RMT Analysis of Pooled Fatigue Symptom Item Set

There were three steps in Stage 2: (1) review of findings from Stage 1 and the conceptual content of the FACIT-F, SF-36 Vitality, and LupusQoL Fatigue scales; (2) structuring and identifying a selection of items representing fatigue symptoms based on the empirical findings from Stage 1; and (3) analysis of the psychometric properties (as described in Stage 1) of the new pooled fatigue symptom item set and comparison against the original scales.

Results

Sample

Data from 1584 patients (n = 793 from EMBODY1 and n = 791 from EMBODY2) were used in these analyses. The sample (mean [SD] age, 42 [12] years; range, 18–64 years; 93% female; 75% white) included patients from a broad geographic distribution (36% USA, 41% EU, and 23% rest of world) with time since diagnosis ranging between 0 and 38 years (mean [SD], 10 [7] years). For the responsiveness analysis, available data from 1203 patients were used (n = 605 from EMBODY1 and n = 598 from EMBODY2).

Stage 1: Measurement Performance Review of FACIT-F, SF-36 Vitality, and LupusQoL Fatigue Scales

The FACIT-F demonstrated adequate targeting, as item thresholds covered 68% of the range of fatigue measured in the sample (Table 1, column 3) while showing some item bunching (Fig. 1). In contrast, the SF-36 and LupusQoL Fatigue scales demonstrated suboptimal targeting: despite covering 63 and 71% of the fatigue measured in the sample, respectively, both scales showed gaps on the continuum, indicating areas on the metric where no scale item matched the levels of fatigue reported in the sample (Fig. 1).

Fig. 1
figure 1

Scale-to-sample targeting exemplars. The upper histograms (dark blue bars) represent the sample distribution for the scale total score whereas the lower histograms (pale blue bars) represent the scale item threshold distribution plotted on the same linear measurement continuum (higher scores reflect better outcomes/lower fatigue). The green curve represents an inverse function of the standard error associated with each person measurement (the peak of the curve indicating the best point of measurement). The grey panels on the lower histograms signify areas on the continuum with sample measurements but no corresponding item thresholds

The response scales for all three scales worked as intended; however, all scales demonstrated some item-fit issues (Table 1, columns 3–6). The worst-fitting items were from the FACIT-F and SF-36 Vitality; both scales demonstrated underestimation of fatigue (Fig. 2), as scores at the lower end were higher than expected and lower than expected at the higher end. In contrast, the worst-fitting LupusQoL Fatigue item displayed overestimation of fatigue with the opposite pattern of observed scores (Fig. 2). Fatigue person separation indices (PSIs) ranged from 0.80 to 0.93, suggesting the sample was sufficiently separated by their items; this is in comparison with the SF-36 Vitality scale, which demonstrated a low PSI (0.53), suggesting low reliability (Table 1, columns 3–6).

Fig. 2
figure 2

Item characteristic curve (ICC) exemplars. The ICC plots the scores expected by the Rasch model for each individual item on the y-axis at each level of the fatigue measurement continuum (x-axis), with higher scores representing better outcomes/lower fatigue. The blue dots represent observed scores in each of the ten class intervals of the fatigue levels. The closer the blue dots (observed scores) lie to the curve (expected scores), the better the item fit for the item under investigation

At the group level, all three fatigue scales showed a significant improvement of fatigue scores at week 24 (P < 0.001) with small and medium effect sizes (ES) and standardized response means (SRMs; Table 2, columns 9–12). The FACIT-F scale was the most responsive (ES = 0.35; SRM = 0.39) and the LupusQoL Fatigue the least responsive of the three scales (ES = 0.26; SRM = 0.24). At the individual level, the three different fatigue scales yielded different results regarding the percentage of patients reaching various degrees of improvement, worsening or no change in their fatigue levels at week 24, especially for the significant improvement and no change categories (Fig. 3). As assessed with the FACIT-F, 27% of patients reached significant improvement of fatigue scores at week 24 as opposed to 10% and 14% when assessed with the SF-36 Vitality and the LupusQoL Fatigue scales, respectively.

Table 2 Group-level responsiveness results
Fig. 3
figure 3

Individual-level responsiveness. Percentage of patients displaying significant improvement, worsening, non-significant improvement, worsening or no change on the reviewed PRO instrument scales at week 24. Individual-level responsiveness was conducted using pooled blinded data in line with the methods described in Hobart & Cano 2009 (pages 151–152) [29]

Stage 2: Construction and RMT Analysis of Pooled Fatigue Items

Construction of Item Pool


The RMT findings were reviewed in reference to item content of the three unique scales. Targeting findings suggested that the range of fatigue captured by the SF-36 and LupusQoL items did not cover the range of fatigue issues displayed in the sample and indicated low reliability for the SF-36 Vitality scale. Findings further demonstrated cohesiveness issues questioning the legitimacy of the total scores, particularly within the FACIT-F and SF-36 Vitality scales. The item content of these scales was closely reviewed to consider whether the conceptual content of these scales might mirror these statistical misfit issues. Some candidate problematic items were identified; these were eliminated from the item pool, resulting in the final selection of the ten pooled fatigue items (Fig. 4).

Fig. 4
figure 4

Scale reconceptualization exemplar. Item-level content of the reviewed PRO instrument scales; *Items comprising the pooled Fatigue Symptoms Item Set

Items were selected on the basis of content clarity, quality, and conceptual relevance in relation to other fatigue items, in that items associated with the potential impact of fatigue on daily activities or emotional consequences, or items associated with cognitive issues were excluded. Subsequently, items demonstrating misfit were also excluded whether these fit issues were hypothesized to be related to the content of the item, such as items confounding the symptoms of fatigue with its impact (e.g., frustration and social activities) or test-design issues. For example, some items were conceptually relevant to fatigue symptoms but were still associated with strong evidence of statistical misfit, such as the SF-36 item ‘Did you have a lot of energy?’ (Fig. 4). This finding could be attributed to test-design issues and the fact this was a positively worded item within an item pool of negatively worded items, which may have caused errors in the selected responses.

RMT Analysis of Pooled Fatigue Items and Comparison with Original Scales

The reconceptualized Fatigue Symptoms scale demonstrated adequate targeting and good reliability with a PSI of 0.88; but some fit issues persisted for two items, while one item had marginal problems with the five-point response scale (Table 1). Although item thresholds covered less of an absolute range of the sample (49%) in comparison to the original three fatigue scales (Table 1), the reconceptualized Fatigue Symptoms scale showed an improved item continuum with fewer gaps in comparison to the SF-36 Vitality and LupusQoL Fatigue scales (Fig. 1). In terms of item fit, the reconceptualized Fatigue Symptoms scale showed an improvement of statistical fit, especially in comparison to the FACIT-F and the SF-36 Vitality scales (Table 1).

In the group-level responsiveness analysis, the reconceptualized Fatigue Symptoms scale showed a significant improvement of fatigue at week 24 (P < 0.001) in line with all original scales (Table 2), but the reconceptualized scale was also associated with the highest ES (0.41) and SRM (0.44). The reconceptualized scale also had the highest relative efficacy, suggesting it was the most sensitive scale for detecting change in fatigue (Table 2).

At the individual level, different results with regard to the percentage of patients reaching significant or not improvement or worsening or no change in their fatigue levels at week 24 when assessed with the reconceptualized Fatigue Symptoms scale (Fig. 3) yielded the highest percentage of patients reaching significant improvement (28%) in comparison to the original scales.

Discussion

Our psychometric evaluation of the FACIT-F, SF-36 Vitality, and LupusQoL Fatigue scales in the context of the EMBODY clinical trials provided mixed findings, challenging the extent to which these PRO scales are fit to quantify fatigue in a valid and reliable way in SLE. The pooled fatigue items, comprising a selection of the best-performing and conceptually clearest items from the three original scales, improved but did not resolve the identified measurement issues. Importantly, the pooled fatigue items systematically enhanced sensitivity in detecting changes in fatigue levels. This pooled item set was not put forward to propose a new fatigue scale, but rather to examine the impact of an item set that is psychometrically and conceptually more cohesive and clearer. This item set, therefore, was used to further elaborate upon some of the limitations of the reviewed scales and to illustrate potential initial steps that could be used to develop a new fatigue PRO. This exercise demonstrated the importance and value of a scale’s conceptual underpinnings and clarity in the psychometric item design.

Findings from the RMT analysis revealed various issues. The FACIT-F demonstrated adequate targeting, indicating the relevance of the FACIT-F items in the population under measurement. However, fit analyses challenged the legitimacy of the FACIT-F total score. Strong evidence of statistical misfit was identified, suggesting the potential presence of multiple underpinning concepts within the scale’s content. The qualitative review of the FACIT-F item content further indicated that the items covered fatigue symptoms, as well as the functional and emotional impact of fatigue, supporting the multiple conceptual underpinnings of the scale.

The SF-36 Vitality and LupusQoL Fatigue scales demonstrated sub-optimal targeting, with findings indicating that the scales do not address all fatigue issues relevant in this population, accounting for the lack of precision associated with the scales’ scores. Furthermore, fit analyses also indicated some issues with the scales’ cohesiveness and reliability analysis, which challenged the SF-36 Vitality scale’s ability to detect differences in the sample.

The pooled fatigue items were conceptually clearer and less ambiguous, and showed good psychometric properties including fit (especially compared to FACIT-F) and targeting (especially compared to SF-36 Vitality and LupusQoL Fatigue scales). In addition, although all of the scales demonstrated small to moderate improvements in fatigue scores at week 24, the pooled fatigue item set displayed the larger ES and SRMs at the group-level, and the highest percentage of ‘significant improvers’ on the individual level. The reconceptualized scale did not resolve all of the measurement issues. Of note, the FACIT-F demonstrated more optimal targeting, but this was probably due to its multidimensional content covering a wider range of HRQoL issues as opposed to it focusing on issues proximal to fatigue symptoms. However, the improvement of the pooled fatigue items sensitivity to measure clinical change, in comparison to the original scales, highlights the importance of a scale’s conceptual underpinning.

Our psychometric analysis findings challenge the extent to which the three reviewed scales quantify fatigue in a reliable and valid way, and consequently call into question whether they should be used in high-stakes decision making in relation to SLE fatigue. Regardless of previously published quantitative psychometric evidence [23, 30,31,32], it is critical that a scale purporting to measure a clinical concept [24] is evaluated using both qualitative and psychometric methods, and specifically that the scale’s content validity (Do the items reflect all relevant aspects of the COI?) and face validity (Do the items ‘on their face’ look like they measure the target COI?) are established.

It is important to acknowledge three limitations. First, the study constitutes a post hoc psychometric analysis of existing clinical trial data relating to a specific sample of patients with moderate to severe SLE. It would therefore be of value to replicate such analyses in further SLE samples to establish generalizability of findings. Second, and related to the first limitation, the EMBODY clinical trials were not designed for the purpose of this post hoc psychometrics analysis (e.g., sample size, power). However, the sample sizes (n = 158) would be considered adequate and power analysis is much less relevant for psychometric data analysis [51]. Thirdly, the responsiveness analyses were conducted on pooled blinded data preventing any comparisons between treatment arms from being made, but rather focusing on the relative sensitivity of the reviewed scales in detecting changes in fatigue levels. It is important to state that the three reviewed PRO scales examined in this study were developed prior to regulatory guidelines, articulating the importance of clear definition and conceptualization of the construct under measurement in each context of use [4, 27]. Additionally, the FACIT-F and SF-36 were not developed specifically for use in SLE, while the SF-36 and LupusQoL, were not developed specifically to assess fatigue, as the reviewed scales constituted only one of multiple components within these PRO instruments. The conceptual underpinning of an item set used to quantify an underlying COI is of fundamental importance, particularly when it is used to make high-stake decisions affecting patients’ treatment and care [4]. Without a clearly and comprehensively defined COI adequately reflected in the range of items within a scale leading to a standalone score, all subsequent quantitative psychometric evidence can be misleading [52].

Conclusions

Our study findings indicate shortcomings of the reviewed scales in quantifying fatigue, while the exploratory reconceptualized item set demonstrated the benefits of a concept-driven approach in improving the scales’ measurement properties. Establishing a PRO scale that is fit for purpose to quantify fatigue in SLE will require thorough and robust exploration of the COI in the specific context of SLE, in order to create an appropriate conceptualization of fatigue to support a fatigue PRO scale content. As new treatments for SLE are developed and tested, developing a fit-for-purpose fatigue PRO for the SLE context of use will be vital for adequately quantifying patient fatigue, in order to evaluate potential treatments for one of the most important and relevant symptoms in SLE.