Background

Fatigue is a common symptom associated with a wide range of chronic diseases [1] and has been frequently studied in many different patient populations. Fatigue has been defined as a sense of exhaustion, lack of energy, or tiredness distinct from sleepiness, sadness, or weakness [2, 3]. To minimise patient burden and to enhance response rates in research studies, a short, yet valid and reliable patient-reported outcome measure (PROM) for fatigue is important for the feasibility of studies on fatigue. Further, in order to compare health status across medical conditions, we need to know whether a PROM for fatigue can be used as a generic measure in populations as diverse as patients affected by stroke and patients living with the consequences of osteoarthritis.

One of the frequently used PROMs for fatigue is the 13-item Lee Fatigue Scale (LFS). It has primarily been used among adults with cancer [4, 5] and human immunodeficiency virus (HIV) [6, 7]. However, it has also been used in other populations such as patients admitted to intensive care units [8], undergoing knee arthroplasty [9], stroke [10] and living with chronic obstructive pulmonary disease [11]. Nonetheless, PROMs with strong psychometric properties in one population need to be evaluated for use in other patient populations. We have previously shown how the psychometric properties of the LFS can vary in different populations and between countries [12], but these types of studies are rarely published. Rather, PROMs are often evaluated psychometrically in a single patient population, and then applied to other populations without further psychometric testing. Moreover, only PROMs that demonstrate strong and stable psychometric properties across a broad range of diverse patient populations should be considered a generic measure suitable for use in any patient population.

We have previously evaluated the English version of the LFS in samples of patients with cancer [13] and people with HIV [14]. Using Rasch analysis, we also reduced the full version of the LFS (13 items) to a short version (5 items) with satisfactory validity and reliability [2]. That study showed that the short version yielded similar fatigue severity ratings as the full scale for 95% of patients, had sufficient sensitivity to separate the responses into three distinct fatigue groups (low, moderate and high severity), and demonstrated unidimensionality and internal scale validity [2]. In research studies and in clinical practice, short instruments are generally preferred in order to minimize the burden on participants and maximize adherence to the protocol.

In order to evaluate whether the LFS is suitable for use as a generic measure of fatigue severity, there is a need for further exploration of the psychometric properties of the LFS. The aim of this study was to evaluate the psychometric properties of the Norwegian 5-item version of the LFS in two different patient populations, adults with stroke and adults with osteoarthritis.

Methods

Design

This study has a cross-sectional design and includes initial pre-intervention data from two longitudinal studies, a multicentre randomised controlled trial evaluating the effect of an intervention promoting psychosocial well-being following stroke [10] and a longitudinal study investigating pain and other symptoms in patients with osteoarthritis undergoing total knee arthroplasty [15].

Participants and setting

Stroke sample

A total of 322 adult stroke survivors recruited from 11 acute stroke or rehabilitation units in university hospitals and other local hospitals providing acute care in Norway were consented into the trial [10]. The inclusion criteria were: adults ≥ 18 years of age, acute stroke within 4 weeks prior to inclusion, medically stable, sufficient cognitive functioning to participate (assessed by their physician/stroke team), able to understand and speak Norwegian, and able to give informed consent. Exclusion criteria were moderate to severe dementia, and serious somatic or psychiatric disease, as these conditions would likely have interfered with the ability to provide informed consent or fully participate in the study protocol.

Initial pre-intervention data were collected in structured in-person assessment interviews within 6 weeks after stroke onset. The data collector recorded the participant’s responses consisting of demographics and 5 items from the LFS [16], in addition to other PROMs included in the trial [10]. Psychometric properties of the LFS at baseline will be reported in this study.

Osteoarthritis sample

A total of 203 patients with osteoarthritis who were admitted for total knee arthroplasty at a surgical clinic in Oslo, Norway were included in the study. The inclusion criteria were: adults ≥ 18 years of age, ability to read, write and understand Norwegian, and scheduled for unilateral primary total knee arthroplasty. Patients undergoing unicompartmental or revision surgery were excluded. A comprehensive description of the study participants and data collection measures has been reported elsewhere [15].

The initial data collection was performed prior to surgery, after admission to the hospital. Patients independently completed paper questionnaires assessing demographic characteristics, 5 items from the original LFS and several other measures included in the osteoarthritis study [15]. Psychometric properties of the LFS at baseline will be reported in this study.

Lee fatigue scale

Fatigue severity was measured in both samples using the same 5-item LFS. The following items with the anchor phrases from the original 13-item version [16] were used: item 1 “not at all tired” to “extremely tired”, item 4 “not at all fatigued” to “extremely fatigued”, item 5 “not at all worn out” to “extremely worn out”, item 16 “carry on a conversation is no effort at all” to “carrying on a conversation is a tremendous chore”, and item 17 “I have absolutely no desire to close my eyes” to “I have a tremendous desire to close my eyes”. All items were rated on a numeric rating scale from 0 to 10; higher scores indicate higher fatigue severity. Items 1, 4 and 5 were also included in a short version of the LFS evaluated through Rasch analysis among women living with HIV [2]. Although items 16 and 17 were not included in that previous short form, they were found to support the scale’s unidemensionality and internal scale validity when the original 13-item version of the LFS was assessed among women with HIV [14].

Statistical analysis

The analysis of the LFS was guided by the use of a Rasch rating scale model [17]. The Rasch model is a confirmatory model where the data has to meet the model requirement to form a valid and unidimensional measurement scale, as compared to other item response theory (IRT) models that are exploratory models aiming to describe the variance in the data. Due to a technical error in the scoring of the LFS in the stroke sample, a score of 0 or 1 was scored as 1. To obtain a similar rating scale for both samples, scores of 0 in the osteoarthritis sample were recoded as 1 so that both samples were scored on a rating scale of 1–10. Thus, the original rating scale of 0–10 was transformed to a 10-level rating scale for both samples. This rating scale has been used successfully in earlier Rasch analyses of the LFS [14]. The transformed 10 category raw scores from the 5-item LFS were analyzed using the WINSTEPS Rasch computer software program version 3.91.0.0 [18]. The analyses were performed using a systematic stepwise approach similar to that used in previous studies [12, 19, 20].

First, an evaluation of the psychometric properties of the rating scale was conducted. The criterion used was that the average measures for each response category on each item should advance monotonically, as evidenced by an Outfit Mean Square (MnSq) value of less than 2.0 for each of the step calibrations [21].

The second step aimed to evaluate the fit of the item responses [17]. Any item that did not show acceptable goodness-of-fit to the Rasch model was removed, and the psychometric properties of the remaining items were re-analyzed until all items demonstrated acceptable goodness-of-fit, defined as Infit MnSq values between 0.7 and 1.3 logits [22]. In the third step, we evaluated the level of unidimensionality in the generated LFS measures through a principal component analysis (PCA) of the residuals, with the criterion that the first latent dimension should explain at least 50% of the total variance [23].

The fourth step evaluated aspects of person response validity. The criterion for evaluating person goodness-of-fit was to reject Infit MnSq values of 1.4 logits or higher or associated with a z-value of 2 or higher, accepting that 5% of the sample may by chance fail to demonstrate acceptable goodness-of-fit without threatening evidence of person response validity [24,25,26]. We also examined ceiling and floor effects by determining the number of respondents obtaining minimum and maximum scores on the scale. Up to 10% of the sample with minimum or maximum scores was considered acceptable.

In the fifth step, Differential Item Functioning (DIF) analyses were performed in order to evaluate the stability of the LFS response patterns in relation to diagnosis (stroke or osteoarthritis) using the Mantel-Haenzel statistics for polytomous scales using log-odds estimators [27, 28], as reported from the WINSTEPS program. Statistics with Bonferroni-corrected p-values < 0.01 were considered indicative of DIF.

The last two steps assessed several aspects of the scale’s reliability. In the sixth step, person separation reliability (i.e., ability to separate participants into distinct fatigue groups) was evaluated by calculating the scale’s person separation index [29]. The separation index reflects the number of statistically different groups that the scale can identify in the sample, considering the range and precision of individual person estimates. An index above 1.5 is required to ensure that the scale can differentiate people with at least two different levels of fatigue. Lastly, in the seventh step, we assessed the scale’s internal consistency reliability by reporting both Cronbach’s alpha reliability coefficient of the raw scores and Rasch-equivalent person reliability coefficient for the final unidimensional scale, as well as the Pearson correlation coefficient between the LFS sum scores and the Rasch-generated measures. Coefficients > 0.80 indicated acceptable internal consistency reliability.

In addition to the steps described above for psychometric analysis of the LFS, characteristics of the study samples were summarized and compared using SPSS statistical software, version 25 [30]. Differences in means and standard deviations (SD) between the two patient samples were assessed with independent sample t-tests for continuous and normally distributed variables. Categorical variables were assessed using frequencies and percentages, and the stroke and osteoarthritis samples were compared using Chi-square analysis. P-values < 0.05 were considered statistically significant.

Results

Characteristics of the two patient samples are described in Table 1. Compared with the sample of stroke patients, the osteoarthritis sample had a larger proportion of women, had more education, and was more likely to be employed.

Table 1 Fatigue scores and demographic characteristics of the two patient samples and the overall sample

The LFS rating scale demonstrated acceptable outcomes in relation to the established criteria. In addition, all ten rating scale steps were used with a frequency above 100 scores for all scale steps. When analyzing the infit mean square statistics for the five included items, only one item out of five demonstrated acceptable goodness-of-fit (See Table 2). Two of the LFS items demonstrated higher than acceptable MnSq statistics (#16 carry on a conversation and #17 close eyes), and two items demonstrated lower than acceptable MnSq (#1 tired and #4 fatigued). Because items with higher fit statistics are a greater threat to unidimensionality compared to items with lower fit statistics, these two items were initially excluded and the remaining three items (#1, #4, and #5) were re-analysed. In the second iteration, the three remaining items demonstrated an acceptable range for goodness-of-fit to the model.

Table 2 Overview of the statistical approach, criteria, and results of the Rasch analysis of the LFS short form scale when used with people with stroke and osteoarthritis (n = 525)

The unidimensionality of both the 5-item and 3-item LFS scales was also acceptable, as the first latent variable accounted for 63.2 and 81.6% of the variance in the fatigue scores, respectively. Additionally, the proportion of the sample demonstrating misfit to the Rasch model for the 5-item LFS scale (4.6%) was within the set criterion of < 5% and was close to the criterion for the 3-item scale (5.6%). The number of respondents with maximum scores indicated negligible ceiling effects, but the number of respondents with minimum scores on both the 5-item and 3-item scales indicated a moderate floor effect.

The DIF analysis indicated that two of the misfitting items showed DIF in relation to disease group, but once those two items were removed, the 3-item scale revealed no systematic differences in relation to diagnosis across any of the three remaining LFS items. The separation index of the LFS scale also increased from 1.82 to 2.49 after deleting the two items demonstrating misfit, indicating the 3-item scale was able to distinguish three statistically distinct groups of fatigue. Measures of internal consistency indicated that the 3-item scale met all set criteria and performed better than the 5-item LFS scale. See Table 2 for a summary of the findings.

Discussion

Findings from our study showed that a 3-item version of the LFS had better psychometric properties than the 5-item version. The 3-item LFS showed unidimensionality, accounted for a large proportion of explained variance, and was able to differentiate three statistically distinct fatigue severity groups. In a previous psychometric evaluation in a sample of women with HIV [14], all 13 items in the original LFS met the set criteria for item goodness-of-fit (internal validity) and explained 52.1% of the variance in scores. However, in studies of fatigue in which participants often experience a sense of exhaustion, lack of energy or tiredness distinct from sleepiness, shorter scales that are less burdensome for patients to complete would be preferable.

As shown in this analysis, scales with more items are not always better. They may lack unidimensionality, which poses challenges to the generation of meaningful total scores and may indicate that the use of subscales is warranted. Moreover, the inclusion of poorly performing items may actually reduce the ability of the scale to distinguish distinct groups based on level of severity, as occurred in this analysis, in which the 5-item scale could only distinguish two severity groups, while the 3-item scale was able to distinguish three severity groups. Even though these groups are based on statistical calculations, future studies could explore the clinical relevance and potential cut-offs for determining such group allocations. This could be a logical step for future research now that there is evidence that the 3-item scale is sensitive enough to detect statistically distinct groups.

Another interesting aspect is that the items in the 3-item version all demonstrated acceptable goodness-of-fit within the set ranges, indicating that the response patterns all contribute to the underlying measure, without evidence of over- or underfit as in the 5-item version, supporting validity evidence of internal structure. The three remaining items (Worn out; Tired; Fatigued) are conceptually more similar than the excluded items that involve interactions or behaviors (Conversation; Open eyes), also indicating evidence of validity of the test concept.

There were moderate floor effects for both the 3- and 5-item LFS, but the 3-item version had a higher mean value than the 5-item version. The low severity scores may have been due to our reduction of the rating scale from 0 to 10 to 1–10. However, other studies on the LFS have also reported floor effects [13], so this may be an issue with the LFS regardless of the slightly modified rating scale. Another likely explanation for the moderate floor effects is that many of the patients in these two samples were not experiencing fatigue or were limiting their activity to minimize their fatigue.

The 3-item LFS was not biased by diagnosis, as indicated by the lack of DIF and similar response patterns across two different patient groups. Combined with the results of prior studies among patients with cancer and HIV [13, 14], this study provides additional evidence that short versions of the LFS can be used as a generic PROM measure of fatigue. In particular, the finding that the items retained in this 3-item version of the LFS were also retained in a 5-item version of the scale evaluated among women living with HIV [2] suggests some degree of consistency, even across different patient populations and across different languages.

One challenge resulting from the use of different short versions of the LFS is how to compare scores from versions containing different items. One solution might be to use Rasch analysis to generate a stable and disease-generic item hierarchy that can be used to select subsets of items for specific studies and still generate comparable measures through conversion tables or computer-adaptive testing (CAT) procedures. Even though some similarities in the item hierarchies from our earlier studies occur in relation to these findings [14], more in-depth analyses with larger samples are required in order to establish such a disease-generic item pool.

Based on the findings from this study, the idea may arise that a single “perfect” fatigue item could perform as well as multi-item versions assessing subtly different aspects of fatigue. Although this could be explored in this sample and others where the LFS has been evaluated using Rasch analysis [13, 14], another body of evidence from qualitative research suggests that multiple items may be necessary, as the phenomenon of fatigue may be perceived by patients in multiple and complex ways [31,32,33]. Thus, the balancing act between developing psychometric excellence and measuring fatigue’s complex manifestations is likely to continue.

The translation of any PROM requires the use of stringent procedures to ensure conceptual equivalence in the new translation compared with the original language version [34]. Conceptual equivalence is closely linked to cultural relevance, as culture is a primary determinant of language [34]. A lack of conceptual equivalence and cultural relevance may lead to a risk of misinterpretation of items or concepts and, consequently, low content validity in the translated version. One challenge with the Norwegian language translation of the 5-item LFS used in this study is that the wording of each item is difficult to differentiate in the Norwegian language. In the stroke study, the LFS was administered as an in-person assessment interview, which helped in delineating the respondents’ understanding of each concept. Some respondents had difficulties in distinguishing items 1 and 5, and this was discussed during the assessment. However, our analysis shows that none of these items are redundant, as the goodness-of-fit indicates that both should be retained in the 3-item version of LFS.

Strengths and limitations

The strengths of this secondary analysis included the relatively large samples from two diverse patient populations and the thorough evaluation of the psychometric properties. However, a significant limitation is that this study evaluated only five of the original 13 LFS items, so it remains unclear whether the three items retained in this analysis represent the best three items for inclusion in a brief fatigue severity PROM. It would therefore be interesting to determine whether the 3-item version generated in this analysis outperforms other potential short versions, particularly the 5-item short form developed in the prior study of women living with HIV [2].

In addition, this study evaluated a Norwegian version of the LFS. Translation of the concept of fatigue into Norwegian and perhaps other languages is difficult, since some English words and phrases do not have direct equivalents in Norwegian or other languages. Thus, generalization of the findings from this Norwegian study to other populations must be done with caution.

Finally, the mode of data collection was not identical in the stroke and osteoarthritis samples. The stroke sample was interviewed in person, while the osteoarthritis sample completed the questionnaire independently. Thus, the patients with stroke may have received help in understanding the meaning of the different items, whereas the patients with osteoarthritis were left to interpret the items on their own. Although no DIF was found based on diagnostic group, the differing data collection mode for the two samples may have introduced bias in the interpretation of the items and, thus, may have influenced the results.

Conclusions

The results of this study indicate that a 3-item version of the Lee Fatigue Scale has acceptable psychometric properties and is sufficiently generic for use as a PROM for fatigue severity with patients post-stroke and patients living with osteoarthritis. Future research should be conducted to evaluate the validity of the 3-item version for use among other clinical populations.