Introduction

Fatigue is a common and debilitating symptom for stroke survivors, potentially affecting every part of their daily activities and quality of life [1]. Stroke clinicians are currently advised to systematically assess fatigue in all patients [2, 3]. However, there is insufficient evidence to support the use of any routine treatment or prevention strategies [4, 5], and the pathophysiology of post-stroke fatigue (PSF) is largely unknown [1]. A critical barrier in PSF research is the lack of a patient-reported outcome measure (PROM) for fatigue with sound psychometric properties [6, 7].

There is growing recognition that content validity is the most important measurement property of a PROM [8]. Content validity can be defined as “the degree to which the content of a measurement instrument is an adequate reflection of the construct to be measured” [9]. It is advised that content validity should be demonstrated before evaluating other psychometric properties [10, 11]. However, establishing content validity in PROMs that assess unobservable constructs such as fatigue is challenging and requires several steps involving qualitative methods [9]. In a prior study [12], we explored the experience of fatigue in a qualitative study with stroke survivors and health professionals. These findings resulted in the development of a conceptual framework that outlined PSF as a multidimensional phenomenon. Two important dimensions were fatigue characteristics (e.g., intensity, timing) and fatigue interference (i.e., emotional, cognitive, activity, and social impacts of fatigue). A clear definition of the construct to be measured is a prerequisite for item development [8]. Despite this, recent fatigue measures lack a clear definition, measure various aspects, and lack high-quality evidence of content validity [6].

In stroke research, the Fatigue Severity Scale (FSS) is the most-used PROM for fatigue. However, the FSS lacks evidence of content validity and evidence of its psychometric properties is limited, particularly among people with stroke [6, 13, 14]. For example, the FSS does not assess important features such as mental versus physical fatigue or diurnal variations [6, 15]. Although other, more complex PROMs for fatigue exist, a recent review found no multidimensional fatigue questionnaires had been adequately validated in people with stroke [16]. With the growing recognition of PSF as a unique clinical entity and the increasing use of PROMs as endpoints in clinical studies, there is a clear need to develop a PSF-specific PROM with robust measurement properties that capture multidimensional features of PSF.

After establishing content validity, guidelines also recommend extensive field testing to obtain insight into the structural validity of the data [9, 11, 17]. Structural validity can be defined as “the degree to which the scores of a measurement instrument are an adequate reflection of the dimensionality of the construct to be measured” [18]. Rasch analysis is a restrictive model that builds a linear interval measure invariant across test-takers and is suitable for item reduction and to examine structural validity and item characteristics in detail [9, 19]. Further, Rasch analysis can obtain stable parameter estimates with smaller sample sizes than 2-parameter item response theory models [20]. The fatigue instruments currently used in stroke populations have limited evidence of their content validity. Studies have often moved straight to assessing other types of validity, reliability, and responsiveness [13, 16, 21]. This approach could introduce several potential biases in these instruments’ scores [8]. To move forward in PSF research, there is need for a fatigue PROM designed for and validated in the stroke population and that has been developed following advanced PROM guidelines [16]. The aim of this study was, therefore, to develop such a measure and evaluate its content validity, structural validity, and internal consistency.

Method

Design

This study had a mixed-methods design involving three iterative steps (Fig. 1), as described by de Vet et al. [9]. An expert panel developed the instrument’s initial items, cognitive interviews were conducted to evaluate content validity, and an online questionnaire was used to collect data for a Rasch analysis guiding item reduction and evaluating structural validity and internal consistency. Reporting of the item development and cognitive interviews follows the consolidated criteria for reporting qualitative research [22], and reporting of the Rasch analysis follows the Rasch reporting guideline for rehabilitation research [23].

Fig. 1
figure 1

The three iterative steps in the development of the Norwegian FCIM

Item development

We first established an expert panel, using convenience sampling to include a balanced group with diverse backgrounds. The expert panel consisted of three stroke researchers and clinicians from Lovisenberg Diaconal Hospital (LDS), two PROM researchers (one from LDS and one external expert on Rasch analysis), and two stroke patients with fatigue serving as user consultants. The first author (IJS), who is a nurse and current PhD student, led six meetings (four live at LDS and two on Zoom) each lasting 1–4 h (with breaks). Between meetings, all group members read and commented on written versions of the instrument. The expert panel developed the items and response categories for the fatigue instrument based on our conceptual framework of fatigue derived from a previously published qualitative study. In that study, we defined PSF as “an experience of mental, physical or general feeling of exhaustion and tiredness, with a discrepancy between the level of activity and the level of fatigue” [12]. The expert panel developed the initial instrument informed by PROM development guidelines [8, 24,25,26,27]. The aim was to develop an instrument that could measure characteristics and interference (in two subscales), have the potential to measure improvement (for use in intervention studies), assess the level and duration of fatigue experienced prior to stroke (i.e., pre-stroke fatigue) and be relatively short (feasible). The expert panel developed version 1.0 of the instrument, named the Fatigue Characteristics and Interference Measure (FCIM), through regular meetings from January through June 2020. We also consulted a speech therapist for advice on adapting the instrument for people with aphasia and/or reading difficulties. Finally, to ensure item clarity and test interview technique, we performed two pilot interviews with the user consultants.

Cognitive interviews—developing evidence of content validity

Next, we conducted cognitive interviews with stroke patients. The aim was to establish the content validity of FCIM version 1.0 by evaluating comprehensibility, retrieval, judgment and communication, comprehensiveness, and relevance [26, 28]. A stroke user organization in Norway invited members to participate via text messages, e-mail, and its Facebook page. Inclusion criteria were prior stroke during the last 2 years, over 18 years, and living within driving distance from Oslo. 15 stroke survivors were purposively sampled to ensure variability in age, gender, fatigue severity, and time since stroke diagnosis. Face-to-face individual interviews were conducted during August and September 2020. Eight interviews were conducted in participants’ homes, and seven at LDS in Oslo. Participants completed FCIM version 1.0 as part of the interview, and a questionnaire including a 7-item version of the FSS (FSS7) [14] and information about their sociodemographic characteristics and relevant medical history.

Cognitive interviews were conducted with the Three-Step Test-Interview (TSTI) technique [29] and followed a semi-structured interview guide (Online Resource 1). TSTI was developed as an aid to identify problems in newly developed instruments. It consists of the following three steps:

  1. (1)

    Observing response behavior and concurrent thinking aloud. The interviewer takes notes and observes behavior such as hesitation and correction of response category. The participants are also instructed to think aloud and verbalize their thoughts when filling out the instrument. The aim is to make the participants’ immediate thoughts about the instrument observable for the interviewer.

  2. (2)

    Follow-up probing considering the behavior or expressed thoughts collected in step 1, where the aim is to clarify and complete only the primary data previously collected.

  3. (3)

    Debriefing aimed at eliciting experiences and opinions, such as potential problems, possible improvements, and the instrument’s completeness.

Audio recordings and field notes were taken during the interviews. Recordings were transcribed verbatim and subject to a deductive content analysis facilitated by NVivo 12 [30]. Each item was analyzed separately, and we used a categorization matrix based on Tourangeau’s four-stage cognitive model [28], which includes comprehension, retrieval, judgment, and response. Finally, we assessed the completeness of the instrument as a whole. After 10 interviews, some changes were made to the instrument, and then again after completing all 15 interviews, resulting in FCIM version 2.0.

Rasch analysis—developing evidence of structural validity and internal consistency

We conducted a cross-sectional study with a convenience sample of stroke patients (N = 169). Participants were recruited through the website of a stroke user organization. Inclusion criteria were adults with a self-reported stroke diagnosis who could read Norwegian. Data were collected between January and March 2021. Participants responded to an online questionnaire including sociodemographic information, relevant medical history, FSS7, and FCIM version 2.0.

FCIM version 2.0 was analyzed using a Rasch model which calculates the probability of a specified response for both persons and items along the same linear scale (representing the latent trait). This enables transformation of ordinal raw scores into an interval-level variable (called logits). Winsteps (version 5.2.0.0), R (version 4.1.2), and SPSS (version 28.0.0.0) were used to conduct statistical analyses and generate graphs. Since the FCIM is a new instrument and includes items with different response categories, we applied a Partial Credit Model (PCM) which make no assumptions about the equidistance between thresholds across items [31]. Then we assessed rating scale functioning according to Linacre’s guidelines to determine whether the scale was suitable for Rasch analysis [32] (Table 1). The primary focus in the Rasch analysis was to address two main aims: item reduction and evaluation of FCIM’s structural validity and internal consistency. This process involves repeated analyses (iterations) of each subscale in FCIM. Following each iteration, the measurement properties were evaluated and items not meeting pre-established criteria were removed. Iteration of analysis were repeated until satisfactory results, as outlined in Table 1.

Table 1 Overview of measurement properties and pre-set criteria assessed for and through Rasch analysis

Rasch analysis uses fit statistics to estimate how well the items’ and persons’ raw data fit the model assumptions [33]. Fit statistics are presented in infit and outfit unstandardized mean square (MNSQ) and standardized fit statistics (z-values). The MNSQ residuals show the degree of randomness and a MNSQ value of 1.0 indicates perfect fit. Values less than 1.0 indicate overfit to the model (i.e., the observations are too predictable), while values greater than 1.0 indicate underfit (i.e., there is more randomness in the data than expected in the Rasch model). Consistent with earlier empirical studies [14], we evaluated standardized infit statistics as they are more sensitive to unexpected item response patterns targeted to the person. This is in contrast to outfit, which is more sensitive to unexpected observations on very easy or hard items, with high outfit MNSQ often resulting from a few random responses by low performers [33, 34].

Structural validity

Each subscale’s unidimensionality was assessed separately using Principal Components Analysis (PCA) of the residuals [35]. Although the conceptual framework of fatigue is multidimensional, statistical unidimensionality of each subscale/dimension needs to be ensured. Local dependency between items was evaluated with Yen’s Q3 statistics, which compute raw score residual correlations [36]. Local independence means that the variance left after removing the contribution to the latent trait is only random, normally distributed, noise [35]. If two items are locally dependent, it indicates either that they add to some other dimension, or that they duplicate some feature of each other (called redundancy-dependency) [35].

Internal consistency

Targeting of persons was reported with the mean measure score (theta) for persons and mean standard error. Well-targeted measures have similar mean locations for persons and items. A positive mean value for persons indicates higher levels of fatigue compared to the average of the scale (set at 0) and a negative value indicates lower levels of fatigue [37]. We also reported Wright item maps that, on the same scale, display both the individual participants’ ability measures and the individual items’ difficulty calibrations (including the step calibrations [Andrich thresholds]) [35]. Precision was evaluated using the person and item separation index with associated reliability [35]. Consistency in item correlations was assessed with Kuder-Richardson Formula 20 (KR-20).

Score-to-measure conversion

Rasch analysis converts the raw scores to interval data (logits). To facilitate clinical interpretation of FCIM, we provide score-to-measure tables for both subscales.

The groups of persons with misfitting MNSQ and z-values were compared to the rest of the sample concerning age, gender, and fatigue (FSS severe fatigue vs. no/mild/moderate fatigue (combined)), using Student’s t test or Fisher’s exact test, as appropriate.

Uniform differential item functioning (DIF)

DIF analysis was conducted to investigate whether previously known subgroups (gender and age [1, 38]) had significantly different responses to items despite equal levels of the underlying trait [35].

Ethical considerations

The study was conducted following the Declaration of Helsinki and was approved by the Regional Medical and Health Ethics Committee of Southeastern Norway (REK) (reference 2017/1741). All participants provided written informed consent.

Results

Instrument development by expert panel

First, the expert panel generated 160 items related to our conceptual framework [12]. Then, the panel selected the best-worded and most relevant, comprehensive, and discriminating items [18]. The FCIM is based on a reflective model and was developed after thorough discussions and according to relevant guidelines [8, 26]. Based on the conceptual framework, the instrument was divided into two subscales: Characteristics and Interference. This initial version consisted of our definition of PSF and 32 items with a 7-day recall period. The Characteristics subscale had 10 items with a 5-point Likert scale ranging from “not at all” to “very much.” Three item pairs (fatigued items 1 and 2, mental fatigue items 3 and 4 and physical fatigue items 5 and 6) were designed to be the same question, but with slightly different wording (Table 2). We included all three item pairs to investigate in the next steps which items were preferred by stroke patients. The Interference subscale had 20 items with a 5-point Likert scale ranging from “never” to “all the time.” In addition, we included two pre-stroke fatigue items to help distinguish post-stroke fatigue from pre-existing levels of fatigue. Based on the speech therapist’s input, we used bold font for essential words in each item. This process resulted in FCIM version 1.0.

Table 2 Version 2.0 of the Norwegian Fatigue Characteristics and Interference Measure (FCIM)

Content validity testing by cognitive interviews

FCIM version 1.0 was then assessed in 15 cognitive interviews with stroke patients. Characteristics of the participants are presented in Table 3. The interviews lasted between 24 and 92 (mean 53) minutes. Details about the cognitive interviews results are displayed in an item-tracking matrix (Online Resource 2). Preliminary analysis after the first 10 interviews showed difficulties with comprehension of the PSF definition, and minor difficulties with comprehension and judgment in seven items in the Interference subscale. We edited the instrument and presented the updated version in the next five interviews, resulting in improved understanding of these items and the PSF definition. After 15 interviews, we changed the ordering of items in the Interference subscale and removed two items due to comprehension problems and lack of relevance. We also flagged five items (2, 4, 6, 27 and 28) because of similarity, comprehension and judgment issues. Despite potential issues, we decided to temporarily keep the flagged items and investigate their performance in the Rasch analysis. This resulted in FCIM version 2.0 with 10 Characteristics items, 18 Interference items and two pre-stroke fatigue items (Table 2). Except for the five flagged items, we found no significant problems relating to comprehension, judgment, relevance, or completeness of the items.

Table 3 Patient characteristics for the cognitive interview and Rasch analysis samples

Evaluating structural validity and internal consistency with Rasch analysis

FCIM version 2.0 was further evaluated in a sample of 169 patients with stroke who responded to an online questionnaire (Table 3). There were no missing data. First, we evaluated the functioning of both subscales against Linacre’s guidelines (Table 1) [32], and both subscales fulfilled the criteria of having unimodal distribution with peaks in the center, more than 10 observations in each category, outfit MNSQ < 2.0, and step calibrations (Andrich thresholds) that advanced monotonically between 1 and 5 logits (Online resource 3 and 4).

Characteristics subscale

The first iteration with all 10 items revealed misfit in item 10 (evening fatigue), so we removed this item (Table 4). The second iteration displayed overfit in item 2 (exhausted), which was also removed. In the third iteration, all eight items demonstrated acceptable fit to the Rasch model. Then we assessed dimensionality of the remaining eight items by a PCA. The residuals explained by the latent trait was just above 55%; however, the eigenvalue in the 1st component was slightly elevated, justifying further investigation. Residual correlations between items 3 and 4 (mental fatigue), as well as items 5 and 6 (physical fatigue), were above the critical value. This was expected since these item pairs were almost identical and previously flagged from the cognitive interviews. We removed items 4 and 6 and re-ran the analysis. Removing these locally dependent items improved the results indicating unidimensionality (Table 4; Online Resource 5). The remaining six items also demonstrated evidence of local independence, with no positive correlations. The mean person response was about 1 logit higher than the mean item measure. The Wright map (Fig. 2) shows that the subscale works across different levels of fatigue in this sample. In addition, the 6-item subscale demonstrated acceptable KR-20 and person separation (Table 4). Slightly exceeding our criterion, 10 persons (5.9%) demonstrated misfit to the Rasch model. However, no significant differences in the group of misfits compared to the rest of the sample were found in relation to age, gender, or fatigue. Uniform DIF was not detected in relation to gender, but significant DIF was found for item 8 (fatigued around noon) in relation to age, with the age group 61–83 being more likely to agree with this item than the age group 45–60 (p = 0.0064). The final 6-item Characteristics subscale demonstrated evidence of good structural validity and internal consistency. Characteristics subscale raw scores range from six to 30 and correspond to Rasch person measures of -5.96 to 6.37 logits. A score-to-measure table is provided in Online Resource 6.

Table 4 Overview of Rasch analysis results at each step in the iterative item removal process
Fig. 2
figure 2

Wright map displaying the 6-item Characteristics subscale. The left side displays individual participants’ ability measures (based on their mean logits), presented both as a total sample, and separately for females and males. The right side displays the individual items’ difficulty calibrations, including the difficulty of each step calibration (Andrich thresholds). Both the participants’ ability measures and the individual item difficulty calibrations are spaced along the common vertical axis with the logits presented on the right side [35]

Interference subscale

Item goodness-of-fit statistics from the first iteration of the 18-item Interference subscale indicated misfit in items 15 (bath/shower) and 16 (dressed/undressed) (Table 4). After removal of these two items, the remaining 16 items demonstrated acceptable fit with infit MNSQ values that met our specified criteria. PCA showed that 62% of the variation was explained by the latent trait; however, as the eigenvalue in the 1st component was greater than 2, and some of the residual item correlations were above 0.3 (indicating locally dependent items), further investigation was justified. Items 14 (gathering thought) and 18 (completing tasks) both demonstrated a higher-than-expected residual correlation with two other items (Table 4); thus, we decided to remove items 14 and 18. We additionally removed items 27 (hobbies) and 28 (pleasant activities), since they were flagged as redundant items from a content validity perspective in the cognitive interviews. In the final subscale, the mean person measure was 0.55 logits higher than the mean item measure, and the Wright map (Fig. 3) shows that the subscale works well across all levels of fatigue in this sample. The final Interference subscale also had a high person separation index and reliability. Fourteen persons (8.28%) demonstrated misfit to the Rasch model on the Interference subscale, which exceeded our criterion (Table 4). However, no statistically significant differences were found in the group of misfitting persons compared to the rest of the sample in relation to age, gender, or fatigue. No significant uniform DIF was found in relation to gender or age groups. Interference subscale raw scores range from 12 to 60 and correspond to Rasch person measures of -7.54 to 8.09 logits. A score-to-measure table is provided in Online Resource 7. In sum, the final 12-item Interference subscale demonstrated good fit to the Rasch model and good structural validity and internal consistency (Table 4; Online Resource 8).

Fig. 3
figure 3

Wright map displaying the 12-item Interference subscale. The left side display individual participants ability measures (based on their mean logits), presented both as a total sample, and separately for females and males. The right side displays the individual items’ difficulty calibrations, including the difficulty of each step calibration (Andrich thresholds). Both the participants’ ability measures and the individual item difficulty calibrations are spaced along the common vertical axis with the logits presented on the right side [35]

Pearson correlation indicated a strong positive relationship (r = 0.77, p < 0.001) between each individual’s Rasch measure from the 6-item Characteristic subscale and 12-item Interference subscale.

Discussion

In this study, we developed and evaluated the Norwegian Fatigue Characteristics and Interference Measure (FCIM), a new 20-item PROM for PSF.

Several previous studies have concluded that fatigue is a multidimensional phenomenon [12, 39]. In this study, based on our conceptual framework and results from the cognitive interviews, the fatigue dimensions of Characteristics and Interference were separated into two subscales. In clinical settings, separating Characteristics and Interference into two subscales is an advantage as the Characteristics subscale can be used in all post-stroke phases, whereas the Interference subscale only can be used after the initial acute phase once the patient has experienced how fatigue has interfered with their life. In addition, differentiating these subscales may also support targeting of different types of interventions. For example, a specific intervention might have no effect on fatigue’s intensity (measured by items in the Characteristics dimension), but could alter fatigue’s interference with the person’s activities. However, the subscales were relatively highly correlated, and future studies with larger and more diverse samples are needed to statistically confirm this two-dimensional structure.

A high-quality PROM needs to be feasible in addition to having evidence of validity and reliability. To avoid respondents losing concentration or becoming fatigued, it is recommended that the instrument not be too extensive or time-consuming [9]. Thus, we aimed for a relatively small number of items in the final FCIM. Three items were removed due to high infit MNSQ values indicating that answers to these items were more unexpected than predicted by the Rasch model. This might indicate that these items capture an additional construct [35]. For example, a possible explanation of misfit in item 10 is that evening fatigue can be commonly experienced even by people with low levels of overall fatigue, as it might reflect normal circadian rhythms of lower energy levels later in the day [40]. Keeping this item could possibly bias the results. A clinical advantage of using a Rasch model to generate individual measures (in logits) is that these measures are based on the individual pattern of responses, not the sum score. Thus, even for respondents with some missing data (e.g., due to fatigue), a Rasch model can generate a comparable measure for use in clinical practice or research (although with a larger individual standard error).

In our previous study, we found that stroke survivors qualitatively described fatigue in a variety of ways [12], a finding also reflected in existing fatigue PROMs using a wide range of different expressions to capture fatigue [6]. Even slight differences in item wording are known to affect responses [41]. To identify the best fatigue wording, we included three item pairs with slightly different wording in the Characteristics subscale. Based on the cognitive interviews, we flagged these items and retained them for the Rasch analysis. Not surprisingly, Rasch analysis indicated local dependence or overfit to the model in these items. It seems likely that the residual correlations between these items indicated duplication (redundancy-dependency) rather than multidimensionality [36]. In retrospect, we could have removed these items after the cognitive interviews, however, it is unlikely that keeping them changed our overall results.

Another limitation of our study was the loss of some separation ability due to the subscales’ item reduction, although the instrument’s structural validity and internal consistency increased, as did its feasibility. Person misfit was slightly higher than expected and should be monitored more closely in larger studies, as a larger sample (> 200) would offer more powerful data on person and context [24]. In addition, we used a convenience sampling method, and despite our sample’s diverse sociodemographic and clinical characteristics, a random-sampling method could yield more robust results.

An advantage of our study is that FCIM is developed based on qualitative data. Our previous review of PROMs used in PSF research showed that existing instruments included items confounded by other post-stroke sequela [6], such as “Do you feel weak?” Including such items could bias the results in a stroke population. Fatigue characteristics and interference are common in other diagnoses, and FCIM has the potential to be used across different patient groups after assessment of its measurement properties.

Conclusion

In this study, we have developed the FCIM, a patient-reported outcome measure for post-stroke fatigue that includes two subscales measuring fatigue’s Characteristics and Interference, and two pre-stroke fatigue items. This study has shown that the FCIM comprehensively captures the essential experiences of fatigue, and thus, demonstrates evidence of content validity. Using Rasch analysis on the two separate subscales, we removed misfitting and locally dependent items, which resulted in both subscales demonstrating evidence of structural validity and internal consistency. The findings suggest that a scale with relatively few items (from a larger item pool) can be clinically sufficient to generate valid measures of different experiences of fatigue for a vulnerable target group. Further assessment of the FCIM is still necessary before it can be used in clinical practice as well as research. The most important next step is to investigate the FCIM’s ability to detect change over time (i.e., responsiveness) and its relationship to other instruments. Responsiveness is especially important for determining whether the FCIM can be used as an outcome measure for intervention studies. Such studies can also serve as a foundation for using the FCIM to support intervention planning, to minimize fatigue’s severity and interference in daily life for stroke patients. Future studies should also evaluate the FCIM’s measurement properties in other patient populations, as fatigue characteristics and interference are common outcomes of many diseases/disorders. While FCIM is currently only available in Norwegian, we aim to translate and cross-culturally validate the instrument in an English-speaking sample.