Fatigue is a common subjective health complaint that significantly affects patients with chronic diseases [1,2,3]. For patients with substance use disorders (SUDs), fatigue is associated with underlying physical and mental health problems—such as hepatitis C virus infection, depression, and an increased risk of suicide [4,5,6]. However, we have scarce evidence about fatigue in the SUD population, and validated questionnaires for surveying fatigue and investigating how it varies over time do not exist for this patient group. To ensure high-quality clinical trials focusing on fatigue, developing a validated and customized fatigue questionnaire aimed for SUD patients is therefore of particular interest.

Patients with SUDs often live in a chaotic situation with extensive medical and psychosocial health challenges, including polysubstance use, substance intoxications and withdrawals, psychiatric comorbidities (e.g., attention deficit hyperactivity disorder, psychosis disorders, or personality disorders), chronic viral hepatitis, financial risk, and temporary living situations [7, 8]. This might make surveying fatigue with questionnaires particularly challenging and might influence the patients’ fatigue experiences and how they respond to questionnaires [9,10,11]. Using simple wording and phrases, avoiding the use of questions which differ in subtle ways which make them appear repetitive, and administering measures that produce reliable and valid scores based on very few questions, might be essential to ensuring reliable results in clinical trials on fatigue in this population.

The nine-item Fatigue Severity Scale (FSS-9) and the Visual Analog Fatigue Scale (VAFS) are two well-known fatigue questionnaires. The evidence for the validity of the FSS-9 has been obtained for a wide range of chronic infectious and neurological diseases—such as hepatitis C virus infection [1], Parkinson’s disease [12], myasthenia gravis [2], systemic lupus erythematosus [13], and stroke [3]; while the VAFS has shown evidence for a good validity and reliability in post-stroke patients [14]. In two validation studies, the VAFS was used alongside the FSS-9 with high correlations [15, 16]. Furthermore, excellent validity and reliability were achieved when the FSS-9 was shortened into a seven-item version [17, 18], while uncertain results were reported when a five-item version was validated [19].

With this validation study we sought to address two gaps in the literature on the assessment of fatigue in patients with SUDs: (1) to evaluate whether the scores produced by a shortened version of FSS-9 can be interpreted similarly to the scores produced by the FSS-9, and (2) to determine whether there is value to adding the VAFS to either the FSS-9 or the shortened version of FSS-9.

We hypothesised that the FSS-9 and VAFS questionnaires can create a short version of the FSS-9 which demonstrates strong evidence of reliability and validity when used with patients who have SUDs.


Data source

We drew data from a nested cohort from the INTRO-HCV trial that collected data on patients with SUDs in Bergen and Stavanger, Norway [20]. We recruited patients receiving opioid agonist therapy in Bergen and Stavanger and patients with SUDs receiving healthcare from the primary health clinics in the municipality of Bergen. This study included all patients in the cohort who had answered the FSS-9 and/or VAFS in the study period from May 2016 to January 2020.

Data collections

All included patients were invited to an annual health assessment, including FSS-9 and VAFS measurements and a survey of their current sociodemographic situation. We collected all data in a health register using data collection software (Checkware®) under the supervision of research nurses.

Study sample

We conducted 917 health assessments of 655 patients, and this included 916 FSS-9 measurements and 915 VAFS measurements during the study period. We defined a measurement as when at least one of the items in the FSS-9 or the VAFS were answered during a health assessment. Baseline was defined as the first health assessment including measurements of the FSS-9 or VAFS when the health assessments were listed chronologically. The FSS-9 and VAFS were completely answered in 914 health assessments. For the remaining three health assessments, one patient only completed the VAFS and not the FSS-9, one only answered five of the nine items on the FSS-9 and did not complete the VAFS, and a third completed the FSS-9 and not the VAFS. Of the 655 included patients, 188 completed two health assessments, while 37 patients completed three health assessments. The time intervals between the annual health assessments varied with a mean of 12 months (standard deviation (SD): 4 months) (Additional file 1). Due to the relatively small number of patients with three health assessments, we used two health assessments when estimating internal consistency reliability and construct validity. For patients with three annual health assessments, we only included the first (baseline) and second health assessments in these analyses.

Measuring fatigue

We used the FSS-9 and VAFS to measure the level of fatigue. The FSS-9 measures fatigue during the past week, and it includes items regarding: mental and physical functioning, motivation, exercise, carrying out certain duties, and interference with work, family, or social life. The VAFS measures the patient’s general experience of fatigue. The FSS-9 was answered on a Likert scale from 1 (no fatigue) to 7 (worst fatigue) and the VAFS was answered by placing a mark on a line from 0 (no fatigue) to 10 (worst fatigue) that represent the fatigue level. The data collection software only allowed valid responses for each question and prompted for responses to unanswered questions before submission in order to minimise missing data. In a previous study, the US-English version of the FSS-9 has been translated into Norwegian by a qualified native Norwegian-speaking translator and back-translated into US-English by an authorised native US-English-speaking translator (Additional file 2) [21].

Statistical analysis

We used Stata/SE 16.0 (StataCorp, TX, USA) for descriptive analysis and IBM SPSS version 24.0 and Mplus version 8.4 for reliability analysis (Cronbach’s α if-item-deleted and Item-Total correlation), for confirmatory factor analysis (CFA), and for linear mixed model (LMM) analysis (Mplus: TwoLevel analysis). The threshold for statistical significance was set to P < 0.05 for all analyses unless otherwise stated.

Internal consistency of the FSS-9 and the shortened version of FSS-9 (FSS-3), and these scales including the VAFS

We calculated the internal consistency of the FSS-9 and this scale including the VAFS at baseline and at the second health assessment. As a part of the validation study, we explored whether there was value in adding the VAFS to the FSS-9 by evaluating if the VAFS added more information than captured by the FSS-9. Cronbach’s α was considered to show good internal consistency if Cronbach’s α was above 0.70 [22,23,24]. We then shortened the FSS-9 by deleting the item that resulted in the highest Cronbach’s α value for the remaining items (alpha-if-item-deleted analysis). The remaining items’ Cronbach’s α coefficients were recalculated, and the next item was deleted. If the remaining scale showed almost equal Cronbach’s α values after removing one or another item, clinical experience was used in the decision of what item we removed. We deleted items that were less adaptable to patients with SUDs, for example, items about employment (unemployment was common in this population) and items with complex phrases and wordings that could be difficult to understand for patients with SUDs when they were intoxicated or went through substance withdrawals. Furthermore, we calculated Cronbach’s α for the VAFS plus the shortened version of FSS-9 (FSS-3) at baseline and at the second health assessment. Due to strong inter-item correlations and good reliability in previous studies evaluating the FSS-9 alongside the VAFS [15, 16], we expected that the VAFS would not provide much added variability than that captured by the FSS-9 and FSS-3 questionnaires at baseline and at the second health assessment.

Longitudinal confirmatory factor analysis for evaluating the fit of the FSS-3 and FSS-9, and these scales including the VAFS

We used CFA models to test the structure of the items in the FSS-3 and FSS-9, and these scales including the VAFS at baseline and at the second health assessment in order to evaluate the relationships between the items and their underlying latent factors [22, 25,26,27,28]. We expected that both the FSS-3 and FSS-9 should support one-dimensional models. The VAFS was added to the FSS-3 and FSS-9 to examine whether the VAFS provided more added variability in fatigue than captured by the FSS-3/FSS-9. This should be indicated by a less than a perfect correlation between the FSS-3/FSS-9 and VAFS. Further, we used longitudinal CFA in order to test measurement invariance for the FSS-3. First, we estimated a free model with all unique parameter values. We then tested for constraints in the model by setting the factor loadings within each item equal to each other at baseline and at the second health assessment. Third, we tested for equality within the residuals over time. The last model constrained the intercept values for the indicators. We used the Wald test to compare model restrictions. All CFA models were evaluated with standard fit measures: χ2, degrees of freedom, p values, Comparative fit index, Tucker Lewis Index, Root Mean Square Error of Approximation with 90% confidence interval, and the probability of close fit. A well-fitted model should have a statistically non-significant χ2, values of Comparative fit index and Tucker Lewis Index should be above 0.95, and Root Mean Square Error of Approximation should preferably be below 0.05 (close fit) [26]. Root Mean Square Error of Approximation above 0.10 is considered to be a poorly fitted model [26]. We used the modification index to explore model improvements if the goodness of fit measures indicated a poorly fitted model (χ2 difference test). We analysed all variables as continuous variables due to the relatively high number of categories in the ordinal variables (FSS-9 items ranged from 1 to 7, and the VAFS ranged from 0 to 10). The CFAs were run using the Robust Maximum Likelihood estimator. According to previous studies showing good reliability and strong inter-item correlations of the FSS-9 and this scale including the VAFS [15, 16], we expected increased support for the FSS-3 reflecting fatigue as one dimension with stronger levels and homogeneity in the factor loadings than for the FSS-9, also when including the VAFS at baseline and at the second health assessment. This should be the consequence of reducing the scale by using the most relevant measurement indicators in this population of respondents. Furthermore, we expected that the longitudinal CFA would support measurement invariance over time for the FSS-3 [26]. Measurement invariance is indicated if each measurement indicator is equally important for the underlying factor over time, with equal factor loadings and intercepts within each item over time. In addition, strict invariance would be supported if residuals for each item are equal over time.

Linear mixed model analysis for evaluating changes in the FSS-3 and FSS-9 sum scores and the VAFS score

We used a LMM analysis (Mplus multilevel modelling: TwoLevel) to evaluate linear changes from baseline in the sum scores of the FSS-3 and FSS-9, the scores in the separate FSS-9 items, and the VAFS score. We included all 917 health assessments. First, we estimated a full random intercept random slope model, which gave us the mean and individual variance in terms of both level and change together with the relationship between level and change [29]. We re-estimated the model as a random intercept fixed slope model if the covariance between the intercept and the slope variance was statistically non-significant. We used the Mplus Maximum Likelihood Robust estimator to correct standard errors for potential deviation from normality [30]. In addition, interclass correlations were estimated. We used full information maximum likelihood in order to use all available measurements. The full information maximum likelihood assumes ‘missing at random’ [31]. Based on separate variables as indicators of the same underlying construction, we expected similar changes over time in the indicators and as seen in the FSS-9 sum score.

Ethics approval and consent to participate

The study was reviewed and approved by the Regional Ethical Committee for Health Research (REC) West, Norway (reference number: 2017/51/REK Vest, dated 29.03.2017/20.04.2017). Each patient provided written informed consent prior to enrolling in the study.


Patient characteristics at baseline

Seventy-one percent were males, and the mean age was 43 years (Table 1). Half had more than primary school as the highest level of education. Eighty-three percent received opioid agonist therapy, and 42% had injected substances in the last 30 days leading up to the health assessment. The mean values of the FSS-9 items varied from 4.43 to 5.38 at baseline (Table 2). For the VAFS, the mean value was 5.19 at baseline. The FSS-9 and VAFS variables were slightly left-skewed (skewness ranged from − 1.14 to − 0.29) and tended towards a flattened distribution (kurtosis ranged from − 1.39 to − 0.09).

Table 1 Baseline characteristics of patients [numbers (n) and percentages (%)]
Table 2 Descriptive information of the Nine-item Fatigue Severity Scale (FSS-9) and Visual Analog Fatigue Scale (VAFS) at baseline (FSS-9: N = 654, VAFS: N = 655)

Internal consistency of the FSS-3 and FSS-9, and these scales including the VAFS

The nine-item Fatigue Severity Scale’s Cronbach’s α was 0.94 at baseline and 0.93 at the second health assessment (Additional file 3). For the FSS-3, retaining items 5–7 from the FSS-9, Cronbach’s α was 0.87 at baseline and 0.84 at the second health assessment. The internal consistency was not substantially affected when we added the VAFS to the FSS-3 and FSS-9 at baseline (FSS-3 plus VAFS: α = 0.87; FSS-9 plus VAFS: α = 0.94), and at the second health assessment (FSS-3 plus VAFS: α = 0.85; FSS-9 plus VAFS: α = 0.93).

Longitudinal confirmatory factor analysis for evaluating the fit of the FSS-3 and FSS-9, and these scales including the VAFS

The results from the CFAs comparing the fit of items in the FSS-3 and FSS-9, and these scales including the VAFS at baseline and at the second health assessment are shown in Table 3. At baseline, the unidimensional model with unique factor loadings, residuals, and intercept values resulted in a borderline fitted model, with a Comparative Fit Index and Tucker Lewis Index below the suggested levels and a Root Mean Square Error of Approximation point estimate just at the level of a poor fit for the FSS-9. The modification index showed that the estimation of the residual covariance between items 2 and 3 improved the model with the best result (Δχ2 = 80.1, degree of freedom (df) = 1, P < 0.001). The factor loadings for the nine items of the FSS-9 in this model were (ranged from items 1 to 9): 0.66, 0.74, 0.77, 0.78, 0.85, 0.79, 0.84, 0.84, and 0.84. The FSS-3 showed a well-fitted model. The FSS-3 and FSS-9 were highly correlated with r = 0.95, P < 0.001, giving 90% explained variance after estimating factor scores. The VAFS, together with the FSS-3 and FSS-9, respectively, gave identical results. The correlations with VAFS were r = 0.68, P < 0.001 (FSS-3) and r = 0.70, P < 0.001 (FSS-9). We obtained relatively similar results for the factor models at the second health assessment for both the FSS-3 and FSS-9 with and without the VAFS. The full FSS-9 version showed the factor loadings to be (ranged from items 1 to 9) 0.59, 0.66, 0.75, 0.69, 0.84, 0.74, 0.85, 0.80, and 0.79. The longitudinal CFA model based on the FSS-3 supported time-invariant equal factor loadings and equal residuals between the baseline and the second health assessment (Fig. 1). The correlation between the models at baseline and at the second measurement was r = 0.52, P < 0.001. A small reduction in the model fit was found if the intercept values were constrained to be equal within each measured item between health assessments. However, this simpler model was still a well-fitted model.

Table 3 Confirmatory factor analysis results at baseline and at the second health assessment
Fig. 1
figure 1

Longitudinal confirmatory factor analysis of the FSS-3 in patients with substance use disorders. FSS-3: Three-item Fatigue Severity Scale. The longitudinal factor analysis structure is shown with the upper circles representing factor 1 and the factor loadings between factor 1 and three separate question items at two different time points (baseline and second health assessment) for the three-item Fatigue Severity Scale [26]. The arrows at the bottom represent the residuals and alfa (α) values for each question item’s standardised mean values. The arrow between the circles represents the correlation (0.52) between the confirmatory factor analyses. The values of 1.86 and 1.78 on the left side of the circles represent the latent variables’ variance

Linear mixed model analysis for evaluating changes in the FSS-3 and FSS-9 sum scores and the VAFS score

The linear mixed model analysis showed considerable intra-individual clustering for the FSS-3 and FSS-9 sum scores, the VAFS score, and the separate items’ scores (Table 4). However, the intraclass correlation coefficient estimated variations of 0.18 to 0.52, and these showed more variation over time in some variables than in others. The random slope and covariance between the intercept and the slope were statistically non-significant for all models. This indicated an equal linear change from baseline at the individual level. The re-estimated linear random intercept fixed slope models showed a small increase in the items 1–4, while no similar change was found in the other items. A mean change was also found in the VAFS variable.

Table 4 Linear mixed model analysis results for the FSS-9, FSS-3 and the Visual analog Fatigue Scale (VAFS)


The present study shows that the FSS-9 can be shortened to the FSS-3, with most of the included variance and validity retained. The value in adding the VAFS to the FSS-3 and FSS-9 did not provide much added variability. Questionnaires that are easily understood and with few items are essential to ensure high completion rates among patients with SUDs. Our findings showed that the full-scale FSS-9 had good internal reliability when tested empirically based on internal consistency. The internal consistency measured for the FSS-3 was nearly as high for the FSS-3 as FSS-9, even though it was just below a suggested 0.90 threshold [24, 32]. This supports the use of the FSS-3 as a patient-reported outcome measure in group-level comparisons, but its use in clinical practice would be controversial. The FSS-3 results from the longitudinal CFA showed a well-fitted unidimensional model with equal factor loadings and equal residuals when comparing baseline to the second health assessment. The FSS-3 and FSS-9 were almost perfectly correlated, which was in line with what we expected considering the homogeneity of the FSS-9. The factor analysis supported longitudinal stability in the indicators and confirmed the longitudinal measurement invariance. Although the fatigue level varied, the LMM analysis showed that the scoring structure was substantially stable and equal over time from baseline, which supported the validity of the indicators. This finding was somehow surprising considering the chaotic life situations and extensive substance use of many patients, which could contribute to substantial individual changes in item scores over time.

The reliability analyses showed high internal consistency for the FSS-3 and FSS-9, and the internal consistency only decreased from 0.94 to 0.87 when reducing the number of items from nine (FSS-9) to three (FSS-3). Cronbach’s α is related to the ratio between the mean covariance and the total variance, and the number of items, which makes the use of α thresholds to some degree arbitrary [22]. Our results showed that the reductions in the Cronbach’s α were accounted for by the number of items and not the covariance/total variance ratio, which is a less problematic reason for this small amount of reduction of internal consistency reliability. A homogenous and substantially equal internal consistency was also found in studies evaluating the FSS-9 in other chronic diseases such as stroke [3], hepatitis C virus infection [1], and multiple sclerosis [19], as well as studies that have shortened the Fatigue Severity Scale from nine to seven items [17, 18, 33]. In the present study, most patients were polysubstance users, of which nearly 40% had injected substances during the past 30 days. This may contribute to substantial changes in medical and psychosocial factors affecting the health assessment, for example, being affected by substances, living temporarily on the street, or having a lack of income, thus making the present highly reliable and very short FSS-3 questionnaire useful in further fatigue surveys.

The confirmatory factor analyses demonstrated unidimensional models for the FSS-9 at baseline and at the second health assessment, which was improved when adding the residual covariance between items 2 and 3. This means that the single-factor model did not fully capture these items’ responses. The explanations for this might be related to similar phrasing and wording [34], as well as the order of items 2 and 3, which might affect patients’ perception and interpretation, and increase their confusion. Moreover, the unidimensional FSS-9 factor model was in line with models reported in other studies that have validated the FSS-9 [2, 21, 33]. In those studies, however, small study samples have been a substantial limitation, contributing to a potential risk of overlooking underlying multidimensional models [2, 33]. When using a relatively large cohort of patients with SUDs, the present study showed that the unidimensional factor models of the FSS-3 and FSS-9 were maintained. Therefore, regardless of sample sizes, one can assume that the unidimensional factor models are generally well-fitted for the FSS-3 and FSS-9.

The linear mixed model analysis showed that items 5–7 included in the FSS-3 remain substantially stable and equal over time between the annual health assessments compared to the separate items 1–4 and VAFS. The result might give further arguments for the better validity of the FSS-3 questionnaire, and the FSS-3 is assumed to be less sensitive to fluctuations compared to the FSS-9. This points out that the FSS-3 might be preferred when evaluating changes in fatigue over time in the SUD population.

Strengths and limitations

This study had some strengths. First, we collected data from patients who are difficult to reach in both research and health care. Of those, 225 patients were followed up by two or three annual health assessments, making longitudinal analyses possible. Second, patients recruited to this study answered the FSS-9 and VAFS questionnaires under different mental and psychosocial conditions, for example, when they were going through substance intoxications or withdrawal and living on the street, which might increase the generalisation of the results. The study also had some limitations. First, the majority of the patients were recruited from opioid agonist therapy, making this validation study more transferable to opioid agonist therapy populations than other substance dependence populations. Second, the FSS-9 questionnaire used was translated from US-English into Norwegian and back-translated into US-English by native Norwegian and US-English-speaking translators [21], but as far as we are aware, no specific protocol for high-quality translations was used [35]. This might slightly reduce the external validity of our results. Third, the time intervals between the baseline and the second health assessment and between the second and the third health assessment varied, which could have affected how patients scored the FSS-9 and VAFS. Forth, comparing the FSS-3 to the FSS-9, the FSS-3 might increase the risk of common method bias considering that three items are more likely to be recalled and more accessible in the short-term memory than nine items [36]. Previous studies have not evaluated the impact of common method bias; however, high reliability was achieved when validating a shortened FSS-9 into seven items in various study samples [17, 37]. Nevertheless, these studies detected cross-sample differences between items 3, 5, 6, and 9, which corresponded to two items (5 and 6) in the FSS-3 questionnaire. This points out the need for further validation and shortening studies on the FSS-3 when adapting it to other populations. Fifth, the psychometric properties of the FSS-3 are not thoroughly investigated [34, 38]. We had assessed the construct validity of the FSS-3; however, we did not know that the criterion validity is conceptually equivalent to the FSS-9, and this scale including the VAFS. For further research, we call for more longitudinal data on the SUD population to improve the validity and the psychometric properties of the FSS-3 and make it even more useful for clinicians and SUD researchers.


The present study demonstrates that the FSS-9 can be shortened to just the FSS-3 among patients with SUDs. The value in adding the VAFS to the FSS-3 and FSS-9 did not provide much added variability. We found that the FSS-3 was more consistent in the structure of changes in fatigue levels compared to the FSS-9. The FSS-3 seems to be a useful patient-reported outcome measure of fatigue in this population at a group level, although the clinical relevance at an individual level remains more controversial.