Background

Ankylosing spondylitis (AS), sometimes referred to as radiographic axial spondyloarthritis (axSpA), is a chronic inflammatory rheumatic disease that affects the axial skeleton, potentially leading to structural and functional impairments and diminished quality of life [1,2,3,4]. Fatigue is reported by > 70% of patients with AS [5, 6] and is associated with higher levels of disease activity and functional disability, as well as worsened global wellbeing and mental health [7]. Due to its negative impact on health-related quality of life, fatigue is considered a core domain of disease assessment in AS randomized controlled trials (RCTs) [7,8,9].

The Functional Assessment of Chronic Illness Therapy-Fatigue (FACIT-F) scale is a questionnaire that evaluates an individual’s self-reported fatigue during their usual daily activities over the past week [10, 11]. Psychometric data analyses in patients with rheumatoid arthritis (RA) [10] and psoriatic arthritis (PsA) [12, 13] have demonstrated the reliability, validity, and internal consistency of FACIT-F. In patients with RA, FACIT-F was shown to differentiate patients according to clinical change (per American College of Rheumatology response criteria) [10]. In patients with PsA, FACIT-F was shown to differentiate patients according to disease activity (per Patient Global Assessment of Disease Activity [PtGA]) [12]. Furthermore, in patients with PsA, FACIT-F has been shown to be significantly correlated with actively inflamed joint counts and, to a lesser degree, swollen joint counts [13].

The psychometric properties of patient-reported outcomes (PROs), including FACIT-F, have been evaluated in patients with active AS treated with adalimumab or placebo [14]. As observed in patients with RA and PsA [10, 12, 13], FACIT-F had good internal consistency and was responsive to changes in clinical status (per Assessment of SpondyloArthritis International Society response criteria) [14]. However, observations were based on a relatively small sample size, and there remained scope for further psychometric analyses to increase understanding of FACIT-F as a measurement tool in AS RCTs.

Tofacitinib is an oral Janus kinase inhibitor that has been investigated for the treatment of adult patients with AS. The efficacy and safety of tofacitinib in patients with active AS, who had an inadequate response or intolerance to non-steroidal anti-inflammatory drugs (NSAIDs) have been demonstrated in phase 2 [15] and phase 3 [16] trials of 16 and 48 weeks’ duration, respectively. This post hoc analysis of data from these two clinical studies further evaluated the psychometric properties of FACIT-F in adult patients with active AS and assessed whether FACIT-F could differentiate between these patients according to disease activity.

Methods

Design and participants

FACIT-F validation was based on data from two RCTs of tofacitinib in patients with AS, details of which have been published previously [15, 16]. The first was a 16-week (12-week treatment, 4-week follow-up) phase 2, placebo-controlled, dose-ranging study (NCT01786668; hereby referred to as Study 1) of tofacitinib 2, 5, or 10 mg twice daily (BID) in patients (N = 207) with active AS [15]. The second was a 48-week phase 3 trial (NCT03502616; hereby referred to as Study 2) of tofacitinib 5 mg BID in patients (N = 269) with active AS. The study had a 16-week placebo-controlled double-blind phase; from Weeks 16–48, all patients received open-label tofacitinib [16].

In both studies, patients were aged ≥ 18 years, had a diagnosis of AS and fulfilled modified New York criteria for AS, documented with central reading of the radiograph of the sacroiliac joints. All patients had active disease at screening and baseline (defined as Bath AS Disease Activity [BASDAI] score ≥ 4, back pain score [BASDAI question 2] ≥ 4), and an inadequate response or intolerance to ≥ 2 NSAIDs. Patients could continue the following (stable) background therapies: NSAIDs; methotrexate (≤ 20 [Study 1] or ≤ 25 [Study 2] mg/week); sulfasalazine (≤ 3 g/day); and oral corticosteroids (< 10 [Study 1] or ≤ 10 [Study 2] mg/day of prednisone or equivalent).

This post hoc analysis used FACIT-F data from both RCTs, captured from all treatment groups (using the 13-item FACIT-F questionnaire [Additional file 1: Appendix 4, Fig. S1]) at baseline, and at Weeks 2, 4, 8, and 12 (both studies), and Week 16 (Study 2 only). The percentage of missing FACIT-F items in the studies was negligible.

Patient and public involvement

Patients were not directly involved in the design, recruitment, or conduct of the clinical studies. Studies were conducted in accordance with the Declaration of Helsinki and International Council for Harmonisation Guidelines for Good Clinical Practice and were approved by the institutional review board and/or independent ethics committee for each study center. Written, informed consent was provided by patients.

Psychometric analyses

Measurement model assessment

The FACIT-F scale is a 13-item questionnaire that evaluates an individual’s self-reported fatigue during their usual daily activities over the past week [10, 11]. The 13 items fall into either an Experience (5 items) or Impact (8 items) domain [11]. Experience items evaluate patients’ perceptions and severity of feeling, including tiredness, energy level, weakness, fatigue, and listlessness, while Impact items evaluate how fatigue impacts an individual’s daily functioning [11]. Each item is presented with a 5-point Likert scale ranging from 0 (‘not at all’) to 4 (‘very much’). After appropriate recoding so negatively phrased items are reverse scored, items are summed to calculate a FACIT-F total score ranging from 0–52, with higher scores representing less fatigue [13, 17].

The current FACIT-F measurement model is represented by both domains and total score. In the measurement model assessments, the latent construct ‘Experience’ (represented by the first-order factor f1) includes items 1, 2, 3, 4, and 7 of FACIT-F, and the latent construct ‘Impact’ (represented by the first-order factor f2) includes the remaining 8 items [11]. The latent aggregated factor (represented by the second-order factor f3) includes all Experience and Impact domains (Additional file 1: Appendix 4, Fig. S2).

Second-order confirmatory factor analysis (CFA) modeling tested the measurement structure of FACIT-F using Study 1 and 2 data (Additional file 1: Appendix 4, Fig. S2). As an indication of whether the model fits the data, the following simultaneous criteria were used [18]: 1) Bentler’s comparative fit index (CFI) > 0.90; 2) unstandardized path coefficients were statistically significant (p < 0.05); and 3) standardized path coefficients were > 0.40 and statistically significant (p < 0.05).

To support the dimensionality of the FACIT-F scale, supplemental analyses using bifactor CFA modeling were also performed. In a bifactor model, every item can be affected by only one general (overall) factor and by only one nuisance (domain) factor. Further details of the methodology are described in the Additional file 1: Appendix 1.

Internal consistency reliability

Internal consistency reliability was assessed using Cronbach’s coefficient alpha (α) and corrected item-to-total correlations. A Cronbach’s coefficient α ≥ 0.70 [19] and an item-to-total correlation ≥ 0.40 [18] were defined as acceptable.

Test–retest reliability

Intraclass Correlation Coefficients (ICCs) estimated test–retest reliability using baseline and Week 2 data and were calculated using a one-way random model (absolute agreement) [18, 20]. An ICC ≥ 0.70 was defined as acceptable [21]. Because of treatment intervention, a subgroup of ‘stable’ patients was used in the analysis, defined using PtGA scores captured during the primary studies. Patients were asked to score their overall disease activity over the last week using a numerical rating scale between 0 (‘no disease activity’) and 10 (‘very active disease’) in response to the question, “How active was your spondylitis on average during the last week?”. Patients were not made aware of their scores at baseline during assessment at Week 2. Test conditions and administration were consistent across visits. Two models were investigated, with only ‘stable’ patient data used in each: Model A assumed that ‘stable’ patients had the same PtGA score at baseline and Week 2; Model B was more ‘relaxed’ and assumed that ‘stable’ patients can change, but not more than 1 point in PtGA score from baseline to Week 2.

Convergent validity

Evidence of convergent validity (the extent to which two concepts are related to one another [22]) was evaluated by the Pearson correlation coefficients of the FACIT-F domain/total scores with the following set of PROs from Studies 1 and 2: PtGA, total back pain/nocturnal spinal pain due to AS, Short Form-36 Health Survey version 2 (SF-36v2), Bath AS Functional Index (BASFI), BASDAI, EuroQol-5 Dimension (EQ-5D) Utility Index, and AS Quality of Life (ASQoL). Although dependent on the nature of the measures being compared, and the time points evaluated (correlations are expected to be higher following a treatment intervention than at pre-treatment or baseline), under most circumstances a correlation of 0.4–0.8 may be taken as evidence of convergent validity for the target scale under consideration (in this case, FACIT-F) [18].

Known-groups validity

Known-groups validity was examined by evaluating differences in the reported FACIT-F domain/total scores among clinically distinct patient groups. This anchor-based approach used a repeated measures longitudinal model with the reported PtGA scores as the anchor and FACIT-F domain/total scores as the outcome. PtGA scores represented patient state from ‘no disease activity’ (PtGA score of 0) to ‘very active disease’ (PtGA score of 10).

Ability to detect change

Ability to detect change was based on the repeated measures longitudinal model, with change from baseline in PtGA scores at Weeks 2, 4, 8, and 12 (both studies), and Week 16 (Study 2 only), as the anchor and change from baseline in FACIT-F domain/total scores (at the same time points) as the outcome [18, 23] to examine the relationship between change from baseline in PtGA scores and change from baseline in FACIT-F domain/total scores.

Defining meaningful within-patient change

Meaningful within-patient change (MWPC; i.e., meaningful improvement or deterioration from the patients’ perspective) was estimated using a repeated measures longitudinal model (the same model used to define ability to detect change).

As there were not likely to be 11 distinct levels of differentiation, PtGA was transformed from a 0–10 numerical rating scale to a Patient Global Impression of Severity (PGIS) 0–4 category scale (Additional file 1: Appendix 4, Fig. S3 and Additional file 1: Appendix 3, Table S1). Thus, a 1-category difference on PGIS corresponded to a 2.5-category difference on PtGA, and a 2-category difference on PGIS corresponded to a 5.0-category difference on PtGA. As a result, MWPC was evaluated based on a 2.5-category change in PtGA, and separately, a 5.0-category change in PtGA was taken as clinically relevant change.

For FACIT-F domain/total scores, differences in mean scores between groups (numerator) divided by standard deviations at baseline (denominator) were used to estimate standardized effect sizes. These effect sizes provided a general set of thresholds or benchmarks through adjectival descriptors on the impact of an intervention, with values of 0.2 standard deviation units generally regarded as ‘small’, 0.5 as ‘medium’, and 0.8 as ‘large’ [24].

Empirical cumulative distribution functions (eCDFs) [25] were produced at the studies’ respective primary analysis time points: Week 12 (Study 1) and Week 16 (Study 2) (Additional file 1: Appendix 1, Supplemental methods).

Results

CFA model

Across the separate study analyses, the second-order CFA model fit the data well. CFI indices were > 0.90 at all time points (Table 1); the unstandardized path coefficients were statistically significant (p < 0.05), and standardized path coefficients were > 0.40 and statistically significant for all items (p < 0.05; Table 1). Supplemental analyses using bifactor CFA modeling supported the dimensionality of the scale (Additional file 1: Appendix 3, Table S2 and Additional file 1: Appendix 4, Figure S5).

Table 1 Second-order CFA model results

Internal consistency reliability

In both studies, the FACIT-F domain/total scores demonstrated excellent internal consistency, with Cronbach’s coefficient α ≥ 0.88 at all time points evaluated (Additional file 1: Appendix 3, Table S3). All but one (FACIT-F Impact domain, item 8, Week 12, Study 2) of the corrected item-to-total correlations were ≥ 0.40 for all domains in both studies.

Test–retest reliability

With both models, an acceptable (≥ 0.70) ICC was observed for the FACIT-F Experience domain (ICC: 0.75–0.86), Impact domain (ICC: 0.84–0.87), and total score (ICC: 0.86–0.89) (Additional file 1: Appendix 3, Table S4).

Convergent validity

In both studies, all correlations were in the hypothesized directions. Correlations at baseline and Week 12 (Study 1)/Week 16 (Study 2) between FACIT-F domain/total scores and all PROs assessed were statistically significant (p < 0.0001) and generally ≥ 0.40, supporting convergent validity (Table 2). This trend was consistent across all other time points in both studies (Additional file 1: Appendix 3, Table S5). Across time points, the strongest correlations (0.62–0.85) were generally between FACIT-F domain/total scores and the SF-36v2 vitality domain score and ASQoL (Table 2 and Additional file 1: Appendix 3, Table S5).

Table 2 Correlations to assess the convergent validity of FACIT-F versus other PRO measures in patients with active AS

Known-groups validity

An approximately linear relationship was observed between FACIT-F domain/total scores and PtGA scores in both studies (Additional file 1: Appendix 4, Fig. S4). In both studies, differences in the FACIT-F domain/total scores between the ‘no disease activity’ and ‘very active disease’ patient groups were statistically significant (all p < 0.0001; Table 3). Standardized effect sizes of all differences were large (≥ 1.17 standard deviation units), supporting known-groups validity.

Table 3 Known-groups validity for FACIT-F in patients with active AS

Ability to detect change

In both studies, ability to detect change was evidenced by an approximately linear relationship between changes from baseline in PtGA scores and FACIT-F domain/total scores (Fig. 1). When patients experienced a change in PtGA, values for FACIT-F domain/total scores changed accordingly. Deviations from linearity were likely due to small sample sizes at the extremes of the PtGA scores (the model with an anchor as a categorical variable does not impose any functional relationship between the outcome and anchor).

Fig. 1
figure 1

Relationship between change from baseline in FACIT-F Experience domain, Impact domain, and total scores, and change from baseline in PtGA scores in patients with active AS in a Study 1* and b Study 2. *NCT01786668 (phase 2 study). NCT03502616 (phase 3 study). AS, ankylosing spondylitis; FACIT-F, Functional Assessment of Chronic Illness Therapy-Fatigue; PtGA, Patient Global Assessment

Correlations between change from baseline in PtGA and FACIT-F domain/total scores are reported in Additional file 1: Appendix 3, Table S6. Across all time points in both studies, correlations between change from baseline in PtGA scores and change from baseline in FACIT-F domain/total scores ranged from − 0.53 to − 0.31; all were statistically significant (p < 0.0001).

Meaningful within-patient change

Across both studies, when a 2.5-category change in PtGA was taken as a clinically relevant change, the largest (i.e., most conservative) values estimated for MWPC for FACIT-F Experience and Impact domain scores and total score were 1.4, 1.8, and 3.1, respectively (Table 4). When a 5.0-category change in PtGA was taken as a clinically relevant change, these estimates were 2.8, 3.6, and 6.3, respectively (Table 4).

Table 4 MWPC estimations for FACIT-F Experience and Impact domain scores and total score

In Study 1, standard deviations at baseline for FACIT-F Experience and Impact domain scores, and total score were 4.3, 7.0, and 10.8, respectively. Therefore, when a 2.5-category change in PtGA was used for MWPC estimation, the MWPC values corresponded to standardized effect sizes of 0.33, 0.26, and 0.29 standard deviation units, respectively, which can be interpreted as ‘small’ effects (0.26 and 0.29) or an approximately ‘small-to-medium effect’ (0.33) [24]. When a 5.0-category change in PtGA was used, MWPC values corresponded to standardized effect sizes of 0.65, 0.52, and 0.59, respectively, which can be interpreted as ‘medium’ effects.

In Study 2, standard deviations at baseline for FACIT-F Experience and Impact domain scores and total scores were almost identical to those observed in Study 1 (4.1, 6.4, and 10.0, respectively), yielding numerically similar effect sizes (data not shown).

Overall, in both studies, eCDFs for changes in FACIT-F total score showed a clear separation of curves for categories of PtGA changes with a sufficient number of available observations per category (see Additional file 1: Appendix 4, Fig. S6).

Discussion

Fatigue is a core domain of disease assessment in AS studies [7, 8], yet data supporting the psychometric validity and reliability of the FACIT-F scale in patients with active AS are limited. This analysis of data from two clinical studies evaluated the quantitative measurement properties of FACIT-F to determine its suitability as a measure of fatigue in AS RCTs. It should be noted that the analyses reported here primarily followed US Food and Drug Administration (FDA) guidance on incorporating clinical outcome assessments (COAs) into endpoints for regulatory decision-making [26]. These guidelines are consistent with the recommendations of the OMERACT Filter 2.1, a framework designed to aid the development of core outcome measurement sets in rheumatology [27, 28].

Consistent with findings observed in a previous psychometric evaluation of FACIT-F in patients with PsA [12], the second-order and bifactor CFA modeling supported the measurement model and multidimensionality of the FACIT-F scale as an overall score with two distinguishable domains (Experience and Impact), in addition to a global domain (total score). Furthermore, in line with results seen in patients with RA [10], PsA [12, 13], and AS [14], the FACIT-F domain/total scores demonstrated excellent internal consistency across time points. The test–retest reliability analysis observed an acceptable ICC, again consistent with earlier analyses in PsA populations [12, 13].

FACIT-F domain/total scores correlated with all measured PROs, suggesting that physical and mental impacts of fatigue are closely linked to patient perception of AS. Consistent with findings in patients with AS [14] and PsA [12], the strongest correlations were seen between FACIT-F domain/total scores and SF-36v2 vitality domain scores. Interestingly, in this analysis, strong correlations were also observed between FACIT-F domain/total scores and ASQoL. Again, this is consistent with previous findings [14]. Together, these analyses suggest that patient-reported improvements in fatigue symptoms are associated with broader aspects of physical and emotional health in active AS.

Estimated MWPC values were consistent across both studies, when either a 2.5- or 5.0-category change on the anchor measure of PtGA was taken as a clinically relevant change. In line with previous findings [12], the known-groups validity analyses clearly demonstrated that the FACIT-F scale can differentiate between patients classified as having ‘no disease activity’ and ‘very active disease’ (PtGA scores of 0 and 10, respectively), as well as any pair of disease severity groups in between. FDA guidance on incorporating COAs into endpoints for regulatory decision-making defines the ability to detect change as, “Evidence that a COA can identify differences in scores over time in individuals or groups who have changed with respect to the measurement concept” [26]. These analyses demonstrated the sensitivity of FACIT-F to changes in PtGA scores and showed that FACIT-F is equally sensitive to increases and decreases in PtGA scores. These results were echoed in the eCDF analyses, which demonstrated that PGIS categories with more favorable change had a higher probability of having values greater than the given change in FACIT-F total score, while PGIS categories with less favorable change had a higher probability of responding less than or equal to the given change in total score. Across analyses, results from both studies were extremely similar, indicating a high degree of reproducibility.

With fatigue known to be a debilitating symptom across rheumatological diseases including AS, RA, and PsA [29], it is important to have a validated and reliable measure for evaluating fatigue across the spectrum of rheumatology patients. The consistency of findings from our evaluation in patients with AS with those previously reported in AS [14], RA [10], and PsA [12, 13] populations indicates that FACIT-F is an appropriate, uniform measure for fatigue in rheumatology.

Some limitations remain associated with these analyses. First, estimated MWPC may vary due to different methodology and natural sampling variation, and may not necessarily represent a minimal value. Second, there is no consensus as to what constitutes a meaningful change on the anchor measure. While it would have been desirable to perform test–retest reliability assessments before tofacitinib treatment began (i.e., during the screening [test] visit and baseline [retest] visit), FACIT-F assessments from the screening visit were not available. In addition, all patients had active AS; therefore, the reliability/validity of FACIT-F in patients with non-active AS remains unclear. Finally, it should be noted that the content validity and feasibility of FACIT-F in patients with AS and its validity/reliability across the spectrum of axSpA have not been empirically demonstrated to date. However, given the body of evidence available across other rheumatology indications [10, 12, 13], FACIT-F is likely to be a feasible, valid, and reliable measure of fatigue in patients with axSpA.

Conclusions

This quantitative analysis of data from two clinical studies of tofacitinib demonstrates the validity and reliability of the FACIT-F scale as a measure of fatigue in adult patients with active AS. Overall, these findings indicate that FACIT-F is a suitable measure for use in AS clinical trials.