Background

Development of a standardized approach to assess key elements of disease activity in rheumatoid arthritis (RA) clinical trials has been the goal of Outcome Measures in Rheumatology Clinical Trials (OMERACT), American College of Rheumatology (ACR), and European League Against Rheumatism (EULAR) groups [1,2,3]. The core sets of measures developed by these groups include assessments and composite indices that incorporate use of patient-reported outcomes (PROs) (e.g., daily functioning, change in disease activity), as well as clinical measures (e.g., erythrocyte sedimentation rate [ESR]) and clinician’s assessments (e.g., clinician assessment of disease activity), to quantify disease activity and change over time [2]. However, patient-centered research has indicated that key outcomes important to patients were not originally captured by the core sets, such as fatigue, sleep, and general wellness [4], morning stiffness [5], and the patient’s experience of social and psychological challenges and ability to cope [6].

Of these, fatigue is noted as one of the most common symptoms experienced by patients with RA [7]. Fatigue is a frequent and debilitating problem for patients with RA [8] and is second only to pain as the most bothersome patient-reported RA symptom [9]. The burden of fatigue in RA patients is well known, with symptom prevalence estimates ranging from 42% to 90% of patients with RA [7, 10, 11]. There is consistent agreement on the clinical relevance of fatigue and the impact fatigue has on activities of daily living and overall health-related quality of life (HRQOL) in RA [12, 13]. Indeed, both the 2007 Patient Perspective Workshop at OMERACT and the 2008 ACR/EULAR Task Force recommended that all RA clinical trials should update the core set of recommended measures of disease activity and report on fatigue [14, 15], although no specific instruments are endorsed.

Although fatigue in RA is a multidimensional concept [16,17,18], tiredness is a key component of fatigue. Arthritis Research UK [19] defines fatigue as “a feeling of extreme physical or mental tiredness,” and the 13-item Functional Assessment of Chronic Illness Therapy-Fatigue (FACIT-F) Scale—a widely used measure of fatigue across many diseases, including RA—has six items that address this key component with “tired” in their wordings (e.g., “I feel tired,” “I have trouble starting things because I am tired,” “I have trouble finishing things because I am tired”). There is wide variation in how patients use the word fatigue when describing their symptom experience, with terms like “physical and mental tiredness” commonly associated with fatigue [20].

Moreover, qualitative interviews with RA patients that focused on the development of a new PRO demonstrated that “tiredness” is a more commonly used term to describe this symptom experience than “fatigue” [21]. Specifically, these one-on-one interviews were designed to explore and better understand the terminology RA patients most often use to report this burdensome symptom. In these interviews, the majority of participants (n = 20, 71%) used “tiredness” to describe their RA symptom experiences, whereas fewer (n = 13, 46%) mentioned “fatigue” [21].

Despite the recommendations for the need to assess this chronic aspect of RA, there is currently no commonly used and well-validated instrument to assess the patient’s experience of this symptom in RA clinical trials [11]. To address this need, a daily electronic PRO diary single-item measure was created to assess the severity of worst tiredness from the patient’s perspective. To develop this single-item measure, referred to as Severity of Worst Tiredness, a targeted literature review and interviews with healthcare providers were conducted in order to ascertain the appropriate terminology to be used for the measure. In addition, qualitative concept elicitation and cognitive debriefing interviews with RA patients were conducted, to ensure that the content of the scale was being accurately captured by the instrument, as well as to confirm that the measure is relevant, easy to use, and easy to understand by patients with RA [21]. This supported the content validity of Severity of Worst Tiredness by confirming the relevance of tiredness as an RA symptom and the appropriateness of the term “tiredness” to describe this symptom. However, although the content validity of Severity of Worst Tiredness was demonstrated, the psychometric properties of the measure have not yet been assessed.

The purpose of the present study is to assess the psychometric properties (i.e., reliability, validity, and responsiveness) of the Severity of Worst Tiredness PRO in patients with moderately to severely active RA who participated in two Phase 3 clinical trials, RA-BEAM and RA-BUILD, for baricitinib.

Methods

Patient population

RA-BEAM

RA-BEAM (N = 1305) was a randomized, double-blind, double-dummy, placebo- and active-controlled, parallel-arm, 52-week study in patients aged ≥18 years with active RA (≥6/68 tender and ≥6/66 swollen joints; serum high-sensitivity C-reactive protein [hsCRP] ≥6 mg/L) with an inadequate response to methotrexate (MTX). The study was designed to assess improvements in disease activity, structural preservation, and PROs, including physical function, safety, and tolerability with oral baricitinib 4 mg once daily. Full details regarding the conduct of the study, as well as the primary efficacy and safety outcomes of this study have been reported previously [22].

RA-BUILD

RA-BUILD (N = 684) was a randomized, double-blind, placebo-controlled, parallel-group 24-week study in patients aged ≥18 years with active RA (≥6/68 tender and ≥6/66 swollen joints; hsCRP ≥3.6 mg/L [upper limit of normal 3.0 mg/L]) and an insufficient response (despite prior therapy) or intolerance to ≥1 csDMARDs. The study was designed to assess improvements in disease activity, structural preservation, and PROs, including physical function, safety, and tolerability with oral baricitinib 2 and 4 mg once daily. Full details regarding the conduct of the study, as well as the primary efficacy and safety outcomes of this study have been reported previously [23].

For both studies, the current analysis is on data between Weeks 0 to 12, utilizing other PRO and clinician-reported indicators of RA symptoms and severity assessed in the primary efficacy studies of RA-BEAM and RA-BUILD. Both studies were conducted with informed consent, under institutional review board approval, and in accordance with the Declaration of Helsinki (ClinicalTrials.gov number NCT01710358 [RA-BUILD] and NCT01721057 [RA-BEAM]).

Instruments used in the psychometric analyses

Patient-reported outcomes (PROs)

Severity of Worst Tiredness, Severity of Morning Joint Stiffness, Severity of Worst Joint Pain, and Duration of Morning Joint Stiffness

Severity of Worst Tiredness, Severity of Morning Joint Stiffness (MJS), and Severity of Worst Joint Pain are all single-item PROs designed to capture the severity of worst tiredness, MJS, and worst joint pain experienced that day, respectively. All three of these PROs are anchored at 0 and 10, where 0 represents “no tiredness,” “no joint stiffness,” or “no joint pain,” and 10 represents “tiredness as bad as you can imagine,” “joint stiffness as bad as you can imagine,” or “joint pain as bad as you can imagine,” respectively. The Duration of MJS is a single-item PRO designed to capture information on self-reported length of time, in minutes, that a patient’s MJS lasted each day. Durations recorded as >12 h (720 min) were censored at 720 min.

For RA-BEAM and RA-BUILD, all four PROs were assessed using a daily electronic diary through Week 12. The Day 1 assessment was the first assessment at the end of the patient’s day after the randomization visit (Week 0, Visit 2). The Week 1 assessment refers to the weekly average values from Days 2 to 8. Assessments at Weeks 2, 4, 8, and 12 refer to weekly average values of the 7 days prior to Weeks 2, 4, 8, and 12 visits, respectively. Recognizing that late shift workers (individuals who work outside of the hours of 9 am until 5 pm) could not complete the electronic diary (at home) at the end of Day 1, the Day 2 assessment (if available) was used to impute missing Day 1 values so that more patients could be included in the psychometric analyses utilizing the Day 1 value.

Medical Outcomes Study 36-Item Short Form Health Survey Version 2 Acute (SF-36)

The SF-36 is a generic, 36-item PRO that measures general health status. The SF-36 includes eight domains of health status evaluated over the prior week: physical function, role limitations–physical, bodily pain, general health perceptions, vitality, social function, role limitations–emotional, and mental health. Two component scores, the Physical Component Score (PCS) and the Mental Component Score (MCS), are derived based on the eight domain scores [24]. Domain and component scores are derived using established formulas [24], with higher scores indicating better health status or functioning.

Functional Assessment of Chronic Illness Therapy-Fatigue (FACIT-F)

The FACIT-F scale [25] is a brief, 13-item, symptom-specific questionnaire that specifically assesses the self-reported severity of fatigue caused by chronic disease and its impact on daily activities and functioning. A 5-point Likert-type scale (0 = not at all; 1 = a little bit; 2 = somewhat; 3 = quite a bit; 4 = very much) is used. The range of possible scores is 0 to 52, with 0 being the worst possible score (indicating greater fatigue) and 52 the best (indicating lesser fatigue).

Health Assessment Questionnaire-Disability Index (HAQ-DI)

The HAQ-DI assesses patient physical function or disability. The HAQ-DI contains 24 questions that query the degree of difficulty a person has in accomplishing tasks in eight functional areas (dressing, arising, eating, walking, hygiene, reaching, gripping, and activities). Responses in each functional area are scored from 0, indicating “no difficulty” in that area, to 3, indicating “inability to perform a task” in that area. The HAQ-DI total score, ranging from 0 to 3 (higher values indicate worse functioning), is obtained by summing the highest score within each functional area and dividing by the number of functional areas answered [26].

Quick Inventory of Depressive Symptomatology Self-Rated-16 (QIDS-SR16)

The QIDS-SR16 is a 16-item PRO intended to assess the existence and severity of symptoms of depression as listed in the American Psychiatric Association’s Diagnostic and Statistical Manual of Mental Disorders, 4th Edition [27]. Patients were asked to consider each statement as it relates to the way they have felt for the past 7 days. There is a unique 4-point ordinal scale for each item, with scores ranging from 0 to 3 reflecting increasing depressive symptoms as the item score increases. The instrument measures nine core symptom domains that are used to define a depressive episode: sad mood; concentration; self-criticism; suicidal ideation; interest; energy/fatigue; sleep disturbance; decrease or increase in appetite or weight; and psychomotor agitation or retardation. The QIDS-SR16 total score is derived as the sum of the scores across the nine scale domains.

Patient’s assessment of pain

Patient’s pain was assessed at each study visit with the use of a 0–100 mm visual analogue scale (VAS), with higher scores indicating more severe pain. Specifically, patients were asked, “How much pain are you currently having because of your rheumatoid arthritis?”

Patient’s Global Assessment of Disease Activity (PtGA)

The PtGA was assessed at each study visit and is recorded on a 0–100 mm VAS, with higher scores indicating more active RA.

Clinician-reported assessments

Physician’s Global Assessment of Disease Activity (PhGA)

The PhGA was assessed at each study visit and is recorded on a 0–100 mm VAS, with higher scores indicating more active RA.

Clinical sign and symptom measures

American College of Rheumatology 20 (ACR20)

An ACR20 response (i.e., a binary variable indicating achieving or not achieving a response) was measured at each study visit and is defined as at least a 20% improvement from baseline in both tender joint count (TJC) (0 to 68) and swollen joint count (SJC) (0 to 66), and in at least three of the following five assessments: patient’s assessment of pain, PtGA, PhGA, HAQ-DI, and hsCRP.

Clinical Disease Activity Index (CDAI)

The CDAI is a tool for measurement of disease activity in RA that integrates measures of physical examination, patient self-assessment, and evaluator assessment [28]. The CDAI was assessed at each study visit and is calculated by adding together scores from the following assessments: number of swollen joints (0 to 28), number of tender joints (0 to 28), PtGA on a VAS (0 to 10 cm), and PhGA on a VAS (0 to 10 cm). Total scores are calculated using established formulas [28]. Thresholds have been established for the CDAI (remission: ≤2.8; low disease activity: >2.8 to ≤10; moderate disease activity: >10 to ≤22; high disease activity: >22 to ≤76) [29].

Disease activity score (28 joints) (DAS28)

The DAS28 is a composite score that is based on a 28-joint count (both TJC 0 to 28 and SJC 0 to 28), hsCRP or ESR, and PtGA and was measured at each study visit. Total scores are calculated using established formulas [30]. Patients can be categorized into four groups (remission: <2.6; low disease activity: ≥2.6 to ≤3.2; moderate disease activity: >3.2 to ≤5.1; high disease activity: >5.1).

Statistical analyses

Reliability (test-retest)

For the assessment of test-retest reliability (which is used to assess if instrument scores are reproducible across time), stable patients were defined as patients with ≤5 point difference [31] on the 0 to 100 PtGA between each assessment period, including between Weeks 1 and 2 and again between Weeks 4 and 8. Intraclass correlation coefficients (ICCs) were calculated between Weeks 1 and 2 and again between Weeks 4 and 8 to evaluate test-retest reliability. An ICC of ≥0.70 was considered good agreement [32].

Convergent and discriminant validity (construct validity)

Construct validity is the degree to which scores from one measure are related to those of other measures in a manner that is theoretically consistent. Pearson correlations at Day 1 and Week 12 were used to assess for the construct validity of Severity of Worst Tiredness. Correlations were calculated at Day 1 and Week 12 between Severity of Worst Tiredness and the scores of other clinical/PRO endpoints: Severity of MJS, Severity of Worst Joint Pain, Duration of MJS, SF-36 domain and component scores, FACIT-F, HAQ-DI, QIDS-SR16, patient’s assessment of pain, PtGA, TJC28, SJC28, PhGA, and hsCRP. The strength of the correlations were interpreted using Cohen’s conventions, where a correlation >0.5 is large, 0.3 to 0.5 is moderate, 0.1 to <0.3 is small, and <0.1 is insubstantial [33].

It was hypothesized that moderate or large correlations supporting convergent validity at Day 1 and Week 12 would be demonstrated between Severity of Worst Tiredness, and PRO instruments measuring concepts related to tiredness or fatigue (FACIT-F, SF-36 Vitality), other RA pain-like symptoms (Severity of MJS, SF-36 Bodily Pain, Severity of Worst Joint Pain, patient’s assessment of pain), their impact on functioning (SF-36 Social Functioning, SF-36 Physical Functioning, HAQ-DI), and clinician-reported/laboratory assessments of disease activity (TJC28, SJC28, PhGA, and hsCRP). Discriminant validity was assessed by Pearson correlations at Day 1 and at Week 12 between Severity of Worst Tiredness, and PROs measuring distally related concepts (SF-36 MCS, SF-36 Role Emotional, QIDS-SR16) where small correlations were hypothesized.

Known-groups validity

Known-groups validity tests seek to demonstrate differences between two or more groups known to differ on the underlying construct [34]. An analysis of variance (ANOVA) model was used for the assessment of known-groups validity at Day 1 and Week 4 to distinguish mean Severity of Worst Tiredness between subgroups defined by the DAS28-ESR thresholds (<2.6; ≥2.6 and ≤3.2; >3.2 and ≤5.1; and >5.1) and CDAI (0.0 to ≤2.8; >2.8 to ≤10; >10 to ≤22; and >22 to ≤76). The Scheffé adjustment was used for multiple comparisons. When subgroup sample sizes were small (i.e., <5% of the total sample size for the subgroup), subgroups were combined.

Responsiveness

Responsiveness, or the ability of the instrument to detect change over time [35], of Severity of Worst Tiredness was evaluated using an analysis of covariance (ANCOVA) methodology to assess significant differences in mean change in Severity of Worst Tiredness from Day 1 to Week 12 between ACR20 responders and nonresponders at Week 12, controlling for Day 1 Severity of Worst Tiredness. A parallel analysis was also conducted to assess responsiveness using disease activity as measured by DAS28-hsCRP at Week 12, using the following subgroups: DAS28-hsCRP <2.6, DAS28-hsCRP ≥2.6 and DAS28-hsCRP ≤3.2, and DAS28-hsCRP >3.2. An overall statistically significant difference (p < 0.05) with statistically significant subgroup comparisons was hypothesized.

Handling of missing data

Only patients with Day 1 data were included in analyses at Day 1. For analyses of data at Week 12, scores collected in the 7 days prior to the Week 12 visit date were used. If there were fewer than 4 nonmissing assessments, the 7-day window was shifted back in time (toward baseline) one day at a time until there were 4 nonmissing assessments available in the 7-day window. Then, the average of the 4 assessments was used in the Week 12 analysis.

Results

Baseline demographics for the total modified intent-to-treat population, patients with Day 1 diary scores, and patients with Week 12 diary scores are provided in Table 1. Scores for all other patient- and clinician-completed assessments, as well as clinical sign and symptom measures, are found in Table 2.

Table 1 Patient Demographic and Disease Characteristics Patients with Electronic Diary Assessments at Day 1 and Patients with Week 12 Electronic Diary Assessments (mITT Population) for RA-BEAM and RA-BUILD
Table 2 Instrument Scores at Day 1 and Week 12 for RA-BEAM and RA-BUILD

A large amount of missing data was present at the Day 1 assessment period (Tables 1 and 2). These missing data were due to multiple reasons as shown in Additional file 1: Table S1, such as the diary device alarm not sounding until the following day or the diary device being given to the patient after Day 1. Sensitivity analyses with the imputation for missing data at Day 1 (n = 1041 for RA-BEAM and n = 563 for RA-BUILD, respectively) were conducted and demonstrated similar findings to the results presented here.

Reliability (test-retest)

From Weeks 1 to 2, ICCs for weekly mean severity of worst tiredness ranged from 0.90 to 0.91 (RA-BEAM n = 412; RA-BUILD n = 185) and from 0.89 to 0.91 from Week 4 to Week 8 (RA-BEAM n = 417; RA-BUILD n = 215). These values provide evidence for substantial test-retest reliability among stable patients.

Convergent and discriminant validity

Results supporting convergent validity of Severity of Worst Tiredness in terms of its relationship with other clinical outcome assessments are presented in Table 3 at Day 1 and Table 4 at Week 12. At Day 1 in RA-BEAM and RA-BUILD, moderate-to-large associations between Severity of Worst Tiredness and other assessments measuring similar tiredness-like patient states were demonstrated. These associations were found to be large at Week 12 in RA-BEAM and RA-BUILD, including the FACIT-F (r = −0.60 in both studies) and SF-36 Vitality (r = −0.52 and −0.51). In addition, Severity of Worst Tiredness also demonstrated moderate-to-large associations with measures of other RA symptoms of pain and stiffness at Day 1. These associations increased at Week 12 in RA-BEAM and RA-BUILD, respectively, including SF-36 Bodily Pain (r = −0.51 and −0.52), Worst Joint Pain (r = 0.82 and 0.83), Severity of MJS (r = 0.79 and 0.77), and patient’s assessment of pain (r = 0.69 and 0.65). The Severity of Worst Tiredness also demonstrated moderate-to-large associations with concepts related to patient physical and social functioning at Day 1 that increased at Week 12, in RA-BEAM and RA-BUILD, respectively, including SF-36 Social Functioning (r = −0.43 and −0.44), SF-36 Physical Functioning (r = −0.43 and −0.41), and HAQ-DI (r = 0.49 and 0.46). These findings provide support for the convergent validity of Severity of Worst Tiredness in patients with moderately to severely active RA.

Table 3 Pearson Correlations between Severity of Worst Tiredness and Other Instruments in RA-BEAM and RA-BUILD at Day 1
Table 4 Pearson Correlations between Severity of Worst Tiredness and Other Instruments in RA-BEAM and RA-BUILD at Week 12

Small-to-moderate correlations were observed between Severity of Worst Tiredness and SF-36 MCS (r = −0.38 and −0.31) and SF-36 Role Emotional (r = −0.35 and −0.21) at Day 1, as well as QIDS-SR16 (r = 0.40 to 0.37) in RA-BEAM and RA-BUILD, respectively, indicating that these assessments measure more distally related constructs as hypothesized.

Known-groups validity

Because small sample sizes in the lower DAS28-ESR subgroups at Day 1 (i.e., <5% of the sample in each score category), patients were categorized into two subgroups: ≤5.1 and >5.1 (Table 5). At Day 1 in RA-BEAM and RA-BUILD, patients with higher DAS28-ESR scores reported a significantly greater Severity of Worst Tiredness score than those patients with lower DAS28-ESR scores (Table 5). Similar results were found for both studies at Week 4.

Table 5 Known-Groups Validity of Severity of Worst Tiredness Using DAS28-ESR Subgroups at Day 1 and Week 4

Similarly, because of small sample sizes, patients were categorized into two subgroups based on the CDAI at Day 1 (0.0 to ≤22.0 and >22.0 to ≤76.0) and three subgroups at Week 4 (0.0 to ≤10.0, >10.0 to ≤22.0, and >22.0 to ≤76.0). At Day 1, patients in the higher CDAI score subgroup experienced a significantly greater Severity of Worst Tiredness in both RA-BEAM and RA-BUILD than those patients in the lower CDAI score subgroup (Table 6). Similar results were found for both studies at Week 4 (Table 7). These findings provide evidence that Severity of Worst Tiredness is able to distinguish between known groups based on disease severity.

Table 6 Known-Groups Validity of Severity of Worst Tiredness Using CDAI Subgroups at Day 1
Table 7 Known-Groups Validity of Severity of Worst Tiredness Using CDAI Subgroups at Week 4

Responsiveness

The responsiveness of Severity of Worst Tiredness was supported through large and statistically significant differences in mean change from Day 1 to Week 12 in Severity of Worst Tiredness between ACR20 responders and non-responders (Table 8). Similar findings supporting responsiveness of Severity of Worst Tiredness were seen when using DAS28-hsCRP as an anchor. Pairwise comparisons assessing for significant differences in mean change between DAS28-hsCRP subgroups of <2.6 versus ≥3.2 (p = 0.001 for both studies), and ≥2.6 and <3.2 versus ≥3.2 (p = 0.001 for both studies) were statistically significant (Table 8). However, the comparisons between change scores for subgroups <2.6 versus ≥2.6 and <3.2 were not statistically significant for either study.

Table 8 Change in Severity of Worst Tiredness from Day 1 to Week 12 among ACR20 and DAS28-hsCRP Groups

Discussion

An investigation into the psychometric properties of Severity of Worst Tiredness PRO using data from patients with moderately to severely active RA provided support for the reliability, validity, and responsiveness of this measure. Analyses of test-retest reliability indicated strong agreement in Severity of Worst Tiredness scores across two assessment periods in stable patients. The construct (convergent and divergent) validity of Severity of Worst Tiredness was also supported, as a priori hypotheses of the associations between Severity of Worst Tiredness and related PROs, clinician-reported measures, and laboratory assessments were supported at Day 1 and Week 12. Using the DAS28-ESR and CDAI as indicators of known clinical status, known-groups validity was supported as mean Severity of Worst Tiredness values were significantly different between predefined groups. Lastly, Severity of Worst Tiredness demonstrated responsiveness to change from Day 1 to Week 12 when defining responders using the ACR20 or DAS28-hsCRP as an anchor.

Patients have identified tiredness/fatigue as a bothersome and debilitating disease-related symptom [8, 9], and despite improved treatment options for other RA symptoms, improvement in fatigue continues to be noted as an unmet need for patients with RA [18]. This was recently demonstrated in an analysis of data from the Leiden Early Arthritis Clinic cohort of patients with RA [36]. Cohort inclusion occurred when RA was confirmed at physical examination and symptom duration was <2 years (early RA). Early RA treatment strategies evolved over time, such that initial treatment for patients enrolled from 1993 to 1995 was nonsteroidal anti-inflammatory drugs (NSAIDs) (DMARDs were used later in treatment); patients enrolled from 1996 to 1998 were treated with non-MTX DMARDs (usually hydroxychloroquine or sulfasalazine); or patients enrolled from 1999 to 2007 were treated with MTX. A longitudinal study of 626 patients from these three cohorts demonstrated that despite improved treatment strategies over time associated with less severe radiographic progression in RA, there was no effect on fatigue severity over many years of treatment in early RA patients (p = 0.96) [36]. The authors concluded that a reliable and valid PRO measure of this symptom is an important tool to aid clinicians in treating patients with RA, thereby facilitating doctor-patient communication to improve the quality of patient care, contribute to better patient outcomes, and help to address this need [37]. Thus, the Severity of Worst Tiredness PRO addresses this unmet need. Given the increasing use of electronic PRO diaries in clinical settings, this instrument could be utilized in a clinical practice where patients are asked to report their worst tiredness symptom daily, thus enhancing the dialogue between patients and care providers. The reliability and ability to detect change over time has been demonstrated and further supports the use of this instrument as a simple, single-item instrument of RA-related tiredness.

Although Severity of Worst Tiredness did display strong evidence of reliability, validity, and responsiveness, a key limitation to this study is the missing data at the Day 1 assessment. These missing data were due to multiple reasons such as the missed alarms. However, sensitivity analyses after imputing missing Day 1 Severity of Worst Tiredness scores were conducted and all study conclusions remained the same. The timespan of the baseline assessment is also a limitation in that it only consisted of one study day’s data versus the average of up to the 7 days of assessments, as used in the Week 12 endpoint.

Conclusion

The results from the present study demonstrate that the single-item, daily measure, Severity of Worst Tiredness, is suitable to validly and reliably measure a key symptom of RA that is important to patients with moderately to severely active RA.