Introduction

Rheumatoid arthritis (RA) is an inflammatory autoimmune disease characterised by chronic joint inflammation, which leads to pain, stiffness, function loss, and fatigue [1,2,3]. All of the mentioned complaints can be measured with patient-reported outcome measures (PROMs) like the visual analogue scale (VAS), the Likert scale, the verbal rating scale (VRS) and the numerical rating scale (NRS) [4,5,6]. During outpatient visits, PROMs are taking a more central place because they directly apply to the patients’ personal perception and can identify key concerns that need to be addressed [7,8,9,10]. Studies also have shown that they correlate with the Disease Activity Score 28 (DAS28) assessed by the healthcare professional [11,12,13,14].

Recently, PROMs have become increasingly important in managing RA in daily clinical practice [6, 10, 15]. Using PROMs offers opportunities to improve understanding of patients’ experiences and responses to therapy, which would otherwise be missed [7, 8, 16]. Next to this, by means of remote monitoring of disease activity with PROMs, doctors can be informed about the disease course in between visits, which could have an influence on the treatment, resulting in a better outcome in the long term. Another advantage is that PROMs can be part of patient education, leading to improvements in therapy adherence [17, 18].

There are different PROMs for measuring the different aspects of RA. Currently, the most widely used PROM scale type in rheumatology is the VAS. Using a VAS to measure patient outcomes was previously recommended in both research and daily clinical practice for its simplicity and adaptability for use in composite indices and questionnaires covering multiple domains [9, 19]. However, in more recent studies, the VAS has been more and more criticised. In many cases, its loosely defined character could undermine its construct validity and contribute to heterogeneity among patients’ ratings [4, 20]. For instance, it has been found that cognition and mood influence VAS results significantly [20,21,22]. Furthermore, certain groups of patients have problems with using the VAS correctly [4, 9, 19, 23].

Alternatives to the VAS are the NRS, VRS and Likert scale. Especially, NRS and Likert scale are described as simple and fast to administer scales. NRS is described as easier to use than VAS, more suitable for international use and able to be used over telephone [19, 24, 25], whereas for Likert scale, research in osteoarthritis proves that Likert scales and VAS responses are highly correlated and yield similar precision, while Likert responses are easier to administer and interpret. However, for changes in Likert score, the associated VAS values can vary across a very wide range [26,27,28].

Studies reporting on advantages and disadvantages of the four different PROMs often report contradictory findings. Therefore, no statements can be made if one scale is generally better than the other in terms of usefulness in daily clinical practice, but rather that one scale type would be more suitable for a specific context or setting [27]. The objective of this study is to measure and compare construct validity of four scale types of PROMs within four different domains. Moreover, reproducibility of these scales and patients’ preferences for PROM scale types were measured. We focused both on construct validity, defined in this study as the correlation between the PROM and the DAS28-3 score of the patient, as well as on reproducibility tested in a test-retest setting.

Materials and methods

This study was designed as a prospective longitudinal study to investigate two out of three key component criteria of the OMERACT (Outcome Measures in Rheumatology) – filter: validity and reproducibility [29] of four commonly used PROMs (NRS, VAS, VRS and Likert scale), as well as determining patients’ preferred way of measuring. Adult patients with RA according to ACR/EULAR criteria [30] were included in this study. This group consisted of patients with stable disease activity, as assessed by a rheumatologist. Patients were excluded in case they were not able to sufficiently speak and write Dutch, in case they had cognitive impairments, and/or in case the medication regime was changed during the visit.

From a population of about 900 patients with RA of Bernhoven Uden, a teaching hospital in the south of the Netherlands, each patient visiting their rheumatologist for a regular appointment at the outpatient clinic in the period from April 2016 to June 2016 was asked to participate in this research project. Flyers with information regarding this study were available in the waiting room, and each rheumatologist assessed whether patients were eligible for the study and asked them to participate. Directly after their doctor visits, patients were asked by the researcher (RvU) to complete a questionnaire on paper in which they were asked to indicate the level of pain, fatigue, experienced disease activity and general well-being they had experienced in the previous week, using NRS, VAS, VRS and Likert scale, and to indicate which PROM they preferred in order to answer the question in every domain. Next to this, all patients signed an informed consent for the use of their medical records in this research. Lastly, all patients received a self-addressed envelope containing exactly the same questionnaire to be filled out and returned by mail after 5 days. Clear instructions were given to all patients on how to fill out this questionnaire and with what purpose.

To assess construct validity, the DAS28 from the study visit was used as gold standard, but because a VAS General Health is part of the standard DAS28 score, we used a modified score. This DAS28-3 score is based on erythrocyte sedimentation rate (ESR), and a physician’s assessment of tender and swollen joints of 28 joints [31]. Because the DAS28-3 score is a score of disease activity and not meant to measure pain, fatigue or general well-being, it acted for these three domains as a surrogate gold standard. The main purpose of this study was to compare the performance of the four PROM scale types within each domain, and not to determine the extent the PROMs correlate with the (surrogate) gold standard. To test the reproducibility of the aforementioned scales, patients completed the same questionnaire again 5 days after their doctor visit.

Patients were asked to describe different aspects of their disease burden (pain, fatigue, experienced disease activity and general well-being) in NRS (scaled from 1 to 10), VAS (on a 100 mm line), VRS (“Not at all, Little, Much and Very Much”) and five-point Likert scales (“Not at all, Little, Neutral, Much and Very Much”). The VRS and Likert scales were designed by the researchers and have been assessed and approved by a patient panel of the Department of Rheumatology, Bernhoven. Pain, fatigue and experienced disease activity were scored progressively (a higher score meaning more pain etcetera, or patients feeling worse), but general well-being scored the other way (a higher score meaning a patient was feeling better). A clear definition of disease activity was made, included in the questionnaire, and explained to all participants in exactly the same manner. At the end of each specific domain, patients were asked which way of measuring they preferred by asking the question: “Which of the abovementioned methods do you prefer in order to indicate how much [pain] you experience because of your RA?”

All data were processed and analysed using SPSS version 22. Descriptive information was expressed as means and frequencies. To assess construct validity, correlation with the (surrogate) gold standard was calculated using Pearson’s correlation coefficients for each PROM, filled out for each separate domain. Although not all PROMS were continuous variables, we found it reasonable to assume that the ordered categories of Likert scales and VRS are derived from applying cutoffs to an underlying (latent) continuous scale. To check this assumption, we inspected scatterplots with 95% confidence intervals and prediction intervals to see whether correlations were indeed linear. For reproducibility intraclass correlation coefficients (ICCs) were calculated. More specifically, we tested ICC agreement. Patients who did not return their second questionnaire were excluded at this point. Pearson’s chi-square test was used to compare categorical data. For all assessments, the significance level was set at α = 0.05. Patients’ preference was assessed with bar charts and pie charts. Lastly, we tested whether or not the PROMs’ correlation coefficients were statistically different from each other within each domain, for both construct validity [32] and reproducibility [33, 34].

The research proposal was presented to the Medical Ethical Committee of the Radboud University Medical Centre. The committee concluded that no specific obligations applied to this research.

Results

During the inclusion period, 350 RA patients visited the outpatient clinic, of whom 213 filled out the first questionnaire. The included group consists of 133 women (63%), has an average age of 63.90 years (SD ± 12.85), an average DAS28-3 score of 2.99 (SD ± 1.09), and a mean disease duration of 8.76 years (SD ± 9.06). All patient characteristics are shown in Table 1. In Fig. 1, the inclusion flowchart is shown. When excluding ineligible patients and patients who have not returned their return envelope, 211 patients remained for the construct validity assessment and 153 for the reproducibility assessment. With regard to missing data, there was one missing value in questionnaire one. The percentage missing data in questionnaire two varied between 4 and 8%. There were no noteworthy missing data with respect to other variables. Patients with missing values did not differ from other patients with respect to age, DAS28-3 score and disease duration.

Table 1 Patient characteristics
Fig. 1
figure 1

Inclusion flowchart. Three hundred fifty patients with RA visited the outpatient clinic during the inclusion period (April 2016 to June 2016); 211 patients were included for the construct validity assessment and 153 patients for the reproducibility assessment

Construct validity

For each of the 211 patients included in the validity assessment, a DAS28-3 score was calculated with scores found in their patient files. After this, Pearson correlation coefficients were calculated. These are shown in Table 2. All these correlations were non-zero except for VRS in fatigue (p > 0.05). In each domain, no large differences in correlation coefficients between the four PROMs were found and none of the pairwise comparisons had p < 0.05. The correlation coefficients for fatigue and general well-being were slightly lower than for pain and experienced disease activity. Scatterplots were made and outliers were inspected. These outliers were not different from the other patients in the study regarding age, gender, disease duration and PROM preference.

Table 2 Construct validity of the four PROMs in four domains (n = 211), correlation with the (surrogate) gold standard

Reproducibility

Reproducibility was assessed in 153 patients with ICCs (shown in Table 3). All correlation coefficients in all scales and all domains had p < 0.001. When comparing the ICCs to each other within each domain, VAS and NRS appeared to be more reliable for the domains pain and experienced disease activity (p < 0.05) (data not shown). Regarding the domain fatigue, VAS appeared to be more reliable than Likert, while NRS was more reliable than both Likert and VRS in this domain. Lastly, there appeared to be no differences within the domain general well-being (p > 0.05). Moreover, there were no differences between VAS and NRS.

Table 3 Reproducibility of the four PROMs in four domains (n = 153)

Patients’ preferences

When assessing patients’ preferences (n = 212) for PROM scale type, NRS was most preferred in every situation (varying from 44.3 to 45.8%). VAS consistently ended in second place (20.8 to 24.5%). Third and fourth place shifted between VRS and Likert scale depending on the domain. This preference is shown in all domains, regardless of whether asked in the first or second questionnaire. There were no statistical differences between disease duration or age and PROM preference (all p values > 0.05). All groups preferred NRS most. Figure 2 shows the overall preference for PROM scale type. Figure 3 shows the bar charts regarding the PROM scale type preferences per domain.

Fig. 2
figure 2

Pie chart overall preference first questionnaire. Overall PROM preference on the first questionnaire. n = 212 patients with rheumatoid arthritis. VAS visual analogue scale, NRS numerical rating scale, VRS verbal rating scale (four-point scale), Likert (five-point scale)

Fig. 3
figure 3

PROM scale type preferences per domain for the first questionnaire. PROM preferences for the domains pain, fatigue, experienced disease activity and general well-being, regarding questionnaire 1. NRS numerical rating scale, VAS visual analogue scale, VRS verbal rating scale. n = 212 patients with rheumatoid arthritis

Discussion

This study aimed to compare the construct validity and reproducibility of four frequently used PROM scale types within the domains pain, fatigue, experienced disease activity and general well-being in patients with RA. In addition, patients’ preference was evaluated. True construct validity could only be assessed in the domain experienced disease activity, because the DAS28-3 score is a measure of disease activity. For the other three domains, the DAS28-3 score was used as a surrogate gold standard to compare correlations with it between the four PROMs in each domain.

All the correlation coefficients in the construct validity assessment were weak to moderate at best, as defined by Cohen [35]. The correlation coefficients were lower in fatigue than in the other domains, which can be explained by the fact that fatigue is less influenced by disease activity compared to the other constructs [36, 37], and because of its multifactorial nature [38,39,40]. Literature shows that there is disagreement between patients’ and physicians’ perspectives regarding disease outcomes [41,42,43,44]. A study by Lati et al. found comparable correlations for VAS of patient global assessment to DAS28 score [20].

Within the construct validity assessment, the differences between the four scale types were small and appeared not different. In three out of the four domains, the NRS and VAS scales had the highest scores. VAS and NRS performed best in the domains pain and experienced disease activity. This could be explained by the fact that pain and experienced disease activity are easier to grasp concepts than a more complex subjective phenomenon as general well-being, and therefore can be scored easier using numbered values. Regarding reproducibility, VAS and NRS scored consistently better within the domains pain and experienced disease activity than VRS and Likert. Within the domain fatigue, VAS scored better than Likert and NRS scored better than both Likert and VRS.

In recent years, the VAS as tool in daily clinical practice and research has been more and more criticised. Gossec et al. described elderly, low literacy populations and some cultural groups having difficulty using the VAS, although it is a quick and easy to apply scale for pain [4]. Hjermstad et al. reported lower compliance when VAS is used compared to NRS and VRS. They also question the ability for patients to discriminate their pain intensity beyond ten levels, and therefore also question VAS’s clinical advantage over a scale such as NRS [45]. Lati et al. argued that VAS endpoints are usually not very clearly defined, and these anchors do not reflect the full range of RA activity [20].

Reproducibility of the scales in our assessment was moderate to high similar to earlier studies [9, 24, 46, 47]. The exception was the domain general well-being, all correlations both for construct validity as well as for reproducibility were lower compared to the other domains. In contrast with the other domains, the Likert and VRS scored highest in this category. Hasson et al., Bolognese et al. and Guyatt et al. described that a Likert scale is easier to use than VAS, for both clinicians as patients, although wording is important when describing complex subjective phenomena [26,27,28]. General well-being is such a complex subjective concept; it is quite possible that patients need an easy to use and easy to understand scale (such as Likert or VRS) to score appropriately. Neither study compared a Likert scale to NRS or VRS though.

In this study, the patients clearly preferred the NRS as scale type in all four PROM domains. The validity of this scale appeared to be at least as good as the other scales and several other studies have shown its practicality in daily clinical practice [24]. Moreover, NRS appeared to be more reliable than VRS and Likert in the domains pain, fatigue and experienced disease activity. We found no statistical differences regarding age (young, middle and old) and diagnose duration (short, middle and long) in relation to PROM preference; all groups showed a preference for NRS, regardless of domain and/or questionnaire. Hjermstad et al. also found a preference for NRS in an age-mixed population, although this was not a population purely consisting of RA patients [45]. Bellamy et al. described some key characteristics for a scale to be useful in daily clinical practice: simplicity, brevity, rapid completion and ease of scoring, and pointed out that both Likert scales and the NRS do comply to these characteristics. Next to this, the NRS is more suitable for international usage given that it is not bothered by language differences [25]. Hawker et al. described the NRS as a valid, reliable and easy and quick to administer way of measuring pain. They emphasise that the NRS can also be used verbally or by telephone, but point out the one-dimensional aspect of this scale [19]. Williamson et al. described NRS as sensitive as VAS, but much easier to use [24].

Strengths and limitations

Some nuances need to be added to our results. For reproducibility, a concern could be that high test-retest reproducibility was found due to a too short time interval between the first and second questionnaires. On this topic, Studenic et al. stated that reproducibility is dependent on the tested time interval and correlations decrease and smallest detectable differences increase with longer time intervals [48]. This can be countered by possible less response if the time interval is increased. Right now, the response rate was 72.2%, which is acceptable. When the interval is increased, one should expect this percentage to decrease, because no guarantee can be given that patients remember to return the questionnaire after a lengthened period of time. Next to this, we wanted to assess reproducibility in patients with stable disease. When increasing the time interval, we cannot be certain that all patients remain stable over this course of time. Second, the manner of measuring at baseline and after 5 days was not exactly the same, as at baseline, patients were interviewed in the hospital, while after 5 days, patients filled out the questionnaire at home. This choice was made for practical reasons: patients would otherwise have needed to visit the hospital a second time in a relative short period of time. Also, the objective of the study was to compare PROMs to each other and not to reflect the exact test-retest correlation. We expected the effect of the questionnaire being filled out at home to be equal for all PROMs and domains, and therefore, we can safely compare PROMs using our results. Next to this, for reproducibility, this would lead to a minor underestimation of correlation and not an overestimation of the results.

For preference, the vast majority for NRS preference could also be explained by the fact that NRS is a widely used measurement tool by healthcare professionals. Patients, and especially patients who were diagnosed a long time ago, could simply point out NRS as their preferred way of measuring, because they have used this for many years. However, when only assessing the preference of patients who were diagnosed with RA in the last year, this preference is still shown, regardless of domain of questionnaire (data not shown). Experience with NRS (or any of the other scales) due to comorbidities was not assessed.

In conclusion, this study shows that NRS is the preferred PROM scale for patients with RA and that NRS is at least as valid as the VAS, Likert or VRS. Moreover, regarding reproducibility, NRS appeared to be more reliable than VRS and Likert in order to measure pain, fatigue and experienced disease activity. Other studies have shown the practicality of the NRS over the VAS; therefore, we think that the NRS is the preferred scale type for the different PROMs.