Background

Gout is an increasingly prevalent, monosodium urate crystal-driven inflammatory arthritis, commonly presenting as debilitating acute painful flares with associated redness and swelling of the affected joint(s). In some cases a chronic course may develop when increasing crystal deposition is left untreated, leading to visible urate crystal deposits (tophi) and joint damage, as well as extra-articular complications [1]. Along with the clinical manifestations, patients suffering from gout are often confronted with pain, physical impairment, work productivity loss, and emotional distress [2, 3]. Patient-reported outcome measures (PROMs) are commonly used to assess these consequences of gout in a variety of settings [4, 5].

When choosing a specific PROM to use from a number of alternatives, one should take into account the research context, feasibility of the instrument, comparability of scores with relevant earlier work, and the measurement properties of the instrument in the population of interest. Measurement properties are arguably a particularly important factor to consider, since they have a direct bearing on, for example, the ability of a study to demonstrate the desired effects, as well as the required sample size. Therefore, choosing the best instrument from a number of alternatives importantly contributes to the potential for the success of a study. Consequently, endorsements of specific instruments should be based on a comprehensive, critical evaluation of their content and the documented evidence supporting their measurement properties [6].

The OMERACT Gout Special Interest Group has endorsed various patient-reported outcome (PRO) instruments for use in acute and chronic gout clinical research [7,8,9,10,11,12,13]. However, these endorsements are based only on the opinions of experts, guided by analyses performed on data from a few selected clinical trials (n = 4) and one observational study, as well as a systematic review on the performance of specific measures in previous clinical trials of acute gout [14, 15]. Important measurement properties, such as reliability and validity, are not typically reported on in trial reports, nor can information about these properties necessarily be inferred from the reported results. Also, as information about measurement properties was derived from a small, selected number of studies, new or less popular instruments may have been underappreciated.

To date, no systematic evaluation has been performed of the available evidence supporting the measurement properties of the various PROMs available for use in gout [16, 17]. The objective of this systematic review was to identify all PROMs currently available for gout, and to critically appraise their content and measurement properties, in order to evaluate the current status regarding PROMs validated for use in gout patients, and to identify areas for future research.

Methods

Search strategy

To identify all available literature, a systematic literature search was performed in PubMed and EMBASE (database start date, up to August 15, 2017), using a modified, but validated search strategy for papers on measurement properties of PROMs used in gout [18]. The exact search terms are included in the additional material (see Additional file 1). References of included studies and systematic reviews of PROMs found in the search were screened initially by title, and if relevant, abstracts were assessed for potentially relevant papers. Finally, for each included PROM a PubMed search was performed to make sure all papers were included.

Selection of literature

Inclusion criteria were published articles in which (1) the study population consisted of gout patients and (2) the article reported on the development of a PROM, or the evaluation of one or more of its measurement properties. We excluded (1) conference abstracts and poster presentations, (2) systematic review articles, and (3) articles published in any other language than English.

The titles and abstracts of the retrieved articles were screened independently by two reviewers (MOV and CJ) on relatedness to gout and development or evaluation of a PROM. Any duplicates of articles generated by the search strategy were removed using Microsoft Excel prior to screening. When the title or abstract caused uncertainty pertaining the eligibility criteria, the full-text articles were retrieved and assessed. Disagreements on the eligibility of the article for inclusion were discussed and resolved through consensus. A third reviewer (PtK) was consulted if disagreements remained unresolved. Full-text articles were retained and the final decision on which studies to include were made through consensus after having read the articles (MOV and PtK). Reasons for exclusion were noted and a flow chart of study article selection was prepared according to the Preferred Reporting Items for Systematic Reviews and Meta-Analyses statement [19].

PROM characteristics

Descriptive characteristics of each instrument were extracted from the included studies or the initial publication of the instrument. The readability of each questionnaire was assessed using the Flesh-Kincaid Grade level-test. A grade level of 6 is recommended by the International Society for Quality of Life Research (ISOQOL) minimum standards for PROMs [20]. Availability of each instrument was determined.

Assessment of methodological quality

The methodological quality of each included study was assessed using the COnsensus-based Standards for the selection of health Measurement INstruments (COSMIN) checklist [21]. Several deviations from COSMIN checklist were deemed necessary in order to correspond better with advances in psychometric theory or standard practices in gout, and quality of life research [20, 22,23,24,25,26]. An overview of the criteria and our deviations from these are presented in Table 1. Two reviewers (MOV and PtK) independently completed the checklist and final decisions about ratings were arrived at through consensus.

Table 1 Quality criteria for rating the measurement properties in accordance with Terwee et al. 2007, and deviations from COSMIN criteria for methodological quality

Assessment of measurement properties

The studies that were judged to be of high methodological quality using the COSMIN checklist were used to rate the measurement properties of the included PROM as either good (+) or poor (−), in accordance with quality criteria proposed by Terwee et al. 2007 [26]. Measurement properties of instruments for which only studies of insufficient methodological quality were available were rated as indeterminate (?), or zero (0) when no information was found for that measurement property. In cases where the same PROM was described in various studies of sufficient methodological quality, which resulted in different quality ratings for the same measurement property, the rating was designated as indecisive (+/−). Table 1 gives a description of each rated measurement property, along with the quality criteria applied.

Content evaluation and assessment of content validity

Health concepts assessed by each PROM were characterized by linking the items to the International Classification of Functioning, disability and Health (ICF) using the 2016 ICF linking rules [27]. As our intention was to compare content between PROMs, we did not link health concepts to the ‘other specified’ or ‘unspecified’ ICF categories. To be rated as having a high content validity (+), ≥ 75% of the health concepts of the PROM had to have been included in either ICF core set [28, 29]. All health items related to emotions were linked to the ICF category ‘b152 emotional functions’.

Item response theory (IRT)

Although no quality criteria are currently available to judge the quality of studies that used IRT-based analysis, we provided a descriptive review of the results of the included studies that used these methods. As a minimum requirement for methodological quality of the studies, we required that at least 50 patients were included in the study for each item in case of PROMs with dichotomous response categories, or 50 patients for each item step parameter for polytomous data [30]. For articles that used 2-parameter IRT modeling we required a minimum of 250 patients to be included [31]. Furthermore, for a positive rating for methodological quality, the IRT model that was used should be described in sufficiently detail for the reader to understand its parametric structure, or references needed to be included to sources that provide such descriptions. Finally, at least some evidence needed to be presented to support model-data fit.

Results

The search resulted in 826 hits, of which 556 were screened after removal of the duplicates. After screening of the titles and abstracts, 33 were found to meet the inclusion criteria. Of these, another 19 were excluded, leaving a total of 14 studies for review (see Fig. 1) [15, 32,33,34,35,36,37,38,39,40,41,42,43,44]. Reference checking of systematic reviews and of these included papers, or the additional PubMed search of each included instrument, did not result in any additional studies eligible for review.

Fig. 1
figure 1

Flow diagram of study selection

Instrument characteristics

The characteristics of the 13 included PROMs are summarized in Table 2 (see Additional file 2 for ICF linking perspectives and categorization of response options). Three of these specifically target gout patients, whereas the remaining instruments target patients with rheumatic diseases or generic populations. Pain and physical function were the most frequently assessed outcome domains. Seven out of eleven (64%) PROMs had Flesch-Kincaid grade level estimates lower than 6, suggesting that their items are easily understood by patients with varying reading proficiency levels (Table 2).

Table 2 General characteristics of the included patient-reported outcome measures (PROMs)

Content description & validity

The results of the ICF-linking exercise revealed that health concepts subsumed under the ICF chapter ‘d4 mobility’ were most frequently addressed in the items of the included PROMs (see Additional file 3). Each included PROM had at least one item related to ‘d4 mobility’. The ‘b1 mental functions’ was the second most popular ICF chapter. This is because all health concepts related to emotional functioning were linked to this chapter. Health concepts related to ‘d8 major life areas’ were also frequently assessed, mainly due to the inclusion of the Rheumatoid Arthritis-Work Instability Scale (RA-WIS). Only three PROMs included content related to environmental factors, particularly the Health Assessment Questionnaire-Disability Index (HAQ-DI), for which scores can be adjusted in case patients need help from others or assistive devices to perform the activities.

Of the in total 32 PROMs, subscales and total scales that were rated, 81% (n = 26) met the criteria for a positive rating for content validity (Table 3). However, the role functioning subscales of Medical Outcomes Study 20-item Short Form Health Survey (MOS-20) and Short Form-36 item version 2 (SF-36v2), the Work Productivity and Activity Impairment (WPAI) physical function Numeric Rating Scale (NRS) and several Arthritis Impact Measurement Scales (AIMS) subscales received negative ratings, mainly due to the fact that a large number of their health concepts were too general to be linked to ICF second level categories.

Table 3 Quality ratings of the measurement properties of the included instruments

Quality rating of measurement properties

Table 3 lists the quality ratings of the psychometric properties of the included PROMs.

Construct validity

The methodological quality for construct validity was frequently rated as poor, in the majority of cases because no hypotheses were specified by the authors with respect to expected correlations or mean differences. In studies with explicitly stated hypotheses, positive ratings were generally given, leading to mostly positive ratings for PROMs for which high quality studies of construct validity were available. However, the HAQ-DI was rated as inconclusive as in one study, 78% of hypotheses were confirmed, whereas in another only 61% of hypotheses could be confirmed. For the latter study, hypotheses were not confirmed for some correlations with the subscales of the SF-36v2 (including emotional health, emotional role limitation, social), but also for correlations with outcomes such as the number of gout flares in the past month, Visual Analogue Scale (VAS) for pain, swollen joint count and physician global assessment.

Score reliability

The reliability of several multi-item PROMs was supported by high quality studies of single administration reliability. All instruments measuring physical function (HAQ-DI, Health Assessment Questionnaire-II (HAQ-II), SF-36v2 physical functioning subscale) received favorable ratings for reliability, as did the RA-WIS and a couple of the Gout Assessment Questionnaire 2.0 (GAQ2.0) subscales and total scale. The other subscales of the GAQ2.0 were either rated negatively because the reliability coefficient was < 0.70, or as indefinite when studies showed mixed results. The AIMS and MOS-20 were rated as indeterminate because the sample size used for the analysis was inadequate (< 50). None of the studies in which an analysis of test-retest reliability was performed were rated to be of high quality. The 20-item Tophus Impact Questionnaire (TIQ-20) was rated as indeterminate for test-retest despite an intraclass correlation coefficient (ICC) > 0.70 using an otherwise appropriate design, because patients did not appear stable during the two measurement periods. The AIMS and the MOS-20 received an indeterminate rating because an inadequate sample size was used, and the follow-up period of 8 weeks between measurements was deemed too long. For the other six questionnaires, no studies on test-retest reliability were found.

Responsiveness

The single-item pain measures (VAS, Likert and NRS) and the bodily pain subscale of the SF-36v2 were demonstrated to be able to detect clinically relevant changes over time. Of the PROMs measuring physical functioning, the HAQ-DI, SF-36v2 (physical functioning subscale, role physical subscale and the physical component summary score) and the single-item WPAI physical function NRS were rated positively, whereas the HAQ-II was rated as indeterminate because it was not clear how patients changed over time. For the same reason, the MOS-20 and AIMS also received an indeterminate rating. The subscales of the SF-36v2 and GAQ2.0 that were rated negatively did so because the demonstrated effect size was considered too small (< 0.30).

Floor and ceiling effects

The (sub)scale(s) of the gout-specific GAQ2.0 and the TIQ-20 both showed no floor or ceiling effects. Similarly, the pain instruments were also rated positively, as were the patient global assessment VAS and the general health subscale of the SF-36v2. The instruments for physical functioning showed contradictory results: the HAQ-II had floor or ceiling effects > 15%, the HAQ-DI had indecisive results, and the physical functioning subscale of the SF-36v2 and physical function NRS scale showed no floor or ceiling effects.

Item response theory (IRT)

There were four articles in which IRT was used. The methodological quality of the first study was rated negatively, because only ~ 24 patients per threshold parameter were included, which makes it unlikely that the estimates of these parameters, which was the subject of their analysis, were stably estimated [39]. In another study, the measurement invariance of the HAQ-DI with respect to diagnosis, was examined [38]. Their results suggest that patients with gout, osteoarthritis and rheumatoid arthritis respond differently to the HAQ-DI categories of walking, dressing, and activities. When these differences in response behavior were controlled for in the model, the authors found that the mean disability scores for the different disease groups were changed slightly. This might impact the validity of cross-diagnostic comparisons using the HAQ-DI. Rasch analysis of the RA-WIS scale provided support for its unidimensionality [36]. Analysis of the locations of the items and persons on the latent measurement continuum revealed that targeting of the scale was supposedly poor, with most of the items clustering together at the middle of the continuum, whereas the distribution of patients was skewed to the right, with a pronounced ceiling effect. Despite this, global reliability was found to be high according to the patient separation index. At last, Rasch analysis was also used in the development of the TIQ-20 [42]. That paper was rated negatively for methodological quality because a longitudinal IRT model was apparently used; however it was not described how the dependencies between the repeated measures were taken into account in the analysis.

Discussion

Brief summary

In the current study, we identified and critically reviewed the content and psychometric properties of PROMs currently available for gout, using a systematic approach. This paper can be used for determining areas where further research is required for specific PRO domains and measures in gout, especially regarding their measurement properties.

Strengths

The comprehensive literature search in various databases, as well as the systematic approach applied during this entire review process, are strengths of this study. In addition, this review is the first to critically review various measurement properties of commonly used PROMs in gout, including the assessment of the methodological quality of studies reporting on these measurement properties. For this purpose, standardized criteria were used to assess both the methodological quality of the included studies using the COSMIN checklist, as well as the quality of the measurement properties using quality criteria that were proposed by ISOQOL and Terwee et al. [20, 21, 26]. Furthermore, the content validity of the included PROMs were comprehensively assessed by linking their items to the ICF using standardized ICF linking procedures [27].

Weaknesses

There were some limitations to this study. First, our search was developed to find papers that evaluated measurement properties of PROMs used in gout. As a result, we may have missed PROMs used in gout for which no evaluation of the psychometric properties are yet available. For instance, several new generic item banks, for example, those developed for the Patient-Reported Outcomes Measurement Information System project, were not included in this review for that reason [45]. Evaluation of measurement properties of such measures in gout seems very relevant. Moreover, no ICF core set for gout is currently available. The comparative ICF core set we used consisted of the ICF core set of acute inflammatory arthritis and a preliminary ICF core set derived in a recent study in which a core set of gout ICF categories considered relevant by a panel of experts physicians was defined [28, 29]. The results regarding content validity should therefore be considered preliminary and interpreted with some caution. Another limitation to the evaluation of the content of the PROMs is that all health concepts related to emotional functioning (e.g., “Have you been very nervous?”) were linked to a single category, namely ‘b152 emotional functions’. Since health concepts relating to emotional functioning were the second most popular category in the included PROMs, and represented quite diverse emotional experiences, different PROMs could probably be characterized in more detail with respect to the various aspects of emotional functioning they assess. Finally, authors of the included papers were usually insufficiently clear about whether patients had active gouty arthritis, or were studied in the so-called inter-critical periods of the disease. Properties of the included PROMs are likely to differ between these subpopulations, which limits the generalizability of our results. For future studies we recommend that authors provide information on the percentage of patients with active arthritis included in the study.

Discussion on findings

The results of this study show that various PROMs are available for gout, covering the majority of the outcome domains that have been endorsed by OMERACT for use in clinical studies in this field. Interesting was the absence of studies assessing the properties of PROMs for the OMERACT key outcomes of ‘joint swelling’ and ‘joint tenderness’. Possibly because in many gout clinical studies these outcomes are not applied as a PROM, but are rather assessed by the physician [46, 47]. Nevertheless, patient-reports of these domains have been done in gout clinical studies, so that evaluation of their measurement properties is desired [48, 49]. Also, no studies were found examining the measurement properties of instruments that can be used to derive health utilities for health-economic studies.

Only the physical functioning subscale of the SF-36v2 was rated favorably for all measurement properties in this systematic review. Moreover, in one of the included studies, a direct comparison with the HAQ-DI and HAQ-II showed that it was the only instrument without floor and ceiling effects, suggesting it better targets the disability levels of gout patients [37]. Therefore, current evidence suggests that the SF-36 physical functioning subscale can be recommended for assessing disability in gout. In measuring disability, the HAQ-DI was the only other instrument for which sufficient studies of high quality were available to provide a comprehensive evaluation of its measurement properties. However, this instrument scored inconclusively for construct validity, and floor and ceiling effects. Based on the current evidence, both the VAS and the SF-36v2 bodily pain subscale may be recommended for measuring pain, as almost all measurement properties were supported by high quality studies. However, in general, few studies have yet assessed the psychometric properties of single-item pain measures.

Of the gout-specific PROMs, the health status measuring GAQ2.0 was most extensively evaluated in the literature. Although its subscales showed no floor and ceiling effects, and were all rated as positive for content validity, confirming its items contain health concepts relevant for gout populations, the GAQ2.0 does not cover all recommended OMERACT outcome domains (e.g., no activity limitations scale). This potentially limits its usefulness for gout clinical research purposes. Moreover, the available evidence suggests poor reliability and non-responsiveness to change for half of its subscales, and it was one of the few PROMs with a poorer rating for ease of reading. The overall psychometric appraisal of the GAQ2.0 in this systematic review is in line with previously reported concerns regarding this instrument and therefore we suggest caution in use of this PROM [17]. For assessing health-related quality of life, the current evidence suggests the SF-36v2 may be used as an alternative.

For other instruments, no strong conclusions regarding their psychometric quality were possible, despite the availability of at least one study of most measurement properties for each instrument. With respect to construct validity, this was mostly because authors failed to specify hypotheses about the associations they expected to find. Construct validation is an iterative process in which confidence in the degree to which a PROM actually reflects the construct it intends to measure increases as applications of the measure consistently yield results that would be expected, given theories about how this construct relates to other constructs [50]. Therefore, especially for newly introduced PROMs, proper evaluation of construct validity requires researchers to be specific about expected relations among instruments included in the assessment; taking into account that the relations between the substantive constructs, measurement error and method of measurement all contribute to the observed relations between instruments. For instance, PROMs can be expected to have relatively high intercorrelations, and therefore only limited information about construct validity can be extracted from the finding that significant correlations exist between a number of PROMs. Neither is it the case that higher correlations are always indicative of greater construct validity. Assessments of test-retest reliability in acute gout are complicated by the often rapid improvement that occurs, even without treatment, in the clinical status of patients. This makes it challenging to select a population of stable patients, which led to the many indeterminate ratings in this review. Therefore, for multi-item PROMs, reliability should, in our opinion, be assessed using coefficients that can be calculated from the interitem covariance matrix, such as Cronbach’s alpha.

Implications for practice

For clinicians working in the field of gout, it may be necessary to understand that little evidence is currently available on the measurement properties of commonly used PROMs, and more importantly, which consequences this may have on outcomes data when poorly supported PROMs are used. In particular as some of the PROMs, for instance the single-item pain PROMs, may be used in daily practice for determining the severity of the pain associated with a gout flare. However, also because evidence from clinical trials, where PROMs are commonly used to collect data, are generally used for developing gout guidelines or management recommendations for in daily clinical practice.

Implications for research

To ensure high-quality patient-reported outcomes data is collected in gout research it is essential that valid and reliable PROMs are used. Their usage may enhance the feasibility of studies by, for example, creating less measurement error, leading to a smaller required sample size. However, the results from this study show that the measurement properties of the PROMs commonly used in gout clinical research settings are weakly supported. To enhance their position in gout research, we recommend that more evidence on the validity and reliability of PROMs used in gout becomes available. Choosing the most suitable PROM from other alternatives may therefore become easier, and endorsing PROMs for measuring relevant gout outcomes in clinical research, as done by OMERACT, will ideally be based on solid evidence supporting the measurement properties of PROMs.

Conclusions

In conclusion, the present report presents the results of an evaluation of the content and literature supporting the measurement properties of commonly used PROMs in gout. The results suggest that PROMs are available to assess the majority of the recommended OMERACT core outcome domains for use in clinical research for acute and chronic gout. However, the SF-36 physical functioning subscale is the only PROM that currently meets all the quality criteria we imposed for this review. Many of the commonly used PROMs in this field are currently not yet well supported and more studies on their measurement properties are needed among both acute and chronic gout populations.