Background

The SF-36 physical functioning scale (PF-10) [1, 2] and the Health Assessment Questionnaire disability index (HAQ-DI) [3, 4] are well-established instruments for measuring self-reported physical functioning. The SF-36 and the HAQ-DI were originally developed as generic measures to allow comparisons across populations [2, 5]. but both instruments have also been thoroughly examined for use in several specific conditions, including rheumatoid arthritis (RA) [6].

Since the inclusion of patient-reported physical disability into core sets of outcomes for clinical trials and observational studies in RA [7, 8], an increasing number of RA studies now assess and report physical functioning. Although variation in the choice of instrument exists, the HAQ-DI and PF-10 are among the most frequently used [9, 10]. Both measures, however, differ considerably in their content, number of items, and scoring procedures, making it difficult to directly compare results obtained with the two scales. One way to overcome this problem is to link scores from the HAQ-DI and PF-10 [11]. This would allow the development of a concordance table, or crosswalk, to convert scores from one instrument to another and enable comparison of data from studies that used either one of the instruments.

Several methods are available for linking scale scores that vary in design, statistical techniques, and the degree to which exchangeability can be achieved [11, 12]. Item response theory (IRT) offers a flexible and powerful framework for score linking by its inherent ability to calibrate different items of the same concept on a common underlying metric [1316]. Several examples of how to use IRT modeling to develop crosswalks between different instruments intended to measure the same health domain have been reported [1720]. IRT, however, makes certain assumptions about the nature of the data, in particular with respect to dimensionality. A variety of models are available, which differ in their restrictiveness with respect to the assumptions made and the number of parameters used to describe items [21]. Consequently, the type of linking and the accuracy of the resulting crosswalk may depend in part on the specific IRT model used.

The most basic form of IRT-based linking is possible when the responses on the two instruments follow the same Rasch model; that is, if it can be shown that they pertain to the same unidimensional latent trait and that all items are equally discriminating. In the Rasch model, the observed sum score is a sufficient statistic for the latent trait estimate [22]. If the Rasch model fits, linking boils down to estimating the trait level associated with an observed score on instrument A and then finding the observed score on instrument B associated with that trait level. In this approach, the statistical equating error is merely a function of the reliability of the two instruments, that is, the reliability with which trait levels can be estimated using either of the two instruments.

If the Rasch model does not fit, a more general model can be used such as a two-parameter IRT model that includes a discrimination parameter for differentially weighting the association of items with the latent variable. Although this extension may improve model fit, linking is less straightforward as the observed sum score is no longer a sufficient statistic for the trait level and, conditional on an observed sum score, estimates of trait levels vary to some degree. In this approach, an observed score on instrument A is associated with an expected trait level and from this expectation an expected observed score on instrument B is estimated. As such, the resulting crosswalk contains a second source of statistical error, attributable to the variation of the trait level given observed sum scores. This error, in turn, is a function of the magnitude of the discrimination indices, that is, the strength of the association of the items with the latent variable.

The linking approach can be further generalized by assuming that the two instruments measure two different, yet correlated latent variables. This situation can be modeled by a two-dimensional IRT model, where the responses on one instrument pertain to one latent variable, and the aggregation of the two latent variables has a two-dimensional normal distribution. Again, the observed sum score on instrument B is estimated from the observed score on instrument A via the IRT model. Added to the two sources of statistical error already identified, is an error associated with the magnitude of the correlation between the two latent variables, that is, the strength of the association between the two assumed latent scales.

To date, no studies have attempted to link PF-10 and HAQ-DI scores. Moreover, although many studies have reported high correlations between the instruments, the degree and consequences of the multidimensionality that would result from combining the scales are unclear. Some previous studies have suggested that the PF-10 and HAQ-DI, or a selection of its items used in the modified HAQ, do essentially measure the same concept [23, 24]. However, studies that examined whether items from both scales could actually be calibrated on a common IRT metric did not unequivocally support either a unidimensional or multidimensional latent structure [25, 26]. Moreover, these studies did not compare the performance of different IRT models to further examine the impact of multidimensionality.

This study presents the development and evaluation of a crosswalk between the PF-10 and the HAQ-DI in a large and clinically diverse sample of patients with RA who completed both instruments. The appropriateness of different IRT models is taken into account by comparing the calibrations and performance of a crosswalk based on a one-parameter Rasch model with those of its two-parameter and multidimensional extensions. The accuracy of the final crosswalk is cross-validated in an independent sample of patients with early RA participating in a treatment-to-target study.

Methods

Study populations

Two independent datasets were used for this study. The first dataset was used for IRT calibrations and development and comparison of the crosswalks between the PF-10 and the HAQ-DI. Accuracy and validity of the final crosswalk were tested in the second dataset.

Calibration sample

This dataset was derived from the Dutch Rheumatoid Arthritis Monitoring (DREAM) registry [27]. The DREAM registry is an observational multicenter cohort study that monitors the course of RA patients undergoing different treatment regimens in the Netherlands. Clinical, laboratory, and patient-reported outcomes are routinely collected and stored. Patient-reported outcomes are generally completed online. Within the different DREAM cohorts, 1791 unique patients simultaneously completed the SF-36 and HAQ-DI at least once between 2003 and 2012. For every patient, the first available simultaneous assessment was selected for analysis.

Cross-validation sample

The second, independent dataset included data from patients participating in the DREAM remission induction cohort [28]. The remission induction cohort consists of patients with early RA participating in a treat-to-target strategy aimed at achieving fast remission. The strategy has been shown to be highly effective, with the largest improvement in disease activity observed in the first 6 months of treatment [28]. Data from 532 patients who completed the HAQ-DI and SF-36 at baseline were used to cross-validate the accuracy of the Rasch-based crosswalk. To study the longitudinal performance of the crosswalk, available data of 276 patients who also completed the HAQ and PF-10 after six months were used.

Measures

SF-36 physical functioning scale (PF-10)

The PF-10 is one of the eight scales of the SF-36 Health Survey and consists of 10 items measuring perceived current limitations in a variety of physical activities on a 3-point response scale from 1 (yes, limited a lot) to 3 (no, not limited at all). Where there are no missing data, observed PF-10 scores can have 21 possible values, with higher scores indicating more favorable levels of physical functioning. Using traditional scoring, scores on the PF-10 items are summed and linearly transformed to range between 0 and 100. Additionally, the summed scores can be standardized using norm-based scoring based on a mean score of 50 and a standard deviation of 10 in the 1998 US general population [29]. Previous Rasch modeling of the PF-10 indicated that the items form a unidimensional, hierarchical continuum with stable item difficulty estimates across diverse patient groups [24, 30].

Health Assessment Questionnaire disability index (HAQ-DI)

The HAQ-DI contains 20 items measuring physical disabilities over the past week in eight categories of daily living: dressing and grooming, rising, eating, walking, hygiene, reach, grip, and activities. Each item is scored on a 4-point rating scale from 0 (without any difficulty) to 3 (unable to do). Additionally, the HAQ-DI contains four sections on the use of aids and devices or need for help from another person for performing activities in any of the eight categories. Two scoring methods can be used to calculate total HAQ-DI scores [31]. The standard disability index (SDI) adjusts category scores upwards for the use of aids or devices or help from others. The alternative disability index (ADI) does not take the use of aids and devices into account. For both scoring methods, the total disability score (HAQ-DI) is calculated by determining the highest score in each of the eight categories and then averaging the category scores. As a result, observed scores on the HAQ-DI can take on 25 possible values between 0 and 3, with higher values indicating more disability. Recent Rasch analyses have shown that the categories of the HAQ-DI constitute a unidimensional scale [24, 32].

Statistical analyses

IRT modeling

The maximum likelihood estimation procedure was utilized to estimate the structural model parameters and the latent disability levels of patients were estimated using the expected a posteriori (EAP) method throughout all IRT analyses. Model fit of all estimated models was assessed using Lagrange multiplier (LM) item fit statistics specifically targeted at polytomously scored items [33, 34]. Absolute differences (effect sizes; ES) between expected and observed item scores for high, average and low scoring individuals were computed. In accordance with previous research, model fit was considered acceptable if all ES statistics were <0.10 [35, 36]. As the ES is weighted by the number of response categories, this cutoff reflects differences between observed and expected score frequencies of 2.5% for the HAQ-DI and 3.33% for the PF-10, respectively. All IRT analyses were performed with the MIRT software package [37].

Development of the crosswalk

Initial IRT analysis and cross-calibration of the PF-10 and HAQ-DI were performed in the calibration sample. To achieve consistent response patterns, PF-10 scores were reversed (so a lower score indicates better function) preceding analysis. Item parameters for the Rasch-based crosswalk were obtained using the polytomous partial credit model (PCM) [38]. First, the 10 PF-10 items and the eight HAQ-DI category scores were jointly calibrated in the same model. After the structural model parameters were estimated, questionnaire-specific scoring runs on the HAQ-DI and PF-10 items only were performed to estimate EAP scores associated with all possible total score levels and to create scoring tables mapping this relation. In these runs the item parameters of the HAQ-DI and PF-10 items, respectively, were fixed to the values obtained in the initial co-calibration. Subsequently, each possible total score was linked to the total score on the other instrument for which the absolute distance between EAP scores on the latent scale was the smallest. The total procedure was separately performed for both the HAQ-SDI based category scores and the HAQ-ADI based category scores.

Next, the validity and appropriateness of the Rasch-based crosswalk was evaluated by determining its precision to correctly predict HAQ-DI scores from PF-10 scores and vice versa and comparing the results to the precision of two additional crosswalks that were developed using the two-parameter and multidimensional extensions of the PCM. Using the same general approach as outlined above, we first re-estimated the model parameters using the generalized partial credit model (GPCM). The GPCM model is a two-parameter IRT model for polytomous data which includes a discrimination parameter that accounts for the different reliability of individual items with respect to measuring the underlying latent trait. As such, the PCM is nested within the GPCM. Finally, a between-item, multidimensional GPCM model was estimated. Again, the GPCM model is nested within the multidimensional GPCM model. In this model, all items were specified to load on their own questionnaire-specific dimension, and the relation between the dimensions was modeled by their correlation. Because in this model the two dimension-specific EAP scores are estimated concurrently, no separate scoring runs needed to be performed for the HAQ-DI and PF-10 to obtain questionnaire-specific EAP estimates associated with all possible total scores.

Agreement between patients’ observed and predicted scores on the PF-10 and HAQ-DI was assessed by computing intraclass correlation coefficients (ICCs) with 95% confidence intervals using two-way mixed effects models with absolute agreement for single measurements (type A,1) [39]. ICCs were considered adequate for group level comparisons when ≥0.70 [40].

Cross-validation of the results

The final step of the analyses was to apply the crosswalk in the cross-validation sample and to evaluate the agreement between observed and predicted HAQ-DI and PF-10 scores. Agreement between patients’ observed and crosswalked scores on the PF-10 and HAQ-DI at baseline (n = 532) was again assessed by computing ICCs (type A,1). Additionally, Bland-Altman plots of the difference against the mean of predicted and observed scores were constructed [41, 42]. As a final test of the validity of the crosswalk, observed and predicted change scores and total effect sizes (Cohen’s d) were calculated for patients who completed both measures at baseline and 6-month follow-up (n = 276). The relative efficiency of the change scores to discriminate between responder status was analyzed using one-way analysis of variance (ANOVA) tests [43, 44]. The 28-joint Disease Activity Score (DAS28), a pooled index that includes a tender joint count, a swollen joint count, the erythrocyte sedimentation rate, and the patient’s global assessment of general health, was used as the external criterion for determining response to treatment [45]. Patients were classified as good responders at 6 months when the DAS28 score had improved at least 1.2 points and the final score was ≤3.2 [46]. For purposes of comparing results, relative validity (RV) coefficients with 95% bias-corrected and accelerated bootstrap confidence intervals [44, 47] for the predicted scores in relation to the actual observed scores were computed.

Results

Patient characteristics

The calibration and cross-validation samples were comparable with respect to demographic characteristics (Table 1). However, baseline physical functioning levels were substantially better in the cross-validation sample, as measured with both the HAQ-DI and the PF-10. Patients in the cross-validation sample had moderately active disease on average at baseline according to the DAS28.

Table 1 Patient characteristics

Development of the Rasch-based crosswalk

Total scores on the PF-10 and HAQ-DI were strongly correlated (r = −0.75 for both the HAQ-SDI and HAQ-ADI). Both the Rasch-based co-calibration of the HAQ-SDI and PF-10 items and the co-calibration of HAQ-ADI and PF-10 items resulted in a model that adequately fitted the data according to the LM tests, with all accompanying ESs <0.10 (Additional file 1: Table S1 and S2).

Figure 1 presents the test information functions which describe the local reliability of the PF-10 and HAQ-SDI. Both scales measured an approximately equally wide range of physical functioning with high precision. Overall, the PF-10 was slightly more precise at better levels of physical functioning (i.e., lower theta values), whereas the HAQ-SDI tended to provide more information at worse levels of functioning.

Figure 1
figure 1

Test information function curve (partial credit model) for the HAQ-SDI and PF-10 in relation to theta. The test information function is the sum of all separate item information functions. Higher positive theta scores indicate worse physical functioning.

Table 2 presents the resulting Rasch-based crosswalks between the PF-10 and HAQ-DI. Separate cross-walks are presented for the standard and alternative scoring rule of the HAQ-DI. As would be expected, predicted HAQ-ADI scores were generally lower than predicted HAQ-SDI scores, for a given level of the PF-10. This effect was strongest in the range of HAQ-DI scores from 1 to 2, where for the same observed PF-10 total scores, the estimated HAQ-SDI scores were consistently 0.25 points (i.e. two score levels) higher than the HAQ-ADI scores. Observed HAQ-ADI and HAQ-SDI scores ≥2.75 were linked to locations on the EAP theta scale that were below the lowest possible score for the PF-10 scale. Conversely, observed PF-10 scores of 95 and 100 were linked to EAP scores that reflect levels of function that are not represented in the HAQ-DI. They were therefore linked to the value zero in the crosswalks.

Table 2 Rasch-based crosswalk for transforming PF-10 scores into HAQ-DI scores and vice versa

Comparative performance of the Rasch-based crosswalk

Model fit of the co-calibrations based on the two-parameter GPCM and the multi-dimensional IRT model improved marginally as compared with the Rasch model (Additional file 1: Table S3–S6). For both the GPCM and the multi-dimensional model, ESs were also <0.10 and generally slightly smaller than those observed in the Rasch model. The correlation between the latent dimensions in the multidimensional models was 0.73. The crosswalks based on the GPCM and multi-dimensional IRT model were almost identical to the Rasch-based crosswalk. Correlations between predicted scores based on the different crosswalks were very high (r’s >0.988). Moreover, the crosswalks based on the two-parameter and multidimensional models did not perform substantially better in terms of agreement between observed and predicted total scores on the PF-10 and HAQ-DI (Table 3). Considering that the Rasch-based calibration fitted the data well according to pre-specified criteria and that the agreement between observed and predicted scale scores did not improve much in the more general models, it was concluded that the Rasch-based crosswalk was adequate for converting total scale scores.

Table 3 Agreement (ICC, 95% CI) between observed and predicted total scale scores using crosswalks based on the different IRT models in the calibration sample (n = 1791)

Cross-validation of the results

The agreement between observed scores and scores predicted from the Rasch-based crosswalk was high in the cross-validation sample. The ICCs (95% CI) between predicted and actual scores were 0.78 (0.74 to 0.81) for the HAQ-SDI, 0.77 (0.72 to 0.80) for the HAQ-ADI and 0.79 (0.75 to 0.82) for the PF-10, indicating adequate agreement for group-level comparisons. Additionally, group mean differences on both scales were small in magnitude (Table 4). Intra-individual differences were similarly distributed above and below the mean and not related to the magnitude of the measurement (Figure 2). However, the limits of agreement were wide for both scales and showed substantial discrepancies in agreement within individual patients.

Table 4 Agreement between observed and predicted scores on the HAQ-DI and PF-10 in the cross-validation sample (n = 532)
Figure 2
figure 2

Bland-Altman plots for agreement between observed and predicted HAQ-SDI (top left), HAQ-ADI (top right) and PF-10 scores (bottom left: predicted from HAQ-SDI, bottom right: predicted from HAQ-ADI). The y-axes represent the difference between observed and predicted scores. The x-axes represent the mean of observed and predicted scores. The horizontal solid lines represent the mean difference between both scores. The dashed lines represent the 95% limits of agreement.

With respect to the observed 6-month change scores in the total cross-validation sample (Table 5), standardized improvements were largest for the HAQ-DI (ES = 0.55), closely followed by the HAQ-SDI (ES = 0.49) and the PF-10 (ES = 0.40). In terms of differentiating between levels of longitudinal treatment response, the HAQ-ADI was slightly more efficient than the HAQ-SDI and PF-10. Relative validity coefficients of the predicted scores were close to, and not significantly different from, those of the actual observed scores for all three scales.

Table 5 Baseline to 6-month effect sizes and mean (SD) changes in physical functioning scores across levels of DAS28 treatment response in the cross-validation sample (n = 276)

Discussion

This study used IRT methods to analyze and link two widely used scales for measuring physical functioning, the PF-10 and the HAQ-DI. Results showed that it was possible to develop a straightforward Rasch-based crosswalk between both scales that can be used to estimate scores on one scale from scores on the other in patients with RA. The Rasch-based crosswalk performed similarly to crosswalks based on its two-parameter and multidimensional extensions. The application of the crosswalk in an independent sample of patients with early RA indicated that the crosswalk can be validly used for group-level analyses in RA populations.

Test linking or test equating has long been the focus of research in educational and psychological settings [12, 48]. More recently, the desire for standardization has also found its way to health outcomes measurement. As in educational testing, linking of existing health outcome instruments could enhance meaningful comparison and interpretation of results across studies and populations. With the rise of IRT in health outcomes assessment, new techniques have become available to achieve this objective. This is reflected in an increasing number of studies that have linked different patient-reported measures using IRT-based methods, including several measures of physical functioning [15, 17, 19, 4955]. These crosswalks allow researchers to compare their results with studies and populations where another instrument was used and may improve the common understanding of the specific underlying construct. Moreover, they may be particularly useful for compilation of findings in meta-analytic studies or longitudinal studies focusing on measuring effects or changes [56]. A such, crosswalks are an important step in achieving better interpretation and comparability of patient-reported outcomes measures across different studies [57]. A next possible step in the standardization and promotion of a common measurement system of patient-reported outcomes, is the development of large IRT-calibrated item banks such as those developed by the Patient-Reported Outcomes Measurement Information System (PROMIS) initiative [58]. These item banks can be used to build flexible short forms and computer adaptive tests for different populations or clinical conditions, while scores on these measures remain directly comparable. Recent studies have already shown the promise of this approach in RA [59].

The current study used an elaborate approach for cross-calibrating the HAQ-DI with the PF-10 and developing and evaluating the crosswalk, especially in its choice for comparing different IRT models. IRT linking studies usually do not explain or justify their use of a specific IRT model, such as the Rasch model or more general models. When using IRT analysis, however, the differences in model assumptions should be taken into account and the final model choice should be motivated by considering aspects such as the unidimensionality and the discrimination equality of the items [60]. Moreover, it should be shown to what degree the used model holds. In the case of using IRT for linking total scale scores, the specific model used may have consequences for the robustness and accuracy of the resulting crosswalk. This article presents a straightforward and practical IRT-based approach of linking total scale scores that includes comparing the fit and performance of different nested IRT models. This approach can be used for future studies aimed at linking different instruments intended to measure the same construct. An important feature of the approach is that it can be used for calibrating scales with polytomous items, which is the case with most patient-reported outcomes. Contrary to the Rasch model, tests of model fit for more complex models for polytomous items which are based on test statistics with known asymptotic distributions are rare. Therefore, the presented approach uses the LM test throughout all fit analyses [34].

Additionally, most IRT linking studies to date have not tested the performance of the crosswalks in clinically different, independent samples. To our knowledge, this study is the first to cross-validate a crosswalk of physical functioning scales in a clinical setting. One recent study did validate a crosswalk for fatigue using data from a subsequent time point, but acknowledged that using an independent sample would have been preferable [56]. With the objective in mind of creating a robust crosswalk in this study, its development was performed in a large and diverse sample of RA patients with a wide range of physical functioning levels. Subsequently, the performance of the crosswalk was examined in a specific sample of patients with early disease.

The results of the IRT calibrations suggested that the PF-10 and the HAQ-DI essentially measure the same unidimensional construct and could be adequately fitted to the same Rasch model. The finding that the simple Rasch model performed similarly to more general models in calibrating both scales may have several theoretical and practical advantages [6163]. An advantage in the case of total score linking is that each observed total instrument score is associated with only one latent trait (theta) score, making the resulting crosswalk more straightforward and robust against statistical error.

The evaluation of the measurement precision of the PF-10 and HAQ-DI under the Rasch model showed that the HAQ-DI and the PF-10 both measured a wide range of physical functioning in patients with RA. However, the HAQ-DI provided its optimal measurement precision at worse levels of physical function, whereas the PF-10 had better precision at somewhat better levels on the physical function continuum. This corresponds with previously reported ceiling effects of the HAQ-DI in less disabled populations [24, 6466] and floor effects of the PF-10 in more disabled populations [6770]. These effects were also apparent in the final crosswalk, where the HAQ-DI was better able to distinguish different scores at the lower end of the physical functioning spectrum and the PF-10 could better distinguish scores at the upper end. This supports previous findings that combining items from the HAQ-DI and PF-10 can reduce floor and ceiling effects and results in a scale with increased measurement precision and sensitivity to change across a wider range of physical functioning [25].

In the current study, separate crosswalks were developed for so-called standard (SDI) and alternative disability index (ADI) scoring of the HAQ-DI [5]. In the standard scoring method, the score on a category of daily living is corrected upwards when a respondent indicates the use of help from others or a device for performing one of the items in this category. Consequently, SDI scores are generally higher than ADI scores. Although the average difference between both scoring methods has been reported to be very small in general populations or populations with mild disability [71], SDI scores have been shown to be up to 0.15 to 0.26 points higher than ADI scores in samples with increasing disability levels [65, 7274]. In the current study, this resulted in higher predicted scores for the SDI than for the ADI, especially for patients with worse levels of functioning. Therefore, care must be taken in using the correct crosswalk when converting PF-10 and HAQ-DI scores. Unfortunately, published studies do not always clearly specify which method was used to compute the HAQ-DI scores [75, 76]. If necessary and possible, researchers should therefore re-analyze the original data to compute the correct HAQ-DI scores.

Additionally, we presented the cross-walk for both the original and the norm-based scoring method of the PF-10. The original 0–100 scoring has been most frequently used in the literature to date. Since the introduction of version 2 of the SF-36, however, all eight scales can also be linearly transformed to T-scores based on normative data from the US general population [29]. This norm-based scoring method has become increasingly popular as it allows for easier interpretation of differences across scales and populations.

The two RA samples used to develop and evaluate the crosswalk in this study correspond with the two major populations of interest in current clinical studies in RA. The sample used to cross-calibrate the PF-10 and HAQ-DI represents the general and clinically diverse RA population seen in the everyday clinical practice and the distribution of age, sex, and functional disability scores in this sample corresponds closely with the characteristics reported in other large observational studies [7779]. The cross-validation was performed in a sample of RA patients with a maximum symptom duration of one year. This population is gaining increasing research interest, mainly due to the development of effective biological treatments and the implementation of new treatment guidelines [80, 81]. The finding that the crosswalk also performed well in this very specific sample, provides further support for its wide applicability in RA research.

It should be noted, however, that RA is characterized by very specific disease mechanisms and physical manifestations, such as a high frequency of dexterity problems. Consequently, the IRT item parameters of the HAQ-DI and PF-10 may vary between conditions and populations as was previously shown for the HAQ-DI across different rheumatic diseases [35]. Therefore, future studies should cross-validate the crosswalk in both general and other disease-specific populations.

Further, the crosswalk is not suitable for use at the individual patient level. Although ICCs between observed and predicted scores were adequate for group-level analyses, they were not sufficiently high to warrant individual level analyses. This was confirmed by the Bland-Altman analyses, which showed that observed and predicted scores were characterized by high intra-individual variation. Therefore, cross-walked scores are not equivalent at an individual level and cannot be used interchangeably.

Conclusions

In sum, the crosswalk developed in this study enables the conversion of PF-10 scores into HAQ-DI scores and vice versa in patients with RA. Using the crosswalk will allow for group-level comparisons of data from studies that used either of the scales and can facilitate more meaningful interpretation and comparison of results. Future studies should examine the robustness of the crosswalk in other populations.