Background

Limitations in upper limb functioning, and associated impact upon daily activities are common in musculoskeletal disorders, as well as many other long-term conditions. One recent large-scale survey in China reported prevalence of rheumatic pain in the neck, shoulder and elbow as 5.6, 3.1 and 1.4% respectively [1]. Another study in the USA, with a sample whose mean age was 68 years, found symptoms associated with the neck and shoulder to have a prevalence of 8 and 13% respectively [2]. An earlier large-scale study in those aged 16 years and above in the UK found the prevalence of pain, swelling or stiffness of shoulder, elbow and hand at 6.9, 3.1 and 6.6% respectively [3]. Thus, upper limb problems are common in the population, and particularly so among those with musculoskeletal disorders and in older people.

It is not surprising therefore that several Patient Reported Outcome Measures (PROMs) have been developed with the intention of assessing the extent of impairment of function, or of limitations in activities associated with the upper limb function in both children and adults [4, 5] . One such PROM, the Disabilities of the Arm, Shoulder and Hand scale (DASH) has gained widespread use across many chronic conditions [6, 7]. However, at the same time, concerns have been raised from a classical test theory perspective about the viability of the summated score of its 30 items, suggesting that the scale comprises more than one domain [8,9,10]. Similar concerns have been expressed from a modern test theory perspective when data from the scale have been fit to the Rasch model, indicating lack of dimensionality, a breach of the local independence assumption among its items, and lack of consistent scaling properties of its items [11,12,13]. Meanwhile, a shortened 11-item version, the QuickDASH, has emerged which has also been criticized from both test perspectives [14,15,16].

This study seeks to examine the internal construct validity and other psychometric aspects of both the DASH and QuickDASH, from a modern test theory perspective, in a population of those with Rheumatoid Arthritis.

Methods

The National Research Ethics Service Committee North West - Greater Manchester North (12/NW/0841) and the University of Salford School of Health Sciences Ethics Panel provided ethical approval for this study. All participants provided written, informed consent.

Participants

Two recruitment strategies were applied in parallel. Firstly, research nurses recruited in 17 Rheumatology out-patient clinics or identified participants from department databases. Secondly, participants willing to be contacted for future studies of a previously conducted outcome measure study from the same Rheumatology out-patient clinics originally were re-checked for eligibility prior to consent. The following eligibility criteria were applied: 1) confirmed diagnosis of rheumatoid arthritis (RA); 2) being able to read, write and understand English; and 3) had not (or were not about to) altered their disease-modifying medication regimen in the last three months.

Procedures

A questionnaire booklet which included the information about the to be recruited study population: demographic and disease data: age, gender, marital, educational and employment status, disease duration and RA disease-modifying medication was mailed to participants.

The DASH and QuickDASH scoring

The DASH consists of 30 items scored on a 1–5 scale. The scoring instructions indicate that summating all items to a total score is acceptable, given at least 27 items have been completed. An algorithm is available to construct an overall standardized score of 0–100, including coping with missing values. The 30-item scale can be said to assess upper limb functioning, comprising both aspects of pain, and activities of daily living. The QuickDASH is formed from a subset of 11 items, and is scored in a similar fashion. In the current study this was derived from participants’ completed DASH questionnaires.

Construct validity

Construct validity was examined by fit to the Rasch measurement model [17]. Data were fit to the Rasch measurement model using the RUMM2030 software [18]. The Rasch model is widely used in health outcomes to determine the sufficiency of the raw score, unidimensionality, local independence, Differential Item Functioning (DIF), and the threshold ordering of polytomous items. Various published papers explain these aspects in some detail [19, 20]. In this study, fit to the model was determined through a non-significant Chi-Square (Benjamin-Hochberg adjusted p values with 25% false discovery rate) and DIF was evaluated for age and gender. Any breach of the local (response) independence assumption was tested through item residual correlations of ≥0.2 above the average residual correlation [21, 22]. Where this breach occurred, items were summated together into testlets to absorb the local dependency [23] (super items which simply add up the item set into one new item). When these testlets were used to assess the scale, additional indicators were available. Expressed as the value ‘A’ in the RUMM2030 program, this is the proportion of common non-error variance retained in the resulting latent estimate, and where a value of 0.85 and above is considered sufficient for supporting a strong unidimensional general factor, consistent with the Explained Common Variance (ECV) to be found in the bi-factor literature [24, 25].

Precision

Finally, the degree of precision of both the DASH and QuickDASH were considered through their Standard Error of Measurement (SEM), and the Smallest Detectable Difference (SDD) expressed as a percentage of the operational range of the scale. The latter is concerned with that value which is needed to show a meaningful difference above the level of error in the scale, and is represented as 1.96* SEM.

Results

Participants

Three hundred and thirty-seven subjects with confirmed RA completed the DASH questionnaire, with a mean age of 62.0 years (SD12.1), 73.6% (n = 252) were female, 71.7% (n = 241) were married or living with a partner and 10.8% (n = 36) had children living at home. Over half (n = 169;50.8%) were retired, and a further 10.6% (n = 35) had retired early due to ill health.

DASH scores

The median score on the DASH using the standardized scoring system (i.e. range 0–100) was 33 (IQR 17.5–55.0). Only eight participants (2.4%) had more than three missing responses. A significant difference was observed between DASH scores by gender, with females having a higher (worse) score than males (Mann-Whitney U Z score = − 2.609; p = 0.009).

Rasch analysis

Dash-30

The data from the 30 items were fit to the Rasch measurement model. The initial fit to support a unidimensional structure was poor (Chi-square = 518 (df 120) p = < 0.001; item fit residual SD 2.6; person fit residual SD 1.48; reliability (Person Separation Index (PSI) = 0.97). The local independence assumption was breached by 27 pairs of significant residual correlations (> 0.02), indicating a pattern consistent with two potential domains of activity limitations, incl. Mobility and self-care and impairments of functions such as pain. For example, the items “Recreational activities in which you take some force or impact through your arm, shoulder or hand (e.g., golf, hammering, tennis, etc.)” and “Recreational activities in which you move your arm freely (e.g., playing frisbee, badminton, etc.)” had a residual correlation of 0.472. The residual correlation matrix of the DASH-30 and QuickDASH can be found in the Additional file 1: Tables S1 and S2. Thus, taking the domains identified, two testlets (super items) were created to absorb the local dependency. In this instance fit showed a Chi-Square of 7.6 (df 8); p = 0.469; and reliability (Person Separation Index: PSI) of 0.85, which was satisfactory. However, DIF was present for gender across both testlets (but not for age). Both testlets displayed the same pattern, the DIF was present for the first class interval representing a group with least problems, and this magnitude did not exceed 0.11 logits. For higher levels of problems with, for example PF (testlet containing item 1-20), the magnitude of difference did not exceed 0.04 logits. The graphical interpretation of DIF on the PF testlet (items 1–20) is shown in Fig. 1. As such no action was taken for DIF.

Fig. 1
figure 1

Differential item functioning by gender for PF testlet

The unidimensional latent estimate so derived accounted for 88% of the total non-error variance and was thus consistent with an ECV indicative of a strong general factor upon which all items load, supporting the unidimensionality of the total set, having accounted for the local dependency in the data. The t-test of the two independent estimates derived from the testlets showed just 2.2% of estimates were significantly different, supporting unidimensionality. This was further supported by the latent correlation of the two testlets being 0.97. No individual showed a positive fit residual > 0.124, (where 2.5 an above would indicate misfit) suggesting that the total score adequately represents the profile of responses in all cases.

QuickDASH

Fit of the 11 items of the QuickDASH to the Rasch model was poor (Chi-Square 221.4 (44); p < 0.001; Item fit residual SD 2.62; Person fit residual SD 1.12; reliability (PSI) = 0.92). Again, the pattern of residual correlations indicated two domains. Applying the two testlet solution, resulted in satisfactory fit (Chi-Square 10.9 (8); p = 0.205; reliability (PSI = 0.84; Cronbach’s α = 0.85). The t-test of the differences between the two testlet derived estimates showed that just 1.6% of estimates were different. The latent estimate indicated an ECV of 0.9. No person had a positive residual greater than 1.0, indicating that the raw score from all 11 items reflects the pattern of responses in all cases.

Measurement precision

The SEM of the DASH (on a score range of 30–150) was 10.47, and that of the QuickDASH (on a score range of 11–55) was 3.95. The SDD of each was 29.0 and 10.9 respectively. This amounted to 24.2% of the operational range of the DASH, and 24.9% of the operational range of the QuickDASH. If the published figures on scale reliability were used, instead of those adjusted to avoid the bias of the local independence assumption, then the SEM of the DASH was 5.03, and that of the QuickDASH was 2.50. Consequently, the SDD would be 13.9 and 6.9, respectively, representing 11.6 and 15.7% of the scale widths.

Exchange between DASH-30 and QuickDASH

Given fit to the Rasch model, any subset of items should give a valid estimate of the person’s upper-limb functioning. Consequently, two new testlets were created, one with the QuickDASH set of 11 items, the second with the remaining 19 items. Fit to the model was excellent (Chi-Square = 6.71 (df 8) p = 0.569; reliability (PSI = 0.96; Cronbach’s α = 0.89). the t-test of the differences between the two testlet derived estimates showed that just 5.3% (95%CI: 2.5–8.2) of estimates were different. The latent estimate indicated an ECV of 0.99, and the latent correlation between the two tests was 1.0. There was no DIF by age or gender. As such a transformation between the two versions of the scale was available. Table 1 gives the raw score (using the 0–4 range giving a total range of 0–120 (add 30 to create the 30–150 range), the standardized score for each scale (range 0–100), and the associated interval scale Rasch metric (range 0–100) through a simple linear rescaling of the ordinal raw score. The latter can be used for parametric analyses, assuming appropriate distributions.

Table 1 Raw score to interval scale Rasch Metric for DASH and QUICKDASH

An exchange between the two scales can be obtained by looking at either the raw score or the standardized score, and the associated Rasch metric. For example, consider a standardized score on the DASH of 40.0, and its associated Rasch metric of 51.3. The nearest equivalent Rasch value on the QuickDASH is 51.0, and this is associated with a QuickDASH standardized score of 36.4. Thus, a DASH standardized score of 40.0 is equal to a QuickDASH standardized score of 36.4.

Discussion

Taking the basic 30-item (DASH) and 11-item (QuickDASH) scales, and applying modern test theory analyses showed that neither versions of the DASH satisfied model expectations. Multidimensionality, and a breach of the local (response) independent assumptions, were evident in both cases. Thus, the initial findings are consistent with the previously published psychometric evidence for the scales [8,9,10,11,12,13,14,15,16, 26]. Yet in both scales, clues exist as to why these findings may misinterpret the construct validity. This lies in the breach of the local independence assumption, whereby clusters of items can be found with varying degrees of residual correlation. In clinical scales this is quite common as, for example, in rehabilitation, health professionals may need to know whether or not a patient can dress their upper body, and their lower body, despite psychometric evidence showing a high residual correlation between the two activities. This type of effect of the breach of the local independence solution has been shown to both inflate classical reliability and to be corrosive for interpretation of the Rasch model [27]. Historically this has led to potential misinterpretation of the construct validity of well-known scales [28]. For example, it can make thresholds appeared disordered, and in the current analysis, of the 5 thresholds that were disordered, four were locally dependent with one or more other items. Local dependency can also drive multidimensionality, which was also the case here. Thus, the absence of applying suitable methods to accommodate this dependency may address some of the challenges identified in previously published findings about the scales.

Consequently, having applied the testlet approach to accommodating local response dependency in the DASH and QuickDASH item sets, within their frame of reference of rheumatoid arthritis, their total (ordinal) scores are shown to be valid given they both satisfy the Rasch model assumptions. It is possible that the frame of reference is crucial, as the DASH and QuickDASH are generic scales, and applied across a wide range of conditions. Thus previous findings in other conditions may reflect differences in how the scale works across different conditions [29]. However, a proper comparison cannot be made until the local dependency issue is firmly dealt with in those other conditions. Failure to take this into account may lead to erroneous conclusions about the construct validity of the scale, including the interpretation of multidimensionality driven by the local dependency. It may also, as shown above, lead to erroneous conclusions about the SEM and SDD of a scale. An inflated reliability will make a scale look much better than it really is, with a lower SEM and consequent SDD.

While the scale or level of detail is reduced in the testlet-based analysis, although this is consistent with how the scales are used in practice, examination at the item level in diagnostic mode is also informative. There is little doubt that the item ‘Tingling’ is a major threat to fit within Rasch model framework. In both the DASH and QuickDASH Rasch analyses, this item at the individual item level, showed a magnitude of misfit much greater than all other items, consistent with the findings of earlier studies [8]. Furthermore, unlike most other items which showed local dependency with one or more other items, ‘tingling’ showed no such dependency at all. Yet when absorbed into the more abstract ‘Impairment of Functions’ testlet, along with other items such as pain, the testlet itself shows no signs of misfit. So, perhaps the effect of the ‘tingling’ item is mitigated at this level, as indeed it must be at the whole test level. Thus, given the focus of use is always on the whole scale score (or subscales when present), the question arises as to the appropriate focus for analysis. The two testlet solution, as applied here, is almost at the whole scale level, similar to the classical test theory approach. Indeed, recent approaches in scale banking, whereby full scale scores for a particular construct are used as items, suggest a different approach to understanding of scale construct validity, albeit still from the Rasch Measurement Theory perspective [30]. Here the emphasis is upon the scale score as a whole, co-calibrated with other total scores from other scales measuring the same construct, which technically does not even need the items, as the total score becomes the item in such an approach. As such the focus of analysis becomes crucial. Where existing scales are being considered, their total (or subscale) score can be the initial focus, although if this fails, then the analysis at the item level sharpens as the reason for failure is explored. When new scales are being developed, then the focus should be reversed, and each individual item considered in the context of all others, and considered for inclusion in the final version of the scale.

While classical test theory approaches have failed to confirm the factor structure of the scales, it is possible that a bi-factor solution may have worked, as the ‘A’ value in the RUMM2030 software is equivalent to the ECV in the bi-factor literature, having been derived from the difference in the reliability estimates between the item- and testlet-based solutions [24]. As in the current study the ECV was below 1, it does suggest that some (although not much) of the variance was discarded in the latent estimate, which would populate the secondary factors in the classical bi-factor model. Nevertheless, both the DASH and QuickDASH had ECV values consistent with a dominant unidimensional first factor, and the associated t-test confirmed that the latent estimates so derived were unidimensional [31].

One caveat to the above analysis is the fact that a two testlet solution is in fact just two items, and this may affect the power of the test of fit to the model. However, these two ‘items’ have a considerable score range and the resulting indicator on the RUMM2030 program showed that the power of the test of fit was ‘excellent’.

Conclusion

The DASH and QuickDASH have been shown to satisfy Rasch model expectations in a RA population, having accounted for local dependency of items through a testlet approach. Consequently, their raw (and standardized) scores can be deemed a sufficient statistic at the ordinal level for upper limb functioning. Rasch-transformed interval scaled estimates are available for the calculation of change score and other mathematical and parametric procedures. Data from other conditions will need to be re-analyzed to ensure that the breach of the local independence solution is adequately dealt with before other aspects are considered. Calculations of SEM and SDD for PROMs should always be based upon an unbiased estimate of reliability, having taken care of the inflation caused by a breach of the local independence assumption associated with all summative scales.