Efficient and precise Ultra-QuickDASH scale measuring lymphedema impact developed using computerized adaptive testing

Purpose This study aimed to evaluate and improve the accuracy and efficiency of the QuickDASH for use in assessment of limb function in patients with upper extremity lymphedema using modern psychometric techniques. Method We conducted confirmative factor analysis (CFA) and Mokken analysis to examine the assumption of unidimensionality for IRT model on data from 285 patients who completed the QuickDASH, and then fit the data to Samejima’s graded response model (GRM) and assessed the assumption of local independence of items and calibrated the item responses for CAT simulation. Results Initial CFA and Mokken analyses demonstrated good scalability of items and unidimensionality. However, the local independence of items assumption was violated between items 9 (severity of pain) and 11 (sleeping difficulty due to pain) (Yen’s Q3 = 0.46) and disordered thresholds were evident for item 5 (cutting food). After addressing these breaches of assumptions, the re-analyzed GRM with the remaining 10 items achieved an improved fit. Simulation of CAT administration demonstrated a high correlation between scores on the CAT and the QuickDash (r = 0.98). Items 2 (doing heavy chores) and 8 (limiting work or daily activities) were the most frequently used. The correlation among factor scores derived from the QuickDASH version with 11 items and the Ultra-QuickDASH version with items 2 and 8 was as high as 0.91. Conclusion By administering just these two best performing QuickDash items we can obtain estimates that are very similar to those obtained from the full-length QuickDash without the need for CAT technology. Supplementary Information The online version contains supplementary material available at 10.1007/s11136-021-02979-y.


Unidimensionality of scale
Unidimensionality of the scale assumption refers only one dimension or a single metric is measured by this scale being studied. Otherwise, the measure will be uninterpretable if it embraces more than one dimension. We can use factor analysis to ascertain the dimensionality such as exploratory factor analysis (EFA) or confirmatory factor analysis (CFA), the utilization preference of which relies on whether the underlying structure of the scale is established or not. Addition, Monkken analysis is an alternative to further verify the dimensionality explored by factor analysis [1].

Scalability of items
Scalability of items ensures we gain the interval level measurement and monotonically increasing item response function. We assessed the scalability of items based on Loevinger's H coefficient generated from Mokken analysis [2]. Items or the scale were considered to obtain sufficient scalability only when the Loevinger's H reached 0.30 or above [3].

Graded response model (GRM)
The graded response model (GRM) [4] as a flexible and polytomous-response IRT model was employed for the data. The characteristics of varied discriminations among items, and unchanged functional form when merging response categories, and being easy to understand make it far superior to one parameter (e.g. Rasch model) [5] and two-parameter models (e.g. generalized partial credit model) [6]. Discrimination (a) and difficulty (b) and were produced within the GRM analysis. The former parameter examines the difficulty level of each item when a test-taker has a 50% probability to endorse the latent trait; the latter parameter reflects how good an item is to discriminate between respondents on the different level of the underlying trait.

Local independence of items
Local independence of items means that items of the scale should be uncorrelated after controlling for the latent variable, which is assessed with Yen's Q3 value for correlations between item residuals. Item residual correlation of more than 0.2 indicates the breach of local independence between items assumption [7]. High residuals correlation leans to occur when items that are too similar and lead to inflating reliability and model misfit [8]. So far, three ways are available to address this issue if the Yen's Q3 is larger than 0.2, that is, deleting this item directly based on sound grounds, retaining items but only administer one of them into the analysis, or adding both to a testlet.
Polytomous response with 5 response categories was scored on a 5 Likert scale from 1 to 5 for each item in this study. Ordered categories mean that the categories are modal, otherwise the overall model fit will be negatively affected. Category threshold ordering was also examined by viewing item characteristic curves to ensure the interval level measurement and guarantee each category is utilized in the same way for respondents. Disordered thresholds will be collapsed and rescored to maintain the right ordering.

Differential item function (DIF)
Differentiative item function (DIF) hypothesizes that the scores on the patient-reported outcome measurement (PROM) should not change because of the demographic group [9]. DIF occurs when different groups have a different probability of endorsing the specific items, even though they are detected to have the same level of ability. The bias caused by DIF could reduce validity for between-group comparisons and bring greater impact to the CAT due to the limited number of items to be administered. Deleting and ignoring these DIF items are current practices to address this issue [10]. However, if more than 50% items are detected as DIF items, separate scales are suggested for these individual groups [11].