Measuring disability in multiple sclerosis: the WHODAS 2.0

Introduction Reliable measurement of disability in multiple sclerosis (MS) using a comprehensive, patient self-reported scale, such as the World Health Organization Disability Assessment Schedule (WHODAS) 2.0, would be of clinical and research benefit. Methods In the Trajectories of Outcome in Neurological Conditions-MS study, WHODAS 2.0 (WHODAS-36 items for working, WHODAS-32 items if not working, WHODAS-12 items short-form) was examined using Rasch analysis in 5809 people with MS. Results The 36- and 32-item parallel forms, and the cognitive and physical domains, showed reliability consistent with individual or group use. The 12-item short-form is valid for group use only. Interval level measurement for parametric statistics can be derived from all three scales which showed medium to strong effect sizes for discrimination across characteristics such as age, subtype, and disease duration. Smallest detectable difference for each scale was < 6 on the standardised metric of 0–100 so < 6% of the total range. There was no substantial differential item functioning (DIF) by age, gender, education, working full/part-time, or disease duration; the finding of no DIF for time or sample supports the use of WHODAS 2.0 for longitudinal studies, with the 36- and 32-item versions and the physical and cognitive domains valid for individual patient follow-up. Conclusions Disability in MS can be comprehensively measured at interval level by the WHODAS 2.0, and validly monitored over time. Routine use of this self-reported measure in clinical and research practice would give valuable information on the trajectories of disability of individuals and groups. Supplementary Information The online version contains supplementary material available at 10.1007/s11136-023-03470-6.


Methods of Rasch Analysis
Data from each (sub)scale was tested against the requirements of the Rasch Measurement model [1].Briefly, these requirements include: i) unidimensionality; ii) monotonicity; iii) homogeneity; iv) local independence; and v) group invariance [2,3].Whichever set of items are to be added together to provide a score, they should satisfy all of these requirements.They should: i) measure one thing (domain/ construct/trait); ii) the probability of a positive response to an item (or in the case of polytomous items, the transition from one response category to the next) should increase with underlying ability, as should the total score [4]; iii) the same hierarchical ordering of items should hold for each level (or grouping) of the score [5]; iv) items should be conditionally (on the score) independent of one another [6]; and v) the response to items across different groups such as age or gender should, conditioned on the total score, be the samereferred to as (the absence of) Differential Item Functioning (DIF) [3].
Each requirement is tested.A t-test is used to determine if two separate groups of items deliver significantly different estimates, following the procedure given by Smith [7].The hierarchical ordering of items across the scale is determined through a Chi-Square test of fit based on grouped scores.Monotonicity is evaluated through inspection of the item-category ordering.Conditional item dependence is determined though the correlation of residuals, where pair-wise correlations should not exceed 0.2 above the average residual [8].Should clusters of locally dependent items be found, consideration is given to grouping these into 'super items' or testlets (simply adding them together to make one larger item, the latter based on a priori defined groups) to absorb the local dependency [9].In the RUMM2030 software, this gives a bi-factor equivalent solution retaining a specified proportion of the variance.This "Explained Common Variance (ECV)" is reported, whereby a value less than 0.7 is indicative of requiring a multidimensional model, a value above 0.9 a unidimensional model, and the grey area in between, undetermined, requiring further evidence [10].
Consequently, value of the ECV at 0.9 and above is considered acceptable in the current analysis.If two parallel forms are created from either a subscale structure, if present, or from the pattern of local dependency in the item set, this requires a latent correlation ≥ 0.9.This is consistent with the reliability required for individual use [11].
Consequently, valid parallel forms would require both their latent correlation to be ≥0.9 and the ECV to be ≥0.9.
Group invariance (DIF) is tested through an ANOVA of residuals for age, gender, duration since diagnosis, education levels, and whether or not the patient is selfemployed or employed, and working full-time or part-time.Should DIF be identified it is tested by a comparison of person estimates from split and unsplit solutions to see if it is 'substantive' [12].Where the difference is significant (a paired t-test), the result is reported as an effect size where a value higher than 0.1 is considered to represent substantive DIF [13].If this is present, then the scale works in different ways for the contextual factor under consideration, and results are reported separately.Finally reliability is reported as both a Person Separation Index (PSI), and as Cronbach's alpha.If data is normally distributed they are equivalent, but otherwise PSI tends to be lower the more data is skewed.Values are treated the same, and so values below 0.7 would be described as low, as they do not support group use.
A hierarchical approach to seeking fit of the data to the model for existing scales is adopted, with level 1 as the priority (Supplementary file 2: Table S1).All aspects listed above must be met for any level of solution.Should the original data fail to fit the model at any level (i.e. at a level 5 solution), item deletion will be considered (

Methods of Trajectory Analysis
A group-based trajectory model was applied, which is designed to identify groups of individuals following similar developmental trajectories [14,15]).It was implemented through traj.ado in STATA17 [16].The number and shape (via polynomial functions) of trajectories were determined by analysing one to five group models without covariates.To accommodate attrition, a 'dropout' model was applied, specified in its basic form of constant dropout across assessment occasions [17].The Bayesian Information Criterion (BIC) was used to determine the best-fitting model, also with consideration for a useful and parsimonious model.Average posterior probabilities above 0.7 were also deemed to indicate optimal fit [18].Missing data were handled using a maximum likelihood approach based on a missing-at-random assumption.
The syntax for this approach is derived from STATA 'add-on' .To obtain this insert following into the SATA command line: . net from http://www.

A.4. Getting along
The 'sexual activities' item displayed substantial misfit, and was removed.Following this, fit to the model was adequate.No DIF was observed.The same solution was required in the validation sample, but on this occasion two items displayed LD.

A.5. Life Activities
The eight items of the 'life activities' domain proved a significant challenge, driven by the fact that the household and work items formed two distinct and strong clusters of local item dependency.Using these clusters as testlets enabled a weak solution, but where 30% of the variance had to be discarded to obtain a unidimensional latent estimate.
The household and work items were then examined as separate domains, only the latter of which reached adequate fit.The new household domain had substantial DIF on age, with an effect size of 1.6 in the difference between the unsplit and split solutions.The split solution offered a weak solution.In the validation sample, the same problems were identified in the total score, requiring 34% of the variance to be discarded to obtain a weak solution.DIF was observed for age, but the curves were indistinguishable and so no further action was taken.The household subset could also not be resolved in the validation sample, including attempts to resolve by item splitting of disease subtype.The four items did form a valid ordinal scale (Loevinger's coefficient 0.92).The new work domain gave adequate fit.
In summary, the total score of Life Activities offers a weak solution as about one-third of the variance had to be discarded to achieve a unidimensional latent estimate.
Following splitting the Life Activities domain into work and household item sets, the work items satisfied the Rasch model, but the household items did not, retaining an ordinal structure.

A.6. Participation
Two pairs of locally dependent items were identified and made into one testlet, with remaining items into a second testlet giving adequate fit.No DIF was observed.The same solution was found for the validation sample, giving adequate fit.

B.1. Physical
The physical component had relatively poor fit to the model, and displayed DIF for subtype on the 'getting around' item.However, the effect size of the difference of person estimates between the split and unsplit solution was just 0.04, and so no further action was taken, and the unsplit solution retained.The validation showed good fit to the model.DIF was evident for gender but the effect size of the difference between the split and unsplit solution was 0.09, and so no further action was taken.

B.2. Cognitive/social
The cognitive/social component had good fit to the model, but displayed DIF for subtype on the 'participation' item.However, the effect size of the difference of person estimates between the split and unsplit solution was just 0.03, and so no further action was taken, and the unsplit solution retained.A similar result appeared in the validation sample, with subtype showing DIF but with an effect size of 0.09.

C. Total
The Total score of the WHODAS-36 showed good fit to the Rasch model, given a bifactor equivalent solution.There was DIF by disease subtype, notably where PPMS varied from the curve in both testlets.However, the effect size of the difference in person estimates between the split and unsplit solutions was 0.08, and so no further action was taken.Given the direction of difference across the two testlets (one above, one below), it is almost certain that this DIF cancelled out at the test level.A similar result was found in the validation sample, with DIF on subtype, but with an effect size of just 0.02.

WHODAS-32
The total WHODAS-32 showed good fit to the model under a component strategy.
However, DIF was evident for disease subtype with Primary Progressive MS showing a higher problem on the physical testlet and Secondary Progressive MS a lower level on the cognitive-social testlet.The effect size of the difference in person estimates between the split and unsplit solution was 0.008, indicating no significant bias, and so no further action was taken.The same solution was found in the validation sample but with no DIF.

WHODAS-12
Reliability (alpha) of the individual domains ranged from 0.71 to 0.94, and so a component approach was adopted.The total score had good fit to the model using the component strategy in the training sample (Table S1c).There was DIF by subtype; Primary Progressive MS showed a higher (worse) score on the physical, but lower on the cognitive/social component.However, the effect size of the difference between the split and unsplit solutions was 0.08, and so no further action was taken.For the cognitive/social component, there was DIF by subtype, but the difference in curves was trivial and no further action was taken.The physical component had adequate fit to the model, good reliability and no DIF.
The results were mostly repeated in the validation sample, where for the total with a component approach, fit was adequate.DIF was evident for disease subtype on physical, but the curves were indistinguishable, and no action was taken.Likewise for the cognitive/social component DIF was evident by disease subtype, but with an effect size of just 0.013.The physical component had good fit.

Comments on Granularity and Reliability
As the granularity of the analysis increased, so did the disturbance of the model, mostly caused by variations of local item dependency across samples (for example, no Local Dependency (LD) in the training sample, but a single pair in the validation sample), or by DIF.At the domain level there was variation in the level of reliability, particularly where there were significant floor effects, leading to a divergence between the Person Separation Index (PSI) which is affected by a skewed distribution, and Cronbach's alpha, which is not.Nevertheless, all domains retained reliability (alpha) consistent with at least group use.The cognition and mobility domains had high reliability in their original format.
When the training and validation samples were merged (n=1050), at the total score level, the disturbances seen at all levels above were absent.Importantly, there was never any DIF for time at any level of analysis, nor was there DIF by sample in the pooled data, supporting both use in longitudinal studies and the cross-validation across samples.

Cross Validation
The training and validation samples were merged, and the total scores of the three versions examined for the total combined sample, and for time (Table S2).All three versions showed good fit to the model, and cross validation was supported by the absence of DIF across samples.Furthermore, there was no DIF by time, supporting the use of the scale in longitudinal studies.

How to use this nomogram
Providing the participant has answered all items in the scale, the scores assigned to each of the items -none (0), mild (1), moderate (2), severe (3) and extreme (4)are added together; this summed total is called the raw score, and it is ordinal.To achieve an interval level estimate suitable for parametric analyses, read across the line to the appropriate column.
For example, if the WHODAS-36 was administered and all items answered, a total raw score of 115 is equal to an interval score of 57.0.If the WHODAS-32 was used, a total raw score of 115 is equal to an interval score of 63.7.

level 6). If this fails, then level 7 will be utilised to test if the scale satisfies ordinal scaling; if not level 8 indicates failure. Supplementary File 2:Table S1. Strategies seeking fit of the data to the model. Level Nature Adjustments Reporting Chi- Square ECV ≥0.9
andrew.cmu.edu/user/bjones/traj 2.3.References1.Rasch G. Probabilistic Models for Some Intelligence and Attainment Tests.Chicago: The University of Chicago Press; 1980.2. Gustafsson J. Testing and obtaining fit of data to the Rasch model.Br J Math Stat Psychol.1980;33(2):205-33.3. Teresi JA, Kleinman M, Ocepek-Welikson K. Modern psychometric methods for detection of differential item functioning: application to cognitive assessment measures.Stat Med.2000;19(11-12):1651-83. 4. Kang HA, Su YH, Chang HH.A note on monotonicity of item response functions for ordered polytomous item response theory models.Br J Math Stat Psychol.Identification of local dependence in the Rasch model using residual correlations.Appl Psychol Meas.2017;41(3):178-94.