A common, international, and interdisciplinary framework of disability measurement is important to develop effective and comparable policy and practice options[1, 2]. During the last decades, the definition of disability has moved from the biomedical and social models to the biopsychosocial model, emphasizing the dynamic and bidirectional relations between a health condition and contextual factors (personal and environmental). In order to reach a universally accepted conceptual framework to define and classify disability[3, 4], the World Health Organization (WHO) developed the International Classification of Functioning, Disability, and Health (ICF)[5, 6]. In the ICF, disability is described as "a difficulty in functioning at the body, person, or societal levels, in one or more life domains, as experienced by an individual with a health condition in interaction with contextual factors" [7].

As part of the ongoing development of the ICF conceptual model, the World Health Organization Disability Assessment Schedule 2.0 (WHODAS-2) was created in 1998 (as a substantially reviewed version of the WHO-DAS[8]) to assess disability based on the ICF model[9]. There exist other tools that have traditionally been used to measure disability, such as the Indexes of activities of daily living (ADLs)[10], the Functional Limitations Profile[11], or the Functional Status Questionnaire[12]; and also a battery of instruments developed focusing on specific populations (i.e., the Late Life Function and Disability Instrument for elders[13], and the Functional Disability Inventory for children[14]). Nevertheless, none of them has been developed with the clear ICF biopsychosocial conceptual model.

Previous studies have evaluated the metric properties of the WHODAS-2 in specific samples, such as arthritis[15], systemic sclerosis[16], psychotic disorders[17], hearing loss[18], stroke[19], ankylosing spondylitis;[20], depression and low back pain[21], schizophrenia[22], and patients in rehabilitation[23], among others[24]. However, data regarding the validity of the WHODAS-2 across a range of diagnoses, settings, and countries is missing. On the other hand, these studies were generally focused on reliability, validity or responsiveness, but the underlying factor structure has almost never been assessed. Available evidence confirming the original structure is only provided for a modified version (i.e. the WHODAS used in the WMH surveys initiative[25, 26]), while findings from WHODAS-2 exploratory factor analysis were not consistent with the proposed measurement model [23, 24]. Thus, a comprehensive evaluation of the conceptual model and metric properties of the WHODAS-2 is needed.

The 'Measuring Health and Disability in Europe: Supporting policy development-MHADIE'[8, 27] is a European multidisciplinary project which has as one of its main objectives the evaluation of the ICF model and related instruments in clinical and rehabilitative settings. As part of this international project, the aim of the present study was to assess the WHODAS-2 conceptual model and metric properties in a set of chronic and prevalent clinical conditions, both physical and mental disorders, accounting for a wide scope of disability in Europe.



The MHADIE is an observational, longitudinal, multicentric study of consecutive patients with different chronic conditions in 7 European centres from Czech Republic, Germany, Italy, Slovenia, and Spain. Evaluations were made at baseline and at 6 weeks and 3 months of follow-up. Background characteristics such as age, sex, education or occupational status were collected from all subjects. In addition, patients were clinically evaluated with disease-specific severity scales, and with standardised instruments measuring disability and quality of life.


Patients had to be over 18 years old and meet the diagnosis criteria of one of the following conditions: bipolar disorder, depression, osteoarthritis, osteoporosis, rheumatoid arthritis, chronic widespread pain (CWP), low back pain (LBP), ischemic heart disease (IHD), migraine, Parkinson disease, multiple sclerosis, traumatic brain injury (TBI), or stroke. Sample size was based on recommendations for exploratory and confirmatory factor analyses (at least 20 participants per variable), and balanced by disorder. Ethical approvals from each institutional ethics committee and informed consent from each participant were obtained.

Measurement instruments

The World Health Organization Disability Assessment Schedule-2

The WHODAS-2 contains 36 items on functioning and disability with a recall period of 30 days[8] covering 7 domains: Understanding and Communicating (6 items), Getting around (5 items), Self-care (4 items), Getting along with others (5 items), Life activities: household (4 items), Life activities: work/school (4 items), and Participation in society (8 items). Response options go from 1 (no difficulty) to 5 (extreme difficulty or can not do).

WHODAS-2 scores are computed for each domain by adding the item responses (the score computation allows for up to 30% of missing items per domain) and transforming them into a range from 0 to 100, with higher scores indicating higher levels of disability. A global score is also calculated from all the items (36) or from all except the Life activities ones -work/school- when people does not apply for this domain (32 items). When less than 50% of items were missing, mean substitution (by domain) was used for imputation.

The Short Form-36 Health Survey (SF-36)

The SF-36 is a generic Health Related Quality of Life (HRQL) instrument measuring 8 domains: Physical Functioning, Role Physical, Bodily Pain, General Health, Vitality, Social Functioning, Role Emotional, and Mental Health [28]. Items are transformed into scores from 0 (worst possible health state) to 100 (best). A weighted addition of these domains allows the computation of two summary scores: Physical and Mental Components Summaries (PCS & MCS)[29, 30]. Scores were not computed for those individuals with more than 50% of missing items per domain. All patients were administered the SF-36 version 1, except those with bipolar disorder or depression, that completed version 2. Main differences between the two versions concern the number of response options of the Role domains, which were incremented from 2 to 5; and minor changes in the mental health and vitality dimensions (from 6 to 5 response options)[31].

Disease-specific severity scales

As shown in Table 1, several different scales were used to evaluate the severity of the health conditions [3240]. A consensus on the best way of classifying patients into different severity groups in order to evaluate differences on WHODAS-2 scores was reached between researchers and the clinical specialist responsible of the patients' management. Criteria used for classifying patients as being mild, moderate or severe are defined in Table 1. The sample sizes of the final groups are also shown.

Table 1 Health condition, severity scales and criteria to make groups.

Questionnaires were either self-administered or interviewer-administered. Proxy versions were occasionally used with those patients unable to respond due to the severity of the health condition leading to cognition or communication difficulties, such as aphasia.

Analytical strategy

Exploratory and Confirmatory factor analyses (EFA & CFA) were performed to assess WHODAS-2 structure and dimensionality. The global sample at baseline was divided into two random sub-samples, stratifying by pathology and severity group (n1 = 533 and n2 = 547). As WHODAS-2 responses are categorical variables, the factorial analyses were based on polychoric correlations, and robust-weighted least squares estimators were used[41, 42]. The first subsample (n1) was used to perform an EFA with oblique (quartimin) rotation[43]. The factor structure obtained by the EFA was assessed on the CFA using the second subsample (n2). The model to be confirmed was also imposed to have a general (global) second order factor; related with the specific factors. On this type of models, the general factor (2nd level) explains the correlation among specific factors (first level)[44]. Goodness-of-fit was measured by the Root Mean Square Error of Approximation (RMSEA, adequate if below 0.08), and the Comparative Fit Index (CFI) and Tucker-Lewis Index (TLI), which are recommended to be over 0.95[45]. These analyses were conducted with MPlus 4.2 and missing values were considered missing at random[45].

Distribution of WHODAS-2 and SF-36 scores was evaluated for the whole sample: means (SD), observed range, percentage of patients with missing domain scores, and floor and ceiling effects (proportion of patients with the worst and best possible score, respectively). Reliability was assessed in terms of internal consistency and reproducibility. The former was evaluated with the Cronbach's alpha coefficients computed with the whole sample at baseline[46]. To assess reproducibility, a sub-sample of stable patients (their clinical-severity not having changed at the six weeks evaluation) was identified. Concordance in the scores of stable patients was estimated with the Intra-class Correlation Coefficient (ICC)[47].

Construct validity was assessed by 2 different approaches: the Multitrait Multimethod (MTMM) Matrix[48] and known groups. Taking into account similarity on content, Pearson correlations (MTMM) were previously hypothesized to be moderate (0.4-0.6) between some of the WHODAS-2 domains and the SF-36 scores. Known groups were defined in two ways: first, based on the severity of the health condition (mild, moderate, and severe) and second, based on whether the patients were working or not due to their health condition (i.e. those who were on sick leave or who reported "ill health" as the main reason for not working for pay). Means scores were compared with ANOVA and the magnitude of the difference between extreme groups was measured by an Effect Size coefficient (difference in mean scores between groups/pooled SD)[49].

To assess sensitivity to change, the only conditions included were those where an improvement was expected over the study period (all except bipolar disorder, osteoarthritis, Parkinson disease, and multiple-sclerosis). Patients suffering from any of these pathologies with a positive change in the severity measure after 3 months were considered "clinically improved". Paired mean comparisons (t-test) between baseline and the third evaluation of these patients were conducted. In this case, the magnitude of the difference was also assessed with ES coefficients, but computed dividing the difference in the scores between the two evaluations by the SD at baseline. An ES > 0.8 is considered high, one of 0.5 moderate, and one close to 0.2 is considered low[50].


Sample characteristics are shown in Table 2. More than half of the subjects were not working for pay (57.8%), and 49% of them (n = 327) reported a main reason: 184 retired and 75 with 'ill health'. The EFA showed the 7-factor model to be the most appropriate structure (Table 3). Most of the WHODAS-2 items (86%) presented the highest loading with their corresponding factor. Moreover, the highest factor loadings of each item was above 0.5 in 75% of the cases. Results of CFA presented acceptable goodness of fit indexes: CFI and TLI above the standard 0.95 (0.975 and 0.973), and RMSEA (0.127); and supported the 7 domains proposed, as well as the global score.

Table 2 Socio-demographic characteristics of global sample, and the reproducibility and improvement sub-samples.
Table 3 Quartimin rotated loadings* of the Exploratory Factor Analysis with 7 Factors.

The distribution characteristics and reliability coefficients of WHODAS-2 and SF-36 scores are reported in Table 4. The global WHODAS-2 mean score was 24.8(SD = 19.3), ranging from 0.0 to 93.5. The proportion of missing values was lower than 16% for most of the WHODAS-2 domains (with the exception of 'life activities: work or school', which was not responded by 50.2% of the sample). The floor effect was not relevant, but quite a high ceiling effect was present in almost all domains, especially for 'Self-care' (53.6%). Cronbach's alpha was above 0.7 for all WHODAS-2 scales, being the highest for the two domains of 'Life activities' and for the Global score (0.94-0.98). Last column of Table 4 shows the results on test-retest evaluation of reproducibility. The ICC was lower than Cronbach's alpha coefficient, but achieved the recommended standard of 0.7 for 4 of the domains.

Table 4 Distribution of scores and reliability coefficients for the WHODAS-2 and SF-36 domains

Table 5 presents the MTMM Matrix, where the correlations hypothesized as moderate (in bold) were confirmed. The global WHODAS-2 score was moderately correlated with most of the scores of the SF-36, with the main exception of 'Bodily pain', which presented quite low correlations with all the WHODAS-2 domains. The 'Participation in society' domain presented moderate to high correlations (0.4-0.6) with all the SF-36 dimensions. Moreover, moderate correlations not previously hypothesized were found between 'Life activities at work or school' and 'Social functioning' from the SF-36(0.5), and between 'Life activities: household' and three of the SF-36 dimensions, 'Physical functioning' (0.6), 'Social functioning' (0.48), and 'Role physical' (0.47).

Table 5 Multitrait-multimethod matrix. Pearson correlation coefficients between the WHODAS-2 and the SF-36 scores.

The WHODAS-2 global score showed statistically significant differences among severity groups for all pathologies (Figure 1) with ES coefficients over 0.7 between mild and severe groups, except for low back pain. Table 6 shows mean scores of the specific domains by each severity group. Three of the WHODAS-2 domains (Getting along with people, Life activities household, and life activities work or school) presented non-significant differences among severity groups for more than half of the pathologies. For physical disorders, in general no significant differences across severity were observed in the understanding and communicating domain, and the ES coefficients were generally smaller than for the mental or neurological conditions. The results showed that at least 4 of the 7 WHODAS-2 domains differ statistically by severity groups for all conditions, except stroke. Most of the mean differences between extreme groups presented a ES coefficient > 0.5.

Figure 1
figure 1

WHODAS-2 global score for each severity group by pathology. *no statistical significant difference. Mean and 95% confidence interval is shown. Effect Size (ES) coefficient among extreme groups.

Table 6 WHODAS-2 domain specific scores by disorder, according to severity level.

Almost all the WHODAS-2 scores showed statistically significant differences (p < 0.001) between working patients and those not working due to ill health (Figure 2), and all except 2 presented an ES above 0.5. For the SF-36 scores, only 3 out of 10 ES coefficients were moderate or high.

Figure 2
figure 2

WHODAS-2 scores for patients working (dots) and not working-sick leave (striped). Mean and 95% confidence interval is shown. Effect Size (ES) coefficient between working and not working-sick leave patients.

Figure 3 shows the mean change of the WHODAS-2 scores and SF-36 component summaries among the subsample of patients that had improved. The ES coefficients were moderate for 2 WHODAS-2 domains: 'Life Activities: work or school' (ES = 0.47), and 'Participation in Society' (ES = 0.66); and for the Global score (ES = 0.55). For the rest of the scores the ES was less than 0.4.

Figure 3
figure 3

Mean chage of the WHODAS-2 scores and the SF-36 component summaries, after 3 months. Mean change and 95% confidence interval is shown. Effect Size (ES) responsiveness coefficient.


This study confirms the conceptual model of the WHODAS-2, which has shown good metric properties among patients with chronic conditions in Europe in the MHADIE project: a very high reliability, good ability to discriminate among known groups and adequate capacity to detect change over time. Therefore, these results support the adequacy of the WHODAS-2 to measure disability in a wide range of physical and mental disorders.

The goodness of fit indices obtained with the CFA models together with the high factor loadings confirmed the 7 domain structure of WHODAS-2 and the global score [44], as proposed by developers. Only some concerns should be raised. The RMSEA wasn't below the standard as recommended. CFA modification indexes (data not shown) suggested that the structural model behind data may be improved if some items from 'Participation in Society' domain were relocated on some of the other factors. Nonetheless, accepting the original structure proposed by developers would improve comparability with past and ongoing WHODAS-2 studies. Therefore, we suggest using the structure of the WHODAS-2 as it is now known, taking into account the expert-based validity criteria originally applied and that, despite the described concerns, our findings confirmed it on a heterogeneous sample. Moreover, the structure is quite consistent with previous results, both from specific populations[23, 24] and from the modified version[25].

The low proportion of missing values suggests the easy completion for a wide range of patients, indicating the high feasibility of WHODAS-2. A great percentage of missing data was only found at the domain of activities at work or school (50.3%), which is clearly related with the proportion of respondents neither working nor being students. The moderate percentage of patients with the best possible score in several domains suggests the possible unsuitability of the WHODAS-2 to differentiate among very low grades of disability. This may not be a limitation for measuring disability on patient samples, but one should be cautious when using it on other samples such as general population, which has earlier shown a very high ceiling effect[26]. Nonetheless, the distribution of the 'Participation in society' score merits a comment. No patient has the worst possible score (floor effect) and presents the lowest ceiling effect (11%), indicating that this domain is able to characterize a wide range of scenarios and is perhaps reflective of the final common pathway in which disability is manifested in the societal context.

The high internal consistency coefficients indicate good reliability. All of them were above the standard proposed for group comparisons (0.7) [51], which is consistent with findings from previous studies[23, 15, 19, 21, 22, 24]. It is also remarkable that internal consistency coefficient for the global score reaches the most strict standard recommended for individual comparisons of 0.95. Reproducibility was acceptable, with the exception of the 'Getting around' domain (ICC = 0.19). Due to the long test-retest period, patient's mobility may have improved or worsened over 6 weeks, even though disease severity did not change substantially. The only study in which stability of the WODAS-2 has been assessed, presented excellent ICC coefficients (0.82-0.96) on patients with inflammatory arthritis[15].

The WHODAS-2, as designed for covering disability, measures the restrictions on daily life activities and social participation, while the Short form-36 Health Survey addresses patients' physical and mental health. The moderate magnitude of the associations among the two instruments is reflecting how the WHODAS-2 and the SF-36 measure different aspects of related concepts (disability and HRQL, respectively). In fact, coefficients found in previously published studies[23, 1518, 20, 21] were fairly similar to ours. These findings support the validity of WHODAS-2 to measure disability and its use as an outcome which complements HRQL.

The WHODAS-2 is able to detect differences between clinical-severity groups. Those patients classified as severe reported worse disability scores than mild patients, with a large difference for most of the health conditions (66%), and a moderate difference for 25% of them. Poor discrimination ability among severity groups were found only for 3 of the WHODAS-2 domains ('Getting along with people', 'Life activities household' and 'Life activities work or school'). Beside this, the instrument detects differences between patients who were working at the time of the study and those who were not working due to their health condition. This is the first time that such an ability is evaluated on the WHODAS-2, and is specially remarkable when talking about disability, probably more than being able to differentiate among severity groups (which has also been shown in other studies[15, 16, 22, 23]).

Coefficients of change at 3 months were moderate or low for all domains. However the WHODAS-2 sensitivity to change may be under-estimated in our study due to the MHADIE patients' characteristics and design, such as the chronic profile of the conditions, and not being an evaluative intervention study. Moreover, this pattern of low improvement, also presented by the SF-36 (no physical change and moderate mental improvement), an instrument which has extensively demonstrated good responsiveness[52, 21], is indicating the lack of a real great improvement in our sample rather than a problem of WHODAS-2 to detect change over time. In fact, a previous study has demonstrated how the WHODAS-2 is quite responsive (ES = 0.65) when change is measured after starting a treatment[21].

This study's results should be interpreted taking into account some limitations. Firstly, the study was not specifically designed for evaluating responsiveness, since the optimum design for this should include an intervention which would produce a clear improvement or an event closely related to deterioration. However, assuming that a change in severity would be accompanied by a change in self-perceived disability, patient improvement was measured indirectly due to the lack of a gold standard for disability change. Secondly, the interval for test-retest evaluation is longer than the standard period used to assess reproducibility. However, the selection strategy applied assured the needed stability and ICC coefficients showed agreement between evaluations. Moreover, it should be noted that different WHODAS-2 linguistic versions have been administered regarding the country setting, but analyzed as a whole. To test the equivalence of these versions, differential item functioning (DIF) analysis would be required [53]. However, it was not possible in our study because of the sample design, where most of the health conditions were recruited only in one country, making impossible to differentiate the effect of these two variables. Finally, other minor limitations are related to version differences. The SF-36 v2 was used for Spanish patients with psychiatric disorders but, as version 1 and 2 of the SF36 are quite similar, no impact on results was expected. On the other hand, proxy versions used on those patients unable to respond were negligible.


Despite some limitations, as discussed above, the results provide considerable support to the WHODAS-2 utilization as a common, international, and interdisciplinary instrument to measure disability. Furthermore, it is of special relevance because of being the only measure based on the ICF biopsychosocial model. A strength of the study is that the underlying latent structure originally designed by developers has been confirmed for the first time. This has moreover been conducted on an heterogeneous sample (different health conditions in several European countries), which gives even higher worth to results, together with the assessment of its good metric properties. In conclusion, the WHODAS-2 is adequate to evaluate disability in patients with chronic conditions, which may help to eliminate barriers on developing policies, giving excellent evidence of these populations' needs.