Using WOMAC Index scores and personal characteristics to estimate Assessment of Quality of Life utility scores in people with hip and knee joint disease
To determine whether Assessment of Quality of Life (AQoL) utility scores can be reliably estimated from Western Ontario and McMaster Universities Osteoarthritis Index (WOMAC) scores in people with hip and knee joint disease (arthritis or osteoarthritis).
WOMAC and AQoL data were analysed from 219 people recruited for a national population-based study. Generalised linear models were used to estimate AQoL utility scores based on WOMAC total and subscale scores and personal characteristics. Goodness of fit was assessed for each model, and plots of prediction errors versus actual AQoL utility scores were used to gauge bias.
Each model closely predicted the average AQoL utility score for the overall sample (actual mean AQoL 0.64, range of predicted means 0.63–0.64; actual median AQoL 0.71, range of predicted medians 0.68–0.69). No clear preferred model was identified, and overall, the models predicted 40–46 % of the variance in AQoL utility scores. The WOMAC function subscale model performed similarly to the total score model. The models functioned best at the mid-range of AQoL scores, with greater bias observed for extreme scores. Inaccuracies in individual-level estimates and low/high health-related quality of life (HRQoL) subgroup estimates were evident.
Reliable overall group-level estimates were produced, supporting the application of these techniques at a population level. Using WOMAC scores to predict individual AQoL utility scores is not recommended, and the models may produce inaccurate estimates in studies targeting patients with low/high HRQoL. Where pain and stiffness data are unavailable, the WOMAC function subscale can be used to generate a reasonable utility estimate.
KeywordsArthritisOsteoarthritisOutcome measuresQuality of life
As in many countries, osteoarthritis (OA) is a major public health problem in Australia and a leading cause of disability. The hip and knee joints are commonly affected, with over 86,000 joint replacements performed in 2012 for severe joint disease . As part of our national study to explore the broader burden of hip and knee joint disease (arthritis and OA) in Australia [2, 3], we collected health status and health-related quality of life (HRQoL) data using the Western Ontario and McMaster Universities Osteoarthritis Index (WOMAC) and Assessment of Quality of Life (AQoL) instruments, respectively. Disease-specific measures of well-being such as the WOMAC Index are valuable for health-related research as they incorporate domains considered relevant to the condition of interest (for example, pain and function). These measures can also be highly responsive to change [4–6], which is an important consideration when assessing intervention effectiveness or patient deterioration over time. On the other hand, generic (or non-disease-specific) health measures provide a more holistic overview of HRQoL by incorporating broader constructs such as social relationships and psychological health. A key advantage of generic HRQoL measures, such as the AQoL, is that they enable comparisons to be made between different health conditions and treatments. This is particularly relevant for funders of health care services (for example, governments and health insurers) who must determine priorities for allocating limited resources. Instruments such as the AQoL also generate utility scores, which can be used to calculate quality-adjusted life years (QALYs) and estimates of cost-utility . These data are paramount for health economic evaluations and assist policy decision-making about the value of new and existing interventions.
As disease-specific and generic measures of well-being provide complementary information, it is common for both types of instruments to be included in epidemiological and intervention studies. However, in studies where participant burden is a concern or for completed studies where only disease-specific data have been collected, there is great interest in statistical methods for estimating utility scores based on disease-specific scores. Using disease-specific data to estimate utility scores could also reduce bias from missing HRQoL data in longitudinal studies or randomised controlled trials. In the literature, this technique is commonly referred to as ‘mapping’, and the majority of research in this field has focussed on estimating EQ-5D (EuroQoL) HRQoL utility scores from other self-reported data . Mapping data to utility scores also has important health policy implications, and these methods have been used in submissions to the Pharmaceutical Benefits Advisory Committee in Australia  and the National Institute for Health and Care Excellence in the UK  to facilitate the evaluation of pharmaceutical and other interventions.
In rheumatology research, studies have mapped disease-specific WOMAC scores collected from people with hip or knee OA or knee pain to generic measures of HRQoL such as the EQ-5D [11, 12] and the Health Utilities Index Mark 3 (HUI3) [13, 14]. Researchers have also used data from joint-specific instruments such as the Oxford Hip  and Knee Scores  to estimate EQ-5D utility scores. Although the AQoL instrument has been used to assess HRQoL in a range of clinical and research settings including OA studies , we are not aware of any studies that have mapped disease-specific WOMAC data to AQoL utility scores. Whether reliable utility estimates can be generated remains unknown. Using data from a population-based study of hip and knee joint disease, this study aimed to determine whether WOMAC scores can be used to predict AQoL utility scores.
Materials and methods
Secondary analysis of data collected in a population-based survey.
Participants and procedure
Data from 237 individuals with hip arthritis, hip OA, knee arthritis and/or knee OA were extracted for the present study from a national survey. The sample selection, recruitment and data collection methods for the population-based survey have been reported previously [2, 18]. In brief, following approval from the Australian Electoral Commission (AEC), we obtained an extract from the federal electoral roll that was used to sample people aged ≥39 years from all eight Australian states and territories. As electoral enrolment is compulsory for Australians aged ≥18 years, the electoral roll provides comprehensive coverage of the Australian adult population. Name, age group, sex and address information was available from the extract. The selected sample was mailed an introductory letter, plain language statement and study questionnaire. Return of a completed questionnaire was deemed to constitute consent, as approved by The University of Melbourne Human Research Ethics Committee. In addition to screening for four doctor-diagnosed conditions (hip arthritis, hip OA, knee arthritis and knee OA) , the study questionnaire was used to collect self-reported data including date of birth, country of birth, language spoken, marital status, highest level of education, height, weight and previous joint replacement surgery.
Body mass index (BMI) was calculated using height and weight data and classified into underweight (BMI < 18.5), normal weight (BMI 18.5–24.99 kg/m2), overweight (BMI 25–29.99) and obese categories (BMI ≥ 30), according to World Health Organisation definitions . Residential location was classified as metropolitan or provincial/rural based on AEC ratings for each federal electoral division . Socio-economic status was approximated using postcodes to link to the Australian Socio-Economic Indexes for Areas (SEIFA) 2006 Index of Relative Socio-Economic Advantage and Disadvantage . The lowest SEIFA decile represents geographical areas with the greatest socio-economic disadvantage; while the highest decile represents areas with the greatest advantage, based on census data including level of education, occupation, housing status, single-parent homes and car ownership.
The WOMAC Index and AQoL-4D instruments were also included in the study questionnaire. The WOMAC Index is a disease-specific measure of health status , which contains 24 items covering pain (5 items), stiffness (2 items) and function (17 items). It produces pain, stiffness and function subscale scores, which are summed to produce a total WOMAC score. Scores range from 0 (least pain) to 20 (highest pain) for pain, 0 (least stiffness) to 8 (greatest stiffness) for stiffness, 0 (best function) to 68 (worst function) for function and 0 (best health) to 96 (worst health) for the total score. The AQoL-4D is a generic (non-disease-specific) measure of HRQoL. It contains 12 items and produces a utility score ranging from −0.04 (worst possible HRQoL) to 1.00 (full HRQoL). Negative AQoL utility scores indicate a health state worse than death . Three additional items (relating to illness) do not contribute to utility score calculation and were not administered. Australian normative data are available for the AQoL instrument . AQOL and WOMAC scores were not available for 18 individuals (7.6 %) due to missing responses, leaving data from 219 participants available for analysis.
Descriptive statistics and Chi-squared tests were used to identify personal characteristics that were likely to be associated with AQOL utility scores. To assist with the descriptive analysis, the AQOL utility scores were categorised according to tertiles (low, middle and high HRQoL). To allow for the skewed distribution and ceiling effect of the AQOL utility scores, we used the AQOL disutility (i.e. 1—AQOL utility) as a continuous variable when exploring its relationship with WOMAC scores and other potential covariates. Generalised linear models (GLM) with log link, Gaussian and Gamma family were explored to identify the model with the best fit. Mean absolute error (MAE), root mean squared error (RMSE), Akaike’s information criterion (AIC) and Bayesian information criterion (BIC) statistics were used to assess model fit. The MAE is the average of the absolute difference between observed and predicted values. The RMSE is the squared root of the average squared prediction error and is more sensitive to larger errors. Optimal fit was inferred through the minimisation of these two measures. The AIC and BIC were used to assess the quality of the models based on a balance of goodness of fit and model complexity. Pseudo R-squared (R2) values provided an indication of the unexplained variance in each model and were used for comparison with published studies.
The Gaussian log link provided a slightly lower RSME, MAE and BIC than the Gamma family with log link and was therefore adopted as the base model type (Gamma distribution data are available upon request). Total WOMAC score was considered in Model A (Table 2), while each of the three WOMAC subscale scores were considered in Model B. Model C considered interactions between WOMAC subscales, and Model D considered the function subscale score only. Personal characteristics identified from the descriptive analysis or hypothesised to be associated with HRQoL were included in Model E, while personal characteristics significantly associated with the AQoL utility score were included in Model F.
Pitman’s tests for differences in variance were used to examine agreement between each model and the base model (Model A). Prediction errors (predicted AQOL scores minus actual AQOL scores) were plotted against actual AQOL scores to assess model performance and bias across the full range of AQOL scores. Model accuracy was also assessed by comparing actual AQoL utility scores (mean, median and range) with predicted values for the overall sample and for the low, middle and high HRQoL tertiles. All statistical tests were two-sided and conducted at a significance level of 0.05. Statistical analysis was performed using Stata version 12 (StataCorp, Texas, USA).
Characteristics of the sample
Overall sample (n = 219)
Low HRQoL tertilea
(n = 73)
Middle HRQoL tertilea
(n = 73)
High HRQoL tertilea
(n = 73)
Age, mean (SD)
Female, n (%)
Country of birth, n (%)
Language spoken, n (%)
Highest level of education, n (%)
Primary school or less
Marital status, n (%)
BMI category, n (%)
Living in provincial/rural area, n (%)
Lowest 2 SEIFA deciles, n (%)
Previous hip or knee replacement, n (%)
WOMAC score, mean (SD)
Total WOMAC score
Relationship between AQoL utility scores and WOMAC scores
Performance of the models
Results of parameter and diagnostic tests used to predict AQOL disutility scores
WOMAC subscales + interactions
WOMAC function subscale only
Total WOMAC + other variablesa
Total WOMAC + sex + previous JRS
Total WOMAC score
Pain subscale score
Stiffness subscale score
Function subscale score
Pain score squared
Stiffness score squared
Function score squared
Pain × stiffness interaction
Pain × function interaction
Stiffness × function interaction
Previous joint replacement surgery
Completed trade/technical/university education
Obesity (Body Mass Index ≥30)
Lowest two SEIFA deciles
Goodness of fit measuresd
Root mean squared error (RMSE)
Mean absolute error (MAE)
Akaike’s information criterion (AIC)
Bayesian information criterion (BIC)
Pitman’s test for difference in variancee
p = 0.197
p = 0.001
p = 0.390
p < 0.001
p = 0.081
When the WOMAC pain, stiffness and function subscales were included separately (Model B), only the function subscale was significantly associated with the AQOL disutility score (p = 0.006). For Model C, none of the included variables was significantly associated with the AQoL disutility score and no improvement to the goodness of fit statistics was evident. No improvement in model fit was observed for Model D (WOMAC function subscale only). However, the BIC value for Model D was slightly lower than for Model A, suggesting that using the WOMAC function score may be more efficient than using the total WOMAC score. Personal characteristics that were associated with AQOL utility scores (education, obesity and socio-economic status) were also included in Model E, together with the total WOMAC score, age, sex and previous joint replacement surgery status. However, only previous joint replacement was found to be significantly associated with the AQoL disutility score. After excluding non-significant variables, three variables were found to be significantly associated with the AQoL disutility score (Model F): total WOMAC score, previous joint replacement and sex.
Accuracy of the models
Accuracy of predicted AQOL utility scores
Low HRQoL tertile
Middle HRQoL tertile
High HRQoL tertile
Using the models to predict individual AQoL scores
Although each model accurately predicted the overall sample mean, the models were unable to reliably predict AQoL utility scores at the individual level. This is demonstrated using individual data from two study participants who differed according to age, gender, joint replacement surgery status and WOMAC scores:
the total WOMAC score only (Model A) was 0.45
the WOMAC function subscale score only (Model D) was 0.45
the total WOMAC score, sex and previous joint replacement (Model F) was 0.32
the total WOMAC score only (Model A) was 0.53
the WOMAC function subscale score only (Model D) was 0.53
the total WOMAC score, sex and previous joint replacement (Model F) was 0.56
As the health care costs of OA continue to grow worldwide , consideration of cost-effectiveness is critical for ensuring the appropriate distribution of limited health care resources . In this context, it is clear that generic measures of HRQOL, such as the AQoL instrument, are essential for the comprehensive economic evaluation of interventions in clinical and policy settings. Mapping techniques can inform this process by using individual disease-specific data to estimate utility scores, if primary utility data are not available. This study is the first to consider mapping disease-specific WOMAC data to the AQoL utility score, and our methods have potential application across a range of rheumatology research settings given that WOMAC data are frequently collected in studies of OA and joint replacement surgery.
Using a population-based sample of people with hip or knee joint disease, we found that WOMAC scores could reliably predict average AQoL utility scores at the overall sample level. There was no clear preferred model, according to goodness of fit statistics and error plots, and the models explained between 40 and 46 % of the variance in AQoL scores. We also found that a model based on the WOMAC function subscale alone (representing 17 of 24 WOMAC items) was reasonably efficient and demonstrated similar fit to the total WOMAC score model. This is encouraging for situations where WOMAC pain and stiffness data are not collected or are missing. Overall, the models appeared to function best at the mid-range of AQoL scores, with evidence of bias for people with low or high AQoL scores. This is consistent with earlier work that mapped WOMAC scores to EQ-5D utility scores . However, as the application of these techniques relate to group-level or population-level quality of life or economic analysis (rather than estimating individual patient outcomes), it is likely that individual prediction errors will be balanced across the sample  and this was confirmed by our data. Prediction errors may relate in part to the different constructs covered by the WOMAC and AQoL instruments . Our subgroup analyses also suggest that mapping WOMAC scores to AQoL utility scores would not be appropriate for studies targeting people with low HRQoL (for example, those with end-stage joint disease) or very high HRQoL (for example, individuals with early disease) as it would produce biased group estimates. The reason for this finding is unclear, as the WOMAC Index has been used for samples with low  and high health status , but could relate to subgroup differences in prioritising HRQoL.
In relation to model performance, our results were comparable to previous studies that used similar statistical techniques to estimate utility scores from WOMAC data [11–13]. However, direct comparison of findings is limited as none of these studies mapped data to the AQoL instrument. Our models demonstrated slightly better predictive performance than the models reported by Grootendorst and colleagues  (predictive values between 39 and 40 % for the HUI3 measure), and similar performance to the model reported by Xie et al.  (predictive value of 45 % for the EQ-5D measure). Similar to our findings, Grootendorst et al.  found their derived model did not reliably predict individual utility scores, but was capable of estimating overall group scores. This was confirmed in a subsequent validation study . Grootendorst et al’s preferred model included WOMAC scores, age, sex and OA severity to estimate HUI3 utility scores. This is similar to our Model F, although we included previous joint replacement surgery that could be considered a proxy for joint disease severity. In contrast, Xie et al. included only the total WOMAC score in their preferred model, while Barton et al.  included age and sex in addition to WOMAC scores. Differences in study populations and disease severity may partially account for between-study variation as clinical trials participants [11, 13, 14] and patients from hospital orthopaedic departments  were used previously.
There are several limitations to this study which should be noted. The diagnosis of hip or knee joint disease was based on self-reported, doctor-diagnosed arthritis or OA, similar to the methods used for other population-based studies [28, 29]. The generalisability of the findings to other clinical populations is not known. Furthermore, although the WOMAC Index and AQoL instrument are available in a range of translated languages, our analyses relate only to the English versions. We also acknowledge that additional demographic and clinical variables (such as doctor-diagnosed co-morbidities) may have improved model performance and accuracy, but a limited number of variables were collected in the survey to minimise participant burden. Additionally, we have not tested our models using an independent data set containing WOMAC and AQoL data, and this is an important area for subsequent research. Similar to the two-stage methods used previously , we plan to soon undertake a validation study, which will examine the accuracy of these models when applied to different patient populations. Specifically, this external validation will use other OA research data sets to determine whether the models yield similar results to those obtained from the original data set. Using independent data sets that have not been part of the models’ development is a critical step in the mapping process and provides a more robust test of predictive performance. Finally, we acknowledge that the current cross-sectional design only allows prediction of utility scores at a single time point and we cannot confirm model performance over time. Future use of longitudinal data sets will enable us to determine whether WOMAC scores can be used to predict AQoL utility scores at different time points.
In conclusion, our study indicates that total WOMAC and function subscale scores can be used to reliably estimate average AQoL utility scores for a population-based sample with hip or knee joint disease. Although a range of models were tested, each produced similar results in terms of model performance and accuracy. Using WOMAC data to predict individual-level AQoL utility scores is not recommended and greater prediction errors were evident for people with low or high HRQoL, suggesting that mapping techniques are unlikely to be accurate for these groups.
We gratefully acknowledge the assistance of Alexandra Gorelik (Senior Statistician, Melbourne EpiCentre) for her statistical advice and constructive feedback. This research was supported in part by a Physiotherapy Research Foundation and United Pacific Industries Thermoskin Research Grant (#T09-THE003). Dr. Ackerman is supported by a National Health and Medical Research Council of Australia Public Health (Australian) Early Career Fellowship (#520004).
Conflict of interest
There are no conflicts of interest to declare.