Background

DEMQOL and DEMQOL-Proxy [1,2,3] are well known and widely used patient reported outcome measures (PROMs) for measuring health related quality of life (HRQL) in people with dementia (PWDs). DEMQOL and DEMQOL-Proxy provide the means to assess HRQL at all stages of dementia severity. DEMQOL is self-reported by the PWD and is appropriate for use in mild/moderate dementia, DEMQOL-Proxy is proxy-reported by a family carer on behalf of the PWD and can be used at all stages of dementia. The two instruments are intended to be used together.

The original development of DEMQOL and DEMQOL-Proxy was grounded in strong methodology and robust psychometric principles [1, 3]. However, the use and application of PROMs is changing. In addition to their use in randomised controlled trials (RCTs) and other evaluative studies, there is a growing interest in the use of PROMs as part of routine monitoring of the quality of health and social care [4]. Routine use of PROMs provides an opportunity to help drive changes in how health and social care are organised and delivered [5] and to improve quality. Consequently, it is necessary to re-evaluate the measurement properties of PROMs to ensure that they are fit for these new purposes. To this end, in this paper we report the re-evaluation of the psychometric properties of DEMQOL and DEMQOL-Proxy.

Modern psychometric methods such as those based on Item Response Theory (IRT) [6, 7] and Rasch Measurement Theory (RMT) [8, 9] provide more stringent psychometric methods than traditional methods derived from Classical Test Theory (CTT). Since April 2009 PROMs data are routinely collected for some elective surgical operations in England [4, 10] and similar use is under consideration for other conditions, including dementia [11, 12]. Methodological work has been undertaken to apply IRT or RMT to the measures for routine use [13,14,15,16,17,18], but for measures of HRQL in dementia this has been limited. Rasch methods have been used with DEMQOL and DEMQOL-Proxy as part of the development of a health state classification system for DEMQOL-U and DEMQOL-Proxy-U [19]. No work has yet used Rasch methods to evaluate the whole set of DEMQOL/DEMQOL-Proxy items in terms of the overall score.

The measurement of outcomes in dementia is challenging. Cognitive impairment can make it difficult for a PWD to provide a reliable self-report on their HRQL and it may be necessary to rely on a proxy report from a family member. Yet, also proxy reports are methodologically challenging; proxies find it difficult to separate their own experience from that of the patient and for more subjective constructs, such as HRQL, PWD-proxy agreement is likely to be lower [20]. These challenges mean it is important that we apply the best available methodological techniques to ensure that the dementia specific outcome measures used in health services research, health care monitoring and individual clinical management are of the highest quality.

Modern psychometric approaches such as IRT and RMT have four advantages over CTT [21, 22]. The scores obtained are invariant, i.e. independent of the sampling distribution of the items used and locate items in a scale independent of the sampling distribution of the people in whom the scale is derived. They generate individual (rather than group) standard errors that clarify the degree of confidence in individual’s scores. Since scores are invariant there is greater potential to measure clinically meaningful differences. Finally, missing data can be dealt with more efficiently. Both IRT and RMT use mathematical (logit) models to improve the measurement properties of scores derived from questionnaires but they differ in the approach to data that do not fit the model: IRT tends to add parameters to the model whereas RMT investigates the data to identify why the misfit occurred. We used RMT to evaluate DEMQOL and DEMQOL-Proxy because the Rasch paradigm allows us to achieve interval scales, to identify potential anomalies with items and response scales, and at the same time, keep the conceptual framework on which the items are based central. This is important to ensure content validity and to produce scores that are clinically meaningful. Anomalies that are identified within the Rasch paradigm can help us to understand which particular items and response options are candidates for improvement. It also allows us to begin to build an evidence base about the extent to which instruments achieve invariant comparison. For example, differential item functioning (DIF) helps us to understand if any items are biased in favour of particular groups of the population. DEMQOL/DEMOQL-Proxy include a range of items about different aspects of daily life which arguably could also be affected by the aging process itself, gender roles and expectations and the deteriorating nature of dementia where eventually patients lose insight about their condition. Our analyses therefore enable us to understand which (if any) items are responded to differently by people of different ages, gender and severity.

Methods

Sample

The data were collected within a large study investigating the impact of Memory Assessment Services (MAS) on HRQL of PWDs [23]. Each of 78 MASs, geographically spread across all regions of the country and representative of all MASs in England, recruited up to 25 consecutive patients with suspected dementia who were attending for a first referral (either at the clinic or at a home visit) and their family carers (if present). Patients or carers with insufficient English to understand the consent procedure or study materials were not eligible for inclusion in the study.

Instruments

DEMQOL consists of 28 questions and DEMQOL-Proxy consists of 31 questions, each assessed on a 4-point Likert-type response scale: a lot, quite a bit, a little, not at all. The questions were derived from five conceptual domains: health and well-being, cognitive functioning, daily activities, social relationships and self-concept [2] and with the exception of the emotion items all have the stem, “How worried have you been about…..”. There is also an additional overall quality of life question, answered on a 4-point scale: very good, good, fair, poor. The items are scored according to a standard scoring algorithm [24] to produce an overall score where higher scores represent better HRQL. See Smith et al. [1,2,3] for details on the development and CTT-based validation of DEMQOL and DEMQOL-Proxy.

Data analysis

The use of modern psychometrics (IRT or Rasch methods) brings the opportunity to achieve more robust measurement by applying a mathematical approach to deriving scores based on a logit model. Modern psychometric methods are based on the relationship between a person’s location on the construct being measured (in this instance the level of their HRQL) and their probability of responding positively to each item. In contrast, traditional methods (such as Classical Test Theory) focus on the relationship between a person’s location on the construct and their observed total score on the scale. Thus, the analysis enables us to consider whether a measurement “ruler” has been successfully constructed. We evaluate this by considering whether i) response categories work as intended (threshold ordering); the items map out a continuum that is relevant to the people being measured (targeting); iii) the items work together (item fit); iv) responses to one item bias responses to another item (response dependency); v) performance is stable across relevant groups (differential item functioning (DIF); vi) items in the instrument represent a reliable unidimensional construct. The unique position of the Rasch paradigm is that when the data do not fit the model, the data (as opposed to the model) are scrutinised to determine the reasons why and to identify ways in which the items and/or response scales can be improved. Rasch based methods therefore provide a powerful set of diagnostic techniques which, although also generating more robust scores, can also highlight ways to improve the instruments in the future.

We conducted a Rasch analysis using RUMM2030 software to identify potential anomalies in the data indicating aspects of the instruments that were not working as intended [25]. Although all the items have the same 4-point Likert type scale, the unrestricted (partial credit) model was used as this was a diagnostic analysis and we wanted to evaluate whether each response scale was actually used similarly to each of the others.

All of the analyses were initially conducted for all items (28 for DEMQOL and 31 for DEMQOL-Proxy) and subsequently for a slightly smaller set of items that excluded the positive emotion items (23 items remaining for DEMQOL and 26 items remaining for DEMQOL-Proxy) as our early analyses [3] and preliminary work on this dataset (including parallel factor analysis – see Appendix) indicated that these items were conceptually different (trait items) and therefore represented a distinct dimension from the other items. We did not consider other reduced item sets because our aim was not to derive a shorter version of the scale. Rather we aimed to retain as many of the original scale items as possible and evaluate their performance. Because the sample was large, all estimates were based on the full sample, but to avoid type 1 error, the sample size was adjusted (N = 500), within the RUMM programme, before calculating significance tests (p-values).

Targeting

Scale-to-sample targeting concerns the match between the range of HRQL measured by the DEMQOL items (and DEMQOL-Proxy items) and the range of HRQL in the sample of PWDs. This was evaluated by comparing the spread of person and item (threshold) locations.

Ordering of item thresholds

We evaluated whether the response categories were working as intended by a visual inspection of the threshold map. As each item has four response categories, there are three thresholds per item, which should be ordered logically. Disordered thresholds can indicate where respondents have misunderstood or been unable to use response categories consistently. Collapsing (or re-scoring) the disordered thresholds can help to provide an indication of how response categories can be improved.

Item fit

The overall fit to the model was evaluated using chi-square. The fit of each item to the Rasch model was evaluated both statistically – fit residual within +/−2.5, chi-square statistic (Bonferroni corrected significance level) – and graphically (visual inspection of the item characteristic curve (ICC)). No single piece of information can confirm the fit of an item to the model and it is important therefore to consider all the evidence together.

Differential item functioning (DIF)

DIF is concerned with the extent to which different groups within the sample exhibit different scores for the same amount of the construct being measured. In this analysis for DEMQOL groups were defined as follows: PWD sex, PWD age group (quartiles), and disease severity (≥ 24 versus <24 MMSE or equivalent based on published cut offs indicating dementia). For DEMQOL-Proxy we additionally defined groups according to the sex and age group (quartiles) of the carer and relationship to the PWD (spouse, son/daughter, other). We used ANOVA to evaluate both main effects for these groups (uniform DIF) and interactions between these groups and the class intervals (non-uniform DIF). The presence of uniform DIF can be corrected by calibrating problem items separately for each level of the group (known as “splitting” items). Items showing non-uniform DIF may need to be investigated and/or removed from the item set.

Local independence

The extent to which each item was independent of the others was evaluated by examining the residual correlation matrix. Pairs of items where the residuals were correlated >0.3 were flagged. In the short term, the presence of response dependence can be corrected by considering each pair of dependent items to identify which is conceptually higher order. The lower order item is then calibrated (or “split”) by each level of the higher order item [26]. This avoids the need to remove items and further compromise content validity.

Unidimensionality

Item analysis by the Rasch model assumes unidimensional data. This was evaluated by prior factor analysis (Appendix) and principal components analysis (PCA) of the residuals to determine if there are any other identifiable dimensions in the data after the main “Rasch dimension” has been taken into account. If there is no interpretable pattern in the residuals then unidimensionality can be said to be supported [27]. Two subsets of four items were created from the highest and lowest loadings on the first principal component and a series of independent t-tests used to investigate whether the estimates for these two subsets differed significantly (percentage of individual t-tests outside the range ± 1.96). We computed Wilson 95% confidence intervals [28], as recommended by Brown, Cai, and DasGupta [29].

Reliability

Reliability was evaluated using the Person Separation Index (PSI), which is similar to Cronbach’s alpha. A value >0.7 is considered adequate.

Rasch model based (logit) scores and their benefit

For both DEMQOL and DEMQOL-Proxy, we re-scored items with disordered thresholds (i.e. combining response categories as necessary). In addition, we resolved the items showing DIF (i.e. by splitting the relevant item and creating new items, one for each level of the person factor showing DIF) and/or local dependency (i.e. splitting the dependent item by the levels of the higher order item). We then generated Rasch model based scores (logits) for both resolved and unresolved versions. If the two versions were highly correlated, we retained the unresolved versions. The benefit of these scores over the raw scores was assessed by plotting them against the raw (original classically derived) scores. When the Rasch model based scores are different to the raw scores this will tend to give an ogive (“S”-shaped) curve.

Results

Descriptive characteristics of the sample

DEMQOL was completed by 1428 people with suspected dementia: 52% female, age range 42–98 years (mean age = 77.9, SD = 8.5) and 95% White or White British. DEMQOL-Proxy was completed by 1022 accompanying carers: 69% female, age range 16–94 years (mean age = 65.9, SD = 13.6), and 95% White or White British. Carers were predominantly the spouse (61%) or son/daughter (29%) of the PWD. Details of the sample are presented in Table 1

Table 1 Demographic characteristics of PWD and carer

Overall fit to the model

For both DEMQOL and DEMQOL-Proxy the overall chi square statistic was non-significant (p = 0.99 and p = 0.11 respectively) suggesting that for both scales the data fit the model.

Targeting

Original item sets (DEMQOL and DEMQOL-Proxy)

For both DEMQOL and DEMQOL-Proxy, targeting of persons to item threshold locations could be improved (see Fig. 1a and b, respectively). In both cases, the spread of person locations (DEMQOL: SD = 0.915, DEMQOL-Proxy: SD = 0.888) covered the spread of item threshold locations well, though there was a lack of item thresholds at the high ends of the continuum.

Fig. 1
figure 1

Person-item threshold location distribution for DEMQOL (a) and DEMQOL-Proxy (b)

Smaller item sets (DEMQOL and DEMQOL-Proxy)

For DEMQOL (23 items) (Fig. 2a) the range of item threshold locations is clearly smaller compared with the full set of items. For DEMQOL-Proxy (26 items) (Fig. 2b) the range of item threshold locations stayed almost the same because in contrast to DEMQOL, the highest located item thresholds included a wider range of items than just positive emotion items.

Fig. 2
figure 2

Person-item threshold location distribution for DEMQOL (23 items) (a) and DEMQOL-Proxy (26 items) (b)

Ordering of item thresholds

Original item sets (DEMQOL and DEMQOL-Proxy)

Five DEMQOL items and four DEMQOL-Proxy items showed response options not working properly (disordered thresholds). For DEMQOL these were having been worried about: a) not having enough company, b) how you get on with people close to you, c) getting the affection that you want, d) getting help when you need it, and e) getting to the toilet in time. For DEMQOL-Proxy these were having been worried about: a) keeping him/herself clean (e.g. washing and bathing), b) keeping him/herself looking nice, c) using money to pay for things, and d) looking after his/her finances. For all of these items we found that the middle two categories (“quite a bit” and “a little”) were not used as intended.

Smaller item sets (DEMQOL and DEMQOL-Proxy)

For DEMQOL (23 items), the same five items as in the original item set showed disordered thresholds. For DEMQOL-Proxy (26 items) we found one item less than in the original item set: having been worried about looking after his/her finances was no longer flagged. This may be due to the slightly smaller sample size (N = 1021) available for this analyses.

Item fit

Original item sets (DEMQOL and DEMQOL-Proxy)

No DEMQOL or DEMQOL-Proxy items showed misfit to the model, considering the fit residuals, chi square values and the ICCs together (Table 2). However, four of the five DEMQOL positive emotion items (felt lively, full of energy, confident, cheerful, enjoying life) were among the items with the highest average threshold locations; the two highest (felt lively, full of energy) also showed large fit residuals (> +/− 2.5) and non-optimal fit to the ICC. We found this pattern largely replicated in DEMQOL-Proxy (Table 3), in particular for (felt) full of energy, lively and –to a lesser extent—cheerful.

Table 2 Diagnostic statistics for the original item set of DEMQOL (28 items)
Table 3 Diagnostic statistics for the original item set of DEMQOL-Proxy (31 items)

Smaller item sets (DEMQOL and DEMQOL-Proxy)

None of the 23 DEMQOL items nor the 26 DEMQOL-Proxy items showed misfit to the model, considering the fit residuals, chi square values and the ICCs together (Tables 4 and 5). However, items that showed large fit residuals (> +/− 2.5) in the original item sets now tended to fit slightly better for both DEMQOL (23 items) and DEMQOL-Proxy (26 items).

Table 4 Diagnostic statistics for the smaller item set of DEMQOL (23 items)
Table 5 Diagnostic statistics for the smaller item set of DEMQOL-Proxy (26 items)

Differential item functioning (DIF)

Original item sets (DEMQOL and DEMQOL-Proxy)

None of the DEMQOL items showed significant main effects (uniform DIF) for PWD sex, age group or severity. Three DEMQOL-Proxy items showed significant main effects. The item “feeling irritable” showed a significant main effect for patient age (carers of younger people report more irritability), patient sex (carers of men with dementia report more irritability) and relationship to the carer (spouse carers tending to report more irritability). The item “worried about forgetting what day it is” showed a significant main effect for severity (carers of people with MMSE scores <24 tending to report more worry about forgetting what day it is). The item “worried about not having enough company” showed a significant main effect for patient sex (carers of women with dementia reporting more worry about not having enough company), relationship to the carer (other carers tending to report more worry about not having enough company) and carer age (general trend for younger carers to report more worry about not having enough company). There were no significant interactions for any of the groups by class intervals.

Smaller item sets (DEMQOL and DEMQOL-Proxy)

None of the 23 DEMQOL items showed significant main effects for PWD sex, age group or severity. Three of the 26 DEMQOL-Proxy items showed significant main effects (uniform DIF). The item “feeling irritable” showed a significant main effect for patient sex (carers of men with dementia reporting more irritability) and patient age (carers of younger people reporting more irritability) and relationship to the carer (spouse carers tending to report more irritability). The item “worried about forgetting what day it is” showed significant main effects for severity (carers of people with MMSE scores <24 tending to report more worry about forgetting what day it is). The item “worried about not having enough company” showed significant main effects for patient sex (carers of women with dementia reporting more worry about not having enough company), carer age (younger carers tending to report more worry about not having enough company) and relationship to the carer (carers who are not a spouse reporting more worry about not having enough company). There were no significant interactions for any of the groups by class intervals.

Local independence

Original item sets (DEMQOL and DEMQOL-Proxy)

Four pairs of DEMQOL items showed local dependency; the correlations were 0.36 (felt cheerful/that you are enjoying life), 0.39 (felt lonely/worried about not having enough company), 0.46 (worried about how you get on with people close to you/getting the affection that you want) and 0.53 (felt full of energy/lively), respectively, see Table 2. Fourteen DEMQOL-Proxy items showed local dependency, with correlations ranging from 0.31 (e.g. felt frustrated/fed-up) to 0.66 (felt full of energy/lively), see Table 3.

Smaller item sets (DEMQOL and DEMQOL-Proxy)

In the smaller item set for DEMQOL two residual correlations >0.3 remained (Table 4): felt lonely/worried about not having enough company (0.40) and worried about how you get on with people close to you/getting the affection that you want (0.41). For DEMQOL-Proxy in the smaller item set we found 11 residual correlations >0.3 (Table 5). The largest ones were between felt sad/fed-up (0.42), having been worried about using money to pay for things/looking after his/her finances (0.47) and keeping him/herself clean/ looking nice (0.64); the large residual correlation between felt sad/fed-up was new.

Unidimensionality

Original item sets (DEMQOL and DEMQOL-Proxy)

Neither the 28 items in DEMQOL nor the 31 items in DEMQOL-Proxy formed a unidimensional scale. The PCA/t-test protocol showed that for DEMQOL the two subsets of measurements differed significantly for 12.3% [10.7; 14.1] of the cases at the 5% level and for 3.0% [2.0; 4.3] of the cases at the 1% level. For DEMQOL-Proxy they differed significantly for 12.0% [10.1; 14.1] at the 5% level and for 3.0% [1.9; 4.7] at the 1% level.

Smaller item sets (DEMQOL and DEMQOL-Proxy)

The smaller set of 23 items in DEMQOL formed an acceptably unidimensional scale though the smaller set of 26 items in DEMQOL-Proxy were still not unidimensional. For DEMQOL the two subsets of measurements differed significantly for 7.1% [5.9; 8.6] of the cases at the 5% level and for 1.1% [0.6; 2.1] of the cases at the 1% level. This is marginally more than can be expected by chance alone and is satisfactory, taking into account the large sample size [30]. For DEMQOL-Proxy the two subsets of measurements differed significantly for 11.9% [10.0; 14.0] of the cases at the 5% level and for 3.0% [1.9; 4.7] at the 1% level.

Reliability

Original item sets (DEMQOL and DEMQOL-Proxy)

For DEMQOL PSI = 0.90, for DEMOL-Proxy PSI = 0.91, suggesting that both instruments discriminate well among people in terms of their HRQL (i.e. high reliability).

Smaller item sets (DEMQOL and DEMQOL-Proxy)

The smaller item sets showed similar PSI statistics. For the smaller set of 23 DEMQOL items PSI = 0.87, and for the smaller set of 26 DEMQOL-Proxy items PSI = 0.91.

Rasch model based (logit) scores and their benefit

We derived Rasch model based scores for the smaller item sets (23 items for DEMQOL and 26 items for DEMQOL-Proxy) because of their generally better performance. For DEMQOL, we re-scored the five items with disordered thresholds. In addition, we resolved the two items that showed response dependency. Person location estimates with and without resolving for response dependency correlated ICC = 0.99, therefore, we kept the original estimates.

For DEMQOL-Proxy, we re-scored three items with disordered thresholds. In addition, we resolved the 11 items that showed response dependency and the three items that showed DIF were split. Person location estimates with and without resolving for these issues correlated ICC = 0.97, therefore, we kept the original estimates.

The plots showing the benefit of the Rasch model based scores are shown in Fig. 3. The S-shaped curve clearly indicates that at the extremes of the distribution there is benefit from deriving the Rasch model based scores. For both DEMQOL (23 items) and DEMQOL-Proxy (26 items), a 10-point increase in terms of raw scores corresponds to a variable amount of increase in terms of logits, dependent on the person’s location on the raw score scale.

Fig. 3
figure 3

Relationship between raw scores and measurements (logits) for DEMQOL (23 items) (a) and DEMQOL-Proxy (26 items) (b)

Discussion

We have improved the scoring of DEMQOL and DEMQOL-Proxy using RMT and developed scores that can provide more robust and meaningful estimates of change and in addition are potentially appropriate for use with individual patients as part of the clinical decision making process. Neither of these were possible with the original CTT based scores. We have also identified a set of items about positive emotion included in the original questionnaires that do not have strong measurement properties. These items need further qualitative investigation to understand how they could be written more appropriately. In addition, we have identified that the response options may not be as easy for respondents to use as was originally reported. This also needs further qualitative investigation. Nonetheless using the new Rasch-based scores will potentially mean that at the group level evaluative studies will be able to report estimates of change that are more precise. Consequently, decisions based on these studies will be more robust and more easily justified. For example, while many researchers using CTT-based scores assume that points on the scale are equally distanced [31] (i.e. interval) in fact their level of measurement is merely ordinal. There is no information about the actual distances between points on the scale. Consequently change scores derived from ordinal scores (e.g. at baseline and follow up) can be difficult to interpret, as the distance between points on the scale may be different at baseline compared with follow up.

We are not advocating that a shorter version of DEMQOL/DEMQOL-Proxy should be administered. DEMQOL and DEMQOL-Proxy are already widely used and should continue to be administered in the standard form (28 items for DEMQOL and 31 items for DEMQOL-Proxy). The improved scores derived here can be calculated for existing datasets or for new data collected using the standard questionnaires. The three available scores for DEMQOL/DEMQOL-Proxy (original classically developed scores, DEMQOL-U /DEMQOL-Proxy-U and the new Rasch developed scores reported here) are based on the same conceptual framework [2]. Each score is a trade-off between measurement for a particular purpose and content validity. Future users should choose the measure appropriate for their purpose. The removal of the positive items in the Rasch scores does not mean that they are unimportant for HRQL in dementia, merely that in their current form and when combined with the other items in the scale, these items do not work as they were intended. Future qualitative work should investigate how these items could be improved to enable them to be retained in the scores.

Future work should also evaluate the effect of Rasch scoring (as described here) on the evaluation of change using DEMQOL and DEMQOL-Proxy. This could be retrospective using existing datasets or prospective. The s-shaped curve in Fig. 3 suggests that most difference between the original scores and the new Rasch scores will be seen at the extremes of the distribution. In a normal sample distribution, the effect of the Rasch scores at the group level may therefore be small. The Rasch scores however, provide added potential for use at individual level.

Our removal of the positive emotion items means there is item content that had been identified as important to PWD’s HRQL [2] that is not represented in the new Rasch scores for DEMQOL and DEMQOL-Proxy. A similar issue also occurred in the development of the original DEMQOL and DEMQOL-Proxy scales [1, 3] in that the items representing the domain of self concept were removed at the item reduction stage. Both of these are examples of the trade off that can sometimes occur between content validity and measurement properties. Our recommendation is that future work should prioritise investigation of the wording of both the “positive emotion” and “self concept” items to develop better ways of asking these questions within the questionnaire format. The targeting diagrams suggest that there are some parts of the continuum of HRQL not represented by items in the questionnaire, particularly at the “higher” end of the HRQL scale. Further qualitative work is needed to investigate these two issues. This further understanding of the construct of HRQL that underlies DEMQOL and DEMQOL-Proxy would also help to improve the apparent lack of unidimensionality of the items in the Rasch based DEMQOL-Proxy score.

Secondly, for some items, response options appear not to work as intended (i.e. disordered thresholds). The category probability curves and item threshold locations suggest that this may be because respondents do not distinguish between the two categories at the extremes of the response scale (i.e. between “a lot” and “quite a bit” and between “a little” and “not at all”). Alternatively, the labels of the two middle categories (“quite a bit” and “a little”) may not be meaningful. We have temporarily resolved this issue by re-scoring the items as dichotomous items by collapsing the two categories at either end of the response scale, but future work should investigate why for some items the response categories are not working.

Although this analysis improves the scores of DEMQOL and DEMQOL-Proxy, the progressive severity of dementia presents additional measurement challenges. In particular, with increasing severity there is likely to be a point where self-report of HRQL is no longer possible. Using DEMQOL-Proxy partially solves this problem, but it is well known that agreement of self and proxy reports is relatively low for subjective, non-observable constructs such as HRQL [20]. One of the possible reasons for lack of agreement between self- and proxy- reports is that the two different reporters use different constructs to define what we call HRQL. Further analysis using the Rasch model could build on these results to address this problem by equating the Rasch scores reported here for DEMQOL and DEMQOL-Proxy to determine if they can be placed on a single scale. Equating would evaluate whether DEMQOL and DEMQOL-Proxy can be placed on a common metric and therefore whether the two instruments actually measure the same construct. If this were the case, then DEMQOL-Proxy scores could be used with confidence even when self-report was no longer possible.

The current analyses were conducted on a large, representative sample of people attending a first appointment at MAS for suspected dementia [23]. The benefits of the Rasch analysis reported here are therefore based on data from people with relatively mild cognitive impairment and their carers. Future developments should investigate the effect on the model fit of including people with more severe cognitive impairment in the sample (particularly for DEMQOL-Proxy). Further, as the questionnaires are standardised instruments, developed in English, people without enough English language to understand and complete the questionnaire were excluded from the study. It was therefore not possible to investigate DIF by ethnic groups and we do not know whether and to what extent items within DEMQOL/DEMQOL-Proxy are affected by the ethnic status of the participants.

Conclusion

We have established that DEMQOL and DEMQOL-Proxy can provide robust measurement of HRQL in dementia when scores are derived from analysis using the Rasch model. At the group level, estimates of change in evaluative studies will potentially be more precise than when using CCT-based scores and the Rasch based scores can also now be used at the individual level. This is an important improvement for making and justifying decisions. There still are a number of limitations. Further research into the anomalies that we have identified may further improve the two instruments in terms of breadth of content and optimizing answer categories. Furthermore, we need to investigate whether measurement properties are the same across ethnic groups and levels of dementia severity. In addition, in future work we will investigate whether DEMQOL and DEMQOL-Proxy can be placed on the same scale and if so a revised Rasch model based scoring algorithm can be produced. This would ensure that one could use DEMQOL-Proxy with confidence if a self-report on DEMQOL is no longer possible. Such an algorithm would be appropriate for use in both existing and new datasets.