Why do multi-attribute utility instruments produce different utilities: the relative importance of the descriptive systems, scale and ‘micro-utility’ effects

Purpose Health state utilities measured by the major multi-attribute utility instruments differ. Understanding the reasons for this is important for the choice of instrument and for research designed to reconcile these differences. This paper investigates these reasons by explaining pairwise differences between utilities derived from six multi-attribute utility instruments in terms of (1) their implicit measurement scales; (2) the structure of their descriptive systems; and (3) ‘micro-utility effects’, scale-adjusted differences attributable to their utility formula. Methods The EQ-5D-5L, SF-6D, HUI 3, 15D and AQoL-8D were administered to 8,019 individuals. Utilities and unweighted values were calculated using each instrument. Scale effects were determined by the linear relationship between utilities, the effect of the descriptive system by comparison of scale-adjusted values and ‘micro-utility effects’ by the unexplained difference between utilities and values. Results Overall, 66 % of the differences between utilities was attributable to the descriptive systems, 30.3 % to scale effects and 3.7 % to micro-utility effects. Discussion Results imply that the revision of utility algorithms will not reconcile differences between instruments. The dominating importance of the descriptive system highlights the need for researchers to select the instrument most capable of describing the health states relevant for a study. Conclusions Reconciliation of inconsistent utilities produced by different instruments must focus primarily upon the content of the descriptive system. Utility weights primarily determine the measurement scale. Other differences, attributable to utility formula, are comparatively unimportant.


Introduction
Economic evaluation of interventions which affect healthrelated quality of life commonly employs cost-utility analyses (CUA) which prioritise interventions according to the cost per quality-adjusted life year (QALY). The estimation of QALYs is increasingly based upon the health state utilities predicted from a multi-attribute utility (MAU) instrument (MAUI). Each of these instruments has two components. First, the descriptive system (or classification) consists of a set of questions and response categoriesitems-which seek to describe a person's health. Secondly, the utility formula (or algorithm) converts the item responses into an index of utility on a 0.00 (death)-1.00 (best health) scale.
A small number of MAUI dominate the literature. A review of articles listed on the Web of Science between 2005 and 2010 found 1,663 studies which had employed an MAUI [1]. Of these, 63 % used the EQ-5D; 15 % the HUI 2 or HUI 3; 9 % the SF-6D; and the remaining 15 % used the 15D, QWB or one of the new Assessment of Quality of Life (AQoL) instruments. The descriptive systems of these instruments, which are described in Table 1, differ significantly in size and content. Three of the instruments-EQ-5D, HUI 3 and 15D-have a preponderance of items which relate to physical health. The SF-6D has an equal number of items in the two broad domains of physical and psychosocial health, and the AQoL-8D has a preponderance of items in the psycho-social domain. Conceptually, HUI 3 has a 'within the skin' descriptive system: it focuses upon an individual's body functions. The other instruments are conceptualised primarily, but not exclusively, in terms of handicap (more recently described by the WHO as activity and participation [2]), i.e. the effect of a health state on a person's ability to function in a social environment. The items combine to describe between 3,125 and 2.4 9 10 23 health states (EQ-5D-5L and AQoL-8D, respectively). Dissimilar descriptive systems need not result in different predicted utilities. Each of the MAUI was constructed with a common endpoint, namely the measurement of the strength of preferences for health states. These may be described in a number of ways and, in principle, each of these ways, coupled with appropriate utility weights, might produce comparable measurement. (Analogously, the weight of an object may be measured with almost identical results using scales which employ a spring, a balancing of physical weights or electronic measurement techniques.) Thus, for example, with a complete 'within the skin' description, individuals might envisage the consequences for their 'activity and participation'. Similarly, brief health state descriptions might result in the same average utility as obtained from a more detailed instrument with discrepancies generated by the greater detail of the larger instrument averaging zero. In these cases, the superficially large differences in the appearance of items might mask the similarity of the instruments' predictions.
The evidence, however, does not support this possibility. The 2005-2010 review identified 392 head-to-head comparisons of the main instruments [1]. The authors generally found a low correspondence between utilities predicted by different instruments. For example, in the three large scale surveys containing five MAUI published to date, it was found that, on average, only 56, 42 and 57 %, respectively, of the variance of one instrument could be explained by another instrument [3][4][5].
Each MAUI was created with the intention of employing the same scale on which 1.00 and 0.00 represent best health and death, respectively, and units quantify the desired trade-off between length and HR-QoL. Nevertheless, the range of utilities predicted by the major instruments varies from 1.59 for the EQ-5D-5L (ie -0.59 to ?1.00) to 0.797 for the SF-6D [1]. This implies that the effective scales used by instruments differ and that differences in instrument utilities are, in part, explained by this. The number of possible health states is determined by the number of items and the number of response categories per item. The EQ-5D-5L has 5 items, each with 5 response levels and therefore 5 5 =3,125 possible health states Casual comparison cannot determine the extent to which the differences between instruments are a result of these scale effects, differences in the descriptive systems and/or differences in the preferences of people interviewed to obtain utility weights. Our review of the literature did not identify studies which analyse this question. Only one study, Whitehurst et al. [6] has compared the utilities from two instruments-the EQ-5D and SF-6D-using comparable scaling methods (DCE) to derive the utility weights. The study conclusion-that the common scaling method did not ameliorate differences in utilities, and that differences are probably attributable to the dissimilar descriptive systems-is of importance for the future direction of a research programme which seeks to reconcile the differences. It implies that research which improves the precision of utility scoring formula will not reconcile the differences. Rather, descriptive systems will need to be revised.
The aim of the present article is to further investigate the reason for the differences between predicted utilities. It does so by pairwise comparison of instrument utilities and disaggregating differences into three components: differences attributable to the two instrument scales, differences in the structure of the descriptive systems and the effect of the utility formula after taking account of the two previous effects. To avoid misleading connotations, this last amount is termed the 'micro-utility effect'.
Methods and data used in the study are outlined below, and results presented in the following section. Their significance for the practice and future development of costutility analyses is then discussed. It is concluded that there is a need to refocus future developmental research to eliminate the causes of inconsistent utility measurement identified here.

Methods and data
Data A multi-instrument comparison (MIC) survey was carried out in six countries: Australia, Canada, Germany, Norway, the UK and the USA. The online survey was administered by a global panel company, CINT Pty Ltd. The survey was approved by the Monash University Human Research Ethics Committee, Monash University, Melbourne, Australia, reference number CF11/3192-2011001748.
Respondents were initially asked to indicate whether they had a chronic disease and to rate their overall health on a visual analogue scale (VAS) where 0.00 represented death and 100 represented 'best possible health' (physical, mental and social). Quotas were then used to obtain a demographically representative sample of the 'healthy' public, defined by the absence of chronic disease and by a score above 70 on the VAS. Quotas were also applied to obtain a target number of respondents in each of seven chronic disease areas, viz, arthritis, asthma, cancer, depression, diabetes, hearing loss and heart disease.
Each respondent completed a total of 12 questionnaires: seven MAU instruments, three subjective well-being instruments, the ICECAP capabilities instrument, a self TTO and a VAS. Responses were subjected to a set of stringent edit procedures based upon a comparison of duplicated or similar questions and a minimum completion time. Edit procedures, the questionnaire and its administration are described in Richardson et al. [7]. Countryspecific results of the edit procedures are available [8], and the database is available online [9].

Methods
The methods detailed below are illustrated in Fig. 1. This plots scores, S i , S j , derived by summing item responses from two MAU instruments, MAUI i and MAUI j on the horizontal axis, and the corresponding utilities, U, and values, V, on the vertical axis. Values are a linear transformation of scores and are represented by the lines XY and ZY. Due to the micro-utility effects of the MAU formula, the corresponding instrument utilities are scattered randomly around the two lines. The differing measurement  scales embodied in the utility formula are illustrated by the differing slopes of XY and ZY. For a given individual, A, the scores from the unweighted instruments S i A , S j A differ. Application of the two MAUI formulae result in estimates of utility which differ by (U i A -U j A ). The aim of the analysis below is to attribute this difference to a difference in the scale ) and the effect attributable to the structure of the descriptive systems which results in the difference, S i A -S j A . Terminology used in the remainder of the paper is defined in Box 1.

Measuring differences
For each respondent, absolute (sign free) differences (U i -U j ) were calculated for each instrument pair. (Consequently, two differences of -0.6 and ?0.4 will average 0.5, not 0.1.)
In the second stage, scores, S i , were subjected to a linear transformation to obtain 'values' which are calibrated on the same scale as the corresponding utilities (XY, ZY in Fig. 1). To achieve this, an OLS linear regression, Eq. 2, was estimated for each instrument between utilities, U i and scores S i Values, V, were calculated by deleting the residual, res i , i.e. V i = a ? b S i . Values calculated in this way are therefore a linear transformation of unweighted scores, S. Utilities, U i , determine the scale upon which values V i are calibrated. Values differ from utilities by the 'microutility effect' included in res i .

Removing scale effects
In each pairwise comparison of MAU i and MAU j , the effect of scale was removed by rotating U j and V j to be on the same scale as U i . This was achieved by regressing U i upon U j and V j as shown in Eqs. 3 and 4.
where res 1 and res 2 are residuals attributable to microutility effects and measurement error. Rotated utilities and values for MAU j were obtained from the linear component of these equations as defined by Eqs. 3 0 and 4 0 .
where U j (u i ) and V j (u i ) are, respectively, the utility and value from MAU j rotated to be on the same scale as U i .

Confirmation of result
The effect of the linear adjustment (3 0 ) may be shown by Similarly, substituting and 5 0 confirm that in principle U j (u i ) and V j (u i ) are on the same linear scale as U i , varying from U i by res 1 and res 2 , respectively, which include the effects of differing descriptive systems, micro-utility effects and an error term. To test empirically the success with which scale effects were removed by these procedures, OLS regressions were estimated between differences in the scaleadjusted utilities and values: Eq. 6. With linear relationships between variables, a perfect alignment of scales would result in a 3 = 0; b 3 = 1.00. Nonlinearities in the relationships would result in a = 0 (a property of OLS regression) but possible deviation from b 3 = 1.00.
Measuring the three components Disaggregation of the differences between utilities employed the following relationships: A = U i -U j : pairwise difference in utilities which are to be explained. Combining the effects Scale ðCÞ þ Descriptive system ðDÞ þ micro utility ðB À DÞ

Data
Data were obtained from 9,665 individuals. Edit procedures resulted in the removal of 17 % of the total. Table 2 presents the age-gender and educational status of the remaining 8,019 respondents. Because quotas were imposed, the proportion of respondents from each country is similar. For the same reason, the age, gender and educational profiles of respondents within each country is similar. The numbers recruited from the disease area varied from 772 for cancer to 943 for heart disease. The 1,760 'public' respondents were obtained by combining country samples which closely matched the age-gender profile in each country. There were few missing data as the online program did not permit respondents to proceed until questions were completed. Individuals who did not answer the final question were excluded. This resulted in a final sample of 8,019. A detailed comparison of utilities is given in Richardson et al. [5]. Table 3 reports summary statistics for the five instruments and the correlation between utilities and values. With the exception of the 15D mean utilities are similar, varying from 0.68 to 0.74 in the full sample and from 0.83 to 0.88 in the public sample. Despite this similarity, the distribution of utilities differ significantly. Reflecting scale differences, the standard deviation of the observations in the full sample varies by 100 % from 0.27 for HUI 3 to 0.13 for 15D and 0.14 for SF-6D. Ceiling effects (U = 1.00) vary from 19.1 % (EQ-5D) to 0.3 % (AQoL-8D), and the percentage with a utility below 0.4 varies from 0.3 for the 15D and 1.3 % for the SF-6D to 13.9 % for HUI 3 and 14.7 % for AQoL-8D. Values obtained from unweighted scores necessarily have the same means as utilities as they were obtained from the regression of utilities upon scores. However, as utilities are not a linear function of scores, the range of values differs from the range of utilities. Nevertheless, the correlation between values and utilities is very high, exceeding 0.89 in all cases and rising to 0.99 for the 15D.

Rescaling
The linear regressions used to rotate the scales of utilities and values are reported in Table 4. The 'b' coefficient indicates the extent to which, on average, incremental change in the 'independent' (right-hand side) instrument utility or value must be compressed or expanded to be on the same scale as the 'dependent' (left-hand side) instrument. From the regression between HUI 3 and 15D utilities, increments of the 15D utility must be expanded by a factor of 1.75 for equivalence with the HUI 3 scale. In contrast, increments of utility on the AQoL-8D must be compressed by a factor of 0.47 for equivalence with incremental utilities measured by the 15D. The test of the success of the rescaling of instruments is reported in Table 5. Reflecting the properties of the OLS regressions used to rotate the scales, a = 0 in every regression indicating that each of the variables used in the regressions has the same mean (equal to the mean of U i ). In each case, the slope parameter, b, is close to but deviates from 1.00 reflecting nonlinearities in the relationship. In the disaggregation of effects, the imperfect alignment of scales will result in an increased micro-utility effect.

Disaggregation
The decomposition of the pairwise differences in utilities is reported in Table 6

Discussion
Discrepancies between utilities predicted by different MAU instruments have been observed in a very large number of studies [1]. Consistent with these, the present study also identifies quantitatively large differences. Across all pairwise comparisons, the average difference in utilities predicted for the 8,019 survey respondents was 0.135. To put this figure in perspective, an incremental change in utility of 0.135 for seven people is almost equivalent to the difference between death and full health for a single person: that is, the difference is quantitatively large with correspondingly large implications for the outcome of an economic evaluation.
The chief conclusion from the present study is that these differences are primarily the result of differences in the descriptive systems. While these explain an average of 66.0 % of the difference between utilities, their importance in pairwise comparisons varies from 27.4 % in the comparison of the 15D and AQoL-8D to 101.6 % of the difference between HUI 3 and AQoL-8D. The former results are plausible. As scale effects account for a larger part of the difference between 15D and AQoL-8D than for any other instrument pair, the relative importance of the remaining effects is consequently reduced. In Table 1, the 15D descriptive system uniquely shares with AQoL-8D items relating to sleep and intimacy and the two instruments have the largest number of items describing depression and anxiety. In contrast, the 'within the skin' descriptive system of HUI 3 has no items relating to social   Table 5 Regression of scalefree difference between utilities and difference between values  Table 6 Decomposition of (U i -U j ) Pairwise comparison a Absolute differences Per cent of (U i -U j ) Micro utility (B -D) relationships which constitute a major part of the AQoL-8D descriptive system. The more surprising result is that the principle effect of differing utility weights is via their effect upon measurement scales and not upon the micro-utility effect. The scale effects are large in comparisons involving 15D, and from Table 3, the 15D has the lowest standard deviation implying the greatest compression of utilities. Scale effects are also large in the comparison of SF-6D with both HUI 3 and AQoL-8D. From Table 3, the SF-6D has the second lowest standard deviation and the HUI 3 and AQoL-8D have the largest standard deviations.
After taking account of differences in the descriptive system and scale, the residual micro-utility effect is generally positive: the effect contributes to an explanation of differences. In three cases in Table 6, it is negative suggesting that the effect partially compensates for other differences. With one exception, the effect is small. The exception is the estimated micro-utility effects in the comparison of EQ-5D and SF-6D. From Table 3, the relationship between SF-6D and EQ-5D is particularly nonlinear with a rapid decrease in SF-6D utilities at the top end of the scale where 19 % of EQ-5D utilities but only 1.3 % of SF-6D are equal to 1.00. The pattern reverses as health deteriorates with 1.3 and 8.9 % of observations below 0.4 for the SF-6D and EQ-5D, respectively. Using present methods, the effect of nonlinearities in the relationship between utilities is attributed to the micro-utility effect.
The respective magnitudes of the three effects employed in the disaggregation have implications for the practice and future development of CUA. First, the identification of significant scale effects implies that these should be eliminated by mapping utilities to a common scale in any ranking of interventions which have employed different MAUI. Mapping functions between each pair of instruments have been estimated by Chen et al. [16] from the database used in the present study and are available on the AQoL website.
Secondly, the results call into question the usefulness of past and future research which is justified by the need to incorporate particular preferences. Unique preferences in Australia, Canada, Finland and the UK would have resulted in significant micro-utility effects in the comparison of the MAUI which derived utilities from representative samples in those countries. The small effects found here suggest that differences in utilities attributed to national preferences are probably the result of differences in the methodologies used to derive utility formula. Minimally, before new results can be attributed to unique preferences the effects of the methods upon utilities must be taken into account.
Finally, as the differences between utilities were primarily attributable to differences in the instrument's descriptive systems, these differences will not be fully eliminated by mapping to a common scale or by the reestimation of utilities. This implies that the results of a CUA may depend upon the choice of MAUI. Elsewhere, we argue that the most sensitive instrument in a disease area should be selected and utilities transformed to the scale of a single instrument [5]. The comparison of results from different instruments will remain imperfect but will be superior to the use of a single instrument which is more sensitive to some health states than to others.
A caveat to the present results is that the effect of measurement error-the inconsistent and erroneous completion of two questionnaires-will result in a larger apparent effect of the descriptive systems. The problem is difficult to circumvent as survey respondents are fallible. However, it is unlikely to have had a large impact. The MIC data were subjected to eight separate edit procedures to delete inconsistent results. These were based upon the comparison of repeated and similar questions and resulted in the removal of 17 % of respondents from the database before analyses commenced. Remaining inconsistencies are unlikely to explain the magnitude of the effects identified here. A more plausible explanation is that the effect is a correct reflection of the very significant differences in the descriptive systems which are apparent from the casual comparison of the instruments.
A final caveat to the results is that they are necessarily based upon particular published utility formulae. While the effect of the descriptive systems is independent of the utility weighting, both the scale and micro-utility effects could vary substantially with a change in the utility formula.

Conclusions
The validity of CUA is compromised by the inconsistent results of the MAUI used to estimate QALYs. A significant body of research has sought to increase the validity of utility measurement by refining the methods used for eliciting utilities, or by deriving utilities from nationally representative samples. The present paper has investigated the extent to which such research is likely to reconcile the inconsistencies in the MAUI. The results suggest that utility weights are important, accounting for 34 % of the difference between instrument scores. But their impact is primarily via a scale effect: different utility formula use different scales for the calibration of utility and these account for 30.3 of the 34.0 % difference between utilities attributable to utility weights. It is possible that this result is attributable to differences in the modelling methodologies that have been adopted. After adjusting for this, the residual effect of different formula-the 'micro-utility effect'-is relatively small. This implies that there is little scope for reconciling the numerical values obtained from different instruments by achieving greater precision in the relative values assigned to items.
The dominant determinant of the difference between utilities is the difference between descriptive systems. A necessary condition for achieving comparability between utilities, QALYs and, therefore, the results of cost-utility analyses is the use of instruments with comparable descriptive systems or the adjustment of results to take account of structural and scale differences.