Are different measures of self-rated health comparable? An assessment in five European countries
- First Online:
- Cite this article as:
- Jürges, H., Avendano, M. & Mackenbach, J.P. Eur J Epidemiol (2008) 23: 773. doi:10.1007/s10654-008-9287-6
Objective: Self-rated health (SRH) is widely used to compare population health across countries, but comparability is often hampered by the use of different versions of this item. This study compares the WHO recommended version (ranging from ‘very good’ to ‘very bad’) with the US version (ranging from ‘excellent’ to ‘poor’) in European countries. Methods: Data came from the Survey of Health, Ageing and Retirement in Europe (SHARE). Both the WHO and US versions of SRH were measured in representative samples of Europeans aged 50+ (n = 11,643) in five countries. Concordance between the two SRH versions and differences in their associations with demographics, chronic diseases, functioning and depression were assessed using ordered probit regression. Results: The US version has a more symmetric distribution and larger variance than the WHO version. Although the WHO version discriminates better at the positive end, the US version shows better discrimination at the positive end of the scale. Sixty-nine percent of respondents provided literally concordant answers, while only about one-third provided relatively concordant answers. Overall, however, less than 10% of respondents were discordant in either sense. The two versions were strongly correlated (polychoric correlation = 0.88), had similar associations with demographics and health indicators, and showed a similar pattern of international variation. Conclusion: Health levels based on different measurements of SRH are not directly comparable and require rescaling of items. However, both versions represent parallel assessments of the same latent health variable. We did not find evidence that the WHO version is preferable to the US version as standard measure of SRH in European countries.
KeywordsSelf-rated health World health International comparisons Research design Europe
Self-rated health is an independent predictor of mortality [1, 2, 3, 4, 5, 6, 7, 8], and it is the most widely used comprehensive health measurement  recommended by the World Health Organization (WHO) [5, 10]. Although differences have been observed between countries in self-rated health levels [10, 11], measurements vary in wording and scale across surveys [7, 12]. It is not known whether self-rated health variations across countries are due to true health differences or to the use of different measurements of self-rated health.
Two-five-point scale versions of self-rated health have been used in international surveys: The first one comprises answer categories ranging from ‘very good’ to ‘very poor’, and has been recommended by WHO-Europe and the European community health monitoring programme [7, 8, 13]. The second version ranges from ‘excellent’ to ‘poor’ and has been primarily applied in the US. It is not known whether both versions are directly comparable, which hampers international comparisons across surveys that use different measurements . As opposed to the US version, the WHO version has been hypothesised to comprise a balanced set of two positive categories (very good, good), one neutral category (fair), and two negative categories (bad, very bad) . However, no studies have empirically examined these advantages of the WHO version, and the scientific evidence for recommending this version remains scarce.
This study compares the WHO and the US versions of self-rated health across five different European countries. We applied both measurements in a sample of over 11,000 respondents to the Survey of Health, Ageing and Retirement in Europe (SHARE) in five European countries. To our knowledge, this is the first study to assess differences in the distribution of different versions of self-rated health, and in their association with demographic and health variables across countries.
Study population and data collection
Details on the SHARE study in Europe have been described elsewhere [14, 15]. Briefly, in 2004, a survey was conducted in representative samples of the non-institutionalised population aged 50 + in Sweden, Denmark, Germany, the Netherlands, France, Switzerland, Austria, Italy, Spain and Greece (n = 22,777). Interviews were face-to-face and took place in the household. Trained interviewers conducted interviews using a computer assisted personal interviewing program. The set-up allowed each country to use exactly the same underlying structure and questionnaire [14, 15].
Original language answer categories for self-rated health using the European (WHO) and the US versions in five European countries: The SHARE study
Self-rated health (WHO)
Self-rated health (US)
Generic (English) version
1 Very good
2 Very good
5 Very bad
1 Sehr gut
2 Sehr gut
5 Sehr schlecht
1 Muy Buena
2 Muy buena
5 Muy mala
1 Heel goed
2 Heel goed
5 Heel slecht
Calibrated sampling weights were designed to adjust for the complex sampling design and non-response in each country . However, due to the fact that the present study does not compare population parameters, we did not apply sampling weights. Because we examine intra-individual consistency of responses to both version of self-rated health, applying weights would not alter our results.
Individuals were asked to rate their health separately using the WHO version (very good, good, fair, bad, or very bad) and the US version (excellent, very good, good, fair, or poor) of self-rated health. Half of the sample was randomised to receive one of the two versions at the beginning or at the end of the health survey. Table 1 summarises the original categories used in each country.
Demographic and health covariates
The following variables were assessed: (1) Age and sex; (2) Highest level of education, reclassified into three levels using the UNESCO International classification of education (ISCED-97) : “low” (ISCED 0–2), “medium” (ISCED 3,4), and “high” (ISCED 5,6). (3) Chronic diseases ever diagnosed by a doctor, including heart disease, stroke, hypertension, hypercholesterolemia, diabetes, lung disease, asthma, arthritis, osteoporosis, cancer, ulcer, Parkinson disease, cataracts, and hip fracture. Information on these diagnoses was based on self-reported information only. Individuals’ answers were summarised in three categories: no condition, one or two conditions, and three or more conditions. (4) Symptoms as measured by self-report of back or joint pain, angina or chest pain, breathlessness, persistent cough, swollen legs, sleeping problems, fall and fear of falling, dizziness, stomach or intestine problems, and incontinence. Answers were summarised in three categories: no symptom, one or two symptoms, and three or more symptoms. (5) Limitations with ADL (activities of daily living), measured by a validated scale of limitations individuals have with basic activities, namely dressing, walking, bathing, eating, getting in and out of bed, and using the toilet . (6) Limitations with IADL (instrumental activities of daily living), measured by a validated scale of limitations with the following activities: using a map, cooking, shopping, telephoning, taking medications, working in the house, and managing money. Limitations with ADLs and IADLs were summarised in three categories: no limitation, one or two limitations, and three or more limitations. (7) Depression as measured by the Euro-Depression (Euro-D), a scale of depression symptoms validated for the European population. A EURO-D score higher than three is indicative of a depressive symptomatology and was used to dichotomise this variable .
Methods of analysis
Concordance measures. Literal concordance occurs when an individual’s response to both versions is verbally consistent regardless of the self-rated health version (e.g., respondent answers “very good” to both the US and WHO version). Combinations of either the two highest positive or the two highest negative ratings possible in both scales were also classified as concordant. Relative concordance occurs when an individual’s responses to both versions are consistent in terms of their position in the self-rated health scale. This assumes that individuals use the scale midpoint as an anchor or population average .
Polychoric correlations were calculated by maximum likelihood  using R 2.7.0, and assuming that general health is a normally distributed continuous latent variable divided into ordered levels . A correlation close to one indicates that both scales measure the same concept. We used both Chi-squared tests and root mean square errors of approximation (RMSEA) to test the assumption of normality of latent health .
Ordered probit regressions [22, 23] were used to assess whether the associations of self-rated health with demographic and health variables differed for the WHO and US versions. The latent continuous variable ‘general health’ is modelled as a linear function of covariates. Coefficients summarise the effect of a one-unit increase in the explanatory variables on the continuous (latent) outcome variable. Country effects were measured by effect coding (effects are measured relative to the grand mean). Cross-equation tests (based on a seemingly unrelated estimation of the two ordered probit equations) were used to assess whether effect sizes differ significantly between the two versions. Analyses were conducted using Stata 9.2.
Differences in distributions
Marginal distributions of self-rated health using the US and WHO versions among men and women aged 50 years and over in five European countries: The SHARE study
Individuals appear to be in better health when confronted with the US version. Whereas 27.3% reported to be in very good or excellent health in response to the latter, only 15.5% reported ‘very good health’ (the top category) in response to the WHO version (Table 2). Similarly, whereas about 7% of respondents reported that their health was poor when presented with the US version, about 9.7% reported their health was poor or very poor when presented with the WHO version. Thus, the same verbal presentations elicited different assessments in the WHO and US versions.
Cross-tabulation of SRH (Self-rated health) between the WHO and US versions (row percentages) among men and women aged 50 years and over in five European countries: The SHARE study
Total (col. %)
Total (row %)
Degree of concordance between the WHO and US version of the self-rated health items among men and women aged 50 years and over in five European countries: The SHARE study
% Literally concordant
% Relatively concordant
Chi-squared (df = 15)b
Chi-squared (df = 4)a
Cross-country differences in concordance and discordance rates are statistically significant as suggested by the chi-squared test statistic. This results holds also if all covariates discussed in the next section are held constant. The overall polychoric correlation between the two versions was 0.882 (Table 4). Correlations were highest in Germany, the Netherlands and Greece, and lowest in Spain. Although Chi-squared tests reject the assumption of normality latent health, root mean square errors of approximation (RMSEA) indicate a good to acceptable fit, overall and in each country separately.
Differences in associations with covariates
Description of health covariates (percentages) among men and women aged 50 years and over in five European countries: The SHARE study
P < 0.001
P < 0.001
P < 0.001
No diagnosed condition
One or two conditions
Three or more conditions
P < 0.001
One or two symptoms
Three or more symptoms
P < 0.001
No (I)ADL limitation
One or two (I)ADL limitations
Three or more (I)ADL limitations
P < 0.001
Depression score 0–3
Depression score 4 or higher
P < 0.001
Ordered probit regressions (fully adjusted models) of self-rated health for the WHO and US item versions and cross-equation tests (N = 11,622) among men and women aged 50 years and over in five European countries: The SHARE study
Self-rated health (WHO)
Self-rated health (US)
P = 0.006
P = 0.670
P = 0.084
No chronic conditions
One or two chronic conditions
Three or more chronic conditions
P = 0.004
One or two symptoms
Three or more symptoms
P = 0.787
No (I)ADL problems
One or two (I)ADL problems
Three or more (I)ADL problems
P = 0.326
Depression score 0–3
Depression score 4 or higher
P = 0.472
P < 0.001
Using different versions of self-rated health did not influence the ranking of countries in terms of their self-rated health. For both versions, self-rated health conditioning on covariates was best in Greece and the Netherlands, and worst in Germany (Table 6). The only exceptions were Austria and Spain, where ranks changed depending on the self-rated health version used. For other countries, self-rated health rankings were identical for the two items.
Cross-equation tests of parameter differences for the two versions of self-rated health were also computed separately by country (results not shown). In Germany, the Netherlands and Austria, there were no significant differences between the WHO and US versions in their associations with any of the covariates. In Spain, we found differences only for the number of conditions. In Greece, associations with age and education were different between versions, but associations with other variables did not differ.
Although WHO has recommended the WHO version as the standard measurement of self-rated health in the European context [7, 8], our results suggest that this version is not clearly superior to the US version. The WHO version discriminates better at the negative end, but the US version is more symmetric and shows better discrimination at the positive end. Individual answers to both items are not fully consistent, and appear to be more concordant in a literal rather than a relative sense. Despite these discrepancies, less than 10% of respondents were discordant in either sense. The US and WHO versions are highly correlated. They show very similar associations with demographic and health indicators, and they show a similar pattern of variation across countries. Overall, although the two measures are not directly comparable, they are in fact different categorizations of latent continuous health.
The strength of this study is the measurement of two self-rated health versions and covariates in several countries. However, some limitations should be considered. Data were only available for individuals aged 50 years and over. As younger individuals are on average healthier, measuring self-rated health in a younger cohort would result in a larger proportion of individuals reporting good health. In younger populations, the US version might be more appropriate because it discriminates better at the positive end. In addition, respondents were presented with both versions of self-rated health along with other health status measurements. The order of presentation (at the beginning or end) may have had an impact on the health ratings . However, we tackled this problem by randomising the order of presentation of both versions, and analyses not shown in this paper indicate that presentation order had little impact on individual’s levels of self-rated health.
Comparison with previous studies
The predictive power of subjective global health assessments has been shown in numerous studies [1, 2, 5, 25]. To our knowledge, this is the first study to show that the two most commonly used versions of subjective global health are not directly comparable within and across countries, but relate similarly to other covariates. Consistent with findings from single populations , we found that different measures of self-rated health are strongly correlated. Our results confirm findings from previous research suggesting that different measures of self-rated health represent parallel assessments of subjective health .
Differences between countries in the level of self-rated health and the association of this variable with socioeconomic and health factors have been reported [10, 11, 26, 27, 28, 29, 30, 31]. Our results suggest that even if self-rated health is assessed in all countries using a 5-point scale, bias may yet be present due to differences in the wording of response categories. Thus, cross-country comparisons of population health based on different versions of the self-rated health item may lead to spurious health variations across populations. On the other hand, the associations of self-rated health with demographic factors such as socioeconomic status were similar for the two self-rated health item versions. Thus, comparisons of how demographic and other factors relate to self-rated health across surveys using a different 5-point self-rated health scale [2, 10, 28, 29, 30, 32] are unlikely to be biased.
Interpretation and implications
Most health and social surveys contain only one version of the self-rated health item. This raises the question of whether it is possible to combine data from different surveys that use different versions of this item. Two-thirds of respondents in our study gave literally concordant answers. Thus, one option would be to collapse the two top categories of the US version and the two bottom categories of the WHO version, resulting in a four-point comparable scale. However, although this would minimise differences, this approach would still result in an overestimation of average health in surveys that use the US version. A second alternative is to achieve comparability of different versions of self-rated health by appropriately rescaling items. For instance, two surveys using different self-rated health measures but similar measures for other variables can be made comparable by imputing conditional probabilities obtained from surveys such as SHARE. Consider again the conditional probabilities shown in Table 3. In order to ‘convert’ the WHO into the US version, a random number, say X, could be drawn for each respondent from a uniform distribution on the zero-to-one interval. A respondent who has answered ‘very good’ to the WHO version would then be coded as being in ‘excellent’ health if X < 0.379 (thus with a 37.9% probability), as being in ‘very good’ health if 0.379 ≤ X < 0.379 + 0.513 = 0.892 (thus with a 51.3% probability), as being in ‘good’ health if 0.892 ≤ X < 0.997, and as being in ‘fair’ health if 0.997 ≤ X ≤ 1.000. A respondent who has answered ‘good’ to the WHO version would be coded as being in ‘excellent’ health if X < 0.036, and so on. This procedure preserves the marginal distribution of the US version. It could also be repeated several times, yielding multiple imputations .
An important finding of this study is that respondents tend to be more concordant in a literal than in a relative sense. This finding might appear to contradict the view that individuals conceive the scale midpoint as the population average health when judging their own health status, independently of the verbal representation . In fact, since two-thirds of our sample selected the equivalent verbal representation in both items, it would seem that respondents try to be consistent in a literal sense, regardless of the relative position of the answer categories. The main implication is that using a 5-point scale is not enough to ensure comparability, because individuals react differently to various verbal representations when judging their health. As a consequence, comparisons between studies using different verbal answer categories are likely to be biased.
Although levels of self-reported health based on the US and WHO versions are not directly comparable, they are in fact different categorizations of the same latent continuous variable. In particular, both scales have the same properties with respect to demographics and health indicators. Thus, data from surveys using different self-rated health versions could still be used to compare associations of covariates with general health, even though overall health levels cannot be compared. However, this may require the use of appropriate statistical models that interpret self-rated health as different categorisations of an underlying (latent) continuous health variable.
WHO recommends the use of the WHO version as standard measurement of self-rated health in European populations. In our data, we found very little support for this directive. One of the central arguments of the WHO and related reports is that the WHO version comprises a balanced scale of five categories, two of which are positive (very good, good), one neutral (fair), and two negative (bad, very bad) [7, 8]. In our study, however, this balanced set of categories resulted in a skewed distribution of self-rated health. In terms of statistical efficiency, the US version has in fact some advantages. Responses to the US version are more evenly distributed across the 5-point scale, resulting in smaller standard errors of the estimated ordered probit parameters. The fact that both versions are similarly associated with demographic and health determinants further weakens the case for recommending the WHO version. Thus, in studies of older European populations, there does not seem to be a strong argument for preferring the WHO version. Moreover, the choice of a self-rated health version should be based on several considerations, including aspects such as the age distribution of the population studied, because in older populations, the WHO version tends to show a skewed distribution. These results invite a reassessment of WHO recommendations.
This paper uses data from release 1 of SHARE 2004. The SHARE data collection has been primarily funded by the European Commission through the 5th framework program (project QLK6-CT-2001-00360 in the thematic program Quality of Life). Additional funding came from the US National Institute on Aging (U01 AG09740-13S2, P01 AG005842, P01 AG08291, P30 AG12815, Y1-AG-4553-01 and OGHA 04-064). Data collection in Austria (through the Austrian Science Fund, FWF), Belgium (through the Belgian Science Policy Office) and Switzerland (through BBW/OFES/UFES) was nationally funded. The SHARE data set is introduced in Börsch-Supan et al. ; methodological details are contained in Börsch-Supan and Jürges . Mauricio Avendano was supported by a grant from the Netherlands Organization for Scientific Research (NWO, grant no. 451-07-001) and a Fellowship from the Erasmus University.
This article is distributed under the terms of the Creative Commons Attribution Noncommercial License which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.