Self-rated health is an independent predictor of mortality [18], and it is the most widely used comprehensive health measurement [9] recommended by the World Health Organization (WHO) [5, 10]. Although differences have been observed between countries in self-rated health levels [10, 11], measurements vary in wording and scale across surveys [7, 12]. It is not known whether self-rated health variations across countries are due to true health differences or to the use of different measurements of self-rated health.

Two-five-point scale versions of self-rated health have been used in international surveys: The first one comprises answer categories ranging from ‘very good’ to ‘very poor’, and has been recommended by WHO-Europe and the European community health monitoring programme [7, 8, 13]. The second version ranges from ‘excellent’ to ‘poor’ and has been primarily applied in the US. It is not known whether both versions are directly comparable, which hampers international comparisons across surveys that use different measurements [7]. As opposed to the US version, the WHO version has been hypothesised to comprise a balanced set of two positive categories (very good, good), one neutral category (fair), and two negative categories (bad, very bad) [7]. However, no studies have empirically examined these advantages of the WHO version, and the scientific evidence for recommending this version remains scarce.

This study compares the WHO and the US versions of self-rated health across five different European countries. We applied both measurements in a sample of over 11,000 respondents to the Survey of Health, Ageing and Retirement in Europe (SHARE) in five European countries. To our knowledge, this is the first study to assess differences in the distribution of different versions of self-rated health, and in their association with demographic and health variables across countries.


Study population and data collection


Details on the SHARE study in Europe have been described elsewhere [14, 15]. Briefly, in 2004, a survey was conducted in representative samples of the non-institutionalised population aged 50 + in Sweden, Denmark, Germany, the Netherlands, France, Switzerland, Austria, Italy, Spain and Greece (n = 22,777). Interviews were face-to-face and took place in the household. Trained interviewers conducted interviews using a computer assisted personal interviewing program. The set-up allowed each country to use exactly the same underlying structure and questionnaire [14, 15].

The present analysis is based on data for Austria, Germany, the Netherlands, Spain and Greece, because overlapping answer categories for the two versions of self-rated health were verbally identical in these countries (Table 1). Other countries were excluded because translated answers were not verbally equivalent for both versions. All participants were drawn from probability samples of the underlying 50+ population in each country. Samples for Germany and the Netherlands were drawn from regional registries; the sample for Spain was drawn from a national population registry; and samples for Austria and Greece were drawn using telephone directories as sampling frames and pre-screening in the field of eligible sample participants [14, 15]. Household response rate was 55.6% in Austria (n = 1,893); 63.4% in Germany (n = 2,866); 61.6% in the Netherlands (n = 2,731); 53.0% in Spain (n = 2,252); and 63.1% in Greece (n = 1,901) [14]. Response rates are comparable to those observed in other European surveys [14]. The total final sample for analysis comprised 11,643 participants.

Table 1 Original language answer categories for self-rated health using the European (WHO) and the US versions in five European countries: The SHARE study

Calibrated sampling weights were designed to adjust for the complex sampling design and non-response in each country [14]. However, due to the fact that the present study does not compare population parameters, we did not apply sampling weights. Because we examine intra-individual consistency of responses to both version of self-rated health, applying weights would not alter our results.

Self-rated health

Individuals were asked to rate their health separately using the WHO version (very good, good, fair, bad, or very bad) and the US version (excellent, very good, good, fair, or poor) of self-rated health. Half of the sample was randomised to receive one of the two versions at the beginning or at the end of the health survey. Table 1 summarises the original categories used in each country.

Demographic and health covariates

The following variables were assessed: (1) Age and sex; (2) Highest level of education, reclassified into three levels using the UNESCO International classification of education (ISCED-97) [16]: “low” (ISCED 0–2), “medium” (ISCED 3,4), and “high” (ISCED 5,6). (3) Chronic diseases ever diagnosed by a doctor, including heart disease, stroke, hypertension, hypercholesterolemia, diabetes, lung disease, asthma, arthritis, osteoporosis, cancer, ulcer, Parkinson disease, cataracts, and hip fracture. Information on these diagnoses was based on self-reported information only. Individuals’ answers were summarised in three categories: no condition, one or two conditions, and three or more conditions. (4) Symptoms as measured by self-report of back or joint pain, angina or chest pain, breathlessness, persistent cough, swollen legs, sleeping problems, fall and fear of falling, dizziness, stomach or intestine problems, and incontinence. Answers were summarised in three categories: no symptom, one or two symptoms, and three or more symptoms. (5) Limitations with ADL (activities of daily living), measured by a validated scale of limitations individuals have with basic activities, namely dressing, walking, bathing, eating, getting in and out of bed, and using the toilet [17]. (6) Limitations with IADL (instrumental activities of daily living), measured by a validated scale of limitations with the following activities: using a map, cooking, shopping, telephoning, taking medications, working in the house, and managing money. Limitations with ADLs and IADLs were summarised in three categories: no limitation, one or two limitations, and three or more limitations. (7) Depression as measured by the Euro-Depression (Euro-D), a scale of depression symptoms validated for the European population. A EURO-D score higher than three is indicative of a depressive symptomatology and was used to dichotomise this variable [18].

Methods of analysis

We assessed the distribution and cross-tabulations of self-rated health ratings, and examined concordance between the WHO and US measurements in three ways:

  1. (1)

    Concordance measures. Literal concordance occurs when an individual’s response to both versions is verbally consistent regardless of the self-rated health version (e.g., respondent answers “very good” to both the US and WHO version). Combinations of either the two highest positive or the two highest negative ratings possible in both scales were also classified as concordant. Relative concordance occurs when an individual’s responses to both versions are consistent in terms of their position in the self-rated health scale. This assumes that individuals use the scale midpoint as an anchor or population average [19].

  2. (2)

    Polychoric correlations were calculated by maximum likelihood [20] using R 2.7.0, and assuming that general health is a normally distributed continuous latent variable divided into ordered levels [20]. A correlation close to one indicates that both scales measure the same concept. We used both Chi-squared tests and root mean square errors of approximation (RMSEA) to test the assumption of normality of latent health [21].

  3. (3)

    Ordered probit regressions [22, 23] were used to assess whether the associations of self-rated health with demographic and health variables differed for the WHO and US versions. The latent continuous variable ‘general health’ is modelled as a linear function of covariates. Coefficients summarise the effect of a one-unit increase in the explanatory variables on the continuous (latent) outcome variable. Country effects were measured by effect coding (effects are measured relative to the grand mean). Cross-equation tests (based on a seemingly unrelated estimation of the two ordered probit equations) were used to assess whether effect sizes differ significantly between the two versions. Analyses were conducted using Stata 9.2.


Differences in distributions

Table 2 shows the distribution of answers to both self-rated health items. The WHO version had a more skewed distribution than the US version. In the WHO version, only 1.6% of participants rated their health as “very poor” (the bottom category), whereas more than 15% selected the top category “very good”. In contrast, about the same proportion of individuals selected the top and bottom categories in the US version.

Table 2 Marginal distributions of self-rated health using the US and WHO versions among men and women aged 50 years and over in five European countries: The SHARE study

Individuals appear to be in better health when confronted with the US version. Whereas 27.3% reported to be in very good or excellent health in response to the latter, only 15.5% reported ‘very good health’ (the top category) in response to the WHO version (Table 2). Similarly, whereas about 7% of respondents reported that their health was poor when presented with the US version, about 9.7% reported their health was poor or very poor when presented with the WHO version. Thus, the same verbal presentations elicited different assessments in the WHO and US versions.

Cross-tabulations in Table 3 show higher levels of literal than relative concordance. For instance, among those who reported that they were in good health when confronted with the WHO version, only 24.9% reported to be in very good health (relative concordance), whereas 65.7% reported to be in good health (literal concordance) when presented with the US version. Nevertheless, only about 10% of these participants reported that they were in excellent, fair or poor health, which were discordant ratings. The total percentage of concordant ratings is shown in Table 4. Percentages add up to more than 100%, because cases at the scale endpoints can be concordant both relatively and literally. Overall, 69.0% of participants provided literally concordant answers, whereas only 30.1% provided relatively concordant answers. Responses were discordant for only 8.1% of participants.

Table 3 Cross-tabulation of SRH (Self-rated health) between the WHO and US versions (row percentages) among men and women aged 50 years and over in five European countries: The SHARE study
Table 4 Degree of concordance between the WHO and US version of the self-rated health items among men and women aged 50 years and over in five European countries: The SHARE study

Cross-country differences in concordance and discordance rates are statistically significant as suggested by the chi-squared test statistic. This results holds also if all covariates discussed in the next section are held constant. The overall polychoric correlation between the two versions was 0.882 (Table 4). Correlations were highest in Germany, the Netherlands and Greece, and lowest in Spain. Although Chi-squared tests reject the assumption of normality latent health, root mean square errors of approximation (RMSEA) indicate a good to acceptable fit, overall and in each country separately.

Differences in associations with covariates

The distribution of covariates varied significantly across countries (Table 5). For instance, Spaniards and Greeks had relatively low levels of education as compared to Germans and Austrians. More than two-thirds reported one or more diagnosed conditions or symptoms, and 20% reported at least one limitation with ADL. The prevalence of chronic diseases, symptoms and limitations was highest in Spain and lowest in the Netherlands. The prevalence of depression is much higher in Southern countries than in Austria, Germany, or the Netherlands.

Table 5 Description of health covariates (percentages) among men and women aged 50 years and over in five European countries: The SHARE study

Table 6 shows ordered probit regression models, which summarise the effect of a one-unit increase in the explanatory variables on latent general health. With the exception of three country effects, all variables were significantly associated with both versions of self-rated health. Standard errors were marginally smaller for the US version, which reflects its more even distribution as compared to the WHO version. As assessed by cross-equation tests, the associations of self-rated health with most demographic and health variables were statistically indistinguishable for both the US and WHO versions (Table 6), with two exceptions: Firstly, the effect of being older than 80 years old on self-rated health was significantly larger for the WHO version, which probably reflected its better discriminative power at the negative end of the scale. Secondly, chronic diseases were more strongly associated with the WHO than with the US version, although this difference was small. Overall, the WHO and US versions were very similarly associated with other variables.

Table 6 Ordered probit regressions (fully adjusted models) of self-rated health for the WHO and US item versions and cross-equation tests (N = 11,622) among men and women aged 50 years and over in five European countries: The SHARE study

Using different versions of self-rated health did not influence the ranking of countries in terms of their self-rated health. For both versions, self-rated health conditioning on covariates was best in Greece and the Netherlands, and worst in Germany (Table 6). The only exceptions were Austria and Spain, where ranks changed depending on the self-rated health version used. For other countries, self-rated health rankings were identical for the two items.

Cross-equation tests of parameter differences for the two versions of self-rated health were also computed separately by country (results not shown). In Germany, the Netherlands and Austria, there were no significant differences between the WHO and US versions in their associations with any of the covariates. In Spain, we found differences only for the number of conditions. In Greece, associations with age and education were different between versions, but associations with other variables did not differ.


Although WHO has recommended the WHO version as the standard measurement of self-rated health in the European context [7, 8], our results suggest that this version is not clearly superior to the US version. The WHO version discriminates better at the negative end, but the US version is more symmetric and shows better discrimination at the positive end. Individual answers to both items are not fully consistent, and appear to be more concordant in a literal rather than a relative sense. Despite these discrepancies, less than 10% of respondents were discordant in either sense. The US and WHO versions are highly correlated. They show very similar associations with demographic and health indicators, and they show a similar pattern of variation across countries. Overall, although the two measures are not directly comparable, they are in fact different categorizations of latent continuous health.

The strength of this study is the measurement of two self-rated health versions and covariates in several countries. However, some limitations should be considered. Data were only available for individuals aged 50 years and over. As younger individuals are on average healthier, measuring self-rated health in a younger cohort would result in a larger proportion of individuals reporting good health. In younger populations, the US version might be more appropriate because it discriminates better at the positive end. In addition, respondents were presented with both versions of self-rated health along with other health status measurements. The order of presentation (at the beginning or end) may have had an impact on the health ratings [24]. However, we tackled this problem by randomising the order of presentation of both versions, and analyses not shown in this paper indicate that presentation order had little impact on individual’s levels of self-rated health.

Comparison with previous studies

The predictive power of subjective global health assessments has been shown in numerous studies [1, 2, 5, 25]. To our knowledge, this is the first study to show that the two most commonly used versions of subjective global health are not directly comparable within and across countries, but relate similarly to other covariates. Consistent with findings from single populations [12], we found that different measures of self-rated health are strongly correlated. Our results confirm findings from previous research suggesting that different measures of self-rated health represent parallel assessments of subjective health [12].

Differences between countries in the level of self-rated health and the association of this variable with socioeconomic and health factors have been reported [10, 11, 2631]. Our results suggest that even if self-rated health is assessed in all countries using a 5-point scale, bias may yet be present due to differences in the wording of response categories. Thus, cross-country comparisons of population health based on different versions of the self-rated health item may lead to spurious health variations across populations. On the other hand, the associations of self-rated health with demographic factors such as socioeconomic status were similar for the two self-rated health item versions. Thus, comparisons of how demographic and other factors relate to self-rated health across surveys using a different 5-point self-rated health scale [2, 10, 2830, 32] are unlikely to be biased.

Interpretation and implications

Most health and social surveys contain only one version of the self-rated health item. This raises the question of whether it is possible to combine data from different surveys that use different versions of this item. Two-thirds of respondents in our study gave literally concordant answers. Thus, one option would be to collapse the two top categories of the US version and the two bottom categories of the WHO version, resulting in a four-point comparable scale. However, although this would minimise differences, this approach would still result in an overestimation of average health in surveys that use the US version. A second alternative is to achieve comparability of different versions of self-rated health by appropriately rescaling items. For instance, two surveys using different self-rated health measures but similar measures for other variables can be made comparable by imputing conditional probabilities obtained from surveys such as SHARE. Consider again the conditional probabilities shown in Table 3. In order to ‘convert’ the WHO into the US version, a random number, say X, could be drawn for each respondent from a uniform distribution on the zero-to-one interval. A respondent who has answered ‘very good’ to the WHO version would then be coded as being in ‘excellent’ health if X < 0.379 (thus with a 37.9% probability), as being in ‘very good’ health if 0.379 ≤ X < 0.379 + 0.513 = 0.892 (thus with a 51.3% probability), as being in ‘good’ health if 0.892 ≤ X < 0.997, and as being in ‘fair’ health if 0.997 ≤ X ≤ 1.000. A respondent who has answered ‘good’ to the WHO version would be coded as being in ‘excellent’ health if X < 0.036, and so on. This procedure preserves the marginal distribution of the US version. It could also be repeated several times, yielding multiple imputations [33].

An important finding of this study is that respondents tend to be more concordant in a literal than in a relative sense. This finding might appear to contradict the view that individuals conceive the scale midpoint as the population average health when judging their own health status, independently of the verbal representation [19]. In fact, since two-thirds of our sample selected the equivalent verbal representation in both items, it would seem that respondents try to be consistent in a literal sense, regardless of the relative position of the answer categories. The main implication is that using a 5-point scale is not enough to ensure comparability, because individuals react differently to various verbal representations when judging their health. As a consequence, comparisons between studies using different verbal answer categories are likely to be biased.

Although levels of self-reported health based on the US and WHO versions are not directly comparable, they are in fact different categorizations of the same latent continuous variable. In particular, both scales have the same properties with respect to demographics and health indicators. Thus, data from surveys using different self-rated health versions could still be used to compare associations of covariates with general health, even though overall health levels cannot be compared. However, this may require the use of appropriate statistical models that interpret self-rated health as different categorisations of an underlying (latent) continuous health variable.

WHO recommends the use of the WHO version as standard measurement of self-rated health in European populations. In our data, we found very little support for this directive. One of the central arguments of the WHO and related reports is that the WHO version comprises a balanced scale of five categories, two of which are positive (very good, good), one neutral (fair), and two negative (bad, very bad) [7, 8]. In our study, however, this balanced set of categories resulted in a skewed distribution of self-rated health. In terms of statistical efficiency, the US version has in fact some advantages. Responses to the US version are more evenly distributed across the 5-point scale, resulting in smaller standard errors of the estimated ordered probit parameters. The fact that both versions are similarly associated with demographic and health determinants further weakens the case for recommending the WHO version. Thus, in studies of older European populations, there does not seem to be a strong argument for preferring the WHO version. Moreover, the choice of a self-rated health version should be based on several considerations, including aspects such as the age distribution of the population studied, because in older populations, the WHO version tends to show a skewed distribution. These results invite a reassessment of WHO recommendations.