1 Introduction

According to projections by the World Health Organization (WHO) depression will be the leading cause of Disability Adjusted Life Years (DALYs) lost in high income countries in 2030 (Mathers and Loncar 2006). Depression influences the risk of other chronic conditions such as heart failure, heart disease and stroke (Blazer 2003; Larson et al. 2001; Liebetrau et al. 2008; Arbelaez et al. 2007; Barth et al. 2004). Previous studies indicate that depression prevalence varies between countries (Castro-Costa et al. 2007), and within countries (Lorant et al. 2003). Disentangling true variations in depression prevalence from variations attributable to differences in reporting is an essential part of understanding the causes of these differences.

Direct international comparisons of depression prevalence between European countries are scarce but the existing evidence suggests large differences across Europe. Castro-Costa et al. (2007) used a representative sample of the European population aged 50 or above, taken from the Survey of Health, Ageing and Retirement in Europe (SHARE), to evaluate the prevalence of depressive symptoms in Europe. Using of the EURO-D depression scale, which was specifically developed as a standardized measure of depression across European countries (Prince et al. 1999b), their study revealed variation in the prevalence of depressive symptoms across Western European countries. For all symptoms, the prevalence was higher in Mediterranean countries (France, Italy and Spain), lower in Northern European countries (Sweden and Denmark), and of average level in Western European countries (Castro-Costa et al. 2007). Copeland et al. (1999) compared random community samples of older people (65+) using the Geriatric Mental State–AGECAT (GMS-AGECAT) package in eight European cities. Adjusting for gender, they classified centers into two groups: the high depression prevalence group (17.3–23.6%), which comprised London, Berlin, Verona and Munich; and the low prevalence group (8.8–12.0%), which comprised Iceland, Liverpool, Zaragoza, Dublin and Amsterdam. (Copeland et al. 1999, 2004; Castro-Costa et al. 2007). Prince et al., using the same dataset and the adjusted EURO-D scale as outcome, found the lowest prevalence of depressive symptoms in the UK and Ireland, followed by Mediterranean countries (France, Spain, Italy), Benelux countries, Nordic countries and finally Germany (Prince et al. 1999a). Zunzunegui et al. performed a cross-national analysis of depressive symptoms among older adults between 75 and 84 years old, and found the highest prevalence in Italy, followed by Israel, Sweden, Leganes (Spain), and the Netherlands (Zunzunegui et al. 2007).

These cross-national studies do not show a clear geographical pattern of differences in prevalence of depression across Europe. Although Castro-Costa et al. and Prince et al. found a North–South divide, other studies show different patterns. A limitation of these studies is their focus on Western European countries only, while little is known about depressive symptoms in Eastern and Central Europe. In addition, studies were based on samples that were neither comparable nor nationally representative, and only a few studies relied on a standardized measure of depression across countries.

Studies on the association between depressive symptoms and socioeconomic status (SES) have shown relatively consistent results, suggesting that lower socioeconomic status is associated with higher depression rates. The results of a meta-analysis of socioeconomic inequalities in depression covering 51 prevalence studies, five incidence studies, and four persistence studies indicate that low-SES individuals have higher odds of: (1) being depressed (odds ratio = 1.81, p < 0.001), (2); developing a new depressive episode (odds ratio = 1.24, p < 0.004); and (3) suffering from persistent depression (odds ratio = 2.06, p < 0.001) (Lorant et al. 2003). Miech and Shanahan (2000) analyzed the association between socioeconomic status and depression over the life-course, using a nationally representative sample of 2,031 adults between the ages of 18 and 90 in the US. Their study suggests that the strength of the association between depression and educational level increases with age. In addition, higher prevalence of physical health problems among lower-educated adults accounted for most of the diverging gap in depression (Miech and Shanahan 2000). Freyers et al. reviewed major European population studies published from 1980 to 2005 on the distribution of common mental disorders, and found a higher prevalence of anxiety and depression in social disadvantaged groups (Fryers et al. 2005). Overall, although these studies have been conducted in different setting and applied different methodologies, they consistently show an inverse association between socioeconomic status and the prevalence of depression.

Typically, population wide studies on depression are based on self-reports of symptoms, rather than on clinical diagnosis. It is therefore unknown to what extent observed differences reflect true variations in depressive symptoms. Cultural, socioeconomic and demographic factors may influence the reference scale that respondents use when rating the presence and severity of their own depressive symptoms (Bago d’Uva et al. 2008b). Differences in response scales, also known as reporting heterogeneity (King et al. 2004), can result in variations in ratings that are not attributable to true differences in depressive symptoms. If reporting heterogeneity is systematic across countries or socioeconomic groups, estimated differences in prevalence across these groups based on self-reports may be biased.

Over recent years, the anchoring vignette approach has been developed to quantify and correct for reporting heterogeneity in subjective categorical self-assessments. Applied to the domain of depression, anchoring vignettes are concrete descriptions of depressive symptoms of hypothetical individuals, which participants are asked to rate on the same scale that they use to rate their own level of depressive symptoms. As vignettes describe fixed levels of depressive symptoms, ratings of respondents are in theory be attributable to heterogeneity in the scale respondents use to assess depressive symptoms. Based on this information, the Hierarchical Ordered Probit (HOPIT) model estimates the magnitude of reporting heterogeneity and uses this to identify differences in depressive symptoms that are not attributable to heterogeneity.

The HOPIT model has recently been used to examine cross-national differences in several outcomes other than depression. Kapteyn et al. (2007) used this approach to analyze work disability differences between the Netherlands and the United States. Their results showed that Dutch respondents are more likely to report work disability than their US counterparts, but about half of this difference is explained by reporting heterogeneity. More detailed descriptions and applications of the HOPIT model are available elsewhere (King et al. 2004; Salomon et al. 2004; Bago d’Uva et al. 2008b; Kristensen and Johansson 2008).

In this paper, we compare prevalence rates of mood, sleeping and concentration problems, which are symptoms included in the EURO-Depression scale. We hypothesize that cross-national and socioeconomic differences in depressive symptoms are attributable to systematic differences in reporting styles. We first examine cross-national differences in the prevalence of self-reported depressive symptoms by country and educational level. In a second step, we use the HOPIT model to assess the extent to which cross-national and socioeconomic variations in depressive symptoms are attributable to reporting heterogeneity.

2 Methodology

2.1 Data

This study is based on data from The Survey of Health, Ageing and Retirement in Europe (SHARE), a longitudinal investigation of the health, social networks, economic situation, and well-being of Europeans aged 50 years and over (Börsch-Supan and Jürges 2005). A drop-off questionnaire of SHARE contains self-assessments and vignette evaluations in several health and work disability dimensions and four other domains of satisfaction and well-being, and was applied to a sub-sample of the main SHARE sample. We refer to these data as the COMPARE sample (www.compare-project.org). A description of the development of the vignettes is given elsewhere (Van Soest 2008). We have used the two waves available in the COMPARE sample, 2004 (4,544 participants) and 2006/2007 (7,186 participants). Three vignettes per item were available in the first wave. In the second wave this was cut back to one vignette per item in order to shorten the questionnaire. The description of these repeated vignettes was equal in both waves. All vignettes are presented in “Appendix”. For respondents who participated in both waves (1991), we only use data from the second wave. Our findings are robust to using data from the first wave for respondents who participated in both waves. We dropped 330 individuals with missing values on at least one variable, resulting in a final sample of 9,409 adults from 11 European countries.Footnote 1 Table 1 summarizes the distribution of basic covariates of the sample.

Table 1 Descriptive variables of covariates by country

2.1.1 Depression Measure

In this analysis, we used three items of the EURO-D (1999b), a standardized scale of depressive symptoms designed to enhance cross-national comparability. The EURO-D consists of 12 items: mood, pessimism, death wish, guilt, sleep, interest, irritability, appetite, fatigue, concentration, enjoyment and tearfulness. The COMPARE dataset includes self-reports and anchoring vignettes in the dimensions of depression, sleep and concentration. Self-reports and anchoring vignettes are rated on the scale: 1—no problems; 2—mild problems; 3—moderate problems; 4—severe problems; 5—extreme problems. Since only a small number of respondents reported more than moderate problems (i.e. categories 4 and 5) on each of the depressive items, we collapsed the scale into: 1—no problems, 2—mild problems, and 3—at least moderate problems.Footnote 2

2.1.2 Education

Two main education items were included in SHARE wave 2, namely the highest completed educational level and the number of years of schooling. To enhance comparability of national educational levels, we transformed national levels into the 1997 International Standard Classification of Education (ISCED-97) of the United Nations Educational, Scientific, and Cultural Organization (UNESCO). The distribution of education varies greatly across European countries. For example, if we consider the following categorization of ISCED 0–2, ISCED 3 and ISCED 4–6, the group with the lowest educational category comprises 17.01% in Germany, but 80.15% in Spain. We therefore operationalize education based on the number of years of schooling. This variable varies between 0 and 25, with an average of 11 years. Because years of schooling are entered as a continuous variable in all models, we used smooth non-parametric LOESS function curves (Cleveland 1979) to examine deviations from linearity. Results indicate that associations with depressive symptoms are approximately linear, which justifies the use of a non-quadratic form of years of schooling in our models.

2.2 Analysis

For each item, we estimate both ordered probit and HOPIT models to analyze cross-national and socioeconomic differences in depressive symptoms. This section gives a short description of both models.

2.2.1 The Basic Model: Ordered Probit

The self-reported ratings of depressive symptoms are ordinal categorical variables. The ordered probit model is the standard model for estimation of the probabilities of ordinal responses. This model assumes an underlying true latent variable \( Y_{ij}^{*} \) with i indicating the individual and j the depressive symptom. The latent variable is constructed as follows:

$$ Y_{ij}^{*} = \beta X_{i} + \varepsilon_{i} ,\quad \varepsilon_{i} \sim N(0,\sigma^{2} ) $$
(1)

where X i is a vector of covariates and β is a vector of coefficients to be estimated.Footnote 3 The latent variable \( Y_{ij}^{*} \) correspondents to the observed ordinal ratings y ij in the following way:

$$ \begin{aligned} y_{ij} = & \, 1 \, \quad {\text{if}}\,\quad Y_{ij}^{*} < \, \tau^{1} \\ y_{ij} = & \, 2\quad {\text{ if}}\,\quad \tau^{1} \le Y_{ij}^{*} < \, \tau^{2} \\ y_{ij} = & \, 3\quad {\text{ if}}\,\quad Y_{ij}^{*} \ge \, \tau^{2} \\ \end{aligned} $$
(2)

where τ1 < τ2 are the cut-points on the latent scale between the response categories. These thresholds are equal for all respondents, which corresponds to assuming no reporting heterogeneity. However, if reporting heterogeneity is present, β reflects a mixture of the true association between the covariates and y ij , and systematic differences in reporting y ij associated with the covariates.

2.2.2 The Extended Model: HOPIT

The Hierarchical Ordered Probit (HOPIT) model is an extension of the ordered probit model that allows for systematic variation in the cut-points, and thus incorporates adjustment for reporting heterogeneity.Footnote 4 In contrast with the ordered probit model, the HOPIT model estimates the cut-points between categories using the ratings of the anchoring vignettes given by respondents. As the level of depressive symptoms described in the vignettes is fixed across respondents, the variation in ratings can be attributed to reporting heterogeneity.

The HOPIT model consists of two main components: the first estimates the cut-points on the latent scale of depressive symptoms, and the second estimates the corrected respondent level of depressive symptoms on the same latent scale. The first component considers an unobserved latent variable \( Y_{ij}^{v*} \) that represents the underlying unobserved level of depressive symptom j represented by the vignettes. Formally, the latent variable \( Y_{ij}^{v*} \) is described by:

$$ Y_{ij}^{v*} = \alpha + \varepsilon_{ij}^{v} ,\quad \varepsilon_{ij}^{v} \sim N(0,1) $$
(3)

where α is a vector of dummy variables identifying the respective vignette. The observed vignette ratings \( y_{ij}^{v} \) results from the mapping of \( Y_{ij}^{v*} \) into three categories, using person-specific cut-points (k = 1, 2):

$$ \begin{aligned} y_{ij}^{v} = & \, 1\quad {\text{if}}\,\quad Y_{ij}^{v*} < \, \tau_{ij}^{1} \\ y_{ij}^{v} = & \, 2\quad {\text{if}}\,\quad \tau_{ij}^{1} \le Y_{ij}^{v*} < \, \tau_{ij}^{2} \\ y_{ij}^{v} = & \, 3\quad {\text{if}}\,\quad Y_{ij}^{v*} \ge \, \tau_{ij}^{2} \\ \end{aligned} $$
(4)

In order to allow for systematic variation in the cut-points, these are modeled as functions of covariates in the following way:

$$ \tau_{ij}^{k} = \gamma^{k} X_{i} $$
(5)

where γ k is a vector of coefficients to be estimated. Note that the covariates are solely included in the estimation of the cut-points (i.e., in Eq. 5 but not in Eq. 3), which corresponds to assuming that there is no systematic variation in the perceived level of depressive symptoms represented by the vignettes (assumption of vignette equivalence, King et al. 2004).

The second component of the HOPIT models the individual’s own level of depressive symptoms imposing the cut-points as given by the first component. As in the ordered probit model, the level of depressive symptoms \( Y_{ij}^{s*} \) and the observed categorical responses \( y_{ij}^{s} \) are defined as:

$$ Y_{ij}^{s*} = \beta Z_{i} + \varepsilon_{i}^{s} ,\quad \varepsilon_{i}^{s} \sim N(0,\sigma^{2} ) $$
(6)
$$ \begin{aligned} y_{ij}^{s} = & 1\quad {\text{if}}\,\quad Y_{ij}^{s*} < \, \tau_{ij}^{1} \\ y_{ij}^{s} = & 2\quad {\text{if}}\,\quad \tau_{ij}^{1} \le Y_{ij}^{s*} < \, \tau_{ij}^{2} \\ y_{ij}^{s} = & 3\quad {\text{if}}\,\quad Y_{ij}^{s*} \ge \, \tau_{ij}^{2} \\ \end{aligned} $$
(7)

where Z i and \( \beta \) are defined as in 2.2.1 but \( \tau_{ij}^{1} \) and \( \tau_{ij}^{2} \) are allowed to be functions of covariates, defined by Eq. 5. This corresponds to assuming that individuals use the same response scale when rating the vignettes and their own situation (response consistency, King et al. 2004).

Using the HOPIT model, we compute the prevalence of depressive symptoms that would be observed under a counterfactual: that all respondents used an identical response scale. In the computation of the thresholds, all covariates in Eq. 5 will be set to counterfactual reference characteristics.

2.3 The Analysis

Our analysis aims to assessing the influence of reporting heterogeneity on (a) cross-national differences and (b) socioeconomic differences in the prevalence of depressive symptoms. Models for cross-national differences include sex, age and country dummies, taking Sweden as reference country. We estimate both a basic and the extended model and calculate predicted probabilities of depressive symptoms for 64-year-old males for each country using both models. Because the sample size is relatively small in each country, male and female samples are pooled together to sustain power in the estimations. We chose 64 as the reference age, because this is the mean age in the sample. For our second objective, we examine educational disparities in depressive symptoms in Europe by adding years of schooling to the model. We cannot assume similar education effects across Europe and so we also allow for variation in these through inclusion of interaction terms between European area indicators and years of schooling. In order to increase statistical power, we consider, instead of country-specific education effects, interactions with indicators of the following areas: Nordic countries (i.e. Sweden and Denmark); Central-West countries (i.e. Germany, The Netherlands and Belgium); Mediterranean countries (i.e. France, Italy and Spain); and finally Eastern countries (i.e. Czech Republic and Poland).Footnote 5

3 Results

3.1 Cross-National Differences in Depressive Symptoms

Table 2 shows the estimated coefficients of thresholds between none and mild depressive problems in the HOPIT models for each item. Because the thresholds are measured on a latent scale, the coefficients have no quantitative interpretation. However, their signs and the p values indicate the direction and the significance of effects of the variables, i.e., a negative value indicates a lower reporting threshold with respect to the reference category. Results show that females have lower thresholds for reporting mood and sleep symptoms, indicating that they report symptoms more easily than men. Increasing age is associated with higher reporting thresholds to report mood and sleep symptoms, but lower thresholds to report concentration problems.

Table 2 Estimation of the thresholds of depressive symptoms with the HOPIT model

Table 2 shows that there are large differences between countries. Compared to Sweden, respondents from the Netherlands, Belgium and Czech Republic use similar reference scales for mood and sleep problems, but they have higher thresholds when reporting concentration problems. When reporting sleeping problems, Danish respondents use a scale similar to the Swedish, but are more lenient when reporting other symptoms. Respondents from Greece, France, Italy and Spain have higher thresholds than the Swedish for reporting all symptoms. Polish respondents have higher thresholds for reporting mood and sleep symptoms, but report concentration symptoms more easily than Swedish respondents. Reporting thresholds for mood symptoms do not differ significantly between Swedish and German respondents, but the latter have higher thresholds for reporting sleeping and concentration problems.

Figure 1 presents the predicted probability of mood problems for a 64-year-old male in each country. The left-hand panel shows the results using the Ordered Probit model, which assumes fixed thresholds for all respondents. The right-panel presents the predicted probabilities obtained from the HOPIT model, using the response scale of a 64-year-old Swedish male as reference. Comparison of the outcomes of the basic with the counterfactual situation shows that countries have similar rankings with regards to the probabilities of mood problems. In other words, the cross-national differences in probabilities remain after the Swedish response scale has been applied to all countries. Figures 2 and 3 show the same results for sleeping and concentration problems. Again, the cross-national differences are not explained by differences in reporting scales. Cross-national differences remain, and the ranking of the countries are similar in both the HOPIT and Probit models. For sleeping problems, the contrast between Italy, France, Greece and Poland vis-à-vis the Nordic countries is somewhat more marked after controlling for reporting heterogeneity. This reflects the fact that respondents from these countries have higher thresholds for reporting sleeping problems than respondents in Nordic European countries, as shown in Table 2. For concentration problems, the probability of symptoms increases in all countries when applying the response scale of a 64-year-old Swedish male, except in Poland, for which the prevalence decreases. Based on the HOPIT model, respondents from Czech Republic have the highest probability of concentration problems.

Fig. 1
figure 1

Predicted probability of having mood problems for a 64-year-old male

Fig. 2
figure 2

Predicted probability of having sleeping problems for a 64-year-old male

Fig. 3
figure 3

Predicted probability of having concentration problems for a 64-year-old male

Table 3 shows the rate ratios comparing the probability of depressive symptoms in Sweden to the probability in other European countries. The rate ratios are computed using probabilities of 64-year-old males. A rate ratio below/above 1 indicates a lower/higher probability of depressive symptoms compared to Sweden. For Central-West European countries (Germany, the Netherlands, Belgium) and Denmark, rate ratios (RRs) of mood problems are similar for both the Probit and HOPIT models. For Mediterranean European countries (France, Italy, Greece and Spain) and Poland, rate ratios from the HOPIT model are somewhat larger than ratios from the Probit model. Overall, cross-national differences in the prevalence of mood problems remain and in some instances increase somewhat after controlling for reporting heterogeneity.

Table 3 Rate ratios comparing probabilities of depressive symptoms in European countries to those in Sweden

For sleeping, almost all rate ratios increase after adjustment for reporting heterogeneity, except for The Netherlands. RRs for Denmark and Greece are significantly lower than 1 in the basic model, while in the counterfactual situation the RR is not significant for Denmark, but for Greece it is significantly larger than 1. Rate ratios for concentration problems show a similar pattern to those of sleeping problems, with adjustment for reporting heterogeneity increasing significantly almost all RRs. In addition, the RRs of Germany, the Netherlands and Czech Republic are insignificant in the Probit model but significant in the HOPIT model. These results suggest that differences in sleeping and concentration problems are not explained by reporting heterogeneity.

3.2 Differences in Mental Health Problems by Education

In order to assess whether systematic reporting heterogeneity by level of education is present, we now specify a separate model that includes years of schooling. Table 4 shows the estimated coefficients (and their p values) in thresholds between having no problems and mild problems, for each of the three depressive symptoms. In the case of mood problems, number of years of schooling has a negative effect (significant at a 5% level) in the Central-West and Mediterranean European countries, indicating that higher educated respondents report problems more easily. Educational differences in reporting of sleeping problems are only significant (at a 5% level) for Central-Western countries, also with a negative effect. In the estimation of concentration problems, the estimated effect of years of schooling is positive (significant at 5%) in Nordic countries, suggesting that the higher the level of education, the more lenient respondents are to report problems. Systematic differences in reporting depressive symptoms by age, gender and country remained similar to our earlier findings, indicating that these differences are robust to educational differences in reporting.

Table 4 Estimation of the thresholds of depressive symptoms with the HOPIT model including years of schooling

Rate ratios in Fig. 4 compare the probability of having problems for a 64-year-old man with 3 years of schooling with the probability for a 64-year-old man with 17 years of schooling, i.e. for the bottom 5% of the educational distribution and the top 5%. A rate ratio above 1 indicates a higher probability of problems for respondents with 3 years of schooling than for respondents with 17 years of schooling. Again, we included interaction terms of years of schooling by European area. However, the last bar in the figures corresponds to the educational effect of the pooled European sample, excluding the interaction terms. The rate ratios incorporating reporting heterogeneity are estimated for the counterfactual situation that all respondents use the response scale of a 64-year-old male with 3 years of schooling in their respective country.

Fig. 4
figure 4

Rate ratios comparing low and high education

Correcting for reporting heterogeneity does not attenuate rate ratios comparing low and high education (Fig. 4). This suggests that different reporting scales do not explain educational differences in depressive symptoms. For all countries pooled, counterfactual rate ratios for mood and sleeping problems are greater than rate ratios from the Probit model. This is expected, since the estimated coefficients of years of schooling on the thresholds are mostly negative, which indicates that more highly educated people are more likely to report problems. Therefore, without including a correction for reporting heterogeneity in the analysis, the educational differences are somewhat underestimated. Although the confidence intervals in the separate areas are large, we see the impact of correcting for reporting heterogeneity is most profound in Central-Western and Mediterranean countries. For concentration problems, the rate ratio for the pooled European sample in the counterfactual analysis is somewhat lower than in the Probit model. However, educational differences in concentration problems are still present in the counterfactual analysis, indicating that reporting heterogeneity does not completely explain these differences. The influence of differences in reporting by years of schooling is the largest in Central Western and Mediterranean countries, and as expected from the positive coefficient in Table 4, we find a reversed effect in Nordic countries after controlling for heterogeneity.

4 Discussion

This study investigates the effect of reporting behavior on cross-national and socioeconomic differences in depression symptoms. Our results indicate that differences in the prevalence of depressive symptoms between countries are generally not explained by systematic cross-national differences in reporting thresholds. Similarly, within countries, differences in depressive symptoms by education are not explained by differences in reporting styles. Our findings suggest that variations in depressive symptoms in Europe are not attributable to differences in reporting styles, but might result from variations in the causes of depressive symptoms between countries and education groups.

Our study is the first to examine the role of reporting heterogeneity in explaining depression variations across and within European countries. Being the first some limitations should be considered. The HOPIT model is based upon two main assumptions. First, response consistency must be satisfied, which assumes that respondents use the same response scale for the vignettes as for their own self-assessment of symptoms. Van Soest et al. (2007) assessed this assumption by comparing a subjective measure of drinking problems with an objective measure. Albeit for a different outcome measure than the ones considered here, their results do support the assumption of response consistency. The use of vignettes improved the fit of the model and raised the correlation between the subjective and the objective measures, which is in line with the main purpose of the vignette methodology (Van Soest et al. 2007). Bago d’Uva et al. (2010) also test for response consistency by comparing reporting heterogeneity inferred from conditioning on objective measures with that obtained from vignettes, using English data in the domain of concentration problems. Their results do not reject this assumption with respect to education (Bago d’Uva et al. 2010).

The second assumption in the HOPIT model implies that the level of depressive problems in vignettes is perceived in the same way by all respondents (vignette equivalence), irrespective of their age, sex, income, education, country of residence or other socio-demographic variables (Salomon et al. 2004). This assumption is difficult to test and has therefore been seldom tested. It is nevertheless supported by a high degree of consistency across individuals in the ranking of vignettes in six health domains (including concentration problems) and eight responsiveness domains, suggesting that they are understood similarly across age and education groups (Murray et al. 2003). On the other hand, more recent results of a formal, more demanding test applied to English data in the domain of concentration do not support this assumption with respect to education (Bago d’Uva et al. 2010). Further results of that paper, comparing vignette adjustment with adjustment by means of objective measures, suggest also that the vignette methodology may not be able to revise estimated disparities in concentration problems by education in the correct direction. A subset of similar objective measures is available in the dataset used here and has been exploited by Vonková and Hullegie (2010). They assessed whether the vignette adjustment improves correlation between self-reports of concentration and objective measures. The results are mixed, showing weakening correlation when one of the vignettes is used (the one which was kept between waves one and two) but an improvement when either of the other two is used. Their analysis does not however permit assessment of the quality of the adjustment of education gradients and country rankings.

The subjective nature of the underlying constructs of the items makes the vignette equivalence a rather strong assumption. The subjective nature of the items not only makes it disputable that the interpretation of the level of problems described in the vignettes does not vary systematically across respondents but also raises the issue of whether there may be varying interpretations of the problem itself. It is unclear the extent to which these problems may affect the ability of vignettes to correct differential cut-point shift. In this light, our findings should be interpreted cautiously as a first approximation to the problem of reporting heterogeneity in depressive symptoms, rather than as a final proof of the absence of reporting heterogeneity between countries and educational groups.

On the other hand, although several depressive symptoms, including the items of mood and concentration, are indeed by definition subjective, some of the components of the Euro-Depression scale do have a clear objective counterpart. For these items vignette equivalence is probably less restrictive. Examples of this include: Troubles with sleep or recent changes in sleep patterns; diminution in appetite or desire for food; changes in eating patterns; fatigue; and events of tearfulness (crying). From these ‘more objective’ items, we had vignettes on sleep. The vignettes for this item asked respondents to make ratings of hypothetical individuals experiencing presumably objective behaviours such as: waking up two nights a week; taking 2 h every night to fall asleep; and waking up once every hour during the night and taking about 15 min to fall back asleep again. Differently from the mood and concentration items, these descriptions of sleep patterns are not by definition subjective (they do not refer to individual’s feelings), but instead refer to objective behaviour, which individuals are asked to judge using a given scale. The fact that for sleep patterns we also did not find that correcting for reporting heterogeneity diminished the cross-country differences strengthens our conclusion that at least part of the items to measure depressive symptoms are not strongly influenced by reporting heterogeneity. However, whether this finding holds for the more subjective components of the scale is indeed uncertain.

Previous studies have found that reporting heterogeneity explains cross-national differences in some health-related outcomes. Kapteyn et al. (2007) studied reporting differences in working disability between the US and The Netherlands. They found that more than half of the observed difference in reported work disability originates from the fact that residents of these two countries use different response scales in answering standard questions on whether they have a work disability. Essentially, for the same level of actual work disability, Dutch respondents have lower response thresholds in claiming disability than the American respondents (Kapteyn et al. 2007). Our study did not find strong support for the role of heterogeneity in explaining the large differences in depression observed across countries. This suggests that reporting heterogeneity across countries might play an important role for some physical health outcomes but not mental health outcomes.

Salomon et al. (2004), Bago d’Uva et al. (2008a, b) analyzed educational disparities in various health outcomes before and after correction for reporting heterogeneity. Outcomes of these studies are somewhat mixed. Bago d’Uva et al. (2008b) rejected reporting homogeneity by different educational groups in India, Indonesia and China, where correcting for reporting heterogeneity somewhat reduced disparities in health by education. However, using European data, it was found that higher educated older Europeans are more likely to rate a given health state negatively, resulting in underestimation or even undetected educational differences without correction (Bago d’Uva et al. 2008a). These results are similar to our own, where educational differences in health do not diminish but tend to increase after adjustment for reporting heterogeneity.

The COMPARE sample includes vignettes and self-reports on only three of the twelve items of the EURO-D scale. The lack of information on the other items does not allow us to draw conclusions concerning the influence of reporting heterogeneity on the EURO-D depression scale as a whole. In addition, the fact that differences in reporting heterogeneity differed somewhat across items and across countries makes it even harder to synthesize a single effect that can be generalized to the EURO-D measure. Thus, although we find no strong evidence of reporting heterogeneity for the three items assessed, it is still possible that heterogeneity exists in other items not assessed in our study. Whether this may lead to reporting differences in the overall Euro-D scale needs to be further examined.

The SHARE sample excludes the institutionalized population and therefore includes only adults living in the community. Observed cross-national variations in depression in our study may therefore be attributable to differential sample selection by country. In particular, institutionalization rates are generally higher in Northern and Central-Western European countries than in the Eastern and Southern parts of Europe. Depressive symptoms are likely to be associated with the risk of institutionalization, which may contribute to the low rates of depressive symptoms in Central-Western and Northern countries. To examine the extent of this bias, we did sensitivity analysis restricting the sample to ages 50–64, at which age institutionalization rates are relatively low. We found the same pattern as for the entire sample, suggesting that institutionalization differences do not account for cross-national and educational variations in depressive symptoms.

When analysing educational differences across countries we compared the 5% highest and 5% lowest educational groups. We performed sensitivity analyses to assess whether our results are robust to this definition. We calculated the Relative Index of Inequality (RII) for categories of education based upon the International Standard Classification of Education (ISCED). On the basis of the RII we estimate rate ratios based on the total effect of the educational ranking on the three items for each European area as well as for all countries pooled together. These analyses confirm our original findings: we find differences in the prevalence of depressive symptom by education, which remain largely unchanged after controlling for reporting heterogeneity using the vignette approach.

Our results provide some support to the hypothesis that variations in depressive symptoms by country and education are the result of variations in the risk factors for depression, rather than an artifact of reporting heterogeneity. An important part of the differences may be explained by physical health problems. It is known that physical health is worse in lower educational groups (Kunst et al. 2005), which may underlie their depressive symptoms prevalence, as this was found in several studies (Geerlings et al. 2000; Lenze et al. 2001; Braam et al. 2005; Koster et al. 2006). In our study, we see that in countries where overall physical health is known to be worse, such as Spain and Italy (Jürges 2005), the prevalence of depressive symptoms is also higher. In addition, different effectiveness of depression treatment across European countries and socioeconomic levels may influence the duration of depression, leading to high point-prevalence of depression.

Economic factors may also be important in explaining differences in prevalence of depressive symptoms. Lorant et al. (2007) showed a lower material standard of living is associated with increased depressive symptoms and casernes of major depression. The higher prevalence of depression in Poland, Czech Republic and some of the Southern Mediterranean countries may be partly explained by higher rates of unemployment, less favorable employment conditions, and more economic hardship (Lyberaki and Tinios 2008; Siegrist and Wahrendorf 2008). Future studies should examine whether these economic and social factors explain the large variations in depressive symptoms observed in Europe.

In summary, using the vignette approach, we find no strong evidence that cross-national and educational differences in depressive symptoms across Europe are attributable to reporting heterogeneity. However, whether the HOPIT model assumptions hold on subjective outcomes such as depressive symptoms, or whether the cross-country differences in depression are attributable to variations in risk factors such as social and economic conditions, behaviour or physical health is a relevant question that deserves further investigation.