1 Introduction

Demographic ageing will be a major determinant of long run economic development in Europe. The extent of the demographic changes, which include a combination of increasing life expectancy and low birth rates, is dramatic and will deeply affect future labour and financial markets. The expected strain on public budgets and especially social security has already received prominent attention and both the political debate and the scientific literature aim at finding sustainable reforms to solve the financial unbalance of public pensions (Economic Policy Committee 2000; Gruber and Wise, 2004; Bonoli and Shinkawa 2005). The design of employment policies targeted to persons with disabilities plays an important role when pursuing the financial equilibrium of modern welfare states. Since the normal retirement age is gradually increasing throughout Europe, disability schemes may constitute an alternative path for early exits from the labour force. Indeed, enrollment in disability programs often entails the permanent receipt of benefits and may act as a bridge towards the eligibility to classical labour retirement schemes.

As Burkhauser and Daly (2002) emphasize, the onset of a disability does not automatically imply the inability of carrying out a job for most people and work-based policies can significantly increase employment. However, the design of efficient reforms in this area requires a better understanding of the determinants of work limiting health problems among older workers. To this aim researchers have started to analyse data from general surveys, where individuals are asked to self-report their work disability status, together with their health and socio-economic characteristics. While physical measures of health are often related to specific features of a given health domain and provide information on health rather than work capacity, the advantage of using this self-reported measure is that it summarizes in a single index a variety of factors determining the work disability status of individuals. Indeed, as pointed out by Haveman and Wolfe (2000), a motor disability may turn out to be less limiting in the labour market activities than much milder health limitations, such as back pain, combined with low education levels. At the same time, self-assessments of work disability are highly subjective and this implies that, if different individuals have different beliefs about the concept of disability, their self-reports will result to be incomparable. This heterogeneity in reporting styles is called differential item functioning (DIF). Several papers (Kapteyn et al. 2007; King et al. 2004; King and Wand 2007) show how vignettes can be used to control for DIF and assess whether differences in disability rates across countries and socio-economic groups are genuine or they just reflect differences in response scales. For example, Kapteyn et al. (2007) show that allowing for heterogeneity in reporting styles leads to reduce by one half the raw differential in the disability rates between the Netherlands and the US. Footnote 1

In this paper we study the determinants of work disability reporting using data from the 2004 wave of the SHARE and anchoring vignettes on eight European countries, collected under the COMPARE project. SHARE is the first European multi-country dataset containing data on vignettes and allows us to investigate whether the differences in reporting styles found by Kapteyn et al. (2007) between the US and the Netherlands are present even when we compare relatively similar countries in continental Europe. Börsch-Supan (2005) documents that continental Europe is characterized by large cross-country heterogeneity in disability insurance enrolment rates, which cannot be explained by differences in the health status or in the demographic composition of the population. Our analysis intends to shed light on these differentials in work-disability rates across European countries by combining a wide set of individual socioeconomic and health status characteristics with macroeconomic indicators describing the institutional background of the country of residence.

The plan of the paper is as follows. Section 2 describes the data used in the analysis and reports some descriptive statistics. Section 3 presents the Hopit model. Section 4 presents and discusses the main estimation results as well as a counterfactual simulation. Section 5 concludes the paper and draws some comments on basis of our findings.

2 Data and Descriptive Statistics

Data are drawn from the first wave of the Survey of Health, Ageing and Retirement in Europe (SHARE) on individuals aged 50 or over in eight European countries (Sweden, Germany, The Netherlands, Belgium, France, Spain, Greece and Italy). SHARE collects detailed information about a number of aspects that characterize the socioeconomic condition of the elderly, such as physical and mental health, employment, wealth and social support (see Börsch-Supan et al. 2005, 2008). As part of the COMPARE project, after the personal interview (CAPI, Computer Assisted Personal Interview), a subset of respondents are asked to fill an additional paper and pencil questionnaire, which contains questions on their own disability status and that of hypothetical persons described in particular situations and conditions (the anchoring vignettes). The vignettes cover three different domains of work disability, namely pain, affect and heart diseases, and were collected both in the first and in the second wave of the survey. However, while the 2004 questionnaire included three vignettes for each domain of work disability, in 2006 this number was reduced to just one. In this paper we will use data from the first wave in order to exploit the information gain resulting from a wider set of vignettes when correcting for heterogeneity in reporting styles.

In the survey, respondents are first asked to report the severity of any work limiting health problems they might have; the exact wording of the question is: "Do you have any impairment or health problem that limits the amount or kind of work you can do?". Then, they are asked to evaluate the extent to which they think that the people described in the nine vignettes are limited in the kind or amount of work they can do.

For each domain of work disability, three vignettes are presented. Examples are:

  • Pain: “[Kevin] suffers from back pain that causes stiffness in his back especially at work but is relieved with low doses of medication. He does not have any pains other than this generalized discomfort.”

  • Affect: “[Anthony] generally enjoys his work. He gets depressed every 3 weeks for a day or two and loses interest in what he usually enjoys but is able to carry on with his day-to-day activities on the job.”

  • Heart disease: “[Eve] has had heart problems in the past and she has been told to watch her cholesterol level. Sometimes if she feels stressed at work she feels pain in her chest and occasionally in her arms.”

How much is [Kevin/Anthony/Eve] limited in the kind or amount of work [he/she] could do?

Both for the self-evaluation and the vignettes the possible answers are " none", "mild", "moderate", "severe" and " extreme".

For identification, we keep only those respondents who give a valid answer to both the self-evaluation question and at least one vignette. We also select respondents up to 64 years older, so our final sample includes 2,464 individuals. Table 1 shows that our respondents are prevalently females (about 56%), they tend to live with a partner (83%) and they are on average 56.4 years old. More than 50% of them are either overweight or obese and the percentage of people who report having limitations with instrumental activities of daily living is only 7.5.

Table 1 Description of the variables included in the regressions

Figure 1 shows that there is a large cross-country variability in self-reported work disability rates, where we define a person as disabled if she reports to be moderately, severely or extremely limited in the kind or amount of work she can do. In particular, while the percentage of respondents who report to have work limiting health problems is over 25% in Sweden, Germany and Spain, it is around 12% in the Netherlands and 6% in Greece. These cross-country differences can be only partly explained by institutional factors such as enrolment and eligibility rules, as suggested by Börsch-Supan (2008). The variation is very large, especially given that we are considering relatively similar countries in continental Europe, and they raise some doubts about the inter-cultural comparability of self-reported measures of work disability: is the variation observed in the data genuine or does it just reflect differences in reporting styles? A prima facie evidence in favour of the idea that reporting heterogeneity might play a role is provided by Fig. 2. The figure shows the evaluation that respondents in different countries give of the hypothetical persons described in three of the vignettes (Kevin, Anthony and Eve). It is interesting to note that Sweden and Spain, where we observe the highest work disability rates, are also the countries with the highest proportions of respondents evaluating the persons described in the vignettes as work disabled. In all countries "Eve" is considered as the one with the most severe work limiting health problems, while "Anthony" is rated as the one less work disabled (only in Sweden there is no statistical difference between how Kevin and Anthony are evaluated).

Fig. 1
figure 1

Self-reported work disability rates by country

Fig. 2
figure 2

Proportions of respondents who rate Anthony, Kevin and Eve as work disabled

3 The Hopit Model

We use the econometric specification introduced by King et al. (2004) in its parametric version, the so-called hierarchical ordered probit model (Hopit). The model consists of two components: a self-assessment equation and a vignette equation.

3.1 Self-Assessment Component

Let us denote with \(Y_{i}^{\ast}\) the level of work disability perceived by individual i = 1, …, n:

$$ \begin{aligned} Y_{i}^{\ast} =X_{i}\beta +\varepsilon _{i};\\ \varepsilon _{i}|X_{i}\sim N(0,1), \end{aligned} $$

where X i is a vector of observable individual characteristics and \(\varepsilon_{i}\) is a normally distributed error term encompassing unobserved factors relevant for the determination of work disability levels. In the data we do not observe the latent variable \(Y_{i}^{\ast}\) but an ordered categorical variable Y i , which takes values 1 (“none”), 2 (“mild”), 3 (“moderate”), 4 (“severe”) and 5 (“extreme”) through the mechanism:

$$ Y_{i}=j \quad \hbox{if}\,\tau _{i}^{j-1}<Y_{i}^{\ast}\leq \tau _{i}^{j},\quad j=1,\ldots,5. $$

The main difference between the standard ordered probit and the Hopit model is that the thresholds τ j i are individual-specific and reflect individual characteristics:

$$ \begin{aligned} \tau_{i}^{0}&=-\infty; \quad \tau _{i}^{5}=\infty;\\ \tau_{i}^{1}&=X_{i}\gamma^{1}; \end{aligned} $$
$$ \tau_{i}^{j}=\tau _{i}^{j-1}+\exp (X_{i}\gamma ^{j}), \quad j=2,3,4. $$

3.2 Vignette Component

To separately identify the parameters in β and γ, we need to use the additional information provided by vignette evaluations. As mentioned above, each respondent in our sample is asked to answer nine vignette questions describing hypothetical persons with different levels of work disability. Let us denote with \(Z_{il}^{\ast }, l=1,\ldots,9\) the underlying work disability of the person described in the vignette l as perceived by respondent i. We assume that:

$$ \begin{aligned} Z_{il}^{\ast} &=\theta _{l}+\nu _{il};\\ \nu _{il} &\sim N(0,\sigma _{v}^{2}), \end{aligned} $$

where ν il is a normally distributed error term independent of \(\varepsilon_{i}.\) The assumption of vignette equivalence implies that the situation described in the vignettes is, on average, perceived by respondents in the same way and formally restricts θ l not to vary over i. As a result, the variability in the vignette evaluations provided by respondents is only due to the error term ν il in Eq. (5) and to the heterogeneous thresholds defined in Eqs. (3) and (4). Even in this case, \(Z_{il}^{\ast}\) is unobservable and we observe instead vignette ratings on a scale that goes from 1 (“none”) to 5 (“extreme”). \(Z_{il}^{\ast}\) is related to Z il as follows:

$$ Z_{il}=j \quad \hbox{if}\,\tau_{i}^{j-1}<Z_{il}^{\ast}\leq \tau _{i}^{j},\quad j=1,\ldots,5. $$

The two components of the model are connected via the utilization of the same set of thresholds in (2) and in (6). By restricting the thresholds to be the same, we assume that the same reporting styles are used both for self-assessments and vignette evaluations. This hypothesis is commonly referred to as the response consistency assumption.

There is a growing literature discussing the validity of the response consistency assumption (see Van Soest et al. 2007; Bago d’Uva et al. 2009). Van Soest et al. (2007) analyze drinking behaviour and extend the standard Hopit model by relaxing the response consistency assumption. They formally allow for the fact that reporting styles used by individuals to evaluate themselves may be different from those adopted to evaluate vignettes. The parameter identification relies on the availability of an objective indicator of the domain of interest, which is combined with self-assessments and vignette evaluations. An objective indicator may be qualified as suitable in this framework when it is unaffected by reporting heterogeneity and it is driven by the same underlying latent process determining self-assessments (one-factor assumption). To this end, Van Soest et al. (2007) take advantage of the information on the number of drinks consumed by the respondents and find that vignette based corrections appear quite effectively in bringing objective and subjective measures closer together. In the context of drinking behavior, which is clearly unidimensional, assuming that the proposed indicator is objective and that the one-factor assumption holds is plausible. However, work-disability is a multi-dimensional concept influenced by a combination of demographic, health and labour demand characteristics: thus, it is hard to think about an objective indicator satisfying the one-factor assumption and making it possible to test response consistency formally in our sample. The objective indicator used by Datta Gupta et al. (2010), namely grip strength, is clearly an unsatisfactory proxy.

Following King et al. (2004), the parameters of the self-assessment and the vignette components are jointly estimated by conditional maximum likelihood. The estimation is carried out using the gllamm (Generalized Linear Latent and Mixed Models) program of the STATA software (Rabe-Hesketh et al. 2004; Rabe-Hesketh and Skrondal 2008).

3.3 Counterfactuals

The main objective of the vignette approach is to estimate the DIF of each respondent and correct for it. Once model estimates are available, corrections for DIF are straightforward. The researcher can define a benchmark (for instance, a country) according to the explanatory variables included in the model and then compute adjusted distributions of the observed variable for all respondents, using the benchmark scale instead of respondents’ own scales. The benchmark scale is calculated by predicting the thresholds under the assumption that all respondents belong to the benchmark group, regardless of their actual condition. Let us suppose that a dummy variable identifies the benchmark group of interest. The benchmark scale is defined by setting such dummy to 1 for all respondents when predicting the thresholds.Footnote 2 As for the other variables, we consider their actual values in the sample. The adjusted distributions of the self-evaluation reports are now defined according to a common scale and they can be compared meaningfully since differences across groups are only due to genuine heterogeneity in the outcome of interest (in our case, work-disability) and no longer to heterogeneity in reporting styles.

These exercises are the so-called “counterfactuals” and they may answer key questions like: how many people in Country X would report to be work disabled if respondents of Country X used the response scales of respondents of Country Y?

In this case, if the threshold equations include a dummy variable for Country Y, we estimate the benchmark scale by predicting the thresholds under the assumption that all respondents live in Country Y and then the corresponding dummy is always equal to 1.

4 Results

In the empirical analysis we control for a large set of explanatory variables, which include demographics (gender and a quadratic polynomial in age), education, employment, marital status, mental and physical health status (grip strength, body mass index, presence of arthritis, mobility, limitations with activities of daily living and instrumental activities of daily living, etc.) and cognitive abilities, as well as country dummies (see Table 1 for more detailed information on each of these variables). Note that for health we only include objective indicators, since subjective questions on health might also be affected by the reporting bias.

The results of our estimation are reported in Table 2. The first column shows the estimates obtained using a baseline model that does not allow for any threshold variation across respondents, the second column reports the estimates for the self-assessment component of the Hopit model, while the results for the threshold equations of the Hopit model are presented in columns 3–6.

Table 2 Hopit model, determinants of work disability—2004 wave, country dummies, 9 vignettes. The first column refers to a baseline Hopit specification not allowing for threshold variation across individuals

The baseline specification in column 1 is nested in the more general Hopit model since it sets to 0 all the parameters but the constant in the threshold equations. This restricted model is almost identical to a standard ordered probit model in that it does not take into account potential differences in reporting styles. A formal likelihood-ratio test strongly rejects the restricted model not allowing for response scale variation against the more general model that does allow for correction of the DIF bias.Footnote 3 Therefore, in what follows we focus only on the results obtained with the Hopit model.

The estimates in column 2 show that the major determinants of work disability are health conditions. Apart from being currently a smoker, all physical and mental health variables are strongly significant and show the expected signs: healthier individuals are less likely to be work-disabled. The results also highlight a significant gender differential: men are more likely to have work limiting health conditions than women, while the other socio-demographic variables do not seem to play a major role in explaining the variability in perceived work-disability across individuals. It is interesting to note that, even after controlling for socio-economic characteristics and health indicators, there is still significant cross-country heterogeneity in work-disability rates. In particular, respondents living in Mediterranean countries are those associated with the lowest perceived levels of work-disability.

The estimates in columns 3–6 of Table 2 show that the thresholds significantly depend on a number of variables, such as country dummies, age, education, employment and several health conditions. In order to show the importance of controlling for heterogeneity in reporting styles, we carry out a counterfactual exercise as described in Sect. 3. Figure 3 reports the percentage of individuals who would rate themselves as work disabled if they used, respectively, the reporting styles of the Spanish and of the Dutch respondents. We focus on Spain and The Netherlands because Fig. 2 suggests that their respondents adopt different response scales when answering vignette questions. Indeed, Spanish respondents are more likely to consider the hypothetical person described in the vignettes as work-disabled. In addition, if the institutional background affects the concept of work-disability underlying self-reports, it is worth comparing these two countries because they are associated with different architectures of work-disability insurance, as suggested by the evidence proposed in Börsch-Supan (2005). Figure 3 confirms once again that response scales do play a role. When the Dutch thresholds are imposed, the resulting work-disability rates in all countries are systematically lower than those obtained using the Spanish thresholds. As an example, the work-disability rate in France is about 6% according to the Dutch reporting styles but it increases to 11% when the Spanish reporting style is adopted. Given the same perceived level of work disability \(Y^{\ast},\) Spanish respondents are more likely to declare the presence of work limiting health problems than their Dutch counterparts.

Fig. 3
figure 3

Counterfactual—Spanish and Dutch thresholds

4.1 The Role of Institutions

In line with the results of Table 2, Fig. 3 reveals that, even when we correct for the DIF bias, we still find large cross-country variation in work-disability rates. To understand to what extent this heterogeneity is due to institutional differences, we re-estimate the same specifications as before replacing the country dummies with three macro-economic indicators released at the country level by Eurostat for the 2004 year: the Harmonised Index of Consumer Prices, the employment rate for the population in the age group 55–64 and the public expenditure for disability function as a percentage of the national GDP. The first index is an indicator of inflation and price stability of the European Central Bank: it reflects cross-country variability in the price of goods and services, including those associated with the presence of disability, such as health care. The employment rate for the individuals aged 55–64 intends to summarize the characteristics of the labour market policies targeted to the elderly as well as the incentives to remain at work provided by Social Security in terms of eligibility requirements for pension benefits and replacement rates. Finally, the public expenditure for the disability function measures the generosity of the welfare state in maintaining the income of work-disabled persons. Our results are reported in Table 3. Overall, the parameters on socioeconomic characteristics and objective health indicators are unaffected by replacing the country dummies with the macro-economic indicators. The DIF bias matters even in this case and a formal likelihood-ratio-test still shows that the hypothesis of common thresholds across individuals is not supported by the data. Footnote 4 Focusing on the relationship between work-disability and macro-economic indicators, we find that the employment rates for individuals aged 55-64 and the generosity of the disability function are two relevant predictors of the perceived level of work-disability. Indeed, respondents in countries with higher employment rates are less likely to be work-disabled. This result might suggest that introducing active policies stimulating the labour market attachment of the elderly results in a lower propensity towards opting out of the labour force via the enrollment in disability programs, which is often a pathway to retirement. On the contrary, a more generous disability function is associated to higher work disability rates. Our results support the hypothesis that the cross-country differentials found in Table 2 and Fig. 3 are related to macro-economic indicators and that countries providing older individuals with higher chances of employment and less generosity in the disability schemes are those where individuals are less likely to report work limiting health problems.

Table 3 Hopit model, determinants of work disability—2004 wave, macro-indicators, 9 vignettes. The first column refers to a baseline Hopit specification not allowing for threshold variation across individuals

Further, the estimates in Table 3 show that employment rates and the generosity of disability schemes affect individual reporting styles for work-disability. One possible explanation for this result is that more generous disability insurance schemes can induce individuals to classify themselves as work-disabled even in presence of mild limitations. In Fig. 4 we represent the percentage of respondents in each country that would rate themselves as work disabled if they were using the reporting styles of the countries with the least and the most generous disability benefits, respectively Greece and Sweden. Interestingly, we find that more generous disability benefits would induce respondents of all countries to consider the same health problems as more severe, ceteris paribus.

Fig. 4
figure 4

Counterfactual—the effect of the generosity of disability benefits

5 Conclusion

The proportion of elderly individuals in European societies is steadily rising and this demographic trend poses the necessity of designing a welfare state able to foster their social and economic inclusion. Therefore, a thorough understanding of the composition of the disabled population helps the development of policies aimed at (1) reducing and preventing the presence of health impairments limiting working capacity, (2) supporting the rehabilitation of disabled persons and, finally, (3) stimulating the participation of this population group to the economic activities of their countries. However, how to measure work disability in practice is a quite complicated task. Physical measures of health can be expensive and often are related to specific features of a given health domain. Moreover, they provide information on health rather than working capacity. Self-reported measures of work disability are, therefore, the common solution adopted by researchers. In fact, several surveys ask respondents to self-assess their work-disability status according to a predetermined multi-item scale. The relationship between this indicator of work-disability and the explanatory factors of interest is then studied via standard ordered probit techniques. However, subjective judgements might be biased.

Whenever we aim at comparing self-assessments across countries, we should take into account the fact that individuals in different countries can interpret or understand the same question in different ways, because they use different scales to evaluate themselves. These inter-personal and inter-cultural differences in interpreting, understanding or using response categories for the same question is called differential item functioning (DIF).

The vignette methodology developed by King et al. (2004) is a generalization of the common ordered probit model, where DIF is modelled through variations in the thresholds. The thresholds determine the individual response scales and are allowed to vary with individual characteristics and across countries.

Using vignette data from the 2004 wave of the Survey of Health, Ageing and Retirement in Europe (SHARE), the empirical analysis in this paper focuses on individuals aged 50-64 in eight European countries and looks at the determinants of work-disability differentials across countries. Indeed our data show that in seemingly-similar European countries there are wide raw differences in the proportions of individuals who declare to be work-disabled. We first estimate cross-country heterogeneity in work-disability by conditioning on a wide set of socioeconomic and health indicators collected at the individual level. Our findings show that cross-country differentials in work-disability are still sizeable even after conditioning on our set of control factors. Moreover, a formal likelihood-ratio test confirms that DIF is an issue in our sample and the data do not support the hypothesis that reporting styles (i.e. the thresholds) are invariant across individuals. In particular, reporting styles are affected by the country of residence, age, education, employment and health status.

To better understand the role of institutions in explaining the cross-country heterogeneity in work-disability rates we analyze how they are influenced by a set of macroeconomic indicators released by Eurostat for each country in our sample: the Harmonized Index of Consumer Prices, the employment rate for individuals aged 55–64 and public expenditure for disability function as percentage of the national GDP. We find that respondents living in countries with higher employment rates and less generous disability schemes are less likely to suffer work limitations. This evidence seems to suggest that policies stimulating the labour market attachment of the elderly and tightening the requirements of disability benefits might actually reduce the financial burden on disability insurance schemes that question their long-run sustainability.

Finally, we find that institutions affect work-disability reporting by modifying individuals’ reporting styles. A counterfactual simulation shows that individuals living in countries with more generous disability schemes are more likely to declare to be work-disabled. This result informs the literature on work-disability by showing that evaluating the impact of policies only on the basis of self-evaluations might lead to misleading results: institutional changes can affect not only the true work-disability status of the elderly but also their response scales used in providing self-evaluations.