Empirical Economics

, Volume 31, Issue 2, pp 333–367 | Cite as

Semiparametric estimation of conditional mean functions with missing data

Combining parametric moments with matching
Original Paper

Abstract

A new semiparametric estimator for estimating conditional expectation functions from incomplete data is proposed, which integrates parametric regression with nonparametric matching estimators. Besides its applicability to missing data situations due to non-response or attrition, the estimator can also be used for analyzing treatment effect heterogeneity and statistical treatment rules, where data on potential outcomes is missing by definition. By combining moments from a parametric specification with nonparametric estimates of mean outcomes in the non-responding population within a GMM framework, the estimator seeks to balance a good fit in the responding population with low bias in the non-responding population. The estimator is applied to analyzing treatment effect heterogeneity among Swedish rehabilitation programmes.

Introduction

In many empirical applications interest lies in estimating a conditional mean function E[YX], yet often the outcome variable Y is only observed for a part of the sample. In this paper, a new semiparametric estimator for dealing with such situations is proposed. This estimator combines parametric regression with nonparametric matching estimators Heckman et al. (1997a; 1998a) to reduce the bias of the estimated conditional mean function in the subpopulation where Y is unobserved.

Consider two motivating examples for the applicability of this estimator. Missing data is one example. In most of applied work with survey data, item non-response and/or panel attrition are frequent. Data may be missing on Y for some individuals, while data on X is still available for them. For example, with panel data, X may refer to the response in the baseline survey, whereas the observability of Y in follow-up periods depends on attrition.

Counterfactual outcomes, treatment effects and treatment choice are a second example where the situation analyzed in this paper applies. Consider a situation where each member of a population chooses one of two options: An unemployed person may or may not take part in an active labour market programme, a physician may choose between two different therapies for a patient etc.1 For analyzing the effects of treatment it is necessary to contrast the expected outcome if choosing the first option with the outcome if choosing the other option, given some covariates X. Since every individual can be observed only in one of the two states, half of the potential outcomes data is missing by definition. For the treated, their counterfactual outcome in the case of non-treatment is unobserved, whereas for non-treated their counterfactual treatment outcome cannot be observed. Estimates of the expected counterfactual mean functions are needed to analyze heterogeneity in the effects. Such estimates are also a basic input for statistical treatment rules, where assignment to treatment is based on predictions of the expected treatment effects for each individual e.g. expected unemployment duration with and without active labour market programmes may be used (by the case worker) to allocate unemployed to active labour market programmes. Such statistical treatment rules may thus assist in a more precise targeting of policies.2

The proposed semiparametric estimator combines the information on Y and X for the respondents with the information on X for the non-respondents in a general methods of moments (GMM) framework. The estimator is applicable in situations where data is missing at random conditional on X (Little and Rubin 1987) or where, conditional on X, treatment assignment is ignorable (selection is on observables).3 Validity of this assumption often requires a high dimensional X vector, making purely nonparametric estimation unreliable in finite samples.

Under this assumption, the conditional mean function E[YX] is identified from the data of the respondents. The non-responding observations are of no value for identification. In a parametric or semiparametric framework, however, the information contained in the X observations of the non-respondents may help in obtaining more precise estimates. The reason for this is that a parametric estimator, which completely neglects the non-responding observations, may fit inadvertently a regression plane that is heavily biased in the non-responding population. A parametric estimator using only the responding observations seeks to minimize MSE in the responding population. This may, however, not be the best fit with respect to the entire population, if the density of X differs between respondents and non-respondents (which usually is the case in a treatment evaluation context where participants and non-participants are often rather dissimilar in their characteristics).

The basic idea of the semiparametric estimator is to use nonparametric estimates of the mean counterfactual outcomes for the non-responding population to measure the average bias of the parametric regression plane in the non-responding population. These mean counterfactual outcomes can be estimated nonparametrically at rate \({\sqrt n }\) by (propensity score) matching estimators (Hahn 1998; Heckman et al. 1998b). As these estimates do not depend on the specification of the parametric regression plane, they can be used to quantify the bias of the parametric model in the non-responding population for any value of the coefficient vector. The semiparametric estimator attempts to choose the regression plane such that it fits well in the responding population and has low bias in the non-responding population.

The asymptotic properties of this estimator are investigated in Section 2, while Section 3 analyzes its finite sample properties in a Monte Carlo simulation. In Section 4, the estimator is applied to analyzing treatment effect heterogeneity and treatment choice among Swedish rehabilitation programmes for long-term sick. The treatment effects for participation in workplace, educational and medical rehabilitation on employment are estimated on an individual level, and a statistical treatment selection rule based on these estimates is illustrated. Section 5 concludes. Appendices A and B contain further results. A supplementary Appendix with proofs and additional material is available on the internet:http://www.siaw.unisg.ch/froelich.

Semiparametric estimation of conditional mean functions

Interest lies in estimating a conditional mean function E[YX] from a sample of iid observations with missing data: {Xi, Di, YiDi}i=1n, where \(X_i \in \Re ^K\) is a vector of covariates and Di∈{0,1} is a missing value indicator. The outcome variable of interest \(Y_{i} \in {\Re }^{V} \) is only observed when Di=1. When Di=0, it is unobserved. The covariates Xi are always observed.4

Data on Y may be missing due to non-response or attrition. Alternatively, data may be missing by definition. A particular example for this is treatment evaluation. An individual may choose between different treatment options, and according to her choice an outcome is observed. The outcomes she would have realized had she chosen differently cannot be observed, though. Suppose treatment is binary: an individual takes part in a particular programme or does not. Denote by Yi0,Yi1 her potential outcomes. Yi1 is the outcome that would be realized if she participated in treatment, whereas Yi0 would be realized otherwise. By definition, the outcome Yi1 is missing for non-participants, whereas Yi0 is missing for participants. Hence, half of the potential outcomes are missing. Evaluating the effect of treatment, however, requires estimates of the missing counterfactual outcomes.

A maintained assumption is that, conditional on X, data are missing at random (Little and Rubin 1987):
$$ E{\left[ {Y\left| X \right.,D = 1} \right]} = E{\left[ {Y\left| X \right.,D = 0} \right]}\quad {\text{w}}.{\text{p}}.1 $$
(1)
With this assumption, the conditional mean function is nonparametrically identified from the D=1 observations. The non-responding observations (D=0) are of no value. Plausibility of assumption 1 often requires a high-dimensional X vector, to control for all confounding covariates e.g. for treatment evaluation, X must contain all variables that affected the selection into treatment as well as the potential outcomes. With a high-dimensional X vector, however, nonparametric regression of Y on X could be very imprecise in small samples due to the curse of dimensionality.
Parametric regression, on the other hand, could lead to heavily biased estimates, particularly in the non-responding population. Let φ(x, θ) with coefficient vector θ be a parametric specification of the expectation function. If the specification is correct, a true coefficient vector θ0 exists, such that
$$ E{\left[ {Y\left| X \right.} \right]} = \varphi {\left( {X,\theta _{0} } \right)}\quad {\text{w}}.{\text{p}}.1. $$
(2)
Otherwise, no true coefficient vector exists and an estimator based on squared loss seeks to choose θ to minimize the sum of variance and squared bias. A parametric estimator which uses only the D=1 observations (and neglects the D=0 observations) attempts to obtain a good fit of the regression plane in the responding population. This regression plane, however, may not fit well in the non-responding population, if the distributions of X in these two populations differ, e.g. if the treated and the non-treated are rather dissimilar in their characteristics.

The idea of the semiparametric approach is to estimate the average bias of the parametric regression plane in the non-responding population and to use this estimate to choose θ such that the regression plane fits well in the responding population and, at the same time, is on average (almost) unbiased in the non-responding population. This could proceed iteratively or, as suggested in this paper, simultaneously in a single estimator where goodness-of-fit in the responding population and average bias in the non-responding population are traded off against each other in a method of moments framework.

The average bias of the regression plane in the non-responding population is E[φ(X,θ)∣D=0]−E[YD=0] for a given value of θ. Estimating this average bias thus requires an estimate of the mean outcome in the non-responding population E[YD=0] that is consistent even when the parametric model is misspecified. Such an estimate can be obtained by nonparametric matching estimators. By the missing at random assumption 1, the mean outcome in the non-responding population is nonparametrically identified as
$$ {\mathop E\limits_{S_{x} } }{\left[ {Y\left| D \right. = 0} \right]} = {\mathop E\limits_{S_{x} } }{\left[ {E{\left[ {Y\left| X \right.,D = 1} \right]}\left| D \right. = 0} \right]}. $$
(3)

The definition of the mean outcome is restricted to the support of X in the responding population Sx={x:fXD=1(x)>0}, because E[YX,D=1] is not identified out of the support. In principle, as shown in Hahn (1998) and Heckman et al. (1998b), the mean outcome (Eq. 3) can be estimated at \({\sqrt n }\)-rate by a matching estimator which is based on nonparametric estimates of E[YX,D=1] obtained from the D=1 sample.

In practice, however, propensity score matching is much more widespread than matching with respect to X, because of the difficulties with nonparametric estimation of E[YX,D=1] when X is high-dimensional. Propensity score matching rests on nonparametric estimates of E[Yp(X)=ρ, D=1] where p(x)=P(D=1∣X=x) is the one-dimensional propensity score. The justification for propensity score matching is based on the results of Rosenbaum and Rubin (1983), who showed that the mean outcome in the non-responding population is also identified as
$$ {\mathop E\limits_{S_{x} } }{\left[ {Y\left| D \right. = 0} \right]} = {\mathop E\limits_{S_{x} } }{\left[ {E{\left[ {Y\left| {_{p} } \right.{\left( X \right)},D = 1} \right]}\left| D \right. = 0} \right]}. $$
Although, from an asymptotic perspective, propensity score matching provides no advantages vis-a-vis matching on X (Hahn 1998), it is much more convenient in practice since it requires only one-dimensional nonparametric regression and it is very widely applied.5 A propensity score matching estimate of the mean outcome is
$$ {\mathop {\widehat{E}}\limits_{S_{x} } }{\left[ {Y\left| D \right. = 0} \right]} = \frac{{\Sigma \widehat{m}{\left( {\widehat{p}_{i} } \right)} \cdot {\left( {1 - D_{i} } \right)} \cdot 1{\left( {\widehat{p}_{i} > 0} \right)}}} {{\Sigma {\left( {1 - D_{i} } \right)} \cdot 1{\left( {\widehat{p}_{i} > 0} \right)}}}, $$
(4)
where \(\widehat{p}_{i} = \widehat{p}{\left( {X_{i} } \right)}\) is an estimate of the propensity score and \(\widehat{m}\) is a nonparametric estimator of m(ρ)=E[Yp(X)=ρ, D=1] for ρ>0.6 Heckman et al. (1998b) showed that the propensity score matching estimator is \({\sqrt n }\)-consistent and asymptotically normal under certain conditions on the estimators of p and m. Hence, the propensity score does not need to be known, it can be estimated parametrically or nonparametrically.
With this nonparametric estimate of E[YD=0], the average bias of the regression plane φ(x,θ) in the non-responding population can be estimated as
$$\frac{{\Sigma {\left\{ {\varphi {\left( {X_{i} ,\theta } \right)} - \widehat{m}{\left( {\widehat{p}_{i} } \right)}} \right\}} \cdot {\left( {1 - D_{i} } \right)} \cdot 1{\left( {\widehat{p}_{i} > 0} \right)}}}{{\Sigma {\left( {1 - D_{i} } \right)} \cdot 1{\left( {\widehat{p}_{i} > 0} \right)}}}.$$
(5)
Since the denominator of Eq. 5 does not depend on θ, it suffices to consider only the numerator
$${\sum\limits_i {{\left\{ {\varphi {\left( {X_{i} ,\theta } \right)} - \widehat{m}{\left( {\widehat{p}_{i} } \right)}} \right\}} \cdot {\left( {1 - D_{i} } \right)}1{\left( {\widehat{p}_{i} > 0} \right)}} }$$
(6)
for quantifying the average bias for different values of θ.7 The semiparametric estimator attempts to tilt the parametric regression plane such to keep the bias in the non-responding population small, while at the same time obtaining a good fit in the responding population.
This idea can be refined to estimating the average bias not only in the entire non-responding population but also in subpopulations thereof. Let Λ(x) be a L×1 vector-valued indicator function that defines L different subpopulations. For example,
$$\Lambda \left( x \right) = \left( {\matrix{ 1 \hfill \cr {1\left( {x_{{\rm gender}} = {\rm male}} \right)} \hfill \cr {1\left( {x_{{\rm age}} > 40} \right)} \hfill \cr } } \right),$$
(7)
would define three separate subpopulations: all, men, and age above 40 years. In analogy to Eq. 6, the (numerator of the) average biases in these different subpopulations is
$$\sum\limits_i {\left\{ {\Lambda \left( {X_i } \right) \otimes \varphi \left( {X_i ,\theta } \right) - \widehat m_{VL} \left( {\widehat p_i } \right)} \right\} \cdot \left( {1 - D_i } \right)1\left( {\widehat p_i > 0} \right),} $$
(8)
which is a VL dimensional vector, where V is the dimension of Y and L is the number of subpopulations. ⊗ is the Kronecker product operator. \(\widehat{m}_{{VL}} {\left( \cdot \right)}\) is the VL×1 column vector of all stacked nonparametric estimates of E[Yp] for all populations multiplied with the population indicator function.8

In principle, the semiparametric approach could proceed iteratively. First, the parametric model is estimated to obtain values of θ. With these \( \widehat{\theta } \), the average biases are estimated and these bias estimates are then used to obtain new estimates of θ. A more convenient approach can be obtained by integrating both aims (goodness-of-fit in the responding population and low bias in the non-responding population) in a single estimator based on moment conditions. One set of moment conditions is given by the average biases (Eq. 8), which have expectation zero in the case of correct parametric specification.

To achieve not only a low bias in the non-responding population but also a good fit in the responding population, a second set of moment conditions is needed to reflect this aim. Under correct specification, the parametric model 2 implies E[A(X)·(Yφ(X, θ0))]=0 with A(X) an instrument matrix. I.e. the weighted distance between the observed outcomes and the regression plane is zero at θ0 for any A(·). Since Y can be observed only for the D=1 observations, the corresponding empirical moment function is
$${\sum\limits_i {A{\left( {X_{i} } \right)}{\left( {Y_{i} - \varphi {\left( {X_{i} ,\theta } \right)}} \right)}D_{i} ,} }$$
(9)
which has expectation zero for θ0. A fully parametric estimator of the regression model 2 chooses θ such that the empirical moment function 9 is zero. This corresponds to a just-identified parametric GMM estimator with an instrument matrix of dimension K×V, where K is the dimension of θ.
The proposed semiparametric estimator attempts to set both sets of moment functions to zero in order to obtain a good fit in the responding population and to minimize bias in the non-responding population. It seeks to choose θ such that the combined moment vector
$$g_n \left( {\theta ,\widehat m_{VL}, \widehat {p}} \right) = {1 \over n}\sum {\left( {\matrix{ {A\left( {X_i } \right) \cdot \left( {Y_i - \varphi \left( {X_i ,\theta } \right)} \right) \cdot D_i } \cr {\Lambda \left( {X_i } \right) \otimes \varphi \left( {X_i ,\theta } \right) \cdot \left( {1 - D_i } \right)1\left( {\widehat p_i > 0} \right)} \cr } } \right) - \left( {\matrix{ {0_K } \hfill \cr {\widehat \mu } \hfill \cr } } \right)} $$
(10)
is close to zero, where \(\widehat{\mu }\) is the nonparametric part:
$$ \widehat{\mu } = \frac{1} {n}{\sum\limits_i {\widehat{m}_{{VL}} {\left( {\widehat{p}_{i} } \right)} \cdot {\left( {1 - D_{i} } \right)}1{\left( {\widehat{p}_{i} > 0} \right)}.} } $$
The moment vector gn is of length K+VL. The first K moments are evaluated for the observations with Di=1, since Y can be observed only for the respondents. The second set of moments measures the bias of the regression plane in the L non-responding subpopulations. Since the number of moments exceeds the number of coefficients by VL, generally it will not be possible to set gn exactly to zero. The GMM estimator therefore seeks to minimize a quadratic form and estimates θ as
$$\widehat{\theta }_{n} = \arg {\mathop {\min }\limits_\theta }\;g^{\prime }_{n} Wg_{n} ,$$
(11)
where W is a positive semidefinite weighting matrix and preliminary estimates of m and of the propensity score p are plugged in. The semiparametric estimator thus seeks a balance between goodness-of-fit in the responding population and small bias in the non-responding population. The particular choice of W determines the respective weights given to these two objectives. If the weight matrix W contains non-zero elements only in the upper K×K sub-matrix, the semiparametric estimator is identical to the parametric estimator. Both estimators are obviously also identical if L=0. Hence the parametric model is contained in the semiparametric framework.

Estimation of θ is straightforward. First, the propensity score is estimated, e.g. by probit or logit. Second, \(\widehat{\mu }\) is estimated by propensity score matching, separately for the different subpopulations. In principle, any propensity score matching routine can be used. With \(\widehat{\mu }\) estimated, the moment function (Eq. 10) depends only on θ, and the quadratic form of the average moment function (Eq. 11) can be minimized, for any choice of W. For example, W might be chosen as a diagonal matrix which gives half of the weights to the first K moments and the other half to the second VL moments.9

Under certain conditions on the propensity score matching estimator and with a correct specification of the regression model 2, the semiparametric GMM estimator \(\widehat{\theta }_{n} \) is consistent and \({\sqrt n } - {\rm\text{asymptotically}}\)normal with approximate variance
$$\frac{1}{n}{\left( {G\prime WG} \right)}^{{ - 1}} G\prime WE{\left[ {JJ\prime } \right]}WG{\left( {G\prime WG} \right)}^{{ - 1}} ,$$
(12)
where \( G = E{\left[ {\frac{{\partial g_{n} {\left( {\theta _{0} ,\widehat{m}_{{VL,\widehat{p}}} } \right)}}} {{\partial \theta \prime }}} \right]} \) is the expected gradient, and
$$ J = g{\left( {Y,D,X,\theta _{0} ,m_{{VL}} } \right)} - {\left( {\begin{array}{*{20}c} {{0_{{\user2{K}}} }} \\ {{\frac{1} {{\lambda _{1} }}E{\left[ {\Psi _{{11,m}} {\left( {Y,D,X;X_{2} } \right)}{\left( {1 - D_{2} } \right)}\left| {Y,D,X} \right.} \right]}}} \\ { \vdots } \\ {{\frac{1} {{\lambda _{L} }}E{\left[ {\Psi _{{VL,m}} {\left( {Y,D,X;X_{2} } \right)}{\left( {1 - D_{2} } \right)}\left| {Y,D,X} \right.} \right]}}} \\ \end{array} } \right)} - {\left( {\begin{array}{*{20}c} {{0_{{\user2{K}}} }} \\ {{E{\left[ {\Psi _{{11,p}} {\left( {Y,D,X;X_{2} } \right)}{\left( {1 - D_{2} } \right)}\left| {Y,D,X} \right.} \right]}}} \\ { \vdots } \\ {{E{\left[ {\Psi _{{VL,p}} {\left( {Y,D,X;X_{2} } \right)}{\left( {1 - D_{2} } \right)}\left| {Y,D,X} \right.} \right]}}} \\ \end{array} } \right)}, $$
where the expectation operator is with respect to X2 and D2, and \(\lambda _{l} = {\mathop {\lim }\limits_{n \to \infty } }\frac{{n_{{l,1}} }}{n}\), with nl,1 the number of D=1 observations belonging to subpopulation l. The influence functions Ψvl,p and Ψvl,m take account of the variance due to the preliminary estimators \(\widehat{p}\) and \(\widehat{m}\), respectively. Proofs and expressions for Ψvl,p,Ψvl,m are given in the supplementary Appendix.

One condition for this result is that the preliminary estimators \(\widehat{p}\) and \(\widehat{m}\) are asymptotically linear with trimming. Parametric and nonparametric local polynomial regression estimators belong to this class as shown in Heckman et al. (1998b), provided certain regularity conditions are met. Hence, for the propensity score estimated by a probit or logit and m estimated by Nadaraya–Watson kernel or local linear regression, \(\widehat{\theta }_{n} \) is asymptotically normally distributed in correctly specified models. For nearest neighbour regression, on the other hand, this does not seem to hold.10

The choice of W determines the weights given to the two objectives of the estimator: goodness-of-fit in the responding population and low bias in the non-responding population. It thereby also affects the properties of the estimator. With a correct parametric specification, the efficient weighting matrix would be the inverse of the covariance matrix of the moment vector [EJJ′]−1 (Hansen 1982). This efficient GMM estimator can be obtained by a two step procedure. First, an arbitrary initial weighting matrix W is chosen to obtain the first step estimates of θ. With this estimates, \( {\left[ {\widehat{E}JJ^{\prime } } \right]}^{{ - 1}} \) is estimated and is then used as the weighting matrix in the second step. If the parametric model is misspecified, on the other hand, the second step GMM estimator is not necessarily superior to the first step GMM estimator, since the ‘efficient’ weighting by [EJJ′]−1 takes only the variance but not the bias of the parametric specification into account. This leads to a weighting matrix which assigns most of the weight to the K parametric moments and little to the nonparametric moments, because the variance of the nonparametric estimates is much higher compared to the parametric moments. However, the uncertainty that stems from not knowing the true form of the conditional expectation function is not incorporated in these weights. Hence such considerations on robustness to misspecification are neglected in the weighting matrix [EJJ′]−1.

Apart from the purpose of estimation, the GMM estimator can also be used as a specification test. Using the J-test of overidentifying restrictions of Hansen (1982), correctness of the parametric model can be tested for. The statistic \( n \cdot g^{\prime }_{n} \widehat{\Omega }g_{n} \), with \(\widehat{\Omega }\) being a consistent estimate of [EJJ′]−1, is asymptotically χ2 distributed with degrees of freedom equal to the number of overidentifying restrictions VL
$$n \cdot g^{\prime }_{n} \widehat{\Omega }g_{n} {\mathop \to \limits^d }\chi ^{2}_{{{\left( {VL} \right)}}} ,$$
(13)
under the null hypothesis of correct specification.

In this section, a semiparametric estimator for estimating a parametric regression plane with lower bias in the non-responding population has been proposed, and several properties of this estimator have been derived. In correctly specified models, and with particular propensity score matching estimators, the GMM estimator is \({\sqrt n } - {\rm\text{consistent}}\) and asymptotically normal. The GMM objective function is asymptotically χ2 distributed and can be used for testing the correctness of the parametric model. On the other hand, if the model is misspecified, the GMM estimator attempts to choose a regression plane with low bias among the non-respondents while maintaining a good fit among the respondents. To examine the behaviour of this estimator and of the specification test in finite samples, a Monte Carlo simulation is conducted in the next section.

Monte Carlo simulation

In a small Monte Carlo experiment the finite sample properties of the semiparametric estimator of the conditional mean function E[YX] are assessed. The simulations should give some indications on the performance of the semiparametric estimator in comparison to parametric estimation under correct and under incorrect specification. In addition, the sensitivity to the number of subpopulations L and their size, to the choice of the estimator \(\widehat{m}\) and to the weighting matrix W is examined. Finally, the properties of the J-test are analyzed, which, however, turn out to be rather unsatisfactory.

The mean squared error of the parametric, the first step and the second step GMM estimator are simulated for different simulation designs. The outcome variable Y is one-dimensional. Hence, V=1 and the number of overidentifying moments is equal to L. The parametric estimator is equivalent to the GMM estimator with L=0 subpopulations. The first and second step GMM estimators are computed for different numbers of subpopulations L to examine their sensitivity to the number of overidentifying moments. The weighting matrix W for the first step GMM estimator is diagonal, with the first K entries being 1/K and the remaining entries being 1/L. Hence, equal weight is given to the parametric and to the nonparametric moments, as a whole. The second step GMM estimator uses the inverse covariance matrix \( {\left[ {\widehat{E}JJ\prime } \right]}^{{ - 1}} \)as weighting matrix, which is evaluated at the first step coefficient estimates using the asymptotic expression given in the previous section.

The Monte Carlo simulations proceed by repeatedly drawing estimation and validation samples from the same population, estimating the coefficients θ from the estimation sample and computing mean squared error (MSE) in the validation sample. The estimation sample {(Xi, Di, YiDi)}i=1n consists of 500 or 2000, respectively, observations with Yi observed only if Di=1. The validation sample contains 10000 draws of X and D. With the coefficients \( \widehat{\theta } \), estimated from the estimation sample, the expected outcomes \(\widehat{E}{\left[ {Y\left| X \right.} \right]}\) are imputed by \( \varphi {\left( {X,\widehat{\theta }} \right)} \) for all observations of the validation sample and compared with the true expected outcomes E[YX] to simulate the MSE.

In each replication, first the nonparametric mean outcomes \(\widehat{\mu }\) are estimated by propensity score matching, separately for each subpopulation. The propensity scores pi are estimated by probit and the regression curves m(p) are estimated nonparametrically in the various subpopulations either by Nadaraya–Watson kernel regression or by local linear ridge regression. Ridge regression is a variant of local linear regression with better small sample properties. Local linear regression is well known for its favourable asymptotic properties (Fan 1992), but in small samples it can be very erratic because of zero or near-zero denominators in the calculation of the estimator. By adding a ridge parameter to the denominator, ridge regression can avoid the high variance problems of the unmodified local linear estimator. At the same time, with the ridge parameter converging to zero with growing sample size, asymptotically both estimators are equivalent, see Seifert and Gasser (1996, 2000). In essence, ridge regression is a convex combination of the Nadaraya–Watson kernel and the local linear estimator, where the weight given to the local linear estimator increases with growing sample size. In a comparison study of the properties of alternative propensity score matching estimators in finite samples (Frölich 2004, 2005), propensity score matching based on ridge regression clearly dominated matching based on local linear regression and also often performed slightly better than Nadaraya–Watson kernel based matching. In the Monte Carlo simulations below, results are given for Nadaraya–Watson kernel matching (with Gaussian kernel) and for ridge matching (with Epanechnikov kernel).11 The bandwidth is chosen by leave-one-out cross validation from the grid: 0.0001, 0.0001·1.41,..., 0.0001·1.428, ∞. With \(\widehat{\mu }\) estimated, the GMM estimator can be computed.

In addition to the GMM estimator, also an alternative semiparametric estimator is included in the Monte Carlo simulations. This estimator is based on the idea to first impute the missing values of Y in the non-responding sample via higher-dimensional nonparametric regression. Second, the parametric regression plane is fitted by least squares using the observed Y for the D=1 observations and the imputed values for the D=0 observations. This LSIR estimator (least squares imputed residuals) estimates \(\widehat{\theta }_{n} \) by minimizing the imputed squared residuals
$$\arg {\mathop {\min }\limits_\theta }{\sum\limits_i {{\left\{ {Y_{i} D_{i} + {\left( {1 - D_{i} } \right)}\widehat{E}{\left[ {Y\left| {X_{i} } \right.} \right]} - \varphi {\left( {X_{i} ,\theta } \right)}} \right\}}^{2} ,} }$$
(14)
where \(\widehat{E}{\left[ {Y\left| {X_{i} } \right.} \right]}\) is a nonparametric estimate of E[Y∣X=Xi]. It is estimated from the responding sample by kernel regression using a multiplicative Gaussian kernel and a single bandwidth.12 The bandwidth is chosen by leave-one-out cross validation from the grid: 0.002, 0.002·1.31, 0.002·1.328, ∞.

A conceptual difference between the GMM and the LSIR estimator is that the latter attempts to minimize squared bias conditional on X, whereas the former aims at minimizing squared bias conditional on larger subpopulations (the L subpopulations). By restricting itself to larger subpopulations, all nonparametric components in the GMM estimator (i.e. the \(\widehat{\mu }\)) converge at \({\sqrt n } - {\rm{rate}}.\) On the other hand, the nonparametric estimates of E[YX] in the LSIR estimator converge at lower rates, if X contains at least one continuous variable.

The properties of these estimators are examined for different simulation designs. The X characteristics consist of 3 explanatory variables (Xi1, Xi2, Xi3) drawn from the (non-symmetric) χ(2)2, χ(3)2, χ(4)2 distribution and divided by 2,3,4, respectively, to standardize their mean. Di is determined by Di=1(Xi1+Xi2+Xi3+ɛi>4.5), with ɛ standard normally distributed. The mean of D is 0.46.

The Yi data are generated according to one of three different DGPs
  • DGP 1: Yi=Xi12+Xi22+Xi32+ξi

  • DGP 2: \(Y_{i} = {\sqrt {X_{{i1}} - 0.5} } + 2{\sqrt {X_{{i2}} - 0.5} } - {\sqrt {X_{{i3}} - 0.5} } + \xi _{i} \)

  • DGP 3: Yi=Xi1Xi2+Xi1Xi3+Xi2Xi3+ξi,

with ξ a standard normal error term.
Four different parametric specifications φ(x, θ) are examined. All are linear models and vary in their set of regressors:

Specification

K

Regressors

φ0

4

Const Xi1, Xi2, Xi3

φ1

4

Const Xi12, Xi22, Xi32

φ2

4

\( const,{\sqrt {X_{{i1}} - 0.5} },{\sqrt {X_{{i2}} - 0.5} },{\sqrt {X_{{i3}} - 0.5} } \)

φ3

7

Const Xi1, Xi2, Xi3, Xi1Xi2, Xi1Xi3, Xi2Xi3.

Specification φ0 is incorrect for all DGPs, φ1 is correct only for DGP 1, φ2 is correct only for DGP 2 and φ3 is correct only for DGP 3.

To assess the sensitivity of the GMM estimator to the number of subpopulations, different numbers of subpopulations L=1, 4, 7, 10 and 14, respectively are included. (L=0 corresponds to OLS.) If the mean squared error does not reduce significantly with L, additional subpopulations would seem to be of little value. This would imply that in empirical applications of the estimator a very small number of L would often suffice, thereby reducing computation time. A natural procedure for defining the subpopulations would begin with the largest population and subsequently include smaller and smaller subpopulations, because the precision in estimating the average bias decreases in smaller subpopulations. The first subpopulation is the entire (non-responding) population. Subpopulations two to four are defined by X1<1.5, X2<1.5, and X3<1.5, respectively, and each contains about 60% of the entire non-responding population. Subpopulations five to seven are defined by {X1<1.5 ∨ X2<1.5}, {X1<1.5 ∨ X3<1.5} and {X2<1.5∨X3<1.5}, respectively, with each covering about 37% of the population. Subpopulations eight to ten each contain about 30% and are defined by X1<1, X2<1, and X3<1, respectively. Finally, the subpopulations 11 to 14 are X1>2, X2>2, X3>2, and {X1<1.5∨X2<1.5∨X3<1.5, respectively and cover only about 20% of the population.13 Subpopulations with less than ten responding observations or less than ten non-responding observations are dropped in the GMM estimator to reduce the impact of very imprecise estimates.

Table 1 gives the simulation results for sample size 500 for OLS, LSIR and the GMM estimators with ridge regression and for different numbers of overidentifying moments L. The four columns labelled DGP1 show the MSE when the true data generating process is DGP1 and the parametric specifications φ0, φ1, φ2 or φ3, respectively, are used. Columns marked in italics indicate that the parametric model is correctly specified. Whereas the upper half of the table gives the MSE in the entire population, the lower half refers to the D=0 population only. Table A.1 shows the results for sample size 2000. Tables A.2 and A.3 give the respective results when Nadaraya-Watson kernel regression is used instead of ridge regression.
Table 1

Mean squared error (sample size 500, ridge matching estimator)

 

DGP 1

DGP 2

DGP 3

φ0

φ1

φ2

φ3

φ0

φ1

φ2

φ3

φ0

φ1

φ2

φ3

OLS (L=0)

9.7

0.0

22.0

13.4

9.6

36.9

2.1

12.4

2.5

4.5

6.3

0.0

LSIR

7.0

0.3

17.3

7.4

13.5

34.1

8.9

15.0

1.8

4.2

3.3

0.3

GMM1

L=14

6.8

0.0

17.6

7.5

10.3

33.7

4.9

13.8

1.6

4.3

3.1

0.1

L=10

6.8

0.0

17.5

7.5

10.3

33.8

5.3

14.6

1.6

4.3

3.1

0.1

L=7

6.8

0.0

17.7

7.5

10.4

33.8

5.6

15.0

1.6

4.3

3.1

0.1

L=4

6.8

0.0

17.8

7.4

10.2

33.6

5.4

14.7

1.6

4.3

3.1

0.1

L=1

6.8

0.0

18.3

7.9

10.3

33.5

4.0

12.9

1.6

4.3

3.2

0.1

GMM2

L=14

7.9

0.0

18.7

8.7

11.1

37.3

4.2

13.2

1.9

4.7

3.5

0.1

L=10

7.3

0.0

17.8

8.1

9.9

34.8

3.0

11.9

1.7

4.6

3.2

0.0

L=7

7.2

0.0

17.7

8.1

9.7

34.6

2.8

11.8

1.7

4.4

3.3

0.0

L=4

7.1

0.0

17.6

8.0

9.5

34.4

2.7

11.5

1.7

4.4

3.3

0.0

L=1

7.1

0.0

18.1

7.9

9.2

34.1

2.3

11.5

1.8

4.3

3.5

0.0

MSE in D=0 population only

OLS (L=0)

9.4

0.0

17.1

16.5

11.0

42.5

2.3

14.9

2.7

1.9

9.3

0.0

LSIR

1.9

0.4

5.8

1.9

17.7

36.4

11.4

19.9

0.6

1.4

1.5

0.4

GMM1

L=14

2.3

0.0

8.3

2.0

10.9

29.5

6.0

12.5

0.6

0.9

2.3

0.1

L=10

2.2

0.0

8.3

1.9

10.8

29.5

6.2

12.6

0.6

0.9

2.3

0.1

L=7

2.3

0.0

8.7

2.0

11.0

29.6

6.5

13.1

0.6

0.9

2.3

0.1

L=4

2.3

0.0

8.8

2.0

10.7

29.3

6.2

12.7

0.6

0.9

2.5

0.1

L=1

2.5

0.0

9.9

2.5

11.1

29.4

5.2

13.5

0.7

0.9

2.7

0.1

GMM2

L=14

1.9

0.0

6.2

2.2

12.1

32.0

5.2

13.9

0.7

0.9

1.9

0.1

L=10

1.8

0.0

6.3

1.8

10.6

29.0

3.6

12.2

0.7

0.8

2.1

0.0

L=7

2.0

0.0

7.2

1.8

10.5

29.6

3.3

12.3

0.7

0.8

2.4

0.0

L=4

2.0

0.0

7.3

1.9

10.2

29.9

3.1

12.0

0.9

0.8

2.7

0.0

L=1

2.7

0.0

9.6

2.8

10.1

33.1

2.6

12.8

1.1

0.9

3.4

0.0

Mean squared error for parametric least squares (OLS), semiparametric least squares imputed residuals (LSIR) and first and second step semiparametric GMM (GMM1 and GMM2) for the three different data generating processes DGP1, DGP2, DGP3 and the four different parametric regression models φ0, φ1, φ2, φ3. The results for the correctly specified models are marked in italics. The results for DGP2 are multiplied by 100. The OLS estimator uses only the data from the respondents (D=1) and is equivalent to the GMM estimator with L=0 overidentifying moments. The GMM estimators are computed with different numbers of overidentifying moments L. The lower half of the table gives the MSE in the non-responding (D=0) population only. Results based on 5,000 replications

Examining first the first three rows of Table 1, it can be seen that for misspecified models both the LSIR and the GMM estimator usually perform better than OLS. (For DGP2, however, this is true only for the GMM estimator and only for sample size 2000.) For correctly specified models, both semiparametric estimators are less precise than OLS, with LSIR always being worse than GMM. In general, the GMM estimator has a smaller or equal MSE than the LSIR estimator. In misspecified models, the semiparametric GMM estimator leads to reductions in MSE, relative to OLS, of about 20–45% for DGP1 and 5–50% for DGP3. For DGP2, the MSE of the GMM estimator is in the range of ±10% about the MSE of OLS. This indicates that semiparametric estimation can lead to quite sizeable efficiency gains in misspecified models, although these are not always guaranteed. On the other hand, the efficiency losses in correctly specified models are often small in absolute terms, compared to the precision gains in misspecified models. In DGP1 (with specification φ1) and DGP3 (with φ3), the MSE increases only by less than 0.1 from OLS to the GMM estimator. In DGP2 (with φ2), however, the GMM estimator performs clearly worse than OLS.

Examining the results for the first step GMM estimator with different numbers of overidentifying moments L, no clear and monotonous relationship can be detected. While the MSE decreases with the number of moments in DGP1-φ2 and DGP1-φ3, it first increases and then decreases in DGP2-φ2 and DGP2-φ3. In the other cases, the MSE hardly changes with the number of moments. This indicates that the value of additional overidentifying moments may be small, such that in applications of this estimator a relatively small number of L should suffice.

The second step GMM estimator, on the other hand, is more sensitive to the number of moments and its MSE generally tends to increase with L. This may be due to a less precise estimation of the weighting matrix, whose dimension increases with L. The second step estimator often tends to have a higher MSE than the first step estimator, unless the model is correctly specified. The latter comes as expected since the second step weighting matrix usually assigns more weight to the parametric moments than the initial weighting matrix used in the first step estimator.

The lower half of Table 1 shows the precision of the various estimators in the non-responding population, which is simulated by using only the D=0 observations of the validation sample. This could be of interest if one were interested in estimating E[YX] only for the non-respondents. A typical example would be the analysis of the treatment effect on the treated for different values of X. While the qualitative results are similar to the previous discussion, the precision gains of the semiparametric estimators for misspecified models are now much larger. For DGP1, MSE is reduced by 50–90% vis-a-vis OLS. For DGP2, the reductions are 5–30% and are 55–75% for DGP3.

Table A.1 shows the simulation results for sample size 2000. The semiparametric estimators have become more precise relative to OLS, and the GMM estimator now dominates OLS in all misspecified models. The LSIR estimator, on the other hand, is still worse than OLS in DGP2-φ0 and DGP2-φ3. The first step GMM estimator remains rather robust to the number of moments L included, while the MSE of the second step GMM estimator still often increases with the number of moments. The second step estimator now performs in almost all misspecified models worse than the first step GMM. This is in accordance with the discussion at the end of Section 2, because with increasing sample size bias becomes more important relative to variance. As the weighting matrix for the second step GMM is only based on variance considerations, too little weight is given to the overidentifying nonparametric moments. Overall, for the D=0 population (lower half of Table A.1), the MSE of the first step GMM estimator is about 60–90% (DGP1), 20–40% (DGP2) and 60–80% (DGP3) lower than for OLS, in the misspecified models.

Tables A.2 and A.3 give the results when Nadaraya–Watson kernel regression is used instead of ridge regression in the GMM estimators. The results are very similar, with kernel regression performing a little worse for sample size 500 and a little better with sample size 2000.

Although no strong conclusions can be drawn from this limited Monte Carlo study, the results seem to indicate that the semiparametric estimators can lead to substantially more precise estimates of E[YX] in misspecified models, while, on the other hand, maintaining good properties in correctly specified models. Reductions in MSE from 5–50% are feasible. If interest is in estimating E[YX] only for the non-responding population, e.g. for analyzing average treatment effects on the treated, reductions in MSE are even larger and can be up to 90%. Although both the LSIR and the GMM estimators perform well in misspecified models, the GMM estimator usually leads to larger reductions and has better properties in correctly specified models. In particular, the first step GMM estimator appeared to be superior to the second step estimator. A moderate number of overidentifying moments L seems to suffice to attain the precision gains. The choice of the nonparametric regression estimator does not seem to matter much. The results with Nadaraya–Watson kernel regression and local linear ridge regression were very similar. Hence, as a practical recommendation, any propensity score matching estimator can be used for estimating \(\widehat{\mu }\) in just a small number of subpopulations L.

The previous discussion focussed on the estimation of conditional mean functions E[YX]. The proposed GMM estimator, however, can also be used for specification testing, using the J-test statistic (Eq. 13). In the supplementary Appendix, the size and the power of this test are examined. The results are not very favourable, though, as the test often tends to over-reject. A likely reason for this size distortion is the use of cross-validation for choosing the bandwidth value. Whereas cross-validation trades off variance against bias, centrality of the test statistic relies on undersmoothing. Hence, if the proposed GMM estimator were to be used for specification testing, a different data-driven technique for bandwidth selection would be needed. For the purpose of estimation, on the other hand, cross-validation seems to work well as the simulations of this section had indicated.

Treatment choice among Swedish rehabilitation programmes

To illustrate the applicability of the proposed estimator, treatment effect heterogeneity among Swedish rehabilitation programmes for long-term sick is analyzed. Conditional expectation functions are estimated for the different programmes, which can be used to analyze individual heterogeneity in the effects and to determine the potential for policy improvements through better targeting of programmes.

Heterogeneity in treatment effects has been somewhat neglected in the recent literature on programme evaluation, which concentrated largely on estimating average treatment effects.14 If treatment effects are heterogeneous, however, it is important to determine which individuals benefit most from which programmes in order to give advice on how policies should be targeted to obtain a more efficient allocation of programmes and participants. Taking treatment effect heterogeneity into account is relevant for many social and economic policies. For example, many evaluations of active labour market policies found negative or zero average treatment effects. It could be possible, though, that some individuals would benefit greatly from such programmes, whereas the majority does not. Instead of completely eliminating such programmes, better targeting might be more sensible.

Treatment effect heterogeneity is at the center of the analysis of optimal statistical treatment rules, as e.g. in Wald (1950); Heckman et al. (1997b); Black et al. (2003); Manski (2000, 2004) and Dehejia (2004). An optimal statistical treatment rule attempts to assign individuals to programmes in a welfare-maximizing way. Suppose a policy consists of R different, mutually exclusive programmes and each individual of an eligible population chooses exactly one of these. (One of these programmes may be denominated ‘non-participation’.) The potential outcomes for an individual i are Yi1,...,YiR, of which one will be realized according to the programme chosen. A statistical treatment rule assigns individuals to programmes on the basis of observed characteristics \(X_{i} \in {\Re }^{K} \). If the planner’s social welfare function is utilitarian, see Manski (2000; 2004), the optimal programme r* for an individual with characteristics x is
$$ r*{\left( x \right)} = {\mathop {\arg \;\max }\limits_{r \in {\left\{ {1, \ldots ,R} \right\}}} }E{\left[ {Y^{r} \left| X \right. = x} \right]}. $$
(15)
Most policies pursue multiple, often conflicting, goals, which can be measured through different outcome variables, e.g. health status, disability status, employment situation, wages, programme costs etc. The potential outcomes are then vector-valued \( Y^{1} ,..,Y^{R} \in {\Re }^{V} \), and a utility weighting function \( u{\left( \cdot \right)}{\Re }^{V} \mapsto {\Re } \) is needed to trade off the V outcome variables. For a linear weighting function, the optimal programme is r*(x)=arg max u(E[YrX=x]).
Deriving optimal treatment-choices thus requires estimates of the conditional expectation functions E[YrX]. These can be estimated from data {(Xi, Di, Yi)}i=1n on past programme participants, where D{1,..,R} indicates the programme an individual had participated in. Since for each observation only the outcome Y=YD corresponding to the programme D the individual had participated in can be observed, the expected potential outcomes are a priori not identified.15 Because of this selection problem, only E[YrX, D=r] is identified. However, if X includes all variables that affected the potential outcomes as well as the assignment under the former selection process, the potential outcomes Yr are conditionally independent of the programme an individual had participated in
$$E{\left[ {Y^{r} \left| X \right.} \right]} = E{\left[ {Y^{r} \left| X \right.,D = r} \right]}\quad {\text{w}}{\text{.p}}.1.$$
(16)
Accordingly, E[YrX=x] is identified for all x with P(D=rX=x)>0. This corresponds to the situation described in the previous sections with data on Yr being observed only for a part of the population and interest being in the population mean function E[YrX]. The proposed semiparametric estimator can thus be applied separately to each of the R different potential outcomes.

With this approach the success of Swedish rehabilitation programmes in re-integrating long-term sick in the labour market is examined. In a retrospective analysis, the optimal programme is determined for each individual. Comparing the average employment rate ensuing with this optimal allocation to the observed employment rate gives an indication of the potential for policy improvement through a better allocation of participants to programmes. This application is merely meant as an illustration of the approach. Using only a single outcome variable, employment, would not do justice to the multi-facetted goals of rehabilitation programmes if strong policy conclusions were to be drawn. Rehabilitation programmes aim not only at restoring lost working capacity, but also at improving mental and physical health, and also their costs would need to be taken into account. A more comprehensive analysis, however, was not possible due to data availability.

Swedish rehabilitation programmes

The Swedish social insurance system provides a wide variety of supportive actions for people in need. One of these is the coordination and financing of vocational rehabilitation for long-term sick individuals. Persons, who have been employed for at least one month, are covered by the public sickness insurance and are eligible for sickness benefit when becoming ill. Sickness cases that last for more than four weeks are considered as long-term sick and appropriate means for these are examined. If sickness is expected to be permanent or of longer duration, a disability pension is granted. Otherwise, rehabilitation actions should be initiated, if they can restore (at least partly) a person’s working ability within reasonable time, i.e. in less than one year. The local insurance offices mediate in this process by coordinating rehabilitative actions with the employer and the employee and by financing vocational rehabilitation.

The rehabilitation actions consist of a wide variety of different programmes and measures, targeted at different groups and pursuing different goals. They can roughly be summarized into vocational and non-vocational measures. The vocational programmes aim at improving employability to guide individuals back into the competitive labour market and consist of work training and educational training. Work training can be with the current employer or at a new place of work. The former requires the cooperation from the employer to make the training feasible. Unemployed individuals on long-term sickness are often offered work training at sheltered public workplaces. Educational rehabilitation comprises various forms of classroom education.

The non-vocational measures consist of medical and social rehabilitation. Social rehabilitation contains, for example, programmes for individuals with alcohol, drug or psychiatric problems. These measures are not coordinated by the insurance office. Individuals with severe health problems may receive different forms of rehabilitation in parallel or sequentially.

In the following analysis, these activities are categorized into four different types of rehabilitation: No rehabilitation, workplace rehabilitation, educational rehabilitation and medical and social rehabilitation.

Rehabilitative activities pursue a variety of different goals. Vocational rehabilitation aims at re-integration into the labour market. Medical and social rehabilitation, on the other hand, rather intend to restore physical and mental health and basic work capacity and to re-establish independence of the sick individual from medical or therapeutic assistance. In the following analysis only a single outcome variable is examined: successful integration into the labour market at the end of the sickness spell. The main reason for restricting the analysis to the employment outcome is that the available data seem to be sufficiently informative to make the conditional independence assumption (16) plausible with respect to the employment outcome but not with respect to other outcome variables e.g. health status. The data is very informative about the selection process into treatment and seems to contain most or all relevant factors that determined simultaneously rehabilitation assignment and subsequent employment outcomes. Even if not all relevant factors are included, the resulting bias is likely to be small relative to the variance in the process of finding employment, as employment is driven by many other factors besides health. With respect to subsequent health outcomes, however, conditional independence may not hold, because the health history data is not sufficiently precise. As many of the variables on health history and medical recommendations are binary, they indicate only incidence of health problems but not their severity. As certain health problems may be highly autocorrelated over time, this might have led to a large bias.

The process of entering in (publicly financed) rehabilitative actions in the years 1991 to 1994 was at such. A person who falls sick or becomes injured first notifies her employer or the local social insurance office thereof.16 If sickness continues for more than four weeks, a rehabilitation assessment should be carried out within the following eight weeks, which consists of various medical and non-medical examinations. On the basis of this assessment a decision about the appropriateness of vocational rehabilitation should be reached: If rehabilitation assistance is not necessary and recovery is expected within a year, the individual draws sickness benefits until healthy. If sickness seems to last for more than a year (even with rehabilitation), the individual will be granted disability pension and the case is closed. If, on the other hand, rehabilitation seems necessary, economically advisable and it is expected that the sick person can regain her working capacity within a year, a rehabilitation plan is established.

This plan is made by the IO officer, taking into account the rehabilitative needs, the medical assessments, budgetary constraints as well as the individual’s preferences. In a first instance, the insurance office’s task is to coordinate the provision of vocational rehabilitation.17 The employer is obliged to facilitate workplace rehabilitation, according to his possibilities, through transfers, changes in duties and work hours, work training, education, adjustments to the current workplace etc. For unemployed persons and also when the employer is not able or not willing to cooperate, the insurance office offers alternative rehabilitative measures, which it purchases from hospitals and private providers of work training and education.18 Individuals may demand but have neither the right to receive rehabilitation nor the obligation to participate. It is mainly the IO officer who determines which rehabilitation measures are to be offered. The officers have clear guidelines to follow for assessing the need and success chances of rehabilitative measures and they do not face any incentive structures for discriminating against particular groups. In case of participation in vocational rehabilitation, individuals receive an additional rehabilitation allowance. After rehabilitation, the sick person may be either healthy or still sick. If still sick, her recovery chances are re-assessed and she either re-enters the pool of long-term sick or is granted disability pension.

Data

The data used in this study is taken from the Riks-LS data set, which has been collected by the National Social Insurance Board (RFV) for the purpose of evaluating the efficacy of vocational rehabilitation. The survey was conducted in the second half of 1994 and beginning of 1995 and analyzed retrospectively 75,000 sickness cases, who had received sickness benefit for a period of at least 60 consecutive days between July 1991 and June 1994. The caseworkers in charge of these cases were questionnaired on the development and assessment of the sickness case. Data collection was organized in form of three independent cross-sections, according to the fiscal years 1991/92, 1992/93 and 1993/94. Cases were followed up until closure of the case or at most until December 1994, the end of the data collection period.

From this data set, a sample of 6,287 cases in five counties in Western Sweden is analyzed. The sample contains only persons not older than 55 years and not receiving pension benefits. Individuals in full-time education are excluded as well as individuals with missing data on sickness and rehabilitation history. Of the 6,287 observations, 3,502 did not receive any rehabilitative measures, 1,118 participated in workplace rehabilitation, 360 in educational rehabilitation and 1,307 in medical and social rehabilitation.19

The data set provides rich information about the socioeconomic variables of the individuals, details on their health status and the selection into rehabilitation. The information about the individual prior to the beginning of the sickness spell are age, gender, marital status, citizenship, education, occupation and labour market position, previous health record, previous participation in vocational rehabilitation, employment status, earnings and earnings loss due to sickness. The individuals environment is characterized by county of residence, community type, local unemployment rate and year of sickness registration. Information at the time of sickness registration contain the medical institution that registered sick leave, initial degree of sickness, indications of alcohol or drug abuse, and medical diagnosis. The data set contains crucial information about the rehabilitation assessment. In particular, the initial medical recommendation, the caseworkers non-medical recommendation, and the organization that carried out the assessment are recorded, revealing important characteristics of the sick person before entering in rehabilitation. These experts opinions include subjective judgements about the sick persons ability, determination and employment chances and are crucial for the conditional independence assumption (16).

Table 2 gives descriptive statistics for selected variables by treatment group. Unskilled blue collar workers are somewhat over-represented in workplace rehabilitation, while white collar workers are under-represented in educational rehabilitation. Unemployed long-term sick are often found in educational rehabilitation. Educational rehabilitation seems also to be targeted towards persons with repeated sickness spells and previous participation in vocational rehabilitation. In addition, individuals whose sickness was registered by a psychiatric or social medicine centre or with alcohol/drug problems or psychiatric disorder are over-represented in educational rehabilitation. Differences between the treatment groups can also be found in the records of the vocational rehabilitation assessment. ‘Wait and see’ recommendations are prevalent among the non-participants and the medical and social rehabilitation group,20 whereas a positive vocational rehabilitation recommendation is very frequent in both other groups. Overall, these statistics indicate that, in particular, educational rehabilitation seems to be somewhat targeted at those cases most difficult to re-integrate in the labour market.
Table 2

Descriptive statistics by treatment groups (means or shares in %)

Variable

All

No Rehabilitation

Workplace

Educational

Medical and social

Age (years)

40.5

40.9

39.6

39.0

40.5

Male

45

45

45

46

46

Married

52

53

53

45

52

Labour market position:

 Blue collar, unskilled

45

42

52

47

47

 Blue collar, skilled

20

20

23

23

20

 White collar worker

23

26

20

16

21

 Self-employed

12

13

5

14

12

Unemployed at beginning of sickness

19

20

9

32

21

Income (in SEK)

1,307

1,303

1,340

1,268

1,300

Previous sickness (days sick in last 6 months)

 <15 days

59

62

58

47

57

 >60 days

22

20

24

35

22

Participation in vocational rehabilitation in last 12 months

11

7

15

23

14

Local unemployment rate (in %)

6.52

6.45

6.59

6.71

6.63

Community type:

 Urban/suburban region

26

31

17

21

21

 Major/middle large city

14

13

11

11

21

 Industrial city

12

10

14

11

16

 Rural and other

49

47

58

57

43

Registration of current sickness spell:

Registration by

 Health care centre/hospital

80

81

81

73

79

 Pysch./social medicine center

8

7

6

14

10

 By private or others

12

11

13

13

11

Degree of sickness is 100% sick leave

86

84

92

91

86

Indications of alcohol or drug abuse

6

6

3

10

8

Diagnosis:

 Psychiatric

18

18

13

28

18

 Circulation

4

5

4

3

2

 Respiratory

2

2

3

4

2

 Digestion

3

4

3

1

2

 Musculoskeletal

44

39

51

44

51

 Injuries

14

15

15

11

12

 Other

15

18

13

10

12

Rehabilitation needs assessment:

Case assessed

 By employer

23

17

40

25

25

 By insurance office

16

13

16

33

22

 IO on behalf of employer

11

8

14

13

17

 Not needed

26

36

10

9

16

 Not carried out

23

26

19

20

20

Medical VR wait and see

55

61

40

37

56

recommendation:

 VR needed and defined

26

14

47

55

34

 Eligible for disability pension

6

9

3

2

4

 Not satisfactory/unclear

12

16

10

6

6

Non-medical VR wait and see

63

76

36

37

59

recommendation:

 VR needed and defined

32

17

63

62

38

 Eligible for disability pension

5

7

1

1

3

End of sickness:

Case closed as of December 1994

87

91

82

81

80

 Returns to regular employment

46.3

48.3

52.4

28.9

40.5

 Number of observations

6,287

3,502

1,118

360

1,307

Sample means in each treatment group multiplied by 100 (except age)

At the end of a sickness case, the exit destination is recorded, which can be returning or entering in regular employment, working at a sheltered workplace, entering in full-time education, being unemployed, receiving disability pension, or ‘other destinations’. At the end of the data collection period in December 1994, some cases remained unclosed, though. These are considered as still sick and represent about 10 to 20% of the observations, as shown at the bottom of Table 2. As regards the exit destinations, about 46% of all cases left sickness towards employment. For non-participants this employment rate is 48%, and it is 52% for the participants in workplace rehabilitation, 29% for the participants in educational rehabilitation and 41% for the participants in medical rehabilitation.

Estimation results

The gross employment rates given at the end of Table 2, however, are not informative about the re-integration success of the rehabilitation programmes because the participants in the different programmes differ in their characteristics. To adjust for these different characteristics between the treatment groups, Table 3 shows estimates of the population mean potential outcomes E[YNo], E[YWork], E[YEdu] and E[YMed], obtained by ridge matching. These estimates indicate that on average medical rehabilitation and particularly educational rehabilitation have not been successful in fostering integration in the labour market. It could be possible, though, that for some individuals medical and/or educational rehabilitation are the best choices for improving employment chances. Therefore, expected potential outcomes conditionally on covariates X need to be estimated.
Table 3

Nonparametric estimates of mean potential outcomes (in %)

Estimated

\(\widehat{{E{\left[ {Y^{{No}} } \right]}}}\)

\(\widehat{{E{\left[ {Y^{{Work}} } \right]}}}\)

\(\widehat{{E{\left[ {Y^{{Edu}} } \right]}}}\)

\(\widehat{{E{\left[ {Y^{{Med}} } \right]}}}\)

Re-employment rate

46.0

45.6

32.9

41.0

Mean potential outcomes estimated by propensity score ridge matching

Using the semiparametric GMM estimator developed in the previous sections, the optimal programme is estimated retrospectively for each of the 6,287 observations in the data set. With the outcome variable employment status being binary, the expectation function is specified by a probit
$$E{\left[ {Y^{r} \left| X \right. = x} \right]} \doteq \Phi {\left( {x\prime \theta ^{r} } \right)}\quad \forall r \in {\left\{ {No,\;Work,\;Edu,\;Med} \right\}},$$
(17)
where Φ is the cdf of the standard normal distribution. The coefficients θr may differ between the programmes r{No, Work, Edu, Med}. The scores of the likelihood function \(\frac{{\partial \ln l{\left( {x^{\prime } \theta ^{r} } \right)}}}{{\partial \theta ^{r} }} = \frac{{\phi {\left( {x^{\prime } \theta ^{r} } \right)}}}{{\Phi {\left( {x^{\prime } \theta ^{r} } \right)}{\left( {1 - \Phi {\left( {x^{\prime } \theta ^{r} } \right)}} \right)}}}x{\left( {y - \Phi {\left( {x^{\prime } \theta ^{r} } \right)}} \right)}\) are taken as the instruments Ar(Xi) in Eq. 10. In the estimator, L=11 different populations are included, as defined in Table B.1. The vector of covariates X contains 38 explanatory characteristics (plus a constant), which are given in Table B.2. These covariates include socioeconomic variables, indicators on sickness history, on the current sickness spell and on the rehabilitation assessment.

The participation probabilities \(\widehat{p}^{r} \) are estimated by probit and the support restriction is implemented by discarding all observations with \( \widehat{p}^{r}_{i} \) below the lowest participation probability among the participants in programme r. The regression curves mr(pr) are estimated for each subpopulation separately by ridge matching, using only the observations belonging to that subpopulation. The bandwidth is chosen by least-squares cross validation. (The implied mean potential outcomes for all subpopulations are given in Table B.1).

With these preliminary estimates, the coefficients θNo, θWork, θEdu, θMed are estimated by the GMM estimator (Eq. 11). Using these coefficient estimates, the expected potential outcomes \(\widehat{Y}^{{No}}_{i} ,\widehat{Y}^{{Work}}_{i} ,\widehat{Y}^{{Edu}}_{i} ,\widehat{Y}^{{Med}}_{i} \) are predicted for each observation, and the optimal programme for observation i is defined as the programme with the largest predicted outcome. Table 4 shows the number of observations for whom the optimal programme is: No, workplace, educational or medical rehabilitation, respectively. It can be seen that in spite of No and workplace rehabilitation being optimal for the majority of all individuals, educational rehabilitation would still have been the optimal choice for 1,519 persons.
Table 4

Distribution of optimal programme (by largest estimate)

Best programme is

No rehabilitation

Workplace

Educational

Medical

For so many individuals

1,865

1,860

1,519

1,043

Number of individuals for whom No rehabilitation, workplace rehabilitation, educational rehabilitation or medical rehabilitation, respectively, is the programme with largest estimated potential outcome

In the analysis so far, the sampling variability of the estimated potential outcomes has been neglected. If the estimates are very noisy, any programme could turn out to be the best, rendering the results meaningless. To take this into account, a programme is considered as optimal for individual i only if the probability that it corresponds to the maximum of \(\widehat{Y}^{{No}}_{i} ,\widehat{Y}^{{Work}}_{i} ,\widehat{Y}^{{Edu}}_{i} ,\widehat{Y}^{{Med}}_{i} \) exceeds a certain threshold. Hence, the programme ri* is defined as the optimal programme if
$$ P{\left( {{\mathop {\arg \;\max }\limits_r }{\left\{ {\widehat{Y}^{{r = No}}_{i} ,\widehat{Y}^{{r = Work}}_{i} ,\widehat{Y}^{{r = Edu}}_{i} ,\widehat{Y}^{{r = Med}}_{i} } \right\}} = r^{*}_{i} } \right)} \geqslant 1 - \alpha . $$
The probability measure can be simulated via bootstrapping of the coefficient estimates \(\widehat{\theta }^{{No}} ,\widehat{\theta }^{{Work}} ,\widehat{\theta }^{{Edu}} ,\widehat{\theta }^{{Med}} \). A threshold of 1−α=0.7, for example, requires that in at least 70% of the bootstrap iterations, ri* corresponds to the maximum of the estimated potential outcomes. If no programme satisfies this condition, the optimal programme for individual i is undefined. Whether a programme is identified as optimal or not thus depends on the precision of the estimated coefficients \(\widehat{\theta }^{{No}} ,\widehat{\theta }^{{Work}} ,\widehat{\theta }^{{Edu}} ,\widehat{\theta }^{{Med}} .\)If any of these coefficients is estimated with a high variance, the corresponding potential outcome will at various times be estimated to be higher than the other outcomes and at other times be estimated to be lower, unless the differences in the levels of the potential outcomes are large. Imprecise estimates of \(\widehat{\theta }^{{No}} ,\widehat{\theta }^{{Work}} ,\widehat{\theta }^{{Edu}} ,\widehat{\theta }^{{Med}} \) reduce the likelihood that any of the programmes will be identified as optimal. Hence, if the estimates are imprecise and if treatment effects are small, most often the optimal programme will be undefined.

The appropriate choice of α depends on the importance of alternative objectives and considerations in the programme choice. If treatment assignment is required to be strictly deterministic and should only depend on X and if no supply-side constraints or waiting lists could delay the availability of treatment, assignment should always be to the programme with the highest estimated outcome. On the other hand, if the estimated potential outcomes are only one of many determinants for the treatment choice, a significant preponderance of evidence, say 1−α=0.7 or 0.8, is desired, to neglect all noisy estimates. If the statistical evidence is insufficient to reach this threshold, alternative criteria should guide the selection. These may include programme goals that are not easily quantifiable (and thus cannot be included in the utility weighting function discussed in the beginning of Section 4), waiting-lists if treatment places are limited, conjectures about general equilibrium effects of certain treatments, which cannot be quantified, and so forth. The more important these alternative goals and criteria are, the more certainty will be expected from the statistical system before taking its predictions into consideration. In addition, the choice of α should also depend on the number of options to choose from. Generally, the larger the number of available programmes, the smaller 1−α should be, because a level of 1−α=0.7 can easily be reached if there are only two options to choose from, but will be much more restrictive if there are ten different programmes.

Table 5 gives the number of individuals for whom No, workplace, educational or medical rehabilitation, respectively, is optimal with at least 90, 70, 60 or 50% simulated probability, as well as the number for whom the optimal programme is undefined. This table again shows that educational rehabilitation would be the optimal programme for some individuals, in spite of a substantial amount of uncertainty which is visible from the large number of individuals without defined optimal programme (even at the 0.5 level).
Table 5

Distribution of optimal programme (for different levels of α)

Best programme is

No rehabilitation

Workplace

Educational

Medical

Undefined

With 90% probability

142

100

23

16

6,006

With 70% probability

618

540

294

180

4,655

With 60% probability

920

893

552

352

3,570

With 50% probability

1,302

1,386

905

606

2,088

Number of individuals for whom the corresponding programme is the estimated optimal programme with probability 1−α=0.9, 0.7, 0.6 or 0.5, respectively. 350 bootstrap replications

It is revealing to compare this simulated optimal allocation with the allocation actually observed. Table 6 cross-tabulates the number of participants in a certain programme Di{No, Work, Edu, Med} and with optimal programme ri*∈{No, Work, Edu, Med, Undefined}, for different levels of α. The table shows that for the individuals participating in No rehabilitation, an optimal programme is determined in 866 cases at the 1−α=0.7 level, which is No rehabilitation in 399 cases, workplace rehabilitation in 198 cases, educational rehabilitation in 156 cases and medical rehabilitation in 113. Particularly striking are the results for educational rehabilitation. Of the participants in educational rehabilitation, only very few (19 cases) would have been assigned to educational rehabilitation in the optimal allocation, whereas most of the 294 cases for whom educational rehabilitation seems to be optimal actually participated in other programmes. These results are similar at the 0.6 and 0.5 level, corroborating the finding that the participants in educational rehabilitation are not well selected. The fraction of misclassification Δ (in %) indicates the mismatch between the optimal and the actual allocation. Leaving apart the individuals for whom no optimal programme is defined, Δ is computed as the number of individuals for whom actual selection Di and optimal choice ri* do not coincide (off-diagonal elements) divided by the total number of individuals with a defined optimal programme. Table 6 shows that at a probability level of 0.7, more than 60% of the optimal programme choices differ from the actual allocation, indicating a substantial potential for improving programme selection. The misclassification level increases to 67% at the 0.5 level, which might be attributable to additional noise, because the optimal programme classifications are becoming less unambiguous.
Table 6

Optimal treatment choice versus actual allocation

 

Optimal allocation ri*

No rehabilitation

Workplace

Education

Medical

Undefined

Actual allocation

1−α=70%

Di=No

399

198

156

113

2,636

Di=Work

76

170

73

19

780

Di=Edu

22

53

19

8

258

Di=Med

121

119

46

40

981

Δ(%)

61.5

Optimal allocation (1−α=60%)

Di=No

586

352

312

213

2,039

Di=Work

122

249

119

40

588

Di=Edu

30

79

37

13

201

Di=Med

182

213

84

86

742

Δ(%)

64.7

Optimal allocation (1−α=50%)

Di=No

828

603

507

351

1,213

Di=Work

179

338

196

82

323

Di=Edu

48

114

56

25

117

Di=Med

247

331

146

148

435

Δ(%)

67.4

Number of participants in programme Di{No, Work, Edu, Med} with optimal programme ri*∈{No, Work, Edu, Med, Undefined}, at different levels of α. The fraction of misclassification Δ in % is defined as the number of cases for whom Di and ri* do not coincide (off-diagonal elements) to the total number of cases with defined optimal programme, leaving apart the undefined cases

Since the optimal programme choices are estimated on an individual level, the optimal allocation is difficult to summarize by a few numbers. Nevertheless, Table 7 gives some summary statistics to compare the optimal targeting with the targeting as is (as was discussed in Section 4.2). The table provides the means of selected characteristics by treatment groups according to the optimal and to the actual allocation, at the 1−α=0.5 level. Table B.2 gives this comparison for all 38 characteristics included in the estimator. In the first four columns the mean characteristics are given according to the optimal allocation. (The 2,088 individuals without defined optimal treatment are not included.) The last four columns provide the average characteristics among the actual participants and correspond to the respective figures in Table 2.
Table 7

Average characteristics by treatment group: optimal vs. actual allocation

Variable

Optimal allocation

Actual allocation

N

W

E

M

N

W

E

M

Age:

18–35 years

12

20

59

52

31

34

37

31

46–55 years

40

62

10

30

41

31

32

36

Gender:

Male

56

36

48

44

45

45

46

46

Employment status:

Unemployed

2

27

47

2

20

9

32

21

Labour market position:

Blue collar, high educated

43

9

19

17

20

23

23

20

Occupation in:

Manufacturing

51

23

23

38

30

38

32

32

Previous sickness days

>60 days

19

32

25

5

20

24

35

22

Prior participation in

Vocational rehabilitation

4

15

21

0

7

15

23

14

Medical diagnosis:

Psychiatric

20

21

11

15

18

13

28

18

Medical recommend.

Wait and see

79

64

19

53

61

40

37

56

Predicted employment

Probability

69.3

52.6

54.5

67.2

48.5

51.9

30.2

41.1

Means or shares in percent. The columns titled optimal allocation give the average characteristics by treatment groups (N=No rehabilitation, W=workplace rehabilitation, E=educational rehabilitation, M=medical and social rehabilitation) if allocation were according to the optimal choices (ri* at the 1−α=0.5 level). The columns labelled actual allocation provide these figures according to the observed allocation (Di). The last row shows the predicted potential employment probabilities

A striking difference with respect to age can be seen. Whereas average age does not vary much by treatment groups in the actual allocation, the optimal choice seems to depend strongly on the individual’s age. Whereas the young are clearly over-represented among those who are advised to participate in medical and, particularly, in educational rehabilitation, only very few of the 46–55 years old are best served by educational rehabilitation. With respect to gender it seems that men should more often attend No rehabilitation, whereas women might benefit more from workplace rehabilitation. Regarding prior unemployment, it is noteworthy that only few unemployed are advised to participate in No or in medical rehabilitation, whereas they represent about half of those advised to educational rehabilitation. Educated blue collar workers are less frequent found among those served best by workplace rehabilitation, whereas manufacturing workers are over-represented among those advised to No rehabilitation. For individuals who had been sick previously for more than 60 days in the last 6 months or who had participated in vocational rehabilitation before, medical rehabilitation is hardly ever an unambiguously optimal choice. Furthermore, in the optimal allocation, individuals with psychiatric problems and those for whom a wait and see strategy has been advised are clearly under-represented in educational rehabilitation, compared to the actual allocation. Generally the differences in the characteristics are much more pronounced in the optimal than in the actual allocation.

In the last row of Table 7, the predicted potential employment outcomes are averaged within the treatment groups according to the optimal and to the actual allocation. The predicted average employment rates in the actual treatment groups correspond quite well to the observed rates of Table 2. When re-allocating the participants to the programmes in an optimal way, substantial increases in the predicted employment rates are achieved. To summarize this analysis, it is illuminating to tentatively predict the overall employment rate that could have been achieved through an optimal allocation. When allocating all individuals to their optimal programme, if defined at the 0.5 level, and all other individuals, for whom no optimal programme is defined, randomly to any programme (with equal probability), the predicted average employment rate is 54.5%. If, on the other hand, the individuals without defined optimal programme are allocated randomly to either No or to workplace rehabilitation, the predicted employment rate is 55.7%. Thus, compared to the current selection process and to the employment rates that would be expected if all individuals were assigned to the same programme (see Table 3), an increase in the employment rate of about 9%-points could be possible through an improved participant allocation.

These findings indicate a substantial heterogeneity in the treatment effects between individuals. The treatment effects depend on the X characteristics and the optimal programme varies with X. These conditional-on-X treatment effects, however, are not fully taken into account by the case workers in their choices of the programmes. The limited sensitivity of the case workers’ choices to their clients’ observed characteristics has already been noted from Table 7. For example, whereas the usefulness of educational rehabilitation clearly seems to decrease with age, the actual allocation to educational rehabilitation depends only little on age. By fully exploiting the differences in the conditional-on-X treatment effects, the employment rate could have been raised substantially, provided that the treatment effects are consistently estimated. The reasons for this large unexploited potential may be severalfold. On the one hand, case workers may not know the conditional-on-X treatment effects. They may also be constrained in their choices, e.g. due to limited numbers of workplace rehabilitation places. Furthermore, they might seek the cooperation of the sick person in their choices. A most important reason, however, is likely to be their different objective functions. Rehabilitation serves several purposes and rapid employment is only one of these. Health and sustainability considerations might be accorded a much larger importance.

If educational rehabilitation were no longer available, the predicted average employment rate would be 54.9%, when individuals without defined optimal programme are assigned randomly to either No or to workplace rehabilitation. Thus, although educational rehabilitation is the optimal programme for some individuals, their second-best choice seems not to be much worse.

Similar results are also obtained for different sets of X variables and different moment specifications (see the sensitivity analysis in the supplementary Appendix). Compared to the above optimal allocation (with 11 subpopulations), the optimal allocations that would result if 1, 6, 16 or 21, respectively, subpopulations were included are not very different. The fraction of misclassification Δ (in %) between the main specification and any of these other specifications is at most 0.1% at the 1−α=0.7 level, at most 2.4% at the 0.6 level and at most 11% at the 0.5 level. On the other hand, if the set of 11 subpopulations is maintained but the set of explanatory variables X is altered, the estimated optimal allocations change more markedly. With a set of 28 or 30 variables, the resulting allocations are still very similar: Δ is about 0.5, 5 and 14.5% at the 0.7, 0.6, 0.5 level, respectively. However, when leaving out relevant information on sickness history, diagnosis, geographic location (and retaining only 24 variables), the misclassification rates increase to 15.8, 26.4 and almost 40%, respectively, at the different levels of 1−α. Hence, detailed information seems to be necessary to obtain informed programme choices.

Conclusions

In this paper a new semiparametric estimator for estimating conditional mean functions from incomplete data has been developed. It applies to situations where data is missing due to non-response or where it is missing by definition, e.g. in the analysis of treatment effects, where only one of the different potential outcomes can be observed for each individual.

This estimator integrates parametric regression with nonparametric matching to obtain more precise estimates in the subpopulation with missing data. Nonparametric matching estimates are used as an anchor for reducing bias in the missing-data subpopulation while retaining a reasonable fit in the full-data subpopulation. A small Monte Carlo simulation showed that considerable reductions in MSE vis-a-vis a fully parametric estimator can be achieved in misspecified parametric models. On the other hand, the efficiency losses in correctly specified models seem to be rather small. The applicability of the estimator has been illustrated by an analysis of treatment effect heterogeneity in Swedish rehabilitation programmes.

Analyzing individual heterogeneity in treatment effects is highly relevant for policy evaluation. In many evaluation studies, small or negative estimates of average treatment effects indicate an ineffective policy. These average effects, however, may mask a considerable heterogeneity in the effects between the individuals. It is important to know whether the effect is as negative for all individuals or whether it harms some while it benefits others. Estimating treatment effects on a disaggregated level, i.e. conditional on characteristics X, can help to assess the extent of treatment effect heterogeneity. These estimates can then be used to appraise the potential for policy improvements due to a better participant allocation. By predicting the treatment effects for each individual, the expected outcomes if assigned to the optimal programmes can be simulated. Comparing these with the observed outcomes gives an estimate of the effectiveness of the allocation process. For example, in the application to the Swedish rehabilitation programmes, the simulated optimal employment outcome is 56%, compared to an observed employment rate of 46%.

Footnotes

  1. 1.

    The estimation of average treatment effects has been intensively analyzed, in particular for active labour market programmes and rehabilitation programmes. See for example, Aakvik (2003), Abbring and van den Berg (2004) and the Special Issue on ‘Long term unemployment and social assistance’, Empirical Economics (1/2), 1998). The focus of this paper is on the heterogeneity in treatment effects, which could be exploited to improve the average effectiveness of policies through a better participant allocation.

  2. 2.

    For a more detailed discussion see Wald (1950); Heckman et al. (1997b); Black et al. (2003); Manski (2000; 2004) and Dehejia (2004).

  3. 3.

    See Rubin (1974); Heckman and Robb (1985); Barnow et al. (1981); Lechner (1999).

  4. 4.

    E.g. in the case of panel attrition, X may refer to information collected in the baseline period.

  5. 5.

    See e.g. Angrist (1998), Heckman et al. (1998b); Dehejia and Wahba (1999); Lechner (1999); Gerfin and Lechner (2002) and Jalan and Ravallion (2003), among many others.

  6. 6.

    The support restriction is incorporated by considering only observations with \(\widehat{p}_{i} > 0\), because Sx={x:fXD=1(x)>0}={x:p(x)>0}

  7. 7.

    This permits a simpler derivation of the asymptotic properties. For the practical implementation, both Eqs. 5 or 6 can be used.

  8. 8.

    More precisely, let \(\widehat{m}_{{vl}} {\left( \rho \right)}\) for ρ>0 be an estimator of the expectation E[Yvp(X)=ρ, Λl(X)=1], i.e. the expectation of the v-th variable of the outcome vector Y conditional on the propensity score in the l-th subpopulation. Let \(\widehat{m}_{l} {\left( \cdot \right)} = {\left( {\widehat{m}_{{1l}} {\left( \cdot \right)}, \ldots ,\widehat{m}_{{vl}} {\left( \cdot \right)}, \ldots \widehat{m}_{{Vl}} {\left( \cdot \right)}} \right)}^{\prime } \) be the element-wise-defined estimator of the outcome vector Y in the population l, i.e. of E[Yp(X)=ρ, Λl(X)=1]. Stacking these estimators for the L subpopulations and multiplying element-wise with the population indicator function gives \( \widehat{m}_{{VL}} {\left( {\widehat{p}{\left( {X_{i} } \right)}} \right)} = {\left( {\widehat{m}^{\prime }_{1} {\left( {p{\left( {X_{i} } \right)}} \right)} \cdot \Lambda _{1} {\left( {X_{i} } \right)}, \ldots \widehat{m}^{\prime }_{l} {\left( {\widehat{p}{\left( {X_{i} } \right)}} \right)} \cdot \Lambda _{l} {\left( {X_{i} } \right)}, \ldots ,\widehat{m}^{\prime }_{L} {\left( {\widehat{p}{\left( {X_{i} } \right)}} \right)} \cdot \Lambda _{L} {\left( {X_{i} } \right)}} \right)}^{\prime } \).

  9. 9.

    When a standard propensity score matching routine is used, care should be exercised to ensure that the lower VL moments in (10) are summed over the same observations as in \(\widehat{\mu }\) and are scaled in the same way. For example, if the propensity score matching routine estimates the mean counterfactual outcome \({{\sum {\widehat{m}_{VL} \left( {\widehat p_i } \right)\left( {1 - D_i } \right)1\left( {\widehat p_i > 0} \right)} } \over {\sum {\left( {1 - D_i } \right)1} \left( {\widehat p_i > 0} \right)}}\) instead of \({{\sum {\widehat m_{VL} \left( {\widehat p_i } \right)\left( {1 - D_i } \right)1\left( {\widehat p_i > 0} \right)} } \over n}\), then also the VL must be scaled accordingly.

  10. 10.

    This includes one-to-one or pair matching.

  11. 11.

    Using Epanechnikov instead of Gaussian kernel, and vice versa, led to largely similar results.

  12. 12.

    The X data are scaled in the estimator to mean zero and variance one.

  13. 13.

    The expected outcomes vary considerably among these subpopulations. Whereas with DGP 1, the expected outcome is 13.1 for the respondents and 5.3 for the non-respondents, the outcome difference between respondents and non-respondents can be as large as 8.2 (for subpopulations ten and eleven) and as small as 0.8 (for subpopulation fourteen). Similar heterogeneity occurs for DGP 2 and 3. For instance, in DGP 2 the expected outcome for the respondents is usually larger than for the non-respondents, but this relationship is reversed in subpopulation five. In DGP 2, the expected outcomes for respondents and non-respondents are 2.2 and 1.5, respectively, and in DGP 3 these figures are 9.6 and 4.3.

  14. 14.

    See Angrist and Krueger (1999) and Heckman et al. (1999) for an overview.

  15. 15.

    Unless the past participants have been assigned randomly to the programmes.

  16. 16.

    Regularly employed individuals receive for the first two weeks sickness benefits from the employer and afterwards from the insurance office. Unemployed and self-employed individuals receive benefits directly from the insurance office. Sickness benefits amount to 80% of previous earnings, adjusted for the degree of lost working capacity and cut at an upper ceiling, and can be received for an unlimited period.

  17. 17.

    Medical and social rehabilitation are not coordinated by the insurance office.

  18. 18.

    The insurance offices themselves do not conduct rehabilitative activities.

  19. 19.

    A number of cases received more than one type of rehabilitation. Since neither it is known whether these measures where given in parallel or sequentially, nor the time sequence of these measures, these cases were assigned to the supposedly first or principal of the rehabilitative measures received. In most cases this has been medical rehabilitation, which is likely to be the first measure. The second priority is given to workplace rehabilitation, since workplace rehabilitation is usually full-time while educational training may operate alongside. For further details on the data see Frölich et al. (2004).

  20. 20.

    The reason for the latter is that the assessment refers to vocational rehabilitation.

Notes

Acknowledgment

The author is also affiliated with the Institute for the Study of Labor (IZA), Bonn. I am grateful for discussions and comments to Bo Honoré, Francois Laisney, Michael Lechner, Ruth Miquel, Oivind Nilsen, Jeff Smith, the editor and three anonymous referees. This research was supported by the Swiss National Science Foundation (project NSF 4043-058311) and the Grundlagenforschungsfonds HSG (project G02110112).

References

  1. Aakvik A (2003) Estimating the employment effects of education for disabled workers in Norway. Empir Econ 28:515–533CrossRefGoogle Scholar
  2. Abbring J, van den Berg G (2004) Analyzing the effect of dynamically assigned treatments using duration models, binary treatment models, and panel data models. Empirical Econ 29:5–20CrossRefGoogle Scholar
  3. Angrist J (1998) Estimating labour market impact of voluntary military service using social security data. Econometrica 66:249–288CrossRefGoogle Scholar
  4. Angrist J, Krueger A (1999) Empirical strategies in labor economics. In: Ashenfelter O, Card D (eds) The handbook of labor economics, III. North-Holland, New York, pp 1277–1366Google Scholar
  5. Barnow B, Cain G, Goldberger A (1981) Selection on observables. Evaluation Studies Review Annual 5:43–59Google Scholar
  6. Black D, Smith J, Berger M, Noel B (2003) Is the threat of reemployment services more effective than the services themselves?—evidence from random assignment in the UI system. Am Econ Rev 93:1313–1327CrossRefGoogle Scholar
  7. Dehejia R (2004) Program evaluation as a decision problem. forthcoming in J EconGoogle Scholar
  8. Dehejia R, Wahba S (1999) Causal effects in non-experimental studies: reevaluating the evaluation of training programmes. J Am Stat Assoc 94:1053–1062CrossRefGoogle Scholar
  9. Fan J (1992) Design-adaptive nonparametric regression. J Am Stat Assoc 87:998–1004CrossRefGoogle Scholar
  10. Frölich M (2004) Finite sample properties of propensity-score matching and weighting estimators. Rev Econ Stat 86:77–90CrossRefGoogle Scholar
  11. Frölich M (2005) Matching estimators and optimal bandwidth choice. Stat Comput 15(3):197–215CrossRefGoogle Scholar
  12. Frölich M, Heshmati, A, Lechner, M (2004) A microeconometric evaluation of rehabilitation of long-term sickness in Sweden. J Appl Econ 19:375–396CrossRefGoogle Scholar
  13. Gerfin M, Lechner M (2002) Microeconometric evaluation of the active labour market policy in Switzerland. Econ J 112:854–893CrossRefGoogle Scholar
  14. Hahn J (1998) On the role of the propensity score in efficient semiparametric estimation of average treatment effects. Econometrica 66:315–331CrossRefGoogle Scholar
  15. Hansen LP (1982) Large sample properties of generalized method of moment estimators. Econometrica 50:1029–1054CrossRefGoogle Scholar
  16. Heckman J, Robb R (1985) Alternative methods for evaluating the impact of interventions. In: Heckman J, Singer B (eds) Longitudinal analysis of labour market data. Cambridge University Press, CambridgeGoogle Scholar
  17. Heckman J, Ichimura H, Todd P (1997) Matching as an econometric evaluation estimator: evidence from evaluating a job training programme. Rev Econ Stud 64:605–654CrossRefGoogle Scholar
  18. Heckman J, Smith J, Clements N (1997) Making the most out of programme evaluations and social experiments: accounting for heterogeneity in programme impacts. Rev Econ Stud 64:487–535CrossRefGoogle Scholar
  19. Heckman J, Ichimura H, Todd P (1998) Matching as an econometric evaluation estimator. Rev Econ Stud 65:261–294CrossRefGoogle Scholar
  20. Heckman J, Ichimura H, Smith J, Todd P (1998) Characterizing selection bias using experimental data. Econometrica 66:1017–1098CrossRefGoogle Scholar
  21. Heckman J, LaLonde R, Smith J (1999) The economics and econometrics of active labour market programs. In: Ashenfelter O, Card D (eds) The handbook of labor economics, III. North-Holland, New York, pp 1865–2097Google Scholar
  22. Jalan J, Ravallion M (2003) Estimating the benefit incidence of an antipoverty program by propensity-score matching. J Bus Econ Stat 21:19–30CrossRefGoogle Scholar
  23. Lechner M (1999) Earnings and employment effects of continuous off-the-job training in east Germany after unification. J Bus Econ Stat 17:74–90CrossRefGoogle Scholar
  24. Little R, Rubin D (1987) Statistical analysis with missing data. Wiley, New YorkGoogle Scholar
  25. Manski C (2000) Identification problems and decisions under ambiguity: empirical analysis of treatment response and normative analysis of treatment choice. J Econ 95:415–442Google Scholar
  26. Manski C (2004) Statistical treatment rules for heterogeneous populations. Econometrica 72:1221–1246CrossRefGoogle Scholar
  27. Rosenbaum P, Rubin D (1983) The central role of the propensity score in observational studies for causal effects. Biometrika 70:41–55CrossRefGoogle Scholar
  28. Rubin D (1974) Estimating causal effects of treatments in randomized and nonrandomized studies. J Educ Psychol 66:688–701CrossRefGoogle Scholar
  29. Seifert B, Gasser T (1996) Finite-sample variance of local polynomials: analysis and solutions. J Am Stat Assoc 91:267–275CrossRefGoogle Scholar
  30. Seifert B, Gasser T (2000) Data adaptive ridging in local polynomial regression. J Comput Graph Stat 9:338–360CrossRefGoogle Scholar
  31. Wald A (1950) Statistical decision functions. Wiley, New YorkGoogle Scholar

Copyright information

© Springer-Verlag 2006

Authors and Affiliations

  1. 1.Department of EconomicsUniversity College LondonLondonUK
  2. 2.University of St.Gallen, SIAWSt.GallenSwitzerland

Personalised recommendations