Semiparametric estimation of conditional mean functions with missing data
Abstract
A new semiparametric estimator for estimating conditional expectation functions from incomplete data is proposed, which integrates parametric regression with nonparametric matching estimators. Besides its applicability to missing data situations due to non-response or attrition, the estimator can also be used for analyzing treatment effect heterogeneity and statistical treatment rules, where data on potential outcomes is missing by definition. By combining moments from a parametric specification with nonparametric estimates of mean outcomes in the non-responding population within a GMM framework, the estimator seeks to balance a good fit in the responding population with low bias in the non-responding population. The estimator is applied to analyzing treatment effect heterogeneity among Swedish rehabilitation programmes.
Introduction
In many empirical applications interest lies in estimating a conditional mean function E[Y∣X], yet often the outcome variable Y is only observed for a part of the sample. In this paper, a new semiparametric estimator for dealing with such situations is proposed. This estimator combines parametric regression with nonparametric matching estimators Heckman et al. (1997a; 1998a) to reduce the bias of the estimated conditional mean function in the subpopulation where Y is unobserved.
Consider two motivating examples for the applicability of this estimator. Missing data is one example. In most of applied work with survey data, item non-response and/or panel attrition are frequent. Data may be missing on Y for some individuals, while data on X is still available for them. For example, with panel data, X may refer to the response in the baseline survey, whereas the observability of Y in follow-up periods depends on attrition.
Counterfactual outcomes, treatment effects and treatment choice are a second example where the situation analyzed in this paper applies. Consider a situation where each member of a population chooses one of two options: An unemployed person may or may not take part in an active labour market programme, a physician may choose between two different therapies for a patient etc.^{1} For analyzing the effects of treatment it is necessary to contrast the expected outcome if choosing the first option with the outcome if choosing the other option, given some covariates X. Since every individual can be observed only in one of the two states, half of the potential outcomes data is missing by definition. For the treated, their counterfactual outcome in the case of non-treatment is unobserved, whereas for non-treated their counterfactual treatment outcome cannot be observed. Estimates of the expected counterfactual mean functions are needed to analyze heterogeneity in the effects. Such estimates are also a basic input for statistical treatment rules, where assignment to treatment is based on predictions of the expected treatment effects for each individual e.g. expected unemployment duration with and without active labour market programmes may be used (by the case worker) to allocate unemployed to active labour market programmes. Such statistical treatment rules may thus assist in a more precise targeting of policies.^{2}
The proposed semiparametric estimator combines the information on Y and X for the respondents with the information on X for the non-respondents in a general methods of moments (GMM) framework. The estimator is applicable in situations where data is missing at random conditional on X (Little and Rubin 1987) or where, conditional on X, treatment assignment is ignorable (selection is on observables).^{3} Validity of this assumption often requires a high dimensional X vector, making purely nonparametric estimation unreliable in finite samples.
Under this assumption, the conditional mean function E[Y∣X] is identified from the data of the respondents. The non-responding observations are of no value for identification. In a parametric or semiparametric framework, however, the information contained in the X observations of the non-respondents may help in obtaining more precise estimates. The reason for this is that a parametric estimator, which completely neglects the non-responding observations, may fit inadvertently a regression plane that is heavily biased in the non-responding population. A parametric estimator using only the responding observations seeks to minimize MSE in the responding population. This may, however, not be the best fit with respect to the entire population, if the density of X differs between respondents and non-respondents (which usually is the case in a treatment evaluation context where participants and non-participants are often rather dissimilar in their characteristics).
The basic idea of the semiparametric estimator is to use nonparametric estimates of the mean counterfactual outcomes for the non-responding population to measure the average bias of the parametric regression plane in the non-responding population. These mean counterfactual outcomes can be estimated nonparametrically at rate \({\sqrt n }\) by (propensity score) matching estimators (Hahn 1998; Heckman et al. 1998b). As these estimates do not depend on the specification of the parametric regression plane, they can be used to quantify the bias of the parametric model in the non-responding population for any value of the coefficient vector. The semiparametric estimator attempts to choose the regression plane such that it fits well in the responding population and has low bias in the non-responding population.
The asymptotic properties of this estimator are investigated in Section 2, while Section 3 analyzes its finite sample properties in a Monte Carlo simulation. In Section 4, the estimator is applied to analyzing treatment effect heterogeneity and treatment choice among Swedish rehabilitation programmes for long-term sick. The treatment effects for participation in workplace, educational and medical rehabilitation on employment are estimated on an individual level, and a statistical treatment selection rule based on these estimates is illustrated. Section 5 concludes. Appendices A and B contain further results. A supplementary Appendix with proofs and additional material is available on the internet:http://www.siaw.unisg.ch/froelich.
Semiparametric estimation of conditional mean functions
Interest lies in estimating a conditional mean function E[Y∣X] from a sample of iid observations with missing data: {X_{i}, D_{i}, Y_{i}D_{i}}_{i=1}^{n}, where \(X_i \in \Re ^K\) is a vector of covariates and D_{i}∈{0,1} is a missing value indicator. The outcome variable of interest \(Y_{i} \in {\Re }^{V} \) is only observed when D_{i}=1. When D_{i}=0, it is unobserved. The covariates X_{i} are always observed.^{4}
Data on Y may be missing due to non-response or attrition. Alternatively, data may be missing by definition. A particular example for this is treatment evaluation. An individual may choose between different treatment options, and according to her choice an outcome is observed. The outcomes she would have realized had she chosen differently cannot be observed, though. Suppose treatment is binary: an individual takes part in a particular programme or does not. Denote by Y_{i}^{0},Y_{i}^{1} her potential outcomes. Y_{i}^{1} is the outcome that would be realized if she participated in treatment, whereas Y_{i}^{0} would be realized otherwise. By definition, the outcome Y_{i}^{1} is missing for non-participants, whereas Y_{i}^{0} is missing for participants. Hence, half of the potential outcomes are missing. Evaluating the effect of treatment, however, requires estimates of the missing counterfactual outcomes.
The idea of the semiparametric approach is to estimate the average bias of the parametric regression plane in the non-responding population and to use this estimate to choose θ such that the regression plane fits well in the responding population and, at the same time, is on average (almost) unbiased in the non-responding population. This could proceed iteratively or, as suggested in this paper, simultaneously in a single estimator where goodness-of-fit in the responding population and average bias in the non-responding population are traded off against each other in a method of moments framework.
The definition of the mean outcome is restricted to the support of X in the responding population S_{x}={x:f_{X∣D=1}(x)>0}, because E[Y∣X,D=1] is not identified out of the support. In principle, as shown in Hahn (1998) and Heckman et al. (1998b), the mean outcome (Eq. 3) can be estimated at \({\sqrt n }\)-rate by a matching estimator which is based on nonparametric estimates of E[Y∣X,D=1] obtained from the D=1 sample.
In principle, the semiparametric approach could proceed iteratively. First, the parametric model is estimated to obtain values of θ. With these \( \widehat{\theta } \), the average biases are estimated and these bias estimates are then used to obtain new estimates of θ. A more convenient approach can be obtained by integrating both aims (goodness-of-fit in the responding population and low bias in the non-responding population) in a single estimator based on moment conditions. One set of moment conditions is given by the average biases (Eq. 8), which have expectation zero in the case of correct parametric specification.
Estimation of θ is straightforward. First, the propensity score is estimated, e.g. by probit or logit. Second, \(\widehat{\mu }\) is estimated by propensity score matching, separately for the different subpopulations. In principle, any propensity score matching routine can be used. With \(\widehat{\mu }\) estimated, the moment function (Eq. 10) depends only on θ, and the quadratic form of the average moment function (Eq. 11) can be minimized, for any choice of W. For example, W might be chosen as a diagonal matrix which gives half of the weights to the first K moments and the other half to the second VL moments.^{9}
One condition for this result is that the preliminary estimators \(\widehat{p}\) and \(\widehat{m}\) are asymptotically linear with trimming. Parametric and nonparametric local polynomial regression estimators belong to this class as shown in Heckman et al. (1998b), provided certain regularity conditions are met. Hence, for the propensity score estimated by a probit or logit and m estimated by Nadaraya–Watson kernel or local linear regression, \(\widehat{\theta }_{n} \) is asymptotically normally distributed in correctly specified models. For nearest neighbour regression, on the other hand, this does not seem to hold.^{10}
The choice of W determines the weights given to the two objectives of the estimator: goodness-of-fit in the responding population and low bias in the non-responding population. It thereby also affects the properties of the estimator. With a correct parametric specification, the efficient weighting matrix would be the inverse of the covariance matrix of the moment vector [EJJ′]^{−1} (Hansen 1982). This efficient GMM estimator can be obtained by a two step procedure. First, an arbitrary initial weighting matrix W is chosen to obtain the first step estimates of θ. With this estimates, \( {\left[ {\widehat{E}JJ^{\prime } } \right]}^{{ - 1}} \) is estimated and is then used as the weighting matrix in the second step. If the parametric model is misspecified, on the other hand, the second step GMM estimator is not necessarily superior to the first step GMM estimator, since the ‘efficient’ weighting by [EJJ′]^{−1} takes only the variance but not the bias of the parametric specification into account. This leads to a weighting matrix which assigns most of the weight to the K parametric moments and little to the nonparametric moments, because the variance of the nonparametric estimates is much higher compared to the parametric moments. However, the uncertainty that stems from not knowing the true form of the conditional expectation function is not incorporated in these weights. Hence such considerations on robustness to misspecification are neglected in the weighting matrix [EJJ′]^{−1}.
In this section, a semiparametric estimator for estimating a parametric regression plane with lower bias in the non-responding population has been proposed, and several properties of this estimator have been derived. In correctly specified models, and with particular propensity score matching estimators, the GMM estimator is \({\sqrt n } - {\rm\text{consistent}}\) and asymptotically normal. The GMM objective function is asymptotically χ^{2} distributed and can be used for testing the correctness of the parametric model. On the other hand, if the model is misspecified, the GMM estimator attempts to choose a regression plane with low bias among the non-respondents while maintaining a good fit among the respondents. To examine the behaviour of this estimator and of the specification test in finite samples, a Monte Carlo simulation is conducted in the next section.
Monte Carlo simulation
In a small Monte Carlo experiment the finite sample properties of the semiparametric estimator of the conditional mean function E[Y∣X] are assessed. The simulations should give some indications on the performance of the semiparametric estimator in comparison to parametric estimation under correct and under incorrect specification. In addition, the sensitivity to the number of subpopulations L and their size, to the choice of the estimator \(\widehat{m}\) and to the weighting matrix W is examined. Finally, the properties of the J-test are analyzed, which, however, turn out to be rather unsatisfactory.
The mean squared error of the parametric, the first step and the second step GMM estimator are simulated for different simulation designs. The outcome variable Y is one-dimensional. Hence, V=1 and the number of overidentifying moments is equal to L. The parametric estimator is equivalent to the GMM estimator with L=0 subpopulations. The first and second step GMM estimators are computed for different numbers of subpopulations L to examine their sensitivity to the number of overidentifying moments. The weighting matrix W for the first step GMM estimator is diagonal, with the first K entries being 1/K and the remaining entries being 1/L. Hence, equal weight is given to the parametric and to the nonparametric moments, as a whole. The second step GMM estimator uses the inverse covariance matrix \( {\left[ {\widehat{E}JJ\prime } \right]}^{{ - 1}} \)as weighting matrix, which is evaluated at the first step coefficient estimates using the asymptotic expression given in the previous section.
The Monte Carlo simulations proceed by repeatedly drawing estimation and validation samples from the same population, estimating the coefficients θ from the estimation sample and computing mean squared error (MSE) in the validation sample. The estimation sample {(X_{i}, D_{i}, Y_{i}D_{i})}_{i=1}^{n} consists of 500 or 2000, respectively, observations with Y_{i} observed only if D_{i}=1. The validation sample contains 10000 draws of X and D. With the coefficients \( \widehat{\theta } \), estimated from the estimation sample, the expected outcomes \(\widehat{E}{\left[ {Y\left| X \right.} \right]}\) are imputed by \( \varphi {\left( {X,\widehat{\theta }} \right)} \) for all observations of the validation sample and compared with the true expected outcomes E[Y∣X] to simulate the MSE.
In each replication, first the nonparametric mean outcomes \(\widehat{\mu }\) are estimated by propensity score matching, separately for each subpopulation. The propensity scores p_{i} are estimated by probit and the regression curves m(p) are estimated nonparametrically in the various subpopulations either by Nadaraya–Watson kernel regression or by local linear ridge regression. Ridge regression is a variant of local linear regression with better small sample properties. Local linear regression is well known for its favourable asymptotic properties (Fan 1992), but in small samples it can be very erratic because of zero or near-zero denominators in the calculation of the estimator. By adding a ridge parameter to the denominator, ridge regression can avoid the high variance problems of the unmodified local linear estimator. At the same time, with the ridge parameter converging to zero with growing sample size, asymptotically both estimators are equivalent, see Seifert and Gasser (1996, 2000). In essence, ridge regression is a convex combination of the Nadaraya–Watson kernel and the local linear estimator, where the weight given to the local linear estimator increases with growing sample size. In a comparison study of the properties of alternative propensity score matching estimators in finite samples (Frölich 2004, 2005), propensity score matching based on ridge regression clearly dominated matching based on local linear regression and also often performed slightly better than Nadaraya–Watson kernel based matching. In the Monte Carlo simulations below, results are given for Nadaraya–Watson kernel matching (with Gaussian kernel) and for ridge matching (with Epanechnikov kernel).^{11} The bandwidth is chosen by leave-one-out cross validation from the grid: 0.0001, 0.0001·1.4^{1},..., 0.0001·1.4^{28}, ∞. With \(\widehat{\mu }\) estimated, the GMM estimator can be computed.
A conceptual difference between the GMM and the LSIR estimator is that the latter attempts to minimize squared bias conditional on X, whereas the former aims at minimizing squared bias conditional on larger subpopulations (the L subpopulations). By restricting itself to larger subpopulations, all nonparametric components in the GMM estimator (i.e. the \(\widehat{\mu }\)) converge at \({\sqrt n } - {\rm{rate}}.\) On the other hand, the nonparametric estimates of E[Y∣X] in the LSIR estimator converge at lower rates, if X contains at least one continuous variable.
The properties of these estimators are examined for different simulation designs. The X characteristics consist of 3 explanatory variables (X_{i1}, X_{i2}, X_{i3}) drawn from the (non-symmetric) χ_{(2)}^{2}, χ_{(3)}^{2}, χ_{(4)}^{2} distribution and divided by 2,3,4, respectively, to standardize their mean. D_{i} is determined by D_{i}=1(X_{i1}+X_{i2}+X_{i3}+ɛ_{i}>4.5), with ɛ standard normally distributed. The mean of D is 0.46.
DGP 1: Y_{i}=X_{i1}^{2}+X_{i2}^{2}+Xi_{3}^{2}+ξ_{i}
DGP 2: \(Y_{i} = {\sqrt {X_{{i1}} - 0.5} } + 2{\sqrt {X_{{i2}} - 0.5} } - {\sqrt {X_{{i3}} - 0.5} } + \xi _{i} \)
DGP 3: Y_{i}=X_{i1}X_{i2}+X_{i1}X_{i3}+X_{i2}X_{i3}+ξ_{i},
Specification | K | Regressors |
---|---|---|
φ_{0} | 4 | Const X_{i1}, X_{i2}, X_{i3} |
φ_{1} | 4 | Const X_{i1}^{2}, X_{i2}^{2}, X_{i3}^{2} |
φ_{2} | 4 | \( const,{\sqrt {X_{{i1}} - 0.5} },{\sqrt {X_{{i2}} - 0.5} },{\sqrt {X_{{i3}} - 0.5} } \) |
φ_{3} | 7 | Const X_{i1}, X_{i2}, X_{i3}, X_{i1}X_{i2}, X_{i1}X_{i3}, X_{i2}X_{i3}. |
To assess the sensitivity of the GMM estimator to the number of subpopulations, different numbers of subpopulations L=1, 4, 7, 10 and 14, respectively are included. (L=0 corresponds to OLS.) If the mean squared error does not reduce significantly with L, additional subpopulations would seem to be of little value. This would imply that in empirical applications of the estimator a very small number of L would often suffice, thereby reducing computation time. A natural procedure for defining the subpopulations would begin with the largest population and subsequently include smaller and smaller subpopulations, because the precision in estimating the average bias decreases in smaller subpopulations. The first subpopulation is the entire (non-responding) population. Subpopulations two to four are defined by X_{1}<1.5, X_{2}<1.5, and X_{3}<1.5, respectively, and each contains about 60% of the entire non-responding population. Subpopulations five to seven are defined by {X_{1}<1.5 ∨ X_{2}<1.5}, {X_{1}<1.5 ∨ X_{3}<1.5} and {X_{2}<1.5∨X_{3}<1.5}, respectively, with each covering about 37% of the population. Subpopulations eight to ten each contain about 30% and are defined by X_{1}<1, X_{2}<1, and X_{3}<1, respectively. Finally, the subpopulations 11 to 14 are X_{1}>2, X_{2}>2, X_{3}>2, and {X_{1}<1.5∨X_{2}<1.5∨X_{3}<1.5, respectively and cover only about 20% of the population.^{13} Subpopulations with less than ten responding observations or less than ten non-responding observations are dropped in the GMM estimator to reduce the impact of very imprecise estimates.
Mean squared error (sample size 500, ridge matching estimator)
| DGP 1 | DGP 2 | DGP 3 | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
φ_{0} | φ_{1} | φ_{2} | φ_{3} | φ_{0} | φ_{1} | φ_{2} | φ_{3} | φ_{0} | φ_{1} | φ_{2} | φ_{3} | ||
OLS (L=0) | 9.7 | 0.0 | 22.0 | 13.4 | 9.6 | 36.9 | 2.1 | 12.4 | 2.5 | 4.5 | 6.3 | 0.0 | |
LSIR | 7.0 | 0.3 | 17.3 | 7.4 | 13.5 | 34.1 | 8.9 | 15.0 | 1.8 | 4.2 | 3.3 | 0.3 | |
GMM1 | L=14 | 6.8 | 0.0 | 17.6 | 7.5 | 10.3 | 33.7 | 4.9 | 13.8 | 1.6 | 4.3 | 3.1 | 0.1 |
L=10 | 6.8 | 0.0 | 17.5 | 7.5 | 10.3 | 33.8 | 5.3 | 14.6 | 1.6 | 4.3 | 3.1 | 0.1 | |
L=7 | 6.8 | 0.0 | 17.7 | 7.5 | 10.4 | 33.8 | 5.6 | 15.0 | 1.6 | 4.3 | 3.1 | 0.1 | |
L=4 | 6.8 | 0.0 | 17.8 | 7.4 | 10.2 | 33.6 | 5.4 | 14.7 | 1.6 | 4.3 | 3.1 | 0.1 | |
L=1 | 6.8 | 0.0 | 18.3 | 7.9 | 10.3 | 33.5 | 4.0 | 12.9 | 1.6 | 4.3 | 3.2 | 0.1 | |
GMM2 | L=14 | 7.9 | 0.0 | 18.7 | 8.7 | 11.1 | 37.3 | 4.2 | 13.2 | 1.9 | 4.7 | 3.5 | 0.1 |
L=10 | 7.3 | 0.0 | 17.8 | 8.1 | 9.9 | 34.8 | 3.0 | 11.9 | 1.7 | 4.6 | 3.2 | 0.0 | |
L=7 | 7.2 | 0.0 | 17.7 | 8.1 | 9.7 | 34.6 | 2.8 | 11.8 | 1.7 | 4.4 | 3.3 | 0.0 | |
L=4 | 7.1 | 0.0 | 17.6 | 8.0 | 9.5 | 34.4 | 2.7 | 11.5 | 1.7 | 4.4 | 3.3 | 0.0 | |
L=1 | 7.1 | 0.0 | 18.1 | 7.9 | 9.2 | 34.1 | 2.3 | 11.5 | 1.8 | 4.3 | 3.5 | 0.0 | |
MSE in D=0 population only | |||||||||||||
OLS (L=0) | 9.4 | 0.0 | 17.1 | 16.5 | 11.0 | 42.5 | 2.3 | 14.9 | 2.7 | 1.9 | 9.3 | 0.0 | |
LSIR | 1.9 | 0.4 | 5.8 | 1.9 | 17.7 | 36.4 | 11.4 | 19.9 | 0.6 | 1.4 | 1.5 | 0.4 | |
GMM1 | L=14 | 2.3 | 0.0 | 8.3 | 2.0 | 10.9 | 29.5 | 6.0 | 12.5 | 0.6 | 0.9 | 2.3 | 0.1 |
L=10 | 2.2 | 0.0 | 8.3 | 1.9 | 10.8 | 29.5 | 6.2 | 12.6 | 0.6 | 0.9 | 2.3 | 0.1 | |
L=7 | 2.3 | 0.0 | 8.7 | 2.0 | 11.0 | 29.6 | 6.5 | 13.1 | 0.6 | 0.9 | 2.3 | 0.1 | |
L=4 | 2.3 | 0.0 | 8.8 | 2.0 | 10.7 | 29.3 | 6.2 | 12.7 | 0.6 | 0.9 | 2.5 | 0.1 | |
L=1 | 2.5 | 0.0 | 9.9 | 2.5 | 11.1 | 29.4 | 5.2 | 13.5 | 0.7 | 0.9 | 2.7 | 0.1 | |
GMM2 | L=14 | 1.9 | 0.0 | 6.2 | 2.2 | 12.1 | 32.0 | 5.2 | 13.9 | 0.7 | 0.9 | 1.9 | 0.1 |
L=10 | 1.8 | 0.0 | 6.3 | 1.8 | 10.6 | 29.0 | 3.6 | 12.2 | 0.7 | 0.8 | 2.1 | 0.0 | |
L=7 | 2.0 | 0.0 | 7.2 | 1.8 | 10.5 | 29.6 | 3.3 | 12.3 | 0.7 | 0.8 | 2.4 | 0.0 | |
L=4 | 2.0 | 0.0 | 7.3 | 1.9 | 10.2 | 29.9 | 3.1 | 12.0 | 0.9 | 0.8 | 2.7 | 0.0 | |
L=1 | 2.7 | 0.0 | 9.6 | 2.8 | 10.1 | 33.1 | 2.6 | 12.8 | 1.1 | 0.9 | 3.4 | 0.0 |
Examining first the first three rows of Table 1, it can be seen that for misspecified models both the LSIR and the GMM estimator usually perform better than OLS. (For DGP2, however, this is true only for the GMM estimator and only for sample size 2000.) For correctly specified models, both semiparametric estimators are less precise than OLS, with LSIR always being worse than GMM. In general, the GMM estimator has a smaller or equal MSE than the LSIR estimator. In misspecified models, the semiparametric GMM estimator leads to reductions in MSE, relative to OLS, of about 20–45% for DGP1 and 5–50% for DGP3. For DGP2, the MSE of the GMM estimator is in the range of ±10% about the MSE of OLS. This indicates that semiparametric estimation can lead to quite sizeable efficiency gains in misspecified models, although these are not always guaranteed. On the other hand, the efficiency losses in correctly specified models are often small in absolute terms, compared to the precision gains in misspecified models. In DGP1 (with specification φ_{1}) and DGP3 (with φ_{3}), the MSE increases only by less than 0.1 from OLS to the GMM estimator. In DGP2 (with φ_{2}), however, the GMM estimator performs clearly worse than OLS.
Examining the results for the first step GMM estimator with different numbers of overidentifying moments L, no clear and monotonous relationship can be detected. While the MSE decreases with the number of moments in DGP1-φ_{2} and DGP1-φ_{3}, it first increases and then decreases in DGP2-φ_{2} and DGP2-φ_{3}. In the other cases, the MSE hardly changes with the number of moments. This indicates that the value of additional overidentifying moments may be small, such that in applications of this estimator a relatively small number of L should suffice.
The second step GMM estimator, on the other hand, is more sensitive to the number of moments and its MSE generally tends to increase with L. This may be due to a less precise estimation of the weighting matrix, whose dimension increases with L. The second step estimator often tends to have a higher MSE than the first step estimator, unless the model is correctly specified. The latter comes as expected since the second step weighting matrix usually assigns more weight to the parametric moments than the initial weighting matrix used in the first step estimator.
The lower half of Table 1 shows the precision of the various estimators in the non-responding population, which is simulated by using only the D=0 observations of the validation sample. This could be of interest if one were interested in estimating E[Y∣X] only for the non-respondents. A typical example would be the analysis of the treatment effect on the treated for different values of X. While the qualitative results are similar to the previous discussion, the precision gains of the semiparametric estimators for misspecified models are now much larger. For DGP1, MSE is reduced by 50–90% vis-a-vis OLS. For DGP2, the reductions are 5–30% and are 55–75% for DGP3.
Table A.1 shows the simulation results for sample size 2000. The semiparametric estimators have become more precise relative to OLS, and the GMM estimator now dominates OLS in all misspecified models. The LSIR estimator, on the other hand, is still worse than OLS in DGP2-φ_{0} and DGP2-φ_{3}. The first step GMM estimator remains rather robust to the number of moments L included, while the MSE of the second step GMM estimator still often increases with the number of moments. The second step estimator now performs in almost all misspecified models worse than the first step GMM. This is in accordance with the discussion at the end of Section 2, because with increasing sample size bias becomes more important relative to variance. As the weighting matrix for the second step GMM is only based on variance considerations, too little weight is given to the overidentifying nonparametric moments. Overall, for the D=0 population (lower half of Table A.1), the MSE of the first step GMM estimator is about 60–90% (DGP1), 20–40% (DGP2) and 60–80% (DGP3) lower than for OLS, in the misspecified models.
Tables A.2 and A.3 give the results when Nadaraya–Watson kernel regression is used instead of ridge regression in the GMM estimators. The results are very similar, with kernel regression performing a little worse for sample size 500 and a little better with sample size 2000.
Although no strong conclusions can be drawn from this limited Monte Carlo study, the results seem to indicate that the semiparametric estimators can lead to substantially more precise estimates of E[Y∣X] in misspecified models, while, on the other hand, maintaining good properties in correctly specified models. Reductions in MSE from 5–50% are feasible. If interest is in estimating E[Y∣X] only for the non-responding population, e.g. for analyzing average treatment effects on the treated, reductions in MSE are even larger and can be up to 90%. Although both the LSIR and the GMM estimators perform well in misspecified models, the GMM estimator usually leads to larger reductions and has better properties in correctly specified models. In particular, the first step GMM estimator appeared to be superior to the second step estimator. A moderate number of overidentifying moments L seems to suffice to attain the precision gains. The choice of the nonparametric regression estimator does not seem to matter much. The results with Nadaraya–Watson kernel regression and local linear ridge regression were very similar. Hence, as a practical recommendation, any propensity score matching estimator can be used for estimating \(\widehat{\mu }\) in just a small number of subpopulations L.
The previous discussion focussed on the estimation of conditional mean functions E[Y∣X]. The proposed GMM estimator, however, can also be used for specification testing, using the J-test statistic (Eq. 13). In the supplementary Appendix, the size and the power of this test are examined. The results are not very favourable, though, as the test often tends to over-reject. A likely reason for this size distortion is the use of cross-validation for choosing the bandwidth value. Whereas cross-validation trades off variance against bias, centrality of the test statistic relies on undersmoothing. Hence, if the proposed GMM estimator were to be used for specification testing, a different data-driven technique for bandwidth selection would be needed. For the purpose of estimation, on the other hand, cross-validation seems to work well as the simulations of this section had indicated.
Treatment choice among Swedish rehabilitation programmes
To illustrate the applicability of the proposed estimator, treatment effect heterogeneity among Swedish rehabilitation programmes for long-term sick is analyzed. Conditional expectation functions are estimated for the different programmes, which can be used to analyze individual heterogeneity in the effects and to determine the potential for policy improvements through better targeting of programmes.
Heterogeneity in treatment effects has been somewhat neglected in the recent literature on programme evaluation, which concentrated largely on estimating average treatment effects.^{14} If treatment effects are heterogeneous, however, it is important to determine which individuals benefit most from which programmes in order to give advice on how policies should be targeted to obtain a more efficient allocation of programmes and participants. Taking treatment effect heterogeneity into account is relevant for many social and economic policies. For example, many evaluations of active labour market policies found negative or zero average treatment effects. It could be possible, though, that some individuals would benefit greatly from such programmes, whereas the majority does not. Instead of completely eliminating such programmes, better targeting might be more sensible.
With this approach the success of Swedish rehabilitation programmes in re-integrating long-term sick in the labour market is examined. In a retrospective analysis, the optimal programme is determined for each individual. Comparing the average employment rate ensuing with this optimal allocation to the observed employment rate gives an indication of the potential for policy improvement through a better allocation of participants to programmes. This application is merely meant as an illustration of the approach. Using only a single outcome variable, employment, would not do justice to the multi-facetted goals of rehabilitation programmes if strong policy conclusions were to be drawn. Rehabilitation programmes aim not only at restoring lost working capacity, but also at improving mental and physical health, and also their costs would need to be taken into account. A more comprehensive analysis, however, was not possible due to data availability.
Swedish rehabilitation programmes
The Swedish social insurance system provides a wide variety of supportive actions for people in need. One of these is the coordination and financing of vocational rehabilitation for long-term sick individuals. Persons, who have been employed for at least one month, are covered by the public sickness insurance and are eligible for sickness benefit when becoming ill. Sickness cases that last for more than four weeks are considered as long-term sick and appropriate means for these are examined. If sickness is expected to be permanent or of longer duration, a disability pension is granted. Otherwise, rehabilitation actions should be initiated, if they can restore (at least partly) a person’s working ability within reasonable time, i.e. in less than one year. The local insurance offices mediate in this process by coordinating rehabilitative actions with the employer and the employee and by financing vocational rehabilitation.
The rehabilitation actions consist of a wide variety of different programmes and measures, targeted at different groups and pursuing different goals. They can roughly be summarized into vocational and non-vocational measures. The vocational programmes aim at improving employability to guide individuals back into the competitive labour market and consist of work training and educational training. Work training can be with the current employer or at a new place of work. The former requires the cooperation from the employer to make the training feasible. Unemployed individuals on long-term sickness are often offered work training at sheltered public workplaces. Educational rehabilitation comprises various forms of classroom education.
The non-vocational measures consist of medical and social rehabilitation. Social rehabilitation contains, for example, programmes for individuals with alcohol, drug or psychiatric problems. These measures are not coordinated by the insurance office. Individuals with severe health problems may receive different forms of rehabilitation in parallel or sequentially.
In the following analysis, these activities are categorized into four different types of rehabilitation: No rehabilitation, workplace rehabilitation, educational rehabilitation and medical and social rehabilitation.
Rehabilitative activities pursue a variety of different goals. Vocational rehabilitation aims at re-integration into the labour market. Medical and social rehabilitation, on the other hand, rather intend to restore physical and mental health and basic work capacity and to re-establish independence of the sick individual from medical or therapeutic assistance. In the following analysis only a single outcome variable is examined: successful integration into the labour market at the end of the sickness spell. The main reason for restricting the analysis to the employment outcome is that the available data seem to be sufficiently informative to make the conditional independence assumption (16) plausible with respect to the employment outcome but not with respect to other outcome variables e.g. health status. The data is very informative about the selection process into treatment and seems to contain most or all relevant factors that determined simultaneously rehabilitation assignment and subsequent employment outcomes. Even if not all relevant factors are included, the resulting bias is likely to be small relative to the variance in the process of finding employment, as employment is driven by many other factors besides health. With respect to subsequent health outcomes, however, conditional independence may not hold, because the health history data is not sufficiently precise. As many of the variables on health history and medical recommendations are binary, they indicate only incidence of health problems but not their severity. As certain health problems may be highly autocorrelated over time, this might have led to a large bias.
The process of entering in (publicly financed) rehabilitative actions in the years 1991 to 1994 was at such. A person who falls sick or becomes injured first notifies her employer or the local social insurance office thereof.^{16} If sickness continues for more than four weeks, a rehabilitation assessment should be carried out within the following eight weeks, which consists of various medical and non-medical examinations. On the basis of this assessment a decision about the appropriateness of vocational rehabilitation should be reached: If rehabilitation assistance is not necessary and recovery is expected within a year, the individual draws sickness benefits until healthy. If sickness seems to last for more than a year (even with rehabilitation), the individual will be granted disability pension and the case is closed. If, on the other hand, rehabilitation seems necessary, economically advisable and it is expected that the sick person can regain her working capacity within a year, a rehabilitation plan is established.
This plan is made by the IO officer, taking into account the rehabilitative needs, the medical assessments, budgetary constraints as well as the individual’s preferences. In a first instance, the insurance office’s task is to coordinate the provision of vocational rehabilitation.^{17} The employer is obliged to facilitate workplace rehabilitation, according to his possibilities, through transfers, changes in duties and work hours, work training, education, adjustments to the current workplace etc. For unemployed persons and also when the employer is not able or not willing to cooperate, the insurance office offers alternative rehabilitative measures, which it purchases from hospitals and private providers of work training and education.^{18} Individuals may demand but have neither the right to receive rehabilitation nor the obligation to participate. It is mainly the IO officer who determines which rehabilitation measures are to be offered. The officers have clear guidelines to follow for assessing the need and success chances of rehabilitative measures and they do not face any incentive structures for discriminating against particular groups. In case of participation in vocational rehabilitation, individuals receive an additional rehabilitation allowance. After rehabilitation, the sick person may be either healthy or still sick. If still sick, her recovery chances are re-assessed and she either re-enters the pool of long-term sick or is granted disability pension.
Data
The data used in this study is taken from the Riks-LS data set, which has been collected by the National Social Insurance Board (RFV) for the purpose of evaluating the efficacy of vocational rehabilitation. The survey was conducted in the second half of 1994 and beginning of 1995 and analyzed retrospectively 75,000 sickness cases, who had received sickness benefit for a period of at least 60 consecutive days between July 1991 and June 1994. The caseworkers in charge of these cases were questionnaired on the development and assessment of the sickness case. Data collection was organized in form of three independent cross-sections, according to the fiscal years 1991/92, 1992/93 and 1993/94. Cases were followed up until closure of the case or at most until December 1994, the end of the data collection period.
From this data set, a sample of 6,287 cases in five counties in Western Sweden is analyzed. The sample contains only persons not older than 55 years and not receiving pension benefits. Individuals in full-time education are excluded as well as individuals with missing data on sickness and rehabilitation history. Of the 6,287 observations, 3,502 did not receive any rehabilitative measures, 1,118 participated in workplace rehabilitation, 360 in educational rehabilitation and 1,307 in medical and social rehabilitation.^{19}
The data set provides rich information about the socioeconomic variables of the individuals, details on their health status and the selection into rehabilitation. The information about the individual prior to the beginning of the sickness spell are age, gender, marital status, citizenship, education, occupation and labour market position, previous health record, previous participation in vocational rehabilitation, employment status, earnings and earnings loss due to sickness. The individuals environment is characterized by county of residence, community type, local unemployment rate and year of sickness registration. Information at the time of sickness registration contain the medical institution that registered sick leave, initial degree of sickness, indications of alcohol or drug abuse, and medical diagnosis. The data set contains crucial information about the rehabilitation assessment. In particular, the initial medical recommendation, the caseworkers non-medical recommendation, and the organization that carried out the assessment are recorded, revealing important characteristics of the sick person before entering in rehabilitation. These experts opinions include subjective judgements about the sick persons ability, determination and employment chances and are crucial for the conditional independence assumption (16).
Descriptive statistics by treatment groups (means or shares in %)
Variable | All | No Rehabilitation | Workplace | Educational | Medical and social |
---|---|---|---|---|---|
Age (years) | 40.5 | 40.9 | 39.6 | 39.0 | 40.5 |
Male | 45 | 45 | 45 | 46 | 46 |
Married | 52 | 53 | 53 | 45 | 52 |
Labour market position: | |||||
Blue collar, unskilled | 45 | 42 | 52 | 47 | 47 |
Blue collar, skilled | 20 | 20 | 23 | 23 | 20 |
White collar worker | 23 | 26 | 20 | 16 | 21 |
Self-employed | 12 | 13 | 5 | 14 | 12 |
Unemployed at beginning of sickness | 19 | 20 | 9 | 32 | 21 |
Income (in SEK) | 1,307 | 1,303 | 1,340 | 1,268 | 1,300 |
Previous sickness (days sick in last 6 months) | |||||
<15 days | 59 | 62 | 58 | 47 | 57 |
>60 days | 22 | 20 | 24 | 35 | 22 |
Participation in vocational rehabilitation in last 12 months | 11 | 7 | 15 | 23 | 14 |
Local unemployment rate (in %) | 6.52 | 6.45 | 6.59 | 6.71 | 6.63 |
Community type: | |||||
Urban/suburban region | 26 | 31 | 17 | 21 | 21 |
Major/middle large city | 14 | 13 | 11 | 11 | 21 |
Industrial city | 12 | 10 | 14 | 11 | 16 |
Rural and other | 49 | 47 | 58 | 57 | 43 |
Registration of current sickness spell: | |||||
Registration by | |||||
Health care centre/hospital | 80 | 81 | 81 | 73 | 79 |
Pysch./social medicine center | 8 | 7 | 6 | 14 | 10 |
By private or others | 12 | 11 | 13 | 13 | 11 |
Degree of sickness is 100% sick leave | 86 | 84 | 92 | 91 | 86 |
Indications of alcohol or drug abuse | 6 | 6 | 3 | 10 | 8 |
Diagnosis: | |||||
Psychiatric | 18 | 18 | 13 | 28 | 18 |
Circulation | 4 | 5 | 4 | 3 | 2 |
Respiratory | 2 | 2 | 3 | 4 | 2 |
Digestion | 3 | 4 | 3 | 1 | 2 |
Musculoskeletal | 44 | 39 | 51 | 44 | 51 |
Injuries | 14 | 15 | 15 | 11 | 12 |
Other | 15 | 18 | 13 | 10 | 12 |
Rehabilitation needs assessment: | |||||
Case assessed | |||||
By employer | 23 | 17 | 40 | 25 | 25 |
By insurance office | 16 | 13 | 16 | 33 | 22 |
IO on behalf of employer | 11 | 8 | 14 | 13 | 17 |
Not needed | 26 | 36 | 10 | 9 | 16 |
Not carried out | 23 | 26 | 19 | 20 | 20 |
Medical VR wait and see | 55 | 61 | 40 | 37 | 56 |
recommendation: | |||||
VR needed and defined | 26 | 14 | 47 | 55 | 34 |
Eligible for disability pension | 6 | 9 | 3 | 2 | 4 |
Not satisfactory/unclear | 12 | 16 | 10 | 6 | 6 |
Non-medical VR wait and see | 63 | 76 | 36 | 37 | 59 |
recommendation: | |||||
VR needed and defined | 32 | 17 | 63 | 62 | 38 |
Eligible for disability pension | 5 | 7 | 1 | 1 | 3 |
End of sickness: | |||||
Case closed as of December 1994 | 87 | 91 | 82 | 81 | 80 |
Returns to regular employment | 46.3 | 48.3 | 52.4 | 28.9 | 40.5 |
Number of observations | 6,287 | 3,502 | 1,118 | 360 | 1,307 |
At the end of a sickness case, the exit destination is recorded, which can be returning or entering in regular employment, working at a sheltered workplace, entering in full-time education, being unemployed, receiving disability pension, or ‘other destinations’. At the end of the data collection period in December 1994, some cases remained unclosed, though. These are considered as still sick and represent about 10 to 20% of the observations, as shown at the bottom of Table 2. As regards the exit destinations, about 46% of all cases left sickness towards employment. For non-participants this employment rate is 48%, and it is 52% for the participants in workplace rehabilitation, 29% for the participants in educational rehabilitation and 41% for the participants in medical rehabilitation.
Estimation results
Nonparametric estimates of mean potential outcomes (in %)
Estimated | \(\widehat{{E{\left[ {Y^{{No}} } \right]}}}\) | \(\widehat{{E{\left[ {Y^{{Work}} } \right]}}}\) | \(\widehat{{E{\left[ {Y^{{Edu}} } \right]}}}\) | \(\widehat{{E{\left[ {Y^{{Med}} } \right]}}}\) |
---|---|---|---|---|
Re-employment rate | 46.0 | 45.6 | 32.9 | 41.0 |
The participation probabilities \(\widehat{p}^{r} \) are estimated by probit and the support restriction is implemented by discarding all observations with \( \widehat{p}^{r}_{i} \) below the lowest participation probability among the participants in programme r. The regression curves m^{r}(p^{r}) are estimated for each subpopulation separately by ridge matching, using only the observations belonging to that subpopulation. The bandwidth is chosen by least-squares cross validation. (The implied mean potential outcomes for all subpopulations are given in Table B.1).
Distribution of optimal programme (by largest estimate)
Best programme is | No rehabilitation | Workplace | Educational | Medical |
---|---|---|---|---|
For so many individuals | 1,865 | 1,860 | 1,519 | 1,043 |
The appropriate choice of α depends on the importance of alternative objectives and considerations in the programme choice. If treatment assignment is required to be strictly deterministic and should only depend on X and if no supply-side constraints or waiting lists could delay the availability of treatment, assignment should always be to the programme with the highest estimated outcome. On the other hand, if the estimated potential outcomes are only one of many determinants for the treatment choice, a significant preponderance of evidence, say 1−α=0.7 or 0.8, is desired, to neglect all noisy estimates. If the statistical evidence is insufficient to reach this threshold, alternative criteria should guide the selection. These may include programme goals that are not easily quantifiable (and thus cannot be included in the utility weighting function discussed in the beginning of Section 4), waiting-lists if treatment places are limited, conjectures about general equilibrium effects of certain treatments, which cannot be quantified, and so forth. The more important these alternative goals and criteria are, the more certainty will be expected from the statistical system before taking its predictions into consideration. In addition, the choice of α should also depend on the number of options to choose from. Generally, the larger the number of available programmes, the smaller 1−α should be, because a level of 1−α=0.7 can easily be reached if there are only two options to choose from, but will be much more restrictive if there are ten different programmes.
Distribution of optimal programme (for different levels of α)
Best programme is | No rehabilitation | Workplace | Educational | Medical | Undefined |
---|---|---|---|---|---|
With 90% probability | 142 | 100 | 23 | 16 | 6,006 |
With 70% probability | 618 | 540 | 294 | 180 | 4,655 |
With 60% probability | 920 | 893 | 552 | 352 | 3,570 |
With 50% probability | 1,302 | 1,386 | 905 | 606 | 2,088 |
Optimal treatment choice versus actual allocation
| Optimal allocation r_{i}* | ||||
---|---|---|---|---|---|
No rehabilitation | Workplace | Education | Medical | Undefined | |
Actual allocation | 1−α=70% | ||||
D_{i}=No | 399 | 198 | 156 | 113 | 2,636 |
D_{i}=Work | 76 | 170 | 73 | 19 | 780 |
D_{i}=Edu | 22 | 53 | 19 | 8 | 258 |
D_{i}=Med | 121 | 119 | 46 | 40 | 981 |
Δ(%) | 61.5 | ||||
Optimal allocation (1−α=60%) | |||||
D_{i}=No | 586 | 352 | 312 | 213 | 2,039 |
D_{i}=Work | 122 | 249 | 119 | 40 | 588 |
D_{i}=Edu | 30 | 79 | 37 | 13 | 201 |
D_{i}=Med | 182 | 213 | 84 | 86 | 742 |
Δ(%) | 64.7 | ||||
Optimal allocation (1−α=50%) | |||||
D_{i}=No | 828 | 603 | 507 | 351 | 1,213 |
D_{i}=Work | 179 | 338 | 196 | 82 | 323 |
D_{i}=Edu | 48 | 114 | 56 | 25 | 117 |
D_{i}=Med | 247 | 331 | 146 | 148 | 435 |
Δ(%) | 67.4 |
Average characteristics by treatment group: optimal vs. actual allocation
Variable | Optimal allocation | Actual allocation | |||||||
---|---|---|---|---|---|---|---|---|---|
N | W | E | M | N | W | E | M | ||
Age: | 18–35 years | 12 | 20 | 59 | 52 | 31 | 34 | 37 | 31 |
46–55 years | 40 | 62 | 10 | 30 | 41 | 31 | 32 | 36 | |
Gender: | Male | 56 | 36 | 48 | 44 | 45 | 45 | 46 | 46 |
Employment status: | Unemployed | 2 | 27 | 47 | 2 | 20 | 9 | 32 | 21 |
Labour market position: | Blue collar, high educated | 43 | 9 | 19 | 17 | 20 | 23 | 23 | 20 |
Occupation in: | Manufacturing | 51 | 23 | 23 | 38 | 30 | 38 | 32 | 32 |
Previous sickness days | >60 days | 19 | 32 | 25 | 5 | 20 | 24 | 35 | 22 |
Prior participation in | Vocational rehabilitation | 4 | 15 | 21 | 0 | 7 | 15 | 23 | 14 |
Medical diagnosis: | Psychiatric | 20 | 21 | 11 | 15 | 18 | 13 | 28 | 18 |
Medical recommend. | Wait and see | 79 | 64 | 19 | 53 | 61 | 40 | 37 | 56 |
Predicted employment | Probability | 69.3 | 52.6 | 54.5 | 67.2 | 48.5 | 51.9 | 30.2 | 41.1 |
A striking difference with respect to age can be seen. Whereas average age does not vary much by treatment groups in the actual allocation, the optimal choice seems to depend strongly on the individual’s age. Whereas the young are clearly over-represented among those who are advised to participate in medical and, particularly, in educational rehabilitation, only very few of the 46–55 years old are best served by educational rehabilitation. With respect to gender it seems that men should more often attend No rehabilitation, whereas women might benefit more from workplace rehabilitation. Regarding prior unemployment, it is noteworthy that only few unemployed are advised to participate in No or in medical rehabilitation, whereas they represent about half of those advised to educational rehabilitation. Educated blue collar workers are less frequent found among those served best by workplace rehabilitation, whereas manufacturing workers are over-represented among those advised to No rehabilitation. For individuals who had been sick previously for more than 60 days in the last 6 months or who had participated in vocational rehabilitation before, medical rehabilitation is hardly ever an unambiguously optimal choice. Furthermore, in the optimal allocation, individuals with psychiatric problems and those for whom a wait and see strategy has been advised are clearly under-represented in educational rehabilitation, compared to the actual allocation. Generally the differences in the characteristics are much more pronounced in the optimal than in the actual allocation.
In the last row of Table 7, the predicted potential employment outcomes are averaged within the treatment groups according to the optimal and to the actual allocation. The predicted average employment rates in the actual treatment groups correspond quite well to the observed rates of Table 2. When re-allocating the participants to the programmes in an optimal way, substantial increases in the predicted employment rates are achieved. To summarize this analysis, it is illuminating to tentatively predict the overall employment rate that could have been achieved through an optimal allocation. When allocating all individuals to their optimal programme, if defined at the 0.5 level, and all other individuals, for whom no optimal programme is defined, randomly to any programme (with equal probability), the predicted average employment rate is 54.5%. If, on the other hand, the individuals without defined optimal programme are allocated randomly to either No or to workplace rehabilitation, the predicted employment rate is 55.7%. Thus, compared to the current selection process and to the employment rates that would be expected if all individuals were assigned to the same programme (see Table 3), an increase in the employment rate of about 9%-points could be possible through an improved participant allocation.
These findings indicate a substantial heterogeneity in the treatment effects between individuals. The treatment effects depend on the X characteristics and the optimal programme varies with X. These conditional-on-X treatment effects, however, are not fully taken into account by the case workers in their choices of the programmes. The limited sensitivity of the case workers’ choices to their clients’ observed characteristics has already been noted from Table 7. For example, whereas the usefulness of educational rehabilitation clearly seems to decrease with age, the actual allocation to educational rehabilitation depends only little on age. By fully exploiting the differences in the conditional-on-X treatment effects, the employment rate could have been raised substantially, provided that the treatment effects are consistently estimated. The reasons for this large unexploited potential may be severalfold. On the one hand, case workers may not know the conditional-on-X treatment effects. They may also be constrained in their choices, e.g. due to limited numbers of workplace rehabilitation places. Furthermore, they might seek the cooperation of the sick person in their choices. A most important reason, however, is likely to be their different objective functions. Rehabilitation serves several purposes and rapid employment is only one of these. Health and sustainability considerations might be accorded a much larger importance.
If educational rehabilitation were no longer available, the predicted average employment rate would be 54.9%, when individuals without defined optimal programme are assigned randomly to either No or to workplace rehabilitation. Thus, although educational rehabilitation is the optimal programme for some individuals, their second-best choice seems not to be much worse.
Similar results are also obtained for different sets of X variables and different moment specifications (see the sensitivity analysis in the supplementary Appendix). Compared to the above optimal allocation (with 11 subpopulations), the optimal allocations that would result if 1, 6, 16 or 21, respectively, subpopulations were included are not very different. The fraction of misclassification Δ (in %) between the main specification and any of these other specifications is at most 0.1% at the 1−α=0.7 level, at most 2.4% at the 0.6 level and at most 11% at the 0.5 level. On the other hand, if the set of 11 subpopulations is maintained but the set of explanatory variables X is altered, the estimated optimal allocations change more markedly. With a set of 28 or 30 variables, the resulting allocations are still very similar: Δ is about 0.5, 5 and 14.5% at the 0.7, 0.6, 0.5 level, respectively. However, when leaving out relevant information on sickness history, diagnosis, geographic location (and retaining only 24 variables), the misclassification rates increase to 15.8, 26.4 and almost 40%, respectively, at the different levels of 1−α. Hence, detailed information seems to be necessary to obtain informed programme choices.
Conclusions
In this paper a new semiparametric estimator for estimating conditional mean functions from incomplete data has been developed. It applies to situations where data is missing due to non-response or where it is missing by definition, e.g. in the analysis of treatment effects, where only one of the different potential outcomes can be observed for each individual.
This estimator integrates parametric regression with nonparametric matching to obtain more precise estimates in the subpopulation with missing data. Nonparametric matching estimates are used as an anchor for reducing bias in the missing-data subpopulation while retaining a reasonable fit in the full-data subpopulation. A small Monte Carlo simulation showed that considerable reductions in MSE vis-a-vis a fully parametric estimator can be achieved in misspecified parametric models. On the other hand, the efficiency losses in correctly specified models seem to be rather small. The applicability of the estimator has been illustrated by an analysis of treatment effect heterogeneity in Swedish rehabilitation programmes.
Analyzing individual heterogeneity in treatment effects is highly relevant for policy evaluation. In many evaluation studies, small or negative estimates of average treatment effects indicate an ineffective policy. These average effects, however, may mask a considerable heterogeneity in the effects between the individuals. It is important to know whether the effect is as negative for all individuals or whether it harms some while it benefits others. Estimating treatment effects on a disaggregated level, i.e. conditional on characteristics X, can help to assess the extent of treatment effect heterogeneity. These estimates can then be used to appraise the potential for policy improvements due to a better participant allocation. By predicting the treatment effects for each individual, the expected outcomes if assigned to the optimal programmes can be simulated. Comparing these with the observed outcomes gives an estimate of the effectiveness of the allocation process. For example, in the application to the Swedish rehabilitation programmes, the simulated optimal employment outcome is 56%, compared to an observed employment rate of 46%.
Footnotes
- 1.
The estimation of average treatment effects has been intensively analyzed, in particular for active labour market programmes and rehabilitation programmes. See for example, Aakvik (2003), Abbring and van den Berg (2004) and the Special Issue on ‘Long term unemployment and social assistance’, Empirical Economics (1/2), 1998). The focus of this paper is on the heterogeneity in treatment effects, which could be exploited to improve the average effectiveness of policies through a better participant allocation.
- 2.
- 3.
- 4.
E.g. in the case of panel attrition, X may refer to information collected in the baseline period.
- 5.
- 6.
The support restriction is incorporated by considering only observations with \(\widehat{p}_{i} > 0\), because S_{x}={x:f_{X∣D=1}(x)>0}={x:p(x)>0}
- 7.
- 8.
More precisely, let \(\widehat{m}_{{vl}} {\left( \rho \right)}\) for ρ>0 be an estimator of the expectation E[Y_{v}∣p(X)=ρ, Λ_{l}(X)=1], i.e. the expectation of the v-th variable of the outcome vector Y conditional on the propensity score in the l-th subpopulation. Let \(\widehat{m}_{l} {\left( \cdot \right)} = {\left( {\widehat{m}_{{1l}} {\left( \cdot \right)}, \ldots ,\widehat{m}_{{vl}} {\left( \cdot \right)}, \ldots \widehat{m}_{{Vl}} {\left( \cdot \right)}} \right)}^{\prime } \) be the element-wise-defined estimator of the outcome vector Y in the population l, i.e. of E[Y∣p(X)=ρ, Λ_{l}(X)=1]. Stacking these estimators for the L subpopulations and multiplying element-wise with the population indicator function gives \( \widehat{m}_{{VL}} {\left( {\widehat{p}{\left( {X_{i} } \right)}} \right)} = {\left( {\widehat{m}^{\prime }_{1} {\left( {p{\left( {X_{i} } \right)}} \right)} \cdot \Lambda _{1} {\left( {X_{i} } \right)}, \ldots \widehat{m}^{\prime }_{l} {\left( {\widehat{p}{\left( {X_{i} } \right)}} \right)} \cdot \Lambda _{l} {\left( {X_{i} } \right)}, \ldots ,\widehat{m}^{\prime }_{L} {\left( {\widehat{p}{\left( {X_{i} } \right)}} \right)} \cdot \Lambda _{L} {\left( {X_{i} } \right)}} \right)}^{\prime } \).
- 9.
When a standard propensity score matching routine is used, care should be exercised to ensure that the lower VL moments in (10) are summed over the same observations as in \(\widehat{\mu }\) and are scaled in the same way. For example, if the propensity score matching routine estimates the mean counterfactual outcome \({{\sum {\widehat{m}_{VL} \left( {\widehat p_i } \right)\left( {1 - D_i } \right)1\left( {\widehat p_i > 0} \right)} } \over {\sum {\left( {1 - D_i } \right)1} \left( {\widehat p_i > 0} \right)}}\) instead of \({{\sum {\widehat m_{VL} \left( {\widehat p_i } \right)\left( {1 - D_i } \right)1\left( {\widehat p_i > 0} \right)} } \over n}\), then also the VL must be scaled accordingly.
- 10.
This includes one-to-one or pair matching.
- 11.
Using Epanechnikov instead of Gaussian kernel, and vice versa, led to largely similar results.
- 12.
The X data are scaled in the estimator to mean zero and variance one.
- 13.
The expected outcomes vary considerably among these subpopulations. Whereas with DGP 1, the expected outcome is 13.1 for the respondents and 5.3 for the non-respondents, the outcome difference between respondents and non-respondents can be as large as 8.2 (for subpopulations ten and eleven) and as small as 0.8 (for subpopulation fourteen). Similar heterogeneity occurs for DGP 2 and 3. For instance, in DGP 2 the expected outcome for the respondents is usually larger than for the non-respondents, but this relationship is reversed in subpopulation five. In DGP 2, the expected outcomes for respondents and non-respondents are 2.2 and 1.5, respectively, and in DGP 3 these figures are 9.6 and 4.3.
- 14.
- 15.
Unless the past participants have been assigned randomly to the programmes.
- 16.
Regularly employed individuals receive for the first two weeks sickness benefits from the employer and afterwards from the insurance office. Unemployed and self-employed individuals receive benefits directly from the insurance office. Sickness benefits amount to 80% of previous earnings, adjusted for the degree of lost working capacity and cut at an upper ceiling, and can be received for an unlimited period.
- 17.
Medical and social rehabilitation are not coordinated by the insurance office.
- 18.
The insurance offices themselves do not conduct rehabilitative activities.
- 19.
A number of cases received more than one type of rehabilitation. Since neither it is known whether these measures where given in parallel or sequentially, nor the time sequence of these measures, these cases were assigned to the supposedly first or principal of the rehabilitative measures received. In most cases this has been medical rehabilitation, which is likely to be the first measure. The second priority is given to workplace rehabilitation, since workplace rehabilitation is usually full-time while educational training may operate alongside. For further details on the data see Frölich et al. (2004).
- 20.
The reason for the latter is that the assessment refers to vocational rehabilitation.
Notes
Acknowledgment
The author is also affiliated with the Institute for the Study of Labor (IZA), Bonn. I am grateful for discussions and comments to Bo Honoré, Francois Laisney, Michael Lechner, Ruth Miquel, Oivind Nilsen, Jeff Smith, the editor and three anonymous referees. This research was supported by the Swiss National Science Foundation (project NSF 4043-058311) and the Grundlagenforschungsfonds HSG (project G02110112).
References
- Aakvik A (2003) Estimating the employment effects of education for disabled workers in Norway. Empir Econ 28:515–533CrossRefGoogle Scholar
- Abbring J, van den Berg G (2004) Analyzing the effect of dynamically assigned treatments using duration models, binary treatment models, and panel data models. Empirical Econ 29:5–20CrossRefGoogle Scholar
- Angrist J (1998) Estimating labour market impact of voluntary military service using social security data. Econometrica 66:249–288CrossRefGoogle Scholar
- Angrist J, Krueger A (1999) Empirical strategies in labor economics. In: Ashenfelter O, Card D (eds) The handbook of labor economics, III. North-Holland, New York, pp 1277–1366Google Scholar
- Barnow B, Cain G, Goldberger A (1981) Selection on observables. Evaluation Studies Review Annual 5:43–59Google Scholar
- Black D, Smith J, Berger M, Noel B (2003) Is the threat of reemployment services more effective than the services themselves?—evidence from random assignment in the UI system. Am Econ Rev 93:1313–1327CrossRefGoogle Scholar
- Dehejia R (2004) Program evaluation as a decision problem. forthcoming in J EconGoogle Scholar
- Dehejia R, Wahba S (1999) Causal effects in non-experimental studies: reevaluating the evaluation of training programmes. J Am Stat Assoc 94:1053–1062CrossRefGoogle Scholar
- Fan J (1992) Design-adaptive nonparametric regression. J Am Stat Assoc 87:998–1004CrossRefGoogle Scholar
- Frölich M (2004) Finite sample properties of propensity-score matching and weighting estimators. Rev Econ Stat 86:77–90CrossRefGoogle Scholar
- Frölich M (2005) Matching estimators and optimal bandwidth choice. Stat Comput 15(3):197–215CrossRefGoogle Scholar
- Frölich M, Heshmati, A, Lechner, M (2004) A microeconometric evaluation of rehabilitation of long-term sickness in Sweden. J Appl Econ 19:375–396CrossRefGoogle Scholar
- Gerfin M, Lechner M (2002) Microeconometric evaluation of the active labour market policy in Switzerland. Econ J 112:854–893CrossRefGoogle Scholar
- Hahn J (1998) On the role of the propensity score in efficient semiparametric estimation of average treatment effects. Econometrica 66:315–331CrossRefGoogle Scholar
- Hansen LP (1982) Large sample properties of generalized method of moment estimators. Econometrica 50:1029–1054CrossRefGoogle Scholar
- Heckman J, Robb R (1985) Alternative methods for evaluating the impact of interventions. In: Heckman J, Singer B (eds) Longitudinal analysis of labour market data. Cambridge University Press, CambridgeGoogle Scholar
- Heckman J, Ichimura H, Todd P (1997) Matching as an econometric evaluation estimator: evidence from evaluating a job training programme. Rev Econ Stud 64:605–654CrossRefGoogle Scholar
- Heckman J, Smith J, Clements N (1997) Making the most out of programme evaluations and social experiments: accounting for heterogeneity in programme impacts. Rev Econ Stud 64:487–535CrossRefGoogle Scholar
- Heckman J, Ichimura H, Todd P (1998) Matching as an econometric evaluation estimator. Rev Econ Stud 65:261–294CrossRefGoogle Scholar
- Heckman J, Ichimura H, Smith J, Todd P (1998) Characterizing selection bias using experimental data. Econometrica 66:1017–1098CrossRefGoogle Scholar
- Heckman J, LaLonde R, Smith J (1999) The economics and econometrics of active labour market programs. In: Ashenfelter O, Card D (eds) The handbook of labor economics, III. North-Holland, New York, pp 1865–2097Google Scholar
- Jalan J, Ravallion M (2003) Estimating the benefit incidence of an antipoverty program by propensity-score matching. J Bus Econ Stat 21:19–30CrossRefGoogle Scholar
- Lechner M (1999) Earnings and employment effects of continuous off-the-job training in east Germany after unification. J Bus Econ Stat 17:74–90CrossRefGoogle Scholar
- Little R, Rubin D (1987) Statistical analysis with missing data. Wiley, New YorkGoogle Scholar
- Manski C (2000) Identification problems and decisions under ambiguity: empirical analysis of treatment response and normative analysis of treatment choice. J Econ 95:415–442Google Scholar
- Manski C (2004) Statistical treatment rules for heterogeneous populations. Econometrica 72:1221–1246CrossRefGoogle Scholar
- Rosenbaum P, Rubin D (1983) The central role of the propensity score in observational studies for causal effects. Biometrika 70:41–55CrossRefGoogle Scholar
- Rubin D (1974) Estimating causal effects of treatments in randomized and nonrandomized studies. J Educ Psychol 66:688–701CrossRefGoogle Scholar
- Seifert B, Gasser T (1996) Finite-sample variance of local polynomials: analysis and solutions. J Am Stat Assoc 91:267–275CrossRefGoogle Scholar
- Seifert B, Gasser T (2000) Data adaptive ridging in local polynomial regression. J Comput Graph Stat 9:338–360CrossRefGoogle Scholar
- Wald A (1950) Statistical decision functions. Wiley, New YorkGoogle Scholar