Abstract
In this paper, we develop a Heckman selection model with endogenous covariates. Estimation of this model is easy and can be done within any econometrics software which supports maximum likelihood estimation of the Heckman selection model. The most important benefit of our model is that it provides an easy-to-interpret measure of the composition of the fully observed sample with respect to unobservables. As an example, we apply our model to the study of the composition of the female full time full year workforce, as has been done by Mulligan and Rubinstein (Q J Econ 123:1061–1110, 2008). We find that their conclusion that the female workforce was negatively selected in the late 1970s is robust to accounting for the potential endogeneity of education in a Heckman selection model. However, we find that accounting for endogeneity leads to a huge increase in the estimated returns to education.
Similar content being viewed by others
Notes
By “fully observed sample” we mean those observations who have non-missing values in the outcome variable of interest. On the other hand, individuals from the “partially observed sample” have a missing value in the outcome variable of interest.
In a related approach, Blundell et al. (1998) estimated labor supply elasticities, controlling for endogeneity of covariates and sample selectivity. Their approach was quite specific to that particular application. The reader may find the exposition in this paper to be more general.
The approach undertaken here to accommodate the endogeneity problem is known as a “control function approach” in the literature (see, e.g., Wooldridge (2010), pp. 126–29).
We also provide in the “Appendix” at the end of this paper a small Monte Carlo simulation study which analyzes the finite sample performance of the FIML estimator and compares its estimates to the (biased) estimates based on the ordinary Heckman selection model which does not control for endogeneity. Moreover, we provide an application of our estimator to the well-known Mroz (1987) labor supply data set in order to compare our results with those of Wooldridge (2010), who did the same using his estimator.
We obtained our data files from the IPUMS-USA database (Ruggles et al. 2010).
Mulligan and Rubinstein (2008) argued that they did not want to identify the main equation parameters by functional form assumptions alone, hence they selected an instrumental variable for the selection equation.
It might be argued that marital status is endogenous as well. We thus replicated our analysis without dummies for the marital status. However, our results did not change much qualitatively.
For the appropriateness of these instrumental variables, cf. the discussion in Card (1999), pp. 1822–1826.
In addition, joint significance of \(\psi _{11}\) and \(\psi _{21}\) is rejected as well (\(p\) value of 0.2532).
References
Ahn H, Powell JL (1993) Semiparametric estimation of censored selection models with a nonparametric selection mechanism. J Econ 58:3–29. doi:10.1016/0304-4076(93)90111-H
Amemiya T (1985) Advanced econometrics. Basil Blackwell, Oxford
Angrist JD, Imbens GW, Rubin DB (1996) Identification of causal effects using instrumental variables. J am Stat Assoc 91:444–455
Angrist JD, Krueger AB (1991) Does compulsory school attendance affect schooling and earnings? Q J Econ 106:979–1014. doi:10.2307/2937954
Blundell R, Duncan A, Meghir C (1998) Estimating labor supply responses using tax reforms. Econometrica 66:827–861
Bound JB, Jaeger DA, Baker RM (1995) Problems with instrumental variables estimation when the correlation between the instruments and the endogenous explanatory variable is weak. J am Stat Assoc 90:443–450. doi:10.1080/01621459.1995.10476536
Card D (1999) The causal effect of education on earnings. In: Ashenfelter O, Card D (eds) Handbook of labor economics, volume 3 of handbook of labor economics. Elsevier, Amsterdam, pp 1801–1863
Chib S, Greenberg E, Jeliazkov I (2009) Estimation of semiparametric models in the presence of endogeneity and sample selection. J Comput Graph Stat 18:321–348. doi:10.1198/jcgs.2009.07070
Das M, Newey WK, Vella F (2003) Nonparametric estimation of sample selection models. Rev Econ Stud 70:33–58. doi:10.1111/1467-937X.00236
Davidson R, MacKinnon JG (1993) Estimation and inference in econometrics. Oxford University Press, New York
Gallant AR, Nychka DW (1987) Semi-nonparametric maximum likelihood estimation. Econometrica 55:363–390
Heckman JJ (1978) Dummy endogenous variables in a simultaneous equation system. Econometrica 46:931–959
Heckman JJ (1979) Sample selection bias as a specification error. Econometrica 47:153–161
Imbens GW, Angrist JD (1994) Identification and estimation of local average treatment effects. Econometrica 62:467–475
Mroz TA (1987) The sensitivity of an empirical model of married women’s hours of work to economic and statistical assumptions. Econometrica 55:765–799
Mulligan CB, Rubinstein Y (2008) Selection, investment, and women’s relative wages over time. Q J Econ 123:1061–1110. doi:10.1162/qjec.2008.123.3.1061
Newey WK (1987) Efficient estimation of limited dependent variable models with endogenous explanatory variables. J Econometrics 36:231–250
Newey WK (2009) Two-step series estimation of sample selection models. Econometrics J 12:S217–S229. doi:10.1111/j.1368-423X.2008.00263.x
Powell JL (1987) Semiparametric estimation of bivariate limited dependent variable models. Manuscript. University of California, Berkeley
Rivers D, Vuong QH (1988) Limited information estimators and exogeneity tests for simultaneous probit models. J Econ 39:347–366. doi:10.1016/0304-4076(88)90063-2
Ruggles S, Alexander JT, Genadek K, Goeken R, Schroeder MB, Sobek M (2010) Integrated public use microdata series: version 5.0 [machine-readable database]. Minneapolis, University of Minnesota
Semykina A, Wooldridge JM (2010) Estimating panel data models in the presence of endogeneity and selection. J Econ 157:375–380. doi:10.1016/j.jeconom.2010.03.039
Smith RJ, Blundell RW (1986) An exogeneity test for a simultaneous equation tobit model with an application to labor supply. Econometrica 54:679–685
Stock JH, Wright JH, Yogo M (2002) A survey of weak instruments and weak identification in generalized method of moments. J Bus Econ Stat 20:518–529
Wooldridge JM (2010) Econometric analysis of cross section and panel data, 2nd edn. The MIT Press, Cambridge
Author information
Authors and Affiliations
Corresponding author
Electronic supplementary material
Below is the link to the electronic supplementary material.
Appendices
Appendix 1
In this “Appendix”, we show how the asymptotic covariance matrix of the LIML estimator must be corrected in order to account for the estimation of the regressors \(\varvec{\varepsilon }_1\), \(\varvec{\varepsilon }_2\) and \(\varvec{\varepsilon }_3\). First, let \(\varvec{\alpha }\equiv (\text {vec}(\varvec{\varDelta })',\text {vec}(\varvec{\varLambda })',\text {vec}(\varvec{\varUpsilon })')'\) and \(l(\varvec{\tilde{\theta }},\varvec{\hat{\alpha }})=\sum _{i=1}^nl_i(\varvec{\tilde{\theta }},\varvec{\hat{\alpha }})\) be the limited information log-likelihood function. Provided there exists an interior solution, we can write the first order condition from maximizing this likelihood function as
An asymptotic first order expansion about \(\varvec{\hat{\tilde{\theta }}}=\varvec{\tilde{\theta }}\) gives after rearranging and pre-multiplication with \(\sqrt{n}\)
Expanding the gradient about \(\varvec{\hat{\alpha }}=\varvec{\alpha }\) yields
If
then
where \(\mathbf C=\mathbf H^{-1}(\mathbf M+\mathbf J\mathbf V\mathbf J')\mathbf H^{-1}\). This follows because the covariance between \(\frac{\partial l_i(\varvec{\tilde{\theta }},\varvec{\alpha })}{\partial \varvec{\tilde{\theta }}}\) and \((\varvec{\hat{\alpha }}-\varvec{\alpha })\) is zero, as shown by Smith and Blundell (1986).
Note that implementation of the LIML estimator using an econometrics software yields an asymptotic covariance of \(\mathbf H^{-1}\mathbf M\mathbf H^{-1}\), as the software does not know that some regressors have been estimated. Hence, one must add to this expression a correction term of \(\mathbf H^{-1}(\mathbf J\mathbf V\mathbf J')\mathbf H^{-1}\) in order to obtain the correct asymptotic covariance.
Appendix 2
In this “Appendix”, we use Monte Carlo simulations in order to study the finite-sample properties of our FIML estimator and in order to gauge the bias which occurs if one does not account for endogeneity. The results of these simulations are presented in Table 6.
The first column of Table 6 contains the specification. We distinguish between four benchmark cases. In the first case, endogeneity is only present in the main equation. In particular, it is assumed that
and
Note that we have assumed a relatively high correlation between the main and the selection equation. Hence, we focus our attention on situations where sample selection bias is indeed a problem.
In the second case, endogeneity is only present in the selection equation:
and
In the third case, there is one common variable in both equations which is endogenous:
and
Finally, in the fourth case, it is assumed that both equations include an endogenous variable which is exclusive for each equation:
and
Throughout, \(X_{1i},\, Z_{1i},\, Z_{2i}\) and \(Z_{3i},\, i=1,\dots ,n\), are scalars which have been simulated from a standard normal distribution. For each of the four cases, these random numbers have been drawn once and kept fixed during simulation. In total, each simulation encompasses 1,000 repetitions in which parameter estimates have been computed. Table 6 presents the mean of these estimates over the repetitions, along with the corresponding standard deviations. Note that in accordance with the notation in Sect. 2 of the main text, the \(\beta \)’s in Table 6 refer to the parameters of the main equation, while the \(\gamma \)’s refer to the parameters of the selection equation.
In order to gauge the finite-sample performance of the estimator outlined in Sect. 3, Table 6 contains simulation results for different sample sizes. For each sample size, Table 6 displays the results for the FIML estimator presented in Sect. 3 (“IV”) and contrasts these results with those obtained when using the ordinary estimator for the sample selection model which does not account for endogeneity (“non-IV”). To save space, only the estimates for the parameters of the main equation and selection equation are presented.
In specification (i) where there is only one endogenous variable included in the main equation, the IV estimator performs well with respect to the estimates of the main equation, even for \(n=100\). However, the estimates for the selection equation are upward biased in finite samples; this property is common in all specifications (i)-(iv). In specification (ii) where there is only one endogenous variable in the selection equation, the estimator for the main equation does well for \(n\ge 200\). This is also true for specification (iii) with a common endogenous variable in both equations. When each equation contains an exclusive endogenous variable (specification (iv)), good results are obtained for \(n\ge 500\).
Note that the estimates for the selection equation are subjected to a normalization rule. This is the reason why the performance of the IV estimator seems to be not “perfect.” However, as it is well known, in binary choice models, only coefficient ratios are identified. Put differently, one should not consider the raw coefficients given in Table 6 but rather coefficient ratios. For example, in specification (iii) for \(n=1{,}000\), we can calculate that the mean of the second coefficient divided by the first gives 0.7018, whereas the mean of the third coefficient divided by the first gives 0.2991. Thus, we see that also the parameters of the selection equation are well estimated by the FIML procedure.
On the contrary, in most cases, the non-IV estimator yields severely biased estimates of the parameters of the main equation among all specifications. For instance, for a sample size of \(n=1{,}000\), the bias ranges from 13 to 248.1 %. However, the estimates of the selection equation are sometimes relatively close to their true values (specifications (i) and (iii)). This notwithstanding, note especially that the estimates of the parameters of the main equation are severely biased even if endogeneity is only present in the selection equation (specification (ii)). This result, which is due to the nonlinearity of the underlying model, has not gained much attention in the literature yet.
Overall, the results show that the FIML-IV estimator from Sect. 3 outperforms the ordinary estimator for the sample selection model, especially with respect to the parameters in the main equation and in case of large sample sizes. Moreover, the results indicate that the bias in the parameter estimates may be substantial if one does not account for endogeneity.
Appendix 3
In this “Appendix”, we present an application of our FIML estimator to the labor supply data set introduced by Thomas Mroz (1987). Our goal is to compare our results with those of Wooldridge (2010), who also applied his estimator to this data set.
The Mroz data set is quite popular and is often used to illustrate the performance of estimators which accounts for sample selectivity. The data set consists of 753 married women of whom 428 are working. We not only have information about relevant labor market characteristics of women (such as the wage, educational attainment and experience) but also on private characteristics such as the number of children, the “non-wife income” and the educational attainment of the parents and the husband. The former variables help identify the selection equation, while the latter variables may serve as instrumental variables for education. These variables are assumed to satisfy an exclusion restriction in the sense that they directly affect only the probability of labor market participation and educational attainment, respectively, but not the wage rate.
For this data set, we estimated a wage equation for married women. However, as a wage equation can only be fitted to the subsample of women who are actually working, a simple regression with the women’s wage as the dependent variable may yield inconsistent parameter estimates due to the possibility of sample selection. Hence, the appropriate model to estimate the wage equation should be a sample selection model. A variable which is commonly included as an explanatory variable is education. However, there might be some background variables like ability which cannot be observed and, thus, are captured within the error terms. These variables are likely to affect not only wages and labor force participation, but education as well. Therefore, a priori education should not be regarded as exogenous. The consequences of falsely treating an endogenous variable like education as exogenous have been illustrated in Appendix 2; hence, estimates from the ordinary sample selection model may be severely biased.
We estimated the following model: The main equation contains the natural logarithm of the hourly wage as its dependent variable; explanatory variables are experience, experience squared and education. The selection equation includes experience, experience squared, non-wife income, age, number of children aged until 6 years of age in the household, number of children aged 6 years or older in the household and education. Since education is treated as endogenous, instrumental variables are needed for estimation. Following Wooldridge (2010), we chose mother’s education, father’s education and husband’s education as instrumental variables for education.Footnote 8 Means and standard deviations of these variables are presented in Table 7.
Estimation results are given in Table 8. In Table 8, estimation results for the ordinary sample selection model (“non-IV”) and the sample selection model with endogeneity (“IV”) are provided. The first part of this table contains the parameter estimates for the variables of the main equation, as well as estimates of the selection parameter \(\tilde{\rho }\) and the endogeneity parameter \(\psi _{11}\). This last parameter indicates whether endogeneity of education is relevant in the main equation. The second part presents the parameter estimates for the selection equation. Additionally included is the endogeneity parameter \(\psi _{21}\), which indicates whether endogeneity of education is relevant in the selection equation. Finally, the third part includes the parameter estimates of the exogenous variables and instrumental variables with respect to education. In analogy with the instrumental variables terminology, this part has been labeled “first stage.”
The results show significance of education in the main and the selection equation. Moreover, the instrumental variables for education employed in the “first stage” are highly significant. The remaining variables possess the expected signs. However, the estimates of \(\tilde{\rho }\), \(\psi _{11}\) and \(\psi _{21}\) are not significantly different from zero, indicating that there is neither a selection bias nor an endogeneity bias present.Footnote 9 These results are in line with those reported by Wooldridge (2010) who draws similar conclusions. However, given that there seems to be neither a sample selection bias nor an endogeneity bias present, this result is not surprising.
Rights and permissions
About this article
Cite this article
Schwiebert, J. Estimation and interpretation of a Heckman selection model with endogenous covariates. Empir Econ 49, 675–703 (2015). https://doi.org/10.1007/s00181-014-0881-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00181-014-0881-z
Keywords
- Sample selection model
- Endogenous covariates
- Gender wage gap
- Composition of the female workforce
- Female labor force participation