Skip to main content
Log in

Estimation and interpretation of a Heckman selection model with endogenous covariates

  • Published:
Empirical Economics Aims and scope Submit manuscript

Abstract

In this paper, we develop a Heckman selection model with endogenous covariates. Estimation of this model is easy and can be done within any econometrics software which supports maximum likelihood estimation of the Heckman selection model. The most important benefit of our model is that it provides an easy-to-interpret measure of the composition of the fully observed sample with respect to unobservables. As an example, we apply our model to the study of the composition of the female full time full year workforce, as has been done by Mulligan and Rubinstein (Q J Econ 123:1061–1110, 2008). We find that their conclusion that the female workforce was negatively selected in the late 1970s is robust to accounting for the potential endogeneity of education in a Heckman selection model. However, we find that accounting for endogeneity leads to a huge increase in the estimated returns to education.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

Notes

  1. By “fully observed sample” we mean those observations who have non-missing values in the outcome variable of interest. On the other hand, individuals from the “partially observed sample” have a missing value in the outcome variable of interest.

  2. In a related approach, Blundell et al. (1998) estimated labor supply elasticities, controlling for endogeneity of covariates and sample selectivity. Their approach was quite specific to that particular application. The reader may find the exposition in this paper to be more general.

  3. The approach undertaken here to accommodate the endogeneity problem is known as a “control function approach” in the literature (see, e.g., Wooldridge (2010), pp. 126–29).

  4. We also provide in the “Appendix” at the end of this paper a small Monte Carlo simulation study which analyzes the finite sample performance of the FIML estimator and compares its estimates to the (biased) estimates based on the ordinary Heckman selection model which does not control for endogeneity. Moreover, we provide an application of our estimator to the well-known Mroz (1987) labor supply data set in order to compare our results with those of Wooldridge (2010), who did the same using his estimator.

  5. We obtained our data files from the IPUMS-USA database (Ruggles et al. 2010).

  6. Mulligan and Rubinstein (2008) argued that they did not want to identify the main equation parameters by functional form assumptions alone, hence they selected an instrumental variable for the selection equation.

  7. It might be argued that marital status is endogenous as well. We thus replicated our analysis without dummies for the marital status. However, our results did not change much qualitatively.

  8. For the appropriateness of these instrumental variables, cf. the discussion in Card (1999), pp. 1822–1826.

  9. In addition, joint significance of \(\psi _{11}\) and \(\psi _{21}\) is rejected as well (\(p\) value of 0.2532).

References

  • Ahn H, Powell JL (1993) Semiparametric estimation of censored selection models with a nonparametric selection mechanism. J Econ 58:3–29. doi:10.1016/0304-4076(93)90111-H

    Article  Google Scholar 

  • Amemiya T (1985) Advanced econometrics. Basil Blackwell, Oxford

    Google Scholar 

  • Angrist JD, Imbens GW, Rubin DB (1996) Identification of causal effects using instrumental variables. J am Stat Assoc 91:444–455

    Article  Google Scholar 

  • Angrist JD, Krueger AB (1991) Does compulsory school attendance affect schooling and earnings? Q J Econ 106:979–1014. doi:10.2307/2937954

    Article  Google Scholar 

  • Blundell R, Duncan A, Meghir C (1998) Estimating labor supply responses using tax reforms. Econometrica 66:827–861

    Article  Google Scholar 

  • Bound JB, Jaeger DA, Baker RM (1995) Problems with instrumental variables estimation when the correlation between the instruments and the endogenous explanatory variable is weak. J am Stat Assoc 90:443–450. doi:10.1080/01621459.1995.10476536

    Google Scholar 

  • Card D (1999) The causal effect of education on earnings. In: Ashenfelter O, Card D (eds) Handbook of labor economics, volume 3 of handbook of labor economics. Elsevier, Amsterdam, pp 1801–1863

    Chapter  Google Scholar 

  • Chib S, Greenberg E, Jeliazkov I (2009) Estimation of semiparametric models in the presence of endogeneity and sample selection. J Comput Graph Stat 18:321–348. doi:10.1198/jcgs.2009.07070

    Article  Google Scholar 

  • Das M, Newey WK, Vella F (2003) Nonparametric estimation of sample selection models. Rev Econ Stud 70:33–58. doi:10.1111/1467-937X.00236

    Article  Google Scholar 

  • Davidson R, MacKinnon JG (1993) Estimation and inference in econometrics. Oxford University Press, New York

    Google Scholar 

  • Gallant AR, Nychka DW (1987) Semi-nonparametric maximum likelihood estimation. Econometrica 55:363–390

    Article  Google Scholar 

  • Heckman JJ (1978) Dummy endogenous variables in a simultaneous equation system. Econometrica 46:931–959

    Article  Google Scholar 

  • Heckman JJ (1979) Sample selection bias as a specification error. Econometrica 47:153–161

    Article  Google Scholar 

  • Imbens GW, Angrist JD (1994) Identification and estimation of local average treatment effects. Econometrica 62:467–475

    Article  Google Scholar 

  • Mroz TA (1987) The sensitivity of an empirical model of married women’s hours of work to economic and statistical assumptions. Econometrica 55:765–799

    Article  Google Scholar 

  • Mulligan CB, Rubinstein Y (2008) Selection, investment, and women’s relative wages over time. Q J Econ 123:1061–1110. doi:10.1162/qjec.2008.123.3.1061

    Article  Google Scholar 

  • Newey WK (1987) Efficient estimation of limited dependent variable models with endogenous explanatory variables. J Econometrics 36:231–250

    Article  Google Scholar 

  • Newey WK (2009) Two-step series estimation of sample selection models. Econometrics J 12:S217–S229. doi:10.1111/j.1368-423X.2008.00263.x

    Article  Google Scholar 

  • Powell JL (1987) Semiparametric estimation of bivariate limited dependent variable models. Manuscript. University of California, Berkeley

  • Rivers D, Vuong QH (1988) Limited information estimators and exogeneity tests for simultaneous probit models. J Econ 39:347–366. doi:10.1016/0304-4076(88)90063-2

    Article  Google Scholar 

  • Ruggles S, Alexander JT, Genadek K, Goeken R, Schroeder MB, Sobek M (2010) Integrated public use microdata series: version 5.0 [machine-readable database]. Minneapolis, University of Minnesota

  • Semykina A, Wooldridge JM (2010) Estimating panel data models in the presence of endogeneity and selection. J Econ 157:375–380. doi:10.1016/j.jeconom.2010.03.039

    Article  Google Scholar 

  • Smith RJ, Blundell RW (1986) An exogeneity test for a simultaneous equation tobit model with an application to labor supply. Econometrica 54:679–685

    Article  Google Scholar 

  • Stock JH, Wright JH, Yogo M (2002) A survey of weak instruments and weak identification in generalized method of moments. J Bus Econ Stat 20:518–529

    Article  Google Scholar 

  • Wooldridge JM (2010) Econometric analysis of cross section and panel data, 2nd edn. The MIT Press, Cambridge

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jörg Schwiebert.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 174 KB)

Appendices

Appendix 1

In this “Appendix”, we show how the asymptotic covariance matrix of the LIML estimator must be corrected in order to account for the estimation of the regressors \(\varvec{\varepsilon }_1\), \(\varvec{\varepsilon }_2\) and \(\varvec{\varepsilon }_3\). First, let \(\varvec{\alpha }\equiv (\text {vec}(\varvec{\varDelta })',\text {vec}(\varvec{\varLambda })',\text {vec}(\varvec{\varUpsilon })')'\) and \(l(\varvec{\tilde{\theta }},\varvec{\hat{\alpha }})=\sum _{i=1}^nl_i(\varvec{\tilde{\theta }},\varvec{\hat{\alpha }})\) be the limited information log-likelihood function. Provided there exists an interior solution, we can write the first order condition from maximizing this likelihood function as

$$\begin{aligned} \sum _{i=1}^n\frac{\partial l_i(\varvec{\hat{\tilde{\theta }}},\varvec{\hat{\alpha }})}{\partial \varvec{\tilde{\theta }}}=0. \end{aligned}$$
(51)

An asymptotic first order expansion about \(\varvec{\hat{\tilde{\theta }}}=\varvec{\tilde{\theta }}\) gives after rearranging and pre-multiplication with \(\sqrt{n}\)

$$\begin{aligned} \sqrt{n}(\varvec{\hat{\tilde{\theta }}}-\varvec{\tilde{\theta }})=\left( -\frac{1}{n}\sum _{i=1}^n\frac{\partial ^2l_i(\varvec{\tilde{\theta }},\varvec{\alpha })}{\partial \varvec{\tilde{\theta }}\partial \varvec{\tilde{\theta }}'}\right) ^{-1}\frac{1}{\sqrt{n}}\sum _{i=1}^n\frac{\partial l_i(\varvec{\tilde{\theta }},\varvec{\hat{\alpha }})}{\partial \varvec{\tilde{\theta }}}+o_p(1). \end{aligned}$$
(52)

Expanding the gradient about \(\varvec{\hat{\alpha }}=\varvec{\alpha }\) yields

$$\begin{aligned} \sqrt{n}(\varvec{\hat{\tilde{\theta }}}-\varvec{\tilde{\theta }})=&\left( -\frac{1}{n}\sum _{i=1}^n\frac{\partial ^2l_i(\varvec{\tilde{\theta }},\varvec{\alpha })}{\partial \varvec{\tilde{\theta }}\partial \varvec{\tilde{\theta }}'}\right) ^{-1}\frac{1}{\sqrt{n}}\sum _{i=1}^n\frac{\partial l_i(\varvec{\tilde{\theta }},\varvec{\alpha })}{\partial \varvec{\tilde{\theta }}}\nonumber \\&+\left( -\frac{1}{n}\sum _{i=1}^n\frac{\partial ^2l_i(\varvec{\tilde{\theta }},\varvec{\alpha })}{\partial \varvec{\tilde{\theta }}\partial \varvec{\tilde{\theta }}'}\right) ^{-1}\left( \frac{1}{n}\sum _{i=1}^n\frac{\partial ^2 l_i(\varvec{\tilde{\theta }},\varvec{\alpha })}{\partial \varvec{\tilde{\theta }}\partial \varvec{\alpha }'}\right) \sqrt{n}(\varvec{\hat{\alpha }}-\varvec{\alpha })+o_p(1). \end{aligned}$$
(53)

If

$$\begin{aligned}&\displaystyle -\frac{1}{n}\sum _{i=1}^n\frac{\partial ^2l_i(\varvec{\tilde{\theta }},\varvec{\alpha })}{\partial \varvec{\tilde{\theta }}\partial \varvec{\tilde{\theta }}'}\overset{p}{\longrightarrow }\mathbf H \text { pos. def.}\end{aligned}$$
(54)
$$\begin{aligned}&\displaystyle \frac{1}{\sqrt{n}}\sum _{i=1}^n\frac{\partial l_i(\varvec{\tilde{\theta }},\varvec{\alpha })}{\partial \varvec{\tilde{\theta }}}\overset{d}{\longrightarrow }\mathcal N(\mathbf 0,\mathbf M)\end{aligned}$$
(55)
$$\begin{aligned}&\displaystyle \frac{1}{n}\sum _{i=1}^n\frac{\partial ^2 l_i(\varvec{\tilde{\theta }},\varvec{\alpha })}{\partial \varvec{\tilde{\theta }}\partial \varvec{\alpha }'}\overset{p}{\longrightarrow }\mathbf J\end{aligned}$$
(56)
$$\begin{aligned}&\displaystyle \sqrt{n}(\varvec{\hat{\alpha }}-\varvec{\alpha })\overset{d}{\longrightarrow }{\mathcal {N}}(\mathbf{0},\mathbf{V}), \end{aligned}$$
(57)

then

$$\begin{aligned} \sqrt{n}(\varvec{\hat{\tilde{\theta }}}-\varvec{\tilde{\theta }})\overset{d}{\longrightarrow }\mathcal N(\mathbf{0},\mathbf{C}), \end{aligned}$$
(58)

where \(\mathbf C=\mathbf H^{-1}(\mathbf M+\mathbf J\mathbf V\mathbf J')\mathbf H^{-1}\). This follows because the covariance between \(\frac{\partial l_i(\varvec{\tilde{\theta }},\varvec{\alpha })}{\partial \varvec{\tilde{\theta }}}\) and \((\varvec{\hat{\alpha }}-\varvec{\alpha })\) is zero, as shown by Smith and Blundell (1986).

Note that implementation of the LIML estimator using an econometrics software yields an asymptotic covariance of \(\mathbf H^{-1}\mathbf M\mathbf H^{-1}\), as the software does not know that some regressors have been estimated. Hence, one must add to this expression a correction term of \(\mathbf H^{-1}(\mathbf J\mathbf V\mathbf J')\mathbf H^{-1}\) in order to obtain the correct asymptotic covariance.

Appendix 2

In this “Appendix”, we use Monte Carlo simulations in order to study the finite-sample properties of our FIML estimator and in order to gauge the bias which occurs if one does not account for endogeneity. The results of these simulations are presented in Table 6.

Table 6 Monte Carlo results

The first column of Table 6 contains the specification. We distinguish between four benchmark cases. In the first case, endogeneity is only present in the main equation. In particular, it is assumed that

$$\begin{aligned} \begin{array}{llllllll} y_i^*&{}=.2&{}+.4\,X_{1i}&{}+.9\,X_{2i}&{}&{}&{}+u_i\\ z_i^*&{}=1&{}&{}&{}+.7\,W_{1i}&{}&{}+v_i\\ X_{2i}&{}=.5&{}+1.5\,X_{1i}&{}&{}-.2\,W_{1i}&{}+.7\,Z_{1i}&{}+\varepsilon _{1i} \end{array} \end{aligned}$$

and

$$\begin{aligned} \text {Cov}[(u_i,v_i,\varepsilon _{1i})']=\begin{pmatrix}1&{}&{}\\ .9&{}1&{}\\ .5&{}.4&{}2\end{pmatrix}\!. \end{aligned}$$

Note that we have assumed a relatively high correlation between the main and the selection equation. Hence, we focus our attention on situations where sample selection bias is indeed a problem.

In the second case, endogeneity is only present in the selection equation:

$$\begin{aligned} \begin{array}{llllllll} y_i^*&{}=.2&{}+.4\,X_{1i}&{}&{}&{}+u_i\\ z_i^*&{}=1&{}+.7\,X_{1i}&{}+.3\,W_{2i}&{}&{}+v_i\\ W_{2i}&{}=.5&{}+1.5\,X_{1i}&{}&{}+.7\,Z_{2i}&{}+\varepsilon _{2i} \end{array} \end{aligned}$$

and

$$\begin{aligned} \text {Cov}[(u_i,v_i,\varepsilon _{2i})']=\begin{pmatrix}1&{}&{}\\ .9&{}1&{}\\ .5&{}.4&{}2\end{pmatrix}\!. \end{aligned}$$

In the third case, there is one common variable in both equations which is endogenous:

$$\begin{aligned} \begin{array}{llllllll} y_i^*&{}=.2&{}+.4\,X_{1i}&{}&{}+.9\,C_i&{}&{}+u_i\\ z_i^*&{}=1&{}&{}+.7\,W_{1i}&{}+.3\,C_i&{}&{}+v_i\\ C_i&{}=.5&{}+1.5\,X_{1i}&{}-.2\,W_{1i}&{}&{}+.7\,Z_{3i}&{}+\varepsilon _{3i} \end{array} \end{aligned}$$

and

$$\begin{aligned} \text {Cov}[(u_i,v_i,\varepsilon _{3i})']=\begin{pmatrix}1&{}&{}\\ .9&{}1&{}\\ .5&{}.4&{}2\end{pmatrix}\!. \end{aligned}$$

Finally, in the fourth case, it is assumed that both equations include an endogenous variable which is exclusive for each equation:

$$\begin{aligned} \begin{array}{llllllll} y_i^*&{}=.2&{}+.4\,X_{1i}&{}+.9\,X_{2i}&{}&{}&{}&{}+u_i\\ z_i^*&{}=1&{}+.7\,X_{1i}&{}&{}+.3\,W_{2i}&{}&{}&{}+v_i\\ X_{2i}&{}=.5&{}+1.5\,X_{1i}&{}&{}&{}+.7\,Z_{1i}&{}&{}+\varepsilon _{1i}\\ W_{2i}&{}=-2&{}+1.8\,X_{1i}&{}&{}&{}&{}+.6\,Z_{2i}&{}+\varepsilon _{2i} \end{array} \end{aligned}$$

and

$$\begin{aligned} \text {Cov}[(u_i,v_i,\varepsilon _{1i},\varepsilon _{2i})']=\begin{pmatrix}1&{}&{}&{}\\ .9&{}1&{}&{}\\ .5&{}.4&{}2&{}\\ .4&{}.5&{}1&{}2\end{pmatrix}. \end{aligned}$$

Throughout, \(X_{1i},\, Z_{1i},\, Z_{2i}\) and \(Z_{3i},\, i=1,\dots ,n\), are scalars which have been simulated from a standard normal distribution. For each of the four cases, these random numbers have been drawn once and kept fixed during simulation. In total, each simulation encompasses 1,000 repetitions in which parameter estimates have been computed. Table 6 presents the mean of these estimates over the repetitions, along with the corresponding standard deviations. Note that in accordance with the notation in Sect. 2 of the main text, the \(\beta \)’s in Table 6 refer to the parameters of the main equation, while the \(\gamma \)’s refer to the parameters of the selection equation.

In order to gauge the finite-sample performance of the estimator outlined in Sect. 3, Table 6 contains simulation results for different sample sizes. For each sample size, Table 6 displays the results for the FIML estimator presented in Sect. 3 (“IV”) and contrasts these results with those obtained when using the ordinary estimator for the sample selection model which does not account for endogeneity (“non-IV”). To save space, only the estimates for the parameters of the main equation and selection equation are presented.

In specification (i) where there is only one endogenous variable included in the main equation, the IV estimator performs well with respect to the estimates of the main equation, even for \(n=100\). However, the estimates for the selection equation are upward biased in finite samples; this property is common in all specifications (i)-(iv). In specification (ii) where there is only one endogenous variable in the selection equation, the estimator for the main equation does well for \(n\ge 200\). This is also true for specification (iii) with a common endogenous variable in both equations. When each equation contains an exclusive endogenous variable (specification (iv)), good results are obtained for \(n\ge 500\).

Note that the estimates for the selection equation are subjected to a normalization rule. This is the reason why the performance of the IV estimator seems to be not “perfect.” However, as it is well known, in binary choice models, only coefficient ratios are identified. Put differently, one should not consider the raw coefficients given in Table 6 but rather coefficient ratios. For example, in specification (iii) for \(n=1{,}000\), we can calculate that the mean of the second coefficient divided by the first gives 0.7018, whereas the mean of the third coefficient divided by the first gives 0.2991. Thus, we see that also the parameters of the selection equation are well estimated by the FIML procedure.

On the contrary, in most cases, the non-IV estimator yields severely biased estimates of the parameters of the main equation among all specifications. For instance, for a sample size of \(n=1{,}000\), the bias ranges from 13 to 248.1 %. However, the estimates of the selection equation are sometimes relatively close to their true values (specifications (i) and (iii)). This notwithstanding, note especially that the estimates of the parameters of the main equation are severely biased even if endogeneity is only present in the selection equation (specification (ii)). This result, which is due to the nonlinearity of the underlying model, has not gained much attention in the literature yet.

Overall, the results show that the FIML-IV estimator from Sect. 3 outperforms the ordinary estimator for the sample selection model, especially with respect to the parameters in the main equation and in case of large sample sizes. Moreover, the results indicate that the bias in the parameter estimates may be substantial if one does not account for endogeneity.

Appendix 3

In this “Appendix”, we present an application of our FIML estimator to the labor supply data set introduced by Thomas Mroz (1987). Our goal is to compare our results with those of Wooldridge (2010), who also applied his estimator to this data set.

The Mroz data set is quite popular and is often used to illustrate the performance of estimators which accounts for sample selectivity. The data set consists of 753 married women of whom 428 are working. We not only have information about relevant labor market characteristics of women (such as the wage, educational attainment and experience) but also on private characteristics such as the number of children, the “non-wife income” and the educational attainment of the parents and the husband. The former variables help identify the selection equation, while the latter variables may serve as instrumental variables for education. These variables are assumed to satisfy an exclusion restriction in the sense that they directly affect only the probability of labor market participation and educational attainment, respectively, but not the wage rate.

For this data set, we estimated a wage equation for married women. However, as a wage equation can only be fitted to the subsample of women who are actually working, a simple regression with the women’s wage as the dependent variable may yield inconsistent parameter estimates due to the possibility of sample selection. Hence, the appropriate model to estimate the wage equation should be a sample selection model. A variable which is commonly included as an explanatory variable is education. However, there might be some background variables like ability which cannot be observed and, thus, are captured within the error terms. These variables are likely to affect not only wages and labor force participation, but education as well. Therefore, a priori education should not be regarded as exogenous. The consequences of falsely treating an endogenous variable like education as exogenous have been illustrated in Appendix 2; hence, estimates from the ordinary sample selection model may be severely biased.

We estimated the following model: The main equation contains the natural logarithm of the hourly wage as its dependent variable; explanatory variables are experience, experience squared and education. The selection equation includes experience, experience squared, non-wife income, age, number of children aged until 6 years of age in the household, number of children aged 6 years or older in the household and education. Since education is treated as endogenous, instrumental variables are needed for estimation. Following Wooldridge (2010), we chose mother’s education, father’s education and husband’s education as instrumental variables for education.Footnote 8 Means and standard deviations of these variables are presented in Table 7.

Table 7 Descriptive statistics for the Mroz data

Estimation results are given in Table 8. In Table 8, estimation results for the ordinary sample selection model (“non-IV”) and the sample selection model with endogeneity (“IV”) are provided. The first part of this table contains the parameter estimates for the variables of the main equation, as well as estimates of the selection parameter \(\tilde{\rho }\) and the endogeneity parameter \(\psi _{11}\). This last parameter indicates whether endogeneity of education is relevant in the main equation. The second part presents the parameter estimates for the selection equation. Additionally included is the endogeneity parameter \(\psi _{21}\), which indicates whether endogeneity of education is relevant in the selection equation. Finally, the third part includes the parameter estimates of the exogenous variables and instrumental variables with respect to education. In analogy with the instrumental variables terminology, this part has been labeled “first stage.”

Table 8 Estimation of a wage equation for married women based on the Mroz data

The results show significance of education in the main and the selection equation. Moreover, the instrumental variables for education employed in the “first stage” are highly significant. The remaining variables possess the expected signs. However, the estimates of \(\tilde{\rho }\), \(\psi _{11}\) and \(\psi _{21}\) are not significantly different from zero, indicating that there is neither a selection bias nor an endogeneity bias present.Footnote 9 These results are in line with those reported by Wooldridge (2010) who draws similar conclusions. However, given that there seems to be neither a sample selection bias nor an endogeneity bias present, this result is not surprising.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Schwiebert, J. Estimation and interpretation of a Heckman selection model with endogenous covariates. Empir Econ 49, 675–703 (2015). https://doi.org/10.1007/s00181-014-0881-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00181-014-0881-z

Keywords

JEL Classification

Navigation