1 Introduction

In this paper we consider an extension of the familiar Heckman (1979) sample selection model, which can be written as follows:

$$\begin{aligned} y_{1i}= & {} \textbf{x}_{1i}'{\alpha }+ u_{1i}, \end{aligned}$$
(1)
$$\begin{aligned} y_{2i}= & {} \mathbb {I}(\textbf{x}_i'{\beta }+ \textbf{z}_i' {\eta }+ u_{2i} \ge 0) \end{aligned}$$
(2)

where \(\mathbb {I}(\cdot )\) in the selection Eq. (2) denotes the indicator function, which takes the value one if its argument is true and zero otherwise, \((u_{1i}, u_{2i})\) are the error terms and the outcome variable \(y_{1i}\) is observed only if the selection variable \(y_{2i} = 1\). The main Eq. (1) contains a \(k_1 \times 1\) vector of explanatory variables \(\textbf{x}_{1i}\) and we are interested in estimating and testing the coefficient vector \({\alpha }\). The explanatory variables of the selection equation are separated into two parts, \(\textbf{x}_i=(\textbf{x}_{1i},\textbf{x}_{2i})\) and \(\textbf{z}_i\). In the traditional version of the model there is no distinction between \(\textbf{x}_{2i}\) and \(\textbf{z}_i\) – both represent the explanatory variables that are present in the selectivity model but not in the main equation of interest. These are the well known exclusion restrictions that facilitate identification of \({\alpha }\). In our setting, for reasons that will become clear shortly we wish to differentiate between \(\textbf{x}_{2i}\) and \(\textbf{z}_i\).

Our extension is to introduce a high-dimensional vector of the explanatory variables in the selectivity model (2) which may or may not belong to the model. The vector \(\textbf{x}\) is a low-dimensional \(k \times 1\) vector of selection determinants that we wish to keep in the model no matter what. The vector \(\textbf{z}\) is a high-dimensional \(p \times 1\) vector of potential controls, where p can be as large as the (pre-selection) sample size N or larger and where we do not know which of these controls are important, if any. The vector \({\beta }\) is a \(k \times 1\) vector of coefficients on \(\textbf{x}\), which can be a target of inference too. The vector \({\eta }\), on the contrary, is just a \(p \times 1\) nuisance parameter vector.

This extension has many empirical applications in economics where we have a well defined list of regressors for the main equation which has roots in economic theory (e.g., consumer and labor theory) while what determines selection into the sample is less certain (see, e.g., Roy 1951; Heckman and Honore 1990). The classic examples are the estimation of the female labor supply function and wage functions (see, e.g., Heckman 1979; Arellano and Bonhomme 2017), which may be subject to selection bias as determinants of the sample selection are confounded with the behavioral functions of interest. We return to women’s labor force participation and labor supply decisions in our empirical application section.

Our objective is to consistently estimate \({\alpha }\) in the outcome Eq. (1) under a potential sample selection bias arising from the fact that in the observed sample

$$\begin{aligned} {\mathbb {E}}(y_{1i}| \textbf{x}_i, \textbf{z}_i, y_{2i}=1) = \textbf{x}_{1i}'{\alpha }+ {\mathbb {E}}(u_{1i}| \textbf{x}_i, \textbf{z}_i, y_{2i}=1) \ne \textbf{x}_{1i}'{\alpha }, \end{aligned}$$

unless \({\mathbb {E}}(u_{1i}| \textbf{x}_i, \textbf{z}_i, y_{2i}=1)=0\), which is a questionable assumption in practice. Heckman (1979) assumed joint normality of \((u_{1i}, u_{2i})\) and showed that \({\mathbb {E}}(u_{1i}| \textbf{x}_i, \textbf{z}_i, y_{2i}=1)=\gamma \lambda (\textbf{x}_i'{\beta }+ \textbf{z}_i' {\eta })\), where \(\lambda (\cdot ) = \phi (\cdot )/\Phi (\cdot )\) is known as the inverse Mills ratio. The two-step heckit procedure is (a) to run the maximum likelihood estimation (MLE) for the probit of \(y_{2i}\) on \((\textbf{x}_{i}, \textbf{z}_i)\) and use the estimates \(({\hat{{\beta }}}, {\hat{{\eta }}})\) to obtain \({\hat{\lambda }}_i \equiv \lambda (\textbf{x}_i'{\hat{{\beta }}}+ \textbf{z}_i' {\hat{{\eta }}})\) and then (b) to regress \(y_{1i}\) on \(\textbf{x}_{1i}\) and \({\hat{\lambda }}_i\). Under correct specification, the resulting estimators \({\hat{{\alpha }}}\) and \({{\hat{\gamma }}}\) are consistent and the usual t-test on \({{\hat{\gamma }}}\) can be used to test for selection bias. If the null of no bias is rejected, the standard errors of the second step have to be corrected for the first step estimation error which is done via a full MLE using normality of the errors or via an analytic correction to the variance in the second step.

The high-dimensionality of \(\textbf{z}_i\) poses a challenge in applying the traditional two-step procedure. First, we cannot include all the variables in \(\textbf{z}_i\) in the first step because there are too many variables. If p is larger than N, the probit with all \(\textbf{x}_i\) and \(\textbf{z}_i\) is infeasible, and even if p is substantially smaller than N but is large then including all these variables can cause difficulties in MLE convergence.

In order to make estimation feasible, it is common to impose a certain structure on \(\eta \), known in the literature on regularized estimation as a sparsity scenario. In particular, we assume that only a few elements in the coefficient vector \(\eta \) are substantially different from zero. Although we assume that \(\eta \) is sparse, we do not know which elements are non-zero and a consistent model selection technique is required. A popular approach to regularizing linear models is the least absolute shrinkage and selection operator (lasso) developed by Tibshirani (1996). The method penalizes the objective function with an \(l_1\)-norm of the coefficients. This shrinks the irrelevant coefficients to zero and thus serves as a model selection tool. However, even for purely linear models, this approach has well known challenges.

First, lasso makes mistakes. Failure to account for the fact that the covariates have been selected by lasso results in invalid inference. The reason is that lasso, like many other model selection techniques, does not always find all the relevant covariates especially when some coefficients are small. Model selection mistakes made by lasso cause the distribution of this naive estimator to be biased and nonnormal. For example, Leeb and Pötscher (2008a, 2008b) and Pötscher and Leeb (2009) show that the normal approximation for the naive lasso estimator will produce misleading inference. Belloni et al. (2014b), Belloni et al. (2016b), and Chernozhukov et al. (2018) derive estimators that are robust to the mistakes made by lasso. Such robust estimators are often referred to as Neyman orthogonal (NO) estimators because they can be viewed as extensions of an approach proposed by Neyman (1959).

The second challenge is choosing the lasso tuning parameter. Lasso’s ability to select relevant covariates depends on the method used to choose the tuning parameters. Belloni et al. (2014b) propose a plug-in method and show that NO estimators perform well on linear models under that method. Belloni et al. (2016b) extend the linear lasso to logit models and show good performance using a simplified version of the plug-in method. Drukker and Liu (2022) extend the plug-in method to cross-sectional generalized linear models and provide Monte Carlo evidence that their extension works well in finite samples.

In this paper, we develop NO estimation for the model in (1)–(2) which we call double-selection Heckman procedure, or DS-HECK. The DS-HECK estimator draws upon the classical two-step heckit estimator and the double-selection lasso for the high-dimensional generalized linear models proposed by Belloni et al. (2016b). We detail the steps involved in the estimation, work out the estimator properties and derive the variance corrections. We also provide new insights into how NO estimation is linked to results on redundancy of knowledge in moment-based estimation considered by Breusch et al. (1999) and Prokhorov and Schmidt (2009).

The rest of the paper is organized as follows. Section 2 describes and studies the DS-HECK estimator. In Sect. 3, we present simulation results that demonstrate an excellent performance of DS-HECK in finite samples. In Sect. 4, we apply DS-HECK to estimate married women’s wage using the 2013 PSID wave, in the presence of high-dimensional controls and potential sample selection bias. Finally, Sect. 5 concludes.

2 The DS-HECK estimator

2.1 Settings

We maintain the standard assumption of the Heckman sample selection model.

Assumption 1

(a) \((\textbf{x}, \textbf{z}, y_2)\) are always observed, \(y_1\) is observed only when \(y_2 =1\); (b) \((u_1, u_2)\) is independent of \(\textbf{x}\) and \(\textbf{z}\) with zero mean; (c) \(u_2 \sim N(0, 1)\); (d) \( \mathop {\mathbb {E}} (u_1 | u_2) = \gamma u_2\).

Assumption 1 is in essence the same as in Wooldridge (2010, p. 803). Part (a) describes the nature of sample selection. Part (b) assumes that \(\textbf{x}\) and \(\textbf{z}\) are exogenous. Part (c) is restrictive but needed to derive the conditional expectation of \(y_1\) given that it is observed. Part (d) requires linearity in the conditional expectation of \(u_1\) given \(u_2\), and it holds when \((u_1, u_2)\) is bivariate normal. However, it also holds under weaker assumptions when \(u_1\) is not normally distributed.

Additionally, we impose a sparsity scenario on \({\eta }\).

Assumption 2

\(\eta \) is sparse; that is, most of the elements of \({\eta }\) are zeros. Namely, \(||{\eta }||_0 \le s\), where \(||\cdot ||_0\) denotes the number of non-zero components of a vector. We require s to be small relative to the sample size N. In particular, \(\frac{s^2\log ^2(max(p, N))}{N} \longrightarrow 0\).

This assumption follows Belloni et al. (2016b). In the settings of generalized linear models, it allows for the estimation of the nuisance parameter in the selection equation at the rate \(o(N^{-1/4})\) (see their Condition IR). In our settings, this rate is needed to guarantee the consistent estimation of \({\beta }\).

Under Assumption 1, it is easy to show that

$$\begin{aligned} \mathop {\mathbb {E}} (y_{1} | \textbf{x}, \textbf{z}, y_2 = 1) = \textbf{x}_1'\alpha + \gamma \lambda (\textbf{x}'{\beta }+ \textbf{z}'\eta ), \end{aligned}$$
(3)

where \(\lambda (\cdot ) = \phi (\cdot )/\Phi (\cdot )\) is the inverse Mills ratio. In essence, this is the classic formulation of Heckman (1979) where the presence of \(\textbf{x}_2\) and \(\textbf{z}\) in the selection equation (but not in the main equation of interest) permits estimation of the model even when the inverse Mills ratio is close to being linear in its argument.

We wish to explore the behavior of this conditional expectation with respect to potential errors in the choice of \(\textbf{z}\). It is easy to see from applying the mean-value theorem to the inverse Mills ratio evaluated at \(\textbf{x}'{\beta }\), that we can rewrite (3) as follows:

$$\begin{aligned} \mathop {\mathbb {E}} (y_{1} | \textbf{x}, \textbf{z}, y_2 = 1)&= \textbf{x}_1'{\alpha }+ \gamma \lambda (\textbf{x}'{\beta }) +\gamma \lambda ^{(1)}(q) \textbf{z}'\eta \nonumber \\&= \textbf{x}_1'{\alpha }+ \gamma \lambda (\textbf{x}'{\beta }) + \textbf{z}'\omega , \end{aligned}$$
(4)

where q is a point between \(\textbf{x}'{\beta }+ \textbf{z}'\eta \) and \(\textbf{x}'{\beta }\), \(\lambda ^{(1)}(\cdot )\) is the first-order derivative of \(\lambda (\cdot )\), and \(\omega = \gamma \lambda ^{(1)}(q) \eta \). It is well known that \(\lambda ^{(1)}(\cdot )\) is monotone and bounded between -1 and 0 (see, e.g., Sampford 1953). We note that \(\omega \) depends on q, \(\gamma \) and \(\eta \), and that if \(\eta \) is sparse then \(\omega \) is sparse too.

Proposition 1

The vector \(\omega \) in Eq. (4) inherits the same sparsity properties as the vector \(\eta \), i.e. \(||\omega ||_0=||{\eta }||_0\).

Proof. Sketches of the proofs of all less obvious propositions are given in the Appendix.

This Proposition makes it clear that a Heckman model with sparsity in the selection equation can be written as a heckit model with the same sparsity scenario in the main equation of interest.

Next we derive some conditions on the linear approximation of the inverse Mills ratio using \(\textbf{z}\) in the selected sample which is common for lasso-based model selection but new in the context of the Heckman model.

Let n denote the size of the selected sample, defined as follows:

$$\begin{aligned} n = \sum _{i=1}^N{\mathbb {I}}(y_{2i}=1). \end{aligned}$$

Then, for the selected observations, we can write

$$\begin{aligned} y_{1i} = \textbf{x}_{1i}'\alpha + g(\textbf{x}_i, \textbf{z}_i) + \epsilon _i, \end{aligned}$$

where \(g(\textbf{x}_i, \textbf{z}_i)=\gamma \lambda (\textbf{x}_i'\beta + \textbf{z}_i'\eta )\), \({\mathbb {E}}(\epsilon _i|\textbf{x}_i, \textbf{z}_i) = 0\) and \({\mathbb {V}}(\epsilon _i|\textbf{x}_i, \textbf{z}_i) = \sigma ^2\). We follow (Belloni et al. 2014b) and write \(g(\textbf{x}_i, \textbf{z}_i)\) in a linear form subject to a bound on the approximation error:

$$\begin{aligned} y_{1i} = \textbf{x}_{1i}'\alpha + \gamma \lambda (\textbf{x}_i'\beta ) + \textbf{z}_i'\delta + r_i + \epsilon _i, \end{aligned}$$
(5)

where \(r_i, i=1,\ldots , n,\) is the approximation error such that \(\sqrt{\frac{1}{n}\sum _{i=1}^n {\mathbb {E}}r_i^2} = O\left( \sqrt{\frac{s}{n}}\right) \). Additionally, we assume that the selected and pre-selection sample sizes are of the same order.

Assumption 3

Equation (5) holds in the selected sample with \(n: \frac{n}{N}\rightarrow c\in (0,1)\) and with

$$\begin{aligned} \sqrt{\frac{1}{n}\sum _{i=1}^n {\mathbb {E}}r_i^2} = O\left( \sqrt{\frac{s}{n}}\right) . \end{aligned}$$

The assumption on the approximation error follows (Belloni et al. 2014b). Similar to Assumptions 2, 3 ensures that we can estimate the nuisance parameter in the selected sample at the rate \(o(n^{-1/4})\) rate. In the context of the Heckman model, it implies that \(||\delta ||_0=||\omega ||_0=||\eta ||_0 \le s\) and that \(\delta \) is also estimated at \(o(N^{-1/4})\) since n and N are of the same order. Example-specific primitive conditions that ensure Assumption 3 holds are discussed by Belloni et al. (2014b, Section 4.1) for parametric and nonparametric cases, with the parametric example in Section 4.1.1 being most relevant for our setting.

Next we investigate how to consistently estimate this model accounting for the high-dimensional nuisance parameter in both equations.

2.2 Estimation of the selection equation

Clearly if we knew the true value of \({\beta }\), we could treat \(\lambda (\textbf{x}'{\beta })\) as a known variable and we could estimate \({\alpha }\) and \(\gamma \), treating \(\delta \) as nuisance parameters. So we start with a consistent estimation of \({\beta }\) in Eq. (2) using the approach of Belloni et al. (2016b) combined with parameter tuning of Drukker and Liu (2022).

The estimation involves three steps:

  • Step 1 (post-lasso probit) We start by estimating a penalized probit of \(y_2\) on \(\textbf{x}\) and \(\textbf{z}\) using the lasso penalty:

    $$\begin{aligned} ({\hat{{\beta }}}, {\hat{{\eta }}})&= \mathop {\mathrm {arg\,min}}\limits _{{\beta }, {\eta }} \mathbb {E}_N (\Lambda _i({\beta }, {\eta })) + \uplambda _1 ||({\beta }, {\eta })||_1, \end{aligned}$$

    where \( \mathbb {E}_N \) denotes the sample mean of N observations, \(\Lambda _i(\cdot )\) is the negative log-likelihood for the probit model, \(||\cdot ||_1 \) is the lasso (\(l_1\)) norm of the parameters and \(\uplambda _1\) is a tuning parameter chosen using the plug-in method of Drukker and Liu (2022). This produces a subset of the variables in \(\textbf{z}\) indexed by \(support({\hat{{\eta }}})\), where for a p-vector v, \(support(v):=\{ j \in \{1,..., p\}: v_j \ne 0\}\). These variables are used in the post-lasso probit:

    $$\begin{aligned} ({\tilde{{\beta }}}, {\tilde{{\eta }}})&= \mathop {\mathrm {arg\,min}}\limits _{{\beta }, {\eta }} \mathbb {E}_N (\Lambda _i({\beta }, {\eta })): support({\eta }) \subseteq support({\hat{{\eta }}}) \end{aligned}$$

    As a result, we obtain the sparse probit estimates \(({\tilde{{\beta }}}, {\tilde{{\eta }}})\) where \({\tilde{{\eta }}}\) contains only a few non-zero elements. Belloni et al. (2016b) propose using these estimates to construct weights \({\hat{f}}_i = {\hat{w}}_i/{\hat{\sigma }}_i\), where \({\hat{w}}_i = \phi (\textbf{x}_i'{\tilde{{\beta }}} + \textbf{z}_i'{\tilde{{\eta }}})\), and \({\hat{\sigma }}_i^2 = \Phi (\textbf{x}_i'{\tilde{{\beta }}} + \textbf{z}_i'{\tilde{{\eta }}}) (1 - \Phi (\textbf{x}_i'{\tilde{{\beta }}} + \textbf{z}_i'{\tilde{{\eta }}}))\), for \(i=1,\ldots , N\).

  • Step 2. We use the weights from Step 1, to run a weighted lasso regression in which for each variable \(x_j\) in \(\textbf{x}\), \(j=1, \ldots , k\), we run the penalized regression of \({\hat{f}}_i x_{ij}\) on \({\hat{f}}_i \textbf{z}_i\),

    $$\begin{aligned} {\hat{\theta }}_j = \mathop {\mathrm {arg\,min}}\limits _{\theta _j} \mathbb {E}_N ({\hat{f}}_i^2(x_{ij} - \textbf{z}_i'\theta _j)^2) + \uplambda _2||\theta _j||_1, \end{aligned}$$

    where \(\uplambda _2\) is chosen by the plug-in method of Drukker and Liu (2022). For each element of \(\textbf{x}\), this produces a selection from the variables in \(\textbf{z}\) indexed by \(support({{\hat{\theta }}}_j), j=1, \ldots , k\).

  • Step 3 (double-selection probit). We use the variables selected from \(\textbf{z}\) in Steps 1 and 2 to run the probit of \(y_2\) on \(\textbf{x}\) and the union of the sets of variables selected in Steps 1 and 2:

    $$\begin{aligned} ({\check{{\beta }}}, {\check{\eta }}) = \mathop {\mathrm {arg\,min}}\limits _{\beta , \eta } \mathbb {E}_N (\Lambda _i({\beta }, \eta ){\hat{f}}_i/{\hat{\sigma }}_i), \end{aligned}$$

    where \(support(\eta ) \subseteq support({{\hat{{\eta }}}}) \cup support({{\hat{\theta }}}_1) \cup \ldots \cup support({\hat{\theta }}_k)\).

In the setting of generalized linear models, Belloni et al. (2016b) show that the double-selection probit corrects for the omitted variable bias introduced by a naive application of lasso to Eq. (2). The intuition is that the double selection using weights \({\hat{f}}_i\) “neymanizes” Step 3. That is, it ensures that the estimation error from the first step does not affect the estimated parameter vector of the last step.

It follows from Belloni et al. (2016b, Theorem1) that, under Assumption 2, \({\check{{\beta }}}\) is a consistent estimator of \({\beta }\) and its variance can be obtained from Step 3 using the well known “sandwich” formula for probit. For example, in Stata it can be obtained using the vce(robust) syntax. We obtain a regular \(\sqrt{N}\)-consistent estimator of \(\beta \) with standard inference, even though a penalized non-\(\sqrt{N}\) estimator is being used to carry out model selection for the high-dimensional nuisance parameter \(\eta \).

2.3 Connection to redundancy of moment conditions

Belloni et al. (2016b) use the Neyman orthogonalization to obtain their result. In this section we show how NO argument relates to the concept of moment redundancy pioneered by Breusch et al. (1999). This offers an alternative way of arriving at the weights derived by Belloni et al. (2016b).

The key insight of Belloni et al. (2016b) is that the weights \(f_i\) ensure the validity of the main moment condition:

$$\begin{aligned} {\mathbb {E}} g_i(\beta _0, \eta _0, \theta _0) \equiv {\mathbb {E}} [y_{2i} - \Phi (\textbf{x}'_i\beta _0 +\textbf{z}_i'\eta _0)](f_i\textbf{x}_i - f_i\textbf{z}_i'\theta _0) = 0, \end{aligned}$$
(6)

which has to hold simultaneously with the condition

$$\begin{aligned} {\mathbb {E}}\frac{\partial }{\partial \eta } g_i(\beta _0, \eta _0, \theta _0) = 0. \end{aligned}$$

It is easy to see that Eq. (6) holds due to \({\mathbb {E}}(y_{2i}|\textbf{x}_i, \textbf{z}_i) = \Phi (\textbf{x}'_i\beta _0 +\textbf{z}_i'\eta _0)\) and that for \(f_i = \frac{\phi (\textbf{x}'_i\beta _0 +\textbf{z}_i'\eta _0)}{\sigma _i^2}\), where \(\sigma _i^2={\mathbb {V}}(y_{2i}|\textbf{x}_i, \textbf{z}_i)=\Phi (\textbf{x}_i'{\beta }_0 + \textbf{z}_i'{\eta }_0) (1 - \Phi (\textbf{x}_i'{\beta }_0 + \textbf{z}_i'{\eta }_0))\), the zero expected derivative condition holds, too. See Eq. (28) of Belloni et al. (2016b).

What has not been noted is that these conditions correspond to what Prokhorov and Schmidt (2009) call moment and parameter redundancy (M/P-redundancy), that is the situation when neither the knowledge of the additional moment conditions nor the knowledge of the parameter they identify help improve efficiency of estimation.

To see this, let \(\textbf{x}_i\) be a scalar for notational simplicity, and write the moment conditions identifying \(\beta \) and \(\eta \) as follows:

$$\begin{aligned}{} & {} {\mathbb {E}}h_{1i}(\beta _0, \eta _0) \equiv {\mathbb {E}}(y_{2i}-\Phi (\textbf{x}_i' \beta _0 + \textbf{z}_i'\eta _0) )f_i\textbf{x}_i = 0 \end{aligned}$$
(7)
$$\begin{aligned}{} & {} {\mathbb {E}}h_{2i}(\beta _0, \eta _0) \equiv {\mathbb {E}}(y_{2i}-\Phi (\textbf{x}_i' \beta _0 + \textbf{z}_i'\eta _0) )f_i\textbf{z}_i = 0 \end{aligned}$$
(8)

where the subscript “0” on a parameter denotes the true value. These moment conditions correspond to the first order conditions of the probit and stem from the specification \({\mathbb {P}}(y_{2i}=1|\textbf{x}_i, \textbf{z}_i) = \Phi (\textbf{x}'_i\beta _0 +\textbf{z}_i'\eta _0)\).

This is the system of moment conditions considered by Breusch et al. (1999) in the Generalized Method of Moments (GMM) framework. See their Eq. (6). They show that the (optimal) GMM estimation based on Eqs. (7)–(8) is equivalent to the estimation based on Eq. (8) and the error in the linear projection of Eq. (7) on Eq. (8). Using their notation, we can write the equivalent moment system as follows:

$$\begin{aligned} {\mathbb {E}}[h_{1i}(\beta _0, \eta _0)- \Omega _{12}\Omega _{22}^{-1}h_{2i}(\beta _0, \eta _0)]= & {} 0 \end{aligned}$$
(9)
$$\begin{aligned} {\mathbb {E}}h_{2i}(\beta _0, \eta _0)= & {} 0 \end{aligned}$$
(10)

where \(\Omega _{12}\) and \(\Omega _{22}\) are the relevant parts of the moment variance matrix

$$\begin{aligned} \Omega \equiv {\mathbb {V}}\left[ \begin{array}{c}h_{1i}(\beta _0, \eta _0)\\ h_{2i}(\beta _0, \eta _0)\end{array}\right] =\left[ \begin{array}{cc}\Omega _{11}&{}\Omega _{12}\\ \Omega _{21}&{}\Omega _{22}\end{array}\right] . \end{aligned}$$

It is easy to see that Eq. (9) coincides with Eq. (6) subject to the additional notation that \(\theta '_0 = \Omega _{12}\Omega _{22}^{-1}={\mathbb {E}}\sigma ^2_i f^2_i \textbf{x}_i \textbf{z}_i' [{\mathbb {E}}\sigma ^2_i f^2_i \textbf{z}_i \textbf{z}'_i]^{-1}\). It is also easy to see that the entire estimation problem can be written in the GMM framework as follows:

$$\begin{aligned}{} & {} {\mathbb {E}} g_i(\beta _0, \eta _0, \theta _0) \equiv {\mathbb {E}} [y_{2i} - \Phi (\textbf{x}'_i\beta _0 +\textbf{z}_i'\eta _0)](f_i\textbf{x}_i - f_i\textbf{z}_i'\theta _0) = 0 \end{aligned}$$
(11)
$$\begin{aligned}{} & {} {\mathbb {E}}h_{2i}(\beta _0, \eta _0) \equiv {\mathbb {E}}(y_{2i}-\Phi (\textbf{x}_i' \beta _0 + \textbf{z}_i'\eta _0) )f_i\textbf{z}_i = 0 \end{aligned}$$
(12)
$$\begin{aligned}{} & {} {\mathbb {E}}h_{3i}(\theta _0) \equiv {\mathbb {E}}\sigma _i f_i \textbf{z}_i(\sigma _i f_i\textbf{x}_i - \sigma _i f_i\textbf{z}_i'\theta _0)=0 \end{aligned}$$
(13)

where the first equation is Eq. (6) above, the second equation is the moment condition that identifes \(\eta _0\) and the third equation is a modified (through the inclusion of the scalar \(\sigma _i\)) version of the OLS first-order conditions used to estimate \(\theta _0\).

We note that, due to a separability result of Ahn and Schmidt (1995), we cannot improve on the estimation of \(\theta _0\) by estimating it jointly with \((\beta _0, \eta _0)\) because the additional conditions (11)–(12) only determine \(\theta _0\) in terms of \(\beta _0\) and \(\eta _0\). See also Prokhorov and Schmidt (2009, Statement 6). We further note that by Statement 7 of Prokhorov and Schmidt (2009), joint estimation of the entire vector \((\beta _0, \eta _0, \theta _0)\) is equivalent to a two-step estimation where \(\theta _0\) is estimated first and the second step is adjusted for the estimation error of the first step.

More importantly, because the correlation between the moment functions in Eqs. (11) and (12) is zero and the expected derivative of Eq. (11) with respect to \(\eta \) is zero, the condition of partial redundancy of Breusch et al. (1999, Theorem 7) holds (in their notation \(G_{21}=0\) and \(\Omega _{21}=0\)). This means the moment condition (12) is redundant for the estimation of \(\beta \) (M-redundancy). Additionally, these conditions are sufficient to show that the knowledge of the value of \(\eta _0\) is redundant (P-redundancy). See Statement 4 of Prokhorov and Schmidt (2009). So the NO condition of Belloni et al. (2016b) corresponds to a well established situation in GMM estimation when neither the knowledge of the parameter \(\eta _0\), nor the knowledge of the moment condition (12) helps estimate \(\beta _0\).Footnote 1

2.4 Choice of penalty parameter

The penalty parameters, \(\lambda _1\) and \(\lambda _2\) can be derived analytically but typically, data-driven methods are used. Their theoretical validity and practical performance have been well studied. For example, cross-validation or AIC typically under-penalize (over-select) by including too many variables to reduce bias, compared with BIC or plug-in methods. Additionally, when too many variables are selected, this violates the sparsity assumption required for the double lasso to work.

Plug-in methods have been shown to perform well in a multitude of settings (see, e.g., Drukker and Liu 2022; Belloni et al. 2014b, 2016a).

2.5 Estimation of the main equation

We can now return to the estimation of \({\alpha }\) and \(\gamma \). Similar to Belloni et al. (2016b), Belloni et al. (2014b) observe that the direct application of lasso to linear models with a large-dimensional nuisance parameter results in a biased estimation of the parameter of interest, which in their case is a scalar treatment effect. They propose a double selection procedure. We follow their approach subject to a few modifications that reflect the specifics of our main equation.

First, with a consistent estimator of \({\beta }\), a natural estimator of the inverse Mills ratio in Eq. (4) is as follows:

$$\begin{aligned} \widehat{\lambda (\textbf{x}_i'{\beta })} = \phi (\textbf{x}_i'{\check{{\beta }}})/\Phi (\textbf{x}_i'{\check{{\beta }}}). \end{aligned}$$

It is also natural to account for the fact that this is a generated regressor when constructing the variance matrix, something we consider later.

Second, Belloni et al. (2014b) derive their results for a scalar parameter. Because the variables of interest \(\textbf{x}\) form a vector, we need to extend the original double selection lasso estimation to vectors. We provide the details of this extension using the NO arguments in Appendix A.

We can now discuss the estimation of the main equation which combines the double selection lasso of Belloni et al. (2014b) and parameter tuningFootnote 2 by Drukker and Liu (2022). It proceeds in three steps:

Step 1:

. We run the lasso regression of \(\textbf{y}_1\) on \(\textbf{z}\)

$$\begin{aligned} {\check{\theta }}_y = \mathop {\mathrm {arg\,min}}\limits _{\theta _y} {\mathbb {E}}_n\left[ (y_{1i} - \textbf{z}_i'\theta _y)^2\right] + \uplambda _1 ||{\theta }_y||_1. \end{aligned}$$

This produces a subset of \(\textbf{z}\) indexed by \(support({\check{\theta }}_y)\).

Step 2:

. For each variable \(x_{1j}\) in \(\textbf{x}_1\), \(j=1, \ldots , k_1\), we run the lasso regression of \(x_{1j}\) on \(\textbf{z}\):

$$\begin{aligned} {\check{\theta }}_j = \mathop {\mathrm {arg\,min}}\limits _{\theta _j} {\mathbb {E}}_n\left[ (x_{1ij} - \textbf{z}_i'\theta _j)^2\right] + \uplambda _2 ||\theta _j||_1. \end{aligned}$$

Additionally, we run the lasso regression of \(\widehat{\lambda (\textbf{x}_i'{\beta })}\) on \(\textbf{z}\):

$$\begin{aligned} {\check{\theta }}_{\lambda }= \mathop {\mathrm {arg\,min}}\limits _{\theta _\lambda } {\mathbb {E}}_n\left[ (\widehat{\lambda (\textbf{x}_i'{\beta })} - \textbf{z}_i'\theta _\lambda )^2\right] + \uplambda _2 ||\theta _\lambda ||_1. \end{aligned}$$

This step produces subsets of \(\textbf{z}\) indexed by \(support({\check{\theta }}_j), j=1, \ldots , k_1\), and \(support({\check{\theta }}_\lambda )\).

Step 3:

. We run the regression of \(y_{1i}\) on \(\textbf{x}_{1i}\), \(\widehat{\lambda (\textbf{x}_i'{\beta })}\), and the union of the sets selected in Steps 1 and 2:

$$\begin{aligned} (\widehat{\alpha }, \widehat{\gamma }, \widehat{\delta }) = \mathop {\mathrm {arg\,min}}\limits _{\alpha , \gamma , \delta } {\mathbb {E}}_n \left[ (y_{1i} - \textbf{x}_{1i}'\alpha - \widehat{\lambda (\textbf{x}_i'{{\beta }})}\gamma - {\textbf{z}}_i'\delta )^2\right] , \end{aligned}$$

where \(support(\delta ) \subseteq support({\check{\theta }}_y) \cup support({\check{\theta }}_1) \cup \ldots \cup support({\check{\theta }}_{k_1})\cup support({\check{\theta }}_{\lambda })\)

Proposition 2

Under Assumptions 13, the DS-HECK estimation in Steps 1–3 above is consistent for \(\alpha \) and \(\gamma \) and post-double-selection inference on \(\alpha \) and \(\gamma \) is valid.

The DS-HECK estimator corrects the bias generated by applying the lasso directly to Eq. (4). The simulation experiments we report in Sect. 3 illustrate the size of bias.

Following Belloni et al. (2014b), we can claim that inference about the vector \((\alpha ', \gamma )\) is valid but, unlike Belloni et al. (2014b), it is valid up to the variance matrix correction reflecting the post-lasso probit estimation of \({\beta }\).Footnote 3

2.6 Variance matrix estimation

We start with some new notation. Let \({\hat{\lambda }}_i = \widehat{\lambda (\textbf{x}_i'{{\beta }})}=\lambda (\textbf{x}_i'{\check{{\beta }}})\) and define

$$\begin{aligned} \xi _i = {\hat{\lambda }}_i({\hat{\lambda }}_i + \textbf{x}_i' {\check{{\beta }}}), \end{aligned}$$

where \({\check{{\beta }}}\) is obtained by the double selection probit. Let e denote the vector of residuals from the last step of the double selection lasso estimation, with typical element \(e_i, i=1, \ldots , n\). That is,

$$\begin{aligned} e_i = y_{1i} - \textbf{x}_{1i}'\widehat{\alpha } - {\widehat{\lambda }}_i \widehat{\gamma } - {\textbf{z}}_i'\widehat{\theta }, \end{aligned}$$

where \(support(\theta ) \subseteq support({\check{\theta }}_y) \cup support({\check{\theta }}_1) \cup \ldots \cup support({\check{\theta }}_{k_1})\cup support({\check{\theta }}_{\lambda }) \). Let W denote the matrix containing \(\textbf{x}_1\), the \(n\times 1\) vector of \({\hat{\lambda }}_i\)’s, and the variables in \(\textbf{z}\) that survived the double selection. Let R be a \(n\times n\) diagonal matrix, with diagonal elements \((1- {\hat{\rho }}^2\xi _i)\), where \({\hat{\rho }} = \widehat{\gamma }/\widehat{\sigma }\) and \({\hat{\sigma }}^2 = (e'e + \widehat{\gamma }^2\sum _i \xi _i)/n\).

Proposition 3

A consistent estimator of the variance matrix of the DS-HECK estimator \( (\widehat{\alpha }', \widehat{\gamma }, \widehat{\theta }')\) is

$$\begin{aligned} V = {\hat{\sigma }}^2 (W'W)^{-1}(W'RW + Q )(W'W)^{-1}, \end{aligned}$$

where

$$\begin{aligned} Q = {\hat{\rho }}^2 (W' D \textbf{x}) V_b (\textbf{x}'D W) \end{aligned}$$

and \(V_b\) is the “sandwich” variance matrix for the double selection probit estimator \({\check{{\beta }}}\) and D is the diagonal matrix with diagonal elements \(\xi _i\).

The variance for \(\widehat{\alpha }\) and \(\widehat{\gamma }\) is the upper \((k_1+1)\times (k_1+1)\) submatrix of V. The dsheckman command implements this variance estimator.

3 Monte Carlo simulations

To evaluate the finite-sample performances of DS-HECK, we conduct a simulation study using four estimators: (i) ordinary least squares on the selected sample (OLS), (ii) Heckman two-step estimator based on the true model (Oracle), (iii) Heckman two-step estimator using lasso to select variables in Eq. (2) (Naive), and the proposed double selection Heckman estimator (DS). We have implemented the DS-HECK estimator in the Stata command dsheckman, available on the authors’ web pages along with data sets and codes for the simulations and application, and we describe its syntax in Appendix C.

OLS is inconsistent unless there is no sample selection bias, i.e. \(\gamma =0\). Naive is inconsistent due to error made by lasso. Moreover, Naive does not provide valid inference as it is not robust to the model selection bias. In contrast, DS is expected to retain consistency in the presence of sample selection biases and show robustness against the model selection bias. Oracle is expected to behave like the standard Heckman estimator under the true model but, in practice, Oracle is infeasible since we do not know the true model.

3.1 Setup

Our data generating process is as follows:

$$\begin{aligned} y_1&= 1 + x_1 + x_2 + u_1\\ y_2&= {\mathbb {I}} ( \textbf{x}' \beta _0 + \textbf{z}'\eta _0 + u_2 \ge 0) \end{aligned}$$

where \(u_2 \sim N(0, 1)\) and where \(u_1 = \gamma u_2 + \epsilon \) and \(\epsilon \sim N(0, 1)\) is independent of \(u_2\). We vary the strength of the selection bias by setting \(\gamma \) to be 0.4, 0.5, 0.6, 0.7, and 0.8 and we observe \(y_1\) only when \(y_2 = 1\).

The selection equation is generated using nine non-zero variables in \(\textbf{z}\) of which four have a relative large effect and five relatively small:

$$\begin{aligned}&\textbf{x}' \beta _0 + \textbf{z}'\eta _0 = -1.5 + x_1 - x_2 + z_1 - z_2 + 0.046z_3 + z_5 \\&\quad - 0.046z_{10} - 0.046z_{11} + 0.046z_{12} - 0.046z_{15} + z_{20}. \end{aligned}$$

The value 0.046 is chosen so that it violates the so called “beta-min” condition and causes lasso to make model selection mistakes (see, e.g., Liu et al. 2020; Drukker and Liu 2022). The sample size is 2000. The number of replications is 1000.

We consider two scenarios for p, the dimension of \(\textbf{z}\): (i) \(p = 1000\), fewer variables than observations; (ii) \(p=2100\), more variables than observations. The variables are generated using a Toeplitz correlation structure with decreasing dependence. In particular, let Z be the matrix of dimension \(N \times p\) containing \(\textbf{z}\). Then,

$$\begin{aligned} Z = M L' \end{aligned}$$

where M is \(N \times p\) and has the typical element \((\zeta _{ij} -15)/\sqrt{30}\), where \(\zeta _{ij} \sim \chi ^2(15)\) and L is the Cholesky decomposition of a symmetric Toeplitz matrix V of dimension \(p \times p\) such that its elements obey the following laws: \(V_{i,j} = V_{i-1, j-1}\) and \(V_{1, j}= j^{-1.3}\).

The variables \(\textbf{x}\) are also correlated and they are generated as functions of \(\textbf{z}\). In particular,

$$\begin{aligned} x_1&= z_3 +z_{10} + z_{11} +z_{12} + z_{15} + \epsilon _{x_1} \\ x_2&= 0.5(z_3 +z_{10} + z_{11} +z_{12} + 2z_{15}) + \epsilon _{x_2} \end{aligned}$$

where \(\epsilon _{x_1}\) and \(\epsilon _{x_2}\) follow a Toeplitz structure similar to Z.

As a result, for the selected sample, the true model is

$$\begin{aligned} y_1 = 1 + x_1 + x_2 + \gamma \lambda (\textbf{x}' \beta _0 + \textbf{z}'\eta _0) + u \end{aligned}$$

where \(\lambda (\cdot ) = \phi (\cdot )/\Phi (\cdot )\) and the true parameter values are \(\beta _1 = \beta _2 = 1\) and \(\gamma = 0.4, 0.5, 0.6, 0.7\), or 0.8. The standard deviation of \(\lambda ^{(1)}(\cdot )\) is about 0.3.Footnote 4

3.2 Results

For each estimator, we report the following measures: (i) true value of parameter (True), (ii) mean squared error (MSE), (iii) average of estimates across simulations (Mean) (iv) standard deviation of estimates across simulations (SD), (v) average standard error across simulations (\({\overline{SE}}\)) and (vi) rejection rate for the \(H_0\) that the parameter equals its true value against the nominal 5% level of significance (Rej. rate).

We report the simulation results for \(\beta _1\), \(\beta _2\), and \(\gamma \) in Tables 12 and 3, respectively. Several observations are clear from the tables. First, OLS is badly biased and the bias is greater when selection is stronger. Second, Naive is also biased and it fails to provide valid inference at any value of \(\gamma \) and p. This demonstrates that Naive is not robust to the model selection errors. The rejection rate and MSE increase as \(\gamma \) increases, which is expected because greater \(\gamma \) value indicate a greater sample selection bias. Third, Oracle shows consistency and rejection rates close to the nominal 5% significance level. Fourth, DS performs similarly to Oracle for all values of \(\gamma \) and p. In particular, its MSE is consistently smaller than for Naive and OLS, \({\overline{SE}}\) is close to SD, which shows that the proposed variance adjustment works well. Finally, Rej. rate for DS is near the 5% significance level, which supports that our estimator offers valid inference.

Table 1 Simulation results for \(\beta _1\)
Table 2 Simulation results for \(\beta _2\)
Table 3 Simulation results for \(\gamma \)

4 Application to female earnings estimation

4.1 Labor force participation and earnings

Estimation of the female earnings equation is a topic of long standing interest among economists. Early investigations of the labor supply decisions including both participation and hours by married women date back to Gronau (1974) and Heckman (1974), who were among the first to highlight sample selection bias stemming from the labor supply decision. Labor market decisions by women over the later decades have been studied and documented by Mroz (1987), Ahn and Powell (1993), Neumark and Korenman (1994), Vella (1998), Devereux (2004), Blau and Kahn (2007), Mulligan and Rubinstein (2008), Cavalcanti and Tavares (2008), Bar et al. (2015), among others.

Numerous applied works have extensively scrutinized the empirical aspects of the sample selection problem when estimating female labor market behavior. The determinants of the earnings equations for married women are similar to those of men and have been mostly agreed upon. These determinants traditionally include women’s education, experience, age, tenure and location. The hallmark of correcting for sample selection bias is finding some appropriate exclusion restriction(s) (i.e., variable(s) affecting selection but not the earnings) in order to ensure proper identification of the model parameters. Two main competing choices of such exclusion restrictions have been exploited for estimation of labor market earnings for married women: non-wife/husband’s income and the existence or number of (young) children. The underlying argument is that these two sets of variables affect the labor supply decision of married women but not their earnings. Huber and Mellace (2014) provide an overview of the related literature on sample selection bias in the female earnings equations.

Cavalcanti and Tavares (2008) provide an alternative view and argue that the declining price and wider availability of home appliances play a crucial role in explaining the rise in female labor force participation. This suggests a long list of potential exclusion restrictions. Furthermore, the exact nature and functional form of the chosen exclusion restriction(s) in the selection equation is uncertain. For example, should labor work experience include years of part-time employment? Should educational attainment be measured in full years of completed education or in milestones such as high school or college, as a replacement or complement to years of education? Similarly, should age enter the model in a linear or quadratic form?

Our goal in this section is to illustrate the performance of the DS-HECK procedure on the following earnings equation:

$$\begin{aligned} \log \left( earnings\right) =\alpha _{0}+\alpha _{1}education+\alpha _{2}{\textbf{x}}_1+\alpha _{3}state\text { }dummies+u_1, \end{aligned}$$
(14)

where \(\log \left( earnings\right) \) is the natural logarithm of the individual’s total annual labor income, education is the person’s completed years of education, \(state\text { }dummies\) is a vector containing a full set of state dummies, and \(u_1\) is an idiosyncratic error. The vector \({\textbf{x}}_1\) varies across the exact specification we consider and can contain age and/or work experience.

To address the potential self-selection bias we employ DS-HECK as a data-driven procedure for choosing the explanatory variables and functional form (among high-dimensional options provided) in the following labor force participation equation:

$$\begin{aligned} inlf={\mathbb {I}}({\textbf{x}}'\beta +{\textbf{z}}'\eta +u_2\ge 0), \end{aligned}$$
(15)

where inlf is the dummy variable that is equal to one for those women who are in the labor force at the time of the interview and zero otherwise, and \(u_2\) is an idiosyncratic error. The vector \({\textbf{x}}\) includes all the explanatory variables from Eq. (14) (both in a linear and quadratic functional form) as well as exclusion restrictions.

In practice, \({\textbf{x}}\) is constructed as follows. To simplify notation, denote all the explanatory variables from Eq. (14) as \({\textbf{x}}_1\). First, running lasso probit of inlf on high-dimensional controls \({\textbf{w}}\), where \({\textbf{w}}\) includes \({\textbf{x}}_1\) and some other high-dimensional controls. Denote the selected variables as \({\textbf{x}}_2\). Second, \({\textbf{x}}\) is union between all \({\textbf{x}}_1\) and \({\textbf{x}}_2\). All the non-selected controls in \({\textbf{w}}\) are used as \({\textbf{z}}\).

4.2 Sample construction

We obtain our sample from the 2013 wave of the Panel Study of Income Dynamics (PSID) where we focus on the sub-population of white married women. The choice of explanatory variables reflects their availability in the PSID and the traditional set of regressors used in the existing literature on female labor force participation and earnings. Specifically, the explanatory variables we collect from the PSID include information on the educational attainment of the individual (both as the number of years completed and as a set of indicators for milestone achievements in education), a set of indicators for whether the individual obtained her education in the USA, outside the USA, or both, as well as a set of indicators for the educational levels of the individual’s parents, work experience of the individual, age and geographical location of the individual (captured by a set of dummy variables for the current state where the individual is located), a set of indicators reflecting the Beale-Ross rural–urban continuum code for the individual’s current residence, and an indicator for whether the individual owns a vehicle. Table 4 contains a description of the key explanatory variables.

Table 4 Description of key PSID variables

The set of (potential) exclusion restrictions includes the number of children in the household under 18 years of age, an indicator for whether there are any children age 15 years old or younger in the individual’s household, annual labor income of the husband, child care expenses, and household major expenditure (i.e. expenditure on household furnishings and equipment, including household textiles, furniture, floor coverings, major appliances, small appliances and miscellaneous housewares). While admittedly far from ideal, this last variable is the closest information we find in the PSID to capture household expenditure on major household appliances, which allows us to test the argument of Cavalcanti and Tavares (2008). Finally, the dependent variables for the earnings and selection equations are (the natural logarithm of) the individual’s total annual labor income and the indicator for whether the individual is in the labor force, respectively.

Our sample contains 1,989 white married women, of whom 1,294 are in the labor force and 695 are not. Table 5 reports summary statistics for key variables in the dataset. A set of dummy variables for the current state as well as a set of indicators reflecting the Beale-Ross rural–urban continuum code are omitted to save space. A total of 46 states are present in our sample, with Delaware, District of Columbia, Hawaii, New Mexico, and Rhode Island omitted from our sample during data cleaning. We note that some women report being neither employed nor (temporarily) unemployed while also reporting non-zero labor income during that time. There are 161 such women in the sample. We treat these individuals as not being in the labor force.

Table 5 Key sample characteristics

4.3 Empirical findings

We consider several specifications when estimating the earnings equation subject to sample selection bias. Table 6 reports the key results for both equations obtained using DS-HECK. The top panel provides coefficient estimates (as well as their standard errors) for \(\alpha _1, \alpha _2\) and the coefficient on the inverse Mills ratio while the bottom panel provides estimates (and standard errors) for \(\beta \). In addition to the reported estimates, each specification contains two more sets of of estimates which we do not report in the table to save space. First, we do not report the coefficients on the full set of state and urban-rural dummies included in both equations. Second, for each specification there are the selected controls in both equations; the number of such controls is reported at the bottom of the respective panels but the coefficients themselves are not reported.

We note that following the original Heckman specification, the explanatory variables present in the earnings equation (\(\textbf{x}_1\)) are always kept in the labor force participation equation. Columns (1) and (2) report the estimates when work experience enters the two equations, with and without the quadratic form, while age is not included. Columns (3) and (4) report the estimate when age is included but experience is not. Columns (5) and (6) contain both but differ in whether experience squared is included. Child care expenditure is always selected in the post-lasso probit step and we view it as an important exclusion restriction (part of \(\textbf{x}\) not in \(\textbf{x}_1\)), so it is reported separately from other controls (\(\textbf{z}\)). The variables If kids under 15, Number of kids, Husband’s income, Major expenses are not selected in the post-lasso probit step but are sometimes selected in Step 2 of the selection model estimation, in which case they are counted under selected controls.

Column (8) reports the estimates of the traditional Heckman specification with Experience, Experience\(^2\), Age and Age\(^2\) in both equations and Child care expenditure, If kids under 15, Number of kids, Husband’s income, Major expenses in the participation equation. No selection is used on additional controls. In column (7) we report a specification that forces exactly the same variables as in the classical Heckman specification in column (8) to be included and additionally, permits the lasso to select any additional controls.

Table 6 Estimation results

As Table 6 suggests, the signs of all the reported coefficient estimates are as expected, and they are highly significant for the most part. We note that from the five potential exclusion restrictions of interest, the lasso selects child care expenditure as the relevant covariate for the labor force participation equation. Finally, we note that the results for the labor force participation equation reported in column (7) and (8) are similar to those reported in columns (5) and (6) except that (7) and (8) use the additional exclusion restrictions some of which turn out to be statistically significant while (5) and (6) drop them.

Next we focus on the estimates of the labor income equation. According to the results reported in Table 6, we conclude that the educational level of the individual plays a crucial role in explaining labor income for white married women in 2012. When statistically significant, the estimated rate of return to education ranges from 5.6% to almost 9% depending on the specification. Furthermore, there is evidence that the individual’s age is more important, both statistically and practically, than work experience for explaining the individual’s labor income in our sample. We note that when the individual’s age (in any functional form) is used in the labor income equation, the rate of return to education is statistically significant.

Most importantly, Table 6 suggests that the inverse Mills ratio is highly statistically significant in all specifications implying that the correction for the sample selection was needed. Given the economic interpretation of the estimated coefficients, their signs and economic as well as statistical significance, specification (5) seems most attractive in light of the existing studies on the topic. Interestingly, the traditional Heckman specification produces results that are close to those reported in column (5).

5 Conclusion

We have proposed an extension to the traditional Heckman sample selection model by incorporating it into a model selection framework with many controls. A double application of the lasso to each part of the model permits valid inference on the parts of the model that is of interest to empirical economists. We detail the steps involved in a consistent estimation with valid standard errors and we provide a new Stata command to implement it. We also investigate aspects of the Neyman orthogonality that relate it to the concept of redundancy in the GMM estimation.

Lasso and double selection in linear models have been recently subject of scrutiny in cases when lasso under-selects controls in finite samples even under a sparsity scenario and the double selection estimators have severe omitted variable biases (see, e.g., Wuthrich and Zhu 2021; Lahiri 2021). This happens when the signal from the variables lasso works on is weak and they do not get selected by either of the two selection procedures. The solution proposed by Wuthrich and Zhu (2021) is to resort in such cases to the regular OLS estimation using a high-dimensional variance matrix computations which is computationally difficult and works only when \(p<n\).

We show how large the errors committed by the naive application of lasso can be and we provide an application to a classic problem in labor economics where using our method leads to a few new insights. We provide a user-friendly and versatile Stata command, which can help empirical economists use the proposed methodology. The command as well as the simulation and application data are made available on the authors’ web pages.

Finally we note that the results of this paper can be extended to other consistent methods of model selection beyond lasso, such as the Dantzig Selector proposed by Candes and Tao (2007).