DS-HECK: double-lasso estimation of Heckman selection model

We extend the Heckman (1979) sample selection model by allowing for a large number of controls that are selected using lasso under a sparsity scenario. The standard lasso estimation is known to under-select causing an omitted variable bias in addition to the sample selection bias. We outline the required adjustments needed to restore consistency of lasso-based estimation and inference for vector-valued parameters of interest in such models. The adjustments include double lasso for both the selection equation and main equation and a correction of the variance matrix. We also connect the estimator with results on redundancy of moment conditions. We demonstrate the effect of the adjustments using simulations and we investigate the determinants of female labor market participation and earnings in the US using the new approach. The paper comes with dsheckman, a dedicated Stata command for estimating double-selection Heckman models.


Introduction
In this paper we consider an extension of the familiar Heckman (1979) sample selection model, which can be written as follows: where I(·) in the selection Eq.
(2) denotes the indicator function, which takes the value one if its argument is true and zero otherwise, (u 1i , u 2i ) are the error terms and the outcome variable y 1i is observed only if the selection variable y 2i = 1. The main Eq.
(1) contains a k 1 × 1 vector of explanatory variables x 1i and we are interested in estimating and testing the coefficient vector α. The explanatory variables of the selection equation are separated into two parts, x i = (x 1i , x 2i ) and z i . In the traditional version of the model there is no distinction between x 2i and z i -both represent the explanatory variables that are present in the selectivity model but not in the main equation of interest. These are the well known exclusion restrictions that facilitate identification of α. In our setting, for reasons that will become clear shortly we wish to differentiate between x 2i and z i . Our extension is to introduce a high-dimensional vector of the explanatory variables in the selectivity model (2) which may or may not belong to the model. The vector x is a low-dimensional k × 1 vector of selection determinants that we wish to keep in the model no matter what. The vector z is a high-dimensional p × 1 vector of potential controls, where p can be as large as the (pre-selection) sample size N or larger and where we do not know which of these controls are important, if any. The vector β is a k × 1 vector of coefficients on x, which can be a target of inference too. The vector η, on the contrary, is just a p × 1 nuisance parameter vector.
This extension has many empirical applications in economics where we have a well defined list of regressors for the main equation which has roots in economic theory (e.g., consumer and labor theory) while what determines selection into the sample is less certain (see, e.g., Roy 1951;Heckman and Honore 1990). The classic examples are the estimation of the female labor supply function and wage functions (see, e.g., Heckman 1979;Arellano and Bonhomme 2017), which may be subject to selection bias as determinants of the sample selection are confounded with the behavioral functions of interest. We return to women's labor force participation and labor supply decisions in our empirical application section.
Our objective is to consistently estimate α in the outcome Eq. (1) under a potential sample selection bias arising from the fact that in the observed sample E(y 1i |x i , z i , y 2i = 1) = x 1i α + E(u 1i |x i , z i , y 2i = 1) = x 1i α, unless E(u 1i |x i , z i , y 2i = 1) = 0, which is a questionable assumption in practice. Heckman (1979) assumed joint normality of (u 1i , u 2i ) and showed that E(u 1i |x i , z i , y 2i = 1) = γ λ(x i β + z i η), where λ(·) = φ(·)/ (·) is known as the inverse Mills ratio. The two-step heckit procedure is (a) to run the maximum likelihood estimation (MLE) for the probit of y 2i on (x i , z i ) and use the estimates (β,η) to obtainλ i ≡ λ(x iβ + z iη ) and then (b) to regress y 1i on x 1i andλ i . Under correct specification, the resulting estimatorsα andγ are consistent and the usual t-test on γ can be used to test for selection bias. If the null of no bias is rejected, the standard errors of the second step have to be corrected for the first step estimation error which is done via a full MLE using normality of the errors or via an analytic correction to the variance in the second step.
The high-dimensionality of z i poses a challenge in applying the traditional twostep procedure. First, we cannot include all the variables in z i in the first step because there are too many variables. If p is larger than N , the probit with all x i and z i is infeasible, and even if p is substantially smaller than N but is large then including all these variables can cause difficulties in MLE convergence.
In order to make estimation feasible, it is common to impose a certain structure on η, known in the literature on regularized estimation as a sparsity scenario. In particular, we assume that only a few elements in the coefficient vector η are substantially different from zero. Although we assume that η is sparse, we do not know which elements are non-zero and a consistent model selection technique is required. A popular approach to regularizing linear models is the least absolute shrinkage and selection operator (lasso) developed by Tibshirani (1996). The method penalizes the objective function with an l 1 -norm of the coefficients. This shrinks the irrelevant coefficients to zero and thus serves as a model selection tool. However, even for purely linear models, this approach has well known challenges.
First, lasso makes mistakes. Failure to account for the fact that the covariates have been selected by lasso results in invalid inference. The reason is that lasso, like many other model selection techniques, does not always find all the relevant covariates especially when some coefficients are small. Model selection mistakes made by lasso cause the distribution of this naive estimator to be biased and nonnormal. For example, Leeb and Pötscher (2008a, b) and Pötscher and Leeb (2009) show that the normal approximation for the naive lasso estimator will produce misleading inference. Belloni et al. (2014b), Belloni et al. (2016b), andChernozhukov et al. (2018) derive estimators that are robust to the mistakes made by lasso. Such robust estimators are often referred to as Neyman orthogonal (NO) estimators because they can be viewed as extensions of an approach proposed by Neyman (1959).
The second challenge is choosing the lasso tuning parameter. Lasso's ability to select relevant covariates depends on the method used to choose the tuning parameters. Belloni et al. (2014b) propose a plug-in method and show that NO estimators perform well on linear models under that method. Belloni et al. (2016b) extend the linear lasso to logit models and show good performance using a simplified version of the plug-in method. Drukker and Liu (2022) extend the plug-in method to cross-sectional generalized linear models and provide Monte Carlo evidence that their extension works well in finite samples.
In this paper, we develop NO estimation for the model in (1)-(2) which we call double-selection Heckman procedure, or DS-HECK. The DS-HECK estimator draws upon the classical two-step heckit estimator and the double-selection lasso for the high-dimensional generalized linear models proposed by Belloni et al. (2016b). We detail the steps involved in the estimation, work out the estimator properties and derive the variance corrections. We also provide new insights into how NO estimation is linked to results on redundancy of knowledge in moment-based estimation considered by Breusch et al. (1999) and Prokhorov and Schmidt (2009).
The rest of the paper is organized as follows. Section 2 describes and studies the DS-HECK estimator. In Sect. 3, we present simulation results that demonstrate an excellent performance of DS-HECK in finite samples. In Sect. 4, we apply DS-HECK to estimate married women's wage using the 2013 PSID wave, in the presence of highdimensional controls and potential sample selection bias. Finally, Sect. 5 concludes.

Settings
We maintain the standard assumption of the Heckman sample selection model. Assumption 1 (a) (x, z, y 2 ) are always observed, y 1 is observed only when y 2 = 1; Assumption 1 is in essence the same as in Wooldridge (2010, p. 803). Part (a) describes the nature of sample selection. Part (b) assumes that x and z are exogenous. Part (c) is restrictive but needed to derive the conditional expectation of y 1 given that it is observed. Part (d) requires linearity in the conditional expectation of u 1 given u 2 , and it holds when (u 1 , u 2 ) is bivariate normal. However, it also holds under weaker assumptions when u 1 is not normally distributed.
Additionally, we impose a sparsity scenario on η.
Assumption 2 η is sparse; that is, most of the elements of η are zeros. Namely, ||η|| 0 ≤ s, where || · || 0 denotes the number of non-zero components of a vector. We require s to be small relative to the sample size N . In particular, s 2 log 2 (max( p,N )) This assumption follows Belloni et al. (2016b). In the settings of generalized linear models, it allows for the estimation of the nuisance parameter in the selection equation at the rate o(N −1/4 ) (see their Condition IR). In our settings, this rate is needed to guarantee the consistent estimation of β.
Under Assumption 1, it is easy to show that where λ(·) = φ(·)/ (·) is the inverse Mills ratio. In essence, this is the classic formulation of Heckman (1979) where the presence of x 2 and z in the selection equation (but not in the main equation of interest) permits estimation of the model even when the inverse Mills ratio is close to being linear in its argument.
We wish to explore the behavior of this conditional expectation with respect to potential errors in the choice of z. It is easy to see from applying the mean-value theorem to the inverse Mills ratio evaluated at x β, that we can rewrite (3) as follows: where q is a point between x β +z η and x β, λ (1) (·) is the first-order derivative of λ(·), and ω = γ λ (1) (q)η. It is well known that λ (1) (·) is monotone and bounded between -1 and 0 (see, e.g., Sampford 1953). We note that ω depends on q, γ and η, and that if η is sparse then ω is sparse too.
Proof. Sketches of the proofs of all less obvious propositions are given in the Appendix. This Proposition makes it clear that a Heckman model with sparsity in the selection equation can be written as a heckit model with the same sparsity scenario in the main equation of interest.
Next we derive some conditions on the linear approximation of the inverse Mills ratio using z in the selected sample which is common for lasso-based model selection but new in the context of the Heckman model.
Let n denote the size of the selected sample, defined as follows: Then, for the selected observations, we can write We follow (Belloni et al. 2014b) and write g(x i , z i ) in a linear form subject to a bound on the approximation error: where r i , i = 1, . . . , n, is the approximation error such that 1 n n i=1 Er 2 i = O s n . Additionally, we assume that the selected and pre-selection sample sizes are of the same order. The assumption on the approximation error follows (Belloni et al. 2014b). Similar to Assumptions 2, 3 ensures that we can estimate the nuisance parameter in the selected sample at the rate o(n −1/4 ) rate. In the context of the Heckman model, it implies that ||δ|| 0 = ||ω|| 0 = ||η|| 0 ≤ s and that δ is also estimated at o(N −1/4 ) since n and N are of the same order. Example-specific primitive conditions that ensure Assumption 3 holds are discussed by Belloni et al. (2014b, Section 4.1) for parametric and nonparametric cases, with the parametric example in Section 4.1.1 being most relevant for our setting.
Next we investigate how to consistently estimate this model accounting for the high-dimensional nuisance parameter in both equations.

Estimation of the selection equation
Clearly if we knew the true value of β, we could treat λ(x β) as a known variable and we could estimate α and γ , treating δ as nuisance parameters. So we start with a consistent estimation of β in Eq. (2) using the approach of Belloni et al. (2016b) combined with parameter tuning of Drukker and Liu (2022).
The estimation involves three steps: Step 1 (post-lasso probit) We start by estimating a penalized probit of y 2 on x and z using the lasso penalty: where E N denotes the sample mean of N observations, i (·) is the negative loglikelihood for the probit model, || · || 1 is the lasso (l 1 ) norm of the parameters and λ 1 is a tuning parameter chosen using the plug-in method of Drukker and Liu (2022). This produces a subset of the variables in z indexed by support(η), where for a p-vector v, support(v) := { j ∈ {1, ..., p} : v j = 0}. These variables are used in the post-lasso probit: As a result, we obtain the sparse probit estimates (β,η) whereη contains only a few non-zero elements. Belloni et al. (2016b) propose using these estimates to Step 2. We use the weights from Step 1, to run a weighted lasso regression in which for each variable x j in x, j = 1, . . . , k, we run the penalized regression off i x i j onf i z i , where λ 2 is chosen by the plug-in method of Drukker and Liu (2022). For each element of x, this produces a selection from the variables in z indexed by support(θ j ), j = 1, . . . , k.
Step 3 (double-selection probit). We use the variables selected from z in Steps 1 and 2 to run the probit of y 2 on x and the union of the sets of variables selected in Steps 1 and 2: In the setting of generalized linear models, Belloni et al. (2016b) show that the double-selection probit corrects for the omitted variable bias introduced by a naive application of lasso to Eq. (2). The intuition is that the double selection using weightŝ f i "neymanizes" Step 3. That is, it ensures that the estimation error from the first step does not affect the estimated parameter vector of the last step.
It follows from Belloni et al. (2016b, Theorem1) that, under Assumption 2,β is a consistent estimator of β and its variance can be obtained from Step 3 using the well known "sandwich" formula for probit. For example, in Stata it can be obtained using the vce(robust) syntax. We obtain a regular √ N -consistent estimator of β with standard inference, even though a penalized non-√ N estimator is being used to carry out model selection for the high-dimensional nuisance parameter η. Belloni et al. (2016b) use the Neyman orthogonalization to obtain their result. In this section we show how NO argument relates to the concept of moment redundancy pioneered by Breusch et al. (1999). This offers an alternative way of arriving at the weights derived by Belloni et al. (2016b).

Connection to redundancy of moment conditions
The key insight of Belloni et al. (2016b) is that the weights f i ensure the validity of the main moment condition: which has to hold simultaneously with the condition It is easy to see that Eq. (6) holds due to E(y 2i |x i , and that for , the zero expected derivative condition holds, too. See Eq. (28) of Belloni et al. (2016b).
What has not been noted is that these conditions correspond to what Prokhorov and Schmidt (2009) call moment and parameter redundancy (M/P-redundancy), that is the situation when neither the knowledge of the additional moment conditions nor the knowledge of the parameter they identify help improve efficiency of estimation.
To see this, let x i be a scalar for notational simplicity, and write the moment conditions identifying β and η as follows: where the subscript "0" on a parameter denotes the true value. These moment conditions correspond to the first order conditions of the probit and stem from the specification . This is the system of moment conditions considered by Breusch et al. (1999) in the Generalized Method of Moments (GMM) framework. See their Eq. (6). They show that the (optimal) GMM estimation based on Eqs. (7)-(8) is equivalent to the estimation based on Eq. (8) and the error in the linear projection of Eq. (7) on Eq. (8).
Using their notation, we can write the equivalent moment system as follows: ( 1 0 ) where 12 and 22 are the relevant parts of the moment variance matrix It is easy to see that Eq. (9) coincides with Eq. (6) subject to the additional notation that θ 0 = 12 It is also easy to see that the entire estimation problem can be written in the GMM framework as follows: where the first equation is Eq. (6) above, the second equation is the moment condition that identifes η 0 and the third equation is a modified (through the inclusion of the scalar σ i ) version of the OLS first-order conditions used to estimate θ 0 . We note that, due to a separability result of Ahn and Schmidt (1995), we cannot improve on the estimation of θ 0 by estimating it jointly with (β 0 , η 0 ) because the additional conditions (11)-(12) only determine θ 0 in terms of β 0 and η 0 . See also Prokhorov and Schmidt (2009, Statement 6). We further note that by Statement 7 of Prokhorov and Schmidt (2009), joint estimation of the entire vector (β 0 , η 0 , θ 0 ) is equivalent to a two-step estimation where θ 0 is estimated first and the second step is adjusted for the estimation error of the first step.
More importantly, because the correlation between the moment functions in Eqs. (11) and (12) is zero and the expected derivative of Eq. (11) with respect to η is zero, the condition of partial redundancy of Breusch et al. (1999, Theorem 7) holds (in their notation G 21 = 0 and 21 = 0). This means the moment condition (12) is redundant for the estimation of β (M-redundancy). Additionally, these conditions are sufficient to show that the knowledge of the value of η 0 is redundant (P-redundancy). See Statement 4 of Prokhorov and Schmidt (2009). So the NO condition of Belloni et al. (2016b) corresponds to a well established situation in GMM estimation when neither the knowledge of the parameter η 0 , nor the knowledge of the moment condition (12) helps estimate β 0 . 1

Choice of penalty parameter
The penalty parameters, λ 1 and λ 2 can be derived analytically but typically, data-driven methods are used. Their theoretical validity and practical performance have been well studied. For example, cross-validation or AIC typically under-penalize (overselect) by including too many variables to reduce bias, compared with BIC or plug-in methods. Additionally, when too many variables are selected, this violates the sparsity assumption required for the double lasso to work.
Plug-in methods have been shown to perform well in a multitude of settings (see, e.g., Drukker and Liu 2022;Belloni et al. 2014bBelloni et al. , 2016a.

Estimation of the main equation
We can now return to the estimation of α and γ . Similar to Belloni et al. (2016b), Belloni et al. (2014b) observe that the direct application of lasso to linear models with a large-dimensional nuisance parameter results in a biased estimation of the parameter of interest, which in their case is a scalar treatment effect. They propose a double selection procedure. We follow their approach subject to a few modifications that reflect the specifics of our main equation.
First, with a consistent estimator of β, a natural estimator of the inverse Mills ratio in Eq. (4) is as follows: It is also natural to account for the fact that this is a generated regressor when constructing the variance matrix, something we consider later.
Second, Belloni et al. (2014b) derive their results for a scalar parameter. Because the variables of interest x form a vector, we need to extend the original double selection lasso estimation to vectors. We provide the details of this extension using the NO arguments in Appendix A.
We can now discuss the estimation of the main equation which combines the double selection lasso of Belloni et al. (2014b) and parameter tuning 2 by Drukker and Liu (2022). It proceeds in three steps: Step 1 . We run the lasso regression of y 1 on ž This produces a subset of z indexed by support(θ y ).
Step 2 . For each variable x 1 j in x 1 , j = 1, . . . , k 1 , we run the lasso regression of x 1 j on z:θ Additionally, we run the lasso regression of λ(x i β) on z: This step produces subsets of z indexed by support(θ j ), j = 1, . . . , k 1 , and support(θ λ ).
Step 3 . We run the regression of y 1i on x 1i , λ(x i β), and the union of the sets selected in Steps 1 and 2: Proposition 2 Under Assumptions 1-3, the DS-HECK estimation in Steps 1-3 above is consistent for α and γ and post-double-selection inference on α and γ is valid.
The DS-HECK estimator corrects the bias generated by applying the lasso directly to Eq. (4). The simulation experiments we report in Sect. 3 illustrate the size of bias.
Following Belloni et al. (2014b), we can claim that inference about the vector (α , γ ) is valid but, unlike Belloni et al. (2014b), it is valid up to the variance matrix correction reflecting the post-lasso probit estimation of β. 3

Variance matrix estimation
We start with some new notation.
whereβ is obtained by the double selection probit. Let e denote the vector of residuals from the last step of the double selection lasso estimation, with typical element e i , i = 1, . . . , n. That is, Let W denote the matrix containing x 1 , the n × 1 vector ofλ i 's, and the variables in z that survived the double selection. Let R be a n × n diagonal matrix, with diagonal

Proposition 3 A consistent estimator of the variance matrix of the DS-HECK estimator
and V b is the "sandwich" variance matrix for the double selection probit estimatorβ and D is the diagonal matrix with diagonal elements ξ i .
The variance for α and γ is the upper (k 1 + 1) × (k 1 + 1) submatrix of V . The dsheckman command implements this variance estimator.

Monte Carlo simulations
To evaluate the finite-sample performances of DS-HECK, we conduct a simulation study using four estimators: (i) ordinary least squares on the selected sample (OLS), (ii) Heckman two-step estimator based on the true model (Oracle), (iii) Heckman two-step estimator using lasso to select variables in Eq. (2) (Naive), and the proposed double selection Heckman estimator (DS). We have implemented the DS-HECK estimator in the Stata command dsheckman, available on the authors' web pages along with data sets and codes for the simulations and application, and we describe its syntax in Appendix C.
OLS is inconsistent unless there is no sample selection bias, i.e. γ = 0. Naive is inconsistent due to error made by lasso. Moreover, Naive does not provide valid inference as it is not robust to the model selection bias. In contrast, DS is expected to retain consistency in the presence of sample selection biases and show robustness against the model selection bias. Oracle is expected to behave like the standard Heckman estimator under the true model but, in practice, Oracle is infeasible since we do not know the true model.

Setup
Our data generating process is as follows: where u 2 ∼ N (0, 1) and where u 1 = γ u 2 + and ∼ N (0, 1) is independent of u 2 . We vary the strength of the selection bias by setting γ to be 0.4, 0.5, 0.6, 0.7, and 0.8 and we observe y 1 only when y 2 = 1.
The selection equation is generated using nine non-zero variables in z of which four have a relative large effect and five relatively small: The value 0.046 is chosen so that it violates the so called "beta-min" condition and causes lasso to make model selection mistakes (see, e.g., Liu et al. 2020;Drukker and Liu 2022). The sample size is 2000. The number of replications is 1000. We consider two scenarios for p, the dimension of z: (i) p = 1000, fewer variables than observations; (ii) p = 2100, more variables than observations. The variables are generated using a Toeplitz correlation structure with decreasing dependence. In particular, let Z be the matrix of dimension N × p containing z. Then, where M is N × p and has the typical element (ζ i j −15)/ √ 30, where ζ i j ∼ χ 2 (15) and L is the Cholesky decomposition of a symmetric Toeplitz matrix V of dimension p× p such that its elements obey the following laws: The variables x are also correlated and they are generated as functions of z. In particular, x 1 = z 3 + z 10 + z 11 + z 12 + z 15 + x 1 x 2 = 0.5(z 3 + z 10 + z 11 + z 12 + 2z 15 ) + x 2 where x 1 and x 2 follow a Toeplitz structure similar to Z .

Results
For each estimator, we report the following measures: (i) true value of parameter (True), (ii) mean squared error (MSE), (iii) average of estimates across simulations (Mean) (iv) standard deviation of estimates across simulations (SD), (v) average standard error across simulations (S E) and (vi) rejection rate for the H 0 that the parameter equals its true value against the nominal 5% level of significance (Rej. rate).
We report the simulation results for β 1 , β 2 , and γ in Tables 1, 2 and 3, respectively. Several observations are clear from the tables. First, OLS is badly biased and the bias is greater when selection is stronger. Second, Naive is also biased and it fails to provide valid inference at any value of γ and p. This demonstrates that Naive is not robust to the model selection errors. The rejection rate and MSE increase as γ increases, which is expected because greater γ value indicate a greater sample selection bias. Third, Oracle shows consistency and rejection rates close to the nominal 5% significance level. Fourth, DS performs similarly to Oracle for all values of γ and p. In particular, its MSE is consistently smaller than for Naive and OLS, S E is close to SD, which shows that the proposed variance adjustment works well. Finally, Rej. rate for DS is near the 5% significance level, which supports that our estimator offers valid inference.

Labor force participation and earnings
Estimation of the female earnings equation is a topic of long standing interest among economists. Early investigations of the labor supply decisions including both participation and hours by married women date back to Gronau (1974) and Heckman (1974), who were among the first to highlight sample selection bias stemming from the labor supply decision. Labor market decisions by women over the later decades have been studied and documented by Mroz (1987), Ahn and Powell (1993), Neumark and Korenman (1994), Vella (1998), Devereux (2004), Blau and Kahn (2007), Mulligan and Rubinstein (2008), Cavalcanti and Tavares (2008), Bar et al. (2015), among others.
Numerous applied works have extensively scrutinized the empirical aspects of the sample selection problem when estimating female labor market behavior. The determinants of the earnings equations for married women are similar to those of men and have been mostly agreed upon. These determinants traditionally include women's education, experience, age, tenure and location. The hallmark of correcting for sample selection bias is finding some appropriate exclusion restriction(s) (i.e., variable(s) affecting selection but not the earnings) in order to ensure proper identification of  non-wife/husband's income and the existence or number of (young) children. The underlying argument is that these two sets of variables affect the labor supply decision of married women but not their earnings. Huber and Mellace (2014) provide an overview of the related literature on sample selection bias in the female earnings equations. Cavalcanti and Tavares (2008) provide an alternative view and argue that the declining price and wider availability of home appliances play a crucial role in explaining the rise in female labor force participation. This suggests a long list of potential exclusion restrictions. Furthermore, the exact nature and functional form of the chosen exclusion restriction(s) in the selection equation is uncertain. For example, should labor work experience include years of part-time employment? Should educational attainment be  log (earnings) = α 0 + α 1 education + α 2 x 1 + α 3 state dummies + u 1 , (14) where log (earnings) is the natural logarithm of the individual's total annual labor income, education is the person's completed years of education, state dummies is a vector containing a full set of state dummies, and u 1 is an idiosyncratic error. The vector x 1 varies across the exact specification we consider and can contain age and/or work experience.  To address the potential self-selection bias we employ DS-HECK as a data-driven procedure for choosing the explanatory variables and functional form (among highdimensional options provided) in the following labor force participation equation: where inl f is the dummy variable that is equal to one for those women who are in the labor force at the time of the interview and zero otherwise, and u 2 is an idiosyncratic error. The vector x includes all the explanatory variables from Eq. (14) (both in a linear and quadratic functional form) as well as exclusion restrictions.
In practice, x is constructed as follows. To simplify notation, denote all the explanatory variables from Eq. (14) as x 1 . First, running lasso probit of inl f on high-dimensional controls w, where w includes x 1 and some other high-dimensional controls. Denote the selected variables as x 2 . Second, x is union between all x 1 and x 2 . All the non-selected controls in w are used as z.

Sample construction
We obtain our sample from the 2013 wave of the Panel Study of Income Dynamics (PSID) where we focus on the sub-population of white married women. The choice of explanatory variables reflects their availability in the PSID and the traditional set of regressors used in the existing literature on female labor force participation and earnings. Specifically, the explanatory variables we collect from the PSID include information on the educational attainment of the individual (both as the number of years completed and as a set of indicators for milestone achievements in education), a set of indicators for whether the individual obtained her education in the USA, outside the USA, or both, as well as a set of indicators for the educational levels of the individual's parents, work experience of the individual, age and geographical location of the individual (captured by a set of dummy variables for the current state where the individual is located), a set of indicators reflecting the Beale-Ross rural-urban continuum code for the individual's current residence, and an indicator for whether the individual owns a vehicle. Table 4 contains a description of the key explanatory variables.
The set of (potential) exclusion restrictions includes the number of children in the household under 18 years of age, an indicator for whether there are any children age 15 years old or younger in the individual's household, annual labor income of the husband, child care expenses, and household major expenditure (i.e. expenditure on household furnishings and equipment, including household textiles, furniture, floor coverings, major appliances, small appliances and miscellaneous housewares). While admittedly far from ideal, this last variable is the closest information we find in the PSID to capture household expenditure on major household appliances, which allows us to test the argument of Cavalcanti and Tavares (2008). Finally, the dependent variables for the earnings and selection equations are (the natural logarithm of) the individual's total annual labor income and the indicator for whether the individual is in the labor force, respectively. Our sample contains 1,989 white married women, of whom 1,294 are in the labor force and 695 are not. Table 5 reports summary statistics for key variables in the dataset. A set of dummy variables for the current state as well as a set of indicators reflecting the Beale-Ross rural-urban continuum code are omitted to save space. A total of 46 states are present in our sample, with Delaware, District of Columbia, Hawaii, New Mexico, and Rhode Island omitted from our sample during data cleaning. We note that some women report being neither employed nor (temporarily) unemployed while also reporting non-zero labor income during that time. There are 161 such women in the sample. We treat these individuals as not being in the labor force.

Empirical findings
We consider several specifications when estimating the earnings equation subject to sample selection bias. Table 6 reports the key results for both equations obtained using DS-HECK. The top panel provides coefficient estimates (as well as their standard errors) for α 1 , α 2 and the coefficient on the inverse Mills ratio while the bottom panel provides estimates (and standard errors) for β. In addition to the reported estimates, each specification contains two more sets of of estimates which we do not report in the table to save space. First, we do not report the coefficients on the full set of state and urban-rural dummies included in both equations. Second, for each specification there are the selected controls in both equations; the number of such controls is reported at the bottom of the respective panels but the coefficients themselves are not reported.
We note that following the original Heckman specification, the explanatory variables present in the earnings equation (x 1 ) are always kept in the labor force participation equation. Columns (1) and (2) report the estimates when work experience enters the two equations, with and without the quadratic form, while age is not included. Columns (3) and (4) report the estimate when age is included but experience is not. Columns (5) and (6) contain both but differ in whether experience squared is included. Child care expenditure is always selected in the post-lasso probit step and we view it as an important exclusion restriction (part of x not in x 1 ), so it is reported separately from other controls (z). The variables If kids under 15, Number of kids, Husband's income, Major expenses are not selected in the post-lasso probit step but are sometimes selected in Step 2 of the selection model estimation, in which case they are counted under selected controls.
Column (8) reports the estimates of the traditional Heckman specification with Experience, Experience 2 , Age and Age 2 in both equations and Child care expenditure, If kids under 15, Number of kids, Husband's income, Major expenses in the participation equation. No selection is used on additional controls. In column (7) we report a specification that forces exactly the same variables as in the classical Heckman specification in column (8) to be included and additionally, permits the lasso to select any additional controls.
As Table 6 suggests, the signs of all the reported coefficient estimates are as expected, and they are highly significant for the most part. We note that from the five potential exclusion restrictions of interest, the lasso selects child care expenditure as the relevant covariate for the labor force participation equation. Finally, we note that the results for the labor force participation equation reported in column (7) and (8) are similar to those reported in columns (5) and (6) except that (7) and (8) use the additional exclusion restrictions some of which turn out to be statistically significant while (5) and (6) drop them.   Table 6 Estimation results (1) (3) (6) (8)

Earnings equation
Education ( Table 6 continued (1) (2) (3) Next we focus on the estimates of the labor income equation. According to the results reported in Table 6, we conclude that the educational level of the individual plays a crucial role in explaining labor income for white married women in 2012. When statistically significant, the estimated rate of return to education ranges from 5.6% to almost 9% depending on the specification. Furthermore, there is evidence that the individual's age is more important, both statistically and practically, than work experience for explaining the individual's labor income in our sample. We note that when the individual's age (in any functional form) is used in the labor income equation, the rate of return to education is statistically significant.
Most importantly, Table 6 suggests that the inverse Mills ratio is highly statistically significant in all specifications implying that the correction for the sample selection was needed. Given the economic interpretation of the estimated coefficients, their signs and economic as well as statistical significance, specification (5) seems most attractive in light of the existing studies on the topic. Interestingly, the traditional Heckman specification produces results that are close to those reported in column (5).

Conclusion
We have proposed an extension to the traditional Heckman sample selection model by incorporating it into a model selection framework with many controls. A double application of the lasso to each part of the model permits valid inference on the parts of the model that is of interest to empirical economists. We detail the steps involved in a consistent estimation with valid standard errors and we provide a new Stata command to implement it. We also investigate aspects of the Neyman orthogonality that relate it to the concept of redundancy in the GMM estimation.
Lasso and double selection in linear models have been recently subject of scrutiny in cases when lasso under-selects controls in finite samples even under a sparsity scenario and the double selection estimators have severe omitted variable biases (see, e.g., Wuthrich and Zhu 2021; Lahiri 2021). This happens when the signal from the variables lasso works on is weak and they do not get selected by either of the two selection procedures. The solution proposed by Wuthrich and Zhu (2021) is to resort in such cases to the regular OLS estimation using a high-dimensional variance matrix computations which is computationally difficult and works only when p < n.
We show how large the errors committed by the naive application of lasso can be and we provide an application to a classic problem in labor economics where using our method leads to a few new insights. We provide a user-friendly and versatile Stata command, which can help empirical economists use the proposed methodology. The command as well as the simulation and application data are made available on the authors' web pages.
Finally we note that the results of this paper can be extended to other consistent methods of model selection beyond lasso, such as the Dantzig Selector proposed by Candes and Tao (2007).
Funding Open Access funding enabled and organized by CAUL and its Member Institutions Masayuki Hirukawa gratefully acknowledges financial support from Japan Society for the Promotion of Science (grant number 19K01595).

Declarations
Conflicts of interest Masayuki Hirukawa declares that he has no conflict of interest. Di Liu declares that he has no conflict of interest. Irina Murtazashvili declares that she has no conflict of interest. Artem Prokhorov declares that he has no conflict of interest.
Ethical approval This article does not contain any studies with human participants or animals performed by any of the authors.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. and the "partialling-out" approach still provides valid inference for θ even without cross-fitting). 3. the "partialling-out' approach is equivalent to the double selection lasso.
We start with a general definition of Neyman orthogonal moments (see, e.g., Chernozhukov et al. 2018, Definition 2.1). Suppose we wish to estimate a low-dimensional parameter θ 0 using the moment conditions where W contains the random variables and η 0 is a nuisance parameter vector, which can be high-dimensional. Suppose we use machine learning techniques to estimate η 0 . In essence, Neyman orthogonality means that small mistakes in the estimation of η will not disturb the consistent estimation of θ 0 . Formally, Neyman orthogonality depends on the concept of pathwise (Gateux) derivative. Let D r denote the Gateux derivative of the moment condition ψ with respect to η in direction r . Then, for all r ∈ [0, 1). When D r is evaluated at r = 0, we denote it as D 0 [η − η 0 ].

Definition 1 The moment condition
Now we show Neyman orthogonality of the "partialing-out" approach. For notational simplicity, we group the equations for D j ( j = 1, . . . , k).
Next, we provide the asymptotic distribution. Let θ be a solution to 1 n ψ i (W ; θ, η) = 0, given η. By Theorem 3.1 of Chernozhukov et al. (2018), θ is a consistent estimator of θ 0 , and it is asymptotically normal with the rate of √ N . Its variance is where J 0 = E[∂ψ/∂θ|θ=θ 0 ]. We can simplify this expression to arrive at the following formula: Equation (23) leads to the following straightforward estimator of : where V i and U i are the residuals from Eqs. (18) and (19), respectively. Apparently, the variance estimator is the classic heteroskedasticity-consistent estimator. The last step is to show that double selection lasso is equivalent to the "partiallingout". We start with the equation y = D θ 0 + g 0 (X ) + U .
If we partial out X from both Y and D, the equation becomes It is easy to see that the solution for θ in Eq. (25) comes from the Neyman orthogonal moment condition using the moment function in Eq. (21). Therefore, double selection is equivalent to the Robinson-style "partialling-out" approach.

B Sketch of Proofs of Propositions 2 and 3
Proof of Proposition 2. The proof is similar to that of Theorem 1 of Belloni et al. (2014b) and essentially consists in verifying that the assumptions of that theorem are fulfilled given what we assumed in Assumptions 1-3.
Proof of Proposition 3. The construction of a variance estimator outlined in the paragraphs immediately preceding the proposition is similar to the standard correction for the first-step estimation error in the Heckman (1979) procedure (see, e.g., Cameron and Trivedi 2005, Section 16.10.2). An important difference is that now we include the effect of the controls selected in the last step of the double selection lasso estimation when computing the residuals. • depvar specifies the dependent variable in the main equation, which corresponds to y 1 in Eq. (1).
• indepvars specifies the independent variables in the main equation, which correspond to x 1 in Eq. (1). • depvar_s specifies the dependent variable in the selection equation, which corresponds to y 2 in Eq. (2). • indepvars_s specifies the independent variables in the selection equation, which correspond to x and z in Eq.
(2). • selvars() specifies x in Eq. (2). If this option is not used, x is constructed in two steps: (a) run the lasso probit of devpar_s on indepvars_s using the plugin penalty; denote the selected variables by x 2 ; (b) construct x as the union of x 1 and x 2 .