Regularization and variable selection in Heckman selection model

Sample selection arises when the outcome of interest is partially observed in a study. A common challenge is the requirement for exclusion restrictions. That is, some of the covariates affecting missingness mechanism do not affect the outcome. The drive to establish this requirement often leads to the inclusion of irrelevant variables in the model. A suboptimal solution is the use of classical variable selection criteria such as AIC and BIC, and traditional variable selection procedures such as stepwise selection. These methods are unstable when there is limited expert knowledge about the variables to include in the model. To address this, we propose the use of adaptive Lasso for variable selection and parameter estimation in both the selection and outcome submodels simultaneously in the absence of exclusion restrictions. By using the maximum likelihood estimator of the sample selection model, we constructed a loss function similar to the least squares regression problem up to a constant, and minimized its penalized version using an efficient algorithm. We show that the estimator, with proper choice of regularization parameter, is consistent and possesses the oracle properties. The method is compared to Lasso and adaptively weighted L1\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$L_{1}$$\end{document} penalized Two-step method. We applied the methods to the well-known Ambulatory Expenditure Data.


Introduction
Sample selection arises when the outcome of interest is non-randomly missing for a subset of the sample, resulting in a sample that is not representative of the population under study. This problem is ubiquitous in empirical economics, social sciences B Emmanuel O. Ogundimu emmanuel.ogundimu@durham.ac.uk 1 Department of Mathematical Sciences, Durham University, Durham, UK and medical research. Consider, as an example, the ambulatory expenditures data ( Cameron and Trivedi (2010)), where the amount of money spent on the medical services is expected to be linked with the decision to spend. The analysis of the data could proceed in two ways: analysis of only the positive expenditures without taking into account the zero expenditures or analysis of both the positive and zero expenditures. These methods will lead to biased estimates and efficiency loss. An optimal solution based on the use of sample selection model incorporates the decision to spend in order to model the amount of expenditures. Heckman (1976) introduced a model for sample selection and several extensions in the parametric framework have been proposed ( Marchenko and Genton (2012), Ogundimu and Hutton (2016), Lee (1983)). The estimation method is often based on the full information maximum likelihood (FIML) due to the maximization of the joint likelihood function of both the outcome and selection submodels simultaneously. An alternative specification of the model is the two-step procedure, where the problem is treated as a model misspecification problem due to omitted covariates ( Heckman (1979). A key drawback in the use of the Two-step estimator (and to a lesser extent FIML estimator) is its susceptibility to collinearity in the absence of an exclusion restriction ( Leung and Yu (2000)). An exclusion restriction implies that there are variables in the selection submodel that are absent in the outcome equation. This is to avoid multicollinearity as a consequence of the linearity of the inverse Mills ratio over a wide range of its support. In the absence of an exclusion restriction, model identifiability relies on the non-linearity of the inverse Mills ratio. In general, both estimators have the same set of assumptions, and two-step estimator is indeed less sensitive to the assumptions in a very specific case of measurement error (see Stapleton and Young (1984) and Leung and Yu (2000)). Sartori (2003) noted that when theory points to identical covariates in both components of the model, a common practice by applied researchers is a "mad" search for an exclusion restriction. This practice is dangerous because including extraneous variables may lead to specification error. Further, the impact of extraneous variables as exclusion restrictions was demonstrated in Ogundimu and Collins (2019). It was shown that the use of these variables in the selection submodel can constitute a nuisance in the model estimation instead of alleviating the problem of collinearity. In particular, if the extraneous variable(s) for exclusion restriction does not affect selection in the population but it is correlated with a true covariate and with selection in the sample, including it in the equation can bias the estimates of the effect of the covariates. Also, the use of an exclusion restriction that is not based on clear theoretical knowledge has been shown to produce results that vary both in effect size and precision ( Genbäck et al. (2015)). Genbäck et al. (2015) used set identification to compute bounds on regression parameters in sample selection model. Although the method does not impose any exclusion restriction and distributional assumption, correct coverage of the bounds relies on specifying an interval for the correlation parameter that contains the true value. We took a different approach from Genbäck et al. (2015) by keeping parametric assumption and develop penalized regression that can shrink variables that are not true exclusion restriction to zero while simultaneously selecting variables that are associated with the outcome and selection submodels.
Penalized regression with convex and non-convex penalties such as least absolute shrinkage and selection operator (Lasso -Tibshirani (1996)), elastic net ( Zou and Hastie (2005)), adaptive Lasso ( Zou (2006)), smoothly clipped absolute deviation (SCAD -Fan and Li (2001)) and minimax concave penalty (MCP -Zhang (2010)) have been widely applied for variable selection in generalized linear models and survival models. To the best of our knowledge, limited work has been done in penalized variable selection in sample selection settings. An example that is so close and yet so far to what we pursue here is, perhaps, the paper of Caner and Fan (2010), where adaptive Lasso was used in two-stage least squares regression for removing weak instrumental variables. Another important contribution is the use of the so-called outcome-adaptive Lasso for selecting covariates for inclusion in propensity score models to account for confounding bias ( Shortreed and Ertefaie (2017)). While the propensity score models and the sample selection models are correction methods for selection bias, it has been opined that the latter should be preferred when the error terms in the substantive outcome submodel and the selection process are correlated ( Antonakis et al. (2010)). This is the motivation for the new approach that we propose in the present article. Theoretically, the direct application of the penalty terms to the likelihood function of sample selection model is possible but it is computationally challenging due to the two components of the model and the possibility of different covariates in the components.
To circumvent this, we propose adaptive Lasso (ALasso) penalized sample selection model for the selection of variables and efficient estimation of parameters for both the outcome and selection submodels of the sample selection model. The proposed method is based on the use of second-order Taylor series expansion to approximate the likelihood function with respect to the maximum likelihood estimator (MLE). The resulting least squares approximation is then solved subject to adaptively weighted L 1 penalty using the coordinate descent algorithm. The ultimate goal, from application perspective, is to identify and consistently estimate parameters in the components. We also extend the method to Lasso and adaptively weighted L 1 penalized Twostep method. These methods are compared with the standard significance testing at α = 0.05 (P-value). That is, variables with a P-value higher than 5% is deemed non-significant and removed from the model. The rest of the paper is organized as follows. Section 2 introduces the Heckman selection model and potential specification issues that are germane to the proposed approach. A penalized version of the model is presented in Sect. 3, while the computational routine and asymptotic properties of the estimator are evaluated in Sect. 4. In Sect. 5, the finite sample properties of the proposed model are studied and applied to the Ambulatory Expenditure data set. A discussion is provided in Sect. 6. Technical details are given in the "Appendix".

Sample selection model
The model consists of two equations: an outcome equation and a selection equation (1) respectively, where β = (β 0 , β 1 , . . . , β p ) and γ = (γ 0 , γ 1 , . . . , γ q ) are unknown parameters with corresponding covariates x i = (1, x i1 , . . . , x i p ) and w i = (1, w i1 , . . . , w iq ), σ is the variance, and (ε 1i , ε 2i ) are random errors with means zero and correlation ρ. It is possible for x i and w i to overlap. We observe the indicator S i = I (S i > 0) such that the outcome Y i = Y i S i is the observed part of the selected sample. Notice that the variance of S i is set to 1 because we only observed its sign and it is not identifiable in (2). The conditional density of the observed data, is the basis of the unification of sample selection problems as skew distributions given by Arellano-Valle et al. (2006). Under the additional assumption of bivariate normal errors, it is straightforward to show that the PDF in (3) is where θ = (β, γ , σ, ρ). The parameter ρ ∈ (-1,1) determines the correlation between Y i and S i , and hence the nature and severity of the selection process. The complete density of the sample selection model has a continuous component, with conditional density given by (4), and a discrete component. The discrete component is often modeled with probit regression as P(S = 1|w) = {Φ(γ T w)} s {1 − Φ(γ T w)} 1−s . The log-likelihood function for n observations is therefore If the assumed model is correct and the Gaussian assumption holds, then the MLE based on equation (5) is √ n−consistent for θ and asymptotically normal under general conditions ( Nicoletti and Peracchi (2001)). These properties are essential for the validity of the proposed method in Sect. 3.
The Two-step estimator is derived from the conditional expectation of the observed data and is given by where Λ(·) = φ(·)/Φ(·) is the inverse Mills ratio, φ(·) and Φ(·) are standard normal density and cumulative distribution function respectively. To obtain the estimates of ρ and σ from (6), the average of the conditional variance, is equated to the observed residual variance in the second-stage least square regression. The penalized regression in Sect. 3 also depends on the correct specification of the model in (5). A possible indication of model misspecification is the non-convergence of the model as a result of the violation of the requirements that |ρσ | ≤ σ and σ ≥ 0 ( Copas and Li (1997)). That is, the estimates of ρ is close to ±1. Even if reparametrization of ρ allows for model convergence, equation (5) is not, in general, globally concave and the model can converge to a local maximum. Further, when the non-intercept variables in w have coefficients equal to zero, Λ(γ T w) is a constant resulting in the failure of the regression model.

Penalized sample selection model
The general form of a penalized estimator for sample selection model is given by where p λ 1 (·) and p λ 2 (·) are penalty functions with tuning parameters λ 1 and λ 2 . In this case, the regression coefficients for the two equations have different penalties and p λ 1 (β 0 ) = p λ 2 (γ 0 ) = 0 to ensure the intercepts of the equations are unpenalized. This article considers the case where p λ 1 (·) = p λ 2 (·) = p λ (·). We can re-write (8) as where ( p+q) = s is the combined dimension of β and γ , and σ and ρ are unpenalized. The Lasso penalized regression proposed by Tibshirani (1996) is based on The R.H.S. of (9) is the L 1 -penalty, which shrinks small coefficients to zero to obtain sparse representation of the solution. Here, λ ≥ 0 is a tuning parameter controlling the amount of shrinkage. Since the Lasso penalizes all the regression coefficients equally, it over-penalizes the important variables thereby resulting in biased estimators. The lack of the oracle property ( Fan and Li (2001)) of Lasso prompted the development of the adaptive Lasso ( Zou (2006)) with this property. The oracle property implies the method is consistent in variable selection, unbiased and asymptotically normal. The adaptive Lasso estimator is a generalization of the Lasso penalty. Unlike Lasso, the adaptive Lasso penalty function penalizes the coefficients of different covariates at a different rate by using adaptive weights. This is accomplished by the introduction of individual weights, τ d in the penalty as where τ = (τ 1 , τ 2 , . . . , τ s ) are chosen in a data dependent way. The adaptive Lasso estimator for the sample selection model is the solution of Often, the weights, τ d , is set to 1/|θ d | δ for some appropriately chosen δ > 0. For simplicity we set δ = 1 andθ as the MLE.
In addition to the proposed method, we adapted the adaptive Lasso approach to the Two-step estimator in (6). For this, a standard probit model is penalized as Letγ = (γ 1 ,γ 2 ), whereγ 1 corresponds to the h nonzero components (h ≤ q) ofγ andγ 2 are its zero elements. Next, the quantity, Λ(γ T 1 w h ) is formed and taken as an additional covariate in the second stage least squares regression with adaptive Lasso penalty. The coefficients of Λ(γ T 1 w h ) is left unpenalized in the second stage regression. The covariance matrix generated in the second stage regression is inconsistent, but will not be addressed in this paper.
An interesting question is whether the model in equation (10) (or its approximation that we proposed next) is without loss of generality. We justified the adequacy of the model by analysing directly equation (8) using a smooth approximation to the where the LHS of the equation is the required approximation. We choose = 10 −6 . This allows for the derivation of the analytical score and Hessian. The optimization routine is based on the Newton-Raphson method with the tuning parameter selected using BIC . The minimum λ is taken over two-dimensional grid search. Initial empirical evaluation of the performance of this method with our proposal shows that the former does not offer significant improvement over the latter. In particular, the computational time grows as the number of variables increases. Consequently, we did not investigate this approach any further.

The optimization routine
We approximate the penalized function in (10) by a second-order Taylor expansion with respect toθ , the unpenalized MLE in (5). Thus, −l(θ ) can be approximated by the quadratic form 2 is the pseudo response vector, ∇ 2 l(θ ) = X T X such that X is derived as the Cholesky decomposition of ∇ 2 l(θ ). The notations ∇l(θ ) = −∂l(θ )/∂θ is the gradient vector and ∇ 2 l(θ ) = −∂ 2 l(θ )/∂θ∂θ T is the Hessian matrix. Thus, the minimization problem in (10) becomes which is in the form of least squares problem subject to adaptive Lasso penalization. This can be solved using efficient minimization algorithms such as least angle regression (lars -Efron et al. (2004)) and the coordinate descent algorithm ( Friedman et al. (2010)). We adopt the latter, and update the parameters in the model aŝ Recall that four parameters in θ (σ, ρ and the intercepts of outcome and selection models) are unpenalized. Although we estimated all the parameters in the model, these four parameters are not set to zero in S(z, δ). The optimal tuning parameter, λ can be estimated by using AIC, BIC and GCV (generalized cross-validation) criteria. The BIC criterion is known to identify the true model with probability tending to one. We therefore use BIC to select optimal λ: where 0 ≤ d f λ ≤ s is the degree of freedom corresponding to the number of nonzero coefficients ofθ . The optimal value of λ is computed over a grid of candidate values of λ between λ = 0 and λ = λ max , with step size of 0.1, where λ max is the value of λ for which the entire vector ofθ is zero. We allowed optimal λ = 0 for the unregularized solution since degenerate cases are uncommon in FIML estimator.

Variance estimation
In low dimensional setting such as the case here, it is possible to first select covariates by using Lasso and adaptive Lasso, and thereafter obtain parameter estimates and their standard errors with maximum likelihood method. The disadvantage of this is the loss of the benefit of the estimation of parameters and selection of variables simultaneously, which is inherent in estimators with oracle properties.
The following regularity conditions are necessary to establish the asymptotic properties of the adaptive Lasso estimator for the model. C1. Identifiability: Suppose that y i , i = 1, . . . , n are i.i.d with a mixed probability mass and density function f (y i ; θ 0 ). The parameters in the model are identifiable Since the information matrix is nonsingular at the true parameter vector, the local identifiability of the model parameters can be inferred (see Appendix A.1 in the supplementary material for the derivation of the expected information matrix). Inherent in the identification condition is the assumption that there is no multicollinearity among the variables in x i and w i and that there is no perfect correlation between x i and Λ, the inverse Mills ratio. In particular, conditions 1 -3 ensure thatθ is consistent while conditions 1 -4 is required for its asymptotic normality.
The main asymptotic results are obtained in a similar version to Fan and Li (2001) by using the regularity conditions:

Theorem 2 (Oracle properties). If
√ nλ n → 0 and nλ n → ∞, as n → ∞ thenθ has the following properties: (i) Sparsity:θ 2 = 0; (ii) Asymptotic normality: √ n(θ 1 − θ 10 ) d − → N (0, I −1 1 (θ 10 )). Note that the consistency and sparsity ofθ in Theorems 1 and 2(i) imply that the adaptive lasso estimator is consistent in variable selection. Theorem 2(ii), on the other hand, implies that the adaptive lasso estimator for the nonzero coefficients is efficient as if the irrelevant covariates are known. The proof of both theorems are given in Appendix A.2.

Numerical studies
In this section we use simulation and a real data to compare the finite sample performance of the adaptive Lasso, Lasso and the Two-step estimators for sample selection models.

Simulation study
The data is generated under two scenarios -strong and weak signals.

Scenario 1: Large effect
The data for the outcome submodel was generated from Y i = β T x i + σ ε 1i , where β T = (0.5, 1, 1, 1.5, 0, 0, 0, 0), β 0 = 0.5. For the setting with exclusion restriction, we generated data for the selection process as S i = γ T w i + ε 2i , where γ T = (1.5, 1, 1, 1, 0, 0, 0, 0, 1), γ 0 = 1.5. We exclude the last element of γ for the case with no exclusion restriction (that is, x = w). The covariates x i and w i are independent of the error terms ε T i = (ε 1i , ε 2i ). The error terms are generated from a bivariate normal distribution with Σ = σ 2 ρσ ρσ 1 , where σ = 1 and the correlation ρ = {0, 0.3, 0.5, 0.7}. The covariates x 1 , . . . , x 8 are generated such that their distribution are marginally standard normal with pairwise correlations corr(x j , x k ) = | j−k| . We take = 0.5 to allow for moderate correlation between the covariates. We only observe the values of Y i when S i > 0. This leads to approximately 30% unobserved cases in the realized sample. Two sample sizes, n = 500 and 1000 are evaluated using 1000 replications. We also evaluated the impact of error distribution misspecification, where the error terms are generated from a bivariate Student's t distribution with degree of freedom ν = 5. For this, we consider the setting under the absence of the exclusion restriction variable and sample size of 1000.

Scenario 3: Shrinkage of false exclusion restriction variable
We investigate the performance of adaptive lasso in detecting non valid exclusion restriction. The data was generated as in scenario 1 with a true exclusion restriction. We consider the case where two variables (X 9 , X 10 ) are used as exclusion restriction whereas these variables are not associated with the outcome model and not predictive of missingness (no_cor_outcome). The second data is generated such that the two variables are associated with the outcome but not predictive of missingness (cor_outcome).
The accuracy of the estimators are evaluated using mean squared errors, where V is the population covariance matrix and the summary of the median over 1000 replications is obtained. We also assess the performance of the methods using sensitivity (mean of proportion of nonzero coefficients that were correctly identified) and specificity (mean of proportion of zero coefficients that were correctly identified) as well as the comparison of the model based and the empirical standard errors.
For computational convenience, all the model parameters are estimated with the support (−∞, ∞). We used estimation metric atanhρ = ln{(1 + ρ)/(1 − ρ)}/2 for ρ, where atanhρ is the inverse hyperbolic tangent of ρ, and logarithm of σ . P-value is based on significance testing at 5% level of significance. The "oracle" estimate is based on the model fitted using the variables that have non-zero coefficients.
Tables 1 and 2 summarize the sensitivity, specificity and the mean square error (MSE) for the models with sample sizes of 500 and 1000 in the presence of the Table 1 Simulation results for exclusion restriction-sample size =500 exclusion restriction variable. The three estimators correctly identified all the nonzero coefficients (sensitivity = 1) in both submodels. The advantage of ALasso and Twostep methods are more pronounced for specificity, in which case at least 95% of zero coefficients in both submodels are correctly identified (specificity ranges between 95% and 98%). The larger the sample size, the better the performance of the methods in the consistency of variable selection across the two submodels. Lasso has higher specificity under the selection submodel than the outcome submodel, and its performance is generally better with larger sample size. There is no clear distinction in the effect of correlation (ρ) on sensitivity and specificity. Overall, ALasso and Two-step estimators perform better than the Lasso in terms of MSE. The impact of ρ slightly manifested here: the accuracy of the estimated nonzero coefficients increases as ρ increases for both Lasso and ALasso. In particular, MSE improves as sample size increases. The results also show that ALasso is consistently superior to significance testing at α = 5% level (P-value).

Method
Tables 3 and 4 summarize the simulation results for the models with sample sizes of 500 and 1000 in the absence of the exclusion restriction variable. The patterns are similar to the corresponding results in Tables 1 and 2. Interestingly, the performance of the measures without the exclusion restriction variable is consistently better than the corresponding results under the exclusion restriction. This result has practical implications -if the model relies on identification through the inverse Mills ratio, then ALasso and Two-step estimators would identify the true variables in the model without exclusion restrictions with probability tending to 1.
The results of fitting the models to a data generated from a Student's t error distribution in the absence of exclusion restrictions can be found in Appendix A.3 of the supplementary material. The MSE of the parameter estimates in the selection submodel are lower than the corresponding results under the data generated from the normal error model (Table 4). This, however, does not affect the specificity of the estimators. In particular, the specificity of the ALasso and the Two-step methods are higher than the corresponding results in Table 4. Further, the specificity under the outcome model for ALasso is slightly lower than the corresponding results in Table  4. Overall, ALasso is more robust to misspecification of the error distribution than the Two-step estimator.
The impact of covariates with weak effect on the proposed methods can be found in Appendix A.3. As expected, Lasso performs better than the other methods in terms of sensitivity in both the selection and outcome equations. This result is not surprising as Lasso is generally known to include true covariates, but also some irrelevant covariates ( Meinshausen and Bühlmann (2006)). A striking result is the poor performance of the two-step estimator in terms of sensitivity but superior performance with regards to specificity in the outcome equation. A possible reason for this is the collinearity between the inverse Mills ratio and covariates with weak effects in the second stage regression as these covariates can be easily pushed to zero. Table 5 shows the results of the number of times the variables that are used for exclusion restrictions are selected. The performance of ALasso estimator is better when the false exclusion restriction variables (X 9 , X 10 ) are not associated with the outcome than when the variables are associated with it. Overall, the true exclusion restriction variable (X 8 ) is not shrunk to zero.  ). Income is included for the exclusion restriction criteria although its use for this purpose is debatable (see Cameron and Trivedi 2010;Marchenko and Genton 2012). Table 6 shows the results of the application of the proposed methods to the data. We obtain the same results for the selection normal model and the Lasso estimator. This is in consonance with the simulation result where Lasso tends to retain more irrelevant variables in the model than adaptive Lasso. Both adaptive Lasso and Twostep estimators set two variables from the outcome model (education and insurance status) to zero. The two variables are also reported as non-significant at 5% significance level under the classical selection normal model. As noted earlier, the regularization methods eliminate the need to retain or remove covariates based on arbitrary thresholds.
An important hypothesis of no sample selection is H 0 : ρ = 0 (equivalently H 0 : atanhρ = 0 and σρ for the adaptive Lasso and Two-step methods respectively). The reported Wald test for the hypothesis under the classical selection model gave a P-value of 0.380. This hypothesis will not be rejected at 5% significance level. The implication of this is that the amount of money spent is unrelated with the decision to spend, and the outcome can be analyzed separately. As noted by Marchenko and Genton (2012), and previous authors who analyzed the data, this conclusion is not plausible.
The inference for sample selection bias under adaptive Lasso suggests the existence of selection bias at 5% significance level ( p = 0.039). Although the proposed adaptive Lasso estimator was developed under the selection normal model, the result is in agreement with the analysis of the same data by using the selection-t model proposed in Marchenko and Genton (2012). In particular, the coefficient estimate of atanhρ under the selection normal model is -0.131 whereas adaptive Lasso estimate is -0.323. This estimate is close to the estimate of atanhρ (atanhρ = −0.322) that was reported in Marchenko and Genton (2012). Similarly, the corresponding parameter estimate IMR -σρ parameter for inverse Mills ratio. *P-value = 0.039; **P-value = 0.000 for the inverse Mills ratio (σρ) under the Two-step method resulted in p = 0.000. In addition, a naive analysis based on missing not at random (MAR) assumption (ρ = 0) for non-penalized and penalized models did not remove any variable in the outcome equation.

Concluding remarks
This paper proposed a method for variable selection and estimation of covariate effects in sample selection models. Data sets that provide information about sample selection are becoming increasingly common in many fields. If these data sets are analyzed and variable selection is carried out using conventional methods, then the conclusions can be misleading. Although many statistical procedures have been developed in the literature for the analysis of selected samples, there is no existing research on regularized variable selection methods for this model. This is, perhaps, due to the association between covariates and the outcome submodel as well as the association between covariates and the selection submodel.
Based on the results of our simulation and data analysis, we recommend the use of the ALasso estimator especially in the presence of weak covariate effects. Application of the ALasso estimator to the Ambulatory expenditure data further corroborate the usefulness of the method. A key advantage of the proposed estimator is that there is no need to specify arbitrary thresholds in order to determine the importance of covariates in the model, which is not the case with the use of P-value.
The applicability of our method depends on correct specification of the sample selection model. If the profile likelihood of ρ is very flat, then there is limited information about selectivity in the data. It is possible to extend our proposal to high dimensional data settings by avoiding the non-differentiability of Lasso and ALasso penalty functions using approximation of different norms. Also, it is possible to accomplish model and variable selection simultaneously by using various parametric extensions of sample selection models such as the selection-t model (Marchenko and Genton 2012) and the selection skew-normal model (Ogundimu and Hutton 2016), robust extensions (Zhelonkin et al. 2016) or by using alternative estimation techniques such as the EM-algorithm (Zhao et al. 2020). In particular, the oracle property of the penalized estimators is a pointwise asymptotic feature and does not necessarily hold for all the points in the parameter space ( Leeb and Pötscher (2005)). Consequently, the problem of post selection inference for non-random sample deserves further investigation.