Abstract
Consider observation of a phenomenon of interest subject to selective sampling due to a censoring mechanism regulated by some other variable. In this context, an extensive literature exists linked to the so-called Heckman selection model. A great deal of this work has been developed under Gaussian assumption of the underlying probability distributions; considerably less work has dealt with other distributions. We examine a general construction which encompasses a variety of distributions and allows various options of the selection mechanism, focusing especially on the case of discrete response. Inferential methods based on the pertaining likelihood function are developed.
Similar content being viewed by others
References
Anh H, Powell JL (1993) Semiparametric estimation of censored selection models with a nonparametric selection mechanism. J Econom 58:3–29
Azzalini A, Capitanio A (2014) The skew-normal and related families. In: IMS monographs series. Cambridge University Press, Cambridge
Copas JB, Li HG (1997) Inference for non-random samples (with discussion). J R Stat Soc Ser B 59:55–95
Greene W (1998) Sample selection in credit-scoring models. Jpn World Econ 10:299–316
Greene WH (2012) Econometric analysis, 7th edn. Pearson Education Ltd, Harlow
Heckman JJ (1976) The common structure of statistical models of truncation, sample selection and limited dependent variables, and a simple estimator for such models. Ann Econ Soc Meas 5:475–492
Heckman JJ (1979) Sample selection bias as a specification error. Econometrica 47:153–161
Marchenko YV, Genton MG (2012) A Heckman selection-\(t\) model. J Am Stat Assoc 107:304–317
Marra G, Radice R (2017) GJRM: generalised joint regression models with binary/continuous/discrete/survival margins. R package version 0.1-4
Marra G, Wyszynski K (2016) Semi-parametric copula sample selection models for count responses. Comput Stat Data Anal 104:110–129
McCullagh P, Nelder JA (1989) Generalized linear models, 2nd edn. Chapman & Hall/CRC, London
Newey WK (2009) Two-step estimation of sample selection models. Econom J 12:S217–S229
Prieger JE (2002) A flexible parametric selection model for non-normal data with application to health care usage. J Appl Econom 17:367–392
Riphahn RR, Wambach A, Million A (2003) Incentive effects in the demand for health care: a bivariate panel count data estimation. J Appl Econom 18:387–405
Terza JV (1998) Estimating count data models with endogenous switching: sample selection and endogenous treatment effects. J Econom 84:129–154
Van de Ven WPMM, Van Praag BMS (1981) The demand for deductibles in private health insurance: a probit model with sample selection. J Econom 17(2):229–252 (Corrigendum in 22(3):395, 1983)
Wyszynski K, Marra G (2017) Sample selection models for count data in R. Comput Stat. https://doi.org/10.1007/s00180-017-0762-y
Wooldridge J (2010) Econometric analysis of cross section and panel data, 2nd edn. The MIT Press, Cambridge
Zhelonkin M, Genton GG, Ronchetti E (2016) Robust inference in sample selection models. J R Stat Soc Ser B 78:805–827
Acknowledgements
We are grateful to two reviewers for insightful comments leading to appreciable improvement in presentation with respect to an earlier version of the paper. Hyoung-Moon Kim’s research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (2015R1D1A1A01059161). Hea-Jung Kim’s research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (2015R1D1A1A01057106).
Author information
Authors and Affiliations
Corresponding author
Appendix: Score function and Hessian matrix
Appendix: Score function and Hessian matrix
In cases of interest in applications, the density function f is a member of the exponential family which enter the formulation of generalized linear models; hence we focus on this situation. Following essentially the notation of McCullagh and Nelder (1989), we write the baseline density (or probability function, in the discrete case) as
where \(a(\cdot ), b(\cdot )\) and \(d(\cdot )\) are known functions. In some cases, the dispersion parameters \(\psi \) is known; important instances of this type are the Poisson and the binomial distribution.
On inserting expression (27) in (14), the log-likelihood function becomes
whose derivatives with respect to the parameters \(\beta , \gamma , \psi \) are as follows:
where \(V_i=a_i(\psi )b''(\vartheta _i)= {\text {var}}_{}\!\left\{ \displaystyle {Y_i}\right\} \), \( {\mathbb {E}}_{}\!\left\{ \displaystyle {Y_i}\right\} =\mu _i=b'(\vartheta _i)\), \(g_0=G_0'\) and \(g(\mu _i)=x_i^{\top }\beta \) is called the link function.
The second order derivatives of (28) are given by the following expressions:
Rights and permissions
About this article
Cite this article
Azzalini, A., Kim, HM. & Kim, HJ. Sample selection models for discrete and other non-Gaussian response variables. Stat Methods Appl 28, 27–56 (2019). https://doi.org/10.1007/s10260-018-0427-1
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10260-018-0427-1