## Abstract

Consider observation of a phenomenon of interest subject to selective sampling due to a censoring mechanism regulated by some other variable. In this context, an extensive literature exists linked to the so-called Heckman selection model. A great deal of this work has been developed under Gaussian assumption of the underlying probability distributions; considerably less work has dealt with other distributions. We examine a general construction which encompasses a variety of distributions and allows various options of the selection mechanism, focusing especially on the case of discrete response. Inferential methods based on the pertaining likelihood function are developed.

### Similar content being viewed by others

## References

Anh H, Powell JL (1993) Semiparametric estimation of censored selection models with a nonparametric selection mechanism. J Econom 58:3–29

Azzalini A, Capitanio A (2014) The skew-normal and related families. In: IMS monographs series. Cambridge University Press, Cambridge

Copas JB, Li HG (1997) Inference for non-random samples (with discussion). J R Stat Soc Ser B 59:55–95

Greene W (1998) Sample selection in credit-scoring models. Jpn World Econ 10:299–316

Greene WH (2012) Econometric analysis, 7th edn. Pearson Education Ltd, Harlow

Heckman JJ (1976) The common structure of statistical models of truncation, sample selection and limited dependent variables, and a simple estimator for such models. Ann Econ Soc Meas 5:475–492

Heckman JJ (1979) Sample selection bias as a specification error. Econometrica 47:153–161

Marchenko YV, Genton MG (2012) A Heckman selection-\(t\) model. J Am Stat Assoc 107:304–317

Marra G, Radice R (2017) GJRM: generalised joint regression models with binary/continuous/discrete/survival margins. R package version 0.1-4

Marra G, Wyszynski K (2016) Semi-parametric copula sample selection models for count responses. Comput Stat Data Anal 104:110–129

McCullagh P, Nelder JA (1989) Generalized linear models, 2nd edn. Chapman & Hall/CRC, London

Newey WK (2009) Two-step estimation of sample selection models. Econom J 12:S217–S229

Prieger JE (2002) A flexible parametric selection model for non-normal data with application to health care usage. J Appl Econom 17:367–392

Riphahn RR, Wambach A, Million A (2003) Incentive effects in the demand for health care: a bivariate panel count data estimation. J Appl Econom 18:387–405

Terza JV (1998) Estimating count data models with endogenous switching: sample selection and endogenous treatment effects. J Econom 84:129–154

Van de Ven WPMM, Van Praag BMS (1981) The demand for deductibles in private health insurance: a probit model with sample selection. J Econom 17(2):229–252

**(Corrigendum in 22(3):395, 1983)**Wyszynski K, Marra G (2017) Sample selection models for count data in R. Comput Stat. https://doi.org/10.1007/s00180-017-0762-y

Wooldridge J (2010) Econometric analysis of cross section and panel data, 2nd edn. The MIT Press, Cambridge

Zhelonkin M, Genton GG, Ronchetti E (2016) Robust inference in sample selection models. J R Stat Soc Ser B 78:805–827

## Acknowledgements

We are grateful to two reviewers for insightful comments leading to appreciable improvement in presentation with respect to an earlier version of the paper. Hyoung-Moon Kim’s research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (2015R1D1A1A01059161). Hea-Jung Kim’s research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (2015R1D1A1A01057106).

## Author information

### Authors and Affiliations

### Corresponding author

## Appendix: Score function and Hessian matrix

### Appendix: Score function and Hessian matrix

In cases of interest in applications, the density function *f* is a member of the exponential family which enter the formulation of generalized linear models; hence we focus on this situation. Following essentially the notation of McCullagh and Nelder (1989), we write the baseline density (or probability function, in the discrete case) as

where \(a(\cdot ), b(\cdot )\) and \(d(\cdot )\) are known functions. In some cases, the dispersion parameters \(\psi \) is known; important instances of this type are the Poisson and the binomial distribution.

On inserting expression (27) in (14), the log-likelihood function becomes

whose derivatives with respect to the parameters \(\beta , \gamma , \psi \) are as follows:

where \(V_i=a_i(\psi )b''(\vartheta _i)= {\text {var}}_{}\!\left\{ \displaystyle {Y_i}\right\} \), \( {\mathbb {E}}_{}\!\left\{ \displaystyle {Y_i}\right\} =\mu _i=b'(\vartheta _i)\), \(g_0=G_0'\) and \(g(\mu _i)=x_i^{\top }\beta \) is called the link function.

The second order derivatives of (28) are given by the following expressions:

## Rights and permissions

## About this article

### Cite this article

Azzalini, A., Kim, HM. & Kim, HJ. Sample selection models for discrete and other non-Gaussian response variables.
*Stat Methods Appl* **28**, 27–56 (2019). https://doi.org/10.1007/s10260-018-0427-1

Accepted:

Published:

Issue Date:

DOI: https://doi.org/10.1007/s10260-018-0427-1