1 Introduction

In this paper we introduce inferential procedures, based on identification sets, for regression parameters in situations where a continuous outcome (response in a linear regression model) is not observed for all individuals. A vast part of the literature on missing outcome deals with situations where the missingness mechanism is independent of the outcome conditionally (or not) on observed covariates, called missing (completely) at random mechanism, see Little and Rubin (2002). In this case, parameters are identified and inference can be performed with standard inferential procedures. When this assumption does not hold (i.e. the missingness mechanism is said to be non-ignorable), several contributions are concerned with introducing other restrictions to obtain identification, such as monotonicity (Manski 2003, Chap. 8) or conditional independence restrictions (e.g. pattern mixture models, Daniels and Hogan 2008; Little 2009).

An alternative, which we pursue here, is to determine a region on the parameter space, that we call identification set, that contains all parameters identified under plausible missing data mechanisms, and to propose inferential procedures accordingly. In this sense, this contribution is in line with Vansteelandt and Goetghebeur (2001), Manski (2003), Imbens and Manski (2004), Vansteelandt et al. (2006) and Horowitz and Manski (2006). Results available in this literature on set identification with non-ignorable nonresponse require situations where the outcome is bounded.

In this paper we focus on set identification of regression parameters when the unbounded outcome is continuous and the missing mechanism is non-ignorable. We do that in a framework where the outcome (continuous valued) and missingness indicator (binary variable) are regressed parametrically against a set of covariates, yielding an outcome equation and a selection (missingness mechanism) equation respectively. The sets depend on the parameter \(\rho \), the correlation between the residuals of the two equations. We show that the identification set can be bounded when only mild restrictions are imposed on the missing data model.

We avoid making strong distributional assumptions, initially focusing on a probit regression model for the selection equation but later relaxing to a more general class. Notice that assuming a probit selection equation allows for identification of the outcome equation parameters and estimation of the parameters can be performed via either ML or two stage least square (TSLS), see Heckman (1979). The first procedure relies on the assumption of joint normality of the error terms and is very sensitive to misspecification (Olsen 1982; Wooldridge 2003, p. 566), thus TSLS has become widely used. However, this method too suffers of serious finite sample instability due to collinearity. That is usually addressed by using exclusion restriction assumptions whereby some covariates excluded in the outcome equation are assumed to predict the missingness mechanism (Little 1985). However, it is well known (Puhani 2000) that results can be sensitive to the choice of exclusion restrictions since different assumptions lead to different conclusions on the parameters of interest. We provide further illustration of this issue with a follow up study on body mass index (BMI). By allowing for set identification, our approach avoids the use of restriction assumptions in studies where no strong theory is available to justify them. Furthermore, the theory applies also to situations outside normality.

When only sets of possible values are identified, Vansteelandt et al. (2006) have provided an inferential framework, and for instance they propose to combine the estimated sets with sampling variation to yield a (\(1-\alpha \)) 100 % uncertainty region, which covers the identification set with a probability of at least (\(1-\alpha \)). In this paper, we deduce uncertainty intervals for the parameters of interests in our context. Uncertainty intervals are the counterpart of confidence intervals in the case of point identification.

A related stream of the literature has developed methods to assess the sensitivity of the inference to departures from the missing at random assumption; see, e.g., de Luna and Lundin (2014), Little et al. (2012), Andridge and Little (2011), Rosenbaum (2010), Copas and Eguchi (2005), Imbens (2003) and Scharfstein et al. (1999). The uncertainty intervals that we introduce may be used as a tool for sensitivity analysis as we illustrate in our case study. Our approach is in this respect closely related to the one proposed by Copas and Li (1997), as the selection parameter \(\theta \) in their paper is a transformation of \(\rho \). Copas and Li (1997) build a profile log likelihood for \(\theta \) in order to carry out a sensitivity analysis. Similar models and methods are used for sensitivity analysis to publication bias in meta analysis (Copas 2013, Henmi et al. 2007).

In Sect. 2 we present a motivating example, a follow up study on individual BMI increase within a ten year interval. We introduce the model, discuss identification and illustrate the instability of the results to different exclusion restrictions. Section 3.1 contains the results on set identification under the probit assumption for the missingness mechanism. The latter assumption is relaxed in Sect. 3.2. In Sect. 3.3 we deduce the uncertainty intervals taking into account sampling variation. The BMI study is presented in detail in Sect. 4, illustrating the results obtained in the paper. Final sample properties are illustrated in a simulation study in Sect. 5, where the data generating mechanisms are chosen to mimic a text book case study on unobserved women wages due to non-participation in the labour market. The paper is concluded in Sect. 6.

2 Motivating studies

We utilize two different studies to motivate the contribution of this paper. The first study is concerned with finding predictors of BMI increase within a ten year interval, between 40 and 50 years of age, see Sect. 4 for more details. The second study estimates a wage offer function for married women and is used as background in the simulation study of Sect. 5. In both cases we have an outcome that is observed only for a selected subsample; in the BMI study selection is due to drop out, where some individuals have no BMI measure ten years after the first measure; and in the wage offer study the selection consists in that wage is observed only for those women participating in the labour force, i.e. women without employment are assumed to have a latent wage offer. In both cases we may use the following model. Let

$$\begin{aligned} y= \nu _2+ \mathbf{x}^T \varvec{\beta }+\eta _2 \end{aligned}$$
(1)

be the outcome equation, where the outcome \(y\) (BMI change or wage in the above mentioned examples) is observed only for individuals with \(z=1\) (no drop out and labour force participation in the above examples), where this selection is modelled as \(z=I (z^*>0)\) with

$$\begin{aligned} z^*= \nu _1 + \mathbf{x}^T \varvec{\delta }+ \eta _1. \end{aligned}$$
(2)

Let us further assume that \(\eta _1 \sim N(0,1)\), \(\text{ E }(\eta _2)=0\) and \(\text{ Var } (\eta _2) = \sigma _2^2\). Note that \(\eta _1\) has variance one without loss of generality. We allow for the errors to be correlated (non-ignorable selection) such that \(\eta _2 = \rho \sigma _2 \eta _1 + \varepsilon \), where \(\rho \) is the correlation between \(\eta _1\) and \(\eta _2\). The variable \(\varepsilon \) is independent from \(\eta _1\), and has zero mean and variance \(\sigma _{\varepsilon }^2\); we make no further assumptions about its distribution. The parameter of interest is \(\varvec{\beta }\). Consistent estimation of \(\varvec{\beta }\) can be obtained with a maximum likelihood estimator or a two stage least squares (TSLS) estimator (Heckman 1979; Wooldridge 2003, Sect. 17.4).

Table 1 presents the results of fitting the selection equation (probit regression) for the sample of 4,648 males for which BMI is observed at 40 years of age, of which 1,324 do not show up at the 50 years of age call (selection by drop out). The table displays the covariates available as well as their \(\delta \) coefficient and corresponding p-values. We notice that seven out of sixteen variables are significant at the five percent level, see Table 1 (first two columns). A backward elimination procedure was used and the final model is also given in Table 1 (last two columns). The subsequent analyses are made by restricting the set of covariates in the probit regression to those which are significant. The outcome equation is then fitted using different estimators and results are displayed in Table 2. Thus, we use ordinary least squares (i.e. assuming missingness is ignorable, results in the first two columns of the table, denoted with OLS), and TSLS without exclusion restrictions (last two columns, denoted with TSLS no ER). We can note here that OLS and TSLS results differ. In fact, letting \(\rho \) free, \(\varvec{\beta }\) is not well identified. This can be illustrated by considering

$$\begin{aligned} E (y\mid \mathbf{x}, z=1)= \nu _2+\mathbf{x}^T \varvec{\beta }+ \rho \sigma _2 \lambda (u), \end{aligned}$$
(3)
Table 1 Results of probit regression (2) for the BMI change case study.
Table 2 Results of TSLS with different exclusion restrictions and OLS, the stars indicate that the variable is included in the first stage.

where \(u=\mathbf{x}^T \varvec{\delta }+\nu _1\), and \(\lambda (u)=\frac{\phi (u)}{\Phi (u)}\), where \(\phi (\cdot )\) and \(\Phi (\cdot )\) are, in order, the standard normal density and cumulative distribution function. The term \(\lambda (u)\) is often called inverse Mills’ ratio in the literature. It is clear from (3) that OLS will be biased if \(\rho \ne 0\). In applications the inverse Mills’ ratio is often close to linear in \(u\) (Puhani 2000; Jonsson 2012) and this is also the case in our example, see Fig. 1. Since the second stage of TSLS is a regression of \(y\) on \(\mathbf{x}\) and \(\lambda (u)\), this will imply a collinearity problem, generating large standard errors (parameters are non significant), see Table 2 (last two columns). In order to avoid collinearity, TSLS is usually performed with exclusion restrictions on some variables in the outcome equation. Indeed, assuming that some components of \(\varvec{\beta }\) are zero while the corresponding components of \(\varvec{\delta }\) are not, ensures that the Mills’ ratio is not close to be linear in \(u\); see e.g. (Wooldridge 2003, p. 564). However, unless exclusion restrictions are available from scientific theories, such assumptions are controversial.

Fig. 1
figure 1

The inverse Mills’ ratio as a function of the linear predictor \(u\)

Table 2 also contains TSLS results based on different exclusion restrictions: in column seven and eight (TSLS ER1) we have excluded one covariate, ’Unemployment benefits’, from the outcome equation; in column five and six (TSLS ER 2) we have excluded another ’log(spouse earnings/earnings)’; in column three and four (TSLS ER3) we have excluded both of them. All exclusion restrictions are made on covariates significant in the probit regression but not in the OLS fit. We obtain p values for the coefficient of the inverse Mills’ ratio of 71, 12, 10 and \(4\,\%\), indicating non-ignorable selection in the latter case only. The results obtained differ most between TSLS without exclusion restrictions (Table 2 last two columns) and the other fits, as expected, due to collinearity. The results also differs between OLS and TSLS with exclusion restrictions, and, most worryingly, between the three fits with different exclusion restriction assumptions. The most clear example of this is ’Parent leave benefits’ which is estimated to \(-0.11\) (p value \(24\,\%\)) in the OLS fit and to \(-0.41\), \(-0.25\) and \(-0.25\) (p values 13, 7 and \(5\,\%\)) in the TSLS fits with exclusion restrictions. Another example is ’Sick leave benefits’ which is estimated to \(0.06\) (p value \(57\,\%\)) in the OLS fit and \(0.40\), \(0.21\) and \(0.22\) (p values 19, 16 and \(13\,\%\)) in the TSLS fits with exclusion restrictions. A conclusion of this exercise is that unless one has a clear theoretical knowledge on which variables among those affecting selection should be excluded from the outcome equation, results may vary, both in effect size and precision. This can happen irrespective of the inverse Mills’ ratio being significant or not and will be even more apparent if we include all variables in the probit regression. Similar findings are in Lennox et al. (2012).

In this paper we avoid the above described problems (collinearity with the inverse Mills’ ratio, need of exclusion restrictions, instability of results with respect to exclusion restriction chosen) by proposing identification sets for \(\varvec{\beta }\) valid for a certain degree of selection to be specified in advance.

3 Theory

3.1 Model and identification set

We reformulate the model from Sect. 2 in matrix form. Let \(\mathbf{y}\) be a \(N\) vector with the complete outcome and \(\mathbf{X}\) the \((N \times (p+1))\) complete data regression matrix, i.e.

$$\begin{aligned} \mathbf{X}= \left[ \begin{array}{cc} 1 &{} \mathbf{x}_1^T \\ 1 &{} \mathbf{x}_2^T \\ \vdots &{} \vdots \\ 1 &{} \mathbf{x}_N^T \end{array} \right] . \end{aligned}$$

The model can be written as follows:

$$\begin{aligned} \mathbf{y}= \mathbf{X}\left[ \begin{array}{c} \nu _2 \\ \varvec{\beta }\end{array} \right] + \varvec{\eta }_2, \end{aligned}$$

the outcome equation, where \(\mathbf{y}\) and \(\varvec{\eta }_2\) are vectors of dimension \(N\), and

$$\begin{aligned} \mathbf {z^*}= \mathbf{X}\left[ \begin{array}{c} \nu _1 \\ \varvec{\delta }\end{array} \right] + \varvec{\eta }_1, \end{aligned}$$

the selection equation, where \(\mathbf {z^*}\) and \(\varvec{\eta }_1\) are vectors of dimension \(N\). As earlier, we assume that all elements of \(\varvec{\eta }_1\) are i.i.d. N\((0,1)\) and all elements of \(\varvec{\eta }_2\) are i.i.d. with zero mean and homogenous variance \(\sigma _2^2\). Also, \(\varvec{\eta }_2= \rho \sigma _2 \varvec{\eta }_1+\varvec{\varepsilon }\), where \(\rho \) is the correlation between the corresponding components of \(\varvec{\eta }_1\) and \(\varvec{\eta }_2\), and \(\varvec{\varepsilon }\) is independent from \(\varvec{\eta }_1\) and has elements with zero mean and variance \(\sigma _\varepsilon ^2\); we make no further assumptions about its distribution.

Let \(\mathbf{y}_s\) be a \(n<N\) vector with the observed outcome and \(\mathbf{X}_s\) be the corresponding \((n \times (p+1))\) incomplete data regression matrix. Then the OLS estimates of the linear regression coefficients of \(\mathbf{y}_s\) on \(\mathbf{X}_s\) are:

$$\begin{aligned} \left[ \begin{array}{c} \hat{\nu _2}_{OLS} \\ \hat{\varvec{\beta }}_{OLS} \end{array} \right] =(\mathbf{X}_s^T \mathbf{X}_s)^{-1} \mathbf{X}_s^T \mathbf{y}_s. \end{aligned}$$
(4)

Note that \(\text{ E }(\mathbf{y}\mid \mathbf{X})=\mathbf{X}\varvec{\beta }\) but \(\text{ E }(\mathbf{y}_s \mid \mathbf{X}_s) \ne \mathbf{X}_s \varvec{\beta }\) if we have nonignorable missingness. Let \(\lambda (u)\) be the inverse Mills’ ratio as introduced in Sect. 2. We have:

$$\begin{aligned} \text{ E }\left( \left[ \begin{array}{c} \hat{\nu _2}_{OLS}\\ \hat{\varvec{\beta }}_{OLS} \end{array} \right] \right)&= \text{ E }[\text{ E }((\mathbf{X}_s^T \mathbf{X}_s)^{-1} \mathbf{X}_s^T \mathbf{y}_s \mid \mathbf{X}_s) ]=\nonumber \\&= \text{ E }[(\mathbf{X}_s^T \mathbf{X}_s)^{-1} \mathbf{X}_s^T \text{ E }(\mathbf{y}_s \mid \mathbf{X}_s)] = \nonumber \\&= \text{ E }\left[ (\mathbf{X}_s^T \mathbf{X}_s)^{-1} \mathbf{X}_s^T \left( \mathbf{X}_s \left[ \begin{array}{c} \nu _2 \nonumber \\ \varvec{\beta }\end{array} \right] + \rho \sigma _2 \varvec{\lambda }_u\right) \right] =\nonumber \\&= \left[ \begin{array}{c} \nu _2 \\ \varvec{\beta }\end{array} \right] +\rho \sigma _2 \text{ E }[(\mathbf{X}_s^T \mathbf{X}_s)^{-1} \mathbf{X}_s^T \varvec{\lambda }_u ] \end{aligned}$$
(5)

where \(\varvec{\lambda }_u^T= [\lambda (u_1),\; \; \lambda (u_2), \ldots ,\lambda (u_n)]\), i.e. the values of the inverse Mills’ ratio for the \(n\) observations.

To get an identification set for \(\varvec{\beta }\) we use (5). We see that in order to estimate \(\varvec{\beta }\) both \(\rho \) and \(\sigma _2\) are needed. Since we know that \(\rho \) ranges between \(-1\) and \(+1\), the strategy we pursue here is to provide bounds for \(\sigma _2\), which will depend on \(\rho \), and then let our identification set depend on a restricted subset of reasonable values for \(\rho \).

Let \(\sigma _r^2=\text{ E }(\text{ Var }(y \mid \mathbf{x}, z=1))\) and \(\tilde{\sigma }_1^2(\mathbf{x})=\text{ Var }(z^* \mid \mathbf{x}, z=1)\). Since \(\sigma ^2_\varepsilon =\sigma _2^2(1-\rho ^2)\) we have:

$$\begin{aligned} \sigma _r^2&= \text{ E }(\text{ Var }(\eta _2 \mid \mathbf{x}, z=1))= \text{ E }\left[ \text{ Var }\left( {\rho \sigma _2} \eta _1 + \varepsilon \mid \mathbf{x}, z=1\right) \right] =\\&= \text{ E }\left[ \sigma _\varepsilon ^2 + {\rho ^2 \sigma _2^2} \tilde{\sigma }_1^2(\mathbf{x}) \right] = \text{ E }\left[ \sigma _2^2-\rho ^2\sigma _2^2 +{\rho ^2\sigma _2^2} \tilde{\sigma }_1^2(\mathbf{x})\right] =\\&= \sigma _2^2\left( 1-\rho ^2\left( 1-\text{ E }\left[ {\tilde{\sigma }_1^2(\mathbf{x})}\right] \right) \right) \end{aligned}$$

where \(0 \le \left( 1-\text{ E }\left[ {\tilde{\sigma }_1^2(\mathbf{x})}\right] \right) \le 1\) for all \(\mathbf{x}\), since \(\tilde{\sigma }_1^2(\mathbf{x})\le \text{ Var }(z^* \mid \mathbf{x})=1\), for all \(\mathbf{x}\). Hence, we get the inequality:

$$\begin{aligned} \sigma _r^2 \le \sigma _2^2 \le \frac{\sigma _r^2}{1-\rho ^2}. \end{aligned}$$
(6)

From (5) and (6) we now can obtain identification sets for all components of \(\varvec{\beta }\). Let:

$$\begin{aligned} b_{1, j}&= \text{ E }\left( \hat{\beta }_j\right) - \rho _{min}\frac{\sigma _r}{\sqrt{1-\rho _{min}^2}} \text{ E }[(\mathbf{X}_s^T \mathbf{X}_s)^{-1} \mathbf{X}_s^T \varvec{\lambda }_u ] \varvec{e}_j,\\ b_{2, j}&= \text{ E }\left( \hat{\beta }_j \right) - \rho _{min}\sigma _r \text{ E }[(\mathbf{X}_s^T \mathbf{X}_s)^{-1} \mathbf{X}_s^T \varvec{\lambda }_u ]\varvec{e}_j,\\ b_{3, j}&= \text{ E }\left( \hat{\beta }_j \right) - \rho _{max}\frac{\sigma _r}{\sqrt{1-\rho _{max}^2}} \text{ E }[(\mathbf{X}_s^T \mathbf{X}_s)^{-1} \mathbf{X}_s^T \varvec{\lambda }_u ] \varvec{e}_j,\\ b_{4, j}&= \text{ E }\left( \hat{\beta }_j \right) - \rho _{max}\sigma _r \text{ E }[(\mathbf{X}_s^T \mathbf{X}_s)^{-1} \mathbf{X}_s^T \varvec{\lambda }_u ]\varvec{e}_j,\\ \end{aligned}$$

for \(j=1, \ldots ,p\), where \(\varvec{e}_j \) is a \((p+1)\) vector with all elements 0 except the \((j+1)\):th which is 1. Then the lower (\(\beta _{l,j}\)) and upper (\(\beta _{u,j}\)) bounds of the identification set are:

$$\begin{aligned}{}[\beta _{l, j}=\text{ min }(b_{1, j} , b_{2, j}, b_{3, j} , b_{4, j}), \beta _{u, j}=\text{ max }(b_{1, j} , b_{2, j}, b_{3, j} , b_{4, j}) ] \end{aligned}$$
(7)

We can see that if we only know that \(\rho \in [-1,1]\) then the identification sets range from \(-\infty \) to \(+\infty \). In cases where we have knowledge on \(\rho \), e.g., \(\rho \in [\rho _{min},\rho _{max}]\), where either \(-1<\rho _{min}\) and/or \(\rho _{max}<1\), we get a bounded identification set for \(\beta _j\).

3.2 Relaxing distributional assumptions

Let \(\text{ E }(y \mid \mathbf{x})=\nu _2 + \mathbf{x}^T \varvec{\beta }\), \(\text{ Var }(y \mid \mathbf{x})=\sigma _2^2\) and

$$\begin{aligned} P(z=1 \mid y,\mathbf{x})=\text{ exp } \left[ H\left( \alpha _0 + \frac{\rho }{\sqrt{1-\rho ^2}} \frac{(y-\nu _2-\mathbf{x}^T \varvec{\beta })}{\sigma _2}\right) \right] , \end{aligned}$$

where \(H\) is a known differentiable function and \(\alpha _0=\frac{\nu _1 +\mathbf{x}^T \varvec{\delta }}{\sqrt{1-\rho ^2}}\). Under some additional regularity assumptions we have (see Appendix):

$$\begin{aligned} \left[ \begin{array}{c} \nu _2 \\ \varvec{\beta }\end{array} \right] =\text{ E }\left( \left[ \begin{array}{c} \hat{\nu _2}_{OLS} \\ \hat{\varvec{\beta }}_{OLS} \end{array} \right] \right) - \sigma _r \frac{\rho }{\sqrt{1-\rho ^2}}\text{ E }[(\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T\varvec{H}'_{\alpha _0} ]+O(\rho ^2). \end{aligned}$$

Note that under the model assumptions of Sect. 2, i.e. \(\varvec{H}'_{\alpha _0}=\varvec{\lambda }_{\alpha _0}\), we get:

$$\begin{aligned} \left[ \begin{array}{c} \nu _2 \\ \varvec{\beta }\end{array} \right] =\text{ E }\left( \left[ \begin{array}{c} \hat{\nu _2}_{OLS} \\ \hat{\varvec{\beta }}_{OLS} \end{array} \right] \right) - \sigma _r \frac{\rho }{\sqrt{1-\rho ^2}}\text{ E }[(\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T\varvec{\lambda }_{\alpha _0}]+O(\rho ^3) \end{aligned}$$

which corresponds to (7) up to convergence order.

3.3 Taking the sampling variability into account: uncertainty intervals

With \(\hat{\beta }_j\) we denote the \(j\)-th element of \(\hat{\varvec{\beta }}_{OLS}\); see (4). The bounds (7) can be estimated from the observed data, by using \(\hat{\beta }_j\) for \(\text{ E }\left( \hat{\beta }_j\right) \), and by estimating \(\text{ E }[(\mathbf{X}_s^T \mathbf{X}_s)^{-1} \mathbf{X}_s^T \varvec{\lambda }_u ]\) with \((\mathbf{X}_s^T \mathbf{X}_s)^{-1} \mathbf{X}_s^T \varvec{\lambda }_{\hat{u}}\), where the parameters of \(u\) are estimated with a probit regression to yield \(\hat{u}\). Also \(\sigma _r^2\) can be estimated with the residual sample variance of the OLS fit, thereby implying a slight overestimation of \(\sigma _r^2\). The latter can be seen with the following asymptotic argument:

$$\begin{aligned} \sigma _{OLS}^2&= \text{ Var }(y-{\nu _2}_{OLS}- \mathbf{x}^T {\varvec{\beta }}_{OLS}\mid z=1) \\&= \sigma _r^2+ \text{ Var }(\text{ E }(y-{\nu _2}_{OLS}- \mathbf{x}^T {\varvec{\beta }}_{OLS}\mid \mathbf{x}, z=1)\mid z=1) > \sigma _r^2, \end{aligned}$$

where \(\sigma _{OLS}^2\), \(\nu _{2OLS}\) and \({\varvec{\beta }}_{OLS}\) are the limits in probability of the OLS and where necessary regularity conditions are assumed for the first equality to hold. The overestimation is slight if \(\text{ E }(y\mid \mathbf{x}, z=1)\) is close to linear as a function of \(\mathbf{x}\), which is often the case in applications (Puhani 2000).

These estimates will induce sampling variability into the identification set. The latter variability is incorporated to create uncertainty intervals with a confidence level of at least \((1-\alpha )100\%\):

$$\begin{aligned} \left[ \hat{\beta }_{l, j} - c_{\frac{\alpha }{2}} \text{ se }(\hat{\beta }_{l, j} ),\; \hat{\beta }_{u,j} + c_{\frac{\alpha }{2}} \text{ se }(\hat{\beta }_{u, j} )\right] , \end{aligned}$$

where \(c_{\frac{\alpha }{2}}\) is the \((1-\alpha /2)100\%\) percentile of the standard normal distribution, since \(\hat{\beta }_{l, j}\) and \(\hat{\beta }_{u, j}\) are asymptotically normal. This is a strong uncertainty region as defined in Vansteelandt et al. (2006), that is it covers all values in the identification set (\([\beta _{l, j}, \beta _{u, j}]\)) with at least \((1-\alpha )100\%\) probability.

Estimation of \(\text{ se }(\hat{\beta }_{l,j})\) and \(\text{ se }(\hat{\beta }_{u,j})\) can be performed with bootstrap techniques since all estimated quantities are identified. In this paper, however, we simply use the standard errors of the OLS estimates \(\hat{\beta }_j\) to construct the uncertainty intervals. This implies an underestimation of the sampling variability but our simulations suggest that this is compensated by the otherwise conservative use of strong uncertainty intervals.

4 Predictors of BMI changes for middle age men

The analysis is performed on data collected via the Västerbotten Intervention Programme (VIP) (Norberg et al. 2010). VIP was initiated in 1985 to counter the high prevalence for cardiovascular disease in Västerbotten county, north of Sweden. From 1991 all residents turning 40, 50 and 60 have been asked to participate. We study all married or cohabiting 40 year old males born 1950–1956 who have chosen to participate, looking for predictors of BMI change from 40 to 50 years of age. By using Swedish personal numbers, these data are linked to socioeconomic and demographic information. At the 50 year call only 3,324 out of the 4,648 males that came to the 40 year call returned, so we have a dropout of \(28.5\,\%\). With such a level of dropout, we may question the reliability of standard OLS techniques, that rely on the missing at random assumption. In particular, a possibility could be that individuals that do not show up for the second check up have a larger increase in BMI than the ones that do (corresponding to a negative \(\rho \)).

In Table 3 we present uncertainty intervals that are built assuming three different sets of values for \(\rho \): \((-0.9, 0)\), \((-0.5, 0)\) and \((-0.5, 0.5)\) and confidence intervals obtained by assuming missing at random (i.e., letting \(\rho =0\) and using OLS). The first two sets of values for \(\rho \) illustrate an assumption that \(\rho \) is not positive. We consider also an interval containing both negative and positive values. The uncertainty intervals are obtained from data as described in Sect. 3.3.

Table 3 Width and center of 95 % uncertainty and confidence intervals

The results obtained are bestly displayed graphically as done in Fig. 2 for the three variables which are significant at the 5 % level in the OLS analysis. Results show that only the two covariates ’Baseline BMI’ and ’Positive self-reported health’ have a non-zero negative effect under all ranges of \(\rho \) considered. Their UI:s contain the value zero only under a rather extreme negative correlation (i.e. \(-\)0.98 or lower). On the other hand, ’Tobacco use’ has a non-zero positive effect under all ranges of \(\rho \) considered, although the UI may contain zero for positive \(\rho \):s larger than 0.52. Such considerations are the added value with respect to the analyses summarised in Table 2.

Fig. 2
figure 2

Graphical display of the OLS estimates (dot), the 95 % confidence intervals (CI) and uncertainty intervals (UI) obtained with \(\rho \) in \([-0.5, 0.5]\) and \([0, 0.9]\), for the three variables of Table 3 which were significant at the 5 % level in the OLS fit.

Note that ’Positive self-reported health’ is the only significant variable that was not significant in the probit regression. If the component of \(\varvec{\delta }\) corresponding to “Positive self-reported health” is zero the corresponding component \(\hat{\varvec{\beta }}_{OLS, z=1}\) is not distorted and therefore our identification set will be reduced to a point, see Hutton and Stanghellini (2010). The corresponding uncertainty interval is then equivalent to a confidence interval with same level. On the other hand, if one of the element in \(\varvec{\delta }\) is small, it will reflect in the estimates and we will still get a rather narrow uncertainty interval. For that reason, when constructing the uncertainty intervals, we have used all available covariates in the probit regression, see Table 1.

5 A simulation study based on a wage offer study

5.1 Design of the study

The design is an attempt to mimic the characteristics of a case study on married women’s wage mentioned in Sect. 2. The study focused on estimating the wage offer Eq. (1) given a set of observed covariates, with a selected sample since wage is observed only for the women who work; see Mroz (1987) and for a more recent analysis (Wooldridge 2003, Chap. 17.4). The covariates used are ’Household income–woman’s income’ (\(nwifeinc\)), ’Educational attainment in years’ (\(educ\)), ’Years of labour market experience’ (\(exper\)), ’Age’, ’Number of children 5 years or younger’ (\(kids5\)) and ’Number of children 6–18 years old’ (\(kids618\)).

The simulated samples in this study are obtained by drawing with replacement a given number of units out of the 753 women in the study. We use their true values on all explanatory variables, but simulate a new response variable using models (1) and (2), while setting \(\rho = 0.1 ,\; 0.2\) or \(0.4\). The other parameters (\(\varvec{\delta }\), \(\varvec{\beta }\) and \(\sigma _2\)) are set to their estimated values obtained from TSLS applied to the original dataset with all covariates included in the selection equation and \(age\), \(kids5\) and \(kids618\) are excluded from the outcome equation. More specifically, given \(\mathbf{x}\) we simulate data from the following model:

$$\begin{aligned} y&= -0.452+ \mathbf{x}\, [0.006,\,\! 0.097,\,\! 0.039,\,\! -0.001,\,\! 0,\,\! 0,\,\! 0]^T + \rho \cdot \sigma _2 \cdot \eta _1 + \varepsilon , \\ z^*&= 0.270+ \mathbf{x}\, [-0.012,\,\! 0.131,\,\! 0.123,\,\! -0.002,\,\! -0.053, \,\! -0.868,\,\! 0.036]^T + \eta _1, \end{aligned}$$

where \(\mathbf{x}= [nwifeinc, educ, exper, exper^2, age, kids5, kidsge618]\), \(\eta _1 \sim N(0,1)\) and \(\sigma _2= 0.662\). In order to mimic the marginal distribution of the observed women’s wages, the distribution of \(\varepsilon \) is chosen to be a centered gamma: \(\varepsilon = \text{ E }(G) - G\), where G is gamma distributed with equal shape and scale parameters (i.e. both parameters are equal to \(\text{ Var }(\varepsilon )^{1/3}\)). From our model assumptions (see Sect. 3.1) we also impose that \(\sqrt{\text{ Var }(\varepsilon )}= \sigma _2\sqrt{1-\rho ^2}\), thereby implying that \(\sqrt{\text{ Var }(\varepsilon )}= 0.659,\, 0.649,\, 0.607\) for the different values of \(\rho \) respectively. In this study we build 10,000 replicates of samples with sizes 100, 350 and 753.

For the identification sets (7) we let \(\rho \in [0,\, 0.5]\) and compute uncertainty intervals as described in Sect. 3.3. We apply the TSLSs procedure without restrictions and with two different exclusion restrictions: TSLS E1, where we exclude three variables (\(age\), \(kids5\) and \(kids618\)) from the outcome equation, i.e. TSLS E1 corresponds to the data generating mechanism of the study; TSLS E2, where four variables are excluded (i.e., also \(nwifeinc\)) from the outcome equation, i.e. TSLS E2 is a misspecified model. Finally, OLS estimates are also produced.

5.2 Results

Results for the \(\varvec{\beta }\) coefficient corresponding to \(educ\) are summarized in Fig. 3, where the width of the uncertainty interval and confidence intervals for the 10,000 replicatesFootnote 1 are reported with box plots. Empirical coverage are also given in the figure. As expected TSLS implies confidence intervals due to collinearity problems (variance inflation factor ranging from around 10 to 100). TSLS E1 (correctly specified model) yields tighter confidence intervals and empirical coverage close to the nominal level. TSLS E2 gives too low empirical coverage for the parameter due to model misspecification (a problem that increases with sample size), as does OLS in all cases for the same reason.

Fig. 3
figure 3

Box plot of the width of 95 % uncertainty intervals and 95 % confidence intervals for the regression coefficient of \(educ\) when varying \(\rho \) and sample sizes. The empirical coverage of the intervals are above each box.

Uncertainty intervals are not directly comparable to confidence intervals since they converge to a non-degenerate interval as sample size grows. Thus, uncertainty intervals are expected to be wider than confidence intervals with correctly specified model (TSLS E1). Uncertainty intervals should—and in our simulations do—imply a higher empirical coverage rate than the nominal level since they are constructed to take into account the uncertainty due to the unknown parameter \(\rho \). By letting \(\rho \) be uncertain we avoid the need to have prior knowledge about exclusion restrictions, and we see that using an incorrect exclusion restriction (TSLS E2) can lead to serious under-coverage. However, one should also note that the coverage of the uncertainty intervals relies on correct a priori information on \(\rho \), i.e. an interval for \(\rho \) containing the true value. Using an interval for \(\rho \) not containing the true value will typically yield too low coverage. Using \(\rho \in [-0.5, 0]\) instead of \(\rho \in [0, 0.5]\) in the above simulations yielded empirical coverages often below 95 % although higher than coverages obtained with OLS, since \(\rho =0\) is included.

Finally, it is worth noting that the p values of the inverse Mills’ ratio (obtained in the second stage of TSLS) are not significant in most of the replicates (even with non-zero \(\rho \)), making the corresponding test of no selection not reliable in practice, i.e. the data carry little information on whether the sample is selected or not.

6 Discussion

We have shown how to compute bounds on the parameters of a regression model with missing continuous outcome without making strong untestable assumptions about missing data. The bounds make evident which inference can be made with reasonably mild restrictions on the value of \(\rho \), which expresses the correlation between the unmeasured factor that drives the missingness mechanism and the residuals of the regression model under study. This is especially important with large datasets, where the sampling variation will be small and therefore the lack of knowledge on \(\rho \) is the major cause for uncertainty. Furthermore, these bounds can be computed without imposing any exclusion restriction and contain the missing at random assumption as a particular case. Therefore, they provide an indication of the impact that the untestable assumptions have on the inference of the parameters. Note that simulations show that correct coverage of the uncertainty intervals relies on specifying an interval for \(\rho \) containing the true value. An alternative to bounds is to use Bayesian inference, where a posterior distribution of the parameters of interest is deduced by integrating out the nuisance parameter \(\rho \) (Daniels and Hogan 2008; Rubin 1977). Our approach has the advantage of relaxing distributional assumptions. Since the bounds are based on standard OLS techniques, they are also easy to compute using standard statistical softwares.