Skip to main content

Unified approach for regression models with nonmonotone missing at random data

Abstract

Unified approach (Chen and Chen in J R Stat Soc B 62(3):449–460, 2000) uses a working regression model to extract information from auxiliary variables in two-stage study for computing an efficient estimator of regression parameter. As far as we know, the method is limited to deal with missing complete at random data in a simple monotone missing data pattern. In this research, we extend the unified approach to estimate regression models with nonmonotone missing at random data. We describe an inverse probability weighting estimator condition on estimators from a set of working regression models which contains information from incomplete data and auxiliary variables. The proposed method is flexible and can easily accommodate incomplete data and auxiliary variables. We investigate the finite-sample performance of the proposed estimators using simulation studies and further illustrate the estimation method on a case–control study investigating the risk factors of hip fractures.

Introduction

In applications of regression, data are often incomplete for numerous individuals either by chance or by design. The partial questionnaire design (PQD, Wacholder et al. 1994) is developed for lengthy questionnaires or other burdensome data collection processes, where subsets of variables are measured for different, but overlapping, groups of subjects; a PQD generates data with nonmonotone patterns of missingness. More broadly, data where values are missing by chance typically have nonmonotone patterns. For example, in a case–control study of the risk factors of hip fracture among male veterans carried out at the University of Illinois at Chicago College of Medicine (Barengolts et al. 2001) only 237 out of 436 subjects have a complete record of the nine potentially important risk factors. There are 38 missing data patterns. (See Table 2 in Chen (2004).)

Methods proposed for regression models with data missing in monotone missing data patterns may not be able to deal with nonmonotone missing data patterns (e.g., Fitzmaurice et al. 2009; Han 2014; Lipsitz et al. 1999; Little and Rubin 2002; Zhao and Lipsitz 1992; Zhao et al. 2009). Multiple imputation (Rubin 1987, 1996; Scheuren 2005) is often used for handling nonmonotone missing data problems; however it can be very challenging to find proper imputation models to accommodate flexible models in practice. The multiple imputation through conditional semiparametric odds ratio models for covariates and the Markov chain Monte Carlo sampling approach (Chen et al. 2011) is more flexible but computationally complex. Maximum likelihood method in Ibrahim et al. (2005) and the three techniques described in Chatterjee and Li (2010) require certain covariates to be categorical variables. Method in Lipsitz and Ibrahim (1996) depends on parametric assumptions for the joint distribution of the covariates. Robins et al. (1994) and van der Laan and Robins (2003) propose a class of semiparametric efficient estimators through an augmented estimating equation; however, the optimal functions required in the augmented estimating equation are in general not available. Practical modifications of the optimal functions (e.g., Sun and Tchetgen 2018; Tsiatis 2006) often reduce the estimation efficiency of the augmented estimating equation-based estimation methods. Nonparametric methods proposed for the simple monotone missing data pattern in Breunig and Haan (2019) and Breunig et al. (2018) may be extended to nonmonotone missing data patterns. For example, to extend the fractional probability weight method in Breunig et al. (2018) we need to define a multinomial variable (e.g., Sun and Tchetgen 2018) to identify multiple missing data patterns or a binary indicator variable for each missing data pattern. Investigation into the extension is valuable, but there is no direct simple solution as far as we can see.

In a two-stage study, Chen and Chen (2000) proposes using a working regression model to extract information from stage I observations and describes a unified approach to compute a more efficient estimator condition on the nuisance estimator from the working regression model. The approach produces consistent estimator of the regression parameter regardless the correctness of the working regression model. As far as we know, the method is limited to deal with missing complete at random (MCAR, Rubin 1976) data in a simple monotone missing data pattern. In this research, we generalize the above idea to deal with regression models with nonmonotone missing at random (MAR) data and describe a conditional inverse probability weighting estimator (IPW, Horvitz and Thompson 1952), given estimators from a set of working regression models, which contains information from incomplete data and auxiliary variables. The proposed estimation method is robust and computationally simple and can be easily implemented using standard computing software.

The rest of the article is organized as follows. Section 2 introduces notation. Section 3 describes the general idea of the new estimation method through IPW for regression models with nonmonotone MAR data, where we assume that the missing data probabilities are either known or can be parametrically modeled (Robins et al. 1994; Sun and Tchetgen 2018). Asymptotic results and variance estimate for the estimator are provided. Section 4 uses simulation studies to examine the finite-sample performance of the proposed method. Section 5 analyzes data from a case–control study on risk factors associated with hip fractures in veterans. Section 6 concludes with some remarks on applications and extensions.

Notation

Let Y be a response variable, X denote a vector of covariates, and \(f(X; \beta )\) represent the conditional mean of Y given X, where \(\beta\) is a vector of parameters. We are interested in estimating the \(\beta\) parameter. To simplify the notation, we assume that data in X are MAR in arbitrary nonmonotone missing data patterns, but data in Y are fully observed. We will see that the method is ready to deal with response or both response and covariates MAR data as well. (See simulation (B) in Sect. 4 where both Y and X are MAR.)

Before we introduce the method, it is important to distinguish the two different ways to count missing data patterns. Suppose there are p covariates in X, that is \(X=(X_1, \ldots , X_p)^T\). Let N be the total number of subjects. To illustrate the ideal, let us consider a small data example shown in Table 1 with \(p=5\) variables and \(N=8\) observations, where the little \(x_{ij}\)’s are the observed values and “?” represents a missing value. Suppose that we divide the subjects \(i=1, \ldots , 8\) into groups, {1, 2}, {3, 4}, {5, 6}, and {7, 8}, where the subjects in the same group have the same variables being observed. We see that the first group contains the fully observed subjects and the rest three groups represent three distinct missing data patterns. Alternatively, we can divide the variables \(j=1, \ldots , 5\) into groups, {1, 2}, {3}, and {4, 5}, where the variables in the same group are either observed or missing together. We see that the second group of variables is fully observed, while the rest two groups represent two distinct missing data patterns. We note that we can divide either the subjects or the variables into different groups according to different missing data patterns. Assume that the sample size N is much larger than p and \(p>1\) the maximum number of missing data patterns is \(2^p-1\) by subject and p by variable.

Table 1 A small data example

To introduce the new method, it is convenient to denote the missing data patterns by variable. Let us partition the covariates in X according to the missing data patterns and denote it as \(X=(X^T_1, \ldots , X^T_q)^T\) such that each \(X_j\), \(j=1, \ldots , q,\) is a vector of covariates which are observed or missing together. Here, q is the total number of distinct missing data patterns. In an extreme case, when each covariate has a unique missing data pattern, \(X_j\) is a scalar and \(q=p\) the total number of variables in X. For example, the covariates in our small data example can be denoted as \(X=(X^T_1, X^T_2, X^T_3)^T\) with \(q=3\), where the three vectors of covariates correspond to the three groups of variables {1, 2}, {3}, and {4, 5}.

We define indicator variables \(R_j\) as \(R_j=1\) if \(X_j\) is observed and 0 otherwise for \(j=1, \ldots , q\). For convenience, we denote the covariate vector X as \(X_0\) and define \(R_0=1\) if \(R_1=R_2=\cdots =R_q=1\) and 0 otherwise. If \(R_{0i}=1,\) we have a complete observation, otherwise we have an incomplete observation. Let n be the number of complete observations, and we require that \(n>p\). For \(i=1, \ldots , N,\) we define the missing data probabilities as

$$\begin{aligned}&\pi _{ji}=P(R_{ji}=1|Y_i,X_{0i}), j=1, \ldots , q, \text{ and } \\&\pi _{0i} = P(R_{0i}=1|Y_i,X_{0i}), \end{aligned}$$

where \(\pi _{ji} \ge \pi _{0i}\). Throughout, we suppose that \(\pi _{0i}> c > 0\) with probability 1 for some c, and \((Y_i, X^{T}_{1i}, \ldots , X^{T}_{qi}, R_{1i}, \ldots , R_{qi})\), \(i={1, \ldots , N}\) are independent and identically distributed.

Estimation methods

We start with the case where the missing data probabilities \(\pi _{ji}\), \(j=0,1, \ldots , q,\) are known. For example, in a PQD the missing data probabilities are known functions of the fully observed variables. Then, we extend the method to the case where the missing data probabilities can be estimated through parametric regression models.

Unified estimator based on known missing data probabilities

For the covariate vector \(X_j\), \(j=1, \ldots , q,\) defined in Sect. 2, we specify a parametric function \(f_j(X_j; \gamma_j)\) as the conditional mean of Y given \(X_j\), where \(\gamma _j\) is a vector of parameters. We call \(f_j(X_j; \gamma _j)\), \(j=1, \ldots , q,\) a set of working regression models or surrogate models and \(\gamma =(\gamma ^{T}_1, \ldots , \gamma ^{T}_q)^{T}\) a vector of surrogate parameters. For convenience, we denote the model of interest \(f(X_0; \beta )\) as \(f_0(X_0; \beta )\). Next, we review standard estimators for the regression parameters, \(\beta\) and \(\gamma\), discuss the association of the estimators, and then propose a new estimator for \(\beta\) based on the standard estimators.

We know that when the missing data probabilities depend on the response variable, given the covariates in the regression model, the regular complete case (CC) estimators can be biased. Therefore, we consider inverse probability weighting (IPW) estimators derived from IPW estimating equations (Horvitz and Thompson 1952; Lawless et al. 1999).

Let

$$\begin{aligned} S_i(\beta ) = \frac{R_{0i}}{\pi _{0i}} w_{0}(X_{0i}) \{Y_i-f_0(X_{0i}; \beta )\}, \end{aligned}$$
(1)

where \(w_0(X_{0i})\) is a vector corresponding to known functions of \(X_{0i}\), and \(\beta ^*\) be the unique solution of the IPW estimating equation:

$$\begin{aligned} E\{S_i(\beta )\}=0. \end{aligned}$$

Simarly, let \(\phi _{i}(\gamma )=\{\phi _{1i}(\gamma _1)^T, \cdots , \phi _{qi}(\gamma _q)^T\}^T\) with

$$\begin{aligned} \phi _{ji}(\gamma _j) = \frac{R_{0i}}{\pi _{0i}} w_{j}(X_{ji})\{Y_i-f_j(X_{ji};\gamma _j)\}, \end{aligned}$$
(2)

and \(\gamma ^*\) be the unique solution of the IPW estimating equation

$$\begin{aligned} E\{\phi _i(\gamma )\}=0. \end{aligned}$$

We note that the above model for \(\gamma\) is based on the complete observations only and therefore not efficient. A more efficient estimating function for \(\gamma\) can be built as follows. Let \(\varphi _{i}(\gamma )=\{\varphi _{1i}(\gamma _1)^T, \cdots ,\) \(\varphi _{qi}(\gamma _q)^T\}^T\) with

$$\begin{aligned} \varphi _{ji}(\gamma _j) = \frac{R_{ji}}{\pi _{ji}} w_{j}(X_{ji})\{Y_i-f_j(X_{ji}; \gamma _j)\}, \end{aligned}$$
(3)

where all the observed data, including the complete and incomplete observations, can make contributions to \(\varphi _{i}(\gamma )\). For MAR data, it can be shown that \(E\{\phi _i(\gamma )\}=E\{\varphi _i(\gamma )\}\) and \(\gamma ^*\) is also the unique solution of the IPW estimating equation:

$$\begin{aligned} E\{\varphi _i(\gamma )\}=0\end{aligned}$$

Then, we obtain IPW estimators \(\hat{\beta }\), \(\hat{\gamma }\), and \(\bar{\gamma }\) by solving the IPW estimating equations (4)–(6), respectively.

$$\begin{aligned} \sum _{i=1}^N S_i(\beta )= & {} 0,~~ \end{aligned}$$
(4)
$$\begin{aligned} \sum _{i=1}^N \phi _i(\gamma )= & {} 0,~\text{ and }\end{aligned}$$
(5)
$$\begin{aligned} \sum _{i=1}^N \varphi _i(\gamma )= & {} 0.~~~ \end{aligned}$$
(6)

We emphasize that \(\hat{\gamma }\) and \(\bar{\gamma }\) are estimators of \(\gamma\) corresponding to different estimating equations. For convenience, in the rest of the article we will introduce a new parameter \(\tau\) and denote \(\varphi _{i}(\gamma )\) as \(\varphi _{i}(\tau )\); the corresponding estimator becomes \(\hat{\tau }\).

Proposition 1

Under standard regularity conditions, \(\hat{\beta }\), \(\hat{\gamma }\), and \(\hat{\tau }\) are consistent and asymptotically normal

$$\begin{aligned} \sqrt{N}\left( \begin{array}{c} \hat{\beta }-\beta ^*\\ \hat{\gamma }-\gamma ^*\\ \hat{\tau }-\tau ^* \end{array} \right) \overset{d}{\rightarrow } N (0, [E\{\varGamma _i(\beta ^*,\gamma ^*,\tau ^*)\}]^{-1} var\{U_i(\beta ^*,\gamma ^*,\tau ^*)\} \nonumber \\ \times [E\{\varGamma _i(\beta ^*,\gamma ^*,\tau ^*)\}]^{-1^T}),~~~~~~~~~~~~ \end{aligned}$$
(7)

where \(\varGamma _i(\beta ^*,\gamma ^*,\tau ^*)=\bigtriangledown _{(\beta ^T,\gamma ^T,\tau ^T)} U_i(\beta ^*,\gamma ^*,\tau ^*)\) with \(U_i(\beta ,\gamma ,\tau )=\{S_i^T(\beta ),\phi _i^T(\gamma ),\varphi _i^T(\tau )\}^T\).

The above proposition indicates that the conditional distribution of \(\sqrt{N}(\hat{\beta }-\beta ^*)\), given \(\sqrt{N}(\hat{\gamma }- \hat{\tau }),\) is asymptotic normal with mean

$$\begin{aligned} \sqrt{N}Cov(\hat{\beta }, \hat{\gamma }-\hat{\tau }) \{Var(\hat{\gamma }-\hat{\tau })\}^{-1} (\hat{\gamma }-\hat{\tau }). \end{aligned}$$

Therefore, we propose to estimate \(\beta ^*\) as

$$\begin{aligned} \hat{\hat{\beta }}=\hat{\beta }-\widehat{Cov}(\hat{\beta }, \hat{\gamma }-\hat{\tau }) \{\widehat{Var}(\hat{\gamma }-\hat{\tau })\}^{-1} (\hat{\gamma }-\hat{\tau }), \end{aligned}$$
(8)

where \(\widehat{Cov}(\hat{\beta }, \hat{\gamma }-\hat{\tau })\) and \(\widehat{Var}(\hat{\gamma }-\hat{\tau })\) are the empirical estimates. The asymptotic variance of \(\hat{\hat{\beta }}\) can be estimated as

$$\begin{aligned} \widehat{Var}(\hat{\hat{\beta }})= \widehat{Var}(\hat{\beta }) -\widehat{Cov}(\hat{\beta }, \hat{\gamma }-\hat{\tau }) \{\widehat{Var}(\hat{\gamma }-\hat{\tau })\}^{-1} \widehat{Cov}(\hat{\gamma }-\hat{\tau }, \hat{\beta }). \end{aligned}$$
(9)

The second term in (9) is positive definite. Therefore, asymptotically the new estimator \(\hat{\hat{\beta }}\) is more efficient than the standard IPW estimator \(\hat{\beta }\). It is important to note that \(\hat{\hat{\beta }}\) is consistent which does not depend on the set of parametric working models \(f_j(X_j; \gamma _j)\), \(j=1, \cdots , q\). However, adequately specified working regression models can improve the efficiency of \(\hat{\hat{\beta }}.\) (See the numerical studies in Sections 4 and 5.)

We call the new estimator \(\hat{\hat{\beta }}\) in (8) a unified estimator (UE). The differences between our proposed unified estimator and the unified estimator in Chen and Chen (2000) include the following: (i) Our proposed estimator can deal with MAR data, but the estimator in Chen and Chen (2000) requires data MCAR; (ii) our estimator can use all the data including the fully observed data and the partially observed data, but the estimator in Chen and Chen (2000) can only use the values in the fully observed rows or the fully observed columns for a data set with missing values; and (iii) the estimator in Chen and Chen (2000) is derived from the conditional distribution of \(\sqrt{N}(\hat{\beta }-\beta ^*)\), given \(\sqrt{N}(\hat{\gamma }- \gamma ^*)\), and then it replaces the unknown fixed \(\gamma ^*\) with an estimator \(\hat{\tau }\). Therefore, the result estimator neglects the variance of \(\hat{\tau }\) and the covariances between \(\hat{\tau }\) and \((\hat{\beta }, \hat{\gamma })\). However, for MCAR data when there is a single missing group, it can be shown that the two estimators are equivalent.

Unified estimator based on estimated missing data probabilities

In practice, the missing data probabilities are often unknown. Next, we explain how to compute a unified estimator when the missing data probabilities are estimated from parametric models. (See Sun and Tchetgen 2018 for models for the missing data probabilities.)

Suppose the missing data probabilities are estimated from a parametric model \(\psi _{i}(\alpha )\) where \(\alpha\) is a vector of parameters. Let \(\alpha ^*\) be the unique solution of the estimating equation

$$\begin{aligned} E\{\psi _i(\alpha )\}=0. \end{aligned}$$

We replace the missing data probabilities in equations (13) with \(\pi _{0i}(\alpha ),\pi _{1i}(\alpha ), \cdots ,\) \(\pi _{qi}(\alpha )\), and denote the corresponding functions as \(S_{i, \alpha }(\beta ), \phi _{i, \alpha }(\gamma ), \varphi _{i, \alpha }(\tau )\). Let \(\hat{\alpha }\) be the solution of the estimating equation

$$\begin{aligned} \sum _{i=1}^N \psi _i(\alpha ) = 0, \end{aligned}$$
(10)

and denote the corresponding IPW estimators as \(\hat{\beta }_{\hat{\alpha }}\), \(\hat{\gamma }_{\hat{\alpha }}\), and \(\hat{\tau }_{\hat{\alpha }}\).

Proposition 2

Assuming the missing data model \(\psi _i(\alpha )\) is correctly specified, under standard regularity conditions \(\hat{\beta }_{\hat{\alpha }}\), \(\hat{\gamma }_{\hat{\alpha }}\), and \(\hat{\tau }_{\hat{\alpha }}\) are consistent and asymptotically normal

$$\begin{aligned} \sqrt{N}\left( \begin{array}{c} \hat{\beta }_{\hat{\alpha }}-\beta ^*\\ \hat{\gamma }_{\hat{\alpha }}-\gamma ^*\\ \hat{\tau }_{\hat{\alpha }}-\tau ^* \end{array} \right) \overset{d}{\rightarrow } N(0, [E\{\varGamma _{i, \alpha ^*}(\beta ^*,\gamma ^*,\tau ^*)\}]^{-1} var\{Res_{i}(\beta ^*,\gamma ^*,\tau ^*,\alpha ^*)\}\nonumber \\ \times [E\{\varGamma _{i, \alpha ^*}(\beta ^*,\gamma ^*,\tau ^*)\}]^{-1^T}),~~~~~~~~~~~~~~~~ \end{aligned}$$
(11)

where \(\varGamma _{i, \alpha ^*}(\beta ^*,\gamma ^*,\tau ^*)=\bigtriangledown _{(\beta ^T,\gamma ^T,\tau ^T)} U_{i, \alpha ^*}(\beta ^*,\gamma ^*,\tau ^*)\) with \(U_{i, \alpha }(\beta ,\gamma ,\tau )=\{S^T_{i, \alpha }(\beta ),\) \(\phi ^T_{i, \alpha }(\gamma ),\varphi ^T_{i, \alpha }(\tau )\}^T\), and \(Res_{i}(\beta ^*,\gamma ^*,\tau ^*,\alpha ^*)=U_{i, \alpha ^*}(\beta ^*,\gamma ^*,\tau ^*)-E\{U_{i, \alpha ^*}(\beta ^*,\gamma ^*,\tau ^*) \times \psi _i^T(\alpha ^*)\}[E\{\psi _i(\alpha ^*)\psi _i^T(\alpha ^*)\}]^{-1}\psi _i(\alpha ^*)\).

Then, following the same idea as in Section 3.1 we get a unified estimator \(\hat{\hat{\beta }}_{\hat{\alpha }}\)

$$\begin{aligned} \hat{\hat{\beta }}_{\hat{\alpha }}=\hat{\beta }_{\hat{\alpha }}-\widehat{Cov}(\hat{\beta }_{\hat{\alpha }}, \hat{\gamma }_{\hat{\alpha }}-\hat{\tau }_{\hat{\alpha }}) \{\widehat{Var}(\hat{\gamma }_{\hat{\alpha }}-\hat{\tau }_{\hat{\alpha }})\}^{-1} (\hat{\gamma }_{\hat{\alpha }}-\hat{\tau }_{\hat{\alpha }}), \end{aligned}$$
(12)

and its asymptotic variance can be estimated as

$$\begin{aligned} \widehat{Var}(\hat{\hat{\beta }}_{\hat{\alpha }})= \widehat{Var}(\hat{\beta }_{\hat{\alpha }}) -\widehat{Cov}(\hat{\beta }_{\hat{\alpha }}, \hat{\gamma }_{\hat{\alpha }}-\hat{\tau }_{\hat{\alpha }}) \{\widehat{Var}(\hat{\gamma }_{\hat{\alpha }}-\hat{\tau }_{\hat{\alpha }})\}^{-1} \widehat{Cov}(\hat{\gamma }_{\hat{\alpha }}-\hat{\tau }_{\hat{\alpha }}, \hat{\beta }_{\hat{\alpha }}). \end{aligned}$$
(13)

It can be shown that \(\hat{\hat{\beta }}_{\hat{\alpha }}\) has the similar properties as \(\hat{\hat{\beta }}\) if the model for the missing data probabilities is correctly specified. We note that even when the true missing data probabilities are known using estimated missing data probabilities instead of the known missing data probabilities can be more efficient (see Breslow et al. 2009; Chatterjee et al. 2003; Lawless et al. 1999; Robins et al. 1994).

Simulation studies

The simulation study has two parts. In part (A), we assume that the covariates are MAR, but the response variable is fully observed, while in part (B) we assume that both the response variable and the covariates are MAR. We consider a linear regression model

$$\begin{aligned} Y=\beta _0+\beta _1X_1+\beta _2X_2+\beta _3X_3+\epsilon \end{aligned}$$
(14)

and a logistic regression model

$$\begin{aligned} logit\{P(Y=1|X_1,X_2,X_3)\}=\beta _0+\beta _1X_1+\beta _2X_2+\beta _3X_3. \end{aligned}$$
(15)

In part (A), \(X_2\) is generated from the exponential distribution with mean 1, and \(X_1\), \(X_3,\) and \(\epsilon\) are generated independently from the standard normal distribution. We assume that \(\{Y, X_1\}\) are fully observed but both \(X_2\) and \(X_3\) are MAR with missing indicator variables \(R_2\) and \(R_3\) generated as follows:

$$\begin{aligned}&logit\{P(R_3=1| Y, X_1)\}=\alpha _{30}+\alpha _{31}Y+\alpha _{32}X_1, ~\text{ and }~\\&logit\{P(R_2=1|Y, X_1, R_3)\}=\alpha _{20}+\alpha _{21}Y+\alpha _{22}X_1+\alpha _{23}R_3. \end{aligned}$$

Let \(\beta =(\beta _0, \beta _1, \beta _2, \beta _3)^T=(0.1,1,1,1)^{T}\) and \(\alpha _{(A)}=(\alpha _{20}, \alpha _{21}, \alpha _{22}, \alpha _{23}, \alpha _{30}, \alpha _{31}, \alpha _{32})^T\) \(= ( -1, 0.2,0.2,0.2,0.8, 0.2, 0.2)^T\) in the linear regression model, \(\beta =(-1.2,1,1,1)^{T}\) and \(\alpha _{(A)}=(-1, 0.3,0.2,0.2,0.9, 0.2, 0.2)^T\) in the logistic regression model. Here, the number of distinct missing patterns \(q=3\). The working regression models are \(f_1(X_1; \gamma _1)\), \(f_2(X_2; \gamma _2),\) and \(f_3(X_3; \gamma _3)\). We notice that \(X_1\) is fully observed, and we may include \(X_1\) in each working model to improve model fitting. Therefore, we compute another unified estimator using a different set of working regression models: \(f_1(X_1; \gamma _1)\), \(f_2\{(X_1, X_2); \gamma _2\}\) and \(f_3\{(X_1, X_3); \gamma _3\}\). We denote the correspond estimator from the two different sets of working models as UE (I) and UE (II), respectively. We use linear regression models and logistic regression models as the working regression models for the cases where the true response models of interest are the linear regression model and the logistic regression model, respectively. We note that in the logistic regression case, the logistic working regression models are misspecified, but still useful for increasing efficiency. The unified estimates are computed using known \(\pi _{ij}\), estimated \(\pi _{ij}\) from the true models, and estimated \(\pi _{ij}\) from logistic regression models for the marginal missing data probabilities, respectively. We note that the logistic regression model for the marginal distribution of \(R_0\), given \((Y, X_1)\), is not the true model.

Table 2 Simulation results

In part (B), we consider the setting used in Sun and Tchetgen (2018), where X follows a truncated normal distribution on the support \(X \in [0,2]^3\) with \(X_1\sim N(1, 0.5^2)\), \(X_2 \sim N(X_1+X_1^2, 0.5^2),\) and \(X_3 \sim N(X_2+0.8X_1X_2, 0.5^2)\), and Y follows the logistic regression model in (15). Let R be a three-level multinomial random variable, \(R=1, 2,\) or 3 with the following probabilities:

$$\begin{aligned}&logit\{P(R=2|Y, X_1, X_2, X_3)\}=\alpha _{20}+\alpha _{21}Y+\alpha _{22}X_1+\alpha _{23}X_2+\alpha _{24}X_3,\\&logit\{P(R=3|Y, X_1, X_2, X_3)\}=\alpha _{30}+\alpha _{31}Y+\alpha _{32}X_1+\alpha _{33}X_2+\alpha _{34}X_3. \end{aligned}$$

If \(R_i=1,\) we have a complete observation \((Y_i, X_{1i},X_{2i},X_{3i})\), \(R_i=2\) we have \((Y_i, X_{1i})\), and \(R_i=3\) we have \((X_{2i}, X_{3i})\). Let \(\alpha _{(B)}=(\alpha _{20}, \alpha _{21}, \alpha _{22}, \alpha _{23}, \alpha _{24}, \alpha _{30}, \alpha _{31}, \alpha _{32}, \alpha _{33}, \alpha _{34})^T\). Let \(\beta =(-2.5,0.7,0.8,1)^{T}\). We consider three missing data settings: (1) MAR, where the missingness depends on the observed covariates and response variable with \(\alpha _{(B)}=(-0.8,-1.8,0.2,0,0,-1.2,0,0,0.3,0.3)\); (2) MAR, where the missingness depends on the observed covariates but not the response variable with \(\alpha _{(B)}=(-0.8,0,0.2,0,0,\) \(-1.2,0,0,0.3,0.3)\); and (3) MNAR, where the missingness depends on both the observed and the missing covariates but not the response variable with \(\alpha _{(B)}=(-1,0,0.3,\) \(-0.1,-0.2,-1.4,0,0.4,-0.8,0.1)\). The working regression models are \(f_1(X_1; \gamma _1)\) and \(f_2\{(X_2, X_3); \gamma _2\}\). To compare with the estimator proposed in Sun and Tchetgen (2018), we report results of the full data MLEs, CC estimates, IPW estimates, and UEs in the same format as Sun and Tchetgen (2018).

Table 3 Simulation results

The simulation results for part (A) based on sample size \(N=1000\) with 1000 replications are given in Table 2. We see that (i) in general the UEs have smaller biases and standard errors, s.e.’s, compared to the IPW estimates, especially for the logistic regression model, and the empirical \(95\%\) coverage probabilities are close to the nominal level for the logistic regression model and slightly conservative for the linear regression model; (ii) the estimates using the estimated \(\pi _{ij}\) are slightly more efficient than those using the known \(\pi _{ij}\), and we also note that the results under the true model for \(\pi _{ij}\) are very similar to those based on the logistic regression models for the marginal missing data probabilities as the logistic regression models fit the data sufficiently in our case; (iii) the UE (II) with fully observed \(X_1\) included in each working model has a smaller bias and s.e. compared to the corresponding UE (I) in most of the cases, especially for the logistic regression model.

Table 3 shows the simulation results of part (B) for sample size \(N=1000\) and 2000 with 1000 replications. We see that in most of the cases the UEs have smaller biases and RMSEs (roots of the empirical mean squared errors) compared to the IPW estimates. We also report an alternative UE (see \(\hbox {UE}^*\) in Table 3) computed from the CC estimates of \((\beta ,\gamma ,\tau )\), which does not require the missing data probabilities. We see that when the missingness does not depend on the response variable including the MNAR case (see setting (2) and (3)) the \(\hbox {UE}^*\)’s have smaller RMSEs compared to the IPW estimates and their empirical \(95\%\) coverage probabilities are close to the nominal level. This is because the CC analysis is valid when the missingness does not depend on the response variable. When data are MNAR, sufficient models for the missing data probabilities are not available. Unless the missing data probabilities are known, consistent IPWs and UEs cannot be computed. Therefore, we may use the alternative \(\hbox {UE}^*\) when sufficient models for the missing data probabilities are not available and the missingness does not depend on the response variable.

Example

Table 4 Analysis of hip fracture data

We consider a case–control study of risk factors of hip fractures among male veterans where covariate values were missing by chance. The study was carried out at the University of Illinois at Chicago College of Medicine (Barengolts et al. 2001; Chen 2004; Chen et al. 2011), where a case was matched with a control on age and race, and 25 potential risk factors in addition to age and race were recorded. One major analysis is fitting a logistic regression model with nine potentially important risk factors identified in preliminary exploratory analyses.

There are 436 subjects in the study. The number of persons with complete observations is 237, and the overall percentage of covariate values missing is \(10.81\%\). Because each covariate has a unique missingness pattern, we set \(q=9\), and we have \(X_j\), \(R_j\) for \(j=1, \ldots , 9\). The list of variables are given in Table 2 of Chen (2004).

We also observe that the \(R_j\)’s within each of the three pairs, \((R_1, R_2)\), \((R_3, R_4),\) and \((R_5, R_6)\), are very similar although not identical. Therefore, as an alternative we can set \(q=6\) by combining \(X_1\) with \(X_2\), \(X_3\) with \(X_4\), and \(X_5\) with \(X_6\), that is, we assume that there are 6 distinct missing groups. We note that with \(q=9\) we need 9 working models which can use all the observed covariate values in the analysis, while with \(q=6\) we only need 6 working models but 32 observed covariate values cannot be used in the analysis. For example, with variables \(X_1\) and \(X_2\) (Etoh and Smoke) combined, there are 8 individuals for which Etoh is observed but not Smoke, so these 8 individuals cannot be used in the working model \(f_1\{(X_1,X_2);\tau _1\}\). However, the working regression models with \(q=6\) are closer to the regression model of interest. Following Chen (2004), we assume that the covariates are MAR. We estimate the missing data probabilities, \(\pi _{ji}\)’s, using logistic regression models with hip fracture (the binary outcome variable), age, and race as predictors.

We report the unified estimates (UE (I) and UE (II) for \(q=9\) and 6, respectively), together with the CC estimates and the IPW estimates, in Table 4. For comparison, we have added the results of the semiparametric maximum likelihood (SPML) estimates of Chen (2004) and the multiple imputation (MI) estimates of Chen et al. (2011). We see that the unified estimates have relatively smaller s.e.’s than the CC and IPW estimates, and the unified estimates with \(q=6\) are more efficient than the corresponding estimates with \(q=9\), which suggests that combining the covariates with the similar \(R_j\)’s can improve the estimation efficiency for the unified estimates when the proportion of observed covariate values which cannot be used in the analysis due to combining \(R_j\)’s is not high. The unified estimates are very close to the SPML estimates and the MI estimates, although the s.e.’s of the SPML estimates and the MI estimates are slightly smaller than those of the unified estimates. The IPW estimate and the unified estimates simultaneously indicate that LevoT4 is a significant factor, but not the SPML estimate or the MI estimate. The differences seen here in some of the estimates cannot really be assessed without examining the validity of the MAR assumption and the assumptions in the semiparametric likelihood approach and MI methods.

Discussion

The proposed unified estimation method uses a set of parametric working regression models to recover information from incomplete observations and auxiliary variables, which is convenient in accommodating arbitrary nonmonotone missing data patterns and incorporating auxiliary variables in the analysis. To improve estimation efficiency, we may use the estimated missing data probabilities instead of the known missing data probabilities if sufficient models for the missing data probabilities can be identified. We can always include auxiliary variables in the models for the missing data probabilities to improve the model fitting. Models and some discussion on modeling the missing data probabilities for nonmonotone MAR data are provided in Sun and Tchetgen (2018) and Zhao (2020). In case when sufficient models for the missing data probabilities are not available and the missingness does not depend on the response variable, reliable UE can be computed based on the CC estimates, which includes the MNAR data as we did in the simulation studies.

In practice, to improve estimation efficiency we should select a set of working regression models which are highly correlated with the original regression model of interest. This is often achieved by choosing the set of working regression models the same as the original regression model of interest with different sets of covariates and auxiliary variables according to missing data patterns as we did in the numerical studies. Therefore, the missing data patterns and auxiliary variables define a set of unique working regression models, which further standardizes the proposed estimation method.

Finally, compared to the augmented estimating equations (Robins et al. 1994; Sun and Tchetgen 2018) the unified estimation method uses a set of “full” data models (Robins et al. 1994) to deal with the missing data problem in regression analysis, which does not require approximating the unknown functions as the augmented estimating equation approach and can be easily implemented in standard statistical software. The method is also very flexible. It can be applied to a wide range of regression models, including but not limited to the generalized linear models, Cox proportional hazards models, parametric models for longitudinal data and partially linear models, as long as the estimating equations can be formulated.

References

  • Barengolts, E., Karanouh, D., Kolodny, L., Kukreja, S.: Risk factors for hip fractures in predominantly african-american veteran male population. J. Bone Miner. Res. 16, S170 (2001)

    Google Scholar 

  • Breslow, N.E., Lumley, T., Ballantyne, C.M., Chambless, L.E., Kulich, M.: Improved Horvitz-Thompson estimation of model parameters from two-phase stratified samples: applications in epidemiology. Stat. Biosciences 1, 32–49 (2009)

    Article  Google Scholar 

  • Breunig, C., Haan, P.: Nonparametric regression with selectively missing covariates. arXiv:1810.00411v2 [econ.EM], 1–37 (2019)

  • Breunig, C., Mammen, E., Simoni, A.: Nonparametric estimation in case of endogenous selection. J. Econom. 202, 268–285 (2018)

    MathSciNet  Article  Google Scholar 

  • Chatterjee, N., Chen, Y., Breslow, N.E.: A pseudo-score estimator for regression problems with two-phase sampling. J. Am. Stat. Assoc. 98, 158–168 (2003)

    Article  Google Scholar 

  • Chatterjee, N., Li, Y.: Inference in semiparametric regression models under partial questionnaire design and nonmonotone missing data. J. Am. Stat. Assoc. 105, 787–797 (2010)

    MathSciNet  Article  Google Scholar 

  • Chen, H.Y.: Nonparametric and semiparametric models for missing covariates in parametric regression. J. Am. Stat. Assoc. 99, 1176–1189 (2004)

    MathSciNet  Article  Google Scholar 

  • Chen, H.Y., Xie, H., Qian, Y.: Multiple imputation for missing values through conditional semiparametric odds ratio models. Biometrics 67, 799–809 (2011)

    MathSciNet  Article  Google Scholar 

  • Chen, Y.H., Chen, H.: A unified approach to regression analysis under double-sampling designs. J. R. Stat. Soc. B 62(3), 449–460 (2000)

    MathSciNet  Article  Google Scholar 

  • Fitzmaurice, G., Davidian, M., Verbeke, G., Molenberghs, G.: Longitudinal data analysis. Chapman and Hall/CRC, Boca Raton (2009)

    MATH  Google Scholar 

  • Han, P.: Multiply robust estimation in regression analysis with missing data. J. Am. Stat. Assoc. 109, 1159–1173 (2014)

    MathSciNet  Article  Google Scholar 

  • Horvitz, D.G., Thompson, D.J.: A generalization of sampling without replacement from a finite universe. J. Am. Stat. Assoc. 47, 663–685 (1952)

    MathSciNet  Article  Google Scholar 

  • Ibrahim, J.G., Chen, M.H., Lipsitz, S.R., Herring, A.H.: Missing-data methods for generalized linear models: A comparative review. J. Am. Stat. Assoc. 100, 332–346 (2005)

    MathSciNet  Article  Google Scholar 

  • van der Laan, M.J., Robins, J.M.: Unified Methods for Censored Longitudinal Data and Causality. Springer-Verlag, New York (2003)

    Book  Google Scholar 

  • Lawless, J.F., Kalbfleisch, J.D., Wild, C.J.: Semiparametric methods for response-selective and missing data problems in regression. J. Royal Stat. Soc. B 61(2), 413–438 (1999)

    MathSciNet  Article  Google Scholar 

  • Lipsitz, S.R., Ibrahim, J.G.: A conditional model for incomplete covariates in parametric regression models. Biometrika 83(4), 916–922 (1996)

    Article  Google Scholar 

  • Lipsitz, S.R., Ibrahim, J.G., Zhao, L.: A weighted estimating equation for missing covariate data with properties similar to maximum likelihood. J. Am. Stat. Assoc. 94, 1147–1160 (1999)

    MathSciNet  Article  Google Scholar 

  • Little, R.J.A., Rubin, D.B.: Statistical Analysis with Missing Data. Wiley, New York (2002)

    Book  Google Scholar 

  • Robins, J.M., Rotnitzky, A., Zhao, L.P.: Estimation of regression coefficients when some regressors are not always observed. J. Am. Stat. Assoc. 89, 846–866 (1994)

    MathSciNet  Article  Google Scholar 

  • Rubin, D.B.: Inference and missing data. Biometrika 63(3), 581–592 (1976)

    MathSciNet  Article  Google Scholar 

  • Rubin, D.B.: Multiple Imputationfor Nonresponse in Surveys. Wiley, New York (1987)

    Book  Google Scholar 

  • Rubin, D.B.: Multiple imputation after 18+ years. J. Am. Stat. Assoc. 91, 473–489 (1996)

    Article  Google Scholar 

  • Scheuren, F.: Multiple imputation: how it began and continues. J. Am. Stat. Assoc. 59, 315–319 (2005)

    MathSciNet  Article  Google Scholar 

  • Sun, B., Tchetgen, E.J.T.: On inverse probability weighting for nonmonotone missing at random data. J. Am. Stat. Assoc. 113, 369–379 (2018)

    MathSciNet  Article  Google Scholar 

  • Tsiatis, A.: Semiparametric Theory and Missing Data. Springer, New York (2006)

    MATH  Google Scholar 

  • Wacholder, S., Carroll, R.J., Pee, D., Gail, M.G.: The partial questionnaire design for case-control studies. Stat. Med. 13, 623–634 (1994)

    Article  Google Scholar 

  • Zhao, L.P., Lipsitz, S.: Designs and analysis of two-stage studies. Stat. Med. 11, 769–782 (1992)

    Article  Google Scholar 

  • Zhao, Y.: Statistical inference for missing data mechanisms. Stat. Med. (2020). https://doi.org/10.1002/sim.8727

    MathSciNet  Article  Google Scholar 

  • Zhao, Y., Lawless, J.F., McLeish, D.L.: Likelihood methods for regression models with expensive variables missing by design. Biometrical J. 51, 123–136 (2009)

    MathSciNet  Article  Google Scholar 

Download references

Acknowledgements

We thank Professor Donald L. McLeish, Professor Jerald F. Lawless, the associate editor, and the anonymous reviewers for their helpful comments and suggestions. We are grateful to Professor Hua Yun Chen for letting us use the hip fracture data.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yang Zhao.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This research was partially supported by grant from the Natural Sciences and Engineering Research Council of Canada (YZ).

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Zhao, Y., Liu, M. Unified approach for regression models with nonmonotone missing at random data. AStA Adv Stat Anal 105, 87–101 (2021). https://doi.org/10.1007/s10182-020-00389-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10182-020-00389-y

Keywords

  • Inverse probability weighting
  • Nonmonotone missing at random data
  • Working regression models