We start with the case where the missing data probabilities \(\pi _{ji}\), \(j=0,1, \ldots , q,\) are known. For example, in a PQD the missing data probabilities are known functions of the fully observed variables. Then, we extend the method to the case where the missing data probabilities can be estimated through parametric regression models.
Unified estimator based on known missing data probabilities
For the covariate vector \(X_j\), \(j=1, \ldots , q,\) defined in Sect. 2, we specify a parametric function \(f_j(X_j; \gamma_j)\) as the conditional mean of Y given \(X_j\), where \(\gamma _j\) is a vector of parameters. We call \(f_j(X_j; \gamma _j)\), \(j=1, \ldots , q,\) a set of working regression models or surrogate models and \(\gamma =(\gamma ^{T}_1, \ldots , \gamma ^{T}_q)^{T}\) a vector of surrogate parameters. For convenience, we denote the model of interest \(f(X_0; \beta )\) as \(f_0(X_0; \beta )\). Next, we review standard estimators for the regression parameters, \(\beta\) and \(\gamma\), discuss the association of the estimators, and then propose a new estimator for \(\beta\) based on the standard estimators.
We know that when the missing data probabilities depend on the response variable, given the covariates in the regression model, the regular complete case (CC) estimators can be biased. Therefore, we consider inverse probability weighting (IPW) estimators derived from IPW estimating equations (Horvitz and Thompson 1952; Lawless et al. 1999).
Let
$$\begin{aligned} S_i(\beta ) = \frac{R_{0i}}{\pi _{0i}} w_{0}(X_{0i}) \{Y_i-f_0(X_{0i}; \beta )\}, \end{aligned}$$
(1)
where \(w_0(X_{0i})\) is a vector corresponding to known functions of \(X_{0i}\), and \(\beta ^*\) be the unique solution of the IPW estimating equation:
$$\begin{aligned} E\{S_i(\beta )\}=0. \end{aligned}$$
Simarly, let \(\phi _{i}(\gamma )=\{\phi _{1i}(\gamma _1)^T, \cdots , \phi _{qi}(\gamma _q)^T\}^T\) with
$$\begin{aligned} \phi _{ji}(\gamma _j) = \frac{R_{0i}}{\pi _{0i}} w_{j}(X_{ji})\{Y_i-f_j(X_{ji};\gamma _j)\}, \end{aligned}$$
(2)
and \(\gamma ^*\) be the unique solution of the IPW estimating equation
$$\begin{aligned} E\{\phi _i(\gamma )\}=0. \end{aligned}$$
We note that the above model for \(\gamma\) is based on the complete observations only and therefore not efficient. A more efficient estimating function for \(\gamma\) can be built as follows. Let \(\varphi _{i}(\gamma )=\{\varphi _{1i}(\gamma _1)^T, \cdots ,\) \(\varphi _{qi}(\gamma _q)^T\}^T\) with
$$\begin{aligned} \varphi _{ji}(\gamma _j) = \frac{R_{ji}}{\pi _{ji}} w_{j}(X_{ji})\{Y_i-f_j(X_{ji}; \gamma _j)\}, \end{aligned}$$
(3)
where all the observed data, including the complete and incomplete observations, can make contributions to \(\varphi _{i}(\gamma )\). For MAR data, it can be shown that \(E\{\phi _i(\gamma )\}=E\{\varphi _i(\gamma )\}\) and \(\gamma ^*\) is also the unique solution of the IPW estimating equation:
$$\begin{aligned} E\{\varphi _i(\gamma )\}=0\end{aligned}$$
Then, we obtain IPW estimators \(\hat{\beta }\), \(\hat{\gamma }\), and \(\bar{\gamma }\) by solving the IPW estimating equations (4)–(6), respectively.
$$\begin{aligned} \sum _{i=1}^N S_i(\beta )= & {} 0,~~ \end{aligned}$$
(4)
$$\begin{aligned} \sum _{i=1}^N \phi _i(\gamma )= & {} 0,~\text{ and }\end{aligned}$$
(5)
$$\begin{aligned} \sum _{i=1}^N \varphi _i(\gamma )= & {} 0.~~~ \end{aligned}$$
(6)
We emphasize that \(\hat{\gamma }\) and \(\bar{\gamma }\) are estimators of \(\gamma\) corresponding to different estimating equations. For convenience, in the rest of the article we will introduce a new parameter \(\tau\) and denote \(\varphi _{i}(\gamma )\) as \(\varphi _{i}(\tau )\); the corresponding estimator becomes \(\hat{\tau }\).
Proposition 1
Under standard regularity conditions, \(\hat{\beta }\), \(\hat{\gamma }\), and \(\hat{\tau }\) are consistent and asymptotically normal
$$\begin{aligned} \sqrt{N}\left( \begin{array}{c} \hat{\beta }-\beta ^*\\ \hat{\gamma }-\gamma ^*\\ \hat{\tau }-\tau ^* \end{array} \right) \overset{d}{\rightarrow } N (0, [E\{\varGamma _i(\beta ^*,\gamma ^*,\tau ^*)\}]^{-1} var\{U_i(\beta ^*,\gamma ^*,\tau ^*)\} \nonumber \\ \times [E\{\varGamma _i(\beta ^*,\gamma ^*,\tau ^*)\}]^{-1^T}),~~~~~~~~~~~~ \end{aligned}$$
(7)
where \(\varGamma _i(\beta ^*,\gamma ^*,\tau ^*)=\bigtriangledown _{(\beta ^T,\gamma ^T,\tau ^T)} U_i(\beta ^*,\gamma ^*,\tau ^*)\) with \(U_i(\beta ,\gamma ,\tau )=\{S_i^T(\beta ),\phi _i^T(\gamma ),\varphi _i^T(\tau )\}^T\).
The above proposition indicates that the conditional distribution of \(\sqrt{N}(\hat{\beta }-\beta ^*)\), given \(\sqrt{N}(\hat{\gamma }- \hat{\tau }),\) is asymptotic normal with mean
$$\begin{aligned} \sqrt{N}Cov(\hat{\beta }, \hat{\gamma }-\hat{\tau }) \{Var(\hat{\gamma }-\hat{\tau })\}^{-1} (\hat{\gamma }-\hat{\tau }). \end{aligned}$$
Therefore, we propose to estimate \(\beta ^*\) as
$$\begin{aligned} \hat{\hat{\beta }}=\hat{\beta }-\widehat{Cov}(\hat{\beta }, \hat{\gamma }-\hat{\tau }) \{\widehat{Var}(\hat{\gamma }-\hat{\tau })\}^{-1} (\hat{\gamma }-\hat{\tau }), \end{aligned}$$
(8)
where \(\widehat{Cov}(\hat{\beta }, \hat{\gamma }-\hat{\tau })\) and \(\widehat{Var}(\hat{\gamma }-\hat{\tau })\) are the empirical estimates. The asymptotic variance of \(\hat{\hat{\beta }}\) can be estimated as
$$\begin{aligned} \widehat{Var}(\hat{\hat{\beta }})= \widehat{Var}(\hat{\beta }) -\widehat{Cov}(\hat{\beta }, \hat{\gamma }-\hat{\tau }) \{\widehat{Var}(\hat{\gamma }-\hat{\tau })\}^{-1} \widehat{Cov}(\hat{\gamma }-\hat{\tau }, \hat{\beta }). \end{aligned}$$
(9)
The second term in (9) is positive definite. Therefore, asymptotically the new estimator \(\hat{\hat{\beta }}\) is more efficient than the standard IPW estimator \(\hat{\beta }\). It is important to note that \(\hat{\hat{\beta }}\) is consistent which does not depend on the set of parametric working models \(f_j(X_j; \gamma _j)\), \(j=1, \cdots , q\). However, adequately specified working regression models can improve the efficiency of \(\hat{\hat{\beta }}.\) (See the numerical studies in Sections 4 and 5.)
We call the new estimator \(\hat{\hat{\beta }}\) in (8) a unified estimator (UE). The differences between our proposed unified estimator and the unified estimator in Chen and Chen (2000) include the following: (i) Our proposed estimator can deal with MAR data, but the estimator in Chen and Chen (2000) requires data MCAR; (ii) our estimator can use all the data including the fully observed data and the partially observed data, but the estimator in Chen and Chen (2000) can only use the values in the fully observed rows or the fully observed columns for a data set with missing values; and (iii) the estimator in Chen and Chen (2000) is derived from the conditional distribution of \(\sqrt{N}(\hat{\beta }-\beta ^*)\), given \(\sqrt{N}(\hat{\gamma }- \gamma ^*)\), and then it replaces the unknown fixed \(\gamma ^*\) with an estimator \(\hat{\tau }\). Therefore, the result estimator neglects the variance of \(\hat{\tau }\) and the covariances between \(\hat{\tau }\) and \((\hat{\beta }, \hat{\gamma })\). However, for MCAR data when there is a single missing group, it can be shown that the two estimators are equivalent.
Unified estimator based on estimated missing data probabilities
In practice, the missing data probabilities are often unknown. Next, we explain how to compute a unified estimator when the missing data probabilities are estimated from parametric models. (See Sun and Tchetgen 2018 for models for the missing data probabilities.)
Suppose the missing data probabilities are estimated from a parametric model \(\psi _{i}(\alpha )\) where \(\alpha\) is a vector of parameters. Let \(\alpha ^*\) be the unique solution of the estimating equation
$$\begin{aligned} E\{\psi _i(\alpha )\}=0. \end{aligned}$$
We replace the missing data probabilities in equations (1–3) with \(\pi _{0i}(\alpha ),\pi _{1i}(\alpha ), \cdots ,\) \(\pi _{qi}(\alpha )\), and denote the corresponding functions as \(S_{i, \alpha }(\beta ), \phi _{i, \alpha }(\gamma ), \varphi _{i, \alpha }(\tau )\). Let \(\hat{\alpha }\) be the solution of the estimating equation
$$\begin{aligned} \sum _{i=1}^N \psi _i(\alpha ) = 0, \end{aligned}$$
(10)
and denote the corresponding IPW estimators as \(\hat{\beta }_{\hat{\alpha }}\), \(\hat{\gamma }_{\hat{\alpha }}\), and \(\hat{\tau }_{\hat{\alpha }}\).
Proposition 2
Assuming the missing data model \(\psi _i(\alpha )\) is correctly specified, under standard regularity conditions \(\hat{\beta }_{\hat{\alpha }}\), \(\hat{\gamma }_{\hat{\alpha }}\), and \(\hat{\tau }_{\hat{\alpha }}\) are consistent and asymptotically normal
$$\begin{aligned} \sqrt{N}\left( \begin{array}{c} \hat{\beta }_{\hat{\alpha }}-\beta ^*\\ \hat{\gamma }_{\hat{\alpha }}-\gamma ^*\\ \hat{\tau }_{\hat{\alpha }}-\tau ^* \end{array} \right) \overset{d}{\rightarrow } N(0, [E\{\varGamma _{i, \alpha ^*}(\beta ^*,\gamma ^*,\tau ^*)\}]^{-1} var\{Res_{i}(\beta ^*,\gamma ^*,\tau ^*,\alpha ^*)\}\nonumber \\ \times [E\{\varGamma _{i, \alpha ^*}(\beta ^*,\gamma ^*,\tau ^*)\}]^{-1^T}),~~~~~~~~~~~~~~~~ \end{aligned}$$
(11)
where \(\varGamma _{i, \alpha ^*}(\beta ^*,\gamma ^*,\tau ^*)=\bigtriangledown _{(\beta ^T,\gamma ^T,\tau ^T)} U_{i, \alpha ^*}(\beta ^*,\gamma ^*,\tau ^*)\) with \(U_{i, \alpha }(\beta ,\gamma ,\tau )=\{S^T_{i, \alpha }(\beta ),\) \(\phi ^T_{i, \alpha }(\gamma ),\varphi ^T_{i, \alpha }(\tau )\}^T\), and \(Res_{i}(\beta ^*,\gamma ^*,\tau ^*,\alpha ^*)=U_{i, \alpha ^*}(\beta ^*,\gamma ^*,\tau ^*)-E\{U_{i, \alpha ^*}(\beta ^*,\gamma ^*,\tau ^*) \times \psi _i^T(\alpha ^*)\}[E\{\psi _i(\alpha ^*)\psi _i^T(\alpha ^*)\}]^{-1}\psi _i(\alpha ^*)\).
Then, following the same idea as in Section 3.1 we get a unified estimator \(\hat{\hat{\beta }}_{\hat{\alpha }}\)
$$\begin{aligned} \hat{\hat{\beta }}_{\hat{\alpha }}=\hat{\beta }_{\hat{\alpha }}-\widehat{Cov}(\hat{\beta }_{\hat{\alpha }}, \hat{\gamma }_{\hat{\alpha }}-\hat{\tau }_{\hat{\alpha }}) \{\widehat{Var}(\hat{\gamma }_{\hat{\alpha }}-\hat{\tau }_{\hat{\alpha }})\}^{-1} (\hat{\gamma }_{\hat{\alpha }}-\hat{\tau }_{\hat{\alpha }}), \end{aligned}$$
(12)
and its asymptotic variance can be estimated as
$$\begin{aligned} \widehat{Var}(\hat{\hat{\beta }}_{\hat{\alpha }})= \widehat{Var}(\hat{\beta }_{\hat{\alpha }}) -\widehat{Cov}(\hat{\beta }_{\hat{\alpha }}, \hat{\gamma }_{\hat{\alpha }}-\hat{\tau }_{\hat{\alpha }}) \{\widehat{Var}(\hat{\gamma }_{\hat{\alpha }}-\hat{\tau }_{\hat{\alpha }})\}^{-1} \widehat{Cov}(\hat{\gamma }_{\hat{\alpha }}-\hat{\tau }_{\hat{\alpha }}, \hat{\beta }_{\hat{\alpha }}). \end{aligned}$$
(13)
It can be shown that \(\hat{\hat{\beta }}_{\hat{\alpha }}\) has the similar properties as \(\hat{\hat{\beta }}\) if the model for the missing data probabilities is correctly specified. We note that even when the true missing data probabilities are known using estimated missing data probabilities instead of the known missing data probabilities can be more efficient (see Breslow et al. 2009; Chatterjee et al. 2003; Lawless et al. 1999; Robins et al. 1994).