Appendix
A.1.E-step of the EM algorithm for continuous covariates
In the E-step of the EM algorithm developed in Sect. 3, we need to calculate the expectations \(E(Z_i|\mathbf{O_i},\theta ^\mathbf{(d)} )\) and \(E(W_i|\mathbf{O_i},\theta ^\mathbf{(d)} )\). As described there, when missing covariates are categorical, they are some summations and can be expressed in the closed form. However, for continuous covariates, this will not be the case and instead we have to deal with the integrals that do not have a closed form. More specifically, we have that
$$\begin{aligned} E(Z_i|\mathbf{O_i},\theta ^\mathbf{(d)} )= & {} \int _{\mathbf{X_{miss}}}\frac{\varvec{\Lambda }^{(\mathbf{d})}(\mathbf{V_i})\mathbf{exp}(\beta _\mathbf{1}^{(\mathbf{d})'}{\mathbf{X_i}^\mathbf{obs}}+\beta _\mathbf{2}^{(\mathbf{d})'}{} \mathbf{X_{i}^{miss}})\delta _\mathbf{1i}}{1-\mathbf{exp}\{-\varvec{\Lambda }^{(\mathbf{d})}(\mathbf{V_i})\mathbf{exp}(\beta _\mathbf{1}^{(\mathbf{d})'}{\mathbf{X_i}^\mathbf{obs}}+\beta _\mathbf{2}^{(\mathbf{d})'}{} \mathbf{X_{i}^{miss}})\}}\\&\times p(\mathbf{X_{i}^{miss}}|{\mathbf{O_i}},{\theta ^{(\mathbf{d})}} )\mathbf{dX_{i}^{miss} }, \end{aligned}$$
and
$$\begin{aligned} E(W_i|\mathbf{O_i},\theta ^\mathbf{(d)} )= & {} \int _\mathbf{{{X_{i}^{miss}}}}\frac{\{{\varvec{\Lambda }}^{\mathbf{(d)}}(\mathbf{U_i}) -{\varvec{\Lambda }}^{\mathbf{(d)}}(\mathbf{V_i})\}{\mathbf{exp}}(\beta _\mathbf{1}^{(\mathbf{d})'}{\mathbf{X_i}^{\mathbf{obs}}} +\beta _\mathbf{2}^{(\mathbf{d})'}{} \mathbf{X_{i}^{miss}})\delta _{\mathbf{2i}}}{1-{\mathbf{exp}}[-\{{\varvec{{\Lambda }}}^{(\mathbf{d})}(\mathbf{U_i})- {\varvec{\Lambda }}^{(\mathbf{d})}(\mathbf{V_i})\}\text{ exp }(\beta _\mathbf{1}^{(\mathbf{d})'}{\mathbf{X_i}^{\mathbf{obs}}}+\beta _\mathbf{2}^{(\mathbf{d})'}{} \mathbf{X_{i}^{miss}})]}\\&\times p(\mathbf{X_{i}^{miss}}|{\mathbf{O_i}},{\theta ^{(\mathbf{d})})}{} \mathbf{dX_{i}^{miss}} \end{aligned}$$
by using the notation defined before.
To calculate the integrals above, by following Herring and Ibrahim (2001), one can employ the Monte-Carlo estimation approach, which draws the sample from
$$\begin{aligned} p_{ij}= & {} P(\mathbf{X_i^{mis}}|{\mathbf{O_i},\theta ^\mathbf{(d)}} )=\frac{\mathbf{f}(\mathbf{U_i}, \mathbf{V_i}, \delta _{\mathbf{1i}}, \delta _{\mathbf{2i}}, \delta _{\mathbf{3i}}|{{\mathbf{X_i}^{\mathbf{obs}}},\mathbf{X_i}^{\mathbf{mis}}})\mathbf{f}({\mathbf{X_i}^{\mathbf{obs}}}, \mathbf{X_i}^{\mathbf{mis}}; \gamma ^{(\mathbf{d})})}{\int _{\mathbf{X_{i}^{mis}}} \mathbf{f}(\mathbf{U_i}, \mathbf{V_i}, \delta _{\mathbf{1i}}, \delta _{\mathbf{2i}}, \delta _{\mathbf{3i}}|{{\mathbf{X_i}^{\mathbf{obs}}}, \mathbf{X_i}^{\mathbf{mis}}})\mathbf{f}({\mathbf{X_i}^{\mathbf{obs}}}, \mathbf{X_i}^{\mathbf{mis}}; \gamma ^{(\mathbf{d})})}\\\propto & {} \mathbf{f}(\mathbf{U_i}, \mathbf{V_i}, \delta _{1i}, \delta _{2i}, \delta _{3i}|{{\mathbf{X_i}^{\mathbf{obs}}}, \mathbf{X_i}^{\mathbf{mis}}})\mathbf{f}({\mathbf{X_i}^{{\mathbf{obs}}}}, \mathbf{X_i}^{\mathbf{mis}}; \gamma ^{(\mathbf{d})}) . \end{aligned}$$
Note that \(f(U_i, V_i, \delta _{1i}, \delta _{2i}, \delta _{3i}|{{\mathbf{X_i}^{\mathbf{obs}}}, \mathbf{X_i}^{\mathbf{mis}}})\) is log-concave (Ibrahim et al. 1999) and if \(f({\mathbf{X_i}^{\mathbf{obs}}},\mathbf{X_i}^{\mathbf{mis}};\gamma ^{(\mathbf{d})})\) belongs to the exponential family, the logrithm of \(P(\mathbf{{ X_i^{mis}}}|{\mathbf{O_i},\theta ^\mathbf{(d)}} )\) is concave. It follows that one can use the Gibbs sampler (Gilks and Wild 1992) and adaptive rejection algorithm (Gilks and Wild 1992) to sample from \(P({ \mathbf{X_i^{mis}}}|{\mathbf{O_i},\theta ^\mathbf{(d)}} )\).
More specifically for the determination of \(E(Z_i|\mathbf{O_i},\theta ^\mathbf{(d)} )\), for each subject with missing covariate \(\mathbf{X_{i}^{miss}}\), we first apply the Gibbs sampler and adaptive reject algorithm to draw the sample \(s_{i,1},...,s_{i,n_{i}}\) of size \(n_i\) from \(p(\mathbf{X_{i}^{miss}|O_{i}},\theta ^{\mathbf{(d)}})\). Then the conditional expectation can be approximated by
$$\begin{aligned} E(Z_i|\mathbf{O_i},\theta ^\mathbf{(d)} )=\frac{1}{\mathbf{n_{i}}}\sum _{\mathbf{k=1}}^{\mathbf{n_{i}}}\frac{{\varvec{\Lambda }}^{(\mathbf{d})}(\mathbf{V_i}){\mathbf{exp}}(\beta _\mathbf{1}^{(\mathbf{d})'}{\mathbf{X_i^{obs}}}+\beta _\mathbf{2}^{(\mathbf{d})'}{\mathbf{s_{i,k}}})\delta _{\mathbf{1i}}}{1-{\mathbf{exp}}\{-{{\varvec{\Lambda }}}^{(\mathbf{d})}(\mathbf{V_i}){\mathbf{exp}}(\beta _\mathbf{1}^{(\mathbf{d})'}{\mathbf{X_i^{obs}}}+\beta _\mathbf{2}^{(\mathbf{d})'}{} \mathbf{s}_{\mathbf{i,k}})\}} . \end{aligned}$$
In comparison to the categorical covariate situation, the above operation can be regarded as replacing each \(x_{i}^{miss}\) by \(n_{i}\) sampled values with equal weight. It is apparent that \(E(W_i|\mathbf{O_i},\theta ^\mathbf{(d)} )\) can be calculated similarly.
A.2.Proofs of the asymptotic properties
In this Appendix, we will sketch the proof for the consistency and asymptotic normality of the proposed estimators given in Theorem 1 by employing the empirical process theory and nonparametric techniques. Define \({P}f=\int f(x)dP(x)\) and \({P}_n f = n^{-1} \sum \limits _{i=1}^{n} f(X_i)\) for a function f, a probability function P and a sample \(X_1, \ldots , X_n\). For the proof, we need the following regularity conditions.
-
(A1)
Assume that \(\Lambda (\tau _1)<\infty \), \(\Lambda (\tau _2)<\infty \), and there exists a positive constant a such that \(P ( V - U> a ) > 0\). Also the union of the supports of U and V is contained in the interval \([r_1, r_2]\) with \(0<r_1<r_2< +\infty \).
-
(A2)
The function \(\Lambda _0\) is continuously differentiable on \([r_1, r_2]\), and satisfies \( M^{-1}<\Lambda _0(r_1)<\Lambda _0(r_2)< M\) for some positive constant M.
-
(A3)
The set of covariates (X, Z) has bounded support.
-
(A4)
The conditional distribution \(f(\mathbf{X_i^{mis}}|\mathbf{X_i^{obs}}; \gamma )\) is identifiable and has continuous second-order derivatives with respect to \(\gamma \), and \(-E_0[\partial ^2/\partial \gamma ^2)\text{ log }f(\mathbf{X_i^{mis}}|\mathbf{X_i^{obs}}; \gamma _0)]\) is positive definite.
-
(A5)
For any \(({\theta }, \varvec{\Lambda })\) near \(({ \theta _\mathbf{0}}, {\varvec{\Lambda }_\mathbf{0}})\), \({P}_0(\text{ log }L({\theta , \varvec{\Lambda }})-\text{ log }L({\theta _\mathbf{0}, \varvec{\Lambda }_\mathbf{0}})\leqslant -K(||\theta -\theta _\mathbf{0}||^2+||\varvec{\Lambda }-\varvec{\Lambda }_\mathbf{0}||^2)\) for a fixed constant \(K>0\).
First we will prove the consistency and for this, we will verify the conditions of Theorem 5.7 of Van der Vaart (1998). Let \(BV_\omega [r_1, r_2]\) denote the functions whose total variation in \([r_1, r_2]\) are bounded by a given constant. Then the class of functions
$$\begin{aligned} F_\omega =\left\{ \int \limits _{0}^{U_k}\text{ exp }\{\beta ^{T}X_i\}d\Lambda (s): \Lambda \in BV_\omega [r_1, r_2]\right\} \end{aligned}$$
is a convex hull of functions \(\{I(U_k\geqslant s)\text{ exp }\{\beta ^{T}X_i\}\) and thus it is a Donsker class. Furthermore,
$$\begin{aligned} \text{ exp }\left( -\int \limits _{0}^{U_k}\text{ exp }\{\beta ^{T}X_i\}d\Lambda (s)\right) -\text{ exp }\left( -\int \limits _{0}^{U_{k+1}}\text{ exp }\{\beta ^{T}X_i\}d\Lambda (s)\right) \end{aligned}$$
is bounded away from zero. Therefore, \(l(\theta , {\hat{\alpha }}|\mathbf{O})=\text{ log }L(\theta , {\hat{\alpha }}|\mathbf{O})\) belongs to some Donsker class due to the preservation property of the Donsker class under Lipschitz-continuous transformations. Then we can conclude that \(\sup _{\theta \in \Theta _n}|{P}_nl(\theta , {\hat{\alpha }}|\mathbf{O})-{P}_nl(\theta _0, {\hat{\alpha }}|\mathbf{O})|\) converges in probability to 0 as \(n\rightarrow 0\).
Now we verify that another condition of Theorem 5.7 of Van der Vaart (1998) also holds. That is, for any \(\varepsilon >0\), we have
$$\begin{aligned} \sup _{d(\theta , \theta _0)>\varepsilon }Pl(\theta ,{\hat{\alpha }}|\mathbf{O}) <Pl(\theta _0, {\hat{\alpha }}|\mathbf{O}) . \end{aligned}$$
Note that this condition is satisfied if we can prove the model is identifiable. According to condition (A4) and similar arguments to the proof of Theorem 2.1 of Chang et al. (2007), we can show the identifiability of the model parameters. Now, by Theorem 5.7 of Van der Vaart (1998), we have \(d({\hat{\theta }}_n, \theta _0)= o_p(1)\), which completes the proof of consistency.
Before proving the asymptotic normality, we will need to establish the convergence rate. For this, we will first define the covering number of the class \({{\mathcal {L}}}=\{l(\theta ,{\hat{\alpha }}|\mathbf{O}):\theta \in \Theta \}\) and establish a needed lemma.
Lemma 1
Assume that Conditions (A1), (A3)–(A4) hold. Then the covering number of the class \({{\mathcal {L}}} = \{l(\theta ,{\hat{\alpha }}|\mathbf{O}): \theta \in \Theta \}\) satisfies
$$\begin{aligned} N(\epsilon , {{\mathcal {L}}}, L_2(P))=O(\epsilon ^{-1}). \end{aligned}$$
Proof of Lemma 1
The proof is similar to that of Zeng et al. (2016) and Hu et al. (2017) and thus omitted.
To establish the convergence rate, for any \(\eta >0\), define the class \({{\mathcal {F}}}_\eta =\{l(\theta _{n0}, {\hat{\alpha }}|\mathbf{O})-l(\theta , {\hat{\alpha }}|\mathbf{O}): \theta \in \Theta , d(\theta , \theta _{n0})\leqslant \eta \}\) with \(\theta _{n0}=(\beta _0,\Lambda _{n0})\). Following the calculation of (Shen and Wong 1994, p. 597), we can establish that \(\text{ log }N_{[]}(\epsilon , {{\mathcal {F}}}_{\eta }, \parallel .\parallel _{2})\leqslant CN \text{ log }(\eta /\epsilon )\) with \(N=m+1\), where \(N_{[]}(\epsilon , {{\mathcal {F}}}_{\eta }, d)\) denotes the bracketing number (see the Definition 2.1.6 in Van Der Vaart and Wellner 1996) with respect to the metric or semi-metric d of a function class \( {{\mathcal {F}}}\). Moreover, some algebraic calculations lead to \(\parallel l(\theta _{n0},{\hat{\alpha }}|\mathbf{O})-l(\theta , {\hat{\alpha }}|\mathbf{O})\parallel _{2}^2\leqslant C\eta ^2\) for any \(l(\theta _{n0}, {\hat{\alpha }}|\mathbf{O})-l(\theta , {\hat{\alpha }}|\mathbf{O})\in {{\mathcal {F}}}_\eta \). Therefore, by Lemma 3.4.2 of Van Der Vaart and Wellner (1996), we obtain
$$\begin{aligned} E_p\parallel n^{1/2}(P_n-P)\parallel _{{\mathcal {F}}_{\eta }}\leqslant CJ_\eta (\epsilon , {{\mathcal {F}}}_\eta , \parallel .\parallel _{2})\left\{ 1+\frac{J_\eta (\epsilon ,{{\mathcal {F}}}_\eta , \parallel .\parallel _{2})}{\eta ^2n^{1/2}}\right\} , ~~~~~~~~(S) \end{aligned}$$
where \(J_{[ ]}(\eta , {{\mathcal {F}}}_\eta , \parallel .\parallel _{2})=\int _{0}^\eta \{logN_{[]}(\epsilon , {{\mathcal {F}}}_{\eta }, \parallel .\parallel _{2})\}^{1/2}d\epsilon \). The right-hand side of (S) yields \(\phi _n(\eta )=C\eta ^{1/2}(1+\frac{\eta ^{1/2}}{\eta ^{2} n^{1/2}}M_1),\) where \(M_1\) is a positive constant. Then \(\phi _n(\eta )/\eta \) is a decreasing function, and \(n^{2/3}\phi _n(-1/3)=O(n^{1/2})\). According the theorem 3.4.1 of Van Der Vaart and Wellner (1996), we can conclude that \(d({\hat{\theta }}, \theta _0)=O_p(n^{-1/3})\).
Now we prove the asymptotic normality of \({\hat{\beta }}_n\). Following the proof of Theorem 2 in Zeng et al. (2016), one can obtain that
$$\begin{aligned} \sqrt{n} ( {{\hat{\beta }}}_n - \beta _0 )=(E[\{l_\beta -l_\Lambda (s^*)\}\{l_\beta -l_\Lambda (s^*)\}^{T})^{-1}G_n\{l_\beta -l_\Lambda (s^*)\}+o_p(1), \end{aligned}$$
where \(l_\beta \) is the score function for \(\beta \), \( l_\Lambda (s^*)\) is the score function along this submodel \(d\Lambda _{\epsilon , s^*}=(1+\epsilon s^*)d\Lambda \). This implies that the influence function for \({\hat{\beta }}_n\) is exactly the efficient influence function, so that \(\sqrt{n} ( {{\hat{\beta }}}_n - \beta _0 )\) converges to a zero-mean normal random vector whose covariance matrix attains the semiparametric efficiency bound. \(\square \)