Annals of the Institute of Statistical Mathematics

, Volume 65, Issue 4, pp 617–638

# Robust estimation in joint mean–covariance regression model for longitudinal data

## Authors

• Xueying Zheng
• Department of Statistics and Actuarial ScienceThe University of Hong Kong
• Department of Statistics and Actuarial ScienceThe University of Hong Kong
• Zhongyi Zhu
• Department of Statistics, School of ManagementFudan University
Article

DOI: 10.1007/s10463-012-0383-8

Zheng, X., Fung, W.K. & Zhu, Z. Ann Inst Stat Math (2013) 65: 617. doi:10.1007/s10463-012-0383-8
• 367 Views

## Abstract

In this paper, we develop robust estimation for the mean and covariance jointly for the regression model of longitudinal data within the framework of generalized estimating equations (GEE). The proposed approach integrates the robust method and joint mean–covariance regression modeling. Robust generalized estimating equations using bounded scores and leverage-based weights are employed for the mean and covariance to achieve robustness against outliers. The resulting estimators are shown to be consistent and asymptotically normally distributed. Simulation studies are conducted to investigate the effectiveness of the proposed method. As expected, the robust method outperforms its non-robust version under contaminations. Finally, we illustrate by analyzing a hormone data set. By downweighing the potential outliers, the proposed method not only shifts the estimation in the mean model, but also shrinks the range of the innovation variance, leading to a more reliable estimation in the covariance matrix.

### Keywords

Covariance matrixGeneralized estimating equation Longitudinal dataModified Cholesky decompositionRobustness

## 1 Introduction

Longitudinal data are often characterized by the dependence of repeated observations over time within the same subject. Observations within the same subject are prone to be correlated. In a marginal model, the typical correlation among a subject’s repeated measurements is not of primary interest, but it should be taken into account to make proper inference. In fact, ignoring the within-subject correlation can result in an inefficient estimator of a regression model, see Qu et al. (2000) and Wang (2003). A well-modeled covariance can decrease the bias of mean estimation for longitudinal data including missing values Daniels and Zhao (2003) and maintain a reliable estimation for the covariance matrix even when the mean model is not correctly specified Pan and Mackenzie (2003). In some cases, the correlation structure plays the role as important as the mean structure, which suggests that the estimation of the covariance matrix is crucial in longitudinal study.

To overcome these challenges, there has been substantial recent literature considering the mean and covariance matrix simultaneously. Pourahmadi (1999, 2000) developed a parametric joint mean and covariance model in the framework of GEE by the modified Cholesky decomposition, but their method does not deal with irregular observed measurements. Pan and Mackenzie (2003) exploited a reparameterisation of the marginal covariance matrix to extend Pourahmadi’s work to irregular cases. Wu and Pourahmadi (2003) proposed non-parametric smoothing to regularize the estimation of covariance matrix, using the two-step estimation of Fan and Zhang (2000). To relax the parametric assumption in mean and covariance structure, Fan et al. (2007) and Fan and Wu (2008) proposed semiparametric model for the mean and covariance structure. However, they only considered the normal data or nearly normal data. For longitudinal data analysis, Liang and Zeger (1986) introduced the technique of generalized estimating equations and generalized estimating equations were developed for both mean and covariance parameters in Ye and Pan (2006). To relax the parametric assumption posed in Ye and Pan (2006), Leng et al. (2010) proposed a semiparametric model for mean and covariance structure.

GEE approach takes advantage of the built-in robustness since no specification of the full likelihood is required. However, in a longitudinal data set, one outlier in the subject level may generate a set of outliers in the sample due to repeated measurements. Therefore, robustness study is required for the reason that estimating equations are highly sensitive to outliers in the sample. Robust regression methods have been developed for estimation on mean parameters and covariance matrix estimation separately. An incomplete list of recent works on the robust GEE methods includes Cantoni (2004), He et al. (2005), Wang et al. (2005a), Qin and Zhu (2007), Qin et al. (2009) and Croux et al. (2012).

However, as far as we know, there is little discussion on robust estimation on joint mean and covariance model. Croux et al. (2012) considered robustification on the mean and covariance where they set up estimating equations for both the mean and the dispersion parameter. The constraint of their approach is that they assumed an inflexible covariance structure determined by two parameters. In this article, following Ye and Pan (2006) and He et al. (2005), we establish a set of robust generalized estimating equations for analyzing a parametric joint mean and covariance regression model for longitudinal data. Robust generalized estimating equations using bounded scores and leverage-based weights are developed for the mean and covariance to achieve robustness against outliers. The advantage of the proposed joint model relies on modeling the covariance matrix with moderate number of parameters, rather than assuming a fixed structure or introducing unstructured covariance matrix causing the curse of dimensionality.

Similar to He et al. (2005), the Mallows-type weights are used to downweigh the effect of leverage points when a bounded score function on the Pearson residuals is employed to reduce the effect of outliers in the response. The Mallows-type weights have also been used by Qin and Zhu (2007) in generalized semiparametric mixed model and Qin et al. (2009) in generalized partial linear mixed model for longitudinal data analysis. The resulting estimators for the regression coefficients in both the mean and covariance are shown to be consistent and asymptotically normally distributed. In simulation studies, for one thing, we apply the robust method in the joint model to obtain better estimators for both mean and covariance parameters under contaminations. On the other side, we find robustifications in mean and in covariance matrix estimation are both necessary, owing to the fact that the triple-robust estimating method performs far better than the mean-robust estimating method. In the analysis of the hormone data analysis, the main advantage of the robust method lies in successfully detecting both the subject-level and observation-level potential influential outliers, which results in a more reliable estimation in both the mean and covariance.

The rest of the article is organized as follows: In Sect. 2, we formulate the robust joint mean and covariance model and introduce the estimation methods. Theoretical properties are established in this section as well. Simulation studies are presented in Sect. 3. Finally, we carry out a hormone data set analysis to illustrate the proposed method in Sect. 4.

## 2 Methodology

### 2.1 Joint mean–covariance model

Suppose that we have a sample of $$m$$ subjects. Let $$y_i=(y_{i1},\dots ,y_{in_i})^{\prime }$$ be the $$n_i$$ repeated measurements at time point $$t_i=(t_{i1},\dots ,t_{in_i})^{\prime }$$ of the $$i$$th subject. Let $$E(y_i)=\mu _i=(\mu _{i1},\dots ,\mu _{in_i})^{\prime }$$ and $$Cov(y_i)=\Sigma _i$$ be the $$n_i\times 1$$ mean vector and $$n_i\times n_i$$ covariance matrix of $$y_i$$, respectively.

According to the positive definite property of the covariance matrix $$\Sigma _i$$, there exists a unique lower triangular matrix $$\varPhi _i$$ with $$1$$’s being the diagonal entries and a unique diagonal matrix $$D_i$$ with positive diagonals such that
\begin{aligned} \varPhi _i\Sigma _i \varPhi _i^{\prime }=D_i. \end{aligned}
(1)
As indicated by Pourahmadi (1999), $$\varPhi _i$$ and $$D_i$$ have clear statistical interpretation. The lower-diagonal entries of $$\varPhi _i$$ are the negatives of the autoregressive coefficients $$\phi _{ijk}$$ defined in
\begin{aligned} \hat{y}_{ij}=\mu _{ij}+\sum ^{j-1}_{k=1}\phi _{ijk}(y_{ik}-\mu _{ik}), \end{aligned}
(2)
which is the linear least squares predictor of $$y_{ij}$$ based on its predecessors $$y_{i(j-1)},\dots ,y_{i1}$$. The diagonal entries $$\sigma ^2_{ij}$$ of $$D_i$$ can be seen as the innovation variance $$\sigma ^2_{ij}=var(\epsilon _{ij})$$, where $$\epsilon _{ij}=y_{ij}-\hat{y}_{ij}$$.
The modified Cholesky decomposition can guarantee $$\Sigma _i$$ to be positive definite thus takes the advantage that $$\phi _{ijk}$$ and $$\sigma ^2_{ij}$$ are unconstrained. Similar to Ye and Pan (2006), we adopt three generalized linear models for the mean, generalized autoregressive parameters and innovation variances:
\begin{aligned} g(\mu _{ij})=x^{\prime }_{ij}\beta ,\, \phi _{ijk}=z^{\prime }_{ijk}\gamma ,\, \text{ log}\ \sigma ^2_{ij}=z^{\prime }_{ij}\lambda , \end{aligned}
(3)
where $$i=1,\dots ,m,\,j=1,\dots ,n_i,\,k=1, \dots , j-1,\,x_{ij},\,z_{ijk}$$ and $$z_{ij}$$ are $$p\times 1,\,q\times 1$$ and $$d\times 1$$ vectors of covariates, and $$\beta ,\,\gamma$$ and $$\lambda$$ are the associated parameters. The known link function $$g(\cdot )$$ is assumed to be monotone and differentiable. The covariates $$z_{ijk}$$ and $$z_{ij}$$ may contain the baseline covariates, the time and the associated interactions. Orthogonal form for the polynomials of the time are recommended as the covariates for the autoregressive components by Ye and Pan (2006):
\begin{aligned} z_{ijk}=(1,\ (t_{ij}-t_{ik}),\ (t_{ij}-t_{ik})^2,\ldots ,\ (t_{ij}-t_{ik})^{q-1}). \end{aligned}
In the above models, estimation for the autoregressive coefficients and innovation variances are treated as important as the estimation for the mean.

### 2.2 Robust generalized estimating equations for the mean

To estimate the regression parameter $$\beta$$ in Sect. 2.1, generalized estimating equations (Liang and Zeger 1986) can be represented as
\begin{aligned} \sum ^m_{i=1} \dot{\mu }_i^{\prime } V_i^{-1} (y_i -\mu _i)=0, \end{aligned}
(4)
where $$V_i$$ is the covariance matrix of $$y_i$$ assumed as $$V_i = A_i^{1/2} R(\alpha ) A_i^{1/2}$$ and $$\dot{\mu }_i$$ is the first derivative of $$\mu (\cdot )$$ evaluated at $$(x_{i1},\dots ,x_{in_i})^{\prime }\beta$$. Here, $$A_i$$ is a diagonal matrix with the marginal variance of $$y_i$$ as the diagonal component, and $$R(\alpha )$$ is a working correlation matrix.

He et al. (2005) mentioned that the GEE approach has some built-in robustness since it requires no specification of the full likelihood. However, estimating equations are highly sensitive to outliers in the sample. In longitudinal studies, an outlier in a subject-level measurement can generate multiple outliers in the sample. Sinha (2004) also emphasized that a small proportion of the data may come from an arbitrary distribution rather than the distribution in the assumption, i.e. the deviations from underlying distributions, which can result in outliers or influential observations in the data. Therefore, robust method is desirable to mitigate the effect of outliers and to obtain bounded influence functions.

To improve the efficiency of the GEE estimator under contamination, He et al. (2005) defined the following robust generalized estimating equations (RGEE) for the mean parameter:
\begin{aligned} \sum _{i=1}^{m}X_i^{\prime }\Delta _{i}(\mu _i(\beta ))(V^{\beta }_{i}) ^{-1}h^{\beta }_{i}(\mu _{i}(\beta ))=0, \end{aligned}
where $$X_i=(x_{i1},\dots ,x_{in_i})^{\prime },\,\Delta _{i}(\mu _{i}(\theta ))= \text{ diag}\{\dot{\mu }_{i1}(\beta ),\ldots ,\dot{\mu }_{in_{i}}(\beta ) \}$$, with $$\dot{\mu }_{ij}(\cdot )$$ denoting the first derivative of $$\mu (\cdot )$$ evaluated at $$x^{\prime }_{ij}\beta$$; and $$h^{\beta }_i(\mu _i(\beta ))=W^{\beta }_i[\psi ^{\beta }(\mu _i(\beta )) -C^{\beta }_i(\mu _i(\beta ))]$$. In $$h^{\beta }_i (\mu _i(\beta ))$$, they employed a Huber function $$\psi$$ on the Pearson residual and a weighting matrix $$W^{\beta }_i$$ to control the influence from outliers. $$C^{\beta }_i(\mu _i(\beta ))$$ is developed to ensure the consistency of the estimating equation. Detailed explanation of the notations are in the following subsection.

### 2.3 Robust estimating equations for joint mean and covariance model

In the previous subsections, we have constructed the joint mean–covariance model. However, we have introduced robust generalized estimating equation only for the mean. Now we propose the following robustified generalized estimating equations for $$\theta =(\beta ^{\prime },\ \gamma ^{\prime },\ \lambda ^{\prime })^{\prime }$$
\begin{aligned} U(\theta )=(U_1(\beta )^{\prime },\ U_2(\gamma )^{\prime },\ U_3(\lambda )^{\prime })^{\prime }. \end{aligned}
To be specific, the detailed estimating equations for the mean, generalized autoregressive parameters and innovation variances are
\begin{aligned} U_{1}(\beta )&= \sum _{i=1}^{m}X_i^{\prime }\varDelta _{i}(\mu _i(\beta ))(V^{\beta } _{i})^{-1}h^{\beta }_{i}(\mu _{i}(\beta ))=0,\end{aligned}
(5)
\begin{aligned} U_{2}(\gamma )&= \sum _{i=1}^{m}T^{\prime }_{i}(V^{\gamma }_{i})^{-1}h^{\gamma } _i(\hat{r}_{i}(\gamma ))=0,\end{aligned}
(6)
\begin{aligned} U_{3}(\lambda )&= \sum _{i=1}^{m}Z_i^{\prime } D_{i}(V^{\lambda }_{i})^{-1}h^ {\lambda }_{i}(\sigma _{i}^{2}(\lambda ))=0, \end{aligned}
(7)
where $$h^{\beta }_i (\mu _i(\beta ))=W^{\beta }_i[\psi ^{\beta }(\mu _i(\beta ))-C^{\beta }_i (\mu _i(\beta ))],\,h^{\gamma }_i(\hat{r}_i(\gamma ))=W^{\gamma }_i[\psi ^{\gamma }(\hat{r}_i(\gamma ))-C^{\gamma }_i(\hat{r}_i(\gamma ))]$$ and $$h^{\lambda }_i(\sigma ^2_i(\lambda ))=W^{\lambda }_i[\psi ^{\lambda } (\sigma ^2_i(\lambda ))-C^{\lambda }_i(\sigma _i^2(\lambda ))]$$ act as the core of the estimating equations with $$\psi ^{\beta }_i,\,\psi ^{\gamma }_i,\,\psi ^{\lambda }_i,\,C^{\beta }_i,\, C^{\gamma }_i,\,C^{\lambda }_i,\,W^{\beta }_i,\,W^{\gamma }_i$$ and $$W^{\lambda }_i$$ to be specified later; $$X_i=(x_{i1},\dots ,x_{in_i})^{\prime }$$ and $$Z_i=(z_{i1},\dots ,z_{in_i})^{\prime },\,r_i$$ and $$\hat{r}_i$$ are $$n_i\times 1$$ vectors with $$j$$th components $$r_{ij}=y_{ij}-\mu _{ij}$$ and $$\hat{r}_{ij}=E(r_{ij}|r_{i1},\dots ,r_{i(j-1)})=\sum ^{j-1}_{k=1}\phi _{ijk}r_{ik}$$. We denote $$\sum ^0_{k=1}$$ as zero when $$j=1$$. In $$U_3(\lambda ),\,\epsilon ^2_i$$ and $$\sigma ^2_i$$ are $$n_i\times 1$$ vectors with $$j$$th components $$\epsilon ^2_{ij}$$ and $$\sigma ^2_{ij}$$, respectively, where $$\epsilon _{ij}=y_{ij}-\hat{y}_{ij}$$. We have $$E(\epsilon ^2_i)=\sigma ^2_i$$. $$\varDelta _{i}(\mu _{i}(\theta ))=\text{ diag}\{ \dot{\mu }_{i1}(\beta ), \ldots ,\dot{\mu }_{in_{i}}(\beta )\}$$, with $$\dot{\mu }_{ij}(\cdot )$$ denoting the first derivative of $$\mu (\cdot )$$ evaluated at $$x^{\prime }_{ij}\beta$$; $$T^{\prime }_i=\partial \hat{r}^{\prime }_i/\partial \gamma$$ is the $$q\times n_i$$ matrix with $$j$$th column $$\partial \hat{r}_{ij}/\partial \gamma =\sum ^{j-1}_{k=1}r_{ik}\phi _{ijk}$$; $$D_i=\text{ diag}\{ \sigma ^2_{i1},\dots ,\sigma ^2_{in_i}\}$$.

Furthermore, $$V^\beta _{i}=A_{i}^{-1/2}\Sigma _{i},\,A_{i}$$ is the diagonal elements of $$\Sigma _{i}$$; $$V^{\gamma }_{i}=D_i^{1/2}$$; $$V^{\lambda }_{i}=\widetilde{A}_i^{-1/2}\widetilde{\Sigma }_i,\, \widetilde{A}_{i}$$ is the diagonal elements of $$\widetilde{\Sigma }_{i}$$. Similar to Ye and Pan (2006), the sandwich working covariance structure $$\widetilde{\Sigma }_i=B_i^{1/2}R_i(\delta )B_i^{1/2}$$ can be used to model the true $$\widetilde{\Sigma }_i=Var(\epsilon ^2_i)$$ with $$B_i=2\text{ diag}\{ \sigma ^4_{i1},\dots ,\sigma ^4_{in_i}\}$$ and $$R_i(\delta )$$ mimics the correlation between $$\epsilon ^2_{ij}$$ and $$\epsilon ^2_{ik}$$ by introducing a new parameter $$\delta$$. Typical structures for $$R_{i}(\delta )$$ include compound symmetry and AR(1). Although no particular suggestion on how to choose the structure and the value of $$\delta$$ was provided Ye and Pan (2006), the parameter $$\delta$$ has little effect on the estimation in practice, which is also confirmed in the simulation study reported in the later section. Moreover, we will show that the working correlation structure also has little effect on the estimates. In fact, $$r_{i}$$ in $$U_{2}(\gamma )$$ and $$\varepsilon _{i}^{2}$$ in $$U_{3}(\lambda )$$ play a role similar to that of $$y_{i}$$ in $$U_{1}(\beta )$$ and they can be viewed as working responses. Hence the ideas behind Eqs. (6) and (7) are in agreement with that in Eq. (5), which enhance the importance of estimation for the covariance matrix.

### 2.4 Huber’s score function $$\psi$$ and weights $$w_{ij}$$

In the core of the estimating equations, $$\psi ^{\beta }(\mu _i)=\psi (A_i^{-1/2}(y_i-\mu _i)),\,\psi ^{\gamma }(\hat{r}_i)= \psi (D^{-1/2}_i(r_i-\hat{r}_i))$$ and $$\psi ^{\lambda }(\sigma ^2_i)=\psi (\widetilde{A}_i^{-1/2}(\epsilon ^2_i-\sigma ^2_i))$$. The function $$\psi (\cdot )$$ is chosen to limit the influence of outliers in the response variable, and a common choice is Huber’s score function $$\psi _{c}(x)=\min \{ c,\max \{ -c,x\} \}$$ for some constant $$c$$, normally chosen to be between 1 and 2.

Huber’s score function is the most widely used robustness technique as a bounded function, truncating large Pearson residuals symmetrically, which ensures the asymptotic normally of the estimator. The tuning constant $$c$$ controls the robustness and the level of asymptotic efficiency. In practice, $$c=1.345,\,c=1.5$$ or $$c=2$$ can be used depending on the seriousness of the contamination in a data set. We do simulations on different $$c$$ and find the choice of $$c$$ is not critical to gaining a good robust estimate. In this article, we use $$c=2$$ in our implementation, which is sufficient to prove the improvement of efficiency by adopting robust estimating equations.

To ensure Fisher consistency, we use $$C^{\beta }_{i}(\mu _{i})=E[\psi (A_{i}^{-1/2}(y_{i}-\mu _{i}))],\, C^{\gamma }_i(\hat{r}_i)=E[\psi (D^{-1/2}_i(r_i-\hat{r}_i))]$$ and $$C^{\lambda }_i(\sigma ^2_i)=E[\psi (\widetilde{A}^{-1/2}_i(\epsilon ^2_i-\sigma ^2_i))]$$. Given the assumption that $$y_i$$ are under normal distribution, the three expectations depend only on the choice of constant c in Huber’s score function. In the following parts, we use $$C^{\beta }_i=0,\,C^{\gamma }_i=0$$ and $$C^{\lambda }_i=-0.05$$ that calculated under normality assumption.

In general cases, unless the true distribution is correctly specified, the expectation $$C_i$$ in estimating equations that used to ensure the Fisher consistency are not available. Qin and Zhu (2007) discussed the difficulty of calculating $$C_i$$. They mentioned that $$C_i$$ can be calculated easily for binary data as $$y_{ij}$$ only take values 0 and 1 while hard to obtain for the data following other distributions as the calculation of expectation $$C_i$$ involves intractable integrals. As an alternative, some numerical integration methods or approximations are required to achieve the expectation $$C_i$$ in this situation. Wang et al. (2005b) provided a bias correction method for robust estimation functions.

The weighting matrix $$W_i^{\beta }=\text{ diag}(w_{i1}^{\beta },\ldots ,w_{in_{i}}^{\beta }) ,\,W_i^{\gamma }=\text{ diag}(w_{i1}^{\gamma },\ldots ,w_{in_{i}}^{\gamma })$$ and $$W_i^{\lambda }=\text{ diag}(w_{i1}^{\lambda },\ldots ,w_{in_{i}}^{\lambda })$$ are diagonal weighting matrices assigning weights to each observation. Here, diagonal entries $$w_{ij}$$ can assign different weights on each observation, instead of assigning unique weight on observations from a single subject.

Following Qin et al. (2009), we choose the weight function $$w_{ij}$$ as a function of the Mahalanobis distance in the form
\begin{aligned} w_{ij}=w(p_{ij})=\text{ min}\left\{ 1,\ \left[\frac{b_0}{(p_{ij}-m_p)^TS^{-1}_p(p_{ij}-m_p)}\right]^{\rho /2}\right\} , \end{aligned}
with $$\rho \ge 1,\,m_p$$ and $$S_p$$ are some robust estimates of the location and scale of $$p_{ij}$$ such as the minimum covariance determinant estimators, where $$p_{ij}$$ is corresponding to the design space. As a result, $$w_{ij}$$ is a weight function that can downweigh any leverage point in the design space. In the following simulation study, $$b_0$$ is chosen as the $$95$$th percentile of the chi-squared distribution with degrees of freedom equal to the dimension of $$p_{ij}$$ and $$\rho$$ is fixed as 1. Moreover, since $$z_{ij}=x_{ij}$$ (the design spaces for $$\beta$$ and $$\lambda$$ are the same), we choose $$p_{ij}=x_{ij}$$ for all three weighting matrices and denote them as $$W_i=\text{ diag}(w_{i1},\ldots ,w_{in_{i}})$$ for simplicity.

The three proposed robust estimation equations within the framework of generalized estimating equations that do not require the normal distribution assumption are extensions for the method illustrated in Ye and Pan (2006) since our equations can resist the contamination and downweigh the potential influential points. By introducing the modified Cholesky decomposition, the positive definiteness of the covariance matrix can be guaranteed. A different variance correlation decomposition on certain type of matrix is implemented in Fan et al. (2007). Furthermore, the dimension of the parameter space of the covariance matrix has been substantially reduced that allows us to consider the regression model for the generalized autoregressive parameters and innovation variances simultaneously with the mean. Most importantly, we apply Mallows-type robust estimations for the mean and covariance jointly for the regression model which enjoy thorough robustness comparing to the single robust estimating equation established in He et al. (2005).

### 2.5 Estimators of parameters

Quasi-Fisher scoring algorithm is applied in solving $$\beta ,\,\gamma$$ and $$\lambda$$ iteratively. First we choose a starting value for $$\beta ,\,\gamma$$ and $$\lambda$$, respectively. If we choose the special case of working independence $$R_{i}=I$$, which implies a convenient starting value of $$\gamma$$ and $$\lambda$$ to be $$\gamma ^{(0)}=0$$ and $$\lambda ^{(0)}=0$$, then (5) no longer depends on $$\gamma$$ and $$\lambda$$. Hence, an initial estimate $$\beta ^{(0)}$$ of $$\beta$$ is set to be the solution to (5) in this special case as the robust GEE estimator under working independence covariance structure.

Given $$\varSigma _i$$, we solve (5) to find the estimate of $$\beta$$ using the iterative procedure
\begin{aligned} \beta ^{(k+1)}=\beta ^{(k)}+\left.\left\{ \left[\sum ^m_{i=1}X_i^{\prime }\varDelta _i(V^{\beta } _i)^{-1}\varGamma ^{\beta }_i\varDelta ^{\prime }_iX_i\right]^{-1}\sum ^m_{i=1}X_i^{\prime }\varDelta _i(V ^{\beta }_i)^{-1}h^{\beta }_i(\mu _i(\beta ))\right\} \right|_{\beta =\beta ^{(k)}}, \nonumber \\ \end{aligned}
(8)
where $$\varGamma ^{\beta }_{i}=E\dot{h}^{\beta }_{i}(\mu _{i}(\beta ))=E\partial h^{\beta }_{i}(\mu _{i})$$/$$\partial \mu _{i}|_{\mu _{i}=\mu _{i}(\beta )}$$, for $$i=1,\ldots ,m$$. In practice, if we know the distribution of the data set, such as normal or $$t$$ distribution, we can calculate the expectation analytically. Otherwise, we can use the sample mean or sample median to approximate the expectation when we do not know the true underlying distribution.
Given $$\beta$$ and $$\lambda$$, $$\gamma$$ can be updated approximately through
\begin{aligned} \gamma ^{(k+1)}=\left.\left\{ \left[ \text{ E}\sum ^m_{i=1}T^{\prime }_i(V^{\gamma }_i)^{-1}\varGamma ^{\gamma }_iT _i\right]^{-1}\sum ^m_{i=1}T^{\prime }_i(V^{\gamma }_i)^{-1}h^{\gamma }_i(\hat{r}_i(\gamma ) )\right\} \right|_{\gamma =\gamma ^{(k)}}, \end{aligned}
(9)
with $$\varGamma ^{\gamma }_i=E\dot{h}^{\gamma }_i(\hat{r}_i(\gamma ))=E\partial h^{\gamma }_i(\hat{r}_i)/\partial \hat{r}_i|_{\hat{r}_i=\hat{r}_i(\gamma )}$$, for $$i=1,\ldots ,m$$.
Finally, given $$\beta$$ and $$\gamma$$, the innovation variance parameters $$\lambda$$ can be updated using
\begin{aligned} \lambda ^{(k+1)}=\lambda ^{(k)}+\left.\left\{ \left[ \sum ^m_{i=1}Z_i^{\prime }D_i(V^{\lambda }_i) ^{-1}\varGamma ^{\lambda }_iD^{\prime }_iZ_i\right]^{-1}\sum ^m_{i=1}Z_i D_i(V^{\lambda }_i) ^{-1}h^{\lambda }_i(\sigma ^2_i(\lambda ))\right\} \right|_{\lambda =\lambda ^{(k)}}, \nonumber \\ \end{aligned}
(10)
where $$\varGamma ^{\lambda }_{i}=E\dot{h}^{\lambda }_{i}(\sigma ^2_{i}(\lambda ))=E\partial h^{\lambda }_{i}(\sigma ^2_{i})/\partial \sigma ^2_{i}|_{\sigma ^2_{i}= \sigma ^2_{i}(\lambda )}$$, for $$i=1,\ldots ,m$$.
In summary, these sets of parameters can be estimated using weighted least squares. The main algorithm processes iteratively as follows:
1. Step 1:

Select an initial value $$(\beta ^{(0)^{\prime }},\gamma ^{(0)^{\prime }},\lambda ^{(0)^{\prime }})^{\prime }$$ and use model (3) to form $$\varPhi _{i}^{(0)}$$ and $$D_{i}^{(0)}$$. Then $$\varSigma _{i}^{(0)}$$, the starting value of $$\varSigma _{i}$$, is obtained.

2. Step 2:

Using the weighted least squares estimators (8)–(10) to calculate the estimators $$\beta ^{(1)},\,\gamma ^{(1)}$$ and $$\lambda ^{(1)}$$ of $$\beta ,\,\gamma$$ and $$\lambda$$, respectively.

3. Step 3:

Replace $$\beta ^{(0)},\,\gamma ^{(0)}$$ and $$\lambda ^{(0)}$$ with the estimators $$\beta ^{(1)},\,\gamma ^{(1)}$$ and $$\lambda ^{(1)}$$.

Repeat Steps 2–3 until convergence of the parameter estimators.

In simulation, the proposed robust method works well under different contaminations. When the sample size is moderate, the difficulty of the convergence in the non-robust method of the algorithm lies in the non-convergence of $$\hat{ \lambda }_m$$ in most of the cases, especially in those having serious contaminations.

The robust method is supposed to outperform the non-robust method substantially in serious contaminations. However, the non-robust method has difficulties in obtaining a reliable result under heavy contaminations. Therefore, we can only compare them under mild contaminations in simulation studies.

### 2.6 Asymptotic properties and hypotheses testing

Following Ye and Pan (2006), we can obtain the following theorems.

### Theorem 1

Suppose there is only one root $$\hat{\theta }_m=(\hat{\beta }^{\prime }_m,\ \hat{\gamma }^{\prime }_m,\ \hat{\lambda }^{\prime }_m)^{\prime }$$ for the generalized estimating equations. Under some mild regularity conditions stated in Appendix, the generalized estimating equation estimator $$\hat{\theta }_m=(\hat{\beta }^{\prime }_m,\ \hat{\gamma }^{\prime }_m,\ \hat{\lambda }^{\prime }_m)^{\prime }$$ is strongly consistent for the true value $$\theta _0=(\beta ^{\prime }_0,\ \gamma ^{\prime }_0,\ \lambda ^{\prime }_0)^{\prime }$$; that is, $$\hat{\theta }_m=(\hat{\beta }^{\prime }_m,\ \hat{\gamma }^{\prime }_m,\ \hat{\lambda }^{\prime }_m)^{\prime }\rightarrow \theta _0=(\beta ^{\prime }_0,\ \gamma ^{\prime }_0,\ \lambda ^{\prime }_0)^{\prime }$$ almost surely as $$m\rightarrow \infty$$.

We denote $$V_m =(v_m^{kl})_{k,l=1,2,3}$$ as the covariance matrix of the function $$U(\theta )/\sqrt{m}=(U_1^{\prime }(\beta ),\ U_2^{\prime }(\gamma ),\ U_3^{\prime }(\lambda ))/\sqrt{m}$$, where $$v_m^{kl}=m^{-1}cov(U_k,U_l)$$ for $$k\ne l$$ and $$v_m^{kk}=m^{-1}var(U_k)$$$$(k,\ l=1,\ 2,\ 3)$$. For the following theorem, the covariance matrix $$V_m$$ evaluated at the true value $$\theta _0$$ is assumed to be positive definite. Furthermore, at $$\theta _0$$ we assume that
\begin{aligned} V_m=\left( \begin{array}{c@{\quad }c@{\quad }c} v^{11}_m&v^{12}_m&v^{13}_m \\ v^{21}_m&v^{22}_m&v^{23}_m \\ v^{31}_m&v^{32}_m&v^{33}_m \\ \end{array} \right)\rightarrow V=\left( \begin{array}{c@{\quad }c@{\quad }c} v^{11}&v^{12}&v^{13} \\ v^{21}&v^{22}&v^{23} \\ v^{31}&v^{32}&v^{33} \\ \end{array} \right)\ \text{ as}\ m\rightarrow \infty . \end{aligned}

### Theorem 2

Under some necessary regularity conditions stated in Appendix, the generalized estimating equation estimator $$\hat{\theta }_m=(\hat{\beta }^{\prime }_m,\ \hat{\gamma }^{\prime }_m,\ \hat{\lambda }^{\prime }_m)^{\prime }$$ is asymptotically normally distributed with
\begin{aligned} \sqrt{m}\left( \begin{array}{c} \hat{\beta }_m-\beta _0 \\ \hat{\gamma }_m-\gamma _0 \\ \hat{\lambda }_m-\lambda _0 \\ \end{array} \right)\rightarrow \text{ N} \left\{ 0, \left( \begin{array}{c@{\quad }c@{\quad }c} v^{11}&0&0 \\ 0&v^{22}&0 \\ 0&0&v^{33} \\ \end{array} \right)^{-1}\left( \begin{array}{c@{\quad }c@{\quad }c} v^{11}&v^{12}&v^{13} \\ v^{21}&v^{22}&v^{23} \\ v^{31}&v^{32}&v^{33} \\ \end{array} \right)\left( \begin{array}{c@{\quad }c@{\quad }c} v^{11}&0&0 \\ 0&v^{22}&0 \\ 0&0&v^{33} \\ \end{array} \right)^{-1} \right\} \end{aligned}
in distribution as $$m\rightarrow \infty$$, where the matrices $$v^{kl}$$$$(k,l=1,2,3)$$ are evaluated at the true value $$\theta =\theta _0$$.

The proofs are given in Appendix.

Note that when the responses $$y_i$$ are normally distributed, we have $$v^{kl}=0\ (k\ne l)$$ and the asymptotic covariance matrix in Theorem 2 reduces to $$\{ \mathrm{diag}(v^{11},\ v^{22},\ v^{33})\}^{-1}$$.

For inference, we use a robust estimator for the covariance matrix of $$\hat{\beta }$$:
\begin{aligned} Cov(\hat{\beta })=(\hat{H}_m)^{-1}\hat{K}_m(\hat{H}_m)^{-1}, \end{aligned}
where $$\hat{H}_m$$ and $$\hat{K}_m$$ are defined by
\begin{aligned}&\hat{H}_m=\sum ^m_{i=1} X_i^{\prime } \varDelta _i (V^\beta _i({\hat{\beta }_m}))^{-1}\varGamma _i^\beta \varDelta _i X_i,\\&\hat{K}_m =\sum ^m_{i=1}X_i^{\prime } \varDelta _i (V^\beta _i({\hat{\beta }_m}))^{-1}h_i^\beta (\mu _i({\hat{\beta }_m}))h_i^\beta (\mu _i({\hat{\beta }_m}))^{\prime }(V^\beta _i({\hat{\beta }_m}))^{-1}\varDelta _i X_i. \end{aligned}
In the sandwich type estimator of covariance matrix, we adopt similar Mallows-type weights and Huber’s function to control the influence of outliers. The covariance matrices of $$\hat{\gamma }$$ and $$\hat{\lambda }$$ can be estimated in an analogous way. In the same manner of He et al. (2005), we compare the average estimated standard errors and the Monte Carlo standard errors in simulations. Overall, we note the standard error estimation works well for different AR(1) correlation structure no matter the contamination exists or not. Similar findings have also been obtained in Leng et al. (2010). As a result, we consider the asymptotic covariance formula quite acceptable as the large-sample estimation.

For hypothesis testing, within the framework of generalized estimating equations, the quasi-score test based on the derivative of the generalized estimating equations may be constructed. See Ye and Pan (2006) for details.

## 3 Simulation study

In this section, simulations including contaminated cases are conducted to assess the performance of the proposed robust method. Four estimation methods are considered: NR refers to the non-robust method, which is given in Ye and Pan (2006). HR$$_{m}$$ means the half-robust method on the mean. In other words, we only adopt the robust estimating equation (5), which is the estimating equation for the mean. In contrast, we have HR$$_{c}$$ that stands for the other half-robust method on the covariance matrix only. The R (robust) method is our proposed method which includes all three robust estimating equations. Note that the non-robust estimators of $$\beta ,\,\gamma$$ and $$\lambda$$ are defined through the same equations except that $$\psi (x)=x$$ and $$W_i = I_i$$, where $$I_i$$ are $$n_i \times n_i$$ identity matrices.

Study 1. The following Guassian linear model is used:
\begin{aligned} y_{ij}=\beta _0+\beta _1x_{ij}+e_{ij},\ i=1,\dots ,m;\ j=1,\dots ,n_i, \end{aligned}
where $$m=100,\,x_{ij}\sim N(0,2),\,\beta _1=1,\,\beta _0=0.5$$ and $$e_{ij}\sim N(0,\varSigma _i)$$.

The error term $$(e_{i1},\dots ,e_{in_i})$$ is generated from a multivariate normal distribution with mean 0 and covariance $$\varSigma _i$$ satisfying $$T_i\varSigma _iT^{\prime }_i=D_i$$, where $$T_i$$ and $$D_i$$ are described in Sect. 2.1 with $$z_{ijk}=(1,\ (t_{ij}-t_{ik}))^{\prime }$$ and $$z_{ij}=x_{ij}$$. Two specifications are considered: Case (1) $$\gamma =(0.2,\ 0.3)^{\prime },\,\lambda =(-0.5,\ 0.2)^{\prime }$$ and Case (2) $$\gamma =(0.2,\ 0)^{\prime },\,\lambda =(-0.5,\ 0.2)^{\prime }$$. The difference between these two cases lies in the choice of $$\gamma _2$$.

Similar to the sampling scheme in Fan et al. (2007), the observation times are regularly scheduled but may be missing in practice. Missing at random is considered. More precisely, each subject has a set of scheduled time point $$\{ 0,1,\dots ,12\}$$, in which each element (except time 0) has a $$20~\%$$ probability of being missing. A uniform $$[0,1]$$ random variable is added to a non-missing scheduled time. This results in irregular (not on a grid) observed time points $$t_{ij}$$ per individual and then $$t_{ij}$$ is transformed onto interval $$[0,1]$$.

To study robustness, we denote NC as no contamination situation and consider the following three contaminations:
1. C1:

randomly choose $$2~\%$$ of $$x_{ij}$$ to be $$x_{ij}-3$$;

2. C2:

randomly choose $$2~\%$$ of $$y_{ij}$$ to be $$y_{ij}+6$$;

3. C3:

randomly choose $$2~\%$$ of $$x_{ij}$$ to be $$x_{ij}-3$$ and $$2~\%$$ of $$y_{ij}$$ to be $$y_{ij}+6$$;

We consider 200 replications for the simulation. Table 1 shows the performance of the NR, HR$$_m$$, HR$$_{c}$$ and R estimators for both the mean and the covariance in Case 1. It is found that the non-robust, half-robust and robust estimation perform equally well in the case of uncontaminated data (NC), although there is some loss of efficiency in the robust method with slightly larger MSE of $$\lambda _1$$. However, for the contaminated data, the robust method generally achieves smaller biases resulting in smaller MSEs.
Table 1

Simulation results of bias and MSE for $$\beta ,\,\gamma$$ and $$\lambda$$ in Study 1 (Case 1)

NC

C1

C2

C3

Bias

MSE

Bias

MSE

Bias

MSE

Bias

MSE

$$\beta _0=0.5$$

NR

0.001

0.0004

0.059

0.0044

0.117

0.0159

0.174

0.0332

HR$$_m$$

0.000

0.0004

0.038

0.0022

0.045

0.0035

0.096

0.0115

HR$$_{c}$$

0.000

0.0004

0.056

0.0041

0.117

0.0159

0.170

0.0318

R

0.000

0.0004

0.029

0.0017

0.029

0.0020

0.067

0.0066

$$\beta _1=1$$

NR

0.000

0.0001

-0.041

0.0019

0.001

0.0003

-0.042

0.0021

HR$$_m$$

0.000

0.0001

-0.030

0.0011

0.001

0.0002

-0.036

0.0016

HR$$_{c}$$

0.000

0.0001

-0.045

0.0022

0.001

0.0004

-0.043

0.0023

R

0.000

0.0001

-0.032

0.0012

0.001

0.0002

-0.035

0.0015

$$\gamma _1=0.2$$

NR

0.001

0.0003

0.026

0.0014

0.057

0.0049

0.059

0.0050

HR$$_m$$

0.001

0.0003

0.026

0.0014

0.058

0.0051

0.061

0.0052

HR$$_{c}$$

0.001

0.0003

0.025

0.0013

0.053

0.0046

0.057

0.0048

R

0.001

0.0003

0.025

0.0013

0.054

0.0047

0.059

0.0051

$$\gamma _2=0.3$$

NR

-0.033

0.0033

-0.102

0.0177

-0.270

0.0881

-0.295

0.1010

HR$$_m$$

-0.036

0.0033

-0.104

0.0180

-0.278

0.0926

-0.303

0.1056

HR$$_{c}$$

-0.037

0.0033

-0.097

0.0163

-0.250

0.0796

-0.284

0.0954

R

-0.041

0.0032

-0.100

0.0169

-0.257

0.0840

-0.298

0.1038

$$\lambda _1=-0.5$$

NR

0.000

0.0016

0.388

0.1532

0.958

0.9198

1.101

1.213

HR$$_m$$

0.000

0.0016

0.390

0.1541

0.961

0.9256

1.103

1.219

HR$$_{c}$$

-0.002

0.0025

0.186

0.0371

0.364

0.1369

0.546

0.302

R

-0.002

0.0025

0.181

0.0352

0.363

0.1360

0.548

0.305

$$\lambda _2=0.2$$

NR

0.003

0.0000

-0.184

0.0346

-0.116

0.0166

-0.193

0.0389

HR$$_m$$

0.002

0.0004

-0.186

0.0352

-0.169

0.0167

-0.193

0.0391

HR$$_{c}$$

0.004

0.0005

-0.084

0.0078

-0.039

0.0024

-0.126

0.0172

R

0.004

0.0005

-0.081

0.0072

-0.036

0.0023

-0.125

0.0169

NR refers to the non-robust method; HR$$_{m}$$ means the half-robust method on the mean; HR$$_{c}$$ stands for the other half-robust method on the covariance matrix; R is the proposed robust method which adopts all three robust estimating equations

First we look into performance of estimating in $$\beta$$. For $$\beta _0$$, we notice that in C1, C2 and C3, the half-robust method on mean outperforms the non-robust method and the half-robust method on covariance. Meanwhile, the robust method performs better than the half-robust method and therefore becomes the best performer for $$\beta _0$$ among the three. This does not happen in the estimation of $$\beta _1$$. For $$\beta _1$$, the robust method R and the half-robust method HR$$_m$$ have similar performance. Both of them have much smaller MSEs than that of the non-robust method and the other half-robust method HR$$_c$$. However, no great difference can be detected between the estimation of the former two, i.e. R and HR$$_m$$.

Next we pay attention to the estimation for parameters in the covariance matrix. All four methods show little difference in estimating $$\gamma$$, because the corresponding covariates for $$\gamma$$ only contain $$t$$, which has no contamination at all. It supports in one way that the proposed robust method performs equally well when there is no contamination in $$\gamma$$. On the other hand, the robust method has no advantage under no contamination. As for $$\lambda$$, while the half-robust method for mean performs as poor as the non-robust method (in both biases and MSEs), the robust estimators take great advantage uniquely (even slightly better than the half-robust method for covariance). For both $$\lambda _1$$ and $$\lambda _2$$, the robust estimators have about half of the biases and one quarter of the MSEs as those of the non-robust and mean half-robust methods. Overall, the robust method performs favorably in comparison with the non-robust and the half-robust methods in estimation of the covariance matrix. This point supports a better estimate for the covariance matrix in the robust method resulting a better estimate for the mean parameter $$\beta _0$$. In summary, the proposed robust method generally outperforms both the half-robust method and the non-robust method under different contaminations in the simulation study.

In previous discussion, we did not provide a method to choose the structure $$R_i(\delta )$$. Instead, we considered the typical AR(1) structure with $$\delta$$ equal to 0, 0.2, 0.5 and 0.8 for a test of sensitivity on the choice of $$\delta$$. Table 2 summarizes the MSEs for $$\beta ,\,\gamma$$ and $$\lambda$$ due to different values of $$\delta$$ in case 1 without contamination and under contamination C3. From the table, we observe very similar performance when we select $$\delta =0,\ 0.2$$ or $$0.5$$. When $$\delta =0.8$$, MSEs for the mean parameter $$\beta$$ still stay close to those for other $$\delta$$s. However, the increase of MSEs for $$\lambda$$ cannot be ignored in the case without contamination. We suppose this choice of $$\delta$$ is a bit apart from the truth that leads to an inefficient estimation comparing with other choices of value for $$\delta$$. In sum, we may conclude that the choice of $$\delta$$ or the structure has no significant influence in our simulation in the mean model. Similar conclusions can be obtained from the performances of estimation under C1 and C2 and thus we omit the results. In the rest of the article, we select an acceptable $$\delta$$ equal to 0 for convenience. We have considered simulations for Case 2 as well. The results are similar to those of Case 1 and so they are omitted for brevity.
Table 2

Mean squared errors for estimates using different $$\delta$$’s in Study 1 ($$\times 100$$)

$$\delta$$

0

0.2

0.5

0.8

NC

$$\beta _0$$

NR

0.034

0.034

0.034

0.034

R

0.036

0.037

0.037

0.037

$$\beta _1$$

NR

0.009

0.009

0.009

0.009

R

0.012

0.012

0.012

0.012

$$\gamma _1$$

NR

0.036

0.036

0.036

0.036

R

0.037

0.037

0.037

0.037

$$\gamma _2$$

NR

0.333

0.333

0.334

0.335

R

0.338

0.338

0.339

0.340

$$\lambda _1$$

NR

0.169

0.168

0.183

0.306

R

0.270

0.270

0.293

0.456

$$\lambda _2$$

NR

0.044

0.050

0.064

0.076

R

0.052

0.057

0.075

0.089

C3

$$\beta _0$$

NR

3.14

3.14

3.14

3.16

R

0.57

0.57

0.57

0.58

$$\beta _1$$

NR

0.21

0.21

0.21

0.21

R

0.15

0.15

0.15

0.15

$$\gamma _1$$

NR

0.46

0.46

0.47

0.47

R

0.46

0.46

0.46

0.46

$$\gamma _2$$

NR

9.71

9.71

9.79

9.87

R

9.60

9.61

9.63

9.70

$$\lambda _1$$

NR

122

122

124

128

R

29.7

29.9

30.9

33.1

$$\lambda _2$$

NR

3.68

3.64

3.62

3.61

R

1.66

1.61

1.58

1.59

In addition to the bias and MSE criteria that we considered, two loss functions are introduced to see how the four methods work in estimating the covariance matrix. They are the entropy loss
\begin{aligned} L_1(\varSigma ,\ \hat{\varSigma })=m^{-1}\sum ^m_{i=1}\{ \text{ trace}(\varSigma _i\hat{\varSigma }_i^{-1})-\text{ log}|\varSigma _i\hat{\varSigma }_i^{-1}|-n_i\}, \end{aligned}
\begin{aligned} L_2(\varSigma ,\ \hat{\varSigma })=m^{-1}\sum ^m_{i=1}\text{ trace}(\varSigma ^{-1}_i\hat{\varSigma }_i - I_i)^2, \end{aligned}
where $$\varSigma _i$$ is the true covariance matrix and $$\hat{\varSigma }_i$$ is its estimator. Each of these losses is 0 when $$\hat{\varSigma }_i=\varSigma _i$$ and positive otherwise. The entropy loss is the same as the Kullback–Leibler loss after switching the roles of the covariance matrix and its inverse. As indicated by Levina et al. (2008), the entropy loss is a more appropriate measure if the covariance matrix itself is the primary object of interest.
Now we focus on the evaluation on the overall performance of the covariance matrix estimation. Table 3 demonstrates that the robust method reduces both the entropy loss and the quadratic loss substantially especially when the contamination is heavier, although it has larger losses in the case of no contamination. Here we see clearly that the mean half-robust method has little improvement in resisting contamination in estimating the covariance matrix, since the losses are nearly the same as those of the non-robust method. In contrast, the half-robust method on covariance works as good as the robust method in covariance matrix estimation.
Table 3

Entropy loss and quadratic loss in estimating $$\varSigma$$ in Study 1

NC

C1

C2

C3

Entropy loss

NR

0.05

0.81

2.51

3.40

HR$$_{m}$$

0.05

1.09

2.54

3.76

HR$$_{c}$$

0.06

0.27

0.77

1.38

R

0.06

0.29

0.76

1.41

NR

0.26

5.32

25.1

29.9

HR$$_{m}$$

0.26

6.16

25.1

31.1

HR$$_{c}$$

0.42

1.52

5.52

10.2

R

0.42

1.55

5.50

10.2

It is to be noted that the results in Table 3 are obtained from 200 replications with successful convergence in estimation. The robust method successfully converged in all situations. However, the non-robust and the half-robust methods did not converge in a few percents of the simulations. Thus, the robust method is recommended since the non-convergence problem should not be neglected. Moreover, it is also found that the larger the contamination, the poorer the convergence performance by the non-robust and half-robust methods. And this is one of the reasons why we only compare the performance of the four methods under relatively mild contaminations.

Study 2. This study is designed to compare the performance of the proposed robust method and the non-robust method when the data sets are from non-normal distributions. The half-robust methods are not chosen because they are outperformed by the robust method. The setting is similar to those in study 1 except that we consider the error terms $$(e_{i1},\dots ,e_{in_i})$$ which are drawn from (a) a multivariate $$t$$-distribution with 3 degrees of freedom and covariance matrix $$\varSigma _i$$ and (b) a mixed multivariate normal distribution with 30 % coming from a normal distribution $$N(-0.7\times \mu _{mn}, \varSigma _i)$$ and the other 70 % from $$N(0.3\times \mu _{mn}, \varSigma _i)$$, where $$\mu _{mn}$$ will be specified later. Note that the error terms drawn from (b) are asymmetric in distribution. Only the case of no contamination and the case of 3rd contamination C3 are considered.

We report MSE of the mean parameter $$\beta _0$$ and $$\beta _1$$ in Table 4, together with the entropy loss and quadratic loss in Table 5. Results for error terms from normal distribution, $$t$$ distribution and mixed normal distributions with $$\mu _{mn}=0.5,\ 1$$ and $$2$$ are listed (They are termed MM0.5, MM1 and MM2, respectively). It is predictable that MSE of $$\beta$$ increases when the error terms are more asymmetric, for the reason that the bias corrected term $$C^{\beta }_i,\,C^{\gamma }_i$$ and $$C^{\lambda }_i$$ in the estimating equation can be misspecified. Meanwhile, the robust method always has smaller MSE for $$\beta$$ and losses of the covariance matrix than the non-robust method under contamination.
Table 4

Mean squares errors for $$\beta _0$$ and $$\beta _1$$ in Study 2 ($$\times 100$$)

Distribution

Normal

t(3)

MN0.5

MN1

MN2

NC (no contamination)

$$\beta _0$$

NR

0.035

0.036

0.087

0.277

1.008

R

0.036

0.040

0.094

0.300

1.181

$$\beta _1$$

NR

0.009

0.010

0.010

0.013

0.015

R

0.012

0.012

0.013

0.014

0.016

C3 (contamination 3)

$$\beta _0$$

NR

3.33

3.33

3.43

3.68

4.50

R

0.66

0.66

0.76

1.09

2.21

$$\beta _1$$

NR

0.20

0.20

0.20

0.20

0.20

R

0.15

0.15

0.13

0.14

0.15

Table 5

Entropy loss and quadratic loss in estimating $$\varSigma$$ in Study 2

Distribution

Normal

t(3)

MN0.5

MN1

MN2

NC (no contamination)

Entropy loss

NR

0.035

0.039

0.265

1.351

3.476

R

0.044

0.049

0.251

1.316

3.413

NR

0.216

0.216

1.002

3.452

7.976

R

0.359

0.365

0.660

2.374

5.730

C3 (contamination 3)

Entropy loss

NR

5.229

5.220

5.490

6.053

7.320

R

2.178

2.164

2.523

3.357

5.067

NR

43.43

42.44

42.51

41.57

40.01

R

13.70

13.61

14.42

15.31

17.51

Furthermore, we find it interesting that, under cases of asymmetric distribution errors, the robust covariance matrix estimator possesses even smaller entropy (and quadratic) losses than the non-robust estimator under no contamination (Table 5). It supports the view that the robust method cultivates a better estimation for the covariance matrix, which can be seriously affected by outliers, non-normal errors or misspecifications of the underlying distributions. In all, study 2 demonstrates that the proposed robust method is able to accommodate the effect of outliers and improve the efficiency of parameter estimation under non-normal or asymmetric distributions.

## 4 Real data analysis

We apply the proposed method to analyze the longitudinal data of a hormone study on progesterone (Zhang et al. 1998). This data set involves a total of 492 observations among the 34 subjects. The log-transformed progesterone level is taken to be the response ($$y_{ij}$$). Other than time ($$t$$), two covariates age (AGE) and body mass index (BMI) are considered. Following Zhang et al. (1998) and other researchers who have imposed a non-linear relationship between $$y_{ij}$$ and $$t_{ij}$$, we consider the following model:
\begin{aligned} y_{ij}&= \beta _0+\beta _1 \text{ AGE}_i+\beta _2\text{ BMI}_i+\beta _3t_{ij}+\beta _4t_{ij}^2+\beta _5t_{ij}^3+e_{ij},\\ \phi _{ijk}&= (1,\ (t_{ij}-t_{ik}),\ (t_{ij}-t_{ik})^2,\ (t_{ij}-t_{ik})^3)^{\prime }\ \gamma ,\\ \text{ log}\sigma ^2_{ij}&= (1,\ \text{ AGE}_i,\ \text{ BMI}_i,\ t_{ij},\ t_{ij}^2,\ t_{ij}^3)^{\prime }\ \lambda . \end{aligned}
Table 6 lists the intercepts and the regression coefficients for AGE and BMI obtained from both the robust and the non-robust methods as well as those of the GEE method for comparison. The effects of AGE and BMI are found to be insignificant for both the robust and non-robust methods due to the large standard errors. In Table 6, the response looks less negatively affected by the body mass index in the robust model than that in the non-robust model, while more positively affected by the age in the robust estimation. The obvious numerical differences of the estimates between the robust and non-robust methods implies that the data may be contaminated.
Table 6

Regression coefficient estimates and standard deviations (in parentheses) for AGE and BMI of the hormone data

GEE

NR

R

Intercept

$$0.87_{(0.13)}$$

$$0.80_{(0.14)}$$

AGE

$$2.05 _{(1.96)}$$

$$2.48_{(2.16)}$$

$$3.95_{(2.05)}$$

BMI

$$-1.72_{(2.26)}$$

$$-1.71_{(2.92)}$$

$$-0.42_{(3.49)}$$

The weight functions $$w_{ij}$$ in our robust method are calculated from $$p_{ij}=(\text{ AGE}_i, \text{ BMI}_i)$$. The heavily downweighted points are from subject 18 (a cluster of points from case 244 to case 263), with $$w_{ij}=0.459$$. A closer inspection of the data set shows that subject 18 has an extremely high BMI of 38. To further look into robustness, we consider the standardized residual $$s_{ij}$$ which is the $$j$$th component of $$\hat{\varSigma }_i^{-1/2}(y_i-\hat{\mu }_i)$$. Case 10 appears to be the most extreme point with $$s_{ij}=-4.58$$. The progesterone level of the 10th observation for subject 1 (case 10) is 2.46, which is very different from its neighborhood observations 9 and 11 measured one day before and one day after, with the progesterone level being 12.8 and 13.4 respectively. In fact, other 13 observations on the subject 1 range from 8.5 to 13.4 except this case. In particular, this observation is the lowest progesterone level in the whole data set. Therefore, we conclude that case 10 is a clear outlier from subject 1, which is consistent with Fung et al. (2002). When the sample size is moderate, a subject-level potential outlier can have significant influence on estimation and inference. Subject 24 is a potential outlier as the mean of its standardized residuals is 2.66, with $$s_{ij}$$ of case 337–346 ranging from 2.09 to 3.80. Subject 24 turns out to be a young women with very low BMI and the highest average progesterone level. Consequently, we find the robust method downweighs substantially the effect of both the subject-level and the observation-level outliers. This is the main reason that the robust method leads to a shift of estimation for the coefficients in the mean model compared with its non-robust version. We believe that the robust estimation is more reliable. Nevertheless, the large standard errors of the estimates suggest that a much larger sample is needed to have any concrete finding.

Figure 1 displays the fitted curves for the cubic polynomial of time. From Fig. 1, we can see that the cubic polynomial of time decreases in the first 7 days and increases steadily later on. They reach a peak around the 23rd day in the cycle and then decrease again. The trajectory of the mean curve is very similar to that in Mao et al. (2011). Furthermore, although there are outliers in the data set, their effects are not too large on the estimation of the mean curve as we can see from Fig. 1 that the robust and non-robust estimates are rather close to each other.
Figure 2a plots the estimated generalized autoregressive parameters $$\phi$$ against the time lag between measurements in the same subject, which is also modeled as a cubic polynomial. The graph shows that the generalized autoregressive parameter decreases sharply from 0.6 to 0 if the time lag is less than 8 days and then drops slowly when the lag becomes larger. It is noted that the robust and non-robust estimates of the parameter, which is essentially a mean parameter as seen from (2), are also quite close to each other. From Fig. 2b, it is observed that the non-robust estimate fluctuates more intensely than the robust estimate for the innovation variance. The figure provides us some idea of how the robust method works. The shrinkage of estimation for the innovation variance (from above 0.6 to less than 0.5) suggests that our method can downweigh the effect of outliers to achieve robustness. Unlike the estimation on the mean parameter (Fig. 1), the non-robust estimate of the variance parameter (innovation variance) is strongly affected by outliers, and is rather different from the robust estimate (Fig. 2b). Although little difference can be detected in the mean model from Fig. 1, the robust method improves estimation for the covariance matrix, especially for the innovation variance.

## 5 Discussion

In this paper, we propose simultaneous robust model for the mean and covariance matrix of longitudinal data. The proposed method has the following advantages and properties: (i) the robust covariance model guarantees the positive definiteness based on the covariance decomposition with a proper statistical interpretation (ii) it is able to control the influence of outliers in the mean and covariance model simultaneously that cultivates a more reliable estimation for the joint mean and covariance model (iii) the robust algorithm has a much greater chance to obtain a convergence solution than the non-robust algorithm. The robust estimating equations we proposed here should enhance the development of joint mean and covariance model for longitudinal data.

A limitation of the proposed method is that the model may include redundant covariates. If we have no prior knowledge of the covariance structure, then we are prone to include all the time and mean associated variables. Redundant covariates may bring in outliers and also increase the computational burden. Robust estimating equations that can serve the goal of both estimating and penalizing the models with too many covariates are under development.

## Acknowledgments

The authors are grateful to the reviewers, the Associate Editor, and the Co-Editor for their insightful comments and suggestions which have improved the manuscript significantly.