Estimation of random-effects model for longitudinal data with nonignorable missingness using Gibbs sampling

The missing data problem is common in longitudinal or repeated measurements data. When the missingness mechanism is nonignorable, the distribution of the observed response and indicators of missingness should be modelled jointly using either ‘shared random-effects model’ or ‘correlated random-effects model’. However, computational challenges arise in the model fitting due to intractable numerical integration involved in the log-likelihood function. We provide alternative modeling of ‘correlated random-effects model’ using latent variables and propose a simple algorithm based on Gibbs sampling for estimation of associated parameters. The method is illustrated through simulation and the analysis of a real data set arising from an autism study.


Introduction
In designed longitudinal studies, the aim is to estimate the mean response at a certain time based on fixed or time-varying covariates. For studies with long follow up periods, the proportion of individuals with missing data can be substantial. Inference based on the observed data may lead to biased and unreliable results. Ample amount of research works on the modeling of longitudinal data with ignorable missingness are available in the literature (Little and Rubin 2002). In this paper, we focus on modeling of longitudinal data when the missingness mechanism is nonignorable. In these cases, the distribution of the observed response and indicators of missingness should be modelled jointly. Such joint models can be classified into either pattern mixture models or selection models. Little (1995) provided a detailed overview of pattern mixture models and selection models for longitudinal studies with missing data due to informative B Prajamitra Bhuyan bhuyan.prajamitra@gmail.com 1 Department of Mathematics, Imperial College London, London, UK dropout. Siddiqui and Ali (1998) considered random-effects pattern-mixture models and provided estimates of the associated parameters by averaging the estimates obtained from different subsets of the data depending on the missing data-patterns. Daniels and Hogan (2000) proposed a reparameterization of the pattern mixture model that allows consideration of a wide range of nonignorable missing-data mechanisms. Diggle and Kenward (1994) proposed the use of a selection model with a logistic regression form to deal with informative dropout. Baker (1995) considered repeated binary data with nonignorable and non-monotone missingness. Troxel et al. (1998) extended those of Diggle and Kenward (1994) for non-monotone and nonignorable missing data. However, its implementation is computationally challenging. Rotnitzky et al. (1998) considered inverse probability weighted estimating equations for the nonignorable missing and provided a simultaneous estimation of the dropout probability and mean response based on the selection model. Minini and Chavance (2004) considered a log-linear model and provided a sensitivity analysis for the longitudinal binary data with nonignorable missingness.
In the context of survival analysis, shared random-effects (SRE) model proposed by Wu and Carrol (1998), is popularly used as an alternative to the selection models. De and Tu (1994) and Schluchter (1994) suggested some extensions of SRE models. Have et al. (1998) and Pulkstenis et al. (1998) adopted the SRE model for binary longitudinal data with the informative dropouts. Tsonaka et al. (2009) considered semi-parametric shared parameter model for the modelling of the response variable with non-monotone and nonignorable missingness. In the aforementioned SRE models, the selection model and the response model have exactly the same random component.
In many situations, the latent factors affecting the missingness could be different from those affecting the response; however, they are correlated due to common risk factors. In order to model such phenomena, Lin et al. (2010) considered an interesting generalization of SRE model by using correlated random effects model. An underlying assumption for the random-effects model is that, conditional on the random effects, the missingness is independent of the response. Note that ignorable missingness is a special case of the correlated random-effects model if the two random effects are independent. A main concern in the random-effects models is computational challenges arise in the model fitting due to intractable numerical integration involved in the loglikelihood function. In order to overcome such difficulties, well-known approximation methods like the Gauss-Hermite quadrature (Pinheiro and Bates 1995) and the Laplace approximation (Breslow and Clayton 1993) are exploited for estimation purposes. Lin et al. (2010) expressed the likelihood as a ratio of two integrals and then approximated the numerator and denominator using the Laplace formula. In order to estimate the associated model parameters, one needs to evaluate first and second derivatives and maximize the two integrands.
We propose alternative modeling of the observed response and indicators of missingness based on correlated latent variables. In particular, we develop regression models with the covariates having a time-varying effect and time-invariant effects on the latent variables involving correlated random effects. A simple Gibbs sampler is developed following Albert and Chib (1993), where in each iteration, we sample the model parameters as well as the latent variables. Our method is simple because it is based on the Gibbs sampler, and it is fast since we estimate the parameters for both the models simultaneously in an automated manner and avoids the computational challenges posed by intractable log-likelihood functions typically encountered in the frequentist method. The rest of the paper is organized as the follows. In Sect. 2, we discuss our proposed model and the Bayesian estimation method in detail. We analyze data from a longitudinal study of the social development of children with autism in Sect. 3. Simulation studies are performed for assessing the effectiveness of the proposed method and the results are finally summarized in Sect. 4. In Sect. 5, we provide outlines of possible future work and some concluding remarks.

Proposed model
In the following presentation, we consider a continuous response measured over m different time points from n subjects. We consider a set of covariates, some of which possibly have a time-varying effect on the response. The response for the ith subject at the tth time point, which we denote by Y i (t), can thus be modelled as the following: where we consider J , J and L denote the number of covariates with time-varying fixed effects, time-invariant fixed effects and random effects on response, respectively. Subject-specific random effects u i = (u 1i , . . . , u Li )'s capture the longitudinal dependence and are assumed to be iid N L (0, Σ u ). The residuals e i (t)'s are assumed to be iid N (0, σ 2 e ). Note that the above model is a special case of generalized varying coefficient model for longitudinal data, introduced by Sentrk et al. (2013).
In general, the data from longitudinal studies involving large number of people possesses some missing responses. In many situations, the missing data mechanism is ignorable and it can be well handled using several available methodologies under the assumption of missing at random (MAR) (Little 1995). However, when there are nonignorable missing values in the response variable, the models in Eq. (1) will produce biased estimates under the MAR assumption. Such data are not so uncommon, specially in biomedical studies and social sciences (Little and Rubin 1987, Ch-1). In order to address such issues, we first define U i (t) as is a latent random variable. We then rewrite the regression model given in Eq. (1) in terms of the latent random variables as follows In addition, we consider the latent variable U * i (t) and write Now we consider the following model for missing data mechanism with some covariates as where the random effects , and the residuals i (t) are iid N (0, 1). In order to incorporate the possible correlation between the response variable Y i (t) and the missing indicator U i (t), we consider u i and v i are correlated random vectors following a multivariate normal distribution with mean vector 0 and covariance matrix For the aforementioned models, we can use the usual models for multivariate longitudinal data (Diggle et al. 2002, p. 332) but this requires values of the latent variables at each step. We propose a Bayesian estimation method for simultaneous estimation of the parameters associated with the joint model using Gibbs sampling.

Modelling time-varying coefficients
One of the major advantages of the longitudinal studies is to incorporate age effect in the modeling and its capacity to distinguish changes in the response within and across individuals over time (Diggle et al. 2002, p. 1). In order to model the effects of the respective covariates over time, we have considered the time-varying coefficients β j (t) and θ k (t) in Eqs. (2) and (3), respectively. Since parametric nature of β j (t) and θ k (t) is not known in advance, we consider semi-parametric approach of modelling the time-varying coefficients using Legendre polynomials (LP) basis functions. These Polynomials have already been proven as powerful tool by several authors for semiparametric regression (Marie and Sen 1985;Meyer 2000;Cui and Zhu 2006;Bhuyan et al. 2019).
The general form of a Legendre Polynomial of order r is given by the following sum where L = r 2 or r −1 2 , whichever is an integer. These polynomials are defined over [− 1, 1] and are orthogonal to each other in this interval in the sense that the inner product 1 −1 P r (x)P s (x)dx = 0, for r = s. First few LPs are as the following: In our context, we denote the r th order Legendre polynomial (LP) at time t by P r (t). We transform the original time points t to the adjusted time points t = −1 + 2( t−t min t max −t min ), for fitting the orthogonal LP over the range [− 1, 1], where t min and t max are the smallest and the highest time points respectively. Let P (r ) (t ) = [P 0 (t ), P 1 (t ), . . . , P r (t )] T denote the family of the first r + 1 basis functions and express the functions β j (t ) and θ k (t ) as some linear combinations of these basis functions: The optimal orders r 1 and r 2 may be chosen by the information criteria, e.g. AIC/BIC etc. Unless great flexibility is required, very low values of r l such as 1 or 2 will suffice, for l = 1, 2. For example, let us consider J = 2, J = 0, L = 1, Then Eqs. (2) and (3) reduces to and respectively, where α j = (α j0 , α j1 ) T 's and λ j = (λ j0 , λ j1 ) T 's are suitably adjusted for the change in location and scale in time, for j = 1, 2. The models (4) and (5) account for the age effects and its interaction with the covariate X 2i (t). Note that the parameters involved in LPs possesses interesting interpretations and it is easy to implement. Moreover, computational complexities (e.g., knot selection, knot location) related to the other basis functions can be automated.

Bayesian estimation using Gibbs sampler
We employ a Bayesian approach of estimating the model parameters for the Eqs. (2) and (3) using the Gibbs sampler. Let Θ = [a, b, γ , δ, σ 2 e , Σ] denote the set of all the model parameters involved in the Eqs. (2) and (3), where the bold symbols denote the vector of the respective coefficients. Note that, one needs to sample from the joint posterior of the model parameters and unknown latent variables. Let , and U = (U 11 (t), . . . , U nm (t)), and write the joint posterior density for the latent variables and the parameters associated with the proposed model as where π(·) is the joint prior for Θ, and f (·, ·) and g(·, ·) are the joint distribution of respectively. Following the traditional Bayesian regression, we consider the non-informative priors for (a, γ , σ 2 e ) and (b, δ). In addition, we consider maximal data information prior for Σ. Therefore the joint prior distribution π (a, b, γ , δ, σ 2 e , Σ), can be expressed as the product of π(a, γ , σ 2 e ) ∝ 1 σ 2 e , π(b, δ) ∝ 1, and π(Σ) ∝ 1 |Σ| . The posterior distributions of Θ conditional on Y * , U * , Y , U can be derived routinely and hence we skip the details. The full conditionals for (a, b, γ , δ), σ 2 e , and Σ are normal, inverse gamma and inverse-Wishart, respectively. The latent variable and respectively. Note that the full conditional densities of the latent variables, given by (6) and (7), are also from standard densities, and hence, one can directly apply Gibbs sampler algorithm for estimation of model parameters Θ.

Data analysis
We apply our proposed method on the data arising from a prospective longitudinal study of the social development of children with autism. A total of 214 children participated in the study who were divided into three diagnostic groups at 2 years of age: autism, pervasive developmental disorder (PDD), and nonspectrum children. We consider a subset of 158 autism spectrum disorder (ASD) children, and social development information was collected for each child at ages 2, 3, 5, 9 and 13 years based on a parent-reported survey. The objective was to assess the development trajectories of these children's socialization for different language proficiency groups. The response variable, Vineland socialization age equivalent (VSAE), was a combined score that included assessments of interpersonal relationships, play/leisure time activities, and coping skills. However, the measurements corresponding to all children at each age are not available, which resulted in 22% missing in the response variable. The children's language development was assessed by the Sequenced Inventory of Communication Development (SICD) score at age 2, and children were categorized into three groups (SICDEGP) based on their SICD scores. The data were collected by researchers at the University of Michigan and analyzed in West et al. (2007, Ch-6) under MAR assumption.
Let Y i (t) be the log(VSAE) score of the ith child at time t, where t = log(Age). Lin et al. (2010) observed that the general trend of the VSAE score is increasing with age, while there is a substantial variation of the VSAE scores among the children, and hence the logarithmic transformation has been considered. In order to incorporate the categorical variable SCIDEGP, we introduce two dummy variables SCI2 and SCI3 representing the second and third level of the SICD group, respectively, and we take SICDEGP = 1 as the reference group. As discussed in the Sect. 2.1, we consider first order LP for modeling of time-varying coefficients and the models (4) and (5) can be rewritten as and U * i (t) = λ 10 +λ 11 t +λ 20 SIC2 2i +λ 21 tSIC2 2i +λ 30 SIC3 2i +λ 31 tSIC3 2i +v 1i + i (t), respectively. The model parameters are estimated by the Gibbs sampler, as discussed in Sect. 2.2. We run MCMC for 50,000 iterations and discard the first 10,000 iterations as burn-in. We also thin the chains by saving every 10th iteration. The convergences of the chains are monitored graphically using trace plots, plots of the autocorrelation function (ACF) and partial autocorrelation function (PACF) for the parameters of interest (see "Appendix"). The results are summarized in Table 1, which are consistent with the  Our primary interest is to estimate the interaction effect between SICDEGP and age. Note that SICDEGP group 1 has been taken as the reference group and the estimate of the interaction effect of the log(Age) with the corresponding SIC3 is larger than that of with SIC2 in magnitude.
One can interpret that children in higher SICD groups can easily be socialized as they grow up. As expected, we find that age is positively associated with the missingness of VSAE score. Interestingly, the children in higher SICD group are more likely to be observed as age increases compared to those in lower SICD group. The estimates of σ 2 e and Σ are 0.18 and 0.151 − 0.056 − 0.056 0.038 , respectively. The negative estimate for Σ uv suggests that children with higher VSAE scores are more likely to have missing outcomes. In order to asses the usefulness of the joint analysis, we test H 0 : |ρ uv | ≤ δ, where ρ uv is the correlation coefficient between the random effects u 1i and v 1i . We calculate the posterior probability P(|ρ uv | > δ|Data) for different choices of δ, and the results are presented in Fig. 1. It is evident that the posterior probability against the null hypothesis is very high and hence, we can conclude that the joint modeling is useful for our analysis.

Simulation study
In order to study the performance of the proposed method, we generate data from the following model: where β 2 (t) = β 20 + β 21 t, and θ 2 (t) = θ 20 + θ 21 t. We first generate data for n = 100 subjects at m = 5 evenly spaced time points, with β 1 = 10, β 20 = 2, β 21 = 5 γ 1 = 15, θ 1 = 0, θ 20 = 15, θ 21 = −3, and δ 1 = −1. The time dependent covariate X 2i (t) is generated from uniform distribution with support (0, 2) and the time-invariant covariate Z 1i is generated from Bernoulli distribution with mean 0.6. We consider the error variance σ 2 e = 9, and the covariance matrix associated with the random effect parameters Σ = 2 − 1 − 1 2 . Under the above choice of the parameters, almost 20% observations corresponding to the response variable are missing, which is in line with the data in the real example discussed in the previous section. For the purpose of comparison, we also generate complete data from the model with the same parameter choices. For each simulated data, we generates samples from the posterior distribution of the parameters β 1 , β 20 , β 21 , and γ 1 using the Gibbs sampler, and computed posterior mean and posterior standard deviations. This is repeated 1000 and the estimates are averaged over the 1000 simulations. We have also performed the same exercise under MAR assumption. In order to compare the performance of these models, we presented relative bias (RB), posterior standard deviation (SD) along with coverage probability in Table 2. Comparing the results from the complete and missing data analysis, we find that the estimates from our proposed method based on missing data mechanism is as good as complete data estimates. It is also evident that the proposed model performs better compared to MAR in terms of relative bias. Next, we generate data for n = 200 subjects at m = 10 evenly spaced time with the same parameter choice, which resulted in 40% missing in the response variable. The results are summarized in Table 3. As expected, posterior standard deviation decreases for the estimates for both the complete and missing data analysis with increase in the number of observations. Even with this higher percentage (40%) of missing observations, our method seems to perform reasonably well in comparison with complete data analysis.
Here also, the proposed model is superior compared to MAR with respect to relative bias and coverage probability.
In order to study the sensitivity of the proposed method under mis-specified selection model, we consider the aforementioned response model with the missing data mechanism given by  with η = 0.5 and ξ = 35. Note that the missing indicator U i (t) depends on the response Y i (t) through the fixed effect parameter η. Next, we generate data for n = 100, m = 5, and n = 200, m = 10, which resulted in 30% missing in the response variable. The summarized results are presented in Tables 4 and 5. It is not surprising that the estimates from the proposed and existing method are biased. To compare the performance of these models, we compute Bayesian information criterion (BIC) for each of the simulated datasets and it is interesting to observe that the average BIC values for the proposed model are much smaller than those of MAR (see Tables 4, 5) under the mis-specified selection mechanism. We have also computed the Deviance information criterion (DIC), and the results are similar.

Discussion
In this paper, a Bayesian methodology has been developed for estimation of the correlated random-effects model for longitudinal data with nonignorable missingness. Unlike the existing frequentist methods that require approximation of the intractable log-likelihood function, we provide a simple estimation methodology using Gibbs sampler. Our method is easy to implement, and as a special case, it is applicable to various other models. For example, traditional SRE models are special cases of the correlated random-effects models, which are popularly used for modeling nonignorable missing data. Moreover, the proposed method is also readily applicable to the data with ignorable missingness. The simulation results indicate that the estimates from our proposed method with missing information in response, are as good as compared to the estimates from complete data analysis. Moreover, the performance of the proposed method is superior compared to MAR even under the mis-specified selection mechanism.
In this paper, we have considered longitudinal data with a continuous response variable. Since the underlying latent response is assumed to be continuous, one can also consider our approach for the purpose of modeling a binary or count response variable with minor modification. For non-normal error and/or random effects models, the full conditionals may not be from standard distributions. One can generalize the proposed method for such cases and employ the Metropolis-Hasting algorithm for estimation purpose. Sometimes, the data sets from longitudinal studies may possess outliers along with missing information not only in the response but also in the covariates. It may be an interesting problem to deal with such data and develop a Bayesian methodology for detection of outliers in the presence of missing values.