Hidden three-state survival model for bivariate longitudinal count data

A model is presented that describes bivariate longitudinal count data by conditioning on a progressive illness-death process where the two living states are latent. The illness-death process is modelled in continuous time, and the count data are described by a bivariate extension of the binomial distribution. The bivariate distributions for the count data approach include the correlation between two responses even after conditioning on the state. An illustrative data analysis is discussed, where the bivariate data consist of scores on two cognitive tests, and the latent states represent two stages of underlying cognitive function. By including a death state, possible association between cognitive function and the risk of death is accounted for.


Introduction
In ageing research, longitudinal data are often collected for various tests of cognitive function. For example, the English Longitudinal Study of Ageing (ELSA) collects data on literacy, numeracy, memory, and information processing (Huppert et al. 2006). In ELSA and often in other studies, the range for test scores is in the form of counts; that is, {0, 1, .., m}, where m is a test-specific integer. Important questions in ageing research concern individual change in cognitive function in the population. When longitudinal data are used, death during follow-up cannot be ignored in the statistical analysis because the risk of death is likely to be correlated with change in cognitive function (Muniz-Terrera et al. 2011).
Typically, cognitive function is seen as a latent trait that explains observed test scores. Of specific interest is to detect when a change in the observed scores is indicative of an irreversible decline in cognitive function. When two tests are available, it make sense to look at options to model both tests simultaneously. For this reason, we propose a statistical framework which allows inference on change in cognitive function by modelling longitudinal count data for two correlated tests conditional on two latent states of cognitive function, and with death as an observable third state.
The three-state process is modelled in continuous time using the Markov assumption. Because the two living states are latent, the three-state model is called a hidden Markov model; see also, for example, Baum and Petrie (1966), Satten and Longini (1996), Jackson (2011), and Zucchini et al. (2016). The count data are conditionally modelled by the bivariate binomial distribution as introduced by Altham and Hankin (2012). The approach is general and can also be used in other areas of applied statistics. For example, if longitudinal count data are collected on biomarkers, and dropout during the follow-up is correlated with the biomarkers, then the same methodology can be used.
Our hidden Markov model can be seen as an alternative for a joint-model approach where a model for a longitudinal response is combined with a two-state survival model (alive versus death); see, for example, Rizopoulos (2012). Using two latent living states in addition to the death state allows us to describe the data in a way which is very close to our understanding of cognitive function as a latent trait. By making the three-state model progressive, we model the assumption that cognitive decline is an irreversible process. In addition, we can use a fitted model to predict latent cognitive impairment without specifying thresholds on the scale of cognitive tests. This will be illustrated in the application.
Examples of papers that discuss change of cognitive function conditional on a multi-state model are Dantan et al. (2011), Van den Hout et al. (2015), andRouanet et al. (2016). These papers combine a multi-state model with a model for a univariate test (or latent trait) for cognitive function. Our statistical framework is an extension of this approach by introducing a bivariate state-dependent distribution. This allows us to model two tests for cognitive function simultaneously. Our hidden Markov model is similar in aim and structure to the modelling in Cook et al. (2004). Cook et al. discuss hidden Markov models where dependence in multivariate tests is taken into account by discrete distributions through log-linear models. They combine a discrete-time two-state Markov model with log-linear models for a bivariate binary response variable. Our model differs in two important aspects. First, we specify a continuous-time Markov model so that we can deal with intervalcensored data and irregular follow-up times. Second, our model is specified for count data which motivates another model for the bivariate state-dependent distributions.
For Y t and Z t , data are collected at pre-scheduled times; that is, the longitudinal sampling is designed independently from Y t and Z t . This creates interval-censored data in the sense that we do not observed values of Y t and Z t between two scheduled time points. This kind of data is also called panel data. We assume that the time of death is observed exactly.
We explain the conditioning on the three-state progressive process by an example. The three-state process X t is defined by transitions 1 → 2, 1 → 3, and 2 → 3. Assume data (y 1 , y 2 ) and (z 1 , z 2 ) are observed at times (t 1 , t 2 ). Let p() be the generic notation of a probability mass function. It follows that (1) The summation in (1) is defined for all possible latent-state combinations for (t 1 , t 2 ) taking into account that the process is progressive. Because count data are observed at t 1 and t 2 , we know that at these times the individual is alive.
We simplify (1) by assuming condition independence across time in the sense that Equations (1) and (2) illustrate the main modelling approach. It will be comprised of three parts: state-dependent bivariate binomial distributions for p(y j , z j |x j ), for j = 1, 2, an illness-death model for transition probabilities p(x 2 |x 1 ), and a logistic regression model for the initial distribution p(x 1 ).
We could go one step further and assume independence between Y t and Z t conditional on latent state. This would imply p(y j , z j |x j ) = p(y j |x j ) p(z j |x j ), for j = 1, 2. This assumption underlies some of the hidden multi-state models that are discussed in the literature; see, for example, Jackson et al. (2016). This will be explored in the data analysis as a comparison to the bivariate approach.

Model
This section introduces the three submodels that define the encompassing hidden Markov model for the bivariate longitudinal count data. The first model is the bivariate binomial distribution of the count data conditional on the latent state. The second model is the stochastic Markov process for the two latent states and death. The third model is the distribution of the initial state. Altham (1978) introduced the following extension of the standard binomial distribution

532
A. van den Hout, G. Muniz-Terrera where 0 < p < 1, θ > 0, and For θ > 1 or θ < 1, the distribution defines under-or overdispersion, respectively, relative to the standard binomial distribution. The bivariate version of (3) is presented in Altham and Hankin (2012) and is given by It is possible to define this distribution for the case where the sample space for Y and Z is not the same (replace the corresponding m by m Y and m Z ).
For the three-state model in the next section, we will specify two bivariate binomials by conditioning on the two living states; that is, p(Y , Z |X = 1) and p(Y , Z |X = 2). To manage the number of parameters, restrictions can be defined across these two distributions. In the data analysis in Sect. 4, for example, we assume that the correlation parameter φ is the same for the two distributions.

Progressive illness-death process
Methods for multi-state models for longitudinal data are well established; see, for example, Hougaard (1999) and Aalen et al. (2008). For the case where transition times are interval-censored, see, for example, Kalbfleisch and Lawless (1985) and Jackson (2011). The following follows the presentation in Van den Hout (2017), who describes parametric time-dependent hazard models for interval-censored data.
Let q rs (t) denote the hazard for transition r → s at time t. Given time interval (t 1 , t 2 ], the cumulative hazard functions for leaving state 1 and 2 are given by respectively. Given that state 3 is an absorbing state and that the model is progressive, if follows that q 21 (t) = q 31 (t) = q 32 (t) = 0. Transition probabilities p rs (t 1 , t 2 ) = P X t 2 = s|X t 1 = r are given by

Examples of hazard models for
The framework is very flexible in the sense that the choice of parametric models can vary across the transitions. For example, a hazard model can be defined that is exponential for 1 → 3 and 2 → 3, and Gompertz for 1 → 2. Alternatively, we can define a hazard model that is Gompertz for 1 → 3 and 2 → 3, and Weibull for 1 → 2.
For the exponential model, transition probabilities p rs (t 1 , t 2 ) can be derived in closed form. For the time-dependent Gompertz and Weibull models, or for combinations thereof, probabilities p 12 (t 1 , t 2 ) are computed by numerical approximation of the one-dimensional integrals; for the details see the appendix.
It is possible to fit the time-dependent models by using a piecewise-constant approximation for the hazards. For example, the Gompertz model can be fitted in the R-package msm by approximating q rs (u) for u ∈ (t 1 , t 2 ] by q rs (t 1 ); see Jackson (2011). In what follows, we do not use this piecewise-constant approximation.
The parametric hazard models can be extended in a standard way to include effects of covariates. As an example, if we would like to investigate the effect of covariate w, then we can specify q rs (t) = λ rs exp(ξ rs t) exp(β rs w).

Initial distribution
As shown in the introduction, a model is needed to estimate the state prevalence; that is, the initial distribution p(x t ) for time t > 0 and x t ∈ {1, 2}. We use the standard logistic regression model and define p(X t = 1) = 1 − p(X t = 2). Covariates can be included in the usual way.

Likelihood function and estimation
Estimation of model parameters is undertaken by maximising the log-likelihood function. Let i = 1, ..., N denote individuals, and t i j the time of the jth observation for individual i, where j = 1, ..., n i . The count data at time t i j are given by pairs (y i j , z i j ). Let y i = (y i1 , ..., y in i ), and y = (y 1 , ..., y N ). Define z in a similar manner. Let x i j denote the state at t i j , which is latent except when death is observed. Conditional on the effect of time t and possible covariates, we assume that the three-state process is Markovian; that is, only the current state determines where the process goes next. This allows us the decompose the probability mass function for a series of states into univariate transition probabilities. For example, Let vector ψ contain all model parameters. Using the conditional Markov assumption and the assumption about independence across time (see the introduction), the likelihood function is given by The logarithm of the likelihood function is maximised over the parameter space of ψ to obtain the maximum likelihood estimates. The maximisation is undertaken using the general-purpose optimiser optim in the R software (R Core Team 2013). In optim, we choose the quasi-Newton method BFGS (developed in 1970 by Broyden, Fletcher, Goldfarb and Shanno). The numerical integration needed to compute probabilities p 12 (t 1 , t 2 ) is implemented using the composite Simpson's rule.
For efficient use of the general-purpose optimiser, all parameters with restricted parameters space are transformed such that they can be estimated over an unrestricted parameter space. For example, if 0 < ψ < 1, then define ψ = exp(γ )/(1 + exp(γ )) and estimate ψ by estimating γ ∈ R. The delta-method may be used to estimate the variance of ψ given estimated variance of γ . Alternatively, simulation may be used by sampling from the asymptotic normal distribution of the maximum likelihood estimate; see Mandel (2013). For the above example this implies sampling values of γ and use these to derive the distribution of ψ.
For the latent living states we parameterise the binomial distributions such that state 2 is the state with a lower cognitive ability. We illustrate this for the bivariate distribution p(Y , Z |X = x), where x ∈ {1, 2}. First we parameterise p Y |x=1 = exp(γ )/(1 + exp(γ )) for γ ∈ R, so that 0 < p Y |x=1 < 1. Next we define for ν ∈ R. This implies p Y |x=1 > p Y |x=2 > 0. We do the same for p Z |x=1 and p Z |x=2 .

Data
The English Longitudinal Study of Ageing (ELSA) is a multidisciplinary open cohort study that features an extensive range of data from a representative sample of men and women living in England who are aged 50 and over. Here we use longitudinal data on cognitive function that are available in the waves 1-5 (2002-2011). Data from ELSA can be obtained via the Economic and Social Data Service (www.esds.ac.uk). For reasons of confidentiality, age is given in integers.
There are 11,932 individuals in the ELSA baseline sample. For the current analysis, we sampled N = 1000 individuals randomly from the baseline conditional on three restrictions. First, we restricted the sampling to individuals being 50 years or older at the ELSA baseline (because participants' partners are also sampled in ELSA, there are individuals younger than 50 in the ELSA baseline). Secondly, the sampling is restricted to individual younger than 90 years at baseline without a censored age during followup. Thirdly, the sampling is restricted to those who have at least two observations or one observation and a time of death. The resulting subsample has 540 women and 460 men. Frequencies for number of observations per individual (including death) are 174, 146, 156, 516, and 8 for number of observations 2, 3, 4, 5, and 6, respectively. The data are used for illustrative purposes only and should not be used to inform clinical practice.
We analyse longitudinal count data for the test of verbal learning and recall. For this test, people are required to learn a list of 10 common words, and are asked to recall the words immediately (Y ) and later on in the interview (Z ). This is a common test in studies of ageing. For example, it is also used the US Health and Retirement Study (website: hrsonline.isr.umich.edu).
As to be expected, individuals remember more words in the direct recall compared to the delayed recall. The top panel of Fig. 1 shows the sample distribution of Y and Z at the baseline. The distribution clearly shifts to the left, from the sample mean 5.44 for Y to the sample mean 3.92 for Z . One would expect a high correlation between Y and Z given that these concern the recall of the same words, and this is confirmed by the Spearman correlation which is 0.67 for the observations at baseline. The bottom panel of Fig. 1 shows the longitudinal scores on Y and Z for a subsample of 30 individuals. The 30 trajectories illustrate that the word recalls are noisy processes. Nevertheless, already in the depicted trajectories there is some evidence of a decline in cognitive function when people get older.
During the follow-up, 195 of the 1000 individuals die. A dropout rate of 19.5% is too high to ignore in the data analysis, especially if the process of interest is associated with ageing. This motivates the inclusion of a death state in the stochastic process.

Models
We will discuss a series of models for the ELSA data. The information criterion by Akaike (AIC, Akaike 1974) is used to compare the models. Given that ELSA is an epidemiological study on ageing, we choose age as the time scale.
In all models, the initial distribution for the two living states is described by the logistic regression (5) which includes age as a covariate. For the state-dependent distribution we distinguish three options: two independent binomial distributions, two independent extended binomial distributions (3), and the bivariate extended binomial distribution (4).
For the three-state model, we consider exponential (E), Gompertz (G), and Weibull (W) hazards. The models are denoted by using the letters for the transitions 1 → 2, 1 → 3, and 2 → 3 respectively. For example, GWW denotes a Gompertz hazard for 1 → 2, and Weibull hazards for death. For the exponential model EEE, it is clear that the bivariate binomial distribution outperforms the other two options for the state-dependent distribution; see the AICs in Table 1. This is as expected, given the correlation between the immediate recall and the delayed recall.
Given that decline of cognitive function is correlated with age, one would expect that time-dependent hazards models are better than the exponential hazard model. This is confirmed by the results in Table 1. For the bivariate binomial distribution, using the Gompertz model GGG yields a substantial lower AIC compared to EEE: AIC = 29646 and AIC = 29732, respectively. The same holds for the Weibull model WWW with AIC = 29651.
As reported in Table 1, we investigated a range of models, including models with different parameter assumptions for the three hazards. The lowest AIC is 29636 for a model with three Gompertz hazards (GGG). This model has age (denoted by t) and gender (denoted by x = 0 for women, and x = 1 for men) as predictors for the initial distribution; that is, The three Gompertz hazards are extended by including gender as covariate: The state-dependent distributions are bivariate binomials, where the dispersion parameters θ Y and θ Z , and the correlation parameter φ are assumed to be the same for the two latent states. This overall model has 19 parameters; Table 2 presents the estimates and their estimated standard errors. As expected, the effects of age (the ξ s) are positive and indicate that an older age is associated with a higher risk of progressing through the states. For the three transition hazards, only the estimated hazard for 2 → 3 indicates a difference between men and women, with the former being at a higher risk (hazard ratio exp(α 23 ) = 1.694).
The parameters for the state-dependent distributions for the direct recall (Y ) and the delayed recall (Z ) are estimated using transformations, among which the transformation (6). Table 2 presents the parameters as used in the definition of the distributions and with estimated standard errors derived by simulation (see Sect. 3). Given that θ Y and θ Z are both larger than 1, the state distributions are more peaked than the standard binomial. Parameter φ controls the dependence between the distributions of Y and Z within each state, with Y and Z being independent iff φ = 1. Although more can be said about the interpretation of φ (see Altham and Hankin 2012), in practice the best option is to look at graphs and at the estimated means. Figure 2 nicely shows the difference in the fitted state-dependent bivariate binomial distributions for Y and Z . Comparing the distributions for state 1 and 2, we see that the distribution for state 2 is shifted to the lower scores on Y and Z . Numerically this is confirmed by the expected value of the fitted distributions. For state 1, we have E[(Y , Z )|X = 1] = (6. 37, 5.19), and for state 2 we have E[(Y , Z )|X = 2] = (4.36, 2.60).  The shift to the lower scores is additionally illustrated by the marginal distributions in Fig. 3. This graph also shows that the marginal distributions of the delayed recall (Z ) are more distinct than those of the direct recall (Y ). This implies that the delayed recall is better at discriminating the two latent cognitive states.
These scores were specified such that A represents an individual with very good recall, C represent an individual whose scores show a decreasing trend, and individual B is more or less in the middle of these two trends. Figure 4 shows the predicted probability to be in state 1 for the three individuals. Note that death is a competing risk in this prediction, so it is not the case that P(X t = 1) = 1 − P(X t = 2), for t representing a higher age than 74.
The graph shows that we can classify individuals into the latent states given past observations. For example, at age 74 and using 0.5 as cut-point, we would classify A and B in state 1, and C in state 2. The graph also shows that the model can be used in practice to plan future health care. For individual B, for example, the model predicts a classification in state 2 from about age 79 and onward (using again 0.5 as cut-point). If there are interventions to delay cognitive decline, then for B such an intervention should be planned in the next five years. The 90% confidence bands in Fig. 4 show that a classification such as above is subject to substantial uncertainty. In general, adding covariates in the submodels may help to reduce this uncertainty up to some extent.

Validation
As acknowledged in the literature, model validation for time-dependent multi-state processes is severely hampered when transitions are interval-censored and observation times vary across individuals; see, for example, Gentleman et al. (1994) and Titman (2009). This problem also holds for the models reported here where we use a threestate process which is latent for the two living states. However, observed death times are exact, and this can be used to check part of the model. The idea is to fit Kaplan-Meier survival curves where the time scale is the study time and death is defined as the event. These curves are then compared to the model-based prediction conditional on the observed data at the baseline of the study. This will not validate all aspects of the model, but it can be used as a heuristic tool to check observed and predicted survival (Gentleman et al. 1994).
The method in the previous section can be used to undertake the model-based prediction. Denote the time scale of the study in years as t * , with t * = 0 at the baseline. Using the same notation as before, let y i1 and z i1 be the observed count data at t * = 0, for all i ∈ {1, ..., N }. The model-based survival at time t * > 0 is p(X t * = 3|y i1 , z i1 , ψ = ψ), which is to be compared with the Kaplan-Meier survival curves. To do this, we first sample the latent baseline state at t * = 0 using p(X t * =0 = 1|y i1 , z i1 , ψ = ψ) for all i = 1, .., N . Next we compute the Kaplan-Meier survival curves and the model-based survival conditional on the sampled baseline state. Figure 5 shows the comparison. Variation in the model-based individual curves is due to individual variation at baseline with respect to age, gender, and the scores on Y and Z . The regular yearly steps in the Kaplan-Meier survival curves are caused by the fact that age is only available in integers for reasons of confidentiality.
Given that only baseline test scores are used, and that the baseline state is latent, the comparisons in Fig. 5 give some confidence in the fitted model. At least with respect to fitting observed and right-censored death times, the model performs well.

Conclusion
In biostatistics, it often the case that the process of interest is latent and that it is investigated via observable response variables. A process of interest in ageing research is change of cognitive function in the older population. Response variables in this case can be scores on cognitive tests. For an application in ageing research with two longitudinal test scores, our approach consists of assuming two latent states for cognitive function and specifying bivariate state-specific distributions for the test scores. In addition, we include death as a third state in our model to take into account possible correlation between change of cognitive function and mortality.
For the response variable, we specify bivariate state-dependent distributions which allow modelling the correlation between two responses even after conditioning on the latent state. In the application, the state-dependent distribution is a bivariate extension of the well-known univariate binomial distribution. In the data analysis, it was shown that a fitted distribution can be explored both numerically and graphically. Goodness of fit and prediction was discussed and illustrated as well.
The underlying progressive three-state model that we use, is very flexible with respect to the specification of the time-dependency. Although the chosen model specifies Gompertz distributions for all three transition hazards, we have also explored hazard models with both Gompertz and Weibull hazards.
The fitting of the three-state model is computationally intensive because some of the transition probabilities are estimated by numerical integration. An alternative would be to use a piecewise-constant approximation of the time-dependent hazards, but the performance of such an approximation would rely heavily on the time grid that is used. Of course, the numerical integration also relies on a grid-based approximation. However, the latter is used to compute one entry in the transition matrix, whereas the former is a piecewise-constant approximation of the whole matrix. Ultimately, both approaches will produce the same results when fine-tuned. We prefer the numerical integration because it is more directly aimed at the quantity that we need to estimate and it is easier to fine tune.
In our hidden Markov model, we assume independence across time and use a bivariate distribution to model possible correlation between response variables. In the current context, where we have longitudinal data clustered within individuals, an alternative would be to use a model with individual-specific random effects. These random effects can be used to capture possible dependence across time or dependence between response variables. For discrete-time hidden Markov models, examples of this approach can be found in Altman (2007) and Maruotti (2011). These mixed hidden Markov models are computationally intensive because the random-effects structure requires numerical integration or simulation-based methods. Continuous-time mixed hidden Markov models can be developed in a similar vein, but will extend the computational challenge.
We have used the bivariate generalisation of the binomial distribution as introduced by Altham and Hankin (2012). Because the fitting the model is undertaken by a generalpurpose optimiser, it is relatively easy to use other distributions. For example, other bivariate generalisations of the binomial distribution are presented in Marshall and Olkin (1985) and Biswas and Hwang (2002). The distribution by Altham and Hankin (2012) was chosen not only because it extends the binomial distribution from univariate to bivariate in a natural way, but also because the extension includes parameters for both over-dispersion and under-dispersion relative to the corresponding binomial distribution.
We hope that our approach will help to maximise the use of available data in longitudinal studies, where it is customary to collect data on multiple cognitive tests, and, consequently, improve knowledge about the processes investigated.