A two-stage approach for Bayesian joint models: reducing complexity while maintaining accuracy

Several joint models for longitudinal and survival data have been proposed in recent years. In particular, many authors have preferred to employ the Bayesian approach to model more complex structures, make dynamic predictions, or use model averaging. However, Markov chain Monte Carlo methods are computationally very demanding and may suffer convergence problems, especially for complex models with random effects, which is the case for most joint models. These issues can be overcome by estimating the parameters of each submodel separately, leading to a natural reduction in the complexity of the joint modelling, but often producing biased estimates. Hence, we propose a novel two-stage approach that uses the estimations from the longitudinal submodel to specify an informative prior distribution for the random effects when estimating them within the survival submodel. In addition, as a bias correction mechanism, we incorporate the longitudinal likelihood function in the second stage, where its fixed effects are set according to the estimation using only the longitudinal submodel. Based on simulation studies and real applications, we empirically compare our proposal with joint specification and standard two-stage approaches considering different types of longitudinal responses (continuous, count and binary) that share information with a Weibull proportional hazard model. The results show that our estimator is more accurate than its two-stage competitor and as good as jointly estimating all parameters. Moreover, the novel two-stage approach significantly reduces the computational time compared to the joint specification.


Introduction
Joint modelling for longitudinal and survival data has become very popular medical applications (Ibrahim and Molenberghs 2009;Neuhaus et al. 2009;Wu et al. 2011). This class of models is a useful tool when it is necessary to study the association between repeated measurements and time until an event of interest (Papageorgiou et al. 2019). When considering both information together (simultaneously) into a single model, the estimates are less biased and there is an increase in statistical efficiency, since clinical hypotheses consider in B Danilo Alvares danilo.alvares@mrc-bsu.cam.ac.uk Valeria Leiva-Yamaguchi valeria.leiva@mrc-bsu.cam.ac.uk 1 MRC Biostatistics Unit, University of Cambridge, Cambridge, UK advance that longitudinal and survival data are connected in some way (Muthén et al. 2009;Ibrahim et al. 2010).
The complexity of the joint modelling often entails two critical issues. The first one is related to identification problems due to the large number of parameters (Wu 2009;Gould et al. 2015;Papageorgiou et al. 2019). On the other hand, the calculation of marginal likelihood and survival functions include integrals without closed-form, so numerical integration methods are required, making the inferential process more time-consuming (Lesaffre and Spiessens 2001;Pinheiro and Chao 2006;Rizopoulos et al. 2009;Wu et al. 2010; Barrett et al. 2015).
Recently, Alvares and Rubio (Alvares and Rubio 2021) proposed a new specification for joint models of longitudinal and survival data that is flexible and does not require numerical integrations. The proposal divides the longitudinal linear predictor into variables that depend or not on time to share them with the survival submodel. This strategy leads to some limitations presented by the authors and which will not be discussed here. Another incipient and very promising proposal is the use of integrated nested Laplace approximation (INLA) for joint models (van Niekerk et al. 2020). However, this approach is not yet accessible to non-expert INLA users.
A more classic alternative is the two-stage approach (Self and Pawitan 1992;Tsiatis et al. 1995). In this case, the longitudinal submodel is fitted separately and then the shared parameters and random effects estimates are used into the survival submodel. Several authors have applied this strategy in order to circumvent the identifiability issues and reduce processing time (Murawska et al. 2012;Guler et al. 2017;Donnelly et al. 2018;Mauff et al. 2020). Nevertheless, the main disadvantage of this approach is that the survival model parameter estimates are often biased (Tsiatis and Davidian 2004;Rizopoulos 2012).
From the Bayesian perspective, some bias correction strategies have been proposed for the two-stage approach (Tsiatis et al. 1995;Wulfsohn and Tsiatis 1997;Viviani et al. 2014;Leiva-Yamaguchi and Alvares 2021). Still, Markov chain Monte Carlo (MCMC) methods are quite demanding and their use may be impractical (Ye et al. 2008;Andrinopoulou et al. 2018).
In this paper, we propose a novel two-stage approach that uses the estimations from the longitudinal submodel to specify an informative prior distribution for the random effects when estimating them within the survival submodel. In addition, as a bias correction mechanism, we incorporate longitudinal likelihood function in the second stage, where its fixed effects are set according to the estimation using only the longitudinal submodel.
The rest of the paper is organised as follows. Section 2 introduces a generic formulation for Bayesian joint models of longitudinal and survival data. Section 3 presents a standard two-stage approach to making the inference in joint modelling framework. Section 4 describes our two-stage proposal. Sections 5 and 6 compare the bias and computational time for the three inferential strategies presented using simulated and real data, respectively. Finally, Sect. 7 discusses the results obtained and potential extensions of our methodology.

Bayesian joint model formulation
Bayesian joint models for longitudinal and survival data usually assume a full joint probability distribution given by (Armero et al. 2018;: where y and s represent the longitudinal and the survival process, respectively. Random effects are denoted by b and have θ b as their hyperparameters; θ y and θ s are the parameters and hyperparameters of each process. The factors on the right hand side of (1) are the conditional joint distribution of the processes y and s given b, θ y , and θ s , f ( y, s | b, θ y , θ s ); the conditional distribution of b given θ b , f (b | θ b ); and the (independent) prior distributions of θ y , θ s and θ b , π(θ y ), π(θ s ) and π(θ b ), respectively. Typically, the conditional joint distribution of y and s is factorised as (Wu and Carroll 1988): where both processes are considered as conditionally independent given b, θ y , and θ s . Note that if the longitudinal parameters and hyperparameters, θ y , are not shared with the survival process, then In the next subsections, we will give more details about components from (1) and (2).

Longitudinal submodel
Typically, the terms f ( y | b, θ y ) in (2) and f (b | θ b ) in (1) for an individual i at time t are expressed through a generalised linear mixed specification: where EF represents an exponential family distribution with mean μ i (t | ·); θ y contains the parameters and hyperparameters of μ i (t | ·); and θ b = . The mean trajectory function can be formulated in different ways, but the most common corresponds to the linear mixed model specification are covariates of the individual i at time t, β is the vector of fixed effects, and b i represents a K -dimensional vector of individual random effects (Pinheiro and Bates 2000).

Survival submodel
We will introduce some key elements of survival analysis in order to characterise the term f (s | b, θ y , θ s ) in (2) using a hazard function. Let T * i denote the event time for individual i, C i the censoring time, T i = min{T * i , C i } the observed time, and δ i = I(T * i ≤ C i ) the event indicator. The proportional hazards specification is commonly used to model these problems (Kumar and Klefsjö 1994) and at the same time incorporate the longitudinal information from submodel (3). So, the hazard function of the survival time T i of individual i is expressed by: where h 0 (t | θ s ) represents an arbitrary baseline hazard function at time t, x i is a covariate vector with coefficients γ , and is a temporal function of terms that comes from the longitudinal submodel and can be specified in different ways, such as μ i (t | ·), dμ i (t | ·)/dt, and t 0 μ i (v | ·)dt (Rizopoulos 2012). The component g i (t | ·) has the role of connecting both processes, while α measures the strength of this association. Finally, θ s includes the parameters of h 0 (t | ·), γ and α.

Prior distributions
We assume independent and diffuse marginal prior distributions (Gelman et al. 2013). More specifically, all longitudinal and survival regression coefficients (including the association parameter) follow Normal distributions with mean at zero and large variance; σ follows a weakly-informative half-Cauchy(0, 5) (Gelman 2006); and follows an inverse- (Schuurman et al. 2016). Once the baseline hazard function h 0 (t | θ s ) is defined, diffuse priors are also specified for its parameters.

Standard two-stage (STS) approach
As a potential competitor, we use Tsiatis, De Gruttola and Wulfsohn (Tsiatis et al. 1995) approach adapted to the Bayesian framework (Leiva-Yamaguchi and Alvares 2021). Specifically, in the first stage, we calculate the maximum a posteriori (MAP) of the random effects b and parameters θ y estimated from longitudinal submodel fit separately. Posterior mean or median can also be used instead. In the second stage, we incorporate the temporal function into the survival submodel considering g i (t |b i ,θ y ), whereb i andθ y are the MAP of b i and θ y , and then the posterior distribution of θ s is calculated.

Novel two-stage (NTS) approach
Like all two-stage methods for joint models, the first stage is to fit the longitudinal submodel separately. So, we calculate the MAP of longitudinal and random effects parameters, which will be denoted byθ y andθ b , respectively. Note that it is not necessary to estimate the random effects b and therefore, we can run a Bayesian estimation algorithm using the marginalised likelihood function (integrating out the random effects) of the longitudinal submodel (3).
In the second stage, we estimate the random effects b and parameters θ s replacing the full joint probability distribution (1) with: where the f (·)'s and π(·) have the same functional form as in (1) and (2), except that the parameters θ y and θ b are now assumed to be known and so it is not necessary to include their prior distributions in (5). Note that when using θ b estimated from the longitudinal submodel (i.e.,θ b ), random effects b become fixed effects, further reducing the complexity of (5) with respect to the full joint probability distribution (1).
Essentially, our proposal simplifies/approximates the likelihood function from the joint approach by assuming that the shared parameters can be satisfactorily estimated from their MAP using the longitudinal submodel separately. In particular, as n i grows, the longitudinal submodel f ( y | b, θ y ) becomes the dominating part in the conditional joint likelihood function (see Eq. (2)), implying that the survival submodel is of minimal importance to obtain the MAP of θ y . Then, the MAP of θ s is estimated considering θ y as a nuisance parameter (Murphy and van der Vaart 2000) and assuming that the contribution from random effects and priors in the posterior distribution is similar using either JS or NTS. See Appendix 1 for technical details.

Simulation studies
Simulation studies were performed to empirically corroborate that our methodological proposal (see Sect. 4) presents improvements to the dual problem "biased estimation versus high computational time" when compared to the joint specification (JS, see Sect. 2) and standard two-stage (STS, see Sect. 3) approaches. Algorithm 1 summarises a simulation scheme for joint models of longitudinal and survival data (Rizopoulos 2022;Austin 2012).
Algorithm 1 Simulation scheme for joint models. In this algorithm, x i denotes a binary covariate, b i the vector of random effects, T * i the event time (not always observed), C i the censoring time, T i the observed time, and δ i the event indicator. Additionally, n, , and t max represent number of individuals, interval between longitudinal observations, and maximum observational time, respectively.
We analysed the joint formulation based on submodels (3) and (4) considering longitudinal continuous, count and binary responses, where the mean of the longitudinal submodel, μ i (t | ·), was shared with the survival submodel.

Longitudinal continuous response
The longitudinal continuous process for an individual i at time t is written as a linear mixed-effects model: where θ y = (β 0 , β 1 , β 2 , σ 2 ) and θ b = . The covariate x i is a binary group indicator. The hazard function of the survival time T i of individual i is specified as follows: where h 0 (t | θ s ) is a Weibull baseline hazard with φ and exp(γ 0 ) being shape and scale parameters, respectively, and θ s = (φ, γ 0 , γ 1 , α) .

Longitudinal count response
The longitudinal count process for an individual i at time t is written as a Poisson mixed-effects model: where θ y = (β 0 , β 1 , β 2 ) , random effects are defined as in (6), and the survival submodel is specified as in (7).

Longitudinal binary response
Similar to the previous case, the longitudinal binary process for an individual i at time t is also written as a Bernoulli mixed-effects model: where θ y = (β 0 , β 1 , β 2 ) , random effects are defined as in (6), and the survival submodel is specified as in (7).

Simulation settings
For each scenario (Normal, Poisson, and Bernoulli), we simulated 300 datasets with n = 500, = 1, and t max = 20. Table 1 presents the specified parameter values for each scenario. Figure 1 shows the proportion of the number of longitudinal observations per individual from the simulated data using the parameter values from Table 1. The resulting censoring rates varied between 26% and 40% (Normal), 22% and 35% (Poisson), and 16% and 27% (Bernoulli).

Results
The longitudinal submodels for continuous, count and binary responses were set as in (6), (8) and (9), respectively, and the survival submodels as in (7). Prior distributions were specified as in Sect. 2.3. The MCMC configuration was defined as follows: 2000 iterations with warm-up of 1000 for the joint model using the joint specification (JS) and for the longitudinal submodel from the novel two-stage (NTS) and standard two-stage (STS) approaches. Additionally, 1000 iterations with warm-up of 500 were set to run the survival submodel from both two-stage approaches. All models were implemented using the rstan R-package version 2.21.7 (Stan Development Team 2022). Simulations were performed on a Dell laptop with 2.2 GHz Intel Core i7, 32 GB RAM, OS Windows. Table 2 displays the running times of each estimation approach.
Unsurprisingly, STS had the lowest computational time, while the NTS required less processing time than the JS approach.
From an inferential perspective, we analysed the posterior mean for the group (γ 1 ) and association (α) parameters in order to study the (potential) bias produced by two-stage approaches by ignoring the joint nature between the longitudinal and survival processes. Figure 2 shows the bias (posterior mean -true parameter value) and standard deviation for the group parameter using the JS, NTS and STS estimation approaches in each simulated scenario.
The group parameter was robustly estimated by the three approaches in all scenarios. However, as expected, the standard deviation from the STS estimators is smaller than the competitors. This happens because sharing the MAP of random effects without correction causes losses in the variability of such components, leading to an artificial and wrong reduction in the uncertainty of survival parameter estimators. Furthermore, it is important to note that NTS and JS produced nearly equivalent standard deviations. Figure 3 shows the bias and standard deviation for the association parameter using the JS, NTS and STS estimation approaches in each simulated scenario.  The association parameter was unbiased using JS and NTS, while the STS approach underestimated the positive associations (α = 2 for Normal and α = 3 for Bernoulli) and overestimated the negative one (α = −2 for Poisson). Also, our proposal presented standard deviations for the association parameter with insignificant differences in relation to those from the JS estimator, while STS again reduced such standard deviations.
The reduction of complexity of the inferential process by using two-stage strategies was quite clear when analysing computational times. However, only our proposal corrected the estimation bias without affecting the estimator variance, producing results equivalent to the JS approach.
Appendix 1 shows complementary results using the same scenarios as in Table 1, but increasing the number of repeated measurements per individual (30 ≤ n i ≤ 50). Once again, our two-stage strategy corrected the estimation bias as well as kept the standard deviation of the estimator equivalent to that from the JS approach.

Applications
We used three real datasets, which are available in the JM R-package version 1.5-2 (Rizopoulos 2022), in order to illustrate the performance of the novel two-stage (NTS, see Sect. 4) against the joint specification (JS, see Sect. 2) and the standard two-stage (STS, see Sect. 3) approach. A brief description of each dataset is presented below: • Acquired immunodeficiency syndrome (aids): This dataset includes information about 467 patients with advanced human immunodeficiency virus infection during antiretroviral treatment who had failed or were intolerant to zidovudine therapy (Goldman et al. 1996). The longitudinal variable "CD4" was transformed by applying the square root; and "drug" was set as a group variable (zalcitabine = 0 and didanosine = 1). • Liver cirrhosis (prothro): This dataset includes 488 patients with histologically verified liver cirrhosis, with 237 patients randomised to treatment with prednisone and the remaining received placebo (Andersen et al. 1993). The longitudinal variable "pro" was divided by 10 and truncated to the nearest integer; and "treat" was set as a group variable (placebo = 0 and prednisone = 1). • Primary biliary cirrhosis (pbc): This dataset includes 312 patients with primary biliary cirrhosis, a rare autoimmune liver disease, at Mayo Clinic, where 158 patients were randomised to D-penicillamine and 154 to placebo (Murtaugh et al. 1994). The longitudinal variable "ser-Bilir" was dichotomised into ≤ 2 (0, control group) or > 2 (1, late stage of PBC); and "drug" was set as a group variable (placebo = 0 and D-penicil = 1). Figure 4 shows the frequency of the number of longitudinal observations per individual from the three datasets. Based on the nature of the longitudinal response type of each dataset, the joint models are specified as follows: (6) and (7) for aids dataset; (8) and (7) for prothro dataset; and (9) and (7) for pbc dataset. Prior distributions and MCMC configuration were defined as in Sect. 5.5. We provide the code for reproducibility at www.github.com/daniloalvares/ BJM-NTS2. Table 3 shows the posterior summary of the group (γ 1 ) and association (α) parameters estimated from JS, NTS and STS approaches for each dataset as well as the average computational time.
In the three applications, the group parameters (γ 1 ) are quite equivalent, except for small differences in the posterior mean using the STS approach, but their 95% credible intervals are similar to those of the other approaches. On the other hand, the biased estimation using the STS approach is more notorious for the association parameter (α). When using the Normal submodel (aids dataset), JS and NTS are almost identical, while STS has a slightly higher point estimate and a narrower 95% credible interval. Equivalence between JS and NTS is maintained for the joint models for count (prothro) and binary (pbc) data, and in both cases the estimation bias using STS is more prominent, in which the credible intervals do not include the posterior mean of α calculated by the JS approach (reference estimation). Again, STS remains the fastest option, while NTS has reasonable average times and significantly lower than those of the JS approach.

Discussion
We have proposed a novel two-stage (NTS) approach that can be used in a Bayesian joint analysis of longitudinal and survival data. We have simplified the complex nature of the joint specification (JS) by using a two-stage strategy that at the same time corrects the well-known bias from a standard two-stage (STS) approach.
NTS was shown to be successful, producing satisfactory results both in simulated scenarios and in real applications considering different types of longitudinal responses (continuous, count and binary) that share information with a Weibull proportional hazard model.
In practical terms, the group parameter (γ 1 ) has not been significantly affected by the use of two-stage approaches.  However, we have once again observed that the association parameter (α) is biased when the STS is used. This issue is mitigated by replacing STS with NTS. Furthermore, we highlight that the strength of the association does not affect the NTS estimation, even with few longitudinal observations per individual. In particular, the strong association presented in the simulation studies and also observed in the pbc dataset modelling did not produce bias, as typically happens with regression calibration approaches (Nevo et al. 2020). Moreover, our proposal had a much lower computational time than the JS approach in all scenarios.
Theoretically, the correction of shared information should be enough so that all other characteristics of the survival model (baseline hazard and coefficients of baseline variables) do not suffer estimation bias compared to the JS. However, in this paper, we have not investigated different baseline hazard specifications to corroborate such behaviour, but we hope to do so in future works. Another advantage of the proposed two-stage approach is that it can be easily generalised to more complex longitudinal (e.g., skewed or multiple longitudinal data) and survival (e.g., competing-risks or multistate data) submodels.
is specified as the log-likelihood function in NTS second stage.
So far we have omitted the specification of random effects and priors for the estimation of the MAP of θ y and θ s . We assume that the distribution of b, f (b | θ b ), from the JS approach is the same as the prior distribution of fixed effects b, π(b |θ b ), from the NTS approach. So, the contribution of random effects to calculate the MAP of θ y and θ s from both approaches is the same. Additionally, even if the random effects distributions are misspecified, Rizopoulos, Verbeke and Molenberghs' Theorem 1 (Rizopoulos et al. 2008) demonstrates that, as n i grows, the estimates of θ y and θ s will be minimally affected. On the other hand, we assume that the prior distributions are proper and weakly-informative, so they must have an irrelevant role in calculating the MAP of θ y and θ s . Furthermore, as the sample size increases, the choice of prior distributions have a minimal impact.

Appendix B
See Figs. 5 and 6