Bayesian inference for continuous-time hidden Markov models with an unknown number of states

We consider the modeling of data generated by a latent continuous-time Markov jump process with a state space of finite but unknown dimensions. Typically in such models, the number of states has to be pre-specified, and Bayesian inference for a fixed number of states has not been studied until recently. In addition, although approaches to address the problem for discrete-time models have been developed, no method has been successfully implemented for the continuous-time case. We focus on reversible jump Markov chain Monte Carlo which allows the trans-dimensional move among different numbers of states in order to perform Bayesian inference for the unknown number of states. Specifically, we propose an efficient split-combine move which can facilitate the exploration of the parameter space, and demonstrate that it can be implemented effectively at scale. Subsequently, we extend this algorithm to the context of model-based clustering, allowing numbers of states and clusters both determined during the analysis. The model formulation, inference methodology, and associated algorithm are illustrated by simulation studies. Finally, we apply this method to real data from a Canadian healthcare system in Quebec. Supplementary Information The online version supplementary material available at 10.1007/s11222-021-10032-8.


Introduction
Continuous-time Markov processes on a finite state space have been widely used as models for irregularly spaced longitudinal data as they correspond to plausible data generating representations.In almost all cases, the process is observed only at a number of discrete time points, rather than being continuously observed.This problem that arises in a broad collection of practical settings from public health surveillance to molecular dynamics.
For example, healthcare systems and electronic health records represent large volumes of data that allow the calculation of longitudinal health trajectories; however, such health records are recorded only when patients interact with the health system.Likelihood-based inference for the infinitesimal generator of a continuous-time Markov jump process has been detailed, for example, in Jacobsen (1982).However, in settings such as those identified above, inference for the infinitesimal generator becomes more difficult.Bladt and Sørensen (2005) investigated likelihood-based inference for discretely observed continuoustime Markov processes, while Tancredi (2019) proposed approximate Bayesian methods to facilitate the computation for such models.
In a related class of problems, the observed data are not directly representative of the Markov process, or similarly the process is observed with measurement error.In those cases, a hidden Markov model (HMM) is more appropriate: this model assumes that an unobserved stochastic process governs the generating model for observations, and assumptions of the Markov property are imposed on the unobserved sequence, with observations usually modeled as independent conditional on the hidden Markov process.There is a broad interest in the application of the continuous-time HMM (CTHMM) in recent years, such as in ecological studies (Mews et al., 2020) and in medical research (Lange et al., 2015;Alaa and Van Der Schaar, 2018;Lange et al., 2018), with a predominant focus on frequentist approaches.More recently, Williams et al. (2020) and Luo et al. (2021) implemented a fully Bayesian CTHMM using different missing data likelihood formulations for the underlying Markov chain.Even when these models have been proposed and implemented, the number of states has had to be pre-specified.Determining the number of hidden states is a challenge addressed in earlier work (see for example Celeux and Durand, 2008;Pohle et al., 2017).Luo et al. (2021) suggested using the BIC to select the number of states via the expectation-maximization algorithm before performing Bayesian inference with a fixed number of states.Luo et al. (2021) extended Bayesian CTHMMs for finite and Dirichlet mixture model-based clustering procedures to cluster individuals, which allows Markov chain Monte Carlo (MCMC) to change the dimension of the number of clusters, but still relied on the assumption that the number of states has to be pre-specified.

Bayesian model determination approaches have been a longstanding focus of interest in
Bayesian inference (see, for example, Carlin and Chib, 1995;Green, 1995;Godsill, 2001).
In particular, reversible jump MCMC (Green, 1995) has provided a general solution by exploiting trans-dimensional moves that exploit the dynamics of the Metropolis-Hastings (MH) algorithm in a fixed dimension, allowing movement across parameter spaces of different dimensions.Richardson and Green (1997) developed a reversible jump MCMC approach to univariate Normal mixture models, and subsequently Robert et al. (2000) extended this work to discrete-time hidden Markov models with Normal mixtures.In their work, they specifically used two types of reversible jump moves in MCMC to explore the parameter space, i.e., split-combine and birth-death moves.Stephens (2000) introduced an alternative MCMC approach, using a birth-death point process to infer the number of components in the Normal mixture model setting, and Cappé et al. (2003) demonstrated the limit-case equivalence of the reversible jump and continuous-time methodologies for both mixture models and discrete-time HMMs.In this paper, we focus on constructing reversible jump MCMC for CTHMMs which allow the number of hidden states to be inferred via the posterior distribution.
For a better understanding of dynamic changes of individual trajectories, it would be helpful to cluster individuals trajectories and to study the pattern in each group to explore the variation in trajectories.Many of these methods may be classified as model-based clustering procedures, where clustering is achieved by consideration of parametric likelihood-or density-based calculations, with the number of clusters determined by information criteria, such as AIC or BIC (Dasgupta and Raftery, 1998;Fraley and Raftery, 1998).Similarly, however, in such calculations, the number of clusters has to be fixed, and determining the number of clusters is a challenge addressed by many clustering algorithms.We address this problem subsequently by extending our reversible jump MCMC procedures to allow the number of clusters to be inferred during the analysis.
The rest of the paper is organized as follows.In Section 2, we describe the CTHMM-GLM.Section 3 presents fully Bayesian inference via reversible jump MCMC, specifically a split-combine move to update the number of states and then fix dimensional MCMC.
Section 4 extends the reversible jump MCMC approach to model-based clustering, allowing numbers of states and clusters to vary simultaneously.Simulation examples to examine the performance of proposed MCMC are presented in Section 5. Finally, we present the results from applying this method to a chronic obstructive pulmonary disease (COPD) cohort in Section 6, and discuss these results in Section 7.

A Continuous-Time Hidden Markov Model
We presume that the data {O 1 , . . ., O T } recorded at observation time points {τ 1 , . . ., τ T } arise as a consequence of a latent continuous-time Markov chain (CTMC) {X s , s ∈ R + } taking values on the finite integer set {1, 2, . . ., K}.Note that observations are indexed using an integer index (that is, O t ), and that the latent process is indexed using a continuousvalued index (that is, X τt ).
Conditional on the latent process, we assume the observations are drawn from an exponential family model with density f (O t |X τt = k).If there are time-varying explanatory variables Z ∈ R D , a generalized linear model (GLM) with link function g(.) and regression coefficients β k for state k, is adopted.Define matrix B = (β d,k ) for d = 1, . . ., D and k = 1, . . ., K as the GLM coefficient matrix.Finally, let S t = (S t,1 , . . ., S t,K ) be an indicator random vector with S t,k = 1 if X τt = k and 0 otherwise.
In the assumed model the data generating mechanism is specified via (i) the latent state model X s |Θ ∼ CTMC (π, Q), where Q is the infinitesimal generator and π is the initial distribution for the continuous-time Markov process, and (ii) the observation model recall that the structural constraint on Q is that its off-diagonal elements {q j,k , j, k = 1, . . ., K, j = k} are positive, and that its rows sum to zero.In this paper, we impose no other constraints, although to do so would be straightforward: for example, we might wish to restrict certain q j,k to obey with linear constraints such as equality to zero.In the model, the observations {O 1 , . . ., O T } are assumed mutually conditionally independent given {X s }; this assumption is common, but can be easily relaxed.With the Markov chain observed discretely at different time points, one could compute the likelihood function for Q in Jackson et al. (2003); Williams et al. (2020), however to facilitate the MCMC algorithm, we consider the complete but unobserved trajectory of {X s } as a collection of auxiliary variables in a missing data formulation: the unobserved trajectory comprises a collection of states and transition times that completely describe the latent path over any finite interval.The detailed derivation of the complete data likelihood, L(Θ), is given in the Supplement.
Bayesian inference for this model with the number of states K fixed has been fully studied by Luo et al. (2021), where an MCMC scheme based on simulating the complete latent path for each individual is developed; this MCMC scheme relies upon the rejection sampling approach of Hobolth and Stone (2009) to sample the latent paths in an efficient fashion.
Bayesian inference using the complete data likelihood formulation is appealing as it produces posterior samples of the full unobserved state sequences and latent continuous-time process, which allows inference to be made for individual-level trajectories across the entire observation window, and which is useful for computing posterior distributions for pathwise aggregate features on individual trajectories.

Reversible Jump MCMC for CTHMMs
First, we add the number of states K as an additional parameter and extend the MCMC algorithm to allow for inference to be made via the posterior distribution for K.There are several different approaches that can be adopted that we outline below and in the Supplement.First we study split/combine moves for states/pairs of states similar in spirit to the split/merge moves of Richardson and Green (1997); Dellaportas and Papageorgiou (2006).
The Supplement also gives detailed descriptions and simulation examples for inference on K via a birth-death point process by Stephens (2000).In our analysis, we restrict attention to the case where transition dynamics are modeled homogeneously across the sample, that is, q n,l,m ≡ q l,m for all n, l, m.

Markov chain Monte Carlo methodology
One iteration of the MCMC algorithm that incorporates the required trans-dimensional move proceeds using the following two types of move: 1.A split/combine move that considers splitting a hidden state into two, or combining two hidden states into one.
2. With the number of states K fixed, update the model parameters using standard MCMC moves: • update latent state indicators S n,t ; • update the parameters associated with the observation process B; • update the initial distribution π; • update the infinitesimal generator Q.
The Supplement gives detailed procedures of updating the model parameters with a fixed number of states, which was extensively studied in Luo et al. (2021).Specifically, for split and combine moves, we will implement the reversible jump algorithm by Green (1995).
Consider a proposal from the current model state (K, Θ) to a new state (K , Θ ) using the proposal density that is, using independent proposals for the two components.The acceptance probability for this form of proposal using the MH procedure is given by where p (K, Θ K |o) is the posterior distribution of (K, Θ K ) given the observed data o, which can, up to proportionality, be decomposed into the marginal (or 'incomplete data') likelihood of the data L(o|Θ K , K) times the prior distribution for (K, Θ K ); Our algorithm relies upon the ability to compute the marginal likelihood efficiently for any Θ K ; however, this is a standard 'forward' calculation for CTHMMs.

Split and Combine Moves
To construct efficient split and combine moves under the reversible jump framework, we adopt the idea of centered proposals by Brooks et al. (2003).The proposal is designed to produce similar likelihood contributions for the current and proposed parameters.The combine move is designed to choose a state, k at random and select another state k The reverse split move is to randomly select a cluster, k to split into two clusters, say k and k , and check if the condition, . If this condition is not met, then the split move is rejected right away.

Split Move
We consider an update that changes K → K + 1, requiring the generation of a new hidden state.For this move, we set q 1 (K; K + 1) = q 1 (K + 1; K) for each K. Then the acceptance probability reduces to We denote the posterior ratio in the final term r (K + 1, Θ K+1 ; K, Θ K |o), that is First, we randomly select a state on which to perform the split move.Without loss of generality, we consider the case where state K is to be split into new states K and K + 1.
We propose the new (K + 1)-dimensional infinitesimal generator Q K+1 using the following updates: q K,j = q K,j q K+1,j = q K,j 1 ≤ j < K with Q K = {q i,j } 1≤i,j≤K from the original K-state model and Q K+1 = {q i,j } 1≤i,j≤K+1 ; here, p 0Q (.) is the prior distribution for the element in Q, which is assumed to be Gamma (a, b).
In this way, the new stationary probabilities s of the CTMC associated with Q K+1 satisfying s Q K+1 = 0 are s j = s j for 1 ≤ j < K, s K = s K + s K+1 where s is a vector of stationary probabilities associated with Q K (satisfying sQ K = 0).The dynamical properties of the CTMC are thus preserved.The observation process parameters associated with new state K + 1 are generated as and the remaining elements of B K+1 set equal to the elements of B K .In addition, we generate a weight w ∼ Beta (2, 2) to split the initial probability for state K in π K = (π 1 , . . ., π K ) into π K = wπ K and π K+1 = (1 − w)π K and the rest remains the same.In the acceptance probability in (1), the ratio of the proposal density can be written as Specifically, where the numerator comes from the Jacobian of the transformation that creates the pro- where p(β 1,K ) is the Normal density with mean β 1,K and variance c 2 .
where the numerator comes from the Jacobian of the transformation that generates π .
Therefore the MH acceptance probability with the prior distribution as p 0 is where b K is the probability of choosing the split move and d K+1 = 1 − b K is the probability of choosing the combine move.

Combine Move
For the update from K + 1 to K states, we consider the following move.Without loss of generality, we consider how to combine states K and K + 1 into a single new state K.For the current configuration Q K+1 , we propose the move to Q K as The remaining q ij , where i = j, are obtained by copying Q K+1 and discarding q K,K+1 and q K+1,K , with the diagonal terms of It can be verified that the stationary probabilities, s = (s 1 , . . ., s K ) associated with Q K , are s j = s j for 1 ≤ j < K and s K = s K + s K+1 .For the parameters in observation process, we propose The remaining elements of β m,j for j < K are taken to be the same as the current parameter configuration B K+1 .Finally, we propose the initially distribution π K+1 = π 1 , . . ., π K , π K+1 simply moves to π K = (π 1 , . . ., π K ) where π K = π K + π K+1 and π j = π j for j < K. Therefore, the proposal ratio is computed as follows: This is the reverse move corresponding to the split move described above, and essentially w = π K /π K and p(.) is the density of Beta(2, 2).For the infinitesimal generator, the reverse move for q i,K for 1 ≤ i < K is the same with the split move.The reverse move for q K,j , 1 ≤ j < K, can be viewed as where u 0 = s K /(s K +s K+1 ) and u 1 is a weight parameter.If we choose u 0 = u 1 , then q K,j = q K,j and q K+1,j = q K,j .This reverses what was proposed for the split move.Therefore, .
In terms of B, since we mimicked the proposal for q K,j , therefore the reverse move is equals 1, and the MH acceptance probability for the combine move, α (K, Θ is the minimum of 1 and  Cn)   with δ = 1, making the weight distribution uniform.Then the complete-data likelihood for subject n is where 1 (C n = m) is the indicator function.A subject is assigned to component m according the posterior probability . (5) In reality, the model parameter, Θ, and the values of the latent states, X n , are not known, and must be inferred from the observed data.

Number of States
In Luo et al. (2021), a reversible-jump algorithm based on the marginalized model in ( 6) was used to update M , and we will incorporate this move into our algorithm in Section 3 to construct a clustering mechanism which allows the number of clusters and the number of states determined together during the analysis.We first apply the reversible-jump MCMC algorithm in Section 3 to update the number of states in each component.We then update the number of components according to a split-combine move, while the combine move only involves components with the same number of states.We summarize one iteration of this clustering mechanism as follows: 1. Update the number of states for each component using the algorithm in Section 3.2; If the move is accepted, update the model parameter in the corresponding component.4. With the number of components M fixed, each with fixed states K m where m = 1, . . ., M , update the model parameters using standard MCMC moves in each component, which the detail is given in the Supplement.
For any empty component from Step 3, we generate model parameters from prior distributions.For the split and combine moves in (b), we carry out them on the marginalized model as in Luo et al. (2021), where component labels and latent processes are marginalized out from the calculation, and use the likelihood Similar with updating the number of states, we update M by considering a proposal from the current state (M, Θ) to a new state (M , Θ ) using the proposal density q (M , Θ ; M, Θ) = q 1 (M ; M ) q 2 (Θ ; Θ), that is, using independent proposals for the two components.The acceptance probability for this proposal is given by where p (M, Θ |o) is the posterior distribution of (M, Θ) given the observed data o, which can, up to proportionality, be decomposed into the marginal likelihood of the data L(o|Θ, M ) times the prior distribution for (M, Θ), with prior distribution as p 0 ; p (M, Θ |o) ∝ L(o|Θ, M )p 0 (Θ|M )p 0 (M ).
We will discuss trans-dimensional moves for updating the number of components in more detail below.

Split/Combine Move for Updating the Number of Clusters
The combine move is designed to choose a component, m say, at random and select another

Split Move
We consider an update that changes the number of component from M → M + 1.Without loss of generality, we aim to split the M th component with CTMC parameters Θ M = {π M , Q M , B M } into two components, requiring the need to generate K M new hidden states, with corresponding parameters, i.e., Θ = {π , Q , B } and Θ = {π , Q , B }. To implement the idea of centering proposals, we use a deterministic proposal for Q and π, and let For observation parameter B, we can use the similar proposal: For mixture weights , let w ∼ Beta(2, 2) and define = w M and = (1 − w) M .
If we define the posterior ratio as Then, the acceptance probability for this proposal is where b M is the probability of choosing the split move and d M +1 = 1 − b M is the probability of choosing the combine move, and p β (•) is the Normal density with mean β 1,K and variance c 2 and p (•) is the Beta(2, 2) density.

Combine Move
For the combine move, we need choose two components with the same number of states and update from M + 1 → M components.Again, without loss of generality, we consider combine the (M + 1) th and M th components into one component, both components with states, with the proposed parameters, Θ = {π , Q , B }.We first find the stationary probabilities, s M and s M +1 , associated with Q M and Q M +1 .To combine Q M and Q M +1 into Q , the operation is as follows: and q M,k,k = − i =k q M,i,k for k = 1, . . ., K M .For the observation process parameter B, For the initial distribution π, and rescale the sum to 1.For mixture weights , the move is essentially to add up the probability of the two corresponding components, i.e., = M + M +1 .The acceptance

Simulation
In this section we demonstrate the performance of the proposed reversible jump and birthdeath (In the Supplement) MCMC approaches for the CTHMM.

Identifying the number of states
In the first example, we demonstrate the performance of MCMC to estimate the number of states, and to discover how performance degrades when the problem becomes more challenging.We consider a four-state model with coefficients The initial distribution π is set to be (0.35, 0.25, 0.2, 0.2).We first constructe the continuous time Markov process from the generator Q for subject i, a continuous-time realization of the latent sequence {X s , 0 ≤ s ≤ 15}, and uniformly at random extract T observation time points between 0 and 15, where T ∼ U nif orm(20, 60).We assume that the first observation is made at time 0. We generate the data for 1000 subjects.The prior distributions for the elements in Q and π are specified as independent Gamma(1, 2) and Dirichlet(1, . . ., 1).
Non-informative prior is imposed for B. We use a P oisson(3.5)distribution as the prior for the number of states, and initiate the model with one hidden state.
The posterior distribution of number of hidden states for different cases are shown in Table 1, with the trace plots displayed in the Supplement.The trace plots demonstrate that our reversible jump MCMC algorithm has extensively explored the parameter space.The posterior modes for Normal with σ = 1 and Poisson cases are both four, indicating that the proposed MCMC algorithm can identify the number of states where the data are simulated from.However, when we increase σ to 1.5 and 2, the posterior modes for the two cases are five.In those cases, the distributions of the number of hidden states are more diverse.In the trace plots for those two cases, the MCMC sampler is more likely to explore the higher dimensional parameter space, resulting in fewer iterations of the four-state model.

Replications: Identifying the number of states
Subsequently, we run 100 replications on the same data set with the same parameter configuration and prior settings as Section 5.1 of 500 subjects for Normal case with σ = 1.
In each replication, we run 50,000 iterations in total.Figure 1 displays the posterior distribution of the number of states over 100 replications after 10,000, 20,000, 30,000 and 50,000 iterations.In the figure, the proposed RJMCMC algorithm generates consistent results across almost replications, where the majority of them has the posterior mode four.
As the number of iterations increases, the variation of the posterior distribution becomes smaller.After 50,000 iterations, 99 out of 100 replications has the posterior mode four, which demonstrates the stability of the proposed algorithm.

Identifying the number of states: Intercept Only
In this example, the data are generated with the intercept only in the GLM model.The purpose of this example is to show how the performance differs from previous examples, especially on the values of σ in the Normal case.The simulation is configured with three latent states and the associated population generator and the coefficient matrix with associated coefficient matrices • Gaussian case: B = (−4, 0, 5), • Poisson case: B = (log(1.5),log(4), log(5)) The initial distribution π for the continuous-time Markov process is set to be (0.5, 0.4, 0.1).As in the first example, we construct the continuous-time Markov process from the generator Q, a continuous-time realization of the latent state process {X s , 0 ≤ s ≤ 15}, and randomly extract observation time points from the U nif orm (20, 60) between 0 and 15, with the first observation at time 0 in order to estimate the initial probability π.We generate data for 1000 subjects in each case.The prior distributions for the elements in Q and π are specified as independent Gamma(1, 2) and Dirichlet(1, . . ., 1).The priors are imposed for the mean of Normal case as N (0, 1) and for Poisson case as Gamma(10, 10).
Again, we use a P oisson(3.5)distribution for as the prior for the number of states, and we initiate the model with one hidden state.
The results are shown in Table 2 with the trace plots of the number of plots in the Sup- As there are fewer parameters in the example, posterior modes for all cases are three, and all cases have over 90% of iterations on the three-state model.Unlike previous cases, the algorithm is less likely to explore higher dimensions compared to models with covariates.We notice that the model constantly visit the four-state model then quickly merged back to three-state model, and this prevents the MCMC sampler to move to a higher dimension.In general, the proposed algorithm shows a good mixing performance in this example.
Data are generated with 400, 500, 450 and 550 subjects for each cluster respectively.We use the same prior distributions with Section 5.1 for the model parameters.
Trace plots for the number of clusters for different cases are shown in the Supplement.
For the number of clusters, all cases, expect Normal σ = 1.5, have posterior mode four which is the true number of clusters where the data are generated from.For Normal cases, we observe a monotonic decreasing trend for posterior probabilities of four clusters as σ increases.The results for the Poisson case are similar to Normal σ = 1.Conditional on four-cluster iterations, trace plots of the number of states display in the Supplement, with missing parts representing non-four-cluster iterations.The posterior modes of the number of states conditional on four-cluster models are consistent with where the data are generated from.For Normal σ = 1 and Poisson cases, trace plots of the number of states are similar to the one-cluster example.For Normal σ = 1.5, 2, we do not observe, in these two cases, mixing as well as previous examples and there are also fewer four-cluster iterations.When σ = 1.5, the posterior modes of the number of states conditional on the four-cluster model are still the consistent with the true data configuration; however, when σ = 2, it is not easy to identify the number of states in each cluster.Compared to previous cases, this is a more difficult problem because of the complexity and the flexibility of the proposed algorithm.For example, when updating the number of clusters, it is less likely to have a successful combine move until two similar clusters have the same number of states.In our example, we set the probability of the combine move for updating the number of clusters as 0.7 to account for issue.Overall, this algorithm performed well in selecting the number of clusters and states in well-separated scenarios.
6 Real Data Analysis: Health Surveillance of COPD patients Our real example relates to healthcare surveillance for the chronic condition, COPD, in greater Montreal area, Canada.In 1998, a 25% random sample was drawn from the registry of the Régie de l'assurance maladie du Québec (RAMQ, the Québec provincial health authority) with a residential postal code in the census metropolitan area of Montreal.At the start of every following year, 25% of those who were born in, or moved to, Montreal within the previous year were sampled to maintain a representative cohort.Follow-up ended when people died or changed their residential address to outside of Montreal.This administrative database includes outpatient diagnoses and procedures submitted through RAMQ billing claims, and procedures and diagnoses from inpatient claims.
Using established case-definitions based on diagnostic codes (Lix et al., 2018), COPD patients were enrolled with an incident event occurring after a minimum of two years at risk with no events.Patients were followed from January 1998, starting from the time of their first diagnosis, until December 2014.Physicians only observed these patients during medi-cal visits, which occurred when patients chose to interact with the healthcare system, and at which information, including the number of prescribed medications, is collected.However, as this information was only available for patients with drug insurance, we restrict the cohort to patients over 65 years old with COPD, as prescription data are available for all of these patients.It is widely believed that the progression of COPD can be well-modeled as a progression through a small number of discrete states which approximate severity (GOLD Executive Committee, 2017).We are interested in identifying those states and modeling transition between these discrete states, which reflects the performance of the healthcare system over time.
In our analysis, the outcome observations are the number of prescribed medications at the time when patients visited the physician: these are modeled using a Poisson model.In addition, the types of healthcare utilization at each visit were also recorded: hospitalization (HOSP), specialist visit (SPEC), general practitioner visit (GP) and emergency department visit (ER).4,597 COPD patients are included in this analysis, and these patients are all with drug plans and with at least five years follow-up.

Identifying the number of states
First, we carry out our analysis to identify the number of states.The analysis is initiated as a one-state model, The prior distributions for the elements in Q and π are specified as independent Gamma(1, 2) and Dirichlet(1, . . ., 1).We use a P oisson(3) distribution for as the prior for the number of states.

With covariates
We implement the model including the types of healthcare utilization as covariates in the observation model.A non-informative prior is imposed for B. We perform the proposed trans-dimensional MCMC algorithm with 20,000 iterations.
Table 4 shows the posterior distribution of the number of states.The trace plot (Figure 2) confirm that the algorithm has fully explored the parameter space.Although the mode of the posterior distribution of the number of states is five, it also spends over 40% of  iterations in the four-state model.Table 5 contains the exponential of the B coefficients condition on the five-state model.On average, from State 1 to 5 the number of drugs taken increases; however, within each state, the numbers of drugs across the different healthcare utilizations are approximately the same.Therefore, it is plausible to consider fitting the intercept-only model without the time-varying covariate, which we will proceed in the next section.

Without covariates
We perform the reversible jump trans-dimensional MCMC algorithm for 20,000 iterations without the time-varying covariate, with a Gamma(10, 10) distribution placed on the mean number of drugs.Table 4 shows the posterior distribution of the number of states.The trace plot (Figure 2) confirm that the algorithm has extensively explored the parameter space.Unlike the model with the time-varying covariate, the MCMC algorithm employs most of the time exploring the less complex models, i.e., three-state and four-state model.
The posterior mode of the number of hidden states is four.Table 6 contains the expected number of drugs prescribed for patients in each state, with associated 95% credible intervals.
As for the model with covariates included, on average, the number of drugs taken increases from State 1 to 4; however, the mean number of drugs prescribed for each state is smaller than the previous five-state model with the time-varying covariate.

Identifying numbers of clusters and states
Next, we implement the clustering algorithm to group trajectories with distinct stochastic properties.From the previous one-cluster model, we did not observe much distinction across different healthcare utilizations on the number of drugs.Therefore, we decide to cluster patient trajectories using the intercept-only model.
We present results based on 10000 MCMC iterations after initialization from one-cluster model with one hidden state.The mode of the posterior distribution of the number of clusters is three (5358 out of 10000 iterations).Table 7 and Figure 3 present the posterior distribution and trace plots of the number of clusters and numbers of states conditional on three-cluster iterations.The posterior modes for numbers of states are four, two and two for Cluster 1, 2, 3 respectively.For a summary output, cluster membership is assigned to the subject according to its posterior mode conditional on three-cluster iterations.Table 8 shows the posterior mean of number of drugs for the three-cluster model along with the number of patients in each cluster.Cluster 1 has the greatest number of patients and a posterior mode of four states, which is consistent with results of the one-cluster model in Section 6.1.The separation between Cluster 2 and 3 is mainly coming from the parameters in the underlying Markov process, as the q 12 and q 21 in Cluster 3 are ten times greater than those in Cluster 2. This suggests that transitions between State 1 and 2 are more frequent in Cluster 3. Also, Cluster 3 on average has the least number of drugs prescribed, indicating that patients in this cluster are possibly on the early stage of COPD.

Discussion
We have developed a reversible jump MCMC algorithm for the CTHMM-GLM with an unknown number of states and clusters, which is implemented under a fully Bayesian framework.This model can deal with challenges typically encountered in latent multistate modeling, in particular, irregular visits that vary from individual to individual.Our approach uses a split/combine move to explore the trans-dimensional parameter space, which extended the fixed dimensional MCMC proposed by Luo et al. (2021).Simulation studies demonstrated that the proposed MCMC approach could identify the number of states and the number of clusters from the true data generating mechanism.We were able to implement the developed methods for a real data set from Quebec, Canada, comprising more than four thousand COPD patients tracked over twenty years.Our work demonstrated that with a careful construction of the trans-dimensional proposal, our reversible jump MCMC algorithm can achieve desired performance in term of identifying the number of states and the number of clusters simultaneously.
Focusing on the number of states and the number of clusters, a standard prior specification is adopted exchangeable in form with respect to the state/cluster labels.In the MCMC algorithm, it is possible that the algorithm would potentially suffer from the labelswitching problem, which has been addressed by Jasra et al. (2005).In this paper, we primarily considered finite mixture formulations to facilitate the trans-dimensional move between different numbers of states and clusters using the reversible jump MCMC algo-rithm.Bayesian nonparametric procedures, specifically procedures using Dirichlet process models, have become popular tools to explore the trans-dimensional parameter space, where the models are limiting versions of exchangeable finite mixture models.Dirichlet process models are now widely used in density estimation and clustering, with implementation via MCMC sampling approaches (Neal, 2000).It would be interesting to apply Dirichlet process models for CTHMMs to identify the number of states and the number of clusters simultaneously and compare aspects of finite and Dirichlet process mixture formulations, noting their similarities and where they differ.In our proposed model, the state space has to be discrete; more generally, there may be health conditions that necessitate the use of a continuous latent process.Bayesian formulations for diffusion or jump processes have been studied in the context of financial data, although such formulations are not common in the analysis of health data, allowing the latent continuous state distribution to have an interpretation as an index or a score.For example, one could use the features included in comorbidity indices to measure multimorbidity in terms of the ability to predict future mortality and health services use.Further studies are needed to address this issue to facilitate the generation of hypotheses about the performance of the healthcare system in managing patients with chronic disease.
Supplementary Materials for "Bayesian inference for continuous-time hidden Markov models with an unknown number of states" A Likelihood for a Continuous-Time Hidden Markov

Model
Figure 4 provides a schematic of the presumed data generating structure for one subject.To construct the complete data likelihood, suppose that {X s } has been observed continuously in the time interval [0, τ ].The likelihood function for Q for individual n is (Bladt and Sørensen, 2005) where N n,l,m (τ ) is the number of transitions from state l to state m in the time interval [0, τ ] and R n,l (τ ) is the total time that the process has spent in state l in [0, τ ] for individual n, and where q l,m are the potentially elements of Q.Note that the quantities N n,l,m (τ ) and R n,l (τ ) are unobserved, but can be computed given a complete realization of the latent process on [0, τ ].
To construct the likelihood for Θ for N independent subjects, we let O n,t (t = 1, . . ., T n ) be records the probability of transition from state j to state k in the interval ∆ n,t , and N j,k n,l,m (∆ n,t ) and R j,k n,l (∆ n,t ) are the amended versions of N l,m and R l computed conditional on starting in state j and ending in state k over the interval ∆ n,t .
Bayesian inference for this model with the number of states K fixed has been fully studied by Luo et al. (2021), where an MCMC scheme based on simulating the complete latent path for each individual is developed; this MCMC scheme relies upon the rejection sampling approach of Hobolth and Stone (2009) to sample the latent paths in an efficient fashion.Bayesian inference using the complete data likelihood formulation is appealing as it produces posterior samples of the full unobserved state sequences and latent continuous time process, which allows inferences to be made for individual-level trajectories across the entire observation window, and which is useful for computing posterior distributions for pathwise aggregate features on the individual trajectories.hood for Q requires continuously observed Markov chain, we first simulate the full path before updating Q.
-Simulate N n,l,m (∆ n,t ) and R n,l (∆ n,t ) from the Markov jump processes step-bystep with infinitesimal generator Q (i−1) through the intervals [τ n,t , τ n,t+1 ) initiated at X n,τn,t and end point X n,τ n,t+1 sampled previously.Simulating sample paths conditional on the endpoints can be achieved efficiently by using modified rejection sampling (that avoids simulating constant sample paths when it is known that at least one state change must take place) as proposed by Hobolth and Stone (2009), from which {X s } is recovered and the jump time points are generated.
-Sample the {q i,j } (i)  1≤i =j≤K given the fully recovered Markov process from independent Gamma distributions with shape and rate parameters given as shape

C Birth-Death MCMC for CTHMMs
In this section, an alternative approach to infer the number of hidden states via a birthdeath process is introduced.Instead of constructing a reversible jump approach, Stephens (2000) described a birth-death method, which view each component of the mixture as a point in the parameter space.The birth or death of a state occurs as a marked point process.The detailed description of one iteration of the MCMC algorithm to incorporate the birth-death process is as follows: 1. Given the state of the parameter Θ (t) at time t, sample the Θ (t) by running the birth-death process for a fixed time t 0 .Set K (t+1) = K (t) .
2. Fix the number of states K. Update the following parameters given Θ (t) .
• Update latent state indicators S n,t .
• Update the parameters associated with the observation process B. We used the exactly the same data generating mechanism in Section 5.3 in the main paper.
The prior distributions are also the same with that example.We initiate the model with one hidden state.We update the number of hidden states using the birth-death process with a fixed time t 0 , with a birth rate λ b = 1.
The trace plots of the number of states for all the cases are shown in Figure 5 with the corresponding posterior distribution Table 9.As shown in the trace plots, the number of states change more frequently than reversible-jump MCMC.With a smaller t 0 , the posterior distribution is more concentrated on three and four.However, in all cases, the posterior modes are four instead of three (the true number of states which the data were generating from); however, the posterior probabilities between three and four state models are close.

E Simulation: Trace Plots
) 4 Model-Based Clustering for CTHMMs So far, we have constructed fully Bayesian inference for a CTHMM via reversible jump MCMC, allowing the number of states to vary during the analysis.We now extend this methodology to cluster trajectories based on a CTHMM with an unknown number of states.Specifically, we will employ model-based clustering procedures to cluster individuals based on the component model parameters that determine the mixture form.The basic formulation of the model envisages that the population is composed of distinct sub-populations each with a distinct stochastic property.For a CTHMM, this corresponds to each group having a potentially different component of parameter Θ = (π, Q, B) and the number of states, K. Luo et al. (2021) develop model-based clustering for CTHMMs under finite and infinite mixture models, with a fixed number of states.We incorporate this finite mixture model structure into the proposed reversible jump MCMC, allowing both the number of states and the number of clusters to be inferred during the analysis.There is a crucial distinction between the number of components M in the mixture model and the number of clusters M * in the data which is defined as the number of components used to generate the observed data, or the number of "filled" mixture components.In the algorithm described below, we focus on specifying a prior on the number of components M , which implicitly places a prior on M * (Miller and Harrison, 2018); however, in our simulation and real examples, the proposed split move merely generates any empty component.For a comprehensive investigation of M and M * in different trans-dimensional algorithms, see Frühwirth-Schnatter et al. (2020).Let M be the number of components and C n be the cluster membership indicator for individual n.For n = 1, 2, . . ., N , it is presumed to be a member of a component labelled 1, 2, . . ., M , where m = P (C n = m) is the prior probability that individual n is assigned to component m, subject to M m=1 M = 1.The following hierarchy leads to model-based clustering procedures for the CTHMM: M ∼ p 0 (M ) , a mass function on {1, 2, 3, . ..} 1 , . . ., M |M ∼ Dirichlet (δ, . . ., δ) 2. Update the number of components by splitting a component or combining components with the same number of states; If a component with K m states is chosen in the split move, then the move is to consider splitting the component into two both with K m states; If two components with the same number of states, K m , are selected in the combine move, then the move is to combine two components into one component with K m states.3. Given parameters in each component, update the component membership for each individual according to the posterior probability (5).
The reverse split move is to randomly select a component, m to split into two components, say m and m * , and check if the condition, B m * − B m 2 < B j − B m 2 for j = m.If this condition is not met, then the split move is rejected.

Figure 1 :
Figure 1: Posterior distribution of the number of states for Normal case σ = 1 with 100 replications for the same dataset.

Figure 2 :
Figure 2: Application: Trace plot for the number of states over 20000 iterations to identify the number of states.The left panel is the observation model with the types of healthcare utilization as covariates, while the right panel is the model without covariates.

Figure 3 :
Figure 3: Application: Trace Plot for the Number of Clusters

Figure 4 :
Figure 4: Schematic of the presumed data generating mechanism for the CTHMM.O s represents the process underlying the observed data, with observation time points denoted τ t , t = 1, . . ., T ; X s represents the hidden Markov process.. Reproduced from Luo et al. (2021).

Table 2 :
Example 5.3: Posterior distribution of the number of states (Intercept Only).

Table 3 :
Example 5.4: Posterior distribution of the number of cluster with varying states.

Table 4 :
Application: Posterior distribution of the number of states corresponding to

Table 6 :
Application: Expected number of drugs for the intercept-only model over the time spent in each state.

Table 8 :
Application: Expected number of drugs for the three-cluster Poisson model.
the t th observation for subject n with the associated observation time τ n,t .The complete data likelihood derived from {O n } and {X n,τn } can be factorized L(Θ) ≡ L(O, X|Θ) = L(X|Θ)L(O|X, Θ) where O = {O n,t } and X = {X n,τn,t } for n = 1, . . ., N and t = 1, . . ., T n = τ n,t+1 − τ n,t .The complete data log-likelihood written in terms of the latent state indicator random vectors {S k

Table 9 :
BDMCMC: Posterior distribution of the number of states (Intercept Only)