Model-based two-way clustering of second-level units in ordinal multilevel latent Markov models

In this paper, an ordinal multilevel latent Markov model based on separate random effects is proposed. In detail, two distinct second-level discrete effects are considered in the model, one affecting the initial probability vector and the other affecting the transition probability matrix of the first-level ordinal latent Markov process. To model these separate effects, we consider a bi-dimensional mixture specification that allows to avoid unverifiable assumptions on the random effect distribution and to derive a two-way clustering of second-level units. Starting from a general model where the two random effects are dependent, we also obtain the independence model as a special case. The proposal is applied to data on the physical health status of a sample of elderly residents grouped into nursing homes. A simulation study assessing the performance of the proposal is also included.

variable that probabilistically evolves over time according to a first-order Markov chain (Wiggins 1973; Bartolucci et al. 2013).
In the last decades, numerous extensions of the basic latent Markov model have been proposed to incorporate additional model features. These include multilevel modeling, which is in order when sample units are grouped and the number of groups is considerable. Examples of applications of multilevel latent Markov models (MLMMs) concern longitudinal datasets of individuals grouped in households (e.g., Koukounari et al. 2013), students in classes (e.g., Bartolucci et al. 2011) or workers in firms (e.g., Bartolucci and Lupparelli 2016). Hereafter, we will refer to such groups as to secondlevel units (SLUs). In these settings, the multilevel structure of the data is typically accounted for by introducing a second-level latent variable affecting the observed responses coming from the same group-either directly or through the first-level latent Markov chain-via a set of random effects. This approach is somewhat related to other proposals, which also make use of continuous or discrete random effects in latent Markov models, though not in a formal multilevel setting (Altman 2007;Maruotti and Rydén 2009;Maruotti and Rocci 2012;Marino and Alfò 2016;Marino et al. 2018).
In the multilevel approaches mentioned before, the second-level latent variable is discrete with a fixed number of states, resulting in a clustering of SLUs. In particular, in Bartolucci and Lupparelli (2016) such a clustering structure is assumed to be timevarying and to directly affect the distribution of the observed responses, together with some covariates and the first-level latent process. In the other multilevel approaches, SLU clustering is assumed to be time-invariant and is considered as a determinantpossibly with some covariates-of the first-level latent process, which in turn has a probabilistic influence on the observed categorical variables. In this case, the firstlevel process can be interpreted as a (possibly multidimensional) underlying trait the outcome variables are a measure of.
When SLU clustering is assumed to be time-invariant, each state of the secondlevel latent variable is associated to a multivariate support point occurring with a certain probability. In detail, every support point identifies a global effect on all the components of the first-level Markovian process, meaning that there is no separation, for instance, between the effect on the initial probabilities and that on the transition probabilities of the Markov chain. This model feature might represent an important limitation when SLU effects on initial and transition probabilities have a substantially different interpretation and a two-way classification of SLUs according to these different dimensions is therefore needed.
A motivating example is represented by longitudinal datasets collecting health indicators for elderly residents grouped in Nursing Homes (NHs). In this context, the first-level latent trait may be an ordinal one representing one domain of residents' overall health status and the main interest resides in detecting NH effects on the transition probabilities. Indeed, these are typically associated to the quality of the health care service they provide. Conversely, NH effects on the initial probabilities usually reflect admission policies, since NHs may have different tendencies to admit residents in more severe health conditions. With reference to this problem, some recently proposed solutions rely on modeling the transition probabilities as a function of fixed NH effects (Bartolucci et al. 2009) or introducing a bivariate Gaussian secondlevel latent variable . Both these approaches allow to obtain an NH ranking along the dimension of the SLU effects on transition probabilities only. However, the former is likely to result in poor effect estimation when the number of NHs is high or some NHs have just a few residents, while the latter relies on the Gaussianity assumption for the second-level latent effects, which is not testable and requires a remarkable computational effort to approximate integrals that do not admit a closed-form solution.
In this paper, we modify the setting in  introducing a discrete (rather than continuous) bivariate random effect. Such an approach has a twofold aim: (i) avoiding unverifiable parametric assumptions on the second-level latent variable; (ii) obtaining a direct two-way clustering of SLUs. In this respect, an ordinal twoway MLMM is introduced. This is based on the specification of a joint probability matrix assigning weights to the discrete support points of a bivariate second-level latent variable. When the two components of this latent variable are independent, the entries of such a joint probability matrix reduce to the product of the corresponding marginal probabilities, leading to an independence model. That is, the latter is nested into the proposed ordinal two-way MLMM, so that standard theory can be applied in a hypothesis testing perspective.
The paper is organized as follows. In Sect. 2, we set the notation and formulate the model, reporting also the details for its estimation procedure (Sect. 2.2). Moreover, we introduce the independence model (Sect. 2.3). In Sect. 3, we reconsider the motivating example described above and fit the proposed two-way MLMM to a longitudinal health care dataset referring to residents hosted in the NHs of Umbria, a region of central Italy. In Sect. 4, we report evidence from a small simulation study conducted to evaluate the model parameter estimators in a setting similar to the one considered in the empirical application, while in Sect. 5 some concluding remarks are given.

The model
hi J ) denote the vector containing the J categorical response variables for the i-th unit in the h-th SLU at time occasion t. Every item can have a different number of response categories labelled from 1 to c j , with j = 1, . . . , J . Each of the H SLUs has its own number of units n h , so that the overall sample size is n = H h=1 n h . The number of measurement occasions T hi ≤ T is unit-specific, with T denoting the maximum number of occasions. Response vectors can be collected across time, i.e. Y hi = (Y (1) hi , . . . , Y (T hi ) hi ), and across first-level units, i.e. Y h = (Y h1 , . . . , Y hn h ). The same notation applies to vectors of individual covariates, that we denote by X (t) hi . For each unit, the Markovian latent process ) is a collection of T hi categorical unobserved variables with k v latent states labelled from 1 to k v . SLU random effects affecting such a process are assumed to be time-invariant and determined by a bivariate latent variable denoted by Z h = (U h , W h ). Here, U h affects the initial probabilities, while W h affects the transition probabilities. Formally, U h and W h represent two separate latent variables defined on k u and k w support points; these are denoted by ψ u (u = 1, . . . , k u ) and ξ w (w = 1, . . . , k w ), respectively.
To formalize the two-way SLU clustering procedure, we denote by SLU-uw (u = 1, . . . , k u , w = 1, . . . , k w ) the group of SLUs sharing the u-th level of U h and the w-th level of W h . Every SLU belongs to the SLU-uw group with probability To emphasize the two-way nature of our clustering approach, the above joint probabilities can be arranged in the probability matrix whose rows and columns sum to respectively. These marginals are collected in the vectors τ u. = (τ 1. , . . . , τ k u . ) and τ .w = (τ .1 , . . . , τ .k w ).
As typical in settings where the primary interest lies in modeling the latent trait, we assume covariates and SLU latent variables to influence the individual Markovian processes V hi but not the measurement model, that is, the model for the response variables given the latent trait. Specifically, such a dependence structure leads to define the initial probabilities the first-order transition probabilities . . , T hi ), and the conditional response probabilities The resulting path diagram is depicted in Fig. 1 Path diagram of the two-way MLMM for a unit with T hi = 4 measurement occasions As mentioned in Sect. 1, we assume that the latent trait V (t) hi is ordinal, with states corresponding to increasing intensities of a certain attribute. In the presence of an ordinal trait, a number of parametrizations for the initial and transition probabilities can be adopted. These include, among others, the adjacent category, global and continuation logit parametrizations; see Bartolucci et al. (2013) for an overview. Here, for the initial probabilities a global logit parametrization based on the proportional odds model (McCullagh 1980) is used; that is, . The parameter β 0v in the equation above is an intercept varying with the logit equations, ψ u represents the effect due to SLU h belonging to latent group u, while β 1 is a vector of fixed regression coefficients related to individual covariates. The global logit parametrization requires the components of β 0 = (β 01 , . . . , β 0k v −1 ) be non-increasing to ensure the cumulative probabilities are non-decreasing. Further, to identify the model, we set ψ 1 = 0 together with the order constraints ψ 1 ≤ · · · ≤ ψ k u . With regard to the (k v × k v ) transition probability matrices, we specify a global logit model similar to the previous one; that is, Again, the components of γ 1 = (γ 11 , . . . , γ 1k v −1 ) must be non-increasing, whereas γ 0 = (γ 01 , . . . , γ 0k v ) is a vector of additive shifts introduced to diversify the transition probability distributions among the rows of the transition matrices. No mathematical constraints are posed on the elements of γ 0 , apart from the necessary identifiabil-ity constraint γ 01 = 0. The vector γ 2 is a time-invariant vector of fixed regression coefficients-constant across logit equations-for the effect of individual-level covariates, while ξ w represents the effect due to SLU h belonging to latent group w. These effects are also time-invariant and constant across logit equations, subject to the identifiability constraints ξ 1 = 0 and ξ 1 ≤ · · · ≤ ξ k w . These order constraints, together with those on the support points of U h , ensure that the SLU-uw clusters are univocally identified, tackling the well-known problem of label switching (Stephens 2000) at the second level.

Two-step maximum likelihood estimation
The model log-likelihood can be written as where θ is a vector collecting all the parameters of the model and is obtained by summing over all the possible configurations of v h and z h . The logarithm of each term in the summation above, that we denote by h (θ ), can be explicitly related to the model parameters. In detail, given the model assumptions we have In the expression above, a number of binary indicator variables are present. Specifically, a hi (v,v) indicates whether the same unit is in latent state v at time t and in latent statev at time t − 1. Moreover, c h (u) and d h (w) indicate whether the h-th SLU belongs to the u-th and w-th cluster formed with respect to the SLU effect on the initial and transition probabilities, respectively, with e h (u, w) denoting the joint membership of the same SLU to these two clusters (i.e., to the SLU-uw group). Finally, f (t) hi j (y) takes value 1 if at time t unit hi responds with category y at the j-th item.
Given the overall complexity of the proposed model, full maximum likelihood (ML) methods are likely to be unfeasible or highly unstable (Di Mari et al. 2016). Therefore, a two-step approach (Bartolucci et al. 2014) is undertaken. In the first step, conditional response probabilities φ j yv (i.e., the measurement model) are estimated by fitting a latent class (LC) model (Lazarsfeld and Henry 1968;Goodman 1974) where all the individual records are pooled together. In the second step, ML estimates of these conditional response probabilities,φ j yv , are plugged through (4) in the log-likelihood (3), which is then maximized with respect to the remaining model parameters, θ (i.e., the latent model). The two-step method was proved to be consistent (Bartolucci et al. 2014(Bartolucci et al. , 2015 and is less likely to suffer from the instability issues mentioned above (Montanari and Pandolfi 2018).
At both steps, the log-likelihood maximization process is performed via an Expectation-Maximization (EM) algorithm (Dempster et al. 1977). While EM estimation of LC models is well-established and standard software is available for its implementation (see Sect. 3.2), that of the latent model requires a more in-depth discussion due to the specifics of the two-way MLMM we have introduced. Formally, taking the conditional response probability estimates as fixed, at the second step the complete-data log-likelihood can be regarded as a function of θ only and written aŝ In the above expression,ˆ h (θ ) is as in the right-hand side of (4) withφ j yv in place of φ j yv , while the binary indicator variables a (t) (u, w) and f (t) hi j (y) are unchanged. Since they refer to the components of the latent processes, all these variables but f (t) hi j (y) are not observed. The first stage of the EM algorithm (E-step) consists in computing their conditional expectations given the observed data. These are denoted by adding the hat symbol to the corresponding indicator variables (e.g.,â hi (v)). In practice, the expressions forâ u, w) reduce to a list of posterior probabilities, which is reported in Appendix A. These are used in the following stage of the algorithm (M-step). At this stage, the value of θ is updated according to a constrained maximization of the expected second-step complete-data log-likelihood , which is performed in a way such that all the order constraints due to the global logit parametrization (see Sect. 2.1) as well as the probability constraint k u u=1 k w w=1 τ uw = 1 are met. The EM algorithm alternates the E-step and the M-step until convergence. Notice that the expression above directly involves the initial and transition probabilities. Thus, the present estimation framework based on the EM algorithm is valid regardless of the parametrization used for these probabilities. However, parametrizations alternative to the global logit might not require the aforementioned order constraints, that would be removed from the M-step.
As typical in complex latent variable models, the EM solution might correspond to a local maximum of the likelihood function rather than to the global one. A circumvention of this problem usually consists in performing multiple initializations of the algorithm, with the number of runs being a multiple of the number of latent groups at any unit level. In the two-step approach we undertake, this strategy has to be carried out separately for the measurement and the latent model; see Appendix B for further details in the context of our application in Sect. 3.
Once the ML estimate for the latent model parameter vector,θ , is obtained, its estimated variance-covariance matrix can be computed via the sandwich formula (White 1980;Royall 1986; see also Spagnoli et al. 2018). In the above equation, denoting by s h (θ ) the independent contribution of the h-th SLU to the score function

H(θ ) = ∂ s(θ )/∂θ represents the Hessian information matrix while S(θ ) is given by
In practice, H(θ ) is obtained by numerical derivation of the score s(θ ), whose expression is reported, together with that of its components s h (θ ), in Appendix C.
Although it is known to be robust to some degree of model misspecification, it is important to underline that the sandwich estimator in (5) ignores the sampling variability of theφ j yv estimates, which are taken as fixed in place of the true φ j yv values in its expression. In principle, this might lead to systematically underestimate the variances in the true variance-covariance matrix Cov(θ ); see Bakk and Kuha (2018) for a related discussion and an adjusted variance estimator in the context of LC models. However, this problem becomes negligible when an estimator with good finite-sample properties is used at the first step. This seems to be the case in our setting, where the first-step estimator of the φ j yv probabilities is based on a pooled LC model including all the available observations, which results in an increased sample size. An empirical account of the first-step estimator performance is reported in the simulation study, where the explored context is rather similar to the one considered in the empirical application (see Sect. 4 for details). Finally, it is worth to mention that within this estimation framework also the model selection procedure consists of two phases. The first phase involves the choice of k v , that is, the number of latent states characterizing the first-level unit latent process V hi . The second phase is about determining the value of the pair (k u , k w ), that is, the number of groups of second-level units formed across the two dimensions of the proposed two-way MLMM. With regard to the empirical application considered here, these two phases are addressed in Sects. 3.2 and 3.3.

The independence model
As mentioned in Sect. 1, a nice feature of the proposed MLMM is that it nests the independence model. That is, it nests the model assuming independence among the latent variables influencing the initial and the transition probabilities of the first-level latent Markov chain, which allows the two SLU clustering procedures not to be influenced from one another. In the path diagram of Fig. 1, this would result in deleting the arrow connecting U h and W h .
As stated above, when U h and W h are independent, every joint probability τ uw reduces to the product of the corresponding marginals The estimation of model parameters may proceed as detailed in Sect. 2.2, by simply replacing (6) in Eq.
(3) (via (4)) and solving with respect to τ u. and τ .w , under the constraints k u u=1 τ u. = k w w=1 τ .w = 1. This result can be fruitfully employed also in a hypothesis testing perspective. To this end, we start by defining the following logit transformation for the joint masses τ uw : The ζ uw parameters directly provide a measure of the dependence between U h and W h and are null when Eq. (6) holds (see, e.g., Spagnoli et al. 2018). In this sense, the independence model occurs when ζ = (ζ 11 , . . . , ζ k u −1,k w −1 ) = 0. Defining the system of hypotheses standard Wald-type, score, or likelihood ratio test (LRT) statistics may be employed.
As regards the latter, we may compute where (θ 0 ) and (θ 1 ) denote the likelihood values obtained at convergence of the EM algorithm under H 0 and H 1 , respectively. From standard theory, it is known that all these statistics are asymptotically equivalent and follow a χ 2 distribution with (k u − 1) × (k w − 1) degrees of freedom under the null hypothesis.

The LTCF dataset
The proposed two-way MLMM is used to analyze a longitudinal dataset collecting information on the health status of elderly residents hosted in the NHs of Umbria, a region of central Italy. Data are gathered within the Suite interRAI protocol (Carpenter and Hirdes 2013), an international system the regional government of Umbria has been adhering to for many years. In detail, NH residents are administered the Long Term Care Facilities (LTCF) questionnaire (Hirdes et al. 2008;Kim et al. 2015), which is specifically designed to investigate their health conditions within these facilities. For this application, we consider a set of n = 1548 residents, grouped in H = 43 NHs, whose first observation falls in the second semester of 2017. These residents are then followed up for the years 2018 and 2019. According to the LTCF protocol, each resident should be administered a questionnaire every six months and whenever a significant change in their health status is acknowledged by the NH staff. For this reason, we have collected all the available observations for each resident, removing those with a subsequent one taken within seven days. In these cases, the second observations are almost identical to those deleted, probably because they are just amendments rather than actual new measurements. The resulting dataset comprises 5582 observations. Notice that the rationale leading to the construction of such a dataset is somewhat different from that in , where a panel dataset was obtained by taking the observations that were closest to predetermined six-month spaced dates.
Although the LTCF questionnaire investigates many health domains (physical limitations, psychological conditions, auditory and view sphere etc.), we here focus on a specific section named Activities of Daily Living (ADL). This section includes J = 10 ordinal response variables measuring resident difficulties in taking everyday actions like walking, getting dressed or maintaining personal hygiene. All these items have c j = c = 6 categories, with labels increasing with the experienced difficulty level. A description of the items is contained in the upper part of Table 1, together with the associate frequency distributions. In this setting, the Markovian individual process V hi is meant to represent the overall level of physical impairment, justifying the assumption that the latent trait is ordinal (see Sect. 2.1) as well as the global logit parametrization in Eqs. (1) and (2). It is important to point out that, within the Suite interRAI protocol, questionnaires are filled in by the NH staff rather than by the residents personally. This fact should guarantee that compilation criteria are objective and time-invariant, thereby preventing from the well-known issue of response shift (Sprangers and Schwartz 1999;Visser et al. 2005). Response shift occurs with self-reported questionnaires and consists in respondents changing their internal standards and reconceptualizing quality of life domains between measurement occasions. It is likely to threaten the identification of true changes in the latent trait, resulting in the misspecification of the measurement model (Oort 2005). Since in our approach the conditional response probabilities are assumed to be time-invariant and their estimation is performed by pooling all the individual records together (see Sects. 2.1 and 2.2), ruling out the presence of response shift appears to be of the essence.
Another relevant aspect concerns resident dropout, that can be due to death or to other reasons not explicitly linked to health status transition. While the latter can essentially be treated like an ignorable missing data mechanism (Rubin 1976;Little 1995), the former clearly corresponds to a worsening in personal health conditions. In other words, death can be considered as an outcome of residents' health status trajectories rather than a source of missing data. For these reasons, we identify it with the k v -th latent state of the individual Markov process. Such a latent state has two specific features. In detail, it is an absorbing state and it has null probability at t = 1. This means that residents can not be dead at the first measurement occasion and can not revert to other latent states once they have reached it. Clearly, this requires the constraints π hi (k v |u) = 0, as well as π (t) hi (v|k v , w) = 0 and π (t) hi (k v |k v , w) = 1, for every value of u and w and for v = 1, . . . , k v − 1. These constraints are met simply by setting β 0k v −1 = −∞ and γ 0k v = +∞. Thus, these two parameters are not to be estimated anymore. To identify dropout in the data, the 430 residents dropping out due to death are assigned an extra observation dated on the day of death with all the ADL items set to the additional response category c + 1 = 7. In terms of conditional response probabilities, this univocal association corresponds in practice to φ j,c+1,v = 0 and φ j,c+1,k v = 1 (v = 1, . . . , k v − 1; j = 1, . . . , J ). Therefore, the final dataset contains N = 6012 observations, with about 3.88 observations per resident on average.
As for the covariates in the latent model, resident age (in years) and gender (a binary variable taking value 1 for males) form the individual covariate vector x (1) hi appearing in Eq. (1). These two covariates have been proved in different studies mentioned in Sect. 1 to affect both the initial probabilities and the probability of transition between latent states. Furthermore, in Eq. (2), the time distance from the previous collection occasion (x (t) hi3 , measured in months) is included as an additional covariate, together with age and gender, in the vector x (t) hi to account for the different amount of time occurring from the previous observation on the same unit. Given the data collection mechanism discussed above, such a variable is expected to play a role also in measuring the overall quality of NH care services. Indeed, since questionnaires are administered prior to the six-month schedule if a sensible change in the health conditions is observed, such a variable is expected to be related to the probability that a transition occurs, though with some important provisos discussed in Sect. 3.3. Within this scheme, the latent variable U h is meant to capture NH effects on the initial probabilities that are unexplained by age and gender, whereas W h models NH effects on the transition probabilities unexplained by age, gender and time distance. In other words, time distance and W h are two concomitant components of the overall NH performance. This is to some degree different from the panel-based approach in , where time distance was only included in the covariate set to adjust for unequally spaced measurements and not used in the NH performance evaluation process.
Since the date of death is recorded by the NH staff, all the covariates are available also for the dropout-related extra observations. The bottom part of Table 1 contains some summary statistics of these covariates, as well as of the number of residents and observations available for each NH. It is worth to highlight that some young residents are present in the dataset, although the interRAI protocol is in principle designed for elderly people. However, these cases are quite rare (there are only 9 residents aged less than 50), and are typically associated to rather serious impairments or mental handicap.

Measurement model
As stated in Sect. 2.2, the two-way MLMM is estimated via a two step procedure. In this section, we show results for the first estimation step, i.e., for the measurement model, obtained via a pooled LC analysis without covariates performed on the ADL items presented in Sect. 3.1. ML estimation of the pooled LC models was performed via the poLCA R package (Linzer and Lewis 2011).
To address model selection, we have explored a set of models by letting k v vary from 2 to 10. The model with k v = 1 was not included since in the LTCF data we always have to account for the death-related extra latent state. That is, we always have at least two latent states; see Sect. 3.1. In Fig. 2, the Bayesian Information Criteria (BIC, Schwarz 1978) of the estimated models are plotted. In theory, models with a lower BIC should be preferred. Nevertheless, in the context of latent variable models this index is known to have a tendency to inflate the number of latent states (Pohle et al. 2017;Bacci et al. 2014). Specifically, when the BIC reaches a minimum within a reasonably large set of candidate models, then the model with the minimum value is typically a reliable choice. However, if it keeps decreasing with the consecutive reductions being always smaller, then it is a good practice to consider also alternative factors in the final model choice. This appears to be the case for the ADL-LC models considered here, where we observe a sensible reduction of the BIC until k v = 6, with some kind of stabilization for k v ≥ 7 (Fig. 2).
In principle, the poLCA package is designed for nominal (not ordinal) responses and latent variables. However, starting from the estimated conditional response prob-abilities, ordinality can be assessed a posteriori by means of the scoreŝ For each pair ( j, v),ŝ jv is a normalized score laying in the 0-1 range and representing the average impairment level of residents in latent state v with respect to the j-th ADL.
In line with the model formulation highlighted in Sect. 2.1, lower labels are assigned to states with lower values ofŝ jv . Clearly, in the ordinal setting considered here it would be desirable that the same latent state ordering is maintained for every item. As a matter of fact, this ordering univocality among items is another relevant factor the choice of k v might be based upon.
To investigate this issue in the LTCF data, theŝ jv scores for all the fitted models are reported in Fig. 3. Specifically, in each plot of the 3 × 3 panel latent states are ordered according to theŝ jv scores of the first item, with a line for each latent state (v = 1, . . . , k v ) joining theŝ jv values across the remaining items. Looking at the plots, it is possible to notice that for k v ≥ 7 (as well as for k v = 3) the lines tend to overlap each other, meaning that the latent state ordering is not univocal. Another noteworthy feature is that when k v ≥ 4, the LC model correctly identifies the last extra latent state associated to death, whereŝ jk v = 1 by virtue of φ j,c+1,k v = 1 ( j = 1, . . . , J ).
In light of such considerations, a natural choice for the final measurement model is k v = 6. The resulting conditional response probability structure (not shown) is aligned with those obtained in similar settings . With regard to theŝ jv scores, we evince that the last two items (bed mobility and eating) present the smallest values for every latent state. This finding is sensible since they represent abilities that are usually lost latest by residents. As a sensitivity check, the whole LC analysis was repeated including φ j,c+1,k v = 1 ( j = 1, . . . , J ) as a formal probability constraint. This procedure involves fitting models for the remaining k v − 1 latent states on the reduced dataset obtained removing the 430 extra observations. The conclusions we may draw are substantially equivalent.

Latent model
Taking the estimated conditional response probabilities as fixed parameters, a number of two-way MLMMs are fitted. In detail, we let k u and k w vary between 1 and 3, so that 9 possible models are inspected overall. Before showing the estimation results, a more detailed discussion about the nature of the effect of x (t) hi3 (distance from the previous observation) in Eq. (2) is necessary. Specifically, it is important to observe that almost 80% of the values of x (t) hi3 lie between 5 and 7 months, that is, in the neighbourhood of the canonical measurement distance of 6 months. Because of this inflation, the effect of x (t) hi3 is likely to be blurred in this neighbourhood, and more distinct for x (t) hi3 < 5. This intuition is confirmed by a sensitivity analysis we performed on the estimated measurement model. Specifically, we assigned each record a latent state based on a maximum-a-posteriori rule. In this way, we were able to build a binary variable   (2). An extensive investigation about the relationship between x (t) hi3 and this variable, including the estimation of several logistic regression models and other empirical analyses, led to opt for the linear spline encoded by the covariatẽ In practice, the effect of x (t) hi3 on the worsening probability (that is, the probability of moving towards more impaired latent states) is assumed to be constant before 3 months, then it is assumed to be linear (on the logit scale) between 3 and 5 months, and then constant again after 5 months. This functional form allows to account for the data inflation around x vector for every fitted model. Also, age andx (t) hi3 were centered at the values of 80 years and 6 months, respectively, in order to ease the overall interpretation of β 0 , γ 0 and γ 1 . Table 2 reports the log-likelihood, the number of parameters in the latent model, and the BIC index for all the fitted models. With regard to the log-likelihood, for each value of k u we notice that its increase is sensibly greater when moving from k w = 1 to k w = 2 than when moving from k w = 2 to k w = 3. The same argument applies swapping k u and k w . This intuition is confirmed by the BIC index, that reaches its minimum for the (2, 2) model. In other words, the log-likelihood gain due to the introduction of a third SLU cluster for the effect on either the initial and the transition probabilities is not worth the increase in the overall model complexity. As a consequence, we consider the (2, 2) model as the final one. For such a model, diagnostic checks based  Table 3 reports the estimates, together with their standard errors, for all the parameters of the latent (2, 2) model but the probability matrix T . From this table, we may conclude that, for both the initial and transition probabilities, the estimated effect of age and gender is in line with that observed in similar analyses of the LTCF data referring to other years (Bartolucci et al. 2009;. In detail, given the global logit parametrization in Eqs. (1) and (2), the positive effect of age shows that older residents are more likely to find themselves in a worse physical health status at admission (β 11 = 0.012), as well as to move towards more serious latent states at the following occasions (γ 21 = 0.016). However, theβ 11 estimate is barely significant ( p-value 0.098). As for the effect of gender, we can conclude that males typically have lower physical impairment at the first occasion (β 12 = −0.493, p-value 5.68 × 10 −6 ), but are also more likely to worsen their condition (γ 22 = 0.107, p-value 0.030) with respect to females. Regarding the effect ofx (t) hi3 , its estimated coefficient γ 23 = −1.464 is highly significant ( p-value 1.29 × 10 −8 ) and shows that shorter  Fig. 4 QQplots of ordinary normal pseudo-residuals for the (2, 2) model distances between observations-within the interval [3, 5] months-are associated to higher probabilities of moving towards higher (i.e., more impaired) latent states. This effect is in line with subject-matter related expectations, since sudden changes in residents' health status are more likely to correspond to a worsening in their physical conditions. Looking at the second-level latent variables, it is possible to notice thatψ 2 = 1.0356 andξ 2 = 0.9174 are both significantly different from zero ( p-values 4.16 × 10 −7 and 6.25 × 10 −4 respectively), thereby corroborating the adoption of the (2, 2) model. In other words, there is evidence of the presence of 4 SLU clusters corresponding to the 4 combinations of the joint effect Z h = (U h , W h ) on the initial and transition probabilities of the first-level-unit Markov process. Recalling the notation introduced in Sect. 2.1, we label these clusters by SLU-uw (u, w = 1, 2).
Combining the estimates above withψ 2 andξ 2 , it is possible to compute the estimated correlation between U h and W h , which is equal to -0.420. Such an estimate suggests a deviation from the independence model, which is confirmed by a formal LRT. Specifically, the log-likelihood of the (2, 2) independence model is -52596.1, so that the LRT statistic is equal to 7.072 ( p-value 0.008 under the null χ 2 (1) distribution). Sinceτ 11 ≈ 0, we can conclude that there are no NHs belonging to the SLU-11 cluster. Conversely, the model suggests that around 8% of the NHs belong to cluster SLU-21. The remaining 92% of NHs are placed in the other two clusters. Of these, around 73% are estimated to belong to cluster SLU-12.
In order to understand which group each NH is more likely to belong to, one can rely on the posterior SLU probabilitiesê h (u, w) estimated at convergence of the EM algorithm. All these probabilities butê h (1, 1), which are always lower than 10 −3 , are reported in the right-hand side of Table 3. It is possible to pinpoint three NHs belonging to the SLU-21 group with a very low degree of uncertainty (h ∈ {12, 18, 26}). The other NHs are estimated to split between clusters SLU-12 and SLU-22 with a varying degree of uncertainty.
To characterize the identified clusters, the estimated initial probability distribution of the latent states for an 80-year-old female resident hosted in an NH belonging to the SLU-1w clusters (w = 1, 2) iŝ π hi (u = 1) = 0.175 0.185 0.178 0.189 0.273 0.000 , whereas the vector for a resident with the same individual covariate pattern hosted in an NH of the SLU-2w clusters (w = 1, 2) iŝ In line with the parameter constraints ψ 2 ≥ ψ 1 = 0, this shows that the NHs in the latter clusters tend to admit residents in worse conditions with respect to physical impairment.
For the same reference resident, the 6-month ahead (x (t) hi3 = 0) transition matrix associated to the SLU-u1 clusters (u = 1, 2) iŝ It is worth to recall that the two matrices above differ only for the residual NH effect W h . Again, conditional on the previous latent state and the covariates, in the first matrix a lower chance of worsening the physical conditions or of dying is observed, corresponding to a better performance of the NHs belonging to the SLU-u1 clusters. However, to properly evaluate the overall performance of an NH in avoiding the worsening of physical impairment, one should also take into account the effect of thex (t) hi3 covariate: as previously mentioned, worsening is more likely as its value decreases.
Although primarily intended as a SLU clustering tool, in the context of this application the proposed two-way MLMM may be used also for NH performance evaluation and ranking purposes. For example, an overall index can be built given by the sum of the averaged contribution, on the global logit scale of Eq. (2), of the observed values of x (t) hi3 in a NH and the corresponding posterior expected value of W h , controlling for age and gender. However, in principle such a procedure should be conducted separately for the groups formed with respect to the NH effect on the initial probabilities. In this way, the lack of independence between the latent variables U h and W h is properly accounted for. In a policy evaluation perspective, this approach would discourage the adoption of unfair practices like adverse selection of residents at baseline Montanari and Doretti 2019). We argue that these group-dependent measures would be more naturally obtained from the present discrete-effect model rather than from continuous-effect models. However, this kind of development is beyond the scope of the present paper.

Simulation study
In this section, we present some evidence from a small simulation study conducted within the two-way MLMM framework. For the reasons outlined in Sect. 2.2, the main purpose of this simulation is checking the performance of the proposed two-step procedure in a setting close to the LTCF application rather than comparing different estimation methods. To this end, B = 500 datasets are generated from a two-way MLMM with k v = 6, k u = 2, k w = 2 and the same probability constraints as in Sect. 3.2 for the sixth state of the individual level latent process.
Each dataset includes H = 60 SLUs with n h = 50 first-level units. At baseline, a normal variate with mean equal to 83 and variance equal to 49 and a Bernoulli variate with probability 0.3 are generated for every first-level unit to mimic the distributions of age (in years) and gender (1=male) in the LTCF data. However, to simplify the whole setting, each unit is assigned T = 3 equally spaced measurement occasions. Thus, the x (t) hi3 covariate is removed and a constant number of observations across datasets (N = 9000) is generated. Since we assume the canonical distance of 6 months between measurements, at t = 3 age is deterministically increased of 1 unit with respect to t = 1, whereas at t = 2 such an increase is random. This mechanism reflects the fact that residents are born in different times of the year.
The data generating process follows the scheme outlined in Sect. 2.1 according to a sequential procedure. First, SLUs are assigned to SLU-uw clusters based on the SLU probabilities τ uw . Conditional on this assignment, the values of the covariates and the remaining latent model parameters, initial and transition probabilities are then determined and used to simulate the first-level unit Markov chains v hi (h = 1, . . . , H ; i = 1, . . . , n h ). The numerical values for the latent model parameters are reported in the second column of Table 4. Notice that two scenarios are considered which differ for the values of the T matrix only, with the first scenario corresponding to an independence model. Finally, given the realized v hi processes, J = 10 ordinal items with c + 1 = 7 categories are generated according to a conditional response probability array similar to the one of the LTCF data (not shown). In detail, items are set to the seventh category with probability 1 if a unit is in the sixth latent state and with probability 0 otherwise, in line with the conditional response probability constraints introduced in Sect. 3.1.
The two-step estimator of the two-way MLMM fitted on the simulated datasets is as described in Sect. 2.2. For each dataset, at the first step 10 random starts are run, whereas at the second step the true latent model parameter vector is taken as initial guess. This combined strategy allows to avoid local maxima issues as well as to speed up the overall computational time. As expected, the performance of the firststep LC model estimator is highly remarkable. Indeed, for the conditional response probabilities the maximum absolute bias is 0.0016 for the first scenario and 0.0022 for the second scenario, while the maximum simulation standard deviation is 0.0195 for the first scenario and 0.0201 for the second scenario. Importantly, discrepancies about latent state ordering across different items are absent.
While the LC measurement model is always fitted with 6 latent states like in the true data generating process, at the second step all the latent models with (k u , k w ) ∈  {1, 2, 3} 2 are estimated. In this way, we are able to control the performance of the BIC index as a model selection tool for the second step. In detail, for the first scenario only for 3 of the 500 datasets (3, 2) model is selected in place of the true one. For the second scenario, the (2, 3) model is selected 6 times, the (3, 2) model is selected 2 times, while in the other cases the correct (2, 2) model is chosen. These results denote a good performance of the BIC index in this context. With regard to the performance of the proposed estimation procedure at the latent layer, results are summarized in Table 4. For each parameter, the true value is accompanied by the main summary statistics of the simulation distribution, including the standard deviation (SD) and the relative root mean squared error (RRMSE). From this table, it is possible to observe that parameter estimators exhibit very small bias and a modest degree of variability, with some deviations occurring for the first element of β 0 . Also, results are very stable across the two scenarios, with little differences only concerning the minimum and maximum values.
The last column of Table 4 contains the average estimated standard error (AESE) for each parameter. As mentioned in Sect. 2.2, in principle these values might be prone to underestimate the true standard deviations of the parameter estimators, due to the estimator in (5) ignoring the sampling variability concerning the conditional response probabilities. However, the results in Table 4 show that AESEs are quite close to Monte Carlo standard deviations (that is, the SD column in the table), apart from some degree of overestimation observed for the SLU-group probabilities. This is not surprising given the very good performance of the measurement model estimator and the adoption of a robust estimation method like the sandwich estimator. Similar patterns are observed for the models fitted with (k u , k w ) = (2, 2). However, when k u or k w are greater than 2, the variability of the estimators of the ψ 3 and ξ 3 parameters is sometimes underestimated. It seems reasonable to ascribe this fact to a model misspecification (we recall that data are generated under the (2,2) model) rather than to first-step variability.

Conclusions
In this paper, a two-way multilevel latent Markov model (MLMM) is introduced, which is in order when sample units are grouped into second level units (SLUs). In these settings, the multilevel structure of the data is typically accounted for by introducing second-level latent variables affecting the observed responses via a set of random effects. These random effects are assumed to be discrete, resulting in a clustering of SLUs. The main methodological innovation of the proposed model consists in the fact that the vector of discrete second-level random effects does not have a univocal influence on the whole first-level Markovian process. Indeed, distinct effects for the initial and transition probabilities are specified, so that SLUs can be clustered along the two dimensions separately, using the Cartesian product of the support points for the random effects on the initial probabilities and those for the transition probabilities. This feature is particularly appealing for those applications where SLU random effects on these two components have a substantially different interpretation. An independence test between the effects on the initial probabilities and the effects on the transition probabilities is also proposed. This test allows to detect when the SLU clustering process along a dimension is independent of that along the other.
An application of the proposal is illustrated with reference to health care data collected on residents hosted in Nursing Homes (NHs), which are the SLUs at issue. NH effects on the initial probabilities of the latent states of first-level units can be ascribed to different admission policies (i.e., NHs tend to admit residents in different health conditions at baseline), whereas those on the transition probabilities are related to the quality of the health care service provided (i.e., NHs with their actions generate different impacts on their residents' probabilities of moving to a different health status). For such data, modeling the NH effects beside those of the available covariates on the transition probabilities may allow to build indicators of NH performances in taking care of their residents.
Although it appears to be absent in the considered application (see Sect. 3.1), response shift might characterize other settings involving latent Markov models. Therefore, solutions to this issue in the latent Markov framework are desirable. In particular, one could adopt a time-varying parametrization for the conditional response probabilities, and perform its estimation accordingly. This approach seems to be the categorical counterpart to the one developed for structural equation models (Oort 2005), which refers to linear models and continuous latent traits and response variables.
The proposed model is tailored to an ordinal first-level latent trait. As a consequence, the marginal SLU clusters defined along the two random effect dimensions also have a conceptual ordering and, thus, a clear interpretation. While the extension to nonordinal settings is possible, interpretation issues might arise. In this sense, a partially ordered set approach could be helpful. Such an approach has already been introduced within the hidden Markov model framework (Ip et al. 2013) and would be particularly sensible for those datasets where indicators measuring several latent domains are collected. Its extension to a multilevel setting would be a promising topic for future research.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
The probabilities involved in the expressions above can be obtained from the parametrized ones (i.e., the initial, transition and SLU probabilities), by using Bayes' theorem and/or the forward and backward Baum-Welch recursions (Baum et al. 1970;Welch 2003), which are commonly adopted in the hidden Markov model literature.

Appendix B: multiple initialization of EM algorithms
With regard to the measurement model, 20 × k v replications are run for every model (k v ∈ 2, . . . , 10) to overcome local maxima issues. As for the latent model, a combined strategy is undertaken which consists of the following steps: (i) fit the three models with k w = 1 adopting multiple random starting points, (ii) use the corresponding solutions as a basis for the starting vectors of the other models, with the remaining parameters varying according to a deterministic grid. Such a grid allows to explore all the plausible values in the parametric subspaces.
With regard to step (i), we perform 50 × k u random starts with initial values given by (recall that k v = 6 with β 05 = −∞ and γ 06 = +∞). In the above, 1 , 2 and 3 are drawn from a standard normal distribution. Moreover, the elements of regression coefficient vectors β 1 and γ 2 are sampled from a N (0, 0.01) distribution, whereas (ψ 2 , . . . , ψ k u ) are drawn from non-overlapping uniform distributions whose centers are spaced by 0.7. Finally, the probabilities T = (τ 11 , . . . , τ k u 1 ) are sampled from a Dirichlet distribution with parameter vector given by 3 · 1 k u .