A two-step estimator for multilevel latent class analysis with covariates

We propose a two-step estimator for multilevel latent class analysis (LCA) with covariates. The measurement model for observed items is estimated in its first step, and in the second step covariates are added in the model, keeping the measurement model parameters fixed. We discuss model identification, and derive an Expectation Maximization algorithm for efficient implementation of the estimator. By means of an extensive simulation study we show that (1) this approach performs similarly to existing stepwise estimators for multilevel LCA but with much reduced computing time, and (2) it yields approximately unbiased parameter estimates with a negligible loss of efficiency compared to the one-step estimator. The proposal is illustrated with a cross-national analysis of predictors of citizenship norms.


Introduction
Latent class analysis (LCA) is used to create a clustering of units based on a set of observed variables, expressed in terms of an underlying unobserved classification.When it is applied to hierarchical (multilevel) data where lower-level units are nested in higher-level ones, the basic latent class model can be extended to account for this data structure.This can be seen as a random coefficients multinomial logistic model (see, for instance Agresti et al., 2000) for an unobserved categorical variable that is measured by several observed indicators, with a higher-level latent class variable in the role of a categorical random effect (Vermunt, 2003).Multilevel LCA has become more popular in the social sciences in recent years, for example in educational sciences (Fagginger Auer et al., 2016;Grilli et al., 2022Grilli et al., , 2016;;Grilli & Rampichini, 2011;Mutz & Daniel, 2013), economics (Paccagnella & Varriale, 2013), epidemiology (Tomczyk et al., 2015;Rindskopf, 2006;Zhang et al., 2012;Horn et al., 2008), sociology (Da Costa & Dias, 2015;Morselli & Glaeser, 2018), and political science (Ruelens & Nicaise, 2020).In most of these examples, the multilevel LCA model includes also covariates that are used as predictors of the clustering, and substantive research questions often focus on the coefficients of the covariates.
In estimation of models with covariates, for single-level LCA the current mainstream recommendation is to use stepwise methods that separate the estimation of the measurement model for the observed indicators from the estimation of the structural model for the latent variables given the covariates (see, e.g., Bakk & Kuha, 2018;Di Mari et al., 2020;Di Mari & Maruotti, 2022;Vermunt, 2010).This is practically convenient because when changes of covariates are made, only the structural model rather than the full model needs to be re-estimated.Different structural models can be considered even by different researchers at different times.Stepwise estimation can also avoid biases which can arise when all the parameters are instead estimated together in a simultaneous (one-step) approach to estimation.In such cases, misspecifications in one part of the model can cause bias also in the parameter estimates in other parts (Bakk & Kuha, 2018).
In multilevel LCA, the one-step approach is particularly cumbersome because of increased estimation time, especially with multiple covariates possibly defined at different levels.In that context, there is still need for further research on bias-adjusted efficient stepwise estimators.Recently Bakk et al. (2022) and Di Mari et al. (2022) proposed a "two-stage" estimator for this purpose.The parameters of the measurement model are estimated in its first stage, without including the covariates.This is further broken down into three steps.In the first of them, initial estimates of the measurement model are obtained from a single-level LC model, ignoring the multilevel structure.The latent class probabilities of the multilevel LC model are then estimated, keeping the measurement parameters from the first step fixed.Third, to stabilize the estimated measurement model and to account for possible interaction effects, the multilevel model is estimated again, now keeping the latent class parameters fixed.The estimated measurement parameters from this last step of the first stage are then held fixed in the second stage, where the model for the latent classes given covariates is estimated.
This method has been shown to greatly simplify model construction and interpretation compared to the one-step estimator, with almost identical results if model assumptions are not violated, and with enhanced algorithmic stability and improved speed of convergence.In addition, the twostage estimator exhibits an increased degree of robustness compared to the simultaneous approach in the presence of measurement noninvariance (Bakk et al., 2022).
A difficulty in this two-stage technique is deriving an asymptotic covariance matrix that takes into account the multi-step procedure.Conditioning on the first-stage estimates as if they were known, even though they are estimates with a sampling distribution, introduces a downward bias in the standard errors, a phenomenon that is well known also in the context of stepwise structural equation models (Skrondal & Kuha, 2012;Oberski & Satorra, 2013).For two-step single level LCA, the standard errors can be corrected in a straightforward way (Bakk & Kuha, 2018), but this is more difficult for two-stage LCA due to conditioning on multiple steps.
The two-stage approach is still in some ways more involved than it needs to be.In this paper we show that it is possible to simplify it into a more straightforward two-step estimator, still retaining its good performance but with a further reduced computation time.This approach is closely motivated by two-step estimation as it is used for single-level LCA.In the first step, the full multilevel measurement model is estimated in one go, but without covariates.In the second step, covariates are included in the model, keeping the measurement model parameters fixed at their estimates from the first step.
With such a two-step estimator, we contribute to the existing literature in several ways: (1) we establish model identification for the multilevel LC model under standard assumptions, as foundation for correct measurement model estimation; (2) we derive a step-by-step EM algorithm with closed-form formulas to handle the computation of the two-step estimator; and (3) we derive the correct asymptotic variance-covariance matrix of the second step estimator of the structural model, drawing on the theory of pseudo maximum likelihood estimation (Gong & Samaniego, 1981).
We evaluate the finite sample properties of our proposal by means of an extensive simulation study.Cross-national data on citizenship norms from the International Association for the Evaluation of Educational Achievement survey are analyzed to illustrate the proposal, and possible extensions are discussed in the conclusions.

The multilevel latent class model with covariates
Let Y ij = (Y ij1 , . . ., Y ijH ) be a vector of observed responses, where Y ijh denotes the response of individual i = 1, . . ., n j in group j = 1, . . ., J on the h-th categorical indicator variable ("item"), with h = 1, . . ., H. The data have a hierarchical (multilevel) structure where the individuals are nested within the groups.In the following, we will also refer to individuals as the "low-level units", and groups as the "high-level units".Let Y j = (Y 1j , . . ., Y n j j ) denote the set of responses for all the low-level units belonging to high-level unit j, with Y j for different j taken to be independent of each other.For simplicity of exposition, we focus below on the case where the items Y ijh are dichotomous, but the idea and methods of two-step estimation proposed here apply in a straightforward way also for polytomous items.
Let W j be a categorical latent variable (i.e. a latent class (LC) variable) defined at the high level, with possible values m = 1, . . ., M and probabilities P (W j = m) = ω m > 0, and let ω = (ω 1 , . . ., ω M ) .Given a realization of W j , let X ij be a categorical latent variable defined at the low level, with possible values t = 1, . . ., T , and conditional probabilities P (X ij = t|W j = m) = π t|m > 0. We collect all the π t|m in the M × T matrix Π.The X ij for the same j are taken to be conditionally independent given W j , so that This shows that the high-level latent class W j serves as a categorical random effect which accounts for associations between the low-level latent classes X ij for different low-level units i within the same high-level unit j.
The items Y j are treated as observed indicators of the latent classes.A multilevel latent class model specifies the probability of observing a particular response configuration for a highlevel unit j as where P (Y ijh |X ij = t, W j = m) denotes the conditional probability mass function of the h-th item, given the latent class variables X ij and W j .The second line in this further assumes that the responses for Y ijh for different items h are conditionally independent given (X ij , W j ), a standard assumption which we make throughout.
Model (1) is a general formulation which is equal to an unrestricted multi-group latent class model.Most applications, however, use a more restricted version which assumes that the item response probabilities do not depend directly on the high-level latent class W j (Vermunt, 2003;Lukociene et al., 2010; this model is represented in Figure 1, if we omit the covariates Z ij which will be introduced below).We will also make this assumption throughout this paper.Model ( 1) is also similar to the multilevel item response model of Gnaldi et al. (2016), but with categorical latent variables at both levels.The response probabilities are then given by (2) Therefore, within each high-level latent class W j , the model for the items has the form of a standard (single-level) LC model with X ij as the latent class (McCutcheon, 1987;Goodman, 1974;Hagenaars, 1990).When the items Y ijh are binary with values 0 and 1, we denote and denote by Φ the H × T matrix of all the φ h|t .
It can be shown that the model is identified (in a generic sense, see Allman et al. 2009), under a standard set of assumptions: Proposition 2.1 (Identification).Suppose that the following conditions hold: (A.1) φ h|t = φ h|s for all h = 1, . . ., H and for t = s; and (A.2) the M × T matrix Π has rank M .Then the multilevel LC model ( 2) is identified when M ≤ T and n j ≥ 3, for all j = 1, . . ., J.
The proof of Proposition 2.1 follows the same lines as in Gassiat et al. (2016), who proved identification of finite state space nonparametric hidden Markov models, and applies the results of Theorem 9 of Allman et al. (2009).The fact that all φ h|t are distinct is sufficient for linear independence of the Bernoulli random variables.For n j = 3, using the assumption of conditional independence of low-level units given high-level class W j , the distribution of (Y 1j , Y 2j , Y 3j ) factorizes as the product of three terms . Graphical representation of a multilevel latent class model which includes a low-level latent class variable X ij nested in a high-level latent class variable W j , and covariates Z ij for X ij .
Here the response probabilities for items Y ijh depend directly only on X ij .
We make three ancillary comments on Proposition 2.1.First, for the unrestricted multilevel LC model (1), if an assumption analogous to (A.1) holds -i.e. if all success probabilities of the Bernoulli random variables are distinct -we can relax (A.2) and prove identification using Allman et al. (2009)'s Theorem 9 (in the related context of mixture of finite mixtures with Gaussian components, a similar argument is used by Di Zio et al., 2007).Second, for longitudinal and multilevel data, generic identification of the measurement model does not require any condition on the number of items, provided that conditions (A.1) and (A.2) are satisfied.Third, although we have discussed identification specifically for binary items and Bernoulli conditional distributions, the identification result extends also to polytomous items if we can assume, analogously to (A.1), that all conditional category-class response probabilities are distinct.This guarantees linear independence of the corresponding multinomial random variables.
Covariates can be included in the multilevel LC model to predict latent class membership in both the low and high-level classes.Let Z ij = (1, Z 1j , Z 2ij ) be a vector of K covariates, which can include high-level (Z 1j ) and low-level (Z 2ij ) variables.For X ij we can consider the multinomial logistic model where γ tm is a K-vector of regression coefficients for each t = 2, . . ., T and m = 1, . . ., M .When only the intercept term is included, so that Z ij = 1, then γ tm = log(π t|m /π 1|m ) in the notation of the model without covariates above.We denote by Γ the (T − 1)M × K matrix of all the parameters in the γ tm vectors.
A model for W j can be specified similarly, now using only high-level covariates Z * j = (1, Z 1j ) , as where α m for m = 2, . . ., M , are regression coefficients.Although this too is straightforward, for ease of exposition and simplicity of notation we will below not consider models with covariates for W j , but present the two-step estimator only for the case where Z * j = 1 and thus α m = log(ω m /ω 1 ).The focus of interest is then on the model for the low-level (individuallevel) latent class X ij , and the high-level (group-level) latent class W j serves primarily as a random effect which accounts for intra-group associations between X ij .We further assume that the observed items Y j are conditionally independent of the covariates Z ij given the latent class variables X ij .This means that the measurement of X ij by Y ij is taken to be invariant with respect to the covariates.With these assumptions, and denoting Z j = (Z 1j , . . ., Z n j j ) , the model that we will consider is finally of the form (5) see also a graphical representation of the model in Figure 1.This model is identified when the corresponding model without covariates is identified, as long as the design matrix of all the Z ij s has full column rank (for an analogous condition for identifiability in the context of single-level latent class models with covariates, see G.-H. Huang &Bandeen-Roche 2004 andOuyang &Xu 2022).

Previous methods of estimation
We denote the parameters of the model in (5) as θ = (θ 1 , θ 2 ) where θ 1 = vec(Φ) are the parameters of the measurement model for the items Y j and θ 2 = (vec(Γ) , ω ) the parameters of the structural model the latent class variables (X ij , W j ) given the covariates Z ij .Maximum likelihood estimates of these parameters can be obtained by maximizing the log likelihood (θ) = J j=1 log P (Y j |Z j ) with respect to all the parameters together.This is the simultaneous or one-step method of estimation for the model.It has serious disadvantages, however.The full model needs to be re-estimated whenever the covariates in the structural model are changed, which can be computationally demanding because of the complexity of such multilevel models.Further, because all the parameters are estimated together, misspecification in one part of the model may destabilize also parameters in other parts of the model (Vermunt, 2010;Asparouhov & Muthén, 2014).
Because of the complexity of the one-step approach, in practice the classical three-step method of estimation is more often used.In its step 1, model (2) without covariates is first estimated.In step 2, this model is used to assign respondents to the latent classes X ij and W j , conditional on their observed responses Y j ; how this is done for the multilevel LC model is described in detail in Vermunt (2003).In step 3 the assigned latent classes are modelled given covariates, treating the classes now as observed variables.This is straightforward to do.However, it, yields biased estimates of the parameters of the structural model, because the assigned classes are potentially misclassified versions of the true latent classes.
Because of this bias in the classical three-step approach, bias-adjusted stepwise methods are needed.One such method for multilevel LC models with covariates is the two-stage estimator proposed by Di Mari et al. ( 2022) -see also Bakk et al. (2022).It involves the following two stages: A) First stage: Unconditional multilevel LC model building (measurement model construction).
Step 1: A single-level latent class model is fitted for Y ij given the low-level latent class X ij , ignoring the multilevel structure of the data.This gives an initial estimate of Φ.
Step 2.a: The multilevel model without covariates (equation 2) is estimated, keeping Φ fixed at its estimated value from Step 1.This gives estimates of ω and Π.
Step 2.b: The two-level model is estimated again, now keeping ω and Π fixed at their estimates from Step 2.a.This gives the estimate of Φ which is taken forward to the second stage.
B) Second stage: Inclusion of covariates in the model (structural model construction).
Step 3: The multilevel model ( 5) with covariates is estimated, keeping the measurement parameters Φ fixed at their estimates from the first stage.This gives the two-stage estimates of the structural parameters θ 2 .
While effective, the two-stage approach has some shortcomings.Although Steps 2.a and 2.b both estimate only part of the measurement model parameters, computationally they do not save much effort because the most challenging part of the estimation (the E-step of the EM algorithm; see below) is required by both steps.Fixing the response probabilities is also not enough to prevent label switching of the classes from one step to the next in the first stage, since this can simultaneously occur at both the low and high levels.Finally, estimating the correct form of the second-stage information matrix, which should take variability of the previous steps into account, is difficult due to the sequential re-updating of the measurement model.These complications make it desirable to look for more straightforward bias-adjusted stepwise approaches for the multilevel LC model.Such a method, the two-step estimator, is described next.

Two-step estimator for the model with covariates
We propose to amend the two-stage estimator by concentrating all of the measurement modeling into a single step 1, where we estimate the multilevel LC model but without covariates.The estimated parameters of the measurement model for the items Y ij from this step are then taken forward as fixed to step 2, where the structural model for the latent classes given covariates is estimated.
Step 2 is thus the same as the second stage of two-stage estimation, but the three steps of its first stage are here collapsed into the single step 1.
The two-step estimation procedure for multilevel LC models that is described in this section has been implemented in the R package multilevLCA (Lyrvall et al., 2023), which can be downloaded from CRAN.The package's routines have been used for the simulations and data analysis in Sections 5, and 6 of the paper.

Step 1 -Measurement model
In the first step, a simple multilevel LC model without covariates is fitted to the data.Given the data defined above, the log likelihood for this step is where P (Y j ) is given by ( 2).This is maximized to find the ML estimate of the parameters of this model.Direct (numerical) maximization is possible, either with suitable constraints or by adopting well-known logistic re-parametrizations, but it quickly becomes infeasible even for a moderate number of low-and/or high-level classes.A more practical alternative to maximize ( 6) is by means of the Expectation-Maximization (EM) algorithm (Dempster et al., 1977), which is what we propose here.
A standard implementation of EM would involve computing M × T n j joint posterior probabilities, which is infeasible already with a few low-level units per high-level unit.Instead, our implementation of the EM algorithm follows closely Vermunt ( 2003)'s upward-downward method of computing the joint posteriors of the low-and high-level classes (see also Vermunt, 2008), where the number of joint posterior probabilities to be computed is only a linear function of the number of low-level units per high-level unit.Here we describe in detail the E and M steps of the algorithm, with the step-by-step implementation, that we use to obtain the estimates in Step 1.
Using standard EM terminology, let us introduce the following augmenting variables: Defining the complete-data sample as }, the completedata log-likelihood (CDLL) for the first step can be specified as where we have dropped the argument (Φ, Π, ω) from c 1 for simplicity of notation.In the E step, the missing data are imputed by conditional expectations given the observed data and current values for the unknown model parameters.More specifically, this involves the computation of the following expected CDLL where To compute the conditional expectation of v i,j,t,m , we use the fact that the joint probability P (X ij = t, W j = m|Y j ) can be written as P (W j = m|Y j )P (X ij = t|W j , Y j ), where P (W j = m|Y j ) is already available from (10).Note that, given the model assumptions, which we use to compute the following desired quantity where in the third row we are using the assumption that the joint probability function of the response variables depend on high-level class membership only through low-level class membership.For the unrestricted multi-group LC model, the expression ( 12) would be adapted straightforwardly.
In the M step of the algorithm, the expected CDLL ( 9) is maximized with respect to the model parameters (Φ, Π, ω) subject to the usual sum-to-one constraints on probabilities.This yields the following closed-form updates Starting from initial values for the model parameters, the algorithm iterates between the Eand the M-steps until some convergence criterion is met, e.g. until the difference between the loglikelihood values of two subsequent iterations falls below some threshold value.
As for all mixture models, the log-likelihood function can have several local optima and there is no guarantee that the solution found by the EM algorithm is the global optimum (Wu, 1983).To better explore the likelihood surface, multiple starting value strategies are typically implemented (among others, see Biernacki et al., 2003;Maruotti & Punzo, 2021).Beyond doubt, the easiest, and most common approach is to initialize the EM algorithm randomly from several different starting points.However, even for relatively simpler models, the multiple starting value strategy is often outperformed by more refined techniques (Biernacki et al., 2003), .
For any stepwise estimators, the initialization strategy of earlier steps is particularly relevant because subsequent steps will be conditional on estimates from previous steps.In our step 1, we suggest implementing the following hierarchical initialization strategy (for a similar approach in a related context, see for instance Catania & Di Mari, 2021;Catania et al., 2022): (1) Perform a single-level K-modes clustering (Z.Huang, 1997;MacQueen, 1967), with K = M .For each j = 1, . . ., J -let Ẇij be the outcome class assignment for unit i in group j; -specify W j as the most frequent assigned class among the n j observations belonging to group j, and let W ij = W j for all i = 1, . . ., n j .
The relative sizes of the resulting high-level classes are used to initialize ω.The entries of ω, before being carried over to the actual estimation step, can be sorted in increasing or decreasing order.
(2) Fit a single-level T -class LC model on the pooled data, ignoring the multilevel structure.
Note that the K-modes algorithm can be employed herein as well to initialize the singlelevel LCA.The estimated output is organized as follows -the response probabilities are passed on the EM algorithm as a start for Φ; -let X ij be the maximum a posteriori class assignment for unit i in group j.Crosstabulate X and W, where X = ( X 11 , . . ., X n J J ) , and W = ( W 11 , . . ., W n J J ) .
From the T × M table of joint counts, compute the conditional (relative) counts of X| W to initialize Π.
The low-level classes can be re-ordered by letting low-level cluster 1 be the one with the highest average probability to score a "1" on all items, cluster 2 the one with the second highest average probability to score a "1" on all items, and so on.
Note that the suggested rule to re-order low-level classes is only an example of a rule that is often (but not always) useful.This is because, if there are many items or some are for rare characteristics, the joint probability of scoring "1" on all of them together might be a number so small as to be overwhelmed by sampling error or even by machine imprecision.That would effectively bring label switching back again.In cases like these, we suggest implementing alternative reordering principles.
Running the EM algorithm to convergence from the above starting values, the solution with the highest log-likelihood (6) provides us with estimates ω, Π, Φ.Of these, ω and Π are discarded and vec( Φ) = θ 1 are retained as the estimates of the measurement parameters θ 1 from this step 1.

Step 2 -Model for class membership
In the second step of estimation, the parameters θ 2 of the model for the latent classes in Equation ( 5) are estimated, keeping the measurement parameters θ 1 fixed at their step-1 estimates θ 1 (see Figure 2).These step-2 estimates are obtained by maximizing the pseudo log-likelihood function with respect to θ 2 .Here log P (Y j |Z j ) is given by equation ( 5), except that θ 1 are regarded as fixed and known values rather than unknown parameters.The EM algorithm that we propose for this step works similarly to the one that we used for the first step.In particular, under the definition of the augmenting variables given in Section 4.1, the CDLL is given by where we have dropped the argument (θ 2 |θ 1 = θ 1 ) from c 2 for ease of notation.Note that the E step is analogous as that described in Section 4.1, except that now the low-level class probabilities conditional on high-level membership depend on covariates.In the M step the expected CDLL, obtained by substituting the missing values with expectations computed using analogous formulas as ( 10) and ( 12), is maximized with respect to θ 2 only.Whereas the update for ω is given by ( 13), to derive the update for the regression coefficients note that v i,j,t,m = P (X ij = t, W j = m|Y j ) can be written as the product of u j,m = P (W j = m|Y j ) and q i,j,t|m = P (X ij = t|W j , Y j ).Thus, estimates of Γ can be found solving the equations which are weighted sums of M equations, each with weights q i,j,t|m .
Stepwise estimation is well known to enhance algorithm stability and speed of convergence (Bakk & Kuha, 2018;Bartolucci et al., 2015;Di Mari & Maruotti, 2022;Skrondal & Kuha, 2012).However, class labels in multiple hidden layer models can still be switched, and keeping the response probabilities fixed cannot prevent it as there are still M !possible permutations of the highlevel class labels.We handle this issue by initializing ω at its estimate from the first step, and by taking log π t|m /π 1|m to initialize the intercepts γ 0tm , for all m = 1, . . ., M and t = 2, . . ., T .The other elements of Γ are initialized at zero.

Selecting the number latent classes
The description of the two-step estimation procedure above takes the numbers of latent classes at both the lower and higher levels as given.The selection of these numbers is a separate exercise.It is normally carried out without covariates, and the selected numbers of classes are then held fixed when covariates are added.This is also in line with general recommendations for LCA with covariates (Masyn, 2017).
The selection of the numbers of classes could be considered as a joint exercise of both the high and low levels together, but a generally used recommendation is to use instead a hierarchical procedure which selects them one at a time (Lukociene et al., 2010).First, simple LC models are fitted at the lower level and the number of classes for it (T ) is selected.Second, this number is held fixed, and multilevel LC models are fitted and compared to select the number of classes at the higher level (M ).Third, the selected M is fixed, and model selection for the multilevel model is done again at the lower level, to obtain the final value of T .A still simpler approach would skip the third step (Vermunt, 2003), but including it allows us to check if the selected number of lowerlevel classes changes once the within-group associations induced by the high-level classes are allowed for.
This hierarchical approach can be used with any method of estimating the models.However, when combined with our two-step estimator, simultaneously selecting the number of classes of the measurement at both levels is also feasible.Practically, this is possible by leveraging an efficient integration of the above initialization strategy with parallel (multi-core) estimation of all plausible values of T and M .
The best candidate values of M and T can be selected with standard information criteria, like AIC or BIC.For the final choice, we suggest balancing the use information criteria with the evaluation of low-and high-level class separation, and, perhaps most importantly, the substantive inspection of the candidate model configurations.For a wider discussion on this issue, see, among others, Di Mari et al. (2022); Magidson & Vermunt (2004).In the social sciences, one of the most commonly used measures of class separation is the entropy-based R 2 of Magidson (1981).The latter can be defined at both lower and higher levels to judge class separation (see Di Mari et al., 2022;Lukociene et al., 2010).

Statistical properties of the two-step estimator
Our two-step estimator is an instance of pseudo maximum likelihood estimation (Gong & Samaniego, 1981).Such estimators are consistent and asymptotically normally distributed under very general regularity conditions.The conditions and a proof of consistency can be found in Gourieroux & Monfort (1995, Sec. 24.2.4).Let the true parameter vector be θ = (θ 1 , θ 2 ) .If the one-step ML estimator of θ is itself consistent for θ , in order to prove consistency of our twostep estimator θ it suffices to show that (1) θ 1 and θ 2 can vary independently of each other, and (2) θ 1 is consistent for θ 1 .These conditions are satisfied in our case: (1) is true by construction of the model, and (2) is satisfied since θ1 from step 1 is a ML estimate of the measurement model parameters of the multilevel LC model without covariates, and these parameters are taken to be the same as in the model with covariates.
Let (θ 1 , θ 2 ) denote the joint log-likelihood function for the model, let s θ 2 (θ 1 , θ 2 ) denote the mean score N −1 ∂ (θ 1 , θ 2 )/∂θ 2 evaluated at (θ 1 , θ 2 ), where N denote the overall sample size, and let be the Fisher information matrix.In addition, let us suppose that Then, using the results of Theorem 2.2 of Gong & Samaniego (1981) (see also Parke, 1986), where θ 2 is the proposed two-step estimator and Intuitively, V 2 describes the variability in θ 2 given the step one estimates θ 1 , and V 1 the additional variability arising from the fact that θ 1 are not known but rather estimated by θ 1 with their own sampling variability.
Let s ij,θ 2 ( θ 1 , θ 2 ) be the individual contribution to the score of low-level unit i belonging to high-level group j evaluated at the parameter estimates of the first and second step respectively.To compute such score we use the well-known fact that ∂ (θ)/∂θ = ∂Q/∂θ (Oakes, 1999), where Q = E [ c (θ)].All such quantities are available from the above EM algorithm without any extra effort.Therefore, I 22 and I 21 can be estimated respectively as and An estimate Σ 11 can be obtained analogously by fitting model (2).We give details on the derivations of the desired quantities in the appendix.
Note that Equation (20) shows that there is a loss of efficiency of the two-step estimator with respect to the simultaneous ML estimator.This important theoretical and practical aspect with be investigated in the simulation study -although we expect this loss to be rather small as very little information about θ 2 should be contained in Y.

Settings
We conduct a simulation study to investigate the finite sample properties of the proposed two-step estimator.It is compared with the simultaneous (one-step) estimator and the two-stage estimator of Bakk et al. (2022);Di Mari et al. (2022).One-step estimation is the statistical benchmark, and the two-step estimator's performance is evaluated in terms of its statistical and computational performance relative to this benchmark.The target measures that we use for the comparison are the bias, standard deviations, confidence interval coverage rates, and computation time of the stepwise estimators compared with those of the simultaneous estimator.We compute both absolute standard deviations, to assess the efficiency of our estimator, as well as relative standard deviations with respect to the one-step method, to investigate potential loss of efficiency with respect to the benchmark.Class separation and sample size are well-known determinants of the finitesample behavior of stepwise estimators for LCA (Bakk & Kuha, 2018;Vermunt, 2010).We considered all combinations of larger and smaller sample sizes, at higher level (30, 50, or 100 higherlevel units) and lower level (100 or 500), with a total of 6 sample size conditions.Data were generated from a multilevel LC model with 2 high-level classes and 3 low-level classes and with 10 binary indicators and one continuous covariate generated from a standard normal distribution.The random slopes γ 2|1 , and γ 3|1 were set to -0.25 and -0.25, whereas γ 2|2 , and γ 3|2 to 0.25 and 0.25, corresponding to a moderate magnitude on the logistic scale.
In multilevel LC models, separation plays a role at both low and high levels (Lukociene et al., 2010).We manipulate low-level class separation by allowing the the response probabilities for the most likely responses to be either 0.7, 0.8 or 0.9, corresponding respectively to low, moderate, and large class separation.We remark that the low class separation condition can be considered as an extreme scenario, in which LCA is hardly carried out in practice.Nevertheless, we decide to include it as a benchmarking condition.Class profiles are such that the first class has high probability to score 1 on all items, the second class to score 1 on the last five items and 0 on the first 5 items, and the third class is likely to score 0 on all items.At the high level, in the model for W , we manipulate class separation by altering the random intercept magnitudes, which are both relatively close to zero in the moderate separation case (-0.85, -1.38 and 0.85, 1.38), and further away from zero in the large separation case (-1.38, -2.07 and 1.38, 2.07).These simulation conditions are in line with previous studies on multilevel LCA (Lukociene et al., 2010;Park & Yu, 2018).
We generated 500 samples for each of the 36 crossed simulation factors of low-level and high-level sample size and low-level and high-level class separation (see Table 1).Data generation and model estimation were carried out in R (Venables et al., 2013), with the integration of C++ code for computation efficiency (Eddelbuettel & François, 2011) 24 simulation conditions.LL stands for Low-Level, HL stands for High-Level.

Results
All estimators show very similar values for bias (see Figures 3a-3b), and both two-stage and two-step estimators have nearly identical results compared to the simultaneous estimator.Relative efficiency with respect to the simultaneous estimator (Table B1, in the appendix) is, in all conditions, approximately one for both stepwise estimators, with the two-stage estimator doing very slightly worse only in one condition.Confidence interval coverages (Figure 4) are mostly very similar between the three estimators.We observe some undercoverage for all methods in the low-separation and small high-level sample size conditions.This may be due to the fact that expected information matrices are used to estimate the asymptotic variance covariance matrix, rather than the observed ones, and the contributions to the score are computed on high level units, and to the overlap between classes.
The different estimators thus perform essentially identically.Where they differ from each other is in their computational demands.Considering the computation time relative to the simultaneous estimator (Figure 5), we find that both stepwise estimators are always (and up to four times) faster than the simultaneous estimator, and the two-step estimator achieves this with one fewer step compared to the existing two-stage competitor.6 Analysis of cross-national citizenship norms with multilevel LCA In this empirical example, we analyze citizenship norms in a diverse set of countries.The data are taken from the International Civic and Citizenship Education Study (ICCS) conducted by the International Association for the Evaluation of Educational Achievement (IEA).Prior research has used LCA to analyze the first two waves of this survey, which were conducted in 1999 and 2009, to investigate distinctive types of citizenship norms (Hooghe & Oser, 2015;Hooghe et al., 2016;Oser & Hooghe, 2013).We focus on the most recent round of the survey, from 2016 (Köhler et al., 2018).The data are from a survey of students in their eighth year of schooling.We have data from between 1300 and 7000 respondents in each of 24 countries, as shown in Table 2.
The respondents answered 12 questions (items) on how important they think different behaviours are for "being a good adult citizen".These behaviours were always obeying the law (labelled obey below), taking part in activities promoting human rights (rights), participating in activities to benefit people in the local community (local), working hard (work), taking part in activities to protect the environment (envir), voting in every national election (vote), learning about the country's history (history), showing respect for government representatives (respect), following political issues in the newspaper, on the radio, on TV or on the Internet (news), participating in peaceful protests against laws believed to be unjust (protest), engaging in political discussions (discuss), and joining a political party (party).
We treat these twelve items as indicators of the individuals' perceptions of the duties of a citizen (citizenship norms).The data have a multilevel structure, with individuals as the low-level units and countries as the high-level units.As predictors of low-level latent class membership, we include the respondent's gender, socio-economic status operationalised by the proxy measure of the number of books in their home, and measures of the respondent's educational expectations, parental education, and if she/he is a non-native language speaker.For details on data cleaning and recoding, see Oser et al. (2023).
To compare with previous work on the same data, we fit a multilevel LC model with T = 4 low-level classes (of individuals within countries) and M = 3 high-level classes (of countries).The same data set was analyzed in Di Mari et al. (2022) with a multilevel LC model with random intercepts, estimated with a two-stage estimator.We extend Di Mari et al. ( 2022)'s model specification by allowing for both random intercepts and random slopes, and we fit the model with the proposed two-step estimator.As the two-step estimator has been shown to be computationally more efficient than the two-stage estimator though with equal performances, for the comparison we include the benchmark simultaneous estimator only.
The measurement model, at both levels, presents very well separated classes (Table 3).At the lower level, the four latent classes are characterised by their the conditional response probability patterns, as shown in Figure 6.Two classes present response configurations relating to two relevant and well-known notions of citizenship norms.First, a "Duty" group, which places high importance on the act of voting, discussing politics, and party activity, while manifesting relatively low interest in protecting human rights and activities to assist the local community.Second, an "Engaged" group, which displays higher emphasis on engaged attitudes such as protecting the environment, and lower importance on more traditional citizenship activity items such as membership of political parties.In addition, we observe two classes with consistently high and consistently low probabilities of assigning importance to all of the behaviours, here labelled the "Maximal" and the "Subject" classes respectively.At the higher level, the estimated model includes three latent classes for the countries, labelled below as HL1, HL2 and HL3.Considering first the conditional probabilities for the four individual-level classes given these country-level classes (see Table 4), we can see that HL1 has clearly the highest conditional probability for the individual "Duty" class, HL2 for the "Maximal" class and HL3 for the "Engaged".The country classes do not differ in probabilities of the passive "Subject" class of individuals, which are in any case consistently low.Table 5 shows the assignment of countries to the classes, when the assignment is done based on highest posterior probabilities given the survey responses in the countries.Here there are no very clear patterns.Only two countries (Denmark and Netherlands) are assigned to HL1, while the other two classes each include a fairly heterogeneous subset of the rest of the countries.

Country
Table 6 presents estimates of the parameters of main interest in the analysis, the coefficients of the structural model for the lower-level classes given individual-level covariates, separately within each of the higher-level classes.We note first that the one-step and two-step estimates and their standard errors are very similar, as would be expected given the previous simulation results.
Considering the coefficients themselves, note that they compare each of the other classes to the "Maximal" class for whom all of the behaviours are to a greater or less extent considered important to good citizenship.Compared to this class, the relative probability of the (overall quite small) "Subject" class for whom none of the behaviours are important, is higher for individuals who are boys, speak the native language at home, have fewer books at home, and have low educational aspirations.The probabilities of the "Engaged" class, who are partly similar to "Maximal" but place less importance on many of the traditional political activities, are relatively higher for girls, those who have larger number of books at home, and for native speakers.For the "Duty" class, which differs from the "Engaged" in placing much less importance on direct activism, the probabilities relative to "Maximal" are higher for boys and those with low educational aspirations.For the comparisons of other pairs of classes, these estimates also imply, for example, that the probabilities of "Engaged" relative to "Duty" are generally higher for girls than for boys.These patterns of the coefficients are broadly similar in each of the country classes, with some variation in detail.
HL  Finally, we report CPU time of estimation and the number of iterations until convergence for the two approaches (Table 7).In this real-data example, the two-step estimator takes only about 22 seconds to reach convergence, with 26 EM iterations.The one-step estimator requires 261 iterations and a running time of around 4.5 minutes to reach convergence.Each iteration requires about 0.93 seconds to run for the one-step estimator, while the two-step estimator uses 0.85 seconds and much fewer EM iterations overall.CPU time to estimation in seconds, and number of iterations until convergence for the two methods -one-step and two-step estimators.

Discussion
In this paper we proposed a two-step estimator for the multilevel latent class model with covariates.It concentrates the estimation of the measurement model in a single first step.In the second step, covariates are added to the model, keeping the measurement model parameters fixed.The approach represents a simplification over the recently proposed two-stage estimator (Bakk et al., 2022) by having only two steps instead of multiple sub-steps in estimating the measurement model.
We discussed model identification of the unconditional model, derived an Expectation Maximization algorithm for efficient estimation of both steps and presented second-step asymptotic standard errors that account for the variability in the first step.The simplified two-step procedure makes it possible to apply the standard theory of Gong & Samaniego (1981) for obtaining unbiased standard errors, a further improvement over the two-stage estimator.An effective initialization strategy, using (dissimilarity-based) cluster analysis, was also proposed.
In the simulation study, we observed that the performance of the proposed estimator in terms of bias is very similar to the benchmark simultaneous (full-information ML) estimatorand similar to that of the two-stage estimator -with nearly no efficiency loss.The two-step estimator was up to 4 times faster than the simultaneous estimator.It should be mentioned that, in conditions where the entropy of the LC model is low, all estimators show relatively higher variability and bias, a finding in line with previous research on stepwise estimators for single-level LC models (Vermunt, 2010).
In the real data example, we found interesting lower and higher level class configurations, consistent with existing literature on the topic of citizenship norms (see, e.g., Oser et al., 2022).In the structural model, the model allows us to investigate the associations between covariates and the latent classes, including the possibility of group-level heterogeneous effects of covariates on lower class membership.In addition, we found a considerable CPU running time difference between the one-step and the two-step estimators, which was even larger than what we observed in the more controlled simulation environment.More specifically, whereas the former required 4.5 minutes to reach convergence, the latter only needed 22 seconds.From an applied user's perspective, such a CPU time gain can be substantial on a larger scale.As an example, consider a data set with larger low-and high-level sample sizes: if simultaneous estimation took 2 hours, our two-step estimator would produce final estimates in only roughly 12 minutes.We expect, based on existing literature on two-step estimators (see, e.g., Di Mari & Maruotti, 2022), such a gap to increase in model complexity -i.e.number of lower/higher level classes and/or available predictors.The difference in time is also multiplied if the models are estimated repeatedly, for example when different sets of covariates or different numbers of latent classes are explored.
There are some issues that deserve future research.First, while we describe two possible approaches for class selection in Section 4.3, this is not the main focus of the current work.Further research should investigate class selection using the different estimators.Second, we have proposed estimates for the asymptotic variance-covariance matrix based on the outer product of the score.Deriving Hessian-and/or sandwich-based (White, 1982) standard errors, e.g. for small high-level sample size and complex sampling scenarios, can be interesting topics for future work.Third, we have discussed multimodality of the likelihood surface as a long-standing well-known characteristic feature related, in general, to mixture models.The EM algorithm's properties have been largely studied over the years -i.e., monotonicity, and global convergence (see, e.g., Redner & Walker, 1984).The EM has several advantages, e.g., low cost per iteration, economy of storage and ease of programming.However, in practice, due to multimodality, convergence to global or local optima depends on the choice of the starting point (Wu, 1983).As such, there is no systematic, neither theoretical nor simulation based, study of the behavior of the EM with two-step estimators.We speculate that, given that the second step operates in a lower dimensional space compared to simultaneous estimation, two-step estimators should somewhat restrain the initialization problem.This point, being not the focus of the current work, certainly deserves specialized attention.For this, and related matters, we defer to future research. .The Q function of Equation ( 17) can be rewritten under the log-linear parametrizations introduced above, except for the second block which is as follows The second block of the ij-th contribution to the score as generic K + 1 contributions  B1 Average relative efficiency for the two-step and two-stage estimator relative to the one-step estimator (SD over benchmark one-step SD), averaged over covariate effects.

Figure 2 .
Figure 2. Step 2 of the two-step estimation: Estimating the structural model for low-level latent classes X ij given covariates Z ij and high-level latent classes W j , keeping measurement model parameters for items Y ijh fixed at their estimates from Step 1.

Figure 3 .Figure 4 .Figure 5 .
Figure 3. Line graphs of estimated bias for the one-step, two-step, and two-stage estimators, for the 36 simulation conditions, averaged over the 500 replicates.Error bars are based on mean bias +/-Monte Carlo standard deviations.

Figure 6 .
Figure 6.Measurement model at the lower (individual) level: line graph of the class-conditional response probabilities.
Number of respondents per country of the third wave (2016) of the IEA survey used for the analysis.Summary statistics for the measurement model.
Assignment of countries to the high-level classes, based on the maximum a posteriori (MAP) classification rule.M = 3.

Table 6
Estimated coefficients of structural models, i.e. multinomial logistic models for membership of the four individual-level latent classes conditional on covariates, separately within each of the three country-level latent classes (HL1, HL2 and HL3).The "Maximal" class is taken as the reference level for the response class.The number of books available in the respondent's home is treated as a proxy for the respondent's socio-economic status.Both simultaneous (one-step) and the proposed two-step estimators of the same parameters are shown, with standard errors in parentheses.*** p-value<0.01,** p-value<0.05,* p-value<0.1.