Tempered expectation-maximization algorithm for the estimation of discrete latent variable models

Maximum likelihood estimation of discrete latent variable (DLV) models is usually performed by the expectation-maximization (EM) algorithm. A well-known drawback is related to the multimodality of the log-likelihood function so that the estimation algorithm can converge to a local maximum, not corresponding to the global one. We propose a tempered EM algorithm to explore the parameter space adequately for two main classes of DLV models, namely latent class and hidden Markov. We compare the proposal with the standard EM algorithm by an extensive Monte Carlo simulation study, evaluating both the ability to reach the global maximum and the computational time. We show the results of the analysis of discrete and continuous cross-sectional and longitudinal data referring to some applications of interest. All the results provide supporting evidence that the proposal outperforms the standard EM algorithm, and it significantly improves the chance to reach the global maximum. The advantage is relevant even considering the overall computing time.


Introduction
A latent variable model is a statistical model in which the distribution of the response variables is affected by one or more variables that are not directly observable. Here, we consider two special classes of discrete latent variable (DLV) models (Bartolucci et al. 2022) that are frequently employed to analyze continuous and categorical response variables.
The latent class (LC) model (Lazarsfeld and Henry 1968;Goodman 1974;Lindsay et al. 1991) assumes individual-specific latent variables having a discrete distribution with a finite number of support points. The hidden (or latent) Markov (HM) model (Zucchini and Guttorp 1991;Bartolucci et al. 2013;Zucchini et al. 2016) represents a generalization of the LC model to the case of longitudinal data and the latent process is frequently assumed as a first-order Markov chain. Both models are used as modelbased clustering methods, and in particular, the HM model allows a dynamic clustering where each unit may move between clusters across time.
Maximum likelihood estimation (MLE) of DLV models is usually performed by using the expectation-maximization (EM) algorithm (Baum et al. 1970;Dempster et al. 1977;McLachlan and Krishnan 2008). This approach is straightforward to implement, and it is available in many software packages; among others, we mention MultiLCIRT  and LMest (Bartolucci et al. 2017) in the R software (R Core Team 2022) for the estimation of LC and HM models, respectively.
A particular drawback of MLE is related to the multimodality of the log-likelihood function which is especially observed with the DLV models. Consequently, the EM algorithm could converge to a local maximum, not corresponding to the global one. Multi-start strategies employing both deterministic and random rules to initialize the model parameters are generally adopted. Although this approach encourages a more accurate exploration of the parameter space, it is computationally intensive and does not ensure that the global optimum is reached. For an overview of different initialization strategies, some of which are based on a preliminary cluster analysis (Everitt et al. 2011), see, among others, Maruotti and Punzo (2021).
Tempering and annealing (Sambridge 2014) constitute a broad family of optimization methods; by means of a parameter known as temperature, they allow us to re-scale the target function and monitor the prominence of all possible maxima. In particular, these procedures are gradually attracted towards the global optimum by accurately defining a sequence of temperature values. The alternation of high and low values of the temperature allows us to deal with two opposite but fundamental issues: on one side, the algorithm is led to explore broad areas of the parameter space, thus escaping local sub-optimal modes (high temperatures); on the other side, the algorithm is able to perform a sharp optimization of the target function in a small area of the parameter space (low temperatures).
The following different tempering methods are defined according to the choices of temperature sequences. Simulated annealing (Kirkpatrick et al. 1983) makes use of a strictly decreasing temperature sequence: the initial temperature is sufficiently high so that the re-scaled function is relatively flat, and it decreases at each step, gradually restoring the original function. Simulated tempering (Geyer and Thompson 1995) assumes that the temperature may either increase or decrease according to a stochastic rule: a new proposed temperature level may be accepted or rejected according to a specific probability, and the process describing the temperature evolution follows a Markov chain. Parallel tempering (Geyer 1991;Falcioni and Deem 1999;Earl and Deem 2005) assumes an ensemble of Markov chains across all levels of the temperature sequence: at specified intervals, a swap between a pair of neighboring chains is proposed and accepted or rejected according to a certain probability.
Tempering techniques are employed, among others, in Barbu and Zhu (2013) and Robert et al. (2018) for simulating from complex multimodal statistical distributions by means of Markov chain Monte Carlo methods (Metropolis et al. 1953;Hastings 1970). On the other hand, the use of these procedures within the EM algorithm is quite scarce. Hofmann (1999) proposed tempering techniques for the EM algorithm in the context of probabilistic latent semantic analysis. For what concerns finite Gaussian mixture models, recently, Lartigue et al. (2022) proposed a general class of deterministic approximated versions of the EM algorithm following previous proposals in Yuille et al. (1994), Ueda and Nakano (1998), and Zhou and Lange (2010).
In the following, dealing with DLV models, we propose a general approach. In particular, we explicitly focus on LC and HM models because these are among the most utilized LDV models in data analysis. However, the proposal can easily be adapted to the aforementioned finite mixture models and to other DLV models. We explore two different temperature sequences, including a non-monotone one, also evaluating the computational time efficiency. Up to our knowledge, for the first time, we deal with the problem of temperature sequence tuning, inspecting the performance of the tempered EM (T-EM) algorithm with both optimally tuned and fixed temperature sequences. Finally, we show the behavior of the algorithm for the selection of the optimal model. The implemented code for the proposal is written for the open source software R (R Core Team 2022). It is based on some functions of the package LMest (Bartolucci et al. 2017), and it is available at the following link in the GitHub repository: https:// github.com/LB1304/T-EM.
The remainder of the paper is organized as follows. In Sect. 2 we outline the LC and HM model formulations and the MLE of the model parameters through the EM algorithm. In Sect. 3 we provide details on the proposed T-EM algorithm for both models. In Sect. 4 we summarize the main findings of an extensive simulation study aimed to assess the performance of the proposal by comparing it with the standard EM algorithm for many different scenarios. We also evaluate the proposed algorithm in connection with different initialization strategies, and compare the overall computing time. In Sect. 5 we apply the T-EM algorithm to estimate LC and HM models using a variety of data types. In Sect. 6 we provide some conclusions. Appendix A supplies more details on the settings used for the simulation studies, while Appendices B and C provide additional simulation results. Finally, Supplementary Information (SI) contains the full outcomes of every sample under each simulated scenario.

Model formulation
In the following, mainly borrowing from Bartolucci et al. (2013) we briefly summarize model notations and implementations of the standard MLE of the model parameters carried out through the EM algorithm; see also  and Pandolfi et al. (2021).

Latent class model
Considering cross-sectional data and for a single individual, let Y = (Y 1 , . . . , Y r ) denote the vector of response variables; we assume that each variable Y j is categorical with the same number c of categories, labeled from 0 to c−1. Note that the formulation of the model may be easily adapted to the case of continuous response variables. The LC model relies on a single latent variable U with k support points that identify the latent classes in the population, labeled from 1 to k. According to the assumption of local independence, the response variables are conditionally independent given the latent variable. The model parameters are the weight of each latent class, denoted by π u = p(U = u), u = 1, . . . , k, and the conditional probability of each response variable given the latent variable, denoted by φ j y|u = p(Y j = y|U = u), for y = 0, . . . , c − 1, j = 1, . . . , r , and u = 1, . . . , k.
In order to estimate the model parameters, collected in the vector θ , on the basis of a sample of n independent observations y i , i = 1, . . . , n, the incomplete data loglikelihood denoted as (θ ) is maximized considering the complete data log-likelihood, where a juy = n i=1 I (u i = u, y i j = y) is the frequency of subjects that are in latent class u and respond by y at the j-th response variable, and b u = n i=1 I (u i = u) is the number of sample units in latent class u, with I (·) denoting the indicator function.

Hidden Markov model
With reference to longitudinal data and for a single individual, let r ) denote the occasion-specific response variables for each time t = 1, . . . , T , and let Y denote the vector of responses, which is made of the union of the vectors Y (t) , t = 1, . . . , T . Given a latent process U = (U (1) , . . . , U (T ) ) having a discrete distribution with k states, the latent model parameters are the initial probabilities, denoted by π u = p(U (1) = u), u = 1, . . . , k, and the transition probabilities denoted by π (t) u|ū = p(U (t) = u|U (t−1) =ū), t = 2, . . . , T ,ū, u = 1, . . . , k. Note that it is possible to include a constraint corresponding to the hypothesis that the latent process is time homogeneous so that the transition probabilities do not depend on time occasion t: π (t) u|ū = π u|ū , t = 2, . . . , T . The HM model in its basic formulation (Bartolucci et al. 2013) relies on the following three main assumptions, which can be suitably relaxed: r are conditionally independent given U (t) , for t = 1,…, T ; • U follows a first-order Markov chain with state space {1, . . . , k}, where k is the number of latent states.

Hidden Markov model with categorical response variables
Let Y (t) j , j = 1, . . . , r , t = 1, . . . , T , denote the categorical response variable with c categories, where the conditional probabilities are defined as in Section 2.1.
Given a sample of n observations, the complete data log-likelihood is expressed as is the number of subjects that, at time occasion t, are in latent state u and have outcome y for the j-th response variable, is the number of subjects that move from latent stateū to latent state u at time occasion t.

Hidden Markov model with continuous response variables
The response vectors Y (t) , t = 1, . . . , T , are assumed to follow a conditional Gaussian distribution, that is, with state-specific mean vectors μ u ∈ R r , u = 1, . . . , k, and variance-covariance matrix ∈ R r ×r constant across latent states under the assumption of homoscedasticity. This latter assumption may be relaxed to allow for heteroscedasticity across latent states.
The complete data log-likelihood function is where f ( y (t) i |u) denotes the probability density function of a multivariate Gaussian distribution with parameters μ u and , z is an indicator function equal to 1 if subject i is in latent stateū at time t − 1 and moves to latent state u at time t.

Expectation-maximization algorithm
Maximum likelihood estimation of model parameters is performed through the EM algorithm. Once the parameters are initialized, the EM algorithm alternates the following steps until a suitable convergence criterion is satisfied: • E-step: compute the conditional expected value of * (θ) given the observed data and the value of the parameters at the previous step: • M-step: maximize the expected value Q(θ ; θ (h−1) ) and so update the model parameters: The computation of the expected values at the E-step is based on the following conditional probabilities, generically referred to as q (·). For the LC model we consider q(u| y) = p(U = u|Y = y), while for the HM model we define q (t) (u| y) = p(U (t) = u|Y = y) and q (t) (ū, u| y) = p(U (t−1) =ū, U (t) = u|Y = y). In the following section, we present some details on the EM algorithm, and we show the tempering technique.

Tempered expectation-maximization algorithm
The T-EM algorithm is implemented by adjusting the computation of the expected frequencies in the E-step. In the following we define some general rules for the tempering constants, and we show details of the T-EM algorithm for the LC and HM models.
The family of tempered probabilities has the following expression: where q(·) denotes the original conditional probability, τ is a suitable parameter, known as temperature and varying over the interval [1, +∞), and m is a normalizing constant. At each E-step of the T-EM algorithm, the conditional expected frequencies are computed accordingly. Regarding the temperature, the choice τ → +∞ yields q (τ ) (·) to a uniform distribution, while τ = 1 recovers the original posterior probability q(·). Therefore, we define a sequence of temperature values (τ h ) h≥1 , where h is the algorithm iteration number, so that: (i) the initial temperature τ 1 is sufficiently large, implying that the corresponding tempered distributionq (τ 1 ) (·) is relatively flat and (ii) the temperature value τ h tends towards 1 as the algorithm iteration counter increases. The resulting sequence, denoted as tempering profile, guarantees a proper convergence of the algorithm (Lartigue et al. 2022). We consider the following two tempering profiles: • a monotonically decreasing exponential profile, which is defined as where α ≥ 1 and β ≥ 0 are two constants chosen so as to ensure flexibility in the profile shape; • a non-monotonic profile with oscillations of gradually smaller amplitude, which is expressed as with constants β, ρ, τ 0 > 0 and 0 < α < 1. This profile has more parameters to tune, but it guarantees a very high level of flexibility. Here tanh(·) indicates the hyperbolic tangent, while sinc(x) = sin(π x)/(π x) (with sinc(0) = 1) denotes the normalized sine cardinal function. In this case, the sequence (τ h ) h≥1 may assume values that are smaller than 1 or even negative; although this is not an issue from a strictly mathematical perspective, a tempering step with negative temperature lacks a proper interpretation. Therefore, in practice, we can force the tempering profile to be always greater than or equal to 1 by taking τ h = max{τ h , δ}, with δ ≥ 1 (in this work we fix δ = 1).
The abbreviations M-T-EM and O-T-EM are used for monotonic (2) and oscillating (3) tempering profiles, respectively.

Tuning of tempering profiles
The selection of optimal tempering constants for both profiles may be carried out through a grid-search procedure; in the following, the term grid will denote the sequence of values considered for a constant, while the term step-size will refer to the distance between two consecutive values.
For the monotonic profile the only two constants are simple to interpret: β controls the value of the initial temperature, while α adjusts the decrease rate of the temperature. Lower values of both make the contribution of tempering insignificant; at the extreme, α = 1 and β = 0 recover the standard EM algorithm. Although it is not possible to provide precise and rigorous rules for the selection of these constants, some guidelines hold in general: (i) avoid very high values of α and β. Indeed, beyond certain values, the target function can not be flattened further, and only the computational time would increase. This sort of "threshold" values are unfortunately data-dependent, but we recommend not exceeding α = 15 and β = 5; (ii) choose step-sizes for each grid such that the distance between two consecutive values of α will result much smaller than the one between two successive values of β. Indeed, the monotonic profile is much more sensitive to variations in α than in β; we suggest, for example, a ratio of about 1:10; (iii) avoid increasing β without a corresponding growth of α (while the opposite has no shortcomings). This would lead to a fast decrease in the value of the temperature; accordingly, the target function would not be warped back to its original shape in a gradual way, and the algorithm could possibly be brought far from the global mode; (iv) typically, for each type of data there are many possible suitable tempering configurations, and an important step is to locate a rough range for the constants. After that, although the tuning process can be further refined, most of the tempering configurations chosen within that range would provide good results; (v) various factors such as number of observations, of response variables, and of latent components would guide the choice of this "unrefined" range. For example, estimating a model with many latent components typically requires higher values of α and β with respect to a model with fewer components.
The same guidelines illustrated above should also be taken into account for the oscillating profile, where, however, there are more constants to tune. Their practical interpretation is, in this case, slightly different: T 0 controls the initial temperature, ρ the distance between two consecutive peaks of the profile, β the amplitude of the oscillations, and α the global decrease rate.
The following steps for tuning the tempering profile are derived from the aforementioned rules and are successfully employed to estimate the models for the applications presented in Sect. 5: (1) define grids for all the tempering constants, starting with large step-sizes; (2) estimate the model using the T-EM algorithm with these "unrefined" grids for the tempering constants employing a much smaller number of starting values with respect to that required with the standard EM algorithm; (3) identify the optimal tempering constants by comparing values of the log-likelihood function at convergence; (4) improve the tuning procedure, if necessary, in a smaller region of the tempering constants space and repeat the same procedure (points 2 and 3) using the same small number of different starting values.
A final note, which is effective for both profiles, is that in order to achieve a proper convergence, the algorithm needs to be run until the temperature is steadily close to 1. After that, the last step is conducted with the temperature precisely equal to 1 in order to retrieve the shape of the original log-likelihood function. Typically, this approach increases the number of steps that are required for the algorithm to converge, especially in the case of the oscillating profile. The code written for this proposal is implemented in R and it is freely available at the following link in the GitHub repository: https:// github.com/LB1304/T-EM.

T-EM algorithm for the latent class model with categorical response variables
In the following, we provide some details of the tempered distribution (1) defined for the LC model with categorical response variables considering a suitable tempering profile τ h :q The corresponding pseudo-code is shown in the box Algorithm 1. The E-and M-step of the T-EM algorithm are implemented as follows: • E-step: compute the conditional expected values of a juy and b u revised according to the rules , thus updating the parameters as: Algorithm 1

T-EM algorithm for the hidden Markov model with categorical response variables
A more refined formulation for the tempered distribution in (1) is required to estimate the HM model. Once the tempering profile τ h is chosen, we obtain the following tempered distributions:q The pseudo-code is shown in the box Algorithm 2. In this setting, the steps of the T-EM algorithm are: • E-step: compute the revised conditional expected value of every frequency a uu , so as to obtain the conditional expected value Q(θ; θ (h−1) ); in particular, we have the following explicit expressions: Similarly to the standard EM algorithm, posterior probabilitiesq (t;τ h ) (u| y i ) and q (t;τ h ) (ū, u| y i ) may be efficiently computed by a backward recursion; see Bartolucci et al. (2013, pp 61-64) for further details.
• M-step: by maximizing Q(θ ; θ (h−1) ) update the parameters as follows: Algorithm 2 T-EM algorithm for HM model with categorical response variables j y|u . 7: end while

T-EM algorithm for hidden Markov model with continuous response variables
Regarding the HM model with continuous response variables, the pseudo-code is shown in the box Algorithm 3. Similarly to the previous case, the steps of the resulting T-EM algorithm are as follows: • M-step: maximize Q(θ ; θ (h−1) ) and update the model parameters as follows:

Simulation study
We conducted an extensive Monte Carlo simulation study to evaluate the performance of the T-EM algorithm. In the following, we illustrate the simulation schemes for each different model specifications and summarize the main results.

Settings of the experimental scenarios
The settings involved in each model are different values of sample size n, number of response variables r , categories for each variable c, time occasions T , and latent components k. We define a baseline scenario (setting A, see Tables 16, 17, and 18 in Appendix A) for each model, characterized by n = 500, r = 6, c = 3, T = 5, and k = 3. In addition, more scenarios (settings from B to F in Appendix A) are obtained by doubling, one at a time, the value of each feature. In Tables 16, 17, and 18 in Appendix A also the values of the models' parameters are presented. For each scenario, 50 different samples are drawn. For each of the simulated samples, we estimate 100 times both the model with correctly specified latent structure and that with misspecified latent structure, using each time different starting values randomly selected and employing the standard EM algorithm and the two proposed versions of the T-EM algorithm. The choice to also fit misspecified models allows us to show in more detail the features of the proposed tempering approach.
The convergence of the algorithms is checked on the basis of both the relative change in the log-likelihood of two consecutive steps, and the distance between the corresponding parameter vectors. We stop the algorithm when both criteria are satisfied: where θ (h) is the vector of parameter estimates obtained at the h-th iteration of the M-step and ε 1 and ε 2 are tolerance levels equal to 10 −8 and 10 −4 , respectively.
Regarding the algorithm initialization, we adopt a starting rule based on normalized random numbers (Bartolucci et al. 2013). In more details, each initial (π u ) and transition (π (t) u|ū ) probability is initialized with a random number drawn from a uniform distribution between 0 and 1. Then, they are normalized so that k u=1 π u = 1 and k u=1 π (t) u|ū = 1. Similarly, we draw each φ j y|u from the uniform distribution and we normalize these parameters so that c−1 y=0 φ j y|u = 1. In the case of continuous response variables, the mean vectors μ u are drawn from a multivariate Gaussian distribution, whereas is initialized with the observed variance-covariance matrix. As suggested in Bartolucci et al. (2013), combining deterministic and random starting values is a proper approach. Therefore, in Sect. 4.4 we analyze the behavior of tempering in connection with a different initialization strategy.

Simulation results
The EM and T-EM algorithms are compared according to the following criteria: 1. Global maximum achievement: the highest of the maximized log-likelihood values over all 100 initial values, denoted byˆ max , is considered as the global maximum, and a log-likelihood value at convergence denoted byˆ is considered close to this value once it satisfies (ˆ max −ˆ )/|ˆ max | <ε, whereε is a suitable threshold; 2. Average distance from the global maximum computed over the 100 log-likelihood valuesˆ 1 , . . . ,ˆ 100 and expressed as 100  Table 16 of the Appendix A for the LC model 3. Low mean square error of the estimated model parameters with respect to the true model parameters, computed only for models with a correctly specified latent structure; 4. Low mean and median of the log-likelihood values at convergence.
In particular, in this first part of the simulation study, we analyze the performance of the M-T-EM algorithm when the tempering profile is optimally tuned through a gridsearch procedure. The following values for the tempering constants are kept fixed throughout the simulation studies: α ranging from 1 to 15 with a step-size equal to 1 and β ranging from 0 to 2, with a step-size equal to 0.1. In order to show the flexibility of the method, we use the same grid for each model. However, efficient ad hoc grids may be set according to the model and observed data. The results are summarized in the following, and the full outcomes related to every sample under each simulated scenario are reported in the SI.
Criterion 1 is the most important, providing a suitable measure of performance of the algorithm. In this regard, the main results are summarized in Figs. 1, 2, and 3, representing the frequencies of global maximum with respect to the LC model, HM model with categorical response variables, and HM model with continuous response variables, respectively. From all these figures it clearly emerges that the M-T-EM algorithm ensures better performance in each considered scenario.
Regarding the estimation of models whose latent structure is correctly specified, in particular (see Figs. 1a, 2a, and 3a), the improvement with respect to the standard EM algorithm is very relevant: the M-T-EM is generally able to detect the global maximum in the overwhelming majority of cases, and the frequency of convergence to the global mode is very close, or even equal, to 100%. Only in estimating models with many latent states (up to 6), this percentage is slightly reduced, even if the M-T-EM still remains the algorithm providing the best performance. As an example, we consider the  Table 18 of the Appendix A for the HM model with continuous response variables HM model with categorical response variables and in the particular setting F (see the last plot in Fig. 2a): in this case the frequency of convergence to the global maximum is, on average, equal to 29% when the standard EM algorithm is used, and up to 52% when the M-T-EM algorithm is employed. Moreover, this frequency is always lower than 75% with the EM, while it reaches 100% with M-T-EM (though only in a few cases).      All the algorithms are less efficient in steadily detecting the global mode when models with misspecified latent components are estimated (see Figs. 1b, 2b, and 3b). The M-T-EM algorithm always provides the best performance, and in many scenarios the improvement is very relevant: in setting D of the LC model (Fig. 1b) the frequency of convergence to the global mode increases from 18 to 41%; in setting C of the HM model with categorical responses (Fig. 2b) for some samples this frequency reaches 100%.
In Tables 1, 2, and 3, for each one of the simulated scenarios, we show the number of samples in which the global maximum is reached at least half of the times (> 50%), almost always (> 95%), or almost never (< 10%). These results provide supporting evidence for the conclusions drawn so far. In particular, when the considered models are estimated with the correct latent structure, the M-T-EM algorithm performs really well, and significantly better than the standard EM algorithm. For example, this enhancement is evident in setting C of the HM model with continuous response vari- ables, where we observe that 40 samples reach the global mode with high frequency compared to none with the standard EM algorithm. An analogous improvement is noticeable for the case with 6 latent states but referred to the frequency of convergence to the global maximum more than half the time. In the case of models estimated with the wrong latent structure and many components, we show another important result, not highlighted so far: the number of samples in which the global maximum is almost never reached (< 10% of times) diminishes when the M-T-EM algorithm is employed. We also consider the mean distance from the global mode to measure how far the obtained maximum is from the global one. In particular, although all settings provide similar results, we notice that when dealing with correctly specified models, the mean distance decreases to zero when the M-T-EM algorithm is employed, thus confirming that the global maximum is almost always reached. In Appendix B, all detailed results are provided in Figs. 7, 8, and 9.
Finally, we also provide the mean square error of the estimated model parameters with respect to the true values, once the models are estimated with the correct latent structure. The results, summarized in Table 4, show that the mean square error values are always smaller with the M-T-EM algorithm than with the standard EM algorithm, thus highlighting that the estimated model parameters are more accurate by employing the former.

Results in terms of computational time
Having assessed the good performance of the proposed M-T-EM algorithm in locating the global maximum, we also compare the computational time required for convergence with that required by the EM algorithm for the same simulation settings illustrated above. Tempering constants are chosen as presented in Sect. 4.2. The estimation is performed by employing an Intel(R) Core(TM) i7-8700T CPU @ 2.40GHz Windows desktop with 8 GB of RAM.
The main results, summarized in Table 5, show that when estimating LC and HM models with continuous response variables, the EM and M-T-EM algorithms show very similar computing times. The EM algorithm generally remains the fastest even if the difference with the M-T-EM is negligible. When dealing with correctly specified HM models with continuous response variables, the M-T-EM algorithm is faster than the EM algorithm. Conversely, for the case of the HM model with categorical response variables it is the slowest, requiring up to 6.5 times the computational time of the EM algorithm. These two opposite behaviors are due to the different implementations of the T-EM algorithm: the one for the HM model with categorical responses requires an additional loop to the code with respect to the other two models.

Initialization of the T-EM algorithm
In this section we consider different initialization strategies of the model parameters to evaluate the effect of the different choices in detecting the global maximum and reducing the computational time. For continuous data, as proposed by Leroux and Puterman (1992), and following McLachlan and Basford (1988), we initialize the parameters according to the partition obtained by applying the k-means method (MacQueen 1967). Maruotti and Punzo (2021) inspected this initialization approach and a few others, concluding that the k-means strategy provides the best results. A similar initialization is employed for discrete data applying the k-modes algorithm (Huang 1998). Initial values are computed as follows: • proportion of observations assigned to cluster u at the first time occasion for the initial probabilities (π u ); • proportion of transition (or persistence) estimated from clusterū to cluster u for the transition probabilities (π (t) u|ū ); • proportion of observations assigned to cluster u who responded with category y to the response variable j for the conditional probabilities (φ j y|u ); • maximum likelihood estimator on the observations of cluster u for the mean vectors (μ u ); • maximum likelihood estimator on all the observations under the hypothesis of homoscedasticity for the variance-covariance matrix ( ).
We consider the same samples and starting values used in Sect. 4.2, comparing the performance of the EM and the M-T-EM algorithms. In general, when the estimation of correctly specified models is considered, the standard EM algorithm benefits from the adoption of a k-means initialization using this kind of strategy, therefore, the results obtained with the EM and the M-T-EM algorithms are very similar.
In Table 6, for each scenario, we report the number of samples in which the standard EM algorithm with k-means initialization does not converge to the global maximum, which is instead reached by the M-T-EM algorithm with the same starting values. It is important to remark that M-T-EM algorithm does not behave worse than the standard EM algorithm in all the other samples, but both algorithms converge to the same value. Further analyses conducted on correctly specified HM models with continuous response variables and k = 2 latent states highlight that in such case the global maximum is always reached also by the EM algorithm with k-means initialization.
We also compare random and k-means initializations for the M-T-EM algorithm. The results, summarized in Table 7, show that the k-means initialization works properly. Indeed this strategy significantly reduces the number of iterations required for convergence, and hence the computational time. In particular we report, along with the number of samples in which the M-T-EM algorithm with k-means initialization reaches the global maximum, the average number of iterations required by the two initialization strategies to converge. We notice that apart from some cases with many latent components, the global maximum is almost always reached by the M-T-EM algorithm when initialized with the k-means approach. As for the decrease in the number of iterations, the advantage is particularly evident when dealing with HM model with continuous responses; in this case, it is dropped up to one sixth.
In the case of models where the latent structure is not correctly specified, the situation is less well defined: likewise the previous case, the results obtained comparing EM and M-T-EM algorithms initialized with k-means strategy are very similar for some samples (Table 8), highlighting that the standard EM algorithm may sometimes benefit Table 7 Percentage of samples in which the M-T-EM algorithm with k-means initialization reaches the global maximum and number of iterations until convergence with random and k-means (or k-modes) initialization when the latent structure of the models is correctly specified  from the adoption of this initialization strategy. However, when M-T-EM is employed, this improvement does not always correspond to an advantage when using the k-means initialization with respect to the random one. As shown in Table 9, the number of samples that benefit from this initialization strategy is quite limited and usually does not reach the 50%. Finally, also in this case the k-means initialization provides some benefits from the point of view of the number of iterations until convergence, even if less pronounced than in the case of models with correctly specified latent structures.

The role of the oscillating tempering profile
Although the M-T-EM algorithm ensures significant improvements in terms of ability to detect the global maximum, in some cases the frequency of convergence to this global mode remains inferior to 100%. A possible remedy is represented by the oscillating profile, which is able to explore the parameter space more deeply than the monotonic one. In the following we focus only on the LC model,  where we show the percentage of times the global maximum is reached and the mean distance from the global maximum for the three versions of the algorithm. Employing the oscillating profile, we notice a further improvement compared to the results analyzed in Sect. 4.2: the global maximum is reached on average about 18% of times with the standard EM algorithm, which increases up to 38% with the M-T-EM algorithm, and up to 60% with the oscillating version. It is also interesting to evaluate the number of samples in which the global maximum is reached almost surely (< 95%); this number, as reported in Table 1, was equal to 0 and 2 with EM and M-T-EM algorithms, respectively. Using the O-T-EM algorithm instead it increases to 18 samples. As for the mean distance from the global maximum, we notice that this value decreases accordingly, following the general advantage of the O-T-EM algorithm over the monotonic version. This optimal behavior of the tempered algorithm with oscillating profile results, however, in a much higher computational time, as reported in Table 10. This aspect sometimes makes the employment of the O-T-EM algorithm rather complex; in particular, when it is applied to the HM model with categorical responses, the convergence is extremely slow, and the M-T-EM could be the most appropriate choice.

Analysis of the T-EM algorithm with fixed tempering profile
Lastly, we check the performance of the T-EM algorithm when it is not optimally tuned, but the tempering constants are fixed in advance. With this aim, for each inspected scenario, a short list of different configurations of tempering constants is considered for applying the M-T-EM algorithm to all samples. In the analysis of the results, the tempered version is considered as the best choice only when it outperforms the standard EM algorithm with respect to all the four criteria introduced in Sect. 4.2. Otherwise, if at least one criterion shows a better result with the standard EM algorithm, the latter is preferred. In this way, we carry out a very rigorous analysis.
Tables 19, 20, and 21 in the Appendix C report for each scenario the configuration of tempering constants which exhibits the best performance. Results are highly satisfactory in most cases: given a fixed configuration, the M-T-EM algorithm outmatches the standard version in around 50% of samples in almost all the analyzed scenarios. In other words, once a configuration of tempering constants is set appropriately by a grid-search procedure over a specific sample, it generally remains valid for around 50% of other samples. This percentage increases up to 100% in some scenarios, especially when the latent structure of the model is correctly specified: the considered configuration of tempering constants provides optimal results in all samples. Similar results are achieved in the case of oscillating tempering profile analyzing setting E of the LC model when the latent structure is correctly defined: the best configuration of tempering constants (α = 0.9, β = 50, ρ = 5, and T 0 = 10) performs well with 62% of the considered samples. It is clear that there are still some cases that require experimenting with the tempering constants to yield good performance; however, in our opinion, this represents a first significant improvement that allows avoiding specific settings for models and types of data.

Applications
To explore the performance of the T-EM algorithm when dealing with real-world cases, we apply it to cross-sectional and longitudinal data; we specifically address the problem of selecting the best number of components for LC and HM models.

Evaluation of anxiety and depression
We consider data derived from the administration of 14 ordinal items measuring anxiety and depression in a sample of 201 Italian oncological patients (Zigmond and Snaith 1983). Items are measured according to four response categories ranging from 0 to 3 and corresponding to the lowest and to the highest level of anxiety or depression, respectively. Data are available in the R package MultiLCIRT ).
The LC model allows to discover subpopulations of patients with similar intensity levels of these two pathologies. The model is estimated with both EM and T-EM algorithms with a number of latent components k ranging from 1 to 4 to perform model selection. The Bayesian Information Criterion (Schwarz 1978, B I C) is employed at this purpose penalizing the maximized log-likelihood function for the model complexity.
For the M-and O-T-EM algorithms the following two configurations of tempering constants are used and held fixed over the values of k: α = 42 and β = 1.5 for the monotonic version, and ρ = 90, τ 0 = 10, β = 20, and α = 0.8 for the oscillating one. In the following, we show the results only for values of k for which there is a significant difference on the global maximum reached by employing the EM and the T-EM algorithms. Figure 5 refers to the maximum log-likelihood values reached by each algorithm for every model. As it is evident, while the EM algorithm spreads out over a wide range of values, both tempered algorithms always converge to a single value appearing as the global mode.
Results based on the O-T-EM algorithm are reported in Table 11, where it can be seen that the optimal number of components corresponding to the minimum value of B I C is three. It is important to remark that the results are always obtained using the same configuration of tempering constants as presented above. Therefore, we highlight again the considerable level of flexibility of the proposed method.

Discovering criminal trajectories
We consider longitudinal data on conviction histories of a cohort of n = 10, 000 offenders followed from the age of criminal responsibility (10 years) until age 40. As described in Research Development and Statistics Directorate (1998), offenses are grouped into the following 10 typologies: violence against the person, sexual offenses, burglary, robbery, theft and handling stolen goods, fraud and forgery, criminal damage, drug offenses, motoring offenses, and other offenses. Binary response variables (r = 10) indicate if the offender has committed a crime during six age bands (T = 6) of length equal to five years. An HM model was proposed for the analysis of these data in Bartolucci et al. (2007) and Pennoni (2014) to identify typologies of criminal behavior and types of criminal career specialization over time.
Results of estimating a time heterogeneous HM model with the M-T-EM algorithm for a number of states ranging from 1 to 5 are reported in Table 12. The optimal number of latent states corresponding to the minimum value of BIC is four. The M-T-EM algorithm with parameters α = 2 and β = 1.5 is compared with the EM algorithm according to the same procedure illustrated in Sect. 4.2: for each value of k, 100 different starting values are randomly chosen to initialize both versions of the algorithm. As shown in Table 13, when the chosen HM model is estimated, even in this context, the T-EM guarantees better performance. More specifically, with the proposed algorithm, the frequency of global maximum is higher: the M-T-EM algorithm reaches the global mode 98 times, while the standard EM algorithm only 73. Moreover, the mean distance from the global optimum decreases to almost zero, and the mean of log-likelihood values increases accordingly; only the median value remains essentially unchanged, with just a very slight enhancement.

Analyzing countries development
We consider data obtained from the World Bank's World Development Indicators (The World Bank Group 2018) on n = 175 countries collected for T = 5 years (from 2011 to 2015) on r = 6 continuous response variables: life expectancy at birth, total population between the ages 0-14, percentage of population with access to electricity, percentage of population using the internet, share of electricity generated by renewable power plants, and fertility rate. A logit transformation is applied to the variables expressed in a percentage scale, and a Box-Cox transformation (Box and Cox 1964) to all the variables. Results of the estimation of a time heterogeneous HM model on the transformed data with the O-T-EM algorithm for a number of states ranging from 1 to 10 are reported in Table 14. In order to check the assumption on the conditional distribution we check the posterior density of each response variable once the units are allocated according to maximum a posteriori rule; results (available from the authors upon request) seem satisfactory. In this case, the advantages of using the tempering approach are even more evident: 1. it guarantees convergence to the global maximum. Indeed, for most values of k, the maximized log-likelihood value is higher than that of the EM algorithm, showing that the standard EM algorithm cannot correctly detect such a value. Moreover, the mean distance from the global maximum also shows significant improvements, with the standard EM leads us to choose eight components, whereas the T-EM algorithm suggests seven components. BIC values are always smaller than those obtained with the standard algorithm; 3. it exhibits an appealing level of flexibility: there is no need to change the optimal set of tempering constants (fixed at α = 0.6, β = 110, ρ = 5, and τ 0 = 20) once the HM model is fitted for a number of states ranging from 5 to 10. For values of k from 2 to 4, another unique configuration of tempering constants proves to be the best (α = 0.5, β = 120, ρ = 5, and τ 0 = 10).
Focusing on the log-likelihood values shown in Fig. 6 related to the selected model with seven states, we notice that the O-T-EM algorithm always avoids lower values in favor of the higher ones of the maximized log-likelihood. These are reached much more frequently with respect to the EM algorithm.
As already illustrated with the simulation study presented in Sect. 4 and also shown in Table 15, the O-T-EM algorithm is more demanding in terms of computational time with respect to the EM algorithm; however, it has superior performance. Moreover, we notice that on average, a single execution of the T-EM algorithm requires the same time of approximately 10 runs of the standard algorithm. It is important to note that after 1,000 executions performed with 1,000 different random starting values, the EM algorithm is still unable to detect the global maximum (according to the definition provided by the first criterion in Sect. 4) obtained with the O-T-EM algorithm, and equal to −15,821.86, since its highest reached value is −15,834.97. Neither a higher number of random starting values (up to 10,000 in our study), nor the k-means initialization strategy allows us to improve its performance.

Conclusions
The likelihood of discrete latent variable models is typically multimodal, and convergence to a point that it is not the global maximum is a severe limitation of all the algorithms employed for maximum likelihood estimation of the model parameters. To reduce the chance of local maxima at convergence when the expectation-maximization (EM) algorithm is employed, the model parameters are typically initialized with a multiple-try strategy, employing deterministic and random values. Then, maximum likelihood estimate of the parameters corresponds to the highest log-likelihood at convergence of the algorithm. In this paper, a new powerful estimation algorithm based on annealing and tempering techniques is proposed in this context. The underlying idea of the tempered EM (T-EM) algorithm is flattening the target function and then gradually warping it back towards the original one. The ability of the algorithm to remain close enough to the dominant maximum is related to the slowness and the graduation of the warping process, which, in turn, is controlled by a sequence of parameters known as the temperature or tempering profile. Two main classes of such profiles usable with many models to be estimated are tested and compared: a monotonically decreasing exponential profile, easy to tune, and an oscillating profile, having more parameters to tune and ensuring best performances with a very high level of flexibility.
An accurate Monte Carlo simulation study is carried out considering two general classes of discrete latent variable models: latent class and hidden Markov models. We compare the performance of the standard EM algorithm with the proposed ones. This comparison is carried out by evaluating the ability to reach the global maximum and the computational time. From the results of the simulation study and those of the applications we show that the proposed algorithms outperform the standard EM, increasing significantly the chance to get to the global maximum in the overwhelming majority of cases. In particular, when an optimally tuned tempering profile is employed, the improvement with respect to the EM algorithm is remarkable: the T-EM algorithm can reach the global mode with a high frequency, generally escaping all local suboptimal maxima. We detect that the variant with the oscillating profile shows the best performance, sightly outperforming also the monotonic version in most cases.
Estimating the models with the proposed algorithms on categorical and continuous data, having a cross-sectional or longitudinal structure, we also show their good performance in choosing the proper number of latent components. According to the results obtained for the HM model we argue that the proposal may be especially useful for the estimation of the model parameters with complex data structures involving the inclusion of covariates, missing values, and drop-out.
An additional appealing feature of the proposal is the high level of flexibility of the tempering profiles: once a grid-search procedure is employed to set the tempering constants, these constants remain valid also when data with similar characteristics are used to estimate the model parameters. Moreover, a broad range of values generally performs optimally in many different applied contexts.
Future works may consider the relevant issue of finding a new family of tempering profiles that combine the excellent performance of the oscillating profile with the simple tuning procedure and the fast execution time of the monotonic profile. Other relevant research directions include the exploration of the T-EM algorithm in connection with other maximization algorithms; the most natural choice in this regard is to apply a tempering approach to a direct maximization algorithm, such as Newton-Raphson. The algorithm would also benefit from a more efficient implementation, through the C++ language in order to reduce the computation time. Finally, another possible research line would be to explore and compare the performance of genetic algorithms (Pernkopf and Bouchaffra 2005) with the proposed tempering techniques.  • conditional response probabilities are kept fixed considering scenario A (see Tables  16, 17,  • for the HM model with continuous response variables, the same conditional distribution holds for all response variables; for example, with k = 3 latent states, the mean vector μ = [−2, 0, 2] is fixed for each response variable; • the variance-covariance matrix is computed as the sample covariance matrix of the data.    Table 17 of the Appendix A for the HM model with categorical response variables

B Additional simulation results
In this section we report additional details on the results of the simulation study in Sect. 4.2. In particular, for each considered scenario (see Tables 16, 17, and Table 18 of the Appendix A for the HM model with continuous response variables

C Numerical results for the analysis of T-EM algorithm with fixed tempering profiles
In this section we present results obtained from the simulation studies comparing the EM algorithm with the T-EM algorithm with fixed tempering profiles; the analysis carried out on the basis of the results is reported in Sect. 4.6. See Tables 19, 20, and 21.