Tempered expectation-maximization algorithm for the estimation of discrete latent variable models

Brusa, Luca; Bartolucci, Francesco; Pennoni, Fulvia

doi:10.1007/s00180-022-01276-7

Tempered expectation-maximization algorithm for the estimation of discrete latent variable models

Original paper
Open access
Published: 07 October 2022

Volume 38, pages 1391–1424, (2023)
Cite this article

Download PDF

You have full access to this open access article

Computational Statistics Aims and scope Submit manuscript

Tempered expectation-maximization algorithm for the estimation of discrete latent variable models

Download PDF

2521 Accesses
Explore all metrics

A Publisher Correction to this article was published on 17 January 2023

This article has been updated

Abstract

Maximum likelihood estimation of discrete latent variable (DLV) models is usually performed by the expectation-maximization (EM) algorithm. A well-known drawback is related to the multimodality of the log-likelihood function so that the estimation algorithm can converge to a local maximum, not corresponding to the global one. We propose a tempered EM algorithm to explore the parameter space adequately for two main classes of DLV models, namely latent class and hidden Markov. We compare the proposal with the standard EM algorithm by an extensive Monte Carlo simulation study, evaluating both the ability to reach the global maximum and the computational time. We show the results of the analysis of discrete and continuous cross-sectional and longitudinal data referring to some applications of interest. All the results provide supporting evidence that the proposal outperforms the standard EM algorithm, and it significantly improves the chance to reach the global maximum. The advantage is relevant even considering the overall computing time.

Dirichlet process and its developments: a survey

Article 01 February 2022

Linear mixed model with Laplace distribution (LLMM)

Article 29 March 2016

R-VGAL: a sequential variational Bayes algorithm for generalised linear mixed models

Article Open access 06 April 2024

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

A latent variable model is a statistical model in which the distribution of the response variables is affected by one or more variables that are not directly observable. Here, we consider two special classes of discrete latent variable (DLV) models (Bartolucci et al. 2022) that are frequently employed to analyze continuous and categorical response variables.

The latent class (LC) model (Lazarsfeld and Henry 1968; Goodman 1974; Lindsay et al. 1991) assumes individual-specific latent variables having a discrete distribution with a finite number of support points. The hidden (or latent) Markov (HM) model (Zucchini and Guttorp 1991; Bartolucci et al. 2013; Zucchini et al. 2016) represents a generalization of the LC model to the case of longitudinal data and the latent process is frequently assumed as a first-order Markov chain. Both models are used as model-based clustering methods, and in particular, the HM model allows a dynamic clustering where each unit may move between clusters across time.

Maximum likelihood estimation (MLE) of DLV models is usually performed by using the expectation-maximization (EM) algorithm (Baum et al. 1970; Dempster et al. 1977; McLachlan and Krishnan 2008). This approach is straightforward to implement, and it is available in many software packages; among others, we mention MultiLCIRT (Bartolucci et al. 2014) and LMest (Bartolucci et al. 2017) in the R software (R Core Team 2022) for the estimation of LC and HM models, respectively.

A particular drawback of MLE is related to the multimodality of the log-likelihood function which is especially observed with the DLV models. Consequently, the EM algorithm could converge to a local maximum, not corresponding to the global one. Multi-start strategies employing both deterministic and random rules to initialize the model parameters are generally adopted. Although this approach encourages a more accurate exploration of the parameter space, it is computationally intensive and does not ensure that the global optimum is reached. For an overview of different initialization strategies, some of which are based on a preliminary cluster analysis (Everitt et al. 2011), see, among others, Maruotti and Punzo (2021).

Tempering and annealing (Sambridge 2014) constitute a broad family of optimization methods; by means of a parameter known as temperature, they allow us to re-scale the target function and monitor the prominence of all possible maxima. In particular, these procedures are gradually attracted towards the global optimum by accurately defining a sequence of temperature values. The alternation of high and low values of the temperature allows us to deal with two opposite but fundamental issues: on one side, the algorithm is led to explore broad areas of the parameter space, thus escaping local sub-optimal modes (high temperatures); on the other side, the algorithm is able to perform a sharp optimization of the target function in a small area of the parameter space (low temperatures).

The following different tempering methods are defined according to the choices of temperature sequences. Simulated annealing (Kirkpatrick et al. 1983) makes use of a strictly decreasing temperature sequence: the initial temperature is sufficiently high so that the re-scaled function is relatively flat, and it decreases at each step, gradually restoring the original function. Simulated tempering (Geyer and Thompson 1995) assumes that the temperature may either increase or decrease according to a stochastic rule: a new proposed temperature level may be accepted or rejected according to a specific probability, and the process describing the temperature evolution follows a Markov chain. Parallel tempering (Geyer 1991; Falcioni and Deem 1999; Earl and Deem 2005) assumes an ensemble of Markov chains across all levels of the temperature sequence: at specified intervals, a swap between a pair of neighboring chains is proposed and accepted or rejected according to a certain probability.

Tempering techniques are employed, among others, in Barbu and Zhu (2013) and Robert et al. (2018) for simulating from complex multimodal statistical distributions by means of Markov chain Monte Carlo methods (Metropolis et al. 1953; Hastings 1970). On the other hand, the use of these procedures within the EM algorithm is quite scarce. Hofmann (1999) proposed tempering techniques for the EM algorithm in the context of probabilistic latent semantic analysis. For what concerns finite Gaussian mixture models, recently, Lartigue et al. (2022) proposed a general class of deterministic approximated versions of the EM algorithm following previous proposals in Yuille et al. (1994), Ueda and Nakano (1998), and Zhou and Lange (2010).

In the following, dealing with DLV models, we propose a general approach. In particular, we explicitly focus on LC and HM models because these are among the most utilized LDV models in data analysis. However, the proposal can easily be adapted to the aforementioned finite mixture models and to other DLV models. We explore two different temperature sequences, including a non-monotone one, also evaluating the computational time efficiency. Up to our knowledge, for the first time, we deal with the problem of temperature sequence tuning, inspecting the performance of the tempered EM (T-EM) algorithm with both optimally tuned and fixed temperature sequences. Finally, we show the behavior of the algorithm for the selection of the optimal model. The implemented code for the proposal is written for the open source software R (R Core Team 2022). It is based on some functions of the package LMest (Bartolucci et al. 2017), and it is available at the following link in the GitHub repository: https://github.com/LB1304/T-EM.

The remainder of the paper is organized as follows. In Sect. 2 we outline the LC and HM model formulations and the MLE of the model parameters through the EM algorithm. In Sect. 3 we provide details on the proposed T-EM algorithm for both models. In Sect. 4 we summarize the main findings of an extensive simulation study aimed to assess the performance of the proposal by comparing it with the standard EM algorithm for many different scenarios. We also evaluate the proposed algorithm in connection with different initialization strategies, and compare the overall computing time. In Sect. 5 we apply the T-EM algorithm to estimate LC and HM models using a variety of data types. In Sect. 6 we provide some conclusions. Appendix A supplies more details on the settings used for the simulation studies, while Appendices B and C provide additional simulation results. Finally, Supplementary Information (SI) contains the full outcomes of every sample under each simulated scenario.

2 Model formulation

In the following, mainly borrowing from Bartolucci et al. (2013) we briefly summarize model notations and implementations of the standard MLE of the model parameters carried out through the EM algorithm; see also Bartolucci et al. (2014) and Pandolfi et al. (2021).

2.1 Latent class model

Considering cross-sectional data and for a single individual, let $ {\varvec{Y}} = (Y_1, \ldots , Y_r)' $ denote the vector of response variables; we assume that each variable $ Y_j $ is categorical with the same number c of categories, labeled from 0 to $ c-1 $. Note that the formulation of the model may be easily adapted to the case of continuous response variables. The LC model relies on a single latent variable U with k support points that identify the latent classes in the population, labeled from 1 to k. According to the assumption of local independence, the response variables are conditionally independent given the latent variable. The model parameters are the weight of each latent class, denoted by $ \pi _u = p(U = u) $, $ u = 1, \ldots , k $, and the conditional probability of each response variable given the latent variable, denoted by $ \phi _{jy \vert u} = p(Y_j = y \vert U = u) $, for $y = 0, \ldots , c-1 $, $ j = 1, \ldots , r $, and $ u = 1, \ldots , k $.

In order to estimate the model parameters, collected in the vector $ \varvec{\theta } $, on the basis of a sample of n independent observations $ \varvec{y}_i, i = 1, \ldots , n$, the incomplete data log-likelihood denoted as $\ell (\varvec{\theta })$ is maximized considering the complete data log-likelihood, given by

$$\begin{aligned} \ell ^*(\varvec{\theta }) = \sum _{j = 1}^{r}\sum _{u = 1}^{k}\sum _{y = 0}^{c-1} a_{juy} \log \phi _{jy \vert u} + \sum _{u = 1}^{k} b_u \log \pi _u, \end{aligned}$$

where $ a_{juy} = \sum _{i = 1}^{n} I(u_i = u, y_{ij} = y) $ is the frequency of subjects that are in latent class u and respond by y at the j-th response variable, and $ b_u = \sum _{i = 1}^{n} I(u_i = u) $ is the number of sample units in latent class u, with $ I(\cdot ) $ denoting the indicator function.

2.2 Hidden Markov model

With reference to longitudinal data and for a single individual, let $ \varvec{Y}^{(t)} = (Y_1^{(t)}, \ldots , Y_r^{(t)})' $ denote the occasion-specific response variables for each time $ t = 1, \ldots , T $, and let $ \varvec{Y} $ denote the vector of responses, which is made of the union of the vectors $ \varvec{Y}^{(t)} $, $ t = 1, \ldots , T $. Given a latent process $ \varvec{U} = ( U^{(1)}, \ldots , U^{(T)} )' $ having a discrete distribution with k states, the latent model parameters are the initial probabilities, denoted by $ \pi _u = p(U^{(1)} = u) $, $ u = 1, \ldots , k $, and the transition probabilities denoted by $ \pi _{u \vert {\bar{u}}}^{(t)} = p(U^{(t)} = u \vert U^{(t-1)} = {\bar{u}}) $, $ t = 2, \ldots , T $, $ {\bar{u}}, u = 1, \ldots , k $. Note that it is possible to include a constraint corresponding to the hypothesis that the latent process is time homogeneous so that the transition probabilities do not depend on time occasion t: $ \pi _{u \vert {\bar{u}}}^{(t)} =\pi _{u \vert {\bar{u}}}$, $ t = 2, \ldots , T $.

The HM model in its basic formulation (Bartolucci et al. 2013) relies on the following three main assumptions, which can be suitably relaxed:

$ \varvec{Y}^{(1)}, \ldots , \varvec{Y}^{(T)} $ are conditionally independent given $ \varvec{U} $;
$ Y_1^{(t)}, \ldots , Y_r^{(t)} $ are conditionally independent given $ U^{(t)} $, for $ t = 1, $..., T;
$ \varvec{U} $ follows a first-order Markov chain with state space $\{1, \ldots , k\}$, where k is the number of latent states.

2.2.1 Hidden Markov model with categorical response variables

Let $ Y_j^{(t)} $, $ j = 1, \ldots , r $, $ t = 1, \ldots , T $, denote the categorical response variable with c categories, where the conditional probabilities are defined as in Section 2.1.

Given a sample of n observations, the complete data log-likelihood is expressed as

$$\begin{aligned} \ell ^*(\varvec{\theta }) = \sum _{j=1}^{r} \sum _{t=1}^{T} \sum _{u=1}^{k}\sum _{y=0}^{c-1} a_{juy}^{(t)} \log \phi _{jy \vert u} + \sum _{u=1}^{k} b_u^{(1)}\log \pi _u +\sum _{t=2}^{T}\sum _{{\bar{u}}=1}^{k}\sum _{u=1}^{k} b_{{\bar{u}}u}^{(t)} \log \pi _{u \vert {\bar{u}}}^{(t)}, \end{aligned}$$

where $ a_{juy}^{(t)} = \sum _{i=1}^{n} I(u_i^{(t)} = u, \ y_{ij}^{(t)} = y) $ is the number of subjects that, at time occasion t, are in latent state u and have outcome y for the j-th response variable, $ b_{u}^{(t)} = \sum _{i=1}^{n} I(u_i^{(t)} = u) $ is the number of subjects in latent state u at time occasion t, and $ b_{{\bar{u}}u}^{(t)} = \sum _{i=1}^{n} I(u_i^{(t-1)} = {\bar{u}}, u_i^{(t)} = u) $ is the number of subjects that move from latent state $ {\bar{u}} $ to latent state u at time occasion t.

2.2.2 Hidden Markov model with continuous response variables

The response vectors $ \varvec{Y}^{(t)} $, $ t = 1, \ldots , T $, are assumed to follow a conditional Gaussian distribution, that is,

$$\begin{aligned} \varvec{Y}^{(t)} \vert U^{(t)} = u \sim {\mathcal {N}}(\varvec{\mu }_u, \varvec{\Sigma }), \quad u = 1, \ldots , k, \end{aligned}$$

with state-specific mean vectors $\varvec{\mu }_u \in {\mathbb {R}}^r $, $ u = 1, \ldots , k $, and variance-covariance matrix $ {\varvec{\Sigma }} \in {\mathbb {R}}^{r \times r} $ constant across latent states under the assumption of homoscedasticity. This latter assumption may be relaxed to allow for heteroscedasticity across latent states.

The complete data log-likelihood function is

$$\begin{aligned} \ell ^*(\varvec{\theta })&= \sum _{i = 1}^{n}\sum _{t = 1}^{T}\sum _{u = 1}^{k} z_{iu}^{(t)} \log f(\varvec{y}_i^{(t)} \vert u)\\&\quad + \sum _{i = 1}^{n}\sum _{u = 1}^{k} z_{iu}^{(1)} \log \pi _u + \sum _{i = 1}^{n}\sum _{t = 2}^{T}\sum _{{\bar{u}} = 1}^{k}\sum _{u = 1}^{k} z_{i{\bar{u}}u}^{(t)} \log \pi _{u \vert {\bar{u}}}^{{(t)}}, \end{aligned}$$

where $ f(\varvec{y}_i^{(t)} \vert u) $ denotes the probability density function of a multivariate Gaussian distribution with parameters $ \varvec{\mu }_u $ and $ \varvec{\Sigma } $, $ z_{iu}^{(t)} = I(u_i^{(t)} = u) $ is an indicator function equal to 1 if subject i is in latent state u at time occasion t, and $ z_{i{\bar{u}}u}^{(t)} = I(u_i^{(t-1)} = \bar{u}, u_i^{(t)} = u) $ is an indicator function equal to 1 if subject i is in latent state $ {\bar{u}} $ at time $ t-1 $ and moves to latent state u at time t.

2.3 Expectation-maximization algorithm

Maximum likelihood estimation of model parameters is performed through the EM algorithm. Once the parameters are initialized, the EM algorithm alternates the following steps until a suitable convergence criterion is satisfied:

E-step: compute the conditional expected value of $ \ell ^*(\varvec{\theta }) $ given the observed data and the value of the parameters at the previous step:
$$\begin{aligned} {\mathcal {Q}}(\varvec{\theta }; \varvec{\theta }^{(h-1)}) = {\mathbb {E}}_{\varvec{\theta }^{(h-1)}}[\ell ^*(\varvec{\theta }) | \varvec{y}]; \end{aligned}$$
M-step: maximize the expected value $ {\mathcal {Q}}(\varvec{\theta }; \varvec{\theta }^{(h-1)}) $ and so update the model parameters:
$$\begin{aligned} \varvec{\theta }^{(h)} = \underset{\varvec{\theta }}{\arg \max } \, {\mathcal {Q}}(\varvec{\theta }; \varvec{\theta }^{(h-1)}). \end{aligned}$$

The computation of the expected values at the E-step is based on the following conditional probabilities, generically referred to as $ q(\cdot )$. For the LC model we consider $ q(u \vert \varvec{y}) = p(U = u \vert \varvec{Y} = \varvec{y}) $, while for the HM model we define $ q^{(t)}(u \vert \varvec{y}) = p(U^{(t)} = u \vert \varvec{Y} = \varvec{y}) $ and $ q^{(t)}({\bar{u}}, u \vert \varvec{y}) = p(U^{(t-1)} = {\bar{u}}, \ U^{(t)} = u \vert \varvec{Y} = \varvec{y}) $. In the following section, we present some details on the EM algorithm, and we show the tempering technique.

3 Tempered expectation-maximization algorithm

The T-EM algorithm is implemented by adjusting the computation of the expected frequencies in the E-step. In the following we define some general rules for the tempering constants, and we show details of the T-EM algorithm for the LC and HM models.

The family of tempered probabilities has the following expression:

$$\begin{aligned} {\tilde{q}}^{(\tau )}(\cdot ) { = m^{-1} } q(\cdot )^{\nicefrac {1}{\tau }}, \end{aligned}$$

(1)

where $ q(\cdot ) $ denotes the original conditional probability, $ \tau $ is a suitable parameter, known as temperature and varying over the interval $ [1, +\infty ) $, and m is a normalizing constant. At each E-step of the T-EM algorithm, the conditional expected frequencies are computed accordingly. Regarding the temperature, the choice $ \tau \rightarrow +\infty $ yields $ {\tilde{q}}^{(\tau )}(\cdot ) $ to a uniform distribution, while $ \tau = 1 $ recovers the original posterior probability $ q(\cdot ) $. Therefore, we define a sequence of temperature values $ (\tau _h)_{h\ge 1} $, where h is the algorithm iteration number, so that: (i) the initial temperature $ \tau _1 $ is sufficiently large, implying that the corresponding tempered distribution $ {\tilde{q}}^{(\tau _1)}(\cdot ) $ is relatively flat and (ii) the temperature value $ \tau _h $ tends towards 1 as the algorithm iteration counter increases. The resulting sequence, denoted as tempering profile, guarantees a proper convergence of the algorithm (Lartigue et al. 2022).

We consider the following two tempering profiles:

a monotonically decreasing exponential profile, which is defined as
$$\begin{aligned} \tau _h = 1 + e^{\beta - \nicefrac {h}{\alpha }}, \end{aligned}$$
(2)
where $ \alpha \ge 1 $ and $ \beta \ge 0 $ are two constants chosen so as to ensure flexibility in the profile shape;
a non-monotonic profile with oscillations of gradually smaller amplitude, which is expressed as
$$\begin{aligned} \tau _h = \tanh \left( \frac{h}{2\rho } \right) + \left( \tau _0 - \beta \ \frac{2\sqrt{2}}{3\pi } \right) \ \alpha ^{\nicefrac {h}{\rho }} \beta \ \text {sinc}\left( \frac{3\pi }{4} + \frac{h}{\rho } \right) , \end{aligned}$$
(3)
with constants $ \beta , \rho , \tau _0 > 0 $ and $ 0< \alpha < 1 $. This profile has more parameters to tune, but it guarantees a very high level of flexibility. Here $ \tanh (\cdot ) $ indicates the hyperbolic tangent, while $ \text {sinc}(x) = \sin (\pi x)/(\pi x)$ (with $ \text {sinc}(0) = 1 $) denotes the normalized sine cardinal function. In this case, the sequence $ (\tau _h)_{h \ge 1} $ may assume values that are smaller than 1 or even negative; although this is not an issue from a strictly mathematical perspective, a tempering step with negative temperature lacks a proper interpretation. Therefore, in practice, we can force the tempering profile to be always greater than or equal to 1 by taking $ \tau _h = \max \{\tau _h, \delta \} $, with $ \delta \ge 1 $ (in this work we fix $ \delta = 1 $).

The abbreviations M-T-EM and O-T-EM are used for monotonic (2) and oscillating (3) tempering profiles, respectively.

3.1 Tuning of tempering profiles

The selection of optimal tempering constants for both profiles may be carried out through a grid-search procedure; in the following, the term grid will denote the sequence of values considered for a constant, while the term step-size will refer to the distance between two consecutive values.

For the monotonic profile the only two constants are simple to interpret: $ \beta $ controls the value of the initial temperature, while $ \alpha $ adjusts the decrease rate of the temperature. Lower values of both make the contribution of tempering insignificant; at the extreme, $ \alpha = 1 $ and $ \beta = 0 $ recover the standard EM algorithm. Although it is not possible to provide precise and rigorous rules for the selection of these constants, some guidelines hold in general: (i) avoid very high values of $\alpha $ and $\beta $. Indeed, beyond certain values, the target function can not be flattened further, and only the computational time would increase. This sort of “threshold” values are unfortunately data-dependent, but we recommend not exceeding $\alpha = 15$ and $\beta = 5$; (ii) choose step-sizes for each grid such that the distance between two consecutive values of $\alpha $ will result much smaller than the one between two successive values of $\beta $. Indeed, the monotonic profile is much more sensitive to variations in $\alpha $ than in $\beta $; we suggest, for example, a ratio of about 1:10; (iii) avoid increasing $\beta $ without a corresponding growth of $\alpha $ (while the opposite has no shortcomings). This would lead to a fast decrease in the value of the temperature; accordingly, the target function would not be warped back to its original shape in a gradual way, and the algorithm could possibly be brought far from the global mode; (iv) typically, for each type of data there are many possible suitable tempering configurations, and an important step is to locate a rough range for the constants. After that, although the tuning process can be further refined, most of the tempering configurations chosen within that range would provide good results; (v) various factors such as number of observations, of response variables, and of latent components would guide the choice of this “unrefined” range. For example, estimating a model with many latent components typically requires higher values of $\alpha $ and $\beta $ with respect to a model with fewer components.

The same guidelines illustrated above should also be taken into account for the oscillating profile, where, however, there are more constants to tune. Their practical interpretation is, in this case, slightly different: $T_0$ controls the initial temperature, $\rho $ the distance between two consecutive peaks of the profile, $\beta $ the amplitude of the oscillations, and $\alpha $ the global decrease rate.

The following steps for tuning the tempering profile are derived from the aforementioned rules and are successfully employed to estimate the models for the applications presented in Sect. 5:

(1)
define grids for all the tempering constants, starting with large step-sizes;
(2)
estimate the model using the T-EM algorithm with these “unrefined” grids for the tempering constants employing a much smaller number of starting values with respect to that required with the standard EM algorithm;
(3)
identify the optimal tempering constants by comparing values of the log-likelihood function at convergence;
(4)
improve the tuning procedure, if necessary, in a smaller region of the tempering constants space and repeat the same procedure (points 2 and 3) using the same small number of different starting values.

A final note, which is effective for both profiles, is that in order to achieve a proper convergence, the algorithm needs to be run until the temperature is steadily close to 1. After that, the last step is conducted with the temperature precisely equal to 1 in order to retrieve the shape of the original log-likelihood function. Typically, this approach increases the number of steps that are required for the algorithm to converge, especially in the case of the oscillating profile. The code written for this proposal is implemented in R and it is freely available at the following link in the GitHub repository: https://github.com/LB1304/T-EM.

3.2 T-EM algorithm for the latent class model with categorical response variables

In the following, we provide some details of the tempered distribution (1) defined for the LC model with categorical response variables considering a suitable tempering profile $ \tau _h $:

$$\begin{aligned} {\tilde{q}}^{(\tau _h)}(u \vert \varvec{y}_i) = \frac{q(u \vert \varvec{y}_i)^{\nicefrac {1}{\tau _h}}}{\sum _{{v} = 1}^{k} q({v} \vert \varvec{y}_i)^{\nicefrac {1}{\tau _h}}}. \end{aligned}$$

The corresponding pseudo-code is shown in the box Algorithm 1. The E- and M-step of the T-EM algorithm are implemented as follows:

E-step: compute the conditional expected values of $ a_{juy} $ and $ b_u $ revised according to the rules
$$\begin{aligned} {\tilde{b}}^{(\tau _h)}_u = \sum _{i = 1}^{n} {\tilde{q}}^{(\tau _h)}(u \vert \varvec{y}_i) \qquad \text {and} \qquad {\tilde{a}}^{(\tau _h)}_{juy}= \sum _{i = 1}^{n} I(y_{ij} = y) {\tilde{q}}^{(\tau _h)}(u \vert \varvec{y}_i) \end{aligned}$$
to obtain the conditional expected value $ {\mathcal {Q}}(\varvec{\theta }; \varvec{\theta }^{(h-1)}) $.
M-step: maximize $ {\mathcal {Q}}(\varvec{\theta }; \varvec{\theta }^{(h-1)}) $, thus updating the parameters as:
$$\begin{aligned} \pi ^{(\tau _h)}_u = \frac{{\tilde{b}}^{(\tau _h)}_u}{n} \qquad \text {and} \qquad \phi ^{(\tau _h)}_{jy \vert u} = \frac{{\tilde{a}}^{(\tau _h)}_{juy}}{{\tilde{b}}^{(\tau _h)}_u}. \end{aligned}$$

3.3 T-EM algorithm for the hidden Markov model with categorical response variables

A more refined formulation for the tempered distribution in (1) is required to estimate the HM model. Once the tempering profile $ \tau _h $ is chosen, we obtain the following tempered distributions:

$$\begin{aligned} {\tilde{q}}^{(t; \tau _h)}(u \vert \varvec{y}_i) = \frac{q^{(t)}(u \vert \varvec{y}_i)^{\nicefrac {1}{\tau _h}}}{\sum _{{v}=1}^{k} q^{(t)}({v} \vert \varvec{y}_i)^{\nicefrac {1}{\tau _h}}} \end{aligned}$$

and

$$\begin{aligned} {\tilde{q}}^{(t; \tau _h)}({\bar{u}}, u \vert \varvec{y}_i) = \frac{q^{(t)}({\bar{u}}, u \vert \varvec{y}_i)^{\nicefrac {1}{\tau _h}}}{\sum _{\bar{v}=1}^{k}\sum _{{v}=1}^{k} q^{(t)}({{\bar{v}}, v} \vert \varvec{y}_i)^{\nicefrac {1}{\tau _h}}}. \end{aligned}$$

The pseudo-code is shown in the box Algorithm 2. In this setting, the steps of the T-EM algorithm are:

E-step: compute the revised conditional expected value of every frequency $ a_{juy}^{(t)} $, $ b_{u}^{(t)} $, and $ b_{{\bar{u}}u}^{(t)} $ , so as to obtain the conditional expected value $ {\mathcal {Q}}(\varvec{\theta }; \varvec{\theta }^{(h-1)}) $; in particular, we have the following explicit expressions:
$$\begin{aligned} \begin{aligned} {\tilde{a}}_{juy}^{(t; \tau _h)}&= \sum _{i=1}^{n} I(y_{ij}^{(t)} = y) {\tilde{q}}^{(t; \tau _h)}(u \vert \varvec{y}_i),\\ {\tilde{b}}_u^{(t; \tau _h)}&= \sum _{i=1}^{n} {\tilde{q}}^{(t; \tau _h)}(u \vert \varvec{y}_i),\\ {\tilde{b}}_{{\bar{u}}u}^{(t; \tau _h)}&= \sum _{i=1}^{n} {\tilde{q}}^{(t; \tau _h)}({\bar{u}}, u \vert \varvec{y}_i). \end{aligned} \end{aligned}$$
Similarly to the standard EM algorithm, posterior probabilities $ {\tilde{q}}^{(t; \tau _h)}(u \vert \varvec{y}_i) $ and $ {\tilde{q}}^{(t; \tau _h)}({\bar{u}}, u \vert \varvec{y}_i) $ may be efficiently computed by a backward recursion; see Bartolucci et al. (2013, pp 61–64) for further details.
M-step: by maximizing $ {\mathcal {Q}}(\varvec{\theta }; \varvec{\theta }^{(h-1)}) $ update the parameters as follows:
$$\begin{aligned} \pi _u^{(\tau _h)} = \frac{{\tilde{b}}_u^{(1; \tau _h)}}{n}, \qquad \pi _{u \vert {\bar{u}}}^{(t; \tau _h)} = \frac{{\tilde{b}}_{{\bar{u}}u}^{(t; \tau _h)}}{{\tilde{b}}_{{\bar{u}}}^{(t-1; \tau _h)}}, \qquad \text {and} \qquad \phi _{jy \vert u}^{(\tau _h)} = \frac{\sum _{t=1}^{T} {\tilde{a}}_{juy}^{(t; \tau _h)}}{\sum _{t=1}^{T} {\tilde{b}}_u^{(t; \tau _h)}}. \end{aligned}$$

3.4 T-EM algorithm for hidden Markov model with continuous response variables

Regarding the HM model with continuous response variables, the pseudo-code is shown in the box Algorithm 3. Similarly to the previous case, the steps of the resulting T-EM algorithm are as follows:

E-step: compute the conditional expected value ${\mathcal {Q}}(\varvec{\theta }; \varvec{\theta }^{(h-1)}) $ considering $ z_{iu}^{(t)} $ and $ z_{i{\bar{u}}u}^{(t)} $:
$$\begin{aligned} {\tilde{z}}_{iu}^{(t; \tau _h)} = {\tilde{q}}^{(t; \tau _h)}(u \vert \varvec{y}_i) \qquad \text {and} \qquad {\tilde{z}}_{i{\bar{u}}u}^{(t; \tau _h)} = {\tilde{q}}^{(t; \tau _h)}({\bar{u}}, u \vert \varvec{y}_i). \end{aligned}$$
M-step: maximize $ {\mathcal {Q}}(\varvec{\theta }; \varvec{\theta }^{(h-1)}) $ and update the model parameters as follows:
$$\begin{aligned} \begin{aligned} \varvec{\mu }_u^{(\tau _h)}&= \frac{1}{\sum _{i = 1}^{n}\sum _{t = 1}^{T} {\tilde{z}}_{iu}^{(t; \tau _h)}} \sum _{i = 1}^{n}\sum _{t = 1}^{T} {\tilde{z}}_{iu}^{(t; \tau _h)} \varvec{y}_i^{(t)},\\ \varvec{\Sigma }^{(\tau _h)}&= \sum _{i = 1}^{n}\sum _{t = 1}^{T}\sum _{u = 1}^{k} \frac{{\tilde{z}}_{iu}^{(t; \tau _h)} (\varvec{y}_i^{(t)} - \varvec{\mu }_u)(\varvec{y}_i^{(t)} - \varvec{\mu }_u)'}{nT},\\ \pi _u^{(\tau _h)}&= \frac{\sum _{i = 1}^{n} {\tilde{z}}_{iu}^{(1; \tau _h)}}{n},\\ \pi _{u \vert {\bar{u}}}^{({t; }\tau _h)}&= \frac{\sum _{i = 1}^{n}\sum _{t = 2}^{T} {\tilde{z}}_{i{\bar{u}}u}^{(t{; \tau _h})}}{\sum _{i = 1}^{n}\sum _{t = 2}^{T} {\tilde{z}}_{iu}^{(t-1; \tau _h)}}. \end{aligned} \end{aligned}$$

4 Simulation study

We conducted an extensive Monte Carlo simulation study to evaluate the performance of the T-EM algorithm. In the following, we illustrate the simulation schemes for each different model specifications and summarize the main results.

4.1 Settings of the experimental scenarios

The settings involved in each model are different values of sample size n, number of response variables r, categories for each variable c, time occasions T, and latent components k. We define a baseline scenario (setting A, see Tables 16, 17, and 18 in Appendix A) for each model, characterized by $ n = 500 $, $ r = 6 $, $ c = 3 $, $ T = 5 $, and $ k = 3 $. In addition, more scenarios (settings from B to F in Appendix A) are obtained by doubling, one at a time, the value of each feature. In Tables 16, 17, and 18 in Appendix A also the values of the models’ parameters are presented. For each scenario, 50 different samples are drawn. For each of the simulated samples, we estimate 100 times both the model with correctly specified latent structure and that with misspecified latent structure, using each time different starting values randomly selected and employing the standard EM algorithm and the two proposed versions of the T-EM algorithm. The choice to also fit misspecified models allows us to show in more detail the features of the proposed tempering approach.

The convergence of the algorithms is checked on the basis of both the relative change in the log-likelihood of two consecutive steps, and the distance between the corresponding parameter vectors. We stop the algorithm when both criteria are satisfied:

$$\begin{aligned} \frac{ \ell (\varvec{\theta }^{(h)}) - \ell (\varvec{\theta }^{(h-1)}) }{ \vert \ell (\varvec{\theta }^{(h)}) \vert } < \varepsilon _1 \end{aligned}$$

and

$$\begin{aligned} \underset{s}{\max }\ | {\theta }^{(h)}_s - {\theta }^{(h-1)}_s | < \varepsilon _2, \end{aligned}$$

where $ \varvec{\theta }^{(h)} $ is the vector of parameter estimates obtained at the h-th iteration of the M-step and $ \varepsilon _1 $ and $ \varepsilon _2 $ are tolerance levels equal to $ 10^{-8} $ and $ 10^{-4} $, respectively.

Regarding the algorithm initialization, we adopt a starting rule based on normalized random numbers (Bartolucci et al. 2013). In more details, each initial ($ \pi _u $) and transition ($ \pi _{u \vert {\bar{u}}}^{(t)} $) probability is initialized with a random number drawn from a uniform distribution between 0 and 1. Then, they are normalized so that $ \sum _{u=1}^{k} \pi _u = 1 $ and $ \sum _{u=1}^{k} \pi _{u \vert {\bar{u}}}^{(t)} = 1 $. Similarly, we draw each $ \phi _{jy \vert u} $ from the uniform distribution and we normalize these parameters so that $ \sum _{y=0}^{c-1} \phi _{jy \vert u} = 1 $. In the case of continuous response variables, the mean vectors $ \varvec{\mu }_u $ are drawn from a multivariate Gaussian distribution, whereas $ \varvec{\Sigma } $ is initialized with the observed variance-covariance matrix. As suggested in Bartolucci et al. (2013), combining deterministic and random starting values is a proper approach. Therefore, in Sect. 4.4 we analyze the behavior of tempering in connection with a different initialization strategy.

4.2 Simulation results

The EM and T-EM algorithms are compared according to the following criteria:

1.
Global maximum achievement: the highest of the maximized log-likelihood values over all 100 initial values, denoted by $ {\hat{\ell }}_{\textsc {max}} $, is considered as the global maximum, and a log-likelihood value at convergence denoted by $ {\hat{\ell }} $ is considered close to this value once it satisfies $ ({\hat{\ell }}_{\textsc {max}} - {\hat{\ell }}) / |{\hat{\ell }}_{\textsc {max}} |< {\tilde{\varepsilon }} $, where $ {\tilde{\varepsilon }} $ is a suitable threshold;
2.
Average distance from the global maximum computed over the 100 log-likelihood values $ {\hat{\ell }}_1, \ldots , {\hat{\ell }}_{100} $ and expressed as $ \sum _{s = 1}^{100} ({\hat{\ell }}_{\textsc {max}} - {\hat{\ell }}_s) / 100 $;
3.
Low mean square error of the estimated model parameters with respect to the true model parameters, computed only for models with a correctly specified latent structure;
4.
Low mean and median of the log-likelihood values at convergence.

In particular, in this first part of the simulation study, we analyze the performance of the M-T-EM algorithm when the tempering profile is optimally tuned through a grid-search procedure. The following values for the tempering constants are kept fixed throughout the simulation studies: $ \alpha $ ranging from 1 to 15 with a step-size equal to 1 and $ \beta $ ranging from 0 to 2, with a step-size equal to 0.1. In order to show the flexibility of the method, we use the same grid for each model. However, efficient ad hoc grids may be set according to the model and observed data. The results are summarized in the following, and the full outcomes related to every sample under each simulated scenario are reported in the SI.

Criterion 1 is the most important, providing a suitable measure of performance of the algorithm. In this regard, the main results are summarized in Figs. 1, 2, and 3, representing the frequencies of global maximum with respect to the LC model, HM model with categorical response variables, and HM model with continuous response variables, respectively. From all these figures it clearly emerges that the M-T-EM algorithm ensures better performance in each considered scenario.

Regarding the estimation of models whose latent structure is correctly specified, in particular (see Figs. 1a, 2a, and 3a), the improvement with respect to the standard EM algorithm is very relevant: the M-T-EM is generally able to detect the global maximum in the overwhelming majority of cases, and the frequency of convergence to the global mode is very close, or even equal, to $100\%$. Only in estimating models with many latent states (up to 6), this percentage is slightly reduced, even if the M-T-EM still remains the algorithm providing the best performance. As an example, we consider the HM model with categorical response variables and in the particular setting F (see the last plot in Fig. 2a): in this case the frequency of convergence to the global maximum is, on average, equal to $29\%$ when the standard EM algorithm is used, and up to $52\%$ when the M-T-EM algorithm is employed. Moreover, this frequency is always lower than $75\%$ with the EM, while it reaches $100\%$ with M-T-EM (though only in a few cases).

All the algorithms are less efficient in steadily detecting the global mode when models with misspecified latent components are estimated (see Figs. 1b, 2b, and 3b). The M-T-EM algorithm always provides the best performance, and in many scenarios the improvement is very relevant: in setting D of the LC model (Fig. 1b) the frequency of convergence to the global mode increases from 18 to 41%; in setting C of the HM model with categorical responses (Fig. 2b) for some samples this frequency reaches 100%.

Table 1 Number of samples in which the global maximum is reached with frequency $ <10\% $, $ >50\% $, or $ >95\% $, using EM (highlighted in bold) and M-T-EM (highlighted in italic) algorithms under the simulated scenarios presented in Table 16 of the Appendix A for the LC model

Tempered expectation-maximization algorithm for the estimation of discrete latent variable models

Abstract

Similar content being viewed by others

Dirichlet process and its developments: a survey

Linear mixed model with Laplace distribution (LLMM)

R-VGAL: a sequential variational Bayes algorithm for generalised linear mixed models

1 Introduction

2 Model formulation

2.1 Latent class model

2.2 Hidden Markov model

2.2.1 Hidden Markov model with categorical response variables

2.2.2 Hidden Markov model with continuous response variables

2.3 Expectation-maximization algorithm

3 Tempered expectation-maximization algorithm

3.1 Tuning of tempering profiles

3.2 T-EM algorithm for the latent class model with categorical response variables

3.3 T-EM algorithm for the hidden Markov model with categorical response variables

3.4 T-EM algorithm for hidden Markov model with continuous response variables

4 Simulation study

4.1 Settings of the experimental scenarios

4.2 Simulation results

4.3 Results in terms of computational time

4.4 Initialization of the T-EM algorithm

4.5 The role of the oscillating tempering profile

4.6 Analysis of the T-EM algorithm with fixed tempering profile

5 Applications

5.1 Evaluation of anxiety and depression

5.2 Discovering criminal trajectories

5.3 Analyzing countries development

6 Conclusions

Change history

17 January 2023

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Supplementary Information

Supplementary file 1 (pdf 206 KB)

Appendices

Appendix

A Characteristics of the simulated scenarios

B Additional simulation results

C Numerical results for the analysis of T-EM algorithm with fixed tempering profiles

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation