The model
Let \(\left\{ {\mathcal {X}}_{it}; i=1,\ldots ,I, t=1,\ldots ,T\right\} \) be a sequence of matrix-variate balanced longitudinal observations recorded on I units over T times, with \({\mathcal {X}}_{it}\in {\mathbb {R}}^{P\times R}\), and let \(\left\{ S_{it}; i=1,\ldots ,I, t=1,\ldots ,T\right\} \) be a first-order Markov chain defined on the state space \(\left\{ 1,\ldots ,k,\ldots ,K\right\} \). As mentioned in Sect. 1, a HMM is a particular type of dependent mixture model consisting of two parts: the underlying unobserved process \(\left\{ S_{it}\right\} \) that satisfies the Markov property, i.e.
$$\begin{aligned}&\text {Pr}\left( S_{it}=s_{it} | S_{i1}=s_{i1}, \ldots ,S_{it-1}=s_{it-1}\right) \\&\quad =\text {Pr}\left( S_{it}=s_{it} | S_{it-1}=s_{it-1}\right) , \end{aligned}$$
and the state-dependent observation process \(\left\{ {\mathcal {X}}_{it}\right\} \) for which the conditional independence property holds, i.e.
$$\begin{aligned} f\Big ({\mathcal {X}}_{it}&={\mathbf {X}}_{it} | {\mathcal {X}}_{i1}={\mathbf {X}}_{i1}, \ldots ,{\mathcal {X}}_{it-1}={\mathbf {X}}_{it-1},S_{i1}=\\&=s_{i1}\ldots ,S_{it}=s_{it}\Bigg ) = f\left( {\mathcal {X}}_{it}={\mathbf {X}}_{it} | S_{it}=s_{it} \right) , \end{aligned}$$
where \(f(\cdot )\) is a generic probability density function (pdf). Therefore, the unknown parameters in an HMM involve both the parameters of the Markov chain and those of the state-dependent pdfs. In detail, the parameters of the Markov chain are the initial probabilities \(\pi _{ik}=\text {Pr}\left( S_{i1}=k\right) \), \(k=1,\ldots ,K\), being K the number of states, and the transition probabilities
$$\begin{aligned} \pi _{ik|j}= & {} \text {Pr}\left( S_{it}=k|S_{it-1}=j\right) , t=2,\ldots ,T \text {and} \\&\quad j,k=1,\ldots ,K, \end{aligned}$$
where k refers to the current state and j refers to the one previously visited. To simplify the discussion, we will consider homogeneous HMMs, that is \(\pi _{ik|j}=\pi _{k|j}\) and \(\pi _{ik}=\pi _{k}, i=1,\ldots ,I\). We collect the initial probabilities in the K-dimensional vector \(\varvec{\pi }\), whereas the time-homogenous transition probabilities are inserted in the \(K \times K\) transition matrix \(\varvec{\Pi }\).
Table 1 Nomenclature, covariance matrix structure, and number of free parameters in \(\varvec{\Phi }_{1},\ldots ,\varvec{\Phi }_{K}\) for the parsimonious models obtained via the eigen decomposition of the state covariance matrices. \({\varvec{I}}\) is the identity matrix Regarding the conditional density for the observed process, it will be given by a matrix-normal distribution, i.e.
$$\begin{aligned}&\phi \left( {\mathbf {X}}_{it}|S_{it}=k;\varvec{\theta }_{k}\right) \nonumber \\&\quad = \frac{\exp \left\{ -\frac{1}{2}\,\text{ tr }\left[ \varvec{\Sigma }_{k}^{-1}({\mathbf {X}}-{\mathbf {M}}_{k})\varvec{\Psi }_{k}^{-1}({\mathbf {X}}-{\mathbf {M}}_{k})'\right] \right\} }{(2\pi )^{\frac{PR}{2}}|\varvec{\Sigma }_{k}|^{\frac{R}{2}}|\varvec{\Psi }_{k}|^{\frac{P}{2}}}, \end{aligned}$$
(1)
where \({\mathbf {M}}_{k}\) is the \(P \times R\) matrix of means, \(\varvec{\Sigma }_{k}\) is the \(P\times P\) covariance matrix containing the covariances between the P rows, \(\varvec{\Psi }_{k}\) is the \(R\times R\) covariance matrix containing the covariances of the R columns and \(\varvec{\theta }_{k}=\left\{ {\mathbf {M}}_{k},\varvec{\Sigma }_{k},\varvec{\Psi }_{k}\right\} \). For an exhaustive description of the matrix-normal distribution and its properties see Gupta and Nagar (2018).
Parsimonious models
As discussed in Sect. 1, a way to reduce the number of parameters of the model is to introduce parsimony in the covariance matrices via the well-known eigen decomposition introduced by Celeux and Govaert (1995). Specifically, a \(Q \times Q\) covariance matrix can be decomposed as
$$\begin{aligned} \varvec{\Phi }_{k} = \lambda _{k}\varvec{\Gamma }_{k}\varvec{\Delta }_{k}\varvec{\Gamma }_{k}', \end{aligned}$$
(2)
where \(\lambda _{k}=|\varvec{\Phi }_{k}|^{1/Q}\), \(\varvec{\Gamma }_{k}\) is a \(Q \times Q\) orthogonal matrix of the eigenvectors of \(\varvec{\Phi }_{k}\) and \(\varvec{\Delta }_{k}\) is the \(Q \times Q\) diagonal matrix with the scaled eigenvalues of \(\varvec{\Phi }_{k}\) (such that \(|\varvec{\Delta }_{k}| = 1\)) located on the main diagonal. The decomposition in (2) has some useful practical interpretations. From a geometric point of view, \(\lambda _{k}\) determines the volume, \(\varvec{\Gamma }_{k}\) governs the orientation, and \(\varvec{\Delta }_{k}\) denotes the shape of the kth state. From a statistical point of view, as well-documented in Greselin and Punzo (2013), Bagnato and Punzo (2021) and Punzo and Bagnato (2021), the columns of \(\varvec{\Gamma }_{k}\) govern the orientation of the principal components (PCs) of the kth state, the diagonal elements in \(\varvec{\Delta }_{k}\) are the normalized variances of these PCs, and \(\lambda _{k}\) can be meant as the overall volume of the scatter in the space spanned by the PCs of the kth state. By imposing constraints on the three components of (2), the fourteen parsimonious models of Table 1 are obtained.
Considering that we have two covariance matrices in (1), this would yield to \(14 \times 14 = 196\) parsimonious MV-HMMs. However, there is a non-identifiability issue since \(\varvec{\Psi }\otimes \varvec{\Sigma }= \varvec{\Psi }^{*} \otimes \varvec{\Sigma }^{*}\) if \(\varvec{\Sigma }^{*}= a\varvec{\Sigma }\) and \(\varvec{\Psi }^{*}= a^{-1}\varvec{\Psi }\). As a result, \(\varvec{\Sigma }\) and \(\varvec{\Psi }\) are identifiable up to a multiplicative constant a (Sarkar et al. 2020). To avoid such issue, the column covariance matrix \(\varvec{\Psi }\) is restricted to have \(|\varvec{\Psi }|= 1\), implying that in (2) the parameter \(\lambda _{k}\) is unnecessary. This reduces the number of models related to \(\varvec{\Psi }\) from 14 to 7, i.e., \({\varvec{I}}, \varvec{\Delta }, \varvec{\Delta }_{k}, \varvec{\Gamma }\varvec{\Delta }\varvec{\Gamma }',\varvec{\Gamma }\varvec{\Delta }_{k}\varvec{\Gamma }', \varvec{\Gamma }_{k}\varvec{\Delta }\varvec{\Gamma }_{k}', \varvec{\Gamma }_{k}\varvec{\Delta }_{k}\varvec{\Gamma }_{k}'\). Therefore, we obtain \(14 \times 7 = 98\) parsimonious MV-HMMs.
Maximum likelihood estimation
To fit our MV-HMMs, we use the expectation-conditional maximization (ECM) algorithm (Meng and Rubin 1993). The ECM algorithm is a variant of the classical expectation-maximization (EM) algorithm (Dempster et al. 1977), from which it differs since the M-step is replaced by a sequence of simpler and computationally convenient CM-steps.
Let \({\mathcal {S}}=\left\{ {\mathbf {X}}_{it}; i=1,\ldots ,I, t=1,\ldots ,T\right\} \) be a sample of matrix-variate balanced longitudinal observations. Then, the incomplete-data likelihood function is
$$\begin{aligned} L\left( \varvec{\Theta }|S\right)= & {} \prod _{i=1}^{I} \varvec{\pi }' \varvec{\phi }\left( {\mathbf {X}}_{i1}\right) \varvec{\Pi }\varvec{\phi }\left( {\mathbf {X}}_{i2}\right) \varvec{\Pi }\ldots \varvec{\phi }\left( {\mathbf {X}}_{iT-1}\right) \\&\quad \varvec{\Pi }\varvec{\phi }\left( {\mathbf {X}}_{iT}\right) {\varvec{1}}_K, \end{aligned}$$
where \(\varvec{\phi }\left( {\mathbf {X}}_{it}\right) \) is a \(K \times K\) diagonal matrix with conditional densities \(\phi \left( {\mathcal {X}}_{it}={\mathbf {X}}_{it} | S_{it}=k\right) \) on the main diagonal, \({\varvec{1}}_K\) is a vector K ones and \(\varvec{\Theta }\) contains all the model parameters. In this setting, \({\mathcal {S}}\) is viewed as incomplete because, for each observation, we do not know its state membership and its evolution over time. For this reason, let us define the unobserved state membership \({\varvec{z}}_{it}=\left( z_{it1},\ldots ,z_{itk},\ldots ,z_{itK}\right) '\) and the unobserved states transition
$$\begin{aligned}{\varvec{z}}{\varvec{z}}_{it}= \begin{bmatrix} zz_{it11} &{} \ldots &{} zz_{it1k} &{} \ldots &{} zz_{it1K}\\ \vdots &{} &{} \vdots &{} &{} \vdots \\ zz_{itj1} &{} \ldots &{} zz_{itjk} &{} \ldots &{} zz_{itjK}\\ \vdots &{} &{} \vdots &{} &{} \vdots \\ zz_{itK1} &{} \ldots &{} zz_{itKk} &{} \ldots &{} zz_{itKK}\end{bmatrix}, \end{aligned}$$
where
$$\begin{aligned} z_{itk}= & {} {\left\{ \begin{array}{ll} 1 &{} \quad \text {if} S_{it}=k \\ 0 &{} \quad \text {otherwise} \end{array}\right. } \quad \text {and}\\ zz_{itjk}= & {} {\left\{ \begin{array}{ll} 1 &{} \quad \text {if} S_{it-1}=j \text {and} S_{it}=k \\ 0 &{} \quad \text {otherwise} \end{array}\right. }. \end{aligned}$$
Therefore, the complete data are \({\mathcal {S}}_c=\Big \{{\mathbf {X}}_{it},{\varvec{z}}_{it},{\varvec{z}}{\varvec{z}}_{it}; i=1,\ldots ,I, t=1,\ldots ,T\Big \}\) and the corresponding complete-data log-likelihood is
$$\begin{aligned} l_{c}\left( \varvec{\Theta }|{\mathcal {S}}_{c}\right) = l_{c_1}\left( \varvec{\pi }|{\mathcal {S}}_{c}\right) +l_{c_2}\left( \varvec{\Pi }|{\mathcal {S}}_{c}\right) +l_{c_3}\left( \varvec{\theta }|{\mathcal {S}}_{c}\right) , \end{aligned}$$
(3)
with \(\varvec{\theta }=\left\{ \varvec{\theta }_k; k=1,\ldots ,K\right\} \) and
$$\begin{aligned} l_{c_1}\left( \varvec{\pi }|{\mathcal {S}}_{c}\right)&= \sum \limits _{i=1}^{I}\sum \limits _{k=1}^{K} z_{i1k} \log \left( \pi _k\right) \\ l_{c_2}\left( \varvec{\Pi }|{\mathcal {S}}_{c}\right)&= \sum \limits _{i=1}^{I}\sum \limits _{t=2}^{T}\sum \limits _{k=1}^{K}\sum \limits _{j=1}^{K} zz_{itjk} \log \left( \pi _{k|j}\right) \\ l_{c_3}\left( \varvec{\theta }|{\mathcal {S}}_{c}\right)&= \sum \limits _{i=1}^{I}\sum \limits _{t=1}^{T}\sum \limits _{k=1}^{K} z_{itk}\\&\quad \Bigg \{\Bigg .-\frac{PR}{2}\log \left( 2\pi \right) -\frac{R}{2}\log |\varvec{\Sigma }_{k}|-\frac{P}{2}\log |\varvec{\Psi }_{k}| \\&-\frac{1}{2}\,\text{ tr }\left[ \varvec{\Sigma }_{k}^{-1}({\mathbf {X}}_{it}-{\mathbf {M}}_{k})\varvec{\Psi }_{k}^{-1}({\mathbf {X}}_{it}-{\mathbf {M}}_{k})'\right] \Bigg . \Bigg \}. \end{aligned}$$
In the following, by adopting the notation used in Tomarchio et al. (2021a), the parameters marked with one dot will represent the updates at the previous iteration and those marked with two dots are the updates at the current iteration. Furthermore, we implemented the ECM algorithm used for fitting all the 98 parsimonious MV-HMMs in the HMM.fit() function of the FourWayHMM package (Tomarchio et al. 2021b) for the R statistical software (R Core Team 2019).
E-Step The E-step requires the calculation of the conditional expectation of (3), given \({\mathcal {S}}_{c}\) and the current estimates of \({\dot{\varvec{\Theta }}}\). Therefore, we need to replace \(z_{itk}\) and \(z_{itjk}\) with their conditional expectations, namely, \(\ddot{z}_{itk}\) and \(\ddot{zz}_{itjk}\). This can be efficiently done by exploiting a forward recursion approach (Baum et al. 1970; Baum 1972; Welch 2003).
Let us start by defining the forward probability
$$\begin{aligned} \gamma _{itk}=\text {Pr}\left( {\mathcal {X}}_{i1}={\mathbf {X}}_{i1},\ldots ,{\mathcal {X}}_{it}={\mathbf {X}}_{it},S_{it}=k\right) , \end{aligned}$$
that is the probability of seeing the partial sequence finishing up in state k at time t, and the corresponding backward probability
$$\begin{aligned} \beta _{itk}=\text {Pr}\left( {\mathcal {X}}_{it+1}={\mathbf {X}}_{it+1},\ldots ,{\mathcal {X}}_{iT}={\mathbf {X}}_{iT}|S_{it}=k\right) . \end{aligned}$$
It is known that the computation of the forward and backward probabilities is susceptible to numerical overflow errors (Farcomeni 2012). To prevent or at least to decrease the risk of such errors, the well known scaling procedure suggested by Durbin et al. (1998) can be implemented (for additional details, see also Zucchini et al. 2017). Then, the updates required in the E-step can be computed as
$$\begin{aligned} \ddot{z}_{itk}= & {} \frac{\gamma _{itk}\beta _{itk}}{\sum \limits _{h=1}^{K}\gamma _{ith}\beta _{ith}} \quad \text {and} \\ \ddot{zz}_{itjk}= & {} \frac{\gamma _{i\left( t-1\right) j}\pi _{k|j}\phi \left( {\mathbf {X}}_{it}|S_{it}=k\right) \beta _{itk}}{\sum \limits _{h=1}^{K}\gamma _{iTh}}. \end{aligned}$$
CM-Step 1 Consider \(\varvec{\Theta }=\left\{ \varvec{\Theta }_1,\varvec{\Theta }_2\right\} \), where \(\varvec{\Theta }_1=\left\{ \pi _k,\varvec{\Pi },{\mathbf {M}}_{k},\varvec{\Sigma }_{k};k=1,\ldots ,K\right\} \) and \(\varvec{\Theta }_2=\Big \{\varvec{\Psi }_{k};k=1,\ldots ,K\Big \}\). At the first CM-step, we maximize the expectation of (3) with respect to \(\varvec{\Theta }_1\), fixing \(\varvec{\Theta }_2\) at \(\dot{\varvec{\Theta }_2}\). In particular, we obtain
$$\begin{aligned} \ddot{\pi }_k= & {} \frac{\sum _{i=1}^I \ddot{z}_{i1k}}{I}, \quad \ddot{\pi }_{k|j} = \frac{\sum _{i=1}^I \sum _{t=2}^T \ddot{zz}_{itjk}}{\sum _{i=1}^I \sum _{t=2}^T \sum _{k=1}^K \ddot{zz}_{itjk}} \quad \text {and} \\ \ddot{{\mathbf {M}}}_k= & {} \frac{\sum _{i=1}^I \sum _{t=1}^T \ddot{z}_{itk}{\mathbf {X}}_{it}}{\sum _{i=1}^I \sum _{t=1}^T \ddot{z}_{itk}}. \end{aligned}$$
The update for \(\varvec{\Sigma }_k\) depends on the parsimonious structure considered. For notational simplicity, let \({\ddot{{\mathbf {Y}}}}=\sum _{k=1}^K {\ddot{{\mathbf {Y}}}}_k\) be the update of the within state row scatter matrix, where \({\ddot{{\mathbf {Y}}}}_k = \sum _{i=1}^I \sum _{t=1}^T \ddot{z}_{itk}\left( {\mathbf {X}}_{it}-{\ddot{{\mathbf {M}}}}_k\right) {{\dot{\varvec{\Psi }}}}_k^{-1}\left( {\mathbf {X}}_{it}-{\ddot{{\mathbf {M}}}}_k\right) '\) is the update of the row scatter matrix related to the kth state. The updates for the 14 parsimonious structures of \(\varvec{\Sigma }_k\) are:
-
Model EII [\(\varvec{\Sigma }_k=\lambda {\varvec{I}}\)]. In this setting, maximizing Eq. (3) reduces to the maximization of
$$\begin{aligned} -\frac{PRTI}{2}\log \lambda -\frac{1}{2\lambda }\,\text{ tr }\left( {\ddot{{\mathbf {Y}}}}\right) . \end{aligned}$$
Thus, we can obtain \(\lambda \) as
$$\begin{aligned} {\ddot{\lambda }} = \frac{\,\text{ tr }\left\{ {\ddot{{\mathbf {Y}}}}\right\} }{PRTI}. \end{aligned}$$
-
Model VII [\(\varvec{\Sigma }_k=\lambda _{k}{\varvec{I}}\)]. In this case, maximizing Eq. (3) reduces to the maximization of
$$\begin{aligned} -\frac{PR}{2} \sum \limits _{k=1}^{K} \log \lambda _k \sum \limits _{i=1}^{I}\sum \limits _{t=1}^{T} \ddot{z}_{itk}-\frac{1}{2}\sum \limits _{k=1}^{K}\frac{1}{\lambda _k}\,\text{ tr }\left( {\ddot{{\mathbf {Y}}}}_k\right) . \end{aligned}$$
Thus, we can obtain \(\lambda _k\) as
$$\begin{aligned} {\ddot{\lambda }}_k = \frac{\,\text{ tr }\left\{ {\ddot{{\mathbf {Y}}}}_k\right\} }{PR \sum _{i=1}^I \sum _{t=1}^T \ddot{z}_{itk}}. \end{aligned}$$
-
Model EEI [\(\varvec{\Sigma }_k=\lambda \varvec{\Delta }\)]. Here, maximizing Eq. (3) reduces to the maximization of
$$\begin{aligned} -\frac{PRTI}{2}\log \lambda -\frac{1}{2\lambda }\,\text{ tr }\left( \varvec{\Delta }^{-1}{\ddot{{\mathbf {Y}}}}\right) . \end{aligned}$$
Applying Corollary A.1 of Celeux and Govaert (1995), we can obtain \(\lambda \) and \(\varvec{\Delta }\) as
$$\begin{aligned} {\ddot{\varvec{\Delta }}} =\frac{\text {diag}\left( {\ddot{{\mathbf {Y}}}}\right) }{\left| \text {diag}\left( {\ddot{{\mathbf {Y}}}}\right) \right| ^\frac{1}{P}} \quad \text {and} \quad {\ddot{\lambda }} = \frac{\left| \text {diag}\left( \ddot{{\mathbf {Y}}}\right) \right| ^\frac{1}{P}}{RTI}. \end{aligned}$$
-
Model VEI [\(\varvec{\Sigma }_k=\lambda _{k}\varvec{\Delta }\)]. In this setting, maximizing Eq. (3) reduces to the maximization of
$$\begin{aligned} -\frac{PR}{2}\sum \limits _{k=1}^{K} \log \lambda _k \sum \limits _{i=1}^{I}\sum \limits _{t=1}^{T} \ddot{z}_{itk}-\sum \limits _{k=1}^{K}\frac{1}{2\lambda _k}\,\text{ tr }\left( \varvec{\Delta }^{-1}{\ddot{{\mathbf {Y}}}}_k\right) . \end{aligned}$$
Applying Corollary A.1 of Celeux and Govaert (1995), we can obtain \(\varvec{\Delta }\) and \(\lambda _{k}\) as
$$\begin{aligned} {\ddot{\varvec{\Delta }}}= & {} \frac{\text {diag}\left( \sum \limits _{k=1}^K \dot{\lambda }_k^{-1}{\ddot{{\mathbf {Y}}}}_k\right) }{\left| \text {diag}\left( \sum \limits _{k=1}^K {{\dot{\lambda }}}_k^{-1}{\ddot{{\mathbf {Y}}}}_k\right) \right| ^\frac{1}{P}} \quad \text {and}\\ {\ddot{\lambda }}_k= & {} \frac{\,\text{ tr }\left\{ {\ddot{{\mathbf {Y}}}}_k {\ddot{\varvec{\Delta }}}^{-1}\right\} }{PR\sum _{i=1}^I \sum _{t=1}^T \ddot{z}_{itk}}. \end{aligned}$$
-
Model EVI [\(\varvec{\Sigma }_k=\lambda \varvec{\Delta }_{k}\)]. In this case, maximizing Eq. (3) reduces to the maximization of
$$\begin{aligned} -\frac{PRTI}{2}\log \lambda -\frac{1}{2\lambda }\sum \limits _{k=1}^{K} \,\text{ tr }\left( \varvec{\Delta }_k^{-1}{\ddot{{\mathbf {Y}}}}_k\right) . \end{aligned}$$
Also in this case, by using Corollary A.1 of Celeux and Govaert (1995), we can obtain \(\varvec{\Delta }_{k}\) and \(\lambda \) as
$$\begin{aligned} {\ddot{\varvec{\Delta }}}_k = \frac{\text {diag}\left( {\ddot{{\mathbf {Y}}}}_k\right) }{\left| \text {diag}\left( {\ddot{{\mathbf {Y}}}}_k\right) \right| ^\frac{1}{P}} \quad \text {and} \quad {\ddot{\lambda }} = \frac{\sum \limits _{k=1}^K\left| \text {diag}\left( {\ddot{{\mathbf {Y}}}}_k\right) \right| ^\frac{1}{P}}{RTI}. \end{aligned}$$
-
Model VVI [\(\varvec{\Sigma }_k=\lambda _{k}\varvec{\Delta }_{k}\)]. Here, maximizing Eq. (3) reduces to the maximization of
$$\begin{aligned} -\frac{PR}{2}\sum \limits _{k=1}^{K} \log \lambda _k \sum \limits _{i=1}^{I}\sum \limits _{t=1}^{T}\ddot{z}_{itk}-\sum \limits _{k=1}^{K}\frac{1}{2\lambda _k}\,\text{ tr }\left( \varvec{\Delta }_k^{-1}{\ddot{{\mathbf {Y}}}}_k\right) . \end{aligned}$$
Again, by using Corollary A.1 of Celeux and Govaert (1995), we can obtain \(\varvec{\Delta }_{k}\) and \(\lambda _{k}\) as
$$\begin{aligned} {\ddot{\varvec{\Delta }}}_k = \frac{\text {diag}\left( {\ddot{{\mathbf {Y}}}}_k\right) }{\left| \text {diag}\left( {\ddot{{\mathbf {Y}}}}_k\right) \right| ^\frac{1}{P}}\quad \text {and} \quad {\ddot{\lambda }}_k = \frac{\left| \text {diag}\left( {\ddot{{\mathbf {Y}}}}_k\right) \right| ^\frac{1}{P}}{R\sum _{i=1}^I \sum _{t=1}^T \ddot{z}_{itk}}. \end{aligned}$$
-
Model EEE [\(\varvec{\Sigma }_k=\lambda \varvec{\Gamma }\varvec{\Delta }\varvec{\Gamma }'\)]. In this setting, given that \(\varvec{\Sigma }_1=\cdots =\varvec{\Sigma }_K\equiv \varvec{\Sigma }\), maximizing Eq. (3) reduces to the maximization of
$$\begin{aligned} -\frac{RTI}{2}\log |\varvec{\Sigma }|-\frac{1}{2}\,\text{ tr }(\varvec{\Sigma }^{-1}{\ddot{{\mathbf {Y}}}}). \end{aligned}$$
Applying Theorem A.2 of Celeux and Govaert (1995), we can update \(\varvec{\Sigma }\) as
$$\begin{aligned} \ddot{\varvec{\Sigma }}= \frac{{\ddot{{\mathbf {Y}}}}}{RTI}. \end{aligned}$$
-
Model VEE [\(\varvec{\Sigma }_k=\lambda _{k}\varvec{\Gamma }\varvec{\Delta }\varvec{\Gamma }'\)]. In this case, it is convenient to write \(\varvec{\Sigma }_k=\lambda _{k}{\mathbf {C}}\), where \({\mathbf {C}}= \varvec{\Gamma }\varvec{\Delta }\varvec{\Gamma }'\). Thus, maximizing Eq. (3) reduces to the maximization of
$$\begin{aligned} -\frac{PR}{2}\sum \limits _{k=1}^{K} \log \lambda _k \sum \limits _{i=1}^{I}\sum \limits _{t=1}^{T}\ddot{z}_{itk}-\sum \limits _{k=1}^{K}\frac{1}{2\lambda _k}\,\text{ tr }\left( {\mathbf {C}}^{-1}{\ddot{{\mathbf {Y}}}}_k\right) . \end{aligned}$$
Applying Theorem A.1 of Celeux and Govaert (1995), we can update \({\mathbf {C}}\) and \(\lambda _{k}\) as
$$\begin{aligned} {\ddot{{\mathbf {C}}}} = \frac{\sum \limits _{k=1}^K {{\dot{\lambda }}}_k^{-1}{\ddot{{\mathbf {Y}}}}_k}{\left| \sum \limits _{k=1}^K {{\dot{\lambda }}}_k^{-1}{\ddot{{\mathbf {Y}}}}_k\right| ^{\frac{1}{P}}}\quad \text {and} \quad {\ddot{\lambda }}_k = \frac{\,\text{ tr }\left\{ {\ddot{{\mathbf {C}}}}^{-1} {\ddot{{\mathbf {Y}}}}_k \right\} }{PR\sum _{i=1}^I \sum _{t=1}^T \ddot{z}_{itk}}. \end{aligned}$$
-
Model EVE [\(\varvec{\Sigma }_k=\lambda \varvec{\Gamma }\varvec{\Delta }_{k}\varvec{\Gamma }'\)]. Here, maximizing Eq. (3) reduces to the maximization of
$$\begin{aligned} -\frac{PRTI}{2}\log \lambda -\frac{1}{2\lambda }\sum \limits _{k=1}^{K} \,\text{ tr }(\varvec{\Gamma }'{\ddot{{\mathbf {Y}}}}_k\varvec{\Gamma }\varvec{\Delta }_{k}^{-1}). \end{aligned}$$
Given that there is no analytical solution for \(\varvec{\Gamma }\), while keeping fixed the other parameters, an iterative Minorization-Maximization (MM) algorithm (Browne and McNicholas 2014) is employed. In detail, a surrogate function can be constructed as
$$\begin{aligned} f\left( \varvec{\Gamma }\right) = \sum \limits _{k=1}^K \,\text{ tr }\left\{ {\ddot{{\mathbf {Y}}}}_k\varvec{\Gamma }\varvec{\Delta }_{k}^{-1}\varvec{\Gamma }'\right\} \le S + \,\text{ tr }\left\{ {{\dot{{\varvec{F}}}}}\varvec{\Gamma }\right\} , \end{aligned}$$
where S is a constant and \({{\dot{{\varvec{F}}}}} = \sum _{k=1}^K\Big (\varvec{\Delta }_{k}^{-1} {{\dot{\varvec{\Gamma }}}}' {\ddot{{\mathbf {Y}}}}_k - e_k \varvec{\Delta }_{k}^{-1} {{\dot{\varvec{\Gamma }}}}'\Big )\), with \(e_k\) being the largest eigenvalue of \({\ddot{{\mathbf {Y}}}}_k\). The update of \(\varvec{\Gamma }\) is given by \({\ddot{\varvec{\Gamma }}} = {{\dot{{\varvec{G}}}}} {{\dot{{\varvec{H}}}}} '\), where \({{\dot{{\varvec{G}}}}}\) and \(\dot{\varvec{H}}\) are obtained from the singular value decomposition of \({{\dot{{\varvec{F}}}}}\). This process is repeated until a specified convergence criterion is met and the update \({\ddot{\varvec{\Gamma }}}\) is obtained. Then, we obtain the update for \(\varvec{\Delta }_{k}\) and \(\lambda \) as
$$\begin{aligned} {\ddot{\varvec{\Delta }}}_k = \frac{\text {diag}\left( {\ddot{\varvec{\Gamma }}}' {\ddot{{\mathbf {Y}}}}_k {\ddot{\varvec{\Gamma }}}\right) }{\left| \text {diag}\left( {\ddot{\varvec{\Gamma }}}' {\ddot{{\mathbf {Y}}}}_k {\ddot{\varvec{\Gamma }}}\right) \right| ^\frac{1}{P}}\quad \text {and} \quad {\ddot{\lambda }} = \frac{\sum \limits _{k=1}^K \,\text{ tr }\left( {\ddot{\varvec{\Gamma }}} {\ddot{\varvec{\Delta }}}_k^{-1} {\ddot{\varvec{\Gamma }}}'{\ddot{{\mathbf {Y}}}}_k\right) }{PRTI}. \end{aligned}$$
-
Model VVE [\(\varvec{\Sigma }_k=\lambda _{k}\varvec{\Gamma }\varvec{\Delta }_{k}\varvec{\Gamma }'\)]. In this case, maximizing Eq. (3) reduces to the maximization of
$$\begin{aligned} -\frac{PR}{2}\sum \limits _{k=1}^{K} \log \lambda _k \sum \limits _{i=1}^{I}\sum \limits _{t=1}^{T}\ddot{z}_{itk}-\sum \limits _{k=1}^{K}\frac{1}{2\lambda _k}\,\text{ tr }(\varvec{\Gamma }'{\ddot{{\mathbf {Y}}}}_k\varvec{\Gamma }\varvec{\Delta }_{k}^{-1}). \end{aligned}$$
Again, there is no analytical solution for \(\varvec{\Gamma }\), and its update is obtained by employing the MM algorithm as described for the EVE model. Then, the updates for \(\varvec{\Delta }_{k}\) and \(\lambda _{k}\) are
$$\begin{aligned} {\ddot{\varvec{\Delta }}}_k = \frac{\text {diag}\left( {\ddot{\varvec{\Gamma }}}' {\ddot{{\mathbf {Y}}}}_k {\ddot{\varvec{\Gamma }}}\right) }{\left| \text {diag}\left( {\ddot{\varvec{\Gamma }}}' {\ddot{{\mathbf {Y}}}}_k {\ddot{\varvec{\Gamma }}}\right) \right| ^\frac{1}{P}}\quad \text {and} \quad {\ddot{\lambda }}_k = \frac{\left| \text {diag}\left( {\ddot{\varvec{\Gamma }}}' {\ddot{{\mathbf {Y}}}}_k {\ddot{\varvec{\Gamma }}}\right) \right| ^{\frac{1}{P}}}{R\sum _{i=1}^I \sum _{t=1}^T \ddot{z}_{itk}}. \end{aligned}$$
-
Model EEV [\(\varvec{\Sigma }_k=\lambda \varvec{\Gamma }_{k}\varvec{\Delta }\varvec{\Gamma }_{k}'\)]. Here, maximizing Eq. (3) reduces to the maximization of
$$\begin{aligned} -\frac{PRTI}{2}\log \lambda -\frac{1}{2\lambda }\sum \limits _{k=1}^{K} \,\text{ tr }(\varvec{\Gamma }_{k}'{\ddot{{\mathbf {Y}}}}_k\varvec{\Gamma }_{k}\varvec{\Delta }^{-1}). \end{aligned}$$
An algorithm similar to the one proposed by Celeux and Govaert (1995) can be employed here. In detail, the eigen-decomposition \({\mathbf {Y}}_k={\varvec{L}}_k \varvec{\Omega }_k {\varvec{L}}_k'\) is firstly considered, with the eigenvalues in the diagonal matrix \(\varvec{\Omega }_k\) following descending order and orthogonal matrix \({\varvec{L}}_k\) composed of the corresponding eigenvectors. Then, we obtain the update for \(\varvec{\Gamma }_k\), \(\varvec{\Delta }\) and \(\lambda \) as
$$\begin{aligned} {\ddot{\varvec{\Gamma }}}_k={\ddot{{\varvec{L}}}}_k , \quad {\ddot{\varvec{\Delta }}} = \frac{\sum \limits _{k=1}^K {\ddot{\varvec{\Omega }}}_k}{\left| \sum \limits _{k=1}^K {\ddot{\varvec{\Omega }}}_k\right| ^\frac{1}{P}}\quad \text {and} \quad {\ddot{\lambda }} = \frac{\left| \sum \limits _{k=1}^K {\ddot{\varvec{\Omega }}}_k\right| ^\frac{1}{P}}{RTI}. \end{aligned}$$
-
Model VEV [\(\varvec{\Sigma }_k=\lambda _k\varvec{\Gamma }_{k}\varvec{\Delta }\varvec{\Gamma }_{k}'\)]. In this setting, maximizing Eq. (3) reduces to the maximization of
$$\begin{aligned} -\frac{PR}{2}\sum \limits _{k=1}^{K} \log \lambda _k \sum \limits _{i=1}^{I}\sum \limits _{t=1}^{T}{\ddot{z}}_{itk}-\sum \limits _{k=1}^{K}\frac{1}{2\lambda _k}\,\text{ tr }(\varvec{\Gamma }_k'{\ddot{{\mathbf {Y}}}}_k\varvec{\Gamma }_k\varvec{\Delta }^{-1}). \end{aligned}$$
By using the same algorithm applied for the EEV model, the updates for \(\varvec{\Gamma }_k\), \(\varvec{\Delta }_{k}\) and \(\lambda _{k}\) are
$$\begin{aligned} {\ddot{\varvec{\Gamma }}}_k= & {} {\ddot{{\varvec{L}}}}_k , \quad {\ddot{\varvec{\Delta }}} = \frac{\sum \limits _{k=1}^K \lambda _k^{-1} {\ddot{\varvec{\Omega }}}_k}{\left| \sum \limits _{k=1}^K \lambda _k^{-1} {\ddot{\varvec{\Omega }}}_k\right| ^\frac{1}{P}}\quad \text {and} \\ {\ddot{\lambda }}_k= & {} \frac{\,\text{ tr }\left\{ {\ddot{\varvec{\Omega }}}_k {\ddot{\varvec{\Delta }}}^{-1} \right\} }{PR\sum _{i=1}^I \sum _{t=1}^T \ddot{z}_{itk}}. \end{aligned}$$
-
Model EVV [\(\varvec{\Sigma }_k=\lambda \varvec{\Gamma }_{k}\varvec{\Delta }_{k}\varvec{\Gamma }_{k}'\)]. For this model, we firstly write \({\mathbf {C}}_k = \varvec{\Gamma }_k\varvec{\Delta }_k\varvec{\Gamma }_k'\). Then, maximizing Eq. (3) reduces to the maximization of
$$\begin{aligned} -\frac{PRTI}{2}\log \lambda -\frac{1}{2\lambda }\sum \limits _{k=1}^{K} \,\text{ tr }({\ddot{{\mathbf {Y}}}}_k{\mathbf {C}}_k^{-1}). \end{aligned}$$
The updates of this model can be obtained in a similar fashion of the EVI model, except for the fact that \({\mathbf {C}}_k\) is not diagonal. Thus, by employing Theorem A.1 of Celeux and Govaert (1995) we can update \({\mathbf {C}}_k\) and \(\lambda \) as
$$\begin{aligned} {\ddot{{\mathbf {C}}}}_k = \frac{{\ddot{{\mathbf {Y}}}}_k}{\left| {\ddot{{\mathbf {Y}}}}_k\right| ^{\frac{1}{P}}}\quad \text {and} \quad {\ddot{\lambda }} = \frac{\sum \limits _{k=1}^K \left| {\ddot{{\mathbf {Y}}}}_k\right| ^{\frac{1}{P}}}{RTI}. \end{aligned}$$
-
Model VVV [\(\varvec{\Sigma }_k=\lambda _{k}\varvec{\Gamma }_{k}\varvec{\Delta }_{k}\varvec{\Gamma }_{k}'\)]. In the case well-known case, maximizing Eq. (3) reduces to the maximization of
$$\begin{aligned} -\frac{R}{2}\sum \limits _{k=1}^K\log |\varvec{\Sigma }_k|\sum \limits _{i=1}^{I}\sum \limits _{t=1}^{T}{\ddot{z}}_{itk}-\frac{1}{2}\sum \limits _{k=1}^K\,\text{ tr }\left( \varvec{\Sigma }_k^{-1}{\ddot{{\mathbf {Y}}}}_k\right) . \end{aligned}$$
Applying Theorem A.2 of Celeux and Govaert (1995), we update \(\varvec{\Sigma }_k\) as
$$\begin{aligned} {\ddot{\varvec{\Sigma }}}_k = \frac{{\ddot{{\mathbf {Y}}}}_k}{R\sum _{i=1}^I \sum _{t=1}^T \ddot{z}_{itk}}. \end{aligned}$$
CM-Step 2 At the second CM-step, we maximize the expectation of the complete-data log-likelihood with respect to \(\varvec{\Theta }_{2}\), keeping \(\varvec{\Theta }_{1}\) fixed at \({\ddot{\varvec{\Theta }}}_{1}\). The update for \(\varvec{\Psi }_k\) depends on which of the 7 parsimonious structures is considered. For notational simplicity, let \({\ddot{{\mathbf {W}}}}=\sum _{k=1}^K {\ddot{{\mathbf {W}}}}_k\) be the update of the within state column scatter matrix, where \({\ddot{{\mathbf {W}}}}_k = \sum _{i=1}^I \sum _{t=1}^T \ddot{z}_{itk}\left( {\mathbf {X}}_{it}-{\ddot{{\mathbf {M}}}}_k\right) '{\ddot{\varvec{\Sigma }}}_k^{-1}\left( {\mathbf {X}}_{it}-{\ddot{{\mathbf {M}}}}_k\right) \) is the update of the column scatter matrix related to the kth state. In detail, we have:
-
Model II [\(\varvec{\Psi }_k={\varvec{I}}\)]. This is the simplest model and no parameters need to be estimated.
-
Model EI [\(\varvec{\Psi }_k=\varvec{\Delta }\)]. In this setting, maximizing Eq. (3) reduces to the maximization of
$$\begin{aligned} -\frac{1}{2}\,\text{ tr }\left( {\ddot{{\mathbf {W}}}}\varvec{\Delta }^{-1}\right) . \end{aligned}$$
Applying Corollary A.1 of Celeux and Govaert (1995), we can obtain \(\varvec{\Delta }\) as
$$\begin{aligned} {\ddot{\varvec{\Delta }}} = \frac{\text {diag}\left( {\ddot{{\mathbf {W}}}}\right) }{\left| \text {diag}\left( {\ddot{{\mathbf {W}}}}\right) \right| ^\frac{1}{R}}. \end{aligned}$$
-
Model VI [\(\varvec{\Psi }_k=\varvec{\Delta }_k\)]. Here, maximizing Eq. (3) reduces to the maximization of
$$\begin{aligned} -\frac{1}{2}\sum \limits _{k=1}^K\,\text{ tr }\left( {\ddot{{\mathbf {W}}}}_k\varvec{\Delta }_k^{-1}\right) . \end{aligned}$$
Applying Corollary A.1 of Celeux and Govaert (1995), we can update \(\varvec{\Delta }_k\) as
$$\begin{aligned} {\ddot{\varvec{\Delta }}}_k = \frac{\text {diag}\left( {\ddot{{\mathbf {W}}}}_k\right) }{\left| \text {diag}\left( {\ddot{{\mathbf {W}}}}_k\right) \right| ^\frac{1}{R}}. \end{aligned}$$
-
Model EE [\(\varvec{\Psi }_k=\varvec{\Gamma }\varvec{\Delta }\varvec{\Gamma }'\)]. In this case, given that \(\varvec{\Psi }_1=\cdots =\varvec{\Psi }_K\equiv \varvec{\Psi }\), maximizing Eq. (3) reduces to the maximization of
$$\begin{aligned} -\frac{1}{2}\,\text{ tr }\left( {\ddot{{\mathbf {W}}}}\varvec{\Psi }^{-1}\right) . \end{aligned}$$
Applying Theorem A.2 of Celeux and Govaert (1995), we can update \(\varvec{\Psi }\) as
$$\begin{aligned} {\ddot{\varvec{\Psi }}} = \frac{{\ddot{{\mathbf {W}}}}}{\left| {\ddot{{\mathbf {W}}}} \right| ^\frac{1}{R}}. \end{aligned}$$
-
Model VE [\(\varvec{\Psi }_k=\varvec{\Gamma }\varvec{\Delta }_k\varvec{\Gamma }'\)]. In this setting, maximizing Eq. (3) reduces to the maximization of
$$\begin{aligned} -\frac{1}{2}\sum \limits _{k=1}^K\,\text{ tr }\left( \varvec{\Gamma }'{\ddot{{\mathbf {W}}}}_k\varvec{\Gamma }\varvec{\Delta }_k^{-1}\right) . \end{aligned}$$
Similarly to the EVE and VVE models in the CM-Step 1, there is no analytical solution for \(\varvec{\Gamma }\), while keeping fixed the other parameters. Therefore, the MM algorithm is implemented by following the same procedure explained for the EVE model and by replacing \({\ddot{{\mathbf {Y}}}}\) with \({\ddot{{\mathbf {W}}}}\). Then, the update of \(\varvec{\Delta }_k\) is
$$\begin{aligned} {\ddot{\varvec{\Delta }}}_k= \frac{\text {diag}\left( {\ddot{\varvec{\Gamma }}}' {\ddot{{\mathbf {W}}}}_k {\ddot{\varvec{\Gamma }}}\right) }{\left| \text {diag}\left( {\ddot{\varvec{\Gamma }}}' {\ddot{{\mathbf {W}}}}_k {\ddot{\varvec{\Gamma }}}\right) \right| ^\frac{1}{R}}. \end{aligned}$$
-
Model EV [\(\varvec{\Psi }_k=\varvec{\Gamma }_k\varvec{\Delta }\varvec{\Gamma }_k'\)]. Here, maximizing Eq. (3) reduces to the maximization of
$$\begin{aligned} -\frac{1}{2}\sum \limits _{k=1}^K\,\text{ tr }\left( \varvec{\Gamma }_k'{\ddot{{\mathbf {W}}}}_k\varvec{\Gamma }_k\varvec{\Delta }^{-1}\right) . \end{aligned}$$
By using the same approach of the EEV and VEV models, and by changing \({\ddot{{\mathbf {Y}}}}\) with \({\ddot{{\mathbf {W}}}}\), we obtain the updates of \(\varvec{\Gamma }_k\) and \(\varvec{\Delta }\) as
$$\begin{aligned} {\ddot{\varvec{\Gamma }}}_k={\ddot{{\varvec{L}}}}_k \quad \text {and} \quad {\ddot{\varvec{\Delta }}} = \frac{\sum \limits _{k=1}^K {\ddot{\varvec{\Omega }}}_k}{\left| \sum \limits _{k=1}^K {\ddot{\varvec{\Omega }}}_k\right| ^\frac{1}{R}}. \end{aligned}$$
-
Model VV [\(\varvec{\Psi }_k=\varvec{\Gamma }_k\varvec{\Delta }_k\varvec{\Gamma }_k'\)]. In the full unconstrained case, maximizing Eq. (3) reduces to the maximization of
$$\begin{aligned} -\frac{1}{2}\sum \limits _{k=1}^K\left( {\ddot{{\mathbf {W}}}}_k\varvec{\Psi }_k^{-1}\right) . \end{aligned}$$
Applying Theorem A.2 of Celeux and Govaert (1995), we update \(\varvec{\Psi }_k\) as
$$\begin{aligned} {\ddot{\varvec{\Psi }}}_k = \frac{{\ddot{{\mathbf {W}}}}_k}{\left| {\ddot{{\mathbf {W}}}}_k\right| ^\frac{1}{R}}. \end{aligned}$$
Table 2 Average MSEs of the parameter estimates for the EII-II MV-HMM. The average is computed among the MSEs of the elements of each estimated parameter, over the K states and 50 data sets in each scenario A note on the initialization strategy
To start our ECM algorithm, we followed the approach of Tomarchio et al. (2020), where a generalization of the short-EM initialization strategy proposed by Biernacki et al. (2003) has been implemented. It consists in H short runs of the algorithm from several random positions. The term “short” means that the algorithm is run for a few iterations s, without waiting for convergence. In this manuscript, we set \(H=100\) and \(s=1\). Then, the parameter set producing the largest log-likelihood is used to initialize the ECM algorithm. In both simulated and real data analyses this procedure has shown stable results after multiple runs. Operationally, this initialization strategy is implemented in the HMM.init() function of the FourWayHMM package.