Parsimonious hidden Markov models for matrix-variate longitudinal data

Hidden Markov models (HMMs) have been extensively used in the univariate and multivariate literature. However, there has been an increased interest in the analysis of matrix-variate data over the recent years. In this manuscript we introduce HMMs for matrix-variate balanced longitudinal data, by assuming a matrix normal distribution in each hidden state. Such data are arranged in a four-way array. To address for possible overparameterization issues, we consider the eigen decomposition of the covariance matrices, leading to a total of 98 HMMs. An expectation-conditional maximization algorithm is discussed for parameter estimation. The proposed models are firstly investigated on simulated data, in terms of parameter recovery, computational times and model selection. Then, they are fitted to a four-way real data set concerning the unemployment rates of the Italian provinces, evaluated by gender and age classes, over the last 16 years.


Introduction
Multivariate longitudinal data have been widely analyzed in the literature (Verbeke et al. 2014 andVerdam andOort 2019). By focusing on balanced data, i.e. those where each unit is observed in all times, they are usually presented in the standard three-way format, where units, times and variables are arranged in software-ready manners. Because of their three-way structure, multivariate balanced longitudinal data have been recently arranged in a matrix-variate fashion (Huang et al. 2019 andViroli 2011b): for each unit i = 1, . . . , I , we observe a P × T matrix, where P and T denote the number of variables and times, respectively. Then, such data have been used for model-based clustering via matrix-variate mixture models (see e.g. Melnykov and Zhu 2019;Tomarchio et al. 2022and Zhu and Melnykov 2021. This allows for both clustering units in homogeneous B Salvatore D. Tomarchio daniele.tomarchio@unict.it 1 groups, defined according to similarities between matrixvariate data, and separately modeling the association between variables and times. Unfortunately, this procedure has two side effects: (a) using the time on either the rows or the columns of the matrices reduces the types of longitudinal data structures that can be arranged in a matrix-variate framework. For instance, spatio-temporal data are used either to analyze P variables observed at T times for R different locations (Viroli 2011b) or to evaluate one measurement on R locations at T times on a set of I units (Viroli 2011a). However, it is not possible to jointly consider P variables at R locations for T times on I units. A possible solution could be to combine locations-times in a single RT -dimension, as done by Viroli (2011a), but this implies a loss in terms of interpretability as well as an increase in the number of parameters of the estimated models, given the higher dimensionality of the matrices. Another example consists of two-factor data, which have been commonly considered in longitudinal settings (see e.g. Brunner and Puri 2001;Fitzmaurice and Ravichandran 2008;Noguchi et al. 2012). Such data have been recently used in matrix-variate mixture models by Sarkar et al. (2020) in a not-longitudinal way, given that the factors fill the two dimensions of the P × R matrices for the I units, and an additional dimension for the time is required.
To summarize, it would be necessary to move from threeway to four-way arrays in order to properly consider and model all the discussed data features.
(b) The matrix-variate clustering approaches mentioned in (a) assume time-constant clustering, i.e. it is not possible for the sample units to move across clusters over time and the evolution over time of the clustering structure is completely overlooked. Time-varying heterogeneity is a specific important feature of longitudinal data analysis and, as such, appropriate modeling strategies should be considered. Hidden Markov models (HMMs) have been extensively used to address this longitudinal data peculiarity (Altman 2007;Maruotti 2011;Bartolucci et al. 2012;Zucchini et al. 2017). Being (dependent) mixtures, HMMs simultaneously allow for clustering units and for modeling the evolution of the clustering over time.
To jointly consider the aspects in (a) and (b), in this manuscript we introduce and discuss HMMs for matrixvariate balanced longitudinal data (MV-HMMs), with a specific application on the two-factor longitudinal case. Such kind of data can be arranged in a four-way array of dimension P × R × I × T . A side effect of working with four-way data is the potentially large number of parameters involved. This often occurs because of the (row-and column-specific) covariance matrices, since P(P + 1)/2 and R(R + 1)/2 unique parameters must be estimated. One of the most classical ways of addressing this overparameterization issue involves the eigen decomposition of the covariance matrices introduced by Celeux and Govaert (1995). This decomposition offers remarkable flexibility and a geometric interpretation in terms of volume, shape and orientation of the hidden states (for other approaches available in the HMMs literature, see Maruotti et al. 2017 and. By using the eigen decomposition of the covariance matrices, we obtain a family of 98 parsimonious MV-HMMs that will be described in Sect. 2.2, after the presentation of the general model (Sect. 2.1). In this framework, model parameters can be estimated by a full maximum likelihood method based on the Expectation Conditional Maximization (ECM) algorithm (Meng and Rubin 1993), and recursions widely used in the HMM literature (Baum et al. 1970). An iterative Minorization-Maximization (MM) algorithm (Browne and McNicholas 2014) is also adopted to update some of the parameters related to a subset of the parsimonious MV-HMMs, during the ECM algorithm.
In Sect. 3, we illustrate the proposal by a large-scale simulation study in order to investigate the empirical behavior of the proposed approach with respect to several aspects, such as the number of observed times, the number of hidden states, the data dimensionality and the association structure between factor-levels. We focus on goodness of clustering and parameters recovery, with a focus on computational times and model selection procedures. Furthermore, in Sect. 4 we test the proposal by analyzing a sample taken from the Italian National Institute of Statistics on the unemployment rate in 98 Italian provinces recorded for 16 years, also covering the 2008 crisis. We examine the unemployment rate arranged as a two-factor design, i.e. taking into account gender and age classes, by allowing some dynamics in the evolution of unemployment. We obtain a flexible model by including different associations across levels, changing according to the inferred dynamics, and by accounting for unobserved characteristics influencing changes in the province's unemployment patterns. For comparison purposes, we added two competing approaches that could be used if our models were not available, thus coercing the data in a three-way structure: (i) mixtures of parsimonious matrix-variate normal distributions and (ii) parsimonious multivariate normal HMMs. Finally, Sect. 5 summarizes the key aspects of our proposal along with future possible extensions.

The model
Let {X it ; i = 1, . . . , I , t = 1, . . . , T } be a sequence of matrix-variate balanced longitudinal observations recorded on I units over T times, with X it ∈ R P×R , and let {S it ; i = 1, . . . , I , t = 1, . . . , T } be a first-order Markov chain defined on the state space {1, . . . , k, . . . , K }. As mentioned in Sect. 1, a HMM is a particular type of dependent mixture model consisting of two parts: the underlying unobserved process {S it } that satisfies the Markov property, i.e.
and the state-dependent observation process {X it } for which the conditional independence property holds, i.e.
where f (·) is a generic probability density function (pdf). Therefore, the unknown parameters in an HMM involve both the parameters of the Markov chain and those of the statedependent pdfs. In detail, the parameters of the Markov chain are the initial probabilities π ik = Pr (S i1 = k), k = 1, . . . , K , being K the number of states, and the transition probabilities where k refers to the current state and j refers to the one previously visited. To simplify the discussion, we will consider homogeneous HMMs, that is π ik| j = π k| j and π ik = π k , i = 1, . . . , I . We collect the initial probabilities in the K -dimensional vector π , whereas the time-homogenous transition probabilities are inserted in the K × K transition matrix .
Regarding the conditional density for the observed process, it will be given by a matrix-normal distribution, i.e.
where M k is the P × R matrix of means, k is the P × P covariance matrix containing the covariances between the P rows, k is the R × R covariance matrix containing the covariances of the R columns and θ k = {M k , k , k }. For an exhaustive description of the matrix-normal distribution and its properties see Gupta and Nagar (2018).

Parsimonious models
As discussed in Sect. 1, a way to reduce the number of parameters of the model is to introduce parsimony in the covariance matrices via the well-known eigen decomposition introduced by Celeux and Govaert (1995). Specifically, a Q × Q covariance matrix can be decomposed as where λ k = | k | 1/Q , k is a Q × Q orthogonal matrix of the eigenvectors of k and k is the Q × Q diagonal matrix with the scaled eigenvalues of k (such that | k | = 1) located on the main diagonal. The decomposition in (2) has some useful practical interpretations. From a geometric point of view, λ k determines the volume, k governs the orientation, and k denotes the shape of the kth state. From a statistical point of view, as well-documented in Greselin and Punzo (2013), Bagnato and Punzo (2021) and Punzo and Bagnato (2021), the columns of k govern the orientation of the principal components (PCs) of the kth state, the diagonal elements in k are the normalized variances of these PCs, and λ k can be meant as the overall volume of the scatter in the space spanned by the PCs of the kth state. By imposing constraints on the three components of (2), the fourteen parsimonious models of Table 1 are obtained.
Considering that we have two covariance matrices in (1), this would yield to 14 × 14 = 196 parsimonious MV-HMMs. However, there is a non-identifiability issue since ⊗ = * ⊗ * if * = a and * = a −1 . As a result, and are identifiable up to a multiplicative constant a (Sarkar et al. 2020). To avoid such issue, the column covariance matrix is restricted to have | | = 1, implying that in (2) the parameter λ k is unnecessary. This reduces the number of models related to from 14 to 7, i.e., I, , k , , k , k k , k k k . Therefore, we obtain 14 × 7 = 98 parsimonious MV-HMMs.

Maximum likelihood estimation
To fit our MV-HMMs, we use the expectation-conditional maximization (ECM) algorithm (Meng and Rubin 1993). The ECM algorithm is a variant of the classical expectationmaximization (EM) algorithm (Dempster et al. 1977), from which it differs since the M-step is replaced by a sequence of simpler and computationally convenient CM-steps.
Let S = {X it ; i = 1, . . . , I , t = 1, . . . , T } be a sample of matrix-variate balanced longitudinal observations. Then, the incomplete-data likelihood function is on the main diagonal, 1 K is a vector K ones and contains all the model parameters. In this setting, S is viewed as incomplete because, for each observation, we do not know its state membership and its evolution over time. For this reason, let us define the unobserved state membership z it = (z it1 , . . . , z itk , . . . , z it K ) and the unobserved states transition . . . , I , t = 1, . . . , T and the corresponding complete-data log-likelihood is with θ = {θ k ; k = 1, . . . , K } and In the following, by adopting the notation used in Tomarchio et al. (2021a), the parameters marked with one dot will represent the updates at the previous iteration and those marked with two dots are the updates at the current iteration. Furthermore, we implemented the ECM algorithm used for fitting all the 98 parsimonious MV-HMMs in the HMM.fit() function of the FourWayHMM package (Tomarchio et al. 2021b) for the R statistical software (R Core Team 2019).

E-
Step The E-step requires the calculation of the conditional expectation of (3), given S c and the current estimates of˙ . Therefore, we need to replace z itk and z it jk with their conditional expectations, namely,z itk andzz it jk . This can be efficiently done by exploiting a forward recursion approach (Baum et al. 1970;Baum 1972;Welch 2003). Let us start by defining the forward probability that is the probability of seeing the partial sequence finishing up in state k at time t, and the corresponding backward probability It is known that the computation of the forward and backward probabilities is susceptible to numerical overflow errors (Farcomeni 2012). To prevent or at least to decrease the risk of such errors, the well known scaling procedure suggested by Durbin et al. (1998) can be implemented (for additional details, see also Zucchini et al. 2017). Then, the updates required in the E-step can be computed as At the first CM-step, we maximize the expectation of (3) with respect to 1 , fixing 2 at˙ 2 . In particular, we obtain The update for k depends on the parsimonious structure considered. For notational simplicity, letŸ = K k=1Ÿ k be the update of the within state row scatter matrix, whereŸ k =  (3) reduces to the maximization of Thus, we can obtain λ as In this case, maximizing Eq. (3) reduces to the maximization of Thus, we can obtain λ k as Applying Corollary A.1 of Celeux and Govaert (1995), we can obtain λ and as In this setting, maximizing Eq.
(3) reduces to the maximization of Applying Corollary A.1 of Celeux and Govaert (1995), we can obtain and λ k as In this case, maximizing Eq. (3) reduces to the maximization of Also in this case, by using Corollary A.1 of Celeux and Govaert (1995), we can obtain k and λ as Here, maximizing Eq. (3) reduces to the maximization of Again, by using Corollary A.1 of Celeux and Govaert (1995), we can obtain k and λ k as In this setting, given that 1 = · · · = K ≡ , maximizing Eq. (3) reduces to the maximization of Applying Theorem A.2 of Celeux and Govaert (1995), we can update as In this case, it is convenient to write k = λ k C, where C = . Thus, maximizing Eq. (3) reduces to the maximization of Applying Theorem A.1 of Celeux and Govaert (1995), we can update C and λ k as Given that there is no analytical solution for , while keeping fixed the other parameters, an iterative Minorization-Maximization (MM) algorithm (Browne and McNicholas 2014) is employed. In detail, a surrogate function can be constructed as k˙ , with e k being the largest eigenvalue ofŸ k . The update of is given by¨ =ĠḢ , whereĠ andḢ are obtained from the singular value decomposition oḟ F. This process is repeated until a specified convergence criterion is met and the update¨ is obtained. Then, we obtain the update for k and λ as In this case, maximizing Eq. (3) reduces to the maximization of Again, there is no analytical solution for , and its update is obtained by employing the MM algorithm as described for the EVE model. Then, the updates for k and λ k arë Here, maximizing Eq. (3) reduces to the maximization of An algorithm similar to the one proposed by Celeux and Govaert (1995) can be employed here. In detail, the eigen-decomposition Y k = L k k L k is firstly considered, with the eigenvalues in the diagonal matrix k following descending order and orthogonal matrix L k composed of the corresponding eigenvectors. Then, we obtain the update for k , and λ as In this setting, maximizing Eq. (3) reduces to the maximization of By using the same algorithm applied for the EEV model, the updates for k , k and λ k arë For this model, we firstly write C k = k k k . Then, maximizing Eq. (3) reduces to the maximization of The updates of this model can be obtained in a similar fashion of the EVI model, except for the fact that C k is not diagonal. Thus, by employing Theorem A.1 of Celeux and Govaert (1995) we can update C k and λ as In the case well-known case, maximizing Eq. (3) reduces to the maximization of Applying Theorem A.2 of Celeux and Govaert (1995), we update k as

CM-
Step 2 At the second CM-step, we maximize the expectation of the complete-data log-likelihood with respect to 2 , keeping 1 fixed at¨ 1 . The update for k depends on which of the 7 parsimonious structures is considered. For notational simplicity, letẄ = K k=1Ẅ k be the update of the within state column scatter matrix, whereẄ k =  Celeux and Govaert (1995), we can update k as ]. In this case, given that 1 = · · · = K ≡ , maximizing Eq. (3) reduces to the maximization of Applying Theorem A.2 of Celeux and Govaert (1995), we can update as In this setting, maximizing Eq. (3) reduces to the maximization of Similarly to the EVE and VVE models in the CM-Step 1, there is no analytical solution for , while keeping fixed the other parameters. Therefore, the MM algorithm is implemented by following the same procedure explained for the EVE model and by replacingŸ withẄ. Then, the update of k is Here, maximizing Eq. (3) reduces to the maximization of By using the same approach of the EEV and VEV models, and by changingŸ withẄ, we obtain the updates of k and as Table 2 Average MSEs of the parameter estimates for the EII-II MV-HMM. The average is computed among the MSEs of the elements of each estimated parameter, over the K states and 50 data sets in each scenario Applying Theorem A.2 of Celeux and Govaert (1995), we update k as

A note on the initialization strategy
To start our ECM algorithm, we followed the approach of Tomarchio et al. (2020), where a generalization of the short-EM initialization strategy proposed by Biernacki et al. (2003) has been implemented. It consists in H short runs of the algorithm from several random positions. The term "short" means that the algorithm is run for a few iterations s, without waiting for convergence. In this manuscript, we set H = 100 and s = 1. Then, the parameter set producing the largest log-likelihood is used to initialize the ECM algorithm. In both simulated and real data analyses this procedure has shown stable results after multiple runs. Operationally, this initialization strategy is implemented in the HMM.init() function of the FourWayHMM package.

Overview
In this section, we examine different aspects of our MV-HMMs through large-scale simulation studies. Given the high number of models introduced, we will only focus on two of them for the sake of simplicity. In detail, we consider the EII-II MV-HMM, which provides an example of model having the same covariance structure for and , and the VVE-EV MV-HMM, which provides an example of model having an opposite covariance structure for and . Furthermore, the EII-II MV-HMM is the also the most parsimonious model, whereas the VVE-EV MV-HMM is one of the models for which the MM algorithm is used. For each model, several experimental conditions are evaluated. Specifically, we set I = 100 and consider two dimensions for the matrices (labeled as D 1 when P = R = 2 and D 2 when P = 4 and R = 8), two times (T 1 = 5 and T 2 = 10), two number of hidden states (K = 2 and K = 4), and two levels of overlap (labeled as O 1 and O 2 ). Therefore, 2 × 2 × 2 × 2 = 16 scenarios are analyzed and, for each of them, 50 data sets are generated by the considered MV-HMM. The parameters used to generate the data are reported in Appendix A.

Discussion
First of all, we evaluate the recovery and the consistency of the estimated parameters by computing the mean square errors (MSEs). Considering the high number of parameters that should be reported, we follow an approach similar to the one used by Farcomeni and Punzo (2020), i.e. we calculate the average among the MSEs of the elements of each estimated parameter over the K states, allowing us to summarize in a single number the MSE of each parameter. Furthermore, before showing the obtained results, it is important to underline the well-known label switching issue, caused by the invariance of the likelihood function under relabeling of the model states (Frühwirth-Schnatter 2006). There are no generally accepted labeling methods, and we simply attribute the labels by looking at the estimated M k . Tables 2 and 3 report the average MSEs, computed after fitting the EII-II and VVE-EV MV-HMMs, with the corresponding K , to the respective data sets. Note that the column Table 3 Average MSEs of the parameter estimates for the VVE-EV MV-HMM. The average is computed among the MSEs of the elements of each estimated parameter, over the K states and 50 data sets in each scenario covariance matrix is not reported in Table 2 since it is not estimated in the EII-II MV-HMM. As we can see, the MSEs can be considered negligible in all the considered scenarios. Regardless of the data dimensionality, it is interesting to note that, for a fixed overlap, their values become better with the increase of T . Note also that, fixed T , their values generally improve as we move from O 1 to O 2 , thus confirming the lower separation among the states. Additionally, when the VVE-EV MV-HMM is considered, it seems that the MM algorithm used for estimating k produces reliable values. Another aspect that is interesting to evaluate is the computational time required for fitting the MV-HMMs. In detail, on each of the above data sets, all the 98 MV-HMMs are now fitted for the corresponding K , and their computational times (in seconds) are illustrated by using the heat maps of Figs. 1 and 2.
Computation is performed on a Windows 10 PC, with AMD Ryzen 7 3700x CPU, 16.0 GB RAM, using the R 64-bit statistical software, and the proc.time() function of the base package is used to measure the elapsed time. As it is reasonable to expect, the computational time grows as T increases on each scenario, and it decreases when we pass from O 1 to O 2 , highlighting the easier estimation in the latter case. Furthermore, the computational time roughly triplicates when we move from fitting MV-HMMs with K = 2 to MV-HMMs with K = 4 hidden states, and approximately quadruplicates when we compare D 1 to D 2 . It is interesting to note that the EVE-VE and VVE-VE MV-HMMs, which are the two models for which we use a MM algorithm for estimating both covariance matrices, are the most time consuming, with a computational burden that seems to double with respect to the other models. This is particularly evident in the O 2 scenarios.
The total computational time can be strongly reduced by exploiting parallel computing. In detail, Table 4 shows the overall time taken by fitting the 98 MV-HMMs sequentially (default in R) and by parallelizing them on 14 cores. As we can see, the computational burden is decreased by about 10 times, and all the models can be fitted in a reasonable fast way (with some exceptions in the O 1 scenarios).
Lastly, the capability of the Bayesian information criterion (BIC; Schwarz et al. 1978) in identifying the true parsimonious structure and the correct number of groups is investigated. This is because, so far, we have fitted models with K equal to the true number of states, and we need to assess if the BIC, which is one of the most famous and used tools in model-based clustering, accurately works. Therefore, on each of the above data sets, the 98 MV-HMMs are fitted for K ∈ {1, . . . , K * + 1}, where K * is the true number of states, and the number of times for which the true parsimonious structure is selected by the BIC are reported in Table 5. First of all, in each scenario, the true K * has been almost always selected by the best fitting model according to the BIC, with only 6 exceptions for the VVE-EV model with dimension D 2 , overlap O 1 , K = 4 states and T 1 times, and 3 exceptions for the VVE-EV with dimension D 2 , overlap O 1 , K = 4 states and T 2 times. Additionally, we notice that in almost all the cases the parsimonious structure of the true data generating model has been identified by the BIC. In those few cases where the BIC selects a wrong model, this is because of an incorrect choice of the parsimonious structure for one of the two covariance matrices or .

Overview
In this section, we analyze data concerning the unemployment rate in the Italian provinces (NUTS3, according to the European Nomenclature of Territorial Units for Statistics). The data comes from the Italian National Institute of Statistics (ISTAT), a public research organization and the main producer of official statistics in the service of citizens and policy-makers in Italy, and are freely accessible at http:// dati.istat.it/#. In detail, we investigate the I = 98 Italian Fig. 1 Heat maps of the average computational time for fitting the 98 MV-HMMs, computed over 50 data sets generated by a EII-II MV-HMM with D 1 and K = 2 (a), D 1 and K = 4 (b), D 2 and K = 2 (c), D 2 and K = 4 (d) Fig. 2 Heat maps of the average computational time for fitting the 98 MV-HMMs, computed over 50 data sets generated by a VVE-EV MV-HMM with D 1 and K = 2 (a), D 1 and K = 4 (b), D 2 and K = 2 (c), D 2 and K = 4 (d) provinces for which the unemployment rate is available from the beginning of the data collection at the provincial level (2004) to 2019. This implies that we are considering T = 16 years of data. Note that, to obtain a balanced dataset, some provinces are not included in the analysis since they were available for only few years. For each province, the unemployment rate is recorded in a two-factor format. The first factor, gender, has two levels (i.e. P = 2): males and females. The second factor, age, has three levels (i.e. R = 3) driven by the age class: 15-24, 25-34 and 35-older. Therefore, the whole data set is presented in a four-way array having dimension 2 × 3 × 98 × 16.
In analyzing this data set, several aspects are worth to be investigated. The first concerns the existence of areas with similar unemployment levels among the Italian provinces.
According to the existing literature on this topic (see, e.g., Cracolici et al. 2007Cracolici et al. , 2009, unemployment rates appear to vary widely across the country, but when analyzed at provincial level tend to be spatially clustered; in other terms, provinces show a certain amount of spatial autocorrelation. To include such information in the analysis, we implemented to matrix-variate longitudinal data an approach similar to that introduced by Scrucca (2005). Specifically, Scrucca (2005) proposed a clustering procedure based on the standardized Getis and Ord measure of local spatial autocorrelation (Getis and Ord 1992), herein labeled as G. He applied such approach for the analysis of the unemployment rates of the municipalities in the Umbria region (NUTS2), but similar implementations have been also done in other applicative fields (see, e.g. Holden andEvans 2010 andAppice et al. 2013). In our case, to implement this approach we -computed a I × I symmetric spatial weight matrix which takes values equal to 1 for neighbouring provinces and 0 otherwise. We define neighbours via the symmetric relative graph criterion (Toussaint 1980 andJaromczyk andToussaint 1992). -computed, for a fixed t, x it = vec(X it ), where vec(·) is the vectorization operator, thus transforming the P × R matrices of each province into P R−dimensional vectors. Then, we calculated G j (x it ) for the j-th variable ( j = 1, . . . , P R) on the i-th unit (i = 1, . . . , I ) as in Scrucca (2005). Such a procedure is repeated for each t, with t = 1, . . . , T . From an interpretative point of view, high (low) positive values of G j (x it ) indicate the possibility of a local cluster of high (low) unemployment rates concerning the i-th province and its neighborhood.
The obtained values, which contain both spatial and unemployment information, are lastly re-arranged in the original (for each province) P × R matrix-variate structure and used in the subsequent analyses.
Another aspect of interest is the strength of time dependence as measured by the transition probability matrix, as well as how the provinces move between the hidden states. This latter aspect can be particularly of interest in light of the great recession globally occurred in 2007-2009, and which has led Italy to be one of the most affected countries.
As mentioned in Sect. 1, we compare the performance of our models with those of two alternative approaches that could be used if our models were not available: 1. mixtures of parsimonious matrix-variate normal distributions (MVN-Ms). To use such model, we collapsed the I and T dimensions into an unique I T dimension, obtaining a P × R × I T array. In doing this, we are removing the modelization of the temporal structure of the data as well as losing interpretability because of the coercion of the data in a three-way array, leading to the issues discussed at the points (a) and (b) of Sect. 1. A total of 98 parsimonious models is still obtained; 2. parsimonious multivariate normal HMMs (M-HMMs).
To use these models, we vectorize the P × R matrices of each province into P R−dimensional vectors, thus obtaining a P R × I × T array. Thus, while in this way we are still modeling the temporal structure of the data, the estimated model has the disadvantages mentioned at the point (a) of Sect. 1. Notice that, in this case we have a total of 14 parsimonious models.

Discussion
All the competing models are fitted to the data for K ∈ {1, . . . , 9} and the corresponding results are reported in Table 6. Firstly, we notice that the overall best model according to the BIC is the VEE-EE MV-HMM with K = 8 hidden states. A similar number of states is also detected by the best M-HMMs, having a VEE parsimonious structure but a worse BIC than our best model. Conversely, K = 6 hidden states are chosen for the best among the MVN-Ms which, despite the similar parsimonious structure to our best model, has by far the worst BIC value. Thus, the obtained results seem to suggest that (i) the modelization of the temporal structure is relevant for our data and (ii) the data coercion leads to worst fitting performance.
By focusing on the VEE-EE MV-HMM, and before graphically showing how the detected states cluster the Italian provinces, useful insights can be gained by looking at its estimated parameters. Specifically, the estimated mean matrices for the hidden states are  As we can note, it is possible to sort the states according to growing unemployment levels, both in the gender and ages factors. More in detail, as we move from the first to the eighth state the unemployment levels rise, and each state becomes worse than the previous ones under any point of view. We can also observe that in the first four states the unemployment levels for males are lower or very similar than those of females, whereas in the last four states an opposite behavior seems to occur. It might be also interesting to report that in the first and the fifth states the differences between the two genders decrease as the age classes increase, whereas in the seventh and eighth states (i.e. the worst states) such differences become larger for growing age classes. As for the gender-related covariance structure, we have different volumes ( λ 1 = 0.11, λ 2 = 0.13, λ 3 = 0.19, λ 4 = 0.32, λ 5 = 0.37, λ 6 = 0.30, λ 7 = 0.33 and λ 8 = 0.53) but the following common orientation and shape matrices We can note how the size of the state-scatter, as measured by the volumes, roughly increases as we move from the best to the worst states in terms of unemployment. Instead, there is no need to make the model over-parametrized in terms of shape and orientation because the states share the same we can note that the variance is higher for the 15-24 age class, and it is relatively similar between the other two age classes. Lastly, it is worth analyzing the estimated transition probability matrix As we can note by the estimated transition probability matrix, transitions between states mostly occur between adjacent states, whereas they are null among distant states. Furthermore, it seems that the persistence of staying in a state roughly decreases as we move from the first to the seventh. However, it increases for the last state, i.e. it appears more difficult for the provinces clustered in the most troubled state to improve their position.
We have also tested the null hypothesis that the lengths of the segments within each state are geometrically distributed, as assumed by HMMs. To this aim, we defined a simple union-intersection multiple testing procedure based on the intersection of K χ 2 goodness of fit tests, which compare the observed lengths of the segments with the theoretical ones (for other similar tests see, e.g., Maruotti et al. 2021). The obtained (unadjusted) p-values are then used to compute the adjusted p-values -which are directly comparable with the significance level α -according to the step-down procedure by Holm (1979). We notice that the minimum among the K adjusted p-values is 0.44; thus, we cannot reject the null hypothesis for any reasonable value of α.
We now report information on how the provinces have changed their state over the years and where the detected states are geographically located. This can be better understood by looking at the Italian provinces maps of Fig. 3, that are colored according to state memberships. Note that the provinces not included in the analysis are colored in gray. For simplicity, we avoid to plot a map for each of the 16-years of data, and we limit to report equidistant years covering the entire time period considered.
Starting from the first year of analysis, i.e. 2004, in Figure 3a we can recognize several clusters of provinces that, as we move towards the south, belong to states with higher unemployment rates. After some years characterized by relatively few changes among the states, the economic recession produced its effects in the years 2008-2009, where a lot of provinces, mainly located in the northern central part of the country, started to perform badly (see Figure 3b). In the subsequent years, there has been a certain amount of switches among adjacent states, bringing some provinces to better states and others to worse states (see Figure 3c), despite the overall situation is still relatively distant from that in 2004 for the majority of cases. However, when the last year of analysis is considered in Figure 3d, it is possible to perceive a slight trend change, with signs of recovery especially for the provinces located in the northern central part of the country. In any case, these positive indications are going to be dramatically arrested by the COVID-19 pandemic, and its effects will have serious repercussions in the next years.

Conclusions
In this manuscript we introduced parsimonious hidden Markov models for matrix-variate balanced longitudinal data. Being (dependent) mixture models, they allow the recovery of homogenous latent subgroups and, simultaneously, provide meaningful interpretation on how the sample units move between the hidden states over time. The parsimony has been introduced via the eigen decomposition of the state covariance matrices, producing a family of 98 MV-HMMs. An ECM algorithm has been illustrated for parameter estimation. At first, the parameter recovery of our algorithm has been evaluated under different scenarios, providing good results. This can be particularly interesting for those MV-HMMs that use a MM algorithm at each step of the ECM algorithm. Relatedly, we have analyzed the computational times for fitting all the 98 MV-HMMs. The computational burden of the MV-HMMs using MM algorithm is definitely higher, even if we are able to fit all the MV-HMMs in a reasonably fast way when parallel computing is considered. The BIC has proven to be reliable in detecting the true number of states in the data as well as the parsimonious structure. The real data example has shown the usefulness of our MV-HMMs. Firstly, when compared with the two alternative approaches and, secondly, in the interpretation of the detected different states at province level.
There are different possibilities for further work, some of which are worth mentioning. First of all, we can extend our MV-HMMs by using skewed or heavy tailed state dependent probability density functions McNicholas 2017, 2019;Tomarchio et al. 2020Tomarchio et al. , 2022, in order to model possible features commonly present in the data. A further avenue would deal with the regression setting (Viroli 2012), where covariates shared by all units in the same hidden state are used. This can be done both in a fixed and in random covariates framework (Tomarchio et al. 2021a). Finally, another possibility would be extending our models in order to deal with unbalanced or missing data.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecomm ons.org/licenses/by/4.0/. The mean matrix of the second state (M 2 ) is obtained by adding a constant c to each element of M 1 , which depends on the level of overlap. Specifically, we set c = 2 when O 1 is considered, whereas c = 5 when O 2 is examined. When K = 4, the first two hidden states have the same { k , k , M k ; k = 1, 2} as before. Clearly, the covariance matrices of the third and fourth hidden states for the EII-II MV-HMM are still equal to those of the first two states. On the contrary, for the VVE-EV MV-HMM we have To obtain M 3 and M 4 we add c = 4 and c = −2 to each element of M 1 , respectively, when O 1 is considered. Other-wise, when O 2 is considered, we add c = 10 and c = −5 to each element of M 1 , respectively.

A.2 Scenarios related to D 2
As concerns the parameters used to generate the data when K = 2, we set Similarly to Sect. A.1, we add constants c to each element of M 1 to obtain M 2 . In detail, for both models we set c = 0.5 when O 1 is considered, whereas c = 5 when O 2 is examined. When K = 4, the first two hidden states have the same { k , k ; k = 1, 2} and M 1 as before. Clearly, the covariance matrices of the third and fourth hidden states for the EII-II MV-HMM are still equal to those of the first two states. On the contrary, for the VVE-EV MV-HMM we have Then, for both MV-HMMs we set π , as described in Sect. A.1. Also in this case, we add constants c to each element of M 1 to obtain the other three mean matrices. Specifically, when O 1 is considered, we set c = 1, c = −1.5 and c = 2 for the EII-II MV-HMM and c = 0.5, c = −0.5 and c = 1 for the VVE-EV MV-HMM. Conversely, when O 2 is considered, we fix c = 5, c = 10 and c = −5 for both MV-HMMs.