1 Introduction

Nowadays, real data is taking on more and more complex structures, with the consequence that traditional multivariate (2-way) techniques often turn out to be inadequate for conducting statistical analyses. A growing number of contributions in the model-based clustering literature have recently focused on the modelization of matrix-variate (3-way) data, where we observe a \(d \times r\) matrix for each statistical unit, being d the considered variables and r the locations, occasions or time points (see, e.g., Viroli 2011a, b; Gallaugher and McNicholas 2018, 2020; Melnykov and Zhu 2018, 2019; Sarkar et al. 2020; Tomarchio et al. 2020, 2021). Although this constitutes an attempt to model more complex data, they remain constrained to the 3-way structure. Thus, more flexible tools are needed for modeling data having more general structures, i.e., having an M-way or tensor-variate form factor.

With a specific focus on the finite mixture modeling literature, to the best of our knowledge, few contributions based on tensor-variate data have been introduced (see, Tait and McNicholas 2019; Sarkar et al. 2021 and Mai et al. 2022). Thus, our paper aims to expand this branch of the statistical literature. To do this, first of all, we introduce two tensor-variate distributions, i.e., the tensor-variate shifted exponential normal (TVSEN) and tensor-variate tail inflated normal (TVTIN) distributions. As the tensor-variate normal (TVN) distribution generalizes to the Mth-order the multivariate normal (MN) and matrix-variate normal (MVN) distributions (Manceur and Dutilleul 2013), the TVSEN and TVTIN distributions generalize to Mth-order tensors their multivariate and matrix-variate counterparts (Punzo and Bagnato 2020, 2021; Tomarchio et al. 2020, 2022a). Then, we use the TVSEN and TVTIN distributions for model-based clustering by introducing the corresponding finite mixture models, labeled TVSEN-M and TVTIN-M.

There are several advantages to using the TVSEN-M and TVTIN-M models over (a) the TVN mixture model (TVN-M) and (b) their multivariate and matrix-variate versions fitted on the rearranged data. In detail:

  1. (a)

    for a correct modelization of many real data, the tails of the TVN components are often too light. This is due to the presence of atypical points that make the traditional normality assumption inadequate. Therefore, it is necessary to use distributions having a more flexible tail behavior, such as the TVSEN and TVTIN. Indeed, our distributions have an additional parameter \(\theta \) that controls the heaviness of the tails, allowing for better data fitting in each cluster.

  2. (b)

    Differently from multivariate or matrix-variate mixture models, tensor-variate mixtures are more general and not forced to work with data having a specific structure. Hypothetically, an M-way dataset can be rearranged into a 3-way form factor via the so-called matricization or unfolding process, which consists in reordering the elements of an M-way structure into a matrix (for each observation) (Kolda and Bader 2009). Then, the existing 3-way models could be used for modeling the reshaped dataset. However, this process presents the following issues that remark how this procedure must be avoided:

    1. (1)

      there are multiple ways to rearrange the same M-way dataset via the matricization operator (Manceur and Dutilleul 2013). This would lead to different datasets and distinct interpretations of the results.

    2. (2)

      Once a dimension is folded into another, there is a loss of information since it no longer becomes analyzable through its specific parameters.

    3. (3)

      The number of parameters related to the covariance/scale matrices of each mixture component substantially increases even with tensors having low dimensions in each mode. This can also cause issues with model selection.

    All these problems are dramatically inflated if the data are further rearranged into a 2-way form factor via the vectorization process. Conversely, our models allow the overcoming of all these issues.

Despite their greater parsimony compared to 3-way or 2-way approaches, tensor-variate mixtures can suffer from overparameterization issues. To alleviate this problem, we adopt the eigen-decomposition of the component scale matrices (Celeux and Govaert 1995), obtaining three families of parsimonious tensor-variate mixtures: two related to our TVSEN and TVTIN distributions and one, obtained as a by-product, for the existing TVN distribution. Notice that, differently from the existing families of (eigen) parsimonious mixture models for 3-way or 2-way data, where the number of parsimonious models is fixed to 98 (Sarkar et al. 2020) or 14 (Celeux and Govaert 1995), respectively, for tensor-variate mixtures there is not a specific number of models, since it varies along with the order M of the tensor. Thus, as M grows, the number of models to be initialized and fitted substantially increases. To shorten this process, we implement and discuss a convenient initialization and fitting strategy that allows for reducing the total number of parsimonious models for a given dataset.

The rest of the paper is organized as follows. Section 2 outlines some preliminary and notational concepts. In Sect. 3, the TVSEN and TVTIN distributions are introduced, along with the corresponding families of parsimonious mixture models. Parsimonious mixtures of TVN distributions are also herein introduced as a by-product. In Sect. 4, details on parameter estimation, based on variants of the well-known expectation-maximization (EM) algorithm (Dempster et al. 1977), are illustrated. Section 5 is devoted to computational aspects, such as the initialization and fitting strategy implemented. In Sect. 6, we conduct different analyses on simulated data showing the adequacy of the procedures discussed in Sect. 5, the usefulness of considering heavy-tailed models in the presence of atypical observations, and the advantages of tensor-variate mixtures compared to their 3-way and 2-way versions. Two analyses on real data, having a 4-way and 5-way structure, respectively, are then discussed in Sect. 7. Finally, Sect. 8 concludes the paper with a summary.

2 Notation and preliminaries

In Sect. 2.1, we provide some key notational concepts that are extensively used throughout the manuscript. Notice that, to avoid abuse of notation, we do not make distinction between a random tensor (or matrix) and its observed realization. Further details about the concepts and the notation herein presented can be found in Kolda and Bader (2009) and Min et al. (2022).

In Sects. 2.2 and 2.3 the TVN distribution and the tensor-variate normal scale mixture (TVNSM) model are recalled, respectively.

2.1 Notation

A random tensor \(\mathcal {X}\in I\hspace{-1.4mm}R^{p_1 \times \cdots \times p_M}\) is called a tensor of order M. We define \(p = \prod _{m=1}^{M} p_m\) and \(p_{-m}=\prod _{m'=1, m'\ne m}^{M} p_{m'}\). The vectorization of a tensor \(\mathcal {X}\) is denoted as \(\text {vec}\left( \mathcal {X}\right) \in I\hspace{-1.4mm}R^{p \times 1}\), with \(X_{i1,\ldots ,iM}\) being its jth element, where \(j=1+\sum _{m=1}^M(i_m-1)\prod _{m'=1}^{m-1}p_{m'}\). The mode-m matricization of \(\mathcal {X}\) is denoted as \(\textbf{X}^{(m)} \in I\hspace{-1.4mm}R^{p_m \times p_{-m}}\) and, as introduced in Sect. 1, consists in the rearrangement of the elements of a tensor into a matrix. The mode-m product of a tensor \(\mathcal {X}\) with a matrix \(\textbf{A}\in I\hspace{-1.4mm}R^{d \times p_m}\) is defined as \(\mathcal {X}\times _{m} \textbf{A}\), and produces a tensor of dimension \(p_1 \times \cdots \times p_{m-1} \times d \times p_{m+1} \times \cdots \times p_{M}\). For a list of matrices \(\left\{ \textbf{A}\right\} =\left\{ \textbf{A}_1,\ldots ,\textbf{A}_M\right\} \), with \(\textbf{A}_m \in I\hspace{-1.4mm}R^{d_m \times p_{m}}\), we define \(\mathcal {X}\times \left\{ \textbf{A}\right\} = \mathcal {X}\times _1 \textbf{A}_1 \cdots \times _M\textbf{A}_M\). The Frobenius norm of a tensor \(\mathcal {X}\) is defined as \(||\mathcal {X}||_{F}= \left( \sum _{i1,\ldots ,iM}X^{2}_{i1,\ldots ,iM}\right) ^{1/2}\).

It should be also noted that when a sample of random tensors \(\mathcal {X}_i, i=1,\ldots , N\) is considered, the order of the resulting dataset is \(M+1\). For example, a 4-way dataset is formed of random tensors \(\mathcal {X}_i \in I\hspace{-1.4mm}R^{p_1 \times p_2 \times p_3}, i=1,\ldots , N\).

2.2 The tensor-variate normal distribution

A random tensor \(\mathcal {X}\in I\hspace{-1.4mm}R^{p_1 \times \cdots \times p_M}\) follows a TVN distribution, denoted by \(\mathcal {T}\mathcal {V}\mathcal {N}_{p_1 \times \cdots \times p_M}\left( \mathcal {M},\left\{ \varvec{\Sigma }\right\} \right) \), if its probability density function (pdf) is

$$\begin{aligned} f\left( \mathcal {X};\mathcal {M},\left\{ {\varvec{\Sigma }}\right\} \right)= & {} \left( 2\pi \right) ^{-\frac{p}{2}}\prod _{m=1}^{M}|{\varvec{\Sigma }}_m|^{-\frac{p_{-m}}{2}} \nonumber \\{} & {} \times \exp \left[ -\frac{{\varvec{\delta }}\left( \mathcal {X};\mathcal {M},\left\{ {\varvec{\Sigma }}\right\} \right) }{2}\right] , \end{aligned}$$
(1)

where \(\mathcal {M}\in I\hspace{-1.4mm}R^{p_1 \times \cdots \times p_M}\) is the mean tensor, \(\varvec{\Sigma }_m \in I\hspace{-1.4mm}R^{p_m \times p_{m}}\) is the covariance matrix of the mth dimension, and \(\varvec{\delta }\left( \mathcal {X};\mathcal {M},\left\{ \varvec{\Sigma }\right\} \right) =||\left( \mathcal {X}-\mathcal {M}\right) \times \left\{ \varvec{\Sigma }^{-\frac{1}{2}}\right\} ||_{F}^2\) represents the squared Mahalanobis distance (Bagherian et al. 2022).

Important properties of the TVN distribution are discussed in Manceur and Dutilleul (2013); Tait and McNicholas (2019); Gallaugher et al. (2021); Sarkar et al. (2021). Here, we limit to recall that the TVN, MVN, and MN distributions are related to each other as follows:

$$\begin{aligned}&\mathcal {X}\sim \mathcal {T}\mathcal {V}\mathcal {N}_{p_1 \times \cdots \times p_M}\left( \mathcal {M},\left\{ {\varvec{\Sigma }}\right\} \right) \equiv \nonumber \\&{\textbf {X}}^{(m)} \sim \mathcal {M}\mathcal {V}\mathcal {N}_{p_m \times p_{-m}}\left( {\textbf {M}}^{(m)},{\varvec{\Sigma }}_m,\bigotimes _{\begin{array}{c} m'=M,\\ m'\ne m \end{array}}^{1}{\varvec{\Sigma }}_{m'}\right) \equiv \nonumber \\&\text{ vec }\left( \mathcal {X}\right) \sim \mathcal {M}\mathcal {N}_{p} \left( \text{ vec }\left( \mathcal {M}\right) , \bigotimes _{m=M}^{1}{\varvec{\Sigma }}_m\right) , \end{aligned}$$
(2)

where \(\mathcal {M}\mathcal {V}\mathcal {N}\) and \(\mathcal {M}\mathcal {N}\) identify the MVN and MN distributions, respectively, \(\bigotimes \) represents the Kronecker product and \(\text {vec}(\cdot )\) is the vectorization operator. Anyway, despite the equivalence among these densities, the use of the TVN distribution for the analysis of tensor-variate data brings the advantages discussed in Sect. 1.

We also recall an identifiability issue related to the properties of the Kronecker product, similar to that existing in the matrix-variate case (Gallaugher et al. 2021; Sarkar et al. 2021). This aspect can be addressed by imposing restrictions on the determinant of \(M-1\) covariance matrices (Sarkar et al. 2021). Specifically, we set \(|\varvec{\Sigma }_m|=1, m= 1,\ldots ,M-1\).

2.3 The tensor-variate normal scale mixture model

A random tensor \(\mathcal {X}\in I\hspace{-1.4mm}R^{p_1 \times \cdots \times p_M}\) follows a TVNSM distribution if its pdf is

$$\begin{aligned} f\left( \mathcal {X};\mathcal {M},\left\{ \varvec{\Sigma }\right\} ^*,\varvec{\theta }\right)= & {} \int _{S_{h}} \mathcal {T}\mathcal {V}\mathcal {N}_{p_1 \times \cdots \times p_M}\left( \mathcal {M},\left\{ \varvec{\Sigma }\right\} ^*\right) \nonumber \\{} & {} \times h\left( w;\varvec{\theta }\right) dw, \end{aligned}$$
(3)

where \(\left\{ \varvec{\Sigma }\right\} ^* = \left\{ \varvec{\Sigma }_1/w,\varvec{\Sigma }_2,\ldots ,\varvec{\Sigma }_M\right\} \), \(h\left( w;\varvec{\theta }\right) \) is the pdf of a mixing random variable W, defined over \(S_{h}\subseteq \left[ 0,\infty \right) \) and depending on parameter(s) \(\varvec{\theta }\). Model (3) generalizes the TVN distribution to more flexible distributions that in tails can accommodate non-normal characteristics often present in real data. Specifically, by varying \(h\left( w;\varvec{\theta }\right) \), different tensor-variate distributions can be obtained. Some examples in this direction can be found in Arashi (2021) and Gallaugher et al. (2021).

3 Methodology

Here, we first introduce the two new tensor-variate distributions (Sect. 3.1). Then, we discuss their greater flexibility in the tail behavior compared to the TVN distribution (Sect. 3.2), and their use within a finite mixture modeling framework (Sect. 3.3).

3.1 Two new tensor-variate distributions

Definition 3.1

A random tensor \(\mathcal {X}\in I\hspace{-1.4mm}R^{p_1 \times \cdots \times p_M}\) follows a tensor-variate TVSEN distribution if its pdf is given by

$$\begin{aligned} f\left( \mathcal {X};\mathcal {M},\left\{ {\varvec{\Sigma }}\right\} ,\theta \right)= & {} \frac{\theta \exp (\theta )}{(2\pi )^{\frac{p}{2}}}\prod _{m=1}^{M}|{\varvec{\Sigma }}_m|^{-\frac{p_{-m}}{2}} \varphi _{\frac{p}{2}}\nonumber \\{} & {} \times \left( \frac{{\varvec{\delta }}\left( \mathcal {X};\mathcal {M},\left\{ {\varvec{\Sigma }}\right\} \right) }{2}+\theta \right) , \end{aligned}$$
(4)

where \(\mathcal {M}\in I\hspace{-1.4mm}R^{p_1 \times \cdots \times p_M}\) is the mean tensor, \(\varvec{\Sigma }_m \in I\hspace{-1.4mm}R^{p_m \times p_{m}}\) is the scale matrix for the mth dimension, \(\theta >0\) is a tailedness parameter, and \(\varphi _{m}(z)\) is the Misra function (Misra 1940). In symbols, \(\mathcal {X}\sim \mathcal {T}\mathcal {V}\mathcal {S}\mathcal {E}\mathcal {N}_{p_1 \times \cdots \times p_M}\left( \mathcal {M},\left\{ \varvec{\Sigma }\right\} ,\theta \right) \).

The pdf in (4) is a special case of the TVNSM model in (3) when a shifted exponential distribution with pdf \(h_{\text {\tiny SE}}(w;\theta )=\theta \exp \left[ -\theta \left( w-1\right) \right] \) on \(S_{h_{\text {\tiny SE}}}=\left( 1,\infty \right) \) is chosen for W. Indeed, at first, we have

$$\begin{aligned}{} & {} \frac{\theta \exp (\theta )}{(2\pi )^{\frac{p}{2}}}\prod _{m=1}^{M}|{\varvec{\Sigma }}_m|^{-\frac{p_{-m}}{2}} \times \int _{1}^{\infty } w^{\frac{p}{2}} \nonumber \\{} & {} \times \exp \left\{ -w\left[ \frac{{\varvec{\delta }}\left( \mathcal {X};\mathcal {M},\left\{ {\varvec{\Sigma }}\right\} \right) }{2}+\theta \right] \right\} dw. \end{aligned}$$
(5)

Considering that the integral in (5) is equal to

$$\begin{aligned}{} & {} \Gamma \left( \frac{p}{2}+1,\frac{{\varvec{\delta }}\left( \mathcal {X};\mathcal {M},\left\{ {\varvec{\Sigma }}\right\} \right) }{2}+\theta \right) \\{} & {} \times \left[ \frac{{\varvec{\delta }}\left( \mathcal {X};\mathcal {M},\left\{ {\varvec{\Sigma }}\right\} \right) }{2}+\theta \right] ^{-\left( \frac{p}{2}+1\right) }, \end{aligned}$$

with \(\Gamma (a,z)\) representing the upper incomplete gamma function (Abramowitz and Stegun 1964), the pdf in (4) is obtained.

Definition 3.2

A random tensor \(\mathcal {X}\in I\hspace{-1.4mm}R^{p_1 \times \cdots \times p_M}\) follows a tensor-variate TVTIN distribution if its pdf is given by

$$\begin{aligned} f\left( \mathcal {X};\mathcal {M},\left\{ {\varvec{\Sigma }}\right\} ,\theta \right)&= \frac{2(\pi )^{-\frac{p}{2}} \prod _{m=1}^{M}|{\varvec{\Sigma }}_m|^{-\frac{p_{-m}}{2}}}{\theta {\varvec{\delta }}\left( \mathcal {X};\mathcal {M},\left\{ {\varvec{\Sigma }}\right\} \right) ^{\frac{p}{2}+1}} \nonumber \\ {}&\quad \times \left[ \Gamma \left( \frac{p}{2}+1,\left( 1-\theta \right) \frac{{\varvec{\delta }}\left( \mathcal {X};\mathcal {M},\left\{ {\varvec{\Sigma }}\right\} \right) }{2}\right) \right. \nonumber \\ {}&\quad \left. -\Gamma \left( \frac{p}{2}+1,\frac{{\varvec{\delta }}\left( \mathcal {X};\mathcal {M},\left\{ {\varvec{\Sigma }}\right\} \right) }{2}\right) \right] , \end{aligned}$$
(6)

where \(\mathcal {M}\in I\hspace{-1.4mm}R^{p_1 \times \cdots \times p_M}\) is the mean tensor, \(\varvec{\Sigma }_m \in I\hspace{-1.4mm}R^{p_m \times p_{m}}\) is the scale matrix for the mth dimension, and \(\theta \in (0,1)\) is a tailedness parameter. In symbols, \(\mathcal {X}\sim \mathcal {T}\mathcal {V}\mathcal {T}\mathcal {I}\mathcal {N}_{p_1 \times \cdots \times p_M}\left( \mathcal {M},\left\{ \varvec{\Sigma }\right\} ,\theta \right) \).

The pdf in (6) is obtained via the TVNSM model in (3) when a uniform distribution having pdf \(h_{\text {\tiny U}}(w;\theta )=1/\theta \) on \(S_{h_{\text {\tiny U}}}=\left( 1-\theta ,1\right) \) is used for the mixing random variable W. Indeed, at first, we have

$$\begin{aligned}{} & {} \frac{(2\pi )^{-\frac{p}{2}}\prod _{m=1}^{M}|{\varvec{\Sigma }}_m|^{-\frac{p_{-m}}{2}}}{\theta } \nonumber \\{} & {} \quad \times \int _{1-\theta }^{1} w^{\frac{p}{2}} \exp \left[ -\frac{w}{2} {\varvec{\delta }}\left( \mathcal {X};\mathcal {M},\left\{ {\varvec{\Sigma }}\right\} \right) \right] dw. \end{aligned}$$
(7)

Since the integral in (7) is equal to

$$\begin{aligned}&\left[ \Gamma \left( \frac{p}{2}+1,\left( 1-\theta \right) \frac{{\varvec{\delta }}\left( \mathcal {X};\mathcal {M},\left\{ {\varvec{\Sigma }}\right\} \right) }{2}\right) \right. \\&\quad \left. {-}\Gamma \left( \frac{p}{2}{+}1,\frac{{\varvec{\delta }}\left( \mathcal {X};\mathcal {M},\left\{ {\varvec{\Sigma }}\right\} \right) }{2}\right) \right] \left[ \frac{2}{{\varvec{\delta }}\left( \mathcal {X};\mathcal {M},\left\{ {\varvec{\Sigma }}\right\} \right) }\right] ^{\frac{p}{2}{+}1}, \end{aligned}$$

the pdf in (6) is obtained.

Corollary 3.1

Let \(\mathcal {T}\mathcal {V}\mathcal {D}_{p_1 \times \cdots \times p_M}\left( \mathcal {M},\left\{ \varvec{\Sigma }\right\} ,\theta \right) \) represent either the TVSEN or TVTIN distribution. Let \(\mathcal {M}\mathcal {V}\mathcal {D}_{p_1 \times p_{-m}}\left( \textbf{M}^{(m)},\varvec{\Sigma }_m,\bigotimes _{\begin{array}{c} m'=M,\\ m'\ne m \end{array}}^{1}\varvec{\Sigma }_{m'},\theta \right) \) and \(\mathcal {M}\mathcal {D}_p\left( \text {vec}\left( \mathcal {M}\right) , \bigotimes _{m=M}^{1}\varvec{\Sigma }_m,\theta \right) \) identify the corresponding matrix-variate and multivariate distributions. Then, the following statements are equivalent:

$$\begin{aligned}&\mathcal {X}\sim \mathcal {T}\mathcal {V}\mathcal {D}_{p_1 \times \cdots \times p_M}\left( \mathcal {M},\left\{ \varvec{\Sigma }\right\} ,\theta \right) \equiv \nonumber \\&\textbf{X}^{(m)} \sim \mathcal {M}\mathcal {V}\mathcal {D}_{p_m \times p_{-m}}\left( \textbf{M}^{(m)},\varvec{\Sigma }_m,\bigotimes _{\begin{array}{c} m'=M,\\ m'\ne m \end{array}}^{1}\varvec{\Sigma }_{m'},\theta \right) \equiv \nonumber \\&\text {vec}\left( \mathcal {X}\right) \sim \mathcal {M}\mathcal {D}_{p} \left( \text {vec}\left( \mathcal {M}\right) , \bigotimes _{m=M}^{1}\varvec{\Sigma }_m,\theta \right) , \end{aligned}$$
(8)

The proof is an easy application of the characteristics of the TVN distribution showed in (2) and the functional form of the TVNSM model presented in (3), from which both the TVSEN and TVTIN distributions are derived. A similar discussion can be found in Gallaugher et al. (2021).

Fig. 1
figure 1

Multiplicative factors \(u(\theta )\) and \(v(\theta )\) for the two distributions

3.2 Moments of the two tensor-variate distributions

In this section, we provide some of the moments of our tensor-variate distributions. In addition to the mean, we give details about the covariance and kurtosis, which help assess the influence of atypical points on a distribution (Rachev et al. 2010). Specifically, by using the relationships in (8) and the discussions of Manceur and Dutilleul (2013); Punzo and Bagnato (2020, 2021), we have

$$\begin{aligned} {{\,\mathrm{\mathbb {E}}\,}}(\mathcal {X})&= \mathcal {M}, \nonumber \\ \text {Var}(\mathcal {X})&= u(\theta ) \bigotimes _{m=M}^{1}\varvec{\Sigma }_m, \end{aligned}$$
(9)
$$\begin{aligned} \text {Kurt}(\mathcal {X})&= v(\theta ) p(p+2), \end{aligned}$$
(10)

where \(u(\theta )= {{\,\mathrm{\mathbb {E}}\,}}(1/W) \) and \(v(\theta )={{\,\mathrm{\mathbb {E}}\,}}[(1/W)^2]/[{{\,\mathrm{\mathbb {E}}\,}}(1/W)]^2\) are multiplicative factors governing the deviation from the nested TVN distribution. It can be noted that since \({{\,\mathrm{\mathbb {E}}\,}}[(1/W)^2]\ge [{{\,\mathrm{\mathbb {E}}\,}}(1/W)]^2\), the excess of kurtosis (with respect to the TVN distribution) is non-negative, with the equality holding only in the degenerate case \(W \equiv 1\). Thus, except for this limit case, the resulting distribution is leptokurtic, and \(\theta \) can be intended as a tailedness parameter.

As concerns the form of \(u(\theta )\) and \(v(\theta )\), they are equal to

$$\begin{aligned}&u(\theta ){=}\nonumber \\&{\left\{ \begin{array}{ll} \theta \exp (\theta ) \varphi _{-1}(\theta ) &{} \quad \text {if}\hspace{0.5em} \mathcal {X}\sim \mathcal {T}\mathcal {V}\mathcal {S}\mathcal {E}\mathcal {N}_{p_1 \times \cdots \times p_M}\left( \mathcal {M},\left\{ \varvec{\Sigma }\right\} ,\theta \right) \\ -\frac{\ln (1-\theta )}{\theta } &{} \quad \text {if}\hspace{0.5em} \mathcal {X}\sim \mathcal {T}\mathcal {V}\mathcal {T}\mathcal {I}\mathcal {N}_{p_1 \times \cdots \times p_M}\left( \mathcal {M},\left\{ \varvec{\Sigma }\right\} ,\theta \right) \\ \end{array}\right. }, \end{aligned}$$
(11)

and

$$\begin{aligned}&v(\theta ){=}\nonumber \\&{\left\{ \begin{array}{ll} \frac{\theta [1-u(\theta )]}{u(\theta )^2} &{} \quad \text {if}\hspace{0.5em} \mathcal {X}\sim \mathcal {T}\mathcal {V}\mathcal {S}\mathcal {E}\mathcal {N}_{p_1 \times \cdots \times p_M}\left( \mathcal {M},\left\{ \varvec{\Sigma }\right\} ,\theta \right) \\ \frac{\theta ^2}{(1-\theta )\ln ^2(1-\theta )} &{} \quad \text {if}\hspace{0.5em} \mathcal {X}\sim \mathcal {T}\mathcal {V}\mathcal {T}\mathcal {I}\mathcal {N}_{p_1 \times \cdots \times p_M}\left( \mathcal {M},\left\{ \varvec{\Sigma }\right\} ,\theta \right) \\ \end{array}\right. }. \end{aligned}$$
(12)

Their graphical representation is illustrated in Fig. 1a and 1b for the TVSEN and TVTIN distributions, respectively.

By starting with the TVSEN distribution, we see that \(u(\theta )\) is an increasing function of \(\theta \) and, because of its values, \(\text {Var}(\mathcal {X})\) is a deflated version of \(\bigotimes _{m=M}^{1}\varvec{\Sigma }_m\) (the variance of a TVN distribution). However, when \(\theta \rightarrow \infty \), then \(u(\theta )\rightarrow 1\) and \(\text {Var}(\mathcal {X})\rightarrow \bigotimes _{m=M}^{1}\varvec{\Sigma }_m\). As concerns \(v(\theta )\), it is a decreasing function of \(\theta \) and, because of its values, \(\text {Kurt}(\mathcal {X})\) is greater than \(p(p+2)\) (the kurtosis of a TVN distribution), implying a leptokurtic behavior. Nevertheless, when \(\theta \rightarrow \infty \), then \(v(\theta )\rightarrow 1\) and \(\text {Kurt}(\mathcal {X})\rightarrow p(p+2)\).

Table 1 Nomenclature and number of free parameters in \(\varvec{\Sigma }_{1},\ldots ,\varvec{\Sigma }_{K}\) for the parsimonious models obtained via the eigen-decomposition of the component scale matrices

When the TVTIN distribution is considered, we observe that \(u(\theta )\) is an increasing function of \(\theta \) and, because of its values, Var(X) is an inflated version of \(\bigotimes _{m=M}^{1}\varvec{\Sigma }_m\). However, when \(\theta \rightarrow 0\), then \(u(\theta )\rightarrow 1\) and \(\text {Var}(\mathcal {X})\rightarrow \bigotimes _{m=M}^{1}\varvec{\Sigma }_m\). Here, \(v(\theta )\) is an increasing function of \(\theta \) and, because of its values, \(\text {Kurt}(\mathcal {X})\) is greater than \(p(p+2)\), implying leptokurtosis. Nevertheless, when \(\theta \rightarrow 0\), then \(v(\theta )\rightarrow 1\) and \(\text {Kurt}(\mathcal {X})\rightarrow p(p+2)\).

3.3 Parsimonious tensor-variate mixture models

Finite mixture models provide a mathematical-based approach to the statistical modeling of many real phenomena. In a tensor-variate framework, a random tensor \(\mathcal {X}\in I\hspace{-1.4mm}R^{p_1 \times \cdots \times p_M}\) arises from a finite mixture model if its pdf can be written as

$$\begin{aligned} g(\mathcal {X};\varvec{\Omega })=\sum _{k=1}^K\pi _k f(\mathcal {X};\varvec{\Theta }_k), \end{aligned}$$
(13)

where \(\pi _k\) is the mixture weight of the kth component, with \(\pi _k>0\) and \(\sum _{k=1}^{K} \pi _k =1\), \(f(\mathcal {X};\varvec{\Theta }_k)\) is the pdf of the kth tensor-variate component having parameters \(\varvec{\Theta }_k\), and \(\varvec{\Omega }\) contains all parameters of the mixture. When the TVSEN and TVTIN distributions are used for \(f(\mathcal {X};\varvec{\Theta }_k)\), we obtain the corresponding TVSEN-M and TVTIN-M models, respectively.

As mentioned in Sect. 1, to introduce parsimony in our mixture models, we consider the eigen-decomposition of the component scale matrices. We recall that, a scale matrix \(\varvec{\Sigma }_k \in I\hspace{-1.4mm}R^{d \times d}\) can be decomposed as

$$\begin{aligned} \varvec{\Sigma }_{k} = \lambda _{k} \varvec{\Gamma }_{k} \varvec{\Delta }_{k} \varvec{\Gamma }_{k}', \end{aligned}$$
(14)

where \(\lambda _{k}=|\varvec{\Sigma }_{k}|^{1/d}\), \(\varvec{\Gamma }_{k}\) is a \(d \times d\) orthogonal matrix whose columns are the normalized eigenvectors of \(\varvec{\Sigma }_{k}\), and \(\varvec{\Delta }_{k}\) is the scaled (\(|\varvec{\Delta }_{k}|=1\)) diagonal matrix of the eigenvalues of \(\varvec{\Sigma }_{k}\). By imposing constraints on the three components of (14), the family of 14 parsimonious structures reported in Table 1 is obtained.

Considering that for a tensor of order M there are M scale matrices, the total number of parsimonious structures for each mixture model should be \(14^M\). However, by incorporating the constraint on the determinant of \(M-1\) scale matrices discussed in Sect. 2.2, the number of parsimonious structures for these scale matrices reduces to the 7 reported in Table 2. Thus, the final number of parsimonious structures for each mixture model is \(7^{M-1}\times 14\).

Table 2 Nomenclature and number of free parameters in \(\varvec{\Sigma }_{1},\ldots ,\varvec{\Sigma }_{K}\), with \(|\varvec{\Sigma }_{k}|=1\), for the parsimonious models obtained via the eigen-decomposition of the component scale matrices

Notice that an additional source of parsimony could have been added by forcing \(\theta _k\) to be tied across mixture components. However, this doubles the parsimonious models to \(7^{M-1}\times 14 \times 2\). Considering the small advantages in terms of parsimony that such a choice would produce and, conversely, the increased computational effort required, we decided to not consider such a constraint.

4 Parameter estimation

A standard way for estimating the parameters of model (13) is the maximum likelihood approach, which is generally implemented via the expectation-maximization (EM) algorithm. Unfortunately, we cannot directly use the EM algorithm for parameter estimation, since at the M-step:

  1. 1.

    the updates for \(M-1\) scale matrices depend on the values of the scale matrices estimated at the previous iteration;

  2. 2.

    the update for \(\theta \) tends to 0 as the number of iterations of the algorithm grows, leading the algorithm to fail convergence (TVTIN-Ms only; see Appendix A for further details).

Thus, we implement two variants of the EM algorithm: an expectation-conditional maximization (ECM) algorithm (Meng and Rubin 1993) for TVN-Ms and TVSEN-Ms and an alternating expectation-conditional maximization (AECM) algorithm (Meng and van Dyk 1997) for TVTIN-Ms. As we will show in Sect. 4.1, both algorithms have the same structure with minor differences that are readily highlighted. Thus, to avoid redundancies we present them at the same time. Additionally, we focus our discussion on TVSEN-Ms and TVTIN-Ms, which are the core of our paper, thus commenting on the differences concerning the algorithm used for TVN-Ms in Sect. 4.2.

4.1 EM-based algorithms for TVSEN-Ms and TVTIN-Ms

In EM-based algorithms the observed data \(\mathcal {S}=\left\{ \mathcal {X}_i; i=1,\ldots ,N\right\} \), being N the sample size, are viewed as incomplete. Conversely, the complete data are \(\mathcal {S}_c=\left\{ \left( \mathcal {X}_i,\textbf{z}_i,w_i\right) ; i=1,\ldots ,N \right\} \), where \(\textbf{z}_i=(z_{i1},\dots ,z_{iK})'\), so that \(z_{ik}=1\) if observation i belongs to group k and \(z_{ik}=0\) otherwise, and \(w_i\) is the realization of the distribution-specific latent variable W discussed in Sect. 3.1.

For the application of the algorithms, it is convenient to use the tensor-variate normal scale mixture model representation, since it allows the factorization of the complete-data log-likelihood function as

$$\begin{aligned} \ell _{c}\left( \varvec{\Omega };\mathcal {S}_{c}\right)= & {} \ell _{1c}\left( \pi _k;\mathcal {S}_{c}\right) +\ell _{2c}\left( \mathcal {M}_k,\left\{ \varvec{\Sigma }_k\right\} ^*; \mathcal {S}_{c}\right) \nonumber \\{} & {} +\ell _{3c}\left( \theta _k;\mathcal {S}_{c}\right) , \end{aligned}$$
(15)

where

$$\begin{aligned}{} & {} \ell _{1c}\left( \pi ;\mathcal {S}_{c}\right) =\sum _{i=1}^{N}\sum _{k=1}^{K} z_{ik} \ln \left( \pi _k\right) , \nonumber \\{} & {} \ell _{2c}\left( \mathcal {M}_k,\left\{ {\varvec{\Sigma }}_k\right\} ^*;\mathcal {S}_{c}\right) =\sum _{i=1}^{N}\sum _{k=1}^{K} z_{ik}\Bigg [-\frac{p}{2}\ln \left( 2\pi \right) \nonumber \\{} & {} \quad +\frac{p}{2}\ln \left( w_{ik}\right) -\frac{1}{2}\sum _{m=1}^{M}p_{-m}\ln |{\varvec{\Sigma }}_{m_k}| \nonumber \\{} & {} \quad -\frac{w_{ik}}{2}{\varvec{\delta }}\left( \mathcal {X}_i;\mathcal {M}_k,\left\{ {\varvec{\Sigma }}_k\right\} \right) \Bigg ], \end{aligned}$$
(16)

and \(\ell _{3c}\left( \theta _k;\mathcal {S}_{c}\right) \) differs according to the considered mixture model. Specifically, for TVSEN-Ms it is equal to

$$\begin{aligned} \ell _{3c}\left( \theta _k;\mathcal {S}_{c}\right) = \sum _{i=1}^{N}\sum _{k=1}^{K} z_{ik}\left[ \ln \left( \theta _k\right) -\theta _k\left( w_{ik}-1\right) \right] , \end{aligned}$$

whereas for TVTIN-Ms, it is equal to

$$\begin{aligned} \ell _{3c}\left( \theta _k;\mathcal {S}_{c}\right){} & {} = \sum _{i=1}^{N}\sum _{k=1}^{K} z_{ik}\nonumber \\{} & {} \left\{ -\ln \left( \theta _k\right) +\ln \left[ \mathbb {1}_{\left( 1-\theta _k,1\right) }\left( w_{ik}\right) \right] \right\} , \end{aligned}$$
(17)

where \(\mathbb {1}_{A}\left( \cdot \right) \) is the indicator function on the set A. In the following, the quantities marked with one dot correspond to the updates at the previous iteration and those marked with two dots represent the updates at the current iteration. The algorithms proceed as follows.

E-step

At the E-step, we have to compute the conditional expectation of (15). Thus, we first calculate the posterior probability that \(\mathcal {X}_i\) belongs to the kth mixture component

$$\begin{aligned} \ddot{z}_{ik} = \frac{{\dot{\pi }}_k f\left( \mathcal {X}_i;{{\dot{\mathcal {M}}}}_k,\left\{ {{\dot{\varvec{\Sigma }}}}_k\right\} ,{{\dot{\theta }}}_k\right) }{\sum \nolimits _{h=1}^K{\dot{\pi }}_h f\left( \mathcal {X}_i;{{\dot{\mathcal {M}}}}_h,\left\{ {{\dot{\varvec{\Sigma }}}}_h\right\} ,{{\dot{\theta }}}_h\right) }, \end{aligned}$$

where, depending on the considered mixture model, \(f(\cdot )\) represents either the TVSEN or the TVTIN distribution. Then, for TVSEN-Ms, we have to compute

$$\begin{aligned} \ddot{w}_{ik} = \frac{\varphi _{\frac{p}{2}+1}\left( \frac{\varvec{\delta }\left( \mathcal {X}_i;{{\dot{\mathcal {M}}}},\left\{ {{\dot{\varvec{\Sigma }}}}_k\right\} \right) }{2}+{{\dot{\theta }}}_k\right) }{\varphi _{\frac{p}{2}}\left( \frac{\varvec{\delta }\left( \mathcal {X}_i;{{\dot{\mathcal {M}}}},\left\{ {{\dot{\varvec{\Sigma }}}}_k\right\} \right) }{2}+{{\dot{\theta }}}_k\right) }, \end{aligned}$$

which corresponds to the expected value of a left-truncated gamma distribution on the interval \(\left( 1,\infty \right) \), with parameters \(\left( p/2\right) +1\) and \(\varvec{\delta }\left( \mathcal {X}_i;{{\dot{\mathcal {M}}}}_k,\left\{ {{\dot{\varvec{\Sigma }}}}_k\right\} \right) /2+{{\dot{\theta }}}_k\), whereas for TVTIN-Ms we need to calculate

$$\begin{aligned}&\ddot{w}_{ik} = \frac{2}{\varvec{\delta }\left( \mathcal {X}_i;{{\dot{\mathcal {M}}}}_k,\left\{ {{\dot{\varvec{\Sigma }}}}_k\right\} \right) } \nonumber \\&\times \frac{\left[ \Gamma \left( \frac{p}{2}+2,\left( 1-{{\dot{\theta }}}_{k}\right) \frac{\varvec{\delta }\left( \mathcal {X}_i;{{\dot{\mathcal {M}}}}_k,\left\{ {{\dot{\varvec{\Sigma }}}}_k\right\} \right) }{2}\right) -\Gamma \left( \frac{p}{2}+2,\frac{\varvec{\delta }\left( \mathcal {X}_i;{{\dot{\mathcal {M}}}}_k,\left\{ {{\dot{\varvec{\Sigma }}}}_k\right\} \right) }{2}\right) \right] }{\left[ \Gamma \left( \frac{p}{2}+1,\left( 1-{{\dot{\theta }}}_{k}\right) \frac{\varvec{\delta }\left( \mathcal {X}_i;{{\dot{\mathcal {M}}}}_k,\left\{ {{\dot{\varvec{\Sigma }}}}_k\right\} \right) }{2}\right) -\Gamma \left( \frac{p}{2}+1,\frac{\varvec{\delta }\left( \mathcal {X};{{\dot{\mathcal {M}}}}_k,\left\{ {{\dot{\varvec{\Sigma }}}}_k\right\} \right) }{2}\right) \right] }, \nonumber \\ \end{aligned}$$
(18)

that corresponds to the expected value of a doubly-truncated gamma distribution on the interval \(\left( 1-\theta _k,1\right) \), with parameters \(\left( p/2\right) +1\) and \(\varvec{\delta }\left( \mathcal {X}_i;{{\dot{\mathcal {M}}}}_k,\left\{ {{\dot{\varvec{\Sigma }}}}_k\right\} \right) /2\).

CM-step 1

At the first CM-step, we obtain the following updates for \(\pi _k\) and \(\mathcal {M}_k\)

$$\begin{aligned} \ddot{\pi }_k = \frac{\sum \nolimits _{i=1}^N \ddot{z}_{ik}}{N} \quad \text {and} \quad {{\ddot{\mathcal {M}}}}_k = \frac{\sum \nolimits _{i=1}^N \ddot{z}_{ik}\ddot{w}_{ik}\mathcal {X}_i}{\sum \nolimits _{i=1}^N \ddot{z}_{ik}\ddot{w}_{ik}}. \end{aligned}$$

Additionally, for TVSEN-Ms, we also get the update of the tailedness parameter \(\theta _k\), i.e.

$$\begin{aligned} \ddot{\theta }_k = \frac{\sum \nolimits _{i=1}^N \ddot{z}_{ik}}{\sum \nolimits _{i=1}^N \ddot{z}_{ik}\left( \ddot{w}_{ik}-1\right) }. \end{aligned}$$
Table 3 Parameter updates related to \(\varvec{\Sigma }_{m_k}, m=1,\ldots , M-1\)

CM-step 2

At this step, we update the scale matrices \(\left\{ \varvec{\Sigma }_k\right\} \). The updates differ according to the parsimonious structure considered, but they have all based on the so-called “flip-flop” algorithm discussed in Manceur and Dutilleul (2013) and recently considered in the tensor-variate mixture modeling framework by Sarkar et al. (2021). To this aim, we first have to consider the m-mode matricization of \(\mathcal {X}\) and \({{{\ddot{\mathcal {M}}}}}_k\) for updating the following scatter matrices

(19)

where the symbol \(\overset{\sim }{{{\dot{\varvec{\Sigma }}}}}_{m'_k}\) means that, as we move from the estimation of the first scale matrix \(\varvec{\Sigma }_{1_k}\) to the last \(\varvec{\Sigma }_{M_k}\), in the Kronecker product in (19) we replace the scale matrices estimated in the previous iteration with those estimated at the current iteration; for specific details see Manceur and Dutilleul (2013). Then, for ease of reading, we compactly report in Table 3 the updates concerning the parsimonious structures related to \(\varvec{\Sigma }_{m_k}, m=1,\ldots , M-1\).

In Table 3, \(\ddot{\textbf{S}}_m=\sum _{k=1}^K \ddot{\textbf{S}}_{m_k}, m=1,\ldots ,M-1\). Regarding the VE model, a Minorization-Maximization (MM) algorithm (Browne and McNicholas 2014) is implemented, in the fashion of Sarkar et al. (2020) to which we refer for specific details. Here, we limit to report that \({{\dot{\textbf{R}}}}_m\) and \({{\dot{\textbf{P}}}}_m'\) are obtained from the singular value decomposition of \({{\dot{\textbf{F}}}}_m=\sum _{k=1}^K {{\dot{\varvec{\Delta }}}}_{m_k}^{-1}{{\dot{\varvec{\Gamma }}}}_m'\ddot{\textbf{S}}_{m_k}-\omega _{m_k}{{\dot{\varvec{\Delta }}}}_{m_k}^{-1}{{\dot{\varvec{\Gamma }}}}_m'\), with \(\omega _{m_k}\) largest eigenvalue of \(\ddot{\textbf{S}}_{m_k}\), which is the quantity involved in the minimization-maximization process. As concerns the EV model, \(\textbf{L}_{m_k}\) and \(\textbf{H}_{m_k}\) are obtained from the eigen-decomposition of \(\ddot{\textbf{S}}_{m_k}=\textbf{L}_{m_k}\textbf{H}_{m_k}\textbf{L}_{m_k}'\); see Sarkar et al. (2020) for further details.

Lastly, we update the parsimonious structures related to the last scale matrix \(\varvec{\Sigma }_M\). They are compactly reported in Table 4, where \(\ddot{\textbf{S}}_M=\sum _{k=1}^K \ddot{\textbf{S}}_{M_k}\).

Table 4 Parameter updates related to \(\varvec{\Sigma }_{M_k}\)

For the EVE and VVE models, an MM algorithm similar to that mentioned for the VE model is implemented. For the EEV and VEV models, we use a strategy similar to that used for the EV model.

CM-step 3

This step involves only the estimation of the tailedness parameter \(\theta _k\) for TVTIN-Ms. We first define the “partial” complete-data log-likelihood function

$$\begin{aligned} \ell _{c}\left( \varvec{\Omega };\mathcal {S}_{c}\right)= & {} \ell _{1c}\left( \varvec{\pi };\mathcal {S}_{c}\right) \\{} & {} + \sum _{i=1}^{N}\sum _{k=1}^{K} z_{ik} \ln f_{\text {\tiny TVTIN}}(\textbf{X}_i;\textbf{M}_k,\left\{ \varvec{\Sigma }_k\right\} ,\theta _k). \end{aligned}$$

The term “partial” refers to the definition of the complete data, that are now viewed as \(\mathcal {S}_{c}=\left\{ \left( \textbf{X}_i,\textbf{z}_i\right) ; i=1,\ldots , N \right\} \). Then, \(\ddot{\theta }_k\) is determined by numerically maximizing

$$\begin{aligned} \sum _{i=1}^{N} \ddot{z}_{ik} \ln f_{\text {\tiny TVTIN}}(\textbf{X}_i;\ddot{\textbf{M}}_k,\left\{ \ddot{\varvec{\Sigma }}_k\right\} ,\theta _k) \end{aligned}$$

over \(\theta _k\in \left( 0,1\right) \), \(k=1,\ldots ,K\).

4.2 Some notes on TVN-Ms parameter estimation

When TVN-Ms are considered, in the context of the ECM algorithm, the complete data are \(\mathcal {S}_c=\{\left( \mathcal {X}_i,\textbf{z}_i\right) ; i=1,\ldots , N \}\), and the corresponding complete-data log-likelihood function is

$$\begin{aligned}{} & {} \ell _{c}\left( \varvec{\Omega };\mathcal {S}_{c}\right) = \ell _{1c}\left( \pi _k;\mathcal {S}_{c}\right) +\ell _{2c}\left( \mathcal {M}_k,\left\{ \varvec{\Sigma }_k\right\} ;\mathcal {S}_{c}\right) ,\\{} & {} \quad k=1,\ldots ,K, \end{aligned}$$

where \(\ell _{1c}\) is the same of (16), while

$$\begin{aligned} \ell _{2c}\left( \mathcal {M}_k,\left\{ \varvec{\Sigma }_k\right\} ;\mathcal {S}_{c}\right) =&\sum _{i=1}^{N}\sum _{k=1}^{K} z_{ik} \Bigg [-\frac{p}{2}\ln \left( 2\pi \right) \\&-\frac{1}{2}\sum _{m=1}^{M}p_{-m}\ln |\varvec{\Sigma }_{m_k}|\\&-\frac{\varvec{\delta }\left( \mathcal {X}_i;\mathcal {M}_k,\left\{ \varvec{\Sigma }_k\right\} \right) }{2}\Bigg ]. \end{aligned}$$

At the E-step, we have to update only the posterior probabilities as

$$\begin{aligned} \ddot{z}_{ik} = \frac{{\dot{\pi }}_k f_{\text {\tiny TVN}}\left( \mathcal {X}_i;{{\dot{\mathcal {M}}}}_k,\left\{ {{\dot{\varvec{\Sigma }}}}_k\right\} \right) }{\sum \nolimits _{h=1}^K{\dot{\pi }}_h f_{\text {\tiny TVN}}\left( \mathcal {X}_i;{{\dot{\mathcal {M}}}}_h,\left\{ {{\dot{\varvec{\Sigma }}}}_h\right\} \right) }. \end{aligned}$$

At the CM-steps 1 and 2 we update \(\pi _k\), \(\mathcal {M}_k\) and \(\left\{ \varvec{\Sigma }_k\right\} \). Concerning the procedures and equations discussed above, we have to set \(\ddot{w}_{ik}=1\) when present.

5 Computational aspects

As introduced in Sect. 1, the number of parsimonious models to be initialized and fitted increases as M grows. Thus, in Sect. 5.1 and Sect. 5.2 we discuss the initialization strategy and the fitting design adopted to expedite both processes, respectively.

All the analyses are conducted by using the R statistical software, which is run on a Windows 10 PC, with AMD Ryzen 7 3700x CPU, and 16.0 GB RAM. Parallel computing over 14 cores is exploited when multiple models have to be fitted to a given dataset.

5.1 Initialization strategy

The choice of the starting values is an important aspect of any EM-based algorithm. Here, we implemented the short-EM initialization strategy introduced by Biernacki et al. (2003). This procedure consists of B runs of the algorithm, from different random positions, for a very small number of iterations \(t_1\). The parameter set producing the largest log-likelihood value is used to initialize the considered algorithm. The advantage of this procedure consists of the variability of the initial values granted by several random starts. However, this procedure may slow down the initialization process. Thus, instead of running a short-EM for each parsimonious model, we initialized our models according to the following scheme:

figure a

i.e., we initialize the simplest model belonging to each family (refer to Tables 1 and 2), and we use these values to start the EM-based algorithms of the connected models. We investigate the goodness of this initialization scheme in the simulated analyses of Sect. 6.1. Notice that, throughout the manuscript, we set \(B = 50\) and \(t_1 = 1\).

Table 5 Average MSEs of the parameter estimates for TVSEN-Ms along with corresponding standard errors (in round brackets)
Table 6 Average MSEs of the parameter estimates for TVTIN-Ms along with corresponding standard errors (in round brackets)

5.2 Fitting strategy

Once the initial values have been obtained, another important aspect concerns the strategy for expediting the running of our models. To this purpose, we adopt a short-EM look-like strategy. Specifically, we run all the models for a short number of iterations \(t_2\). Then, we compute the Bayesian information criterion (BIC; Schwarz 1978) for each model and we sort all of them according to their BIC, thus obtaining a ranking. Subsequently, we fit until convergence only the subset of models ranked in the top \(Y\%\) according to the BIC, where Y is a value to be chosen.

In the implementation of this fitting scheme, several elements must be investigated, which concern:

  1. 1.

    the number of iterations \(t_2\) to consider;

  2. 2.

    the percentage value Y to be selected;

  3. 3.

    the capability of the BIC in detecting the correct model after \(t_2\) iterations and within the percentage value Y.

We investigate these aspects in the simulated analyses of Sect. 6.2.

6 Simulated analyses

In this section, we present three simulation studies. Specifically, in Sect. 6.1, we discuss whether by using the proposed initialization scheme we can obtain appropriate parameter estimates and data classification. Since the likelihood function of finite mixture models is generally multimodal, a bad initialization can easily lead to local maximizers and produce inaccurate parameter estimates (Melnykov and Melnykov 2012; Najar et al. 2017). For illustrative purposes, we first generate data from two models having a parsimonious structure which, for each dimension, is the most different from that used for its initialization. Specifically, both for TVSEN-Ms and TVTIN-Ms, we generate data from the VI-VI-VVI and VV-VV-VVV parsimonious models that, according to the scheme illustrated in Sect. 5.1, are initialized using the EI-EI-EEI and EE-EE-EEE structures, respectively. Furthermore, we also generate data from the EI-EV-EVV parsimonious model to provide an example with different structures among all the dimensions.

In Sect. 6.2, we investigate whether the proposed fitting strategy produces reliable results. To do this, both for TVSEN-Ms and TVTIN-Ms, we simulate from the three data generating models (DGMs) presented in Sect. 6.1. For each dataset, we fit all the models for \(K \in \left\{ 1,2,3\right\} \). As experimental factors, we consider the number of iterations \(t_2\) (5 or 10) and the percentage of best models Y to be picked by the BIC (1% or 5%). Therefore, by combining these experimental factors we have the following four scenarios:

  1. 1.

    \(t_2=5\) and \(Y=1\%\);

  2. 2.

    \(t_2=5\) and \(Y=5\%\);

  3. 3.

    \(t_2=10\) and \(Y=1\%\);

  4. 4.

    \(t_2=10\) and \(Y=5\%\).

We try to understand which of these configurations produces the best results in terms of efficiency and identification of the correct DGM. Once the best scenario has been identified, we run the subset of models until convergence, and we analyze whether the BIC correctly selects the true but unknown DGM as the overall best-fitting model.

In Sect. 6.3, we investigate the argument discussed in items (a) and (b) of Sect. 1. We first evaluate the item (a) concerning the effect of outlying observations on classification and model selection when TVN-Ms, TVSEN-Ms, and TVTIN-Ms are compared. Then, we unfold the 4-way datasets into the possible 3-way and 2-way configurations, and the matrix-variate and multivariate versions of the above models are fitted to the rearranged data accordingly. The obtained results are compared to those of their tensor-variate counterparts, and the drawbacks mentioned in item (b) are disclosed.

6.1 Simulation study 1

Throughout this study, we consider data having \(p_1 = 2\), \(p_2=3\), \(p_3=4\) (i.e. 4-way arrays), and \(K=2\). The parameters used for the DGMs are reported in Appendix B. We also set two values for the sample size, i.e. \(N \in \{200, 400\}\). For each value of N, we generate 50 datasets per DGM. Each dataset is then fitted by the corresponding parsimonious model for \(K=2\). We compute the mean squared error (MSE) for evaluating the accuracy of the parameter estimates when the proposed initialization scheme is adopted. Considering the high number of parameters that should be reported, we follow an approach similar to the one used by Farcomeni and Punzo (2020) and Tomarchio (2022), i.e., we calculate the average among the MSEs of the elements of each estimated parameter, thus summarizing in a single number the MSEs. The results are contained in Table 5 for TVSEN-Ms and Table 6 for TVTIN-Ms. Standard errors for the average MSEs are reported in round brackets.

Fig. 2
figure 2

Parallel coordinate plots for TVSEN-Ms random datasets having VI-VI-VVI (a), VV-VV-VVV (b) and EI-EV-EVV (c) parsimonious structures. Each subplot refers to the values of the first variable (in the first dimension), over the \(3\times 4\) occasions for the 200 observations

Fig. 3
figure 3

Parallel coordinate plots for TVTIN-Ms random datasets having VI-VI-VVI (a), VV-VV-VVV (b) and EI-EV-EVV (c) parsimonious structures. Each subplot refers to the values of the first variable (in the first dimension), over the \(3\times 4\) occasions for the 200 observations

Table 7 Number of times each parsimonious model is present in the ranking produced by the BIC over the 50 datasets generated for each scenario, after the implementation of our initialization strategy

The average MSEs, and the corresponding standard errors, are quite similar and close to 0 for all the structures and models considered, supporting the goodness of the implemented initialization strategy. We also note that the average MSE values systematically improve when N increases, evidencing the consistency of the parameters’ estimator. Regarding TVSEN-Ms, this improvement ranges between 13.34% (for \(\varvec{\Sigma }_{3_2}\) when data are generated from the VV-VV-VVV) and 77.78% (for \(\varvec{\Sigma }_{1_2}\) when data are generated from the VI-VI-VVI), with an average improvement of approximately 45%. Similarly, TVTIN-Ms show an improvement that ranges from 25.93% (for \(\varvec{\Sigma }_{3_1}\) when data are generated from the VI-VI-VVI) to 86.49% (for \(\theta _1\) when data are generated from the VI-VI-VVI), with an average improvement around 50%. A similar discussion can be done also for the standard errors which, by following the pattern discussed above for the average MSEs, regularly decrease when N moves from 200 to 400.

Nevertheless, there are some differences among the average MSEs, and the corresponding standard errors, especially for those concerning \(\varvec{\Sigma }_{3_2}\) for the VI-VI-VVI and VV-VV-VVV of TVSEN-Ms and TVTIN-Ms, and \(\mathcal {M}_2\) for the VI-VI-VVI and VV-VV-VVV of TVTIN-Ms. The reason for such behavior lies in the relative closeness between the two groups that these parsimonious configurations create (and, for \(\varvec{\Sigma }_{3_k}\), also because of the higher variability of the values used; see Appendix B). This is supported by looking at the parallel coordinate plots of random datasets generated by each parsimonious structure for TVSEN-Ms and TVTIN-Ms, illustrated in Fig. 2 and 3, respectively. For ease of visualization, in both figures, we limit to reporting the values of the first variable (in the first dimension), over the \(3\times 4\) occasions for the 200 observations.

From a classification perspective, we compute the adjusted Rand index (ARI; Hubert and Arabie 1985), which measures the agreement between two classifications, with a maximum of 1 indicating identical solutions. We report that for TVSEN-Ms, regardless of the parsimonious structure considered, the average ARI over the 50 datasets per DGM is 1, whereas for TVTIN-Ms it is equal to 0.99 (due to a handful of misclassifications in some datasets). Thus, despite the relative closeness between the two groups appearing in Figs. 2 and 3, the correct data classification is regularly retrieved.

6.2 Simulation study 2

Throughout this study, we use the same parameters and settings presented in Sect. 6.1. The results related to our fitting strategy are presented in Table 7.

As concerns TVSEN-Ms, each model is always classified in the top \(Y\%\) by the BIC in the corresponding datasets, regardless of the number of iterations considered. Conversely, when TVTIN-Ms are analyzed, some issues are encountered in the scenarios having \(t_2=5\). Indeed, when the VI-VI-VVI and VV-VV-VVV DGMs are considered, the true models are not classified in the top \(Y\%\) by the BIC on several occasions. The situation improves when we pass from \(Y=1\%\) to \(Y=5\%\), but some problems persist. On the contrary, when we move to \(t_2=10\) iterations, we obtain the same (good) results obtained for TVSEN-Ms. Thus, it seems that \(t_2=5\) iterations are not enough to correctly include the true DGMs in the subset of models to run until convergence. In light of these results, and to reduce potential issues, we decided to implement our fitting strategy by setting \(t_2=10\), both for TVSEN-Ms and TVTIN-Ms. Considering that, in the scenarios having \(t_2=10\) there are no differences among the \(Y=1\%\) and \(Y=5\%\) cases, we decided to set \(Y=1\%\) hereafter with the aim of making more efficient the fitting procedure.

By focusing on the scenario with \(t_2=10\) and \(Y=1\%\), we now report the results related to the run of the subset of models until convergence in Table 8. As we can see, the BIC almost regularly identifies the DGMs as the best-fitting models in their corresponding datasets, both for TVSEN-Ms and TVTIN-Ms.

Table 8 Number of times the BIC identifies the correct data generating model (over 50 datasets per model), at converge of the algorithm, after the implementation of our fitting strategy with \(t_2=10\) and \(Y=1\%\)

6.3 Simulation study 3

In this study, we generate 100 datasets of size \(N=200\) from TVN-Ms having a VV-VV-VVV parsimonious structure, \(p_1 = 2\), \(p_2=3\), \(p_3=4\), \(K=3\), and parameters reported in Appendix C. In each dataset, 15% of the observations are randomly selected and, for each of these observations, one of its values is randomly chosen and multiplied by a constant \(c=4\). Thus, we are introducing an atypical value along one dimension per selected observation.

Over the generated datasets, we fit TVN-Ms, TVSEN-Ms, and TVTIN-Ms with VV-VV-VVV parsimonious structure and \(K=3\). Additionally, we unfold every 4-way dataset into the following 3-way configurations: \(p_1 \times p_2p_3\), \(p_2 \times p_1p_3\), and \(p_3 \times p_1p_2\). On each unfolded dataset, we fit the (unconstrained) matrix-variate versions of the above models, i.e. matrix-variate normal mixtures (MVN-Ms), matrix-variate shifted exponential normal mixtures (MVSEN-Ms), and matrix-variate tail-inflated normal mixtures (MVTIN-Ms) with \(K=3\). Lastly, every 4-way dataset is vectorized into a \(p_1p_2p_3\)-dimensional matrix, and the (unconstrained) multivariate versions of the considered models, i.e. multivariate normal mixtures (MN-Ms), multivariate shifted exponential normal mixtures (MSEN-Ms), and multivariate tail-inflated normal mixtures (MTIN-Ms) are fitted with \(K=3\).

We start by analyzing the classification performance. In this regard, we illustrate in Fig. 4 the box plots of the ARI values computed for the considered models over the 100 datasets.

Fig. 4
figure 4

Box plots of the ARI values computed for the considered models over the 100 datasets

By starting with the tensor-variate mixtures (Fig. 4, plot on the right), we see that the accuracy of the partitions obtained via TVSEN-Ms and TVTIN-Ms is far better than that of TVN-Ms. This is due to the tails of the mixture components that, for TVN-Ms, are not heavy enough to adequately model data with such atypical observations, destroying the true underlying group structure. Indeed, the median ARI values for TVSEN-Ms and TVTIN-Ms are 0.94 and 0.92, respectively, while that of TVN-Ms is 0.53. Additionally, the variability of the estimated partitions is far higher for TVN-Ms than for TVSEN-Ms and TVTIN-Ms. From this point of view, a similar pattern is somehow replicated when the data are rearranged into the 3-way and 2-way configurations, and the normal-based mixtures are compared to the heavy-tailed approaches.

By moving to the analysis of the matrix-variate mixtures (Fig. 4, plot in the center), we notice that the classification performance seems to differ depending on how the data are rearranged. In particular, from a clustering perspective, it seems that rearranging the data as \(p_3 \times p_1p_2\) matrices allows for obtaining better ARI values. Since the results vary according to the selected 3-way configuration, this poses the question of which data rearrangement to select when working with real tensor-variate data, and the true data labels are not available. On the contrary, by using the natural tensor-variate structure of the data, a tensor-based approach avoids such a problem. Additionally, the variability of the ARI values obtained via MVSEN-Ms and MVTIN-Ms is higher than those of their tensor-variate counterparts.

Finally, by considering the results of the multivariate mixtures (Fig. 4, plot on the left), we note a deep deterioration of the produced classifications. Indeed, the median ARI values of all three multivariate approaches are below 0.50 and they also show similar variabilities.

A second element that is worth to be investigated concerns model selection. In detail, we now fit over the same datasets, the (corresponding) mixture models for \(K\in \left\{ 1,\ldots ,4\right\} \). The number of times each K is chosen by the BIC over the 100 datasets is reported in Table 9.

Table 9 Number of times each K is selected by the BIC over 100 datasets

By starting with the multivariate mixtures, we immediately note that \(K=1\) is selected 100% of times for MSEN-Ms and MTIN-Ms. The reason for this behavior relies on the increased number of parameters that such data rearrangement causes, with direct effects on the penalty term of the BIC. This problem is partially mitigated for MN-Ms because of their well-known tendency of selecting additional mixture components in the presence of atypical observations (see, e.g., Peel and McLachlan 2000).

When the matrix-variate mixtures are considered, we notice that the selection of K is generally dependent on the considered 3-way configuration. This is particularly true for MVN-Ms, which disclose the best behavior when the \(p_2 \times p_1p_3\) arrangement is considered, and for MVTIN-Ms, which provide the best results when the data have the \(p_3 \times p_1p_2\) arrangement. Anyway, MVN-Ms generally tend to select an additional component, mimicking what was discussed for their multivariate counterparts, and as also discussed by Tomarchio et al. (2022b).

As concerns TVSEN-Ms and TVTIN-Ms, we see that they select \(K=3\) the majority of times. Thus, by considering the data in their tensor-variate structure, we avoid potential model selection issues caused by a wrong selection of the 3-way configuration. Similarly to before, TVN-Ms mainly select an additional mixture component because of the inadequacy of the mixture components’ tails.

A third aspect to be reported is the difference among the models in terms of the number of parameters to be estimated. Such information is displayed in Table 10 for the models we considered in our analysis.

Table 10 Number of parameters related the considered models for each K

As observed, the process of vectorizing the data results in a significant increase in the number of parameters for the multivariate mixtures, which in turn contributes to the aforementioned model selection issues. Regarding the matrix-variate data arrangements, we see that depending on the considered 3-way configuration, the number of parameters for the matrix-variate mixtures can substantially vary. Nevertheless, the original tensor-variate structure leads to the lowest number of parameters for the tensor-variate mixtures. Thus, we reach the greatest level of parsimony when the data are analyzed in their natural tensor-variate structure.

The final aspect we assessed is the computational effort. Specifically, we show in Fig. 5 the computational times per iteration (in seconds) for the considered models over the 100 datasets.

Fig. 5
figure 5

Box plots of the computational times per iteration (in sec.) computed for the considered models over the 100 datasets

As it is reasonable to expect, the transition from the multivariate to the matrix-variate mixtures, and in turn to the tensor-variate mixtures, increases the computational cost. This growth is more relevant for normal-based mixtures, since the computational time per iteration totally rises by a factor of 20, whereas is less severe for tail-inflated normal-based mixtures, considering that the computational time per iteration globally increases by a factor of 2. In any case, mixture models based on the tail-inflated normal distribution are the most computationally intensive (due to the absence of a closed-form solution for the \(\theta \) update; refer to Sect. 4), whereas the normal-based mixtures are the most efficient (because of their easier estimation). To summarize, tensor-variate mixtures are not (per se) much more computationally costly than their multivariate and matrix-variate counterparts. The main challenge with their fitting relies on the potentially high number of models to be fitted, as already mentioned in Sect. 1. The heuristic strategy proposed in Sect. 5 tries to alleviate this concern.

7 Real data analyses

In this section, we fit our TVSEN-Ms and TVTIN-Ms, as well as TVN-Ms for comparative purposes, to two real datasets having a 4-way (Sect. 7.1) and 5-way (Sect. 7.2) structure. In both datasets, all the parsimonious models are fitted for \(K\in \left\{ 1,\ldots ,7\right\} \). This means that \(7^2 \times 14 \times 7 = 4802\) and \(7^3 \times 14 \times 7 = 33614\) models should be fitted to the two datasets, respectively. Thus, we decided to fit until convergence all the models for the first dataset, while we implemented our fitting strategy for the second dataset.

7.1 First example

The first analysis concerns the australia.soybean dataset contained in the agridat package (Wright 2022) for the R statistical software. It contains information on yield and other traits of \(N=58\) genotypes of soybeans, grown in four locations across two years in Australia. Therefore, it is a 4-way dataset having the following structure: traits \(\times \) locations \(\times \) years \(\times \) genotypes. The 58 genotypes of soybeans consist of 43 lines (40 local Australian selections from a cross, their two parents, and another which was used as a parent in earlier trials) and 15 other lines, 12 of which were from the USA.

This dataset has been considered by Basford and McLachlan (1985) and Viroli (2011a), after its rearrangement into two-way and three-ways structures, respectively. However, this leads to the important concerns discussed in Sect. 1, such as increasing data dimensionality and losing data interpretability. On the contrary, we use this dataset in its 4-way natural structure. Farmers and researchers in this applied field generally want to investigate the main effects of average production quality among locations and between years. Specifically, common research questions concern whether to change production practices because of observed differences across locations, and how much difference there is across several years to justify future production modifications (Jaynes and Colvin 1997; Porter et al. 1998). These aspects can be better analyzed by using a 4-way structure of the data, where traits, locations, and years are considered separately and not (in some way) folded with each other.

Similarly to Basford and McLachlan (1985) and Viroli (2011a), we limit our attention to the following two traits: seed yield and seed protein percentage. Thus, the structure of the dataset is \(2 \times 4 \times 2 \times 58\). Notice that, to avoid the so-call boundary bias problem (Tomarchio and Punzo 2020), we map the considered variables to \(I\hspace{-1.4mm}R\) via the logarithmic (for seed yield) and logit (for seed protein percentage/100) transformations.

7.1.1 Results

According to the BIC, the best parsimonious models selected for TVN-Ms, TVSEN-Ms, and TVTIN-Ms share the same number of \(K=5\) groups. Additionally, the parsimonious structure of the scale matrices is VI-EE-EEE for TVSEN-Ms and TVTIN-Ms, while it is VE-EE-VEE for TVN-Ms. Therefore, from this point of view, there is a perfect agreement between our models, which instead differ from the TVN-Ms. This is also evidenced by the BIC values of TVSEN-Ms (\(-972.42\)) and TVTIN-Ms (\(-971.08\)) which are better than the BIC of TVN-Ms (\(-953.10\)), thus providing a better fit of the data.

It is interesting to report that if, instead of fitting all the models until convergence, we had used the fitting strategy presented in Sect. 5.2, the same models discussed above would be selected for TVN-Ms, TVSEN-Ms, and TVTIN-Ms, respectively. This result further supports the use of the proposed strategy, since it also reduces the computational times by more than two-thirds.

In terms of genotype classification, our findings indicate that all models classify the genotypes in a similar way. In detail, there are only three differences between the classifications produced by TVSEN-Ms and TVTIN-Ms, whereas these models differ from TVN-Ms for 6 and 7 units, respectively. Additionally, these differences specifically refer to units clustered in only two out of five groups. Because of these results, in the following, we focus on the classification produced by the best-fitting model (TVSEN-Ms).

The classification of the 58 genotypes into the five groups detected by the best TVSEN-Ms is reported in Table 11.

Table 11 Classification of the 58 genotypes into the five groups detected by the best TVSEN-Ms

To interpret the five detected groups, we start by using the information contained in Basford and Tukey (1998), where a detailed discussion about this dataset is disclosed. In particular, it is possible to report that groups 1 and 2 cluster the 15 overseas genotypes. These lines are the earlier maturing genotypes, where maturity is based on average days of flowering. Conversely, genotypes clustered in group 5 have the longest days of flowering, thus leading to very late relative maturity. Groups 3 and 4 aggregate genotypes having from mid to late maturity. Varying levels of maturity among soybean lines can suggest distinct susceptibility to weather conditions, even if they are grown at the same site in the same year, and this will impact their overall performance.

To further investigate the detected clusters, we report the parallel coordinate plots of the estimated mean tensors \(\textbf{M}_k\). Separately, Fig. 6 refers to the seed yield variable while Fig. 7 involves the seed protein. In turn, each figure is split across the four locations, and each location separately considers the two years of analysis.

Fig. 6
figure 6

Parallel coordinate plots for the means estimated by TVSEN-Ms concerning the seed yield variable, over the four locations and two years, colored according to the estimated classification

Fig. 7
figure 7

Parallel coordinate plots for the means estimated by TVSEN-Ms concerning the seed protein variable, over the four locations and two years, colored according to the estimated classification

By starting with Fig. 6, we note that the mean seed yield of all the groups generally increases when passing from 1970 to 1971. The reason for such behavior may lie in the soybean rust epidemic and late infestations of green vegetable bugs which affected, in different ways, the four locations in 1970 (Basford and Tukey 1998). Thus, it is reasonable that in 1971 almost all the groups produced a better yield. At Redland Bay, these problems were particularly intense for later maturing genotypes, since the five groups seem to be sorted by maturity. Nevertheless, it is interesting to note that groups 1 and 2 (the overseas early maturing genotypes), despite their greater yield at this location are the only ones showing a decreasing tendency. Thus, tracking the evolution of yield production over time, across the location, allows for immediate comparison of the groups’ behavior.

In 1970, group 2 shows the highest mean seed yield in Lawes, Brookstead, and Redland Bay, whereas group 1 has the lowest mean seed yield in Lawes, Brookstead, and Nambour. Therefore, although these two groups are composed of soybeans having early maturity, their behavior across the locations is quite different. The soybeans clustered in group 2 seem to better adapt to the climatic and edaphic conditions of these locations, generating a greater yield. Thus, from this point of view, genotypes aggregated in group 2 should be preferred to those of group 1.

It is also interesting to note that, with the exclusion of 1970 at Brookstead, groups 3, 4, and 5 are always ranked in the same (decreasing) order. Thus, for mid-to-late maturity groups, the yielding performance seems to be a decreasing function of the maturity, in an opposite way to what was discussed above for earlier maturity groups. In 1971, group 2 continues in its good yielding performance, while group 5 presents the worst mean seed yield at three locations out of four.

When Fig. 7 is considered, we see that all the groups increase their mean seed protein when moving from 1970 to 1971, with the only exception of group 1 in Lawes. Group 5 has the highest mean seed protein in all the locations over both years, closely followed by group 4. It is interesting to note that, late maturity genotypes, such as those in groups 4 and 5, despite their low seed yield (refer to Fig. 6) are those producing the highest protein levels across the locations. This information can be useful for farmers in deciding whether to invest in groups of soybeans producing a greater yield or more protein-based.

To better illustrate the tail behavior of the detected groups, and to show how they deviate from the normality assumption, we now report the estimated tailedness parameters of TVSEN-Ms: \(\theta _1=1.39\), \(\theta _2=0.59\), \(\theta _3=0.36\), \(\theta _4=0.85\), and \(\theta _5=1.04\). As we note, the first and the last groups have slightly heavier-than-normal tails, whereas the other three groups show thicker tails (refer to Fig. 1). Thus, this supports the necessity of using mixture component distributions that allow for a more flexible tail behavior than that achievable when using the TVN distribution.

7.2 Second example

The second dataset, herein labeled as Italy.work, comes from the Italian National Institute of Statistics (ISTAT), a public research organization and main producer of official statistics in the service of citizens and policy-makers in Italy, and is accessible at http://dati.istat.it/#. Specifically, we analyze data concerning the inactivity and unemployment rates for \(N=106\) Italian provinces (NUTS3, according to the European Nomenclature of Territorial Units for Statistics). Both variables, that are often jointly analyzed (see, e.g. Armstrong 1997; Gregg and Wadsworth 2010), are naturally made available by ISTAT in a four-way structure. Indeed, each variable is presented via tables containing information on three factors: gender, age, and time. The first factor, gender, has two levels: males and females. The second factor, age, has levels driven by the age class. We select the age classes such that they are: (i) available for both variables, (ii) not-overlapping each other, and (iii) covering the full age range. Thus, the following four age classes are selected: \(15-24\), \(25-34\), \(35-49\), and \(50-74\). The third factor, time, allows the selection of the time horizon. Here, we select the four years spanning from 2018 to 2021 that ISTAT makes available at the provincial level, at the time of data access, after the new regulations defined by the Council and European Commission (European Parliament and Council 2019). Thus, the whole data set is presented as a 5-way array having dimensions \( 2 \times 2 \times 4 \times 4 \times 106\). Notice that, the original data contained 107 provinces, but the province of Vibo Valentia is excluded because of some missing values in the unemployment rate in 2020. As before, we map both variables to \(I\hspace{-1.4mm}R\) via the logit transformation.

It must be evidenced that, via the above ISTAT website, it is possible to make a simplified version of the considered dataset by aggregating into single levels the gender and age information. In particular, the two genders can be aggregated into a unique level called "total" that, despite the label, it is not the summation of the values for males and females, but rather a balance between the two. Similarly, the age classes can be grouped into a unique level called "\(15-74\)" that, similarly for gender, balance in some way the values of the different ages. Note that, for both factors, the way in which the values are balanced across the levels is not specified. Thus, this would reduce the dataset to a 3-way array having dimensions \(2 \times 4 \times 106\), and matrix-variate mixture models could be used for the analysis. However, as we show in Sect. 7.2.2, this approach would lead to several drawbacks compared to the analysis conducted on the 5-way dataset of Sect. 7.2.1. In a nutshell, we lose information and interpretability both at a cluster-specific and overall data partition level.

7.2.1 Results

Similarly to the first dataset, we report that the best parsimonious models selected by the BIC for TVN-Ms, TVSEN-Ms, and TVTIN-Ms share the same number of groups, which in this case is equal to \(K=6\). However, the parsimonious structure for the scale matrices is here slightly different among all the selected models. In detail, we have a VE-EV-EE-EVE structure for TVSEN-Ms, a VE-EV-EV-VVE structure for TVTIN-Ms, and a VE-EV-EE-VVE structure for TVN-Ms. Thus, all the models agree in identifying VE and EV relationships among the variables and the gender factor, respectively, but provide divergent conclusions for the other two scale matrices. Anyway, despite these minor discrepancies, TVSEN-Ms and TVTIN-Ms provide a better fit also on this dataset, with BIC values equal to 6778.51 and 6818.63, respectively, compared to the TVN-Ms that have a BIC of 7291.36.

From a classification viewpoint, our findings indicate that TVSEN-Ms and TVTIN-Ms detect a very similar grouping structure since they only differ for the assignment of four provinces. Conversely, their classifications differ from that of TVN-Ms for eight and twelve units, respectively. Anyway, all the models agree in classifying the same units in groups 2, 3, and 4, while there are differences in groups 1, 5, and 6. To avoid an unnecessary length of this section, we focus our discussion on the classification produced by the best-fitting model (TVSEN-Ms), but substantially similar conclusions can be drawn by analyzing that of TVTIN-Ms.

The classification of the 106 provinces into the six groups detected by the best TVSEN-Ms is reported in Table 12.

Table 12 Classification of the 106 provinces into the six groups detected by the best TVSEN-Ms

To interpret the six detected groups, we report in Figs. 8 and 9 the parallel coordinate plots of the estimated mean tensors \(\textbf{M}_k\) (in a separate way for the inactivity rate and unemployment rate variables, respectively) for TVSEN-Ms.

Fig. 8
figure 8

Parallel coordinate plots for the means estimated by the best-fitting TVSEN-Ms concerning the inactivity rate variable, over the four age classes and four years, colored according to the estimated classification and with line type according to the gender classes

Fig. 9
figure 9

Parallel coordinate plots for the means estimated by the best-fitting TVSEN-Ms concerning the unemployment rate variable, over the four age classes and four years, colored according to the estimated classification and with line type according to the gender classes

By analyzing Fig. 8 from a global point of view, we note that the mean inactivity rates: (i) roughly decrease as the age class increases, (ii) are generally higher for females, and (iii) have a quite regular behavior over time. More in detail, by investigating the 15–24 age class, we note that all the groups present an increasing spike in 2020. A reason for such a behavior relies on the drastic socio-economic restrictions adopted by Italy to face Covid-19 in its initial stages. Such a peculiar event had a greater impact on inactivity than unemployment, as discussed by Verick et al. (2022). Interestingly, the inactivity rate patterns for males in group 5 are quite different from those of the other groups. On a related note, the inactivity rates for males in groups 1 to 4 are very close together and slightly better than those of the corresponding females in the same groups.

Fig. 10
figure 10

Italian province’s map colored by the classification produced by the best-fitting TVSEN-M according to the BIC

When moving to the 25–34 age class, we observe comparable temporal evolutions for the mean inactivity rates to those observed for the 15–24 age class. However, the males from the provinces clustered in group 5 exhibit a behavior that is contrary to what was observed in the 15–24 age class. Furthermore, the inactivity rates for males in groups 1 to 4 are still close together but present, by far, the best results, and are more distant than the corresponding females. In turn, these are very similar to the males of group 6.

The age class 35–49 shows temporal configurations that are mostly different from those observed in the previously commented age classes. Indeed, with some notable exceptions, the evolution over time of the mean inactivity rates appears to be more smooth. Groups 1 to 4, which in the 15–24 and 25–34 age classes performed similarly, now present a different pattern among them, both for males and females. In particular, males in groups 3 and 4 reach the lowest mean inactivity rates, whereas those in groups 1 and 2 severely deteriorate their performance. In a like manner, females in groups 2 and 4 have better mean inactivity rates than those of groups 1 and 3.

The last age class, 50–74, shows a temporal pattern of the mean inactivity rates resembling that of the age class 35–49. Here, males in groups 1 and 2 show the lowest mean values, while those in groups 3 and 4 experience a significant decline in their mean values. Hence, we are witnessing a behavior that is different from what was shown for the 35–49 age class. It is also worth mentioning that males in group 5 are gradually improving their performance when passing from the 25–34 to the 50–74 age class, where they also achieve their best mean inactivity rates.

As concerns Fig. 9, we note that the results for the mean unemployment rates (partially) mimic those of the mean inactivity rates, since they decrease as the age classes grow, and females have generally higher rates than males. However, their temporal evolutions are less clear and smooth than those of the mean inactivity rates. By starting with the analysis of the 15–24 age class, we note that the distances among the six groups are quite narrow. Moreover, in each group, the mean unemployment rates between males and females rarely follow the same temporal structure. Anyway, males in group 1 seem to provide the lowest mean unemployment values while females in group 6 show the highest values in three years out of four.

Fig. 11
figure 11

Parallel coordinate plots for the means estimated by the best-fitting MVN-Ms for both variables, over the four years, colored according to the estimated classification

Fig. 12
figure 12

Italian province’s map colored by the classification produced by the best-fitting MVN-M according to the BIC

At the age class 25–34, we see an opposite behavior to that observed for the 15–24 age class. Indeed, the mean unemployment rates among the six groups severely vary. Specifically, groups 5 and 6 are now well separated from groups 1 to 4. In addition, we start to notice a more distinct configuration between males of groups 1 to 4 and their corresponding females. Differently from the 15–24 age class, and with the exclusion of males in group 4 and females in group 6, there seems to be a decreasing trend in the mean unemployment rates over the years.

When moving to the 35–49 age class, we observe a relative convergence of the six groups, compared to the 25–34 age class. The separation between the mean unemployment rates of males of groups 1 to 4 (which are still the best) and the corresponding females is more definite here. We also notice that females in group 6 have better values than males in the same group over all four years, constituting the first and unique case in which such a performance is present.

Finally, for the age class 50–74, we note a deep decrease in mean unemployment rates for males in groups 3 and 4 over the four years, while females in group 6 return to show the worst performances.

To summarize, we see how by simply moving from one age class to another, and by separately examining genders and years, different behaviors within and between the groups can be disclosed. Such detailed group interpretability would be lost in the cases of data unfolding (see Sect. 1) or if they were extracted with a lower level of detail (see Sect. 7.2.2).

To further interpret the six clusters, we illustrate in Fig. 10 how the provinces have been clustered across Italy. Note that the omitted province of Vibo Valentia is colored in gray.

We can notice that groups 5 and 6, those generally presenting the worst behavior for both variables, are located in the southern central part of the country. Conversely, groups 1 and 2, those mainly providing the best results for both variables, cover most of northern Italy. Group 3 aggregates some provinces in the extreme east of northern Italy and between the Piemonte and Lombardia regions (north-west), while group 4 mainly clusters provinces in the east-central part of Italy and those bordering on Switzerland.

Lastly, to better support the necessity of using mixture components with heavier-than-normal tails, we now report the estimated tailedness parameters for TVSEN-Ms: \(\theta _1=0.48\), \(\theta _2=0.34\), \(\theta _3=0.51\), \(\theta _4=0.62\), \(\theta _5=0.93\), and \(\theta _6=0.34\). Hence, with the exclusion of group 5, all the other clusters have quite heavy tails (refer to Fig. 1).

7.2.2 An alternative analysis by using 3-way data

As introduced in Sect. 7.2, data could be extracted by the ISTAT website with a lower level of detail, i.e. without distinction among genders and age classes. The obtained 3-way dataset could then be analyzed via the matrix-variate versions of the tensor-variate mixture models considered up to now. To evaluate the consequences of such a choice, we fit the 98 parsimonious MVN-Ms, MVSEN-Ms, and MVTIN-Ms (whose unconstrained versions have been already considered in Sect. 6.3) to the 3-way dataset for \(K\in \left\{ 1,\ldots ,7\right\} \).

All the models detect the same \(K=3\) groups with “EVI” and “EE” parsimonious structures for the two scale matrices. Additionally, according to the BIC, the best fitting model is that based on MVN-Ms (BIC = \(-1097.51\)), closely followed by MVSEN-Ms and MVTIN-Ms having BIC values equal to \(-1086.22\) and \(-1086.11\), respectively. The estimated mean matrices for the MVN-Ms are reported via the parallel coordinate plots of Fig. 11, while the Italian political map colored by the detected classification is illustrated in Fig. 12.

As we note, there are several differences compared to the results in Sect. 7.2.1, which we briefly discuss below.

  • The best-fitting model is now based on MVN-Ms, despite the same classification is obtained by the best parsimonious configuration of each mixture model.

  • Only three groups have been detected and, from the analysis of Fig. 12, they are a summarized version of the results in Figs. 8 and 9. There are no very relevant differences among the three groups with the exclusion of their order of values. By using such a 3-way dataset, we are losing cluster interpretability and information. Indeed, we have seen and discussed in Sect. 7.2.1 that the behavior of provinces within and between clusters varies per age class, gender, and time.

  • On a related note, the clustering solution reported in Fig. 12 is a blurred version of that in Fig. 10. This remarks that we are losing details and the whole clustering process is affected.

8 Conclusions

In this work, we have introduced two generalizations of the tensor-variate normal (TVN) distribution: the tensor-variate shifted exponential normal (TVSEN) and the tensor-variate tail-inflated normal (TVTIN) distributions. Both distributions allow for a more flexible tail behavior, thus providing a better data modelization in the presence of non-normal features of the data. We have introduced the related mixture models, TVSEN-Ms and TVTIN-Ms, for model-based clustering in a parsimonious setting due to the use of the eigen-decomposition of the component scale matrices. As a by-product, the parsimonious version of TVN-Ms has been also introduced. EM-based algorithms have been illustrated for parameter estimation. The initialization and fitting strategies proposed for facilitating both processes have produced good results in the implemented simulated analyses. Specifically, the parameter and classification recovery, as well as the capability of the BIC in detecting the true data-generating model, exhibited convincing outcomes. An additional simulation study showed the advantages of using heavy-tailed models in the presence of atypical observations, and how the unfolding and vectorization processes lead to different and worse classification and model selection results (other than increasing the numbers of parameters). The two real-data analyses showed at first the flexibility of our methodology in modeling data having different dimensionality, i.e., 4-way and 5-way, respectively. Secondly, the advantages of using TVSEN-Ms and TVTIN-Ms over TVN-Ms have been evidenced in terms of data fitting. Thirdly, the classification produced in both datasets, as well as their interpretation via the estimated parameters, has been deeply analyzed, highlighting the presence of clusters with different characteristics.