1 Introduction

Footnote 1In this paper we introduce methodological applications arising of cluster analysis of covariance matrices. Throughout, we will show that appropriate clustering criteria on these objects provide useful tools in the analysis of classic problems in Multivariate Analysis. The chosen framework is that of multivariate classification under a Gaussian Mixture Model, a setting where a suitable reduction of the involved parameters is a fundamental goal leading to the Parsimonious Model. We focus on this hierarchized model, designed to explain data with a minimum number of parameters, by introducing intermediate categories associated with clusters of covariance matrices.

Gaussian Mixture Models approaches to discriminant and cluster analysis are well-established and powerful tools in multivariate statistics. For a fixed number K, both methods aim to fit K multivariate Gaussian distributed components to a data set in \({\mathbb {R}}^d\), with the key difference that labels providing the source group of the data are known (supervised classification) or unknown (unsupervised classification). In the supervised problem, we handle a data set with N observations \(y_1,\ldots ,y_N\) on \({\mathbb {R}}^d\) and associated labels \(z_{ {i},{ {k}}}, {i}=1,\ldots ,N\), \({ {k}}=1,\ldots , {K}\), where \(z_{ {i},{ {k}}}=1\) if the observation \(y_{ {i}}\) belongs to the group k and 0 otherwise. Denoting by \(\phi (\cdot \vert \mu ,\Sigma )\) the density of a multivariate Gaussian distribution on \({\mathbb {R}}^d\) with mean \(\mu \) and covariance matrix \(\Sigma \), we seek to maximize the complete log-likelihood function

$$\begin{aligned} CL\Bigl (\pmb \pi ,\pmb \mu ,\pmb \Sigma \Bigr ) = \sum _{{ {i}}=1}^N\sum _{ {k}=1}^{ {K}} z_{ {i},{ {k}}} \log \Biggl ( \pi _{ {k}} \phi (y_{ {i}}\vert \mu _{ {k}},\Sigma _{ {k}})\Biggr ), \end{aligned}$$
(1)

with respect to the weights \(\pmb {\pi }=(\pi _1,\ldots ,\pi _{ {K}})\) with \(0 \le \pi _{ {k}}\le 1, \ \sum _{ {k}=1}^{ {K}} \pi _{ {k}}=1\), the means \(\pmb {\mu }=(\mu _1,\ldots ,\mu _{ {K}})\) and the covariance matrices \(\pmb {\Sigma }=(\Sigma _1,\ldots ,\Sigma _{ {K}})\). In the unsupervised problem the labels \(z_{ {i},{ {k}}}\) are unknown, and fitting the model involves the maximization of the log-likelihood function

$$\begin{aligned} L\Bigl ( \pmb \pi ,\pmb \mu ,\pmb \Sigma \Bigr ) = \sum _{{ {i}}=1}^N\log \Biggl (\sum _{ {k}=1}^{ {K}} \pi _{ {k}} \phi (y_{ {i}}\vert \mu _{ {k}},\Sigma _{ {k}})\Biggr ) \, \end{aligned}$$
(2)

with respect to the same parameters. This maximization is more complex, and it is usually performed via the EM algorithm (Dempster et al. 1977), where we repeat iteratively the following two steps. The E step, which consists in computing the expected values of the unobserved variables \(z_{ {i},{ {k}}}\) given the current parameters, and the M step, in which we are looking for the parameters maximizing the complete log-likelihood (1) for the values \(z_{ {i},{ {k}}}\) computed in the E step. Therefore, both model-based techniques require the maximization of (1), for which optimal values of the weights and the mean are easily computed:

$$\begin{aligned} n_{ {k}}= \sum _{{ {i}}=1}^N z_{ {i},{ {k}}}\, \ \pi _{ {k}}=\frac{n_{ {k}}}{N}\, \ \mu _{ {k}} =\frac{\sum _{{ {i}}=1}^N z_{ {i},{ {k}}} y_{ {i}}}{n_{ {k}}}. \end{aligned}$$
(3)

With these optimal values, if we denote \(S_{ {k}}=(1/n_{ {k}})\sum _{{ {i}}=1}^N z_{ {i},{ {k}}} (y_{ {i}}-\mu _{ {k}})(y_{ {i}}-\mu _{ {k}})^T\), the problem of maximizing (1) with respect to \(\Sigma _1,\ldots ,\Sigma _{ {K}}\) is equivalent to the problem of maximizing

$$\begin{aligned} (\Sigma _1,\ldots ,\Sigma _{ {K}}) \mapsto \sum _{ {k}=1}^{ {K}} \hspace{0.2cm} \log \Bigl (W_d\bigl (n_{ {k}}S_{ {k}}\vert n_{ {k}},\Sigma _{ {k}}\bigr )\Bigr ) \, \end{aligned}$$
(4)

where \( W_d( \cdot \vert n_{ {k}},\Sigma _{ {k}})\) is the d-dimensional Wishart distribution with parameters \(n_{ {k}},\Sigma _{ {k}}\). For even moderate dimension d, the large number of involved parameters in relation with the size of the data set could result in a poor behavior of standard unrestricted methods. In order to improve the solutions, regularization techniques are often invoked. In particular, many authors have proposed estimating the maximum likelihood parameters under some additional constraints on the covariance matrices \(\Sigma _1,\ldots ,\Sigma _{ {K}}\), which lead us to solve the maximization of (4) under these constraints. Between these proposals, a prominent place is occupied by the so called Parsimonious Model, a broad set of hierarchized constraints capable of adapting to conceptual situations that may occur in practice.

A common practice in multivariate statistics consists in assuming that covariance matrices share a common part of their structure. For example, if \(\Sigma _1=\ldots =\Sigma _{ {K}}=I_d\), the clustering method described in (2) gives just the k-means. If we assume common covariance matrices \(\Sigma _1=\ldots =\Sigma _{ {K}}=\Sigma \), the procedure coincides with linear discriminant analysis (LDA) in the supervised case (1), and with the method proposed in Friedman and Rubin (1967) in the unsupervised case (2). General theory to organize these relationships between covariance matrices is based on the spectral decomposition, beginning with the analysis of Common Principal Components (Flury 1984, 1988). In the discriminant analysis setting, the use of the spectral decomposition was first proposed in Flury et al. (1994), and in the clustering setting in Banfield and Raftery (1993). The term “Parsimonious model" and the fourteen levels given in Table 1 were introduced in Celeux and Govaert (1995) for the clustering setting and later, in Bensmail and Celeux (1996), for the discriminant setup.

Table 1 Parsimonious levels based on the spectral decomposition of \(\Sigma _1,\ldots ,\Sigma _{ {K}}.\)

Given a positive definite covariance matrix \(\Sigma _{ {k}}\), the spectral decomposition of reference is

$$\begin{aligned}\Sigma _{ {k}} = \gamma _{ {k}} \beta _{ {k}} \Lambda _{ {k}} \beta _{ {k}}^T \,\end{aligned}$$

where \(\gamma _{ {k}}={\text {det}}(\Sigma _{ {k}})^{1/d}> 0\) governs the size of the groups, \(\Lambda _{ {k}}\) is a diagonal matrix with positive entries and determinant equal to 1 that controls the shape, and \(\beta _{ {k}}\) is an orthogonal matrix that controls the orientation. Given K covariance matrices \(\Sigma _1,\ldots ,\Sigma _{ {K}}\), the spectral decomposition enables to establish the fourteen different parsimonious levels in Table 1, allowing differences or not in the parameters associated to size, shape and orientation. To fit a Gaussian Mixture Model under a parsimonious level \({\mathscr {M}}\) in the Table 1, we must face the maximization of (4) under the parsimonious restriction. That is, we should find

$$\begin{aligned} \hat{\pmb \Sigma }= \underset{ {\pmb \Sigma }\in {\mathscr {M}}}{{\text {argmax}}}\ \sum _{ {k}=1}^{ {K}} \hspace{0.1cm} \log \Bigl (W_d\bigl (n_{ {k}}S_{ {k}}\vert n_{ {k}},\Sigma _{ {k}}\bigr )\Bigr ) , \end{aligned}$$
(5)

where we say that \( {\pmb \Sigma }=(\Sigma _1,\ldots ,\Sigma _{ {K}})\in {\mathscr {M}}\) if the K covariance matrices verify the level. We should remark that the Common Principal Components model (Flury 1984, 1988) plays a key role in this hierarchy, which in any case is based on simple geometric interpretations.

Restrictions are also often used to solve a well-known problem that appears in model-based clustering, the unboundedness of the log-likelihood function (2). With no additional constraints, the problem of maximizing (2) is not even well defined, a fact that could lead to uninteresting spurious solutions, where some groups would be associated to a few, almost collinear, observations. Although we will also use these restrictions, we will not discuss on this line in this work. A review of approaches for dealing with this problem can be found in García-Escudero et al. (2017).

The aim of this paper is to introduce a generalization of equation (5), that allows us to give a likelihood-based classification associated to intermediate parsimonious levels. Let \(G \in \{1,\ldots ,K\}\) and \(\pmb u=(u_1,\ldots ,u_K)\) be any vector in \(\lbrace 1,\ldots , G \rbrace ^{K}\). Given a parsimonious level \({\mathscr {M}}\), we can formulate a model in which we assume that the theoretical covariance matrices \(\Sigma _1,\ldots ,\Sigma _K\) verify a parsimonious level \({\mathscr {M}}\) within each of the G classes defined by \(\pmb u\). For instance, let \(K=7\), \(G=3\), \({\mathscr {M}}\) = VVE and take \(\pmb u=(1,1,2,3,1,2,1)\). This implies

$$\begin{aligned} \Sigma _{{k}}&= \gamma _{{k}}\beta _1\Lambda _{{k}} \beta _1^T, \quad {k}=1,2,5,7, \\ \Sigma _{{k}}&= \gamma _{{k}}\beta _2\Lambda _{{k}} \beta _2^T, \quad {k}=3,6, \\ \Sigma _{{k}}&= \gamma _{{k}}\beta _3\Lambda _{{k}} \beta _3^T ,\quad {k}=4 \ . \end{aligned}$$

Following (5), the estimation of the original covariance matrices involves maximizing (4) within \({\mathscr {M}}_{\pmb u}\), the set of covariance matrices satisfying \(\{\Sigma _k: u_k=g\} \in {\mathscr {M}}\) for all \(g=1,\ldots ,G\). Using the maximized log-likelihood as a measure for the appropriateness of \(\pmb u\), the optimal \(\hat{\pmb u}\) would provide a classification for \(S_1,\ldots ,S_K\) according to the level \({\mathscr {M}}\). Precise definitions will be provided in Sect. 2. We will present an iterative procedure to simultaneously compute the optimal classification and covariance matrix estimators through the modification of equation (5) given by

$$\begin{aligned} \bigl (&\hat{\textbf{u}}, {\hat{\varvec{\Sigma }}} \bigr ) \\ {}&=\underset{\pmb u , \pmb \Sigma \in {\mathscr {M}}_{\pmb u}}{{\text {argmax}}} \Biggl (\sum _{g=1}^G \sum _{k:u_k=g}\log \Bigl (W_d(n_k S_k\vert n_k,\Sigma _k)\Bigr )\Biggr ) .\nonumber \end{aligned}$$
(6)

Solving this equation will allow us to fit Gaussian Mixture Models with intermediate parsimonious levels, in which the common parameters of a parsimonious level will be shared within each of the G classes given by the vector of indexes \( {\pmb {\hat{u}}}\), but varying between the different classes. In the previous example, we obtain three classes of covariance matrices that share their principal directions within each class, resulting in a better interpretation of the final classification and allowing a considerable reduction of the number of parameters to be estimated. We will use these ideas for fitting Gaussian Mixture Models in discriminant analysis and cluster analysis. To avoid unboundedness of the objective function in the clustering framework, we will impose the determinant and shape constraints of García-Escudero et al. (2020), which are fully implemented in the MATLAB toolbox FSDA (Riani et al. 2012). We will analyze some examples where the proposed models result in less parameters and more interpretability fitting the data, being better suited when compared with the 14 parsimonious models. We point out that, as it is becoming usual in the literature, to carry out the comparisons between different models, we will use the Bayesian Information Criterion (BIC). This applies to all examples considered in the text. It has been noticed by many authors that BIC selection works properly in model based clustering, as well as in discriminant analysis. Fraley and Raftery (2002) includes a detailed justification for the use of BIC, based on previous references. A summary of the comparison of BIC with other techniques for model selection can also be found in Biernacki and Govaert (1999).

The paper is organized as follows. Section 2 approaches the problem of the parsimonious classification of covariance matrices given by equation (6), focusing on its computation for the most interesting restrictions in terms of dimensionality reduction and interpretability. Throughout, we will only work with models based on the parsimonious levels of proportionality (VEE) and common principal components (VVE), although the extension to other levels is straightforward. Section 3 applies the previous theory for the estimation of Gaussian Mixture Models in cluster analysis and discriminant analysis, including some simulation examples for their illustration. Section 4 includes real data examples, where we will see the gain in interpretability that can arise from these solutions. Some conclusions are outlined in Sect. 5. Finally, Appendix A includes theoretical results, Appendix B provides some additional simulation examples and Appendix C explains technical details about the algorithms. Additional graphical material is provided in the Online Supplementary Figures document.

2 Parsimonious classification of covariance matrices

Given \(n_1,\ldots ,n_{ {K}}\) independent observations from K groups with different distributions, and \(S_1,\ldots ,S_{ {K}}\) the sample covariance matrices, a group classification may be provided according to different similarity criteria. In the general case, given a similarity criterion f depending on the sample covariance matrices and the sample lengths, the problem of classifying K covariance matrices in G classes, \(1\le G \le { {K}}\), typically would consist in solving the equation

$$\begin{aligned} {\hat{ {\pmb u}}}= \underset{ {\pmb u} \in {\mathscr {H}}}{\text {argmax}} \quad \sum _{g=1}^G f\Bigl ( \hspace{0.05cm} \bigl \lbrace (S_{ {k}},n_{ {k}}): {u}_{ {k}}= g \bigr \rbrace \Bigr ), \end{aligned}$$

where \({\mathscr {H}} = \bigl \lbrace {\pmb u}=(u_1,\ldots ,u_K) \in \lbrace 1,\ldots , G\rbrace ^{ {K}}: \forall \ g = 1,\ldots ,G \quad \exists \ {k} \text { verifying } {u}_{ {k}}=g \bigr \rbrace \). In this work, we focus on the Gaussian case, proposing different similarity criteria based on the parsimonious levels that arise from the spectral decomposition of a covariance matrix.

Multivariate procedures based on parsimonious decompositions assume that the theoretical covariance matrices \(\Sigma _1,\ldots ,\Sigma _{ {K}}\) jointly verify one level \({\mathscr {M}}\) out of the fourteen in Table 1. To elaborate on this idea, we include now some useful notation. In a parsimonious model \({\mathscr {M}}\), we write \((\Sigma _1,\ldots ,\Sigma _{ {K}}) \in {\mathscr {M}}\) if these matrices share some common parameters C, and they have variable parameters \(\pmb V = (V_1,\ldots ,V_{ {K}})\) (specified in the model \({\mathscr {M}}\)). We will denote by \(\Sigma (V_{ {k}},C)\) the covariance matrix with the size, shape and orientation parameters associated to \((V_{ {k}},C)\). Therefore, under the parsimonious level \({\mathscr {M}}\), we are assuming that

$$\begin{aligned} \Sigma _{ {k}} = \Sigma (V_{ {k}},C) \quad \quad {k}=1,\ldots , {K}. \end{aligned}$$

If the \(n_{ {k}}\) observations of group k are independent and arise from a distribution \(N(\mu _{ {k}},\Sigma _{ {k}})\), according to the arguments in the introduction, it is natural to consider the maximized log-likelihood (5) under the parsimonious level \({\mathscr {M}}\) as a similarity criterion for the covariance matrices. This allows us to measure their resemblance in the features associated to the common part of the decomposition in the theoretical model. Thus, the similarity criterion for the parsimonious level \({\mathscr {M}}\) is

$$\begin{aligned} f_{{\mathscr {M}}}\Bigl (\bigl \lbrace (S_{ {k}},n_{ {k}}), {k}=1,\ldots ,r&\bigr \rbrace \Bigr ) \nonumber \\ =\underset{V_1,\ldots ,V_r, C }{\max }\ \sum _{ {k}=1}^r \hspace{0.2cm} \log \Bigl (W_d\bigl (&n_{ {k}}S_{ {k}}\vert n_{ {k}},\Sigma (V_{ {k}},C)\bigr )\Bigr ) . \end{aligned}$$

Consequently, given a level of parsimony \({\mathscr {M}}\), the covariance matrix classification problem in G classes consists in solving the equation

$$\begin{aligned} {\hat{ {\pmb u}}} = \underset{ {\pmb u} \in {\mathscr {H}}}{\text {argmax}}&\ \sum _{g=1}^G f_{{\mathscr {M}}}\Bigl ( \bigl \lbrace (S_{ {k}},n_{ {k}}) : {u}_{ {k}}= g \bigr \rbrace \Bigr )\nonumber \\ =\underset{ {\pmb u} \in {\mathscr {H}}}{\text {argmax}}&\Biggl (\underset{V_1,\ldots ,V_{ {K}}, C_1,\ldots ,C_G }{\max }\ \sum _{g=1}^G \nonumber \\ \sum _{ {k}: {u}_{ {k}}=g} \log&\Bigl (W_d\bigl (n_{ {k}}S_{ {k}}\vert n_{ {k}},\Sigma (V_{ {k}},C_g)\bigr )\Bigr )\Biggr ) . \end{aligned}$$
(7)

In order to avoid the combinatorial problem of maximizing within \({\mathscr {H}}\), denoting the variable parameters by \(\pmb V=(V_1,\ldots ,V_{ {K}})\) and the common parameters by \(\pmb C=(C_1,\ldots ,C_G)\), we focus on the problem of maximizing

$$\begin{aligned} W( {\pmb {u}},\pmb { V}, \pmb {C})&\nonumber \\ =\sum _{g=1}^G \hspace{0.1 cm}\sum _{ {k}: {u}_{ {k}}=g}&\hspace{0.1cm}\log \Biggl (W_d\Bigl (n_{ {k}} S_{ {k}}\vert n_{ {k}},\Sigma (V_{ {k}},C_g)\Bigr )\Biggr ) \ , \end{aligned}$$

since the value \( {{ {\pmb u}}}\) maximizing this function agrees with the optimal \( {\hat{ {\pmb u}}}\) in (7). This problem will be referred to as Classification \(\pmb G\)-\(\pmb {{\mathscr {M}}}\). From the expression of the d-dimensional Wishart density, we can see that maximizing W is equivalent to minimizing with respect to the same parameters the function

$$\begin{aligned} \sum _{g=1}^G \sum _{ {k}: {u}_{ {k}}=g} n_{ {k}} \Biggl ( \log \Bigl (\vert \Sigma (V_{ {k}},C_g)\bigr \vert +{\text {tr}}\Bigl (\Sigma (V_{ {k}},C_g)^{-1}S_{ {k}}\Bigr )\Biggr ). \end{aligned}$$

Maximization can be achieved through a simple modification of the CEM algorithm (Classification Expectation Maximization, introduced in Celeux and Govaert (1992)), for any of the fourteen parsimonious levels. A sketch of the algorithm is presented here:

Classification \(\pmb G\)-\(\pmb {{\mathscr {M}}:}\) Starting from an initial estimation \(\pmb {C^0}=(C_1^0,\ldots ,C_G^0)\) of the common parameters, which may be taken as the parameters of G different matrices \(S_{ {k}}\) randomly chosen between \(S_1,\ldots ,S_{ {K}}\), the \(m^{th}\) iteration consists of the following steps:

  • u-V step: Given the common parameters \(\pmb {C^m}=(C_1^m,\ldots , C_G^m)\), we maximize with respect to the partition \( {\pmb u}\) and the variable parameters \(\pmb {V}\). For each \( {k} = 1,\ldots , {K}\), we compute

    $$\begin{aligned} {\tilde{V}} _{ {k},g} = \quad \underset{V}{{\text {argmax }}}\ W_d\Bigl (n_{ {k}} S_{ {k}} \big \vert n_{ {k}},\Sigma (V,C_g)\Bigr ) \quad \nonumber \end{aligned}$$

    for \(1\le g\le G\), and we define:

    $$\begin{aligned} {u}_{ {k}}^{m+1} = \underset{g\in \lbrace 1,\ldots ,G\rbrace }{{\text {argmax }}} \ W_d\Bigl (n_{ {k}} S_{ {k}} \big \vert n_{ {k}}, \Sigma ({\tilde{V}}_{ {k},g},C_g)\Bigr ) . \end{aligned}$$
  • V-C step: Given the partition \(\pmb { {u}^{m+1}}\), we compute the values \((\pmb {V^{m+1}},\pmb {C^{m+1}})\) maximizing \(W(\pmb { {u}^{m+1}},\pmb { V}, \pmb {C})\). The maximization can be done individually for each of the groups created, by maximizing for each \(g=1,\ldots ,G\) the function

    $$\begin{aligned} (\lbrace&V_{{k}} \rbrace _{k: {u_k}=g}, C_g) \\&\longmapsto \sum _{k: {u_k}=g}\hspace{0.1cm}\log \Bigl (W_d\bigl (n_k S_k\vert n_k,\Sigma (V_k,C_g)\bigr )\Bigr ) \ , \end{aligned}$$

    The maximization for each of the 14 parsimonious levels can be done, for instance, with the techniques in Celeux and Govaert (Celeux and Govaert 1995). The methodology proposed therein for common orientation models uses modifications of the Flury algorithm (Flury and Gautschi 1986). However, for these models we will use the algorithms subsequently developed by Browne and McNicholas (2014a, 2014b), often implemented the software available for parsimonious model fitting, which allow more efficient estimation of the common orientation parameters.

For each of the fourteen parsimonious models, the variable parameters in the solution \(\pmb {{\hat{V}}}\) may be computed as a function of the parameters \(( {\pmb {\hat{u}}},\pmb {{\hat{C}}})\), the sample covariance matrices \(S_1,\ldots ,S_{ {K}}\) and the sample lengths \(n_1,\ldots ,n_{ {K}}\). Therefore, the function W could be written as \(W(\pmb { {u}},\pmb {C})\), and the maximization could be seen as a particular case of the coordinate descent algorithm explained in Bezdek et al. (1987).

As it was already noted, we focus on the development of the algorithm only for two particular (the most interesting) parsimonious levels. First of all, we are going to keep models flexible enough to enable the solution of (6), when taking \(G= {K}\) (no grouping is assumed), to coincide with the unrestricted solution, \({{\hat{\Sigma }}}_{ {k}}=S_{ {k}}\). The first six models do not verify this condition. For the last eight models, the numbers of parameters are

$$\begin{aligned}\delta _{\text {VOL}}\cdot 1 + \delta _{\text {SHAPE}}\cdot (d-1) + \delta _{\text {ORIENT}} \cdot \frac{d(d-1)}{2} \,\end{aligned}$$

where \(\delta _{\text {VOL}}, \delta _{\text {SHAPE}}\) and \(\delta _{\text {ORIENT}}\) take the value 1 if the given parameter is assumed to be common, and K if it is assumed to be variable between groups. When d and K are large, the main source of variation in the number of parameters is related to considering common or variable orientation, followed by considering common or variable shape. For example, if \(d=9,k=6\), the number of parameters related to each constraint are detailed in Table 2.

Table 2 Number of parameters associated with each feature when \(k=6,d=9\)

Our primary motivation is exemplified through Table 2: to raise alternatives for the models with variable orientation. For that, we look for models with orientation varying in G classes, with \(1\le G \le { {K}}\). We consider the case where size and shape are variable across all groups (G different Common Principal Components, G-CPC) and also the case where shape parameters are additionally common within each of the G classes (proportionality to G different matrices, G-PROP). Apart from the parameter reduction, these models can provide an easier interpretation of the variables involved in the problem, which is often a hard task in multidimensional problems with several groups. We keep the size variable, since it does not cause a major increase in the number of parameters, and it is easy to interpret. Therefore, the models we are considering are:

  • \({\textbf {Classification G-CPC}}\): We are looking for G orthogonal matrices \(\pmb {\beta }=(\beta _1,\ldots ,\beta _G)\) and a vector of indexes \( {\pmb {u}}=( {u}_1,\ldots , {u}_{ {K}}) \in {\mathscr {H}}\) such that

    $$\begin{aligned} \Sigma _{ {k}} = \gamma _{ {k}} \beta _{ {u}_{ {k}}} \Lambda _{ {k}} \beta _{ {u}_{ {k}}}^T \quad {k}=1,\ldots , {K}\, \end{aligned}$$

    where \(\pmb {\gamma }=(\gamma _1,\ldots ,\gamma _{ {K}})\) and \(\pmb {\Lambda }=(\Lambda _1,\ldots ,\Lambda _{ {K}})\) are the variable size and shape parameters. The number of parameters is \( { {K}}+ { {K}}(d-1) + G d(d-1)/2\). In the situation of Table 2, taking \(G=2\) the number of parameters is 126, while allowing for variable orientation it is 270. To solve (7), we have to find a vector of indexes \( {\pmb {\hat{u}}}\), G orthogonal matrices \(\pmb {{{\hat{\beta }}}}\) and variable parameters \(\pmb {{{\hat{\gamma }}}}\) and \(\pmb {{{\hat{\Lambda }}}}\) minimizing

    $$\begin{aligned}&( {\pmb {u}},\pmb {\Lambda },\pmb {\gamma }, \pmb {\beta }) \nonumber \\&\longmapsto \sum _{g=1}^G \sum _{ {k}: {u}_{ {k}}=g} n_{ {k}} \left( d\log \bigl (\gamma _{ {k}}\bigr )+ \frac{1}{\gamma _{ {k}}}{\text {tr}}\left( \Lambda _{ {k}}^{-1}\beta _g^TS_{ {k}}\beta _g\right) \right) . \end{aligned}$$
    (8)
  • \({\textbf {Classification G-PROP}}\): We are looking for G orthogonal matrices \(\pmb {\beta }=(\beta _1,\ldots ,\beta _G)\), G shape matrices \(\pmb {\Lambda }=(\Lambda _1,\ldots ,\Lambda _G)\) and \( {\pmb u}=( {u}_1,\ldots , {u}_{ {K}})\in {\mathscr {H}}\) such that

    $$\begin{aligned}\Sigma _{ {k}} = \gamma _{ {k}} \beta _{ {u}_{ {k}}} \Lambda _{ {u}_{ {k}}} \beta _{ {u}_{ {k}}}^T \quad {k}=1,\ldots , { {K}}\,\end{aligned}$$

    where \(\pmb {\gamma }=(\gamma _1,\ldots ,\gamma _{ {K}})\) are the variable size parameters. The number of parameters is \( { {K}}+ G (d-1) + G d(d-1)/2\). In the situation of Table 2, the number of parameters if we take \(G=2\) is 94. To solve (7), we have to find a vector of indexes \( {\pmb {\hat{u}}}\), G orthogonal matrices \(\pmb {{{\hat{\beta }}}}\), G shape matrices \(\pmb {{{\hat{\Lambda }}}}\) and the variable size parameters \(\pmb {{{\hat{\gamma }}}}\) minimizing

    $$\begin{aligned}&( {\pmb {u}},\pmb {\Lambda },\pmb {\gamma }, \pmb {\beta }) \nonumber \\&\longmapsto \sum _{g=1}^G \sum _{ {k}: {u}_{ {k}}=g} n_{ {k}} \left( d\log \bigl (\gamma _{ {k}}\bigr )+ \frac{1}{\gamma _{ {k}}}{\text {tr}}\left( \Lambda _g^{-1}\beta _g^TS_{ {k}}\beta _g\right) \right) . \end{aligned}$$
    (9)
Fig. 1
figure 1

Classification of \(S_1,\ldots ,S_{100}\), represented by their 95% confidence ellipses. The first row shows the classes and axes estimations given by the 4-CPC model, and the second row shows the classes and proportional matrix estimations given by the 4-PROP model

Explicit algorithms for finding the minimum of (8) and (9) are given in Section C.2 in the Appendix. The results given by both algorithms are illustrated in the following example, where we have randomly created 100 covariance matrices \(\Sigma _1,\ldots ,\Sigma _{100}\) according:

$$\begin{aligned} \Sigma _{ {k}} = X\Bigl ({\text {U}}(\alpha ) {\text {Diag}}(1,Y) {\text {U}}(\alpha )^T\Bigr )\quad {k}=1,\ldots ,100 \, \end{aligned}$$

where \({\text {U}}(\alpha )\) represents the rotation of angle \(\alpha \), \({\text {Diag}}(1,Y)\) is the diagonal matrix with entries 1, Y, and \(X,Y,\alpha \) are uniformly distributed random variables with distributions:

$$\begin{aligned}X\sim U\bigl (0.5,2\bigr )\, \ Y\sim U\bigl (0,0.5\bigr ) \,\ \alpha \sim U\bigl (0,\pi \bigr ).\end{aligned}$$

For each \( {k}=1,\ldots ,100\), we have taken \(S_{ {k}}\) as the sample covariance matrix computed from 200 independent observations from a distribution \(N(0,\Sigma _{ {k}})\), and we have applied 4-CPC and 4-PROP to obtain different classifications of \(S_1,\ldots ,S_{100}\). The partitions obtained by both methods allow us to classify the covariance matrices according to both criteria. Figure 1 shows the 95% confident ellipses representing the sample covariance matrices associated to each class (coloured lines) together with the estimations of the common axes or the common proportional matrix within each class (black lines).

3 Gaussian mixture models

In a Gaussian Mixture Model (GMM), data are assumed to be generated by a random vector with probability density function:

$$\begin{aligned}f(y)= \sum _{ {k}=1}^{ {K}} \pi _{ {k}} \phi (y\vert \mu _{ {k}},\Sigma _{ {k}})\,\end{aligned}$$

where \(0\le \pi _{ {k}} \le 1, \ \sum _{ {k}=1}^{ {K}} \pi _{ {k}}=1\). The idea of introducing covariance matrix restrictions given by parsimonious decomposition in the estimation of GMMs has become a common tool for statisticians, and methods are implemented in the software R in many packages. In this paper we use for the comparison the results given by the package mclust (Fraley and Raftery 2002; Scrucca et al. 2016), although there exists many others widely known (Rmixmod: Lebret et al. (2015); mixtools: Benaglia et al. (2009)). The aim of this section is to explore how we can fit GMMs in different contexts with the intermediate parsimonious models explained in Sect. 2, allowing the common part of the covariance matrices in the decomposition to vary between G classes. That is, with the same notation as in Sect. 2, we want to study GMMs with density function

$$\begin{aligned} f(y)= \sum _{g=1}^G\sum _{ {k}: {u}_{ {k}}=g} \pi _{ {k}} \phi \Bigl (y\big \vert \mu _{ {k}},\Sigma (V_{ {k}},C_g)\Bigr ) \, \end{aligned}$$
(10)

where \( {\pmb u}=( {u}_1,\ldots , {u}_{ {K}})\in {\mathscr {H}}\) is a fixed vector of indexes, \(\pmb V = (V_1,\ldots ,V_{ {K}})\) are the variable parameters, \(\pmb {C}=( C_1,\ldots , C_G)\) are the common parameters among classes and \(\Sigma (V_{ {k}},C_g)\) is the covariance matrix with the parameters given by \((V_{ {k}},C_g)\). The following subsections exploit the potential of these particular GMMs for cluster analysis and discriminant analysis. A more general situation where only part of the labels are known could also be considered, following the same line as in Dean et al. (2006), but it will not be discussed in this work.

As already noted in the Introduction, the criterion we are going to use for model selection between all the estimated models is BIC (Bayesian Information Criterion), choosing the model with a higher value of the BIC approximation given by

$$\begin{aligned}\text {BIC} = 2\cdot \text {loglikelihood} - \log (N) \cdot p \,\end{aligned}$$

where N is the number of observations and p is the number of independent parameters to be estimated in the model. This criterion is used for the comparison of the intermediate models G-CPC and G-PROP with the fourteen parsimonious models estimated in the software R with the functions in the mclust package. In addition, within the framework of discriminant analysis, the quality of the classification given by the best models, in terms of BIC, is also compared using cross validation techniques.

3.1 Model-based clustering

Given \(y_1,\ldots ,y_N\) independent observations of a d-dimensional random vector, clustering methods based on fitting a GMM with K groups seek to maximize the log-likelihood function (2). From the fourteen possible restrictions considered in Celeux and Govaert (1995), we can compute fourteen different maximum likelihood solutions in which size, shape and orientation are common or not between the K covariance matrices. For a particular level \({\mathscr {M}}\) in Table 1, the fitting requires the maximization of the log-likelihood

$$\begin{aligned} L\Bigl (\pmb {\pi },\pmb {\mu } ,\pmb {V}&,C\Big \vert y_1,\ldots ,y_N\Bigr ) \\&=\sum _{{ {i}}=1}^N\log \Biggl (\sum _{ {k}=1}^{ {K}} \pi _{ {k}} \phi \Bigl (y_{ {i}}\big \vert \mu _{ {k}},\Sigma (V_{ {k}},C)\Bigr )\Biggr ) \ , \end{aligned}$$

where \(\pmb {\pi }=(\pi _1,\ldots ,\pi _{ {K}})\) are the weights, with \(0\le \pi _{ {k}}\le 1,\) \( \sum _{ {k}=1}^{ {K}} \pi _{ {k}}=1\), \(\pmb {\mu }=(\mu _1,\ldots ,\mu _{ {K}})\) the means, \(\pmb {V}=(V_1,\ldots ,V_{ {K}})\) the variable parameters and C the common parameters. Estimation under the parsimonious restriction is performed via the EM algorithm. In the GMM context, we can see the complete data as pairs \((y_{ {i}},z_{ {i}})\), where \(z_{ {i}}\) is an unobserved random vector such that \(z_{ {i},{ {k}}}=1\) if the observation \(y_{ {i}}\) comes from distribution k, and \(z_{ {i},{ {k}}}=0\) otherwise.

With the ideas of Sect. 2, we are going to fit Gaussian Mixture Models with parsimonious restrictions, but allowing the common parameters to vary between different classes. Assuming a parsimonious level of decomposition \({\mathscr {M}}\) and a number \(G\in \lbrace 1,\ldots , { {K}}\rbrace \) of classes, we are supposing that our data are independent observations from a distribution with density function (10). The log-likelihood function given a fixed vector of indexes \( {\pmb u}\) is

$$\begin{aligned} L_{ {\pmb u}}\Bigl (\pmb {\pi },\pmb {\mu }&,\pmb {V},\pmb {C}\Big \vert y_1,\ldots ,y_N\Bigr ) \\ {}&=\sum _{{ {i}}=1}^N\log \Biggl (\sum _{g=1}^G\sum _{ {k}: {u}_{ {k}}=g} \pi _{ {k}} \phi \Bigl (y_{ {i}}\big \vert \mu _{ {k}},\Sigma (V_{ {k}},C_g)\Bigr )\Biggr ) . \end{aligned}$$

For each \( {\pmb u}\in {\mathscr {H}}\), we can fit a model. In order to choose the best value for the vector of indexes \( {\pmb u}\), we should compare the BIC values given by the different models estimated. As the number of parameters is the same, the best value for \( {\pmb u}\) can be obtained by taking

$$\begin{aligned} { {\hat{ {\pmb u}}}} = \underset{ {\pmb {u}} \in {\mathscr {H}}}{\text {argmax}} \Bigl [ \underset{\pmb {\pi },\pmb {\mu },\pmb {V},\pmb {C}}{\text {max}} \quad L_{ {\pmb {u}}}\Bigl (\pmb {\pi },\pmb {\mu },\pmb {V},\pmb {C}\Big \vert y_1,\ldots ,y_N\Bigr ) \Bigr ]. \end{aligned}$$

In order to avoid the combinatorial problem of maximizing within \({\mathscr {H}}\), we can take \( {\pmb u}\) as if it were a parameter, and we are going to focus on the problem of maximizing

$$\begin{aligned} L\Bigl (\pmb {\pi },&\pmb {\mu }, {\pmb {u}},\pmb {V},\pmb {C}\Big \vert y_1,\ldots ,y_N\Bigr ) \nonumber \\&=\sum _{{ {i}}=1}^N\log \Biggl (\sum _{g=1}^G\sum _{ {k}: {u}_{ {k}}=g} \pi _{ {k}} \phi \Bigl (y_{ {i}}\big \vert \mu _{ {k}},\Sigma (V_{ {k}},C_g)\Bigr )\Biggr ) , \end{aligned}$$
(11)

that will be referred to as Clustering G-\(\pmb {{\mathscr {M}}}\). Therefore, given the unobserved variables \(z_{ {i},{ {k}}}\), for \( {k}=1,\ldots , {K}\) and \({ {i}}=1,\ldots ,N\), the complete log-likelihood is

$$\begin{aligned}&CL\Bigl (\pmb {\pi },\pmb {\mu } , {\pmb u}, \pmb {V},\pmb C\Big \vert y_1,\ldots ,y_N,z_{1,1},\ldots , z_{ {N,K}}\Bigr ) \nonumber \\&\quad =\sum _{{ {i}}=1}^N \left[ \sum _{g=1}^G\sum _{ {k}: {u}_{ {k}}=g} z_{ {i},{ {k}}} \log \Biggl ( \pi _{ {k}} \phi \Bigl (y_{ {i}}\big \vert \mu _{ {k}},\Sigma (V_{ {k}},C_g)\Bigr )\Biggr )\right] \ . \end{aligned}$$
(12)

The proposal of this section is to fit this model given a parsimonious level \({\mathscr {M}}\) and fixed values of K and \(G \in \lbrace 1,\ldots , { {K}}\rbrace \), introducing also constraints to avoid the unboundedness of the log-likelihood function (11). For this purpose, we introduce the determinant and shape constraints studied in García-Escudero et al. (2020). For \( {k}=1,\ldots , {K}\), denote by \((\lambda _{ {k},1},\ldots ,\lambda _{ {k},d})\) the diagonal elements of the shape matrix \(\Lambda _{ {k}}\) (which may be the same within classes). We impose K constraints controlling the shape of each group, in order to avoid solutions that are almost contained in a subspace of lower dimension, and a size constraint in order to avoid the presence of very small clusters. Given \(c_{sh},c_{vol}\ge 1\), we impose:

$$\begin{aligned} \frac{\underset{l=1,\ldots ,d}{{\text {max}}}\lambda _{ {k},l}}{\underset{l=1,\ldots ,d}{{\text {min}}}\lambda _{ {k},l}}\le c_{sh}, \ {k}=1,\ldots , {K},\quad \frac{\underset{ {k}=1,\ldots , {K}}{{\text {max}}}\gamma _{ {k}}}{\underset{ {k}=1,\ldots , {K}}{{\text {min}}}\gamma _{ {k}}}\le c_{vol} \, \end{aligned}$$
(13)

Remark 1

With these restrictions, the theoretical problem of maximizing (11) is well defined. If Y is a random vector following a distribution \({\mathbb {P}}\), the problem consists in maximizing

(14)

with respect to \(\pmb {\pi },\pmb {\mu }, {\pmb {u}},\pmb {V},\pmb {C}\), defined as above, and verifying (13). If \({{\mathbb {P}}} _N\) stands for the empirical measure \({{\mathbb {P}}} _N=(1/ N) \sum _{ {i}=1}^N \delta _{\left\{ y_{ {i}}\right\} }\), by replacing \({{\mathbb {P}}} \) by \({{\mathbb {P}}} _N\), we recover the original sample problem of maximizing (11) under the determinant and shape constraints (13). This approach guarantees that the objective function is bounded, allowing results to be stated in terms of existence and consistence of the solutions (see Section A in the Appendix).

Now, we are going to give a sketch of the EM algorithm used for the estimation of these intermediate parsimonious clustering models, for each of the fourteen levels.

Clustering G-\(\pmb {{\mathscr {M}}: }\) Starting from an initial solution of the parameters \(\pmb {\pi ^{0}},\pmb {\mu ^{0}},\pmb { {u}^{0}}\), \(\pmb {V^{0}},\pmb {C^{0}}\), we have to repeat the following steps until convergence:

  • E step: Given the current values of the parameters \(\pmb {\pi ^{m}},\pmb {\mu ^{m}},\pmb { {u}^{m}}\), \(\pmb {V^{m}},\pmb {C^{m}}\), we compute the posterior probabilities

    $$\begin{aligned} z_{ {i},{ {k}}} = \frac{\pi _{ {k}}^m \phi \Bigl (y_{ {i}}\vert \mu _{ {k}}^m,\Sigma \bigl (V_{ {k}}^m,C_{ {u}_{ {k}}}^m\bigr )\Bigr )}{\sum _{l=1}^{ {K}} \pi _l^m \phi \Bigl (y_{ {i}}\vert \mu _l^m,\Sigma \bigl (V_l^m,C_{ {u}_l}^m\bigr )\Bigr )} \end{aligned}$$
    (15)

    for \( {k}=1,\ldots , {K}, \ { {i}}=1,\ldots ,N\).

  • M step: In this step, we have to maximize (12) given the expected values \(\lbrace z_{ {i},{ {k}}} \rbrace _{ {i},{ {k}}}\). The optimal values for \(\pmb {\pi ^{m+1}},\pmb {\mu ^{m+1}}\) are given by (3). With these optimal values, if we denote \(S_{ {k}}= (1/n_{ {k}})\sum _{{ {i}}=1}^N z_{ {i},{ {k}}} (y_{ {i}}-\mu _{ {k}}^{m+1})(y_{ {i}}-\mu _{ {k}}^{m+1})^T\), then we have to find the values \(\pmb { {u}^{m+1}}\), \(\pmb {V^{m+1}},\pmb {C^{m+1}}\) verifying the determinant and shape constraints (13) maximizing

    $$\begin{aligned} (&{\pmb u}, \pmb {V},\pmb C) \longmapsto CL\Bigl (\pmb {\pi ^{ {m+1}}},\\&\quad \pmb {\mu ^{ {m+1}}} , {\pmb u},\pmb {V},\pmb C\Big \vert y_1,\ldots ,y_N,z_{1,1},\ldots ,z_{ {N,K}}\Bigr ) \ . \end{aligned}$$

    If we remove the determinant and shape constraints, the solution of this maximization coincides with the classification problem presented in Sect. 2 for the computed values of \(n_1,\ldots ,n_{ {K}}\) and \(S_1,\ldots ,S_{ {K}}\). A simple modification of that algorithm, computing on each step the optimal size and shape constrained parameters (instead of the unconstrained version) with the optimal truncation algorithm presented in García-Escudero et al. (2020) allows the maximization to be completed. Determinant and shape constraints can be incorporated in the algorithms together with the parsimonious constraints following the lines developed in García-Escudero et al. (2022).

As already noted in Sect. 2, we keep only the clustering models G-CPC and G-PROP, the most interesting in terms of parameter reduction and interpretability. For these models, explicit algorithms are explained in Section C.3 in the Appendix. Now, we are going to illustrate the results of the algorithms in two simulation experiments:

  • Clustering G-CPC: In this example, we simulate \(n=100\) observations from each of 6 Gaussian distributions, with means \(\mu _1,\ldots ,\mu _6\) and covariance matrices verifying

    $$\begin{aligned} \Sigma _{ {k}} =&{\gamma }_{ {k}}\beta _1\Lambda _{ {k}} \beta _1^T, \quad {k}=1,2,3, \\ \Sigma _{ {k}} =&{\gamma }_{ {k}}\beta _2\Lambda _{ {k}} \beta _2^T ,\quad {k}=4,5,6 \ . \end{aligned}$$

    In Fig. 2, we can see in the first plot the 95 % confidence ellipses of the six theoretical Gaussian distributions together with the 100 independent observations simulated from these distributions. The second plot represents the clusters created by the maximum likelihood solution for the 2-CPC model, taking \(c_{sh}=c_{vol}=100\). The numbers labeling the ellipses represent the class of covariance matrices sharing the orientation. Finally, the third plot represents the best solution estimated by mclust for \( {K}=6\), corresponding to the parsimonious model VEV, with equal shape and variable size and orientation. The BIC value in the 2-CPC model (31 d.f.) is \(-\)3937.08, whereas the best model VEV (30 d.f.) estimated with mclust has BIC value \(-\)3960.07. Therefore, the GMM estimated with the 2-CPC restriction has higher BIC than all the parsimonious models. Finally, the number of observations assigned to different clusters from the original ones is 82 for the 2-CPC model and 91 for the VEV model.

  • Clustering G-PROP: In this example, we simulate \(n=100\) observations from each of 6 Gaussian distributions, with means \(\mu _1,\ldots ,\mu _6\) and covariance matrices verifying:

    $$\begin{aligned} \Sigma _{ {k}} =&{\gamma }_{ {k}} A_1, \quad {k}=1,2,3, \\ \Sigma _{ {k}} =&{\gamma }_{ {k}} A_2 ,\quad {k}=4,5,6 \ . \end{aligned}$$

    Figure 3 is analogous to Fig. 2, but in the proportionality case. The BIC value for the 2-PROP model (27 d.f.) with \(c_{sh}=c_{vol}=100\) is \(-\)3873.127, whereas the BIC value for the best model fitted by mclust is \(-\)3919.796, which corresponds to the unrestricted model VVV (35 d.f.). Now, the number of observations wrongly assigned to the source groups is 64 for the 2-PROP model, while it is 71 for the VVV model.

Fig. 2
figure 2

From left to right: 1. Theoretical Gaussian distributions and observations simulated from each distribution. 2. Solution estimated by clustering through 2-CPC model. 3. Best clustering solution estimated by mclust in terms of BIC

Fig. 3
figure 3

From left to right: 1. Theoretical Gaussian distributions and observations simulated from each distribution. 2. Solution estimated by clustering through 2-PROP model. 3. Best clustering solution estimated by mclust in terms of BIC

Remark 2

Note that, by imposing appropriate constraints in the clustering problem, we can significantly decrease the number of parameters while keeping a good fit of the data. Figure 3 shows this effect. However, constraints also have a clear interpretation in cluster analysis problems, since we are looking for groups that are forced to have a particular shape. Therefore, different constraints can lead to clusters with different shapes. This is what happens in Fig. 2, where by introducing the right constraints we have managed to make the clusters created more similar to the original ones. Of course, in the absence of prior information, it is not possible to know the appropriate constraints, and the most reasonable approach is to select a model according to a criterion that penalizes the fit with the number of parameters such as the BIC.

To evaluate the sensitivity of BIC for the detection of the true underlying model, we have used the models described in the two previous examples. Once a model and a particular sample size n (=50, 100, 200) have been chosen, the simulation planning produces a sample containing n random elements generated from each \(N(\mu _{ {k}},\Sigma _{ {k}}),\ {k}=1,\ldots ,6\). We repeated every simulation plan 1000 times, comparing for every sample the BIC obtained for the underlying clustering model vs the best parsimonious model estimated by mclust. Table 3 includes the proportions of times in which 2-CPC or 2-PROP model improves the best mclust model in terms of BIC for each value of n. Of course, the accuracy of the approach should depend on the dimension, the number of groups, the overlapping... However, even in the case of a large overlapping, as in the present examples, the proportions reported in Table 3 show that moderate values of n suffice to get very high proportions of success. Appendix B contains additional simulations supporting the suitability of BIC in this framework.

Table 3 Proportions of times in which clustering 2-CPC or 2-PROP model improves the best mclust model in terms of BIC, for each sample size n

3.2 Discriminant analysis

The parsimonious model introduced in Bensmail and Celeux (1996) for discriminant analysis has been developed in conjunction with model-based clustering. The R package mclust (Fraley and Raftery 2002; Scrucca et al. 2016) also includes functions for fitting these models, denoted by EDDA (Eigenvalue Decomposition Discriminant Analysis). In this context, given a parsimonious level \({\mathscr {M}}\) and a number G of classes, we can also consider fitting an intermediate model for each fixed \( {\pmb u}\in {\mathscr {H}}\), by maximizing the complete log-likelihood

$$\begin{aligned} C&L_{ {\pmb u}}\Bigl (\pmb {\pi },\pmb {\mu } , \pmb {V},\pmb C\Big \vert y_1,\ldots ,y_N,z_{1,1},\ldots ,z_{ {N,K}}\Bigr ) \nonumber \\&=\sum _{{ {i}}=1}^N \left[ \sum _{g=1}^G\sum _{ {k}: {u}_{ {k}}=g} z_{ {i},{ {k}}} \log \Biggl ( \pi _{ {k}} \phi \Bigl (y_{ {i}}\vert \mu _{ {k}},\Sigma (V_{ {k}},C_g)\Bigr )\Biggr )\right] \ . \end{aligned}$$
(16)

Model comparison is done through BIC, and consequently we could try to choose \( {\pmb u}\) maximizing the log-likelihood (11). However, given that in the model fitting we are maximizing the complete log-likelihood (16), it is not unreasonable trying to find the value of \( {\pmb u}\) maximizing (16). Proceeding in this manner, we can think of \( {\pmb u}\) as a parameter, and the problem consists in maximizing (12). Model estimation is simple from model-based clustering algorithms: with a single iteration of the M step, we can compute the values of the parameters. A new set of observations can be classified computing the posterior probabilities, with the formula (15) of the E step, and assigning each new observation to the group with higher posterior probability. Since the groups are known, the complete log-likelihood (12) is bounded under mild conditions, and it is not required to impose eigenvalue constraints, although it may be interesting in some examples with almost degenerated variables. To summarize the quality of the classification given by the best models (selected through BIC) in the different examples, other indicators based directly on classification errors are provided:

  • MM: Model Misclassification, or training error. Proportion of observations misclassified by the model fitted with all observations.

  • LOO: Leave One Out error.

  • CV(R,p): Cross Validation error. Considering each observation as labeled or unlabeled with probability p and \(1-p\), we compute the proportion of unlabeled observations misclassified by the model fitted with the labeled observations. The indicator CV(R,p) represents the mean of the proportions obtained in R repetitions of the process. When several classification methods are compared, the same R random partitions are used to compute the values of this indicator.

In the line of the previous section, only the discriminant analysis models G-CPC and G-PROP are considered. Table 4 and 5 show the results of applying these models to the simulation examples of Figs. 2, 3. In both situations, the classification obtained with our model slightly improves that given by mclust.

Table 4 Classification results for data in Fig. 2 for the best mclust model and 2-CPC
Table 5 Classification results for data in Fig. 3 for the best mclust model and 2-PROP

As we did in the clustering setting, in order to evaluate the sensitivity of BIC for the detection of the true underlying model, simulations have been repeated 1000 times, for each sample size n (=30, 50, 100, 200). Table 6 shows the proportions of times in which 2-CPC or 2-PROP model improves the best mclust model in terms of BIC for each value of n.

Table 6 Proportions of times in which discriminant analysis 2-CPC or 2-PROP model improves the best mclust model in terms of BIC, for each sample size n

Remark 3

In discriminant analysis, the weights \(\pmb \pi =(\pi _1,\ldots ,\pi _{ {K}})\) might not be considered as parameters. Model-based methods assume that observations from the \( {k}^{th}\) group follow a distribution with density function \(f(\cdot ,\theta _{ {k}})\). If \(\pi _{ {k}}\) is the proportion of observations of group k, the classifier minimizing the expected misclassification rate is known as Bayes classifier, and it assigns an observation y to the group with higher posterior probability

$$\begin{aligned} P\bigl (y \in \text {Group } {k}\bigr ) = \frac{\pi _{ {k}} f(y,\theta _{ {k}})}{\sum _{l=1}^{ {K}} \pi _l f(y,\theta _l)}. \end{aligned}$$
(17)

The values of \(\pmb \pi ,\theta _1,\ldots ,\theta _{ {K}}\) are usually unknown, and the classification is performed with estimations \(\pmb {{{\hat{\pi }}}},{\hat{\theta }}_1,\ldots ,{\hat{\theta }}_{ {K}}\). Whereas \({\hat{\theta }}_1,\ldots ,{\hat{\theta }}_{ {K}}\) are always parameters estimated from the sample, the values of \(\pmb {{{\hat{\pi }}}}\) may be seen as part of the classification rule, if we think that they represent a characteristic of a particular sample we are classifying, or real parameters, if we assume that the observations \((z_{ {i}},y_{ {i}})\) arise from a GMM such that

$$\begin{aligned} z_{ {i}} \sim {\text {mult}}\Bigl (1,\lbrace 1,\ldots , { {K}}\rbrace ,&\lbrace \pi _1,\ldots , \pi _{ {K}} \rbrace \Bigr )\\ y_{ {i}} \big \vert z_{{ {i}}} \sim f\bigl (&\cdot , \theta _{z_{ {i}}}\bigr ) \ , \end{aligned}$$

where mult() denotes the multinomial distribution, and the weights verify \(0\le \pi _{ {k}}\le 1\), \(\sum _{ {k}=1}^{ {K}} \pi _{ {k}} =1\). In accordance with mclust, for model comparison we are not considering \(\pmb {\pi }\) as parameters, although its consideration would only mean adding a constant to all BIC values computed. However, in order to define the theoretical problem, the situation where we are considering \(\pmb {\pi }\) as a parameter is more interesting. If (ZY) is a random vector following a distribution \({\mathbb {P}}\) in \(\lbrace 1,\ldots , { {K}}\rbrace \times {\mathbb {R}}^d\), the theoretical problem consists in maximizing

(18)

with respect to the parameters \(\pmb \pi , \pmb \mu , {\pmb u},\pmb V, \pmb C\). Given N observations \((z_{ {i}},y_{ {i}}), \ { {i}}=1,\ldots ,N\) of \({\mathbb {P}}\), the problem of maximizing (18) agrees with the sample problem presented above the remark when taking the empirical measure \({\mathbb {P}}_N\), with the obvious relation \(z_{ {i},{ {k}}}={\text {I}}(z_{ {i}}= {k})\). Arguments like those presented in Section A in the Appendix for the cluster analysis problem would give existence and consistency of solutions also in this setting.

4 Real data examples

To illustrate the usefulness of the G-CPC and G-PROP models in both settings, we show four real data examples in which our models outperform the best parsimonious models fitted by mclust, in terms of BIC. The two first examples are intended to illustrate the methods in simple and well-known data sets, while the latter involve greater complexity.

4.1 Cluster analysis: IRIS

Here we revisit the famous Iris data set, which consists of observations of four features (length and width of sepals and petals) of 50 samples of three species of Iris (setosa, versicolor and virginica), and is available in the base package of R. We apply the functions of package mclust for model-based clustering, letting the number of clusters to search equal to 3, to obtain the best parsimonious model in terms of BIC value. Table 7 compares this model with the models 2-CPC and 2-PROP, fitted with \(c_{sh}=c_{vol}=100\). With some abuse of notation, we include in the table the Model Misclassification (MM), representing here the number of observations assigned to different clusters than the originals, after identifying the clusters created with the originals in a logical manner.

Table 7 Iris data solutions for clustering with mclust, 2-CPC and 2-PROP
Fig. 4
figure 4

Clustering obtained from 2-PROP model in the Iris data set. Color represents the clusters created. The ellipses are the contours of the estimated mixture densities, grouped into the classes given by indexes in black. Point shapes represent the original groups. Observations lying on different clusters from the originals are marked with red circles

From Table 7 we can appreciate that the best clustering model in terms of BIC is the 2-PROP model. In Fig. 4 we can see the clusters created by this model. These clusters coincide with the real groups, except for four observations. From this example, we can also see the advantage of the intermediate models G-CPC and G-PROP in terms of interpretability. In the solution found with G-PROP the covariance matrices associated to two of the three clusters are proportional. Each cluster represents a group of individuals with similar features, which in absence of labels, we could see as a subclassification within the Iris specie. In this subclassification associated to the groups with proportional covariance matrices, both groups share not only the principal directions, but also the same proportion of variability between the directions. In many biological studies, principal components are of great importance. When working with phenotypic variables, principal components may be interpreted as “growing directions" (see e.g. Thorpe 1983). From the estimated model, we can conclude that in the Iris data, it is reasonable to think that there are three groups, two of them with similar “growing pattern", since not only the principal components are the same, but also the shape is common. However, this biological interpretation will become even more evident in the following example.

4.2 Discriminant analysis: CRABS

The data set consists of measures of 5 features over a set of 200 crabs from two species, orange and blue, and from both sexes, and it is available in the R package MASS (Venables and Ripley 2002). For each specie and sex (labeled OF, OM, BF, BM) there are 50 observations. The variables are measures in mm of the following features: frontal lobe (FL), rear width (RW), carapace length (CL), carapace width (CW) and body depth (BD). Applying the classification function of the mclust library, the best parsimonious model in terms of BIC is EEV. Table 8 shows the result for the EEV model, together with the discriminant analysis models 2-CPC and 2-PROP, with \(c_{sh}=c_{vol}=100000\) (with these values, the solutions agrees with the unrestricted solutions).

Table 8 Crabs data solutions for discriminant analysis with mclust, 2-CPC and 2-PROP

The results show that the comparison given by BIC can differ from those obtained by cross validation techniques, partially because BIC mainly measures the fit of the data to the model. However, in the parsimonious context, model selection is usually performed via BIC, in order to avoid the very time-consuming process of evaluating every possible model with cross validation techniques.

Figure 1 in the Online Supplementary Figures represents the solution estimated by 2-PROP model. The solution given by this model allows for a better biological interpretation than the one given by the parsimonious model EEV, where orientation varies along the 4 groups, making the comparison quite complex. In the 2-PROP model, the groups of males of both species share proportional matrices, and the same is true for the females. Returning to the biological interpretation of the previous example, under the 2-PROP model, we can state that crabs of the same sex have the same “growing pattern”, despite of being from different species.

4.3 Cluster analysis: gene expression cancer

In this example, we work with the Gene expression cancer RNA-Seq Data Set, which can be downloaded from the UCI Machine Learning Repository. This data set is part of the data collected by “The Cancer Genome Atlas Pan-Cancer analysis project"" (Weinstein et al. 2013). The considered data set consists of a random extraction of gene expressions of patients having different types of tumor: BRCA (breast carcinoma), KIRC (kidney renal clear-cell carcinoma), COAD (colon adenocarcinoma), LUAD (lung squamous carcinoma) and PRAD (prostate adenocarcinoma). In total, the data set contains the information of 801 patients, and for each patient we have information of 20531 variables, which are the RNA sequencing values of 20531 genes. To reduce the dimensionality and to apply model-based clustering algorithms, we have removed the genes with almost zero sum of squares (\(< 10^{-5}\)) and applied PCA to the remaining genes. We have taken the first 14 principal components, the minimum number of components retaining more than 50 \(\%\) of the total variance. Applying model-based clustering methods looking for 5 groups to this reduced data set, we have found that 3-CPC, fitted with \(c_{sh}=c_{vol}=1000\), improves the BIC value obtained by the best parsimonious model estimated by mclust. The results obtained from 3-CPC, presented in Table 9, significantly improve the assignment error made by mclust. Figure 2 in the Online Supplementary Figures shows the projection of the solution obtained by 3-CPC onto the first six principal components computed in the preprocessing steps.

Table 9 Cancer data solutions for clustering with mclust and 3-CPC

4.4 Discriminant analysis: Italian olive oil

Table 10 Olive oil discriminant analysis with mclust, 2-CPC, 3-CPC and 3-PROP

The data set contains information about the composition in percentage of eight fatty acids (palmitic, palmitoleic, stearic, oleic, linoleic, linolenic, arachidic and eicosenoic) found in the lipid fraction of 572 Italian olive oils, and it is available in the R package pdfCluster (Azzalini and Menardi 2014). The olive oils are labeled according to a two level classification: 9 different areas that are grouped at the same time in three different regions.

  • SOUTH: Apulia North, Calabria, Apulia South, Sicily.

  • SARDINIA: Sardinia inland, Sardinia coast.

  • CENTRE-NORTH: Umbria, Liguria east, Liguria west.

In this example, we have evaluated the performance of different discriminant analysis models, for the problem of classifying the olive oils between areas. The best parsimonious model fitted with mclust is the VVE model, with variable size and shape and equal orientation. Note that due to the dimension \(d=8\), there is a significant difference in the number of parameters between models with common or variable orientation. Therefore, BIC selection will tend to choose models with common orientation, despite the fact that this hypothesis might not be very precise. This suggests that intermediate models could be of great interest also in this example. Given that the last variable eicosenoic is almost degenerated in some areas, we fit the models with \(c_{sh}=c_{vol}=10000\), and the shape constraints are effective in some groups. We have found 3 different intermediate models improving the BIC value obtained with mclust. Results are displayed in Table 10.

The best solution found in terms of BIC is given by the 3-CPC model, which is also the solution with the best values for the other indicators. The classification of the areas in classes given in this solution is:

  • CLASS 1: Umbria.

  • CLASS 2: Apulia North, Calabria, Apulia South, Sicily.

  • CLASS 3: Sardinia inland, Sardinia coast, Liguria east, Liguria west.

Note that areas in class 2 exactly agree with areas from the South Region. This classification coincides with the separation in classes given by 3-PROP, whereas 2-PROP model grouped together class 1 and class 3. These facts support that our intermediate models have been able to take advantage of the apparent difference in the structure of the covariance matrices from the South region and the others. When we are looking for a three-class separation, instead of splitting the areas from the Centre-North and Sardinia into these two regions, all Centre-North and Sardinia areas are grouped together, except Umbria, which forms a group alone. Figure 3 in the Online Supplementary Figures represents the solution in the principal components of the group Umbria, and we can appreciate the characteristics of this area. The plot corresponding to the second and third variables allows us to see clear differences in some of its principal components. Additionally, we can see that it is also the area with less variability in many directions. In conclusion, a different behavior of the variability in the olive oils from this area seems to be clear. This could be related to the geographical situation of Umbria (the only non-insular and non-coastal area under consideration).

5 Conclusions and further directions

Cluster analysis of structured data opens up interesting research prospects. This fact is widely known and used in applications where the data themselves share some common structure, and thus clustering techniques are a key tool in functional data analysis. More recently, the underlying structures of the data have increased in complexity, leading, for example, to consider probability distributions as data, and to use innovative metrics, such as earth-mover or Wasserstein distances. This configuration has been used in cluster analysis, for example, in del Barrio et al. (2019), from a classical perspective, but also including new perspectives: meta-analysis of procedures, aggregation facilities.... Nevertheless, to the best of our knowledge, this is the first occasion in which a clustering procedure is used as a selection (of an intermediate model) step in an estimation problem. Our proposal allows improvements in the estimation process and, arguably, often a gain in the interpretability of the estimation thanks to the chosen framework: Classification through the Gaussian Mixture Model.

The presented methodology enhances the so-called parsimonious model leading to the inclusion of intermediate models. They are linked to geometrical considerations on the ellipsoids associated to the covariance matrices of the underlying populations that compose the mixture. These considerations are precisely the essence of the parsimonious model. The intermediate models arise from clustering covariance matrices, considered as structured data, and using a similarity measure based in the likelihood. The consideration of clustering these objects through other similarities could be appropriate looking for tools for different goals. In particular, we emphasize on the possibility of clustering based on metrics like the Bures–Wasserstein distance. The role played here by the BIC would have to be tested in the corresponding configurations or, alternatively, replaced by appropriate penalties for choosing between other hierarchical models.

Feasibility of the proposal is an essential requirement for a serious essay of a statistical tool. The algorithms considered in the paper are simple adaptations of Classification Expectation Maximization algorithm, but we think that they could be still improved. We will pursuit on this challenge, looking also for feasible computations for similarities associated to new pre-established objectives.

In summary, through the paper we have used clustering to explore similarities between groups according to predetermined patterns. In this wider setup, clustering is not a goal in itself, it can be an important tool for specialized analyses.

6 Supplementary material

Supplementary figures: Online document with additional graphs for the real data examples. Repository: Github repository containing the R scripts with the algorithms and workflow necessary to reproduce the results of this work. Simulation data of the examples are also included. (https://github.com/rvitores/ImprovingModelChoice).