Improving Model Choice in Classification: An Approach Based on Clustering of Covariance Matrices

This work introduces a refinement of the Parsimonious Model for fitting a Gaussian Mixture. The improvement is based on the consideration of clusters of the involved covariance matrices according to a criterion, such as sharing Principal Directions. This and other similarity criteria that arise from the spectral decomposition of a matrix are the bases of the Parsimonious Model. We show that such groupings of covariance matrices can be achieved through simple modifications of the CEM (Classification Expectation Maximization) algorithm. Our approach leads to propose Gaussian Mixture Models for model-based clustering and discriminant analysis, in which covariance matrices are clustered according to a parsimonious criterion, creating intermediate steps between the fourteen widely known parsimonious models. The added versatility not only allows us to obtain models with fewer parameters for fitting the data, but also provides greater interpretability. We show its usefulness for model-based clustering and discriminant analysis, providing algorithms to find approximate solutions verifying suitable size, shape and orientation constraints, and applying them to both simulation and real data examples.


Introduction
1 In this paper we introduce methodological applications arising of cluster analysis of covariance matrices.Throughout, we will show that appropriate clustering criteria on these objects provide useful tools in the analysis of classic problems in Multivariate Analysis.The chosen framework is that of multivariate classification under a Gaussian 2 email: david.rodriguez.vitores@uva.es The research leading to these results received funding from MCIN/AEI/10.13039/501100011033/FEDERunder Grant Agreement No PID2021-128314NB-I00 1 The research leading to these results received funding from MCIN/AEI/10.13039/501100011033/FEDERunder Grant Agreement No PID2021-128314NB-I00 Mixture Model, a setting where a suitable reduction of the involved parameters is a fundamental goal leading to the Parsimonious Model.We focus on this hierarchized model, designed to explain data with a minimum number of parameters, by introducing intermediate categories associated with clusters of covariance matrices.
Gaussian Mixture Models approaches to discriminant and cluster analysis are well-established and powerful tools in multivariate statistics.For a fixed number K, both methods aim to fit K multivariate Gaussian distributed components to a data set in R d , with the key difference that labels providing the source group of the data are known (supervised classification) or unknown (unsupervised classification).In the supervised problem, we handle a data set with N observations y 1 , . . ., y N on R d and associated labels z i,k , i = 1, . . ., N , k = 1, . . ., K, where z i,k = 1 if the observation y i belongs to the group k and 0 otherwise.Denoting by ϕ(•|µ, Σ) the density of a multivariate Gaussian distribution on R d with mean µ and covariance matrix Σ, we seek to maximize the complete log-likelihood function (1) with respect to the weights π π π = (π 1 , . . ., π K ) with 0 ≤ π k ≤ 1, K k=1 π k = 1, the means µ µ µ = (µ 1 , . . ., µ K ) and the covariance matrices Σ Σ Σ = (Σ 1 , . . ., Σ K ).In the unsupervised problem the labels z i,k are unknown, and fitting the model involves the maximization of the log-likelihood function with respect to the same parameters.This maximization is more complex, and it is usually performed via the EM algorithm [Dempster et al., 1977], where we repeat iteratively the following two steps.The E step, which consists in computing the expected values of the unobserved variables z i,k given the current parameters, and the M step, in which we are looking for the parameters maximizing the complete log-likelihood (1) for the values z i,k computed in the E step.Therefore, both model-based techniques require the maximization of (1), for which optimal values of the weights and the mean are easily computed: With these optimal values, if we denote T , the problem of maximizing (1) with respect to Σ 1 , . . ., Σ K is equivalent to the problem of maximizing For even moderate dimension d, the large number of involved parameters in relation with the size of the data set could result in a poor behavior of standard unrestricted methods.In order to improve the solutions, regularization techniques are often invoked.
In particular, many authors have proposed estimating the maximum likelihood parameters under some additional constraints on the covariance matrices Σ 1 , . . ., Σ K , which lead us to solve the maximization of (4) under these constraints.
Between these proposals, a prominent place is occupied by the so called Parsimonious Model, a broad set of hierarchized constraints capable of adapting to conceptual situations that may occur in practice.
A common practice in multivariate statistics consists in assuming that covariance matrices share a common part of their structure.For example, if Σ 1 = . . .= Σ K = I d , the clustering method described in (2) gives just the k-means.If we assume common covariance matrices Σ 1 = . . .= Σ K = Σ, the procedure coincides with linear discriminant analysis (LDA) in the supervised case (1), and with the method proposed in Friedman and Rubin [1967] in the unsupervised case (2).General theory to organize these relationships between covariance matrices is based on the spectral decomposition, beginning with the analysis of Common Principal Components [Flury, 1984[Flury, , 1988]].In the discriminant analysis setting, the use of the spectral decomposition was first proposed in Flury et al. [1994], and in the clustering setting in Banfield and Raftery [1993].The term "Parsimonious model" and the fourteen levels given in Table 1 were introduced in Celeux and Govaert [1995] for the clustering setting and later, in Bensmail and Celeux [1996], for the discriminant setup.Given a positive definite covariance matrix Σ k , the spectral decomposition of reference is where γ k = det(Σ k ) 1/d > 0 governs the size of the groups, Λ k is a diagonal matrix with positive entries and determinant equal to 1 that controls the shape, and β k is an orthogonal matrix that controls the orientation.Given K covariance matrices Σ 1 , . . ., Σ K , the spectral decomposition enables to establish the fourteen different parsimonious levels in Table 1, allowing differences or not in the parameters associated to size, shape and orientation.To fit a Gaussian Mixture Model under a parsimonious level M in the Ta- The aim of this paper is to introduce a generalization of equation ( 5), that allows us to give a likelihood-based classification associated to intermediate parsimonious levels.Let G ∈ {1, . . ., K} and u u u = (u 1 , . . ., u K ) be any vector in {1, . . ., G} K .
Given a parsimonious level M , we can formulate a model in which we assume that the theoretical covariance matrices Σ 1 , . . ., Σ K verify a parsimonious level M within each of the G classes defined by u u u.
For instance, let K = 7, G = 3, M = VVE and take u u u = (1, 1, 2, 3, 1, 2, 1).This implies Following (5), the estimation of the original covariance matrices involves maximizing (4) within M u u u , the set of covariance matrices satisfying {Σ k : u k = g} ∈ M for all g = 1, . . ., G. Using the maximized log-likelihood as a measure for the appropriateness of u u u, the optimal û u u would provide a classification for S 1 , . . ., S K according to the level M .Precise definitions will be provided in Section 2. We will present an iterative procedure to simultaneously compute the optimal classification and covariance matrix estimators through the modification of equation ( 5) given by Solving this equation will allow us to fit Gaussian Mixture Models with intermediate parsimonious levels, in which the common parameters of a parsimonious level will be shared within each of the G classes given by the vector of indexes û û û, but varying between the different classes.In the previous example, we obtain three classes of covariance matrices that share their principal directions within each class, resulting in a better interpretation of the final classification and allowing a considerable reduction of the number of parameters to be estimated.We will use these ideas for fitting Gaussian Mixture Models in discriminant analysis and cluster analysis.
To avoid unboundedness of the objective function in the clustering framework, we will impose the determinant and shape constraints of García-Escudero et al. [2020], which are fully implemented in the MATLAB toolbox FSDA [Riani et al., 2012].We will analyze some examples where the proposed models result in less parameters and more interpretability fitting the data, being better suited when compared with the 14 parsimonious models.We point out that, as it is becoming usual in the literature, to carry out the comparisons between different models, we will use the Bayesian Information Criterion (BIC).This applies to all examples considered in the text.It has been noticed by many authors that BIC selection works properly in model based clustering, as well as in discriminant analysis.Fraley and Raftery [2002] includes a detailed justification for the use of BIC, based on previous references.A summary of the comparison of BIC with other techniques for model selection can also be found in Biernacki and Govaert [1999].
The paper is organized as follows.Section 2 approaches the problem of the parsimonious classification of covariance matrices given by equation ( 6), focusing on its computation for the most interesting restrictions in terms of dimensionality reduction and interpretability.Throughout, we will only work with models based on the parsimonious levels of proportionality (VEE) and common principal components (VVE), although the extension to other levels is straightforward.Section 3 applies the previous theory for the estimation of Gaussian Mixture Models in cluster analysis and discriminant analysis, including some simulation examples for their illustration.Section 4 includes real data examples, where we will see the gain in interpretability that can arise from these solutions.Some conclusions are outlined in Section 5. Finally, Appendix A includes theoretical results, Appendix B provides some additional simulation examples and Appendix C explains technical details about the algorithms.Additional graphical material is provided in the Online Supplementary Figures document.

Parsimonious Classification of Covariance Matrices
Given n 1 , . . ., n K independent observations from K groups with different distributions, and S 1 , . . ., S K the sample covariance matrices, a group classification may be provided according to different similarity criteria.In the general case, given a similarity criterion f depending on the sample covariance matrices and the sample lengths, the problem of classifying K covariance matrices in G classes, 1 ≤ G ≤ K, typically would consist in solving the equation In this work, we focus on the Gaussian case, proposing different similarity criteria based on the parsimonious levels that arise from the spectral decomposition of a covariance matrix.
Multivariate procedures based on parsimonious decompositions assume that the theoretical covariance matrices Σ 1 , . . ., Σ K jointly verify one level M out of the fourteen in Table 1.To elaborate on this idea, we include now some useful notation.In a parsimonious model M , we write (Σ 1 , . . ., Σ K ) ∈ M if these matrices share some common parameters C, and they have variable parameters We will denote by Σ(V k , C) the covariance matrix with the size, shape and orientation parameters associated to (V k , C).Therefore, under the parsimonious level M , we are assuming that If the n k observations of group k are independent and arise from a distribution N (µ k , Σ k ), then n k S k follows a d-dimensional Wishart distribution with parameters n k , Σ k .Therefore, given the level of parsimony M , it is natural to consider the maximized log-likelihood under the level M as a similarity criterion for the covariance matrices.This allows us to measure their resemblance in the features associated to the common part of the decomposition in the theoretical model.Thus, the similarity criterion for the parsimonious level M is Consequently, given a level of parsimony M , the covariance matrix classification problem in G classes consists in solving the equation In order to avoid the combinatorial problem of maximizing within H , denoting the variable parameters by V V V = (V 1 , . . ., V K ) and the common parameters by C C C = (C 1 , . . ., C G ), we focus on the problem of maximizing since the value u u u maximizing this function agrees with the optimal û u u in (7).This problem will be referred to as Classification G G G-M M M .From the expression of the d-dimensional Wishart density, we can see that maximizing W is equivalent to minimizing with respect to the same parameters the function Maximization can be achieved through a simple modification of the CEM algorithm (Classification Expectation Maximization, introduced in Celeux and Govaert [1992]), for any of the fourteen parsimonious levels.A sketch of the algorithm is presented here: ) of the common parameters, which may be taken as the parameters of G different matrices S k randomly chosen between S 1 , . . ., S K , the m th iteration consists of the following steps: , we maximize with respect to the partition u u u and the variable parameters V V V .For each k = 1, . . ., K, we compute Ṽk,g = argmax for 1 ≤ g ≤ G, and we define: The maximization can be done individually for each of the groups created, by maximizing for each g = 1, . . ., G the function The maximization for each of the 14 parsimonious levels can be done, for instance, with the techniques in Celeux and Govaert [Celeux and Govaert, 1995].The methodology proposed therein for common orientation models uses modifications of the Flury algorithm [Flury and Gautschi, 1986].However, for these models we will use the algorithms subsequently developed by Browne and McNicholas [2014b,a], often implemented the software available for parsimonious model fitting, which allow more efficient estimation of the common orientation parameters.
For each of the fourteen parsimonious models, the variable parameters in the solution V V V may be computed as a function of the parameters (û û û, Ĉ Ĉ Ĉ), the sample covariance matrices S 1 , . . ., S K and the sample lengths n 1 , . . ., n K .Therefore, the function W could be written as W (u u u, C C C), and the maximization could be seen as a particular case of the coordinate descent algorithm explained in Bezdek et al. [1987].
As it was already noted, we focus on the development of the algorithm only for two particular (the most interesting) parsimonious levels.First of all, we are going to keep models flexible enough to enable the solution of ( 6), when taking G = K (no grouping is assumed), to coincide with the unrestricted solution, Σk = S k .The first six models do not verify this condition.For the last eight models, the numbers of parameters are where δ VOL , δ SHAPE and δ ORIENT take the value 1 if the given parameter is assumed to be common, and K if it is assumed to be variable between groups.When d and K are large, the main source of variation in the number of parameters is related to considering common or variable orientation, followed by considering common or variable shape.For example, if d = 9, k = 6, the number of parameters related to each constraint are detailed in Table 2. Our primary motivation is exemplified through Table 2: to raise alternatives for the models with variable orientation.For that, we look for models with orientation varying in G classes, with 1 ≤ G ≤ K.
We consider the case where size and shape are variable across all groups (G different Common Principal Components, G-CPC) and also the case where shape parameters are additionally common within each of the G classes (proportionality to G different matrices, G-PROP).Apart from the parameter reduction, these models can provide an easier interpretation of the variables involved in the problem, which is often a hard task in multidimensional problems with several groups.We keep the size variable, since it does not cause a major increase in the number of parameters, and it is easy to interpret.Therefore, the models we are considering are: • Classification G-CPC: We are looking for G orthogonal matrices β β β = (β 1 , . . ., β G ) and a vector of indexes u u u = (u 1 , . . ., u K ) ∈ H such that where γ γ γ = (γ 1 , . . ., γ K ) and Λ Λ Λ = (Λ 1 , . . ., Λ K ) are the variable size and shape parameters.The number of parameters is K + K(d − 1) + Gd(d − 1)/2.In the situation of Table 2, taking G = 2 the number of parameters is 126, while allowing for variable orientation it is 270.To solve (7), we have to find a vector of indexes û û û, G orthogonal matrices β β β and variable parameters γ γ γ and Λ Λ Λ minimizing • Classification G-PROP: We are looking for G orthogonal matrices where γ γ γ = (γ 1 , . . ., γ K ) are the variable size parameters.The number of parameters is K + G(d − 1) + Gd(d − 1)/2.In the situation of Table 2, the number of parameters if we take G = 2 is 94.To solve (7), we have to find a vector of indexes û û û, G orthogonal matrices β β β, G shape matrices Λ Λ Λ and the variable size parameters γ γ γ minimizing Explicit algorithms for finding the minimum of ( 8) and ( 9) are given in Section C.2 in the Appendix.The results given by both algorithms are illustrated in the following example, where we have randomly created 100 covariance matrices Σ 1 , . . ., Σ 100 according: where U(α) represents the rotation of angle α, Diag(1, Y ) is the diagonal matrix with entries 1, Y , and X, Y, α are uniformly distributed random variables with distributions: For each k = 1, . . ., 100, we have taken S k as the sample covariance matrix computed from 200 independent observations from a distribution N (0, Σ k ), and we have applied 4-CPC and 4-PROP to obtain different classifications of S 1 , . . ., S 100 .The partitions obtained by both methods allow us to classify the covariance matrices according to both criteria.
Figure 1 shows the 95% confident ellipses representing the sample covariance matrices associated to each class (coloured lines) together with the estimations of the common axes or the common proportional matrix within each class (black lines).

Gaussian Mixture Models
In a Gaussian Mixture Model (GMM), data are assumed to be generated by a random vector with probability density function: The idea of introducing covariance matrix restrictions given by parsimonious decomposition in the estimation of GMMs has become a common tool for statisticians, and methods are implemented in the software R in many packages.In this paper we use for the comparison the results given by the package mclust [Fraley andRaftery, 2002, Scrucca et al., 2016], although there exists many others widely known (Rmixmod: Lebret et al. [2015]; mixtools: Benaglia et al. [2009]).The aim of this section is to explore how we can fit GMMs in different contexts with the intermediate parsimonious models explained in Section 2, allowing the common part of the covariance matrices in the decomposition to vary between G classes.That is, with the same notation as in Section 2, we want to study GMMs with density function where are the common parameters among classes and Σ(V k , C g ) is the covariance matrix with the parameters given by (V k , C g ).The following subsections exploit the potential of these particular GMMs for cluster analysis and discriminant analysis.
A more general situation where only part of the labels are known could also be considered, following the same line as in Dean et al. [2006], but it will not be discussed in this work.
As already noted in the Introduction, the criterion we are going to use for model selection between all the estimated models is BIC (Bayesian Information Criterion), choosing the model with a higher value of the BIC approximation given by where N is the number of observations and p is the number of independent parameters to be estimated in the model.This criterion is used for the comparison of the intermediate models G-CPC and G-PROP with the fourteen parsimonious models estimated in the software R with the functions in the mclust package.In addition, within the framework of discriminant analysis, the quality of the classification given by the best models, in terms of BIC, is also compared using cross validation techniques.

Model-Based Clustering
Given y 1 , . . ., y N independent observations of a d-dimensional random vector, clustering methods based on fitting a GMM with K groups seek to maximize the log-likelihood function (2).From the fourteen possible restrictions considered in Celeux and Govaert [1995], we can compute fourteen different maximum likelihood solutions in which size, shape and orientation are common or not between the K covariance matrices.For a particular level M in Table 1, the fitting requires the maximization of the the log-likelihood where π π π = (π 1 , . . ., π K ) are the weights, with 0 ) the variable parameters and C the common parameters.Estimation under the parsimonious restriction is performed via the EM algorithm.In the GMM context, we can see the complete data as pairs (y i , z i ), where z i is an unobserved random vector such that z i,k = 1 if the observation y i comes from distribution k, and z i,k = 0 otherwise.
With the ideas of Section 2, we are going to fit Gaussian Mixture Models with parsimonious restrictions, but allowing the common parameters to vary between different classes.Assuming a parsimonious level of decomposition M and a number G ∈ {1, . . ., K} of classes, we are supposing that our data are independent observations from a distribution with density function ( 10).The log-likelihood function given a fixed vector of indexes u u u is For each u u u ∈ H , we can fit a model.In order to choose the best value for the vector of indexes u u u, we should compare the BIC values given by the different models estimated.As the number of parameters is the same, the best value for u u u can be obtained by taking In order to avoid the combinatorial problem of maximizing within H , we can take u u u as if it were a parameter, and we are going to focus on the problem of maximizing that will be referred to as Clustering G-M M M .Therefore, given the unobserved variables z i,k , for k = 1, . . ., K and i = 1, . . ., N , the complete loglikelihood is The proposal of this section is to fit this model given a parsimonious level M and fixed values of K and G ∈ {1, . . ., K}, introducing also constraints to avoid the unboundedness of the log-likelihood function (11).For this purpose, we introduce the determinant and shape constraints studied in García-Escudero et al. [2020].For k = 1, . . ., K, denote by (λ k,1 , . . ., λ k,d ) the diagonal elements of the shape matrix Λ k (which may be the same within classes).
We impose K constraints controlling the shape of each group, in order to avoid solutions that are almost contained in a subspace of lower dimension, and a size constraint in order to avoid the presence of very small clusters.Given c sh , c vol ≥ 1, we impose: Remark 1.With these restrictions, the theoretical problem of maximizing ( 11) is well defined.If Y is a random vector following a distribution P, the problem consists in maximizing with respect to π π π, µ µ µ, u u u, V V V , C C C, defined as above, and verifying (13).If P N stands for the empirical measure P N = (1/N ) N i=1 δ {y i } , by replacing P by P N , we recover the original sample problem of maximizing (11) under the determinant and shape constraints (13).This approach guarantees that the objective function is bounded, allowing results to be stated in terms of existence and consistence of the solutions (see Section A in the Appendix).Now, we are going to give a sketch of the EM algorithm used for the estimation of these intermediate parsimonious clustering models, for each of the fourteen levels.
Clustering G-M : M : M : Starting from an initial solution of the parameters C 0 , we have to repeat the following steps until convergence: C m , we compute the posterior probabilities for k = 1, . . ., K, i = 1, . . ., N .
• M step: In this step, we have to maximize (12) given the expected values {z i,k } i,k .The optimal values for π m+1 π m+1 π m+1 , µ m+1 µ m+1 µ m+1 are given by (3).With these optimal values, if we denote C m+1 verifying the determinant and shape constraints (13) maximizing If we remove the determinant and shape constraints, the solution of this maximization coincides with the classification problem presented in Section 2 for the computed values of n 1 , . . ., n K and S 1 , . . ., S K .A simple modification of that algorithm, computing on each step the optimal size and shape constrained parameters (instead of the unconstrained version) with the optimal truncation algorithm presented in García-Escudero et al. [2020] allows the maximization to be completed.Determinant and shape constraints can be incorporated in the algorithms together with the parsimonious constraints following the lines developed in García-Escudero et al. [2022].As already noted in Section 2, we keep only the clustering models G-CPC and G-PROP, the most interesting in terms of parameter reduction and interpretability.For these models, explicit algorithms are explained in Section C.3 in the Appendix.Now, we are going to illustrate the results of the algorithms in two simulation experiments: • Clustering G-CPC: In this example, we simulate n = 100 observations from each of 6 Gaussian distributions, with means µ 1 , . . ., µ 6 and covariance matrices verifying In Figure 2, we can see in the first plot the 95 % confidence ellipses of the six theoretical Gaussian distributions together with the 100 independent observations simulated from these distributions.
The second plot represents the clusters created by the maximum likelihood solution for the 2-CPC model, taking c sh = c vol = 100.The numbers labeling the ellipses represent the class of covariance matrices sharing the orientation.Finally, the third plot represents the best solution estimated by mclust for K = 6, corresponding to the parsimonious model VEV, with equal shape and variable size and orientation.The BIC value in the 2-CPC model (31 d.f.) is -3937.08,whereas the best model VEV (30 d.f.) estimated with mclust has BIC value -3960.07.Therefore, the GMM estimated with the 2-CPC restriction has higher BIC than all the parsimonious models.Finally, the number of observations assigned to different clusters from the original ones is 82 for the 2-CPC model and 91 for the VEV model.
• Clustering G-PROP: In this example, we simulate n = 100 observations from each of 6 Gaussian distributions, with means µ 1 , . . ., µ 6 and covariance matrices verifying: ).Now, the number of observations wrongly assigned to the source groups is 64 for the 2-PROP model, while it is 71 for the VVV model.
Remark 2. Note that, by imposing appropriate constraints in the clustering problem, we can significantly decrease the number of parameters while keeping a good fit of the data.Figure 3 shows this effect.However, constraints also have a clear interpretation in cluster analysis problems, since we are looking for groups that are forced to have a particular shape.Therefore, different constraints can lead to clusters with different shapes.This is what happens in Figure 2, where by introducing the right constraints we have managed to make the clusters created more similar to the original ones.Of course, in the absence of prior information, it is not possible to know the appropriate constraints, and the most reasonable approach is to select a model according to a criterion that penalizes the fit with the number of parameters such as the BIC.
To evaluate the sensitivity of BIC for the detection of the true underlying model, we have used the models described in the two previous examples.Once a model and a particular sample size n (=50, 100, 200) have been chosen, the simulation planning produces a sample containing n random elements generated from each N (µ k , Σ k ), k = 1, . . ., 6.We repeated every simulation plan 1000 times, comparing for every sample the BIC obtained for the underlying clustering model vs the best parsimonious model estimated by mclust.

Discriminant Analysis
The parsimonious model introduced in Bensmail and Celeux [1996] for discriminant analysis has been developed in conjunction with model-based clustering.
The R package mclust [Fraley andRaftery, 2002, Scrucca et al., 2016] also includes functions for fitting these models, denoted by EDDA (Eigenvalue Decomposition Discriminant Analysis).In this context, given a parsimonious level M and a number G of classes, we can also consider fitting an intermediate model for each fixed u u u ∈ H , by maximizing the complete log-likelihood -LOO: Leave One Out error.
-CV(R,p): Cross Validation error.Considering each observation as labeled or unlabeled with probability p and 1 − p, we compute the proportion of unlabeled observations misclassified by the model fitted with the labeled observations.
The indicator CV(R,p) represents the mean of the proportions obtained in R repetitions of the process.When several classification methods are compared, the same R random partitions are used to compute the values of this indicator.
In the line of the previous section, only the discriminant analysis models G-CPC and G-PROP are considered.Table 4 and 5 show the results of applying these models to the simulation examples of Figure 2, 3.In both situations, the classification obtained with our model slightly improves that given by mclust.
As we did in the clustering setting, in order to evaluate the sensitivity of BIC for the detection of the true underlying model, simulations have been repeated 1000 times, for each sample size n (=30, 50, 100, 200).Table 6 shows the proportions of times in which 2-CPC or 2-PROP model improves the best mclust model in terms of BIC for each value of n.
The values of π π π, θ 1 , . . ., θ K are usually unknown, and the classification is performed with estimations π π π, θ1 , . . ., θK .Whereas θ1 , . . ., θK are always parameters estimated from the sample, the values of π π π may be seen as part of the classification rule, if we think that they represent a characteristic of a particular sample we are classifying, or real parameters, if we assume that the observations (z i , y i ) arise from a GMM such that where mult() denotes the multinomial distribution, and the weights verify In accordance with mclust, for model comparison we are not considering π π π as parameters, although its consideration would only mean adding a constant to all BIC values computed.However, in order to define the theoretical problem, the situation where we are considering π π π as a parameter is more interesting.If (Z, Y ) is a random vector following a distribution P in {1, . . ., K} × R d , the theoretical problem consists in maximizing with respect to the parameters π π π, µ µ µ, u u u, V V V , C C C. Given N observations (z i , y i ), i = 1, . . ., N of P, the problem of maximizing (18) agrees with the sample problem presented above the remark when taking the empirical measure P N , with the obvious relation z i,k = I(z i = k).Arguments like those presented in Section A in the Appendix for the cluster analysis problem would give existence and consistency of solutions also in this setting.

Real Data Examples
To illustrate the usefulness of the G-CPC and G-PROP models in both settings, we show four real data examples in which our models outperform the best parsimonious models fitted by mclust, in terms of BIC.The two first examples are intended to illustrate the methods in simple and well-known data sets, while the latter involve greater complexity.

Cluster Analysis: IRIS
Here we revisit the famous Iris data set, which consists of observations of four features (length and width of sepals and petals) of 50 samples of three species of Iris (setosa, versicolor and virginica), and is available in the base package of R. We apply the functions of package mclust for model-based clustering, letting the number of clusters to search equal to 3, to obtain the best parsimonious model in terms of BIC value.In Figure 4 we can see the clusters created by this model.These clusters coincide with the real groups, except for four observations.From this example, we can also see the advantage of the intermediate models G-CPC and G-PROP in terms of interpretability.In the solution found with G-PROP the covariance matrices associated to two of the three clusters are proportional.Each cluster represents a group of individuals with similar features, which in absence of labels, we could see as a subclassification within the Iris specie.In this subclassification associated to the groups with proportional covariance matrices, both groups share not only the principal directions, but also the same proportion of variability between the directions.In many biological studies, principal components are of great importance.When working with phenotypic variables, principal components may be interpreted as "growing directions" (see e.g.Thorpe [1983]).From the estimated model, we can conclude that in the Iris data, it is reasonable to think that there are three groups, two of them with similar "growing pattern", since not only the principal components are the same, but also the shape is common.However, this biological interpretation will become even more evident in the following example.

Discriminant Analysis: CRABS
The data set consists of measures of 5 features over a set of 200 crabs from two species, orange and blue, and from both sexes, and it is available in the R package MASS [Venables and Ripley, 2002].For each specie and sex (labeled OF, OM, BF, BM) there are 50 observations.The variables are measures in mm of the following features: frontal lobe (FL), rear width (RW), carapace length (CL), carapace width (CW) and body depth (BD).Applying the classification function of the mclust library, the best parsimonious model in terms of BIC is EEV.Table 8 shows the result for the EEV model, together with the discriminant analysis models 2-CPC and 2-PROP, with c sh = c vol = 100000 (with these values, the solutions agrees with the unrestricted solutions).The results show that the comparison given by BIC can differ from those obtained by cross validation techniques, partially because BIC mainly measures the fit of the data to the model.However, in the parsimonious context, model selection is usually performed via BIC, in order to avoid the very time-consuming process of evaluating every possible model with cross validation techniques.
Figure D1 represents the solution estimated by 2-PROP model.The solution given by this model allows for a better biological interpretation than the one given by the parsimonious model EEV, where orientation varies along the 4 groups, making the comparison quite complex.In the 2-PROP model, the  The data set contains information about the composition in percentage of eight fatty acids (palmitic, palmitoleic, stearic, oleic, linoleic, linolenic, arachidic and eicosenoic) found in the lipid fraction of 572 Italian olive oils, and it is available in the R package pdf-Cluster [Azzalini and Menardi, 2014].The olive oils are labeled according to a two level classification: 9 different areas that are grouped at the same time in three different regions.
In this example, we have evaluated the performance of different discriminant analysis models, for the problem of classifying the olive oils between areas.
The best parsimonious model fitted with mclust is the VVE model, with variable size and shape and equal orientation.Note that due to the dimension d = 8, there is a significant difference in the number of parameters between models with common or variable orientation.Therefore, BIC selection will tend to choose models with common orientation, despite the fact that this hypothesis might not be very precise.This suggests that intermediate models could be of great interest also in this example.
Given that the last variable eicosenoic is almost  10.
The best solution found in terms of BIC is given by the 3-CPC model, which is also the solution with the best values for the other indicators.The classification of the areas in classes given in this solution is: • CLASS 1: Umbria.
Note that areas in class 2 exactly agree with areas from the South Region.This classification coincides with the separation in classes given by 3-PROP, whereas 2-PROP model grouped together class 1 and class 3.These facts support that our intermediate models have been able to take advantage of the apparent difference in the structure of the covariance matrices from the South region and the others.When we are looking for a three-class separation, instead of splitting the areas from the Centre-North and Sardinia into these two regions, all Centre-North and Sardinia areas are grouped together, except Umbria, which forms a group alone.Figure D3 represents the solution in the principal components of the group Umbria, and we can appreciate the characteristics of this area.The plot corresponding to the second and third variables allows us to see clear differences in some of its principal components.Additionally, we can see that it is also the area with less variability in many directions.In conclusion, a different behavior of the variability in the olive oils from this area seems to be clear.This could be related to the geo-graphical situation of Umbria (the only non-insular and non-coastal area under consideration).

Conclusions and further directions
Cluster analysis of structured data opens up interesting research prospects.This fact is widely known and used in applications where the data themselves share some common structure, and thus clustering techniques are a key tool in functional data analysis.
More recently, the underlying structures of the data have increased in complexity, leading, for example, to consider probability distributions as data, and to use innovative metrics, such as earth-mover or Wasserstein distances.This configuration has been used in cluster analysis, for example, in del Barrio et al. [2019], from a classical perspective, but also including new perspectives: meta-analysis of procedures, aggregation facilities.... Nevertheless, to the best of our knowledge, this is the first occasion in which a clustering procedure is used as a selection (of an intermediate model) step in an estimation problem.Our proposal allows improvements in the estimation process and, arguably, often a gain in the interpretability of the estimation thanks to the chosen framework: Classification through the Gaussian Mixture Model.
The presented methodology enhances the so-called parsimonious model leading to the inclusion of intermediate models.They are linked to geometrical considerations on the ellipsoids associated to the covariance matrices of the underlying populations that compose the mixture.These considerations are precisely the essence of the parsimonious model.The intermediate models arise from clustering covariance matrices, considered as structured data, and using a similarity measure based in the likelihood.The consideration of clustering these objects through other similarities could be appropriate looking for tools for different goals.In particular, we emphasize on the possibility of clustering based on metrics like the Bures-Wasserstein distance.The role played here by the BIC would have to be tested in the corresponding configurations or, alternatively, replaced by appropriate penalties for choosing between other hierarchical models.Feasibility of the proposal is an essential requirement for a serious essay of a statistical tool.The algorithms considered in the paper are simple adaptations of Classification Expectation Maximization algorithm, but we think that they could be still improved.We will pursuit on this challenge, looking also for feasible computations for similarities associated to new pre-established objectives.
In summary, through the paper we have used clustering to explore similarities between groups according to predetermined patterns.In this wider setup, clustering is not a goal in itself, it can be an important tool for specialized analyses.

A Theoretical results
In this section we are going to further on the comments of Remark 1.Given a parsimonious model M and fixed values of K, G, c vol and c sh , the problem consists in maximizing the function ( 14) in Θ M ,G c vol ,c sh , the set of parameters π π π, µ µ µ, u u u, V V V , C C C associated with the clustering model G-M verifying the size and shape constraints (13).Using the same notation as in García-Escudero et al. [2020], denote where S d >0 is the set of positive definite symmetric real matrices.If we define the map This and Lemma 1 in García-Escudero et al. [2020] allow us to replicate the proofs of Proposition 1 and Proposition 2 in García-Escudero et al. [2015] to prove the following theorems on the existence and consistence of the solutions.
Theorem 1.If P is a probability that is not concentrated on K points, and E P || • || 2 < ∞, the maximum of ( 14) is achieved at some Given {y i } ∞ i=1 independent observations of the distribution P, for each N we can define the empirical distribution The sample problem of maximizing ( 14) under the constraint (13) coincides with the distributional problem presented here, when we take the probability P N .Therefore, Theorem 1 also guarantees the existence of the solution of the empirical problem corresponding to large enough samples drawn from an absolutely continuous distribution.
We use the notation θ 0 for any constrained maximizer of the theoretical problem for the underlying distribution P, and let be a sequence of empirical solutions for the sequence of empirical sample distributions {P N } ∞ N =1 .The following result states consistency under similar assumptions as in Theorem 1 if the maximizer of the theoretical problem is assumed to be unique.
Theorem 2. Let us assume that P is not concentrated on K points, E P || • || 2 < ∞ and that θ 0 ∈ Θ M ,G c vol ,c sh is the unique constrained maximizer of (14) for P. If {θ n } ∞ n=1 is a sequence of empirical maximizers of ( 14) with θ n ∈ Θ M ,G c vol ,c sh , then θ n −→ θ 0 almost surely.

B Additional simulations
At the suggestion of a reviewer, we present two additional simulation examples that reinforce the ideas presented in Section 3.1.For the sake of brevity, we only give the results for the more involved clustering problem.We point out two basic ideas.Since we have introduced a broader family of models, model selection will be more challenging than within the fourteen parsimonious models.This is clearly seen in the former example, but with a sufficiently large sample size, BIC is still able to select the true model.In the latter example, we emphasize that our extension of the parsimonious model is not redundant.
First, we repeat the two-dimensional simulation experiment described in Section 3.1, but assuming the VVE model: This example allows us to deal with two different situations.The true underlying model verifies the VVE (1-CPC) model, so it also verifies the 2-CPC model, but it does not verify the 2-PROP model.For a sample with n = 50 observations from each group, we compute the VVE, 2-CPC and 2-PROP solutions for clustering.Results are shown in Figure B1, where we can appreciate that both VVE and 2-CPC models fit the data perfectly, while the constraint of 2-PROP does not allow a good fitting of the data.This is also reflected in Table B1, where the BIC values are computed.The best model in terms of BIC is VVE, but 2-CPC is also competitive.2-PROP gives much worse BIC values.Finally, as we did in Table 3, simulations have been repeated 1000 times, for different sample sizes n.In each simulation, we are comparing the BIC value obtained for 2-CPC and 2-PROP with the BIC value obtained for the true underlying model VVE.Results are shown in Table B2.The second example is similar to the 2-CPC example in 3.1, but now in dimension d = 10.We consider K = 6 distributions, with G = 2 classes given by Parameters were created so that we get a favorable but not trivial situation for applying clustering algorithms.Figure D4 shows a sample created with n = 100 observations from each group.For this sample, we fit the clustering model 2-CPC, and we compare it with the best model estimated by mclust.
The results of this simulation are given in Table B3.The main advantage of considering our intermediate models against the 14 parsimonious models estimated by mclust in this particular example is that mclust is selecting the model VVE, which it is not exactly verified, because the VVV model involves a substantially larger number of parameters (395 for clustering, 390 for discriminant analysis).This leads to a significant improvement in the BIC value of the 2-CPC model.As a result of this, when we repeated the simulation 1000 times with different sample sizes n(= 50, 100, 200), our model 2-CPC improved in terms of BIC the best model estimated by mclust in 100% of the simulations, for all the values of n considered.

C Algorithms C.1 Optimal truncation
In the algorithms presented, we will repeatedly use the optimal truncation algorithm explained in Section 3.1 in García-Escudero et al. [2020], which was introduced in Fritz et al. [2013].
Given d ≥ 0 and a fixed restriction constant c ≥ 1, the m-truncated value is defined by for m opt being the optimal threshold value obtained as Obtaining that optimal threshold value only requires the maximization of a real-valued function and m opt can be efficiently obtained by performing only 2 • J • L + 1 evaluations through a procedure which can be fully vectorized [Fritz et al., 2013].
In the algorithms of the following sections, when working with proportionality models, we will minimize in several situations a function of the type being β an orthogonal matrix and β l , l = 1, . . ., d its columns, γ 1 , . . ., γ r size parameters verifying the size constraint for c vol and λ 1 , . . ., λ d the common shape parameters verifying the shape constraint for c sh and d l=1 λ l = 1.In this situation, the minimization can be made iteratively, taking into account that: • Fixed the sizes and shapes, the minimization with respect to β can be done with the algorithms proposed in Browne and McNicholas [2014a].
• Fixed the orientation and shapes, the optimal unconstrained values of the size are Therefore, the optimal restricted values for the size are OT c vol {n k } r k=1 ; {γ opt k } r k=1 .
• Fixed the orientation and sizes, the optimal unconstrained values of the shapes are: The optimal values verifying the constraint c sh are OT c sh {1}; {λ opt 1 , . . ., λ opt d } , and because of the reasoning in Section 3.3 in García-Escudero et al. [2020], the optimal values verifying also d l=1 λ l = 1 are obtained normalizing the result of the optimal truncation operator.
When working with CPC models, many times we will come to the conclusion that we have to minimize a slightly different type of function: In this case, we can repeat analogous comments for the minimization with respect to the sizes and the orientation matrix.For the shape matrices: • Fixed the orientation and sizes, the optimal unconstrained values of the shapes are For each k = 1, . . ., r, the optimal values verifying the constraint c sh are the result of the operator OT c sh {1}; {λ opt k,1 . . ., λ opt k,d } , and the optimal values verifying also d l=1 λ l = 1 are obtained normalizing the result of that truncation.

C.2 Classification G-CPC/G-PROP
In this section we are going to develop the algorithms for the covariance matrices classification models G-CPC and G-PROP minimizing ( 8) and ( 9).Since these algorithms are included in the algorithms for cluster analysis, determinant and shape constraints are also included.When focusing on the original problem of Section 2, these constraints should be omitted, which can be done taking c vol = c sh = ∞.The input of the algorithm is Classification G-CPC/PROP S 1 , . . ., S K , n 1 , . . ., n K , G, c sh ,c vol , nstart 1 , where S 1 , . . ., S K are the sample covariance matrices, n 1 , . . ., n K the the sample lengths, G the number of classes, c sh , c vol the values of the constants for the determinant and shape constraints and nstart 1 the number of random initializations.The parameters of the minimization are u u u = (u 1 , . . ., u K ), γ γ γ = (γ 1 , . . ., γ K ) β β β = (β 1 , . . ., β G ) and Λ Λ Λ = (Λ 1 , . . ., Λ s ), where s = K in G-CPC and s = G in G-PROP, and they are also the output of the algorithm.A detailed presentation of the algorithm is given as follows: 1. Initialization: We start taking a random vector of indexes u 0 u 0 u 0 ∈ H . Then we take: β 0 : β 0 : β 0 : For each g = 1, . . ., G, we take k such that u 0 k = g, and we define β g as the eigenvectors of S k .Λ 0 : Λ 0 : Λ 0 : → G-PROP: For each g = 1, . . ., K, taking the same k as before, Constrained values: 2. Iterations: The following steps are repeated until convergence: • u-V step: u-V step: u-V step: Based on the current parameters Λ m , we are going to optimize with respect to u u u and the variable parameters of each parsimonious model.The variable parameters will be also optimized in the following step, thus its value will not be updated here.Size parameters γ γ γ don't affect the selection of the best u u u, thus it is enough to find for each k = 1, . . ., K the value of u k for which taking the common parameters C u k we obtain a lower value in the minimization with respect to the variable parameters of → G-PROP: The parameters Λ Λ Λ, β β β are common, we are only minimizing with respect to u u u.For each k = 1, . . ., K, → G-CPC: The parameters β β β are common.For each k = 1, . . ., K, Λ m , we are going to optimize with respect to γ γ γ, β β β, Λ Λ Λ.This optimization requires iterations.Setting s = 0, and considering the initial solutions the following steps are repeated until convergence: s : s : s : s = s + 1 γs : γs : γs : Update the size parameters.

Evaluate the target function:
Steps 1 and 2 are repeated nstart 1 times.At each step, we evaluate the target function ( 8) or ( 9), and we keep the parameters estimated in the iteration with the best value of the target function.

C.3 Clustering G-CPC/G-PROP
In this section we are going to give a detailed explanation of the algorithms for model-based clustering G-CPC and G-PROP presented in Section 3.1 for minimizing (11).The algorithms for fitting the corresponding discriminant analysis models can be easily deduced from these.The input of the clustering algorithm is: where X is the matrix with N observations of d variables, G is the number of classes, K is the number of clusters, c sh , c vol are the values for the determinant and shape constraints, nstart 1 is the number of random initializations in the classification algorithm, and nstart 2 is the number of random initialization in the clustering algorithm.The parameters of the minimization are π π π = (π 1 , . . ., π K ), µ µ µ = (µ 1 , . . ., µ K ), u u u = (u 1 , . . ., u K ), γ γ γ = (γ 1 , . . ., γ K ) β β β = (β 1 , . . ., β G ) and Λ Λ Λ = (Λ 1 , . . ., Λ s ), where s = K in G-CPC and s = G in G-PROP.A detailed presentation of the algorithm is given as follows: 1. Initialization: We start taking a random vector of indexes u 0 u 0 u 0 ∈ H . Then we take: et al., 2008et al., , Fritz et al., 2012] ] for a random sample of length N/2 of X, number of groups K, eigenvalue constraint given by c = min{c sh , c vol } and a suitable number of starts.
We are considering as initial solution a random perturbation of the values obtained.If S = cov(X), we are considering (This simple initial solution verifies determinant and shape constraint independently of the c sh and c vol values) 2. Iterations: The E and M steps are repeated until convergence: • M step: In this step, we have to maximize the complete log-likelihood (12) given the expected values {z i,k } i,k .

Evaluate the target function:
Steps 1 and 2 are repeated nstart 2 times.At the end of each different initialization, we evaluate the target function ( 11), and we keep the parameters estimated in the iteration with the best value of the target function.

D Supplementary figures
This appendix includes illustrative graphs for the best models, in terms of BIC, fitted in the last three real data examples presented in the paper "Improving Model Choice in Classification: An Approach Based on Clustering of Covariance Matrices": • Crabs data set: 2-PROP discriminant analysis model.
We also include a graph of the simulation data used in the second example of Appendix CRABS DATA SET: Discriminant Analysis 2-PROP model CANCER DATA SET: Clustering 3-CPC model

Figure 1 :
Figure 1: Classification of S 1 , . . ., S 100 , represented by their 95% confidence ellipses.The first row shows the classes and axes estimations given by the 4-CPC model, and the second row shows the classes and proportional matrix estimations given by the 4-PROP model.

Figure 2 :
Figure 2: From left to right: 1. Theoretical Gaussian distributions and observations simulated from each distribution.2. Solution estimated by clustering through 2-CPC model.3. Best clustering solution estimated by mclust in terms of BIC.

Figure 3
Figure 3 is analogous to Figure 2, but in the proportionality case.The BIC value for the 2-PROP model (27 d.f.) with c sh = c vol = 100 is -3873.127,whereas the BIC value for the best model fitted by mclust is -3919.796,which corresponds to the unrestricted model VVV (35 d.f.).Now, the number of observations wrongly assigned to the source groups is 64 for the 2-PROP model, while it is 71 for the VVV model.

Figure 3 :
Figure 3: From left to right: 1. Theoretical Gaussian distributions and observations simulated from each distribution.2. Solution estimated by clustering through 2-PROP model.3. Best clustering solution estimated by mclust in terms of BIC.

Figure 4 :
Figure 4: Clustering obtained from 2-PROP model in the Iris data set.Color represents the clusters created.The ellipses are the contours of the estimated mixture densities, grouped into the classes given by indexes in black.Point shapes represent the original groups.Observations lying on different clusters from the originals are marked with red circles.

Figure B1 :
Figure B1: From left to right: 1. Theoretical Gaussian distributions and observations simulated from each distribution in the VVE example.2. Solution estimated by clustering through 2-CPC model.3. Solution estimated by clustering through 2-PROP model.4. VVE clustering solution estimated with mclust.

Figure D1 :
Figure D1: Discriminant analysis obtained from 2-PROP model in the Crabs data set.Color represents the classification created by the model, and the associated ellipses are the contours of the estimated mixture densities, grouped into the classes given by indexes in black.Point shapes represent the original groups.Misclassified observations are marked with red circles.

Figure D2 :
Figure D2: Clustering obtained from 3-CPC model in Gene expression cancer data set.Color represents the clusters created by the model, and the associated ellipses are the contours of the estimated mixture densities, grouped into the classes given by indexes in black.Point shapes represent the original groups.Observations lying on different clusters from the originals are marked with red circles.

Figure D3 :
Figure D3: Discriminant analysis obtained from 3-CPC model in the Italian Olive Oil data set.Color represents the classification created by the model, and the associated ellipses are the contours of the estimated mixture densities, grouped into the classes given by indexes in black.Point shapes represent the original groups.Misclassified observations are marked with red circles.

Figure D4 :
Figure D4: Data simulated from the 2-CPC model in the ten-dimensional example presented in Appendix B, represented in the common principal components of the first class.Color represents the original groups, and the associated ellipses are the contours of the true mixture densities, grouped into the classes given by indexes in black.

Table 2
Table 3 includes the proportions of times in which 2-CPC or 2-PROP model improves the best mclust model in terms of BIC for each value of n.Of course, the accuracy of the approach should de- pend on the dimension, the number of groups, the overlapping...However, even in the case of a large overlapping, as in the present examples, the proportions reported in Table3show that moderate values of n suffice to get very high proportions of success.Appendix B contains additional simulations supporting the suitability of BIC in this framework.

Table 3 :
Proportions of times in which clustering 2-CPC or 2-PROP model improves the best mclust model in terms of BIC, for each sample size n.

Table 4 :
Classification results for data in Figure2for the best mclust model and 2-CPC.

Table 5 :
Classification results for data in Figure3for the best mclust model and 2-PROP.

Table 6 :
Proportions of times in which discriminant analysis 2-CPC or 2-PROP model improves the best mclust model in terms of BIC, for each sample size n.If π k is the proportion of observations of group k, the classifier minimizing the expected misclassification rate is known as Bayes classifier, and it assigns an observation y to the group with higher posterior probability Remark 3. In discriminant analysis, the weights π π π = (π 1 , . . ., π K ) might not be considered as parameters.Model-based methods assume that observations from the k th group follow a distribution with density function f (•, θ k ).
Table7compares this model with the models 2-CPC and 2-PROP, fitted with c sh = c vol = 100.With some abuse of notation, we include in the table the Model Misclassification (MM), representing here the number of observations assigned to different clusters than the originals, after identifying the clusters created with the originals in a logical manner.

Table B1 :
Clustering results in the VVE example for VVE, 2-CPC and 2-PROP models.

Table B2 :
Proportions of times in which clustering 2-CPC or 2-PROP model improves the model VVE in terms of BIC, for each sample size n.
lower BIC value than VVE.2-CPC model is verified, it is more flexible than VVE, and the difference in the number of parameters is only one.Thus, this is a rather complicated setting for model selection.Even in this case, if the sample size n is large enough, BIC is able to select the true model in almost all cases.

Table B3 :
Clustering results in the 10-dimensional example for the best mclust model and 2-CPC.
B. Each figure is accompanied by the corresponding explanatory caption.For a better understanding of the models, graphics should be enlarged, in order to distinguish the different colors and shapes.The ellipses in the figures show the contours of the fitted multivariate component densities.While the contours of proportional matrices in a lower dimensional subspace are still proportional, it is important to notice that the analogous property is not true for Common Principal Components, unless the subspace is generated by a subset of the Common Principal Components.