1 Introduction

In several fields, such as genetics, economics, engineering, social sciences and many others, data often present a complex structure where variables are measured on different scales: some are continuous, some others ordinal. If the goal of the analysis is to find subgroups in the population, all information should be properly used. In other words, all the variables should concur symmetrically, i.e. in the same way, to the estimation of the groups. However, the literature has been for the most part developed for continuous variables. In this framework, several clustering methods exist, mainly divided into distance-based, such as k-means, and model-based. Under the model-based approach, the finite Gaussian mixture models are the most commonly used (Hennig et al. 2015) for clustering continuous data. They are intensively used in many fields and with different classification purposes (e.g. unsupervised, semi-supervised and supervised). Their success is mainly due to their simplicity to be fitted and interpreted. According to a clustering point of view, they provide a coherent strategy for classifying data accounting for uncertainties through probabilities. Each mixture component can be interpreted as a sub-population, i.e. cluster. The same framework cannot be directly applied to ordinal data. The challenge to model ordinal data is mainly due to the lack of metric properties. For this reason, among practitioners, it is still common to analyze ordinal data following a naive approach whereby their nature is ignored. Ranks are treated as interval-scaled, and thus clustering techniques developed for continuous data are applied. However, the estimates are biased and the clustering structure may be wrong (see e.g. Dolan 1994; DiStefano 2002; Rhemtulla et al. 2012 in the SEM framework; see e.g. Ranalli and Rocci 2016, 2017 in the clustering framework). It follows that ordinal variables should be modeled properly. This can be achieved adopting the underlying variable approach (URV, Jöreskog 1990; Lee et al. 1990; Muthén 1984) where the ordinal variables are assumed to be generated by thresholding some latent continuous variables. This approach allows us to cluster mixed-type data (continuous and ordinal variables) satisfying two main requirements: dealing with ordinal data properly and modeling dependences between ordinal and continuous variables. Both continuous and ordinal variables follow a heteroscedastic Gaussian mixture model, by assuming that the ordinal variables are some variates of the mixture only partially observed through a discretization (see e.g. Ranalli and Rocci 2017; Everitt 1988).

Adopting mixture models for mixed-type data, two main closely related issues should be faced with when the dimensionality of the data increases: the number of parameters increases polynomially; a large number of ordinal variables makes the full maximum likelihood estimation infeasible.

To solve the first issue, the model should be more parsimonious in terms of number of parameters to estimate. At this aim, appropriate reparameterizations need to be assumed for the covariance matrices. In literature, there exists a general class of parsimonious mixture models for continuous data by imposing a factor decomposition on component-specific covariance matrices. The loadings and variances of error terms of the factor model may be constrained to be equal or unequal across mixture components (McNicholas and Murphy 2008; McLachlan et al. 2003; Ghahramani and Hinton 1996). More precisely, Ghahramani and Hinton (1996) constrains the (variance) error term equal across groups, McLachlan et al. (2003) imposes no constrains, and McNicholas and Murphy (2008) use eight models with varying constraints on the loadings and/or (variance) error terms. This means that even if the number of variables P is high, it is still possible to estimate the component-specific covariance matrices with few latent factors \(K (K<<P)\).

In our proposal, we define a general class of parsimonious mixture models for mixed-type data by introducing several possible parsimonious reparameterizations for the covariance matrices starting from the idea of McNicholas and Murphy (2008) formulated only for continuous data. In particular, we introduce twelve models; eight are defined constrained; four are defined semi-constrained, since they are more flexible. In the first class of models, we consider different constraints on the loadings and/or (variance) error terms. In the second one, the latent factors in each clusters are the same but with different variances.

As regard the second issue, we note that the maximum likelihood estimation is rather complex. Indeed the presence of ordinal variables requires the computation of many high dimensional integrals, whose evaluation is computationally demanding as the number of ordinal variables increases. The problem is usually solved by substituting the likelihood function with a surrogate function. More precisely, we replace the full likelihood with the composite likelihood (Lindsay 1988), defined as the product of m-dimensional marginals or conditional events. The composite likelihood methods are flexible ways to create consistent estimators, which inherit the main desirable properties of the maximum likelihood estimators: under some regularity conditions (Molenberghs and Verbeke 2005), asymptotically unbiased and normally distributed with the variance given by the inverse of the Godambe Information (Lindsay 1988; Varin et al. 2011). Moreover, they have some varying degrees of robustness (Xu and Reid 2011), they are fully efficient and identical to the full maximum likelihood estimators in exponential families under a certain closure property (Mardia et al. 2009). In general efficiency is not easy to achieve and it is strictly linked to the design issue, but in all cases much more efficient in terms of computational complexity. In the current work, a composite likelihood approach is adopted for model estimation. The surrogate function is built as the product of all possible marginals of two ordinal and all continuous variables. However, as long as sparsity is not a problem and computations are feasible, it is possible to use a higher m, including more ordinal variables. The computation of parameter estimates is carried out through an EM-type algorithm based on the complete-data composite log-likelihood.

The remainder of the paper is organised as follows. Section 2 introduces the general model. Section 3 describes the estimation procedure and some issues about classification, model selection and identifiability. A theoretical comparison with the most related models some is presented in Sect. 4. While the results of a simulation study are presented in Sect. 5. A real data analysis is conducted in Sect. 6 and some concluding remarks are pointed out in Sect. 7. The models presented in this work have been implemented in MatLab code, which may be found online at https://github.com/moniar412/parsFMMmixdata.

2 Model

Let \({\textbf {y}}^{\bar{O}}=[y_1,\ldots , y_{P-O}]\) and \({\textbf {x}}=[x_{P-O+1}, \ldots , x_P]\) be \({\bar{O}}=P-O\) continuous variables and O ordinal variables, respectively. The associated categories for each ordinal variable are denoted by \(c_{i}=1,\ldots , C_{i}\) with \(i={\bar{O}}+1,\ldots , P\).

Following the underlying response variable approach, observed variables \({\textbf {x}}\) are considered as a discretization of continuous latent variables \({\textbf {y}}^{O}=[y_{{\bar{O}}+1},\ldots , y_P]\). The latent relationship between x and \({\textbf {y}}^O\) is explained by a threshold model defined as follows,

$$\begin{aligned} \gamma _{c_i-1}^{(i)} \le y_{i} < \gamma _{c_{i}}^{(i)} \Leftrightarrow x_{i}=c_{i}, \end{aligned}$$

where \(-{\infty } =\gamma _{0}^{(i)}< \gamma _{{1}}^{(i)}<\ldots< \gamma _{{C_i-1}}^{(i)}< \gamma _{{C_i}}^{(i)}=+{\infty } \) are the thresholds defining the \(C_i\) categories.

according to our proposal \({\textbf {y}}=[{\textbf {y}}^{{\bar{O}}},{\textbf {y}}^O]\) follows a finite mixture of factor analyzers (McNicholas and Murphy 2008; McLachlan et al. 2003; Ghahramani and Hinton 1996)

$$\begin{aligned} f({\textbf {y}})=\sum _{g=1}^G p_g\phi (\varvec{\mu }_g, \varvec{\varLambda }_g\varvec{\varLambda }_g^{\prime }+\varvec{\varPsi }_g) \end{aligned}$$

where \(\phi \) is the multivariate normal density, \(\varvec{\varLambda }_g\) is the \(P\times K\) matrix of factor loadings, and \(\varvec{\varPsi }_g\) is the diagonal matrix of uniqueness. The latter can be assumed of the isotropic form \(\psi _g{\textbf {I}}\), leading to the probabilistic principal component analysis model (Tipping et al. 1999). Each term may be constrained to be equal or unequal across mixture components. The result of imposing, or not, such constraints generates the family of the eight parsimonious Gaussian mixture models (PGMMs), described in Table 1 introduced by McNicholas and Murphy (2008) in the context of continuous data. Each member of this family of models has a number of covariance parameters that is linear in data dimensionality. By assuming a common covariance structure, an even more parsimonious model can be used.

Table 1 The covariance structure of latent parsimonious Gaussian mixture models with constrained or unconstrained factor loadings matrix
Table 2 The covariance structure of latent parsimonious Gaussian mixture models with a semiconstrained factor loadings matrix

With respect to the proposal of McNicholas and Murphy (2008), we decided to add some extra flexibility maintaining a certain degree of parsimony. We introduce four new models, see Table 2, that are in between the first and the last four models of Table 1 in terms of flexibility. This is achieved by assuming that the matrix of factor loadings can be written in the form \(\varvec{\varLambda } {\textbf {L}}_g\), where \({\textbf {L}}_g\) is a positive definite diagonal matrix of factor saliences. The interpretation is the following one: the latent factors in each clusters are the same but with different variances recorded by the matrices \({\textbf {L}}_g\). This is a particular form of factorial invariance firstly introduced by Cattell (1944) and then developed by several authors in the context of three-way analysis where the same variables are measured on the same subjects in different occasions (Carroll and Chang 1970; Harshman et al. 1970). It has been also extended, and successfully applied, to the case of multi-group factor analysis, where the same variables are observed on different groups of observations (see Stegeman and Lam 2016 and references there in).

A nice feature of the semi-constrained models in Table 2 is that, under mild conditions, the factors are unique. In other terms, it is not possible to rotate the factors as in the classical factor analysis model. This property can be shown by using the following result found by Kruskal (1977). Let us denote by k-rank(Z) the so-called k-rank of a matrix Z. It is defined as the largest number k such that every subset of k columns of Z is linearly independent. Moreover, let (A, B, C) and (\({\textbf {A}}_T\), \({\textbf {B}}_T\), \({\textbf {C}}_T\)) be two triplets of matrices with K columns such that

$$\begin{aligned} {\textbf {A}}\text{ diag }({\textbf {c}}_g){\textbf {B}}^{\prime } = {\textbf {A}}_T\text{ diag }({\textbf {c}}_{Tg}){\textbf {B}}^{\prime }_T \end{aligned}$$
(1)

with \(g=1,2,\ldots , G\), where, \(\text{ diag }({\textbf {d}})\) is the diagonal matrix having the elements of vector \({\textbf {d}}\) on the main diagonal, \({\textbf {c}}_g [{\textbf {c}}_{Tg}]\) is g-th row of the \(G \times K\) matrix \({\textbf {C}} [{\textbf {C}}_T]\). Kruskal (1977) has shown that if

$$\begin{aligned} k\text{-rank }({\textbf {A}})+k\text{-rank }({\textbf {B}})+k\text{-rank }({\textbf {C}})\ge 2K+2 \end{aligned}$$
(2)

then there exists a permutation matrix \({\textbf {P}}\) and three diagonal matrices \({\textbf {D}}_A\), \({\textbf {D}}_B\) and \({\textbf {D}}_C\), for which \({\textbf {D}}_A{\textbf {D}}_B{\textbf {D}}_C = {\textbf {I}}\), where I denotes the identity matrix, such that

$$\begin{aligned} {\textbf {A}}_T={\textbf {AP}}{} {\textbf {D}}_A,{\textbf {B}}_T={\textbf {BP}}{} {\textbf {D}}_B,{\textbf {C}}_T={\textbf {C}}{} {\textbf {P}}{} {\textbf {D}}_C. \end{aligned}$$
(3)

In words, if (2) holds then the solution (A, B, C) is unique up to scaling and a simultaneous column permutation. Although Kruskal’s condition has been extended by some other authors (see Giordani et al. 2020 for an overview), what follows is based on such a condition because practitioners mainly refer to it in their applications. In our case, let us suppose that the part of the covariance matrices due to the common factors has not a unique representation and it is possible to write

$$\begin{aligned} \varvec{\varLambda }{} {\textbf {L}}_g\varvec{\varLambda }^{\prime }=\varvec{\varLambda }_T{\textbf {L}}_{Tg}\varvec{\varLambda }_T^{\prime } \end{aligned}$$
(4)

with \(g=1,2,\ldots ,G\). Indicating with \({\textbf {M}}\) the \(G\times K\) matrix having the diagonal of \({\textbf {L}}_g\) as the g-th row, from the Kruskal’s results we deduce that if

$$\begin{aligned} 2 \cdot k\text {-rank}(\varvec{\varLambda })+k\text{-rank }({\textbf {M}})\ge 2K+2 \end{aligned}$$
(5)

then \(\varvec{\varLambda } [{\textbf {M}}]\) differs from \(\varvec{\varLambda }_T [{\textbf {M}}_T]\) only for the scaling and/or position of the columns. It is very important to note that the inequality (5) is satisfied if \(\varvec{\varLambda }\) and M are of full column rank, as usual in practical applications.

For a random i.i.d. sample of size N, the log-likelihood is

$$\begin{aligned} \ell (\varvec{\theta })=\sum _{n=1}^N\log \left[ \sum _{g=1}^G p_g \phi ({\textbf {y}}^{{\bar{O}}}_n; \varvec{\mu }^{{\bar{O}}}_g,\varvec{\varSigma }_g^{{\bar{O}}{\bar{O}}}) \pi ({\textbf {x}}_n;\varvec{\mu }_{ng}^{O\mid {\bar{O}}},\varvec{\varSigma }_g^{O \mid {\bar{O}}},\varvec{\gamma }) \right] \end{aligned}$$

where

$$\begin{aligned} \varvec{\mu }_g^{{\bar{O}}}= & {} \text{ E }[{\textbf {y}}^{{\bar{O}}}\mid g], \varvec{\varSigma }_g^{{\bar{O}}{\bar{O}}}=\text{ V }({\textbf {y}}^{{\bar{O}}}\mid g), \\ \varvec{\mu }_{ng}^{O\mid {\bar{O}}}= & {} \text{ E }[{\textbf {y}}^{O}\mid {\textbf {y}}^{{\bar{O}}}_n,g], \varvec{\varSigma }_g^{O\mid {\bar{O}}}=\text{ V }({\textbf {y}}^{O}\mid {\textbf {y}}^{{\bar{O}}}_n, g) \end{aligned}$$

and

$$\begin{aligned} \pi ({\textbf {x}}_n;\varvec{\mu }_{ng}^{O\mid {\bar{O}}},\varvec{\varSigma }_g^{O \mid {\bar{O}}},\varvec{\gamma })= & {} \int _{\gamma _{c_1-1}^{({\bar{O}}+1)}}^{\gamma _{c_{1}}^{({\bar{O}}+1)}}\cdots \int _{\gamma _{c_P-1}^{(P)}}^{\gamma _{c_{P}}^{(P)}} \phi ({\textbf {y}}^{Q};\varvec{\mu }_{ng}^{O \mid {\bar{O}}},\varSigma _g^{O \mid {\bar{O}} })d{\textbf {y}}^{O} \end{aligned}$$

is the probability of response pattern \({\textbf {x}}_n\) in the g-th component mixture with mean and covariance matrix conditioned on the continuous variables. As said before, the covariance matrices could have different structures according to the specific parsimonious model chosen (see Tables 1 and 2). This likelihood causes non trivial computational problems due to the presence of multidimensional integrals. In the next section, we are going to solve this problem through the use of a composite likelihood.

3 Estimation

As suggested in Ranalli and Rocci (2017) and references therein, a composite likelihood approach could be adopted. It allows us to simplify the problem by replacing the full likelihood with a surrogate function based on m-dimensional marginals. It is a robust estimation method and its estimators have been proven to be consistent, asymptotically unbiased and normally distributed, under some mild regularity conditions (Lindsay 1988; Varin et al. 2011; Molenberghs and Verbeke 2005). In general they are less efficient than the full maximum likelihood estimators, or estimators obtained with a higher m, but in many cases the loss in efficiency is very small or almost null (Lindsay 1988; Mardia et al. 2009).

In the sequel, we refer to the case based on \(O(O-1)/2\) marginal distributions each of them composed of two ordinal variables and \({\bar{O}}\) continuous variables. The composite log-likelihood will be the sum of \(O(O-1)/2\) sub-log-likelihoods, one for each marginal distribution. In formulas

$$\begin{aligned} c\ell (\varvec{\theta })= \sum _{i=1}^{O-1} \sum _{j=i+1}^O \sum _{n=1}^N \log \left[ \sum _{g=1}^G p_g \phi ({\textbf {y}}_n^{{\bar{O}}};\varvec{\mu }_g^{{\bar{O}}},\varvec{\varSigma }_g^{{\bar{O}}{\bar{O}}}) \pi ({\textbf {x}}_n^{ij};\varvec{\mu }_{ng}^{ij\mid {\bar{O}}},\varvec{\varSigma }_g^{ij\mid {\bar{O}}},\varvec{\gamma }^{ij})\right] , \end{aligned}$$

where \(\pi ({\textbf {x}}_n^{ij};\varvec{\mu }_{ng}^{ij\mid {\bar{O}}},\varvec{\varSigma }_g^{ij\mid {\bar{O}}},\varvec{\gamma }^{ij})\) is the conditional probability of response pattern \({\textbf {x}}_n^{ij}\), i.e. the response pattern \({\textbf {x}}_n\) restricted to only the variables i and j, given all the \({\bar{O}}\) continuous variables, i.e. \({\textbf {Y}}^{{\bar{O}}}={\textbf {y}}_n^{{\bar{O}}}\); while \(\varvec{\gamma }^{ij}\) is the set of thresholds for variables i and j. This conditional probability is obtained by integrating the density of a bivariate normal distribution with parameters \((\varvec{\mu }_{ng}^{ij\mid {\bar{O}}},\varvec{\varSigma }_g^{ij\mid {\bar{O}}})\) between the corresponding threshold parameters contained in \(\gamma ^{ij}\). The computation of parameter estimates is carried out using simultaneously a standard EM algorithm on each sub-likelihood having the same set of parameters. We start by writing the complete composite log-likelihood \(c\ell _c(\varvec{\theta })\) by introducing the group membership matrix \({\textbf {z}}^{ij}\) indicating if the observation n belongs to mixture component g in the sub-likelihood corresponding to the marginal distribution of the ordinal variables i and j and all the continuous variables, as follows

$$\begin{aligned} c\ell _c(\varvec{\theta })= \sum _{i=1}^{O-1} \sum _{j=i+1}^O \sum _{n=1}^N \sum _{g=1}^G z_{ng}^{ij}\log \left[ p_g \phi ({\textbf {y}}_n^{{\bar{O}}};\varvec{\mu }_g^{{\bar{O}}},\varvec{\varSigma }_g^{{\bar{O}}{\bar{O}}}) \pi ({\textbf {x}}_n^{ij};\varvec{\mu }_{ng}^{ij\mid {\bar{O}}},\varvec{\varSigma }_g^{ij\mid {\bar{O}}},\varvec{\gamma }^{ij})\right] , \end{aligned}$$

The E-step requires the computation of the expected value of the complete-data composite log-likelihood given the current estimates of the model parameters. This is given by

$$\begin{aligned} Q(\varvec{\theta }\mid \hat{\varvec{\theta }}^{(r-1)})=\text{ E}_{\varvec{\theta }^{(r-1)}}\left[ c\ell _c(\varvec{\theta };{\textbf {y}}^{{\bar{O}}},{\textbf {x}},{\textbf {z}}\mid {\textbf {y}}^{{\bar{O}}},{\textbf {x}}) \right] . \end{aligned}$$

At the r-th iteration, the E-step consists of updating the group membership matrix \({\textbf {z}}^{ij}\) of order \(N\times G\) as

$$\begin{aligned} {\hat{z}}_{ng}^{ij}= \dfrac{p_g \phi ({\textbf {y}}_n^{{\bar{O}}};\varvec{\mu }_g^{{\bar{O}}},\varvec{\varSigma }_g^{{\bar{O}}{\bar{O}}}) \pi ({\textbf {x}}_n^{ij};\varvec{\mu }_{ng}^{ij\mid {\bar{O}}},\varvec{\varSigma }_g^{ij\mid {\bar{O}}},\varvec{\gamma }^{ij})}{\sum _{h=1}^G p_h \phi ({\textbf {y}}_n^{{\bar{O}}};\varvec{\mu }_h^{{\bar{O}}},\varvec{\varSigma }_h^{{\bar{O}}{\bar{O}}}) \pi ({\textbf {x}}_n^{ij};\varvec{\mu }_{nh}^{ij\mid {\bar{O}}},\varvec{\varSigma }_h^{ij\mid {\bar{O}}},\varvec{\gamma }^{ij})}, \end{aligned}$$

for \(i=1,\ldots , Q-1\), \(j=i+1,\ldots , Q\). Then, given the E-step, the M-step is performed in blocks. First, at iteration r, the mixing weights are updated by averaging the group membership matrices, then the complete composite log-likelihood is maximized with respect to the other parameters. Since the parameter estimates of the mixture components do not have a closed form, we use an optimization routine to obtain all the parameter estimates (apart from \(p_1,\ldots ,p_G\)). More precisely we use an optimization routine (“fmincon”) in Matlab based on a quasi-Newton approximation (for more details see MATLAB (2013)). Any other optimization routines can be used: in any case, the complete composite log-likelihood needs to be coded, such that each block of marginals is weighted by the corresponding group membership matrix updated in the E-step. Given the parameter estimates, the E-step can be performed once again. The E and M steps are repeated until convergence is reached. We halted the estimation process and assumed convergence to the maximum when the relative difference between two consecutive composite log-likelihood values is less than \(10^{-5}\).

3.1 Classification

As regards the classification, each observation is assigned to the component with the maximum fit according to CMAP criterion (Ranalli and Rocci 2017). In a context of standard mixture models, the classification of the observations is usually based on the MAP criterion. This means that the observation is assigned to the component corresponding to the maximum fit. However, since the composite likelihood is constructed as the product of \(O(O-1)/2\) sub-likelihoods, following the same principle, the fit of each observation is obtained by multiplying the corresponding \(O(O-1)/2\) fits

$$\begin{aligned} s_{gn}(\varvec{\theta })= \prod _{i=1}^{O-1} \prod _{j=i+1}^O \left[ p_g \phi ({\textbf {y}}_n^{{\bar{O}}};\varvec{\mu }_g^{{\bar{O}}},\varvec{\varSigma }_g^{{\bar{O}}{\bar{O}}}) \pi ({\textbf {x}}_n^{ij};\varvec{\mu }_{ng}^{ij\mid {\bar{O}}},\varvec{\varSigma }_g^{ij\mid {\bar{O}}},\varvec{\gamma }^{ij}) \right] , \end{aligned}$$

In order to express the fit in terms of degree of membership, the fit of each observation is normalized (i.e. it varies between 0 and 1), that is

$$\begin{aligned} \text{ max}_g \dfrac{s_{gn}}{\sum _{h=1}^gs_{gh}}. \end{aligned}$$

3.2 Model selection

In the estimation procedure, we assume that the number of mixture components and the structure of covariance matrices are fixed. In practice, they are often unknown and thus, they have to be selected through the data. A criterion to select the best model could be the so-called composite BIC (Gao and Song 2010). However, its use requires the computation of the gradient of the contribution of each observation to the composite log-likelihood, see (Ranalli and Rocci 2017) for details. This makes its use rather cumbersome or infeasible when the dimensionality of the data increases. For this reason, in this work, the best model is chosen by selecting the one minimizing the additive BIC, that is the sum of BICs computed for each sub-likelihood. We refer to the additive BIC as aBIC. The idea is quite simple. Each BIC should obtain a minimum on the true model as well as their sum. The only problem in this reasoning is the fact that the BICs are computed by using the composite likelihood estimates instead of the full likelihood ones. However, if the sample size is large enough they should not be very different.

3.3 Identifiability

A further important point of the proposed model, that is worth to be discussed, is parameter identifiability. To estimate both thresholds and component parameters if all the ordinal variables have three categories at least, we set the first two thresholds to 0 and 1, respectively. This identification constraint allows us to identify uniquely means and variances of the latent variates of the mixture components (ignoring the label switching problem), as well described in Millsap and Yun-Tein (2004). This parameterization is equivalent to that one used by Jöreskog and Sörbom (1996), where the means and the variances of the latent variables (of the first component, in the mixture framework) are set to 0 and 1, respectively. There is a one-to-one correspondence between the two sets of parameters. If there are binary variables, then the unique threshold should be set to zero while their variances should be set equal to 1 (while their means should be still kept free). However, this is a necessary condition, but it is not sufficient. Within a full maximum likelihood approach, it is well known that a sufficient condition for local identifiability is given by the non singularity of the information matrix; while a necessary condition is that the number of parameters must be less than or equal to the number of canonical parameters. Such conditions should be modified when model parameters are estimated by maximizing a composite likelihood. The sufficient condition should be reformulated by investigating the Godambe information matrix, that is, the analogous of the information matrix in composite likelihood estimation. However, as far as we know, such modification has not been formally investigate yet.

About the necessary condition, we note that in the composite likelihood only some marginal distributions are involved. It implies that we have to count the number of canonical parameters only by considering the ones involved in such marginals. As an example, if there are only ordinal variables the number of canonical parameters is equal to the number of non-redundant parameters involved in the bivariate marginals. This equals the number of parameters of a log linear model with only two factor interaction terms. In particular, given a \(C_{{\bar{O}}+1}\times C_{{\bar{O}}+2}\times \ldots \times C_{O}\) contingency table such number is

$$\begin{aligned} \sum _{i={\bar{O}}+1}^{O}(C_{i}-1)+\sum _{i={\bar{O}}+1}^{O-1}\sum _{j=i+1}^{O}(C_i-1)(C_j-1). \end{aligned}$$

However, heuristically, we are always able to see if a model is not identified, that is when the same maximized likelihood (or composite likelihood) is obtained with different parameter estimates.

The factorial reparameterisation of a component-specific covariance is not uniquely identified for models in Table 1. Indeed, we note that it has the same rotational freedom that characterizes the classical factor analysis model. Only the subspaces generated by the columns of \(\varvec{\varLambda }_g\) are identified. In order to estimate such subspaces, we impose some constraints on the model parameters, in complete analogy with what is usually done in the factor analysis model. In this way, we select a particular solution, one which is convenient to find, and leave the experimenter to apply whatever rotation he thinks desirable, as suggested by Lawley and Maxwell (1962). In particular, we require a lower triangular form in the first K rows of the loading matrix. Of course, after the estimation the parameter matrices can be rotated to enhance the interpretation. In Sect. 2, we have shown that such rotational freedom disappears for models in Table 2. However, in this case the columns of the matrix of factor loadings \(\varvec{\varLambda }\) can be arbitrarly rescaled by adjusting the matrices \({\textbf {L}}_g\) accordingly. We remove such ambiguity by setting \({\textbf {L}}_1={\textbf {I}}\). It is important to say that rules for the identifiability of a factor analysis model (Shapiro 1985), like the so-called Ledermann bound (Ledermann 1937) for the number of factors

$$\begin{aligned} K \le P+(1-\sqrt{8P+1})/2, \end{aligned}$$

hold for the unconstrained models, while, probably, they could be relaxed for the constrained ones. Finally, we mention that in factor analysis there is the possibility that the estimate of the variance of the error term for a variable, the so-called uniqueness, is exactly zero. Such possibility is named an Heywood case and considered, in the factor analysis field, an improper solution because it corresponds to assume that one of the common factors coincides with one of the variables. In our experience, we did not encounter Heywood cases, however, if this would happen we suggest to introduce some constraints, e.g. \(\varvec{\varPsi }=\psi _g{\textbf {I}}\), or eliminating some variables causing the Heywood case (see Farooq 2022 and references there in).

4 Related models

The present proposal can be considered an extension of the work (Ranalli and Rocci 2017) where mixed type data, with ordinal and continuous variables, is used to cluster a sample of observations. The model is a finite mixture of Gaussians were some variates, the ones corresponding to the ordinal variables, are observed only trough a discretization. The parameters are estimated by maximizing a composite likelihood built on three blocks. The first is given by all the continuous variables, the second by all the bivariate marginals obtained considering pairs of ordinal variables, the third by the marginals obtained considering one ordinal variables and all the continuous ones. In this paper we refined this scheme eliminating the first two blocks and extending the third including two ordinal variables and all the continuous in each sub-likelihood. This modification allowed us to improve computational efficiency without worsening the quality of the estimates. Another improvement over (Ranalli and Rocci 2017) is the introduction of several possible parsimonious reparameterizations for the covariance matrices starting from the idea of McNicholas and Murphy (2008) formulated only for continuous data. A similar approach has been also adopted by Mcparland and Gormley (2015). It is a model based clustering procedure for data of mixed type based on latent variables. The latters, following a mixture of Gaussian distributions, generates the observed data of mixed type: continuous, ordinal, binary or nominal. It employs a parsimonious diagonal covariance structure for the latent variables, leading to six clustering models that varying in complexity. Each model can be estimated by using the package clustMD available in R. The main differences with our approach are that the thresholds parameters are estimated in a separate step using the single variables, even nominal variables are considered but it is essentially based on the local independence assumption, i.e. the variables are independent conditionally to the groups. As a side note, it is necessary to caution the reader on the presence of a further model in the R package clustMD, called BD model, although there is no theoretical explanation about the assumptions underlying the corresponding data generation process. Due to the lack of information about the model and the method/algorithm used for the parameter estimation, we decided to exclude it from the main analysis. In the following simulation study, we focus only on the first six parsimonious models included in the R package clustMD. However, as explicitly requested by an anonymous reviewer, we have also considered the BD model as possible competitor of our proposal in the supplementary material, although given the lack of its description, we have not been able to make reasonable comments on results.

The mixture of factor analyzers model Ghahramani and Hinton (1996) has been extended to the mixed type data by McParland et al (Mcparland et al. 2014, 2017). Compared to our proposal there are some differences: the authors estimate the model using a Bayesian approach; furthermore, they consider even variables measured on a nominal scale but constrain the diagonal matrix of uniqueness, \(\varvec{\varPsi }\) in our notation, to be equal to the identity matrix in each component.

Finally, it is important to say that there are approaches where the variables does not play a symmetric role. For example, this happens in Murphy and Murphy (2020) were the model for each component is a regression and the distribution of some continuous variables is formulated conditionally to some covariates that could be categorical. A similar example is the proposal of Ingrassia et al. (2015) where the regressions are univariate and the covariates are random with a joint distribution built on the hypothesis of local independence.

5 Simulation study

In this section, we illustrate and discuss the results of a large simulation study aimed at assessing the effectiveness of the maximum composite likelihood estimator under different settings in terms of sample size, number of components and factors, number of variables and categories. The composite estimator has been also compared to the full likelihood one, in terms of precision and computational time. A further comparison has been done in terms of the Adjusted Rand index (Hubert and Arabie 1985) by considering the naive approach, i.e. our model treating all variables as they were continuous, and the six parsimonious models of clustMD (Mcparland and Gormley 2015). Furthermore, as explicitly requested by an anonymous reviewer, we have also considered the BD model as possible competitor of our proposal in the supplementary material. Finally, we tested the effectivness of the aBIC in finding the correct number of components.

The experiments are conducted generating the data from the (SC)UU model with eight variables, four continuous and four ordinal with four categories. In some experiments the number of variables and categories has been increased to 15 (of which 10 are ordinal) and 10, respectively. For \(G=2\), we specify the mixture weights by \(p_g=[0.30, 0.70]\), while the group-specific mean vector by

$$\begin{aligned} \varvec{\mu }_1= & {} [-0.5, 0.5, 1, 1, -1, 2, -2, -1], \\ \varvec{\mu }_2= & {} [1.5, 1.5, 0, 0, 1, 0, 2, 0]. \end{aligned}$$

For \(G=3\), we specify the mixture weights by \(p_g=[0.25, 0.35, 0.40]\) and we added the further group-specific mean vector,

$$\begin{aligned} \varvec{\mu }_3=[-0.5, -0.5, -1, -1, 0, -2, 0, -1]. \end{aligned}$$

At last, \(\varvec{\varLambda }\) is randomly drawn from a uniform distribution on the interval \([-1, 1]\), the diagonal elements of \({\textbf {L}}_g\) are randomly drawn from a uniform distribution on the interval [0, 2], and a reasonable level of error was added by generating the diagonal elements of \(\varvec{\varPsi }_g\) from a uniform in [0, 1]. The thresholds for the ordinal variables are [0, 1, 2] when the categories are four. When the categories are five or ten, the thresholds for the ordinal variables are [0, 1, 1.5, 2.5] and [0, 1, 1.5, 1.833, 2.166, 2.499, 2.832, 3.165, 3.498], respectively.

Maximum composite likelihood estimates are computed by following the EM-like algorithm previously described. We halted the estimation process and assumed convergence to the maximum when the relative difference between two consecutive composite log-likelihood values is less than \(10^{-5}\). To initialize the model parameters, we worked out the output of the Gaussian mixture model where we treated all variables as they were continuous. The initial values for the thresholds have been computed as follows: for each variable, we have considered the empirical relative frequency of each category and then we have minimized the quadratic difference between this frequency and the corresponding probability of the mixture. As regards the factor loadings, starting from the specific-component covariance matrix (the output of the Gaussian mixture model) we estimate a factor analysis model. Then we rotate the obtained loading matrix in order to obtain a lower triangular form for the square sub-matrix given by the first K columns. The error variances are obtained as the difference between the main diagonal of the within covariance matrix of the Gaussian mixture model output and \(\varvec{\varLambda }_g\varvec{\varLambda }_g^{\prime }\). We averaged \(\varvec{\varLambda }_g\) to get \(\varvec{\varLambda }\). Finally \({\textbf {L}}_g\) is set to 1. Of course, this inizialitazion can be adapted properly to accommodate the other cases listed in Tables 1 and 2. The choice of initial values influences the speed of convergence of the algorithm and its ability to reach the global maximum. We suggest to use a rational start because, in our experience, a purely random initialization is generally extremely worse than our rational start in terms of local optima and computational time. However, our rational start does not guarantee to reach the global optimum and further studies are needed to improve it.

We analyzed the performance of an estimator by computing for each sample the Euclidean squared distance between the estimates and the true values for different set of parameters: group-specific means (\({\varvec{\mu }_c}\), \({\varvec{\mu }_o}\)), thresholds (\({\varvec{\gamma }}\)), mixture weights (p), factor loadings (\({\varvec{\varLambda }}\)), saliences (L), uniqueness (\({\varvec{\varPsi }}\)). These indexes serve to evaluate the accuracy of the estimators. In order to identify uniquely the sign of the columns of the loading matrices, in the simulation study, we impose the first non-zero element in each column to be positive. The performance of recovering the cluster structure is measured by the Adjusted Rand Index (ARI) (Hubert and Arabie 1985), which is a measure of agreement between the estimated and the true cluster memberships. It takes its maximum (value one) in case of perfect agreement.

in what follows, we describe the different experiments that compose our simulation study.

5.1 Efficiency of the composite estimator

In this experiment we assess the performances of the composite likelihood estimator over different scenarios obtained combining three factors: number of observations (\(N=500, 1000\)), number of groups (\(G=2, 3\)) and number of latent factors (\(K=2, 4\)). For each of the scenarios, we generated 250 samples and compute the aforementioned measures of performance. The results are depicted in Table 3.

Table 3 Quartiles, mean and standard deviation (in brackets) of the imprecision indexes for the parameter estimates, and the Adjusted Rand index (ARI) obtained by the EM-like algorithm over 250 samples generated in different settings from the model (SC)UU

Overall, we note a normal behaviour of the composite estimator. The sample size N and the model complexity influence the imprecision of the estimates and the goodness of classification. In particular, when N increases the imprecision decreases while the ARI increases, i.e. the estimator performs better. On the contrary, when the model complexity increases, in terms of the number of components G and/or latent factors K, the estimator efficiency decreases.

5.2 Comparison between the composite and the full likelihood estimators

In this experiment we compare the composite likelihood with the full likelihood approach. This allows us to evaluate both the statistical efficiency (the full likelihood is more precise) and the computational efficiency (the composite likelihood is much faster). We generated 250 samples of size \(N=500\), with \(G=3\) components and \(K=4\) latent factors. Table 4 displays the results in terms of imprecision indexes and computational time.

Table 4 Quartiles, mean and standard deviation (in brackets) of the imprecision indexes for the parameter estimates obtained by maximizing the composite likelihood and the full likelihood over 250 samples generated from the model (SC)UU with \(G=3\), \(K=4\), \(N=500\)

In terms of error in parameter estimation, the full likelihood is slightly more efficient. However, it is more than 4 times in median and 11 times in mean slower than the composite likelihood. It follows that the loss in statistical efficiency of the composite likelihood is well paid in terms of computational time.

The experiment has been extended by considering on the same samples even the naive approach, where the ordinal variables are considered as continuous, and six models of the clustMD package. Being different models with different parameters, the performances have been compared only in terms of goodness of recovery by using the ARI. The results are depicted in Table 5.

Table 5 Quartiles, mean and standard deviation (in brackets) of the ARI for the estimated partition obtained by different estimators over 250 samples generated from the model (SC)UU with \(G=3\), \(K=4\) and \(N=500\)

In terms of recovering the clustering structure, the full likelihood is the best one, even if the composite likelihood shows similar results (0.80 vs. 0.79 in terms of means). The ARI for the naive case is 0.6 in mean. The six parsimonious models show ARI values inferior to the others, between 0.58 and 0.60. This is probably due to the fact they assume a local independence that in this case is false.

5.3 The effect of a higher number of categories

In this experiment we evaluate the effect of a higher number of categories for the ordinal variables. This allows us to evaluate how the error in parameter estimation changes with different number of categories. This means that some categories could have zero or very low frequencies, but this is not a problem for the algorithm. Furthermore we compare the clustering performance, in terms of ARI, of the proposal with the naive case (i.e. our model treating all variables as they were continuous) and six parsimonious models available in the package clustMD. We considered the following design: number of observations (\(N=500\)), number of groups (\(G=2,3\)) and number of latent factors (\(K=4\)). The number of variables is 8 of which 4 are ordinal. For the ordinal variables we considered two cases: 5 or 10 categories for each ordinal variable. The other model parameters are generated as described previously under the model (SC)UU. We generated 250 samples.

Table 6 Quartiles, mean and standard deviation (in brackets) of the imprecision indexes for the parameter estimates obtained by maximizing the composite likelihood and the ARI for the estimated partition obtained by different estimators over 250 samples generated from the model (SC)UU with \(G=2,3\), \(K=4\), \(N=500\), ordinal variables with 5 or 10 categories

Table 6 displays the results in terms of imprecision indexes and ARI values. As regards the performances of the composite likelihood estimator, overall we noted that when the model complexity increases in terms of the number of components G or higher number of categories, the estimator efficiency decreases. In terms of recovering the clustering structure, we noted that in some cases the algorithm of six models available in the R package clustMD was not able to provide a solution. Looking at the ARI values our proposal is still the best one (mean values varying between 0.78 and 0.87), the naive case results to be the worst (mean values varying between 0.46 and 0.71). The six parsimonious models show lower ARI values; the mean values are between 0.54 and 0.84.

5.4 The effect of larger number of ordinal variables

Finally, in the last experiment we evaluate the effect of higher number of ordinal variables. This allows to evaluate how the sparsity in the data could affect the parameter estimation and the classification performances. This means that some profiles could have zero or very low frequencies, but this is not a problem for the algorithm. Furthermore we compare the clustering performance, in terms of ARI, of the proposal with the naive case (i.e. our model treating all variables as they were continuous) and six parsimonious models available in the package clustMD. We considered the following design: number of observations (\(N=1000\)), number of groups (\(G=2,3\)) and number of latent factors (\(K=4\)). The number of variables is 15 of which 10 are ordinal. We considered 5 categories for each ordinal variable.

For \(G=2\), we specify the group-specific mean vector by

$$\begin{aligned} \varvec{\mu }_1= & {} [0.5, 0.5, 1, 1, 0.5, 0.5, 1, 1, 1, 1, -1, 2, -2, -1, -1], \\ \varvec{\mu }_2= & {} [1.5, 1.5, 0, 0, 1.5, 1.5, 0, 0, 0, 0, 1, 0, 2, 0, 0]. \end{aligned}$$

For \(G=3\), we added the further group-specific mean vector,

$$\begin{aligned} \varvec{\mu }_3=[0.5, -0.5, -1, -1, 0.5, -0.5, -1, -1, -1, -1, 0, -2, 0, -1, -1]. \end{aligned}$$

The other model parameters are generated as described previously under the model (SC)UU. We generated 250 samples.

Table 7 Quartiles, mean and standard deviation (in brackets) of the imprecision indexes for the parameter estimates obtained by maximizing the composite likelihood and the ARI for the estimated partition obtained by different estimators over 250 samples generated from the model (SC)UU with \(G=2,3\), \(K=4\), \(N=1000\), 15 variables where 10 are ordinal

Table 7 displays the results in terms of imprecision indexes and ARI values. As regards the performances of the composite likelihood estimator, we do not observe particular issues. Once again when the complexity of the model increases in terms of number of components G, the estimator efficiency decreases, as expected. In terms of recovering the clustering structure, we noted that is many cases the algorithm of six models available in the R package clustMD was not able to provide a solution, especially when \(G=3\). This is may be due to the presence of zero frequency categories. Also in this case, we noted that as the number of groups increases the ARI values decrease, even if, less compared to the previous cases. This may be explained by a larger sample size (\(N=1000\)). More in details, looking at the ARI values, our proposal is still the best one- the mean values are equal to 0.97 and 0.79 for \(G=2\) and \(G=3\), respectively. The naive model results to be the worst when \(G=3\) (mean value is 0.53), while the other six parsimonious models show lower ARI values in mean, varying between 0.69 and 0.85 when \(G=2\) and between 0.64 and 0.66 when \(G=3\).

5.5 A simulation study for the model selection

To complete our large simulation study, we tested the effectivness of the aBIC in finding the correct number of components.

We considered the following design: number of observations (\(N=1000\)), number of groups (\(G=3\)) and number of latent factors (\(K=2\)). The number of variables is 8 of which 4 are ordinal. The ordinal variables have 5 categories. The other model parameters are generated as described previously under the model (SC)UU. We generated 250 samples. For each sample we fitted the true model, keeping fixed the number of latent factors. Further works are needed to calibrate the penalization term for the additive BIC.

Table 8 Percentage distribution of the number of components chosen by aBIC over 250 samples generated from the model (SC)UU with \(G=3\), \(K=2\) and \(N=1000\)

Looking at the results of 8, the aBIC is able to choose the right number for components most of the times (82.80%). In conclusion, even if the idea of measuring the complexity of the model by counting the number of parameters involved in each sub-likelihood may seem simplistic, in this context it gives good results. This does not exclude that further improvements are possible and needed.

6 Real data analysis

We apply the proposal to a set of data taken from the survey carried out by the Italian National Statistical Institute (ISTAT) in 2015 on academic graduates’ vocational integration by interviewing a sample of graduates who attained the university degree four years before. The aim of the survey is to detect graduates’ employment conditions about four years after graduation. We select the following variables: final grade for high school, age at graduation, final grade for the MSc degree, seven variables to detect job satisfaction, four variables to detect the propensity to move abroad, monthly income, length of study (in years), and gap between graduation and job (in months). We focus only on students of Master degree in Economics and Statistics. Furthermore we restricted the analysis to the observations with non-missing values. The final dataset is composed of 1033 students and 16 variables (12 ordinal variables and 4 continuous variables). The seven variables to detect job satisfaction have 11 categories, while the four variables to detect the propensity to move abroad have 4 categories. We fitted twelve models to the data for \(G=2,3,4\) and \(K=1,2,3\), and computed the additive BIC values for each model. The values are shown in Fig. 1.

Fig. 1
figure 1

Model selection according to additive BIC with \(G=2,3,4\) and \(K=1,2,3\)

The best model is that one minimizing the additive BIC. The model with the lowest additive BIC value was a three component mixture with the UUU covariance structure with a single common factor. The factors are changing over the groups, as well as the error terms for each variable. The Table 9 reports the empirical medians and proportions within each group.

Table 9 Empirical medians and proportions with each group based on the classification given by the best model, i.e. UUU with \(G=3\) and \(K=1\)
Table 10 Correlations between the variables and the single common factor of each group

Looking at Table 9, it is possible to note that the first two groups are very similar in terms of final grade, satisfaction with the current job, and salary, although median values and proportions are slightly different in terms of willingness to move abroad. On the other hand, in the third group the final mark, the satisfaction with the current job and the monthly income are the lowest, making this group quite different in terms of academic graduates’ vocational integration. More in detail, the first group is the smallest (18.48%) and it is composed by the youngest graduates. They got the highest median value as final mark for the degree (110); they are very happy and satisfied with their current work; they are also willing to move abroad only for better qualification (93.10%) or higher salary (69.83%); their monthly median income is 1750 euros, and finally, 50% of graduates started to work just two months after their graduation. The second group is the largest (55.83%), composed mainly by graduates that got 108 as median final mark for the degree. Furthermore, they are quite satisfied with their current job; they want also move abroad for better qualification (70.07%) or higher salary (70.07%), although the percentages are lower than those in the first group; 50% of graduates found job at most three months after graduation and their monthly median income is 1700 euros. The last group is composed by of 26% of graduates. Half of them got 106.5 or less as final median mark; they are not so satisfied with their current job, mainly in terms of long term (43.83%), salary (43.21%) and career perspectives (40.12%). However, the proportions of graduates that want to move abroad is very similar to the second group; 50% of graduates found their job at most three months after the degree, but their monthly median income is the lowest, that is 1300 euros.

In each group we have a single factor that is, being unique, a sort of overall. Looking at the correlations between the variables and the latent factors, we note that, as expected, they are all positive (see Table 10). The only exception is the grade for degree in the second group that is negative. It implies that this variable in the second group has a negative correlation with the other variables while they are positive in the other groups. In practice, in the second group a higher grade is associated with a lower degree of happiness and willingness to move. A possible explanation could be that in this group the expectations of the students are strictly related with the grade for degree: high grade implies high expectations that are rarely satisfied; low grade implies low expectations that are easier to satisfy.

Furthermore we fit the data with six flexible models available in the R package clustMD. However we noted some issues: if some categories have zero frequencies or there are many categories, the algorithms are not able to provide a solution. To overcome this issue, we merged the categories such that all the categories have non-zero frequencies and the number of categories is reduced. The solutions are not directly comparable. For this reason we did not report the results.

7 Discussion

In this paper, we have introduced a general class of parsimonious Gaussian mixture models for clustering ordinal and continuous variables. It includes known parameterizations proposed for continuous data as mixture of factor analysers and mixture of probabilistic principal component analysis as special cases. In order to increase the flexibility, we also introduced a new parameterization introduced in the context of multiway and multigroup data. The main advantage is that the number of covariance parameters grows linearly with the number of variables, rather than polynomially. This feature, along with the maximum composite likelihood estimation, makes the application of such models possible even for high dimensional mixed type data. The effectiveness of the proposal has been tested through a simulation study. Additionally, this class of models appears to be particularly good at modelling situations where some of the variables are highly correlated within the groups, as expected in high dimensional data. The application to the university graduates’ employment conditions about four years after graduation in Italy indicates that the model gives excellent clustering performance. The clusters found using the models showed greater ability to capture the group structure, by defining the main features of the graduates in each group.

Finally, we summarize some limits and possible extensions of our proposal. First of all, we note that counting and nominal variables are not considered. Their introduction would be quite easy under a local independence assumption, i.e. assuming that the observed variables are independent within the components. However, this assumption is in contrast with the spirit of this work where the dependencies among the variables within a component are modelled trough a factor analysis model. Such a modelling is not trivial; for example some authors argue that in some cases factor analysis is not applicable to nominal data (Revuelta et al. 2019), and probably would lead to a different way to build the composite likelihood. All in all, further studies are needed. Another direction of extension of the present work is to introduce other ways to constrain the parameters of the factor analysis model across the groups. As an example, McNicholas and Murphy (2010) propose to rewrite the diagonal matrix of the uniqueness \(\varvec{\varPsi }_g\) as the product of a scalar \(\omega _g\) and a diagonal matrix \(\varvec{\varDelta }_g\), having the determinant equal to 1. New parsimonious parameterizations are obtained constraining the first or the second factor of the product to be the same across the components. In our model, such constraints could be applied to the uniqueness matrices and to the saliences matrices \({\textbf {L}}_g\). Other models could be simply obtained by allowing the number of factors to be cluster specific. However, this does not mean that finding the best fitting model based on information criteria is so trivial. Indeed it may result to be time consuming, since we need to simultaneously choose \(G, K_1, \ldots , K_G\), and the covariance parameterisation. To overcome this issue it could be useful adopting a penalization approach. As an example, by introducing a lasso term, as explored in a different context (see e.g. Khalili and Chen (2007); Chen and Khalili (2008)), it may be possible to define a regularization path for the model selection. This would allow us to fit a lower number of models and to choose the best one based on a pre-specified information criterion. This potential solution could be developed in a future work.