The goal of the simulation study is to explore to what extent GMM recovery with the EM algorithm and the BIC is impacted by observing ordinal variables with c categories instead of continuous variables. We study the drop in performance as a function of c conditional on the number of true components, the separation between them, the number of variables, and the sample size.
Setup
Considered simulation conditions
We independently vary the number of components in the mixture models \(K\ \in \{2,3,4 \}\), the number of dimensions \(p \in \{2, 3, \dots , 10 \}\), the pairwise Kullback–Leibler divergence DKL \(\in \{2,3.5,5 \}\) and the total sample size \(N\ \in \{1000,2500,10,000 \}\). We choose the data-generating GMMs such that we consider highly overlapping mixture components (DKL = 2), somewhat overlapping mixture components (DKL = 3.5), and clearly separated mixture components (DKL = 5; see Fig. 1).
We choose GMMs such that the pairwise DKL between components are equal, which is necessary to meaningfully compare results across variations of K. In the bivariate case (p = 2), we can arrange the means of components in an equidistant way for \(K\ \in \{2,3 \}\) and we therefore only vary the means and not the variances and covariances across mixture components. We fix all variances to \(\sigma ^{2}=\sqrt {0.25}\) and set all covariances to zero. For K = 4, it is impossible to place the component means equidistant from each other, and we therefore vary the covariances to obtain the same DKL as in \(K\ \in \{2,3 \}\). For dimensions p ≥ 3, it is again possible for \(K\ \in \{2,3,4 \}\) to place the means at pairwise equidistant locations and we therefore only vary the means and keep the covariance matrix constant. In these cases, we use numeric optimization to choose arrangements of component means that have the desired pairwise DKL. For a detailed description for how we defined the means of the mixture components, see Appendix A. We set the mixing probability of each mixture component to \(\frac {1}{K}\).
Mapping from continuous to ordinal data
Figure 2 illustrates the mapping from continuous variables to ordinal categories with the following procedure. For each variable separately, we calculated the 0.50% and 99.50% quantiles and then construct c equally spaced categories between those quantiles. The 1% data lying outside this interval on each side are mapped into the nearest category. We chose this procedure to avoid that extreme values can meaningfully influence the mapping for small sample sizes N. This implies that the borders of the grid are defined by the 0.50% and 99.50% quantiles of the Gaussian mixture. The labels of the c categories are defined as the midpoints of the intervals defined by the c + 1 thresholds (including the boundaries defined by the quantiles as thresholds). For example, if the thresholds are − 1,0,1, then the category labels would be − 0.5,0.5. We choose these category labels to ensure that the ordinal data are on a similar scale as the continuous data. This is required to meaningfully assess estimation errors on the component parameters because it would not make sense to compare the estimated mean of component 1 with corresponding the true mean, if we already know that all values in the observed data are a times bigger.
Estimation method
We estimated GMMs with the EM algorithm and the BIC using the implementation in the R-package mclust version 5.4.6 (Scrucca et al., 2016). To limit the number of parameters of GMMs, we constrained the GMMs to be isomorphic, that is, to have equal variances and covariances. Thus, in this case, the GMM is equal to a latent profile analysis. Note that this class of GMMs is correctly specified for all simulation conditions, except the one for p = 2, K = 4, where we also varied the covariances. We consider the sequence of \(K \in \{1, \dots , 7 \}\). In order to estimate expected performance measures, we compute averages over 100 repetitions of the design. The repetitions use the same mixture models but differ in the generation of Gaussian noise.
In some applications one might be specifically interested in differences in covariances across components. For example, clinical psychologists analyze the covariances between symptoms of mental disorders (Borsboom, 2017) to uncover possible causal interactions between symptoms (however, note that this non-trivial, see e.g., Ryan et al., 2019; Haslbeck et al., 2021). Mixture models have been suggested as a way to determine whether individuals with/without diagnosis differ in their symptom covariances (Haslbeck et al., 2021; Brusco et al., 2019; De Ron et al., 2021). We therefore run the simulation design a second time in which we freely estimate all covariance matrices. The code to reproduce the simulation study and all results and figures in the paper can be found at https://github.com/jmbh/OrdinalGMMSim_reparchive.
Results
We discuss the results of the constrained estimation in the main text and the results of the unconstrained estimation in Appendix B. Figure 3 displays the probability of recovering the correct number of components (i.e., accuracy) as a function of K (rows), DKL (columns), p (orange gradient), and the number of ordinal categories (x-axis), fixed for N = 10,000. We discuss the results for N = 10,000 first because they are the most illustrative of the effects of the misspecification by ordinal scales.
We first consider the bivariate case (p = 2). We see that, across the variation of DKL and for \(K\ \in \{2,3 \}\), the performance is very low for small number of categories and tends to increase as the number of categories becomes larger. This increase is more rapid when K = 3 and DKL are larger. The exceptions are the cases with K = 4. This is the only condition in which components varied in their covariances and hence estimation with diagonal covariances cannot recover these mixtures. This explains that accuracy is at zero for these cases (bottom row). In the bivariate case, we are able to visually inspect the data and the estimated component means: Fig. 4 displays the simulated data for K = 2, DKL = 5 for different numbers of ordinal categories and the estimated component means (red Xs) in the first iteration of the simulation. If the data are continuous (top left panel), the component means are estimated correctly, as expected. However, when thresholding the continuous data into 12 ordinal categories, we estimate five components which are placed close to the true components means. When decreasing the number of categories to five, the over-estimation of K increases. From c = 4 on it decreases again and for c = 2 we underestimate with \(\hat K = 1\). This shows, at least for the binary case, that observing ordinal categories strongly impacts GMM recovery.
Considering p > 2 in Fig. 3, we observe a peculiar behavior in the accuracy that begins at zero at c = 2, increases at c = 3, decreases again at c = 4, and then increases to 1 as c increases further. The reason for this perhaps surprising behavior is that for small c the estimated K first increases from underestimation to overestimation, and then decreases again towards the correct K. This pattern is more pronounced for the conditions with more than two variables. Figure 5 displays this behavior in box plots of the estimated number of components \(\hat K\), separate for \(p \in \{2,\dots , 10\}\) and fixed for K = 2 and DKL = 2.
We see that across all conditions with p > 2, the mean estimated \(\hat K\) increases from c = 3 to c = 4 and then decreases again. This decrease after c = 4 is steeper for higher p. This non-monotone behavior of the location of the distribution of estimated \(\hat K\)s from underestimation (at p = 2) to increasing overestimation (for p ≤ 4) towards correct estimation (for p ≥ 4) explains the perhaps surprising non-monotone behavior of the accuracy measure as a function of c in Fig. 3. The underestimation at c = 2 is due to the fact that estimation for K > 1 fails due to singular covariance matrices of some of the components. Typically, the problem is that one component has a large variance and all other components have zero variance. This problem could be addressed by providing priors on the variances (Fraley et al., 2012; Scrucca et al., 2016). This behavior cannot be explained by the geometry of the true component means because we generate the configurations randomly and independently in each of the 100 simulation runs (see Appendix A).
Our best explanation is that two forces are at play: if the number of categories c is low the data looks roughly unimodal because the tails of the components collapse into grid points that have a probability mass similar to the grid points close to the true component means. In those cases, we expect that the components fall on the grid point masses. When going from c = 3 to c = 4, more such grid points are available, leading to higher K. When further increasing c, the true components become better separated because the grid points between the component means have smaller probability. This separation happens more quickly for larger p, which might explain why overestimation declines more quickly as a function of c for larger p.
From Fig. 3, we see that the number of true components K and the pairwise distance DKL clearly impacts accuracy. However, we also see that accuracy is predominantly driven by the number of variables p and the number of ordinal categories c. To summarize our findings, we therefore average accuracy across K and DKL and display it conditional on p and c in Fig. 6. We see that performance is extremely low if we have few variables and few ordinal categories. However, when having more than a few variables and more than a few ordinal categories, performance improves dramatically. For example, with only four variables and 12 ordinal categories we already achieve an average accuracy of 0.97. Or, if we have ten variables, c = 5 categories are enough to achieve high accuracy (0.97). Generally, average accuracy is high if p > 5 and c > 5. Thus, if we remain in those situations, GMMs can also be recovered with ordinal variables, if the sample size is high enough. The situation is a bit more complex for lower sample sizes, for which performance drops considerably when the components of the data-generating GMM are less well separated. However, even for N = 1000, performance is high for p > 5,c > 7 if the components are well separated (see below).
So far, we only considered results for sample size N = 10,000. Figure 7 displays the same aggregate performance as Fig. 6 but for sample sizes \(N\ \in \{1000,\ 2500 \}\). For N = 1000 (left panel), we see the same pattern as in Fig. 6 showing the results for N = 10,000 when c and p are small: when the number of categories and dimensions are too small, the recovery performance is very low. When both increase, the performance increases. However, different to the setting with N = 10,000, we see that the performance decreases again when increasing p > 4. The reason for this behavior is that the number of parameters increases linearly with p (specifically, K(p + 1) + (K − 1)), which increases the weight of the penalty term of the BIC. We do not see this effect in the N = 10,000 because the likelihood part in the BIC has a much higher weight because we have more data. For N = 2500 (right panel), we see a similar pattern, except that the drop in performance as a function of p is much smaller since we have more data relative to the number of parameters.
In the above results, we only considered to what extent the number of components K can be correctly estimated. However, we did not explore yet how well the parameters of the GMM can be estimated when K has been estimated correctly. The additional parameters are the K − 1 mixing probabilities and the means and covariance matrices of each of the K components. Here, we report how well the means and covariances of each of the K components are estimated. To this end, we consider only those simulation results from the main text in which K has been estimated correctly. In each case, it is necessary to map each component in the estimated model to its corresponding component in the data-generating model. We do this by computing average errors for all K! possible mappings and choose the one with the smallest error.
To keep the number of presented results manageable, we take a similar approach to above and average over the variations in K and DKL. Figure 8 displays the mean squared difference between true and estimated parameters as a function of sample size (rows), number of variables p (y-axes) and number of ordinal categories c (x-axes), separately for means, variances, and covariances (columns), and rounded to two decimals. The missing cells indicate that the correct K was estimated in none of the simulation runs.
We observe average absolute errors around 0.15 for means across variations of the number of variables p and the number of ordinal categories c. These errors are relatively large considering that the true means vary roughly between − 0.5 and 1.5. Interestingly, this error does not seem to decrease when increasing the sample size. This shows that the error is due to the misspecification created by the ordinal mapping. In contrast, when the data are continuous, we see that the estimation error is extremely small across N. This shows that for the continuous case, once we correctly estimated K, we can also expect precise estimates of the mean vector. The estimation error in the variances and covariances is very low across all conditions, as one would expect, since we constrained all covariances to be isomorphic, which corresponds to the data-generating GMM.
The above results are of the estimation method in which we constrained the covariance matrices to be diagonal with equal variances. Except for the condition K = 4, p = 2, these models are correctly specified because also the data-generating GMMs have diagonal covariances with equal variances. When estimating covariance matrices unconstrained, we would expect a drop in performance, since the model has many more parameters while the sample size remains constant. In addition, we expect this effect to be larger for conditions with more dimensions p because the parameters grow quadratically with p. We report the results of estimating GMMs with unconstrained covariance matrices and verify those theoretical predictions in Appendix B. In addition to verifying these predictions, the main result of those additional results is that one requires a relatively large sample size in order to estimate a GMM with unconstrained covariance matrices because the number of parameters grow quadratically with the number of variables p.
An additional way to assess the performance of mixture/clustering methods is to consider to what extent cases are classified to the correct component or cluster. The classification performance depends on how well the mixture components are separated and how well the mixture distribution is being estimated. In the present paper, the focus is on the extent to which Gaussian mixture models can be recovered from data in various scenarios. We therefore chose performance measures capturing how well K is estimated and how well the parameters of component models are estimated once K has been estimated correctly. Classification performance captures estimation performance less directly, since it is a function of both estimation performance and the separation between components in the true mixture model. For this reason, and to keep the paper concise, we did not include classification performance.