Abstract
Mixtures of factor analyzers is a useful model-based clustering method which can avoid the curse of dimensionality in high-dimensional clustering. However, this approach is sensitive to both diverse non-normalities of marginal variables and outliers, which are commonly observed in multivariate experiments. We propose mixtures of Gaussian copula factor analyzers (MGCFA) for clustering high-dimensional clustering. This model has two advantages; (1) it allows different marginal distributions to facilitate fitting flexibility of the mixture model, (2) it can avoid the curse of dimensionality by embedding the factor-analytic structure in the component-correlation matrices of the mixture distribution.
An EM algorithm is developed for the fitting of MGCFA. The proposed method is free of the curse of dimensionality and allows any parametric marginal distribution which fits best to the data. It is applied to both synthetic data and a microarray gene expression data for clustering and shows its better performance over several existing methods.
Similar content being viewed by others
References
Andrews, J. L., & McNicholas, P. D. (2012). Model-based clustering, classification, and discriminant analysis via mixtures of multivariate t-distributions. Statistics and Computing, 22(5), 1021–1029.
Baek, J., & McLachlan, G. J. (2011). Mixtures of common t-factor analyzers for clustering high-dimensional microarray data. Bioinformatics, 27(9), 1269–1276.
Baek, J., McLachlan, G. J., & Flack, L. (2010). Mixtures of factor analyzers with common factor loadings: applications to the clustering and visualisation of high-dimensional data. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(7), 1298–1309.
Bhattacharjee, A., Richards, W., Staunton, J., Li, C., Monti, S., Vasa, P., Ladd, C., Beheshti, J., Bueno, R., Gillette, M., Loda, M., Weber, G., Mark, E. J., Lander, E. S., Wong, W., Johnson, B. E., Golub, T. R., Sugarbaker, D. J., & Meyerson, M. (2001). Classification of human lung carcinomas by mrna expression profiling reveals distinct adenocarcinoma subclasses. Proceedings of the National Academy of Sciences, 98(24), 13790–13795.
Browne, R. P., & McNicholas, P. D. (2015). A mixture of generalized hyperbolic distributions. The Canadian Journal of Statistics, 43(2), 176–198.
Di Lascio, F. M. L., & Giannerini, S. (2012). A copula-based algorithm for discovering patterns of dependent observations. Journal of Classification, 29(1), 50–75.
Fackler, P. L. (2005). Notes on matrix calculus. North Carolina State University.
Franczak, B. C., Browne, R. P., & McNicholas, P. D. (2014). Mixtures of shifted asymmetric laplace distributions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(6), 1149–1157.
Galimberti, G., Montanari, A., & Viroli, C. (2009). Penalized factor mixture analysis for variable selection in clustered data. Computational Statistics & Data Analysis, 53(12), 4301–4310.
Ghahramani, Z., & Hinton, G. E. (1997). The em algorithm for mixtures of factor analyzers. Technical Report CRG-TR-96-1, Toronto: The University of Toronto.
Henderson, H. V., & Searle, S. R. (1981). The vec-permutation matrix, the vec operator and kronecker products: a review. Linear and Multilinear Algebra, 9(4), 271–288.
Hestenes, M. R., & Stiefel, E. (1952). Methods of conjugate gradients for solving linear systems. Journal of Research of the National Bureau of Standards, 49(6), 409–438.
Jajuga, K., & Papla, D. (2006). Copula functions in model based clustering. Data and Information Analysis to Knowledge Engineering, 60, 6–613.
Karlis, D., & Santourian, A. (2009). Model-based clustering with non-elliptically contoured distributions. Statistics and Computing, 19(1), 73–83.
Kosmidis, I., & Karlis, D. (2016). Model-based clustering using copulas with applications. Statistics and Computing, 26(5), 1079–1099.
Lee, S. X., & Mclachlan, G. J. (2016). Finite mixtures of canonical fundamental skew t-distributions. Statistics and Computing, 26(3), 573–589.
Lin, T. I., Lee, J. C., & Yen, S. Y. (2007). Finite mixture modelling using the skew normal distribution. Statistica Sinica, 17(3), 909–927.
Lin, T. I., Mclachlan, G. J., & Lee, S. X. (2016). Extending mixtures of factor models using the restricted multivariate skew-normal distribution. Journal of Multivariate Analysis, 143, 398–413.
McLachlan, G. J., Bean, R. W., & Ben-Tovim Jones, L. (2007). Extension of the mixture of factor analyzers model to incorporate the multivariate t-distribution. Computational Statistics & Data Analysis, 51(11), 5327–5338.
McLachlan, G. J., & Peel, D. (2000). Finite Mixture Models. Wiley.
McNicholas, S. M., McNicholas, P. D., & Browne, R. P. (2013). Mixtures of variance-gamma distributions. Arxiv preprint arXiv:13092695.
McNicholas, P. D., & Murphy, T. B. (2008). Parsimonious gaussian mixture models. Statistics and Computing, 18(3), 285–296.
Montanari, A., & Viroli, C. (2010). A skew-normal factor model for the analysis of student satisfaction towards university courses. Journal of Applied Statistics, 37(3), 473–487.
Murray, P. M., Browne, R. P., & McNicholas, P. D. (2014). Mixtures of skew-t factor analyzers. Computational Statistics & Data Analysis, 77, 326–335.
Murray, P. M., Browne, R. P., & McNicholas, P. D. (2017). Hidden truncation hyperbolic distributions, finite mixtures thereof, and their application for clustering. Journal of Multivariate Analysis, 161, 141–156.
Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statistics, 6(2), 461–464.
Souto, M., Costa, I., Araujo, D., Ludermir, T., & Schliep, A. (2008). Clustering cancer gene expression data: a comparative study. BMC Bioinformatics, 9, 497.
Tortora, C., McNicholas, P. D., & Browne, R. P. (2016). A mixture of generalized hyperbolic factor analyzers. Advances in Data Analysis and Classification, 10(4), 423–440.
VracL, M., Billard, L., Diday, E., & Chédin, A. (2012). Copula analysis of mixture models. Computational Statistics, 27(3), 427–457.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Zhang, L., Baek, J. Mixtures of Gaussian copula factor analyzers for clustering high dimensional data. J. Korean Stat. Soc. 48, 480–492 (2019). https://doi.org/10.1016/j.jkss.2018.12.001
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1016/j.jkss.2018.12.001