Mixtures of Gaussian copula factor analyzers for clustering high dimensional data

Zhang, Lili; Baek, Jangsun

doi:10.1016/j.jkss.2018.12.001

Mixtures of Gaussian copula factor analyzers for clustering high dimensional data

Published: 04 January 2019

Volume 48, pages 480–492, (2019)
Cite this article

Journal of the Korean Statistical Society Aims and scope Submit manuscript

Lili Zhang¹ &
Jangsun Baek¹

29 Accesses
4 Citations
Explore all metrics

Abstract

Mixtures of factor analyzers is a useful model-based clustering method which can avoid the curse of dimensionality in high-dimensional clustering. However, this approach is sensitive to both diverse non-normalities of marginal variables and outliers, which are commonly observed in multivariate experiments. We propose mixtures of Gaussian copula factor analyzers (MGCFA) for clustering high-dimensional clustering. This model has two advantages; (1) it allows different marginal distributions to facilitate fitting flexibility of the mixture model, (2) it can avoid the curse of dimensionality by embedding the factor-analytic structure in the component-correlation matrices of the mixture distribution.

An EM algorithm is developed for the fitting of MGCFA. The proposed method is free of the curse of dimensionality and allows any parametric marginal distribution which fits best to the data. It is applied to both synthetic data and a microarray gene expression data for clustering and shows its better performance over several existing methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Andrews, J. L., & McNicholas, P. D. (2012). Model-based clustering, classification, and discriminant analysis via mixtures of multivariate t-distributions. Statistics and Computing, 22(5), 1021–1029.
Article MathSciNet Google Scholar
Baek, J., & McLachlan, G. J. (2011). Mixtures of common t-factor analyzers for clustering high-dimensional microarray data. Bioinformatics, 27(9), 1269–1276.
Article Google Scholar
Baek, J., McLachlan, G. J., & Flack, L. (2010). Mixtures of factor analyzers with common factor loadings: applications to the clustering and visualisation of high-dimensional data. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(7), 1298–1309.
Article Google Scholar
Bhattacharjee, A., Richards, W., Staunton, J., Li, C., Monti, S., Vasa, P., Ladd, C., Beheshti, J., Bueno, R., Gillette, M., Loda, M., Weber, G., Mark, E. J., Lander, E. S., Wong, W., Johnson, B. E., Golub, T. R., Sugarbaker, D. J., & Meyerson, M. (2001). Classification of human lung carcinomas by mrna expression profiling reveals distinct adenocarcinoma subclasses. Proceedings of the National Academy of Sciences, 98(24), 13790–13795.
Article Google Scholar
Browne, R. P., & McNicholas, P. D. (2015). A mixture of generalized hyperbolic distributions. The Canadian Journal of Statistics, 43(2), 176–198.
Article MathSciNet Google Scholar
Di Lascio, F. M. L., & Giannerini, S. (2012). A copula-based algorithm for discovering patterns of dependent observations. Journal of Classification, 29(1), 50–75.
Article MathSciNet Google Scholar
Fackler, P. L. (2005). Notes on matrix calculus. North Carolina State University.
Google Scholar
Franczak, B. C., Browne, R. P., & McNicholas, P. D. (2014). Mixtures of shifted asymmetric laplace distributions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(6), 1149–1157.
Article Google Scholar
Galimberti, G., Montanari, A., & Viroli, C. (2009). Penalized factor mixture analysis for variable selection in clustered data. Computational Statistics & Data Analysis, 53(12), 4301–4310.
Article MathSciNet Google Scholar
Ghahramani, Z., & Hinton, G. E. (1997). The em algorithm for mixtures of factor analyzers. Technical Report CRG-TR-96-1, Toronto: The University of Toronto.
Google Scholar
Henderson, H. V., & Searle, S. R. (1981). The vec-permutation matrix, the vec operator and kronecker products: a review. Linear and Multilinear Algebra, 9(4), 271–288.
Article MathSciNet Google Scholar
Hestenes, M. R., & Stiefel, E. (1952). Methods of conjugate gradients for solving linear systems. Journal of Research of the National Bureau of Standards, 49(6), 409–438.
Article MathSciNet Google Scholar
Jajuga, K., & Papla, D. (2006). Copula functions in model based clustering. Data and Information Analysis to Knowledge Engineering, 60, 6–613.
Google Scholar
Karlis, D., & Santourian, A. (2009). Model-based clustering with non-elliptically contoured distributions. Statistics and Computing, 19(1), 73–83.
Article MathSciNet Google Scholar
Kosmidis, I., & Karlis, D. (2016). Model-based clustering using copulas with applications. Statistics and Computing, 26(5), 1079–1099.
Article MathSciNet Google Scholar
Lee, S. X., & Mclachlan, G. J. (2016). Finite mixtures of canonical fundamental skew t-distributions. Statistics and Computing, 26(3), 573–589.
Article MathSciNet Google Scholar
Lin, T. I., Lee, J. C., & Yen, S. Y. (2007). Finite mixture modelling using the skew normal distribution. Statistica Sinica, 17(3), 909–927.
MathSciNet MATH Google Scholar
Lin, T. I., Mclachlan, G. J., & Lee, S. X. (2016). Extending mixtures of factor models using the restricted multivariate skew-normal distribution. Journal of Multivariate Analysis, 143, 398–413.
Article MathSciNet Google Scholar
McLachlan, G. J., Bean, R. W., & Ben-Tovim Jones, L. (2007). Extension of the mixture of factor analyzers model to incorporate the multivariate t-distribution. Computational Statistics & Data Analysis, 51(11), 5327–5338.
Article MathSciNet Google Scholar
McLachlan, G. J., & Peel, D. (2000). Finite Mixture Models. Wiley.
Book Google Scholar
McNicholas, S. M., McNicholas, P. D., & Browne, R. P. (2013). Mixtures of variance-gamma distributions. Arxiv preprint arXiv:13092695.
MATH Google Scholar
McNicholas, P. D., & Murphy, T. B. (2008). Parsimonious gaussian mixture models. Statistics and Computing, 18(3), 285–296.
Article MathSciNet Google Scholar
Montanari, A., & Viroli, C. (2010). A skew-normal factor model for the analysis of student satisfaction towards university courses. Journal of Applied Statistics, 37(3), 473–487.
Article MathSciNet Google Scholar
Murray, P. M., Browne, R. P., & McNicholas, P. D. (2014). Mixtures of skew-t factor analyzers. Computational Statistics & Data Analysis, 77, 326–335.
Article MathSciNet Google Scholar
Murray, P. M., Browne, R. P., & McNicholas, P. D. (2017). Hidden truncation hyperbolic distributions, finite mixtures thereof, and their application for clustering. Journal of Multivariate Analysis, 161, 141–156.
Article MathSciNet Google Scholar
Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statistics, 6(2), 461–464.
Article MathSciNet Google Scholar
Souto, M., Costa, I., Araujo, D., Ludermir, T., & Schliep, A. (2008). Clustering cancer gene expression data: a comparative study. BMC Bioinformatics, 9, 497.
Article Google Scholar
Tortora, C., McNicholas, P. D., & Browne, R. P. (2016). A mixture of generalized hyperbolic factor analyzers. Advances in Data Analysis and Classification, 10(4), 423–440.
Article MathSciNet Google Scholar
VracL, M., Billard, L., Diday, E., & Chédin, A. (2012). Copula analysis of mixture models. Computational Statistics, 27(3), 427–457.
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Department of Statistics, Chonnam National University, Gwangju, 61186, Korea
Lili Zhang & Jangsun Baek

Authors

Lili Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Jangsun Baek
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jangsun Baek.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhang, L., Baek, J. Mixtures of Gaussian copula factor analyzers for clustering high dimensional data. J. Korean Stat. Soc. 48, 480–492 (2019). https://doi.org/10.1016/j.jkss.2018.12.001

Download citation

Received: 13 November 2018
Accepted: 07 December 2018
Published: 04 January 2019
Issue Date: September 2019
DOI: https://doi.org/10.1016/j.jkss.2018.12.001

AMS 2000 subject classifications

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Mixtures of Gaussian copula factor analyzers for clustering high dimensional data

Abstract

Access this article

Similar content being viewed by others

On Bayesian Analysis of Parsimonious Gaussian Mixture Models

Gaussian mixture model with an extended ultrametric covariance structure

A new clustering method of gene expression data based on multivariate Gaussian mixture models

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

AMS 2000 subject classifications

Keywords

Navigation

Mixtures of Gaussian copula factor analyzers for clustering high dimensional data

Abstract

Access this article

Similar content being viewed by others

On Bayesian Analysis of Parsimonious Gaussian Mixture Models

Gaussian mixture model with an extended ultrametric covariance structure

A new clustering method of gene expression data based on multivariate Gaussian mixture models

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

AMS 2000 subject classifications

Keywords

Search

Navigation