Advertisement

Statistics and Computing

, Volume 23, Issue 5, pp 625–638 | Cite as

Using conditional independence for parsimonious model-based Gaussian clustering

  • Giuliano Galimberti
  • Gabriele SoffrittiEmail author
Article

Abstract

In the framework of model-based cluster analysis, finite mixtures of Gaussian components represent an important class of statistical models widely employed for dealing with quantitative variables. Within this class, we propose novel models in which constraints on the component-specific variance matrices allow us to define Gaussian parsimonious clustering models. Specifically, the proposed models are obtained by assuming that the variables can be partitioned into groups resulting to be conditionally independent within components, thus producing component-specific variance matrices with a block diagonal structure. This approach allows us to extend the methods for model-based cluster analysis and to make them more flexible and versatile. In this paper, Gaussian mixture models are studied under the above mentioned assumption. Identifiability conditions are proved and the model parameters are estimated through the maximum likelihood method by using the Expectation-Maximization algorithm. The Bayesian information criterion is proposed for selecting the partition of the variables into conditionally independent groups. The consistency of the use of this criterion is proved under regularity conditions. In order to examine and compare models with different partitions of the set of variables a hierarchical algorithm is suggested. A wide class of parsimonious Gaussian models is also presented by parameterizing the component-variance matrices according to their spectral decomposition. The effectiveness and usefulness of the proposed methodology are illustrated with two examples based on real datasets.

Keywords

Bayesian information criterion Conditional independence EM algorithm Gaussian mixture model Spectral decomposition 

References

  1. Baek, J., McLachlan, G.J.: Mixtures of factor analyzers with common factor loadings for the clustering and visualisation of high-dimensional data. Technical report NI08018-SCH, Preprint, Series of the Isaac Newton Institute for Mathematical Sciences, Cambridge (2008) Google Scholar
  2. Baek, J., McLachlan, G.J., Flack, L.: Mixtures of factor analyzers with common factor loadings: applications to the clustering and visualisation of high-dimensional data. IEEE Trans. Pattern Anal. Mach. Intell. 32, 1298–1309 (2010) CrossRefGoogle Scholar
  3. Banfield, J.D., Raftery, A.E.: Model-based Gaussian and non-Gaussian clustering. Biometrics 49, 803–821 (1993) MathSciNetzbMATHCrossRefGoogle Scholar
  4. Bartholomew, D., Knott, M., Moustaki, I.: Latent Variable Models and Factor Analysis: A Unified Approach, 3rd edn. Wiley, Chichester (2011) CrossRefGoogle Scholar
  5. Basso, R.M., Lachos, V.H., Barbosa Cabral, C.R., Ghosh, P.: Robust mixture modeling based on scale mixtures of skew-normal distributions. Comput. Stat. Data Anal. 54, 2926–2941 (2010) CrossRefGoogle Scholar
  6. Biernacki, C., Govaert, G.: Choosing models in model-based clustering and discriminant analysis. J. Stat. Comput. Simul. 64, 49–71 (1999) zbMATHCrossRefGoogle Scholar
  7. Biernacki, C., Celeux, G., Govaert, G.: Choosing starting values for the EM algorithm for getting the highest likelihood in multivariate Gaussian mixture models. Comput. Stat. Data Anal. 41, 561–575 (2003) MathSciNetCrossRefGoogle Scholar
  8. Biernacki, C., Celeux, G., Govaert, G., Langrognet, F.: Model-based cluster and discriminant analysis with the MIXMOD software. Comput. Stat. Data Anal. 51, 587–600 (2006) MathSciNetzbMATHCrossRefGoogle Scholar
  9. Böhning, D., Seidel, W.: Editorial: recent developments in mixture models. Comput. Stat. Data Anal. 41, 349–357 (2003) CrossRefGoogle Scholar
  10. Böhning, D., Seidel, W., Alfò, M., Garel, B., Patilea, V., Walther, G.: Advances in mixture models. Comput. Stat. Data Anal. 51, 5205–5210 (2007) zbMATHCrossRefGoogle Scholar
  11. Bouveyron, C., Girard, S., Schmid, C.: High-dimensional data clustering. Comput. Stat. Data Anal. 52, 502–519 (2007) MathSciNetzbMATHCrossRefGoogle Scholar
  12. Branco, M.D., Dey, D.K.: A general class of multivariate skew-elliptical distributions. J. Multivar. Anal. 79, 99–113 (2001) MathSciNetzbMATHCrossRefGoogle Scholar
  13. Celeux, G., Govaert, G.: Gaussian parsimonious clustering models. Pattern Recognit. 28, 781–793 (1995) CrossRefGoogle Scholar
  14. Cook, R.D., Weisberg, S.: An Introduction to Regression Graphics. Wiley, New York (1994) zbMATHCrossRefGoogle Scholar
  15. Coretto, P., Hennig, C.: Maximum likelihood estimation of heterogeneous mixtures of Gaussian and uniform distributions. J. Stat. Plan. Inference 141, 462–473 (2011) MathSciNetzbMATHCrossRefGoogle Scholar
  16. Cutler, A., Windham, M.P.: Information-based validity functionals for mixture analysis. In: Bozdogan, H. (ed.) Proceedings of the First US/Japan Conference on the Frontiers of Statistical Modeling: An Informational Approach, pp. 149–170. Kluwer Academic, Dordrecht (1994) CrossRefGoogle Scholar
  17. Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood for incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B 39, 1–22 (1977) MathSciNetzbMATHGoogle Scholar
  18. Dias, J.G.: Latent class analysis and model selection. In: Spilopoulou, M., Kruse, R., Borgelt, C., Nürnberger, A., Gaul, W. (eds.) From Data and Information Analysis to Knowledge Engineering, pp. 95–102. Springer, Berlin (2006) CrossRefGoogle Scholar
  19. Fraley, C., Raftery, A.E.: How many clusters? Which clustering method? Answers via model-based cluster analysis. Comput. J. 41, 578–588 (1998) zbMATHCrossRefGoogle Scholar
  20. Fraley, C., Raftery, A.E.: Model-based clustering, discriminant analysis and density estimation. J. Am. Stat. Assoc. 97, 611–631 (2002) MathSciNetzbMATHCrossRefGoogle Scholar
  21. Fraley, C., Raftery, A.E.: Enhanced software for model-based clustering. J. Classif. 20, 263–286 (2003) MathSciNetzbMATHCrossRefGoogle Scholar
  22. Fraley, C., Raftery, A.E.: MCLUST version 3 for R: normal mixture modeling and model-based clustering. Technical report No. 504, Department of Statistics, University of Washington (2006) Google Scholar
  23. Frank, A., Asuncion, A.: UCI machine learning repository. School of Information and Computer Science, University of California, Irvine, CA (2010). http://archive.ics.uci.edu/ml
  24. Galimberti, G., Soffritti, G.: Model-based methods to identify multiple cluster structures in a data set. Comput. Stat. Data Anal. 52, 520–536 (2007) MathSciNetzbMATHCrossRefGoogle Scholar
  25. Galimberti, G., Montanari, A., Viroli, C.: Penalized factor mixture analysis for variable selection in clustered data. Comput. Stat. Data Anal. 53, 4301–4310 (2009) MathSciNetzbMATHCrossRefGoogle Scholar
  26. Ghahramani, Z., Hinton, G.E.: The EM algorithm for factor analyzers. Technical report CRG-TR-96-1, University of Toronto (1997) Google Scholar
  27. Gordon, A.D.: Classification, 2nd edn. Chapman & Hall, Boca Raton (1999) zbMATHGoogle Scholar
  28. Karlis, D., Santourian, A.: Model-based clustering with non-elliptically contoured distributions. Stat. Comput. 19, 73–83 (2009) MathSciNetCrossRefGoogle Scholar
  29. Kass, R.E., Raftery, A.E.: Bayes factors. J. Am. Stat. Assoc. 90, 773–795 (1995) zbMATHCrossRefGoogle Scholar
  30. Keribin, C.: Consistent estimation of the order of mixture models. Sankhyā Ser. A 62, 49–66 (2000) MathSciNetzbMATHGoogle Scholar
  31. Lin, T.I.: Maximum likelihood estimation for multivariate skew normal mixture models. J. Multivar. Anal. 100, 257–265 (2009) zbMATHCrossRefGoogle Scholar
  32. Lin, T.I.: Robust mixture modeling using multivariate skew t distributions. Stat. Comput. 20, 343–356 (2010) MathSciNetCrossRefGoogle Scholar
  33. Lin, T.I., Lee, J.C., Hsieh, W.J.: Robust mixture modeling using the skew t distribution. Stat. Comput. 17, 81–92 (2007a) MathSciNetCrossRefGoogle Scholar
  34. Lin, T.I., Lee, J.C., Yen, S.Y., Shu, Y.: Finite mixture modelling using the skew normal distribution. Stat. Sin. 17, 909–927 (2007b) zbMATHGoogle Scholar
  35. Lütkepohl, H.: Handbook of Matrices. Wiley, Chichester (1996) zbMATHGoogle Scholar
  36. MacQueen, J.: Some methods for classification and analysis of multivariate observations. In: Le Cam, L.M., Neyman, J. (eds.) Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, pp. 281–297. University of California Press, Berkeley (1967) Google Scholar
  37. Maugis, C., Celeux, G., Martin-Magniette, M.-L.: Variable selection for clustering with Gaussian mixture models. Technical report RR-6211, Inria, France (2007) Google Scholar
  38. Maugis, C., Celeux, G., Martin-Magniette, M.-L.: Variable selection in model-based clustering: a general variable role modeling. Comput. Stat. Data Anal. 53, 3872–3882 (2009a) MathSciNetzbMATHCrossRefGoogle Scholar
  39. Maugis, C., Celeux, G., Martin-Magniette, M.-L.: Variable selection for clustering with Gaussian mixture models. Biometrics 65, 701–709 (2009b) MathSciNetzbMATHCrossRefGoogle Scholar
  40. McColl, J.H.: Multivariate Probability. Arnold, London (2004) zbMATHGoogle Scholar
  41. McLachlan, G.J., Krishnan, T.: The EM Algorithm and Extensions, 2nd edn. Wiley, Chichester (2008) zbMATHCrossRefGoogle Scholar
  42. McLachlan, G.J., Peel, D.: Finite Mixture Models. Wiley, Chichester (2000a) zbMATHCrossRefGoogle Scholar
  43. McLachlan, G.J., Peel, D.: Mixtures of factor analyzers. In: Langley, P. (ed.) Proceedings of the Seventeenth International Conference on Machine Learning, pp. 599–606. Morgan Kaufmann, San Francisco (2000b) Google Scholar
  44. McLachlan, G.J., Peel, D., Basford, K.E., Adams, P.: The EMMIX software for the fitting of mixtures of normal and t-components. J. Stat. Softw. 4, 2 (1999) Google Scholar
  45. McLachlan, G.J., Peel, D., Bean, R.W.: Modelling high-dimensional data by mixtures of factor analyzers. Comput. Stat. Data Anal. 41, 379–388 (2003) MathSciNetCrossRefGoogle Scholar
  46. McLachlan, G.J., Bean, R.W., Ben-Tovim Jones, L.: Extension of the mixture of factor analyzers model to incorporate the multivariate t-distribution. Comput. Stat. Data Anal. 51, 5327–5338 (2007) MathSciNetzbMATHCrossRefGoogle Scholar
  47. McNicholas, P.D., Murphy, T.B.: Parsimonious Gaussian mixture models. Stat. Comput. 18, 285–296 (2008) MathSciNetCrossRefGoogle Scholar
  48. McNicholas, P.D., Murphy, T.B., McDaid, A.F., Frost, D.: Serial and parallel implementations of model-based clustering via parsimonious Gaussian mixture models. Comput. Stat. Data Anal. 54, 711–723 (2010) MathSciNetzbMATHCrossRefGoogle Scholar
  49. Melnykov, V., Maitra, R.: Finite mixture models and model-based clustering. Stat. Surv. 4, 80–116 (2010) MathSciNetzbMATHCrossRefGoogle Scholar
  50. Melnykov, V., Melnykov, I.: Initializing the EM algorithm in Gaussian mixture models with an unknown number of components. Comput. Stat. Data Anal. (2011). doi: 10.1016/j.csda.2011.11.002 zbMATHGoogle Scholar
  51. Miloslavsky, M., van der Laan, M.J.: Fitting of mixtures with unspecified number of components using cross validation distance estimate. Comput. Stat. Data Anal. 41, 413–428 (2003) CrossRefGoogle Scholar
  52. Montanari, A., Viroli, C.: Heteroscedastic factor mixture analysis. Stat. Model. 10, 441–460 (2010a) MathSciNetCrossRefGoogle Scholar
  53. Montanari, A., Viroli, C.: The independent factor analysis approach to latent variable modelling. Statistics 44, 397–416 (2010b) MathSciNetCrossRefGoogle Scholar
  54. Peel, D., McLachlan, G.J.: Robust mixture modeling using the t-distribution. Stat. Comput. 10, 339–348 (2000) CrossRefGoogle Scholar
  55. R Development Core Team: R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria (2010). http://www.R-project.org
  56. Raftery, A.E., Dean, N.: Variable selection for model-based cluster analysis. J. Am. Stat. Assoc. 101, 168–178 (2006) MathSciNetzbMATHCrossRefGoogle Scholar
  57. Ray, S., Lindsay, B.G.: Model selection in high dimensions: a quadratic-risk-based approach. J. R. Stat. Soc. Ser. B 70, 95–118 (2008) MathSciNetzbMATHGoogle Scholar
  58. Schwarz, G.: Estimating the dimension of a model. Ann. Stat. 6, 461–464 (1978) zbMATHCrossRefGoogle Scholar
  59. Smith, J.W., Everhart, J.E., Dickson, W.C., Knowler, W.C., Johannes, R.S.: Using the ADAP learning algorithm to forecast the onset of diabetes mellitus. In: Proceedings of the Symposium on Computer Applications and Medical Care, pp. 261–265. IEEE Computer Society, Los Alamitos (1988) Google Scholar
  60. Teicher, H.: Identifiability of mixture models. Ann. Math. Stat. 34, 1265–1269 (1963) MathSciNetzbMATHCrossRefGoogle Scholar
  61. Tipping, M.E., Bishop, C.M.: Mixture of probabilistic principal component analysers. Neural Comput. 11, 443–482 (1999) CrossRefGoogle Scholar
  62. Titterington, D.M., Smith, A.F.M., Makov, U.E.: Statistical Analysis of Finite Mixture Distributions. Wiley, Chichester (1985) zbMATHGoogle Scholar
  63. Viroli, C.: Dimensionally reduced model-based clustering through mixtures of factor mixture analyzers. J. Classif. 27, 363–388 (2010) MathSciNetCrossRefGoogle Scholar
  64. Wang, K., Ng, S.-K., McLachlan, G.J.: Multivariate skew t mixture models: applications to fluorescence-activated cell sorting data. In: Shi, H., Zhang, Y., Bottema, M.J., Lovell, B.C., Maeder, A.J. (eds.) Proceedings of the 2009 Conference of Digital Image Computing: Techniques and Applications, pp. 526–531. IEEE Computer Society, Los Alamitos (2009) CrossRefGoogle Scholar
  65. Yakowitz, S.J., Spragins, J.D.: On the identifiability of finite mixtures. Ann. Math. Stat. 39, 209–214 (1968) MathSciNetzbMATHCrossRefGoogle Scholar
  66. Yang, C.C.: Evaluating latent class analysis models in qualitative phenotype identification. Comput. Stat. Data Anal. 50, 1090–1104 (2006) CrossRefGoogle Scholar
  67. Yoshida, R., Higuchi, T., Imoto, S.: A mixed factors model for dimension reduction and extraction of a group structure in gene expression data. In: Proceedings of the 2004 IEEE Computational Systems Bioinformatics Conference, pp. 161–172 (2004) Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2012

Authors and Affiliations

  1. 1.Department of StatisticsUniversity of BolognaBolognaItaly

Personalised recommendations