Advertisement

Advances in Data Analysis and Classification

, Volume 10, Issue 4, pp 441–464 | Cite as

Factor probabilistic distance clustering (FPDC): a new clustering method

  • Cristina Tortora
  • Mireille Gettler Summa
  • Marina Marino
  • Francesco PalumboEmail author
Regular Article
  • 428 Downloads

Abstract

Factor clustering methods have been developed in recent years thanks to improvements in computational power. These methods perform a linear transformation of data and a clustering of the transformed data, optimizing a common criterion. Probabilistic distance (PD)-clustering is an iterative, distribution free, probabilistic clustering method. Factor PD-clustering (FPDC) is based on PD-clustering and involves a linear transformation of the original variables into a reduced number of orthogonal ones using a common criterion with PD-clustering. This paper demonstrates that Tucker3 decomposition can be used to accomplish this transformation. Factor PD-clustering alternatingly exploits Tucker3 decomposition and PD-clustering on transformed data until convergence is achieved. This method can significantly improve the PD-clustering algorithm performance; large data sets can thus be partitioned into clusters with increasing stability and robustness of the results. Real and simulated data sets are used to compare FPDC with its main competitors, where it performs equally well when clusters are elliptically shaped but outperforms its competitors with non-Gaussian shaped clusters or noisy data.

Keywords

Factor clustering Probabilistic distance clustering Tucker3 k-means 

Mathematics Subject Classification

6207 62H30 

Notes

Acknowledgments

The authors are grateful to an associate editor and anonymous reviewers for their very helpful comments and suggestions, the cumulative effect of which has been a stronger manuscript.

References

  1. Andersson CA, Bro R (2000) The N-way toolbox for MATLAB. Chemom Intell Lab Syst 52(1):1–4CrossRefGoogle Scholar
  2. Andrews JL, McNicholas PD (2011) Extending mixtures of multivariate t-factor analyzers. Stat Comput 21(3):361–373MathSciNetCrossRefzbMATHGoogle Scholar
  3. Arabie P, Hubert L (1994) Cluster analysis in marketing research. In: Bagozzi R (ed) Advanced methods in marketing research. Blackwell, Oxford, pp 160–189Google Scholar
  4. Ben-Israel A, Iyigun C (2008) Probabilistic d-clustering. J Classif 25(1):5–26MathSciNetCrossRefzbMATHGoogle Scholar
  5. Bezdek J (1974) Numerical taxonomy with fuzzy sets. J Math Biol 1(1):57–71MathSciNetCrossRefzbMATHGoogle Scholar
  6. Bock HH (1987) On the interface between cluster analysis, principal component analysis, and multidimensional scaling. Multivar Stat Model Data Anal 8:17–34MathSciNetCrossRefzbMATHGoogle Scholar
  7. Bouveyron C, Brunet C (2012) Simultaneous model-based clustering and visualization in the Fisher discriminative subspace. Stat Comput 22(1):301–324MathSciNetCrossRefzbMATHGoogle Scholar
  8. Bouveyron C, Brunet-Saumard C (2014) Model-based clustering of high-dimensional data: a review. Comput Stat Data Anal 71:52–78MathSciNetCrossRefzbMATHGoogle Scholar
  9. Campbell JG, Fraley F, Murtagh F, Raftery AE (1997) Linear flaw detection in woven textiles using model-based clustering. Pattern Recogn Lett 18:1539–1548Google Scholar
  10. Ceulemans E, Kiers HAL (2006) Selecting among three-mode principal component models of different types and complexities: a numerical convex hull based method. Br J Math Stat Psychol 59(1):133–150MathSciNetCrossRefGoogle Scholar
  11. Chiang M, Mirkin B (2010) Intelligent choice of the number of clusters in k-means clustering: an experimental study with different cluster spreads. J Classif 27(1):3–40MathSciNetCrossRefzbMATHGoogle Scholar
  12. Core Team R (2014) R: a language and environment for statistical computing. R Foundation for Statistical Computing, ViennaGoogle Scholar
  13. Craen S, Commandeur J, Frank L, Heiser W (2006) Effects of group size and lack of sphericity on the recovery of clusters in k-means cluster analysis. Multivar Behav Res 41(2):127–145CrossRefGoogle Scholar
  14. De Sarbo WS, Manrai AK (1992) A new multidimensional scaling methodology for the analysis of asymmetric proximity data in marketing research. Mark Sci 11(1):1–20CrossRefGoogle Scholar
  15. De Soete, G. and J. D. Carroll (1994). k-means clustering in a low-dimensional Euclidean space. In: Diday E, Lechevallier Y, Schader M et al (eds) New approaches in classification and data analysis. Springer, Heidelberg, pp 212–219Google Scholar
  16. Franczak BC, McNicholas PD, Browne RB, Murray PM (2013) Parsimonious shifted asymmetric Laplace mixtures. arXiv:1311:0317
  17. Franczak BC, Tortora C, Browne RP, McNicholas PD (2015) Unsupervised learning via mixtures of skewed distributions with hypercube contours. Pattern Recognit Lett 58:69–76CrossRefGoogle Scholar
  18. Ghahramani Z, Hinton GE (1997) The EM algorithm for mixtures of factor analyzers. Crg-tr-96-1, Univ. Toronto, TorontoGoogle Scholar
  19. Hwang H, Dillon WR, Takane Y (2006) An extension of multiple correspondence analysis for identifying heterogenous subgroups of respondents. Psychometrika 71:161–171MathSciNetCrossRefzbMATHGoogle Scholar
  20. Iodice D’Enza A, Palumbo F, Greenacre M (2008) Exploratory data analysis leading towards the most interesting simple association rules. Comput Stat Data Anal 52(6):3269–3281MathSciNetCrossRefzbMATHGoogle Scholar
  21. Iyigun C (2007) Probabilistic distance clustering. Ph.D. thesis, New Brunswick Rutgers, The State University of New JerseyGoogle Scholar
  22. Jain AK (2009) Data clustering: 50 years beyond k-means. Pattern Recognit Lett 31(8):651–666CrossRefGoogle Scholar
  23. Karlis D, Santourian A (2009) Model-based clustering with non-elliptically contoured distributions. Stat Comput 19(1):73–83MathSciNetCrossRefGoogle Scholar
  24. Kiers HAL, Der Kinderen A (2003) A fast method for choosing the numbers of components in Tucker3 analysis. Br J MathStat Psychol 56(1):119–125MathSciNetCrossRefGoogle Scholar
  25. Kroonenberg PM (2008) Applied multiway data analysis. Ebooks Corporation, HobokenCrossRefzbMATHGoogle Scholar
  26. Kroonenberg PM, Van der Voort THA (1987) Multiplicatieve decompositie van interacties bij oordelen over de werkelijkheidswaarde van televisiefilms [multiplicative decomposition of interactions for judgments of realism of television films]. Kwantitatieve Methoden 8:117–144Google Scholar
  27. Lebart A, Morineau A, Warwick K (1984) Multivariate statistical descriptive analysis. Wiley, New YorkzbMATHGoogle Scholar
  28. Lee SX, McLachlan GJ (2013) On mixtures of skew normal and skew t-distributions. Adv Data Anal Classif 7(3):241–266MathSciNetCrossRefzbMATHGoogle Scholar
  29. Lin T-I, McLachlan GJ, Lee SX (2013) Extending mixtures of factor models using the restricted multivariate skew-normal distribution. arXiv:1307:1748
  30. Lin T-I (2009) Maximum likelihood estimation for multivariate skew normal mixture models. J Multivar Anal 100:257–265MathSciNetCrossRefzbMATHGoogle Scholar
  31. Lin T-I (2010) Robust mixture modeling using multivariate skew t distributions. Stat Comput 20(3):343–356MathSciNetCrossRefGoogle Scholar
  32. Lin T-I, McNicholas PD, Hsiu JH (2014) Capturing patterns via parsimonious t mixture models. Stat Probab Lett 88:80–87MathSciNetCrossRefzbMATHGoogle Scholar
  33. Markos A, Iodice D’Enza A, Van de Velden M (2013) clustrd: methods for joint dimension reduction and clustering. R package version 0.1.2Google Scholar
  34. Maronna RA, Zamar RH (2002) Robust estimates of location and dispersion for high-dimensional datasets. Technometrics 44(4):307–317MathSciNetCrossRefGoogle Scholar
  35. McLachlan GJ, Peel D (2000b) Mixtures of factor analyzers. In: Morgan Kaufman SF (ed) Proccedings of the seventeenth international conference on machine learning, pp 599–606Google Scholar
  36. McLachlan GJ, Peel D, Bean RW (2003) Modelling high-dimensional data by mixtures of factor analyzers. Comput Stat Data Anal 41:379–388MathSciNetCrossRefzbMATHGoogle Scholar
  37. McLachlan GJ, Peel D (2000a) Finite mixture models. Wiley Interscience, New YorkCrossRefzbMATHGoogle Scholar
  38. McNicholas PD, Jampani KR, McDaid AF, Murphy TB, Banks L (2011) pgmm: Parsimonious Gaussian Mixture Models. R package version 1:1Google Scholar
  39. McNicholas SM, McNicholas PD, Browne RP (2013) Mixtures of variance-gamma distributions. arXiv:1309.2695
  40. McNicholas PD, Murphy T (2008) Parsimonious Gaussian mixture models. Stat Comput 18(3):285–296MathSciNetCrossRefGoogle Scholar
  41. Murray PM, Browne RB, McNicholas PD (2014) Mixtures of skew-t factor analyzers. Comput Stat Data Anal 77:326–335MathSciNetCrossRefGoogle Scholar
  42. Palumbo F, Vistocco D, Morineau A (2008) Huge multidimensional data visualization: back to the virtue of principal coordinates and dendrograms in the new computer age. In: Chun-houh Chen WH, Unwin A (eds) Handbook of data visualization. Springer, pp 349–387Google Scholar
  43. Rachev ST, Klebanov LB, Stoyanov SV, Fabozzi FJ (2013) The methods of distances in the theory of probability and statistics. SpringerGoogle Scholar
  44. Rocci R, Gattone SA, Vichi M (2011) A new dimension reduction method: factor discriminant k-means. J Classif 28(2):210–226MathSciNetCrossRefzbMATHGoogle Scholar
  45. Steane MA, McNicholas PD, Yada R (2012) Model-based classification via mixtures of multivariate t-factor analyzers. Commun Stat Simul Comput 41(4):510–523MathSciNetCrossRefzbMATHGoogle Scholar
  46. Stute W, Zhu LX (1995) Asymptotics of k-means clustering based on projection pursuit. Sankhyā 57(3):462–471Google Scholar
  47. Subedi S, McNicholas PD (2014) Variational Bayes approximations for clustering via mixtures of normal inverse Gaussian distributions. Adv Data Anal Classif 8(2):167–193MathSciNetCrossRefGoogle Scholar
  48. The MathWorks Inc. (2007) MATLAB—The Language of Technical Computing, Version 7.5. The MathWorks Inc., NatickGoogle Scholar
  49. Timmerman ME, Ceulemans E, Roover K, Leeuwen K (2013) Subspace k-means clustering. Behav Res Methods Res 45(4):1011–1023Google Scholar
  50. Timmerman ME, Ceulemans E, Kiers HAL, Vichi M (2010) Factorial and reduced k-means reconsidered. Comput Stat Data Anal 54(7):1858–1871MathSciNetCrossRefzbMATHGoogle Scholar
  51. Timmerman ME, Kiers HAL (2000) Three-mode principal components analysis: choosing the numbers of components and sensitivity to local optima. Br J Math Stat Psychol 53(1):1–16CrossRefGoogle Scholar
  52. Tortora, C. and M. Marino (2014). Robustness and stability analysis of factor PD-clustering on large social datasets. In D. Vicari, A. Okada, G. Ragozini, and C. Weihs (Eds.), Analysis and Modeling of Complex Data in Behavioral and Social Sciences, pp. 273–281. SpringerGoogle Scholar
  53. Tortora C, Gettler Summa M, Palumbo F (2013) Factor PD-clustering. In: Berthold UL, Dirk V (ed) Algorithms from and for nature and life, pp 115–123Google Scholar
  54. Tortora C, McNicholas PD, Browne RP (2015) A mixture of generalized hyperbolic factor analyzers. Adv Data Anal Classif (in press)Google Scholar
  55. Tortora C, McNicholas PD (2014) FPDclustering: PD-clustering and factor PD-clustering. R package version 1.0Google Scholar
  56. Tortora C, Palumbo F (2014) FPDC. MATLAB and Statistics Toolbox Release (2012a) The MathWorks Inc. NatickGoogle Scholar
  57. Tucker LR (1966) Some mathematical notes on three-mode factor analysis. Psychometrika 31(3):279–311MathSciNetCrossRefGoogle Scholar
  58. Vermunt JK (2011) K-means may perform as well as mixture model clustering but may also be much worse: comment on Steinley and Brusco (2011). Psychol Methods 16(1):82–88MathSciNetCrossRefGoogle Scholar
  59. Vichi M, Kiers HAL (2001) Factorial k-means analysis for two way data. Comput Stat Data Anal 37:29–64MathSciNetCrossRefzbMATHGoogle Scholar
  60. Vichi M, Saporta G (2009) Clustering and disjoint principal component analysis. Comput Stat Data Anal 53(8):3194–3208MathSciNetCrossRefzbMATHGoogle Scholar
  61. Vrbik I, McNicholas PD (2014) Parsimonious skew mixture models for model-based clustering and classification. Comput Stat Data Anal 71:196–210MathSciNetCrossRefGoogle Scholar
  62. Yamamoto M, Hwang H (2014) A general formulation of cluster analysis with dimension reduction and subspace separation. Behaviormetrika 41:115–129CrossRefGoogle Scholar
  63. Zadeh LA (1965) Fuzzy sets. Inf Control 8(3):338–353MathSciNetCrossRefzbMATHGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2015

Authors and Affiliations

  • Cristina Tortora
    • 1
  • Mireille Gettler Summa
    • 2
  • Marina Marino
    • 3
  • Francesco Palumbo
    • 4
    Email author
  1. 1.Department of Mathematics and StatisticsMcMaster UniversityHamiltonCanada
  2. 2.CEREMADE, Université Paris DauphineParisFrance
  3. 3.Dipartimento di Scienze SocialiUniversity of Naples Federico IINaplesItaly
  4. 4.Dipartimento di Scienze PoliticheUniversity of Naples Federico IINaplesItaly

Personalised recommendations