Advertisement

Computational Statistics

, Volume 28, Issue 2, pp 789–807 | Cite as

Iterative factor clustering of binary data

  • Alfonso Iodice D’Enza
  • Francesco Palumbo
Original Paper

Abstract

Binary data represent a very special condition where both measures of distance and co-occurrence can be adopted. Euclidean distance-based non-hierarchical methods, like the k-means algorithm, or one of its versions, can be profitably used. When the number of available attributes increases the global clustering performance usually worsens. In such cases, to enhance group separability it is necessary to remove the irrelevant and redundant noisy information from the data. The present approach belongs to the category of attribute transformation strategy, and combines clustering and factorial techniques to identify attribute associations that characterize one or more homogeneous groups of statistical units. Furthermore, it provides graphical representations that facilitate the interpretation of the results.

Keywords

Categorical attribute quantification Correspondence analysis Cluster analysis Binary data 

References

  1. Arabie P, Hubert L (1994) Cluster analysis in marketing research. IEEE Trans Autom Control 19:716–723Google Scholar
  2. Caliński T, Harabasz J (1974) A dendrite method for cluster analysis. Commun Stat A Theory 3:1–27zbMATHCrossRefGoogle Scholar
  3. Chae SS, Dubien JL, Warde WD (2006) A method of predicting the number of clusters using Rands statistic. Comput Stat Data Anal 50:3531–3546MathSciNetzbMATHCrossRefGoogle Scholar
  4. Choi SS, Cha SS, Tappert CC (2010) A survey of binary similarity and sistance measures. J Syst Cybernet Inform 8:43–48Google Scholar
  5. Dimitriadou E, Dolnicar S, Weingassel A (2002) An examination of indexes for setermining the number of clusters in binary data sets. Psychometrika 67:137–160MathSciNetCrossRefGoogle Scholar
  6. Duda RO, Hart PE, Stork DG (2001) Pattern classification. Wiley, New YorkzbMATHGoogle Scholar
  7. Dudoit S, Fridlyand J (2002) A prediction-based resampling method for estimating the number of clusters in a dataset. Genome Biol 3:1–21CrossRefGoogle Scholar
  8. Ertoz L, Steinbach M, Kumar V (2003) Finding clusters of different sizes, shapes, and densities in noisy, high dimensional data. In: Barbara D, Kamath C (eds) Proceedings of the third SIAM international conference on data mining, vol 112, pp 47–59Google Scholar
  9. Greenacre MJ (2007) Correspondence analysis in practice, 2nd edn. Chapman and Hall, Boca RatonGoogle Scholar
  10. Guha S, Rastogi S, Shim K (2000) ROCK: a robust clustering algorithm for categorical attribute. Inform Syst 25:512–521CrossRefGoogle Scholar
  11. Hastie T, Tibshirani R, Friedman JH (2001) The elements of statistical learning. Springer, New YorkzbMATHGoogle Scholar
  12. Hwang H, Dillon WR (2010) Simultaneous two-way clustering of multiple correspondence analysis. Multivar Behav Res 45:186–208CrossRefGoogle Scholar
  13. Hwang H, Dillon WR, Takane Y (2006) An extension of multiple correspondence analysis for identifying heterogenous subgroups of respondents. Psychometrika 71:161–171MathSciNetCrossRefGoogle Scholar
  14. Javed K, Babri H, Saeed M (2012) Feature selection based on class-dependent densities for high-dimensional binary data. IEEE Trans Knowl Data Eng 24:465–477CrossRefGoogle Scholar
  15. Kaufman L, Rousseeuw PJ (2005) Finding groups in data. An introduction to cluster analysis. Wiley, HobokenGoogle Scholar
  16. Kraus MJ, Müssel C, Palm G, Kestler HA (2011) Multi-objective selection for collecting cluster alternatives. Comput Stat 26:341–353CrossRefGoogle Scholar
  17. Kuncheva LI, Vetrov DP (2005) Evaluation of stability of k-means cluster ensembles with respect to random initialization. IEEE Trans Pattern Anal 28:1798–1808CrossRefGoogle Scholar
  18. Lauro CN, Balbi S (1999) The analysis of structured qualitative data. Appl Stoch Model Data Anal 15:1–27MathSciNetzbMATHCrossRefGoogle Scholar
  19. Lauro CN, D’Ambra L (1984) L’analyse non symmétrique des correspondances. In: Diday E et al (eds) Data analysis and informatics, III. North Holland, Amsterdam, pp 433–446Google Scholar
  20. Lebart L, Morineau A, Warwick K (1984) Multivariate descriptive statistical analysis. Wiley, New YorkzbMATHGoogle Scholar
  21. Light R, Margolin B (1971) An analysis of variance for categorical data. In J Am Stat Assoc 66:534–544MathSciNetzbMATHCrossRefGoogle Scholar
  22. MacQueen J (1967) Some methods for classification and analysis of multivariate observations. In: Cam LML, Neyman J (eds) Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, vol 1. University of California Press, pp 281–297Google Scholar
  23. Mola F, Siciliano R (1997) A fast splitting procedure for classification and regression trees. Stat Comput 7:208–216CrossRefGoogle Scholar
  24. Mucha HJ (2002) An intelligent clustering clustering technique based on dual scaling. In: Nishisato S, Baba Y, Bozdogan H, Kanefuji K (eds) Measurement and multivariate analysis. Springer, Tokyo, pp 37–46Google Scholar
  25. Milligan GW, Cooper MC (1985) An examination of procedures for determining the number of clusters in a data. Psychometrika 50:159–179CrossRefGoogle Scholar
  26. Mirkin B (2001) Eleven ways to look at the Chi-squared coefficient for contingency tables. Am Stat 55:111–120MathSciNetCrossRefGoogle Scholar
  27. Mirkin B (2011) Choosing the number of clusters. WIREs Data Mining Knowl Disc 1:252–260CrossRefGoogle Scholar
  28. Nocke T, Schumann H, Böhm U (2004) Methods for the visualization of clustered climate data. Comput Stat 19:74–94CrossRefGoogle Scholar
  29. Palumbo F, Iodice D’Enza A (2012) Adaptive factorial clustering of binary data. In: Giusti A, Ritter G, Vichi M (eds) Classification and data mining. Studies in classification, data analysis, and knowledge organization, July 2012Google Scholar
  30. Palumbo F, Siciliano R (1999) Factorial discriminant analysis and probabilistic models. In: Metron, LVI, pp 186–198Google Scholar
  31. van Buuren S, Heiser WJ (1989) Clustering \(n\) objects in \(k\) groups under optimal scaling of variables. Psychometrika 54:699–706MathSciNetCrossRefGoogle Scholar
  32. Vichi M, Saporta G (2009) Clustering and disjoint principal component analysis. Comput Stat Data Anal 53:3194–3208MathSciNetzbMATHCrossRefGoogle Scholar
  33. Vichi M, Kiers H (2001) Factorial k-means analysis for two way data. Comput Stat Data Anal 37:49–64MathSciNetzbMATHCrossRefGoogle Scholar

Copyright information

© Springer-Verlag 2012

Authors and Affiliations

  1. 1.Dipartimento di Scienze EconomicheUniversità di CassinoCassinoItaly
  2. 2.Dipartimento di Teoria e Metodi per le Scienze Umane e SocialiUniversità degli Studi di Napoli ‘Federico II’NaplesItaly

Personalised recommendations