Iterative factor clustering of binary data

Abstract

Binary data represent a very special condition where both measures of distance and co-occurrence can be adopted. Euclidean distance-based non-hierarchical methods, like the k-means algorithm, or one of its versions, can be profitably used. When the number of available attributes increases the global clustering performance usually worsens. In such cases, to enhance group separability it is necessary to remove the irrelevant and redundant noisy information from the data. The present approach belongs to the category of attribute transformation strategy, and combines clustering and factorial techniques to identify attribute associations that characterize one or more homogeneous groups of statistical units. Furthermore, it provides graphical representations that facilitate the interpretation of the results.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

References

  1. Arabie P, Hubert L (1994) Cluster analysis in marketing research. IEEE Trans Autom Control 19:716–723

    Google Scholar 

  2. Caliński T, Harabasz J (1974) A dendrite method for cluster analysis. Commun Stat A Theory 3:1–27

    MATH  Article  Google Scholar 

  3. Chae SS, Dubien JL, Warde WD (2006) A method of predicting the number of clusters using Rands statistic. Comput Stat Data Anal 50:3531–3546

    MathSciNet  MATH  Article  Google Scholar 

  4. Choi SS, Cha SS, Tappert CC (2010) A survey of binary similarity and sistance measures. J Syst Cybernet Inform 8:43–48

    Google Scholar 

  5. Dimitriadou E, Dolnicar S, Weingassel A (2002) An examination of indexes for setermining the number of clusters in binary data sets. Psychometrika 67:137–160

    MathSciNet  Article  Google Scholar 

  6. Duda RO, Hart PE, Stork DG (2001) Pattern classification. Wiley, New York

    Google Scholar 

  7. Dudoit S, Fridlyand J (2002) A prediction-based resampling method for estimating the number of clusters in a dataset. Genome Biol 3:1–21

    Article  Google Scholar 

  8. Ertoz L, Steinbach M, Kumar V (2003) Finding clusters of different sizes, shapes, and densities in noisy, high dimensional data. In: Barbara D, Kamath C (eds) Proceedings of the third SIAM international conference on data mining, vol 112, pp 47–59

  9. Greenacre MJ (2007) Correspondence analysis in practice, 2nd edn. Chapman and Hall, Boca Raton

  10. Guha S, Rastogi S, Shim K (2000) ROCK: a robust clustering algorithm for categorical attribute. Inform Syst 25:512–521

    Article  Google Scholar 

  11. Hastie T, Tibshirani R, Friedman JH (2001) The elements of statistical learning. Springer, New York

    Google Scholar 

  12. Hwang H, Dillon WR (2010) Simultaneous two-way clustering of multiple correspondence analysis. Multivar Behav Res 45:186–208

    Article  Google Scholar 

  13. Hwang H, Dillon WR, Takane Y (2006) An extension of multiple correspondence analysis for identifying heterogenous subgroups of respondents. Psychometrika 71:161–171

    MathSciNet  Article  Google Scholar 

  14. Javed K, Babri H, Saeed M (2012) Feature selection based on class-dependent densities for high-dimensional binary data. IEEE Trans Knowl Data Eng 24:465–477

    Article  Google Scholar 

  15. Kaufman L, Rousseeuw PJ (2005) Finding groups in data. An introduction to cluster analysis. Wiley, Hoboken

    Google Scholar 

  16. Kraus MJ, Müssel C, Palm G, Kestler HA (2011) Multi-objective selection for collecting cluster alternatives. Comput Stat 26:341–353

    Article  Google Scholar 

  17. Kuncheva LI, Vetrov DP (2005) Evaluation of stability of k-means cluster ensembles with respect to random initialization. IEEE Trans Pattern Anal 28:1798–1808

    Article  Google Scholar 

  18. Lauro CN, Balbi S (1999) The analysis of structured qualitative data. Appl Stoch Model Data Anal 15:1–27

    MathSciNet  MATH  Article  Google Scholar 

  19. Lauro CN, D’Ambra L (1984) L’analyse non symmétrique des correspondances. In: Diday E et al (eds) Data analysis and informatics, III. North Holland, Amsterdam, pp 433–446

  20. Lebart L, Morineau A, Warwick K (1984) Multivariate descriptive statistical analysis. Wiley, New York

    Google Scholar 

  21. Light R, Margolin B (1971) An analysis of variance for categorical data. In J Am Stat Assoc 66:534–544

    MathSciNet  MATH  Article  Google Scholar 

  22. MacQueen J (1967) Some methods for classification and analysis of multivariate observations. In: Cam LML, Neyman J (eds) Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, vol 1. University of California Press, pp 281–297

  23. Mola F, Siciliano R (1997) A fast splitting procedure for classification and regression trees. Stat Comput 7:208–216

    Article  Google Scholar 

  24. Mucha HJ (2002) An intelligent clustering clustering technique based on dual scaling. In: Nishisato S, Baba Y, Bozdogan H, Kanefuji K (eds) Measurement and multivariate analysis. Springer, Tokyo, pp 37–46

  25. Milligan GW, Cooper MC (1985) An examination of procedures for determining the number of clusters in a data. Psychometrika 50:159–179

    Article  Google Scholar 

  26. Mirkin B (2001) Eleven ways to look at the Chi-squared coefficient for contingency tables. Am Stat 55:111–120

    MathSciNet  Article  Google Scholar 

  27. Mirkin B (2011) Choosing the number of clusters. WIREs Data Mining Knowl Disc 1:252–260

    Article  Google Scholar 

  28. Nocke T, Schumann H, Böhm U (2004) Methods for the visualization of clustered climate data. Comput Stat 19:74–94

    Article  Google Scholar 

  29. Palumbo F, Iodice D’Enza A (2012) Adaptive factorial clustering of binary data. In: Giusti A, Ritter G, Vichi M (eds) Classification and data mining. Studies in classification, data analysis, and knowledge organization, July 2012

  30. Palumbo F, Siciliano R (1999) Factorial discriminant analysis and probabilistic models. In: Metron, LVI, pp 186–198

  31. van Buuren S, Heiser WJ (1989) Clustering \(n\) objects in \(k\) groups under optimal scaling of variables. Psychometrika 54:699–706

    MathSciNet  Article  Google Scholar 

  32. Vichi M, Saporta G (2009) Clustering and disjoint principal component analysis. Comput Stat Data Anal 53:3194–3208

    MathSciNet  MATH  Article  Google Scholar 

  33. Vichi M, Kiers H (2001) Factorial k-means analysis for two way data. Comput Stat Data Anal 37:49–64

    MathSciNet  MATH  Article  Google Scholar 

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Francesco Palumbo.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Iodice D’Enza, A., Palumbo, F. Iterative factor clustering of binary data. Comput Stat 28, 789–807 (2013). https://doi.org/10.1007/s00180-012-0329-x

Download citation

Keywords

  • Categorical attribute quantification
  • Correspondence analysis
  • Cluster analysis
  • Binary data