A Unified View on Clustering Binary Data

Li, Tao

doi:10.1007/s10994-005-5316-9

A Unified View on Clustering Binary Data

Published: 29 January 2006

Volume 62, pages 199–215, (2006)
Cite this article

Download PDF

Machine Learning Aims and scope Submit manuscript

A Unified View on Clustering Binary Data

Download PDF

Tao Li¹

4174 Accesses
29 Citations
3 Altmetric
Explore all metrics

Abstract

Clustering is the problem of identifying the distribution of patterns and intrinsic correlations in large data sets by partitioning the data points into similarity classes. This paper studies the problem of clustering binary data. Binary data have been occupying a special place in the domain of data analysis. A unified view of binary data clustering is presented by examining the connections among various clustering criteria. Experimental studies are conducted to empirically verify the relationships.

References

Agrawal, R., & Srikant, R. (1994). Fast algorithms for mining association rules. In Proceedings of the 20th International Conferenceon Very Large Data Bases (VLDB'94) (pp. 487–499). Morgan Kaufmann Publishers.
Ando, R. K., & Lee, L. (2001). Iterative Residual Rescaling: An analysis and generalization of LSI. In Proceedings of the 24th SIGIR (pp. 154–162).
Barbara, D., Li, Y., & Couto, J. (2002). COOLCAT: An entropy-based algorithm for categorical clustering. Proceedings of the eleventh international conference on Information and knowledge management (CIKM'02) (pp. 582–589). ACM Press.
Baulieu, F.B. (1997). Two variant axiom systems for presence/absence based dissimilarity coefficients. Journal of Classification, 14, 159–170.
Article MATH MathSciNet Google Scholar
Baxter, R.A., & Oliver, J.J. (1994). MDL and MML: similarities and differences (Technical Report 207). Monash University.
Biernacki, C., & Govaert, G. (1997). Using the classification likelihood to choose the number of clusters. Computing Science and Statistics (pp. 451–457).
Bock, H.-H. (1989). Probabilistic aspects in cluster analysis. In O. Opitz (Ed.), Conceptual and numerical analysis of data, (pp. 12–44). Berlin: Springer-verlag.
Google Scholar
Celeux, G., & Govaert, G. (1991). Clustering criteria for discrete data and latent class models. Journal of Classification, 8, 157–176.
Google Scholar
Celeux, G., & Soromenho, G. (1996). An entropy criterion for assessing the number of clusters in a mixture model. Journal of Classification, 13, 195–212.
Article MathSciNet Google Scholar
Cover, T.M., & Thomas, J.A. (1991). Elements of information theory. John Wiley & Sons.
Dhillon, I.S., Mallela, S., & Modha, S.S. (2003). Information-theoretic co-clustering. Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD 2003) (pp. 89–98). ACM Press.
Dhillon, I.S., & Modha, D.S. (2001). Concept decompositions for large sparse text data using clustering. Machine Learning, 42, 143–175.
Article Google Scholar
Ganti, V., Gehrke, J., & Ramakrishnan, R. (1999). CACTUS: Clustering categorical data using summaries. Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD'99) (pp. 73–83). ACM Press.
Gibson, D., Kleinberg, J., & Raghavan, P. (1998). Clustering categorical data: An approach based on dynamical systems. In Proceedings of the 24th International Conference on Very Large Data Bases (VLDB'98) (pp. 311–322). Morgan Kaufmann Publishers.
Guha, S., Rastogi, R., & Shim, K. (2000). ROCK: A robust clustering algorithm for categorical attributes. Information Systems, 25, 345–366.
Article Google Scholar
Gyllenberg, M., Koski, T., & Verlaan, M. (1997). Classification of binary vectors by stochastic complexity. Journal of Multivariate Analysis, 63, 47–72.
Article MathSciNet Google Scholar
Hartigan, J.A. (1975). Clustering algorithms. Wiley.
Havrda, J., & Charvat, F. (1967). Quantification method of classification processes: Concept of structural a-entropy. Kybernetika, 3, 30–35.
MathSciNet Google Scholar
Huang, Z. (1998). Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Mining and Knowledge Discovery, 2, 283–304.
Article Google Scholar
Jain, A.K., & Dubes, R.C. (1988). Algorithms for clustering data. Prentice Hall.
Jardine, N., & Sibson, R. (1971). Mathematical taxonomy. John Wiley & Sons.
Kaufman, L., & Rousseeuw, P.J. (1990). Finding groups in data: An introduction to cluster analysis. John Wiley.
Li, T. (2005). A general model for clustering binary data. Proceedings of Eleventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD 2005) (pp. 188–197).
Li, T., & Ma, S. (2004). IFD: iterative feature and data clustering. Proceedings of the 2004 SIAM International conference on Data Mining (SDM 2004) (pp. 472–476). SIAM.
Li, T., Ma, S., & Ogihara, M. (2004a). Document clustering via adaptive subspace iteration. Proceedings of the Twenty-Seventh Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2004) (pp. 218–225).
Li, T., Ma, S., & Ogihara, M. (2004b). Entropy-based criterion in categorical clustering. Proceedings of The 2004 IEEE International Conference on Machine Learning (ICML 2004) 536–543.
Li, T., Zhu, S., & Ogihara, M. (2003). Efficient multi-way text categorization via generalized discriminant analysis. Proceedings of the Twelfth International Conference on Information and Knowledge Management (CIKM 2003) (pp. 317–324). ACM Press.
McCallum, A.K. (1996). Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. http://www.cs.cmu.edu/ mccallum/bow.
McLachlan, G., & Peel, D. (2000). Finite mixture models. John Wiley.
Mitchell, T.M. (1997). Machine learning. The McGraw-Hill Companies, Inc.
Mumford, D. (1996). Pattern Theory: A Unifying Perspective. 25–62.
Oliver, J.J., & Baxter, R.A. (1994). MML and Bayesianism: similarities and differences (Technical Report 206). Monash University.
Rissanen, J. (1978). Modeling by shortest data description. Automatica, 14, 465–471.
Article MATH Google Scholar
Rissanen, J. (1989). Stochastic complexity in statistical inquiry. Singapore: World Scientific Press.
Google Scholar
Roberts, S., Everson, R., & Rezek, I. (1999). Minimum entropy data partitioning. Proc. International Conference on Artificial Neural Networks (pp. 844–849).
Roberts, S., Everson, R., & Rezek, I. (2000). Maximum certainty data partitioning. Pattern Recognition, 33, 833–839.
Article Google Scholar
Smyth, P. (1996). Clustering using monte carlo cross-validation. Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (SIGKDD 1996) (pp. 126–133).
Soete, G.D., & douglas Carroll, J. (1994). K-means clustering in a low-dimensional euclidean space. In New approaches in classification and data analysis, 212–219. Springer.
Symons, M.J. (1981). Clustering criteria and multivariate normal mixtures. Biometrics, 37, 35–43.
MATH MathSciNet Google Scholar
Xu, W., & Gong, Y. (2004). Document clustering by concept factorization. SIGIR '04: Proceedings of the 27th annual international conference on Research and development in information retrieval (pp. 202–209). Sheffield, United Kingdom: ACM Press.
Xu, W., Liu, X., & Gong, Y. (2003). Document clustering based on non-negative matrix factorization. Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval((SIGIR'03)) (pp. 267–273). ACM Press.
Zha, H., He, X., Ding, C., & Simon, H. (2001). Spectral relaxation for k-means clustering. Proceedings of Neural Information Processing Systems (pp. 1057–1064).
Zhao, Y., & Karypis, G. (2001). Criterion functions for document clustering: Experiments and analysis (Technical Report). Department of Computer Science, University of Minnesota.

Download references

Author information

Authors and Affiliations

School of Computer Science, Florida International University, 11200 SW 8th Street, Miami, FL, 33199
Tao Li

Authors

Tao Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tao Li.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, T. A Unified View on Clustering Binary Data. Mach Learn 62, 199–215 (2006). https://doi.org/10.1007/s10994-005-5316-9

Download citation

Received: 14 March 2005
Revised: 30 September 2005
Accepted: 03 October 2005
Published: 29 January 2006
Issue Date: March 2006
DOI: https://doi.org/10.1007/s10994-005-5316-9

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

A Unified View on Clustering Binary Data

Abstract

Article PDF

Similar content being viewed by others

Categorical Data Clustering

Categorical Data Clustering

Partitional Clustering

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A Unified View on Clustering Binary Data

Abstract

Article PDF

Similar content being viewed by others

Categorical Data Clustering

Categorical Data Clustering

Partitional Clustering

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation