Abstract
Clustering is the problem of identifying the distribution of patterns and intrinsic correlations in large data sets by partitioning the data points into similarity classes. This paper studies the problem of clustering binary data. Binary data have been occupying a special place in the domain of data analysis. A unified view of binary data clustering is presented by examining the connections among various clustering criteria. Experimental studies are conducted to empirically verify the relationships.
Article PDF
Similar content being viewed by others
References
Agrawal, R., & Srikant, R. (1994). Fast algorithms for mining association rules. In Proceedings of the 20th International Conferenceon Very Large Data Bases (VLDB'94) (pp. 487–499). Morgan Kaufmann Publishers.
Ando, R. K., & Lee, L. (2001). Iterative Residual Rescaling: An analysis and generalization of LSI. In Proceedings of the 24th SIGIR (pp. 154–162).
Barbara, D., Li, Y., & Couto, J. (2002). COOLCAT: An entropy-based algorithm for categorical clustering. Proceedings of the eleventh international conference on Information and knowledge management (CIKM'02) (pp. 582–589). ACM Press.
Baulieu, F.B. (1997). Two variant axiom systems for presence/absence based dissimilarity coefficients. Journal of Classification, 14, 159–170.
Baxter, R.A., & Oliver, J.J. (1994). MDL and MML: similarities and differences (Technical Report 207). Monash University.
Biernacki, C., & Govaert, G. (1997). Using the classification likelihood to choose the number of clusters. Computing Science and Statistics (pp. 451–457).
Bock, H.-H. (1989). Probabilistic aspects in cluster analysis. In O. Opitz (Ed.), Conceptual and numerical analysis of data, (pp. 12–44). Berlin: Springer-verlag.
Celeux, G., & Govaert, G. (1991). Clustering criteria for discrete data and latent class models. Journal of Classification, 8, 157–176.
Celeux, G., & Soromenho, G. (1996). An entropy criterion for assessing the number of clusters in a mixture model. Journal of Classification, 13, 195–212.
Cover, T.M., & Thomas, J.A. (1991). Elements of information theory. John Wiley & Sons.
Dhillon, I.S., Mallela, S., & Modha, S.S. (2003). Information-theoretic co-clustering. Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD 2003) (pp. 89–98). ACM Press.
Dhillon, I.S., & Modha, D.S. (2001). Concept decompositions for large sparse text data using clustering. Machine Learning, 42, 143–175.
Ganti, V., Gehrke, J., & Ramakrishnan, R. (1999). CACTUS: Clustering categorical data using summaries. Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD'99) (pp. 73–83). ACM Press.
Gibson, D., Kleinberg, J., & Raghavan, P. (1998). Clustering categorical data: An approach based on dynamical systems. In Proceedings of the 24th International Conference on Very Large Data Bases (VLDB'98) (pp. 311–322). Morgan Kaufmann Publishers.
Guha, S., Rastogi, R., & Shim, K. (2000). ROCK: A robust clustering algorithm for categorical attributes. Information Systems, 25, 345–366.
Gyllenberg, M., Koski, T., & Verlaan, M. (1997). Classification of binary vectors by stochastic complexity. Journal of Multivariate Analysis, 63, 47–72.
Hartigan, J.A. (1975). Clustering algorithms. Wiley.
Havrda, J., & Charvat, F. (1967). Quantification method of classification processes: Concept of structural a-entropy. Kybernetika, 3, 30–35.
Huang, Z. (1998). Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Mining and Knowledge Discovery, 2, 283–304.
Jain, A.K., & Dubes, R.C. (1988). Algorithms for clustering data. Prentice Hall.
Jardine, N., & Sibson, R. (1971). Mathematical taxonomy. John Wiley & Sons.
Kaufman, L., & Rousseeuw, P.J. (1990). Finding groups in data: An introduction to cluster analysis. John Wiley.
Li, T. (2005). A general model for clustering binary data. Proceedings of Eleventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD 2005) (pp. 188–197).
Li, T., & Ma, S. (2004). IFD: iterative feature and data clustering. Proceedings of the 2004 SIAM International conference on Data Mining (SDM 2004) (pp. 472–476). SIAM.
Li, T., Ma, S., & Ogihara, M. (2004a). Document clustering via adaptive subspace iteration. Proceedings of the Twenty-Seventh Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2004) (pp. 218–225).
Li, T., Ma, S., & Ogihara, M. (2004b). Entropy-based criterion in categorical clustering. Proceedings of The 2004 IEEE International Conference on Machine Learning (ICML 2004) 536–543.
Li, T., Zhu, S., & Ogihara, M. (2003). Efficient multi-way text categorization via generalized discriminant analysis. Proceedings of the Twelfth International Conference on Information and Knowledge Management (CIKM 2003) (pp. 317–324). ACM Press.
McCallum, A.K. (1996). Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. http://www.cs.cmu.edu/ mccallum/bow.
McLachlan, G., & Peel, D. (2000). Finite mixture models. John Wiley.
Mitchell, T.M. (1997). Machine learning. The McGraw-Hill Companies, Inc.
Mumford, D. (1996). Pattern Theory: A Unifying Perspective. 25–62.
Oliver, J.J., & Baxter, R.A. (1994). MML and Bayesianism: similarities and differences (Technical Report 206). Monash University.
Rissanen, J. (1978). Modeling by shortest data description. Automatica, 14, 465–471.
Rissanen, J. (1989). Stochastic complexity in statistical inquiry. Singapore: World Scientific Press.
Roberts, S., Everson, R., & Rezek, I. (1999). Minimum entropy data partitioning. Proc. International Conference on Artificial Neural Networks (pp. 844–849).
Roberts, S., Everson, R., & Rezek, I. (2000). Maximum certainty data partitioning. Pattern Recognition, 33, 833–839.
Smyth, P. (1996). Clustering using monte carlo cross-validation. Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (SIGKDD 1996) (pp. 126–133).
Soete, G.D., & douglas Carroll, J. (1994). K-means clustering in a low-dimensional euclidean space. In New approaches in classification and data analysis, 212–219. Springer.
Symons, M.J. (1981). Clustering criteria and multivariate normal mixtures. Biometrics, 37, 35–43.
Xu, W., & Gong, Y. (2004). Document clustering by concept factorization. SIGIR '04: Proceedings of the 27th annual international conference on Research and development in information retrieval (pp. 202–209). Sheffield, United Kingdom: ACM Press.
Xu, W., Liu, X., & Gong, Y. (2003). Document clustering based on non-negative matrix factorization. Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval((SIGIR'03)) (pp. 267–273). ACM Press.
Zha, H., He, X., Ding, C., & Simon, H. (2001). Spectral relaxation for k-means clustering. Proceedings of Neural Information Processing Systems (pp. 1057–1064).
Zhao, Y., & Karypis, G. (2001). Criterion functions for document clustering: Experiments and analysis (Technical Report). Department of Computer Science, University of Minnesota.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Li, T. A Unified View on Clustering Binary Data. Mach Learn 62, 199–215 (2006). https://doi.org/10.1007/s10994-005-5316-9
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10994-005-5316-9