Synonyms
Definition
Data clustering is informally defined as the problem of partitioning a set of objects into groups, such that the objects in the same group are similar, while the objects in different groups are dissimilar. Categorical data clustering refers to the case where the data objects are defined over categorical attributes. A categorical attribute is an attribute whose domain is a set of discrete values that are not inherently comparable. That is, there is no single ordering or inherent distance function for the categorical values, and there is no mapping from categorical to numerical values that is semantically meaningful.
Motivation and Background
Clustering is a problem of great practical importance that has been the focus of substantial research in several domains for decades. As storage capacities grow, we have at hand larger amounts of data available for analysis and mining. Clustering plays an instrumental role in this...
Recommended Reading
Andritsos, P., Tsaparas, P., Miller, R. J., Kenneth, C., & Sevcik, K. C. (2004). LIMBO: Scalable clustering of categorical data. In Proceedings of the 9th international conference on extending database technology (EDBT) (pp. 123–146). Heraklion, Greece.
Barbarà, D., Couto, J., & Li, Y. (2002). COOLCAT: An entropy-based algorithm for categorical clustering. In Proceedings of the 11th international conference on information and knowledge management (CIKM) (pp. 582–589). McLean, VA.
Cover, T. M., & Thomas, J. A. (1991). Elements of information theory. New York: Wiley.
Das, G., & Mannila, H. (2000). Context-based similarity measures for categorical databases. In Proceedings of the 4th European conference on principles of data mining and knowledge discovery (PKDD) (pp. 201–210). Lyon, France.
Fisher, D. H. (1987). Knowledge acquisition via incremental conceptual clustering. Machine Learning, 2, 139–172.
Ganti, V., Gehrke, J., & Ramakrishnan, R. (1999). CACTUS: Clustering categorical data using summaries. In Proceedings of the 5th international conference on knowledge discovery and data mining (KDD) (pp. 73–83). San Diego, CA.
Gionis, A., Mannila, H., & Tsaparas, P. (2007). Clustering aggregation. ACM Transactions on Knowledge Discovery from Data, 1(1), Article No 4.
Gluck, M., & Corter, J. (1985). Information, uncertainty, and the utility of categories. In Proceedings of the 7th annual conference of the cognitive science society (COGSCI) (pp. 283–287). Irvine, CA.
Guha, S., Rastogi, R., & Shim, K. (1999). ROCK: A robust clustering algorithm for categorical attributes. In Proceedings of the 15th international conference on data engineering (pp. 512–521). Sydney, Australia.
Han, J., & Kamber, M. (2001). Data mining: Concepts and techniques. San Francisco: Morgan Kaufmann.
Huang, Z. (1998). Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Mining and Knowledge Discovery, 2(3), 283–304.
Jain, A. K., & Dubes, R. C. (1988). Algorithms for clustering data. Englewood Cliffs, NJ: Prentice-Hall.
Jarke, M., Lenzerini, M., Vassiliou, Y., & Vassiliadis, P. (1999). Fundamentals of data warehouses. Berlin: Springer.
Kleinberg, Jon (1999). Authoritative sources in a hyperlinked environment”. Journal of the ACM 46(5): 604632.
Zaki, M. J., Peters, M., Assent, I., & Seidl, T. (2005). CLICKS: An effective algorithm for mining subspace clusters in categorical datasets. In Proceeding of the 11th international conference on knowledge discovery and data mining (KDD) (pp. 736–742). Chicago, IL.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer Science+Business Media, LLC
About this entry
Cite this entry
Andritsos, P., Tsaparas, P. (2011). Categorical Data Clustering. In: Sammut, C., Webb, G.I. (eds) Encyclopedia of Machine Learning. Springer, Boston, MA. https://doi.org/10.1007/978-0-387-30164-8_99
Download citation
DOI: https://doi.org/10.1007/978-0-387-30164-8_99
Publisher Name: Springer, Boston, MA
Print ISBN: 978-0-387-30768-8
Online ISBN: 978-0-387-30164-8
eBook Packages: Computer ScienceReference Module Computer Science and Engineering