Context-Based Distance Learning for Categorical Data Clustering

Ienco, Dino; Pensa, Ruggero G.; Meo, Rosa

doi:10.1007/978-3-642-03915-7_8

Dino Ienco²⁰,
Ruggero G. Pensa²⁰ &
Rosa Meo²⁰

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 5772))

Included in the following conference series:

International Symposium on Intelligent Data Analysis

1909 Accesses
20 Citations

Abstract

Clustering data described by categorical attributes is a challenging task in data mining applications. Unlike numerical attributes, it is difficult to define a distance between pairs of values of the same categorical attribute, since they are not ordered. In this paper, we propose a method to learn a context-based distance for categorical attributes. The key intuition of this work is that the distance between two values of a categorical attribute A _i can be determined by the way in which the values of the other attributes A _j are distributed in the dataset objects: if they are similarly distributed in the groups of objects in correspondence of the distinct values of A _i a low value of distance is obtained. We propose also a solution to the critical point of the choice of the attributes A _j. We validate our approach on various real world and synthetic datasets, by embedding our distance learning method in both a partitional and a hierarchical clustering algorithm. Experimental results show that our method is competitive w.r.t. categorical data clustering approaches in the state of the art.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Han, J., Kamber, M.: Data Mining: Concepts and Techniques. The Morgan Kaufmann Series in Data Management Systems. Morgan Kaufmann, San Francisco (2000)
MATH Google Scholar
Kasif, S., Salzberg, S., Waltz, D., Rachlin, J., Aha, D.: Towards a framework for memory-based reasoning (manuscript, 1995) (in review)
Google Scholar
Huang, Z.: Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Min. Knowl. Discov. 2(3), 283–304 (1998)
Article Google Scholar
Guha, S., Rastogi, R., Shim, K.: Rock: A robust clustering algorithm for categorical attributes. In: Proc. of IEEE ICDE 1999 (1999)
Google Scholar
Andritsos, P., Tsaparas, P., Miller, R.J., Sevcik, K.C.: Scalable clustering of categorical data. In: Proc. of EDBT 2004, pp. 123–146 (2004)
Google Scholar
Zaki, M.J., Peters, M.: Clicks: Mining subspace clusters in categorical data via k-partite maximal cliques. In: Proc. of IEEE ICDE 2005, pp. 355–356 (2005)
Google Scholar
Ganti, V., Gehrke, J., Ramakrishnan, R.: Cactus-clustering categorical data using summaries. In: Proc. of ACM SIGKDD 1999, pp. 73–83 (1999)
Google Scholar
Barbara, D., Couto, J., Li, Y.: Coolcat: an entropy-based algorithm for categorical clustering. In: Proc. of CIKM 2002, pp. 582–589. ACM Press, New York (2002)
Google Scholar
Li, T., Ma, S., Ogihara, M.: Entropy-based criterion in categorical clustering. In: Proc. of ICML 2004, pp. 536–543 (2004)
Google Scholar
Ahmad, A., Dey, L.: A method to compute distance between two categorical values of same attribute in unsupervised learning for categorical data set. Pattern Recogn. Lett. 28(1), 110–118 (2007)
Article Google Scholar
Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. J. Mach. Learn. Res. 3, 1157–1182 (2003)
MATH Google Scholar
Yu, L., Liu, H.: Feature selection for high-dimensional data: A fast correlation-based filter solution. In: Proc. of ICML 2003, Washington, DC (2003)
Google Scholar
Quinlan, R.J.: C4.5: Programs for Machine Learning. Morgan Kaufmann Series in Machine Learning. Morgan Kaufmann, San Francisco (1993)
Google Scholar
Strehl, A., Ghosh, J., Cardie, C.: Cluster ensembles - a knowledge reuse framework for combining multiple partitions. Journal of Machine Learning Research 3, 583–617 (2002)
MathSciNet MATH Google Scholar
Blake, C.L., Merz, C.J.: UCI repository of machine learning databases (1998), http://www.ics.uci.edu/~mlearn/MLRepository.html
Melli, G.: Dataset generator, perfect data for an imperfect world (2008), http://www.datasetgenerator.com
Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques, 2nd edn. Data Management Systems. Morgan Kaufmann, San Francisco (2005)
MATH Google Scholar

Download references

Author information

Authors and Affiliations

Dept. of Computer Science, University of Torino, Italy
Dino Ienco, Ruggero G. Pensa & Rosa Meo

Authors

Dino Ienco
View author publications
You can also search for this author in PubMed Google Scholar
Ruggero G. Pensa
View author publications
You can also search for this author in PubMed Google Scholar
Rosa Meo
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Mathematics, Imperial College London, South Kensington Campus, SW7 2PG, London, United Kingdom
Niall M. Adams
INSA Lyon, LIRIS CNRS UMR 5205, Bâtiment Blaise Pascal, University of Lyon, F-69621, Villeurbanne, France
Céline Robardet
Department of Information and Computer Science, Universiteit Utrecht, Utrecht, The Netherlands
Arno Siebes
INSA-Lyon, LIRIS CNRS UMR5205, University of Lyon, F-69621, Villeurbanne, France
Jean-François Boulicaut

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ienco, D., Pensa, R.G., Meo, R. (2009). Context-Based Distance Learning for Categorical Data Clustering. In: Adams, N.M., Robardet, C., Siebes, A., Boulicaut, JF. (eds) Advances in Intelligent Data Analysis VIII. IDA 2009. Lecture Notes in Computer Science, vol 5772. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-03915-7_8

Download citation

DOI: https://doi.org/10.1007/978-3-642-03915-7_8
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-03914-0
Online ISBN: 978-3-642-03915-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics