Similarity-Based Text Clustering: A Comparative Study

Ghosh, J.; Strehl, A.

doi:10.1007/3-540-28349-8_3

Similarity-Based Text Clustering: A Comparative Study

J. Ghosh⁵ &
A. Strehl⁶

Chapter

9287 Accesses
15 Citations
7 Altmetric

Summary

Clustering of text documents enables unsupervised categorization and facilitates browsing and search. Any clustering method has to embed the objects to be clustered in a suitable representational space that provides a measure of (dis)similarity between any pair of objects. While several clustering methods and the associated similarity measures have been proposed in the past for text clustering, there is no systematic comparative study of the impact of similarity measures on the quality of document clusters, possibly because most popular cost criteria for evaluating cluster quality do not readily translate across qualitatively different measures. This chapter compares popular similarity measures (Euclidean, cosine, Pearson correlation, extended Jaccard) in conjunction with several clustering techniques (random, self-organizing feature map, hypergraph partitioning, generalized k-means, weighted graph partitioning), on a variety of high dimension sparse vector data sets representing text documents as bags of words. Performance is measured based on mutual information with a human-imposed classification. Our key findings are that in the quasiorthogonal space of word frequencies: (i) Cosine, correlation, and extended Jaccard similarities perform comparably; (ii) Euclidean distances do not work well; (iii) Graph partitioning tends to be superior especially when balanced clusters are desired; (iv) Performance curves generally do not cross.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Author information

Authors and Affiliations

Department of ECE, University of Texas at Austin, 1 University Station C0803, Austin, TX, 78712-0240, USA
J. Ghosh
Leubelfingstrasse 110, 90431, Nurnberg, Germany
A. Strehl

Authors

J. Ghosh
View author publications
You can also search for this author in PubMed Google Scholar
A. Strehl
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Mathematics and Statistics, University of Maryland Baltimore County, 1000 Hilltop Circle, Baltimore, Maryland, 21250, USA
Jacob Kogan
Department of Computer Science and Electrical Engineering, University of Maryland Baltimore County, 1000 Hilltop Circle, Baltimore, Maryland, 21250, USA
Jacob Kogan
Department of Computer Science and Electrical Engineering, University of Maryland Baltimore County, 1000 Hilltop Circle, Baltimore, Maryland, 21250, USA
Charles Nicholas
School of Mathematical Sciences, Tel-Aviv University, Ramat Aviv, Tel-Aviv, 69978, Israel
Marc Teboulle

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Ghosh, J., Strehl, A. (2006). Similarity-Based Text Clustering: A Comparative Study. In: Kogan, J., Nicholas, C., Teboulle, M. (eds) Grouping Multidimensional Data. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-28349-8_3

Download citation

DOI: https://doi.org/10.1007/3-540-28349-8_3
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-28348-5
Online ISBN: 978-3-540-28349-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics