Definition
Text clustering is to automatically group textual documents (for example, documents in plain text, web pages, emails and etc) into clusters based on their content similarity. The problem of text clustering can be defined as follows. Given a set of n documents noted as DS and a pre-defined cluster number K (usually set by users), DS is clustered into K document clusters DS1 , DS2 , … , DSk, (i . e , {DS1, DS2, … , DSk} = DS) so that the documents in a same document cluster are similar to one another while documents from different clusters are dissimilar [14].
Foundations
Text clustering consists of several important components including document representation, text clustering algorithms and performance measurements....
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Recommended Reading
Croft WB. Organizing and searching large files of documents. Ph.D. thesis, University of Cambridge; 1978.
Cutting DR, Karger DR, Pedersen JO, Tukey JW. Scatter/gather: a cluster-based approach to browsing large document collections. In: Proceedings of the 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval; 1992. p. 318–29.
Day WH, Edelsbrunner H. Efficient algorithms for agglomerative hierarchical clustering methods. J Classif. 1984;1(2):1–24.
Dhillon IS. Co-clustering documents and words using bipartite spectral graph partitioning, UT CS Technical report #TR. Department of Computer Sciences, University of Texas, Austin; 2001.
Dumais S, Platt J, Heckerman D, Sahami M. Inductive learning algorithms and representations for text categorization. In: Proceedings of the 7th International Conference on Information and Knowledge Management; 1998. p. 148–55.
Jain AK, Murty MN, Flynn PJ. Data clustering: a review. ACM Comput Surv. 1999;31(3):264–323.
Leouski AV, Croft WB. An evaluation of techniques for clustering search results. Technical report IR-76. Department of Computer Science, University of Massachusetts, Amherst; 1996.
Lewis DD. Representation quality in text classification: an introduction and experiment. In: Proceedings of the Workshop on Speech and Natural Language; 1990. p. 288–295.
MacQueen JB. Some methods for classification and analysis of multivariate observations. In: Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability; 1967. p. 281–97.
Nagy G. State of the art in pattern recognition. Proc IEEE. 1968;56(5):836–62.
Sebastiani F. Machine learning in automated text categorization. ACM Comput Surv. 2002;34(1):147.
Steinbach M, Karypis G, Kumar V. A comparison of document clustering techniques. Technique report, University of Minnesota – Computer Science and Engineering; 2000.
van Rijsbergen CJ. Information retrieval. 2nd ed. London: Butterworths; 1979.
Yoo I, Hu XH. A comprehensive comparison study of document clustering for a biomedical distal library Medline. In: Proceedings of the ACM/IEEE Joint Conference on Digital Libraries; 2006. p. 220–9.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Section Editor information
Rights and permissions
Copyright information
© 2018 Springer Science+Business Media, LLC, part of Springer Nature
About this entry
Cite this entry
Li, H. (2018). Text Clustering. In: Liu, L., Özsu, M.T. (eds) Encyclopedia of Database Systems. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-8265-9_415
Download citation
DOI: https://doi.org/10.1007/978-1-4614-8265-9_415
Published:
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4614-8266-6
Online ISBN: 978-1-4614-8265-9
eBook Packages: Computer ScienceReference Module Computer Science and Engineering