A Survey of Text Clustering Algorithms

Aggarwal, Charu C.; Zhai, ChengXiang

doi:10.1007/978-1-4614-3223-4_4

Charu C. Aggarwal³ &
ChengXiang Zhai⁴

21k Accesses
261 Citations
6 Altmetric

Abstract

Clustering is a widely studied data mining problem in the text domains. The problem finds numerous applications in customer segmentation, classification, collaborative filtering, visualization, document organization, and indexing. In this chapter, we will provide a detailed survey of the problem of text clustering. We will study the key challenges of the clustering problem, as it applies to the text domain. We will discuss the key methods used for text clustering, and their relative advantages. We will also discuss a number of recent advances in the area in the context of social network and linked data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 189.00; Price excludes VAT (USA)

Softcover Book: USD 249.99; Price excludes VAT (USA)

Hardcover Book: USD 249.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

C. C. Aggarwal, Y. Zhao, P. S. Yu. On Text Clustering with Side Information, ICDE Conference, 2012.
Google Scholar
C. C. Aggarwal, P. S. Yu. On Effective Conceptual Indexing and Similarity Search in Text, ICDM Conference, 2001.
Google Scholar
C. C. Aggarwal, P. S. Yu. A Framework for Clustering Massive Text and Categorical Data Streams, SIAM Conference on Data Mining, 2006.
Google Scholar
C. C. Aggarwal, S. C. Gates, P. S. Yu. On Using Partial Supervision for Text Categorization, IEEE Transactions on Knowledge and Data Engineering, 16(2), 245–255, 2004.
Article Google Scholar
C. C. Aggarwal, C. Procopiuc, J. Wolf, P. S. Yu, J.-S. Park. Fast Algorithms for Projected Clustering, ACM SIGMOD Conference, 1999.
Google Scholar
C. C. Aggarwal, P. S. Yu. Finding Generalized Projected Clusters in High Dimensional Spaces, ACM SIGMOD Conference, 2000.
Google Scholar
R. Agrawal, J. Gehrke, P. Raghavan. D. Gunopulos. Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications, ACM SIGMOD Conference, 1999.
Google Scholar
R. Agrawal, R. Srikant. Fast Algorithms for Mining Association Rules in Large Databases, VLDB Conference, 1994.
Google Scholar
J. Allan, R. Papka, V. Lavrenko. Online new event detection and tracking. ACM SIGIR Conference, 1998.
Google Scholar
P. Andritsos, P. Tsaparas, R. Miller, K. Sevcik. LIMBO: Scalable Clustering of Categorical Data. EDBT Conference, 2004.
Google Scholar
P. Anick, S. Vaithyanathan. Exploiting Clustering and Phrases for Context-Based Information Retrieval. ACM SIGIR Conference, 1997.
Google Scholar
R. Angelova, S. Siersdorfer. A neighborhood-based approach for clustering of linked document collections. CIKM Conference, 2006.
Google Scholar
R. A. Baeza-Yates, B. A. Ribeiro-Neto, Modern Information Retrieval - the concepts and technology behind search, Second edition, Pearson Education Ltd., Harlow, England, 2011.
Google Scholar
S. Basu, M. Bilenko, R. J. Mooney. A probabilistic framework for semi-supervised clustering. ACM KDD Conference, 2004.
Google Scholar
S. Basu, A. Banerjee, R. J. Mooney. Semi-supervised Clustering by Seeding. ICML Conference, 2002.
Google Scholar
F. Beil, M. Ester, X. Xu. Frequent term-based text clustering, ACM KDD Conference, 2002.
Google Scholar
L. Baker, A. McCallum. Distributional Clustering ofWords for Text Classification, ACM SIGIR Conference, 1998.
Google Scholar
R. Bekkerman, R. El-Yaniv, Y. Winter, N. Tishby. On Feature Distributional Clustering for Text Categorization. ACM SIGIR Conference, 2001.
Google Scholar
D. Blei, J. Lafferty. Dynamic topic models. ICML Conference, 2006.
Google Scholar
D. Blei, A. Ng, M. Jordan. Latent Dirichlet allocation, Journal of Machine Learning Research, 3: pp. 993–1022, 2003.
MATH Google Scholar
P. F. Brown, P. V. deSouza, R. L. Mercer, V. J. Della Pietra, and J/ C. Lai. Class-based n-gram models of natural language, Computational Linguistics, 18, 4 (December 1992), 467-479.
Google Scholar
K. Chakrabarti, S. Mehrotra. Local Dimension reduction: A new Approach to Indexing High Dimensional Spaces, VLDB Conference, 2000.
Google Scholar
J. Chang, D. Blei. Topic Models for Document Networks. AISTASIS, 2009.
Google Scholar
W. B. Croft. Clustering large files of documents using the single-link method. Journal of the American Society of Information Science, 28: pp. 341–344, 1977.
Article Google Scholar
D. Cutting, D. Karger, J. Pedersen, J. Tukey. Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections. ACM SIGIR Conference, 1992.
Google Scholar
D. Cutting, D. Karger, J. Pederson. Constant Interaction-time Scatter/ Gather Browsing of Large Document Collections, ACM SIGIR Conference, 1993.
Google Scholar
M. Dash, H. Liu. Feature Selection for Clustering, PAKDD Conference, pp. 110–121, 1997.
Google Scholar
S. Deerwester, S. Dumais, T. Landauer, G. Furnas, R. Harshman. Indexing by Latent Semantic Analysis. JASIS, 41(6), pp. 391–407, 1990.
Google Scholar
I. Dhillon, D. Modha. Concept Decompositions for Large Sparse Data using Clustering, 42(1), pp. 143–175, 2001.
MATH Google Scholar
I. Dhillon. Co-clustering Documents and Words using bipartite spectral graph partitioning, ACM KDD Conference, 2001.
Google Scholar
I. Dhillon, S. Mallela, D. Modha. Information-theoretic Co- Clustering, ACM KDD Conference, 2003.
Google Scholar
C. Ding, X. He, H. Zha, H. D. Simon. Adaptive Dimension Reduction for Clustering High Dimensional Data, ICDM Conference, 2002.
Google Scholar
C. Ding, X. He, H. Simon. On the equivalence of nonnegative matrix factorization and spectral clustering. SDM Conference, 2005.
Google Scholar
B. Dorow, D. Widdows. Discovering corpus-specific word senses, Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics - Volume 2 (EACL ’03), pages 79-82, 2003.
Google Scholar
R. El-Yaniv, O. Souroujon. Iterative Double Clustering for Unsupervised and Semi-supervised Learning. NIPS Conference, 2002.
Google Scholar
H. Fang, T. Tao, C. Zhai, A formal study of information retrieval heuristics, Proceedings of ACM SIGIR 2004, 2004.
Google Scholar
D. Fisher. Knowledge Acquisition via incremental conceptual clustering. Machine Learning, 2: pp. 139–172, 1987.
Google Scholar
M. Franz, T. Ward, J. McCarley, W.-J. Zhu. Unsupervised and supervised clustering for topic tracking. ACM SIGIR Conference, 2001.
Google Scholar
G. P. C. Fung, J. X. Yu, P. Yu, H. Lu. Parameter Free Bursty Events Detection in Text Streams, VLDB Conference, 2005.
Google Scholar
J. H. Gennari, P. Langley, D. Fisher. Models of incremental concept formation. Journal of Artificial Intelligence, 40 pp. 11–61, 1989.
Article Google Scholar
D. Gibson, J. Kleinberg, P. Raghavan. Clustering Categorical Data: An Approach Based on Dynamical Systems, VLDB Conference, 1998.
Google Scholar
M. Girolami, A Kaban. On the Equivalance between PLSI and LDA, SIGIR Conference, pp. 433–434, 2003.
Google Scholar
S. Guha, R. Rastogi, K. Shim. ROCK: a robust clustering algorithm for categorical attributes, International Conference on Data Engineering, 1999.
Google Scholar
S. Guha, R. Rastogi, K. Shim. CURE: An Efficient Clustering Algorithm for Large Databases. ACM SIGMOD Conference, 1998.
Google Scholar
D. Gusfield. Algorithms for strings, trees and sequences, Cambridge University Press, 1997.
Google Scholar
Y. Huang, T. Mitchell. Text clustering with extended user feedback. ACM SIGIR Conference, 2006.
Google Scholar
H. Li, K. Yamanishi. Document classification using a finite mixture model. Annual Meeting of the Association for Computational Linguistics, 1997.
Google Scholar
Q. He, K. Chang, E.-P. Lim, J. Zhang. Bursty feature representation for clustering text streams. SDM Conference, 2007.
Google Scholar
T. Hofmann. Probabilistic Latent Semantic Indexing. ACM SIGIR Conference, 1999.
Google Scholar
A. Jain, R. C. Dubes. Algorithms for Clustering Data, Prentice Hall, Englewood Cliffs, NJ, 1998.
Google Scholar
N. Jardine, C. J.van Rijsbergen. The use of hierarchical clustering in information retrieval, Information Storage and Retrieval, 7: pp. 217–240, 1971.
Google Scholar
X. Ji, W. Xu. Document clustering with prior knowledge. ACM SIGIR Conference, 2006.
Google Scholar
I. T. Jolliffee. Principal Component Analysis. Springer, 2002.
Google Scholar
L. Kaufman, P. J. Rousseeuw. Finding Groups in Data: An Introduction to Cluster Analysis, Wiley Interscience, 1990.
Google Scholar
W. Ke, C. Sugimoto, J. Mostafa. Dynamicity vs. effectiveness: studying online clustering for scatter/gather. ACM SIGIR Conference, 2009.
Google Scholar
H. Kim, S. Lee. A Semi-supervised document clustering technique for information organization, CIKM Conference, 2000.
Google Scholar
J. Kleinberg, Bursty and hierarchical structure in streams, ACM KDD Conference, pp. 91–101, 2002.
Google Scholar
D. D. Lee, H. S. Seung. Learning the parts of objects by nonnegative matrix factorization, Nature, 401: pp. 788–791, 1999.
Article Google Scholar
T. Li, S. Ma, M. Ogihara, Document Clustering via Adaptive Subspace Iteration, ACM SIGIR Conference, 2004.
Google Scholar
T. Li, C. Ding, Y. Zhang, B. Shao. Knowledge transformation from word space to document space. ACM SIGIR Conference, 2008.
Google Scholar
Y.-B. Liu, J.-R. Cai, J. Yin, A. W.-C. Fu. Clustering Text Data Streams, Journal of Computer Science and Technology, Vol. 23(1), pp. 112–128, 2008.
Article Google Scholar
T. Liu, S. Lin, Z. Chen, W.-Y. Ma. An Evaluation on Feature Selection for Text Clustering, ICML Conference, 2003.
Google Scholar
Y. Lu, Q.Mei, C. Zhai. Investigating task performance of probabilistic topic models: an empirical study of PLSA and LDA, Information Retrieval, 14(2): 178-203 (2011).
Article Google Scholar
A. McCallum. Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. http://www.cs.cmu. edu/~mccallum/bow, 1996.
Google Scholar
Q. Mei, D. Cai, D. Zhang, C.-X. Zhai. Topic Modeling with Network Regularization. WWW Conference, 2008.
Google Scholar
D. Metzler, S. T. Dumais, C. Meek, Similarity Measures for Short Segments of Text, Proceedings of ECIR 2007, 2007.
Google Scholar
Z. Ming, K. Wang, T.-S. Chua. Prototype hierarchy-based clustering for the categorization and navigation of web collections. ACM SIGIR Conference, 2010.
Google Scholar
T. M. Mitchell. The role of unlabeled data in supervised learning. Proceedings of the Sixth International Colloquium on Cognitive Science, 1999.
Google Scholar
F. Murtagh. A Survey of Recent Advances in Hierarchical Clustering Algorithms, The Computer Journal, 26(4), pp. 354–359, 1983.
MATH Google Scholar
F. Murtagh. Complexities of Hierarchical Clustering Algorithms: State of the Art, Computational Statistics Quarterly, 1(2), pp. 101– 113, 1984.
MathSciNet MATH Google Scholar
R. Ng, J. Han. Efficient and Effective Clustering Methods for Spatial Data Mining. VLDB Conference, 1994.
Google Scholar
K. Nigam, A. McCallum, S. Thrun, T. Mitchell. Learning to classify text from labeled and unlabeled documents. AAAI Conference, 1998.
Google Scholar
P. Pantel, D. Lin. Document Clustering with Committees, ACM SIGIR Conference, 2002.
Google Scholar
G. Qi, C. Aggarwal, T. Huang. Community Detection with Edge Content in Social Media Networks, ICDE Conference, 2012.
Google Scholar
M. Rege, M. Dong, F. Fotouhi. Co-clustering Documents andWords Using Bipartite Isoperimetric Graph Partitioning. ICDM Conference, pp. 532–541, 2006.
Google Scholar
C. J. van Rijsbergen. Information Retrieval, Butterworths, 1975.
Google Scholar
C. J.van Rijsbergen, W. B. Croft. Document Clustering: An Evaluation of some experiments with the Cranfield 1400 collection, Information Processing and Management, 11, pp. 171–182, 1975.
Google Scholar
S. E. Robertson and S. Walker. Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. In SIGIR, pages 232–241, 1994.
Google Scholar
M. Sahami, T. D. Heilman, A web-based kernel function for measuring the similarity of short text snippets, Proceedings of WWW 2006, pages 377-386, 2006.
Google Scholar
N. Sahoo, J. Callan, R. Krishnan, G. Duncan, R. Padman. Incremental Hierarchical Clustering of Text Documents, ACM CIKM Conference, 2006.
Google Scholar
G. Salton. An Introduction to Modern Information Retrieval, Mc Graw Hill, 1983.
Google Scholar
G. Salton, C. Buckley. Term Weighting Approaches in Automatic Text Retrieval, Information Processing and Management, 24(5), pp. 513–523, 1988.
Article Google Scholar
H. Schutze, C. Silverstein. Projections for Efficient Document Clustering, ACM SIGIR Conference, 1997.
Google Scholar
J. Shi, J. Malik. Normalized cuts and image segmentation. IEEE Transaction on Pattern Analysis and Machine Intelligence, 2000.
Google Scholar
C. Silverstein, J. Pedersen. Almost-constant time clustering of arbitrary corpus subsets. ACM SIGIR Conference, pp. 60–66, 1997.
Google Scholar
A. Singhal, C. Buckley, M. Mitra. Pivoted Document Length Normalization. ACM SIGIR Conference, pp. 21–29, 1996.
Google Scholar
N. Slonim, N. Tishby. Document Clustering using word clusters via the information bottleneck method, ACM SIGIR Conference, 2000.
Google Scholar
N. Slonim, N. Tishby. The power of word clusters for text classification. European Colloquium on Information Retrieval Research (ECIR), 2001.
Google Scholar
N. Slonim, N. Friedman, N. Tishby. Unsupervised document classification using sequential information maximization. ACM SIGIR Conference, 2002.
Google Scholar
M. Steinbach, G. Karypis, V. Kumar. A Comparison of Document Clustering Techniques, KDD Workshop on text mining, 2000.
Google Scholar
Y. Sun, J. Han, J. Gao, Y. Yu. iTopicModel: Information Network Integrated Topic Modeling, ICDM Conference, 2009.
Google Scholar
E. M. Voorhees. Implementing Agglomerative Hierarchical Clustering for use in Information Retrieval,Technical Report TR86–765, Cornell University, Ithaca, NY, July 1986.
Google Scholar
F. Wang, C. Zhang, T. Li. Regularized clustering for documents. ACM SIGIR Conference, 2007.
Google Scholar
J. Wilbur, K. Sirotkin. The automatic identification of stopwords, J. Inf. Sci., 18: pp. 45–55, 1992.
Article Google Scholar
P. Willett. Document Clustering using an inverted file approach. Journal of Information Sciences, 2: pp. 223–231, 1980.
Article Google Scholar
P. Willett. Recent Trends in Hierarchical Document Clustering: A Critical Review. Information Processing and Management, 24(5): pp. 577–597, 1988.
Article Google Scholar
W. Xu, X. Liu, Y. Gong. Document Clustering based on nonnegative matrix factorization, ACM SIGIR Conference, 2003.
Google Scholar
W. Xu, Y. Gong. Document clustering by concept factorization. ACM SIGIR Conference, 2004.
Google Scholar
Y. Yang, J. O. Pederson. A comparative study on feature selection in text categorization, ACM SIGIR Conference, 1995.
Google Scholar
Y. Yang. Noise Reduction in a Statistical Approach to Text Categorization, ACM SIGIR Conference, 1995.
Google Scholar
T. Yang, R. Jin, Y. Chi, S. Zhu. Combining link and content for community detection: a discriminative approach. ACM KDD Conference, 2009.
Google Scholar
L. Yao, D. Mimno, A. McCallum. Efficient methods for topic model inference on streaming document collections, ACM KDD Conference, 2009.
Google Scholar
O. Zamir, O. Etzioni. Web Document Clustering: A Feasibility Demonstration, ACM SIGIR Conference, 1998.
Google Scholar
O. Zamir, O. Etzioni, O. Madani, R. M. Karp. Fast and Intuitive Clustering of Web Documents, ACM KDD Conference, 1997.
Google Scholar
C. Zhai, Statistical Language Models for Information Retrieval (Synthesis Lectures on Human Language Technologies), Morgan & Claypool Publishers, 2008.
Google Scholar
D. Zhang, J. Wang, L. Si. Document clustering with universum. ACM SIGIR Conference, 2011.
Google Scholar
J. Zhang, Z. Ghahramani, Y. Yang. A probabilistic model for online document clustering with application to novelty detection. In Saul L., Weiss Y., Bottou L. (eds) Advances in Neural Information Processing Letters, 17, 2005.
Google Scholar
T. Zhang, R. Ramakrishnan, M. Livny. BIRCH: An Efficient Data Clustering Method for Very Large Databases. ACM SIGMOD Conference, 1996.
Google Scholar
X. Zhang, X. Hu, X. Zhou. A comparative evaluation of different link types on enhancing document clustering. ACM SIGIR Conference, 2008.
Google Scholar
Y. Zhao, G. Karypis. Evaluation of hierarchical clustering algorithms
Google Scholar
for document data set, CIKM Conference, 2002.
Google Scholar
Y. Zhao, G. Karypis. Empirical and Theoretical comparisons of selected criterion functions for document clustering, Machine Learning, 55(3), pp. 311–331, 2004.
Article MATH Google Scholar
S. Zhong. Efficient Streaming Text Clustering. Neural Networks, Volume 18, Issue 5–6, 2005.
Google Scholar
Y. Zhou, H. Cheng, J. X. Yu. Graph Clustering based on Structural/ Attribute Similarities, VLDB Conference, 2009.
Google Scholar
http://www.lemurproject.org/
Google Scholar

Download references

Author information

Authors and Affiliations

IBM T. J. Watson Research Center, Yorktown Heights, NY, USA
Charu C. Aggarwal
University of Illinois at Urbana-Champaign, Urbana, IL, USA
ChengXiang Zhai

Authors

Charu C. Aggarwal
View author publications
You can also search for this author in PubMed Google Scholar
ChengXiang Zhai
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Charu C. Aggarwal .

Editor information

Editors and Affiliations

Thomas J. Watson Research Center, IBM, Skyline Drive 19, Hawthorne, 10532, New York, USA
Charu C. Aggarwal
at Urbana-Champaign, University of Illinois, URBANA, 61801, Illinois, USA
ChengXiang Zhai

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Aggarwal, C.C., Zhai, C. (2012). A Survey of Text Clustering Algorithms. In: Aggarwal, C., Zhai, C. (eds) Mining Text Data. Springer, Boston, MA. https://doi.org/10.1007/978-1-4614-3223-4_4

Download citation

DOI: https://doi.org/10.1007/978-1-4614-3223-4_4
Published: 07 January 2012
Publisher Name: Springer, Boston, MA
Print ISBN: 978-1-4614-3222-7
Online ISBN: 978-1-4614-3223-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics