Abstract
Clustering is a widely studied data mining problem in the text domains. The problem finds numerous applications in customer segmentation, classification, collaborative filtering, visualization, document organization, and indexing. In this chapter, we will provide a detailed survey of the problem of text clustering. We will study the key challenges of the clustering problem, as it applies to the text domain. We will discuss the key methods used for text clustering, and their relative advantages. We will also discuss a number of recent advances in the area in the context of social network and linked data.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
C. C. Aggarwal, Y. Zhao, P. S. Yu. On Text Clustering with Side Information, ICDE Conference, 2012.
C. C. Aggarwal, P. S. Yu. On Effective Conceptual Indexing and Similarity Search in Text, ICDM Conference, 2001.
C. C. Aggarwal, P. S. Yu. A Framework for Clustering Massive Text and Categorical Data Streams, SIAM Conference on Data Mining, 2006.
C. C. Aggarwal, S. C. Gates, P. S. Yu. On Using Partial Supervision for Text Categorization, IEEE Transactions on Knowledge and Data Engineering, 16(2), 245–255, 2004.
C. C. Aggarwal, C. Procopiuc, J. Wolf, P. S. Yu, J.-S. Park. Fast Algorithms for Projected Clustering, ACM SIGMOD Conference, 1999.
C. C. Aggarwal, P. S. Yu. Finding Generalized Projected Clusters in High Dimensional Spaces, ACM SIGMOD Conference, 2000.
R. Agrawal, J. Gehrke, P. Raghavan. D. Gunopulos. Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications, ACM SIGMOD Conference, 1999.
R. Agrawal, R. Srikant. Fast Algorithms for Mining Association Rules in Large Databases, VLDB Conference, 1994.
J. Allan, R. Papka, V. Lavrenko. Online new event detection and tracking. ACM SIGIR Conference, 1998.
P. Andritsos, P. Tsaparas, R. Miller, K. Sevcik. LIMBO: Scalable Clustering of Categorical Data. EDBT Conference, 2004.
P. Anick, S. Vaithyanathan. Exploiting Clustering and Phrases for Context-Based Information Retrieval. ACM SIGIR Conference, 1997.
R. Angelova, S. Siersdorfer. A neighborhood-based approach for clustering of linked document collections. CIKM Conference, 2006.
R. A. Baeza-Yates, B. A. Ribeiro-Neto, Modern Information Retrieval - the concepts and technology behind search, Second edition, Pearson Education Ltd., Harlow, England, 2011.
S. Basu, M. Bilenko, R. J. Mooney. A probabilistic framework for semi-supervised clustering. ACM KDD Conference, 2004.
S. Basu, A. Banerjee, R. J. Mooney. Semi-supervised Clustering by Seeding. ICML Conference, 2002.
F. Beil, M. Ester, X. Xu. Frequent term-based text clustering, ACM KDD Conference, 2002.
L. Baker, A. McCallum. Distributional Clustering ofWords for Text Classification, ACM SIGIR Conference, 1998.
R. Bekkerman, R. El-Yaniv, Y. Winter, N. Tishby. On Feature Distributional Clustering for Text Categorization. ACM SIGIR Conference, 2001.
D. Blei, J. Lafferty. Dynamic topic models. ICML Conference, 2006.
D. Blei, A. Ng, M. Jordan. Latent Dirichlet allocation, Journal of Machine Learning Research, 3: pp. 993–1022, 2003.
P. F. Brown, P. V. deSouza, R. L. Mercer, V. J. Della Pietra, and J/ C. Lai. Class-based n-gram models of natural language, Computational Linguistics, 18, 4 (December 1992), 467-479.
K. Chakrabarti, S. Mehrotra. Local Dimension reduction: A new Approach to Indexing High Dimensional Spaces, VLDB Conference, 2000.
J. Chang, D. Blei. Topic Models for Document Networks. AISTASIS, 2009.
W. B. Croft. Clustering large files of documents using the single-link method. Journal of the American Society of Information Science, 28: pp. 341–344, 1977.
D. Cutting, D. Karger, J. Pedersen, J. Tukey. Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections. ACM SIGIR Conference, 1992.
D. Cutting, D. Karger, J. Pederson. Constant Interaction-time Scatter/ Gather Browsing of Large Document Collections, ACM SIGIR Conference, 1993.
M. Dash, H. Liu. Feature Selection for Clustering, PAKDD Conference, pp. 110–121, 1997.
S. Deerwester, S. Dumais, T. Landauer, G. Furnas, R. Harshman. Indexing by Latent Semantic Analysis. JASIS, 41(6), pp. 391–407, 1990.
I. Dhillon, D. Modha. Concept Decompositions for Large Sparse Data using Clustering, 42(1), pp. 143–175, 2001.
I. Dhillon. Co-clustering Documents and Words using bipartite spectral graph partitioning, ACM KDD Conference, 2001.
I. Dhillon, S. Mallela, D. Modha. Information-theoretic Co- Clustering, ACM KDD Conference, 2003.
C. Ding, X. He, H. Zha, H. D. Simon. Adaptive Dimension Reduction for Clustering High Dimensional Data, ICDM Conference, 2002.
C. Ding, X. He, H. Simon. On the equivalence of nonnegative matrix factorization and spectral clustering. SDM Conference, 2005.
B. Dorow, D. Widdows. Discovering corpus-specific word senses, Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics - Volume 2 (EACL ’03), pages 79-82, 2003.
R. El-Yaniv, O. Souroujon. Iterative Double Clustering for Unsupervised and Semi-supervised Learning. NIPS Conference, 2002.
H. Fang, T. Tao, C. Zhai, A formal study of information retrieval heuristics, Proceedings of ACM SIGIR 2004, 2004.
D. Fisher. Knowledge Acquisition via incremental conceptual clustering. Machine Learning, 2: pp. 139–172, 1987.
M. Franz, T. Ward, J. McCarley, W.-J. Zhu. Unsupervised and supervised clustering for topic tracking. ACM SIGIR Conference, 2001.
G. P. C. Fung, J. X. Yu, P. Yu, H. Lu. Parameter Free Bursty Events Detection in Text Streams, VLDB Conference, 2005.
J. H. Gennari, P. Langley, D. Fisher. Models of incremental concept formation. Journal of Artificial Intelligence, 40 pp. 11–61, 1989.
D. Gibson, J. Kleinberg, P. Raghavan. Clustering Categorical Data: An Approach Based on Dynamical Systems, VLDB Conference, 1998.
M. Girolami, A Kaban. On the Equivalance between PLSI and LDA, SIGIR Conference, pp. 433–434, 2003.
S. Guha, R. Rastogi, K. Shim. ROCK: a robust clustering algorithm for categorical attributes, International Conference on Data Engineering, 1999.
S. Guha, R. Rastogi, K. Shim. CURE: An Efficient Clustering Algorithm for Large Databases. ACM SIGMOD Conference, 1998.
D. Gusfield. Algorithms for strings, trees and sequences, Cambridge University Press, 1997.
Y. Huang, T. Mitchell. Text clustering with extended user feedback. ACM SIGIR Conference, 2006.
H. Li, K. Yamanishi. Document classification using a finite mixture model. Annual Meeting of the Association for Computational Linguistics, 1997.
Q. He, K. Chang, E.-P. Lim, J. Zhang. Bursty feature representation for clustering text streams. SDM Conference, 2007.
T. Hofmann. Probabilistic Latent Semantic Indexing. ACM SIGIR Conference, 1999.
A. Jain, R. C. Dubes. Algorithms for Clustering Data, Prentice Hall, Englewood Cliffs, NJ, 1998.
N. Jardine, C. J.van Rijsbergen. The use of hierarchical clustering in information retrieval, Information Storage and Retrieval, 7: pp. 217–240, 1971.
X. Ji, W. Xu. Document clustering with prior knowledge. ACM SIGIR Conference, 2006.
I. T. Jolliffee. Principal Component Analysis. Springer, 2002.
L. Kaufman, P. J. Rousseeuw. Finding Groups in Data: An Introduction to Cluster Analysis, Wiley Interscience, 1990.
W. Ke, C. Sugimoto, J. Mostafa. Dynamicity vs. effectiveness: studying online clustering for scatter/gather. ACM SIGIR Conference, 2009.
H. Kim, S. Lee. A Semi-supervised document clustering technique for information organization, CIKM Conference, 2000.
J. Kleinberg, Bursty and hierarchical structure in streams, ACM KDD Conference, pp. 91–101, 2002.
D. D. Lee, H. S. Seung. Learning the parts of objects by nonnegative matrix factorization, Nature, 401: pp. 788–791, 1999.
T. Li, S. Ma, M. Ogihara, Document Clustering via Adaptive Subspace Iteration, ACM SIGIR Conference, 2004.
T. Li, C. Ding, Y. Zhang, B. Shao. Knowledge transformation from word space to document space. ACM SIGIR Conference, 2008.
Y.-B. Liu, J.-R. Cai, J. Yin, A. W.-C. Fu. Clustering Text Data Streams, Journal of Computer Science and Technology, Vol. 23(1), pp. 112–128, 2008.
T. Liu, S. Lin, Z. Chen, W.-Y. Ma. An Evaluation on Feature Selection for Text Clustering, ICML Conference, 2003.
Y. Lu, Q.Mei, C. Zhai. Investigating task performance of probabilistic topic models: an empirical study of PLSA and LDA, Information Retrieval, 14(2): 178-203 (2011).
A. McCallum. Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. http://www.cs.cmu. edu/~mccallum/bow, 1996.
Q. Mei, D. Cai, D. Zhang, C.-X. Zhai. Topic Modeling with Network Regularization. WWW Conference, 2008.
D. Metzler, S. T. Dumais, C. Meek, Similarity Measures for Short Segments of Text, Proceedings of ECIR 2007, 2007.
Z. Ming, K. Wang, T.-S. Chua. Prototype hierarchy-based clustering for the categorization and navigation of web collections. ACM SIGIR Conference, 2010.
T. M. Mitchell. The role of unlabeled data in supervised learning. Proceedings of the Sixth International Colloquium on Cognitive Science, 1999.
F. Murtagh. A Survey of Recent Advances in Hierarchical Clustering Algorithms, The Computer Journal, 26(4), pp. 354–359, 1983.
F. Murtagh. Complexities of Hierarchical Clustering Algorithms: State of the Art, Computational Statistics Quarterly, 1(2), pp. 101– 113, 1984.
R. Ng, J. Han. Efficient and Effective Clustering Methods for Spatial Data Mining. VLDB Conference, 1994.
K. Nigam, A. McCallum, S. Thrun, T. Mitchell. Learning to classify text from labeled and unlabeled documents. AAAI Conference, 1998.
P. Pantel, D. Lin. Document Clustering with Committees, ACM SIGIR Conference, 2002.
G. Qi, C. Aggarwal, T. Huang. Community Detection with Edge Content in Social Media Networks, ICDE Conference, 2012.
M. Rege, M. Dong, F. Fotouhi. Co-clustering Documents andWords Using Bipartite Isoperimetric Graph Partitioning. ICDM Conference, pp. 532–541, 2006.
C. J. van Rijsbergen. Information Retrieval, Butterworths, 1975.
C. J.van Rijsbergen, W. B. Croft. Document Clustering: An Evaluation of some experiments with the Cranfield 1400 collection, Information Processing and Management, 11, pp. 171–182, 1975.
S. E. Robertson and S. Walker. Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. In SIGIR, pages 232–241, 1994.
M. Sahami, T. D. Heilman, A web-based kernel function for measuring the similarity of short text snippets, Proceedings of WWW 2006, pages 377-386, 2006.
N. Sahoo, J. Callan, R. Krishnan, G. Duncan, R. Padman. Incremental Hierarchical Clustering of Text Documents, ACM CIKM Conference, 2006.
G. Salton. An Introduction to Modern Information Retrieval, Mc Graw Hill, 1983.
G. Salton, C. Buckley. Term Weighting Approaches in Automatic Text Retrieval, Information Processing and Management, 24(5), pp. 513–523, 1988.
H. Schutze, C. Silverstein. Projections for Efficient Document Clustering, ACM SIGIR Conference, 1997.
J. Shi, J. Malik. Normalized cuts and image segmentation. IEEE Transaction on Pattern Analysis and Machine Intelligence, 2000.
C. Silverstein, J. Pedersen. Almost-constant time clustering of arbitrary corpus subsets. ACM SIGIR Conference, pp. 60–66, 1997.
A. Singhal, C. Buckley, M. Mitra. Pivoted Document Length Normalization. ACM SIGIR Conference, pp. 21–29, 1996.
N. Slonim, N. Tishby. Document Clustering using word clusters via the information bottleneck method, ACM SIGIR Conference, 2000.
N. Slonim, N. Tishby. The power of word clusters for text classification. European Colloquium on Information Retrieval Research (ECIR), 2001.
N. Slonim, N. Friedman, N. Tishby. Unsupervised document classification using sequential information maximization. ACM SIGIR Conference, 2002.
M. Steinbach, G. Karypis, V. Kumar. A Comparison of Document Clustering Techniques, KDD Workshop on text mining, 2000.
Y. Sun, J. Han, J. Gao, Y. Yu. iTopicModel: Information Network Integrated Topic Modeling, ICDM Conference, 2009.
E. M. Voorhees. Implementing Agglomerative Hierarchical Clustering for use in Information Retrieval,Technical Report TR86–765, Cornell University, Ithaca, NY, July 1986.
F. Wang, C. Zhang, T. Li. Regularized clustering for documents. ACM SIGIR Conference, 2007.
J. Wilbur, K. Sirotkin. The automatic identification of stopwords, J. Inf. Sci., 18: pp. 45–55, 1992.
P. Willett. Document Clustering using an inverted file approach. Journal of Information Sciences, 2: pp. 223–231, 1980.
P. Willett. Recent Trends in Hierarchical Document Clustering: A Critical Review. Information Processing and Management, 24(5): pp. 577–597, 1988.
W. Xu, X. Liu, Y. Gong. Document Clustering based on nonnegative matrix factorization, ACM SIGIR Conference, 2003.
W. Xu, Y. Gong. Document clustering by concept factorization. ACM SIGIR Conference, 2004.
Y. Yang, J. O. Pederson. A comparative study on feature selection in text categorization, ACM SIGIR Conference, 1995.
Y. Yang. Noise Reduction in a Statistical Approach to Text Categorization, ACM SIGIR Conference, 1995.
T. Yang, R. Jin, Y. Chi, S. Zhu. Combining link and content for community detection: a discriminative approach. ACM KDD Conference, 2009.
L. Yao, D. Mimno, A. McCallum. Efficient methods for topic model inference on streaming document collections, ACM KDD Conference, 2009.
O. Zamir, O. Etzioni. Web Document Clustering: A Feasibility Demonstration, ACM SIGIR Conference, 1998.
O. Zamir, O. Etzioni, O. Madani, R. M. Karp. Fast and Intuitive Clustering of Web Documents, ACM KDD Conference, 1997.
C. Zhai, Statistical Language Models for Information Retrieval (Synthesis Lectures on Human Language Technologies), Morgan & Claypool Publishers, 2008.
D. Zhang, J. Wang, L. Si. Document clustering with universum. ACM SIGIR Conference, 2011.
J. Zhang, Z. Ghahramani, Y. Yang. A probabilistic model for online document clustering with application to novelty detection. In Saul L., Weiss Y., Bottou L. (eds) Advances in Neural Information Processing Letters, 17, 2005.
T. Zhang, R. Ramakrishnan, M. Livny. BIRCH: An Efficient Data Clustering Method for Very Large Databases. ACM SIGMOD Conference, 1996.
X. Zhang, X. Hu, X. Zhou. A comparative evaluation of different link types on enhancing document clustering. ACM SIGIR Conference, 2008.
Y. Zhao, G. Karypis. Evaluation of hierarchical clustering algorithms
for document data set, CIKM Conference, 2002.
Y. Zhao, G. Karypis. Empirical and Theoretical comparisons of selected criterion functions for document clustering, Machine Learning, 55(3), pp. 311–331, 2004.
S. Zhong. Efficient Streaming Text Clustering. Neural Networks, Volume 18, Issue 5–6, 2005.
Y. Zhou, H. Cheng, J. X. Yu. Graph Clustering based on Structural/ Attribute Similarities, VLDB Conference, 2009.
http://www.lemurproject.org/
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer Science+Business Media, LLC
About this chapter
Cite this chapter
Aggarwal, C.C., Zhai, C. (2012). A Survey of Text Clustering Algorithms. In: Aggarwal, C., Zhai, C. (eds) Mining Text Data. Springer, Boston, MA. https://doi.org/10.1007/978-1-4614-3223-4_4
Download citation
DOI: https://doi.org/10.1007/978-1-4614-3223-4_4
Published:
Publisher Name: Springer, Boston, MA
Print ISBN: 978-1-4614-3222-7
Online ISBN: 978-1-4614-3223-4
eBook Packages: Computer ScienceComputer Science (R0)