Simultaneous Clustering and Dynamic Keyword Weighting for Text Documents

Frigui, Hichem; Nasraoui, Olfa

doi:10.1007/978-1-4757-4305-0_3

Hichem Frigui &
Olfa Nasraoui

2260 Accesses
33 Citations

Abstract

In this chapter, we propose a new approach to unsupervised text document categorization based on a coupled process of clustering and cluster-dependent keyword weighting. The proposed algorithm is based on the K-Means clustering algorithm. Hence it is computationally and implementationally simple. Moreover, it learns a different set of keyword weights for each cluster. This means that, as a by-product of the clustering process, each document cluster will be characterized by a possibly different set of keywords. The cluster-dependent keyword weights have two advantages: they help in partitioning the document collection into more meaningful categories; and they can be used to automatically generate a compact description of each cluster in terms of not only the attribute values,but also their relevance. In particular, for the case of text data, this approach can be used to automatically annotate the documents. We also extend the proposed approach to handle the inherent fuzziness in text documents, by automatically generating fuzzy or soft labels instead of hard all-or-nothing categorization. This means that a text document can belong to several categories with different degrees. The proposed approach can handle noise documents elegantly by automatically designating one or two noise magnet clusters that grab most outliers away from the other clusters. The performance of the proposed algorithm is illustrated by using it to cluster real text document collections.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

eBook: USD 16.99; Price excludes VAT (USA)

Softcover Book: USD 149.00; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

H. Almuallim and T.G. Dietterich. Learning with many irrelevant features. In Proceedings of the Ninth National Conference on Artificial Intelligence, pages 547–552, 1991.
Google Scholar
M. Berry, Z. Drmac, and E. Jessup. Matrices, vector spaces, and information retrieval. SIAM Review, 41 (2): 335–362, 1999.
MathSciNet MATH Google Scholar
J.C. Bezdek. Pattern Recognition with Fuzzy Objective Function Algorithms. Plenum, New York, 1981.
Book MATH Google Scholar
P.S. Bradley and U.M. Fayyad. Refining initial points for K-Means clustering. In Procedings of the Fifteenth International Conference on Machine Learning, Morgan Kaufmann, San Francisco, pages 91–99, 1998.
Google Scholar
P.S. Bradley, U.M. Fayyad, and C. Reina. Scaling clustering algorithms to large databases. In Knowledge Discovery and Data Mining, pages 9–15, 1998.
Google Scholar
C. Buckley and A.F. Lewit. Optimizations of inverted vector searches. In SIGIR ’85, pages 97–110, 1985.
Google Scholar
Bow] Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering [online, cited September 2002].Available from World Wide Web: http://www.cs.cmu.edu/mccallum/bow.
D.R. Cutting, D.R. Karger, J.O. Pedersen, and J.W. Turkey. Scatter/gather: A cluster-based approach to browsing large document collections.In Proceedings of the Fifteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Copenhagen, Denmark, pages 318–329, June 1992.
Google Scholar
newsgroup data set [online, cited September 2002 ]. Available from World Wide Web: www-2. es.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.html.
Google Scholar
S. Deerwester, S. Dumais, G. Fumas, T. Landauer, and R. Harshman. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41 (6): 391–407, 1990.
Article Google Scholar
H. Frigui and R. Krishnapuram. Clustering by competitive agglomeration. Pattern Recognition, 30 (7): 1223–1232, 1997.
Google Scholar
H. Frigui and R. Krishnapuram. A robust competitive clustering algorithm with applications in computer vision. Transactions on Pattern Analysis and Machine Intelligence, 21 (5): 450–465, May 1999.
Article Google Scholar
F. Farnstrom, J. Lewis, and C. Elkan. Scalability for clustering algorithms revisited. SIGKDD Explorations, 2 (1): 51–57, 2000.
Article Google Scholar
H. Frigui and O. Nasraoui. Simultaneous clustering and attribute discrimination. In Proceedings of the IEEE Conference on Fuzzy Systems, San Antonio, TX, pages 158–163, 2000.
Google Scholar
E.E. Gustafson and W.C. Kessel. Fuzzy clustering with a fuzzy covariance matrix. In Proceedings of the IEEE Conference on Decision and Control, San Diego, pages 761–766, 1979.
Google Scholar
L.O. Hall, I.O. Ozyurt, and J.C. Bezdek. Clustering with a genetically optimized approach. Transactions on Evolutionary Computations, 3 (2): 103–112, Jul 1999.
Article Google Scholar
P.J. Huber. Robust Statistics.Wiley, New York, 1981.
Google Scholar
G. John, R. Kohavi, and K. Pfleger. Irrelevant features and the subset selection problem. In Proceedings of the Eleventh International Machine Learning Conference, pages 121–129, 1994.
Google Scholar
R. Krishnapuram and J. M. Keller. A possibilistic approach to clustering. IEEE Transactions on Fuzzy Systems, 1 (2): 98–110, May 1993.
Article Google Scholar
R.R. Korfhage. Information Storage and Retrieval. Wiley, New York, 1977.
Google Scholar
G. Kowalski. Information Retrieval Systems: Theory and Implementation. Kluwer Academic, Hingham, MA, 1997.
MATH Google Scholar
K. Kira and L. A. Rendell. The feature selection problem: Traditional methods and a new algorithm. In Proceedings of the Tenth National Conference on Artificial Intelligence, pages 129–134, 1992.
Google Scholar
R. Kohavi and D. Sommerfield. Feature subset selection using the wrapper model: Overfitting and dynamic search space topology. In Proceedings of the First International Conference on Knowledge Discovery and Data Mining, pages 192–197, 1995.
Google Scholar
D. Mladenic. Text learning and related intelligent agents. JEEE Expert, Jul 1999.
Google Scholar
O. Nasraoui and R. Krishnapuram. An improved possibilistic c-means algorithm with finite rejection and robust scale estimation. In Proceedings of the North American Fuzzy Information Processing Society Conference, Berkeley, CA, pages 395–399, Jun 1996.
Google Scholar
O. Nasraoui and R. Krishnapuram. A genetic algorithm for robust clustering based on a fuzzy least median of squares criterion. In Proceedings of the North American Fuzzy Information Processing Society Conference, Syracuse, NY, pages 217–221, Sept 1997.
Google Scholar
O. Nasraoui and R. Krishnapuram. A novel approach to unsupervised robust clustering using genetic niching. In Proceedings of the IEEE International Conference on Fuzzy Systems, New Orleans, pages 170–175, 2000.
Google Scholar
L.A. Rendell and K. Kira. A practical approach to feature selection. In Proceedings of the International Conference on Machine Learning, pages 249–256, 1992.
Google Scholar
P. J. Rousseeuw and A. M. Leroy. Robust Regression and Outlier Detection. Wiley, New York, 1987.
Book MATH Google Scholar
D. Skalak. Prototype and feature selection by sampling and random mutation hill climbing algorithms. In Proceedings of the Eleventh International Machine Learning Conference (ICML-94), pages 293–301, 1994.
Google Scholar
C.J. van Rijsbergen. Information Retrieval. second edition. Butterworths, London, 1979.
Google Scholar
O. Zamir, O. Etzioni, O. Madani, and R.M. Karp. Fast and intuitive clustering of web documents.In KDD ‘87, pages 287–290, 1997.
Google Scholar

Download references

Authors

Hichem Frigui
View author publications
You can also search for this author in PubMed Google Scholar
Olfa Nasraoui
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science, University of Tennessee, 203 Claxton Complex, 37996-3450, Knoxville, TN, USA
Michael W. Berry

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Frigui, H., Nasraoui, O. (2004). Simultaneous Clustering and Dynamic Keyword Weighting for Text Documents. In: Berry, M.W. (eds) Survey of Text Mining. Springer, New York, NY. https://doi.org/10.1007/978-1-4757-4305-0_3

Download citation

DOI: https://doi.org/10.1007/978-1-4757-4305-0_3
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4419-3057-6
Online ISBN: 978-1-4757-4305-0
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics