Abstract
In this chapter, we propose a new approach to unsupervised text document categorization based on a coupled process of clustering and cluster-dependent keyword weighting. The proposed algorithm is based on the K-Means clustering algorithm. Hence it is computationally and implementationally simple. Moreover, it learns a different set of keyword weights for each cluster. This means that, as a by-product of the clustering process, each document cluster will be characterized by a possibly different set of keywords. The cluster-dependent keyword weights have two advantages: they help in partitioning the document collection into more meaningful categories; and they can be used to automatically generate a compact description of each cluster in terms of not only the attribute values,but also their relevance. In particular, for the case of text data, this approach can be used to automatically annotate the documents. We also extend the proposed approach to handle the inherent fuzziness in text documents, by automatically generating fuzzy or soft labels instead of hard all-or-nothing categorization. This means that a text document can belong to several categories with different degrees. The proposed approach can handle noise documents elegantly by automatically designating one or two noise magnet clusters that grab most outliers away from the other clusters. The performance of the proposed algorithm is illustrated by using it to cluster real text document collections.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
H. Almuallim and T.G. Dietterich. Learning with many irrelevant features. In Proceedings of the Ninth National Conference on Artificial Intelligence, pages 547–552, 1991.
M. Berry, Z. Drmac, and E. Jessup. Matrices, vector spaces, and information retrieval. SIAM Review, 41 (2): 335–362, 1999.
J.C. Bezdek. Pattern Recognition with Fuzzy Objective Function Algorithms. Plenum, New York, 1981.
P.S. Bradley and U.M. Fayyad. Refining initial points for K-Means clustering. In Procedings of the Fifteenth International Conference on Machine Learning, Morgan Kaufmann, San Francisco, pages 91–99, 1998.
P.S. Bradley, U.M. Fayyad, and C. Reina. Scaling clustering algorithms to large databases. In Knowledge Discovery and Data Mining, pages 9–15, 1998.
C. Buckley and A.F. Lewit. Optimizations of inverted vector searches. In SIGIR ’85, pages 97–110, 1985.
Bow] Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering [online, cited September 2002].Available from World Wide Web: http://www.cs.cmu.edu/mccallum/bow.
D.R. Cutting, D.R. Karger, J.O. Pedersen, and J.W. Turkey. Scatter/gather: A cluster-based approach to browsing large document collections.In Proceedings of the Fifteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Copenhagen, Denmark, pages 318–329, June 1992.
newsgroup data set [online, cited September 2002 ]. Available from World Wide Web: www-2. es.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.html.
S. Deerwester, S. Dumais, G. Fumas, T. Landauer, and R. Harshman. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41 (6): 391–407, 1990.
H. Frigui and R. Krishnapuram. Clustering by competitive agglomeration. Pattern Recognition, 30 (7): 1223–1232, 1997.
H. Frigui and R. Krishnapuram. A robust competitive clustering algorithm with applications in computer vision. Transactions on Pattern Analysis and Machine Intelligence, 21 (5): 450–465, May 1999.
F. Farnstrom, J. Lewis, and C. Elkan. Scalability for clustering algorithms revisited. SIGKDD Explorations, 2 (1): 51–57, 2000.
H. Frigui and O. Nasraoui. Simultaneous clustering and attribute discrimination. In Proceedings of the IEEE Conference on Fuzzy Systems, San Antonio, TX, pages 158–163, 2000.
E.E. Gustafson and W.C. Kessel. Fuzzy clustering with a fuzzy covariance matrix. In Proceedings of the IEEE Conference on Decision and Control, San Diego, pages 761–766, 1979.
L.O. Hall, I.O. Ozyurt, and J.C. Bezdek. Clustering with a genetically optimized approach. Transactions on Evolutionary Computations, 3 (2): 103–112, Jul 1999.
P.J. Huber. Robust Statistics.Wiley, New York, 1981.
G. John, R. Kohavi, and K. Pfleger. Irrelevant features and the subset selection problem. In Proceedings of the Eleventh International Machine Learning Conference, pages 121–129, 1994.
R. Krishnapuram and J. M. Keller. A possibilistic approach to clustering. IEEE Transactions on Fuzzy Systems, 1 (2): 98–110, May 1993.
R.R. Korfhage. Information Storage and Retrieval. Wiley, New York, 1977.
G. Kowalski. Information Retrieval Systems: Theory and Implementation. Kluwer Academic, Hingham, MA, 1997.
K. Kira and L. A. Rendell. The feature selection problem: Traditional methods and a new algorithm. In Proceedings of the Tenth National Conference on Artificial Intelligence, pages 129–134, 1992.
R. Kohavi and D. Sommerfield. Feature subset selection using the wrapper model: Overfitting and dynamic search space topology. In Proceedings of the First International Conference on Knowledge Discovery and Data Mining, pages 192–197, 1995.
D. Mladenic. Text learning and related intelligent agents. JEEE Expert, Jul 1999.
O. Nasraoui and R. Krishnapuram. An improved possibilistic c-means algorithm with finite rejection and robust scale estimation. In Proceedings of the North American Fuzzy Information Processing Society Conference, Berkeley, CA, pages 395–399, Jun 1996.
O. Nasraoui and R. Krishnapuram. A genetic algorithm for robust clustering based on a fuzzy least median of squares criterion. In Proceedings of the North American Fuzzy Information Processing Society Conference, Syracuse, NY, pages 217–221, Sept 1997.
O. Nasraoui and R. Krishnapuram. A novel approach to unsupervised robust clustering using genetic niching. In Proceedings of the IEEE International Conference on Fuzzy Systems, New Orleans, pages 170–175, 2000.
L.A. Rendell and K. Kira. A practical approach to feature selection. In Proceedings of the International Conference on Machine Learning, pages 249–256, 1992.
P. J. Rousseeuw and A. M. Leroy. Robust Regression and Outlier Detection. Wiley, New York, 1987.
D. Skalak. Prototype and feature selection by sampling and random mutation hill climbing algorithms. In Proceedings of the Eleventh International Machine Learning Conference (ICML-94), pages 293–301, 1994.
C.J. van Rijsbergen. Information Retrieval. second edition. Butterworths, London, 1979.
O. Zamir, O. Etzioni, O. Madani, and R.M. Karp. Fast and intuitive clustering of web documents.In KDD ‘87, pages 287–290, 1997.
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer Science+Business Media New York
About this chapter
Cite this chapter
Frigui, H., Nasraoui, O. (2004). Simultaneous Clustering and Dynamic Keyword Weighting for Text Documents. In: Berry, M.W. (eds) Survey of Text Mining. Springer, New York, NY. https://doi.org/10.1007/978-1-4757-4305-0_3
Download citation
DOI: https://doi.org/10.1007/978-1-4757-4305-0_3
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4419-3057-6
Online ISBN: 978-1-4757-4305-0
eBook Packages: Springer Book Archive