Skip to main content

Simultaneous Clustering and Dynamic Keyword Weighting for Text Documents

  • Chapter
Survey of Text Mining

Abstract

In this chapter, we propose a new approach to unsupervised text document categorization based on a coupled process of clustering and cluster-dependent keyword weighting. The proposed algorithm is based on the K-Means clustering algorithm. Hence it is computationally and implementationally simple. Moreover, it learns a different set of keyword weights for each cluster. This means that, as a by-product of the clustering process, each document cluster will be characterized by a possibly different set of keywords. The cluster-dependent keyword weights have two advantages: they help in partitioning the document collection into more meaningful categories; and they can be used to automatically generate a compact description of each cluster in terms of not only the attribute values,but also their relevance. In particular, for the case of text data, this approach can be used to automatically annotate the documents. We also extend the proposed approach to handle the inherent fuzziness in text documents, by automatically generating fuzzy or soft labels instead of hard all-or-nothing categorization. This means that a text document can belong to several categories with different degrees. The proposed approach can handle noise documents elegantly by automatically designating one or two noise magnet clusters that grab most outliers away from the other clusters. The performance of the proposed algorithm is illustrated by using it to cluster real text document collections.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

eBook
USD 16.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 149.00
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. H. Almuallim and T.G. Dietterich. Learning with many irrelevant features. In Proceedings of the Ninth National Conference on Artificial Intelligence, pages 547–552, 1991.

    Google Scholar 

  2. M. Berry, Z. Drmac, and E. Jessup. Matrices, vector spaces, and information retrieval. SIAM Review, 41 (2): 335–362, 1999.

    MathSciNet  MATH  Google Scholar 

  3. J.C. Bezdek. Pattern Recognition with Fuzzy Objective Function Algorithms. Plenum, New York, 1981.

    Book  MATH  Google Scholar 

  4. P.S. Bradley and U.M. Fayyad. Refining initial points for K-Means clustering. In Procedings of the Fifteenth International Conference on Machine Learning, Morgan Kaufmann, San Francisco, pages 91–99, 1998.

    Google Scholar 

  5. P.S. Bradley, U.M. Fayyad, and C. Reina. Scaling clustering algorithms to large databases. In Knowledge Discovery and Data Mining, pages 9–15, 1998.

    Google Scholar 

  6. C. Buckley and A.F. Lewit. Optimizations of inverted vector searches. In SIGIR ’85, pages 97–110, 1985.

    Google Scholar 

  7. Bow] Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering [online, cited September 2002].Available from World Wide Web: http://www.cs.cmu.edu/mccallum/bow.

  8. D.R. Cutting, D.R. Karger, J.O. Pedersen, and J.W. Turkey. Scatter/gather: A cluster-based approach to browsing large document collections.In Proceedings of the Fifteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Copenhagen, Denmark, pages 318–329, June 1992.

    Google Scholar 

  9. newsgroup data set [online, cited September 2002 ]. Available from World Wide Web: www-2. es.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.html.

    Google Scholar 

  10. S. Deerwester, S. Dumais, G. Fumas, T. Landauer, and R. Harshman. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41 (6): 391–407, 1990.

    Article  Google Scholar 

  11. H. Frigui and R. Krishnapuram. Clustering by competitive agglomeration. Pattern Recognition, 30 (7): 1223–1232, 1997.

    Google Scholar 

  12. H. Frigui and R. Krishnapuram. A robust competitive clustering algorithm with applications in computer vision. Transactions on Pattern Analysis and Machine Intelligence, 21 (5): 450–465, May 1999.

    Article  Google Scholar 

  13. F. Farnstrom, J. Lewis, and C. Elkan. Scalability for clustering algorithms revisited. SIGKDD Explorations, 2 (1): 51–57, 2000.

    Article  Google Scholar 

  14. H. Frigui and O. Nasraoui. Simultaneous clustering and attribute discrimination. In Proceedings of the IEEE Conference on Fuzzy Systems, San Antonio, TX, pages 158–163, 2000.

    Google Scholar 

  15. E.E. Gustafson and W.C. Kessel. Fuzzy clustering with a fuzzy covariance matrix. In Proceedings of the IEEE Conference on Decision and Control, San Diego, pages 761–766, 1979.

    Google Scholar 

  16. L.O. Hall, I.O. Ozyurt, and J.C. Bezdek. Clustering with a genetically optimized approach. Transactions on Evolutionary Computations, 3 (2): 103–112, Jul 1999.

    Article  Google Scholar 

  17. P.J. Huber. Robust Statistics.Wiley, New York, 1981.

    Google Scholar 

  18. G. John, R. Kohavi, and K. Pfleger. Irrelevant features and the subset selection problem. In Proceedings of the Eleventh International Machine Learning Conference, pages 121–129, 1994.

    Google Scholar 

  19. R. Krishnapuram and J. M. Keller. A possibilistic approach to clustering. IEEE Transactions on Fuzzy Systems, 1 (2): 98–110, May 1993.

    Article  Google Scholar 

  20. R.R. Korfhage. Information Storage and Retrieval. Wiley, New York, 1977.

    Google Scholar 

  21. G. Kowalski. Information Retrieval Systems: Theory and Implementation. Kluwer Academic, Hingham, MA, 1997.

    MATH  Google Scholar 

  22. K. Kira and L. A. Rendell. The feature selection problem: Traditional methods and a new algorithm. In Proceedings of the Tenth National Conference on Artificial Intelligence, pages 129–134, 1992.

    Google Scholar 

  23. R. Kohavi and D. Sommerfield. Feature subset selection using the wrapper model: Overfitting and dynamic search space topology. In Proceedings of the First International Conference on Knowledge Discovery and Data Mining, pages 192–197, 1995.

    Google Scholar 

  24. D. Mladenic. Text learning and related intelligent agents. JEEE Expert, Jul 1999.

    Google Scholar 

  25. O. Nasraoui and R. Krishnapuram. An improved possibilistic c-means algorithm with finite rejection and robust scale estimation. In Proceedings of the North American Fuzzy Information Processing Society Conference, Berkeley, CA, pages 395–399, Jun 1996.

    Google Scholar 

  26. O. Nasraoui and R. Krishnapuram. A genetic algorithm for robust clustering based on a fuzzy least median of squares criterion. In Proceedings of the North American Fuzzy Information Processing Society Conference, Syracuse, NY, pages 217–221, Sept 1997.

    Google Scholar 

  27. O. Nasraoui and R. Krishnapuram. A novel approach to unsupervised robust clustering using genetic niching. In Proceedings of the IEEE International Conference on Fuzzy Systems, New Orleans, pages 170–175, 2000.

    Google Scholar 

  28. L.A. Rendell and K. Kira. A practical approach to feature selection. In Proceedings of the International Conference on Machine Learning, pages 249–256, 1992.

    Google Scholar 

  29. P. J. Rousseeuw and A. M. Leroy. Robust Regression and Outlier Detection. Wiley, New York, 1987.

    Book  MATH  Google Scholar 

  30. D. Skalak. Prototype and feature selection by sampling and random mutation hill climbing algorithms. In Proceedings of the Eleventh International Machine Learning Conference (ICML-94), pages 293–301, 1994.

    Google Scholar 

  31. C.J. van Rijsbergen. Information Retrieval. second edition. Butterworths, London, 1979.

    Google Scholar 

  32. O. Zamir, O. Etzioni, O. Madani, and R.M. Karp. Fast and intuitive clustering of web documents.In KDD ‘87, pages 287–290, 1997.

    Google Scholar 

Download references

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2004 Springer Science+Business Media New York

About this chapter

Cite this chapter

Frigui, H., Nasraoui, O. (2004). Simultaneous Clustering and Dynamic Keyword Weighting for Text Documents. In: Berry, M.W. (eds) Survey of Text Mining. Springer, New York, NY. https://doi.org/10.1007/978-1-4757-4305-0_3

Download citation

  • DOI: https://doi.org/10.1007/978-1-4757-4305-0_3

  • Publisher Name: Springer, New York, NY

  • Print ISBN: 978-1-4419-3057-6

  • Online ISBN: 978-1-4757-4305-0

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics