Advertisement

Feature Selection and Document Clustering

  • Inderjit Dhillon
  • Jacob Kogan
  • Charles Nicholas

Abstract

Feature selection is a basic step in the construction of a vector space or bag-of-words model [BB99]. In particular, when the processing task is to partition a given document collection into clusters of similar documents a choice of good features along with good clustering algorithms is of paramount importance. This chapter suggests two techniques for feature or term selection along with a number of clustering strategies. The selection techniques significantly reduce the dimension of the vector space model. Examples that illustrate the effectiveness of the proposed algorithms are provided.

Keywords

Feature Selection Cluster Algorithm Confusion Matrix Document Collection Term Selection 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. [BB99]
    M.W. Berry and M. Browne. Understanding Search Engines: Mathematical Modeling and Text Retrieval.SIAM, Philadelphia, 1999.Google Scholar
  2. [BB02]
    P. Berkhin and J.D. Becher. Learning simple relations: Theory and applications. In Proceedings of the Second SIAM International Conference on Data Mining, Arlington, VA, pages 410–436, April 2002.Google Scholar
  3. [BGG+99a]
    D. Boley, M. Gini, R. Gross, E.-H. Han, K. Hastings, G. Karypis, V. Kumar, B. Mobasher, and J. Moore. Document categorization and query generation on the World Wide Web using WebACE. AIReview, 13 (5,6): 365–391, 1999.Google Scholar
  4. [BGG+99b]
    D. Boley, M. Gini, R. Gross, E.-H. Han, K. Hastings, G. Karypis, V. Kumar, B. Mobasher, and J. Moore.Partitioning-based clustering for Web document categorization.Decision Support Systems, 27 (3): 329–341, 1999.CrossRefGoogle Scholar
  5. [Bo198]
    D.L. Boley.Principal direction divisive partitioning.Data Mining and Knowledge Discovery, 2 (4): 325–344, 1998.CrossRefGoogle Scholar
  6. [Dam951.
    M. Damashek. Gauging similarity with n-grams: Language-independent categorization of text. Science, 267: 843–848, 1995.CrossRefGoogle Scholar
  7. [DGK02]
    I.S. Dhillon, Y. Guan, and J. Kogan.Refining clusters in high-dimensional text data.In Proceedings of the Workshop on Clustering High Dimensional Data and Its Applications at the Second SIAM International Conference on Data Mining, I.S. Dhillon and J. Kogan, eds., pages 71–82. SIAM, Philadelphia, 2002.Google Scholar
  8. [DM01]
    I.S. Dhillon and D.S. Modha.Concept decompositions for large sparse text data using clustering.Machine Learning, 42(1): 143–175, Jan 2001.A1so appears as IBM Research Report RJ 10147, Jul 1999.Google Scholar
  9. [DMK02]
    I.S. Dhillon, S. Malella, and R. Kumar.Enhanced word clustering for hierarchical text classification. In KDD-2002,2002.Google Scholar
  10. [DHS01]
    R.O. Duda, P.E. Hart, and D.G. Stork. Pattern Classification, second edition. Wiley, New York, 2001.Google Scholar
  11. [For65]
    E. Forgy. Cluster analysis of multivariate data: Efficiency vs. interpretability of classifications.Biometrics, 21 (3): 768, 1965.Google Scholar
  12. [GK02]
    E. Gendler and J. Kogan.Index terms selection for clustering large text data.In Proceedings of the Workshop on Text Mining at the Second SIAM International Conference on Data Mining, M.W. Berry, ed., pages 87–94, 2002.Google Scholar
  13. [GKSR+02]
    M. Ganapathiraju, J. Klein-Seetharaman, R. Rosenfeld, J. Carbonell, and R. Reddy.Rare and frequent n-grams in whole-genome protein sequences.In Proceedings of RECOMB’02: The Sixth Annual International Conference on Research in Computational Molecular Biology,2002.Google Scholar
  14. [Gre94]
    G. Grefenstette. Explorations in Automatic Thesaurus Discover y.Kluwer Academic, Boston, 1994.Google Scholar
  15. [Kog01a]
    J. Kogan. Clustering large unstructured document sets. In Computational Information Retrieval, M.W. Berry, ed., pages 107–117, SIAM, Philadelphia, 2001.Google Scholar
  16. [Kog0lb]
    J. Kogan. Means clustering for text data. In Proceedings of the Workshop on Text Mining at the First SIAM International Conference on Data Mining, M.W. Berry, ed., pages 47–57, 2001.Google Scholar
  17. [Kog02]
    J. Kogan. Computational information retrieval. Springer-Verlag Lecture Notes in Contributions to Statistics, H.R. Lerche, ed., 2002. To appear.Google Scholar
  18. [PN96]
    C. Pearce and C. Nicholas.TELLTALE: Experiments in a dynamic hypertext environment for degraded and multilingual data.Journal of the American Society for Information Science, 47: 263–275, 1996.CrossRefGoogle Scholar
  19. [Por80]
    M.F. Porter. An algorithm for suffix stripping.Program, 14: 130–137, 1980.Google Scholar
  20. [SM83]
    G. Salton and M.J. McGill. Introduction to Modern Information Retrieval.Mc Graw-Hill, New York, 1983.Google Scholar
  21. [SP95]
    H. Schütze and J. Pedersen. Information retrieval based on word senses. In Proceedings of the Fourth Annual Symposium on Document Analysis and Information Retrieval, Las Vegas, pages 161–175, 1995.Google Scholar
  22. [SS02]
    R. Shamir and R. Sharan. Algorithmic approaches to clustering gene expression data. In Current Topics in Computational Molecular Biology, T. Jiang, T. Smith, Y. Xu, and M. Q. Zhang, eds., pages 269–300, MIT Press, Cambridge, MA, 2002.Google Scholar
  23. [ST01]
    N. Slonim and N. Tishby. The power of word clusters for text classification. In Proceedings of the 23rd European Colloquium on Information Retrieval Research (ECIR), Darmstadt, 2001.Google Scholar
  24. [ZK02]
    Y. Zhao and G. Karypis. Comparison of agglomerative and partitional document clustering algorithms. In Proceedings of the Workshop on Clustering High Dimensional Data and Its Applications at the Second SIAM International Conference on Data Mining, I.S. Dhillon and J. Kogan, eds., pages 83–93. SIAM, Philadelphia, 2002.Google Scholar

Copyright information

© Springer Science+Business Media New York 2004

Authors and Affiliations

  • Inderjit Dhillon
  • Jacob Kogan
  • Charles Nicholas

There are no affiliations available

Personalised recommendations