Information Retrieval

, Volume 10, Issue 6, pp 563–579 | Cite as

A new unsupervised method for document clustering by using WordNet lexical and conceptual relations

Article

Abstract

Text document clustering provides an effective and intuitive navigation mechanism to organize a large amount of retrieval results by grouping documents in a small number of meaningful classes. Many well-known methods of text clustering make use of a long list of words as vector space which is often unsatisfactory for a couple of reasons: first, it keeps the dimensionality of the data very high, and second, it ignores important relationships between terms like synonyms or antonyms. Our unsupervised method solves both problems by using ANNIE and WordNet lexical categories and WordNet ontology in order to create a well structured document vector space whose low dimensionality allows common clustering algorithms to perform well. For the clustering step we have chosen the bisecting k-means and the Multipole tree, a modified version of the Antipole tree data structure for, respectively, their accuracy and speed.

Keywords

Clustering Text documents Bisecting k-means Multipole Antipole WordNet 

References

  1. Allan, J. (2002). Introduction to topic detection and tracking. In Topic detection and tracking: Event-based information organization (pp. 1–16). Kluwer Academic Publishers.Google Scholar
  2. ANNIE. Annie—a robust cross-domain ie system. http://www.gate.ac.uk/ie/annie.html
  3. Barbara, D., Li, Y., & Couto, J. (2002). Coolcat: An entropy-based algorithm for categorical clustering. In Proceedings of the 11th international conference on Information and knowledge management (pp. 582–589).Google Scholar
  4. Beil, F., Ester, M., & Xu, X. (2002). Frequent term-based text clustering. KDD 02. pp. 436–442.Google Scholar
  5. Boley, D. (1998) Principal direction divisive partitioning. Data Mining and Knowledge Discovery, 2(4), 325–344.CrossRefGoogle Scholar
  6. Bolshakova, N., & Azuaje, F. (2003). Improving expression data mining through cluster validation. Information Technology Applications in Biomedicine, 19–22.Google Scholar
  7. Borgelt, C. (2000) Apriori—association rule induction/frequent item set mining. http://www.fuzzy.cs.uni-magdeburg.de/borgelt/apriori.html
  8. Canas, A. J., Valerio, A., Lalinde-Pulido, J., Carvalho, M., & Arguedas, M. (2003). Using wordnet for word sense disambiguation to support concept map construction. SPIRE, 2857, 350–359.Google Scholar
  9. Cantone, D., Ferro, A., Pulvirenti, A., Reforgiato, D., & Shasha, D. (2005). Antipole tree indexing to support range search and k-nearest-neighbor search in metric spaces. IEEE Transactions on Knowledge and Data Engineering (TKDE), 17(4), 535–550.CrossRefGoogle Scholar
  10. Chua, S., & Kulathuramaiyer, N. (2004). Semantic feature selection using wordnet. In Proceedings of the 2004 IEEE/WIC/ACM International Conference on Web Intelligence (pp. 166–172).Google Scholar
  11. Crowe, M. (2000) Wordnet.net library. http://www.opensvn.csie.org/WordNetDotNet/
  12. Cunningham, H., Maynard, D., Bontcheva, K., & Tablan, V. (2002). Gate: A framework and graphical development environment for robust nlp tools and applications. In Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics (ACL’02), July 2002.Google Scholar
  13. Cutting, D. R., Karger, D. R., Pedersen, J. O., & Tukey, J. W. (1992). Scatter/gather: A cluster-based approach to browsing large document collection. In Proc. ACM SIGIR 92 (pp. 318–329).Google Scholar
  14. Dave, D. M. P. K., & Lawrence, S. (2003) Mining the peanut gallery: Opinion extraction and semantic classification of product reviews. WWW 03 ACM (pp. 519–528).Google Scholar
  15. de Buenaga Rodriguez, M., Gomez Hidalgo, J. M., & Diaz Agudo, B. (2000). Using wordnet to complement training information in text categorization. In N. Nicolov & R. Mitkov (Eds.), Recent advances in natural language processing II: Selected papers from RANLP’97, current issues in linguistic theory (CILT) (pp. 353–364). Amsterdam/Philadelphia: John Benjamins.Google Scholar
  16. Dhillon, I. S. (2001). Co-clustering documents and words using bipartite spectral graph partitioning. In Proc. of the 11th international conference on knowledge discovery and data mining (pp. 269–274).Google Scholar
  17. Fodor, I. K. (2002). A survey of dimension reduction techniques. LLNL technical report, UCRL ID-148494 URL: http://www.llnl.gov/CASC/sapphire/pubs.html
  18. Friedman, J. H. (1994). An overview of predictive learning and function approximation. In V. Cherkassky, J. H. Friedmanm, & H. Wechsler (Eds.), From statistic to neural networks, Proc. NATO/ASI Workshop (pp. 1–61).Google Scholar
  19. Gonzalez, T. F. (1985). Clustering to minimize the maximum intercluster distance. Theoretical Computer Science, 38, 293–306.MATHCrossRefMathSciNetGoogle Scholar
  20. Green, S. J. (1997). Building hypertext links in newspaper articles using semantic similarity. NLDB 97 (pp. 178–190).Google Scholar
  21. Green, S. J. (1999). Building hypertext links by computing semantic similarity. TKDE, 11(5), 50–57.Google Scholar
  22. Hotho, A., Staab, S., & Stumme, G. (2003). Wordnet improves text document clustering. ACM SIGIR Workshop on Semantic Web.Google Scholar
  23. Jing, L., Zhou, L., Ng, M. K., & Huang, J. Z. (2006). Ontology-based distance measure for text clustering. SIAM conference on data mining.Google Scholar
  24. Larsen, B., & Aone, C. (1999). Fast and effective text mining using linear-time document clustering. In Proc. of the 5th ACM SIGKDD international conference on knowledge discovery and data mining (pp. 16–22).Google Scholar
  25. Urena Lopez, L. A., Gomez de Buenaga Rodriguez, M., & Gomez Hidalgo, J. M. (2001). Integrating linguistic resources in tc through wsd. Computers and the Humanities, 35(2), 215–230.CrossRefGoogle Scholar
  26. Miller, G. (1995). Wordnet: A lexical database for English. CACM, 38(11), 39–41.Google Scholar
  27. Moldovan, D. I., & Mihalcea, R. (2000). Using wordnet and lexical operators to improve internet searches. IEEE Internet Computing, 4(1), 34–43.CrossRefGoogle Scholar
  28. Nickerson, A., Japkowicz, N., & Milios, E. (2001). Using unsupervised learning to guide re-sampling in imbalanced data sets. In Proc. of the 8th international workshop on AI and statistics (pp. 261–265).Google Scholar
  29. Parson, L., Haque, E., & Liu, H. (2004). Subspace clustering for high dimensional data: A review. ACM SIGKDD Explorations Newsletter, 6(1), 90–105.CrossRefGoogle Scholar
  30. Porter, M. F. (1980). An algorithm for suffix stripping. Program, 14(3), 130–137.Google Scholar
  31. Reforgiato, D. (2007). Hierarchical clustering data structure comparisons. Technical Report.Google Scholar
  32. Van Rijsbergen, C. J. (1979). Information retrieval, 2nd ed. Dept. of Computer Science, University of Glasgow.Google Scholar
  33. Sedding, J., & Kazakov, D. (2004). Wordnet-based text document clustering. 3rd Workshop on Robust Methods in Analysis of Natural Language Data, 104–113.Google Scholar
  34. Smyth, P. (1996). Clustering using monte carlo cross-validation. Knowledge Discovery and Data Mining, 126–133.Google Scholar
  35. Steinbach, M., Karypis, G., & Kumar, V. (2000). A comparison of document clustering techniques. In Proc. TextMining Workshop, KDD 2000.Google Scholar
  36. Voorhees, E. M. (1994). Query expansion using lexical-semantic relations. In Proc. of ACM-SIGIR (pp. 61–69).Google Scholar
  37. Zamir, O., & Etzioni, O. (1998). Web document clustering: A feasibility demonstration. In Proc. ACM SIGIR 98 (pp. 46–54).Google Scholar
  38. Zamir, O., Etzioni, O., Madani, O., & Karp R. M. (1997). Fast and intuitive clustering of web documents. KDD 97, 287–290.Google Scholar
  39. Zervas, G., & Ruger, S. M. (1999). The curse of dimensionality and document clustering. In Proc. of the IEE Searching for Information: AI and IR Approaches (pp. 19/1–19/3).Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2007

Authors and Affiliations

  1. 1.Dipartimento di Matematica e InformaticaUniversità degli Studi di CataniaCataniaItaly

Personalised recommendations