Advertisement

Knowledge and Information Systems

, Volume 28, Issue 2, pp 395–421 | Cite as

On ontology-driven document clustering using core semantic features

  • Samah FodehEmail author
  • Bill Punch
  • Pang-Ning Tan
Regular Paper

Abstract

Incorporating semantic knowledge from an ontology into document clustering is an important but challenging problem. While numerous methods have been developed, the value of using such an ontology is still not clear. We show in this paper that an ontology can be used to greatly reduce the number of features needed to do document clustering. Our hypothesis is that polysemous and synonymous nouns are both relatively prevalent and fundamentally important for document cluster formation. We show that nouns can be efficiently identified in documents and that this alone provides improved clustering. We next show the importance of the polysemous and synonymous nouns in clustering and develop a unique approach that allows us to measure the information gain in disambiguating these nouns in an unsupervised learning setting. In so doing, we can identify a core subset of semantic features that represent a text corpus. Empirical results show that by using core semantic features for clustering, one can reduce the number of features by 90% or more and still produce clusters that capture the main themes in a text corpus.

Keywords

Clustering Information gain Semantic features Ontology Dimensionality reduction 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Andrzejewski D, Zhu X, Craven M (2009) Incorporating domain knowledge into topic modeling via dirichlet forest priors. ICML 25–32Google Scholar
  2. 2.
    Al Sumait L, Domeniconi C (2007) Local semantic kernels for text document clustering. In: SIAM international conference on data mining workshop on text miningGoogle Scholar
  3. 3.
    Banerjee S, Ramanathan K, Gupta A (2007) Clustering short texts using Wikipedia. SIGIR 787–788Google Scholar
  4. 4.
    Basu S, Banerjee A, Mooney RJ (2002) Semi-supervised clustering by seeding. ICML 19–26Google Scholar
  5. 5.
    Bodner RC, Song F(1996) Knowledge-based approaches to query expansion in information retrieval. Adv Artif Intell 146–158Google Scholar
  6. 6.
    CLUTO Family of Clustering Software Tools: http://glaros.dtc.umn.edu/gkhome/cluto/cluto/overview
  7. 7.
    Craven M, DiPasquo D, Freitag D, McCallum A, Mitchell TM et al (2000) Learning to construct knowledge bases from the World Wide Web. Artif Intell 118:69–113Google Scholar
  8. 8.
    Chemudugunta C, Smyth P, Steyvers M (2008) Combining concept hierarchies and statistical topic models. CIKM 1469–1470Google Scholar
  9. 9.
    Dhillon I, Mallela S, Modha D (2003) Information-theoretic co-clustering. KDD 89–98Google Scholar
  10. 10.
    Farahat AK, Kamel MS (2010) Enhancing document clustering using hybrid models for semantic similarity. In: SIAM international conference on data mining workshop on text miningGoogle Scholar
  11. 11.
    Fodeh SJ, Punch W, Tan PN (2009) Combining statistics and semantics via ensemble model for document clustering. SAC 1446–1450Google Scholar
  12. 12.
    Gabrilovich E, Markovitch S (2006) Overcoming the brittleness bottleneck using Wikipedia: enhancing text categorization with encyclopedic knowledge. NCAI 21: 1301–1306Google Scholar
  13. 13.
    Gabrilovich E, Markovitch S (2007) Computing semantic relatedness using Wikipedia-based explicit semantic analysis. IJCAI 1606–1611Google Scholar
  14. 14.
    Hotho A, Staab S, Stumme G (2003) WordNet improves text document clustering. In: SIGIR 2003 semantic web workshop. 541–544Google Scholar
  15. 15.
    Hu J, Fang L, Cao Y (2008) Enhancing text clustering by leveraging Wikipedia semantics. SIGIR 179–186Google Scholar
  16. 16.
    Hu X, Sun N, Zhang C, Chua TS (2009) Exploiting internal and external semantics for the clustering of short texts using world knowledge. CIKM 919–928Google Scholar
  17. 17.
    Ifrim G, Theobald M, Weikum G (2005) Learning word-to-concept mappings for automated text classification. In: Workshop on learning in web search (LWS 2005). 18–25Google Scholar
  18. 18.
    Jiang JJ, Conrath DW (1997) Semantic similarity based on corpus statistics and lexical taxonomy. International Conference Research on Computational Linguistics (ROCLING X)Google Scholar
  19. 19.
    Jing L, Zhou L, Ng MK, Huang JZ (2006) Ontology-based distance measure for text clustering. In: SIAM SDM workshop on text miningGoogle Scholar
  20. 20.
    Kandylas V, Upham SP, Ungar LH (2009) Finding cohesive clusters for analyzing knowledge communities. Knowl Inf Syst 17: 335–354CrossRefGoogle Scholar
  21. 21.
    Lang K (1995) NewsWeeder: learning to filter netnews. ICML 331–339Google Scholar
  22. 22.
    Larsen B, Aone C (1999) Fast and effective text mining using linear-time document clustering. KDD 16–22Google Scholar
  23. 23.
    Lewis D (1997) Reuters-21578 text categorization test collection. AT&T Labs ResearchGoogle Scholar
  24. 24.
    Lin D (1998) An information-theoretic definition of similarity. ICML 1: 296–304Google Scholar
  25. 25.
    Mandala R, Tokunaga T, Tanaka H (1999) Complementing WordNet with Roget’s and Corpus-based Thesauri for information retrieval. In: The 9th conference of the European chapter of the association for computational linguistics. 94–101Google Scholar
  26. 26.
    MeSH, National Library of Medicine Controlled Vocabulary: http://www.nlm.nih.gov/mesh
  27. 27.
    Moravec P, Kolovrat M, Snasel V (2004) LSI vs. WordNet ontology in dimension reduction and information retrieval. DATESO 288–294Google Scholar
  28. 28.
    Natural Language Toolkit: http://www.nltk.org
  29. 29.
    Recupero D (2007) A new unsupervised method for Document Clustering by using WordNet Lexical and Conceptual Relations. SIGIR 10: 563–579Google Scholar
  30. 30.
    Rosso P, Ferretti E, Jimenez D et al (2004) Text categorization and information retrieval using WordNet senses. In: 2nd Global WordNet international conference. 299–304Google Scholar
  31. 31.
    Sedding J, Kazakov D (2004) WordNet-based text document clustering. In: 3rd workshop on Robust methods in analysis of natural language processing data. 104–113Google Scholar
  32. 32.
    Siolas G, d’Alche Buc F (2004) Support vector machines based on a semantic kernel for text categorization. IJCNN’00 5: 205–209Google Scholar
  33. 33.
    Slonim N, Tishby N (2000) Document clustering using word clusters via the information bottleneck method. SIGIR 208–215Google Scholar
  34. 34.
    Steinbach M, Karypis G, Kumar V (2000) A comparison of document clustering techniques. In: Proceedings of KDD workshop on text mining 34:35–36Google Scholar
  35. 35.
    Tan PN, Steinbach M, Kumar V (2005) Introduction to data mining. Addison-Wesley Longman Publishing Co, BostonGoogle Scholar
  36. 36.
    Termier A, Rousset MC, Sebag M (2001) Combining statistics and semantics for word and document clustering. IJCAI 1: 49–54Google Scholar
  37. 37.
  38. 38.
    Vorhees E (1993) Using WordNet to disambiguate word senses for text retrieval. SIGIR 171–180Google Scholar
  39. 39.
    Wang P, Hu J, Zeng HJ et al (2007) Improving text classification by using encyclopedia knowledge. ICDM 332–341Google Scholar
  40. 40.
    Wang P, Domeniconi C (2008) Building semantic kernels for text classification using Wikipedia. KDD 713–721Google Scholar
  41. 41.
    Wang Y, Hodges J (2006) Document clustering with semantic analysis. HICSS 3:54c–54cGoogle Scholar
  42. 42.
  43. 43.
  44. 44.
    Wu Z, Palmer M Verb (1994) Semantics and lexical selection. MACL 133–138Google Scholar
  45. 45.
    Xiong H, Steinbach M, Ruslim A et al (2009) Characterizing pattern preserving clustering. Knowl Inf Syst 19: 133–138CrossRefGoogle Scholar
  46. 46.
    Yoo I, Hu X, Song I (2006) Integration of semantic-based bipartite graph representation and mutual refinement strategy for biomedical literature clustering. KDD 791–796Google Scholar
  47. 47.
    Zhong S, Ghosh J (2005) Generative model-based document clustering: a comparative study. Knowl Inf Syst 8: 374–384CrossRefGoogle Scholar

Copyright information

© Springer-Verlag London Limited 2011

Authors and Affiliations

  1. 1.Yale UniversityNew HavenUSA
  2. 2.Michigan State UniversityEast LansingUSA

Personalised recommendations