Advertisement

Knowledge and Information Systems

, Volume 19, Issue 3, pp 265–281 | Cite as

Using Wikipedia knowledge to improve text classification

  • Pu Wang
  • Jian Hu
  • Hua-Jun Zeng
  • Zheng Chen
Regular Paper

Abstract

Text classification has been widely used to assist users with the discovery of useful information from the Internet. However, traditional classification methods are based on the “Bag of Words” (BOW) representation, which only accounts for term frequency in the documents, and ignores important semantic relationships between key terms. To overcome this problem, previous work attempted to enrich text representation by means of manual intervention or automatic document expansion. The achieved improvement is unfortunately very limited, due to the poor coverage capability of the dictionary, and to the ineffectiveness of term expansion. In this paper, we automatically construct a thesaurus of concepts from Wikipedia. We then introduce a unified framework to expand the BOW representation with semantic relations (synonymy, hyponymy, and associative relations), and demonstrate its efficacy in enhancing previous approaches for text classification. Experimental results on several data sets show that the proposed approach, integrated with the thesaurus built from Wikipedia, can achieve significant improvements with respect to the baseline algorithm.

Keywords

Text classification Wikipedia Thesaurus 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Hotho A, Staab S, Stumme G (2003) Wordnet improves text document clustering. In: Proceedings of the semantic web workshop at SIGIR’03Google Scholar
  2. 2.
    Gabrilovich E, Markovitch S (2005) Feature generation for text categorization using world knowledge. In Proceedings of the 19th international joint conference on artificial intelligence (IJCAI’05)Google Scholar
  3. 3.
    Gabrilovich E, Markovitch S (2006) Overcoming the brittleness bottleneck using Wikipedia: enhancing text categorization with encyclopedic knowledge. In: Proceedings of the 21nd AAAI conference on artificial intelligence (AAAI’06)Google Scholar
  4. 4.
    Milne D, Medelyan O, Witten IH (2006) Mining domain-specific Thesauri from Wikipedia: a case study. In: Proceedings of 2007 IEEE/WIC/ACM international conference on web intelligence (WI’06)Google Scholar
  5. 5.
    Bunescu R, Pasca M (2006) Using encyclopedic knowledge for named entity disambiguation. In: Proceedings of the 11th conference of the european chapter of the association for computational linguistics (EACL’06)Google Scholar
  6. 6.
    Strube M, Ponzetto SP (2006) WikiRelate! computing semantic relatedness using Wikipedia. In: Proceedings of the 21nd AAAI conference on artificial intelligence (AAAI’06)Google Scholar
  7. 7.
    Porter MF (1980) An algorithm for suffix stripping. Program 14(3): 130–137Google Scholar
  8. 8.
    Agirre E, Rigau G (1995) A proposal for word sense disambiguation using conceptual distance. In: Proceedings of the 1st international conference on recent advances in natural language processing (RANLP’95)Google Scholar
  9. 9.
    Reuters-21578 text categorization test collection, Distribution 1.0. Reuters. 1997. http://www.daviddlewis.com/resources/testcollections/reuters21578/
  10. 10.
    Hersh W, Buckley C, Leone T, Hickam D (1994) OHSUMED: an interactive retrieval evaluation and new large test collection for research. In: Proceedings of the 17th annual international ACM-SIGIR conference on research and development in information retrieval (SIGIR’94), pp 192–201Google Scholar
  11. 11.
    Lang K (1995) Newsweeder: learning to filter netnews. In: Proceedings of the 12th international conference on machine learning (ICML’95), pp 331–339Google Scholar
  12. 12.
    Joachims T (1998) Text categorization with support vector machines: learning with many relevant features. In: Proceedings of the 10th european conference on machine learning (ECML’98), pp 137-142Google Scholar
  13. 13.
    Yang Y, Liu X (1999) A re-examination of text categorization methods. In: Proceedings of the 22th annual international ACM-SIGIR conference on research and development in information retrieval (SIGIR’99), pp 42–49Google Scholar
  14. 14.
    de Buenaga Rodriguez M, Gomez Hidalgo JM, Agudo BD (1999) Using WordNet to complement training information in text categorization. In: The 2nd international conference on recent advances in natural language processing (RANLP’97)Google Scholar
  15. 15.
    Dave K, Lawrence S, Pennock DM (2003) Mining the peanut gallery: opinion extraction and semantic classification of product reviews. In: Proceedings of the 12th international World Wide Web conference (WWW’03)Google Scholar
  16. 16.
    Ponzetto SP, Strube M (2007) Deriving a large scale taxonomy from Wikipedia. In: Proceedings of the 22nd AAAI conference on artificial intelligence (AAAI’07)Google Scholar
  17. 17.
    Urena-Lopez LA, Buenaga M, Gomez JM (2001) Integrating linguistic resources in TC through WSD. Comput Hum 35:215C230Google Scholar
  18. 18.
    Miller G (1995) WordNet: a lexical database for english. Communications of the ACMGoogle Scholar
  19. 19.
  20. 20.
    Open Directory Project (1998). http://dmoz.org

Copyright information

© Springer-Verlag London Limited 2008

Authors and Affiliations

  1. 1.Department of Computer ScienceGeorge Mason UniversityFairfaxUSA
  2. 2.Machine Learning GroupMicrosoft Research AsiaBeijingChina

Personalised recommendations