Phrase-Based Document Categorization

  • Cornelis H. A. Koster
  • Jean G. Beney
  • Suzan Verberne
  • Merijn Vogel
Chapter

Abstract

This chapter takes a fresh look at an old idea in Information Retrieval: the use of linguistically extracted phrases as terms in the automatic categorization of documents, and in particular the pre-classification of patent applications. In Information Retrieval, until now there was found little or no evidence that document categorization benefits from the application of linguistic techniques. Classification algorithms using the most cleverly designed linguistic representations typically did not perform better than those using simply the bag-of-words representation. We have investigated the use of dependency triples as terms in document categorization, according to a dependency model based on the notion of aboutness and using normalizing transformations to enhance recall. We describe a number of large-scale experiments with different document representations, test collections and even languages, presenting evidence that adding such triples to the words in a bag-of-terms document representation may lead to a statistically significant increase in the accuracy of document categorization.

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Cornelis H. A. Koster
    • 1
  • Jean G. Beney
    • 2
  • Suzan Verberne
    • 1
  • Merijn Vogel
    • 1
  1. 1.Computing Science Institute ICISUniv. of NijmegenNijmegenThe Netherlands
  2. 2.Dept. Informatique, LCIINSA de LyonLyonFrance

Personalised recommendations