Skip to main content

Phrase-Based Document Categorization

  • Chapter
  • 1584 Accesses

Part of the book series: The Information Retrieval Series ((INRE,volume 29))

Abstract

This chapter takes a fresh look at an old idea in Information Retrieval: the use of linguistically extracted phrases as terms in the automatic categorization of documents, and in particular the pre-classification of patent applications. In Information Retrieval, until now there was found little or no evidence that document categorization benefits from the application of linguistic techniques. Classification algorithms using the most cleverly designed linguistic representations typically did not perform better than those using simply the bag-of-words representation. We have investigated the use of dependency triples as terms in document categorization, according to a dependency model based on the notion of aboutness and using normalizing transformations to enhance recall. We describe a number of large-scale experiments with different document representations, test collections and even languages, presenting evidence that adding such triples to the words in a bag-of-terms document representation may lead to a statistically significant increase in the accuracy of document categorization.

This is a preview of subscription content, log in via an institution.

Notes

  1. 1.

    The terms Categorization and Classification are used interchangeably.

  2. 2.

    See www.ir-facility.org/research/evaluation/clef-ip-10.

  3. 3.

    www.phasar.cs.ru.nl.

References

  1. Alonso M, Vilares J, Darriba V (2002) On the usefulness of extracting syntactic dependencies for text indexing. In: LNCS, vol 2464. Springer, Berlin, pp 3–11

    Google Scholar 

  2. Arampatzis A, Van der Weide T, Koster CHA, Van Bommel P (2000) An evaluation of linguistically-motivated indexing schemes. In: Proceedings of BCS-IRSG 2000 colloquium on IR research, 5th–7th April 2000, Sidney Sussex College, Cambridge, England

    Google Scholar 

  3. Bel N, Koster CHA, Villegas M (2003) Cross-lingual text categorization. In: Proceedings ECDL 2003. LNCS, vol 2769. Springer, Berlin, pp 126–139

    Google Scholar 

  4. Brants T, Google Inc (2003) Natural language processing in information retrieval. In: Proceedings CLIN 2003, pp 1–13

    Google Scholar 

  5. Bruza P, Huibers T (1994) Investigating aboutness axioms using information fields. In: Proceedings SIGIR 94, pp 112–121

    Google Scholar 

  6. Caropreso M, Matwin S, Sebastiani F (2000) A learner-independent evaluation of the usefulness of statistical phrases for automated text categorization. In: Text databases and document management: theory and practice, pp 78–102

    Google Scholar 

  7. Cohen W, Singer Y (1996) Context sensitive learning methods for text categorization. In: Proceedings of the 19th annual international ACM conference on research and development in information retrieval, pp 307–315

    Google Scholar 

  8. Dagan I, Karov Y, Roth D (1997) Mistake-driven learning in text categorization. In: Proceedings of the second conference on empirical methods in NLP, pp 55–63

    Google Scholar 

  9. De Marneffe MC, Manning CD (2008) The Stanford typed dependencies representation. In: Coling 2008: Proceedings of the workshop on cross-framework and cross-domain parser evaluation. Association for Computational Linguistics, pp 1–8

    Google Scholar 

  10. Fagan J (1988) Experiments in automatic phrase indexing for document retrieval: a comparison of syntactic and non-syntactic methods. PhD thesis, Cornell University

    Google Scholar 

  11. Grefenstette G (1996) Light parsing as finite state filtering. In: Workshop on extended finite state models of language, ECAI’96, Budapest, Hungary, August 1996

    Google Scholar 

  12. Joachims T (1998) Text categorization with support vector machines: learning with many relevant features. In: European conference on machine learning (ECML-98). Springer, Berlin, pp 137–142

    Chapter  Google Scholar 

  13. Koster CHA, Beney JG (2009) Phrase-based document categorization revisited. In: Proceedings of the 2nd international workshop on patent information retrieval (PAIR 2009) at CIKM, pp 49–55

    Chapter  Google Scholar 

  14. Koster CHA, Seutter M (2003) Taming wild phrases. In: Proceedings 25th European conference on IR research (ECIR 2003). LNCS, vol 2633. Springer, Berlin, pp 161–176

    Google Scholar 

  15. Koster CHA, Seutter M, Beney JG (2003) Multi-classification of patent applications with winnow. In: Proceedings PSI 2003. LNCS, vol 2890. Springer, Berlin, pp 545–554

    Google Scholar 

  16. Koster CHA, Seutter M, Seibert O (2007) Parsing the medline corpus. In: Proceedings RANLP 2007, pp 325–329

    Google Scholar 

  17. Krier M, Zaccà F (2002) Automatic categorization applications at the European patent office. World Pat Inf 24:187–196

    Article  Google Scholar 

  18. Lewis D (1992) An evaluation of phrasal and clustered representations on a text categorization task. In: Proceedings ACM SIGIR’92

    Google Scholar 

  19. Lewis D, Croft B (1990) Term clustering of syntactic phrases. In: Proceedings SIGIR’90, pp 385–404

    Google Scholar 

  20. Lin D (1998) Dependency-based evaluation of MINIPAR. In: Workshop on the evaluation of parsing systems, Granada, Spain

    Google Scholar 

  21. Littlestone N (1988) Learning quickly when irrelevant attributes abound: a new linear-threshold algorithm. Mach Learn 2:285–318

    Google Scholar 

  22. Littlestone N (2006) Generating typed dependency parses from phrase structure parses. In: Proceedings LREC 2006

    Google Scholar 

  23. Nastase V, Sayyad Shirabad J, Caropreso MF (2007) Using dependency relations for text classification. University of Ottawa SITE Technical Report TR-2007-12, 13 pages

    Google Scholar 

  24. Sebastiani F (2002) Machine learning in automated text categorization. ACM Comput Surv 34(1):1–47

    Article  Google Scholar 

  25. Sleator DD, Temperley D Parsing English with a link grammar. In: Third international workshop on parsing technologies

    Google Scholar 

  26. Spairck Jones K (1999) The role of NLP in text retrieval. In: Strzalkowski T (ed) Natural language information retrieval. Kluwer, Dordrecht, pp 1–24

    Google Scholar 

  27. Strzalkowski T (1995) Natural language information retrieval. Inf Process Manag 31(3):397–417

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Cornelis H. A. Koster .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Koster, C.H.A., Beney, J.G., Verberne, S., Vogel, M. (2011). Phrase-Based Document Categorization. In: Lupu, M., Mayer, K., Tait, J., Trippe, A. (eds) Current Challenges in Patent Information Retrieval. The Information Retrieval Series, vol 29. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-19231-9_13

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-19231-9_13

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-19230-2

  • Online ISBN: 978-3-642-19231-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics