Phrase-Based Document Categorization

Koster, Cornelis H. A.; Beney, Jean G.; Verberne, Suzan; Vogel, Merijn

doi:10.1007/978-3-642-19231-9_13

Phrase-Based Document Categorization

Cornelis H. A. Koster⁵,
Jean G. Beney⁶,
Suzan Verberne⁵ &
…
Merijn Vogel⁵

Chapter

1584 Accesses

Part of the book series: The Information Retrieval Series ((INRE,volume 29))

Abstract

This chapter takes a fresh look at an old idea in Information Retrieval: the use of linguistically extracted phrases as terms in the automatic categorization of documents, and in particular the pre-classification of patent applications. In Information Retrieval, until now there was found little or no evidence that document categorization benefits from the application of linguistic techniques. Classification algorithms using the most cleverly designed linguistic representations typically did not perform better than those using simply the bag-of-words representation. We have investigated the use of dependency triples as terms in document categorization, according to a dependency model based on the notion of aboutness and using normalizing transformations to enhance recall. We describe a number of large-scale experiments with different document representations, test collections and even languages, presenting evidence that adding such triples to the words in a bag-of-terms document representation may lead to a statistically significant increase in the accuracy of document categorization.

This is a preview of subscription content, log in via an institution.

Notes

1.
The terms Categorization and Classification are used interchangeably.
2.
See www.ir-facility.org/research/evaluation/clef-ip-10.
3.
www.phasar.cs.ru.nl.

References

Alonso M, Vilares J, Darriba V (2002) On the usefulness of extracting syntactic dependencies for text indexing. In: LNCS, vol 2464. Springer, Berlin, pp 3–11
Google Scholar
Arampatzis A, Van der Weide T, Koster CHA, Van Bommel P (2000) An evaluation of linguistically-motivated indexing schemes. In: Proceedings of BCS-IRSG 2000 colloquium on IR research, 5th–7th April 2000, Sidney Sussex College, Cambridge, England
Google Scholar
Bel N, Koster CHA, Villegas M (2003) Cross-lingual text categorization. In: Proceedings ECDL 2003. LNCS, vol 2769. Springer, Berlin, pp 126–139
Google Scholar
Brants T, Google Inc (2003) Natural language processing in information retrieval. In: Proceedings CLIN 2003, pp 1–13
Google Scholar
Bruza P, Huibers T (1994) Investigating aboutness axioms using information fields. In: Proceedings SIGIR 94, pp 112–121
Google Scholar
Caropreso M, Matwin S, Sebastiani F (2000) A learner-independent evaluation of the usefulness of statistical phrases for automated text categorization. In: Text databases and document management: theory and practice, pp 78–102
Google Scholar
Cohen W, Singer Y (1996) Context sensitive learning methods for text categorization. In: Proceedings of the 19th annual international ACM conference on research and development in information retrieval, pp 307–315
Google Scholar
Dagan I, Karov Y, Roth D (1997) Mistake-driven learning in text categorization. In: Proceedings of the second conference on empirical methods in NLP, pp 55–63
Google Scholar
De Marneffe MC, Manning CD (2008) The Stanford typed dependencies representation. In: Coling 2008: Proceedings of the workshop on cross-framework and cross-domain parser evaluation. Association for Computational Linguistics, pp 1–8
Google Scholar
Fagan J (1988) Experiments in automatic phrase indexing for document retrieval: a comparison of syntactic and non-syntactic methods. PhD thesis, Cornell University
Google Scholar
Grefenstette G (1996) Light parsing as finite state filtering. In: Workshop on extended finite state models of language, ECAI’96, Budapest, Hungary, August 1996
Google Scholar
Joachims T (1998) Text categorization with support vector machines: learning with many relevant features. In: European conference on machine learning (ECML-98). Springer, Berlin, pp 137–142
Chapter Google Scholar
Koster CHA, Beney JG (2009) Phrase-based document categorization revisited. In: Proceedings of the 2nd international workshop on patent information retrieval (PAIR 2009) at CIKM, pp 49–55
Chapter Google Scholar
Koster CHA, Seutter M (2003) Taming wild phrases. In: Proceedings 25th European conference on IR research (ECIR 2003). LNCS, vol 2633. Springer, Berlin, pp 161–176
Google Scholar
Koster CHA, Seutter M, Beney JG (2003) Multi-classification of patent applications with winnow. In: Proceedings PSI 2003. LNCS, vol 2890. Springer, Berlin, pp 545–554
Google Scholar
Koster CHA, Seutter M, Seibert O (2007) Parsing the medline corpus. In: Proceedings RANLP 2007, pp 325–329
Google Scholar
Krier M, Zaccà F (2002) Automatic categorization applications at the European patent office. World Pat Inf 24:187–196
Article Google Scholar
Lewis D (1992) An evaluation of phrasal and clustered representations on a text categorization task. In: Proceedings ACM SIGIR’92
Google Scholar
Lewis D, Croft B (1990) Term clustering of syntactic phrases. In: Proceedings SIGIR’90, pp 385–404
Google Scholar
Lin D (1998) Dependency-based evaluation of MINIPAR. In: Workshop on the evaluation of parsing systems, Granada, Spain
Google Scholar
Littlestone N (1988) Learning quickly when irrelevant attributes abound: a new linear-threshold algorithm. Mach Learn 2:285–318
Google Scholar
Littlestone N (2006) Generating typed dependency parses from phrase structure parses. In: Proceedings LREC 2006
Google Scholar
Nastase V, Sayyad Shirabad J, Caropreso MF (2007) Using dependency relations for text classification. University of Ottawa SITE Technical Report TR-2007-12, 13 pages
Google Scholar
Sebastiani F (2002) Machine learning in automated text categorization. ACM Comput Surv 34(1):1–47
Article Google Scholar
Sleator DD, Temperley D Parsing English with a link grammar. In: Third international workshop on parsing technologies
Google Scholar
Spairck Jones K (1999) The role of NLP in text retrieval. In: Strzalkowski T (ed) Natural language information retrieval. Kluwer, Dordrecht, pp 1–24
Google Scholar
Strzalkowski T (1995) Natural language information retrieval. Inf Process Manag 31(3):397–417
Article Google Scholar

Download references

Author information

Authors and Affiliations

Computing Science Institute ICIS, Univ. of Nijmegen, Nijmegen, The Netherlands
Cornelis H. A. Koster, Suzan Verberne & Merijn Vogel
Dept. Informatique, LCI, INSA de Lyon, Lyon, France
Jean G. Beney

Authors

Cornelis H. A. Koster
View author publications
You can also search for this author in PubMed Google Scholar
Jean G. Beney
View author publications
You can also search for this author in PubMed Google Scholar
Suzan Verberne
View author publications
You can also search for this author in PubMed Google Scholar
Merijn Vogel
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Cornelis H. A. Koster .

Editor information

Editors and Affiliations

Information Retrieval Facility, Donau-City Straße 1, Vienna, 1220, Austria
Mihai Lupu
Information Retrieval Facility, Donau-City Straße 1, Vienna, 1220, Austria
Katja Mayer
Information Retrieval Facility, Donau-City Straße 1, Vienna, 1220, Austria
John Tait
3LP Advisors, Post Rd. 7003, Dublin, 43016, Ohio, USA
Anthony J. Trippe

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Koster, C.H.A., Beney, J.G., Verberne, S., Vogel, M. (2011). Phrase-Based Document Categorization. In: Lupu, M., Mayer, K., Tait, J., Trippe, A. (eds) Current Challenges in Patent Information Retrieval. The Information Retrieval Series, vol 29. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-19231-9_13

Download citation

DOI: https://doi.org/10.1007/978-3-642-19231-9_13
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-19230-2
Online ISBN: 978-3-642-19231-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics