Taming Wild Phrases

Koster, Cornelis H. A.; Seutter, Mark

doi:10.1007/3-540-36618-0_12

Cornelis H. A. Koster⁵ &
Mark Seutter⁵

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2633))

Included in the following conference series:

European Conference on Information Retrieval

1279 Accesses
11 Citations

Abstract

In this paper the suitability of different document representations for automatic document classification is compared, investigating a whole range of representations between bag-of-words and bag-of-phrases. We look at some of their statistical properties, and determine for each representation the optimal choice of classification parameters and the effect of Term Selection.

Phrases are represented by an abstraction called Head/Modifier pairs. Rather than just throwing phrases and keywords together, we shall start with pure HM pairs and gradually add more keywords to the document representation. We use the classification on keywords as the baseline, which we compare with the contribution of the pure HM pairs to classification accuracy, and the incremental contributions from heads and modifiers. Finally, we measure the accuracy achieved with all words and all HM pairs combined, which turns out to be only marginally above the baseline.

We conclude that even the most careful term selection cannot overcome the differences in Document Frequency between phrases and words, and propose the use of term clustering to make phrases more cooperative.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Avi Arampatzis, Jean Beney, C. H. A. Koster, Th.P. van der Weide, KUN on the TREC-9 Filtering Track: Incrementality, Decay, and Threshold Optimization for Adaptive Filtering Systems. The Ninth Text REtrieval Conference (TREC-9), Gaithersburg, Maryland, November 13–16, 2000.
Google Scholar
M. F. Caropreso, S. Matwin and F. Sebastiani (2001), A learner-independent evaluation of the usefulness of statistical phrases for automated text categorization, In: A. G. Chin (Ed.), Text Databases and Document Management: Theory and Practice, Idea Group Publishing, Hershey, US, pp. 78–102.
Google Scholar
W. W. Cohen and Y. Singer (1999), Context-sensitive learning methods for text categorization. ACM Transactions on Information Systems 13,1, 100–111.
Google Scholar
I. Dagan, Y. Karov, D. Roth (1997), Mistake-Driven Learning in Text Categorization. In: Proceedings of the Second Conference on Empirical Methods in NLP, pp. 55–63.
Google Scholar
D. Evans and R. G. Lefferts (1994), Design and evaluation of the CLARIT-TREC-2 system. Proceedings TREC-2, NIST Special Publication 500-215, pp. 137–150.
Google Scholar
J. L. Fagan (1988), Experiments in automatic phrase indexing for document retrieval: a comparison of syntactic and non-syntactic methods, PhD Thesis, Cornell University.
Google Scholar
A. Grove, N. Littlestone, and D. Schuurmans (2001), General convergence results for linear discriminant updates. Machine Learning 43(3), pp. 173–210.
Article MATH Google Scholar
C. H. A. Koster, C. Derksen, D. van de Ende and J. Potjer, Normalization and matching in the DORO system. Proceedings of IRSG’99, 10pp.
Google Scholar
C. H. A. Koster, M. Seutter and J. Beney (2001), Classifying Patent Applications with Winnow, Proceedings Benelearn 2001, Antwerpen, 8pp.
Google Scholar
C. H. A. Koster and E. Verbruggen (2002), The AGFL Grammar Work Lab, Proceedings FREENIX/Usenix dy2002, pp 13–18.
Google Scholar
M. Krier and F. Zaccà (2001), Automatic Categorisation Applications at the European Patent Office, International CHemical Information Conference, Nimes, October 2001, 10 pp.
Google Scholar
Term Clustering of Syntactic Phrases (1990), Proceedings SIGIR 90, pp. 385–404.
Google Scholar
D. Lin (1995), A dependency-based method for evaluating broad-coverage parsers. Proceedings IJCAI-95, pp. 1420–1425.
Google Scholar
C. Peters and C. H. A. Koster (2002), Uncertainty-based Noise Reduction and Term Selection, Proceedings ECIR 2002, Springer LNCS 2291, pp 248–267.
Google Scholar
J. J. Rocchio (1971), Relevance feedback in Information Retrieval, In: Salton, G. (ed.), The Smart Retrieval system — experiments in automatic document processing, Prentice-Hall, Englewood Cliffs, NJ, pp 313–323.
Google Scholar
G. Ruge (1992), Experiments on Linguistically Based Term Associations, Information Processing & management, 28(3), pp. 317–332.
Article Google Scholar
T. Strzalkowski (1992), TTP: A Fast and Robust Parser for Natural Language, In: Proceedings COLING’ 92, pp 198–204.
Google Scholar
T. Strzalkowski, editor (1999), Natural Language Information Retrieval, Kluwer Academic Publishers, ISBN 0-7923-5685-3.
Google Scholar
Y. Yiming and J. P. Pedersen (1997), A Comparative Study on Feature Selection in Text Categorization. In: ICML 97, pp. 412–420.
Google Scholar

Download references

Author information

Authors and Affiliations

Dept. Comp. Sci., University of Nijmegen, The Netherlands
Cornelis H. A. Koster & Mark Seutter

Authors

Cornelis H. A. Koster
View author publications
You can also search for this author in PubMed Google Scholar
Mark Seutter
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Instituto di Scienza e Tecnologie dell’Informazione, Consiglio Nazionale delle Ricerche, Via Giuseppe Moruzzi, 1, 56124, Pisa, Italy
Fabrizio Sebastiani

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Koster, C.H.A., Seutter, M. (2003). Taming Wild Phrases. In: Sebastiani, F. (eds) Advances in Information Retrieval. ECIR 2003. Lecture Notes in Computer Science, vol 2633. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-36618-0_12

Download citation

DOI: https://doi.org/10.1007/3-540-36618-0_12
Published: 15 April 2003
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-01274-0
Online ISBN: 978-3-540-36618-8
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics