JHU Ad Hoc Experiments at CLEF 2008

McNamee, Paul

doi:10.1007/978-3-642-04447-2_21

Paul McNamee²⁴

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 5706))

Included in the following conference series:

Workshop of the Cross-Language Evaluation Forum for European Languages

545 Accesses
2 Citations

Abstract

For CLEF 2008 JHU conducted monolingual and bilingual experiments in the ad hoc TEL and Persian tasks. Additionally we performed several post hoc experiments using previous CLEF ad hoc tests sets in 13 languages.

In all three tasks we explored alternative methods of tokenizing documents including plain words, stemmed words, automatically induced segments, a single selected n-gram from each word, and all n-grams from words (i.e., traditional character n-grams). Character n-grams demonstrated consistent gains over ordinary words in each of these three diverse sets of experiments. Using mean average precision, relative gains of of 50-200% on the TEL task, 5% on the Persian task, and 18% averaged over 13 languages from past CLEF evaluations, were observed.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Creutz, M., Lagus, K.: Unsupervised Morpheme Segmentation and Morphology Induction from Text Corpora Using Morfessor 1.0. Helsinki University of Technology Technical Report A81 (2005)
Google Scholar
Järvelin, A., Järvelin, A., Järvelin, K.: S-grams: Defining Generalized N-grams for Information Retrieval. Information Processing and Management 43(4), 1005–1019 (2007)
Article MATH Google Scholar
Mayfield, J., McNamee, P.: Single n-gram stemming. In: Proceedings of ACM SIGIR 2003, pp. 415–416 (2003)
Google Scholar
McNamee, P., Mayfield, J.: Character N-Gram Tokenization for European Language Text Retrieval. Information Retrieval 7(1-2), 73–97 (2004)
Article Google Scholar
McNamee, P., Mayfield, J.: Translating pieces of words. In: Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Salvador, Brazil, pp. 643–644 (2005)
Google Scholar
McNamee, P., Nicholas, C., Mayfield, J.: Don’t Have a Stemmer?: Be Un+concern+ed. In: Proceedings of ACM SIGIR 2008, pp. 813–814 (2008)
Google Scholar
Pirkola, A., Keskustalo, H., Leppänen, E., Känsälä, A., Järvelin, K.: Targeted s-gram matching: a novel n-gram matching technique for cross- and mono-lingual word form variants. Information Research 7(2) (2002)
Google Scholar
Ponte, J.M., Croft, W.B.: A Language Modeling Approach to Information Retrieval. In: Proceedings of ACM SIGIR 1998, pp. 275–281 (1998)
Google Scholar

Download references

Author information

Authors and Affiliations

JHU Human Language Technology Center of Excellence, USA
Paul McNamee

Authors

Paul McNamee
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Istituto di Scienza e Tecnologie dell’Informazione, CNR, Pisa, Italy
Carol Peters
RWTH Aachen University, Aachen, Germany
Thomas Deselaers
University of Padua, Padua, Italy
Nicola Ferro
LSI-UNED, Madrid, Spain
Julio Gonzalo & Anselmo Peñas &
Dublin City University, Dublin 9, Ireland
Gareth J. F. Jones
Helsinki University of Technology, Espoo, Finland
Mikko Kurimo
University of Hildesheim, Hildesheim, Germany
Thomas Mandl
Humboldt University Berlin, Germany
Vivien Petras

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

McNamee, P. (2009). JHU Ad Hoc Experiments at CLEF 2008. In: Peters, C., et al. Evaluating Systems for Multilingual and Multimodal Information Access. CLEF 2008. Lecture Notes in Computer Science, vol 5706. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-04447-2_21

Download citation

DOI: https://doi.org/10.1007/978-3-642-04447-2_21
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-04446-5
Online ISBN: 978-3-642-04447-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics