JHU/APL continued its participation in multilingual retrieval at CLEF in 2006. We again applied our hallmark technique for combating language diversity and morphological complexity: character n-gram tokenization. This year we participated in the ad hoc cross-language track and submitted both monolingual and bilingual runs. Our experimental results this year agree with our previous reports that n-grams perform especially well in linguistically complex languages, notably Bulgarian and Hungarian, where monolingual improvements of 27% and 70% respectively were observed compared to space-delimited word forms. As in CLEF 2005, our bilingual submissions made use of subword translation, statistical translation of character n-grams using aligned corpora, when parallel data were available, and web-based machine translation, when no suitable data was available to us.


Cross-language information retrieval character n-gram tokenization corpus-based translation 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Hiemstra, D.: Using Language Models for Information Retrieval. Ph. D. Thesis, Center for Telematics and Information Technology, The Netherlands (2000)Google Scholar
  2. 2.
    Koehn, P.: Europarl: A multilingual corpus for evaluation of machine translation. Unpublished,
  3. 3.
    McNamee, P., Mayfield, J.: Character N-gram Tokenization for European Language Text Retrieval. Information Retrieval 7(1-2), 73–97 (2004)CrossRefGoogle Scholar
  4. 4.
    McNamee, P., Mayfield, J.: Translating Pieces of Words. In: Proceedings of the 28th Annual International Conference on Research and Development in Information Retrieval (SIGIR-2005), Salvador, Brazil, pp. 643–644 (August 2005)Google Scholar
  5. 5.
    McNamee, P.: Exploring New Languages at CLEF 2005. In: Peters, C., Gey, F.C., Gonzalo, J., Müller, H., Jones, G.J.F., Kluck, M., Magnini, B., de Rijke, M., Giampiccolo, D. (eds.) CLEF 2005. LNCS, vol. 4022, Springer, Heidelberg (2006)CrossRefGoogle Scholar
  6. 6.
    Pirkola, A., Hedlund, T., Keskusalo, H., Järvelin, K.: Dictionary-Based Cross-Language Information Retrieval: Problems, Methods, and Research Findings. Information Retrieval 4, 209–230 (2001)zbMATHCrossRefGoogle Scholar
  7. 7.
  8. 8.

Copyright information

© Springer-Verlag Berlin Heidelberg 2007

Authors and Affiliations

  • Paul McNamee
    • 1
  1. 1.Johns Hopkins University Applied Physics Laboratory, 11100 Johns Hopkins Road, Laurel, MD 20723-6099USA

Personalised recommendations