Exploring New Languages with HAIRCUT at CLEF 2005

  • Paul McNamee
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4022)


JHU/APL has long espoused the use of language-neutral methods for cross-language information retrieval. This year we participated in the ad hoc cross-language track and submitted both monolingual and bilingual runs. We undertook our first investigations in the Bulgarian and Hungarian languages. In our bilingual experiments we used several non-traditional CLEF query languages such as Greek, Hungarian, and Indonesian, in addition to several western European languages. We found that character n-grams remain an attractive option for representing documents and queries in these new languages. In our monolingual tests n-grams were more effective than unnormalized words for retrieval in Bulgarian (+30%) and Hungarian (+63%). Our bilingual runs made use of subword translation, statistical translation of character n-grams using aligned corpora, when parallel data were available, and web-based machine translation, when no suitable data could be found.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Cavnar, W.B., Trenkle, J.M.: N-Gram Based Text Categorization. In: Proceedings of the Third Symposium on Document Analysis and Information Retrieval, pp. 161–169 (1994)Google Scholar
  2. 2.
    Church, K.W.: Char_align: A program for aligning parallel texts at the character level. In: Proceedings of the 31st Annual Meeting of the Association for Computational Linguistics, pp. 1–8 (1993)Google Scholar
  3. 3.
    Damashek, M.: Gauging Similarity with n-grams: Language-Independent Categorization of Text. Science 267, 843–848 (1995)CrossRefGoogle Scholar
  4. 4.
    Hiemstra, D.: Using Language Models for Information Retrieval. Ph. D. Thesis, Center for Telematics and Information Technology, The Netherlands (2000)Google Scholar
  5. 5.
    Jelinek, F., Mercer, R.: Interpolated Estimation of Markov Source Parameters from Sparse Data. In: Gelsema, E.S., Kanal, L.N. (eds.) Pattern Recognition in Practice, pp. 381–402. North-Holland, Amsterdam (1980)Google Scholar
  6. 6.
    Koehn, P.: Europarl: A multilingual corpus for evaluation of machine translation (unpublished) http://www.isi.edu/koehn/publications/europarl/
  7. 7.
    Mayfield, J., McNamee, P., Piatko, C.: The JHU/APL HAIRCUT System at TREC-8. In: Voorhees, E., Harman, D. (eds.) Proceedings of the Eighth Text REtrieval Conference (TREC-8), NIST Special Publication, Gaithersburg, Maryland, pp. 500–246 (2000)Google Scholar
  8. 8.
    Mayfield, J., McNamee, P.: Single N-gram Stemming. In: Proceedings of the 26th Annual International Conference on Research and Development in Information Retrieval (SIGIR 2003), Toronto, Ontario, pp. 415–416 (July 2003)Google Scholar
  9. 9.
    McNamee, P., Mayfield, J.: JHU/APL Experiments in Tokenization and Non-Word Translation. In: Working Notes of the CLEF 2003 Workshop, pp. 19-28 (2003)Google Scholar
  10. 10.
    McNamee, P., Mayfield, J.: Character N-gram Tokenization for European Language Text Retrieval. Information Retrieval 7(1-2), 73–97 (2004)CrossRefGoogle Scholar
  11. 11.
    McNamee, P., Mayfield, J.: Translating Pieces of Words. In: Proceedings of the 28th Annual International Conference on Research and Development in Information Retrieval (SIGIR 2005), Salvador, Brazil, pp. 643–644 (August 2005)Google Scholar
  12. 12.
    Mihalcea, R., Nastase, V.: Letter Level Learning for Language Independent Diacritics Restoration. In: Proceedings of the 6th Conference on Natural Language Learning (CoNLL 2002), pp. 105–111 (2002)Google Scholar
  13. 13.
    Pirkola, A., Hedlund, T., Keskusalo, H., Järvelin, K.: Dictionary-Based Cross-Language Information Retrieval: Problems, Methods, and Research Findings. Information Retrieval 4, 209–230 (2001)MATHCrossRefGoogle Scholar
  14. 14.
    Ponte, J.M., Croft, W.B.: A Language Modeling Approach to Information Retrieval. In: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Melbourne, Australia, pp. 275–281 (1998)Google Scholar
  15. 15.
    Zamora, E.M., Pollock, J.J., Zamora, A.: The Use of Trigram Analysis for Spelling Error Detection. Information Processing and Management 17, 305–316 (1981)CrossRefGoogle Scholar
  16. 16.

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Paul McNamee
    • 1
  1. 1.The Johns Hopkins University Applied Physics LaboratoryLaurelUSA

Personalised recommendations