JHU/APL Ad Hoc Experiments at CLEF 2006
JHU/APL continued its participation in multilingual retrieval at CLEF in 2006. We again applied our hallmark technique for combating language diversity and morphological complexity: character n-gram tokenization. This year we participated in the ad hoc cross-language track and submitted both monolingual and bilingual runs. Our experimental results this year agree with our previous reports that n-grams perform especially well in linguistically complex languages, notably Bulgarian and Hungarian, where monolingual improvements of 27% and 70% respectively were observed compared to space-delimited word forms. As in CLEF 2005, our bilingual submissions made use of subword translation, statistical translation of character n-grams using aligned corpora, when parallel data were available, and web-based machine translation, when no suitable data was available to us.
KeywordsCross-language information retrieval character n-gram tokenization corpus-based translation
Unable to display preview. Download preview PDF.
- 1.Hiemstra, D.: Using Language Models for Information Retrieval. Ph. D. Thesis, Center for Telematics and Information Technology, The Netherlands (2000)Google Scholar
- 2.Koehn, P.: Europarl: A multilingual corpus for evaluation of machine translation. Unpublished, http://www.isi.edu/koehn/publications/europarl/
- 4.McNamee, P., Mayfield, J.: Translating Pieces of Words. In: Proceedings of the 28th Annual International Conference on Research and Development in Information Retrieval (SIGIR-2005), Salvador, Brazil, pp. 643–644 (August 2005)Google Scholar