Exploring New Languages with HAIRCUT at CLEF 2005
JHU/APL has long espoused the use of language-neutral methods for cross-language information retrieval. This year we participated in the ad hoc cross-language track and submitted both monolingual and bilingual runs. We undertook our first investigations in the Bulgarian and Hungarian languages. In our bilingual experiments we used several non-traditional CLEF query languages such as Greek, Hungarian, and Indonesian, in addition to several western European languages. We found that character n-grams remain an attractive option for representing documents and queries in these new languages. In our monolingual tests n-grams were more effective than unnormalized words for retrieval in Bulgarian (+30%) and Hungarian (+63%). Our bilingual runs made use of subword translation, statistical translation of character n-grams using aligned corpora, when parallel data were available, and web-based machine translation, when no suitable data could be found.
Unable to display preview. Download preview PDF.
- 1.Cavnar, W.B., Trenkle, J.M.: N-Gram Based Text Categorization. In: Proceedings of the Third Symposium on Document Analysis and Information Retrieval, pp. 161–169 (1994)Google Scholar
- 2.Church, K.W.: Char_align: A program for aligning parallel texts at the character level. In: Proceedings of the 31st Annual Meeting of the Association for Computational Linguistics, pp. 1–8 (1993)Google Scholar
- 4.Hiemstra, D.: Using Language Models for Information Retrieval. Ph. D. Thesis, Center for Telematics and Information Technology, The Netherlands (2000)Google Scholar
- 5.Jelinek, F., Mercer, R.: Interpolated Estimation of Markov Source Parameters from Sparse Data. In: Gelsema, E.S., Kanal, L.N. (eds.) Pattern Recognition in Practice, pp. 381–402. North-Holland, Amsterdam (1980)Google Scholar
- 6.Koehn, P.: Europarl: A multilingual corpus for evaluation of machine translation (unpublished) http://www.isi.edu/koehn/publications/europarl/
- 7.Mayfield, J., McNamee, P., Piatko, C.: The JHU/APL HAIRCUT System at TREC-8. In: Voorhees, E., Harman, D. (eds.) Proceedings of the Eighth Text REtrieval Conference (TREC-8), NIST Special Publication, Gaithersburg, Maryland, pp. 500–246 (2000)Google Scholar
- 8.Mayfield, J., McNamee, P.: Single N-gram Stemming. In: Proceedings of the 26th Annual International Conference on Research and Development in Information Retrieval (SIGIR 2003), Toronto, Ontario, pp. 415–416 (July 2003)Google Scholar
- 9.McNamee, P., Mayfield, J.: JHU/APL Experiments in Tokenization and Non-Word Translation. In: Working Notes of the CLEF 2003 Workshop, pp. 19-28 (2003)Google Scholar
- 11.McNamee, P., Mayfield, J.: Translating Pieces of Words. In: Proceedings of the 28th Annual International Conference on Research and Development in Information Retrieval (SIGIR 2005), Salvador, Brazil, pp. 643–644 (August 2005)Google Scholar
- 12.Mihalcea, R., Nastase, V.: Letter Level Learning for Language Independent Diacritics Restoration. In: Proceedings of the 6th Conference on Natural Language Learning (CoNLL 2002), pp. 105–111 (2002)Google Scholar
- 14.Ponte, J.M., Croft, W.B.: A Language Modeling Approach to Information Retrieval. In: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Melbourne, Australia, pp. 275–281 (1998)Google Scholar