A First Approach to CLIR Using Character N-Grams Alignment
This paper describes the technique for translation of character n-grams we developed for our participation in CLEF 2006. This solution avoids the need for word normalization during indexing or translation, and it can also deal with out-of-vocabulary words. Since it does not rely on language-specific processing, it can be applied to very different languages, even when linguistic information and resources are scarce or unavailable. Our proposal makes considerable use of freely available resources and also tries to achieve a higher speed during the n-gram alignment process with respect to other similar approaches.
KeywordsCross-Language Information Retrieval character n-grams translation algorithms alignment algorithms association measures
Unable to display preview. Download preview PDF.
- 4.Koehn, P.: Europarl: A Parallel Corpus for Statistical Machine Translation. In: Proc. of the 10th Machine Translation Summit (MT Summit X), Phuket, Thailand, September 12-16, 2005, pp. 79–86 (2005), Corpus available in http://www.iccs.inf.ed.ac.uk/~pkoehn/publications/europarl/
- 5.Koehn, P., Och, F.J., Marcu, D.: Statistical phrase-based translation. In: NAACL 2003: Proc. of the 2003 Conference of the North American Chapter of the ACL, Morristown, NJ, USA, pp. 48–54 (2003)Google Scholar
- 8.McNamee, P., Mayfield, J.: JHU/APL experiments in tokenization and non-word translation. In: Peters, C., Gonzalo, J., Braschler, M., Kluck, M. (eds.) CLEF 2003. LNCS, vol. 3237, pp. 85–97. Springer, Heidelberg (2004)Google Scholar
- 9.Och, F.J., Ney, H.: A systematic comparison of various statistical alignment models, Toolkit (2003), available at http://www.fjoch.com/GIZA++.html
- 11.Vilares, J., Oakes, M.P., Tait, J.I.: CoLesIR at CLEF 2006: rapid prototyping of a N-gram-based CLIR system. In: Working Notes of the CLEF 2006 Workshop, 20-22 September, Alicante, Spain (2006) available at Google Scholar