Abstract
This paper describes a new technique for the direct translation of character n-grams for use in Cross-Language Information Retrieval systems. This solution avoids the need for word normalization during indexing or translation, and it can also deal with out-of-vocabulary words. This knowledge-light approach does not rely on language-specific processing, and it can be used with languages of very different natures even when linguistic information and resources are scarce or unavailable. Our proposal also tries to achieve a higher speed during the n-gram alignment process with respect to previous approaches.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Amati, G., van Rijsbergen, C.J.: Probabilistic models of Information Retrieval based on measuring divergence from randomness. ACM Transactions on Information Systems 20(4), 357–389 (2002)
Koehn, P.: Europarl: A Parallel Corpus for Statistical Machine Translation. In: Proc. of the 10th Machine Translation Summit, pp. 79–86 (2005), Corpus available in http://www.iccs.inf.ed.ac.uk/~pkoehn/publications/europarl/
Koehn, P., Och, F.J., Marcu, D.: Statistical phrase-based translation. In: Proc. of the 2003 Conf. of the North American Chapter of the ACL, pp. 48–54 (2003)
Manning, C.D., Schütze, H.: Foundations of Statistical Natural Language Processing. The MIT Press, Cambridge (1999)
McNamee, P., Mayfield, J.: Character N-gram Tokenization for European Language Text Retrieval. Information Retrieval 7(1-2), 73–97 (2004)
McNamee, P., Mayfield, J.: JHU/APL experiments in tokenization and non-word translation. In: LNCS, vol. 3237, pp. 85–97. Springer, Heidelberg (2004)
Nardi, A., Peters, C., Vicedo, J.L. (eds.): Working Notes of the CLEF 2006, Workshop (2006), available at http://www.clef-campaign.org
Och, F.J., Ney, H.: A systematic comparison of various statistical alignment models, Source code (2003), available at http://www.fjoch.com/GIZA++.html
Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)
J. Savoy. Cross-Language Information Retrieval: experiments based on CLEF 2000 corpora. Information Processing and Management 39, 75–115 (2003)
Vilares, J., Oakes, M.P., Tait, J.I.: CoLesIR at CLEF 2000 rapid prototyping of a N-gram-based CLIR system. In: Nardi, et al. (2006)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2007 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Vilares, J., Oakes, M.P., Vilares, M. (2007). Character N-Grams Translation in Cross-Language Information Retrieval. In: Kedad, Z., Lammari, N., Métais, E., Meziane, F., Rezgui, Y. (eds) Natural Language Processing and Information Systems. NLDB 2007. Lecture Notes in Computer Science, vol 4592. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-73351-5_19
Download citation
DOI: https://doi.org/10.1007/978-3-540-73351-5_19
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-73350-8
Online ISBN: 978-3-540-73351-5
eBook Packages: Computer ScienceComputer Science (R0)