Character N-Grams Translation in Cross-Language Information Retrieval

Vilares, Jesús; Oakes, Michael P.; Vilares, Manuel

doi:10.1007/978-3-540-73351-5_19

Jesús Vilares¹,
Michael P. Oakes² &
Manuel Vilares³

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4592))

Included in the following conference series:

International Conference on Application of Natural Language to Information Systems

962 Accesses
4 Citations

Abstract

This paper describes a new technique for the direct translation of character n-grams for use in Cross-Language Information Retrieval systems. This solution avoids the need for word normalization during indexing or translation, and it can also deal with out-of-vocabulary words. This knowledge-light approach does not rely on language-specific processing, and it can be used with languages of very different natures even when linguistic information and resources are scarce or unavailable. Our proposal also tries to achieve a higher speed during the n-gram alignment process with respect to previous approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

http://ir.dcs.gla.ac.uk/terrier/
Amati, G., van Rijsbergen, C.J.: Probabilistic models of Information Retrieval based on measuring divergence from randomness. ACM Transactions on Information Systems 20(4), 357–389 (2002)
Article Google Scholar
Koehn, P.: Europarl: A Parallel Corpus for Statistical Machine Translation. In: Proc. of the 10th Machine Translation Summit, pp. 79–86 (2005), Corpus available in http://www.iccs.inf.ed.ac.uk/~pkoehn/publications/europarl/
Koehn, P., Och, F.J., Marcu, D.: Statistical phrase-based translation. In: Proc. of the 2003 Conf. of the North American Chapter of the ACL, pp. 48–54 (2003)
Google Scholar
Manning, C.D., Schütze, H.: Foundations of Statistical Natural Language Processing. The MIT Press, Cambridge (1999)
MATH Google Scholar
McNamee, P., Mayfield, J.: Character N-gram Tokenization for European Language Text Retrieval. Information Retrieval 7(1-2), 73–97 (2004)
Article Google Scholar
McNamee, P., Mayfield, J.: JHU/APL experiments in tokenization and non-word translation. In: LNCS, vol. 3237, pp. 85–97. Springer, Heidelberg (2004)
Google Scholar
Nardi, A., Peters, C., Vicedo, J.L. (eds.): Working Notes of the CLEF 2006, Workshop (2006), available at http://www.clef-campaign.org
Och, F.J., Ney, H.: A systematic comparison of various statistical alignment models, Source code (2003), available at http://www.fjoch.com/GIZA++.html
Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)
Google Scholar
J. Savoy. Cross-Language Information Retrieval: experiments based on CLEF 2000 corpora. Information Processing and Management 39, 75–115 (2003)
Google Scholar
Vilares, J., Oakes, M.P., Tait, J.I.: CoLesIR at CLEF 2000 rapid prototyping of a N-gram-based CLIR system. In: Nardi, et al. (2006)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of A Coruña, Campus de Elviña s/n, 15071 – A Coruña, Spain)
Jesús Vilares
School of Computing and Technology, University of Sunderland, St. Peter’s Campus, St. Peter’s Way, Sunderland – SR6 0DD, United Kingdom)
Michael P. Oakes
Department of Computer Science, University of Vigo, Campus As Lagoas s/n, 32004 – Ourense, Spain)
Manuel Vilares

Authors

Jesús Vilares
View author publications
You can also search for this author in PubMed Google Scholar
Michael P. Oakes
View author publications
You can also search for this author in PubMed Google Scholar
Manuel Vilares
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Zoubida Kedad Nadira Lammari Elisabeth Métais Farid Meziane Yacine Rezgui

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Vilares, J., Oakes, M.P., Vilares, M. (2007). Character N-Grams Translation in Cross-Language Information Retrieval. In: Kedad, Z., Lammari, N., Métais, E., Meziane, F., Rezgui, Y. (eds) Natural Language Processing and Information Systems. NLDB 2007. Lecture Notes in Computer Science, vol 4592. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-73351-5_19

Download citation

DOI: https://doi.org/10.1007/978-3-540-73351-5_19
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-73350-8
Online ISBN: 978-3-540-73351-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics