A First Approach to CLIR Using Character N-Grams Alignment

  • Jesús Vilares
  • Michael P. Oakes
  • John I. Tait
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4730)


This paper describes the technique for translation of character n-grams we developed for our participation in CLEF 2006. This solution avoids the need for word normalization during indexing or translation, and it can also deal with out-of-vocabulary words. Since it does not rely on language-specific processing, it can be applied to very different languages, even when linguistic information and resources are scarce or unavailable. Our proposal makes considerable use of freely available resources and also tries to achieve a higher speed during the n-gram alignment process with respect to other similar approaches.


Cross-Language Information Retrieval character n-grams translation algorithms alignment algorithms association measures 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
  2. 2.
  3. 3.
    Amati, G., van Rijsbergen, C.J.: Probabilistic models of Information Retrieval based on measuring divergence from randomness. ACM Transactions on Information Systems 20(4), 357–389 (2002)CrossRefGoogle Scholar
  4. 4.
    Koehn, P.: Europarl: A Parallel Corpus for Statistical Machine Translation. In: Proc. of the 10th Machine Translation Summit (MT Summit X), Phuket, Thailand, September 12-16, 2005, pp. 79–86 (2005), Corpus available in
  5. 5.
    Koehn, P., Och, F.J., Marcu, D.: Statistical phrase-based translation. In: NAACL 2003: Proc. of the 2003 Conference of the North American Chapter of the ACL, Morristown, NJ, USA, pp. 48–54 (2003)Google Scholar
  6. 6.
    Manning, C.D., Schütze, H.: Foundations of Statistical Natural Language Processing. The MIT Press, Cambridge (1999)zbMATHGoogle Scholar
  7. 7.
    McNamee, P., Mayfield, J.: Character N-gram Tokenization for European Language Text Retrieval. Information Retrieval 7(1-2), 73–97 (2004)CrossRefGoogle Scholar
  8. 8.
    McNamee, P., Mayfield, J.: JHU/APL experiments in tokenization and non-word translation. In: Peters, C., Gonzalo, J., Braschler, M., Kluck, M. (eds.) CLEF 2003. LNCS, vol. 3237, pp. 85–97. Springer, Heidelberg (2004)Google Scholar
  9. 9.
    Och, F.J., Ney, H.: A systematic comparison of various statistical alignment models, Toolkit (2003), available at
  10. 10.
    Savoy, J.: Cross-Language Information Retrieval: experiments based on CLEF 2000 corpora. Information Processing and Management 39, 75–115 (2003)zbMATHCrossRefGoogle Scholar
  11. 11.
    Vilares, J., Oakes, M.P., Tait, J.I.: CoLesIR at CLEF 2006: rapid prototyping of a N-gram-based CLIR system. In: Working Notes of the CLEF 2006 Workshop, 20-22 September, Alicante, Spain (2006) available at [2]Google Scholar
  12. 12.
    Weeds, J., Weir, D.: Co-occurrence retrieval: A flexible framework for lexical distributional similarity. Computational Linguistics 31(4), 439–475 (2005)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2007

Authors and Affiliations

  • Jesús Vilares
    • 1
  • Michael P. Oakes
    • 2
  • John I. Tait
    • 2
  1. 1.Departamento de Computación, Universidade da Coruña, Campus de Elviña s/n, 15071 - A CoruñaSpain
  2. 2.School of Computing and Technology, University of Sunderland, St. Peter’s Campus, St. Peter’s Way, Sunderland - SR6 0DDUnited Kingdom

Personalised recommendations