Application of Variable Length N-Gram Vectors to Monolingual and Bilingual Information Retrieval

  • Daniel Gayo-Avello
  • Darío Álvarez-Gutiérrez
  • José Gayo-Avello
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3491)


Our group in the Department of Informatics at the University of Oviedo has participated, for the first time, in two tasks at CLEF: monolingual (Russian) and bilingual (Spanish-to-English) information retrieval. Our main goal was to test the application to IR of a modified version of the n-gram vector space model (codenamed blindLight). This new approach has been successfully applied to other NLP tasks such as language identification or text summarization and the results achieved at CLEF 2004, although not exceptional, are encouraging. There are two major differences between the blindLight approach and classical techniques: (1) relative frequencies are no longer used as vector weights but are replaced by n-gram significances, and (2) cosine distance is abandoned in favor of a new metric inspired by sequence alignment techniques, not so computationally expensive. In order to perform cross-language IR we have developed a naive n-gram pseudo-translator similar to those described by McNamee and Mayfield or Pirkola et al.


Machine Translation Pairwise Alignment Parallel Corpus Document Vector International Financial Institution 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Salton, G., Wong, A., Yang, C.S.: A vector space model for information retrieval. Communications of the ACM 18(11), 613–620 (1975)zbMATHCrossRefGoogle Scholar
  2. 2.
    D’Amore, R., Mah, C.P.: One-time complete indexing of text: Theory and practice. In: Proc. of SIGIR 1985, pp. 155–164 (1985)Google Scholar
  3. 3.
    Kimbrell, R.E.: Searching for text? Send an n-gram! Byte 13(5), 297–312 (1988)Google Scholar
  4. 4.
    Dunning, T.: Accurate methods for the statistics of surprise and coincidence. Computational Linguistics 19(1), 61–74 (1993)Google Scholar
  5. 5.
    Ferreira da Silva, J., Pereira Lopes, G.: A Local Maxima method and a Fair Dispersion Normalization for extracting multi-word units from corpora. In: Proc. of MOL6 (1999)Google Scholar
  6. 6.
    Ferreira da Silva, J., Pereira Lopes, G.: Extracting Multiword Terms from Document Collections. In: Proc. of VExTAL, Venice, Italy (1999)Google Scholar
  7. 7.
    Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals (English translation from Russian). Soviet Physics Doklady 10(8), 707–710 (1966)MathSciNetGoogle Scholar
  8. 8.
    Gayo-Avello, D., Álvarez-Gutiérrez, D., Gayo-Avello, J.: Naive Algorithms for Key phrase Extraction and Text Summarization from a Single Document inspired by the Protein Biosynthesis Process. In: Ijspeert, A.J., Murata, M., Wakamiya, N. (eds.) BioADIT 2004. LNCS, vol. 3141, pp. 440–455. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  9. 9.
    Gayo-Avello, D., Álvarez-Gutiérrez, D., Gayo-Avello, J.: One Size Fits All? A Simple Technique to Perform Several NLP Tasks. In: Vicedo, J.L., Martínez-Barco, P., Muńoz, R., Saiz Noeda, M. (eds.) EsTAL 2004. LNCS (LNAI), vol. 3230, pp. 267–278. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  10. 10.
    Peters, C., Braschler, M., Di Nunzio, G., Ferro, N.: CLEF 2004: Ad Hoc Track Overview and Results Analysis. In: Peters, C., Clough, P., Gonzalo, J., Jones, G.J.F., Kluck, M., Magnini, B. (eds.) CLEF 2004. LNCS, vol. 3491, pp. 10–26. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  11. 11.
    Peters, C.: What happened in CLEF 2004. In: Peters, C., Clough, P., Gonzalo, J., Jones, G.J.F., Kluck, M., Magnini, B. (eds.) CLEF 2004. LNCS, vol. 3491, pp. 1–9. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  12. 12.
    Koehn, P.: Europarl: A Multilingual Corpus for Evaluation of Machine Translation, Draft (unpublished),
  13. 13.
    Pirkola, A., Keskustalo, H., Leppänen, E., Känsälä, A., Järvelin, K.: Targeted s gram matching: a novel n-gram matching technique for cross- and monolingual word form variants. Information Research 7(2) (2002)Google Scholar
  14. 14.
    McNamee, P., Mayfield, J.: JHU/APL Experiments in Tokenization and Non-Word Translation. In: Peters, C., Gonzalo, J., Braschler, M., Kluck, M. (eds.) CLEF 2003. LNCS, vol. 3237, pp. 85–97. Springer, Heidelberg (2004)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2005

Authors and Affiliations

  • Daniel Gayo-Avello
    • 1
  • Darío Álvarez-Gutiérrez
    • 1
  • José Gayo-Avello
    • 1
  1. 1.Department of InformaticsUniversity of OviedoOviedoSpain

Personalised recommendations