One Size Fits All? A Simple Technique to Perform Several NLP Tasks

  • Daniel Gayo-Avello
  • Darío Álvarez-Gutiérrez
  • José Gayo-Avello
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3230)


Word fragments or n-grams have been widely used to perform different Natural Language Processing tasks such as information retrieval [1] [2], document categorization [3], automatic summarization [4] or, even, genetic classification of languages [5]. All these techniques share some common aspects such as: (1) documents are mapped to a vector space where n-grams are used as coordinates and their relative frequencies as vector weights, (2) many of them compute a context which plays a role similar to stop-word lists, and (3) cosine distance is commonly used for document-to-document and query-to-document comparisons. blindLight is a new approach related to these classical n-gram techniques although it introduces two major differences: (1) Relative frequencies are no more used as vector weights but replaced by n-gram significances, and (2) cosine distance is abandoned in favor of a new metric inspired by sequence alignment techniques although not so computationally expensive. This new approach can be simultaneously used to perform document categorization and clustering, information retrieval, and text summarization. In this paper we will describe the foundations of such a technique and its application to both a particular categorization problem (i.e., language identification) and information retrieval tasks.


Language Identification Parallel Corpus Genetic Classification Document Vector Cosine Distance 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    D’Amore, R., Mah, C.P.: One-time complete indexing of text: Theory and practice. In: Proc. of SIGIR 1985, pp. 155–164 (1985)Google Scholar
  2. 2.
    Kimbrell, R.E.: Searching for text? Send an n-gram! Byte 13(5), 297–312 (1988)Google Scholar
  3. 3.
    Damashek, M.: Gauging similarity with n-grams: Language-independent categorization of text. Science 267, 843–848 (1995)CrossRefGoogle Scholar
  4. 4.
    Cohen, J.D.: Highlights: Language and Domain-Independent Automatic Indexing Terms for Abstracting. JASIS 46(3), 162–174 (1995)CrossRefGoogle Scholar
  5. 5.
    Huffman, S.: The Genetic Classification of Languages by n-gram Analysis: A Computational Technique, Ph. D. thesis, Georgetown University (1998) Google Scholar
  6. 6.
    Thomas, T.R.: Document retrieval from a large dataset of free-text descriptions of physician-patient encounters via n-gram analysis. Technical Report LA-UR-93-0020, Los Alamos National Laboratory, Los Alamos, NM (1993) Google Scholar
  7. 7.
    Cavnar, W.B.: Using an n-gram-based document representation with a vector processing retrieval model. In: Proc. of TREC-3, pp. 269–277 (1994) Google Scholar
  8. 8.
    Huffman, S.: Acquaintance: Language-Independent Document Categorization by N Grams. In: Proceedings of The Fourth Text REtrieval Conference (1995) Google Scholar
  9. 9.
    Gayo-Avello, D., Álvarez-Gutiérrez, D., Gayo-Avello, J.: Naïve Algorithms for Keyphrase Extraction and Text Summarization from a Single Document Inspired by the Protein Biosynthesis Process. In: Ijspeert, A.J., Murata, M., Wakamiya, N. (eds.) BioADIT 2004. LNCS, vol. 3141, pp. 440–455. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  10. 10.
    Dunning, T.: Accurate methods for the statistics of surprise and coincidence. Computational Linguistics 19(1), 61–74 (1993)Google Scholar
  11. 11.
    Ferreira da Silva, J., Pereira Lopes, G.: A Local Maxima method and a Fair Dispersion Normalization for extracting multi-word units from corpora. In: Proc. of MOL6 (1999)Google Scholar
  12. 12.
    Ferreira da Silva, J., Pereira Lopes, G.: Extracting Multiword Terms from Document Collections. In: Proc. of VExTAL, Venice, Italy (1999)Google Scholar
  13. 13.
    Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals (English translation from Russian). Soviet Physics Doklady 10(8), 707–710 (1966)Google Scholar
  14. 14.
    Ziegler, D.: The Automatic Identification of Languages Using Linguistic Recognition Signals. PhD Thesis, State University of New York, Buffalo (1991) Google Scholar
  15. 15.
    Souter, C., Churcher, G., Hayes, J., Johnson, S.: Natural Language Identification using Corpus-based Models. Hermes Journal of Linguistics 13, 183–203 (1994); Faculty of Modern Languages, Aarhus School of Business, DenmarkGoogle Scholar
  16. 16.
    Beesley, K.R.: Language Identifier: A Computer Program for Automatic Natural-Language Identification of Online Text. In: Language at Crossroads: Proceedings of the 19th Annual Conference of the American Translators Association, pp. 47–54 (1988)Google Scholar
  17. 17.
    Dunning, T.: Statistical identification of language. Technical Report MCCS 94-273, New Mexico State University (1994)Google Scholar
  18. 18.
    Kessler, B.: Computational Dialectology in Irish Gaelic. Dublin: EACL. In: Proceedings of the European Association for Computational Linguistics, pp. 60–67 (1995) Google Scholar
  19. 19.
    Nerbonne, J., Heeringa, W.: Measuring Dialect Distance Phonetically. In: Coleman, J. (ed.) Proceedings of the Third Meeting of the ACL Special Interest Group in Computational Phonology, pp. 11–18 (1997)Google Scholar
  20. 20.
    Handbook of the International Phonetic Association: A Guide to the Use of the International Phonetic Alphabet. Cambridge University Press, Cambridge (1999)Google Scholar
  21. 21.
    Jarvis, R.A., Patrick, E.A.: Clustering Using a Similarity Measure Based on Shared Near Neighbors. IEEE Transactions on Computers 22(11), 1025–1034 (1973)CrossRefGoogle Scholar
  22. 22.
    Verdaguer, P.: Grammaire de la langue catalane. Les origines de la langue, Curial (1999) Google Scholar
  23. 23.
    Koehn, P.: Europarl: A Multilingual Corpus for Evaluation of Machine Translation, Draft (unpublished),

Copyright information

© Springer-Verlag Berlin Heidelberg 2004

Authors and Affiliations

  • Daniel Gayo-Avello
    • 1
  • Darío Álvarez-Gutiérrez
    • 1
  • José Gayo-Avello
    • 1
  1. 1.Department of InformaticsUniversity of OviedoOviedoSpain

Personalised recommendations