Extraction of Bilingual Cognates from Wikipedia

  • Pablo Gamallo
  • Marcos Garcia
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7243)


In this article, we propose a method to extract translation equivalents with similar spelling from comparable corpora. The method was applied on Wikipedia to extract a large amount of Portuguese-Spanish bilingual terminological pairs that were not found in existing dictionaries. The resulting bilingual lexicons consists of more than 27,000 new pairs of lemmas and multiwords, with about 92% accuracy.


Edit Distance Computational Linguistics Internal Link Translation Equivalent Bilingual Dictionary 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Ahrenberg, L., Andersson, M., Merkel, M.: A Simple Hybrid Aligner for Generating Lexical Correspondences in Parallel Texts. In: 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, Montreal, pp. 29–35 (1998)Google Scholar
  2. 2.
    Armentano-Oller, C., Carrasco, R.C., Corbí-Bellot, A.M., Forcada, M.L., Ginestí-Rosell, M., Ortiz-Rojas, S., Pérez-Ortiz, J.A., Ramírez-Sánchez, G., Sánchez-Martínez, F., Scalco, M.A.: Open-Source Portuguese–Spanish Machine Translation. In: Vieira, R., Quaresma, P., Nunes, M.d.G.V., Mamede, N.J., Oliveira, C., Dias, M.C. (eds.) PROPOR 2006. LNCS (LNAI), vol. 3960, pp. 50–59. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  3. 3.
    Chiao, Y.-C., Zweigenbaum, P.: Looking for candidate translational equivalents in specialized, comparable corpora. In: 19th COLING (2002)Google Scholar
  4. 4.
    Fung, P., McKeown, K.: Finding terminology translation from non-parallel corpora. In: 5th Annual Workshop on Very Large Corpora, Hong Kong, pp. 192–202 (1997)Google Scholar
  5. 5.
    Fung, P., Yee, L.Y.: An IR Approach for Translating New Words from Nonparallel, Comparable Texts. In: Coling 1998, Montreal, Canada, pp. 414–420 (1998)Google Scholar
  6. 6.
    Gale, W., Church, K.: Identifying Word Correspondences in Parallel Texts. In: Workshop DARPA SNL (1991)Google Scholar
  7. 7.
    Gamallo, P., González, I.: A grammatical formalism based on patterns of part-of-speech tags. International Journal of Corpus Linguistics 16(1), 45–71 (2011)CrossRefGoogle Scholar
  8. 8.
    Gamallo, P., González, I.: Measuring comparability of multilingual corpora extracted from wikipedia. In: Workshop on Iberian Cross-Language NLP tasks (ICL 2011), Huelva, Spain (2011)Google Scholar
  9. 9.
    Gamallo Otero, P., Pichel Campos, J.R.: Learning Spanish-Galician Translation Equivalents Using a Comparable Corpus and a Bilingual Dictionary. In: Gelbukh, A. (ed.) CICLing 2008. LNCS, vol. 4919, pp. 423–433. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  10. 10.
    Gomes, L., Lopes, G.P.: Measuring Spelling Similarity for Cognate Identification. In: Antunes, L., Pinto, H.S. (eds.) EPIA 2011. LNCS (LNAI), vol. 7026, pp. 624–633. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  11. 11.
    Kwong, O.Y., Tsou, B.K., Lai, T.B.: Alignment and extraction of bilingual legal terminology from context profiles. Terminology 10(1), 81–99 (2004)CrossRefGoogle Scholar
  12. 12.
    Melamed, D.: A Portable Algorithm for Mapping Bitext Correspondences. In: 35th Conference of the Association of Computational Linguistics (ACL 1997), Madrid, Spain, pp. 305–312 (1997)Google Scholar
  13. 13.
    Nakagawa, H.: Disambiguation of single noun translations extracted from bilingual comparable corpora. Terminology 7(1), 63–83 (2001)Google Scholar
  14. 14.
    Rapp, R.: Automatic Identification of Word Translations from Unrelated English and German Corpora. In: ACL 1999, pp. 519–526 (1999)Google Scholar
  15. 15.
    Rubino, R., Linarès, G.: A Multi-view Approach for Term Translation Spotting. In: Gelbukh, A. (ed.) CICLing 2011, Part II. LNCS, vol. 6609, pp. 29–40. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  16. 16.
    Saralegui, X., San Vicente, I., Gurrutxaga, A.: Automatic generation of bilingual lexicons from comparable corpora in a popular science domain. In: LREC 2008 Workshop on Building and Using Comparable Corpora (2008)Google Scholar
  17. 17.
    Shao, L., Ng, H.T.: Mining New Word Translations from Comparable Corpora. In: 20th International Conference on Computational Linguistics (COLING 2004), Geneva, Switzerland, pp. 618–624 (2004)Google Scholar
  18. 18.
    Tiedemann, J.: Extraction of Translation Equivalents from Parallel Corpora. In: 11th Nordic Conference of Computational Linguistics, Copenhagen, Denmark (1998)Google Scholar
  19. 19.
    Yu, K., Tsujii, J.: Bilingual dictionary extraction from wikipedia. In: Machine Translation Summit XII, Ottawa, Canada (2009)Google Scholar
  20. 20.
    Yu, K., Tsujii, J.: Extracting bilingual dictionary from comparable corpora with dependency heterogeneity. In: NAACL HLT 2009, Boulder, Colorado, pp. 121–124 (2009)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Pablo Gamallo
    • 1
  • Marcos Garcia
    • 1
  1. 1.Centro de Investigação em Tecnologias da Informação (CITIUS)Universidade de Santiago de CompostelaGalizaSpain

Personalised recommendations