Extraction of Bilingual Cognates from Wikipedia

  • Pablo Gamallo
  • Marcos Garcia
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7243)


In this article, we propose a method to extract translation equivalents with similar spelling from comparable corpora. The method was applied on Wikipedia to extract a large amount of Portuguese-Spanish bilingual terminological pairs that were not found in existing dictionaries. The resulting bilingual lexicons consists of more than 27,000 new pairs of lemmas and multiwords, with about 92% accuracy.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Ahrenberg, L., Andersson, M., Merkel, M.: A Simple Hybrid Aligner for Generating Lexical Correspondences in Parallel Texts. In: 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, Montreal, pp. 29–35 (1998)Google Scholar
  2. 2.
    Armentano-Oller, C., Carrasco, R.C., Corbí-Bellot, A.M., Forcada, M.L., Ginestí-Rosell, M., Ortiz-Rojas, S., Pérez-Ortiz, J.A., Ramírez-Sánchez, G., Sánchez-Martínez, F., Scalco, M.A.: Open-Source Portuguese–Spanish Machine Translation. In: Vieira, R., Quaresma, P., Nunes, M.d.G.V., Mamede, N.J., Oliveira, C., Dias, M.C. (eds.) PROPOR 2006. LNCS (LNAI), vol. 3960, pp. 50–59. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  3. 3.
    Chiao, Y.-C., Zweigenbaum, P.: Looking for candidate translational equivalents in specialized, comparable corpora. In: 19th COLING (2002)Google Scholar
  4. 4.
    Fung, P., McKeown, K.: Finding terminology translation from non-parallel corpora. In: 5th Annual Workshop on Very Large Corpora, Hong Kong, pp. 192–202 (1997)Google Scholar
  5. 5.
    Fung, P., Yee, L.Y.: An IR Approach for Translating New Words from Nonparallel, Comparable Texts. In: Coling 1998, Montreal, Canada, pp. 414–420 (1998)Google Scholar
  6. 6.
    Gale, W., Church, K.: Identifying Word Correspondences in Parallel Texts. In: Workshop DARPA SNL (1991)Google Scholar
  7. 7.
    Gamallo, P., González, I.: A grammatical formalism based on patterns of part-of-speech tags. International Journal of Corpus Linguistics 16(1), 45–71 (2011)CrossRefGoogle Scholar
  8. 8.
    Gamallo, P., González, I.: Measuring comparability of multilingual corpora extracted from wikipedia. In: Workshop on Iberian Cross-Language NLP tasks (ICL 2011), Huelva, Spain (2011)Google Scholar
  9. 9.
    Gamallo Otero, P., Pichel Campos, J.R.: Learning Spanish-Galician Translation Equivalents Using a Comparable Corpus and a Bilingual Dictionary. In: Gelbukh, A. (ed.) CICLing 2008. LNCS, vol. 4919, pp. 423–433. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  10. 10.
    Gomes, L., Lopes, G.P.: Measuring Spelling Similarity for Cognate Identification. In: Antunes, L., Pinto, H.S. (eds.) EPIA 2011. LNCS (LNAI), vol. 7026, pp. 624–633. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  11. 11.
    Kwong, O.Y., Tsou, B.K., Lai, T.B.: Alignment and extraction of bilingual legal terminology from context profiles. Terminology 10(1), 81–99 (2004)CrossRefGoogle Scholar
  12. 12.
    Melamed, D.: A Portable Algorithm for Mapping Bitext Correspondences. In: 35th Conference of the Association of Computational Linguistics (ACL 1997), Madrid, Spain, pp. 305–312 (1997)Google Scholar
  13. 13.
    Nakagawa, H.: Disambiguation of single noun translations extracted from bilingual comparable corpora. Terminology 7(1), 63–83 (2001)Google Scholar
  14. 14.
    Rapp, R.: Automatic Identification of Word Translations from Unrelated English and German Corpora. In: ACL 1999, pp. 519–526 (1999)Google Scholar
  15. 15.
    Rubino, R., Linarès, G.: A Multi-view Approach for Term Translation Spotting. In: Gelbukh, A. (ed.) CICLing 2011, Part II. LNCS, vol. 6609, pp. 29–40. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  16. 16.
    Saralegui, X., San Vicente, I., Gurrutxaga, A.: Automatic generation of bilingual lexicons from comparable corpora in a popular science domain. In: LREC 2008 Workshop on Building and Using Comparable Corpora (2008)Google Scholar
  17. 17.
    Shao, L., Ng, H.T.: Mining New Word Translations from Comparable Corpora. In: 20th International Conference on Computational Linguistics (COLING 2004), Geneva, Switzerland, pp. 618–624 (2004)Google Scholar
  18. 18.
    Tiedemann, J.: Extraction of Translation Equivalents from Parallel Corpora. In: 11th Nordic Conference of Computational Linguistics, Copenhagen, Denmark (1998)Google Scholar
  19. 19.
    Yu, K., Tsujii, J.: Bilingual dictionary extraction from wikipedia. In: Machine Translation Summit XII, Ottawa, Canada (2009)Google Scholar
  20. 20.
    Yu, K., Tsujii, J.: Extracting bilingual dictionary from comparable corpora with dependency heterogeneity. In: NAACL HLT 2009, Boulder, Colorado, pp. 121–124 (2009)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Pablo Gamallo
    • 1
  • Marcos Garcia
    • 1
  1. 1.Centro de Investigação em Tecnologias da Informação (CITIUS)Universidade de Santiago de CompostelaGalizaSpain

Personalised recommendations