Abstract
In this article, we propose a method to extract translation equivalents with similar spelling from comparable corpora. The method was applied on Wikipedia to extract a large amount of Portuguese-Spanish bilingual terminological pairs that were not found in existing dictionaries. The resulting bilingual lexicons consists of more than 27,000 new pairs of lemmas and multiwords, with about 92% accuracy.
This work has been supported by Ministerio de Ciencia e Innovación, within the project OntoPedia, ref: FFI2010-14986.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Ahrenberg, L., Andersson, M., Merkel, M.: A Simple Hybrid Aligner for Generating Lexical Correspondences in Parallel Texts. In: 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, Montreal, pp. 29–35 (1998)
Armentano-Oller, C., Carrasco, R.C., Corbí-Bellot, A.M., Forcada, M.L., Ginestí-Rosell, M., Ortiz-Rojas, S., Pérez-Ortiz, J.A., Ramírez-Sánchez, G., Sánchez-Martínez, F., Scalco, M.A.: Open-Source Portuguese–Spanish Machine Translation. In: Vieira, R., Quaresma, P., Nunes, M.d.G.V., Mamede, N.J., Oliveira, C., Dias, M.C. (eds.) PROPOR 2006. LNCS (LNAI), vol. 3960, pp. 50–59. Springer, Heidelberg (2006)
Chiao, Y.-C., Zweigenbaum, P.: Looking for candidate translational equivalents in specialized, comparable corpora. In: 19th COLING (2002)
Fung, P., McKeown, K.: Finding terminology translation from non-parallel corpora. In: 5th Annual Workshop on Very Large Corpora, Hong Kong, pp. 192–202 (1997)
Fung, P., Yee, L.Y.: An IR Approach for Translating New Words from Nonparallel, Comparable Texts. In: Coling 1998, Montreal, Canada, pp. 414–420 (1998)
Gale, W., Church, K.: Identifying Word Correspondences in Parallel Texts. In: Workshop DARPA SNL (1991)
Gamallo, P., González, I.: A grammatical formalism based on patterns of part-of-speech tags. International Journal of Corpus Linguistics 16(1), 45–71 (2011)
Gamallo, P., González, I.: Measuring comparability of multilingual corpora extracted from wikipedia. In: Workshop on Iberian Cross-Language NLP tasks (ICL 2011), Huelva, Spain (2011)
Gamallo Otero, P., Pichel Campos, J.R.: Learning Spanish-Galician Translation Equivalents Using a Comparable Corpus and a Bilingual Dictionary. In: Gelbukh, A. (ed.) CICLing 2008. LNCS, vol. 4919, pp. 423–433. Springer, Heidelberg (2008)
Gomes, L., Lopes, G.P.: Measuring Spelling Similarity for Cognate Identification. In: Antunes, L., Pinto, H.S. (eds.) EPIA 2011. LNCS (LNAI), vol. 7026, pp. 624–633. Springer, Heidelberg (2011)
Kwong, O.Y., Tsou, B.K., Lai, T.B.: Alignment and extraction of bilingual legal terminology from context profiles. Terminology 10(1), 81–99 (2004)
Melamed, D.: A Portable Algorithm for Mapping Bitext Correspondences. In: 35th Conference of the Association of Computational Linguistics (ACL 1997), Madrid, Spain, pp. 305–312 (1997)
Nakagawa, H.: Disambiguation of single noun translations extracted from bilingual comparable corpora. Terminology 7(1), 63–83 (2001)
Rapp, R.: Automatic Identification of Word Translations from Unrelated English and German Corpora. In: ACL 1999, pp. 519–526 (1999)
Rubino, R., Linarès, G.: A Multi-view Approach for Term Translation Spotting. In: Gelbukh, A. (ed.) CICLing 2011, Part II. LNCS, vol. 6609, pp. 29–40. Springer, Heidelberg (2011)
Saralegui, X., San Vicente, I., Gurrutxaga, A.: Automatic generation of bilingual lexicons from comparable corpora in a popular science domain. In: LREC 2008 Workshop on Building and Using Comparable Corpora (2008)
Shao, L., Ng, H.T.: Mining New Word Translations from Comparable Corpora. In: 20th International Conference on Computational Linguistics (COLING 2004), Geneva, Switzerland, pp. 618–624 (2004)
Tiedemann, J.: Extraction of Translation Equivalents from Parallel Corpora. In: 11th Nordic Conference of Computational Linguistics, Copenhagen, Denmark (1998)
Yu, K., Tsujii, J.: Bilingual dictionary extraction from wikipedia. In: Machine Translation Summit XII, Ottawa, Canada (2009)
Yu, K., Tsujii, J.: Extracting bilingual dictionary from comparable corpora with dependency heterogeneity. In: NAACL HLT 2009, Boulder, Colorado, pp. 121–124 (2009)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Gamallo, P., Garcia, M. (2012). Extraction of Bilingual Cognates from Wikipedia. In: Caseli, H., Villavicencio, A., Teixeira, A., Perdigão, F. (eds) Computational Processing of the Portuguese Language. PROPOR 2012. Lecture Notes in Computer Science(), vol 7243. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-28885-2_7
Download citation
DOI: https://doi.org/10.1007/978-3-642-28885-2_7
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-28884-5
Online ISBN: 978-3-642-28885-2
eBook Packages: Computer ScienceComputer Science (R0)