Advertisement

Parallel Corpora for WordNet Construction: Machine Translation vs. Automatic Sense Tagging

  • Antoni Oliver
  • Salvador Climent
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7182)

Abstract

In this paper we present a methodology for WordNet construction based on the exploitation of parallel corpora with semantic annotation of the English source text. We are using this methodology for the enlargement of the Spanish and Catalan versions of WordNet 3.0, but the methodology can also be used for other languages. As big parallel corpora with semantic annotation are not usually available, we explore two strategies to overcome this problem: to use monolingual sense tagged corpora and machine translation, on the one hand; and to use parallel corpora and automatic sense tagging on the source text, on the other.

With these resources, the problem of acquiring a WordNet from parallel corpora can be seen as a word alignment task. Fortunately, this task is well known, and some aligning algorithms are freely available.

Keywords

lexical resources wordnet parallel corpora machine translation automatic sense tagging 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Atserias, J., Climent, S., Farreres, X., Rigau, G., Rodriguez, H.: Combining multiple methods for the automatic construction of multi-lingual WordNets. In: Recent Advances in Natural Language Processing II. Selected papers from RANLP, vol. 97, pp. 327–338 (1997)Google Scholar
  2. 2.
    Azarova, I., Mitrofanova, O., Sinopalnikova, A., Yavorskaya, M., Oparin, I.: Russnet: Building a lexical database for the Russian language. In: Workshop on WordNet Structures and Standarisation, and how these affect WordNet Application and Evaluation, Las Palmas de Gran Canaria (Spain), pp. 60–64 (2002)Google Scholar
  3. 3.
    Benítez, S., Escudero, G., López, M., Rigau, G., Taulé, M.: Methods and tools for building the catalan WordNet. In: Proceedings of the ELRA Workshop on Language Resources for European Minority Languages (1998)Google Scholar
  4. 4.
    Brandt, M., Loftsson, H., Sigurρórsson, H., Tyers, F.: Apertium-IceNLP: a rule-based icelandic to english machine translation system. Reykjavik University, Reykjavík (2011) (unpublished paper)Google Scholar
  5. 5.
    Cilibrasi, R.L., Vitanyi, P.M.: The Google similarity distance. IEEE Transactions on Knowledge and Data Engineering 19(3), 370–383 (2007)CrossRefGoogle Scholar
  6. 6.
    Diab, M.: The feasibility of bootstrapping an arabic WordNet leveraging parallel corpora and an english WordNet. In: Proceedings of the Arabic Language Technologies and Resources, NEMLAR, Cairo (2004)Google Scholar
  7. 7.
    Fellbaum, C.: WordNet: An electronic lexical database. The MIT Press (1998)Google Scholar
  8. 8.
    Fišer, D.: Leveraging parallel corpora and existing wordnets for automatic construction of the slovene wordnet. In: Proceedings of the 3rd Language and Technology Conference, vol. 7, p. 3–5 (2007)Google Scholar
  9. 9.
    Ide, N., Erjavec, T., Tufis, D.: Sense discrimination with parallel corpora. In: Proceedings of the ACL 2002 Workshop on Word Sense Disambiguation: Recent Successes and Future Directions, vol. 8. p. 61–66 (2002)Google Scholar
  10. 10.
    Isahara, H., Bond, F., Uchimoto, K., Utiyama, M., Kanzaki, K.: Development of the japanese WordNet. In: Proceedings of the 6th LREC (2008)Google Scholar
  11. 11.
    Kazakov, D., Shahid, A.: Unsupervised construction of a multilingual WordNet from parallel corpora. In: Proceedings of the Workshop on Natural Language Processing Methods and Corpora in Translation, Lexicography, and Language Learning, pp. 9–12 (2009)Google Scholar
  12. 12.
    Koehn, P.: Europarl: A parallel corpus for statistical machine translation. In: MT summit, vol. 5 (2005)Google Scholar
  13. 13.
    Liang, P., Taskar, B., Klein, D.: Alignment by agreement. In: Proceedings of the HLT-NAACL 2006 (2006)Google Scholar
  14. 14.
    Miller, G.A., Leacock, C., Tengi, R., Bunker, R.T.: A semantic concordance. In: Proceedings of the Workshop on Human Language Technology, HLT 1993, pp. 303–308. Association for Computational Linguistics, Stroudsburg (1993), ACM ID: 1075742CrossRefGoogle Scholar
  15. 15.
    Navigli, R., Ponzetto, S.P.: BabelNet: building a very large multilingual semantic network. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, ACL 2010, pp. 216–225. Association for Computational Linguistics, Stroudsburg (2010), ACM ID: 1858704Google Scholar
  16. 16.
    Oliver, A., Climent, S.: Construcción de los wordnets 3.0 para castellano y catalán mediante traducción automática de corpus anotados semánticamente. In: Proceedings of the 27th Conference of the SEPLN, Huelva, Spain (2011)Google Scholar
  17. 17.
    Oliver, A., Climent, S.: Building wordnets by machine translation of sense tagged corpora. In: Proceedings of the Global WordNet Conference, Matsue, Japan (2012)Google Scholar
  18. 18.
    Padró, L., Reese, S., Agirre, E., Soroa, A.: Semantic services in freeling 2.1: Wordnet and UKB. In: Proceedings of the 5th International Conference of the Global WordNet Association (GWC 2010) (2010)Google Scholar
  19. 19.
    Pedersen, B., Nimb, S., Asmussen, J., Sørensen, N., Trap-Jensen, L., Lorentzen, H.: DanNet: the challenge of compiling a wordnet for danish by reusing a monolingual dictionary. Language resources and evaluation 43(3), 269–299 (2009)CrossRefGoogle Scholar
  20. 20.
    Rajendran, S., Arulmozi, S., Shanmugam, B., Baskaran, S., Thiagarajan, S.: Tamil WordNet. In: Proceedings of the First International Global WordNet Conference, Mysore, vol. 152, pp. 271–274 (2002)Google Scholar
  21. 21.
    Sagot, B., Fišer, D.: Building a free french wordnet from multilingual resources. In: Proceedings of OntoLex 2008, Marrackech,Morocco (2008)Google Scholar
  22. 22.
    Saveski, M., Trajkovski, I.: Automatic construction of wordnets by using machine translation and language modeling. In: 13th Multiconference Information Society, Ljubljana, Slovenia (2010)Google Scholar
  23. 23.
    Sinha, M., Reddy, M., Bhattacharyya, P.: An approach towards construction and application of multilingual indo-wordnet. In: 3rd Global Wordnet Conference (GWC 2006), Jeju Island, Korea (2006)Google Scholar
  24. 24.
    Tufis, D., Cristea, D., Stamou, S.: BalkaNet: aims, methods, results and perspectives: a general overview. Science and Technology 7(1-2), 9–43 (2004)Google Scholar
  25. 25.
    Vandeghinste, V., Martens, S.: PaCo-MT-D4. 2. report on lexical selection. Tech. rep., Centre for Computational Linguistics - KULeuven (2010)Google Scholar
  26. 26.
    Vossen, P.: Introduction to Eurowordnet. Computers and the Humanities 32(2), 73–89 (1998)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Antoni Oliver
    • 1
  • Salvador Climent
    • 1
  1. 1.Universitat Oberta de CatalunyaBarcelonaSpain

Personalised recommendations