Language Resources and Evaluation

, Volume 49, Issue 3, pp 601–635 | Cite as

Constructing a poor man’s wordnet in a resource-rich world

Original Paper

Abstract

In this paper we present a language-independent, fully modular and automatic approach to bootstrap a wordnet for a new language by recycling different types of already existing language resources, such as machine-readable dictionaries, parallel corpora, and Wikipedia. The approach, which we apply here to Slovene, takes into account monosemous and polysemous words, general and specialised vocabulary as well as simple and multi-word lexemes. The extracted words are then assigned one or several synset ids, based on a classifier that relies on several features including distributional similarity. Finally, we identify and remove highly dubious (literal, synset) pairs, based on simple distributional information extracted from a large corpus in an unsupervised way. Automatic, manual and task-based evaluations show that the resulting resource, the latest version of the Slovene wordnet, is already a valuable source of lexico-semantic information.

Keywords

Wordnet development Multilingual lexicon extraction  Word-sense disambiguation Distributional similarity 

References

  1. Agirre, E., & Soroa, A. (2009). Personalizing pagerank for word sense disambiguation. In Proceedings of the 12th conference of the European chapter of the association for computational linguistics (EACL’09), Athens, Greece, pp. 33–41.Google Scholar
  2. Arhar, Š., & Gorjanc, V. (2007). Korpus FidaPLUS: Nova generacija slovenskega referenčnega korpusa (The FidaPLUS corpus: A new generation of the Slovene reference corpus). Jezik in slovstvo, 52(2), 95–110.Google Scholar
  3. Banerjee, S., & Pedersen, T. (2002). An adapted Lesk algorithm for word sense disambiguation using WordNet. In Computational linguistics and intelligent text processing, (pp. 136–145). Berlin: Springer.Google Scholar
  4. Bernhard, D., & Gurevych, I. (2009). Combining lexical semantic resources with question and answer archives for translation-based answer finding. In Proceedings of the joint conference of the 47th annual meeting of the ACL and the 4th international joint conference on natural language processing of the AFNLP (ACL ’09), Suntec, Singapore, pp. 728–736.Google Scholar
  5. Bond, F., & Foster, R. (2013). Linking and extending an open multilingual Wordnet. In Proceedings of the 51st annual meeting of the association for computational linguistics, Sofia, Bulgaria, pp. 1352–1362.Google Scholar
  6. Carpuat, M., & Wu, D. (2007). Improving statistical machine translation using word sense disambiguation. In The 2007 joint conference on empirical methods in natural language processing and computational natural language learning (EMNLP-CoNLL 2007), pp. 61–72.Google Scholar
  7. Casses, B. (2010). Final paper of research experience for undergraduates for artificial intelligence, natural language processing and information retrieval, The University of Colorado at Colorado Springs. http://www.cs.uccs.edu/~jkalita/work/reu/REUFinalPapers2010/Casses.pdf.
  8. Copestake, A., Sanfilippo, A., Briscoe, T., & de Paiva, V. (1993). The ACQUILEX LKB: An introduction. In T. Briscoe, A. Copestake, & V. de Paiva (Eds.), Inheritance, defaults and the lexicon (pp. 148–163). New York, NY: Cambridge University Press.Google Scholar
  9. Cuadros, M., & Rigau, G. (2006). Quality assessment of large scale knowledge resources. In Proceedings of the 2006 conference on empirical methods in natural language processing (EMNLP ’06), Sydney, Australia, pp. 534–541.Google Scholar
  10. Daumé III, H. (2004). Notes on CG and LM-BFGS optimization of logistic regression. Paper available at http://pub.hal3.name#daume04cg-bfgs, implementation available at http://hal3.name/megam/
  11. Declerck, T., Pérez, A.G., Vela, O., Gantner, Z., & Manzano-Macho, D. (2006). Multilingual lexical semantic resources for ontology translation. In Proceedings of the international conference on language resources and evaluation (LREC 2006), Genova, Italy.Google Scholar
  12. de Melo, G., & Weikum, G. (2009). Towards a universal wordnet by learning from combined evidence. In Proceedings of the 18th ACM conference on information and knowledge management (CIKM ’09). ACM, New York, NY, United States, pp. 513–522.Google Scholar
  13. Diab, M. (2004). The feasibility of bootstrapping an Arabic WordNet leveraging parallel corpora and an English wordnet. In Proceedings of the Arabic language technologies and resources.Google Scholar
  14. Dyvik, H. (2002). Translations as semantic mirrors: From parallel corpus to wordnet. In (2002). In Post-proceedings of the ICAME 2002 conference (revised version), Gothenburg, Sweden.Google Scholar
  15. Erjavec, T., & Fišer, D. (2006). Building Slovene WordNet. In Proceedings of the international conference on language resources and evaluation (LREC 2006), Genova, Italy.Google Scholar
  16. Erjavec, T., Ignat, C., Pouliquen, B., & Steinberger, R. (2005). Massive multi lingual corpus compilation: Acquis communautaire and totale. In 2nd language & technology conference, April 21–23, 2005, Poznań, Poland. Vetulani, Z. (ur.). Human language technologies as a challenge for computer science and linguistics: in memory of Maurice Gross and Antonio Zampolli: proceedings. Poznań: Wydawnictwo Poznańskie Sp. z o.o., 2005, pp. 32–36.Google Scholar
  17. Fellbaum, C. (Ed.). (1998). WordNet: An electronic lexical database. Cambridge, MA: MIT Press.Google Scholar
  18. Fišer, D., Ljubešić, N., & Kubelka, O. (2012). Addressing polysemy in bilingual lexicon extraction from comparable corpora. In N.C.C. Chair, K. Choukri, T. Declerck, M.U. Doğan, B. Maegaard, J. Mariani, J. Odijk, S. Piperidis (Eds.), Proceedings of the eight International conference on language resources and evaluation (LREC 2012), Istanbul, Turkey.Google Scholar
  19. Fišer, D., & Novak, J. (2011). Visualizing sloWNet. In Proceedings of the conference on electronic lexicography in the 21st century: New applications for new users (eLEX2011), Bled, Slovenia.Google Scholar
  20. Fišer, D., & Erjavec, T. (2009). Semantic concordances for Slovene. Cognitive Studies - Études cognitives, 9, 89–100.Google Scholar
  21. Fišer, D., & Sagot, B. (2008). Combining multiple resources to build reliable wordnets. In Proceedings of the 11th international conference on text, speech and dialogue (TSD 2008), Brno, Czech Republic.Google Scholar
  22. Fišer, D., & Vintar, Š. (2010). Uporaba wordneta za boljše razdvoumljanje pri strojnem prevajanju. In Proceedings of the 13th international multiconference information society—IS 2010.Google Scholar
  23. Fung, P. (1995). A pattern matching method for finding noun and proper noun translations from noisy parallel corpora. In Proceedings of the 33rd annual meeting on association for computational linguistics (ACL ’95), Cambridge, Massachusetts, United States, pp. 236–243.Google Scholar
  24. Gabrilovich, E., & Markovitch, S. (2006). Overcoming the brittleness bottleneck using wikipedia: Enhancing text categorization with encyclopedic knowledge. In Proceedings of the 21st national conference on artificial intelligence (AAAI’06). AAAI Press, pp. 1301–1306.Google Scholar
  25. Grad, A., & Leeming, H. (1999). Slovene–English dictionary. Zagreb: DZS.Google Scholar
  26. Grad, A., Škerlj, R., & Vitorovič, N. (1999). English–Slovene dictionary. Zagreb: DZS.Google Scholar
  27. Harabagiu, S., Moldovan, D., Pasca, M., Mihalcea, R., Surdeanu, M., Bunescu, R., Girju, R., Rus, V., & Morarescu, P. (2000). Falcon: Boosting knowledge for answer engines. In Proceedings of TREC-9, pp. 479–488.Google Scholar
  28. Ide, N., Erjavec, T., & Tufiş, D. (2002). Sense discrimination with parallel corpora. In Proceedings of the ACL’02 workshop on word sense disambiguation: Recent successes and future directions (WSD ’02), Philadelphia, Pennsylvania, United States, pp. 61–66.Google Scholar
  29. Kirkpatrick, B. (1987). Roget’s thesaurus of English words and phrases. Penguin: Penguin reference books.Google Scholar
  30. Knight, K., & Luk, S. K. (1994). Building a large-scale knowledge base for machine translation. In Proceedings of the twelfth national conference on artificial intelligence (AAAI ’94), Seattle, Washington, United States, pp. 773–778.Google Scholar
  31. Korošec, T., Fekonja, M., Jehart, A., Pečelin, F., & Ulčar, M. (2002). Vojaški slovar. Ljubljana: Ministrstvo za obrambo.Google Scholar
  32. Krstev, C., Pavlović-Lažetić, G., & Obradović, I. (2004). Using textual and lexical resources in developing serbian wordnet. Romanian Journal of Information Science and Technology, 7(1–2), 147–161.Google Scholar
  33. Lesk, M. (1986). Automatic sense disambiguation using machine readable dictionaries: How to tell a pine cone from an ice cream cone. In Proceedings of the 5th annual international conference on systems documentation (SIGDOC’86), Toronto, Canada, pp. 24–26.Google Scholar
  34. Lin, D., Zhao, S., Qin, L., & Zhou, M. (2003). Identifying synonyms among distributionally similar words. In Proceedings of the 18th international joint conference on artificial intelligence (IJCAI 2003), Acapulco, Mexico, pp. 1492–1493.Google Scholar
  35. Liu, H. (2003). Unpacking meaning from words: A context-centered approach to computational lexicon design. In P. Blackburn, C. Ghidini, R. M. Turner, & F. Giunchiglia (Eds.), Modeling and using context: Fourth international and interdisciplinary conference, context 2003. Springer, Stanford, California, United States, pp. 218–232.CrossRefGoogle Scholar
  36. Matuszek, C., Cabral, J., Witbrock, M., & Deoliveira, J. (2006). An introduction to the syntax and content of Cyc. In Proceedings of the 2006 AAAI spring symposium on formalizing and compiling background knowledge and its applications to knowledge representation and question answering, pp. 44–49.Google Scholar
  37. Mihalcea, R., Sinha, R., & McCarthy, D. (2010). Semeval-2010 task 2: Cross-lingual lexical substitution. In Proceedings of the 5th international workshop on semantic evaluation (SemEval 2010). Los Angeles, California, United States, pp. 9–14.Google Scholar
  38. Nastase, V. (2008). Topic-driven multi-document summarization with encyclopedic knowledge and spreading activation. In Proceedings of the conference on empirical methods in natural language processing (EMNLP ’08). Honolulu, Hawaii, pp. 763–772.Google Scholar
  39. Navigli, R., & Ponzetto, S. P. (2010). Babelnet: Building a very large multilingual semantic network. In Proceedings of the 48th annual meeting of the association for computational linguistics, Uppsala, Sweden, pp. 216–225.Google Scholar
  40. Navigli, R., & Ponzetto, S. P. (2012). BabelNet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network. Artificial Intelligence, 193, 217–250.CrossRefGoogle Scholar
  41. Nie, J. Y. (2010). Cross-language information retrieval synthesis lectures on human language technologies. San Rafael, CA: Morgan & Claypool Publishers.Google Scholar
  42. Orav, H., & Vider, K. (2004). Concerning the difference between a conception and its application in the case of the estonian wordnet. In Proceedings of the 2nd international conference of the Gobal WordNet Association (GWC-2004), Brno, Czech Republic, pp. 285–290.Google Scholar
  43. Pianta, E., Bentivogli, L., & Girardi, C. (2004). Fighting arbitrariness in wordnet-like lexical databases—A natural language motivated remedy. In Proceedings of the 1st international conference of the Global WordNet Association (GWC-2002). Mysore, India.Google Scholar
  44. Ponzetto, S. P., & Navigli, R. (2009). Large-scale taxonomy mapping for restructuring and integrating wikipedia. In Proceedings of the 21st international jont conference on artifical intelligence (IJCAI’09), Pasadena, California, United States, pp. 2083–2088.Google Scholar
  45. Reiter, N., Hartung, M., & Frank, A. (2008). A resource-poor approach for linking ontology classes to Wikipedia articles. In J. Bos & R. Delmonte (Eds.), Semantics in text processing. STEP 2008 conference proceedings, research in computational semantics. College Publications, pp. 381–387.Google Scholar
  46. Resnik, P., & Yarowsky, D. (1997). A perspective on word sense disambiguation methods and their evaluation. In Proceedings of the ACL SIGLEX workshop on tagging text with lexical semantics: Why, what, and how?, Washington, DC, United States, pp. 79–86.Google Scholar
  47. Richardson, S. D., Dolan, W. B., & Vanderwende, L. (1998). Mindnet: Acquiring and structuring semantic information from text. In Proceedings of the 36th annual meeting of the association for computational linguistics and 17th international conference on computational linguistics , Montreal, Canada, pp. 1098–1102.Google Scholar
  48. Rudnicka, E., Maziarz, M., Piasecki, M., & Szpakowicz, S. (2012). A strategy of mapping Polish Wordnet onto Princeton Wordnet. In Proceedings of COLING 2012: Posters. Mumbai, India, pp. 1039–1048.Google Scholar
  49. Ruiz-Casado, M., Alfonseca, E., & Castells, P. (2005). Automatic assignment of wikipedia encyclopedic entries to wordnet synsets. In Proceedings of advances in web intelligence.Google Scholar
  50. Sagot, B., & Fišer, D. (2008). Building a free French wordnet from multilingual resources. In Proceedings of ontolex 2008, Marrakech, Morocco.Google Scholar
  51. Sagot, B., & Fišer, D. (2011). Extending wordnets by learning from multiple resources. In LTC’11: 5th language and technology conference. Poznań, Pologne. http://hal.inria.fr/hal-00655785
  52. Sagot, B., & Fišer, D. (2012a). Automatic extension of WOLF. In Proceedings of the 6th international Global Wordnet Conference (GWC2012), Matsue, Japan.Google Scholar
  53. Sagot, B., & Fišer, D. (2012b). Cleaning noisy synsets. In Proceedings of the international conference on language resources and evaluation (LREC 2012), Istanbul, Turkey.Google Scholar
  54. Sornlertlamvanich, V. (2010). Asian wordnet: Development and service in collaborative approach. In Proceedings of the 5th international conference of the Global WordNet Association (GWC-2010), Mumbai, India.Google Scholar
  55. Suchanek, F. M., Kasneci, G., & Weikum, G. (2008). Yago: A large ontology from Wikipedia and wordnet. Journal of Web Semantics, 6(3), 203–217.CrossRefGoogle Scholar
  56. Tavčar, A., Fišer, D., & Erjavec, T. (2012). slowcrowd: Orodje za popravljanje wordneta z izkoriščanjem moči množic. In Proceedings of the 8th language technologies conference, within the proceedings of the 15th international multiconference information society (IS 2012), Vol. C, Ljubljana, Slovenia, pp. 197–202.Google Scholar
  57. Tiedemann, J. (2003). Recycling translations—Extraction of lexical data from parallel corpora and their application in natural language processing. Ph.D. thesis, Uppsala Universitet, Uppsala, Sweden (Studia Linguistica Upsaliensia 1).Google Scholar
  58. Tufiş, D. (2000). BalkaNet—Design and development of a multilingual balkan wordnet. Romanian Journal of Information Science and Technology Special Issue, 7, 107–124Google Scholar
  59. Tufiş, D., & Cristea, D. (2002). Methodological issues in building the Romanian Wordnet and consistency checks in Balkanet. In Proceedings of LREC 2002 workshop on wordnet structures and standardisation, Las Palmas, Spain, pp. 35–41.Google Scholar
  60. Tufiş, D., Koeva, S., Erjavec, T., Gavrilidou, M., & Krstev, C. (2009). Building language resources and translation models for machine translation focused on south Slavic and Balkan languages. In Machačová, J., & Rohsmann, K. (Eds.), Scientific results of the SEE-ERA.NET pilot joint call, pp. 37–48.Google Scholar
  61. Vossen, P. (Ed.). (1999). EuroWordNet : A multilingual database with lexical semantic networks for European languages. Dordrecht: Kluwer.Google Scholar
  62. Weisscher, A. (2013). GWA base concepts. http://globalwordnet.org/gwa-base-concepts/
  63. Widdows, D., & Ferraro, K. (2008). Semantic vectors: A scalable open source package and online technology management application. In Proceedings of the international conference on language resources and evaluation (LREC 2008), Marrakech, Morocco.Google Scholar
  64. Wong, S. H. S. (2004). Fighting arbitrariness in wordnet-like lexical databases—A natural language motivated remedy. In Proceedings of the 2nd international conference of the Global WordNet Association (GWC-2004). Brno, Czech Republic, pp. 234–241.Google Scholar
  65. Yokoi, T. (1995). The EDR electronic dictionary. Communications of the ACM, 38(11), 42–44.CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media Dordrecht 2015

Authors and Affiliations

  1. 1.Department of Translation Faculty of ArtsUniversity of Ljubljana Aškerčeva 2LjubljanaSlovenia
  2. 2.AlpageINRIA Paris-Rocquencourt & Université Paris-DiderotParisFrance

Personalised recommendations