Skip to main content
Log in

Constructing a poor man’s wordnet in a resource-rich world

  • Original Paper
  • Published:
Language Resources and Evaluation Aims and scope Submit manuscript

Abstract

In this paper we present a language-independent, fully modular and automatic approach to bootstrap a wordnet for a new language by recycling different types of already existing language resources, such as machine-readable dictionaries, parallel corpora, and Wikipedia. The approach, which we apply here to Slovene, takes into account monosemous and polysemous words, general and specialised vocabulary as well as simple and multi-word lexemes. The extracted words are then assigned one or several synset ids, based on a classifier that relies on several features including distributional similarity. Finally, we identify and remove highly dubious (literal, synset) pairs, based on simple distributional information extracted from a large corpus in an unsupervised way. Automatic, manual and task-based evaluations show that the resulting resource, the latest version of the Slovene wordnet, is already a valuable source of lexico-semantic information.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Notes

  1. http://www.cl.cam.ac.uk/research/nl/acquilex/ [06.07.2014].

  2. http://www.bartleby.com/62/ [06.07.2014].

  3. http://research.microsoft.com/nlp/Projects/MindNet.aspx [06.07.2014].

  4. http://web.media.mit.edu/hugo/conceptnet/ [06.07.2014].

  5. http://www.cyc.com/ [06.07.2014].

  6. http://compling.hss.ntu.edu.sg/omw/ [06.07.2014].

  7. In this paper we use the term monosemous for such literals that only appear in one synset in the Princeton WordNet. While this is unproblematic in most cases, there is a possibility that some words only appear to be monosemous according to the lexical resource which is missing some senses because the resource is incomplete.

  8. http://www.wiktionary.org/ [06.07.2014]

  9. http://species.wikimedia.org/ [06.07.2014].

  10. http://eurovoc.europa.eu/ [06.07.2014].

  11. http://www.islovar.org/ [06.07.2014].

  12. http://www.wikipedia.org/ [06.07.2014].

  13. See however (Erjavec and Fišer 2006) for preliminary experiments on building a Slovene wordnet from the Serbian wordnet (Krstev et al. 2004).

  14. The conversion from one synset inventory to another was achieved based on an automatic PWN 2.0 to 3.0 mapping (Erjavec, p.c.).

  15. This threshold of 2 was empirically found to be the best balance between the number of related words (a threshold of 1 or 0 would have provided us too few, a threshold of 3 or more too many) and the relevance of the related words (a threshold of 3 or more gathers many literals which are not relevant as descriptors of the input synset).

  16. http://semanticvectors.googlecode.com [06.07.2014].

  17. http://lucene.apache.org [06.07.2014].

  18. The Slovene lemmatisation was performed using the ToTaLe system (Erjavec et al. 2005).

  19. In experiments conducted for applying this extension technique to the French wordnet WOLF, the same 0.1 threshold leads to retaining a higher proportion of candidates, namely 55,159 out of 177,980, which have a much higher precision (83 %). This is related to the archaic words present in the Slovene-English dictionaries we use for extending sloWNet and suggests that this dictionary is not the best resource for wordnet construction but was nevertheless used since it is the only extensive bilingual dictionary available, which is not uncommon in realistic research scenarios.

  20. http://wordnet.princeton.edu/wordnet/man/wnstats.7WN.html [06.07.2014]

  21. Note that SWN does not contain any adverbial synsets and only a few adjectival synsets.

  22. Note that the first versions of BabelNet did not contain any Slovene literals. Only the recently published BabalNet 2.0 does.

  23. 115 Slovene UWN (literal, synset) pairs have a literal that contains at least one comma, which seems to be more a separator between possible literals than part of unique literals. Moreover, some literals include a stress marker (mentioned above and removed from sloWNet since version 2.0). Before evaluating sloWNet 3.0 against the UWN, we “improved” the UWN by correcting these issues. Therefore, our evaluation is in a way biased in favour of UWN.

  24. http://presis.amebis.si/prevajanje/ [06.07.2014].

  25. http://translate.google.com/ [06.07.2014].

  26. July 6, 2014.

  27. http://nl.ijs.si/slowtool/ [06.07.2014].

References

  • Agirre, E., & Soroa, A. (2009). Personalizing pagerank for word sense disambiguation. In Proceedings of the 12th conference of the European chapter of the association for computational linguistics (EACL’09), Athens, Greece, pp. 33–41.

  • Arhar, Š., & Gorjanc, V. (2007). Korpus FidaPLUS: Nova generacija slovenskega referenčnega korpusa (The FidaPLUS corpus: A new generation of the Slovene reference corpus). Jezik in slovstvo, 52(2), 95–110.

  • Banerjee, S., & Pedersen, T. (2002). An adapted Lesk algorithm for word sense disambiguation using WordNet. In Computational linguistics and intelligent text processing, (pp. 136–145). Berlin: Springer.

  • Bernhard, D., & Gurevych, I. (2009). Combining lexical semantic resources with question and answer archives for translation-based answer finding. In Proceedings of the joint conference of the 47th annual meeting of the ACL and the 4th international joint conference on natural language processing of the AFNLP (ACL ’09), Suntec, Singapore, pp. 728–736.

  • Bond, F., & Foster, R. (2013). Linking and extending an open multilingual Wordnet. In Proceedings of the 51st annual meeting of the association for computational linguistics, Sofia, Bulgaria, pp. 1352–1362.

  • Carpuat, M., & Wu, D. (2007). Improving statistical machine translation using word sense disambiguation. In The 2007 joint conference on empirical methods in natural language processing and computational natural language learning (EMNLP-CoNLL 2007), pp. 61–72.

  • Casses, B. (2010). Final paper of research experience for undergraduates for artificial intelligence, natural language processing and information retrieval, The University of Colorado at Colorado Springs. http://www.cs.uccs.edu/~jkalita/work/reu/REUFinalPapers2010/Casses.pdf.

  • Copestake, A., Sanfilippo, A., Briscoe, T., & de Paiva, V. (1993). The ACQUILEX LKB: An introduction. In T. Briscoe, A. Copestake, & V. de Paiva (Eds.), Inheritance, defaults and the lexicon (pp. 148–163). New York, NY: Cambridge University Press.

    Google Scholar 

  • Cuadros, M., & Rigau, G. (2006). Quality assessment of large scale knowledge resources. In Proceedings of the 2006 conference on empirical methods in natural language processing (EMNLP ’06), Sydney, Australia, pp. 534–541.

  • Daumé III, H. (2004). Notes on CG and LM-BFGS optimization of logistic regression. Paper available at http://pub.hal3.name#daume04cg-bfgs, implementation available at http://hal3.name/megam/

  • Declerck, T., Pérez, A.G., Vela, O., Gantner, Z., & Manzano-Macho, D. (2006). Multilingual lexical semantic resources for ontology translation. In Proceedings of the international conference on language resources and evaluation (LREC 2006), Genova, Italy.

  • de Melo, G., & Weikum, G. (2009). Towards a universal wordnet by learning from combined evidence. In Proceedings of the 18th ACM conference on information and knowledge management (CIKM ’09). ACM, New York, NY, United States, pp. 513–522.

  • Diab, M. (2004). The feasibility of bootstrapping an Arabic WordNet leveraging parallel corpora and an English wordnet. In Proceedings of the Arabic language technologies and resources.

  • Dyvik, H. (2002). Translations as semantic mirrors: From parallel corpus to wordnet. In (2002). In Post-proceedings of the ICAME 2002 conference (revised version), Gothenburg, Sweden.

    Google Scholar 

  • Erjavec, T., & Fišer, D. (2006). Building Slovene WordNet. In Proceedings of the international conference on language resources and evaluation (LREC 2006), Genova, Italy.

  • Erjavec, T., Ignat, C., Pouliquen, B., & Steinberger, R. (2005). Massive multi lingual corpus compilation: Acquis communautaire and totale. In 2nd language & technology conference, April 21–23, 2005, Poznań, Poland. Vetulani, Z. (ur.). Human language technologies as a challenge for computer science and linguistics: in memory of Maurice Gross and Antonio Zampolli: proceedings. Poznań: Wydawnictwo Poznańskie Sp. z o.o., 2005, pp. 32–36.

  • Fellbaum, C. (Ed.). (1998). WordNet: An electronic lexical database. Cambridge, MA: MIT Press.

    Google Scholar 

  • Fišer, D., Ljubešić, N., & Kubelka, O. (2012). Addressing polysemy in bilingual lexicon extraction from comparable corpora. In N.C.C. Chair, K. Choukri, T. Declerck, M.U. Doğan, B. Maegaard, J. Mariani, J. Odijk, S. Piperidis (Eds.), Proceedings of the eight International conference on language resources and evaluation (LREC 2012), Istanbul, Turkey.

  • Fišer, D., & Novak, J. (2011). Visualizing sloWNet. In Proceedings of the conference on electronic lexicography in the 21st century: New applications for new users (eLEX2011), Bled, Slovenia.

  • Fišer, D., & Erjavec, T. (2009). Semantic concordances for Slovene. Cognitive Studies - Études cognitives, 9, 89–100.

    Google Scholar 

  • Fišer, D., & Sagot, B. (2008). Combining multiple resources to build reliable wordnets. In Proceedings of the 11th international conference on text, speech and dialogue (TSD 2008), Brno, Czech Republic.

  • Fišer, D., & Vintar, Š. (2010). Uporaba wordneta za boljše razdvoumljanje pri strojnem prevajanju. In Proceedings of the 13th international multiconference information society—IS 2010.

  • Fung, P. (1995). A pattern matching method for finding noun and proper noun translations from noisy parallel corpora. In Proceedings of the 33rd annual meeting on association for computational linguistics (ACL ’95), Cambridge, Massachusetts, United States, pp. 236–243.

  • Gabrilovich, E., & Markovitch, S. (2006). Overcoming the brittleness bottleneck using wikipedia: Enhancing text categorization with encyclopedic knowledge. In Proceedings of the 21st national conference on artificial intelligence (AAAI’06). AAAI Press, pp. 1301–1306.

  • Grad, A., & Leeming, H. (1999). Slovene–English dictionary. Zagreb: DZS.

    Google Scholar 

  • Grad, A., Škerlj, R., & Vitorovič, N. (1999). English–Slovene dictionary. Zagreb: DZS.

    Google Scholar 

  • Harabagiu, S., Moldovan, D., Pasca, M., Mihalcea, R., Surdeanu, M., Bunescu, R., Girju, R., Rus, V., & Morarescu, P. (2000). Falcon: Boosting knowledge for answer engines. In Proceedings of TREC-9, pp. 479–488.

  • Ide, N., Erjavec, T., & Tufiş, D. (2002). Sense discrimination with parallel corpora. In Proceedings of the ACL’02 workshop on word sense disambiguation: Recent successes and future directions (WSD ’02), Philadelphia, Pennsylvania, United States, pp. 61–66.

  • Kirkpatrick, B. (1987). Roget’s thesaurus of English words and phrases. Penguin: Penguin reference books.

    Google Scholar 

  • Knight, K., & Luk, S. K. (1994). Building a large-scale knowledge base for machine translation. In Proceedings of the twelfth national conference on artificial intelligence (AAAI ’94), Seattle, Washington, United States, pp. 773–778.

  • Korošec, T., Fekonja, M., Jehart, A., Pečelin, F., & Ulčar, M. (2002). Vojaški slovar. Ljubljana: Ministrstvo za obrambo.

    Google Scholar 

  • Krstev, C., Pavlović-Lažetić, G., & Obradović, I. (2004). Using textual and lexical resources in developing serbian wordnet. Romanian Journal of Information Science and Technology, 7(1–2), 147–161.

    Google Scholar 

  • Lesk, M. (1986). Automatic sense disambiguation using machine readable dictionaries: How to tell a pine cone from an ice cream cone. In Proceedings of the 5th annual international conference on systems documentation (SIGDOC’86), Toronto, Canada, pp. 24–26.

  • Lin, D., Zhao, S., Qin, L., & Zhou, M. (2003). Identifying synonyms among distributionally similar words. In Proceedings of the 18th international joint conference on artificial intelligence (IJCAI 2003), Acapulco, Mexico, pp. 1492–1493.

  • Liu, H. (2003). Unpacking meaning from words: A context-centered approach to computational lexicon design. In P. Blackburn, C. Ghidini, R. M. Turner, & F. Giunchiglia (Eds.), Modeling and using context: Fourth international and interdisciplinary conference, context 2003. Springer, Stanford, California, United States, pp. 218–232.

    Chapter  Google Scholar 

  • Matuszek, C., Cabral, J., Witbrock, M., & Deoliveira, J. (2006). An introduction to the syntax and content of Cyc. In Proceedings of the 2006 AAAI spring symposium on formalizing and compiling background knowledge and its applications to knowledge representation and question answering, pp. 44–49.

  • Mihalcea, R., Sinha, R., & McCarthy, D. (2010). Semeval-2010 task 2: Cross-lingual lexical substitution. In Proceedings of the 5th international workshop on semantic evaluation (SemEval 2010). Los Angeles, California, United States, pp. 9–14.

  • Nastase, V. (2008). Topic-driven multi-document summarization with encyclopedic knowledge and spreading activation. In Proceedings of the conference on empirical methods in natural language processing (EMNLP ’08). Honolulu, Hawaii, pp. 763–772.

  • Navigli, R., & Ponzetto, S. P. (2010). Babelnet: Building a very large multilingual semantic network. In Proceedings of the 48th annual meeting of the association for computational linguistics, Uppsala, Sweden, pp. 216–225.

  • Navigli, R., & Ponzetto, S. P. (2012). BabelNet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network. Artificial Intelligence, 193, 217–250.

    Article  Google Scholar 

  • Nie, J. Y. (2010). Cross-language information retrieval synthesis lectures on human language technologies. San Rafael, CA: Morgan & Claypool Publishers.

    Google Scholar 

  • Orav, H., & Vider, K. (2004). Concerning the difference between a conception and its application in the case of the estonian wordnet. In Proceedings of the 2nd international conference of the Gobal WordNet Association (GWC-2004), Brno, Czech Republic, pp. 285–290.

  • Pianta, E., Bentivogli, L., & Girardi, C. (2004). Fighting arbitrariness in wordnet-like lexical databases—A natural language motivated remedy. In Proceedings of the 1st international conference of the Global WordNet Association (GWC-2002). Mysore, India.

  • Ponzetto, S. P., & Navigli, R. (2009). Large-scale taxonomy mapping for restructuring and integrating wikipedia. In Proceedings of the 21st international jont conference on artifical intelligence (IJCAI’09), Pasadena, California, United States, pp. 2083–2088.

  • Reiter, N., Hartung, M., & Frank, A. (2008). A resource-poor approach for linking ontology classes to Wikipedia articles. In J. Bos & R. Delmonte (Eds.), Semantics in text processing. STEP 2008 conference proceedings, research in computational semantics. College Publications, pp. 381–387.

  • Resnik, P., & Yarowsky, D. (1997). A perspective on word sense disambiguation methods and their evaluation. In Proceedings of the ACL SIGLEX workshop on tagging text with lexical semantics: Why, what, and how?, Washington, DC, United States, pp. 79–86.

  • Richardson, S. D., Dolan, W. B., & Vanderwende, L. (1998). Mindnet: Acquiring and structuring semantic information from text. In Proceedings of the 36th annual meeting of the association for computational linguistics and 17th international conference on computational linguistics , Montreal, Canada, pp. 1098–1102.

  • Rudnicka, E., Maziarz, M., Piasecki, M., & Szpakowicz, S. (2012). A strategy of mapping Polish Wordnet onto Princeton Wordnet. In Proceedings of COLING 2012: Posters. Mumbai, India, pp. 1039–1048.

  • Ruiz-Casado, M., Alfonseca, E., & Castells, P. (2005). Automatic assignment of wikipedia encyclopedic entries to wordnet synsets. In Proceedings of advances in web intelligence.

  • Sagot, B., & Fišer, D. (2008). Building a free French wordnet from multilingual resources. In Proceedings of ontolex 2008, Marrakech, Morocco.

  • Sagot, B., & Fišer, D. (2011). Extending wordnets by learning from multiple resources. In LTC’11: 5th language and technology conference. Poznań, Pologne. http://hal.inria.fr/hal-00655785

  • Sagot, B., & Fišer, D. (2012a). Automatic extension of WOLF. In Proceedings of the 6th international Global Wordnet Conference (GWC2012), Matsue, Japan.

  • Sagot, B., & Fišer, D. (2012b). Cleaning noisy synsets. In Proceedings of the international conference on language resources and evaluation (LREC 2012), Istanbul, Turkey.

  • Sornlertlamvanich, V. (2010). Asian wordnet: Development and service in collaborative approach. In Proceedings of the 5th international conference of the Global WordNet Association (GWC-2010), Mumbai, India.

  • Suchanek, F. M., Kasneci, G., & Weikum, G. (2008). Yago: A large ontology from Wikipedia and wordnet. Journal of Web Semantics, 6(3), 203–217.

    Article  Google Scholar 

  • Tavčar, A., Fišer, D., & Erjavec, T. (2012). slowcrowd: Orodje za popravljanje wordneta z izkoriščanjem moči množic. In Proceedings of the 8th language technologies conference, within the proceedings of the 15th international multiconference information society (IS 2012), Vol. C, Ljubljana, Slovenia, pp. 197–202.

  • Tiedemann, J. (2003). Recycling translations—Extraction of lexical data from parallel corpora and their application in natural language processing. Ph.D. thesis, Uppsala Universitet, Uppsala, Sweden (Studia Linguistica Upsaliensia 1).

  • Tufiş, D. (2000). BalkaNet—Design and development of a multilingual balkan wordnet. Romanian Journal of Information Science and Technology Special Issue, 7, 107–124

  • Tufiş, D., & Cristea, D. (2002). Methodological issues in building the Romanian Wordnet and consistency checks in Balkanet. In Proceedings of LREC 2002 workshop on wordnet structures and standardisation, Las Palmas, Spain, pp. 35–41.

  • Tufiş, D., Koeva, S., Erjavec, T., Gavrilidou, M., & Krstev, C. (2009). Building language resources and translation models for machine translation focused on south Slavic and Balkan languages. In Machačová, J., & Rohsmann, K. (Eds.), Scientific results of the SEE-ERA.NET pilot joint call, pp. 37–48.

  • Vossen, P. (Ed.). (1999). EuroWordNet : A multilingual database with lexical semantic networks for European languages. Dordrecht: Kluwer.

    Google Scholar 

  • Weisscher, A. (2013). GWA base concepts. http://globalwordnet.org/gwa-base-concepts/

  • Widdows, D., & Ferraro, K. (2008). Semantic vectors: A scalable open source package and online technology management application. In Proceedings of the international conference on language resources and evaluation (LREC 2008), Marrakech, Morocco.

  • Wong, S. H. S. (2004). Fighting arbitrariness in wordnet-like lexical databases—A natural language motivated remedy. In Proceedings of the 2nd international conference of the Global WordNet Association (GWC-2004). Brno, Czech Republic, pp. 234–241.

  • Yokoi, T. (1995). The EDR electronic dictionary. Communications of the ACM, 38(11), 42–44.

    Article  Google Scholar 

Download references

Acknowledgments

The work described in this paper was funded in part by the French–Slovene PHC PROTEUS project 22718UC “Building Slovene–French linguistic resources: parallel corpus and wordnet” (2010–2011), by the French national grant ANR-09-CORD-008 “EDyLex” (2010–2013) and by the Slovene national postdoctoral grant Z6-3668.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Benoît Sagot.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Fišer, D., Sagot, B. Constructing a poor man’s wordnet in a resource-rich world. Lang Resources & Evaluation 49, 601–635 (2015). https://doi.org/10.1007/s10579-015-9295-6

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-015-9295-6

Keywords

Navigation