Advertisement

Language Resources and Evaluation

, Volume 48, Issue 2, pp 191–201 | Cite as

Is it possible to create a very large wordnet in 100 days? An evaluation

  • Krister LindénEmail author
  • Jyrki Niemi
Project Note

Abstract

Wordnets are large-scale lexical databases of related words and concepts, useful for language-aware software applications. They have recently been built for many languages by using various approaches. The Finnish wordnet, FinnWordNet (FiWN), was created by translating the more than 200,000 word senses in the English Princeton WordNet (PWN) 3.0 in 100 days. To ensure quality, they were translated by professional translators. The direct translation approach was based on the assumption that most synsets in PWN represent language-independent real-world concepts. Thus also the semantic relations between synsets were assumed mostly language-independent, so the structure of PWN could be reused as well. This approach allowed the creation of an extensive Finnish wordnet directly aligned with PWN and also provided us with a translation relation and thus a bilingual wordnet usable as a dictionary. In this paper, we address several concerns raised with regard to our approach, many of them for the first time. We evaluate the craftsmanship of the translators by checking the spelling and translation quality, the viability of the approach by assessing the synonym quality both on the lexeme and concept level, as well as the usefulness of the resulting lexical resource both for humans and in a language-technological task. We discovered no new problems compared with those already known in PWN. As a whole, the paper contributes to the scientific discourse on what it takes to create a very large wordnet. As a side-effect of the evaluation, we extended FiWN to contain 208,645 word senses in 120,449 synsets, effectively making version 2.0 of FiWN currently the largest wordnet in the world by these statistics.

Keywords

Wordnet Bilingual lexicon Quality assessment Knowledge representation Word-sense disambiguation 

Notes

Acknowledgments

We are grateful to Mirka Hyvärinen, Kristiina Muhonen and Paula Pääkkä for checking the long lists of words with potential spelling errors and part-of-speech mismatches. We also thank Mirka Hyvärinen and Pinja Pennala for their valuable contribution to the creation of the word-sense disambiguated test corpus and for the many hours spent on evaluating sets of words extracted from Wikipedia and Wiktionary. Mirka Hyärinen also conducted the crowdsourcing experiment. This work was funded by the FIN-CLARIN and META-NORD projects. The META-NORD project has received funding from the European Union’s ICT Policy Support Programme as part of the Competitiveness and Innovation Framework Programme under grant agreement no. 270899.

References

  1. Agirre, E., & Soroa, A. (2009). Personalizing PageRank for word sense disambiguation. In Proceedings of the 12th conference of the European chapter of the association for computational linguistics (EACL-2009) (pp. 33–41). Athens: ACL.Google Scholar
  2. Atserias, J., Climent, S., Farreres, X., Rigau, G., & Rodríguez, H. (2000). Combining multiple methods for the automatic construction of multilingual WordNets. In N. Nicolov & R. Mitkov (Eds.), Recent advances in natural language processing. Volume II: Selected papers from RANLP’97, number 189 in current issues in linguistic theory (pp. 327–338). Amsterdam: John Benjamins.Google Scholar
  3. Bond, F., Isahara, H., Kanzaki, K., & Uchimoto, K. (2008). Boot-strapping a WordNet using multiple existing WordNets. In N. Calzolari, K. Choukri, B. Maegaard, J. Mariani, J. Odijk, S. Piperidis, & D. Tapias (Eds.), Proceedings of the international conference on language resources and evaluation, LREC 2008 (pp. 1619–1624). Marrakech: ELRA.Google Scholar
  4. Bond, F., & Paik, K. (2012). A survey of wordnets and their licenses. In Global WordNet Association (2012) (pp. 64–71). http://globalwordnet.org/gwa/proceedings/gwc2012.pdf.
  5. Cilibrasi R. L., & Vitányi P. M. B. (2007) The Google similarity distance. IEEE Transactions on Knowledge and Data Engineering, 19(3):370–383. doi: 10.1109/TKDE.2007.48.Google Scholar
  6. Fellbaum, C. (Ed.). (1998). WordNet: An electronic lexical database. Cambridge, MA: MIT Press.Google Scholar
  7. Fišer, D., & Sagot, B. (2008). Combining multiple resources to build reliable wordnets. In P. Sojka, A. Horák, I. Kopeček, & K. Pala (Eds.), Text, speech and dialogue, Lecture Notes in Computer Science (Vol. 5246, pp. 61–68). Berlin: Springer. doi: 10.1007/978-3-540-87391-4_10.
  8. Global WordNet Association. (2012). Proceedings of the 6th international global wordnet conference (GWC 2012). Matsue: Global WordNet Association. http://globalwordnet.org/gwa/proceedings/gwc2012.pdf.
  9. Lee C., Lee G. G., & Seo J. (2004). Multiple heuristics and their combination for automatic WordNet mapping. Computers and the Humanities, 38(4):437–455. doi: 10.1007/s10579-004-1367-y.Google Scholar
  10. Lindén, K., & Carlson, L. (2010). FinnWordNet-WordNet p å finska viaöversättning. LexicoNordica: Nordic Journal of Lexicography, 17, 119–140. English translation “FinnWordNet—Finnish WordNet by translation” at http://www.ling.helsinki.fi/~klinden/pubs/FinnWordnetInLexicoNordica-en.pdf.
  11. Lindén, K., Niemi, J., & Hyvärinen, M. (2012). Extending and updating the finnish Wordnet. In D. Santos, K. Lindén, & W. Ng’ang’a (Eds.), Shall we play the festschrift game? Essays on the Occasion of Lauri Carlson’s 60th Birthday (pp. 67–98). Berlin: Springer. doi: 10.1007/978-3-642-30773-7_7.
  12. Martola, N. (2011). FinnWordNet och kulturbundna ord. LexicoNordica: Nordic Journal of Lexicography, 18:111–133.Google Scholar
  13. Muhonen, K., & Lindén, K. (2011). Do wordnets also improve human performance on NLP tasks? In B. S. Pedersen, G. Nešpore, & I. Skadiņa (Eds.), Proceedings of the 18th Nordic conference of computational linguistics NODALIDA 2011, NEALT proceedings series, (Vol. 11, pp. 146–152). Northern European Association for Language Technology (NEALT). URL http://hdl.handle.net/10062/16955.
  14. Niemi, J., & Lindén, K. (2012). Representing the translation relation in a bilingual wordnet. In Proceedings of the eight international conference on language resources and evaluation (LREC’12) (pp. 2439–2446). Istambul, Turkey. http://www.lrec-conf.org/proceedings/lrec2012/summaries/194.html.
  15. Niemi, J., Lindén, K., & Hyvärinen, M. (2012). Using a bilingual resource to add synonyms to a wordnet: FinnWordNet and Wikipedia as an example. In Global WordNet association (2012) (pp. 227–231). http://globalwordnet.org/gwa/proceedings/gwc2012.pdf.
  16. Pääkkö, P., & Lindén, K. (2012). Finding a location for a new word in WordNet. In Global WordNet Association (2012) (pp. 286–293). http://globalwordnet.org/gwa/proceedings/gwc2012.pdf.
  17. Pedersen, B. S., Borin, L., Forsberg, M., Kahusk, N., Lindén, K., Niemi, J., Nisbeth, N., Nygaard, L., Orav, H., Rögnvaldsson, E., Seaton, M., Vider, K., & Voionmaa, K. (2013). Nordic and Baltic wordnets aligned and compared through WordTies. In S. Oepen, K. Hagen, & J. B. Johannessen (Eds.), Proceedings of the 19th nordic conference of computational linguistics (NODALIDA 2013), number 16 in NEALT Proceedings series (pp. 147–162). Oslo University, Norway. http://www.ep.liu.se/ecp_article/index.en.aspx?issue=085;article=016.
  18. Pedersen, B. S., Borin, L., Forsberg, M., Lindén, K., Orav, H., & Rögnvaldsson, E. (2012). Linking and validating Nordic and Baltic wordnets: A multilingual action in META-NORD. In Global WordNet Association (2012) (pp. 254–260). http://globalwordnet.org/gwa/proceedings/gwc2012.pdf.
  19. Pianta, E., Bentivogli, L., & Girardi, C. (2002). MultiWordNet: developing an aligned multilingual database. In Proceedings of the first international conference on global WordNet (pp. 293–302). Mysore, India. http://multiwordnet.fbk.eu/paper/MWN-India-published.pdf.
  20. Sagot, B., & Fišer, D. (2008). Building a free French wordnet from multilingual resources. In Proceedings of OntoLex 2008 (pp. 14–19). Marrakech, Morocco. http://hal.inria.fr/inria-00614708.
  21. Saveski, M., & Trajkovski, I. (2010). Automatic construction of wordnets by using machine translation and language modeling. In T. Erjavec, & J. Žganec Gros (Eds.), Proceedings of seventh language technologies conference, 13th international multiconference information society. Ljubljana, Slovenia.Google Scholar
  22. Thoongsup, S., Robkop, K., Mokarat, C., Sinthurahat, T., Charoenporn, T., Sornlertlamvanich, V., & Isahara, H. (2009). Thai WordNet construction. In Proceedings of the 7th workshop on Asian language resources, in conjunction with ACL-IJCNLP 2009 (pp. 139–144). Singapore: ACL.Google Scholar
  23. Tufiş, D., Cristea, D., & Stamou, S. (2004) BalkaNet: Aims, methods, results and perspectives. A general overview. Romanian Journal of Information Science and Technology, 7 (1–2), 9–43.Google Scholar
  24. Vossen, P. (Ed.) (1998). EuroWordNet: A multilingual database with lexical semantic networks. Dordrecht: Kluwer Academic Publishers.Google Scholar

Copyright information

© Springer Science+Business Media Dordrecht 2013

Authors and Affiliations

  1. 1.Department of Modern LanguagesUniversity of HelsinkiHelsinkiFinland

Personalised recommendations