Is it possible to create a very large wordnet in 100 days? An evaluation
- 535 Downloads
Wordnets are large-scale lexical databases of related words and concepts, useful for language-aware software applications. They have recently been built for many languages by using various approaches. The Finnish wordnet, FinnWordNet (FiWN), was created by translating the more than 200,000 word senses in the English Princeton WordNet (PWN) 3.0 in 100 days. To ensure quality, they were translated by professional translators. The direct translation approach was based on the assumption that most synsets in PWN represent language-independent real-world concepts. Thus also the semantic relations between synsets were assumed mostly language-independent, so the structure of PWN could be reused as well. This approach allowed the creation of an extensive Finnish wordnet directly aligned with PWN and also provided us with a translation relation and thus a bilingual wordnet usable as a dictionary. In this paper, we address several concerns raised with regard to our approach, many of them for the first time. We evaluate the craftsmanship of the translators by checking the spelling and translation quality, the viability of the approach by assessing the synonym quality both on the lexeme and concept level, as well as the usefulness of the resulting lexical resource both for humans and in a language-technological task. We discovered no new problems compared with those already known in PWN. As a whole, the paper contributes to the scientific discourse on what it takes to create a very large wordnet. As a side-effect of the evaluation, we extended FiWN to contain 208,645 word senses in 120,449 synsets, effectively making version 2.0 of FiWN currently the largest wordnet in the world by these statistics.
KeywordsWordnet Bilingual lexicon Quality assessment Knowledge representation Word-sense disambiguation
We are grateful to Mirka Hyvärinen, Kristiina Muhonen and Paula Pääkkä for checking the long lists of words with potential spelling errors and part-of-speech mismatches. We also thank Mirka Hyvärinen and Pinja Pennala for their valuable contribution to the creation of the word-sense disambiguated test corpus and for the many hours spent on evaluating sets of words extracted from Wikipedia and Wiktionary. Mirka Hyärinen also conducted the crowdsourcing experiment. This work was funded by the FIN-CLARIN and META-NORD projects. The META-NORD project has received funding from the European Union’s ICT Policy Support Programme as part of the Competitiveness and Innovation Framework Programme under grant agreement no. 270899.
- Agirre, E., & Soroa, A. (2009). Personalizing PageRank for word sense disambiguation. In Proceedings of the 12th conference of the European chapter of the association for computational linguistics (EACL-2009) (pp. 33–41). Athens: ACL.Google Scholar
- Atserias, J., Climent, S., Farreres, X., Rigau, G., & Rodríguez, H. (2000). Combining multiple methods for the automatic construction of multilingual WordNets. In N. Nicolov & R. Mitkov (Eds.), Recent advances in natural language processing. Volume II: Selected papers from RANLP’97, number 189 in current issues in linguistic theory (pp. 327–338). Amsterdam: John Benjamins.Google Scholar
- Bond, F., Isahara, H., Kanzaki, K., & Uchimoto, K. (2008). Boot-strapping a WordNet using multiple existing WordNets. In N. Calzolari, K. Choukri, B. Maegaard, J. Mariani, J. Odijk, S. Piperidis, & D. Tapias (Eds.), Proceedings of the international conference on language resources and evaluation, LREC 2008 (pp. 1619–1624). Marrakech: ELRA.Google Scholar
- Bond, F., & Paik, K. (2012). A survey of wordnets and their licenses. In Global WordNet Association (2012) (pp. 64–71). http://globalwordnet.org/gwa/proceedings/gwc2012.pdf.
- Fellbaum, C. (Ed.). (1998). WordNet: An electronic lexical database. Cambridge, MA: MIT Press.Google Scholar
- Fišer, D., & Sagot, B. (2008). Combining multiple resources to build reliable wordnets. In P. Sojka, A. Horák, I. Kopeček, & K. Pala (Eds.), Text, speech and dialogue, Lecture Notes in Computer Science (Vol. 5246, pp. 61–68). Berlin: Springer. doi: 10.1007/978-3-540-87391-4_10.
- Global WordNet Association. (2012). Proceedings of the 6th international global wordnet conference (GWC 2012). Matsue: Global WordNet Association. http://globalwordnet.org/gwa/proceedings/gwc2012.pdf.
- Lindén, K., & Carlson, L. (2010). FinnWordNet-WordNet p å finska viaöversättning. LexicoNordica: Nordic Journal of Lexicography, 17, 119–140. English translation “FinnWordNet—Finnish WordNet by translation” at http://www.ling.helsinki.fi/~klinden/pubs/FinnWordnetInLexicoNordica-en.pdf.
- Lindén, K., Niemi, J., & Hyvärinen, M. (2012). Extending and updating the finnish Wordnet. In D. Santos, K. Lindén, & W. Ng’ang’a (Eds.), Shall we play the festschrift game? Essays on the Occasion of Lauri Carlson’s 60th Birthday (pp. 67–98). Berlin: Springer. doi: 10.1007/978-3-642-30773-7_7.
- Martola, N. (2011). FinnWordNet och kulturbundna ord. LexicoNordica: Nordic Journal of Lexicography, 18:111–133.Google Scholar
- Muhonen, K., & Lindén, K. (2011). Do wordnets also improve human performance on NLP tasks? In B. S. Pedersen, G. Nešpore, & I. Skadiņa (Eds.), Proceedings of the 18th Nordic conference of computational linguistics NODALIDA 2011, NEALT proceedings series, (Vol. 11, pp. 146–152). Northern European Association for Language Technology (NEALT). URL http://hdl.handle.net/10062/16955.
- Niemi, J., & Lindén, K. (2012). Representing the translation relation in a bilingual wordnet. In Proceedings of the eight international conference on language resources and evaluation (LREC’12) (pp. 2439–2446). Istambul, Turkey. http://www.lrec-conf.org/proceedings/lrec2012/summaries/194.html.
- Niemi, J., Lindén, K., & Hyvärinen, M. (2012). Using a bilingual resource to add synonyms to a wordnet: FinnWordNet and Wikipedia as an example. In Global WordNet association (2012) (pp. 227–231). http://globalwordnet.org/gwa/proceedings/gwc2012.pdf.
- Pääkkö, P., & Lindén, K. (2012). Finding a location for a new word in WordNet. In Global WordNet Association (2012) (pp. 286–293). http://globalwordnet.org/gwa/proceedings/gwc2012.pdf.
- Pedersen, B. S., Borin, L., Forsberg, M., Kahusk, N., Lindén, K., Niemi, J., Nisbeth, N., Nygaard, L., Orav, H., Rögnvaldsson, E., Seaton, M., Vider, K., & Voionmaa, K. (2013). Nordic and Baltic wordnets aligned and compared through WordTies. In S. Oepen, K. Hagen, & J. B. Johannessen (Eds.), Proceedings of the 19th nordic conference of computational linguistics (NODALIDA 2013), number 16 in NEALT Proceedings series (pp. 147–162). Oslo University, Norway. http://www.ep.liu.se/ecp_article/index.en.aspx?issue=085;article=016.
- Pedersen, B. S., Borin, L., Forsberg, M., Lindén, K., Orav, H., & Rögnvaldsson, E. (2012). Linking and validating Nordic and Baltic wordnets: A multilingual action in META-NORD. In Global WordNet Association (2012) (pp. 254–260). http://globalwordnet.org/gwa/proceedings/gwc2012.pdf.
- Pianta, E., Bentivogli, L., & Girardi, C. (2002). MultiWordNet: developing an aligned multilingual database. In Proceedings of the first international conference on global WordNet (pp. 293–302). Mysore, India. http://multiwordnet.fbk.eu/paper/MWN-India-published.pdf.
- Sagot, B., & Fišer, D. (2008). Building a free French wordnet from multilingual resources. In Proceedings of OntoLex 2008 (pp. 14–19). Marrakech, Morocco. http://hal.inria.fr/inria-00614708.
- Saveski, M., & Trajkovski, I. (2010). Automatic construction of wordnets by using machine translation and language modeling. In T. Erjavec, & J. Žganec Gros (Eds.), Proceedings of seventh language technologies conference, 13th international multiconference information society. Ljubljana, Slovenia.Google Scholar
- Thoongsup, S., Robkop, K., Mokarat, C., Sinthurahat, T., Charoenporn, T., Sornlertlamvanich, V., & Isahara, H. (2009). Thai WordNet construction. In Proceedings of the 7th workshop on Asian language resources, in conjunction with ACL-IJCNLP 2009 (pp. 139–144). Singapore: ACL.Google Scholar
- Tufiş, D., Cristea, D., & Stamou, S. (2004) BalkaNet: Aims, methods, results and perspectives. A general overview. Romanian Journal of Information Science and Technology, 7 (1–2), 9–43.Google Scholar
- Vossen, P. (Ed.) (1998). EuroWordNet: A multilingual database with lexical semantic networks. Dordrecht: Kluwer Academic Publishers.Google Scholar