Constructing a poor man’s wordnet in a resource-rich world

In this paper we present a language-independent, fully modular and automatic approach to bootstrap a wordnet for a new language by recycling different types of already existing language resources, such as machine-readable dictionaries, parallel corpora, and Wikipedia. The approach, which we apply here to Slovene, takes into account monosemous and polysemous words, general and specialised vocabulary as well as simple and multi-word lexemes. The extracted words are then assigned one or several synset ids, based on a classifier that relies on several features including distributional similarity. Finally, we identify and remove highly dubious (literal, synset) pairs, based on simple distributional information extracted from a large corpus in an unsupervised way. Automatic, manual and task-based evaluations show that the resulting resource, the latest version of the Slovene wordnet, is already a valuable source of lexico-semantic information.

  7. In this paper we use the term monosemous for such literals that only appear in one synset in the Princeton WordNet. While this is unproblematic in most cases, there is a possibility that some words only appear to be monosemous according to the lexical resource which is missing some senses because the resource is incomplete.

  13. See however (Erjavec and Fišer 2006) for preliminary experiments on building a Slovene wordnet from the Serbian wordnet (Krstev et al. 2004).

  14. The conversion from one synset inventory to another was achieved based on an automatic PWN 2.0 to 3.0 mapping (Erjavec, p.c.).

  15. This threshold of 2 was empirically found to be the best balance between the number of related words (a threshold of 1 or 0 would have provided us too few, a threshold of 3 or more too many) and the relevance of the related words (a threshold of 3 or more gathers many literals which are not relevant as descriptors of the input synset).

  18. The Slovene lemmatisation was performed using the ToTaLe system (Erjavec et al. 2005).

  19. In experiments conducted for applying this extension technique to the French wordnet WOLF, the same 0.1 threshold leads to retaining a higher proportion of candidates, namely 55,159 out of 177,980, which have a much higher precision (83 %). This is related to the archaic words present in the Slovene-English dictionaries we use for extending sloWNet and suggests that this dictionary is not the best resource for wordnet construction but was nevertheless used since it is the only extensive bilingual dictionary available, which is not uncommon in realistic research scenarios.

  21. Note that SWN does not contain any adverbial synsets and only a few adjectival synsets.

  22. Note that the first versions of BabelNet did not contain any Slovene literals. Only the recently published BabalNet 2.0 does.

  23. 115 Slovene UWN (literal, synset) pairs have a literal that contains at least one comma, which seems to be more a separator between possible literals than part of unique literals. Moreover, some literals include a stress marker (mentioned above and removed from sloWNet since version 2.0). Before evaluating sloWNet 3.0 against the UWN, we “improved” the UWN by correcting these issues. Therefore, our evaluation is in a way biased in favour of UWN.

The work described in this paper was funded in part by the French–Slovene PHC PROTEUS project 22718UC “Building Slovene–French linguistic resources: parallel corpus and wordnet” (2010–2011), by the French national grant ANR-09-CORD-008 “EDyLex” (2010–2013) and by the Slovene national postdoctoral grant Z6-3668.

Fišer, D., Sagot, B. Constructing a poor man's wordnet in a resource-rich world. Lang Resources & Evaluation 49, 601–635 (2015).

