Skip to main content
Log in

SALDO: a touch of yin to WordNet’s yang

  • Original Paper
  • Published:
Language Resources and Evaluation Aims and scope Submit manuscript

Abstract

The English-language Princeton WordNet (PWN) and some wordnets for other languages have been extensively used as lexical–semantic knowledge sources in language technology applications, due to their free availability and their size. The ubiquitousness of PWN-type wordnets tends to overshadow the fact that they represent one out of many possible choices for structuring a lexical–semantic resource, and it could be enlightening to look at a differently structured resource both from the point of view of theoretical–methodological considerations and from the point of view of practical text processing requirements. The resource described here—SALDO—is such a lexical–semantic resource, intended primarily for use in language technology applications, and offering an alternative organization to PWN-style wordnets. We present our work on SALDO, compare it with PWN, and discuss some implications of the differences. We also describe an integrated infrastructure for computational lexical resources where SALDO forms the central component.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Notes

  1. Indeed, according to the Global Wordnet Association http://www.globalwordnet.org/gwa/wordnet_table.htm, in order for a lexical resource to count as a wordnet, it must provide minimally the two lexical-semantic relations synonymy (in the form of synsets) and hyponymy.

  2. This is often called “hypernym(y)” in the literature on wordnets. Here, we have opted to use the etymologically more orthodox alternative hyperonym(y) (see Murphy 2003, 252, note 1).

  3. Gunilla Fredriksson worked together with Lennart Lönngren on the sense definitions and Ágnes Kilár did most of the programming in the original project.

  4. See http://spraakbanken.gu.se/eng/resource/saldo and http://spraakbanken.gu.se/eng/research/saldo/statistics.

  5. http://http://spraakbanken.gu.se/korp/ (Borin et al. 2012)

  6. Over the years, we have also used other terms for referring to the basic lexical-semantic relations in SALDO. At one time, a family metaphor was used, where the two descriptors were called mother (= primary) and father (= secondary). The main advantage of this was that it offered a natural family or kinship terminology for talking about various derived relations, such as “children”, “siblings”, “husband”, “grandmother”, etc. In some contexts, we have also used the terms main (= primary) and determinative (= secondary) descriptor.

  7. Many, but not all construals of synonymy make it to be non-directional; see Murphy (2003), chapter 4 for a discussion. Note that the pairs sannolikhet ‘probability’—sannolik ‘probable’ and liv ‘life’—leva ‘live (v)’ would not be considered synonyms on some definitions of this term, because of the different parts of speech (see Sect. 1.2).

  8. Thus, the average number of children of a non-leaf node in SALDO is between two and three.

  9. The NSM semantic primes have undergone many revisions through the years. For a current version, see the Proposed semantic primes (2007) from the NSM homepage: http://www.une.edu.au/bcss/linguistics/nsm/.

  10. It is an interesting question—raised by an anonymous reviewer—whether the semantic relations defining SALDO could be extracted automatically from text, but strictly speaking this question is out of scope for this paper, where our focus is a comparison of two manually constructed lexical-semantic resources.

  11. For SALDO, this component is available both as a processing module written in Functional Morphology and as a full-form lexicon generated from SALDO (see Sect. 5) For PWN, the morphology comes in the form of a morphological processing tool called Morphy; see http://wordnet.princeton.edu/wordnet/man/morphy.7WN.html.

  12. Here, we use “word sense” in its usual meaning of a (Saussurean) linguistic sign, i.e., a particular kind of form–content pairing. On this view, the synset is obviously a different kind of unit. In the PWN, word senses and synsets are defined partly differently, but they obviously imply each other, due to the—somewhat problematic (Murphy, 2003, 159f)—definition of synonymy adopted by the compilers of PWN.

  13. http://spraakbanken.gu.se/eng/resource/saldoe.

  14. There are currently 37 different parts of speech (POS) used in SALDO, of 4 kinds: (1) ‘ordinary’ single-word POS, e.g. nn ‘noun’ (13 POS labels in total); (2) multi-word POS, nnm ‘multi-word noun’ (11 POS labels); (3) abbreviations, nna ‘noun abbreviation’ (7 POS labels); (4) compound parts and clause idioms (6 POS labels). See the SALDO statistics page at http://spraakbanken.gu.se/eng/research/saldo/statistics.

  15. Since SALDO is a fully connected graph, the relations—possibly in combination with some principles for assigning weights to them—can also be used to define a distance measure among word senses. This aspect will not be discussed here, however.

  16. For a project aiming at adding similar “evocation” links among the synsets of PWN, see Boyd-Graber et al. (2006).

  17. It seems to us that—at least in the computational linguistics literature—“multi-word expression” is a pre-theoretical, essentially negative characterization, which to boot is dependent on the vagaries of individual orthographies. Discussions of MWEs in computational linguistics rarely refer to the vast linguistic literature on the problems connected with defining the entity word in a cross-linguistically reasonable way (see, e.g., Anderson 1985; Aikhenvald 2007), and we are not aware of any typological studies to establish which kinds of MWEs there are cross-linguistically or how frequent they are across the world’s languages.

  18. http://http://spraakbanken.gu.se/saldo/ws.

  19. A number of formal checks are made each time a new update is checked into the resource repository, including checks that all descriptors are in the lexicon and that no circular structures have been introduced.

  20. http://spraakbanken.gu.se/karp.

  21. Lemgram is the term used in SALDO about the combination of a base form and and inflectional pattern.

  22. This seems to be another difference between SALDO and PWN: In SALDO, identifiers are held invariant through different versions of the resource, while it seems that in PWN this has not been a guiding principle. Hence, the linking even between monosemous lemmas in different PWN versions is not a trivial task (Daudé et al. 2000).

  23. Degrees of synonymy are a feature of the Synlex resource, a crowdsourced Swedish synomym lexicon (Kann and Rosell 2006); see Sect. 8.

  24. http://spraakbanken.gu.se/korp.

  25. http://spraakbanken.gu.se/eng/resource/swesaurus.

  26. Simple Knowledge Organization System, a W3C initiative; see http://http://www.w3.org/2004/02/skos/.

  27. See http://spraakbanken.gu.se/eng/resource/saldo.

References

  • Aikhenvald, A. Y. (2007). Typological distinctions in word-formation. In T. Shopen (Ed.), Language typology and syntactic description (2nd ed.). Volume III: Grammatical categories and the lexicon. Cambridge: Cambridge University Press.

    Google Scholar 

  • Anderson, S. R. (1985). Inflectional morphology, 1st edn. In T. Shopen (Ed.), Language typology and syntactic description (1st ed.). Volume III: Grammatical categories and the lexicon. Cambridge: Cambridge University Press.

    Google Scholar 

  • Apresjan, Y. D. (2002). Principles of systematic lexicography. In M.-H. Corréard (Ed.), Lexicography and natural language processing. A festschrift in honour of B. T. S. Atkins. Euralex, pp. 91– 104.

  • Baranov, A. N., & Dobrovol’skij, D. O. (2008). Aspekty teorii frazeologii. Moscow: Znak.

  • Boas, H. C. (Ed.). (2010). Contrastive studies in construction grammar. Amsterdam: John Benjamins.

    Google Scholar 

  • Borin, L. (2005). Mannen är faderns mormor: Svenskt associationslexikon reinkarnerat. LexicoNordica 12, 39–54.

    Google Scholar 

  • Borin, L. (2010). Med Zipf mot framtiden—en integrerad lexikonresurs för svensk språkteknologi. LexicoNordica 17, 35–54.

    Google Scholar 

  • Borin, L. (2012). Core vocabulary: A useful but mystical concept in some kinds of linguistics. In D. Santos, K. Lindén, & W. Ng’ang’a (Eds.), Shall we play the Festschrift game? Essays on the occasion of Lauri Carlson’s 60th birthday (pp. 53–65). Berlin: Springer.

  • Borin, L., & Forsberg, M. (2011). Swesaurus—ett svenskt ordnät med fria tyglar. LexicoNordica 18, 17–39.

    Google Scholar 

  • Borin, L., Forsberg, M., & Ahlberger, C. (2011). Semantic search in literature as an e-Humanities research tool: CONPLISIT—consumption patterns and life-style in 19th century Swedish literature. In NODALIDA 2011 conference proceedings (pp. 58–65). Riga: NEALT.

  • Borin, L., Forsberg, M., & Lönngren, L. (2008). The hunting of the BLARK—SALDO, a freely available lexical database for Swedish language technology. In J. Nivre, M. Dahllöf, & B. Megyesi (Eds.), Resourceful language technology. Festschrift in honor of Anna Sågvall Hein. Acta Universitatis Upsaliensis: Studia Linguistica Upsaliensia (pp. 21–32). Uppsala: Uppsala University, Department of Linguistics and Philology.

  • Borin, L., Forsberg, M., & Roxendal, J. (2012). Korp—the corpus infrastructure of Språkbanken. In Proceedings of LREC 2012 (pp. 474–478). Istanbul: ELRA.

  • Borin, L., Danélls, D., Forsberg, M., Kokkinakis, D., & Toporowska Gronostaj, M. (2010). The past meets the present in Swedish FrameNet++. In 14th EURALEX international congress (pp. 269–281). Leeuwarden: EURALEX.

  • Borin, L., Forsberg, M., Olsson, L.-J., & Uppström, J. (2012). The open lexical infrastructure of Språkbanken. In Proceedings of LREC 2012 (pp. 3598–3602). Istanbul: ELRA.

  • Boyd-Graber, J., Fellbaum, C., Osherson, D., & Schapire, R. (2006). Adding dense, weighted connections to WordNet. In GWC 2006 proceedings (pp. 29–35). Brno: Masaryk University.

  • Carstairs-McCarthy, A. (1999). The origins of complex language. Oxford: Oxford University Press.

    Google Scholar 

  • Cruse, D. A. (2000). Aspects of the micro-structure of word meanings. In: Y. Ravin, & C. Leacock (Eds.), Polysemy: Theoretical and computational approaches (pp. 30–51). Oxford: Oxford University Press.

    Google Scholar 

  • Daudé, J., Padró, L., & Rigau, G. (2000). Mapping wordnets using structural information. In Proceedings of ACL 2000. Hong Kong: ACL.

  • Erk, K. (2010). What is word meaning, really? (And how can distributional models help us describe it?). In Proceedings of the 2010 workshop on geometrical models of natural language semantics (pp. 17–26). Uppsala: ACL.

  • Fellbaum, C. (1998a). Introduction. In C. Fellbaum (Ed.), WordNet: An electronic lexical database (pp. 1–19). Cambridge, Mass: MIT Press.

  • Fellbaum, C. (Ed.). (1998b). WordNet: An electronic lexical database. Cambridge, Mass: MIT Press.

    Google Scholar 

  • Fellbaum, C. (2005). Co-occurrence and antonymy. International Journal of Lexicography, 8(4), 281–303.

    Article  Google Scholar 

  • Forsberg, M. (2007). Three tools for language processing: BNF converter, Functional Morphology, and Extract. PhD diss, Göteborg University and Chalmers University of Technology.

  • Francopoulo, G. (Ed.). (2013). LMF: Lexical markup framework. London/Hoboken, NJ: ISTE/Wiley.

  • Goddard, C. (Ed.). (2008). Cross-linguistic semantics. Amsterdam: John Benjamins.

    Google Scholar 

  • Goddard, C., & Karlsson, S. (2008). Re-thinking think in contrastive perspective: Swedish vs. English. In: C. Goddard (Ed.), Cross-linguistic semantics (pp. 225–240). Amsterdam: John Benjamins.

  • Goldberg, A. E. (1995). Constructions: A construction grammar approach to argument structure. Chicago: University of Chicago Press.

    Google Scholar 

  • Hanks, P. (2000). Do word meanings exist? Computers and the Humanities, 34(1–2), 205–215.

    Article  Google Scholar 

  • ISO. (2008). Language resource management—Lexical markup framework (LMF). International Standard ISO 24613.

  • Kann, V., & Rosell, M. (2006). Free construction of a free Swedish dictionary of synonyms. In Proceedings of the 15th NODALIDA conference, Joensuu 2005 (pp. 105–110). Department of Linguistics, University of Joensuu.

  • Kilgarriff, A. (1997). I don’t believe in word senses. Computers and the Humanities 31(2), 91–113.

    Article  Google Scholar 

  • Lönngren, L. (1988). Lexika, baserade på semantiska relationer. In: Nordiske Datalingvistikdage og Symposium for datamatstøttet leksikografi og terminologi 1987 (pp. 229–236). Copenhagen: Handelshøjskolen i København, Institut for Datalingvistik.

  • Lönngren, L. (1989). Svenskt associationslexikon: Rapport från ett projekt inom datorstödd lexikografi. Centrum för datorlingvistik. Uppsala universitet. Rapport UCDL-R-89-1.

  • Lönngren, L. (1992). Svenskt associationslexikon. Del I-IV. Institutionen för lingvistik. Uppsala universitet.

  • Lönngren, L. (1998). A Swedish associative thesaurus. In Euralex ’98 proceedings, 2, 467–474.

  • Mel’čuk, I. A. (1974). Opyt teorii lingvističeskih modelej «\( Smysl \leftrightarrow Tekst\)». Moscow: Nauka.

  • Miller, G. A. (1998). Nouns in WordNet. In: C. Fellbaum (Ed.), WordNet: An electronic lexical database (pp. 23–46). Cambridge, Mass: MIT Press.

  • Morris, J., & Hirst, G. (2004). Non-classical lexical semantic relations. In HLT-NAACL 2004: Workshop on computational lexical semantics (pp. 46–51). Boston: ACL.

  • Murphy, M. L. (2003). Semantic relations and the lexicon. Cambridge: Cambridge University Press.

    Book  Google Scholar 

  • Sag, I. A., Baldwin, T., Bond, F., Copestake, A., & Flickinger, D. (2001). Multiword expressions: A pain in the neck for NLP. In Proc. of the 3rd international conference on intelligent text processing and computational linguistics (CICLing-2002 (pp. 1–15). Berlin: Springer.

  • Vossen, P. (Ed.). (1998). EuroWordNet: A multilingual database with lexical semantic networks for European languages. Dordrecht: Kluwer.

    Google Scholar 

  • Wierzbicka, A. (1996). Semantics: Primes and universals. USA: Oxford University Press.

    Google Scholar 

  • Wurzel, W. U. (1989). Inflectional morphology and naturalness. Dordrecht: Kluwer.

    Google Scholar 

Download references

Acknowledgments

SALDO has been developed with Swedish public funding. After 2003, the University of Gothenburg through Språkbanken has financed the main part of the work on SALDO. During 2006–2008, the development of SALDO was partly supported by the Swedish Research Council project Library-Based Grammar Engineering (2005-4211; PI Aarne Ranta, Chalmers University of Technology). After 2008, part of the funding has come from the Swedish Research Council in the projects Safeguarding the future of Språkbanken (2007-7430; PI Lars Borin, Språkbanken, University of Gothenburg) and Swedish FrameNet++ (2010-6013; PI Lars Borin). We thank the three anonymous reviewers for their detailed comments and suggestions.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lars Borin.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Borin, L., Forsberg, M. & Lönngren, L. SALDO: a touch of yin to WordNet’s yang. Lang Resources & Evaluation 47, 1191–1211 (2013). https://doi.org/10.1007/s10579-013-9233-4

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-013-9233-4

Keywords

Navigation