Both DWN and RBN are semantic lexical resources. RBN uses a traditional structure of form-meaning pairs, so-called Lexical Units. Lexical Units (LUs) are word senses in the lexical semantic tradition. They contain the linguistic knowledge that is needed to properly use the word in a specific meaning in a language. Since RBN follows a word-to-meaning view, the semantic and combinatoric information for each meaning typically clarify the differences across the meanings. RBN likewise focusses on the polysemy of words and typically follows an approach to represent condensed and generalised meanings from which more specific ones can be derived.
On the other hand, DWN is organised around the notion of synsets. Synsets are sets of synonyms that represent a single concept as defined by [14], e.g. box and luidspreker in Dutch are synonyms for loud speaker. Synsets are conceptual units based the lexicalisations in a language. Footnote 4In Wordnet, concepts are defined in a graph by lexical semantic relations, such as hypernyms (broader term), hyponyms (narrower term), role relations. Typically in Wordnet, information is provided for the synset as a whole and not for the individual synonyms, thus presenting a meaning-to-word view on a lexical database and focussing on the similarities of word meanings. For example, word meanings that are synonyms have a single gloss or definition in Wordnet but have separate definitions in RBN as different lexical units. From a Wordnet point of view, the definitions of LUs from the same synset should be semantically equivalent and the LUs of a single word should belong to different synsets. From a RBN point of view, the LUs of a single word typically differ in terms of connotation, pragmatics, syntax and semantics but synonymous words of the same synset can be differentiated along connotation, pragmatics and syntax but not semantics.
Outside the lexicon, an ontology provides a third layer of meaning. In Cornetto, SUMO [24] has been used as the ontological framework. SUMO provides good coverage, is publicly available, and all synsets in PWN are mapped to it. Through the equivalence relations from DWN to PWN, mappings to SUMO can be imported automatically. Footnote 5The concepts in an ontology are referred to as Terms. Terms represent types that can be combined in a knowledge representation language to form axioms. In principle, Terms are defined independently of language but according to principles of logic. In Cornetto, the ontology represents an independent anchoring of the pure relational meaning in Wordnet. The ontology is a formal framework that can be used to constrain and validate the implicit semantic statements of the lexical semantic structures, both for LUs and synsets. Further, the semantic anchoring to the ontology contributes to the development of semantic web applications for which language-specific lexicalisations of ontological types are useful.
A fourth layer is represented by Wordnet Domains [22]. Domains represent clusters of concepts that are related by a shared area of interest, such as sport, education or politics. Whereas different instruments can be subclasses of the same ontological Term (e.g. tank and ambulance are both of the type Vehicle ), they may belong to different Domains (e.g. military and medical ).
The Cornetto database (CDB) thus consists of 4 layers of information represented in two collections:
-
1.
Collection of Lexical Units (LU), mainly derived from the RBN
-
2.
Collection of Synsets, derived from DWN with mappings to PWN
-
3.
Mappings to Terms and axioms in SUMO
-
4.
Mappings to Domains in Wordnet Domains
Figure 10.1shows an overview of the different data structures and their relations. There may be LUs that do not occur in synsets but there are no synonyms in synsets that are not LUs. The synsets are organised by means of internal relations such as hypernyms, while the LUs provide rich information on morphology, syntax and pragmatics. The synsets also point to external sources: the Princeton Wordnet (PWN), Wordnet domains (DM) and the SUMO ontology. The Cornetto database is implemented in the Dictionary Editor and Browser (DEB II) platform [18], while the raw XML files are distributed by the TST centrale. The XML Schema file for the data can be downloaded from the Cornetto website.
Figure 10.2provides a simplified overview of the interplay between the different data structures. Here, four meanings of band are defined according to their semantic relations in DWN, RBN, SUMO and Wordnet Domains. Black arrows represent hypernym relations while the dashed arrows represent other semantic relations such as a Mero-Member between ‘music group’ and ‘musician’. Note that the hypernym of each synset for band is similar to SUMO terms, e.g. middel (device) and Device. However, the SUMO terms are fully axiomatised externally, while the implications of the hypernym relation remain implicit.
In the next sections, we describe the data collections for the synsets, the lexical units and the mappings to SUMO terms in more detail.
10.3.1 Lexical Units
The data structure for the LUs is implemented as a list; every LU element has an unique identifier or c_lu_id. The database for LUs contains structures to represent the form, syntactic and morphological information, semantics, pragmatics, and usage examples. An example of the XML structure for the first sense of the noun band (tire) is shown in Fig. 10.3. The xml of this LU contains basic morpho-syntactic information (lines 3–8), some semantics (lines 11–15) and additional examples on the combinatorial behaviour of the word such as the lexical collocation de band oppompen (to inflate a tire) at line 41, and an idiomatic usage: uit de band springen (excessive behavior) at line 20.
For nouns, the morpho-syntactic information is relatively simple. Figure 10.4shows the rich information provided for verbs, illustrated by the LU oppompen (to inflate). The syntax field (lines 12–16) specifies the transitivity, valency and complementation of this verb. The semantics field provides information about the caseframe (lines 20–28); oppompen is an action verb with a selection restriction on the agent (animate agent) and no further restrictions on the theme. Finally, both a canonical (line 37) and a textual example (line 38) are given with typical fillers for the theme of this verb: ‘tube’, ‘tire’ and ‘ball’. For a further description of the structure and contents, we refer to the Cornetto deliverable [11].
10.3.2 Synsets
Synsets are identified by an unique identifier or c_synset_id, which is used to reference synsets. An additional attribute, d_synset_id, links synsets to their source concepts in DWN in order to make the lookup for the alignment process more efficient. Each synset contains one or more synonyms; each of these synonym entries consists of a pointer to a LU (c_lu_id).
Figure 10.5illustrates the structure in more detail for the synset band. It has luchtband (tire filled with air) as a synonym (lines 2–5). Further, the example shows that band has several semantic relations to other concepts such as a hypernym relation to ring (line 20) and to various instruments that apply to tires, such as bandenlichter (tire lever) at line 10, and bandrem (tire brake) at line 15. Footnote 6It also shows an EQ_SYNONYM relation to the English synset for tire at line 27, a relation to the domain transport at line 34 and a subclass relation ( + ) to the SUMO class Artifact at line 38.
10.3.3 SUMO Ontology Mappings
The SUMO ontology mappings provide the conceptual anchoring of the synsets and the lexical units. The mappings to Terms in SUMO have been imported from the equivalence relations of the synsets to PrincetonWordNet (PWN). Four basic relations are used in Princeton Wordnet and Cornetto:
- = :
-
The synset is equivalent to the SUMO concept
- + :
-
The synset is subsumed by the SUMO concept
- @:
-
The synset is an instance of the SUMO concept
- [:
-
The SUMO concept is subsumed by the synset
The mappings from PWN to SUMO consist of two placeholders: one for the four relations ( = , + , @, [) and one for the SUMO term. In Cornetto, we extended this representation with a third placeholder to define more complex mappings from synsets to the SUMO ontology. For this, the above relations have been extended with all relations defined in SUMO (version April 2006). The relation name and two arguments represent a so-called triple. Footnote 7The arguments of the triples follow the syntax of the relation names in SUMO: the first slot is reserved for the relation, the second slot for a variable and the third slot contains either a SUMO term or an additional variable. The variables are expressed as integers, where the integer 0 is reserved to co-index with the referent of the synset that is being defined.
For example, the following expressions are possible in the Cornetto database:
-
1.
Equality cirkel (circle): ( = , 0, Circle)
-
2.
Subsumption band (tire): ( + , 0, Artifact)
-
3.
Related bot (bone) : (part, 0, Skeleton)
-
4.
Axiomatised theewater (tea water): ((instance, 0, Water) (instance, 1, Making) (instance, 2, Tea) (resource, 0, 1) (result, 2,1))
Relations directly imported from Princeton Wordnet will have the structure of 1 and 2. The triples in 3 and 4 are used to specify a complex mapping relation to the SUMO ontology, in case the basic mapping relations are not sufficient. This is especially the case for so-called non-rigid concepts [16], e.g. theewater (water used for making tea) is not a type of water but water used for some purpose. The triples given in 4 likewise indicate that the synset refers to an instance of Water rather than a subclass and that this instance is involved in the process of making Tea as a resource. Footnote 8