The LIME vocabulary (see Fig. 2) we present here, though inspired by the proposal in , is in fact very different because of the need for a better alignment with the overall scope of the working group and for accommodating the flexible publication scenario envisaged by lemon.
Following the conceptual model of the ontology-lexicon interface defined by lemon (see Requirement R1), we distinguish at the metadata level three entities:
the ontology (bearing semantic information),
the lexicon (bearing linguistic information),
the set of lexicalizations (intended as the mere correspondences between logical entities in the ontology and lexical entries in the lexicon).
From the perspective of a metadata vocabulary, LIME focuses on the representation of the relation between these three entities and summaries and descriptive statistics concerning these entities and their relations (see Requirement R5).
The three entities (ontology, lexicon and lexicalization set) are regarded as instances of
. While the lemon model introduces a subclass of
to represent lexica (
), no such subclass exists for lexicalizations. LIME introduces such a subclass,
, to describe the relation between the lexicon and the ontology in question. A
object thus holds all the relevant metadata and descriptive statistics about the lexicalizations that relate ontology elements in the ontology to lexical entries (possibly found in a lexicon).
Moving away from our original assumption that lexicalizations are embedded within an ontology, we allow each entity to be published independently or combined with others into a single resource (see Requirement R3). By allowing this freedom, we support the following scenarios:
a lexicon is published as a stand-alone resource, independently of any specific ontology. We further distinguish the following two cases:
an ontology contains a set of lexicalizations by means of entries in the lexicon (thus ontology + lexicalization as a single data source)
an ontology exists independently of the lexicon, and a third party publishes a lexicalization of the ontology by adopting the above lexicon (thus all the three datasets are separate entities)
a lexicon is created for a specific ontology:
the lexicon and lexicalizations for an existing ontology are published together.
an ontology is published alongside with its lexicon (ontology, lexicon and the set of lexicalizations published together).
Obviously, since ontologies may be lexicalized for more languages, and as a general-purpose lexicon may be reused across different ontologies, multiple combinations of the above cases may happen for any single resource. Finally, linguistic enrichment of ontologies may occur by means of links with lexical concepts, rather than links with specific lexical entries, as suggested by Pazienza and Stellato . The notion of Lexical Linkset accounts for this scenario, by specializing the notion of
to make explicit its linguistic value.
5.1 Describing (Domain) Datasets
From the LIME viewpoint, any RDF dataset may be lexicalized in a natural language or aligned with a set of lexical concepts. The term dataset is meant hereafter to encompass ontologies, SKOS concept schemes and in general any set of RDF triples. In the ontology-lexicon dualism, the dataset corresponds to the ontology, in the sense that it provides formal symbols that need for grounding in a natural language.
At the metadata level, a dataset is then represented as an instance of the class
or a more specific subclass, e.g.
for vocabularies. LIME defines no specific term for the description of the dataset bearing the semantic references for the ontology-lexicon interface. Still, it suggests the use of appropriate metadata terms suggested by the VoID specification (see Requirement R6). For instance, in the following excerpt:
> a voaf:Vocabulary;
dct:title “The Friend of a Friend (FOAF) Vocabulary”@en;
voaf:propertyNumber 62 .
we declare an instance of
describing the FOAF vocabulary. In the example, we show how to provide the name of the vocabulary, its home page (providing a unique key supporting data aggregation), a download file and the count of classes and properties. In the previous example, we followed LOV when reusing the URI of FOAF to provide additional metadata. This approach requires the publication of metadata via a SPARQL endpoint or some other API (Application Programming Interface). Alternatively, one can create a new URI for the metadata instance, so that it can be dereferenced. Meanwhile, the connection to the vocabulary is established via an
axiom, or some other uniquely identifying property.
5.2 Describing Lexica
A lexicon comprises a collection of lexical entries in a given natural language, and is generally independent from the semantic content of ontologies. The class
represents lexica in both the core (data) and metadata levels of the OntoLex specification. This class extends
, such that recommendations from the VoID specification apply.
Perhaps the most important fact about a lexicon is the language it refers to, an explicit marker for applicability of the resource in given scenarios. This information can be represented either as a literal (according to ISO 639 ) through property ontolex:language or as a resource (through the property dct:language), using any of the vocabularies assigning URIs to languages (e.g. http://www.lexvo.org/
http://id.loc.gov/). The following example describes an English lexicon:
ex:myLexicon a ontolex:Lexicon;
void:triples 10000 .
The description above contains terms from VoID (see Requirement R6), e.g. to provide a data dump and a SPARQL endpoint. An agent may choose between the available types of access based on various criteria: (i) the suitability of the local triple store for handling the advertised number of triples, (ii) the necessity of specialized processing not provided by the SPARQL endpoint, (iii) the willingness to avoid stressing the data provider with frequent/complex queries.
To support the actual exploitation of a lexicon, LIME supports metadata about the way a lexicon has been encoded (see Requirement R4). The reason is that lemon does not commit to a specific catalog of linguistic categories (e.g. part-of-speech), whereas it defers to the user the choice of a specific catalog. The adopted catalog may be indicated as a value of the property
. This property is defined as a subproperty of
, to better qualify the specific association between the lexicon and the ontology providing linguistic categories. For instance, we can say that
uses LexInfo2 as repository of linguistic annotations:
ex:myLexicon a ontolex:Lexicon;
An important metric indicating the usefulness of a lexicon is the number of lexical entries it contains (see Requirement R5):
ex:myLexicon lime:lexicalEntries 13 .
5.3 Describing Lexicalization Sets
We use the term lexicalization for the reified relation between a lexical entry and the ontological meaning it denotes. A collection of such lexicalizations is modeled by the class
, which in turn subclasses
. For example, the property foaf:knows can be lexicalized as “X is friend of”, “X knows Y”, “X is acquainted with X” etc., all corresponding to different lexicalizations.
is characterized (as an
) by the natural language it refers, which can be indicated via the properties already used for the same purpose within
. Moreover, a
may play an associative function, as it may relate a dataset with a lexicon providing lexical entries. The properties
point to the dataset and the lexicon, respectively. The presence of explicit links with the dataset and lexicon will allow metadata indexes answering queries that seek, as an example, a lexicalization set in a natural language for a given dataset (see Requirement R3). This is an example of an English lexicalization set for FOAF utilizing an OntoLex lexicon:
ex:LexicalizationSet a lime:LexicalizationSet;
lime:lexiconDataset ex:myLexicon .
The mandatory property
tells which dataset the lexicalization is about. Similarly, the optional property
holds a reference to the lexicon being used. This optionality allows supporting previous lexicalization models (see Requirement R2) that rely on plain literals (e.g. RDFS and SKOS) or introduce reified labels (e.g. SKOS-XL), but in any case have no separate notion of lexicon. It is thus necessary to introduce the mandatory property
, which holds the model used in a specific lexicalization set (see Requirement R4). We may say, for instance, that FOAF has an embedded lexicalization set expressed in RDFS:
> void:subset ex:embedLexSet .
ex:embedLexSet a lime:LexicalizationSet;
Knowing that a dataset is lexicalized in a given natural language does not guarantee that the available linguistic information is useful. In particular, the value of a lexicalization set may be assessed by means of metrics (see Requirement R5). For instance, in the following excerpt:
:myItalianLexicalizationOfFOAF a lime:LexicalizationSet;
the property lime:partition (domain:
lime:LexicalizationSet ⊔ lime:LexicalLinkset
) points to a
, which is the subset of the lexicalization set dealing exclusively with instances of the class referenced by
. The properties
hold, respectively, the number of entities from the reference dataset and the number of lexical entries from the lexicon that participate in at least one lexicalization, while
holds the total number of lexicalizations. Additionally,
gives the average number of lexicalizations per resource, while
indicates the ratio of resources having at least one lexicalization. There is a certain level of redundancy among these properties, so that it is at the discretion of the publisher to choose a number of properties. For instance, if metadata for the lexicalized ontology is not available, then it is mandatory to provide ratios (such in the above example), whereas clients can combine counts (if available from both the lexicalization and the reference datasets) in order to compute them.
5.4 Describing Lexical Concept Sets
is a subclass of
that defines a collection of
. It holds LIME-specific and other dataset-level metadata. Lexical concepts are instances of
(as ontolex:LexicalConcept is a subclass of
). In fact, following the pattern already adopted for the lexicon, we combined the concept scheme with the concept set, by making the latter a subclass of the former. It is possible to summarize the content of a concept set (see Requirement R5), by reporting (via the property
) the total number of lexical concepts in a concept set. Beyond the need for such summarizing information, the rationale for the class
is to support the publication of lexical concepts as a separate dataset (see Requirement R3). This, in turn, allows the independent publication of the linguistic realization of those concepts in different natural languages, e.g. several wordnets sharing the synsets from the English WordNet. However, lemon and LIME are also compatible with the approach to multilingual wordnets, in which each wordnet has its own set of synsets, while an inter-language index establishes a mapping between them. In the following excerpt, we define a
mappings between two
:ItalianWN_EnglishWN_index a void:Linkset;
void:linkPredicate skos:exactMatch .
5.5 Describing Conceptualizations
is a dataset relating a set of lexical concepts to a lexicon, indicated by the properties
, respectively. In the representation of wordnets, it plays a role like that of a
for the ontology lexicalization. A different class has been introduced, since the association between lexical concepts and words is different from the lexicalization of ontology concepts.
In addition to the explicit references to the lexicon and the lexical concept set, a conceptualization holds a number of resuming metadata (see Requirement R5). The properties
hold the number of lexical entries and lexical concepts that have been associated, respectively.
5.6 Describing Lexical Link Sets
An interesting use of wordnets is to enrich an ontology with links to lexical concepts, which may provide a less ambiguous inter-lingua (than natural language, which has inherent lexical ambiguity) for the task of ontology matching.
To represent a collection of these links, we introduced
, which extends
with additional metadata tailored to this specific type of linking. The properties
clearly distinguish between the different roles that the linked datasets play from the perspective of the lemon model, whereas properties from the VoID vocabulary only deal with lower-level features, e.g. to which dataset the subjects of the link belong to. Similarly to the case of
, the property lime:partition references a
dealing with a given resource type. Due to the lack of space, we will not provide specific examples for the relevant metrics. However, they are analogous to the ones already discussed for lexicalization sets, expect for the fact they now refer to links rather than lexicalizations.