1 Introduction

The study of language and the development of natural language processing (NLP) applications requires access to language resources (LRs). Recently, several digital repositories that index metadata for LRs have emerged, supporting the discovery and reuse of LRs. One of the most notable of such initiatives is META-SHAREFootnote 1 [18], an open, integrated, secure and interoperable exchange infrastructure where LRs are documented, uploaded, stored, catalogued, announced, downloaded, exchanged and discussed, aiming to support reuse of LRs. Towards this end, META-SHARE has developed a rich metadata schema that allows aspects of LRs accounting for their whole lifecycle from their production to their usage to be described. The schema has been implemented as an XML Schema Definition (XSD)Footnote 2 and descriptions of specific LRs are available as XML documents.

Yet, META-SHARE is not the only source for discovering LRs and their descriptions; other sources include the catalogs of agencies dedicated to the promotion and distribution of LRs, such as ELRAFootnote 3 and LDCFootnote 4, other infrastructures such as the CLARIN Virtual Language Observatory (VLO)Footnote 5 [2], the Language GridFootnote 6 and AlveoFootnote 7, the Open Language Archives Community (OLAC)Footnote 8, catalogs with crowd-sourced metadata, such as the LREMapFootnote 9 [5], and, more recently, repositories coming from various communities (e.g. OpenAireFootnote 10, EUDATFootnote 11 etc.). The metadata schemes of all these sources vary with respect to their coverage and the set of specific metadata captured. Currently, it is not possible to query all these sources in an integrated and uniform fashion. The Web of Data is a natural scenario for exposing LRs metadata in order to allow their automated discovery, share and reuse by humans or software agents. The benefits of this model including interoperability, federation, expressivity and dynamicity were laid out by Chiarcos et al. [7].

In this paper we contribute to the interoperability of these repositories by developing an ontology in the Web Ontology Language (OWL) [17] that allows us to represent the metadata schemes of these repositories under an extensible, open-world model.Footnote 12 The proposed ontology is based on the ontology developed by Villegas et al. [22] for the University Pompeu Fabra’s (UPF) META-SHARE node (covering part of the original schema), which is extended to the complete schema (in order to cover all relevant LRs) and incorporates the consensus reached in the context of the W3C Linked Data for Language Technologies (LD4LT) Community GroupFootnote 13. We show how this model interacts with the DCAT [16] vocabulary as well as the most prominent models in the CLARIN VLO data. Further, we describe the application of the model in two portals, firstly the IULA LOD catalogue and secondly LinghubFootnote 14.

The remainder of this paper is structured as follows: in Sect. 2 we will describe the related work in the fields of LR metadata harmonization. The development of the META-SHARE ontology is described in Sect. 3 and its application in Sect. 4. Finally, in Sect. 5 we consider the broader impact of this ontology as a tool for computational linguists and as a method to realize an architecture of (linked) data-aware services.

2 Related Work

The task of finding common vocabularies for linguistics is of wide interest and several general ontologies for linguistics have been proposed. The General Ontology for Linguistic Description [9, GOLD] was proposed as a common model for linguistic data, but its relatively limited scope and low coherence has not led to wide-spread adoption. An alternative approach that has been proposed is to use ontologies to create coherence among the resources, in particular by using ontologies to align different linguistic schemas [6].

This lack of consensus resides also in the description of LRs, even for non-linguistic concepts. In fact, there are as many metadata schemas for their descriptions as catalogs and repositories for their presentation (e.g. those used by ELRA and the LDC) and communities describing them (e.g. TEI [14] or CES [13]). The most widely used schema for the exchange of LRs is the one suggested by the Open Language Archives Community [1, OLAC], which builds on the Dublin Core metadata and has been criticized as being too reductionistic. Differences between the schemas lie in the range of features used and their labels and datatypes.

An important effort to harmonize metadata has been the ISO Data Category Registry (ISOcat DCR) [15], intended as a registry where metadata providers can register their concepts (Data Categories) and link them to those of other providers. A subset thereof was selected by metadata experts as the core elements for the description of LRs (“Athens Core”). The Component Metadata Infrastructure [4] proposed by CLARIN extends this principle of a common registry to include “components” and “profiles”: “components” consist of semantically close elements to be shared among different communities when producing “profiles” for specific LR types. However, as we observe in Sect. 3.5, this has in practice merely resulted in each contributing institute using its own scheme, with very little commonality between different institutes. To improve this situation it was recently proposed that the conversion of these CMDI schemas to RDF would enable better interoperability [21].

A different approach was taken for the design of the META-SHARE schema  [11], which was based on a comparative study of the most widespread metadata schemas and catalog descriptions, analysis of user needs and discussions with metadata providers and experts in order to arrive at a common schema, taking into account previous initiatives and recommendations (cf.  [8, 20]).

Fig. 1.
figure 1

The core of the META-SHARE model

3 The META-SHARE OWL Ontology

3.1 Original MS XSD Schema

The META-SHARE schema [11] has been designed not only as an aid for LR search and retrieval, but also as a means to foster their production, use and re-use by bringing together knowledge about LRs and related objects and processes, thus encoding information about the whole lifecycle of the LR from production to usage. The central entity of the META-SHARE schema is the LR per se, which encompasses both data sets (e.g., textual, audio and multimodal/multimedia corpora, lexical data, ontologies, terminologies, computational grammars, language models) and technologies (e.g., tools, services) used for their processing. In addition to the central entity, other entities are also documented in the schema; these are reference documents related to the LR (papers, reports, manuals etc.), persons/organizations involved in its creation and use (creators, distributors etc.), related projects and activities (funding projects, activities of usage etc.), accompanying licenses, etc., all described with metadata taken as far as possible from relevant schemas and guidelines (e.g. BibTex for bibliographical references). The META-SHARE schema proposes a set of elements to encode specific descriptive features of each of these entities and relations holding between them, taking as a starting point the LR. Following the CMDI approach, these elements are grouped together into “components”. The core of the schema is the resourceInfo component (Fig. 1), which subsumes:

  • administrative components relevant to all LRs, e.g. identificationInfo (name, description and identifiers), distributionInfo (licensing and intellectual property rights information), usageInfo (information about the intended and actual use of the LR).

  • components specific to the resource type (corpus, lexical/conceptual resource, language model, tool/service) and media type (text, audio, video, image), which support the encoding of information relevant to resource/media combinations, e.g. text or audio parts of corpora, lexical/conceptual resources etc., such as language, formats, classification.

The META-SHARE schema recognises obligatory elements (minimal version) and recommended and optional elements (maximal version). An integrated environment supports the description of LRs, either from scratch or through uploading of XML files adhering to the META-SHARE metadata schema, as well as browsing, searching and viewing of the LRs.

3.2 Formal Modelling and Mapping Issues

In the META-SHARE XSD schema, elements are formalized as simple elements whereas components are formalized as complex-type elements. When mapping the XSD schema to RDF, elements can be naturally understood as properties (e.g. name, gender, etc.). Components (i.e. complex-type elements), however, deserve a careful analysis. General mapping rules from XSD to RDF establish that a local element with complex type translates into an object property and a class. We observed that the straightforward application of such a principle may derive unnecessarily verbose graphs. Thus, following Villegas et al. [22], we identified potentially removable nodes before undertaking the actual RDFication process. Embedded complex elements with cardinality of exactly one are identified as potentially removable, provided they contain neither text nor attributes. This allows for a simplification of the model, for example in the chain resourceInfo \(\circ \) identificationInfo \(\circ \) resourceName, the identificationInfo property is not needed. Interestingly enough, the removal of the superfluous wrapping elements has also led to a change of philosophy in the schema and a need for restructuring in order to ensure that properties are attached to the most appropriate node, as exemplified and discussed in Sect. 3.4. Beyond this, we made extensions to our mapping strategy in order to improve the ontology, such as the following:

  • Removal of the Info suffix from the names of wrapping elements of components.

  • Improvement of names that created confusion, as already noted by the META-SHARE group and/or the LD4LT group; thus, resourceInfo was renamed LanguageResource, restrictionsOfUse became conditionsOfUse.

  • Generalization of concepts, e.g. notAvailableThroughMetashare with availableThroughOtherDistributor;

  • Development of novel classes based on existing values, for example: \(\mathtt {Corpus} \equiv \exists \mathtt {resourceType}.\mathtt {corpus}\)

  • Grouping similar elements under novel superclasses, e.g. annotationType and genre values are structured in classes and subclasses better reflecting the relation between them. Indicatively, the superclass SemanticAnnotation can be used to bring together semantic annotation types, such as semantic roles, named entities, polarity, and semantic relations.

  • Extension of existing classes with new values and new properties (e.g. licenseCategory for licences).

The actual mapping was achieved by means of a custom domain-specific language called LIXR [?].

Fig. 2.
figure 2

An abridged example of a metadata entry represented with common metadata properties from DCAT and Dublin Core and novel properties from the META-SHARE ontology.

3.3 Interface with DCAT and Other Vocabularies

The META-SHARE model can be considered broadly similar to DCAT in that there are classes that are nearly an exact match to the ones in DCAT for three out of four classes. DCAT’s dataset corresponds nearly exactly to the resourceInfo tag and, similarly, distributions are similar to distributionInfo classes and catalogRecord is similar to metadataInfo. Thus, we introduced equivalent class relations between these elements. The fourth main class, catalog covers a level not modelled by META-SHARE. DCAT uses Dublin Core properties for many parts of the metadata, and often these properties are found deeply nested in the META-SHARE description. For example, language is found in several places deeply nested under six tagsFootnote 15. In META-SHARE this allows different media types in the resource to have different languages, e.g., the dialogues and the scripts of a video may be in English, whereas the subtitles can be in French and German. We still include this fine-grained metadata but also add the property at the resource level to indicate if any part of the resource is in the stated language. Similarly, it is also the case that some Dublin Core properties are not directly specified in the META-SHARE model, but can be inferred from related properties, e.g., Dublin Core’s ‘contributor’ follows (by means of a property chain) from people indicated as ‘annotators’, ‘evaluators’, ‘recorders’ or ‘validators’. Furthermore, several DCAT specific-properties, such as ‘download URL’, are nearly exactly equivalent to those in META-SHARE but occur in places that do not fit the domain and range of the properties. In this particular case, it was a simple fix to move the property to the enclosing DistributionInfo class. Inevitably, several properties from DCAT did not have equivalences in META-SHARE, notably ‘keyword’. Figure 2 shows a simplified example of a META-SHARE metadata entry.

In addition to DCAT, we used also other vocabularies to establish equivalences to parts of the model. In particular, we mapped to the Friend of a Friend (FOAF) ontology to describe people and organizations and the Semantic Web for Research Communities (SWRC) ontology to describe scientific publications.

3.4 Licensing Module

A specific area where we made a significant effort to improve the modelling was in the licensing information in order to allow the formulation of a clear and concise rights information of the LRs. Some languages already exist for this purpose and, among them, ODRL 2.1 (Open Digital Rights Language) was chosen and extended. ODRL is a policy and rights expression language specified by the W3C ODRL Community GroupFootnote 16, which defines a model for representing permissions, prohibitions and duties. The most common licenses (for software, data or general works) have been already expressed in ODRL in the RDF License datasetFootnote 17 [19] and can be pointed to when an LR is licensed with any of these. Extensions to the ODRL vocabulary have been made to represent some of the specificities of the LR domain. The specification also suggested changes, some of them structural, to the previous META-SHARE modelling, and to this extent we combined the existing META-SHARE licensing vocabulary with ODRL. In addition, we extended the model by adding some new properties and individuals based on requirements from the LD4LT community groupFootnote 18. In particular, the generic conditions-of-use values of the META-SHARE schema have been exploited for creating RDF codes for non-standard licenses and are mapped to ODRL actions (e.g. the duty to attribute), and included in an RDF document, as shown in Fig. 3. This module has been published both as independent moduleFootnote 19 and as part of the META-SHARE ontology.

Fig. 3.
figure 3

An example of the modelling of licenses in a record

3.5 Harmonizing Other Resources with META-SHARE

While a basic level of interoperability can be established by using standard vocabularies such as DCAT and Dublin Core, this can only be done by sacrificing completeness and ignoring all metadata particular to language resources. For this reason, we use the META-SHARE model to represent and harmonize the metadata relating specifically to the domain of linguistics and language resources. As a proof-of-concept, we show how the META-SHARE ontology supports the harmonization of data from the CLARIN VLO. The CLARIN repository describes its resources using a small common set of metadata and a larger description defined by the Component Metadata Infrastructure [4, CMDI]. These metadata schemes are extremely diverse as shown in Table 1. We will focus on the top five of these types for which we have created corresponding mappings. Two of these schemes are only Dublin Core properties and so do not have specific language resource metadata. The most frequent tag ‘Song’ is used to describe records of a database consisting of musical recordings. While many of the properties used by this tag (e.g., ‘number of stanzas’) have no correspondence in Dublin Core, they can be described with respect to existing elements of the META-SHARE Ontology. The Session tag is used to provide IMDI metadata [3], and has a very loose correspondence to META-SHARE. For instance, there are no corresponding properties to describe the participants of a media recording. This highlights the advantage of taking an open world, ontological approach as opposed to a fixed schema, in that we can easily introduce new properties while still reusing the META-SHARE properties where they are appropriate. The MODS metadata scheme [10] is in fact a general domain metadata framework. We found that 28 entities from META-SHARE corresponded to elements used in the ‘Sing’ metadata of the Meertens Institute collection, and 37 to the IMDI metadata, although there was only minor overlap with the MODS scheme (in particular 4 entities used to describe language) as this scheme is not specific to language resources.

Table 1. The top 10 most frequent component types in CLARIN and the institutes that use them. Abbreviations: MI \(=\) Meertens Institute (KNAW), MPI \(=\) Max Planck Insitute (Nijmegen), BeG \(=\) Netherlands Institute for Sound and Vision, HI \(=\) Huygens Institute (KNAW), BBAW \(=\) Berlin-Brandenburg Academy of Sciences

4 Applications

4.1 IULA LOD Catalogue

The IULA-UPF CLARIN Competence CentreFootnote 20 aims to promote and support the use of technology and text analysis tools in the Humanities and Social Sciences research. The centre includes a CatalogueFootnote 21 with information on language resources and technology. The Catalogue is based on the initial linked open data (LOD) version of the META-SHARE model as described in [22] and the original data generated from the UPF META-SHARE nodeFootnote 22. The source XML records were converted into RDF and augmented with service descriptions (not included in the UPF META-SHARE node) and relevant documentation (appropriate articles, documentation, sample data and results, illustrative experiments, examples from outstanding projects, illustrative use cases, etc.) to encourage potential users to embrace digital tools. Finally, the data was enriched with internal and external links. The resulting linked data maximised the information contained in the original repository and enabled data mashup techniques that get relevant data from the DBpedia and the DBLPFootnote 23. The catalogue demonstrates the benefits of the LOD framework and how this can be easily used as the basis for a web browser application that maximizes information and helps users to navigate throughout the datasets in a comprehensive way.

4.2 Linghub

Linghub is a portal designed to allow common querying of metadata from multiple highly heterogeneous repositories. Currently, it draws not only from META-SHARE, but also from the LRE-Map [5], the CLARIN VLO [2] and DataHub, and is regularly updated with new/changed information. The repository is based on the DCAT and Dublin Core vocabularies. However, these models do not capture any specific linguistic information. For this reason, the ontology described in this paper has been integrated into the system to allow users to use META-SHARE as the basic vocabulary for querying linguistic information about language resources, and the mappings previously described have already been applied to data from LRE-Map and the CLARIN VLO. Linghub supports browsing and querying by several means, including faceted browsing, full-text search, SPARQL querying and related item search. As such, we believe that the portal, while not a direct collector of metadata, will enable users to find more language resources and do so more easily. See Fig. 4 for a screenshot of the Linghub web interface.

Fig. 4.
figure 4

META-SHARE data as displayed within the Linghub interface

The Linghub portal is thus a proof-of-concept for the level of harmonization that the use of a common ontology provides, as metadata originating from different repositories can be uniformly queried in Linghub in an integrated fashion. We adhere to an open architecture in which not only Linghub but other discovery services that aggregate and index data could potentially be developed.

5 Conclusion

This work represents only a first starting point for the harmonization of language resources by providing a standard ontology that can be used in the description of metadata of linguistic resources and there are still a number of challenges ahead of us to be addressed. Firstly, the next step would be to make sure that not only metadata, but the actual data is available on the Web in open web standards such as RDF so that data can be automatically crawled and analyzed. Secondly, it should be required that linguistic data published on the Web should ideally follow the same format (e.g. RDF) so that it can be easily integrated and data can be queried across datasets. This presupposes the agreement on best practices for data publication and formats, and the Natural Language Processing Interchange Format (NIF) [12] is an obvious candidate for that. Thirdly, harmonization should be extended to the description of NLP services so that NLP services can be distributed across providers and repositories. The mechanisms for description of the functionality of NLP services should be extremely light-weight. Finally, input and output formats for services should be standardized and homogenized so that services can be easily composed to realize more complex workflows, without relying on too much parametrization. Workflows of services should be easily executable ‘on the cloud’. In order to scale, services should support parallelization, streaming and non-centralized processing. We believe that the development of common vocabularies such as the one presented in this paper should enable the emergence of a new paradigm supporting the discovery and exploitation of linguistic data and services across repositories.