One Ontology to Bind Them All: The META-SHARE OWL Ontology for the Interoperability of Linguistic Datasets on the Web
- 1 Citations
- 3 Mentions
- 1k Downloads
Abstract
META-SHARE is an infrastructure for sharing Language Resources (LRs) where significant effort has been made into providing carefully curated metadata about LRs. However, in the face of the flood of data that is used in computational linguistics, a manual approach cannot suffice. We present the development of the META-SHARE ontology, which transforms the metadata schema used by META-SHARE into ontology in the Web Ontology Language (OWL) that can better handle the diversity of metadata found in legacy and crowd-sourced resources. We show how this model can interface with other more general purpose vocabularies for online datasets and licensing, and apply this model to the CLARIN VLO, a large source of legacy metadata about LRs. Furthermore, we demonstrate the usefulness of this approach in two public metadata portals for information about language resources.
Keywords
Language resources and evaluation Metadata Ontologies Harmonization1 Introduction
The study of language and the development of natural language processing (NLP) applications requires access to language resources (LRs). Recently, several digital repositories that index metadata for LRs have emerged, supporting the discovery and reuse of LRs. One of the most notable of such initiatives is META-SHARE1 [18], an open, integrated, secure and interoperable exchange infrastructure where LRs are documented, uploaded, stored, catalogued, announced, downloaded, exchanged and discussed, aiming to support reuse of LRs. Towards this end, META-SHARE has developed a rich metadata schema that allows aspects of LRs accounting for their whole lifecycle from their production to their usage to be described. The schema has been implemented as an XML Schema Definition (XSD)2 and descriptions of specific LRs are available as XML documents.
Yet, META-SHARE is not the only source for discovering LRs and their descriptions; other sources include the catalogs of agencies dedicated to the promotion and distribution of LRs, such as ELRA3 and LDC4, other infrastructures such as the CLARIN Virtual Language Observatory (VLO)5 [2], the Language Grid6 and Alveo7, the Open Language Archives Community (OLAC)8, catalogs with crowd-sourced metadata, such as the LREMap9 [5], and, more recently, repositories coming from various communities (e.g. OpenAire10, EUDAT11 etc.). The metadata schemes of all these sources vary with respect to their coverage and the set of specific metadata captured. Currently, it is not possible to query all these sources in an integrated and uniform fashion. The Web of Data is a natural scenario for exposing LRs metadata in order to allow their automated discovery, share and reuse by humans or software agents. The benefits of this model including interoperability, federation, expressivity and dynamicity were laid out by Chiarcos et al. [7].
In this paper we contribute to the interoperability of these repositories by developing an ontology in the Web Ontology Language (OWL) [17] that allows us to represent the metadata schemes of these repositories under an extensible, open-world model.12 The proposed ontology is based on the ontology developed by Villegas et al. [22] for the University Pompeu Fabra’s (UPF) META-SHARE node (covering part of the original schema), which is extended to the complete schema (in order to cover all relevant LRs) and incorporates the consensus reached in the context of the W3C Linked Data for Language Technologies (LD4LT) Community Group13. We show how this model interacts with the DCAT [16] vocabulary as well as the most prominent models in the CLARIN VLO data. Further, we describe the application of the model in two portals, firstly the IULA LOD catalogue and secondly Linghub14.
The remainder of this paper is structured as follows: in Sect. 2 we will describe the related work in the fields of LR metadata harmonization. The development of the META-SHARE ontology is described in Sect. 3 and its application in Sect. 4. Finally, in Sect. 5 we consider the broader impact of this ontology as a tool for computational linguists and as a method to realize an architecture of (linked) data-aware services.
2 Related Work
The task of finding common vocabularies for linguistics is of wide interest and several general ontologies for linguistics have been proposed. The General Ontology for Linguistic Description [9, GOLD] was proposed as a common model for linguistic data, but its relatively limited scope and low coherence has not led to wide-spread adoption. An alternative approach that has been proposed is to use ontologies to create coherence among the resources, in particular by using ontologies to align different linguistic schemas [6].
This lack of consensus resides also in the description of LRs, even for non-linguistic concepts. In fact, there are as many metadata schemas for their descriptions as catalogs and repositories for their presentation (e.g. those used by ELRA and the LDC) and communities describing them (e.g. TEI [14] or CES [13]). The most widely used schema for the exchange of LRs is the one suggested by the Open Language Archives Community [1, OLAC], which builds on the Dublin Core metadata and has been criticized as being too reductionistic. Differences between the schemas lie in the range of features used and their labels and datatypes.
An important effort to harmonize metadata has been the ISO Data Category Registry (ISOcat DCR) [15], intended as a registry where metadata providers can register their concepts (Data Categories) and link them to those of other providers. A subset thereof was selected by metadata experts as the core elements for the description of LRs (“Athens Core”). The Component Metadata Infrastructure [4] proposed by CLARIN extends this principle of a common registry to include “components” and “profiles”: “components” consist of semantically close elements to be shared among different communities when producing “profiles” for specific LR types. However, as we observe in Sect. 3.5, this has in practice merely resulted in each contributing institute using its own scheme, with very little commonality between different institutes. To improve this situation it was recently proposed that the conversion of these CMDI schemas to RDF would enable better interoperability [21].
The core of the META-SHARE model
3 The META-SHARE OWL Ontology
3.1 Original MS XSD Schema
-
administrative components relevant to all LRs, e.g. identificationInfo (name, description and identifiers), distributionInfo (licensing and intellectual property rights information), usageInfo (information about the intended and actual use of the LR).
-
components specific to the resource type (corpus, lexical/conceptual resource, language model, tool/service) and media type (text, audio, video, image), which support the encoding of information relevant to resource/media combinations, e.g. text or audio parts of corpora, lexical/conceptual resources etc., such as language, formats, classification.
The META-SHARE schema recognises obligatory elements (minimal version) and recommended and optional elements (maximal version). An integrated environment supports the description of LRs, either from scratch or through uploading of XML files adhering to the META-SHARE metadata schema, as well as browsing, searching and viewing of the LRs.
3.2 Formal Modelling and Mapping Issues
-
Removal of the Info suffix from the names of wrapping elements of components.
-
Improvement of names that created confusion, as already noted by the META-SHARE group and/or the LD4LT group; thus, resourceInfo was renamed LanguageResource, restrictionsOfUse became conditionsOfUse.
-
Generalization of concepts, e.g. notAvailableThroughMetashare with availableThroughOtherDistributor;
-
Development of novel classes based on existing values, for example: \(\mathtt {Corpus} \equiv \exists \mathtt {resourceType}.\mathtt {corpus}\)
-
Grouping similar elements under novel superclasses, e.g. annotationType and genre values are structured in classes and subclasses better reflecting the relation between them. Indicatively, the superclass SemanticAnnotation can be used to bring together semantic annotation types, such as semantic roles, named entities, polarity, and semantic relations.
-
Extension of existing classes with new values and new properties (e.g. licenseCategory for licences).
An abridged example of a metadata entry represented with common metadata properties from DCAT and Dublin Core and novel properties from the META-SHARE ontology.
3.3 Interface with DCAT and Other Vocabularies
The META-SHARE model can be considered broadly similar to DCAT in that there are classes that are nearly an exact match to the ones in DCAT for three out of four classes. DCAT’s dataset corresponds nearly exactly to the resourceInfo tag and, similarly, distributions are similar to distributionInfo classes and catalogRecord is similar to metadataInfo. Thus, we introduced equivalent class relations between these elements. The fourth main class, catalog covers a level not modelled by META-SHARE. DCAT uses Dublin Core properties for many parts of the metadata, and often these properties are found deeply nested in the META-SHARE description. For example, language is found in several places deeply nested under six tags15. In META-SHARE this allows different media types in the resource to have different languages, e.g., the dialogues and the scripts of a video may be in English, whereas the subtitles can be in French and German. We still include this fine-grained metadata but also add the property at the resource level to indicate if any part of the resource is in the stated language. Similarly, it is also the case that some Dublin Core properties are not directly specified in the META-SHARE model, but can be inferred from related properties, e.g., Dublin Core’s ‘contributor’ follows (by means of a property chain) from people indicated as ‘annotators’, ‘evaluators’, ‘recorders’ or ‘validators’. Furthermore, several DCAT specific-properties, such as ‘download URL’, are nearly exactly equivalent to those in META-SHARE but occur in places that do not fit the domain and range of the properties. In this particular case, it was a simple fix to move the property to the enclosing DistributionInfo class. Inevitably, several properties from DCAT did not have equivalences in META-SHARE, notably ‘keyword’. Figure 2 shows a simplified example of a META-SHARE metadata entry.
In addition to DCAT, we used also other vocabularies to establish equivalences to parts of the model. In particular, we mapped to the Friend of a Friend (FOAF) ontology to describe people and organizations and the Semantic Web for Research Communities (SWRC) ontology to describe scientific publications.
3.4 Licensing Module
An example of the modelling of licenses in a record
3.5 Harmonizing Other Resources with META-SHARE
The top 10 most frequent component types in CLARIN and the institutes that use them. Abbreviations: MI \(=\) Meertens Institute (KNAW), MPI \(=\) Max Planck Insitute (Nijmegen), BeG \(=\) Netherlands Institute for Sound and Vision, HI \(=\) Huygens Institute (KNAW), BBAW \(=\) Berlin-Brandenburg Academy of Sciences
Component root tag | Institutes | Frequency |
---|---|---|
Song | 1 (MI) | 155,403 |
Session | 1 (MPI) | 128,673 |
OLAC-DcmiTerms | 39 | 95,370 |
MODS | 1 (Utrecht) | 64,632 |
DcmiTerms | 2 (BeG,HI) | 46,160 |
SongScan | 1 (MI) | 28,448 |
media-session-profile | 1 (Munich) | 22,405 |
SourceScan | 1 (MI) | 21,256 |
Source | 1 (MI) | 16,519 |
teiHeader | 2 (BBAW, Copenhagen) | 15,998 |
4 Applications
4.1 IULA LOD Catalogue
The IULA-UPF CLARIN Competence Centre20 aims to promote and support the use of technology and text analysis tools in the Humanities and Social Sciences research. The centre includes a Catalogue21 with information on language resources and technology. The Catalogue is based on the initial linked open data (LOD) version of the META-SHARE model as described in [22] and the original data generated from the UPF META-SHARE node22. The source XML records were converted into RDF and augmented with service descriptions (not included in the UPF META-SHARE node) and relevant documentation (appropriate articles, documentation, sample data and results, illustrative experiments, examples from outstanding projects, illustrative use cases, etc.) to encourage potential users to embrace digital tools. Finally, the data was enriched with internal and external links. The resulting linked data maximised the information contained in the original repository and enabled data mashup techniques that get relevant data from the DBpedia and the DBLP23. The catalogue demonstrates the benefits of the LOD framework and how this can be easily used as the basis for a web browser application that maximizes information and helps users to navigate throughout the datasets in a comprehensive way.
4.2 Linghub
META-SHARE data as displayed within the Linghub interface
The Linghub portal is thus a proof-of-concept for the level of harmonization that the use of a common ontology provides, as metadata originating from different repositories can be uniformly queried in Linghub in an integrated fashion. We adhere to an open architecture in which not only Linghub but other discovery services that aggregate and index data could potentially be developed.
5 Conclusion
This work represents only a first starting point for the harmonization of language resources by providing a standard ontology that can be used in the description of metadata of linguistic resources and there are still a number of challenges ahead of us to be addressed. Firstly, the next step would be to make sure that not only metadata, but the actual data is available on the Web in open web standards such as RDF so that data can be automatically crawled and analyzed. Secondly, it should be required that linguistic data published on the Web should ideally follow the same format (e.g. RDF) so that it can be easily integrated and data can be queried across datasets. This presupposes the agreement on best practices for data publication and formats, and the Natural Language Processing Interchange Format (NIF) [12] is an obvious candidate for that. Thirdly, harmonization should be extended to the description of NLP services so that NLP services can be distributed across providers and repositories. The mechanisms for description of the functionality of NLP services should be extremely light-weight. Finally, input and output formats for services should be standardized and homogenized so that services can be easily composed to realize more complex workflows, without relying on too much parametrization. Workflows of services should be easily executable ‘on the cloud’. In order to scale, services should support parallelization, streaming and non-centralized processing. We believe that the development of common vocabularies such as the one presented in this paper should enable the emergence of a new paradigm supporting the discovery and exploitation of linguistic data and services across repositories.
Footnotes
Notes
Acknowledgments
We are very grateful to the members of the W3C Linked Data for Language Technologies (LD4LT) for all the useful feedback received and for allowing this initiative to be developed as an activity of the group. This work is supported by the FP7 European project LIDER (610782), by the Spanish Ministry of Economy and Competitiveness (project TIN2013-46238-C4-2-R and a Juan de la Cierva grant), the Greek CLARIN Attiki project (MIS 441451) and the H2020 project CRACKER (645357).
References
- 1.Bird, S., Simons, G.: The OLAC metadata set and controlled vocabularies. In: Proceedings of the ACL 2001 Workshop on Sharing Tools and Resources, vol. 15, pp. 7–18. Association for Computational Linguistics (2001)Google Scholar
- 2.Broeder, D., Kemps-Snijders, M., Van Uytvanck, D., Windhouwer, M., Withers, P., Wittenburg, P., Zinn, C.: A data category registry-and component-based metadata framework. In: Proceedings of the Seventh Conference on International Language Resources and Evaluation, pp. 43–47 (2010)Google Scholar
- 3.Broeder, D., Offenga, F., Willems, D., Wittenburg, P.: The IMDI metadata set, its tools and accessible linguistic databases. In: Proceedings of the IRCS Workshop on Linguistic Databases, pp. 11–13 (2001)Google Scholar
- 4.Broeder, D., Windhouwer, M., Van Uytvanck, D., Goosen, T., Trippel, T.: CMDI: a component metadata infrastructure. In: Describing LRs with Metadata: Towards Flexibility and Interoperability in the Documentation of LR, pp. 1–4 (2012)Google Scholar
- 5.Calzolari, N., Del Gratta, R., Francopoulo, G., Mariani, J., Rubino, F., Russo, I., Soria, C.: The LRE map. Harmonising community descriptions of resources. In: Proceedings of the Eighth Conference on International Language Resources and Evaluation, pp. 1084–1089 (2012)Google Scholar
- 6.Chiarcos, C.: Ontologies of linguistic annotation: Survey and perspectives. In: Proceedings of the Eighth International Conference on Language Resources and Evaluation, pp. 303–310 (2012)Google Scholar
- 7.Chiarcos, C., McCrae, J., Cimiano, P., Fellbaum, C.: Towards open data for linguistics: linguistic linked data. In: Oltramari, A., Vossen, P., Qin, L., Hovy, E. (eds.) New Trends of Research in Ontologies and Lexical Resources: Ideas, Projects, Systems, pp. 7–25. Springer, Heidelberg (2013)CrossRefGoogle Scholar
- 8.Cieri, C., Choukri, K., Calzolari, N., Langendoen, D.T., Leveling, J., Palmer, M., Ide, N., Pustejovsky, J.: A road map for interoperable language resource metadata. In: Chair, N.C.C., Choukri, K., Maegaard, B., Mariani, J., Odijk, J., Piperidis, S., Rosner, M., Tapias, D. (eds.) Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC’10). European Language Resources Association (ELRA), Valletta, Malta, May 2010Google Scholar
- 9.Farrar, S., Lewis, W., Langendoen, T.: A common ontology for linguistic concepts. In: Proceedings of the Knowledge Technologies Conference, pp. 10–13 (2002)Google Scholar
- 10.Gartner, R.: MODS: Metadata object description schema. JISC Techwatch report TSW, pp. 3–6 (2003)Google Scholar
- 11.Gavrilidou, M., Labropoulou, P., Desipri, E., Piperidis, S., Papageorgiou, H., Monachini, M., Frontini, F., Declerck, T., Francopoulo, G., Arranz, V., Mapelli, V.: The META-SHARE metadata schema for the description of language resources. In: Proceedings of the Eighth International Conference on Language Resources and Evaluation, pp. 1090–1097 (2012)Google Scholar
- 12.Hellmann, S., Lehmann, J., Auer, S., Brümmer, M.: Integrating NLP using linked data. In: Alani, H., et al. (eds.) The Semantic Web – ISWC 2013. LNCS, vol. 8219, pp. 98–113. Springer, Heidelberg (2013)CrossRefGoogle Scholar
- 13.Ide, N.: Corpus encoding standard: SGML guidelines for encoding linguistic corpora. In: Proceedings of the First International Language Resources and Evaluation Conference, pp. 463–470 (1998)Google Scholar
- 14.Ide, N., Véronis, J. (eds.): Text Encoding Initiative: Background and Contexts. Springer, Heidelberg (1995)Google Scholar
- 15.Kemps-Snijders, M., Windhouwer, M., Wittenburg, P., Wright, S.E.: ISOcat: corralling data categories in the wild. In: Proceedings of the Seventh Conference on International Language Resources and Evaluation (2008)Google Scholar
- 16.Maali, F., Erickson, J., Archer, P.: Data catalog vocabulary (DCAT). W3C recommendation, The World Wide Web Consortium (2014)Google Scholar
- 17.Motik, B., Patel-Schneider, P.F., Parsia, B., Bock, C., Fokoue, A., Haase, P., Hoekstra, R., Horrocks, I., Ruttenberg, A., Sattler, U., Smith, M.: OWL 2 web ontology language structural specification and functional-style syntax. W3C recommendation, The World Wide Web Consortium (2012)Google Scholar
- 18.Piperidis, S.: The META-SHARE language resources sharing infrastructure: principles, challenges, solutions. In: Proceedings of the Eighth Conference on International Language Resources and Evaluation, pp. 36–42 (2012)Google Scholar
- 19.Rodriguez-Doncel, V., Villata, S., Gomez-Perez, A.: A dataset of RDF licenses. In: Proceedings of the 27th International Conference on Legal Knowledge and Information System (JURIX), pp. 187–189 (2014)Google Scholar
- 20.Soria, C., Calzolari, N., Monachini, M., Quochi, V., Bel, N., Choukri, K., Mariani, J., Odijk, J., Piperidis, S.: The language resource strategic agenda: the flarenet synthesis of community recommendations. Lang. Resour. Eval. 48(4), 753–775 (2014). http://dx.doi.org/10.1007/s10579-014-9279-yCrossRefGoogle Scholar
- 21.\(\check{\text{ D }}\)určo, M., Windhouwer, M.: From CLARIN component metadata to linked open data. In: Proceedings of the 3rd Workshop on Linked Data in Linguistics, pp. 13–17 (2014)Google Scholar
- 22.Villegas, M., Melero, M., Bel, N.: Metadata as linked open data: mapping disparate XML metadata registries into one RDF/OWL registry. In: Proceedings of the Ninth International Conference on Language Resources and Evaluation, pp. 393–400 (2014)Google Scholar
Copyright information
Open Access This chapter is licensed under the terms of the Creative Commons Attribution-NonCommercial 2.5 International License (http://creativecommons.org/licenses/by-nc/2.5/), which permits any noncommercial use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.