Keywords

1 Introduction

Linked Data is a set of best practices for publishing and interlinking structured data on the Web [1]. Linked Data employs Web technologies, such as HTTP, RDF, URIs to create entities from various domains and connect them through typed links, thus building a Web of machine-readable data, rather than human-readable documents. Controlled vocabularies and ontologies are the means of organizations and communities of different disciplines to formalize entities and their relations.

The Semantic Web, called the Web of Data, is a constantly growing dataspace.Footnote 1 Besides the simple collection of data, the Semantic Web approach includes the provision of relationships between the data. “This collection of interrelated datasets on the Web can also be referred to as Linked Data”.Footnote 2 Semantic Web standards, such as RDF [2], OWL [3], and SPARQL [4] have been developed to describe semantic information, including the relationship between data and concepts, on the Web, providing the basis for Linked Data.

Regarding bioeconomy, the main topic of this book, Semantic Web is a useful technology for integrating and publishing heterogeneous data—see also Section 7.6, “Enterprise Linked Data” below. This enables better querying and analyzing processes of bioeconomy.

Linked Data, which started as an initiativeFootnote 3 of Tim Berners-Lee (the inventor of the World Wide Web), has been increasingly becoming one of the most popular methods for publishing data on the Web. There are different reasons for that: on the one hand, it defines simple principles for publishing and interlinking structured data that is accessible by both humans and machines, enabling interoperability and information exchange [5]. For instance, improving the data accessibility lowers the barriers to finding and reusing this data, while providing machine-readable data facilitates the integration of this data into different applications. On the other hand, Linked Data allows to discover more useful data through the connections with other datasets, and to exploit it in a more useful way through inferencing and semantic queries and rules. The term “Semantic Web” refers to W3C’s vision of the Web of linked data. Semantic Web technologies enable people to create data stores on the Web, build vocabularies, and write rules for handling data. Linked data are empowered by technologies such as RDF, SPARQL, OWL, and SKOS. As a result, there is a growing number of datasets becoming available in Linked Data format, as depicted in the Linked Open Data (LOD) cloudFootnote 4 diagram (Fig. 7.1). The widespread use and interest in Linked Data has also resulted in the creation of guidelines and best practices on how to generate and publish it, as discussed later in this chapter.

Fig. 7.1
figure 1

The linked open data cloud diagram

Linked Data can be used and applied to virtually any application domain (as depicted in Fig. 7.1). It consists of both application data as well as data about other data or resources (metadata). In fact, Linked Data incorporates human and machine-readable metadata along with it, making it self-describing [6]. Moreover, RDF, the underlying standard for Linked Data interchange and query, was originally developed in the 1990s with the emphasis on the representation of metadata about Web resources; however later the vision of the Semantic Web was extended to the representation of semantic information in general, beyond simple RDF descriptions and Web documents as primary subjects of such descriptions [5], which provided the ground for the creation of the Linked Data initiative later on.

In the following, we discuss more in detail metadata, with focus on agriculture and other bio-sectors, followed by more technical information on Linked Data and related best practices. Next, we present different usage scenarios and experiences of using Linked Data in DataBio.

2 Metadata

Metadata is, as its name implies, data about data. It describes the properties of a dataset or resource. Metadata can cover various types of information, which according to [7], can be coarsely categorized into three categories: (i) descriptive metadata includes elements such as the title, abstract, author, and keywords, and is mostly used to discover and identify a dataset or another resource; (ii) structural metadata, which indicates how compound objects are put together (logical or physical relationships between objects and their parts); and (iii) administrative metadata with elements such as the license, intellectual property rights, when and how the dataset was created, who can access it, etc. Datasets in agriculture are either added locally, by a user, harvested from existing data portals, or fetched from operational systems or IoT ecosystems. The definition of a set of metadata elements is necessary to allow identification of the vast amount of information resources managed for which metadata is created, its classification and identification of its geographic location and temporal reference, quality and validity, conformity with implementing rules on the interoperability of spatial data sets and services, constraints related to access and use, and organization responsible for the resource.

Metadata of datasets and dataset series (particularly relevant for agriculture are the EO products derived from satellite imagery) should adhere to the INSPIRE Metadata RegulationFootnote 5 with added theme-specific metadata elements for the agriculture, forestry and fishery domains if necessary. This approach will ensure that metadata created for the datasets, dataset series and services will be compliant with the INSPIRE requirements as well as with international standards.Footnote 6, Footnote 7, Footnote 8 In addition, INSPIRE conformant metadata may be expressed also through the DCAT Application Profile,Footnote 9 which defines a minimum set of metadata elements to ensure cross-domain and cross-border interoperability between metadata schemas used in European data portals. Such a mapping could support the inclusion of INSPIRE metadataFootnote 10 in the Pan-European Open Data PortalFootnote 11 for wider discovery across sectors beyond the geospatial domain.

A Distribution represents a way in which the data is made available. DCAT is a rather small vocabulary,  which strategically leaves many details open as it welcomes “application profiles”: more specific specifications built on top of DCATFootnote 12, e.g., GeoDCAT-APFootnote 13 as a geospatial extension. For sensors there is also SensorMLFootnote 14, a standard which can be used to describe a wide range of sensors, including both dynamic and stationary platforms and both in situ and remote sensors. Another possibility is Semantic Sensor Network OntologyFootnote 15, which describes sensors and observations, and related concepts. It does not describe domain concepts, time, locations, etc.; these are intended to be included from other ontologies via OWL imports. This ontology is developed by the W3C Semantic Sensor Networks Incubator Group (SSN-XG).Footnote 16

There is a need for metadata harmonization of the spatial and non-spatial datasets and services. GeoDCAT-AP is an obvious choice due to the strong focus on geographic datasets. The main advantage is that it enables users to query all geospatial datasets in a uniform way. GeoDCAT-AP is still very new, and the implementation of the new standard can provide feedback to OGC, W3C & JRC from both technical and end user point of view. Several software components are available in the DataBio architecture that have varying support for GeoDCAT-AP, being MickaFootnote 17, CKANFootnote 18 [3], FedEO Gateway & CatalogFootnote 19, and GeoNetworkFootnote 20 [4]. For the DataBio purposes we also had to integrate Semantic Sensor Net Ontology and SensorML.

For enabling compatibility with COPERNICUSFootnote 21, INSPIREFootnote 22, and GEOSSFootnote 23, the DataBio project made three extensions: (i) Module for extended harvesting INSPIRE metadata to DCAT, based on XSLT and easy configuration; (ii) Module for user friendly visualisation of INSPIRE metadata in CKAN; and (iii) Module to output metadata in GeoDCAT-AP respectively SensorDCAT. DataBio used Micka and CKAN systems. Micka is a complex system for metadata management used for building Spatial Data Infrastructure (SDI) and geoportal solutions. It contains tools for editing and management of spatial data, and services metadata as well as other sources (documents, websites, etc.). Micka also fully supports GeoDCAT-AP and Open Search. CKAN supports DCAT to import or export its datasets. CKAN enables harvesting data from OGC:CSW catalogues, but not all mandatory INSPIRE metadata elements are supported. Unfortunately, the DCAT output does not fulfil all INSPIRE requirements, nor is GeoDCAT-AP fully supported.

For data identification, naming, and search keywords we used the INSPIRE data registry.Footnote 24 The INSPIRE infrastructure involves a number of items, which require clear descriptions and the possibility to be referenced through unique identifiers. Examples of such items include INSPIRE themes, code lists, application schemas or discovery services. Registers provide a means to assign identifiers to items and their labels, definitions and descriptions (in different languages). The INSPIRE Registry is a service giving access to INSPIRE semantic assets (e.g. application schemas, code-lists, themes), and assigning to each of them a persistent URI. As such, this service can be considered also as a metadata directory/catalogue for INSPIRE, as well as a registry for the INSPIRE “terminology”. Starting from June 2013, when the INSPIRE Registry was first published, several versions have been released, implementing new features based on the community’s feedback.

Also important is data lineage, which refers to the sources of information, such as entities and processes, involved in producing or delivering an artifact. Data lineage records the derivation history of a data product. The history could include the algorithms used, the process steps taken, the computing environment run, data sources input to the processes, the organization/person responsible for the product, etc. Provenance provides important information to data users for them to determine the usability and reliability of the product. In the science domain, the data provenance is especially important since scientists need to use the information to determine the scientific validity of a data product and to decide if such a product can be used as the basis for further scientific analysis.

3 Linked Data

As noted above, Linked Data refers to a set of best practices for publishing and interlinking structured data thereby enabling it to be accessed by both humans and machines. The data interchange follows the RDF family of standards and SPARQL is used for querying. In particular, the key concepts and technologies that support Linked Data are:

  • Any concept or entity can be identified by assigning specific Uniform Resource Identifier (URIs) to them.

  • HTTP for retrieving or description of resources.

  • RDF which is a generic graph-based data model used for structuring and linking data that describes concepts or entities in the real world.

  • SPARQL is the standard RDF query language.

More in detail, RDF expresses data as triples of the form < subject, predicate, object > . A triple encodes the relation of the object to the subject through the predicate. The subject is a URI, or more generally Internationalized Resource Identifier (IRI), which, as specified above, identifies a resource or a concept; the object may be either a literal e.g. number, string, date, or a URI which references another resource. Triples which interlink resources constitute RDF links, which construct the Web of Data.

4 Linked Data Best Practices

The growing popularity of Linked Data has led to the definition of more detailed guidelines for the development and delivery of (open) data as linked data. For instance, for open government (also applicable for LOD and the bioeconomic sector) data, the following best practices are recommended [8]:

  • To prepare the stakeholders by explaining the process of creating and maintaining the Linked Data.

  • To select a dataset which can be reused by others.

  • To model the Linked Data represented as data objects and their relation in an application-independent way.

  • To specify an appropriate license to ease data reuse by declaring the origin, ownership and conditions applied for the reusing of the open data.

  • To use a well-considered URI naming strategy and implementation plan, based on HTTP URIs.

  • To describe the objects with previously defined vocabulary so as to extend the standard vocabulary.

  • To convert data to a Linked Data representation by scripting or other automated processes.

  • To provide machine access to the data by providing a way for search in an engine and other automated processes using standard web mechanisms.

  • To announce new datasets on authoritative domains to initiate an implicit social contact.

  • To maintain the data once published.

It is important to note that although these best practices were conceived for open government data, they can be applied in most cases to many other domains.

To help prepare stakeholders, there are at least three well known lifecycle models (Hyland et al. [8], Hausenblas [9], Villazón-Terrazas et al. [10]) describing the process for publishing linked data. All of these models identify common needs of specifying, modelling and publishing data in the standard open Web format (https://www.databio.eu/wp-content/uploads/2017/05/DataBio_D4.3-Data-sets-formats-and-models_public-version.pdf, https://www.google.ca/search?q=%22standard+open+web+format%22). However even though all of the models somewhat deal with similar tasks, they have some differences between those tasks. To discuss more in detail the above mentioned tasks, we will focus on one of these models as their roles are similar and complementary. For the sake of consideration, Villazón-Terrazas et al. [10] has the following sub-tasks for each step:

  • Specification:

    • Identification and analysis of the data sources by opening and publishing the data that have not yet opened up and published and by reusing or leveraging the data that had already been opened/published up by others. This may require contacting specific data owners to get access to their legacy data.

    • Design URIs by using meaningful URIs rather than opaque ones whenever possible. It is important to separate TBox (ontology model) from ABox (instances) URIs.

    • Definition of the license of the data sources. It is also possible to reuse and apply an existing license of the data sources.

  • Modelling:

    • Ontologies ideally are expressed in OWL or RDF(S) both being based on RDF.

    • Reusing the existing and available vocabularies.

    • Reusing the available non-ontological resources like highly reliable websites, domain related sites, government catalogs etc.

  • Generation:

    • Transformation of the data sources selected in the specification activity into RDF according to the vocabulary created in the modelling activity by using tools like CSV and spreadsheets, RDB or XML.

    • Data cleansing involves the finding and fixing of the possible errors specified in Hogan et al. which includes http-level issues, such as accessibility and de-referencability, reasoning issues such as namespace without vocabulary and malformed/incompatible data types.

    • Linking suitable datasets and discovering suitable relationships between the data items and validate the relationships discovered.

  • Publishing:

    • Dataset publication by using tools for storing RDF (e.g. Virtuoso Universal Server, Jena, Sesame, 4Store, YARS, OWLIM etc.) using SPARQL endpoint and Linked Data front end (e.g. Pubby, Talis Platform, Fuseki)

    • Metadata publication using VoID, which allows to express metadata about RDF datasets and by OPM (Open Provenance Model).

    • Dataset discovery by registering the datasets in the CKAN registry and generating sitemap files for the dataset, by using sitemap4rdf.Footnote 25

  • Exploitation is the final step in linked data publication workflow which refers to the application and exploitation of Linked Data for various purposes and applications across different platforms.

5 The Linked Open Data (LOD) Cloud

The LOD cloud comprises 1,255 datasets with 16,174 links (as of May 2020). Nevertheless, although large cross-domains datasets exist (like DBpediaFootnote 26 or WikidataFootnote 27) and some domains are well covered, like Geography, Government, and Bioinformatics, this is still not the case for all domains. For instance, in the agriculture domain we can find relevant thesaurus like AGROVOCFootnote 28 from FAOFootnote 29, or the National Agricultural Library’s Agricultural Thesaurus (NALT),Footnote 30 but there is still a lack of datasets related to agricultural facilities and farm management activities. A similar situation occurs in the fishery domain where only some taxonomies for specific types of fish or regions are available, but no catching data exists, including, for example, locations, quantities, values, equipment used, vessels used, etc. This is also true in the forestry domain, where almost no specific Open Linked Data is available. This is in part due to the lack of standardized models for the representation of such data, even though some efforts in this direction have been made in the past, as discussed below.

FOODIE project,Footnote 31 for instance, addressed this issue for the agriculture domain with the development of the FOODIE data modelFootnote 32 [11], which was reused and extended in the DataBio project. To ensure the maximum degree of data interoperability, the model is based on INSPIRE generic data models, specially the data model for Agricultural and Aquaculture Facilities (AF), which is extended and specialized. In particular, the model was created based on AF version 1.0, and thus it was found that there was a lack of a concept for an entity of finer granularity than Site that was part of the INSPIRE AF.Footnote 33 The key motivation was to represent a continuous area of agricultural land with one type of crop species, cultivated by one user using one farming mode (conventional vs. transitional vs. organic farming). Such a concept is called Plot and represents the main element in the model, especially because it is the level to which the majority of agro data is related. One level lower than Plot is the ManagementZone, which enables a more precise description of the land characteristics in fine-grained areas. Additionally, the FOODIE model includes concepts for crop and soil data, treatments, interventions, agriculture machinery, etc. Furthermore, the model reuses data types defined in ISO standards (ISO 19101, ISO/TS 19103, ISO 8601 and ISO 19115) as well standardization efforts published under the INSPIRE DirectiveFootnote 34 (like the structure of unique identifiers). The model was consulted with several experts from various institutions like the Directorate General Joint Research Centre (DG JRC) of the EU Commission, the EU Global Navigation Satellite Systems Agency (GSA), Czech Ministry of Agriculture, Global Earth Observation System of Systems (GEOSS), or the German Kuratorium für Technik und Bauwesen in der Landwirtschaft (KTBL). FOODIE data model was specified in Unified Modeling Language (UML) (as the INSPIRE models), but describes the process followed to transform this model into an OWL ontology in order to enable the publication of linked agricultural data [12]. FOODIE ontology follows a modular approach. Thus, while the core ontology includes all elements common to different applications, the ontology can be further specialized with profiles for a particular application or country needs. In the DataBio project, for example, the FOODIE ontology was reused in several agriculture pilots, which resulted in the addition of several new elements in the core, and with the creation of extensions for the specific needs of the pilot.

Regarding the fishery domain, there have been also some previous efforts to fill this standard model gap. For instance, in NeOn project, FAO produced a network of fisheries ontologiesFootnote 35 that included a catch record pattern, water areas (e.g. FAO division areas), species taxonomic classifications, fisheries commodities, vessels classifications, gear classifications, etc. Unfortunately, the work did not continue and many of these resources are no longer available. Nevertheless, in the DataBio project, some of these resources were reused when possible (e.g. catch pattern, species taxonomy), some others were re-created with further detail (e.g. water areas), and some new extensions were created to cover specific pilot needs in order to publish linked fishery data from them.

6 Enterprise Linked Data (LED)

Although Linked Data is mostly known and used to publish open data, and to link different open datasets, the underlying technologies and approach can also be applied in a (partially) closed setting, e.g. an enterprise, where potentially some data cannot be made openly available - this is especially relevant for all sectors of Bioeconomy with sensitive and geo-based data. In fact, even if the enterprise data remains closed, or accessible only via access control mechanisms to selected parties, it can still be linked with open data, and get all the benefits from that.

According to [11], Linked Enterprise Data (LED) meshes each and every enterprise data (e.g. structured records, documents or office files), wherever they come from, to create a global and unified information space from which new business information is created to solve operational needs. Hence, it federates the content of heterogeneous silos by interconnecting the data and creates a unified and coherent warehouse, called an information hub, that exposes and shares new knowledge objects [13]. Besides, as it follows the same standards, links can be established with other datasets, either internal or external (e.g. LOD).

In order to restrict the access to internal data, LED must be used in combination with access control mechanisms enabling compliance with privacy and security constraints, as described in the next section. Regarding security of the stored RDF data, one of the most typical approaches to control the access to the data is by using different RDF graphs for the restricted datasets. An RDF graph is a set of RDF triples, normally identified by an IRI, which can be assigned different access control policies.

For instance, Virtuoso, the RDF store used in the DataBio project, features SPARQL endpoints, which are Web services capable of providing more than Read-Only access to back-end graphs. So, even though they are commonly general-purpose, SPARQL endpoints can also be purpose-specific, and their privileges may, therefore, be limited to specific Create, Read, Update, and/or Delete operations. The privileges provided by a given Virtuoso SPARQL endpoint may be based simply upon the endpoint’s URL, or upon sophisticated rules which associate specific user identities with specific database roles and privileges. Virtuoso offers three methods for securing SPARQL endpoints:

  • Digest Authentication via SQL Accounts

  • OAuth Protocol based Authentication

  • WebID Protocol based authentication.

In the DataBio project, the first method was tested in order to restrict access to some of the pilot datasets. In particular, the process of setting up a secure Virtuoso SPARQL endpoint using the method of Digest Authentication via SQL Accounts is as follows:

  • Step 1: Create a user for a data graph.

  • Step 2: Assign the user to the specific user group assigned with a specific role. A user should become a member of an appropriate group (e.g. SPARQL_SELECT, SPARQL_SPONGE, or SPARQL_UPDATE) in order to start using its graph-level privileges.

  • Step 3: Some graphs are supposed to be confidential; the whole triple store is first set to be restricted to set the overall graph store permission.

  • Step 4: Set some basic privileges to some users where the specific users will not have the global access to the graphs.

  • Step 5: Grant specific privileges on specific graphs to specific users:

    • User can only READ but not WRITE from the personal system data graph.

    • User can both READ and WRITE from the personal system data graph.

    • Grant specific privileges on specific graph to public where the graphs (e.g. dbpedia.org) are intended for public consumption for:

      • READ but not WRITE;

      • READ and WRITE.