Linked Data Usages in DataBio

One of the main goals of DataBio was the provision of solutions for big data management enabling, among others, the harmonisation and integration of a large variety of data generated and collected through various applications, services and devices. The DataBio approach to deliver such capabilities was based on the use of Linked Data as a federated layer to provide an integrated view over (initially) disconnected and heterogeneous datasets. The large amount of data sources, ranging from mostly static to highly dynamic, led to the design and implementation of Linked Data Pipelines. The goal of these pipelines is to automate as much as possible the process to transform and publish different input datasets as Linked Data. In this chapter, we describe these pipelines and how they were applied to support different uses cases in the project, including the tools and methods used to implement them.


Introduction
Linked Data has been extensively used in the DataBio project as a federated layer to support large-scale harmonization and integration of a large variety of data collected from various heterogeneous sources and to provide an integrated view on them. Accordingly, as part of the project, we generated a large number of linked datasets. In fact, the triplestore populated during the course of DataBio with Linked Data has over 1 billion triples, being one of the largest semantic repositories related to agriculture. The dataset has been recognized by the EC Innovation Radar as 'arable farming data integrator for smart farming.' In addition, we have deployed different endpoints providing access to some dynamic data sources in their native format as Linked Data by providing a virtual semantic layer on top of them.
Given the huge number of data sources, and data formats that were addressed during the course of DataBio, such layer has been realized in DataBio through the implementation of instantiations of a 'Generic Pipeline for the Publication and Integration of Linked Data,' which have been applied in different uses cases related to the bioeconomy sectors. The main goal of these pipeline instances is to define and deploy (semi-) automatic processes to carry out the necessary steps to transform and publish different input datasets as Linked Data. Accordingly, they connect different data processing components to carry out the transformation of data into RDF [1] format or the translation of queries to/from SPARQL [2] and the native data access interface, plus their linking, and include the mapping specifications to process the input datasets. Each pipeline instance is configured to support specific input dataset types (same format, model and delivery form), and they are created with the following general principles in mind: • Pipelines must be directly re-executed and re-applied (e.g., extended/updated datasets). • Pipelines must be easily reusable.
• Pipelines must be easily adapted for new input datasets.
• Pipeline execution should be as automatic as possible. The final target is to fully automated processes. • Pipelines should support both: (mostly) static data and data streams (e.g., sensor data).
Most of the Linked Data Publication pipeline instances discussed in this chapter perform the transformation and publication of agricultural data as Linked Data; however, there are also some pipelines that are focused on fishery data or on providing access to geospatial datasets metadata as Linked Data. The ultimate target is to query and access different heterogeneous data sources via an integrated layer, in compliance with any privacy and access control needs.
A high-level view of the end-to-end flow of the generic pipeline, aligned with the top-level DataBio generic pipeline, is depicted in Fig. 8.1. Following the best practices and guidelines for Linked Data Publication [3,4], these pipelines (i) take as input selected datasets that are collected from heterogeneous sources (shapefiles, GeoJSON, CSV, relational databases, RESTful APIs), (ii) curate and/or preprocess the datasets when needed, (iii) select and/or create/extend the vocabularies (e.g., ontologies) for the representation of data in semantic format, (iv) process and transform the datasets into RDF triples according to underlying ontologies, (v) perform any necessary post-processing operations on the RDF data, (vi) identify links with other datasets and (vii) publish the generated datasets as Linked Data and applying required access control mechanisms. The transformation process depends on different aspects of the data like format of the available input data, the purpose (target use case) of the transformation and the volatility of the data (how dynamic is the data). Based on these characteristics, there are two main approaches for making the transformation for a dataset: (i) data upgrade or lifting, which consists of generating RDF data from the source dataset according to mapping descriptions and then storing it in semantic triplestore (e.g., Virtuoso) and (ii) on-the-fly query transformation, which allows evaluating SPARQL queries over a virtual RDF dataset, by rewriting those queries into source query language according to the mapping descriptions. In this former scenario, data physically stays at their source and a new layer is provided to enable access to it over the virtual RDF dataset.
In every transformation process, regardless of the method or tools chosen, a mapping specification has to be defined to specify the rules to map the source elements (e.g., table columns, JSON elements, CSV columns, etc.) into target elements (e.g., ontology terms). Generally, this specification is an RDF document itself written in RML 1 /R2RML 2 (and extensions) languages and/or nonstandard extensions of SPARQL, e.g., in the case of the Tarql CSV to RDF transformation tool. 3 The resulting datasets can thereafter be exploited through SPARQL queries, or via a plethora of user interfaces. Some examples of these interfaces include:  The following diagram ( Fig. 8.2) provides a simplified representation of the generic Linked Data Publication pipeline component view that includes the software components and interfaces involved. More information is available in [5,6].
The URL link of the generic pipeline in the DataBioHub is https://mub.me/2f81.

Linked Data Pipeline Instantiations in DataBio
The Linked Data Pipeline, as described in the previous section, is a generalization of multiple instantiations, in particular two specific project's pilots and four additional experiments in DataBio. Thus, in order to show how this generic pipeline has been applied in each of these use cases, we present in this section for each of them the pipeline view, previously presented in [5], highlighting the specific methods and components used/applied, along with a description of the task performed and results achieved.

Linked Data in Agriculture Related to Cereals and Biomass Crops
This pipeline instance was focused toward publication of INSPIRE-based agricultural Linked Data from the farm data collected from cereals and biomass crop pilots, in order to query and access different heterogeneous data sources via an integrated layer. The input datasets used for this experiment include: • Farm data (Rostenice pilot) that holds information about each field name with the associated cereal crop classifications and arranged by year. • Data about the field boundaries and crop map yield the potential of most of the fields in the Rostenice pilot farm from Czech Republic. • Yield records from two fields (Pivovarska and Predni) within the pilot farm that were harvested in 2017 and 2018.
The source datasets, collected as shapefiles, were transformed into RDF format and published as Linked Data, using the FOODIE ontology as the underlying model. The resulting linked datasets are available for querying and exploitation through the DataBio SPARQL endpoint deployed at PSNC' HPC facilities. More in detail, the tasks carried out are as follows: • Definition of the data model to transform the input datasets into RDF. For this step, FOODIE ontology [7], which is based on INSPIRE schema and the ISO 19100 series standards, was used as the base vocabulary and extended as needed (with a Czech pilot extension) in order to represent all the farm and open data from the input datasets. The extension includes data elements and relations from the input datasets that were not covered by the main FOODIE ontology but that were critical for the pilot needs. • Creation of an RDF mapping file that specifies how to map the contents of a dataset into RDF triples by matching the source dataset schema with FOODIE ontology and its extensions. A generic RML/R2RML definition of the mapping file was generated from the input shapefiles by using applications like GeoTriples and thereafter manually edited as per the data model identified to generate the final mapping definition. GeoTriples was also used to generate the RDF dump from the source data contents. FOODIE ontology and its extension were used extensively in the mapping files to match the source dataset schemas. • The RDF datasets generated were loaded into DataBio Virtuoso triplestore. A SPARQL endpoint and a faceted search endpoint are available for querying and exploiting the Linked Data in the Virtuoso instance deployed at PSNC infrastructure. • The final task involved providing an integrated view over the original dataset. As source datasets were particularly large (especially when considering connections with open datasets), and the connections were not of equivalence (i.e., resources are related via some properties but they are not equivalent), it was decided to use queries to access the integrated data as per need rather than using link discovery tools like SILK or LIMES. Hence, cross-querying within the datasets was done in Virtuoso SPARQL endpoint for some use cases to establish possible links between agricultural and related open datasets. • To visualize and explore the Linked Data in a map, we have created different application/system prototypes. One such map visualization component called HS Layers NG is available at https://app.hslayers.org/project-databio/land/. The resulting linked datasets are accessible via: https://www.foodie-cloud.org/ sparql. A figure that maps the generic components identified in this pilot is given below (Fig. 8.3). The red highlighted markings indicate the components in use in the pilot.

Linked Sensor Data from Machinery Management
This pipeline was performed for the machinery management DataBio pilot, where sensor data from the SensLog service (used by FarmTelemeter service) was transformed into Linked Data on the fly; i.e., data stays at the source, and only a virtual semantic layer was created on top of it to access it as Linked Data. For modeling the sensor data, the following vocabularies/ontologies were selected: 1. Semantic sensor network (SSN 4 ) ontology for describing sensors and their observations, the involved procedures, the studied features of interest, the samples used to do so and the observed properties. A lightweight but selfcontained core ontology called Sensor, Observation, Sample, and Actuator or SOSA was actually used in this specific case to align the SensLog data. 2. Data Cube Vocabulary and its SDMX ISO standard extensions were effective in aligning multidimensional survey data like in SensLog. The Data Cube includes well-known RDF vocabularies (SKOS, 5 SCOVO, 6 VOID, FOAF, 7 Dublin Core 8 ).
The SensLog service uses a relational database (PostgreSQL) to store the data. Hence, in the mapping stage, the creation of R2RML/RML definitions required different preprocessing tasks and some on-the-fly assumptions to engineer the alignment between the SensLog database and the ontologies/vocabularies.
Once the mapping file was generated (manually), the RDF data of the dataset was published using a D2RQ server that enables accessing relational database sources as virtual RDF graphs. This on-the-fly approach allows publishing of RDF data from large and/or live databases, and thus the need for replicating the data into a dedicated RDF triple store is not required. The Linked Data from the sensor data from SensLog (version 1) was published in the PSNC infrastructure in a D2RQ server available at http://senslogrdf.foodie-cloud.org/. The associated SPARQL endpoint to query the data is available at: http://senslogrdf.foodie-cloud.org/sparql.
The figure below ( Fig. 8.4) highlights the main components used in this pilot from the generic pipeline components.

Linked Open EU-Datasets Related to Agriculture and Other Bio Sectors
This pipeline focuses on EU and national open data from various heterogeneous sources from a wide range of applications in the geospatial domain. The purpose was to experiment on these datasets by transforming them into Linked Data and exploiting them on various technology platforms for integration and visualization.
The sources for all of these data contents are widely heterogeneous and in various forms (e.g., in shapefiles, CSV format, JSON and in relational databases), which required extensive work to identify the most suitable mode for their transformation. This included a careful inspection of the input data contents in order to identify available ontologies/vocabularies, and any required extensions, necessary for the representation of such data in RDF format. Additionally, since the source datasets were in different formats, selecting the most suitable tools for their transformation was a key activity in order to create the correct (R2RML/RML) mapping definitions. Some of the input datasets, their formats and the ontologies/vocabularies used for the representation of data in semantic format are described below.
• Input data of land parcel and cadastral data (for Czech Republic and Poland), erosion-endangered soil zones, water buffer and soil type classification are available as shapefiles. The ontologies used for the representation of such data included the INSPIRE-based FOODIE ontology as well as different extensions created to cover all the necessary information (e.g., erosion zones and restricted areas near water bodies). • The Farm Accountancy Data Network (FADN 9 ) data is available as a set of CSV files. The main ontologies used were Data Cube Vocabulary and its SDMX ISO standard extensions that were much more effective in aligning such multidimensional survey data. Data Cube Vocabulary encompasses well-known RDF vocabulary like SKOS, SCOVO, VOID, FOAF, Dublin Core, etc. Preparing the mapping definitions from the input data sources required preprocessing actions to make them reusable for all types of the CSV data sources of FADN. Separate CSV files were manually created for each reusable common class type. Once mapping definitions were generated for each of the created CSV files, they were integrated into one whole mapping file covering all the components from the input data. • The sample data input from Yelp is available as a set of JSON files. Different ontologies like review, 10 FOAF, schema.org, POI, etc., were used to represent the elements from the input data in semantic format during the creation of the mapping definition.  The generation of RDF triples was carried out using different tools (depending on the source dataset format). For shapefiles, GeoTriples tool was used, while for the JSON and CSV data the RML processor tool was used. The resulting RDF datasets were then loaded into DataBio Virtuoso triplestore providing SPARQL and faceted search endpoints for further exploitation. Finally, for the provision of an integrated view over the original datasets in case of agricultural and open data, SPARQL queries were generated and additional links were discovered using tools like SILK. For visualization, platforms like HS Layers NG and Metaphactory were used as discussed in Chap. 13.
The resulting linked datasets are accessible via: https://www.foodie-cloud.org/ sparql. The figure below (Fig. 8.5) highlights the main components used in this pilot from the generic pipeline components.

Linked (Meta) Data of Geospatial Datasets
This pipeline focuses on the publication of metadata from geospatial datasets as Linked Data. There were two data sources that were transformed.
The first dataset was metadata collected from the public Lesproject Micka registry, 11 which includes information of over 100 K geospatial datasets. Micka is a software for spatial data/services metadata management according to ISO, OGC and INSPIRE standards, and it allows to retrieve the metadata in RDF using Geo-DCAT 12 for the representation of geographic metadata compliant with the DCAT application profile for European data portals. Nevertheless, such metadata cannot be queried as Linked Data, and thus the goal was to make it available in this form in order to enable its integration with other datasets, e.g., Open Land Use (OLU). The process for publication, thus, was straightforward: A dump of all the metadata in RDF format was generated from Micka, which was then loaded into DataBio Virtuoso triplestore. Some example SPARQL queries were then generated to identify connection points for integration, e.g., get OLU entries and their metadata given a municipal code and type of area (e.g., agriculture lands). The dataset is accessible via: https://www.foo die-cloud.org/sparql.
The second dataset was more challenging. The goal was to make Earth Observation (EO) Collections and EO Products metadata available as Linked Data via a SPARQL compliant endpoint which makes requests to non-SPARQL back ends on the fly. Hence, we wanted to enable querying via SPARQL without harvesting all the metadata and storing the data in a triplestore but access them dynamically via the existing online interfaces. The metadata was accessible via an OpenSearch interface provided by the FedEO 13 Clearinghouse in Spacebel (http://geo.spacebel.be/opense arch/readme.html) that enables retrieving the metadata in different formats, including atom/xml, RDF/xml, turtle, GeoJSON and LD-JSON. We used LD-JSON, which already defines the semantic properties used to represent the metadata elements. These properties comprise terms from different standard and well-known vocabularies/ontologies like Dublin Core, DCAT, SKOS, VOID and OM-Lite-lite, as well as from the OpenSearch specifications. Next, in order to enable access to a REST API via SPARQL queries that would allow linking with other Linked Datasets we used the Metaphactory platform. Metaphactory (https://www.metaphacts.com/pro duct) includes a component called Ephedra, which is a SPARQL federation engine aimed at processing hybrid queries. Ephedra provides a flexible declarative mechanism for including hybrid services into a SPARQL federation and implements a number of static and runtime query optimization techniques for improving the hybrid SPARQL query performance [8]. The RDF data is exposed via a SPARQL endpoint provided in the Metaphactory platform (http://metaphactory.foodie-cloud. org/sparql?repository=ephedra). A demo interface has also been implemented to visualize the Linked Data in Metaphactory (entry point: http://metaphactory.foodiecloud.org/resource/:ESA-datasets).
The figure below (Fig. 8.6) highlights the main components used in this pilot from the generic pipeline components. In the figure, the components related to the first sub-case (Micka) are highlighted in green, while the components related to the second sub-case (FedEO) are highlighted in orange.

Linked Fishery Data
This pipeline focuses on the catch record data from the fisheries of Norwegian region. The purpose of this pipeline was to publish the catch record data from five years of historical data as Linked Data and perform experimentation operations to exploit and visualize them on various platforms. The input data was in the form of CSV files containing the catch record data of each year.
• The first task was to identify and map which attributes of the data are mostly in line with the transformation procedure and can be mapped with some existing ontology. Upon identifying such relevant data attributes from the main CSV file and carefully following the most relevant ontologies/vocabularies, we decided to use 'catchrecord.owl' 14 and mostly an extended version for our use of mapping. • The CSV files were extensively preprocessed in such a way so as to generate a R2RML/RML mapping definition using a tool named GeoTriples. The mapping definitions were further analyzed and processed to settle with the final mapping definition for transformation of the CSV data. During the creation of the mapping definitions, the possibility of integration with other Linked Datasets was also considered. • The transformation to the Linked Data was carried out using a tool named RML Processor from the final R2RML/RML mapping definitions. • After the transformation of the Linked Data, a few post-processing steps were done to make the data ready to upload to the DataBio Virtuoso triplestore. • At present, the catch data from five years was transformed and uploaded to the Virtuoso triplestore providing SPARQL and faceted search endpoints for further exploitation.
For the purpose of showcasing the integration and visualization of the dataset, a Web interface using the Metaphactory platform was created, which includes map visualizations and representation of data in the form of charts and graphs. This process is ongoing, and more experimentations are to come by. The interface is presently available at http://metaphactory.foodie-cloud.org/resource/:CatchDataNor way_v2. The resulting linked datasets are accessible via: https://www.foodie-cloud. org/sparql and https://www.foodie-cloud.org/fct. The figure below (Fig. 8.7) highlights the main components used in this use case from the generic pipeline components.

Usage and Exploitation of Linked Data
The pipelines used in DataBio are part of an ongoing process and yet to be tested on other use cases and input data types. For example, as a result of the pipelines involving the LPIS and Czech field data, it was possible to perform integration experiments of the dataset for various use case scenarios of data integration.
As mentioned above, the datasets are deployed in the Virtuoso triplestore within PSNC and can be accessed via SPARQL and faceted search endpoints. The triplestore has over 1 billion triples, making it one of the largest semantic repositories related to agriculture.
The data in the triplestore is partitioned/organized into named graphs, where each named graph describes different contents and is identified by an IRI.
For example, the IRI <http://w3id.org/foodie/open/africa/GRIP> is the graph identifier of the African Roads Network dataset, which contains 27,586,675 triples.
Named graphs may be further composed of named subgraphs, as it is the case of the LPIS Poland dataset, which provides information about land-parcel identification in Poland, identified by the graph <http://w3id.org/foodie/open/pl/LPIS/>, and contains 727,517,039 triples. This graph contains, for example, the subgraph <http://w3id.org/ foodie/open/pl/LPIS/lubelskie>, which refers to the data associated with the Lublin Voivodeship.
The table below shows some of the respective graphs produced by all the pipelines previously described and the number of triples contained in them. The official SPARQL and the faceted search endpoints of the triplestore are: https://www.foodie-cloud.org/sparql ( Fig. 8.8) and https://www.foodie-cloud.org/fct Regarding the sensor data described in Sect. 1.3.2, it is published on the fly which serves the purpose of streaming transformation. This data can be accessed and linked through the following endpoints: SPARQL endpoint: http://senslogrdf.foodie-cloud.org/sparql SNORQL search endpoint: http://senslogrdf.foodie-cloud.org/snorql/ Web-based visualization: http://senslogrdf.foodie-cloud.org/ (see Fig. 8.10).

Experiences in the Agricultural Domain
RDF links often connect entities from two different sources, with relations which are not necessarily described in either data source. In the agricultural domain, this can be linking fields of specific crop type with the administrative region in which these fields reside, or find whether plots intersect with a buffer zone of water bodies in their vicinity. This is a means to control, e.g., the level and amount of pesticides used in those plots. Creating such agricultural knowledge graphs is important due to environmental, economic and administrative reasons. However, constructing links manually is time and effort intensive, and links between concepts are rather to be discovered automatically. The basic idea of link discovery is to find data items within the target dataset which are logically connected to the source dataset. Formally, this means: Given S and T , sets of RDF resources, called source and target resources, respectively, and a relation R, the aim of link discovery methods is to find a mapping M = {(s,t) ∈ S ✕ T : R(s,t)}. Naive computation of M requires quadratic time to test for each s ∈ S and t ∈ T whether R holds, which is infeasible for large datasets, and leads to the development of link discovery tools, which address this task.
In the agricultural domain, entities are mostly geospatial objects, and the relations are of a topological nature. Existing tools for link discovery, such as SILK and LIMES, are limited when it comes to geospatial data and therefore, as part of the DataBio project, we developed Geo-L, a system designated for discovery of RDF spatial links based on topological relations.
The system provides flexible configuration options to define to-be-linked datasets for SPARQL affine users and employs retrieval and caching mechanisms, resulting in efficient dataset management.
Geo-L uses PostgreSQL, an open-source object-relational DBMS, with PostGIS extension, as the database back end which supports geospatial data processing.
We conducted experiments to evaluate the performance of our proposed system by searching geospatial links based on topological relations between geometries of datasets of the foodie cloud, in particular subsets of OLU, SPOI and NUTS.
The experiments show that Geo-L outperforms the state-of-the-art tools in terms of mapping time, accuracy and flexibility. 15 It also proves to be more robust when it comes to handling errors in the data, as well as with managing large datasets.
We applied Geo-L to several use cases involving datasets from the foodie cloud, e.g., • Identifying fields from Czech LPIS data with specific soil type, from Czech open data • Identifying all fields in a specific region which grow the same type of crops like the one grown in a specific field over a given period of time • Identifying plots from Czech LPIS data which intersect with buffer zones around water bodies.
An example for the last case is depicted in the image below ( Fig. 8.11), where an overlap area between a plot and a buffer zone of a water body in its vicinity is colored with orange.
The respective dataset resulting from linking water bodies whose buffer zones are intersected by Czech LPIS plots is available on the DBpedia Databus. 16

Experiences with DBpedia
DBpedia is a crowd-sourced continuous community effort to extract structured information from Wikipedia and to make this information available as a knowledge graph on the Web. DBpedia allows querying against this data and information and linking to other datasets on the Web [9,10]. Currently, DBpedia is one of the central interlinking hubs in the Linked Open Data (LOD) cloud. With over 28 million of described and localized things, it is one of the largest and open datasets.
As part of the project, we constructed links between satellite entities, available in the European Space Association (ESA) thesaurus, 17 whose recorded images are employed in DataBio pilots and their respective DBpedia resources. These links are beneficial since the data in DBpedia is available in machine readable form for further processing, and in addition there are additional data and external links related to the satellite. We used REST API to retrieve satellite names from the ESA thesaurus and queried for DBpedia resources matching these names, which were then identified as satellites, based on their properties available in DBpedia.  The listing depicted in Fig. 8.12 presents an excerpt from the link-data result. The links allow, on the one hand, access to other properties of the respective DBpedia resources and, on the other hand, enable other DBpedia users to access the ESA set. This dataset can be found as an artifact 18 on the DBpedia Databus.
DBpedia resources which refer to geographical regions include different important properties about those areas such as temperature amplitudes and monthly precipitation. Such properties may be helpful, e.g., analysis of yields. These resources, however, do not contain the actual geometry of the regions. We used OpenStreetMap to retrieve data about regions and applied Geo-L to link between DBpedia region resources and their geometries.
These geometries can be helpful then not only for the purpose of the DataBio or for agriculture in general, but may be used for locating points of interest, which coordinates are known, within a specific region, a thing which has not been possible so far.
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.