To exemplify our approach, we will introduce a running example.Footnote 6 In our example, we simulate an organization who aims to send a newsletter to their customers (a data processing purpose). Newsletters can only be sent to customers who have given their explicit consent. In our example, we have a relational database table “Customer” that contains last names, first names, and email addresses (amongst other things) (Table 1).
Table 1 The customers table We will use that table to a dataset schema file referencing that information. We assume, for the running example, that the dataset schema file will be stored at a particular location,Footnote 7 this will be important as the location of the file (an absolute URI) will be the base of the RDF graph. When creating a dataset for data processing purposes, a location (and thus, also, an absolute URI) will be generated
Step 1: annotating the schema
We start off with the (re)use of a schema describing the tabular data, which we will annotate with information on where to fetch the data. Listing 1 depicts an RDF graph containing a minimal dataset definition of the tabular data using the CSVW vocabulary and JSON-LD representation provided by [10]. We only retained the name of the columns as well as the property URLs giving us an indication as to how values in these columns must be interpreted. The vocabulary allows one to prescribe mandatory constraints (i.e. are values required), data types, and patterns that values should comply with (amongst others). These constraints are useful for validating tabular data. The highlighted statements extend the tabular file definitions with mapping information. We refer to a relational database table called “Customer” with the structure depicted in Table 2.
Table 2 Schema of the relational database table “Customers”
We reuse R2RML’s predicates to further annotate the dataset schema, as R2RML already provides us with the necessary predicates to annotate the file’s schema. Those predicates are used to indicate where one can find the source tabular information in a relational database (see Listing 2, highlighted). It is important to note that we do not intend to create a valid R2RML document by reusing those predicates. We will, however, use them to generate a valid R2RML mapping in the next step of the process.
The dataset schema is also serialized in JSON-LD [22], a common and increasingly popular representation for RDF for the available tooling. One can translate this JSON-LD into another RDF serialization such as TURTLE, but most of the column definitions will have no identifier (i.e. they will be blank nodes). Our approach of using RDF Data Cube had the additional benefit of having identifiers (URIs) for the different parts of the schema. This allows for an easier separation between schema and annotations. While this is also possible in JSON-LD, the way RDF Data Cube structures the data “forces” one to provide identifiers for those various components.
Step 2: generation of an R2RML mapping
We have chosen to adopt a declarative approach to generating the R2RML mapping via a sequence of SPARQL CONSTRUCT queries. The various queries can be summarized as follows: (1) create the triples maps (for mapping tables, views or queries to RDF); (2) use the columns to create subject maps; (3) create predicate object maps for the columns, and (4) connect the dataset that will be generated with its schema.
We obtain an executable R2RML mapping by merging the models resulting from each SPARQL CONSTRUCT query. This model is not meant to be merged with the prior RDF graphs from Listing 2. Instead, it will be used to generate RDF that will be the “just-in-time” dataset. In a wider governance narrative, the resulting mapping may be stored to inform stakeholders of the provenance of the datasets. We now begin with the description of each query. Note that we have omitted prefixes and base declarations from each of the query listings for brevity.
The generation of a logical table (tables and views) for each schema related to a table is the first query and is shown in Listing 3. A similar CONSTRUCT query is used for schemas related to a query with the rr:query predicate. The namespace ont: refers to our vocabulary developed for this study, and is useful in order to attach the different components of the R2RML mapping later on.
The CONSTRUCT query for generating the subject map is shown in Listing 4. The columns of the tabular data are used to identify each record. We use that information to generate the subject map of a triples map. The table columns used by the columns are used for creating a template that will identify each record in the dataset. An R2RML processor will use the template, which will generate values, to keep information of each record in an appropriate data structure (e.g. a dictionary). Notice on line 17, we rely on strSplit function offered by Apache Jena’s SPARQL processorFootnote 8 to split a string into a list. Such a function was, unfortunately, not available in SPARQL. On lines 20 and 21 (grey), we include function calls which are part of an R2RML-F, an extension of R2RML [23].
Listing 5 provides the CONSTRUCT query for adding predicate object maps to the triples maps based on columns. The query for tabular data is fairly straightforward. We test for the presence of a csvw:propertyURL that refers to an RDF predicate. As the use of csvw:propertyURL is not mandatory, we will construct a predicate based on the column’s name from the schema when that property is missing (lines 17–19). We note that a base URI—usually the URL or IRI of the dataset—is needed to provide absolute URIs of the column names. We avail of URI encoding to ensure that the URI of each predicate is valid.
Finally, the dataset that will be generated with this dataset also needs to be connected to its schema. This is straightforward with the following CONSTRUCT query (in Listing 6). In CSVW, the schema refers to a particular file with the csvw:url predicate. We add this statement in the dataset that we will generate by executing the R2RML mapping as one of the final steps.
With these mappings—which are declarative and implemented as SPARQL CONSTRUCT queries—we are able to generate an executable R2RML mapping that will be used in the generation of the “just-in-time” compliant dataset in the next process. Given our table “Customer” in Table 2 and the snippets from Listing 2, the R2RML in Listing 7 is generated. While it is not explicit that the resource is a rr:TriplesMap, it will be inferred by the R2RML engine as such since the domain of rr:logicalTable is rr:TriplesMap.
Step 3: execution of the R2RML mapping
For the execution of our mapping, we rely on an implementation of the R2RML implementation developed by [23] as some use cases are not supported by the R2RML specification (see discussions). The mapping in Listing 7 contains no statements that fall outside R2RML’s scope and should work with other implementations of the specification. The execution of this mapping generated 3 RDF triples for each record in the Customers table, which corresponds with the generated dataset in Fig. 1. An example of such a record is shown in Listing 8, representing “user_1” in the graph with their first and last name.
In the case of an XSD datatype, our R2RML processor checks whether a value that is generated by an object map corresponds with that datatype and reports when this is not the case. When a datatype is not part of the XSD namespace is used for an object map, such as ex:myInteger, for instance, the literal is merely typed with that datatype. If no datatype is provided, the datatype of the literal depends on the datatype of the column (see Sect. 10.2 “Natural Mapping of SQL Values” of [16].
Step 4: validating the generated RDF
We validate the generated RDF by checking the integrity constraints described in the schema. This is necessary as the execution of any R2RML mapping according to a particular schema or standard does not guarantee that the resulting dataset complies with the constraints of that schema or standard. For RDF Data Cubes, the specification presents a set of so-called integrity constraints in the specification [13], which are a sequence of SPARQL ASK queries. For CSVW, we rely on CSVW validatorsFootnote 9 taking as input both the schema and the generated dataset to check whether the dataset adheres to the schema’s constraint.
Related work on generating mappings
Though tools exist to convert between representations (such as from OLAP and CSV to RDF Data Cube [24]), we are interested in the state-of-the-art of generating mappings from one representation to another. Related work in generating (R2RML) mappings from other representations is quite limited. The authors in [25]—who proposed a declarative language for ontology-based data access where a single description results in an ontology, mappings, and rules for transforming queries—mentioned adopting R2RML because of its increasing uptake.
The Open Cube Toolkit [26] provides a D2RQ [27] extension for generating an RDF graph according to the RDF Data Cube Vocabulary using D2RQ’s R2RML support. The D2RQ data provider requires a mapping relating a table to a dataset using a bespoke XML mapping language. The XML file is then used to generate an R2RML mapping which is then executed by D2RQ’s engine. Their approach is thus similar in that it generates an executable R2RML file from the mapping. The limitations of their approach relate to their mapping language; it is bespoke, not in RDF and has not been declared in a particular namespace.