Keywords

1 Introduction

The CSV file format is a convenient and popular way to exchange structured data, but the semantics of the information captured in such files are not explicit. RDF, on the other hand, provides one means to add meaning to data that software can process. The process of converting data in any non-RDF format (e.g., CSV and relational databases) into RDF is called uplift.

There are scenarios where it is necessary to manipulate data during the uplift process. Depending on the source data format, this can be very straightforward. With uplift languages for relational databases such as R2RMLFootnote 1, for example, one can rely on the underlying RDBMS to support some data manipulation tasks (e.g., string concatenation). The same is true for XSPARQL [5], where one can use XQuery to manipulate the data contained in XML. Sometimes, however, relying on the underlying technology is not sufficient [4]. For CSV datasets there is no such equivalent, standardized underlying technology. When data in CSV files has to be manipulated to generate RDF, one needs to resort to data pre- or post-processing. This increases complexity in terms of number of steps necessary to generate RDF, and renders the whole data processing “pipeline” less transparent. A solution to tackle this problem is to capture these manipulations as functions in the mappings.

We propose a method to incorporate functions into mapping languages that draws inspiration from, and generalizes ideas presented in [4]. To demonstrate our method, we extend RML’s vocabulary and engine to include notions for function calls and parameter bindings. The main contributions of this paper can be summarized as follows: (i) a method to incorporate functions in a mapping language; (ii) an implementation of the method extending RML; and (iii) a demonstration of functions incorporated into mappings applied to a real world dataset.

2 Related Work

One approach to support uplift is to use an annotation language to relate non-RDF data to RDF (i.e., “mappings”), for which an engine is built. Examples of this approach include R2RML and R2RML-F [4] for relational databases; SML [3] for relational databases and CSV; and RML [1] and KR2RML [7] for an even wider array of non-RDF data formats. These mapping languages usually have access to functionality provided by the underlying technology of the non-RDF data source. For CSV files, these do not exist and one has to apply data pre- or post-processing techniques, which raises problems as explained in Sect. 1.

To the best of our knowledge, KR2RML is the only tool to support data manipulation functions inside a mapping language that does not rely on the underlying technology. Though they provide an editor in which you can load data and input mappings to create functions in Python to manipulate that data, once those functions are stored several problems can be observed. First, a lot of the structured information containing the function is captured as a string. This thus requires both parsing the file and that string. Secondly, the mapping becomes rather complex, which makes it more difficult for users to create similar mappings with other tools. Their editor, however, does facilitate the mapping creation process for their mapping language. In [4] an extension to R2RML called R2RML-F is proposed. R2RML-F adds supports for capturing domain knowledge inside the mapping language for relational databases. Unlike KR2RML, functions in R2RML-F are captured as resources referred to by mappings with the RDF data model, allowing functions to be reused in different mappings.

3 Incorporating Functions into Mapping Languages

In this section, we describe how we adopt ideas presented in [4] to develop a more generic, usable and amenable approach to incorporate functions into mapping languages. These functions can be used to capture both domain knowledge (e.g., transforming units) and other – more syntactic – data manipulation tasks (e.g., transforming values to create valid URIs). Function names are unique and each function must have one function name and one function body. A function body defines a function with a return statement; parameters are optional.

Our proof-of-concept extends RML’s vocabulary and engineFootnote 2 by introducing construct for describing functions, function calls and parameter bindings. Listing 1 defines a function. This function has one string as a parameter and returns a URL concatenated with its camel case version. Although this function executes a simple string transformation, functions in this method are generic and capable of complex data transformations. Furthermore, functions work with any data format and can be reused in the mapping. Listing 2 demonstrates how the function is called. In R2RML – and by consequence RML – a Term Map generates an RDF term (see [1]). In our implementation, we introduced a new Term Map called a Function Valued Term Map that generates RDF terms based on the application of a function. The parameters are also Term Maps that are evaluated before the results are passed as arguments.

4 Demonstration

The dataset used to demonstrate our approach comes from the Seshat: Global History Databank [6]. This project is developing a knowledge base to describe human history that is created by hand via a wiki where contributors are expected to adhere to certain conventions to structure the facts. The Seshat dataset is currently made available for analysis as CSV files by scraping the wiki pages. A current development within the project is to gather the data into an OWL knowledge base, but predicates from the dataset differ from the predicates defined in the OWL ontology. Thus, there is a need to develop an approach to transform CSV values into the URIs of the ontology’s predicates.

For example, in the dataset one predicate is labeled as “Capital”, but in the ontology the predicate is seshat:capital. Other examples include “Language” and “Supracultural entity”. Table 1 shows a fragment of the data where the values for the column “Variable” are transformed into predicates using the mapping from Listings 1 and 2. Example RDF output is shown in Fig. 1.

Table 1. Excerpt of the CSV file shown as a table.
Fig. 1.
figure 1

RDF output.

5 Conclusions and Future Work

Most tools to convert data to RDF rely on underlying technology for data manipulation, but there is no manipulation language for CSV datasets. Moreover, when data manipulation is needed for CSV datasets one depends on pre- or post-processing techniques, which adds complexity to the uplift process. One solution is to incorporate functions in mappings, but the state-of-the-art does not offer a feasible way to do so. We tackled this problem by presenting a more amenable method to incorporate functions into mapping languages. We demonstrated our approach by extending RML’s vocabulary and engine and applied it on a real world dataset.

Future work includes more use cases and experiments to compare performance and expressiveness of our approach and other mapping languages. Since functions can be considered software agents, one can also generate provenance information referring to these functions. Inspiration can be drawn from [2], who proposed a method for creating provenance information while generating RDF.