Creation of Knowledge Graphs

. This chapter introduces how Knowledge Graphs are generated. The goal is to gain an overview of diﬀerent approaches that were proposed and ﬁnd out more details about the current prevalent ones. After reading this chapter, the reader should have an understanding of the diﬀerent solutions available to generate Knowledge Graphs and should be able to choose the mapping language that best suits a certain use case.


Introduction
The real power of the Semantic Web will be realized once a significant number of software agents requiring information from different heterogeneous data sources become available. However, human and machine agents still have limited ability to interact with heterogeneous data as most data is not available in the form of knowledge graphs, which are the fundamental cornerstone of the Semantic Web. They have different structures (e.g., tabular, hierarchical), appear in heterogeneous formats (e.g., CSV, XML, JSON) and are accessed via heterogeneous interfaces (e.g., database interfaces or Web APIs).
Therefore, different approaches were proposed to generate knowledge graphs from existing data. In the beginning, custom implementations were proposed [67,292] and they remain prevalent today [71,177]; however, more generic approaches emerged as well. Such approaches were originally focused on data with specific formats, namely dedicated approaches for, e.g., relational databases [93], data in Excel (e.g. [274]), or in XML format (e.g. [272]). However, data owners who hold data in different formats need to learn and maintain several tools [111].
To deal with this, different approaches were proposed for integrating heterogeneous data sources while generating knowledge graphs. Those approaches follow different directions, but detaching the rules definition from their execution prevailed, because they render the rules interoperable between implementations, whilst the systems that process those rules are use-case independent. To generate knowledge graphs, on the one hand, dedicated mapping languages were proposed, e.g., RML [111], and, on the other hand, existing languages for other tasks were repurposed as mapping languages, e.g., SPARQL-Generate [278].
We focus on dedicated mapping languages. The most prevalent dedicated mapping languages are extensions of R2RML [97], the W3C recommendation on knowledge graph generation from relational databases. RML was the first language proposed as an extension of R2RML, but there are more alternative approaches and extensions beyond the originally proposed language. For instance, xR2RML [305], for generating knowledge graphs from heterogeneous databases, and KR2RML [407], for generating knowledge graphs from heterogeneous data.
In the remainder of this chapter, we introduce the Relational to RDF Mapping Language (R2RML) [97] and the RDF Mapping Language (RML) [111] which was the first mapping language extending R2RML to support other heterogeneous formats. Then we discuss other mapping languages which extended or complemented R2RML and RML, or their combination.

R2RML
The Relational to RDF Mapping Language (R2RML) [97] is the W3C recommendation to express customized mapping rules from data in relational databases to generate knowledge graphs represented using the Resource Description Framework (RDF) [94]. R2RML considers any custom target semantic schema which might be a combination of vocabularies. The R2RML vocabulary namespace is http://www.w3.org/ns/r2rml# and the preferred prefix is r2rml.
In R2RML, RDF triples are generated from the original data in the relational database based on one or more Triples Maps (rr:TriplesMap, Listing 4.1, line 3). Each Triples Map refers to a Logical Table ( A column-valued term map (rr:column, Listing 4.3, line 6) generates a literal by default that is a column in a given Logical Table's row. The language (rr:language, line 7) and datatype (rr:datatype) may be optionally defined.
A template-valued Term Map (rr:template, Listing 4.3, line 8) is a valid string template containing referenced columns and generates an IRI by default. If the default termtype is desired to be changed, the term type (rr:termType) needs to be defined explicitly (rr:IRI, rr:Literal, rr:BlankNode).

RML
The RDF Mapping Language (RML) [110,111] expresses customized mapping rules from heterogeneous data structures, formats and serializations to RDF. RML is a superset of R2RML, aiming to extend its applicability and broaden its scope, adding support for heterogeneous data. RML keeps the mapping rules as in R2RML but excludes its database-specific references from the core model. This way, the input data that is limited to a certain database in R2RML (because each R2RML processor may be associated to only one database), becomes a broad set of one or more input data sources in RML.
RML provides a generic way of defining mapping rules referring to different data structures, combined with case-specific extensions, but remains backwards compatible with R2RML, as relational databases form such a specific case. RML enables mapping rules defining how a knowledge graph is generated from a set of sources that altogether describe a certain domain, can be defined in a combined and uniform way. The mapping rules may be re-used across different sources describing the same domain to incrementally form well-integrated datasets.
In the remainder of this subsection, we will talk in more details about data retrieval and transformations in RML, as well as other representations of RML.

Data Retrieval
Data can originally (i) reside on diverse locations, e.g., files or databases on the local network, or published on the Web; (ii) be accessed using different interfaces, e.g., raw files, database connectivity for databases, or different interfaces from the Web such as Web APIs; and (iii) have heterogeneous structures and formats, e.g., tabular, such as databases or CSV files, hierarchical, such as XML or JSON format, or semi-structured, such as HTML.
In this section, we explain how RML performs the retrieval and extraction steps required to obtain the data whose semantic representation is desired.
Logical Source. RML's Logical Source (rml:LogicalSource, Listing 4.5) extends R2RML's Logical Table and determines the data source with the data to generate the knowledge graph. The R2RML Logical Table definition

</countries>
Listing 4.6. Country data in XML format Reference Formulation. RML deals with different data serialisations which use different ways to refer to data fractions. Thus, a dedicated way of referring to the data's fractions is considered, while the mapping definitions that define how the RDF terms and triples are generated remain generic. RML considers that any reference to the Logical Source should be defined in a form relevant to the input data, e.g. XPath for XML data or JSONPath for JSON data. To this end, the Reference Formulation (rml:referenceFormulation) declaration is introduced (Listing 4.7, line 4), indicating the formulation (for instance, a standard, query language or grammar) used to refer to its data.

Listing 4.7. A Logical Source specifies its Reference Formulation and iterator
Iterator. While in R2RML it is already known that a per-row iteration occurs, as RML remains generic, the iteration pattern, if any, cannot always be implicitly assumed, but it needs to be determined. Thereafter, the iterator (rml:iterator) is introduced (Listing 4.7, line 5). The iterator determines the iteration pattern over the data source and specifies the extract of the data during each iteration.
The iterator is not required to be explicitly mentioned in the case of tabular data sources, as the default per-row iteration is implied.
Source. Data can originally reside on diverse, distributed locations and be accessed using different access interfaces [112]. Data can reside locally, e.g., in files or in a database at the local network, or can be published on the Web. Data can be accessed using diverse interfaces. For instance, metadata may describe how to access the data, such as dataset's metadata descriptions in the case of data catalogues, or dedicated access interfaces might be needed to retrieve data from a repository, such as database connectivity for databases, or different Web interfaces, such as Web APIs. RML considers an original data source, but the way this input is retrieved remains out of scope, in the same way it remains out of scope for R2RML how the SQL connection is established. Corresponding vocabularies can describe how to access the data, for instance the dataset's metadata (Listing 4.8), hypermediadriven Web APIs or services, SPARQL services, and database connectivity frameworks (Listing 4.9) [112]. RDF Term Maps are instantiated with data fractions referred to using a reference formulation relevant to the corresponding data format. Those fractions are derived from data extracted at a certain iteration from a Logical Source. Such a Logical Source is formed by data retrieved from a repository accessed as defined by the corresponding dataset or service description vocabulary.
Language Map. RML introduces a new Term Map for defining the language, the Language Map (rml:LanguageMap, Listing 4.10, line 5), which extends R2RML's language tag (rr:language). The Language Map allows not only constant values for language but also references derived from the input data. rr:language is considered then an abbreviation for the rml:languageMap, as rr:predicate is for the rr:predicateMap.

Data Transformations: FnO
Mapping rules involve (re-)modeling the original data, describing how objects are related by specifying correspondences between data in different schemas [126], and deciding which vocabularies and ontologies to use. Data transformations, as opposed to schema transformations that the mapping rules represent, are needed to support any changes in the structure, representation or content of data [367], for instance, performing string transformations or computations.
The Function Ontology (FnO) [102,104] describes functions uniformly, unambiguously, and independently of the technology that implements them. As RML extends R2RML with respect to schema transformations, the combination of RML with FnO extends R2RML with respect to data transformations.
A function (fno:Function) is an activity which has input parameters, output, and implements certain algorithm(s) (Listing 4.11, line 1). A parameter (fno:Parameter) is a function's input value (Listing 4.11, line 4). An output (fno:Output) is the function's output value (Listing 4.11,5). An execution (fno:Execution) assigns values to the parameters of a function for a certain execution. An implementation (fno:Implementation) defines the internal workings of one or more functions.

Listing 4.11. A function described in FnO that splits a string
The Function Map (fnml:FunctionMap) is another Term Map, introduced as an extension of RML, to facilitate the alignment of the two, RML and FnO. A Function Map is generated by executing a function instead of using a constant or a reference to the raw data values. Once the function is executed, its output value is the term generated by this Function Map. The fnml:functionValue property indicates which instance of a function needs to be executed to generate an output and considering which values. A mapping (Listing 4.13, line 1) contains all definitions that state how subjects, predicates, and objects are generated. Each mapping definition is a keyvalue pair. The key sources (line 3) defines the set of data sources that are used to generate the entities. Each source is added to this collection via a key-value pair. The value is a collection with three keys: (i) the key access defines the local or remote location of the data source; (ii) the key reference formulation defines the reference formulation used to access the data source; and (iii) the key iterator (conditionally required) defines the path to the different records over which to iterate. The key subjects (line 5) defines how the subjects are generated. The key predicateobjects (line 6) defines how combinations of predicates and objects are generated. Below the countries example (Listing 4.6) is shown in YARRRML:

[R2]RML Extensions and Alternatives
Other languages were proposed based on differentiation on (i) data retrieval and (ii) data transformations. The table below (

XR2RML
xR2RML [306] was proposed in 2014 in the intersection of R2RML and RML. xR2RML extends R2RML beyond relational databases and RML to include nonrelational databases. xR2RML extends R2RML following the RML paradigm but is specialized for non-relational databases and, in particular, NoSQL and XML databases. NoSQL systems have heterogeneous data models (e.g., keyvalue, document, extensible column, or graph store), as opposed to relational databases. xR2RML assumes, as R2RML does, that a processor executing the rules is connected to a certain database. How the connection or authentication is established against the database is out of the language's scope, as in R2RML. The xR2RML vocabulary preferred prefix is xrr and the namespace is the following: http://www.i3s.unice.fr/ns/xr2rml#. Data Source. Similarly to RML, an xR2RML Triples Map refers to a Logical Source (xrr:logicalSource, Listing 4.14, line 3), but similarly to R2RML, this Logical Source can be either an xR2RML base table (xrr:sourceName, for databases where tables exist) or an xR2RML view representing the results of executing a query against the input database (xrr:query, line 4).
Listing 4.14. xR2RML logical source over an XML database supporting XQuery Iterator. xR2RML originally introduced the xrr:iterator, according to the rml:iterator, to iterate over the results. In a later version, xR2RML converged using the rml:iterator (Listing 4.14, line 5).
Format or Reference Formulation. In contrast to RML that considers a formulation (rml:referenceFormulation) to refer to its input data, xR2RML originally specified explicitly the format of data retrieved from the database using the property xrr:format (Listing 4.15, line 2). For instance, RML considers XPath or XQuery or any other formulation to refer to data in XML format, xR2RML would refer to the format, e.g. xrr:XML. While RML allows for other kinds of query languages to be introduced, xR2RML decides exactly which query language to use. In an effort to converge with RML, xR2RML considers optionally a reference formulation. Reference. Similar to RML, xR2RML uses a reference (xrr:reference) to refer to the data elements (Listing 4.14, line 7). xR2RML extends RML's reference to refer to data elements in data with mixed formats. xR2RML considers cases where different formats are nested; for instance, a JSON extract is embedded in a cell of a tabular structure. A path with mixed syntax consists of the concatenation of several path expressions separated by the slash '/' character.

Collections and Containers. Several RDF terms can be generated by a Term
Map during an iteration if multiple values are returned. This can normally generate several triples, but it can also generate hierarchical values in the form of RDF collections or containers. To achieve the latter, xR2RML extends R2RML by introducing corresponding datatypes to support the generation of containers. xR2RML introduces new term types (rr:termType): xrr:RdfList for an rdf:List, xrr:RdfBag for rdf:Bag, xrr:RdfSeq for rdf:Seq and xrr:RdfAlt for rdf:Alt. All RDF terms produced by the Object Map during one triples map iteration step are then grouped as members of one term. To achieve this, two more constructs are introduced: Nested Term Maps and Push Downs. Data Source. Mapping tabular data (e.g., CSV) into the Nested Relational Model is straightforward. The model has a one-to-one mapping of tables, rows, and columns, unless a transformation like splitting on a column occurs, which will create a new column that contains a nested table.
Mapping hierarchical data (e.g., JSON, XML) into the Nested Relational Model requires a translation algorithm for each data format next to the mapping language. Such an algorithm is considered for data in XML and JSON format. If the data is in JSON, an object maps to a single row table in NRM with a column for each field. Each column is populated with the value of the appropriate field. Fields with scalar values do not need translation, but fields with array values are translated to their own nested tables: if the array contains scalar or object values, each array element becomes a row in the nested table. If the elements are scalar values like strings as in the tags field, a default column name "values" is provided. If a JSON document contains a JSON array at the top level, each element is treated like a row in a database table. If the data is in XML format, its elements are treated like JSON objects, and its attributes and repeated child elements as single-row nested table where each attribute is a column.
References. The column-valued term map is not limited to SQL identifiers as it occurs in R2RML to support mapping nested columns in the NRM. A JSON array is used to capture the column names that make up the path to a nested column from the document root. The template-valued term map is also extended to include columns that do not exist in the original input but are the result of the transformations applied by the processor.
Joins. Joins are not supported because they are considered to be impractical and require extensive planning and external support.
Execution Planning. A tag (km-dev:hasWorksheetHistory) is introduced to capture the cleaning, transformation and modeling steps.
Data Transformations. The Nested Transformation Model can also be used to embed transformation functions. A transformation function can create a new set of nested tables instead of transforming the data values.

FunUL
FunUL [232] is an alternative to FnO for data transformations. FunUL allows the definition of functions as part of the mapping language. In FunUL, functions have a name and a body. The name needs to be unique. The body defines the function using a standardized programming language. It has a return statement and a call refers to a function with an optional set of parameters.
The class rrf:Function defines a function (Listing 4.17, line 3). A function definition has two properties defining the name (rrf:functionName, line 4), and the function body (rrf:functionBody, line 5).
A function can be called using the property rrf:functionCall (Listing 4.17, line 13). This property refers to a rrf:Function with the property rr:function (line 14). Parameters are defined using rrf:parameterBindings (line 15).