Keywords

1 Introduction

In 2012, the Relational to RDF Mapping Language (R2RML) [37] was released as a W3C Recommendation. The R2RML ontology [8] provides a vocabulary to describe how an RDF graph should be generated from data in a relational database (RDB). Although R2RML gained wide adoption, its potential applicability beyond RDBs quickly appeared as a salient need [49, 63, 76, 87].

Targeting the generation of RDF from heterogeneous data sources other than RDBs, several extensions [49, 76, 87] preserving R2RML’s core structure were proposed. As R2RML and the growing number of extensions were applied in a wider range of use cases, more limitations became evident [76]. Consequently, these languages were further extended with different features, e.g., the description of input data sources or output RDF (sub)graphs [76, 96], data transformations [40, 47, 66, 69], support for RDF-star [48, 91], etc. Over the years, the RDF Mapping Language (RML) has gathered a large community of contributors and users, and a plethora of systems [26, 95] and benchmarks [31, 33, 59].

Until recently, there was no well-defined, agreed-upon set of features for the RML mapping language, nor was there a consensus-marking ontology covering the whole set of features. Consequently, it has become challenging for non-experts to fully comprehend this landscape and utilize all capabilities without investing a substantial research effort. Therefore, the W3C Community Group on Knowledge Graph Construction [3], with more than 160 members, has convened every two weeks to review the RML specification over the past three years.

In this paper, we present the new modular RML ontology and the accompanying SHACL shapes [61] that complement the specification. We discuss the motivations and challenges that emerged by extending R2RML, the methodology we followed to design the new ontology while ensuring its backward compatibility with R2RML, and the novel features which increase its expressiveness. The new RML ontology and specification is the result of an attempt to (i) address multiple use cases from the community [30] (ii) streamline and integrate various features proposed to support these use cases, and (iii) adopt agreed-upon design practices that make it possible to come up with a coherent, integrated whole consisting of a core ontology [97] and multiple feature-specific modules [41, 46, 64, 98]. The presented ontology consolidates the potential of RML enabling the definition of mapping rules for constructing RDF graphs that were previously unattainable, and the development of systems in adherence with both R2RML and RML.

This paper is organized as follows: In Sect. 2, we present the relevant concepts of R2RML and RML. In Sect. 3, we outline the motivations that drive this work and the challenges we tackle. In Sect. 4, we describe the methodology employed to redesign the RML ontology while maintaining backward compatibility, and in Sect. 5 the modules introduced with the various features. In Sect. 6, we present the early adoption and potential impact, followed by related work in Sect. 7. We conclude the paper with a summary of the presented contributions and future steps in Sect. 8.

2 Background: R2RML

R2RML mapping rules (Listing 2) are grouped within rr:TriplesMap (line 1), which contain one rr:LogicalTable, one rr:SubjectMap and zero to multiple rr:PredicateObjectMap. The rr:LogicalTable (lines 2–3) describes the input RDB, while rr:SubjectMap (lines 4–5) specifies how the subjects of the triples are created. A rr:PredicateObjectMap (lines 6–9) generates the predicate-object pairs with one or more rr:PredicateMap (line 7) and one or more rr:ObjectMap (lines 8–9). Zero or more rr:GraphMap, which indicate how to generate named graphs, can be assigned to both rr:SubjectMap and rr:PredicateObjectMap. It is also possible to join rr:LogicalTables replacing rr:ObjectMap by rr:RefObjectMap, which uses the subject of another Triples Map indicated in rr:parentTriplesMap as the object of the triple. This join may have a condition to be performed, which is indicated using rr:joinCondition, rr:child, and rr:parent. Subject Map, Predicate Map, Object Map, and Graph Map are subclasses of rr:TermMap, which define how to generate RDF terms. Term Maps can be (i) constant-valued, i.e., always generating the same RDF term (line 7); (ii) column-valued, i.e., the RDF terms are directly obtained from cells of a column in the RDB (line 9); or (iii) template-valued, i.e., the RDF terms are composed from the data in columns and constant strings (line 5).

figure a

According to the R2RML specification, an R2RML processor is a system that, given a set of R2RML mapping rules and an input RDB, can construct RDF graphs. Therefore, an R2RML processor should have an SQL connection to the input RDB where the tables reside and a base IRI used to resolve the relative IRIs produced by the R2RML mapping rules.

3 Motivation and Challenges

In this section, we discuss the limitations of generalizing R2RML to construct RDF graphs from heterogeneous data sources and their impact on the ontology. We also consider the required extensions for the ontology to construct RDF graphs that were not possible before, e.g., RDF collections and containers or RDF-star [56]. Based on these limitations, we group the challenges in the following high-level categories: data input and RDF output, schema and data transformations, collections and containers, and RDF-star.

Data Input and RDF Output. In R2RML, the desired RDF graph is constructed from tables residing in only one RDB. R2RML recommends to hard-code the connection to the RDB in the R2RML processor, hence, rules in a mapping document cannot refer to multiple input RDBs.

To date, a wide range of data formats and structures is considered beyond RDBs, such as CSV, XML, or JSON. These sources may be available locally or via web APIs, statically, or streaming. Thus, a flexible approach for constructing RDF graphs from a combination of these diverse inputs is desired [95]. The R2RML ontology needs to be extended to also describe what the data source is for each set of mapping rules, e.g., a NoSQL DB or a Web API, and what the data format is, e.g., CSV, JSON or XML. In addition, a per row iteration pattern is assumed for RDBs, but this may vary for other data formats.

RML [49] proposed how to describe heterogeneous data assuming originally that these data appear in local files and a literal value specifies the path to the local file. In parallel, xR2RML [76] proposed how to extend R2RML for the document-oriented MongoDB. A more concrete description of the data sources and their access, e.g., RDBs, files, Web APIs, etc. was later proposed [50], relying on well-known vocabularies to describe the data sources, e.g., DCAT [74], VOID [23], or SPARQL-SD [101] and further extended [96] to also describe the output RDF (sub)graphs. The description of NULL values [94], predetermined in RDBs but not in other data sources, has not been addressed yet.

Schema and Data Transformations. Integrating heterogeneous data goes beyond schema-level transformations, as it usually involves additional data-level transformations [67]. The R2RML ontology describes the schema transformations, i.e., the correspondences between the ontology and the data schema. It delegates data transformations and joins to the storage layer, by using operators in SQL queries. However, not all data formats can leverage similar operators, e.g., JSON does not have a formal specification to describe its data transformation, nor do all formats’ operators cover the same data transformations, e.g., XPath offers a different set of operators compared to SQL. Moreover, there are cases in which such pre-processing is not possible, e.g., for streaming data. Thus, the R2RML ontology needs to be extended to describe such data transformations.

RML+FnO [39], R2RML-F [47], its successor FunUL [66] or D-REPR [100] are examples of the proposals providing support to data operations. However, only RML+FnO describes the transformation functions declaratively. RML+FnO has been well adopted in the community by being included in a number of RML-compliant engines [25, 58, 59, 86] and RML+FnO-specific translation engines [65, 85]. Nevertheless, a more precise definition to address ambiguities and a simplification of introduced (complex) constructs is needed.

Collections and Containers. RDF containers represent open sets of RDF terms, ordered (rdf:Sequence) or unordered (rdf:Bag, rdf:Alt). Their member terms are denoted with the rdf:_ n propertiesFootnote 1. RDF collections refer solely to type rdf:List that represents a closed-ordered list of RDF terms. An RDF list is built using cons-pairs; the first cons-pair of a list refers to an element of that list with the rdf:first property, and the rdf:rest to the remainder list. All list elements should be traversed via rdf:rest until the empty list rdf:nil. Generating RDF collections and containers in R2RML, while possible, results in a cumbersome and limited task. A container’s properties rdf:_n are typically generated only when a key in the form of a positive integer is yielded from the data source. By contrast, there is no elegant way to model a list’s cons-pairs with R2RML if the list is of arbitrary length.

Due to the need for RDF containers and collections in several projects [75], e.g., both the Metadata Authority Description Schema [15] and W3C’s XHTML Vocabulary [7] use RDF containers, and both OWL [102] and SHACL [68] use RDF collections. The xR2RML [76] vocabulary supported the generation of nested collections and containers within the same data source iteration (e.g., within one result among the results returned by the MongoDB database). Its vocabulary also allowed to change the iterator within a term map and yield nested collections and containers. By contrast, [45] provided terms for creating (nested) collections and containers from within an iteration (same row) and across iterations (across rows) and provided a property for retaining empty collections and containers. The ontology presented in [45] also provided directive for generating collections or containers whose members may have different term types, whereas [76]’s vocabulary provided support for one term type. The vocabulary of both approaches did not provide support for named collections and containers, nor the generation of these as subjects.

RDF-star. RDF-star [55] introduces the quoted triple term, which can be embedded in the subject or object of another triple. Quoted triples may be asserted (i.e., included in the graph) or not. RDF-star quickly gained popularity, leading to its adoption by a wide range of systems [4] (e.g., Apache Jena [24], Oxigraph [78]) and the formation of the RDF-star Working Group [5].

The inception of RDF-star came after R2RML. Therefore, R2RML only considered the generation of RDF. The principal challenge is the generation of quoted and asserted triples, which requires a dedicated extension. RML-star [48] and R2RML-star [91] are extensions of R2RML to construct RDF-star graphs, however, the latter comes with limitations and it is not backward compatible with R2RML. Our ontology includes the RML-star extension to enable the generation of RDF-star graphs, remaining backward compatible with R2RML.

4 Methodology

We followed the Linked Open Terms (LOT) methodology [80] to redesign the R2RML ontology, as well as to generalize and modularize it. The methodology includes four major stages: Requirements Specification, Implementation, Publication, and Maintenance. We describe below how we follow these steps to develop the RML ontology and the accompanying SHACL shapes.

Requirements. The requirements to build the RML ontology are mainly derived from three sources: (i) the legacy of the R2RML ontology, (ii) the scientific publications which proposed different extensions [63, 95], and (iii) the experience of the community of R2RML and RML to build upon their limitations. The latter has been gathered from GitHub issues [20] and summarized as mapping challenges [13]. The complete set of requirements for each module can be accessed from the ontology portal [21]. These requirements cover both the base needs and fine-grained features for generating triples with mapping rules. On the one hand, how to generate subjects, predicate, objects, datatypes, language tags, and named graphs in both a static (constant) and dynamic (from data sources) manner (RML-Core). On the other hand, the description and access of input data sources and target output data (RML-IO); RDF Collections and Containers to create lists from diverse terms (RML-CC); data transformation functions with their desired output and input parameters (RML-FNML); and quoting Triples Maps to create asserted and non-asserted RDF-star triples (RML-star).

Implementation. We build the RML ontology based on the requirements in a modular manner maintaining its backward compatibility with R2RML. We use a GitHub organization [2] to summarize issues and coordinate asynchronously.

Modularity. The ontology is composed of 5 modules: RML-Core, RML-IO, RML-CC, RML-FNML, and RML-star. We opt for a modular design to facilitate its development and maintenance, as each module can be adjusted independently without affecting the rest. This choice facilitates also its reuse and adoption, as RML processors can implement specific modules instead of the entire ontology.

Modeling. The modeling of each module is carried out independently. A version is drafted from the requirements and presented to the community. For this iteration step, we draft the proposal using ontology diagrams that follow the Chowlk notation [34], and some use cases with examples. Once it is agreed that the model is accurate and meets the requirements, the ontology is encoded.

Encoding. We encode the ontology using OWL [28] and its application profile using SHACL [68]. We deliberately use both to distinguish between the model, which is described in OWL, and the constraints described as SHACL shapes. The latter specifies how the different ontology constructs should be used within a mapping document, e.g. a Triples Map can only contain exactly one Subject Map. Hence, they allow to validate the mapping rules’ correctness, depending on which modules are used. This way, RML processors can indicate which module they support and verify the mapping rules’ compliance before executing them.

Backward Compatibility. The new RML ontology is backward compatible with the previous [R2]RML ontologies [17]. We first gather all terms affected from the RML-Core and RML-IO modules (the other modules only introduce new features), and define correspondences between the past and new resources. We identify two kinds of correspondences: (i) equivalences if a resource is used in the same manner and its semantics is not significantly changed (e.g., rr:SubjectMap is equivalent to rml:SubjectMap); and (ii) replacements if a resource is superseded by another one (e.g., rr:logicalTable is replaced by rml:logicalSource). A summary of these correspondences is available online [18] as well as a semantic version to enable automatic translation of mapping rules [17].

Evaluation. We evaluate the ontology with OOPS! [81] and check for inconsistencies using the HermiT reasoner. If all issues are solved, a module is deployed.

Table 1. List of modules of the RML ontology.

Publication. All modules of the RML ontology are managed and deployed independently from a separate GitHub repository, and published using a W3ID URL under the CC-BY 4.0 license. Each repository contains the ontology file, its documentation (created using Widoco [52]), requirements, associated SHACL shapes, and the module’s specification. We follow a unified strategy for the resources’ IRIs. The RML ontology resources use a single prefix IRI to make it convenient for users to convert their RML mappings to the new RML ontology, while clearly stating which module each resource belongs to. We publish the complete ontology at http://w3id.org/rml/, and a summary of all modules with links to all their related resources (i.e. SHACL shapes, issues, specifications, etc.) is available at the ontology portal [21] and in Table 1.

Maintenance. To ensure that the ontology is updated with error corrections and new updates during its life cycle, the GitHub issue tracker of each module will be used to gather suggestions for additions, modifications, and deletions. We discuss every major modification asynchronously and in the W3C KG Construction Community Group meetings until they are agreed upon, which triggers another round of implementation and publication, leading to new releases.

5 Artifacts: Ontologies and Shapes

The RML ontology consists of 5 modules: (i) RML-Core (Sect. 5.1) describes the schema transformations, generalizes and refines the R2RML ontology, and becomes the basis for all other modules; (ii) RML-IO (Sect. 5.2) describes the input data and output RDF (sub)graphs; (iii) RML-CC (Sect. 5.3) describes how to construct RDF collections and containers; (iv) RML-FNML (Sect. 5.4) describes data transformations; and (v) RML-star (Sect. 5.5) describes how RDF-star can be generated. Figure 1 shows an overview of all modules of the RML ontology and how they are connected. The modules build upon the RML-Core, which, in turn, builds upon R2RML, but are independent among one another. We illustrate each module by continuing the example in Sect. 2.

5.1 RML-Core: Schema Transformations

The RML-Core is the main module of the RML ontology, which generalizes and refines the R2RML ontology; all the other modules build on top of it. The RML-Core ontology consists of the same concepts as the R2RML ontology (rml:TriplesMap, rml:TermMap, rml:SubjectMap, rml:PredicateMap, rml:ObjectMap, rml:PredicateObjectMap, and rml:ReferencingObjectMap), but redefines them to distinguish them from the R2RML counterparts.

Fig. 1.
figure 1

RML ontology overview following the Chowlk diagram notation [34].

RML-Core refines the R2RML ontology by introducing the concept of Expression Map rml:ExpressionMap (Listing 4, lines 6 & 9). An Expression Map is a mapping construct that can be evaluated on a data source to generate values during the mapping process, the so-called expression values. The R2RML specification allowed such mapping constructs (template-based, column-based or constant-based) which can only be applied to subject, predicate, object and named graph terms. In RML, the Expression Map can be a template expression specified with the property rml:template, a reference expression specified with the property rml:reference, or a constant expression, specified with the property rml:constant. A Term Map becomes a subclass of Expression Map.

With the introduction of the Expression Map, the language, term type, parent and child properties can be specified using any Expression Map, and not only a predefined type of expression. To achieve this, the following concepts are introduced as subclasses of the Expression Map: (i) rml:LanguageMap, whose shortcut rml:language can be used if it is a constant-valued Expression Map; (ii) rml:DatatypeMap, whose shortcut rml:datatype can be used if it is a constant-valued Expression Map; (iii) rml:ParentMap, whose shortcut rml:parent can be used if it is a reference-valued Expression Map; and (iv) rml:ChildMap, whose shortcut rml:child can be used if it is a reference-valued Expression Map.

Listing 4 shows an example of a basic mapping to create RDF triples from the JSON file in Listing 3, and whose Logical Source is defined in Listing 5.

figure b

5.2 RML-IO: Source and Target

RML-IO complements RML-Core describing the input data sources and how they can be retrieved. To achieve this, RML-IO defines the Logical Source (with rml:LogicalSource) for describing the input data, and the Source (rml:Source) for accessing the data. The Logical Source specifies the grammar to refer to the input data via the Reference Formulation (rml:ReferenceFormulation). For instance, in Listing 5, the Reference Formulation is JSONPath (Line 4). RML-IO refers to a set of predefined Reference Formulations (JSONPath, XPath, etc.) but others can be considered as well. Besides the Reference Formulation, the Logical Source also defines how to iterate over the data source through the iteration pattern with the property rml:iteration (Line 5). In a Triples Map, the property rml:logicalSource specifies the Logical Source to use and should be specified once. The Source specifies how a data source is accessed by leveraging existing specifications; and indicates when values should be considered NULL (rml:null) and a query if needed (rml:query) e.g., SQL or SPARQL query. In a Logical Source, the property rml:source refers to exactly one Source.

Similarly, RML-IO includes the Logical Target (rml:LogicalTarget) and the Target (rml:Target) to define how the output RDF is exported. A Logical Target includes the properties rml:serialization to indicate in which RDF serialisation the output should be encoded, and rml:target to refer to exactly one Target. The Logical Target can be optionally specified in any Term Map e.g., Subject or Graph Map. The Target is similar to the Source, indicating how the output target can be accessed, and has 2 properties: (i) rml:compression to specify if the RDF output will be compressed and, if so, how e.g., GZip, and (ii) rml:encoding to define the encoding e.g., UTF-8.

Both Source and Target consider the re-use of existing vocabularies to incorporate additional features to access data, such as DCAT [74], SPARQL-SD [101], VoID [23], D2RQ [36], and CSVW [93]. Listing 5 shows an example of an RML mapping that describes a JSON file (Listing 3) with a Logical Source (lines 1–5). A Logical Target is also used to export the resulting triples to a Turtle file with GZip compression (lines 7–12). Since the Logical Target is specified within the subject (line 18), all triples with that subject are exported to this target.

figure c

5.3 RML-CC: Collections and Containers

As the RML Collections and Containers module is fundamentally new, the Community Group formulated a set of functional requirements that the RML Containers and Collections specification should meet: One should be able to: (i) collect values from one or more Term Maps, including multi-valued Term Maps; (ii) have control over the generation of empty collections and containers; (iii) generate nested collections and containers; (iv) group different term types; (v) use a generated collection or container as a subject; and (vi) assign an IRI or blank node identifier to a collection or container. Based on these requirements, the RML-CC module was introduced which consists of a new concept, the Gather Map rml:GatherMap, two mandatory properties (rml:gather and rml:gatherAs) and two optional properties. Even though the module is limited with respect to its ontology terms, significant effort is expected for the RML processors to support it. Thus, we decided on keeping it as a separate module.

We specified the Gather Map (rml:GatherMap) as a Term Map with 2 mandatory properties: (i) rml:gather to specify the list of Term Maps used for the generation of a collection or container, and (ii) rml:gatherAs to indicate what is generated (one of the rdf:List, rdf:Bag, rdf:Seq, and rdf:Alt). The rml:gather contains any type of Term Map including Referencing Term Maps (to use the subjects generated by another Triples Map) which are treated as multi-valued Term Maps. Other properties were defined with default values to facilitate the use of this extension: rml:allowEmptyListAndContainer and rml:strategy. By default, a Gather Map shall not yield empty containers and collections; the predicate rml:allowEmptyListAndContainer must be set to true to preserve themFootnote 2. Also, by default, the values of multiple multi-valued Term Maps will be appended from left to right; rml:append is the default rml:strategy. Alternatively, the rml:cartesianProduct strategy that instructs to carry out a Cartesian product between the terms generated by each Term Map. The rml:strategy renders the vocabulary extensible; RML implementations may propose their own strategies for generating collections and containers from a list of multi-valued term maps.

Listing 6 demonstrates the support for 4 of the aforementioned requirements: the collection of values from a multi-valued Term Map, the generation of a named collection, and the collection as a subject. It generates a subject that will be related to a list via ex:contains. The values of that list are collected from a multi-valued Term Map generating IRIs from the names and generates the following RDF: :Ranking23 ex:contains (:Duplantis :Guttormsen :Vloon).

figure d

If a Gather Map is an empty Expression Map, a new blank node is created for the head node of each generated collection or container (the first cons-pair in the case of a collection). Conversely, when providing a template, constant or reference, the head node is assigned the generated IRI or blank node identifier. If unchecked, this may lead to the generation of collections or containers that share the same head node, which we refer to as ill-formed. Therefore, the specification details the behavior that a processor must adopt: when a gather map creates a named collection or container, it must first check whether a named collection or container with the same head node IRI or blank node identifier already exists, and if so, it must append the terms to the existing one.

5.4 RML-FNML: Data Transformations

The RML-FNML module enables the declarative evaluation of data transformation functions defined using the Function Ontology (FnO) [40] in RML. Thus, the data transformation functions in RML are independent of specific processors. Functions and Executions are described with FnO, while FNML declares the evaluation of FnO functions in terms of specific data sources. The evaluation of a function is defined through a Function Execution (rml:FunctionExecution), where a Function Map (rml:FunctionMap) defines the function. The input values’ definitions are provided through Term Maps using Inputs (rml:Input), which in turn include Parameter Maps (rml:ParameterMap) referring to Function Parameters defined by FnO. The Function Execution’s output is declared using a Return Map (rml:ReturnMap) and referred to by the rml:return property, enabling the reference to a specific output of a function’s multiple outputs.

figure e

Listing 7 shows the use date formatting function. Within an Object Map, the Function Execution (Line 7) and type of Return (Line 8) are defined. The Function Execution describes which Function is used (Line 11) and its two Inputs: the data reference (Lines 12–14) and the output date format (Lines 15–17).

5.5 RML-star: RDF-star Generation

The building block of RML-star [48] is the Star Map (rml:StarMap). A Star Map can be defined in a Subject or an Object Map, generating quoted triples in the homonymous positions of the output triples. The Triples Map generating quoted triples is connected to a Star Map via the object property rml:quotedTriplesMap. Quoted Triple Maps specify whether they are asserted (rml:AssertedTriplesMap) or non-asserted (rml:NonAssertedTriplesMap). Listing 8 uses an Asserted Triples Map (lines 1–5) to generate triples of the mark of some athletes, annotated using a Star Map with the date on which the marks were accomplished (lines 7–11).

figure f

6 Early Adoption and Potential Impact

Over the years, the RML mapping language has gathered a large community of contributors and users, a plethora of systems were developed [26, 95], benchmarks were proposed [31, 33, 59], and tutorials were performed [9,10,11,12, 16, 99].

During the last decade, many initiatives have used the different extensions of R2RML which contributed to the modular RML ontology to construct RDF graphs from heterogeneous data for e.g., COVID-19-related data [77, 84, 90], biodiversity [22, 62, 79], streaming data analysis and visualisation [38, 44], social networks’ data portability [43], social media archiving [73], supply chain data integration [42], public procurement [88], agriculture [29], federated advertisement profiles [72]. These extensions were also incorporated in services, e.g., Chimera [53], Data2Services [14], InterpretME [35], and even the Google Enterprise Knowledge Graph to construct and reconcile RDF [6].

As the modules of the RML ontology take shape, an increasing number of systems embrace its latest version. The RMLMapper [58], which was so far the RML reference implementation, currently supports the RML-Core and RML-IO modules. The SDM-RDFizer [59, 60] and Morph-KGC [25], two broadly-used RML processors, already integrate the generation of RDF-star with the RML-star module in their systems. Additionally, Morph-KGC also supports transformation functions using the RML-FNML module. The adoption of RML was facilitated by YARRRML [57], a human-friendly serialization of RML. Yatter [32] is a YARRRML translator that already translates this serialization to the RML-Core, RML-star, RML-IO and RML-FNML modules. The increasing number of systems that support different modules illustrate the benefits of the modular approach of the RML ontology: each system implements a set of modules, without the necessity of offering support for the complete ontology, while different use cases can choose a system based on their modules’ support.

As an increasing number of systems support the RML ontology proposed in this paper, several real-world use cases already adopted the ontology as well. NORIA [92] extends RMLMapper in their MASSIF-RML [82] system, implemented at Orange, to construct RDF graphs for anomaly detection and incident management. InteGraph [70] uses RML to construct RDF graphs in the soil ecology domain and CLARA [19] to construct RDF-star for educational modules, both using Morph-KGC. At the European level, two governmental projects use RML to integrate their data into RDF graphs. In the transport domain, the European Railway Agency constructs RDF from the Register of Infrastructure data, distributed in XML, using RML mapping rules [83] and taking advantage of the RML-IO module. The Public Procurement Data Space [54] is an ongoing project that integrates procurement data, distributed in various formats, from all EU member states using the e-Procurement Ontology [1], and mapping rules in RML with the RML-Core and RML-star modules on the roadmap.

The RML ontology and SHACL shapes are maintained by the W3C Knowledge Graph Construction Community Group, and part of the larger International Knowledge Graph Construction community. This community will continue to maintain these resources and a call for systems to incorporate the new specification and ontology will be launched. The aim is to have at least two reference implementations for each module in RML systems by the end of 2023.

7 Related Work

A large amount of mapping languages have been proposed to enable the construction of RDF graphs from different data sources [63, 95]. Apart from the RDF-based languages that considerably influence the RML ontology (see Sect. 3), we highlight here the popular alternatives that rely on the syntax of query (SPARQL-Generate [71], SPARQL-Anything [27], NORSE [89]), constraint (e.g., ShExML [51]) or data-serialisation (e.g., D-REPR [100]) languages.

SPARQL-Generate [71] leverages the expressive power of the SPARQL query language and extends it with additional clauses to describe the input data. It offers a relatable way to RML for handling the input data: it supports a wide range of data sources, describes their access and defines an iterator and reference formulations to describe input data. While SPARQL-Generate describes input data and its access, it does not consider the specification of target formats as the new RML ontology does. SPARQL-Generate supports collection and containers, but they can only be placed as objects; embedded collections and containers are not allowed. Last, despite developed over Apache Jena, which already supports RDF-star and SPARQL-star, the GENERATE clause proposed by SPARQL-Generate, does not support RDF-star graph construction at the moment.

SPARQL-Anything [27] introduces Facade-X to override the SERVICE clause as its input data description. It implements all SPARQL and SPARQL-star features, including the generation of RDF-star graphs. However, the construction of well-formed collections and containers with the CONSTRUCT clause is limited. To overcome this, SPARQL-Anything proposes a bespoke function fx:bnode that ensures that the same blank node identifiers are returned for the same input. Hence, while blank nodes are addressed, the generation of lists remains complex. Both SPARQL-Generate and SPARQL-Anything, offer limited support for data transformations, as they are bounded to the ones already provided by their corresponding implementations. While SPARQL allows custom functions, these are implementation-dependent. The addition of declarative data transformations, as the new RML ontology proposes, is not possible.

More recently, Stadler et al. [89] present an approach that processes SPARQL CONSTRUCT queries translated from RML mappings. This approach also includes an extended implementation of the SERVICE operator to process non-RDF data sources. These sources are described with NORSE, a tailored vocabulary for this implementation that allows including RML source descriptions inside queries. Hence, this approach completely relies on the expressiveness of RML features while leveraging SPARQL implementations.

Despite the expressive power of these languages, the systems implementing them need to provide a complete support for SPARQL and extend the language with new clauses or modify the semantics of existing ones to support the construction of RDF graphs. The modular approach presented for the RML ontology allows having a set of basic features to be implemented by the systems, without forcing support for the entire language, and ensures the long-term sustainability, as new modules of the ontology can be proposed without affecting current ones.

8 Conclusions and Future Steps

We present the RML ontology as a community-driven modular redesign of R2RML and its extensions to generate RDF graphs from heterogeneous data sources. Our work is driven by the limitations of R2RML and the extensions proposed over the years. We present our motivation for following a modular design, backward compatible with R2RML. We discussed how each module was designed accompanied by its SHACL shapes, addressing the identified challenges.

The quick adoption of the RML ontology by some of its most used systems, and the number of initiatives and companies that have already incorporated RML, creates a favorable ecosystem for the adoption of RML as the standard for generating RDF graphs from heterogeneous data. The modular design allows us to easily adjust the adequate module with future requirements following an agile methodology. A thorough versioning system will be enabled to keep track of the new versions and badges will be provided for systems to indicate which modules and versions of these modules they support.

As future steps, the community is willing to initiate the process of turning this resource into a W3C Recommendation. Hence, a Final Community Group Report will be published with all the resources presented in this paper, so the SW community can start providing feedback on the specifications to finally, draft a W3C Working Group charter. From a technical perspective, we want to develop further use cases to ensure a thorough validation of the new implementations. Finally, test-cases for each module and validation with SHACL shapes will also be further refined to provide an exhaustive validation resource.