1 Introduction

The research and development establishing the Web of Data has been accomplished by the opening and sharing of various heterogeneous data types such as the unstructured multimedia resources on the Web. The Linked Open Data (LOD) has emerged as a powerful enabler to extend the current Web of Documents to a Web of interlinked data and, ultimately, into the Semantic Web. Using domain ontologies, LOD can exploit machine-readable data from the diverse data resources in the Web by means of web-based standards for encoding datasets and linking them to other published datasets. In the last decade, numerous best practices of LOD, including DBpedia,YAGO and CAMO, have been published by an increasing number of researchers, governments, public organizations and data providers, creating a global data space that interlinks billions of assertions: the Web of Linked Open Data [4, 7, 9]. Therefore, LOD has evolved from a practical research idea into a very promising technology that can realize the Web as a platform for an intelligent information system with semantic search, query and reasoning capability. In order to accelerate LOD paradigm in the Web, the publication of machine-readable LOD sets should take precedence [4, 21].

Since the vast amount of useful data are still stored in relational databases (RDB), one of the most efficient ways to populate LOD sets is to map data in relational databases into RDF, which is the standard data model of LOD. Due to the importance of mapping RDB to RDF (RDB2RDF), the multifold mapping approaches have been proposed [11, 16, 18]. The most important step towards RDB2RDF is two standard recommendations by the W3C RDB2RDF Working Group: Direct Mapping and R2RML mapping language [1, 12, 14].

With these remarkable achievements for RDB2RDF, it has been expected that the seamless integration of RDB data with RDF datasets towards the Web of Data can be easily actualized. In practice, encouraged by the standard mapping language R2RML, many RDB2RDF systems adopting this common language have been developed in various areas [11]. However, in spite of earnest efforts to adopt this common mapping language, the publication of RDB data on the Web in machine-readable RDF format did not yield the significant results as expected. Although direct mapping and R2RML seem to be inevitable for RDB2RDF, this approach has some limitations [11, 13]. The complex mapping structures of R2RML written in Turtle hinders its practical applications. Due to its robust structures, we cannot find fully-fledged R2RML processors. R2RML also does not address some common questions that occurred in RDB2RDF, such as the implementation of translation process and the way to access the mapped RDF datasets with SPARQL [19]. The translation of SPARQL queries into equivalent SQL queries is one of the crux issues in RDB2RDF [5, 8].

This paper proposes a noble and practical method based on schema translation for RDB2RDF. Since the conceptual schema of RDB is similar to ontological modeling of a certain domain, the mapping of RDB schema into RDF Schema is more effective than the conventional instance-based approaches. We describe how to resolve the intrinsic differences between RDB and RDF in schema level. In addition, we also show the ways to map the operational differences, such as graph pattern matching in RDF and JOIN operations in RDB. To extend this approach, we present a new schema-level mapping method and an effective way to implement SPARQL endpoint in RDB.

The rest of the paper is organized as follows. Section 2 reviews related work on RDB-to-RDF mapping approaches, especially direct mapping method, and analyses the principles for the mapping approaches. Section 3 presents the detailed description of the proposed schema-base mapping: the underlying concepts, mapping method and mapping description. Section 4 deals with schema-based mapping system architecture and implementation of SPARQL endpoint in RDB with some typical mapping examples. Section 5 concludes the paper and discusses possible future work.

2 Related work

Since the vast amounts of information are stored in RDB, RDB2RDF for the publication of RDB data on the Web and the integration of data from different RDBs has been a crucial research topic for LOD applications. A myriad of approaches, techniques, and corresponding tools for RDB2RDF have been proposed over the last decade. With these research efforts, several studies have been conducted to compare approaches and techniques from diverse perspectives. The motivations, underlying principles, specifications, capabilities, and categorizations of RDB2RDF can be referred to these comprehensive surveys of proposed approaches [3, 8, 11, 16,17,18].

The W3C RDB2RDF Working Group has proposed two standards for RDB2RDF: Direct Mapping and R2RML (Relational Database to RDF Mapping Language) [1, 12, 14]. Direct Mapping is the recommended approach to directly translate RDB data and its schema to RDF representation. R2RML is a generic language for describing a set of customized mapping rules that transform RDB data into RDF datasets. The publications by the W3C have addressed a new, normalized discipline towards standardized RDB2RDF mapping and the development of compliant tools. After emerging R2RML, direct mapping becomes the dominant approach of RDB2RDF.

Broadly speaking, the direct mapping approach uses simple rules to convert RDB data into RDF format in a straightforward manner as follows [10, 11]:

  • table-to-class: a table is translated into an ontological class identified by a URI.

  • row-to-resource: each row or tuple of a table is translated into a resource that has triple structure in RDF model.

  • column-to-predicate: each column of a table is translated into predicate in RDF triple, representing an ontological property.

  • primary key-to-subject: the primary key used as an identifier of the table is translated into subject with URI in RDF triple.

  • cell-to-literal value: each cell with a literal value is translated into object in RDF triple, representing a data property.

  • cell-to-resource with URI: each cell with a foreign key constraint is translated into a resource with URI in RDF triple, representing an object property.

Figure 1 shows the typical example of direct mapping. Direct mapping is the efficient way to create RDF from relational database data, by virtue of being completely automated [2]. In reality, however, the conventional direct mapping approach based on the above rules has some practical limitations. Direct mapping only maps the instances in RDB to RDF, since it does not consider the conceptual schema of RDB. So, this approach is able to capture neither data semantics nor information preservation that is required to realize interoperable LOD. Since the separation between conceptual schema and instances is not considered, mapping language is too complicated to describe the required mapping effectively. One of the most difficult problems in direct mapping is the translation of SPARQL queries into equivalent SQL queries. To resolve these limitations, we have to extend the functionalities of the direct mapping concept or consider other mapping approaches such as schema-level direct mapping proposed in this paper.

Fig. 1
figure 1

Concept of direct mapping and RDB2RDF example

3 R2LD: Schema-based Graph Mapping of multimedia RDB to LOD

This section describes the concepts and the techniques of schema-based mapping of RDB2RDF. After reviewing the mapping requirements, we elaborated on the details of schema-based mapping, called R2LD, and explained how to describe mapping relations.

3.1 Requirements for RDB2RDF

The instances of LOD are based on RDF, the standard framework for expressing and interchanging information about resources on the Web. Resources can be anything, including documents, people, physical objects, multimedia and abstract concepts. In RDF, resources are represented in the form of subject–predicate–object expressions, known as triples. The subject denotes the resource, and the predicate denotes traits or aspects of the resource, and expresses a relationship between the subject and the object.

The triples in RDF to be the instances in LOD can be visualized as a connected directed graph. Graphs consists of nodes and arcs. The subjects and objects of the triples make up the nodes in the graph; the predicates form the arcs. Figure 2 shows the graph resulting from the sample triples [15].

Fig. 2
figure 2

Informal graph model of LOD

While the instances of LOD are consisted of graph model of RDF, RDB data are contained in the structured tables. So the requirement analysis, considering the motivations and objectives of RDB2RDF, is necessary to develop an effective mapping method for two heterogeneous systems that have different constituent elements and query processing mechanisms. Some researches have analyzed formal requirements, such as information and semantics preservation [5, 6]. After analyzing the diverse use cases taking relational data and exposing it in patterns conforming to shared RDF schemata, W3C proposed 11 core requirements and 5 optional requirements. To sum up the core requirements of mapping approaches, the followings should be primary concerns for mapping method [6, 13, 20]:

  • The local vocabularies used in RDB should be able to translate into common equivalent ontology vocabularies, since the main objective of RDB2RDF is to expose and publish RDB data on the Web. This means the mapping method should provide the efficient ways to redefine column names of RDB tables into ontology vocabularies used in LOD according to their semantic attribute.

  • The mapped RDB resources should be identified by URIs of LOD. Dereferencing a HTTP URI about RDF data returns appropriate information. This is mandatory for the mapped RDB data to be LOD.

  • The mapping method should provide not only translation of RDB table to RDF graph pattern, but also transformation of SPARQL queries to SQL queries. The mapping description plays the role of an arbitrator between RDB and RDF, unless the duplicated LOD sets of RDB data are generated.

  • The mapping method should have competent facility to handle the primary and foreign key constraints and M:N relationships. Since these are the principal mechanism to join tables in RDB, these play an essential role in constructing RDF graph model.

  • Some tables decomposed by normalization are used only to connect tables. The mapping method should provide a way to process these normailzed tables since they do not represent the conceptual schema. In addition, even tables composed of conceptual attributes should be able to be decomposed on any occasion while they are mapped into RDF graph to maintain the consistent data model.

As most of the requirement analyses have mentioned, the above requirements are mandatory for RDB2RDF, however, it is easy to neglect their indispensable functionalities. Especially, many RDB2RDF mapping approaches have revealed the difficulties in handling the normalized tables, foreign key constraints and SQL query generation. This paper proposes a more efficient, noble approach to deal with these issues.

3.2 Schema-based Graph Mapping approach

The conceptual schema of relational database is usually represented with entity-relationship diagram (ERD), which is almost identical to RDF data model. Accordingly, it can be expected that the RDB2RDF mapping approach should be based on RDB schema of ERD. This provides a consistent way of mapping and makes it possible to preserve information and conceptual structures of RDB in the process of the mapping.

An entity in ERD is a physical or logical object that can be uniquely identified in a domain, which is conceptual element corresponding to the class of RDF Schema. Since RDF data with triple structure, that become LOD instances, are generated by means of the classes and properties of RDF Schema, RDB schema defined by ERD should be the starting point of RDB2RDF. This schema-based mapping provides the notable feature of the separation of concerns between schema and instance. Focusing on conceptual schema, the mapping method can realize seamless translation of RDB data to LOD instances at schema level. The mapping description also becomes more succinct, since it need not specify the mapping definition for each instance as direct mapping does. Above all, schema-based mapping is expected to realize all the requirements for RDB2RDF.

3.2.1 Mapping RDB table schema to RDF graph

The schema-based mapping is applied to the primitive tables that contain conceptual attributes, not the relational tables that simply represent the association of tables. The primitive tables are usually defined in ERD as an entity, while the relational tables usually generated by means of the decomposition during the normalization are irrelevant to RDF data model. The mapping rules of schema-based mapping are straightforward as follows:

  • database-to-namespace: the database name is mapped into the namespace of RDF. Since the database is in some senses kind of the domain, it defines the domain vocabularies to tables and columns.

  • table-to-subject: the table name is mapped into the subject of RDF data model. While the subject of the triple in RDF model usually denotes the specified resources with URI, the subject by schema-based mapping plays a role of the organizer to compose the instances.

  • column-to-predicate: the column as a property is mapped into the predicate of RDF data model.

  • table.column-to-object: the cell value described in table.column is mapped into the object of RDF data model.

  • row-to-instance: each row of the table corresponds to RDF triple.

In schema-based mapping, the primitive table schema is preserved and represented in RDF graph as shown in Fig. 3. The mapped RDF graph shows the conceptual schema inherent in the table, not RDF data graph of the instance triple. Note that the mapped object value is represented by the object value term, ‘table.column’. The instance data will be generated according to RDF graph when SPARQL query is executed against the RDB.

Fig. 3
figure 3

Schema-based mapping Method

A table in itself as a resource is mapped into a subject with namespace and can be additionally associated with the common ontology vocabularies, such as rdf:type and rdfs:subClassOf. The table column Ci corresponding to the predicate of RDF data can be translated into the well-known vocabularies, such as DC, FOAF, CAMO, and vocabularies in schema.org. In this manner, the schema-based mapping can realize semantic interoperability of the mapped RDF data by using a simple, effective mapping description. The object value of the predicate is described in the value term, TableName.Ci. The value term notation makes it possible to translate SPARQL query into the equivalent SQL query.

The subject of RDF triple usually denotes the specified resource. However, in schema-based mapping the subject acts as a virtual subject to compose the triple. In practical applications of LOD, since the instance can be identified by its properties, there is no need to use the superfluous identifier as the subject. Rather, the subject in schema-based mapping can represent the class type of the instance.

The schema-based mapping is more succinct and effective than direct mapping based on the instance. The schema-based mapping can resolve theoretical requirements of RDB2RDF as well as preserve the conceptual schema completely. The mapping approach need not be based on the instance modeling since the instance are stored in RDB and can be accessed anytime by SQL query.

3.2.2 Types of the predicate

In relational database, the primary key and the foreign key are used to establish the relationships between the tables. Sometimes these key attributes are independent of the conceptual schema, required only to connect tables. Besides, the relationships between tables are not explicitly specified, only represented in the conceptual schema. Since the relationships are denoted by the predicate in RDF data model, the implied relationships by the key attributes should be explicitly redefined. There are two types of the predicate by its function.

Attribute predicate

This predicate represents the primitive concept and has the literal value described in the value term, TableName.Ci. In Fig. 4, for example, the column ARTIST.Name in the table ARTIST becomes the attribute predicate.

Fig. 4
figure 4

RDB Tables containing RDB2RDF requirements

Link predicate

The link predicate originates from the relationships by the primary key and the foreign key. There are two types of link predicates depending on the target table: internal link (i-link) predicate for the recursive relationship and external link (e-link) predicate for the different tables. The object value of the link predicate has the relation expressions that represent the link equation, TableName.Ci = TableName.Cj. In Fig. 4, for example, the column ARTIST.CoWorker in the table ARTIST becomes the i-link predicate and has the value, ARTIST.CoWorker = ARTIST.ID, as the object, while the column ARTIST.DID becomes e-link predicate and has the value, ARTIST.DID = PUBLISHER.ID, as the object. The link predicate is represented with the ontology vocabulary in the RDF graph model.

The object value term or relation expression of the predicate can solve the difficult problems in RDB2RDF, such as M:N or foreign key relationships. The SQL query to access the instance data is compiled from SPARQL variables with the object value term or relation expression. Figure 4 is the typical example of RDB tables containing the unique features of RDB that should be considered in RDB2RDF. Although AP table is actually a normalized table, its own attributes, Role and Date, are added to show that schema-based mapping can handle the complexity of the relationships in RDB. The column CoWorker in ARTIST table is a recursive relation.

The schema-based mapping of these tables to RDF graph model is Fig. 5. The comprehensive view of the whole conceptual mapping at the schema level is clearly represented without introducing any complex structures shown in direct mapping. The link predicates with the relation expression (dotted line) resolve many complicated mapping issues efficiently, such as the foreign key relationships shown in the AP table. The common ontology vocabulary can be freely adopted as shown in foaf:name.

Fig. 5
figure 5

Schema-based mapping of Fig. 4

Note that a new relation between tables such as accomplish can be added without any difficulties during RDB2RDF mapping description. This can enable more complete conceptualization rather than simple mapping of RDB to RDF and generation of a new model appropriate for LOD.

3.3 Mapping description of R2LD

The mapping description of schema-based approach is straightforward, since all mapping information is revealed in the mapped RDF graph as Fig. 5. The mapping description is similar to the class and relation definition in ontology development since the mapped RDF graph is the conceptual schema of RDB. This paper presents a conceptual mapping description instead of the formal definition to clear understanding for schema-based mapping. The following Figure 6 is the part of the mapping description of Fig. 5.

Fig. 6
figure 6

Table representation of mapping description of Fig. 5 (part)

The mapping of the attribute predicate shown in Fig. 6a is straightforward. Several RDB vocabularies such as ARTIST. Name, PUBLISHER.Name and PRODUCT.Name can be mapped into a same vocabulary. However, the correct mapping can be clarified by other predicates during the processing of RDF graph. In addition, the common ontologies such as FOAF and GeoNames can be easily imported without any restrictions. This makes it possible to redesign and generate more appropriate RDF graph model of RDB schema.

The prominent feature of schema-based mapping is to use the join expression for link predicate as shown in Fig. 6b, c. Since the relations between resources are implicitly implemented with key types in RDB tables, it is reasonable to explicitly map the link predicate to the join expressions. The join expression can resolve the diverse difficult problems caused in RDB2RDF mapping. Any complex table relationships normalized or partitioned for database management can be accommodated with join expression in a simple and efficient manner.

The mapping description consists of simple one-to-one correspondence of the vocabularies between RDB and RDF. Many-to-one mapping shown in the attribute predicate can be also regarded as one-to-one since the ambiguity can be easily resolved in the process of RDF graph. The mapping description is also flexible enough that some useful ontology vocabularies can be easily added to realize more competent model of LOD. The simplicity of mapping description also provides easy implementation and applications.

4 R2LD and SPARQL query processing of multimedia resources

To verify the effectiveness of schema-based mapping, this section describes an implementation of SPARQL endpoint into RDB. The schema-base mapping can duplicate RDB to LOD as direct mapping doses. Moreover, the translation of SPARQL to SQL is also straightforward. This section explains SPARQL-to-SQL translation under schema-based mapping.

4.1 Schema-based SPARQL endpoint architecture

The typical architecture of schema-based RDB2RDF system is shown in Fig. 7. The conceptual schema or table relationships of RDB are mapped in to RDF data graph that holds conceptual structures and table relationships. The mapping description is obtained from RDB schema or ERD as described in Section 3.

Fig. 7
figure 7

System architecture with schema-based mapping

The SPARQL endpoint to publish RDB data into LOD is implemented with the simple interface to RDB. In the conventional mapping approach, such as direct mapping, robustness and performance have been serious problems. However, schema-based mapping realizes robustness by conceptual schema mapping and performance by SQL query over RDB data. Schema-based mapping can implement SPARQL endpoint with a simple add-on interface without duplication of RDB.

4.2 SPARQL query mapping

In schema-based mapping R2RS, the primary resources for writing SPARQL queries is the mapped RDF graph represented by the conceptual scheme of RDB. The mapped RDF graph promotes conceptual thinking so that this provides a more efficient way to write SPARQL queries. Since the nearly identical conceptual schema is used in both SPARQL and SQL, SPARQL graph patterns are efficiently translated into SQL queries. Through the typical use cases of SPARQL queries, the effectiveness of query processing in schema-based mapping is shown below.

4.2.1 Relationships between two tables with foreign keys

Although many suggestions have been proposed, the primary and foreign key issues still remain cumbersome obstacles in RDB2RDF mapping. Some proposed mapping description is also too complicated to apply for the real operational databases. However, in schema-based mapping R2LD, the foreign key relationships are explicitly represented in the mapped RDF graph, and the detailed mapping methods are specified as the join expressions in the mapping description. This provides an efficient way to solve the implicit foreign key relationships in RDB.

For example, in Fig. 5, the query “What publisher does John work in?” in RDF graph of Fig. 8a is written in SPARQL as Fig. 8b. Writing SPARQL query over the mapped RDF graph is straightforward, as seen in Fig. 8b. From the mapping description in Fig. 6, “John” of foaf:name has the object value ARTIST.Name, and the link predicate work has join expression ARTIST.DID = PUBLISHER.ID, respectively. With all this information, the predicate name between ?y and ?z can be deduced into PUBLISHER.Name. As a result, SPARQL query can be translated into SQL query of Fig. 8c.

Fig. 8
figure 8

Example: foreign key relationship

Some tables in RDB such as CONTACT in Fig. 4 is only for practical purposes of database management, not related to conceptual schema. In this case, combining CONTACT with its master table ARTIST is natural. In schema-base mapping, any table combining necessary for more desirable RDF model can be easily accomplished with the join expression. For example, RDF query graph combining CONTACT with ARTIST shown in Fig. 9 is definitely possible.

Fig. 9
figure 9

Example: Combing tables with foreign key relation

For example, in Fig. 9, the subject ?x is related to the tableS ARTIST and CONTACT by means of its predicates and the predicate name is resolved to ARTIST.Name. So from the mapping description in Fig. 6c, since the join expression to combine ARTIST and CONTACT is obtained, SPARQL query can be easily translated in to SQL query.

The join expression in schema-based mapping is an efficient way to solve the diverse problems caused in RDB2RDF. All complicated mapping problems can be accommodated into the mapping description. This grants more flexibility than rigid mapping for simple duplication of RDB.

4.2.2 Recursive relationships

The recursive relationships are easily implemented in RDB and RDF graph model. However, the mapping of recursive relationships is another matter in RDB2RDF. Although some mapping approaches suggest the plausible solutions, these methods accompany ineffective and unpractical issues. Schema-based mapping can easily deal with this cumbersome problem by means of the join expression.

The query containing the recursive relationship conatined in ARTIST table, such as “Who is the coworker of John?” can be efficiently modeled as shown in Fig. 10. Actually, this recursive relationship is similar to the simple foreign key relationship. From the mapping description of Fig. 6b, the join expression of collaboate can be obtained similar to foreign key relations in the previous examples. As a result, the corresponding SQL query can be easily obtained as in Fig. 10c.

Fig. 10
figure 10

Example: Recursive relationship

4.2.3 Multi-tables relationships

SPARQL queries related with many tables is not the matter of the odd mapping in schema-based mapping. The join expression in the mapping description can resolve the complicate table management as usual. The tables related with SPARQL query in Fig. 11 are identified by means of link predicates, work and contact, and table join operations for SQL query is composed with their join expressions in the mapping description.

Fig. 11
figure 11

Example: Multi-tables and M:N relationship

Even though SPARQL query use the same predicate such as name in ?x and ?y, this ambiguity can be resolved with the mapping information of other predicates. Schema-based mapping is flexible in the use of ontology vocabularies.

4.2.4 Multi-tables and M:N relationships

In schema-based mapping, any additional relationships can be inserted as a link predicate accomplish shown in Fig. 5. The additional relationships regardless of RDB schema can realize more reasonable RDF model.

The additional relationships contain many join operations and M:N relationships. In spite of the complexity of the additional relationships, these can be handled as usual as other link predicates in schema-based mapping. Even though the query in Fig. 12 contains the additional link predicate accomplish, SQL Query can be generated with its join expression in the mapping description as usual as other SPARQL query translation.

Fig. 12
figure 12

Example: Multi-tables and M:N relationship

4.2.5 Joined and merged-tables with M:N relationships

Any complicated SPARQL query can be easily translated into the corresponding SQL query in schema-based mapping. The following example shown in Fig. 12 is conceptually intutive and suitable for LOD. However, this query contains very complex relationship problems among the tables of RDB.

The variable ?a is related to the table CONTACT, however, the table is implicitly joined and merged with ARTIST. The variable ?b is related with the Table AP that originally mediates the primary and foreign key relationships by the normalization, but has its own attributes. In addition, the link predicate accomplish related with the variable? y as seen in the previous example is the condensed relation that merges several relations in RDB.

With the join expressions in the mapping description, the identification of the related tables is systematically tractable. After resolving the variables and prediactes, SQL query can be easily composed as in Fig. 13.

Fig. 13
figure 13

Example: Joined and merged tables with M:N relationships

The schema-based mapping supports to build more natural SPARQL query adequate for LOD applications and provides the efficient way of the mapping to SQL query. This make it possible to implement SPARQL endpoint in RDB, which is vital to publish RDB to LOD sets.

5 Conclusion

The research and development involved in the realization of the Web of Data has been actively accomplished by opening and sharing various heterogeneous data types such as unstructured multimedia resources on the Web. Ontology-based LOD that allows computers to understand and process the data semantics is proposed as a standard data model. In this model, LOD becomes the enabler to realize the Web of Data by using the standard data model for both structured and unstructured information resources of the Web and a shared semantic representation.

The high qualitative data sets should be provided to realize the diverse intelligent services on the Web. The conventional approaches to develop LOD sets, such as ontology-based LOD generation and the translation of RDB to LOD, suffer from the complexity and difficulty of the approaches, the realization of domain peculiarities and practical adaptabilities. In addition, the development of information systems based on LOD has made very little progress due to the lack of appropriate, specialized methodologies and tools for the instances of LOD sets.

The most effective and practical way to populate the LOD sets is to publish the data stored in RDB on the Web as the standard form of RDF. Many studies about RDB-to-RDF mapping have been conducted to realize the Web of Data. Two standard drafts have been proposed as important achievements by the W3C: Direct Mapping and R2RML mapping language. However, the practical RDB-to-RDF mapping approach is still an open question.

This paper proposes a noble and practical RDB2RDF mapping method suitable for LOD at the conceptual schema level. Since the conceptual schema of RDB is similar to ontological domain modeling of RDF, the proposed schema-based mapping R2LD can achieve more coherent mapping than the conventional direct mapping approaches by dissolving the structural and operational differences, such as graph pattern matching and JOIN operations. The mapping description is straightforward on account of the compatible conceptual structures and can accommodate the complex relationships in an effective manner. In addition, the implementation of schema-based mapping is simple and intuitive as seen in several typical examples. So, the schema-based mapping R2LD provides an efficient way to implement SPARQL endpoint into RDB, which is vital to disseminate LOD.