A design space for RDF data representations

Sagi, Tomer; Lissandrini, Matteo; Pedersen, Torben Bach; Hose, Katja

doi:10.1007/s00778-021-00725-x

A design space for RDF data representations

Special Issue Paper
Open access
Published: 21 January 2022

Volume 31, pages 347–373, (2022)
Cite this article

Download PDF

You have full access to this open access article

The VLDB Journal Aims and scope Submit manuscript

A design space for RDF data representations

Download PDF

Tomer Sagi ORCID: orcid.org/0000-0002-8916-0128¹,
Matteo Lissandrini¹,
Torben Bach Pedersen¹ &
…
Katja Hose¹

4132 Accesses
5 Altmetric
Explore all metrics

Abstract

RDF triplestores’ ability to store and query knowledge bases augmented with semantic annotations has attracted the attention of both research and industry. A multitude of systems offer varying data representation and indexing schemes. However, as recently shown for designing data structures, many design choices are biased by outdated considerations and may not result in the most efficient data representation for a given query workload. To overcome this limitation, we identify a novel three-dimensional design space. Within this design space, we map the trade-offs between different RDF data representations employed as part of an RDF triplestore and identify unexplored solutions. We complement the review with an empirical evaluation of ten standard SPARQL benchmarks to examine the prevalence of these access patterns in synthetic and real query workloads. We find some access patterns, to be both prevalent in the workloads and under-supported by existing triplestores. This shows the capabilities of our model to be used by RDF store designers to reason about different design choices and allow a (possibly artificially intelligent) designer to evaluate the fit between a given system design and a query workload.

Challenge Accepted: QUAD Meets MOCHA2017

An empirical study on the evaluation of the RDF storage systems

Article Open access 10 July 2021

Identifying and Caching Hot Triples for Efficient RDF Query Processing

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

The resource description framework (RDF) [44] is a popular standard for storing and sharing factual information, predominantly created from sources on the World Wide Web. RDF is represented as subject–predicate–object triples, usually modeled as a graph, whose nodes serve as subjects and objects and edges as predicates. RDF stores, often called triplestores, are designed to support the storage of RDF data and its efficient querying by exposing a declarative query API standardized in the SPARQL standard [63]. Interest in triplestores has grown steadily over the past decade. In particular, both research and industry are employing these systems to store and query knowledge bases augmented with semantic annotations [59, 85]. Thus, a multitude of triplestore implementations are available, ranging from academic prototypes (e.g., RDF-3X [55], Hexastore [81]) and community projects (e.g., JENA TDB [58], and Rya [64]) to commercial products (e.g., Virtuoso [24], GraphDB [57], and Neptune [13]).

RDF data differs from relational data in the complexity of its structure. In practice, real-life RDF datasets are highly heterogeneous in structure, especially compared to relational datasets [22]. This structural complexity causes query performance to vary substantially [3]. Converting RDF data to the relational model, utilizing existing, mature, RDBMS technologies (e.g., by Oracle [20] and IBM [17]), introduces unique challenges due to the substantial information heterogeneity and the inherent absence of a strict schema [67]. Together with the need to accommodate increasingly large RDF graphs, these unique properties of RDF data have led researchers and companies to suggest RDF-specific designs rather than just mappings from RDF to relational formats. Reviewing the features of the numerous available systems reveals each to employ its own list of design choices each of which seem equally compelling to the casual observer. Recently, a different set of systems has been proposed to handle graph data represented as node and edge labeled multigraphs annotated with properties, i.e., property graphs (PG) [14]. Despite the fact that both the RDF model and the PG model handle graph-shaped data, the two models differ substantially in terms of the functionalities they offer. In particular, while RDF represents data as a set of triples, PG DBMS are designed to query labeled objects annotated with properties in the form of key-value pairs. Therefore, the analysis of core operations supported by triplestores substantially differs from those supported by PG DBMS. (This can be seen, for instance, by comparing our analysis with the operations studied in a recent PG DBMS microbenchmark [47].) For instance, in a PG, we can select nodes having a specific label and a specific attribute set to a specific type accessing only node objects, while in an RDF triplestore an equivalent query will need to query a set of triples instead.

Although there are several surveys of the many existing triplestores [2, 52, 56, 59], they are either limited to providing a taxonomy of the features offered by the systems (e.g., API and data load facilities) or classify them according to their underlying technology (e.g., relational versus native graph). Thus, these surveys offer a vast compendium of alternative systems but only little information regarding the design space in which their internal architecture reside.

Instead, in this work, we provide a unifying three-dimensional design space across three axes: Subdivision, Compression, and Redundancy (SCR, see Fig. 4). The design space defines the dimensions along which any storage system for RDF data must be designed. We then provide a review of the current choices made over these dimensions as well as a discussion into unexplored options. To guide the analysis of this space and of current solutions within it, we define a corresponding feature space for query access patterns^{Footnote 1} and a cost model to tie the different choices in the design space to their impact on the performance of specific query workloads.

Our approach is, in part, inspired by the Data Calculator [41]. However, while the Data Calculator focuses on optimizing a single low-level data structure for generic (i.e., non-SPARQL-specific) data access operations, our approach works at a higher level and focuses on evaluating the set of data representations employed by a triplestore; choosing the right design choices to match the RDF access patterns this system must support. Thus, a triplestore system developer could use our approach to identify access patterns that are currently not optimized for by the data representations in their triplestore. For example, the Jena (TDB) system [58] does not feature any optimized data representation for filters selecting triples involving a specific data type or with a specific language tag, even though they are quite common in existing workloads (see Sect. 7). This is evident from analyzing Jena’s existing data representations using our proposed design space (see Tables 5, 6, and 7). Once the developer has identified this unmet pattern, they can examine which combination of decisions is available in the three design space dimensions they wish to employ to satisfy this new pattern. In this example, one possible solution would be to employ an additional subdivision of the existing ID space to differentiate between data types and languages. This can lead to a different data organization within an existing index or to a new access-pattern-specific index. (In this case we also move within the redundancy dimension.) Since the Data Calculator is not aware of the presence of annotations such as language tags and the possibility to represent them separately by subdividing the ID space, it cannot identify this unmet need and it cannot provide an optimized solution for this access pattern. Similarly, the Data Calculator is not aware of other RDF-specific access patterns, e.g., reachability and path queries. Moreover, Jena employs three different main triple data representations based on standard B+trees and identical in all the low-level features considered by the Data Calculator. In the proposed access patterns feature space, we provide a way to differentiate between the RDF-specific access patterns each of these representations is required to support. Thus, we are able to evaluate and assess the compatibility of a set of data representations used in a triplestore with RDF-specific access patterns.

Our design space is based on intuitions widely shared across different data models and their DBMS [10]. However, these intuitions have not been formalized for RDF systems and existing RDF systems have not been analyzed based on them. We thereby make the following contributions.

1.
A design space for RDF data representations employed in a triplestore that is simple enough to be intuitive, yet, as we show, powerful enough to analyze the benefits and trade-offs of each design choice.
2.
A review of existing triplestores positioning them in this design space and identifying unexplored choices.
3.
A feature space for SPARQL query access patterns allowing to characterize query execution over RDF.
4.
A software tool to parse query workloads and analyze the patterns used.
5.
A comprehensive analysis of how design choices impact system performance when answering access patterns with specific features.
6.
An empirical evaluation of the prevalence of access patterns in commonly used query workloads.

The rest of the paper is structured as follows: Section 2 provides preliminary definitions followed by the query access pattern feature space in Sect. 3. We then introduce our design space in Sect. 4, followed by the review of existing systems (Sect. 5). Section 6 presents our impact analysis, distilling important findings obtained by the analysis of the systems design in our design space. This analysis is followed by the empirical evaluation in Sect. 7. Finally, we highlight the differences between this work and previous surveys in Sect. 8 and conclude by discussing the implications of our analysis and findings in Sect. 9.

2 Preliminaries

In this section, we formally define the basic concepts of RDF graphs and RDF graph patterns. Both revolve around the concept of RDF triples [44]. An RDF triple is a factual statement comprised of a subject (s), a predicate (p), and an object (o). Subjects and objects can be resources identified by International Resource Identifiers (IRI), anonymous nodes identified by internal IDs (called blank nodes), or literals. Predicates are always IRIs (resources) and never literals [62]. For example, the triple (ex:iri1, rdfs:label, "Human") states that the resource identified by the IRI ex:iri1 has an rdfs:label which is the literal string "Human". Collectively, nodes (i.e., resources, blank nodes, and literals) can be referred to as atoms. RDF allows to explicitly record knowledge codified as a graph where subjects and predicates serve as nodes and triples as edges.

Definition 1

(RDF Triple/Statement & Graph) Given a set of IRIs $\mathcal {I}$, blank nodes $\mathcal {B}$, and literals $\mathcal {L}$, a triple $(s,p,o) \in (\mathcal {I}\cup \mathcal {B})\times (\mathcal {I})\times (\mathcal {I}\cup \mathcal {B}\cup \mathcal {L})$ is called an RDF triple. In the (s, p, o) triple, also called an RDF statement, s is the subject, p is the predicate, and o is the object. An RDF graph ${\mathcal {G}}$ is a set of RDF triples.

Example 1

Consider the graph in Fig. 1, which is a graphical representation of a set of triples. In this case, ex:iri1, ex:iri3, rdfs:label, and ex:Sex are examples of IRIs in $\mathcal {I}$, the first two with the role of nodes and the second two with the role of predicates. On the other hand, "11M" and 2018 are literals in $\mathcal {L}$. Finally, (ex:iri1, ex:Sex, ex:iri3) is an (s, p, o) triple.

An RDF graph is queried by issuing a SPARQL [63] query to an evaluation engine, called a triplestore, whose core functionality is to compute answers based on the graph structures matching it. SPARQL queries contain one or more basic graph patterns, which are sets of triples with zero or more of their components replaced by variables, formally defined as follows.

Definition 2

(Basic Graph Pattern [62]) Assume an infinite countable set of variables $\mathcal {X}$. A Basic Graph Pattern (BGP) P is defined as a conjunction of a finite set of triple patterns $P=\{t_1,{\ldots } \,,t_n\} $, with $t_i\in P$ being a triple pattern defined as $t_i{\in }(\mathcal {I}{\cup }\mathcal {X}){\times }(\mathcal {I}\cup \mathcal {X}){\times }(\mathcal {I}{\cup }\mathcal {L}{\cup }\mathcal {X})$.

Thus, triple patterns are (s, p, o) triples where any position may be replaced by a variable. Solutions to the variables are found by matching the triple patterns in the BGP with triples in the RDF graph.

Moreover, there are special types of patterns that match sequences of edges satisfying a specific set of predicates, e.g., all nodes reachable by an arbitrary sequence of ex:childOf predicates. These patterns are called property paths. They are defined via a specialized form of expressions called path expressions (similar to regular expressions) and offer a succinct way to write parts of basic graph patterns and also extend matching of triple patterns to arbitrary length paths [35].

Definition 3

(Property paths [35]) SPARQL 1.1 defines a property path $\mathbf {p}$ recursively as follows. A property path is (1) any resource $a{\in }\mathcal {I}$; (2) given property paths $\mathbf {p}_{1}$ and $\mathbf {p}_{2}$, a property path is either a sequence of paths denoted by $\mathbf {p}_{1}{/}\mathbf {p}_{2}$, a disjunction paths denoted by $\mathbf {p}_{1}{\mid }\mathbf {p}_{2}$, a negation of a path denoted by ${\hat{\,}}\,{\mathbf {p}}_{1}$, a sequence of zero or more of the same path denoted by $\mathbf {p}_{1}{*}$, a sequence of one or more repetitions of the same path denoted by $\mathbf {p}_{1}{+}$, or either zero or one occurrences of the path denoted by $\mathbf {p}_{1}{?}$; alternatively (3) given resources $a_{1}, \ldots , a_{n}{\in }{\mathcal {I}}$, then any of the following expressions $! a_{1}, !^{\wedge } a_{1}$ $!\left( a_{1}|\ldots | a_{n}\right) $, $!\left( { }^{\wedge } a_{1}|\ldots |^{\wedge } a_{n}\right) $ and $!\left( \left. a_{1}|\ldots | a_{j}\right| ^{\wedge } a_{j+1}|\ldots |^{\wedge } a_{n}\right) $ where ! denotes negation, $ ^{\wedge }$ denotes inversion and | denotes disjunction.

Hence, property paths are expressions over vocabulary $\mathcal {I}$ of all IRIs [86]. The language does not allow to express negated property paths, but it is possible to express negation on IRIs, inverted IRIs and disjunctions of combinations of IRIs and inverted IRIs. A property path triple is a tuple t of the form $(s, \mathbf {p}, o)$, where $s, o {\in }(\mathcal {I}{\cup }\mathcal {X})$ and $\mathbf {p}$ is a property path. Such a triple is a graph pattern that matches all pairs of nodes $\langle s, o\rangle $ in an RDF graph that are connected by paths that conform to $\mathbf {p}$.

In its simplest form, a SPARQL query has the form “SELECT $\mathbf {V}$ WHERE P”, with $P=\{t_1,\ldots \,,t_n\}$ being a set of triple patterns (BGP). Optionally, one or more FILTER clauses further constrain the variables in P. Let $\mathcal {X}_{P}$ denote the finite set of variables occurring in P, i.e., $\mathcal {X}_{P}\subset \mathcal {X}$ [63], then $\mathbf {V}$ is the vector of variables returned by the query such that $\mathbf {V}\subseteq \mathcal {X}_{P}$ . Additional operators such as UNION or OPTIONAL allow more than one BGP in a query by defining the non-conjunctive semantics of their combination. Finally, SPARQL queries can also make use of GROUP BY and aggregate operators.

SPARQL queries are declarative and are therefore designed to be decoupled from the physical data access methods^{Footnote 2} used to retrieve the data. This decoupling allows specific triplestore implementations to use different data representations and query processing designs to dynamically match an appropriate execution plan with a given query. Furthermore, when answering a query, a single BGP can be decomposed into several component BGPs and subsequently recombined before or after each BGP is solved.

SPARQL queries, as traditional queries, can also specify how to change the content of the graph. In this case, the query can either list a set of new RDF triples to be inserted into the graph or a set of triples to be deleted from the graph.

3 Access patterns

To design a performant RDF triplestore, it is of crucial importance not only to understand the type of information that it will store but also how this information will be queried, i.e., the expected query workload. Specifically, SPARQL queries, decomposed into BGPs and their associated triple patterns, access the data representation in different ways. In this section, we describe how SPARQL queries and their constituent BGPs can be analyzed to identify standard access patterns.

Definition 4

(Access Pattern) An access pattern is the set of logical operations that, given a set of variable bindings over a graph, determine what data to access (and optionally to change) and what output to produce.

Note that the term is used differently by relational model [27] analyses, where, given a relation, it only refers to what data is required as input and what tuples of the relation to produce as output. Moreover, this concept also differs from the concept of access paths, which refers instead to alternative data structures that can be navigated to reach the desired data [72]. The need for satisfying the requirements of these access patterns guides the selection and design of appropriate data representations along the design space dimensions defined in the following section.

Given a query and a specific BGP from the query, the access pattern is determined by the triple patterns in the BGP and any additional operators assigned to it from the SPARQL query, e.g., filters, grouping, or aggregations. Here, we identify the feature space of access patterns (summarized in Table 1). This feature space is comprised of six dimensions (each dimension containing a set of alternative features), namely:

Constants the presence of constant values (as opposed to variables which have bindings).
Filter The presence (or absence) of a filter on a range of values.
Traversal The complexity and type of the traversal described by different triple patterns of the BGP.
Pivot How different triple patterns of the BGP are linked together by some common atom in the BGP (here called a pivot).
Return The information expected to be returned.
Write Whether and how the BGP causes a change in the contents of the database.

Table 1 Feature space of access patterns for a SPARQL query

A design space for RDF data representations

Abstract

Similar content being viewed by others

Challenge Accepted: QUAD Meets MOCHA2017

An empirical study on the evaluation of the RDF storage systems

Identifying and Caching Hot Triples for Efficient RDF Query Processing

1 Introduction

2 Preliminaries

Definition 1

Example 1

Definition 2

Definition 3

3 Access patterns

Definition 4

4 A design space for RDF data representations

4.1 Cost model

Definition 5

4.2 Data representation compatibility

Definition 6

Example 2

4.3 The design space dimensions

Example 3

5 Data representations in RDF triplestores

5.1 Subdivision

5.2 Compression

5.3 Redundancy

5.4 Summary

6 Design space analysis

6.1 Matching access patterns to the design space

6.2 A compatibility-based analysis

7 Workload case analysis

7.1 Analyzed workloads

7.2 Limitations

7.3 Results and discussion

8 Related surveys

9 Conclusions and future work

Data availability

Notes

References

Acknowledgements

Open Access

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation