Towards a Scalable Semantic-Based Distributed Approach for SPARQL Query Evaluation
Over the last two decades, the amount of data which has been created, published and managed using Semantic Web standards and especially via Resource Description Framework (RDF) has been increasing. As a result, efficient processing of such big RDF datasets has become challenging. Indeed, these processes require, both efficient storage strategies and query-processing engines, to be able to scale in terms of data size. In this study, we propose a scalable approach to evaluate SPARQL queries over distributed RDF datasets using a semantic-based partition and is implemented inside the state-of-the-art RDF processing framework: SANSA. An evaluation of the performance of our approach in processing large-scale RDF datasets is also presented. The preliminary results of the conducted experiments show that our approach can scale horizontally and perform well as compared with the previous Hadoop-based system. It is also comparable with the in-memory SPARQL query evaluators when there is less shuffling involved.
Recently, significant amounts of data have been created, published and managed using the Semantic Web standards. Currently, the Linked Open Data (LOD) cloud comprises more than 10000 datasets available online1 using the Semantic Web standards. RDF is a standard that represents data linked as a graph of resources following the idea of the linking structure of the Web and using URIs for representation.
To facilitate better maintenance and faster access to this scale of data, efficient data partitioning is needed. One of such partitioned strategies is semantic-based partitioning. It groups the facts based on the subject and its associated triples. We want to explore and evaluate the effect of semantic-based partitioning on query performance when dealing with such a volume of RDF datasets.
SPARQL is a W3C standard query language for querying data modeled as RDF. Querying RDF data efficiently becomes challenging when the size of the data increases. This has motivated a considerable amount of work on designing distributed RDF systems able to efficiently evaluate SPARQL queries [6, 20, 21]. Being able to query a large amount of data in an efficient and faster way is one of the key requirements for every SPARQL engine.
To address these challenges, in this paper, we propose a scalable semantic-based distributed approach2 for efficient evaluation of SPARQL queries over distributed RDF datasets. The main component of the system is the data partitioning and query evaluation over this data representation.
A scalable approach for semantic-based partitioning using the distributed computing framework, Apache Spark.
A scalable semantic-based query engine (SANSA.Semantic) on top of Apache Spark (under the Apache Licence 2.0).
Comparison with state-of-the-art engines and demonstrate the performance empirically.
The rest of the paper is structured as follows: Our approach for data modeling, data partitioning, and query translation using a distributed framework are detailed in Sect. 3 and evaluated in Sect. 4. Related work on the SPARQL query engines is discussed in Sect. 5. Finally, we conclude and suggest planned extensions of our approach in Sect. 6.
Here, we first introduce the basic notions used throughout the paper.
Apache Hadoop and MapReduce. Apache Hadoop is a distributed framework that allows for the distributed processing of large data sets across a cluster of computers using the MapReduce paradigm. Beside its computing system, it contains a distributed file system: the Hadoop Distributed File System (HDFS), which is a popular file system capable of handling the distribution of the data across multiple nodes in the cluster.
Apache Spark. Apache Spark is a fast and generic-purpose cluster computing engine which is built over the Hadoop ecosystem. Its core data structure is Resilient Distributed Dataset (RDD)  which are a fault-tolerant and immutable collections of records that can be operated in a parallel setting. Apache Spark provides a rich set of APIs for faster, in-memory processing of RDDs.
Data Partitioning. Partitioning the RDF data is the process of dividing datasets in a specific logical and/or physical representation in order to ease faster access and better maintenance. Often, this process is performed for improving the system availability, load balancing and query processing time. There are many different data partitioning techniques proposed in the literature. We choose to investigate the so-called semantic-based partitioning behaviors when dealing with large-scale RDF datasets. This partitioned technique was proposed in the SHARD  system. We have implemented this technique using in-memory processing engine, Apache Spark for better performance. A semantically partitioned fact is a tuple (S, R) containing pieces of information \(R \in (P, O)\) about the same S where S is a unique subject on the RDF graph and R represents all its associated facts i.e predicates P and objects O.
In this section, we present the system architecture of our proposed approach, the semantic-based partitioning, and mapping SPARQL to Spark Scala-compliant code.
3.1 System Architecture Overview
It consists of three main facets: Data Storage Model, SPARQL Query Fragments Translator, and Query Evaluator. Below, each facet is discussed in more details.
Data Storage Model. We model the RDF data following the concept of RDDs. RDDs are immutable collections of records, which represent the basic building blocks of the Spark framework. RDDs can be kept in-memory and are able to operate in parallel throughout the Spark cluster. We make use of SANSA ’s data representation and distribution layer for such representation.
Often flattening data is considered immature with respect to other data representation, we want to explore and investigate if it improves the performance of the query evaluation. We choose this representation for the reason of easy-storage and reuse while designing a query engine. Although, it slightly degrades the performance when it comes to multiple scans over the table when there are multiple predicates involved in the query. However, this is minimal, as Spark uses in-memory, caching operations. We will discuss this on the Sect. 4 into more detail.
SPARQL Query Fragments Translation. This process generates the Scala code in the format of Spark RDD operations using the key-value pairs mechanism. With Spark pairRDD, one can manipulate the data by splitting it into key-value pairs and group all associated values with the same keys. It walks through the SPARQL query (Step 4) using the Jena ARQ5 and iterate through clauses in the SPARQL query and bind the variables into the RDF data while fulfilling the clause conditions. Such iteration corresponds to a single clause with one of the Spark operations (e.g. map, filter, reduce). Often this operation needs to be materialized i.e the result set of the next iteration depends on the previous clauses and therefore a join operation is needed. This is a bottleneck since scanning and shuffling is required. In order to keep these joins as small as possible, we leverage the caching techniques of the Spark framework by keeping the intermediate results in-memory while the next iteration is performed. Finally, the Spark-Scala executable code is generated (Step 5) using the bindings corresponding the query. Besides simple BGP translation, our system supports UNION, LIMIT and FILTER clauses.
Query Evaluator. The mappings created as shown in the previous section can now be evaluated directly into the Spark RDD executable code. The result set of these operations is distributed data structure of Spark (e.g. RDD) (Step 6). The result set can be used for further processing and visualization using the SANSA-Notebooks (Step 7) .
3.2 Distributed Algorithm Description
This SPARQL query rewriter includes multiple Spark operations. First, partitioned data is mapped to a list of variable bindings satisfying the first basic graph pattern (BGP) of the query (line 2). During this process, the duplicates are removed and the intermediate result is kept in-memory (RDD) with the variable bindings as a key. The consequent step is to iterate through other variables and bind them by processing the upcoming query clauses and/or filtering the other ones unseen on the new clause. These intermediate steps perform Spark operations over both, the partitioned data and the previously bound variables which were kept on Spark RDDs.
In our evaluation, we observe the impact of semantic-based partitioning and analyze the scalability of our approach when the size of the dataset increases.
In the following subsections, we present the benchmarks used along with the server configuration setting, and finally, we discuss our findings.
4.1 Experimental Setup
We make use of two well-known SPARQL benchmarks for our experiments: the Waterloo SPARQL Diversity Test Suite (WatDiv) v0.6  and Lehigh University Benchmark (LUBM) v3.1 . The dataset characteristics of the considered benchmarks are given in Table 1.
WatDiv comes with a test suite with different query shapes which allows us to compare the performance of our approach and the other approaches. In particular, it comes with a predefined set of 20 query templates which are grouped into four categories, based on the query shape: star-shaped queries, linear-shaped queries, snowflake-shaped queries, and complex-shaped queries. We have used WatDiv datasets with 10M to 100M triples with scale factors 10 and 100, respectively. In addition, we have generated the SPARQL queries using WatDiv Query Generator.
Dataset characteristics (nt format).
#nr. of triples
We implemented our approach using Spark-2.4.0, Scala 2.11.11, Java 8, and all the data were stored on the HDFS cluster using Hadoop 2.8.0. All experiments were carried out on a commodity cluster of 6 nodes (1 master, 5 workers): Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10 GHz (32 Cores), 128 GB RAM, 12 TB SATA RAID-5. We executed each experiment three times and the average query execution time has been reported.
4.2 Preliminary Results
Performance analysis on large-scale RDF datasets.
Our evaluation results for performance analysis, sizeup analysis, node scalability, and breakdown analysis by SPARQL queries are shown in Table 2, Figs. 2, 3 and 4 respectively. In Table 2 we use “fail” whenever the system fails to complete the task and “n/a” when the task could not be completed due to a parser error (e.g. not able to translate some of the basic patterns to RDDs operations).
In order to evaluate our approach with respect to the speedup, we analyze and compare it with other approaches. This set of experiments was run on three datasets, Watdiv-10M, Watdiv-100M and LUBM-1K.
Scalability Analysis. In order to evaluate the scalability of our approach, we conducted two sets of experiments. First, we measure the data scalability (e.g. size-up) of our approach and position it with other approaches. As SHARD fails for most of the LUBM queries, we omit other queries on this set of experiments and choose only Q1, Q5, and Q14. Q1 has been chosen due to its complexity while bringing large inputs of the data and high selectivity, Q5 since it has considerably larger intermediate results due to the triangular pattern in the query, and Q14 mainly for its simplicity. We run experiments on three different sizes of LUBM (see Fig. 2). We keep the number of nodes constant i.e. 5 worker nodes and increase the size of the datasets to measure whether our approach deals with larger datasets.
We see that the query execution time for our approach grows linearly when the size of the datasets increases. This shows the scalability of our approach as compared to SHARD, in context of the sizeup. SHARD suffers from the expensive overhead of MapReduce joins which impact its performance, as a result, it is significantly worse than other systems.
Figure 3 shows the performance of systems on LUBM-1K dataset when the number of worker nodes varies. We see that as the number of nodes increases, the runtime cost of our query engine decreases linearly as compared with the SHARD, which keeps staying constant. SHARD performance stays constant (high) even when more worker nodes are added. This trend is due to the communication overhead SHARD needs to perform between map and reduce steps. The execution time of our approach decreases about 1.7 times (from 1,821.75 s down to 656.85 s) as the worker nodes increase from one to five nodes. SPARQLGX-SDE and Sparklify perform better when the number of nodes increases compared to our approach and SHARD.
Our main observation here is that our approach can achieve linear scalability in the performance.
Correctness. In order to assess the correctness of the result set, we computed the count of the result set for the given queries and compare it with other approaches. As a result of it, we conclude that all approaches return exactly the same result set. This implies the correctness of the results.
We can see from Fig. 4 that our approach performs better compared to Hadoop-based system, SHARD. This is due to the use of the Spark framework which leverages the in-memory computation for faster performance. However, the performance declines as compared to other approaches which use vertical partitioning (e.g., SPARQLGX-SDE on RDD and Sparklify on Spark SQL). This is due to the fact that our approach performs de-duplication of triples that involves shuffling and incurs network overhead. The results show that the performance of SPARQLGX-SDE decreases as the number of triple patterns involved in the query increases (see Q5) when compared to Sparklify. However, SPARQLGX-SDE performs better when there are simple queries (see Q14). This occurs because SPARQLGX-SDE must read the whole RDF graph each time when there is a triple pattern involved. In contrast to SPARQLGX-SDE, Sparklify performs better when there are more triple patterns involved (see Q5) but slightly worse when linear queries (see Q14) are evaluated.
Based on our findings and the evaluation study carried out in this paper, we show that our approach can scale up with the increasing size of the dataset.
5 Related Work
Partitioning of RDF Data. Centralized RDF stores use relational (e.g., Sesame ), property (e.g., Jena ), or binary tables (e.g., SW-Store ) for storing RDF triples or maintain the graph structure of the RDF data (e.g., gStore ). For dealing with big RDF datasets, vertical partitioning and exhaustive indexing are commonly employed techniques. For instance, Abadi et al.  introduce a vertical partitioning approach in which each predicate is mapped to a two-column table containing the subject and object. This approach has been extended in Hexastore  to include all six permutations of subject, predicate, and object (s, p, o). To improve the efficiency of SPARQL queries RDF-3X  has adopted exhaustive indices not only for all (s, p, o) permutations but also for their binary and unary projections. While some of these techniques can be used in distributed configurations as well, storing and querying RDF datasets in distributed environments pose new challenges such as the scalability. In our approach, we tackle partitioning and querying of big RDF datasets in a distributed manner.
Partitioning-based approaches for distributed RDF systems propose to partition an RDF graph in fragments which are hosted in centralized RDF stores at different sites. Such approaches use either standard partitioning algorithms like METIS  or introduce their own partitioning strategies. For instance, Lee et al.  define a partition unit as a vertex with its closest neighbors based on heuristic rules while DiploCloud  and AdPart  use physiological RDF partitioning based on RDF molecules. In our proposal, we use a semantic-based partitioning approach.
Hadoop-Based Systems. Cloud-based approaches for managing large-scale RDF mainly use NoSQL distributed data stores or employ various partitioning approaches on top of Hadoop infrastructure, i.e., the Hadoop Distributed File System (HDFS) and its MapReduce implementation, in order to leverage computational resources of multiple nodes. For instance, Sempala  is a Hadoop-based approach which serves as SPARQL-to-SQL approach on top of Hadoop. It uses Impala6 as a distributed SQL processing engine. Sempala uses unified vertical partitioning based on a single property table to improve the runtime of the star-shaped queries by excluding the joins. The limitation of Sempala is that it was designed only for that particular shape of the queries. PigSPARQL  uses Hadoop based implementation of vertical partitioning for data representation. It translates SPARQL queries into Pig7 LATIN queries and runs them using the Pig engine. A most recent approach based on MapReduce is RYA . It is a Hadoop based scalable RDF store which uses Accumulo8 as a distributed key-value store for indexing the RDF triples. One of RYA’s advantages is the power of performing join reorder. The main drawback of RYA is that it relies on disk-based processing increasing query execution times. Other RDF systems like JenaHBase  and H2RDF+  use the Hadoop database HBase for storing triple and property tables.
SHARD  is one approach which groups RDF data into a dedicated partition so-called semantic-based partition. It groups these RDF data by subject and implements a query engine which iterates through each of the clauses used on the query and performs a query processing. A MapReduce job is created while scanning each of the triple patterns and generates a single plan for each of the triple pattern which leads to a larger query plan, therefore, it contains too many Map and Reduces jobs. Our partitioning algorithm is based on SHARD, but instead of creating MapReduce jobs we employ the Spark framework in order to increase scalability.
In-Memory Systems. S2RDF  is a distributed query engine which translates SPARQL queries into SQL ones while running them on Spark-SQL. It introduces a data partitioning strategy that extends vertical partitioning with additional statistics, containing pre-computed semi-joins for query optimization. SPARQLGX  is similar to S2RDF, but instead of translating SPARQL to SQL, it maps SPARQL into direct Spark RDD operations. It is a scalable query engine which is capable of evaluating efficiently the SPARQL queries over distributed RDF datasets . It uses a simplified VP approach, where each predicate is assigned to a specific parquet file. As an addition, it is able to assign RDF statistics for further query optimization while also providing the possibility of directly query files on the HDFS using SDE. Recently, Sparklify  – a scalable software component for efficient evaluation of SPARQL queries over distributed RDF datasets has been proposed. The approach uses Sparqify9 as a SPARQL to SQL rewriter for translating SPARQL queries into Spark executable code. In our approach, intermediate results are kept in-memory in order to accelerate query execution over big RDF data.
6 Conclusions and Future Work
In this paper, we propose a scalable semantic-based query engine for efficient evaluation of SPARQL queries over distributed RDF datasets. It uses a semantic-based partitioning strategy as the data distribution and converts SPARQL to Spark executable code. By doing so, it leverages the advantages of the Spark framework’s rich APIs. We have shown empirically that our approach can scale horizontally and perform well as compared with the previous Hadoop-based system: the SHARD triple store. It is also comparable with other in-memory SPARQL query evaluators when there is less shuffling involved i.e. less duplicate values.
Our next steps include expanding our parser to support more SPARQL fragments and adding statistics to the query engine while evaluating queries. We want to analyze the query performance in the large-scale RDF datasets and explore prospects for the improvement. For example, we intend to investigate the re-ordering of the BGPs and evaluate the effects on query execution time.
This work was partly supported by the EU Horizon2020 projects Boost4.0 (GA no. 780732), BigDataOcean (GA no. 732310), SLIPO (GA no. 731581), and QROWD (GA no. 723088).
- 2.Abadi, D.J., Marcus, A., Madden, S., Hollenbach, K.J.: Scalable semantic web data management using vertical partitioning. In: Proceedings of the 33rd International Conference on Very Large Data Bases, University of Vienna, Austria, 23–27 September 2007, pp. 411–422 (2007)Google Scholar
- 3.Aluç, G., Hartig, O., Özsu, M.T., Daudjee, K.: Diversified stress testing of RDF data management systems. In: International Semantic Web Conference (2014)Google Scholar
- 5.Ermilov, I., et al.: The Tale of Sansa Spark. In: 16th International Semantic Web Conference, Poster & Demos (2017)Google Scholar
- 7.Graux, D., Jachiet, L., Geneves, P., Layaïda, N.: A multi-criteria experimental ranking of distributed SPARQL evaluators. In: 2018 IEEE International Conference on Big Data (Big Data), pp. 693–702. IEEE (2018)Google Scholar
- 9.Gurajada, S., Seufert, S., Miliaraki, I., Theobald, M.: TriAD: a distributed shared-nothing RDF engine based on asynchronous message passing. In: International Conference on Management of Data, SIGMOD 2014, Snowbird, UT, USA, 22–27 June 2014, pp. 289–300 (2014)Google Scholar
- 11.Khadilkar, V., Kantarcioglu, M., Thuraisingham, B., Castagna, P.: Jena-HBase: a distributed, scalable and efficient RDF triple store. In: Proceedings of the 2012th International Conference on Posters & Demonstrations Track, ISWC-PD 2012, vol. 914, pp. 85–88 (2012)Google Scholar
- 13.Lehmann, J., et al.: Distributed semantic analytics using the SANSA stack. In: Proceedings of 16th International Semantic Web Conference - Resources Track (ISWC 2017) (2017)Google Scholar
- 15.Papailiou, N., Konstantinou, I., Tsoumakos, D., Karras, P., Koziris, N.: H2RDF+: high-performance distributed joins over large-scale RDF graphs. In: Proceedings of the 2013 IEEE International Conference on Big Data, 6–9 October 2013, Santa Clara, CA, USA, pp. 255–263 (2013)Google Scholar
- 16.Punnoose, R., Crainiceanu, A., Rapp, D.: Rya: a scalable RDF triple store for the clouds. In: Proceedings of the 1st International Workshop on Cloud Intelligence, Cloud-I 2012, pp. 4:1–4:8. ACM, New York (2012)Google Scholar
- 17.Rohloff, K., Schantz, R.E.: High-performance, massively scalable distributed systems using the MapReduce software framework: the SHARD triple-store. In: Programming Support Innovations for Emerging Distributed Applications, PSI EtA 2010, pp. 4:1–4:5. ACM, New York (2010)Google Scholar
- 18.Schätzle, A., Przyjaciel-Zablocki, M., Lausen, G.: PigSPARQL: mapping SPARQL to Pig Latin. In: Proceedings of the International Workshop on Semantic Web Information Management, SWIM 2011, pp. 4:1–4:8. ACM, New York (2011)Google Scholar
- 21.Stadler, C., Sejdiu, G., Graux, D., Lehmann, J.: Sparklify: A Scalable Software Component for Efficient evaluation of SPARQL queries over distributed RDF datasets. In: Proceedings of 18th International Semantic Web Conference (2019)Google Scholar
- 22.Weiss, C., Karras, P., Bernstein, A.: Hexastore: sextuple indexing for semantic web data management. PVLDB 1(1), 1008–1019 (2008)Google Scholar
- 23.Wilkinson, K.: Jena property table implementation. In: SSWS, Athens, Georgia, USA, pp. 35–46 (2006)Google Scholar
- 25.Zaharia, M., et al.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, USENIX (2012)Google Scholar
- 26.Zou, L., Mo, J., Chen, L., Özsu, M.T., Zhao, D.: gStore: answering SPARQL queries via subgraph matching. PVLDB 4(8), 482–493 (2011)Google Scholar
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.