Sempala: Interactive SPARQL Query Processing on Hadoop

Schätzle, Alexander; Przyjaciel-Zablocki, Martin; Neu, Antony; Lausen, Georg

doi:10.1007/978-3-319-11964-9_11

Alexander Schätzle²⁴,
Martin Przyjaciel-Zablocki²⁴,
Antony Neu²⁴ &
…
Georg Lausen²⁴

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 8796))

Included in the following conference series:

International Semantic Web Conference

3986 Accesses
37 Citations

Abstract

Driven by initiatives like Schema.org, the amount of semantically annotated data is expected to grow steadily towards massive scale, requiring cluster-based solutions to query it. At the same time, Hadoop has become dominant in the area of Big Data processing with large infrastructures being already deployed and used in manifold application fields. For Hadoop-based applications, a common data pool (HDFS) provides many synergy benefits, making it very attractive to use these infrastructures for semantic data processing as well. Indeed, existing SPARQL-on- Hadoop (MapReduce) approaches have already demonstrated very good scalability, however, query runtimes are rather slow due to the underlying batch processing framework. While this is acceptable for data-intensive queries, it is not satisfactory for the majority of SPARQL queries that are typically much more selective requiring only small subsets of the data. In this paper, we present Sempala, a SPARQL-over-SQL-on-Hadoop approach designed with selective queries in mind. Our evaluation shows performance improvements by an order of magnitude compared to existing approaches, paving the way for interactive-time SPARQL query processing on Hadoop.

Download to read the full chapter text

Chapter PDF

Performance Evaluation of Spark SQL Using BigBench

A Study of SQL-on-Hadoop Systems

Efficient SPARQL Query Evaluation via Automatic Data Partitioning

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

Cloudera Impala, http://impala.io/
Parquet, http://parquet.io/
Abadi, D.J., Marcus, A., Madden, S.R., Hollenbach, K.: Scalable Semantic Web Data Management Using Vertical Partitioning. In: VLDB, pp. 411–422 (2007)
Google Scholar
Abouzeid, A., Bajda-Pawlikowski, K., Abadi, D.J., Rasin, A., Silberschatz, A.: HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads. PVLDB 2(1), 922–933 (2009)
Google Scholar
Afrati, F.N., Ullman, J.D.: Optimizing Multiway Joins in a Map-Reduce Environment. IEEE Trans. Knowl. Data Eng. 23(9), 1282–1298 (2011)
Article Google Scholar
Bizer, C., Schultz, A.: The Berlin SPARQL Benchmark. International Journal on Semantic Web and Information Systems (IJSWIS) 5(2), 1–24 (2009)
Article Google Scholar
Blanas, S., Patel, J.M., Ercegovac, V., Rao, J., Shekita, E.J., Tian, Y.: A Comparison of Join Algorithms for Log Processing in MapReduce. In: SIGMOD (2010)
Google Scholar
Choi, P., Jung, J., Lee, K.-H.: RDFChain: Chain Centric Storage for Scalable Join Processing of RDF Graphs using MapReduce and HBase. In: ISWC (Posters & Demos), pp. 249–252 (2013)
Google Scholar
Erling, O., Mikhailov, I.: Towards web scale RDF. In: Proc. SSWS (2008)
Google Scholar
Guo, Y., Pan, Z., Heflin, J.: LUBM: A Benchmark for OWL Knowledge Base Systems. Web Semantics 3(2), 158 (2005)
Article Google Scholar
Harris, S., Lamb, N., Shadbolt, N.: 4store: The Design and Implementation of a Clustered RDF Store. In: SSWS, pp. 94–109 (2009)
Google Scholar
Huang, J., Abadi, D.J., Ren, K.: Scalable SPARQL Querying of Large RDF Graphs. PVLDB 4(11), 1123–1134 (2011)
Google Scholar
Husain, M.F., Doshi, P., Khan, L., Thuraisingham, B.M.: Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce. In: CloudCom, pp. 680–686 (2009)
Google Scholar
Husain, M.F., McGlothlin, J.P., Masud, M.M., Khan, L.R., Thuraisingham, B.M.: Heuristics-Based Query Processing for Large RDF Graphs Using Cloud Computing. IEEE TKDE 23(9) (2011)
Google Scholar
Khadilkar, V., Kantarcioglu, M., Thuraisingham, B.M., Castagna, P.: Jena-HBase: A Distributed, Scalable and Effcient RDF Triple Store. In: ISWC (Posters & Demos) (2012)
Google Scholar
Melnik, S., Gubarev, A., Long, J.J., Romer, G., Shivakumar, S., Tolton, M., Vassilakis, T.: Dremel: Interactive Analysis of Web-Scale Datasets. Proc. VLDB Endow. 3(1-2), 330–339 (2010)
Article Google Scholar
Myung, J., Yeon, J., Lee, S.-g.: SPARQL Basic Graph Pattern Processing with Iterative MapReduce. In: MDAC, pp. 1–6 (2010)
Google Scholar
Neumann, T., Weikum, G.: RDF-3X: a RISC-style engine for RDF. PVLDB 1(1), 647–659 (2008)
Google Scholar
Owens, A., Seaborne, A., Gibbins, N.: Clustered TDB: A Clustered Triple Store for Jena. In: WWW (2009)
Google Scholar
Papailiou, N., Konstantinou, I., Tsoumakos, D., Karras, P., Koziris, N.: H2RDF+: High-performance Distributed Joins over Large-scale RDF Graphs. In: BigData Conference, pp. 255–263 (2013)
Google Scholar
Prud’hommeaux, E., Seaborne, A.: SPARQL Query Language for RDF. W3C Recom (2008), http://www.w3.org/TR/rdf-sparql-query/
Przyjaciel-Zablocki, M., Schätzle, A., Skaley, E., Hornung, T., Lausen, G.: Map-Side Merge Joins for Scalable SPARQL BGP Processing. In: CloudCom, pp. 631–638 (2013)
Google Scholar
Rohloff, K., Schantz, R.E.: High-Performance, Massively Scalable Distributed Systems using the MapReduce Software Framework: The SHARD Triple-Store. In: PSI EtA, Reno, Nevada, pp. 4:1–4:4 (2010)
Google Scholar
Schätzle, A., Przyjaciel-Zablocki, M., Dorner, C., Hornung, T., Lausen, G.: Cascading Map-Side Joins over HBase for Scalable Join Processing. In: SSWS+HPCSW, p. 59 (2012)
Google Scholar
Schätzle, A., Przyjaciel-Zablocki, M., Hornung, T., Lausen, G.: PigSPARQL: A SPARQL Query Processing Baseline for Big Data. In: ISWC (Posters & Demos), pp. 241–244 (2013)
Google Scholar
Schätzle, A., Przyjaciel-Zablocki, M., Lausen, G.: PigSPARQL: Mapping SPARQL to Pig Latin. In: SWIM, Athens, Greece, pp. 4:1–4:4 (2011)
Google Scholar
Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P., Murthy, R.: Hive: A Warehousing Solution over a Map-Reduce Framework. Proc. VLDB Endow. 2(2), 1626–1629 (2009)
Article Google Scholar
Urbani, J., Kotoulas, S., Maassen, J., van Harmelen, F., Bal, H.: WebPIE: A Web-scale Parallel Inference Engine using MapReduce. J. Web Sem. 10, 59–75 (2012)
Article Google Scholar
Wilkinson, K.: Jena Property Table Implementation. In: SSWS, pp. 35–46 (2006)
Google Scholar
Wilkinson, K., Sayers, C., Kuno, H.A., Reynolds, D.: Efficient RDF Storage and Retrieval in Jena2. In: SWDB, pp. 131–150 (2003)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of Freiburg, Georges-Köhler-Allee 051, 79110, Freiburg, Germany
Alexander Schätzle, Martin Przyjaciel-Zablocki, Antony Neu & Georg Lausen

Authors

Alexander Schätzle
View author publications
You can also search for this author in PubMed Google Scholar
Martin Przyjaciel-Zablocki
View author publications
You can also search for this author in PubMed Google Scholar
Antony Neu
View author publications
You can also search for this author in PubMed Google Scholar
Georg Lausen
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Yahoo Labs, Diagonal 177, 08018, Barcelona, Spain
Peter Mika
Stanford University, 1265 Welch Road, 94305, Stanford, CA, USA
Tania Tudorache
University of Zurich, DDIS, Zurich, Switzerland
Abraham Bernstein
IBM Research, Yorktown Heights, NY, USA
Chris Welty
Information Sciences Institute and Department of Computer Science, University of Southern California, Los Angeles, CA, USA
Craig Knoblock
Google, USA
Denny Vrandečić & Natasha Noy &
VU University Amsterdam, The Netherlands
Paul Groth
University of California, Santa Barbara, CA, USA
Krzysztof Janowicz
School of Computer Science, The University of Manchester, Manchester, UK
Carole Goble

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Schätzle, A., Przyjaciel-Zablocki, M., Neu, A., Lausen, G. (2014). Sempala: Interactive SPARQL Query Processing on Hadoop. In: Mika, P., et al. The Semantic Web – ISWC 2014. ISWC 2014. Lecture Notes in Computer Science, vol 8796. Springer, Cham. https://doi.org/10.1007/978-3-319-11964-9_11

Download citation

DOI: https://doi.org/10.1007/978-3-319-11964-9_11
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-11963-2
Online ISBN: 978-3-319-11964-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Sempala: Interactive SPARQL Query Processing on Hadoop

Abstract

Chapter PDF

Similar content being viewed by others

Performance Evaluation of Spark SQL Using BigBench

A Study of SQL-on-Hadoop Systems

Efficient SPARQL Query Evaluation via Automatic Data Partitioning

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Sempala: Interactive SPARQL Query Processing on Hadoop

Abstract

Chapter PDF

Similar content being viewed by others

Performance Evaluation of Spark SQL Using BigBench

A Study of SQL-on-Hadoop Systems

Efficient SPARQL Query Evaluation via Automatic Data Partitioning

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation