Processing SPARQL queries over distributed RDF graphs

Peng, Peng; Zou, Lei; Özsu, M. Tamer; Chen, Lei; Zhao, Dongyan

doi:10.1007/s00778-015-0415-0

Processing SPARQL queries over distributed RDF graphs

Regular Paper
Published: 04 January 2016

Volume 25, pages 243–268, (2016)
Cite this article

The VLDB Journal Aims and scope Submit manuscript

Peng Peng¹,
Lei Zou¹,
M. Tamer Özsu ORCID: orcid.org/0000-0002-8126-1717²,
Lei Chen³ &
…
Dongyan Zhao¹

6896 Accesses
84 Citations
Explore all metrics

Abstract

We propose techniques for processing SPARQL queries over a large RDF graph in a distributed environment. We adopt a “partial evaluation and assembly” framework. Answering a SPARQL query Q is equivalent to finding subgraph matches of the query graph Q over RDF graph G. Based on properties of subgraph matching over a distributed graph, we introduce local partial match as partial answers in each fragment of RDF graph G. For assembly, we propose two methods: centralized and distributed assembly. We analyze our algorithms from both theoretically and experimentally. Extensive experiments over both real and benchmark RDF repositories of billions of triples confirm that our method is superior to the state-of-the-art methods in both the system’s performance and scalability.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Notes

The statistic is reported in http://stats.lod2.eu/.
\(f_j(v)=NULL\) means that vertex v in query Q is not matched in local partial match \(PM_j\). It is formally defined in Definition 6 condition (2)
In this paper, we use “\(\leftarrow \)” to denote the assignment operator.
An algorithm is called fixed-parameter tractable for a problem of size l, with respect to a parameter n, if it can be solved in time O(f(n)g(l)), where f(n) can be any function but g(l) must be polynomial [10].
When we find local partial matches in fragment \(F_i\) and send them to join, we tag which vertices in local partial matches are internal vertices of \(F_i\).
We underline all extended vertices in serialization vectors.
A problem is said to have optimal substructure if an optimal solution can be constructed efficiently from optimal solutions of its subproblems [9]. This property is often used in dynamic programming formulations.
Note that, in this example, their cost values are the same, but they are possible to be different.
We use ANTRL v3’s grammar which is an implementation of the SPARQL grammar’s specifications. It is available at http://www.antlr3.org/grammar/1200929755392/.
A triple pattern t is a “selective triple pattern” if it has no more than 100 matches in RDF graph G

References

Abadi, D.J., Marcus, A., Madden, S., Hollenbach, K.: SW-store: a vertically partitioned DBMS for semantic web data management. VLDB J. 18(2), 385–406 (2009)
Article Google Scholar
Aluç, G., Hartig, O., Özsu, M.T., Daudjee, K.: Diversified stress testing of RDF data management systems. In: Proceedings of 13th International Semantic Web Conference, pp 197–212 (2014)
Astrahan, M.M., Blasgen, H.W., Chamberlin, D.D., Eswaran, K.P., Gray, J.N., Griffiths, P.P., King, W.F., Lorie, R.A., Mehl, J.W., Putzolu, G.R., Traiger, I.L., Wade, B.W., Watson, V.: System R: relational approach to database management. ACM Trans. Database Syst. 1, 97–137 (1976)
Article Google Scholar
Atre, M.: Left Bit Right: for SPARQL join queries with OPTIONAL patterns (left-outer-joins). In: Proceedings of ACM SIGMOD International Conference on Management of Data, pp 1793–1808 (2015)
Atre, M., Chaoji, V., Zaki, M.J., Hendler, J.A.: Matrix “bit” loaded: a scalable lightweight join query processor for RDF data. In: Proceedings of 19th International World Wide Web Conference, pp 41–50 (2010)
Buneman, P., Cong, G., Fan, W., Kementsietsidis, A.: Using partial evaluation in distributed query evaluation. In: Proceedings of 32nd International Conference on Very Large Data Bases, pp 211–222 (2006)
Cong, G., Fan, W., Kementsietsidis, A.: Distributed query evaluation with performance guarantees. In: Proceedings of ACM SIGMOD International Conference on Management of Data, pp 509–520 (2007)
Cong, G., Fan, W., Kementsietsidis, A., Li, J., Liu, X.: Partial evaluation for distributed XPath query processing and beyond. ACM Trans. Database Syst. 37(4), 32 (2012)
Article Google Scholar
Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to Algorithms, 3rd edn. MIT Press, Cambridge (2009)
MATH Google Scholar
Downey, R.G., Fellows, M.R., Vardy, A., Whittle, G.: The parametrized complexity of some fundamental problems in coding theory. SIAM J. Comput. 29(2), 545–570 (1999)
Article MathSciNet MATH Google Scholar
Dyer, M.E., Greenhill, C.S.: The complexity of counting graph homomorphisms. Random Struct. Algorithms 17(3–4), 260–289 (2000)
Article MathSciNet MATH Google Scholar
Fan, W., Li, J., Ma, S., Tang, N., Wu, Y., Wu, Y.: Graph pattern matching: from intractable to polynomial time. Proc. VLDB Endow. 3(1), 264–275 (2010)
Article Google Scholar
Fan, W., Wang, X., Wu, Y.: Performance guarantees for distributed reachability queries. Proc. VLDB Endow. 5(11), 1304–1315 (2012)
Article Google Scholar
Fan, W., Wang, X., Wu, Y., Deng, D.: Distributed graph simulation: impossibility and possibility. Proc. VLDB Endow. 7(12), 1083–1094 (2014)
Article Google Scholar
Galarraga, L., Hose, K., Schenkel, R.: Partout: a distributed engine for efficient RDF processing. In: Proceedings of 23rd International World Wide Web Conference (Companion Volume), pp 267–268 (2014)
Görlitz, O., Staab, S.: SPLENDID: SPARQL endpoint federation exploiting VOID descriptions. In: Proceedings of ISWC 2011 Workshop on Consuming Linked Data (2011)
Guo, Y., Pan, Z., Heflin, J.: LUBM: a benchmark for OWL knowledge base systems. J. Web Semant. 3(2–3), 158–182 (2005)
Article Google Scholar
Gurajada, S., Seufert, S., Miliaraki, I., Theobald, M.: TriAD: a distributed shared-nothing RDF engine based on asynchronous message passing. In: Proceedings of ACM SIGMOD International Conference on Management of Data, pp 289–300 (2014)
Harth, A., Hose, K., Karnstedt, M., Polleres, A., Sattler, K., Umbrich, J.: Data summaries for on-demand queries over linked data. In: Proceedings of 19th International World Wide Web Conference, pp 411–420 (2010)
Hartig, O., Özsu, M.T.: Linked data query processing (Tutorial). In: Proceedings of 30th International Conference on Data Engineering, pp 1286–1289 (2014)
Hose, K., Schenkel, R.: WARP: Workload-aware replication and partitioning for RDF. In: Proceedings of Workshops of 29th International Conference on Data Engineering, pp 1–6 (2013)
Huang, J., Abadi, D.J., Ren, K.: Scalable SPARQL querying of large RDF graphs. Proc. VLDB Endow. 4(11), 1123–1134 (2011)
Google Scholar
Husain, M.F., McGlothlin, J.P., Masud, M.M., Khan, L.R., Thuraisingham, B.M.: Heuristics-based query processing for large RDF graphs using cloud computing. IEEE Trans. Knowl. Data Eng. 23(9), 1312–1327 (2011)
Article Google Scholar
Jones, N.D.: An introduction to partial evaluation. ACM Comput. Surv. 28(3), 480–503 (1996)
Article Google Scholar
Kaoudi, Z., Manolescu, I.: RDF in the clouds: a survey. VLDB J. 24(1), 67–91 (2015)
Article Google Scholar
Karypis, G., Kumar, V.: Analysis of multilevel graph partitioning. In: Proceedings of ACM/IEEE Conference on Supercomputing. Article No. 29 (1995)
Khadilkar, V., Kantarcioglu, M., Thuraisingham, B. M., Castagna, P.: Jena-HBase: A distributed, scalable and efficient RDF triple store. In: Proceedings of International Semantic Web Conference Posters & Demos Track (2012)
Lee, K., Liu, L.: Scaling queries over big RDF graphs with semantic hash partitioning. Proc. VLDB Endow. 6(14), 1894–1905 (2013)
Article Google Scholar
Lee, K., Liu, L., Tang, Y., Zhang, Q., Zhou, Y.: Efficient and customizable data partitioning framework for distributed big RDF data processing in the cloud. In: Proceedings of IEEE 6th International Conference on Cloud Computing, pp 327–334 (2013)
Li, F., Ooi, B.C., Özsu, M.T., Wu, S.: Distributed data management using MapReduce. ACM Comput. Surv. 46(3), 31 (2014)
Google Scholar
Ma, S., Cao, Y., Huai, J., Wo, T.: Distributed graph pattern matching. In: Proceeding of 21st International World Wide Web Conference, pp 949–958 (2012)
Neumann, T., Weikum, G.: RDF-3X: a RISC-style engine for RDF. Proc. VLDB Endow. 1(1), 647–659 (2008)
Article Google Scholar
Papailiou, N., Konstantinou, I., Tsoumakos, D., Koziris, N.: \(\text{ H }_{{\rm 2}}\)RDF: adaptive query processing on RDF data in the cloud. In: Proceedings of 21st International World Wide Web Conference (Companion Volume), pp 397–400 (2012)
Papailiou, N., Tsoumakos, D., Konstantinou, I., Karras, P., Koziris, N.: \(\text{ H }_{{\rm 2}}\)RDF+: an efficient data management system for big RDF graphs. In: Proceeding of ACM SIGMOD International Conference on Management of Data, pp 909–912 (2014)
Pérez, J., Arenas, M., Gutierrez, C.: Semantics and complexity of SPARQL. ACM Trans. Database Syst. 34(3), 2009
Quilitz, B., Leser, U.: Querying Distributed RDF Data Sources with SPARQL. In: Proceeding of 5th European Semantic Web Conference, pp 524–538 (2008)
Rohloff, K., Schantz, R. E.: High-performance, massively scalable distributed systems using the MapReduce software framework: the shard triple-store. In: Proceeding of International Workshop on Programming Support Innovations for Emerging Distributed Applications, Article No. 4 (2010)
Saleem, M., Ngomo, A. N.: HiBISCuS: Hypergraph-based source selection for sparql endpoint federation. In: Proceeding of 11th Extended Semantic Web Conference, pp 176–191 (2014)
Saleem, M., Padmanabhuni, S.S., Ngomo, A.N., Iqbal, A., Almeida, J.S., Decker, S., Deus, H.F.: TopFed: TCGA tailored federated query processing and linking to LOD. J. Biomed. Semant. 5, 47 (2014)
Article Google Scholar
Schmachtenberg, M., Bizer, C., Paulheim, H.: Adoption of best data practices in different topical domains. In: Proceeding of 13th International Semantic Web Conference, pp 245–260 (2014)
Schmidt, M., Görlitz, O., Haase, P., Ladwig, G., Schwarte, A., Tran, T.: FedBench: A benchmark suite for federated semantic data query processing. In: Proceeding of 10th International Semantic Web Conference, pp 585–600 (2011)
Schwarte, A., Haase, P., Hose, K., Schenkel, R., Schmidt, M.: FedX: Optimization techniques for federated query processing on linked data. In: Proceeding of 10th International Semantic Web Conference, pp 601–616 (2011)
Shang, H., Zhang, Y., Lin, X., Yu, J.X.: Taming verification hardness: an efficient algorithm for testing subgraph isomorphism. Proc. VLDB Endow. 1(1), 364–375 (2008)
Article Google Scholar
Shao, B., Wang, H., Li, Y.: Trinity: a distributed graph engine on a memory cloud. In: Proceeding of ACM SIGMOD International Conference on Management of Data, pp 505–516 (2013)
Valiant, L.G.: A bridging model for parallel computation. Commun. ACM 33(8), 103–111 (1990)
Article Google Scholar
Wang, L., Xiao, Y., Shao, B., Wang, H.: How to partition a billion-node graph. In: Proceeding of 30th International Conference on Data Engineering, pp 568–579 (2014)
Zeng, K., Yang, J., Wang, H., Shao, B., Wang, Z.: A distributed graph engine for web scale RDF data. Proc. VLDB Endow. 6(4), 265–276 (2013)
Article Google Scholar
Zhang, X., Chen, L., Tong, Y., Wang, M.: EAGRE: towards scalable I/O efficient SPARQL query evaluation on the cloud. In: Proceeding of 29th International Conference on Data Engineering, pp 565–576 (2013)
Zhang, X., Chen, L., Wang, M.: Towards efficient join processing over large RDF graph using mapreduce. In: Proceeding of 24th International Conference on Scientific and Statistical Database Management, pp 250–259 (2012)
Zou, L., Özsu, M.T., Chen, L., Shen, X., Huang, R., Zhao, D.: gStore: a graph-based SPARQL query engine. VLDB J. 23(4), 565–590 (2014)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Institute of Computer Science and Technology, Peking University, Beijing, China
Peng Peng, Lei Zou & Dongyan Zhao
David R. Cheriton School of Computer Science, University of Waterloo, Waterloo, Canada
M. Tamer Özsu
Department of Computer Science and Engineering, Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong, China
Lei Chen

Authors

Peng Peng
View author publications
You can also search for this author in PubMed Google Scholar
Lei Zou
View author publications
You can also search for this author in PubMed Google Scholar
M. Tamer Özsu
View author publications
You can also search for this author in PubMed Google Scholar
Lei Chen
View author publications
You can also search for this author in PubMed Google Scholar
Dongyan Zhao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to M. Tamer Özsu.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 109 KB)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Peng, P., Zou, L., Özsu, M.T. et al. Processing SPARQL queries over distributed RDF graphs. The VLDB Journal 25, 243–268 (2016). https://doi.org/10.1007/s00778-015-0415-0

Download citation

Received: 30 March 2015
Revised: 10 September 2015
Accepted: 17 November 2015
Published: 04 January 2016
Issue Date: April 2016
DOI: https://doi.org/10.1007/s00778-015-0415-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Processing SPARQL queries over distributed RDF graphs

Abstract

Access this article

Similar content being viewed by others

MongoDB Vs PostgreSQL: A comparative study on performance aspects

Efficient and effective algorithms for densest subgraph discovery and maintenance

Comparing Oracle and PostgreSQL, Performance and Optimization

Notes

References

Author information

Authors and Affiliations

Corresponding author

Electronic supplementary material

Supplementary material 1 (pdf 109 KB)

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Processing SPARQL queries over distributed RDF graphs

Abstract

Access this article

Similar content being viewed by others

MongoDB Vs PostgreSQL: A comparative study on performance aspects

Efficient and effective algorithms for densest subgraph discovery and maintenance

Comparing Oracle and PostgreSQL, Performance and Optimization

Notes

References

Author information

Authors and Affiliations

Corresponding author

Electronic supplementary material

Supplementary material 1 (pdf 109 KB)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation