The VLDB Journal

, Volume 25, Issue 3, pp 355–380 | Cite as

Accelerating SPARQL queries by exploiting hash-based locality and adaptive partitioning

  • Razen Harbi
  • Ibrahim Abdelaziz
  • Panos Kalnis
  • Nikos Mamoulis
  • Yasser Ebrahim
  • Majed Sahli
Regular Paper

Abstract

State-of-the-art distributed RDF systems partition data across multiple computer nodes (workers). Some systems perform cheap hash partitioning, which may result in expensive query evaluation. Others try to minimize inter-node communication, which requires an expensive data preprocessing phase, leading to a high startup cost. Apriori knowledge of the query workload has also been used to create partitions, which, however, are static and do not adapt to workload changes. In this paper, we propose AdPart, a distributed RDF system, which addresses the shortcomings of previous work. First, AdPart applies lightweight partitioning on the initial data, which distributes triples by hashing on their subjects; this renders its startup overhead low. At the same time, the locality-aware query optimizer of AdPart takes full advantage of the partitioning to (1) support the fully parallel processing of join patterns on subjects and (2) minimize data communication for general queries by applying hash distribution of intermediate results instead of broadcasting, wherever possible. Second, AdPart monitors the data access patterns and dynamically redistributes and replicates the instances of the most frequent ones among workers. As a result, the communication cost for future queries is drastically reduced or even eliminated. To control replication, AdPart implements an eviction policy for the redistributed patterns. Our experiments with synthetic and real data verify that AdPart: (1) starts faster than all existing systems; (2) processes thousands of queries before other systems become online; and (3) gracefully adapts to the query load, being able to evaluate queries on billion-scale RDF data in subseconds.

Keywords

Parallel and distributed RDF systems SPARQL query processing Main memory engines 

References

  1. 1.
    Aluç, G., Özsu, M.T., Daudjee, K.: Workload matters: Why RDF databases need a new design. PVLDB 7(10), 837–840 (2014)Google Scholar
  2. 2.
    Atre, M., Chaoji, V., Zaki, M.J., Hendler J.A.: Matrix “Bit” loaded: a scalable lightweight join query processor for rdf data. In: WWW (2010)Google Scholar
  3. 3.
    Blanas, S., Li, Y., Patel, J.M.: Design and evaluation of main memory hash join algorithms for multi-core CPUs. In: SIGMOD (2011)Google Scholar
  4. 4.
    Bol’shev, L., Ubaidullaeva, M.: Chauvenet’s test in the classical theory of errors. Theory Prob. Appl. 19(4), 683–692 (1975)CrossRefMATHGoogle Scholar
  5. 5.
    Boyer, R.S., Strother Moore, J.: MJRTY: a fast majority vote algorithm. In: Boyer, R.S. (ed.) Automated Reasoning: Essays in Honor of Woody Bledsoe, pp. 105–118. Kluwer, London (1991)CrossRefGoogle Scholar
  6. 6.
    Chong, Z., Chen, H., Zhang, Z., Shu, H., Qi, G., Zhou, A.: RDF pattern matching using sortable views. In: CIKM (2012)Google Scholar
  7. 7.
    Curino, C., Jones, E., Zhang, Y., Madden, S.: Schism: a workload-driven approach to database replication and partitioning. PVLDB 3(1–2), 48–57 (2010)Google Scholar
  8. 8.
    Dean, J., Ghemawat, S.: Mapreduce: Simplified data processing on large clusters. In: OSDI (2004)Google Scholar
  9. 9.
    Dittrich, J., Quiané-Ruiz, J.A., Jindal, A., Kargin, Y., Setty, V., Schad, J.: Hadoop++: Making a yellow elephant run like a cheetah (Without It Even Noticing). PVLDB 3(1–2), 515–529 (2010)Google Scholar
  10. 10.
    Dritsou, V., Constantopoulos, P., Deligiannakis, A., Kotidis, Y.: Optimizing query shortcuts in RDF databases. In: ESWC (2011)Google Scholar
  11. 11.
    Elnozahy, E.N.M., Alvisi, L., Wang, Y.M., Johnson, D.B.: A survey of rollback-recovery protocols in message-passing systems. ACM Comput. Surv. 34(3), 375–408 (2002)CrossRefGoogle Scholar
  12. 12.
    Forum, M.P.: Mpi: a message-passing interface standard. Tech. rep, Knoxville, TN, USA (1994)Google Scholar
  13. 13.
    Galarraga, L., Hose, K., Schenkel, R.: Partout: a distributed engine for efficient RDF processing. CoRR arXiv:1212.5636 (2012)
  14. 14.
    Gallego, M.A., Fernández, J.D., Martínez-Prieto, M.A., de la Fuente, P.: An empirical study of real-world SPARQL queries. In: USEWOD (2011)Google Scholar
  15. 15.
    Goasdoué, F., Karanasos, K., Leblay, J., Manolescu, I.: View selection in semantic web databases. PVLDB 5(2), 97–108 (2011)Google Scholar
  16. 16.
    Gurajada, S., Seufert, S., Miliaraki, I., Theobald, M.: TriAD: A distributed shared-nothing RDF engine based on asynchronous message passing. In: SIGMOD (2014)Google Scholar
  17. 17.
    Harth, A., Umbrich, J., Hogan, A., Decker, S.: YARS2: A federated repository for querying graph structured data from the Web. In: ISWC/ASWC, vol. 4825 (2007)Google Scholar
  18. 18.
    Hose, K., Schenkel, R.: WARP: workload-aware replication and partitioning for RDF. In: ICDEW (2013)Google Scholar
  19. 19.
    Huang, J., Abadi, D., Ren, K.: Scalable SPARQL querying of large RDF graphs. PVLDB 4(11), 1123–1134 (2011)Google Scholar
  20. 20.
    Husain, M., McGlothlin, J., Masud, M., Khan, L., Thuraisingham, B.: Heuristics-based query processing for large RDF graphs using cloud computing. TKDE 23(9), 1312–1327 (2011)Google Scholar
  21. 21.
    Idreos, S., Kersten, M.L., Manegold, S.: Database cracking. In: CIDR (2007)Google Scholar
  22. 22.
    Karypis, G., Kumar, V.: A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM J. Sci. Comput. 20(1), 359–392 (1998)MathSciNetCrossRefMATHGoogle Scholar
  23. 23.
    Lee, K., Liu, L.: Scaling queries over big RDF graphs with semantic hash partitioning. PVLDB 6(14), 1894–1905 (2013)Google Scholar
  24. 24.
    Malewicz, G., Austern, M., Bik, A., Dehnert, J., Horn, I., Leiser, N., Czajkowski, G.: Pregel: a system for large-scale graph processing. In: SIGMOD (2010)Google Scholar
  25. 25.
    Neumann, T., Weikum, G.: The rdf-3x engine for scalable management of rdf data. VLDB J. 19(1), 91–113 (2010)Google Scholar
  26. 26.
    Papailiou, N., Konstantinou, I., Tsoumakos, D., Karras, P., Koziris, N.: H2rdf+: High-performance distributed joins over large-scale rdf graphs. In: IEEE Big Data (2013)Google Scholar
  27. 27.
    Punnoose, R., Crainiceanu, A., Rapp, D.: Rya: a scalable RDF triple store for the clouds. In: Cloud-I (2012)Google Scholar
  28. 28.
    Rietveld, L., Hoekstra, R., Schlobach, S., Guéret, C.: Structural properties as proxy for semantic relevance in RDF graph sampling. In: ISWC (2014)Google Scholar
  29. 29.
    Rohloff, K., Schantz, R.E.: High-performance, massively scalable distributed systems using the MapReduce software framework: the SHARD triple-store. In: PSI EtA (2010)Google Scholar
  30. 30.
    Shen, Y., Chen, G., Jagadish, H.V., Lu, W., Ooi, B.C., Tudor, B.M.: Fast failure recovery in distributed graph processing systems. PVLDB 8(4), 437–448 (2014)Google Scholar
  31. 31.
    Stonebraker, M., Madden, S., Abadi, D., Harizopoulos, S., Hachem, N., Helland, P.: The end of an Architectural Era: (It’s Time for a Complete Rewrite). PVLDB, 1150–1160 (2007)Google Scholar
  32. 32.
    Wang, L., Xiao, Y., Shao, B., Wang, H.: How to partition a billion-node graph. In: ICDE (2014)Google Scholar
  33. 33.
    Weiss, C., Karras, P., Bernstein, A.: Hexastore: sextuple indexing for semantic web data management. PVLDB 1(1), 1008–1019 (2008)Google Scholar
  34. 34.
    Wu, B., Zhou, Y., Yuan, P., Liu, L., Jin, H.: Scalable SPARQL querying using path partitioning. In: ICDE (2015)Google Scholar
  35. 35.
    Yang, S., Yan, X., Zong, B., Khan, A.: Towards effective partition management for large graphs. In: SIGMOD (2012)Google Scholar
  36. 36.
    Yuan, P., Liu, P., Wu, B., Jin, H., Zhang, W., Liu, L.: TripleBit: a fast and compact system for large scale RDF data. PVLDB 6(7), 517–528 (2013)Google Scholar
  37. 37.
    Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: USENIX (2010)Google Scholar
  38. 38.
    Zeng, K., Yang, J., Wang, H., Shao, B., Wang, Z.: A distributed graph engine for web scale RDF data. PVLDB 6(4), 265–276 (2013)Google Scholar
  39. 39.
    Zhang, X., Chen, L., Tong, Y., Wang, M.: EAGRE: towards scalable I/O efficient SPARQL query evaluation on the cloud. In: ICDE (2013)Google Scholar
  40. 40.
    Zou, L., Özsu, M.T., Chen, L., Shen, X., Huang, R., Zhao, D.: gStore: a graph-based SPARQL query engine. VLDB J. 23(4), 565–590 (2014)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2016

Authors and Affiliations

  1. 1.King Abdullah University of Science and TechnologyThuwalSaudi Arabia
  2. 2.University of IoanninaIoanninaGreece
  3. 3.Microsoft CorporationRedmondUSA

Personalised recommendations