Advertisement

Efficient SPARQL Query Evaluation via Automatic Data Partitioning

  • Tao Yang
  • Jinchuan Chen
  • Xiaoyan Wang
  • Yueguo Chen
  • Xiaoyong Du
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7826)

Abstract

The volume of RDF data increases very fast within the last five years, e.g. the Linked Open Data cloud grows from 2 billions to 50 billions of RDF triples. With its wonderful scalability, cloud computing platform like Hadoop is a good choice for processing queries over large data sets. Previous works on evaluating SPARQL queries with Hadoop mainly focus on reducing the number of joins through careful split of HDFS files and algorithms for generating Map/Reduce jobs. However, the way of partitioning RDF data could also affect the performance. Specifically, a good partitioning will greatly reduce or even totally avoid cross-node joins and significantly reduce the cost of query evaluation. Based on HadoopDB, this work processes SPARQL queries in a hybrid architecture where Map/Reduce takes charge of the computing tasks and an RDF query engine, RDF-3X, stores the data and evaluates join operations over local data. Based on analysis of query work-loads, we propose a novel algorithm for automatically partitioning RDF data. We also present an approximate solution to physically place the partitions in order to reduce data redundancy. All the proposed approaches are evaluated by extensive experiments over large RDF data sets.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
  2. 2.
  3. 3.
    Abouzeid, A., Bajda-Pawlikowski, K., Abadi, D., Silberschatz, A., Rasin, A.: Hadoopdb: An architectural hybrid of mapreduce and dbms technologies for analytical workloads. PVLDB 2(1), 992–933 (2009)Google Scholar
  4. 4.
    Agrawal, S., Narasayya, V., Yang, B.: Integrating vertical and horizontal partitioning into automated physical database design. In: SIGMOD 2004, pp. 359–370 (2004)Google Scholar
  5. 5.
    Andreev, K., Räcke, H.: Balanced graph partitioning. In: SPAA, pp. 120–124 (2004)Google Scholar
  6. 6.
    Chang, C., Kurç, T.M., Sussman, A., Çatalyürek, Ü.V., Saltz, J.H.: A hypergraph-based workload partitioning strategy for parallel data aggregation. In: PPSC (2001)Google Scholar
  7. 7.
    Du, F., Chen, Y., Du, X.: Partitioned indexes for entity search over rdf knowledge bases. In: Lee, S.-g., Peng, Z., Zhou, X., Moon, Y.-S., Unland, R., Yoo, J. (eds.) DASFAA 2012, Part I. LNCS, vol. 7238, pp. 141–155. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  8. 8.
    Guo, Y., Pan, Z., Heflin, J.: Lubm: A benchmark for owl knowledge base systems. Web Semantics: Science, Services and Agents on the World Wide Web 3(2-3), 158–182 (2005)CrossRefGoogle Scholar
  9. 9.
    Huang, J., Ren, D.J.K.: Scalable sparql querying of large rdf graphs. PVLDB 4(11), 1123–1134 (2011)Google Scholar
  10. 10.
    Husain, M., McGlothlin, J., Masud, M.M., Khan, L., Thuraisingham, B.: Heuristics based query processing for large rdf graphs using cloud computing. IEEE TKDE 23(9), 1312–1327 (2011)Google Scholar
  11. 11.
    Kim, H., Ravindra, P., Anyanwu, K.: Scan-sharing for optimizing rdf graph pattern matching on mapreduce. In: IEEE CLOUD, pp. 139–146 (2012)Google Scholar
  12. 12.
    Myung, J., Yeon, J., Lee, S.-G.: Sparql basic graph pattern processing with iterative mapreduce. In: Proc. of the 2010 Workshop on Massive Data Analytics on the Cloud, MDAC 2010, pp. 6:1–6:6 (2010)Google Scholar
  13. 13.
    Neumann, T., Weikum, G.: Rdf-3x: a risc-style engine for rdf. PVLDB 1(1), 647–659 (2008)Google Scholar
  14. 14.
    Pavlo, A., Curino, V., Zdonik, S.: Skew-aware automatic database partitioning in shared-nothing, parallel oltp systems. In: SIGMOD 2012, pp. 61–72 (2012)Google Scholar
  15. 15.
    Rao, J., Zhang, C., Megiddo, N., Lohman, G.: Automating physical database design in a parallel database. In: SIGMOD 2002, pp. 558–569 (2002)Google Scholar
  16. 16.
    Sanghavi, S., Shah, D., Willsky, A.S.: Message passing for maximum weight independent set. IEEE Trans. on Information Theory 55(11), 4822–4834 (2009)MathSciNetCrossRefGoogle Scholar
  17. 17.
    Wilkinson, K., Sayers, C., Kuno, H.A., Reynolds, D.: Efficient RDF Storage and Retrieval in Jena2. In: ISWC 2003, pp. 131–150 (2003)Google Scholar
  18. 18.
    Yang, T., Chen, J., Wang, X., Chen, Y., Du, X.: Efficient sparql query evaluation via automatic data partitioning, technical report (2012), http://iir.ruc.edu.cn/~jchchen/rdfpartition.pdf

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Tao Yang
    • 1
  • Jinchuan Chen
    • 2
  • Xiaoyan Wang
    • 1
  • Yueguo Chen
    • 2
  • Xiaoyong Du
    • 1
  1. 1.School of InformationRenmin University of ChinaChina
  2. 2.Key Laboratory of Data Engineering and Knowledge EngineeringRenmin University of China, MOEChina

Personalised recommendations