RDF partitioning for scalable SPARQL query processing
The volume of RDF data increases dramatically within recent years, while cloud computing platforms like Hadoop are supposed to be a good choice for processing queries over huge data sets for their wonderful scalability. Previous work on evaluating SPARQL queries with Hadoop mainly focus on reducing the number of joins through careful split of HDFS files and algorithms for generating Map/Reduce jobs. However, the way of partitioning RDF data could also affect system performance. Specifically, a good partitioning solution would greatly reduce or even totally avoid cross-node joins, and significantly cut down the cost in query evaluation. Based on HadoopDB, this work processes SPARQL queries in a hybrid architecture, where Map/Reduce takes charge of the computing tasks, and RDF query engines like RDF-3X store the data and execute join operations. According to the analysis of query workloads, this work proposes a novel algorithm for automatically partitioning RDF data and an approximate solution to physically place the partitions in order to reduce data redundancy. It also discusses how to make a good trade-off between query evaluation efficiency and data redundancy. All of these proposed approaches have been evaluated by extensive experiments over large RDF data sets.
KeywordsRDF data data partitioning SPARQL query
Unable to display preview. Download preview PDF.
- 4.Agrawal S, Narasayya V, Yang B. Integrating vertical and horizontal partitioning into automated physical database design. In: Proceedings of the ACM SIGMOD International Conference on Management of Data. 2004, 359–370Google Scholar
- 5.Pavlo A, Curino V, Zdonik S. Skew-aware automatic database partitioning in shared-nothing parallel OLTP systems. In: Proceedings of the ACM SIGMOD International Conference on management of Data. 2012, 61–72Google Scholar
- 6.Chang C, Kurc T, Sussman A, Catalyurek U, Saltz J. A hypergraphbased workload partitioning strategy for parallel data aggregation. In: Proceedings of the 11th SIAM Conference on Parallel Processing for Scientific Computing, 2001Google Scholar
- 9.Huang J, Abadi D, Ren K. Scalable SPARQL querying of large RDF graphs. The Proceedings of the VLDB Endowment, 2011, 4(11): 1123–1134Google Scholar
- 10.Andreev K, Räcke H. Balanced graph partitioning. In: Proceedings of the 16th Annual ACM Symposium on Parallelism in Algorithms and Architectures. 2004, 120–124Google Scholar
- 12.Yang T, Chen J, Wang X, Chen Y, Du X. Efficient SPARQL Query Evaluation via Automatic Data Partitioning. Technical Report. 2012Google Scholar
- 15.Getoor L, Taskar B, Koller D. Selectivity estimation using probabilistic models. In: Proceedings of the ACM SIGMOD International Conference on Management of Data. 2001, 461–472Google Scholar
- 16.Wilkinson K, Sayers C, Kuno H, Reynolds D. Efficient RDF storage and retrieval in Jena2. In: Proceedings of the 2nd International Semantic Web Conference. 2003, 131–150Google Scholar
- 18.Górlitz O, Thimm M, Staab S. SPLODGE: Systematic generation of SPARQL benchmark queries for linked open data. In: Proceedings of International Semantic Web-ISWC. 2012, 116–132Google Scholar
- 19.Kim H, Ravindra P, Anyanwu K. Scan-sharing for optimizing RDF graph pattern matching on mapreduce. In: Proceedings of the 5th IEEE International Conference on Cloud Computing. 2012, 139–146Google Scholar
- 20.Zeng K, Yang J, Wang H, Shao B, Wang Z. A distributed graph engine for web scale RDF data. In: Proceedings of the 39th International Conference on Very Large Data Bases. 2013, 265–276Google Scholar
- 21.Rao J, Zhang C, Megiddo N, Lohman G. Automating physical database design in a parallel database. In: Proceedings of the ACM SIGMOD International Conference on Management of Data. 2002, 558–569Google Scholar