Robust Runtime Optimization and Skew-Resistant Execution of Analytical SPARQL Queries on Pig

  • Spyros Kotoulas
  • Jacopo Urbani
  • Peter Boncz
  • Peter Mika
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7649)


We describe a system that incrementally translates SPARQL queries to Pig Latin and executes them on a Hadoop cluster. This system is designed to work efficiently on complex queries with many self-joins over huge datasets, avoiding job failures even in the case of joins with unexpected high-value skew. To be robust against cost estimation errors, our system interleaves query optimization with query execution, determining the next steps to take based on data samples and statistics gathered during the previous step. Furthermore, we have developed a novel skew-resistant join algorithm that replicates tuples corresponding to popular keys. We evaluate the effectiveness of our approach both on a synthetic benchmark known to generate complex queries (BSBM-BI) as well as on a Yahoo! case of data analysis using RDF data crawled from the web. Our results indicate that our system is indeed capable of processing huge datasets without pre-computed statistics while exhibiting good load-balancing properties.


Query Processing Query Optimization Query Execution SPARQL Query Link Open Data 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Abdel Kader, R., Boncz, P., Manegold, S., van Keulen, M.: ROX: run-time optimization of XQueries. SIGMOD (2009)Google Scholar
  2. 2.
    Abouzeid, A., Bajda-Pawlikowski, K., Abadi, D., Rasin, A., Silberschatz, A.: HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads. PVLDB 2(1), 922–933 (2009)Google Scholar
  3. 3.
    Bizer, C., Schultz, A.: The Berlin SPARQL Benchmark. International Journal on Semantic Web and Information Systems (IJSWIS) 5(2), 1–24 (2009)CrossRefGoogle Scholar
  4. 4.
    Erling, O.: Virtuoso, a Hybrid RDBMS/Graph Column Store. DEBULL 35(1), 3–8 (2012)Google Scholar
  5. 5.
    Gallego, M., Fernández, J., Martínez-Prieto, M., Fuente, P.: An empirical study of real-world SPARQL queries. In: USEWOD 2011 at WWW (2011)Google Scholar
  6. 6.
    Ganguly, S., Gibbons, P., Matias, Y., Silberschatz, A.: Bifocal sampling for skew-resistant join size estimation. SIGMOD Record 25(2), 271–281 (1996)CrossRefGoogle Scholar
  7. 7.
    Gates, A., Natkovich, O., Chopra, S., Kamath, P., Narayanamurthy, S., Olston, C., Reed, B., Srinivasan, S., Srivastava, U.: Building a high-level dataflow system on top of Map-Reduce: the Pig experience. PVLDB 2(2), 1414–1425 (2009)Google Scholar
  8. 8.
    Ivanova, M., Kersten, M., Nes, N., Gonçalves, R.: An architecture for recycling intermediates in a column-store. TODS 35(4), 24 (2010)CrossRefGoogle Scholar
  9. 9.
    Jahani, E., Cafarella, M., Ré, C.: Automatic Optimization for MapReduce Programs. PVLDB 4(6), 385–396 (2011)Google Scholar
  10. 10.
    Kader, R., van Keulen, M., Boncz, P., Manegold, S.: Run-time Optimization for Pipelined Systems. In: Proceedings of the IV Alberto Mendelzon Workshop on Foundations of Data Management (2010)Google Scholar
  11. 11.
    Kobilarov, G., Scott, T., Raimond, Y., Oliver, S., Sizemore, C., Smethurst, M., Bizer, C., Lee, R.: Media Meets Semantic Web – How the BBC Uses DBpedia and Linked Data to Make Connections. In: Aroyo, L., Traverso, P., Ciravegna, F., Cimiano, P., Heath, T., Hyvönen, E., Mizoguchi, R., Oren, E., Sabou, M., Simperl, E. (eds.) ESWC 2009. LNCS, vol. 5554, pp. 723–737. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  12. 12.
    Kotoulas, S., Oren, E., van Harmelen, F.: Mind the data skew: distributed inferencing by speeddating in elastic regions. In: WWW (2010)Google Scholar
  13. 13.
    Neumann, T., Moerkotte, G.: Characteristic sets: Accurate cardinality estimation for RDF queries with multiple joins. In: ICDE (2011)Google Scholar
  14. 14.
    Neumann, T., Weikum, G.: Scalable join processing on very large RDF graphs. SIGMOD (2009)Google Scholar
  15. 15.
    Neumann, T., Weikum, G.: The RDF-3X engine for scalable management of RDF data. The VLDB Journal 19(1), 91–113 (2010)CrossRefGoogle Scholar
  16. 16.
    Olston, C., Reed, B., Srivastava, U., Kumar, R., Tomkins, A.: Pig latin: a not-so-foreign language for data processing. SIGMOD (2008)Google Scholar
  17. 17.
    Ravindra, P., Deshpande, V., Anyanwu, K.: Towards scalable RDF graph analytics on MapReduce. In: Proceedings of the 2010 Workshop on Massive Data Analytics on the Cloud, p. 5. ACM (2010)Google Scholar
  18. 18.
    Salvadores, M., Correndo, G., Harris, S., Gibbins, N., Shadbolt, N.: The Design and Implementation of Minimal RDFS Backward Reasoning in 4store. In: Antoniou, G., Grobelnik, M., Simperl, E., Parsia, B., Plexousakis, D., De Leenheer, P., Pan, J. (eds.) ESWC 2011, Part II. LNCS, vol. 6644, pp. 139–153. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  19. 19.
    Schätzle, A., Lausen, G.: PigSPARQL: mapping SPARQL to Pig Latin. In: SWIM: The 3th International Workshop on Semantic Web Information Management (2011)Google Scholar
  20. 20.
    L. SYSTAP. Bigdata®Google Scholar
  21. 21.
    Thusoo, A., Sarma, J., Jain, N., Shao, Z., Chakka, P., Zhang, N., Anthony, S., Liu, H., Murthy, R.: Hive: a petabyte scale data warehouse using Hadoop. In: ICDE (2010)Google Scholar
  22. 22.
    Weiss, C., Karras, P., Bernstein, A.: Hexastore: sextuple indexing for semantic web data management. PVLDB 1(1), 1008–1019 (2008)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Spyros Kotoulas
    • 1
    • 2
  • Jacopo Urbani
    • 1
  • Peter Boncz
    • 3
    • 2
  • Peter Mika
    • 4
  1. 1.IBM ResearchIreland
  2. 2.Vrije Universiteit AmsterdamThe Netherlands
  3. 3.CWI AmsterdamThe Netherlands
  4. 4.Yahoo! Research BarcelonaSpain

Personalised recommendations