SPARQLGX: Efficient Distributed Evaluation of SPARQL with Apache Spark

  • Damien Graux
  • Louis Jachiet
  • Pierre Genevès
  • Nabil Layaïda
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9982)


sparql is the w3c standard query language for querying data expressed in the Resource Description Framework (rdf). The increasing amounts of rdf data available raise a major need and research interest in building efficient and scalable distributed sparql query evaluators. In this context, we propose sparqlgx: our implementation of a distributed rdf datastore based on Apache Spark. sparqlgx is designed to leverage existing Hadoop infrastructures for evaluating sparql queries. sparqlgx relies on a translation of sparql queries into executable Spark code that adopts evaluation strategies according to (1) the storage method used and (2) statistics on data. We show that sparqlgx makes it possible to evaluate sparql queries on billions of triples distributed across multiple nodes, while providing attractive performance figures. We report on experiments which show how sparqlgx compares to related state-of-the-art implementations and we show that our approach scales better than these systems in terms of supported dataset size. With its simple design, sparqlgx represents an interesting alternative in several scenarios.


rdf system Distributed sparql evaluation 


  1. 1.
    SPARQL 1.1 overview, March 2013.
  2. 2.
    Abadi, D.J., Marcus, A., Madden, S.R., Hollenbach, K.: Scalable semantic web data management using vertical partitioning. In: Proceedings of the 33rd International Conference on Very Large Data Bases, pp. 411–422. VLDB Endowment (2007)Google Scholar
  3. 3.
    Aluç, G., Hartig, O., Özsu, M.T., Daudjee, K.: Diversified stress testing of RDF data management systems. In: Mika, P., et al. (eds.) ISWC 2014. LNCS, vol. 8796, pp. 197–212. Springer, Heidelberg (2014). doi: 10.1007/978-3-319-11964-9_13 Google Scholar
  4. 4.
    Armbrust, M., Xin, R.S., Lian, C., Huai, Y., Liu, D., Bradley, J.K., Meng, X., Kaftan, T., Franklin, M.J., Ghodsi, A., et al.: Spark SQL: Relational data processing in spark. In: SIGMOD, pp. 1383–1394. ACM (2015)Google Scholar
  5. 5.
    Erling, O., Mikhailov, L.: Virtuoso: RDF support in a native RDBMS. In: de Virgilio, R., Giunchiglia, F., Tanca, L. (eds.) Semantic Web Information Management, pp. 501–519. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  6. 6.
    Galarraga, L., Hose, K., Schenkel, R.: Partout: A distributed engine for efficient rdf processing. In: WWW Companion, pp. 267–268 (2014)Google Scholar
  7. 7.
    Gallego, M.A., Fernández, J.D., Martínez-Prieto, M.A., de la Fuente, P.: An empirical study of real-world SPARQL queries. In: 1st International Workshop on Usage Analysis and the Web of Data at the 20th International World Wide Web Conference (2011)Google Scholar
  8. 8.
    Goasdoué, F., Kaoudi, Z., Manolescu, I., Quiané-Ruiz, J.A., Zampetakis, S.: Cliquesquare: flat plans for massively parallel RDF queries. In: ICDE, pp. 771–782. IEEE (2015)Google Scholar
  9. 9.
    Guo, Y., Pan, Z., Heflin, J.: LUBM: A benchmark for OWL knowledge base systems. Web Semant. Sci. Serv. Agents World Wide Web 3(2), 158–182 (2005)CrossRefGoogle Scholar
  10. 10.
    Harris, S., Lamb, N., Shadbolt, N.: 4store: The design and implementation of a clustered RDF store. In: SSWS (2009)Google Scholar
  11. 11.
    Hayes, P., McBride, B.: RDF semantics. In: W3C Rec. (2004)Google Scholar
  12. 12.
    Kaoudi, Z., Manolescu, I.: RDF in the clouds: a survey. VLDB J. 24(1), 67–91 (2015)CrossRefGoogle Scholar
  13. 13.
    Neumann, T., Weikum, G.: RDF-3X: a RISC-style engine for RDF. Proc. VLDB Endowment 1(1), 647–659 (2008)CrossRefGoogle Scholar
  14. 14.
    Odersky, M.: The scala language specification v 2.9 (2014)Google Scholar
  15. 15.
    Olston, C., Reed, B., Srivastava, U., Kumar, R., Tomkins, A.: Pig latin: a not-so-foreign language for data processing. In: SIGMOD, pp. 1099–1110. ACM (2008)Google Scholar
  16. 16.
    Punnoose, R., Crainiceanu, A., Rapp, D.: Rya: a scalable RDF triple store for the clouds. In: International Workshop on Cloud Intelligence, p. 4. ACM (2012)Google Scholar
  17. 17.
    Schätzle, A., Przyjaciel-Zablocki, M., Lausen, G.: PigSPARQL: Mapping SPARQL to pig latin. In: Proceedings of the International Workshop on Semantic Web Information Management, p. 4. ACM (2011)Google Scholar
  18. 18.
    Schätzle, A., Przyjaciel-Zablocki, M., Skilevic, S., Lausen, G.: S2RDF: RDF querying with SPARQL on spark. In: VLDB, pp. 804–815 (2016)Google Scholar
  19. 19.
    Shvachko, K., Kuang, H., Radia, S., Chansler, R.: The hadoop distributed file system. In: Mass Storage Systems and Technologies (MSST), pp. 1–10. IEEE (2010)Google Scholar
  20. 20.
    Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P., Murthy, R.: Hive: a warehousing solution over a map-reduce framework. Proc. VLDB Endowment 2(2), 1626–1629 (2009)CrossRefGoogle Scholar
  21. 21.
    Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In: NSDI, p. 2. USENIX Association (2012)Google Scholar

Copyright information

© Springer International Publishing AG 2016

Authors and Affiliations

  • Damien Graux
    • 1
    • 2
    • 3
  • Louis Jachiet
    • 1
    • 2
    • 3
  • Pierre Genevès
    • 1
    • 2
    • 3
  • Nabil Layaïda
    • 1
    • 2
    • 3
  1. 1.InriaParisFrance
  2. 2.CNRS, LIGGrenobleFrance
  3. 3.Université Grenoble AlpesGrenobleFrance

Personalised recommendations