Abstract
The flexibility offered by the Resource Description Framework (RDF) has led it to become a very popular standard for representing data with an undefined or variable schema using the concept of triples. Its success has resulted in many large scale multidisciplinary datasets, that have prompted the development of efficient RDF processing systems. Current approaches can be distinguished into two groups: the first, adopting the relational model storing the triples in tables, and the second creating data structures that model RDF data as a graph. The strategies of the first group are more easily scalable since they apply optimization strategies from the relational model like indexing and fragmentation. However, these approaches suffer many overheads when dealing with complex queries (e.g. compounded SPARQL graphs involving filters) persistent in existing applications. On the other hand, graph-based systems that use more complex data structures fail to efficiently manage the main memory and are not scalable in computer hardware with limited resources. In this paper, we propose a novel approach to perform queries (Basic Graph Patterns, Wildcards, Aggregations and Sorting) on RDF data. We propose to combine both RDF graph exploration with physical fragmentation of triples. In this work, we describe our graph-based storage and query evaluation models. Then, we detail the architecture of our system and we largely explain the strategy, based in the Volcano execution model, used to manage the main memory at query runtime. We conducted extensive experiments on synthetic and real datasets to evaluate the efficiency of our proposal. We compared our performance with a relational-based (Virtuoso), a graph-based (gStore) and an intensive-indexing (RDF-3X) approach. According to our evaluation, our system offers the best compromise between efficient query processing and scalability.
Similar content being viewed by others
Notes
Similar to SQL queries with Wildcards characters
Subject Predicate Object
Queries with variable predicates can be answered by query rewriting
In the rest of this paper we use the word graph fragment instead of characteristic sets to design the physical split of SPO or OPS.
The set of predicates related to subject (in the case of SPO fragment) or objects (in the case of OPS fragment
ϕ is used to denote an empty element
References
Abadi, D.J., Marcus, A., Madden, S.R., Hollenbach, K. (2007). Scalable semantic web data management using vertical partitioning. In Proceedings of the 33rd international conference on very large data bases (pp. 411–422): VLDB Endowment.
Aït-Kaci, H., Boyer, R., Lincoln, P , Nasr, R. (1989). Efficient implementation of lattice operations. ACM Transactions on Programming Languages and Systems (TOPLAS), 11(1), 115–146.
Al-Harbi, R., Abdelaziz, I., Kalnis, P., Mamoulis, N., Ebrahim, Y., Sahli, M. (2016). Accelerating SPARQL queries by exploiting hash-based locality and adaptive partitioning. VLDB Journal, 25(3), 355–380.
Atre, M., Srinivasan, J., Hendler, Bitmat. (2008). Bitmat: a main-memory bit matrix of RDF triples for conjunctive triple pattern queries. In Proceedings of the poster and demonstration session at the 7th international semantic web conference (ISWC2008), Karlsruhe, Germany, October 28.
Briggs, M. (2012). Db2 nosql graph store what why & overview.
Broekstra, J., Kampman, A., van Harmelen, F. (2002). Sesame: a generic architecture for storing and querying RDF and RDF schema. In The semantic web - ISWC, first international semantic web conference, Italy, June 9-12 (pp. 54–68).
Cyganiak, R. (2005). A relational algebra for sparql. Digital Media Systems Laboratory HP Laboratories Bristol. HPL-2005-170, p. 35.
Deppisch, U. (1986). S-tree: a dynamic balanced signature index for office retrieval. In Proceedings of the 9th annual international ACM SIGIR conference on research and development in information retrieval (pp. 77–87): ACM.
Du, J., Wang, H., Ni, Y., Hadooprdf, Y.Yu. (2012). A scalable semantic data analytical engine. In Intelligent computing theories and applications - 8th international conference, ICIC, China, July 25-29 (pp. 633–641).
Erling, O. (2012). Virtuoso, a hybrid rdbms/graph column store. IEEE Data Engineering Bulletin, 35(1), 3–8.
Fuentes-Lorenzo, D., Morato, J., Gómez, J.M. (2009). Knowledge management in biomedical libraries: a semantic web approach. Information Systems Frontiers, 11(4), 471–480.
Galicia, J., Mesmoudi, A., Bellatreche, L. (2019). Rdfpartsuite: bridging physical and logical RDF partitioning. In Big data analytics and knowledge discovery - 21st international conference, DaWaK 2019, Linz, Austria, August 26-29, 2019, Proceedings (pp. 136–150).
Görlitz, O., & Staab, S. (2011). SPLENDID: SPARQL endpoint federation exploiting VOID descriptions. In Proceedings of the second international workshop on consuming linked data, Bonn, Germany, October 23.
Graefe. G. (1994). Volcano - an extensible and parallel query evaluation system. IEEE Transactions on Knowledge and Data Engineering, 6(1), 120–135.
Gurajada, S., Seufert, S., Miliaraki, I., Theobald, M. (2014). Triad: a distributed shared-nothing RDF engine based on asynchronous message passing. In SIGMOD, USA, June 22-27 (pp. 289–300).
Huang, J., Abadi, D.J., Ren, K. (2011). Scalable SPARQL querying of large RDF graphs. PVLDB, 4 (11), 1123–1134.
Janik, M., & Kochut, K. (2005). BRAHMS: a workbench RDF store and high performance memory system for semantic association discovery. In The semantic web - ISWC 2005, 4th international semantic web conference, ISWC, Galway, Ireland, November 6-10, 2005, Proceedings (pp. 431–445).
Karypis, G., & Kumar, V. (1998). A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM Journal of Scientific Computing, 20(1), 359–392.
Lehmann, J., Isele, R., Jakob, M., Jentzsch, A., Kontokostas, D., Mendes, P.N., Hellmann, S., Morsey, M., Van Kleef, P., Auer, S., et al. (2015). Dbpedia–a large-scale, multilingual knowledge base extracted from wikipedia. Semantic Web, 6(2), 167–195.
McBride, B. (2002). Jena: a semantic web toolkit. IEEE Internet Computing, 6, 55–59.
Mouzakitis, S., Papaspyros, D., Petychakis, M., Koussouris, S., Zafeiropoulos, A., Fotopoulou, E., Farid, L., Orlandi, F., Attard, J., Psarras, J. (2017). Challenges and opportunities in renovating public sector information by enabling linked data and analytics. Information Systems Frontiers, 19(2), 321–336.
Neumann, T., & Moerkotte, G. (2011). Characteristic sets: accurate cardinality estimation for rdf queries with multiple joins. In Data Engineering (ICDE) (pp. 984–994).
Neumann, T., & Weikum, G. (2008). Rdf-3x: a risc-style engine for rdf. Proceedings of the VLDB Endowment, 1(1), 647–659.
Papailiou, N., Konstantinou, I., Tsoumakos, D., Karras, P., Koziris, N. (2013). H2RDF+: high-performance distributed joins over large-scale RDF graphs. In Proceedings of the 2013 IEEE international conference on big data (pp. 255–263). USA.
Peng, P., Zou, L., Özsu, M.T., Chen, L., Zhao, D. (2016). Processing SPARQL queries over distributed RDF graphs. VLDB Journal, 25(2), 243–268.
Pérez, J., Arenas, M., Gutierrez, C. (2006). Semantics and complexity of sparql. In International Semantic Web Conference, (Vol. 4273 pp. 30–43): Springer.
Rohloff, K., & Schantz, R.E. (2011). Clause-iteration with mapreduce to scalably query datagraphs in the SHARD graph-store. In DIDC’11, Proceedings of the fourth international workshop on data-intensive distributed computing (pp. 35–44). San Jose.
Saleem, M., & Ngomo, A.N. (2014). Hibiscus: hypergraph-based source selection for SPARQL endpoint federation. In The semantic web: trends and challenges - 11th international conference, ESWC, Anissaras, Crete, Greece, May 25-29, 2014. Proceedings (pp. 176–191).
Schätzle, A., Przyjaciel-Zablocki, M., Skilevic, S., Lausen, G. (2011). Pigsparql: mapping SPARQL to pig latin. In Proceedings of the international workshop on semantic web information management, SWIM (p. 4). Greece.
Schätzle, A., Przyjaciel-Zablocki, M., Berberich, T., Lausen, G. (2015). S2X: graph-parallel querying of RDF with graphx. In Biomedical data management and graph online querying - VLDB 2015 workshops (pp. 155–168).
Schätzle, A., Przyjaciel-Zablocki, M., Skilevic, S., Lausen, G. (2016). S2RDF: RDF querying with SPARQL on spark. PVLDB, 9(10), 804–815.
Stephan, E.G., Elsethagen, T., Berg, L.K., Macduff, M.C., Paulson, P.R., Shaw, W.J., Sivaraman, C., Smith, W., Wynne, A. (2016). Semantic catalog of things, services, and data to support a wind data management facility. Information Systems Frontiers, 18(4), 679–691.
Udrea, O., Pugliese, A., Subrahmanian, V.S. (2007). GRIN: a graph based RDF index. In Proceedings of the twenty-second AAAI conference on artificial intelligence, July 22-26, Vancouver, British Columbia, Canada (pp. 1465–1470).
W3C. (2014). Rdf 1.1 concepts and abstract syntax. https://www.w3.org/TR/rdf11-concepts/, https://www.w3.org/TR/rdf-sparql-query/.
Weiss, C., Karras, P., Bernstein, A. (2008). Hexastore: sextuple indexing for semantic web data management. Proceedings of VLDB, 1(1), 1008–1019.
Wilkinson, K., Sayers, C., Kuno, H.A., Reynolds, D. (2003). Efficient RDF storage and retrieval in jena2. In Proceedings of SWDB’03, the first international workshop on semantic web and databases, co-located with VLDB 2003, Humboldt-Universitȧt, Berlin, Germany, September 7-8 (pp. 131–150).
Zeng, K., Yang, J., Wang, H., Shao, B., Wang, Z. (2013). A distributed graph engine for web scale RDF data. PVLDB, 6(4), 265–276.
Zou, L., Mo, J., Chen, L., Özsu, M.T., Zhao, D. (2011). gstore: answering sparql queries via subgraph matching. Proceedings of the VLDB Endowment, 4(8), 482–493.
Zou, L., Özsu, M.T., Chen, L., Shen, X., Huang, R., Zhao, D. (2014). gstore: a graph-based SPARQL query engine. VLDB Journal, 23(4), 565–590.
Zouaghi, I., Mesmoudi, A., Galicia, J., Bellatreche, L., Aguili, T. (2020). Query optimization for large scale clustered rdf data. In 22nd international workshop on design, optimization, languages and analytical processing of big data, March 30, 2020. Copenhagen.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Khelil, A., Mesmoudi, A., Galicia, J. et al. Combining Graph Exploration and Fragmentation for Scalable RDF Query Processing. Inf Syst Front 23, 165–183 (2021). https://doi.org/10.1007/s10796-020-09998-z
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10796-020-09998-z