Skip to main content
Log in

Combining Graph Exploration and Fragmentation for Scalable RDF Query Processing

  • Published:
Information Systems Frontiers Aims and scope Submit manuscript

Abstract

The flexibility offered by the Resource Description Framework (RDF) has led it to become a very popular standard for representing data with an undefined or variable schema using the concept of triples. Its success has resulted in many large scale multidisciplinary datasets, that have prompted the development of efficient RDF processing systems. Current approaches can be distinguished into two groups: the first, adopting the relational model storing the triples in tables, and the second creating data structures that model RDF data as a graph. The strategies of the first group are more easily scalable since they apply optimization strategies from the relational model like indexing and fragmentation. However, these approaches suffer many overheads when dealing with complex queries (e.g. compounded SPARQL graphs involving filters) persistent in existing applications. On the other hand, graph-based systems that use more complex data structures fail to efficiently manage the main memory and are not scalable in computer hardware with limited resources. In this paper, we propose a novel approach to perform queries (Basic Graph Patterns, Wildcards, Aggregations and Sorting) on RDF data. We propose to combine both RDF graph exploration with physical fragmentation of triples. In this work, we describe our graph-based storage and query evaluation models. Then, we detail the architecture of our system and we largely explain the strategy, based in the Volcano execution model, used to manage the main memory at query runtime. We conducted extensive experiments on synthetic and real datasets to evaluate the efficiency of our proposal. We compared our performance with a relational-based (Virtuoso), a graph-based (gStore) and an intensive-indexing (RDF-3X) approach. According to our evaluation, our system offers the best compromise between efficient query processing and scalability.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Similar content being viewed by others

Notes

  1. https://github.com/bio2rdf/bio2rdf-scripts/wiki

  2. http://wiki.dbpedia.org

  3. Similar to SQL queries with Wildcards characters

  4. Subject Predicate Object

  5. Queries with variable predicates can be answered by query rewriting

  6. https://hadoop.apache.org

  7. https://hbase.apache.org/

  8. In the rest of this paper we use the word graph fragment instead of characteristic sets to design the physical split of SPO or OPS.

  9. The set of predicates related to subject (in the case of SPO fragment) or objects (in the case of OPS fragment

  10. ϕ is used to denote an empty element

  11. http://graphdb.ontotext.com/

  12. https://github.com/pkumod/gStore

  13. https://github.com/openlink/virtuoso-opensource

  14. Queries list: https://www.lias-lab.fr/~amesmoudi/papers/ISF2020/Queries.pdf

References

  • Abadi, D.J., Marcus, A., Madden, S.R., Hollenbach, K. (2007). Scalable semantic web data management using vertical partitioning. In Proceedings of the 33rd international conference on very large data bases (pp. 411–422): VLDB Endowment.

  • Aït-Kaci, H., Boyer, R., Lincoln, P , Nasr, R. (1989). Efficient implementation of lattice operations. ACM Transactions on Programming Languages and Systems (TOPLAS), 11(1), 115–146.

    Article  Google Scholar 

  • Al-Harbi, R., Abdelaziz, I., Kalnis, P., Mamoulis, N., Ebrahim, Y., Sahli, M. (2016). Accelerating SPARQL queries by exploiting hash-based locality and adaptive partitioning. VLDB Journal, 25(3), 355–380.

    Article  Google Scholar 

  • Atre, M., Srinivasan, J., Hendler, Bitmat. (2008). Bitmat: a main-memory bit matrix of RDF triples for conjunctive triple pattern queries. In Proceedings of the poster and demonstration session at the 7th international semantic web conference (ISWC2008), Karlsruhe, Germany, October 28.

  • Briggs, M. (2012). Db2 nosql graph store what why & overview.

  • Broekstra, J., Kampman, A., van Harmelen, F. (2002). Sesame: a generic architecture for storing and querying RDF and RDF schema. In The semantic web - ISWC, first international semantic web conference, Italy, June 9-12 (pp. 54–68).

  • Cyganiak, R. (2005). A relational algebra for sparql. Digital Media Systems Laboratory HP Laboratories Bristol. HPL-2005-170, p. 35.

  • Deppisch, U. (1986). S-tree: a dynamic balanced signature index for office retrieval. In Proceedings of the 9th annual international ACM SIGIR conference on research and development in information retrieval (pp. 77–87): ACM.

  • Du, J., Wang, H., Ni, Y., Hadooprdf, Y.Yu. (2012). A scalable semantic data analytical engine. In Intelligent computing theories and applications - 8th international conference, ICIC, China, July 25-29 (pp. 633–641).

  • Erling, O. (2012). Virtuoso, a hybrid rdbms/graph column store. IEEE Data Engineering Bulletin, 35(1), 3–8.

    Google Scholar 

  • Fuentes-Lorenzo, D., Morato, J., Gómez, J.M. (2009). Knowledge management in biomedical libraries: a semantic web approach. Information Systems Frontiers, 11(4), 471–480.

    Article  Google Scholar 

  • Galicia, J., Mesmoudi, A., Bellatreche, L. (2019). Rdfpartsuite: bridging physical and logical RDF partitioning. In Big data analytics and knowledge discovery - 21st international conference, DaWaK 2019, Linz, Austria, August 26-29, 2019, Proceedings (pp. 136–150).

  • Görlitz, O., & Staab, S. (2011). SPLENDID: SPARQL endpoint federation exploiting VOID descriptions. In Proceedings of the second international workshop on consuming linked data, Bonn, Germany, October 23.

  • Graefe. G. (1994). Volcano - an extensible and parallel query evaluation system. IEEE Transactions on Knowledge and Data Engineering, 6(1), 120–135.

    Article  Google Scholar 

  • Gurajada, S., Seufert, S., Miliaraki, I., Theobald, M. (2014). Triad: a distributed shared-nothing RDF engine based on asynchronous message passing. In SIGMOD, USA, June 22-27 (pp. 289–300).

  • Huang, J., Abadi, D.J., Ren, K. (2011). Scalable SPARQL querying of large RDF graphs. PVLDB, 4 (11), 1123–1134.

    Google Scholar 

  • Janik, M., & Kochut, K. (2005). BRAHMS: a workbench RDF store and high performance memory system for semantic association discovery. In The semantic web - ISWC 2005, 4th international semantic web conference, ISWC, Galway, Ireland, November 6-10, 2005, Proceedings (pp. 431–445).

  • Karypis, G., & Kumar, V. (1998). A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM Journal of Scientific Computing, 20(1), 359–392.

    Article  Google Scholar 

  • Lehmann, J., Isele, R., Jakob, M., Jentzsch, A., Kontokostas, D., Mendes, P.N., Hellmann, S., Morsey, M., Van Kleef, P., Auer, S., et al. (2015). Dbpedia–a large-scale, multilingual knowledge base extracted from wikipedia. Semantic Web, 6(2), 167–195.

    Article  Google Scholar 

  • McBride, B. (2002). Jena: a semantic web toolkit. IEEE Internet Computing, 6, 55–59.

    Article  Google Scholar 

  • Mouzakitis, S., Papaspyros, D., Petychakis, M., Koussouris, S., Zafeiropoulos, A., Fotopoulou, E., Farid, L., Orlandi, F., Attard, J., Psarras, J. (2017). Challenges and opportunities in renovating public sector information by enabling linked data and analytics. Information Systems Frontiers, 19(2), 321–336.

    Article  Google Scholar 

  • Neumann, T., & Moerkotte, G. (2011). Characteristic sets: accurate cardinality estimation for rdf queries with multiple joins. In Data Engineering (ICDE) (pp. 984–994).

  • Neumann, T., & Weikum, G. (2008). Rdf-3x: a risc-style engine for rdf. Proceedings of the VLDB Endowment, 1(1), 647–659.

    Article  Google Scholar 

  • Papailiou, N., Konstantinou, I., Tsoumakos, D., Karras, P., Koziris, N. (2013). H2RDF+: high-performance distributed joins over large-scale RDF graphs. In Proceedings of the 2013 IEEE international conference on big data (pp. 255–263). USA.

  • Peng, P., Zou, L., Özsu, M.T., Chen, L., Zhao, D. (2016). Processing SPARQL queries over distributed RDF graphs. VLDB Journal, 25(2), 243–268.

    Article  Google Scholar 

  • Pérez, J., Arenas, M., Gutierrez, C. (2006). Semantics and complexity of sparql. In International Semantic Web Conference, (Vol. 4273 pp. 30–43): Springer.

  • Rohloff, K., & Schantz, R.E. (2011). Clause-iteration with mapreduce to scalably query datagraphs in the SHARD graph-store. In DIDC’11, Proceedings of the fourth international workshop on data-intensive distributed computing (pp. 35–44). San Jose.

  • Saleem, M., & Ngomo, A.N. (2014). Hibiscus: hypergraph-based source selection for SPARQL endpoint federation. In The semantic web: trends and challenges - 11th international conference, ESWC, Anissaras, Crete, Greece, May 25-29, 2014. Proceedings (pp. 176–191).

  • Schätzle, A., Przyjaciel-Zablocki, M., Skilevic, S., Lausen, G. (2011). Pigsparql: mapping SPARQL to pig latin. In Proceedings of the international workshop on semantic web information management, SWIM (p. 4). Greece.

  • Schätzle, A., Przyjaciel-Zablocki, M., Berberich, T., Lausen, G. (2015). S2X: graph-parallel querying of RDF with graphx. In Biomedical data management and graph online querying - VLDB 2015 workshops (pp. 155–168).

  • Schätzle, A., Przyjaciel-Zablocki, M., Skilevic, S., Lausen, G. (2016). S2RDF: RDF querying with SPARQL on spark. PVLDB, 9(10), 804–815.

    Google Scholar 

  • Stephan, E.G., Elsethagen, T., Berg, L.K., Macduff, M.C., Paulson, P.R., Shaw, W.J., Sivaraman, C., Smith, W., Wynne, A. (2016). Semantic catalog of things, services, and data to support a wind data management facility. Information Systems Frontiers, 18(4), 679–691.

    Article  Google Scholar 

  • Udrea, O., Pugliese, A., Subrahmanian, V.S. (2007). GRIN: a graph based RDF index. In Proceedings of the twenty-second AAAI conference on artificial intelligence, July 22-26, Vancouver, British Columbia, Canada (pp. 1465–1470).

  • W3C. (2014). Rdf 1.1 concepts and abstract syntax. https://www.w3.org/TR/rdf11-concepts/, https://www.w3.org/TR/rdf-sparql-query/.

  • Weiss, C., Karras, P., Bernstein, A. (2008). Hexastore: sextuple indexing for semantic web data management. Proceedings of VLDB, 1(1), 1008–1019.

    Article  Google Scholar 

  • Wilkinson, K., Sayers, C., Kuno, H.A., Reynolds, D. (2003). Efficient RDF storage and retrieval in jena2. In Proceedings of SWDB’03, the first international workshop on semantic web and databases, co-located with VLDB 2003, Humboldt-Universitȧt, Berlin, Germany, September 7-8 (pp. 131–150).

  • Zeng, K., Yang, J., Wang, H., Shao, B., Wang, Z. (2013). A distributed graph engine for web scale RDF data. PVLDB, 6(4), 265–276.

    Google Scholar 

  • Zou, L., Mo, J., Chen, L., Özsu, M.T., Zhao, D. (2011). gstore: answering sparql queries via subgraph matching. Proceedings of the VLDB Endowment, 4(8), 482–493.

    Article  Google Scholar 

  • Zou, L., Özsu, M.T., Chen, L., Shen, X., Huang, R., Zhao, D. (2014). gstore: a graph-based SPARQL query engine. VLDB Journal, 23(4), 565–590.

    Article  Google Scholar 

  • Zouaghi, I., Mesmoudi, A., Galicia, J., Bellatreche, L., Aguili, T. (2020). Query optimization for large scale clustered rdf data. In 22nd international workshop on design, optimization, languages and analytical processing of big data, March 30, 2020. Copenhagen.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Amin Mesmoudi.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Khelil, A., Mesmoudi, A., Galicia, J. et al. Combining Graph Exploration and Fragmentation for Scalable RDF Query Processing. Inf Syst Front 23, 165–183 (2021). https://doi.org/10.1007/s10796-020-09998-z

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10796-020-09998-z

Keywords

Navigation