Abstract
The Resource Description Framework (RDF) data model enables the construction of knowledge graphs over various domains, using ontologies in order to encode information about the domain, and simple statements in the form of subject-predicate-object triples for data representation, facilitating the interlinking and exchange of Web data. However, this simplicity comes with the cost of having to execute a large number of joins in order to get the desirable query results, while at the same time large ontological hierarchies complicate the query answering process even more, for systems that provide complete answers with respect to such ontological axioms. In this work we present PARJ, an in-memory RDF store which takes into consideration ontological hierarchies during join processing with very low performance overhead, avoiding expensive preprocessing and materialization of implications, and is also amenable to straightforward parallelization. Specifically, we present a join implementation that allows to achieve any desired degree of parallelism on arbitrary join queries and RDF graphs stored in memory using compact vertical partitioning. We use an adaptive join processing approach, such that we take advantage of complete or even partial ordering of RDF data, which is compactly stored in order to increase spatial locality and keep memory consumption low, coupled with an ID-to-Position vector index used when ordering does not allow for efficient scanning of the input relation. Finally, we experimentally show the efficiency and scalability of our proposal.
Similar content being viewed by others
References
Abadi, D.J., Marcus, A., Madden, S., Hollenbach, K.J.: Scalable semantic web data management using vertical partitioning. In: Proceedings of the 33rd International Conference on Very Large Data Bases, University of Vienna, Austria, September 23–27, 2007, pp. 411–422 (2007)
Abdelaziz, I., Harbi, R., Khayyat, Z., Kalnis, P.: A survey and experimental comparison of distributed SPARQL engines for very large RDF data. PVLDB 10(13), 2049–2060 (2017)
Al-Harbi, R., Abdelaziz, I., Kalnis, P., Mamoulis, N., Ebrahim, Y., Sahli, M.: Accelerating SPARQL queries by exploiting hash-based locality and adaptive partitioning. VLDB J. 25(3), 355–380 (2016)
Albutiu, M.C., Kemper, A., Neumann, T.: Massively parallel sort-merge joins in main memory multi-core database systems. Proc. VLDB Endow. 5(10), 1064–1075 (2012)
Alexaki, S., Christophides, V., Karvounarakis, G., Plexousakis, D., Tolle, K.: The ICS-FORTH RDFSuite: managing voluminous RDF description bases. In: SemWeb (2001)
Aluç, G., Hartig, O., Özsu, M.T., Daudjee, K.: Diversified stress testing of RDF data management systems. In: The Semantic Web—ISWC 2014—13th International Semantic Web Conference, Riva del Garda, Italy, October 19–23, 2014. Proceedings, Part I, pp. 197–212 (2014)
Armbrust, M., Xin, R.S., Lian, C., Huai, Y., Liu, D., Bradley, J.K., Meng, X., Kaftan, T., Franklin, M.J., Ghodsi, A., Zaharia, M.: Spark SQL: relational data processing in spark. In: SIGMOD Conference, pp. 1383–1394. ACM (2015)
Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., Ives, Z.G.: DBpedia: a nucleus for a web of open data. In: The Semantic Web, 6th International Semantic Web Conference, 2nd Asian Semantic Web Conference, ISWC 2007 + ASWC 2007, Busan, Korea, November 11–15, pp. 722–735 (2007)
Bilidas, D., Koubarakis, M.: Scalable parallelization of RDF joins on multicore architectures. In: Advances in Database Technology—22nd International Conference on Extending Database Technology, EDBT 2019, Lisbon, Portugal, March 26–29, 2019, pp. 349–360 (2019). https://doi.org/10.5441/002/edbt.2019.31
Borovica-Gajic, R., Idreos, S., Ailamaki, A., Zukowski, M., Fraser, C.: Smooth scan: Statistics-oblivious access paths. In: 2015 IEEE 31st International Conference on Data Engineering (ICDE), pp. 315–326. IEEE (2015)
Borovica-Gajic, R., Idreos, S., Ailamaki, A., Zukowski, M., Fraser, C.: Smooth scan: robust access path selection without cardinality estimation. VLDB J. 1–25 (2018)
Bursztyn, D., Goasdoué, F., Manolescu, I.: Teaching an RDBMS about ontological constraints. Proc. VLDB Endow. 9(12), 1161–1172 (2016)
Calvanese, D., Cogrel, B., Komla-Ebri, S., Kontchakov, R., Lanti, D., Rezk, M., Rodriguez-Muro, M., Xiao, G.: Ontop: answering SPARQL queries over relational databases. Semant. Web 8(3), 471–487 (2017)
Chortaras, A., Trivela, D., Stamou, G.: Optimized query rewriting for OWL 2 QL. In: International Conference on Automated Deduction, pp. 192–206. Springer (2011)
Du, J., Wang, H., Ni, Y., Yu, Y.: HadoopRDF: a scalable semantic data analytical engine. In: Intelligent Computing Theories and Applications—8th International Conference, ICIC 2012, Huangshan, China, July 25–29, 2012. Proceedings, pp. 633–641 (2012)
Groppe, J., Groppe, S.: Parallelizing join computations of SPARQL queries for large semantic web databases. In: Proceedings of the 2011 ACM Symposium on Applied Computing, pp. 1681–1686. ACM (2011)
Guo, Y., Pan, Z., Heflin, J.: LUBM: A benchmark for OWL knowledge base systems. J. Web Sem. 3(2–3), 158–182 (2005)
Gurajada, S., Seufert, S., Miliaraki, I., Theobald, M.: Triad: a distributed shared-nothing RDF engine based on asynchronous message passing. In: International Conference on Management of Data, SIGMOD 2014, Snowbird, UT, USA, June 22–27, 2014, pp. 289–300 (2014)
Hoffart, J., Suchanek, F.M., Berberich, K., Weikum, G.: YAGO2: a spatially and temporally enhanced knowledge base from wikipedia. Artif. Intell. 194, 28–61 (2013)
Huang, J., Abadi, D.J., Ren, K.: Scalable SPARQL querying of large RDF graphs. PVLDB 4(11), 1123–1134 (2011)
Idreos, S., Groffen, F., Nes, N., Manegold, S., Mullender, K.S., Kersten, M.L.: Monetdb: two decades of research in column-oriented database architectures. IEEE Data Eng. Bull. 35(1), 40–45 (2012)
Kaoudi, Z., Manolescu, I.: RDF in the clouds: a survey. VLDB J. 24(1), 67–91 (2015)
Kharlamov, E., Hovland, D., Skjæveland, M.G., Bilidas, D., Jiménez-Ruiz, E., Xiao, G., Soylu, A., Lanti, D., Rezk, M., Zheleznyakov, D., et al.: Ontology based data access in statoil. J. Web Semant. 44, 3–36 (2017)
Kikot, S., Kontchakov, R., Zakharyaschev, M.: Conjunctive query answering with OWL 2 QL. In: Thirteenth International Conference on the Principles of Knowledge Representation and Reasoning (2012)
Kim, C., Kaldewey, T., Lee, V.W., Sedlar, E., Nguyen, A.D., Satish, N., Chhugani, J., Di Blas, A., Dubey, P.: Sort vs hash revisited: fast join implementation on modern multi-core CPUs. Proc. VLDB Endow. 2(2), 1378–1389 (2009)
Kontchakov, R., Lutz, C., Toman, D., Wolter, F., Zakharyaschev, M.: The combined approach to ontology-based data access. In: Twenty-second international joint conference on artificial intelligence (2011)
Luo, Y., Picalausa, F., Fletcher, G.H., Hidders, J., Vansummeren, S.: Storing and indexing massive RDF datasets. In: Semantic search over the web, pp. 31–60. Springer (2012)
Lutz, C., Seylan, I., Toman, D., Wolter, F.: The combined approach to OBDA: taming role hierarchies using filters. In: International semantic web conference, pp. 314–330. Springer (2013)
Manegold, S., Boncz, P., Kersten, M.: Optimizing main-memory join on modern hardware. IEEE Trans. Knowl. Data Eng. 14(4), 709–730 (2002)
Manegold, S., Boncz, P., Kersten, M.L.: Generic database cost models for hierarchical memory systems. In: VLDB’02: Proceedings of the 28th International Conference on Very Large Databases, pp. 191–202. Elsevier (2002)
Mora, J., Corcho, Ó.: Engineering optimisations in query rewriting for OBDA. In: Proceedings of the 9th International Conference on Semantic Systems, pp. 41–48. ACM (2013)
Myung, J., Yeon, J., Lee, S.g.: Sparql basic graph pattern processing with iterative mapreduce. In: Proceedings of the 2010 Workshop on Massive Data Analytics on the Cloud. ACM (2010)
Nenov, Y., Piro, R., Motik, B., Horrocks, I., Wu, Z., Banerjee, J.: RDFox: a highly-scalable RDF store. In: International Semantic Web Conference, pp. 3–20. Springer (2015)
Neumann, T., Moerkotte, G.: Characteristic sets: accurate cardinality estimation for RDF queries with multiple joins. In: ICDE, pp. 984–994. IEEE Computer Society (2011)
Neumann, T., Weikum, G.: Scalable join processing on very large RDF graphs. In: SIGMOD Conference, pp. 627–640. ACM (2009)
Olston, C., Reed, B., Srivastava, U., Kumar, R., Tomkins, A.: Pig latin: a not-so-foreign language for data processing. In: SIGMOD Conference, pp. 1099–1110. ACM (2008)
Papailiou, N., Tsoumakos, D., Konstantinou, I., Karras, P., Koziris, N.: H\({}_{{2}}\)RDF+: an efficient data management system for big RDF graphs. In: SIGMOD Conference, pp. 909–912. ACM (2014)
Poggi, A., Lembo, D., Calvanese, D., De Giacomo, G., Lenzerini, M., Rosati, R.: Linking data to ontologies. In: Journal on data semantics X, pp. 133–173. Springer (2008)
Potter, A., Motik, B., Nenov, Y., Horrocks, I.: Distributed RDF query answering with dynamic data exchange. In: International Semantic Web Conference (1), Lecture Notes in Computer Science, vol. 9981, pp. 480–497 (2016)
Punnoose, R., Crainiceanu, A., Rapp, D.: SPARQL in the cloud using Rya. Inf. Syst. 48, 181–195 (2015)
Qin, W., Idreos, S.: Adaptive data skipping in main-memory systems. In: Proceedings of the 2016 International Conference on Management of Data, pp. 2255–2256. ACM (2016)
Ravindra, P., Kim, H., Anyanwu, K.: An intermediate algebra for optimizing RDF graph pattern matching on mapreduce. In: ESWC (2), Lecture Notes in Computer Science, vol. 6644, pp. 46–61. Springer (2011)
Rodriguez-Muro, M., Kontchakov, R., Zakharyaschev, M.: Ontology-based data access: ontop of databases. In: International Semantic Web Conference, pp. 558–573. Springer (2013)
Rohloff, K., Schantz, R.E.: High-performance, massively scalable distributed systems using the mapreduce software framework: the SHARD triple-store. In: PSI EtA, p. 4. ACM (2010)
Rohloff, K., Schantz, R.E.: Clause-iteration with mapreduce to scalably query datagraphs in the SHARD graph-store. In: DICT@HPDC, pp. 35–44. ACM (2011)
Rosati, R., Almatelli, A.: Improving query answering over DL-Lite ontologies. In: Twelfth International Conference on the Principles of Knowledge Representation and Reasoning (2010)
Schätzle, A., Przyjaciel-Zablocki, M., Lausen, G.: PigSPARQL: mapping SPARQL to Pig Latin. In: SWIM, p. 4. ACM (2011)
Schätzle, A., Przyjaciel-Zablocki, M., Skilevic, S., Lausen, G.: S2RDF: RDF querying with SPARQL on spark. PVLDB 9(10), 804–815 (2016)
Schmachtenberg, M., Bizer, C., Paulheim, H.: Adoption of the linked data best practices in different topical domains. In: Semantic Web Conference (1), Lecture Notes in Computer Science, vol. 8796, pp. 245–260. Springer (2014)
Ślezak, D., Wróblewski, J., Eastwood, V., Synak, P.: Brighthouse: an analytic data warehouse for ad-hoc queries. Proc. VLDB Endow. 1(2), 1337–1345 (2008)
Stefanoni, G., Motik, B., Kostylev, E.V.: Estimating the cardinality of conjunctive queries over RDF data using graph summarisation. In: Proceedings of the 2018 World Wide Web Conference on World Wide Web, pp. 1043–1052. International World Wide Web Conferences Steering Committee (2018)
Stonebraker, M., Abadi, D.J., Batkin, A., Chen, X., Cherniack, M., Ferreira, M., Lau, E., Lin, A., Madden, S., O’Neil, E.J., O’Neil, P.E., Rasin, A., Tran, N., Zdonik, S.B.: C-store: a column-oriented DBMS. In: VLDB, pp. 553–564. ACM (2005)
Subercaze, J., Gravier, C., Chevalier, J., Laforest, F.: Inferray: fast in-memory RDF inference. Proc. VLDB Endow. 9(6), 468–479 (2016)
Weiss, C., Karras, P., Bernstein, A.: Hexastore: sextuple indexing for semantic web data management. PVLDB 1(1), 1008–1019 (2008)
Wilkinson, K., Sayers, C., Kuno, H.A., Reynolds, D.: Efficient RDF storage and retrieval in jena2. In: SWDB, pp. 131–150 (2003)
Xiao, G., Hovland, D., Bilidas, D., Rezk, M., Giese, M., Calvanese, D.: Efficient ontology-based data integration with canonical IRIs. In: European Semantic Web Conference, pp. 697–713. Springer (2018)
Yuan, P., Liu, P., Wu, B., Jin, H., Zhang, W., Liu, L.: Triplebit: a fast and compact system for large scale RDF data. Proc. VLDB Endow. 6(7), 517–528 (2013)
Zeng, K., Yang, J., Wang, H., Shao, B., Wang, Z.: A distributed graph engine for web scale RDF data. PVLDB 6(4), 265–276 (2013)
Acknowledgements
The present work was funded by the European Union’s Horizon 2020 research and innovation programme under Grant Agreement No. 825258.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Bilidas, D., Koubarakis, M. In-memory parallelization of join queries over large ontological hierarchies. Distrib Parallel Databases 39, 545–582 (2021). https://doi.org/10.1007/s10619-020-07305-y
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10619-020-07305-y