Compacting frequent star patterns in RDF graphs

Abstract

Knowledge graphs have become a popular formalism for representing entities and their properties using a graph data model, e.g., the Resource Description Framework (RDF). An RDF graph comprises entities of the same type connected to objects or other entities using labeled edges annotated with properties. RDF graphs usually contain entities that share the same objects in a certain group of properties, i.e., they match star patterns composed of these properties and objects. In case the number of these entities or properties in these star patterns is large, the size of the RDF graph and query processing are negatively impacted; we refer these star patterns as frequent star patterns. We address the problem of identifying frequent star patterns in RDF graphs and devise the concept of factorized RDF graphs, which denote compact representations of RDF graphs where the number of frequent star patterns is minimized. We also develop computational methods to identify frequent star patterns and generate a factorized RDF graph, where compact RDF molecules replace frequent star patterns. A compact RDF molecule of a frequent star pattern denotes an RDF subgraph that instantiates the corresponding star pattern. Instead of having all the entities matching the original frequent star pattern, a surrogate entity is added and related to the properties of the frequent star pattern; it is linked to the entities that originally match the frequent star pattern. Since the edges between the entities and the objects in the frequent star pattern are replaced by edges between these entities and the surrogate entity of the compact RDF molecule, the size of the RDF graph is reduced. We evaluate the performance of our factorization techniques on several RDF graph benchmarks and compare with a baseline built on top gSpan, a state-of-the-art algorithm to detect frequent patterns. The outcomes evidence the efficiency of proposed approach and show that our techniques are able to reduce execution time of the baseline approach in at least three orders of magnitude. Additionally, RDF graph size can be reduced by up to 66.56% while data represented in the original RDF graph is preserved.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Notes

  1. 1.

    https://lod-cloud.net/

  2. 2.

    property type refers to rdf:type

  3. 3.

    Available at: http://wiki.knoesis.org/index.php/LinkedSensorData

References

  1. Abadi, D., Madden, S., Ferreira, M. (2006). Integrating compression and execution in column-oriented database systems. In Proceedings of the 2006 ACM sigmod international conference on management of data (pp. 671–682): ACM, DOI https://doi.org/10.1145/1142473.1142548.

  2. Allen, D., Hodler, A., Hunger, M., Knobloch, M., Lyon, W., Needham, M., Voigt, H. (2019). Understanding trolls with efficient analytics of large graphs in neo4j. BTW 2019.

  3. Álvarez-García, S., Brisaboa, N.R., Fernández, J.D., Martínez-Prieto, M.A. (2011). Compressed k2-triples for full-in-memory RDF engines. arXiv:1105.4004.

  4. Arenas, M., Gutierrez, C., Pérez, J. (2009). Foundations of RDF databases. In Reasoning web. semantic technologies for information systems (pp. 158–204): Springer, DOI https://doi.org/10.1007/978-3-642-03754-2_4.

  5. Auer, S., Kovtun, V., Prinz, M., Kasprzik, A., Stocker, M., Vidal, M. (2018). Towards a knowledge graph for science. In Proceedings of the 8th international conference on web intelligence, mining and semantics. WIMS 2018, DOI https://doi.org/10.1145/3227609.3227689.

  6. Bizer, C., Heath, T., Berners-Lee, T. (2011). Linked data: The story so far. In Semantic services, interoperability and web applications: emerging concepts. IGI Global (pp. 205–227), DOI https://doi.org/10.4018/jswis.2009081901.

  7. Boncz, P.A., Zukowski, M., Nes, N. (2005). Monetdb/x100: Hyper-pipelining query execution. In Cidr. http://cidrdb.org/cidr2005/papers/P19.pdf, (Vol. 5 pp. 225–237).

  8. Brisaboa, N.R., Ladra, S., Navarro, G. (2009). k2-trees for compact web graph representation. In International symposium on string processing and information retrieval (pp. 18–30): Springer, DOI https://doi.org/10.1007/978-3-642-03784-9_3.

  9. Compton, M., Barnaghi, P., Bermudez, L., Garciá-Castro, R., Corcho, O., Cox, S., Graybeal, J., Hauswirth, M., Henson, C., Herzog, A., et al. (2012). The ssn ontology of the w3c semantic sensor network incubator group. Web semantics: science, services and agents on the world wide web, 17, 25–32, https://doi.org/10.1016/j.websem.2012.05.003.

  10. Copeland, G.P., & Khoshafian, S.N. (1985). A decomposition storage model. In ACM sigmod record, (Vol. 14 pp. 268–279): ACM, DOI https://doi.org/10.1145/318898.318923.

  11. Elseidy, M., Abdelhamid, E., Skiadopoulos, S., Kalnis, P. (2014). Grami: Frequent subgraph and pattern mining in a single large graph. Proceedings of the VLDB endowment, 7(7), 517–528.

    Article  Google Scholar 

  12. Ernst, P., Siu, A., Weikum, G. (2015). Knowlife: a versatile approach for constructing a large knowledge graph for biomedical sciences. BMC bioinformatics, 16 (1), 157. https://doi.org/10.1186/s12859-015-0549-5.

    Article  Google Scholar 

  13. Fernández, J.D., Martínez-prieto, M.A., Gutiérrez, C., Polleres, A., Arias, M. (2013). Binary RDF representation for publication and exchange (hdt). web semantics: science, services and agents on the world wide web, 19, 22–41,https://doi.org/10.1016/j.websem.2013.01.002.

  14. Fernández, J.D., Llaves, A., Óscar Corcho. (2014). Efficient RDF Interchange (ERI) Format for RDF Data Streams. In The semantic web - ISWC 2014 - 13th international semantic web conference, Riva del Garda, Italy, October 19-23, 2014. Proceedings, Part II, https://doi.org/10.1007/978-3-319-11915-1_16 (pp. 244–259).

  15. Grangel-González, I., Halilaj, L., Vidal, M., Rana, O., Lohmann, S., Auer, S., Múller, A.W. (2018). Knowledge graphs for semantically integrating cyber-physical systems. In Database and expert systems applications - 29th international conference, DOI https://doi.org/10.1007/978-3-319-98809-2_12.

  16. Joshi, A.K., Hitzler, P., Dong, G. (2013). Logical linked data compression. In Extended semantic web conference (pp. 170–184): Springer, DOI https://doi.org/10.1007/978-3-642-38288-8_12.

  17. Karim, F., Mami, M.N., Vidal, M.E., Auer, S. (2017). Large-scale storage and query processing for semantic sensor data. In Proceedings of the 7th international conference on web intelligence, mining and semantics (p. 8): ACM, DOI https://doi.org/10.1145/3102254.3102260.

  18. Lassila, O., Swick, R.R., et al. Resource description framework (RDF) model and syntax specification (1998). http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.44.6030.

  19. Lehmann, J., Isele, R., Jakob, M., Jentzsch, A., Kontokostas, D., Mendes, P.N., Hellmann, S., Morsey, M., Van Kleef, P., Auer, S., et al. (2015). Dbpedia–a large-scale, multilingual knowledge base extracted from wikipedia. Semantic Web, 6(2), 167–195. https://doi.org/10.3233/SW-140134.

    Article  Google Scholar 

  20. Meier, M. (2008). Towards rule-based minimization of RDF graphs under constraints. In International conference on web reasoning and rule systems (pp. 89–103): Springer, DOI https://doi.org/10.1007/978-3-540-88737-9_8.

  21. Pan, J.Z., Pérez, J.M.G., Ren, Y., Wu, H., Wang, H., Zhu, M. (2014). Graph pattern based RDF data compression. In Joint international semantic technology conference (pp. 239–256): Springer, DOI https://doi.org/10.1007/978-3-319-15615-6_18.

  22. Patni, H.K., Henson, C.A., Sheth, A.P. (2010). Linked sensor data. https://corescholar.libraries.wright.edu/knoesis/545/.

  23. Pichler, R., Polleres, A., Skritek, S., Woltran, S. (2010). Redundancy elimination on RDF graphs in the presence of rules, constraints, and queries. In International conference on web reasoning and rule systems (pp. 133–148): Springer, DOI https://doi.org/10.1007/978-3-642-15918-3_11.

  24. Prud’hommeaux, E., & Seaborne, A. (2011). Sparql query language for RDF. w3c recommendation (january 15, 2008). https://www.w3.org/TR/rdf-sparql-query/.

  25. Roth, M.A., & Van Horn, S.J. (1993). Database compression. ACM sigmod record, 22(3), 31–39. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.464.643&rank=1.

    Article  Google Scholar 

  26. Singhal, A. (2012). Introducing the knowledge graph: things, not strings. Official google blog 5. https://www.blog.google/products/search/introducing-knowledge-graph-things-not/.

  27. Stonebraker, M., Abadi, D.J., Batkin, A., Chen, X., Cherniack, M., Ferreira, M., Lau, E., Lin, A., Madden, S., O’Neil, E., et al. (2005). C-store: a column-oriented dbms. In Proceedings of the 31st international conference on very large data bases (pp. 553–564): VLDB Endowment, DOI https://doi.org/10.1145/3226595.3226638.

  28. Vidal, M.E., Endris, K.M., Jazashoori, S., Sakor, A., Rivas, A. (2019). Transforming heterogeneous data into knowledge for personalized treatments a use case. Datenbank-Spektrum, 1–12. https://doi.org/10.1007/s13222-019-00312-z.

  29. Westmann, T., Kossmann, D., Helmer, S., Moerkotte, G. (2000). The implementation and performance of compressed databases. ACM Sigmod Record, 29(3), 55–67. https://doi.org/10.1145/362084.362137.

    Article  Google Scholar 

  30. Yan, X., & Han, J. (2002). gspan: Graph-based substructure pattern mining. In 2002 IEEE international conference on data mining, 2002. proceedings (pp. 721–724): IEEE.

  31. Zhu, M., Wu, W., Pan, J.Z., Han, J., Huang, P., Liu, Q. (2018). Predicate invention based RDF data compression. In Joint international semantic technology conference (pp. 153–161): Springer, DOI https://doi.org/10.1007/978-3-030-04284-4_11.

  32. Zukowski, M., Heman, S., Nes, N., Boncz, P.A. (2006). Super-scalar ram-cpu cache compression. In Icde, (Vol. 6 p. 59), DOI https://doi.org/10.1109/ICDE.2006.150.

Download references

Acknowledgments

Farah Karim is supported by the German Academic Exchange Service (DAAD); this work is partially funded by the EU H2020 project IASiS (GA No.727658).

Author information

Affiliations

Authors

Corresponding author

Correspondence to Farah Karim.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Karim, F., Vidal, M. & Auer, S. Compacting frequent star patterns in RDF graphs. J Intell Inf Syst (2020). https://doi.org/10.1007/s10844-020-00595-9

Download citation

Keywords

  • Semantic Web
  • RDF compaction
  • Linked data
  • Knowledge graph