Skip to main content
Log in

HERMES: data placement and schema optimization for enterprise knowledge bases

  • Regular Paper
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract

Enterprises create domain-specific knowledge bases (KBs) by curating and integrating their business data from multiple sources. To support a variety of query types over domain-specific KBs, we propose Hermes, an ontology-based system that allows storing KB data in multiple backends, and querying them with different query languages. In this paper, we address two important challenges in realizing such a system: data placement and schema optimization. First, we identify the best data store for any query type and determine the subset of the KB that needs to be stored in this data store, while minimizing data replication. Second, we optimize how we organize the data for best query performance. To choose the best data stores, we partition the data described by the domain ontology into multiple overlapping subsets based on the operations performed in a given query workload, and place these subsets in appropriate data stores according to their capabilities. Then, we optimize the schema on each data store to enable efficient querying. In particular, we focus on the property graph schema optimization, which has been largely ignored in the literature. We propose two algorithms to generate an optimized schema from the domain ontology. We demonstrate the effectiveness of our data placement and schema optimization algorithms with two real-world KBs from the medical and financial domains. The results show that the proposed data placement algorithm generates near-optimal data placement plans with minimal data replication overhead, and the schema optimization algorithms produce high-quality schemas, achieving up to two orders of magnitude speed-up compared to alternative schema designs.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20

Similar content being viewed by others

Notes

  1. The terms ObjectProperty and Relationship are used interchangeably in this paper.

  2. Even if inheritance and union are not ObjectProperties, we simplify the notation for presentation purposes.

  3. We make a distinction between stored data that is initially placed in the data stores and intermediate data that is generated during query execution.

  4. Access frequencies of concepts, relationships, and data properties in an ontology.

  5. The neighborhood concepts do not include the member concepts of \(c_i\).

  6. Db2 is a registered trademark of IBM Corporation

  7. We make a distinction between stored data that is initially placed in the data stores and intermediate data that is generated during a query execution.

References

  1. VLDB Workshop: Poly’20. https://sites.google.com/view/poly20/program

  2. Federal deposit insurance corporation. https://www.fdic.gov/regulations/resources/call/index.html (2019)

  3. Gremlin query language. https://tinkerpop.apache.org/gremlin.html (2019)

  4. Janusgraph: Distributed graph database. http://janusgraph.org/ (2019)

  5. The neo4j graph platform. https://neo4j.com/ (2019)

  6. Owl 2 web ontology language document overview. https://www.w3.org/TR/owl2-overview/ (2019)

  7. Securities and exchange commission. https://www.sec.gov/dera/data/financial-statement-data-sets.html (2019)

  8. Apache solr. https://lucene.apache.org/solr/ (2020)

  9. Elasticsearch: Open source search & analytics. https://www.elastic.co/ (2020)

  10. Abadi, D.J., Marcus, A., Madden, S., Hollenbach, K.: SW-Store: a vertically partitioned DBMS for semantic web data management. VLDB J. 18(2), 385–406 (2009)

    Article  Google Scholar 

  11. Abiteboul, S., Hull, R., Vianu, V.: Foundations of databases: the logical level. Addison-Wesley Longman Publishing Co., Inc., Boston (1995)

    MATH  Google Scholar 

  12. Agrawal, S., Chaudhuri, S., Narasayya, V.R.: Automated selection of materialized views and indexes in sql databases. VLDB 2000, 496–505 (2000)

    Google Scholar 

  13. Alotaibi, R., Lei, C., Quamar, A., Efthymiou, V., Özcan, F.: Property graph schema optimization for domain-specific knowledge graphs. In: ICDE, pp. 924–935 (2021)

  14. Angles, R., Thakkar, H., Tomaszuk, D.: Mapping rdf databases to property graph databases. IEEE Access 8, 86091–86110 (2020)

    Article  Google Scholar 

  15. Baader, F., Calvanese, D., McGuinness, D.L., Nardi, D., Patel-Schneider, P.F. (eds.): The description logic handbook: theory, implementation, and applications. Cambridge University Press, Cambridge (2003)

    MATH  Google Scholar 

  16. Bharadwaj, S., Chiticariu, L., Danilevsky, M., et al.: Creation and interaction with large-scale domain-specific knowledge bases. PVLDB 10(12), 1965–1968 (2017)

    Google Scholar 

  17. Bonifati, A., Martens, W., Timm, T.: An analytical study of large SPARQL query logs. PVLDB 11(2), 149–161 (2017)

    Google Scholar 

  18. Bornea, M.A., Dolby, J., Kementsietsidis, A., Srinivas, K., Dantressangle, P., Udrea, O., Bhattacharjee, B.: Building an efficient RDF store over a relational database. In: SIGMOD, pp. 121–132 (2013)

  19. Brin, S., Page, L.: The anatomy of a large-scale hypertextual web search engine. In: WWW, pp. 107–117 (1998)

  20. Bruno, N., Chaudhuri, S.: Automatic physical database tuning: A relaxation-based approach. In: SIGMOD, pp. 227–238 (2005)

  21. Bugiotti, F., Bursztyn, D., Deutsch, A., I, I., I, M.: Invisible glue: Scalable Self-Tuning Multi-Stores. In: CIDR (2015)

  22. Chawathe, S.S., Garcia-Molina, H., Hammer, J., et al.: The TSIMMIS project: integration of heterogeneous information sources. In: Proceedings of the 10th Meeting of the Information Processing Society of Japan, pp. 7–18 (1994)

  23. Chong, E.I., Das, S., Eadon, G., Srinivasan, J.: An efficient sql-based RDF querying scheme. In: VLDB, pp. 1216–1227 (2005)

  24. Christophides, V., Efthymiou, V., Stefanidis, K.: Entity Resolution in the Web of Data. Theory and Technology. Morgan & Claypool Publishers, Synthesis Lectures on the Semantic Web (2015)

  25. Dash, D., Polyzotis, N., Ailamaki, A.: Cophy: a scalable, portable, and interactive index advisor for large workloads. PVLDB 4(6), 362–372 (2011)

    Google Scholar 

  26. Deutsch, A., Xu, Y., Wu, M., Lee, V.: Tigergraph: a native MPP graph database. CoRR abs/1901.08248 (2019)

  27. Dong, X.L., Srivastava, D.: Big data integration. Synthesis lectures on data management. Morgan & Claypool Publishers, San Rafael (2015)

    Book  Google Scholar 

  28. Du, J., Meehan, J., Tatbul, N., Zdonik, S.: Towards dynamic data placement for polystore ingestion. In: BIRTE, pp. 2:1–2:8 (2017)

  29. Duggan, J., Elmore, A.J., Stonebraker, M., Balazinska, M., Howe, B., et al.: The BigDAWG polystore system. SIGMOD Record 44(2), 11–16 (2015)

    Article  Google Scholar 

  30. Francis, N., Green, A., Guagliardo, P., et al.: Cypher: an evolving query language for property graphs. In: SIGMOD, pp. 1433–1445 (2018)

  31. Gog, I., Schwarzkopf, M., Crooks, N., et al.: Musketeer: all for one, one for all in data processing systems. In: Proceedings of the Tenth European Conference on Computer Systems, p. 2 (2015)

  32. Han, X., Hu, L., Sen, J., Dang, Y., Gao, B., Isahagian, V., Lei, C., et al.: Bootstrapping natural language querying on process automation data. In: IEEE SCC, pp. 170–177. IEEE (2020)

  33. Harris, S., Shadbolt, N.: SPARQL query processing with conventional relational database systems. In: WISE, pp. 235–244 (2005)

  34. Hassan, M.S., Kuznetsova, T., Jeong, H.C., Aref, W.G., Sadoghi, M.: Extending in-memory relational database engines with native graph support. In: EDBT, pp. 25–36 (2018)

  35. Kharlamov, E., Mailis, T., Bereta, K., et al.: A semantic approach to polystores. In: IEEE Big Data, pp. 2565–2573 (2016)

  36. Kimura, H., Huo, G., Rasin, A., Madden, S., Zdonik, S.B.: Coradd: correlation aware database designer for materialized views and indexes. PVLDB 3(1–2), 1103–1113 (2010)

    Google Scholar 

  37. Kolev, B., Bondiombouy, C., Valduriez, P., et al.: The cloudmdsql multistore system. In: SIGMOD, pp. 2113–2116 (2016)

  38. LeFevre, J., Sankaranarayanan, J., Hacigumus, H., et al.: Miso: souping up big data query processing with a multistore system. In: SIGMOD, pp. 1591–1602 (2014)

  39. Lehmann, J., Isele, R., Jakob, M., et al.: Dbpedia - A large-scale, multilingual knowledge base extracted from wikipedia. Semantic Web (2015)

  40. Lei, C., Özcan, F., Quamar, A., Mittal, A.R., Sen, J., Saha, D., Sankaranarayanan, K.: Ontology-based natural language query interfaces for data exploration. IEEE Data Eng. Bull. 41(3), 52–63 (2018)

    Google Scholar 

  41. Leskovec, J., Rajaraman, A., Ullman, J.D.: Mining of Massive Datasets, 2nd edn. Cambridge University Press, New York, NY, USA (2014)

    Book  Google Scholar 

  42. Levy, A., Rajaraman, A., Ordille, J.: Querying heterogeneous information sources using source descriptions. Tech. rep, Stanford InfoLab (1996)

  43. Lu, J., Holubová, I., Cautis, B.: Multi-model databases and tightly integrated polystores: Current practices, comparisons, and open challenges. In: CIKM, p. 2301–2302 (2018)

  44. Maduko, A., Anyanwu, K., Sheth, A.P., Schliekelman, P.: Estimating the cardinality of RDF graph patterns. In: WWW, pp. 1233–1234 (2007)

  45. McHugh, J., Cuddihy, P.E., Williams, J.W., et al.: Integrated access to big data polystores through a knowledge-driven framework. In: IEEE Big Data (2017)

  46. Mior, M.J., Salem, K., Aboulnaga, A., Liu, R.: Nose: schema design for nosql applications. In: ICDE, pp. 181–192 (2016)

  47. Neumann, T., Moerkotte, G.: Characteristic sets: accurate cardinality estimation for RDF queries with multiple joins. In: ICDE, pp. 984–994 (2011)

  48. Neumann, T., Weikum, G.: The RDF-3X engine for scalable management of RDF data. VLDB J. 19(1), 91–113 (2010)

    Article  Google Scholar 

  49. Pirahesh, H., Hellerstein, J.M., Hasan, W.: Extensible/rule based query rewrite optimization in starburst. In: SIGMOD, pp. 39–48 (1992)

  50. Quamar, A., Kumar, K.A., Deshpande, A.: SWORD: scalable workload-aware data placement for transactional workloads. In: EDBT, pp. 430–441 (2013)

  51. Quamar, A., Özcan, F., Xirogiannopoulos, K.: Discovery and creation of rich entities for knowledge bases. In: ExploreDB (2018)

  52. Quamar, A., Straube, J., Tian, Y.: Enabling rich queries over heterogeneous data from diverse sources in healthcare. In: CIDR (2020)

  53. Saha, D., Floratou, A., Sankaranarayanan, K., et al.: Athena: an ontology-driven system for natural language querying over relational data stores. PVLDB 9(12), 1209–1220 (2016)

    Google Scholar 

  54. Sen, J., Ozcan, F., Quamar, A., Stager, G., Mittal, A.R., Jammi, M., Lei, C., Saha, D., Sankaranarayanan, K.: Natural language querying of complex business intelligence queries. In: SIGMOD, pp. 1997–2000 (2019)

  55. Slavík, P.: A tight analysis of the greedy algorithm for set cover. In: STOC ’96 (1996)

  56. Stonebraker, M.: The case for polystores. https://wp.sigmod.org/?p=1629 (2015)

  57. Stonebraker, M., Cetintemel, U.: “one size fits all”: an idea whose time has come and gone. In: ICDE, p. 2–11 (2005)

  58. Suchanek, F.M., Weikum, G.: Knowledge harvesting in the big-data era. In: SIGMOD, pp. 933–938 (2013)

  59. Sun, W., Fokoue, A., Srinivas, K., Kementsietsidis, A., Hu, G., Xie, G.T.: Sqlgraph: an efficient relational-based property graph store. In: SIGMOD, pp. 1887–1901 (2015)

  60. Tanon, T.P., Weikum, G., Suchanek, F.M.: YAGO 4: A reason-able knowledge base. In: ESWC, pp. 583–596 (2020)

  61. Tian, Y., Xu, E.L., Zhao, W., et al.: IBM db2 graph: supporting synergistic and retrofittable graph queries inside IBM db2. In: SIGMOD, pp. 345–359 (2020)

  62. Tsialiamanis, P., Sidirourgos, L., Fundulaki, I., et al.: Heuristics-based query optimisation for SPARQL. In: EDBT, pp. 324–335 (2012)

  63. Vazirani, V.V.: Approximation Algorithms. Springer-Verlag, Berlin, Heidelberg (2001)

    MATH  Google Scholar 

  64. Xiao, G., Calvanese, D., Kontchakov, R., Lembo, D., Poggi, A., Rosati, R., Zakharyaschev, M.: Ontology-based data access: a survey. In: IJCAI, p. 5511–5519 (2018)

  65. Zilio, D.C., Rao, J., Lightstone, S., et al.: Db2 design advisor: integrated automatic physical database design. In: VLDB, pp. 1087–1097 (2004)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chuan Lei.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Chuan Lei, Vasilis Efthymiou, Fatma Özcan, Rana Alotaibi: Work done while at IBM Research.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Lei, C., Quamar, A., Efthymiou, V. et al. HERMES: data placement and schema optimization for enterprise knowledge bases. The VLDB Journal 32, 549–574 (2023). https://doi.org/10.1007/s00778-022-00756-y

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-022-00756-y

Keywords

Navigation