HERMES: data placement and schema optimization for enterprise knowledge bases

Lei, Chuan; Quamar, Abdul; Efthymiou, Vasilis; Özcan, Fatma; Alotaibi, Rana

doi:10.1007/s00778-022-00756-y

HERMES: data placement and schema optimization for enterprise knowledge bases

Regular Paper
Published: 26 July 2022

Volume 32, pages 549–574, (2023)
Cite this article

The VLDB Journal Aims and scope Submit manuscript

Chuan Lei ORCID: orcid.org/0000-0001-6265-9554¹,
Abdul Quamar²,
Vasilis Efthymiou³,
Fatma Özcan⁴ &
…
Rana Alotaibi⁵

444 Accesses
Explore all metrics

Abstract

Enterprises create domain-specific knowledge bases (KBs) by curating and integrating their business data from multiple sources. To support a variety of query types over domain-specific KBs, we propose Hermes, an ontology-based system that allows storing KB data in multiple backends, and querying them with different query languages. In this paper, we address two important challenges in realizing such a system: data placement and schema optimization. First, we identify the best data store for any query type and determine the subset of the KB that needs to be stored in this data store, while minimizing data replication. Second, we optimize how we organize the data for best query performance. To choose the best data stores, we partition the data described by the domain ontology into multiple overlapping subsets based on the operations performed in a given query workload, and place these subsets in appropriate data stores according to their capabilities. Then, we optimize the schema on each data store to enable efficient querying. In particular, we focus on the property graph schema optimization, which has been largely ignored in the literature. We propose two algorithms to generate an optimized schema from the domain ontology. We demonstrate the effectiveness of our data placement and schema optimization algorithms with two real-world KBs from the medical and financial domains. The results show that the proposed data placement algorithm generates near-optimal data placement plans with minimal data replication overhead, and the schema optimization algorithms produce high-quality schemas, achieving up to two orders of magnitude speed-up compared to alternative schema designs.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 6

A Pay-as-you-go Methodology to Design and Build Enterprise Knowledge Graphs from Relational Databases

A Value-Added Approach to Design BI Applications

From Big Data to Big Knowledge

Notes

The terms ObjectProperty and Relationship are used interchangeably in this paper.
Even if inheritance and union are not ObjectProperties, we simplify the notation for presentation purposes.
We make a distinction between stored data that is initially placed in the data stores and intermediate data that is generated during query execution.
Access frequencies of concepts, relationships, and data properties in an ontology.
The neighborhood concepts do not include the member concepts of \(c_i\).
Db2 is a registered trademark of IBM Corporation
We make a distinction between stored data that is initially placed in the data stores and intermediate data that is generated during a query execution.

References

VLDB Workshop: Poly’20. https://sites.google.com/view/poly20/program
Federal deposit insurance corporation. https://www.fdic.gov/regulations/resources/call/index.html (2019)
Gremlin query language. https://tinkerpop.apache.org/gremlin.html (2019)
Janusgraph: Distributed graph database. http://janusgraph.org/ (2019)
The neo4j graph platform. https://neo4j.com/ (2019)
Owl 2 web ontology language document overview. https://www.w3.org/TR/owl2-overview/ (2019)
Securities and exchange commission. https://www.sec.gov/dera/data/financial-statement-data-sets.html (2019)
Apache solr. https://lucene.apache.org/solr/ (2020)
Elasticsearch: Open source search & analytics. https://www.elastic.co/ (2020)
Abadi, D.J., Marcus, A., Madden, S., Hollenbach, K.: SW-Store: a vertically partitioned DBMS for semantic web data management. VLDB J. 18(2), 385–406 (2009)
Article Google Scholar
Abiteboul, S., Hull, R., Vianu, V.: Foundations of databases: the logical level. Addison-Wesley Longman Publishing Co., Inc., Boston (1995)
MATH Google Scholar
Agrawal, S., Chaudhuri, S., Narasayya, V.R.: Automated selection of materialized views and indexes in sql databases. VLDB 2000, 496–505 (2000)
Google Scholar
Alotaibi, R., Lei, C., Quamar, A., Efthymiou, V., Özcan, F.: Property graph schema optimization for domain-specific knowledge graphs. In: ICDE, pp. 924–935 (2021)
Angles, R., Thakkar, H., Tomaszuk, D.: Mapping rdf databases to property graph databases. IEEE Access 8, 86091–86110 (2020)
Article Google Scholar
Baader, F., Calvanese, D., McGuinness, D.L., Nardi, D., Patel-Schneider, P.F. (eds.): The description logic handbook: theory, implementation, and applications. Cambridge University Press, Cambridge (2003)
MATH Google Scholar
Bharadwaj, S., Chiticariu, L., Danilevsky, M., et al.: Creation and interaction with large-scale domain-specific knowledge bases. PVLDB 10(12), 1965–1968 (2017)
Google Scholar
Bonifati, A., Martens, W., Timm, T.: An analytical study of large SPARQL query logs. PVLDB 11(2), 149–161 (2017)
Google Scholar
Bornea, M.A., Dolby, J., Kementsietsidis, A., Srinivas, K., Dantressangle, P., Udrea, O., Bhattacharjee, B.: Building an efficient RDF store over a relational database. In: SIGMOD, pp. 121–132 (2013)
Brin, S., Page, L.: The anatomy of a large-scale hypertextual web search engine. In: WWW, pp. 107–117 (1998)
Bruno, N., Chaudhuri, S.: Automatic physical database tuning: A relaxation-based approach. In: SIGMOD, pp. 227–238 (2005)
Bugiotti, F., Bursztyn, D., Deutsch, A., I, I., I, M.: Invisible glue: Scalable Self-Tuning Multi-Stores. In: CIDR (2015)
Chawathe, S.S., Garcia-Molina, H., Hammer, J., et al.: The TSIMMIS project: integration of heterogeneous information sources. In: Proceedings of the 10th Meeting of the Information Processing Society of Japan, pp. 7–18 (1994)
Chong, E.I., Das, S., Eadon, G., Srinivasan, J.: An efficient sql-based RDF querying scheme. In: VLDB, pp. 1216–1227 (2005)
Christophides, V., Efthymiou, V., Stefanidis, K.: Entity Resolution in the Web of Data. Theory and Technology. Morgan & Claypool Publishers, Synthesis Lectures on the Semantic Web (2015)
Dash, D., Polyzotis, N., Ailamaki, A.: Cophy: a scalable, portable, and interactive index advisor for large workloads. PVLDB 4(6), 362–372 (2011)
Google Scholar
Deutsch, A., Xu, Y., Wu, M., Lee, V.: Tigergraph: a native MPP graph database. CoRR abs/1901.08248 (2019)
Dong, X.L., Srivastava, D.: Big data integration. Synthesis lectures on data management. Morgan & Claypool Publishers, San Rafael (2015)
Book Google Scholar
Du, J., Meehan, J., Tatbul, N., Zdonik, S.: Towards dynamic data placement for polystore ingestion. In: BIRTE, pp. 2:1–2:8 (2017)
Duggan, J., Elmore, A.J., Stonebraker, M., Balazinska, M., Howe, B., et al.: The BigDAWG polystore system. SIGMOD Record 44(2), 11–16 (2015)
Article Google Scholar
Francis, N., Green, A., Guagliardo, P., et al.: Cypher: an evolving query language for property graphs. In: SIGMOD, pp. 1433–1445 (2018)
Gog, I., Schwarzkopf, M., Crooks, N., et al.: Musketeer: all for one, one for all in data processing systems. In: Proceedings of the Tenth European Conference on Computer Systems, p. 2 (2015)
Han, X., Hu, L., Sen, J., Dang, Y., Gao, B., Isahagian, V., Lei, C., et al.: Bootstrapping natural language querying on process automation data. In: IEEE SCC, pp. 170–177. IEEE (2020)
Harris, S., Shadbolt, N.: SPARQL query processing with conventional relational database systems. In: WISE, pp. 235–244 (2005)
Hassan, M.S., Kuznetsova, T., Jeong, H.C., Aref, W.G., Sadoghi, M.: Extending in-memory relational database engines with native graph support. In: EDBT, pp. 25–36 (2018)
Kharlamov, E., Mailis, T., Bereta, K., et al.: A semantic approach to polystores. In: IEEE Big Data, pp. 2565–2573 (2016)
Kimura, H., Huo, G., Rasin, A., Madden, S., Zdonik, S.B.: Coradd: correlation aware database designer for materialized views and indexes. PVLDB 3(1–2), 1103–1113 (2010)
Google Scholar
Kolev, B., Bondiombouy, C., Valduriez, P., et al.: The cloudmdsql multistore system. In: SIGMOD, pp. 2113–2116 (2016)
LeFevre, J., Sankaranarayanan, J., Hacigumus, H., et al.: Miso: souping up big data query processing with a multistore system. In: SIGMOD, pp. 1591–1602 (2014)
Lehmann, J., Isele, R., Jakob, M., et al.: Dbpedia - A large-scale, multilingual knowledge base extracted from wikipedia. Semantic Web (2015)
Lei, C., Özcan, F., Quamar, A., Mittal, A.R., Sen, J., Saha, D., Sankaranarayanan, K.: Ontology-based natural language query interfaces for data exploration. IEEE Data Eng. Bull. 41(3), 52–63 (2018)
Google Scholar
Leskovec, J., Rajaraman, A., Ullman, J.D.: Mining of Massive Datasets, 2nd edn. Cambridge University Press, New York, NY, USA (2014)
Book Google Scholar
Levy, A., Rajaraman, A., Ordille, J.: Querying heterogeneous information sources using source descriptions. Tech. rep, Stanford InfoLab (1996)
Lu, J., Holubová, I., Cautis, B.: Multi-model databases and tightly integrated polystores: Current practices, comparisons, and open challenges. In: CIKM, p. 2301–2302 (2018)
Maduko, A., Anyanwu, K., Sheth, A.P., Schliekelman, P.: Estimating the cardinality of RDF graph patterns. In: WWW, pp. 1233–1234 (2007)
McHugh, J., Cuddihy, P.E., Williams, J.W., et al.: Integrated access to big data polystores through a knowledge-driven framework. In: IEEE Big Data (2017)
Mior, M.J., Salem, K., Aboulnaga, A., Liu, R.: Nose: schema design for nosql applications. In: ICDE, pp. 181–192 (2016)
Neumann, T., Moerkotte, G.: Characteristic sets: accurate cardinality estimation for RDF queries with multiple joins. In: ICDE, pp. 984–994 (2011)
Neumann, T., Weikum, G.: The RDF-3X engine for scalable management of RDF data. VLDB J. 19(1), 91–113 (2010)
Article Google Scholar
Pirahesh, H., Hellerstein, J.M., Hasan, W.: Extensible/rule based query rewrite optimization in starburst. In: SIGMOD, pp. 39–48 (1992)
Quamar, A., Kumar, K.A., Deshpande, A.: SWORD: scalable workload-aware data placement for transactional workloads. In: EDBT, pp. 430–441 (2013)
Quamar, A., Özcan, F., Xirogiannopoulos, K.: Discovery and creation of rich entities for knowledge bases. In: ExploreDB (2018)
Quamar, A., Straube, J., Tian, Y.: Enabling rich queries over heterogeneous data from diverse sources in healthcare. In: CIDR (2020)
Saha, D., Floratou, A., Sankaranarayanan, K., et al.: Athena: an ontology-driven system for natural language querying over relational data stores. PVLDB 9(12), 1209–1220 (2016)
Google Scholar
Sen, J., Ozcan, F., Quamar, A., Stager, G., Mittal, A.R., Jammi, M., Lei, C., Saha, D., Sankaranarayanan, K.: Natural language querying of complex business intelligence queries. In: SIGMOD, pp. 1997–2000 (2019)
Slavík, P.: A tight analysis of the greedy algorithm for set cover. In: STOC ’96 (1996)
Stonebraker, M.: The case for polystores. https://wp.sigmod.org/?p=1629 (2015)
Stonebraker, M., Cetintemel, U.: “one size fits all”: an idea whose time has come and gone. In: ICDE, p. 2–11 (2005)
Suchanek, F.M., Weikum, G.: Knowledge harvesting in the big-data era. In: SIGMOD, pp. 933–938 (2013)
Sun, W., Fokoue, A., Srinivas, K., Kementsietsidis, A., Hu, G., Xie, G.T.: Sqlgraph: an efficient relational-based property graph store. In: SIGMOD, pp. 1887–1901 (2015)
Tanon, T.P., Weikum, G., Suchanek, F.M.: YAGO 4: A reason-able knowledge base. In: ESWC, pp. 583–596 (2020)
Tian, Y., Xu, E.L., Zhao, W., et al.: IBM db2 graph: supporting synergistic and retrofittable graph queries inside IBM db2. In: SIGMOD, pp. 345–359 (2020)
Tsialiamanis, P., Sidirourgos, L., Fundulaki, I., et al.: Heuristics-based query optimisation for SPARQL. In: EDBT, pp. 324–335 (2012)
Vazirani, V.V.: Approximation Algorithms. Springer-Verlag, Berlin, Heidelberg (2001)
MATH Google Scholar
Xiao, G., Calvanese, D., Kontchakov, R., Lembo, D., Poggi, A., Rosati, R., Zakharyaschev, M.: Ontology-based data access: a survey. In: IJCAI, p. 5511–5519 (2018)
Zilio, D.C., Rao, J., Lightstone, S., et al.: Db2 design advisor: integrated automatic physical database design. In: VLDB, pp. 1087–1097 (2004)

Download references

Author information

Authors and Affiliations

Instacart, San Francisco, USA
Chuan Lei
IBM Research - Almaden, San Jose, USA
Abdul Quamar
FORTH - Institute of Computer Science, Heraklion, Greece
Vasilis Efthymiou
Google, Mountain View, USA
Fatma Özcan
University of California - San Diego, San Diego, USA
Rana Alotaibi

Authors

Chuan Lei
View author publications
You can also search for this author in PubMed Google Scholar
Abdul Quamar
View author publications
You can also search for this author in PubMed Google Scholar
Vasilis Efthymiou
View author publications
You can also search for this author in PubMed Google Scholar
Fatma Özcan
View author publications
You can also search for this author in PubMed Google Scholar
Rana Alotaibi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chuan Lei.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Chuan Lei, Vasilis Efthymiou, Fatma Özcan, Rana Alotaibi: Work done while at IBM Research.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lei, C., Quamar, A., Efthymiou, V. et al. HERMES: data placement and schema optimization for enterprise knowledge bases. The VLDB Journal 32, 549–574 (2023). https://doi.org/10.1007/s00778-022-00756-y

Download citation

Received: 19 January 2021
Revised: 11 March 2022
Accepted: 08 June 2022
Published: 26 July 2022
Issue Date: May 2023
DOI: https://doi.org/10.1007/s00778-022-00756-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

HERMES: data placement and schema optimization for enterprise knowledge bases

Abstract

Access this article

Similar content being viewed by others

A Pay-as-you-go Methodology to Design and Build Enterprise Knowledge Graphs from Relational Databases

A Value-Added Approach to Design BI Applications

From Big Data to Big Knowledge

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

HERMES: data placement and schema optimization for enterprise knowledge bases

Abstract

Access this article

Similar content being viewed by others

A Pay-as-you-go Methodology to Design and Build Enterprise Knowledge Graphs from Relational Databases

A Value-Added Approach to Design BI Applications

From Big Data to Big Knowledge

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation