Query Rewriting for Heterogeneous Data Lakes

  • Rihan HaiEmail author
  • Christoph Quix
  • Chen Zhou
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11019)


The increasing popularity of NoSQL systems has lead to the model of polyglot persistence, in which several data management systems with different data models are used. Data lakes realize the polyglot persistence model by collecting data from various sources, by storing the data in its original structure, and by providing the datasets for querying and analysis. Thus, one of the key tasks of data lakes is to provide a unified querying interface, which is able to rewrite queries expressed in a general data model into a union of queries for data sources spanning heterogeneous data stores. To address this challenge, we propose a novel framework for query rewriting that combines logical methods for data integration based on declarative mappings with a scalable big data query processing system (i.e., Apache Spark) to efficiently execute the rewritten queries and to reconcile the query results into an integrated dataset. Because of the diversity of NoSQL systems, our approach is based on a flexible and extensible architecture that currently supports the major data structures such as relational data, semi-structured data (e.g., JSON, XML), and graphs. We show the applicability of our query rewriting engine with six real world datasets and demonstrate its scalability using an artificial data integration scenario with multiple storage systems.


Query Rewriting Lake Data (DLs) Heterogeneous Data Stores NoSQL Systems Data Integration Scenarios 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.



This work has been partially funded by the German Federal Ministry of Education and Research (BMBF) (project HUMIT,, grant no. 01IS14007A), German Research Foundation (DFG) within the Cluster of Excellence “Integrative Production Technology for High Wage Countries” (EXC 128), and by the Joint Research (IGF) of the German Federal Ministry of Economic Affairs and Energy (BMWI, project charMant,, IGF promotion plan 18504N).


  1. 1.
    Bugiotti, F., et al.: Invisible glue: scalable self-tuning multi-stores. In: Proceedings of CIDR (2015)Google Scholar
  2. 2.
    Chasseur, C., Li, Y., Patel, J.M.: Enabling JSON document stores in relational systems. In: Proceedings of WebDB, pp. 1–6 (2013)Google Scholar
  3. 3.
    Duggan, J., et al.: The BigDAWG polystore system. SIGMOD Rec. 44(2), 11–16 (2015)CrossRefGoogle Scholar
  4. 4.
    Florescu, D., Fourny, G.: JSONiq: the history of a query language. IEEE Int. Comput. 17(5), 86–90 (2013)CrossRefGoogle Scholar
  5. 5.
    Giannakouris, V., Papailiou, N., Tsoumakos, D., Koziris, N.: MuSQLE: distributed SQL query execution over multiple engine environments. In: Proceedings of Big Data, pp. 452–461 (2016)Google Scholar
  6. 6.
    Hai, R., Geisler, S., Quix, C.: Constance: an intelligent data lake system. In: Proceedings of SIGMOD, pp. 2097–2100 (2016)Google Scholar
  7. 7.
    DeWitt, D.J., et al.: Split query processing in polybase. In: Proceedings of SIGMOD, pp. 1255–1266. 22–27 June 2013Google Scholar
  8. 8.
    Jarke, M., Quix, C.: On warehouses, lakes, and spaces: the changing role of conceptual modeling for data integration. In: Cabot, J., Gómez, C., Pastor, O., Sancho, M., Teniente, E. (eds.) Conceptual Modeling Perspectives, pp. 231–245. Springer, Cham (2017). Scholar
  9. 9.
    Kolev, B., et al.: CloudMdsQL: querying heterogeneous cloud data stores with a common language. Distrib. Parallel Databases 34(4), 463–503 (2016)CrossRefGoogle Scholar
  10. 10.
    LeFevre, J., et al.: MISO: souping up big data query processing with a multistore system. In: Proceedings of SIGMOD, pp. 1591–1602 (2014)Google Scholar
  11. 11.
    Leis, V., et al.: How good are query optimizers, really? In: Proceedings of VLDB, pp. 204–215 (2015)Google Scholar
  12. 12.
    Ong, K.W., Papakonstantinou, Y., Vernoux, R.: The SQL++ unifying semi-structured query language, and an expressiveness benchmark of SQL-on-Hadoop, NoSQL and NewSQL databases. CoRR, abs/1405.3631 (2014)Google Scholar
  13. 13.
    Quix, C., Hai, R., Vatov, I.: Metadata extraction and management in data lakes with GEMMS. Complex Syst. Inf. Model. Q. 9, 67–83 (2016)Google Scholar
  14. 14.
    Sharma, B., LaPlante, A.: Architecting data lakes. O’Reilly Media (2016).
  15. 15.
    Terrizzano, I., Schwarz, P.M., Roth, M., Colino, J.E.: Data wrangling: the challenging yourney from the wild to the lake. In: Proceedings of CIDR (2015)Google Scholar
  16. 16.
    Yu, C., Popa, L.: Constraint-based XML query rewriting for data integration. In: Proceedings of SIGMOD, pp. 371–382 (2004)Google Scholar
  17. 17.
    Zhu, M., Risch, T.: Querying combined cloud-based and relational databases. In: 2011 International Conference Cloud and Service Computing (CSC) (2011)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.Databases and Information SystemsRWTH Aachen UniversityAachenGermany
  2. 2.Fraunhofer-Institute for Applied Information Technology FITSankt AugustinGermany

Personalised recommendations