Encyclopedia of Big Data Technologies

Living Edition
| Editors: Sherif Sakr, Albert Zomaya

Data Lake

Living reference work entry
DOI: https://doi.org/10.1007/978-3-319-63962-8_7-1

Definition

A data lake is a data repository in which datasets from multiple sources are stored in their original structures. It should provide functions to extract data and metadata from heterogeneous sources and to ingest them into a hybrid storage system. In addition, a data lake should offer a data transformation engine, in which datasets can be transformed, cleaned, and integrated with other datasets. Finally, interfaces to explore and to query the data and metadata of a data lake should be also available in a data lake system.

Overview

The term “data lake” (DL) was first mentioned by James Dixon in 2010 in a blog post (https://jamesdixon.wordpress.com/2010/10/14/pentaho-hadoop-and-data-lakes/) where he put data marts on the same level as bottled water, which is cleansed, packaged, and structured for easy consumption. In contrast, a data lake manages the raw data as it is ingested from the data sources.

In the initial article (and in a later, more detailed article (https://jamesdixon.wordpress.com/2014/09/25/data-lakes-revisited/...

This is a preview of subscription content, log in to check access.

References

  1. Abedjan Z, Golab L, Naumann F (2015) Profiling relational data: a survey. VLDB J 24(4):557–581. https://doi.org/10.1007/s00778-015-0389-yCrossRefGoogle Scholar
  2. Alserafi A, Calders T, Abelló A, Romero O (2017) Ds-prox: dataset proximity mining for governing the data lake. In: Beecks C, Borutta F, Kröger P, Seidl T (eds) Proceedings of 10th international conference similarity search and applications, SISAP 2017, Munich, 4–6 Oct 2017. Lecture notes in computer science, vol 10609, pp 284–299. Springer. https://doi.org/10.1007/978-3-319-68474-1_20CrossRefGoogle Scholar
  3. Bernstein PA, Melnik S (2007) Model management 2.0: manipulating richer mappings. In: Zhou L, Ling TW, Ooi BC (eds) Proceedings of ACM SIGMOD international conference on management of data. ACM Press, Beijing, pp 1–12. https://doi.org/10.1145/1247480.1247482
  4. Boci E, Thistlethwaite S (2015) A novel big data architecture in support of ads-b data analytic. In: Proceedings of integrated communication, navigation, and surveillance conference (ICNS), pp C1-1–C1-8.  https://doi.org/10.1109/ICNSURV.2015.7121218
  5. Calvanese D, De Giacomo G, Lenzerini M, Vardi MY (2012) Query processing under glav mappings for relational and graph databases. Proc VLDB Endow 6(2):61–72CrossRefGoogle Scholar
  6. Curino C, Moon HJ, Deutsch A, Zaniolo C (2013) Automating the database schema evolution process. VLDB J 22(1):73–98CrossRefGoogle Scholar
  7. Douglas C, Curino C (2015) Blind men and an elephant coalescing open-source, academic, and industrial perspectives on bigdata. In: Gehrke J, Lehner W, Shim K, Cha SK, Lohman GM (eds) 31st IEEE international conference on data engineering, ICDE 2015, Seoul, 13–17 Apr 2015. IEEE Computer Society, pp 1523–1526.  https://doi.org/10.1109/ICDE.2015.7113417. http://ieeexplore.ieee.org/xpl/mostRecentIssue.jsp?punumber=7109453
  8. Florescu D, Fourny G (2013) Jsoniq: the history of a query language. IEEE Internet Comput 17(5):86–90CrossRefGoogle Scholar
  9. Franklin M, Halevy A, Maier D (2005) From databases to dataspaces: a new abstraction for information management. SIGMOD Rec 34(4):27–33. https://doi.org/10.1145/1107499.1107502CrossRefGoogle Scholar
  10. Gottlob G, Orsi G, Pieris A (2014) Query rewriting and optimization for ontological databases. ACM Trans Database Syst 39(3):25:1–25:46. https://doi.org/10.1145/2638546MathSciNetCrossRefGoogle Scholar
  11. Halevy AY, Korn F, Noy NF, Olston C, Polyzotis N, Roy S, Whang SE (2016) Managing Google’s data lake: an overview of the goods system. IEEE Data Eng Bull 39(3):5–14. http://sites.computer.org/debull/A16sept/p5.pdf
  12. Hartung M, Terwilliger JF, Rahm E (2011) Recent advances in schema and ontology evolution. In: Bellahsene Z, Bonifati A, Rahm E (eds) Schema matching and mapping, data-centric systems and applications. Springer, pp 149–190. https://doi.org/10.1007/978-3-642-16518-4Google Scholar
  13. Jarke M, Quix C (2017) On warehouses, lakes, and spaces: the changing role of conceptual modeling for data integration. In: Cabot J, Gómez C, Pastor O, Sancho M, Teniente E (eds) Conceptual modeling perspectives. Springer, pp 231–245. https://doi.org/10.1007/978-3-319-67271-7_16CrossRefGoogle Scholar
  14. Jarke M, Jeusfeld MA, Quix C, Vassiliadis P (1999) Architecture and quality in data warehouses: an extended repository approach. Inf Syst 24(3):229–253CrossRefGoogle Scholar
  15. Jeffery SR, Franklin MJ, Halevy AY (2008) Pay-as-you-go user feedback for dataspace systems. In: Wang JTL (ed) Proceedings of ACM SIGMOD international conference on management of data. ACM Press, Vancouver, pp 847–860. https://doi.org/10.1145/1376616.1376701
  16. Karæz Y, Ivanova M, Zhang Y, Manegold S, Kersten ML (2013) Lazy ETL in action: ETL technology dates scientific data. PVLDB 6(12):1286–1289. http://www.vldb.org/pvldb/vol6/p1286-kargin.pdf
  17. Kensche D, Quix C, Li X, Li Y, Jarke M (2009) Generic schema mappings for composition and query answering. Data Knowl Eng 68(7):599–621. https://doi.org/10.1016/j.datak.2009.02.006CrossRefGoogle Scholar
  18. LaPlante A, Sharma B (2016) Architecting data lakes. O’Reilly Media, Sebastopol, CA, USAGoogle Scholar
  19. Mathis C (2017) Data lakes. Datenbank-Spektrum 17(3):289–293. https://doi.org/10.1007/s13222-017-0272-7MathSciNetCrossRefGoogle Scholar
  20. Otto B (2011) Data governance. Bus Inf Syst Eng 3(4):241–244. https://doi.org/10.1007/s12599-011-0162-8CrossRefGoogle Scholar
  21. Quix C, Berlage T, Jarke M (2016) Interactive pay-as-you-go-integration of life science data: the HUMIT approach. ERCIM News 2016(104). http://ercim- news.ercim.eu/en104/special/interactive-pay-as-you- go-integration-of-life-science-data-the-humit-approach
  22. Saha B, Srivastava D (2014) Data quality: the other face of big data. In: Cruz IF, Ferrari E, Tao Y, Bertino E, Trajcevski G (eds) Proceedings of 30th international conference on data engineering (ICDE). IEEE, Chicago, pp 1294–1297.  https://doi.org/10.1109/ICDE.2014.6816764Google Scholar
  23. Sarma AD, Dong X, Halevy AY (2008) Bootstrapping pay-as-you-go data integration systems. In: Wang JTL (ed) Proceedings of ACM SIGMOD international conference on management of data. ACM Press, Vancouver, pp 861–874Google Scholar
  24. Stein B, Morrison A (2014) The enterprise data lake: better integration and deeper analytics. http:// www.pwc.com/us/en/technology-forecast/2014/cloud- computing/assets/pdf/pwc-technology-forecast-data- lakes.pdf
  25. Terrizzano I, Schwarz PM, Roth M, Colino JE (2015) Data wrangling: the challenging journey from the wild to the lake. In: 7th Biennial conference on innovative data systems (CIDR). http://www.cidrdb.org/cidr2015/Papers/CIDR15_Paper2.pdf

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Fraunhofer-Institute for Applied Information Technology FITSankt AugustinGermany
  2. 2.RWTH Aachen UniversityAachenGermany

Section editors and affiliations

  • Maik Thiele
    • 1
  1. 1.Database Systems GroupTechnische Universität DresdenDresdenDeutschland