Zusammenfassung
Unternehmen stehen zunehmend vor der Herausforderung, große, heterogene Daten zu verwalten und den darin enthaltenen Wert zu extrahieren. In den letzten Jahren kam darum der Data Lake als neuartiges Konzept auf, um diese komplexen Daten zu verwalten und zu nutzen. Wollen Unternehmen allerdings einen solchen Data Lake praktisch umsetzen, so stoßen sie auf vielfältige Herausforderungen, wie beispielsweise Widersprüche in der Definition oder unscharfe und fehlende Konzepte. In diesem Beitrag werden konkrete Projekte eines global agierenden Industrieunternehmens genutzt, um bestehende Herausforderungen zu identifizieren und Anforderungen an Data Lakes herzuleiten. Diese Anforderungen werden mit der verfügbaren Literatur zum Thema Data Lake sowie mit existierenden Ansätzen aus der Forschung abgeglichen. Die Gegenüberstellung zeigt, dass fünf große Forschungslücken bestehen: 1. Unklare Datenmodellierungsmethoden, 2. Fehlende Data-Lake-Referenzarchitektur, 3. Unvollständiges Metadatenmanagementkonzept, 4. Unvollständiges Data-Lake-Governance-Konzept, 5. Fehlende ganzheitliche Realisierungsstrategie.
Literatur
Gölzer P, Cato P, Amberg M (2015) Data processing requirements of industry 4.0—use cases for big data applications. Proceedings of the 23th European Conference on Information Systems (ECIS 2015).
Lee J, Kao H‑A, Yang S (2014) Service innovation and smart Analytics for industry 4.0 and big data environment. Proceedings of the 6th CIRP Conference on Industrial Product-Service Systems.
Lv Z, Song H, Basanta-Val P, Steed A, Jo M (2017) Next-generation big data Analytics: state of the art, challenges, and future research topics. IEEE Trans Industr Inform 13(4):1891–1899
Russom P (2011) Big data analytics. TDWI best pract. report, 4th quart.
Cao L (2017) Data Science. ACM Comput Surv 50(3):1–42
Mathis C (2017) Data lakes. Datenbank Spektrum 17(3):289–293
Analytics IBM (2016) The governed data lake approach
Tyagi P, Demirkan H (2016) Data lakes: the biggest big data challenges. Analytics 9(6):56–63
Ravat F, Zhao Y (2019) Data lakes: trends and perspectives. Proceedings of the 30th International Conference on Database and Expert Systems Applications (DEXA 2019).
Chessell M, Jones NL, Limburn J, Radley D, Shan K (2015) Designing and operating a data reservoir
Giebler C, Gröger C, Hoos E, Schwarz H, Mitschang B (2019) Leveraging the data lake—current state and challenges. Proceedings of the 21st International Conference on Big Data Analytics and Knowledge Discovery (DaWaK 2019).
Gausemeier J, Plass C (2014) Zukunftsorientierte Unternehmensgestaltung. Carl Hanser, München
Gröger C (2018) Building an industry 4.0 analytics platform. Datenbank Spektrum 18(1):5–14
Terrizzano I, Schwarz P, Roth M, Colino JE (2015) Data wrangling: the challenging journey from the wild to the lake. Proceedings of the 7th Biennial Conference on Innovative Data Systems Research (CIDR’15).
Stefanowski J, Krawiec K, Wrembel R (2017) Exploring complex and big data. Int J Appl Math Comput Sci 27(4):669–679
O’Leary DE (2014) Embedding AI and Crowdsourcing in the big data lake. IEEE Intell Syst 29(5):70–73
Loshin D (2009) Master data management. Elsevier, Amsterdam
Schnider D, Jordan C, Welker P, Wehner J (2016) Data warehouse blueprints – business intelligence in der praxis. Carl Hanser, München
Larson D, Chang V (2016) A review and future direction of agile, business intelligence, analytics and data science. Int J Inf Manage 36(5):700–710
Chen H, Chiang RHL, Storey VC (2012) Business intelligence and Analytics: from big data to big impact. MIS Q 36(4):1165–1188
Russom P (2017) Data lakes—purposes, practices, patterns, and platforms
Dixon J (2010) Pentaho, Hadoop, and data lakes. https://jamesdixon.wordpress.com/2010/10/14/pentaho-hadoop-and-data-lakes/. Zugegriffen: 22.01.2020
Dixon J (2014) Data lakes revisited. https://jamesdixon.wordpress.com/2014/09/25/data-lakes-revisited/. Zugegriffen: 22.01.2020
Madera C, Laurent A (2016) The next information architecture evolution: the data lake wave. Proceedings of the 8th International Conference on Management of Digital EcoSystems (MEDES). ACM, New York
Fang H (2015) Managing data lakes in big data era: what’s a data lake and why has it became popular in data management ecosystem. Proceedings of the 2015 IEEE International Conference on Cyber Technology in Automation, Control, and Intelligent Systems (CYBER 2015).
Gröger C, Hoos E (2019) Ganzheitliches Metadatenmanagement im Data Lake: Anforderungen, IT-Werkzeuge und Herausforderungen in der Praxis. Proceedings der 18. Fachtagung Datenbanksysteme für Business, Technologie und Web (BTW).
Lock M (2016) Maximizing your data lake with a cloud or hybrid approach
Madsen M (2015) How to build an enterprise data lake: important considerations before jumping in
Gartner Inc. (2014) Gartner says beware of the data lake fallacy. https://www.gartner.com/en/newsroom/press-releases/2014-07-28-gartner-says-beware-of-the-data-lake-fallacy. Zugegriffen: 22.01.2020
Patel P, Wood G, Diaz A (2017) Data lake governance best practices. Dzone Guid. to big data—data sci. Adv Anal 4:6–7
Chessell M, Scheepers F, Nguyen N, van Kessel R, van der Starre R (2014) Governing and managing big data for analytics and decision makers
Topchyan AR (2016) Enabling data driven projects for a modern enterprise. Proc Inst Syst Progr Ras 28(3):209–230
Stein B, Morrison A (2014) The enterprise data lake: Better integration and deeper analytics. In: Technol Forecast Rethink Integr, Bd. 1
Stiglich P (2014) Data modeling in the age of big data. Bus Intell J 19(4):17–22
Houle P (2017) Data lakes, data ponds, and data droplets. http://ontology2.com/the-book/data-lakes-ponds-and-droplets.html. Zugegriffen: 22.01.2020
Walker C, Alrehamy H (2015) Personal data lake with data gravity pull. Proceedings of the 2015 IEEE Fifth International Conference on Big Data and Cloud Computing (BDCloud’15).
Giebler C, Gröger C, Hoos E, Schwarz H, Mitschang B (2019) Modeling data lakes with data vault: practical experiences, assessment, and lessons learned. Proceedings of the 38th Conference on Conceptual Modeling (ER 2019).
Cernjeka K, Jaksic D, Jovanovic V (2018) NoSQL document store translation to data vault based EDW. Proceedings of the 41st International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO 2018).
Gröger C, Schwarz H, Mitschang B (2014) The deep data warehouse: link-based integration and enrichment of warehouse data and unstructured content. Proceedings of the 2014 IEEE 18th International Enterprise Distributed Object Computing Conference (EDOC 2014).
Inmon B (2016) Data lake architecture—designing the data lake and avoiding the garbage dump (Technics Publications)
Sharma B (2018) Architecting data lakes—data management architectures for advanced business use cases. O’Reilly, Sebastopol
Marz N, Warren J (2015) Big data—principles and best practices of scalable real-time data systems. Manning, Shelter Island
Giebler C, Stach C, Schwarz H, Mitschang B (2018) BRAID—a hybrid processing architecture for big data. Proceedings of the 7th International Conference on Data Science, Technology and Applications (DATA 2018). SCITEPRESS, Setúbal
Nadal S, Herrero V, Romero O, Abelló A, Franch X, Vansummeren S, Valerio D (2017) A software reference architecture for semantic-aware Big Data systems. Inf Softw Technol 90:75–92
Zikopoulos P, DeRoos D, Bienko C, Buglio R, Andrews M (2015) Big data beyond the hype. McGraw-Hill, New York
Sadalage PJ, Fowler M (2013) NoSQL distilled—a brief guide to the emerging world of polyglot persistence. Pearson, London
Abraham R, Schneider J, vom Brocke J (2019) Data governance: a conceptual framework, structured view, and research agenda. Int J Inf Manage 49:424–438
Quix C, Hai R, Vatov I (2016) Metadata extraction and management in data lakes with GEMMS. Complex Syst Inf Model Q 9(9):67–83
Gallinucci E, Golfarelli M, Rizzi S (2018) Schema profiling of document-oriented databases. Inf Syst 75:13–25
Nogueira I, Romdhane M, Darmont J (2018) Modeling data lake Metadata with a data vault. Proceedings of the 22nd International Database Engineering Applications Symposium (IDEAS 2018).
Sawadogo PN, Scholly É, Favre C, Ferey É, Loudcher S, Darmont J (2019) Metadata systems for data lakes: models and features. Proceedings of the 23rd European Conference on Advances in Databases and Information Systems (ADBIS 2019).
Sawadogo P, Kibata T, Darmont J (2019) Metadata management for textual documents in data lakes. Proceedings of the 21st International Conference on Enterprise Information Systems (ICEIS 2019). SCITEPRESS, Setúbal
Ravat F, Zhao Y (2019) Metadata management for data lakes. Proceedings of the 23rd European Conference on Advances in Databases and Information Systems (ADBIS 2019).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Giebler, C., Gröger, C., Hoos, E. et al. Data Lakes auf den Grund gegangen. Datenbank Spektrum 20, 57–69 (2020). https://doi.org/10.1007/s13222-020-00332-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13222-020-00332-0