Abstract
Over the past two decades, we have witnessed an exponential increase of data production in the world. So-called big data generally come from transactional systems, and even more so from the Internet of Things and social media. They are mainly characterized by volume, velocity, variety and veracity issues. Big data-related issues strongly challenge traditional data management and analysis systems. The concept of data lake was introduced to address them. A data lake is a large, raw data repository that stores and manages all company data bearing any format. However, the data lake concept remains ambiguous or fuzzy for many researchers and practitioners, who often confuse it with the Hadoop technology. Thus, we provide in this paper a comprehensive state of the art of the different approaches to data lake design. We particularly focus on data lake architectures and metadata management, which are key issues in successful data lakes. We also discuss the pros and cons of data lakes and their design alternatives.
Similar content being viewed by others
References
Alrehamy, H., & Walker, C. (2015). Personal data lake with data gravity pull. In IEEE 5Th international conference on big data and cloud computing(BDCloud 2015), Dalian, China, IEEE computer society washington, vol. 88, pp. 160–167. https://doi.org/10.1109/BDCloud.2015.62.
Ansari, J. W., Karim, N., Decker, S., Cochez, M., & Beyan, O. (2018). Extending data lake metadata management by semantic profiling. In 2018 Extended semantic web conference (ESWC 2018), Heraklion, Crete, Greece, pp. 1–15.
Beheshti, A., Benatallah, B., Nouri, R., Chhieng, V. M., Xiong, H., & Zhao, X. (2017). CoreDB: A Data Lake Service. In 2017 ACM On conference on information and knowledge management (CIKM 2017), Singapore, Singapore, ACM, pp. 2451–2454. https://doi.org/10.1145/3132847.3133171.
Beheshti, A., Benatallah, B., Nouri, R., & Tabebordbar, A. (2018). CoreKG: A knowledge lake service. Proceedings of the VLDB Endowment, 11(12), 1942–1945. https://doi.org/10.14778/3229863.3236230.
Bhattacherjee, S., & Deshpande, A. (2018). RSTore: A distributed multi-version document store. In IEEE 34Th international conference on data engineering (ICDE), Paris, France, pp. 389–400. https://doi.org/10.1109/ICDE.2018.00043.
Cha, B., Park, S., Kim, J., Pan, S., & Shin, J. (2018). International network performance and security testing based on distributed abyss storage cluster and draft of data lake framework. Hindawi Security and Communication Networks, 2018, 1–14. https://doi.org/10.1155/2018/1746809.
Chessell, M., Scheepers, F., Nguyen, N., van Kessel, R., & van der Starre, R. (2014). Governing and managing big data for analytics and decision makers. IBM.
Couto, J., Borges, O., Ruiz, D., Marczak, S., & Prikladnicki, R. (2019). A mapping study about data lakes: an improved definition and possible architectures. In 31St international conference on software engineering and knowledge engineering (SEKE 2019), Lisbon, Portugal, pp. 453–458. https://doi.org/10.18293/SEKE2019-129.
Diamantini, C., Giudice, P.L., Musarella, L., Potena, D., Storti, E., & Ursino, D. (2018). A new metadata model to uniformly handle heterogeneous data lake sources. In: New trends in databases and information systems - ADBIS 2018 Short Papers and Workshop, Budapest, Hungary, pp. 165–177. https://doi.org/10.1007/978-3-030-00063-9_17.
Dixon, J. (2010). Pentaho, Hadoop, and data lakes. https://jamesdixon.wordpress.com/2010/10/14/pentaho-hadoop-and-data-lakes/.
Fang, H. (2015). Managing Data Lakes in Big Data era: What’s a data lake and why has it became popular in data management ecosystem. In 5Th annual IEEE international conference on cyber technology in automation, control and intelligent systems (CYBER 2015), Shenyang, China, IEEE, pp. 820–824. https://doi.org/10.1109/CYBER.2015.7288049.
Farid, M., Roatis, A., Ilyas, I. F., Hoffmann, H. F., & Chu, X. (2016). CLAMS: Bringing quality to data lakes. In 2016 International conference on management of data (SIGMOD 2016), San Francisco, CA, USA, ACM, pp. 2089–2092. https://doi.org/10.1145/2882903.2899391.
Farrugia, A., Claxton, R., & Thompson, S. (2016). Towards social network analytics for understanding and managing enterprise data lakes. In Advances in social networks analysis and mining (ASONAM 2016), San Francisco, CA, USA, IEEE, pp. 1213–1220. https://doi.org/10.1109/ASONAM.2016.7752393.
Fauduet, L., & Peyrard, S. (2010). A data-first preservation strategy: Data management in spar. In: 7th international conference on preservation of digital objects (iPRES 2010), Vienna, Autria, pp. 1–8. http://www.ifs.tuwien.ac.at/dp/ipres2010/papers/fauduet-13.pdf.
Ganore, P. (2015). Introduction To The Concept Of Data Lake And Its Benefits. https://www.esds.co.in/blog/introduction-to-the-concept-of-data-lake-and-its-benefits.
Giebler, C., Grȯger, C., Hoos, E., Schwarz, H., & Mitschang, B. (2019). Leveraging the data lake - current state and challenges. In Proceedings of the 21st international conference on big data analytics and knowledge discovery (DaWaK (p. 2019). Austria: Linz.
Grosser, T., Bloeme, J., Mack, M., & Vitsenko, J. (2016). Hadoop and data lakes: Use cases, benefits and limitations business application research center – BARC GmbH.
Hai, R., Geisler, S., & Quix, C. (2016). Constance: an intelligent data lake system. In International conference on management of data (SIGMOD 2016), San Francisco, CA, USA, ACM Digital Library, pp. 2097–2100. https://doi.org/10.1145/2882903.2899389.
Hai, R., Quix, C., & Zhou, C. (2018). Query rewriting for heterogeneous data lakes. In: 22nd European conference on advances in databases and information systems (ADBIS 2018), Budapest, Hungary, LNCS, vol. 11019, pp. 35–49. Springer. https://doi.org/10.1007/978-3-319-98398-1_3.
Halevy, A. Y., Korn, F., Noy, N. F., Olston, C., Polyzotis, N., Roy, S., & Whang, S. E. (2016). Goods: organizing google’s datasets. In Proceedings of the 2016 international conference on management of data (SIGMOD 2016), San Francisco, CA, USA, pp. 795–806. https://doi.org/10.1145/2882903.2903730.
Hellerstein, J.M., Sreekanti, V., sGonzalez, J.E., Dalton, J., Dey, A., Nag, S., Ramachandran, K., Arora, S., Bhattacharyya, A., Das, S., Donsky, M., Fierro, G., She, C., Steinbach, C., Subramanian, V., & Sun, E. (2017). Ground: A data context service. In: 8th Biennial Conference on Innovative Data Systems Research (CIDR 2017), Chaminade, CA, USA. http://cidrdb.org/cidr2017/papers/p111-hellerstein-cidr17.pdf.
Hultgren, H. (2016). Data Vault modeling guide: Introductory guide to data vault modeling. Genessee Academy, USA.
Inmon, B. (2016). Data Lake architecture: Designing the Data Lake and avoiding the garbage dump. Technics Publications.
John, T., & Misra, P. (2017). Data Lake for enterprises: Lambda architecture for building enterprise data systems. Packt Publishing.
Joss, A. (2016). The rise of the GDPR data lake. https://blogs.informatica.com/2016/06/16/rise-gdpr-data-lake/.
Khine, P. P., & Wang, Z. S. (2017). Data lake: A new ideology in big data era. In 4Th international conference on wireless communication and sensor network (WCSN 2017), Wuhan, China, ITM web of conferences, vol. 17, pp. 1–6. https://doi.org/10.1051/itmconf/2018170302.
Klettke, M., Awolin, H., Stürl, U., Müller, D., & Scherzinger, S. (2017). Uncovering the evolution history of data lakes. In 2017 IEEE International conference on big data (BIGDATA 2017), Boston, MA, USA, pp. 2462–2471. https://doi.org/10.1109/BigData.2017.8258204.
LaPlante, A., & Sharma, B. (2016). Architecting data lakes data management architectures for advanced business use cases. O’Reilly Media Inc.
Laskowski, N. (2016). Data lake governance: A big data do or die. https://searchcio.techtarget.com/feature/Data-lake-governance-A-big-data-do-or-die.
Leclercq, E., & Savonnet, M. (2018). A tensor based data model for polystore: an application to social networks data. In Proceedings of the 22nd international database engineering & applications symposium (IDEAS 2018), Villa San Giovanni, Italy, pp. 110–118. https://doi.org/10.1145/3216122.3216152.
Linstedt, D. (2011). Super charge your data warehouse: Invaluable data modeling rules to implement your data. Vault CreateSpace Independent Publishing.
Maccioni, A., & Torlone, R. (2018). KAYAK: A framework for just-in-time data preparation in a data lake. In: International Conference on Advanced Information Systems Engineering (CAiSE 2018), Tallin, Estonia, pp. 474–489. https://doi.org/10.1007/978-3-319-91563-0_29.
Madera, C., & Laurent, A. (2016). The next information architecture evolution: the data lake wave. In 8Th international conference on management of digital ecosystems (MEDES 2016), Biarritz, France, pp. 174–180. https://doi.org/10.1145/3012071.3012077.
Maroto, C. (2018). Data lake security – four key areas to consider when securing your data lake. https://www.searchtechnologies.com/blog/data-lake-security.
Mathis, C. (2017). Data lakes. Datenbank-Spektrum, 17(3), 289–293. https://doi.org/10.1007/s13222-017-0272-7.
Mehmood, H., Gilman, E., Cortes, M., Kostakos, P., Byrne, A., Valta, K., Tekes, S., & Riekki, J. (2019). Implementing big data lake for heterogeneous data sources. In 2019 IEEE 35Th international conference on data engineering workshops (ICDEW), pp. 37–44. https://doi.org/10.1109/ICDEW.2019.00-37.
Miloslavskaya, N., & Tolstoy, A. (2016). Big data, fast data and data lake concepts. In 7Th annual international conference on biologically inspired cognitive architectures (BICA 2016), NY, USA, Procedia Computer Science, vol. 88, pp. 1–6. https://doi.org/10.1016/j.procs.2016.07.439.
Nargesian, F., Pu, K.Q., Zhu, E., Bashardoost, B.G., & Miller, R.J. (2018). Optimizing Organizations for Navigating Data Lakes. arXiv:abs/1812.07024.
Nogueira, I., Romdhane, M., & Darmont, J. (2018). Modeling data lake metadata with a data vault. In 22Nd international database engineering and applications symposium (IDEAS 2018), Villa San Giovanni, Italia (pp. 253–261). New York: ACM.
O’Leary, D. E. (2014). Embedding AI and crowdsourcing in the big data lake. IEEE Intelligent Systems, 29(5), 70–73. https://doi.org/10.1109/MIS.2014.82.
Oram, A. (2015). Managing the data lake. Zaloni.
Pathirana, N. (2015). Modeling industrial and cultural heritage data. Master’s thesis, université lumière Lyon 2 France.
Quix, C., & Hai, R. (2018). Data lake, (pp. 1–8). Berlin: Springer International Publishing. https://doi.org/10.1007/978-3-319-63962-8_7-1.
Quix, C., Hai, R., & Vatov, I. (2016). Metadata extraction and management in data lakes with GEMMS. Complex Systems Informatics and Modeling Quarterly, 9, 289–293. https://doi.org/10.7250/csimq.2016-9.04.
Ravat, F., & Zhao, Y. (2019). Data lakes: Trends and perspectives. In 30Th international conference on database and expert systems applications (DEXA (p. 2019). Austria: Linz.
Ravat, F., & Zhao, Y. (2019). Metadata management for data lakes. In 23Rd european conference on advances in databases and information systems (ADBIS (p. 2019). Slovenia: Bled.
Russom, P. (2017). Data lakes purposes, Practices, Patterns, and Platforms. TDWI research.
Sawadogo, P. N., Kibata, T., & Darmont, J. (2019). Metadata management for textual documents in data lakes. In 21St international conference on enterprise information systems (ICEIS 2019), Heraklion, Crete, Greece, pp. 72–83. https://doi.org/10.5220/0007706300720083.
Sawadogo, P. N., Scholly, E., Favre, C., Ferey, É. , Loudcher, S., & Darmont, J. (2019). Metadata systems for data lakes: models and features. In BI And big data applications - ADBIS 2019 Short Papers and Workshop, Bled, Slovenia.
Singh, K., Paneri, K., Pandey, A., Gupta, G., Sharma, G., Agarwal, P., & Shroff, G. (2016). Visual bayesian fusion to navigate a data lake. In 19Th international conference on information fusion (FUSION 2016), Heidelberg, Germany, IEEE, pp. 987–994.
Sirosh, J. (2016). The intelligent data lake. https://azure.microsoft.com/fr-fr/blog/the-intelligent-data-lake/.
Stefanowski, J., Krawiec, K., & Wrembel, R. (2017). Exploring complex and big data. International Journal of Applied Mathematics and Computer Science, 27(4), 669–679. https://doi.org/10.1515/amcs-2017-0046.
Stein, B., & Morrison, A. (2014). The enterprise data lake: Better integration and deeper analytics. PWC Technology Forecast. http://www.smallake.kr/wp-content/uploads/2017/03/20170313_074222.pdf.
Suriarachchi, I., & Plale, B. (2016). Crossing analytics systems: A case for integrated provenance in data lakes. In 12Th IEEE international conference on e-science (e-science 2016), Baltimore, MD, USA, pp. 349–354. https://doi.org/10.1109/eScience.2016.7870919.
Terrizzano, I., Schwarz, P., Roth, M., & Colino, J.E. (2015). Data Wrangling: The Challenging Journey from the Wild to the Lake. In: 7th Biennial conference on innovative data systems research (CIDR 2015), Asilomar, CA, USA, pp. 1–9. http://cidrdb.org/cidr2015/Papers/CIDR15_Paper2.pdf.
Tharrington, M. (2017). The dzone guide to big data, data science & advanced Analytics. DZone.
The Apache Software Foundation. (2019). Apache atlas – data governance and metadata framework for Hadoop. https://atlas.apache.org/.
Tiao, S. (2018). Object storage for big data: What Is It? And Why Is It Better? https://blogs.oracle.com/bigdata/what-is-object-storage.
Zikopoulos, P., deRoos, D., Bienko, C., Buglio, R., & Andrews, M. (2015). Big data bayond the hype. McGraw-Hill Education.
Acknowledgments
The research accounted for in this paper was funded by the Université Lumière Lyon 2 and the Auvergne-Rhône-Alpes Region through the COREL and AURA-PMI projects, respectively. The authors also sincerely thank the anonymous reviewers of this paper for their constructive comments and suggestions.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Sawadogo, P., Darmont, J. On data lake architectures and metadata management. J Intell Inf Syst 56, 97–120 (2021). https://doi.org/10.1007/s10844-020-00608-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10844-020-00608-7