Skip to main content
Log in

On data lake architectures and metadata management

  • Published:
Journal of Intelligent Information Systems Aims and scope Submit manuscript

Abstract

Over the past two decades, we have witnessed an exponential increase of data production in the world. So-called big data generally come from transactional systems, and even more so from the Internet of Things and social media. They are mainly characterized by volume, velocity, variety and veracity issues. Big data-related issues strongly challenge traditional data management and analysis systems. The concept of data lake was introduced to address them. A data lake is a large, raw data repository that stores and manages all company data bearing any format. However, the data lake concept remains ambiguous or fuzzy for many researchers and practitioners, who often confuse it with the Hadoop technology. Thus, we provide in this paper a comprehensive state of the art of the different approaches to data lake design. We particularly focus on data lake architectures and metadata management, which are key issues in successful data lakes. We also discuss the pros and cons of data lakes and their design alternatives.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

References

  • Alrehamy, H., & Walker, C. (2015). Personal data lake with data gravity pull. In IEEE 5Th international conference on big data and cloud computing(BDCloud 2015), Dalian, China, IEEE computer society washington, vol. 88, pp. 160–167. https://doi.org/10.1109/BDCloud.2015.62.

  • Ansari, J. W., Karim, N., Decker, S., Cochez, M., & Beyan, O. (2018). Extending data lake metadata management by semantic profiling. In 2018 Extended semantic web conference (ESWC 2018), Heraklion, Crete, Greece, pp. 1–15.

  • Beheshti, A., Benatallah, B., Nouri, R., Chhieng, V. M., Xiong, H., & Zhao, X. (2017). CoreDB: A Data Lake Service. In 2017 ACM On conference on information and knowledge management (CIKM 2017), Singapore, Singapore, ACM, pp. 2451–2454. https://doi.org/10.1145/3132847.3133171.

  • Beheshti, A., Benatallah, B., Nouri, R., & Tabebordbar, A. (2018). CoreKG: A knowledge lake service. Proceedings of the VLDB Endowment, 11(12), 1942–1945. https://doi.org/10.14778/3229863.3236230.

    Article  Google Scholar 

  • Bhattacherjee, S., & Deshpande, A. (2018). RSTore: A distributed multi-version document store. In IEEE 34Th international conference on data engineering (ICDE), Paris, France, pp. 389–400. https://doi.org/10.1109/ICDE.2018.00043.

  • Cha, B., Park, S., Kim, J., Pan, S., & Shin, J. (2018). International network performance and security testing based on distributed abyss storage cluster and draft of data lake framework. Hindawi Security and Communication Networks, 2018, 1–14. https://doi.org/10.1155/2018/1746809.

    Google Scholar 

  • Chessell, M., Scheepers, F., Nguyen, N., van Kessel, R., & van der Starre, R. (2014). Governing and managing big data for analytics and decision makers. IBM.

  • Couto, J., Borges, O., Ruiz, D., Marczak, S., & Prikladnicki, R. (2019). A mapping study about data lakes: an improved definition and possible architectures. In 31St international conference on software engineering and knowledge engineering (SEKE 2019), Lisbon, Portugal, pp. 453–458. https://doi.org/10.18293/SEKE2019-129.

  • Diamantini, C., Giudice, P.L., Musarella, L., Potena, D., Storti, E., & Ursino, D. (2018). A new metadata model to uniformly handle heterogeneous data lake sources. In: New trends in databases and information systems - ADBIS 2018 Short Papers and Workshop, Budapest, Hungary, pp. 165–177. https://doi.org/10.1007/978-3-030-00063-9_17.

  • Dixon, J. (2010). Pentaho, Hadoop, and data lakes. https://jamesdixon.wordpress.com/2010/10/14/pentaho-hadoop-and-data-lakes/.

  • Fang, H. (2015). Managing Data Lakes in Big Data era: What’s a data lake and why has it became popular in data management ecosystem. In 5Th annual IEEE international conference on cyber technology in automation, control and intelligent systems (CYBER 2015), Shenyang, China, IEEE, pp. 820–824. https://doi.org/10.1109/CYBER.2015.7288049.

  • Farid, M., Roatis, A., Ilyas, I. F., Hoffmann, H. F., & Chu, X. (2016). CLAMS: Bringing quality to data lakes. In 2016 International conference on management of data (SIGMOD 2016), San Francisco, CA, USA, ACM, pp. 2089–2092. https://doi.org/10.1145/2882903.2899391.

  • Farrugia, A., Claxton, R., & Thompson, S. (2016). Towards social network analytics for understanding and managing enterprise data lakes. In Advances in social networks analysis and mining (ASONAM 2016), San Francisco, CA, USA, IEEE, pp. 1213–1220. https://doi.org/10.1109/ASONAM.2016.7752393.

  • Fauduet, L., & Peyrard, S. (2010). A data-first preservation strategy: Data management in spar. In: 7th international conference on preservation of digital objects (iPRES 2010), Vienna, Autria, pp. 1–8. http://www.ifs.tuwien.ac.at/dp/ipres2010/papers/fauduet-13.pdf.

  • Ganore, P. (2015). Introduction To The Concept Of Data Lake And Its Benefits. https://www.esds.co.in/blog/introduction-to-the-concept-of-data-lake-and-its-benefits.

  • Giebler, C., Grȯger, C., Hoos, E., Schwarz, H., & Mitschang, B. (2019). Leveraging the data lake - current state and challenges. In Proceedings of the 21st international conference on big data analytics and knowledge discovery (DaWaK (p. 2019). Austria: Linz.

  • Grosser, T., Bloeme, J., Mack, M., & Vitsenko, J. (2016). Hadoop and data lakes: Use cases, benefits and limitations business application research center – BARC GmbH.

  • Hai, R., Geisler, S., & Quix, C. (2016). Constance: an intelligent data lake system. In International conference on management of data (SIGMOD 2016), San Francisco, CA, USA, ACM Digital Library, pp. 2097–2100. https://doi.org/10.1145/2882903.2899389.

  • Hai, R., Quix, C., & Zhou, C. (2018). Query rewriting for heterogeneous data lakes. In: 22nd European conference on advances in databases and information systems (ADBIS 2018), Budapest, Hungary, LNCS, vol. 11019, pp. 35–49. Springer. https://doi.org/10.1007/978-3-319-98398-1_3.

  • Halevy, A. Y., Korn, F., Noy, N. F., Olston, C., Polyzotis, N., Roy, S., & Whang, S. E. (2016). Goods: organizing google’s datasets. In Proceedings of the 2016 international conference on management of data (SIGMOD 2016), San Francisco, CA, USA, pp. 795–806. https://doi.org/10.1145/2882903.2903730.

  • Hellerstein, J.M., Sreekanti, V., sGonzalez, J.E., Dalton, J., Dey, A., Nag, S., Ramachandran, K., Arora, S., Bhattacharyya, A., Das, S., Donsky, M., Fierro, G., She, C., Steinbach, C., Subramanian, V., & Sun, E. (2017). Ground: A data context service. In: 8th Biennial Conference on Innovative Data Systems Research (CIDR 2017), Chaminade, CA, USA. http://cidrdb.org/cidr2017/papers/p111-hellerstein-cidr17.pdf.

  • Hultgren, H. (2016). Data Vault modeling guide: Introductory guide to data vault modeling. Genessee Academy, USA.

  • Inmon, B. (2016). Data Lake architecture: Designing the Data Lake and avoiding the garbage dump. Technics Publications.

  • John, T., & Misra, P. (2017). Data Lake for enterprises: Lambda architecture for building enterprise data systems. Packt Publishing.

  • Joss, A. (2016). The rise of the GDPR data lake. https://blogs.informatica.com/2016/06/16/rise-gdpr-data-lake/.

  • Khine, P. P., & Wang, Z. S. (2017). Data lake: A new ideology in big data era. In 4Th international conference on wireless communication and sensor network (WCSN 2017), Wuhan, China, ITM web of conferences, vol. 17, pp. 1–6. https://doi.org/10.1051/itmconf/2018170302.

  • Klettke, M., Awolin, H., Stürl, U., Müller, D., & Scherzinger, S. (2017). Uncovering the evolution history of data lakes. In 2017 IEEE International conference on big data (BIGDATA 2017), Boston, MA, USA, pp. 2462–2471. https://doi.org/10.1109/BigData.2017.8258204.

  • LaPlante, A., & Sharma, B. (2016). Architecting data lakes data management architectures for advanced business use cases. O’Reilly Media Inc.

  • Laskowski, N. (2016). Data lake governance: A big data do or die. https://searchcio.techtarget.com/feature/Data-lake-governance-A-big-data-do-or-die.

  • Leclercq, E., & Savonnet, M. (2018). A tensor based data model for polystore: an application to social networks data. In Proceedings of the 22nd international database engineering & applications symposium (IDEAS 2018), Villa San Giovanni, Italy, pp. 110–118. https://doi.org/10.1145/3216122.3216152.

  • Linstedt, D. (2011). Super charge your data warehouse: Invaluable data modeling rules to implement your data. Vault CreateSpace Independent Publishing.

  • Maccioni, A., & Torlone, R. (2018). KAYAK: A framework for just-in-time data preparation in a data lake. In: International Conference on Advanced Information Systems Engineering (CAiSE 2018), Tallin, Estonia, pp. 474–489. https://doi.org/10.1007/978-3-319-91563-0_29.

  • Madera, C., & Laurent, A. (2016). The next information architecture evolution: the data lake wave. In 8Th international conference on management of digital ecosystems (MEDES 2016), Biarritz, France, pp. 174–180. https://doi.org/10.1145/3012071.3012077.

  • Maroto, C. (2018). Data lake security – four key areas to consider when securing your data lake. https://www.searchtechnologies.com/blog/data-lake-security.

  • Mathis, C. (2017). Data lakes. Datenbank-Spektrum, 17(3), 289–293. https://doi.org/10.1007/s13222-017-0272-7.

    Article  Google Scholar 

  • Mehmood, H., Gilman, E., Cortes, M., Kostakos, P., Byrne, A., Valta, K., Tekes, S., & Riekki, J. (2019). Implementing big data lake for heterogeneous data sources. In 2019 IEEE 35Th international conference on data engineering workshops (ICDEW), pp. 37–44. https://doi.org/10.1109/ICDEW.2019.00-37.

  • Miloslavskaya, N., & Tolstoy, A. (2016). Big data, fast data and data lake concepts. In 7Th annual international conference on biologically inspired cognitive architectures (BICA 2016), NY, USA, Procedia Computer Science, vol. 88, pp. 1–6. https://doi.org/10.1016/j.procs.2016.07.439.

  • Nargesian, F., Pu, K.Q., Zhu, E., Bashardoost, B.G., & Miller, R.J. (2018). Optimizing Organizations for Navigating Data Lakes. arXiv:abs/1812.07024.

  • Nogueira, I., Romdhane, M., & Darmont, J. (2018). Modeling data lake metadata with a data vault. In 22Nd international database engineering and applications symposium (IDEAS 2018), Villa San Giovanni, Italia (pp. 253–261). New York: ACM.

  • O’Leary, D. E. (2014). Embedding AI and crowdsourcing in the big data lake. IEEE Intelligent Systems, 29(5), 70–73. https://doi.org/10.1109/MIS.2014.82.

    Article  Google Scholar 

  • Oram, A. (2015). Managing the data lake. Zaloni.

  • Pathirana, N. (2015). Modeling industrial and cultural heritage data. Master’s thesis, université lumière Lyon 2 France.

  • Quix, C., & Hai, R. (2018). Data lake, (pp. 1–8). Berlin: Springer International Publishing. https://doi.org/10.1007/978-3-319-63962-8_7-1.

    Google Scholar 

  • Quix, C., Hai, R., & Vatov, I. (2016). Metadata extraction and management in data lakes with GEMMS. Complex Systems Informatics and Modeling Quarterly, 9, 289–293. https://doi.org/10.7250/csimq.2016-9.04.

    Google Scholar 

  • Ravat, F., & Zhao, Y. (2019). Data lakes: Trends and perspectives. In 30Th international conference on database and expert systems applications (DEXA (p. 2019). Austria: Linz.

  • Ravat, F., & Zhao, Y. (2019). Metadata management for data lakes. In 23Rd european conference on advances in databases and information systems (ADBIS (p. 2019). Slovenia: Bled.

  • Russom, P. (2017). Data lakes purposes, Practices, Patterns, and Platforms. TDWI research.

  • Sawadogo, P. N., Kibata, T., & Darmont, J. (2019). Metadata management for textual documents in data lakes. In 21St international conference on enterprise information systems (ICEIS 2019), Heraklion, Crete, Greece, pp. 72–83. https://doi.org/10.5220/0007706300720083.

  • Sawadogo, P. N., Scholly, E., Favre, C., Ferey, É. , Loudcher, S., & Darmont, J. (2019). Metadata systems for data lakes: models and features. In BI And big data applications - ADBIS 2019 Short Papers and Workshop, Bled, Slovenia.

  • Singh, K., Paneri, K., Pandey, A., Gupta, G., Sharma, G., Agarwal, P., & Shroff, G. (2016). Visual bayesian fusion to navigate a data lake. In 19Th international conference on information fusion (FUSION 2016), Heidelberg, Germany, IEEE, pp. 987–994.

  • Sirosh, J. (2016). The intelligent data lake. https://azure.microsoft.com/fr-fr/blog/the-intelligent-data-lake/.

  • Stefanowski, J., Krawiec, K., & Wrembel, R. (2017). Exploring complex and big data. International Journal of Applied Mathematics and Computer Science, 27(4), 669–679. https://doi.org/10.1515/amcs-2017-0046.

    Article  MathSciNet  Google Scholar 

  • Stein, B., & Morrison, A. (2014). The enterprise data lake: Better integration and deeper analytics. PWC Technology Forecast. http://www.smallake.kr/wp-content/uploads/2017/03/20170313_074222.pdf.

  • Suriarachchi, I., & Plale, B. (2016). Crossing analytics systems: A case for integrated provenance in data lakes. In 12Th IEEE international conference on e-science (e-science 2016), Baltimore, MD, USA, pp. 349–354. https://doi.org/10.1109/eScience.2016.7870919.

  • Terrizzano, I., Schwarz, P., Roth, M., & Colino, J.E. (2015). Data Wrangling: The Challenging Journey from the Wild to the Lake. In: 7th Biennial conference on innovative data systems research (CIDR 2015), Asilomar, CA, USA, pp. 1–9. http://cidrdb.org/cidr2015/Papers/CIDR15_Paper2.pdf.

  • Tharrington, M. (2017). The dzone guide to big data, data science & advanced Analytics. DZone.

  • The Apache Software Foundation. (2019). Apache atlas – data governance and metadata framework for Hadoop. https://atlas.apache.org/.

  • Tiao, S. (2018). Object storage for big data: What Is It? And Why Is It Better? https://blogs.oracle.com/bigdata/what-is-object-storage.

  • Zikopoulos, P., deRoos, D., Bienko, C., Buglio, R., & Andrews, M. (2015). Big data bayond the hype. McGraw-Hill Education.

Download references

Acknowledgments

The research accounted for in this paper was funded by the Université Lumière Lyon 2 and the Auvergne-Rhône-Alpes Region through the COREL and AURA-PMI projects, respectively. The authors also sincerely thank the anonymous reviewers of this paper for their constructive comments and suggestions.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Pegdwendé Sawadogo.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sawadogo, P., Darmont, J. On data lake architectures and metadata management. J Intell Inf Syst 56, 97–120 (2021). https://doi.org/10.1007/s10844-020-00608-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10844-020-00608-7

Keywords

Navigation