Provenance as Essential Infrastructure for Data Lakes

  • Isuru SuriarachchiEmail author
  • Beth Plale
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9672)


The Data Lake is emerging as a Big Data storage and management solution which can store any type of data at scale and execute data transformations for analysis. Higher flexibility in storage increases the risk of Data Lakes becoming data swamps. In this paper we show how provenance contributes to data management within a Data Lake infrastructure. We study provenance integration challenges and propose a reference architecture for provenance usage in a Data Lake. Finally we discuss the applicability of our tools in the proposed architecture.



This work is funded in part by a grant from the NSF, ACI-0940824.


  1. 1.
    Akoush, S., Sohan, R., Hopper, A.: Hadoopprov: towards provenance as a first class citizen in mapreduce. In: TaPP, pp. 11:1–11:4 (2013)Google Scholar
  2. 2.
    Chessell, M., Scheepers, F., Nguyen, N., van Kessel, R., van der Starre, R.: Governing and managing big data for analytics and decision makers (2014).
  3. 3.
    Missier, P., Ludascher, B., Bowers, S., Dey, S., Sarkar, A., Shrestha, B., Altintas, I., Anand, M., Goble, C.: Linking multiple workflow provenance traces for interoperable collaborative science. In: WORKS, pp. 1–8, November 2010Google Scholar
  4. 4.
    Suriarachchi, I., Zhou, Q., Plale, B.: Komadu: a capture and visualization system for scientific data provenance. J. Open Res. Softw. 3(1) (2015)Google Scholar
  5. 5.
    Terrizzano, I., Schwarz, P.M., Roth, M., Colino, J.E.: Data wrangling: the challenging journey from the wild to the lake. In: CIDR (2015)Google Scholar
  6. 6.
    Wang, J., Crawl, D., Purawat, S., Nguyen, M., Altintas, I.: Big data provenance: challenges, state of the art and opportunities. In: Big Data, pp. 2509–2516 (2015)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  1. 1.School of Informatics and ComputingIndiana UniversityBloomingtonUSA

Personalised recommendations