Building Trust in Information pp 103-126 | Cite as
A Brief Tour Through Provenance in Scientific Workflows and Databases
Abstract
Within computer science, the term provenance has multiple meanings, due to different motivations, perspectives, and assumptions prevalent in the respective communities. This chapter provides a high-level “sightseeing tour” of some of those different notions and uses of provenance in scientific workflows and databases.
Keywords
Lineage Prospective provenance Provenance games Provenance polynomials Retrospective provenance Why-not provenanceNotes
Acknowledgements
This work was supported in part by NSF grants ACI-1430508, DBI-{1147273, 1356751}, IIS-1118088, and SMA-1439603. With special thanks to Shawn Bowers, Timothy McPhillips, Manish K. Anand, Víctor Cuevas-Vicenttín, Saumen Dey, Lei Dou, Sven Köhler, Sean Riddle, and Daniel Zinn for fruitful years of collaboration on scientific workflows and database provenance. Also special thanks to Boris Glavic for comments on an earlier draft of this paper and for his collaboration on and implementation of games for why-not provenance.
References
- 1.Wedel, M.J.: A monument of inefficiency: the presumed course of the recurrent laryngeal nerve in sauropod dinosaurs. Acta Palaeontol. Pol. 57 (2), 251–256 (2011)CrossRefGoogle Scholar
- 2.Dobzhansky, T.: Nothing in biology makes sense except in the light of evolution. Am. Biol. Teach. 35 (3), 125–129 (1973)CrossRefGoogle Scholar
- 3.Hey, T., Tansley, S., Tolle, K. (eds.): The fourth paradigm: data-intensive scientific discovery. Microsoft Research Redmond, WA (2009)Google Scholar
- 4.GCIS: Global Change Information System (2015). http://data.globalchange.gov/ Google Scholar
- 5.Melillo, J.M., Richmond, T.T., Yohe, G.W. (eds.): Climate Change Impacts in the United States: The Third National Climate Assessment. U.S. Global Change Research Program (2014). doi: 10.7930/J0Z31WJ2
- 6.Tilmes, C., Fox, P., Ma, X.L., McGuinness, D.L., Privette, A.P., Smith, A., Waple, A., Zednik, S., Zheng, J.G.: Provenance representation for the national climate assessment in the global change information system. IEEE Trans. Geosci. Remote Sens. 51 (11), 5160–5168 (2013)CrossRefGoogle Scholar
- 7.Sadiq, S.: Handbook of Data Quality. Springer, Berlin (2013)CrossRefGoogle Scholar
- 8.Mann, M.E., Zhang, Z., Hughes, M.K., Bradley, R.S., Miller, S.K., Rutherford, S., Ni, F.: Proxy-based reconstructions of hemispheric and global surface temperature variations over the past two millennia. Proc. Natl. Acad. Sci. 105 (36), 13252–13257 (2008)CrossRefGoogle Scholar
- 9.Hills, D.J., Downs, R.R., Duerr, R., Goldstein, J.C., Parsons, M.A., Ramapriyan, H.K.: The importance of data set provenance for science. Eos 96 (2015). 10.1029/2015EO040557
- 10.Eisenman, I., Meier, W.N., Norris, J.R.: A spurious jump in the satellite record: has Antarctic sea ice expansion been overestimated? Cryosphere 8 (4), 1289–1296 (2014)CrossRefGoogle Scholar
- 11.Stevens, L.: Texas Summer 2011: Record Heat and Drought (2013). GCIS metadata record with provenance. Accessed 12 Dec 2015Google Scholar
- 12.Ludäscher, B., Bowers, S., McPhillips, T.: Scientific workflows. In: Özsu, T., Liu, L. (eds.) Encyclopedia of Database Systems. Springer, Berlin (2009)Google Scholar
- 13.Cuevas-Vicenttín, V., Dey, S., Köhler, S., Riddle, S., Ludäscher, B.: Scientific workflows and provenance: introduction and research opportunities. Datenbank-Spektrum 12 (3), 193–203 (2012)CrossRefGoogle Scholar
- 14.Davidson, S.B., Boulakia, S.C., Eyal, A., Ludäscher, B., McPhillips, T.M., Bowers, S., Anand, M.K., Freire, J.: Provenance in scientific workflow systems. IEEE Data Eng. Bull. 30 (4), 44–50 (2007)Google Scholar
- 15.Bowers, S.: Scientific workflow, provenance, and data modeling challenges and approaches. J. Data Semant. 1 (1), 19–30 (2012)CrossRefGoogle Scholar
- 16.Ludäscher, B., Altintas, I., Bowers, S., Cummings, J., Critchlow, T., Deelman, E., Roure, D.D., Freire, J., Goble, C., Jones, M., Klasky, S., McPhillips, T., Podhorszki, N., Silva, C., Taylor, I., Vouk, M.: Scientific process automation and workflow management. In: Shoshani, A., Rotem, D. (eds.) Scientific Data Management. Chapman & Hall/CRC, London/Boca Raton (2009)Google Scholar
- 17.McPhillips, T., Bowers, S., Zinn, D., Ludäscher, B.: Scientific workflow design for mere mortals. Futur. Gener. Comput. Syst. 25 (5), 541–551 (2009)CrossRefGoogle Scholar
- 18.Dou, L., Cao, G., Morris, P.J., Morris, R.A., Ludäscher, B., Macklin, J.A., Hanken, J.: Kurator: a kepler package for data curation workflows. Proc. Comput. Sci. 9, 1614–1619 (2012). Demo video at http://youtu.be/DEkPbvLsud0
- 19.Ludäscher, B., Altintas, I., Berkley, C., Higgins, D., Jaeger, E., Jones, M., Lee, E.A., Tao, J., Zhao, Y.: Scientific workflow management and the Kepler system. Concurr. Comput. Pract. Experience 18 (10), 1039–1065 (2006)CrossRefGoogle Scholar
- 20.Bowers, S., McPhillips, T., Riddle, S., Anand, M.K., Ludäscher, B.: Kepler/pPOD: scientific workflow and provenance support for assembling the tree of life. In: Provenance and Annotation of Data and Processes (IPAW), pp. 70–77. Springer, Berlin, Heidelberg (2008)Google Scholar
- 21.Anand, M.K., Bowers, S., Ludäscher, B.: Provenance browser: displaying and querying scientific workflow provenance graphs. In: IEEE International Conference on Data Engineering (ICDE), pp. 1201–1204 (2010)Google Scholar
- 22.Zinn, D., Ludäscher, B.: Abstract provenance graphs: anticipating and exploiting schema-level data provenance. In: Provenance and Annotation of Data and Processes, pp. 206–215. Springer, Berlin, Heidelberg (2010)Google Scholar
- 23.Moreau, L., Ludäscher, B., Altintas, I., Barga, R.S., Bowers, S., Callahan, S., Chin, G., Clifford, B., Cohen, S., Cohen-Boulakia, S., Davidson, S., Deelman, E., Digiampietri, L., Foster, I., Freire, J., Frew, J., Futrelle, J., Gibson, T., Gil, Y., Goble, C., Golbeck, J., Groth, P., Holland, D.A., Jiang, S., Kim, J., Koop, D., Krenek, A., McPhillips, T., Mehta, G., Miles, S., Metzger, D., Munroe, S., Myers, J., Plale, B., Podhorszki, N., Ratnakar, V., Santos, E., Scheidegger, C., Schuchardt, K., Seltzer, M., Simmhan, Y.L., Silva, C., Slaughter, P., Stephan, E., Stevens, R., Turi, D., Vo, H., Wilde, M., Zhao, J., Zhao, Y.: Special issue: the first provenance challenge. Concurr. Comput. Pract. Experience 20 (5), 409–418 (2008)Google Scholar
- 24.Moreau, L., Freire, J., Futrelle, J., McGrath, R.E., Myers, J., Paulson, P.: The open provenance model: an overview. In: Provenance and Annotation of Data and Processes, pp. 323–326. Springer, Berlin (2008)Google Scholar
- 25.Moreau, L., Clifford, B., Freire, J., Futrelle, J., Gil, Y., Groth, P., Kwasnikowska, N., Miles, S., Missier, P., Myers, J., Plale, B., Simmhan, Y., Stephan, E., den Bussche, J.V.: The open provenance model core specification (v1. 1). Futur. Gener. Comput. Syst. 27 (6), 743–756 (2011)Google Scholar
- 26.Moreau, L., Missier, P., Belhajjame, K., B’Far, R., Cheney, J., Coppens, S., Cresswell, S., Gil, Y., Groth, P., Klyne, G., Lebo, T., McCusker, J., Miles, S., Myers, J., Sahoo, S., Tilmes, C.: The PROV data model. W3C Technical Report (2012). https://www.w3.org/TR/prov-dm/
- 27.Heinis, T., Alonso, G.: Efficient lineage tracking for scientific workflows. In: SIGMOD, pp. 1007–1018. ACM, New York (2008)Google Scholar
- 28.Chapman, A.P., Jagadish, H.V., Ramanan, P.: Efficient provenance storage. In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, pp. 993–1006. ACM, New York (2008)Google Scholar
- 29.Anand, M.K., Bowers, S., McPhillips, T., Ludäscher, B.: Efficient provenance storage over nested data collections. In: International Conference on Extending Database Technology (EDBT), pp. 958–969. ACM, New York (2009)Google Scholar
- 30.Anand, M.K., Bowers, S., Ludäscher, B.: A navigation model for exploring scientific workflow provenance graphs. In: 4th Workshop on Workflows in Support of Large-Scale Science (WORKS) (2009)Google Scholar
- 31.Anand, M.K., Bowers, S., Ludäscher, B.: Techniques for efficiently querying scientific workflow provenance graphs. In: EDBT, vol. 10, pp. 287–298 (2010)Google Scholar
- 32.Anand, M.K., Bowers, S., Ludäscher, B.: Database support for exploring scientific workflow provenance graphs. In: Scientific and Statistical Database Management, pp. 343–360. Springer, Berlin, Heidelberg (2012)Google Scholar
- 33.Garijo, D., Gil, Y.: A new approach for publishing workflows: abstractions, standards, and linked data. In: 6th Workshop on Workflows in Support of Large-Scale Science (WORKS) (2011)Google Scholar
- 34.Missier, P., Dey, S., Belhajjame, K., Cuevas-Vicenttín, V., Ludäscher, B.: D-PROV: extending the prov provenance model with workflow structure. In: 5th USENIX Workshop on the Theory and Practice of Provenance (TaPP) (2013)Google Scholar
- 35.Dey, S., Köhler, S., Bowers, S., Ludäscher, B.: Datalog as a lingua franca for provenance querying and reasoning. In: Workshop on the Theory and Practice of Provenance (TaPP), Boston, MA (2012)Google Scholar
- 36.Pham, Q., Malik, T., Glavic, B., Foster, I.: LDV: light-weight database virtualization. In: International Conference on Data Engineering (ICDE), pp. 1179–1190 (2015)Google Scholar
- 37.Kwasnikowska, N., Moreau, L., Bussche, J.V.D.: A formal account of the open provenance model. ACM Trans. Web (TWEB) 9 (2), 10:1–10:44 (2015)Google Scholar
- 38.Dey, S., Riddle, S., Ludäscher, B.: Provenance analyzer: exploring provenance semantics with logic rules. In: 5th USENIX Workshop on the Theory and Practice of Provenance (TaPP) (2013)Google Scholar
- 39.Dijkstra, E.W.: Hamming’s exercise in SASL. EWD-792 (1981)Google Scholar
- 40.Hemmendinger, D.: The “Hamming problem” in prolog. ACM SIGPLAN Not. 23 (4), 81–86 (1988)CrossRefGoogle Scholar
- 41.Köhler, S., Ludäscher, B., Smaragdakis, Y.: Declarative datalog debugging for mere mortals. In: Datalog in Academia and Industry, pp. 111–122. Springer, Berlin, Heidelberg (2012)Google Scholar
- 42.Koop, D., Santos, E., Bauer, B., Troyer, M., Freire, J., Silva, C.T.: Bridging workflow and data provenance using strong links. In: Gertz, M., Ludäscher, B. (eds.) Scientific and statistical database management (SSDBM). Lecture Notes in Computer Science, vol. 6187, Springer, Berlin (2010)Google Scholar
- 43.Bowers, S., McPhillips, T., Ludäscher, B.: Declarative rules for inferring fine-grained data provenance from scientific workflow execution traces. In: International Provenance and Annotation Workshop (IPAW), pp. 82–96. Springer (2012)Google Scholar
- 44.Dey, S., Belhajjame, K., Koop, D., Song, T., Missier, P., Ludäscher, B.: UP & DOWN: improving provenance precision by combining workflow-and trace-level information. In: 6th USENIX Workshop on the Theory and Practice of Provenance (TaPP), Cologne (2014)Google Scholar
- 45.Dey, S., Belhajjame, K., Koop, D., Raul, M., Ludäscher, B.: Linking prospective and retrospective provenance for scripts. In: 7th USENIX Workshop on the Theory and Practice of Provenance (TaPP), Edinburgh (2015)Google Scholar
- 46.McPhillips, T., Bowers, S., Belhajjame, K., Ludäscher, B.: Retrospective provenance without a runtime provenance recorder. In: 7th USENIX Workshop on the Theory and Practice of Provenance (TaPP), Edinburg (2015)Google Scholar
- 47.Cheney, J., Chiticariu, L., Tan, W.: Provenance in databases: why, how, and where. Found. Trends Databases 1 (4), 379–474 (2009)CrossRefGoogle Scholar
- 48.Cohen, S., Cohen-Boulakia, S., Davidson, S.: Towards a model of provenance and user views in scientific workflows. In: Data Integration in the Life Sciences (DILS), pp. 264–279. Springer, BerlinGoogle Scholar
- 49.Tan, W.C.: Provenance in databases: past, current, and future. IEEE Data Eng. Bull. 30 (4), 3–12 (2007)Google Scholar
- 50.Bowers, S., Ludäscher, B.: Actor-oriented design of scientific workflows. In: Conceptual Modeling (ER). Lecture Notes in Computer Science, vol. 3716, pp. 369–384. Springer, Berlin (2005)Google Scholar
- 51.Biton, O., Cohen-Boulakia, S., Davidson, S.B., Hara, C.S.: Querying and managing provenance through user views in scientific workflows. In: International Conference on Data Engineering (ICDE), pp. 1072–1081. IEEE, New York (2008)Google Scholar
- 52.Murta, L., Braganholo, V., Chirigati, F., Koop, D., Freire, J.: noWorkflow: Capturing and analyzing provenance of scripts. In: Provenance and Annotation of Data and Processes (IPAW), pp. 71–83. Springer, Berlin (2014)Google Scholar
- 53.Buneman, P., Tan, W.C.: Provenance in databases (Tutorial Outline). In: SIGMOD, pp. 1171–1173. ACM, New York (2007)Google Scholar
- 54.Amsterdamer, Y., Davidson, S.B., Deutch, D., Milo, T., Stoyanovich, J., Tannen, V.: Putting lipstick on pig: enabling database-style workflow provenance. Proc. VLDB Endow. 5 (4), 346–357 (2011)CrossRefGoogle Scholar
- 55.Abiteboul, S., Hull, R., Vianu, V.: Foundations of Databases. Addison-Wesley, Reading, MA (1995)Google Scholar
- 56.Deutsch, A., Tannen, V.: Reformulation of XML Queries and Constraints. In: International Conference on Database Theory (ICDT), pp. 225–241. Springer, Berlin (2003)Google Scholar
- 57.Boncz, P., Grust, T., Van Keulen, M., Manegold, S., Rittinger, J., Teubner, J.: MonetDB/XQuery: a fast XQuery processor powered by a relational engine. In: SIGMOD, pp. 479–490. ACM, New York (2006)Google Scholar
- 58.Wang, Y.R., Madnick, S.E., et al.: A polygen model for heterogeneous database systems: the source tagging perspective. In: VLDB, vol. 90, pp. 519–538 (1990)Google Scholar
- 59.Woodruff, A., Stonebraker, M.: Supporting fine-grained data lineage in a database visualization environment. In: International Conference on Data Engineering (ICDE), pp. 91–102. IEEE, New York (1997)Google Scholar
- 60.Cui, Y., Widom, J., Wiener, J.: Tracing the lineage of view data in a warehousing environment. ACM Trans. Database Systems 25 (2), 179–227 (2000)CrossRefGoogle Scholar
- 61.Chaudhuri, S., Dayal, U.: Data warehousing and OLAP for decision support. ACM Sigmod Rec. 26 (2), 507–508 (1997)CrossRefGoogle Scholar
- 62.Buneman, P., Khanna, S., Tan, W.C.: Why and where: a characterization of data provenance. In: ICDT, pp. 316–330. Springer, Berlin (2001)Google Scholar
- 63.Green, T., Karvounarakis, G., Tannen, V.: Provenance semirings. In: PODS, pp. 31–40 (2007)Google Scholar
- 64.Green, T.J., Karvounarakis, G., Tannen, Z.G.I.V.: Provenance in ORCHESTRA. In: Bulletin of the Technical Committee on Data Engineering, vol. 33(3), pp. 9–16. IEEE Computer Society, New York (2010)Google Scholar
- 65.Chapman, A., Jagadish, H.: Why not? In: SIGMOD, pp. 523–534. ACM, New York (2009)Google Scholar
- 66.Herschel, M., Hernández, M.A.: Explaining missing answers to SPJUA queries. Proc. VLDB Endow. 3 (1–2), 185–196 (2010)CrossRefGoogle Scholar
- 67.Tran, Q.T., Chan, C.Y.: How to ConQueR Why-Not Questions. In: SIGMOD, ACM, New York (2010), pp. 15–26Google Scholar
- 68.Geerts, F., Poggi, A.: On database query languages for k-relations. J. Appl. Log. 8 (2), 173–185 (2010)CrossRefGoogle Scholar
- 69.Amsterdamer, Y., Deutch, D., Tannen, V.: On the limitations of provenance for queries with difference. In: TaPP (2011)Google Scholar
- 70.Köhler, S., Ludäscher, B., Zinn, D.: First-order provenance games. In: In Search of Elegance in the Theory and Practice of Computation. Essays Dedicated to Peter Buneman. Lecture Notes in Computer Science, vol. 8000, pp. 382–399. Springer, Berlin (2013)Google Scholar
- 71.Bidoit, N., Herschel, M., Tzompanaki, K.: EFQ: why-not answer polynomials in action. Proc. VLDB Endow. 8 (12), 1980–1983 (2015)CrossRefGoogle Scholar
- 72.ten Cate, B., Civili, C., Sherkhonov, E., Tan, W.C.: High-level why-not explanations using ontologies. In: ACM Symposium on Principles of Database Systems (PODS), pp. 31–43. ACM, New York (2015)Google Scholar
- 73.Glavic, B., Miller, R.J., Alonso, G.: Using SQL for efficient generation and querying of provenance information. In: In Search of Elegance in the Theory and Practice of Computation. Essays Dedicated to Peter Buneman. Lecture Notes in Computer Science, vol. 8000, pp. 291–320. Springer, Berlin (2013)Google Scholar
- 74.Arab, B., Gawlick, D., Radhakrishnan, V., Guo, H., Glavic, B.: A Generic Provenance Middleware for Queries, Updates, and Transactions. In: 6th USENIX Workshop on the Theory and Practice of Provenance (TaPP) (2014)Google Scholar
- 75.Glavic, B., Esmaili, K.S., Fischer, P.M., Tatbul, N.: Efficient stream provenance via operator instrumentation. ACM Trans. Internet Tech. 14 (1), 7 (2014)CrossRefGoogle Scholar
- 76.Stamatogiannakis, M., Groth, P., Bos, H.: Decoupling provenance capture and analysis from execution. In: 7th USENIX Workshop on the Theory and Practice of Provenance (TaPP) (2015)Google Scholar
- 77.Arab, B., Gawlick, D., Krishnaswamy, V., Radhakrishnan, V., Glavic, B.: Formal foundations of reenactment and transaction provenance. Technical Report IIT/CS-DB-2016-01. Illinois Institute of Technology (2016)Google Scholar
- 78.Karvounarakis, G., Green, T.J.: Semiring-annotated data: queries and provenance. ACM SIGMOD Rec. 41 (3), 5–14 (2012)CrossRefGoogle Scholar
- 79.Benjelloun, O., Sarma, A.D., Halevy, A., Widom, J.: ULDBs: Databases with uncertainty and lineage. In: Proceedings of the 32nd International Conference on Very Large Data Bases, VLDB Endowment, pp. 953–964 (2006)Google Scholar
- 80.Hodges, W.: Logic and Games. In: Zalta, E.N. (ed.) The Stanford Encyclopedia of Philosophy (2013). http://plato.stanford.edu/entries/logic-games/
- 81.Hintikka, J.: The Principles of Mathematics Revisited. Cambridge University Press, Cambridge (1996)CrossRefGoogle Scholar
- 82.Flum, J., Kubierschky, M., Ludäscher, B.: Total and partial well-founded datalog coincide. In: ICDT, pp. 113–124 (1997)Google Scholar
- 83.Apt, K.R., Doets, K.: A new definition of SLDNF-resolution. J. Logic Program. 18 (2), 177–190 (1994)CrossRefGoogle Scholar
- 84.Moreau, L.: The foundations for provenance on the web. Found. Trends Web Sci. 2 (2–3), 99–241 (2010)CrossRefGoogle Scholar
- 85.Missier, P., Paton, N.W., Belhajjame, K.: Fine-grained and efficient lineage querying of collection-based workflow provenance. In: EDBT, pp. 299–310 (2010)Google Scholar
- 86.Missier, P., Ludäscher, B., Bowers, S., Dey, S., Sarkar, A., Shrestha, B., Altintas, I., Anand, M.K., Goble, C.: Linking multiple workflow provenance traces for interoperable collaborative science. In: 5th Workshop on Workflows in Support of Large-Scale Science (WORKS). IEEE, New York (2010)Google Scholar
- 87.Köhler, S., Riddle, S., Zinn, D., McPhillips, T., Ludäscher, B.: Improving workflow fault tolerance through provenance-based recovery. In: Scientific and Statistical Database Management, pp. 207–224. Springer, Berlin, Heidelberg (2011)Google Scholar
- 88.Meliou, A., Gatterbauer, W., Moore, K.F., Suciu, D.: The complexity of causality and responsibility for query answers and non-answers. Proc. VLDB Endow. 4 (1), 34–45 (2010)CrossRefGoogle Scholar
- 89.Salimi, B., Bertossi, L.: From causes for database queries to repairs and model-based diagnosis and back. In: 18th International Conference on Database Theory (ICDT), vol. 31, pp. 342–362. Schloss Dagstuhl–Leibniz-Zentrum für Informatik, Wadern (2015)Google Scholar