Exploring Provenance in a Distributed Job Execution System

  • Christine F. Reilly
  • Jeffrey F. Naughton
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4145)


We examine provenance in the context of a distributed job execution system. It is crucial to capture provenance information during the execution of a job in a distributed environment because often this information is lost once the job has finished. In this paper we discuss the type of information that is available within a distributed job execution system, how to capture such information, and what the burdens on the user and system are when such information is captured. We identify what we think is the key data that must be captured and discuss the collection of provenance in the Quill++ project of Condor. Our conclusion is that it is possible to capture important provenance information in a distributed job execution system with relatively little intrusion on the user or the system.


Data Item Infrastructure Information Data Provenance Provenance Information Logical Provenance 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Bose, R., Frew, J.: Lineage retrieval for scientific data processing: A survey. ACM Computing Surveys 37, 1–28 (2005)CrossRefGoogle Scholar
  2. 2.
    Jagadish, H., Olken, F.: Data management for the biosciences: Report of the NSF/NLM workshop on data management for molecular and cell biology, national library of medicine. Technical Report LBNL Report LBNL-52767, Lawrence Berkeley National Laboratory (2003)Google Scholar
  3. 3.
    Simmhan, Y.L., Plale, B., Gannon, D.: A survey of data provenance in e-science. SIGMOD Record 34, 31–36 (2005)CrossRefGoogle Scholar
  4. 4.
    Simmhan, Y.L., Plale, B., Gannon, D.: A survey of data provenance techniques. Technical Report IUB-CS-TR618, Computer Science Department, Indiana University, Bloomington, Indiana (2005)Google Scholar
  5. 5.
    Condor: Project homepage (2006),
  6. 6.
    Tannenbaum, T., Wright, D., Miller, K., Livny, M.: Condor – A distributed job scheduler. In: Sterling, T. (ed.) Beowulf Cluster Computing with Linux. MIT Press, Cambridge (2001)Google Scholar
  7. 7.
    Buneman, P., Khanna, S., Tan, W.C.: Why and where: A characterization of data provenance. In: Van den Bussche, J., Vianu, V. (eds.) ICDT 2001. LNCS, vol. 1973, p. 316. Springer, Heidelberg (2000)CrossRefGoogle Scholar
  8. 8.
    Cui, Y., Widom, J.: Lineage tracing for general data warehouse transformations. In: Proceedings of the 27th VLDB Conference, Roma, Italy (2001)Google Scholar
  9. 9.
    Cui, Y., Widom, J.: Lineage tracing for general data warehouse transformations. Technical report, Stanford University Database Group (2001)Google Scholar
  10. 10.
    Cui, Y., Widom, J.: Lineage tracing for general data warehouse transformations. VLDB Journal 12, 41–58 (2003)CrossRefGoogle Scholar
  11. 11.
    Fan, H., Poulovassilis, A.: Tracing data lineage using schema transformation pathways. In: Omelayenko, B., Klein, M. (eds.) Knowledge Transformation for the Semantic Web. IOS Press, Amsterdam (2003)Google Scholar
  12. 12.
    Foster, I., Vockler, J., Wilde, M., Zhao, Y.: Chimera: A virtual data system for representing, querying, and automating data derivation. In: 14th International Conference on Scientific and Statistical Database Management (2002)Google Scholar
  13. 13.
    Frew, J., Bose, R.: Earth system science workbench: A data management infrastructure for earth science products. In: Thirteenth International Conference on Scientific and Statistical Database Management, Fairfax, Virginia, pp. 180–189 (2001)Google Scholar
  14. 14.
    Widom, J.: Trio: A system for integrated management of data, accuracy, and lineage. In: CIDR (2005)Google Scholar
  15. 15.
    Woodruff, A., Stonebraker, M.: Supporting fine-grained data lineage in a database visualization environment. In: Proceedings of the 13th International Conference on Data Engineering, Birmingham, England, April 1997, pp. 91–102 (1997)Google Scholar
  16. 16.
    Cui, Y., Widom, J.: Storing auxiliary data for efficient maintenance and lineage tracing of complex views. In: Proceedings of the International Workshop on Design and Management of Data Warehouses (DMDW), Stockholm, Sweden (2000)Google Scholar
  17. 17.
    Szomszor, M., Moreau, L.: Recording and reasoning over data provenance in web and grid services. In: Meersman, R., Tari, Z., Schmidt, D.C. (eds.) CoopIS 2003, DOA 2003, and ODBASE 2003. LNCS, vol. 2888, pp. 603–620. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  18. 18.
    Barga, R.: Automatic generation of workflow execution provenance. In: Moreau, L., Foster, I. (eds.) IPAW 2006. LNCS, vol. 4145, pp. 1–9. Springer, Heidelberg (2006), CrossRefGoogle Scholar
  19. 19.
    Braun, U., Garfinkel, S., Holland, D.A., Muniswamy-Reddy, K.K., Seltzer, M.I.: Issues in automatic provenance collection. In: Moreau, L., Foster, I. (eds.) IPAW 2006. LNCS, vol. 4145, pp. 171–183. Springer, Heidelberg (2006), CrossRefGoogle Scholar
  20. 20.
    Huang, J., Kini, A., Reilly, C., Robinson, E., Shankar, S., Shrinivas, L., DeWitt, D., Naughton, J.: An overview of Quill++: A passive operational data logging system for Condor (2006),

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Christine F. Reilly
    • 1
  • Jeffrey F. Naughton
    • 1
  1. 1.Department of Computer SciencesUniversity of Wisconsin–MadisonMadisonUSA

Personalised recommendations