Skip to main content

A Noisy 10GB Provenance Database

  • Conference paper
Business Process Management Workshops (BPM 2011)

Part of the book series: Lecture Notes in Business Information Processing ((LNBIP,volume 100))

Included in the following conference series:

Abstract

Provenance of scientific data is a key piece of the metadata record for the data’s ongoing discovery and reuse. Provenance collection systems capture provenance on the fly, however, the protocol between application and provenance tool may not be reliable. Consequently, the provenance record can be partial, partitioned, and simply inaccurate. We use a workflow emulator that models faults to construct a large 10GB database of provenance that we know is noisy (that is, has errors). We discuss the process of generating the provenance database, and show early results on the kinds of provenance analysis enabled by the large provenance.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  1. Antonatos, S., Anagnostakis, K., Markatos, E.: Generating realistic workloads for network intrusion detection systems. In: ACM Workshop on Software and Performance, Redwood Shores, CA, USA (2004)

    Google Scholar 

  2. Bodnarchuk, R.R., Bunt, R.B.: A synthetic workload model for a distributed systems file server. In: Proceedings of the SIGMETRICS International Conference on Measurement and Modeling of Computer Systems, pp. 50–59 (1991)

    Google Scholar 

  3. Cui, Y., Widom, J.: Lineage tracing for general data warehouse transformations. VLDB Journal 12, 41–58 (2003)

    Article  Google Scholar 

  4. Freire, J., Koop, D., Santos, E., Silva, C.T.: Provenance for Computational Tasks: A Survey. Computing in Science and Engineering 10(3), 11–21 (2008)

    Article  Google Scholar 

  5. Frew, J., Metzger, D., Slaughter, P.: Automatic capture and reconstruction of computational provenance. Concurrency and Computation: Practice and Experience 20(5), 485–496 (2008)

    Article  Google Scholar 

  6. Groth, P., Moreau, L.: Recording Process Documentation for Provenance. IEEE Transactionson Parallel and Distributed Systems 20(9), 1246–1259 (2009)

    Article  Google Scholar 

  7. Kim, J., Deelman, E., Gil, Y., Mehta, G., Ratnakar, V.: Provenance Trails in the Wings/Pegasus System. Concurrency and Computation: Practice and Experience 20(5), 587–597 (2008)

    Article  Google Scholar 

  8. Leake, D.B., Kendall-Morwick, J.: Towards Case-Based Support for e-Science Workflow Generation by Mining Provenance. In: Althoff, K.-D., Bergmann, R., Minor, M., Hanft, A. (eds.) ECCBR 2008. LNCS (LNAI), vol. 5239, pp. 269–283. Springer, Heidelberg (2008)

    Chapter  Google Scholar 

  9. Leake, D., Kendall-Morwick, J.: Four Heads are Better than One: Combining Suggestions for Case Adaptation. In: McGinty, L., Wilson, D.C. (eds.) ICCBR 2009. LNCS, vol. 5650, pp. 165–179. Springer, Heidelberg (2009)

    Chapter  Google Scholar 

  10. Lopez de Mantaras, R., McSherry, D., Leake, D., Smyth, B., Craw, S., Faltings, B., Maher, M.L., Cox, M., Forbus, K., Keane, M., Aamodt, A., Watson, I.: Retrieval, Revision, and Retention in CBR. Knowledge Engineering Review 20(3), 215–240 (2006)

    Article  Google Scholar 

  11. Ludäscher, B., Altintas, I., Berkley, C., Higgins, D., Jaeger-Frank, E., Jones, M., Lee, E., Tao, J., Zhao, Y.: Scientific Workflow Management and the Kepler System. Concurrency and Computation: Practice & Experience, Special Issue on Scientific Workflows (2005)

    Google Scholar 

  12. Mehra, P., Wah, B.: Synthetic Workload Generation for Load-balancing Experiments. IEEE Parallel and Distributed Technology 3(3), 4–19 (1995)

    Article  Google Scholar 

  13. Moreau, L., Plale, B., Miles, S., Goble, C., Missier, P., Barga, R., Simmhan, Y., Futrelle, J., McGrath, R., Myers, J., Paulson, P., Bowers, S., Ludaescher, B., Kwasnikowska, N., Van den Bussche, J., Ellkvist, T., Freire, J., Groth, P.: The Open Provenance Model. Technical report, Electronics and Computer Science, University of Southampton, (2008)

    Google Scholar 

  14. Noble, B.D., Satyanarayanan, M., Nguyen, G.T., Katz, R.H.: Trace-Based Mobile Network Emulation. In: Proceedings of SIGCOMM 1997, Cannes, France, pp. 51–61 (September 1997)

    Google Scholar 

  15. Ramakrishnan, L., Plale, B.: A Multi-Dimensional Classification Model for Workflow Characteristics. In: Workflow Approaches to New Data-centric Science, with ACM SIGMOD 2010, Indianapolis, IN (2010)

    Google Scholar 

  16. Ramakrishnan, L., Plale, B., Gannon, D.: WORKEM: Representing and Emulating Distributed Scientific Workflow Execution State. In: Proceedings of the 10th IEEE/ACM Int’l. Symposium on Cluster, Cloud and Grid Computing, Melbourne, Australia (2010)

    Google Scholar 

  17. Shirasuna, S.: A Dynamic Scientific Workflow System for the Web Services Architecture. PhD thesis, Indiana University (September 2007)

    Google Scholar 

  18. Simmhan, Y., Plale, B., Gannon, D.: Karma2: Provenance Management for Data Driven Workflows. International Journal of Web Services Research 5(2) (2008)

    Google Scholar 

  19. Simmhan, Y., Plale, B., Gannon, D.: Towards a Quality Model for Effective Data Selection in Collaboratories. In: IEEE Workshop on Workflow and Data Flow for Scientific Applications, held in conjunction with ICDE, Atlanta, GA (2006)

    Google Scholar 

  20. Simmhan, Y., Plale, B., Gannon, D.: A survey of data provenance in e-science. SIGMOD Record 34(3), 31–36 (2005)

    Article  Google Scholar 

  21. Sreenivasan, K., Kleinman, A.J.: On the construction of a representative synthetic workload. Communications of the ACM, 127–133 (1974)

    Google Scholar 

  22. Widom, J.: Trio: A System for Integrated Management of Data, Accuracy, and Lineage. In: CIDR, Pacific Grove, California (January 2005)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Cheah, YW., Plale, B., Kendall-Morwick, J., Leake, D., Ramakrishnan, L. (2012). A Noisy 10GB Provenance Database. In: Daniel, F., Barkaoui, K., Dustdar, S. (eds) Business Process Management Workshops. BPM 2011. Lecture Notes in Business Information Processing, vol 100. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-28115-0_35

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-28115-0_35

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-28114-3

  • Online ISBN: 978-3-642-28115-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics