Abstract
Provenance of scientific data is a key piece of the metadata record for the data’s ongoing discovery and reuse. Provenance collection systems capture provenance on the fly, however, the protocol between application and provenance tool may not be reliable. Consequently, the provenance record can be partial, partitioned, and simply inaccurate. We use a workflow emulator that models faults to construct a large 10GB database of provenance that we know is noisy (that is, has errors). We discuss the process of generating the provenance database, and show early results on the kinds of provenance analysis enabled by the large provenance.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Antonatos, S., Anagnostakis, K., Markatos, E.: Generating realistic workloads for network intrusion detection systems. In: ACM Workshop on Software and Performance, Redwood Shores, CA, USA (2004)
Bodnarchuk, R.R., Bunt, R.B.: A synthetic workload model for a distributed systems file server. In: Proceedings of the SIGMETRICS International Conference on Measurement and Modeling of Computer Systems, pp. 50–59 (1991)
Cui, Y., Widom, J.: Lineage tracing for general data warehouse transformations. VLDB Journal 12, 41–58 (2003)
Freire, J., Koop, D., Santos, E., Silva, C.T.: Provenance for Computational Tasks: A Survey. Computing in Science and Engineering 10(3), 11–21 (2008)
Frew, J., Metzger, D., Slaughter, P.: Automatic capture and reconstruction of computational provenance. Concurrency and Computation: Practice and Experience 20(5), 485–496 (2008)
Groth, P., Moreau, L.: Recording Process Documentation for Provenance. IEEE Transactionson Parallel and Distributed Systems 20(9), 1246–1259 (2009)
Kim, J., Deelman, E., Gil, Y., Mehta, G., Ratnakar, V.: Provenance Trails in the Wings/Pegasus System. Concurrency and Computation: Practice and Experience 20(5), 587–597 (2008)
Leake, D.B., Kendall-Morwick, J.: Towards Case-Based Support for e-Science Workflow Generation by Mining Provenance. In: Althoff, K.-D., Bergmann, R., Minor, M., Hanft, A. (eds.) ECCBR 2008. LNCS (LNAI), vol. 5239, pp. 269–283. Springer, Heidelberg (2008)
Leake, D., Kendall-Morwick, J.: Four Heads are Better than One: Combining Suggestions for Case Adaptation. In: McGinty, L., Wilson, D.C. (eds.) ICCBR 2009. LNCS, vol. 5650, pp. 165–179. Springer, Heidelberg (2009)
Lopez de Mantaras, R., McSherry, D., Leake, D., Smyth, B., Craw, S., Faltings, B., Maher, M.L., Cox, M., Forbus, K., Keane, M., Aamodt, A., Watson, I.: Retrieval, Revision, and Retention in CBR. Knowledge Engineering Review 20(3), 215–240 (2006)
Ludäscher, B., Altintas, I., Berkley, C., Higgins, D., Jaeger-Frank, E., Jones, M., Lee, E., Tao, J., Zhao, Y.: Scientific Workflow Management and the Kepler System. Concurrency and Computation: Practice & Experience, Special Issue on Scientific Workflows (2005)
Mehra, P., Wah, B.: Synthetic Workload Generation for Load-balancing Experiments. IEEE Parallel and Distributed Technology 3(3), 4–19 (1995)
Moreau, L., Plale, B., Miles, S., Goble, C., Missier, P., Barga, R., Simmhan, Y., Futrelle, J., McGrath, R., Myers, J., Paulson, P., Bowers, S., Ludaescher, B., Kwasnikowska, N., Van den Bussche, J., Ellkvist, T., Freire, J., Groth, P.: The Open Provenance Model. Technical report, Electronics and Computer Science, University of Southampton, (2008)
Noble, B.D., Satyanarayanan, M., Nguyen, G.T., Katz, R.H.: Trace-Based Mobile Network Emulation. In: Proceedings of SIGCOMM 1997, Cannes, France, pp. 51–61 (September 1997)
Ramakrishnan, L., Plale, B.: A Multi-Dimensional Classification Model for Workflow Characteristics. In: Workflow Approaches to New Data-centric Science, with ACM SIGMOD 2010, Indianapolis, IN (2010)
Ramakrishnan, L., Plale, B., Gannon, D.: WORKEM: Representing and Emulating Distributed Scientific Workflow Execution State. In: Proceedings of the 10th IEEE/ACM Int’l. Symposium on Cluster, Cloud and Grid Computing, Melbourne, Australia (2010)
Shirasuna, S.: A Dynamic Scientific Workflow System for the Web Services Architecture. PhD thesis, Indiana University (September 2007)
Simmhan, Y., Plale, B., Gannon, D.: Karma2: Provenance Management for Data Driven Workflows. International Journal of Web Services Research 5(2) (2008)
Simmhan, Y., Plale, B., Gannon, D.: Towards a Quality Model for Effective Data Selection in Collaboratories. In: IEEE Workshop on Workflow and Data Flow for Scientific Applications, held in conjunction with ICDE, Atlanta, GA (2006)
Simmhan, Y., Plale, B., Gannon, D.: A survey of data provenance in e-science. SIGMOD Record 34(3), 31–36 (2005)
Sreenivasan, K., Kleinman, A.J.: On the construction of a representative synthetic workload. Communications of the ACM, 127–133 (1974)
Widom, J.: Trio: A System for Integrated Management of Data, Accuracy, and Lineage. In: CIDR, Pacific Grove, California (January 2005)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Cheah, YW., Plale, B., Kendall-Morwick, J., Leake, D., Ramakrishnan, L. (2012). A Noisy 10GB Provenance Database. In: Daniel, F., Barkaoui, K., Dustdar, S. (eds) Business Process Management Workshops. BPM 2011. Lecture Notes in Business Information Processing, vol 100. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-28115-0_35
Download citation
DOI: https://doi.org/10.1007/978-3-642-28115-0_35
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-28114-3
Online ISBN: 978-3-642-28115-0
eBook Packages: Computer ScienceComputer Science (R0)