Big Data Provenance: Challenges and Implications for Benchmarking

Glavic, Boris

doi:10.1007/978-3-642-53974-9_7

Boris Glavic¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 8163))

Included in the following conference series:

2342 Accesses
44 Citations

Abstract

Data Provenance is information about the origin and creation process of data. Such information is useful for debugging data and transformations, auditing, evaluating the quality of and trust in data, modelling authenticity, and implementing access control for derived data. Provenance has been studied by the database, workflow, and distributed systems communities, but provenance for Big Data - which we refer to as Big Provenance - is a largely unexplored field. This paper reviews existing approaches for large-scale distributed provenance and discusses potential challenges for Big Data benchmarks that aim to incorporate provenance data/management. Furthermore, we will examine how Big Data benchmarking could benefit from different types of provenance information. We argue that provenance can be used for identifying and analyzing performance bottlenecks, to compute performance metrics, and to test a system’s ability to exploit commonalities in data and processing.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 49.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Ahmad, F., Lee, S., Thottethodi, M., Vijaykumar, T.: PUMA: Purdue MapReduce Benchmarks Suite. Tech. Rep. TR-ECE-12-11, Purdue University (2012)
Google Scholar
Akoush, S., Sohan, R., Hopper, A.: HadoopProv: Towards Provenance as A First Class Citizen in MapReduce. TaPP (2013)
Google Scholar
Amsterdamer, Y., Davidson, S., Deutch, D., Milo, T., Stoyanovich, J., Tannen, V.: Putting Lipstick on Pig: Enabling Database-style Workflow Provenance. PVLDB 5(4), 346–357 (2011)
Google Scholar
Chapman, A., Jagadish, H.V., Ramanan, P.: Efficient Provenance Storage. In: SIGMOD, pp. 993–1006 (2008)
Google Scholar
Divyakant, A., Bertino, E., Davidson, S., Franklin, M., Halevy, A., Han, J., Jagadish, H.V., Madden, S., Papakonstantinou, Y., Ramakrishnan, R., Ross, K., Shahabi, C., Vaithyanathan, S., Widom, J.: Challenges and opportunities with big data (2012)
Google Scholar
Graefe, G.: Benchmarking robust performance. In: Rabl, T., et al. (eds.) WBDB 2012. LNCS, vol. 8163, Springer, Heidelberg (2012)
Google Scholar
Ikeda, R., Park, H., Widom, J.: Provenance for generalized map and reduce workflows. In: CIDR, pp. 273–283 (2011)
Google Scholar
Karvounarakis, G., Green, T.: Semiring-Annotated Data: Queries and Provenance. SIGMOD Record 41(3), 5–14 (2012)
Article Google Scholar
Malik, T., Nistor, L., Gehani, A.: Tracking and Sketching Distributed Data Provenance. In: eScience, pp. 190–197 (2010)
Google Scholar
Muniswamy-Reddy, K., Macko, P., Seltzer, M.: Provenance for the cloud. In: FAST, pp. 197–210 (2010)
Google Scholar
Park, J., Nguyen, D., Sandhu, R.: A provenance-based access control model. In: PST, pp. 137–144 (2012)
Google Scholar
Seltzer, M., Macko, P., Chiarini, M.: Collecting Provenance via the Xen Hypervisor. In: TaPP (2011)
Google Scholar
Widom, J.: Trio: A System for Managing Data, Uncertainty, and Lineage. Managing and Mining Uncertain Data, 1–35 (2008)
Google Scholar
Zhang, M., Zhang, X., Zhang, X., Prabhakar, S.: Tracing Lineage beyond Relational Operators. In: VLDB, pp. 1116–1127 (2007)
Google Scholar
Zhou, W., Mapara, S., Ren, Y., Li, Y., Haeberlen, A., Ives, Z., Loo, B., Sherr, M.: Distributed time-aware provenance. PVLDB 6(2), 49–60 (2012)
Google Scholar

Download references

Author information

Authors and Affiliations

Illinois Institute of Technology, 10 W 31st Street, Chicago, IL, 60615, USA
Boris Glavic

Authors

Boris Glavic
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Electric and Computer Science, University of Toronto, 10 King’s College Road, SFB 540, M5S 3G4, Toronto, ON, Canada
Tilmann Rabl & Hans-Arno Jacobsen &
Server Technologies, Oracle Corporation, 500 Oracle Parkway, 94065, Redwood Shores, CA, USA
Meikel Poess
Supercomputer Center, University of California San Diego, 9500 Gilman Drive, 92093-0505, La Jolla, CA, USA
Chaitanya Baru

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Glavic, B. (2014). Big Data Provenance: Challenges and Implications for Benchmarking. In: Rabl, T., Poess, M., Baru, C., Jacobsen, HA. (eds) Specifying Big Data Benchmarks. WBDB WBDB 2012 2012. Lecture Notes in Computer Science, vol 8163. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-53974-9_7

Download citation

DOI: https://doi.org/10.1007/978-3-642-53974-9_7
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-53973-2
Online ISBN: 978-3-642-53974-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics