Trade-Offs in Automatic Provenance Capture

  • Manolis StamatogiannakisEmail author
  • Hasanat Kazmi
  • Hashim Sharif
  • Remco Vermeulen
  • Ashish Gehani
  • Herbert Bos
  • Paul Groth
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9672)


Automatic provenance capture from arbitrary applications is a challenging problem. Different approaches to tackle this problem have evolved, most notably a. system-event trace analysis, b. compile-time static instrumentation, and c. taint flow analysis using dynamic binary instrumentation. Each of these approaches offers different trade-offs in terms of the granularity of captured provenance, integration requirements, and runtime overhead. While these aspects have been discussed separately, a systematic and detailed study, quantifying and elucidating them, is still lacking. To fill this gap, we begin to explore these trade-offs for representative examples of these approaches for automatic provenance capture by means of evaluation and measurement. We base our evaluation on UnixBench—a widely used benchmark suite within systems research. We believe this approach will make our results easier to compare with future studies.


Provenance SPADE Taint tracking LLVM Strace 



This material is based upon work supported by the National Science Foundation under Grant IIS-1116414. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.


  1. 1.
    ProvBench: A Provenance Repository for Benchmarking (2013). Accessed Feb 2016
  2. 2.
    Balakrishnan, N., Bytheway, T., Sohan, R., Hopper, A.: OPUS: a lightweight system for observational provenance in user space. In: Proceedings of USENIX TaPP 2013, Lombard, IL, USA, April 2013Google Scholar
  3. 3.
    Bates, A., Tian, D., Butler, K.R.B., Moyer, T.: Trustworthy whole-system provenance for the Linux Kernel. In: Proceedings of USENIX SEC 2015, Washington, DC, USA, August 2015Google Scholar
  4. 4.
    Braun, U., Garfinkel, S.L., Holland, D.A., Muniswamy-Reddy, K.-K., Seltzer, M.I.: Issues in automatic provenance collection. In: Moreau, L., Foster, I. (eds.) IPAW 2006. LNCS, vol. 4145, pp. 171–183. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  5. 5.
    Carata, L., Akoush, S., Balakrishnan, N., Bytheway, T., Sohan, R., Seltzer, M., Hopper, A.: A primer on provenance. ACM Queue 12(3), 10:10–10:23 (2014)CrossRefGoogle Scholar
  6. 6.
    Chapman, A., Blaustein, B.T., Seligman, L., Allen, M.D.: PLUS: a provenance manager for integrated information. In: Proceedings of IEEE IRI 2011, Las Vegas, NV, USA, August 2011Google Scholar
  7. 7.
    Firth, H., Missier, P.: ProvGen: generating synthetic PROV graphs with predictable structure. In: Ludaescher, B., Plale, B. (eds.) IPAW 2014. LNCS, vol. 8628, pp. 16–27. Springer, Heidelberg (2015)CrossRefGoogle Scholar
  8. 8.
    Frew, J., Metzger, D., Slaughter, P.: Automatic capture and reconstruction of computational provenance. Concurr. Comput.: Pract. Exp. 20(5), 485–496 (2008)CrossRefGoogle Scholar
  9. 9.
    Gehani, A., Tariq, D.: SPADE: Support for Provenance Auditing in Distributed Environments. In: Narasimhan, P., Triantafillou, P. (eds.) Middleware 2012. LNCS, vol. 7662, pp. 101–120. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  10. 10.
    Glavic, B.: Big data provenance: challenges and implications for benchmarking. In: Rabl, T., Poess, M., Baru, C., Jacobsen, H.-A. (eds.) WBDB 2012. LNCS, vol. 8163, pp. 72–80. Springer, Heidelberg (2014)CrossRefGoogle Scholar
  11. 11.
    Groth, P., Moreau, L.: PROV-Overview. An Overview of the PROV Family of Documents. W3C Working Group Note NOTE-prov-overview-20130430, W3C.
  12. 12.
    Holland, D.A., Seltzer, M.I., Braun, U., Muniswamy-Reddy, K.K.: PASSing the provenance challenge. Concurr. Comput.: Pract. Exp. 20(5), 531–540 (2008)CrossRefGoogle Scholar
  13. 13.
    Kemerlis, V.P., Portokalidis, G., Jee, K., Keromytis, A.D.: libdft: practical dynamic data flow tracking for commodity systems. In: Proceedings of VEE 2012, London, UK, March 2012Google Scholar
  14. 14.
    Lattner, C., Adve, V.: LLVM: a compilation framework for lifelong program analysis & transformation. In: Proceedings of CGO 2004, Palo Alto, CA, USA (2004)Google Scholar
  15. 15.
    Luk, C.K., et al.: Pin: building customized program analysis tools with dynamic instrumentation. In: Proceedings of PLDI 2005, Chicago, IL, USA, June 2005Google Scholar
  16. 16.
    Ma, S., Zhang, X., Xu, D.: ProTracer: towards practical provenance tracing by alternating between logging and tainting. In: Proceedings of NDSS 2016, San Diego, CA, USA, February 2016Google Scholar
  17. 17.
    Moreau, L., et al.: The open provenance model core specification (v1.1). Future Gener. Comput. Syst. 27(6), 743–756 (2011)MathSciNetCrossRefGoogle Scholar
  18. 18.
    Pohly, D.J., McLaughlin, S., McDaniel, P., Butler, K.: Hi-Fi: collecting high-fidelity whole-system provenance. In: Proceedings of ACSAC 2012, Orlando, FL, USA, December 2012Google Scholar
  19. 19.
    Simmhan, Y.L., Plale, B., Gannon, D.: A survey of data provenance in e-Science. SIGMOD Rec. 34(3), 31–36 (2005)CrossRefGoogle Scholar
  20. 20.
    Smith, B., Lucas, K., et al.: UnixBench: The original BYTE UNIX benchmark suite (2011). Accessed Feb 2016
  21. 21.
    Stamatogiannakis, M., Groth, P., Bos, H.: Looking inside the black-box: capturing data provenance using dynamic instrumentation. In: Ludaescher, B., Plale, B. (eds.) IPAW 2014. LNCS, vol. 8628, pp. 155–167. Springer, Heidelberg (2015)CrossRefGoogle Scholar
  22. 22.
    Tariq, D., Ali, M., Gehani, A.: Towards automated collection of application-level data provenance. In: Proceedings of USENIX TaPP 2012, Boston, MA, USA (2012)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  • Manolis Stamatogiannakis
    • 1
    Email author
  • Hasanat Kazmi
    • 2
  • Hashim Sharif
    • 2
  • Remco Vermeulen
    • 1
  • Ashish Gehani
    • 2
  • Herbert Bos
    • 1
  • Paul Groth
    • 3
  1. 1.Computer Science InstituteVrije Universiteit AmsterdamAmsterdamThe Netherlands
  2. 2.SRI InternationalMenlo ParkUSA
  3. 3.Elsevier LabsAmsterdamThe Netherlands

Personalised recommendations