Distributed and Parallel Databases

, Volume 30, Issue 5–6, pp 351–370 | Cite as

MTCProv: a practical provenance query framework for many-task scientific computing

  • Luiz M. R. GadelhaJr.
  • Michael Wilde
  • Marta Mattoso
  • Ian Foster


Scientific research is increasingly assisted by computer-based experiments. Such experiments are often composed of a vast number of loosely-coupled computational tasks that are specified and automated as scientific workflows. This large scale is also characteristic of the data that flows within such “many-task” computations (MTC). Provenance information can record the behavior of such computational experiments via the lineage of process and data artifacts. However, work to date has focused on lineage data models, leaving unsolved issues of recording and querying other aspects, such as domain-specific information about the experiments, MTC behavior given by resource consumption and failure information, or the impact of environment on performance and accuracy. In this work we contribute with MTCProv, a provenance query framework for many-task scientific computing that captures the runtime execution details of MTC workflow tasks on parallel and distributed systems, in addition to standard prospective and data derivation provenance. To help users query provenance data we provide a high level interface that hides relational query complexities. We evaluate MTCProv using an application in protein science, and describe how important query patterns such as correlations between provenance, runtime data, and scientific parameters are simplified and expressed.


Provenance Many-task computing Database queries Parallel and distributed computing 



This work was supported in part by CAPES, CNPq, by the Office of Advanced Scientific Computing Research, Office of Science, U.S. Department of Energy, under Contract DE-AC02-06CH11357, and by NSF under awards OCI-0944332 and OCI-1007115. We thank Swift users Aashish Adhikari, Andrey Rzhetsky and Jon Monette, for providing and running applications using MTCProv, and for helping us understand their provenance requirements.


  1. 1.
    Abiteboul, S., Quass, D., McHugh, J., Widom, J., Wiener, J.: The Lorel query language for semistructured sata. Int. J. Digit. Libr. 1, 66–88 (1997) CrossRefGoogle Scholar
  2. 2.
    Adhikari, A., Peng, J., Wilde, M., Xu, J., Freed, K., Sosnick, T.: Modeling large regions in proteins: applications to loops, termini, and folding. Protein Sci. 21(1), 107–121 (2012) CrossRefGoogle Scholar
  3. 3.
    Anand, M., Bowers, S., McPhillips, T., Ludäscher, B.: Exploring scientific workflow provenance using hybrid queries over nested data and lineage graphs. In: Scientific and Statistical Database Management. Lecture Notes in Computer Science, vol. 5566, pp. 237–254. Springer, Berlin (2009) CrossRefGoogle Scholar
  4. 4.
    Chebotko, A., Lu, S., Fei, X., Fotouhi, F.: RDFProv: a relational RDF store for querying and managing scientific workflow provenance. Data Knowl. Eng. 69(8), 836–865 (2010) CrossRefGoogle Scholar
  5. 5.
    Clifford, B., Foster, I., Voeckler, J., Wilde, M., Zhao, Y.: Tracking provenance in a virtual data grid. Concurr. Comput. 20(5), 575 (2008) CrossRefGoogle Scholar
  6. 6.
    da Cruz, S., Campos, M., Mattoso, M.: Towards a taxonomy of provenance in scientific workflow management systems. In: Proc. IEEE Congress on Services, Part I (SERVICES I 2009), pp. 259–266 (2009) Google Scholar
  7. 7.
    Dries, A., Nijssen, S.: Analyzing graph databases by aggregate queries. In: Proc. Workshop on Mining and Learning with Graphs (MLG 2010), pp. 37–45 (2010) CrossRefGoogle Scholar
  8. 8.
    Dun, N., Taura, K., Yonezawa, A.: ParaTrac: a fine-grained profiler for data-intensive workflows. In: Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, HPDC’10, pp. 37–48. ACM Press, New York (2010) CrossRefGoogle Scholar
  9. 9.
    Foster, I., Vökler, J., Wilde, M., Zhao, Y.: Chimera: a virtual data system for representing, querying, and automating data derivation. In: Proc. International Conference on Scientific and Statistical Database Management (SSDBM 2002), pp. 37–46. IEEE Computer Society, Los Alamitos (2002) CrossRefGoogle Scholar
  10. 10.
    Freire, J., Silva, C., Callahan, S., Santos, E., Scheidegger, C., Vo, H.: Managing rapidly-evolving scientific workflows. In: Provenance and Annotation of Data. Lecture Notes in Computer Science, vol. 4145, pp. 10–18. Springer, Berlin (2006) CrossRefGoogle Scholar
  11. 11.
    Furlani, T., Jones, M., Gallo, S., Bruno, A., Lu, C., Ghadersohi, A., Gentner, R., Patra, A., DeLeon, R., von Laszewski, G., Wang, L., Zimmerman, A.: Performance metrics and auditing framework for high performance computer systems. In: Proceedings of the 2011 TeraGrid Conference: Extreme Digital Discovery, TG ’11, p. 16:1. ACM Press, New York (2011) Google Scholar
  12. 12.
    Gadelha, L., Mattoso, M.: Kairos: an architecture for securing authorship and temporal information of provenance data in grid-enabled workflow management systems. In: IEEE Fourth International Conference on eScience (e-Science 2008), pp. 597–602. IEEE, New York (2008) CrossRefGoogle Scholar
  13. 13.
    Gadelha, L., Clifford, B., Mattoso, M., Wilde, M., Foster, I.: Provenance management in Swift. Future Gener. Comput. Syst. 27(6), 780 (2011) Google Scholar
  14. 14.
    Gadelha, L., Mattoso, M., Wilde, M., Foster, I.: Provenance query patterns for many-task scientific computations. In: Proceedings of the 3rd USENIX Workshop on Theory and Applications of Provenance (TaPP’11) (2011) Google Scholar
  15. 15.
    Goth, G.: The science of better science. Commun. ACM 55(2), 13–15 (2012) CrossRefGoogle Scholar
  16. 16.
    Jagadish, H.V., Chapman, A., Elkiss, A., Jayapandian, M., Li, Y., Nandi, A., Yu, C.: Making database systems usable. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 13–24. ACM Press, New York (2007) Google Scholar
  17. 17.
    Katz, D., Armstrong, T., Zhang, Z., Wilde, M., Wozniak, J.: Many-task computing and blue waters. arXiv:1202.3943, February 2012
  18. 18.
    Liew, C., Atkinson, M., Ostrowski, R., Cole, M., van Hemert, J., Han, L.: Performance database: capturing data for optimizing distributed streaming workflows. Philos. Trans. R. Soc., Math. Phys. Eng. Sci. 369(1949), 3268–3284 (2011) CrossRefGoogle Scholar
  19. 19.
    Mattoso, M., Werner, C., Travassos, G., Braganholo, V., Ogasawara, E., Oliveira, D., Cruz, S., Martinho, W., Murta, L.: Towards supporting the life cycle of large scale scientific experiments. Int. J. Bus. Process Integration Manag. 5(1), 79–92 (2010) CrossRefGoogle Scholar
  20. 20.
    Melnik, S., Gubarev, A., Long, J., Romer, G., Shivakumar, S., Tolton, M., Vassilakis, T.: Dremel: interactive analysis of web-scale datasets. Commun. ACM 54(6), 114–123 (2011) CrossRefGoogle Scholar
  21. 21.
    Miles, S., Groth, P., Branco, M., Moreau, L.: The requirements of recording and using provenance in e-Science. J. Grid Comput. 5(1), 1–25 (2007) CrossRefGoogle Scholar
  22. 22.
    Moreau, L., Clifford, B., Freire, J., Futrelle, J., Gil, Y., Groth, P., Kwasnikowska, N., Miles, S., Missier, P., Myers, J., Plale, B., Simmhan, Y., Stephan, E., Van den Bussche, J.: The open provenance model core specification (v1.1). Future Gener. Comput. Syst. 27(6), 743–756 (2011) CrossRefGoogle Scholar
  23. 23.
    Moreau, L., Missier, P., Belhajjame, K., Cresswell, S., Gil, Y., Golden, R., Groth, P., Klyne, G., McCusker, J., Miles, S., Myers, J., Sahoo, S.: The PROV data model and abstract syntax notation. Technical report, World Wide Web Consortium (W3C), December 2011 Google Scholar
  24. 24.
    Muniswamy-Reddy, K., Braun, U., Holland, D., Macko, P., Maclean, D., Margo, D., Seltzer, M., Smogor, R.: Layering in provenance systems. In: Proc. of the USENIX Annual Technical Conference (2009) Google Scholar
  25. 25.
    Ogasawara, E., de Oliveira, D., Valduriez, P., Dias, J., Porto, F., Mattoso, M.: An algebraic approach for data-centric scientific workflows. Proc. VLDB Endow. 4(12), 1339 (2011) Google Scholar
  26. 26.
    Ordonez, C.: Optimizing recursive queries in SQL. In: Proc. of the ACM SIGMOD International Conference on Management of Data (SIGMOD 2005), pp. 834–839 (2005) CrossRefGoogle Scholar
  27. 27.
    Provenance working group: (2012)
  28. 28.
    Raicu, I., Foster, I.T., Zhao, Y.: Many-task computing for grids and supercomputers. In: Workshop on Many-Task Computing on Grids and Supercomputers, MTAGS 2008, pp. 1–11, November 2008. IEEE Press, New York (2008) CrossRefGoogle Scholar
  29. 29.
    Scheidegger, C., Koop, D., Santos, E., Vo, H., Callahan, S., Freire, J., Silva, C.: Tackling the provenance challenge one layer at a time. Concurr. Comput. 20(5), 473–483 (2008) CrossRefGoogle Scholar
  30. 30.
    Simmhan, Y., Plale, B., Gannon, D.: A survey of data provenance in e-science. SIGMOD Rec. 34(3), 31–36 (2005) CrossRefGoogle Scholar
  31. 31.
    Technology Audit and Insertion Service for TeraGrid: (2012)
  32. 32.
    White, R., Roth, R.: Exploratory Search: Beyond the Query–Response Paradigm. Morgan & Claypool, San Rafael (2009) Google Scholar
  33. 33.
    Wieczorek, M., Prodan, R., Fahringer, T.: Scheduling of scientific workflows in the ASKALON grid environment. SIGMOD Rec. 34(3), 56–62 (2005) CrossRefGoogle Scholar
  34. 34.
    Wilde, M., Hategan, M., Wozniak, J., Clifford, B., Katz, D., Foster, I.: Swift: a language for distributed parallel scripting. Parallel Comput. 37(9), 634–652 (2011) CrossRefGoogle Scholar
  35. 35.
    XSEDE—Extreme Science and Engineering Discovery Environment: (2012)
  36. 36.
    Yu, C., Jagadish, H.V.: Schema summarization. In: Proceedings of the 32nd International Conference on Very Large Data Bases, VLDB ’06, pp. 319–330. VLDB Endowment, New York (2006) Google Scholar
  37. 37.
    Zhao, Y., Lu, S.: A logic programming approach to scientific workflow provenance querying. In: Provenance and Annotation of Data and Processes (IPAW 2008). Lecture Notes in Computer Science, vol. 5272, pp. 31–44. Springer, Berlin (2008) CrossRefGoogle Scholar
  38. 38.
    Zhao, Y., Wilde, M., Foster, I.: Applying the virtual data provenance model. In: Proc. 1st International Provenance and Annotation Workshop (IPAW 2006). Lecture Notes in Computer Science, vol. 4145, pp. 148–161. Springer, Berlin (2006) Google Scholar
  39. 39.
    Zhao, Y., Hategan, M., Clifford, B., Foster, I., Laszewski, G., Raicu, I., Stef-Praun, T., Wilde, M.: Swift: fast, reliable, loosely coupled parallel computation. In: Proc. 1st IEEE International Workshop on Scientific Workflows (SWF 2007), pp. 199–206 (2007) Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2012

Authors and Affiliations

  • Luiz M. R. GadelhaJr.
    • 1
    • 2
  • Michael Wilde
    • 3
    • 4
  • Marta Mattoso
    • 1
  • Ian Foster
    • 3
    • 4
    • 5
  1. 1.Computer Engineering Program, COPPEFederal University of Rio de JaneiroRio de JaneiroBrazil
  2. 2.National Laboratory for Scientific ComputingPetrópolisBrazil
  3. 3.Mathematics and Computer Science DivisionArgonne National LaboratoryChicagoUSA
  4. 4.Computation InstituteArgonne National Laboratory and University of ChicagoChicagoUSA
  5. 5.Department of Computer ScienceUniversity of ChicagoChicagoUSA

Personalised recommendations