Using Domain-Specific Data to Enhance Scientific Workflow Steering Queries

  • João Carlos de A.R. Gonçalves
  • Daniel de Oliveira
  • Kary A. C. S. Ocaña
  • Eduardo Ogasawara
  • Marta Mattoso
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7525)


In scientific workflows, provenance data helps scientists in understanding, evaluating and reproducing their results. Provenance data generated at runtime can also support workflow steering mechanisms. Steering facilities for workflows is considered a challenge due to its dynamic demands during execution. To steer, for example, scientists should be able to suspend (or stop) a workflow execution when the approximate solution meets (or deviates) preset criteria. These criteria are commonly evaluated based on provenance data (execution data) and domain-specific data. We claim that the final decision on whether to interfere on the workflow execution may only become feasible when workflows can be steered by scientists using provenance data enriched with domain-specific data. In this paper we propose an approach based on specialized software components, named Data Extractor (DE), to acquire domain-specific data from data files produced during a scientific workflow execution. DE gathers domain-specific data from produced data files and associates it to existing provenance data on the provenance repository. We have evaluated the proposed approach using a real bioinformatics workflow for comparative genomics executed in SciCumulus cloud workflow parallel engine.


Data Extractor Cloud Environment Thalassiosira Pseudonana Virtual Cluster Candidate Query 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Taylor, I.J., Deelman, E., Gannon, D.B., Shields, M.: Workflows for e-Science: Scientific Workflows for Grids, 1st edn. Springer (2007)Google Scholar
  2. 2.
    Freire, J., Koop, D., Santos, E., Silva, C.T.: Provenance for Computational Tasks: A Survey. Computing in Science and Engineering 10(3), 11–21 (2008)CrossRefGoogle Scholar
  3. 3.
    Vaquero, L.M., Rodero-Merino, L., Caceres, J., Lindner, M.: A break in the clouds: towards a cloud definition. SIGCOMM Comput. Commun. Rev. 39(1), 50–55 (2009)CrossRefGoogle Scholar
  4. 4.
    Simmhan, Y.L., Plale, B., Gannon, D.: A survey of data provenance in e-science. ACM SIGMOD Record 34(3), 31–36 (2005)CrossRefGoogle Scholar
  5. 5.
    Factor, M., Henis, E., Naor, D., Rabinovici-Cohen, S., Reshef, P., Ronen, S., Michetti, G., Guercio, M.: Authenticity and provenance in long term digital preservation: modeling and implementation in preservation aware storage. In: First Workshop on Theory and Practice of Provenance, Berkeley, CA, USA, pp. 6:1–6:10 (2009)Google Scholar
  6. 6.
    Groth, P., Deelman, E., Juve, G., Mehta, G., Berriman, B.: Pipeline-centric provenance model. In: Proceedings of the 4th Workshop on Workflows in Support of Large-Scale Science, Portland, Oregon, pp. 1–8 (2009)Google Scholar
  7. 7.
    Sahoo, S., Sheth, A.: Provenir ontology: Towards a Framework for eScience Provenance Management. In: Microsoft eScience Workshop, Pittsburgh, PA, pp. 15–17 (2009)Google Scholar
  8. 8.
    Wolstencroft, K., Alper, P., Hull, D., Wroe, C., Lord, P.W., Stevens, R.D., Goble, C.A.: The myGrid ontology: bioinformatics service discovery. Int. J. Bioinformatics Res. Appl. 3(3), 303–325 (2007)CrossRefGoogle Scholar
  9. 9.
    Crawl, D., Altintas, I.: A Provenance-Based Fault Tolerance Mechanism for Scientific Workflows. In: Freire, J., Koop, D., Moreau, L. (eds.) IPAW 2008. LNCS, vol. 5272, pp. 152–159. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  10. 10.
    de Oliveira, D., Ogasawara, E., Seabra, F., Silva, V., Murta, L., Mattoso, M.: GExpLine: A Tool for Supporting Experiment Composition. In: McGuinness, D.L., Michaelis, J.R., Moreau, L. (eds.) IPAW 2010. LNCS, vol. 6378, pp. 251–259. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  11. 11.
    Missier, P.: Incremental workflow improvement through analysis of its data provenance. In: 3rd USENIX Workshop on the Theory and Practice of Provenance (TaPP 2011), Heraklion, Crete, Greece (2011)Google Scholar
  12. 12.
    Ocaña, K.A.C.S., Oliveira, D., Dias, J., Ogasawara, E., Mattoso, M.: Optimizing Phylogenetic Analysis Using SciHmm Cloud-based Scientific Workflow. In: 2011 IEEE Seventh International Conference on e-Science (e-Science) IEEE e-Science 2011, Stockholm, Sweden, pp. 190–197 (2011)Google Scholar
  13. 13.
    Guerra, G., Rochinha, F., Elias, R., Oliveira, D., Ogasawara, E., Dias, J., Mattoso, M., Coutinho, A.L.G.A.: Uncertainty Quantification in Computational Predictive Models for Fluid Dynamics Using Workflow Management Engine. International Journal for Uncertainty Quantification 2(1), 53–71 (2012)CrossRefGoogle Scholar
  14. 14.
    Ogasawara, E., Oliveira, D., Chirigati, F., Barbosa, C.E., Elias, R., Braganholo, V., Coutinho, A., Mattoso, M.: Exploring many task computing in scientific workflows. In: Proceedings of the 2nd Workshop on Many-Task Computing on Grids and Supercomputers, MTAGS 2009, Portland, Oregon, USA, pp. 1–10 (2009)Google Scholar
  15. 15.
    Gil, Y., Deelman, E., Ellisman, M., Fahringer, T., Fox, G., Gannon, D., Goble, C., Livny, M., Moreau, L., et al.: Examining the Challenges of Scientific Workflows. Computer 40(12), 24–32 (2007)CrossRefGoogle Scholar
  16. 16.
    Dias, J., Ogasawara, E., Oliveira, D., Porto, F., Coutinho, A., Mattoso, M.: Supporting Dynamic Parameter Sweep in Adaptive and User-Steered Workflow. In: 6th Workshop on Workflows in Support of Large-Scale Science WORKS 2011, Seattle, WA, USA, pp. 31–36 (2011)Google Scholar
  17. 17.
    Oliveira, D., Ogasawara, E., Ocaña, K., Baiao, F., Mattoso, M.: An Adaptive Parallel Execution Strategy for Cloud-based Scientific Workflows. Concurrency and Computation: Practice and Experience (2011) (online)Google Scholar
  18. 18.
    Miller, W., Makova, K.D., Nekrutenko, A., Hardison, R.C.: Comparative Genomics. Annual Review of Genomics and Human Genetics 5(1), 15–56 (2004)CrossRefGoogle Scholar
  19. 19.
    Clark, A.G.: Genomics of the evolutionary process. Trends in Ecology & Evolution 21(6), 316–321 (2006)CrossRefGoogle Scholar
  20. 20.
    Baldi, P., Brunak, S., Chauvin, Y., Andersen, C.A.F., Nielsen, H.: Assessing the accuracy of prediction algorithms for classification: an overview. Bioinformatics 16(5), 412–424 (2000)CrossRefGoogle Scholar
  21. 21.
    Callahan, S.P., Freire, J., Santos, E., Scheidegger, C.E., Silva, C.T., Vo, H.T.: VisTrails: visualization meets data management. In: SIGMOD International Conference on Management of Data, Chicago, Illinois, USA, pp. 745–747 (2006)Google Scholar
  22. 22.
    Hull, D., Wolstencroft, K., Stevens, R., Goble, C., Pocock, M.R., Li, P., Oinn, T.: Taverna: a tool for building and running workflows of services. Nucleic Acids Research 34(2), 729–732 (2006)CrossRefGoogle Scholar
  23. 23.
    Amazon EC2, Amazon Elastic Compute Cloud (Amazon EC2) (2010),
  24. 24.
    Ogasawara, E., Dias, J., Oliveira, D., Porto, F., Valduriez, P., Mattoso, M.: An Algebraic Approach for Data-Centric Scientific Workflows. Proc. of VLDB Endowment 4(12), 1328–1339 (2011)Google Scholar
  25. 25.
    Gamma, E., Helm, R., Johnson, R., Vlissides, J.M.: Design Patterns: Elements of Reusable Object-Oriented Software. Addison-Wesley Professional (1994)Google Scholar
  26. 26.
    Moreau, L., Freire, J., Futrelle, J., McGrath, R.E., Myers, J., Paulson, P.: The Open Provenance Model: An Overview. In: Freire, J., Koop, D., Moreau, L. (eds.) IPAW 2008. LNCS, vol. 5272, pp. 323–326. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  27. 27.
    Carpenter, B., Getov, V., Judd, G., Skjellum, A., Fox, G.: MPJ: MPI-like message passing for Java. Concurrency: Practice and Experience 12(11), 1019–1038 (2000)zbMATHCrossRefGoogle Scholar
  28. 28.
    Pruitt, K.D., Tatusova, T., Klimke, W., Maglott, D.R.: NCBI Reference Sequences: current status, policy and new initiatives. Nucleic Acids Research 37(Database issue), D32–D36 (2009)Google Scholar
  29. 29.
    Simmhan, Y.L., Plale, B., Gannon, D.: A Framework for Collecting Provenance in Data-Centric Scientific Workflows. In: ICWS, pp. 427–436 (2006)Google Scholar
  30. 30.
    Missier, P., Sahoo, S.S., Zhao, J., Goble, C., Sheth, A.: Janus: From Workflows to Semantic Provenance and Linked Open Data. In: McGuinness, D.L., Michaelis, J.R., Moreau, L. (eds.) IPAW 2010. LNCS, vol. 6378, pp. 129–141. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  31. 31.
    Anand, M.K., Bowers, S., McPhillips, T., Ludäscher, B.: Exploring Scientific Workflow Provenance Using Hybrid Queries over Nested Data and Lineage Graphs. In: Winslett, M. (ed.) SSDBM 2009. LNCS, vol. 5566, pp. 237–254. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  32. 32.
    Gadelha, L., Mattoso, M., Wilde, M., Foster, I.: Provenance Query Patterns for Many-Task Scientific Computing. In: USENIX Workshop on the Theory and Practice of Provenance (TaPP), Heraklion, Crete, Greece (2011)Google Scholar
  33. 33.
    Zhao, Y., Hategan, M., Clifford, B., Foster, I., von Laszewski, G., Nefedova, V., Raicu, I., Stef-Praun, T., Wilde, M.: Swift: Fast, Reliable, Loosely Coupled Parallel Computation. In: 3rd IEEE World Congress on Services, Salt Lake City, USA, pp. 199–206 (2007)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • João Carlos de A.R. Gonçalves
    • 1
  • Daniel de Oliveira
    • 1
  • Kary A. C. S. Ocaña
    • 1
  • Eduardo Ogasawara
    • 1
    • 2
  • Marta Mattoso
    • 1
  1. 1.COPPE, Federal University of Rio de JaneiroBrazil
  2. 2.CEFET/RJBrazil

Personalised recommendations