Advertisement

Journal of Grid Computing

, Volume 11, Issue 3, pp 381–406 | Cite as

A Case Study into Using Common Real-Time Workflow Monitoring Infrastructure for Scientific Workflows

  • Karan Vahi
  • Ian Harvey
  • Taghrid Samak
  • Daniel Gunter
  • Kieran Evans
  • David Rogers
  • Ian Taylor
  • Monte Goode
  • Fabio Silva
  • Eddie Al-Shakarchi
  • Gaurang Mehta
  • Ewa Deelman
  • Andrew Jones
Article

Abstract

Scientific workflow systems support various workflow representations, operational modes, and configurations. Regardless of the system used, end users have common needs: to track the status of their workflows in real time, be notified of execution anomalies and failures automatically, perform troubleshooting, and automate the analysis of the workflow results. In this paper, we describe how the Stampede monitoring infrastructure was integrated with the Pegasus Workflow Management System and the Triana Workflow Systems, in order to add generic real time monitoring and troubleshooting capabilities across both systems. Stampede is an infrastructure that provides interoperable monitoring using a three-layer model: (1) a common data model to describe workflow and job executions; (2) high-performance tools to load workflow logs conforming to the data model into a data store; and (3) a common query interface. This paper describes the integration of Stampede monitoring architecture with Pegasus and Triana and shows the new analysis capabilities that Stampede provides to these workflow systems. The successful integration of Stampede with these workflow engines demonstrates the generic nature of the Stampede monitoring infrastructure and its potential to provide a common platform for monitoring across scientific workflow engines.

Keywords

Scientific workflows Real time monitoring Common monitoring infrastructure Log analysis Troubleshooting Workflow performance data 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Advanced Message Queuing Protocol. http://www.amqp.org. Accessed Apr 2012
  2. 2.
    Ali, A.S., Rana, O.F., Taylor, I.J.: Web services composition for distributed data mining. In: ICPP 2005 Workshops, International Conference Workshops on Parallel Processing, pp. 11–18. IEEE, New York (2005)CrossRefGoogle Scholar
  3. 3.
    Altintas, I., Berkley, C., Jaeger, E., Jones, M., Ludäscher, B., Mock, S.: Kepler: an extensible system for design and execution of scientific workflows. In: 16th International Conference on Scientific and Statistical Database Management (SSDBM), pp. 423–424. IEEE Computer Society, New York (2004)Google Scholar
  4. 4.
    Amazon Elastic Cloud. http://aws.amazon.com/ec2. Accessed Apr 2012
  5. 5.
    Andrews, T., Curbera, F., Dholakia, H., Goland, Y., Klein, J., Leymann, F., Liu, K., Roller, D., Smith, D., Thatte, S., Trickovic, I., Weerawarana, S.: Business Process Execution Language for Web Services Version 1.1 (2003)Google Scholar
  6. 6.
    Barga, R., Jackson, J., Araujo, N., Guo, D., Gautam, N., Simmhan, Y.: The Trident scientific workflow workbench. In: Proceedings of the 2008 Fourth IEEE International Conference on eScience, pp. 317–318. IEEE Computer Society, Washington, DC (2008)CrossRefGoogle Scholar
  7. 7.
    Benson, T., Conley, E.C., Harrison, A.B., Taylor, I.: Sintero server—simplifying interoperability for distributed collaborative health care. In: IHIC 2011 Conference, Orlando (2011)Google Scholar
  8. 8.
    Callaghan, S., Deelman, E., Gunter, D., Juve, G., Maechling, P., Brooks, C.X., Vahi, K., Milner, K., Graves, R., Field, E., Okaya, D., Jordan, T.: Scaling up workflow-based applications. J. Comput. Syst. Sci. 76(6), 428–446 (2010)CrossRefGoogle Scholar
  9. 9.
    Callaghan, S., Maechling, P., Small, P., Milner, K., Juve, G., Jordan, T., Deelman, E., Mehta, G., Vahi, K., Gunter, D., Beattie, K., Brooks, C.X.: Metrics for heterogeneous scientific workflows: a case study of an earthquake science application. Int. J. High Perform. Comput. Appl. 25(3), 274–285 (2011)CrossRefGoogle Scholar
  10. 10.
    Couvares, P., Kosar, T., Roy, A., Weber, J., Wenger, K.: Workflow in Condor. In: Taylor, I., Deelman, E., Gannon, D., Shields, M. (eds.) Workflows for e-Science. Springer Press (2007)Google Scholar
  11. 11.
    Data Mining Tools and Services for Grid Computing Environments (DataMiningGrid). http://www.datamininggrid.org/. Accessed Apr 2012
  12. 12.
    Deelman, E., Singh, G., Su, M.-H., Blythe, J., Gil, Y., Kesselman, C., Mehta, G., Vahi, K., Berriman, G.B., Good, J., Laity, A., Jacob, J.C., Katz, D.S.: Pegasus: a framework for mapping complex scientific workflows onto distributed systems. Sci. Program. J. 13(3), 219–237 (2005)Google Scholar
  13. 13.
    Deelman, E., Callaghan, S., Field, E., Francoeur, H., Graves, R., Gupta, N., Gupta, V., Jordan, T.H., Kesselman, C., Maechling, P., Mehringer, J., Mehta, G., Okaya, D., Vahi, K., Zhao, L.: Managing large-scale workflow execution from resource provisioning to provenance tracking: the cybershake example. In: Proceedings of the Second IEEE International Conference on e-Science and Grid Computing, E-SCIENCE ’06. IEEE Computer Society, Washington, DC (2006)Google Scholar
  14. 14.
    Deelman, E., Gannon, D., Shields, M., Taylor, I.: Workflows and e-science: an overview of workflow system features and capabilities. Future Gener. Comput. Syst. 25, 528–540 (2009)CrossRefGoogle Scholar
  15. 15.
    Emmerich, W., Butchart, B., Chen, L., Wassermann, B., Price, S.L.: Grid service orchestration using the Business Process Execution Language (BPEL). J. Grid Computing 3(3), 283–304 (2005)CrossRefGoogle Scholar
  16. 16.
    Fahringer, T., Prodan, R., Duan, R., Nerieri, F., Podlipnig, S., Qin, J., Siddiqui, M., Truong, H.-L., Villazon, A., Wieczorek, M.: ASKALON: a Grid application development and computing environment. In: 6th International Workshop on Grid Computing, pp. 122–131. IEEE Computer Society Press, New York (2005)Google Scholar
  17. 17.
    Foster, I., Kesselman, C.: Globus: a metacomputing infrastructure toolkit. Int. J. Supercomput. Appl. 11(2), 115–128 (1997)CrossRefGoogle Scholar
  18. 18.
    Frey, J., Tannenbaum, T., Livny, M., Foster, I., Tuecke, S.: Condor-G: a computation management agent for multi-institutional Grids. In: Proceedings of the 10th IEEE International Symposium on High Performance Distributed Computing (HPCD-’01). IEEE Computer Society, New York (2001)Google Scholar
  19. 19.
    Glatard, T., Montagnat, J., Lingrand, D., Pennec, X.: Flexible and efficient workflow deployment of data-intensive applications on Grids with moteur. Int. J. High Perform. Comput. Appl. 22(3), 347–360 (2008)CrossRefGoogle Scholar
  20. 20.
    Gunter, D., Tierney, B.: Netlogger: a toolkit for distributed system performance tuning and debugging. In: Integrated Network Management, IFIP/IEEE Eighth International Symposium on Integrated Network Management (IM 2003). IFIP Conference Proceedings, vol. 246, pp. 97–100. Kluwer (2003)Google Scholar
  21. 21.
    Gunter, D., Deelman, E., Samak, T., Brooks, C.X., Goode, M., Juve, G., Mehta, G., Moraes, P., Silva, F., Martin Swany, D., Vahi, K: Online workflow management and performance analysis with Stampede. In: CNSM, pp. 1–10. IEEE (2011)Google Scholar
  22. 22.
    Harrison, A., Taylor, I., Wang, I., Shields, M.: WS-RF workflow in Triana. Int. J. High Perform. Comput. Appl. 22(3), 268–283 (2008)CrossRefGoogle Scholar
  23. 23.
    Huang, J., Kini, A., Paulson, E., Reilly, C., Robinson, E., Shankar, S., Shrinivas, L., DeWitt, D., Naughton, J.: An overview of Quill: a passive operational data logging system for Condor. Computer Sciences Technical Report, University of Wisconsin (2007)Google Scholar
  24. 24.
    Kacsuk, P.: P-grade portal family for Grid infrastructures. Concurr. Comput.: Pract. Exper. 23, 235–245 (2011)CrossRefGoogle Scholar
  25. 25.
    Katz, D.S., Jacob, J.C., Bruce Berriman, G., Good, J., Laity, A.C., Deelman, E., Kesselman, C., Singh, G., Su, M.-H., Prince, T.A.: A comparison of two methods for building astronomical image mosaics on a Grid. In: ICPP Workshops, pp. 85–94. IEEE Computer Society (2005)Google Scholar
  26. 26.
    Maechling, P., Deelman, E., Zhao, L., Graves, R., Mehta, G., Gupta, N., Mehringer, J., Kesselman, C., Callaghan, S., Okaya, D., Francoeur, H., Gupta, V., Cui, Y., Vahi, K., Jordan, T., Field, E.: SCEC cybershake workflows—automating probabilistic seismic hazard analysis calculations. In: Taylor, I., Deelman, E., Gannon, D., Shield, M. (eds.) Worflows for e-Sciences. Springer (2006)Google Scholar
  27. 27.
    Miles, S., Deelman, E., Groth, P., Vahi, K., Mehta, G., Moreau, L.: Connecting Scientific Data to Scientific Experiments with Provenance (2007)Google Scholar
  28. 28.
    Oinn, T., Addis, M., Ferris, J., Marvin, D., Senger, M., Greenwood, M., Carver, T., Glover, K., Pocock, M.R., Wipat, A., Li, P.: Taverna: a tool for the composition and enactment of bioinformatics workflows. Bioinformatics 20(17), 3045–3054 (2004)CrossRefGoogle Scholar
  29. 29.
    Oracle Technology Network: Oracle BPEL resources. See web site at http://www.oracle.com/technology/products/ias/bpel/. Accessed Apr 2012
  30. 30.
    Ostermann, S., Plankensteiner, K., Prodan, R., Fahringer, T., Iosup, A.: Workflow monitoring and analysis tool for askalon. In: Yahyapour, R., Talia, D., Meyer, N. (eds.) CoreGRID Workshop on Grid Middleware, pp. 1–14 (2008)Google Scholar
  31. 31.
    PREservation Metadata Implementation Strategies (PREMIS). http://www.loc.gov/standards/premis/v2/premis-2-0.pdf. Accessed Apr 2012
  32. 32.
    PYANG—An extensible YANG validator and converter in python. http://www.yang-central.org/twiki/pub/Main/YangTools/pyang.1.html. Accessed Apr 2012
  33. 33.
    R project. http://cran.r-project.org/. Accessed Apr 2012
  34. 34.
    RabbitMQ. http://www.rabbitmq.com. Accessed Apr 2012
  35. 35.
    Riposan, A., Taylor, I.J., Rana, O., Owens, D.R., Conley, E.C.: TRIACS workflows platform for distributed decision support processes. In: CBMS 2009, Albuquerque (2009)Google Scholar
  36. 36.
    RPy2. http://rpy.sourceforge.net/. Accessed Apr 2012
  37. 37.
    Samak, T., Gunter, D., Goode, M., Deelman, E., Juve, G., Mehta, G., Silva, F., Vahi, K.: Online fault and anomaly detection for large-scale scientific workflows. In: Thulasiraman, P., Yang, L.T., Pan, Q., Liu, X., Chen, Y.-C., Huang, Y.-P., L.h. Chang, Hung, C.-L., Lee, C.-R., Shi, J.Y., Zhang, Y. (eds.) HPCC, pp. 373–381. IEEE (2011)Google Scholar
  38. 38.
    Samak, T., Gunter, D., Goode, M., Deelman, E., Mehta, G., Silva, F., Vahi, K.: Failure prediction and localization in large scientific workflows. In: The Sixth Workshop on Workflows in Support of Large-Scale Science (WORKS11), Seattle, WA, USA (2011)Google Scholar
  39. 39.
    Semantic Web Applications in Neuromedicine (SWAN). http://swan.mindinformatics.org/spec/1.2/pav.html. Accessed Feb 2010
  40. 40.
    Singh, G., Kesselman, C., Deelman, E.: Optimizing Grid-based workflow execution. J. Grid Computing 3(3–4), 201–219 (2005)CrossRefGoogle Scholar
  41. 41.
    SQLAlchemy. http://www.sqlalchemy.org. Accessed Apr 2012
  42. 42.
    Taylor, I.: Triana generations. In: Scientific Workflows and Business Workflow Standards in e-Science in Conjunction with Second IEEE International Conference on e-Science, Amsterdam, Netherlands, 2–4 December 2006Google Scholar
  43. 43.
    Taylor, I., Al-Shakarchi, E., Beck, S.D.: Distributed Audio Retrieval using Triana (DART). In: International Computer Music Conference (ICMC) 2006 at Tulane University, USA, 6–11 November, pp. 716–722 (2006)Google Scholar
  44. 44.
    The EU Wf4Ever Project. http://www.wf4ever-project.org/. Accessed May 2012
  45. 45.
    The Open Provenance Model (OPM). http://openprovenance.org/. Accessed May 2012
  46. 46.
    The Open Provenance Model Vocabulary Specification (OPM-V). http://open-biomed.sourceforge.net/opmv/ns.html. Accessed May 2012
  47. 47.
    The SHaring Interoperable Workflows for large-scale scientific simulations on Available DCIs project. http://www.shiwa-workflow.eu/. Accessed May 2012
  48. 48.
    Tierney, B., Gunter, D.: NetLogger: a toolkit for distributed system performance, tuning and debugging. In: Proceedings of the IFIP/IEEE Eighth International Symposium on Integrated Network Management (IM 2003). IFIP Conference Proceedings, vol. 246, pp. 97–100. Kluwer (2003)Google Scholar
  49. 49.
    Tierney, B., Gunter, D., Pearlman, L.: Grid Logging: Best Practices Guide (2008)Google Scholar
  50. 50.
    Vahi, K., Harvey, I., Samak, T., Gunter, D., Evans, K., Rogers, D., Taylor, I., Goode, M., Silva, F., Al-Shakarchi, E., Mehta, G., Jones, A., Deelman, E.: A general approach to real-time workflow monitoring. In: The Seventh Workshop on Workflows in Support of Large-Scale Science (WORKS12). IEEE/ACM (2012)Google Scholar
  51. 51.
    Voeckler, J.S., Mehta, G., Zhao, Y., Deelman, E., Wilde, M.: Kickstarting remote applications. In: 2nd International Workshop on Grid Computing Environments (2006)Google Scholar
  52. 52.
    YANG—a data modeling language for the network configuration protocol. http://tools.ietf.org/html/rfc6020. Accessed May 2012
  53. 53.
    Yang Schema for Stampede Netlogger Formatted Log Messages. http://acs.lbl.gov/projects/stampede/4.0/stampede-schema.html. Accessed May 2012

Copyright information

© Springer Science+Business Media Dordrecht (outside the USA) 2013

Authors and Affiliations

  • Karan Vahi
    • 1
  • Ian Harvey
    • 2
  • Taghrid Samak
    • 3
  • Daniel Gunter
    • 3
  • Kieran Evans
    • 2
  • David Rogers
    • 2
  • Ian Taylor
    • 2
  • Monte Goode
    • 3
  • Fabio Silva
    • 4
  • Eddie Al-Shakarchi
    • 2
  • Gaurang Mehta
    • 1
  • Ewa Deelman
    • 1
  • Andrew Jones
    • 2
  1. 1.USC Information Sciences InstituteMarina Del ReyUSA
  2. 2.School of Computer ScienceCardiffUK
  3. 3.Lawrence Berkeley National LaboratoryBerkeleyUSA
  4. 4.University of Southern CaliforniaLos AngelesUSA

Personalised recommendations