Guide to e-Science pp 353-382

Part of the Computer Communications and Networks book series (CCN) | Cite as

Facilitating e-Science Discovery Using Scientific Workflows on the Grid

  • Jianwu Wang
  • Prakashan Korambath
  • Seonah Kim
  • Scott Johnson
  • Kejian Jin
  • Daniel Crawl
  • Ilkay Altintas
  • Shava Smallen
  • Bill Labate
  • Kendall N. Houk
Chapter

Abstract

e-Science has been greatly enhanced from the developing capability and usability of cyberinfrastructure. This chapter explains how scientific workflow systems can facilitate e-Science discovery in Grid environments by providing features including scientific process automation, resource consolidation, parallelism, provenance tracking, fault tolerance, and workflow reuse. We first overview the core services to support e-Science discovery. To demonstrate how these services can be seamlessly assembled, an open source scientific workflow system, called Kepler, is integrated into the University of California Grid. This architecture is being applied to a computational enzyme design process, which is a formidable and collaborative problem in computational chemistry that challenges our knowledge of protein chemistry. Our implementation and experiments validate how the Kepler workflow system can make the scientific computation process automated, pipelined, efficient, extensible, stable, and easy-to-use.

References

  1. 1.
    Foster I (2002) What is the Grid? – a three point checklist. GRIDtoday, Vol. 1, No. 6. http://www-fp.mcs.anl.gov/~foster/Articles/WhatIsTheGrid.pdf
  2. 2.
    Sudholt W, Altintas I, Baldridge K (2006) Scientific workflow infrastructure for computational chemistry on the Grid. In: Proc. of the 1st Computational Chemistry and Its Applications Workshop at the 6th International Conference on Computational Science (ICCS 2006):69–76, LNCS 3993Google Scholar
  3. 3.
    Tiwari A, Sekhar AKT (2007) Workflow based framework for life science informatics. Computational Biology and Chemistry 31(5–6):305–319MATHCrossRefGoogle Scholar
  4. 4.
    Yang X, Bruin RP, Dove MT (2010) Developing an End-to-End Scientific Workflow: a Case Study of Using a Reliable, Lightweight, and Comprehensive Workflow Platform in e-Science. Computing in Science and Engineering, 12(3):52–61, May/June 2010, doi:10.1109/MCSE.2010.61CrossRefGoogle Scholar
  5. 5.
    Taylor I, Deelman E, Gannon D, Shields M (eds) (2007), Workflows for e-Science. Springer, New York, Secaucus, NJ, USA, ISBN: 978-1-84628-519-6Google Scholar
  6. 6.
    Yu Y, Buyya R (2006) A Taxonomy of Workflow Management Systems for Grid Computing. J. Grid Computing, 2006 (3):171–200Google Scholar
  7. 7.
    Foster I, Kesselman C (eds) (2003) The Grid: Blueprint for a New Computing Infrastructure. Morgan Kaufmann Publishers, The Elsevier Series in Grid Computing, ISBN 1558609334, 2nd editionGoogle Scholar
  8. 8.
    Berman F, Fox GC, Hey AJG (eds) (2003) Grid Computing: Making The Global Infrastructure a Reality. Wiley. ISBN 0-470-85319-0Google Scholar
  9. 9.
    Richardson L, Ruby S (2007) RESTful Web Services. O’Reilly Media, Inc., ISBN: 978-0-596-52926-0Google Scholar
  10. 10.
    Foster I, Kesselman C, Nick J, Tuecke S (2002) The Physiology of the Grid: An Open Grid Services Architecture for Distributed Systems Integration. www.globus.org/research/papers/ogsa.pdf
  11. 11.
    Singh MP, Huhns MN (2005) Service-Oriented Computing: Semantics, Processes, Agents. John Wiley & SonsGoogle Scholar
  12. 12.
    Buyya R (ed.) (1999) High Performance Cluster Computing: Architectures and Systems. Volume 1, ISBN 0-13-013784-7, Prentice Hall, NJ, USAGoogle Scholar
  13. 13.
    Buyya R (ed.) (1999) High Performance Cluster Computing: Programming and Applications. Volume 2, ISBN 0-13-013785-5, Prentice Hall, NJ, USAGoogle Scholar
  14. 14.
    El-Rewini H, Lewis TG, Ali HH (1994) Task Scheduling in Parallel and Distributed Systems, ISBN: 0130992356, PTR Prentice HallGoogle Scholar
  15. 15.
    Dong F, Akl SG (2006) Scheduling Algorithms for Grid Computing: State of the Art and Open Problems. Technical Report No. 2006-504, Queen’s University, Canada, http://www.cs.queensu.ca/TechReports/Reports/2006-504.pdf
  16. 16.
    Chervenak A, Foster I, Kesselman C, Salisbury C, Tuecke S (2000) The data Grid: Towards an architecture for the distributed management and analysis of large scientific datasets. Journal of Network and Computer Applications. 23(3): 187–200. July 2000, doi:10.1006/jnca.2000.0110CrossRefGoogle Scholar
  17. 17.
    Gray J, Liu DT, Nieto-Santisteban M, Szalay A, DeWitt DJ, Heber G (2005) Scientific data management in the coming decade, ACM SIGMOD Record, 34(4):34–41, doi://10.1145/1107499.1107503CrossRefGoogle Scholar
  18. 18.
    Shoshani A, Rotem D (eds) (2009) Scientific Data Management: Challenges, Existing Technology, and Deployment, Computational Science Series. Chapman & Hall/CRCGoogle Scholar
  19. 19.
    Moore RW, Jagatheesan A, Rajasekar A, Wan M, Schroeder W (2004) Data Grid Management Systems. In Proc. of the 21st IEEE/NASA Conference on Mass Storage Systems and Technologies (MSST)Google Scholar
  20. 20.
    Venugopal S, Buyya R, Ramamohanarao K (2006) A taxonomy of Data Grids for distributed data sharing, management, and processing. ACM Comput. Surv. 38(1)Google Scholar
  21. 21.
    Yick J, Mukherjee B, Ghosal D (2008) Wireless sensor network survey. Computer Networks, 52(12): 2292–2330, DOI: 10.1016/j.comnet.2008.04.002.CrossRefGoogle Scholar
  22. 22.
    Fox G, Gadgil H, Pallickara S, Pierce M, Grossman RL, Gu Y, Hanley D, Hong X (2004) High Performance Data Streaming in Service Architecture. Technical Report. http://www.hpsearch.org/documents/HighPerfDataStreaming.pdf
  23. 23.
    Rajasekar A, Lu S, Moore R, Vernon F, Orcutt J, Lindquist K (2005) Accessing sensor data using meta data: a virtual object ring buffer framework. In: Proc. of the 2nd Workshop on Data Management for Sensor Networks (DMSN 2005): 35–42Google Scholar
  24. 24.
    Tilak S, Hubbard P, Miller M, Fountain T (2007) The Ring Buffer Network Bus (RBNB) Data Turbine Streaming Data Middleware for Environmental Observing Systems. eScience 2007: 125–133Google Scholar
  25. 25.
    J. Postel and J. Reynolds, File Transfer Protocol (FTP), Internet RFC-959 1985Google Scholar
  26. 26.
  27. 27.
    Greenberg J (2002) Metadata and the World Wide Web. The Encyclopedia of Library and Information Science, Vol.72: 224–261, Marcel Dekker, New YorkGoogle Scholar
  28. 28.
    Wittenburg P, Broeder D (2002) Metadata Overview and the Semantic Web. In Proc. of the International Workshop on Resources and Tools in Field LinguisticsGoogle Scholar
  29. 29.
    Davies J, Fensel D, van Harmelen F. (eds.) (2002) Towards the Semantic Web: Ontology-driven Knowledge Management. WileyGoogle Scholar
  30. 30.
    Wolstencroft K, Alper P, Hull D, Wroe C, Lord PW, Stevens RD, Goble C (2007) The myGrid Ontology: Bioinformatics Service Discovery. International Journal of Bioinformatics Research and Applications, 3(3):326–340CrossRefGoogle Scholar
  31. 31.
    Ludäscher B, Altintas I, Bowers S, Cummings J, Critchlow T, Deelman E, Roure DD, Freire J, Goble C, Jones M, Klasky S, McPhillips T, Podhorszki N, Silva C, Taylor I, Vouk M (2009) Scientific Process Automation and Workflow Management. In Shoshani A, Rotem D (eds) Scientific Data Management: Challenges, Existing Technology, and Deployment, Computational Science Series. 476–508. Chapman & Hall/CRCGoogle Scholar
  32. 32.
    Deelman E, Gannon D, Shields MS, Taylor I (2009) Workflows and e-Science: An overview of workflow system features and capabilities. Future Generation Comp. Syst. 25(5): 528–540CrossRefGoogle Scholar
  33. 33.
    Brooks C, Lee EA, Liu X, Neuendorffer S, Zhao Y, Zheng H (2007), Chapter 7: MoML, Heterogeneous Concurrent Modeling and Design in Java (Volume 1: Introduction to Ptolemy II), EECS Department, University of California, Berkeley, UCB/EECS-2007-7, http://www.eecs.berkeley.edu/Pubs/TechRpts/2007/EECS-2007-7.html
  34. 34.
    Scufl Language, Taverna 1.7.1 Manual, http://www.myGrid.org.uk/usermanual1.7/
  35. 35.
  36. 36.
    Wang J, Altintas I, Berkley C, Gilbert L, Jones MB (2008) A High-Level Distributed Execution Framework for Scientific Workflows. In: Proc. of workshop SWBES08: Challenging Issues in Workflow Applications, 4th IEEE International Conference on e-Science (e-Science 2008):634–639Google Scholar
  37. 37.
    Pautasso C, Alonso G (2006) Parallel Computing Patterns for Grid Workflows, In: Proc. of Workshop on Workflows in Support of Large-Scale Science (WORKS06) http://www.iks.ethz.ch/publications/jop_grid_workflow_patterns
  38. 38.
    Flynn MJ (1972) Some Computer Organizations and Their Effectiveness. IEEE Trans. on Computers, C–21(9):948-960MathSciNetCrossRefGoogle Scholar
  39. 39.
    Wieczorek M, Prodan R, Fahringer T (2005) Scheduling of scientific workflows in the ASKALON grid environment. SIGMOD Record 34(3): 56–62CrossRefGoogle Scholar
  40. 40.
    Singh G, Kesselman C, Deelman E (2005) Optimizing Grid-Based Workflow Execution. J. Grid Comput. 3(3–4):201–219CrossRefGoogle Scholar
  41. 41.
    Simmhan YL, Plale B, Gannon D (2005). A survey of data provenance in e-science. SIGMOD Record, 34(3):31–36CrossRefGoogle Scholar
  42. 42.
    Davidson SB, Freire J (2008) Provenance and scientific workflows: challenges and opportunities. In: Proc. of SIGMOD Conference 2008:1345–1350Google Scholar
  43. 43.
    Wang J, Altintas I, Berkley C, Gilbert L, Jones MB (2008) A High-Level Distributed Execution Framework for Scientific Workflows. In: Proc. of the 2008 Fourth IEEE International Conference on e-Science (e-Science 2008):634–639Google Scholar
  44. 44.
    Tierney B, Aydt R, Gunter D, Smith W, Swany M, Taylor V, Wolski R (2002) A Grid Monitoring Architecture. GWDPerf-16–3, Global Grid Forum http://wwwdidc.lbl.gov/GGF-PERF/GMA-WG/papers/GWD-GP-16-3.pdf
  45. 45.
    Friendly M (2009) Milestones in the history of thematic cartography, statistical graphics, and data visualization. Toronto, York University, http://www.math.yorku.ca/SCS/Gallery/milestone/milestone.pdf
  46. 46.
    Haber RB, McNabb DA (1990) Visualization Idioms: A Conceptual Model for Scientific Visualization Systems. IEEE Visualization in Scientific Computing:74–93Google Scholar
  47. 47.
    Singh JP, Gupta A, Levoy M (1994) Parallel Visualization Algorithms: Performance and Architectural Implications, Computer, 27(7):45–55 doi:10.1109/2.299410CrossRefGoogle Scholar
  48. 48.
    Ahrens J, Brislawn K, Martin K, Geveci B, Law CC, Papka M (2001) Large-scale data visualization using parallel data streaming. IEEE Comput. Graph. Appl., 21(4):34–41CrossRefGoogle Scholar
  49. 49.
    Strengert M, Magallón M, Weiskopf D, Guthe S, Ertl T (2004) Hierarchical visualization and compression of large volume datasets using GPU clusters. In: Proc. Eurographics symposium on parallel graphics and visualization (EGPGV04), Eurographics Association: 41–48Google Scholar
  50. 50.
    Welch V, Siebenlist F, Foster I, Bresnahan J, Czajkowski K, Gawor J, Kesselman C, Meder S, Pearlman L, Tuecke S (2003) Security for grid services. In: Proc. of the Twelfth International Symposium on High Performance Distributed Computing (HPDC-12). IEEE PressGoogle Scholar
  51. 51.
    Plankensteiner K, Prodan R, Fahringer T, Kertesz A, Kacsuk PK (2007). Fault-tolerant behavior in state-of-the-art grid workflow management systems. Technical Report. CoreGRID, http://www.coregrid.net/mambo/images/stories/TechnicalReports/tr-0091.pdf
  52. 52.
    Ludäscher B, Altintas I, Berkley C, Higgins D, Jaeger E, Jones M, Lee E, Tao J, Zhao Y (2005) Scientific workflow management and the Kepler system. Concurrency and Computa-tion: Practice and Experience, 18 (10):1039–1065CrossRefGoogle Scholar
  53. 53.
    Brooks C, Lee EA, Liu X, Neuendorffer S, Zhao Y, Zheng H (2007) Heterogeneous Concurrent Modeling and Design in Java (Volume 3: Ptolemy II Domains), EECS Department, University of California, Berkeley, UCB/EECS-2007-9, http://www.eecs.berkeley.edu/Pubs/TechRpts/2007/EECS-2007-9.html
  54. 54.
    Mouallem P, Crawl D, Altintas I, Vouk M, Yildiz U (2010). A Fault-Tolerance Architecture for Kepler-based Distributed Scientific Workflows. In: Proc. of 22nd International Conference on Scientific and Statistical Database Management (SSDBM 2010):452–460Google Scholar
  55. 55.
    Lee EA, Parks T (1995) Dataflow Process Networks. In: Proc. of the IEEE, 83(5):773–799CrossRefGoogle Scholar
  56. 56.
    Altintas I, Barney O, Jaeger-Frank E (2006) Provenance Collection Support in the Kepler Scientific Workflow System. In: Proc. of International Provenance and Annotation Workshop (IPAW2006):118–132Google Scholar
  57. 57.
    Wang J, Altintas I, Hosseini PR, Barseghian D, Crawl D, Berkley C, Jones MB (2009) Accelerating Parameter Sweep Workflows by Utilizing Ad-hoc Network Computing Resources: an Ecological Example. In: Proc. of IEEE 2009 Third International Workshop on Scientific Workflows (SWF 2009) at Congress on Services (Services 2009):267–274Google Scholar
  58. 58.
    Radetzki U, Leser U, Schulze-Rauschenbach SC, Zimmermann J, Lussem J, Bode T, Cremers AB (2006) Adapters, shims, and glue-service interoperability for in silico experiments. Bioinformatics, 22(9):1137–1143CrossRefGoogle Scholar
  59. 59.
    Wang J, Korambath P, Kim S, Johnson S, Jin K, Crawl D, Altintas I, Smallen S, Labate B, Houk KN (2010) Theoretical Enzyme Design Using the Kepler Scientific Workflows on the Grid, In: Proc. of 5th Workshop on Computational Chemistry and Its Applications (5th CCA) at International Conference on Computational Science (ICCS 2010):1169–1178Google Scholar
  60. 60.
    Zanghellini A, Jiang L, Wollacott AM, Cheng G, Meiler J, Althoff EA, Röthlisberger D, Baker D (2006) New algorithms and an in silico benchmark for computational enzyme design. Protein Sci. 15(12):2785–2794CrossRefGoogle Scholar
  61. 61.
    Tantillo DJ, Chen J, Houk KN (1998) Theozymes and compuzymes: theoretical models for biological catalysis. Curr Opin Chem Biol. 2(6):743–50CrossRefGoogle Scholar
  62. 62.
    Dantas G, Kuhlman B, Callender D, Wong M, Baker D (2003) A Large scale test of computational protein desing: Folding and stability of nine completely redesigned globular proteins. J. Mol. Biol. 332(2):449–460CrossRefGoogle Scholar
  63. 63.
    Meiler J, Baker D (2006) ROSETTALIGAND: Protein-small molecule docking with full side-chain flexibility. Proteins 65:538–548CrossRefGoogle Scholar

Copyright information

© Springer-Verlag London Limited 2011

Authors and Affiliations

  • Jianwu Wang
    • 1
  • Prakashan Korambath
    • 2
  • Seonah Kim
    • 3
  • Scott Johnson
    • 3
  • Kejian Jin
    • 2
  • Daniel Crawl
    • 1
  • Ilkay Altintas
    • 1
  • Shava Smallen
    • 1
  • Bill Labate
    • 2
  • Kendall N. Houk
    • 3
  1. 1.San Diego Supercomputer CenterUCSDLa JollaUSA
  2. 2.Institute for Digital Research and EducationUCLALos AngelesUSA
  3. 3.Department of Chemistry and BiochemistryUCLALos AngelesUSA

Personalised recommendations