Journal of Grid Computing

, Volume 10, Issue 1, pp 5–21 | Cite as

An Evaluation of the Cost and Performance of Scientific Workflows on Amazon EC2

  • Gideon Juve
  • Ewa Deelman
  • G. Bruce Berriman
  • Benjamin P. Berman
  • Philip Maechling
Article

Abstract

Workflows are used to orchestrate data-intensive applications in many different scientific domains. Workflow applications typically communicate data between processing steps using intermediate files. When tasks are distributed, these files are either transferred from one computational node to another, or accessed through a shared storage system. As a result, the efficient management of data is a key factor in achieving good performance for workflow applications in distributed environments. In this paper we investigate some of the ways in which data can be managed for workflows in the cloud. We ran experiments using three typical workflow applications on Amazon’s EC2 cloud computing platform. We discuss the various storage and file systems we used, describe the issues and problems we encountered deploying them on EC2, and analyze the resulting performance and cost of the workflows.

Keywords

Cloud computing Scientific workflows 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Amazon.com: Elastic Compute Cloud (EC2). http://aws.amazon.com/ec2. Accessed 9 Mar 2012
  2. 2.
    Amazon.com: Simple Storage Service (S3). http://aws.amazon.com/s3. Accessed 9 Mar 2012
  3. 3.
    Callaghan, S., Deelman, E., Gunter, D., Juve, G., Maechling, P., Brooks, C., Vahi, K., Milner, K., Graves, R., Field, E., Okaya, D., Jordan, T.: Scaling up workflow-based applications. J. Comput. Syst. Sci. 76(6), 428–446 (2010)CrossRefGoogle Scholar
  4. 4.
    Carns, P., Ligon, W., Ross, R., Thakur, R.: PVFS: A parallel file system for linux clusters. In: 4th Annual Linux Showcase and Conference (2000)Google Scholar
  5. 5.
    Chase, J.S., Irwin, D.E., Grit, L.E., Moore, J.D., Sprenkle, S.E.: Dynamic virtual clusters in a grid site manager. In: Proceedings of the 12th IEEE International Symposium on High Performance Distributed Computing (HPDC03) (2003)Google Scholar
  6. 6.
    DAGMan: http://cs.wisc.edu/condor/dagman. Accessed 9 Mar 2012
  7. 7.
    Deelman, E., Singh, G., Su, M.-H., Blythe, J., Gil, Y., Kesselman, C., Mehta, G., Vahi, K., Berriman, G.B., Good, J., Laity, A., Jacob, J.C., Katz, D.S.: Pegasus: A framework for mapping complex scientific workflows onto distributed systems. Sci. Program. 13(3), 219–237 (2005)Google Scholar
  8. 8.
    Evangelinos, C., Hill, C.N.: Cloud computing for parallel scientific HPC applications: Feasibility of running coupled atmosphere-ocean climate models on Amazon’s EC2. In: Cloud Computing and Its Applications (CCA 2008) (2008)Google Scholar
  9. 9.
    Figueiredo, R.J., Dinda, P.A., Fortes, J.A.B.: A case for grid computing on virtual machines. In: 23rd International Conference on Distributed Computing Systems (2003)Google Scholar
  10. 10.
    Foster, I., Freeman, T., Keahey, K., Scheftner, D., Sotomayer, B., Zhang, X.: Virtual clusters for grid communities. In: Sixth IEEE International Symposium on Cluster Computing and the Grid (CCGRID06) (2006)Google Scholar
  11. 11.
    Gluster, Inc.: GlusterFS. http://www.gluster.org. Accessed 9 Mar 2012
  12. 12.
    Huang, W., Liu, J., Abali, B., Panda, D.K.: A case for high performance computing with virtual machines. In: 20th annual international conference on Supercomputing (ICS 06) (2006)Google Scholar
  13. 13.
    Hupfeld, F., Cortes, T., Kolbeck, B., Stender, J., Focht, E., Hess, M., Malo, J., Marti, J., Cesario, E.: The XtreemFS architecture—a case for object-based file systems in Grids. Concurrency Comput. Pract. Ex. 20(17), 2049–2060 (2008)CrossRefGoogle Scholar
  14. 14.
    Juve, G., Deelman, E., Vahi, K., Mehta, G.: Scientific workflow applications on Amazon EC2. In: Workshop on Cloud-based Services and Applications in conjunction with 5th IEEE International Conference on e-Science (e-Science 2009) (2009)Google Scholar
  15. 15.
    Juve, G., Deelman, E.: Automating application deployment in infrastructure clouds. In: 3rd IEEE International Conference on Cloud Computing Technology and Science (CloudCom) (2011)Google Scholar
  16. 16.
    Kärkkäinen, P., Kurth, L.: XenOverview—Xen Wiki. http://wiki.xensource.com/xenwiki/XenOverview. Accessed 9 Mar 2012
  17. 17.
    Katz, D.S., Jacob, J.C., Deelman, E., Kesselman, C., Gurmeet, S., Mei-Hui, S., Berriman, G.B., Good, J., Laity, A.C., Prince, T.A.: A comparison of two methods for building astronomical image mosaics on a grid. In: 34th International Conference on Parallel Processing Workshops (ICPP ’05) (2005)Google Scholar
  18. 18.
    Lagouvardos, K., Floros, E., Kotroni, V.: A grid-enabled regional-scale ensemble forecasting system in the Mediterranean area. J. Grid Computing 8(2), 181–197 (2010)CrossRefGoogle Scholar
  19. 19.
    Li, H., Ruan, J., Durbin, R.: Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res. 18(11), 1851–1858 (2008)CrossRefGoogle Scholar
  20. 20.
    Litzkow, M.J., Livny, M., Mutka, M.W.: Condor: A hunter of idle workstations. In: 8th International Conference of Distributed Computing Systems (1988)Google Scholar
  21. 21.
    Napper, J., Bientinesi, P.: Can cloud computing reach the top500? In: Proceedings of the Workshop on UnConventional High Performance Computing (2009)Google Scholar
  22. 22.
    NASA Advanced Supercomputing Division: NAS parallel benchmarks. http://www.nas.nasa.gov/Resources/Software/npb.html. Accessed 9 Mar 2012
  23. 23.
    Oracle Corporation: Lustre parallel filesystem. http://www.lustre.org. Accessed 9 Mar 2012
  24. 24.
    Ostermann, S., Iosup, A., Yigitbasi, N., Prodan, R., Fahringer, T., Epema, D.: A performance analysis of ec2 cloud computing services for scientific computing. In: Proceedings of Cloudcomp 2009 (2009)Google Scholar
  25. 25.
    Palankar, M.R., Iamnitchi, A., Ripeanu, M., Garfinkel, S.: Amazon S3 for science grids: A viable solution? In: Proceedings of the 2008 international workshop on Data-aware distributed computing (DADC 08) (2008)Google Scholar
  26. 26.
    ptrace(2)—process trace (man page). In: Linux Programmer’s Manual. Retrieved from: http://www.kernel.org/doc/man-pages/online/pages/man2/ptrace.2.html. Accessed 30 Mar 2009
  27. 27.
    Sandberg, R., Golgberg, D., Kleiman, S., Walsh, D., Lyon, B.: Design and implementation of the sun network filesystem. In: USENIX Conference Proceedings (1985)Google Scholar
  28. 28.
    Singh, G., Kesselman, C., Deelman, E.: Optimizing grid-based workflow execution. J. Grid Computing 3(3–4), 201–219 (2005)CrossRefGoogle Scholar
  29. 29.
    Southern California Earthquake Center, Broadband Platform. http://scec.usc.edu/scecpedia/Broadband_Platform. Accessed 9 Mar 2012
  30. 30.
    Taylor, I.J., Deelman, E., Gannon, D.B., Shields, M.: Workflows for e-Science: Scientific workflows for grids. Springer New York, Inc. (2006)Google Scholar
  31. 31.
    USC Epigenome Center. http://epigenome.usc.edu. Accessed 9 Mar 2012
  32. 32.
    Vecchiola, C., Pandey, S., Buyya, R.: High-performance cloud computing: A view of scientific applications. In: International Symposium on Parallel Architectures, Algorithms, and Networks (2009)Google Scholar
  33. 33.
    Walker, E.: Benchmarking Amazon EC2 for high-performance scientific computing. Login 33(5), 18–23Google Scholar
  34. 34.
    Wang, Y., Mehta, G., Mayani, R., Lu, J., Souaiaia, T., Chen, Y., Clark, A., Yoon, H.J., Wan, L., Evgrafov, O.V., Knowles, J.A., Deelman, E., Chen, T.: RseqFlow: Workflows for RNA-Seq data analysis. Bioinformatics (18), 2598–2600 (2011)Google Scholar
  35. 35.
    Weil, S.A., Brandt, S.A., Miller, E.L., Long, D.D.E., Maltzahn, C.: Ceph: A scalable, high-performance distributed file system. In: 7th Symposium on Operating Systems Design and Implementation (OSDI 06) (2006)Google Scholar
  36. 36.
    Youseff, L., Wolski, R., Gorda, B., Krintz, C.: Paravirtualization for HPC systems. In: Workshop on Xen in High-Performance Cluster and Grid Computing (2006)Google Scholar
  37. 37.
    Yu, W., Vetter, J.S.: Xen-Based HPC: A parallel I/O perspective. In: 8th IEEE International Symposium on Cluster Computing and the Grid (CCGrid ’08) (2008)Google Scholar

Copyright information

© Springer Science+Business Media B.V. 2012

Authors and Affiliations

  • Gideon Juve
    • 1
  • Ewa Deelman
    • 1
  • G. Bruce Berriman
    • 2
  • Benjamin P. Berman
    • 3
  • Philip Maechling
    • 4
  1. 1.USC Information Sciences InstituteMarina Del ReyUSA
  2. 2.NASA Exoplanet Science Institute, Infrared, Processing and Analysis Center, CaltechPasadenaUSA
  3. 3.USC Epigenome CenterLos AngelesUSA
  4. 4.Southern California Earthquake CenterLos AngelesUSA

Personalised recommendations