Skip to main content

Computation and Storage Trade-Off for Cost-Effectively Storing Scientific Datasets in the Cloud

  • Chapter
  • First Online:

Abstract

Scientific applications are usually data intensive [1,~ 2], where the generated datasets are often terabytes or even petabytes in size. As reported by Szalay and Gray in [3], science is in an exponential world and the amount of scientific data will double every year over the next decade and future. Producing scientific datasets involves large number of computation intensive tasks, e.g., with scientific workflows [4], hence taking a long time for execution. These generated datasets contain important intermediate or final results of the computation, and need to be stored as valuable resources. This is because: (1) data can be reused – scientists may need to re-analyze the results or apply new analyses on the existing datasets [5]; (2) data can be shared – for collaboration, the computation results may be shared, hence the datasets are used by scientists from different institutions [6]. Storing valuable generated application datasets can save their regeneration cost when they are reused, not to mention the waiting time caused by regeneration. However, the large size of the scientific datasets is a big challenge for their storage.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD   169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    http://www.parkes.atnf.csiro.au/

  2. 2.

    http://astronomy.swin.edu.au/supercomputing/

  3. 3.

    http://www.atnf.csiro.au/

  4. 4.

    http://astronomy.swin.edu.au/pulsar/?topic=apsr

  5. 5.

    Bandwidth is another common kind of resource in the cloud. In [1], the authors state that the cost-effective way of doing science in the cloud is to upload all the application data to the cloud storage and run all the applications with the cloud services. So we assume that the scientists upload all the original data to the cloud to conduct their experiments. Because transferring data within one cloud service provider's facilities is usually free, the data transfer cost of managing the application datasets is not counted. In [15], the authors discussed the scenario of running scientific applications among different cloud service providers.

  6. 6.

    The prices may fluctuate from time to time according to market factors.

  7. 7.

    Amazon cloud service offers different CPU instances with different prices, where using expensive CPU instances with higher performance would reduce computation time. There exists a trade-off of time and cost [34], which is different with the trade-off of computation and storage, hence is out of this chapter's scope.

  8. 8.

    http://www.vmware.com/

  9. 9.

    http://hadoop.apache.org/

References

  1. Deelman, E., G. Singh, M. Livny, B. Berriman, and J. Good. The Cost of Doing Science on the Cloud: the Montage Example. in ACM/IEEE Conference on Supercomputing (SC’08). pp. 1–12. 2008. Austin, Texas, USA.

    Google Scholar 

  2. Ludascher, B., I. Altintas, C. Berkley, D. Higgins, E. Jaeger, M. Jones, and E.A. Lee, Scientific Workflow Management and the Kepler System. Concurrency and Computation: Practice and Experience, 2005. 18(10): pp. 1039–1065.

    Article  Google Scholar 

  3. Szalay, A.S. and J. Gray, Science in an Exponential World. Nature, 2006. 440: pp. 23–24.

    Article  Google Scholar 

  4. Deelman, E., D. Gannon, M. Shields, and I. Taylor, Workflows and e-Science: An Overview of Workflow System Features and Capabilities. Future Generation Computer Systems, 2009. 25(5): pp. 528–540.

    Article  Google Scholar 

  5. Bose, R. and J. Frew, Lineage Retrieval for Scientific Data Processing: A Survey. ACM Computing Survey, 2005. 37(1): pp. 1–28.

    Article  Google Scholar 

  6. Burton, A. and A. Treloar. Publish My Data: A Composition of Services from ANDS and ARCS. in 5th IEEE International Conference on e-Science, (e-Science ’09) pp. 164–170. 2009. Oxford, UK.

    Google Scholar 

  7. Foster, I., Z. Yong, I. Raicu, and S. Lu. Cloud Computing and Grid Computing 360-Degree Compared. in Grid Computing Environments Workshop (GCE’08). pp. 1–10. 2008. Austin, Texas, USA.

    Google Scholar 

  8. Buyya, R., C.S. Yeo, S. Venugopal, J. Broberg, and I. Brandic, Cloud Computing and Emerging IT Platforms: Vision, Hype, and Reality for Delivering Computing as the 5th Utility. Future Generation Computer Systems, 2009. 25(6): pp. 599–616.

    Article  Google Scholar 

  9. Amazon Cloud Services: http://aws.amazon.com/.

  10. Zaharia, M., A. Konwinski, A.D. Joseph, R. Katz, and I. Stoica. Improving MapReduce Performance in Heterogeneous Environments. in 8th USENIX Symposium on Operating Systems Design and Implementation (OSDI’2008). pp. 29–42. 2008. San Diego, CA, USA.

    Google Scholar 

  11. Adams, I., D.D.E. Long, E.L. Miller, S. Pasupathy, and M.W. Storer. Maximizing Efficiency by Trading Storage for Computation. in Workshop on Hot Topics in Cloud Computing (HotCloud’09). pp. 1–5. 2009. San Diego, CA, USA.

    Google Scholar 

  12. Yuan, D., Y. Yang, X. Liu, and J. Chen. A Cost-Effective Strategy for Intermediate Data Storage in Scientific Cloud Workflows. in 24th IEEE International Parallel & Distributed Processing Symposium (IPDPS’10). pp. 1–12. 2010. Atlanta, Georgia, USA.

    Google Scholar 

  13. Yuan, D., Y. Yang, X. Liu, G. Zhang, and J. Chen, A Data Dependency Based Strategy for Intermediate Data Storage in Scientific Cloud Workflow Systems. Concurrency and Computation: Practice and Experience, 2010. (http://dx.doi.org/10.1002/cpe.1636)

  14. Yuan, D., Y. Yang, X. Liu, and J. Chen. A Local-Optimisation based Strategy for Cost-Effective Datasets Storage of Scientific Applications in the Cloud. in 4th IEEE International Conference on Cloud Computing (Cloud2011). pp. 1–8. 2011. Washington DC, USA.

    Google Scholar 

  15. Yuan, D., Y. Yun, X. Liu, and J. Chen, On-demand Minimum Cost Benchmarking for Intermediate Datasets Storage in Scientific Cloud Workflow Systems. Journal of Parallel and Distributed Computing, 2011. 72(2): pp. 316–332.

    Article  Google Scholar 

  16. Chiba, T., T. Kielmann, M.d. Burger, and S. Matsuoka. Dynamic Load-Balanced Multicast for Data-Intensive Applications on Clouds. in IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid2010). pp. 5–14. 2010. Melbourne, Australia.

    Google Scholar 

  17. Juve, G., E. Deelman, K. Vahi, and G. Mehta. Data Sharing Options for Scientific Workflows on Amazon EC2. in ACM/IEEE Conference on Supercomputing (SC’10). pp. 1–9. 2010. New Orleans, Louisiana, USA.

    Google Scholar 

  18. Li, J., M. Humphrey, D. Agarwal, K. Jackson, C.v. Ingen, and Y. Ryu. eScience in the Cloud: A MODIS Satellite Data Reprojection and Reduction Pipeline in the Windows Azure Platform. in 24th IEEE International Parallel & Distributed Processing Symposium (IPDPS’10). pp. 1–12. 2010. Atlanta, Georgia, USA.

    Google Scholar 

  19. Yuan, D., Y. Yang, X. Liu, and J. Chen, A Data Placement Strategy in Scientific Cloud Workflows. Future Generation Computer Systems, 2010. 26(8): pp. 1200–1214.

    Article  Google Scholar 

  20. Eucalyptus. Available from: http://open.eucalyptus.com/.

  21. Nimbus. Available from: http://www.nimbusproject.org/.

  22. OpenNebula. Available from: http://www.opennebula.org/.

  23. Armbrust, M., A. Fox, R. Griffith, A.D. Joseph, R. Katz, A. Konwinski, G. Lee, D. Patterson, A. Rabkin, I. Stoica, and M. Zaharia, A View of Cloud Computing. Commun. ACM, 2010. 53(4): pp. 50–58.

    Article  Google Scholar 

  24. Assuncao, M.D.d., A.d. Costanzo, and R. Buyya. Evaluating the Cost-Benefit of Using Cloud Computing to Extend the Capacity of Clusters. in 18th ACM International Symposium on High Performance Distributed Computing (HPDC’09). pp. 1–10. 2009. Garching, Germany.

    Google Scholar 

  25. Kondo, D., B. Javadi, P. Malecot, F. Cappello, and D.P. Anderson. Cost-Benefit Analysis of Cloud Computing versus Desktop Grids. in 23th IEEE International Parallel & Distributed Processing Symposium (IPDPS’09). pp. 1–12. 2009. Rome, Italy.

    Google Scholar 

  26. Cho, B. and I. Gupta. New Algorithms for Planning Bulk Transfer via Internet and Shipping Networks. in IEEE 30th International Conference on Distributed Computing Systems (ICDCS). pp. 305–314. 2010. Genova, Italy.

    Google Scholar 

  27. Gunda, P.K., L. Ravindranath, C.A. Thekkath, Y. Yu, and L. Zhuang. Nectar: Automatic Management of Data and Computation in Datacenters. in 9th Symposium on Operating Systems Design and Implementation (OSDI’2010). pp. 1–14. 2010, Vancouver, Canada.

    Google Scholar 

  28. Bao, Z., S. Cohen-Boulakia, S.B. Davidson, A. Eyal, and S. Khanna. Differencing Provenance in Scientific Workflows. in 25th IEEE International Conference on Data Engineering (ICDE’09). pp. 808–819. 2009. Shanghai, China.

    Google Scholar 

  29. Groth, P. and L. Moreau, Recording Process Documentation for Provenance. IEEE Transactions on Parallel and Distributed Systems, 2009. 20(9): pp. 1246–1259.

    Article  Google Scholar 

  30. Muniswamy-Reddy, K.-K., P. Macko, and M. Seltzer. Provenance for the Cloud. in 8th USENIX Conference on File and Storage Technology (FAST’10). pp. 197–210. 2010. San Jose, CA, USA.

    Google Scholar 

  31. Osterweil, L.J., L.A. Clarke, A.M. Ellison, R. Podorozhny, A. Wise, E. Boose, and J. Hadley. Experience in Using A Process Language to Define Scientific Workflow and Generate Dataset Provenance. in 16th ACM SIGSOFT International Symposium on Foundations of Software Engineering. pp. 319–329. 2008. Atlanta, Georgia: ACM.

    Google Scholar 

  32. Foster, I., J. Vockler, M. Wilde, and Z. Yong. Chimera: A Virtual Data System for Representing, Querying, and Automating Data Derivation. in 14th International Conference on Scientific and Statistical Database Management, (SSDBM’02). pp. 37–46. 2002. Edinburgh, Scotland, UK.

    Google Scholar 

  33. Simmhan, Y.L., B. Plale, and D. Gannon, A Survey of Data Provenance in E-Science. SIGMOD Rec., 2005. 34(3): pp. 31–36.

    Article  Google Scholar 

  34. Garg, S.K., R. Buyya, and H.J. Siegel, Time and Cost Trade-Off Management for Scheduling Parallel Applications on Utility Grids. Future Generation Computer Systems, 2010. 26(8): pp. 1344–1355.

    Article  Google Scholar 

  35. Liu, X., D. Yuan, G. Zhang, J. Chen, and Y. Yang, SwinDeW-C: A Peer-to-Peer Based Cloud Workflow System, in Handbook of Cloud Computing, B. Furht and A. Escalante, Editors. 2010, Springer. pp. 309–332.

    Google Scholar 

  36. Yang, Y., K. Liu, J. Chen, J. Lignier, and H. Jin. Peer-to-Peer Based Grid Workflow Runtime Environment of SwinDeW-G. in IEEE International Conference on e-Science and Grid Computing. pp. 51–58. 2007. Bangalore, India.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dong Yuan .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer Science+Business Media, LLC

About this chapter

Cite this chapter

Yuan, D., Yang, Y., Liu, X., Chen, J. (2011). Computation and Storage Trade-Off for Cost-Effectively Storing Scientific Datasets in the Cloud. In: Furht, B., Escalante, A. (eds) Handbook of Data Intensive Computing. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-1415-5_5

Download citation

  • DOI: https://doi.org/10.1007/978-1-4614-1415-5_5

  • Published:

  • Publisher Name: Springer, New York, NY

  • Print ISBN: 978-1-4614-1414-8

  • Online ISBN: 978-1-4614-1415-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics