Computation and Storage Trade-Off for Cost-Effectively Storing Scientific Datasets in the Cloud
Scientific applications are usually data intensive [1,~ 2], where the generated datasets are often terabytes or even petabytes in size. As reported by Szalay and Gray in , science is in an exponential world and the amount of scientific data will double every year over the next decade and future. Producing scientific datasets involves large number of computation intensive tasks, e.g., with scientific workflows , hence taking a long time for execution. These generated datasets contain important intermediate or final results of the computation, and need to be stored as valuable resources. This is because: (1) data can be reused – scientists may need to re-analyze the results or apply new analyses on the existing datasets ; (2) data can be shared – for collaboration, the computation results may be shared, hence the datasets are used by scientists from different institutions . Storing valuable generated application datasets can save their regeneration cost when they are reused, not to mention the waiting time caused by regeneration. However, the large size of the scientific datasets is a big challenge for their storage.
- 1.Deelman, E., G. Singh, M. Livny, B. Berriman, and J. Good. The Cost of Doing Science on the Cloud: the Montage Example. in ACM/IEEE Conference on Supercomputing (SC’08). pp. 1–12. 2008. Austin, Texas, USA.Google Scholar
- 6.Burton, A. and A. Treloar. Publish My Data: A Composition of Services from ANDS and ARCS. in 5th IEEE International Conference on e-Science, (e-Science ’09) pp. 164–170. 2009. Oxford, UK.Google Scholar
- 7.Foster, I., Z. Yong, I. Raicu, and S. Lu. Cloud Computing and Grid Computing 360-Degree Compared. in Grid Computing Environments Workshop (GCE’08). pp. 1–10. 2008. Austin, Texas, USA.Google Scholar
- 9.Amazon Cloud Services: http://aws.amazon.com/.
- 10.Zaharia, M., A. Konwinski, A.D. Joseph, R. Katz, and I. Stoica. Improving MapReduce Performance in Heterogeneous Environments. in 8th USENIX Symposium on Operating Systems Design and Implementation (OSDI’2008). pp. 29–42. 2008. San Diego, CA, USA.Google Scholar
- 11.Adams, I., D.D.E. Long, E.L. Miller, S. Pasupathy, and M.W. Storer. Maximizing Efficiency by Trading Storage for Computation. in Workshop on Hot Topics in Cloud Computing (HotCloud’09). pp. 1–5. 2009. San Diego, CA, USA.Google Scholar
- 12.Yuan, D., Y. Yang, X. Liu, and J. Chen. A Cost-Effective Strategy for Intermediate Data Storage in Scientific Cloud Workflows. in 24th IEEE International Parallel & Distributed Processing Symposium (IPDPS’10). pp. 1–12. 2010. Atlanta, Georgia, USA.Google Scholar
- 13.Yuan, D., Y. Yang, X. Liu, G. Zhang, and J. Chen, A Data Dependency Based Strategy for Intermediate Data Storage in Scientific Cloud Workflow Systems. Concurrency and Computation: Practice and Experience, 2010. (http://dx.doi.org/10.1002/cpe.1636)
- 14.Yuan, D., Y. Yang, X. Liu, and J. Chen. A Local-Optimisation based Strategy for Cost-Effective Datasets Storage of Scientific Applications in the Cloud. in 4th IEEE International Conference on Cloud Computing (Cloud2011). pp. 1–8. 2011. Washington DC, USA.Google Scholar
- 16.Chiba, T., T. Kielmann, M.d. Burger, and S. Matsuoka. Dynamic Load-Balanced Multicast for Data-Intensive Applications on Clouds. in IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid2010). pp. 5–14. 2010. Melbourne, Australia.Google Scholar
- 17.Juve, G., E. Deelman, K. Vahi, and G. Mehta. Data Sharing Options for Scientific Workflows on Amazon EC2. in ACM/IEEE Conference on Supercomputing (SC’10). pp. 1–9. 2010. New Orleans, Louisiana, USA.Google Scholar
- 18.Li, J., M. Humphrey, D. Agarwal, K. Jackson, C.v. Ingen, and Y. Ryu. eScience in the Cloud: A MODIS Satellite Data Reprojection and Reduction Pipeline in the Windows Azure Platform. in 24th IEEE International Parallel & Distributed Processing Symposium (IPDPS’10). pp. 1–12. 2010. Atlanta, Georgia, USA.Google Scholar
- 20.Eucalyptus. Available from: http://open.eucalyptus.com/.
- 21.Nimbus. Available from: http://www.nimbusproject.org/.
- 22.OpenNebula. Available from: http://www.opennebula.org/.
- 24.Assuncao, M.D.d., A.d. Costanzo, and R. Buyya. Evaluating the Cost-Benefit of Using Cloud Computing to Extend the Capacity of Clusters. in 18th ACM International Symposium on High Performance Distributed Computing (HPDC’09). pp. 1–10. 2009. Garching, Germany.Google Scholar
- 25.Kondo, D., B. Javadi, P. Malecot, F. Cappello, and D.P. Anderson. Cost-Benefit Analysis of Cloud Computing versus Desktop Grids. in 23th IEEE International Parallel & Distributed Processing Symposium (IPDPS’09). pp. 1–12. 2009. Rome, Italy.Google Scholar
- 26.Cho, B. and I. Gupta. New Algorithms for Planning Bulk Transfer via Internet and Shipping Networks. in IEEE 30th International Conference on Distributed Computing Systems (ICDCS). pp. 305–314. 2010. Genova, Italy.Google Scholar
- 27.Gunda, P.K., L. Ravindranath, C.A. Thekkath, Y. Yu, and L. Zhuang. Nectar: Automatic Management of Data and Computation in Datacenters. in 9th Symposium on Operating Systems Design and Implementation (OSDI’2010). pp. 1–14. 2010, Vancouver, Canada.Google Scholar
- 28.Bao, Z., S. Cohen-Boulakia, S.B. Davidson, A. Eyal, and S. Khanna. Differencing Provenance in Scientific Workflows. in 25th IEEE International Conference on Data Engineering (ICDE’09). pp. 808–819. 2009. Shanghai, China.Google Scholar
- 30.Muniswamy-Reddy, K.-K., P. Macko, and M. Seltzer. Provenance for the Cloud. in 8th USENIX Conference on File and Storage Technology (FAST’10). pp. 197–210. 2010. San Jose, CA, USA.Google Scholar
- 31.Osterweil, L.J., L.A. Clarke, A.M. Ellison, R. Podorozhny, A. Wise, E. Boose, and J. Hadley. Experience in Using A Process Language to Define Scientific Workflow and Generate Dataset Provenance. in 16th ACM SIGSOFT International Symposium on Foundations of Software Engineering. pp. 319–329. 2008. Atlanta, Georgia: ACM.Google Scholar
- 32.Foster, I., J. Vockler, M. Wilde, and Z. Yong. Chimera: A Virtual Data System for Representing, Querying, and Automating Data Derivation. in 14th International Conference on Scientific and Statistical Database Management, (SSDBM’02). pp. 37–46. 2002. Edinburgh, Scotland, UK.Google Scholar
- 35.Liu, X., D. Yuan, G. Zhang, J. Chen, and Y. Yang, SwinDeW-C: A Peer-to-Peer Based Cloud Workflow System, in Handbook of Cloud Computing, B. Furht and A. Escalante, Editors. 2010, Springer. pp. 309–332.Google Scholar
- 36.Yang, Y., K. Liu, J. Chen, J. Lignier, and H. Jin. Peer-to-Peer Based Grid Workflow Runtime Environment of SwinDeW-G. in IEEE International Conference on e-Science and Grid Computing. pp. 51–58. 2007. Bangalore, India.Google Scholar