Abstract
Scientific applications are usually data intensive [1,~ 2], where the generated datasets are often terabytes or even petabytes in size. As reported by Szalay and Gray in [3], science is in an exponential world and the amount of scientific data will double every year over the next decade and future. Producing scientific datasets involves large number of computation intensive tasks, e.g., with scientific workflows [4], hence taking a long time for execution. These generated datasets contain important intermediate or final results of the computation, and need to be stored as valuable resources. This is because: (1) data can be reused – scientists may need to re-analyze the results or apply new analyses on the existing datasets [5]; (2) data can be shared – for collaboration, the computation results may be shared, hence the datasets are used by scientists from different institutions [6]. Storing valuable generated application datasets can save their regeneration cost when they are reused, not to mention the waiting time caused by regeneration. However, the large size of the scientific datasets is a big challenge for their storage.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
- 3.
- 4.
- 5.
Bandwidth is another common kind of resource in the cloud. In [1], the authors state that the cost-effective way of doing science in the cloud is to upload all the application data to the cloud storage and run all the applications with the cloud services. So we assume that the scientists upload all the original data to the cloud to conduct their experiments. Because transferring data within one cloud service provider's facilities is usually free, the data transfer cost of managing the application datasets is not counted. In [15], the authors discussed the scenario of running scientific applications among different cloud service providers.
- 6.
The prices may fluctuate from time to time according to market factors.
- 7.
Amazon cloud service offers different CPU instances with different prices, where using expensive CPU instances with higher performance would reduce computation time. There exists a trade-off of time and cost [34], which is different with the trade-off of computation and storage, hence is out of this chapter's scope.
- 8.
- 9.
References
Deelman, E., G. Singh, M. Livny, B. Berriman, and J. Good. The Cost of Doing Science on the Cloud: the Montage Example. in ACM/IEEE Conference on Supercomputing (SC’08). pp. 1–12. 2008. Austin, Texas, USA.
Ludascher, B., I. Altintas, C. Berkley, D. Higgins, E. Jaeger, M. Jones, and E.A. Lee, Scientific Workflow Management and the Kepler System. Concurrency and Computation: Practice and Experience, 2005. 18(10): pp. 1039–1065.
Szalay, A.S. and J. Gray, Science in an Exponential World. Nature, 2006. 440: pp. 23–24.
Deelman, E., D. Gannon, M. Shields, and I. Taylor, Workflows and e-Science: An Overview of Workflow System Features and Capabilities. Future Generation Computer Systems, 2009. 25(5): pp. 528–540.
Bose, R. and J. Frew, Lineage Retrieval for Scientific Data Processing: A Survey. ACM Computing Survey, 2005. 37(1): pp. 1–28.
Burton, A. and A. Treloar. Publish My Data: A Composition of Services from ANDS and ARCS. in 5th IEEE International Conference on e-Science, (e-Science ’09) pp. 164–170. 2009. Oxford, UK.
Foster, I., Z. Yong, I. Raicu, and S. Lu. Cloud Computing and Grid Computing 360-Degree Compared. in Grid Computing Environments Workshop (GCE’08). pp. 1–10. 2008. Austin, Texas, USA.
Buyya, R., C.S. Yeo, S. Venugopal, J. Broberg, and I. Brandic, Cloud Computing and Emerging IT Platforms: Vision, Hype, and Reality for Delivering Computing as the 5th Utility. Future Generation Computer Systems, 2009. 25(6): pp. 599–616.
Amazon Cloud Services: http://aws.amazon.com/.
Zaharia, M., A. Konwinski, A.D. Joseph, R. Katz, and I. Stoica. Improving MapReduce Performance in Heterogeneous Environments. in 8th USENIX Symposium on Operating Systems Design and Implementation (OSDI’2008). pp. 29–42. 2008. San Diego, CA, USA.
Adams, I., D.D.E. Long, E.L. Miller, S. Pasupathy, and M.W. Storer. Maximizing Efficiency by Trading Storage for Computation. in Workshop on Hot Topics in Cloud Computing (HotCloud’09). pp. 1–5. 2009. San Diego, CA, USA.
Yuan, D., Y. Yang, X. Liu, and J. Chen. A Cost-Effective Strategy for Intermediate Data Storage in Scientific Cloud Workflows. in 24th IEEE International Parallel & Distributed Processing Symposium (IPDPS’10). pp. 1–12. 2010. Atlanta, Georgia, USA.
Yuan, D., Y. Yang, X. Liu, G. Zhang, and J. Chen, A Data Dependency Based Strategy for Intermediate Data Storage in Scientific Cloud Workflow Systems. Concurrency and Computation: Practice and Experience, 2010. (http://dx.doi.org/10.1002/cpe.1636)
Yuan, D., Y. Yang, X. Liu, and J. Chen. A Local-Optimisation based Strategy for Cost-Effective Datasets Storage of Scientific Applications in the Cloud. in 4th IEEE International Conference on Cloud Computing (Cloud2011). pp. 1–8. 2011. Washington DC, USA.
Yuan, D., Y. Yun, X. Liu, and J. Chen, On-demand Minimum Cost Benchmarking for Intermediate Datasets Storage in Scientific Cloud Workflow Systems. Journal of Parallel and Distributed Computing, 2011. 72(2): pp. 316–332.
Chiba, T., T. Kielmann, M.d. Burger, and S. Matsuoka. Dynamic Load-Balanced Multicast for Data-Intensive Applications on Clouds. in IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid2010). pp. 5–14. 2010. Melbourne, Australia.
Juve, G., E. Deelman, K. Vahi, and G. Mehta. Data Sharing Options for Scientific Workflows on Amazon EC2. in ACM/IEEE Conference on Supercomputing (SC’10). pp. 1–9. 2010. New Orleans, Louisiana, USA.
Li, J., M. Humphrey, D. Agarwal, K. Jackson, C.v. Ingen, and Y. Ryu. eScience in the Cloud: A MODIS Satellite Data Reprojection and Reduction Pipeline in the Windows Azure Platform. in 24th IEEE International Parallel & Distributed Processing Symposium (IPDPS’10). pp. 1–12. 2010. Atlanta, Georgia, USA.
Yuan, D., Y. Yang, X. Liu, and J. Chen, A Data Placement Strategy in Scientific Cloud Workflows. Future Generation Computer Systems, 2010. 26(8): pp. 1200–1214.
Eucalyptus. Available from: http://open.eucalyptus.com/.
Nimbus. Available from: http://www.nimbusproject.org/.
OpenNebula. Available from: http://www.opennebula.org/.
Armbrust, M., A. Fox, R. Griffith, A.D. Joseph, R. Katz, A. Konwinski, G. Lee, D. Patterson, A. Rabkin, I. Stoica, and M. Zaharia, A View of Cloud Computing. Commun. ACM, 2010. 53(4): pp. 50–58.
Assuncao, M.D.d., A.d. Costanzo, and R. Buyya. Evaluating the Cost-Benefit of Using Cloud Computing to Extend the Capacity of Clusters. in 18th ACM International Symposium on High Performance Distributed Computing (HPDC’09). pp. 1–10. 2009. Garching, Germany.
Kondo, D., B. Javadi, P. Malecot, F. Cappello, and D.P. Anderson. Cost-Benefit Analysis of Cloud Computing versus Desktop Grids. in 23th IEEE International Parallel & Distributed Processing Symposium (IPDPS’09). pp. 1–12. 2009. Rome, Italy.
Cho, B. and I. Gupta. New Algorithms for Planning Bulk Transfer via Internet and Shipping Networks. in IEEE 30th International Conference on Distributed Computing Systems (ICDCS). pp. 305–314. 2010. Genova, Italy.
Gunda, P.K., L. Ravindranath, C.A. Thekkath, Y. Yu, and L. Zhuang. Nectar: Automatic Management of Data and Computation in Datacenters. in 9th Symposium on Operating Systems Design and Implementation (OSDI’2010). pp. 1–14. 2010, Vancouver, Canada.
Bao, Z., S. Cohen-Boulakia, S.B. Davidson, A. Eyal, and S. Khanna. Differencing Provenance in Scientific Workflows. in 25th IEEE International Conference on Data Engineering (ICDE’09). pp. 808–819. 2009. Shanghai, China.
Groth, P. and L. Moreau, Recording Process Documentation for Provenance. IEEE Transactions on Parallel and Distributed Systems, 2009. 20(9): pp. 1246–1259.
Muniswamy-Reddy, K.-K., P. Macko, and M. Seltzer. Provenance for the Cloud. in 8th USENIX Conference on File and Storage Technology (FAST’10). pp. 197–210. 2010. San Jose, CA, USA.
Osterweil, L.J., L.A. Clarke, A.M. Ellison, R. Podorozhny, A. Wise, E. Boose, and J. Hadley. Experience in Using A Process Language to Define Scientific Workflow and Generate Dataset Provenance. in 16th ACM SIGSOFT International Symposium on Foundations of Software Engineering. pp. 319–329. 2008. Atlanta, Georgia: ACM.
Foster, I., J. Vockler, M. Wilde, and Z. Yong. Chimera: A Virtual Data System for Representing, Querying, and Automating Data Derivation. in 14th International Conference on Scientific and Statistical Database Management, (SSDBM’02). pp. 37–46. 2002. Edinburgh, Scotland, UK.
Simmhan, Y.L., B. Plale, and D. Gannon, A Survey of Data Provenance in E-Science. SIGMOD Rec., 2005. 34(3): pp. 31–36.
Garg, S.K., R. Buyya, and H.J. Siegel, Time and Cost Trade-Off Management for Scheduling Parallel Applications on Utility Grids. Future Generation Computer Systems, 2010. 26(8): pp. 1344–1355.
Liu, X., D. Yuan, G. Zhang, J. Chen, and Y. Yang, SwinDeW-C: A Peer-to-Peer Based Cloud Workflow System, in Handbook of Cloud Computing, B. Furht and A. Escalante, Editors. 2010, Springer. pp. 309–332.
Yang, Y., K. Liu, J. Chen, J. Lignier, and H. Jin. Peer-to-Peer Based Grid Workflow Runtime Environment of SwinDeW-G. in IEEE International Conference on e-Science and Grid Computing. pp. 51–58. 2007. Bangalore, India.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer Science+Business Media, LLC
About this chapter
Cite this chapter
Yuan, D., Yang, Y., Liu, X., Chen, J. (2011). Computation and Storage Trade-Off for Cost-Effectively Storing Scientific Datasets in the Cloud. In: Furht, B., Escalante, A. (eds) Handbook of Data Intensive Computing. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-1415-5_5
Download citation
DOI: https://doi.org/10.1007/978-1-4614-1415-5_5
Published:
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4614-1414-8
Online ISBN: 978-1-4614-1415-5
eBook Packages: Computer ScienceComputer Science (R0)