Computation and Storage Trade-Off for Cost-Effectively Storing Scientific Datasets in the Cloud

Yuan, Dong; Yang, Yun; Liu, Xiao; Chen, Jinjun

doi:10.1007/978-1-4614-1415-5_5

Computation and Storage Trade-Off for Cost-Effectively Storing Scientific Datasets in the Cloud

Dong Yuan³,
Yun Yang³,
Xiao Liu³ &
…
Jinjun Chen⁴

Chapter
First Online: 11 November 2011

1578 Accesses
3 Citations

Abstract

Scientific applications are usually data intensive [1,~ 2], where the generated datasets are often terabytes or even petabytes in size. As reported by Szalay and Gray in [3], science is in an exponential world and the amount of scientific data will double every year over the next decade and future. Producing scientific datasets involves large number of computation intensive tasks, e.g., with scientific workflows [4], hence taking a long time for execution. These generated datasets contain important intermediate or final results of the computation, and need to be stored as valuable resources. This is because: (1) data can be reused – scientists may need to re-analyze the results or apply new analyses on the existing datasets [5]; (2) data can be shared – for collaboration, the computation results may be shared, hence the datasets are used by scientists from different institutions [6]. Storing valuable generated application datasets can save their regeneration cost when they are reused, not to mention the waiting time caused by regeneration. However, the large size of the scientific datasets is a big challenge for their storage.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
http://www.parkes.atnf.csiro.au/
2.
http://astronomy.swin.edu.au/supercomputing/
3.
http://www.atnf.csiro.au/
4.
http://astronomy.swin.edu.au/pulsar/?topic=apsr
5.
Bandwidth is another common kind of resource in the cloud. In [1], the authors state that the cost-effective way of doing science in the cloud is to upload all the application data to the cloud storage and run all the applications with the cloud services. So we assume that the scientists upload all the original data to the cloud to conduct their experiments. Because transferring data within one cloud service provider's facilities is usually free, the data transfer cost of managing the application datasets is not counted. In [15], the authors discussed the scenario of running scientific applications among different cloud service providers.
6.
The prices may fluctuate from time to time according to market factors.
7.
Amazon cloud service offers different CPU instances with different prices, where using expensive CPU instances with higher performance would reduce computation time. There exists a trade-off of time and cost [34], which is different with the trade-off of computation and storage, hence is out of this chapter's scope.
8.
http://www.vmware.com/
9.
http://hadoop.apache.org/

References

Deelman, E., G. Singh, M. Livny, B. Berriman, and J. Good. The Cost of Doing Science on the Cloud: the Montage Example. in ACM/IEEE Conference on Supercomputing (SC’08). pp. 1–12. 2008. Austin, Texas, USA.
Google Scholar
Ludascher, B., I. Altintas, C. Berkley, D. Higgins, E. Jaeger, M. Jones, and E.A. Lee, Scientific Workflow Management and the Kepler System. Concurrency and Computation: Practice and Experience, 2005. 18(10): pp. 1039–1065.
Article Google Scholar
Szalay, A.S. and J. Gray, Science in an Exponential World. Nature, 2006. 440: pp. 23–24.
Article Google Scholar
Deelman, E., D. Gannon, M. Shields, and I. Taylor, Workflows and e-Science: An Overview of Workflow System Features and Capabilities. Future Generation Computer Systems, 2009. 25(5): pp. 528–540.
Article Google Scholar
Bose, R. and J. Frew, Lineage Retrieval for Scientific Data Processing: A Survey. ACM Computing Survey, 2005. 37(1): pp. 1–28.
Article Google Scholar
Burton, A. and A. Treloar. Publish My Data: A Composition of Services from ANDS and ARCS. in 5^th IEEE International Conference on e-Science, (e-Science ’09) pp. 164–170. 2009. Oxford, UK.
Google Scholar
Foster, I., Z. Yong, I. Raicu, and S. Lu. Cloud Computing and Grid Computing 360-Degree Compared. in Grid Computing Environments Workshop (GCE’08). pp. 1–10. 2008. Austin, Texas, USA.
Google Scholar
Buyya, R., C.S. Yeo, S. Venugopal, J. Broberg, and I. Brandic, Cloud Computing and Emerging IT Platforms: Vision, Hype, and Reality for Delivering Computing as the 5th Utility. Future Generation Computer Systems, 2009. 25(6): pp. 599–616.
Article Google Scholar
Amazon Cloud Services: http://aws.amazon.com/.
Zaharia, M., A. Konwinski, A.D. Joseph, R. Katz, and I. Stoica. Improving MapReduce Performance in Heterogeneous Environments. in 8th USENIX Symposium on Operating Systems Design and Implementation (OSDI’2008). pp. 29–42. 2008. San Diego, CA, USA.
Google Scholar
Adams, I., D.D.E. Long, E.L. Miller, S. Pasupathy, and M.W. Storer. Maximizing Efficiency by Trading Storage for Computation. in Workshop on Hot Topics in Cloud Computing (HotCloud’09). pp. 1–5. 2009. San Diego, CA, USA.
Google Scholar
Yuan, D., Y. Yang, X. Liu, and J. Chen. A Cost-Effective Strategy for Intermediate Data Storage in Scientific Cloud Workflows. in 24th IEEE International Parallel & Distributed Processing Symposium (IPDPS’10). pp. 1–12. 2010. Atlanta, Georgia, USA.
Google Scholar
Yuan, D., Y. Yang, X. Liu, G. Zhang, and J. Chen, A Data Dependency Based Strategy for Intermediate Data Storage in Scientific Cloud Workflow Systems. Concurrency and Computation: Practice and Experience, 2010. (http://dx.doi.org/10.1002/cpe.1636)
Yuan, D., Y. Yang, X. Liu, and J. Chen. A Local-Optimisation based Strategy for Cost-Effective Datasets Storage of Scientific Applications in the Cloud. in 4th IEEE International Conference on Cloud Computing (Cloud2011). pp. 1–8. 2011. Washington DC, USA.
Google Scholar
Yuan, D., Y. Yun, X. Liu, and J. Chen, On-demand Minimum Cost Benchmarking for Intermediate Datasets Storage in Scientific Cloud Workflow Systems. Journal of Parallel and Distributed Computing, 2011. 72(2): pp. 316–332.
Article Google Scholar
Chiba, T., T. Kielmann, M.d. Burger, and S. Matsuoka. Dynamic Load-Balanced Multicast for Data-Intensive Applications on Clouds. in IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid2010). pp. 5–14. 2010. Melbourne, Australia.
Google Scholar
Juve, G., E. Deelman, K. Vahi, and G. Mehta. Data Sharing Options for Scientific Workflows on Amazon EC2. in ACM/IEEE Conference on Supercomputing (SC’10). pp. 1–9. 2010. New Orleans, Louisiana, USA.
Google Scholar
Li, J., M. Humphrey, D. Agarwal, K. Jackson, C.v. Ingen, and Y. Ryu. eScience in the Cloud: A MODIS Satellite Data Reprojection and Reduction Pipeline in the Windows Azure Platform. in 24th IEEE International Parallel & Distributed Processing Symposium (IPDPS’10). pp. 1–12. 2010. Atlanta, Georgia, USA.
Google Scholar
Yuan, D., Y. Yang, X. Liu, and J. Chen, A Data Placement Strategy in Scientific Cloud Workflows. Future Generation Computer Systems, 2010. 26(8): pp. 1200–1214.
Article Google Scholar
Eucalyptus. Available from: http://open.eucalyptus.com/.
Nimbus. Available from: http://www.nimbusproject.org/.
OpenNebula. Available from: http://www.opennebula.org/.
Armbrust, M., A. Fox, R. Griffith, A.D. Joseph, R. Katz, A. Konwinski, G. Lee, D. Patterson, A. Rabkin, I. Stoica, and M. Zaharia, A View of Cloud Computing. Commun. ACM, 2010. 53(4): pp. 50–58.
Article Google Scholar
Assuncao, M.D.d., A.d. Costanzo, and R. Buyya. Evaluating the Cost-Benefit of Using Cloud Computing to Extend the Capacity of Clusters. in 18th ACM International Symposium on High Performance Distributed Computing (HPDC’09). pp. 1–10. 2009. Garching, Germany.
Google Scholar
Kondo, D., B. Javadi, P. Malecot, F. Cappello, and D.P. Anderson. Cost-Benefit Analysis of Cloud Computing versus Desktop Grids. in 23th IEEE International Parallel & Distributed Processing Symposium (IPDPS’09). pp. 1–12. 2009. Rome, Italy.
Google Scholar
Cho, B. and I. Gupta. New Algorithms for Planning Bulk Transfer via Internet and Shipping Networks. in IEEE 30th International Conference on Distributed Computing Systems (ICDCS). pp. 305–314. 2010. Genova, Italy.
Google Scholar
Gunda, P.K., L. Ravindranath, C.A. Thekkath, Y. Yu, and L. Zhuang. Nectar: Automatic Management of Data and Computation in Datacenters. in 9th Symposium on Operating Systems Design and Implementation (OSDI’2010). pp. 1–14. 2010, Vancouver, Canada.
Google Scholar
Bao, Z., S. Cohen-Boulakia, S.B. Davidson, A. Eyal, and S. Khanna. Differencing Provenance in Scientific Workflows. in 25th IEEE International Conference on Data Engineering (ICDE’09). pp. 808–819. 2009. Shanghai, China.
Google Scholar
Groth, P. and L. Moreau, Recording Process Documentation for Provenance. IEEE Transactions on Parallel and Distributed Systems, 2009. 20(9): pp. 1246–1259.
Article Google Scholar
Muniswamy-Reddy, K.-K., P. Macko, and M. Seltzer. Provenance for the Cloud. in 8th USENIX Conference on File and Storage Technology (FAST’10). pp. 197–210. 2010. San Jose, CA, USA.
Google Scholar
Osterweil, L.J., L.A. Clarke, A.M. Ellison, R. Podorozhny, A. Wise, E. Boose, and J. Hadley. Experience in Using A Process Language to Define Scientific Workflow and Generate Dataset Provenance. in 16th ACM SIGSOFT International Symposium on Foundations of Software Engineering. pp. 319–329. 2008. Atlanta, Georgia: ACM.
Google Scholar
Foster, I., J. Vockler, M. Wilde, and Z. Yong. Chimera: A Virtual Data System for Representing, Querying, and Automating Data Derivation. in 14th International Conference on Scientific and Statistical Database Management, (SSDBM’02). pp. 37–46. 2002. Edinburgh, Scotland, UK.
Google Scholar
Simmhan, Y.L., B. Plale, and D. Gannon, A Survey of Data Provenance in E-Science. SIGMOD Rec., 2005. 34(3): pp. 31–36.
Article Google Scholar
Garg, S.K., R. Buyya, and H.J. Siegel, Time and Cost Trade-Off Management for Scheduling Parallel Applications on Utility Grids. Future Generation Computer Systems, 2010. 26(8): pp. 1344–1355.
Article Google Scholar
Liu, X., D. Yuan, G. Zhang, J. Chen, and Y. Yang, SwinDeW-C: A Peer-to-Peer Based Cloud Workflow System, in Handbook of Cloud Computing, B. Furht and A. Escalante, Editors. 2010, Springer. pp. 309–332.
Google Scholar
Yang, Y., K. Liu, J. Chen, J. Lignier, and H. Jin. Peer-to-Peer Based Grid Workflow Runtime Environment of SwinDeW-G. in IEEE International Conference on e-Science and Grid Computing. pp. 51–58. 2007. Bangalore, India.
Google Scholar

Download references

Author information

Authors and Affiliations

Faculty of Information and Communication Technologies, Swinburne University of Technology, Melbourne, Australia
Dong Yuan, Yun Yang & Xiao Liu
Faculty of Engineering and Information Technology, University of Technology, Sydney, Australia
Jinjun Chen

Authors

Dong Yuan
View author publications
You can also search for this author in PubMed Google Scholar
Yun Yang
View author publications
You can also search for this author in PubMed Google Scholar
Xiao Liu
View author publications
You can also search for this author in PubMed Google Scholar
Jinjun Chen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dong Yuan .

Editor information

Editors and Affiliations

Dept. of Computer Science & Engineering, Florida Atlantic University, Boca Raton, 33431, Florida, USA
Borko Furht
LexisNexis, Boca Raton, 33487, Florida, USA
Armando Escalante

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Yuan, D., Yang, Y., Liu, X., Chen, J. (2011). Computation and Storage Trade-Off for Cost-Effectively Storing Scientific Datasets in the Cloud. In: Furht, B., Escalante, A. (eds) Handbook of Data Intensive Computing. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-1415-5_5

Download citation

DOI: https://doi.org/10.1007/978-1-4614-1415-5_5
Published: 11 November 2011
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4614-1414-8
Online ISBN: 978-1-4614-1415-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics