Date: 11 Nov 2011

Computation and Storage Trade-Off for Cost-Effectively Storing Scientific Datasets in the Cloud

* Final gross prices may vary according to local VAT.

Get Access

Abstract

Scientific applications are usually data intensive [1,~ 2], where the generated datasets are often terabytes or even petabytes in size. As reported by Szalay and Gray in [3], science is in an exponential world and the amount of scientific data will double every year over the next decade and future. Producing scientific datasets involves large number of computation intensive tasks, e.g., with scientific workflows [4], hence taking a long time for execution. These generated datasets contain important intermediate or final results of the computation, and need to be stored as valuable resources. This is because: (1) data can be reused – scientists may need to re-analyze the results or apply new analyses on the existing datasets [5]; (2) data can be shared – for collaboration, the computation results may be shared, hence the datasets are used by scientists from different institutions [6]. Storing valuable generated application datasets can save their regeneration cost when they are reused, not to mention the waiting time caused by regeneration. However, the large size of the scientific datasets is a big challenge for their storage.