High Throughput Data-Compression for Cloud Storage

Nicolae, Bogdan

doi:10.1007/978-3-642-15108-8_1

Bogdan Nicolae¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 6265))

Included in the following conference series:

International Conference on Data Management in Grid and P2P Systems

582 Accesses
31 Citations
1 Altmetric

Abstract

As data volumes processed by large-scale distributed data-intensive applications grow at high-speed, an increasing I/O pressure is put on the underlying storage service, which is responsible for data management. One particularly difficult challenge, that the storage service has to deal with, is to sustain a high I/O throughput in spite of heavy access concurrency to massive data. In order to do so, massively parallel data transfers need to be performed, which invariably lead to a high bandwidth utilization. With the emergence of cloud computing, data intensive applications become attractive for a wide public that does not have the resources to maintain expensive large scale distributed infrastructures to run such applications. In this context, minimizing the storage space and bandwidth utilization is highly relevant, as these resources are paid for according to the consumption. This paper evaluates the trade-off resulting from transparently applying data compression to conserve storage space and bandwidth at the cost of slight computational overhead. We aim at reducing the storage space and bandwidth needs with minimal impact on I/O throughput when under heavy access concurrency. Our solution builds on BlobSeer, a highly parallel distributed data management service specifically designed to enable reading, writing and appending huge data sequences that are fragmented and distributed at a large scale. We demonstrate the benefits of our approach by performing extensive experimentations on the Grid’5000 testbed.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Bryant, R.E.: Data-intensive supercomputing: The case for disc. Tech. rep., CMU (2007)
Google Scholar
Buyya, R.E.: Market-oriented cloud computing: Vision, hype, and reality of delivering computing as the 5th utility. In: IEEE International Symposium on Cluster Computing and the Grid, p. 1 (2009)
Google Scholar
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Communications of the ACM 51(1), 107–113 (2008)
Article Google Scholar
DeWitt, D., Gray, J.: Parallel database systems: the future of high performance database systems. Commun. ACM 35(6), 85–98 (1992)
Article Google Scholar
Ghandeharizadeh, S., Papadopoulos, C., Pol, P., Zhou, R.: Nam: a network adaptable middleware to enhance response time of web services. In: MASCOTS ’03: 11th IEEE/ACM International Symposium on Modeling, Analysis and Simulation of Computer Telecommunications Systems, pp. 136–145 (12-15, 2003)
Google Scholar
Ghemawat, S., Gobioff, H., Leung, S.T.: The Google file system. SIGOPS - Operating Systems Review 37(5), 29–43 (2003)
Article Google Scholar
The Apache Hadoop Project, http://www.hadoop.org
HDFS. The Hadoop Distributed File System, http://hadoop.apache.org/common/docs/r0.20.1/hdfs_design.html
Isard, M., Budiu, M., Yu, Y., Birrell, A., Fetterly, D.: Dryad: distributed data-parallel programs from sequential building blocks. SIGOPS Oper. Syst. Rev. 41(3), 59–72 (2007)
Article Google Scholar
Jeannot, E., Knutsson, B., Björkman, M.: Adaptive online data compression. In: HPDC ’02: Proceedings of the 11th IEEE International Symposium on High Performance Distributed Computing, p. 379. IEEE Computer Society, Washington (2002)
Chapter Google Scholar
Jégou, Y., Lantéri, S., Leduc, J., Noredine, M., Mornet, G., Namyst, R., Primet, P., Quetier, B., Richard, O., Talbi, E.G., Iréa, T.: Grid’5000: a large scale and highly reconfigurable experimental grid testbed. International Journal of High Performance Computing Applications 20(4), 481–494 (2006)
Article Google Scholar
Krintz, C., Sucu, S.: Adaptive on-the-fly compression. IEEE Trans. Parallel Distrib. Syst. 17(1), 15–24 (2006)
Article Google Scholar
Nicolae, B., Antoniu, G., Bougé, L.: BlobSeer: How to enable efficient versioning for large object storage under heavy access concurrency. In: Data Management in Peer-to-Peer Systems, St-Petersburg, Russia (2009); Workshop held within the scope of the EDBT/ICDT 2009 joint Conference
Google Scholar
Nicolae, B., Antoniu, G., Bougé, L.: Enabling high data throughput in desktop grids through decentralized data and metadata management: The blobseer approach. In: Sips, H., Epema, D., Lin, H.-X. (eds.) Euro-Par 2009. LNCS, vol. 5704, pp. 404–416. Springer, Heidelberg (2009)
Google Scholar
Nicolae, B., Moise, D., Antoniu, G., Bougé, L., Dorier, M.: BlobSeer: Bringing high throughput under heavy concurrency to Hadoop Map/Reduce applications. In: Proc. 24th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2010 (in press, 2010)
Google Scholar
Oberhumer, M.F.X.J.: Lempel-ziv-oberhumer (2009), http://www.oberhumer.com/opensource/lzo
Pavlo, A., Paulson, E., Rasin, A., Abadi, D.J., DeWitt, D.J., Madden, S., Stonebraker, M.: A comparison of approaches to large-scale data analysis. In: SIGMOD ’09: Proceedings of the 35th SIGMOD international conference on Management of data, pp. 165–178. ACM, New York (2009)
Chapter Google Scholar
Raghuveer, A., Jindal, M., Mokbel, M.F., Debnath, B., Du, D.: Towards efficient search on unstructured data: an intelligent-storage approach. In: CIKM ’07: Proceedings of the sixteenth ACM Conference on information and knowledge management, pp. 951–954. ACM, New York (2007)
Chapter Google Scholar
Seward, J.: Bzip2 (2001), http://bzip.org
Vaquero, L.M., Rodero-Merino, L., Caceres, J., Lindner, M.: A break in the clouds: towards a cloud definition. SIGCOMM Comput. Commun. Rev. 39(1), 50–55 (2009)
Article Google Scholar
Wiseman, Y., Schwan, K., Widener, P.: Efficient end to end data exchange using configurable compression. SIGOPS Oper. Syst. Rev. 39(3), 4–23 (2005)
Article Google Scholar
Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Transactions on Information Theory 23, 337–343 (1977)
Article MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

IRISA, University of Rennes 1, Rennes, France
Bogdan Nicolae

Authors

Bogdan Nicolae
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Institut de Recherche en Informatique de Toulouse (IRIT), Paul Sabatier University, 118, route de Narbonne, 31062, Toulouse Cedex, France
Abdelkader Hameurlain
IRIT Institut de Recherche en Informatique de Toulouse, Paul Sabatier University, 118, route de Narbonne, 31062, Toulouse Cedex, France
Franck Morvan
Institute of Software Technology and Interactive Systems, Vienna University of Technology, Favoritenstr. 9/188, 1040, Wien, Austria
A Min Tjoa

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Nicolae, B. (2010). High Throughput Data-Compression for Cloud Storage. In: Hameurlain, A., Morvan, F., Tjoa, A.M. (eds) Data Management in Grid and Peer-to-Peer Systems. Globe 2010. Lecture Notes in Computer Science, vol 6265. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-15108-8_1

Download citation

DOI: https://doi.org/10.1007/978-3-642-15108-8_1
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-15107-1
Online ISBN: 978-3-642-15108-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics