The Journal of Supercomputing

, Volume 71, Issue 5, pp 1736–1753 | Cite as

An approach of fast data manipulation in HDFS with supplementary mechanisms

Article

Abstract

The Hadoop framework has been widely applied in miscellaneous clusters to build large scalable and powerful systems for massive data processing based on commodity hardware. Hadoop distributed file system (HDFS), the distributed storage component of Hadoop, is responsible for managing vast amount of data effectively in large clusters. To utilize the parallel processing infrastructure of Hadoop, Map/Reduce, the traditional workflow needs to upload data from local file systems to HDFS first. Unfortunately, when dealing with massive data, the uploading procedure becomes extremely time-consuming which causes almost intolerable delay for urgent tasks, along with unnecessary space waste due to replicated data. The primary contribution of this paper is the proposition of Zput and its supplementary mechanism named Zport. After the implementation is described, we introduce several improved details which are significant for runtime efficiency and performance. Evaluation results prove that Zput can accelerate the local data uploading procedure by over 315.4 %, while Zport can boost the remote block distribution by over 190.3 %. Besides, the compatibility for upper-layer applications remains intact.

Keywords

Metadata manipulation Block replication and placement  Distributed file system Checksum calculation Data compression 

References

  1. 1.
    Bonwick J (2005) Zfs end-to-end data integrity. https://blogs.oracle.com/bonwick/entry/zfs_end_to_end_data
  2. 2.
    Braden R, Borman D, Partridge C (1989) Computing the internet checksum. SIGCOMM Comput Commun Rev 19(2):86–94. ISSN:0146–4833, doi:10.1145/378444.378453
  3. 3.
    Chen Y, Ganapathi A, Katz RH (2010) To compress or not to compress—compute vs. io tradeoffs for mapreduce energy efficiency. In: Proceedings of the first ACM SIGCOMM workshop on green networking, green networking ’10. ACM, New York, pp 23–28, ISBN 978-1-4503-0196-1, doi:10.1145/1851290.1851296
  4. 4.
    Cohen F (1987) A cryptographic checksum for integrity protection. Comput Secur 6(6):505–510. ISSN:0167–4048, http://www.sciencedirect.com/science/article/pii/0167404887900319
  5. 5.
    Crume A, Buck J, Maltzahn C, Brandt S (2012) Compressing intermediate keys between mappers and reducers in scihadoop. In: Proceedings of the 2012 SC companion: high performance computing, networking storage and analysis, SCC ’12. IEEE Computer Society, Washington, DC, pp 7–12, ISBN:978-0-7695-4956-9, doi:10.1109/SC.Companion.2012.12
  6. 6.
    Eltabakh MY, Tian Y, Özcan F, Gemulla R, Krettek A, McPherson J (2011) Cohadoop: flexible data placement and its exploitation in hadoop. Proc VLDB Endow 4(9):575–585. ISSN:2150–8097, doi:10.14778/2002938.2002943
  7. 7.
    Fan X, Li S, Liao X, Wang L, Huang C, Ma J (2012) Datanode optimization in distributed storage systems. In: CLOUD COMPUTING 2012, The third international conference on cloud computing, GRIDs, and virtualization, pp 247–252, ISBN:978-1-61208-216-5Google Scholar
  8. 8.
    Fletcher J (1982) An arithmetic checksum for serial transmissions. Commun IEEE Trans 30(1):247–252, ISSN:0090–6778, doi:10.1109/TCOM.1982.1095369
  9. 9.
    Genova Z, Christensen K (2002) Efficient summarization of urls using crc32 for implementing url switching. In: Proceedings of the 27th annual IEEE conference on local computer networks, LCN ’02. IEEE Computer Society, Washington, DC, pp 343–344, ISBN:0-7695-1591-6, http://dl.acm.org/citation.cfm?id=648047.745545
  10. 10.
    Gopal V, Guilford J, Dixon M, Feghali W (2011) Fast, parallelized crc computation using the nehalem crc32 instruction. http://www.drdobbs.com/parallel/fast-parallelized-crc-computation-using/229401411
  11. 11.
    He Y, Lee R, Huai Y, Shao Z, Jain N, Zhang X, Xu Z (2011) Rcfile: a fast and space-efficient data placement structure in mapreduce-based warehouse systems. In: Data engineering (ICDE), 2011 IEEE 27th international conference, pp 1199–1208Google Scholar
  12. 12.
    SSE Intel (2007) Programming reference. Intel’s software network, sofwareprojects. intel. com/avx, 2:7Google Scholar
  13. 13.
    Nicolae B (2010) High throughput data-compression for cloud storage. In: Proceedings of the third international conference on data management in grid and peer-to-peer systems, Globe’10. Springer-Verlag, Berlin, Heidelberg, pp 1–12, ISBN:3-642-15107-8, 978-3-642-15107-1, http://dl.acm.org/citation.cfm?id=1885229.1885231
  14. 14.
    Urbani J, Maassen J, Bal H (2010) Massive semantic web data compression with mapreduce. In: Proceedings of the 19th ACM international symposium on high performance distributed computing, HPDC ’10. ACM, New York, pp 795–802, ISBN:978-1-60558-942-8, doi:10.1145/1851476.1851591
  15. 15.
    Viswanathan A (2012) A guide to using lzo compression in hadoop. Linux J 2012(220). ISSN:1075–3583, http://dl.acm.org/citation.cfm?id=2371484.2371485
  16. 16.
    Wang Y, Wang W, Ma C, Meng D (2013) Zput: a speedy data uploading approach for the hadoop distributed file system. In: Cluster computing (CLUSTER), 2013 IEEE international conference, pp 1–5Google Scholar

Copyright information

© Springer Science+Business Media New York 2014

Authors and Affiliations

  1. 1.Institute of Computing TechnologyChinese Academy of SciencesBeijingChina
  2. 2.The Second Research Laboratory, Institute of Information EngineeringChinese Academy of SciencesBeijingChina

Personalised recommendations