Skip to main content

HDFS Pipeline Reformation to Minimize the Data Loss

  • Conference paper
  • First Online:
  • 1513 Accesses

Part of the book series: Smart Innovation, Systems and Technologies ((SIST,volume 105))

Abstract

The Hadoop is a popular framework. It has been designed to deal with very large sets of data. Hadoop file sizes are usually very large, ranging from gigabytes to terabytes, and large Hadoop clusters store millions of these files. HDFS will use the pipeline process to write the data into blocks. NameNode will send the available blocks list so that pipeline will be created based on the DataNodes having the empty blocks. We can customize the DataNode replacement policy in case of any DataNode failure in the pipeline process using configuration parameters. In this process, write process will be resumed even though there are less number of DataNodes, i.e., even having single DataNode. In single DataNode case, we will lose the data since we have only one copy of data. This paper addresses the issue while having single DataNode in the write operation and taking the pause in write operation until it gets the DataNodes in the pipeline process, and having pause is worthwhile than losing the valuable data if the DataNode fails while write operation is in progress.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD   219.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Apache Hadoop. Available at Hadoop Apache

    Google Scholar 

  2. Apache Hadoop Distributed File System. Available at Hadoop Distributed File System Apache

    Google Scholar 

  3. Scalability of Hadoop Distributed File System

    Google Scholar 

  4. Porter, G.: Decoupling storage and computation in Hadoop with SuperDataNodes. ACM SIGOPS Oper. Syst. Rev. 44 (2010)

    Article  Google Scholar 

  5. Kakade, A., Raut, S.: Hadoop distributed file system with cache technology. Ind. Sci. 1(6) (2014). ISSN: 2347-5420

    Google Scholar 

  6. Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. In: Proceeding of the 6th Conference on Symposium on operating Systems Design and Implementation (OSDI’04), Berkeley, CA, USA, pp. 137–150 (2004)

    Google Scholar 

  7. Shafer, J., Rixner, S., Cox, A.L.: The Hadoop distributed filesystem: balancing portability and performance. In: Proceedings of IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS 2010), White Plains, NY (2010)

    Google Scholar 

  8. Wang, F., et al.: Hadoop High Availability through Metadata Replication. IBM China Research Laboratory, ACM (2009)

    Google Scholar 

  9. Tankel, D.: Scalability of Hadoop Distributed File System. Yahoo Developer Work (2010)

    Google Scholar 

  10. Alapati, S.R.: Expert Hadoop Administration, Managing, Tuning and Securing

    Google Scholar 

  11. Ankam, V.: Big Data Analytics. Published by Packt Publishing Ltd. ISBN 978-1-78588-469-6 (2016)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to B. Purnachandra Rao .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Purnachandra Rao, B., Nagamalleswara Rao, N. (2019). HDFS Pipeline Reformation to Minimize the Data Loss. In: Satapathy, S., Bhateja, V., Das, S. (eds) Smart Intelligent Computing and Applications . Smart Innovation, Systems and Technologies, vol 105. Springer, Singapore. https://doi.org/10.1007/978-981-13-1927-3_27

Download citation

  • DOI: https://doi.org/10.1007/978-981-13-1927-3_27

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-13-1926-6

  • Online ISBN: 978-981-13-1927-3

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics