Abstract
The Hadoop is a popular framework. It has been designed to deal with very large sets of data. Hadoop file sizes are usually very large, ranging from gigabytes to terabytes, and large Hadoop clusters store millions of these files. HDFS will use the pipeline process to write the data into blocks. NameNode will send the available blocks list so that pipeline will be created based on the DataNodes having the empty blocks. We can customize the DataNode replacement policy in case of any DataNode failure in the pipeline process using configuration parameters. In this process, write process will be resumed even though there are less number of DataNodes, i.e., even having single DataNode. In single DataNode case, we will lose the data since we have only one copy of data. This paper addresses the issue while having single DataNode in the write operation and taking the pause in write operation until it gets the DataNodes in the pipeline process, and having pause is worthwhile than losing the valuable data if the DataNode fails while write operation is in progress.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Apache Hadoop. Available at Hadoop Apache
Apache Hadoop Distributed File System. Available at Hadoop Distributed File System Apache
Scalability of Hadoop Distributed File System
Porter, G.: Decoupling storage and computation in Hadoop with SuperDataNodes. ACM SIGOPS Oper. Syst. Rev. 44 (2010)
Kakade, A., Raut, S.: Hadoop distributed file system with cache technology. Ind. Sci. 1(6) (2014). ISSN: 2347-5420
Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. In: Proceeding of the 6th Conference on Symposium on operating Systems Design and Implementation (OSDI’04), Berkeley, CA, USA, pp. 137–150 (2004)
Shafer, J., Rixner, S., Cox, A.L.: The Hadoop distributed filesystem: balancing portability and performance. In: Proceedings of IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS 2010), White Plains, NY (2010)
Wang, F., et al.: Hadoop High Availability through Metadata Replication. IBM China Research Laboratory, ACM (2009)
Tankel, D.: Scalability of Hadoop Distributed File System. Yahoo Developer Work (2010)
Alapati, S.R.: Expert Hadoop Administration, Managing, Tuning and Securing
Ankam, V.: Big Data Analytics. Published by Packt Publishing Ltd. ISBN 978-1-78588-469-6 (2016)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Purnachandra Rao, B., Nagamalleswara Rao, N. (2019). HDFS Pipeline Reformation to Minimize the Data Loss. In: Satapathy, S., Bhateja, V., Das, S. (eds) Smart Intelligent Computing and Applications . Smart Innovation, Systems and Technologies, vol 105. Springer, Singapore. https://doi.org/10.1007/978-981-13-1927-3_27
Download citation
DOI: https://doi.org/10.1007/978-981-13-1927-3_27
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-1926-6
Online ISBN: 978-981-13-1927-3
eBook Packages: EngineeringEngineering (R0)