Abstract
Although by the end of 2020, most of companies will be running 1000 node Hadoop in the system, the Hadoop implementation is still accompanied by many challenges like security, fault tolerance, flexibility. Hadoop is a software paradigm that handles big data, and it has a distributed file systems so-called Hadoop Distributed File System (HDFS). HDFS has the ability to handle fault tolerance using data replication technique. It works by repeating the data in multiple DataNodes which means the reliability and availability are achieved. Although data replications technique works well, but still waste much more time because it uses single pipelined paradigm. The proposed approach improves the performance of HDFS by using multiple pipelines in transferring data blocks instead of single pipeline. In addition, each DataNode will update its reliability value after each round and send this updated data to the NameNode. The NameNode will sort the DataNodes according to the reliability value. When the client submits request to upload data block, the NameNode will reply by a list of high reliability DataNodes that will achieve high performance. The proposed approach is fully implemented and the experimental results show that it improves the performance of HDFS write operations.
Article PDF
Avoid common mistakes on your manuscript.
References
R. Sandberg, D. Goldberg, S. Kleiman, D. Walsh, B. Lyon, Design and Implementation of the Sun Network Filesystem, Proceedings of the USENIX Conference & Exibition, Portland, OR, 1985, pp. 119–130.
J.H. Howard, An overview of the Andrew file system, Proceedings of the USENIX Winter Technical Conference, Dallas TX, 1988, pp. 23–26.
S. Ghemawat, H. Gobioff, S.T. Leung, The Google file system, SOSP ‘03 Proceedings of the 19th ACM Symposium on Operating Systems Review, ACM, New York, USA, 2003, pp. 29–43.
B. Martini, K.K.R. Choo, Distributed filesystem forensics: XtreemFS as a case study, Dig. Invest. 11 (2014), 295–313.
W. Lin, L. Chen, B. Liu, A Hadoop-based efficient economic cloud storage system, 2011 Third Pacific-Asia Conference on Circuits, Communications and System (PACCS), IEEE, Wuhan, China, 2011, pp. 1–4.
E. Abdelfattah, M. Elkawkagy, A. El-Sisi, A reactive fault tolerance approach for cloud computing, 2017 13th International Computer Engineering Conference (ICENCO), IEEE, Cairo, Egypt, 2017, pp. 190–194.
E. Sivaraman, R. Manickachezian, High performance and fault tolerant distributed file system for big data storage and processing using hadoop, 2014 International Conference on Intelligent Computing Applications (ICICA), IEEE, Coimbatore, India, 2014, pp. 32–36.
C.L. Abad, Y. Lu, R.H. Campbell, DARE: adaptive data replication for efficient cluster scheduling, 2011 IEEE International Conference on Cluster Computing, IEEE, Austin, TX, USA, 2011, pp. 159–168.
B. Fan, W. Tantisiriroj, L. Xiao, G. Gibson, DiskReduce: RAID for data-intensive scalable computing, Proceedings of the 4th Annual Workshop on Petascale Data Storage, ACM, Portland, OR, USA, 2009, pp. 6–10.
Z. Cheng, Z. Luan, Y. Meng, Y. Xu, D. Qian, A. Roy, et al., ERMS: an elastic replication management system for HDFS, 2012 IEEE International Conference on Cluster Computing Workshops, IEEE, Beijing, China, 2012, pp. 32–40.
Q. Feng, J. Han, Y. Gao, D. Meng, Magicube: high reliability and low redundancy storage architecture for cloud computing, 2012 IEEE Seventh International Conference on Networking, Architecture and Storage (NAS), IEEE, Xiamen, Fujian, China, 2012, pp. 89–93.
M. Patel Neha, M. Patel Narendra, M.I. Hasan, D. Shah Parth, M. Patel Mayur, Improving HDFS write performance using efficient replica placement, 2014 5th International Conference - Confluence The Next Generation Information Technology Summit (Confluence), IEEE, Noida, India, 2014, pp. 36–39.
M. Patel Neha, M. Patel Narendra, M.I. Hasan, M. Patel Mayur, Improving data transfer rate and throughput of HDFS using efficient replica placement, Int. J. Comput. Appl. 86 (2014), 254–261.
H. Zhang, L. Wang, H. Huang, SMARTH: enabling multi-pipeline data transfer in HDFS, 2014 43rd International Conference on Parallel Processing, IEEE, Minneapolis MN, USA, 2014, pp. 30–39.
A. Kashkouli, B. Soleimani, M. Rahbari, Investigating Hadoop architecture and fault tolerance in map-reduce, Int. J. Comput. Sci. Netw. Secur. 17 (2017), 81–87.
C. Singh, R. Singh, Enhancing performance and fault tolerance of Hadoop cluster, Int. Res. J. Eng. Technol. 4 (2017), 2851–2854.
D. Poola, M.A. Salehi, K. Ramamohanarao, R. Buyya, A taxonomy and survey of fault-tolerant workflow management systems in cloud and distributed computing environments, Software Architecture for Big Data and the Cloud, Elsevier, Amsterdam, The Netherlands, 2017, pp. 285–320.
M.G. Noll, Benchmarking and stress testing an Hadoop cluster with TeraSort, TestDFSIO & Co., 2011, Available from: https://www.michael-noll.com/blog/2011/04/09/benchmark-ing-and-stress-testing-an-hadoop-cluster-with-terasort-testdf-sio-nnbench-mrbench/.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
This is an open access article distributed under the CC BY-NC 4.0 license (http://creativecommons.org/licenses/by-nc/4.0/).
About this article
Cite this article
Elkawkagy, M., Elbeh, H. High Performance Hadoop Distributed File System. Int J Netw Distrib Comput 8, 119–123 (2020). https://doi.org/10.2991/ijndc.k.200515.007
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.2991/ijndc.k.200515.007