Abstract
The Hadoop Distributed File System (HDFS) is a robust and flexible file system designed for reliably storing large volumes of data in distributed environments. Its storage model relies upon data replication and one of its central features is to optimize the placement of the replicas across the cluster for fault tolerance, availability, and performance. To this end, the Replica Placement Policy selects which nodes will store the data blocks. This work presents an experimental investigation of the different placement strategies available in HDFS. For a broader analysis, we consider different stages where the placement of the replicas is necessary, such as writing files in the system, re-replicating blocks among the nodes, and balancing the replica distribution in the cluster. The evaluation results allowed a deeper understanding of the behavior of the policies, in addition to highlighting the advantages and drawbacks of the replica placement concerning optimizations in data availability, data locality, write and read throughput, and in the overall performance of the HDFS.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
The timeout to set a DN dead is relatively long (over 10Â min by default) to avoid replication storms caused by state flapping of DNs [3]. To better suit performance-sensitive workloads, it is possible to configure a shorter interval to mark DNs as stale and exclude their nodes in I/O operations.
- 2.
HDFS tries to satisfy a read request from a block that is closer to the reader so that local replicas are preferred over remote data. This reduces global bandwidth consumption and read latency [3].
- 3.
The definition of the RPP to be used in the file system is made in a configuration file (hdfs-site.xml), setting the parameter dfs.block.replicator.classname with the corresponding classpath for the desired policy.
- 4.
References
Abead, E.S., et al.: A comparative study of HDFS replication approaches. Int. J. IT Eng. 3, 5–11 (2015)
Achari, S.: Hadoop Essentials. 1st edn. Packt Publishing Ltd, Birmingham (2015)
Apache software foundation: apache hadoop. https://hadoop.apache.org/docs/r3.3.1/ (2021) Accessed 27 Sep 2021
Ciritoglu, H.E., et al.: Investigation of replication factor for performance enhancement in the hadoop distributed file system. In: Companion of the 2018 ACM/SPEC International Conference on Performance Engineering, pp. 135–140 (2018)
Cloudera Inc: Scaling namespaces and optimizing data storage. https://docs.cloudera.com/runtime/7.2.6/scaling-namespaces/topics/hdfs-balancing-data-across-hdfs-cluster.html (2020). Accessed 3 Sep 2021
Dai, W., Ibrahim, I., Bassiouni, M.: An improved replica placement policy for hadoop distributed file system running on cloud platforms. In: 2017 IEEE 4th International Conference on Cyber Security and Cloud Computing (CSCloud), pp. 270–275. IEEE (2017)
Fazul, R., Cardoso, P.V., Barcelos, P.P.: Improving data availability in HDFS through replica balancing. In: 2019 9th Latin-American Symposium on Dependable Computing (LADC), pp. 1–6. IEEE (2019)
Fazul, R.W.A., Barcelos, P.P.: Automation and prioritization of replica balancing in HDFS. In: Proceedings of the 36th Annual ACM Symposium on Applied Computing, pp. 35–38 (2021)
Shvachko, K., Kuang, H., Radia, S., Chansler, R.: The hadoop distributed file system. In: 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), pp. 1–10. IEEE (2010)
Shwe, T., Aritsugi, M.: A data re-replication scheme and its improvement toward proactive approach. ASEAN Eng. J. 8(1), 36–52 (2018)
Turkington, G.: Hadoop Beginner’s Guide, 1st edn. Packt Publishing Ltd, Birmingham (2013)
White, T.: Hadoop: The Definitive Guide, 4th edn. O’Reilly Media Inc, Sebastopol (2015)
Acknowledgment
This work was developed with the support of CNPq - National Council for Scientific and Technological Development – Brazil. Experiments presented in this paper were carried out using the Grid’5000 testbed, supported by a scientific interest group hosted by Inria and including CNRS, RENATER and several Universities as well as other organizations.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 IFIP International Federation for Information Processing
About this paper
Cite this paper
Fazul, R.W.A., Barcelos, P.P. (2022). The HDFS Replica Placement Policies: A Comparative Experimental Investigation. In: Eyers, D., Voulgaris, S. (eds) Distributed Applications and Interoperable Systems. DAIS 2022. Lecture Notes in Computer Science, vol 13272. Springer, Cham. https://doi.org/10.1007/978-3-031-16092-9_10
Download citation
DOI: https://doi.org/10.1007/978-3-031-16092-9_10
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-16091-2
Online ISBN: 978-3-031-16092-9
eBook Packages: Computer ScienceComputer Science (R0)