The HDFS Replica Placement Policies: A Comparative Experimental Investigation

Fazul, Rhauani Weber Aita; Barcelos, Patrícia Pitthan

doi:10.1007/978-3-031-16092-9_10

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13272))

Included in the following conference series:

IFIP International Conference on Distributed Applications and Interoperable Systems

Abstract

The Hadoop Distributed File System (HDFS) is a robust and flexible file system designed for reliably storing large volumes of data in distributed environments. Its storage model relies upon data replication and one of its central features is to optimize the placement of the replicas across the cluster for fault tolerance, availability, and performance. To this end, the Replica Placement Policy selects which nodes will store the data blocks. This work presents an experimental investigation of the different placement strategies available in HDFS. For a broader analysis, we consider different stages where the placement of the replicas is necessary, such as writing files in the system, re-replicating blocks among the nodes, and balancing the replica distribution in the cluster. The evaluation results allowed a deeper understanding of the behavior of the policies, in addition to highlighting the advantages and drawbacks of the replica placement concerning optimizations in data availability, data locality, write and read throughput, and in the overall performance of the HDFS.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
The timeout to set a DN dead is relatively long (over 10 min by default) to avoid replication storms caused by state flapping of DNs [3]. To better suit performance-sensitive workloads, it is possible to configure a shorter interval to mark DNs as stale and exclude their nodes in I/O operations.
2.
HDFS tries to satisfy a read request from a block that is closer to the reader so that local replicas are preferred over remote data. This reduces global bandwidth consumption and read latency [3].
3.
The definition of the RPP to be used in the file system is made in a configuration file (hdfs-site.xml), setting the parameter dfs.block.replicator.classname with the corresponding classpath for the desired policy.
4.
https://www.grid5000.fr.

References

Abead, E.S., et al.: A comparative study of HDFS replication approaches. Int. J. IT Eng. 3, 5–11 (2015)
Google Scholar
Achari, S.: Hadoop Essentials. 1st edn. Packt Publishing Ltd, Birmingham (2015)
Google Scholar
Apache software foundation: apache hadoop. https://hadoop.apache.org/docs/r3.3.1/ (2021) Accessed 27 Sep 2021
Ciritoglu, H.E., et al.: Investigation of replication factor for performance enhancement in the hadoop distributed file system. In: Companion of the 2018 ACM/SPEC International Conference on Performance Engineering, pp. 135–140 (2018)
Google Scholar
Cloudera Inc: Scaling namespaces and optimizing data storage. https://docs.cloudera.com/runtime/7.2.6/scaling-namespaces/topics/hdfs-balancing-data-across-hdfs-cluster.html (2020). Accessed 3 Sep 2021
Dai, W., Ibrahim, I., Bassiouni, M.: An improved replica placement policy for hadoop distributed file system running on cloud platforms. In: 2017 IEEE 4th International Conference on Cyber Security and Cloud Computing (CSCloud), pp. 270–275. IEEE (2017)
Google Scholar
Fazul, R., Cardoso, P.V., Barcelos, P.P.: Improving data availability in HDFS through replica balancing. In: 2019 9th Latin-American Symposium on Dependable Computing (LADC), pp. 1–6. IEEE (2019)
Google Scholar
Fazul, R.W.A., Barcelos, P.P.: Automation and prioritization of replica balancing in HDFS. In: Proceedings of the 36th Annual ACM Symposium on Applied Computing, pp. 35–38 (2021)
Google Scholar
Shvachko, K., Kuang, H., Radia, S., Chansler, R.: The hadoop distributed file system. In: 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), pp. 1–10. IEEE (2010)
Google Scholar
Shwe, T., Aritsugi, M.: A data re-replication scheme and its improvement toward proactive approach. ASEAN Eng. J. 8(1), 36–52 (2018)
Article Google Scholar
Turkington, G.: Hadoop Beginner’s Guide, 1st edn. Packt Publishing Ltd, Birmingham (2013)
Google Scholar
White, T.: Hadoop: The Definitive Guide, 4th edn. O’Reilly Media Inc, Sebastopol (2015)
Google Scholar

Download references

Acknowledgment

This work was developed with the support of CNPq - National Council for Scientific and Technological Development – Brazil. Experiments presented in this paper were carried out using the Grid’5000 testbed, supported by a scientific interest group hosted by Inria and including CNRS, RENATER and several Universities as well as other organizations.

Author information

Authors and Affiliations

Federal University of Santa Maria (UFSM), Santa Maria, RS, Brazil
Rhauani Weber Aita Fazul & Patrícia Pitthan Barcelos

Authors

Rhauani Weber Aita Fazul
View author publications
You can also search for this author in PubMed Google Scholar
Patrícia Pitthan Barcelos
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rhauani Weber Aita Fazul .

Editor information

Editors and Affiliations

University of Otago, Dunedin, New Zealand
David Eyers
Athens University of Economics and Business, Athens, Greece
Spyros Voulgaris

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Fazul, R.W.A., Barcelos, P.P. (2022). The HDFS Replica Placement Policies: A Comparative Experimental Investigation. In: Eyers, D., Voulgaris, S. (eds) Distributed Applications and Interoperable Systems. DAIS 2022. Lecture Notes in Computer Science, vol 13272. Springer, Cham. https://doi.org/10.1007/978-3-031-16092-9_10

Download citation

DOI: https://doi.org/10.1007/978-3-031-16092-9_10
Published: 06 September 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-16091-2
Online ISBN: 978-3-031-16092-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Federation for Information Processing (opens in a new tab)