Skip to main content

The HDFS Replica Placement Policies: A Comparative Experimental Investigation

  • Conference paper
  • First Online:
Distributed Applications and Interoperable Systems (DAIS 2022)

Abstract

The Hadoop Distributed File System (HDFS) is a robust and flexible file system designed for reliably storing large volumes of data in distributed environments. Its storage model relies upon data replication and one of its central features is to optimize the placement of the replicas across the cluster for fault tolerance, availability, and performance. To this end, the Replica Placement Policy selects which nodes will store the data blocks. This work presents an experimental investigation of the different placement strategies available in HDFS. For a broader analysis, we consider different stages where the placement of the replicas is necessary, such as writing files in the system, re-replicating blocks among the nodes, and balancing the replica distribution in the cluster. The evaluation results allowed a deeper understanding of the behavior of the policies, in addition to highlighting the advantages and drawbacks of the replica placement concerning optimizations in data availability, data locality, write and read throughput, and in the overall performance of the HDFS.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    The timeout to set a DN dead is relatively long (over 10 min by default) to avoid replication storms caused by state flapping of DNs [3]. To better suit performance-sensitive workloads, it is possible to configure a shorter interval to mark DNs as stale and exclude their nodes in I/O operations.

  2. 2.

    HDFS tries to satisfy a read request from a block that is closer to the reader so that local replicas are preferred over remote data. This reduces global bandwidth consumption and read latency [3].

  3. 3.

    The definition of the RPP to be used in the file system is made in a configuration file (hdfs-site.xml), setting the parameter dfs.block.replicator.classname with the corresponding classpath for the desired policy.

  4. 4.

    https://www.grid5000.fr.

References

  1. Abead, E.S., et al.: A comparative study of HDFS replication approaches. Int. J. IT Eng. 3, 5–11 (2015)

    Google Scholar 

  2. Achari, S.: Hadoop Essentials. 1st edn. Packt Publishing Ltd, Birmingham (2015)

    Google Scholar 

  3. Apache software foundation: apache hadoop. https://hadoop.apache.org/docs/r3.3.1/ (2021) Accessed 27 Sep 2021

  4. Ciritoglu, H.E., et al.: Investigation of replication factor for performance enhancement in the hadoop distributed file system. In: Companion of the 2018 ACM/SPEC International Conference on Performance Engineering, pp. 135–140 (2018)

    Google Scholar 

  5. Cloudera Inc: Scaling namespaces and optimizing data storage. https://docs.cloudera.com/runtime/7.2.6/scaling-namespaces/topics/hdfs-balancing-data-across-hdfs-cluster.html (2020). Accessed 3 Sep 2021

  6. Dai, W., Ibrahim, I., Bassiouni, M.: An improved replica placement policy for hadoop distributed file system running on cloud platforms. In: 2017 IEEE 4th International Conference on Cyber Security and Cloud Computing (CSCloud), pp. 270–275. IEEE (2017)

    Google Scholar 

  7. Fazul, R., Cardoso, P.V., Barcelos, P.P.: Improving data availability in HDFS through replica balancing. In: 2019 9th Latin-American Symposium on Dependable Computing (LADC), pp. 1–6. IEEE (2019)

    Google Scholar 

  8. Fazul, R.W.A., Barcelos, P.P.: Automation and prioritization of replica balancing in HDFS. In: Proceedings of the 36th Annual ACM Symposium on Applied Computing, pp. 35–38 (2021)

    Google Scholar 

  9. Shvachko, K., Kuang, H., Radia, S., Chansler, R.: The hadoop distributed file system. In: 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), pp. 1–10. IEEE (2010)

    Google Scholar 

  10. Shwe, T., Aritsugi, M.: A data re-replication scheme and its improvement toward proactive approach. ASEAN Eng. J. 8(1), 36–52 (2018)

    Article  Google Scholar 

  11. Turkington, G.: Hadoop Beginner’s Guide, 1st edn. Packt Publishing Ltd, Birmingham (2013)

    Google Scholar 

  12. White, T.: Hadoop: The Definitive Guide, 4th edn. O’Reilly Media Inc, Sebastopol (2015)

    Google Scholar 

Download references

Acknowledgment

This work was developed with the support of CNPq - National Council for Scientific and Technological Development – Brazil. Experiments presented in this paper were carried out using the Grid’5000 testbed, supported by a scientific interest group hosted by Inria and including CNRS, RENATER and several Universities as well as other organizations.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rhauani Weber Aita Fazul .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 IFIP International Federation for Information Processing

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Fazul, R.W.A., Barcelos, P.P. (2022). The HDFS Replica Placement Policies: A Comparative Experimental Investigation. In: Eyers, D., Voulgaris, S. (eds) Distributed Applications and Interoperable Systems. DAIS 2022. Lecture Notes in Computer Science, vol 13272. Springer, Cham. https://doi.org/10.1007/978-3-031-16092-9_10

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-16092-9_10

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-16091-2

  • Online ISBN: 978-3-031-16092-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics