Cluster Computing

, Volume 18, Issue 4, pp 1465–1480 | Cite as

Optimizing data placement in heterogeneous Hadoop clusters



Data placement decision of Hadoop distributed file system (HDFS) is very important for the data locality which is a primary criterion for task scheduling of MapReduce model and eventually affects the application performance. The existing HDFS’s rack-aware data placement strategy and replication scheme are work well with MapReduce framework in homogeneous Hadoop clusters, but in practice, such data placement policy can noticeably reduce MapReduce performance and may cause increasingly energy dissipation in heterogeneous environments. Besides that, HDFS employs an inflexible replica factor acquiescently for each data block, which will give rise to unnecessary waste of storage space when there is a lot of inactive data in Hadoop system. In this paper, we propose a novel data placement strategy (SLDP) for heterogeneous Hadoop clusters. SLDP adopts a heterogeneity aware algorithm to divide various nodes into several virtual storage tiers (VSTs) firstly, and then places data blocks across nodes in each VST circuitously according to the hotness of data. Furthermore, SLDP uses a hotness proportional replication to save disk space and also has an effective power control function. Experimental results on two real data-intensive applications show that SLDP is energy-efficient, space-saving and able to improve MapReduce performance in a heterogeneous Hadoop cluster significantly.


Hadoop cluster HDFS Data placement Heterogeneous Replica 


  1. 1.
    Armbrust, M., Fox, A., Griffith, R., et al.: A view of cloud computing. Commun. ACM 53(4), 50–58 (2010)CrossRefGoogle Scholar
  2. 2.
    IBM Big Data: [Online].
  3. 3.
    Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)CrossRefGoogle Scholar
  4. 4.
    Apache Hadoop: [Online].
  5. 5.
    Shvachko, K., Kuang, H., Radia, S., et al.: The Hadoop distributed file system. In: Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies, May 2010, pp. 1–10Google Scholar
  6. 6.
    Hadoop Wiki: Applications powered by Hadoop. [Online].
  7. 7.
    Hadoop Distributed File System Architecture Guide: [Online].
  8. 8.
    White, T.: Hadoop-the Definitive Guide: Storage and Analysis at Internet Scale, 3rd edn. O’Reilly Media Inc, Sebastopol, CA (2012)Google Scholar
  9. 9.
    Zaharia, M., Konwinski, A., Joseph, A. D., et al.: Improving MapReduce performance in heterogeneous environments. In: Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation, Dec 2008, pp. 29–42Google Scholar
  10. 10.
    Mauch, V., Kunze, M., Hillenbrand, M.: High performance cloud computing. Futur. Gener. Comput. Syst. 29(6), 1408–1416 (2013)CrossRefGoogle Scholar
  11. 11.
    Amur, H., Cipar, J., Gupta, V., et al.: Robust and flexible power-proportional storage. In: Proceedings of the ACM Symposium on Cloud Computing, June 2010, pp. 217–228Google Scholar
  12. 12.
    Carns, P.H., Walter, I., Ligon, B., et al.: PVFS: a parallel virtual file system for Linux clusters. In: Proceedings of the 4th Annual Linux Showcase and Conference, Oct 2000, pp. 317–327Google Scholar
  13. 13.
    Microsystems, S.: Lustre file system: high-performance storage architecture and scalable cluster file system. Technical Report, Lustre File System White Paper (2007)Google Scholar
  14. 14.
    Ghemawat, S., Gobioff, H., Leung, S.: The Google file system. In: Proceedings of the 19th ACM Symposium on Operating Systems Principles, Oct 2003, pp. 29–43Google Scholar
  15. 15.
    Lin, M., Wierman, A., Andrew, L.L.H., et al.: Dynamic right-sizing for power-proportional data centers. IEEE/ACM Trans. Netw. 21(5), 1378–1391 (2013)CrossRefGoogle Scholar
  16. 16.
    Barroso, L.A., Holzle, U.: The case for energy-proportional computing. Computer 40(12), 33–37 (2007)CrossRefGoogle Scholar
  17. 17.
    Nan, Zhu, Xue, Liu, Jie, Liu, et al.: Towards a cost-efficient MapReduce: mitigating power peaks for Hadoop clusters. Tsinghua Sci. Technol. 19(1), 24–32 (2014)CrossRefGoogle Scholar
  18. 18.
    Xie, J., Yin, S., Ruan, X., et al.: Improving MapReduce performance through data placement in heterogeneous Hadoop clusters. In: Proceedings of the IEEE International Parallel & Distributed Processing Symposium, Workshops, April 2010Google Scholar
  19. 19.
    Apache Hadoop: Enable support for heterogeneous storages in HDFS. [Online].
  20. 20.
    Jain, K., Murty, M.N., Flynn, P.J.: Data clustering: a review. ACM Comput. Surv. (CSUR) 31(3), 264–323 (1999)CrossRefGoogle Scholar
  21. 21.
    Iri, M.C., Ignjatovi, J., Bogdanovi, S.: Fuzzy equivalence relations and their equivalence classes. Fuzzy Sets Syst. 158(12), 1295–1313 (2007)CrossRefGoogle Scholar
  22. 22.
    Kaushik, R.T., Bhandarkar, M.: GreenHDFS: towards an energy-conserving, storage-efficient, hybrid Hadoop compute cluster. In: Proceedings of the 2010 International Conference on Power Aware Computing and Systems, June 2010, pp. 1–9Google Scholar
  23. 23.
    AMS-02 Experiment: [Online]:
  24. 24.
    Collaboration, A.M.S.: First result from the alpha magnetic spectrometer on the international space station: precision measurement of the positron fraction in primary cosmic rays of 0.5-350 GeV. Phys. Rev. Lett. 110(14), 1–10 (2013)Google Scholar
  25. 25.
    Myung, J.: Tutorial on maximum likelihood estimation. J. Math. Psychol. 47(1), 90–100 (2003)MathSciNetCrossRefGoogle Scholar
  26. 26.
    Jittat, F., Bundit, L., Danupon, N.: Faster algorithms for semi-matching problems. ACM Trans. Algorithms 10(3), 14–37 (2014)MathSciNetGoogle Scholar
  27. 27.
    Leverich, J., Kozyrakis, C.: On the energy (in) efficiency of Hadoop clusters. ACM SIGOPS Oper. Syst. Rev. 44(1), 61–65 (2010)CrossRefGoogle Scholar
  28. 28.
    Lang, W., Patel, J.M.: Energy management for MapReduce clusters. PVLDB 3(1–2), 129–139 (2010)Google Scholar
  29. 29.
    Rafique, M.M., Rose, B., Butt, A.R., et al.: Supporting MapReduce on large-scale asymmetric multi-core clusters. ACM SIGOPS Oper. Syst. Rev. 43(2), 25–34 (2009)CrossRefGoogle Scholar
  30. 30.
    Fadika, Z., Dede, E., Hartog, J., et al.: MARLA: MapReduce for heterogeneous clusters. In: Proceedings of the 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, May 2012, pp. 49–56Google Scholar
  31. 31.
    Guo, Z., Fox, G.: Improving MapReduce performance in heterogeneous network environments and resource utilization. In: Proceedings of the 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, May 2012, pp. 714–716Google Scholar
  32. 32.
    Guo, Z., Fox, G., Zhou, M.: Investigation of data locality in MapReduce. In: Proceedings of the 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, May 2012, pp. 419–426Google Scholar
  33. 33.
    Zhang, X., Feng, Y., Feng, S., et al.: An effective data locality aware task scheduling method for MapReduce framework in heterogeneous environments. In: Proceedings of the International Conference on Cloud and Service Computing, Dec 2011, pp. 235–242Google Scholar
  34. 34.
    Jin, H., Yang, X., Sun, X.H. et al.: ADAPT: availability-aware MapReduce data placement for non-dedicated distributed computing. In: Proceedings of the 2012 IEEE 32nd International Conference on Distributed Computing Systems, June 2012, pp. 516–525Google Scholar
  35. 35.
    Vasić, N., Barisits, M., Salzgeber, V., et al.: Making cluster applications energy-aware. In: Proceedings of the 1st Workshop on Automated Control for Datacenters and Clouds, June 2009, pp. 37–42Google Scholar
  36. 36.
    Maheshwari, N., Nanduri, R., Varma, V.: Dynamic energy efficient data placement and cluster reconfiguration algorithm for MapReduce framework. Future Gener. Comput. Syst. 28(1), 119–127 (2012)CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2015

Authors and Affiliations

  1. 1.School of Computer Science and EngineeringSoutheast UniversityNanjingPeople’s Republic of China

Personalised recommendations