Abstract
Data placement decision of Hadoop distributed file system (HDFS) is very important for the data locality which is a primary criterion for task scheduling of MapReduce model and eventually affects the application performance. The existing HDFS’s rack-aware data placement strategy and replication scheme are work well with MapReduce framework in homogeneous Hadoop clusters, but in practice, such data placement policy can noticeably reduce MapReduce performance and may cause increasingly energy dissipation in heterogeneous environments. Besides that, HDFS employs an inflexible replica factor acquiescently for each data block, which will give rise to unnecessary waste of storage space when there is a lot of inactive data in Hadoop system. In this paper, we propose a novel data placement strategy (SLDP) for heterogeneous Hadoop clusters. SLDP adopts a heterogeneity aware algorithm to divide various nodes into several virtual storage tiers (VSTs) firstly, and then places data blocks across nodes in each VST circuitously according to the hotness of data. Furthermore, SLDP uses a hotness proportional replication to save disk space and also has an effective power control function. Experimental results on two real data-intensive applications show that SLDP is energy-efficient, space-saving and able to improve MapReduce performance in a heterogeneous Hadoop cluster significantly.
Similar content being viewed by others
References
Armbrust, M., Fox, A., Griffith, R., et al.: A view of cloud computing. Commun. ACM 53(4), 50–58 (2010)
IBM Big Data: [Online]. http://www.ibm.com/big-data/
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
Apache Hadoop: [Online]. http://hadoop.apache.org
Shvachko, K., Kuang, H., Radia, S., et al.: The Hadoop distributed file system. In: Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies, May 2010, pp. 1–10
Hadoop Wiki: Applications powered by Hadoop. [Online]. http://wiki.apache.org/hadoop/PoweredBy
Hadoop Distributed File System Architecture Guide: [Online]. http://hadoop.apache.org/docs/r1.2.1/hdfs_design.html
White, T.: Hadoop-the Definitive Guide: Storage and Analysis at Internet Scale, 3rd edn. O’Reilly Media Inc, Sebastopol, CA (2012)
Zaharia, M., Konwinski, A., Joseph, A. D., et al.: Improving MapReduce performance in heterogeneous environments. In: Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation, Dec 2008, pp. 29–42
Mauch, V., Kunze, M., Hillenbrand, M.: High performance cloud computing. Futur. Gener. Comput. Syst. 29(6), 1408–1416 (2013)
Amur, H., Cipar, J., Gupta, V., et al.: Robust and flexible power-proportional storage. In: Proceedings of the ACM Symposium on Cloud Computing, June 2010, pp. 217–228
Carns, P.H., Walter, I., Ligon, B., et al.: PVFS: a parallel virtual file system for Linux clusters. In: Proceedings of the 4th Annual Linux Showcase and Conference, Oct 2000, pp. 317–327
Microsystems, S.: Lustre file system: high-performance storage architecture and scalable cluster file system. Technical Report, Lustre File System White Paper (2007)
Ghemawat, S., Gobioff, H., Leung, S.: The Google file system. In: Proceedings of the 19th ACM Symposium on Operating Systems Principles, Oct 2003, pp. 29–43
Lin, M., Wierman, A., Andrew, L.L.H., et al.: Dynamic right-sizing for power-proportional data centers. IEEE/ACM Trans. Netw. 21(5), 1378–1391 (2013)
Barroso, L.A., Holzle, U.: The case for energy-proportional computing. Computer 40(12), 33–37 (2007)
Nan, Zhu, Xue, Liu, Jie, Liu, et al.: Towards a cost-efficient MapReduce: mitigating power peaks for Hadoop clusters. Tsinghua Sci. Technol. 19(1), 24–32 (2014)
Xie, J., Yin, S., Ruan, X., et al.: Improving MapReduce performance through data placement in heterogeneous Hadoop clusters. In: Proceedings of the IEEE International Parallel & Distributed Processing Symposium, Workshops, April 2010
Apache Hadoop: Enable support for heterogeneous storages in HDFS. [Online]. https://issues.apache.org/jira/browse/HDFS-2832
Jain, K., Murty, M.N., Flynn, P.J.: Data clustering: a review. ACM Comput. Surv. (CSUR) 31(3), 264–323 (1999)
Iri, M.C., Ignjatovi, J., Bogdanovi, S.: Fuzzy equivalence relations and their equivalence classes. Fuzzy Sets Syst. 158(12), 1295–1313 (2007)
Kaushik, R.T., Bhandarkar, M.: GreenHDFS: towards an energy-conserving, storage-efficient, hybrid Hadoop compute cluster. In: Proceedings of the 2010 International Conference on Power Aware Computing and Systems, June 2010, pp. 1–9
AMS-02 Experiment: [Online]: http://www.ams02.org/
Collaboration, A.M.S.: First result from the alpha magnetic spectrometer on the international space station: precision measurement of the positron fraction in primary cosmic rays of 0.5-350 GeV. Phys. Rev. Lett. 110(14), 1–10 (2013)
Myung, J.: Tutorial on maximum likelihood estimation. J. Math. Psychol. 47(1), 90–100 (2003)
Jittat, F., Bundit, L., Danupon, N.: Faster algorithms for semi-matching problems. ACM Trans. Algorithms 10(3), 14–37 (2014)
Leverich, J., Kozyrakis, C.: On the energy (in) efficiency of Hadoop clusters. ACM SIGOPS Oper. Syst. Rev. 44(1), 61–65 (2010)
Lang, W., Patel, J.M.: Energy management for MapReduce clusters. PVLDB 3(1–2), 129–139 (2010)
Rafique, M.M., Rose, B., Butt, A.R., et al.: Supporting MapReduce on large-scale asymmetric multi-core clusters. ACM SIGOPS Oper. Syst. Rev. 43(2), 25–34 (2009)
Fadika, Z., Dede, E., Hartog, J., et al.: MARLA: MapReduce for heterogeneous clusters. In: Proceedings of the 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, May 2012, pp. 49–56
Guo, Z., Fox, G.: Improving MapReduce performance in heterogeneous network environments and resource utilization. In: Proceedings of the 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, May 2012, pp. 714–716
Guo, Z., Fox, G., Zhou, M.: Investigation of data locality in MapReduce. In: Proceedings of the 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, May 2012, pp. 419–426
Zhang, X., Feng, Y., Feng, S., et al.: An effective data locality aware task scheduling method for MapReduce framework in heterogeneous environments. In: Proceedings of the International Conference on Cloud and Service Computing, Dec 2011, pp. 235–242
Jin, H., Yang, X., Sun, X.H. et al.: ADAPT: availability-aware MapReduce data placement for non-dedicated distributed computing. In: Proceedings of the 2012 IEEE 32nd International Conference on Distributed Computing Systems, June 2012, pp. 516–525
Vasić, N., Barisits, M., Salzgeber, V., et al.: Making cluster applications energy-aware. In: Proceedings of the 1st Workshop on Automated Control for Datacenters and Clouds, June 2009, pp. 37–42
Maheshwari, N., Nanduri, R., Varma, V.: Dynamic energy efficient data placement and cluster reconfiguration algorithm for MapReduce framework. Future Gener. Comput. Syst. 28(1), 119–127 (2012)
Acknowledgments
This work is supported by National Natural Science Foundation of China under Grant Nos. 61320106007, 61202449, 61572129, 61502097, 61370207, National High-tech R&D Program of China (863 Program) under Grant No. 2013AA013503, China Fundamental Research Funds for the Central Universities under Grant No. 1109007115, Jiangsu research prospective joint research project under Grant Nos. BY2012202, BY2013073-01, Jiangsu Provincial Key Laboratory of Network and Information Security under Grant No. BM2003201, Key Laboratory of Computer Network and Information Integration of Ministry of Education of China under Grant No. 93K-9, and partially supported by Collaborative Innovation Center of Novel Software Technology and Industrialization.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Xiong, R., Luo, J. & Dong, F. Optimizing data placement in heterogeneous Hadoop clusters. Cluster Comput 18, 1465–1480 (2015). https://doi.org/10.1007/s10586-015-0495-z
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10586-015-0495-z