Abstract
The MapReduce programming model and Hadoop has become the de facto standard for data-intensive applications. Hadoop tasks are mapped to certain nodes within the Hadoop cluster with data required by tasks. Such a strategy is intuitively appealing for a homogeneous cluster, both in terms of computation and storage capabilities. However most commonplace clusters are indeed heterogeneous, since nodes are added over a prolonged period. This necessitates the use of an intelligent data placement strategy among cluster nodes that accounts for the inherent heterogeneity, which otherwise incurs performance bottleneck. In this paper, we propose to have a performance based clustering of Hadoop nodes and subsequently place data among the nodes. Performance based profiling of nodes can be achieved by running multiple benchmarks in an offline manner and segregating dividing the cluster nodes into two subsets namely low and high performance nodes. Additionally, execution process of Hadoop tasks is monitored using Hadoop’s task speculation mechanism and computations are dynamically migrated for slow running tasks based on a prior knowledge of data block regarding the task. Experiments conducted demonstrates that the proposed intelligent data placement improve network utilization and cluster performance.
Keywords
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
Apache Hadoop Homepage. http://hadoop.apache.org. Accessed 10 Mar 2017
Hadoopinrealworld Topic Page. http://hadoopinrealworld.com/data-locality-in-hadoop/. Accessed 25 May 2017
Xie, J., Yin, S., Ruan, X., Ding, Z., Tian, Y., Majors, J., Qin, X.: Improving MapReduce performance through data placement in heterogeneous hadoop clusters. In: 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Ph.D Forum (IPDPSW), pp. 1–9. IEEE, April 2010
Lee, C.W., Hsieh, K.Y., Hsieh, S.Y., Hsiao, H.C.: A dynamic data placement strategy for hadoop in heterogeneous environments. Big Data Res. 1, 14–22 (2014)
Bardhan, S., Menascé, D.A.: The anatomy of MapReduce jobs, scheduling, and performance challenges. In: International CMG Conference, November 2013
Google Hadoop Big Data Webpage. https://cloud.google.com/solutions/hadoop/. Accessed 10 Apr 2017
White, T.: Hadoop: The definitive guide. O’Reilly Media Inc, Sebastopol (2012)
Hemant Kumar Reddy, K., Patra, M.R., Roy, D.S., Pradhan, B.: An adaptive scheduling mechanism for computational desktop grid using gridgain. Proc. Technol. 4, 573–578 (2012)
Pradhan, B., Nayak, A., Roy, D.S.: An elegant load balancing scheme in grid computing using GridGain. Int. J. Comput. Sci. Appl. 1(1), 254–257 (2011)
Hemant Kumar Reddy, K., Roy, D.S.: A hierarchical load balancing algorithm for efficient job scheduling in a computational grid testbed. In: 2012 1st International Conference on Recent Advances in Information Technology (RAIT), pp. 363–368. IEEE, March 2012
Harsha, L.S., Hemant Kumar Reddy, K., Roy, D.S.: A novel delay based application scheduling for energy efficient cloud operations. In: 2015 International Conference on Man and Machine Interfacing (MAMI), pp. 1–5. IEEE, December 2015
Zaharia, M., Konwinski, A., Joseph, A.D., Katz, R.H., Stoica, I.: Improving MapReduce performance in heterogeneous environments. In: Osdi, vol. 8, no. 4, p. 7, December 2008
Xiong, R., Luo, J., Dong, F.: Optimizing data placement in heterogeneous Hadoop clusters. Cluster Comput. 18(4), 1465–1480 (2015)
Jaykishan, B., Hemant Kumar Reddy, K., Roy, D.S.: A data-aware scheduling framework for parallel applications in a cloud environment. In: Sengupta, S., Das, K., Khan, G. (eds.) Emerging Trends in Computing and Communication. LNEE, vol. 298, pp. 459–463. Springer, New Delhi (2014). https://doi.org/10.1007/978-81-322-1817-3_49
Hemant Kumar Reddy, K., Roy, D.S.: DPPACS: a novel data partitioning and placement aware computation scheduling scheme for data-intensive cloud applications. Comput. J. 59(1), 64–82 (2015)
Hemant Kumar Reddy, K., Das, H., Roy, D.S.: A data aware scheme for scheduling big data applications with SAVANNA Hadoop. In: Elkhodr, M., Hassan, Q.F., Shahrestani, S. (eds.) Big Data and the Internet of Things, Part IV. Networks of the Future: Architectures, Technologies, and Implementations. CRC Press, Taylor & Francis Group, LLC, Florida, USA (2017)
Huang, S., Huang, J., Liu, Y., Yi, L., Dai, J.: HiBench: a representative and comprehensive hadoop benchmark suite. In: Proceedings ICDE Workshops (2010)
Github Repo. https://github.com/intel-hadoop/HiBench. Accessed 10 Apr 2017
https://github.com/apache/hadoop. Accessed 03 May 2017
https://wiki.apache.org/hadoop/HowToContribute. Accessed 22 Jun 2017
https://pravinchavan.wordpress.com/2013/04/14/building-apache-hadoop-from-source/. Accessed 20 Jun 2017
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Paik, S.S., Goswami, R.S., Roy, D.S., Reddy, K.H. (2018). Intelligent Data Placement in Heterogeneous Hadoop Cluster. In: Bhattacharyya, P., Sastry, H., Marriboyina, V., Sharma, R. (eds) Smart and Innovative Trends in Next Generation Computing Technologies. NGCT 2017. Communications in Computer and Information Science, vol 827. Springer, Singapore. https://doi.org/10.1007/978-981-10-8657-1_43
Download citation
DOI: https://doi.org/10.1007/978-981-10-8657-1_43
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-8656-4
Online ISBN: 978-981-10-8657-1
eBook Packages: Computer ScienceComputer Science (R0)