A Cost-Effective Data Node Management Scheme for Hadoop Clusters in Cloud Environment

Vidhyasagar, B. S.; Perinbam, J. Raja Paul; Krishnamurthy, M.; Arunnehru, J.

doi:10.1007/978-981-15-4301-2_3

B. S. Vidhyasagar¹²,
J. Raja Paul Perinbam¹³,
M. Krishnamurthy¹⁴ &
…
J. Arunnehru¹⁵

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1203))

Included in the following conference series:

Symposium on Machine Learning and Metaheuristics Algorithms, and Applications

508 Accesses
1 Citations

Abstract

MapReduce framework in Hadoop is used to analyze the large set of data in a distributed storage system. MapReduce jobs are designate to the task node to perform the map-reduce operation based upon the scheduler. Each node has slots (virtual core) to process a task using the map and reduce operation. Map tasks done separately prior to the Reduce task. The different execution order of jobs and different slot configuration in the clusters affect the CPU performance significantly. In this paper, we have stated effective DataNode assignment techniques for resource allocation in the Hadoop MapReduce job. We performed various operations on Amazon EC2 and physical machine to demonstrate that our proposed technique helps to choose optimized node selection for assignment of DataNodes in the Hadoop cluster. This significantly scales down the cost of the node and increases the job execution performance in the Hadoop cluster.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Load Balancing Approach for a MapReduce Job Running on a Heterogeneous Hadoop Cluster

Optimized Capacity Scheduler for MapReduce Applications in Cloud Environments

Performance Analysis of Job Scheduling Algorithms on Hadoop Multi-cluster Environment

References

Al-Fuqaha, A., Guizani, M., Mohammadi, M., Aledhari, M., Ayyash, M.: Internet of Things: a survey on enabling technologies, protocols, and applications. IEEE Commun. Surv. Tutor. 17(4), 2347–2376 (2015)
Article Google Scholar
Apache hadoop releases. http://www.hadoop.apache.org/releases.html. Accessed 11 Feb 2018
Singh, N., Agrawal, S.: A review of research on mapreduce scheduling algorithms in Hadoop. In: 2015 International Conference on Computing, Communication & Automation (ICCCA), pp. 637–642. IEEE (2015)
Google Scholar
Gautam, J.V., Prajapati, H.B., Dabhi, V.K., Chaudhary, S.: A survey on job scheduling algorithms in big data processing. In: 2015 IEEE International Conference on Electrical, Computer and Communication Technologies (ICECCT), pp. 1–11. IEEE (2015)
Google Scholar
Ghazi, M.R., Gangodkar, D.: Hadoop, mapreduce and HDFS: a developers perspective. Proc. Comput. Sci. 48, 45–50 (2015)
Article Google Scholar
Bok, K., Hwang, J., Lim, J., Kim, Y., Yoo, J.: An efficient mapreduce scheduling scheme for processing large multimedia data. Multimed. Tools Appl. 76(16), 17273–17296 (2017)
Article Google Scholar
Demchenko, Y., Ngo, C., Membrey, P.: Architecture framework and components for the big data ecosystem. J. Syst. Netw. Eng. 49(7), 1–31 (2013)
Google Scholar
Pastorelli, M., Carra, D., Dell’Amico, M., Michiardi, P.: HFSP: bringing size-based scheduling to hadoop. IEEE Trans. Cloud Comput. 5(1), 43–56 (2017)
Article Google Scholar
Mavridis, I., Karatza, H.: Performance evaluation of cloud-based log file analysis with apache hadoop and apache spark. J. Syst. Softw. 125, 133–151 (2017)
Article Google Scholar
Apache hadoop file system. http://www.hadoop.apache.org/hdfs. Accessed 11 Feb 2018
Afrati, F., Dolev, S., Korach, E., Sharma, S., Ullman, J.D.: Assignment problems of different-sized inputs in mapreduce. ACM Trans. Knowl. Disc. Data (TKDD) 11(2), 18 (2016)
Google Scholar
Mathiya, B.J., Desai, V.L.: Apache hadoop yarn parameter configuration challenges and optimization. In: 2015 International Conference on Soft-Computing and Networks Security (ICSNS), pp. 1–6. IEEE (2015)
Google Scholar
Cai, X., Li, F., Li, P., Lei, J., Jia, Z.: SLA-aware energy-efficient scheduling scheme for hadoop yarn. J. Supercomput. 73(8), 3526–3546 (2017)
Article Google Scholar
Anuradha, J., et al.: A brief introduction on big data 5Vs characteristics and hadoop technology. Proc. Comput. Sci. 48, 319–324 (2015)
Article Google Scholar
Suresh, S., Gopalan, N.P.: An optimal task selection scheme for hadoop scheduling. IERI Proc. 10, 70–75 (2014)
Article Google Scholar
Dias, L.S., Ierapetritou, M.G.: Integration of scheduling and control under uncertainties: review and challenges. Chem. Eng. Res. Design 116, 98–113 (2016)
Article Google Scholar
Apache hadoop yarn scheduler. https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/. Accessed 11 Feb 2018
Yoo, D., Sim, K.M.: A comparative review of job scheduling for mapreduce. In: 2011 IEEE International Conference on Cloud Computing and Intelligence Systems (CCIS), pp. 353–358. IEEE (2011)
Google Scholar
Apache hadoop emr. https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hadoop.htm. Accessed 11 Feb 2018
Horton works. https://hortonworks.com. Accessed 11 Feb 2018
Cloudera framework. https://www.cloudera.com/. Accessed 11 Feb 2018
Sarkar, D.: Pro Microsoft HDInsight. Apress, Berkeley (2014)
Book Google Scholar
MAPR framework. https://mapr.com/. Accessed 11 Feb 2018
Tang, S., Lee, B.-S., He, B.: Dynamic job ordering and slot configurations for mapreduce workloads. IEEE Trans. Serv. Comput. 9(1), 4–17 (2016)
Article Google Scholar
Polo, J., et al.: Deadline-based mapreduce workload management. IEEE Trans. Netw. Serv. Manage. 10(2), 231–244 (2013)
Article Google Scholar
Leverich, J., Kozyrakis, C.: On the energy (in) efficiency of hadoop clusters. ACM SIGOPS Oper. Syst. Rev. 44(1), 61–65 (2010)
Article Google Scholar
Zhao, Y., Jie, W., Liu, C.: Dache: a data aware caching for big-data applications using the mapreduce framework. Tsinghua Sci. Technol. 19(1), 39–50 (2014)
Article Google Scholar
Qureshi, N.M.F., Shin, D.R., Siddiqui, I.F., Chowdhry, B.S.: Storage-tag-aware scheduler for hadoop cluster. IEEE Access 5, 13742–13755 (2017)
Article Google Scholar
Wang, X., Shen, D., Yu, G., Nie, T., Kou, Y.: A throughput driven task scheduler for improving mapreduce performance in job-intensive environments. In: 2013 IEEE International Congress on Big Data (BigData Congress), pp. 211–218. IEEE (2013)
Google Scholar
Brahmwar, M., Kumar, M., Sikka, G.: Tolhit-a scheduling algorithm for hadoop cluster. Proc. Comput. Sci. 89, 203–208 (2016)
Article Google Scholar
Usama, M., Liu, M., Chen, M.: Job schedulers for big data processing in hadoop environment: testing real-life schedulers using benchmark programs. Digit. Commun. Netw. 3(4), 260–273 (2017)
Article Google Scholar
Thirumala Rao, B., Sridevi, N.V., Krishna Reddy, V., Reddy, L.S.S.: Performance issues of heterogeneous hadoop clusters in cloud computing (2012). arXiv preprint arXiv:1207.0894
Tamrakar, K., Yazidi, A., Haugerud, H.: Cost efficient batch processing in amazon cloud with deadline awareness. In: 2017 IEEE 31st International Conference on Advanced Information Networking and Applications (AINA), pp. 963–971. IEEE (2017)
Google Scholar
Jlassi, A., Martineau, P.: Experimental study on performance and energy consumption of hadoop in cloud environments. In: Helfert, M., Ferguson, D., Méndez Muñoz, V., Cardoso, J. (eds.) CLOSER 2016. CCIS, vol. 740, pp. 255–272. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-62594-2_13
Chapter Google Scholar
Wikipedia dataset 3.375 gb. https://dumps.wikimedia.org/enwiki/20171103/enwiki-20171103-pages-meta-history9.xml-p1947829p1952641.7z. Accessed 11 Feb 2018
Stanford dataset 23 gb. https://snap.stanford.edu/data/bigdata/wikipedia08/enwiki-20080103.talk.bz2. Accessed 11 Feb 2018
Purdue dataset 50 gb. ftp://ftp.ecn.purdue.edu/puma/wikipedia_50GB.tar.bz2. Accessed 11 Feb 2018
Purdue dataset 140 gb. ftp://ftp.ecn.purdue.edu/puma/wikipedia_140GB.tar.bz2. Accessed 11 Feb 2018
Purdue dataset 150 gb. ftp://ftp.ecn.purdue.edu/puma/wikipedia_150GB.tar.bz2. Accessed 11 Feb 2018
Purdue dataset 300 gb. ftp://ftp.ecn.purdue.edu/puma/wikipedia_300GB.tar.bz2

Download references

Author information

Authors and Affiliations

Information and Communication Engineering, Anna University, Chennai, Tamilnadu, India
B. S. Vidhyasagar
Department of ECE, Kings Engineering College, Chennai, Tamilnadu, India
J. Raja Paul Perinbam
Department of CSE, KCG College of Technology, Chennai, Tamilnadu, India
M. Krishnamurthy
Department of CSE, SRM Institute of Science and Technology, Chennai, Tamilnadu, India
J. Arunnehru

Authors

B. S. Vidhyasagar
View author publications
You can also search for this author in PubMed Google Scholar
J. Raja Paul Perinbam
View author publications
You can also search for this author in PubMed Google Scholar
M. Krishnamurthy
View author publications
You can also search for this author in PubMed Google Scholar
J. Arunnehru
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to B. S. Vidhyasagar .

Editor information

Editors and Affiliations

Indian Institute of Information Technology and Management - Kerala (IIITM-K), Trivandrum, India
Sabu M. Thampi
Simon Fraser University, Burnaby, BC, Canada
Ljiljana Trajkovic
Providence University, Taichung, Taiwan
Kuan-Ching Li
Indian Statistical Institute, Kolkata, West Bengal, India
Swagatam Das
Wrocław University of Technology, Wrocław, Poland
Michal Wozniak
Università degli Studi di Firenze, Florence, Italy
Stefano Berretti

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Vidhyasagar, B.S., Perinbam, J.R.P., Krishnamurthy, M., Arunnehru, J. (2020). A Cost-Effective Data Node Management Scheme for Hadoop Clusters in Cloud Environment. In: Thampi, S., Trajkovic, L., Li, KC., Das, S., Wozniak, M., Berretti, S. (eds) Machine Learning and Metaheuristics Algorithms, and Applications. SoMMA 2019. Communications in Computer and Information Science, vol 1203. Springer, Singapore. https://doi.org/10.1007/978-981-15-4301-2_3

Download citation

DOI: https://doi.org/10.1007/978-981-15-4301-2_3
Published: 05 April 2020
Publisher Name: Springer, Singapore
Print ISBN: 978-981-15-4300-5
Online ISBN: 978-981-15-4301-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

A Cost-Effective Data Node Management Scheme for Hadoop Clusters in Cloud Environment

Abstract

Access this chapter

Similar content being viewed by others

Load Balancing Approach for a MapReduce Job Running on a Heterogeneous Hadoop Cluster

Optimized Capacity Scheduler for MapReduce Applications in Cloud Environments

Performance Analysis of Job Scheduling Algorithms on Hadoop Multi-cluster Environment

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

A Cost-Effective Data Node Management Scheme for Hadoop Clusters in Cloud Environment

Abstract

Access this chapter

Similar content being viewed by others

Load Balancing Approach for a MapReduce Job Running on a Heterogeneous Hadoop Cluster

Optimized Capacity Scheduler for MapReduce Applications in Cloud Environments

Performance Analysis of Job Scheduling Algorithms on Hadoop Multi-cluster Environment

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation