H-WorD: Supporting Job Scheduling in Hadoop with Workload-Driven Data Redistribution

Jovanovic, Petar; Romero, Oscar; Calders, Toon; Abelló, Alberto

doi:10.1007/978-3-319-44039-2_21

Petar Jovanovic¹⁷,
Oscar Romero¹⁷,
Toon Calders^18,19 &
…
Alberto Abelló¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9809))

Included in the following conference series:

East European Conference on Advances in Databases and Information Systems

805 Accesses
2 Altmetric

Abstract

Today’s distributed data processing systems typically follow a query shipping approach and exploit data locality for reducing network traffic. In such systems the distribution of data over the cluster resources plays a significant role, and when skewed, it can harm the performance of executing applications. In this paper, we address the challenges of automatically adapting the distribution of data in a cluster to the workload imposed by the input applications. We propose a generic algorithm, named H-WorD, which, based on the estimated workload over resources, suggests alternative execution scenarios of tasks, and hence identifies required transfers of input data a priori, for timely bringing data close to the execution. We exemplify our algorithm in the context of MapReduce jobs in a Hadoop ecosystem. Finally, we evaluate our approach and demonstrate the performance gains of automatic data redistribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
We define makespan as the total time elapsed from the beginning of the execution of a set of jobs, until the end of the last executing job [5].
2.
WordCount Example: https://wiki.apache.org/hadoop/WordCount.
3.
TeraSort: https://hadoop.apache.org/docs/r2.7.1/api/org/apache/hadoop/examples/terasort/package-summary.html.

References

Apache HBase. https://hbase.apache.org/. Accessed 02 March 2016
Cluster rebalancing in HDFS. http://hadoop.apache.org/docs/r1.2.1/hdfs_design.html#Cluster+Rebalancing. Accessed 02 Mar 2016
Hadoop: capacity scheduler. http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html. Accessed 04 Mar 2016
Hadoop: fair scheduler. https://hadoop.apache.org/docs/r2.7.1/hadoop-yarn/hadoop-yarn-site/FairScheduler.html. Accessed 04 Mar 2016
Błażewicz, J., Ecker, K.H., Pesch, E., Schmidt, G., Weglarz, J.: Handbook on Scheduling: From Theory to Applications. Springer Science & Business Media, Berlin (2007)
MATH Google Scholar
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
Article Google Scholar
Guo, Z., Fox, G., Zhou, M.: Investigation of data locality in MapReduce. In: CCGrid, pp. 419–426 (2012)
Google Scholar
Herodotou, H., Lim, H., Luo, G., Borisov, N., Dong, L., Cetin, F.B., Babu, S.: Starfish: a self-tuning system for big data analytics. In: CIDR, pp. 261–272 (2011)
Google Scholar
Jin, J., Luo, J., Song, A., Dong, F., Xiong, R.: BAR: an efficient data locality driven task scheduling algorithm for cloud computing. In: CCGrid, pp. 295–304 (2011)
Google Scholar
Kolisch, R., Hartmann, S.: Heuristic Algorithms for the Resource-Constrained Project Scheduling Problem: Classification and Computational Analysis. Springer, New York (1999)
Google Scholar
Palanisamy, B., Singh, A., Liu, L., Jain, B.: Purlieus: locality-aware resource allocation for MapReduce in a cloud. In: SC, pp. 58:1–58:11 (2011)
Google Scholar
Shvachko, K., Kuang, H., Radia, S., Chansler, R.: The hadoop distributed file system. In: MSST, pp. 1–10 (2010)
Google Scholar
Vavilapalli, V.K., Murthy, A.C., Douglas, C., Agarwal, S., Konar, M., Evans, R., Graves, T., Lowe, J., Shah, H., Seth, S., Saha, B., Curino, C., O’Malley, O., Radia, S., Reed, B., Baldeschwieler, E.: Apache hadoop YARN: yet another resource negotiator. In: ACM Symposium on Cloud Computing, SOCC 2013, Santa Clara, CA, USA, 1–3 October 2013, pp. 5:1–5:16 (2013)
Google Scholar
Wang, W., Zhu, K., Ying, L., Tan, J., Zhang, L.: Map task scheduling in MapReduce with data locality: throughput and heavy-traffic optimality. In: INFOCOM, pp. 1609–1617 (2013)
Google Scholar
Zaharia, M., Borthakur, D., Sarma, J.S., Elmeleegy, K., Shenker, S., Stoica, I.: Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling. In: EuroSys, pp. 265–278 (2010)
Google Scholar

Download references

Acknowledgements

This work has been partially supported by the Secreteria d’Universitats i Recerca de la Generalitat de Catalunya under 2014 SGR 1534, and by the Spanish Ministry of Education grant FPU12/04915.

Author information

Authors and Affiliations

Universitat Politècnica de Catalunya, BarcelonaTech, Barcelona, Spain
Petar Jovanovic, Oscar Romero & Alberto Abelló
Universite Libre de Bruxelles, Brussels, Belgium
Toon Calders
University of Antwerp, Antwerp, Belgium
Toon Calders

Authors

Petar Jovanovic
View author publications
You can also search for this author in PubMed Google Scholar
Oscar Romero
View author publications
You can also search for this author in PubMed Google Scholar
Toon Calders
View author publications
You can also search for this author in PubMed Google Scholar
Alberto Abelló
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Petar Jovanovic .

Editor information

Editors and Affiliations

MFF, Charles University MFF, Prague, Czech Republic
Jaroslav Pokorný
Faculty of Sciences, University of Novi Sad Faculty of Sciences, Novi Sad, Serbia
Mirjana Ivanović
Christian-Albrechts-Universität Kiel , Kiel, Germany
Bernhard Thalheim
VSB-Technical University Ostrava , Ostrava, Czech Republic
Petr Šaloun

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Jovanovic, P., Romero, O., Calders, T., Abelló, A. (2016). H-WorD: Supporting Job Scheduling in Hadoop with Workload-Driven Data Redistribution. In: Pokorný, J., Ivanović, M., Thalheim, B., Šaloun, P. (eds) Advances in Databases and Information Systems. ADBIS 2016. Lecture Notes in Computer Science(), vol 9809. Springer, Cham. https://doi.org/10.1007/978-3-319-44039-2_21

Download citation

DOI: https://doi.org/10.1007/978-3-319-44039-2_21
Published: 14 August 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-44038-5
Online ISBN: 978-3-319-44039-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics