Avoiding Slow Running Nodes in Distributed Systems
In distributed systems like Hadoop, work is segmented into various tasks and then subsequently executed in parallel on nodes in the cluster. Stragglers, the nodes which are 6–8 times slower than median nodes, can potentially degrade the overall cluster performance by increasing the job completion time. The existing solutions mainly concentrate on reactive measures after detecting stragglers but they lead to extended job completion time and resource wastage. Currently, proactive straggler avoidance techniques have introduced the application of machine learning methods to enhance the task scheduling. In this paper, a prognostic system that proactively avoids stragglers using predictive models is proposed. It has two stages: (1) To develop the prediction model for identifying the straggler nodes before allocation of the task using distributed machine learning and (2) To guide the scheduler to efficiently assign the tasks. This results in avoiding or minimizing the number of stragglers and leads to smarter scheduling. The proposed solution is compared with default Hadoop scheduler and has shown the significant improvement.
KeywordsDistributed systems Hadoop Stragglers Predictive model Machine learning
We are indebted to Puja, Ankit, Kalpesh and Aditi for helping us in the implementation of this proposed solution.
- 1.Turkington G.: Hadoop Beginner’s Guide. Packt Publishing Ltd. (2013).Google Scholar
- 2.Dean, Jeffrey, and Sanjay Ghemawat.: MapReduce: Simplified data processing on large clusters. In: Communications of the ACM 51.1 pp. 107–113 (2008).Google Scholar
- 3.Zaharia, Matei, et al.: Improving MapReduce performance in heterogeneous environments. In: Proceedings of the 8th USENIX conference on Operating systems design and implementation. USENIX Association (2008).Google Scholar
- 4.Ananthanarayanan, G., Kandula, S., Greenberg, A. G., Stoica, I., Lu, Y., Saha, B., & Harris, E. In: Reining in the Outliers in Map-Reduce Clusters using Mantri. In OSDI (2010).Google Scholar
- 5.Ananthanarayanan, Ganesh, et al.: Effective Straggler Mitigation: Attack of the Clones. NSDI. Vol. 13 (2013).Google Scholar
- 6.Neeraja J. Yadwadkar, Ganesh Ananthanarayanan, and Randy Katz. Wrangler: Predictable and faster jobs using fewer resources. In: Symposium on Cloud Computing (2014).Google Scholar
- 7.Matthew L. Massie, Brent N. Chun, and David E. Culler. In: The ganglia distributed monitoring system: Design, implementation and experience. Parallel Computing, vol. 30 (2004).Google Scholar
- 8.Christopher M. Bishop.: Pattern Recognition and Machine Learning (Information Science and Statistics). Springer-Verlag New York, Inc., Secaucus, NJ, USA (2006).Google Scholar
- 9.Christopher J. C. Burges.: A tutorial on support vector machines for pattern recognition. In: Data Min. Knowl. Discov., vol. 2(2), pp. 121–167 (1998).Google Scholar
- 10.Fan, Yuanquan, et al.: Performance Prediction Model in Heterogeneous MapReduce Environments. In: Computer and Information Technology (CIT), 2014 IEEE International Conference on. IEEE (2014).Google Scholar
- 11.Label-based Scheduling for YARN Applications. http://doc.mapr.com/display/MapR/Label-based+Scheduling+for+YARN+Applications.
- 12.Chih-Chung Chang and Chih-Jen Lin.: Distributed LIBLINEAR: Libraries for Large-scale Linear Classification on Distributed Environments (2016). https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/distributed-liblinear/mpi/guide_virtualbox_mpi.html.