Zeno: A Straggler Diagnosis System for Distributed Computing Using Machine Learning

  • Huanxing ShenEmail author
  • Cong Li
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10876)


Modern distributed computing frameworks for cloud computing and high performance computing typically accelerate job performance by dividing a large job into small tasks for execution parallelism. Some tasks, however, may run far behind others, which jeopardize the job completion time. In this paper, we present Zeno, a novel system which automatically identifies and diagnoses stragglers for jobs by machine learning methods. First, the system identifies stragglers with an unsupervised clustering method which groups the tasks based on their execution time. It then uses a supervised rule learning algorithm to learn diagnosis rules inferring the stragglers with their resource assignment and usage data. Zeno is evaluated on traces from a Google’s Borg system and an Alibaba’s Fuxi system. The results demonstrate that our system is able to generate simple and easy-to-read rules with both valuable insights and decent performance in predicting stragglers.


Distributed computing Straggler diagnosis Unsupervised clustering Supervised rule induction 



We thank Tai Huang and Jia Bao for their valuable comments and suggestions on an early draft of the paper. We acknowledge the four anonymous reviewers for their valuable comments and criticisms. We thank Xing Zhao for her checking of the English of the paper. A previous description of the machine learning methods for straggler diagnosis appeared as a 6-page extended abstract on a workshop [13].


  1. 1.
    Ananthanarayanan, G., Ghodsi, A., Shenker, S., Stoica, I.: Effective straggler mitigation: attack of the clones. In: Proceedings of the 10th USENIX Conference on Networked Systems Design and Implementation (NSDI 2013), pp. 185–198 (2013)Google Scholar
  2. 2.
    Bailey, T., Jain, A.K.: A note on distance-weighted k-nearest neighbor rules. IEEE Trans. Syst. Man Cybern. 8(4), 311–313 (1978)CrossRefGoogle Scholar
  3. 3.
    Blumer, A., Ehrenfeucht, A., Haussler, D., Warmuth, M.: Occam’s razor. Inf. Process. Lett. 24(6), 377–380 (1987)MathSciNetCrossRefGoogle Scholar
  4. 4.
    Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)CrossRefGoogle Scholar
  5. 5.
    Bremer, P.T., Mohr, B., Pascucci, V., Schulz, M. (eds.): Proceedings of the 2nd Workshop on Visual Performance Analysis (VPA 2015) (2015)Google Scholar
  6. 6.
    Bremer, P.T., Gimenez, J., Levine, J.A., Schulz, M. (eds.): Proceedings of the 3rd International Workshop on Visual Performance Analysis (VPA 2016) (2016)Google Scholar
  7. 7.
    Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995)zbMATHGoogle Scholar
  8. 8.
    Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. In: Proceedings of the 6th Symposium on Operating System Design and Implementation (OSDI 2004), pp. 137–150 (2004)Google Scholar
  9. 9.
    Garraghan, P., Ouyang, X., Yang, R., McKee, D., Xu, J.: Straggler root-cause and impact analysis for massive-scale virtualized cloud datacenters. IEEE Trans. Serv. Comput. (2017).
  10. 10.
    Gupta, S., Fritz, C., Price, R., Hoover, R., de Kleer, J., Witteveen, C.: ThroughputScheduler: learning to schedule on heterogeneous Hadoop clusters. In: Proceedings of the 10th International Conference on Autonomic Computing (ICAC 2013), pp. 159–165 (2013)Google Scholar
  11. 11.
    Iba, W., Langley, P.: Induction of one-level decision trees. In: Proceedings of the 9th International Workshop on Machine Learning (ML 1992), pp. 233–240 (1992)CrossRefGoogle Scholar
  12. 12.
    Isard, M., Prabhakaran, V., Currey, J., Wieder, U., Talwar, K., Goldberg, A.: Quincy: fair scheduling for distributed computing clusters. In: Proceedings of the ACM SIGOPS 22nd Symposium on Operating Systems Principles (SOSP 2009), pp. 261–276 (2009)Google Scholar
  13. 13.
    Li, C., Shen, H., Huang, T.: Learning to diagnose stragglers in distributed computing. In: Proceedings of the 9th Workshop on Many-Task Computing on Clouds, Grids, and Supercomputers (MTAGS 2016), pp. 1–6 (2016)Google Scholar
  14. 14.
    Lloyd, S.P.: Least squares quantization in PCM. IEEE Trans. Inf. Theory 28(2), 129–137 (1982)MathSciNetCrossRefGoogle Scholar
  15. 15.
    Ng, A.Y.: Feature selection, \(L_1\) vs. \(L_2\) regularization, and rotational invariance. In: Proceedings of the 21st International Conference on Machine Learning (ICML 2004) (2004)Google Scholar
  16. 16.
    Ouyang, X., Garraghan, P., McKee, D., Townend, P., Xu, J.: Straggler detection in parallel computing systems through dynamic threshold calculation. In: Proceedings of the 30th International Conference on Advanced Information Networking and Applications, (AINA 2016), pp. 414–421 (2000)Google Scholar
  17. 17.
    Pelleg, D., Moore, A.W.: X-means: Extending k-means with efficient estimation of the number of clusters. In: Proceedings of the 17th International Conference on Machine Learning (ICML 2000), pp. 727–734 (2000)Google Scholar
  18. 18.
    Quinlan, J.R.: Induction of decision trees. Mach. Learn. 1(1), 81–106 (1986)Google Scholar
  19. 19.
    Vavilapalli, V.K., Murthy, A.C., Douglas, C., Agarwal, S., Konar, M., Evans, R., Graves, T., Lowe, J., Shah, H., Seth, S., Saha, B., Curino, C., O’Malley, O., Radia, S., Reed, B., Baldeschwieler, E.: Apache Hadoop YARN: yet another resource negotiator. In: Proceedings of the 4th Annual Symposium on Cloud Computing (SOCC 2013) (2013)Google Scholar
  20. 20.
    Verma, A., Pedrosa, L., Korupolu, M., Oppenheimer, D., Tune, E., Wilkes, J.: Large-scale cluster management at Google with Borg. In: Proceedings of the 10th European Conference on Computer Systems (EuroSys 2015) (2015)Google Scholar
  21. 21.
    Yadwadkar, N.J., Hariharan, B., Gonzalez, J.E., Katz, R.: Multi-task learning for straggler avoiding predictive job scheduling. J. Mach. Learn. Res. 17(1), 3692–3728 (2016)MathSciNetzbMATHGoogle Scholar
  22. 22.
    Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: Proceedings of the 2nd USENIX Workshop on Hot Topics in Cloud Computing (HotCloud 2010) (2010)Google Scholar
  23. 23.
    Zhang, Z., Li, C., Tao, Y., Yang, R., Tang, H., Xu, J.: Fuxi: a fault-tolerant resource management and job scheduling system at internet scale. Proc. VLDB Endow. 7(13), 1393–1404 (2014)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Intel CorporationShanghaiPeople’s Republic of China

Personalised recommendations