Limits of Work-Stealing Scheduling

  • Željko Vrba
  • Håvard Espeland
  • Pål Halvorsen
  • Carsten Griwodz
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5798)

Abstract

The number of applications with many parallel cooperating processes is steadily increasing, and developing efficient runtimes for their execution is an important task. Several frameworks have been developed, such as MapReduce and Dryad, but developing scheduling mechanisms that take into account processing and communication requirements is hard. In this paper, we explore the limits of work stealing scheduler, which has empirically been shown to perform well, and evaluate load-balancing based on graph partitioning as an orthogonal approach. All the algorithms are implemented in our Nornir runtime system, and our experiments on a multi-core workstation machine show that the main cause of performance degradation of work stealing is when very little processing time, which we quantify exactly, is performed per message. This is the type of workload in which graph partitioning has the potential to achieve better performance than work-stealing.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Lee, E.A.: The problem with threads. Computer 39(5), 33–42 (2006)CrossRefGoogle Scholar
  2. 2.
    Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. In: Proceedings of Symposium on Opearting Systems Design & Implementation (OSDI), Berkeley, CA, USA, p. 10. USENIX Association (2004)Google Scholar
  3. 3.
    Valvag, S.V., Johansen, D.: Oivos: Simple and efficient distributed data processing. In: 10th IEEE International Conference on High Performance Computing and Communications, 2008. HPCC 2008, September 2008, pp. 113–122 (2008)Google Scholar
  4. 4.
    Isard, M., Budiu, M., Yu, Y., Birrell, A., Fetterly, D.: Dryad: distributed data-parallel programs from sequential building blocks. In: Proceedings of the ACM SIGOPS/EuroSys European Conference on Computer Systems, pp. 59–72. ACM, New York (2007)Google Scholar
  5. 5.
    Ranger, C., Raghuraman, R., Penmetsa, A., Bradski, G., Kozyrakis, C.: Evaluating mapreduce for multi-core and multiprocessor systems. In: Proceedings of the IEEE International Symposium on High Performance Computer Architecture (HPCA), Washington, DC, USA, pp. 13–24. IEEE Computer Society, Los Alamitos (2007)Google Scholar
  6. 6.
    de Kruijf, M., Sankaralingam, K.: MapReduce for the Cell BE Architecture. University of Wisconsin Computer Sciences Technical Report CS-TR-2007 1625 (2007)Google Scholar
  7. 7.
    He, B., Fang, W., Luo, Q., Govindaraju, N.K., Wang, T.: Mars: a mapreduce framework on graphics processors. In: PACT 2008: Proceedings of the 17th international conference on Parallel architectures and compilation techniques, pp. 260–269. ACM, New York (2008)CrossRefGoogle Scholar
  8. 8.
    Vrba, Ž., Halvorsen, P., Griwodz, C.: Evaluating the run-time performance of kahn process network implementation techniques on shared-memory multiprocessors. In: Proceedings of the International Workshop on Multi-Core Computing Systems, MuCoCoS (2009)Google Scholar
  9. 9.
    Arora, N.S., Blumofe, R.D., Plaxton, C.G.: Thread scheduling for multiprogrammed multiprocessors. In: Proceedings of ACM symposium on Parallel algorithms and architectures (SPAA), pp. 119–129. ACM, New York (1998)CrossRefGoogle Scholar
  10. 10.
    Catalyurek, U., Boman, E., Devine, K., Bozdag, D., Heaphy, R., Riesen, L.: Hypergraph-based dynamic load balancing for adaptive scientific computations. In: Proc. of 21st International Parallel and Distributed Processing Symposium (IPDPS 2007). IEEE, Los Alamitos (2007); Best Algorithms Paper AwardGoogle Scholar
  11. 11.
    Kahn, G.: The semantics of a simple language for parallel programming. Information Processing 74 (1974)Google Scholar
  12. 12.
    Blumofe, R.D., Joerg, C.F., Kuszmaul, B.C., Leiserson, C.E., Randall, K.H., Zhou, Y.: Cilk: An efficient multithreaded runtime system. Technical report, Cambridge, MA, USA (1996)Google Scholar
  13. 13.
    Blumofe, R.D., Papadopoulos, D.: The performance of work stealing in multiprogrammed environments (extended abstract). SIGMETRICS Perform. Eval. Rev. 26(1), 266–267 (1998)CrossRefGoogle Scholar
  14. 14.
    Saha, B., Adl-Tabatabai, A.R., Ghuloum, A., Rajagopalan, M., Hudson, R.L., Petersen, L., Menon, V., Murphy, B., Shpeisman, T., Sprangle, E., Rohillah, A., Carmean, D., Fang, J.: Enabling scalability and performance in a large scale cmp environment. SIGOPS Oper. Syst. Rev. 41(3), 73–86 (2007)CrossRefGoogle Scholar
  15. 15.
    Frigo, M., Leiserson, C.E., Randall, K.H.: The implementation of the Cilk-5 multithreaded language. In: Proceedings of the ACM SIGPLAN ’98 Conference on Programming Language Design and Implementation, Montreal, Quebec, Canada, June 1998, pp. 212–223 (1998); Proceedings published ACM SIGPLAN Notices, Vol. 33(5) (May 1998)Google Scholar
  16. 16.
    Catalyurek, U.V., Aykanat, C.: Hypergraph-partitioning-based decomposition for parallel sparse-matrix vector multiplication. IEEE Transactions on Parallel and Distributed Systems 10(7), 673–693 (1999)CrossRefGoogle Scholar
  17. 17.
    Richardson, I.E.G.: H.264/mpeg-4 part 10 white paper, http://www.vcodex.com/files/h264_overview_orig.pdf
  18. 18.
    Mann, H.B., Whitney, D.R.: On a test of whether one of two random variables is stochastically larger than the other. Annals of Mathematical Statistics (1947)Google Scholar
  19. 19.
    Chevalier, C., Pellegrini, F.: Pt-scotch: A tool for efficient parallel graph ordering. Parallel Comput. 34(6-8), 318–331 (2008)CrossRefMathSciNetGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2009

Authors and Affiliations

  • Željko Vrba
    • 1
    • 2
  • Håvard Espeland
    • 1
    • 2
  • Pål Halvorsen
    • 1
    • 2
  • Carsten Griwodz
    • 1
    • 2
  1. 1.Simula Research Laboratory, Oslo 
  2. 2.Department of InformaticsUniversity of Oslo 

Personalised recommendations