Cluster Computing

, Volume 16, Issue 2, pp 299–319 | Cite as

Juggle: addressing extrinsic load imbalances in SPMD applications on multicore computers

  • Steven HofmeyrEmail author
  • Juan A. Colmenares
  • Costin Iancu
  • John Kubiatowicz


We investigate proactive dynamic load balancing on multicore systems, in which threads are continually migrated to reduce the impact of processor/thread mismatches. Our goal is to enhance the flexibility of the SPMD-style programming model and enable SPMD applications to run efficiently in multiprogrammed environments. We present Juggle, a practical decentralized, user-space implementation of a proactive load balancer that emphasizes portability and usability. In this paper we assume perfect intrinsic load balance and focus on extrinsic imbalances caused by OS noise, multiprogramming and mismatches of threads to hardware parallelism. Juggle shows performance improvements of up to 80 % over static load balancing for oversubscribed UPC, OpenMP, and pthreads benchmarks. We also show that Juggle is effective in unpredictable, multiprogrammed environments, with up to a 50 % performance improvement over the Linux load balancer and a 25 % reduction in performance variation. We analyze the impact of Juggle on parallel applications and derive lower bounds and approximations for thread completion times. We show that results from Juggle closely match theoretical predictions across a variety of architectures, including NUMA and hyper-threaded systems.


Proactive load balancing Parallel programming Single-program multiple-data parallelism Operating system Multicore 



The authors acknowledge the support of DOE Grant #DE-FG02-08ER25849. Juan Colmenares and John Kubiatowicz acknowledge support of Microsoft (Award #024263), Intel (Award #024894), matching U.C. Discovery funding (Award #DIG07-102270), and additional support from Par Lab affiliates National Instruments, NEC, Nokia, NVIDIA, Samsung, and Sun Microsystems. No part of this paper represents the views and opinions of the sponsors.


  1. 1.
    Blumofe, R.D., Papadopoulos, D.: The performance of work stealing in multiprogrammed environments. ACM SIGMETRICS Perform. Eval. Rev. 26(1), 266–267 (1998) CrossRefGoogle Scholar
  2. 2.
    Boneti, C., Gioiosa, R., Cazorla, F.J., Corbalán, J., Labarta, J., Valero, M.: Balancing HPC applications through smart allocation of resources in MT processors. In: Proc. 22nd IEEE Int’l Symposium on Parallel and Distributed Processing, pp. 1–12 (2008) Google Scholar
  3. 3.
    Boneti, C., Gioiosa, R., Cazorla, F.J., Valero, M.: A dynamic scheduler for balancing HPC applications. In: Proc. 2008 ACM/IEEE Conference on Supercomputing, pp. 41:1–41:12, (2008) Google Scholar
  4. 4.
    Cedo, F., Cortes, A., Ripoll, A., Senar, M., Luque, E.: The convergence of realistic distributed load-balancing algorithms. Theory Comput. Syst. 41(4), 609–618 (2007) MathSciNetzbMATHCrossRefGoogle Scholar
  5. 5.
    Datta, K., Murphy, M., Volkov, V., Williams, S., Carter, J., Oliker, L., Patterson, D., Shalf, J., Yelick, K.: Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. In: Proc. 2008 ACM/IEEE Conference on Supercomputing, pp. 4:1–4:12, (2008) Google Scholar
  6. 6.
    Feitelson, D.G., Rudolph, L.: Gang scheduling performance benefits for fine-grain synchronization. J. Parallel Distrib. Comput. 16, 306–318 (1992) zbMATHCrossRefGoogle Scholar
  7. 7.
    Fonlupt, C., Marquet, P., luc Dekeyser, J.: Data-parallel load balancing strategies. Parallel Comput. 24(11), 1665–1684 (1998) CrossRefGoogle Scholar
  8. 8.
    Gupta, A., Tucker, A., Urushibara, S.: The impact of operating system scheduling policies and synchronization methods on performance of parallel applications. ACM SIGMETRICS Perform. Eval. Rev. 19(1) (1991) Google Scholar
  9. 9.
    Hofmeyr, S., Iancu, C., Blagojević, F.: Load balancing on speed. In: Proc. 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 147–158 (2010) CrossRefGoogle Scholar
  10. 10.
    Hofmeyr, S., Colmenares, J.A., Iancu, C., Kubiatowicz, J.: Juggle: proactive load balancing on multicore computers. In: Proc. 20th ACM Int’l Symposium on High Performance and Distributed Computing, pp. 3–14 (2011) Google Scholar
  11. 11.
    Iancu, C., Hofmeyr, S., Blagojevic, F., Zheng, Y.: Oversubscription on multicore processors. In: Proc. 2010 IEEE Int’l Symposium on Parallel and Distributed Processing, pp. 1–11 (2010) CrossRefGoogle Scholar
  12. 12.
    Jones, T., Dawson, S., Neely, R., Tuel, W., Brenner, L., Fier, J., Blackmore, R., Caffrey, P., Maskell, B., Tomlinson, P., Roberts, M.: Improving the scalability of parallel jobs by adding parallel awareness to the operating system. In: Proc 2003 ACM/IEEE Conference on Supercomputing, p. 10 (2003) CrossRefGoogle Scholar
  13. 13.
    Khan, Z., Singh, R., Alam, J., Kumar, R.: Performance analysis of dynamic load balancing techniques for parallel and distributed systems. Int. J. Comput. Netw. Secur. 2, 2 (2010) Google Scholar
  14. 14.
    Kukanov, A., Voss, M.J.: The foundations for scalable multi-core software in Intel Threading Building Blocks. Intel Technol. J. 11(4) (2007) Google Scholar
  15. 15.
    Li, T., Baumberger, D., Hahn, S.: Efficient and scalable multiprocessor fair scheduling using distributed weighted round-robin. In: Proc. 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (2009) Google Scholar
  16. 16.
    Nishtala, R., Yelick, K.: Optimizing collective communication on multicores. In: Proc. 1st USENIX Workshop on Hot Topics in Parallelism (2009) Google Scholar
  17. 17.
    Olivier, S., Prins, J.: Scalable dynamic load balancing using UPC. In: Proc. 37th Int’l Conference on Parallel Processing, pp. 123–131 (2008) Google Scholar
  18. 18.
    Ousterhout, J.: Scheduling techniques for concurrent systems. In: Proc. 3rd Int’l Conference on Distributed Computing Systems, pp. 22–30 (1982) Google Scholar
  19. 19.
    Plastino, A., Ribeiro, C.C., Rodriguez, N.: Developing SPMD applications with load balancing. Parallel Comput. 29(6), 743–766 (2003) CrossRefGoogle Scholar
  20. 20.
    Roberson, J.: ULE: A modern scheduler for FreeBSD. In: Proc. USENIX BSD Conference (BSDCON), pp. 17–28 (2003) Google Scholar
  21. 21.
    Sancho, J.C., Kerbyson, D.J., Lang, M.: Characterizing the impact of using spare-cores on application performance. In: Proc. 16th Int’l Euro-Par Conference on Parallel Processing, Part I. LNCS, vol. 6271, pp. 74–85 (2010) Google Scholar
  22. 22.
    Tsafrir, D., Etsion, Y., Feitelson, D.G., Kirkpatrick, S.: System noise, OS clock ticks, and fine-grained parallel applications. In: Proc. 19th ACM Annual Int’l Conference on Supercomputing (ICS), pp. 303–312 (2005) CrossRefGoogle Scholar
  23. 23.
    Willebeek-LeMair, M., Reeves, A.: Strategies for dynamic load balancing on highly parallel computers. IEEE Trans. Parallel Distrib. Syst. 4(9) (1993) Google Scholar
  24. 24.
    Xu, C., Lau, F.C.: Load Balancing in Parallel Computers: Theory and Practice. Kluwer Academic, Dordrecht (1997) Google Scholar

Copyright information

© Springer Science + Business Media, LLC (outside the USA) 2012

Authors and Affiliations

  • Steven Hofmeyr
    • 1
    Email author
  • Juan A. Colmenares
    • 2
  • Costin Iancu
    • 1
  • John Kubiatowicz
    • 2
  1. 1.Lawrence Berkeley National LaboratoryBerkeleyUSA
  2. 2.Parallel Computing LaboratoryUC BerkeleyBerkeleyUSA

Personalised recommendations