Skip to main content

Just in Time Load Balancing

  • Conference paper
Languages and Compilers for Parallel Computing (LCPC 2012)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 7760))

Abstract

Leveraging Loop Level Parallelism (LLP) is one of the most attractive techniques for improving program performance on emerging multi-cores. Ordinary programs contain a large amount of parallel and DOALL loops, however emerging multi-core designs feature a rapid increase in the number of on-chip cores and the ways such cores share on-chip resources - such as pipeline and memory hierarchy, leads to an increase in the number of possible high-performance configurations. This trend in emerging multi-core design makes attaining peak performance through the exploitation of LLP an increasingly complex problem.

In this paper, we propose a new iteration scheduling technique to speedup the execution of DOALL loops on complex multi-core systems. Our technique targets the execution of DOALL loops with a variable cost per iteration and exhibiting either a predictable or an unpredictable behavior across multiple instances of a DOALL loop. In the former case our technique implements a quick run-time pass - to identify chunks of iterations containing the same amount of work - followed by a static assignment of such chunks to cores. If the static parallel execution is not profitable, our technique can decide to run such a loop either sequentially or in parallel, but using dynamic scheduling and an appropriate selection of the chunk size to optimize performance.

We implemented our technique in GNU GCC/OpenMP and demonstrate promising results on three important linear algebra kernels - matrix multiply, Gauss-Jordan elimination and adjoint convolution - for which near-optimal speedup against existing scheduling techniques is attained. Furthermore, we demonstrate the impact of our approach on the already parallelized program 470.lbm from SPEC CPU2006, implementing the Lattice Boltzman Method. On 470.lbm, our technique attains a speedup up of to 65% on the state-of-the-art 4-cores, 2-way Symmetric Multi-Threading Intel Sandy Bridge architecture.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 49.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Henning, J.L.: Spec cpu2000: Measuring cpu performance in the new millennium. IEEE Computer 33(7), 28–35 (2000)

    Article  Google Scholar 

  2. Henning, J.L.: SPEC CPU2006 benchmark descriptions. SIGARCH Computer Architecture News 34(4), 1–17 (2006)

    Article  MathSciNet  Google Scholar 

  3. Lundstrom, S.F., Barnes, G.H.: A controllable MIMD architecture. In: Advanced Computer Architecture, IEEE Computer Society Press, Los Alamitos (1986)

    Google Scholar 

  4. Polychronopoulos, C.D., Kuck, D.J.: Guided self-scheduling: A practical scheduling scheme for parallel supercomputers. IEEE Trans. Comput. 36(12), 1425–1439 (1987)

    Article  Google Scholar 

  5. Hummel, S., Schonberg, E., Flynn, L.E.: Factoring: a method for scheduling parallel loops. Commun. ACM 35(8), 90–101 (1992)

    Article  Google Scholar 

  6. Lucco, S.: A dynamic scheduling technique for irregular parallel programs, pp. 200–211 (1992)

    Google Scholar 

  7. Tzen, T.H., Ni, L.M.: Trapezoid self-scheduling: A practical scheduling scheme for parallel compilers. IEEE Trans. Parallel Distrib. Syst. 4(1), 87–98 (1993)

    Article  Google Scholar 

  8. Yue, K.K., Lilja, D.J.: Parameter estimation for a generalized parallel loop scheduling algorithm. In: HICSS, p. 187 (1995)

    Google Scholar 

  9. Hancock, D.J., Ford, R.W., Freeman, T.L., Bull, J.M.: An investigation of feedback guided dynamic scheduling of nested loops. In: Proceedings of the International Workshop on Parallel Processing (2000)

    Google Scholar 

  10. Kejariwal, A., Nicolau, A., Banerjee, U., Veidenbaum, A.V., Polychronopoulos, C.D.: Cache-aware partitioning of multi-dimensional iteration spaces. In: Proceedings of SYSTOR (2009)

    Google Scholar 

  11. Williams, S., Waterman, A., Patterson, D.: Roofline: an insightful visual performance model for multicore architectures. Commun. ACM 52(4) (2009)

    Google Scholar 

  12. Openmp, http://www.openmp.org

  13. Gnu gcc v4.6, http://gcc.gnu.org/gcc-4.6/

  14. Intel compilers, http://software.intel.com/en-us/articles/intel-compilers/

  15. Aslot, V., Domeika, M., Eigenmann, R., Gaertner, G., Jones, W.B., Parady, B.: SPEComp: A New Benchmark Suite for Measuring Parallel Computer Performance. In: Eigenmann, R., Voss, M.J. (eds.) WOMPAT 2001. LNCS, vol. 2104, pp. 1–10. Springer, Heidelberg (2001)

    Chapter  Google Scholar 

  16. Zhang, Y., Voss, M.: Runtime empirical selection of loop schedulers on hyperthreaded smps. In: 19th International Parallel and Distributed Processing Symposium (2005)

    Google Scholar 

  17. Bull, J.M., O’Neill, D.: A microbenchmark suite for openmp 2.0. SIGARCH Comput. Archit. News 29, 41–48 (2001)

    Article  Google Scholar 

  18. Novillo, D.: Openmp and automatic parallelization in gcc. In: GCC Developers Summit (2006)

    Google Scholar 

  19. Mucci, P.J., Browne, S., Deane, C., Ho, G.: Papi: A portable interface to hardware performance counters. In: Proceedings of the Department of Defense HPCMP Users Group Conference, pp. 7–10 (1999)

    Google Scholar 

  20. Kernighan, B.W.: The C Programming Language, 2nd edn. Prentice Hall Professional Technical Reference (1988)

    Google Scholar 

  21. Pohl, T., Kowarschik, M., Wilke, J., Iglberger, K., Rüde, U.: Optimization and profiling of the cache performance of parallel lattice boltzmann codes. Parallel Processing Letters 13(4) (2003)

    Google Scholar 

  22. Flatt, H.P., Kennedy, K.: Performance of parallel processors. Parallel Computing 12(1), 1–20 (1989)

    Article  MathSciNet  MATH  Google Scholar 

  23. Lamport, L.: The Hyperplane Method for an Array Computer. In: Tse-Yun, F. (ed.) Parallel Processing. LNCS, vol. 24, pp. 113–131. Springer, Heidelberg (1975)

    Chapter  Google Scholar 

  24. Banerjee, U.: Loop transformations for restructuring compilers - the foundations. Kluwer (1993)

    Google Scholar 

  25. Kruskal, C.P., Weiss, A.: Allocating independent subtasks on parallel processors. IEEE Trans. Softw. Eng. 11(10) (1985)

    Google Scholar 

  26. Aycock, J.: A brief history of just-in-time. ACM Comput. Surv. 35(2), 97–113 (2003)

    Article  Google Scholar 

  27. Rauchwerger, L., Amato, N.M., Padua, D.A.: A scalable method for run-time loop parallelization. International Journal of Parallel Programming 23(6) (1995)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Cammarota, R., Nicolau, A., Veidenbaum, A.V. (2013). Just in Time Load Balancing. In: Kasahara, H., Kimura, K. (eds) Languages and Compilers for Parallel Computing. LCPC 2012. Lecture Notes in Computer Science, vol 7760. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-37658-0_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-37658-0_1

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-37657-3

  • Online ISBN: 978-3-642-37658-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics