Just in Time Load Balancing

Cammarota, Rosario; Nicolau, Alexandru; Veidenbaum, Alexander V.

doi:10.1007/978-3-642-37658-0_1

Rosario Cammarota¹⁷,
Alexandru Nicolau¹⁷ &
Alexander V. Veidenbaum¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 7760))

Included in the following conference series:

International Workshop on Languages and Compilers for Parallel Computing

1121 Accesses
2 Citations

Abstract

Leveraging Loop Level Parallelism (LLP) is one of the most attractive techniques for improving program performance on emerging multi-cores. Ordinary programs contain a large amount of parallel and DOALL loops, however emerging multi-core designs feature a rapid increase in the number of on-chip cores and the ways such cores share on-chip resources - such as pipeline and memory hierarchy, leads to an increase in the number of possible high-performance configurations. This trend in emerging multi-core design makes attaining peak performance through the exploitation of LLP an increasingly complex problem.

In this paper, we propose a new iteration scheduling technique to speedup the execution of DOALL loops on complex multi-core systems. Our technique targets the execution of DOALL loops with a variable cost per iteration and exhibiting either a predictable or an unpredictable behavior across multiple instances of a DOALL loop. In the former case our technique implements a quick run-time pass - to identify chunks of iterations containing the same amount of work - followed by a static assignment of such chunks to cores. If the static parallel execution is not profitable, our technique can decide to run such a loop either sequentially or in parallel, but using dynamic scheduling and an appropriate selection of the chunk size to optimize performance.

We implemented our technique in GNU GCC/OpenMP and demonstrate promising results on three important linear algebra kernels - matrix multiply, Gauss-Jordan elimination and adjoint convolution - for which near-optimal speedup against existing scheduling techniques is attained. Furthermore, we demonstrate the impact of our approach on the already parallelized program 470.lbm from SPEC CPU2006, implementing the Lattice Boltzman Method. On 470.lbm, our technique attains a speedup up of to 65% on the state-of-the-art 4-cores, 2-way Symmetric Multi-Threading Intel Sandy Bridge architecture.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 49.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Henning, J.L.: Spec cpu2000: Measuring cpu performance in the new millennium. IEEE Computer 33(7), 28–35 (2000)
Article Google Scholar
Henning, J.L.: SPEC CPU2006 benchmark descriptions. SIGARCH Computer Architecture News 34(4), 1–17 (2006)
Article MathSciNet Google Scholar
Lundstrom, S.F., Barnes, G.H.: A controllable MIMD architecture. In: Advanced Computer Architecture, IEEE Computer Society Press, Los Alamitos (1986)
Google Scholar
Polychronopoulos, C.D., Kuck, D.J.: Guided self-scheduling: A practical scheduling scheme for parallel supercomputers. IEEE Trans. Comput. 36(12), 1425–1439 (1987)
Article Google Scholar
Hummel, S., Schonberg, E., Flynn, L.E.: Factoring: a method for scheduling parallel loops. Commun. ACM 35(8), 90–101 (1992)
Article Google Scholar
Lucco, S.: A dynamic scheduling technique for irregular parallel programs, pp. 200–211 (1992)
Google Scholar
Tzen, T.H., Ni, L.M.: Trapezoid self-scheduling: A practical scheduling scheme for parallel compilers. IEEE Trans. Parallel Distrib. Syst. 4(1), 87–98 (1993)
Article Google Scholar
Yue, K.K., Lilja, D.J.: Parameter estimation for a generalized parallel loop scheduling algorithm. In: HICSS, p. 187 (1995)
Google Scholar
Hancock, D.J., Ford, R.W., Freeman, T.L., Bull, J.M.: An investigation of feedback guided dynamic scheduling of nested loops. In: Proceedings of the International Workshop on Parallel Processing (2000)
Google Scholar
Kejariwal, A., Nicolau, A., Banerjee, U., Veidenbaum, A.V., Polychronopoulos, C.D.: Cache-aware partitioning of multi-dimensional iteration spaces. In: Proceedings of SYSTOR (2009)
Google Scholar
Williams, S., Waterman, A., Patterson, D.: Roofline: an insightful visual performance model for multicore architectures. Commun. ACM 52(4) (2009)
Google Scholar
Openmp, http://www.openmp.org
Gnu gcc v4.6, http://gcc.gnu.org/gcc-4.6/
Intel compilers, http://software.intel.com/en-us/articles/intel-compilers/
Aslot, V., Domeika, M., Eigenmann, R., Gaertner, G., Jones, W.B., Parady, B.: SPEComp: A New Benchmark Suite for Measuring Parallel Computer Performance. In: Eigenmann, R., Voss, M.J. (eds.) WOMPAT 2001. LNCS, vol. 2104, pp. 1–10. Springer, Heidelberg (2001)
Chapter Google Scholar
Zhang, Y., Voss, M.: Runtime empirical selection of loop schedulers on hyperthreaded smps. In: 19th International Parallel and Distributed Processing Symposium (2005)
Google Scholar
Bull, J.M., O’Neill, D.: A microbenchmark suite for openmp 2.0. SIGARCH Comput. Archit. News 29, 41–48 (2001)
Article Google Scholar
Novillo, D.: Openmp and automatic parallelization in gcc. In: GCC Developers Summit (2006)
Google Scholar
Mucci, P.J., Browne, S., Deane, C., Ho, G.: Papi: A portable interface to hardware performance counters. In: Proceedings of the Department of Defense HPCMP Users Group Conference, pp. 7–10 (1999)
Google Scholar
Kernighan, B.W.: The C Programming Language, 2nd edn. Prentice Hall Professional Technical Reference (1988)
Google Scholar
Pohl, T., Kowarschik, M., Wilke, J., Iglberger, K., Rüde, U.: Optimization and profiling of the cache performance of parallel lattice boltzmann codes. Parallel Processing Letters 13(4) (2003)
Google Scholar
Flatt, H.P., Kennedy, K.: Performance of parallel processors. Parallel Computing 12(1), 1–20 (1989)
Article MathSciNet MATH Google Scholar
Lamport, L.: The Hyperplane Method for an Array Computer. In: Tse-Yun, F. (ed.) Parallel Processing. LNCS, vol. 24, pp. 113–131. Springer, Heidelberg (1975)
Chapter Google Scholar
Banerjee, U.: Loop transformations for restructuring compilers - the foundations. Kluwer (1993)
Google Scholar
Kruskal, C.P., Weiss, A.: Allocating independent subtasks on parallel processors. IEEE Trans. Softw. Eng. 11(10) (1985)
Google Scholar
Aycock, J.: A brief history of just-in-time. ACM Comput. Surv. 35(2), 97–113 (2003)
Article Google Scholar
Rauchwerger, L., Amato, N.M., Padua, D.A.: A scalable method for run-time loop parallelization. International Journal of Parallel Programming 23(6) (1995)
Google Scholar

Download references

Author information

Authors and Affiliations

University of California, Irvine, USA
Rosario Cammarota, Alexandru Nicolau & Alexander V. Veidenbaum

Authors

Rosario Cammarota
View author publications
You can also search for this author in PubMed Google Scholar
Alexandru Nicolau
View author publications
You can also search for this author in PubMed Google Scholar
Alexander V. Veidenbaum
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Faculty of Science and Engineering, Department of Computer Science and Engineering, Waseda University, 27 Waseda-machi, 162-0042, Shinjuku-ku, Tokyo, Japan
Hironori Kasahara & Keiji Kimura &

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Cammarota, R., Nicolau, A., Veidenbaum, A.V. (2013). Just in Time Load Balancing. In: Kasahara, H., Kimura, K. (eds) Languages and Compilers for Parallel Computing. LCPC 2012. Lecture Notes in Computer Science, vol 7760. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-37658-0_1

Download citation

DOI: https://doi.org/10.1007/978-3-642-37658-0_1
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-37657-3
Online ISBN: 978-3-642-37658-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics