Abstract
Leveraging Loop Level Parallelism (LLP) is one of the most attractive techniques for improving program performance on emerging multi-cores. Ordinary programs contain a large amount of parallel and DOALL loops, however emerging multi-core designs feature a rapid increase in the number of on-chip cores and the ways such cores share on-chip resources - such as pipeline and memory hierarchy, leads to an increase in the number of possible high-performance configurations. This trend in emerging multi-core design makes attaining peak performance through the exploitation of LLP an increasingly complex problem.
In this paper, we propose a new iteration scheduling technique to speedup the execution of DOALL loops on complex multi-core systems. Our technique targets the execution of DOALL loops with a variable cost per iteration and exhibiting either a predictable or an unpredictable behavior across multiple instances of a DOALL loop. In the former case our technique implements a quick run-time pass - to identify chunks of iterations containing the same amount of work - followed by a static assignment of such chunks to cores. If the static parallel execution is not profitable, our technique can decide to run such a loop either sequentially or in parallel, but using dynamic scheduling and an appropriate selection of the chunk size to optimize performance.
We implemented our technique in GNU GCC/OpenMP and demonstrate promising results on three important linear algebra kernels - matrix multiply, Gauss-Jordan elimination and adjoint convolution - for which near-optimal speedup against existing scheduling techniques is attained. Furthermore, we demonstrate the impact of our approach on the already parallelized program 470.lbm from SPEC CPU2006, implementing the Lattice Boltzman Method. On 470.lbm, our technique attains a speedup up of to 65% on the state-of-the-art 4-cores, 2-way Symmetric Multi-Threading Intel Sandy Bridge architecture.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Henning, J.L.: Spec cpu2000: Measuring cpu performance in the new millennium. IEEE Computer 33(7), 28–35 (2000)
Henning, J.L.: SPEC CPU2006 benchmark descriptions. SIGARCH Computer Architecture News 34(4), 1–17 (2006)
Lundstrom, S.F., Barnes, G.H.: A controllable MIMD architecture. In: Advanced Computer Architecture, IEEE Computer Society Press, Los Alamitos (1986)
Polychronopoulos, C.D., Kuck, D.J.: Guided self-scheduling: A practical scheduling scheme for parallel supercomputers. IEEE Trans. Comput. 36(12), 1425–1439 (1987)
Hummel, S., Schonberg, E., Flynn, L.E.: Factoring: a method for scheduling parallel loops. Commun. ACM 35(8), 90–101 (1992)
Lucco, S.: A dynamic scheduling technique for irregular parallel programs, pp. 200–211 (1992)
Tzen, T.H., Ni, L.M.: Trapezoid self-scheduling: A practical scheduling scheme for parallel compilers. IEEE Trans. Parallel Distrib. Syst. 4(1), 87–98 (1993)
Yue, K.K., Lilja, D.J.: Parameter estimation for a generalized parallel loop scheduling algorithm. In: HICSS, p. 187 (1995)
Hancock, D.J., Ford, R.W., Freeman, T.L., Bull, J.M.: An investigation of feedback guided dynamic scheduling of nested loops. In: Proceedings of the International Workshop on Parallel Processing (2000)
Kejariwal, A., Nicolau, A., Banerjee, U., Veidenbaum, A.V., Polychronopoulos, C.D.: Cache-aware partitioning of multi-dimensional iteration spaces. In: Proceedings of SYSTOR (2009)
Williams, S., Waterman, A., Patterson, D.: Roofline: an insightful visual performance model for multicore architectures. Commun. ACMÂ 52(4) (2009)
Openmp, http://www.openmp.org
Gnu gcc v4.6, http://gcc.gnu.org/gcc-4.6/
Intel compilers, http://software.intel.com/en-us/articles/intel-compilers/
Aslot, V., Domeika, M., Eigenmann, R., Gaertner, G., Jones, W.B., Parady, B.: SPEComp: A New Benchmark Suite for Measuring Parallel Computer Performance. In: Eigenmann, R., Voss, M.J. (eds.) WOMPAT 2001. LNCS, vol. 2104, pp. 1–10. Springer, Heidelberg (2001)
Zhang, Y., Voss, M.: Runtime empirical selection of loop schedulers on hyperthreaded smps. In: 19th International Parallel and Distributed Processing Symposium (2005)
Bull, J.M., O’Neill, D.: A microbenchmark suite for openmp 2.0. SIGARCH Comput. Archit. News 29, 41–48 (2001)
Novillo, D.: Openmp and automatic parallelization in gcc. In: GCC Developers Summit (2006)
Mucci, P.J., Browne, S., Deane, C., Ho, G.: Papi: A portable interface to hardware performance counters. In: Proceedings of the Department of Defense HPCMP Users Group Conference, pp. 7–10 (1999)
Kernighan, B.W.: The C Programming Language, 2nd edn. Prentice Hall Professional Technical Reference (1988)
Pohl, T., Kowarschik, M., Wilke, J., Iglberger, K., Rüde, U.: Optimization and profiling of the cache performance of parallel lattice boltzmann codes. Parallel Processing Letters 13(4) (2003)
Flatt, H.P., Kennedy, K.: Performance of parallel processors. Parallel Computing 12(1), 1–20 (1989)
Lamport, L.: The Hyperplane Method for an Array Computer. In: Tse-Yun, F. (ed.) Parallel Processing. LNCS, vol. 24, pp. 113–131. Springer, Heidelberg (1975)
Banerjee, U.: Loop transformations for restructuring compilers - the foundations. Kluwer (1993)
Kruskal, C.P., Weiss, A.: Allocating independent subtasks on parallel processors. IEEE Trans. Softw. Eng. 11(10) (1985)
Aycock, J.: A brief history of just-in-time. ACM Comput. Surv. 35(2), 97–113 (2003)
Rauchwerger, L., Amato, N.M., Padua, D.A.: A scalable method for run-time loop parallelization. International Journal of Parallel Programming 23(6) (1995)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Cammarota, R., Nicolau, A., Veidenbaum, A.V. (2013). Just in Time Load Balancing. In: Kasahara, H., Kimura, K. (eds) Languages and Compilers for Parallel Computing. LCPC 2012. Lecture Notes in Computer Science, vol 7760. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-37658-0_1
Download citation
DOI: https://doi.org/10.1007/978-3-642-37658-0_1
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-37657-3
Online ISBN: 978-3-642-37658-0
eBook Packages: Computer ScienceComputer Science (R0)