Static Compilation Analysis for Host-Accelerator Communication Optimization

Amini, Mehdi; Coelho, Fabien; Irigoin, François; Keryell, Ronan

doi:10.1007/978-3-642-36036-7_16

Mehdi Amini^17,18,
Fabien Coelho¹⁸,
François Irigoin¹⁸ &
…
Ronan Keryell¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 7146))

Included in the following conference series:

International Workshop on Languages and Compilers for Parallel Computing

1007 Accesses
6 Citations

Abstract

We present an automatic, static program transformation that schedules and generates efficient memory transfers between a computer host and its hardware accelerator, addressing a well-known performance bottleneck. Our automatic approach uses two simple heuristics: to perform transfers to the accelerator as early as possible and to delay transfers back from the accelerator as late as possible. We implemented this transformation as a middle-end compilation pass in the pips /Par4All compiler. In the generated code, redundant communications due to data reuse between kernel executions are avoided. Instructions that initiate transfers are scheduled effectively at compile-time. We present experimental results obtained with the Polybench 2.0, some Rodinia benchmarks, and with a real numerical simulation. We obtain an average speedup of 4 to 5 when compared to a naïve parallelization using a modern gpu with Par4All , hmpp , and pgi , and 3.5 when compared to an OpenMP version using a 12-core multiprocessor.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Amini, M., Ancourt, C., Coelho, F., Creusillet, B., Guelton, S., Irigoin, F., Jouvelot, P., Keryell, R., Villalon, P.: PIPS is not (just) polyhedral software. In: 1st International Workshop on Polyhedral Compilation Techniques, Impact (in Conjunction with CGO 2011) (April 2011)
Google Scholar
Ancourt, C., Coelho, F., Irigoin, F., Keryell, R.: A linear algebra framework for static High Performance Fortran code distribution. Scientific Programming 6(1), 3–27 (1997)
Google Scholar
Aubert, D., Amini, M., David, R.: A Particle-Mesh Integrator for Galactic Dynamics Powered by GPGPUs. In: Allen, G., Nabrzyski, J., Seidel, E., van Albada, G.D., Dongarra, J., Sloot, P.M.A. (eds.) ICCS 2009, Part I. LNCS, vol. 5544, pp. 874–883. Springer, Heidelberg (2009)
Chapter Google Scholar
Augonnet, C., Thibault, S., Namyst, R., Wacrenier, P.-A.: StarPU: A unified platform for task scheduling on heterogeneous multicore architectures. Concurrency and Computation: Practice and Experience 23, 187–198 (2011); Special Issue: Sips, H., Epema, D., Lin, H.-X. (eds.) Euro-Par 2009. LNCS, vol. 5704, pp. 863–874. Springer, Heidelberg (2009)
Chapter Google Scholar
Bodin, F., Bihan, S.: Heterogeneous multicore parallel programming for graphics processing units. Sci. Program. 17, 325–336 (2009)
Google Scholar
Che, S., Boyer, M., Meng, J., Tarjan, D., Sheaffer, J.W., Lee, S.H., Skadron, K.: Rodinia: A benchmark suite for heterogeneous computing. In: IEEE International Symposium on Workload Characterization (2009)
Google Scholar
Chen, Y., Cui, X., Mei, H.: Large-scale FFT on GPU clusters. In: 24th ACM International Conference on Supercomputing, ICS 2010 (2010)
Google Scholar
Creusillet, B., Irigoin, F.: Interprocedural array region analyses. Int. J. Parallel Program. 24(6), 513–546 (1996)
Google Scholar
Datta, K., Murphy, M., Volkov, V., Williams, S., Carter, J., Oliker, L., Patterson, D., Shalf, J., Yelick, K.: Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. In: Proceedings of the 2008 ACM/IEEE Conference on Supercomputing (2008)
Google Scholar
Fang, W., He, B., Luo, Q.: Database compression on graphics processors. Proc. VLDB Endow. 3, 670–680 (2010)
Google Scholar
Feautrier, P.: Parametric integer programming. RAIRO Recherche Opérationnelle 22 (1988)
Google Scholar
Gerndt, H.M., Zima, H.P.: Optimizing Communication in SUPERB. In: Burkhart, H. (ed.) CONPAR 1990 and VAPP 1990. LNCS, vol. 457, pp. 300–311. Springer, Heidelberg (1990)
Chapter Google Scholar
Gong, C., Gupta, R., Melhem, R.: Compilation techniques for optimizing communication on distributed-memory systems. In: ICPP 1993 (1993)
Google Scholar
Han, T.D., Abdelrahman, T.S.: hiCUDA: a high-level directive-based language for GPU programming. In: Proceedings of GPGPU-2. ACM (2009)
Google Scholar
HPC Project. Par4All automatic parallelization, http://www.par4all.org
Huang, W., Ghosh, S., Velusamy, S., Sankaranarayanan, K., Skadron, K., Stan, M.R.: Hotspot: acompact thermal modeling methodology for early-stage VLSI design. IEEE Trans. Very Large Scale Integr. Syst. (May 2006)
Google Scholar
Irigoin, F., Jouvelot, P., Triolet, R.: Semantical interprocedural parallelization: an overview of the PIPS project. In: ICS 1991, pp. 244–251 (1991)
Google Scholar
Jablin, T.B., Prabhu, P., Jablin, J.A., Johnson, N.P., Beard, S.R., August, D.I.: Automatic CPU-GPU communication management and optimization. In: Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2011, pp. 142–151. ACM, New York (2011)
Google Scholar
Lee, S., Eigenmann, R.: OpenMPC: Extended OpenMP programming and tuning for GPUs. In: SC 2010, pp. 1–11 (2010)
Google Scholar
Lee, S., Min, S.-J., Eigenmann, R.: OpenMP to GPGPU: a compiler framework for automatic translation and optimization. In: PPoPP (2009)
Google Scholar
Ohshima, S., Hirasawa, S., Honda, H.: OMPCUDA: OpenMP Execution Framework for CUDA Based on Omni OpenMP Compiler. In: Sato, M., Hanawa, T., Müller, M.S., Chapman, B.M., de Supinski, B.R. (eds.) IWOMP 2010. LNCS, vol. 6132, pp. 161–173. Springer, Heidelberg (2010)
Chapter Google Scholar
Pouchet, L.-N.: The Polyhedral Benchmark suite 2.0 (March 2011)
Google Scholar
Wolfe, M.: Implementing the PGI accelerator model. In: GPGPU (2010)
Google Scholar
Yan, Y., Grossman, M., Sarkar, V.: JCUDA: A Programmer-Friendly Interface for Accelerating Java Programs with CUDA. In: Sips, H., Epema, D., Lin, H.-X. (eds.) Euro-Par 2009. LNCS, vol. 5704, pp. 887–899. Springer, Heidelberg (2009)
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

HPC Project, Meudon, France
Mehdi Amini & Ronan Keryell
MINES ParisTech/CRI, Fontainebleau, France
Mehdi Amini, Fabien Coelho & François Irigoin

Authors

Mehdi Amini
View author publications
You can also search for this author in PubMed Google Scholar
Fabien Coelho
View author publications
You can also search for this author in PubMed Google Scholar
François Irigoin
View author publications
You can also search for this author in PubMed Google Scholar
Ronan Keryell
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Computer Science Department, Colorado State University, 80523-1873, Fort Collins, CO, USA
Sanjay Rajopadhye & Michelle Mills Strout &

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Amini, M., Coelho, F., Irigoin, F., Keryell, R. (2013). Static Compilation Analysis for Host-Accelerator Communication Optimization. In: Rajopadhye, S., Mills Strout, M. (eds) Languages and Compilers for Parallel Computing. LCPC 2011. Lecture Notes in Computer Science, vol 7146. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-36036-7_16

Download citation

DOI: https://doi.org/10.1007/978-3-642-36036-7_16
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-36035-0
Online ISBN: 978-3-642-36036-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics