Exact and Approximated Data-Reuse Optimizations for Tiling with Parametric Sizes

Darte, Alain; Isoard, Alexandre

doi:10.1007/978-3-662-46663-6_8

Alain Darte¹⁴ &
Alexandre Isoard¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9031))

Included in the following conference series:

International Conference on Compiler Construction

1515 Accesses
2 Citations

Abstract

Loop tiling is a loop transformation widely used to improve spatial and temporal data locality, to increase computation granularity, and to enable blocking algorithms, which are particularly useful when offloading kernels on computing units with smaller memories. When caches are not available or used, data transfers and local storage must be software-managed, and some useless remote communications can be avoided by exploiting data reuse between tiles. An important parameter of tiling is the sizes of the tiles, which impact the size of the required local memory. However, for most analyzes involving several tiles, which is the case for inter-tile data reuse, the tile sizes induce non-linear constraints, unless they are numerical constants. This complicates or prevents a parametric analysis with polyhedral optimization techniques.

This paper shows that, when tiles are executed in sequence along tile axes, the parametric (with respect to tile sizes) analysis for inter-tile data reuse is nevertheless possible, i.e., one can determine, at compiletime and in a parametric fashion, the copy-in and copy-out data sets for all tiles, with inter-tile reuse, as well as sizes for the induced local memories. When approximations of transfers are performed, the situation is much more complex, and involves a careful analysis to guarantee correctness when data are both read and written. We provide the mathematical foundations to make such approximations possible. Combined with hierarchical tiling, this result opens perspectives for the automatic generation of blocking algorithms, guided by parametric cost models, where blocks can be pipelined and/or can contain parallelism. Previous work on FPGAs and GPUs already showed the interest and feasibility of such automation with tiling, but in a non-parametric fashion.

Improved version of IMPACT’14 paper ( impact.gforge.inria.fr/impact2014 ).

Download to read the full chapter text

Chapter PDF

Jagged Tiling for Intra-tile Parallelism and Fine-Grain Multithreading

An Analytical Model for Loop Tiling Transformation

Loop Nest Tiling for Image Processing and Communication Applications

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

Alias, C., Baray, F., Darte, A.: Bee+Cl@k: An implementation of lattice-based array contraction in the source-to-source translator Rose. In: Int. Conf. on Languages, Compilers, and Tools for Embedded Systems (LCTES 2007), San Diego (2007)
Google Scholar
Alias, C., Darte, A., Plesco, A.: Optimizing DDR-SDRAM communications at C-level for automatically-generated hardware accelerators. An experience with the Altera C2H HLS tool. In: Int. Conf. on Application-specific Systems, Architectures and Processors (ASAP 2010), pp. 329–332. IEEE Computer Society, Rennes (2010)
Google Scholar
Alias, C., Darte, A., Plesco, A.: Kernel offloading with optimized remote accesses. Tech. Rep. RR-7697, Inria (July 2011)
Google Scholar
Alias, C., Darte, A., Plesco, A.: Optimizing remote accesses for offloaded kernels: Application to HLS for FPGA. In: Design, Automation and Test in Europe (DATE 2013), Grenoble, pp. 575–580 (2013)
Google Scholar
Baskaran, M.M., Bondhugula, U., Krishnamoorthy, S., Ramanujam, J., Rountev, A., Sadayappan, P.: Automatic data movement and computation mapping for multi-level parallel architectures with explicitly managed memories. In: 13th ACM SIGPLAN Symp. on Principles and Practice of Parallel Programming (PPoPP 2008), pp. 1–10 (2008)
Google Scholar
Baskaran, M.M., Vasilache, N., Meister, B., Lethin, R.: Automatic communication optimizations through memory reuse strategies. In: 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP 2012), New Orleans, pp. 277–278 (2012)
Google Scholar
Bondhugula, U., Hartono, A., Ramanujam, J., Sadayappan, P.: A practical automatic polyhedral parallelizer and locality optimizer. In: ACM Int. Conf. on Programming Languages Design and Implementation (PLDI 2008), pp. 101–113 (2008)
Google Scholar
Boppu, S., Hannig, F., Teich, J.: Loop program mapping and compact code generation for programmable hardware accelerators. In: IEEE Int. Conf. on Application-Specific Systems, Architectures and Processors (ASAP 2013), pp. 10–17 (June 2013)
Google Scholar
Bourgoin, M., Chailloux, E., Lamotte, J.L.: Efficient abstractions for GPGPU programming. International Journal of Parallel Programming 42(4), 583–600 (2014)
Article Google Scholar
Creusillet, B., Irigoin, F.: Interprocedural array region analyses. In: Int. Workshop on Languages and Compilers for Parallel Computing (LCPC 1996). LNCS, vol. 1033, pp. 46–60. Springer (1996)
Google Scholar
Darte, A., Isoard, A.: Exact and approximated data-reuse optimizations for tiling with parametric sizes. Tech. Rep. RR-8671, Inria (January 2015), http://hal.inria.fr/hal-01103460
Darte, A., Schreiber, R., Villard, G.: Lattice-based memory allocation. IEEE Transactions on Computers 54(10), 1242–1257 (2005)
Article Google Scholar
Feautrier, P.: Parametric integer programming. RAIRO Recherche Opérationnelle 22(3), 243–268 (1988), corresponding software tool PIP: http://www.piplib.org/
Feautrier, P., Lengauer, C.: The polyhedron model. In: Padua, D. (ed.) Encyclopedia of Parallel Programming. Springer (2011)
Google Scholar
Goumas, G.I., Athanasaki, M., Koziris, N.: An efficient code generation technique for tiled iteration spaces. IEEE TPDS 14(10), 1021–1034 (2003)
Google Scholar
Größlinger, A.: Precise management of scratchpad memories for localising array accesses in scientific codes. In: Compiler Construction (CC 2009), pp. 236–250 (2009)
Google Scholar
Guelton, S., Amini, M., Creusillet, B.: Beyond do loops: Data transfer generation with convex array regions. In: Int. Workshop on Languages and Compilers for Parallel Computing (LCPC 2013). LNCS, vol. 7760, pp. 249–263. Springer (2013)
Google Scholar
Guelton, S., Keryell, R., Irigoin, F.: Compilation pour cible hétérogènes: automatisation des analyses, transformations et décisions nécessaires. In: 20ème Rencontres Françaises du Parallélisme (Renpar 2011), Saint Malo, France (May 2011)
Google Scholar
Hartono, A., Baskaran, M.M., Ramanujam, J., Sadayappan, P.: DynTile: Parametric tiled loop generation for parallel execution on multicore processors. In: Int. Symp. on Parallel and Distributed Processing (IPDPS 2010), pp. 1–12 (2010)
Google Scholar
Irigoin, F., Triolet, R.: Supernode partitioning. In: 15th Symposium on Principles of Programming Languages (POPL 1988), pp. 319–329. ACM, San Diego (1988)
Google Scholar
Issenin, I., Borckmeyer, E., Miranda, M., Dutt, N.: DRDU: A data reuse analysis technique for efficient scratch-pad memory management. ACM Trans. on Design Automation of Electronics Systems (ACM TODAES) 12(2), article 15 (April 2007)
Google Scholar
Kandemir, M., Kadayif, I., Choudhary, A., Ramanujam, J., Kolcu, I.: Compiler-directed scratch pad memory optimization for embedded multiprocessors. IEEE Transactions on VLSI Systems 12(3), 281–287 (2004)
Google Scholar
Kim, J., Kim, H., Lee, J.H., Lee, J.: Achieving a single compute device image in OpenCL for multiple GPUs. In: 16th ACM Symposium on Principles and Practice of Parallel Programming (PPoPP 2011), pp. 277–288. ACM (2011)
Google Scholar
Lee, S., Eigenmann, R.: OpenMPC: Extended OpenMP programming and tuning for GPUs. In: ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC 2010), pp. 1–11 (2010)
Google Scholar
Lefebvre, V., Feautrier, P.: Automatic storage management for parallel programs. Parallel Computing 24, 649–671 (1998)
Article MATH Google Scholar
Pai, S., Govindarajan, R., Thazhuthaveetil, M.J.: Fast and efficient automatic memory management for GPUs using compiler-assisted runtime coherence scheme. In: 21st International Conference on Parallel Architectures and Compilation Techniques (PACT 2012), pp. 33–42 (2012)
Google Scholar
PLUTO: An automatic polyhedral parallelizer and locality optimizer for multicores, http://pluto-compiler.sourceforge.net
Pouchet, L.N., Zhang, P., Sadayappan, P., Cong, J.: Polyhedral-based data reuse optimization for configurable computing. In: ACM/SIGDA Int. Symp. on Field Programmable Gate Arrays (FPGA 2013), pp. 29–38. ACM (2013)
Google Scholar
Pouchet, L.N.: PolyBench/C, the polyhedral benchmark suite, http://sourceforge.net/projects/polybench/
Renganarayanan, L., Kim, D., Rajopadhye, S.V., Strout, M.M.: Parameterized tiled loops for free. In: Conf. on Programming Language Design and Implementation (PLDI 2007), San Diego, pp. 405–414 (June 2007)
Google Scholar
Upadrasta, R., Cohen, A.: Sub-polyhedral scheduling using (unit-)two-variable-per-inequality polyhedra. In: Symp. on Principles of Programming Languages (POPL 2013), Roma, pp. 483–496 (January 2013)
Google Scholar
Verdoolaege, S.: isl: An integer set library for the polyhedral model. In: Mathematical Software - ICMS 2010. LNCS, vol. 6327, pp. 299–302. Springer (2010), http://freecode.com/projects/isl/
Verdoolaege, S.: Counting affine calculator and applications. In: 1st Int. Workshop on Polyhedral Compilation Techniques (IMPACT 2011), Chamonix (April 2011)
Google Scholar
Verdoolaege, S., Juega, J.C., Cohen, A., Gómez, J.I., Tenllado, C., Catthoor, F.: Polyhedral parallel code generation for CUDA. ACM Transactions on Architecture and Code Optimization (TACO) 9(4), 54 (2013)
Google Scholar
Wolf, M., Lam, M.: A data locality optimizing algorithm. In: ACM Conf. on Programming Language Design and Implementation (PLDI 1991), pp. 30–44 (1991)
Google Scholar
Xue, J.: On tiling as a loop transformation. Par. Proc. Letters 7(4), 409–424 (1997)
Article Google Scholar
Xue, J.: Loop Tiling for Parallelism. Kluwer Academic Publishers (2000)
Google Scholar

Download references

Author information

Authors and Affiliations

Compsys, Computer Science Lab (LIP), CNRS, INRIA, ENS-Lyon, UCB-Lyon, Lyon, France
Alain Darte & Alexandre Isoard

Authors

Alain Darte
View author publications
You can also search for this author in PubMed Google Scholar
Alexandre Isoard
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

University of Edinburgh, Edinburgh, United Kingdom
Björn Franke

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Darte, A., Isoard, A. (2015). Exact and Approximated Data-Reuse Optimizations for Tiling with Parametric Sizes. In: Franke, B. (eds) Compiler Construction. CC 2015. Lecture Notes in Computer Science(), vol 9031. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-46663-6_8

Download citation

DOI: https://doi.org/10.1007/978-3-662-46663-6_8
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-46662-9
Online ISBN: 978-3-662-46663-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Exact and Approximated Data-Reuse Optimizations for Tiling with Parametric Sizes

Abstract

Chapter PDF

Similar content being viewed by others

Jagged Tiling for Intra-tile Parallelism and Fine-Grain Multithreading

An Analytical Model for Loop Tiling Transformation

Loop Nest Tiling for Image Processing and Communication Applications

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Exact and Approximated Data-Reuse Optimizations for Tiling with Parametric Sizes

Abstract

Chapter PDF

Similar content being viewed by others

Jagged Tiling for Intra-tile Parallelism and Fine-Grain Multithreading

An Analytical Model for Loop Tiling Transformation

Loop Nest Tiling for Image Processing and Communication Applications

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation