Abstract
Loop tiling is a loop transformation widely used to improve spatial and temporal data locality, to increase computation granularity, and to enable blocking algorithms, which are particularly useful when offloading kernels on computing units with smaller memories. When caches are not available or used, data transfers and local storage must be software-managed, and some useless remote communications can be avoided by exploiting data reuse between tiles. An important parameter of tiling is the sizes of the tiles, which impact the size of the required local memory. However, for most analyzes involving several tiles, which is the case for inter-tile data reuse, the tile sizes induce non-linear constraints, unless they are numerical constants. This complicates or prevents a parametric analysis with polyhedral optimization techniques.
This paper shows that, when tiles are executed in sequence along tile axes, the parametric (with respect to tile sizes) analysis for inter-tile data reuse is nevertheless possible, i.e., one can determine, at compiletime and in a parametric fashion, the copy-in and copy-out data sets for all tiles, with inter-tile reuse, as well as sizes for the induced local memories. When approximations of transfers are performed, the situation is much more complex, and involves a careful analysis to guarantee correctness when data are both read and written. We provide the mathematical foundations to make such approximations possible. Combined with hierarchical tiling, this result opens perspectives for the automatic generation of blocking algorithms, guided by parametric cost models, where blocks can be pipelined and/or can contain parallelism. Previous work on FPGAs and GPUs already showed the interest and feasibility of such automation with tiling, but in a non-parametric fashion.
Improved version of IMPACT’14 paper ( impact.gforge.inria.fr/impact2014 ).
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Alias, C., Baray, F., Darte, A.: Bee+Cl@k: An implementation of lattice-based array contraction in the source-to-source translator Rose. In: Int. Conf. on Languages, Compilers, and Tools for Embedded Systems (LCTES 2007), San Diego (2007)
Alias, C., Darte, A., Plesco, A.: Optimizing DDR-SDRAM communications at C-level for automatically-generated hardware accelerators. An experience with the Altera C2H HLS tool. In: Int. Conf. on Application-specific Systems, Architectures and Processors (ASAP 2010), pp. 329–332. IEEE Computer Society, Rennes (2010)
Alias, C., Darte, A., Plesco, A.: Kernel offloading with optimized remote accesses. Tech. Rep. RR-7697, Inria (July 2011)
Alias, C., Darte, A., Plesco, A.: Optimizing remote accesses for offloaded kernels: Application to HLS for FPGA. In: Design, Automation and Test in Europe (DATE 2013), Grenoble, pp. 575–580 (2013)
Baskaran, M.M., Bondhugula, U., Krishnamoorthy, S., Ramanujam, J., Rountev, A., Sadayappan, P.: Automatic data movement and computation mapping for multi-level parallel architectures with explicitly managed memories. In: 13th ACM SIGPLAN Symp. on Principles and Practice of Parallel Programming (PPoPP 2008), pp. 1–10 (2008)
Baskaran, M.M., Vasilache, N., Meister, B., Lethin, R.: Automatic communication optimizations through memory reuse strategies. In: 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP 2012), New Orleans, pp. 277–278 (2012)
Bondhugula, U., Hartono, A., Ramanujam, J., Sadayappan, P.: A practical automatic polyhedral parallelizer and locality optimizer. In: ACM Int. Conf. on Programming Languages Design and Implementation (PLDI 2008), pp. 101–113 (2008)
Boppu, S., Hannig, F., Teich, J.: Loop program mapping and compact code generation for programmable hardware accelerators. In: IEEE Int. Conf. on Application-Specific Systems, Architectures and Processors (ASAP 2013), pp. 10–17 (June 2013)
Bourgoin, M., Chailloux, E., Lamotte, J.L.: Efficient abstractions for GPGPU programming. International Journal of Parallel Programming 42(4), 583–600 (2014)
Creusillet, B., Irigoin, F.: Interprocedural array region analyses. In: Int. Workshop on Languages and Compilers for Parallel Computing (LCPC 1996). LNCS, vol. 1033, pp. 46–60. Springer (1996)
Darte, A., Isoard, A.: Exact and approximated data-reuse optimizations for tiling with parametric sizes. Tech. Rep. RR-8671, Inria (January 2015), http://hal.inria.fr/hal-01103460
Darte, A., Schreiber, R., Villard, G.: Lattice-based memory allocation. IEEE Transactions on Computers 54(10), 1242–1257 (2005)
Feautrier, P.: Parametric integer programming. RAIRO Recherche Opérationnelle 22(3), 243–268 (1988), corresponding software tool PIP: http://www.piplib.org/
Feautrier, P., Lengauer, C.: The polyhedron model. In: Padua, D. (ed.) Encyclopedia of Parallel Programming. Springer (2011)
Goumas, G.I., Athanasaki, M., Koziris, N.: An efficient code generation technique for tiled iteration spaces. IEEE TPDS 14(10), 1021–1034 (2003)
Größlinger, A.: Precise management of scratchpad memories for localising array accesses in scientific codes. In: Compiler Construction (CC 2009), pp. 236–250 (2009)
Guelton, S., Amini, M., Creusillet, B.: Beyond do loops: Data transfer generation with convex array regions. In: Int. Workshop on Languages and Compilers for Parallel Computing (LCPC 2013). LNCS, vol. 7760, pp. 249–263. Springer (2013)
Guelton, S., Keryell, R., Irigoin, F.: Compilation pour cible hétérogènes: automatisation des analyses, transformations et décisions nécessaires. In: 20ème Rencontres Françaises du Parallélisme (Renpar 2011), Saint Malo, France (May 2011)
Hartono, A., Baskaran, M.M., Ramanujam, J., Sadayappan, P.: DynTile: Parametric tiled loop generation for parallel execution on multicore processors. In: Int. Symp. on Parallel and Distributed Processing (IPDPS 2010), pp. 1–12 (2010)
Irigoin, F., Triolet, R.: Supernode partitioning. In: 15th Symposium on Principles of Programming Languages (POPL 1988), pp. 319–329. ACM, San Diego (1988)
Issenin, I., Borckmeyer, E., Miranda, M., Dutt, N.: DRDU: A data reuse analysis technique for efficient scratch-pad memory management. ACM Trans. on Design Automation of Electronics Systems (ACM TODAES) 12(2), article 15 (April 2007)
Kandemir, M., Kadayif, I., Choudhary, A., Ramanujam, J., Kolcu, I.: Compiler-directed scratch pad memory optimization for embedded multiprocessors. IEEE Transactions on VLSI Systems 12(3), 281–287 (2004)
Kim, J., Kim, H., Lee, J.H., Lee, J.: Achieving a single compute device image in OpenCL for multiple GPUs. In: 16th ACM Symposium on Principles and Practice of Parallel Programming (PPoPP 2011), pp. 277–288. ACM (2011)
Lee, S., Eigenmann, R.: OpenMPC: Extended OpenMP programming and tuning for GPUs. In: ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC 2010), pp. 1–11 (2010)
Lefebvre, V., Feautrier, P.: Automatic storage management for parallel programs. Parallel Computing 24, 649–671 (1998)
Pai, S., Govindarajan, R., Thazhuthaveetil, M.J.: Fast and efficient automatic memory management for GPUs using compiler-assisted runtime coherence scheme. In: 21st International Conference on Parallel Architectures and Compilation Techniques (PACT 2012), pp. 33–42 (2012)
PLUTO: An automatic polyhedral parallelizer and locality optimizer for multicores, http://pluto-compiler.sourceforge.net
Pouchet, L.N., Zhang, P., Sadayappan, P., Cong, J.: Polyhedral-based data reuse optimization for configurable computing. In: ACM/SIGDA Int. Symp. on Field Programmable Gate Arrays (FPGA 2013), pp. 29–38. ACM (2013)
Pouchet, L.N.: PolyBench/C, the polyhedral benchmark suite, http://sourceforge.net/projects/polybench/
Renganarayanan, L., Kim, D., Rajopadhye, S.V., Strout, M.M.: Parameterized tiled loops for free. In: Conf. on Programming Language Design and Implementation (PLDI 2007), San Diego, pp. 405–414 (June 2007)
Upadrasta, R., Cohen, A.: Sub-polyhedral scheduling using (unit-)two-variable-per-inequality polyhedra. In: Symp. on Principles of Programming Languages (POPL 2013), Roma, pp. 483–496 (January 2013)
Verdoolaege, S.: isl: An integer set library for the polyhedral model. In: Mathematical Software - ICMS 2010. LNCS, vol. 6327, pp. 299–302. Springer (2010), http://freecode.com/projects/isl/
Verdoolaege, S.: Counting affine calculator and applications. In: 1st Int. Workshop on Polyhedral Compilation Techniques (IMPACT 2011), Chamonix (April 2011)
Verdoolaege, S., Juega, J.C., Cohen, A., Gómez, J.I., Tenllado, C., Catthoor, F.: Polyhedral parallel code generation for CUDA. ACM Transactions on Architecture and Code Optimization (TACO) 9(4), 54 (2013)
Wolf, M., Lam, M.: A data locality optimizing algorithm. In: ACM Conf. on Programming Language Design and Implementation (PLDI 1991), pp. 30–44 (1991)
Xue, J.: On tiling as a loop transformation. Par. Proc. Letters 7(4), 409–424 (1997)
Xue, J.: Loop Tiling for Parallelism. Kluwer Academic Publishers (2000)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Darte, A., Isoard, A. (2015). Exact and Approximated Data-Reuse Optimizations for Tiling with Parametric Sizes. In: Franke, B. (eds) Compiler Construction. CC 2015. Lecture Notes in Computer Science(), vol 9031. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-46663-6_8
Download citation
DOI: https://doi.org/10.1007/978-3-662-46663-6_8
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-46662-9
Online ISBN: 978-3-662-46663-6
eBook Packages: Computer ScienceComputer Science (R0)