Exact and Approximated Data-Reuse Optimizations for Tiling with Parametric Sizes

  • Alain Darte
  • Alexandre Isoard
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9031)

Abstract

Loop tiling is a loop transformation widely used to improve spatial and temporal data locality, to increase computation granularity, and to enable blocking algorithms, which are particularly useful when offloading kernels on computing units with smaller memories. When caches are not available or used, data transfers and local storage must be software-managed, and some useless remote communications can be avoided by exploiting data reuse between tiles. An important parameter of tiling is the sizes of the tiles, which impact the size of the required local memory. However, for most analyzes involving several tiles, which is the case for inter-tile data reuse, the tile sizes induce non-linear constraints, unless they are numerical constants. This complicates or prevents a parametric analysis with polyhedral optimization techniques.

This paper shows that, when tiles are executed in sequence along tile axes, the parametric (with respect to tile sizes) analysis for inter-tile data reuse is nevertheless possible, i.e., one can determine, at compiletime and in a parametric fashion, the copy-in and copy-out data sets for all tiles, with inter-tile reuse, as well as sizes for the induced local memories. When approximations of transfers are performed, the situation is much more complex, and involves a careful analysis to guarantee correctness when data are both read and written. We provide the mathematical foundations to make such approximations possible. Combined with hierarchical tiling, this result opens perspectives for the automatic generation of blocking algorithms, guided by parametric cost models, where blocks can be pipelined and/or can contain parallelism. Previous work on FPGAs and GPUs already showed the interest and feasibility of such automation with tiling, but in a non-parametric fashion.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Alias, C., Baray, F., Darte, A.: Bee+Cl@k: An implementation of lattice-based array contraction in the source-to-source translator Rose. In: Int. Conf. on Languages, Compilers, and Tools for Embedded Systems (LCTES 2007), San Diego (2007)Google Scholar
  2. 2.
    Alias, C., Darte, A., Plesco, A.: Optimizing DDR-SDRAM communications at C-level for automatically-generated hardware accelerators. An experience with the Altera C2H HLS tool. In: Int. Conf. on Application-specific Systems, Architectures and Processors (ASAP 2010), pp. 329–332. IEEE Computer Society, Rennes (2010)Google Scholar
  3. 3.
    Alias, C., Darte, A., Plesco, A.: Kernel offloading with optimized remote accesses. Tech. Rep. RR-7697, Inria (July 2011)Google Scholar
  4. 4.
    Alias, C., Darte, A., Plesco, A.: Optimizing remote accesses for offloaded kernels: Application to HLS for FPGA. In: Design, Automation and Test in Europe (DATE 2013), Grenoble, pp. 575–580 (2013)Google Scholar
  5. 5.
    Baskaran, M.M., Bondhugula, U., Krishnamoorthy, S., Ramanujam, J., Rountev, A., Sadayappan, P.: Automatic data movement and computation mapping for multi-level parallel architectures with explicitly managed memories. In: 13th ACM SIGPLAN Symp. on Principles and Practice of Parallel Programming (PPoPP 2008), pp. 1–10 (2008)Google Scholar
  6. 6.
    Baskaran, M.M., Vasilache, N., Meister, B., Lethin, R.: Automatic communication optimizations through memory reuse strategies. In: 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP 2012), New Orleans, pp. 277–278 (2012)Google Scholar
  7. 7.
    Bondhugula, U., Hartono, A., Ramanujam, J., Sadayappan, P.: A practical automatic polyhedral parallelizer and locality optimizer. In: ACM Int. Conf. on Programming Languages Design and Implementation (PLDI 2008), pp. 101–113 (2008)Google Scholar
  8. 8.
    Boppu, S., Hannig, F., Teich, J.: Loop program mapping and compact code generation for programmable hardware accelerators. In: IEEE Int. Conf. on Application-Specific Systems, Architectures and Processors (ASAP 2013), pp. 10–17 (June 2013)Google Scholar
  9. 9.
    Bourgoin, M., Chailloux, E., Lamotte, J.L.: Efficient abstractions for GPGPU programming. International Journal of Parallel Programming 42(4), 583–600 (2014)CrossRefGoogle Scholar
  10. 10.
    Creusillet, B., Irigoin, F.: Interprocedural array region analyses. In: Int. Workshop on Languages and Compilers for Parallel Computing (LCPC 1996). LNCS, vol. 1033, pp. 46–60. Springer (1996)Google Scholar
  11. 11.
    Darte, A., Isoard, A.: Exact and approximated data-reuse optimizations for tiling with parametric sizes. Tech. Rep. RR-8671, Inria (January 2015), http://hal.inria.fr/hal-01103460
  12. 12.
    Darte, A., Schreiber, R., Villard, G.: Lattice-based memory allocation. IEEE Transactions on Computers 54(10), 1242–1257 (2005)CrossRefGoogle Scholar
  13. 13.
    Feautrier, P.: Parametric integer programming. RAIRO Recherche Opérationnelle 22(3), 243–268 (1988), corresponding software tool PIP: http://www.piplib.org/
  14. 14.
    Feautrier, P., Lengauer, C.: The polyhedron model. In: Padua, D. (ed.) Encyclopedia of Parallel Programming. Springer (2011)Google Scholar
  15. 15.
    Goumas, G.I., Athanasaki, M., Koziris, N.: An efficient code generation technique for tiled iteration spaces. IEEE TPDS 14(10), 1021–1034 (2003)Google Scholar
  16. 16.
    Größlinger, A.: Precise management of scratchpad memories for localising array accesses in scientific codes. In: Compiler Construction (CC 2009), pp. 236–250 (2009) Google Scholar
  17. 17.
    Guelton, S., Amini, M., Creusillet, B.: Beyond do loops: Data transfer generation with convex array regions. In: Int. Workshop on Languages and Compilers for Parallel Computing (LCPC 2013). LNCS, vol. 7760, pp. 249–263. Springer (2013)Google Scholar
  18. 18.
    Guelton, S., Keryell, R., Irigoin, F.: Compilation pour cible hétérogènes: automatisation des analyses, transformations et décisions nécessaires. In: 20ème Rencontres Françaises du Parallélisme (Renpar 2011), Saint Malo, France (May 2011)Google Scholar
  19. 19.
    Hartono, A., Baskaran, M.M., Ramanujam, J., Sadayappan, P.: DynTile: Parametric tiled loop generation for parallel execution on multicore processors. In: Int. Symp. on Parallel and Distributed Processing (IPDPS 2010), pp. 1–12 (2010)Google Scholar
  20. 20.
    Irigoin, F., Triolet, R.: Supernode partitioning. In: 15th Symposium on Principles of Programming Languages (POPL 1988), pp. 319–329. ACM, San Diego (1988)Google Scholar
  21. 21.
    Issenin, I., Borckmeyer, E., Miranda, M., Dutt, N.: DRDU: A data reuse analysis technique for efficient scratch-pad memory management. ACM Trans. on Design Automation of Electronics Systems (ACM TODAES) 12(2), article 15 (April 2007)Google Scholar
  22. 22.
    Kandemir, M., Kadayif, I., Choudhary, A., Ramanujam, J., Kolcu, I.: Compiler-directed scratch pad memory optimization for embedded multiprocessors. IEEE Transactions on VLSI Systems 12(3), 281–287 (2004)Google Scholar
  23. 23.
    Kim, J., Kim, H., Lee, J.H., Lee, J.: Achieving a single compute device image in OpenCL for multiple GPUs. In: 16th ACM Symposium on Principles and Practice of Parallel Programming (PPoPP 2011), pp. 277–288. ACM (2011)Google Scholar
  24. 24.
    Lee, S., Eigenmann, R.: OpenMPC: Extended OpenMP programming and tuning for GPUs. In: ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC 2010), pp. 1–11 (2010)Google Scholar
  25. 25.
    Lefebvre, V., Feautrier, P.: Automatic storage management for parallel programs. Parallel Computing 24, 649–671 (1998)CrossRefMATHGoogle Scholar
  26. 26.
    Pai, S., Govindarajan, R., Thazhuthaveetil, M.J.: Fast and efficient automatic memory management for GPUs using compiler-assisted runtime coherence scheme. In: 21st International Conference on Parallel Architectures and Compilation Techniques (PACT 2012), pp. 33–42 (2012)Google Scholar
  27. 27.
    PLUTO: An automatic polyhedral parallelizer and locality optimizer for multicores, http://pluto-compiler.sourceforge.net
  28. 28.
    Pouchet, L.N., Zhang, P., Sadayappan, P., Cong, J.: Polyhedral-based data reuse optimization for configurable computing. In: ACM/SIGDA Int. Symp. on Field Programmable Gate Arrays (FPGA 2013), pp. 29–38. ACM (2013)Google Scholar
  29. 29.
    Pouchet, L.N.: PolyBench/C, the polyhedral benchmark suite, http://sourceforge.net/projects/polybench/
  30. 30.
    Renganarayanan, L., Kim, D., Rajopadhye, S.V., Strout, M.M.: Parameterized tiled loops for free. In: Conf. on Programming Language Design and Implementation (PLDI 2007), San Diego, pp. 405–414 (June 2007)Google Scholar
  31. 31.
    Upadrasta, R., Cohen, A.: Sub-polyhedral scheduling using (unit-)two-variable-per-inequality polyhedra. In: Symp. on Principles of Programming Languages (POPL 2013), Roma, pp. 483–496 (January 2013)Google Scholar
  32. 32.
    Verdoolaege, S.: isl: An integer set library for the polyhedral model. In: Mathematical Software - ICMS 2010. LNCS, vol. 6327, pp. 299–302. Springer (2010), http://freecode.com/projects/isl/
  33. 33.
    Verdoolaege, S.: Counting affine calculator and applications. In: 1st Int. Workshop on Polyhedral Compilation Techniques (IMPACT 2011), Chamonix (April 2011)Google Scholar
  34. 34.
    Verdoolaege, S., Juega, J.C., Cohen, A., Gómez, J.I., Tenllado, C., Catthoor, F.: Polyhedral parallel code generation for CUDA. ACM Transactions on Architecture and Code Optimization (TACO) 9(4), 54 (2013)Google Scholar
  35. 35.
    Wolf, M., Lam, M.: A data locality optimizing algorithm. In: ACM Conf. on Programming Language Design and Implementation (PLDI 1991), pp. 30–44 (1991)Google Scholar
  36. 36.
    Xue, J.: On tiling as a loop transformation. Par. Proc. Letters 7(4), 409–424 (1997)CrossRefGoogle Scholar
  37. 37.
    Xue, J.: Loop Tiling for Parallelism. Kluwer Academic Publishers (2000)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2015

Authors and Affiliations

  • Alain Darte
    • 1
  • Alexandre Isoard
    • 1
  1. 1.Compsys, Computer Science Lab (LIP)CNRS, INRIA, ENS-Lyon, UCB-LyonLyonFrance

Personalised recommendations