Jagged Tiling for Intra-tile Parallelism and Fine-Grain Multithreading

  • Sunil Shrestha
  • Joseph Manzano
  • Andres Marquez
  • John Feo
  • Guang R. Gao
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8967)


In this paper, we have developed a novel methodology that takes into consideration multithreaded many-core designs to better utilize memory/processing resources and improve memory residence on tileable applications. It takes advantage of polyhedral analysis and transformation in the form of PLUTO [6], combined with a highly optimized fine grain tile runtime to exploit parallelism at all levels. The main contributions of this paper include the introduction of multi-hierarchical tiling techniques that increases intra tile parallelism; and a data-flow inspired runtime library that allows the expression of parallel tiles with an efficient synchronization registry. Our current implementation shows performance improvements on an Intel Xeon Phi board up to 32.25 % against instances produced by state-of-the-art compiler frameworks for selected stencil applications.


Iteration Space Memory Hierarchy Tile Size Level Tile Memory Access Latency 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    perf: Linux profiling with performance countersGoogle Scholar
  2. 2.
    Bandishti, V., Pananilath, I., Bondhugula, U.: Tiling stencil computations to maximize parallelism. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC 2012, Los Alamitos, CA, USA, pp. 40:1–40:11 (2012)Google Scholar
  3. 3.
    Baskaran, M.M., et al.: Automatic data movement and computation mapping for multi-level parallel architectures with explicitly managed memories. In: Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 1–10. ACM (2008)Google Scholar
  4. 4.
    Bastoul, C.: Generating loops for scanning polyhedra: cloog users guide. Polyhedron 2, 10 (2004)Google Scholar
  5. 5.
    Bikshandi, G., et al.: Programming for parallelism and locality with hierarchically tiled arrays. In: Proceedings of the Eleventh ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2006, pp. 48–57. ACM, New York (2006)Google Scholar
  6. 6.
    Bondhugula, U., Ramanujam, J.: Pluto: a practical and fully automatic polyhedral parallelizer and locality optimizer (2007)Google Scholar
  7. 7.
    Intel Open Source Technology Center. Open community runtime (2012)Google Scholar
  8. 8.
    Cepeda, S.: Optimization and performance tuning for Intel Xeon Phi coprocessors, part 2: understanding and using hardware events (2012)Google Scholar
  9. 9.
    Datta, K., Kamil, S., Williams, S., Oliker, L., Shalf, J., Yelick, K.: Optimization and performance modeling of stencil computations on modern microprocessors. Siam Rev. (2008)Google Scholar
  10. 10.
    Dursun, H., et al.: Hierarchical parallelization and optimization of high-order stencil computations on multicore clusters. J. Supercomput. 62(2), 946–966 (2012)CrossRefMathSciNetGoogle Scholar
  11. 11.
    Feautrier, P.: Some efficient solutions to the affine scheduling problem. i. one-dimensional time. Int. J. Parallel Program. 21(5), 313–347 (1992)CrossRefzbMATHMathSciNetGoogle Scholar
  12. 12.
    Feautrier, P.: Some efficient solutions to the affine scheduling problem. part ii. multidimensional time. Int. J. Parallel Program. 21(6), 389–420 (1992)CrossRefzbMATHMathSciNetGoogle Scholar
  13. 13.
    Frigo, M., Leiserson, C.E., Prokop, H., Ramachandran, S.: Cache-oblivious algorithms. In: Proceedings of the 40th Annual Symposium on Foundations of Computer Science, FOCS 1999, p. 285. IEEE Computer Society, Washington, DC (1999)Google Scholar
  14. 14.
    Gan, G., Wang, X., Manzano, J., Gao, G.R.: Tile percolation: an OpenMP tile aware parallelization technique for the cyclops-64 multicore processor. In: Sips, H., Epema, D., Lin, H.-X. (eds.) Euro-Par 2009. LNCS, vol. 5704, pp. 839–850. Springer, Heidelberg (2009) CrossRefGoogle Scholar
  15. 15.
    Griebl, M., Lengauer, C., Wetzel, S.: Code generation in the polytope model. In: Proceedings 1998 International Conference on Parallel Architectures and Compilation Techniques, pp. 106–111. IEEE (1998)Google Scholar
  16. 16.
    Grosser, T., Verdoolaege, S., Cohen, A., Sadayappan, P.: The relation between diamond tiling and hexagonal tiling. In: HiStencils 2014, p. 65 (2014)Google Scholar
  17. 17.
    Högstedt, K., Carter, L., Ferrante, J.: Selecting tile shape for minimal execution time. In: Proceedings of the Eleventh Annual ACM Symposium on Parallel Algorithms and Architectures, pp. 201–211. ACM (1999)Google Scholar
  18. 18.
    ET International. Swarm (swift adaptive runtime machine) (2012)Google Scholar
  19. 19.
    Kim, D., et al.: Physical experimentation with prefetching helper threads on intel’s hyper-threaded processors. In: Proceedings of the International Symposium on Code Generation and Optimization: Feedback-directed and Runtime Optimization, CGO 2004, p. 27. IEEE Computer Society, Washington, DC (2004)Google Scholar
  20. 20.
    Kodukula, I., Ahmed, N., Pingali, K.: Data-centric multi-level blocking, pp. 346–357 (1997)Google Scholar
  21. 21.
    Lewis, J., et al.: An automatic prefetching and caching system. In: 2010 IEEE 29th International Performance Computing and Communications Conference (IPCCC), pp. 180–187, December 2010Google Scholar
  22. 22.
    Massachusetts Institute of Technology: Laboratory for Computer Science and D.O.J. Tanguay. Compile-time Loop Splitting for Distributed Memory Multiprocessors. Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science (1993)Google Scholar
  23. 23.
    Theobald, K.B.: Earth: An Efficient Architecture for Running Threads. McGill University, Montreal (1999) Google Scholar
  24. 24.
    Wilde, D.K.: A library for doing polyhedral operations, Technical report (1997)Google Scholar
  25. 25.
    Wolf, M.E., Lam, M.S.: A data locality optimizing algorithm. In: Proceedings of the ACM SIGPLAN 1991 Conference on Programming Language Design and Implementation, PLDI 1991, pp. 30–44. ACM, New York (1991)Google Scholar
  26. 26.
    Wolfe, M.: More iteration space tiling. In: Proceedings of the 1989 ACM/IEEE Conference on Supercomputing, Supercomputing 1989, pp. 655–664. ACM, New York (1989)Google Scholar
  27. 27.
    Wolfe, M.: Iteration space tiling for memory hierarchies. In: Proceedings of the Third SIAM Conference on Parallel Processing for Scientific Computing, pp. 357–361. Society for Industrial and Applied Mathematics, Philadelphia (1989)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  • Sunil Shrestha
    • 1
  • Joseph Manzano
    • 2
  • Andres Marquez
    • 2
  • John Feo
    • 2
  • Guang R. Gao
    • 1
  1. 1.CAPSLUniversity of DelawareNewarkUSA
  2. 2.Pacific Northwest National LaboratoryRichlandUSA

Personalised recommendations