TiDA: High-Level Programming Abstractions for Data Locality Management

  • Didem Unat
  • Tan Nguyen
  • Weiqun Zhang
  • Muhammed Nufail Farooqi
  • Burak Bastem
  • George Michelogiannakis
  • Ann Almgren
  • John Shalf
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9697)


The high energy costs for data movement compared to computation gives paramount importance to data locality management in programs. Managing data locality manually is not a trivial task and also complicates programming. Tiling is a well-known approach that provides both data locality and parallelism in an application. However, there is no standard programming construct to express tiling at the application level. We have developed a multicore programming model, TiDA, based on tiling and implemented the model as C++ and Fortran libraries. The proposed programming model has three high level abstractions, tiles, regions and tile iterator. These abstractions in the library hide the details of data decomposition, cache locality optimizations, and memory affinity management in the application. In this paper we unveil the internals of the library and demonstrate the performance and programability advantages of the model on five applications on multiple NUMA nodes. The library achieves up to 2.10x speedup over OpenMP in a single compute node for simple kernels, and up to 22x improvement over a single thread for a more complex combustion proxy application (SMC) on 24 cores. The MPI+TiDA implementation of geometric multigrid demonstrates a 30.9 % performance improvement over MPI+OpenMP when scaling to 3072 cores (excluding MPI communication overheads, 8.5 % otherwise).


Iteration Space Adaptive Mesh Refinement Tile Size Multigrid Solver NUMA Node 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.



Dr. Unat is supported by the Marie Sklodowska Curie Reintegration Grant 655965 by the European Commission. Authors from KU are supported by the Turkish Science and Technology Research Centre Grant No: 215E285. Authors from LBNL were supported by the SciDAC Program and the Exascale Co-Design Program under the U.S. DOE contract DE-AC02-05CH11231. This research used resources of the National Energy Research Scientific Computing Center, which is supported by the Office of Science of the U.S. DOE under Contract No. DE-AC02-05CH11231. We would like to acknowledge and thank John Bell and Hakan Memisoglu for their input.


  1. 1.
    PLuTo, A polyhedral automatic parallelizer and locality optimizer for multicores. Software.
  2. 2.
    Real World Technologies: Knights Landing Details.
  3. 3.
    Balfour, J., Dally, W.J.: Design tradeoffs for tiled CMP on-chip networks. In: Proceedings of the 20th Annual International Conference on Supercomputing, ICS 2006 (2006)Google Scholar
  4. 4.
    Bertozzi, S., Acquaviva, A., Bertozzi, D., Poggiali, A.: Supporting task migration in multi-processor systems-on-chip: a feasibility study. In: Proceedings of Design, Automation and Test in Europe, 2006, DATE 2006, vol. 1, pp. 1–6, March 2006Google Scholar
  5. 5.
    Bianco, M., Cumming, B.: A generic strategy for multi-stage stencils. In: Silva, F., Dutra, I., Santos Costa, V. (eds.) Euro-Par 2014 Parallel Processing. LNCS, vol. 8632, pp. 584–595. Springer, Heidelberg (2014)Google Scholar
  6. 6.
    Bikshandi, G., Guo, J., Hoeflinger, D., Almasi, G., Fraguela, B.B., Garzarán, M.J., Padua, D., von Praun, C.: Programming for parallelism and locality with hierarchically tiled arrays. In: Proceedings of the Eleventh ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPopp, 2006, pp. 48–57. ACM, New York (2006)Google Scholar
  7. 7.
    Bondhugula, U., Hartono, A., Ramanujam, J., Sadayappan, P.: A practical automatic polyhedral parallelizer and locality optimizer. SIGPLAN Not. 3(6), 101–113 (2008)CrossRefGoogle Scholar
  8. 8.
    Chen, J.H., Choudhary, A., de Supinski, B., DeVries, M., Hawkes, E.R., Klasky, S., Liao, W.K., Ma, K.L., Mellor-Crummey, J., Podhorszki, N., Sankaran, R., Shende, S., Yoo, C.S.: Terascale direct numerical simulations of turbulent combustion using S3D. Comput. Sci. Discovery 2(1), 015001 (2009)CrossRefGoogle Scholar
  9. 9.
    Chen, R., Chen, H.: Tiled-mapreduce: efficient and flexible mapreduce processing on multicore with tiling. ACM Trans. Archit. Code Optim. 10(1), 3:1–3:30 (2013)Google Scholar
  10. 10.
    Das, R., Mutlu, O., Moscibroda, T., Das, C.R.: Application-aware prioritization mechanisms for on-chip networks. In: Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO, pp. 280–291 (2009)Google Scholar
  11. 11.
    Datta, K., Murphy, M., Volkov, V., Williams, S., Carter, J., Oliker, L., Patterson, D., Shalf, J., Yelick, K.: Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. In: Proceedings of the ACM/IEEE Conference on Supercomputing, SC 2008, pp. 4:1–4:12. IEEE Press, Piscataway (2008)Google Scholar
  12. 12.
    Edwards, H.C., Sunderland, D., Porter, V., Amsler, C., Mish, S.: Manycore performance-portability: Kokkos multidimensional array library. Sci. Program. 20(2), 89–114 (2012)Google Scholar
  13. 13.
    Emmett, M., Zhang, W., Bell, J.B.: High-order algorithms for compressible reacting flow with complex chemistry. Combust. Theor. Model. 18(3), 361–387 (2014)MathSciNetCrossRefGoogle Scholar
  14. 14.
    Fuchs, T., Fürlinger, K.: Expressing and exploiting multidimensional locality in DASH. In: Proceedings of the SPPEXA Symposium 2016. Lecture Notes in Computational Science and Engineering, Garching, Germany, January 2016Google Scholar
  15. 15.
    Goglin, B.: Managing the topology of heterogeneous cluster nodes with hardware locality (hwloc). In: International Conference on High Performance Computing and Simulation, HPCS 2014, Bologna, Italy, 21–25 July 2014, pp. 74–81 (2014)Google Scholar
  16. 16.
    Hall, M., Chame, J., Chen, C., Shin, J., Rudy, G., Khan, M.M.: Loop transformation recipes for code generation and auto-tuning. In: Gao, G.R., Pollock, L.L., Cavazos, J., Li, X. (eds.) LCPC 2009. LNCS, vol. 5898, pp. 50–64. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  17. 17.
    Hartono, A., Baskaran, M.M., Bastoul, C., Cohen, A., Krishnamoorthy, S., Norris, B., Ramanujam, J., Sadayappan, P.: Parametric multi-level tiling of imperfectly nested loops. In: Proceedings of the 23rd International Conference on Supercomputing, ICS 2009, pp. 147–157. ACM, New York (2009)Google Scholar
  18. 18.
    Jingcao, H., Marculescu, R.: Energy-aware mapping for tile-based NoC architectures under performance constraints. In: Proceedings of the Asia and South Pacific Design Automation Conference, ASP-DAC 2003, pp. 233–239 (2003)Google Scholar
  19. 19.
    Kim, D., Rajopadhye, S.: Parameterized tiling for imperfectly nested loops. Technical report CS-09-101, Department of Computer Science, Colorado State University (2009)Google Scholar
  20. 20.
    Kim, D., Renganarayanan, L., Rostron, D., Rajopadhye, S., Strout, M.M.: Multi-level tiling: M for the price of one. In: Proceedings of the ACM/IEEE Conference on Supercomputing, SC 2007, pp. 51:1–51:12. ACM, New York (2007)Google Scholar
  21. 21.
    Murali, S., De Micheli, G.: Bandwidth-constrained mapping of cores onto NoC architectures. In: Proceedings of the Conference on Design, Automation and Test in Europe - vol. 2, DATE ’04, (2004)Google Scholar
  22. 22.
    Renganarayanan, L., Kim, D.G., Rajopadhye, S., Strout, M.M.: Parameterized tiled loops for free. SIGPLAN Not. 42(6), 405–414 (2007)CrossRefGoogle Scholar
  23. 23.
    Rogers, B.M., Krishna, A., Bell, G.B., Ken, V., Jiang, X., Solihin, Y.: Scaling the bandwidth wall: challenges in and avenues for CMP scaling. In: Proceedings of the 36th Annual International Symposium on Computer Architecture, ISCA, pp. 371–382 (2009)Google Scholar
  24. 24.
    Shalf, J., Dosanjh, S., Morrison, J.: Exascale computing technology challenges. In: Palma, J.M.L.M., Daydé, M., Marques, O., Lopes, J.C. (eds.) VECPAR 2010. LNCS, vol. 6449, pp. 1–25. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  25. 25.
    Unat, D., Cai, X., Baden, S.B.: Mint: realizing CUDA performance in 3D stencil methods with annotated C. In: Proceedings of the International Conference on Supercomputing, ICS 2011, pp. 214–224. ACM, New York (2011)Google Scholar
  26. 26.
    Unat, D., Chan, C., Zhang, W., Bell, J., Shalf, J.: Tiling as a durable abstraction for parallelism and data locality. In: Workshop on Domain-Specific Languages and High-Level Frameworks for High Performance Computing, 18 November 2013Google Scholar
  27. 27.
    Unat, D., Chan, C., Zhang, W., Williams, S., Bachan, J., Bell, J., Shalf, J.: Exasat: an exascale co-design tool for performance modeling. Int. J. High Perform. Comput. Appl. 29(2), 209–232 (2015)CrossRefGoogle Scholar
  28. 28.
    Unat, D., Shalf, J., Hoefler, T., Schulthess, T., Dubey, A., (eds.) et al.: Programming abstractions for data locality. Technical report (2014)Google Scholar
  29. 29.
    Vega, A., Cabarcas, F., Ramirez, A., Valero, M.: Breaking the bandwidth wall in chip multiprocessors. In: International Conference on Embedded Computer Systems, SAMOS, pp. 255–262 (2011)Google Scholar
  30. 30.
    Zhang, W., Almgren, A., Day, M., Nguyen, T., Shalf, J., Unat, D.: BoxLib with tiling: an AMR software framework. SIAM J. Sci. Comput. (2016)Google Scholar
  31. 31.
    Zhou, W., Zhang, Y., Mao, Z.: An application specific NoC mapping for optimized delay. In: Design and Test of Integrated Systems in Nanoscale Technology, DTIS 2006, 184–188, September 2006Google Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  • Didem Unat
    • 1
  • Tan Nguyen
    • 2
  • Weiqun Zhang
    • 2
  • Muhammed Nufail Farooqi
    • 1
  • Burak Bastem
    • 1
  • George Michelogiannakis
    • 2
  • Ann Almgren
    • 2
  • John Shalf
    • 2
  1. 1.Koç UniversityIstanbulTurkey
  2. 2.Lawrence Berkeley National LaboratoryBerkeleyUSA

Personalised recommendations