Advertisement

QUARC: An Array Programming Approach to High Performance Computing

  • Diptorup DebEmail author
  • Robert J. Fowler
  • Allan Porterfield
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10136)

Abstract

We present QUARC, a framework for the optimized compilation of domain-specific extensions to C++. Driven by needs for programmer productivity and portable performance for lattice QCD, the framework focuses on stencil-like computations on arrays with an arbitrary number of dimensions. QUARC uses a template meta-programming front end to define a high-level array language. Unlike approaches that generate scalarized loop nests in the front end, the instantiation of QUARC templates retains high-level abstraction suitable for optimization at the object (array) level. The back end compiler (CLANG/LLVM) is extended to implement array transformations such as transposition, reshaping, and partitioning for parallelism and for memory locality prior to scalarization. We present the design and implementation.

Keywords

Array-programming Domain-specific languages 

Notes

Acknowledgement

This work was supported in part by the DOE Office of Science SciDAC program on grants DE-FG02-11ER26050/DE-SC0006925 and DE-SC0008706.

References

  1. 1.
    Datta, K., Murphy, M., Volkov, V., Williams, S., Carter, J., Oliker, L., Patterson, D., Shalf, J., Yelick, K.: Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. In: Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, SC 2008, pp. 4:1–4:12. IEEE Press, Piscataway (2008). http://dl.acm.org/citation.cfm?id=1413370.1413375
  2. 2.
    Edwards, H.C., Trott, C.R.: Kokkos: enabling performance portability across manycore architectures. In: Proceedings of the 2013 Extreme Scaling Workshop (XSW 2013), XSW 2013, pp. 18–24 (2013). http://dx.doi.org/10.1109/XSW.2013.7
  3. 3.
    Estérie, P., Gaunard, M., Falcou, J., Lapresté, J.T., Rozoy, B.: Boost.SIMD: generic programming for portable SIMDization. In: Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques, PACT 2012, pp. 431–432. ACM, New York (2012). http://doi.acm.org/10.1145/2370816.2370881
  4. 4.
    Härdtlein, J., Pflaum, C., Linke, A., Wolters, C.H.: Advanced expression templates programming. Comput. Vis. Sci. 13(2), 59–68 (2009). http://dx.doi.org/10.1007/s00791-009-0128-2 CrossRefGoogle Scholar
  5. 5.
    Henretty, T., Veras, R., Franchetti, F., Pouchet, L.N., Ramanujam, J., Sadayappan, P.: A stencil compiler for short-vector SIMD architectures. In: Proceedings of the 27th International ACM Conference on International Conference on Supercomputing - ICS 2013, p. 13 (2013). http://dl.acm.org/citation.cfm?doid=2464996.2467268
  6. 6.
    Iglberger, K., Hager, G., Treibig, J., Rüde, U.: Expression templates revisited: a performance analysis of current methodologies. SIAM J. Sci. Comput. 34(2), C42–C69 (2012). http://dx.doi.org/10.1137/110830125 MathSciNetCrossRefGoogle Scholar
  7. 7.
    Intel Corporation: Intel Threading Building Blocks (2016)Google Scholar
  8. 8.
    Iverson, K.E.: Notation as a tool of thought. Commun. ACM 23(8), 444–465 (1980). http://doi.acm.org/10.1145/358896.358899 MathSciNetCrossRefGoogle Scholar
  9. 9.
    Joo, B., Smelyanskiy, M., Kalamkar, D.D., Vaidyanathan, K.: Wilson Dslash kernel from lattice QCD optimization, July 2015. http://www.osti.gov/scitech/servlets/purl/1223094
  10. 10.
    Kamil, S., Husbands, P., Oliker, L., Shalf, J., Yelick, K.: Impact of modern memory subsystems on cache optimizations for stencil computations. In: Proceedings of the 2005 Workshop on Memory System Performance, MSP 2005, pp. 36–43. ACM, New York (2005). http://doi.acm.org/10.1145/1111583.1111589
  11. 11.
    Kennedy, K., Broom, B., Chauhan, A., Fowler, R.J., Garvin, J., Koelbel, C., Mccosh, C., Mellor-Crummey, J.: Telescoping languages: a system for automatic generation of domain languages. Proc. IEEE 93(2), 387–408 (2005)CrossRefGoogle Scholar
  12. 12.
    Majeti, D., Barik, R., Zhao, J., Grossman, M., Sarkar, V.: Compiler-driven data layout transformation for heterogeneous platforms. In: Mey, D., et al. (eds.) Euro-Par 2013. LNCS, vol. 8374, pp. 188–197. Springer, Heidelberg (2014). doi: 10.1007/978-3-642-54420-0_19 CrossRefGoogle Scholar
  13. 13.
    Maslov, V.: Delinearization: an efficient way to break multiloop dependence equations. In: Proceedings of the SIGPLAN 1992 Conference on Programming Language Design and Implementation, pp. 152–161 (1992)Google Scholar
  14. 14.
    More, T.: Axioms and theorems for a theory of arrays. IBM J. Res. Dev. 17(2), 135–175 (1973). http://dx.doi.org/10.1147/rd.172.0135 MathSciNetCrossRefzbMATHGoogle Scholar
  15. 15.
    Mullin, L.: A mathematics of arrays. Ph.D. thesis, Syracuse University, December 1988Google Scholar
  16. 16.
    Roth, G., Mellor-Crummey, J., Kennedy, K., Brickner, R.G.: Compiling stencils in high performance fortran. In: Proceedings of the 1997 ACM/IEEE Conference on Supercomputing, SC 1997. pp. 1–20. ACM, New York (1997). http://doi.acm.org/10.1145/509593.509605
  17. 17.
    Haney, S., Crotinger, J., Karmesin, S., Smith, S.: Easy expression templates using PETE, the Portable Expression Template Engine. Technical report LA-UR-99-777 (1999)Google Scholar
  18. 18.
    Sung, I.J., Stratton, J.A., Hwu, W.M.W.: Data layout transformation exploiting memory-level parallelism in structured grid many-core applications. In: Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques, PACT 2010, pp. 513–522. ACM, New York (2010). http://doi.acm.org/10.1145/1854273.1854336
  19. 19.
    Tang, Y., Chowdhury, R.A., Kuszmaul, B.C., Luk, C.K., Leiserson, C.E.: The pochoir stencil compiler. In: Proceedings of the Twenty-Third Annual ACM Symposium on Parallelism in Algorithms and Architectures, SPAA 2011, pp. 117–128. ACM, New York (2011). http://doi.acm.org/10.1145/1989493.1989508
  20. 20.
  21. 21.
    Veldhuizen, T.: Expression templates. C++ Report 7, 26–31 (1995)Google Scholar
  22. 22.
    Verdoolaege, S.: isl: an integer set library for the polyhedral model. In: Fukuda, K., Hoeven, J., Joswig, M., Takayama, N. (eds.) ICMS 2010. LNCS, vol. 6327, pp. 299–302. Springer, Heidelberg (2010). doi: 10.1007/978-3-642-15582-6_49 CrossRefGoogle Scholar
  23. 23.
    Winter, F.T., Clark, M.A., Edwards, R.G., Joo, B.: A framework for lattice QCD calculations on GPUs. In: 2014 IEEE 28th International Parallel and Distributed Processing Symposium. IEEE, May 2014. http://dx.doi.org/10.1109/IPDPS.2014.112
  24. 24.
    Wolf, M.E., Lam, M.S.: A data locality optimizing algorithm. In: Proceedings of the ACM SIGPLAN 1991 Conference on Programming Language Design and Implementation, PLDI 1991, pp. 30–44. ACM, New York (1991). http://doi.acm.org/10.1145/113445.113449
  25. 25.
    Xu, S., Gregg, D.: Semi-automatic composition of data layout transformations for loop vectorization. In: Hsu, C.-H., Shi, X., Salapura, V. (eds.) NPC 2014. LNCS, vol. 8707, pp. 485–496. Springer, Heidelberg (2014). doi: 10.1007/978-3-662-44917-2_40 Google Scholar
  26. 26.
    Yan, Y., Lin, P.H., Liao, C., de Supinski, B.R., Quinlan, D.J.: Supporting multiple accelerators in high-level programming models. In: Proceedings of the Sixth International Workshop on Programming Models and Applications for Multicores and Manycores, PMAM 2015, pp. 170–180, ACM, New York (2015). http://doi.acm.org/10.1145/2712386.2712405

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  • Diptorup Deb
    • 1
    Email author
  • Robert J. Fowler
    • 1
  • Allan Porterfield
    • 1
  1. 1.Department of Computer ScienceUniversity of North Carolina at Chapel HillChapel HillUSA

Personalised recommendations