QUARC: An Array Programming Approach to High Performance Computing

  • Diptorup DebEmail author
  • Robert J. Fowler
  • Allan Porterfield
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10136)


We present QUARC, a framework for the optimized compilation of domain-specific extensions to C++. Driven by needs for programmer productivity and portable performance for lattice QCD, the framework focuses on stencil-like computations on arrays with an arbitrary number of dimensions. QUARC uses a template meta-programming front end to define a high-level array language. Unlike approaches that generate scalarized loop nests in the front end, the instantiation of QUARC templates retains high-level abstraction suitable for optimization at the object (array) level. The back end compiler (CLANG/LLVM) is extended to implement array transformations such as transposition, reshaping, and partitioning for parallelism and for memory locality prior to scalarization. We present the design and implementation.


Array-programming Domain-specific languages 



This work was supported in part by the DOE Office of Science SciDAC program on grants DE-FG02-11ER26050/DE-SC0006925 and DE-SC0008706.


  1. 1.
    Datta, K., Murphy, M., Volkov, V., Williams, S., Carter, J., Oliker, L., Patterson, D., Shalf, J., Yelick, K.: Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. In: Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, SC 2008, pp. 4:1–4:12. IEEE Press, Piscataway (2008).
  2. 2.
    Edwards, H.C., Trott, C.R.: Kokkos: enabling performance portability across manycore architectures. In: Proceedings of the 2013 Extreme Scaling Workshop (XSW 2013), XSW 2013, pp. 18–24 (2013).
  3. 3.
    Estérie, P., Gaunard, M., Falcou, J., Lapresté, J.T., Rozoy, B.: Boost.SIMD: generic programming for portable SIMDization. In: Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques, PACT 2012, pp. 431–432. ACM, New York (2012).
  4. 4.
    Härdtlein, J., Pflaum, C., Linke, A., Wolters, C.H.: Advanced expression templates programming. Comput. Vis. Sci. 13(2), 59–68 (2009). CrossRefGoogle Scholar
  5. 5.
    Henretty, T., Veras, R., Franchetti, F., Pouchet, L.N., Ramanujam, J., Sadayappan, P.: A stencil compiler for short-vector SIMD architectures. In: Proceedings of the 27th International ACM Conference on International Conference on Supercomputing - ICS 2013, p. 13 (2013).
  6. 6.
    Iglberger, K., Hager, G., Treibig, J., Rüde, U.: Expression templates revisited: a performance analysis of current methodologies. SIAM J. Sci. Comput. 34(2), C42–C69 (2012). MathSciNetCrossRefGoogle Scholar
  7. 7.
    Intel Corporation: Intel Threading Building Blocks (2016)Google Scholar
  8. 8.
    Iverson, K.E.: Notation as a tool of thought. Commun. ACM 23(8), 444–465 (1980). MathSciNetCrossRefGoogle Scholar
  9. 9.
    Joo, B., Smelyanskiy, M., Kalamkar, D.D., Vaidyanathan, K.: Wilson Dslash kernel from lattice QCD optimization, July 2015.
  10. 10.
    Kamil, S., Husbands, P., Oliker, L., Shalf, J., Yelick, K.: Impact of modern memory subsystems on cache optimizations for stencil computations. In: Proceedings of the 2005 Workshop on Memory System Performance, MSP 2005, pp. 36–43. ACM, New York (2005).
  11. 11.
    Kennedy, K., Broom, B., Chauhan, A., Fowler, R.J., Garvin, J., Koelbel, C., Mccosh, C., Mellor-Crummey, J.: Telescoping languages: a system for automatic generation of domain languages. Proc. IEEE 93(2), 387–408 (2005)CrossRefGoogle Scholar
  12. 12.
    Majeti, D., Barik, R., Zhao, J., Grossman, M., Sarkar, V.: Compiler-driven data layout transformation for heterogeneous platforms. In: Mey, D., et al. (eds.) Euro-Par 2013. LNCS, vol. 8374, pp. 188–197. Springer, Heidelberg (2014). doi: 10.1007/978-3-642-54420-0_19 CrossRefGoogle Scholar
  13. 13.
    Maslov, V.: Delinearization: an efficient way to break multiloop dependence equations. In: Proceedings of the SIGPLAN 1992 Conference on Programming Language Design and Implementation, pp. 152–161 (1992)Google Scholar
  14. 14.
    More, T.: Axioms and theorems for a theory of arrays. IBM J. Res. Dev. 17(2), 135–175 (1973). MathSciNetCrossRefzbMATHGoogle Scholar
  15. 15.
    Mullin, L.: A mathematics of arrays. Ph.D. thesis, Syracuse University, December 1988Google Scholar
  16. 16.
    Roth, G., Mellor-Crummey, J., Kennedy, K., Brickner, R.G.: Compiling stencils in high performance fortran. In: Proceedings of the 1997 ACM/IEEE Conference on Supercomputing, SC 1997. pp. 1–20. ACM, New York (1997).
  17. 17.
    Haney, S., Crotinger, J., Karmesin, S., Smith, S.: Easy expression templates using PETE, the Portable Expression Template Engine. Technical report LA-UR-99-777 (1999)Google Scholar
  18. 18.
    Sung, I.J., Stratton, J.A., Hwu, W.M.W.: Data layout transformation exploiting memory-level parallelism in structured grid many-core applications. In: Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques, PACT 2010, pp. 513–522. ACM, New York (2010).
  19. 19.
    Tang, Y., Chowdhury, R.A., Kuszmaul, B.C., Luk, C.K., Leiserson, C.E.: The pochoir stencil compiler. In: Proceedings of the Twenty-Third Annual ACM Symposium on Parallelism in Algorithms and Architectures, SPAA 2011, pp. 117–128. ACM, New York (2011).
  20. 20.
  21. 21.
    Veldhuizen, T.: Expression templates. C++ Report 7, 26–31 (1995)Google Scholar
  22. 22.
    Verdoolaege, S.: isl: an integer set library for the polyhedral model. In: Fukuda, K., Hoeven, J., Joswig, M., Takayama, N. (eds.) ICMS 2010. LNCS, vol. 6327, pp. 299–302. Springer, Heidelberg (2010). doi: 10.1007/978-3-642-15582-6_49 CrossRefGoogle Scholar
  23. 23.
    Winter, F.T., Clark, M.A., Edwards, R.G., Joo, B.: A framework for lattice QCD calculations on GPUs. In: 2014 IEEE 28th International Parallel and Distributed Processing Symposium. IEEE, May 2014.
  24. 24.
    Wolf, M.E., Lam, M.S.: A data locality optimizing algorithm. In: Proceedings of the ACM SIGPLAN 1991 Conference on Programming Language Design and Implementation, PLDI 1991, pp. 30–44. ACM, New York (1991).
  25. 25.
    Xu, S., Gregg, D.: Semi-automatic composition of data layout transformations for loop vectorization. In: Hsu, C.-H., Shi, X., Salapura, V. (eds.) NPC 2014. LNCS, vol. 8707, pp. 485–496. Springer, Heidelberg (2014). doi: 10.1007/978-3-662-44917-2_40 Google Scholar
  26. 26.
    Yan, Y., Lin, P.H., Liao, C., de Supinski, B.R., Quinlan, D.J.: Supporting multiple accelerators in high-level programming models. In: Proceedings of the Sixth International Workshop on Programming Models and Applications for Multicores and Manycores, PMAM 2015, pp. 170–180, ACM, New York (2015).

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  • Diptorup Deb
    • 1
    Email author
  • Robert J. Fowler
    • 1
  • Allan Porterfield
    • 1
  1. 1.Department of Computer ScienceUniversity of North Carolina at Chapel HillChapel HillUSA

Personalised recommendations