Abstract
We present QUARC, a framework for the optimized compilation of domain-specific extensions to C++. Driven by needs for programmer productivity and portable performance for lattice QCD, the framework focuses on stencil-like computations on arrays with an arbitrary number of dimensions. QUARC uses a template meta-programming front end to define a high-level array language. Unlike approaches that generate scalarized loop nests in the front end, the instantiation of QUARC templates retains high-level abstraction suitable for optimization at the object (array) level. The back end compiler (CLANG/LLVM) is extended to implement array transformations such as transposition, reshaping, and partitioning for parallelism and for memory locality prior to scalarization. We present the design and implementation.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Datta, K., Murphy, M., Volkov, V., Williams, S., Carter, J., Oliker, L., Patterson, D., Shalf, J., Yelick, K.: Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. In: Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, SC 2008, pp. 4:1–4:12. IEEE Press, Piscataway (2008). http://dl.acm.org/citation.cfm?id=1413370.1413375
Edwards, H.C., Trott, C.R.: Kokkos: enabling performance portability across manycore architectures. In: Proceedings of the 2013 Extreme Scaling Workshop (XSW 2013), XSW 2013, pp. 18–24 (2013). http://dx.doi.org/10.1109/XSW.2013.7
Estérie, P., Gaunard, M., Falcou, J., Lapresté, J.T., Rozoy, B.: Boost.SIMD: generic programming for portable SIMDization. In: Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques, PACT 2012, pp. 431–432. ACM, New York (2012). http://doi.acm.org/10.1145/2370816.2370881
Härdtlein, J., Pflaum, C., Linke, A., Wolters, C.H.: Advanced expression templates programming. Comput. Vis. Sci. 13(2), 59–68 (2009). http://dx.doi.org/10.1007/s00791-009-0128-2
Henretty, T., Veras, R., Franchetti, F., Pouchet, L.N., Ramanujam, J., Sadayappan, P.: A stencil compiler for short-vector SIMD architectures. In: Proceedings of the 27th International ACM Conference on International Conference on Supercomputing - ICS 2013, p. 13 (2013). http://dl.acm.org/citation.cfm?doid=2464996.2467268
Iglberger, K., Hager, G., Treibig, J., Rüde, U.: Expression templates revisited: a performance analysis of current methodologies. SIAM J. Sci. Comput. 34(2), C42–C69 (2012). http://dx.doi.org/10.1137/110830125
Intel Corporation: Intel Threading Building Blocks (2016)
Iverson, K.E.: Notation as a tool of thought. Commun. ACM 23(8), 444–465 (1980). http://doi.acm.org/10.1145/358896.358899
Joo, B., Smelyanskiy, M., Kalamkar, D.D., Vaidyanathan, K.: Wilson Dslash kernel from lattice QCD optimization, July 2015. http://www.osti.gov/scitech/servlets/purl/1223094
Kamil, S., Husbands, P., Oliker, L., Shalf, J., Yelick, K.: Impact of modern memory subsystems on cache optimizations for stencil computations. In: Proceedings of the 2005 Workshop on Memory System Performance, MSP 2005, pp. 36–43. ACM, New York (2005). http://doi.acm.org/10.1145/1111583.1111589
Kennedy, K., Broom, B., Chauhan, A., Fowler, R.J., Garvin, J., Koelbel, C., Mccosh, C., Mellor-Crummey, J.: Telescoping languages: a system for automatic generation of domain languages. Proc. IEEE 93(2), 387–408 (2005)
Majeti, D., Barik, R., Zhao, J., Grossman, M., Sarkar, V.: Compiler-driven data layout transformation for heterogeneous platforms. In: Mey, D., et al. (eds.) Euro-Par 2013. LNCS, vol. 8374, pp. 188–197. Springer, Heidelberg (2014). doi:10.1007/978-3-642-54420-0_19
Maslov, V.: Delinearization: an efficient way to break multiloop dependence equations. In: Proceedings of the SIGPLAN 1992 Conference on Programming Language Design and Implementation, pp. 152–161 (1992)
More, T.: Axioms and theorems for a theory of arrays. IBM J. Res. Dev. 17(2), 135–175 (1973). http://dx.doi.org/10.1147/rd.172.0135
Mullin, L.: A mathematics of arrays. Ph.D. thesis, Syracuse University, December 1988
Roth, G., Mellor-Crummey, J., Kennedy, K., Brickner, R.G.: Compiling stencils in high performance fortran. In: Proceedings of the 1997 ACM/IEEE Conference on Supercomputing, SC 1997. pp. 1–20. ACM, New York (1997). http://doi.acm.org/10.1145/509593.509605
Haney, S., Crotinger, J., Karmesin, S., Smith, S.: Easy expression templates using PETE, the Portable Expression Template Engine. Technical report LA-UR-99-777 (1999)
Sung, I.J., Stratton, J.A., Hwu, W.M.W.: Data layout transformation exploiting memory-level parallelism in structured grid many-core applications. In: Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques, PACT 2010, pp. 513–522. ACM, New York (2010). http://doi.acm.org/10.1145/1854273.1854336
Tang, Y., Chowdhury, R.A., Kuszmaul, B.C., Luk, C.K., Leiserson, C.E.: The pochoir stencil compiler. In: Proceedings of the Twenty-Third Annual ACM Symposium on Parallelism in Algorithms and Architectures, SPAA 2011, pp. 117–128. ACM, New York (2011). http://doi.acm.org/10.1145/1989493.1989508
USQCD: QDP++ (2002). http://usqcd-software.github.io/qdpxx/
Veldhuizen, T.: Expression templates. C++ Report 7, 26–31 (1995)
Verdoolaege, S.: isl: an integer set library for the polyhedral model. In: Fukuda, K., Hoeven, J., Joswig, M., Takayama, N. (eds.) ICMS 2010. LNCS, vol. 6327, pp. 299–302. Springer, Heidelberg (2010). doi:10.1007/978-3-642-15582-6_49
Winter, F.T., Clark, M.A., Edwards, R.G., Joo, B.: A framework for lattice QCD calculations on GPUs. In: 2014 IEEE 28th International Parallel and Distributed Processing Symposium. IEEE, May 2014. http://dx.doi.org/10.1109/IPDPS.2014.112
Wolf, M.E., Lam, M.S.: A data locality optimizing algorithm. In: Proceedings of the ACM SIGPLAN 1991 Conference on Programming Language Design and Implementation, PLDI 1991, pp. 30–44. ACM, New York (1991). http://doi.acm.org/10.1145/113445.113449
Xu, S., Gregg, D.: Semi-automatic composition of data layout transformations for loop vectorization. In: Hsu, C.-H., Shi, X., Salapura, V. (eds.) NPC 2014. LNCS, vol. 8707, pp. 485–496. Springer, Heidelberg (2014). doi:10.1007/978-3-662-44917-2_40
Yan, Y., Lin, P.H., Liao, C., de Supinski, B.R., Quinlan, D.J.: Supporting multiple accelerators in high-level programming models. In: Proceedings of the Sixth International Workshop on Programming Models and Applications for Multicores and Manycores, PMAM 2015, pp. 170–180, ACM, New York (2015). http://doi.acm.org/10.1145/2712386.2712405
Acknowledgement
This work was supported in part by the DOE Office of Science SciDAC program on grants DE-FG02-11ER26050/DE-SC0006925 and DE-SC0008706.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Deb, D., Fowler, R.J., Porterfield, A. (2017). QUARC: An Array Programming Approach to High Performance Computing. In: Ding, C., Criswell, J., Wu, P. (eds) Languages and Compilers for Parallel Computing. LCPC 2016. Lecture Notes in Computer Science(), vol 10136. Springer, Cham. https://doi.org/10.1007/978-3-319-52709-3_1
Download citation
DOI: https://doi.org/10.1007/978-3-319-52709-3_1
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-52708-6
Online ISBN: 978-3-319-52709-3
eBook Packages: Computer ScienceComputer Science (R0)