KART – A Runtime Compilation Library for Improving HPC Application Performance

  • Matthias Noack
  • Florian Wende
  • Georg Zitzlsberger
  • Michael Klemm
  • Thomas Steinke
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10524)


The effectiveness of ahead-of-time compiler optimization heavily depends on the amount of available information at compile time. Input-specific information that is only available at runtime cannot be used, although it often determines loop counts, branching predicates and paths, as well as memory-access patterns. It can also be crucial for generating efficient SIMD-vectorized code. This is especially relevant for the many-core architectures paving the way to exascale computing, which are more sensitive to code-optimization. We explore the design-space for using input-specific information at compile-time and present KART, a Open image in new window library solution that allows developers to compile, link, and execute code (e.g., C, Open image in new window , Fortran) at application runtime. Besides mere runtime compilation of performance-critical code, KART can be used to instantiate the same code multiple times using different inputs, compilers, and options. Other techniques like auto-tuning and code-generation can be integrated into a KART-enabled application instead of being scripted around it. We evaluate runtimes and compilation costs for different synthetic kernels, and show the effectiveness for two real-world applications, HEOM and a WSM6 proxy.



This work is partially supported by Intel Corporation within the “Research Center for Many-core High-Performance Computing” (Intel PCC) at ZIB. We thank the “The North-German Supercomputing Alliance - HLRN” for providing us access to the HLRN-III production system ‘Konrad’ and the Cray TDS system with Intel KNL nodes.

Intel, Xeon, and Xeon Phi are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries.

* Other names and brands are the property of their respective owners. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to

Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.


  1. 1.
    OpenMP Compilers, September 2016.
  2. 2.
    OpenMP®: Support for the OpenMP language, April 2016.
  3. 3.
    Bezanson, J., Edelman, A., Karpinski, S., Shah, V.B.: Julia: a fresh approach to numerical computing, November 2014Google Scholar
  4. 4.
    Bezanson, J., Karpinski, S., Shah, V.B., Edelman, A.: Julia: a fast dynamic language for technical computing.
  5. 5.
    Heinecke, A., Henry, G., Hutchinson, M., Pabst, H.: LIBXSMM: accelerating small matrix multiplications by runtime code generation. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 84:1–84:11, SC 2016. IEEE Press, Piscataway (2016).
  6. 6.
    Heinecke, A., Klemm, M., Pflüger, D., Bode, A., Bungartz, H.J.: Extending a highly parallel data mining algorithm to the Intel\(^{\textregistered }\) many integrated core architecture. In: Alexander, M., et al. (eds.) Parallel Processing Workshops, Euro-Par 2011. LNCS, vol. 7156. Springer, Heidelberg (2011)Google Scholar
  7. 7.
    Henderson, T., Michalakes, J., Gokhale, I., Jha, A.: Chapter 2 - Numerical weather prediction optimization. In: Reinders, J., Jeffers, J. (eds.) High Performance Parallelism Pearls, pp. 7–23. Morgan Kaufmann, Boston (2015)CrossRefGoogle Scholar
  8. 8.
    Joó, B.: LLVM and QDP-JIT. In: iXPUG Workshop, Berkeley (2015).
  9. 9.
    Khronos OpenCL Working Group: The OpenCL Specification, Version 2.2.
  10. 10.
    Kreisbeck, C., Kramer, T., Aspuru-Guzik, A.: Scalable high-performance algorithm for the simulation of exciton dynamics. Application to the light-harvesting Complex II in the presence of resonant vibrational modes. J. Chem. Theory Comput. 10(9), 4045–4054 (2014). pMID: 26588548. CrossRefGoogle Scholar
  11. 11.
    Lattner, C., Adve, V.: LLVM: a compilation framework for lifelong program analysis and transformation. In: CGO, pp. 75–88, San Jose, CA, USA, March 2004.
  12. 12.
    Nickolls, J., Buck, I., Garland, M., Skadron, K.: Scalable parallel programming with CUDA. Queue 6(2), 40–53 (2008). CrossRefGoogle Scholar
  13. 13.
    Noack, M., Wende, F., Oertel, K.D.: Chapter 19 - OpenCL: there and back again. In: Reinders, J., Jeffers, J. (eds.) High Performance Parallelism Pearls, pp. 355–378. Morgan Kaufmann, Boston (2015)CrossRefGoogle Scholar
  14. 14.
    Noack, M., Wende, F., Steinke, T., Cordes, F.: A unified programming model for intra- and inter-node offloading on xeon phi clusters. In: International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2014, New Orleans, LA, USA, 16–21 November 2014, pp. 203–214 (2014).
  15. 15.
    NVIDIA: NVRTC - CUDA Runtime Compilation User Guide.
  16. 16.
    OpenMP Architecture Review Board: OpenMP Application Program Interface, Version 4.5 (2015).
  17. 17.
    Schling, B.: The Boost C++ Libraries. XML Press, Fort Collins (2011)Google Scholar
  18. 18.
    Schneider, T., Kjolstad, F., Hoefler, T.: MPI datatype processing using runtime compilation. In: Proceedings of the 20th European MPI Users’ Group Meeting, pp. 19–24. ACM, September 2013Google Scholar
  19. 19.
    Siso, S.: DL_MESO Code Modernization. Intel Xeon Phi Users Group (IXPUG). IXPUG Workshop, Ostrava, March 2016Google Scholar
  20. 20.
    Winter, F.T., Clark, M.A., Edwards, R.G., Joó, B.: A framework for lattice QCD calculations on GPUs. In: Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium, pp. 1073–1082, IPDPS 2014 (2014).

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  • Matthias Noack
    • 1
  • Florian Wende
    • 1
  • Georg Zitzlsberger
    • 2
  • Michael Klemm
    • 2
  • Thomas Steinke
    • 1
  1. 1.Zuse Institute BerlinBerlinGermany
  2. 2.Intel Deutschland GmbHFeldkirchenGermany

Personalised recommendations