Lattice QCD on Intel® Xeon PhiTM Coprocessors

  • Bálint Joó
  • Dhiraj D. Kalamkar
  • Karthikeyan Vaidyanathan
  • Mikhail Smelyanskiy
  • Kiran Pamnany
  • Victor W. Lee
  • Pradeep Dubey
  • William WatsonIII
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7905)


Lattice Quantum Chromodynamics (LQCD) is currently the only known model independent, non perturbative computational method for calculations in the theory of the strong interactions, and is of importance in studies of nuclear and high energy physics. LQCD codes use large fractions of supercomputing cycles worldwide and are often amongst the first to be ported to new high performance computing architectures. The recently released Intel Xeon Phi architecture from Intel Corporation features parallelism at the level of many x86-based cores, multiple threads per core, and vector processing units. In this contribution, we describe our experiences with optimizing a key LQCD kernel for the Xeon Phi architecture. On a single node, using single precision, our Dslash kernel sustains a performance of up to 320 GFLOPS, while our Conjugate Gradients solver sustains up to 237 GFLOPS. Furthermore we demonstrate a fully ’native’ multi-node LQCD implementation running entirely on KNC nodes with minimum involvement of the host CPU. Our multi-node implementation of the solver has been strong scaled to 3.9 TFLOPS on 32 KNCs.


High Performance Computing Memory Bandwidth Single Precision Chunk Size Intel Corporation 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Hestenes, M.R., Stiefel, E.: Methods of Conjugate Gradients for Solving Linear Systems. Journal of Research of the National Bureau of Standards 49(6), 409–436 (1952)MathSciNetzbMATHCrossRefGoogle Scholar
  2. 2.
    Creutz, M.: Quarks, Gluons and Lattices. Cambridge Monographs on Mathematical Physics, 169 p. Univ. Pr., Cambridge (1983)Google Scholar
  3. 3.
    Wilson, K.G.: Quarks and Strings on a Lattice. In: Zichichi, A. (ed.) New Phenomena in Subnuclear Physics, p. 69. Plenum Press, New York (1975)Google Scholar
  4. 4.
    van der Vorst, H.A.: Bi-CGSTAB: A Fast and Smoothly Converging Variant of Bi-CG for the Solution of Nonsymmetric Linear Systems. SIAM Journal on Scientific and Statistical Computing 13(2), 631–644 (1992)zbMATHCrossRefGoogle Scholar
  5. 5.
    Smelyanskiy, M., Vaidyanathan, K., Choi, J., Joó, B., Chhugani, J., Clark, M.A., Dubey, P.: High-performance lattice QCD for multi-core based parallel systems using a cache-friendly hybrid threaded-MPI approach. In: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2011, pp. 69:1–69:11 (2011)Google Scholar
  6. 6.
    Clark, M.A., Babich, R., Barros, K., Brower, R.C., Rebbi, C.: Solving Lattice QCD systems of equations using mixed precision solvers on GPUs. Comput. Phys. Commun. 181, 1517–1528 (2010)zbMATHCrossRefGoogle Scholar
  7. 7.
    OpenMP Architecture Review Board: OpenMP Application Program Interface (2011)Google Scholar
  8. 8.
    Nguyen, A.D., Satish, N., Chhugani, J., Kim, C., Dubey, P.: 3.5-D blocking optimization for stencil computations on modern CPUs and GPUs. In: SC, pp. 1–13 (2010)Google Scholar
  9. 9.
    Babich, R., Clark, M.A., Joó, B.: Parallelizing the QUDA Library for Multi-GPU Calculations in Lattice Quantum Chromodynamics. In: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2010, pp. 1–11 (2010)Google Scholar
  10. 10.
    Boyle, P.A.: The BlueGene/Q supercomputer. PoS LATTICE 2012, 020 (2012)Google Scholar
  11. 11.
    MPI: A Message-Passing Interface Standard (March 1994)Google Scholar
  12. 12.
    Joó, B.: SciDAC-2 software infrastructure for lattice QCD. Journal of Physics: Conference Series 78(1), 012034 (2007)CrossRefGoogle Scholar
  13. 13.
    Pakin, S., Lang, M., Kerbyson, D.J.: The reverse-acceleration model for programming petascale hybrid systems. IBM Journal of Research and Development 53(5), 8:1–8:15 (2009)Google Scholar
  14. 14.
    Heinecke, A., et al.: Design and Implementation of the Linpack Benchmark for Single and Multi-Node Systems Based on Intel(R) Xeon Phi(TM) Coprocessor. In: Proceedings of IPDPS Conference (2013)Google Scholar
  15. 15.
    Strzodka, R., Göddeke, D.: Pipelined mixed precision algorithms on FPGAs for fast and accurate PDE solvers from low precision components. In: IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM 2006), pp. 259–268 (April 2006)Google Scholar
  16. 16.
    Doi, J.: Peta-scale lattice quantum chromodynamics on a blue gene/Q supercomputer. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC 2012, pp. 1–45. IEEE Computer Society Press, Los Alamitos (2012)Google Scholar
  17. 17.
    Alexandru, A., Lujan, M., Pelissier, C., Gamari, B., Lee, F.X.: Efficient implementation of the overlap operator on multi-GPUs (2011)Google Scholar
  18. 18.
    Kowalski, A., Shen, X.: Implementing the Dslash Operator in OpenCL. College of William and Mary Technical Report (2010)Google Scholar
  19. 19.
    Bach, M., Lindenstruth, V., Philipsen, O., Pinke, C.: Lattice QCD based on OpenCL (2012)Google Scholar
  20. 20.
    Clark, M.A., Babich, R.: High-efficiency lattice QCD computations on the fermi architecture. In: Innovative Parallel Computing (InPar), pp. 1–9 (May 2012)Google Scholar
  21. 21.
    Chen, D., et al.: QCDSP machines: design, performance and cost. In: Proceedings of the 1998 ACM/IEEE Conference on Supercomputing (CDROM), Supercomputing 1998, pp. 1–6. IEEE Computer Society, Washington, DC (1998)Google Scholar
  22. 22.
    Vranas, P., et al.: The BlueGene/L supercomputer and quantum ChromoDynamics. In: Proceedings of the 2006 ACM/IEEE Conference on Supercomputing, SC 2006. ACM, New York (2006)Google Scholar
  23. 23.
    Boyle, P.A.: The BAGEL assembler generation library. Computer Physics Communications 180(12), 2739–2748 (2009) 40 YEARS OF CPC: A celebratory issue focused on quality software for high performance, grid and novel computing architecturesGoogle Scholar
  24. 24.
    Pochinsky, A.: Writing efficient QCD code made simpler: QA(0). PoS LATTICE 2008, 040 (2008)Google Scholar
  25. 25.
    Chen, J., Watson, W., Mao, W.: GMH: A Message Passing Toolkit for GPU Clusters. In: 2010 IEEE 16th International Conference on Parallel and Distributed Systems (ICPADS), pp. 35–42 (December 2010)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Bálint Joó
    • 1
  • Dhiraj D. Kalamkar
    • 2
  • Karthikeyan Vaidyanathan
    • 2
  • Mikhail Smelyanskiy
    • 3
  • Kiran Pamnany
    • 2
  • Victor W. Lee
    • 3
  • Pradeep Dubey
    • 3
  • William WatsonIII
    • 1
  1. 1.Thomas Jefferson National Accelerator FacilityNewport NewsU.S.A.
  2. 2.Parallel Computing Lab.Intel CorporationBangaloreIndia
  3. 3.Parallel Computing Lab.Intel CorporationSanta ClaraU.S.A.

Personalised recommendations