Performance Evaluation of NWChem Ab-Initio Molecular Dynamics (AIMD) Simulations on the Intel® Xeon Phi™ Processor

  • Eric J. BylaskaEmail author
  • Mathias Jacquelin
  • Wibe A. de Jong
  • Jeff R. Hammond
  • Michael Klemm
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10524)


Ab-initio Molecular Dynamics (AIMD) methods are an important class of algorithms, as they enable scientists to understand the chemistry and dynamics of molecular and condensed phase systems while retaining a first-principles-based description of their interactions. Many-core architectures such as the Intel® Xeon Phi™ processor are an interesting and promising target for these algorithms, as they can provide the computational power that is needed to solve interesting problems in chemistry. In this paper, we describe the efforts of refactoring the existing AIMD plane-wave method of NWChem from an MPI-only implementation to a scalable, hybrid code that employs MPI and OpenMP to exploit the capabilities of current and future many-core architectures. We describe the optimizations required to get close to optimal performance for the multiplication of the tall-and-skinny matrices that form the core of the computational algorithm. We present strong scaling results on the complete AIMD simulation for a test case that simulates 256 water molecules and that strong-scales well on a cluster of 1024 nodes of Intel Xeon Phi processors. We compare the performance obtained with a cluster of dual-socket Intel® Xeon® E5–2698v3 processors.


Xeon Phi Many-core Chemistry AIMD Ab-initio Molecular dynamics 



This work was supported by the NWChem project in the William R. Wiley Environmental Molecular Sciences Laboratory (EMSL), the U.S. Department of Energy, Office of Science, Advanced Scientific Computing Research ECP program (NWChemEx project), and E.J.B was also supported by the the U.S. Department of Energy, Office of Science, Office of Basic Energy Sciences, Chemical Sciences, Geosciences, and Biosciences Division at PNNL, DE-AC06-76RLO 1830. EMSL operations are supported by the DOE’s Office of Biological and Environmental Research. M.J. and W.A.D. were partially supported by the Scientific Discovery through Advanced Computing (SciDAC) program funded by U.S. Department of Energy, Office of Science, Advanced Scientific Computing Research and Basic Energy Sciences. In particular, M.J. was supported by the FASTMath SciDAC institute. We wish to thank the Scientific Computing Staff, Office of Energy Research, and the U. S. Department of Energy for support through the NERSC NESAP program the National Energy Research Scientific Computing Center (Berkeley, CA). This work was also supported by Intel as part of its Intel Parallel Computing Centers effort. This research used resources of the National Energy Research Scientific Computing Center, a DOE Office of Science User Facility supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231.

Intel, Xeon, and Xeon Phi are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries.

* Other names and brands are the property of their respective owners.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to

Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.


  1. 1.
  2. 2.
    Aprà, E., Bylaska, E.J., Dean, D.J., Fortunelli, A., Gao, F., Krstić, P.S., Wells, J.C., Windus, T.L.: NWChem for materials science. Comput. Mater. Sci. 28(2), 209–221 (2003)CrossRefGoogle Scholar
  3. 3.
    Ayala, O., Wang, L.P.: Parallel implementation and scalability analysis of 3D fast fourier transform using 2D domain decomposition. Parallel Comput. 39(1), 58–77 (2013).
  4. 4.
    Bylaska, E., Tsemekhman, K., Govind, N., Valiev, M.: Large-scale plane-wave-based density functional theory: formalism, parallelization, and applications. In: Computational Methods for Large Systems: Electronic Structure Approaches for Biotechnology and Nanotechnology, pp. 77–116 (2011)Google Scholar
  5. 5.
    Bylaska, E.J., Glass, K., Baxter, D., Baden, S.B., Weare, J.H.: Hard scaling challenges for ab initio molecular dynamics capabilities in nwchem: using 100,000 CPUs per second. In: Journal of Physics: Conference Series, vol. 180, p. 012028. IOP Publishing (2009)Google Scholar
  6. 6.
    Bylaska, E.J., Valiev, M., Kawai, R., Weare, J.H.: Parallel implementation of the projector augmented plane wave method for charged systems. Comput. Phys. Commun. 143(1), 11–28 (2002)MathSciNetCrossRefzbMATHGoogle Scholar
  7. 7.
    Canning, A., Raczkowski, D.: Scaling first-principles plane-wave codes to thousands of processors. Comput. Phys. Commun. 169(1), 449–453 (2005)CrossRefGoogle Scholar
  8. 8.
    Canning, A., Shalf, J., Wang, L.W., Wasserman, H., Gajbe, M.: A comparison of different communication structures for scalable parallel three dimensional FFTs in first principle codes. In: Chapman, B., Desprez, F., Joubert, G.R., et al. (eds.), pp. 107–116 (2010)Google Scholar
  9. 9.
    Car, R., Parrinello, M.: Unified approach for molecular dynamics and density-functional theory. Phys. Rev. Lett. 55(22), 2471 (1985)CrossRefGoogle Scholar
  10. 10.
    Chen, Y., Bylaska, E., Weare, J.: First principles estimation of geochemically important transition metal oxide properties. In: Molecular Modeling of Geochemical Reactions: An Introduction, p. 107 (2016)Google Scholar
  11. 11.
    Cramer, T., Schmidl, D., Klemm, M., an Mey, D.: OpenMP programming on Intel Xeon Phi Coprocessors: an early performance comparison. In: Proceedings of Many Core Applications Research Community (MARC) Symposium, pp. 38–44 (2012)Google Scholar
  12. 12.
    Dagum, L., Menon, R.: OpenMP: an industry standard API for shared-memory programming. IEEE Computat. Sci. Eng. 5(1), 46–55 (1998)CrossRefGoogle Scholar
  13. 13.
    Fattebert, J.L., Osei-Kuffuor, D., Draeger, E.W., Ogitsu, T., Krauss, W.D.: Modeling dilute solutions using first-principles molecular dynamics: computing more than a million atoms with over a million cores. In: International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2016, pp. 12–22. IEEE (2016)Google Scholar
  14. 14.
    Gygi, F.: Architecture of Qbox: A scalable first-principles molecular dynamics code. IBM J. Res. Develop. 52(1.2), 137–144 (2008)Google Scholar
  15. 15.
    Jacquelin, M., De Jong, W., Bylaska, E.: Towards highly scalable Ab initio molecular dynamics (AIMD) simulations on the Intel knights landing manycore processor. In: 31st IEEE International Parallel & Distributed Processing Symposium. IEEE Computer Society (2017, Accepted)Google Scholar
  16. 16.
    de Jong, W.A., Bylaska, E., Govind, N., Janssen, C.L., Kowalski, K., Müller, T., Nielsen, I.M., van Dam, H.J., Veryazov, V., Lindh, R.: Utilizing high performance computing for chemistry: parallel computational chemistry. Phys. Chem. Chem. Phys. 12(26), 6896–6920 (2010)CrossRefGoogle Scholar
  17. 17.
    Kim, J., Dally, W.J., Scott, S., Abts, D.: Technology-driven, highly-scalable dragonfly topology. SIGARCH Comput. Archit. News 36(3), 77–88 (2008).
  18. 18.
    Kohn, W., Sham, L.J.: Self-consistent equations including exchange and correlation effects. Phys. Rev. 140(4A), A1133 (1965)MathSciNetCrossRefGoogle Scholar
  19. 19.
    Lancaster, P., Rodman, L.: Algebraic Riccati Equations. Clarendon Press, Oxford (1995)zbMATHGoogle Scholar
  20. 20.
    Marx, D., Hutter, J.: Modern methods and algorithms of quantum chemistry. Grotendorst, J. (ed.), pp. 301–449 (2000)Google Scholar
  21. 21.
    MPI Forum: MPI: A Message-passing Interface Standard. Tech. rep., June 2015Google Scholar
  22. 22.
    Nelson, J., Plimpton, S., Sears, M.: Plane-wave electronic-structure calculations on a parallel supercomputer. Phys. Rev. B 47(4), 1765 (1993)CrossRefGoogle Scholar
  23. 23.
    OpenMP Architecture Review Board: OpenMP Application Program Interface, Version 4.5, November 2015.
  24. 24.
    Parr, R.G.: Density functional theory of atoms and molecules. In: Fukui, K., Pullman, B. (eds.) Horizons of Quantum Chemistry. Académie Internationale Des Sciences Moléculaires Quantiques/International Academy of Quantum Molecular Science, vol. 3, pp. 5–15. Springer, Dordrecht (1980). doi: 10.1007/978-94-009-9027-2_2
  25. 25.
    Payne, M.C., Teter, M.P., Allan, D.C., Arias, T., Joannopoulos, J.: Iterative minimization techniques for ab initio total-energy calculations: molecular dynamics and conjugate gradients. Rev. Mod. Phys. 64(4), 1045 (1992)CrossRefGoogle Scholar
  26. 26.
    Polian, A., Loubeyre, P., Boccara, N.: Simple molecular systems at very high density. In: NATO Advanced Science Institutes (ASI) Series B, vol. 186 (1989)Google Scholar
  27. 27.
    Rabenseifner, R., Hager, G., Jost, G.: Hybrid MPI/OpenMP parallel programming on clusters of multi-core SMP nodes. In: 2009 17th Euromicro International Conference on Parallel, Distributed and Network-based Processing, pp. 427–436. IEEE (2009)Google Scholar
  28. 28.
    Remler, D.K., Madden, P.A.: Molecular dynamics without effective potentials via the car-parrinello approach. Mol. Phys. 70(6), 921–966 (1990)CrossRefGoogle Scholar
  29. 29.
    Sodani, A.: Knights landing (KNL): 2nd Generation Intel\(^{\textregistered }\) Xeon Phi Processor. In: Presentation at Hot Chips: A Symposium on High Performance Chips, August 2015Google Scholar
  30. 30.
    Swarztrauber, P.: Fftpack: a package of fortran subprograms for the fast fourier transform of periodic and other symmetric sequences. Obtainable by e-mail or by ftp from (1985)Google Scholar
  31. 31.
    Van De Geijn, R.A., Watts, J.: Summa: scalable universal matrix multiplication algorithm. Concurrency-Pract. Exp. 9(4), 255–274 (1997)CrossRefGoogle Scholar
  32. 32.
    Wiggs, J., Jonsson, H.: A hybrid decomposition parallel implementation of the car-parrinello method. Comput. Phys. Commun. 87(3), 319–340 (1995)CrossRefGoogle Scholar
  33. 33.
    Williams, S., Waterman, A., Patterson, D.: Roofline: an insightful visual performance model for multicore architectures. Commun. ACM 52(4), 65–76 (2009).

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  • Eric J. Bylaska
    • 1
    Email author
  • Mathias Jacquelin
    • 2
  • Wibe A. de Jong
    • 2
  • Jeff R. Hammond
    • 3
  • Michael Klemm
    • 4
  1. 1.Environmental Molecular Sciences LaboratoryPacific Northwest National LaboratoryRichlandUSA
  2. 2.Computational Research DivisionLawrence Berkeley National LaboratoryBerkeleyUSA
  3. 3.Data Center Group, Intel CorporationPortlandUSA
  4. 4.Software and Services GroupIntel Deutschland GmbHFeldkirchenGermany

Personalised recommendations