Performance of MD-Algorithms on Hybrid Systems-on-Chip Nvidia Tegra K1 & X1

  • Vsevolod Nikolskii
  • Vyacheslav Vecher
  • Vladimir StegailovEmail author
Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 687)


In this paper we consider the efficiency of hybrid systems-on-a-chip for high-performance calculations. Firstly, we build Roofline performance models for the systems considered using Empirical Roofline Toolkit and compare the results with the theoretical estimates. Secondly, we use LAMMPS as an example of the molecular dynamic package to demonstrate its performance and efficiency in various configurations running on Nvidia Tegra K1 & X1. Following the Roofline approach, we attempt to distinguish compute-bound and memory-bound conditions for the MD algorithm using the Lennard-Jones liquid model. The results are discussed in the context of the LAMMPS performance on Intel Xeon CPUs and the Nvidia Tesla K80 GPU.


ARM GPU Maxwell Kepler Roofline LAMMPS 



HSE and MIPT provided funds for purchasing the hardware used in this study. The authors are grateful to the Forsite company for the access to the server with Nvidia Tesla K80. The authors acknowledge Joint Supercomputer Centre of RAS for the access to MVS-100K and MVS-10P supercomputers. The work was supported by the grant No. 14-50-00124 of the Russian Science Foundation.


  1. 1.
    Mitra, G., Johnston, B., Rendell, A., McCreath, E., Zhou, J.: Use of SIMD vector operations to accelerate application code performance on low-powered ARM and Intel platforms. In: 2013 IEEE 27th International Parallel and Distributed Processing Symposium Workshops PhD Forum (IPDPSW), pp. 1107–1116 (2013). doi: 10.1109/IPDPSW.2013.207
  2. 2.
    Keipert, K., Mitra, G., Sunriyal, V., Leang, S.S., Sosonkina, M., Rendell, A.P., Gordon, M.S.: Energy-efficient computational chemistry: comparison of x86 and ARM systems. J. Chem. Theory Comput. 11(11), 5055–5061 (2015). doi: 10.1021/acs.jctc.5b00713 CrossRefGoogle Scholar
  3. 3.
    Curnow, H.J., Wichmann, B.A.: A synthetic benchmark. Comput. J. 19(1), 43–49 (1976)CrossRefGoogle Scholar
  4. 4.
    Strohmaier, E., Hongzhang, S.: Apex-Map: a global data access benchmark to analyze HPC systems and parallel programming paradigms. In: Proceedings of the ACM/IEEE SC 2005 Conference (2005). doi: 10.1109/SC.2005.13
  5. 5.
    Heroux, M.A., Doerfler, D.W., Crozier, P.S., Willenbring, J.M., Edwards, H.C., Williams, A., Rajan, M., Keiter, E.R., Thornquist, H.K., Numrich, R.W.: Improving performance via mini-applications. Technical report, Sandia National Laboratories (2009)Google Scholar
  6. 6.
    Williams, S., Waterman, A., Patterson, D.: Roofline: an insightful visual performance model for multicore architectures. Commun. ACM 52(4), 65–76 (2009)CrossRefGoogle Scholar
  7. 7.
    Hoefler, T., Belli, R.: Scientific benchmarking of parallel computing systems: twelve ways to tell the masses when reporting performance results. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2015, pp. 73:1–73:12 (2015).
  8. 8.
    Pruitt, D.D., Freudenthal, E.A.: Preliminary investigation of mobile system features potentially relevant to HPC. In: Proceedings of the 4th International Workshop on Energy Efficient Supercomputing, E2SC 2016, pp. 54–60. IEEE Press, Piscataway, NJ, USA (2016). doi: 10.1109/E2SC.2016.13
  9. 9.
    Scogland, T., Azose, J., Rohr, D., Rivoire, S., Bates, N., Hackenberg, D.: Node variability in large-scale power measurements: perspectives from the Green500, Top500 and EEHPCWG. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2015 (2015).
  10. 10.
    Stegailov, V.V., Orekhov, N.D., Smirnov, G.S.: HPC hardware efficiency for quantum and classical molecular dynamics. In: Malyshkin, V. (ed.) PaCT 2015. LNCS, vol. 9251, pp. 469–473. Springer, Heidelberg (2015). doi: 10.1007/978-3-319-21909-7_45 CrossRefGoogle Scholar
  11. 11.
    Smirnov, G.S., Stegailov, V.V.: Efficiency of classical molecular dynamics algorithms on supercomputers. Math. Models Comput. Simul. 8(6), 734–743 (2016). doi: 10.1134/S2070048216060156 CrossRefGoogle Scholar
  12. 12.
    Gallardo, E., Teller, P.J., Argueta, A., Jaloma, J.: Cross-accelerator performance profiling. In: Proceedings of the XSEDE16 Conference on Diversity, Big Data, and Science at Scale XSEDE 2016, pp. 19:1–19:8. ACM, NY, USA (2016). doi: 10.1145/2949550.2949567
  13. 13.
    Glinsky, B., Kulikov, I., Chernykh, I., Weins, D., Snytnikov, A., Nenashev, V., Andreev, A., Egunov, V., Kharkov, E.: The co-design of astrophysical code for massively parallel supercomputers. In: Carretero, J., et al. (eds.) ICA3PP 2016. LNCS, vol. 10049, pp. 342–353. Springer, Heidelberg (2016). doi: 10.1007/978-3-319-49956-7_27 CrossRefGoogle Scholar
  14. 14.
    Rojek, K., Wyrzykowski, R., Kuczynski, L.: Systematic adaptation of stencil-based 3D MPDATA to GPU architectures. Concurr. Comput.: Pract. Exp. (2016). doi: 10.1002/cpe.3970 Google Scholar
  15. 15.
    Nikolskiy, V., Stegailov, V.: Floating-point performance of ARM cores and their efficiency in classical molecular dynamics. J. Phys.: Conf. Ser. 681(1) (2016). Article ID 012049. Google Scholar
  16. 16.
    Laurenzano, M.A., Tiwari, A., Cauble-Chantrenne, A., Jundt, A., Ward, W.A., Campbell, R., Carrington, L.: Characterization and bottleneck analysis of a 64-bit ARMv8 platform. In: 2016 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pp. 36–45 (2016). doi: 10.1109/ISPASS.2016.7482072
  17. 17.
    Ukidave, Y., Kaeli, D., Gupta, U., Keville., K.: Performance of the NVIDIA Jetson TK1 in HPC. In: 2015 IEEE International Conference on Cluster Computing, pp. 533–534 (2015)Google Scholar
  18. 18.
    Haidar, A., Tomov, S., Luszczek, P., Dongarra, J.: Magma embedded: towards a dense linear algebra library for energy efficient extreme computing. In: High Performance Extreme Computing Conference (HPEC), pp. 1–6. IEEE (2015)Google Scholar
  19. 19.
    Stone, J.E., Hallock, M.J., Phillips, J.C., Peterson, J.R., Luthey-Schulten, Z., Schulten, K.: Evaluation of emerging energy-efficient heterogeneous computing platforms for biomolecular and cellular simulation workloads. In: International Parallel and Distributed Processing Symposium Workshop (IPDPSW). IEEE (2016)Google Scholar
  20. 20.
    Nikolskiy, V.P., Stegailov, V.V., Vecher, V.S.: Efficiency of the Tegra K1 and X1 systems-on-chip for classical molecular dynamics. In: 2016 International Conference on High Performance Computing Simulation (HPCS), pp. 682–689 (2016). doi: 10.1109/HPCSim. 7568401
  21. 21.
    Lo, Y.J., et al.: Roofline model toolkit: a practical tool for architectural and program analysis. In: Jarvis, S.A., Wright, S.A., Hammond, S.D. (eds.) PMBS 2014. LNCS, vol. 8966, pp. 129–148. Springer, Heidelberg (2015). doi: 10.1007/978-3-319-17248-4_7 Google Scholar
  22. 22.
    Eckhardt, W., et al.: 591 TFLOPS multi-trillion particles simulation on SuperMUC. In: Kunkel, J.M., Ludwig, T., Meuer, H.W. (eds.) ISC 2013. LNCS, vol. 7905, pp. 1–12. Springer, Heidelberg (2013). doi: 10.1007/978-3-642-38750-0_1 CrossRefGoogle Scholar
  23. 23.
    Piana, S., Klepeis, J.L., Shaw, D.E.: Assessing the accuracy of physical models used in protein-folding simulations: quantitative evidence from long molecular dynamics simulations. Curr. Opin. Struct. Biol. 24, 98–105 (2014). doi: 10.1016/ CrossRefGoogle Scholar
  24. 24.
    Plimpton, S.: Fast parallel algorithms for short-range molecular dynamics. J. Comput. Phys. 117(1), 1–19 (1995). doi: 10.1006/jcph.1995.1039 CrossRefzbMATHGoogle Scholar
  25. 25.
    Glaser, J., Nguyen, T.D., Anderson, J.A., Lui, P., Spiga, F., Millan, J.A., Morse, D.C., Glotzer, S.C.: Strong scaling of general-purpose molecular dynamics simulations on GPUs. Comput. Phys. Commun. 192, 97–107 (2015). doi: 10.1016/j.cpc.2015.02.028 CrossRefGoogle Scholar
  26. 26.
    Trott, C.R., Winterfeld, L., Crozier, P.S.: General-purpose molecular dynamics simulations on GPU-based clusters. ArXiv e-prints arXiv:1009.4330 (2010)
  27. 27.
    Brown, W.M., Wang, P., Plimpton, S.J., Tharrington, A.N.: Implementing molecular dynamics on hybrid high performance computers – short range forces. Comput. Phys. Commun. 182(4), 898–911 (2011). doi: 10.1016/j.cpc.2010.12.021 CrossRefzbMATHGoogle Scholar
  28. 28.
    Brown, W.M., Kohlmeyer, A., Plimpton, S.J., Tharrington, A.N.: Implementing molecular dynamics on hybrid high performance computers – particle–particle particle-mesh. Comput. Phys. Commun. 183(3), 449–459 (2012). doi: 10.1016/j.cpc.2011.10.012 CrossRefGoogle Scholar
  29. 29.
    Edwards, H.C., Trott, C.R., Sunderland, D.: Kokkos: enabling manycore performance portability through polymorphic memory access patterns. J. Parallel Distrib. Comput. 74(12), 3202–3216 (2014). doi: 10.1016/j.jpdc.2014.07.003 CrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG 2016

Authors and Affiliations

  • Vsevolod Nikolskii
    • 1
    • 3
  • Vyacheslav Vecher
    • 1
    • 2
  • Vladimir Stegailov
    • 1
    Email author
  1. 1.Joint Institute for High Temperatures of RASMoscowRussia
  2. 2.Moscow Institute of Physics and Technology (State University)DolgoprudnyRussia
  3. 3.National Research University Higher School of EconomicsMoscowRussia

Personalised recommendations