Applying the Roofline Performance Model to the Intel Xeon Phi Knights Landing Processor
The Roofline Performance Model is a visually intuitive method used to bound the sustained peak floating-point performance of any given arithmetic kernel on any given processor architecture. In the Roofline, performance is nominally measured in floating-point operations per second as a function of arithmetic intensity (operations per byte of data). In this study we determine the Roofline for the Intel Knights Landing (KNL) processor, determining the sustained peak memory bandwidth and floating-point performance for all levels of the memory hierarchy, in all the different KNL cluster modes. We then determine arithmetic intensity and performance for a suite of application kernels being targeted for the KNL based supercomputer Cori, and make comparisons to current Intel Xeon processors. Cori is the National Energy Research Scientific Computing Center’s (NERSC) next generation supercomputer. Scheduled for deployment mid-2016, it will be one of the earliest and largest KNL deployments in the world.
This research used resources of the National Energy Research Scientific Computing Center, a DOE Office of Science User Facility supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231.
This material is based upon work supported by the Advanced Scientific Computing Research Program in the U.S. Department of Energy, Office of Science, under Award Number DE-AC02-05CH11231.
J.D. was supported by the SciDAC Program on Excited State Phenomena in Energy Materials funded by the U. S. Department of Energy, Office of Basic Energy Sciences and of Advanced Scientific Computing Research, under Contract No. DE-AC02-05CH11231 at Lawrence Berkeley National Laboratory
- 1.Aktulga, H.M., Buluc, A., Williams, S., Yang, C.: Optimizing sparse matrix-multiple vector multiplication for nuclear configuration interaction calculations. In: International Parallel and Distributed Processing Symposium (IPDPS 2014), May 2014Google Scholar
- 4.Birdsall, C.K., Langdon, A.B.: Plasma Physics Via Computer Simulation. Series in Plasma Physics. CRC Press, Boca Raton (2005)Google Scholar
- 5.Cray xc series supercomputers. http://www.cray.com/products/computing/xc-series
- 6.Deslippe, J., Samsonidze, G., Strubbe, D.A., Jain, M., Cohen, M.L., Louie, S.G.: Berkeleygw: a massively parallel computer package for the calculation of the quasiparticle and optical properties of materials and nanostructures. Comput. Phys. Commun. 183(6), 1269–1289 (2012). http://www.sciencedirect.com/science/article/pii/S0010465511003912 CrossRefGoogle Scholar
- 7.Doerfler, D.: Understanding application data movement characteristics using intel vtune amplifier and software development emulator tools. In: IXPUG 2015, Berkeley, CA, September 28 - October 2 2015Google Scholar
- 9.Lawrence Berkeley National Laboratory.: Warp website. http://warp.lbl.gov
- 10.Ligocki, T.: Roofline toolkit. https://bitbucket.org/berkeleylab/cs-roofline-toolkit
- 11.Malas, T., Kurth, T., Deslippe, J.: Optimization of the sparse matrix-vector products of an idr krylov iterative solver for the intel knl manycore processor (in preparation)Google Scholar
- 12.Maris, P., Aktulga, H.M., Caprio, M.A., Çatalyürek, Ü.V., Ng, E.G., Oryspayev, D., Potter, H., Saule, E., Sosonkina, M., Vary, J.P., Yang, C., Zhou, Z.: Large-scale ab initio configuration interaction calculations for light nuclei. J. Phys. Conf. Ser. 403(1), 012019 (2012). http://stacks.iop.org/1742-6596/403/i=1/a=012019 CrossRefGoogle Scholar
- 13.NERSC: Cori. https://www.nersc.gov/systems/cori/
- 14.NERSC: Measuring arithmetic intensity. https://www.nersc.gov/users/application-performance/measuring-arithmetic-intensity
- 17.Raman, K.: Calculating “flop” using intel software developmentemulator (intelsde) (March 2015). https://software.intel.com/en-us/articles/calculating-flop-using-intel-software-development-emulator-intel-sde
- 18.Sodani, A.: Knights landing (knl): 2nd generation intel xeon phiprocessor. In: Hot Chips 27. Flint Center, Cupertino, August 23rd-25th 2015. http://www.hotchips.org/wp-content/uploads/hc_archives/hc27/HC27.25-Tuesday-Epub/HC27.25.70-Processors-Epub/HC27.25.710-Knights-Landing-Sodani-Intel.pdf
- 19.Tal, A.: Intel software development emulator. https://software.intel.com/en-us/articles/intel-software-development-emulator
- 20.Vincenti, H., Lehe, R., Sasanka, R., Vay, J.: An efficient and portable SIMD algorithm for charge/current deposition in Particle-In-Cell codes. ArXiv e-prints, January 2016Google Scholar
- 21.Williams, S.: Auto-tuning Performance on Multicore Computers. Ph.D. thesis, EECS Department, University of California, Berkeley, December 2008Google Scholar
- 23.Williams, S., Stralen, B.V., Ligocki, T., Oliker, L., Cordery, M., Lo, L.: Roofline performance model. http://crd.lbl.gov/departments/computer-science/PAR/research/roofline/