High Performance Optimizations for Nuclear Physics Code MFDn on KNL

  • Brandon Cook
  • Pieter Maris
  • Meiyue Shao
  • Nathan Wichmann
  • Marcus Wagner
  • John O’Neill
  • Thanh Phung
  • Gaurav Bansal
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9945)

Abstract

Initial optimization strategies and results on MFDn, a large-scale nuclear physics application code, running on a single KNL node are presented. This code consists of the construction of a very large sparse real symmetric matrix and computing a few lowest eigenvalues and eigenvectors of this matrix through iterative methods. Challenges addressed include effectively utilizing MCDRAM with representative input data for production runs on 5,000 KNL nodes that require over 80 GB of memory per node, using OpenMP 4 to parallelize functions in the construction phase of the sparse matrices, and vectorizing those functions in spite of while-loops, conditionals, and lookup tables with indirect indexing. Moreover, hybrid MPI/OpenMP is employed not only to maximize the total problem size that can be solved per node, but also to eventually minimize parallel scaling overhead through the best scaling combination of MPI ranks per node with OpenMP threads. We describe a vectorized version of a popcount operation to avoid serialization on intrinsic popcnt which only operates on scalar registers. Additionally we leverage SSE 4.2 string comparison instructions to determine nonzero matrix elements. By utilizing MCDRAM, we achieve excellent Sparse Matrix–Matrix multiplication performance; in particular, using blocks of 8 vectors lead to a speedup of 6.4\(\times \) on KNL and 2.9\(\times \) on Haswell compared to the performance of repeated SpMV’s. This optimization was essential in achieving a 1.6\(\times \) improvement on KNL over Haswell.

Keywords

Vectorization MCDRAM KNL MFDn Sparse matrix SpMV 

1 Introduction

Many-Fermion Dynamics—nuclear, or MFDn, is a configuration interaction (CI) code for nuclear structure calculations. It is a platform independent Fortran 90 code using a hybrid MPI/OpenMP programming model, and is being used on current supercomputers such as Edison at NERSC, Mira at ALCF, and Titan at OLCF for ab initio calculations of atomic nuclei using realistic nucleon–nucleon and three-nucleon forces [3, 7, 8, 9]. A calculation consists of generating a many-body basis space, constructing the many-body Hamiltonian matrix in this basis, obtaining the lowest eigenpairs, and calculating a set of observables from those eigenpairs. Key computational challenges for MFDn include effectively using the available aggregate memory, efficient construction of the matrix, and efficient sparse matrix–vector products used in the solution of the eigenvalue problem.

In principle an infinite-dimensional basis space is needed for an exact representation of the many-body wavefuctions. However, in practice the basis space is truncated and observables are studied as a function of the truncation parameters. Typical basis space dimensions for large-scale production runs are of the order of several billion. The corresponding many-body matrix is extremely sparse, with tens of trillion nonzero matrix elements, which are stored in core. This defines one of the key computational challenges for this code—effectively using the aggregate memory available in a cluster.

To accurately capture this need we developed a test code which uses representative data for production calculations on 5,000 Knights Landing (KNL) nodes (approximately half the size of Cori at NERSC) using over 80 GB of memory per node. In such a production run, half of the symmetric matrix is distributed in a two-dimensional fashion over the available MPI ranks. Each MPI rank constructs and stores its own sparse submatrix. The test code performs nearly all the computational work a single node would do in the production run but with the communication removed.

2 Target Architecture

The optimizations on MFDn presented in this work target the Cori supercomputer at the National Energy Research Scientific Computing Center (NERSC). Cori is a Cray XC40 based supercomputer deployed in two phases.

Phase 1
  • 1,630 compute nodes with 128 GB DDR4@2133 MHz per node and 2 16-core 2.3 GHz Intel Haswell processors

  • 1.92 PFLOPs theoretical peak

  • 203 TB aggregate memory

  • Aries Dragonfly topology network

  • Deployed late-2015

Phase 2
  • Scheduled deployment mid-2016

  • Over 9,300 self-hosted Knights Landing (KNL) compute nodes

  • Over 1 PB aggregate memory (DDR4 and MCDRAM combined)

  • Aries Dragonfly topology network

As Cori Phase 2 is not available at the time of writing, we perform our tests using the following platform:
  • B0 stepping KNL preproduction white boxes

  • 64-cores @ 1.3 GHz and 4 logical threads per core

  • 16 GB MCDRAM

  • 96 GB DDR4 @ 2133 MHz

On Haswell we achieved the best performance with one MPI rank per socket (i.e., two MPI ranks per node, and avoiding NUMA issues) and 16 threads per rank resulting in one thread per physical core. Our initial tests on KNL showed that hyperthreading was beneficial and that 4 threads per core was best, see Fig. 1. Furthermore, one MPI rank with 256 threads was more efficient than two MPI ranks with 128 threads or four MPI ranks with 64 threads. In addition to being most efficient, using one MPI rank per node avoids memory overhead from data replicated across MPI ranks and facilitates addressing large blocks of MCDRAM. Hence, unless otherwise noted, all tests reported below on KNL were done in quadrant+flat mode with one MPI rank and 256 threads. For allocations to MCDRAM we used the memkind [4] library and FASTMEM directives.
Fig. 1.

OpenMP thread scaling of kernels on KNL B0 with 64 physical cores and up to 4 logical threads per core. Speedup is measured against wall time with 8 threads.

3 Optimization of SpMV/SpMM Kernel

The lowest few eigenvalues and eigenvectors of the very large real sparse symmetric Hamiltonian matrix are found with iterative solvers Lanczos [6] or LOBPCG [5]. The key kernels in the iterative eigensolver are Sparse Matrix–Vector (SpMV) and Sparse transposed Matrix–Vector (SpMVT) products, as only half of the symmetric matrix is stored in order to save memory. The sparse matrix is stored in a CSB_Coo format [1, 2], which allows for efficient linear algebra operations on very sparse matrices, improved cache reuse on multicore architectures and thread scaling even when the same structure is used for both SpMV and SpMVT (as is the case in this application). The thread scalability of this kernel is shown in Fig. 1; it scales ideally with the number of physical cores and additional hyperthreads provide some small additional benefit. The figure was generated with 8 simultaneous vectors allocated to MCDRAM and different numbers of vectors (not shown) display the same behavior.

In a production run on 5,000 KNL nodes over 80 GB of memory per node is required for a calculation. The nonzero matrix elements and corresponding indices account for 64 GB of the memory and the input/output vectors account for up to 16 GB depending on the specific problem and on the eigensolver that is used. Improving data reuse, utilizing vectorization, and effectively using as much aggregate bandwidth as possible are key challenges for computing sparse matrix-vector products in MFDn. To improve data reuse and allow for vectorization, in LOBPCG we replace SpMV with SpMM (i.e., SpMV operations on a block of vectors). To fully utilize AVX-512 instructions on KNL up to 16 single-precision vectors could be used, but we limit our study to 8 vectors due to memory requirements and the need to balance simultaneous data use from each memory system. To access more memory bandwidth we explicitly place the input/output vectors in MCDRAM using the memkind [4] library and FASTMEM directives.
Fig. 2.

Wall time for SpMM kernel on Haswell and KNL. KNL (DDR) means no MCDRAM was utilized. KNL (cache) indicates that cache mode was used. KNL (memkind) indicates that input/output vectors were explicitly kept in MCDRAM using directives.

Table 1.

Performance data for SpMM on m vectors for 2 socket Haswell with 2 MPI ranks and 16 OpenMP threads.

m

AI

GFLOPs

DDR GB/s

1

0.23

23.2

122–125

4

0.62

56.8

125

8

0.80

67.5

122–125

Table 2.

Performance data for SpMM on m vectors for B0 KNL with 64 cores with 256 OpenMP threads.

m

AI\(_{\text {DDR}}\)

AI\(_{\text {MCDRAM}}\)

AI\(_{\text {total}}\)

GFLOPs

DDR GB/s

MCDRAM GB/s

1

0.20

0.33

0.13

17.1

83

55

4

0.80

0.36

0.25

62.4

80

170–190

8

1.57

0.37

0.30

109.1

71

290–310

To analyze the performance we measure the arithmetic intensity (AI, the ratio of FLOPs to data movement) of the SpMM operation. We used the dynamic instruction tracing capabilities of Intel’s Software Development Emulator (SDE) to count the number of floating point operations. Due to the size of the matrices and vectors the most relevant measure for data movement are the main memory counters. To access the data movement at DDR and MCDRAM controllers we use Intel’s VTune Amplifier XE. In our experiments we used a fixed CSB_Coo block size \(\beta =16{,}000\) to maintain consistency, though this value should be adjusted to match the cache sizes of the current hardware and matrix sparsity of a given Hamiltonian.

On Haswell operating on blocks of vectors with SPMM operations resulted in a speedup of 2.9\(\times \) over operating on a single vector at a time. Due to the low value of AI for SPMM operations memory bandwidth is the limiting factor in performance. Sustained memory bandwidth on Haswell was measured to be \(B_{\tiny {\text {DDR}}}^{\tiny {\text {HSW}}}=128\) GB/s. Performance measurements are summarized in Tables 1 and 2. The theoretical peak performance in GFLOP/s is \(P(m) = B^{\tiny {\text {HSW}}}_{\tiny {\text {DDR}}} \cdot AI(m)\), where B is bandwidth and AI is arithmetic intensity. Our measurements show we achieve a large fraction of theoretical maximum performance and that we are utilizing nearly all of the available memory bandwidth. However, as m increases we achieve a lower fraction of theoretical peak performance. This is a result of the larger working set resulting in increased cache pressure. Tuning of the CSB_Coo block size, \(\beta \), could mitigate this effect, but is outside the scope of this work as the optimal choice depends on the specific hardware and the specific sparsity structure and physics of each problem.

On KNL we will use data from both DDR and MCDRAM. In this case two factors, the total data movement and the ratio of data movement on MCDRAM to DDR, are important. The measured peak sustained bandwidth was \(B_{\tiny {\text {DDR}}}^{\tiny {\text {KNL}}}\) = 83 GB/s for DRR and \(B_{\tiny {\text {MCDRAM}}}^{\tiny {\text {KNL}}}=\)390 GB/s for MCDRAM on our KNL white boxes. For optimal performance the ratio of data moved on each controller should match the ratio of available bandwidth (\(R_{\max }=B_{\tiny {\text {MCDRAM}}}^{\tiny {\text {KNL}}}/B_{\tiny {\text {DDR}}}^{\tiny {\text {KNL}}} \approx 4.7\)) in order to fully utilize each memory system. For \(m=1,4,8\) the ratio of data moved on each controller is \(R=0.6, 2.2, 4.2\), respectively. We estimate that if enough MCDRAM was available the ratio for \(m=16\) would be \(R\approx 8\), which would result in the DDR being under-utilized. Increasing m reduces the traffic on DDR and reduces the total data moved on both controllers, resulting in increased total arithmetic intensity. Assuming that it is possible to simultaneously utilize both memory systems fully the expected performance in GFLOP/s is given by \(P(m) = \min \bigl \{B_{\tiny {\text {DDR}}}^{\tiny {\text {KNL}}} \cdot AI_{\tiny {\text {DDR}}}(m), B_{\tiny {\text {MCDRAM}}}^{\tiny {\text {KNL}}} \cdot AI_{\tiny {\text {MCDRAM}}}(m)\bigr \}\). The peak performance predicted by this model (the curve) and measured data points (the dots) for \(m=1,4,8\) are shown in Fig. 3. In this model the data movement on DDR is inversely proportional to the m since we reduce the number of times the full matrix must be read by m. However, the data movement on the MCDRAM is nearly constant as the same amount of matrix vector products must be computed. At low values of m the DDR bandwidth is the limiting factor, but as more traffic is shifted to MCDRAM it becomes the limiting factor, where the crossover point is defined by the ratio of data movement on each set of controllers in the kernel.
Fig. 3.

Performance model for utilizing data from MCDRAM and DDR simultaneously on KNL. The dots are measurements taken with 256 threads on B0.

On Haswell and KNL machines we find \(m=8\) simultaneous vectors to be the most efficient given constraints on memory capacity and bandwidth. Analysis suggests that further improvements in performance for \(m>8\) are not likely on KNL given the ratio of available DRR and MCDRAM bandwidth. In the case of \(m=8\) we achieve a \(1.6\times \) increase in performance over a dual socket Haswell node with a single KNL. We also find that increasing m from 1 to 8 increases performance by \(6.4\times \) on KNL and \(2.9\times \) on Haswell. For completeness we also include timing data from a KNL configured in quadrant+cache mode in Fig. 2. We omit detailed analysis as this mode consistently has lower performance than explicitly managing the memory and reduces the total addressable memory of a node. The SNC modes were not analyzed due to the memory overhead associated with additional MPI ranks, lower efficiency, and complications of allocating the large vectors across NUMA domains.

4 Optimization of Matrix Construction

In MFDn, there are four steps in the construction phase:
  1. 1.

    Count nonzero tiles in the Hamiltonian;

     
  2. 2.

    Construct the nonzero tile structure (roughly speaking, block structure);

     
  3. 3.

    Count nonzero matrix elements in each tile;

     
  4. 4.

    Construct the Hamiltonian in CSB_Coo format.

     

The first three steps in this matrix construction phase involves only integer arithmetic, such as bit manipulation and integer comparison; the actual construction (step 4) involves both integer arithmetic and floating point operations. The matrix construction phase is naturally parallelizable, and is performed without any MPI communication (even for a production run). However, efficient implementation of the construction phase is challenging due to conditionals and use of lookup tables with indirect indexing. The matrix construction phase is not bound by memory bandwidth and is primarily sensitive to integer compute performance and random indirect access to lookup tables. In our tests cache mode had no benefit for the construction. In the following we discuss several optimization techniques we have applied to the matrix construction phase.

Typical Loop Structure. The (ij)th entry of the many-body Hamiltonian with a d-body interaction, \(H(i,j)=\langle \varPhi _i|H|\varPhi _j\rangle \), is nonzero only when the many-body basis states \(\varPhi _i\) and \(\varPhi _j\) differ by at most d single particle states. A typical loop structure in the matrix construction phase (steps 3–4) is shown in Fig. 4.
Fig. 4.

A typical loop structure in the matrix construction phase (steps 3–4).

In MFDn a many-body basis state \(\varPhi _i\) can be represented by a sequence of integers, denoted by \({{\mathrm{BIN}}}(\varPhi _i)\). Each binary bit of \({{\mathrm{BIN}}}(\varPhi _i)\) indicates whether a single particle state is occupied (each single particle state can either be occupied or not occupied). Information regarding all differently-occupied single particle states between \(\varPhi _i\) and \(\varPhi _j\) is encoded by \({{\mathrm{BIN}}}(\varPhi _i){{\mathrm{\oplus }}}{{\mathrm{BIN}}}(\varPhi _j)\), where \({{\mathrm{\oplus }}}\) denotes the bitwise exclusive or operation. The number of differently-occupied single particle states is then obtained by counting the number of 1’s in \({{\mathrm{BIN}}}(\varPhi _i){{\mathrm{\oplus }}}{{\mathrm{BIN}}}(\varPhi _j)\), i. e., the popcount of \({{\mathrm{BIN}}}(\varPhi _i){{\mathrm{\oplus }}}{{\mathrm{BIN}}}(\varPhi _j)\). In the 3rd line of the loop above, we only compute the popcount of the first integer of \({{\mathrm{BIN}}}(\varPhi _i){{\mathrm{\oplus }}}{{\mathrm{BIN}}}(\varPhi _j)\), representing the lowest 32 single particle states, to quickly identify most of the zero entries. The 4th and 5th lines of the loop are more complicated, and are accomplished by subroutine calls.

Promoting 32-bit Integers to 64-bit. The first optimization we made is to use 64-bit integers to encode \({{\mathrm{BIN}}}(\varPhi _i)\) instead of 32-bit integers. Consequently, computing the popcount of the first integer of \({{\mathrm{BIN}}}(\varPhi _i)\) now checks the lowest 64 single particle states. Compared to the original 32-bit version, the 64-bit version quickly identifies more zero entries and reduces the subsequent calls to expensive subroutines (4th line in Fig. 4), while the additional cost in popcount is negligible. Such a change leads to about 15 % improvement in the first three steps in the construction phase; see Fig. 5.
Fig. 5.

Timings on B0 KNL for the four matrix construction phases.

Loop Unrolling. To optimize the loop in Fig. 4 we manually unroll the inner loop, for instance, by a step size 16.1 Then lines 3–5 in the loop becomes three independent loops of size 16, as shown in Fig. 6.
Fig. 6.

Unroll the inner loop in Fig. 4 by a step size 16.

The first innermost loop (lines 3–5) in Fig. 6 can potentially be vectorized. The subroutine calls in lines 7 and 10 can also be adjusted so that the subroutines accepts arrays of inputs and outputs. By doing so increases data reuse, and allows the compiler to generate vectorized instructions.

Vectorizing Popcount. Unfortunately, the Fortran intrinsic popcnt does not vectorize, preventing any loop involving popcnt from being vectorized. This is due the hardware instruction only operating on integer registers. To bypass this obstacle, we replace popcnt by a hand coded popcount implementation. A simple implementation of popcount shown in Fig. 7 already vectorizes. Another implementation shown in Fig. 8 also vectorizes and is in general faster than that in Fig. 7 by a small margin. Table 3 shows timing results of step 3 in the construction phase on KNL.
Fig. 7.

A simple implementation of popcount for 64-bit integers.

Fig. 8.

An optimized implementation of popcount for 64-bit integers.

Table 3.

Timings (sec) for different versions of popcount.

ifort v. 16.0.2

ifort v. 17 beta

Fortran intrinsic popcnt

34.05

34.01

popcount in Fig. 7

34.26

103.43

popcount in Fig. 8

33.90

32.46

State Comparison with SSE 4.2 Intrinsics. In MFDn, in addition to the bit representation, there is also an index-based representation of the occupied single particle states in \(\varPhi _i\). This representation is used in the detailed tests for the states (4th line Fig. 4). The detailed test counts the number of differently occupied states, or the symmetric difference of the two sets of indices describing which states are occupied. Depending on the number of different states occupied and the n-body interactions a quantum selection rule is then applied.
Fig. 9.

An implementation of detailed state comparison using SSE 4.2 intrinsics.

On machines which support the SSE 4.2 instruction set the __cmpistrm intrinsic function can be used to perform an all-to-all comparison of eight 16-bit integers in a single instruction. In this function a and b are __m128i and can hold eight 16-bit integers. By appropriately setting the control bits with mode the result will be a bit mask which is 1 if the element in that position in a does not have a matching value in b. The count of differences is obtained by extracting the relevant part of the resulting mask and calculating the popcount. Our implementation is shown in Fig. 9. In our test case the number of occupied single particle states is 8 which perfectly matches the register size. For cases with less than 8 one can pad the integer representation with zeros. Additional logic is required for cases with more than 8 occupied states, but the generalization is straightforward. Unfortunately there are not corresponding AVX-512 instructions, but techniques based on rotations, shuffles and comparisons are interesting possibilities for future work.

The efficiency of the SSE 4.2 approach is shown in Fig. 10. For comparison we show the timings for the 32-bit and 64-bit popcount tests (using the machine instructions) followed by the original scalar detailed comparison test. In addition we include the timing for when only the detailed comparisons were done. The SSE 4.2 approach demonstrates very promising efficiency over the popcount ones, though completely skipping quick tests does not seem wise. This suggests that a well-tuned implementation on the target architecture should identify a good balance between quick tests and detailed comparisons.
Fig. 10.

Timings for the 3rd matrix construction phases where the number of nonzero matrix elements are counted. Timings taken with 256 threads on B0 KNL with 64 cores in quadrant+flat mode and on 2 Haswell sockets with 16 threads each.

5 Conclusion and Outlook

We found that improved data reuse and vectorization were essential for improving the performance of MFDn on KNL over Haswell. Especially in the SpMM kernel where we achieved a 1.6\(\times \) speedup over Haswell by operating on \(m=8\) simultaneously. In this kernel we also show that by utilizing directives and the memkind library effective use of both DDR and MCDRAM memories can be achieved at the same time, along with the importance of balancing the load on each memory system. Our optimizations for the matrix construction benefit both Haswell and KNL architectures, by reducing branching and enabling use of vector registers. OpenMP 4 directives we used to implement vectorized popcount functions and simd intrinsic functions are shown to provide excellent performance in detailed comparisons of quantum many-body states. Future efforts to explore the application of AVX-512 instructions in the matrix construction phases will prove key in obtaining further improvements in the matrix construction phases of MFDn.

Footnotes

  1. 1.

    The optimal choice of this number is certainly architecture dependent.

Notes

Acknowledgments

This work is supported in part by U.S. DOE Grant Number DESC0008485 (SciDAC/NUCLEI). This research used resources of the National Energy Research Scientific Computing Center (NERSC), a DOE Office of Science User Facility supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231.

References

  1. 1.
    Aktulga, H.M., Buluç, A., Williams, S., Yang, C.: Optimizing sparse matrix-multiple vectors multiplication for nuclear configuration interaction calculations. In: 2014 IEEE 28th International on Parallel and Distributed Processing Symposium, pp. 1213–1222. IEEE (2014)Google Scholar
  2. 2.
    Aktulga, H.M., Yang, C., Ng, E.G., Maris, P., Vary, J.P.: Improving the scalability of a symmetric iterative eigensolver for multi-core platforms. Concurr. Comput. Pract. Exper. 26(16), 2631–2651 (2014)CrossRefGoogle Scholar
  3. 3.
    Binder, S., Calci, A., Epelbaum, E., Furnstahl, R.J., Golak, J., Hebeler, K., Kamada, H., Krebs, H., Langhammer, J., Liebig, S., Maris, P., Meißner, U.G., Minossi, D., Nogga, A., Potter, H., Roth, R., Skinińki, R., Topolnicki, K., Vary, J.P., Witała, H.: Few-nucleon systems with state-of-the-art chiral nucleon-nucleon forces. Phys. Rev. C 93(4), 044002 (2016)CrossRefGoogle Scholar
  4. 4.
    Cantalupo, C., Venkatesan, V., Hammond, J.R., Hammond, S.: User extensible heap manager for heterogeneous memory platforms and mixed memory policies (2015)Google Scholar
  5. 5.
    Knyazev, A.V.: Toward the optimal preconditioned eigensolver: locally optimal block preconditioned conjugate gradient method. SIAM J. Sci. Comput. 23(2), 517–541 (2001)MathSciNetCrossRefMATHGoogle Scholar
  6. 6.
    Lanczos, C.: An iteration method for the solution of the eigenvalue problem of linear differential and integral operators. J. Res. Natl. Bur. Std. B Math. Sci. 45(4), 255–282 (1950)MathSciNetCrossRefGoogle Scholar
  7. 7.
    Maris, P., Caprio, M.A., Vary, J.P.: Emergence of rotational bands in ab initio no-core configuration interaction calculations of the Be isotopes. Phys. Rev. C 91(1), 014310 (2015)CrossRefGoogle Scholar
  8. 8.
    Maris, P., Vary, J.P., Navratil, P., Ormand, W.E., Nam, H., Dean, D.J.: Origin of the anomalous long lifetime of 14C. Phys. Rev. Lett. 106(20), 202502 (2011)CrossRefGoogle Scholar
  9. 9.
    Maris, P., Vary, J.P., Gandolfi, S., Carlson, J., Pieper, S.C.: Properties of trapped neutrons interacting with realistic nuclear Hamiltonians. Phys. Rev. C 87(5), 054318 (2013)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG 2016

Authors and Affiliations

  • Brandon Cook
    • 1
  • Pieter Maris
    • 2
  • Meiyue Shao
    • 1
  • Nathan Wichmann
    • 3
  • Marcus Wagner
    • 3
  • John O’Neill
    • 4
  • Thanh Phung
    • 4
  • Gaurav Bansal
    • 4
  1. 1.Lawrence Berkeley National LaboratoryBerkeleyUSA
  2. 2.Department of Physics and AstronomyIowa State UniversityAmesUSA
  3. 3.Cray Inc.SeattleUSA
  4. 4.Software and Services GroupIntel CorporationSanta ClaraUSA

Personalised recommendations