Runtime optimization of a memory efficient CG solver for FFT-based homogenization: implementation details and scaling results for linear elasticity

The memory efficient CG algorithm of Kabel et al. (Comput Mech 54(6):1497–1514, 2014) reduces the memory requirements of a strain based implementation of the CG algorithm following Zeman et al. (J Comput Phys 229(21):8065–8071, 2010) for solving the equations of linear elasticity by $$40\%$$40%. But since the Fourier wave vectors have to be recalculated at several steps of the memory efficient algorithm, the runtime increases for a straightforward implementation.. We explain how to reduce the runtime overhead to a negligible size, and show that the memory efficient algorithm scales better than the standard algorithm with up to 256 MPI processes.


Introduction
The FFT-based algorithm of [3] is a fast and accurate method for obtaining effective properties in linear elasticity and conductivity problems. Even though its memory requirements are low due to the matrix free implementation, it can still be a limiting factor to the resolution and therefore to the accuracy of the numerical simulation. When solving the discretized system by the CG algorithm as suggested in [2] instead by a Neumann series expansion as in [3], the memory usage per voxel is increased from 144 bytes to 360, a factor of 2.5. To mitigate this additional burden while still benefiting from the fast convergence of the CG algorithm, [1] suggests an alternative implementation reducing the memory requirement to 216 bytes per voxel, a reduction of 40%.
A direct implementation of the proposed memory efficient CG algorithm increases the runtime by a factor two. We will show how the computational overhead can be considerably reduced. We will also investigate the scaling behaviour of both the standard CG algorithm and the memory efficient implementation.
B Hannes Grimm-Strele hannes.grimm-strele@itwm.fraunhofer.de 1 Department of Flow and Material Simulation, Fraunhofer ITWM, Kaiserslautern, Germany Besides low memory consumption, scalability is a major prerequisite for an efficient FFT solver. [4] has recently compared MPI and OpenMP parallelization strategies. They show that with MPI, nearly perfect scaling can be achieved for a solver which is very similar to ours. [5] emphasizes the dependence of the scaling of their code on the material law. Since applying the material law is a local operation and scales perfectly for linear elasticity, the scaling of the code is better if the material law application dominates overall runtime. In this sense, linear elasticity is a challenging test for the scalability of a code since in this case, the application of the material law is computationally cheap.
In this work, we focus on the staggered grid discretization from [6]. But the CG algorithm can be combined with many other discretizations as, e.g., the basic discretization from [3] and the discretizations from [8][9][10], to which our results apply partly, too.

Memory efficient CG algorithm
In the following, we describe the memory efficient CG algorithm from [1] for small deformations. Our aim is to solve the Lippmann-Schwinger equation where the strain ε is given by the sum of E, the prescribed macroscopic field, and a periodic fluctuation field depending on the displacement u. The simulation domain is a cuboid V with periodic boundary fluctuation conditions. Equation (1) can be brought in the form Aε = E by writing A = Id + B and B = Γ 0 (C − C 0 ) such that we can apply iterative schemes as the CG algorithm to solve it. The operator Γ 0 has the form where G 0 is the Green operator which is explicitly known in Fourier space. For small deformations, the strain operator ∇ = ∇ s is defined by The key idea is to save all additional arrays of the CG algorithm as displacement fluctuation fields instead of strain fields, thereby halving the memory requirements of each array except for one array which stores the final solution. The modified algorithm yields the same outcome as the naive implementation based on strain fields.

Implementation details
When implementing the memory efficient CG algorithm from [1], we need to define how to apply the operator and the operator ∇ s as defined in (3) to a displacement fluctuation field. We also need to specify how to calculate the inner product from the Fourier coefficients of the displacements. As a convergence criterion, we use the simple method given in [1], In the following, capital letters refer to strain fields while lower case letters refer to displacement fluctuation fields.
When we calculate the inner product on the displacement fluctuation fields, it must give the same result as when calculated on the strain fields. More specifically, if ∇ s q = Q and ∇ s y = Y , we require that Our simulation domain V is a cuboid which we discretize by a voxel mesh with N 1 × N 2 × N 3 voxels. All voxels have the same constant volume. Then, with the Frobenius inner product defined by Q : Y = l,m Q * l,m Y l,m = l,m Q l,m Y l,m (since Q and Y are real-valued and symmetric), and N = N 1 N 2 N 3 . Due to Parseval's theorem, the last sum can be calculated by summing the Fourier coefficients of Q l,m and Y l,m . We write for the Fourier wave vector. The form of the Fourier wave vector and the constant factors in front of the sums might change depending on the implementation of the FFT library.
Continuing from (7), As described in [6], derivatives transform to multiplications with the Fourier wave vectors in Fourier space. The precise form of k ± (ξ ) depends on the discretization method. For the staggered grid discretization, they are defined by where h j is the grid spacing in direction j. The discrete form of the ∇ s and the Div operators in Fourier space depends on the discretization method, too. We focus here on the staggered grid discretization, but similar formulae can be derived in the same way for any other discretization method.
Applying ∇ s to a displacement q in Fourier space yields with the Fourier coefficients q(ξ ) = ( q 1 , q 2 , q 3 ) [6]. Using (11) and since k + When calculating the norm, i.e., q = y, we can simplify this expression to 3 l,m=1 For the inner product, we cannot simplify the expression in the same way, but we have to calculate the gradients first and sum up their product. Since the input arrays q and y contain only real data, the Fourier coefficients possess the Hermitian symmetry q(ξ ) = ( q(−ξ)) * [7]. Therefore, the imaginary parts cancel in the final summation, and we only need to take the real parts into account. The result is, as expected, a real number.
As can be seen from Eq. (1), the displacement fields contain only information about the periodic part of the strain field. The mean of the displacement fields is always 0. When calculating the norm or the inner product through the gradients of displacements as in (6), we have to add the mean of the strain fields associated with q and y. Therefore, we save the mean of the corresponding strain field in the vectors q(0) and y(0) and add the inner product q(0), y(0) to the result. We note that q(0) and y(0) are always strain tensors even though q and y contain displacements.
The implementation of the inner product of two displacement fluctuation fields q and y is summarized in Algorithms 1 and 2. ξ(i, j, k)), y(ξ(i, j, k))) 4: return Σ Algorithm 2 Function VoxelProduct(q(ξ ), y(ξ )) 1: for l = 1, 2, 3 do 2: ∇ s q l,l ← −k l q l , ∇ s y l,l ← −k l y l 3: for l, m = 1, 2, 3, l < m do 4: ∇ s q l,m ← k l q m + k m q l , ∇ s y l,m ← k l y m + k m y l 5: return 3 l,m=1 l≤m ∇ s q l,m ∇ s y l,m

Minimizing runtime overhead
In the memory efficient CG algorithm, the discretized wave vectors k ± in Fourier space are needed every time we calculate an inner product, a norm, or apply one of the operators ∇ s , G 0 , and Div. If we use the formula (10), calculating k ± is quite expensive, and recalculating it in each loop introduces considerable overhead compared to the standard CG algorithm.
We can decrease this overhead in two ways. We can modify the function applying the operator defined by Eq. (4) such that the parameter α, the norms q 2 and q − w 2 , the inner products q, q − w , r , q − w and x, q are all calculated "on the fly" in the same loop where G 0 Div is applied. Then, we can calculate where r old 2 and u old 2 are known from the previous iteration. u new 2 is needed for our convergence criterion (5).
In this way, we can avoid recalculating the wave vectors k ± in each of these tasks. The resulting loop, corresponding to the application of G 0 Div in Eq. (4), is summarized in Algorithm 3. The wave vectors k ± (ξ ) are only calculated once per coefficient. We note that in each application of the InnerProduct function as defined in Algorithm 2, the gradient of the input vector has to be computed. Therefore, we modify the corresponding functions to take the gradient as input, and calculate the gradient itself only once.
The input for Algorithm 3 is the strain field Algorithm 3 Apply G 0 Div combined with inner product and norm calculation 1: , q(ξ )) 10: for i = 1, . . . , 5 do 11: Σ i ← add mean field contribution 12: in Fourier space. Furthermore, the arrays q, r and u as well as the parameter γ are used for calculating the parameters α and δ. The array w contains the output displacement field. Additionally, Algorithm 3 returns the norm of the new solution field u new used for the convergence test.
Using Algorithm 3, the memory efficicent CG algorithm can be simplified to Algorithm 4. The function ApplyReducedOperator corresponds to applying Eq. (4) to a displacement fluctuation field, and FourierGradient to applying Eq. (11).

Algorithm 4 Efficient implementation of the algorithm from [1]
1: W ← G 0 Div (FFT (E)) 2: ApplyReducedOperator(x, W , r ) 3: q ← r 4: γ ← InnerProduct(r , r ) 5: while not converged do 6: Algorithm 3(W , q, r , u, w, α, γ, δ) 7: u ← u + αq 8: r ← r − α(q − w) 9: β ← δ/γ 10: For the staggered grid discretization [6], according to Eq. (10) and for each j = 1, 2, 3, the component k ± j of the discretized wave vector depends only on ξ j . Therefore, these vectors can be precalculated and stored in a one-dimensional array per direction. In each loop, we only need to access the precomputed values instead of calculating them. This is also possible, e.g., for the basic discretization used in the algorithm of [3], but not with the discretization from [8] or with finite element based discretizations [9,10]. If calculating k ± j is expensive and the vectors cannot be stored in one-dimensional arrays, but in three-dimensional ones, the memory savings of the whole algorithm are lost. Nevertheless, we cannot completely remove the runtime overhead because norm and inner product calculation are inevitably more expensive in Fourier space than in real space, as can be seen from Eq. (13). There, we need to calculate two vector norms, twelve complex products, and three real parts, compared to one vector norm in the real case.

Numerical results
Verification of Algorithm 4 is straightforward since the results of each iteration must coincide with the results of the classical CG algorithm up to machine precision. It is also obvious that the memory requirements are lower by 40%.
All tests in this section are run on the Beehive cluster at ITWM. The Beehive cluster consists of 166 nodes with each two eight-core Intel Xeon E5-2670 CPUs. All nodes have 64 GB RAM and are connected through Infiniband interconnect. The cluster runs on CentOS Linux release 7.5 (Linux kernel 3.10.0). For all tests, we use gcc version 6.2.0, Open-MPI 1.10.7, MPICH 3.2, and FFTW version 3.3.7. We use the compilation options -mfpmath=387 -funroll-all-loops -O2 -pipe We perform two strong scaling tests with up to 256 tasks where we use the staggered grid discretization. As test geometry, we choose the Berea Sandstone from [11]. The image can be downloaded from [12]. We cut out a box of 512 3 voxels in the center of the original image as shown in Fig. 1. For the solid material, we set the bulk modulus to 36 GPa, the shear modulus to 45 GPa [13], and assume linear isotropy. Then, the Berea sandstone can be treated as a linear elastic problem. We calculate one load case for which it takes 163 iterations of the CG algorithm until Eq. (5) is fulfilled with an tolerance of 10 −4 .

Workstation test
In the first setup, we run this test with the memory efficient CG algorithm on a single node of our cluster, and compare the performance of OpenMP parallelization and both MPI libraries. The resulting performance is shown in Fig. 2 together with the runtime of the naive implementation of the algorithm parallelized with OpenMP.
In this test, the runtime with the naive implementation for the iterative solver without the FFT transformations is about a factor two higher than with the efficient implementation. We excluded the FFT runtime since it is the same for all algorithms, and its share of the total runtime depends on the problem size and the number of parallel processes. The scaling efficiency is generally very good. We observe that with MPI, the parallel efficiency is slightly better than with OpenMP. The choice of the MPI library does not significantly change the parallel performance. In all cases, we observe a particularly strong slowdown when going from 8 to 16 cores. This is partly due to the decreasing frequency of the cores depending on the number of "active cores" as described in [14]. Similar to their work, we could deduce the  Fig. 2, the line designated by "achievable scaling" takes the expected slowdown due to this effect into account. These results, in particular the improved performance with less active cores per node, is in accordance with the findings of [4]. While the standard CG algorithm consumed 11.62 GB of main memory, the memory consumption of the memory efficient variant was only 7.32 GB. This represents as reduction of about 40%, as predicted. k ± (ξ ) can be stored in three onedimensional arrays, and therefore the additional amount of memory needed is negligible.

Cluster test
In the second setup, we repeat the same test, but using more MPI processes and comparing the standard CG with the memory efficient CG algorithm. To avoid dependency on the number of active cores, we set the number of processes per node always to 8. Since the results do not significantly depend on the MPI library, we only show data for OpenMPI 1.10.7.  Fig. 3, we observe that the scaling is still very good for both algorithms. The memory efficient CG algorithm is even slightly faster in most cases. In particular, there is no performance loss compared to the standard CG algorithm.
Due to the problem size, more than 256 MPI processes cannot be efficiently used in this test. Other decisive factors influencing the scaling is the parallel performance of the FFTW library [7]. As can be seen from Tables 2 and 3 where the total runtime and the time spent in the FFT calls is listed for both the standard and the memory efficient CG algorithm, the share of the runtime spent in the FFT calls increases with the number of MPI processes, indicating that the scaling efficiency of the FFT libraries is worse than the rest of the code. In particular, the FFT runtime does only slightly decrease when increasing the number of processes from 32 to 64.

Conclusions
The memory efficient CG algorithm from [1] reduces the memory requirements of numerical simulations of linear elasticity by around 40%. At the same time, it introduces a runtime overhead. Depending on the parallelization technique and the problem size, this overhead can be reduced in the range of 0 to 15% of the runtime of the standard implementation of the CG algorithm. Even though the runtime of our code is dominated by the FFT library, we obtain an impressive parallel efficiency when performing two strong scaling tests with up to 256 MPI processes.
We remark that for large deformations where the displacement gradient operator is given by the norm calculation in Fourier space actually becomes simpler because 3 l,m=1 Therefore, the runtime overhead should be even smaller in this case.