1 Introduction

The FFT–based algorithm of [3] is a fast and accurate method for obtaining effective properties in linear elasticity and conductivity problems. Even though its memory requirements are low due to the matrix free implementation, it can still be a limiting factor to the resolution and therefore to the accuracy of the numerical simulation. When solving the discretized system by the CG algorithm as suggested in [2] instead by a Neumann series expansion as in [3], the memory usage per voxel is increased from 144 bytes to 360, a factor of 2.5. To mitigate this additional burden while still benefiting from the fast convergence of the CG algorithm, [1] suggests an alternative implementation reducing the memory requirement to 216 bytes per voxel, a reduction of \(40\%\).

A direct implementation of the proposed memory efficient CG algorithm increases the runtime by a factor two. We will show how the computational overhead can be considerably reduced. We will also investigate the scaling behaviour of both the standard CG algorithm and the memory efficient implementation.

Besides low memory consumption, scalability is a major prerequisite for an efficient FFT solver. [4] has recently compared MPI and OpenMP parallelization strategies. They show that with MPI, nearly perfect scaling can be achieved for a solver which is very similar to ours.

[5] emphasizes the dependence of the scaling of their code on the material law. Since applying the material law is a local operation and scales perfectly for linear elasticity, the scaling of the code is better if the material law application dominates overall runtime. In this sense, linear elasticity is a challenging test for the scalability of a code since in this case, the application of the material law is computationally cheap.

In this work, we focus on the staggered grid discretization from [6]. But the CG algorithm can be combined with many other discretizations as, e.g., the basic discretization from [3] and the discretizations from [8,9,10], to which our results apply partly, too.

1.1 Memory efficient CG algorithm

In the following, we describe the memory efficient CG algorithm from [1] for small deformations. Our aim is to solve the Lippmann–Schwinger equation

$$\begin{aligned} \varepsilon + \varGamma ^0 * \left( \left( \mathcal {C} - \mathcal {C}_0 \right) : \varepsilon \right) = E,\ \varepsilon = E + \varepsilon (u), \end{aligned}$$
(1)

where the strain \(\varepsilon \) is given by the sum of E, the prescribed macroscopic field, and a periodic fluctuation field depending on the displacement u. The simulation domain is a cuboid V with periodic boundary fluctuation conditions.

Equation (1) can be brought in the form \(A \varepsilon = E\) by writing \(A = {\mathrm {Id}} + B\) and \(B = \varGamma ^0 \left( \mathcal {C} - \mathcal {C}_0 \right) \) such that we can apply iterative schemes as the CG algorithm to solve it. The operator \(\varGamma ^0\) has the form

$$\begin{aligned} \varGamma ^0 = \nabla \,G^0\,{\mathrm {Div}}, \end{aligned}$$
(2)

where \(G^0\) is the Green operator which is explicitly known in Fourier space. For small deformations, the strain operator \(\nabla = \nabla _s\) is defined by

$$\begin{aligned} \nabla _s u = \left( \begin{array}{ccc} \frac{\partial u_1}{\partial x_1} &{}\quad \frac{1}{2} \left( \frac{\partial u_2}{\partial x_1} + \frac{\partial u_1}{\partial x_2} \right) &{}\quad \frac{1}{2} \left( \frac{\partial u_3}{\partial x_1} + \frac{\partial u_1}{\partial x_3} \right) \\ &{} \frac{\partial u_2}{\partial x_2} &{}\quad \frac{1}{2} \left( \frac{\partial u_2}{\partial x_3} + \frac{\partial u_3}{\partial x_2} \right) \\ {\text {sym}} &{}\quad &{}\quad \frac{\partial u_3}{\partial x_3} \end{array} \right) \end{aligned}$$
(3)

for a displacement vector \(u = \left( u_1, u_2, u_3 \right) \).

The key idea is to save all additional arrays of the CG algorithm as displacement fluctuation fields instead of strain fields, thereby halving the memory requirements of each array except for one array which stores the final solution. The modified algorithm yields the same outcome as the naive implementation based on strain fields.

2 Implementation details

When implementing the memory efficient CG algorithm from [1], we need to define how to apply the operator

$$\begin{aligned} {\widehat{G^0}} {\widehat{{\mathrm {Div}}}} \left( {\mathrm {FFT}} \left( - \left( \mathcal {C} - \mathcal {C}_0 \right) : {\mathrm {FFT}}^{-1} \left( {\widehat{\nabla _s}}\right) \right) \right) \end{aligned}$$
(4)

and the operator \(\nabla _s\) as defined in (3) to a displacement fluctuation field. We also need to specify how to calculate the inner product from the Fourier coefficients of the displacements. As a convergence criterion, we use the simple method given in [1],

$$\begin{aligned} \frac{\left| \left\| \varepsilon ^{{\mathrm {new}}}\right\| ^2 - \left\| \varepsilon ^{{\mathrm {old}}}\right\| ^2 \right| }{\left\| \varepsilon ^ {{\mathrm {initial}}}\right\| ^2} < {\mathrm {tolerance}}. \end{aligned}$$
(5)

In the following, capital letters refer to strain fields while lower case letters refer to displacement fluctuation fields.

When we calculate the inner product on the displacement fluctuation fields, it must give the same result as when calculated on the strain fields. More specifically, if \(\nabla _s q = Q\) and \(\nabla _s y = Y\), we require that

$$\begin{aligned} \langle Q, Y \rangle = \langle \nabla _s q, \nabla _s y \rangle . \end{aligned}$$
(6)

Our simulation domain V is a cuboid which we discretize by a voxel mesh with \(N_1 \times N_2 \times N_3\) voxels. All voxels have the same constant volume. Then,

$$\begin{aligned} \begin{aligned} \langle Q, Y \rangle&= \frac{1}{\left| V\right| } \int _V Q : Y\,dX = \frac{1}{\left| V\right| } \int _V \sum _{l,m=1}^3 Q_{l,m} Y_{l,m} \,dX \\&= \sum _{l,m=1}^3 \frac{1}{\left| V\right| } \int _V Q_{l,m} Y_{l,m} \,dX \\&= \frac{1}{N} \sum _{l,m=1}^3 \sum _{i=1}^{N_1} \sum _{j=1}^{N_2} \sum _{k=1}^{N_3} Q_{l,m}(i,j,k) Y_{l,m}(i,j,k), \end{aligned} \end{aligned}$$
(7)

with the Frobenius inner product defined by \(Q : Y = \sum _{l,m} Q_{l,m}^{*} Y_{l,m} = \sum _{l,m} Q_{l,m} Y_{l,m}\) (since Q and Y are real-valued and symmetric), and \(N = N_1 N_2 N_3\). Due to Parseval’s theorem, the last sum can be calculated by summing the Fourier coefficients of \(Q_{l,m}\) and \(Y_{l,m}\). We write \(\xi = \xi (i,j,k) = \left( \frac{2 \pi i}{N_1}, \frac{2 \pi j}{N_1}, \frac{2 \pi k}{N_3}\right) \) for the Fourier wave vector. The form of the Fourier wave vector and the constant factors in front of the sums might change depending on the implementation of the FFT library.

Continuing from (7),

$$\begin{aligned} \begin{aligned}&\sum _{l,m=1}^3 \sum _{i=1}^{N_1} \sum _{j=1}^{N_2} \sum _{k=1}^{N_3} Q_{l,m}(i,j,k) Y_{l,m}(i,j,k) \\&= \sum _{\xi } \sum _{l,m=1}^3 {\widehat{Q_{l,m}}}(\xi ) {\widehat{Y_{l,m}}}^{*}(\xi ) \\&= \sum _{\xi } \sum _{l,m=1}^3 {\widehat{\nabla _s q_{l,m}}}(\xi ) {\widehat{\nabla _s y_{l,m}}}^{*}(\xi ). \end{aligned} \end{aligned}$$
(8)

As described in [6], derivatives transform to multiplications with the Fourier wave vectors

$$\begin{aligned} k^{\pm }(\xi ) = \left( k_1^{\pm }(\xi ), k_2^{\pm }(\xi ), k_3^{\pm }(\xi ) \right) \end{aligned}$$
(9)

in Fourier space. The precise form of \(k^{\pm }(\xi )\) depends on the discretization method. For the staggered grid discretization, they are defined by

$$\begin{aligned} \mathfrak {R}\left( k_j^{\pm } \right) = \pm \frac{\cos \left( \pm \xi _j \right) - 1}{h_j},\ \mathfrak {I}\left( k_j^{\pm } \right) = \pm \frac{\sin \left( \pm \xi _j \right) }{h_j}, \end{aligned}$$
(10)

where \(h_j\) is the grid spacing in direction j. The discrete form of the \(\nabla _s\) and the \({\mathrm {Div}}\) operators in Fourier space depends on the discretization method, too. We focus here on the staggered grid discretization, but similar formulae can be derived in the same way for any other discretization method.

Applying \({\widehat{\nabla _s}}\) to a displacement q in Fourier space yields

$$\begin{aligned} {\widehat{\nabla _s q}}(\xi ) {=} \left( \begin{array}{ccc} k_1^+ {\widehat{q_1}} &{}\quad \frac{1}{2} \left( k_1^- {\widehat{q_2}} + k_2^- {\widehat{q_1}} \right) &{}\quad \frac{1}{2} \left( k_1^- {\widehat{q_3}} + k_3^- {\widehat{q_1}} \right) \\ &{} k_2^+ {\widehat{q_2}} &{}\quad \frac{1}{2} \left( k_3^- {\widehat{q_2}} + k_2^- {\widehat{q_3}} \right) \\ {\text {sym}} &{}\quad &{}\quad k_3^+ {\widehat{q_3}} \end{array} \right) , \end{aligned}$$
(11)

with the Fourier coefficients \(\widehat{q}(\xi ) = \left( {\widehat{q_1}}, {\widehat{q_2}}, {\widehat{q_3}} \right) \) [6]. Using (11) and since \(k_j^+ = - \left( k_j^- \right) ^{*}\), we obtain

$$\begin{aligned} \begin{aligned}&\sum _{l,m=1}^3 {\widehat{\nabla _s q_{l,m}}}(\xi ) {\widehat{\nabla _s y_{l,m}}}^{*}(\xi ) = \sum _{l=1}^3 \left( k_l^- \right) ^{*} {\widehat{q_l}} k_l^- \left( {\widehat{y_l}}\right) ^{*} \\&\quad +\, \frac{1}{2} \sum \limits _{\begin{array}{c} l,m=1 \\ l < m \end{array}}^3 \left( k_l^- {\widehat{q_m}} + k_m^- {\widehat{q_l}}\right) \left( k_l^- {\widehat{y_m}} + k_m^- {\widehat{y_l}}\right) ^{*}. \end{aligned} \end{aligned}$$
(12)

When calculating the norm, i.e., \(q = y\), we can simplify this expression to

$$\begin{aligned} \begin{aligned}&\sum _{l,m=1}^3 \left| {\widehat{\nabla _s q_{l,m}}}(\xi )\right| ^2 \\&\quad = \left\| k^-\right\| ^2 \left\| \widehat{q}\right\| ^2 +\, \sum \limits _{\begin{array}{c} l,m=1 \\ l < m \end{array}}^3 \mathfrak {R}\left( k_l^- {\widehat{q_m}} \left( k_m^- {\widehat{q_l}}\right) ^{*} \right) . \end{aligned} \end{aligned}$$
(13)

For the inner product, we cannot simplify the expression in the same way, but we have to calculate the gradients first and sum up their product. Since the input arrays q and y contain only real data, the Fourier coefficients possess the Hermitian symmetry \({\widehat{q}}(\xi ) = \left( \widehat{q}(- \xi )\right) ^{*}\) [7]. Therefore, the imaginary parts cancel in the final summation, and we only need to take the real parts into account. The result is, as expected, a real number.

As can be seen from Eq. (1), the displacement fields contain only information about the periodic part of the strain field. The mean of the displacement fields is always 0. When calculating the norm or the inner product through the gradients of displacements as in (6), we have to add the mean of the strain fields associated with q and y. Therefore, we save the mean of the corresponding strain field in the vectors \(\widehat{q}(0)\) and \(\widehat{y}(0)\) and add the inner product \(\langle \widehat{q}(0), \widehat{y}(0) \rangle \) to the result. We note that \(\widehat{q}(0)\) and \(\widehat{y}(0)\) are always strain tensors even though q and y contain displacements.

The implementation of the inner product of two displacement fluctuation fields q and y is summarized in Algorithms 1 and 2.

figure a
figure b

2.1 Minimizing runtime overhead

In the memory efficient CG algorithm, the discretized wave vectors \(k^{\pm }\) in Fourier space are needed every time we calculate an inner product, a norm, or apply one of the operators \({\widehat{\nabla _s}}\), \({\widehat{G^0}}\), and \({\widehat{{\mathrm {Div}}}}\). If we use the formula (10), calculating \(k^{\pm }\) is quite expensive, and recalculating it in each loop introduces considerable overhead compared to the standard CG algorithm.

We can decrease this overhead in two ways. We can modify the function applying the operator defined by Eq. (4) such that the parameter \(\alpha \), the norms \(\left\| q\right\| ^2\) and \(\left\| q-w\right\| ^2\), the inner products \(\langle q, q-w \rangle \), \(\langle r, q-w \rangle \) and \(\langle x, q \rangle \) are all calculated “on the fly” in the same loop where \({\widehat{G^0}} {\widehat{{\mathrm {Div}}}}\) is applied. Then, we can calculate

$$\begin{aligned} \delta&= \left\| r^{{\mathrm {new}}}\right\| ^2 = \left\| r^{{\mathrm {old}}}\right\| ^2 + \alpha ^2 \left\| q-w\right\| ^2 - 2 \alpha \langle r^{{\mathrm {old}}}, q-w \rangle , \end{aligned}$$
(14)
$$\begin{aligned}&\quad \left\| u^{{\mathrm {new}}}\right\| ^2 = \left\| u^{{\mathrm {old}}}\right\| ^2 + \alpha ^2 \left\| q\right\| ^2 + 2 \alpha \langle u^{{\mathrm {old}}}, q \rangle , \end{aligned}$$
(15)

where \(\left\| r^{{\mathrm {old}}}\right\| ^2\) and \(\left\| u^{{\mathrm {old}}}\right\| ^2\) are known from the previous iteration. \(\left\| u^{{\mathrm {new}}}\right\| ^2\) is needed for our convergence criterion (5). In this way, we can avoid recalculating the wave vectors \(k^{\pm }\) in each of these tasks.

The resulting loop, corresponding to the application of \({\widehat{G^0}} {\widehat{\mathrm {Div}}}\) in Eq. (4), is summarized in Algorithm 3. The wave vectors \(k^{\pm }(\xi )\) are only calculated once per coefficient. We note that in each application of the InnerProduct function as defined in Algorithm 2, the gradient of the input vector has to be computed. Therefore, we modify the corresponding functions to take the gradient as input, and calculate the gradient itself only once.

figure c

The input for Algorithm 3 is the strain field

$$\begin{aligned} W = {\mathrm {FFT}} \left( - \left( \mathcal {C} - \mathcal {C}_0 \right) : \left( {\mathrm {FFT}}^{-1} \left( {\widehat{\nabla _s}} q\right) \right) \right) \end{aligned}$$
(16)

in Fourier space. Furthermore, the arrays q, r and u as well as the parameter \(\gamma \) are used for calculating the parameters \(\alpha \) and \(\delta \). The array w contains the output displacement field. Additionally, Algorithm 3 returns the norm of the new solution field \(\left\| u^{\mathrm {new}}\right\| \) used for the convergence test.

Using Algorithm 3, the memory efficicent CG algorithm can be simplified to Algorithm 4. The function ApplyReducedOperator corresponds to applying Eq. (4) to a displacement fluctuation field, and FourierGradient to applying Eq. (11).

figure d

For the staggered grid discretization [6], according to Eq. (10) and for each \(j=1,2,3\), the component \(k_j^{\pm }\) of the discretized wave vector depends only on \(\xi _j\). Therefore, these vectors can be precalculated and stored in a one-dimensional array per direction. In each loop, we only need to access the precomputed values instead of calculating them. This is also possible, e.g., for the basic discretization used in the algorithm of [3], but not with the discretization from [8] or with finite element based discretizations [9, 10]. If calculating \(k_j^{\pm }\) is expensive and the vectors cannot be stored in one–dimensional arrays, but in three-dimensional ones, the memory savings of the whole algorithm are lost.

Nevertheless, we cannot completely remove the runtime overhead because norm and inner product calculation are inevitably more expensive in Fourier space than in real space, as can be seen from Eq. (13). There, we need to calculate two vector norms, twelve complex products, and three real parts, compared to one vector norm in the real case.

3 Numerical results

Verification of Algorithm 4 is straightforward since the results of each iteration must coincide with the results of the classical CG algorithm up to machine precision. It is also obvious that the memory requirements are lower by \(40\%\).

All tests in this section are run on the Beehive cluster at ITWM. The Beehive cluster consists of 166 nodes with each two eight-core Intel Xeon E5-2670 CPUs. All nodes have 64 GB RAM and are connected through Infiniband interconnect. The cluster runs on CentOS Linux release 7.5 (Linux kernel 3.10.0). For all tests, we use gcc version 6.2.0, OpenMPI 1.10.7, MPICH 3.2, and FFTW version 3.3.7. We use the compilation options

$$\begin{aligned} {\texttt {-mfpmath=387}}~{\texttt {-funroll-all-loops}}~{\texttt {-O2}}~{\texttt {-pipe}} \end{aligned}$$

We perform two strong scaling tests with up to 256 tasks where we use the staggered grid discretization. As test geometry, we choose the Berea Sandstone from [11]. The image can be downloaded from [12]. We cut out a box of \(512^3\) voxels in the center of the original image as shown in Fig. 1. For the solid material, we set the bulk modulus to \(36\,{\mathrm {GPa}}\), the shear modulus to \(45\,{\mathrm {GPa}}\) [13], and assume linear isotropy. Then, the Berea sandstone can be treated as a linear elastic problem. We calculate one load case for which it takes 163 iterations of the CG algorithm until Eq. (5) is fulfilled with an tolerance of \(10^{-4}\).

Fig. 1
figure 1

Berea sandstone dataset with a binary segmentation. The image has \(512 \times 512 \times 512\) voxels and the voxel edge length is \(8.444\,\upmu \mathrm {m}\)

3.1 Workstation test

In the first setup, we run this test with the memory efficient CG algorithm on a single node of our cluster, and compare the performance of OpenMP parallelization and both MPI libraries. The resulting performance is shown in Fig. 2 together with the runtime of the naive implementation of the algorithm parallelized with OpenMP.

Fig. 2
figure 2

Solver runtime with FFT transformations for the workstation test with the memory efficient CG algorithm on one single cluster node. Details can be found in Sect. 3.1

In this test, the runtime with the naive implementation for the iterative solver without the FFT transformations is about a factor two higher than with the efficient implementation. We excluded the FFT runtime since it is the same for all algorithms, and its share of the total runtime depends on the problem size and the number of parallel processes.

The scaling efficiency is generally very good. We observe that with MPI, the parallel efficiency is slightly better than with OpenMP. The choice of the MPI library does not significantly change the parallel performance. In all cases, we observe a particularly strong slowdown when going from 8 to 16 cores. This is partly due to the decreasing frequency of the cores depending on the number of “active cores” as described in [14]. Similar to their work, we could deduce the core frequencies of the Intel Xeon E5-2670 CPUs of Beehive depending on the number of active cores using multiplication tests. The resuling clock rates are shown in Table 1. In Fig. 2, the line designated by “achievable scaling” takes the expected slowdown due to this effect into account. These results, in particular the improved performance with less active cores per node, is in accordance with the findings of [4].

Table 1 Empirical dependence of the core frequency on the number of active cores for the 8-core Intel Xeon E5-2670 CPU

While the standard CG algorithm consumed 11.62 GB of main memory, the memory consumption of the memory efficient variant was only 7.32 GB. This represents as reduction of about \(40\%\), as predicted. \(k^{\pm }(\xi )\) can be stored in three one-dimensional arrays, and therefore the additional amount of memory needed is negligible.

3.2 Cluster test

In the second setup, we repeat the same test, but using more MPI processes and comparing the standard CG with the memory efficient CG algorithm. To avoid dependency on the number of active cores, we set the number of processes per node always to 8. Since the results do not significantly depend on the MPI library, we only show data for OpenMPI 1.10.7.

Fig. 3
figure 3

Scaling results for the cluster test with up to 256 MPI processes. Details can be found in Sect. 3.2

From Fig. 3, we observe that the scaling is still very good for both algorithms. The memory efficient CG algorithm is even slightly faster in most cases. In particular, there is no performance loss compared to the standard CG algorithm.

Table 2 Total runtime depending on the number of MPI processes, and time spent in the FFT calls for the standard CG algorithm. The second column lists the number of slices per MPI process
Table 3 Total runtime depending on the number of MPI processes, and time spent in the FFT calls for the memory efficient CG algorithm. The second column lists the number of slices per MPI process

Due to the problem size, more than 256 MPI processes cannot be efficiently used in this test. Other decisive factors influencing the scaling is the parallel performance of the FFTW library [7]. As can be seen from Tables 2 and 3 where the total runtime and the time spent in the FFT calls is listed for both the standard and the memory efficient CG algorithm, the share of the runtime spent in the FFT calls increases with the number of MPI processes, indicating that the scaling efficiency of the FFT libraries is worse than the rest of the code. In particular, the FFT runtime does only slightly decrease when increasing the number of processes from 32 to 64.

4 Conclusions

The memory efficient CG algorithm from [1] reduces the memory requirements of numerical simulations of linear elasticity by around \(40\%\). At the same time, it introduces a runtime overhead. Depending on the parallelization technique and the problem size, this overhead can be reduced in the range of 0 to \(15\%\) of the runtime of the standard implementation of the CG algorithm. Even though the runtime of our code is dominated by the FFT library, we obtain an impressive parallel efficiency when performing two strong scaling tests with up to 256 MPI processes.

We remark that for large deformations where the displacement gradient operator is given by

$$\begin{aligned} \nabla u = \left( \begin{array}{ccc} \frac{\partial u_1}{\partial x_1} &{}\quad \frac{\partial u_1}{\partial x_2} &{}\quad \frac{\partial u_1}{\partial x_3} \\ \frac{\partial u_2}{\partial x_1} &{}\quad \frac{\partial u_2}{\partial x_2} &{}\quad \frac{\partial u_2}{\partial x_3} \\ \frac{\partial u_3}{\partial x_1} &{}\quad \frac{\partial u_3}{\partial x_2} &{}\quad \frac{\partial u_3}{\partial x_3} \end{array} \right) , \end{aligned}$$
(17)

the norm calculation in Fourier space actually becomes simpler because

$$\begin{aligned} \sum _{l,m=1}^3 \left| {\widehat{\nabla q_{l,m}}}(\xi )\right| ^2 = \left\| k^-\right\| ^2 \left\| \widehat{q}\right\| ^2. \end{aligned}$$
(18)

Therefore, the runtime overhead should be even smaller in this case.