Pro ﬁ ling General-Purpose Fast Multipole Method (FMM) Using Human Head Topology

In this study, we characterize the performance of the fast multipole method (FMM) in solving the Laplace and Helmholtz equations. We use the FMM library developed by the group of Dr. L. Greengard. This version of the FMM algorithm is multilayer with no priori limit on the number of levels of the FMM tree, although, after about thirty levels, there may be floating point issues. A collection of high-resolution human head models is used as test objects. We perform a detailed analysis of the runtime and memory consumption of the FMM in a wide range of frequencies, problem sizes, and precisions required. Although we focus on two-manifold test cases, the results are generalizable to other topologies as well. The tests are conducted on both Windows and Linux platforms. The results obtained in this study can serve as a general benchmark for the performance of FMM. It can also be employed to pre-estimate the efficiency of numerical modeling methods (e.g., the boundary element method) accelerated by FMM.


Introduction
Recently, a quasistatic boundary element method (BEM) solution has been proposed [1] that combines the adjoint double-layer formulation of the boundary element method [2][3][4] which utilizes surface charges at the boundaries, the zeroth-order (piecewise constant) basis functions with accurate near-field integration, and the FMM accelerator [5][6][7]. This approach does not require explicit forming of the BEM matrix; an iterative solution with M iterations requires O(MN) operations. The fast multipole method speeds up computation of a matrix-vector product of a numerical iterative solution via the boundary element method (BEM) by many orders of magnitude. In the past, it was successfully applied for modeling high-frequency electromagnetic [8,9] and acoustic [10][11][12] scattering problems. It has also been applied to modeling transcranial magnetic stimulation (TMS) and demonstrated a fast computational speed and superior accuracy for high-resolution head models as compared to both the standard boundary element method and the finite element method of various orders [1,13,14]. The rapid increase in the use of FMM in such numerical modeling schemes calls for an accurate and thorough study of the performance of FMM in a wide range of scenarios.
The goal of this study is to benchmark the performance (both speed and memory consumption) of the fast multipole method or FMM [5,6]. Here, we will use the established head collection and its barycentrically refined versions to perform the profiling of the FMM library provided by Z. Gimbutas and L. Greengard [7] and employed in [1,12]. Such profiling implies running the FMM for all head geometries at different frequencies including the static case and averaging the respective results. One FMM runtime essentially corresponds to one iteration step of an iterative D. N. Pham (*) ECE Department, Worcester Polytechnic Institute, Worcester, MA, USA e-mail: dnpham@wpi.edu BEM-FMM solution [1,12]. Therefore, the data reported in the present study could be used to estimate the performance of a rather generic BEM-FMM algorithm if the number of iterations is approximately known or could be estimated a priori.

FMM Library of 2017
The core FMM algorithm is taken from the FMM library provided by Gimbutas and Greengard [7]. The latest version, last updated on November 8, 2017, is downloaded from the GitHub database to use in this study. We focus specifically in the function fmm3d which is used to solve Laplace and Helmholtz equations for a large number of target points. The compiled MEX versions of this function, namely, fmm3d.mexw64 and fmm3d.mexw64 for MATLAB compatibility in Windows and Linux, respectively, are used for all FMM calculations within the MATLAB environment. Depending on whether a solution for the Laplace or Helmholtz equation is desired, a wrapper function, either lfmm3dpart or hfmm3dpartboth available in the FMM library, is employed. A sample MATLAB command that calls hfmm3dpart to compute the Helmholtz equation is given by the following: [U]=hfmm3dpart(iprec,k,nsource,source,ifcharge,charge,ifdipole, dipstr,dipvec,ifpot,iffld,ntarget,target,ifpottarg,iffldtarg) In the command above, the inputs parameters are as follows: The output parameter is the struct U that contains the following fields: • U.pot: the computed potential at source locations • U.fld: the field at source location • U.pottarg: potential at target locations • U.fldtarg: field at target locations In a similar manner, a sample MATLAB command that calls lfmm3dpart to compute the Laplace equation is given by the following: [U]=lfmm3dpart(iprec,nsource,source,ifcharge,charge,ifdipole, dipstr,dipvec,ifpot,iffld,ntarget,target,ifpottarg,iffldtarg) where the input and output variables are similar to that of hfmm3dpart, except that for lfmm3dpart there is no wave number k. A more recent FMM library, developed by Flatiron Institute [15], is also investigated and compared with the library provided by Gimbutas and Greengard [7].

CAD Human Head Models
Every CAD human head model [8] has seven objects: the skin, skull, CSF, GM, cerebellum, WM, and ventricles head compartments. The models have an "onion" topology: the gray matter shell is a container for white matter, ventricles, and cerebellum objects; the CSF shell contains the gray matter shell; the skull shell contains the CSF shell; and finally, the skin or scalp shell contains the skull shell. The models have an average of 866,000 triangular facets and an average triangle quality of 0.25. The average edge length is 1.48 mm, and the average surface mesh density or resolution is 0.57 points per mm 2 . A sample image of such a head model is shown in Fig. 1. Finer meshes with~3,464,000 facets, obtained through one iteration of subdivision on the original CAD models, are also obtained for more intensive examinations on the scaling of timings and hardware resources.

Hardware Information
Windows server:

Charge Assembly
For each of the 16 CAD models, a set of monopole charges are distributed over the surfaces of the triangular mesh so that at each triangle centers, a charge of random magnitude q is assigned. The electric potential generated at each triangle centroids, excluding self-contribution from the local charge, is given by the following: where ε 0 is permittivity of vacuum, q is electric charge of the source, k is the wave number, r is the target location at which the potential is sought, and r i is the source location. The resultant electric field is given by the following: As a measure of FMM's performance, both the potential and the electric field, given by Eqs. (1) and (2), are computed for all models, at the same triangle centroids, and excluding the self-contribution. Through the function hfmm3dpart (and lfmm3dpart for the case ka ¼ 0), the potential and the field are obtained simultaneously. With each head models, the calculations are done for three levels of accuracy: The frequencies for which the FMM algorithm is tested span over a wide range, which corresponds to ka values varying from 0 to 500. Here, a is the maximum of the x, y, and z coordinates of the model. Average value of a is 107.5754.

Original CAD Models
The relationship between runtimes of FMM calculations on Windows server, averaged over all 16 models, and ka is shown in Figs. 2, 3 and 4, with precisions 0 (two digits), 1 (three digits), and 2 (six digits), respectively. The discrete step for values of ka is 50, starting from ka ¼ 0 and ending at ka ¼ 500, with a more refined resolution within the low-frequency domain, from 0 to 50, where the step is 2.5. As can be seen in the insets in these figures, where the plot for low-frequency domain is magnified, there is always a sharp jump from the runtime for ka ¼ 0 (Laplace case) to the very next value ka ¼ 2.5. After the abrupt jump, FMM time increases steadily in a linear manner within the small ka domain (low frequencies) before growing exponentially at medium and large ka. The slope of the time-ka dependence in low ka with precision 0 (Fig. 2) is (6.04 AE 0.19) Â 10 À2 , whereas with precision 1, the slope is (7.86 AE 0.13) Â 10 À2 , and with precision 2, it is (16.81 AE 1.03) Â 10 À2 . From these numerical estimations, it can be concluded that the higher the demanded precision is, the steeper the time-ka slope becomes, and the runtime increases with ka in a higher rate. In Fig. 5, the FMM runtimes for all three precision choices are plotted. For the Laplace case (ka ¼ 0), it takes on average 3.41 s for the computations to complete with precision 0. If higher level of accuracy is requested, the time taken increases to 9.23 s with precision 1 and 16.70 with precision 2. This trend, however, is not replicated in the Helmholtz case, particularly at the low-frequency domain. As shown in Figs. 2 and 5, in the small ka domain, except for ka ¼ 0, FMM is longest with precision 0, the lowest level of accuracy of all. More specifically, with precision 0, FMM runtime increases from an average of 62.88 s at ka ¼ 2.5 to 65.33 s at ka ¼ 50 ( Fig. 1 or 4). Precision 2, the highest accuracy level tested, only takes the second longest amount of time, with 41.92 s for ka ¼ 2.5, and rises to 50.77 s at ka ¼ 50. Calculations within the low-frequency domain are fastest with precision 1, as it only takes 27.73 s to finish calculating for ka ¼ 2.5 and 31.67 s for ka ¼ 50. This rather unexpected behavior continues as far as ka ¼ 150, where FMM time with precision 2, due to its rapid exponential rise, surpasses the runtime of precision 0. Toward the high end of the frequency range, precision 1 runtime, whose exponential rate is also higher than precision 0 (but not as high as 2), starts approaching before surpassing precision 0's runtime at ka ¼ 500. Therefore, at very large values for ka, a more intuitively expected trend is observed, where FMM runtime with precision 0 is lowest and has the slowest exponential rise as ka increases, followed by precision 1, and lastly, precision 2 is most time-consuming and has the quickest exponential rate.

First-Order Mesh Subdivision
In Figs. 6, 7 and 8, FMM runtimes on Windows server with precisions 0, 1, and 2, respectively, averaged over all 16 refined meshes obtained through one iteration of barycentric subdivision done on the original CAD models, are presented. Average mesh size quadruples; it is now 3.464 M facets. Similar to when the calculations were done on the original head models, for ka ¼ 0 (Laplace case), precision 0 takes the least amount of time, 10.41 s, compared to 41.50 s with precision 1 and 70.86 s with precision 2. Also similar to the original head models, there are abrupt jumps in runtime from the Laplace case to the Helmholtz calculations, as shown in Figs. 6, 7 and 8. Within low-frequency limit, FMM runtime increases linearly with ka. The higher the requested accuracy is, the steeper the slope is; with precision 0, FMM time increases at the linear rate of (11.02 AE 1.90) Â 10 À2 for ka in the low-frequency domain (0-50), while FMM time for calculations done with precision 1 increases at a higher rate, (21.30 AE 0.61) Â 10 À2 , and precision 2 calculation time increases most rapidly with the rate (39.21 AE 6.75) Â 10 À2 . In Fig. 9, the runtimes of FMM applied to the first-order-refined meshes with all three precision choices are plotted. Again, within the low-frequency domain, precision 0 does not yield the fastest runtime. As shown in Fig. 9, at ka ¼ 2.5, runtime  with precision 0 (172.62 s) is essentially comparable to precision 2 (173.26 s), and both are significantly slower than precision 1 (127.76 s). As ka increases, out of the three options, precision 0 has its runtimes increase at the slowest rate. Therefore, by ka ¼ 350, FMM runtimes of precision 0 is surpassed by precision 1, and from then on, its runtimes are quickest, followed by precision 1, and precision 2 takes up the most time.
On the scaling of FMM runtime from the original CAD models that have average of N 0 ¼ 866,000 facets to first-order-refined meshes with N 1 ¼ 3,464,000 facets, the theoretical factor is as follows: In Fig. 10, throughout the tested range for ka, the time ratio for precision 0 is always the smallest out of the three choices for precision. Quite surprisingly, the ratio for precision 2, for most of the time, is smaller than that of precision 1. Also interestingly, all three precision options start at ka ¼ 0 with scaling factors relatively close to the theoretical values (perhaps with an exception with precision 0) and then decrease significantly as ka increases. Therefore, it appears that the higher the frequency is, the better the scaling in runtime is.

Original CAD Models
The relationship between runtimes of FMM calculations on Linux server, averaged over all 16 models, and ka is shown in Figs. 11, 12 and 13 with precisions 0, 1, and 2, respectively. A summary of runtimes for all three precisions is plotted in Fig. 14. Different from the same calculations done on the Windows platform, the runtime on Linux with precision 0, the lowest level of accuracy tested in this study, takes the least amount of time, while calculations with precision 1 are second, and precision 2, the highest level of accuracy with six digits, consumes the most amount of time. This order is held consistently throughout the entire frequency range from 0 to 500. It is also noticeable that runtimes with precisions 0 and 1, which guarantee accuracy within two and three digits, respectively, are comparable to each other, with calculations with precision 1 take slightly longer than 0. FMM runtime with precision 2, which demands six digits accuracy, takes significantly more time to finish. This rather intuitive behavior, however, is not present in the runtime profiling for Windows, which was discussed in Sect. 3.1. As a note on how the two servers compare to each other in computing Laplace equation, it takes the Linux server 5.14 s to finish the calculation with precision 0, while for Windows, it is only 3.41 s. The Linux server, however, is notably faster on precision 1, taking 5.33 s to finish as compared Comparing the calculations done on two platforms, Linux and Windows, we also observe major differences in how the runtime evolves as ka varies. First of all, FMM calculations within low-frequency domain are generally faster on Linux than on Windows, especially for low-to-medium accuracies (precisions 0 and 1). In particular, with precision 0, on Linux, it takes on average 19.51 s for hfmm3dpart to finish solving the Helmholtz equation on the original CAD models for ka within the range 0-50, whereas it takes on average 64.03 s to complete the same task on the Windows server. Similarly, it takes only 25.59 s on Linux to finish the calculations in the low-frequency domain with precision 1. Calculations on Windows, although not drastically slower than Linux as in the case of precision 0, still take 29.71 s to complete. If higher precisions are in demand, in fact, runtimes on Windows will catch up with Linux, and eventually, the speed on Windows will exceed. Evidently, for low-frequency calculations demanding precision 2 (six digits accuracy), it takes only 45.35 s for hfmm3dpart to complete computing, while a similar task takes the Linux server 88.88 s to complete.
Scaling of runtime as the frequency (or ka) is increased is another important metric. For the Linux server, within the range 0-50 for ka, FMM runtime increases linearly with the slope of (16.06 AE 0.31) Â 10 À2 with precision 0. This is a significantly faster rate compared to the slope (6.04 AE 0.19) Â 10 À2 (already mentioned in Sect. 3.1.1) for the same precision but on Windows. This comparison also holds with precisions 1 and 2, as on Linux the linear rates are (26.20 AE 0.07) Â 10 À2 with precision 1 and a whopping (50.09 AE 3.81) Â 10 À2    In terms of how runtimes of the three accuracy options compare to each other, as can be seen in Fig. 18, precision 0 takes the least amount of time, tightly followed by precision 1, while precision 2 is a lot more time-consuming. The same behavior was already discussed in Sect. 3.2.1 for calculations done with the original CAD models on Linux. The same conclusion, however, cannot be drawn for calculations done on Windows, as mentioned in Sects. 3.1.1 and 3.1.2.
The rates at which FMM runtimes increase with ka are significantly higher on Linux than on Windows. Within the low-frequency domain, where the FMM timeka dependence appears to be linear, FMM runtime (with refined meshes) for precision 0 on Linux has the linear rate of (28.90 AE 0.45) Â 10 À2 , by a large margin higher than the rate (11.02 AE 1.9) Â 10 À2 on Windows for the same precision level. Similarly, precision 1's runtimes increase at the rate (53.91 AE 0.90) Â 10 À2 in the low-frequency range, while on windows, it is only (21.30 AE 0.61) Â 10 À2 . And finally, for precision 2, the rate is (104.5 AE 4.30) Â 10 À2 on Linux and (39.21 AE 6.75) Â 10 À2 . Such steep slopes on the runtime-ka dependence in the low-frequency domain of the Linux platform are continued by the rapid exponentiations in the medium-and high-frequency ranges, which result in the Linux server being far inferior to Windows in computing the Helmholtz equations in highfrequency domain.
In Fig. 19, the actual ratio between FMM time (on Linux) of the refined models and the original model is plotted. Unlike the time ratio plot for Windows (Fig. 10), here, we see a more expected trend; the time ratio for precision 0 is lowest, followed  Fig. 18 Comparison among average FMM runtimes for the refined models in MATLAB on Linux for when two, three, and six digits accuracy are demanded by precision 1, while precision 2 has the largest ratio, and this behavior is maintained over the entire range ka ¼ 0-500. As also shown in Fig. 19, except for only the Laplace case precision 2, all the ratios of the three precisions are below the theoretical scaling factor (see Eq. (3)). As ka increases, a decreasing trend is observed for all three plots. A similar result can be seen in Fig. 10 for Windows.

Original CAD Models
In this section, we discuss the memory consumed by FMM. Due to limitations on tools available, as well as Windows' uncompromising memory recording scheme, only memory information for calculations done on Linux is profiled and analyzed here. However, given the same FMM task, the (approximately) same amount of memory consumption is expected in both platforms. Therefore, valuable insights in memory requirements for performing FMM on Windows can still be drawn. In Figs. 20, 21, and 22, peak physical memory over FMM runtime is plotted with ka for precisions 0, 1, and 2, respectively. In Fig. 23, a summary of memory vs ka is plotted for all three precision choices. As can be seen in the figures, the general shapes of the curves are similar to the runtime plots displayed in previous sections; there is an abrupt jump from memory needed for solving Laplace equation to the Helmholtz  low-frequency domain, the memory-ka dependence is linear before evolving into an exponential growth in higher frequency ranges. Perhaps the most astonishing results are that precision 0, the lowest level of accuracy, requires the most amount of memory, especially at high-frequency domain. As shown in Fig. 23, starting with very small values of ka, calculations for precision 0 consume the least amount of memory. However, both its linear rate within the low-frequency range and its exponential rate in the higher-frequency domain exceed that of precisions 1 and 2, resulting in the memory needed for precision 0 to somehow outgrow the supposedly more computationally demanding precision options. The memory plots for precisions 1 and 2, on the other hand, evolve in a more relaxed manner and, over the entire ka range from 0 to 500, tend to stay close to each other.

First-Order Mesh Subdivision
In Figs. 24, 25 and 26, peak physical memory over FMM runtime (performed on refined meshes) is plotted with ka for precisions 0, 1, and 2, respectively. In Fig. 27, a summary of memory vs ka is plotted for all three precision choices. Similar to the memory recorded for FMM done on the original meshes, here, we again observe that it is precision 0 that consumes the most memory, particularly at high frequencies (Fig. 28).

Windows Platform
In this section, we study the performance of FMM in calculations that use CAD models that have more refined meshes. These models have an average of 13,800,000 triangles and are obtained by performing two levels of barycentric subdivisions on the original head models. In Figs. 29, 30, and 31, FMM runtimes on Windows server with precisions 0, 1, and 2, respectively, averaged over all 16 refined meshes, are presented. In Fig. 32, a summary of runtime vs ka is plotted for all three precision choices. For the studies in this section, due to limited resources in computation, we restrict ourselves with ka values only from 0 to 50. As seen in Figs. 29 and 32, the runtimes for dense meshes are rather unpredictable, as there are no observable patterns for how the runtime of FMM evolves when ka is increased from 0 to 50. This volatility can be seen in the plots for all precisions 0, 1, and 2. In Fig. 33, the ratio between FMM time for the refined models (second level of mesh refinement) and FMM time for the original CAD models is plotted with respect to ka. As seen in Fig. 33, the scaling is best for calculations that require precision 0, followed by precision 2. FMM calculations for precision 1 has the largest scaling factor. This unintuitive scaling result was also seen for first-order mesh refinement (see Fig. 9) and was discussed in Sect. 3.1.2.

Linux Platform
In Figs. 34, 35, and 36, FMM runtimes on Linux server with precisions 0, 1, and 2, respectively, averaged over all 16 refined meshes (second order), are presented. In Fig. 37, a summary of runtime vs ka is plotted for all three choices of precision. Unlike the volatile behavior seen in the results for Windows, the FMM runtimes in Linux for meshes of second-order refinement increase linearly as ka increases, as expected. In Fig. 38, the ratio between FMM time for the refined models (second level of mesh refinement) and FMM time for the original CAD models is plotted with respect to ka. Again, we see that the scaling for precision 0 is lowest, while precision 1 has the highest scaling factor.   In Figs. 39, 40, and 41, FMM runtimes of the new and old libraries on Linux server with precisions 0, 1, and 2, respectively, averaged over all 16 refined meshes, are presented. A few comments are in order:

Comparisons with the
• The newest FMM library of the three tested crashed at high frequencies. The reason for these crashes is due to memory leaking. This issue has been fixed meanwhile.

Discussion and Conclusions
In this paper, we have studied the performance of the fast multipole method in computing the Laplace and Helmholtz source-to-source potentials within human head topology. The FMM software used for this study was developed by Gimbutas and Greengard [7], and we have profiled the method in a wide range of frequency values, mesh density, in both Linux and Windows frameworks, and with all choices of precision available. We showed that for problems that have reasonably "small" sizes (up to 3-4 million facets), the FMM runtime and memory consumption evolve in a predictable manner when run on both Linux and Windows. In particular, the runtime (and memory usage) varies exponentially as the frequency increases, with a small linear dependence at small values of frequency. We also observe universally a sharp, discrete increase in both runtime and memory from when the FMM software is using the Laplace solver to when the Helmholtz solver is used.  Upon studying the scaling efficiency of FMM, we showed that the algorithm deviates slightly from the theoretically expected scaling factors for runtime and memory, along with decreasing trends as the problem size increases. We also observed a number of interesting and unexpected results in terms of comparisons in runtimes required by different level of accuracy. In particular, we showed that for calculations run in Windows, the runtime needed for the three levels of accuracy tested did not follow any particular order at low frequency and only formed a pattern (low accuracy needed less time than high accuracy) when the frequency is sufficiently high. Resources needed for calculations performed in Linux were shown to have much more predictable patterns among different choices of accuracy and problem sizes, as well as smoother evolutions as the frequency changes.
We also compare the performance of this FMM library with a newer package (downloaded June 11, 2019) and its updated version (downloaded June 20, 2019). The results show that although the new library has better performance at low frequency, in Laplace calculations, it scales poorly compared to the old library and therefore is time-wise less efficient than the old FMM codes at high frequencies.
Finally, we investigate the scaling rate of the new library with increasing number of right-hand sides (rhs) being solved simultaneously. The overall results show that while the Windows platform has the shorter runtime for small number of rhs, the FMM code compiled for Linux has better scaling rate and therefore has better runtime when the number of rhs increases.
With this study, we have benchmarked the performance of the general-purpose FMM, provided new insights to the behavior of the algorithms in various scenarios, and effectively offered a means to pre-estimate the efficiency of FMM-based or FMM-accelerated numerical methods.