A massively parallel Eikonal solver on unstructured meshes
 258 Downloads
Abstract
Algorithms for the numerical solution of the Eikonal equation discretized with tetrahedra are discussed. Several massively parallel algorithms for GPU computing are developed. This includes domain decomposition concepts for tracking the moving wave fronts in subdomains and over the subdomain boundaries. Furthermore a low memory footprint implementation of the solver is introduced which reduces the number of arithmetic operations and enables improved memory access schemes. The numerical tests for different meshes originating from the geometry of a human heart document the decreased runtime of the new algorithms.
Keywords
Eikonal equation GPU Domain decomposition Tetrahedral mesh Parallel algorithm1 Introduction
Recent work in [1, 2, 3] has shown that building an efficient 3D tetrahedral Eikonal solver for multicore and SIMD (single instruction multiple data) architectures poses many challenges. It is important to keep the memory footprint low in order to reduce the costly memory accesses and achieve a good computational density on GPUs and other SIMD architectures with limited memory and register capabilities. But also in the general case of parallel architectures with limited highbandwidth memory, the memory footprint of the local solver becomes a bottleneck to performance. We addressed the implementation of an Eikonal solver for Shared Memory (OpenMP) and for streaming architectures in CUDA with a low memory footprint in our previous work [2] wherein we achieved already a very short execution time of the algorithm and good quality results. We use the fast iterative method (FIM) [1, 4, 5] for solving the Eikonal equation, especially the version by Fu, Kirby and Whitaker [1] wherein the memory footprint has been reduced by storing temporarily the inner products needed to compute the 3D solution of one vertex. This step is performed in the wave front computations and some data have to be transferred from main memory to compute the solution for each tetrahedron. The application background for our paper are cardiovascular simulations that use Eikonal solvers for determining the initial excitation pattern in a human heart [6].
We address a new method to reduce the memory footprint of the Eikonal solver, see Sect. 3. We will reduce the number of memory transfers as well as the local memory footprint by precomputing all the needed inner products for each tetrahedron in reference orientation, 18 floats in total. The mapping problem of the precomputed data to the orientation of the tetrahedron under investigation is also addressed. A second step replaces 12 memory accesses per tetrahedron by onthefly computations from 6 given data per tetrahedron which reduces the memory footprint to these 6 numbers in total.
The following sections introduce our algorithms for CPU (sequential C), for shared memory parallelization (OpenMP) and for GPU accelerators (CUDA). Data structures and algorithms for streaming architectures are described in detail and we show that a careful management of data allows a higher computational density on the GPU which yields then to a satisfactory speed up compared to the OpenMP implementation.
The domain decomposition approach to solve the Eikonal equation is proposed in Sect. 6. We present two different strategies of load balancing. The first approach dynamically maps one subdomain to several thread blocks in CUDA, using a sequence of different kernel invocations. It takes better advantage of the GPU shared memory since it shares the workload of one subdomain between potentially many thread blocks exploiting in this way the total shared memory space. The second approach uses a single kernel on a single thread block per subdomain. It avoids host synchronization and memory transfers nearly completely. Numerical tests and a discussion of the observed performance gains are presented in Sect. 7 wherein we investigate GPUs using CUDA [7] and CPUs using OpenMP.
2 Mathematical description of the Eikonal equation
3 Local solver
We use the fast iterative method (FIM) [4, 5] in its description by Fu, Kirby and Whitaker [1] as baseline Eikonal solver for our improvements in this paper. The original FIM is introduced in Sect. 3.1 and we modify it in Sect. 3.2 by precomputing the needed inner products for each tetrahedron. A further significant reduction in memory footprint is achieved in Sect. 3.3 by applying some analysis.
3.1 Fast iterative method
3.2 Precomputing the inner products
In order to reduce the noncoalesced memory accesses whenever system (9) has to be solved in a tetrahedron we precompute all possible 18 inner products, listed in Table 1.
MScalar products stored in the \(6\times 6\) symmetric matrix \(T_M\)

Local Graycode numbering of edges

The challenge consist in accessing the needed 6 inner products for \(M'\) from the the 18 precomputed entries in Table 1 without branched code whenever we face a nonreference configuration, e.g., \(\phi _i\) is the unknown value with \(i \in \{1,2,3\}\). The needed tools are described in the remaining subsection.
3.2.1 Gray code like indexing
We use a local Graycode [10] of 4 bits length to identify uniquely all possible objects in a tetrahedron, see Fig. 1. Each vertex \(k \in \{1,2,3,4\}\) is represented by a 4bit number \(2^{k1}\) with exactly one bit set and the Gray index of the connecting edge \(e_{k,\ell }^{\text {Gray}} := \textsc {or}(k^{\text {Gray}},\ell ^{\text {Gray}})\) contains exactly two bits.
We are only interested in the edge numbers for accessing the precomputed Mscalar products of one tetrahedron. Table 2 presents the available Graycodes indices for the edges, i.e., the three edges connected to \(x_3\) have the Graycode indexing 5, 6 and 12. This Graycode numbering will simplify the rotation of the tetrahedron in Sect. 3.2.3. The six edge numbers are spread between 3 and 12 in the Graycode. In order to access the precomputed scalar products in the \(6 \times 6\) matrix from Table 1 we have to transform these Graycode edge numbers to the edge index set \(\{0,\ldots ,5\}\). This transformation is performed by \(\mathbf{f_{\text {int}}(d)} = \mathbf{d}/2  1\) derived from the linear regression function \(\mathbf f(d)\) such that edge \(e_{1,3}\) from \(x_1\) to \(x_3\) has the Graycode number \(d=5\) with \(\mathbf{f(5)}==1\) and so we will store the scalar products with this edge in row/column 1 of Table 1.
3.2.2 Accessing the precomputed scalar products via Gray indexing
3.2.3 Rotating elements into the reference configuration
Edge indices and signs after rotation
\(\phi _k\)  \(\phi _4\)  \(\phi _3\)  \(\phi _2\)  \(\phi _1\) 

Edges  5,1,2  2,4,0  0,1,3  3,4,5 
Signs  +,+,+  +,–,+  +,–,–  –,+,+ 
Thanks to the Graycode indexing, we can perform the needed edge index transformation from the reference configuration simply by bit shift operations and the same holds for the sign changes caused by redirected edges, see “Appendix A” for details. The appropriate edge indices after applying \(\mathbf{f_{\text {int}}(d)}\) are presented in row 2 of Table 3 for all configurations.
3.3 Further memory footprint reduction
The possibility to reduce the memory footprint from 18 to 6 floats originates from the fact that \(\langle \mathbf {e}_{k,s}, \mathbf {e}_{s,\ell } \rangle _{M^{\tau }}\) (\(k\ne \ell \)) represents an angle of a surface triangle whereas \(\langle \mathbf {e}_{k,s}, \mathbf {e}_{k,s} \rangle _{M^{\tau }}\) represents the length of an edge in the Mmetric which are also the products in the main diagonal. Basic geometry as well as vector arithmetics yield to the conclusion that the angle information can be expressed by the combination of three edge lengths. Therefore we only have to precompute the 6 edge lengths of one tetrahedron and compute onthefly only the 3 needed angle data therein instead of the potential 12 angle data. This finally reduces the memory footprint per tetrahedron to 6 numbers.
Reduced number of scalar products

The memory footprint reduction is obvious. Instead of precomputing and storing all the 18 scalar products in Table 1, only the 6 scalar products of the main diagonal are precomputed and stored as in Table 4.
4 Global solution algorithms
The proposed algorithm uses a modification of the active set update scheme combined with the local solver described above designed for unstructured tetrahedral meshes with inhomogeneous anisotropic speed functions.
Another opportunity for solving the Eikonal equation would be the HopfLax update which is explained in [13] for a 2D finite element discretization. A mathematically detailed representation for the Eikonal equation using finite elements with the HopfLax update is given in [13] which has been extended to 3D in [14].
5 Task based parallel Eikonal solver
There are several strategies to derive a parallel Eikonal solver [3, 15, 16, 17]: The simple Algorithm 2 can execute the “for all” loop in parallel [18]. Each thread or processor i is responsible for a number of tetrahedra, e.g. forming a subdomain \(\Omega _i\) of the computational domain \(\Omega \). However, only a few tetrahedra have to update the solution \(\phi \) at the same time. So this code is not efficient.
5.1 Multithreading
Using manycore processors and many threads, programming paradigms like OpenMP allow this kind of parallelism. The efficiency depends on the size of the active set, which corresponds to the length of the wave front which changes during computation. Starting with a single excitation point, the Eikonal solver will start a wave with growing wave front size, until the wave reaches the boundary and is cut off, see Fig. 2. Furthermore, the amount of work per set element and the general memory layout play an important role.
Therefore, the multi threaded version of the algorithm is derived by just partitioning the active list for each iteration and assigning the work of each sublist to one thread. Each thread is updating its own active sublist but the solution is synchronized against the solution vector \(\Phi \) where all the values for each node are kept. In practice, we simply divide the active list arbitrarily into N sublists, assign the sublists to the N threads. We choose N to be the number of virtual CPU cores in multithreading.
We have a very short convergence time of the algorithm and good quality results, see Fig. 2 wherein the wave propagation looks very smooth.
5.2 CUDA offloading kernels
When designing an algorithm that will be used in a streaming unit such as a GPU, it is very important to optimize for throughput and not for latency. Together with the concept of coalesced memory access pattern we designed our algorithm based on one critical point: allowing data redundancy in order to achieve good coalesced memory access, throughput and occupancy.
The main idea of the Eikonal update scheme is to find the solution of one node calculating for all the onering tetrahedra. It means that in the general case one thread is supposed to solve for one node by doing the computations and solving for all its neighboring tetrahedra. Unfortunately, we would have non–coalesced access pattern, smaller number of threads (low level of parallelism), low throughput and a poor occupancy. A better solution consists in calculating a priori all the neighboring tetrahedra indices for each node in the active list, store them in a global array and then assign each element (tetrahedra) to one thread. Of course the information will be redundant as shown in Fig. 4 on the last array, but we trade in memory to gain in performance. In this way we achieve a better coalesced memory access, more threads to run (increased parallelism) and a better bandwidth which are the three most important concepts for a fast GPU algorithm.
5.2.1 Using SCAN for compaction
In order to successfully apply this idea we use the parallel SUM SCAN algorithm and its most important applications such as stream compaction and address generation for efficiently gathering the results into memory as presented in [19], see Algorithm 6.
5.2.2 Using SCAN for data management
The first array in Fig. 4 denotes the compacted active list generated by using the process from Fig. 3. Based on the compacted active list, we compute the total number of neighboring tetrahedra for each node of the active list (Algorithm 8, line 4) and store them to a temporary array as in step 2 in Fig. 4. Then the scan algorithm is applied to this temporary array to generate the addresses which will be used to gather the neighboring tetrahedra indices to the last result array (Algorithm 8, line 5–7). For example, from the Fig. 4 the first two elements are copied starting at address 0, the next 3 elements starting at address 2 and so on. The information on the last array is redundant. This scheme allows us to achieve coalesced access and also to increase the parallelism by increasing the working number of threads. One thread per element is used for solving the elements into the last array as in Algorithm 8, line 8. The algorithm uses the same scheme later on when it solves for all the neighboring nodes of the converged active list nodes.
6 Domain decomposition parallel Eikonal solver
The domain \(\Omega \) is statically partitioned into a number of nonoverlapping subdomains \(\Omega _i\), see Fig. 5, each of them is assigned to a single processor. Synchronization and communication of the processors is to be reduced to a minimum. In our case, a single processor i can efficiently solve the Eikonal equation on \(\Omega _i\), as long as its boundary data on \(\partial \Omega _i\) is correct. However, this data may belong to the outer boundary \(\partial \Omega \) or to other processors. Hence interprocessor communication is needed. Algorithm 3 can be adapted in several ways.
6.1 Parallel sweep
6.2 Local solves
6.3 CUDA implementation
We use two different strategies to distribute the workload of the kernels to the available blocks in CUDA. Our first approach consist in mapping the work of a subdomain into different blocks or SMs on the GPU where one thread can process for only one tetrahedron as explained in Sect. 5.2.2. This allows the shared memory of all blocks solving for a certain subdomain to be used from the local subdomain solver. It means more data are going to fit into the shared memory since the blocks share the load. The other approach is one block computes one subdomain where the shared memory is limited to the shared memory of the block but the threads have more work to do increasing in this way the granularity. Every scan and scatter or any other kernel which prepares the data for the solution kernel computes independently for each domain. The only difference between the two versions is in the way the workload of the main kernel is distributed. Since the main difficulty of the static decomposition is the load balancing because the wavefront moves and many domains remain idle during execution, empty active sets, then distributing the load not only in one block but on many blocks increases the number of utilized blocks and multiprocessor units (SM) on the GPU. This has also the benefit that the wave front velocity information can be stored into the shared memory since more shared memory is available now compared to the model one block one subdomain.
The advantage of the other model, where one block is responsible for one domain is the granularity or the number of elements processed by one thread. In this model one thread has more work to do and with the right type of access it can increase the efficiency especially if that fits into the registers of the thread. Access optimization of this feature is a future work. One idea is to use the CUB block load primitive from CUB library [20] which provides collective data movement methods for loading a linear segment of items from memory into a blocked arrangement across a CUDA thread block.
6.3.1 First DD approach

Work load management strategy.
During each iteration we keep track of the active domains. The number of active domains changes with thread synchronization. In this way the data preparation by other kernels and solution computations are done only for the active domains reducing in this way unnecessary computations that might come by preparing data or solving for inactive domains.
The work load of one domain will be distributed into several blocks where each block computes for a subset of the tetrahedron elements of that domain calculated in advance from the local active list as shown in Fig. 6. One thread in a block computes for one tetrahedron element and the number of threads in the block is chosen such that the occupancy is not decreased. The number of blocks solving for one domain is defined based on the total number of tetrahedron elements divided by the number of threads used per one block.
The algorithm is such that in the beginning of the execution and at the end of the execution the number of active domains is very small. This means that many blocks or processing units remain idle. This approach increases the utilizations because it uses more blocks for one domain reducing in this way the number of inactive processing units (SMs). Another way to overcome this is to have several smaller subdomains which may increase the chance of a processor to have a nonempty active set.

Synchronization.
At the end of the loop local values are exchanged based on local to local precomputed mapping which reduces the access to global memory. The domain \(\Omega \) is statically decomposed into a number of nonoverlapping subdomains \(\Omega _i\), which enables the exchange of the local solution values only on common nodes in the boundary between two domains. In our case the solution of the Eikonal equation on \(\Omega _i\) is efficient as long as its boundary data on \(\partial \Omega _i\) is correct. This data may belong to other subdomains processed by other blocks which requires interprocess communication. Communication is realized on one GPU between blocks by host synchronization. For this reason the synchronization kernel takes place at the end of the loop. In order to achieve a quick mapping, a precomputed local to local interface of each domain with all the other neighboring domains is used. It contains the local indices of the boundary nodes and the indices of two domains that form that boundary where these nodes are contained. In this way we can easily start a kernel which has all the needed information in order to exchange solution values for all the node in the interface. The exchange happens only if the value of a certain boundary node in the first domain is smaller than the value of the same node in the other domain. If that happens then the larger value is updated with a better solution, the node is added to the local active list of the subdomain of the changed value, and the subdomain of the changed value becomes active.

The termination condition.
It checks if the total number of nodes for all local active lists is zero which is the fulfillment of the termination condition. This is a separate kernel since it needs to wait for the synchronization kernel to end, where nodes are added to the local active lists, for getting a correct result.
6.3.2 Second DD approach

Work load management strategy.
We keep track of active domains in order to reduce the unnecessary computations for the inactive domains and allow for better load balancing of the work into the blocks. Since one block is computing for a single subdomain one thread has do to more work. This might might become a very efficient approach if done in the correct way by preserving the coalesced memory access and by increased granularity as in Fig. 7. Several smaller subdomains increase the chance of a block to have a nonempty active set.

Host synchronization reduction.
The domain decomposition in CUDA allows for kernel optimizations and host interaction reduction because the independent computations within a block for each subdomain allows for block synchronization instead of host synchronization. This reduces the number of kernels and improves the code. The memory allocation is done dynamically which means a total reduction of memory transfer from the device to host. In summary this improves the performance thanks to the domain decomposition approach.

Synchronization.
Fusing the number of kernels into one kernel is not straight forward because of incorporating the synchronization kernel and the termination condition kernel. The safest way to guarantee correct results consists in a synchronization with the host in order to ensure that all blocks ended execution which is mandatory for the termination condition kernel. We finally managed to reduce all computations to one large synchronization kernel and a second termination condition kernel.
The change in the local to local interface model as explained in Sect. 6.3.1 allowed us to incorporate the synchronization kernel. The first domain decomposition approach used a local to local mapping for all boundaries between two subdomains. This can be no longer applied in the new approach where one blocks solves for one subdomain because the information in the interface is not grouped for each subdomain. Now the mapping contains the total information on the boundary for all domains. One block does not know which part of the information belongs to the subdomain it is solving for. Therefore, we changed the way of precomputing the interface such that every subdomain has its own separated interface information with its neighboring subdomains. As a consequence one block knows exactly which interface maps to which process and updates accordingly.
This change allowed us to incorporate all computations into one main kernel but it does not solve the synchronization problem. Placing the synchronization kernel within the main kernel it is no longer thread safe. When the synchronization kernel tries to update a boundary solution value of another domain, that block computing on that specific other domain might update it as well. Of course this update is protected with atomic operations but this alone is not enough. Fortunately, the order of this update is not relevant to the overall solution and so we achieve finally a correct solution.
7 Numerical tests and performance analysis
We present the results for the numerical tests in single precision performed on a workstation with Intel Core i74700MQ CPU @2.40GHz processor and GeForce GTX 1080 GPU (Nvidia Pascal). We use a coarser mesh of a human heart with 3,073,529 tetrahedra and 547,680 vertices and a finer mesh of the same heart which contains 24,400,999 tetrahedra and 4,380,375 vertices. Results and analysis will be shown for the Graycode and the domain decomposition method which includes the Graycode improvements.
7.1 Gray code
Run times in sec. on the workstation
Implementations  # Tets  CUDA  OpenMP 8 threads 

Without Graycode  3,073,529  1.49  5.66 
With Graycode  3,073,529  0.73  3.65 
Without Graycode  24,400,999  11.48  56.63 
With Graycode  24,400,999  5.16  36.43 
A profiling of the CPU performance (Intel’s Vtune Amplifier) shows that the number of loads has been reduced to \(70\%\) and the number of stores dropped to \(40\%\) in the Graycode version. The last level cache (LLC) miss count has been reduced to \(84\%\). The significant reduction in stores is caused by avoiding many local temporary variables from the original approach. The reduction of noncoalesced memory accesses is documented by the reduced number of loads and the lower LLC miss counts.
We see these improvements even more clearly with the CUDA implementation wherein we get an even larger performance improvement caused by avoiding expensive noncoalesced memory accesses. The reduced number of noncoalesced memory accesses to global memory is the same as on CPU but the GPU profits more from that. We compared the profiling results of both CUDA versions (Nvidia’s Visual Profiler nvvp). The old implementation had a high local memory overhead which accounted for \(54\%\) of total memory traffic which indicated excessive register spilling. After implementing the Graycode method that problem disappeared and the number of register spills decreased significantly. We have only 60 bytes spill stores and 84 bytes spill loads in the new main kernel in contrast to the old kernel with 352 bytes spill stores and 472 bytes spill loads. The L2 cache utilization is maximized in the new kernel. The L2bandwidth is doubled and achieves now 952.08 GB/s, compared to the version with only 456.103 GB/s. The number of loads and stores is also decreased for the L2 cache as well as for the device memory. For this reason the local memory overhead is also decreased and the achieved device memory bandwidth is increased to above \(90\%\) compared to the old version where the achieved memory bandwidth was below \(60\%\) indicating latency issues. And lastly the divergent execution also dropped and improved the warp execution efficiency from 75 to \(80\%\).
7.2 Domain decomposition
Let us compare the numerical results between the first and the second DD approaches, and with the nonDD approach. We focus especially on hardware limitations observed on the GTX 1080 GPU and how they could be overcome.
Run times in sec. on the GTX 1080 for the coarser mesh
# Subdomains  First DD approach  Second DD approach 

74  0.48  0.69 
160  0.52  0.60 
320  0.58  0.51 
The block scan computes a parallel prefix sum/scan of items partitioned across a block. If we decompose the domain in subdomains small enough such that block scan can use the available shared memory and the register pressure does not affect the occupancy then this method fits to our DD approaches. Additionally, the block scan can be called within a CUDA kernel and so we incorporated it into the one big kernel for the second DD approach. A device scan would require a separate kernel.
The drawback consists in the shared memory limitation. The block scan requires shared memory and the shared memory size limits the number of subdomains. A small number of subdomains means larger subdomains for the same mesh and for this reason the data to be scanned for those subdomains do not fit anymore into the shared memory. Hence we start the testing with 2700 subdomains for the larger mesh and 74 subdomains for the coarser mesh. Anyway the idea is always to increase the number of subdomains as discussed in Sect. 6.3.1 since it improves the load balancing and therefore this limitation is not relevant for our code.
Run times in s on the GTX 1080 for the finer mesh
# Subdomains  First DD approach  Second DD approach 

2700  5.96  6.89 
3000  6.40  7.55 
4000  7.55  9.46 
8000  14.00  14.74 
In order to check the scalability of our DD approaches on different GPUs we tested on an Nvidia Titan X Pascal card with 24 multiprocessors (SMs), 4 more than a GTX 1080 card and 4GB more global memory but still not enough to preallocate. The results we get for the mesh using 2700 subdomains are approximately \(20\%\) faster than the results on a GTX 1080 for the first approach and \(13\%\) for the second approach. Respectively for the first DD approach we get a convergence time of 4.68 s and for the second approach we get a convergence time of 5.96 s. It scales worse for the second approach because of scalability issue we have on the DD approaches for the larger mesh but that affects more the second approach as explained above.
8 Conclusions and future work
The Graycode numbering in Sect. 3.3 has significantly reduced the overall memory footprint of the Eikonal solver achieving performance improvements of 35–50%. The analysis showed that this Graycode version decreased the noncoalesced access level and significantly increased the computational density on the GPU.
The domain decomposition approach solves the Eikonal equation on large scale problems. We achieved to run the domain decomposition approach in one GPU as explained in Sect. 6.3 by using two different strategies in CUDA. The first strategy makes better use of shared memory. For coarser meshes we get a very good convergence time. However, it does not scale well with the increased number of subdomains since its implementation contains many kernels successively resulting in many host synchronization and memory transfers between the device and the host. Assigning one block to one subdomain in the second strategy avoids host synchronization and memory transfers nearly completely. This can be seen by the good scalability thanks to dynamic preallocation of global memory for the coarser mesh. We still run into global memory limitation for large scale problems.
Rotation of elements into the reference configuration

By testing on different GPUs with Pascal architecture such as GTX 1080 and Titan X we concluded that our CUDA implementations, the domain decomposition approaches and the non domain decomposition approach, all scale very well on different Nvidia GPUs. Our OpenMP implementation scales almost linearly on physical KNL cores.
Notes
Acknowledgements
Open access funding provided by University of Graz.
References
 1.Fu, Z., Kirby, R.M., Whitaker, R.T.: Fast iterative method for solving the Eikonal equation on tetrahedral domains. SIAM J. Sci. Comput. 35(5), C473–C494 (2013)MathSciNetCrossRefMATHGoogle Scholar
 2.Ganellari, D., Haase, G.: Fast manycore solvers for the Eikonal equations in cardiovascular simulations. In: 2016 International Conference on High Performance Computing Simulation (HPCS), pp. 278–285. IEEE, Peerreviewed (2016)Google Scholar
 3.Noack, M.: A twoscale method using a list of active subdomains for a fully parallelized solution of wave equations. J. Comput. Sci. 11, 91–101 (2015)CrossRefGoogle Scholar
 4.Fu, Z., Jeong, W.K., Pan, Y., Kirby, R.M., Whitaker, R.T.: A fast iterative method for solving the Eikonal equation on triangulated surfaces. SIAM J. Sci. Comput. 33, 2468–2488 (2011)MathSciNetCrossRefMATHGoogle Scholar
 5.Jeong, W.K., Whitaker, R.T.: A fast iterative method for Eikonal equations. SIAM J. Sci. Comput. 30, 2512–2534 (2008)MathSciNetCrossRefMATHGoogle Scholar
 6.Neic, A., Campos, F.O., Prassl, A.J., Niederer, S.A., Bishop, M.J., Vigmond, E.J., Plank, G.: A fast iterative method for Eikonal equations. J. Comput. Phys. 346, 191–211 (2017)MathSciNetCrossRefGoogle Scholar
 7.NVIDIA, CUDA C programming guide. http://docs.nvidia.com/cuda/cudacprogrammingguide
 8.Qian, J., Zhang, Y.T., Zhao, H.K.: Fast sweeping methods for Eikonal equations on triangulated meshes. SIAM J. Numer. Anal. 45, 83–107 (2007)MathSciNetCrossRefMATHGoogle Scholar
 9.Holm, D.D.: Geometric Mechanics: Part I: Dynamics and Symmetry, 2nd edn. Imperial College London Press, London (2011)CrossRefMATHGoogle Scholar
 10.Weisstein, E.: Gray code. http://mathworld.wolfram.com/GrayCode.html, from MathWorld–A Wolfram Web Resource
 11.Sethian, A.: A fast marching level set method for monotonically advancing fronts. Proc. Natl. Acad. Sci. USA 93(4), 1591–1595 (1996)MathSciNetCrossRefMATHGoogle Scholar
 12.Sethian, A., Vladimirsky, A.: Fast methods for the Eikonal and related Hamilton–Jacobi equations on unstructured meshes. Proc. Natl. Acad. Sci. USA 97(11), 5699–5703 (2000)MathSciNetCrossRefMATHGoogle Scholar
 13.Bornemann, F., Rasch, C.: Finiteelement discretization of static Hamilton–Jacobi equations based on a local variational principle. Comput. Vis. Sci. 9(2), 57–69 (2006). https://doi.org/10.1007/s007910060016y MathSciNetCrossRefGoogle Scholar
 14.Stöcker, C., Vey, S., Voigt, A.: AMDiS Adaptive multidimensional simulations: composite finite elements and signed distance functions. WSEAS Trans. Circuits Syst. 4(3), 111–116 (2005)MathSciNetGoogle Scholar
 15.Zhao, H.K.: Parallel implementations of the fast sweeping method. J. Comput. Math. 25, 421–429 (2007)MathSciNetGoogle Scholar
 16.Ganellari, D., Haase, G.: Reducing the memory footprint of an Eikonal solver. In: Limet, S., Smari W., Spalazzi, L. (eds.) 2017 International Conference on High Performance Computing Simulation (HPCS), pp. 325–332. IEEE (2017). https://doi.org/10.1109/HPCS.2017.57
 17.Detrixhea, M., Giboua, F., Minc, C.: A parallel fast sweeping method for the Eikonal equation. J. Comput. Phys. 237, 46–55 (2013)MathSciNetCrossRefGoogle Scholar
 18.Mattson, T.G., Sanders, B.A., Massingill, B.L.: Patterns for Parallel Programming. AddisonWesley, Boston (2004)MATHGoogle Scholar
 19.Harris, M., Sengupta, S., Owens, J.D.: Parallel prefix sum (scan) with CUDA. In: Nguyen, H. (ed.) GPU Gems 3, pp. 851–876. AddisonWesley, Boston (2007)Google Scholar
 20.Merrill, D.: Cub library. https://nvlabs.github.io/cub, NVIDIA Research (2013)
 21.Merrill, D., Garland, M.: Singlepass parallel prefix scan with decoupled lookback. NVIDIA Technical Report NVR2016002 (2016)Google Scholar
Copyright information
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.