Fast harmonic tetrahedral mesh optimization

Mesh optimization is essential to enable sufficient element quality for numerical methods such as the finite element method (FEM). Depending on the required accuracy and geometric detail, a mesh with many elements is necessary to resolve small-scale details. Sequential optimization of large meshes often imposes long run times. This is especially an issue for Delaunay-based methods. Recently, the notion of harmonic triangulations [1] was evaluated for tetrahedral meshes, revealing significantly faster run times than competing Delaunay-based methods. A crucial aspect for efficiency and high element quality is boundary treatment. We investigate directional derivatives for boundary treatment and massively parallel GPUs for mesh optimization. Parallel flipping achieves compelling speedups by up to 318×\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$318\times $$\end{document}. We accelerate harmonic mesh optimization by 119×\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$119\times $$\end{document} for boundary preservation and 78×\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$78\times $$\end{document} for moving every boundary vertex, while producing superior mesh quality.


Introduction
Meshing a domain into a set of simplices T is a fundamental task in geometry processing. The resulting mesh can be used to solve differential equations, enabling a wide range of applications including physically based animation using the FEM [19,34] and spectral geometry processing [17,39].
Mesh generation for numerical computation is not only concerned with finding a triangulation T of . Elements of small volume or area, i.e., ill-shape, must be avoided too. In fact, a single ill-shaped element may cause numerical methods to fail [27]. For this reason, current meshing tools [13,30] perform an optimization step after generating an initial mesh. However, it is an open issue for tetrahedral meshes that quality functions are not consistent with the Delaunay triangulation, leading to ill-shaped elements [18]. Recently, B D. Ströter daniel.stroeter@gris.tu-darmstadt.de 1 TU Darmstadt, 64277 Darmstadt, Germany 2 Fraunhofer IGD and TU Darmstadt, Darmstadt, Germany 3 Fraunhofer IGD, TU Darmstadt and TU Graz, Darmstadt, Germany Alexa [1] introduced harmonic triangulations, defining an energy whose minimization significantly improves element quality of Delaunay meshes.
The accuracy of numerical methods improves with the mesh resolution. Thus, meshes with many elements are required for the analysis of complex geometric structures. Sequential mesh optimization on the CPU results in slow run times for large meshes, which is especially an issue in interactive settings [33]. Parallel algorithms that use modern parallel processors, specifically massively parallel GPUs, are necessary to optimize large meshes quickly. As harmonic triangulations outperform established Delaunay-based optimization methods, we use them as a basis to devise a parallel mesh optimization algorithm.
In this paper, we extend harmonic mesh optimization to faster run times, improved convergence and boundary treatment. Our contributions are: -A novel mesh optimization scheme that efficiently improves high-resolution tetrahedral meshes. -A robustly converging mesh optimization scheme.
-Novel massively parallel algorithms for mesh optimization. -Gradient-based boundary vertex optimization , replacing reprojection.

Preliminaries and notation
Although we focus on tetrahedral meshes, our notation covers arbitrary dimensions, because we extend the harmonic triangulations framework. We define a d-dimensional simplicial mesh M = (T, V) as a tuple of a d-simplex sequence T and a sequence of vertices V ⊂ R d . Boundary vertices are included in ∂V, where ∂V ⊆ V. We denote a k-simplex τ as a (k + 1)-tuple (x 0 , . . . , x k ) ∈ V k+1 of vertices, where k ≤ d.
Oriented volumes, face areas, and normals are represented by v τ , a τ , and n τ , respectively. The ith vertex of τ is given by τ i ∈ V. Likewise, the ith element in T or V is denoted as T i or V i , respectively. The matrix of a d-simplex' vertex set can be written as X τ = (τ 0 , . . . , τ d ) ∈ R d×(d+1) , where the ith column is the position of τ i . We use the matrix M d to express the vertices of a d-simplex in relation to its first vertex such that v τ = det(τ 1 − τ 0 , . . . , τ d − τ 0 )/d! = det(X τ M d )/d!: where e i denotes the ith canonical unit vector of R d .
As the subtriangulation forming the one-ring neighborhood of a specific vertex x ∈ V is of interest during optimization, we introduce the following notation: For any k-simplex τ we obtain the (k −1)-sub-simplex opposite to τ i with the set difference τ \ τ i . The goal of harmonic mesh optimization [1] is to minimize the trace of the Laplacian L T consisting of tr(L τ ), where τ ∈ T: tr(L T ) = τ ∈T tr(L τ ).
Dropping the constant factor leads to the harmonic index η: The optimized triangulation shall respect the input boundary B and be free of inversions to satisfy the requirements of numerical methods. Considering these conditions leads to the following nonlinear optimization problem: where ∼ = denotes an approximate congruence between the target boundary B and the discrete boundary ∂T. For full boundary preservation, we can enforce ∂T = B. Additionally, we prohibit inversions, i.e., tetrahedra τ with v τ < 0. We denote boundary vertices surrounded by co-planar faces as V F ⊆ ∂V, boundary vertices on a geometrical edge as V E ⊆ ∂V, and boundary vertices representing a geometrical corner as V C ⊆ ∂V.
To perform harmonic mesh optimization, Alexa [1] combines a flipping algorithm and a gradient descent scheme. The flipping algorithm performs 2-3 and 3-2 bistellar flips (see Fig. 1). If a bistellar flip reduces tr(L T ), it is a harmonic flip. Harmonic flips can be locally ordered by their reduction of the trace. Prioritizing harmonic flips with the largest reduction of the traces produces good element quality. Thus, harmonic flips may be arranged in an ordered queue favoring flips by their reduction of tr(L T ). Additionally, a harmonic flip either coincides with a Delaunay flip or it produces a local triangulation of two tetrahedra, while the Delaunay flip would produce three tetrahedra.
In order to minimize tr(L T ), a gradient descent scheme can be used to relocate vertices. The first step is to assemble a gradient for each vertex of the mesh by calculating a gradient for each tetrahedron τ ∈ T: To avoid inverted elements, Alexa uses binary search to find a single step size λ for one gradient descent step on the entire mesh. Instead of using λ for vertex relocation, Alexa uses Brent's method [23] to find a minimum along the steepest descent located at some α ∈ [0, λ]. Boundary vertices are reprojected onto the surface after global gradient descent.

Related work
The literature comprises a long history in investigating tetrahedral mesh optimization. Freitag et al. [9] improve tetrahedral meshes by swapping common faces or edges and relocating vertices. For fast run times, Freitag et al. [8] relocate batches of non-adjacent vertices in parallel, while preventing element inversions. Many mesh optimization frameworks use computational efficient Laplacian smooth-ing, relocating a vertex in the direction of the arithmetic average of the adjacent vertices [9,26,29,36]. However, Laplacian smoothing does not strictly guarantee to produce a high-quality or even inversion-free mesh. In addition, the gradient of the Laplacian does not vanish in general, which complicates finding appropriate termination criteria. Consequently, previous works devised different quality functions for a mesh element [27]. Knupp [16] confirms the use of the Jacobian as a building block for quality functions of a finite element. Today, many mesh optimization methods rely on distortion methods using the Jacobian. We provide a review of distortion-based methods in Sect. 3.1. As our work contributes massively parallel algorithms for mesh optimization, we discuss related work in this field in Sect. 3.2. We also discuss boundary treatment in our work and highlight previous work in Sect. 3.3.

Distortion energies for mesh optimization
Besides mesh improvement, energies minimizing global distortion are typically used for parameterization tasks such as surface fitting or remeshing. Hormann et al. [12] introduce the most isometric parameterizations (MIPS). Originally, MIPS is intended for mapping a triangulation of data points to a triangulation in the plane. Fu et al. [10] extend MIPS to the advanced MIPS (AMIPS) energy that effectively minimizes distortion in 2D and 3D. For vertex relocation, they perform nonlinear Gauss-Seidel iterations simultaneously on sets of non-adjacent vertices. However, nonlinear optimization methods typically impose slow run times and do not scale well to meshes with many elements. For this reason, Rabinovich et al. [24] present a local/global algorithm that scales to large data sets through replacing the nonlinear energy with a simple proxy energy. The local step calculates weights mapping gradients to the distortion of elements using the proxy energy. With the weighted gradients, a global system can be efficiently assembled and solved. For solving the global system, an initial inversion-free step size is found using the method of Smith et al. [31]. While distortion energies are effective in improving illshaped elements, harmonic triangulations provide a local order of bistellar flips [1]. As flips are locally ordered by energy reduction, we formulate a massively parallel algorithm performing locally most beneficial flips that quickly improves element quality. Additionally, our work focuses on Delaunay-based methods, as harmonic flips are related to Delaunay flips. We achieve scalability by neat parallelization.

Parallel tetrahedral mesh optimization
Lots of recent work address parallel tetrahedral mesh optimization. Benitez et al. [3] perform smoothing and untan-gling in a distributed environment using domain decomposition. Shontz et al. [28] relocate vertices by solving ordinary differential equations on a distributed system using domain decomposition. Zint et al. [40] describe a GPU-parallel method to search for an optimal vertex position on a coarse grid of candidate positions. While this enables optimization of non-differentiable functions, we focus on differentiable energies, as they enable first-order methods that converge more quickly than exhaustive search. In addition, we focus on fine-grained parallelism that leads to fast run times on a single machine and does not require a distributed system.
In contrast to parallel vertex relocation, parallel local reconnection of vertices imposes the additional challenge of preventing concurrent processing of overlapping regions. Nonetheless, vertex relocation and reconnection should be used in concert [15] to achieve an effective optimization. D'Amato et al. [5] designed a CPU-GPU framework that performs local remeshing and vertex relocation in parallel using a decomposition of the mesh into clusters. Shang et al. [26] present a multi-threaded algorithm for parallel local reconnection, which maps reconnection operations to feature points sorted along a space filling curve. They assume geometrical separation of remeshing operations so that regions rarely overlap. Ibanez et al. [14] schedule the application of cavity-based remeshing on shared memory systems. Their method finds independent sets of cavities for processing in batches of these independent sets. Drakopoulos et al. [7] describe a parallel speculative local remeshing approach for high-performance computing. They use atomic operations for synchronization in case of overlapping regions. In contrast to established parallel local reconnection methods, our parallel flipping algorithm does not require a precomputed decomposition of the mesh or atomic operations but relies on the local order of harmonic flips.

Boundary treatment in tetrahedral mesh optimization
Boundary treatment in tetrahedral mesh optimization is a sparsely discussed field. While some methods rely on curved boundaries [6], we only rely on the boundary of the discrete mesh. Many methods either subdivide ill-shaped boundary elements [15] or reproject boundary vertices back on the original surface [1]. Subdivision of boundary elements increases the element count, which is a drawback, as each element costs computationally. The drawbacks of boundary reprojection are that it requires to find the closest point on the boundary and the reprojection step does not respect energy minimization leading to reduced convergence. Yin et al. [38] replace reprojection of boundary vertices with shape functions approximating the surface based on the discrete mesh. They incorporate the shape functions as a penalty term into the to-be-optimized function to enforce boundary conformance. Contrary to our method, the penalization approach requires the choice of a suitable penalty number. Wicke et al. [35] address optimization of the mesh boundary for dynamic domain remeshing. They penalize relocation of boundary vertices by augmenting the optimization function with a quadric error term. Although this allows for efficient relocation of boundary vertices, element quality to surface distance is an apples-to-oranges comparison. Xu et al. [37] propose harmonic guided optimization to further improve the quality of boundary elements despite the usage of a quadric error term. They precompute a harmonic scalar field on a voxelized grid. As the field is maximal at the boundary and minimal for the medial axis of the mesh, it enables the computation of weights tweaking the importance of boundary preservation and element quality. Our method keeps boundary vertices on the surface without using a penalization term and thereby without the need of precomputing additional weights.

Optimization algorithms
In this section, we describe a harmonic mesh optimization algorithm suitable for parallelization on massively parallel GPUs. In order to compute subsimplex-to-simplex relationships or simplex-to-subsimplex relationships, we employ the mesh data structure developed by Mueller-Roemer et al. [20,21], because it provides memory-efficient organization of these relationships in a compactly encoded ternary sparse row format. The following algorithms assume an inversionfree mesh, i.e., every oriented volume v τ must be positive.

Vertex relocation
We focus on gradient descent of interior vertices first and detail the treatment of boundary vertices in Sect. 4.2. We achieve conflict-free parallelization by coloring vertices into independent sets S C . Therefore, Algorithm 1, which outlines our vertex relocation scheme, can process vertices in Gauss-Seidel iteration order.
A drawback of Alexa's gradient descent scheme [1] is that a single line search is performed for all vertices of the mesh. Thus, vertices potentially affect each other leading to a small set of vertices preventing substantial optimization of the majority of vertices. Additionally, the gradient directions of vertices might lead to conflicting updates, reducing the convergence rate. Instead of performing a single line search for the entire mesh, we perform local gradient descent.
Since each pass over an independent set of vertices affects mesh quality, it is beneficial for the optimization to recalculate the gradients for each batch of independent sets. As the gradient of each vertex depends on multiple tetrahedra, parallel gradient assembly using Eq. (7) requires synchronization primitives, such as atomic operations, in order to handle write conflicts. Thus, we propose calculating the harmonic gradient for each vertex using the following equation instead of Eq. (7), which facilitates parallel processing in independent sets: The proof of Eq. (8) can be found in the appendix. It is an interesting observation that the harmonic gradient is a linear combination of the face normals of τ . We leave the geometric interpretation of Eq. (8) for future work.
With one pass over t(x), the gradient of the incident tetrahedra can be calculated by application of the sum rule. For convenience, we introduce a notation for the gradient of tetrahedra incident to x: Unlike Alexa [1], we locally compute an inversion-free interval [0, λ x ] for vertex x given its local gradient. As an inversion occurs when vertex x passes the plane spanned by the opposing triangle τ \ x, the exact step size λ x can be determined by performing a simple plane-ray intersection test. An illustration of this principle appears in Fig. 2. Starting from λ x , an inversion free step size can be found with binary search using the root finding method of Smith et al. [31]. To avoid unnecessary search iterations, we reduce λ x by a factor μ ∈ [0, 1) beforehand. We choose μ = .95 in our work. We set λ x to the resulting step size. As a result, the local gradient descent update formula is as follows: A bracketing scheme is used to determine α. Like Alexa, we use Brent's [23] method in our work; however, we use it locally. for all c ∈ S C do Go through independent sets 3: for all i ∈ c do In parallel 4: end if 14: if g < ε g then continue end if 15: g ← − g g 2 16: if subject to ∂T ∼ = B then 24: search← is x still on boundary primitive? 25: end if 26: if search then λ ← λ/2 endif 27: while search 28: if subject to ∂T ∼ = B then 32: identifyBoundaryState (

Directional derivatives for boundary treatment
Unlike relocation of interior vertices, gradient descent of boundary vertices can deform the surface resulting in a significant loss of geometric detail. We intend to avoid reprojection of vertices onto the boundary in our work, while still keeping boundary vertices on the boundary. For this purpose, we investigate directional derivatives for mesh optimization. We first address full preservation of the mesh surface and detail an algorithm for relocating every boundary vertex in Sect. 4.3. If the primary concern is to fully preserve the input surface, we only allow gradients to be co-planar to the boundary surface. We classify boundary vertices depending on their adjacent surface triangles to obtain rules for full surface preservation, which we summarize in Algorithm 2. For full boundary preservation, the following rules apply: -x ∈ V F : All incident boundary triangles are co-planar.
Thus, there is a unique tangent plane.
x ∈ V E : Two sets of incident boundary triangles are coplanar. Thus, there is a unique tangent line. -x ∈ V C : There is no unique tangent plane or line. A corner vertex cannot be moved without altering the surface.
For full boundary preservation, we apply homogeneous Neumann boundary conditions [2], while alternative boundary conditions are an ongoing research topic [32]: Thus, a boundary vertex is only relocated along a tangent plane or line. Let p be a tangent plane on the surface with linearly independent unit vectors u 1 and u 2 : We now show, how we apply directional derivatives to the surface of a tetrahedral mesh. Let the function g x replace a boundary vertex x with a given vertex x and calculate the trace for all incident tetrahedra: With the use of g x and p we can express the field of tr(L t(x) ) on the tangent plane as: As our goal is to obtain a gradient for x on the tangent plane p, we evaluate the gradient at t = 0 and s = 0. The gradient follows by the chain rule: The gradient of the tangent plane p evaluates to (u 1 , u 2 ) .
Because g x (x) replaces x with itself, we can further simplify the gradient to: The gradient on the plane can be transformed to R 3 , resulting in the directional derivative: In case of a tangent line for x ∈ V E , one can just drop u 2 and perform the analog calculations. The use of directional derivatives provides several benefits for mesh optimization: Algorithm 2 Derivative on a tangent sub-space.
return (u 1 g)u 1 + (u 2 g)u 2 6: else return (u 1 g)u 1 10: end if 11: end procedure 1. Reprojection of vertices after gradient descent is not necessary. Thus, it becomes obsolete to find the closest surface triangle, which can be computationally expensive. 2. Line search on a tangent subspace converges against a local minimum. No special convergence criteria are necessary for the boundary. 3. Projection of relocated vertices to the closest surface triangle can produce inversions or projection of vertices onto opposing faces. We avoid these issues by inversion free intervals for line searches along the boundary.

Moving every boundary vertex
We present an algorithm that relies on directional derivatives to optimize boundary vertices while keeping them on the mesh surface. As a result, the approximation error due to boundary vertex relocation is controlled by the input mesh surface, which is assumed to be of high resolution such that the approximation error for curved surfaces is low enough. The main idea of our algorithm is to relax the homogeneous Neumann boundary condition such that the gradient has to lie only on a single tangent plane (or line) p. At the same time, we relocate boundary vertices only along the boundary: This relaxation enables relocation of a vertex along a single boundary primitive granting deviations from the input surface for better mesh quality. In order to ensure that the updated vertex still lies on the boundary, we need to bound λ x such that x does not leave the boundary. Algorithm 3 exhibits our method of finding a descent direction for a vertex on the boundary. During the optimization, our algorithm maintains the location of the vertex on the boundary, which leads to the following three states: 1. On vertex: The vertex overlaps with a boundary vertex. This is the initial state for each boundary vertex. 2. On edge: The vertex lies on a boundary edge but not on either of its vertices. 3. On triangle: The vertex lies within a boundary triangle but not on any of its edges.
Depending on the state, we calculate the directional derivative for each boundary primitive adjacent to the vertex. Our algorithm checks for each directional derivative, if its descent direction does not relocate the vertex away from B, i.e., conforms to B. Following the rule of steepest descent, we choose the directional derivative with the largest magnitude. Figure 3 shows how our algorithm relocates a vertex on the boundary maintaining the state of the vertex. In order to keep the state consistent after gradient descent, we limit the step size such that the vertex remains on its boundary primitive.
After relocation, we check if a vertex of state on triangle is now on edge or on vertex. Likewise, we check if a vertex of state on edge is now on vertex. This check is outlined by Algorithm 4 and uses the barycentric coordinates of the vertex regarding its current boundary triangle. If one or two barycentric coordinates are close to zero, the vertex is set to the corresponding edge or vertex, respectively.

Flips
Performing flips in parallel requires conflict detection, because otherwise flipping does not guarantee a valid mesh, as can be seen in Fig. 4. In harmonic triangulations, prioritizing flips by their reduction of the trace leads to good Fig. 3 We use directional derivatives, in order to relocate the vertex (red) along the boundary. Initially, the directional derivative on the boundary with the largest magnitude is along an edge. After relocat-ing the vertex, the directional derivative of the triangle to the right of the edge is chosen. Finally, vertex relocation converges after relocating the vertex on the triangle Algorithm 3 Algorithm to find directional derivative for boundary vertex 1: procedure findTangentDerivative(x, λ, g) 2: if x is on boundary vertex x b then 4: for boundary triangle τ b containing x b do 5: λ ← β 0 For β 0 point along ray is on τ b \ x b 10: Set x on τ b 11: end if 12: end for 13: for boundary edge e b containing x b do 14: else if x is on boundary edge e b then 22: for boundary triangle τ b adjacent to e b do 23: Analogous to lines 5 and 6 24: if x is on a boundary triangle τ b then 3:  [1]. We exploit this property and resolve the issue of conflict detection by finding the locally most beneficial harmonic flip, as shown in Fig. 5. As a result, we are Suppose one thread is assigned to each of the two orange triangles adjacent to the white triangle. Concurrently, both threads perform a flip, which leads to an invalid triangulation Fig. 5 In case of conflicting flips, we choose the locally most beneficial flip. As a result, we obtain a valid triangulation able to perform harmonic flips massively parallel without significant differences to sequential computations. We present Algorithm 5 that identifies feasible and locally most beneficial flips. We encode flips by the index of the flipped mesh facet and an identifier for the type of the flip: Our algorithm performs a parallel pass over all τ ∈ T to identify the most beneficial flip for each τ . Each τ can be flipped at either one of its six edges or one of its four faces. Hence, we evaluate feasibility checks and quality improvements regarding η of the potential flips in a predetermined order. Face or edge flips are only feasible on interior faces or edges, respectively. Each flip requires its incident tetrahedra to form a convex subtriangulation. Whenever a flip is feasible, we compare its quality improvement to the currently most beneficial flip. As a result, we obtain the most beneficial harmonic flip for τ . If no harmonic flip has been found, our flipping algorithm terminates. Otherwise, we proceed with another parallel pass over all τ ∈ T.
In order to prepare for building a new Triangulation T , we allocate an array of markers indicating whether τ ∈ T is part of T or not and an array of integers for the number of newly added tetrahedra. For each τ ∈ T, we find locally most beneficial flips in parallel. Using flip type and facet index, we obtain the tetrahedra incident to the flip using the precomputed connectivity relationships. No conflict occurs, if the flip is the most beneficial flip for each incident tetrahedron in the convex region of the flip. In this case, the flip is locally the most beneficial and is selected to be performed. Consequently, the tetrahedron associated with the thread can be marked for removal. We elect a coordinator thread to perform the flip. If the index of τ is the lowest of the incident tetrahedra, the associated thread is declared as the coordinator. The coordinator thread sets its integer value to the number of tetrahedra added by performing the flip. Since only coordinator threads write to the array of integers, thread i is a coordinator thread if this array holds a non-zero entry at position i.
An exclusive prefix sum over the integers for new tetrahedra provides offset positions and the total number of tetrahedra to be added. The marker values of the tetrahedra in sum amount to the number of remaining tetrahedra. We allocate a new buffer for the resulting tetrahedra and copy the remaining tetrahedra through a stream compaction to a newly allocated buffer. In a final parallel pass over the tetrahedra, the coordinator threads perform the flips and append the resulting tetrahedra to the remaining tetrahedra using the offset positions.

Combined vertex relocation and flipping
We perform several alternating passes of vertex relocation and harmonic flipping. Our algorithm terminates if its effect on the mesh becomes insignificant. Gradient descent converges if the gradient approaches zero. Therefore, we terminate if ∇tr(L T ) is sufficiently small. Thus, when ∇tr(L T ) is smaller than some c , gradient descent is not expected to cause significant improvements. In addition, update rates can become vanishingly small. To avoid this situation, we terminate if the difference of the current to the prior gradient is smaller than c . As tr(L T ) is scale dependent, we advise to choose a relative c . We opt for choosing c based on a constant governing the accuracy in finding a minimum: c ← max( , (∇tr(L T )) ) (17) As some vertices converge more quickly than others, we do not further optimize a vertex with a gradient norm smaller than ε g . We choose = 10 −5 = ε g and · 2 2 as the norm in our work.
Connectivity relationships and coloring need to be updated after flipping. Checking for flips is an unnecessary overhead if flips are unlikely to be found. Thus, we apply a heuristic reducing the number of checks. A counter k f holds the number of iterations without flip checking and is initialized as k f = 1. Whenever flip checking fails to find flips, we double k f . Analogously, if flip checking finds flips, k f is halved rounding up. If the counter has reached a predetermined number 2 N , we terminate, as additional flips are unlikely to be found. We opt for choosing N = 3. In summary, we terminate at iteration i, if one of the following conditions is met: ( j, type) ← flips i ; is_coordinator ← true; agree ← true 23: if j = -1 then continue end if 24: for all tetrahedron τ involved in flip( j, type) do 25: k ← index_of(τ ); ( j adj , type adj ) ← flips k 26: agree ← agree ∧ type = type adj ∧ j = j adj 27: is_coordinator ← is_coordinator ∧ i ≤ k 28: end for 29: if agree then tets_marked i ← 0 end if 30: if agree ∧ is_coordinator then 31:

Results
We present experiments to demonstrate the benefits of our algorithms from Sect. 4. To ensure a fair comparison, we implemented the algorithms from scratch using C++ and CUDA [22]. We compiled the code using Visual Studio 2019 and CUDA 11.2 on Windows 10. We ran the experiments on a machine equipped with an NVIDIA RTX 3090 GPU and an Intel i7 3930K CPU. In order to avoid outliers in run time measurements, we have determined the median run time from 10 executions.

Parallel harmonic flips
We compare our GPU parallel harmonic flipping algorithm performing locally most beneficial flips to the sequential CPU algorithm performing flips in an ordered queue. We perform flips on the input mesh, until no further harmonic flips can be found. The results appear in Table 1. While Alexa [1] performed harmonic flips on Delaunay triangulations of point sets, we perform flips on meshes generated with Tetgen [30] a constrained Delaunay mesher, leading to a lesser reduction of the number of tetrahedra. We detail the exact numbers of tetrahedra in the resulting triangulations, in order to show that postponing locally not most beneficial flips to later flipping passes does not lead to significant differences in the resulting triangulation. Our experiments reveal substantial speedups of 106×-318×. As harmonic bistellar flips either coincide with the Delaunay triangulation or reduce a triangulation of three tetrahedra to two tetrahedra, our parallel flipping algorithm is a useful tool for mesh optimization and generation, quickly reducing the tetrahedron count while somewhat preserving the Delaunay criterion. Our experiments confirm that harmonic flipping well preserves the percentage of locally Delaunay tetrahedra.

Robustness
In order to validate the practicability of our parallel algorithms, we have applied our optimization algorithm in Sect. 4.5 to the 10k tetrahedral meshes generated by Hu et al. [13]. Our algorithm did not produce any inversion due to the choice of the inversion free interval for each vertex x. After termination, each triangular face was connected to one or two tetrahedra. In addition, we consistently observed alternating face orientations for triangular faces adjacent to two tetrahedra. The Manhattan distance of distinct vertices was larger than 10 −10 for all except for two meshes meaning that our optimization method does not produce geometrically duplicated vertices. For the two meshes with geometrically close vertices, we observed smaller vertex distances already before optimization.
We calculate the one-sided Hausdorff distance of the boundary vertices of the optimized mesh to the input mesh surface, in order to validate that our vertex relocation algorithm on the boundary (c.f. Sect. 4.3) keeps vertices on the boundary. We divide the resulting distances by the average boundary edge length to put them in relation to the dimensions of the model. For 99.95% of the meshes, the one-sided Hausdorff distance was below 10 −3 , which shows that boundary vertices remain on the input surface considering round off errors. In four out of the five remaining cases, roundoff errors on directional derivative calculation accumulate to a degree that the resulting deviation is roughly 10 −2 . In only one case, a significant deviation of .19 can be observed. As the meshes generated by Hu et al. [13] generally are of high quality and already optimized, the experiments regarding runtime, convergence and mesh quality use unoptimized meshes.

Element quality and convergence
We investigate resulting element quality and convergence of both Alexa's method [1] and our method. Our work covers two methods of using directional derivatives at the boundary. Using directional derivatives only for vertices in V F and V E provides surface preservation (∂T = B). In addition, directional derivatives can be used to move vertices along the input surface (∂T ∼ = B) to optimize all vertices at the cost of altering the model shape. We compare Alexa's reprojectionbased method [1] to both variants of boundary treatment.
Our boundary preserving method is most useful for input meshes with few corner vertices. Thus, we investigate the  Fig. 6 and provide the results in Table 2. Although all of the input meshes include critical minimal dihedral angles, both methods achieve to improve the minimal angle, while our method achieves significantly larger minimal angles with the exception of similar minimal angles for the Part. Likewise, the lower 5% of dihedral angles is significantly larger with the exception of the Part. If we relocate every boundary vertex of the Part model, we achieve a minimal dihedral angle of 8.12 • and a lower 5-percentile φ 5% of 38.63 • , which is a better result than the reprojection-based method. Relocating every boundary vertex does not result in significant differences for the other three meshes. Our method consistently results in lower energy states for η regarding the maximum, 95-percentile and the sum over T. For meshes with many corner vertices forming curved surfaces, optimizing all boundary vertices is important, as boundary preserving optimization typically results in lower minimal angles oftentimes not even half as large. We evaluate our method on the bottom four meshes shown in Fig. 6 and provide the results in Table 3. While our method improves the minimal angles of all inputs, reprojection-based optimization impairs the initial minimal angles on the Block and Pot meshes. As the reprojection step does not respect energy minimization, a degradation of mesh quality may occur. Using directional derivatives along the boundary respects energy minimization leading to lower energy states for η with the exception of the 95-percentile of the Block.
We have investigated the impact on the mesh surfaces and the convergence of the optimization methods on different meshes. We present typical results in Fig. 7. While reprojection of boundary vertices distorts sharp detail, directional derivatives along boundary faces and edges can be used to preserve the mesh surface. Moreover, the reprojection step mitigates convergence, because it does not respect energy minimization. On the contrary, gradient descent of directional derivatives converges to a local minimum on the boundary, as can be seen in the monotonously decreasing curve of the gradient norm for the Barrel. We observe convergence for relocating every vertex along the boundary as well. The reprojection based optimization oftentimes terminates in a premature state. Since Alexa's [1] algorithm potentially chooses small step sizes, reprojection to the closest point on the mesh surface oftentimes does not significantly change vertex positions from the initial state leaving a lot of optimization potential. Directional derivatives along the boundary respect energy minimization even when migrating to different boundary primitives. However, the gradient norm does not reduce as monotonous as for choosing a constant boundary primitive, as gradient norms change, when a vertex is associated with another boundary primitive. Convergence is achieved though, while the input shape is approximately preserved. This is notable, as our use of directional derivatives enables robust improvement in high-resolution meshes and keeps boundary vertices on the boundary while converging.

Run time
We compare run times of Alexa's [1] and our massively parallel algorithm for full and approximate boundary preservation. Figure 8 shows the run time comparisons for full and approximate boundary preservation. For full boundary preservation, we achieve notable speedups of 9.17×-119×. Although the boundary reprojection prevents convergence on these meshes, the competing algorithm still performs a considerable number of iterations until no harmonic flips can be found. This is not the case for the more complex meshes we used for comparison with our vertex relocation along the boundary (c.f. Sect. 4.3). The reduced convergence of reprojecting vertices on the boundary leads to lower iteration numbers of the competing optimization algorithm. Additionally, our method for optimizing vertices along the boundary imposes more branching than the full boundary preservation reducing the impact of massively parallel processing Table 2 We provide dihedral angles φ and energy states using Alexa's [1] and our method (∂T   7 We compare our method (red) to Alexa's [1] reprojection based method (blue). We visualize resulting boundaries and plot the gradient norm throughout optimization and leading to up to 40% slower run times. Thus, we obtain lower but still notable speedups of 3.51×-78×.

Conclusions
In summary, we have devised an efficient and robustly converging mesh optimization method processing millions of tetrahedra in a few seconds. We have introduced massively parallel algorithms and parallelization strategies for optimizing meshes while preventing inverted elements. Our parallel flipping algorithm achieves speedups of 106×-318× without producing significantly different results from sequential flipping. We have evaluated the use of directional derivatives for boundary treatment in mesh optimization. Our method supports both, full preservation of the surface and optimization of boundary vertices along the surface. The results for using directional derivatives are compelling, as we achieve significantly better mesh quality compared to reprojection, while keeping vertices on the boundary without adding any error terms to the optimization function. Our method tends to smooth sharp surface details, which is a limitation. A natural extension to our method is to calculate the directional derivative for curved surfaces such as CAD bodies to improve precision. As a result of improved convergence and neat parallelization, we accelerate harmonic mesh optimization by up to 119× in the boundary preserving case and by up to 78× for relocating all boundary vertices.
As the convergence is governed by gradient descent, the use of a nonlinear conjugate gradient method [11] or a momentum method [25] might improve convergence. A fast adaptive mesh optimization could be obtained through the use of parallel refinement, which potentially leads to better mesh quality and improved surface approximation. An interesting idea is to incorporate parallel harmonic flipping into GPU-accelerated Delaunay meshers such as [4] to diminish element count and improve mesh quality.