1 Introduction

Physics simulation is an active area of research in computer graphics due to the prominence of digital characters and artificially generated natural phenomena in the VFX industry [13, 20, 21, 38, 41, 44]. Numerical simulation of any kind of physical effect requires the governing laws to be discretized on a mesh for computing the differential operators. Subsequently, for temporal evolution, the backward Euler scheme (or its variants) is generally the accepted standard for stability purposes [10, 47]. This makes a linear solve the bottleneck in physics simulation, and thus, substantial effort has been invested by numerous research groups in designing fast methods, either through the development of faster solvers [1, 2, 11, 34, 57, 58] or through the development of novel discretization schemes [35, 37, 49, 50].

In contrast to prior works, in this paper, we propose a method that can potentially be useful for accelerating any other method for physics simulation. We achieve this by paying close attention to the interaction of the simulation software with the underlying hardware and identifying a key bottleneck—the numerical solution of linear systems is a memory-bound (as opposed to computation-bound) operation. This is because of the discrepancy between the memory bandwidth, which is roughly \(4-5\times \) slower, in comparison with the compute bandwidth that is available on modern workstations. Every linear equation requires a handful of variables to be read from the memory, some basic arithmetic operations (e.g., multiplication by some scalar coefficient, addition/subtraction) performed on them, and the result subsequently written back to memory. Besides being much faster (as most arithmetic operations are natively supported in hardware), the density of arithmetic operations is much smaller than the number of read/write operations involved in each linear equation.

To address this issue, some authors have considered increasing the density of associated computations once some values have been read from memory using some rather intricate matrix factorizations [34]. However, as observed by the authors themselves [34], such an approach only pays off when the problem size is large enough so as to not fit in the aggregate memory of all available GPUs. Another approach is to realize that variables are read from the memory to the system cache. If the variables in a linear equation are all present in the cache (i.e., a cache hit), then the operations can be executed much faster in comparison with the case when some of the required variables are not present in the cache (i.e., a cache miss) and need to be read from the memory. Thus, researchers have designed grid-based data structures [36, 42] that make a better effort in putting variables in a linear equation closer together in memory. While this is possible for grid-based data structures that exhibit regular connectivity patterns, it is generally assumed that such optimizations cannot be performed for triangle/tetrahedral mesh-based data structures, where the connectivity pattern is more irregular.

For mesh-based data structures, some authors have investigated reordering optimizations at the level of the compiler for better efficiency [55]. Other researchers have investigated delayed updates to matrix factorizations [23, 24] as a means to faster computation. In contrast to these prior works that require more involved computations, our proposed approach is much simpler to implement and is based on spectral reordering [16, 32, 53], a technique that is well known in the geometry processing community but, to the best of our knowledge, has not been exploited for accelerating computations in physics simulation. Specifically, this approach requires computing the eigenvector corresponding to the second-smallest eigenvalue, also known as the Fiedler vector, which can be computed very efficiently thanks to “off-the-shelf” Algebraic Multigrid-preconditioned Conjugate Gradient (AMGPCG) solvers. Most notably, this approach is cognizant of the connectivity pattern used for generating the linear system and reorders variables to respect this connectivity pattern, as opposed to the predominant practice of using fixed connectivity patterns for mesh layouts regardless of the linear system.

We observe that our spectral reordering approach can speed-up simulations of elastic solids by at least \(10\%\), and up to nearly \(30\%\), as described in our experimental results in Sect. 6. This can lead to considerable savings given that even modest simulations require at least 240 time steps. Moreover, our proposed method is general as we show its applicability to two different approaches for simulating elastic solids [6, 37].

The rest of this paper is organized as follows: Sect. 2 discusses related prior work, Sect. 3 describes our core technical approach, Sect. 4 provides details behind our numerical implementation, Sect. 5 describes the elasticity simulators that we used, Sect. 6 describes our experimental results, and finally, Sect. 7 describes our conclusion and avenues for future work.

2 Related work

Cache-efficient algorithms fall into two types: cache-aware, which depend on knowing specific parameters of the cache, such as block size, and cache-oblivious, which improve performance by reducing cache misses without using any specific information about the cache. Some of the first cache-oblivious algorithms were introduced by Frigo et al. [19]. We refer the reader to [14] for an overview of cache-oblivious algorithms.

While cache-efficient algorithms are designed on a case-by-case basis, cache-efficient data structures are general-purpose, allowing for improvements across multiple algorithms. The simplest cache-efficient data structure is a reordering of an existing data structure. For example, Yoon et al. [53] reorder the vertices in meshes for cache-efficiency using a metric-minimization method. Their metric measures the distance of indices in the list of vertices that are connected by edges. They use repeated local reorderings to minimize this metric and focus on very large meshes, from millions to hundreds of millions of vertices. They utilize their method to speed up three separate applications: large-scale rendering, collision detection, and isocontour extraction.

The mesh itself is not the only data structure that can benefit from reordering. Yoon and Manocha [54] use probabilistic models to cluster and reorder the bounding volume hierarchy. Their method minimizes cache misses in collision detection for rigid body simulations.

Isenburg and Lindstrom [26] introduce a new streaming format for storing very large meshes. Similar to our method, they utilize the Fiedler vector to reorder the mesh nodes, to minimize separation between vertices that belong to the same face. Hoppe [25] used indexed triangle strips to make renderings more efficient. To increase efficiency further, he reordered the triangle faces to make cache-efficient layouts. The strategies used include a greedy triangle stripification algorithm, and repeated local optimizations. Triangle stripping has been developed further by Gopi and Eppstein [22] by using a perfect matching algorithm to create a triangle strip for a graph representing a mesh’s faces as nodes. Setaluri et al. [43] speed up simulations of fluid dynamics by ordering the cells of 2D and 3D grids in memory using Morton ordering, a type of space-filling curve.

The Fiedler vector is part of a wider array of methods for processing functions on graphs, commonly known as spectral methods. Pioneered by Taubin [51], these methods draw insights from signal processing on 2D and 3D geometric data, including meshes. Spectral methods are defined by several commonalities. First, they treat geometric data as graphs. Second, they focus on the graph Laplacian rather than the adjacency matrix. Third, they make use of eigenvectors and eigenvalues from the graph Laplacian, in a direct analogy to the decomposition of functions in Fourier analysis. See [40, 56] for a survey of spectral methods for geometric processing. There are numerous applications of spectral methods. Levy [32] interprets the graph Laplacian (referred to as the Laplace-Beltrami operator) as a basis for a function space defined on meshes. He uses it for approximating functions on meshes, demonstrating applications to pose transfer, segmentation, and parameterization. Barnard et al. [7] and Alpert et al. [3] use eigenvectors of the graph Laplacian for two different applications—for reordering the graph’s sparse adjacency matrix and for partitioning the graph. Karni et al. [27] use spectral methods for compressing meshes.

Since its introduction [17, 18], the Fiedler vector has been extensively used in various spectral methods, such as graph drawing [28, 29], for optimizing streaming formats for meshes [26], shape characterization [4, 30, 31], graph segmentation and partitioning, where the graph can represent meshes [33, 48, 52] or images [45].

3 Technical approach

Our choice of using the Fiedler vector for mesh reordering is inspired by its use for efficient graph drawing [28, 29]. A mesh can be represented as a graph, where vertices are mesh nodes and an edge connects any two vertices if they are part of the same mesh element. For both graph drawing and cache-efficient mesh processing, the goal is to minimize the distance between vertices that are connected by edges. As described in [29], one way to ensure this property is by giving the vertices coordinates defined by the Fiedler vector (see Fig. 1). Here, we provide a brief overview of the argument in [29]. Let \({\varvec{x}}\) denote the vector of coordinates for all vertices, where each vertex coordinate is a real number. Hall’s energy for a graph of n nodes is defined as follows:

$$\begin{aligned} E = \frac{1}{2} \sum _{i.j}^n w_{i,j} (x_i - x_j)^2 \end{aligned}$$
(1)

where \(x_i\) and \(x_j\) are the coordinates of nodes i and j, where \(i \ne j\), and \(w_{i,j}\) is the weight of the edge connecting them. Intuitively, minimizing Hall’s energy is equivalent to finding nodal coordinates that minimize the distance between connected nodes.

Fig. 1
figure 1

The Fiedler vector produces a natural ordering of the nodes of a tetrahedral mesh by closely following its shape

Hall’s energy can be restated by using the Laplacian \({\varvec{L}}\) of the graph. \({\varvec{L}}\) is a sparse symmetric \(n\times n\) matrix. Given vertices i and j, if there is an edge between them, then the value in \({\varvec{L}}\) at the \(i^{th}\) row and \(j^{th}\) column is \(-w_{i,j}\). On the diagonal, each (ii) entry is set to the sum of the edge weights on row i. Formally,

(2)

Using the Laplacian matrix in equation (2), Hall’s energy can be denoted as \(E = \varvec{x^T L x}\). The trivial solution, \({\varvec{x}} =0\), can be eliminated by requiring the coordinates vector norm to be nonzero, i.e., \(\varvec{x^Tx} = c\). As the coordinates can be arbitrarily scaled, the constant c can be set to 1: \(\varvec{x^Tx} = 1\). This constraint can be accommodated in the minimization problem using Lagrange multipliers: \(\min (\varvec{x^TLx} - \lambda (\varvec{x^Tx} - 1))\). The analytical solution can be found by computing the derivative with respect to \({\varvec{x}}\) and setting it to 0. The result is the eigenvalue problem:

$$\begin{aligned} \varvec{Lx} = \lambda {\varvec{x}} \end{aligned}$$
(3)

Since \(E = \varvec{x^T Lx}\) and \(\varvec{Lx} = \lambda {\varvec{x}}\), it follows that \(E = \varvec{x^T Lx} = \varvec{^T} \lambda {\varvec{x}}\). Since we set \(\varvec{x^Tx} = 1\), \(\lambda \varvec{x^T x} = \lambda \), so \(E = \lambda \). Thus, the eigenvalue is the Hall energy.

The solution which minimizes the eigenvalue problem is trivial: \({\varvec{x}}\) such that \(x_i = 1/n\) for \(1\le i\le n\), whose eigenvalue, and thus the Hall energy, is 0. This solution gives the same coordinate to each node, which is not practical. The second eigenvector, which has the second-smallest eigenvalue, is known as the Fiedler vector. For the non-trivial solutions, it gives the smallest eigenvalue and, thus, the smallest energy. Per the intuitive interpretation of the Hall energy mentioned above, the Fiedler vector gives the node coordinates where the distance between connected nodes is minimized.

Fig. 2
figure 2

A diagram illustrating how reordering helps reduce cache misses. The example mesh is triangular. A single triangular mesh element is shown, along with both its location and the locations of its vertices in the main memory. The bottom left point of the triangle is loaded into the cache, alongside vertices that neighbor it in the main memory. Top: In the original ordering of the vertices, geometrically neighboring vertices are not loaded into the cache, so accessing them will result in expensive cache misses. Bottom: The reordered vertices are sorted such that geometrically neighboring vertices are also neighboring in memory. When the program summons the vertex’s neighbors, it will find them in the cache and avoid the need to load them again from main memory

The Fiedler vector can be used for defining coordinates. The Fielder vector was used for graph drawing in [29], where it was used for defining coordinates along a single axis for each vertex. For drawing graphs in 2D and 3D, eigenvectors corresponding to the third and fourth smallest eigenvalues can be used to obtain the additional vertex coordinates. For our elasticity simulation, the Fiedler vector alone is sufficient, but the values defined by it need to be converted to a ranking, i.e., an ordering of the mesh nodes. Rankings are computed by sorting the Fiedler values of the vertices: The node with the smallest value is ranked first, the node with the second-smallest value is ranked second, and so on. The rankings define the order of the vertices in memory. As connected vertices have similar values in the Fiedler vector, they are close together in the ranking and thus close together in the main memory. So, when the simulator extracts them, it is likely that they will be placed in the same cache, and thus, cache misses would be less likely to occur. See Fig. 2 for an illustration of how reordering a mesh’s vertices can reduce the number of cache misses.

4 Numerical implementation

Our code to reorder the meshes was written in Python and used the NumPy and SciPy libraries. Given an input mesh, to extract a Fiedler vector of node coordinates, we first compute the Laplacian of the mesh’s graph as a sparse matrix. As there is no special reason to give different weights to different node pairs, we set \(w_{i,j} = 1\) for every pair of nodes i and j (where \(i \ne j\)) if they are part of the same mesh element, and \(w_{i,j} = 0\) otherwise. This form of the Laplacian is typically known as the combinatorial Laplacian. We extract the Fiedler vector from the combinatorial Laplacian using Locally Optimal Block Preconditioned Conjugate Gradients (LOBPCG), a well-known method for finding the largest or smallest eigenvalues of a matrix and their corresponding eigenvectors [15]. We use this solver to find the two smallest eigenvalues and eigenvectors.

LOBPCG requires an initial guess for the eigenvectors as input. For the first eigenvector, we used its known true value: a constant vector where every value is 1/n. For the second eigenvector, we used random values for the initial guess, where each value was sampled from a uniform distribution between \(-0.5\) and 0.5. For fast convergence of the Conjugate Gradients solve, we used the Algebraic Multigrid smoothed aggregation solver from PyAMG [8] as the preconditioner.

4.1 Reordering mesh elements

In addition to reordering vertices in meshes, we also experimented with reordering the mesh elements. For this purpose, we defined a dual graph of the mesh, where each mesh element is a vertex, and if any two mesh elements share at least one node in common, they are connected by an edge. Using this definition, we built the combinatorial Laplacian matrix as before (where all edges have weight 1). We used the same method, LOBPCG preconditioned with a smoothed aggregation Multigrid solver from PyAMG, to find the Fiedler vector of the matrix. The new mesh element ranking was likewise computed in the same way as before.

4.2 Mesh orderings

We used meshes of three different objects: a dragon, a bunny, and an armadillo. Their characteristics are described in Sect. 5. For each mesh, we used the above methods to find a Fiedler reordering for its vertices, and another Fiedler reordering for its elements (i.e., its tetrahedra). However, we found that it is possible for many vertices in any given mesh to already be well ordered, and comparing only an original mesh with its Fiedler-reordered version would not properly show the power of reordering for reducing cache misses. Thus, for each shape, we created a scrambled reordering as a point of reference, where the node rankings were determined randomly using NumPy’s shuffle function. All of the orderings for the mesh models are shown in Fig. 3. For each shape, we also created an additional reordering where both the vertices and mesh elements were randomly reordered.

Fig. 3
figure 3

The various reorderings for the vertices of the three mesh models that we used, shown as contour drawings. Top to bottom: dragon, bunny, armadillo. Left to right: Fiedler-reordered models, original orderings, scrambled orderings

4.3 Edge span

To illustrate the effect of reordering, we utilized the edge span metric. As defined in [53], for an edge consisting of vertices \((v_i,v_j)\), its span is the absolute difference of its vertices’ indices, \(|i - j |\). As before, an edge is defined between two vertices if they are part of the same mesh element. The greater the edge span, the higher the chance of cache misses since the simulator is likely to access both vertices in quick succession. We list the edge spans for the original, Fiedler-reordered, and the scrambled reordering of each mesh in Sect. 6.1.

5 Elasticity simulations

For testing the effects of different orderings on the simulation speed, we used three physics libraries for simulating elastic solids: Vega [6], an ADMM solver [37] based on Projective Dynamics [9], and an updated version of the ADMM solver with collision handling [39]. From here on, we will refer to the latter two as ADMM-PD and ADMM-PD-extended, respectively.

We devised six scenarios to test the speedup, two for each solver, as listed in Table 2. Each scenario involves a single object undergoing a motion that results in, or is caused by, a deformation. The full animations can be seen in the supplemental video. We ran each scenario for 240 time steps on a Intel Xeon E5–1620 v4 CPU with 4 cores and a cache size of 10 MB. We timed all of the scenarios using Chrono, a C++ timing library. We did not time the setup for the models. At each time step, we only timed the call to the solver for simulating that time step and used the sum of the time for all of the time steps as the computation time for that run.

For each scenario, we used five different orderings as listed in Table 2: original, Fiedler-reordered (vertex-only), VF Fiedler-reordered (both vertices and faces), scrambled (vertex-only), and VF scrambled (both vertices and faces). We ran each scenario three times per reordering and recorded the average and standard deviation of the timings, as described in Sect. 6.2.

Fig. 4
figure 4

The dragon in the falling scenario. Compared to the baseline scrambled ordering (vertex-only), the original ordering provided a speedup of 272.9 s (14.8%), and the Fiedler reordering (our method) provided a speedup of 492.4 s (26.7%). The Fiedler reordering of both the vertices and mesh elements provided a speedup of 514 s (27.8%)

Fig. 5
figure 5

The dragon in the constrained falling scenario. Compared to the baseline scrambled ordering (vertex-only), the original ordering provided a speedup of 266.5 s (12.1%), and the Fiedler reordering (our method) provided a speedup of 469.8 s (21.3%). Fiedler reordering of both the vertices and mesh elements provided a speedup of 494.7 s (22.4%)

Fig. 6
figure 6

The dragon in the stretch scenario. Compared to the baseline scrambled ordering (vertex-only), the original ordering provided a speedup of 19.8 s (4.66%), and the Fiedler reordering (our method) provided a speedup of 51.7 s (12.1%). The Fiedler reordering of both the vertices and mesh elements provided a speedup of 108.2 s (25.5%)

5.1 Vega

Vega [6] is a library that simulates elastic solids using an implicit backwards Euler scheme. For the internal solver, we used the Jacobi-preconditioned Conjugate Gradients (PCG) solver that comes built-in with the library. We used Vega for the falling scenarios shown in Figs. 4 and 5. For both scenarios, we used the dragon model that came with Vega. It has 46,736 vertices and 160,553 tetrahedra. Neither scenario had gravity.

In the falling scenario of Fig. 4, the dragon uses corotational linear elasticity as the constitutive model. It starts at rest in mid-air, and we exert a force of \(-50\) N along each of the three coordinate axes on all vertices during the first time step. No forces are exerted after the first time step. In the falling scenario of Fig. 5, the dragon has the Saint Venant–Kirchhoff (StVK) constitutive model. We randomly selected a node near the top of the dragon model and constrained it to a fixed location. We exerted a force of \(-9.8\) N along the Y-axis (i.e., downwards) on all of the mesh vertices.

Fig. 7
figure 7

The armadillo in the unsquash scenario. Compared to the baseline scrambled ordering (vertex-only), the original ordering provided a speedup of 84.4 s (3.63%), and the Fiedler reordering (our method) provided a speedup of 236.8 s (10.2%). The Fiedler reordering of both the vertices and mesh elements provided a speedup of 273.5 s (11.8%)

Fig. 8
figure 8

The armadillo falling in a bowl scenario. Compared to the baseline scrambled ordering (vertex-only), the original ordering provided a speedup of 177.6 s (4.1%), and the Fiedler reordering (our method) provided a speedup of 480.7 s (11.1%). The Fiedler reordering of both the vertices and mesh elements provided a speedup of 451.5 s (10.4%)

Fig. 9
figure 9

The bunny in the bounce scenario. Compared to the baseline scrambled ordering (vertex-only), the original ordering provided a speedup of 262.8 s (8.23%), and the Fiedler reordering (our method) provided a speedup of 426.2 s (13.3%). The Fiedler reordering of both the vertices and mesh elements provided a speedup of 449.2 s (14.1%)

Fig. 10
figure 10

The log–log histograms of edge spans for the three meshes, showing the number of edges in different sized bins. Each of the histograms have a bin size of 50. For the scrambled meshes, there are roughly the same number of edges for all edge spans. The original models have more edges with a smaller edge span. The Fiedler reorderings have the largest number of edges with small edge spans, and no edges with a span greater than 6000. The three models are mostly similar, except in comparison with the dragon, the original bunny is closer to its Fiedler-reordered case, and the original armadillo is closer to its scrambled case

5.2 ADMM-PD

ADMM-PD [37] is a generalization of the projective dynamics (PD) framework [9] and supports general constitutive models of elasticity. It uses an implicit scheme based on minimizing an error function closely related to the physical system’s energy. It is computationally efficient as it pre-computes a Cholesky factorization of the system matrix and reuses the Cholesky factors during each time step. We used ADMM-PD for the scenarios shown in Figs. 6 and 7, respectively. In both scenarios, the mesh is scaled such that its height is 1 m. For both scenarios, gravity is turned off. The scenario in Fig. 6 uses the same dragon model from the Vega scenarios. The dragon is given a simple spring constitutive model and the simulation uses 20 ADMM iterations per time step. The dragon’s vertices are split through the midpoint of the mesh along the X-axis. Vertices with an x-coordinate greater than or equal to the midpoint are given an initial velocity of 0.1 m/s in the +X direction. The other vertices are given a velocity of 0.1 m/s in the -X direction. The resulting motion stretches the dragon. The scenario in Fig. 7 is directly based on a similar scenario in [37]. We tetrahedralized the armadillo mesh from the Stanford 3D Scanning Repository using Tetgen [46]. The model has 45,908 vertices and 151,869 tetrahedra. The armadillo is given a linear tetrahedral strain constitutive model, and the simulation uses 100 ADMM iterations per time step. It takes all of the armadillo’s vertices and gives them the same starting location, so that the entire model is “squashed” into a single point. The model then unsquashes itself into its original shape during the course of the simulation.

5.3 ADMM-PD-extended

ADMM-PD-extended [39] is the revised version of ADMM-PD. It uses the same formulation as ADMM-PD for handling object motion and elastic deformation, but also handles self-collisions within objects and frictionless object-obstacle collisions. We used the ADMM-PD-extended model for the two scenarios shown in Figs. 8 and 9, where collisions occur. In both scenarios, we rescale the mesh to have a height of 0.07 m. For the scenario shown in Fig. 8, the armadillo falls from a height of 0.4 m into the rim of a nearly spherical bowl. For the scenario shown in Fig. 9, we use a version of the Stanford bunny mesh tetrahedralized with Tetgen [46]. It has 34,833 vertices and 120,001 tetrahedra. The bunny falls from a height of 0.2 m and bounces off the floor.

6 Experimental results

In this section, we provide quantitative details behind the performance of our method in our experiments.

6.1 Edge spans

Table 1 shows the average edge span of the connected vertices for the three meshes. As expected, we see a reduction in average edge span between the scrambled and original orderings of the meshes. We see a much larger reduction in the average edge span, of around two orders of magnitude, between the scrambled and the Fiedler orderings.

Table 1 Average edge span for each of the three meshes

Figure 10 shows log–log histograms of the edge spans for the three meshes. The scrambled orderings have roughly the same number of edges for each edge span. The original orderings have somewhat more edges with shorter edge spans, and fewer edges with longer edge spans. In contrast, the Fiedler reordering has almost no edges with a span of more than 8000, and it has the largest number of edges with short edge spans.

Table 2 Average computation time (in seconds) and standard deviation for each of our 6 experimental scenarios to run (taken across 5 runs) for each ordering of the scenario’s shape

6.2 Elasticity simulations

Our timing results for the six scenarios are shown in Table 2. We took the average run-time of three runs for each mesh ordering, for each scenario. We also took the standard deviations, from which we saw that the averages are broadly representative, and can be used for comparisons between different mesh orderings. We did not use the standard deviations for any other purpose. In particular, the results shown in Tables 3 and 4 are based only on the averages.

We chose the scrambled ordering for vertices to be the reference ordering for our comparisons. The number of seconds that the original and Fiedler reorderings save compared to the reference ordering is shown in Table 3. The same speedups are shown in Table 4 as a percentage of the reference ordering. The results show limited speedups as one goes from the scrambled orderings to the original orderings, and larger speedups as one goes from the original orderings to the Fiedler reorderings. The Fiedler orderings provide savings of hundreds of seconds (several minutes) over the original and scrambled orderings in all scenarios except Fig. 6, where the savings are in tens of seconds. Since the reordering process itself takes on the order of seconds to complete, we see that Fiedler reordering pays for itself several times over. Table 4 shows that savings of nearly \(30\%\) over the scrambled version can be achieved with Fiedler reorderings.

We note that when both vertices and mesh elements are reordered, in most of the scenarios the time savings are increased further. The amount of savings depends on whether or not mesh elements are accessed by reference from their vertices, similar to how the savings from vertex ordering depends on vertices being accessed by reference from their mesh elements. This effect is most pronounced for the stretch scenario shown in Fig. 6, which uses the spring constitutive model. For this constitutive model, edges rather than tetrahedra are the primary elements of computation.

In addition, we noticed that the Fiedler ordering for the bounce scenario in Fig. 9 shows less improvement over the original ordering in comparison with the case with the other scenarios. We believe this is because of the bunny mesh in the bounce scenario. The bunny’s original ordering, as shown in Figs. 3 and 10, was itself already partially ordered, such that there was less room for improvement from applying the Fiedler reordering.

Table 3 Average speedups (in seconds) for the original and Fiedler reorderings compared to the scrambled vertices case
Table 4 Speedups for the original and Fiedler reorderings as percentages of the time taken by the scrambled vertices case

6.3 Dependence of simulations on the orderings

When examining the speedups in Tables 3 and 4 obtained by the Fiedler vertex reordering in comparison with the scrambled reordering, we notice that greater speedups are obtained in the Vega simulations than the ADMM simulations. This is because Vega uses a traditional Newton-based solver for elasticity, where a linear problem is solved using Conjugate Gradients inside each Newton iteration. A number of computations that occur during the linear solve, such as the matrix–vector product, vector–vector dot-product, or vector-norm computation, directly benefit from the Fiedler vertex reorderings, leading to speedups. In contrast, the ADMM solver pre-factorizes the Cholesky factors of the system matrix and subsequently uses those factors during each iteration. Thus, ADMM does not require back-and-forth communication between the matrix/vector data and the geometric data stored on the meshes, leading to less speedups overall. However, we notice a nonlinear effect when simultaneously reordering both the vertices and mesh elements. Specifically, we notice a greater speedup in the stretch scenario in Fig. 6 with the VF Fiedler reordering, which was simulated using ADMM, in comparison with the constrained falling scenario in Fig. 5, which was simulated using Vega. Since the difference is relatively small, we defer a more in-depth analysis of this behavior to future work, as we believe such an analysis would require intervention at the systems level with explicit control over cache-scheduling policies.

7 Conclusion and future work

We proposed to use the Fiedler vector to reorder tetrahedral meshes for greater cache efficiency. We demonstrated the benefits of this reordering for speeding up simulations of elastic solids. Our proposed method is general, as demonstrated by its applicability to Vega [6] and the ADMM solver [37], which use different algorithms for elasticity. Our results show that using cache-efficient mesh orderings can provide significant speedups for physics simulations with negligible overhead.

It would be interesting to apply our method to fluid simulations on tetrahedral meshes [5, 12]. Since fluids exhibit dynamic topology changes, the mesh connectivity will change every time step, requiring the Fiedler reordering to be recomputed every time step. However, given the negligible overhead associated with this computation, we believe our method should still be applicable for speeding up fluid simulations. There may also be some opportunities for optimizing the computation of the Fiedler vector, given that the mesh connectivity will only change very slightly during each time step.

Another potential avenue is to investigate what makes any given scenario and/or algorithm more or less susceptible to speedups when the meshes used are reordered to be cache-efficient. We demonstrated a large variation in speed-ups with our limited selection of physics engines and scenarios. This indicates that different ways of processing data vary greatly in their utilization of memory caches. Knowing the exact ways in which simulation algorithms utilize the cache means knowing when it is worth to reorder the mesh and could provide other insights into increasing the simulation efficiency.