# NiftySim: A GPU-based nonlinear finite element package for simulation of soft tissue biomechanics

- First Online:

- Received:
- Accepted:

DOI: 10.1007/s11548-014-1118-5

- Cite this article as:
- Johnsen, S.F., Taylor, Z.A., Clarkson, M.J. et al. Int J CARS (2015) 10: 1077. doi:10.1007/s11548-014-1118-5

- 12 Citations
- 1.7k Downloads

## Abstract

### Purpose

*NiftySim*, an open-source finite element toolkit, has been designed to allow incorporation of high-performance soft tissue simulation capabilities into biomedical applications. The toolkit provides the option of execution on fast graphics processing unit (GPU) hardware, numerous constitutive models and solid-element options, membrane and shell elements, and contact modelling facilities, in a simple to use library.

### Methods

The toolkit is founded on the total Lagrangian explicit dynamics (TLEDs) algorithm, which has been shown to be efficient and accurate for simulation of soft tissues. The base code is written in C\(++\), and GPU execution is achieved using the nVidia CUDA framework. In most cases, interaction with the underlying solvers can be achieved through a single Simulator class, which may be embedded directly in third-party applications such as, surgical guidance systems. Advanced capabilities such as contact modelling and nonlinear constitutive models are also provided, as are more experimental technologies like reduced order modelling. A consistent description of the underlying solution algorithm, its implementation with a focus on GPU execution, and examples of the toolkit’s usage in biomedical applications are provided.

### Results

Efficient mapping of the TLED algorithm to parallel hardware results in very high computational performance, far exceeding that available in commercial packages.

### Conclusion

The *NiftySim* toolkit provides high-performance soft tissue simulation capabilities using GPU technology for biomechanical simulation research applications in medical image computing, surgical simulation, and surgical guidance applications.

### Keywords

FEM Total Lagrangian explicit dynamics GPU Software engineering Soft tissue biomechanics## Introduction

In this paper, we describe the development and features of the open-source finite element (FE) toolkit, *NiftySim*. The toolkit’s key feature is its use of graphics processing unit (GPU)-based execution, which allows it to outperform equivalent central processing unit (CPU)-based implementations by more than an order of magnitude, and commercial packages by significantly more again [9, 29]. While the solver may be used for the analysis of any solid materials, it has been designed and optimised for simulation of soft tissues. The motivation for its development is the growing need for robust soft tissue modelling capabilities in medical imaging and surgical simulation applications, and in particular, in time-critical applications. The latter include, for example, interactive simulation systems where real-time computation is required [5, 19, 24], and intra-operative image registration and image guidance systems [2, 3, 7] for which rapid, if not real-time, computation is necessary.

*NiftySim* was developed around the total Lagrangian explicit dynamic (TLED) FE algorithm, first identified as a potentially efficient approach for soft tissue simulation by Miller et al. [21] (but, see also [24]). An important feature of the presented algorithm is that it correctly accommodates geometric and constitutive nonlinearities, both of which are essential for this application; soft tissues generally can tolerate large deformations, and their stress–strain response is seldom linear [11]. The efficiency of the algorithm derives from two aspects: (1) the total Lagrangian framework allows shape function derivatives to be precomputed and stored, rather than re-computed at each time step and (2) the low stiffness of biological tissues means the critical time steps for explicit integration, normally a very restrictive constraint, are relatively large. Since explicit methods involve comparatively inexpensive computations in each time step, the latter feature can lead to very low overall computation times.

An additional virtue of explicit methods that is central to *NiftySim* ’s development is their amenability to parallel execution. Whereas the main computational task in implicit methods is solution of a large linear system (several times per time step for nonlinear problems), computations in explicit solution procedures are executed on an element- and node-wise basis. The mapping to parallel hardware is thus direct and efficient. This fact was exploited in our earlier work [25, 26] to produce a GPU-based solver using OpenGL and the Cg graphics language. The introduction of the general-purpose CUDA API [22] allowed a more flexible and efficient implementation to be proposed subsequently, as described in [27, 28]. In separate work, we also described the incorporation of the technology in the SOFA framework [4]. The underlying technology in *NiftySim* builds on the approach described in [28], in particular.

*NiftySim* also includes a number of features that go beyond the solid-element-based TLED algorithm, the most important of which are: (1) membrane and shell formulations compatible with TLED’s explicit time integration (described in [1] and [8], respectively) that can be used on their own or in conjunction with solid-element-based meshes, (2) specialised contact models for the efficient simulation of interactions between deformable geometry and simple, analytically describable surfaces, (3) a general-purpose mesh-based contact model with a collision response formulation derived from the work of Heinstein et al. [10, 15]. The latter can simulate contacts between multiple deformable bodies, self-collisions, and contacts between deformable geometry and rigid surfaces.

With its lightweight, yet consistent and flexible implementation of the TLED algorithm, written in C\(++\) and CUDA, *NiftySim* is primarily aimed at researchers developing algorithms in the area of medical image analysis, surgical image guidance, and surgical simulation, requiring a fast FE backend for the simulation of soft tissue mechanics. It is mainly geared towards an algorithmic generation of simulation descriptions and post-processing of results with custom researcher-written code. Therefore, our goal is not to compete with end-to-end toolkits like SOFA^{1} that provide their own tools for graphical simulation definition and interaction, or general-purpose finite element analysis suites like Abaqus FEA.^{2} Further, unlike the common commercial packages, which must be accessed via the command line, *NiftySim* can be used as a back-end library in C\(++\) applications, thus allowing for the direct exchange of data with client code. To aid the integration of *NiftySim* in such specialised applications, it sports the following features: It has been tested on various versions of Linux, Mac OS and Windows. A command line application capable of executing complete simulations and that can be used in conjunction with scripting languages or for prototyping simulations is included. Various features simplifying its use as a library are also available, such as a wrapper simulator class, which encapsulates all of the simulation technology and allows it to be easily embedded in other libraries and applications, and full support for CMake’s^{3}*config mode*.

In the remainder of the paper, we give a brief introduction to *NiftySim* ’s usage (see section “NiftySim usage”). Full details of the continuum formulation and solution algorithms can be found in our earlier publications [26, 28, 30]; however, a summary of the core algorithm is provided (see section “The TLED algorithm”), followed by a description of the main classes and their implementation in section “Implementation using C\(++\)/CUDA”, outline some example applications taken from published research that employed *NiftySim* (see section “Research applications of NiftySim”), and conclude with a brief discussion (see section “Discussion and conclusions”). A description of the constitutive models currently available is provided in the “Appendix”.

The toolkit is available for download from SourceForge^{4} and subject only to the terms of a liberal BSD-style licence.

## NiftySim usage

This section gives a brief overview of *NiftySim* ’s usage by means of two simple examples. For a more comprehensive description, the reader is referred to *NiftySim* ’s PDF user manual that ships with the source code.

*NiftySim*can be used as a stand-alone application and as a library. However, it is used, the quickest and most flexible way to create a simulation is to describe it using

*XML*. Figure 1 contains such a description, a

*model*, for a simple

*NiftySim*simulation comprising all parts found in a realistic simulation. The figure also introduces concepts such as

*system parameters*and

*element set*that will reappear later in the text.

*NiftySim*’s stand-alone executable. It also contains an illustration of the constraints of the example model of Fig. 1.

*NiftySim*as a library in a C\(++\) code is the most advantageous. The simple C\(++\) application in Fig. 3, consisting of a single compilation unit, my_example.cpp, containing only a main function, and a CMakeLists.txt for the build configuration, accomplishes the task of running

*any*

*NiftySim*simulation contained in the file residing at the hardcoded location /path/to/my/sim.xml.

## The TLED algorithm

### The basic TLED algorithm

*lumped*, i.e. diagonal, mass matrix and \(\varvec{D}\) is a diagonal damping matrix, introduced for the numerical stability of the time integration. In TLED the latter is linked to the mass matrix via a damping coefficient \(\alpha _D\): \(D = \alpha _D M\). \(\varvec{R}^\mathrm{ext}\) are the discretised external loads, i.e. body forces and Neumann BCs.

Use of the total Lagrangian evaluation of stresses means the shape function derivatives \(\partial _{\varvec{X}} \varvec{h}\) only need to be computed once.

TLED employs one-point quadrature on the spatial domain, meaning the numerical approximation of \(\varvec{f}^{(e)}\) for the internal forces are evaluated only at the initial configuration centre of the corresponding element. One of the following formulas is used, depending on the element type that is employed in the discretisation of the problem:

*Linear 8-node reduced-integration hexahedron*This element employs trilinear shape functions, and the formula for its internal forces is given by

*Linear 4-node tetrahedron*This element employs linear shape functions. The formula (5) for element nodal forces is then

*Nodal-averaged pressure 4-node tetrahedron* Developed to alleviate the volumetric locking problems that plague the standard tetrahedron, this element employs the same shape functions and nodal forces formula (Eq. 8). The stress \(\check{\varvec{S}}\), however, is computed using a modified deformation gradient whose volumetric component has been averaged over adjacent nodes—see [17]. The performance of this formulation is generally superior to that of the standard tetrahedron.

The other major reason for the algorithm’s efficiency is its treatment of the time ordinary differential equation (ODE). Two distinct explicit ODE solvers are implemented in *NiftySim*:

*Explicit Central-Difference Method (CDM)*: With this method solving for the next time-step displacements, \(\varvec{U}_{n+1}\), at a given time step \(n\), is achieved by substituting the following approximations for the velocity, \(\varvec{\dot{U}}\), and the acceleration, \(\varvec{\ddot{U}}\), into Eq. (3):

*Explicit Newmark Method (EDM)*This method introduces a numerical acceleration and velocity. It is summarised by the following formulas:

Dirichlet BCs are incorporated at the end of a time step via a simple substitution of fixed values for the components of the displacement vector \(\varvec{U}\) that are subject to such constraints.

### Acceleration of TLED by means of reduced order modelling

*NiftySim*also provides reduced order modelling (ROM) capabilities, the mathematical underpinnings of which are explained in detail in [29, 30]. The key idea is to project the full displacement field, defined by the usual vector of nodal values \(\mathbf {U} \in \mathbb {R}^{3N_{\text {nodes}}}\), onto a lower dimensional basis \(\varvec{\Phi } \in \mathbb {R}^{3N_{\text {nodes}} \times M}\) as follows:

The benefit conferred by this process is a substantial enlargement of the critical time step \(\varDelta t_{\text {cr}}\), meaning many fewer time steps are required for a given simulation. In ref. [30], it was shown that speed improvements of around an order of magnitude are feasible, with an error below 5 % compared with full model solutions.

### Incorporation of membranes and shells in TLED

*NiftySim*is based on ref. [1]. It is an iso-parametric triangle element in which the strain is computed via the usual reference triangle

*NiftySim*version 2.3 is incompressible neo-Hookean, whose SPK stress is given by

*NiftySim*is the rotation-free EBST1 described in [8]. Computations with this element are based on quadratic shape functions defined on patches consisting of four triangles (Fig. 4) with deformation and curvature functions being sampled at the midpoints of the edges of patches’ central triangle and subsequently averaged. With this shell element, the curvature giving rise to its bending stiffness is computed from standard nodal displacements; therefore, there is no need for modifications to the time-ODE solver algorithms employed with TLED.

### Contact modelling

All contact modelling in *NiftySim* is based on prediction–correction, i.e. the basic TLED algorithm is used compute a prediction for the next time-step displacement, which is then used to search for potential contacts. If contacts are found, corrections must be computed. These can either be displacement corrections, directly applied to the displacement value of offending nodes, or collision response forces which are incorporated in the effective load vector, \(\varvec{R}^\mathrm{eff}\).

In the simpler of the two contact modelling algorithms implemented in *NiftySim*, the penetration of deformable-geometry nodes into the *master* surface is found by evaluating an analytical expression. In this contact modelling context, the deformable geometry surface is referred to as the slave surface.

*gap function*, denoted with \(g\), whose value represents the signed distance to the closest point on the master surface, and if negative, indicates that the slave node has penetrated the master surface. This also implies that there must be a means of computing the surface normal, \(\varvec{n_m}\), at every point on the master surface. The latter two quantities, \(g\) and \(\varvec{n_m}\), can then be used to compute a displacement correction, \(\varvec{\varDelta u}\):

*NiftySim*detects collisions of slave-surface nodes and the interior of master-surface facets and intersection of slave and master surface edges with bounding volume hierarchies (BVHs). The contact search algorithm returns a projection of slave nodes onto the master surface, here denoted with \((\xi ,\,\eta )\), as well as the corresponding gap function value, and in the case of edge–edge intersections, the signed shortest distance between the two edges at the end of the time step along with the corresponding edge parameters, labelled \(r,\ q\). The formulas for the forces applied in response to collisions are derived from the explicit Lagrange-multiplier method of Heinstein et al. [10]. In the case of contacts between deformable bodies, the node-facet collision response forces are given by

### Implementation overview

The processing of a simulation with *NiftySim* consists of three main stages. The first stage deals with the parsing of the simulation XML description and the loading of the simulation geometry. In the precomputation step, the spatial derivatives of the shape functions, the node masses, and constraint and contact modelling-related data are computed. In typical usage scenarios, the precomputation happens absolutely transparently to the user in the *simulator* class’s constructor.

*NiftySim*’s workflow.

In a minimal, sequential TLED implementation, Eq. (4) can be evaluated in one loop over all elements, computing in every element its deformation gradient, strains, stresses and from that internal forces, and accumulating the per-element internal forces in a global internal-force vector. With this done, the effective loads can be computed by subtracting the internal forces from the applied external loads. A second loop is then invoked, iterating over the nodes in the mesh and updating their displacements based on Eq. (10). Thanks to the lumping of the mass matrix, this last step can be done for each node individually. Parallel implementations require a more complex memory layout to efficiently avoid race conditions on the internal-force accumulation buffer. The basic pattern of two main loops, one over all elements and one over all nodes, remains the same, though. A more detailed description of the strategies employed in *NiftySim* ’s parallel solvers is given in section “The solver classes”.

## Implementation using C\(++\)/CUDA

This section introduces the most important modules and concepts of *NiftySim* ’s TLED implementation. A more complete list and technical description of *NiftySim* ’s modules can be found in the source code’s *Doxygen*^{5} documentation.

### Coding guidelines and naming conventions

*NiftySim* follows VTK^{6} naming conventions, where class names have a “tled” prefix and are camel-cased, e.g. tledExampleNiftySimClass. Member names are also camel-cased and start with a capital letter. Names of functions normally begin with an appropriate verb.

Function signatures were until recently also based on VTK’s style with no function arguments and member functions having const modifiers. Motivated by the addition of CPU parallel solvers and the potential race conditions it entails, a move towards a style more similar to that of the Insight Segmentation and Registration Toolkit^{7} has been undertaken, where certain member functions such as getters have const modifiers, as do all read-only function arguments.

The CUDA portion of *NiftySim* was designed to be as far as possible backward compatible; the use of complex classes in CUDA device code is therefore avoided. Instead, namespaces are used extensively to provide modularity and prevent name collisions, so that all functions and variables belonging to a particular module are wrapped in the same namespace, whose name is derived from the name of the corresponding module in the host portion of the code.

### The simulator class

tledSimulator is the normal entry point for anyone wanting to use *NiftySim* as an FEM backend. A major motivation for the introduction of this class was the encapsulation of all simulation components except the model, and thus, the facilitation of the integration of *NiftySim* as an FE backend in C\(++\) code, as was illustrated with the example in Fig. 3. Its most important member function, Simulate, contains the time stepping loop.

### The model class

The tledModel class is the in-memory representation of the simulation description, usable by the other components of *NiftySim*. Internally, it stores the XML description of the simulation as a Document Object Model (DOM) tree whose contents are accessible through member functions of tledModel.

A model can be defined recursively in XML through the notion of *sub-models*. Each sub-model is represented by its own tledModel instance whose management is done by tledSubModelManager.

### The mesh representation

The tledMesh class only provides basic information about the mesh, such as node positions and element connectivity; for more complicated topological queries, tledMeshTopology can be used. There is one instance of tledMesh accessible through the simulation’s model whose purpose is to hold all solid-element geometry in the simulation, even if a simulation contains multiple disjoint bodies, as is the case with many contact problems.

*NiftySim* provides its own mesh file format, which is based on an inline definition of meshes through a block of node positions and a block of element connectivities, in the simulation XML description, but it also supports reading of VTK unstructured grid files and the MSH^{8} ASCII file format. Further, it can output simulation results in VTK unstructured grid files (see section “Output”).

*NiftySim* also has some limited mesh manipulation capabilities, allowing it to apply affine transforms to meshes read from files and to assemble larger connected meshes from the meshes contained in sub-models. The sub-model manager performs this mesh merging operation incrementally by searching for nodes whose positions are less than a user-specified distance apart. Therefore, its use is recommended only on conforming meshes.

There are dedicated surface-mesh classes for holding membrane and shell elements (see section “tledShellSolverCPU”) and contact modelling (see section “Contact modelling”); all these classes are derived from tledSurface. The geometrical information necessary for shell and membrane computations is contained in a tledShellMesh instance that in turn depends on a solid mesh for the vertex positions. In cases where a solid body is wrapped in a membrane, the 2D mesh’s connectivity information is directly obtained from the solid mesh by extracting its surface facets. tledRigidContactSurface is used for the modelling of contacts with arbitrarily meshed rigid bodies and tledDeformableContactSurface holds the current-configuration surface for contact modelling purposes.

### The solver classes

The purpose of tledSolver and its sub-classes is the coordination of the time step calculations involved in completing the simulation: compilation of internal forces and external loads, imposition of BCs, and update of displacements.

#### tledSolverCPU

tledSolverCPU is the sequential C\(++\) solver implementation of *NiftySim*. Precomputations of \(\varvec{M}\), \(\varvec{\partial h}\), etc., are performed in the class’s constructor. The main computational tasks in each time step are calculation of new internal nodal forces and calculation of new nodal displacements. The latter task is fully delegated to a dedicated CPU time-ODE solver class (described in section “Time integration”). The sequential loop by which the former calculation is carried out is summarised in the pseudo-code loop at the centre of Algorithm 1.

*constraint manager*(described in section “Constraints”), but their accumulation and application is done by the solver. If applicable, a contact manager (tledContactManager) also resolves contacts between bodies in the model (see section “Contact modelling”).

#### tledParallelSolverCPU

tledParallelSolverCPU is a parallel CPU solver based on Boost^{9} threads. It shares most of its code with tledSolverCPU. Its main distinguishing feature is that it splits the element array into blocks of equal size and assigns these sub-arrays to different threads. To avoid race conditions on the internal-forces buffer \(\varvec{R}^\mathrm{int}\), every thread is associated with one intermediate force accumulation buffer, into which the internal forces of the elements in its sub-array are written. These temporary buffers are then summed up and the result is written to the global internal-force array.

#### tledSolverGPU

The nVidia CUDA solver implementation is called tledSolverGPU. All its precomputations are performed on the CPU with code resembling that of tledSolverCPU.

The second important solver kernel, the displacement update kernel, is invoked by the solver with one thread for every node. As is the case on the CPU, code associated with the solver is responsible for computation of the effective loads. The accumulation of the internal forces acting on a thread’s node is performed by querying two texture arrays, one display array of type int2 holding an offset and a range, and a second int2-array holding for every node the indices of the elements to which it belongs and its vertex index in those elements. Hence, these two arrays allow for a retrieval of all internal forces computed per element from the buffer that was filled by the internal-forces kernel. The look-up process is illustrated in Fig. 6. The external loads are computed on the CPU and passed as a global memory array to the kernel. The kernel is templated with respect to the tledTimeStepper sub-class used for displacement evolution, and the effective forces are next passed to the appropriate tledTimeStepper function via template polymorphism that in turn returns a predictor displacement value for the thread’s node. It is then checked if any of the node’s components are subject to constraints through a binary mask held in texture memory, with one entry for every component of every node. If the component is constrained, the corresponding value is retrieved from another texture array.

An example of the handling of contact constraints on GPUs is given in section “Contact modelling”.

#### tledSolverGPU_ROM

Reduced Order Modelling is implemented in the tledSolverGPU_ROM class, which follows a similar execution model to the basic GPU-enabled solver described in the previous section. In particular, computation of element nodal force contributions is identical to that in tledSolverGPU. The subsequent displacements update, however, is divided into a sequence of device and host computations: (i) effective nodal loads \(\varvec{R}^\mathrm{eff}\) are assembled using a first kernel, launched over \(N_{\text {nodes}}\) threads, then transferred to the host; (ii) the quantity \(\varvec{\varPhi \hat{M}\varPhi ^{\text {T}}R}^\mathrm{eff}\) is computed and the resulting vector is transferred back to the device; and (iii) the final displacements \(\varvec{U_{n+1}}\) are computed using a second kernel, also launched over \(N_{\text {nodes}}\) threads. It is found to be more effective to perform step (ii) on the host side, as the small sizes of the involved vectors and matrices make GPU execution inefficient.

Matlab code for constructing the reduced basis from training data using proper orthogonal decomposition is also included in the *NiftySim* source code package.

#### tledShellSolverCPU

Similar to how tledSolverCPU is responsible for the spatial discretisation with solid elements on the CPU, the tledShellSolverCPU class performs the tasks of computing the mass of shell and membrane elements and their internal forces.

Element sets are implemented as classes templated with respect to the membrane element type, so as to allow for a mix of membrane/shell element types in the same simulation. These templated classes are derived from a common abstract class tledShellSolver::ElementSet that has a pure virtual function ComputeForces that is responsible for the computation of internal forces in one element set and receives a reference to the same buffer \(\varvec{R}^\mathrm{int}\) used for accumulation of solid-element internal forces by tledSolverCPU. The contents of this function and its method of operation are largely analogous to the loop body of Algorithm 1, i.e. (i) the computation of strain/curvature measures is delegated to element classes derived from tledElementMembrane; (ii) a shell/membrane constitutive model object associated with the element set is used for computation of the stresses arising from the strains/curvatures; (iii) the element class converts the stresses to internal forces. Since the same force accumulation buffer is used as for solid elements, all BC and contact modelling operations can be performed by tledSolverCPU.

A class tledParallelShellSolverCPU exists to provide CPU parallelism. Its element set classes work by splitting their element arrays into equal parts that are assigned to different threads, very similar to how it is performed in tledParallelSolverCPU.

#### tledShellSolverGPU

tledShellSolverGPU is the CUDA implementation of tledShellSolverCPU. Its internal organisation and a large amount of administrative and precomputation code are shared with tledShellSolverCPU. As with its CPU counterpart, one design goal of this class was to reuse solid-element solver code for BCs, contact modelling, etc. The strategy for force accumulation employed by tledShellSolverGPU is largely identical to that of tledSolverGPU, i.e. forces are computed and stored element-wise, to be later retrieved by a dedicated kernel invoked with one thread per node using the same type of lookup tables. The aggregated forces are directly subtracted from the external loads before these are passed to the displacement update kernel of tledSolverGPU.

The internal-forces kernel is templated with respect to the constitutive model and element class, and the appropriate functions for computation of the deformation, stresses, and internal forces are called via template polymorphism.

### Time integration

The base class of all ODE solvers used for the time integration is tledTimeStepper. Two further abstract classes, tledTimeStepperCPU and tledTimeStepperGPU, exist to provide the CPU and GPU specific parts of the ODE solver API, respectively. Mathematically, two types of explicit time integration are supported: the central difference method and explicit Newmark integration (see section “The basic TLED algorithm”).

*NiftySim*, was employed. In this case, the CDM/EDM-specific but platform-independent parts of the implementation, e.g. getters for intermediate results such as velocity, are contained in two templated decorator classes, tledCentralDifferenceTimeStepper and tledNewmarkTimeStepper. These decorators derive from a solver base class that is passed as a template argument, as follows

where TBaseTimeStepper is either tledTimeStepperCPU or tledTimeStepperGPU. These decorated CPU/GPU ODE solver base classes then serve as the parent class for the actual solver implementations, such as tledCentralDifferenceTimeStepperCPU.

The displacement evolution code of the GPU ODE solvers is implemented as a device function that is directly called by the displacement update kernel of the GPU solver. Unlike with the internal force computation, no precautions need to be taken to avoid race conditions, since the computation of the next displacement value of a given node only depends on its effective loads, and its current and previous time-step displacements.

### Constraints

Loads and boundary conditions are incorporated under the common heading of constraints. All constraint types are represented by a sub-class of tledConstraint, e.g. tledDispConstraint implements nonzero essential boundary conditions. A class called tledConstraintManager is responsible for their management.

The constraint types accessible through the simulation XML description were originally aimed at an algorithmic generation of boundary condition definitions. Mostly, they are of a very basic type, such as displacement or force constraint, and require an explicit specification of the nodes directly affected by the constraint, thus making it difficult for humans to read and manually specify. More recently, we have added a method of geometric boundary specification that allows the user to specify the surface facets contained in a boundary through a combination of facet normal-orientation criteria and bounding volumes. The processing and conversion to node index lists of these descriptions is done in tledModel with the aid of the classes tledMeshSurface, that can extract surfaces of solid meshes and compute facet normals, and tledNodeRejector and its sub-classes that are used to filter nodes based on “is inside volume”-type criteria.

### Contact modelling

#### Contacts with analytically described surfaces

This feature enables the efficient simulation of contacts between soft tissue and geometries frequently encountered in medical settings. Examples of analytical contact-surface classes are tledContactCylinder and tledContactPlate. There is no common interface for analytical contact surfaces since these are very simple classes holding only a few parameters necessary to describe the surface, such as the radius, the axis and origin of the centre line in the case of the contact cylinder.

#### Mesh-based contact modelling

A wide-range contacts can be modelled with the mesh-based code: contacts of multiple deformable bodies, deformable-body self-collisions, contacts between moving and static rigid bodies and deformable ones. A dedicated manager, tledUnstructuredContactManager, exists to manage the surface meshes used in the collision queries, the contact search bounding volumes, and the *contact solvers* that compute the collision response forces. Similar to how the constraint manager provides loads and boundary displacements to the solver for a given point in time, this manager provides member functions that can be called by the solver to get the forces arising from collisions for a given displacement configuration without needing any in-depth knowledge of the type of contacts simulated or the number of bodies involved in the contacts.

tledUnstructuredContactManager encapsulates one object holding the surface of the simulation geometry at the current time step, of the class tledDeformableContactSurface. This data structure provides the facilities needed to construct a BVH for broad-phase contact search, the connectivity and surface-geometry information needed for the narrow-phase search and response-force computation. The BVH is a data structure that recursively partitions the geometry until every bounding volume (BV) only contains one surface primitive (e.g. a triangle). This partitioning is done such that when a BV is split, its children are only assigned geometric primitives that are connected.

The contact search is conducted in two phases: The broad phase operates only on the BVH and, in the case of deformable-body contacts, recursively checks sub-trees of the BVH containing geometry between which there is no topological connection, against each other. In this *pair-wise descent*, the geometry bounded by one BVH subtree is considered the master surface, the other is the slave.

Conceptually, little changes with deformable and rigid body contact. The main difference is that each rigid contact surface is contained in its own data structure and has its own BVH. In the contact search, the entire deformable-body BVH is checked against the entire BVH of the rigid body. Further, contact-response forces are applied to the deformable body only.

In self-collision detection, the subtrees of the deformable-geometry BVH that need to be checked against each other are identified with the surface-cone method of Volino and Magnenat-Thalmann [31]. Otherwise, the algorithm is identical to Algorithm 3.

The template-based decorator design pattern described in section “Time integration” is used extensively to share code between the various mesh-based contact modelling pipelines. The mesh-based contact modelling is only available in the development branch of the project and not part of the stable releases, as of version 2.3.

### Output

#### Visualisation

Some basic visualisation capabilities are included in *NiftySim* ; these employ VTK for the rendering and window management. A custom render scene interactor, the *mesh sources*, which handle the conversion of *NiftySim* mesh objects and their attributes to VTK objects, and the source code for the creation of the render scene itself are contained in a separate library called libviz.

#### Mesh output

The same converters that are used in the visualisation module can be used to export the simulation mesh with the final displacement as an attribute in VTK’s vtkUnstructuredGrid format, or vtkPolyData in the case of membrane meshes. This functionality can be invoked through the *NiftySim* front-end with the -export-mesh, -export-submesh, and -export-membrane switch for the export of all simulation geometry as one mesh, as individual submeshes, and surface meshes, respectively.

#### Displacement and internal force history

tledSimulator also encapsulates an instance of tledSolutionWriter which can record the time step displacements and internal forces. The displacements/forces are recorded in a Matlab parsable ASCII format at a frequency the user specifies through an attribute on the Output XML element that is used to request the output of a variable (\(\varvec{F}\) or \(\varvec{U}\)).

## Research applications of NiftySim

In this section, we will look at a series of applications of *NiftySim* in published research. The majority of these examples illustrate the use of *NiftySim* for soft tissue simulations and exploit the speed of the GPU solver to run a large number of simulations with different parameters within a useful timeframe, e.g. to compute optimal material parameters for an image registration. However, in some cases *NiftySim* was also chosen for its features that go beyond TLED, such as its wide range of constitutive models or its contact modelling.

### Biomechanically guided prone-to-supine image registration of breast MRI using an estimated reference state

*NiftySim*to simulate the unloading of the breast. To this end models comprising three neo-Hookean element sets with distinct parameters, taken from the literature, were constructed; corresponding to the pectoral muscle, the adipose tissue, and the fibro-glandular tissue. The reference state was obtained by using a gravity constraint on a mesh obtained from the loaded configurations and inverting the direction of gravity. This yields the reference configuration for a subsequent iterative refinement of the zero-gravity configuration. The refinement of the reference state is carried out by reloading the estimated zero-gravity mesh with the physical gravity direction and computing the difference between the loaded estimate and the configuration seen in the corresponding MR image. This difference is subsequently transformed back into the coordinate system of the reference configuration by means of a nodally averaged deformation gradient, and directly added to the vertex positions of the reference-configuration mesh:

Performing a validation based on landmarks in actual clinical data by tracking said landmarks from both the supine and prone configurations into the simulated reference configuration and measuring their distance, Eiben et al. obtained mean target registration errors (TREs) of 5.3–6.8 mm which is well below the clinically relevant threshold of 10 mm.

In their experiments, the algorithm required 19 simulations to converge both from the supine and prone configurations to the zero-gravity reference configuration. The simulations took an average 80 and 83 s on an nVidia GeForce GTX 580, respectively, with meshes with 10,455 and 10,741 nodes, respectively.

### Development of patient-specific biomechanical models for predicting large breast deformation

Han et al. [9] presented an algorithm for recovering suitable material parameters from MR images for the accurate modelling of breasts undergoing large deformation, such as in the previously discussed prone-to-supine registration. The algorithm was used to estimate material parameters for up to four different types of tissue within a model: fat, fibro-glandular, muscle, and tumour tissue. The inputs were: a segmented image of the initial (subsequently denoted by \(A\)) and final configurations (called \(B\)), and a set of initial guesses for the material parameters that were obtained from the literature.

*NiftySim*. It made heavy use of the element set concept, and if the experimental setup demanded it,

*NiftySim*’s contact modelling features. A pseudo-code description of the algorithm is given in Algorithm 4.

This iterative optimisation process was effectively enabled by the speed advantages of *NiftySim* ’s GPU-enabled solver over established commercial packages: individual simulations took 19 s to complete with *NiftySim*, compared with 104 min with ABAQUS standard and 312 min with ABAQUS explicit on an Intel dual-core 3.4 GHz CPU with a GeForce GTX 285 GPU. They also ascertained that *NiftySim* ’s solutions are consistent with those obtained with the slower commercial packages.

### Modelling prostate motion for data fusion during image-guided interventions

*NiftySim*’s tledContactUSProbe class was used to simulate the ultrasound probe’s motion and interaction with the tissue. The FEM models comprised four element sets corresponding to the prostate inner and outer gland, rectal wall, and other surrounding tissue. Further, they used a generic pelvis model with random rotation, translation, and scaling parameters to impose a homogeneous displacement constraint on the model (Fig. 8). An outline of the implementation of the SMM generating algorithm can be found in Algorithm 5

Using *NiftySim* ’s GPU-enabled solver, a full training set of 500 simulations were completed in an average of 140 min and with minimal user intervention, rendering the process amenable to clinical use. By comparison, comparable (individual) simulations using Ansys take between 10 and 30 min. Using these statistical models Hu et al. were able to obtain TREs of \(<\)3 mm, which is both below the clinically relevant threshold of 4.92 mm and the TREs obtained with elastic registration that they identified as the primary competing method.

### MRI to X-ray mammography intensity-based registration with simultaneous optimisation of pose and biomechanical transformation parameters

Mertzanidou et al. [20] developed a method for registering 3D MR images to 2D X-ray mammograms. The problem is particularly challenging as the X-ray images are acquired with the breast being compressed between two plates. The MRIs are also used diagnostically and for surgical planning, and are acquired with the women lying prone with their breasts pendulous. The algorithm aims to simulate the compression on a mesh generated from an MRI, using the resulting displacement field to warp the MRI, and generate a simulated X-ray of the compressed MRI via ray-casting. Finally, the simulated X-ray is repeatedly compared with the actual X-ray mammogram, thus at convergence, providing correspondences between the two images of the breast, as assessed by the normalised cross- correlation (NCC) metric. Simulations were performed using *NiftySim* and making use of a transversely isotropic neo-Hookean constitutive model for the breast tissue with a fixed Young’s modulus. The other material parameters were optimised as part of the registration procedure, in a manner similar to that proposed by Han et al. [9]. A pseudo-code summary of the algorithm is given in Algorithm 6.

*NiftySim*’s GPU solver as a backend to save the time required to reload the simulation model, by substituting material parameters, using tledSolverGPU’s UpdateMaterialParams function, and the displacement settings of the tledContactPlate contact surfaces in every iteration of the hill-climbing optimisation. However, it could be implemented using the @niftysim@ stand-alone application without making any functional sacrifices. Further, computational costs can be significantly reduced by performing the warping on-the-fly as part of the raycasting process. The NCC evaluation function is given in Algorithm 7.

The use of *NiftySim* ’s GPU solver allowed Mertzanidou et al. to run approximately 420 simulations in one registration, taking about 2 hours in total.

They obtained TREs of \(11.6\,\pm \,3.8\) and \(11\,\pm \,5.4\) mm for the registration of the MRI to the cranio-caudal and the medio-lateral oblique X-ray, respectively.

*NiftySim*’s newer features. In addition to the above algorithm, that uses a frictionless analytical model for the contact plates and a homogeneous solid-element model, they also performed a sensitivity analysis to assess the impact of a more sophisticated model including a membrane representing the patient’s skin, and friction between the contact plates and the breast surface. The incorporation of friction requires using the mesh-based contact model, and the creation of a surface mesh for the contact plates. The “skinning” of the mesh with a neo-Hookean membrane as done by Mertzanidou et al. can be achieved with the following lines of XML code: where G and rho are a suitable shear modulus and mass density, respectively.

## Discussion and conclusions

The *NiftySim* toolkit has been designed to enable efficient integration of simulation technology into applications in medical image computing and computer-assisted interventions. This integration is facilitated by both a command line program capable of executing simulations in a stand-alone fashion, and a library which enables simple embedding of the simulation code in third-party software. High computational performance is achieved by employing a highly data-parallel FE algorithm and executing on massively parallel graphics processing units. The underlying formulation is valid for fully nonlinear problems, making it suitable for simulating materially nonlinear soft tissues undergoing large deformations. Moreover, the codebase is relatively small and minimally dependent on third-party libraries, allowing fast and easy compilation on a range of platforms, and an uncomplicated integration in client code. A series of example applications from recently published work was used to demonstrate the toolkit’s utility.

Abaqus FEA is a product of Dassault Systèmes, http://www.3ds.com/products-services/simulia/portfolio/abaqus/.

Doxygen is a tool for the extraction of inline API documentation, available from http://www.doxygen.org.

## Acknowledgments

The work conducted on the software package, and this submission was partially funded by the Intelligent Imaging Programme Grant (EPSRC Reference: EP/H046410/1). The research applications, presented in the second part of this submission, and the contributions of their respective authors to the codebase of *NiftySim* were funded by the following Grants and institutions: EPSRC Grant ”MIMIC” (EP/K020439/1); European 7th Framework Program “HAMAM” (FP7-ICT-2007.5.3); European 7th Framework Program ”VPH-PRISM” (FP7-ICT-2011-9, 601040); Philips Research Hamburg.

### Conflict of interest

Stian F. Johnsen, Zeike A. Taylor, Matthew J. Clarkson, John Hipwell, Marc Modat, Bjoern Eiben, Lianghao Han, Yipeng Hu, Thomy Mertzanidou, David J. Hawkes, and Sebastien Ourselin declare that they have no conflict of interest.

## Copyright information

**Open Access**This article is distributed under the terms of the Creative Commons Attribution License which permits any use, distribution, and reproduction in any medium, provided the original author(s) and the source are credited.