Ultrascale simulations of nonsmooth granular dynamics
 614 Downloads
 6 Citations
Abstract
This article presents new algorithms for massively parallel granular dynamics simulations on distributed memory architectures using a domain partitioning approach. Collisions are modelled with hard contacts in order to hide their microdynamics and thus to extend the time and length scales that can be simulated. The global multicontact problem is solved using a nonlinear block GaussSeidel method that is conforming to the subdomain structure. The parallel algorithms employ a sophisticated protocol between processors that delegate algorithmic tasks such as contact treatment and position integration uniquely and robustly to the processors. Communication overhead is minimized through aggressive message aggregation, leading to excellent strong and weak scaling. The robustness and scalability is assessed on three clusters including two petascale supercomputers with up to 458,752 processor cores. The simulations can reach unprecedented resolution of up to ten billion (\(10^{10}\)) nonspherical particles and contacts.
Keywords
Granular dynamics High performance computing Nonsmooth contact Parallel computing Message passing interfaceMathematics Subject Classification
65Y05 70F35 70F40 70E551 Introduction
Granular matter exhibits intriguing behaviours akin to solids, liquids or gases. However, in contrast to those fundamental states of matter, granular matter still cannot be described by a unified model equation homogenizing the dynamics of the individual particles [28]. To date, the rich set of phenomena observed in granular matter, can only be reproduced with simulations that resolve every individual particle. In this paper, we will consider methods where also the spatial extent and geometric shape of the particles can be modelled. Thus in addition to position and translational velocity the orientation and angular velocity of each particle constitute the state variables of the dynamical system. The shapes of the particles can be described for example by geometric primitives, such as spheres or cylinders, with a lowdimensional parameterization. Composite objects can be introduced as a set of primitives that are rigidly glued together. Eventually, even meshes with a higherdimensional parameterization can be used. In this article the shape of the particles does not change in time, i.e. no agglomeration, fracture or deformation takes place. The rates of change of the state variables are described by the NewtonEuler equations, and the particle interactions are determined by contact models.
Two fundamentally different model types must be distinguished: Soft and hard contacts. Soft contacts allow a local compliance in the contact region, whereas hard contacts forbid penetrations. In the former class the contact forces can be discontinuous in time, leading to nondifferentiable but continuous velocities after integration. The differential system can be cast e.g. as an ordinary differential equation with a discontinuous righthand side or as differential inclusions. However, the resulting differential system is typically extremely stiff if realistic material parameters are employed.
In the latter class, discontinuous forces are not sufficient to accomplish nonpenetration of the particles. Instead, impulses are necessary to instantaneously change velocities on collisions or in selflocking configurations if Coulomb friction is present [35]. Stronger mathematical concepts are required to describe the dynamics. For that purpose, Moreau introduced the notion of measure differential inclusions in [29].
Hard contacts are an idealization of reality. The rigidity of contacts has the advantage that the dynamics of the microcollisions does not have to be resolved in time. However, this also introduces ambiguities: The rigidity has the effect that the force chains along which a particle is supported are no longer unique [31]. If energy is dissipated, this also effects the dynamics. To integrate measure differential inclusions numerically in time, two options exist: In the first approach the integration is performed in subintervals from one impulsive event to the next [11, 27]. At each event an instantaneous impact problem must be solved whose solution serves as initial condition of the subsequent integration subinterval. Impact problems can range from simple binary collisions to selflocking configurations and to complicated instantaneous frictional multicontact problems with simultaneous impacts. The dynamics between events are described by differential inclusions, differential algebraic equations or ordinary differential equations. Predicting the times of the upcoming events correctly is nontrivial in general and handling them in order in parallel is impeding the scalability [27]. In the second approach no efforts are made to detect events, but the contact conditions are only required to be satisfied at discrete points in time. This approach is commonly referred to as a timestepping method.
This article focuses on the treatment of hard contacts in order to avoid the temporal resolution of microcollisions and thus the dependence of the timestep length on the stiffness of the contacts. In order to avoid the resolution of events a timestepping method is employed. This considerably extends the time scales that are accessible to simulations of granular systems with stiff contacts.
To estimate the order of a typical reallife problem size of a granular system, consider an excavator bucket with a capacity of \(1\,\hbox {m}^{3}\). Assuming sand grains with a diameter of \(0.15\,\hbox {mm}\), and assuming that they are packed with a solid volume fraction of 0.6, the excavator bucket contains in the order of \(10^{10}\) particles. In such a dense packing the number of contacts is in the same order as the number of particles. Only large scale parallel systems with distributed memory can provide enough memory to store the data and provide sufficient computational power to integrate such systems for a relevant simulation time. Consequently a massive parallelization of the numerical method for architectures with distributed memory is absolutely essential.
In the last half decade several approaches were published suggesting parallelizations of the methods integrating the equations of motion of rigid particles in hard contact [17, 18, 22, 30, 36, 40, 41]. The approach put forward in this article builds conceptually on these previous approaches but exceeds them substantially by consistently parallelizing all parts of the code, consistently distributing all simulation data (including the description of the domain partitioning), systematically minimizing the volume of communication and the number of exchanged messages, and relying exclusively on efficient nearestneighbor communication. The approach described here additionally spares the expensive assembly of system matrices by employing matrixfree computations. All this is accomplished without sacrificing accuracy. The matrixfree implementation allows the direct and straight forward evaluation of wrenches in parallel and thus reduces the amount of communicated data. Furthermore, an exceptionally robust synchronization protocol is defined, which is not susceptible to numerical errors. The excellent parallel scaling behaviour is then demonstrated for dilute and dense test problems in strong and weakscaling experiments on three clusters with fundamentally different interconnect networks. Among the test machines are the petascale supercomputers SuperMUC and Juqueen, as they will be described in Sect. 7.3. The results show that given a sufficient computational intensity of the granular setup and an adequate processor interconnect, a few hundred particles per process are already enough to obtain satisfactory scaling even on millions of processes.
In Sect. 2 of this paper the underlying differential equations and the timecontinuous formulation of the hard contact models are formulated. Sect. 3 proposes a discretization scheme and discrete constraints for the hard contact model. The problem of reducing the number of contacts in the system for efficiency reasons is addressed in Sect. 4. Subsequently, an improved numerical method for solving multicontact problems in parallel is introduced in Sect. 5 before turning to the design of the parallelization in Sect. 6. The scalability of the parallelization is then demonstrated in Sect. 7 by means of dilute and dense setups on three different clusters. Finally, the algorithms and results are compared to previous work by other authors in Sect. 8 before summarizing in Sect. 9.
2 Continuous dynamical system
In contrast to soft contact models, the contact reactions in hard contact models cannot be explicitly expressed as a function of the state variables but are defined implicitly, e.g. by implicit nonlinear functions [23], complementarity constraints [1, 3], or inclusions [38]. In any case, the constraints distinguish between reactions in the directions normal to the contact surfaces and reactions in the tangential planes of the contact surfaces. The former are used to formulate the nonpenetration constraints, and the latter are used to formulate the friction constraints. For that reason, each contact j is associated with a contact frame, where the axis \({\mathbf {n}}_{j}(\varvec{x}_{}(t), \varvec{\varphi }_{}(t)) \in \mathbb {R}^3\) points along the direction normal to the contact surface, and the other two axes \({\mathbf {t}}_{j}(\varvec{x}_{}(t), \varvec{\varphi }_{}(t)) \in \mathbb {R}^3\) and \({\mathbf {o}}_j(\varvec{x}_{}(t), \varvec{\varphi }_{}(t)) \in \mathbb {R}^3\) span the tangential plane of the contact.
When considering impacts, a nonpenetration constraint for the reaction impulse in the direction normal to the contact surface must be formulated, and, if the contact is closed, an additional constraint modelling the impact dynamics, such as Newton’s impact law, must be added. The coefficient of restitution in Newton’s impact law can then be used to control the amount of energy that is dissipated in the collisions. This is analogous to damping elements in soft contact models.
3 Discrete dynamical system
In simulations of granular matter impulsive reactions are abundant. Higherorder integrators for timestepping schemes are still subject to active research [33]. In particular, discontinuities pose problems for these integrators. Hence, the continuous dynamical system is discretized in the following with an integrator of order one, resembling the semiimplicit Euler method and similar to the one suggested in [2].
A detailed discussion of solution algorithms for onecontact problems is out of the scope of this article. However, splitting methods, where nonpenetration and friction constraints are solved separately, are prone to slow convergence or cycling. In [7] Bonnefon et al. solve the onecontact problem by finding the root of a quartic polynomial. Numerous other approaches exist for modified friction laws, notably those where the friction cone is approximated by a polyhedral cone and solution algorithms for linear complementarity problems can be used [1, 32]. In any case the algorithm of choice should be as robust as possible in order to successfully resolve \(\nu _c\) contacts per iteration and time step. In this article we will demonstrate that \(\nu _c\) can be in the order of \(10^{10}\).
4 Contact detection
Broadphase contact detection algorithms aim to limit the particle pairs that are possible candidates for contacts to as few as possible. To this end they use e.g. spatial partitioning or they exploit the temporal coherence of the particle positions [9]. The candidate pairs are then checked in detail in the narrowphase contact detection, where (2) is solved for each pair, leading to the contact location \({\hat{\varvec{x}}}_{j}\), normal \({\mathbf {n}}_{j}\) and signed distance \(\xi _j\) for a contact j.
To solve (2) for nonoverlapping particles, the GilbertJohnson–Keerthi (GJK) algorithm can be used [4, 13]. For overlapping particle shapes the expanding polytope algorithm (EPA) computes approximate solutions [5]. For simple geometric primitives like spheres, the optimization problem can be solved analytically. The indices of all contacts found that way form the set of potential contacts \(\mathcal {C}= \left\{ 1 \mathrel {..}\nu _c \right\} \) at time t. Let \({\mathbf {F}}(\varvec{\lambda }_{}) = \mathbf{0}\) from now on denote the contact problem where all contact conditions and contact reactions whose indices are not part of \(\mathcal {C}\) have been filtered out.
5 Numerical solution algorithms
To solve the multicontact problem, when suitable solution algorithms for the onecontact problems \({\mathbf {F}}_j^{1}\) are given, a nonlinear block GaussSeidel (NBGS) can be used as propagated by the nonsmooth contact dynamics (NSCD) method [20]. Unfortunately, the GaussSeidel algorithm cannot be efficiently executed in parallel for irregular data dependencies as they appear in contact problems [22].
The algorithm is of iterative nature and needs an appropriate stopping criterion to terminate. Note that the choice of the stopping criterion for parallel executions of the algorithm is not different from serial executions [22, 36]. In each iteration k a sweep over all contacts is performed, where each contact j is relaxed, given an approximation of all other contact reactions \(\varvec{\tilde{\lambda }}_{}^{(k,j)}\). In the subdomain NBGS, the approximation of contact reaction l is taken from the current iteration if it was already relaxed (\(l < j\)) and if it is associated with the same subdomain as the contact j to be relaxed (\(s_c(l) = s_c(j)\)). In all other cases, the approximation is taken from the previous iteration. The contact reaction \(\varvec{\lambda }_{j}^{(k+1)}\) is then a weighted mean between the previous approximation and the relaxation result. If all contacts are associated with the same subdomain and \(\omega = 1\) then Algorithm 1 corresponds to a classic NBGS. If each contact is associated to a different subdomain then Algorithm 1 corresponds to a nonlinear block Jacobi (NBJ) with relaxation parameter \(\omega \).
It is well known that GaussSeidel and Jacobi methods of equations do not scale linearly with the number of unknowns for many problems of practical interest. This means that the number of iterations needed to obtain a given error bound increases with the number of unknowns in the system. However, this effect is most severe only when large global systems must be solved. In many cases of interest in granular dynamics, the global system effectively splits into many small systems that correspond to the clusters of objects in contact. Therefore, and in particular when a good initial guess is available from previous time steps, often a moderate number of iterations is sufficient to obtain a satisfactory accuracy. To compensate for effects due to a variable number of iterations in the scaling experiments in Sect. 7, the iterative solver is stopped there after a constant number of iterations.
The subsequent section explains how the subdomain NBGS algorithm presented in this section can be implemented and efficiently executed in parallel on machines with distributed memory. Under the assumption that the algorithm converges, the contact reactions thus obtained in a parallel execution of the NBGS solve the mathematical statement of the problem from Sect. 3 just as if the algorithm is executed serially. However, as mentioned briefly in Sect. 3, the discrete systems have a nonunique solution, as it is inherently caused by the hard contact model. Hence, solutions obtained by the NBJ, the classic NBGS and the subdomain NBGS can differ. However, this is simply an effect of an insufficient regularization of the hard contact problem, it is not caused by the parallelization per se.
6 Parallelization design
Sect. 6.1 introduces the domain partitioning approach. Sect. 6.2 then discusses requirements that must be met in order to be able to treat all contacts exactly once in parallel. Sect. 6.3 explains a special technique to reduce the data dependencies to other processes. To this end, accumulator and correction variables will be introduced. In Sect. 6.4 conditions are discussed under which the set of communication partners can be reduced to the nearest neighbors. Timeintegration and the subsequent necessity of synchronization are addressed in Sect. 6.5 before summarizing the timestepping procedure in Sect. 6.6.
6.1 Domain partitioning
Under the assumption that no contacts are present, there exists no coupling between the data of any two particles, and the problem becomes embarrassingly parallel: Each process integrates \(\lfloor \frac{\nu _b}{\nu _p} \rfloor \) or \(\lceil \frac{\nu _b}{\nu _p} \rceil \) particles. Let \(s_b(i) \in \mathcal {P}\) determine the process responsible for the timeintegration of particle i as of now referred to as the parent process. All data associated with this particle, that is the state variables (position, orientation, velocities) and constants (mass, body frame inertia matrix, shape parameters), are instantiated only at the parent process in order to distribute the total memory load. However, contacts or shortrange potentials introduce data dependencies to particles that in general are not instantiated on the local process nor on a process close to the local one, rendering a proper scaling impossible. A domain partitioning approach alleviates this problem.
Let \({\varOmega }\) denote the computational domain within which all particles are located and \({\varOmega }_p \subseteq {\varOmega }, p \in \mathcal {P}\), a family of disjoint subdomains into which the domain shall be partitioned. In this context, subdomain boundaries are assigned to exactly one process. One process shall be executed per subdomain. The number of processes can e.g. correspond to the number of compute nodes in a hybrid parallelization or to the total number of cores or even threads in a homogeneous parallelization. In the domain partitioning approach the integration of a particle whose center of mass \(\varvec{x}_{i}\) is located in a subdomain \({\varOmega }_p\) at time t is calculated by process p. That way data dependencies typically pertain the local or neighboring subdomains since they are considered to be of short range. Let \(s_b(i)\) be adapted accordingly. Special care is required when associating a particle to a subdomain whose center of mass is located on or near subdomain interfaces. Especially, periodic boundary conditions can complicate the association process since the finite precision of floatingpoint arithmetics does in general not allow a consistent parametric description of subdomains across periodic boundaries. Sect. 6.5 below explains how the synchronization protocol can be used to realize a reliable association.
The domain partitioning should be chosen such that approximately an equal number of particles is located initially in each subdomain and that this is sustained over the course of the simulation in order to balance the computational load which is directly proportional to the number of particles. Particles now migrate between processes if their positions change the subdomain. Migration can lead to severe load imbalances that may need to be addressed by dynamically repartitioning the domain. Such loadbalancing techniques are beyond the scope of this article.
6.2 Shadow copies
A pure local instantiation of particles has the effect that contacts cannot be detected between particles that are not located on the same process. A process can detect a contact if both particles involved in the contact are instantiated on that process. In order to guarantee that at least one process can detect a contact, the condition that a contact j must be detected by all processes whose subdomains intersect with the hull intersection \(\mathcal {H}_{j_1} \cap \mathcal {H}_{j_2}\) is sufficient if the intersection of the hull intersection and the domain is nonempty. This condition can be fulfilled by the following requirement:
Requirement 1
A particle i must be instantiated not only on the parent process but also on all processes whose subdomains intersect with the particle’s hull.
In order to determine which process will be responsible for treating the contact a rule is needed. Ideally this does not require additional communication. Here, the statement that a process is responsible for treating a contact refers to the responsibility of the process for executing the relaxation of the respective contact in Algorithm 1. The typical choice for this rule requires that the process whose subdomain contains the point of contact is put in charge to treat the contact [36].
However, this seemingly natural rule only works if the process whose subdomain contains the point is able to detect the contact. Unfortunately, this is only guaranteed if the point of contact is located within the hull intersection. Also, if the point of contact is located outside of the domain \({\varOmega }\), then no process would be responsible to treat it.
A more intricate drawback of this approach is that it can fail in case of periodic boundary conditions: If the contact point is located near the periodic boundary, the periodic image of the contact point will be detected at the other end of the simulation box. Due to the shifted position of the contact point image and the limited numerical precision, the processes can no longer consistently determine which process gets to treat the contact.
A more robust rule can be established by fulfilling the following requirement:
Requirement 2
All shadow copy holders of a particle maintain a complete list of all other shadow copy holders and the parent process of that particle.
Then each process detecting a contact can determine the list of all processes detecting that very same contact, which is the list of all processes with an instantiation of both particles involved in the contact. This list is exactly the same on all processes detecting the contact and is not prone to numerical errors. The rule can then e.g. appoint the detecting process with smallest rank to treat the contact. In order to enhance the locality of the contact treatment, the rule should favor the particle parents if they are among the contact witnesses. Any such rule defines a partitioning of the contact set \(\mathcal {C}\). Let \(\mathcal {C}_p\) be the set of all contacts treated by process \(p \in \mathcal {P}\). Then process p instantiates all contacts \(j \in \mathcal {C}_p\).
6.3 Accumulator and correction variables
The contact relaxations in Algorithm 1 exhibit sums with nonlocal data dependencies. In the following, the redundant evaluation of these sums is prevented by introducing accumulator variables and the nonlocal data dependencies are reduced by introducing correction variables.
At the end of each iteration the wrench corrections for each body must be reduced and added to the accumulated wrench from the last iteration. This can be performed in two message exchanges. In the first message exchange each process sends the wrench correction of each shadow copy to its parent process. Then each process sums up for each original instance all wrench corrections obtained from the shadow copy holders, its own wrench correction, and the original instance’s accumulated wrench. Subsequently, the updated accumulated wrench of each original instance is sent to the shadow copy holders in a second messageexchange communication step. The wrench corrections are then reset everywhere.
An alternative to storing accumulated wrenches and wrench corrections is to store accumulated velocities and velocity corrections. In that case, a process p instantiates variables \(\varvec{v}_{}^{[p]}\), \(\varvec{\omega }_{}^{[p]}\), \(\delta \varvec{v}_{}^{[p]}\), \(\delta \varvec{\omega }_{}^{[p]} \in \mathbb {R}^{3 \mathcal {B}_p}\). The accumulated velocities are set to \(\varvec{v}_{i}'(\varvec{\lambda }_{}^{(k)})\) and \(\varvec{\omega }_{i}'(\varvec{\lambda }_{}^{(k)})\) for all \(i \in \mathcal {B}_p\) in each iteration. They are initialized and updated accordingly. The velocity corrections are initialized and updated analogously to the wrench corrections. Hereby, the velocity variables can be updated in place. In the classic NBGS, no wrench or velocity correction variables would be necessary, but the corrections could be added to the velocity variables right away which is similar to the approach suggested by Tasora et al. in [39].
6.4 Nearestneighbor communication
Typically, the size limit stemming from (7) is not a problem for the particles of the granular matter themselves, but very well for boundaries or mechanical parts the granular matter interacts with. However, the number of such large bodies is in many applications of practical interest significantly smaller than the number of smallsized particles, suggesting that they can be treated globally. Let \(\mathcal {B}_{global}\) be the set of all body indices exceeding the size limit. These bodies will be referred to as being global in the following. All associated state variables and constants shall be instantiated on all processes and initialized equally. The timeintegration of these global bodies then can be performed by all processes equally. If a global body i has infinite inertia (\(m_{i} = \infty \) and \(\mathbf{I}_{ii}^0 = \infty \mathbf{1}\)), such as a stationary wall or a nonstationary vibrating plate, the body velocities are constant, and no wrenches need to be communicated. Global bodies having a finite inertia can be treated by executing an allreduce communication primitive whenever reducing the wrench or velocity corrections of the smallsized bodies. Instead of only involving neighboring processes, the allreduce operation sums up the corrections for each global body with finite inertia from all processes and broadcasts the result, not requiring any domain partitioning information.
6.5 Timeintegration and synchronization protocol
Having solved the contact problem \({\mathbf {F}}(\varvec{\lambda }_{}) = \mathbf{0}\) by Algorithm 1, the timeintegration defined in (3) needs to be performed. If the NBGS implementation uses velocity accumulators, the integrated velocities are at hand after the final communication of the velocity corrections. If instead the NBGS implementation uses wrench accumulators, the wrenches are at hand, and the velocities of all local bodies can be updated immediately.
Subsequently, the timeintegration of the positions can take place. Updating a body’s position or orientation effects that the list of shadow copy holders changes since the intersection hull possibly intersects with different subdomains. Also, the body’s center of mass can move out of the parent’s subdomain. In order to restore the fulfillment of the requirements 1 and 2, a process must determine the new list of shadow copy holders and the new parent process for each local body after the position update. Shadow copy holders must be informed when such shadow copies become obsolete and must be removed. Analogously, processes must be notified when new shadow copies must be added to their state. In this case copies of the corresponding state variables, constants, list of shadow copy holder indices, and index of the parent process must be transmitted.
All other shadow copy holders must obtain the new state variables, list of shadow copy holder indices, and index of the parent process. Hereby, the condition from (7) guarantees that all communication partners are neighbors. All information can be propagated in a single aggregated nearestneighbor messageexchange. The information should be communicated explicitly and should not be derived implicitly, in order to avoid inconsistencies. This is essential to guarantee that the responsibility of a process to treat a contact can always be determined as well as the responsibility to perform the time integration.
Our implementation of the synchronization protocol makes use of separate container data structures for storing shadow copies and original instances in order to be able to enumerate these different types of bodies with good performance. Both containers support efficient insertion, deletion and lookup operations for handling the fluctuations and updates of the particles efficiently. Furthermore, the determination of the new list of shadow copy holders involves intersection tests between intersection hulls of local bodies and neighboring subdomains as requirement 1 explains in Sect. 6.2. However, determining the minimal set of shadow copy holders is not necessary. Any type of bounding volumes can be used to ease intersection testing. In particular bounding spheres either with tightly fitting bounding radii \(\overline{r}_i + h_i(t)\) or even with an overall bounding radius \(\max _{i \in \mathcal {B}} \overline{r}_i + h_i(t)\) as proposed by Shojaaee et al. in [36] are canonical. Concerning the geometry of the subdomains at least the subdomain closures can be used for intersection testing. In our implementation we chose to determine almost minimal sets of shadow copy holders by testing the intersections of the actual hull geometries of the particles with the closures of the subdomains. This reduces the number of shadow copies and thus the overall communication volume in exchange for more expensive intersection tests.
6.6 Summary
Then, in the narrowphase contact detection, for all candidates the contact location, associated contact frame, and signed contact distance is determined if the hulls actually intersect. Finally, this set of detected contacts \(\mathcal {C}_{p,np}\) must be filtered according to one of the rules presented above, resulting in \(\mathcal {C}_p\), the set of contacts to be treated by process p. Before entering the iteration of the subdomain NBGS, the accumulator, correction, and contact reaction variables must be initialized. The initialization of the accumulator variables requires an additional reduction step if the external forces or torques cannot be readily evaluated on all processes.
Each iteration of the subdomain NBGS on process p involves a sweep over all contacts to be treated by the process. The contacts are relaxed by a suitable onecontact solver. The \(\overline{j}\) indexing indicates that such a solver typically needs to evaluate the relative contact velocity under the assumption that no reaction acts at the contact j. This can be achieved by subtracting out the corresponding part from the accumulator variables. The weighted relaxation result is then stored in place. The update of the wrench or velocity correction variables is not explicitly listed. After the sweep the wrench or velocity corrections are sent to the respective parent process and summed up per body including the accumulator variables. Then the accumulator variables are redistributed to the respective shadow copy holders in a second messageexchange step.
After a fixed number of iterations or when a prescribed convergence criterion is met, the time step proceeds by executing the timeintegration for each local body. The changes of the state variables must then be synchronized in a final messageexchange step, after which the preconditions of the next time step are met. Any user intervention taking place between two time steps needs to adhere to these requirements.
7 Experimental validation of scalability
This section aims to assess the scalability of the parallel algorithms and data strcutures that were presented in Sect. 6. The methods have been implemented in the opensource software framework Open image in new window for massively parallel simulations of rigid bodies [18, 19]. The implementation is based on velocity accumulators and corrections, as introduced in Sect. 6.3. The accumulator initialization performs an additional initial correction reduction step in all experiments.
In Sect. 7.1 the idea behind weak and strongscaling experiments is explained before presenting the test problems for which those experiments are executed in Sect. 7.2. The scaling experiments are performed on three clusters whose properties are summarized and compared in Sect. 7.3. Sect. 7.4 points out the fundamental differences in the scalability requirements of the two test problems. Finally, in Sect. 7.5 the weakscaling and in Sect. 7.6 the strongscaling results are presented for each test problem and cluster.
7.1 Weak and strong scalability
7.2 Test problems
The scalability of the parallelization algorithm as it is implemented in the Open image in new window framework is validated based on two fundamentally different families of test problems. Sect. 7.2.1 describes a family of dilute granular gas setups whereas Sect. 7.2.2 describes a family of hexagonal close packings of spheres corresponding to structured and dense setups. We chose these setups because their demands towards the implementation vary considerably. This will be analyzed in detail in Sect. 7.4.
7.2.1 Granular gas
Granular material attains a gaseous state when sufficient energy is brought into the system, for example by vibration. Consequently, granular gases feature a low solid volume fraction and are dominated by binary collisions. When the energy supply ceases, the system cools down due to dissipation in the collisions. Granular gases are not only observed in laboratory experiments, but appear naturally for example in planetary rings [37] and in technical applications such as granular dampers [21]. These systems in general exhibit interesting effects like the inelastic collapse [26] or other clustering effects as they e.g. can be observed in the Maxwelldemon experiment [43].
The distance between the centers of two granular particles along each spatial dimension is \(1.1\,\hbox {cm}\), amounting to a solid volume fraction of 23 % on average. In [18] almost the same family of setups served as a scalability test problem. However, there the granular gases had a solid volume fraction of 3.8 % on average. In order to test a higher collision frequency, a denser granular gas was chosen here. The system is simulated for \(\frac{1}{10}\,\hbox {s}\), and the time step is kept constant at \(100\,\upmu \hbox {s}\), resulting in 1000 time steps in total. Since the contacts are dissipative and no energy is added, the system is quickly cooling down. The coefficient of friction is 0.1 for any contact whether it is a contact between a pair of particles or a contact between a particle and a confining wall.
For this test problem, the subdomain NBGS solver requires a slight underrelaxation in order to prevent divergence. Using an underrelaxation parameter of 0.75 produces good results. For binary collisions, a single iteration of the solver would suffice, but because particles cluster due to the inelastic contacts, more iterations are required. This could be determined by a dynamic stopping criterion, but in the scenario presented here it was found to be more efficient to perform a fixed number of 10 iterations.
For particle simulations, the work load strongly depends on the number of particles and contacts. For the weakscaling experiments, each process is responsible for a rectangular subdomain, initially containing a fixed number of particles arranged in a Cartesian grid. For the strongscaling experiments, the total number of particles in x, y, and zdimension should be divisible by the number of processes in x, y, and zdimension that is used in the experiment. With this arrangement the initial load is perfectly balanced. Statistically, the load, that is the number of particles and contacts per subdomain, remains balanced if the subdomains are large enough, and clustering effects have not yet progressed too far. In the simulation performed here, the duration of the simulation was chosen such that the load remains well balanced throughout.
7.2.2 Hexagonal close packing of spheres
7.3 Test machines
The test machines used for performing the weak and strongscaling experiments
Cluster name  Emmy  SuperMUC  Juqueen 

Computing centre  Regional computing centre in Erlangen (RRZE), Germany  Leibniz supercomputing centre (LRZ), Germany  Jülich supercomputing centre (JSC), Germany 
Best TOP 500 ranking  –  4th (June 2012)  5th (November 2012) 
Peak performance in \(\hbox {PFlop}/\hbox {s}\)  0.23  3.2  5.9 
Number of nodes  560  9216  28,672 
Number of sockets  2  2  1 
Name of CPU  Intel Xeon E52660 v2  Intel Xeon E52680  IBM PowerPC A2 
Clock rate in \(\hbox {GHz}\)  2.2  2.7  1.6 
Number of cores per CPU  10  8  16 
Number of threads per core  2  2  4 
Total RAM in \(\hbox {TiB}\)  35  288  448 
Interconnection fabric  Infiniband QDR  Infiniband QDR/ Infiniband FDR 10  BlueGene/Q 
Network topology  Nonblocking tree  Nonblocking tree/ 4:1 pruned tree  5D torus 
The second test machine is the SuperMUC supercomputer which is located at the Leibniz Supercomputing Centre (LRZ) in Germany and was best ranked on the 4th place of the TOP 500 list in June 2012. The cluster is subdivided into multiple islands. The majority of the compute power is contributed by the 18 thinnode islands. Each thinnode island consists of 512 compute nodes (excluding four additional spare nodes) connected to a fully nonblocking 648 port FDR10 Infiniband switch with \(4\times \) link aggregation, resulting in a bandwidth of \(40\,\hbox {Gbit/s}\) per link and direction. Though QDR and FDR10 use the same signaling rate, the effective data rate of FDR10 is more than 20 % higher since it uses a more efficient encoding of the transmitted data. The islands’ switches are each connected via 126 links to 126 spine switches. This results in a blocking switchtopology. Thus, if e.g. all nodes within an island send to nodes located in another island, then the 512 nodes have to share 126 links to the spine switches, effecting that the bandwidth is roughly one quarter of the bandwidth that would be available in an overall nonblocking switchtopology. Each (thin) compute node has two sockets, each equipped with an Intel Xeon E52680 processor having 8 cores clocked at \(2.7\,\hbox {GHz}\). The processors support 2way SMT. In the following, as in the case of the Emmy cluster, each core is associated with a single subdomain. The peak performance of the cluster is stated to be \(3.2\,\hbox {Pflop}/\hbox {s}\). Each node offers \(32\,\hbox {GiB}\) of RAM, summing up to \(288\,\hbox {TiB}\) in total. The SuperMUC supercomputer has an interesting blocking tree networktopology and the processors with the highest clock rate among the processors in the test machines.
The third test machine is the Juqueen supercomputer which is located at the Jülich Supercomputing Centre (JSC) in Germany and was best ranked on the 5th place of the TOP 500 list in November 2012. The cluster is a BlueGene/Q system with 28,672 compute nodes since 2013 [14, 42]. Each node features a single IBM PowerPC A2 processor having 18 cores clocked at \(1.6\,\hbox {GHz}\), where only 16 cores are available for computing. The processors support 4way SMT. The Juqueen supercomputer is the only machine, where we associate each hardware thread with a subdomain in the scaling experiments. The machine’s peak performance is \(5.9\,\hbox {PFlop}/\hbox {s}\). Each node offers \(16\,\hbox {GiB}\) of RAM, summing up to \(448\,\hbox {TiB}\) in total. The interconnect fabric is a 5D torus network featuring a bandwidth of \(16\,\hbox {Gbit}/\hbox {s}\) per link and direction [8]. The Juqueen supercomputer is the machine with the highest peak performance, the largest number of cores and threads and the only machine among our test machines with a torus interconnect.
Summary of the domain partitionings used on all test clusters

7.4 Timestep profiles
The timestep profiles showed that for the dilute granular gas scenario the time spent in the various timestep components is well balanced and the time spent in the communication routines moderately increases as the problem size is increased. For the hexagonal close packings most of the time is spent in the contact sweeps and the reduction of the velocity corrections. Components such as the position integration and the final synchronization play a negligible role due to the higher number of iterations in comparison to the granular gas scenario.
7.5 Weakscaling results
Summary of the test problem parameters used for the weakscaling experiments
Granular gas  Hexagonal close packing  

Emmy  Juqueen  SuperMUC  Emmy  Juqueen  
Number of particles per process  \(25^3\)  \(10^3\)  \(10^3\)  \(10^3\)  \(10^3\) 
Number of time steps  1000  1000  10,000  1000  100 
Maximum number of particles  \(1.6 \times 10^8\)  \(1.8 \times 10^9\)  \(1.3 \times 10^8\)  \(1.0 \times 10^7\)  \(1.8 \times 10^9\) 
Initial number of contacts  0  0  0  \(6.0 \times 10^7\)  \(1.1 \times 10^{10}\) 
Solid volume fraction (%)  23  23  3.8  74  74 
7.5.1 Granular gas
The figure also distinguishes between weakscaling graphs with one, two, and threedimensional domain partitionings since their communication volumes differ. Higherdimensional nonperiodic domain partitionings have typically a higher communication volume in comparison to lower dimensional nonperiodic domain partitionings with the same number of processes, due to the larger area of the interfaces between the subdomains. The plotted timings for the onedimensional domain partitionings are indeed consistently slightly better than the timings for twodimensional domain partitionings, which are in turn slightly better than the timings for threedimensional domain partitionings.
Even though the intranode weakscaling results reveal an underperforming parallel efficiency between 30.8 and 32.9 % when computing on all cores of an Emmy node, the correlation with the measured memory bandwidth of a triad suggests that a good intranode scaling can be expected as long as the available bandwidth scales. With corresponding pinning, this is the case as off the first full socket on the Emmy cluster.
Figure 10b shows the results of the internode weakscaling experiments on the Juqueen supercomputer. The scaling experiments are only performed with the more demanding threedimensional domain partitionings. In the first series of measurements the average wallclock time per time step increases as expected up to 2048 nodes. But then the average timestep duration for setups with 4096 nodes and beyond is significantly shorter than the average timestep duration with fewer nodes. The time steps are even computed faster than on a single node, where no internode communication takes place at all. Assuming that intranode communication is faster than internode communication, this is a puzzling result. In fact, it turns out that the intranode communication is responsible for the behaviour: The default mechanism for intranode communication is via shared memory on the Juqueen. In the second series of measurements we disallow the usage of shared memory for intranode communication. This results in the measurements that are consistently faster than the measurements from the first series, and the parallel efficiency is now essentially monotonically decreasing with an excellent parallel efficiency of at least 92.9 %.
The reason why the measured times in the first series become shorter for 4096 nodes and more is revealed when considering how the processes get mapped to the hardware. The default mapping on Juqueen is ABCDET, where the letters A to E stand for the five dimensions of the torus network, and T stands for the hardware thread within each node. The sixdimensional coordinates are then mapped to the MPI ranks in a rowmajor order, that is, the last dimension increases fastest. The T coordinate is limited by the number of processes per node, which is 64 for the above measurements. Upon creation of a threedimensional communicator, the three dimensions of the domain partitioning are mapped also in rowmajor order. If the number of processes in zdimension is less than the number of processes per node, this has the effect that a twodimensional or even threedimensional section of the domain partitioning is mapped to a single node. However, if the number of processes in zdimension is larger or equal to the number of processes per node, only a onedimensional section of the domain partitioning is mapped to a single node. A onedimensional section of the domain partitioning performs considerably less intranode communication than a two or threedimensional section of the domain partitioning. This matches exactly the situation for 2048 and 4096 nodes. For 2048 nodes, a twodimensional section \(1 \times 2 \times 32\) of the domain partitioning \(64 \times 64 \times 32\) is mapped to each node, and for 4096 nodes a onedimensional section \(1 \times 1 \times 64\) of the domain partitioning \(64 \times 64 \times 64\) is mapped to each node. To substantiate this claim, we confirmed that the performance jump occurs when the last dimension of the domain partitioning reaches the number of processes per node, also when using 16 and 32 processes per node.
Figure 10c presents the weakscaling results on the SuperMUC supercomputer. The setup differs from the granular gas scenario presented in Sect. 7.2.1 in that it is more dilute. The distance between the centers of two granular particles along each spatial dimension is \(2\hbox {cm}\), amounting to a solid volume fraction of 3.8 % and consequently to fewer collisions. As on the Juqueen supercomputer only threedimensional domain partitionings are used. All runs on up to 512 nodes were running within a single island. The run on 1024 nodes also used the minimum number of 2 islands. The run on 4096 nodes used nodes from 9 islands, and the run on 8192 nodes used nodes from 17 islands, that is both runs used one island more than required. The graph shows that most of the performance is lost in runs on up to 512 nodes. In these runs only the nonblocking intraisland communication is utilised. Thus this part of the setup is very similar to the Emmy cluster since it also has dualsocket nodes with Intel Xeon E5 processors and a nonblocking tree Infiniband network. Nevertheless, the intraisland scaling results are distinctly worse. The reasons for these differences were not yet fully investigated. However, the scaling behaviour beyond a single island can be considered satisfactory, featuring a parallel efficiency of 73.8 % with respect to a single island. A possible explanation of the underperforming intranode scaling behaviour could be that during the tests some of the Infiniband links were degraded to QDR, which was a known problem at the time the extremescaling workshop took place. The communication routines then need \(\frac{5 \cdot 64}{4 \cdot 66} \approx 1.21\) times longer to complete. This could also explain the high variability of the runs’ wallclock times.
Subsequently, a second series of measurements is performed with \(60^3\) nonspherical particles per process. The scaling behaviour is comparable to the scaling behaviour observed in Fig. 10c. However, the largest weakscaling run simulated \(28,311,552,000 \approx 2.8 \cdot 10^{10}\) nonspherical particles—possibly a recordbreaking number for nonsmooth contact dynamics.
7.5.2 Hexagonal close packings of spheres
The weakscaling results of the hexagonal close packing scenario on the Juqueen supercomputer are presented in Fig. 11b. The parallel efficiency with respect to a single node stays above 91.4 % for all measurements. This result is almost as good as the 92.9 % parallel efficiency in the scaling experiments of the granular gas. The largest execution exercises \(1024 \times 1792 \times 1 = 1{,}835{,}008\) processes on all 28,672 nodes of the machine, where \(10{,}240 \times 17,920 \times 10 = 1{,}835{,}008,000\) particles are spawned, in total leading to \(10{,}826{,}547{,}200 \approx 1.1 \cdot 10^{10}\) contacts – again a possibly recordbreaking number for nonsmooth contact dynamics.
7.6 Strongscaling results
Summary of the test problem parameters used for the strongscaling experiments
Granular gas  Hexagonal close packing  

Emmy  Juqueen  SuperMUC  Emmy  Juqueen  
Number of particles  \(320 \times 160 \times 160\)  \(320 \times 320 \times 320\)  \(128 \times 128 \times 128\)  \(1280 \times 640 \times 10\)  \(2048 \times 2048 \times 10\) 
Number of time steps  1000  1000  100  50  20 
Solid volume fraction (%)  23  23  3.8  74  74 
7.6.1 Granular gas
Figure 12b presents the results of the strongscaling experiments on the Juqueen supercomputer for the granular gas. The total number of particles was 32,768,000 particles. In the execution on 32 nodes each of the \(16 \times 16 \times 8 = 2048\) processes initially had \(20 \times 20 \times 40 = 16{,}000\) nonspherical particles, and in the execution on 4096 nodes each of the \(64 \times 64 \times 64 = 262{,}144\) processes spawned \(5 \times 5 \times 5 = 125\) particles. The parallel efficiency is plotted with respect to 32 nodes and stays above 80.7 % for up to 1024 nodes and 500 particles per process before rapidly decreasing. On 4096 nodes the efficiency is at 55.4 %. The weak and strongscaling results are both better in comparison to the Emmy cluster, owed to the torus network which shows excellent performance for the nearestneighbor communication.
The results of the strongscaling experiments on the SuperMUC supercomputer are shown in Fig. 12c. In total \(128^3\) nonspherical particles are simulated. Hence, in the singlenode run each process owns \(32 \times 64 \times 64 = 131{,}072\) particles, and in the run on 1024 nodes, each process owns \(8 \times 4 \times 4 = 128\) particles. The parallel efficiency is at 90.0 % on 256 nodes. Beyond that point it decreases dramatically, indicating that the scaling is good as long as at least about 500 particles are present per process.
7.6.2 Hexagonal close packings of spheres
For the strongscaling experiment on the Juqueen supercomputer a hexagonal close packing with 41,943,040 particles in total is created. The smallest execution runs \(64 \times 32 \times 1 = 2048\) processes on 32 nodes, where \(32 \times 64 \times 10 = 20,480\) spherical particles are generated per process. The largest execution runs \(512 \times 512 \times 1 = 262{,}144\) processes on 4096 nodes, where \(4 \times 4 \times 10 = 160\) particles are generated per process. Figure 13b shows the speedup and the parallel efficiency on the second axis, both with respect to 32 nodes. A parallel efficiency of 75.0 % on 4096 nodes is achieved, where only 160 particles were owned per process. This suggests that a reasonable good efficiency can be achieved for a dense setup on the Juqueen supercomputer, as long as several hundred particles are handled per process.
8 Related work
Other authors have proposed approaches for parallelizing nonsmooth contact dynamics on architectures with distributed memory. All of them are based on domain partitionings. A parallelization strategy termed nonsmooth contact domain decomposition (NSCDD) implemented in the renowned LMGC90 code was lately presented in [40, 41] by Visseq et al. The approach is inspired by the finite element tearing and interconnect (FETI) method for solving partial differential equations in computational mechanics. The authors suggest to decouple the multicontact problem such that on each process a multicontact problem is solved having the same structure as a multicontact problem that is solved sequentially. Particles with multiple contacts that are associated with different subdomains are duplicated, similar to shadow copies used in this article. However, the mass and inertia are split among all instantiations. The coupling is recovered by adding linear equations gluing the duplicates back together through additional Lagrange multipliers. In contrast to the contact constraints, the interface equations are linear, and a blockdiagonal system of linear equations must be solved after several sweeps over all contacts. In [41], the authors present simulations with up to \(2 \cdot 10^5\) spherical particles and \(2 \cdot 10^6\) contacts, timeintegrated on up to 100 processes. The NSCDD allows nonnearestneighbor communication in order to allow enlarged rigid bodies instead of introducing a concept analogous to global bodies.
Prior to Visseq et al., Koziara et al. presented the parallelization implemented in the solfec code [22]. This approach dispenses with the separation into interface problems and local multicontact problems. A classic NBGS is parallelized with a nonnegligible but inevitable amount of serialization. Bodies are instantiated redundantly on all processes, prohibiting scaling beyond the memory limit. Instead of using accumulator and correction variables, as proposed in this paper, the authors synchronize dummy particles (particles that are in contact with shadow copies or original instances) in addition to shadow copies in order to implement contact shadow copies. As in the NSCDD, the system matrix (Delassus operator) is set up explicitly instead of using matrixfree computations as proposed here. Simulations are presented with up to \(1 \cdot 10^4\) polyhedral particles or \(6 \cdot 10^5\) contacts timeintegrated on up to 64 processes.
At the same time, Shojaaee et al. presented another domain partitioning method in [36]. The presentation is restricted to twodimensional problems. The solver in the paper corresponds to a subdomain NBGS with relaxation parameter \(\omega = 1\), where the authors argue that divergence does typically not occur. At least for threedimensional simulations this is in our experience not sufficient. Shadow copies are created not only if the hulls overlap the neighboring subdomain but also if the particles approach the subdomain boundaries, simplifying the intersection testing but introducing excessive shadow copies. Shojaaee et al. also introduce contact shadow copies instead of using accumulator and correction variables as proposed here. Simulations are presented with up to \(1 \cdot 10^6\) circular particles in a dense packing on up to 256 processes.
The approach presented in this paper improves in general the robustness and scalability of previously published parallel algorithms. The matrixfree approach facilitates the evaluation of the particle wrenches in parallel as suggested in Sect. 6.3 and thus reduces the amount of communicated data. The separation of bodies into global and local bodies allows to restrict messageexchange communications to nearest neighbors as detailed in Sect. 6.4 and thus maps well to various interconnect networks. Furthermore, the synchronization protocol defined in Sect. 6.2 and Sect. 6.5 is not susceptible to numerical errors in contrast to the conventional rules which are based on contact locations. Last but not least the scaling experiments from Sect. 7 with up to \(2.8 \cdot 10^{10}\) nonspherical particles or \(1.1 \cdot 10^{10}\) contacts on up to \(1.8 \cdot 10^{6}\) processes exceed all previously published numbers by a factor of \(10^3\)–\(10^4\).
9 Summary
This article presents models and algorithms for performing scalable direct numerical simulations of granular matter in hard contact as we implemented them in the Open image in new window opensource software framework for massively parallel simulations of rigid bodies. The pe framework already has been successfully used to simulate granular systems with and without surrounding fluid in the past [6, 12]. Excellent scaling has also been achieved in a fluidstructure interaction context [15, 16].
The discretization of the equations of motion underlying the timestepping scheme uses an integrator of order one. Contacts are modelled as inelastic and hard contacts with Coulomb friction. The hard contact model avoids the necessity to resolve the collision microdynamics and the timestepping scheme avoids the necessity to resolve impulsive events in time. The onestep integration can be split into the integration of the velocities and the subsequent integration of the positions and orientations.
The velocity integration requires the solution of a nonlinear system of equations per time step. In order to reduce the size of the system in the first place conventional broadphase contact detection algorithms are applied to exclude contacts between intersection hulls. To solve the nonlinear system of equations the subdomain nonlinear block GaussSeidel is used. The numerical solution algorithm is a mixture between a nonlinear block GaussSeidel (NBGS) and a nonlinear block Jacobi with underrelaxation. In contrast to a pure nonlinear block Jacobi it only requires a mild underrelaxation and in contrast to a nonlinear block GaussSeidel it accommodates the subdomain structure of the domain partitioning and thus allows an efficient parallelization avoiding irregular data dependencies across subdomains. The implementation of the subdomain NBGS in the Open image in new window is matrixfree and thus avoids the expensive assembly of the Delassus operator. Furthermore, the use of accumulators and correction variables enables the evaluation of the particle wrenches in parallel, reuses partial results and reduces the number of particles that need to be synchronized.
The integration of the positions and orientations is entailed by the execution of a robust synchronization protocol that guarantees correctness while being highly efficient when proceeding to ultra large scale. The key to obtain this robustness is to add the rank of the parent process and the ranks of the shadow copy holders to the state of each particle and to explicitly communicate the state changes. Only then processes can reliably agree upon responsibilities such as contact treatment and particle integration without being susceptible to numerical errors.
Beyond that, all messages are aggressively aggregated in order to reduce the communication overhead of small messages and all messages are restricted to nearest neighbors. The latter is achieved by splitting bodies into local and global bodies and identifying appropriate requirements. Both measures improve the scalability of the implementation.
Finally, the scalability was demonstrated for dilute and dense setups on three clusters, two of them having been in the top 10 of the world’s largest publicly available supercomputers. The parallel efficiency on Juqueen is excellent. The interisland scaling results on SuperMUC are satisfactory, however, the intraisland scaling results show room for possible improvements. This is not inherently caused by the parallelization approach, as can be shown by inspecting the results of the Emmy cluster, whose architecture is similar to a single island of SuperMUC.
The largest scaling experiments demonstrate that simulations of unprecedented scale with up to \(2.8 \cdot 10^{10}\) nonspherical particles and up to \(1.1 \cdot 10^{10}\) contacts are possible using up to \(1.8 \cdot 10^{6}\) processes. The systematic evaluation also confirms that good parallel efficiency can be expected on millions of processes even if only a few hundred particles are allocated to each process provided that the computation exhibits a sufficiently high computational intensity and the architecture has a good interconnect network.
The favourable scalability results do not account for the fact that the NBGS solver may not scale (algorithmically) in terms of the number of iterations needed to achieve a given error bound when large ensembles of particles are in mutual contact. Possible future developments arise out of this: In such situations, the convergence rate of multigrid methods can still be independent of the number of unknowns and is in that sense optimal. The successful construction of a multigrid method for hard contact problems would be invaluable for simulating everyincreasing system sizes.
Notes
Acknowledgments
The second author gratefully acknowledges the support of the Institute of Mathematical Sciences of the National University of Singapore.
References
 1.Anitescu M, Potra F (1997) Formulating dynamic multirigidbody contact problems with friction as solvable linear complementarity problems. Nonlinear Dynam 14(3):231–247MATHMathSciNetCrossRefGoogle Scholar
 2.Anitescu M, Potra F (2002) A timestepping method for stiff multibody dynamics with contact and friction. In J Numer Methods Eng 55(7):753–784MATHMathSciNetCrossRefGoogle Scholar
 3.Anitescu M, Tasora A (2010) An iterative approach for cone complementarity problems for nonsmooth dynamics. Comput Optim Appl 47(2):207–235MATHMathSciNetCrossRefGoogle Scholar
 4.van den Bergen G (1999) A fast and robust GJK implementation for collision detection of convex objects. J Gr Tools 4(2):7–25MathSciNetCrossRefGoogle Scholar
 5.van den Bergen G (2001) Proximity queries and penetration depth computation on 3D game objects. In: Game developers conference, vol 170Google Scholar
 6.Bogner S, Mohanty S, Rüde U (2015) Drag correlation for dilute and moderately dense fluidparticle systems using the lattice boltzmann method. Int J Multiph Flow 68:71–79MathSciNetCrossRefGoogle Scholar
 7.Bonnefon O, Daviet G (2011) Quartic formulation of Coulomb 3D frictional contact. Technical Report RT0400, INRIAGoogle Scholar
 8.Chen D, Eisley N, Heidelberger P, Senger R, Sugawara Y, Kumar S, Salapura V, Satterfield D, SteinmacherBurow B, Parker J (2012) The IBM Blue Gene/Q interconnection fabric. IEEE Micro 32(1):32–43MATHCrossRefGoogle Scholar
 9.Cohen J, Lin M, Manocha D, Ponamgi M (1995) ICOLLIDE: An interactive and exact collision detection system for largescale environments. In: Proceedings of the 1995 symposium on interactive 3D graphics, ACM, p 189Google Scholar
 10.Diebel J (2006) Representing attitude: Euler angles, unit quaternions, and rotation vectors. Matrix 58:15–16Google Scholar
 11.Esefeld B (2014) Numerische Integration von Mehrkörpersystemen mit mengenwertigen Kraftgesetzen. Herbert Utz Verlag, MünchenGoogle Scholar
 12.Fischermeier E, Bartuschat D, Preclik T, Marechal M, Mecke K (2014) Simulation of a hardspherocylinder liquid crystal with the pe. Comput Phys Commun 185(12):3156–3161CrossRefGoogle Scholar
 13.Gilbert E, Johnson D, Keerthi S (1988) A fast procedure for computing the distance between complex objects in threedimensional space. IEEE J Robot Autom 4(2):193–203CrossRefGoogle Scholar
 14.Gilge M et al (2013) IBM system Blue Gene solution Blue Gene/Q application development. IBM Redbooks, DurhamGoogle Scholar
 15.Götz J, Iglberger K, Feichtinger C, Donath S, Rüde U (2010) Coupling multibody dynamics and computational fluid dynamics on 8192 processor cores. Parallel Comput 36(2–3):142–151MATHMathSciNetCrossRefGoogle Scholar
 16.Götz J, Iglberger K, Stürmer M, Rüde U (2010) Direct numerical simulation of particulate flows on 294912 processor cores. In: Proceedings of the 2010 ACM/IEEE international conference for high performance computing, networking, storage and analysis, pp 1–11Google Scholar
 17.Iglberger K, Rüde U (2009) Massively parallel rigid body dynamics simulations. Comput Sci Res Dev 23(3–4):159–167CrossRefGoogle Scholar
 18.Iglberger K, Rüde U (2010) Massively parallel granular flow simulations with nonspherical particles. Comput Sci Res Dev 25(1–2):105–113CrossRefGoogle Scholar
 19.Iglberger K, Rüde U (2011) Largescale rigid body simulations. Multibody Syst Dynam 25(1):81–95MATHCrossRefGoogle Scholar
 20.Jean M (1999) The nonsmooth contact dynamics method. Comput Methods Appl Mech Eng 177(3–4):235–257MATHMathSciNetCrossRefGoogle Scholar
 21.Kollmer J, Sack A, Heckel M, Pöschel T (2013) Relaxation of a spring with an attached granular damper. New J Phys 15(9):093,023CrossRefGoogle Scholar
 22.Koziara T, Bićanić N (2011) A distributed memory parallel multibody contact dynamics code. Int J Numer Methods Eng 87(1–5):437–456MATHCrossRefGoogle Scholar
 23.Leyffer S (2006) Complementarity constraints as nonlinear equations: theory and numerical experience. In: Optimization with multivalued mappings, Springer, pp 169–208Google Scholar
 24.Liu C, Jain S (2012) A quick tutorial on multibody dynamics. Tech. rep., Georgia institute of technologyGoogle Scholar
 25.McCalpin J (1995) Memory bandwidth and machine balance in current high performance computers. In: IEEE computer society technical committee on computer architecture (TCCA) newsletter pp 19–25Google Scholar
 26.McNamara S, Young W (1994) Inelastic collapse in two dimensions. Phys Rev E 50(1):R28–R31CrossRefGoogle Scholar
 27.Miller S, Luding S (2004) Eventdriven molecular dynamics in parallel. J Comput Phys 193(1):306–316MATHCrossRefGoogle Scholar
 28.Mitarai N, Nakanishi H (2012) Granular flow: dry and wet. Eur Phys J Spec Top 204(1):5–17Google Scholar
 29.Moreau J, Panagiotopoulos P (1988) Nonsmooth mechanics and applications, vol 302. Springer, WienNew YorkMATHCrossRefGoogle Scholar
 30.Negrut D, Tasora A, Mazhar H, Heyn T, Hahn P (2012) Leveraging parallel computing in multibody dynamics. Multibody Syst Dynam 27(1):95–117MATHCrossRefGoogle Scholar
 31.Popa C, Preclik T, Rüde U (2014) Regularized solution of LCP problems with application to rigid body dynamics. Numer Algorithm 67:1–12CrossRefGoogle Scholar
 32.Sauer J, Schömer E (1998) A constraintbased approach to rigid body dynamics for virtual reality applications. In: Proceedings of the ACM symposium on virtual reality software and technology, pp 153–162Google Scholar
 33.Schindler T, Acary V (2014) Timestepping schemes for nonsmooth dynamics based on discontinuous Galerkin methods: definition and outlook. Math Comput Simul 95:180–199MathSciNetCrossRefGoogle Scholar
 34.Schütte K, van der Waerden B (1952) Das Problem der dreizehn Kugeln. Mathematische Annalen 125(1):325–334CrossRefGoogle Scholar
 35.Shen Y, Stronge W (2011) Painlevé paradox during oblique impact with friction. Eur J Mech A/Solids 30(4):457–467MATHMathSciNetCrossRefGoogle Scholar
 36.Shojaaee Z, Shaebani M, Brendel L, Török J, Wolf D (2012) An adaptive hierarchical domain decomposition method for parallel contact dynamics simulations of granular materials. J Comput Phys 231(2):612–628MATHCrossRefGoogle Scholar
 37.Spahn F, Petzschmann O, Schmidt J, Sremčević M, Hertzsch JM (2001) Granular viscosity, planetary rings and inelastic particle collisions. In: Granular gases, Springer, Berlin, pp 363–385Google Scholar
 38.Studer C (2009) Numerics of unilateral contacts and friction: modeling and numerical time integration in nonsmooth dynamics, Lecture Notes in Applied and Computational Mechanics, vol 47. Springer, BerlinGoogle Scholar
 39.Tasora A, Anitescu M (2011) A matrixfree cone complementarity approach for solving largescale, nonsmooth, rigid body dynamics. Comput Methods Appl Mech Eng 200(5):439–453MATHMathSciNetCrossRefGoogle Scholar
 40.Visseq V, Martin A, Dureisseix D, Dubois F, Alart P (2012) Distributed nonsmooth contact domain decomposition (NSCDD): algorithmic structure and scalability. In: Proceedings of the international conference on domain decomposition methodsGoogle Scholar
 41.Visseq V, Alart P, Dureisseix D (2013) High performance computing of discrete nonsmooth contact dynamics with domain decomposition. Int J Numer Methods Eng 96(9):584–598MathSciNetCrossRefGoogle Scholar
 42.Wautelet P, Boiarciuc M, Dupays J, Giuliani S, Guarrasi M, Muscianisi G, Cytowski M (2014) Best practice guide—Blue Gene/Q. v1.1.1 ednGoogle Scholar
 43.van der Weele K, van der Meer D, Versluis M, Lohse D (2001) Hysteretic clustering in granular gas. Europhys Lett 53(3):328CrossRefGoogle Scholar