Scalable computational kernels for mortar finite element methods

Targeting simulations on parallel hardware architectures, this paper presents computational kernels for efficient computations in mortar finite element methods. Mortar methods enable a variationally consistent imposition of coupling conditions at high accuracy, but come with considerable numerical effort and cost for the evaluation of the mortar integrals to compute the coupling operators. In this paper, we identify bottlenecks in parallel data layout and domain decomposition that hinder an efficient evaluation of the mortar integrals. We then propose a set of computational strategies to restore optimal parallel communication and scalability for the core kernels devoted to the evaluation of mortar terms. We exemplarily study the proposed algorithmic components in the context of three-dimensional large-deformation contact mechanics, both for cases with fixed and dynamically varying interface topology, yet these concepts can naturally and easily be transferred to other mortar applications, e.g. classical meshtying problems. To restore parallel scalability, we employ overlapping domain decompositions of the interface discretization independent from the underlying volumes and then tackle parallel communication for the mortar evaluation by a geometrically motivated reduction of ghosting data. Using three-dimensional contact examples, we demonstrate strong and weak scalability of the proposed algorithms up to 480 parallel processes as well as study and discuss improvements in parallel communication related to mortar finite element methods. For the first time, dynamic load balancing is applied to mortar contact problems with evolving contact zones, such that the computational work is well balanced among all parallel processors independent of the current state of the simulation.


Introduction
Mortar finite element methods (FEM) are nowadays well established in a variety of application areas in computational science and engineering as discretization technique for the coupling of non-matching meshes. Their general applicability in a vast range of problems as well as their mathematical properties, e.g. variational consistency, make them one of the most popular choices among interface discretization techniques. They are undoubtedly the most preferred choice for robust finite element discretization in computational contact mechanics undergoing large deformations [16,60,61,79,81]. However, the numerical effort and computational cost is high and can be considered a bottleneck in many scenarios. This paper discusses several performance challenges of mortar methods in the context of parallel computing and proposes remedies to reduce the overall runtime, obtain optimal scalability as well as reduce parallel communication and memory consumption. As a demanding prototype application, several test cases from computational contact mechanics showcase the proposed algorithms and their impact on runtime and parallel scalability.
Originally being developed in the context of domain decomposition for the weak imposition of interfacial constraints [5,10], mortar methods soon became popular in meshtying [58,59] and contact mechanics problems [6,34,52,55,56,57,61,62,85,84]. Recently, mortar methods for meshtying problems have regained attention due to the rise of isogeometric analysis and the need for isogeometric patch coupling [20,21,24,23,41,82,88]. A variety of papers discusses mortar methods in the context of isogeometric analysis for contact problems, among them [19,25,17,16,65]. Moreover, mortar methods have spread to other single-field problems, e.g. contact mechanics including wear [30] or fluid dynamics [26], as well as a variety of surface-coupled multi-physics problems, among them fluidstructure interaction [42,46,51] or the simulation of lithium-ion cells in electrochemistry [27]. Lately, also volume-coupled problems have been addressed by mortar methods [29]. Despite their significant computational cost, the popularity of mortar methods over classical node-to-segment, Gauss-point-tosegment, and other collocation-based approaches is based on their mathematical properties such as their variational consistency and stability. Compared to two-dimensional problems, an efficient mortar evaluation is much more critical in three-dimensional problems, which are at the same time of great practical relevance in real-world applications. When using a Lagrange multiplier field λ to impose constraints on the subdomain interfaces, mortar methods discretize λ on the so-called slave side of the interface. The numerical effort of mortar methods is usually related to the search for nearest neighbors, local projection of meshes and subsequent clipping and triangulation of intersected meshes, as well as the resulting segment-based numerical integration, cf. Figure 1. While these operations themselves are already expensive, implicit contact solvers need to perform them in every nonlinear iteration, rendering this a possible feasibility bottleneck or at least a performance impediment, which becomes even more demanding through the necessity of consistent linearizations of all mortar terms. The parallelization of contact search algorithms has been addressed in [36] for example, where standard domain-decomposition-based spatial search is enhanced with thread-level parallelism. To speed-up the subsequent evaluation of contact terms, various integration strategies are available, among them element-based and segment-based integration, cf. [12,28,50,73]. Segment-based integration subdivides each slave element into segments having no discontinuities of the integrands within their domain. This yields a highly accurate quadrature, though is computationally expensive. Element-based integration on the other hand reduces the effort of clipping and triangulation the intersected meshes by employing higher-order integration schemes to deal with weak discontinuities at element edges, though brings along a less accurate evaluation of the mortar integrals. While the segment-based integration strategy is unequivocally preferable due to its accuracy, it comes at significantly higher computational cost. Furthermore, systems of linear equations arising from mortar-based interface discretizations require tailored preconditioning techniques for an efficient iterative solution procedure. Depending on the specific details of the discretization, the resulting linear system might exhibit saddle-point structure. Efficient preconditioners to be used in conjunction with Krylov solvers are available in literature [2,14,67,69,70,72,71,77] and, thus, are not in the scope of this paper. We rather focus on the cost of evaluating all mortar-related terms.
As outlined previously, many theoretical aspects of mortar methods have already been discussed and solved in the literature, e.g. the choice of discrete basis functions [31,48,49,56,57,76,80], numerical quadrature [12,28,50,73], conservation laws [39,40,85], or contact search algorithms [8,74,75,84,83,86,87]. However, computational aspects of mortar methods for contact problems -especially in the context of parallel computing -have largely been neglected by the scientific community so far. To fill this gap, this work is motivated and guided by the quest for parallel scalability of all algorithmic components of mortar methods for arbitrarily evolving contact zones in three-dimensional problems. Therefore, we analyze the computational kernels of mortar finite element methods and design their interplay to assure parallel scalability. To the best of our knowledge, most contributions in literature have focused on the serial case (i.e. one processor) only or have embedded mortar methods into existing parallel finite element codes without specific provisions. An exception to this observation is the work of Krause and Zulian [47], where a parallel approach to the variational transfer of discrete fields between unstructured finite element meshes as well as the associated proximity and intersection detections are described in detail and examples for the evaluation of grid projection operators are given for various surface and volume projection problems. Yet, Krause and Zulian [47] spare dynamic contact problems with evolving contact zones, which are of particular importance in engineering applications. In the present contribution, we analyze several schemes to subdivide mortar interface discretizations into subdomains suitable for parallel computing and discuss their interplay with distributed memory architectures of computing clusters to achieve parallel scalability. Thereby, we follow a message-passing parallel programming model that utilizes the message passing interface (MPI) for communication between address spaces of different processes [53]. Finally, we develop and showcase a dynamic load balancing strategy to address the particular needs of contact problems with evolving contact configurations and interface topologies for three-dimensional problems.
By starting from an analysis of the computational cost of the evaluation of mortar terms, which is most commonly related to the slave side of the contact interface, we identify three main tasks, which will directly lead to the postulation of two essential requirements for parallel and scalable computational kernels for mortar finite element methods: • For the geometrical task of identifying close master and slave nodes within the contact search, each slave node needs access to the position of every node of the master side of the interface discretization. While the distribution of the master interface discretization to several compute nodes enables larger problem sizes, it requires advanced ghosting (i.e. sending data between different processors) of interface quantities to reduce the overall communication and memory footprint. We will propose ghosting strategies that take a measure of geometric proximity between master and slave nodes into account to pre-compute and reduce the list of master nodes/elements to be communicated.
• To efficiently parallelize the evaluation of mortar terms, we will start from a baseline approach where interfacial subdomains are aligned with the subdomains of the underlying bulk domain. This method is straightforward to implement, preserves data locality, and reduces communication between parallel processes. However, it does not include all processes in the evaluation of the mortar terms and, thus, is not scalable. We will then devise strategies for redistributing the interface domain decomposition in order to increase parallel efficiency and scalability of the mortar evaluation.
• As the contact configuration and area often changes over the course of a simulation, we will propose a dynamic load balancing scheme. Therefore, we will monitor characteristic quantities of the parallel evaluation of all mortar terms and will trigger an adaptation of the interface domain decomposition if the current state and computational behavior of the simulation indicates a deterioration of parallel performance.
We will discuss these approaches in detail and demonstrate their scaling behavior and applicability to large three-dimensional problems. Although our current work studies scalable computational kernels for mortar methods in the context of classcial finite element analysis, all findings are equally valid for isogeometric mortar methods (i.e. NURBS-based interface discretizations). The remainder of this paper is organized as follows: After a brief description of the contact problem, its discretization, and suitable solution techniques in Section 2, the implications of storing mortar discretizations on distributed memory machines will be discussed in Section 3. Domain decomposition approaches for an efficient evaluation of the mortar integrals will then be developed in Section 4. Section 5 presents several numerical studies to assess communication patterns and demonstrate the parallel scalability of the proposed methods in the context of computational contact mechanics, before we conclude with some final remarks in Section 6.

Problem formulation and finite element discretization
While mortar methods are applicable to a broad spectrum of problems and partial differential equations (PDEs), finite deformation contact problems are nowadays certainly one of the most appealing and challenging application areas for mortar methods in computational mechanics. Hence, we focus on contact problems now, but keep the generality of mortar evaluations in mind.

Governing equations
In general, mortar methods allow for the coupling of several physical domains governed by PDEs through enforcing coupling conditions at various coupling surfaces or interfaces. Without loss of generality, we focus our presentation on the two-body contact problem with bodies Ω (1) and Ω (2) which potentially come into frictionless contact along their contact boundaries Γ (1) * and Γ (2) * , respectively. Each subdomain Ω (i) , i ∈ {1, 2} is governed by the initial boundary value problem of finite deformation elasto-dynamics, reading in Ω in Ω with the unknown displacement field u, the first Piola-Kirchhoff stress tensor P, the body force vectorb 0 , density ρ 0 , normal vector n 0 , and traction vectorh 0 in the initial configuration Ω 0 . Furthermore, prescribed boundary and initial values are marked with( •). First and second time derivatives are given as(•) and( •), respectively. For frictionless contact, the contact constraints are typically given by the Hertz-Signorini-Moreau conditions, reading with the contact pressure p n along the contact interface Γ * and the gap function g n denoting the normal distance between the two bodies. To later distinguish between the two sides of the contact interface, we follow the traditional naming scheme and refer to Γ (1) * carrying the Lagrange multiplier as so-called "slave" side Γ sl * , while Γ (2) * denotes the "master" side Γ ma * .
Since this paper is concerned with the efficient evaluation of the mortar terms on parallel computing clusters, we will detail the discretization of all mortar-related terms in Section 2.2. However, to keep the focus tight and concise, we refer to the extensive literature for any further details on the finite element formulation and discretization [31,48,49,56,57,76,80], the solution of the nonlinear problem via active set strategies [37,43,45,44,55], as well as for details on the structure of the arising linear systems of equations and efficient solvers thereof [2,14,67,69,70,72,71,77].

Discretization
In order to perform the spatial discretization with FEM, we assume the existence of a weak form of the contact mechanics problem summarized in Section 2.1. For the additional terms arising in contact mechanics, a Lagrange multiplier field λ is introduced into the weak form to enforce the contact constraints, leading to a mixed method with a variational inequality, where both the primal field u as well as the dual variable λ need to be discretized in space.
For the sake of a concise presentation, we skip the details of the FEM applied to the threedimensional solid bodies Ω (i) 0 , i ∈ {1, 2}. Considering the contact interface, we adopt from the volume discretization the isoparametric concept with the parameter coordinate ξ = [ξ 1 , ξ 2 ] and the shape functions N k (ξ) defined at node k of all n (1) nodes on the discrete slave surface Γ sl * ,h and N (ξ) defined at node of all n (2) nodes on the discrete master surface Γ ma * ,h , respectively. The interpolation of the displacement field on element level is then given as As usual in mortar methods, the Lagrange multiplier field λ is discretized on m (1) nodes of the discrete slave surface Γ sl * ,h , reading where Φ j (ξ) denotes the Lagrange multiplier shape function at node j. Thereby, either standard or dual shape functions can be used.
Inserting (1) and (2) into the contact virtual work δW λ = Γ * λ δu sl − δu ma dΓ yields δu (2) . (3) The mortar matrices D and M associated with the slave and master side of the coupling interface are then assembled from the nodal blocks D [j, k] and M [j, ] defined in (3), respectively. In general, both D and M are rectangular matrices. If m (1) = n (1) (which is common practice except for a few cases, e.g. higher-order FEM [48,49,57]), D becomes square. Furthermore, if Φ j are chosen as so-called dual shape functions that satisfy a biorthogonality relationship with the standard shape functions N j , then D becomes a diagonal matrix and, thus, easy and computationally cheap to invert [31,48,49,57,64,76,78,80]. We stress that both summands in (3) contain integrals over the slave side Γ (1) * ,h of the discrete coupling surface, where the discretization is indicated by the additional subscript (•) h . A suitable discrete mapping χ h : Γ ma * ,h → Γ sl * ,h from the master side to the slave side of the coupling interface is required, because the discrete coupling surfaces Γ ma * ,h and Γ sl * ,h do not coincide anymore in general, especially when considering non-matching meshes on curved interfaces. These projections are usually based on a continuous field of normal vectors defined on the slave side Γ sl * ,h , cf. [55,85]. We note that the mortar matrices D and M also occur in the discrete representation of the Hertz-Signorini-Moreau conditions, cf. [56] for example.

Evaluation of mortar integrals
In general, the evaluation of both D [j, k] and M [j, ] in (3) requires information from both the discrete slave interface Γ sl * ,h and the discrete master interface Γ ma * ,h . Firstly, this inevitably involves the discrete mapping χ h to project finite element nodes and quadrature points between slave and master sides. In practice, mortar integration is often performed on a piecewise flat geometrical approximation of the slave surface Γ sl * ,h as proposed in [58]. For further details and an in-depth mathematical analysis, see [18,60,61]. Secondly, the slave-sided integration domain Γ sl * ,h has to be split into so-called mortar segments, such that both Φ (1) j and N (2) are C 1 -continuous on these segments, as kinks in the function to be integrated would deteriorate the achievable accuracy of the numerical quadrature. These mortar segments are arbitrarily shaped polygons, which will then be decomposed into triangles to perform quadrature. While the evaluation of D [j, k] involves quantities solely defined on the slave interface Γ sl * ,h , the evaluation of M [j, ] requires to integrate the product of master side shape functions N (2) and slave side shape functions Φ (1) j over the discrete slave interface Γ sl * ,h . Algorithm 1 outlines the necessary steps to perform segmentation and numerical quadrature for one pair of slave and master elements. We refer to [58] for a detailed description of all steps outlined Algorithm 1: Segment-based mortar integration for three-dimensional problems for all slave elements do for all associated master elements in the vicinity of the current slave element do Project master elements onto slave side Find mesh intersection of slave and master elements via a clipping algorithm, see e.g. [32]. Divide clip polygon into triangular integration cells. Perform quadrature to compute entries of D [j, k] and M [j, ] according to (3). end end in Algorithm 1. Although segment-based quadrature as described in Algorithm 1 undoubtedly delivers the highest achievable accuracy for the numerical integration of D [j, k] and M [j, ] in three dimensions, it comes at high computational expenses related to mesh projection and intersection, subsequent triangulation as well as numerical quadrature. In practice and also in the present work, both mortar operators D [j, k] and M [j, ] are usually evaluated using segment-based integration in order to guarantee conservation of linear momentum [58]. More efficient but possibly less accurate integration algorithms have been discussed in [12,28,50,73].
Having today's parallel computing architectures with distributed memory in mind, the evaluation of (3) brings along two major implications on the software and algorithm design: 1. The evaluation of the integrands in (3) requires information from both the slave and master side.
Slave data is readlily available locally on each parallel process. The implementation has to enable access also to master side data, that might be owned by another process or is stored on a different compute node.
2. The computational cost and time is mostly associated with numerical integration over the slave side of the interface. Parallelization can reduce the computational time by distributing the integration domain, i.e. the slave interface, over multiple parallel processes.
Therefore, we deduce the following requirements: R1: Enable access to all required slave and master data during evaluation of mortar integrals while keeping the memory demand and parallel communication low. R2: Use parallel resources efficiently for numerical integration over the slave side of the mortar interface, also targeting parallel scalabity.
We will elaborate on these implications in Sections 3 and 4 and outline various approaches to satisfy both requirements R1 and R2 in the context of parallel computing.
3 Storing data of the contact interface on a parallel machine When executing the FEM solver on a parallel machine, data needs to be distributed among the different MPI ranks or compute nodes. Now, we first summarize the basics of overlapping domain decomposition to distribute chunks of the discretization to individual processes. Then, we discuss the implications on access to the relevant interface data during contact evaluation, before we present and discuss several strategies to ensure access to the necessary data without excessive data redundancy. Overall, this section is devoted to strategies in order to satisfy our basic requirement R1.

Overlapping domain decomposition
We base our considerations on the existence of an FEM solver that can be executed on parallel computers with a multitude of CPUs and/or compute nodes using a distributed memory architecture. In our case, this FEM solver is our in-house code Baci [1]. For optimal parallel treatment, the code base utilizes overlapping domain decomposition (DD) techniques [22,63,66,68]. Using n proc to denote the number of available parallel processes, the computational domain Ω is divided into n proc subdomains Ω m , m ∈ {0, 1, . . . , M − 1}. A one-to-one mapping of subdomains to processes is employed, such that n proc = M . An exemplary overlapping DD into four subdomains distributed to processes p ∈ {0, 1, 2, 3} is shown in Figure 2. While each node in the finite element discretization is uniquely assigned to a subdomain Ω m , elements might span subdomain boundaries. We stress that processes can only access data of nodes that they own themselves. This has implications on finite element evaluation and assembly: A process p can only assemble into those entries of the global residual vector and those rows of the global Jacobian matrix that are associated with nodes in Ω p . Hence, elements that span across subdomain boundaries will be evaluated by all processes that own at least one of this element's nodes such that each process can assemble quantities associated with its own nodes. 1 This requires communication of data prior to the evaluation, i.e. data of off-process nodes needs to be communicated. This is often referred to as ghosting. Ideally, subdomains exhibit a small surface-to-volume ratio to minimize the amount of data subject to ghosting.
In our code base Baci, we employ the hypergraph partitioning package Zoltan [11] with the ParMETIS backend to decompose the computational domain Ω into n proc subdomains Ω m . Parallel data structures and parallel linear algebra is enabled through the Trilinos 2 packages Epetra, Tpetra, and Xpetra. Iterative solvers for sparse systems of linear equations are taken form the Trilinos packages AztecOO [38] and Belos [4] with scalable multi-level preconditioners from ML [33] and MueLu [9]. Mesh and subdomains: proc 0: proc 1: proc 2: proc 3:

Subdomain boundaries Ω m
Subdomain m on process p = m Owned nodes Ghosted nodes Owned elements Elements integrated by multiple processes Figure 2: Exemplary overlapping domain decomposition and parallel assembly involving four subdomains Ω m , m ∈ {0, 1, 2, 3} assigned to four parallel processes p ∈ {0, 1, 2, 3}. Since each process can only assembly into unknowns of owned nodes, elements spanning across the subdomain boundaries need to be evaluated by multiple processes. This requires ghosting of nodes and elements, which entails parallel communication among multiple processes.

Implications of distributed memory on the contact search and evaluation
Without loss of generality and for ease of presentation, we assume that the entire discretization of a twobody contact problem has undergone an overlapping DD and that each subdomain m ∈ {0, . . . , M − 1} has been assigned to a process p ∈ {0, . . . , n proc − 1}. For the purpose of illustration, we will discuss the case of n proc = 3 subdomains and further assume that every process owns a part of the master and of the slave interface as illustrated in Figure 3. Please note that our considerations also hold, if some processes only own a part of either the slave or the master side of the interface discretization or even if some processes do not own any portion of the contact interface at all. When process p is performing contact search and evaluation on its share of the slave interface, it needs access to data from the geometrically close master side of the interface. In a parallel computing environment, the required data from the master side of the interface does not necessarily reside on that same process p. Still, access has to be enabled in order to • identify pairs of slave/master elements, that potentially are in active contact. This step is usually referred to as "contact search". • evaluate the second integrand in (3), where the shape functions N (2) defined on the master side need to be evaluated and projected onto the slave side.
If the required data of the master side resides on a different parallel process q than the current slavesided process p, this data has to be communicated or "ghosted" (cf. Figure 2) from process q to process p in order to be known by process p. Therefore, the ghosting of the master interface discretization has to be extended. Since such an extension will impact the inter-processes communication demand as well as the on-process memory demand, we will introduce models for communication and memory demands in Section 3.3. More importantly, we will discuss various approaches for extending the ghosting of the Γ ma * Γ sl * proc 0 proc 1 proc 2 Figure 3: Without particular measures, DDs of master and slave side of the interface distribute each interface side to some processes. Geometrically close portions of the master and slave interface are not guaranteed to reside on the same process. Without further measures, each process p can only identify possibly contacting pairs of slave/master elements from the subset of master elements owned by process p, i.e. master elements that reside in the set Ω p ∩ Γ ma * . This can and needs to be alleviated by extending the ghosting of the master side of the interface. (For simplicity of visualization, coloring of ownership omits ghosted elements stemming from the overlapping interface DD.) master interface discretization in Sections 3.4 and 3.5, where we will also discuss the impact of these ghosting extension strategies on the memory demand.

Models for communication and memory demand
Starting from an overlapping DD and distributed storage of both interface discretizations, data needs to be communicated among processes to facilitate the mortar evaluation. We will use σ to denote the amount of data to be sent over the interconnect of all compute nodes and processes. Since data related to the slave side of the interface discretization just remains on its process p, σ sl p = 0. As has already been indicated in Figure 3, process p owning the portion Γ sl m of the slave side of the mortar interface requires the master side's data from those processes owning the geometrically close master elements. Hence, usually σ ma p > 0, especially if a situation as depicted in Figure 3 occurs. Although an explicit expression to compute σ ma p cannot be given, as it highly depends on the software implementation at hand, it for sure is related to the number of nodes n nd and elements n el to be communicated. We denote this relation by with ζ(n nd , n el ) referring to an implementation-specific measure describing the cost of parallel communication. The total amount of data to be communicated to process p sums up to Obviously, σ increases with an increasing number of subdomains. More importantly, however, it is impacted by the individual contributions σ ma p . Especially when the number of subdomains, that are required to solve a given problem, is fixed, reducing σ ma p is key to reduce the overall cost of communication. Naturally, σ = 0 if n proc = 1.
From the domain decomposition of the underlying bulk field, the memory demand s Ω p per process p is given. For the mortar interface discretizations, we use s sl p to denote the memory demand of the slave interface portion Γ sl m on process p. Furthermore, s ma p refers to the memory demand of the master interface portion Γ ma m on process p. Then, the total memory demand s p on process p is given as Γ ma * Γ sl * proc 0 proc 1 proc 2 Figure 4: Fully redundant storage of the master interface discretization: Solid lines indicate data that is owned by a particular process due to the initial DD. Dashed lines indicate data that is available through the extended ghosting. With fully redundant storage of the master discretization on each process, each process p can immediately identify all pairs of slave/master elements, that are possibly in active contact.
Note that s p includes the amount of memory required for owned nodes/elements as well as for ghost nodes/elements originating from the overlapping DD with an element overlap of 1. We stress that s Ω p is fully determined by the overlapping DD of the underlying bulk fields and that s sl p is only governed by the overlapping DD of the slave interface discretization, that might arise from any of the schemes proposed in Section 4 later. At this point, only the master interface's contribution s ma p can be controlled by choosing a specific ghosting extension strategy.

Redundant storage: the straightforward case
The probably most straightforward remedy for the issue of undetected master/slave pairs described in Figure 3 is to fully extend the master side's ghosting to all processes, i.e. to store the entire master side of the interface redundantly on every process p. This scenario of distributed storage of the slave interface discretization, but redundant storage of the master interface discretization is illustrated in Figure 4 for an exemplary number of three processes. The slave interface Γ sl * is decomposed into three subdomains and distributed to the processes 'proc 0', 'proc 1', and 'proc 2', indicated by coloring. The master interface Γ ma * starts out from its initial DD (colored boxes with solid lines) as already seen in Figure 3. Then, its ghosting is extended over the entire master interface Γ ma * (colored boxes with dashed lines), such that Γ ma * is now stored redundantly on all three processes. The redundant storage of the master side of the interface just requires a one-time setup and communication cost at the beginning of the simulation in order to extend the ghosting of master data to the entire master interface, but then enables access to every bit of master interface data from every process p ∈ {0, 1, . . . , n proc − 1} without further communication among parallel processes. After the ghosting has been extended following the idea of fully redundant storage, all algorithmic steps, e.g. the contact search or the evaluation of (3), can be performed immediately without further communication.
In terms of the communication cost σ, however, this approach is rather expensive: since the entire master discretization needs to be communicated to every slave processor p, the total communication cost can be estimated via (4) and (5) as σ ≈ n proc ζ(n (2) , n el,ma ), where all nodes and elements of the master side of the interface discretization enter the cost estimate. The model (7) suffers only from a slight over-estimation, since a part of the master surface might already be located on the target process and, thus, does not need to be communicated. Yet, this over-estimation becomes smaller for an increasing number of subdomains.
Since the entire master discretization has to be stored on each process along with a portion of the slave discretization, the memory demand of this approach can grow quite excessively when going to large master interface discretizations. The maximum problem size, for which this strategy still works, cannot be given theoretically. It strongly depends on several key factors, for example the exact specifications of the computing hardware or intricate details of the software implementation. Considering the memory model (6), the per-process master contribution s ma p has to be replaced by the memory consumption s ma of the entire master interface since each process stores the entire master discretization. Since s ma grows with mesh refinement, the total storage demand s p on process p is not bounded. This limits the applicability of redundant storage to small and medium sized interface discretizations, depending on the hardware at hand.
Besides the possibly unbounded memory demand, fully redundant ghosting of the master side also comes with a run-time cost: when process p loops over all of its nodes/elements of the master discretization, then it actually loops over all nodes/elements of the entire master discretization, although most of the nodes/elements are irrelevant on process p as they are not located in the geometric vicinity of process p's slave nodes/elements. Naturally, the code is not aware of any concept of vicinity prior to the contact search, so this cost cannot be avoided with this approach.

Distributed storage: going to large problems
As soon as the memory demand s p exceeds the available memory on a computing node, redundant storage as described in Section 3.4 should not be applied anymore to avoid performance degradation due to memory swapping. Following (6), the total storage demand s p per process can be reduced by reducing the storage demand of the master interface. In particular, when storing also the master interface discretization in a distributed fashion, its storage demand per process can be reduced to s ma p < s ma for p ∈ {0, . . . , n proc − 1} , n proc ≥ 2. Similarly, when the growth in run-time for loops over master nodes/elements becomes prohibitive, reducing the portion of the master interface stored on each process p is expected to speed up simulations. Still, each portion of the slave interface needs to have access to those parts of the master interface that reside in its geometric proximity (cf. Figure 3). In turn, measuring geometric proximity requires access to all pairs of slave and master nodes.
This situation can be remedied by different algorithmic modifications: Within a token-based evaluation strategy, e.g. inspired by Round-Robin (RR) scheduling [13], the parallel decomposition and distribution of the slave interface is fixed. On the master side, just the decomposition into subdomains is fixed, while the subdomain-to-process mapping is shifted by one process per RR iteration until every process has owned each master interface subdomain once. Since an RR loop requires n proc iterations for a complete evaluation of all slave elements, its run-time cost is high and has even proven to be prohibitive in large-scale applications, which we have also observed in our own experiments.
As an alternative, the incorporation of the notion of proximity already into the extension of the master side's ghosting offers a promising solution. Hence, we resort to pre-computing ghosting data based on a geometrically motivated binning approach, where we exploit the fact that the contact search needs to identify all master elements in the proximity of a given slave element. This idea is inspired by [54], where a similar parallel algorithm is used for the spatial decomposition of atoms in short-range molecular dynamics simulations.
In the context of mortar methods, we will first construct an axis aligned bounding box around the mortar interface, i.e. a cuboid box that is oriented along the Cartesian axes and encloses all nodes of the mortar interface. Then, this bounding box will be covered with a set of Cartesian bins that are independent of the finite element meshes of the contacting bodies (cf. Figure 5). Since the contacting bodies are moving relative to the background bins, slave nodes or elements can migrate between individual bins over time. In order to not loose track of individual nodes or elements due to this motion, the minimal bin size β min is chosen as β min = max n el,sl h sl + 2 · ∆t ·u * with max n el,sl h sl being the largest element edge of the slave discretization, ∆t representing the time step size,u * denoting the vector of nodal interface velocities and (•) referring to the mean value proc 0 Figure 5: Extended ghosting of the master interface using a binning scheme -We exemplarily show three parallel processes and depict each one in its own sketch for the sake of presentation. Bins are sketched in solid orange lines. On Γ sl * , mesh entities (such as nodes and elements) are owned by the respective process anyway. On Γ ma * , solid lines indicate data that is owned by this process, while dashed lines indicate data that has been ghosted via binning. of (•), respectively. If the interface velocity is not available in static problems, it can be replaced via a finite difference approximation w.r.t. to the previous load step. Analogously, the axis aligned bounding box embracing all mortar nodes is expanded by β min in each direction. Then, the actual bin size β and number of bins per direction is computed based on the dimensions of the expanded axis aligned bounding box and the minimal bin size β min . We then apply Algorithm 2 to compute process-specific lists {e gh } ma p of master elements to be ghosted for each process p.
Algorithm 2: Geometrically motivated binning to pre-compute ghosting of the master interface Sort all n el,sl slave elements into bins Sort all n el,ma master elements into bins for each process p do  Figure 5 illustrates the binning approach detailed in Algorithm 2 for three processes. For 'proc 0', no further ghosting is required in this example, since all required master elements already reside in the neighboring bins of the set of bins {B sl 0 } enclosing all slave elements of Γ sl * ,0 . In contrast, the master elements owned by 'proc 1' are not contained in {B sl 1 } and do not participate to the evaluation of mortar terms in Γ sl * ,1 . The master elements of interest, i.e. those in the neighboring bins of {B sl 1 }, need to be ghosted, which leaves out the master elements in the left most bin covering Γ ma * . Finally, 'proc 2' already owns some of the required elements and only needs to ghost some additional elements.
The communication cost σ ma p for each processor p now depends on the number of nodes/elements in the current bin b and its neighboring bins. Due to the Cartesian character of bins, each bin has 8 or 26 neighbors in 2D or 3D, respectively. Based on a constant bin size and assuming uniform mesh sizes, the cost measure ζ per subdomain introduced in (4) is now evaluated with 8 × n  for 2D 26 · σ ma p for 3D (8) which is a significant reduction for large core counts compared to (7). The scalar factors in (8) originate from the number of neighboring bins in 2D and 3D, respectively. Regarding memory demand as estimated via (6), the master side's demand s ma p now comprises of all master elements stored on process p plus all master elements in neighboring bins. Assuming bin sizes similar to the size of subdomains Ω m as well as evenly sized master elements, the master side's storage demand is bounded by 5 × s ma p or 9 × s ma p for 2D and 3D problems, respectively. While the number of bins and, thus, the effort to sort master elements into bins increases with a smaller characteristic bin size β, the storage demands for each process p diminishes even more.

Intermediate discussion of ghosting strategies
So far, we have concerned ourselves with strategies to satisfy the requirement R1. Before addressing R2 in Section 4, we briefly discuss some properties of the presented strategies for the ghosting of the master interface.
While the fully redundant ghosting presented in Section 3.4 appears as straightforward, easy to implement, and only needs to be done once at the beginning of the simulation, its runtime cost for communication as well as its memory demand can become prohibitive when going to large problems. The RR approach, in turn, alleviates the issue of excessive growth of memory demand. Yet, the number of necessary RR iterations equals the number of processes n proc , rendering this approach impractical for n proc 1 (especially as it has to be applied in every time/load step). Although the binning approach proposed in Section 3.5 needs to be applied in every time/load step, it appears as the only approach without impractical restrictions when going to large problem sizes: Through the choice of the number and size of the bins, the amount of data to be ghosted can be controlled, such that only those master elements will be ghosted, that are likely to be required during contact search and evaluation. In sum, the applicability of the binning approach is neither affected by the number of parallel processes nor greatly impacts the parallel communication or total memory demand.
We will later supplement our assessment with detailed numerical experiments in Section 5.1.1, but want to anticipate the main finding here: For the largest problems with 25M mesh nodes and 25k interface nodes, the process with the largest ghosting demand asks for the redundant ghosting of 25921 nodes, while binning reduces this number to 1212 nodes, which amounts to a reduction of more than 20×. On average across all MPI ranks, these numbers can be improved through load balancing which will be introduced in Section 4.

Balancing the work load among multiple parallel processes
Now, we discuss strategies for an optimal distribution of the work load to multiple parallel processes. These strategies are intended to satisfy the requirement R2 from Section 2.3. We assume that requirement R1 has already been satisfied by any of the methods described in Section 3 and, thus, all data is accessible whenever needed.
In Sections 4.1 -4.3, we first present some general considerations applicable to all type of mortar interface problems, before we move to the specific scenario of dynamically evolving contact problems in Section 4.4.

The concepts of strong and weak scalability
When assessing the performance of a parallel code and/or algorithm, an important question is whether adding more computational resources will actually speed-up the algorithm's performance at the proper rate. Two concepts are commonly followed and investigated: • For a fixed problem size, strong scalability is given, if the computational time diminishes at the same rate as the used hardware resources grow. The strong scaling limit is reached, when increasing the hardware resources does not lead to a further reduction of computational time. See [3].
• Weak scalability expects a constant computational time when increasing the problem size and the parallel resources at the same rate, i.e. when the work load per process is kept constant. See [35].
As it is well established in many research and application codes (and also in our code base Baci [1]), weak scalability of the finite element evaluation of the pure bulk field (i.e. volume element evaluation) without the presence of any mortar interface can be achieved under uniform mesh refinement.

Curse of dimensionality
In surface-coupled problems with d spatial dimensions, the coupling surface is always a d − 1 dimensional geometric entity. Originally described in [7], this curse of dimensionality between the bulk and the interface discretization becomes problematic under uniform mesh refinement. Denoting the characteristic mesh size with h, the number of unknowns in the bulk discretization grows at O h d while the surface discretization of the coupling interface exhibits a growth rate of O h d−1 only. This becomes evident in practice when a first and simple DD of the interface discretizations is now obtained by aligning the interface subdomains of the slave and master side with the subdomains of the underlying bulk discretizations. Although this approach is straightforward to implement and also avoids off-process assembly, thus reducing parallel communication, it does not result in an optimal parallel distribution for the evaluation of the mortar coupling terms. Since computing the interface contributions, i.e. the mortar segmentation process, integration and assembly of the mortar matrices D and M to only name the most important tasks, is all done on the slave interface discretization, all numerical tasks might be performed by very few parallel processes only, while others idle.
For simplicity of visualization, this is illustrated using a two-dimensional meshtying problem in Figure 6, where the domain decomposition of the mortar interface's slave and master side is fully aligned with the underlying bulk discretizations. Considering a coarse discretization distributed to four parallel processes as shown in Figure 6(a), the slave interface is divided into two subdomains and the master interface is owned by two processes only as well. Consequently, there are two processes, that do not own a share of the slave interface, and two other processes not owning any node of the master interface. In Figure 6(b), the mesh has been refined by a factor of two in each direction, and 16 processes have been used such that the load per process remains constant in the bulk discretization. While the bulk discretization is now split into 16 subdomains, the slave and master interface are shared only among four processes each. In sum, only 4 processes tackle the expensive evaluation of mortar terms on the slave side of the interface, while 12 processes are completely left out. Even in these small and only two-dimensional problem, it becomes evident that the alignment of interface subdomains with bulk subdomains potentially leaves a huge fraction of all processes idle during interface evaluation. While it is true that using more processes improves the parallelization of the bulk discretization, it does not necessarily contribute to a good and scalable parallelization of the interface computations. We stress that this issue is even more pronounced for the three-dimensional case and larger numbers of parallel processes. Owner Ω (2) Γ ma * Γ sl * Ω (1) (b) 20/21 × 20/21 mesh using 16 processes  Owner Owner

Improving the domain decomposition of interface discretizations
To overcome the curse of dimensionality and to satisfy R2, we allow the slave and master side of the interface to be decomposed into subdomains independently from the underlying bulk discretizations in order to achieve optimal parallel scalability of the computational tasks associated with both the integration and assembly in the bulk domains Ω 1 and Ω 2 as well as integration and assembly on the interfaces Γ sl and Γ ma . In a first and straightforward approach, one can divide both interfaces Γ sl and Γ ma into n proc subdomains, such that each parallel process handles a portion of the interface as illustrated in Figure 7. This is particularly important for the slave side which needs to perform all computations related to the integration of the mortar terms in (3). For the coarse and the fine mesh, both Γ sl and Γ ma are distributed to 4 and 16 parallel processes, respectively. A clear advantage of this strategy is that all parallel processes participate in the interface treatment, so idling is mostly avoided. However, the fine mesh already indicates that the interface subdomains may become very small, i.e. they consist only of a few elements. Recalling the curse of dimensionality outlined in Section 4.2, this will become an issue at large scale where the bulk field is divided into n proc subdomains of reasonable size, while the subdomain size of the interface decreases when refining the mesh and adding parallel processes at the same rate. Having many but very small interface subdomains does not leave any process idle, but also yields interface subdomains with a large surface-to-volume ratio which indicates an increasing communication overhead. In sum, this strategy distributes the computational work of the interface evaluation more evenly to all processes than just adopting the interface subdomains from the underlying bulk discretization. The numerical experiments in Section 5 confirm this statement.
Conceptually, there is still room for further optimizations, in particular related to parallel communication among processes, e.g. by setting a lower bound n el min on the number of elements per interface subdomain to reduce the communication overhead. Such an approach needs to compromise between the amount of parallel communication among processes and the number of idling processes. In this work, we have refrained from exploring this research direction, since the distribution of the interface to all parallel processes already delivers satisfying scaling behavior for many practical applications.

Interface domain decompositions for dynamically evolving interfaces
In many applications, the interface configuration evolves over time, e.g. as in contact problems with large sliding or contact of rolling bodies. In such cases, the interface DD can come out of balance, resulting in some processes to do significantly more work than others, which possibly idle. Then, a rebalancing can become necessary to distribute the computational work evenly to all participating processes.
In each time step, we track the time spent in the evaluation of all mortar terms for each processor as well as the number of slave elements per processor. We then estimate the imbalance among all processes by η t = max p (t eval,p ) min p (t eval,p ) , η e = max p n el,sl p min p n el,sl p with η t and η e denoting the imbalance in contact evaluation time and number of slave elements per processor, respectively. The theoretical optimum of a perfect balancing of the mortar-related workload is given for η t = 1 and η e = 1, respectively, i.e. when all processes spend exactly the same time in mortar evaluation and when all processes own the exact same number of slave elements. If in any time step these imbalance estimates exceed user-given thresholdsη t ≥ 1 and η e ≥ 1 for contact evaluation time and number of slave elements per processor, respectively, i.e. if then we re-compute the interface DD to obtain a DD with better load balancing. As this load balancing procedure is triggered dynamically by the current state of the simulation, we refer to it as dynamic load balancing. Naturally,η t = 1 will trigger rebalancing in every time step, such that each time step can rely on the best possible interface DD. In practice, the cost for rebalancing needs to be taken into account, such that practical computations requireη t > 1. We will study the impact of the actual choice ofη t on the run time in Section 5.2.
The main difference between the two imbalance measures η t and η e is that η e does not account for the time to evaluate a given master/slave pair, while η t relies on actual wall clock timings. Thus, situations with η e 1, but η t fairly close to 1 can occur, if the contact search identifies a huge number of pairs of master and slave elements as close to each other, but the subsequent mortar evaluation cannot find a valid projection and, thus, most of the computational work to evaluate (3) is skipped for such pairs of elements. In sum, the time-based trigger η t is expected to be more effective to avoid idling processes in practical simulations.

Implication on finite element assembly and communication patterns
Although the slave side's interface discretization might exhibit its independent DD to improve scalability, all system quantities, e.g. the Jacobian matrix J and the residual vector f , are distributed among parallel processes following the DD of the underlying bulk discretizations. After evaluation of the mortar element matrices defined in (3) within a mortar element in interface subdomain Γ n , n ∈ {0, . . . , M − 1}, on process q ∈ {0, . . . , n proc − 1}, a contribution to J and f associated with node j in Ω m , m ∈ {0, . . . , M − 1}, owned by process p ∈ {0, . . . , n proc − 1} can only be assembled by process p. Hence, if p = q, communication is required to send data from process q to process p in order to assemble into global system quantities. From the perspective of the evaluating process, this is referred to as off-process assembly. Communication can only be avoided if and only if p = q.
It is true that off-process assembly increases the amount of communication and, thus, puts a cost burden onto the entire algorithm. Although this is not desirable, it is usually the much cheaper price to pay than to just stick to the one-to-one matching of interface and underlying bulk DDs. The speed-up of the cost-intensive evaluation of mortar terms through an independent DD of the interface discretizations easily amortizes the additional cost of communication related to off-process assembly. We will study timings of the mortar evaluation and off-process assembly in detail in the numerical experiments in Section 5.

Numerical experiments
We first study parallel redistribution and scalability in a simple two-block contact example in Section 5.1 before moving on to dynamic contact problems in Section 5.2.
All computations are done with our in-house multi-physics research code Baci [1]. All scaling studies have been run on our in-house cluster (20 nodes with 2x Intel Xeon Gold 5118 (Skylake-SP) 12 core CPUs, 196 GB RAM per node, Mellanox Infiniband Interconnect).

Contact of two cubes
For a first assessment of the scalability of the contact evaluation, we consider a simple two-block contact problem with a small block (dimensions 0.8 × 0.8 × 0.8) and a slightly bigger block (dimensions 1.0×1.0×1.0), where contact will occur between two flat surfaces of the blocks. To reduce the complexity of the contact problem and to exclude nonlinearities due to changes in the contact active set, the faces opposite to the contact interface are fixed with Dirichlet boundary conditions, while the blocks initially penetrate each other at the contact interface by 0.001. The smaller block acts as the slave side and its entire contact area is already initialized as "active". Application of the contact algorithms will then result in a slight compression of both blocks, such that the initial penetration vanishes. This problem setup allows to distill the computational effort spent on the redistribution, ghosting, and contact evaluation. In fact, for the parallel scaling studies, we only evaluate all contact terms, but then do not even solve the contact problem to allow for an even more concise focus on the scaling behavior of the contact evaluation.
Both blocks use a Neo-Hooke material with Young's modulus E = 10 and Poisson's ratio ν = 0.3. Denoting the mesh refinement factor with κ, both blocks are discretized with 5κ linear hexahedral elements along their edges.
As an exemplary visualization, Figure 8 illustrates the assignment of subdomains to MPI ranks for a simulation with 24 MPI ranks. Since the discretization of both blocks uses the same number of elements per block, the volume DD exhibits 12 subdomains for each block. Without load balancing, the interface DD evidently matches the underlying volume DD (cf. the top right picture in Figure 8). In particular, the slave side of the interface is shared by only 6 (out of 24) processes, such that the remaining 18 processes idle during the expensive mortar evaluation. While the DD of the solid volume  is not affected by the interface load balancing, the interface DD now yields 24 subdomains for both sides of the interface (cf. the bottom right picture in Figure 8). This allows to share the computational work for the mortar evaluation among all processes.

Weak scaling
We perform a weak scaling study. The smallest problem using 1 MPI rank consists of 55,566 displacement unknowns, while 441/400 nodes/elements reside on the slave side of the contact interface. The largest problem using 480 MPI ranks contains 25,039,686 displacement unknowns, with 25,921/25,600 nodes/elements located on the slave side of the contact interface. We target a load of ≈50k displacement unknowns per MPI rank under weak scaling conditions. Timing results are shown in Figure 9. With load balancing, the pure contact evaluation time remains constant under weak scaling conditions as shown in Figure 9(a) and as expected for finite element evaluations. Manifesting the curse of dimensionality described in Section 4.2 though, the case without load balancing does not equally benefit from adding hardware resources since most of the additional processes do not participate in the mortar evaluation. While the choice of load balancing does not impact the serial case (n proc = 1) of course, the contact evaluation without load balancing requires twice as much time on 2, 4, and 8 MPI ranks than with load balancing, since the processes owning a piece of the master side of the interface do not contribute to the contact evaluation. For an increasing number of MPI ranks, this gap increases.
Regarding the time spent in redistribution and extending the interface ghosting, t LB +t gh , an increase with an increasing number of MPI ranks is expected, as the size of the MPI communicator grows and, thus, mandates increased communication. Obviously, this time component is rather independent of the parallel distribution, but is largely impacted by the ghosting strategy: Since the redundant ghosting of the master side requires to communicate all interface nodes and elements of the master side to all MPI ranks, the timings for redundant ghosting exceed the time for ghosting via the geometrically motivated binning approach, where the amount of data to be communicated among processes is reduced based on geometric information.
It becomes evident from Figure 9(c), that the time for assembling of all contact terms into the global linear system is only slightly impacted by load balancing, while the impact of the ghosting strategy appears to be negligible.
Finally, we assess the total cost of contact evaluation which is the most relevant target quantity for practical applications. It is given by the total time t total = t LB + t gh + t eval + t ass for (possibly)    redistributing, ghosting, evaluation, and assembly of the contact interface and is shown in Figure 9(d).
Again, ghosting via binning ("binning") results in a lower total time t total than the redundant ghosting of the master side ("redundant master"). Moreover, load balancing ("LB") allows all MPI ranks to participate in the evaluation of the contact terms, yielding a faster total contact time than without load balancing ("no LB"). Dominated by the contact evaluation time t eval , the case without load balancing does not scale beyond 8 MPI ranks, while load balancing shows good weak scalability up to 200 MPI ranks. Overall, our proposed strategy of load balancing in combination with binning delivers the fastest contact evaluation for all mesh sizes and also features the smallest increase in total contact time when increasing the problem size. Figure 10 illustrates the impact of both the load balancing and the ghosting strategy on the number of owned and ghosted master side elements by reporting the maximum number of elements per MPI rank among all processes. Naturally, load balancing, where all processes hold a portion of the contact interface, leads to a lower number of owned entities per processor than no load balancing, where the interface is stored only by a subset of all processes: Depicted by the dotted lines, the number of owned elements per MPI rank is smaller in case of load balancing than without load balancing, in particular by a factor of 100 for more than 48 MPI ranks in this example. The influence of the ghosting strategy is shown with dashed and solid lines: While binning (solid lines) just adds a small number of nodes or elements to be ghosted from other processes, fully redundant ghosting (dashed lines) drastically increases the number of ghosted elements. For large examples, this increase can exceed two orders of magnitude. We observe that the number of owned elements is consistently smaller than the number of ghosted elements when using load balancing, while this is not the case without load balancing. This peculiarity is just an artifact of the visualization, since the MPI rank with the maximum number of owned entities is not necessarily the same as the one with the maximum number of ghosted entities. Using the ratio of ghosted elements to owned elements to indicate the additional overhead in memory and parallel communication due to the distributed memory paradigm, we make the following key observation: ghosting via binning as proposed in Section 3.5 is much more efficient in terms of memory and parallel communication than fully redundant ghosting. Please note that the respective diagram for owned and ghosted nodes of the interface's master side essentially looks the same and, thus, is not shown for the conciseness of the presentation.

Strong scaling
To assess the strong scaling behavior under different load balancing and ghosting strategies, we study three different meshes and problem sizes detailed in Table 1. The strong scaling behavior is reported in Figure 11. While meshes 2M and 5M could be run in serial, mesh 10M did not fit into the memory     Figure 11: Two-block contact: strong scaling of contact time of a single core. Hence, the graphs for the mesh 10M start at 3 MPI ranks, while 2M and 5M start at 1 MPI rank. Regarding the pure contact evaluation time t eval depicted in Figure 11(a), the curse of dimensionality as described in Section 4.2 leads to insufficient scaling behavior for the case without load balancing. For some cases, the contact evaluation time does not deacrease (or even slightly increase) when adding more processes, (cf. the mesh '2M' without load balancing executed on 3, 6, and 12 MPI ranks for example). In contrast, the proposed load balancing scheme delivers the expected strong scaling behavior across a wide range of MPI ranks, since all MPI ranks participate in the contact evaluation. Moreover, load balancing results in faster contact evaluation independent of the mesh size and ghosting strategy than no load balancing. Naturally, strong scaling behavior of the contact evaluation time t eval is not affected by the choice of ghosting strategy.
For the combined time t LB + t gh for redistribution and ghosting as shown in Figure 11(b), the timings are now dominated by the choice of ghosting strategy. In particular, fully redundant ghosting of the master interface (dashed lines) requires a consistently larger time across a wide range of MPI ranks. Ghosting via binning (solid lines) can benefit from additional hardware resources, until the strong scaling limit is reached and timings are increasing with an increasing number of MPI ranks. The effect of the load balancing strategy is negligible, but we note that the extra cost of performing a redistribution leads to slightly higher times with load balancing than without load balancing.
Considering the assembly of all contact terms into the global linear system, Figure 11(c) shows just a small difference with and without load balancing. Similar to the weak scaling study from Section 5.1.1, the ghosting strategy does not impact these timings. We observe good strong scalability for all studied cases.
Having in mind the overall goal of a fast time-to-solution, the total time t total = t LB +t gh +t eval +t ass for (possibly) redistributing, ghosting, evaluation, and assembly of the contact interface is depicted in Figure 11(d). Again, the proposed load balancing strategy results in the best timings and in good, but not perfect scaling behavior. Stemming from the pure contact evaluation time t eval (depicted in Figure 11(a)), the total time t total without load balancing does not strictly follow the expected strong scaling behavior. Per definition of t total , this diagram combines all characteristics from Figures 11(d), 11(b), and 11(c), namely the better scaling of t eval due to load balancing and the increase in the timing component t LB +t gh for large numbers of MPI ranks due to increased communication and redistribution effort.
In sum, the best scaling behavior is achieved with the proposed approach of load balancing in combination with ghosting via binning. While both components affect the overall efficiency, the fastest evaluation times and the best weak and strong scaling behavior can only be achieved through the combination of load balancing with ghosting via binning. So far, we have limited our analysis to static contact problems without any changes in the contact zone, where the proposed algorithms demonstrate their beneficial effect on the run time and the weak and strong scaling behavior, but could not unfold their full potential. Therefore, we now move to dynamic contact problems, where the contact zone changes over time and, thus, the load balancing is expected to show an even better effect on the scalability and performance.

Rolling cylinder with dynamic contact
This example studies the behavior of parallel algorithms for dynamic contact problems, i.e. for unilateral contact problems where the contact zone is changing over time. This will exercise the parallel redistribution of the contact interface discretization to its full extent.
The problem is configured as follows: An elastic hollow cylinder is pushed onto a deformable block with initially flat surfaces. After contact has been established, a rotating motion is imposed on the inner surface of the hollow cylinder, somewhat mimicking a rolling tire. Both bodies are modeled with a compressible Neo-Hooke material with Young's modulus E = 1, Poisson's ratio ν = 0.3, and density ρ = 10 −6 .
Both bodies are discretized with first-order hexahedral finite elements. The top surface of the block is chosen as the master side of the contact interface, while the outer surface of the hollow cylinder takes the role of the slave surface. For constraint enforcement, a node-based penalty regularization of the mortar approach with a penalty parameter of 5 is chosen. Time integration employs the generalized-α method [15] with spectral radius ρ ∞ = 1.0. Figure 12 exemplarily compares the volume and interface subdomains for the different load balancing strategies for the case of 24 MPI ranks. The initial subdomain layout in step 0 is the same for all load balancing strategies. While the DD of the underlying bodies will not be altered, we apply the interface load balancing scheme proposed in Section 4.3, which results in different interface DDs for the slave side. To unclutter the presenation, we only show the evolution of the slave side's interface DDs, since this is the key ingredient for a scalable mortar evaluation. Interesting features due to load balancing are highlighted with roman numbers I -IV (see also Figures 12(b) -12(e)) and will be discussed below. In the case of no load balancing (column "no LB"), the interface subdomains match the subdomains of the underlying volume DD throughout the entire simulation. For static LB (column "static LB"), an initial interface DD is performed at the beginning of step 1, but it is not updated during the simulation. Hence, a small strip of slave subdomains is generated during the initial load balancing phase (cf. highlight I or Figure 12(b)) and then rotates with the rolling motion of the cylinder as marked by highlight III (see also Figure 12(d)), such that it quickly leaves the contact area and, thus, does not contribute to an optimal contact evaluation throughout the entire simulation. Based on the threshold criterion (10), the dynamic load balancing (column "dyn. LB") updates the interface DD close to the contact area (cf. highlight II or Figure 12(c)) such that the interface DD is nearly optimal in the vicinity of the contact zone and all processes participate in the evaluation of the contact terms independent of the rolling motion of the cylinder (cf. highlight IV or Figure 12(e)).

Effect of load balancing on wall clock time and memory consumption
We compare the cases of no load balancing, an initial load balancing in the reference configuration, and the dynamic load balancing proposed in Section 4.4 on a mesh with 825,600 hexahedral elements consisting of 913,923 nodes and resulting in 2,741,769 displacement unknowns. We run the simulation on 96 MPI ranks on our in-house cluster. For the case of initial and dynamic load balancing, we limit the relative mismatch in subdomain size of the interface DD by setting the Zoltan parameter IMBALANCE TOL to 1.03 [11]. For dynamic load balancing, we have tested different thresholdsη t ∈ {1.01, 1.2, 1.5, 1.8, 2.5, 5.0, 8.0} to trigger rebalancing, but will only report and discuss selected cases in the following for the sake of presentation, namelyη t ∈ {1.01, 1.8, 5.0, 8.0}. To extend the ghosting of the master side's interface discretization, we rely on the binning strategy outlined in Section 3.5. A comparison of the different ghosting strategies is presented in Section 5.2.3.
We run the simulation for 200 time steps (20 time steps to close the initial gap, then 180 time steps of the rolling motion) to facilitate a rotation of 180 • , such that the contact area on the outer cylinder surface substantially moves along the circumferential direction.
For every time step, Figure 13(a) reports the average time per nonlinear iteration spent in contact evaluation (without considering the cost for load balancing). If no load balancing is performed, the average contact evaluation time is the largest. Since the slave side's interface DD is just adopted from the underlying volume discretization, some processes do never participate in contact evaluation. Moreover, the number of processes contributing to the contact evaluation changes over time, so the average contact evaluation time also changes over time steps. In contrast, static load balancing assures that all parallel processes hold their share of the slave side of the interface, such that the average contact evaluation time is roughly constant for all time steps (as soon as full contact is established). Since only a part of all processes contributes to the evaluation of the potentially active part of the slave interface, the average contact evaluation time is still rather large. Ultimately, dynamic load balancing no LB static LB dynamic LB Step 0 Step 1

I
Step 50

II
Step 150 triggers a rebalancing based on the current simulation status to aid a well-balanced distribution of the contact evaluation work to all parallel processes. In Figure 13(a), time steps just after a drop in t eval are those, in which a rebalancing has occurred. Since the effort of mortar evaluation is now distributed to all processes, the time spent in mortar evaluation drops significantly on average. Of course, individual time steps with an imbalanced work distribution among processes might take longer, which will ultimately lead to rebalancing as soon as the rebalancing criterion (10) is met. In particular, a very low rebalancing threshold (e.g.η t = 1.01) requires to rebalance in basically every time step. Although this results in the overall fastest mortar evaluation, the additional effort for rebalancing limits the possible speed-up. On the other hand, a loose threshold (e.g.η t ∈ {5.0, 8.0}) triggers the rebalancing only a few times over the course of the simulation, however for some time steps the time spent in the mortar evaluation can grow by a factor of two or even three compared to the ideal case. In our numerical experiments, we have found the thresholdη t = 1.8 to deliver a good compromise between imbalance in per-process workload and the frequency of rebalancing. Therefore, we will use this threshold value for all further studies. Figure 13(b) reports the time t gh spent for ghosting of the master side of the contact interface plus the time t LB for rebalancing of the interface DD (if applicable). For the clarity of the presentation, we concentrate on three selected cases. While the time component t gh for the master side ghosting is rather constant for all three cases, the time component t LB varies: For static load balancing, only the first time step requires rebalancing, while all later time steps do not perform load balancing anymore. Hence, this curve peaks in the first time step and then drops and remains at low values. For dynamic load balancing with the strict imbalance thresholdη t = 1.01, rebalancing occurs in every time step, such that this case consistently delivers high values for t LB + t gh . Obviously, these two cases can be interpreted as a lower and upper bound as evident from Figure 13(b). The case of dynamic load balancing withη t = 1.8 positions itself in between, since some time steps require rebalancing, but some do not.
While all cases with dynamic load balancing spend additional time on the redistribution of the interface subdomains, these additional timings are easily amortized. To this end, Figure 13(c) plots the time t acc = C c=1 (t eval + t LB + t gh ) c , c ∈ {1, . . . , C}, of all time components related to mortar evaluation over all time steps accrued over all C contact evaluations of the entire simulation. The end point markers are intended to highlight also small differences between curves. Naturally, a strict monotone increase is expected, while one aims for an as low as possible slope. Similar to the average contact evaluation time, static load balancing is beneficial compared to no load balancing at all, while dynamic load balancing results in the lowest contact evaluation times. Clearly, the better parallelization due to the dynamic load balancing strategy pays off the additional cost for occasional rebalancing. The lower the acceptable imbalanceη t is, the lower is the accumulated contact evaluation time t acc . Overall, a maximum reduction up to 71% in t acc can be achieved through proper dynamic load balancing. We note that the difference in t acc betweenη t = 1.01 andη t = 1.8 is very small, indicating that load balancing in every time step does not bring much additional value.
To demonstrate the effect of the rebalancing triggerη t in detail, Figure 14 shows a close-up of the results in Figure 13(a) as well as the evolution of the max/min ratio η t in contact evaluation time across all parallel processes. For a clearer visualization, only a subset of the results is plotted. In Figure 14(a), data points after a drop in t eval correspond to time steps, where load balancing has occured since the max/min ratio η t exceeded the thresholdη t in the previous time step. This is in line with Figure 14(b), where η t is plotted over time along with dashed lines to indicate the different thresholdsη t . We observe that η t drops close to the perfect balance (i.e. η t = 1.0) just after it exceeded the threshold levelη t . In favor of an uncluttered view, Figure 14(b) shows only results obtained with dynamic load balancing.
So far, we have studied the impact of the load balancing strategy and the imbalance thresholdη t onto the time spent in the computational treatment of all mortar terms. In all cases, dynamic load balancing is worth the effort. Since the present example shows very good behavior forη t = 1.8, we continue to use this value throughout this example. We note that the optimal choice ofη t is problemdependent. Yet, we generally recommend to use dynamic load balancing for contact problems with

Strong scaling behavior under dynamic load balancing
Now, we study the strong scaling behavior of the contact evaluation time when dynamic load balancing is active. We therefore study two different problem sizes: 517,185 displacement unknowns referred to as "500k" and 1,005,993 displacement degrees of freedom denoted by "1000k". While we keep the problem sizes fixed, we solve the problem on an increasing number of MPI ranks on our in-house cluster. We will compare different load balancing strategies, namely no load balancing ("no LB"), initial load balancing in the reference configuration ("static LB"), and dynamic load balancing (withη t = 1.8 ("dyn. LB") as found useful in Section 5.2.1). Figure 15 shows the strong scaling behavior. Again, we consider the average contact evaluation time t eval per time step, the time t LB + t gh spent in redistribution and ghosting of the interface discretizations, the average total time t total = t eval + t LB + t gh per time step, and finally the total contact time t acc = C c=1 (t eval + t LB + t gh ) c , c ∈ {1, . . . , C}, accumulated over all C contact evaluations of the entire simulation. For both problem sizes as well as all quantities of interest, we observe good strong scaling behavior when using dynamic load balancing: starting from a small number of MPI ranks, the time spent on a given task (e.g.. contact evaluation, redistribution and ghosting, total contact time, accumulated contact time) is reduced when adding more MPI ranks to tackle the computations, while the reduction rate is linked to the increase in MPI ranks, i.e. delivering perfect strong scaling [3]. As expected, both meshes reached their strong scaling limit at some point, such that adding more hardware resources does not reduce, but actually increase the execution time, e.g. due to a deteriorating computation-to-communication ratio. Naturally, the strong scaling limit of the large problem (1000k) is located at twice the number of MPI ranks as for the small, half-sized problem (500k). The beneficial effect of dynamic load balancing becomes evident in comparison to "no LB" and "static LB": Without any load balancing or just an initial rebalancing of the interface discretizations, the initial slope in the scaling diagrams is far from optimal. Once again, this originates from the curse of dimensionality, since the additional hardware resources do not necessarily participate in the interface evaluation. For an intermediate number of MPI ranks, strong scaling is recovered, however absolute timings are much higher than for the same setup with dynamic load balancing. As already observed in Section 5.1.1, static load balancing is consistently a bit faster than no using load balancing at all, yet it is by far slower than dynamic load balancing.
As demonstrated, the proposed dynamic load balancing scheme is the key factor to achieve strong scalability of the evaluation of mortar terms in a nonlinear and time-dependent contact simulation. To the authors' best knowledge, this constitutes the first time that strong scalability in such a complex setting could be demonstrated.

Comparison of strategies to extend the master side's ghosting
While the influence of the load balancing strategy has already been discussed previously, we now aim to assess the impact of the ghosting strategy on the overall performance of the contact evaluation. Therefore, we exemplarily consider the mesh from Section 5.2.1 run on 96 MPI ranks. Now, we compare the fully redundant storage of the master side of the interface (cf. Section 3.4) to the geometrically motivated binning approach (cf. Section 3.5). We study again the cases of no, static, and dynamic load balancing. For the clarity of the presenation, we only show the case of dynamic load balancing scenario withη t = 1.8, but note that other values forη t exhibit similar behavior. Figure 16 summarizes the wall clock time spent on contact evaluation. For the pure contact evaluation time reported in Figure 16(a), the fully redundant ghosting increases the evaluation time for all cases, since the contact detection needs to account for all master elements, while ghosting via binning pre-sorts the master elements based on their geometric proximity within neighboring bins. Figure 16(b) depicts the time spent in redistribution and ghosting of interface data. For the sake of a clear presentation and to really focus on the most relevant case, we show only the curves for dynamic load balancing. Evidently, ghosting via binning is faster by a factor of ≈ 8 − 10× than fully redundant ghosting. Figure 16(c) shows the accumulated time for contact evaluation, load balancing, and ghosting, i.e. t acc = C c=1 (t eval + t LB + t gh ) c , c ∈ {1, . . . , C}, to assess the overall accumulated time spent on all C evaluations of the contact interface over the course of the entire simulation. For the cases with no and static load balancing, the ghosting strategy does not impact the overall performance significantly. For dynamic load balancing though, the necessity of ghosting after each redistribution makes the difference: the performance difference between fully redundant ghosting and ghosting via binning as observed in Figure 16(b) now accumulates over time, such that the use of binning results in the overall lowest time spent on contact evaluation. So, additional savings of 40% of the contact evaluation time can be achieved. Summing up the study of contact timings, the best case scenario of dynamic load balancing with ghosting via binning is faster than  • static load balancing by a factor of ≈ 2.61, • no load balancing by a factor of ≈ 3.30, which strongly emphasizes the benefits of dynamic load balancing and ghosting via binning in dynamic contact problems. Finally, we briefly summarize the impact of the load balancing scheme and the ghosting strategy onto the cost for storage and parallel communication: If no load balancing is performed ("no LB"), the maximum number of owned nodes per process is roughly 10× larger than its average across all processes, since not all processes hold a portion of the interface. This imbalance is alleviated for static or dynamic load balancing. Regarding the impact of the ghosting strategy, ghosting via binning reduces down the number of nodes/elements to be ghosted by a factor of 100× compared to the fully redundant case, which ultimately also impacts the global memory footprint of the application.
In sum, dynamic contact problems require a good choice of load balancing strategy as well as a suitable ghosting strategy. In particular, load balancing highly impacts the time spent in contact evaluation. Despite the additional cost of performing the load balancing operation, the overall fastest contact evaluation is achieved with dynamic load balancing based on a user-given imbalance thresholdη t . While we have foundη t = 1.8 to deliver very good results in our numerical studies, the optimal choice ofη t can depend on details of the computing hardware, the software implementation, and also the example at hand. To reduce the amount of parallel communication as well as the memory demand per compute node, ghosting via binning is by far superior to a fully redundant storage of the master side of the interface discretization. The overall best performance with respect to both phenomena (run time and communicatio/memory demand) is obtained through the combination of dynamic load balancing with ghosting via binning.

Concluding remarks
Recognizing the tremendous computational effort to evaluate mortar integrals in the context of nonmatching interface discretizations as they exemplarily arise in contact mechanics, this paper proposes strategies for efficient storage and parallel computational kernels for mortar interface problems. Starting from a close look at the computational effort to evaluate mortar integrals, we have derived two basic requirements for computations on parallel machines with distributed memory architecture: On the one hand, one needs to enable access to the appropriate interface data to guarantee a correct identification of all master/slave pairs at the mortar interface. On the other hand, the available parallel hardware needs to be used efficiently, such that parallel scalability of the mortar evaluation can be achieved.
We have discussed some techniques to guarantee access to all required master/slave pairs during the contact search and mortar evaluation. While fully redundant ghosting is conceptually easy and straightforward to implement, it suffers from an elevated memory demands and tremendous communication overhead at large scale, which ultimately increases the time-to-solution. A geometrically motivated approach using a background grid of Cartesian bins allows for the efficient identification of nearby master elements, reduces the per-process memory demand as well as limits the ghosting data to the nearby master elements. The binning approach has shown the best timings in large weak and strong scaling studies and consistently reduces the amount of data to be communicated between parallel processes as well as to be stored within a process.
We have then discussed the curse of dimensionality in overlapping DDs of interface problems, which requires a special treatment of the interface subdomains. To this end, we have proposed to use an interface DD independent from the underlying volume DD and were able to demonstrate optimal weak and strong scalability of the mortar evaluation time. To account for dynamic changes in the contact zone, we have designed a dynamic load balancing scheme for contact problems, which tracks imbalances among parallel processes and rebalances the computational work as soon as user-given imbalance thresholds are exceeded. We have tested the proposed algorithms on a time-dependent nonlinear contact problem undergoing large deformations. In time measurements on such large-scale examples, dynamic load balancing outperforms the case of no or only initial load balancing by factors up to 2 − 4×. Wall clock time is the lowest, when only small imbalances are allowed, although even a large imbalance tolerance delivers faster computations than simulations without any load balancing at all. For the first time, strong and weak scalability could be shown for time-dependent nonlinear contact problems undergoing large deformations and dynamically evolving contact zones through the application of the proposed dynamic load balancing scheme.
In our numerical experiments, we have studied representative test cases from computational contact mechanics. We have performed weak and strong scaling studies up to 480 MPI ranks as well as have assessed the impact of different algorithmic parameters. From our numerical experiments, we extract several findings: • Ghosting via binning is favorable due to its reduced communication overhead, which also directly reduces the time-to-solution.
• Load balancing is crucial for optimal contact evaluation times. In particular, systems with a static contact zone benefit from an initial redistribution of the interface, while contact problems with dynamically evolving contact zones require the proposed dynamic load balancing scheme for optimal performance.
• For static contact problems, we have found the combination of static load balancing and ghosting via binning to deliver the best results.
• For dynamic contact problems, we have found the combination of dynamic load balancing and ghosting via binning to deliver the best results.
In sum, we recommend to apply static load balancing in combination with ghosting via binning for problems with static contact zones, while dynamic load balancing in combination with ghosting via binning is preferable for problems with dynamically evolving contact zones. Following these recommendations, a fast time-to-solution as well as good weak and strong scaling behavior can be achieved.