Abstract
We describe the engineering of the distributedmemory multilevel graph partitioner dKaMinPar. It scales to (at least) 8192 cores while achieving partitioning quality comparable to widely used sequential and sharedmemory graph partitioners. In comparison, previous distributed graph partitioners scale only in more restricted scenarios and often induce a considerable quality penalty compared to nondistributed partitioners. When partitioning into a large number of blocks, they even produce infeasible solution that violate the balancing constraint. dKaMinPar achieves its robustness by a scalable distributed implementation of the deepmultilevel scheme for graph partitioning. Crucially, this includes new algorithms for balancing during refinement and coarsening.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
Graphs are a central concept of computer science used whenever we need to model relations between objects. Consequently, handling large graphs is very important for parallel processing. This often requires to partition these graphs into blocks of approximately equal weight with most edges inside the blocks (balanced graph partitioning). Applications include scientific computing, handling social networks, route planning, and graph databases [3].
In principle, multilevel graph partitioners (MGP) achieve high quality partitions for a wide range of input graphs G with a good tradeoff between quality and partitioning cost. They are based on first iteratively coarsening G by contracting edges or small clusters. The resulting small graph \(G'\) is then still a good representation of the overall input and an initial partition of \(G'\) already induces a good partition of G. This is further improved by uncoarsening the graph and improving the partition on each level through refinement algorithms.
However, parallelizing multilevel graph partitioning has proved challenging over several decades. While sharedmemory graph partitioners have recently matured to achieve high quality and reasonable scalability [1, 9, 10, 14], current distributedmemory partitioners [13, 19, 25] induce a severe quality deterioration and often are not able to consistently achieve feasible (i.e. balanced) partitions. In particular, high quality partitioners do not scale to the number of processing elements (PEs) available in large supercomputers. This situation is exacerbated by the fact that often the number of blocks k should increase linearly in the number of PEs. Previous systems are not able to directly handle large k running into even larger problems with achieving feasibility.
In this paper, we present dKaMinPar which addresses all these issues. Its basis is a distributedmemory adaptation of the deepmultilevel graph partitioning concept [9] that continues the multilevel approach deep into the initial partitioning phase. This makes the large k case much easier and eliminates a parallelization bottleneck due to initial partitioning. Our coarsening and refinement algorithms are based on the label propagation approach previously used in several partitioners [13, 19, 25]. Label propagation [18, 20] greedily moves vertices to other clusters/blocks when this reduces cuts (and does not violate the balance constraint). This is simple, fast, effective and robust even for complex networks. We develop a distributedmemory version with improved scalability, e.g., by using improved sparsealltoall primitives. Perhaps the main algorithmic innovation are new scalable distributed techniques allowing to maintain the balance constraint. During coarsening, a maximum cluster weight is approximated by unwinding contractions that lead to overweight clusters. During uncoarsening, block weight constraints are achieved by finding, selecting and applying globally “best” block moves.
The experiments described in Sect. 6 indicate that our implementation has achieved the main goals. It scales to at least 8 192 cores even for complex networks that did not scale on previous distributed solvers. Feasibility is guaranteed, even for large k and quality is typically within a few percent of the sharedmemory systems. Section 7 summarizes the results and discusses possible future improvements.
Contributions

Scalable distributed implementation of deep multilevel graph partitioning.

Simplicity using label propagation for both contraction and refinement.

New scalable balanced coarsening and uncoarsening algorithm.

Extensive evaluation on both large real world networks and huge synthetic networks from 3 input families.

Quality comparable to sharedmemory systems.

Scalability up to (at least) \(2^{13}\) machine cores and \(2^{39}\) graph edges.

Works both for complex networks and large number of blocks where previous systems often fail.
2 Preliminaries
Notation and Definitions. Let \(G = (V, E, c, \omega )\) be an undirected graph with vertex weights \(c: V \rightarrow \mathbb {N}_{> 0}\), edge weights \(\omega : E \rightarrow \mathbb {N}_{> 0}\), \(n {:}{=}V \), and \(m {:}{=}E \). We extend c and \(\omega \) to sets, i.e., \(c(V') {:}{=}\sum _{v \in V'} c(v)\) and \(\omega (E') {:}{=}\sum _{e \in E'} \omega (e)\). \(N(v) {:}{=}\{ u \mid \{ u, v \} \in E\}\) denotes the neighbors of v. For some \(V' \subseteq V\), \(G[V']\) denotes the subgraph of G induced by \(V'\). We are looking for blocks of nodes \(\varPi {:}{=}\{ V_1, \dots , V_k \}\) that partition V, i.e., \(V_1 \cup \dots \cup V_k = V\) and \(V_i \cap V_j = \emptyset \) for \(i \ne j\). The balance constraint demands that for all \(i \in \{1, \dots , k\}\), \(c(V_i) \le L_\text {max} {:}{=}\max \{ (1 + \varepsilon ) \frac{c(V)}{k}, \frac{c(V)}{k} + \max _v c(v) \}\) for some imbalance parameter \(\varepsilon \)^{Footnote 1}. The objective is to minimize \(\text {cut}(\varPi ) {:}{=}\sum _{i < j} \omega (E_{ij})\) (weight of all cut edges), where \(E_{ij} {:}{=}\{ \{u, v\} \in E \mid u \in V_i \text { and } v \in V_j \}\). We call a vertex \(u \in V_i\) that has a neighbor in \(V_j\), \(i \ne j\), a boundary vertex. A clustering \(\mathcal {C} {:}{=}\{ C_1, \dots , C_\ell \}\) is also a partition of V, where the number of blocks \(\ell \) is not given in advance (there is also no balance constraint).
Machine Model and Input Format. The distributed memory model used in this work considers P processing elements (PEs) numbered 1..P, connected by a fullduplex, single ported communication network. The input graph is given with a (usually balanced) 1D vertex partition. Each PE is given a subgraph of the input graph (i.e., a block of the 1D partition) with consecutive vertices. An undirected edge \(\{u, v\}\) is represented by two directed edges (u, v), (v, u), which are stored on the PEs owning the respective tail vertices. Vertices adjacent to vertices owned by other PEs are called interface vertices and are replicated as ghost vertices (i.e., without outgoing edges) on those PEs.
3 Related Work
There has been a huge amount of research on graph partitioning so that we refer the reader to overview papers [2,3,4, 24] for most of the general material. Here, we focus on parallel algorithms for highquality graph partitioning.
Distributed Graph Partitioning. Virtually all highquality partitioners are based on the multilevel paradigm, e.g., ParMETIS [12, 13], ParHIP [19, 22] and others [5, 27]. These algorithms partition a graph in three phases. First, they build a hierarchy of successively coarse approximations of the input graph, usually by contracting matchings or clusters. Once the graph has only few vertices left (e.g., \(n \le C k\) for some contraction limit C), the graph is partitioned into k blocks. Finally, this partition is successively projected onto finer levels of the hierarchy and refined using local search algorithms.
The performance of multilevel algorithms is defined by the algorithmic components used for these phases. Partitioners designed for meshpartitioning usually contract matchings to coarsen the graph [5, 13, 27]. However, this technique is not suitable for partitioning complex networks that only admit a small maximum matching. Thus, other partitioners use twohop matchings [15] or sizeconstrained label propagation [9, 11, 19]. Due to its simple yet effective nature, the latter is also commonly used as a local search algorithm during refinement [1, 6, 9, 11, 13, 19, 27].
Label propagation has also been used by nonmultilevel graph partitioning algorithms such as XtraPuLP [25], which reports scalability up to \(2^{17}\) cores, a level which has not been reached by multilevel algorithms. However, using label propagation without the multilevel paradigm comes with a pronounced decline in quality; Ref. [9] reports edge cuts for PuLP [26] (nonmultilevel) that are on average more than twice as large as those of KaMinPar (multilevel). Across a large and diverse benchmark set, this is considered a lot; most multilevel algorithms achieve average edge cuts within a few percentage points of each other. Another class of highly scalable graph partitioners include geometric partitioners, which work on a geometric embedding of the graph. While these algorithms are orders of magnitude faster than multilevel algorithms [16], they generally compute larger edge cuts and only work on graphs with a meaningful geometric embedding.
Deep Multilevel Graph Partitioning. As plain MGP algorithms usually shrink the graph down to Ck vertices, large values for k break the assumption that the coarsest graph is small. This causes their performance to deteriorate [9]. Instead, recursive bipartitioning can be used to compute partitions with large k, but this induces an additional \(\log k\) factor in running time and makes it more difficult to compute balanced partitions due to the lack of global view. Deep multilevel graph partitioning (deep MGP) [9] circumvents these problems by continuing coarsening deep into initial partitioning. More precisely, deep MGP coarsens the graph until only 2C vertices are left, independent of k. After bipartitioning the coarsest graphs, it maintains the invariant that a (coarse) graph with n vertices is partitioned into \(\min \{k, n / C\}\) blocks by using recursive bipartitioning on the current level. By using additional balancing techniques, partitioners based on deep MGP can obtain feasible highquality partitions with a large number of blocks (e.g., \(k \approx 1M\)) while often being an order of magnitude faster than partitioners based on plain MGP. Compared to recursive bipartitioning the entire graph, it reduces the additional \(\log k\) factor to \(\log kC/n\). KaMinPar [9] is a scalable sharedmemory implementation of deep MGP which uses sizeconstrained label propagation during coarsening and refinement.
4 Distributed Deep Multilevel Graph Partitioning
In this section, we present dKaMinPar, a distributed graph partitioner that leverages deep MGP. We first describe the distributed deep MGP scheme, which partitions a graph on P PEs into k blocks. For simplicity, we assume that k and P are powers of two. Then, we explain the different algorithmic components for coarsening, initial partitioning, refinement and balancing in more detail.
Distributed Deep Multilevel Partitioning. Recall that deep MGP [9] follows the traditional multilevel graph partitioning scheme, but coarsens the graph down to a small size independent of k. After partitioning the coarsest graph into a small number of blocks, it maintains the invariant that each block of the current partition contains roughly C vertices throughout the uncoarsening phase (until there are k blocks).
To adapt this scheme to the distributed setting, we use distributed algorithms for graph clustering, contraction, and partition refinement (see below). Initial partitioning of the coarsest graph and blockinduced subgraphs is done using an inmemory partitioner by gathering full copies of the graphs on individual PEs. Since this process is communication heavy, we generalize the bipartitioning steps of deep MGP to Kway partitioning for some tuning parameter K. The scheme then works as follows.
We repeatedly coarsen the input graph until only \(K \cdot C\) vertices are left, building a hierarchy of successively coarser graphs. During this process, we exploit parallelism and improve scalablility on coarse levels of the hierarchy by maintaining the invariant that P PEs work on a graph with at least \(P \cdot C\) vertices [27]. This leads to more diversification on coarse levels due to the randomized nature of the clustering, initial partitioning and refinement algorithms. More precisely, we check on each level whether the current graph \(G_i\) has more than \(P \cdot C\) vertices. If so, we split the P PEs into two subgroups \(1..\frac{P}{2}\), \(\frac{P}{2}+1..P\) and mirror the parts of \(G_i\) between PEs j and \(\frac{P}{2} + j\), \(0 \le j < \frac{P}{2}\), such that each group obtains an identical copy of \(G_i\). The subgroups then continue this procedure recursively. In Fig. 1, we illustrate this process by using bold horizontal lines, duplicating \(G_{\ell  1}\) and \(G_{\ell }\). The coarsest graph is then copied to each PE and partitioned into \(\min \{k, K\}\) blocks using an inmemory partitioner. The best partition (within each group of PEs) is selected and projected onto \(G_{\ell  1}\) by assigning fine vertices to the blocks of their corresponding coarse vertices. From here, we maintain two invariants during uncoarsening:

1.
The current partition is feasible, which we ensure using the distributed balancing algorithm described below, and

2.
each block contains roughly C vertices (until there are k blocks).
To maintain the latter invariant, assume that the current graph with \(V(G_i) \) vertices is partitioned into \(k' < k\) blocks. Then, assign \(k' / P\) blocks to each PE and use alltoall communication to gather full copies of the blockinduced subgraphs. These subgraphs are then partitioned into \(\min \{k / k', K\}\) blocks using an inmemory partitioner. Afterwards, we update the partition of the distributed graph using alltoall communication and subsequently improve it using a refinement algorithm. We repeat this process until we obtain a partition where each block contains roughly C vertices. Note that if \(k > V(G_1) / C\), the partition computed on the finest graph has not enough blocks. In this case, we distribute and partition the blockinduced subgraphs once more to compute the missing blocks.
Coarsening. We use a similar parallelization of sizeconstrained label propagation as ParHIP [19] and KaMinPar [9] to cluster the graphs. The algorithm works by first assigning each vertex to its own cluster. In further iterations over the vertices (we use \(\{3, 5\}\) iterations), they are then moved to adjacent clusters such that the weight of intracluster edges is maximized without violating the maximum cluster weight \(W_i {:}{=}\varepsilon \frac{c(V)}{k'_i}\), where \(k'_i {:}{=}\min \{k, V(G_i) / C\}\) [9] and i is the current level of the graph hierarchy.
As noted in Ref. [1, 18], the solution quality of label propagation is improved when iterating over vertices in increased degree order. Since this is not cache efficient and lacks diversification by randomization, we sort the vertices into exponentially spaced degree buckets, i.e., bucket b contains all vertices with degree \(2^b \le d < 2^{b + 1}\), and rearrange the input graph accordingly. This happens locally on each PE. Then, during label propagation, we split buckets into small chunks and randomize traversal on a interchunk and intrachunk level analogous to the randomization of the matching algorithm used by Metis [12].
To communicate the current cluster assignment of interface vertices, we follow ParHIP and split each iteration into \(\max \{\alpha , \beta / P\}\) (we use \(\alpha = 8\), \(\beta = 128\)) batches. After each batch, we use a sparse alltoall operation to notify adjacent PEs of interface vertices that were moved to a different cluster. Since clusters can span multiple PEs, enforcing the maximum cluster weight becomes more challenging than in a sharedmemory setting. ParHIP relaxes the weight limit and only enforces it locally, consequently allowing clusters with weight up to \(P \cdot W\). This can lead to very heavy coarse vertices, making it more difficult to compute balanced partitions. Instead, we track the global cluster weights by sending the change in cluster weight after each batch to the PE owning the initial vertex of the cluster, which accumulates the changes and replies with the total weight of the cluster. If a cluster becomes heavier than W, each PE reverts moves proportional to its part of the total cluster weight. Those vertices can then be moved to other clusters in subsequent iterations.
After clustering the graph, we contract the clusters to build the next graph in the hierarchy. We give more details on this operation in Sect. 5.
Balancing. Balance constraint violations during deep MGP can occur after initial partitioning or after projecting a coarse graph partition onto a finer level of the graph hierarchy [9]. Since these balance constraint violations are bounded by the weight of the heaviest vertex, we design the following balancing algorithm based on the assumption that only few vertex moves are necessary to restore balance. Thus, it is feasible to invest a moderate amount of work per vertex move.
For each overloaded block B, each PE maintains a local priority queue \(P_B\) of vertices in block B ordered by their relative gain which we define as \(g \cdot c(v)\) if \(g \ge 0\) and g/c(v) if \(g < 0\). Here, g is the largest reduction in edge cut when moving v to a nonoverloaded block. This rating function prefers moving few heavy vertices over many light vertices, supporting our assumption that few vertex moves are sufficient to balance the partition. To keep the priority queues small, we maintain the invariant that \(P_B\) stores no more vertices than are necessary to remove all excess weight \(o(B) {:}{=}c(B)  L_\text {max}\) from B. To this end, we initialize \(P_B\) by iterating over all vertices v in B and inserting v into \(P_B\) if \(c(P_B) < o(B)\). Otherwise, we only insert v if it can replace another vertex with worse relative gain.
To choose which vertices to move, we then use a global reduction tree as illustrated in Fig. 2. Using the local PQs, each PE builds a sorted list per overloaded block containing up to \(\tau \) (a tuning parameter) vertices. At each level of the reduction tree, the sorted lists are then merged and truncated to the prefix that is sufficient to remove all excess weight, but no more than \(\tau \) vertices. Finally, the root PE selects a subset of the proposed vertices such that no other block becomes overloaded and broadcasts its decision to all PEs. Using this information, PEs update the current partition state, remove moved vertices from their PQs and update the relative gains of neighboring vertices. We repeat this process until the partition is balanced.
Refinement. We also use sizeconstrained label propagation to improve the current graph partition. In contrast to label propagation for clustering as described above, vertices are initially assigned to clusters representing the blocks of the partition, and the maximum block weight is used as weight constraint. We use the same iteration order and number of batches as during coarsening to move vertices to adjacent blocks such that the weight of intrablock edges is maximized without violating the balance constraint. Ties are broken in favor of the lighter block, or by coin flip if both blocks have the same weight.
Since the number of blocks during refinement is usually much smaller than the number of clusters during coarsening, we track the global block weights using an allreduce operation after each vertex batch. Note that this does not prevent violations of the balance constraint if multiple PEs move vertices to the same block during the same vertex batch. In this case, we use our global balancing algorithm described above afterwards to restore the balance constraint. This is a downside compared to refinement via sizeconstrained label propagation in sharedmemory parallel graph partitioners, where the balance constraint can be strictly enforced using atomic compareandswap operations.
5 Implementation Details
Vertex and Edge IDs. To reduce the communication overheads, we distinguish between local and global vertex and edge identifiers. This allows us to use 64bit data types for global and 32bit data types for local IDs.
Graph Contraction. Contracting a clustering consisting of \(n_C\) clusters and constructing the corresponding coarse graph works as follows. First, the clustering algorithm described above assigns a cluster ID to each vertex, which corresponds to some vertex ID in the distributed graph. We say that a cluster is owned by the PE owning the corresponding vertex. After contracting the local subgraphs (i.e., deduplicating edges between clusters and accumulating vertex and edge weights), we map clusters to PEs such that each PE gets roughly the same number of coarse vertices while attempting to minimize the required communication amount. We assign \(\le \delta \cdot n_C / P\) clusters owned by each PE to the same PE (in our experiments, \(\delta = 1.1\)). If a PE owns more clusters, we redistribute the remaining clusters to PEs that have the smallest number of clusters assigned to them. Afterwards, each PE sends outgoing edges of coarse vertices to the respective PE using an alltoall operation, then builds the coarse graph by deduplicating edges and accumulating vertex and edge weights.
LowLatency Sparse AlltoAll. Many steps of dKaMinPar require communication along the cut edges of the distributed graph, which translates to (often very) sparse and irregular alltoall communication. Since MPI_Alltoallv has relatively high latency, we instead use a twolevel approach that arranges PEs in a grid [21]. Then, messages are first sent to the right row, then to the right column, reducing the total number of messages send through the network from \(\mathcal {O}(P^2)\) to \(\mathcal {O}(P)\).
6 Experiments
We implemented the proposed algorithm dKaMinPar in C++ and compiled it using g++12.1 with flags O3 march=native. We use OpenMPI 4.0 as parallelization library and growt [17] for hash tables. Raw experimental results are available online^{Footnote 2}.
Setup. We evaluate the solution quality of our algorithm on a sharedmemory machine equipped with 1TB of main memory and one AMD EPYC 7702P processor with 64 cores (Machine A). Additionally, we perform scalability experiments on a highperformance cluster where each compute node is equipped with 256GB of main memory and two Intel Xeon Platinum 8368 processors (Machine B). The compute nodes are connected by an InfiniBand 4X HDR 200GBit/s network with approx. 1\(\mu \) s latency. We only use 64 out of the available 78 cores since some of the graph generators require the number of cores to be a power of two.
We compare dKaMinPar against the distributed versions of the algorithms included in Ref. [9], i.e., ParHIP [19] (v3.14) and ParMETIS [13] (v4.0.3). ParHIP offers two configurations, denoted ParHIPFast and ParHIPEco, which configure a tradeoff between running time and partition quality. We do not include the distributed version PuLP [26] (XtraPuLP [25]) in our main comparison, since its quality is not competitive with multilevel partitioners. Instead, a comparison against XtraPuLP is available in the full version [23] of the paper. We evaluate two configurations of our algorithm: dKaMinParFast uses \(C = 2000\) as contraction limit (same as in Ref. [9]), KaMinPar [9] for initial partitioning and performs 3 iterations of label propagation during coarsening, whereas dKaMinParStrong uses \(C = 5000\) (same as in Ref. [19]), MtKaHyPar [11] for initial partitioning and 5 iterations of label propagation during coarsening.
Instances. We evaluate our algorithm on the graphs from Benchmark Set B of Ref. [9] and the graphs used in Ref. [19]. Additionally, we use KaGen [8] to evaluate the scaling capabilities of our algorithm on huge randomly generated 2D and 3D geometric and hyperbolic graphs denoted \(\textsf {rgg}_{\text {2D}}N\text {d}D\), \(\textsf {rgg}_{\text {3D}}N\text {d}D\) and \(\textsf {rhg}_{3.0}N\text {d}D\). These graphs have \(2^N\) vertices per compute node (i.e., per 64 cores), average degree D and powerlaw exponent 3 (hyperbolic graphs only).
Methodology. We call a combination of a graph and the number of blocks an instance. For each instance, we perform 5 repetitions with different seeds and aggregate the edge cuts and running times using the arithmetic mean. To aggregate over multiple instances, we use the geometric mean.
To compare the solution quality of different algorithms, we use performance profiles [7]. Let \(\mathcal {A}\) be the set of algorithms we want to compare, \(\mathcal {I}\) the set of instances, and \(q_A(I)\) the quality of algorithm \(A \in \mathcal {A}\) on instance \(I \in \mathcal {I}\). For each algorithm A, we plot the fraction of instances \(\frac{\mathcal {I}_A(\tau ) }{\mathcal {I} }\) (yaxis) where \(\mathcal {I}_A(\tau ) {:}{=}\{ I \in \mathcal {I} \mid q_A(I) \le \tau \cdot \min _{A' \in \mathcal {A}} q_{A'}(I) \}\) and \(\tau \) is on the xaxis. Achieving higher fractions at lower \(\tau \)values is considered better. For \(\tau = 1\), the yvalue indicates the percentage of instances for which an algorithm performs best.
Solution Quality and Running Time. To evaluate the quality and running time of dKaMinPar we partition all graphs of our benchmark set into \(k \in \{2, 4, \dots , 128\}\) blocks with \(\varepsilon = 3\%\) using all 64 cores of Machine A and compare partition qualities and running times against competing distributed MGP algorithms. Additionally, we compare the results against KaMinPar to evaluate the penalties of dKaMinPar due to its distributed nature. Further experiments with larger values for k are available in the full version [23] of the paper. We set the time limit for a single instance to one hour, which is approx. 10 times the running time of dKaMinParFast on most instances^{Footnote 3}.
The results are summarized in Fig. 3a–c. In Fig. 3a, we can see that dKaMinParFast finds the lowest edge cuts on approx. 60% of all benchmark instances, whereas ParMETIS and ParHIPFast only find better edge cuts on approx. 30% resp. 10% of all instances. Moreover, both competing algorithms frequently fail to compute feasible partitions—in particular, ParMETIS is unable to partition most social networks, violating the balance constraint or crashing on 34% of all instances. When looking at running times (Fig. 3c), we therefore only average over instances for which all partitioners computed a feasible partition or ran into the timeout (145 out of 224 instances). dKaMinParFast (4.93 s geometric mean running time) is 1.4 and 3.4 times faster than ParMETIS (6.98 s) and ParHIPFast (16.77 s), respectively. While ParHIPEco achieves higher partition quality than dKaMinParFast, Fig. 3b shows that equipping dKaMinPar with a stronger algorithm for initial partitioning is sufficient to achieve similar partition quality, while still being faster than ParHIPFast.
We evaluate the weak scalability of dKaMinPar using families of randomly generated graphs, \(k = 16\), and 64–8 192 cores (i.e., 1–128 compute nodes) of Machine B. Throughputs are shown in Fig. 4, where we observe weak scalability for dKaMinParFast all the way to 8 192 cores on all three graph families. ParMETIS achieves similar and in cases slightly higher throughputs than dKaMinPar, but is unable to efficiently partition hyperbolic graphs. ParHIPFast shows a drop in scalability beyond 2 048 cores, which is most likely due to the extensive and inefficient communication that it performs during graph contraction. Moreover, we note that ParHIPFast was originally designed to overlap local work and global communication during label propagation through the use of nonblocking MPI operations. This implementation relies on MPI progression threads, which seem to be unavailable in modern OpenMPI versions.
Perinstance edge cut results are available in the full version [23] of the paper. We observe that ParMETIS finds lower edge cuts than dKaMinParFast on the dense \(\textsf {rgg}_{\text {2D}}26\text {d}32\) graph and both \(\textsf {rgg}_{\text {3D}}\) graphs by 5%–13%. However, on the sparser \(\textsf {rgg}_{\text {2D}}26\text {d}8\) graph, dKaMinParFast has 19% smaller cuts than ParMETIS which is already a considerable improvement. The gap gets much larger for the hyperbolic graph where ParMETIS only finds approx. 5.5–6.1 times larger cuts. Such solutions will be unsuitable for many applications.
We now evaluate weak scalability in terms of graph size and number of blocks by scaling k with the number of compute nodes used. This implies that the number of blocks is large when using a large number of cores. The throughput of each algorithm in this setting is summarized in Fig. 5. Note that we only use the sparser graphs in this experiment, since ParMETIS and ParHIP are unable to partition the dense versions of the graphs even on few compute nodes.
ParHIPFast is unable to obtain a feasible partition on all but 6 instances, none of which uses more than 1 024 cores, and only shows increasing throughputs up to 256 cores. While ParMETIS achieves decent weak scalability and computes feasible solutions on the meshtype graphs, it is unable to partition any graph on 8 192 cores and often crashes on fewer cores (e.g., it only works on up to 1 024 cores on \(\textsf {rgg}_{\text {2D}}\) with \(2^{12}\) vertices per block). On the random hyperbolic graph, it only computes a feasible solution on 64 cores. Meanwhile, dKaMinParFast shows weak scalability up to 8 192 cores on every graph family, although it should be noted that the throughput increase from 4 096 to 8 192 is rather small.
In terms of number of edges cut, we summarize that dKaMinPar finds on average 19.3% and 2.8% lower edge cuts than ParMETIS and ParHIPFast, respectively (only averaging over instances for which the respective partitioner computed a feasible partition), with improvements ranging from 0% on \(\textsf {rgg}_{\text {3D}}26\text {d}8\) to approx. 60% on \(\textsf {rhg}_{8}3.0\text {d}26\) (\(2^{15}\) vertices per block). For detailed perinstance edge cut results, we refer to the full version [23] of the paper.
Strong Scalability of dKaMinPar . We partition three of the largest low and highdegree graphs from our benchmark set into \(k = 16\) blocks using 64–8 192 cores of Machine B and a time limit of 15min. The results are summarized in Fig. 6, where we can observe strong scalability for up to 1 024–2 048 cores on highdegree graphs. ParMETIS is unable to partition these graphs regardless of the number of cores used. While ParHIPFast scales up to 2 048 cores on uk200705, we observe that its running time is still higher than dKaMinPar on just 64 cores. The twitter graph is difficult to coarsen due to its highly skewed degree distribution; here, we observe that only dKaMinPar can partition the graph within the time limit.
Turning towards graphs with small maximum degree, we observe strong scalability for up to 2 048, 2 048 and 1 024 cores on kmer_V1r, nlpkkt240 and \(\textsf {rgg}_{\text {2D}}27\), respectively. Similar to our weak scaling experiments, ParMETIS shows better scalability and throughput on the meshtype graph \(\textsf {rgg}_{\text {2D}}\) as well as on nlpkkt240, but fails to partition kmer_V1r on 8 192 cores.
The edge cuts obtained remain relatively constant when scaling to large number of cores. Surprisingly, the geometric mean edge cut on 8 192 cores is slightly better than on 64 cores (by 2.0%).
7 Conclusion and Future Work
Our distributedmemory graph partitioner dKaMinPar successfully partitions a wide range of input graphs using many thousands of cores yielding high speed and good quality. Further improvements of the implementation might be possible, for example making better use of sharedmemory on each compute node. Beyond that, one can explore the quality versus time trade off. By distributed implementations of more powerful local improvement algorithms like local search or flowbased techniques one could achieve better quality at the price of higher execution time. It then also makes sense to look at a portfolio of different partitioners variants that can be run in parallel achieving good quality for subsets of inputs. For example, matching based coarsening as in ParMETIS might help for meshlike networks. On the other hand, more aggressive methods for handling highdegree nodes might help with some social networks.
Notes
 1.
Traditionally, \(L_k {:}{=}(1 + \varepsilon ) \lceil \frac{c(V)}{k} \rceil \) is used as balance constraint. We relax this constraint since it is otherwise NPcomplete to find a feasible partition.
 2.
 3.
Only twitter2010 takes 6min resp. 7min for \(k = 64\) resp. \(k = 128\).
References
Akhremtsev, Y., Sanders, P., Schulz, C.: Highquality sharedmemory graph partitioning. IEEE Trans. Parallel Distrib. Syst. 31(11), 2710–2722 (2020)
Bichot, C., Siarry, P. (eds.): Graph Partitioning. Wiley, Hoboken (2011)
Buluç, A., Meyerhenke, H., Safro, I., Sanders, P., Schulz, C.: Recent advances in graph partitioning. In: Kliemann, L., Sanders, P. (eds.) Algorithm Engineering. LNCS, vol. 9220, pp. 117–158. Springer, Cham (2016). https://doi.org/10.1007/9783319494876_4
Çatalyürek, U.V., et al.: More recent advances in (hyper)graph partitioning. ACM Comput. Surv. 55, 1–38 (2022)
Chevalier, C., Pellegrini, F.: PTscotch: a tool for efficient parallel graph ordering. Parallel Comput. 34(6–8), 318–331 (2008)
Devine, K.D., et al.: Parallel hypergraph partitioning for scientific computing. In: 20th International Parallel and Distributed Processing Symposium (IPDPS 2006) (2006)
Dolan, E.D., Moré, J.J.: Benchmarking optimization software with performance profiles. Math. Program. 91(2), 201–213 (2002)
Funke, D., et al.: Communicationfree massively distributed graph generation. J. Parallel Distrib. Comput. 131, 200–217 (2019)
Gottesbüren, L., et al.: Deep multilevel graph partitioning. In: 29th European Symposium on Algorithms (ESA). LIPIcs, vol. 204, pp. 48:1–48:17. Schloss Dagstuhl  LeibnizZentrum für Informatik (2021)
Gottesbüren, L., Heuer, T., Sanders, P.: Parallel flowbased hypergraph partitioning. In: 20th International Symposium on Experimental Algorithms (SEA 2022), vol. 233, pp. 5:1–5:21. LIPICS (2022)
Gottesbüren, L., Heuer, T., Sanders, P., Schlag, S.: Scalable sharedmemory hypergraph partitioning. In: 23rd Workshop on Algorithm Engineering and Experiments (ALENEX 2021), pp. 16–30. SIAM (2021)
Karypis, G., Kumar, V.: A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM J. Sci. Comput. 20(1), 359–392 (1998)
Karypis, G., Kumar, V.: Multilevel \(k\)way partitioning scheme for irregular graphs. J. Parallel Distrib. Comput. 48(1), 96–129 (1998)
LaSalle, D., Karypis, G.: Multithreaded graph partitioning. In: 27th IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 225–236 (2013)
LaSalle, D., et al.: Improving graph partitioning for modern graphs and architectures. In: 5th Workshop on Irregular Applications  Architectures and Algorithms (IA3), pp. 14:1–14:4. ACM (2015)
von Looz, M., Tzovas, C., Meyerhenke, H.: Balanced kmeans for parallel geometric partitioning. In: 47th International Conference on Parallel Processing (ICPP), pp. 52:1–52:10. ACM (2018)
Maier, T., Sanders, P., Dementiev, R.: Concurrent hash tables: fast and general(?)! ACM Trans. Parallel Comput. 5(4), 16:1–16:32 (2019)
Meyerhenke, H., Sanders, P., Schulz, C.: Partitioning complex networks via sizeconstrained clustering. In: Gudmundsson, J., Katajainen, J. (eds.) SEA 2014. LNCS, vol. 8504, pp. 351–363. Springer, Cham (2014). https://doi.org/10.1007/9783319079592_30
Meyerhenke, H., Sanders, P., Schulz, C.: Parallel graph partitioning for complex networks. IEEE Trans. Parallel Distrib. Syst. 28(9), 2625–2638 (2017)
Raghavan, N., Albert, R., Kumara, S.: Near linear time algorithm to detect community structures in largescale networks. Phys. Rev. E Stat. Nonlinear Soft Matter Phys. 76, 36–106 (2007)
Sanders, P., Schimek, M.: Engineering massively parallel MST algorithms. In: 27th IEEE International Parallel and Distributed Processing Symposium (IPDPS) (2023)
Sanders, P., Schulz, C.: Think locally, act globally: highly balanced graph partitioning. In: Bonifaci, V., Demetrescu, C., MarchettiSpaccamela, A. (eds.) SEA 2013. LNCS, vol. 7933, pp. 164–175. Springer, Heidelberg (2013). https://doi.org/10.1007/9783642385278_16
Sanders, P., Seemaier, D.: Distributed deep multilevel graph partitioning (2023). https://arxiv.org/abs/2303.01417
Schulz, C., Strash, D.: Graph partitioning: formulations and applications to big data. In: Sakr, S., Zomaya, A. (eds.) Encyclopedia of Big Data Technologies, pp. 1–7. Springer, Cham (2018). https://doi.org/10.1007/9783319639628_3122
Slota, G.M., et al.: Scalable, multiconstraint, complexobjective graph partitioning. IEEE Trans. Parallel Distrib. Syst. 31(12), 2789–2801 (2020)
Slota, G.M., Madduri, K., Rajamanickam, S.: PuLP: scalable multiobjective multiconstraint partitioning for smallworld networks. In: 2014 IEEE International Conference on Big Data (IEEE BigData 2014), pp. 481–490 (2014)
Walshaw, C., Cross, M.: JOSTLE: parallel multilevel graphpartitioning software – an overview. Mesh Partitioning Tech. Domain Decomposition Tech. 10, 27–58 (2007)
Acknowledgments
This work was performed on the HoreKa supercomputer funded by the Ministry of Science, Research and the Arts BadenWürttemberg and by the Federal Ministry of Education and Research. This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No. 882500).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Copyright information
© 2023 The Author(s)
About this paper
Cite this paper
Sanders, P., Seemaier, D. (2023). Distributed Deep Multilevel Graph Partitioning. In: Cano, J., Dikaiakos, M.D., Papadopoulos, G.A., Pericàs, M., Sakellariou, R. (eds) EuroPar 2023: Parallel Processing. EuroPar 2023. Lecture Notes in Computer Science, vol 14100. Springer, Cham. https://doi.org/10.1007/9783031396984_30
Download citation
DOI: https://doi.org/10.1007/9783031396984_30
Published:
Publisher Name: Springer, Cham
Print ISBN: 9783031396977
Online ISBN: 9783031396984
eBook Packages: Computer ScienceComputer Science (R0)