Keywords

1 Introduction

Graphs are a central concept of computer science used whenever we need to model relations between objects. Consequently, handling large graphs is very important for parallel processing. This often requires to partition these graphs into blocks of approximately equal weight with most edges inside the blocks (balanced graph partitioning). Applications include scientific computing, handling social networks, route planning, and graph databases [3].

In principle, multilevel graph partitioners (MGP) achieve high quality partitions for a wide range of input graphs G with a good trade-off between quality and partitioning cost. They are based on first iteratively coarsening G by contracting edges or small clusters. The resulting small graph \(G'\) is then still a good representation of the overall input and an initial partition of \(G'\) already induces a good partition of G. This is further improved by uncoarsening the graph and improving the partition on each level through refinement algorithms.

However, parallelizing multi-level graph partitioning has proved challenging over several decades. While shared-memory graph partitioners have recently matured to achieve high quality and reasonable scalability [1, 9, 10, 14], current distributed-memory partitioners [13, 19, 25] induce a severe quality deterioration and often are not able to consistently achieve feasible (i.e. balanced) partitions. In particular, high quality partitioners do not scale to the number of processing elements (PEs) available in large supercomputers. This situation is exacerbated by the fact that often the number of blocks k should increase linearly in the number of PEs. Previous systems are not able to directly handle large k running into even larger problems with achieving feasibility.

In this paper, we present dKaMinPar which addresses all these issues. Its basis is a distributed-memory adaptation of the deep-multilevel graph partitioning concept [9] that continues the multilevel approach deep into the initial partitioning phase. This makes the large k case much easier and eliminates a parallelization bottleneck due to initial partitioning. Our coarsening and refinement algorithms are based on the label propagation approach previously used in several partitioners [13, 19, 25]. Label propagation [18, 20] greedily moves vertices to other clusters/blocks when this reduces cuts (and does not violate the balance constraint). This is simple, fast, effective and robust even for complex networks. We develop a distributed-memory version with improved scalability, e.g., by using improved sparse-all-to-all primitives. Perhaps the main algorithmic innovation are new scalable distributed techniques allowing to maintain the balance constraint. During coarsening, a maximum cluster weight is approximated by unwinding contractions that lead to overweight clusters. During uncoarsening, block weight constraints are achieved by finding, selecting and applying globally “best” block moves.

The experiments described in Sect. 6 indicate that our implementation has achieved the main goals. It scales to at least 8 192 cores even for complex networks that did not scale on previous distributed solvers. Feasibility is guaranteed, even for large k and quality is typically within a few percent of the shared-memory systems. Section 7 summarizes the results and discusses possible future improvements.

Contributions

  • Scalable distributed implementation of deep multilevel graph partitioning.

  • Simplicity using label propagation for both contraction and refinement.

  • New scalable balanced coarsening and uncoarsening algorithm.

  • Extensive evaluation on both large real world networks and huge synthetic networks from 3 input families.

  • Quality comparable to shared-memory systems.

  • Scalability up to (at least) \(2^{13}\) machine cores and \(2^{39}\) graph edges.

  • Works both for complex networks and large number of blocks where previous systems often fail.

2 Preliminaries

Notation and Definitions. Let \(G = (V, E, c, \omega )\) be an undirected graph with vertex weights \(c: V \rightarrow \mathbb {N}_{> 0}\), edge weights \(\omega : E \rightarrow \mathbb {N}_{> 0}\), \(n {:}{=}|V |\), and \(m {:}{=}|E |\). We extend c and \(\omega \) to sets, i.e., \(c(V') {:}{=}\sum _{v \in V'} c(v)\) and \(\omega (E') {:}{=}\sum _{e \in E'} \omega (e)\). \(N(v) {:}{=}\{ u \mid \{ u, v \} \in E\}\) denotes the neighbors of v. For some \(V' \subseteq V\), \(G[V']\) denotes the subgraph of G induced by \(V'\). We are looking for blocks of nodes \(\varPi {:}{=}\{ V_1, \dots , V_k \}\) that partition V, i.e., \(V_1 \cup \dots \cup V_k = V\) and \(V_i \cap V_j = \emptyset \) for \(i \ne j\). The balance constraint demands that for all \(i \in \{1, \dots , k\}\), \(c(V_i) \le L_\text {max} {:}{=}\max \{ (1 + \varepsilon ) \frac{c(V)}{k}, \frac{c(V)}{k} + \max _v c(v) \}\) for some imbalance parameter \(\varepsilon \)Footnote 1. The objective is to minimize \(\text {cut}(\varPi ) {:}{=}\sum _{i < j} \omega (E_{ij})\) (weight of all cut edges), where \(E_{ij} {:}{=}\{ \{u, v\} \in E \mid u \in V_i \text { and } v \in V_j \}\). We call a vertex \(u \in V_i\) that has a neighbor in \(V_j\), \(i \ne j\), a boundary vertex. A clustering \(\mathcal {C} {:}{=}\{ C_1, \dots , C_\ell \}\) is also a partition of V, where the number of blocks \(\ell \) is not given in advance (there is also no balance constraint).

Machine Model and Input Format. The distributed memory model used in this work considers P processing elements (PEs) numbered 1..P, connected by a full-duplex, single ported communication network. The input graph is given with a (usually balanced) 1D vertex partition. Each PE is given a subgraph of the input graph (i.e., a block of the 1D partition) with consecutive vertices. An undirected edge \(\{u, v\}\) is represented by two directed edges (uv), (vu), which are stored on the PEs owning the respective tail vertices. Vertices adjacent to vertices owned by other PEs are called interface vertices and are replicated as ghost vertices (i.e., without outgoing edges) on those PEs.

3 Related Work

There has been a huge amount of research on graph partitioning so that we refer the reader to overview papers [2,3,4, 24] for most of the general material. Here, we focus on parallel algorithms for high-quality graph partitioning.

Distributed Graph Partitioning. Virtually all high-quality partitioners are based on the multilevel paradigm, e.g., ParMETIS [12, 13], ParHIP [19, 22] and others [5, 27]. These algorithms partition a graph in three phases. First, they build a hierarchy of successively coarse approximations of the input graph, usually by contracting matchings or clusters. Once the graph has only few vertices left (e.g., \(n \le C k\) for some contraction limit C), the graph is partitioned into k blocks. Finally, this partition is successively projected onto finer levels of the hierarchy and refined using local search algorithms.

The performance of multilevel algorithms is defined by the algorithmic components used for these phases. Partitioners designed for mesh-partitioning usually contract matchings to coarsen the graph [5, 13, 27]. However, this technique is not suitable for partitioning complex networks that only admit a small maximum matching. Thus, other partitioners use two-hop matchings [15] or size-constrained label propagation [9, 11, 19]. Due to its simple yet effective nature, the latter is also commonly used as a local search algorithm during refinement [1, 6, 9, 11, 13, 19, 27].

Label propagation has also been used by non-multilevel graph partitioning algorithms such as XtraPuLP [25], which reports scalability up to \(2^{17}\) cores, a level which has not been reached by multilevel algorithms. However, using label propagation without the multilevel paradigm comes with a pronounced decline in quality; Ref. [9] reports edge cuts for PuLP [26] (non-multilevel) that are on average more than twice as large as those of KaMinPar (multilevel). Across a large and diverse benchmark set, this is considered a lot; most multilevel algorithms achieve average edge cuts within a few percentage points of each other. Another class of highly scalable graph partitioners include geometric partitioners, which work on a geometric embedding of the graph. While these algorithms are orders of magnitude faster than multilevel algorithms [16], they generally compute larger edge cuts and only work on graphs with a meaningful geometric embedding.

Deep Multilevel Graph Partitioning. As plain MGP algorithms usually shrink the graph down to Ck vertices, large values for k break the assumption that the coarsest graph is small. This causes their performance to deteriorate [9]. Instead, recursive bipartitioning can be used to compute partitions with large k, but this induces an additional \(\log k\) factor in running time and makes it more difficult to compute balanced partitions due to the lack of global view. Deep multilevel graph partitioning (deep MGP) [9] circumvents these problems by continuing coarsening deep into initial partitioning. More precisely, deep MGP coarsens the graph until only 2C vertices are left, independent of k. After bipartitioning the coarsest graphs, it maintains the invariant that a (coarse) graph with n vertices is partitioned into \(\min \{k, n / C\}\) blocks by using recursive bipartitioning on the current level. By using additional balancing techniques, partitioners based on deep MGP can obtain feasible high-quality partitions with a large number of blocks (e.g., \(k \approx 1M\)) while often being an order of magnitude faster than partitioners based on plain MGP. Compared to recursive bipartitioning the entire graph, it reduces the additional \(\log k\) factor to \(\log kC/n\). KaMinPar [9] is a scalable shared-memory implementation of deep MGP which uses size-constrained label propagation during coarsening and refinement.

4 Distributed Deep Multilevel Graph Partitioning

In this section, we present dKaMinPar, a distributed graph partitioner that leverages deep MGP. We first describe the distributed deep MGP scheme, which partitions a graph on P PEs into k blocks. For simplicity, we assume that k and P are powers of two. Then, we explain the different algorithmic components for coarsening, initial partitioning, refinement and balancing in more detail.

Fig. 1.
figure 1

Distributed deep multilevel graph partitioning on \(P = 4\) PEs to partition a graph G into \(k = 4\) blocks. Unpartitioned graphs are labeled with their number of vertices. During initial partitioning and uncoarsening, blocks are recursively partitioned into \(K = 2\) blocks. Bold horizontal lines illustrate PE groups working independently.

Distributed Deep Multilevel Partitioning. Recall that deep MGP [9] follows the traditional multilevel graph partitioning scheme, but coarsens the graph down to a small size independent of k. After partitioning the coarsest graph into a small number of blocks, it maintains the invariant that each block of the current partition contains roughly C vertices throughout the uncoarsening phase (until there are k blocks).

To adapt this scheme to the distributed setting, we use distributed algorithms for graph clustering, contraction, and partition refinement (see below). Initial partitioning of the coarsest graph and block-induced subgraphs is done using an in-memory partitioner by gathering full copies of the graphs on individual PEs. Since this process is communication heavy, we generalize the bipartitioning steps of deep MGP to K-way partitioning for some tuning parameter K. The scheme then works as follows.

We repeatedly coarsen the input graph until only \(K \cdot C\) vertices are left, building a hierarchy of successively coarser graphs. During this process, we exploit parallelism and improve scalablility on coarse levels of the hierarchy by maintaining the invariant that P PEs work on a graph with at least \(P \cdot C\) vertices [27]. This leads to more diversification on coarse levels due to the randomized nature of the clustering, initial partitioning and refinement algorithms. More precisely, we check on each level whether the current graph \(G_i\) has more than \(P \cdot C\) vertices. If so, we split the P PEs into two subgroups \(1..\frac{P}{2}\), \(\frac{P}{2}+1..P\) and mirror the parts of \(G_i\) between PEs j and \(\frac{P}{2} + j\), \(0 \le j < \frac{P}{2}\), such that each group obtains an identical copy of \(G_i\). The subgroups then continue this procedure recursively. In Fig. 1, we illustrate this process by using bold horizontal lines, duplicating \(G_{\ell - 1}\) and \(G_{\ell }\). The coarsest graph is then copied to each PE and partitioned into \(\min \{k, K\}\) blocks using an in-memory partitioner. The best partition (within each group of PEs) is selected and projected onto \(G_{\ell - 1}\) by assigning fine vertices to the blocks of their corresponding coarse vertices. From here, we maintain two invariants during uncoarsening:

  1. 1.

    The current partition is feasible, which we ensure using the distributed balancing algorithm described below, and

  2. 2.

    each block contains roughly C vertices (until there are k blocks).

To maintain the latter invariant, assume that the current graph with \(|V(G_i) |\) vertices is partitioned into \(k' < k\) blocks. Then, assign \(k' / P\) blocks to each PE and use all-to-all communication to gather full copies of the block-induced subgraphs. These subgraphs are then partitioned into \(\min \{k / k', K\}\) blocks using an in-memory partitioner. Afterwards, we update the partition of the distributed graph using all-to-all communication and subsequently improve it using a refinement algorithm. We repeat this process until we obtain a partition where each block contains roughly C vertices. Note that if \(k > |V(G_1) |/ C\), the partition computed on the finest graph has not enough blocks. In this case, we distribute and partition the block-induced subgraphs once more to compute the missing blocks.

Coarsening. We use a similar parallelization of size-constrained label propagation as ParHIP [19] and KaMinPar [9] to cluster the graphs. The algorithm works by first assigning each vertex to its own cluster. In further iterations over the vertices (we use \(\{3, 5\}\) iterations), they are then moved to adjacent clusters such that the weight of intra-cluster edges is maximized without violating the maximum cluster weight \(W_i {:}{=}\varepsilon \frac{c(V)}{k'_i}\), where \(k'_i {:}{=}\min \{k, |V(G_i) |/ C\}\) [9] and i is the current level of the graph hierarchy.

As noted in Ref. [1, 18], the solution quality of label propagation is improved when iterating over vertices in increased degree order. Since this is not cache efficient and lacks diversification by randomization, we sort the vertices into exponentially spaced degree buckets, i.e., bucket b contains all vertices with degree \(2^b \le d < 2^{b + 1}\), and rearrange the input graph accordingly. This happens locally on each PE. Then, during label propagation, we split buckets into small chunks and randomize traversal on a inter-chunk and intra-chunk level analogous to the randomization of the matching algorithm used by Metis [12].

To communicate the current cluster assignment of interface vertices, we follow ParHIP and split each iteration into \(\max \{\alpha , \beta / P\}\) (we use \(\alpha = 8\), \(\beta = 128\)) batches. After each batch, we use a sparse all-to-all operation to notify adjacent PEs of interface vertices that were moved to a different cluster. Since clusters can span multiple PEs, enforcing the maximum cluster weight becomes more challenging than in a shared-memory setting. ParHIP relaxes the weight limit and only enforces it locally, consequently allowing clusters with weight up to \(P \cdot W\). This can lead to very heavy coarse vertices, making it more difficult to compute balanced partitions. Instead, we track the global cluster weights by sending the change in cluster weight after each batch to the PE owning the initial vertex of the cluster, which accumulates the changes and replies with the total weight of the cluster. If a cluster becomes heavier than W, each PE reverts moves proportional to its part of the total cluster weight. Those vertices can then be moved to other clusters in subsequent iterations.

After clustering the graph, we contract the clusters to build the next graph in the hierarchy. We give more details on this operation in Sect. 5.

Fig. 2.
figure 2

Illustration of the rebalancing algorithm with \(P = 4\) PEs (background color), two overloaded blocks \(V_0\), \(V_2\), and \(\tau = 2\) vertices per overloaded block and round. Proposed moves are indicated by arrows, with their relative gains encoded by vertex size.

Balancing. Balance constraint violations during deep MGP can occur after initial partitioning or after projecting a coarse graph partition onto a finer level of the graph hierarchy [9]. Since these balance constraint violations are bounded by the weight of the heaviest vertex, we design the following balancing algorithm based on the assumption that only few vertex moves are necessary to restore balance. Thus, it is feasible to invest a moderate amount of work per vertex move.

For each overloaded block B, each PE maintains a local priority queue \(P_B\) of vertices in block B ordered by their relative gain which we define as \(g \cdot c(v)\) if \(g \ge 0\) and g/c(v) if \(g < 0\). Here, g is the largest reduction in edge cut when moving v to a non-overloaded block. This rating function prefers moving few heavy vertices over many light vertices, supporting our assumption that few vertex moves are sufficient to balance the partition. To keep the priority queues small, we maintain the invariant that \(P_B\) stores no more vertices than are necessary to remove all excess weight \(o(B) {:}{=}c(B) - L_\text {max}\) from B. To this end, we initialize \(P_B\) by iterating over all vertices v in B and inserting v into \(P_B\) if \(c(P_B) < o(B)\). Otherwise, we only insert v if it can replace another vertex with worse relative gain.

To choose which vertices to move, we then use a global reduction tree as illustrated in Fig. 2. Using the local PQs, each PE builds a sorted list per overloaded block containing up to \(\tau \) (a tuning parameter) vertices. At each level of the reduction tree, the sorted lists are then merged and truncated to the prefix that is sufficient to remove all excess weight, but no more than \(\tau \) vertices. Finally, the root PE selects a subset of the proposed vertices such that no other block becomes overloaded and broadcasts its decision to all PEs. Using this information, PEs update the current partition state, remove moved vertices from their PQs and update the relative gains of neighboring vertices. We repeat this process until the partition is balanced.

Refinement. We also use size-constrained label propagation to improve the current graph partition. In contrast to label propagation for clustering as described above, vertices are initially assigned to clusters representing the blocks of the partition, and the maximum block weight is used as weight constraint. We use the same iteration order and number of batches as during coarsening to move vertices to adjacent blocks such that the weight of intra-block edges is maximized without violating the balance constraint. Ties are broken in favor of the lighter block, or by coin flip if both blocks have the same weight.

Since the number of blocks during refinement is usually much smaller than the number of clusters during coarsening, we track the global block weights using an allreduce operation after each vertex batch. Note that this does not prevent violations of the balance constraint if multiple PEs move vertices to the same block during the same vertex batch. In this case, we use our global balancing algorithm described above afterwards to restore the balance constraint. This is a downside compared to refinement via size-constrained label propagation in shared-memory parallel graph partitioners, where the balance constraint can be strictly enforced using atomic compare-and-swap operations.

5 Implementation Details

Vertex and Edge IDs. To reduce the communication overheads, we distinguish between local- and global vertex- and edge identifiers. This allows us to use 64bit data types for global and 32bit data types for local IDs.

Graph Contraction. Contracting a clustering consisting of \(n_C\) clusters and constructing the corresponding coarse graph works as follows. First, the clustering algorithm described above assigns a cluster ID to each vertex, which corresponds to some vertex ID in the distributed graph. We say that a cluster is owned by the PE owning the corresponding vertex. After contracting the local subgraphs (i.e., deduplicating edges between clusters and accumulating vertex- and edge weights), we map clusters to PEs such that each PE gets roughly the same number of coarse vertices while attempting to minimize the required communication amount. We assign \(\le \delta \cdot n_C / P\) clusters owned by each PE to the same PE (in our experiments, \(\delta = 1.1\)). If a PE owns more clusters, we redistribute the remaining clusters to PEs that have the smallest number of clusters assigned to them. Afterwards, each PE sends outgoing edges of coarse vertices to the respective PE using an all-to-all operation, then builds the coarse graph by deduplicating edges and accumulating vertex- and edge weights.

Low-Latency Sparse All-to-All. Many steps of dKaMinPar require communication along the cut edges of the distributed graph, which translates to (often very) sparse and irregular all-to-all communication. Since MPI_Alltoallv has relatively high latency, we instead use a two-level approach that arranges PEs in a grid [21]. Then, messages are first sent to the right row, then to the right column, reducing the total number of messages send through the network from \(\mathcal {O}(P^2)\) to \(\mathcal {O}(P)\).

6 Experiments

We implemented the proposed algorithm dKaMinPar in C++ and compiled it using g++-12.1 with flags -O3 -march=native. We use OpenMPI 4.0 as parallelization library and growt [17] for hash tables. Raw experimental results are available onlineFootnote 2.

Setup. We evaluate the solution quality of our algorithm on a shared-memory machine equipped with 1TB of main memory and one AMD EPYC 7702P processor with 64 cores (Machine A). Additionally, we perform scalability experiments on a high-performance cluster where each compute node is equipped with 256GB of main memory and two Intel Xeon Platinum 8368 processors (Machine B). The compute nodes are connected by an InfiniBand 4X HDR 200GBit/s network with approx. 1\(\mu \) s latency. We only use 64 out of the available 78 cores since some of the graph generators require the number of cores to be a power of two.

We compare dKaMinPar against the distributed versions of the algorithms included in Ref. [9], i.e., ParHIP [19] (v3.14) and ParMETIS [13] (v4.0.3). ParHIP offers two configurations, denoted ParHIP-Fast and ParHIP-Eco, which configure a trade-off between running time and partition quality. We do not include the distributed version PuLP [26] (XtraPuLP [25]) in our main comparison, since its quality is not competitive with multilevel partitioners. Instead, a comparison against XtraPuLP is available in the full version [23] of the paper. We evaluate two configurations of our algorithm: dKaMinPar-Fast uses \(C = 2000\) as contraction limit (same as in Ref. [9]), KaMinPar [9] for initial partitioning and performs 3 iterations of label propagation during coarsening, whereas dKaMinPar-Strong uses \(C = 5000\) (same as in Ref. [19]), Mt-KaHyPar [11] for initial partitioning and 5 iterations of label propagation during coarsening.

Instances. We evaluate our algorithm on the graphs from Benchmark Set B of Ref. [9] and the graphs used in Ref. [19]. Additionally, we use KaGen [8] to evaluate the scaling capabilities of our algorithm on huge randomly generated 2D and 3D geometric and hyperbolic graphs denoted \(\textsf {rgg}_{\text {2D}}N\text {d}D\), \(\textsf {rgg}_{\text {3D}}N\text {d}D\) and \(\textsf {rhg}_{3.0}N\text {d}D\). These graphs have \(2^N\) vertices per compute node (i.e., per 64 cores), average degree D and power-law exponent 3 (hyperbolic graphs only).

Methodology. We call a combination of a graph and the number of blocks an instance. For each instance, we perform 5 repetitions with different seeds and aggregate the edge cuts and running times using the arithmetic mean. To aggregate over multiple instances, we use the geometric mean.

To compare the solution quality of different algorithms, we use performance profiles [7]. Let \(\mathcal {A}\) be the set of algorithms we want to compare, \(\mathcal {I}\) the set of instances, and \(q_A(I)\) the quality of algorithm \(A \in \mathcal {A}\) on instance \(I \in \mathcal {I}\). For each algorithm A, we plot the fraction of instances \(\frac{|\mathcal {I}_A(\tau ) |}{|\mathcal {I} |}\) (y-axis) where \(\mathcal {I}_A(\tau ) {:}{=}\{ I \in \mathcal {I} \mid q_A(I) \le \tau \cdot \min _{A' \in \mathcal {A}} q_{A'}(I) \}\) and \(\tau \) is on the x-axis. Achieving higher fractions at lower \(\tau \)-values is considered better. For \(\tau = 1\), the y-value indicates the percentage of instances for which an algorithm performs best.

Fig. 3.
figure 3

Results for \(k = \{2, 4, 8, 16, 32, 64, 128\}\) with \(\varepsilon = 3\%\) on Machine A. From left to right: (a) edge cuts of dKaMinPar-Fast, ParHIP-Fast and ParMETIS, (b) edge cuts of dKaMinPar-Fast, dKaMinPar-Strong and ParHIP-Eco, (c) running times of all algorithms. The numbers above the x-axis are geometric mean running times [s] over all instances for which all algorithms produced a result. Timeouts are marked with , failed runs or infeasible results are marked with \(\times \).

Solution Quality and Running Time. To evaluate the quality and running time of dKaMinPar  we partition all graphs of our benchmark set into \(k \in \{2, 4, \dots , 128\}\) blocks with \(\varepsilon = 3\%\) using all 64 cores of Machine A and compare partition qualities and running times against competing distributed MGP algorithms. Additionally, we compare the results against KaMinPar to evaluate the penalties of dKaMinPar due to its distributed nature. Further experiments with larger values for k are available in the full version [23] of the paper. We set the time limit for a single instance to one hour, which is approx. 10 times the running time of dKaMinPar-Fast on most instancesFootnote 3.

The results are summarized in Fig. 3a–c. In Fig. 3a, we can see that dKaMinPar-Fast finds the lowest edge cuts on approx. 60% of all benchmark instances, whereas ParMETIS and ParHIP-Fast only find better edge cuts on approx. 30% resp. 10% of all instances. Moreover, both competing algorithms frequently fail to compute feasible partitions—in particular, ParMETIS is unable to partition most social networks, violating the balance constraint or crashing on 34% of all instances. When looking at running times (Fig. 3c), we therefore only average over instances for which all partitioners computed a feasible partition or ran into the timeout (145 out of 224 instances). dKaMinPar-Fast (4.93 s geometric mean running time) is 1.4 and 3.4 times faster than ParMETIS (6.98 s) and ParHIP-Fast (16.77 s), respectively. While ParHIP-Eco achieves higher partition quality than dKaMinPar-Fast, Fig. 3b shows that equipping dKaMinPar with a stronger algorithm for initial partitioning is sufficient to achieve similar partition quality, while still being faster than ParHIP-Fast.

Fig. 4.
figure 4

Throughput of \(\textsf {rgg}_{\text {2D}}\), \(\textsf {rgg}_{\text {3D}}\) and rhg graphs with \(2^{26}\) vertices per compute node, average degree \(\in \{8, 32\}\), \(k = 16\) and \(\varepsilon = 3\%\) on 64–8 192 cores of Machine B.

We evaluate the weak scalability of dKaMinPar using families of randomly generated graphs, \(k = 16\), and 64–8 192 cores (i.e., 1–128 compute nodes) of Machine B. Throughputs are shown in Fig. 4, where we observe weak scalability for dKaMinPar-Fast all the way to 8 192 cores on all three graph families. ParMETIS achieves similar and in cases slightly higher throughputs than dKaMinPar, but is unable to efficiently partition hyperbolic graphs. ParHIP-Fast shows a drop in scalability beyond 2 048 cores, which is most likely due to the extensive and inefficient communication that it performs during graph contraction. Moreover, we note that ParHIP-Fast was originally designed to overlap local work and global communication during label propagation through the use of nonblocking MPI operations. This implementation relies on MPI progression threads, which seem to be unavailable in modern OpenMPI versions.

Per-instance edge cut results are available in the full version [23] of the paper. We observe that ParMETIS finds lower edge cuts than dKaMinPar-Fast on the dense \(\textsf {rgg}_{\text {2D}}26\text {d}32\) graph and both \(\textsf {rgg}_{\text {3D}}\) graphs by 5%–13%. However, on the sparser \(\textsf {rgg}_{\text {2D}}26\text {d}8\) graph, dKaMinPar-Fast has 19% smaller cuts than ParMETIS which is already a considerable improvement. The gap gets much larger for the hyperbolic graph where ParMETIS only finds approx. 5.5–6.1 times larger cuts. Such solutions will be unsuitable for many applications.

Fig. 5.
figure 5

Throughput of rgg2D, rgg3D and rhg graphs with \(2^{26}\) vertices per compute node, average degree 8, and \(\varepsilon = 3\%\) on 64–8 192 cores of Machine B. The number of blocks are scaled with the size of the graph such that each block contains \(2^{12}\) or \(2^{15}\) vertices.

We now evaluate weak scalability in terms of graph size and number of blocks by scaling k with the number of compute nodes used. This implies that the number of blocks is large when using a large number of cores. The throughput of each algorithm in this setting is summarized in Fig. 5. Note that we only use the sparser graphs in this experiment, since ParMETIS and ParHIP are unable to partition the dense versions of the graphs even on few compute nodes.

ParHIP-Fast is unable to obtain a feasible partition on all but 6 instances, none of which uses more than 1 024 cores, and only shows increasing throughputs up to 256 cores. While ParMETIS achieves decent weak scalability and computes feasible solutions on the mesh-type graphs, it is unable to partition any graph on 8 192 cores and often crashes on fewer cores (e.g., it only works on up to 1 024 cores on \(\textsf {rgg}_{\text {2D}}\) with \(2^{12}\) vertices per block). On the random hyperbolic graph, it only computes a feasible solution on 64 cores. Meanwhile, dKaMinPar-Fast shows weak scalability up to 8 192 cores on every graph family, although it should be noted that the throughput increase from 4 096 to 8 192 is rather small.

In terms of number of edges cut, we summarize that dKaMinPar finds on average 19.3% and 2.8% lower edge cuts than ParMETIS and ParHIP-Fast, respectively (only averaging over instances for which the respective partitioner computed a feasible partition), with improvements ranging from 0% on \(\textsf {rgg}_{\text {3D}}26\text {d}8\) to approx. 60% on \(\textsf {rhg}_{8}3.0\text {d}26\) (\(2^{15}\) vertices per block). For detailed per-instance edge cut results, we refer to the full version [23] of the paper.

Fig. 6.
figure 6

Strong scaling running times for the largest low- and high-degree graphs in our benchmark set, with \(k = 16\), \(\varepsilon = 3\%\) on 64–8 192 cores of Machine B.

Strong Scalability of dKaMinPar . We partition three of the largest low- and high-degree graphs from our benchmark set into \(k = 16\) blocks using 64–8 192 cores of Machine B and a time limit of 15min. The results are summarized in Fig. 6, where we can observe strong scalability for up to 1 024–2 048 cores on high-degree graphs. ParMETIS is unable to partition these graphs regardless of the number of cores used. While ParHIP-Fast scales up to 2 048 cores on uk-2007-05, we observe that its running time is still higher than dKaMinPar on just 64 cores. The twitter graph is difficult to coarsen due to its highly skewed degree distribution; here, we observe that only dKaMinPar can partition the graph within the time limit.

Turning towards graphs with small maximum degree, we observe strong scalability for up to 2 048, 2 048 and 1 024 cores on kmer_V1r, nlpkkt240 and \(\textsf {rgg}_{\text {2D}}27\), respectively. Similar to our weak scaling experiments, ParMETIS shows better scalability and throughput on the mesh-type graph \(\textsf {rgg}_{\text {2D}}\) as well as on nlpkkt240, but fails to partition kmer_V1r on 8 192 cores.

The edge cuts obtained remain relatively constant when scaling to large number of cores. Surprisingly, the geometric mean edge cut on 8 192 cores is slightly better than on 64 cores (by 2.0%).

7 Conclusion and Future Work

Our distributed-memory graph partitioner dKaMinPar successfully partitions a wide range of input graphs using many thousands of cores yielding high speed and good quality. Further improvements of the implementation might be possible, for example making better use of shared-memory on each compute node. Beyond that, one can explore the quality versus time trade off. By distributed implementations of more powerful local improvement algorithms like local search or flow-based techniques one could achieve better quality at the price of higher execution time. It then also makes sense to look at a portfolio of different partitioners variants that can be run in parallel achieving good quality for subsets of inputs. For example, matching based coarsening as in ParMETIS might help for mesh-like networks. On the other hand, more aggressive methods for handling high-degree nodes might help with some social networks.

figure c