Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Discovering cohesive subgraphs or communities in a given graph is an important problem arising in diverse domains ranging from social networks to biological processes. Different models have been proposed for this purpose, among which the truss decomposition [1] is a prominent model. Apart from ensuring that the entities (vertices) in the subgraph are strongly connected among one another, the model also focuses on the strength of the connections (edges). The model is based on the intuition that the edge between two vertices can be considered strong, if they share many common neighboring entities, or alternatively, the edges are included in many triangles. For instance, in a social network, we can say that more common friends two people have, the stronger is their connection.

Given a graph G and an integer k, the k-truss is defined as the largest subgraph, in which every edge is included in at least \((k-2)\) triangles within the subgraph. The model provides a hierarchical decomposition of the graph. The whole graph G is the 2-truss, and for \(k \ge 3\), the k-truss is contained within the \((k-1)\)-truss. For an edge e, the truss number \(\tau (e)\) is defined as the largest k such that e belongs to the k-truss. The truss decomposition problem is to compute the truss numbers of all the edges.

The truss decomposition model is useful in applications such as community detection [2], visualization of large networks [3], and discovering cohesive structures containing a given set of entities [4]. The truss decomposition model builds on the prior k-core formulation [5] and the recently proposed nucleus decomposition [6] generalizes both the concepts by considering higher order cliques in place of triangles. Given the utility of the model, the truss computation is included as part of the recently proposed Graph Challenge benchmark effort (https://graphchallenge.mit.edu/).

Cohen [1] introduced the truss decomposition model and presented a polynomial time algorithm for constructing the decomposition. Building on the above work, Wang and Cheng [7] described an I/O efficient implementation. Rossi [8], Smith et al. [9], Kabir and Madduri [10, 11] and Voegele et al. [12] proposed algorithms for the shared memory and GPU settings. Green et al. evaluated truss computation under GPU setting [13]. Zhang and Parthasarathy [14] independently described the model and used it as a preprocessing step for finding cliques and other dense structures. Chen et al. [15], Cohen [16] and Shao et al. [17] studied the problem on MapReduce setting.

Shared memory systems and GPUs have limitations in terms of the number of cores and/or memory availability, leading to impediments in enhancing the performance further. For instance, on a popular social network graph named friendster (1.8 billion edges, 4.2 billion triangles), the execution time achieved in shared memory setting (with 24 cores) is about 25 min [10]. Prior work has also considered MapReduce framework [16], but the execution times are significantly higher, due to framework overheads. Our aim is to achieve execution times of about one minute on graphs of the above size. Towards that goal, we study the problem on distributed memory systems using MPI.

We build on two prior procedures (which we adapt to the MPI setting) and study the problem from an algorithmic perspective. The first procedure, due to Cohen [1], lies at the heart of most prior implementations. While the algorithm is optimized in terms of the computational load, it takes a large number of iteration to converge. In a distributed setting the slow convergence leads to high synchronization costs and load imbalance. Working within the MapReduce framework, Chen et al. [15] proposed an algorithm which takes much lesser number of iterations, at the cost of increased computational load. We denote the two algorithms as \(\mathsf{MinTruss}\) and \(\mathsf{PropTruss}\), respectively. Our main contribution is a new algorithm that offers a tradeoff between the prior algorithms in terms of the two fundamental metrics of number of iterations and load.

Truss computation is performed in two steps: a first phase that enumerates triangles and a second phase that computes the truss numbers of the edges. Triangle enumeration is a well-studied problem and efficient algorithms have been developed (e.g., [18]). We focus on the second phase of truss computation. The second phase is iterative. In each iteration, we need to find the triangles incident on some of the edges. The prior work considers two different implementation settings. The first setting (e.g., [15, 17]) explicitly stores the list of triangles enumerated in the first phase and reuses the list in the second phase, whereas the second (e.g., [8, 10]) does not store the list of triangles and recomputes. The first setting has higher memory usage due to the presence of large number of triangles, but it facilitates efficient implementation of the second phase. Our implementation is based on the setting of explicitly storing the triangles. In contrast to shared memory systems, we can afford to store the list of triangles, as sufficient memory is available under the distributed memory setting.

Our Contributions

  • We propose a new algorithm, denoted \(\mathsf{Hybrid}\), that offers a tradeoff between the prior algorithms on the two performance metrics: iterations close to \(\mathsf{PropTruss}\) and load close to \(\mathsf{MinTruss}\). We present an efficient distributed memory (MPI) implementation based on the above algorithm.

  • We present an experimental evaluation involving large real-world graphs (having up to 4 billion triangles). The results show that \(\mathsf{PropTruss}\) performs the best in terms of the number of iterations. Relative to \(\mathsf{PropTruss}\), \(\mathsf{Hybrid}\) is higher by at most 16x factor, whereas \(\mathsf{MinTruss}\) is as high as 76x. In terms of load, \(\mathsf{MinTruss}\) performs the best. Relative to \(\mathsf{MinTruss}\), \(\mathsf{Hybrid}\) is higher by at most 2.3x factor, whereas \(\mathsf{PropTruss}\) is as high as 17x.

  • In terms of the execution time (truss number computation), Hybrid achieves better performance on large system sizes. On the largest system size in our study (512 MPI ranks), it outperforms \(\mathsf{MinTruss}\) and \(\mathsf{PropTruss}\) by up to 2x and 3.4x factors, respectively. Over the different benchmark graphs, it outperforms the best of the prior algorithms by a factor of up to 2x. The implementation is able to solve graphs having more than billion edges and 4 billion triangles in about a minute.

2 Preliminaries

Let \(G=(V, E)\) be an undirected graph. A triple of vertices u, v and w is said to form a triangle, if \(\langle {u},{v}\rangle \), \(\langle {u},{w}\rangle \) and \(\langle {v},{w}\rangle \) are edges in G. We denote the triangle as \(\varDelta (u, v, w)\). The three edges are said to be incident on the triangle and vice versa. Two edges e and \(e'\) are called neighbors, if they are incident on a common triangle. Let \(\gamma (G)\) denote the number triangles in G and for an edge e, let \(\gamma (e)\) denote the number of triangles incident on e.

By a subgraph, we refer to a graph \(H = (V', E')\) such that \(V'\subseteq V\) and \(E' \subseteq (V' \times V')\cap E\); we denote this as \(H\subseteq G\). The size of a subgraph H is measured by the number of edges in it. For a subgraph H and an edge e found in H, the support of e within H, denoted \(\mathtt{supp}_H(e)\), is defined as the number of triangles in H incident on e. For an integer \(k \ge 2\), the k-truss of G is defined as the largest subgraph \(H\subseteq G\) such that every edge e in H has \(\mathtt{supp}_H(e) \ge k-2\) (the k-truss may not be connected). The k-truss of a graph is unique.

Let \(\kappa \) be the largest value such that the \(\kappa \)-truss is non-empty. The 2-truss is simply the whole graph G. The k-trusses, for \(k\ge 2\), form a hierarchical decomposition: \(G = 2\text{-truss }\supseteq 3\text{-truss } \supseteq 4\text{-truss } \supseteq \cdots \supseteq \kappa \text{-truss }\). For an edge e, the truss number of e, denoted \(\tau (e)\), is defined as the largest value k such that e is found in the k-truss. Given a graph G, the truss decomposition problem is to construct the hierarchical decomposition; equivalently, the goal is to compute the truss number \(\tau (e)\) for all the edges.

Fig. 1.
figure 1

Analysis of prior algorithms

3 Prior Algorithms

In this section, we present an outline of the two prior algorithms \(\mathsf{MinTruss}\) [1] and \(\mathsf{PropTruss}\) [15]. Both the algorithms involve a preprocessing phase, where they compute the \(\mathtt{supp}_G(e)\) for all the edges via enumerating triangles of the input graph G. Triangle enumeration is a well-studied problem and efficient techniques have been developed (e.g., [18]). We describe the algorithms assuming that the supports have already been computed. For the clarity of exposition, we present the algorithms at a conceptual level, deferring distributed aspects and other implementations details to Sect. 5. A brief discussion on the preprocessing procedure can also be found in the same section.

Algorithm \(\mathsf{MinTruss}\): For each edge e, the algorithm maintains an upperbound \(\widehat{\tau }(e)\) on the true truss number \(\tau (e)\); it is initialized as \(\widehat{\tau }(e) = \mathtt{supp}_G(e) + 2\). The algorithm marks all edges as not settled and proceeds iteratively. In each iteration, among the edges not settled, select the edges with the least truss value and declare them to be settled. We then update the truss values of their neighbors in the following manner. Let \(e=\langle {u},{v}\rangle \) be a selected edge. For each triangle \(\varDelta (u, v, w)\) incident on e, if both \(\langle {u},{w}\rangle \) and \(\langle {v},{w}\rangle \) are not settled already, then decrement the truss values \(\widehat{\tau }(u,w)\) and \(\widehat{\tau }(v,w)\) by one. Proceed in the above manner till all the edges are settled.

Intuitively, imagine that the settled edges are deleted from the graph. The deletion of an edge e destroys the triangles incident on it. When a triangle is destroyed, the other two edges lose the support of the triangle. So, we decrement their truss values, provided e is the first edge to be deleted among the three edges. We can show that for each edge e, the truss value \(\widehat{\tau }(e)\) gets decremented monotonically and becomes the true truss number \(\tau (e)\) before termination.

Algorithm \(\mathsf{PropTruss}\): In each iteration of the \(\mathsf{MinTruss}\) algorithm, only the neighbors of the edges with the least truss value get updated. As a result, the algorithm incurs a large number of iterations and converges slowly. Chen et al. [15] proposed an algorithm that exhibits better parallelism by taking much lesser number of iterations. We denote the algorithm as \(\mathsf{PropTruss}\). We rephrase and present a sketch of the algorithm.

The core idea is to select every edge e whose truss value changed in the prior iteration and propagate its new truss value to its neighbors. Since edges having various truss values propagate simultaneously, the update operation becomes more intricate, as against the simple decrement operation under the \(\mathsf{MinTruss}\) algorithm. For a triangle \(\varDelta (u,v,w)\), define the truss number of the triangle as \(\tau (u,v,w) = \min \{\tau (u,v), \tau (u,w), \tau (v,w)\}\). The new update operation is based on the following proposition. The truss numbers can be seen as stationary solutions satisfying the condition given by the proposition.

Proposition 1

For any edge \(e=\langle {u},{v}\rangle \), we have that

$$ \tau (e) = \max \{j~:~|\{\varDelta (u,v,x)~:~\tau (u,v,x) \ge j \}| \ge j-2\} $$

For each triangle \(\varDelta (u,v,w)\), the algorithm maintains an upperbound \(\widehat{\tau }(u,v,w)\ge \min \{\widehat{\tau }(u,v),\widehat{\tau }(u,w),\widehat{\tau }(v,w)\}\). These are initialized to \(\infty \). We ensure that for any edge \(e=\langle {u},{v}\rangle \), a condition analogous to the proposition is true throughout the execution of the algorithm:

$$\begin{aligned} \widehat{\tau }(e) = \max \{j~:~|\{\varDelta (u,v,x)~:~\widehat{\tau }(u,v,x) \ge j \}| \ge j-2\} \end{aligned}$$
(1)

The \(\mathsf{PropTruss}\) algorithm can be summarized as follows. In each iteration, consider all the edges \(e=\langle {u},{v}\rangle \) whose truss value changed in the prior iteration. For each triangle \(\varDelta (u,v,w)\) incident on e, if \(\widehat{\tau }(e) < \widehat{\tau }(u,v,w)\), then we update the truss value of the triangle to \(\widehat{\tau }(e)\). As a result, the truss values of the edges \(\langle {u},{w}\rangle \) and \(\langle {v},{w}\rangle \) may no longer satisfy condition (1). So, for both the edges, we recompute the right hand side and update their truss values accordingly. We proceed in this manner, until a stable solution is reached, wherein the truss value of none of the edges changes. In the first iteration, all the edges get selected and perform the above propagate operation.

Comparison of \(\mathsf{MinTruss}\) and \(\mathsf{PropTruss}\): We compare the algorithms using two fundamental metrics: (i) number of iterations; (ii) load - the total number of updates (one update is counted whenever an edge changes the truss value of a triangle and propagates to the other two edges of the triangle). In a distributed setting, higher number of iterations leads to higher synchronization cost and load imbalance. The second metric determines the computational load and the communication volume.

The \(\mathsf{PropTruss}\) algorithm is superior on the first metric, because edges from multiple truss levels propagate their truss value simultaneously leading to faster convergence. On the other hand, the \(\mathsf{MinTruss}\) algorithm is better in terms of load. The reason is that any edge e propagates its truss value only once during the entire execution (when its truss value \(\widehat{\tau }(e)\) settles to the true truss number \(\tau (e)\)), whereas the same edge may propagate multiple times under \(\mathsf{PropTruss}\).

Figure 1(a) illustrates the above tradeoff by providing the two metrics on four sample graphs drawn from our experimental evaluation (properties are graphs can be found in Sect. 6). We can see that \(\mathsf{PropTruss}\) involves significantly lesser number of iterations, but \(\mathsf{MinTruss}\) is superior on load.

4 Algorithm \(\mathsf{Hybrid}\)

In this section, we present a new algorithm, denoted \(\mathsf{Hybrid}\), that strikes a tradeoff between the two prior algorithms. It aims at achieving load close to \(\mathsf{MinTruss}\) and the number of iterations close to \(\mathsf{PropTruss}\).

The new algorithm is motivated from an analysis of prior algorithms in terms of their load profiles, a plot that shows the load incurred in each iteration of the algorithm. As an illustration, Fig. 1(b) and (c) provide the load profiles of the two algorithms on the pokec graph. We can see that \(\mathsf{PropTruss}\) incurs the maximum load in the first iteration and the load monotonically decreases until the algorithm converges. On the other hand, in the case of \(\mathsf{MinTruss}\), the iterations are grouped into many blocks; within each block the load is maximum in the initial iteration and then decreases monotonically. Each block corresponds a truss value k and all the edges with the truss number \(\tau (e) = k\) settle in the successive iterations of the block. While the \(\mathsf{MinTruss}\) algorithm involves a large number of iterations, most of the iterations incur very little load. The core idea behind the \(\mathsf{Hybrid}\) algorithm is to eliminate the low-load iterations, without compromising much on the overall load incurred.

Fig. 2.
figure 2

Algorithm \(\mathsf{Hybrid}\)

Algorithm \(\mathsf{Hybrid}\): Like the prior algorithms, we maintain an upperbound \(\widehat{\tau }(e)\) on the true truss number \(\tau (e)\), for all edges e, and initialize it to \(\mathtt{supp}_G(e) + 2\). Let \(k_{\min }\) and \(k_{\max }\) denote the minimum and the maximum truss value \(\widehat{\tau }(e)\) among all the edges e. We imagine that each truss value is a bucket and each edge e resides in the bucket corresponding to its truss value \(\widehat{\tau }(e)\). As the algorithm proceeds, whenever \(\widehat{\tau }(e)\) decreases, we visualize that the edge moves from its current bucket to a lower bucket. We maintain a set of edges called the active set, denoted \(\mathtt{Act}\). The edges in the set would propagate their truss values in each iteration. The edges belonging to the active set are drawn from a window of buckets, denoted W. To start with, the window consists of only the bucket \(k_{\min }\), i.e., \(W=[k_{\min }, k_{\min }]\). In each iteration, we construct the active set by including all edges e such that \(\widehat{\tau }(e)\) changed in the prior iteration and e belongs to one of the buckets in the window.

In the next and the crucial step, we use an appropriate heuristic to estimate whether the current active set would result in the load being too low. In this case, we expand the window by including the next bucket, and add all the edges in the bucket to the active set. We repeat the above process until the heuristic determines that the load would be sufficiently high.

We proceed in the above manner until all the buckets have been added and the window becomes the complete range \([k_{\min }, k_{\max }]\). At this stage, we continue with the iterations until the active set becomes empty; namely, the truss value does not change for any of the edges. A pseudocode for \(\mathsf{Hybrid}\) is given in Fig. 2.

Window Expansion Heuristic: We develop a heuristic for window expansion by estimating the load to be incurred on the current active set \(\mathtt{Act}\). Let \(e=\langle {u},{v}\rangle \) be an edge in \(\mathtt{Act}\). For each triangle \(\varDelta (u,v, w)\) incident on e, we update the two neighboring edges provided \(\widehat{\tau }(u,v) < \widehat{\tau }(u,v,w)\); let \(\widetilde{\gamma }(e)\) denote the number of such triangles. The exact load under \(\mathtt{Act}\) is the sum of \(\widetilde{\gamma }(e)\) for all edges \(e\in \mathtt{Act}\). Unfortunately, \(\widetilde{\gamma }(e)\) changes dynamically and its computation requires an expensive scan of the triangles incident on e. We avoid the scan by using the upperbound \(\gamma (e)\) (the number of triangles incident on e). In contrast to \(\widetilde{\gamma }(e)\), the quantity \(\gamma (e)\) is static and can be computed as part of the preprocessing stage. Define \(\gamma (\mathtt{Act}) = \sum _{e\in \mathtt{Act}} \gamma (e)\). We take \(\gamma (\mathtt{Act})\) as an estimate on the load incurred by the set \(\mathtt{Act}\).

We determine if the above estimate is high enough by comparing against the maximum number of triangles encountered in the prior iterations. Meaning, let \(\mathtt{Act}_j\) denote the active set in a prior iteration j and let \(\gamma (\mathtt{Act}_j)\) denote aggregate number of triangles incident on the edges in \(\mathtt{Act}_j\). We keep track of the quantity \(\gamma _{\max } = \max _j \gamma (\mathtt{Act}_j)\). The heuristic estimates that the load on \(\mathtt{Act}\) would be low, if the ratio of \(\gamma (\mathtt{Act})\) to \(\gamma _{\max }\) is below a threshold \(\delta \). In this case, we expand the window by including the next bucket. The process is repeated until the estimate on the load becomes sufficiently high. In the above procedure, \(\delta \) is a tunable parameter. Pseudocode for the procedure can be found in Fig. 2.

Update Operation: As in the case of the \(\mathsf{PropTruss}\) algorithm, our update operation is also based on Proposition 1. Recall that in the \(\mathsf{PropTruss}\) algorithm, whenever an edge \(e=\langle {u},{v}\rangle \) updates the truss value \(\widehat{\tau }(u,v,w)\) for a triangle \(\varDelta (u,v,w)\), the truss values are recomputed for the other two edges \(\langle {u},{w}\rangle \) and \(\langle {v},{w}\rangle \) via evaluating the right hand side of condition (1). We develop a more efficient method that avoids the expensive recomputation by maintaining suitable histograms, as described below.

Consider any edge \(e=\langle {u},{v}\rangle \). We group the triangles incident on e based on their truss values and maintain a histogram consisting of two components, \(h_e(\cdot )\) and \(g_e\). For \(j < \widehat{\tau }(e)\), \(h_e(j)\) stores the number of triangles with truss value exactly j, whereas \(g_e\) keeps track of the number of triangles with the truss values at least \(\widehat{\tau }(e)\). Namely:

$$\begin{aligned} \forall j < \widehat{\tau }(e):\quad h_e(j) = |\{\varDelta (u,v,x) : \widehat{\tau }(u,v,x)=j\}| \end{aligned}$$
(2)
$$\begin{aligned} \text{ and }\quad g_e = |\{\varDelta (u,v,x) : \widehat{\tau }(u,v,x)\ge \widehat{\tau }(e)\}| \end{aligned}$$
(3)

For each triangle \(\varDelta (u,v,w)\), we initialize \(\widehat{\tau }(u,v,w)=\infty \). For each edge e, the histogram is initialized as \(g_e = \widehat{\tau }(e)-2\) and for all \(j < \widehat{\tau }(e)\), \(h_e(j) = 0\).

The iterations are executed as follows. Consider each edge \(\langle {u},{v}\rangle \) found in the active set. For each triangle \(\varDelta (u,v,w)\) incident on e, if \(\widehat{\tau }(e) < \widehat{\tau }(u,v,w)\), we update \(\widehat{\tau }(u,v,w) = \widehat{\tau }(e)\). Let \(\mathrm{val}_\mathrm{old}\) denote the value of \(\widehat{\tau }(u,v,w)\) before the update was performed and \(\mathrm{val}_\mathrm{new}\) be the new value (\(=\widehat{\tau }(e)\)). We update the histogram and \(\widehat{\tau }(\cdot )\) value for the other two edges \(\langle {u},{w}\rangle \) and \(\langle {v},{w}\rangle \) in such a manner that the conditions (1), (2) and (3) continue to be satisfied.

Let \(e'\) be one of other two edges. Before the update, the triangle is counted as part of \(g(e')\), if \(\mathrm{val}_\mathrm{old}\ge \widehat{\tau }(e')\) and as part of \(h_{e'}(\mathrm{val}_\mathrm{old})\), if \(\mathrm{val}_\mathrm{old}< \widehat{\tau }(e')\). Similarly, after the update the triangle is counted as part of \(g(e')\), if \(\mathrm{val}_\mathrm{new}\ge \widehat{\tau }(e')\) and as part of \(h_{e'}(\mathrm{val}_\mathrm{new})\), if \(\mathrm{val}_\mathrm{new}< \widehat{\tau }(e')\). Thus, based on the value of \(\mathrm{val}_\mathrm{old}\) and \(\mathrm{val}_\mathrm{new}\), we adjust (increment/decrement) \(g(e')\), \(h_{e'}(\mathrm{val}_\mathrm{old})\) and \(h_{e'}(\mathrm{val}_\mathrm{new})\); see Fig. 2. We then decrement \(\widehat{\tau }(e')\), if \(g_{e'} < \widehat{\tau }(e')-2\). Furthermore, in this case, \(h_{e'}(\widehat{\tau }(e'))\) must now be counted as part of \(g_{e'}\) and we add \(h_{e'}(\widehat{\tau }(e'))\) to \(g_{e'}\). Our implementation of \(\mathsf{PropTruss}\) also uses the above histogram strategy.

Discussion: The two prior algorithms can be realized by modifying the window expansion heuristic: \(\mathsf{PropTruss}\) via initializing the window to include all the buckets; \(\mathsf{MinTruss}\) via expanding the window with the next bucket only when the active set becomes empty. By tuning the parameter \(\delta \), we get a spectrum of algorithms offering tradeoff between the two extremes. On one hand, restricting the active set to a window of buckets leads to lesser load than \(\mathsf{PropTruss}\). On the other hand, ensuring that the load is high enough in each iteration leads to faster convergence and lesser number of iterations than \(\mathsf{MinTruss}\). We can prove the following tradeoff for any value of \(\delta \in [0,1]\):

$$\begin{aligned} \text{ Number } \text{ of } \text{ iterations }&:&\mathsf{PropTruss}\le \mathsf{Hybrid}\le \mathsf{MinTruss}\\ \text{ Load }&:&\mathsf{MinTruss}\le \mathsf{Hybrid}\le \mathsf{PropTruss}\end{aligned}$$

Figure 1(c) shows the load profile for the \(\mathtt{pokec}\) graph with \(\delta =0.1\). We can see that the number of iterations is close to \(\mathsf{PropTruss}\) and the load is close to \(\mathsf{MinTruss}\). The profile also exhibits a blocked behavior, but the load in any iteration is sufficiently high.

At a high level, computing the truss decomposition shares similarities with the single source shortest path problem (SSSP). Similar to truss computation, prior algorithms for SSSP maintain an upperbound on the shortest distances which get iteratively refined. Here, we can draw parallels between edges and the truss numbers on one hand, and the vertices and the shortest distances on the other. Viewed from this perspective, the \(\mathsf{MinTruss}\) and the \(\mathsf{PropTruss}\) algorithms are analogous to the well-known Dijkstra’s and the Bellman-Ford algorithms, respectively. The \(\mathsf{Hybrid}\) algorithm is inspired by the \(\varDelta \)-stepping algorithm [19].

5 Distributed Implementation

Graph Distribution: We distribute the input graph \(G=(V, E)\) among the processors (MPI ranks) using a degree-based ordering proposed in prior work in the context of efficient triangle counting (e.g., [18]). For a vertex u, let \(\mathtt{deg}(u)\) denote its degree. Arrange the vertices in the increasing order of degrees, breaking ties via lexicographic identifiers. Namely, we say that \(u\prec v\), if either \(\mathtt{deg}(u) < \mathtt{deg}(v)\), or \(\mathtt{deg}(u)=\mathtt{deg}(v)\) and \(\mathtt{id}(u) < \mathtt{id}(v)\). Let \(\mathtt{deg}_{+}(u)\) be the number of neighbors of u with \(v\succ u\).

We assign each vertex u to a processor chosen uniformly at random, called the owner of u. We also assign ownership for each edge \(e=\langle {u},{v}\rangle \): assign e to the owner of u, if \(u\prec v\), and to the owner of v, if \(v\prec u\). Let V(p) and E(p) denote the set of vertices and edges owned by a processor p.

For a processor p, let \(\gamma (p)\) denote the aggregate number of triangles incident on the edges owned by p, i.e., \(\gamma (p) = \sum _{e\in E(p)} \gamma (e)\). The quantity \(\gamma (p)\) is an indicator of the number of updates performed by the processor during the truss computation. We can derive a bound on \(\gamma (p)\) follows. For each vertex \(u\in V(p)\), the processor owns \(\mathtt{deg}_{+}(u)\) edges incident on u; each of these edges can be incident on at most \(\mathtt{deg}(u)\) triangles. Hence, \(\gamma (p)\) is at most \(\sum _{u\in V(p)} \mathtt{deg}(u) \mathtt{deg}_{+}(u)\). Intuitively, if u is a low-degree vertex, then \(\mathtt{deg}_{+}(u)\) is also low, whereas if u is a high-degree vertex, then it cannot have too many neighbors succeeding it in the ordering and so, \(\mathtt{deg}_{+}(u)\) is again low. As a result, the above distribution helps in achieving good load balance.

Preprocessing - Triangle Enumeration: All the three algorithms involve a preprocessing stage of computing the support of the edges, via triangle enumeration. For this purpose, we adopt an efficient strategy proposed in prior work (e.g., [18]). We say that a pair of edges \(\langle {u},{v}\rangle \) and \(\langle {u},{w}\rangle \) is a monotone wedge, \(v\succ u\) and \(w\succ u\). The strategy is to enumerate all the monotone wedges \(\langle {u},{v}\rangle \) and \(\langle {u},{w}\rangle \) and test whether \(\langle {v},{w}\rangle \) is also an edge. The advantage with this approach is that the number of wedges considered is only \(\sum _{u\in V} \mathtt{deg}_{+}^2(u)\).

In our distributed implementation, each processor p builds a hash table over edges E(p) owned by it. For each vertex \(u\in V(p)\), the processor p enumerates all monotone wedges \(\langle {u},{v}\rangle \) and \(\langle {u},{w}\rangle \), and sends the triple (uvw) to the processor owning v, say q. Using its hash table, the processor q checks if the pair \(\langle {v},{w}\rangle \) is an edge in G and if so, the triangle \(\varDelta (u, v, w)\) has been discovered. In this case, q increments \(\mathtt{supp}_G(v, w)\) and sends the triple (uvw) back to p, upon receiving which p increments both \(\mathtt{supp}_G(u,v)\) and \(\mathtt{supp}_G(u,w)\). In the above process, for each edge e, its owner stores the list of triangles incident on e.

Truss Computation: The algorithms are implemented under the bulk synchronous parallel model. For each edge \(e=\langle {u},{v}\rangle \), the owner of e maintains \(\widehat{\tau }(e)\), histogram \(h_e(\cdot )\) and \(g_e\). In addition, for each triangle \(\varDelta (u, v, w)\) incident on e, the processor also stores a local copy of \(\widehat{\tau }(u,v,w)\). In each iteration, for each edge \(e=\langle {u},{v}\rangle \in \mathtt{Act}\), the owner of e propagates the new truss value \(\widehat{\tau }(e)\), as follows. For each triangle \(\varDelta (u,v, w)\) with \(\widehat{\tau }(e) < \widehat{\tau }(u,v,w)\), p sends update messages to the owners of the edges \(\langle {u},{w}\rangle \) and \(\langle {v},{w}\rangle \), wherein the message consists of the identification of the triangle \(\varDelta (u, v, w)\), as well as the new value of \(\widehat{\tau }(u, v, w)\). The messages are exchanged using the \(MPI\_Alltoallv\) primitive. Each processor executes the update procedure on the received messages, updating the edge truss values, histograms, as well as the local copies of the triangle truss values. The buckets and the active sets are stored in a distributed manner: each processor p maintains the buckets and active sets restricted to the edges owned by it.

Fig. 3.
figure 3

Graph properties: number of vertices (n), edges (m) and triangles (\(\varDelta \)), all in millions. The maximum truss number \(\kappa \) is also shown.

Fig. 4.
figure 4

Basic metrics

6 Experimental Evaluation

Experimental Setup: The experiments were conducted on a cluster of Power-8 nodes (20 physical cores, 512 GB memory, 4 GHz) connected via InfiniBand in a fat-tree topology. We launch 16 MPI ranks per node, each mapped to a core. We use 2 to 32 nodes, leading to a total of 32 to 512 MPI ranks.

The dataset consists of eight representative real-world graphs obtained from the SNAP repositoryFootnote 1, the Koblenz network collectionFootnote 2 and the University of SuiteSparse Matrix CollectionFootnote 3; the uk-2002 and hollywood-2009 graphs are based on the prior work [20]. Four of the graphs are medium-sized with more than 100 million triangles, and the other four are large graphs with more than billion triangles. Figure 3 shows the properties of the graphs, including the small pokec graph used as a case study in earlier discussion (the graphs are sorted by the number of triangles).

Prior work has presented efficient shared memory implementations for truss computation [10, 13]. These are based on the \(\mathsf{MinTruss}\) algorithm and provide optimizations for the above setting. Our objective is to study the two extremes of \(\mathsf{MinTruss}\) and \(\mathsf{PropTruss}\), and the effect of the tradeoff offered by \(\mathsf{Hybrid}\) under distributed memory setting. Towards the objective, our experimental evaluation focuses on the three algorithms.

Recall that \(\mathsf{Hybrid}\) offers a tradeoff between \(\mathsf{MinTruss}\) and \(\mathsf{PropTruss}\), controlled by \(\delta \). We experimented with different values of the parameter on different graphs and system sizes, and found that setting \(\delta =0.1\) offers the best tradeoff. All the experiments discussed below use the above setting of the parameter.

Basic Metrics: We first evaluate the algorithms on the two basic metrics: number of iterations and load (number of updates). We normalize the load by \(\gamma (G)\), the number of triangles in the graph. An ideal value for normalized load is one unit, which is attained when an algorithm performs only a single update per triangle.

The results, shown in Fig. 4, confirm our earlier analysis (Sect. 3). We can see that \(\mathsf{MinTruss}\) incurs a large number of iterations, whereas \(\mathsf{PropTruss}\) takes much lesser number of iterations, with the reduction being as high as 76x (on hollywood). The above trend is reversed on the metric of load. The \(\mathsf{MinTruss}\) algorithm performs the best with near-ideal load, whereas the quantity is as high as 22 units for \(\mathsf{PropTruss}\). The \(\mathsf{Hybrid}\) algorithm strikes a balance between the two algorithms. In terms of the number of iterations, relative to \(\mathsf{PropTruss}\), \(\mathsf{Hybrid}\) is higher by at most 16x factor (whereas \(\mathsf{MinTruss}\) is as high as 76x). In terms of load, relative to \(\mathsf{MinTruss}\), \(\mathsf{Hybrid}\) is higher by at most 2.3x factor (whereas \(\mathsf{PropTruss}\) is as high as 17x).

Another metric of importance is the max-load, which quantifies the load balance characteristics. We compute the max-load by finding the maximum load among the processors in each iteration and summing up across all the iterations. An ideal value of the metric is \(\gamma (G)/P\), where P is the number of processors; We normalize the max-load by this quantity. Figure 4 presents the normalized max-load at \(P=512\) (the largest system size in our study). In spite of achieving near-ideal load, the \(\mathsf{MinTruss}\) algorithm incurs the highest max-load in most cases. The reason is that the load gets spread over the large number of iterations, leading to load imbalance. The \(\mathsf{PropTruss}\) and the \(\mathsf{Hybrid}\) algorithms involve lesser number of iterations and perform comparatively better.

Fig. 5.
figure 5

Execution time (seconds) on the benchmark graphs on ranks from 32 to 512. The best running times are highlighted.

Fig. 6.
figure 6

Speedup of \(\mathsf{Hybrid}\) over \(\mathsf{MinTruss}\) and \(\mathsf{PropTruss}\). The average speedup on the eight graphs at different ranks are also shown.

Fig. 7.
figure 7

Truss computation time(s) at 512 ranks

Truss Computation: Execution Time: We next evaluate the truss computation time of the algorithms on different systems sizes (32 to 512 ranks). The results are shown in Fig. 5 (the running times are for a single run of the algorithms). The best execution time is highlighted for each configuration. The figure also includes the preprocessing time (triangle enumeration), which is common for all the algorithms.

We can observe that the \(\mathsf{MinTruss}\) algorithm performs the best on small system sizes. However, as the system size increases, the algorithm suffers from synchronization costs and load imbalance arising out of the large number of iterations, resulting in degradation of the performance. Except friendster, the \(\mathsf{Hybrid}\) algorithm outperforms both the prior algorithms on larger systems sizes.

The friendster graph is one of the largest in terms of the number of triangles. However, the maximum truss number \(\kappa \) is comparatively smaller leading to lesser number of iterations for \(\mathsf{MinTruss}\). Consequently, the synchronization cost and load imbalance are lesser, and so, \(\mathsf{MinTruss}\) outperforms \(\mathsf{Hybrid}\) on all the system sizes in the study. We expect \(\mathsf{Hybrid}\) to outperform \(\mathsf{MinTruss}\) at system sizes larger than 512 ranks.

Figure 6 provides the speedup of \(\mathsf{Hybrid}\) over \(\mathsf{MinTruss}\) and \(\mathsf{PropTruss}\) on the different graphs, as the number of ranks is varied from 32 to 512. The speedup is measured as a ratio of the running time of the competing algorithm (\(\mathsf{MinTruss}\) or \(\mathsf{PropTruss}\)) to that of \(\mathsf{Hybrid}\). The figure also provides the average speedup over the eight benchmark graphs across 32 to 512 ranks. With respect to \(\mathsf{MinTruss}\), the speedup is less than one on small systems sizes (since \(\mathsf{MinTruss}\) is superior). On the largest system size of 512, \(\mathsf{Hybrid}\) outperforms \(\mathsf{MinTruss}\), with the speedup ranging up to 2x with the average being 1.6x. With respect to \(\mathsf{PropTruss}\), \(\mathsf{Hybrid}\) achieves better speedup at smaller ranks. As the number of ranks increases, the speedup decreases because of increase in synchronization cost and load imbalance under \(\mathsf{Hybrid}\). Nevertheless, we see that on the largest system size of 512, the speedup is up to 4.2x with the average being 2.4x.

Figure 7 compares the execution times on the largest system size of 512 ranks. We can see that \(\mathsf{Hybrid}\) outperforms \(\mathsf{MinTruss}\) and \(\mathsf{PropTruss}\) by factors of up to 2x (on stackoverflow) and 4x (on orkut), respectively. Taking the best of the prior algorithms in each case, the performance gain is up to a factor of 2x (on stackoverflow).

7 Conclusions

We presented a new distributed algorithm for truss decomposition that offers a tradeoff between two prior procedures in terms of the metrics of number of iterations and the number updates. Our experimental study shows that the algorithm outperforms the prior procedures on large system sizes by a factor of up to 2x. Improving the scalability of the algorithm and exploring \(\mathsf{Hybrid}\) algorithm on shared memory systems are useful avenues for future work.