Graph Partitioning for Distributed Graph Processing
 3k Downloads
 7 Citations
Abstract
There is a large demand for distributed engines that efficiently process largescale graph data, such as social graph and web graph. The distributed graph engines execute analysis process after partitioning input graph data and assign them to distributed computers, so the quality of graph partitioning largely affects the communication cost and load balance among computers during the analysis process. We propose an effective graph partitioning technique that achieves low communication cost and good load balance among computers at the same time. We first generate more clusters than the number of computers by extending the modularitybased clustering, and then merge those clusters into balancedsize clusters until the number of clusters becomes the number of computers by using techniques designed for graph packing problem. We implemented our technique on top of distributed graph engine, PowerGraph, and made intensive experiments. The results show that our partitioning technique reduces the communication cost so it improves the response time of graph analysis patterns. In particular, PageRank computation is 3.2 times faster at most than HDRF, the stateofthe art of streamingbased partitioning approach.
Keywords
Graph partitioning Graph mining Distributed processing1 Introduction
Largescale graph data such as social graphs and web graphs have emerged in various domains. As an example of social graph, the number of daily active users in Facebook reached 1.13 billion on average for June 2016 an increase of 17% yearoveryear reported in the Facebook reports second quarter 2016 results:^{1} vertexes and edges represent users and their relationships, respectively.
To analyze such largescale graph data efficiently, distributed graph engines have been developed and they are widely used in graph analysis field. Some examples are Pregel [1], GraphLab [2], PowerGraph [3], and GraphX [4]. Distributed graph engines commonly (1) partition input graph data into subgraphs, (2) assign each subgraph to each computer, and (3) make graph analysis over the distributed graph. Each computer iteratively analyzes the assigned subgraph by updating the parameters assigned to the vertexes/edges. Notice that the subgraph assignment to computers largely affects the communication cost and load balance during graph analysis. The commutation cost increases to the number of crosspartition vertexes/edges, because communication between different computers is required when parameters are updated by referring to adjacent vertexes/edges in remote computers. The computation cost of each computer depends on the number of vertexes/edges assigned to the computer [5], so load imbalance occurs among computers when the number of assigned vertexes/edges imbalances.
Our goal is to design a graph partitioning technique that achieves low communication cost and good load balance among computers. The stateofthe art of graph partitioning techniques is Oblivious [3] and HDRF [6] that are actually implemented in PowerGraph. These techniques generate balancedsize clusters while attempting to reduce communication overhead. However, the communication overhead tends to be high and this degrades the performance, in particular, the number of commuters is large. In contrast, there are other graph clustering techniques [7, 8, 9] that are designed to reduce the number of crosscluster edges. They are expected to reduce the communication overhead; however, the size of the obtained clusters is imbalanced as reported in [8] so we cannot directly apply these techniques to our goal just as they are.
We propose an effective graph partitioning technique that achieves low communication cost and good load balance among computers at the same time. So as to obtain balancedsize clusters, we first generate much more balancedsize clusters than the number of computers by extending the modularitybased clustering, and then merge those clusters into balancedsize clusters by employing the techniques designed for the packing problem [10]. Finally, we convert edgecut graph into vertexcut graph, because the modularity clustering is edgecutbased clustering and most of the recent distributed graph engines are based on vertexcut graph. We implemented our technique on top of PowerGraph and made evaluations. The results show that our partitioning technique reduces the communication cost so it improves the response time of graph analysis patterns. In particular, it improves the response time of PageRank computation 3.2 times faster at most than HDRF. In addition, we also evaluated how the major graph metrics (the replication factor and load balance factor) correlate with the physical performance measures, the response time, the amount of data transfer between computers, and the imbalance runtime ratio among computers.
The remainder of this paper is organized as follows. Section 2 describes the background of this work. Section 3 describes the detailed design of our technique. Section 4 reports the results of experiments. Section 5 addresses related work, and Sect. 6 concludes this paper.
2 Prelimilary
2.1 Replication Factor and Load Balance Factor
Recent distributed graph processing frameworks (e.g., GraphLab [2] and PowerGraph [3]) have employed vertexcut method [2, 6] for the graph partitioning since it provides better performance in terms of load balancing among distributed computers. Vertexcut method is a graph partitioning technique for distributed graph processing; it divides a graph into multiple partitions by replicating crosscluster vertexes, and it assigns each partition to each computer in the distributed computation environment. In order to qualify the effectiveness of graph partitioning, it is natural choice to use two major metrics called replication factor [2] and load balance factor [3].
2.2 Modularity
Our proposed method merges partition pairs for increasing a graph partitioning measure, namely modularity [7], so as to reduce the total number of crosspartition edges. In this section, we formally introduce modularity.
In our proposed method, we modify Eq. (5) for finding balancedsize partitions for efficient distributed graph processing; we introduce a new term for balancing the partitioning size [8] into Eq. (5). We present its details in Sect. 3.1.
3 BalancedSize Clustering Technique
 Balancedsize modularity clustering phase

We first employ a modified modularity proposed by Wakita and Tsurumi [8] that achieves good modularity and mitigates the imbalance of cluster size.
 Cluster merge phase

Since modularity clustering generates large number of clusters in general, we need to have additional phase to merge clusters more. Moreover, even if we employ the modified modularity that mitigates imbalanced size of clusters, we still have the imbalance of cluster size. So, we generate much more clusters than the number of computers in the 1st phase, and then merge those clusters into balancedsize clusters until the number of clusters becomes the number of computers by employing techniques designed for graph packing problem.
 Graph conversion phase

Finally, we convert edgecut graph into vertexcut graph, because the modularity clustering is edgecutbased clustering and most of the recent distributed graph engines are based on the vertexcut graph.
3.1 BalancedSize Modularity Clustering Phase
Definitions of symbols used in Algorithm 1
Symbol  Definition 

\(\mathbb {C}\)  Input cluster set 
k  Specified number of output clusters 
\(\mathbb {R}\)  Output cluster set 
\(top\_k\_clusters(\mathbb {C}, k)\)  Topk clusters \(\in \mathbb {C}\) 
\(inner\_edges(c)\)  Inner edges of cluster c 
neighbors(c)  Adjacent clusters of cluster c 
\(cut\_edges(n, m)\)  Cut edges between cluster n and m 
For finding finegrained and wellbalanced clusters efficiently, we apply Eq. (6) to the stateoftheart modularitybased clustering called incremental aggregation method [9]. The incremental aggregation method is a modularitybased clustering algorithm that is able to process largescale graphs with more than a few billion edges within quite short computation time. This is because the method effectively reduces the number of edges to be referenced during the modularity gain computation by incrementally merging cluster pairs. By combining the method and the modified modularity gain shown in Eq. (6), this phase finds the finegrained and wellbalanced clusters efficiently.
In addition, this phase attempts to produce larger number of clusters than userspecified parameter k. The reasons are twofold: (1) Although Eq. (6) is effective in balancing the cluster size, it is not sufficient for the load balance. For further balancing the size of clusters, we additionally perform firstfit algorithm [10] in the next phase, which is an approximation algorithm for the bin packing problem. (2) If we run modularitybased clustering methods until convergence, they automatically determine the number of clusters relying on the input graph topology. In order to control the number of clusters for the distributed machines, this phase needs to run until (a) we can find no cluster pairs that increase the modularity score, or (b) the number of clusters produced in this phase reaches \(a \times k\) where \(a \in \mathbb {R}\) is a userspecified parameter such that \(a > 1\).
3.2 Cluster Merge Phase
The idea of producing balancedsize clusters is to employ the techniques developed for the packing problem [10]. That is, given various size of items, we pack them into fixed number of containers with the same size. Since we generated more clusters than the number of computers at the last phase, we pack those clusters into balancedsize containers by performing firstfit algorithm. In addition, we choose an adjacent cluster of a given cluster and pack them into the same container during firstfit algorithm, so that we can keep the number of crosscluster edges small.
The detail is as follows. Given we have many clusters produced at the balancedsize modularity clustering phase, we choose k (number of computers) largest clusters as seed clusters and put them into different containers. Then, we repeatedly merge the smallest seed cluster with its adjacent cluster until there is no adjacent cluster to seed clusters. After that, there may be clusters that are not connected to any seed clusters, that is, the clusters are isolated from any seed clusters. We pick up a cluster from the isolated ones, merge reachable clusters from it, and put the merged cluster into the container with the smallest number of inner edges.
Example 1
Figure 1 depicts an example of the cluster merge phase, the initial state is on the left, and the final state is on the right. Each circle represents cluster, and the number located at the center of the circle shows the number of inner edges in the cluster. The number assigned to an edge shows the number of the crosscluster edges. The dotted shape represents seed cluster (container). (1) In the initial state, two largest clusters (cluster 1 and cluster 2) are chosen as seed clusters. (2) The smallest seed cluster (cluster 2) and its one of adjacent clusters (cluster 3) are merged. (3) Still the merged seed cluster (containing cluster 2 and cluster 3) is the smallest seed cluster [the size is 35 (20 + 5 + 10)], so we continue to merge it with its adjacent cluster, cluster 5. (4) Now the merged seed cluster size is 55, the smallest cluster changes to cluster 1. Then, cluster 1 is merged with its adjacent cluster, cluster 4, and the size becomes 65. (5) Now, there is no adjacent cluster to any seed clusters, so we put the isolated cluster, cluster 6, into the smallest seed cluster, cluster 2, as shown in the final state in Fig. 1.
3.3 Graph Conversion Phase
So far, we have obtained k clusters of edgecut graph. In this final phase, we convert edgecut graph into vertexcut graph, since most of the recent distributed graph engines are based on the vertexcut graph. This design is based on the fact that vertexcut graph is more efficiently balanced than edgecut graph [3, 13]. To convert edgecut graph to vertexcut graph, we have to convert crosscluster edge to crosscluster vertex by choosing either two sides of crosscluster edge as crosscluster vertex. Let u is chosen as crosscluster vertex and v is not for crosscluster edge e(u, v). The crosscluster edge e(u, v) is assigned to the cluster to which noncrosscluster vertex v belong. We choose crosscluster vertexes so that the size of the clusters to be balanced. This procedure is simple but affects largely the load balance.
4 Experiments
 Partitioned graph quality

We evaluated the effectiveness of partitioned graph by using the major metrics, replication factor [Eq. (1)] and load balance factor [Eq. (2)].
 Performance for graph analysis

We evaluated the runtime, the amount of data transfer between computers, and the imbalance runtime ratio among computers during graph analysis. In addition, we also evaluated how the major graph metrics, the replication factor and load balance factor, correlate with the physical performance measures, the response time, the amount of data transfer between computers, and the imbalance runtime ratio among computers.
 Scalability

We evaluated the response time of graph analysis, graph partitioning time, and the sum of both by varying the number of computers.
We compared our graph partitioning technique to other techniques, a random partition, Oblivious [3], and HDRF [6]. The random partitioning is a naive approach that randomly assigns vertexes/edges to distributed computers. The Oblivious is a heuristic technique that balances the size of partitions and reduces the replication factor. The HDRF is a technique improved from Oblivious and actually provides better graph partitions than Oblivious does for various graphs. We used two variations of our graph partitioning technique in the 1st phase; the original modularity clustering and the balancedsize modularity clustering. They are denoted as modularity and balancedsize in figures, respectively. For the parameter setting, we choose the number of clusters the 1st phase generates according to the graph size; we set more clusters to generate as input graph size increases.
4.1 Benchmark
 1.
PageRank [14]: one of the linkbased ranking techniques designed for web pages.
 2.
SSSP (singlesource shortest path): computing the shortest paths to all vertexes from a given vertex.
 3.
CC (connected component): detecting subgraphs (components) connected with edges.
Realworld graph data
Dataset  Short name  V  E  Modularity 

emailEuAll [15]  Eu  265,214  420,045  0.779 
webStanford [15]  St  281,903  2,312,497  0.914 
comDBLP [15]  DB  317,080  1,049,866  0.806 
webNotreDame [15]  No  325,729  1,497,134  0.931 
amazon0505 [15]  am  410,236  3,356,824  0.852 
webBerkStan [15]  Be  685,230  7,600,595  0.930 
webGoogle [15]  Go  875,713  5,105,039  0.974 
socPokec [15]  Po  1,632,803  30,622,564  0.633 
roadNetCA [15]  CA  1,965,206  2,766,607  0.992 
wikiTalk [15]  Ta  2,394,385  5,021,410  0.566 
socLiveJournal1 [15]  Li  4,847,571  68,993,773  0.721 
uk2002 [16]  uk  18,520,486  298,113,762  0.986 
webbase2001 [16]  ba  118,142,155  1,019,903,190  0.976 
4.2 Setting
The experiments were made on Amazon EC2, r3.2xlarge Linux instances. Each instance has CPU Intel(R) Xeon(R) CPU E52670 v2, 2.50 GHz (four cores) with 64 GB RAM. The network performance between instances was 1.03 Gbps. The hard disks delivered 103 MB/s for buffered reads. We used g++4.8.1 with –O3 optimization for PowerGraph and all partitioning techniques. We chose synchronous engine of PowerGraph to ensure the preciseness of the analysis results.
4.3 Partitioned Graph Quality
We evaluated the effectiveness of partitioned graph by using the major metrics, replication factor [Eq. (1)] and load balance factor [Eq. (2)] for the graph data in Table 2.
4.3.1 Relationship Between Modularity and Replication Factor
4.3.2 Replication Factor
Figure 3 shows the results of the experiments for replication factor by varying the number of computers, 8, 16, 32, 48, 64. The figure includes only the three largest graph data, socLiveJournal1, uk2002, webbase2001. We omit others here because they are similar results to the above three graph data. We set the number of clusters the 1st phase generates at 4000, 8000, 160,000 for socLiveJournal1, uk2002, webbase2001, respectively.
4.3.3 Load Balance Factor
4.4 Performance for Graph Analysis
We evaluated the runtime time, the amount of data transfer between computers, and the imbalance runtime ratio among computers during graph analysis executed on PowerGraph. We fixed the number of computers at 64.
4.4.1 Runtime
4.4.2 Amount of Data Transfer
4.4.3 Imbalance Runtime Ratio
4.5 Scalability
5 Related Work
In the line of the work for efficient distributed graph processing, the problem of finding better graph partitions has been studied in recent decades. A recent survey paper on vertexcentric frameworks summarizes various types of graph partitioning techniques [17]. The major approach is twofold: edgecut method and vertexcut method.
Edgecut method The edgecut method is a graph partitioning approach that divides a graph into sets of subgraphs by cutting edges so as to reduce the number of crosspartition edges. In the distributed graph processing, the edgecut method assigns each subgraph to each computer. METIS, proposed by Karypis and Kumar in 1998 [18], is one of the representative partitioning algorithms that focuses on reducing the number of crosspartition edges via the edgecut method. The problem of edgecut method is that it cannot avoid load imbalance for typical graphs that follow the power law distribution [19]. We explain the detail more in the vertexcut method part.
Vertexcut method The vertexcut method is another type of partitioning technique that attempts to reduce the number of crosspartition vertexes. As we described above, the edgecut method splits a graph into sets of subgraphs by cutting edges. In contrast, the vertexcut method divides a graph by splitting vertexes. Most of the recent distributed graph engines use vertexcut methods, because vertexcut graph is more efficiently balanced than edgecut graph [3, 13]. Typically, graph usually follows the power law distribution so it tends to include supervertexes, that is, the number of their connected edges is tremendously large. Those supervertexes affect largely load imbalance, so the idea of the vertexcut method is to reduce the load imbalance by splitting the supervertexes. In the family of the vertexcut methods, Oblivious [2] and HDRF (HighDegree (are) Replicated First) [6] are the stateoftheart algorithms. These algorithms are streambased algorithms: Every edge is read from input file, and it is immediately assigned to a computer; and thus, they are scalable to largescale graphs and achieve better load balance performance. Specifically, Oblivious assigns an incoming edge to a computer, so that it can reduce the number of crossvertexes spanned among computers. HDRF divides edges into partitions by splitting highdegree vertexes in order to reduce the total number of crossvertexes.
6 Conclusion
We proposed a graph partitioning technique that efficiently partitions graphs with good quality so that it achieves high performance for graph analysis by reducing the communication cost and by keeping good load balance among computers. We extend modularitybased clustering and integrate it with the techniques for the graph packing problem. We implemented our technique on top of distributed graph engine, PowerGraph, and made intensive experiments. The results show that our partitioning technique reduces the communication cost so it improves the response time of graph analysis patterns. In particular, PageRank computation is 3.2 times faster at most than HDRF, the stateofthe art of streamingbased partitioning approach. In addition, we observed that the replication factor and load balance factor correlate with the amount of data transfer and the imbalance runtime ratio, respectively, and that the response time is correlated with the replication factor but not with the load balance factor so much.
Possible future work is as follows. (1) There is a tradeoff between the communication cost and load balance depending on the number of computers. We optimize the tradeoff problem by fixing the number of computers in this paper, but one future work is to optimize the number of computers depending on the input graph and analysis patterns. (2) There is a still room improving more on the replication factor and load imbalance and achieving efficient graph clustering.
Footnotes
References
 1.Malewicz G, Austern MH, Bik AJ, Dehnert JC, Horn I, Leiser N, Czajkowski G (2010) Pregel: a system for largescale graph processing. In: Proceedings of SIGMODGoogle Scholar
 2.Low Y, Bickson D, Gonzalez J, Guestrin C, Kyrola A, Hellerstein JM (2012) Distributed GraphLab: a framework for machine learning and data mining in the cloud. PVLDB, 5(8):716–727Google Scholar
 3.Gonzalez JE, Low Y, Gu H, Bickson D, Guestrin C (2012) PowerGraph: distributed graphparallel computation on natural graphs. In: Proceedings of OSDIGoogle Scholar
 4.Xin RS, Gonzalez JE, Franklin MJ, Stoica I (2013) GraphX: a resilient distributed graph system on Spark. In: Proceeding of GRADESGoogle Scholar
 5.Suri S, Vassilvitskii S (2011) Counting triangles and the curse of the last reducer. In: Proceedings of WWWGoogle Scholar
 6.Petroni F, Querzoni Leonardo, Daudjee K, Kamali S, Iacoboni G (2015) HDRF: streambased partitioning for powerlaw graphs. In: Proceeding of CIKMGoogle Scholar
 7.Newman MEJ, Girvan M (2004) Finding and evaluating community structure in networks. Phys Rev E 69, 026113Google Scholar
 8.Wakita K, Tsurumi T (2007) Finding community structure in megascale social networks. In: Proceedings of WWWGoogle Scholar
 9.Shiokawa H, Fujiwara Y (2013) Fast algorithm for modularitybased graph clustering. In: Proceeding of AAAI, OnizukaGoogle Scholar
 10.Dósa G, Sgall J (2013) First fit bin packing: a tight analysis. In: Proceeding of STACSGoogle Scholar
 11.Clauset A, Newman MEJ, Moore C (2004) Finding community structure in very large networks. Phys Rev E 70:066111Google Scholar
 12.Blondel VD, Guillaume J, Lambiotte R, Lefebvre E (2008) Fast unfolding of communities in large networks. J Stat Mech Theory Exp. doi: 10.1088/17425468/2008/10/P10008
 13.Bourse F, Lelarge M, Vojnovic M (2014) Balanced graph edge partition. In: Proceeding of KDDGoogle Scholar
 14.Page L, Brin S, Motwani R, Winograd T (1999) The PageRank citation ranking: bringing order to the web. Technical reportGoogle Scholar
 15.Stanford Large Network Dataset Collection (2014) http://snap.stanford.edu/data/. Accessed 31 Jan 2017
 16.Laboratory for Web Algorithmics (2002) http://law.di.unimi.it. Accessed 31 Jan 2017
 17.McCune RR, Weninger T, Madey G (2015) Thinking like a vertex: a survey of vertexcentric frameworks for largescale distributed graph processing. ACM Comput Surv 48(2):25Google Scholar
 18.Karypis G, Kumar V (1999) A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM J Sci Comput 20(1):359–392Google Scholar
 19.Faloutsos M, Faloutsos P, Faloutsos C (1999) On powerlaw relationships of the internet topology. In: Proceeding of SIGCOMMGoogle Scholar
Copyright information
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.