Cascade-aware partitioning of large graph databases

Graph partitioning is an essential task for scalable data management and analysis. The current partitioning methods utilize the structure of the graph, and the query log if available. Some queries performed on the database may trigger further operations. For example, the query workload of a social network application may contain re-sharing operations in the form of cascades. It is beneficial to include the potential cascades in the graph partitioning objectives. In this paper, we introduce the problem of cascade-aware graph partitioning that aims to minimize the overall cost of communication among parts/servers during cascade processes. We develop a randomized solution that estimates the underlying cascades, and use it as an input for partitioning of large-scale graphs. Experiments on 17 real social networks demonstrate the effectiveness of the proposed solution in terms of the partitioning objectives.


Introduction
Distributed graph databases employ partitioning methods to provide data locality for queries and to keep the load balanced among servers [1][2][3][4][5]. Online social networks (OSNs) are common applications of graph databases where users are represented by vertices and their connections are represented by edges/hyperedges. Partitioning tools (e.g., Metis [6], Patoh [7]) and community detection algorithms (e.g., [8]) are used for assigning users to servers. The contents generated by a user are typically stored on the server that the user is assigned.
Graph partitioning methods are designed using the graph structure, and the query workload (i.e., logs of queries exe-cuted on the database), if available [9][10][11][12][13][14]. Some queries performed on the database may trigger further operations. For example, users in OSNs frequently share contents generated by others, which leads to a propagation/cascade of re-shares (cascades occur when users are influenced by others and then perform the same acts) [15][16][17]. The database needs to copy the re-shared contents to the servers that contain the users who will eventually need to access this content (i.e., at least a record id of the original content needs to be transferred).
Many users in a cascade process are not necessarily the neighbors of the originator. Hence, the graph structure, even with the influence probabilities, would not directly capture the underlying cascading behavior, if the link activities are considered independently. We first aim to estimate the cascade traffic on the edges. For this purpose, we present the concept of random propagation trees/forests that encodes the information of propagation traces through users. We then develop a cascade-aware partitioning that aims to optimize the load balance and reduce the amount of propagation traffic between servers. We discuss the relationship between the cascade-aware partitioning and other graph partitioning objectives.
To get insights into the cascade traffic, we analyzed a query workload from Digg, a news sharing-based social network [18]. The data include cascades with a depth of up to six links, i.e., the maximum path length from the initiator of the content to the users who eventually get the content. When we partitioned the graph by just minimizing the number of links straddling between 32 balanced partitions (using Metis [6]), the majority of the traffic remained between the servers, as opposed to staying local. This traffic goes over a relatively small fraction of the links. Only 0.01% of the links occur in 20% of the cascades, and these links carry 80% of the traffic observed in these cascades. It is important to identify the highly active edges and avoid placing them crossing the partitions.
We draw an equivalence between minimizing the expected number of cut edges in a random propagation tree/forest and minimizing communication during a random propagation process starting from any subset of users. A probability distribution is defined over the edges of a graph, which corresponds to the frequency of these edges being involved in a random propagation process. #P-Hardness of the computation of this distribution is discussed, and a sampling-based method, which enables estimation of this distribution within a desired level of accuracy and confidence interval, is proposed along with its theoretical analysis.
Experimentation has been performed both with theoretical cascade models and with real logs of user interactions. The experimental results show that the proposed solution performs significantly better than the alternatives in reducing the amount of communication between servers during a cascade process. While the propagation of content was studied in the literature from the perspective of data modeling, to the best of our knowledge, these models have not been integrated into database partitioning for efficiency and scalability.
The rest of the paper is organized as follows. Table 1 displays the notation used throughout the paper. Section 2 provides the background material and summarizes the related work. Section 3 presents a formal definition for the proposed problem. Section 4 describes the proposed solution for the problem, gives a theoretical analysis and explains how it achieves its objectives. Section 5 presents a discussion for some of the limitations and extensions of the cascade-aware graph partitioning algorithm. Section 6 presents the results of experiments on real-world datasets and demonstrates the effectiveness of the proposed solution. Section 7 concludes the paper.

Graph partitioning
Let G = (V , E) be an undirected graph such that each vertex v i ∈ V has weight w i and each undirected edge e i j ∈ E connecting vertices v i and v j has cost c i j . Generally, a Kway partition Π = {V 1 , V 2 . . . V K } of G is defined as follows:  Each part V k ∈ Π is a non-empty subset of V , all parts are mutually exclusive (i.e., V k ∩ V m = ∅ for k = m), and the union of all parts is V (i.e., V k ∈Π V k = V ). Given a partition Π , weight W k of a part V k is defined as the sum of the weights of vertices belonging to that part (i.e., W k = v i ∈V k w i ). The partition Π is said to be balanced if all parts V k ∈ Π satisfy the following balancing constraint: Here, W avg is the average part weight (i.e., W avg = v i ∈V w i /K ) and is the maximum imbalance ratio of a partition.
An edge is called cut if its endpoints belong to different parts and uncut otherwise. The cut and uncut edges are also referred to as external and internal edges, respectively. The cut size χ(Π) of a partition Π is defined as where E Π cut denotes the set of cut edges. In the multi-constraint extension of the graph partitioning problem, each vertex v i is associated with multiple weights w c i for c = 1, . . . , C. For a given partition Π , W c k denotes the weight of part V k on constraint c (i.e., W c k = v i ∈V k w c i ). Then, Π is said to be balanced if each part V k satisfies W c k ≤ W c avg (1 + ), where W c avg denotes the average part weight on constraint c.
The graph partitioning problem, which is an NP-Hard problem [19], seeks to compute a partition Π * of G that minimizes the cut size χ(·) in Eq. (2) while satisfying the balancing constraint in Eq. (1) defined on part weights.

Graph partitioning and replication
Graph partitioning has been studied to improve scalability and query processing performances of the distributed data management systems. It has been widely used in the context of social networks. Pujol et al. [10] propose a social network partitioning solution that reduces the number of edges crossing different parts and provides a balanced distribution of vertices. They aim to reduce the amount of communication operations between servers. It is later extended in [9] by considering replication of some users across different parts. SPAR [11] is developed as a social network partitioning and replication middleware.
Yuan et al. [13] propose a partitioning scheme to process time-dependent social network queries more efficiently. The proposed scheme considers not only the spatial network of social relations but also the time dimension in such a way that users that have communicated in a time window are tried to be grouped together. Additionally, the social graph is partitioned by considering two-hop neighborhoods of users instead of just considering directly connected users. Turk et al. [14] propose a hypergraph model built from logs of temporal user interactions. The proposed hypergraph model correctly encapsulates multi-user queries and is partitioned under load balance and replication constraints. Partitions obtained by this approach effectively reduces the number of communications operations needed during executions of multicast and gather type of queries.
Sedge [3] is a distributed graph management environment based on Pregel [20] and designed to minimize communication among servers during graph query processing. Sedge adopts a two-level partition management system: In the first level, complementary graph partitions are computed via the graph partitioning tool Metis [6]. In the second level, ondemand partitioning and replication strategies are employed. To determine cross-partition hotspots in the second level, the ratio of number of cut edges to uncut edges of each part is computed. This ratio approximates the probability of observing a cross-partition query and later is compared against the ratio of the number of cross-partition queries to internal queries in a workload. This estimation technique differs from our approach, since we estimate an edge being included in a cascade process, whereas this approach estimates the probability of observing a cross-partition query in a part and does not consider propagation processes.
Leopard is a graph partitioning and replication algorithm to manage large-scale dynamic graphs [1]. This algorithm incrementally maintains the quality of an initial parti-tion via dynamically replicating and reassigning vertices. Nicoara et al. [21] propose Hermes, a lightweight graph repartitioner algorithm for dynamic social network graphs. In this approach, the initial partitioning is obtained via Metis and as the graph structure changes in time, an incremental algorithm is executed to maintain the quality of the partitions.
For efficient processing of distributed transactions, Curino et al. [4] propose SCHISM, which is a workload-aware graph model that makes use of past query patterns. In this model, data items are represented by vertices and if two items are accessed by the same transaction, an edge is put between the respective pair of vertices. In order to reduce the number of distributed transactions, the proposed model is split into balanced partitions using a replication strategy in such a way that the number of cut edges is minimized.
Hash-based graph partitioning and selective replication schemes are also employed for managing large-scale dynamic graphs [2]. Instead of utilizing graph partitioning techniques, a replication strategy is used to perform cross-partition graph queries locally on servers. This method makes use of past query workloads in order to decide which vertices should be replicated among servers.

Multi-query optimization
Le et al. [22] propose a multi-query optimization algorithm which partitions a set of graph queries into groups where queries in the same group have similar query patterns. Their partitioning algorithm is based on k-means clustering algorithm. Queries assigned to each cluster are rewritten to their cost-efficient versions. Our work diverges from this approach, since we make use of propagation traces to estimate a probability distribution over edges in a graph and partition this graph, whereas this approach clusters queries based on their similarities.

Influence propagation
Propagation of influence [15] is commonly modeled using a probabilistic model [23,24] learnt over user interactions [25,26]. Influence maximization problem is first studied by Domingos and Richardson [27]. Kempe et al. [28] proved that the influence maximization problem is NP-Hard under two influence propagation models such as Independent Cascade (IC) and Linear Threshold (LT) models. The Influence spread function defined in [28] has an important property called submodularity, which enables a greedy algorithm to achieve (1 − 1/e) approximation guarantee for the influence maximization problem. However, computing this influence spread function is proven to be #P-Hard [17], which makes the greedy approximation algorithm proposed in [28] infeasible for larger social networks. Therefore, more efficient heuristic algorithms are targeted in the literature [17,[29][30][31][32].
More recently, algorithms that run nearly in optimal linear time and provide (1 − 1/e) approximation guarantee for the influence maximization problem are proposed in [33][34][35].
The notion of influence and its propagation processes have also been used to detect communities in social networks. Zhou et al. [36] find community structure of a social network by grouping users that have high influence-based similarity scores. Similarly, Lu et al. [37] and Ghosh et al. [38] consider finding community partition of a social network that maximizes different influence-based metrics within communities. Barbieri et al. [39] propose a network-oblivious algorithm making use of influence propagation traces available in their datasets to detect community structures.

Problem definition
In this section, we present the graph partitioning problem within the context of content propagation in a social network where the link structure and the propagation probability values associated with these links are given. Let an edgeweighted directed graph G = (V , E, w) represent a social network where each user is represented by a vertex v i ∈ V , each directed edge e i j ∈ E represents the direction of content propagation from user v i to v j and each edge e i j is associated with a content propagation probability w i j ∈ [0, 1]. We assume that the w i j probabilities associated with the edges are known beforehand as in the case of Influence Maximization domain [28,29,34]. Methods for learning the influence/content propagation probabilities between users in a social network are previously studied in the literature [25,26]. In this setting, a partition Π of G refers to a userto-server assignment in such a way that a vertex v i assigned to a part V k ∈ Π denotes that the user represented by v i is stored in the server represented by part V k .
We adopt a widely used propagation model, the IC model, with propagation processes starting from a single user for ease of exposition. As we discuss later, this can be extended to other popular models such as the LT model and propagation processes starting from multiple users as well. Under the IC model, a content propagation process proceeds in discrete time steps as follows: Let a subset S ⊆ V consist of initially active users who share a specific content for the first time in a social network (we assume |S| = 1 for ease of exposition). For each discrete time step t, let set S t consists of users that are activated in time step t ≥ 0, which indicates that S 0 = S (i.e., a user becomes activated meaning that this user has just received the content). Once activated in time step t, each user v i ∈ S t is given a single chance of activating each of its inactive neighbor v j with a probability w i j (i.e., user v i activates user v j meaning that the content propagates from v i to v j ). If an inactive neighbor v j is activated in time step t (i.e., v j has received the content), then it becomes active in the next time step t + 1 and added to the set S t+1 . The same process continues until there are no new activations (i.e., until S t = ∅).
Kempe et al. [28] define an equivalent process for the IC model by generating an unweighted directed graph g from G by independently realizing each edge e i j ∈ E with probability w i j . In the realized graph g, vertices reachable by a directed path from the vertices in S can be considered as active at the end of an execution of the IC model propagation process starting with the initially active users in S. As a result of the equivalent process of the IC model, the original graph G induces a distribution over unweighted directed graphs. Therefore, we use the notation g ∼ G to indicate that we draw an unweighted directed graph g from the distribution induced by G by using the equivalent process of IC model. That is, we generate a directed graph g via realizing each edge e i j ∈ G with probability w i j .

Propagation trees
Given a vertex v, we define the propagation tree I g (v) to denote a directed tree rooted on vertex v in graph g. The tree I g (v) corresponds to an IC model propagation process, when v is used as the initially active vertex, in such a way that each edge e i j ∈ I g (v) encodes the information that the content propagated to v j from v i during this process. Here, there can be more than one possible propagation trees for v on g, since g may not be a tree itself. One of the possible trees can be computed by performing a breadth-first search (BFS) on g starting from vertex v, since IC model does not prescribe an order for activating inactive neighbors of the newly activated vertices. Note that generating a graph g and performing a BFS on a vertex v are equivalent to performing a randomized BFS algorithm starting from the vertex v. The difference between the randomized BFS algorithm and usual BFS algorithm is that each edge e i j ∈ E is searched with probability w i j in the randomized case. That is, during an iteration of BFS, if a vertex v i is extracted from the queue, each of its outgoing edge(s) e i j to an unvisited vertex v j is examined and added to the queue with a probability w i j .
Here, we also define a fundamental concept called random propagation tree which is used throughout the text. A random propagation tree is a propagation tree that is generated by two levels of randomness: First, a graph g is drawn from the distribution induced by G, then a vertex v ∈ V is chosen randomly, and its propagation tree I g (v) on g is computed. It is important to note that a random propagation tree is equivalent to an IC model propagation process starting from a randomly chosen vertex. Here, the concept of random propagation trees has similarities to reverse-reachable sets previously proposed in [33,34]. Reverse-reachable sets are built on transpose G T of directed graph G by performing a randomized BFS starting from a vertex v and including all BFS edges. Hence, reverse-reachable sets are different from propagation trees in the ways that they do not constitute directed trees and they are built on the structure of G T instead of G.
From a systems perspective, if a content propagation occurs between two users located on different servers, we assume this causes a communication operation. This is depicted in Fig. 1 which displays a graph with its edges denoting directions of content propagations between users. In this figure, two different partitionings of the same social network are given in Fig. 1a, b. In Fig. 1a, users are partitioned among three servers as S 1 = {u 0 , u 1 , u 2 }, S 2 = {u 6 , u 7 , u 8 , u 9 } and S 3 = {u 3 , u 4 , u 5 }. In Fig. 1b, user u 6 is moved from S 2 to S 1 where S 3 remains the same. In the figure, a content shared by user u 7 propagates through four users u 6 , u 1 , u 2 and u 3 under the IC model. Here, the straight lines denote the edges along which propagation events occurred and these lines constitute the propagation tree formed by this propagation process (probability values associated with the edges will be discussed later in the next section). The dotted lines denote the edges that are not involved in this propagation process. Therefore, in accordance with our assumption, straight lines crossing different parts necessitate communication operations. For instance, in Fig. 1a, the propagation of the content from u 7 to u 6 does not incur any communication operation, whereas the propagation of the same content from u 6 to u 1 and u 2 incurs two communication operations. For the whole propagation process initiated by user u 7 , the total number of communication operations are equal to 3 and 2 under the partitions in Fig. 1a, b , respectively.
Given a partition Π of G and a propagation tree I g (v) of vertex v on a directed graph g∼G, we define the number of That is, the number of communication operations performed is equal to the number of edges in I g (v) that are crossing different parts in Π . It can be observed that each different partition Π of G induces a different communication pattern between servers for the same propagation process.

Cascade-aware graph partitioning
In the cascade-aware graph partitioning problem, we seek to compute a partition Π * of G that achieves the following two objectives: (i) Under the IC model, the expected number of communication operations to be performed between servers during a propagation process starting from a randomly chosen user should be minimized. (ii) The partition should distribute the users to servers as evenly as possible in order to ensure a balance of workload among them.
The first objective reflects the fact that many different content propagations, starting from different users or subsets of users, may simultaneously occur during any time interval in a social network and in order to minimize the total communication between servers, the expected number of communication operations in a random propagation process can be minimized. It is worth to mention that, due to the equivalence between random propagation trees and randomized BFS algorithm, the first objective is also equivalent to minimizing the expected number of cross-partition edges traversed during a randomized BFS execution starting from a randomly chosen vertex.
To give a formal definition for the proposed problem, we redefine the first objective in terms of the equivalent process of the IC model. For a given partition Π of G, we write the expected number of communication operations to be performed during a propagation process starting from a randomly chosen user as Here, subscripts v and g ∼ G of the expectation function denote the two levels of randomness in the process of generating a random propagation tree. As mentioned above, a random propagation tree I g (v) is equivalent to a propagation process that starts from a randomly chosen user in the network. Therefore, the expected value of λ Π g (v), which denotes the expected number of cut edges included in a random propagation tree, is equivalent to the expected number of communication operations to be performed. Due to this correspondence, computing a partition Π * that minimizes the expectation ] achieves the first objective (i) of the proposed problem. Consequently, the proposed problem can be defined as a special type of graph partitioning in which the objective is to compute a K-way partition Π * of G that minimizes the expectation subject to the balancing constraint in Eq. (1). That is, Here, we designate weight w i = 1 of each vertex v i ∈ V and define the weight W k of a partition V k ∈ Π as the number of vertices assigned to that part (i.e., W k = |V k |). Therefore, this balancing constraint ensures that the objective (ii) is achieved by the partition Π * .

Solution
The proposed approach is to first estimate a probability distribution for modeling the propagation and use it as an input to map the problem into a graph partitioning problem. Given an edge-weighted directed graph G = (V , E, w) representing an underlying social network, the first stage of the proposed solution consists of estimating a probability distribution defined over all edges of G. For that purpose, we define a probability value p i j for each edge e i j ∈ E apart from its content propagation probability w i j . The value p i j of an edge e i j is defined to be the probability that the edge e i j is involved in a propagation process that starts from a randomly selected user. Equivalently, when a random prop- is generated by the process described in Sect. 3, the probability that the edge e i j is included in the propagation tree I g (v) is equal to p i j . It is important to note that the value w i j of an edge e i j corresponds to the probability that the edge e i j is included in a graph g ∼ G, whereas the value p i j is defined to be the probability that e i j is included in a random propagation tree I g (v) rooted on a randomly selected vertex v in graph g. For now, we delay the discussion on the computation of p i j values for ease of exposition and assume that we are provided with the p i j values. Later in this section, we provide an efficient method that estimates these values.
] corresponds to the expected number of cut edges in a random propagation tree I g (v) under a partition Π . In other words, if we draw a graph g from the distribution induced by G and randomly choose a vertex v and compute its propagation tree I g (v), then the expected number of cut edges included in On the other hand, the value p i j of an edge e i j is defined to be the probability that the edge e i j is included in a random propagation tree I g (v). Therefore, given a par- can be written in terms of p i j values of all cut edges in E Π cut as follows: In Eq. (5), the expected number of cut edges in a random propagation tree is computed by summing the p i j value of each edge e i j ∈ E Π cut , where the value p i j of an edge e i j is defined to be the probability that the edge e i j is included in a random propagation tree. Hence, the main objective becomes to compute a partition Π * that minimizes the total p i j values of edges crossing different parts in Π * and satisfies the balancing constraint defined over the part weights. That is, subject to the balancing constraint defined in the original problem. As mentioned earlier, each vertex v i is associated with a weight w i = 1 and part weight W i of a part V i is defined to be the number of vertices assigned to V i (i.e., As a result of Eq. (6), the problem can be formulated as a graph partitioning problem for which successful tools exist [6,40]. However, the graph partitioning problem is usually defined for undirected graphs, whereas G is a directed graph and p i j values are associated with the directed edges of G.
To that end, we build an undirected graph G = (V , E ) by symmetrizing directed graph G through computing the cost of each edge e i j ∈ E as c i j = p i j + p ji .
Let Π be a partition of G . Since G and G consist of the same vertex set V , Π induces a set E Π cut of cut edges for the original graph G. Due to the cost definitions of edges in E , the cut size χ(Π) of G under partition Π is equal to the sum of the p i j values of cut edges in E Π cut which is shown to be equal to the value of the main objective function in Eq. (4). That is, Hence, a partition Π * that minimizes the cut size χ(·) of G also minimizes the expectation in the original social network partitioning problem. In other words, if the partition Π * for G is an optimal solution for the partitioning of G , it is also an optimal solution for Eq. (4) in the original problem. Additionally, the equivalence drawn between the graph partitioning problem and the cascade-aware graph partitioning problem also proves that the proposed problem is NP-Hard even the p i j values were given beforehand.
In Fig. 1, the main objective of cascade-aware graph partitioning is depicted as follows: Each edge in the figure is associated with a content propagation probability along with its computed p i j value (i.e., each edge e i j is associated with "w i j | p i j "). The partitioning in Fig. 1a provides a better cut size in terms of both number of cut edges and the total propagation probabilities of edges crossing different parts. However, we assert that the partitioning in Fig. 1b provides a better partition for objective function 4, at the expense of providing worse cut size in terms of other cut size metrics (i.e., the sum of p i j values of cut edges is less in the second partition).

Computation of the p ij values
We now return to the discussion on the computation of the p i j values defined over all edges of G and start with the following theorem indicating the hardness of this computation:

Theorem 1 Computation of the p i j value for an edge e i j of G is a #P-Hard problem.
Proof Let function σ (v k , v i ) denote the probability of there being a directed path from vertex v k to vertex v i on a directed graph g drawn from the distribution induced by G. Assume that the only path goes from v k to v j is through v i on each possible g. That is v j is only connected to v i in G. (This simplifying assumption does not affect the conclusion we draw for the theorem.) Hence, the probability of v i included in a propagation tree since inclusion of e i j in g and formation of a directed path from v k to v i on g are two independent events; therefore, their respective probability values w i j and σ (v k , v i ) can be multiplied. As mentioned earlier, the value p i j of an edge e i j is defined to be the probability of edge e i j included in a random propagation tree. Therefore, we can compute the value p i j of an edge e i j as follows: Here, to compute the p i j value of edge e i j , we sum the conditional probability 1 Due to the definition of random propagation trees, selections of v k in a graph g ∼ G are all mutually exclusive events with equal probability 1 |V | . Therefore, we can sum the terms 1 |V | · p k i j for each v k ∈ V to compute the total probability p i j .
In order to prove the theorem, we present an equivalence between the computation of function σ (·, ·) and the s,tconnectedness problem [41], since the p i j value of an edge e i j depends on the computation of where each edge e i j ∈ E may fail randomly and independently from each other. The problem asks to compute the total probability of there being an operational path from a specified source vertex s to a target vertex t on the input graph. However, computing this probability value is proven to be a #P-Hard problem [41]. On the other hand, the function σ (v k , v i ) denotes the probability of there being a directed path from v k to v i in a g ∼ G, where each edge in g is realized with probability w i j randomly and independently from other edges. It is obvious to see that the computation of function σ (v k , v i ) is equivalent to the computation of the s,t-connectedness probability. (We refer the reader to [17] for a more formal description for the reduction of σ (v k , v i ) to s,t-connectedness problem). This equivalence implies that the computation of function σ (v k , v i ) is #P-Hard even for a single vertex v k and therefore implies that the computation of p i j value for any edge e i j is also #P-Hard.
Theorem 1 states that it is unlikely to devise a polynomial time algorithm which exactly computes p i j values for all edges in G. Therefore, we employ an efficient method that can estimate these p i j values for all edges in G at once. These estimations can be made within a desired level of accuracy and confidence interval, but there is a trade-off between the runtime and the estimation accuracy of the proposed approach. On the other hand, the quality of the results produced by the overall solution is expected to increase with increasing accuracy of the p i j values.
The proposed estimation technique employs a sampling approach that starts with generating a certain number of random propagation trees. Recall that a random propagation tree is generated by first drawing a directed graph g ∼ G and then computing a propagation tree I g (v) on g for a randomly selected vertex v ∈ V . Let I be the set of all random propagation trees generated for estimation and let N be the size of this set (i.e., N = |I |). After forming the set I , the value p i j of an edge e i j can be estimated by the frequency of that edge's appearance in random propagation trees in I as follows: Let function F I (e i j ) denote the number of random propagation trees in I that contains edge e i j . That is, Due to the definition of p i j , the appearance of edge e i j in a random propagation tree I g (v) ∈ I can be considered as a Bernoulli trial with success probability p i j . Hence, the function F I (e i j ) can be considered as the number of total successes in N Bernoulli trials with success probability p i j , which implies that F I (e i j ) is Binomially distributed with parameters N and p i j (i.e., F I (e i j ) ∼ Binomial( p i j , N )). Therefore, the expected value of the function F I (e i j ) is equal to p i j N , which also implies that As a result of Eq. (11), if an adequate number of random propagation trees are generated to form the set I , the value F I (e i j )/N can be an estimation for the value of p i j . Therefore, the estimation method consists of generating N random propagation trees that together form the set I , and computing the function F I (e i j ) according to Eq. (10) for each edge e i j ∈ E. After computing the function F I (e i j ) for each edge e i j , we use F I (e i j )/N as an estimation for the p i j value.

Implementation of the estimation method
We seek an efficient implementation for the proposed estimation method. The main computation of the method consists of generating N random propagation trees. A random propagation tree can be efficiently generated by performing a randomized BFS, starting from a randomly chosen vertex, in G. It is important to note that the randomized BFS algorithm starting from a vertex v is equivalent to drawing a graph g ∼ G and performing a BFS starting from the vertex v on g. That is, the randomized BFS algorithm is equivalent to the method introduced in Sect. 3 to generate a propagation tree I g (v) rooted on v. Therefore, forming the set I can be accomplished by performing N randomized BFS algorithms on G starting from randomly chosen vertices. Moreover, the computation of the function F I (·) for all edges in E can be performed while forming the set I with a slight modification to the randomized BFS algorithm. For this purpose, a counter for each e i j ∈ E can be kept in such a way that its value is incremented each time the corresponding edge is traversed during a randomized BFS execution. This counter denotes the number of times an edge is traversed during the performance of all randomized BFS algorithms. Therefore, after N randomized BFS executions, the function F I (e i j ) for an edge e i j is equal to the value of the counter maintained for that edge.

Algorithm
The overall cascade-aware graph partitioning algorithm is described in Algorithm 1. In the first line, the set I is formed by performing N randomized BFS algorithms, where the function F I (e i j ) is computed for each edge e i j ∈ E during these randomized BFS executions. In lines 2 and 3, an undirected graph G = (V , E ) is built via composing a new set E of undirected edges, where each undirected edge e i j ∈ E is associated with a cost of c i j using the estimations computed in the first step. In line 4, each vertex v i ∈ V is associated with a weight w i = 1 in order to ensure that the weight of a part is equal to the number of vertices assigned to that part. Lastly, a K-way partition Π of the undirected graph G is obtained using an existing graph partitioning algorithm and returned as a solution for the original problem. Here, the graph partitioning algorithm is executed with the same imbalance ratio as with the original problem.

Determining the size of set I
As mentioned earlier, the accurate estimation of the p i j values is a crucial step to compute "good" solutions for the proposed problem, since the graph partitioning algorithm used in the second step makes use of these p i j values to compute the costs of edges in G . The total cost of cut edges 3: Associate a cost c i j with each e i j ∈ E as follows: 4: Associate each vertex v i ∈ V with weight w i = 1. 5: Compute a K -Way partition Π of G using an existing graph partitioning algorithm (using the same imbalance ration ). 6: return Π in G represents the value of the objective function in Eq. (4). Therefore, the p i j values need to be accurately estimated so that the graph partitioning algorithm correctly optimizes the objective function.
Estimation accuracies of the p i j values depend on the number of random propagation trees forming the set I . As the size of the set I increases, more accurate estimations can be obtained. However, we want to compute the minimum value of N to get a specific accuracy within a specific confidence interval. More formally, letp i j be the estimation computed for the p i j value of an edge e i j ∈ E (i.e.,p i j = F I (e i j )/N ), and we want to compute the minimum value of N to achieve the following inequality: That is, with a probability of at least 1 − δ, we want the estimationp i j to be within θ of p i j for each edge e i j ∈ E. For that purpose, we make use of well-known Chernoff [42] and Union bounds from probability theory. Chernoff bound can be used to find an upper bound for the probability that a sum of many independent random variables deviates a certain amount from their expected mean. In this regard, due to the function F I (·) being Binomial, Chernoff bound can guarantee the following inequality: for each edge e i j ∈ E. Here, ξ denotes the distance from the expected mean in the context of Chernoff bound. In Eq. (15), dividing both sides of the inequality |F I (e i j )− p i j N | ≥ ξ p i j N in the function Pr [·] by N and taking ξ = θ/ p i j yields which denotes the upper bound for the probability that the accuracy θ is not achieved for a single edge e i j (the last inequality in Eq. (16) follows, since p i j ≤ 1). Moreover, RHS of Eq. (16) is independent from the value of p i j and its value is the same for all edges in E, which enables us to apply the same bound for all of them. However, our objective is to find the minimum value of N to achieve accuracy θ for all edges simultaneously with a probability at least 1 − δ.
For that purpose, we need to find an upper bound for the probability that there exists at least one edge in E for which the accuracy θ is not achieved. We can compute this upper bound using Union bound as follows: Here, we simply multiply RHS of Eq. (16) by |E|, since for each edge in E, the accuracy θ is not achieved with a probability at most 2 exp(− θ 2 2+θ · N ). In order to achieve Eq. (14), RHS of Eq. (17) needs to be at most δ. That is, Solving this equation for N yields which indicates the minimum value of N to achieve θ accuracy for all edges in E with a probability at least 1 − δ.
The accuracy θ determines how much error is made by the graph partitioning algorithm while it performs the optimization. As shown in Eq. (7), for a partition Π of G obtained by the graph partitioning algorithm, the cut size χ(Π) is equal to the value of main objective function (4). However, the cost values associated with the edges of G are estimations of their exact values, and therefore, the partition cost χ(Π) might be different from the exact value of the objective function. In this regard, the difference between the objective function and the partition cost can be expressed as follows: Here, the error is computed by multiplying the accuracy θ by the number of cut edges of G under the partition Π , since for each edge in E Π cut , at most θ error can be made with a probability at least 1−δ. Therefore, even if it were possible to solve the graph partitioning problem optimally, the solution returned by Algorithm 1 would be within θ · |E Π cut | of the optimal solution for the original problem with a probability at least 1 − δ. Consequently, as the value of θ decreases, the partition obtained by Algorithm 1 will incur less error for the main objective function, which will enable the graph partitioning algorithm to perform a better optimization for the original problem.

Complexity analysis
The proposed algorithm consists of two main computational phases. In the first phase, for an accuracy θ with confidence δ, the set I is generated via performing at least N = 2+θ θ 2 · ln 2|E| δ randomized BFS algorithms and each of these BFS executions takes Θ(V + E) time. The second phase of the algorithm performs the partitioning of the undirected graph G which is constructed from the directed graph G by using F I (e i j ) values computed in the first phase. The construction of the graph G can be performed in Θ(V + E) time. The partitioning complexity of the graph G , however, depends on the partitioning tool used. In our implementation, we preferred Metis which has a complexity of Θ(V + E + K log K ), where K is the number of partitions. Therefore, if θ and δ are assumed to be constants, the overall complexity of Algorithm 1 to obtain a K -way partition can be formulated as follows: Equation (21) denotes serial execution complexity of Algorithm 1. The proposed algorithm's scalability can be improved even further via parallel processing, since the estimation technique is embarrassingly parallel. Given P parallel processors, N propagation trees in I can be computed without necessitating any communication or synchronization (i.e., each processor can generate N /P trees by separate BFS executions). The only synchronization point is needed in the reduction of F I (e i j ) values computed by these processors. This reduction operation, however, can be efficiently performed in log P synchronization phases. Additionally, there exist parallel graph partitioning tools (e.g., ParMetis [43]) which can improve the scalability of the graph partitioning phase.

Extension to the LT model
Even though we have illustrated the problem and solution for the IC model, both our problem definition and proposed solution can be easily extended to other models such as the LT (linear threshold) model. It is worth to mention that the proposed solution does not depend on the IC model or the probability distribution defined over edges (i.e., w i j probabilities). As long as the random propagation trees can be generated, the proposed solution does not require any modification for the use of any different cascade model or the probability distribution defined over edges.
We skip the description for the LT model and just provide the equivalent process of LT model proposed in [28]. In the equivalent process of the LT model, an unweighted directed graph g is generated from G by realizing only one incoming edge of each vertex in V . That is, for each vertex v i ∈ V , each incoming edge e ji of vertex v i has probability w ji of being selected and only the selected edge is realized in g. Given a directed graph g generated by this equivalent process, a propagation tree I g (v) rooted on vertex v again can be computed by performing a BFS starting from v on g. Different from the equivalent process of IC model, there can be only one propagation tree for each vertex, since all vertices have only one incoming edge to these vertices. However, a propagation tree I g (v) under LT model still encodes the same information as in IC model; that is, each edge e i j ∈ I g (v) encodes the information that a content propagates from v i to v j .
In the problem definition part, we make use of the notion of propagation trees in such a way that edges in a propagation tree that are crossing different parts are assumed to necessitate communication operations between servers. This assumption also holds for the LT model, since propagation trees generated by the equivalent processes of IC and LT models encode the same information. Therefore, minimizing the expected number of communication operations during an LT propagation process starting from a randomly chosen user still corresponds to minimizing the expected number of cut edges in a random propagation tree. In this regard, we do not need any modification for the objective function (4) and we still want to compute a partition Π * that minimizes the expected number of cut edges in a random propagation tree. (The only difference is in the process of computing a random propagation tree under LT model.) In the solution part, we generate a certain number of random propagation trees in order to estimate a probability distribution defined over all edges in E. The estimated probability distribution associates each edge with a probability value denoting how likely an edge is included in a random propagation tree under the IC model. The associated probability values are also later used as costs in the graph partitioning phase. However, both the estimation method and the overall solution do not depend on anything specific to the IC model and only require a method for generating random propagation trees which is mentioned above. Moreover, concentration bounds attained for the estimation of the probability distribution still holds under the LT model and the number of random propagation trees forming the set I in Algorithm 1 should satisfy Eq. (19).

Processes starting from multiple users
The method proposed for the propagation processes starting from a single user can be generalized for propagation processes that start from multiple users as follows: Instead of the definition of random propagation trees, we define random propagation forest I g (S) for a randomly selected subset of users S ⊆ V . The only difference between the two definitions is that a random propagation forest consists of multiple propagation trees that are rooted on the vertices in S. However, these propagation trees must be edge-disjoint and if a vertex is reachable from two different vertices in S, this vertex can be arbitrarily included in one of the propagation trees rooted on these vertices. As noted earlier, the IC model does not prescribe an order for activating inactive neighbors; therefore, a random propagation forest over the set S can be computed by first drawing a graph g ∼ G and then performing a multi-source BFS on g starting from the vertices in S. The order of execution of multi-source BFS determines the form of propagation trees in the propagation forest I g (S).
In a partition Π of propagation forest I g (S), each cut edge incurs one communication operation. So, the total number of communication operations induced by Π is defined to be the number of cut edges which we denote as λ Π g (S). These new definitions do not require any major modification for the optimization problem introduced in Eq. (4), and we just replace the expectation function with E v,g∼G [λ Π g (S)]. That is, our objective becomes computing a partition that minimizes the expected number of cut edges in a random propagation forest.
To generalize the proposed solution, we redefine p i j value of an edge e i j as the probability of edge e i j included in a random propagation forest instead of a random propagation tree. With this new definition of p i j values, Eqs. (5) and (6) can still be satisfied; hence, a partition Π * that minimizes the sum of p i j values of edges crossing different parts also minimizes the expectation E v,g∼G [λ Π g (S)]. The new definition of p i j values necessitates some modifications to the estimation method proposed earlier. Recall that, for the previous definition of p i j values, we generate a set I of random propagation trees and compute the function F I (·) for each edge e i j . For the new definition of p i j values, the estimations can be obtained with a similar approach; however, the set I must now consist of random propagation forests and F I (·) must denote frequencies of edges to appear in these random propagation forests. Therefore, the only modification required for Algorithm 1 is to replace the step that the set I is generated by performing N randomized BFS algorithms. The new set I of random propagation forests can be obtained with a similar approach such that instead of performing randomized single-source BFS algorithms, we perform randomized multi-source BFS algorithms. These two BFS algorithms are essentially the same except that multi-source BFS starts execution with its queue containing a randomly selected subset of vertices instead of a single vertex. The new definitions together with the modifications performed on the overall solution do not affect the concentration bounds obtained in Eq. (19).

Extensions and limitations
Here, we show how the proposed cascade-aware graph partitioning algorithm (CAP) can be incorporated into other graph partitioning objectives.

Non-cascading queries
Queries such as "reading-friend's-posts" and "read-all-postsfrom-friends" can be observed more frequently than cascading (i.e., re-share) operations in a typical OSN application. The number of communication operations for such noncascading queries may require minimizing the number of cut edges if query workload is highly changing or not available, or minimizing the total traffic crossing different parts if it can be estimated. The cascade-aware graph partitioning aims at reducing the cut edges that have high probability of being involved in a random propagation process under a specific cascade model. Assigning unit weights to all edges (i.e., c i j = 1 for each edge e i j ) makes the objective same as minimizing the number of cut edges. A combination of objectives can be achieved by assigning each edge cost c i j = 1 + α( p i j + p ji ), where α determines the relative weight of traffic/cascade-awareness.

Intra-propagation balancing among servers
This paper considers the number of nodes/users as the only balancing criteria for the proposed cascade-aware partitioning. On the other hand, the proposed formulation can be enhanced to handle balance on multiple workload metrics via a multi-constraint graph partitioning. For example, a balanced distribution of the number of content propagation operations within servers can be attained via the following two-constraint formulation. We assign the following two weights to each vertex v i : Here, the summation in the second weight represents the sum of p probabilities of the incoming edges of vertex v i . Under this vertex weight definition, the two-constraint partitioning maintains balance on both the number of users assigned to servers and the number of intra-propagation operations to be performed within servers. The latter balancing holds because of the fact that the expected number of propagations within a part V k is where E(V k ) denotes the set of edges pointing toward the vertices in V k .

Repartitioning
As graph databases are usually dynamic, i.e., new vertices and edges are added or removed, etc., repartitioning is nec-essary [1][2][3]21]. Repartitioning methods aims to maintain the quality of an initial partition via reassigning vertices to parts as the graph structure changes. However, the costs of new edges should be computed for repartitioning. That is, if a new direct edge is established in G, then its p value needs to be computed before repartitioning. The p i j value of a new edge e i j can be computed using p ki value of each incoming edge e ki of vertex v i as follows: That is, the content propagation probability w i j is multiplied by the probability of there being at least one edge e ki incoming to vertex v i is activated during a random propagation process. It is important to note that establishing the new edge e i j also affects p jk value of each outgoing edge e jk of vertex v j . If these values also need to be updated during repartitioning, Eq.( 24) can be applied for each edge e jk , in succession for updating the value of p i j . In short, while moving vertices between parts during repartitioning, the p i j value of any edge e i j can be updated via applying Eq. (24) in a correct order. By updating p i j values on demand, the existing repartitioning approaches can be adapted for the cascade-aware graph partitioning problem.

Replication
Replication strategies need some modifications in order to be used for the cascade-aware graph partitioning. It should be noted that, even though the cut size of graph G can be reduced via replication of some vertices among multiple parts, this approach also incurs additional communication operations. This is because, when a replicated vertex becomes active during a content propagation process, the content needs to be transferred to each server that the vertex is replicated.

Experimental evaluation
In this section, we experimentally evaluate the performance of the proposed solution on social network datasets. We develop an alternative solution, which produces competitive results, as a baseline algorithm in our experiments. The baseline algorithm directly makes use of propagation probabilities between users in the partitioning phase (i.e., w i j values). Additionally, we also test various algorithms previously studied in the literature [10,13] and compared them with the proposed solution. Table 2 displays the properties of the real-world social networks used in our experiments. Many of these datasets are used in the context of influence maximization research [34]. The first 13 datasets (Facebook-LiveJournal) are collected from Stanford Large Network Dataset Collection 1 [45], and they contain friendship, communication or citation relationships between users in various realworld social network applications. Twitter (large) is collected from [46], uk-2002 and webbase-2001 are collected from Laboratory for Web Algorithmics 2 [47], and sinaweibo is collected from Network Repository 3 [48]. Additionally, we also make use of a synthetic graph, named as random-social-network, which we generate by using graph500 [49] power law random graph generator. The graph500 tool is initialized with two parameters, namely as edge-factor and scale, in order to produce graphs with 2 scale vertices and edge-factor×2 scale directed edges. We set both scale and edge-factor to 16 to produce random-social-network dataset.

Datasets
All datasets are provided in the form of a graph, where users are represented by vertices and relationships by directed or undirected edges. To infer the direction of content propagation between users, we interpret these social networks as follows: For directed graphs, we assume that a propagation may occur only in the direction of a directed edge, whereas for undirected graphs, we assume that a propagation may occur in both directions along an undirected edge. Therefore, we did not apply any modifications to the directed graphs, whereas we modified the undirected graphs by replacing each undirected edge with two opposite directed edges.
In the datasets in Table 2, the information about the content propagation probabilities between users is not available. Therefore, for each dataset, we draw values uniformly at random from the interval [0, 1] and associate these values with the edges connecting its pairs of users as the propagation probabilities. We repeat this process five times for each dataset and obtain five different versions of the same social network having different propagation probabilities associated with its edges. Using these versions of the social network, we performed the same set of experiments on each different version and reported the averages of the results obtained for that specific dataset.
Given an underlying social network with its associated propagation probabilities, our aim is to find a user partition that minimizes the expected number of communication oper- Generated by graph500 power law graph generator ations during a random propagation process under a specific cascade model. There have been effective approaches in the literature to learn the propagation probabilities between users in a social network [25,26]. Inferring these probability values using logs of user interactions is out of the scope of this paper. However, we also work on a real-world dataset, from which real propagation traces can be deduced, to test the proposed solution.

Baseline partitioning (BLP) algorithm
One can partition the input graph in such a way that the edges with high propagation probabilities are removed from the cut as much as possible. To achieve this, the sum of propagation probabilities of the cut edges can be considered as an objective function to be minimized in the graph partitioning problem. The baseline algorithm also builds an undirected graph from a given social network and makes use of a graph partitioning tool. Instead of computing a new probability distribution over all edges (i.e., the p i j values), the baseline algorithm directly makes use of propagation probabilities associated with edges (i.e., the w i j values). That is, the cost c i j of an undirected edge e i j of G is determined using w i j and w ji values instead of p i j and p ji values of edges e i j and e ji , respectively. By this way, the graph partitioner minimizes the sum of propagation probabilities associated with the edges crossing different parts. The difference between the baseline algorithm and the proposed solution is the cost values associated with the edges of the undirected graph provided to the graph partitioner.

Other tested algorithms
In our experiments, we also test three previously studied social network partitioning algorithms for comparison purposes. The first of these algorithms (CUT) is given in [10] and aims to minimize the number of links crossing different parts (i.e., basically minimizes the number of cut edges). The second algorithm (MO+) [10] makes use of a community detection algorithm and performs partitioning based on the community structures inherent to social networks.
As the third algorithm, we consider the social network partitioning algorithm provided in [13]. The social graph is partitioned in such a way that two-hop neighborhood of a user is kept in one partition, instead of the one-hop network. For that purpose, an activity prediction graph (APG) is built and its edges are associated with weights that are computed using the number of messages exchanged between users in a time period. Since the w i j values can not be directly considered as the number of messages exchanged between users, we make use of F I (e i j ) values computed by CAP algorithm. That is, we designate the number of messages exchanged in a time period between users as F I (e i j ). Additionally, to compute edge weights, the algorithm uses two parameters which are the total number of past periods considered and a scaling constant (these parameters are referred to as K and C in [13]). We set these parameters to one, since we can not partition F I (e i j ) values into time periods. Using these values, we construct the same APG graph and partition this graph. We refer to this algorithm as 2Hop in our experiments.

Content propagations
To evaluate the qualities of the partitions obtained by the tested algorithms, we performed a large number of experiments based on both real and IC-based traces of propagation on real-world social networks. We generated the IC-based propagation data as follows: First, we generate a randomly selected subset of users and then execute an IC model propagation process starting from the users in this set. The size of the set is randomly determined and chosen uniformly at random from the interval [1,50]. During this propagation process, we count the total number of propagation events that occurred between the users located on different parts. As mentioned earlier, such propagation events cause communication operations between servers according to our problem definition. For each of the datasets, we perform 10 5 such experiments and compute the average of the total number of communication operations performed under a given partition. This average corresponds to an estimation for the expected number of communication operations during a random propagation process.

Partitioning framework
The graphs generated by algorithms, except MO+, are partitioned using state-of-the-art multi-level graph partitioning tool Metis [6] using the following set of parameters: We specify partitioning method as multi-level k-way partitioning, type of objective as edge-cut minimization and the maximum allowed imbalance ratio as 0.10. All the other parameters are set to their default values. We implemented MO+ algorithm by using a community detection algorithm 4 provided in [50] with its default parameters.
In order to observe the variation of the relative performance of the algorithms, each graph instance is partitioned K -way for K = 32, 64, 128, 256, 512 and 1024, respectively. In order to observe the performance gain achieved by intelligent partitioning algorithms, all graph instances are also partitioned random-wise, which is referred to as random partitioning (RP) algorithm.  Fig. 2 The geometric means of the communication operation counts incurred by the partitions obtained by BLP, CUT [10], CAP, MO+ [10] and 2Hop [13] normalized with respect to those by RP Figure 2 compares the performance of the proposed CAP algorithm against the existing algorithms 2Hop, MO+, CUT as well as BLP. In the figure, we display the geometric means of the ratios of the communication operation counts incurred by the partitions obtained by CAP, BLP, CUT, MO+ and 2Hop to those by RP, for each different K value. We run CAP algorithm with accuracy θ = 0.01 and δ = 0.05. As seen in the figure, BLP performs much better than both 2Hop and MO+, whereas it performs slightly better (6%-9% on average) than CUT. These experimental results justify the use of BLP as a baseline algorithm to test the validity of the proposed CAP algorithm. As also seen in the figure, the proposed CAP algorithm performs significantly better than all algorithm. Table 3 compares the performance of the proposed CAP algorithm against BLP and RP on each graph for each K value, in terms of average number of communication operations during IC model propagation simulations. Here, partitioning of a graph for each different K value constitutes a partitioning instance. For each K value, the last column entitled "%imp" displays the percent improvement of CAP over BLP for each dataset in terms of the number of communication operations. For each K value, the last row entitled "norm avgs wrto RP" displays the geometric means of the ratios of the communication operation counts incurred by the partitions obtained by BLP and CAP to those by RP. The table also contains a "cut" column which displays the ratio of the number of cut edges to the total number of edges for each partitioning instance.

Experimental results
As seen in Table 3, BLP performs significantly better than RP in all partitioning instances. This is because BLP successfully reduces the sum of propagation probabilities of cut edges and reduces the chances for propagation events to occur between different parts. On average, partitions obtained by BLP incurs 4.76x, 3.84x, 3.57x, 3.22x, 2.77x and 2.63x less communications than RP for K = 32, 64, 128, 256, 512 and 1024 servers, respectively. The decrease in the performance gap between BLP and RP with increasing K can be attributed to the performance degradation of the graph partitioning tool for high K values. In particular, whenever the average number of vertices assigned to parts (i.e., |V |/K ) decreases below some certain threshold (e.g., for K = 1024 and K = 512 on Facebook and wiki-Vote datasets), improvements of Metis significantly degrades as can be seen from Table 3. However, for the case of web graphs (e.g., uk-2002 and webbase-2001), Metis provides significantly better partitions, providing cut ratios below 0.1 (i.e., structures of graphs also have effect on quality of partitions produced by Metis). As a result, the number of inter-partition communication operations is significantly less in cases of these graphs as compared to other cases.
As seen in Table 3, CAP performs significantly better than BLP in all of the partitioning instances. If the cut ratio values are closely inspected, partitions obtained by CAP leave more edges in the cut (i.e., higher cut ratios); but these partitions incur less communication operations. On average, CAP achieves 25.16%, 31.82%, 32.04%, 29.97%, 27.36% less communication operations than BLP for K = 32, 64, 128, 256, 512 and 1024 servers, respectively. In particular, the best improvement is obtained on email-EuAll social network for K = 64, where the partitions obtained by CAP achieve 88% less communication operations than those by BLP. Also in this partitioning instance, CAP achieves a cut ratio of 0.35 which is significantly less than 0.75 of BLP. However, as the value of K increases, the improvement of CAP over BLP decreases for some social networks, especially for wiki-Talk and wiki-Vote, where 19.11% and 19.70% improvements of CAP over BLP for K = 32, respectively, decrease to 1.11% and 1.27% for K = 1024. This can be attributed to Eq. (20), since as the value of K increases, the number of cut edges is also expected to increase. As shown in Eq. (20), the number of cut edges directly affects the error made by CAP algorithm: the upper bound of the error made by CAP algorithm is shown to be proportional to the number of cut edges. Indeed, the performance improvement of CAP over BLP is observed to be the lowest on the partitioning instances for which CAP incurs the highest cut ratio. For instance, on datasets Facebook, wiki-Talk and wiki-Vote for K = 1024, partitions generated by CAP have cut ratios of 0.97, 0.97 and 0.92, respectively.
The performance decrease of CAP can be alleviated by making more accurate estimations for the p i j values and decreasing the value of θ . However, the cut ratio depends on the graph partitioning tool performance, dataset characteristics and imbalance ratios used during partitioning. In order to get better cut ratios, the imbalance constraint can be relaxed and increased to higher values (e.g., we used imbalance ratio of 0.1 in our experiments).  To observe how the improvement of CAP algorithm changes with respect to the cut ratio, we perform the same set of experiments also on random-social-network. As seen in Table 4, partitions obtained by CAP algorithm cause 43% less communication operations for K = 32 even though the fraction of edges are 6% more than that of BLP. As noted in previous experimental results, the improvement of CAP over BLP decreases as the value of K and the cut ratio increases: the percent improvement of CAP over BLP decreases from 43% to 30% as the fraction of cut edges increases from 0.91 to 0.95 on K = 32 and K = 1024, respectively.

Sensitivity analysis
We performed experiments to see how the accuracy parameter θ affects the performance of the CAP algorithm. For different values of θ and K , we compare the performance of CAP against RP on random-social-network. In Fig. 3, we designate the size of set I as |I | = 10, 10 2 , 10 3 , 10 4 and 10 5 , respectively. Experiments are performed under K -way partitions for K = 32, 64, 128 and 256. We plot the percent increase in the performance of CAP over RP on y-axis. The accuracy values are computed for confidence δ = 0.05 and displayed on the right side of the figure. As seen in the figure, with increasing size of set I , the value of θ decreases exponentially and the improvement of CAP increases logarithmically. Additionally, as also observed earlier, the relative performance of CAP decreases with increasing K . The best performance improvement is obtained for K = 32 where CAP performs 2x better than RP. These results can be attributed to the results of Eq. 20, since for higher values of K , both the cut ratio and the error made by the CAP algorithm increase. However, as the accuracy increases, the error made by CAP decreases and the overall optimization quality improves.

Relationship with minimizing cut edges
We also performed experiments to observe how much the cascade-based estimation of traffic is related to the performance measure of minimizing the number of cut edges. Previously, we asserted that different objectives can be encoded in the same cut size definition through assigning different weights to costs associated with edges by each objective (i.e., c i j = 1 + α( p i j + p ji )). The parameter α controls how much the cascade-based estimation of the traffic is considered. Figure 4 displays the average number of communication operations and ratio of cut edges, by varying parameter α, obtained by CAP for HepPh dataset on K = 32 parts/servers (i.e., the α value is multiplied by F I (e i j ) value of each edge). As seen in the figure, with increasing α, the average number of communication operations decreases, whereas the ratio of the number of cut edges increases. On the other hand, the increase in the cut size slows down after α = 10 −2 and cut ratio becomes at most 0.44, since the cut size also has effect on the average number of communication operations. Note that for the smallest value of α, CAP becomes almost equivalent to CUT, providing equally well partitions in terms of number of cut edges (black-dashed curve denotes the cut value obtained by CUT algorithm). If the query workload is dominated by non-cascading queries and there is comparably small number of cascades, then α value can be set to smaller values, or vice versa.

Running times
We

Experiments on digg social network with real propagation traces
In this section, we use actual propagation traces collected from Digg 5 [18]  their friends in their news feeds where the stories their friends shared or voted for are displayed. With these properties of Digg social network, a story can propagate through users once it is shared or voted for. Propagation of news stories can be considered as propagation of contents in our problem definition.
The Digg dataset contains a directed graph G = (V , E) representing the underlying social network which consists of 71,367 users and 1,731,658 friendship links. As friendships are formed as one-way relationships, they are represented by directed edges. Each directed edge e i j ∈ E means that user v j is following the activities of user v i ; therefore, the content propagation can occur in the direction of v i to v j . Additionally, the dataset contains log L of past activities of users over a set N of news stories. Each entry (v i , n k , t i ) ∈ L means that user v i ∈ V has voted for news story n k ∈ N at time t i . The dataset contains 3,018,197 votes made on 3,553 news stories (i.e., |L| = 3,018,197 and |N | = 3,553).
In order to deduce the content propagation traces from log L, we follow the approach proposed in [39]. In this approach, if user v i votes for the news story n k , then it is assumed that v i is probably influenced by one of its friends that have voted for the same story before. However, in order for v i to be influenced by its friends, the difference between their voting times should be within a time window t Δ . Let P k i denote the set of users that potentially influence user v i in voting for news story n k , we define P k i as In our experiments, we set the time window t Δ as one month following the approach in [39]. The set P k i induces a subgraph g k = (V , E k ) of G, where potential influencers of each user are denoted by the directed edges in E k as follows: The subgraph g k is reminiscent of a directed graph g ∼ G, where each directed edge e i j is associated with a propagation probability w i j and g is generated by the equivalent process of the IC model as described in Sect. 3. Note that g is also a subgraph where each user may have multiple potential influencers and one of them can be arbitrarily selected to generate a propagation tree/forest. Therefore, we generate a propagation forest for the news story n k on g k as follows: Let I k (S) denote a propagation forest on g k , where propagation trees are rooted on vertices in S and the set S is composed of vertices that are having no incoming edges (i.e., users in the set S do not have any potential influencers). The propagation forest I k (S) can be computed by performing a multi-source BFS starting from vertices in S on g k as if a random propagation tree is built from a g ∼ G. It is important to note that multiple propagation forests can be Algorithm 2 Generating Propagation Trees/Forests from logs of past propagation traces Input: G = (V , E), L, t Δ Output: I 1: Partition log L based on news stories and obtain L k for each story n k ∈ N 2: Initialize an empty set I of propagation forests 3: for each L k do 4: Sort entries (v i , n k , t i ) ∈ L k according to their timestamps t i in increasing order 5: Initialize a directed graph g k = (V , E k ), where E k = ∅ 6: for each entry (v i , n k , t i ) ∈ L k do 7: Mark v i as activated 8: for each (v j , v i ) ∈ E do 9: if v j is activated and t j − t i ≤ t Δ then 10: Initialize set S = {v i | in-degree of v i = 0 in g k } 12: Perform multi-source BFS on g k starting from the vertices in S and generate a propagation forest I k 13: I = I ∪ {I k } 14: return I generated depending on the execution of multi-source BFS on g k . Edges in the propagation forest I k (S) still encode the information as to propagation traces through users. We generate propagation trees/forests for all news stories in Log L and use them instead of performing the IC model propagation simulations. Algorithm 2 presents computations we performed on log L to deduce the content propagation traces.
After generating the propagation trees/forests for all news stories available in log L, we sample 90% of these trees/forests to use in Algorithm 1. That is, instead of randomly generating random propagation forests, we use real propagations in log L to compute the function F I (·) and estimate p i j values of edges in G. We use the remaining 10% of the propagation forests to test the qualities of the partitions returned by Algorithm 1. If an edge in a propagation forest crosses different parts, we count that edge as one communication operation.
We compare the qualities of the partitions produced by CAP algorithm against those of a slightly modified version of the BLP algorithm presented previously. In the modified version of BLP, we associate unit cost with each edge of the undirected graph that is produced from the input social graph. This modification causes BLP to regard only the friendship structure of Digg social network and produce partitions that minimize the number of friendship links crossing different parts. In this way, BLP and CUT algorithms become equivalent.
In Table 5, we present the results of the experiments on Digg social network. In addition to CAP and the modified version of BLP, we also include the results for partitions generated by RP. For each of these partitions, we compute the average number of communication operations induced on the propagation trees that are generated and sampled from 10 percent of log L. As seen in Table 5, BLP performs much better than 2Hop, CUT, MO+ and RP algorithms. For K = 32, the partition generated by BLP incurs approximately 2x less communication operations than RP. The performance improvements of BLP is less for higher values of K . For example, BLP performs 2 times better than RP for K = 1024. CAP algorithm, on the other hand, consistently performs better than BLP for all values of K . In particular, for K = 32, CAP algorithm incurs 60% less communication operations. However, as the value of K increases from 32 to 1024, the overall improvement of CAP over BLP decreases to 13%. This is because the accuracy obtained by 90% of the propagation trees/forest sampled from log L remains constant as we increase the value of K and therefore the error made by CAP algorithm increases as we have shown in Eq. 20. Additionally, the performance of the graph partitioning tool is expected to decrease for higher values of K where the average number of vertices per part reduces below 100 for K = 1024.
Results displayed in Table 5 illustrates the effectiveness of the CAP algorithm in a case where actual propagation traces are used instead of the IC model simulations.

Conclusion
We studied the problem of cascade-aware graph partitioning, where we seek to compute a user-to-server assignment that minimizes the communication between servers/parts considering content propagation processes.
We employed a sampling-based method to estimate a probability distribution by which each edge of a graph is associated with a probability of being involved in a random propagation process. We use these estimates as part of the input of graph partitioning. The proposed solution works under various cascade models and requires that parameters of these models are given beforehand. Theoretic results that show how our solution achieves the stated objectives are also derived. To the best of our knowledge, this is the first work that incorporates the models of graph cascades into a systems performance goal.
We performed experiments under the widely used IC model and evaluated the effectiveness of the proposed solution in terms of partitioning objectives. We implemented the solution over real logs of propagation traces among users, in addition to using their social network structure. Experiments demonstrate the effectiveness of the proposed solution in both the presence and absence of actual propagation traces.