Accelerated butterfly counting with vertex priority on bipartite graphs

Bipartite graphs are of great importance in many real-world applications. Butterfly, which is a complete 2×2\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$2 \times 2$$\end{document} biclique, plays a key role in bipartite graphs. In this paper, we investigate the problem of efficient counting the number of butterflies. The most advanced techniques are based on enumerating wedges which is the dominant cost of counting butterflies. Nevertheless, the existing algorithms cannot efficiently handle large-scale bipartite graphs. This becomes a bottleneck in large-scale applications. In this paper, instead of the existing layer-priority-based techniques, we propose a vertex-priority-based paradigm BFC\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathsf {BFC}}$$\end{document}-VP\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathsf {VP}}$$\end{document} to enumerate much fewer wedges; this leads to a significant improvement of the time complexity of the state-of-the-art algorithms. In addition, we present cache-aware strategies to further improve the time efficiency while theoretically retaining the time complexity of BFC\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathsf {BFC}}$$\end{document}-VP\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathsf {VP}}$$\end{document}. We also show that our proposed techniques can work efficiently in external and parallel contexts. Moreover, we study the butterfly counting problem on batch-dynamic graphs. Specifically, given a bipartite graph G and a batch-update of edges B, we aim to maintain the number of butterflies in G. To tackle this problem, fast vertex-priority-based algorithms are proposed with optimizations for reducing the computation of existing wedges in G. Our extensive empirical studies demonstrate that the proposed techniques significantly outperform the baseline solutions on real datasets.

is a cohesiveness measurement of bipartite graphs. Given a bipartite graph G, its bipartite clustering coefficient equals 4 × G / G , where G is the total number of caterpillars (i.e., three-paths) in G. For example, (u 0 , v 0 , u 1 , v 1 ) in Fig. 1 is a three-path. High bipartite clustering coefficient indicates localized closeness and redundancy in bipartite graphs [4,53]; for instance, in user-product networks, bipartite clustering coefficients can be used frequently to analyze the sale status for products in different categories. These statistics can also be used in Twitter network for internet advertising where the Twitter network is a bipartite graph consisting of Twitter users and the URLs they mentioned in their postings. Since G can be easily computed in O(m) time where m denotes the total number of edges in G [4], computing G becomes a bottleneck in computing the clustering coefficient.
Summarizing inter-corporate relations. In a director-board network, two directors on the same two boards can be modeled as a butterfly. These butterflies can reflect inter-corporate relations [48][49][50]. The number of butterflies indicates the extent to which directors re-meet one another on two or more boards. A large butterfly counting number indicates a large number of inter-corporate relations and formal alliances between companies [53].
Computing k-wing in bipartite graphs. Counting the number of butterflies for each edge also has applications. For example, it is the first step to compute a k-wing [56] (or k-bitruss [67,68,73]) for a given k where k-wing is the maximal subgraph of a bipartite graph with each edge in at least k butterflies. Discovering such dense subgraphs is proved useful in many applications, e.g., community detection [25,26,69,72], word-document clustering [21], and viral marketing [24,42,65,71]. Given a bipartite graph G, the proposed algorithms [56,67,68,73] for k-wing computation are to first count the number of butterflies on each edge in G.
After that, the edge with the lowest number of butterflies is iteratively removed from G until all the remaining edges appear in at least k butterflies.
Note that in real applications, butterfly counting may happen not only once in a graph. We may need to conduct such a computation against an arbitrarily specified subgraph. Indeed, there can exist a high demand for butterfly counting in large networks. However, the existing solutions cannot efficiently handle large-scale bipartite graphs. As shown in [54], on the Tracker network with 10 8 edges, their algorithm needs about 9000 s to compute G . Therefore, the study of efficient butterfly counting is imperative to support online large-scale data analysis. Moreover, some applications demand exact butterfly counting in bipartite graphs. For example, in k-wing computation, approximate counting does not make sense since the k-wing decomposition algorithm in [56] needs to iteratively remove the edges with the lowest number of butterflies; the number has to be exact.
Notably, dynamic graphs have attracted significant interest in recent research studies [11,15,28,44,46,61] since there can exist a large number of constant updates on graphs in real applications. To enable computational sharing and increase system throughout (i.e., average processing time per update) in practice, many existing studies consider processing the updates in a batch-dynamic way which handles updates as a set of batches [1,8,23,27,43]. However, there is no systematic study over the butterfly counting problem on batch-dynamic graphs in the literature. To fill this research gap, we investigate how to design efficient parallel butterfly counting algorithms for batch-dynamic settings in this paper. Specifically, given a bipartite graph G and a batch-update of edges B, we aim to maintain the number of butterflies in G.
State-of-the-art. Consider that there can be O(m 2 ) butterflies in the worst case. Wang et al. in [64] propose an algorithm to avoid enumerating all the butterflies. It has two steps. At the first step, a layer is randomly selected. Then, the algorithm iteratively starts from every vertex u in the selected layer, computes the 2-hop reachable vertices from u, and for each 2-hop reachable vertex w, counts the number n uw of times reached from u. At the second step, for each 2-hop reachable vertex w from u, we count the number of butterflies containing both u and w as n uw (n uw − 1)/2. For example, regarding Fig. 1, if the lower layer is selected, starting from the vertex v 0 , vertices v 1 , v 2 , and v 3 are 2-hop reached 3 times, 1 time, and 1 time, respectively. Thus, there are C 2 3 1 (= 3) butterflies containing v 0 and v 1 and no butterfly containing v 0 and v 2 (or v 0 and v 3 ). Iteratively, the algorithm first uses v 0 as the start-vertex, then v 1 , and so on. Then, we add all the counts together; the added counts divided by two is the total number of butterflies.
Observe that the time complexity of the algorithm in [64] is O( u∈U (G) deg G (u) 2 )) if the lower layer L(G) is chosen to have start-vertices, where U (G) is the upper layer. Sanei et al. in [54] chooses a layer S such that O( v∈S deg G (v) 2 )) is minimized among the two layers.
1. Using high-degree vertices as middle-vertices may generate numerous wedges to be scanned. The existing techniques [54,64], including the layer-priority-based techniques [54], cannot avoid using unnecessary highdegree vertices as middle-vertices as illustrated earlier. Therefore, it is a challenge to effectively handle highdegree vertices. 2. Effectively utilizing CPU cache can often reduce the computation dramatically. Therefore, it is also a challenge to utilize CPU cache to speed up the counting of butterflies. 3. On batch-dynamic graphs, we need to handle a batch of updates on the original graph, which can be very large. Thus, it is a challenge to explore possible sharing opportunities and reduce the computation that does not lead to any new butterflies.
Our approaches. To address Challenge 1, instead of the existing layer-priority-based algorithm, we propose a vertexpriority-based algorithm BFC-VP that can effectively handle hub vertices (i.e., high-degree vertices). To avoid overcounting or miss-counting, we propose that for each edge (u, v), the new algorithm BFC-VP uses the vertex with a higher degree as the start-vertex so that the vertex with a lower degree will be used as the middle-vertex. Specifically, the BFC-VP algorithm will choose one end vertex of an edge (u, v) as the start-vertex, say u, according to its priority. Note that the vertex priority is a total ordering of vertices, and we use degree-based priority in this paper. A higher degree indicates a higher priority, and the tie is broken by the ID of vertices. For example, regarding Fig. 2a, the BFC-VP algorithm will choose u 0 and v 1000 as start-vertices; consequently, only 2000 wedges in total will be scanned by our algorithm compared with 500, 500 different wedges generated by the existing algorithms as illustrated earlier. Once all edges from the starting vertex u are exhausted, BFC-VP moves to another edge. This is the main idea of our BFC-VP algorithm.
As a result, the time complexity of our BFC-VP algorithm is O( (u,v)∈E(G) min{deg G (u), deg G (v)}), which is in general significantly lower than the time complexity of the stateof-the-art algorithm in [54] (i.e., O(min{ v∈U (G) deg G (v) 2 , v∈L(G) deg G (v) 2 )})). Note that the time complexity of BFC-VP is also bounded by O(α·m), where α is the arboricity of G [17].
In the BFC-VP algorithm, there are O(n) accesses of start-vertices because we need to explore every vertex as a start-vertex only once, O(m) accesses of middle-vertices and O( (u,v)∈E(G) min{deg G (u), deg G (v)}) accesses of endvertices in the processed wedges. Thus, the number of accesses to end-vertices is dominant. Given that the cache miss latency takes a big part of the memory access time [3], improving the CPU cache performance when accessing the end-vertices becomes a key issue. Our second algorithm, the cache-aware algorithm BFC-VP ++ , aims to improve the CPU cache performance of BFC-VP by having high-degree vertices as end-vertices to enhance the locality while retaining the total number of accesses of end-vertices (thus, retain the time complexity of the BFC-VP algorithm). Consequently, BFC-VP ++ proposes to request the end-vertices to be prioritized in the same way as the start-vertices in the BFC-VP algorithm.
For example, considering the graph in Fig. 2b, we have according to their degrees where p(v) denotes the priority of a vertex v. In this example, starting from v 0 to v 3 , going through u 0 , BFC-VP needs to process 5 wedges using u 0 as the middlevertex (i.e., and (v 3 , u 0 , v 2 )), and there are 3 vertices, v 1 , v 2 and v 3 that need to be performed as end-vertices. Note that these are the only 5 wedges using u 0 as the middle-vertex since p(u 0 ) > p(v 2 ) > p(v 1 ). Regarding the same example, BFC-VP ++ also needs to process exactly 5 wedges with u 0 as the middle-vertex, ; however, only 2 vertices, v 0 and v 3 , are performed as end-vertices.
We also propose the cache-aware reordering strategy to improve the cache performance by storing high-priority (more frequently accessed) end-vertices together to reduce the cache-miss [70]. Considering the example in Fig. 2b, BFC-VP ++ will store v 0 and v 3 together after reordering.
To handle batch-dynamic graphs, we propose efficient algorithms by identifying the affected scope (subgraph) of the batch-update. Then, we can also utilize our vertex-prioritybased techniques to accelerate the computation process. In addition, we categorize the new butterflies into different cases and propose effective pruning techniques to further reduce the computation (of old wedges on the original graph) that does not lead to any new butterflies.

Contribution.
We summarize the principal contributions of this paper as follows.
-We propose a novel vertex-priority-based algorithm BFC-VP to count the butterflies that reduce the time complexities of the state-of-the-art algorithms significantly in both theory and practice. -We propose a novel cache-aware butterfly counting algorithm BFC-VP ++ by adopting cache-aware strategies to BFC-VP. Compared with BFC-VP, the BFC-VP ++ algorithm achieves a better CPU cache performance. -We can replace the exact counting algorithm in the approximate algorithm [54] by our exact counting algorithm for a speedup. -We also present an external-memory algorithm and a parallel algorithm for butterfly counting. -This is also the first work to study butterfly counting on batch-dynamic graphs. We propose a work-efficient parallel algorithm to solve the problem. -We conduct comprehensive experimental evaluations on real datasets. It shows that our proposed algorithms BFC-VP and BFC-VP ++ outperform the existing algorithms by up to two orders of magnitude. For instance, on the Tracker dataset, the BFC-VP ++ algorithm can count 10 12 butterflies in 50 s, and the state-of-the-art butterfly counting algorithm [54] runs about 9000 s. In addition, our advanced algorithm BFCB-IG + for batch-dynamic butterfly counting is up to 2 orders of magnitude faster than the baseline algorithm.
Organization We organize the rest of the paper as follows. In Sect. 2, we show the preliminaries. Section 3 discusses the existing algorithms BFC-BS and BFC-IBS. We introduce the BFC-VP algorithm in Sect. 4. Section 5 explores cache-awareness. Section 6 extends our algorithms to count butterflies against each edge, the parallel execution of our proposed algorithms, and the external memory solution. Section 7 presents our algorithms for batch-dynamic butterfly counting. Section 8 reports experimental results. Section 9 reviews the related works. Section 10 presents the conclusion.

Preliminaries
We define our problem over a bipartite graph G(V = (U , L), E). Here U (G) contains all the upper layer vertices, L(G) contains all the lower layer vertices, and E(G) is the edge set. Note that U (G) ∩ L(G) = ∅. An edge connecting two vertices u and v is represented as (u, v) (or (v, u)). N G (u) is the neighbor set of a vertex u in G, and u's degree is represented as deg G (u) = |N G (u)|. The 2-hop neighbor set of u (i.e., the set of vertices which are exactly two edges away from u) is denoted as 2hop G (u). Note that we assume each vertex in G has an id, and the IDs of the vertices in U (G) are always higher than that of the vertices in L(G). m and n are used to represent the number of edges and vertices in G. We use |G| to denote the size of G, where |G| = m +n.
Definition 1 (Wedge) Consider a bipartite graph G, and three vertices u, v, w ∈ V (G). A wedge (u, v, w) is a path starting from u, going through v and ending at w. u, v, and w are called the start-, the middle-, and the end-vertex in the wedge (u, v, w), respectively.
Definition 2 (Butterfly) Consider a bipartite graph G and four vertices u, w ∈ U (G) and v, x ∈ L(G). A butterfly [u, v, w, x] is a complete bipartite subgraph (i.e., 2 × 2biclique) induced by u, v, w, x.
The total number of butterflies containing a vertex u and an edge e are denoted as u and e , respectively. In addition, the count of butterflies in G is denoted as G .
Problem statement Given a bipartite graph G, the butterfly counting problem is to compute G .

Existing algorithms
Here, we discuss the two existing algorithms, the baseline butterfly counting algorithm BFC-BS [64] and the improved baseline butterfly counting algorithm BFC-IBS [54]. As discussed earlier, both algorithms are based on enumerating wedges. Lemma 1 [64] is a key to the two algorithms.

Lemma 1
In a bipartite graph G, the following equations hold: In fact, BFC-IBS has the same framework as BFC-BS and improves BFC-BS in two aspects: (1) pre-choosing the layer of start-vertices to achieve a lower time complexity; (2) using a hash map to speed up the implementation. We show the details of BFC-IBS in Algorithm 1. 11 if wedge_cnt(w) > 1 then 12 G ← G + wedge_cnt(w) 2 13 return G Note that to avoid counting a butterfly twice, for each middle-vertex v ∈ N G (u) and the corresponding end-vertex w ∈ N G (v), BFC-IBS processes the wedge (u, v, w) only if w.id > u.id; consequently, in Algorithm 1 we do not need to use the factor 1 2 in Equation 2 of Lemma 1. Note that the BFC-BS algorithm has the time complexity

Algorithm by vertex priority
In BFC-BS and BFC-IBS, the time complexity is related to the total number of 2-hop neighbors visited (i.e., the total number of wedges processed). When starting from one vertex layer, the number of processed wedges is decided by the sum of degree squares of middle-vertices of the other layer. If all the vertices with low degrees are distributed in one vertex layer as middle-vertices, BFC-IBS can just start from the vertices in the other layer and obtain a much lower computation cost. However, when there are vertices with high degrees (i.e., hub vertices) exist in both layers, which is not uncommon in real datasets (e.g., Tracker dataset), choosing which layer to start cannot achieve a better performance. For example, consider the graph G with 2002 vertices and 4000 edges in Fig. 3, where u 0 and u 1 are connected with 1, 000 vertices (v 0 to v 999 ), v 1000 and v 1001 are also connected with 1000 vertices (u 2 to u 1001 ). In this example, choosing either of the two layers still needs to go through hub vertices, u 0 , Fig. 3 can be constructed in the following two ways: (1) by the wedges Consequently, a hub vertex (e.g., u 0 in Fig. 3) may not always necessary to become a middle-vertex in a wedge for the construction of a butterfly. Thus, it is possible to design an algorithm which can avoid using hub vertices unnecessarily as middle-vertices. To achieve this objective, we introduce the vertex-priority-based butterfly counting algorithm BFC-VP which runs in a vertex level (i.e., choosing which vertex to be processed as the start-vertex) rather than a layer level (i.e., choosing which vertex-layer to be processed as the start-layer). Given a graph G, BFC-VP first assigns a priority to each vertex u ∈ V (G).

Definition 3 (Priority) Consider a bipartite graph G.
The vertex priority is a total ordering of the vertices in G. Specifically, the priority According to the concept of priority, we can always construct a butterfly from two wedges (u, v, w) and (u, x, w) where the start-vertex u has a higher priority than the middlevertices v and x. This is because we can always find a vertex that has the highest priority and connects to two vertices with lower priorities in a butterfly.
Based on the above observation, the BFC-VP algorithm can get all the butterflies by only processing the wedges where the priorities of start-vertices are higher than the priorities of middle-vertices. In this way, the algorithm BFC-VP will avoid processing the wedges where middle-vertices have higher priorities than start-vertices (e.g., (v 0 , u 0 , v 1 ) in Fig. 3 if we consider that a higher vertex degree indicates a higher priority). In addition, in order to avoid duplicate counting, another constraint should also be satisfied in BFC-VP: BFC-VP only processes the wedges where start-vertices have higher priorities than end-vertices. To avoid processing unnecessary wedges in the implementation, we sort the neighbors of vertices in ascending order of their priorities. Then we can early terminate the processing once we meet an end-vertex that has higher priority than the start-vertex (or meet a middle-vertex that has higher priority than the start-vertex). We show the details of the BFC-VP algorithm in Algorithm 2.

Algorithm 2: BFC-VP
Input: G Output: G 1 calculate the priority for each vertex in V (G) 2 for each vertex u, sort N (u) according to their priorities Firstly, BFC-VP assigns a priority to each vertex u ∈ V (G) according to Definition 3 and sort the neighbors of u. After that, BFC-VP processes wedges from each start-vertex u (Line 4). For each middle-vertex v ∈ N G (u), it processes v if it satisfies p(v) < p(u). Then, it only processes w ∈ N G (v) which satisfies p(w) < p(u) to avoid duplicate counting. After that, the value |N G (u) ∩ N G (w)| can be obtained for u and w which is equal to wedge_cnt(w). Then, according to Lemma 1, BFC-VP computes G . Finally, we return G .
Correctness and complexity analysis of the BFC-VP algorithm. Below we show theoretical analysis of the BFC-VP algorithms.

Theorem 1
The BFC-VP algorithm correctly solves the butterfly counting problem.
Proof We prove that BFC-VP correctly computes G for a bipartite graph G. A butterfly can always be constructed from two different wedges with the same start-vertex and the same end-vertex. Thus, we only need to prove that each butterfly in G will be counted exactly once by BFC-VP. Given a butterfly [x, u, v, w], we assume x has the highest priority. The vertex priority distribution must be one of the three situations as shown in Fig. 4 (the other situations can be transformed into the above by symmetry), where p i is the corresponding vertex priority. Regarding the case in Fig. 4a, b, or c, BFC-VP only counts the butterfly [x, u, v, w] once from the wedges (x, u, v) and (x, w, v). Thus, we can prove that BFC-VP correctly solves the butterfly counting problem.
Proof Algorithm 2 has two phases: initializing in the first phase and computing G in the second phase. The time complexity of the first phase is O(n + m). Firstly, we need O(m) to get the degrees of vertices and O(n) time to get the priorities by sorting the vertices using bin sort [37]. Secondly, we need O(m) time to sort the neighbors of vertices in ascending order of their priorities. To achieve this, we generate a new empty neighbor list T (u) for each vertex u. Then, we process the vertex with lower priority first and for each vertex u and its neighbor v, we put u into T (v). Finally, the neighbors of vertices are ordered in T . The time cost of the second phase is bounded by the number of wedge counting operations executed in Algorithm 2 Line 8 (i.e., the number of wedges traversed in BFC-VP). According to the processing rule of BFC-VP, the wedge counting operations consume O(deg(v)) time for each edge (u, v) ∈ E(G) with p(u) > p(v). This is because only the wedges where the priorities of middle-vertices are lower than the priorities of start-vertices are processed in BFC-VP.
time in total, and this theorem holds.
According to the above analysis, how to order the vertices can affect the complexity of the algorithm. In this paper, we propose using the degree-based priority to achieve both good practical and theoretical results.
Based on Definition 4, we can have the following theorem.

Theorem 3 The time complexity of
Proof Since the degree priority is applied, the time com- According to [17], the time complexity of BFC-VP can be simplified to O(α · m), where α is the arboricity of G.
In the following parts, we simply call the degree priority as the priority and use p(u) to denote the degree priority when the context is clear.

Theorem 4 The space complexity of BFC-VP is O(m).
Proof In Algorithm 2, we need O(m) space to store the graph structure and O(n) space to store the arrays for the priority of vertices and counting the number of wedges. Thus, the space cost of the BFC-VP algorithm is bounded by O(m).

Lemma 2
In a bipartite graph G, the following equation holds: The equality happens if and only if one of the following two conditions is satisfied: Proof Given a bipartite graph G, since there are deg G (u) edges attached to a vertex u, we can get that: Thus, we can prove that Eq. (3) holds. The condition of equality can be easily proved by contradiction which is omitted here.
From Lemma 2, we can get that BFC-VP improves the time complexity of BFC-IBS. Now we illustrate how BFC-VP efficiently handles the hub-vertices compared with BFC-IBS using the following example. Fig. 3.

Example 1 Consider the example in
BFC-VP first assigns a priority to each vertex in G where . Starting from u 1 , BFC-VP needs to process 1000 wedges ending at u 0 . Similarly, starting from v 1001 , BFC-VP needs to process 1000 wedges ending at v 1000 . No other wedges need to be processed by BFC-VP. In total, BFC-VP needs to process 2000 wedges.
BFC-IBS processes each vertex u ∈ U (G) as start-vertex. Starting from u 0 , BFC-IBS needs to process 1000 wedges ending at u 1 . Starting from u 1 , no wedges need to be processed. In addition, starting from the vertices in {u 2 , u 3 , . . . , u 1001 }, BFC-IBS needs to process 999,000 wedges. In total, BFC-IBS needs to process 1,000,000 wedges.

Cache-aware techniques
As discussed in Sect. 1, below is the breakdown of memory accesses to vertices required when processing the wedges in the BFC-VP algorithm: accesses of end-vertices. Thus, the total access of end-vertices is dominant. For example, by running the BFC-VP algorithm on Tracker dataset, there are about 6 × 10 9 accesses of end-vertices, while the accesses of start-vertices and middle-vertices are only 4 × 10 7 and 2 × 10 8 , respectively. Since the cache miss latency takes a big part of the memory access time [3], we try to improve the CPU cache performance when accessing the end-vertices.
Because the CPU cache is hard to control in algorithms, a general approach to improve the CPU cache performance is storing frequently accessed vertices together. Suppose there is a buffer BF that is partitioned into a low-frequency area LFA and a high-frequency area HFA as shown in Fig. 5. The vertices are stored in BF and only a limited number of vertices are stored in HFA. For an access of the end-vertex w, we compute miss(w) by the following equation: We want to minimize F which is computed by: Low Frequency High Frequency

Fig. 5 The buffer BF
Here, W is the set of processed wedges of an algorithm.
Since F can only be derived after finishing the algorithm, the minimum value of F cannot be pre-computed. We present two strategies that aim to decrease F: -Cache-aware wedge processing which performs more high-priority vertices as end-vertices, while retaining the total number of accesses of end-vertices (thus, the same time complexity of BFC-VP). Doing this will enhance the access locality. -Cache-aware graph reordering which stores vertices with high-priority together in HFA.

Cache-aware wedge processing
Issues in wedge processing of BFC-VP. In BFC-VP, the processing rule restricts the priorities of end-vertices should be lower than the priorities of start-vertices in the processed wedges. Because of that, the accesses of end-vertices exhibit bad locality (i.e., not clustered in memory). For example, by counting the accesses of end-vertices over Tracker dataset, as shown in Fig. 6a, 79% of total accesses are accesses of low-degree vertices (i.e., degree < 500) while the percentage of high-degree vertices (i.e., degree > 2000) accesses is only 9% in BFC-VP. Since the locality of accesses is a key aspect of improving the CPU cache performance, we explore whether the locality of end-vertex-accesses can be improved.
With the total access of end-vertices remaining unchanged, we hope the algorithm can access more high-degree vertices as end-vertices. In that manner, the algorithm will have more chance to request the same memory location repeatedly and the accesses of HFA are more possible to increase (i.e., F is more possible to decrease).
New wedge processing strategy. Based on the above observation, we present a new wedge processing strategy: processing the wedges where the priorities of end-vertices are higher than the priorities of middle-vertices and startvertices. We name the algorithm using this new strategy as BFC-VP + . BFC-VP + will perform more high-priority vertices as the end-vertices than BFC-VP because of the restriction of priorities of end-vertices. For example, considering the < 500 79% according to their degrees. We analyze the processed wedges starting from v 0 to v 3 , going through u 0 . BFC-VP needs to process 5 wedges (i.e., ) and 3 vertices (i.e., v 1 , v 2 and v 3 ) are performed as end-vertices. Utilizing the new wedge processing strategy, as shown in Fig. 2b, the number of processed wedges of BFC-VP + is still (v 3 , u 0 , v 0 )) but only 2 vertices with high-priorities (i.e., v 0 and v 3 ) are performed as end-vertices. Thus, the number of accessing different end-vertices is decreased from 3 to 2 (i.e., the accesses exhibit better locality). Also as shown in Fig. 6b, after applying the new wedge processing strategy, the percentage of accesses of high-degree vertices (i.e., degree > 2000) increases from 9 to 81% on Tracker dataset.
Analyzing the new wedge processing strategy. Although the new wedge processing strategy can improve the CPU cache performance of BFC-VP, there are two questions that arise naturally: (1) whether the number of processed wedges is still the same as BFC-VP; (2) whether the time complexity is still the same as BFC-VP after utilizing the new wedge processing strategy. We denote the set of processed wedges of BFC-VP as W vp and the set of processed wedges of BFC-VP + as W vp + , and the following lemma holds.
Proof For a wedge (u, v, w) ∈ W vp , it always satisfies p(u) > p(v) and p(u) > p(w) according to Algorithm 2. For a wedge (u, v, w) ∈ W vp + , it always satisfies p(w) > p(v) and p(w) > p(u) according to the new wedge processing strategy. In addition, p(u) is unique for each vertex u and the new wedge processing strategy does not change p(u) of u. Thus, for each wedge (u, v, w) ∈ W vp , we can always find a wedge (w, v, u) ∈ W vp + . Similarly, for each wedge (u, v, w) ∈ W vp + , we can always find a wedge (w, v, u) ∈ W vp . Therefore, we prove that |W vp | = |W vp + |.
Since no duplicate wedges are processed, based on the above lemma, BFC-VP + will process the same number of wedges with BFC-VP. However, if only applying this strategy, when going through a middle-vertex, we need to check all its neighbors to find the end-vertices which have higher priorities than the middle vertex and the start-vertex. The time In order to reduce the time complexity, for each vertex, we need to sort the neighbors in descending order of their priorities. After that, when dealing with a middle-vertex, we can early terminate the priority checking once we meet a neighbor which has a lower priority than the middle-vertex or the start-vertex.

Cache-aware graph reordering
Motivation. After utilizing the cache-aware wedge processing strategy, end-vertices are mainly high-priority vertices. Generally, vertices are sorted by their ids when storing in the buffer. Figure 7 shows accesses of the buffer when processing end-vertices (i.e., v 0 and v 3 ) starting from v 0 to v 3 and going through u 0 in Fig. 2b by BFC-VP. We can see that although end-vertices are mostly high-priority vertices, the distance between two end-vertices (e.g., v 0 and v 3 ) can be very long. This is because many low-priority vertices are stored in the middle of high-priority vertices. In addition, real graphs usually follow power-law distributions which do not contain too many vertices with high priorities (degrees). For example, in the Tracker dataset with about 40,000,000 vertices, there are only 10,338 vertices with degree ≥ 1000, and only 1% vertices (400,000) with degree ≥ 37. Motivated by the above observations, we propose the graph reordering strategy which can further improve the cache performance.
Graph reordering strategy. The main idea of the graph reordering strategy is reordering the given bipartite graph G into a reordering graph G * using a 1 to 1 bijective function f . The reordering graph G * is defined as follows.
Definition 5 (Reordering graph) For a bipartite graph G, the reordering graph G * is defined as: Note that our linear graph reordering uses a 1 to 1 bijective function to relabel the vertex-IDs which does not change the graph structure. Thus, the number of vertices and edges are both unchanged after reordering. After reordering the original graph G into the reordering graph G * , the vertices with high priorities will be stored together. In this manner, we can store more high-priority vertices consecutively in HFA. Figure 7 illustrates the idea of graph reordering using the example in Fig. 2b. After obtaining the reordering graph G * , we can see that the distance between two high-priority endvertices becomes much shorter, e.g., the distance between v * 1 and v * 2 is 1 while the distance between v 0 and v 3 before reordering is 3. In the experiments, we prove that the algorithms applying with the graph reordering strategy achieve a much lower cache miss ratio than BFC-VP.

Putting cache-aware strategies together
The BFC-VP ++ algorithm. Putting the above strategies together, we show the details of the algorithm BFC-VP ++ in Algorithm 3. BFC-VP ++ first generates a reordering graph G * according to Definition 5 and for each vertex u * ∈ V (G * ), we sort its neighbors. Then, Finally, we compute G (Lines [13][14]. Proof We prove that BFC-VP ++ correctly computes G for a bipartite graph G. Since the graph reordering strategy just renumbers the vertices, it does not affect the structure of G. Given a butterfly [x, u, v, w], we assume x has the highest priority. We only need to prove that BFC-VP ++ will count exactly once for each butterfly in Fig. 4. Regarding the case in Fig. 4a, b, or c, BFC-VP ++ only counts the butterfly [x, u, v, w] once from the wedges (v, u, x) and (v, w, x). Thus, we can get that the BFC-VP ++ algorithm correctly solves the butterfly counting problem.
Proof Algorithm 3 has two phases including the initialization phase and G computation phase. In the first phase, similar to BFC-VP, the algorithm needs O(n + m) time to compute the priority number, sort the neighbors of vertices and compute the reordering graph. Secondly, we analyze how many wedge counting operations executed by BFC-VP ++ in Algorithm 3 Line 10 (i.e., the number of wedges processed) as follows. In BFC-VP ++ , we only need to process the wedges where the degree of the end-vertex is higher than or equal to the middle-vertex. Thus, the wedge counting in total, and this theorem holds.
Proof In Algorithm 3, we need O(m) space to store the graph structure and O(n) space to store the arrays for the priority of vertices and counting the number of wedges. The graph reordering process also needs O(m) space for the reordering graph. Thus, the space cost of the BFC-VP algorithm is bounded by O(m).
Remark Note that our algorithms (i.e., BFC-VP and BFC-VP ++ ) are able to output all the butterflies in a com-

Handling other cases
In this section, firstly, we extend our algorithms to compute e for each e in G. Secondly, we extend our algorithms to parallel algorithms. Thirdly, we introduce the external memory butterfly counting algorithm to handle large graphs with limited memory size.

Counting the butterflies for each edge
For an edge e in G, the following equation holds [64]: Based on the above equation, our BFC-VP ++ algorithm can be extended to compute the butterfly count for each edge. In Algorithm 3, for a start-vertex u * and a valid end-vertex Here, we present the BFC-EVP ++ algorithm to compute e . The details of BFC-EVP ++ are shown in Algorithm 4. In the initialization process, we initialize e for each edge e in G. After that, for each start-vertex u * , we run Algorithm 3 Lines 6-12 to compute |N G (u * )∩N G (w * )|. Then, we run another round of wedge processing and update e(u,v) , e(v,w) according to Eq. (8) (Lines 5-14). Finally, we return the result.
In Algorithm 4, we only need an extra array to store e for each edge e. In addition, because it just runs the wedge processing procedure twice, we can get that the time complexity of the BFC-EVP ++ algorithm is the same as BFC-VP ++ . 4 run Algorithm 3 Line 6 -Line 12

Parallelization
Shared-memory parallelization. In Algorithm 3, only read operations occur on the graph structure. This motivates us to consider the shared-memory parallelization. Assume we have multiple threads and these threads can handle different start-vertices simultaneously. No conflict occurs when these threads read the graph structure simultaneously. However, conflicts may occur when they update wedge_cnt and G simultaneously in Algorithm 3. Thus, we can divide the data space into the global data space and the local data space.
In the global data space, the threads can access the graph structure simultaneously. In the local data space, we use local_wedge_cnt and local_ G for each thread to avoid conflicts. Thus, we can use O(n * t + m) space to extend BFC-VP ++ into a parallel version, where t denotes the number of threads. The algorithm BFC-VP ++ in parallel. We show the details of the algorithm BFC-VP ++ in parallel in Algorithm 5. Note that we use the priority-based dynamic scheduling strategy by considering the workload balance [66]. Similar as BFC-VP ++ , we first generate a reordering graph G * . Then, the algorithm sequentially processes the start-vertices in nonascending order by their priorities. For a vertex u * ∈ V (G * ), it will be dynamically allocated to an idle thread i. Note that, for each thread i, we generate an independent space for local_wedge_cnt [i] and local_ G [i]. After all the threads finishing their computation, we compute G on the master thread.
Note that the work-span model is popularly used to analyze the parallel algorithms. The work of an algorithm measures the total number of operations, and the span of an algorithm is the longest dependency path [20]. Based on this model, Algorithm 5 takes O(α · m) work and O(log(m)) span with high probability, which can be analyzed similarly as done in [59].

External memory butterfly counting
In order to handle large graphs with limited memory size, we introduce the external memory algorithm BFC-EM in Algorithm 6 which is also based on the vertex priority. We first run an external sorting on the edges to group the edges with the same vertex-IDs together. Then, we compute the priorities of vertices by sequentially scanning these edges once. Then, for each vertex v ∈ V (G), we sequentially scan its neighbors from the disk and generate the wedges (u, v, w) with p(w) > p(v) and p(w) > p(u) where w ∈ N G (v) and u ∈ N G (v) (Lines 4-6). For each wedge (u, v, w), we only store the vertex-pair (u, w) on disk. After that, we maintain the vertex-pairs on disk such that all (u, w) pairs with the same u and w values are stored continuously (Line 7). This can be simply achieved by running an external sorting on these (u, w) pairs. Then, we sequentially scan these vertexpairs, and for the vertex-pair (u, w), we count the occurrence of it and compute G based on Lemma 1 (Lines 8-10).

Algorithm 6: BFC-EM
Input: G Output: G 1 sort all the edges e ∈ G on disk 2 calculate the priority for each vertex in V (G) on disk ). In BFC-EM, the dominate cost is to scan and sort the vertexpairs. Since there are O(α · m) vertex-pairs generated by the BFC-EM algorithm, we can get that the I/O complexity of BFC-EM is O(scan(α · m) + sort(α · m)).

Batch-dynamic butterfly counting
In this section, we discuss the problem of batch-dynamic butterfly counting. Given a bipartite graph G and a batch-update B, we aim to compute the number of butterflies resulting from the batch-update B (denoted as B ). Here, a batch-update B is a batch of edge insertion and edge deletion operations. In other words, suppose we already know how many butterflies in G, we want to obtain the number of butterflies after updating B on G. Since B contains edge insertion and deletion operations, we use B + to represent the set of inserting edges and use B − to represent the set of deleting edges. We suppose B + and B − are disjoint (i.e., B + ∩ B − = ∅) since we can safely remove all the common edges in these two sets without affecting the final butterfly counts. Then, we can first count the number of affected butterflies of B − and then count the number of affected butterflies of B + . Note that the count-ing procedures of the deletion batch and the insertion batch are inherently the same since we can consider deleting B − from G as inserting B − into G\B − . All we need to know is the number of affected butterflies resulting from a batch of updates. It is worth noticing that to transform a deletion case into an insertion case, we need to adjust the initial data structures. Specifically, we remove B − from G and consider B − as B + . As evaluated in our experiments, such overhead is small. For the ease of presentation, we suppose all the updates are edge insertions in the following parts (i.e., B = B + ).

Computing B from each new edge
It is apparent that each new butterfly contains at least one new edge from B. As a result, a straightforward algorithm BFCB-BS can be designed by sequentially processing each , we exchange u and v (i.e., process the neighbor set of u instead of that of v). In this way, we can reduce the number of wedges processed in the algorithm. One may also consider precomputing the hash map of neighbor set for each vertex in G. Then, when computing |N G (u) ∩ N G (w)|, we can choose to scan the smaller set between N G (u) and N G (w) with O(min{deg G (u), deg G (w)}) time. However, it incurs additional time cost to compute the hash map for each vertex in G which can be a very large overhead under the batch-dynamic context.  After that, we insert (u 3 , v 2 ) into G, and get the number of new butterflies containing (u 3 , v 2 ) is 5. In total, we can get the number of new butterflies resulting from B is 7.

Theorem 8 The time complexity of BFCB-BS is O(
in the worst-case.  The in-depth reason is that if we want to count the number of butterflies containing exactly one new edge, we cannot use vertex-priority-based techniques to amortize the time cost like Algorithm 4. (2) It is not easy to parallelize BFCB-BS since it needs to process each edge e ∈ B sequentially and insert the edge e into G after processing it. Specifically, if we directly parallelize Line 2 in Algorithm 7, the parallel algorithm will likely overlook the new butterflies with more than one new edge which are processed by different threads simultaneously. There is no simple parallel implementation of BFCB-BS which can address this issue efficiently.

Computing B from the affected subgraph
To address the above-mentioned issues, we propose an affected-subgraph-based algorithm BFCB-IG. The main intuition behind BFCB-IG is that before computing the new butterflies resulting from B, we first update B to G and identify the affected scope (or subgraph) of B. In this way, we not only can use the vertex-priority-based algorithms (in a subgraph with limited size) but also can easily derive a parallel implementation. We define the affected subgraph as follows.

Definition 6 (Affected subgraph) Given a bipartite graph G and a batch-update B, the affected subgraph G A due to B is the induced subgraph of all the vertices in
B is the set of vertices that are incident to at least one new edge in B. V 2 B is the set of vertices that are the neighbors of at least one vertex in V 1 B . For instance, the affected subgraph G A of G is shown in Fig. 8.

Lemma 4 Consider a bipartite graph G and a batch-update B. All the new butterflies resulting from B are contained in the affected subgraph G A .
Proof For each new butterfly, it must contain at least one new edge e = (u, v). Since V 1 B includes all the vertices incident to the new edges, u and v must be in V 1 B . Due to the butterfly structure, the other two vertices of the new butterfly must be connected to u and v, which must be contained in V 2 B (Definition 6). Therefore, all of the new butterflies resulting from B are contained in the affected subgraph G A .
Based on the above lemma, we can guarantee that all the new butterflies are contained in G A . Since G A may also contain many butterflies which originally exist in G (i.e., old butterflies), we need to subtract the count of these old butterflies when applying the butterfly counting algorithm (e.g., BFC-VP ++ ) on G A . Reviewing the BFC-VP ++ algorithm (i.e., Algorithm 3), we can see that it starts from each vertex and processes valid wedges to count the number of butterflies. In the batch-dynamic context, we call a wedge is a new wedge if it contains at least one new edge from B. Otherwise, we call it is an old wedge. For example, (v 0 , u 3 , v 1 ) is a new wedge. It is easy to observe that any new butterfly in G A is composed of (1) one new wedge and one old wedge; or (2) two new wedges. Thus, we can have the following lemma.

Lemma 5 Given an affected subgraph G A and a vertex u ∈ G A , the number of new butterflies containing u denoted as
+ u can be computed by the following equation: Here C old (w) is the number of old wedges containing u and w in G A .
Proof For any two vertices The BFCB-IG Algorithm. Based on the above lemma, we design the algorithm BFCB-IG to count the number of new butterflies in G A as shown in Algorithm 8. We first insert each edge in B into G. Then, we construct the affected subgraph G A according to Definition 6. After that, we re-organized G A using the cache-aware graph reordering technique and compute the vertex priority on G A . Then, starting from each vertex in V (G A ), we use the same strategy as BFC-VP ++ to process the wedges. Unlike BFC-VP ++ , BFCB-IG needs two hash maps C total and C old to record the number of wedges and the number of old wedges containing an end-vertex w, respectively. This is because we need to obtain the number of new butterflies which can be computed according to Lemma 5 (Line 17). Fig. 8a and the batch-

Example 3 Consider the example in
We first construct G A as shown in Fig. 8b.
Then, we start from each vertex in G A to count the number of new butterflies by traversing valid wedges. From v 0 , we get wedges Thus, the number of new butterflies can be obtained is 3 2 − 2 2 + 3 2 − 2 2 = 4. Then, when starting from v 2 , we will get 3 new butterflies similarly. In total, we have 7 new butterflies.  For each two-hop neighbor w of u, C total (w) counts the total number of wedges containing u and w and C old (w) counts the old wedges (Lines 6, 7, 11, and 13). Note that only the wedges in which w has the highest priority are visited (Lines 9-10), which can avoid counting duplicate butterflies. By Lemma 5, the new butterflies containing u and w is computed by subtracting the number of the old butterflies containing u and w from the total count (Line 17). Since B keeps an accumulating sum of the number of new butterflies containing the visited vertices, B becomes the number of new butterflies when the for-loop terminates in line 18, which completes the proof.

Theorem 10 The worst-case time complexity of BFCB-IG is O(T C G A + (u,v)∈E(G A ) min{deg G A (u), deg G A (v)}. Here, T C G A is the worst-case time complexity of constructing G A , bounded by O(|B| + u∈V (G A ) deg G (u)).
Proof Algorithm 8 constructs the affected subgraph G A from B first and then counts the new butterflies on G A . When constructing G A , the algorithm inserts the edges in B into G, which takes O(|B|) time. Then, it needs to visit the vertices in V (G A ) and scan through each vertex's neighbors in G to construct the affected subgraph. Therefore, the affected subgraph construction takes Also, Algorithm 8 needs to enumerate all the valid wedges similar to Algorithm 3. By Theorem 3, this process takes Adding it to the time complexity of G A construction completes the proof.
In addition, BFCB-IG has the space complexity of O(|G|+ |B|) since apart from the graph structure, the data structures used in BFCB-IG + (e.g., C total and C old ) are all bounded by O(|G| + |B|).

Reducing the computation of old wedges
Motivation. Although the algorithm BFCB-IG can effectively utilize the vertex-priority-based techniques, it can be further sped up by reducing the computation of old wedges. In BFCB-IG, the number of new butterflies is obtained based on Lemma 5 which needs to compute the total number of butterflies and the number of old butterflies on G A . In this manner, we need to traverse many old wedges, and some of them do not lead to a new butterfly. For instance, consider the affected subgraph G A in Fig. 10, we need to traverse many wedges which do not form any new butterfly such as (u 1 , v 1 , u 2 ) and (u 1 , v 2 , u 2 ). To explore how to reduce the computation of old wedges (and old butterflies), we first present all the cases of new butterflies as shown in Fig. 9. These cases are based on the fact that each new butterfly must contain at least one new edge. Handling new butterflies of different cases. Reviewing the vertex-priority-based algorithm BFCB-IG, a butterfly is counted by composing two wedges which both use a vertex w with the highest priority as the end-vertex. We show all the cases of new butterflies in Fig. 22. We can see that the butterflies in Cases 4-9 are all composed of two new wedges (i.e., a wedge composed by at least one new edge). Thus, to avoid touching old wedges when identifying these butterflies, we can split N G A into N old (u) and N new (u) for each u ∈ V (G A ) which contain the old neighbors and new inserted neighbors of u, separately. Then, when starting from a vertex u, we can When handling Cases 1-3, we cannot avoid the traversal of old wedges since these butterflies are composed of one new wedge and one old wedge. For instance, considering Case 1 in Fig. 10, the butterfly is composed of the new wedge (u, x, w) and the old wedge (u, v, w). To reduce the computation of old wedges which do not lead to new butterflies, we adopt the following strategies. (1) We do not process a start-vertex u if there is no new wedge generated from it. To achieve this, we can record the set of end-vertices of the new wedges in an array arr_2hop when handling Cases 4-9 and terminate the finding process if arr_2hop is empty. For instance, consider the example in Fig. 10. When starting from u 1 , there is no new wedge exists from it and we skip the process of finding butterflies in Cases 4-9 from u 1 . (2) When enumerating old wedges from a start-vertex u, we need to scan the neighbor list N old (v) for each v ∈ N old (u) to find whether w ∈ arr_2hop exists. In many cases, the number of end-vertices of new wedges can be much smaller compared with the size of the neighbor set of v (i.e., |N old (v)|). In such cases, we can perform a binary search to check whether each w ∈ arr_2hop exists in N old (v). For instance, when starting u 0 , the only end-vertex of a new wedge is u 1 . Thus, when finding old wedges starting from u 0 , we can use binary search to check whether u 1 exists in N old (v 1 ) rather than scan N old (v 1 ).
The BFCB-IG + Algorithm. With the above optimization strategies, we propose the BFCB-IG + algorithm in Algorithm 9. We construct the affected subgraph G A similarly as BFCB-IG. After that, we split N G A (u) into N old (u) and N new (u) for each u ∈ V (G A ) (Line 5). Then, starting from each vertex in V (G A ), we use the same strategy as BFC-VP ++ to process the wedges. Unlike BFCB-IG, we first count the number of butterflies in Cases 4-9 (i.e., the butterflies composed by two new wedges) as shown in Fig. 22. To achieve this, we need to process all the new wedges. We can choose a vertex v from N old (u) (or N new (u)) and choose a vertex w from N new (v) (or N G A (v)) (Lines [10][11][12][13][14][15][16][17][18][19][20][21][22][23][24][25]. Note that all the end-vertices of the new wedges containing u are recorded in an array arr_2hop. Then, we count the number of butterflies for Cases 1-3 (Lines 31-41). We skip this stage if there is no new wedges in G A to reduce the time cost (Line 28). Inherently, in this stage, we aim to find the number of common neighbors for each u and w ∈ arr_2hop on the original graph. We can scan the neighbor list for each v ∈ N old (u) to find whether w ∈ arr_2hop exists. Alternatively, we can use binary search approach to check whether each w ∈ arr_2hop in N old (v). For each v ∈ N old (u), we can pre-compute |N old (v)| and |arr_2hop| · log(|N old (v)|) to choose which method is more efficient (Line 32). Fig. 10.

Example 4 Consider the affected subgraph G A in
To count this new butterfly using BFCB-IG + , we first start from u 0 and get the new wedge (u 0 , v 0 , u 1 ). Then, we use binary search to get that u 1 is the neighbor of v 1 (i.e., there exists a old wedge (u 0 , v 1 , u 1 )). Compared with BFCB-IG, BFCB-IG + does not need to traverse any other old wedges. For instance, the old wedge (u 1 , v 1 , u 2 ) is not touched since there is no new wedge generated starting from u 1 .

Theorem 11
The BFCB-IG + algorithm correctly solves the batch-dynamic butterfly counting problem.
Proof By Lemma 5, Algorithm 9 only counts the new butterflies in G A . For each vertex u, it first counts the new butterflies with two new wedges (Lines 9-27) and then counts the ones with only one new wedge (Lines 31-41). Specifically, Lines 10-17 process the new wedges in which (u, v) is old and (v, w) is new, and Lines 18-25 process those in which (u, v) is new. Since the new butterflies in Cases 4-9 are formed by two new wedges, Lines 26 and 27 correctly compute the number of such butterflies containing u. In Lines 31-41, the algorithm finds an old wedge (u, v, w) and checks how many new wedges containing u and w exist. Since the butterflies of Cases 1-3 need one old wedge and one new wedge, there are C total (w) many such butterflies corresponding to each old wedge (u, v, w). Thus, this algorithm handles all possible  Complexity analysis. BFCB-IG + has the same worst-case time complexity as BFCB-IG. However, it reduces much computation of old wedges and can be more efficient as studied in our experimental evaluations. In addition, BFCB-IG + has the same space complexity as BFCB-IG (i.e., O(|G| + |B|)) since the additional data structures used in BFCB-IG + (i.e., arr_2hop, N old , and N new ) are all bounded by O(|G|+|B|).

Remark
The algorithms BFCB-IG and BFCB-IG + can all be parallelized similarly as Algorithm 5. Note that it is straightforward to devise a work-efficient 2 parallel algorithm based on BFCB-IG + . In the affected graph constructing phase, we can scan each new edge in B in parallel to get the vertex set V 1 B . Then, we scan the vertices in V 1 B in parallel to get the vertex set V 2 B . In the butterfly counting phase, we can follow the same paradigm as Algorithm 5 to process the start-vertices in parallel.

Experiments
In this section, we present the results of empirical studies. In particular, our empirical studies are conducted against the following designs in Sect. We also evaluate the algorithms for batch-dynamic butterflies counting in Sect. 8.3 including the baseline algorithm BFCB-BS, the affected-graph-based butterfly counting algorithm BFCB-IG, and the advanced affected-graph-based algorithm BFCB-IG + .
The algorithms are implemented with C++. We run the empirical studies on a computer with 512 GB main memory and 2 × Intel Xeon E5-2698 processors (2.20GHz, 640 KB L1I Cache, 640 KB L1D Cache, 5MB L2 Cache, 50MB L3 Cache, and 40 cores in total). Although most empirical studies have been against single-core, we want our empirical studies to be conducted on the same computer as the evalu-

Datasets
Twelve datasets are used in our experiments including all the 9 real datasets in [54]. We add 3 more datasets to evaluate the scalability of our techniques. All these datasets can be downloaded from KONECT (http://konect.cc/). The 9 real-world datasets that we used are DBPedia, Twitter, Amazon, Wiki-fr, Wiki-en, Live-jour nal, Delicious, Tracker, and Orkut.
To further test the scalability, we also evaluate three bipartite networks (i.e., Bi-twitter, Bi-sk and Bi-uk) which are sub-networks obtained from billion-scale real datasets (i.e., twitter, sk-2005 and uk-2006-05). To obtain bipartite-subgraphs from these two datasets, we put the vertices with odd ids in one group while the vertices with even ids in the other group and remove the edges which formed by two vertices with both odd ids or even ids.
In Table 1, we show the statistics of datasets. Note that u∈L d(u) 2 and v∈R d(v) 2 represent the sum of degree squares for L and R, respectively. T C ibs = min{ u∈L d(u) 2 , v∈R d(v) 2 } which is the time complexity bound of BFC-IBS. T C new is the time complexity bound of BFC-VP and BFC-VP ++ .

Evaluation of butterfly counting algorithms
In this section, the performance of the proposed algorithms for butterfly counting is evaluated.
Evaluating the performance on all datasets. Figure 11 shows the performance of BFC-IBS, BFC-VP, and BFC-VP ++ on different datasets. We can observe that BFC-VP ++ is the most efficient algorithm, while BFC-VP also outperforms BFC-IBS. This is because BFC-VP ++ utilizes both the vertex-priority-based optimization and the cache-aware strategies which significantly reduce the computation cost. On Tracker, BFC-VP and BFC-VP ++ are faster than BFC-IBS by at least two orders of magnitude. On Bi-twitter, Bi-sk and Bi-uk, BFC-IBS cannot finish within 10 h. This is because the degree distribution of these datasets is skewed and high-degree vertices exist in both layers. For instance, T C ibs is more than 100× larger than T C new in Tracker. This property leads to a large number of wedge processing for BFC-IBS while our BFC-VP and BFC-VP ++ algorithms can handle such situation efficiently. In Fig. 12, we also show the performance for computing e for each e. The performance differences of these algorithms follow similar trends to those in Fig. 11.
Evaluating the number of processed wedges. Figure 13 shows the number of processed wedges of the proposed algorithms. We can observe that on Tracker, Bi-twitter, Bi-sk and Bi-uk datasets, BFC-IBS needs to process 100× more wedges than BFC-VP and BFC-VP ++ . This is because T C ibs is much larger than T C new and there exist many hubvertices in both L and R in these datasets. Thus, BFC-VP and BFC-VP ++ only need to process a limited number of wedges while BFC-IBS should process numerous wedges no matter choosing which layer to start. We also observe that BFC-VP and BFC-VP ++ process the same number of wedges since BFC-VP ++ improves BFC-VP on cache performance, which does not change the number of processed wedges.
Scalability. Figure 14 studies the scalability of the algorithms by varying the number of vertices n. Specifically, we randomly sample 20-100% of the dataset's vertices and retrieve the induced subgraphs of the selected vertices. It is observed that BFC-VP and BFC-VP ++ are scalable and the computational cost of all the algorithms increases when the percentage of vertices increases. On Bi-twitter, BFC-IBS can only complete when n = 20%. As discussed before, BFC-VP ++ is the most efficient algorithm.     Figure 15 studies the effect of graph density (or sparsity) by varying the number of edges m. Specifically, we fix all the vertices and randomly sample 20-100% of the edges in the datasets. We can observe that when the graph is sparse, the performance gap between the vertex-priority-based algorithms and the baseline algorithm is relatively small. This is because the number of high degree vertices and the total number of butterflies are relatively low in a sparse graph. When the graph becomes denser, the advantage of the vertex-priority-based algorithms becomes more apparent.
Evaluating the parallelization techniques. Figure 16 studies the performance of the BFC-IBS, BFC-VP, and BFC-VP ++ algorithms in parallel by varying the thread number t from 1 to 32 on four datasets. Note that we are actually increasing the number of physical cores when increasing the number of threads. The BFC-IBS algorithm in parallel is not parallelfriendly. For example, on Tracker, the BFC-IBS algorithm in parallel performs worse when t increases from 16 to 32. On Bi-twitter, the algorithm BFC-IBS in parallel cannot get a result within the timeout threshold when t = 1 and t = 8. We can also observe that, on all these datasets, the   Evaluating the cache-aware strategies. In Tables 2, 3, 4 and 5, we evaluate the cache-aware strategies on Wiki-en, Delicious, Tracker and Bi-twitter, respectively. Here, Cache-ref denotes the total cache access number. Cache-m denotes the total cache miss number which means the number of cache references missed. Cache-mr denotes the percentage of cache references missed over the total cache access number. Time denotes the computation time of different algorithms. Here, BFC-VP + is the BFC-VP algorithm deploying with only the cache-aware wedge processing strategy. BFC-VPC is the BFC-VP algorithm deploying with only the graph reordering strategy. BFC-VP has the largest number of cache-miss on all the datasets. By utilizing the  cache-aware wedge processing, compared with BFC-VP, BFC-VP + reduces the number of cache miss over 50% on all the datasets. By utilizing the cache-aware reordering, compared with BFC-VP, BFC-VPC also reduces over 50% cache-miss on all the datasets. BFC-VP ++ achieves the smallest cachemiss-numbers and reduces the cache-miss-ratio significantly on all these datasets since BFC-VP ++ combines the two cache-aware strategies together. Compared with BFC-VP, BFC-VP ++ reduces over more than 70% cache miss on all the testing datasets.
Speeding up the approximate butterfly counting algorithm.
In the approximate algorithm BFC-ESap [54], the exact butterfly counting algorithm BFC-IBS is served as a basic block to count the butterfly exactly in a sampled graph. Since BFC-VP ++ and BFC-IBS both count the number of butterflies exactly, the approximate algorithm BFC-ESap vp ++ can be obtained by applying BFC-VP ++ in BFC-ESap without changing the theoretical guarantee. In Fig. 17, we first evaluate the average running time of BFC-ESap and BFC-ESap vp ++ for each iteration by varying the probability p. Comparing two approximate algorithms, BFC-ESap vp ++ outperforms BFC-ESap under all the setting of p on Tracker and Bi-twitter datasets. Especially, on these two datasets, BFC-ESap vp ++ is more than one order of magnitude faster than BFC-ESap when p ≥ 0.062.
In the second experiment, we run the algorithms to yield the theoretical guarantee Pr[|ˆ G − G | > G ] ≤ δ as shown in [54]. We vary and fix δ = 0.1 and p as the optimal p as suggested in [54]. We can see that in Fig. 18, for these two approximate algorithms, the time costs are increased on these two datasets in order to get better accuracy and    Evaluating the external memory algorithm. In Fig. 19, we evaluate the scalability of the external memory algorithm BFC-EM on two large datasets Bi-sk and Bi-uk by varying the graph size n. We limit the memory size to 1GB in our evaluation. On Bi-sk and Bi-uk, we can see that the time cost and I/O both increase with the percentage of vertices increases.
Cache-aware graph reordering versus Gorder. In [70], the authors propose Gorder to reduce the cache miss in graph algorithms. Here, we replace the graph reordering strategy with Gorder in BFC-VP ++ and evaluate the performance. Table 6 shows the time cost. We can observe that the renumbering time cost of the graph reordering is much less than Gorder on all datasets. This is because graph reordering can be simply obtained according to the priority number of vertices while Gorder needs complex renumbering computation. Regarding the computation time, the performance of the algorithm with graph reordering is better than the algorithm with Gorder on 9 datasets while the algorithm with Gorder is better on 3 datasets. Finally, the total time cost of graph reordering is better than Gorder. Table 7 shows the cache statistics. Firstly, they have a similar number of cache references since the renumbering process does not change the algorithm itself. Secondly, graph reordering achieves a better CPU performance than Gorder on almost all datasets (i.e., fewer cache-misses and fewer cache-miss-ratios on 9 datasets) when handing the butterfly counting problem with BFC-VP ++ . In summary, our graph reordering strategy is more suitable when handling butterfly counting.

Evaluation of batch-dynamic butterfly counting
In this subsection, we evaluate the performance of BFCB-BS, BFCB-IG, and BFCB-IG + to solve the batch-dynamic butterfly counting problem. For each test, we randomly extract a proportion b of edges from the graph as the newly inserted edges for batch-update, and the graph consisting of the remaining edges is used as the original graph. By default, we set the batch size b = 0.5% and the number of threads t = 1.
Evaluating the performance on all datasets. Figure 20 shows the performance of the BFCB-BS, BFCB-IG, and BFCB-IG + algorithms on all datasets with b and t set to default values. We can observe that BFCB-IG and BFCB-IG + outperform BFCB-BS on almost all the large datasets. The BFCB-IG and BFCB-IG + algorithms are faster than BFCB-BS by at least one order of magnitude on Wiki-fr, Live-journal, Wiki-en, Orkut Bi-twitter, Bi-sk, and Bi-uk. Especially, on Trackers, the BFCB-IG + algorithm is faster than the BFCB-BS algorithm by at least two orders of magnitude. In addition, BFCB-IG + is faster than BFCB-IG on all the datasets since it is designed to avoid much computation of existing wedges than BFCB-IG. Note that both BFCB-IG and  BFCB-IG + are slightly slower than BFCB-BS on small datasets (i.e., DBPedia, Twitter, and Amazon). This is because when the datasets are small, the number of updates (i.e., b) is also very small. Thus, the updating process of BFCB-BS can be finished efficiently without counting a large number of butterflies while BFCB-IG and BFCB-IG + need to construct the affected graph at first. In the following experiments, we omit the BFCB-BS algorithm since the algorithms BFCB-IG and BFCB-IG + significantly outperform BFCB-BS as evaluated here.
Evaluating the effect of t. Figure 21 shows the result of the BFCB-IG and BFCB-IG + algorithms in parallel by varying the thread number t from 1 to 32 on 4 large datasets (with b = 0.5%). We also evaluate the breakdown of execution time (i.e., the execution time of different stages) including the affected subgraph construction stage (-GC) and the butterfly counting stage (-Count). Note that these two stages both run in parallel. We can see that the parallelization techniques improve the efficiency of both BFCB-IG and BFCB-IG + significantly. In addition, the speedup of the butterfly counting stage is higher than the speedup of the affected subgraph construction stage for both of the algorithms. This is because there do not exist many writing conflicts in the butterfly count- ing stage, and the updating/writing operations are need to be performed to construct the affected subgraph which may lead to extra latency when the same memory location is updated simultaneously by multiple threads.
Evaluating the effect of b. Here, we evaluate the effect of the batch size b by setting b from 0.01 to 2% of the original number of edges (m) in the dataset. We first show the number of affected butterflies in Fig. 22. It is observed that the number of affected butterflies can be very large even the batch-size is small. For instance, on Bi-uk, the number of affected butterflies reaches 3.80 × 10 13 with b = 2% (the original butterfly counts on Bi-uk is 4.89 × 10 14 as shown in Table 1). In Fig. 23, we show the processing time of the BFCB-IG and BFCB-IG + algorithms when varying b. We can see that BFCB-IG + consistently outperforms BFCB-IG for all values of b. Especially, when b becomes smaller, the margin between BFCB-IG + and BFCB-IG becomes larger. This is because BFCB-IG + can reduce the traversal of wedges in the original graph and take advantage of the small number of updated edges.
Evaluating the deletion cases. In this part, we evaluate the performance of the algorithms for both insertion and deletion cases.
(1) Firstly, we evaluate the performance of the algorithms on Delicious, Tracker, Bi-sk, and Bi-uk by setting b = 2% and b = 10% in Fig. 24. In deletion cases, we randomly extract a proportion b of edges from the input graph and consider these edges as the deletion edges. When using BFCB-IG and BFCB-IG + for handling deletion cases, we denote these two algorithms as Del-IG and Del-IG + , respectively. We can observe that the overhead for deletion cases is relatively small when (2) Secondly, we run the algorithms on a dynamic graph MovieLens, which contains 162,000 users (U ), 62,000 movies (L), 25,000,095 ratings (E) and the timestamp of the ratings. In real scenarios, the edge insertion/deletion operations usually run in a sliding-window manner. Thus, in insertion cases, we consider a proportion b of edges with the latest timestamp as the insertion edges.
In deletion cases, we consider a proportion b of edges with the oldest timestamp as the deletion edges. The performance of the algorithms is shown in Fig. 25. We can see that the time cost of the algorithms increases with the increase of b in both insertion and deletion cases. In addition, the overhead for deletion cases also increases when b becomes large. Note that compared to insertion cases, the total time cost for handling deletion cases is only slightly higher since the dominant cost in the algorithms is to maintain the number of butterflies.

Related work
Motif counting in unipartite graphs. Triangle is the smallest non-trivial cohesive structure and there are extensive studies on counting triangles in the literature [5,9,16,19,31,32,38,57,57,58,[60][61][62]. However, the butterfly counting is inherently different from the triangle counting for two reasons, (1) the number of butterflies may be significantly larger than that of triangles (O(m 2 ) versus O(m 1.5 ) in the worst case), and (2) the structures are inherently different. Thus, the existing triangle counting techniques are not applicable to efficient butterfly counting because the existing techniques for counting triangles (e.g., [19,60]) are based on enumerating all triangles and the enumeration is not affordable in counting butterflies due to the quadratic number O(m 2 ) of butterflies. There are also some studies [33,34,52] focusing on the other cohesive structures such as 4-vertices and 5-vertices. In [6], the authors propose generic matrix-multiplication based algorithm for counting the cycles of length k (3 ≤ k ≤ 7) in O(n 2.376 ) time and O(n 2 ) space. While the algorithm in [6] can be used to solve our problem, it cannot process large graphs due to its space and time complexity. As shown in [64], the algorithm in [64] has a significant improvement over [6], while our algorithm significantly improves [64].
Bipartite graphs. Some studies are conducted toward motifs such as 3 × 3 biclique [14] and 4-path [47]. These structures are different from the butterfly thus these works also cannot be used to solve the butterfly counting problem. Algorithms for accurate estimation of the number of butterflies in graph streams is studied [55]. The authors in [40] propose an approach that applies sampling and sketching techniques for approximate butterfly counting in graph streams which outperforms the algorithms in [55]. Independent from our work, in [59], the authors study the parallel implementation of butterfly counting algorithms. [59] mainly focuses on parallel algorithms for butterfly computations including butterfly counting and wing/tip decomposition, while we propose efficient butterfly counting algorithms under different settings including in-memory algorithms, parallel algorithms, I/O efficient algorithms, and batch-dynamic algorithms. When implementing the parallel butterfly counting algorithms, [59] uses different data structures compared to us. [59] also provides a more detailed theoretical analysis about the work/span of parallel algorithms. Based on the butterfly structure, the k-bitruss (or k-wing) and ktip decomposition problems are studied in the literature [56,59,67,68,73]. Recently, the authors in [36] propose a new coarse-to-fine framework for parallel tip decomposition.

Dynamic graphs.
There are many existing studies on triangle counting [15,28,61], butterfly counting [44], and other motif counting [11,46] on dynamic graphs. Stefani et al. [61] propose a streaming algorithm for triangle counting on dynamic graphs with both edge insertions and deletions considered. Bulteau et al. [15] introduce a one-pass algorithm for triangle counting estimation. In [28], an edge-samplingbased framework is proposed to augment traditional models for triangle counting and improve computational and memory costs. In [44], the authors present a distributed implementation of butterfly counting in dynamic environments. As shown in [46], the authors propose a suite of approaches to count the number of motifs by calculating the number of embeddings of each motif. In addition, counting multi-layer temporal motifs on dynamic graphs is studied in [11]. In the literature, there are also various works that consider the batch-dynamic settings [1,8,23,27,43]. Ediger et al. [23] present an algorithm for computing clustering coefficients in batch-dynamic graphs. A batch algorithm of the singlesource dynamic shortest path is proposed in [8]. In [1], the authors present a parallel algorithm for batch-dynamic connectivity problems. In order to improve the computational efficiency, [27] propose GPU-based algorithms under the batch-dynamic setting. Makkar et al. [43] study the exact triangle counting algorithm in batch-dynamic graphs.
Graph ordering. There are some studies on specific graph algorithms using graph ordering. Then et al. [63] optimize BFS algorithms. Park et al. [51] improve the CPU cache performance of many classic graph algorithms such as Bellman-Fold and Prim. The authors in [29] present a suite of approaches to accelerate set intersections in graph algorithms. Since these techniques are specific to the problems studied, they are not applicable to butterfly counting.
In the literature, there are also works studying general graph ordering methods to speed up graph algorithms [7,10,12,13,18,22,35,70]. In the experiments, we show that our cache-aware techniques outperform the state-of-the-art technique [70]; that is, our cache-aware strategy is more suitable for counting butterflies.

Conclusion
We study the butterfly counting problem in this paper. We propose a vertex-priority-based butterfly counting algorithm BFC-VP which can effectively handle high-degree vertices. We also propose the cache-aware butterfly counting algorithm BFC-VP ++ which improves the CPU cache performance of BFC-VP with two cache-aware strategies. Efficient butterfly counting algorithms for batch-dynamic graphs are also investigated. We conduct comprehensive experimental evaluations and the result shows that our vertexpriority-based solutions outperform the baseline algorithms significantly. cate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecomm ons.org/licenses/by/4.0/.