Keywords

1 Introduction

Centrality is one of the most important concepts in the analysis of social networks. Among centrality measures, one of the most popular is betweenness centrality [1, 6]. The betweenness of a node is a measure of the control this node has on the communication paths in the network. Therefore, it can be used to rank nodes according to their relative importance in a graph. Betweenness has been used effectively in a variety of applications, such as: design and control of communications networks [15], traffic monitoring [13], identifying key actors in terrorist networks [11], finding essential proteins [8], and many others.

Computing the betweenness of all nodes in a network has a high computational cost, so efficiency is the target of much related research. Nowadays, most graphs are inherently dynamic. When a graph suffers small changes, recomputing betweenness from scratch would be very inefficient. Therefore, dynamic algorithms capable of computing betweenness faster by using previous computations have been proposed [10, 12]. None of these is better than Brandes [3] (brandes) in the worst case, and there is evidence that this is likely very hard to overcome [16]. Despite that, good speedups in typical instances have been achieved [2, 7].

In this work, we focus on the exact computation of betweenness centrality in incremental graphs. While not allowing edges to be deleted, incremental graphs cover some important applications, as has been pointed by several authors before [2, 9, 12]. Two recently proposed algorithms deal with the same problem, obtaining better performance than previous work, so we compare with them:

  1. 1.

    icentral [7] works on undirected connected graphs, and allows edges to be deleted and inserted. It only stores the betweenness of all nodes of the graph, so memory requirement is linear. First, it decomposes the graph into biconnected components, and then updates betweenness of nodes in the component affected by the update. In the article it’s proven that for undirected graphs, the betweenness can change only for nodes in the affected component. Its time complexity is highly dependent on the size of the affected biconnected component.

  2. 2.

    ibet [2] works on directed graphs, and allows edges to be inserted. It stores all distances between pairs of nodes, so memory requirement is quadratic. First, it identifies efficiently all pairs of nodes which distance or number of shortest paths are affected by the update. Then it applies an optimized procedure to calculate changes in betweenness for nodes affected by the update. Experiments showed it outperforms previous approaches requiring quadratic memory.

In this paper we present a space efficient algorithm to compute the betweenness centrality of all nodes in a directed incremental network. Its space complexity is linear in the size of the input graph and its time complexity is similar to that of icentral. In the worst case, it’s equivalent to recalculating betweenness in the biconnected component where the added edge resides, plus some linear overhead. Up to the authors knowledge it’s the first algorithm calculating betweenness centrality in incremental directed graphs, showing better performance than recalculation, and at the same time, having less than quadratic space complexity. On the other hand, it works with disconnected graphs, detail usually left out by previous approaches, but important in real world applications.

In the next section we define betweenness, biconnected component and incremental algorithms. In Sect. 3 we present the proposed algorithm, prove its correctness, and determine space and time complexity. In Sect. 4 we show the experimental validation of our algorithm. At the end, the conclusions and references.

2 Preliminaries

For simplicity, we will refer to directed, simple and unweighted graphs. In the following we will refer to a graph \(G = (V, E)\) with n nodes and m edges.

2.1 Betweenness

Betweenness centrality of a node is formally defined by the following formula:

$$\begin{aligned} C_B(v) = \sum _{\begin{array}{c} s \ne v, t \ne v\\ s,t \in V \end{array}} {\frac{\sigma _{st}(v)}{\sigma _{st}}} \end{aligned}$$
(1)

where \(\sigma _{st}(v)\) is the number of shortest paths from s to t passing through v and \(\sigma _{st}\) is the number of shortest paths from s to t. A naive algorithm using this formula has \(\mathcal {O}(n^3)\) complexity.

In [3] Brandes showed a more efficient way to calculate betweenness values:

$$\begin{aligned} C_B(v) = \displaystyle \sum _{\begin{array}{c} s \ne v, s \in V \end{array}} \delta _{s\cdot }(v) \end{aligned}$$
(2)

where \(\delta _{s\cdot }(v) = \sum _{\begin{array}{c} s \ne v,t \ne v,t \in V \end{array}} \frac{\sigma _{st}(v)}{\sigma _{st}}\). Using this formula betweenness values can be computed in time \(\mathcal {O}(n \cdot m)\), by running a BFS (Breath First Search [4]) on each node and computing the required values (distances, \(\sigma \), \(\delta \)). For a complete explanation see [3].

2.2 Biconnected Components

Biconnected components were first proposed as a good heuristic for speeding up betweenness computations in [14], and more recently in the context of dynamic graphs in [7]. We will make use of the following definitions:

Definition 1

Let G be an undirected graph. A biconnected component is a connected induced subgraph A of G, such that the removal of any node doesn’t disconnect A, and is maximal.

Definition 2

Any node belonging to more than one biconnected component is called articulation point.

2.3 Incremental Graphs

We call a dynamic graph incremental if edges can be inserted, but not deleted. As previously mentioned in this work we focus on incremental graphs. Computing betweenness in such context is usually done in two steps. In the first step some preprocessing is done and initial betweenness is computed. Next, after each edge insertion, betweenness is updated. The two steps could have different time complexities, so both define the time complexity of an incremental algorithm. All algorithms mentioned here have the same complexity in the first step (the same as brandes), so in comparisons we will only take into account the update step.

3 Algorithm

The proposed algorithm is a generalization of icentral to deal with directed graphs.

Definition 3

Let G be a graph, and let \(G^*\) be the graph G after inserting a new edge (u, v). We define affected component as the biconnected component of the undirected version of \(G^*\) to which the newly inserted edge (u, v) belongs.

The main obstacle in generalizing icentral is that, in directed graphs, when an edge is inserted, betweenness values of nodes outside the affected component can change as well. In the next theorem we prove a formula allowing to compute those changes efficiently.

Theorem 1

Let x be a node outside the affected component A, and let s be the articulation point inside the component such that its removal disconnects x from A. Then after the update, the betweenness of x changes by

$$\begin{aligned} \delta _s(x) \cdot (\text {reach}^*(s) - \text {reach}(s)) + \delta ^r_s(x) \cdot (\text {reach}^{*r}(s) - \text {reach}^r(s)) \end{aligned}$$
(3)

where reach(s) equals the number of nodes z such that there exists a shortest path from z to x passing through A, superscript r means the function is applied to the reversed graph, and superscript \(*\) indicates the function is applied to the updated graph.

Proof

For the sake of clearness, lets rename variables in the definition of betweenness 1:

$$\begin{aligned} C_B(x) = \sum _{\begin{array}{c} a \ne x, b \ne x\\ a,b \in V \end{array}} {\frac{\sigma _{ab}(x)}{\sigma _{ab}}} \end{aligned}$$
(4)

In the sum on the right the only terms that can change after an update are such that a and b are in different biconnected components, and such that all shortest paths from a to b pass through A. Therefore, all these paths must pass through s. Then, two cases may occur, according to the relative orders of s and x in the paths from a to b that go through x:

  1. 1.

    \(a, s, x, b \implies \frac{\sigma _{ab}(x)}{\sigma _{ab}} = \frac{\sigma _{as}\sigma _{sx}\sigma _{xb}}{\sigma _{as}\sigma _{sb}} = \frac{\sigma _{sx}\sigma _{xb}}{\sigma _{sb}} = \frac{\sigma _{sb}(x)}{\sigma _{sb}} \)

  2. 2.

    \(a, x, s, b \implies \frac{\sigma _{ab}(x)}{\sigma _{ab}} = \frac{\sigma _{ax}\sigma _{xs}\sigma _{sb}}{\sigma _{as}\sigma _{sb}} = \frac{\sigma _{ax}\sigma _{xs}}{\sigma _{as}} = \frac{\sigma _{as}(x)}{\sigma _{as}}\)

Therefore, the terms that can change equal:

$$\begin{aligned} \sum _{a, b \text { in case 1}} {\frac{\sigma _{sb}(x)}{\sigma _{sb}}} + \sum _{a, b \text { in case 2}} {\frac{\sigma _{as}(x)}{\sigma _{as}}} = \text {reach}(s)\cdot \delta _s(x) + \text {reach}^r(s)\cdot \delta ^r_s(x) \end{aligned}$$
(5)

and the theorem easily follows.

Following Theorem 1, pseudocode for the function updating betweenness outside A is shown in Algorithm 3. Then, it remains to update betweenness inside the component; this can be done as in icentral, and is shown in Algorithm 2. The Brandes-like function in lines 8 and 9 computes delta values in the affected component, as in icentral, using reach\(_o^r\) values to add the contribution of nodes outside the affected component. r and \(*\) have the same meaning as in Theorem 1. The pseudocode of the proposed algorithm is shown in Algorithm 1.

figure a
figure b

3.1 Complexity

The overall space complexity is linear (in the size of the graph) as the algorithm only uses a constant number of arrays with linear size (\(C_B\), the different variants of reach, A, \(A^*\), and the different variants of \(\delta \)). Only \(C_B\) and the graph itself persist across updates.

Time complexity of the proposed algorithm (Algorithm 1) equals the complexity of finding biconnected components (linear), plus the complexity of Algorithm 2, plus the one of Algorithm 3. Let \(n_A\) and \(m_A\) be the number of nodes and edges respectively in the affected component. Algorithm 2 has exactly the same complexity as icentral, which is \(\mathcal {O}(n + m + |Sr| * (n_A + m_A))\), where Sr is the set of affected sources (as defined in [7]).

In Algorithm 3, for a given s all variants of reach (lines 3 and 7) can be computed using BFS in time \(\mathcal {O}(n_A + m_A)\), as there is no need to do any computation outside A at this point. As any node outside A will have at most one corresponding articulation point s, in lines 4, 5, 6, 8, 9, and 10 each node and edge of the graph appears at most once, and so the total complexity of these is \(\mathcal {O}(n + m)\). Summing up, the complexity of Algorithm 3 is \(\mathcal {O}(n + m + |\text {articulation-points}(A)| * (n_A + m_A))\).

Overall, using that there are at most \(n_A\) articulation points in A, and also at most \(n_A\) affected sources, complexity of the proposed algorithm is proven to be \(\mathcal {O}(n + m + n_A * (n_A + m_A))\), matching that of icentral.

figure c

3.2 Notes

It’s possible to modify slightly the proposed algorithm to work with graphs with arbitrary positive weights, by using Dijkstra algorithm [5] instead of BFS. In graphs with multiples edges, parallel edges can be substituted with the edge with smallest weight, and then obtain a simple graph with the same betweenness.

On the other hand, it’s straightforward to parallelize the most time consuming part of the algorithm, the computation of the betweenness changes inside the affected component. As the \(\delta \) values respect to affected sources are computed independently, this computations could be done by different nodes in a parallel environment. In this environment, good speedups are expected, similar to those in [7].

4 Experiments

We experimentally evaluate the proposed algorithm by measuring time and memory, and comparing it with icentral, ibet and brandes. All algorithms were implemented in pure python, and graphs were stored and manipulated using the python library NetworkXFootnote 1. Algorithms were run on a GNU/Linux 64 bit machine, processor Intel(R) Core(TM) i3-4160 CPU @ 3.60 GHz, with 5 GBytes of main memory.

The datasets used for experimentation were taken from online sources, some of them being already referenced in [2] or [7]; p2p-Gnutella08, Wiki-Vote, and CollegeMsg, were taken from SNAP graphs collection. The description of the data is shown in Table 1.

Table 1. Statistics of graph datasets (lbc refers to largest biconnected component)

For each graph, we randomly selected 100 edges that were not already contained in the graph, and measured the average time and maximum memory used by each algorithm to update the betweenness of all nodes when each edge is inserted. In the case of algorithms that work with directed graphs, when testing on an undirected one, each edge was transformed into two edges, one for each possible direction. Results are shown in Table 2. Note it’s not possible to test icentral on directed graphs.

Table 2. Results, time given in seconds and memory in MBytes.

As expected, both our algorithm and icentral perform very similar, both in time and memory, and are consistently faster than brandes. This speedup is highly dependent on the size of the affected component. Best performance respect to brandes was obtained in dataset Eva, where the number of nodes in the largest biconnected component is relatively small. On average, our algorithm is between 2 and 3 times faster than brandes.

On the other hand, ibet is the fastest of all on most datasets, but it’s memory usage is very high, making it very expensive for graphs of tens of thousands of nodes. Also note, that for datasets like Eva, ibet is outperformed by algorithms icentral and our proposal, stressing the relevance of algorithms using the biconnected components decomposition.

5 Conclusions

In this work an algorithm for computing betweenness in incremental directed graphs has been proposed. Its memory usage is linear allowing it to scale to large graphs. It’s time complexity is similar to that of algorithm proposed in [7], despite of handling the more general case of directed graphs. Experiments have proven it can be a practical replacement of brandes for directed and undirected graphs, mostly when quadratic memory usage is not feasible due to large input.

As future work we plan to conduct experiments with a distributed and parallel implementation of the proposed algorithm. Also, we will extend the proposed algorithm to work with edge deletions. Moreover, it seems possible to apply some of the optimizations proposed in [2] to update betweenness values inside the affected biconnected component.