# Encyclopedia of Social Network Analysis and Mining

Living Edition
| Editors: Reda Alhajj, Jon Rokne

# Path-Based and Whole-Network Measures

Living reference work entry
DOI: https://doi.org/10.1007/978-1-4614-7163-9_241-1

## Keywords

Undirected Graph Social Network Analysis Cluster Coefficient Betweenness Centrality Geodesic Distance
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

## Glossary

Betweenness centrality

A measure of the proportion of shortest paths in a network passing through a specific node or edge.

Closeness centrality

A measure of how close a node is to all the other nodes of a network.

Clustering coefficient

A measure of how much nodes tend to form groups in a network.

Diameter

The maximum distance between two nodes.

Direct connection

An edge between two nodes, usually indicating the existence of a specific relationship, e.g., a friendship between two individuals.

A group of two people.

Geodesic distance (or distance)

Length of one of the shortest paths between two nodes.

Indirect connection

A path between two nodes that are not directly connected through an edge.

Node

An entity in a network, usually representing an individual.

Path

A sequence of edges sharing common endpoints. e.g., an edge between n i and n j followed by an edge between n j and n k ..

Triangle

Three nodes with an edge between every pair of them.

## Definition

Path-based measures associate a value to every node in a network according to its direct and indirect connections to other nodes. For example, given a node we can compute the maximum distance to all other nodes: this measure is called node eccentricity. Whole-network measures associate a value to an entire network, providing a summary of its structure. For example, the diameter of a network is the maximum eccentricity of its nodes and represents a global measure of the efficiency of information dissemination in that network. In this essay we cover the most popular path-based and whole-network measures.

## Introduction

Graphs are a widely used abstract representation of the structure of social networks, where nodes represent individuals and edges indicate relationships between them, e.g., communication acts or friendship ties. While a graph representation hides many details of the original social network – the content of communication relationships, personal data about the individuals, and so on – the structure of the graph may highlight many relevant features. How fast does the information produced by a node reach other nodes? How important is a specific node in facilitating or slowing down information diffusion? Which one of two given networks is more active than the other? To answer these and similar questions, we need quantitative measures describing the network structure and efficient algorithms capable of computing these measures. In this essay we describe the main graph measures related to paths between nodes (eccentricity, closeness, betweenness, clustering coefficient) and those used to summarize whole networks (diameter, network closeness, network clustering coefficient, and density).

To introduce the concepts discussed in this essay, we consider two simple social groups thoroughly studied in sociology: dyads (pairs of individuals) and triads (groups of three individuals). When two individuals are connected to each other, we have the simplest possible social group, and flows of information between these two individuals can be fast and easy to achieve. As an example consider Fig. 1a, where nodes (ellipses) indicate individuals and an edge (line) between two nodes indicates mutual knowledge of e-mail addresses: if Rodrigo wants to send an e-mail to Lucia, he can do that directly. Now consider Renzo: he is not connected to Lucia but he knows Rodrigo, so he can still send a message to Lucia through the common friend. We may state that there is a potential communication channel between Renzo and Lucia, but delivering a message may take longer and has a lower probability of success because Rodrigo might decide not to pass the message along. If we add another node in the chain, e.g., Agnese knows Renzo who knows Rodrigo who knows Lucia, information exchange between Agnese and Lucia is possible but may take even longer and be less likely to succeed. Fig. 1 Simple social groups at the basis of many single-node and whole-network measures eccentricity (the maximum distance from one node to all other nodes in the network), closeness (the inverse average distance from one node to the others), and betweenness (how much information flows through a node)

From this simple example, we can see how a graph induces a notion of distance between nodes: Rodrigo and Lucia are closer to each other than Renzo and Lucia. Distance is considered a good indicator of how easy or fast it is to send information from one node to another and is used to define several well-known single-node measures.

While the distance between pairs of nodes and the role of nodes in between are useful to characterize information propagation, triads (sets of three nodes) are important to study the tendency of members of a network to form groups. In real social networks, when a node is connected to two other nodes, we observe that those two nodes are often connected to each other as well. In our example, we may expect that Renzo and Lucia also exchange their e-mail addresses after some communication mediated by Rodrigo, therefore allowing direct communication and forming a triangle (Fig. 1b). A specific measure called clustering coefficient indicates how frequently pairs of neighbors are neighbors themselves.

Single-node metrics can often be used to compute whole-network measures by taking into consideration the contribution of all nodes. Network closeness and network clustering coefficient can be defined as averages of the corresponding single-node measures. The diameter of a network corresponds to the maximum eccentricity and thus represents the maximum distance between nodes. Finally, the network density indicates how many of all possible connections actually exist.

## Key Points

• Graphs are a widely used abstraction of the structure of social networks; for this reason, properties of a social network are inferred from those of its graph representation.

• Path-based and single-node metrics provide a quantitative measure of attributes or properties of individual nodes in a network, such as the level of “importance” of a node, whether a node is “well connected” to other nodes, and so on.

• Whole-network measures summarize attributes of the whole graph.

• In some cases, whole-network measures can be obtained by combining the values of single-node metrics over all nodes.

• Analysis of large graphs can be computationally challenging. New algorithms for social network analysis need to be developed to take advantage of modern high-performance computing architectures.

## Historical Background

The foundations of graph theory date back to year 1735 (Alexanderson 2006), when the Swiss mathematician Leonhard Euler solved a problem known as “the seven bridges of Königsberg.” The Prussian city of Königsberg (now Kaliningrad) was located on a bifurcation of the Pregel river that included an island. Seven bridges were placed across the banks formed by the river and the island. The problem was stated as follows: does there exist a path that allows one to cross all bridges exactly once? Euler proved that the problem has no solution.

The application of graphs to the study of social interactions is generally attributed to American psychiatrist of Romanian origins Jacob Moreno (Newman 2010). In his 1934 book Who shall survive (Moreno 1934), Moreno shows and discusses diagrams of human interactions that he calls sociograms as directed or undirected graphs. Since then, social network analysis (SNA) has been applied to the most diverse areas such as friendship patterns in communities; romantic and sexual relationships; collaboration of scientists, actors, or musicians; networks of terrorists; and food chains in ecological systems. We will briefly discuss some of them in section “Key Applications”; the interested reader is referred to Newman (2010) for a more complete list of references.

## Measures

### Preliminaries: Networks

In this essay, we will mostly consider the representation of a social network as an undirected, unweighted graph.

### Definition 1 (Social Network)

A social network is an undirected graph G = (V, E) where V is a set of nodes (individuals) and E ⊆ V × V is a symmetric relation indicating social connections.

Figure 2 shows a small example with nine individuals (nodes) and nine connections (edges). Fig. 2 Working example: a social network with nine nodes and nine edges

In the following, we denote with n the number of nodes and with m the number of edges of G. Given a graph G, a path p = 〈v 0 ,..., v k 〉 of length k from node u to node v is a sequence of nodes v 0 ,..., v k such that v 0 = u, v k = v, and every consecutive pair of nodes is connected by an edge: (v i , v i+1) ∈ E, for all i = 0,..., k − 1. Two nodes u and v are connected if there exists at least one path between them. Note that, since edges in undirected graphs can be traversed in both directions, a path from u to v can always be reversed to obtain a path from v to u.

Representing a social network as an undirected, unweighted graph is quite common; however, it is important to keep in mind that such representation is an abstraction based on a set of assumptions, which may or may not hold.

The first assumption is to consider undirected graphs, where edges can be traversed in both directions. Undirected graphs are an appropriate representation of social networks where relationships between individuals are symmetric. For example, “friendship” in Facebook is symmetric. However, the following/follower relation on the Twitter network is not symmetric; therefore, a better representation would be based on directed graphs: if Lucia is following Renzo on Twitter, Renzo is not necessarily following Lucia. Many concepts defined for undirected graphs still hold for the directed case, such as the definition of geodesic path used as a basis for most of the measures presented in this essay. However, in directed graphs there can be a path from node u to node v without any path from v to u. In this case the distance between v and u is infinite which may require adjustments in those measures which are based on maximizing or averaging distances. Extensions for directed graphs are presented in White and Borgatti (1994) and can be found in SNA textbooks (Wasserman and Faust 1994; Newman 2010).

The second assumption is to consider unweighted graphs, where edges do not carry any “cost” or “weight” associated with them. Adding weights to edges could be useful to convey additional details on the underlying social network. For example, the weight of an edge may represent the strength of the social relationship it represents. This kind of weight is available in online social networks such as Google+ and can be computed for other networks by looking at the communication acts, e.g., by counting the number of messages exchanged between two nodes. Extensions of path-based measures to weighted graphs are discussed in (Peay 1980; Opsahl et al. 2010). It should be observed that these extensions are not yet widely adopted. One reason is that giving accurate estimates of edge weights requires a deep understanding of social relationships; this kind of information can be extremely difficult to infer and is often domain specific, i.e., it depends on the type of social network. On the other hand, simple structural properties (who is connected to whom) are much easier to identify.

The third assumption is that the graph is fully connected, i.e., each pair of nodes is connected by at least one path. In fact, social networks tend to have a large connected component, which includes most of the nodes that are typically those of interest. Therefore, analysis of disconnected networks is usually carried out by first splitting the graph into its connected components and then analyzing each component separately.

We now introduce the definition of single-node measures, including for each one a brief description and, where appropriate, some usage hints or limitations. We then illustrate the corresponding whole-network measures, which are typically derived by aggregating the single-node values. It is important to observe that some network measures have been defined by different authors in different, incompatible ways, e.g., with and without normalization factors. Therefore, different network analysis tools may provide different results when computing the same measure, since they may be using different definitions.

### Single-Node and Path-Based Metrics

In this section we describe the most commonly used path-based and single-node measures in SNA.

#### Shortest Paths and Geodesic Distance

Many important graph metrics are based on the concepts of shortest path and geodesic distance. The shortest path between two nodes u and v on an unweighted graph is a path from u to v with minimum number of edges; the geodesic distance d(u, v) is the number of edges of one of the shortest paths from u to v. For undirected graphs, d(u, v) = d(v, u) since all paths can be reversed. By definition, the distance of a node from itself is zero (d(u, u) = 0); if u and v are adjacent, meaning that there exists an edge between them, then d(u, v) = 1. If there is no path connecting u and v, we set d(u, v) = +. Since we focus on connected graphs, we assume that no pair of nodes has infinite distance, i.e., all pairs of nodes are connected. We also assume that no edges connecting a node to itself exist. Table 1 shows the geodesic distance between all pairs of nodes in the graph of Fig. 2.
Table 1

Geodesic distance d(u, v) between all pairs of nodes in the working example

n 0

n 1

n 2

n 3

n 4

n 5

n 6

n 7

n 8

n 0

0

1

2

1

1

1

3

3

4

n 1

1

0

1

2

2

2

2

2

3

n 2

2

1

0

3

3

3

1

1

2

n 3

1

2

3

0

2

2

4

4

5

n 4

1

2

3

2

0

1

4

4

5

n 5

1

2

3

2

1

0

4

4

5

n 6

3

2

1

4

4

4

0

2

3

n 7

3

2

1

4

4

4

2

0

1

n 8

4

3

2

5

5

5

3

1

0

#### Eccentricity

The eccentricity E(u) of a node u is defined as the maximum distance between u and all other nodes (Harary and Norman 1953; Harary 1969). Formally:
$$E(u)=\underset{v\in V}{ \max } d\left( u, v\right)$$
(1)
The eccentricity is a measure of how efficiently a node can disseminate information, with lower values denoting better dissemination efficiency. In our example, if we consider the geodesic distances shown in Table 1, we observe that node n 8 has eccentricity 5 (maximum value on the last row or column of the table), which means that the furthest nodes from n 8 are located 5 hops away. We can actually see that nodes n 3, n 4, and n 5 have geodesic distance 5 from n 8; this means that, for example, information produced by n 8 can pass through n 7, n 2, n 1, and n 0 to reach n 4 in the smallest number of steps. On the other hand, nodes n 1 and n 2 have both eccentricity 3, which means that information originating from them must traverse at least three edges before reaching some specific nodes in the graph. However, notice that information does not necessarily pass through the shortest paths of the network. For example, a message from n 0 may reach n 4 indirectly through n 5 even if there is an edge between n 0 and n 4. The second column of Table 2 shows the eccentricity of all nodes in our working example.
Table 2

Single-node measures for the working example (values approximated to the second decimal)

Node

Eccentricity

Closeness

Betweenness clustering

Coefficient

n 0

4

0.50

17.0

1/6

n 1

3

0.53

16.0

0.0

n 2

3

0.50

17.0

0.0

n 3

5

0.35

0.0

0.0

n 4

5

0.36

0.0

1.0

n 5

5

0.36

0.0

1.0

n 6

4

0.35

0.0

0.0

n 7

4

0.38

7.0

0.0

n 8

5

0.29

0.0

0.0

#### Closeness

While eccentricity represents the maximum distance between nodes, closeness measures average distances. The closeness C(u) of a node u has been defined in different ways, all providing the same information. The main idea is to use the inverse average distance between u and all other nodes as a measure of how quickly we expect information produced by u to reach the rest of the network (Sabidussi 1966). Formally:
$$C(u)=\frac{n-1}{\sum_{v\in V, u\ne u} d\left( u, v\right)}$$
(2)

where n is the number of nodes of the graph G. n is sometimes used as a normalization factor instead of n − 1. Moreover, 1/C(u) is used in SNA tools such as Gephi as a measure of closeness, although it would be more appropriate to consider it as a measure of farness since it increases when the average distance increases. However, all these alternative definitions return the same ranking of nodes (in inverse order, if the inverse of Eq. (2) is used), so they can be typically used interchangeably.

Note that the definition of closeness relies on the assumption that the graph is connected, so that there are n − 1 other nodes connected with u. If we consider Fig. 2, the closeness of node n 0 can be computed by averaging the distances of all other nodes from n 0, as reported in Table 1, and inverting the result. Therefore, we get C(n 0) = 8/(1 + 2 + 1 + 1 + 1 + 3 + 3 + 4) = 0.5. The third column of Table 2 shows the closeness of each node of our sample graph.

When closeness is used to compare different nodes or networks, it is important to consider that this measure usually spans a small range of values. In real social networks, the distance between nodes tends to be small, while the number of nodes can be very large. In addition, the definition of closeness according to (2) fails to take into consideration the fact that distant edges are less likely to spread information, since in (2) all edges in a shortest path contribute equally. For example, n 0 may rely more on its neighbors n 1, n 3, n 4, and n 5 to forward its messages than n 7 that is not directly connected to it. As a consequence, alternative definitions of closeness may be considered, assigning different weights to specific edges depending on their distance from the node under examination.

#### Betweenness

Betweenness measures how frequently a node lies on shortest paths between other nodes (Freeman 1977; Anthonisse 1971). Let σ vw (u) be the number of geodesic paths (shortest paths) from v to w passing through u and σ vw the total number of geodesic paths between nodes v and w. Then, the ratio σ vw (u) vw can be interpreted as the probability that node u lies on a randomly selected geodesic path from v to w. The betweenness B(u) of node u is defined as
$$B(u)=\sum_{\begin{array}{c} v, w\in V\\ {} v\ne w\ne u\end{array}}\frac{\sigma_{vw}(u)}{\sigma_{vw}}$$
(3)

Betweenness gives a measure of the load placed on a given node. Intuitively, if a node u has a large value of betweenness, then it tends to appear on many shortest paths. Since shortest paths are the most efficient way to route information, a node with high betweenness is more likely to play an important role in the dissemination of information (White and Borgatti 1994).

Considering our example, peripheral nodes like n 8 do not belong to any geodesic path; therefore, their betweenness is 0. On the other hand, let us consider node n 2. If one of the nodes {n 8, n 7} wants to send a message to any of {n 0, n 1, n 3, n 4, n 5, n 6} through a geodesic path, then the message must pass through n 2. Looking at Fig. 2 it clearly appears how n 2 plays an important role in allowing information to pass from one side of the network to the other side. The fourth column of Table 2 shows the betweenness of all nodes in the graph.

It is important to observe that the definition of betweenness centrality assumes that information always flows through geodesic paths. While this may not always be the case in real social graphs, we may consider shortest paths as the most likely information channels and thus use this definition of betweenness as an estimate of the real number of messages passing through a node. However, more complex versions of betweenness have been proposed, taking non-shortest paths into consideration (Newman 2005).

#### Clustering Coefficient

The clustering coefficient (Watts and Strogatz 1998) measures the tendency of the neighbors of a node to be connected to each other forming a fully connected subgraph (clique). The relevance of this measure comes from the fact that real social networks present a higher clustering coefficient than corresponding random networks, indicating a tendency to create triangles. Therefore, this measure discriminates between random and social networks and highlights nodes whose neighbors are well connected to each other.

Given an undirected graph G = (V, E), for each node u ∈ V we define the neighborhood N (u) as the set of nodes directly connected with u:
$$N(u)=\left\{ v\in V|\left( u, v\right)\in E\right\}$$
Let EN (u) be the set of edges of G whose endpoints are both in the neighborhood N (u):
$$E N(u)=\left\{\left( v, w\right)\in E| v\in N(u), w\in N(u)\right\}$$

Let n(u) = |N (u)| be the number of neighbors of node u. We observe that the maximum number of edges that an undirected graph with n(u) nodes can have is n(u) × (n(u) 1) /2; this number corresponds to the number of edges of the complete graph with n(u) nodes (in a complete graph, there is an edge connecting each pair of nodes).

The clustering coefficient CC (u) of u is defined as the ratio between the number of edges in EN (u) and the number of edges that the neighborhood could have if all its nodes were fully connected. Formally:
$$CC(u)\frac{2\times \left| EN(u)\right|}{n(n)\times \left( n(n)-1\right)}$$
(4)
CC (u) can take values in the range [0, 1]. If CC (u) = 1, then node u together with its neighbors form a clique, as in Fig. 3a. If CC (u) = 0, then the neighbors of u are not connected to each other, as in Fig. 3c. Fig. 3 Clustering coefficient for node u with different neighborhood; edges incident to u are shown in gray, since they do not contribute to the actual computation of the centrality value

### Whole-Network Measures

In this section we describe how the measures introduced in the previous section can be extended to whole graphs. Table 3 summarizes these measures for the graph in Fig. 2.
Table 3

Whole-network measures for the graph in Fig. 2

Measure

Value

Diameter

5

Closeness

0.32

Betweenness

0.71

Clustering coefficient

0.24

Density

0.25

#### Diameter

The graph diameter is a global measure of eccentricity. The diameter D(C) of a graph G = (V, E) is the maximum eccentricity of its nodes:
$$D(G)=\underset{u\in V}{ \max } E(u)$$
(5)
which, applying (1), can be rewritten as:
$$D(G)=\underset{u, v\in V}{ \max } E\left( u, v\right)$$
(6)

Therefore, the diameter of G is the maximum distance between all pair of nodes. For the graph of Fig. 2, we observe from the data shown in the second column of Table 2 that the maximum eccentricity is 5; therefore, the graph diameter is 5.

The diameter of a graph is an indication of the efficiency of information transmission: information produced by a node may need to traverse D(G) edges to reach other specific nodes in the network. Note that information may also traverse more than D(G) edges if it does not follow shortest paths. In addition, for graphs with a dense core and a few distant nodes, the diameter is determined by those nodes and may assume a high value, whereas the majority of nodes are close to each other.

#### Closeness

According to Freeman (1978), the network closeness C(G) of a graph G is defined as:
$$C(G)=\frac{\sum_{v\in V}\left({C}_{\max }- C(v)\right)}{\left({n}^2-3 n+2\right)/\left(2 n-3\right)}$$
(7)

where C(v) is the closeness of node v and C max = max v∈V C(v) is the maximum value of closeness over the whole graph. In our example we have C(G) = 0.32, as it can be obtained from the values in column closeness of Table 2.

The feature emphasized by closeness is similar to those emphasized by eccentricity and diameter. However, a single node far away from the others would affect closeness to a more limited extent than eccentricity and diameter. Also in this case, a higher value of closeness corresponds to a node or network where information may tend to reach other nodes more quickly.

Other measures that are related to network closeness are the average geodesic distance, i.e., the mean value of the distance between all pairs of nodes, and the so-called global efficiency (Latora and Marchiori 2001) E(G) defined as:
$$E(G)=\frac{1}{n\left( n-1\right)}\sum_{\begin{array}{c} u, w\in V\\ {} u\ne v\end{array}}\frac{1}{d\left( u, v\right)}$$
(8)

#### Betweenness

Betweenness has been extended in Freeman (1977) to whole graphs to represent the dominance of the “most central” node. Formally, given a graph G, let B max = max v∈V B(v) be the maximum value of the betweenness centrality of any node in G. The graph centrality B(G) can then be defined as:
$$B(G)=\frac{\sum_{v\in V}\left({B}_{\max }- B(v)\right)}{\left( n-1\right){B}_{\max }}$$
(9)
The value of B(G) is a real number in the range [0, 1]. B(G) = 0 if all nodes in G have the same centrality; this happens, e.g., for a graph where nodes are connected as in a ring, as shown in Fig. 4a. The value B(G) = 1 can be obtained on a graph with a single central node which is connected to all other nodes in a star topology, as in Fig. 4b. Fig. 4 Extreme values of betweenness centrality

#### Clustering Coefficient

The clustering coefficient CC (G) of a graph G can be computed as the average of the clustering coefficient of all nodes:
$$CC(G)=\frac{\sum_{u\in V} CC(u)}{n}$$
(10)
A different definition was proposed by Luce and Perry (1949). Given an undirected graph G, we define a triplet to be a path over a set of three nodes, e.g., 〈n 3, n 0, n 5〉. We focus on nodes connected by either two (open triplet) or three (closed triplet) edges. As an example, 〈n 3, n 0, n 5〉 is an open triplet, while 〈n 3, n 0, n 5, n 3〉 is a closed triplet. Therefore, the clustering coefficient CC (G) can also be defined as the percentage of triplets that are closed:
$$CC(G)=\frac{\mathrm{number}\ \mathrm{of}\ \mathrm{closed}\ \mathrm{triplets}}{\mathrm{total}\ \mathrm{number}\ \mathrm{of}\ \mathrm{triplets}}$$
(11)

This measure has also been extended to weighted networks by Opsahl and Panzarasa (2009).

As said, when computed on a social network, the clustering coefficient measures the tendency of individuals to form triangles. In fact, it is often described as a measure of the so-called transitivity of a graph (the term transitivity is sometimes used as a synonym of whole-network clustering coefficient). More in general, the clustering coefficient of a network indicates a tendency to form dense subgraphs, also called communities or clusters depending on their interpretation.

Other worth-mentioning related measures are the rich-club coefficient, indicating the tendency of nodes of high degree to be well connected to each other, and the degree correlation, i.e., the probability that an arbitrary edge connects two nodes of specific degrees. The more general concept of assortativity (also known as assortative mixing) is used to indicate the tendency of nodes to connect to similar nodes, and one well-known assortativity measure is modularity, used by several methods of community detection (Fortunato 2010). Additional details on these measures can be found in (Costa et al. 2007).

#### Density

The density ρ(G) of a graph G is the fraction of edges in G with respect to the maximum number of possible edges. For an undirected graph with n nodes, there can be at most n(n − 1)/2 edges, therefore:
$$\rho (G)=\frac{2 m}{n\left( n-1\right)}$$
(12)

where m = |E| is the number of edges of G.

Density is always in the range [0, 1]. If ρ(G) = 0, then G has no edges and all nodes are isolated; if ρ(G) = 1 then G is a complete graph, where every pair of nodes is connected by an edge. Note that the minimum density for a connected graph with n nodes is 2/n, since any connected graph must have at least n − 1 edges. The converse is not necessarily true: a graph with n − 1 edges may be disconnected.

Figure 5 shows the density of a graph with n = 6 nodes, with increasing number of edges. The graph in Fig. 5a has m = 5 edges, resulting in a density ρ(G) = (2 × 5)/(6 × 5) = 1/3. The graph in Fig. 5a has m = 10 edges, resulting in a density ρ(G) = (2 × 10)/(6 × 5) = 2/3. Finally, the graph in Fig. 5c is fully connected and therefore has density ρ(G) = 1. Fig. 5 Graph density

The density of social graphs is typically low, meaning that on average every individual is connected to a small number of other individuals. This can be observed on any online social network site, e.g., Facebook or Twitter, where most users are connected to tens or hundreds of other users out of about a billion total users. This feature can be explained by the existence of limits in cognitive activity preventing a person to manage more than a given number of stable relationships (Goncalves et al. 2011) and is fundamental for the design of efficient algorithms. For our graph in Fig. 2, we have n = 9 nodes and m = 9 edges, resulting in a density of ρ(G) = (2 × 9)/(9 × 8) = 1/4.

### Summary

We summarize the notation used in this essay in Table 4. Table 5 lists all single-node and whole-network measures introduced so far.
Table 4

Notation summary

 V Set of graph nodes E Set of graph edges, E ⊆ V × V n Number of nodes m Number of edges d(u, v) Geodesic distance from node u to node v σvw Number of shortest paths from v to w σvw (u) Number of shortest paths from v to w passing through u N (u) Neighbors of node u n(u) Number of neighbors of node u, n(u) = |N (u)| EN (u) Set of edges in the neighborhood of u
Table 5

Single-node and whole-graph metrics

 Eccentricity $$E(u)=\underset{v\in V}{ \max } d\left( u, v\right)$$ Diameter $$D(G)=\underset{u\in V}{\mathit{\max}} e(u)$$ Closeness $$C(u)=\frac{n-1}{\sum_{v\in V, u\ne u} d\left( u, v\right)}$$ Closeness $$C(G)=\frac{\sum_{v\in V}\left({C}_{\max }- C(v)\right)}{\left({n}^2-3 n+2\right)/\left(2 n-3\right)}$$ Betweenness $$B(u)=\sum_{\begin{array}{c} v, w\in V\\ {} v\ne w\ne u\end{array}}\frac{\sigma_{vw}(u)}{\sigma_{vw}}$$ Betweenness $$B(G)=\frac{\sum_{v\in V}\left({B}_{\max }- B(v)\right)}{\left( n-1\right){B}_{\max }}$$ Clustering Coefficient $$CC(u)\frac{2\times \left| EN(u)\right|}{n(n)\times \left( n(n)-1\right)}$$ Clustering $$CC(G)=\frac{\sum_{u\in V} CC(u)}{n}$$ Density $$\rho (G)=\frac{2 m}{n\left( n-1\right)}$$

## Computational Aspects

In this section we describe the algorithms that can be used to compute some of the metrics described in the previous sections. We assume that the reader is familiar with the basic concepts of algorithm design and analysis, as can be found in introductory textbooks such as Cormen et al. (2009). However, in order to make this essay self-contained, we summarize the main points below.

An algorithm is a finite sequence of steps describing an effective method for calculating a function. A fundamental attribute of an algorithm is its efficiency, representing the amount of resources (e.g., CPU time or storage space) it needs to compute the result. The amount of resources depends on the input size: it is reasonable to expect that for larger inputs, the algorithm will require more CPU time and/or storage space to compute the result. The input size of most graph algorithms is the size of the input graph, which in turn is proportional to the number of nodes n and edges m.

The goal of algorithm analysis is to define a function mapping the input size to the number of elementary steps (time complexity ) or storage locations (space complex-ity ) required by the algorithm to compute the result. Since it is in general not possible to give a precise definition of the complexity function, it is common to estimate its asymptotic behavior, that is, the growth rate of the complexity function for arbitrarily large inputs.

The big O notation is used to concisely express the asymptotic growth rate of the cost function. Assume that an algorithm requires T (n) steps (storage locations) to compute the results for an input of size n. We say that the asymptotic cost of the algorithm is O (f (n)) if there exist constants c > 0 and n 0 > 0 such that, for all n ≥ n 0, T (n) cf. (n). For example, we say that an algorithm requires time O(n 2) to denote that its cost function grows not faster than the square of the input size, for sufficiently large inputs.

### Graph Representation

We first introduce the data structures which can be used to represent graphs. A simple way to represent a (directed or undirected) unweighted graph G = (V, E) with n nodes is using an n × n adjacency matrix M ij, where M ij = 1 if and only if (i, j)∈ E.

If G is undirected, as in the case of Fig. 2, the adjacency matrix is symmetric since edges (u, v) and (v, u) are the same. Adjacency matrices support some graph operations efficiently: for example, it is possible to test for the existence of an edge (u, v), add a new edge, or delete an edge in constant time by accessing the appropriate element of the matrix. Unfortunately, the storage space required to encode a graph with n nodes is O(n 2), since the matrix has n 2 elements. Therefore, the adjacency matrix is mostly used with small graphs or with dense graphs where ρ(G) 1.

A more space-efficient representation is based on adjacency lists. Here, the graph is represented as an array of n lists, where list u contains the neighbors of node u.

Figure 6 shows the adjacency list representation of the graph in Fig. 2. Since the graph has n = 9 nodes, we have nine lists, each one associated to a single node. The list for node n 0 contains four elements {n 1, n 3, n 4, n 5}, which are precisely the four neighbors of n 0. Note that, for undirected graphs, each edge appears twice in the lists. For example, edge (n 0, n 5) appears as element n 5 in the list for n 0 and as element n 0 in the list for node n 5. Therefore, the total space requirement of an adjacency list is n + 2 m = O(n + m). While the space requirement is lower than the adjacency matrix, some operations are less efficient on adjacency lists. For example, to test for the existence of an edge (u, v), it is necessary to scan the adjacency list for u (or v) until either the other node is found or the end of the list is reached. The cost in this case is O (max v n(v)), that is, in the worst case proportional to the maximum number of neighbors a node can have. Fig. 6 Adjacency list representation of the graph on Fig. 2
Table 6 shows the space complexity and time complexity of operations on adjacency matrix and adjacency list graph representations.
Table 6

Space

O(n 2)

O(n + m)

O(1)

O (maxv n(v))

O(n)

O (maxv n(v))

O(1)

O(1)

Deleting an edge

O(1)

O (maxv n(v))

### Graph Algorithms

We now give an overview of some classic algorithms used to compute the measures considered in this essay. The algorithms considered here are not necessarily the most efficient ones available, but nevertheless are those which are most frequently implemented in graph analysis packages due to their simplicity.

Eccentricity, Closeness, and Diameter. To compute the eccentricity and closeness of a node u, we need to compute the geodesic distances from u to all other nodes. This is the well-known single source shortest path (SSSP) problem on graphs (Festa and MGC 2006). For unweighted graphs, this problem reduces to performing a breadth first visit of the graph starting from u that requires time O(n + m) using adjacency lists. Dijkstra’s algorithm can solve the SSSP problem for directed graphs with nonnegative edge weights in time O ((n + m) log n) using a priority queue implemented with a binary heap (Cormen et al. 2009); using more efficient priority queue implementations, the running time can be reduced to O(m + n log n). For directed graphs with arbitrary edge weights, Bellman-Ford algorithm can be used to compute all shortest paths from a single source in time O(nm) (Cormen et al. 2009). Computing the graph diameter requires solving the all pairs of shortest path problem, that is, computing the geodesic distances between all pairs of nodes. For unweighted graphs this can be achieved in time O(n 2 + nm) by simply executing n Breadth First visits, starting from each node. Floyd-Warshall algorithm (Floyd 1962) can compute all geodesic distances for directed graphs with arbitrary edge weights in time O(n 3). Johnson’s algorithm (Johnson 1977) achieves a running time of O(nm + n 2 log n), which is more efficient on graphs with low density.

Betweenness. The most efficient sequential algorithm for computing the betweenness centrality of graphs is due to Brandes (2001). Brandes’ algorithm requires O(n + m) space and runs in time O(nm) on unweighted graphs and time O(nm + n 2 log n) on weighted graphs.

Clustering Coefficient. Computation of the node clustering coefficient C(u) can be done in time O(n 2) in the worst case, by counting all edges incident to the neighbors of u. The network clustering coefficient C(G) from Eq. (11) can be computed by counting all (closed) triplets. A brute-force approach is to examine each combination of nodes, which requires time O(n 3). A better algorithm has been proposed by Latapy (2008), who demonstrated that it is possible to solve triplet finding, counting and node counting in O(n 2.376) time and O(n 2) space using fast matrix multiplication on the adjacency matrix representation of G.

### Software

We conclude this part by mentioning some existing software packages which can be used to compute the measures described above and many others.

Gephi (Bastian et al. 2009) and NodeXL (2012) are interactive network visualization tools supporting visual network analysis. Gephi is a cross platform and extensible environment, with a plugin mechanism to implement additional algorithms, while NodeXL is an extension of MS Excel, working only on Windows systems, more focused on easiness of use by people without strong computer skills. Both tools provide algorithms to compute common SNA measures, including those addressed in this essay (NodeXL uses SNAP (Leskoveč and Sosiˇč 2016) as the underlying computation library). Igraph (Csardi and Nepusz 2006) is a software package written in C for creating and manipulating large graphs and can be used for statistical SNA, thanks to its version for the R statistical environment (R Core Team 2012).

## Key Applications

The applications of social network analysis span the most diverse topics. Newman (2010) cites several examples, including the estimation of the number of people the average person knows (McCormick et al. 2010), the study of the collaboration pattern of scientists (Newman 2001) and movie actors (Watts and Strogatz 1998), the analysis of dating patterns among high school students (Bearman et al. 2004), and the analysis of networks of terrorists (Latora and Marchiori 2004). We briefly discuss some of these results.

A scientific collaboration network is an undirected graph G = (V, E) where nodes represent scientists, and there exists an edge (u, v) if and only if u and v wrote a paper together. Newman (2001) observed that the collaboration networks of several disciplines exhibit a small-world structure: two scientists picked at random are likely to be connected by a short path in G. Specifically, the average degree of separation in the analyzed dataset is about six, meaning that any scientist can be reached from any other scientist in the collaboration graph by following a path of average length of 6.

The small-world property of collaboration networks has been part of the folklore for a long time before being formally observed. Mathematicians defined the Erdős number as a tribute to Paul Erdős (1913–1996), probably the most prolific mathematician of all times. The Erdős number E(v) is the distance between Erdős and researcher v in the collaboration graph. Therefore, those who have written a paper with Erdős have E(v) = 1, their coauthors (that are not coauthors of Erdős) have E(v) = 2, and so on. Most mathematicians have finite Erdős number: an estimate of the median Erdős number among mathematicians is 5, the mean is 4.65, and the standard deviation is 1.21 (Erdős Number Project 2006). Many nonmathematicians have finite Erdős number as well, due to interdisciplinary research activities that resulted in joint publications at some point in time.

The idea of Erdős number has been ported to the movie industry: the Bacon number is defined as the distance to Kevin Bacon in the graph G whose nodes are actors, and an edge (u, v) represents the fact that actors u and w appeared in the same movie. The average Bacon number for a randomly chosen actor with finite distance to Kevin Bacon is 3.02 (Bacon Oracle 2016), revealing that the movie industry is a small-world network. There are notable scientists that have both finite Erdős number and finite Bacon number: for example, physicist and Nobel laureate Richard Feynman has Erdős number 3 and Bacon number 3, the latter due to his appearance in the film Anti-Clock.

Humans are not the only species whose collaboration patterns take the form of a small-world network. Lusseau (2003) analyzed the social interaction graph within a community of 64 bottlenose dolphins (Tursiops truncatus). Nodes of the interaction graph correspond to individual dolphins, and an undirected edge (u, v) denotes that u and v were observed together more often than expected by random encounters alone. The resulting network is scale-free and highly clustered.

We conclude by describing a study where graph metrics have been used to probe the structure of a terrorist organization. Latora and Marchiori (2004) consider the connections among the hijackers involved in the September 2001 attacks, with the goal of identifying the terrorists to target if one wants to disrupt the organization. As a quantitative measure of such disruption, the authors use the global efficiency of the terrorist graph defined in Eq. (8); the most important individual in the criminal organization is the one whose removal produces the largest decrease in efficiency.

## Future Directions

The success of online social network sites as a mass phenomenon has dramatically increased the size of these networks. Analyzing them is computationally challenging, due to the prohibitive time and storage space requirements of traditional, sequential algorithms. In response to these challenges, new algorithms have been proposed, based on emerging computer architectures and different computing paradigms like approximate algorithms (Brandes and Pich 2007) and streaming computation (Becchetti et al. 2008; Guha and McGregor 2012). On-chip parallelism, such as that provided by modern multicore processors or by general purpose graphics processing units (GPUs), is fostering the interest on the development of efficient parallel graph algorithms (Lambertini et al. 2014; Lumsdaine et al. 2007; Wang et al. 2016), where multiple independent execution units concurrently and cooperatively build the solution to specific graph problems.

However, it should be observed that more powerful algorithms and computing infrastructures are not always the correct solution to address the increase in network sizes. For example, while it may be interesting to compute centrality measures on the whole Facebook network, practical SNA tasks would often focus on a specific community, e.g., people subscribed to a product or company page or fans of a public figure. In this context community detection methods may become particularly relevant to filter portions of the network to analyze (Fortunato 2010).

Another aspect that has not been taken into full consideration yet is the understanding of the semantics of edges. Some decades ago the reconstruction of the social graph was a difficult task, usually to be performed by asking people to list their friends or observing specific environments, e.g., a workplace. The advent of online social networks has made it much easier to construct social network graphs. However, the graph edges that can be easily inferred from online data may not be the best way to identify the social relations of interest, and this could lead to the construction of a wrong social graph. For example, many social network sites provide information about user contacts through their application programming interfaces (API), e.g., we can easily retrieve all Twitter followers of a specific user through the Twitter API. However, all the path-based measures presented in this essay are based on the assumption that the network represents communication paths, while a large percentage of messages is in fact exchanged between users that are not directly connected, and therefore they are exchanged on a different network that we may call communication network (Rossi and Magnani 2012).

Things may get even more complicated when we consider multiple social networks. Individuals often use different services to communicate with different audiences, e.g., Facebook, Twitter, or LinkedIn. Of course, the social graphs built using Facebook, Twitter, or LinkedIn contacts may differ strongly, even if these graphs are built considering the same set of individuals. The computation of centrality measures on each of these graphs in isolation may provide insights on the usage of that social network by a specific user, but may fail to draw a realistic picture of the information flow and of the role each user plays in cross disseminating information over all networks.

## References

1. Alexanderson GL (2006) About the cover: Euler and Königsberg’s bridges: a historical view. Bull Am Math Soc 43:567–573. doi:10.1090/S0273-0979-06-01130-X
2. Anthonisse JM (1971) The rush in a directed graph. Technical report BN 9/71, Stichting Mathematisch Centrum, AmsterdamGoogle Scholar
3. Bacon Oracle (2016) The Oracle of Bacon. https://oracleofbacon.org/. Accessed 11 Nov 2016
4. Bastian M, Heymann S, Jacomy M (2009) Gephi: an open source software for exploring and manipulating networks. http://www.aaai.org/ocs/index.php/ICWSM/09/paper/view/154
5. Bearman PS, Moody J, Stovel K (2004) Chains of affection: the structure of adolescent romantic and sexual networks. Am J Sociol 110(1):44–91. doi:10.1086/386272
6. Becchetti L, Boldi P, Castillo C, Gionis A (2008) Efficient semi-streaming algorithms for local triangle counting in massive graphs. In: Proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ‘08. ACM, New York, pp 16–24. doi:10.1145/1401890.1401898Google Scholar
7. Brandes U (2001) A faster algorithm for betweenness centrality. J Math Sociol 25(2):163–177. doi:10.1080/0022250X.2001.9990249
8. Brandes U, Pich C (2007) Centrality estimation in large networks. Int J Bifurcation Chaos 17(07):2303–2318. doi:10.1142/S0218127407018403
9. Cormen TH, Leiserson CE, Rivest RL, Stein C (2009) Introduction to algorithms, 3rd edn. MIT Press, Cambridge, MA
10. Costa LF, Rodrigues FA, Travieso G, Villas Boas PR (2007) Characterization of complex networks: a survey of measurements. Adv Phys 56(1):167–242. doi:10.1080/00018730601170527
11. Csardi G, Nepusz T (2006) The igraph software package for complex network research. Inter J Complex Syst 1695. http://igraph.org/
12. Erdős Number Project (2006) The Erdős number project at Oakland University. https://oakland.edu/enp/. Accessed 26 Nov 2016
13. Festa P (2006) Shortest path algorithms. In: Resende MGC, Pardalos PM (eds) Handbook of optimization in telecommunications. Springer, New York, pp 185–210. doi:10.1007/978-0-387-30165-5_8
14. Floyd RW (1962) Algorithm 97: shortest path. Commun ACM 5(6):345. doi:10.1145/367766.368168
15. Fortunato S (2010) Community detection in graphs. Phys Rep 486(3–5):75–174. doi:10.1016/j.physrep.2009.11.002
16. Freeman LC (1977) A set of measures of centrality based on betweenness. Sociometry 40(1):35–41. doi:10.2307/3033543
17. Freeman LC (1978) Centrality in social networks conceptual clarification. Soc Networks 1(3):215–239. doi:10.1016/0378-8733(78)90021-7
18. Goncalves B, Perra N, Vespignani A (2011) Modeling users’ activity on twitter networks: validation of Dunbar’s number. PLoS ONE 6(8):e22656. doi:10.1371/journal.pone.0022656
19. Guha S, McGregor A (2012) Graph synopses, sketches, and streams: a survey. Proc VLDB Endow 5(12):2030–2031. doi:10.14778/2367502.2367570
21. Harary F, Norman RZ (1953) Graph theory as a mathematical model in the social sciences. Institute for Social Research, University of Michigan, Ann ArborGoogle Scholar
22. Johnson DB (1977) Efficient algorithms for shortest paths in sparse networks. J ACM 24(1):1–13. doi:10.1145/321992.321993
23. Lambertini M, Magnani M, Marzolla M, Montesi D, Paolino C (2014) Large-scale social network analysis. In: Gkoulalas-Divanis A, Labbi A (eds) Large-scale data analytics. Springer, New York, pp 155–187. doi:10.1007/978-1-4614-9242-9 6
24. Latapy M (2008) Main-memory triangle computations for very large (sparse (power-law)) graphs. Theor Comput Sci 407(1):458–473. doi:10.1016/j.tcs.2008.07.017
25. Latora V, Marchiori M (2001) Efficient behavior of small-world networks. Phys Rev Lett 87:198,701. doi:10.1103/PhysRevLett.87.198701
26. Latora V, Marchiori M (2004) How the science of complex networks can help developing strategies against terrorism. Chaos, Solitons Fractals 20(1):69–75. doi:10.1016/S0960-0779(03) 00429-6
27. Leskoveč J, Sosiˇč R (2016) Snap: a general-purpose network analysis and graph-mining library. ACM Trans Intell Syst Technol 8(1):20. doi:10.1145/2898361Google Scholar
28. Luce R, Perry A (1949) A method of matrix analysis of group structure. Psychometrika 14:95–116. doi:10.1007/BF02289146
29. Lumsdaine A, Gregor D, Hendrickson B, Berry JW (2007) Challenges in parallel graph processing. Parallel Process Lett 17(1):5–20. doi:10.1142/S0129626407002843
30. Lusseau D (2003) The emergent properties of a dolphin social network. Proc R Soc Lond B Biol Sci 270(Suppl 2):S186–S188. doi:10.1098/rsbl.2003.0057
31. McCormick TH, Salganik MJ, Zheng T (2010) How many people do you know?: efficiently estimating personal network size. J Am Stat Assoc 105(489):59–70. doi:10.1198/jasa.2009.ap08518
32. Moreno JL (1934) Who shall survive? A new approach to the problem of human Interrelations. Nervous and Mental Disease Publishing Co., Washington, DC
33. Newman MEJ (2001) The structure of scientific collaboration networks. Proc Natl Acad Sci U S A 98(2):404–409. doi:10.1073/pnas.98.2.404
34. Newman MEJ (2005) A measure of betweenness centrality based on random walks. Soc Networks 27(1):39–54. doi:10.1016/j.socnet.2004.11.009
35. Newman MEJ (2010) Networks: an introduction. Oxford University Press, Oxford
36. NodeXL (2012) Nodexl, a graph visualization and manipulation software. http://nodexl.codeplex.com. Accessed 6 Dec 2016
37. Opsahl T, Panzarasa P (2009) Clustering in weighted networks. Soc Networks 31(2):155–163. doi:10.1016/j.socnet.2009.02.002
38. Opsahl T, Agneessens F, Skvoretz J (2010) Node centrality in weighted networks: generalizing degree and shortest paths. Soc Networks 32(3):245–251. doi:10.1016/j.socnet.2010.03.006
39. Peay ER (1980) Connectedness in a general model for valued networks. Soc Networks 2(4):385–410. doi:10.1016/0378-8733(80)90005-2
40. R Core Team (2012) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna. http://www.R-project.org. ISBN:3-900051-07-0
41. Rossi L, Magnani M (2012) Conversation practices and network structure in twitter. https://www.aaai.org/ocs/index.php/ICWSM/ICWSM12/paper/view/4634
42. Sabidussi G (1966) The centrality index of a graph. Psychometrika 31(4):581–603. doi:10.1007/ BF02289527
43. Wang Y, Davidson A, Pan Y, Wu Y, Riffel A, Owens JD (2016) Gunrock: a high-performance graph processing library on the GPU. In: Proceedings of 21st ACM SIGPLAN symposium on principles and practice of parallel programming, PPoPP ‘16. ACM, New York, pp 11:1–11:12. doi:10.1145/2851141.2851145Google Scholar
44. Wasserman S, Faust K (1994) Social network analysis. Cambridge University Press, New York
45. Watts DJ, Strogatz SH (1998) Collective dynamics of “small-world” networks. Nature 393:440–442. doi:10.1038/30918
46. White DR, Borgatti SP (1994) Betweenness centrality measures for directed graphs. Soc Networks 16(4):335–346. doi:10.1016/0378-8733(94)90015-9

## Authors and Affiliations

1. 1.Department of Information TechnologyUppsala UniversityUppsalaSweden
2. 2.Department of Computer Science and EngineeringUniversity of BolognaBolognaItaly

## Section editors and affiliations

• Przemysław Kazienko
• 1
• Jaroslaw Jankowski
• 2
1. 1.Department of Computer Science and Management, Institute of InformaticsWrocław University of TechnologyWrocławPoland
2. 2.Faculty of Computer Science and Information TechnologyWest Pomeranian University of TechnologySzczecinPoland