A generalised significance test for individual communities in networks

Kojaku, Sadamori; Masuda, Naoki

doi:10.1038/s41598-018-25560-z

A generalised significance test for individual communities in networks

Article
Open access
Published: 09 May 2018

Volume 8, article number 7351, (2018)
Cite this article

Download PDF

You have full access to this open access article

Scientific Reports

A generalised significance test for individual communities in networks

Download PDF

2873 Accesses
18 Citations
8 Altmetric
Explore all metrics

Abstract

Many empirical networks have community structure, in which nodes are densely interconnected within each community (i.e., a group of nodes) and sparsely across different communities. Like other local and meso-scale structure of networks, communities are generally heterogeneous in various aspects such as the size, density of edges, connectivity to other communities and significance. In the present study, we propose a method to statistically test the significance of individual communities in a given network. Compared to the previous methods, the present algorithm is unique in that it accepts different community-detection algorithms and the corresponding quality function for single communities. The present method requires that a quality of each community can be quantified and that community detection is performed as optimisation of such a quality function summed over the communities. Various community detection algorithms including modularity maximisation and graph partitioning meet this criterion. Our method estimates a distribution of the quality function for randomised networks to calculate a likelihood of each community in the given network. We illustrate our algorithm by synthetic and empirical networks.

On community structure validation in real networks

Article Open access 04 October 2021

Constructing null networks for community detection in complex networks

Article 04 July 2018

Perspective on Measurement Metrics for Community Detection Algorithms

Introduction

Many biological, physical and social systems can be expressed as networks, with nodes representing individual entities within the network and edges representing pairwise relationships between nodes^1,2. Among various structural properties of networks, many empirical networks have community structure such that a network is composed of communities, which are groups of nodes that are densely interconnected with each other while sparsely interconnected with those in other groups^3,4. A community may correspond to the role of nodes. For example, communities may correspond to functional modules of proteins⁵, groups of airports serving the same geographical region⁶ and herds of people sharing an interest⁷.

Many algorithms have been proposed for finding communities in networks^3,4. These algorithms are often equipped with a quality function with which to judge whether or not the detected community structure is significant overall. A much less asked fundamental question is the significance of individual communities. In fact, a network may be composed of a part where community structure is pronounced and another part where community structure is vague or absent. To discuss community structure in such a “chimera” network, one needs methods to assess statistical significance of single communities.

In the present study, we consider the significance of single communities that have been detected by a non-overlapping community-detection algorithm. An algorithm for testing significance of individual communities was previously proposed⁸. In that algorithm, one uses a quality function for individual communities to compare the quality of a community in question, detected in the given network, and that detected in randomised networks. The distribution of the quality function in randomised networks is analytically known. The authors then used the same significant test in OSLOM, which is an algorithm for finding various types of communities⁹. However, OSLOM does not optimise the same quality function as that used in the aforementioned statistical test or its aggregate over the different communities. The same discrepancy exists in a different significance test for single communities¹⁰. In an extreme case, let us suppose one detects communities by optimising a quality function that is very different from the quality function used in the statistical test. Then, the detected communities may have small values of the quality function used in the statistical test and will be judged to be insignificant. However, in terms of the quality function used in the community detection, the detected communities may be sufficiently strong.

This pitfall may be overcome if one uses the same quality function for the community detection and the statistical test. There exist such significance tests for individual communities^11,12. However, these significance tests^11,12 do not consider the possible dependence of the quality function value on the size of community^10,13,14. This practice is problematic for the following reason. Suppose that two communities in the given network have different sizes and bear the same value of the quality function. Then, the significance level (i.e., p-value) in these statistical tests is the same for the two communities. In general, however, the quality function value may be positively correlated with the community size, which is in fact often the case (Methods section). In this case, it is easier for the larger community to attain the observed quality function value than for the smaller community under the null model. Then, the smaller community should be judged to be more significant than the larger community if they yield the same quality function value. An aforementioned statistical test does consider the dependence of the quality function value on the community size¹⁰. However, that method does not use a common quality function between community detection and statistical testing, as discussed already.

Based on these considerations, it will be useful to develop methods to test the significance of individual communities that (i) use a quality function that is consistent with the one used in community detection, and (ii) take into account the dependence of the quality function value on the community size. We will develop a new statistical test for individual communities that meets these criteria. An additional feature of our method is that it allows for general quality functions. Python code for the present significance test is available at https://github.com/skojaku/qstest/.

Methods

Correlation between quality and community size

We consider unweighted networks composed of N nodes. Denote their N × N adjacency matrix by A = (A_ij), where A_ij = 1 if nodes i and j are adjacent and A_ij = 0 otherwise. We assume that the network is undirected (i.e., A_ij = A_ji for all i ≠ j) and does not contain self-loops (i.e., A_ii = 0). Let M be the number of edges in the network. We denote by ${d}_{i}\equiv {\sum }_{j=1}^{N}\,{A}_{ij}$ the degree of node i.

One may regard a community as significant if its quality value is significantly larger than that expected for randomised networks. This intuitive approach has a problem. To see this, let us consider a benchmark network generated by the Lancichinetti-Fortunato-Radicchi (LFR) model¹⁵ (Fig. 1(a)). The network has N = 10³ nodes and consists of C non-overlapping communities. Each node i belongs to one of the C = 31 communities. To generate the network, we set the average node’s degree to 10, the maximum node’s degree to 100, the range of the number of nodes in a community c (denoted by n_c) to [10,100] and the power-law exponent for the distributions of d_i and n_c to 2. Let us consider a quality function ${q}_{c}^{{\rm{mod}}}$ given by^13,14

$${q}_{c}^{{\rm{mod}}}\equiv \frac{1}{2M}\sum _{\begin{array}{c}1\le i,j\le N\\ i,j\in \,{\rm{community}}\,c\end{array}}({A}_{ij}-\frac{{d}_{i}{d}_{j}}{2M}).$$

(1)

Note that the modularity is the sum of ${q}_{c}^{{\rm{mod}}}$ over the communities⁷. We find a strong positive correlation between ${q}_{c}^{{\rm{mod}}}$ and n_c (circles in Fig. 1(b)). This is also true for communities in randomised networks that are generated by the configuration model, i.e., random networks that preserve the expected degree of each node (crosses in Fig. 1(b)). Crucially, large communities detected in the randomised networks have larger ${q}_{c}^{{\rm{mod}}}$ values than small communities in the original network do. Therefore, we can not judge the significance of communities solely by the value of ${q}_{c}^{{\rm{mod}}}$. The results are qualitatively the same for other quality functions for individual communities introduced in the following section (Fig. 1(c,d and e)).

Our statistical test

On the basis of the observations made in the previous section, we construct a statistical test for individual communities as follows. Note that we do not specify the quality function q_c, which may be ${q}_{c}^{{\rm{mod}}}$ or a different one. Moreover, we do not specify how one measures the size s_c of community c. We refer to the present statistical test based on a quality function q and community size s as the (q, s)–test.

Suppose that we have a community c with quality q_c and size s_c. We judge community c to be significant if its q_c value is larger than those for communities of the same size s_c detected in randomised networks. We compute $P(\tilde{q}\ge {q}_{c}|{s}_{c})$, which is the probability that a community of size s_c detected in randomised networks generated by the configuration model has a quality value $\tilde{q}$ larger than q_c. We numerically estimate $P(\tilde{q}\ge {q}_{c}|{s}_{c})$ as follows. First, we generate 500 randomised networks using the configuration model. Then, we detect communities in each randomised network by the algorithm that has been used to detect communities in the original network. Let $\overline{C}$ be the sum of the number of communities detected in the 500 randomised networks. For each community $\overline{c}$ $(1\le \overline{c}\le \overline{C})$ in the randomised networks, we compute the quality ${\tilde{q}}_{\overline{c}}$ and size ${\tilde{s}}_{\overline{c}}$. Then, we compute the average values, i.e., $\langle \tilde{q}\rangle \equiv {\sum }_{\overline{c}=1}^{\overline{C}}\,{\tilde{q}}_{\overline{c}}/\overline{C}$ and $\langle \tilde{s}\rangle \equiv {\sum }_{\overline{c}=1}^{\overline{C}}\,{\tilde{s}}_{\overline{c}}/\overline{C}$, and the unbiased estimation of the standard deviation, i.e., ${\sigma }_{\tilde{q}}\equiv \sqrt{{\sum }_{\overline{c}=1}^{\overline{C}}{({\tilde{q}}_{\overline{c}}-\langle \tilde{q}\rangle )}^{2}/(\overline{C}-\mathrm{1)}}$ and ${\sigma }_{\tilde{s}}\equiv \sqrt{{\sum }_{\overline{c}=1}^{\overline{C}}{({\tilde{s}}_{\overline{c}}-\langle \tilde{s}\rangle )}^{2}/(\overline{C}-\mathrm{1)}}$. We estimate the joint probability distribution $P(\tilde{q},\tilde{s})$ using the kernel density estimator¹⁶ as follows:

$$P(\mathop{q}\limits^{ \sim },\mathop{s}\limits^{ \sim })=\sum _{\overline{c}=1}^{\overline{C}}\,f(\frac{\mathop{q}\limits^{ \sim }-{\mathop{q}\limits^{ \sim }}_{\overline{c}}}{h{\sigma }_{\mathop{q}\limits^{ \sim }}},\frac{\mathop{s}\limits^{ \sim }-{\mathop{s}\limits^{ \sim }}_{\overline{c}}}{h{\sigma }_{\mathop{s}\limits^{ \sim }}})/\overline{C},$$

(2)

where h is the width of the kernel. The function f (·, ·) is the bivariate Gaussian kernel (i.e., bivariate standard normal distribution) given by

$$f({x}_{1},{x}_{2})\equiv \frac{1}{2\pi \sqrt{1-{\gamma }^{2}}}\exp (\,-\,\frac{{x}_{1}^{2}-2\gamma {x}_{1}{x}_{2}+{x}_{2}^{2}}{2(1-{\gamma }^{2})}),$$

(3)

where

$$\gamma \equiv \frac{\sum _{\bar{c}=1}^{\bar{C}}({\tilde{q}}_{\bar{c}}-\langle \tilde{q}\rangle )({\tilde{s}}_{\bar{c}}-\langle \tilde{s}\rangle )}{\sqrt{\sum _{\bar{c}=1}^{\bar{C}}{({\tilde{q}}_{\bar{c}}-\langle \tilde{q}\rangle )}^{2}}\sqrt{\sum _{\bar{c}=1}^{\bar{C}}{({\tilde{s}}_{\bar{c}}-\langle \tilde{s}\rangle )}^{2}}},$$

(4)

is the Pearson correlation coefficient between ${\{{\tilde{q}}_{\overline{c}}\}}_{\overline{c}=1}^{\overline{C}}$ and ${\{{\tilde{s}}_{\overline{c}}\}}_{\overline{c}=1}^{\overline{C}}$. The probability distribution estimated by the Gaussian kernels is close to any form of the true probability distribution as the number of samples increases¹⁷. Although there are also non-Gaussian kernels that share this property¹⁷, we used the Gaussian kernels, which is a state-of-the-art method. The width h is a free parameter that affects the speed of the convergence to the true probability distribution. Optimising the value of h requires assumptions for the true probability distributions and intensive computations^18,19. Therefore, we set $h={\overline{C}}^{(-\mathrm{1/6)}}$ according to Scott’s rule-of-thumb²⁰, which often provides a reasonable estimate in practice^18,19,20.

The conditional probability, $P(\tilde{q} > {q}_{c}|{s}_{c})$, is given by

$$P(\mathop{q}\limits^{ \sim }\ge {q}_{c}|{s}_{c})=\frac{{\int }_{{q}_{c}}^{{\rm{\infty }}}P(\mathop{q}\limits^{ \sim },{s}_{c}){\rm{d}}\mathop{q}\limits^{ \sim }}{{\int }_{-{\rm{\infty }}}^{{\rm{\infty }}}P(\mathop{q}\limits^{ \sim },{s}_{c}){\rm{d}}\mathop{q}\limits^{ \sim }}=\frac{\sum _{\overline{c}=1}^{\overline{C}}{\int }_{{q}_{c}}^{{\rm{\infty }}}\,f(\frac{\mathop{q}\limits^{ \sim }-{\mathop{q}\limits^{ \sim }}_{\overline{c}}}{{\sigma }_{\mathop{q}\limits^{ \sim }}h},\frac{{s}_{c}-{\mathop{s}\limits^{ \sim }}_{\overline{c}}}{{\sigma }_{\mathop{s}\limits^{ \sim }}h}){\rm{d}}\mathop{q}\limits^{ \sim }}{\sum _{\overline{c}=1}^{\overline{C}}{\int }_{-{\rm{\infty }}}^{{\rm{\infty }}}\,f(\frac{\mathop{q}\limits^{ \sim }-{\mathop{q}\limits^{ \sim }}_{\overline{c}}}{{\sigma }_{\mathop{q}\limits^{ \sim }}h},\frac{{s}_{c}-{\mathop{s}\limits^{ \sim }}_{\overline{c}}}{{\sigma }_{\mathop{s}\limits^{ \sim }}h}){\rm{d}}\mathop{q}\limits^{ \sim }}.$$

(5)

The integration of f (x₁, x₂) over x₁ yields

$${\int }_{y}^{{\rm{\infty }}}f({x}_{1},{x}_{2})\,{\rm{d}}{x}_{1}=\frac{1}{\sqrt{2\pi }}\exp (\,-\,\frac{{x}_{2}^{2}}{2})[1-{\rm{\Phi }}(\frac{y-\gamma {x}_{2}}{\sqrt{1-{\gamma }^{2}}})],$$

(6)

where Φ (·) is the cumulative distribution function of the standard normal distribution. By substituting Eq. (6) into Eq. (5), we have

$$P(\mathop{q}\limits^{ \sim }\ge {q}_{c}|{s}_{c})=1-\frac{\sum _{\bar{c}=1}^{\overline{C}}\exp [\,-\,{(\frac{{s}_{c}-{\mathop{s}\limits^{ \sim }}_{\overline{c}}}{\sqrt{2}h{\sigma }_{\mathop{s}\limits^{ \sim }}})}^{2}]\,{\rm{\Phi }}\,(\frac{1}{\sqrt{1-{\gamma }^{2}}}(\frac{{q}_{c}-{\mathop{q}\limits^{ \sim }}_{\overline{c}}}{h{\sigma }_{\mathop{q}\limits^{ \sim }}}-\gamma \frac{{s}_{c}-{\mathop{s}\limits^{ \sim }}_{\overline{c}}}{h{\sigma }_{\mathop{s}\limits^{ \sim }}}))}{\sum _{\overline{c}=1}^{\overline{C}}\exp [\,-\,{(\frac{{s}_{c}-{\mathop{s}\limits^{ \sim }}_{\overline{c}}}{\sqrt{2}h{\sigma }_{\mathop{s}\limits^{ \sim }}})}^{2}]}.$$

(7)

Finally, we regard community c as significant if $P(\tilde{q}\ge {q}_{c}|{s}_{c})\le \alpha $, where α ∈ [0, 1] is the significance level. The conditional probability $P(\tilde{q}\ge {q}_{c}|{s}_{c})$ obeys a uniform probability distribution over [0, 1] for a community detected in a randomised network (see Supplementary Information 1). One can estimate more accurate p-values (i.e. $P(\tilde{q}\ge {q}_{c}|{s}_{c})$) using a larger number of randomised networks, which, however, requires an additional computational time. We opt to use 500 randomised networks to obtain sufficiently accurate p-values in a reasonable time. In fact, the p-value does not change much if one increases the number of randomised networks beyond 500 or if one uses networks with different numbers of nodes and communities (Supplementary Information 2).

As the number of communities, C, increases, some insignificant communities would be significant owing to the multiple comparison problem. To avoid this, we use the Šidák correction²¹, i.e., α = 1 − (1 − α′)^1/C, where α′ ∈ [0, 1] is the targeted significance level. We set α′ = 0.05.

Time complexity

The time complexity of the proposed statistical test is evaluated as follows. Generating one randomised network from the configuration model consumes ${\mathscr{O}}(N+M)$ time using an efficient algorithm²², which is implemented in some network analysis software^23,24. For each generated randomised network, we detect communities. Any community-detection algorithm qualified for the present statistical test computes the quality and size of the individual communities and maximises the quality function for the entire network. We use the quality and size of the optimised communities in the statistical test. We carry out these procedures for each of the R randomised networks, consuming ${\mathscr{O}}((N+M+Z)R)$ time in total, where Z is the time complexity of the community-detection algorithm. We compute the p-value for each of the C communities in the original network using Eq. (7) with RC^conf samples on average, ${\{{\tilde{q}}_{c}\}}_{c=1}^{R{C}^{{\rm{conf}}}}$ and ${\{{\tilde{s}}_{c}\}}_{c=1}^{R{C}^{{\rm{conf}}}}$, where C^conf is the average number of communities detected in a randomised network. This incurs a time complexity of ${\mathscr{O}}(C\times R{C}^{{\rm{conf}}})$. In total, the proposed statistical test requires ${\mathscr{O}}((N+M+Z+C{C}^{{\rm{conf}}})R)$ time.

The time complexity can be mitigated using parallel computing. In other words, one runs multiple threads, each of which generates independent samples of $({\tilde{q}}_{c},{\tilde{s}}_{c})$. Once the sampling is completed in all the threads, one computes the p-value using Eq. (7). We used 16 threads on a computer with the Intel 2.6 GHz Sandy Bridge processors and 4GB of memory. For the largest network we analysed (i.e., Internet²⁵; N = 34,761 nodes), our statistical test needed 403 seconds using the Louvain community-detection algorithm, which has a time complexity of ${\mathscr{O}}(M)$²⁶. With the Kernighan-Lin community-detection algorithm having a time complexity of ${\mathscr{O}}({N}^{2})$²⁷, it took 17,763 seconds (i.e. approximately 5 hours).

Community detection with different quality functions

Among various quality functions for individual communities apart from ${q}_{c}^{{\rm{mod}}}$^4,13,14, we consider the following three quality functions. The internal average degree¹⁴ (i.e., normalised number of intra-community edges), denoted by ${q}_{c}^{{\rm{int}}}$, is defined by

$${q}_{c}^{{\rm{int}}}\equiv \frac{1}{{n}_{c}}\sum _{\begin{array}{c}1\le i,j\le N\\ i,j\in \,{\rm{community}}\,c\end{array}}{A}_{ij}.$$

(8)

The maximisation of ${q}_{c}^{{\rm{int}}}$ yields a community having dense intra-community connectivity. The expansion¹⁴, denoted by ${q}_{c}^{\exp }$, is defined by

$${q}_{c}^{\exp }\equiv -\,\frac{1}{{n}_{c}}\sum _{\begin{array}{c}1\le i,j\le N\\ i\in \,{\rm{community}}\,c\\ j\notin \,{\rm{community}}\,c\end{array}}{A}_{ij}.$$

(9)

The maximisation of ${q}_{c}^{\exp }$ yields a community having sparse inter-community connectivity. Finally, the conductance¹⁴, denoted by ${q}_{c}^{{\rm{cnd}}}$, is defined by

$${q}_{c}^{{\rm{cnd}}}\equiv -\,\frac{1}{{{\rm{vol}}}_{c}}\sum _{\begin{array}{c}1\le i,j\le N\\ i\in \,{\rm{community}}\,c\\ j\notin \,{\rm{community}}\,c\end{array}}{A}_{ij},$$

(10)

where vol_c is the sum of degrees of nodes (i.e., volume) in a community c. Similar to the case of ${q}_{c}^{\exp }$, the maximisation of ${q}_{c}^{{\rm{cnd}}}$ yields a community having sparse inter-community connectivity. One can also interpret the maximisation of ${q}_{c}^{{\rm{cnd}}}$ as the maximisation of the number of intra-community edges²⁸.

For ${q}_{c}^{{\rm{mod}}}$, we adopt the Louvain algorithm to maximise the modularity (i.e., sum of ${q}_{c}^{{\rm{mod}}}$ over the communities, ${\sum }_{c=1}^{C}\,{q}_{c}^{{\rm{mod}}}$) to find communities in the original and randomised networks. However, the Louvain algorithm is not available to $Q={\sum }_{c=1}^{C}\,{q}_{c}$, where ${q}_{c}={q}_{c}^{{\rm{int}}}$, ${q}_{c}^{\exp }$ or ${q}_{c}^{{\rm{cnd}}}$. Therefore, we adopt a variant of the Kernighan–Lin algorithm²⁹ used in a previous study²⁷. The algorithm seeks partitioning of the network into communities that maximises Q. Suppose that each node i has a tentative label ${\ell }_{i}$ $(1\le {\ell }_{i}\le C)$ indicating the index of the community to which node i belongs. First, we assign each node to one of the C communities selected uniformly at random. Second, for each node i, we tentatively relabel it to a different label and measure the increment in Q. Third, we select the node i and its new label c that maximise the increment in Q among all nodes i (1 ≤ i ≤ N) and all possible new labels. Regardless of whether Q increases or not, we accept the proposed relabelling of node i (i.e., set ${\ell }_{i}=c$). Fourth, we determine the pair of another node j (j ≠ i) and its tentative new label c′, which maximises the increment in Q, and change the label of j to c′ (i.e., ${\ell }_{j}=c^{\prime} $). In this manner, we relabel nodes one by one. Here we do not relabel the nodes that have already been relabelled. After sequentially relabelling the N nodes, we select the labelling that yields the largest value of Q among the N + 1 labellings that have appeared in the course of relabelling the N nodes. If the initial labelling (before relabelling any node) yields the largest value of Q, we terminate the algorithm. Otherwise, we use the labelling that has yielded the largest Q value among the N + 1 labellings as the initial labelling in the next round of updating the labels. We repeat the aforementioned procedure to sequentially relabel N nodes and select the best labelling. We repeat rounds of updating until the initial labelling is the best labelling in the round in terms of the Q value.

To find communities in networks using ${q}_{c}^{{\rm{int}}}$, ${q}_{c}^{\exp }$ or ${q}_{c}^{{\rm{cnd}}}$, we need to specify the number of communities, C. Otherwise, the maximisation of the quality functions may yield trivial communities. For example, ${q}_{c}^{\exp }$ is always the largest when each connected component constitutes a community because there is no inter-community edge. In the analysis of synthetic networks, we set C to the number of planted communities. For empirical networks, we set C to the number of communities identified by the Louvain algorithm.

Other statistical tests

We compare the (q, s)–test with two statistical tests, i.e., the test proposed by Spirin and Mirny¹⁰ and the test proposed by Lancichinetti, Radicchi and Ramasco⁸, which we refer to as the S–test and L–test, respectively. As is the case with the (q, s)–test, both S–test and L–test adopt the configuration model as the null model. For both statistical tests, we set the significance level for a single community to α = 1 − (1 − α′)^1/C, where α′ = 0.05.

The S–test regards a community as significant if it has more intra-community edges than a community composed of the same number of nodes detected in randomised networks does. Their original algorithm¹⁰ is slow for large networks. Therefore, we adopt the Kernighan–Lin algorithm²⁹ to optimise the quality function for a community adopted in the S–test. Up to our numerical efforts, our implementation is faster and also finds better community structure than their original algorithm does in terms of their quality function.

The L–test regards a community as significant if every node in the community has more neighbours within the community than that expected for the configuration model. In the original paper⁸, the authors defined two significance measures, i.e., ${\mathscr{C}}$-score and $ {\mathcal B} $-score. We adopt the $ {\mathcal B} $–score, which is less conservative than the ${\mathscr{C}}$–score. In the original article⁸, the $ {\mathcal B} $–score is claimed to be more trustworthy than the ${\mathscr{C}}$–score because the ${\mathscr{C}}$–score but not the B-score relies on an extreme value statistics.

Data

We apply the statistical test to the 12 empirical networks listed in Table 1. We ignore the directions and weights of edges in the empirical networks.

Table 1 Properties of 12 empirical networks.

Full size table

The karate club network represents the relationships among the members of a university’s karate club³⁰. Each node represents a member of the karate club. Two members are defined to be adjacent if they are friends outside of the club activities.

The dolphin social network represents the relationships of the dolphins living near Doubtful Sound in New Zealand³¹. Each node represents a dolphin. Two dolphins are defined to be adjacent if they are frequently observed in the same school.

The network of Les Misérables represents the relationships between the characters of a novel, Les Misèrables³². Each node represents a character of the book. Each edge indicates that they appear in the same chapter of the book.

The Enron email network represents the email interactions among the staff of Enron Inc³³. Each node represents an email account. Each edge indicates that an email is sent from one account to the other account.

The jazz network represents the collaborations among jazz musicians³⁴. Each node represents a jazz musician. Each edge indicates that two musicians belong to the same band.

The network of network scientists represents the collaborations between researchers in network science⁷. Each node represents a researcher. Two researchers are defined to be adjacent if they have published a co-authored paper cited by one of two popular review papers on network science. Then, some nodes and edges were added manually by the author of the article⁷. We only consider the largest connected component of the network.

The political blog network is the network of blogs on the United States presidential election in 2004³⁵. Each node represents a blog. Two blogs are defined to be adjacent if there is at least one hyperlink between the two blogs on their front page.

The airport network consists of nodes representing airports in the world^36,37. Two airports are defined to be adjacent if there is a direct commercial flight between the two airports.

The protein network represents the physical interactions among human proteins^38,39. Each node represents a protein. Two proteins are defined to be adjacent if they physically interact.

The Chess network represents the chess matches between players²⁵. Each node represents a chess player. Each edge indicates that they have played at least once.

The Astro-ph network represents the collaborations among the researchers who published a joint paper in the arXiv’s astro-ph section⁴⁰. Each node represents a researcher. Two researchers are defined to be adjacent if they have published a joint paper.

The Internet network represents the network of autonomous systems²⁵. A node represents an autonomous system, which is a group of routers maintained by a network operator. Two autonomous systems are defined to be adjacent if they have a logical peering relation.

Results

We measure the size of a community in two ways: the number of nodes in a community c, n_c, and the sum of degrees of nodes in a community c, vol_c. In the next two subsections, we consider the $({q}_{c}^{{\rm{mod}}},{n}_{c})$–test and the $({q}_{c}^{{\rm{mod}}},{{\rm{vol}}}_{c})$–test. We show the results for other quality functions in the third subsection.

Synthetic networks

In this section, we examine synthetic networks with planted communities. We generate networks using the LFR model¹⁵, which places edges such that the node’s degree, (i.e., d_i), and the number of nodes in a community c, (i.e., n_c), follow power-law distributions. We set the power-law exponent for the distributions of d_i and n_c to 2, the average node’s degree to 10, the maximum degree to 100 and the range of n_c to [20,200]. The networks are composed of N = 10³ nodes. Each node i has an average fraction 1 − μ of neighbours belonging to the same community, where μ ∈ {0, 0.025, 0.05, …, 1} is a mixing parameter controlling the “strength” of community structure. With μ = 0, all edges are placed within communities, and the community structure is the strongest. With μ = 1, all edges are between different communities. We set the extent of overlaps between different communities to zero.

We generate 30 networks using the LFR model at each μ value. For each generated network, we classify the planted communities into significant and insignificant communities by each statistical test. Then, we compute the true positive rate (i.e., the fraction of significant communities in the network). Finally, we average the true positive rate over the 30 generated networks.

Figure 2 shows the true positive rate as a function of μ. The true positive rate for the S–test is smallest for the entire range of μ, indicating that the S–test is the most conservative. The S–test does not regard all the planted communities as significant even at μ = 0 for the following reason. In the S–test, one detects the strongest community in each randomised network, where the strength of a community is measured by the number of intra-community edges. Then, a focal community in the original network is regarded as significant if it is stronger than the majority of the strongest communities detected in the randomised networks. The strongest communities in the randomised networks often contain almost the largest possible number of intra-community edges, whereas the planted communities do not always even at μ = 0. Therefore, the S–test concludes that some planted communities are insignificant. The true positive rate for the L–test is 1 when μ = 0 and ranges between 0.55 and 0.95 for 0 < μ ≤ 0.5. The true positive rate for the $({q}_{c}^{{\rm{mod}}},{n}_{c})$–test and that for the $({q}_{c}^{{\rm{mod}}},{{\rm{vol}}}_{c})$–test are comparable and close to 1 for 0 ≤ μ ≤ 0.3. In contrast, there is a visible difference between the results for the $({q}_{c}^{{\rm{mod}}},{n}_{c})$– and the $({q}_{c}^{{\rm{mod}}},{{\rm{vol}}}_{c})$– tests for 0.3 < μ ≤ 0.5. This result suggests that the definition of the size of a community may affect the significance of weak communities but not of strong communities.

Empirical networks

We apply the statistical tests to the 12 empirical networks listed in Table 1 (see the Data section for details). In this section, we detect communities by modularity maximisation using the Louvain algorithm²⁶. Then, we apply the statistical tests to each detected community.

The fraction of significant communities for each statistical test is shown in Table 2. The $({q}_{c}^{{\rm{mod}}},{n}_{c})$– and the $({q}_{c}^{{\rm{mod}}},{{\rm{vol}}}_{c})$–tests identify more significant communities than the S–test and the L–test do in a majority of the 12 empirical networks. This result indicates that the $({q}_{c}^{{\rm{mod}}},{n}_{c})$– and the $({q}_{c}^{{\rm{mod}}},{{\rm{vol}}}_{c})$–tests are more generous than the S–test and L– test, which is consistent with the results for the LFR model. This is probably because the $({q}_{c}^{{\rm{mod}}},{n}_{c})$– and the $({q}_{c}^{{\rm{mod}}},{{\rm{vol}}}_{c})$– tests use ${q}_{c}^{{\rm{mod}}}$ to evaluate the quality of individual communities, which is consistent with the objective function of modularity maximisation, ${\sum }_{c=1}^{C}\,{q}_{c}^{{\rm{mod}}}$.

Table 2 Fraction of significant communities identified by the S–test, the L–test, the (q^mod, s)–test, the (q^int, s)–test, the (q^exp, s)–test and the (q^cnd, s)–test in the 12 empirical networks.

Full size table

To quantify the agreement between the $({q}_{c}^{{\rm{mod}}},{n}_{c})$– and the $({q}_{c}^{{\rm{mod}}},{{\rm{vol}}}_{c})$– tests, we compute the level of agreement defined by τ = (C₁₁ + C₀₀)/C, where C₀₀ is the number of communities classified as insignificant by both statistical tests and C₁₁ is the number of communities classified as significant by both tests. Note that 0 ≤ τ ≤ 1, τ = 1 if the two tests regard the same set of communities as significant, and τ = 0 if the two tests completely disagree. We compute τ between each pair of statistical tests for each empirical network and then average τ over the 12 empirical networks. The averaged τ values are shown in Table 3. We find τ = 0.42 between the S–test and the L–test, indicating that the two statistical tests disagree for a majority of communities. The L–test weakly agrees with the $({q}_{c}^{{\rm{mod}}},{{\rm{vol}}}_{c})$–test (i.e., τ = 0.58) but disagrees with the other tests for a majority of communities (i.e., τ < 0.5). The τ between the $({q}_{c}^{{\rm{mod}}},{n}_{c})$– and the $({q}_{c}^{{\rm{mod}}},{{\rm{vol}}}_{c})$–tests is large (τ = 0.84), suggesting that the significance of a majority of communities is not strongly affected by the definition of the community size.

Table 3 Agreement between pairs of statistical tests.

Full size table

Other quality functions

In this section, we examine the $({q}_{c}^{{\rm{int}}},{s}_{c})$–, the $({q}_{c}^{\exp },{s}_{c})$–and the $({q}_{c}^{{\rm{cnd}}},{s}_{c})$–tests, where s_c is either n_c or vol_c. For the synthetic networks, the true positive rate for the $({q}_{c}^{{\rm{int}}},{n}_{c})$–and the $({q}_{c}^{{\rm{int}}},{{\rm{vol}}}_{c})$–tests is small in the entire range of μ (Fig. 3). As is the case for the S–test, quality function ${q}_{c}^{{\rm{int}}}$ uses the number of intra-community edges. Some planted communities are regarded as insignificant because randomised networks often contain a community having almost the largest possible number of intra-community edges (Fig. 1(c)). The quality function ${q}_{c}^{\exp }$ is the largest when the community c is disconnected from the other nodes. Randomised networks often contain many disconnected components, yielding a large value of ${q}_{c}^{\exp }$ (Fig. 1(d)). Therefore, the true positive rate for the $({q}_{c}^{\exp },{n}_{c})$– and the $({q}_{c}^{\exp },{{\rm{vol}}}_{c})$–tests is also close to zero in the entire range of μ. In contrast to $({q}_{c}^{{\rm{int}}},{s}_{c})$– and $({q}_{c}^{\exp },{s}_{c})$–tests, the $({q}_{c}^{{\rm{cnd}}},{n}_{c})$– and $({q}_{c}^{{\rm{cnd}}},{{\rm{vol}}}_{c})$–tests yield the true positive rate close to one when μ ≤ 0.3. These results suggest that the results considerably depend on the quality function. For all the (q, s)–tests, the definition of community size (i.e., n_c or vol_c) does not strongly influence the true positive rate.

For the empirical networks, we first detect communities by maximising q, where q is either ${q}_{c}^{{\rm{int}}}$, ${q}_{c}^{\exp }$ or ${q}_{c}^{{\rm{cnd}}}$, using the variant of the Kernighan–Lin algorithm (see the Other statistical test sections). Then, we apply the (q,s)–test to each detected community. The results for the $({q}_{c}^{{\rm{int}}},{s}_{c})$–, the $({q}_{c}^{\exp },{s}_{c})$– and the $({q}_{c}^{{\rm{cnd}}},{s}_{c})$–tests applied to the 12 empirical networks are shown in Table 2. For all the networks, the $({q}_{c}^{{\rm{cnd}}},{s}_{c})$–test regards more communities as significant than the $({q}_{c}^{\exp },{s}_{c})$– and the $({q}_{c}^{{\rm{cnd}}},{s}_{c})$–tests, where s_c is either n_c or vol_c. This result is consistent with those obtained for the synthetic networks (Fig. 3). For each quality function q, the level of agreement (i.e., τ) between the different definitions of the community size (i.e., n_c or vol_c) is shown in Table 4. For most empirical networks, the agreement τ is larger than 0.8, indicating that the results of the statistical test do not strongly depend on the definition of community size in most cases.

Table 4 Agreement between the (q_c, n_c)–test and the (q_c, vol_c)–test.

Full size table

Discussion

We proposed a non-parametric statistical test, called the (q, s)–test, for the significance of individual communities, which accounts for the correlation between the quality and the size of single communities. We demonstrated our test with several quality functions q including the one defined as the contribution of a single community to the modularity. In fact, the (q, s)–test accepts different quality functions for individual communities such as those described in the previous literature^{13,14,41,42,43}. In addition, the (q, s)–test does not demand how communities should be detected in a given network. We note that q that is consistent with the objective function for community detection should be used because the former is maximised in the (q, s)–test and the latter is maximised in community detection.

We have used two definitions of the size of a community, i.e., the number of nodes in a community (i.e., n_c), and the sum of degrees of nodes in a community (i.e., vol_c). For degree-homogeneous networks, the choice does not matter because n_c ∝ vol_c. However, for degree-heterogeneous networks, significant communities may considerably depend on whether we use n_c or vol_c. If q explicitly uses its own measure of the size of a community, we should probably adopt the corresponding definition of the community size in the (q, s)–test. If a measure of community size is not explicit, we suggest that one selects a measure of community size that is more strongly correlated with q than others. If q is correlated with multiple quantities (e.g. both n_c and vol_c) that are not perfectly correlated with each other, one can extend the (q, s)–test by adopting multivariate Gaussian kernels with three or more variables instead of bivariate Gaussian kernels. A downside of this approach is that we would need more data to reliably estimate the distribution of (q, s), where s is at least two-dimensional.

We can adopt the (q, s)–test to assess the significance of other structures of networks, such as bipartite communities⁴⁴ and core-periphery structure^45,46,47, provided that the quality function for the individual structure (e.g., a single bipartite community) is explicitly defined. In fact, we applied a variant of the (q, s)–test to core-periphery structure in our previous study⁴⁷.

Robustness of community structure against random perturbations (e.g., addition, removal and rewiring of edges) is an alternative measure of the significance of communities^14,48,49. With this approach, if small perturbations do not considerably change communities, then the communities are regarded as significant. Statistical tests based on quality functions including the (q, s)–test and those based on robustness may provide different results⁴⁹. As is the case of quality functions, the robustness of an individual community may be correlated with the size of a community. For example, removal of a small number of intra-community edges may destroy small communities, whereas large communities may survive the removal of more intra-community edges. If this is the case, it may be worthwhile to inform a robustness–based test of individual communities by the dependence of the robustness measure on the size of a community.

References

Newman, M. E. J. Networks: An Introduction (Oxford University Press, Oxford, 2010).
Barabási, A. L. Network Science (Cambridge University Press, Cambridge, 2016).
Fortunato, S. Community detection in graphs. Phys. Rep. 486, 75–174 (2010).
Article ADS MathSciNet Google Scholar
Fortunato, S. & Hric, D. Community detection in networks: A user guide. Phys. Rep. 659, 1–44 (2016).
Article ADS MathSciNet Google Scholar
Jonsson, P. F., Cavanna, T., Zicha, D. & Bates, P. A. Cluster analysis of networks generated through homology: automatic identification of important protein communities involved in cancer metastasis. BMC Bioinf. 7, 2 (2006).
Article Google Scholar
Guimerà, R., Mossa, S., Turtschi, A. & Amaral, L. A. N. The worldwide air transportation network: anomalous centrality, community structure, and cities’ global roles. Proc. Natl. Acad. Sci. USA 102, 7794–7799 (2005).
Article ADS MathSciNet PubMed PubMed Central MATH Google Scholar
Newman, M. E. J. Finding community structure in networks using the eigenvectors of matrices. Phys. Rev. E 74, 036104 (2006).
Article ADS MathSciNet CAS Google Scholar
Lancichinetti, A., Radicchi, F. & Ramasco, J. J. Statistical significance of communities in networks. Phys. Rev. E 81, 046110 (2010).
Article ADS Google Scholar
Lancichinetti, A., Radicchi, F., Ramasco, J. J. & Fortunato, S. Finding statistically significant communities in networks. PLOS ONE 6, e18961 (2011).
Article ADS CAS PubMed PubMed Central Google Scholar
Spirin, V. & Mirny, L. A. Protein complexes and functional modules in molecular networks. Proc. Natl. Acad. Sci. USA 100, 12123–12128 (2003).
Article ADS CAS PubMed PubMed Central Google Scholar
Wang, B. et al. Spatial scan statistics for graph clustering. In Proc. 2008 SIAM Int. Conf. Data Mining, 727–738 (SIAM, Philadelphia, 2008).
Zhao, Y., Levina, E. & Zhu, J. Community extraction for social networks. Proc. Natl. Acad. Sci. USA 108, 7321–7326 (2011).
Article ADS CAS PubMed PubMed Central Google Scholar
Leskovec, J., Lang, K. J. & Mahoney, M. W. Empirical comparison of algorithms for network community detection. In Proc. 19th Int. Conf. World Wide Web, 631–640 (ACM, New York, 2010).
Yang, J. & Leskovec, J. Defining and evaluating network communities based on ground-truth. Know. Inf. Syst. 42, 181–213 (2015).
Article Google Scholar
Lancichinetti, A., Fortunato, S. & Radicchi, F. Benchmark graphs for testing community detection algorithms. Phys. Rev. E 78, 046110 (2008).
Article ADS Google Scholar
Wand, M. P. & Jones, M. C. Comparison of smoothing parameterizations in bivariate kernel density estimation. J. Am. Stat. Assoc. 88, 520–528 (1993).
Article MathSciNet MATH Google Scholar
Parzen, E. On estimation of a probability density function and mode. Annal. Math. Stat. 33, 1065–1076 (1962).
Article MathSciNet MATH Google Scholar
Park, B. U. & Marron, J. S. Comparison of data-driven bandwidth selectors. J. Am. Stat. Assoc. 85, 66–72 (1990).
Article Google Scholar
Jones, M. C., Marron, J. S. & Sheather, S. J. A brief survey of bandwidth selection for density estimation. J. Am. Stat. Assoc. 91, 401–407 (1996).
Article MathSciNet MATH Google Scholar
Scott, D. W. Multivariate density estimation and visualization (Springer, Berlin, 2012).
Šidák, Z. Rectangular confidence regions for the means of multivariate normal distributions. J. Am. Stat. Assoc. 62, 626–633 (1967).
MathSciNet MATH Google Scholar
Miller, J. C. & Hagberg, A. Efficient generation of networks with given expected degrees. In Frieze, A., Horn, P. & Prałat, P. (eds) Algorithms and Models for the Web Graph, vol. 6732 LNCS, 115–126 (Springer Berlin Heidelberg, Berlin, Heidelberg, 2011).
Staudt, C. L., Sazonovs, A. & Meyerhenke, H. Networkit: A tool suite for large-scale complex network analysis. Network Science 4, 508–530 (2016).
Article Google Scholar
Hagberg, A. A., Schult, D. A. & Swart, P. J. Exploring network structure, dynamics, and function using networkx. In Varoquaux, G., Vaught, T. & Millman, J. (eds) Proc. 7th Python in Sci. Conf., 11–15 (Pasadena, CA USA, 2008).
Kunegis, J. Available at, http://konect.uni-koblenz.de [Accessed: 2 Sep 2017].
Blondel, V. D., Guillaume, J.-L., Lambiotte, R. & Lefebvre, E. Fast unfolding of communities in large networks. J. Stat. Mech. 2008, P10008 (2008).
Article Google Scholar
Karrer, B. & Newman, M. E. J. Stochastic blockmodels and community structure in networks. Phys. Rev. E 83, 016107 (2011).
Article ADS MathSciNet Google Scholar
von Luxburg, U. A tutorial on spectral clustering. Stat. Comput. 17, 395–416 (2007).
Article MathSciNet Google Scholar
Kernighan, B. W. & Lin, S. An efficient heuristic procedure for partitioning graphs. Bell Syst. Tech. J. 49, 291–307 (1970).
Article MATH Google Scholar
Zachary, W. W. An information flow model for conflict and fission in small groups. J. Anthropol. Res. 33, 452–473 (1977).
Article Google Scholar
Lusseau, D. et al. The bottlenose dolphin community of doubtful sound features a large proportion of long-lasting associations. Behav. Ecol. Sociobiol. 54, 396–405 (2003).
Article Google Scholar
Knuth, D. E. The Stanford GraphBase: A Platform for Combinatorial Computing (ACM Press, New York, 1993).
Klimt, B. & Yang, Y. The Enron corpus: A new dataset for email classification research. In Proc. 15th European Conf. Machine Learning, 217–226 (Springer, Berlin, 2004).
Gleiser, P. M. & Danon, L. Community structure in jazz. Adv. Comp. Syst. 6, 565–573 (2003).
Article Google Scholar
Adamic, L. A. & Glance, N. The political blogosphere and the 2004 u.s. election: divided they blog. In Proc. 3rd Int. Workshop on Link Discovery, 36–43 (ACM, New York, 2005).
J. Patokallio. Available at, http://openflights.org [Accessed: 24 Sep 2016].
T. Opsahl. Available at, https://toreopsahl.com/2011/08/12/why-anchorage-is-not-that-important-binary-ties-and-sample-selection [Accessed: 24 Sep 2016].
Rual, J. et al. Towards a proteome-scale map of the human protein-protein interaction network. Nature 437, 1173–1178 (2005).
Article ADS CAS PubMed Google Scholar
Ma’ayan, A. Available at, http://research.mssm.edu/maayan/datasets/qualitative_networks.shtml [Accessed: 2 Sep 2017].
Leskovec, J., Kleinberg, J. & Faloutsos, C. Graph evolution: densification and shrinking diameters. ACM Trans. Knowl. Discov. Data 1, 2 (2007).
Article Google Scholar
Chen, M., Kuzmin, K. & Szymanski, B. K. Community detection via maximization of modularity and its variants. IEEE Trans. Comput. Soc. Syst. 1, 46–65 (2014).
Article Google Scholar
Lambiotte, R., Delvenne, J. C. & Barahona, M. Random walks, markov processes and the multiscale modular organization of complex networks. IEEE Trans. Netw. Sci. Eng. 1, 76–90 (2014).
Article MathSciNet Google Scholar
Zhang, P. & Moore, C. Scalable detection of statistically significant communities and hierarchies, using message passing for modularity. Proc. Natl. Acad. Sci. USA 111, 18144–18149 (2014).
Article ADS CAS PubMed PubMed Central Google Scholar
Newman, M. E. J. & Leicht, E. A. Mixture models and exploratory analysis in networks. Proc. Natl. Acad. Sci. USA 104, 9564–9569 (2007).
Article ADS CAS PubMed PubMed Central MATH Google Scholar
Borgatti, S. P. & Everett, M. G. Models of core/periphery structures. Soc. Netw. 21, 375–395 (2000).
Article Google Scholar
Rombach, M. P., Porter, M. A., Fowler, J. H. & Mucha, P. J. Core-periphery structure in networks (revisited). SIAM Rev. 59, 619–646 (2017).
Article MathSciNet MATH Google Scholar
Kojaku, S. & Masuda, N. Core-periphery structure requires something else in the network. New J. Phys. 20, 043012 (2018).
Gfeller, D., Chappelier, J. C. & De Los Rios, P. Finding instabilities in the community structure of complex networks. Phys. Rev. E 72, 056135 (2005).
Article ADS Google Scholar
Karrer, B., Levina, E. & Newman, M. E. J. Robustness of community structure in networks. Phys. Rev. E 77, 046119 (2008).
Article ADS Google Scholar

Download references

Acknowledgements

N.M. acknowledges the support provided through JST, CREST, and JST, ERATO, Kawarabayashi Large Graph Project.

Author information

Authors and Affiliations

CREST, JST, Kawaguchi Center Building, 4-1-8, Honcho, Kawaguchi-shi, Saitama, 332-0012, Japan
Sadamori Kojaku
Department of Engineering Mathematics, Merchant Venturers Building, University of Bristol, Woodland Road, Clifton, Bristol, BS8 1UB, United Kingdom
Sadamori Kojaku & Naoki Masuda

Authors

Sadamori Kojaku
View author publications
You can also search for this author in PubMed Google Scholar
Naoki Masuda
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

N.M. conceived and designed the research; S.K. performed the computational experiments; N.M. and S.K. wrote the paper.

Corresponding author

Correspondence to Naoki Masuda.

Ethics declarations

Competing Interests

The authors declare no competing interests.

Additional information

Publisher's note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Supplementary Information

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Kojaku, S., Masuda, N. A generalised significance test for individual communities in networks. Sci Rep 8, 7351 (2018). https://doi.org/10.1038/s41598-018-25560-z

Download citation

Received: 11 December 2017
Accepted: 24 April 2018
Published: 09 May 2018
DOI: https://doi.org/10.1038/s41598-018-25560-z
Springer Nature Limited

This article is cited by

Is Management and Organizational Studies divided into (micro-)tribes?
- Oliver Wieczorek
- Olof Hallonsten
- Fredrik Åström
Scientometrics (2024)
Visualizing novel connections and genetic similarities across diseases using a network-medicine based approach
- Brian Ferolito
- Italo Faria do Valle
- Kelly Cho
Scientific Reports (2022)
Detecting mesoscale structures by surprise
- Emiliano Marchese
- Guido Caldarelli
- Tiziano Squartini
Communications Physics (2022)
Genomics and phenomics of body mass index reveals a complex disease network
- Jie Huang
- Jennifer E. Huffman
- Christopher J. O’Donnell
Nature Communications (2022)
Computing exact P-values for community detection
- Zengyou He
- Hao Liang
- Yan Liu
Data Mining and Knowledge Discovery (2020)

A generalised significance test for individual communities in networks

Abstract

Similar content being viewed by others

On community structure validation in real networks

Constructing null networks for community detection in complex networks

Perspective on Measurement Metrics for Community Detection Algorithms

Introduction

Methods