Introduction

Many biological, physical and social systems can be expressed as networks, with nodes representing individual entities within the network and edges representing pairwise relationships between nodes1,2. Among various structural properties of networks, many empirical networks have community structure such that a network is composed of communities, which are groups of nodes that are densely interconnected with each other while sparsely interconnected with those in other groups3,4. A community may correspond to the role of nodes. For example, communities may correspond to functional modules of proteins5, groups of airports serving the same geographical region6 and herds of people sharing an interest7.

Many algorithms have been proposed for finding communities in networks3,4. These algorithms are often equipped with a quality function with which to judge whether or not the detected community structure is significant overall. A much less asked fundamental question is the significance of individual communities. In fact, a network may be composed of a part where community structure is pronounced and another part where community structure is vague or absent. To discuss community structure in such a “chimera” network, one needs methods to assess statistical significance of single communities.

In the present study, we consider the significance of single communities that have been detected by a non-overlapping community-detection algorithm. An algorithm for testing significance of individual communities was previously proposed8. In that algorithm, one uses a quality function for individual communities to compare the quality of a community in question, detected in the given network, and that detected in randomised networks. The distribution of the quality function in randomised networks is analytically known. The authors then used the same significant test in OSLOM, which is an algorithm for finding various types of communities9. However, OSLOM does not optimise the same quality function as that used in the aforementioned statistical test or its aggregate over the different communities. The same discrepancy exists in a different significance test for single communities10. In an extreme case, let us suppose one detects communities by optimising a quality function that is very different from the quality function used in the statistical test. Then, the detected communities may have small values of the quality function used in the statistical test and will be judged to be insignificant. However, in terms of the quality function used in the community detection, the detected communities may be sufficiently strong.

This pitfall may be overcome if one uses the same quality function for the community detection and the statistical test. There exist such significance tests for individual communities11,12. However, these significance tests11,12 do not consider the possible dependence of the quality function value on the size of community10,13,14. This practice is problematic for the following reason. Suppose that two communities in the given network have different sizes and bear the same value of the quality function. Then, the significance level (i.e., p-value) in these statistical tests is the same for the two communities. In general, however, the quality function value may be positively correlated with the community size, which is in fact often the case (Methods section). In this case, it is easier for the larger community to attain the observed quality function value than for the smaller community under the null model. Then, the smaller community should be judged to be more significant than the larger community if they yield the same quality function value. An aforementioned statistical test does consider the dependence of the quality function value on the community size10. However, that method does not use a common quality function between community detection and statistical testing, as discussed already.

Based on these considerations, it will be useful to develop methods to test the significance of individual communities that (i) use a quality function that is consistent with the one used in community detection, and (ii) take into account the dependence of the quality function value on the community size. We will develop a new statistical test for individual communities that meets these criteria. An additional feature of our method is that it allows for general quality functions. Python code for the present significance test is available at https://github.com/skojaku/qstest/.

Methods

Correlation between quality and community size

We consider unweighted networks composed of N nodes. Denote their N × N adjacency matrix by A = (A ij ), where A ij  = 1 if nodes i and j are adjacent and A ij  = 0 otherwise. We assume that the network is undirected (i.e., A ij  = A ji for all i ≠ j) and does not contain self-loops (i.e., A ii  = 0). Let M be the number of edges in the network. We denote by \({d}_{i}\equiv {\sum }_{j=1}^{N}\,{A}_{ij}\) the degree of node i.

One may regard a community as significant if its quality value is significantly larger than that expected for randomised networks. This intuitive approach has a problem. To see this, let us consider a benchmark network generated by the Lancichinetti-Fortunato-Radicchi (LFR) model15 (Fig. 1(a)). The network has N = 103 nodes and consists of C non-overlapping communities. Each node i belongs to one of the C = 31 communities. To generate the network, we set the average node’s degree to 10, the maximum node’s degree to 100, the range of the number of nodes in a community c (denoted by n c ) to [10,100] and the power-law exponent for the distributions of d i and n c to 2. Let us consider a quality function \({q}_{c}^{{\rm{mod}}}\) given by13,14

$${q}_{c}^{{\rm{mod}}}\equiv \frac{1}{2M}\sum _{\begin{array}{c}1\le i,j\le N\\ i,j\in \,{\rm{community}}\,c\end{array}}({A}_{ij}-\frac{{d}_{i}{d}_{j}}{2M}).$$
(1)
Figure 1
figure 1

(a) A network with 31 non-overlapping communities generated by the LFR model. The circles represent nodes. The lines between the nodes represent edges. The colour of each node indicates the planted community to which the node belongs. (be) Quality of a community (i.e., \({q}_{c}^{{\rm{mod}}}\), \({q}_{c}^{{\rm{int}}}\), \({q}_{c}^{\exp }\) and \({q}_{c}^{{\rm{cnd}}}\)) plotted against its number of nodes, n c . The circles indicate the planted communities shown in panel (a). The crosses indicate the communities detected in 500 randomised networks generated by the configuration model. To find communities in the randomised networks, we use the Louvain algorithm26 for \({q}_{c}^{{\rm{mod}}}\) (panel (b)) and a variant of the Kernighan–Lin algorithm27 for \({q}_{c}^{{\rm{int}}}\), \({q}_{c}^{\exp }\) and \({q}_{c}^{{\rm{cnd}}}\) (panels (c–e)).

Note that the modularity is the sum of \({q}_{c}^{{\rm{mod}}}\) over the communities7. We find a strong positive correlation between \({q}_{c}^{{\rm{mod}}}\) and n c (circles in Fig. 1(b)). This is also true for communities in randomised networks that are generated by the configuration model, i.e., random networks that preserve the expected degree of each node (crosses in Fig. 1(b)). Crucially, large communities detected in the randomised networks have larger \({q}_{c}^{{\rm{mod}}}\) values than small communities in the original network do. Therefore, we can not judge the significance of communities solely by the value of \({q}_{c}^{{\rm{mod}}}\). The results are qualitatively the same for other quality functions for individual communities introduced in the following section (Fig. 1(c,d and e)).

Our statistical test

On the basis of the observations made in the previous section, we construct a statistical test for individual communities as follows. Note that we do not specify the quality function q c , which may be \({q}_{c}^{{\rm{mod}}}\) or a different one. Moreover, we do not specify how one measures the size s c of community c. We refer to the present statistical test based on a quality function q and community size s as the (q, s)–test.

Suppose that we have a community c with quality q c and size s c . We judge community c to be significant if its q c value is larger than those for communities of the same size s c detected in randomised networks. We compute \(P(\tilde{q}\ge {q}_{c}|{s}_{c})\), which is the probability that a community of size s c detected in randomised networks generated by the configuration model has a quality value \(\tilde{q}\) larger than q c . We numerically estimate \(P(\tilde{q}\ge {q}_{c}|{s}_{c})\) as follows. First, we generate 500 randomised networks using the configuration model. Then, we detect communities in each randomised network by the algorithm that has been used to detect communities in the original network. Let \(\overline{C}\) be the sum of the number of communities detected in the 500 randomised networks. For each community \(\overline{c}\) \((1\le \overline{c}\le \overline{C})\) in the randomised networks, we compute the quality \({\tilde{q}}_{\overline{c}}\) and size \({\tilde{s}}_{\overline{c}}\). Then, we compute the average values, i.e., \(\langle \tilde{q}\rangle \equiv {\sum }_{\overline{c}=1}^{\overline{C}}\,{\tilde{q}}_{\overline{c}}/\overline{C}\) and \(\langle \tilde{s}\rangle \equiv {\sum }_{\overline{c}=1}^{\overline{C}}\,{\tilde{s}}_{\overline{c}}/\overline{C}\), and the unbiased estimation of the standard deviation, i.e., \({\sigma }_{\tilde{q}}\equiv \sqrt{{\sum }_{\overline{c}=1}^{\overline{C}}{({\tilde{q}}_{\overline{c}}-\langle \tilde{q}\rangle )}^{2}/(\overline{C}-\mathrm{1)}}\) and \({\sigma }_{\tilde{s}}\equiv \sqrt{{\sum }_{\overline{c}=1}^{\overline{C}}{({\tilde{s}}_{\overline{c}}-\langle \tilde{s}\rangle )}^{2}/(\overline{C}-\mathrm{1)}}\). We estimate the joint probability distribution \(P(\tilde{q},\tilde{s})\) using the kernel density estimator16 as follows:

$$P(\mathop{q}\limits^{ \sim },\mathop{s}\limits^{ \sim })=\sum _{\overline{c}=1}^{\overline{C}}\,f(\frac{\mathop{q}\limits^{ \sim }-{\mathop{q}\limits^{ \sim }}_{\overline{c}}}{h{\sigma }_{\mathop{q}\limits^{ \sim }}},\frac{\mathop{s}\limits^{ \sim }-{\mathop{s}\limits^{ \sim }}_{\overline{c}}}{h{\sigma }_{\mathop{s}\limits^{ \sim }}})/\overline{C},$$
(2)

where h is the width of the kernel. The function f (·, ·) is the bivariate Gaussian kernel (i.e., bivariate standard normal distribution) given by

$$f({x}_{1},{x}_{2})\equiv \frac{1}{2\pi \sqrt{1-{\gamma }^{2}}}\exp (\,-\,\frac{{x}_{1}^{2}-2\gamma {x}_{1}{x}_{2}+{x}_{2}^{2}}{2(1-{\gamma }^{2})}),$$
(3)

where

$$\gamma \equiv \frac{\sum _{\bar{c}=1}^{\bar{C}}({\tilde{q}}_{\bar{c}}-\langle \tilde{q}\rangle )({\tilde{s}}_{\bar{c}}-\langle \tilde{s}\rangle )}{\sqrt{\sum _{\bar{c}=1}^{\bar{C}}{({\tilde{q}}_{\bar{c}}-\langle \tilde{q}\rangle )}^{2}}\sqrt{\sum _{\bar{c}=1}^{\bar{C}}{({\tilde{s}}_{\bar{c}}-\langle \tilde{s}\rangle )}^{2}}},$$
(4)

is the Pearson correlation coefficient between \({\{{\tilde{q}}_{\overline{c}}\}}_{\overline{c}=1}^{\overline{C}}\) and \({\{{\tilde{s}}_{\overline{c}}\}}_{\overline{c}=1}^{\overline{C}}\). The probability distribution estimated by the Gaussian kernels is close to any form of the true probability distribution as the number of samples increases17. Although there are also non-Gaussian kernels that share this property17, we used the Gaussian kernels, which is a state-of-the-art method. The width h is a free parameter that affects the speed of the convergence to the true probability distribution. Optimising the value of h requires assumptions for the true probability distributions and intensive computations18,19. Therefore, we set \(h={\overline{C}}^{(-\mathrm{1/6)}}\) according to Scott’s rule-of-thumb20, which often provides a reasonable estimate in practice18,19,20.

The conditional probability, \(P(\tilde{q} > {q}_{c}|{s}_{c})\), is given by

$$P(\mathop{q}\limits^{ \sim }\ge {q}_{c}|{s}_{c})=\frac{{\int }_{{q}_{c}}^{{\rm{\infty }}}P(\mathop{q}\limits^{ \sim },{s}_{c}){\rm{d}}\mathop{q}\limits^{ \sim }}{{\int }_{-{\rm{\infty }}}^{{\rm{\infty }}}P(\mathop{q}\limits^{ \sim },{s}_{c}){\rm{d}}\mathop{q}\limits^{ \sim }}=\frac{\sum _{\overline{c}=1}^{\overline{C}}{\int }_{{q}_{c}}^{{\rm{\infty }}}\,f(\frac{\mathop{q}\limits^{ \sim }-{\mathop{q}\limits^{ \sim }}_{\overline{c}}}{{\sigma }_{\mathop{q}\limits^{ \sim }}h},\frac{{s}_{c}-{\mathop{s}\limits^{ \sim }}_{\overline{c}}}{{\sigma }_{\mathop{s}\limits^{ \sim }}h}){\rm{d}}\mathop{q}\limits^{ \sim }}{\sum _{\overline{c}=1}^{\overline{C}}{\int }_{-{\rm{\infty }}}^{{\rm{\infty }}}\,f(\frac{\mathop{q}\limits^{ \sim }-{\mathop{q}\limits^{ \sim }}_{\overline{c}}}{{\sigma }_{\mathop{q}\limits^{ \sim }}h},\frac{{s}_{c}-{\mathop{s}\limits^{ \sim }}_{\overline{c}}}{{\sigma }_{\mathop{s}\limits^{ \sim }}h}){\rm{d}}\mathop{q}\limits^{ \sim }}.$$
(5)

The integration of f (x1, x2) over x1 yields

$${\int }_{y}^{{\rm{\infty }}}f({x}_{1},{x}_{2})\,{\rm{d}}{x}_{1}=\frac{1}{\sqrt{2\pi }}\exp (\,-\,\frac{{x}_{2}^{2}}{2})[1-{\rm{\Phi }}(\frac{y-\gamma {x}_{2}}{\sqrt{1-{\gamma }^{2}}})],$$
(6)

where Φ (·) is the cumulative distribution function of the standard normal distribution. By substituting Eq. (6) into Eq. (5), we have

$$P(\mathop{q}\limits^{ \sim }\ge {q}_{c}|{s}_{c})=1-\frac{\sum _{\bar{c}=1}^{\overline{C}}\exp [\,-\,{(\frac{{s}_{c}-{\mathop{s}\limits^{ \sim }}_{\overline{c}}}{\sqrt{2}h{\sigma }_{\mathop{s}\limits^{ \sim }}})}^{2}]\,{\rm{\Phi }}\,(\frac{1}{\sqrt{1-{\gamma }^{2}}}(\frac{{q}_{c}-{\mathop{q}\limits^{ \sim }}_{\overline{c}}}{h{\sigma }_{\mathop{q}\limits^{ \sim }}}-\gamma \frac{{s}_{c}-{\mathop{s}\limits^{ \sim }}_{\overline{c}}}{h{\sigma }_{\mathop{s}\limits^{ \sim }}}))}{\sum _{\overline{c}=1}^{\overline{C}}\exp [\,-\,{(\frac{{s}_{c}-{\mathop{s}\limits^{ \sim }}_{\overline{c}}}{\sqrt{2}h{\sigma }_{\mathop{s}\limits^{ \sim }}})}^{2}]}.$$
(7)

Finally, we regard community c as significant if \(P(\tilde{q}\ge {q}_{c}|{s}_{c})\le \alpha \), where α ∈ [0, 1] is the significance level. The conditional probability \(P(\tilde{q}\ge {q}_{c}|{s}_{c})\) obeys a uniform probability distribution over [0, 1] for a community detected in a randomised network (see Supplementary Information 1). One can estimate more accurate p-values (i.e. \(P(\tilde{q}\ge {q}_{c}|{s}_{c})\)) using a larger number of randomised networks, which, however, requires an additional computational time. We opt to use 500 randomised networks to obtain sufficiently accurate p-values in a reasonable time. In fact, the p-value does not change much if one increases the number of randomised networks beyond 500 or if one uses networks with different numbers of nodes and communities (Supplementary Information 2).

As the number of communities, C, increases, some insignificant communities would be significant owing to the multiple comparison problem. To avoid this, we use the Šidák correction21, i.e., α = 1 − (1 − α′)1/C, where α′ ∈ [0, 1] is the targeted significance level. We set α′ = 0.05.

Time complexity

The time complexity of the proposed statistical test is evaluated as follows. Generating one randomised network from the configuration model consumes \({\mathscr{O}}(N+M)\) time using an efficient algorithm22, which is implemented in some network analysis software23,24. For each generated randomised network, we detect communities. Any community-detection algorithm qualified for the present statistical test computes the quality and size of the individual communities and maximises the quality function for the entire network. We use the quality and size of the optimised communities in the statistical test. We carry out these procedures for each of the R randomised networks, consuming \({\mathscr{O}}((N+M+Z)R)\) time in total, where Z is the time complexity of the community-detection algorithm. We compute the p-value for each of the C communities in the original network using Eq. (7) with RCconf samples on average, \({\{{\tilde{q}}_{c}\}}_{c=1}^{R{C}^{{\rm{conf}}}}\) and \({\{{\tilde{s}}_{c}\}}_{c=1}^{R{C}^{{\rm{conf}}}}\), where Cconf is the average number of communities detected in a randomised network. This incurs a time complexity of \({\mathscr{O}}(C\times R{C}^{{\rm{conf}}})\). In total, the proposed statistical test requires \({\mathscr{O}}((N+M+Z+C{C}^{{\rm{conf}}})R)\) time.

The time complexity can be mitigated using parallel computing. In other words, one runs multiple threads, each of which generates independent samples of \(({\tilde{q}}_{c},{\tilde{s}}_{c})\). Once the sampling is completed in all the threads, one computes the p-value using Eq. (7). We used 16 threads on a computer with the Intel 2.6 GHz Sandy Bridge processors and 4GB of memory. For the largest network we analysed (i.e., Internet25; N = 34,761 nodes), our statistical test needed 403 seconds using the Louvain community-detection algorithm, which has a time complexity of \({\mathscr{O}}(M)\)26. With the Kernighan-Lin community-detection algorithm having a time complexity of \({\mathscr{O}}({N}^{2})\)27, it took 17,763 seconds (i.e. approximately 5 hours).

Community detection with different quality functions

Among various quality functions for individual communities apart from \({q}_{c}^{{\rm{mod}}}\)4,13,14, we consider the following three quality functions. The internal average degree14 (i.e., normalised number of intra-community edges), denoted by \({q}_{c}^{{\rm{int}}}\), is defined by

$${q}_{c}^{{\rm{int}}}\equiv \frac{1}{{n}_{c}}\sum _{\begin{array}{c}1\le i,j\le N\\ i,j\in \,{\rm{community}}\,c\end{array}}{A}_{ij}.$$
(8)

The maximisation of \({q}_{c}^{{\rm{int}}}\) yields a community having dense intra-community connectivity. The expansion14, denoted by \({q}_{c}^{\exp }\), is defined by

$${q}_{c}^{\exp }\equiv -\,\frac{1}{{n}_{c}}\sum _{\begin{array}{c}1\le i,j\le N\\ i\in \,{\rm{community}}\,c\\ j\notin \,{\rm{community}}\,c\end{array}}{A}_{ij}.$$
(9)

The maximisation of \({q}_{c}^{\exp }\) yields a community having sparse inter-community connectivity. Finally, the conductance14, denoted by \({q}_{c}^{{\rm{cnd}}}\), is defined by

$${q}_{c}^{{\rm{cnd}}}\equiv -\,\frac{1}{{{\rm{vol}}}_{c}}\sum _{\begin{array}{c}1\le i,j\le N\\ i\in \,{\rm{community}}\,c\\ j\notin \,{\rm{community}}\,c\end{array}}{A}_{ij},$$
(10)

where vol c is the sum of degrees of nodes (i.e., volume) in a community c. Similar to the case of \({q}_{c}^{\exp }\), the maximisation of \({q}_{c}^{{\rm{cnd}}}\) yields a community having sparse inter-community connectivity. One can also interpret the maximisation of \({q}_{c}^{{\rm{cnd}}}\) as the maximisation of the number of intra-community edges28.

For \({q}_{c}^{{\rm{mod}}}\), we adopt the Louvain algorithm to maximise the modularity (i.e., sum of \({q}_{c}^{{\rm{mod}}}\) over the communities, \({\sum }_{c=1}^{C}\,{q}_{c}^{{\rm{mod}}}\)) to find communities in the original and randomised networks. However, the Louvain algorithm is not available to \(Q={\sum }_{c=1}^{C}\,{q}_{c}\), where \({q}_{c}={q}_{c}^{{\rm{int}}}\), \({q}_{c}^{\exp }\) or \({q}_{c}^{{\rm{cnd}}}\). Therefore, we adopt a variant of the Kernighan–Lin algorithm29 used in a previous study27. The algorithm seeks partitioning of the network into communities that maximises Q. Suppose that each node i has a tentative label \({\ell }_{i}\) \((1\le {\ell }_{i}\le C)\) indicating the index of the community to which node i belongs. First, we assign each node to one of the C communities selected uniformly at random. Second, for each node i, we tentatively relabel it to a different label and measure the increment in Q. Third, we select the node i and its new label c that maximise the increment in Q among all nodes i (1 ≤ i ≤ N) and all possible new labels. Regardless of whether Q increases or not, we accept the proposed relabelling of node i (i.e., set \({\ell }_{i}=c\)). Fourth, we determine the pair of another node j (j ≠ i) and its tentative new label c′, which maximises the increment in Q, and change the label of j to c′ (i.e., \({\ell }_{j}=c^{\prime} \)). In this manner, we relabel nodes one by one. Here we do not relabel the nodes that have already been relabelled. After sequentially relabelling the N nodes, we select the labelling that yields the largest value of Q among the N + 1 labellings that have appeared in the course of relabelling the N nodes. If the initial labelling (before relabelling any node) yields the largest value of Q, we terminate the algorithm. Otherwise, we use the labelling that has yielded the largest Q value among the N + 1 labellings as the initial labelling in the next round of updating the labels. We repeat the aforementioned procedure to sequentially relabel N nodes and select the best labelling. We repeat rounds of updating until the initial labelling is the best labelling in the round in terms of the Q value.

To find communities in networks using \({q}_{c}^{{\rm{int}}}\), \({q}_{c}^{\exp }\) or \({q}_{c}^{{\rm{cnd}}}\), we need to specify the number of communities, C. Otherwise, the maximisation of the quality functions may yield trivial communities. For example, \({q}_{c}^{\exp }\) is always the largest when each connected component constitutes a community because there is no inter-community edge. In the analysis of synthetic networks, we set C to the number of planted communities. For empirical networks, we set C to the number of communities identified by the Louvain algorithm.

Other statistical tests

We compare the (q, s)–test with two statistical tests, i.e., the test proposed by Spirin and Mirny10 and the test proposed by Lancichinetti, Radicchi and Ramasco8, which we refer to as the S–test and L–test, respectively. As is the case with the (q, s)–test, both S–test and L–test adopt the configuration model as the null model. For both statistical tests, we set the significance level for a single community to α = 1 − (1 − α′)1/C, where α′ = 0.05.

The S–test regards a community as significant if it has more intra-community edges than a community composed of the same number of nodes detected in randomised networks does. Their original algorithm10 is slow for large networks. Therefore, we adopt the Kernighan–Lin algorithm29 to optimise the quality function for a community adopted in the S–test. Up to our numerical efforts, our implementation is faster and also finds better community structure than their original algorithm does in terms of their quality function.

The L–test regards a community as significant if every node in the community has more neighbours within the community than that expected for the configuration model. In the original paper8, the authors defined two significance measures, i.e., \({\mathscr{C}}\)-score and \( {\mathcal B} \)-score. We adopt the \( {\mathcal B} \)–score, which is less conservative than the \({\mathscr{C}}\)–score. In the original article8, the \( {\mathcal B} \)–score is claimed to be more trustworthy than the \({\mathscr{C}}\)–score because the \({\mathscr{C}}\)–score but not the B-score relies on an extreme value statistics.

Data

We apply the statistical test to the 12 empirical networks listed in Table 1. We ignore the directions and weights of edges in the empirical networks.

Table 1 Properties of 12 empirical networks.

The karate club network represents the relationships among the members of a university’s karate club30. Each node represents a member of the karate club. Two members are defined to be adjacent if they are friends outside of the club activities.

The dolphin social network represents the relationships of the dolphins living near Doubtful Sound in New Zealand31. Each node represents a dolphin. Two dolphins are defined to be adjacent if they are frequently observed in the same school.

The network of Les Misérables represents the relationships between the characters of a novel, Les Misèrables32. Each node represents a character of the book. Each edge indicates that they appear in the same chapter of the book.

The Enron email network represents the email interactions among the staff of Enron Inc33. Each node represents an email account. Each edge indicates that an email is sent from one account to the other account.

The jazz network represents the collaborations among jazz musicians34. Each node represents a jazz musician. Each edge indicates that two musicians belong to the same band.

The network of network scientists represents the collaborations between researchers in network science7. Each node represents a researcher. Two researchers are defined to be adjacent if they have published a co-authored paper cited by one of two popular review papers on network science. Then, some nodes and edges were added manually by the author of the article7. We only consider the largest connected component of the network.

The political blog network is the network of blogs on the United States presidential election in 200435. Each node represents a blog. Two blogs are defined to be adjacent if there is at least one hyperlink between the two blogs on their front page.

The airport network consists of nodes representing airports in the world36,37. Two airports are defined to be adjacent if there is a direct commercial flight between the two airports.

The protein network represents the physical interactions among human proteins38,39. Each node represents a protein. Two proteins are defined to be adjacent if they physically interact.

The Chess network represents the chess matches between players25. Each node represents a chess player. Each edge indicates that they have played at least once.

The Astro-ph network represents the collaborations among the researchers who published a joint paper in the arXiv’s astro-ph section40. Each node represents a researcher. Two researchers are defined to be adjacent if they have published a joint paper.

The Internet network represents the network of autonomous systems25. A node represents an autonomous system, which is a group of routers maintained by a network operator. Two autonomous systems are defined to be adjacent if they have a logical peering relation.

Results

We measure the size of a community in two ways: the number of nodes in a community c, n c , and the sum of degrees of nodes in a community c, vol c . In the next two subsections, we consider the \(({q}_{c}^{{\rm{mod}}},{n}_{c})\)–test and the \(({q}_{c}^{{\rm{mod}}},{{\rm{vol}}}_{c})\)–test. We show the results for other quality functions in the third subsection.

Synthetic networks

In this section, we examine synthetic networks with planted communities. We generate networks using the LFR model15, which places edges such that the node’s degree, (i.e., d i ), and the number of nodes in a community c, (i.e., n c ), follow power-law distributions. We set the power-law exponent for the distributions of d i and n c to 2, the average node’s degree to 10, the maximum degree to 100 and the range of n c to [20,200]. The networks are composed of N = 103 nodes. Each node i has an average fraction 1 − μ of neighbours belonging to the same community, where μ ∈ {0, 0.025, 0.05, …, 1} is a mixing parameter controlling the “strength” of community structure. With μ = 0, all edges are placed within communities, and the community structure is the strongest. With μ = 1, all edges are between different communities. We set the extent of overlaps between different communities to zero.

We generate 30 networks using the LFR model at each μ value. For each generated network, we classify the planted communities into significant and insignificant communities by each statistical test. Then, we compute the true positive rate (i.e., the fraction of significant communities in the network). Finally, we average the true positive rate over the 30 generated networks.

Figure 2 shows the true positive rate as a function of μ. The true positive rate for the S–test is smallest for the entire range of μ, indicating that the S–test is the most conservative. The S–test does not regard all the planted communities as significant even at μ = 0 for the following reason. In the S–test, one detects the strongest community in each randomised network, where the strength of a community is measured by the number of intra-community edges. Then, a focal community in the original network is regarded as significant if it is stronger than the majority of the strongest communities detected in the randomised networks. The strongest communities in the randomised networks often contain almost the largest possible number of intra-community edges, whereas the planted communities do not always even at μ = 0. Therefore, the S–test concludes that some planted communities are insignificant. The true positive rate for the L–test is 1 when μ = 0 and ranges between 0.55 and 0.95 for 0 < μ ≤ 0.5. The true positive rate for the \(({q}_{c}^{{\rm{mod}}},{n}_{c})\)–test and that for the \(({q}_{c}^{{\rm{mod}}},{{\rm{vol}}}_{c})\)–test are comparable and close to 1 for 0 ≤ μ ≤ 0.3. In contrast, there is a visible difference between the results for the \(({q}_{c}^{{\rm{mod}}},{n}_{c})\)– and the \(({q}_{c}^{{\rm{mod}}},{{\rm{vol}}}_{c})\)– tests for 0.3 < μ ≤ 0.5. This result suggests that the definition of the size of a community may affect the significance of weak communities but not of strong communities.

Figure 2
figure 2

True positive rate for the statistical tests applied to the networks generated by the LFR model. Legends S, L, \(({q}_{c}^{{\rm{mod}}},{n}_{c})\) and \(({q}_{c}^{{\rm{mod}}},{{\rm{vol}}}_{c})\) indicate the S–test, the L–test, the \(({q}_{c}^{{\rm{mod}}},{n}_{c})\)–test and the \(({q}_{c}^{{\rm{mod}}},{{\rm{vol}}}_{c})\)–test, respectively. The error bars indicate the ±1 standard deviation.

Empirical networks

We apply the statistical tests to the 12 empirical networks listed in Table 1 (see the Data section for details). In this section, we detect communities by modularity maximisation using the Louvain algorithm26. Then, we apply the statistical tests to each detected community.

The fraction of significant communities for each statistical test is shown in Table 2. The \(({q}_{c}^{{\rm{mod}}},{n}_{c})\)– and the \(({q}_{c}^{{\rm{mod}}},{{\rm{vol}}}_{c})\)–tests identify more significant communities than the S–test and the L–test do in a majority of the 12 empirical networks. This result indicates that the \(({q}_{c}^{{\rm{mod}}},{n}_{c})\)– and the \(({q}_{c}^{{\rm{mod}}},{{\rm{vol}}}_{c})\)–tests are more generous than the S–test and L– test, which is consistent with the results for the LFR model. This is probably because the \(({q}_{c}^{{\rm{mod}}},{n}_{c})\)– and the \(({q}_{c}^{{\rm{mod}}},{{\rm{vol}}}_{c})\)– tests use \({q}_{c}^{{\rm{mod}}}\) to evaluate the quality of individual communities, which is consistent with the objective function of modularity maximisation, \({\sum }_{c=1}^{C}\,{q}_{c}^{{\rm{mod}}}\).

Table 2 Fraction of significant communities identified by the S–test, the L–test, the (qmod, s)–test, the (qint, s)–test, the (qexp, s)–test and the (qcnd, s)–test in the 12 empirical networks.

To quantify the agreement between the \(({q}_{c}^{{\rm{mod}}},{n}_{c})\)– and the \(({q}_{c}^{{\rm{mod}}},{{\rm{vol}}}_{c})\)– tests, we compute the level of agreement defined by τ = (C11 + C00)/C, where C00 is the number of communities classified as insignificant by both statistical tests and C11 is the number of communities classified as significant by both tests. Note that 0 ≤ τ ≤ 1, τ = 1 if the two tests regard the same set of communities as significant, and τ = 0 if the two tests completely disagree. We compute τ between each pair of statistical tests for each empirical network and then average τ over the 12 empirical networks. The averaged τ values are shown in Table 3. We find τ = 0.42 between the S–test and the L–test, indicating that the two statistical tests disagree for a majority of communities. The L–test weakly agrees with the \(({q}_{c}^{{\rm{mod}}},{{\rm{vol}}}_{c})\)–test (i.e., τ = 0.58) but disagrees with the other tests for a majority of communities (i.e., τ < 0.5). The τ between the \(({q}_{c}^{{\rm{mod}}},{n}_{c})\)– and the \(({q}_{c}^{{\rm{mod}}},{{\rm{vol}}}_{c})\)–tests is large (τ = 0.84), suggesting that the significance of a majority of communities is not strongly affected by the definition of the community size.

Table 3 Agreement between pairs of statistical tests.

Other quality functions

In this section, we examine the \(({q}_{c}^{{\rm{int}}},{s}_{c})\)–, the \(({q}_{c}^{\exp },{s}_{c})\)–and the \(({q}_{c}^{{\rm{cnd}}},{s}_{c})\)–tests, where s c is either n c or vol c . For the synthetic networks, the true positive rate for the \(({q}_{c}^{{\rm{int}}},{n}_{c})\)–and the \(({q}_{c}^{{\rm{int}}},{{\rm{vol}}}_{c})\)–tests is small in the entire range of μ (Fig. 3). As is the case for the S–test, quality function \({q}_{c}^{{\rm{int}}}\) uses the number of intra-community edges. Some planted communities are regarded as insignificant because randomised networks often contain a community having almost the largest possible number of intra-community edges (Fig. 1(c)). The quality function \({q}_{c}^{\exp }\) is the largest when the community c is disconnected from the other nodes. Randomised networks often contain many disconnected components, yielding a large value of \({q}_{c}^{\exp }\) (Fig. 1(d)). Therefore, the true positive rate for the \(({q}_{c}^{\exp },{n}_{c})\)– and the \(({q}_{c}^{\exp },{{\rm{vol}}}_{c})\)–tests is also close to zero in the entire range of μ. In contrast to \(({q}_{c}^{{\rm{int}}},{s}_{c})\)– and \(({q}_{c}^{\exp },{s}_{c})\)–tests, the \(({q}_{c}^{{\rm{cnd}}},{n}_{c})\)– and \(({q}_{c}^{{\rm{cnd}}},{{\rm{vol}}}_{c})\)–tests yield the true positive rate close to one when μ ≤ 0.3. These results suggest that the results considerably depend on the quality function. For all the (q, s)–tests, the definition of community size (i.e., n c or vol c ) does not strongly influence the true positive rate.

Figure 3
figure 3

True positive rate as a function of mixing parameter, μ, for the six (q, s)–tests.

For the empirical networks, we first detect communities by maximising q, where q is either \({q}_{c}^{{\rm{int}}}\), \({q}_{c}^{\exp }\) or \({q}_{c}^{{\rm{cnd}}}\), using the variant of the Kernighan–Lin algorithm (see the Other statistical test sections). Then, we apply the (q,s)–test to each detected community. The results for the \(({q}_{c}^{{\rm{int}}},{s}_{c})\)–, the \(({q}_{c}^{\exp },{s}_{c})\)– and the \(({q}_{c}^{{\rm{cnd}}},{s}_{c})\)–tests applied to the 12 empirical networks are shown in Table 2. For all the networks, the \(({q}_{c}^{{\rm{cnd}}},{s}_{c})\)–test regards more communities as significant than the \(({q}_{c}^{\exp },{s}_{c})\)– and the \(({q}_{c}^{{\rm{cnd}}},{s}_{c})\)–tests, where s c is either n c or vol c . This result is consistent with those obtained for the synthetic networks (Fig. 3). For each quality function q, the level of agreement (i.e., τ) between the different definitions of the community size (i.e., n c or vol c ) is shown in Table 4. For most empirical networks, the agreement τ is larger than 0.8, indicating that the results of the statistical test do not strongly depend on the definition of community size in most cases.

Table 4 Agreement between the (q c , n c )–test and the (q c , vol c )–test.

Discussion

We proposed a non-parametric statistical test, called the (q, s)–test, for the significance of individual communities, which accounts for the correlation between the quality and the size of single communities. We demonstrated our test with several quality functions q including the one defined as the contribution of a single community to the modularity. In fact, the (q, s)–test accepts different quality functions for individual communities such as those described in the previous literature13,14,41,42,43. In addition, the (q, s)–test does not demand how communities should be detected in a given network. We note that q that is consistent with the objective function for community detection should be used because the former is maximised in the (q, s)–test and the latter is maximised in community detection.

We have used two definitions of the size of a community, i.e., the number of nodes in a community (i.e., n c ), and the sum of degrees of nodes in a community (i.e., vol c ). For degree-homogeneous networks, the choice does not matter because n c  ∝ vol c . However, for degree-heterogeneous networks, significant communities may considerably depend on whether we use n c or vol c . If q explicitly uses its own measure of the size of a community, we should probably adopt the corresponding definition of the community size in the (q, s)–test. If a measure of community size is not explicit, we suggest that one selects a measure of community size that is more strongly correlated with q than others. If q is correlated with multiple quantities (e.g. both n c and vol c ) that are not perfectly correlated with each other, one can extend the (q, s)–test by adopting multivariate Gaussian kernels with three or more variables instead of bivariate Gaussian kernels. A downside of this approach is that we would need more data to reliably estimate the distribution of (q, s), where s is at least two-dimensional.

We can adopt the (q, s)–test to assess the significance of other structures of networks, such as bipartite communities44 and core-periphery structure45,46,47, provided that the quality function for the individual structure (e.g., a single bipartite community) is explicitly defined. In fact, we applied a variant of the (q, s)–test to core-periphery structure in our previous study47.

Robustness of community structure against random perturbations (e.g., addition, removal and rewiring of edges) is an alternative measure of the significance of communities14,48,49. With this approach, if small perturbations do not considerably change communities, then the communities are regarded as significant. Statistical tests based on quality functions including the (q, s)–test and those based on robustness may provide different results49. As is the case of quality functions, the robustness of an individual community may be correlated with the size of a community. For example, removal of a small number of intra-community edges may destroy small communities, whereas large communities may survive the removal of more intra-community edges. If this is the case, it may be worthwhile to inform a robustness–based test of individual communities by the dependence of the robustness measure on the size of a community.