1 Introduction

Community detection is an interesting area of social network analysis. It helps us to identify hidden communities within a network. Starting from the early 2000, many community detection algorithms have been introduced and studied in multidisciplinary areas. The main types of applications of community detection in networks are recommendation systems [1, 2], link predictions [3, 4], and anomaly detection in online social networks [5, 6]. However, different algorithms can generate significantly different communities based on their algorithms. In the literature, the most common way to evaluate the community detection algorithm is comparing the algorithm’s result with the ground truth communities of that network [7,8,9]. One can also evaluate the community detection algorithms using dynamic graphs with planted evolving community structure, as a benchmark [10]. In another study  [11] the performance of the Community Density Rank (CDR) algorithm against other community detection algorithms were compared using synthetic data from Lancichinetti, Fortunato and Radicchi (LFR) algorithm  [12]. Dao and the team also have done a study  [13] to compare community detection methods, based on computation time and community size distribution. However, these studies have not considered about the change of community detection ability with the change of features of networks. The main similarity metrics which generally use to assess the similarity are Adjusted Rank Index (ARI) [14], Normalized Mutual Information (NMI) [15] and Adjusted Mutual Information (AMI)  [16].

In this paper, we focus on identifying the impact of topological features on community detection results. We conducted a comparative analysis using sparse and dense networks. Dense networks have many connections with others, while sparse networks have fewer edges than the possible maximum number of edges. We used Gaussian Random Partition Graph [24] to generate dense networks and our newly developed method, which uses the mixture of Gaussians, to generate sparse networks. Then used those networks as the benchmarks to evaluate community detection methods. The following section will discuss both network generation methods in detail and the main difference between those two methods.

2 Material and methods

In social network analysis, the community is defined as a subset of nodes within the graph that have a higher probability of being connected to each other than to the rest of the network. In our previous study [17]. The communities were identified using community detection algorithms in order to understand the nature of the spread of COVID-19. Here, we consider directed networks for the simulations and use Louvain algorithm [19], Infomap algorithm [20], Label propagation algorithm [22], and Spinglass algorithm [23] for community detection.

Most of the algorithms are developed based on modularity. Because of that, understanding the meaning of modularity is essential. It measures the strength of the division of a network into groups. Modularity is defined as,

$$\begin{aligned} Q = \frac{1}{4m}\sum _{i,j \;in \;same\; module}\left( A_{ij} - \frac{k_{i}k_{j}}{2m}\right) . \end{aligned}$$
(1)

Here m is the number of edges of the graph, \(k_i\) is the degree of node i, and \(A_{ij}\) is the adjacency matrix. The above equation shows modularity for an undirected graph.  [18] developed a new modularity for a directed graph, is given by,

$$\begin{aligned} Q_{d}= \frac{1}{m}\sum _{i,j \;in \;same\; module}\left( A_{ij} - \frac{k_{i}^{in}k_{j}^{out}}{m}\right) . \end{aligned}$$
(2)

When calculating directed modularity, both in-degrees and out-degrees were considered, but when calculating undirected modularity, only the degree centrality is considered. In the following paragraphs we are discussing the different methods we used for community detection.

The first algorithm that we are focusing on is, the Louvain algorithm  [19]. This is an unsupervised algorithm for detecting communities in networks. This method has two phases; Modularity Optimization and Community Aggregation. These two phases are executing until there are no more changes in the communities, and the maximum modularity is achieved. At the Modularity Optimization phase, each node is assigned to its own community. Then node i is removing from its own community and moving it into the neighbor community j. The change in modularity is calculated using,

$$\begin{aligned} \Delta Q_{d}=\frac{d_{i}^c}{m}-\left\lfloor \frac{d_{i}^{out}.\sum _{tot}^{in}+d_{i}^{in}.\sum _{tot}^{out}}{m^2} \right\rfloor . \end{aligned}$$
(3)

Here \(d_{i}^c\) is the degree of i in community C and \(\sum _{tot}^{in}\) shows the number of incoming edges of community C, while \(\sum _{tot}^{out}\) shows number of outgoing edges of community C. If no positive increase can be seen in modularity, the node i remains in its current community. This sequence will be repeatedly performed for all nodes until no increment in modularity can be seen. The 1st phase stops when a local maximum has been found.

In the community aggregation step, each communities are considered as a single node. Then a new network is built using these community nodes. When the new network is created, the second phase has ended, and the first phase can be re-applied to the new network. These phases are carried out until there is no more change in the community, and a maximum of modularity is achieved.

The Infomap algorithm repeats the two described phases in Louvain until an objective function is optimized  [20]. Instead of modularity, this algorithm optimizes the ‘Map equation’. Infomap algorithm finds community structure by minimizing the description length of a random walker’s movements on a network. This random walker randomly moves between each node of the network. The random walker would like to move through the highly weighted edges. Hence, the weights of the connections within the community are greater than the weights of the connections between nodes of different communities.

The definition of the map equation is based on Shannon’s Source Coding Theorem, from the field of Information Theory  [21]. Here each module has a ‘module codebook’ and these module codebooks have ‘codewords’ for the nodes within each module, which are derived from the node visit/exit frequencies of the random walker. The ‘index codebook’ has codewords for the modules, which are derived from the module switch rates of the random walker. The map equation shows that the average length of the code (a step of the random walker) is equal to the average length of codewords from the index codebook and the module codebooks weighted by their rates of use.

$$\begin{aligned} L(M) = q_{\curvearrowleft }H(Q)+\sum _{i=1}^{m}p_{i}\circlearrowright H(\rho _i) \end{aligned}$$
(4)

A network with n nodes and m clusters is shown by M, and L(M) shows the per-step description length for module partition of that. Note that \(q_{\curvearrowleft }\) and \(p_{i}\circlearrowright\) show the rate of use of index codebook and the rate of use of module codebook respectively, while the H(Q) and \(H(\rho _i)\) show the frequency-weighted average length of codewords in the index codebook and module codebook i respectively.

The next algorithm is Label Propagation (LPA) [22] and it is one of the fast semi-supervised algorithms for finding communities in a graph. The algorithm works as follows; Every node is initialized with a unique community label (an identifier), and these labels propagate through the network. At the propagation step each node update the label based on the labels of their neighbor’s. When there are ties, labels are selected uniformly and randomly. This algorithm converges when each node has the majority label of its neighbors. And this stops if either convergence or the use defined maximum number of iterations is achieved.

In LPA, the influence of a node’s label on other nodes is determined by their respective closeness and the closeness between nodes is measured by (5). Here the euclidean distance between node i and j is shown by \(d_{ij}\) and \(\sigma ^2\) shows the parameter to scale proximity. \(w_{ij}\)s are used to generate weight matrix and to perform label propagation weight matrix is converted into a transition matrix T using the \(t_{ij}\)s by (6).

$$\begin{aligned} w_{ij}=exp\left( -\frac{d_{ij}^2}{\sigma ^2} \right) \end{aligned}$$
(5)
$$\begin{aligned} t_{ij} = \frac{w_{ij}}{\sum _{k}w_{kj}}. \end{aligned}$$
(6)

Spinglass is the last algorithm that we are discussing in this study. This method minimizes the Hamiltonian of the network  [23]. Spinglass has the following four requirements, and they are considered to develop communities.

  1. 1.

    reward internal edges between nodes of the same group (in the same spin state)

  2. 2.

    penalize missing edges between nodes in the same group.

  3. 3.

    penalize existing edges between different groups (nodes in different spin state)

  4. 4.

    reward non-links between different groups.

Hamiltonian for spinglass is given by the following equation. Here \(A_{ij}\) is adjacency matrix of the graph. \(\delta (\sigma _{i},\sigma _{j})\) is known as Kronecker delta function, \(\delta _{\sigma _{i},\sigma _{j}}\) = 1 if \(\sigma _{i}=\sigma _{j}\), and 0 otherwise. The spin state (or group index) of the node i is showed by \(\sigma _i \epsilon \left\{ 1,2,...,q\right\}\). \(a_{i,j},b_{i,j},b_{i,j},d_{i,j}\) are the weights of the individual contributions for the above requirements, respectively.

$$\begin{aligned} H({\sigma })&=-\sum _{i\ne j}a_{ij}A_{ij}\delta (\sigma _i,\sigma _j) +\sum _{i\ne j}b_{ij}(1-A_{ij})\delta (\sigma _i,\sigma _j)\nonumber \\ &\quad +\sum _{i\ne j}b_{ij}(1-A_{ij})\delta (\sigma _i,\sigma _j)-\sum _{i\ne j}d_{ij}(1-A_{ij})(1-\delta (\sigma _i,\sigma _j)) \end{aligned}$$
(7)

This algorithm tries to minimizes the energy of spinglass with the spin states being the community indices. Here Modularity is rewritten using Hamiltonian as (8),

$$\begin{aligned} Q = - \frac{1}{M}H(\left\{ \sigma \right\} ). \end{aligned}$$
(8)

It applies the simulated annealing optimization technique on this model to optimize the modularity.

2.1 Random graph generation

We now discuss two methods to generate random directed graphs with random partitions. Those methods are the Gaussian random partition graph  [24] and our newly developed method. The main difference between these two methods is that the Gaussian Random Partition graph draws community size from a normal distribution. In contrast, our newly developed method draws community size from a mixture of Gaussian distributions. Since the existing method uses normal distribution, most of the communities will have a similar number of nodes. But in real life we usually see communities with different number of nodes. Hence we proposed this new method, which generates communities in different sizes.

A Gaussian random partition graph  [24] is created by generating k clusters, each with a size drawn from a normal distribution with mean (s) and standard deviation (s/v). The size of the last cluster is possibly significantly smaller than the others. Here v is the shape parameter. Nodes are connected within clusters with probability \(P_{in}\) and between clusters with probability \(P_{out}\). To induce a more general structure of a social network, we have updated the algorithm of the Gaussian random partition graph with a mixture of Gaussian distributions. Here we can use two normal distributions; with a larger mean and a lower mean value. As a result, this method will generate network graphs with both larger and smaller communities.

In this novel method, for a given number of nodes (n), cluster sizes will be drawn from a mixture of two Gaussian distributions with means s1 and s2, and standard deviations (s1/v1) and (s2/v2) respectively. Let the probability density function (PDF) for the \(i^{th}\) Gaussian distribution is \(f_i(x)\). In the mixture of Gaussians, the probability of drawing from the first distribution is p and from the second distribution is \((1-p)\), where p is known as the mixing parameter. Then the probability density function of the mixture of Gaussians can be written as below:

$$\begin{aligned} \begin{aligned} f_{M}(x)&= (p) N\left( s_{1},\frac{s_{1}}{v_1}\right) + (1-p) N \left( s_{2},\frac{s_{2}}{v_2}\right) \\&= (p).f_1(x) + (1-p).f_2(x). \end{aligned} \end{aligned}$$
(9)

The mean of Gaussian mixture can be observed by integrating:

$$\begin{aligned} \begin{aligned} \mu _M&=\int _{-\infty }^{\infty }xf_{M}(x)dx\\&=\int _{-\infty }^{\infty }xp.f_1(x)dx + \int _{-\infty }^{\infty }x(1-p).f_2(x))dx\\&=p\int _{-\infty }^{\infty }xf_1(x)dx + (1-p)\int _{-\infty }^{\infty }xf_2(x))dx\\&= p.\mu _1 + (1-p)\mu _2. \end{aligned} \end{aligned}$$
(10)

In order to find the variance, first of all we need to find the second moment. Then using that we can find the variance of Gaussian mixture.

$$\begin{aligned} \begin{aligned} E[M^2]&=\int _{-\infty }^{\infty }x^2f_M(x)dx\\&=\int _{-\infty }^{\infty }x^2p.f_1(x)dx + \int _{-\infty }^{\infty }x^2(1-p).f_2(x))dx\\&=p\int _{-\infty }^{\infty }x^2f_1(x)dx + (1-p)\int _{-\infty }^{\infty }x^2f_2(x))dx\\&= p.E(x_1^2)+(1-p).E(x_2^2) \end{aligned} \end{aligned}$$
(11)
$$\begin{aligned} \sigma _M^2=E[M^2]-\mu _M^2 \end{aligned}$$
(12)

Hence using above equations we can calculate mean and standard deviation values of mixture of two distributions and generate networks with partitions.

2.2 Simulation study

In this section, we describe our simulation process, where we consider various parameter values. The following flow chart in Fig. 1 shows the process step by step. That illustrates the simulation process when changing the number of nodes while all the other parameters are fixed. Based on this flow chart, first, we are generating a network with the number of nodes equal to 10. Next, this network will be developed with community labels for each node, and we consider those labels as actual labels. Then we use the same generated network to find communities by applying our four community detection algorithms (Louvain, Label propagation, Spinglass, and Infomap). After that, we can evaluate community results based on actual labels using similarity matrices (ARI, NMI, and AMI). In this simulation study, we generated all the synthetic networks using the ’Networkx‘ package in python [25]. To calculate similarity scores, we used the scikit-learn package in python  [26].

Fig. 1
figure 1

General simulation process

We developed another method to change two parameters and understand the hidden truth of the results of different community detection algorithms. This process explains the steps to create a heat map for each community detection algorithm, which shows the variation of similarity score measure, with the change of \(P_{in}\) and \(P_{out}\). Here rows and columns will indicate \(P_{in}\) and \(P_{out}\).

Fig. 2
figure 2

The process of creating a similarity score value matrix

Figure 2 explains the process step by step. As the first step, we fixed number of nodes (n), two mean values (s1, s2), two standard deviance (\(\sigma _1\), \(\sigma _2\)) values and changed \(P_{in}\) and \(P_{out}\) values from 0.01 to 0.5 with 0.01 increment. Then we generated a network using a random partition graph with a mixture of two Gaussians and considered those partitions as true communities. Next, we applied the community detection algorithm to the network and identified the communities. Then we stored the similarity scores between true and identified communities in a vector. We repeated the same procedure 100 times and calculated the mean of those similarity values to get the average similarity score for the given \(P_{in}\) and \(P_{out}\). This average similarity score will be stored in the first cell of the matrix. Then repeat the same procedure for different \(P_{in}\) and \(P_{out}\) values and fill the matrix.

After creating the similarity score matrix, we could easily convert it to a color heat map. Heat maps appeal to the eyes, and visualization is generally easier to understand than reading the values. Through the heat maps, we could easily identify the capacity of the given algorithm to identify correct communities.

3 Results

As we discussed in the introduction, we have considered two main random network graph generators for this study. Figure  3 shows four random network graphs which were generated from Gaussian random network graphs. We can see that most of the networks have communities of similar size. But in reality, we would not always be able to find this kind of communities. For example, there can be communities with a large number of nodes while some other communities with few nodes. Hence, to generate more practical networks, we developed our new network graph generator using the mixture of Gaussians.

Fig. 3
figure 3

Plots ac: Some randomly generated graphs from Gaussian random partition graph. All graphs were generated using same parameters

In Fig.  4 we can see sample network graphs that were generated from our new method. In these networks, we can observe that, some communities with a large number of nodes and others with a small number of nodes.

Fig. 4
figure 4

Plots ac: Some randomly generated graphs from newly developed method using mixture of two Gaussians. All graphs were generated using same parameters

3.1 Results of dense networks

In this simulation, we fixed all other parameters except the parameter that we are interested in when creating networks. We considered the number of nodes, probabilities of connecting between communities (intra probability) and within communities (inter probability) as the parameters. Figure 5 shows similarity score variations with the increment of the number of nodes. Based on these plots, we can see that ARI, NMI, and AMI scores of communities based on Louvain and Spinglass methods decrease when the number of nodes increases. On the other hand, ARI, NMI, and AMI scores for Infomap and Label-propagation methods are constant with zero value. Thus, it shows that the number of nodes does not affect the similarity score of those two methods.

Fig. 5
figure 5

Variation of similarity scores with the increment of number of nodes. Plots a ARI score, b NMI score, c AMI score

Figure 6 shows similarity score variations with the increment of intra probability. Here we have changed that probability from 0.1 to 0.9 with a 0.01 increment. Based on these figures, it is clear that communities found from Louvain and Spinglass methods show better agreement with the actual labels for the networks which are created with an intra probability greater than 0.3.

Fig. 6
figure 6

Variation of similarity scores with the increment of number of intra probability. Plots a ARI score, b NMI score, c AMI score

Fig. 7
figure 7

Variation of similarity scores with the increment of number of inter probability. Plots a ARI score, b NMI score, c AMI score

The variation of similarity score with the increment of inter probability is plotted in Fig. 7. Same as the intra-cluster probability, we used probabilities from 0.1 to 0.9 with 0.01 increment. All the above similarity score plots illustrate that Louvain and Spinglass algorithms have higher similarity scores with true clusters when the networks have lower inter-connection probability. The highest similarity score is achieved when the inter probability is 0.1. After that score drastically drops down and remains around zero. The results of Infomap and label propagation algorithms always show zero similarity scores for all three similarity scores.

3.2 Results of sparse networks

We now consider sparse networks using our new network generation method. We conducted the same simulation process for the sparse networks also. The behavior of the similarity scores for the results of each community detection algorithms, when increasing the number of nodes, illustrate in Fig. 8. When generating random sparse networks, some networks end up with isolated nodes. Spinglass network does not work on networks that have isolated nodes. Hence we removed Spinglass method from this simulation study.

Fig. 8
figure 8

Variation of similarity scores with the increment of number of nodes in sparse networks. Plots a ARI score, b NMI score, c AMI score

Based on the results shown in Fig. 8, it is clear that in the sparse networks when the sample size is small, the results of all three algorithms show a higher similarity with true labels. Label propagation shows a rapid decline, while Infomap and Louvain show a gradual decline with the increase of frequency of nodes.

Then we changed intra probability from 0.1 to 0.9 with a 0.01 increment. We fixed inter probability to 0.02 and the number of nodes to 50. Figures 9 shows the results of this simulation. All three algorithms show a gradual increase of similarity score in these sparse networks with the rise of intra probability. Here the results of Louvain and Infomap show better and similar results than the Label propagation algorithm.

Fig. 9
figure 9

Variation of similarity scores with the increment of intra probability in sparse networks. Plots a ARI score, b NMI score, c AMI score

Finally, we changed the inter-probability. We fixed the intra probability to 0.3 and the number of nodes to 50. The variation of similarity scores (ARI, NMI, and AMI) for the results of each community detection algorithm was plotted in Fig. 10. When the inter probability is lower than 0.1, we can see higher similarity scores for all the community detection algorithms.

Fig. 10
figure 10

Variation of similarity scores with the increment of inter probability in sparse networks. Plots a ARI score, b NMI score, c AMI score

3.3 Results of heat maps

According to the steps we described in Sect. 2.2 we generated heat maps for each community detection algorithm. Since we are considering three similarity measures, ARI, NMI, and AMI, we got three heat maps for each algorithm.

Fig. 11
figure 11

ARI similarity score heat maps for a Infomap, b Label propagation, c Louvain. From dark blue color to yellow color indicates low to high similarity score

Figure 11 shows ARI similarity score heat maps for the results of Infomap, Label propagation, and Louvain algorithms. The color range from dark blue to yellow shows the lowest similarity score to the highest similarity score. In these heat maps, left bottom corners have higher intra probabilities and lower inter probabilities. These left corners in yellow, indicate a high similarity score. Hence, it is clear that the results of all the community detection algorithms are acceptable if their community networks have high intra probabilities and low inter probabilities. When comparing these three heat maps, we can observe that plot (b) has the smallest yellow color area, and plot (c) has the largest yellow color area. It shows that Louvain has the highest capacity to detect acceptable communities while Label propagation has the lowest, based on different network structures.

Fig. 12
figure 12

NMI similarity score heat maps for a Infomap, b Label propagation c Louvain. From dark blue color to yellow color indicates low to high similarity score

Fig. 13
figure 13

AMI similarity score heat maps for a Infomap, b Label propagation, c Louvain. From dark blue color to yellow color indicates low to high similarity score

Figures 12 and  13 show NMI and AMI similarity score heat maps, respectively. We can see yellow color areas (high similarity scores) for higher intra probabilities and lower inter probabilities in those heat maps as well. These sets of heat maps also show that the Louvain algorithm has the highest capacity to detect communities while Label propagation has the lowest capacity.

4 Conclusion and future direction

Since the community detection is specially developed for social network analysis which depends on edges, it is clear that the results should depend on the network structure. Our simulation study showed that most community detection algorithms gave better results for sparse networks than dense networks. However, even the community detection algorithms struggling to detect communities in dense networks could detect communities up to some level in sparse networks. In both sparse and dense networks, we observed that the ability to detect communities of algorithms increased with the increment of the probability of having edges between nodes in the same community and the increment of average community size. Conversely, the detection ability decreased with the increment of the probability of having edges between nodes in different communities and the number of nodes.

Using the heat maps we generated, we could understand the community detection capacities of different community detection algorithms. Overall, the results of all the community detection methods are acceptable if the community network has higher intra probabilities and lower inter probabilities. Still, the Louvain algorithm has the highest capacity, while label propagation has the smallest. Throughout the study, we observed that the Louvain algorithm could detect communities better than other algorithms.

The Gaussian random partition graph generator is one of the best methods to generate random networks with partitions. It uses Gaussian distribution to draw the community size, and because of that, the output networks contain communities with a similar number of nodes. But in real networks, we see communities with a different number of nodes; large communities and small communities. Hence we developed this novel method to generate networks using a mixture of Gaussians and observed that it could generate networks similar to the structure of real-world disease transmission networks.

This study can be further expanded by updating the network generator using a Dirichlet process with Gaussian distribution as the base distribution. Then we can compare the network structures of the networks generated from the existing Gaussian random partition graph with the networks generated from the newly updated method with the Dirichlet process. Furthermore, we can improve our approach to evaluate temporal community detection in real-world applications.