A motif-based probabilistic approach for community detection in complex networks

Hajibabaei, Hossein; Seydi, Vahid; Koochari, Abbas

doi:10.1007/s10844-024-00850-3

A motif-based probabilistic approach for community detection in complex networks

Research
Open access
Published: 16 March 2024

(2024)
Cite this article

Download PDF

You have full access to this open access article

Journal of Intelligent Information Systems Aims and scope Submit manuscript

A motif-based probabilistic approach for community detection in complex networks

Download PDF

Hossein Hajibabaei¹,
Vahid Seydi^1,2 &
Abbas Koochari¹

592 Accesses
Explore all metrics

Abstract

Community detection in complex networks is an important task for discovering hidden information in network analysis. Neighborhood density between nodes is one of the fundamental indicators of community presence in the network. A community with a high edge density will have correlations between nodes that extend beyond their immediate neighbors, denoted by motifs. Motifs are repetitive patterns of edges observed with high frequency in the network. We proposed the PCDMS method (Probabilistic Community Detection with Motif Structure) that detects communities by estimating the triangular motif in the network. This study employs structural density between nodes, a key concept in graph analysis. The proposed model has the advantage of using a probabilistic generative model that calculates the latent parameters of the probabilistic model and determines the community based on the likelihood of triangular motifs. The relationship between observing two pairs of nodes in multiple communities leads to an increasing likelihood estimation of the existence of a motif structure between them. The output of the proposed model is the intensity of each node in the communities. The efficiency and validity of the proposed method are evaluated through experimental work on both synthetic and real-world networks; the findings will show that the community identified by the proposed method is more accurate and dense than other algorithms with modularity, NMI, and F1score evaluation metrics.

The homophily principle in social network analysis: A survey

Article 18 January 2022

A new semi-local centrality for identifying influential nodes based on local average shortest path with extended neighborhood

Article Open access 13 April 2024

A comprehensive survey on community detection methods and applications in complex information networks

Article 18 April 2024

1 Introduction

One of the fundamental characteristics of a complex network is the concept of community, which can be described as a collection of nodes that are more densely connected than they are to the rest of the network. Community detection is a class of pattern recognition methods that assign network nodes to groups, or communities, based on the network’s structural organization (Singhal et al., 2020). Communities can reflect a style of thinking, a category, an interest, or a topic orientation, among other things. Communities can be separate or shared, which is referred to as an overlapping community (Hajibabaei et al., 2023). Community detection is a component of the machine learning clustering topic and has the potential to be used in a wide range of data science areas, including text categorization, traffic network optimization, and social network analysis. The purpose of community detection is to divide a network’s nodes into multiple communities so that nodes in the same community are densely coupled or have comparable node properties (Wu et al., 2018). The detection of communities with the goal of uncovering hidden structures of a complex network, which are often densely linked nodes, is a crucial issue in dynamical network analysis (Fortunato & Hric, 2016).

Community detection algorithms can be divided into categories like weighted (Hajibabaei et al., 2023; Wang et al., 2017) and unweighted (Chen & Li, 2019; Zarandi & Rafsanjani, 2018), directed (Le et al., 2019)and undirected (Lyu et al., 2019), global (Zhou et al., 2019a) and local (Lyu et al., 2019), overlapping (Yang & Leskovec, 2013; Ma et al., 2016) and nonoverlapping (Liu et al., 2018).

Based on these categories, different algorithms were formed for community detection. For example, modularity-based methods (Zhou et al., 2019b; Newman et al., 2004; Girvan & Newman, 2002),label propagation methods (Le et al., 2019; Raghavan et al., 2007; Gregory, 2010; Berahmand & Bouyer, 2019), model-based methods (Hajibabaei et al., 2023; Yang & Leskovec, 2013, 2012), clique percolation methods (Palla et al., 2005), network embedding methods (Kumar et al., 2021), and others can be mentioned. By examining the different approaches to community detection and reviewing the research done in this field, it can be seen that a simple analysis of node properties will not provide the necessary accuracy for community detection in complex networks; rather, closer consideration of the networks’ specifics and the utilization of the graphs’ original traits, such as motif structure, will result in superior results.

The proposed method uses a probabilistic model to detect communities in networks. We expand probabilistic model-based approaches from edge generation to motif generation. Small, linked sub-networks known as “motifs” are frequently found in complex networks. We demonstrate that by increasing the observation of three nodes in shared communities, the probability of the existence of a triangular motif between them increases. In other words, we use the triangular motif to find the probabilistic model’s hidden parameter and detect the community. For the probabilistic motif generator’s function, we define the triangular motif estimator function as a softmax loss function over one node and two of that node’s neighbors. Also, we study the influence of community overlapping on the generation of motifs.

2 Related works

The issue of community detection in complex networks has received a lot of attention due to the development of these networks, particularly social networks. Numerous studies have been conducted on various facets of community detection over the years. Traditional methods are the first approach for community detection and refer to some clustering-based algorithms. These methods introduced key ideas in community detection and paved the way for future improvements. Some traditional algorithms are as follows: hierarchical clustering, spectral clustering, partitional clustering, and graph partitioning (Javed et al., 2018). Girvan and Newman discussed the subject of community detection in 2002. They believe that “network nodes are tied together in tightly knit groupings, between which there are only looser connections” (Girvan & Newman, 2002) should be a characteristic of the community structure. Since then, other community detection techniques have been proposed by researchers from various fields. Nowadays, this approach is less used for large-scale social networks. with millions of nodes.The main contributions to the community detection algorithms are based on the overlapping and connectivity structures of the networks. In the majority of real-life complex networks, there is a community based on the network’s structures. Numerous community detection algorithms have been created to date because identifying community structure significantly contributes to our understanding of how complex networks are structured. In this section, we review a number of these studies that are related to the concepts of the proposed methodology.

2.1 Label propagation approaches

The label propagation algorithm (LPA) is a fast and convenient community detection algorithm and was first presented in (Raghavan et al., 2007). It is highly regarded for its straightforward structure and lower time complexity, but there are some drawbacks, including the randomness of node selection and label updating. In the LPA method, a node is randomly chosen and its label is updated with the most common label nearby through an iterative process (Li et al., 2022). The Speaker-listener Label Propagation Algorithm (SLPA) (Xie et al., 2011) and the COPRA (Gregory, 2010) were developed to solve the shortcomings in the label propagation process. SLPA consists of the following three stages: 1) the starting point 2) the process of evolution 3) the post-production. In both algorithms, each node has just one label, which is repeatedly updated using the maximum label in its neighborhood. The method converges, allowing for the identification of distinct communities. The current state-of-the-art label propagation-based algorithm (LPAm+) detects communities using a two-stage iterative procedure: the first stage is to assign labels, or community memberships, to nodes using label propagation to maximize modularity, a well-known quality function to evaluate the goodness of a community division, the second stage merges smaller communities to further improve modularity (Le et al., 2019).

2.2 Modularity optimization approaches

Modularity-based community detection algorithms are widely studied and applied because of their concise strategies and prominent effects. However, they also face challenges, such as sensitivity to seed node selection and unstable communities (Guo et al., 2022). Among these approaches, the Louvain algorithm (Blondel et al., 2008) is famous and frequently applied to modularity-based network analysis. This method offers a straightforward and quick methodology for finding distinct communities in complex weighted and unweighted networks and maximizes modularity by clustering graph nodes using the greedy approach. The Leiden algorithm (Traag et al., 2019) tries to correct some problems in the Louvain algorithm. In Leiden, the goal is to change the community established throughout the iteration cycle while simultaneously speeding up local movement and transferring nodes to arbitrary neighbors. The distance between the grains and the calculated density in Euclidean space are two requirements for the leading tree, an effective granule calculation model (GrC) for hierarchical clustering (Fu et al., 2022). In (Chen et al., 2022), an enhanced density peak-based community detection algorithm is proposed called DPCD.In DPCD Firstly, a novel local density suitable for complex networks is defined to consider the node distribution and network structure jointly. Secondly, based on the node density and network structure, a density-connected tree is constructed to measure the density following the distance of each node. Finally, an improved density peak model is constructed to quickly and accurately cluster complex networks.

2.3 The probabilistic model estimation approach

Probabilistic methods estimate a generative model to detect communities, as opposed to the methods stated above, which employ traditional models to detect communities in networks. This approach creates an abstract generative model from the network graph and estimates the model parameters. The AGM (Affiliation Graph Model) (Yang & Leskovec, 2013, 2012) can be read as a nonnegative affiliation matrix or interpreted as a bipartite network connecting vertices and communities. It is based on the idea that shared social affiliations offer ascent to communities. Methods (Yang & Leskovec, 2013, 2012; Yang et al., 2013) use a matrix factorization-based model to estimate a probability distribution function and the intensity of node affiliation to communities as a parameter in the probabilistic model. A model based on matrix factorization is presented by (Yang et al., 2018), allowing the edges to be simply removed or added. Also, the community detection problem is presented as a non-negative matrix factorization model (Yu et al., 2019), and the dynamics of the network structure are then managed by a transfer matrix.

2.4 Clique percolation and motif-based approaches

Cliques are one of the fundamental concepts in graph theory and are used to detect communities in graphs. In an undirected graph G, a clique is a core group of vertices where every two different vertices are connected to one another, indicating that the induced subgraph is complete (Palla et al., 2005). The idea behind clique percolation algorithms (CPM) is that a community is made up of overlapping collections of fully connected subgraphs (Attal et al., 2021). The process of community detection thus involves searching for nearby cliques. The first step is to identify every clique with size k. Then, a new graph is built with each node standing in for one of these k-cliques. CPM works well in networks with many closely connected components. Based on the clique percolation method’s search for local patterns, in (Palla et al., 2005) Palla et al. proposed one of the first overlapping community detection algorithms. The CPM implementation is the CFinder algorithm (Adamcsek et al., 2006). The algorithm’s complexity is polynomial. For large networks, the algorithm does not, however, reach its limit.

One of the concepts close to clique is the motif. Motifs are small interconnected sub-networks observed in complex networks with high frequency, interpreted as simple and fundamental components of the network (Bloem & de Rooij, 2020). Motif search has been widely studied in various fields of network analysis. The main problem of recognizing motifs in large networks is the exponential growth of the number of possible subgraphs in the network with increasing motif size (Milo et al., 2002).

Motifs are usable for understanding structure and detecting community in complex networks. A community with a high edge density will have correlations between nodes that extend beyond their immediate neighbors, as shown by the presence of motifs. Empirical studies show that similar nodes in a community have similar motifs (Arenas et al., 2008). Therefore, using motifs with high-density connections can be an important strategy to help discover the communities and analyze the network more precisely (Li et al., 2022). Though a few motif-based community detection algorithms have been put forth (Arenas et al., 2008; Tsourakakis et al., 2017; Huang et al., 2020), they typically struggle with high computational complexity when applied to large-scale networks. Integrating lower-order and higher-order structural data effectively and efficiently into a single framework for community detection is still a challenge.

3 Proposed model frameworks

In this paper, we present PCDMS (Probabilistic Community Detection with Motif Structure), a probabilistic community detection method for complex networks that makes use of the affiliation graph model and the triangular motif to identify community structures. The fundamental tenet of the proposed community detection approach is that a strong community requires consideration of the structural model and relationship kinds of the node. In their study on the relationship between 2-clique (edge) probability and community overlapping, Yang and Leskovec (Yang & Leskovec, 2013) found that two nodes observed in the more shared communities, the more likely to be connected. In this article, we investigate how community overlapping affects the development of the 3-clique and triangle motifs. We demonstrate that by increasing the number of nodes observed in shared communities, the probability of the existence of a triangular motif between them increases. Such a finding is consistent with the fundamental tenet that vertices located in communities’ overlaps are more densely connected than vertices within a single community. Thus, we may improve and expand the AGM’s capability (Yang & Leskovec, 2013, 2012) to generate triangle motifs by utilizing the optimized softmax loss function for probabilistic estimation. In Fig. 1, two types of three-node motifs are presented. The triangle motifs discovered in different types of networks can be given different interpretations in various networks according to their characteristics. For example, in complex networks (Milo et al., 2002), the most widely studied motifs are the 3-node motif, i.e., the (3,e)-motif, and the 4-node motif, i.e., the (4,e)-motif. Due to the complexity of probabilistic calculations for the (4,e)-motif, the presented probabilistic model is based on a (3,e)-motif. However, the proposed method does not seek to discover or interpret the content of the triangular motif but rather uses the motif to construct the hidden parameter of the probabilistic model and to detect the community.

In contrast to other community detection methods, the presented model in PCDMS takes a different Properties that were less used into account in previous methods such as:

Using the probabilistic method to estimate triangle motif
Conceptual connection of community detection with the probability presence or absence triangular motif
Using evolutionary methods and maximum likelihood estimation in calculations

The components of the proposed models are explained in more detail below. The PCDMS model is predicated on a network G(V, E) , where V and E refer to nodes and edges. We generate the strength affiliation of node u to the community c as a nonnegative value, $ M_{uc} $. ($ M_{uc} = 0 $ indicates that u is not a member of c) The M matrix thus displays the degree of dependence between each node and each community.

In PCDMS, the value of M determines the likelihood of a triangular motif between three nodes u, $ v_1 $, and $ v_2 $ occurring or not in a community c. Independent motif generation occurs within each community c. In particular, by considering the following probability, we assumed that three nodes, u, $ v_1 $, and $ v_2 $, form a triangular motif. We define the triangular motif estimator function as $ P_c(u, v_1, v_2) $ by using the softmax loss function over one node and two neighbors of that node for the probabilistic motif generator’s function, that is,

$$\begin{aligned} P_c(u,v_1,v_2) =\underset{(u,v_1) \in E}{P_c(u,v_1)}\cdot \underset{(u,v_2) \in E}{P_c(u,v_2)}= \nonumber \\ \frac{\exp (-M_{uc}.M^T_{v_1c})}{\sum _{v_i \in N(u)}{\exp (-M_{uc}.M^T_{v_ic})}} \cdot \frac{\exp (-M_{uc}.M^T_{v_2c})}{\sum _{v_i \in N(u)}{\exp (-M_{uc}.M^T_{v_ic})}} \end{aligned}$$

(1)

In (2), N(u) is a set of neighbors of node u. According to the generative probabilistic procedure between two pairs of nodes in a triangular motif, each pair of nodes is independently distributed by the Bernoulli distribution. Therefore, using the conditional independent probability, the following relationship is established for the probability of the existence of a triangular motif:

$$\begin{aligned} P_c(u,v_1,v_2) =\underset{v_1 \in N(u)}{P_c(u,v_1)}\cdot \underset{v_2 \in N(u)}{P_c(u,v_2)}\ \end{aligned}$$

(2)

The proposed framework, which is a probabilistic generative model, is predicated on the following premises:

In a community, a triangle motif can exist between two pairs of nodes (one node and two neighbors of that node).
The probability of the existence of a triangle motif increases when the two pairs of nodes are observed in multiple communities.
Communities can overlap; communities that overlap have a higher density of triangle motifs.

4 Community detection by PCDMS model

We describe the components of the PCDMS model before demonstrating how to use it for community detection in networks. In (3), l(M) is the logarithm of the likelihood of the existence of a motif in graph G. Also in (4), N(u) is a set of neighbors of node u. By minimizing the negative likelihood, we can estimate the ideal M as follows:

$$\begin{aligned} l(M) = \log {P(G \mid {M})} \end{aligned}$$

(3)

$$\begin{aligned} M = \underset{M>0}{-argmin}\ L(M) = \nonumber \\ -argmin_M \prod _{(v1,v2) \in N(u)} P(u,v_1,v_2) \prod _{(v1,v2) \notin N(u)} (1-P(u,v_1,v_2))= \nonumber \\ -argmin_M [\prod _{(u,v1) \in E} P(u,v_1) \prod _{(u,v1) \notin E} (1-P(u,v_1))]. \nonumber \\ [\prod _{(u,v2) \in E} P(u,v_2) \prod _{(u,v2) \notin E} (1-P(u,v_2))] \end{aligned}$$

(4)

In (5), the degree belonging of a node to a community ($ M_{uc} $) is estimated. A natural logarithm must be calculated from both sides after the Insertion of (2) into (4) to change the multiplication into the sum and make further calculations simpler.

$$\begin{aligned} L(M) = -[\ln ({\exp (-M_{uc}.M^T_{v_1c})})-{\sum _{v_i \in N(u)}{\exp (-M_{uc}.M^T_{v_ic})}}] - \nonumber \\ [\ln ({\exp (-M_{uc}.M^T_{v_2c})})-{\sum _{v_i \in N(u)}{\exp (-M_{uc}.M^T_{v_ic})}}] = \nonumber \\ M_{uc}.M^T_{v_1c} + M_{uc}.M^T_{v_2c} + 2\sum _{v_i \in N(u)}{\exp (-M_{uc}.M^T_{v_ic})} \end{aligned}$$

(5)

4.1 Updating the parameter

The latent variable M is contained in the negative non-linear likelihood function of (4), which cannot be minimized using the well-known optimization methods. To overcome the difficulty of solving optimization issues with latent variables in machine learning, we use the Block Coordinate Descent technique (Xu & Yin, 2013) to solve the objective function in (4). We update $ M_u $ for each node u by keeping fixed neighbors ($ M_v $). We solve the following subproblem:

$$\begin{aligned} L(M_u) = M_{u}.M^T_{v_1} + M_{u}.M^T_{v_2} + 2\sum _{v_i \in N(u)}{\exp (-M_{u}.M^T_{v_i})} \end{aligned}$$

(6)

To estimate the minimum negative likelihood (i.e., the minimum point of the diagram), we must look for a point on the diagram where the slope is 0. Therefore, it is necessary to derive the partial derivative of the log-likelihood function (5) with respect to $ M_u $.

$$\begin{aligned} \frac{\partial l(M_u)}{\partial M_u} = M_{v_1} + M_{v_2} + \frac{\sum _{v_i \in N(u)}{-M_{v_i}\exp (-M_{u}.M_{v_i})}}{\sum _{v_i \in N(u)}{\exp (-M_{u}.M_{v_i})}} \end{aligned}$$

(7)

Eventually, $ M_u $ values will be updated by the gradient decent method (Hsieh & Dhillon, 2011; Lin, 2007). Since a node’s belonging strength to a community cannot be negative, it will be substituted with 0 if it detects.

$$\begin{aligned} M_u(t+1) = \max (0,M_u(t)-\eta (\frac{\partial l(M_u)}{\partial M_u})) \end{aligned}$$

(8)

where $ \eta $ is a learning rate parameter. The updating process is repeated until the difference between the value from the previous step and the current value is smaller than the desired threshold.

4.2 WSCD algorithm

The proposed PCDMS model (Probabilistic Community Detection with Motif Structure) is shown in Algorithm 1. The inputs to the method are a graph (G) and the number of communities (k). The model also generates a matrix that shows how strongly each node belongs to each community ($ M_{uc} $). The possibility of an existing motif structure between two sets of nodes grows when they are seen in various communities. The algorithm then enters an iterative loop after the latent variable of the model (M) is initialized (how to initialize M will be covered below). The iterations will end when the difference between $ M_u(t+1) $ and $ M_u(t) $ is less than a predetermined threshold (here, 0.005 is the stop threshold). This iterative method computes the likelihood function of the probabilistic generative model ($ L(M_u) $) to estimate the model’s unknown parameter in the graph. The likelihood function’s logarithm is retrieved from each node u by $ D(L(M_u) $ to get it as close as possible to its minimal value (where the slope of the line is 0). Due to the complexity of the calculations, we chose the descending gradient method (Xu & Yin, 2013; Hsieh & Dhillon, 2011) to minimize the likelihood. At each iteration of the algorithm, this approach is utilized to update the latent variable of the model ($ M_u $). The contribution strength of each node to each community will then be determined after the M value has been fixed. This value can be categorized as belonging or not belonging to the communities after comparing it to an experimental threshold (e.g., the median of M values), and the model’s output will be realized.

4.3 Computational complexity

The computational complexity of the PCDMS algorithm depends on the number of communities and the density of motifs. Equations (6) and (7) are used to update the degree of belonging to the community, which is at the core of Algorithm 1, as seen in its iteration steps. In this scenario, the presence or absence of a motif for two nodes depends on whether or not their neighbors are members of one or more communities. Therefore, the computational complexity will depend on the order of each node’s neighbors (N(u)) and how many communities are present; in the worst case, this complexity will be $ O(2k\cdot E) $.

4.4 Initialization

There are multiple options to initialize the matrix of belonging intensity for the nodes in each community. Filling in the values at random is the first solution, which also seems to be the simplest. The biggest limitation, however, is that the algorithm performs more iterations, adding to the computational complexity in order to reach the model stability stages. The local minimum neighborhood method (Gleich & Seshadhri, 2012), which has been demonstrated through experiments to be a good starting point for community detection algorithms, is the second choice. In addition to minimizing iteration steps and starting the algorithm in a stable state, using this approach has the additional benefit of being able to predict the initial number of communities to start the proposed model’s community detection phase.

5 Experiments

The Python programming language has been used to implement the proposed PCDMS approach in the Spyder environment. To assess the outcomes, we employed seven real-world data sets (Table 2) and sixteen synthetic networks (Table 5), respectively. The datasets also contains “ground-truth” community memberships of the node. In these datasets, the proposed method is compared with fundamental algorithms like Louvain (Blondel et al., 2008), Leiden (Traag et al., 2019), Bigclam (Yang & Leskovec, 2013, 2012), CPM (Palla et al., 2005), Label propagation (Gregory, 2010), and SLPA (Xie et al., 2011). Table 1 lists these algorithms in brief.

Table 1 Summarizes of the employed methods

Full size table

5.1 Evaluation metrics

The efficiency and accuracy of the community detection algorithms are assessed using three widely used assessment measures. The F1Score and NMI are external measures for evaluating community accuracy by comparing them to ground-truth communities (Fortunato & Hric, 2016), whereas modularity (Clauset et al., 2004) is the internal metric for evaluating community quality. The Girvan-Newman method (Clauset et al., 2004) is the source of the modularity metric in internal metrics, which is a well-known standard for determining the density of edges in communities. The modularity value is determined by dividing the predicted community edges by the expected community edges. The higher the number of nodes inside communities and the closer a community’s modularity score is to 1, the better the discovered community will be. The F1Score is a well-known assessment metric used in community detection algorithms that compares the frequency of correctly identifying the nodes in each community based on available ground truth data. NMI, or mutual information about the relationship discovered between the identified communities and the ground truth, is the second external metric.

5.2 Real-world datasets

Seven real-world datasets are used in the experiments. Zachary’s karate club network (Zachary, 1977) is the first dataset, containing 34 nodes, 78 connecting edges between them, and 2 ground-truth communities. This dataset contains social ties among university karate club members collected by Wayne Zachary in 1977. Dolphins’ online social network (Lusseau et al., 2003) is the second dataset, which contains 62 nodes, 159 connecting edges, and two ground-truth communities containing a list of all the links, where a link represents frequent associations between dolphins. The third dataset (Kunegis, 2013), with 105 nodes, 441 connecting edges, and 3 ground-truth communities, is based on data from the network of books about US politics published around the time of the 2004 presidential election. Edges between books represent frequent co-purchasing of books by the same buyers. The fourth dataset is the American football (Girvan & Newman, 2002), with 116 nodes, 613 connecting edges, and 12 ground-truth communities. This network contains American football games between Division IA colleges during the fall of 2000. The fifth dataset is a large network generated using email data from a large European research institution (Leskovec et al., 2007; Yin et al., 2017). This network contains 1005 members of the institution as nodes and 25571 edges contain emails sent between members of the institution and people outside of the institution. The dataset also assumes departments at the research institute as the nodes’ ground-truth community memberships. Each individual belongs to exactly one of the 42 departments at the research institute. The sixth dataset, known as Wiki-Vote, has 879 nodes and 2914 connecting edges (Rossi & Ahmed, 2015). It contains voter data from a poll. The network’s nodes represent network users, and the edge that connects node i to node j indicates the edge that user i voted for user j. The seventh dataset, Twitter, has 1536 nodes and 30596 connecting edges. It is based on data from the social network Twitter (Kumar et al., 2014). The nodes in this graph represent social network users, and the edge that connects node i to node j represents tweets that node j retweeted. The Wiki-Vote and Twitter datasets do not contain any ground-truth. The real-world datasets analyzed during the current study are shown in Table 2 where N is the number of nodes, E is the number of edges, and K is the number of ground truths. These datasets are available in the network repository^{Footnote 1} (Rossi & Ahmed, 2015), the KONECT project^{Footnote 2} (Kunegis, 2013), and the Stanford Network Analysis Project^{Footnote 3} (SNAP) (Leskovec et al., 2007).

Table 2 The specifics of the real-world dataset used

Full size table

5.2.1 Experimental results on real-world datasets

To evaluate the effectiveness and accuracy of PCDMS in community detection, we will compare the proposed model with four categories of community detection methods, namely methods of modularity optimization, label propagation, probabilistic estimation, and clique percolation. Some of these techniques were briefly covered in the previous sections. Six algorithms are used to evaluate the proposed method with internal (modularity and community number) and external evaluation metrics (NMI and F1Score). In the internal metrics (modularity maximum and accuracy in the number of the community), the results in Table 3 show that our method has better accuracy than other methods.

Table 3 Experimental results on real-world networks by the modularity metric ( Q ) and community number ( CN )

Full size table

Additionally, in the external evaluation criteria (NMI and F1Score), Figs. 2 and 3 show that our proposed method in datasets that contain ground-truth has relative superiority compared to modularity optimization and label propagation methods and has absolute superiority over probabilistic estimation and clique percolation methods.

5.3 Synthetic datasets

A synthetic network is an appropriate method to assess community detection methods. There are different ways to create synthetic networks. The LFR benchmark (Lancichinetti et al., 2008) is one of the most well-known and commonly applied techniques. The LFR benchmark generates undirected and unweighted synthetic networks with ground-truth communities using the degree and community size distributions. We can configure network and community settings before simulating networks using LFR. The mixing parameter ($ \mu $) is one of the important parameters in LFR. This parameter regulates how different communities interact. As shown in Table 5, a high mixing parameter value ($ \mu $) will lower the network’s degree of modularity ($ Q_{GT} $). Accordingly, according to the modularity metric and mixing parameter, the LFR-generated datasets are divided into two categories: sparse and dense communities. The average degree is another significant parameter; increasing it will lead to more community interaction. Table 4 displays the main features of the LFR synthetic datasets.

Table 5 reports the datasets created using the LFR approach that we employed.

Table 4 Parameters of LFR synthetic datasets (Lancichinetti et al., 2008)

Full size table

Table 5 The details of the LFR synthetic network generated

Full size table

5.3.1 Experimental results on synthetic datasets

We have conducted experiments on LFR synthetic networks in addition to real-world networks. To demonstrate the impact of the optimized softmax loss function for a probabilistic estimate on the community detection process utilizing modularity, F1Score, and NMI criteria, we compare the PCDMS technique with the well-known community detection methods in Table 1. For this purpose, according to the properties of synthetic networks that are given in Table 4, sixteen LFR synthetic networks are created with different configurations of mixing parameters ($ \mu $) varying from 0.05 to 0.8, as shown in Table 5. In Table 6, the experimental results show that the communities are dense for small values of the mixing parameter (e.g. $ 0.05\le \mu \le 0.4$) and that the compared algorithms are nearly accurate in this case. But as the mixing parameter’s ($ \mu $) value increases, the main distinction between the algorithms becomes more clear (e.g. $ 0.4 \le \mu \le 0.8$), and the communities are sparse, making it challenging to detect communities because the edges between communities rise.

Table 6 Experimental results on sixteen LFR synthetic networks by the modularity metric

Full size table

As seen in Figs. 4 and 5, some algorithms have NMI and F1Score values equal to zero as the mixing parameter value increases. The proposed method surpasses the majority of the widely used methods in the range of $ 0.5 \le \mu \le 0.8 $.

6 Conclusion

We present a motif-based probabilistic approach for community detection in complex networks. Recent community detection methods have given less thought to the latent variable of the probabilistic model due to the difficulty of the composition of probabilistic methods in motif structure. Still, the proposed approach uses the relationship of at least two connected edges between three nodes (triangular motif structure) and the intensity of the node’s membership in the community to estimate the latent variable of the probabilistic model. This paper applied the well-known Block Coordinate Decent algorithm to minimize the negative likelihood function and extract the latent parameters of the model. Another factor that aids in the investigation of newly discovered communities is the relationship between node membership in communities and edge density: three nodes are more likely to form a motif structure when observed in different communities. Another advantage of PCDMS is overlapping in the detection of communities; by the results, communities that overlap have a higher density of triangle motifs. For evaluating the effectiveness of the proposed method, we used seven real-world networks and sixteen synthetic networks. On real-world networks, PCDMS was able to reach a sufficient quorum and outperform the other six approaches in terms of internal and external evaluation metrics. Also, according to evaluations of synthetic networks, the proposed method outperforms other methods in sparse datasets. Additionally, a study of execution time complexity shows that the proposed methodology performs better than other methods. Future research can develop PCDMS. Edge weight can be estimated using a probabilistic generative model and considering a latent parameter. Also, to provide a more accurate interpretation of the detected communities, the proposed method can be expanded by using node attributes in the network.

Availability of supporting data

The real-world datasets generated during the current study are available on the network repository (https://networkrepository.com/), SNAP (https://snap.stanford), and konect project (http://konect.cc/), and The LFR-Benchmark (https://github.com/eXascaleInfolab/LFR-Benchmark_UndirWeightOvp/) generates the synthetic networks.

Notes

https://networkrepository.com/
http://konect.cc/
https://snap.stanford.edu/

References

Adamcsek, B., Palla, G., Farkas, I. J., et al. (2006). CFinder: locating cliques and overlapping modules in biological networks. Bioinformatics, 22(8), 1021–1023. https://doi.org/10.1093/bioinformatics/btl039
Article CAS PubMed Google Scholar
Arenas, A., Fernandez, A., Fortunato, S., et al. (2008). Motif-based communities in complex networks. Journal of Physics A: Mathematical and Theoretical, 41(22), 224001. https://doi.org/10.1088/1751-8113/41/22/224001
Article ADS MathSciNet Google Scholar
Attal, J.-P., Malek, M., & Zolghadri, M. (2021). Overlapping community detection using core label propagation algorithm and belonging functions. Applied Intelligence, 51(11), 8067–8087. https://doi.org/10.1007/s10489-021-02250-4
Article Google Scholar
Berahmand, K., & Bouyer, A. (2019). A link-based similarity for improving community detection based on label propagation algorithm. Journal of Systems Science and Complexity, 32(3), 737–758. https://doi.org/10.1007/s11424-018-7270-1
Article Google Scholar
Bloem, P., & de Rooij, S. (2020). Large-scale network motif analysis using compression. Data Mining and Knowledge Discovery, 34, 1421–1453. https://doi.org/10.1007/s10618-020-00691-y
Article MathSciNet Google Scholar
Blondel, V. D., Guillaume, J.-L., Lambiotte, R., et al. (2008). (2008) Fast unfolding of communities in large networks. Journal of Statistical Mechanics: theory and experiment, 10, 10008. https://doi.org/10.1088/1742-5468/2008/10/p10008
Article Google Scholar
Chen, X., & Li, J. (2019). Community detection in complex networks using edge-deleting with restrictions. Physica A: Statistical Mechanics and its Applications, 519, 181–194. https://doi.org/10.1016/j.physa.2018.12.023
Article ADS Google Scholar
Chen, L., Zheng, H., Li, Y., et al. (2022). Enhanced density peak-based community detection algorithm. Journal of Intelligent Information Systems, 59, 263–284. https://doi.org/10.1007/s10844-022-00702-y
Article Google Scholar
Clauset, A., Newman, M. E., & Moore, C. (2004). Finding community structure in very large networks. Physical Review E, 70(6), 66111. https://doi.org/10.1103/physreve.70.066111
Article ADS Google Scholar
Fortunato, S., & Hric, D. (2016). Community detection in networks: A user guide. Physics Reports, 659, 1–44. https://doi.org/10.1016/j.physrep.2016.09.002
Article ADS MathSciNet Google Scholar
Fu, S., Wang, G., Xu, J., et al. (2022). IbLT: An effective granular computing framework for hierarchical community detection. Journal of Intelligent Information Systems, 58, 175–196. https://doi.org/10.1007/s10844-021-00668-3
Article Google Scholar
Girvan, M., & Newman, M. E. (2002). Community structure in social and biological networks. Proceedings of the National Academy of Sciences, 99(12), 7821–7826. https://doi.org/10.1073/pnas.122653799
Article ADS MathSciNet CAS Google Scholar
Gleich, D.F., & Seshadhri, C. (2012) Vertex neighborhoods, low conductance cuts, and good seeds for local community methods. In: Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 597-605. https://doi.org/10.1145/2339530.2339628
Gregory, S. (2010). Finding overlapping communities in networks by label propagation. New Journal of Physics, 12(10), 103018. https://doi.org/10.1088/1367-2630/12/10/103018
Article ADS Google Scholar
Guo, K., Huang, X., Wu, L., et al. (2022). Local community detection algorithm based on local modularity density. Applied Intelligence, 52(2), 1238–1253. https://doi.org/10.1007/s10489-020-02052-0
Article Google Scholar
Hajibabaei, H., Seydi, V., & Koochari, A. (2023). Community detection in weighted networks using probabilistic generative model. Journal of Intelligent Information Systems, 60, 119–136. https://doi.org/10.1007/s10844-022-00740-6
Article Google Scholar
Hsieh, C-J., & Dhillon, I.S. (2011) Fast coordinate descent methods with variable selection for non-negative matrix factorization. In: Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 1064-1072. https://doi.org/10.1145/2020408.2020577
Huang, L., Chao, H-Y., & Xie, Q. (2020) MuMod: A micro-unit connection approach for hybrid-order community detection. In:Proceedings of the AAAI Conference on Artificial Intelligence 107-114. https://doi.org/10.1609/aaai.v34i01.5340
Javed, M. A., Younis, M. S., Latif, S., et al. (2018). Community detection in networks: A multidisciplinary review. Journal of Network and Computer Applications, 108, 87–111. https://doi.org/10.1016/j.jnca.2018.02.011
Article Google Scholar
Kumar, S., Morstatter, F., & Liu, H. (2014). Twitter data analytics. New York: Springer, 1041–4347,. https://doi.org/10.1007/978-1-4614-9372-3_4
Kumar, S., Panda, B., & Aggarwal, D. (2021). Community detection in complex networks using network embedding and gravitational search algorithm. Journal of Intelligent Information Systems, 57, 51–72. https://doi.org/10.1007/s10844-020-00625-6
Article Google Scholar
Kunegis, J. (2013) Konect: the koblenz network collection. In:Proceedings of the 22nd International Conference on World Wide Web 1343-1350. https://doi.org/10.1145/2487788.2488173
Lancichinetti, A., Fortunato, S., & Radicchi, F. (2008). Benchmark graphs for testing community detection algorithms. Physical Review E, 78(4), 046110. https://doi.org/10.1103/physreve.78.046110
Article ADS Google Scholar
Le, B.-D., Shen, H., Nguyen, H., et al. (2019). Improved network community detection using meta-heuristic based label propagation. Applied Intelligence, 49(4), 1451–1466. https://doi.org/10.1007/s10489-018-1321-0
Article Google Scholar
Leskovec, J., Kleinberg, & J., Faloutsos, C. (2007) Graph evolution: Densification and shrinking diameters. ACM Transactions on Knowledge Discovery from Data (TKDD) 1(1):2-es. https://doi.org/10.1145/1217299.1217301
Li, C., Chen, H., Li, T., et al. (2022). A stable community detection approach for complex network based on density peak clustering and label propagation. Applied Intelligence, 52(2), 1188–1208. https://doi.org/10.1007/s10489-021-02287-5
Article Google Scholar
Lin, C.-J. (2007). Projected gradient methods for nonnegative matrix factorization. Neural Computation, 19(10), 2756–2779. https://doi.org/10.1162/neco.2007.19.10.2756
Article MathSciNet PubMed Google Scholar
Li, C., Tang, Y., Tang, Z., et al. (2022). Motif-based embedding label propagation algorithm for community detection. International Journal of Intelligent Systems, 37(3), 1880–1902. https://doi.org/10.1002/int.22759
Article Google Scholar
Liu, F., Choi, D., Xie, L., et al. (2018). Global spectral clustering in dynamic networks. Proceedings of the National Academy of Sciences, 115(5), 927–932. https://doi.org/10.1073/pnas.1718449115
Article ADS MathSciNet CAS Google Scholar
Lusseau, D., Schneider, K., Boisseau, O. J., et al. (2003). The bottlenose dolphin community of Doubtful Sound features a large proportion of long-lasting associations. Behavioral Ecology and Sociobiology, 54(4), 396–405. https://doi.org/10.1007/s00265-003-0651-y
Article Google Scholar
Lyu, C., Shi, Y., & Sun, L. (2019). A novel local community detection method using evolutionary computation. IEEE Transactions on Cybernetics, 51(6), 3348–3360. https://doi.org/10.1109/tcyb.2019.2933041
Article Google Scholar
Ma, T., Wang, Y., Tang, M., et al. (2016). LED: A fast overlapping communities detection algorithm based on structural clustering. Neurocomputing, 207, 488–500. https://doi.org/10.1016/j.neucom.2016.05.020
Article Google Scholar
Milo, R., Shen-Orr, S., Itzkovitz, S., et al. (2002). Network motifs: simple building blocks of complex networks. Science, 298(5594), 824–827. https://doi.org/10.1126/science.298.5594.824
Article ADS CAS PubMed Google Scholar
Newman, M. E., & Girvan, M. (2004). Finding and evaluating community structure in networks. Physical review E, 69(2), 026113. https://doi.org/10.1103/physreve.69.026113
Article ADS CAS Google Scholar
Palla, G., Derényi, I., Farkas, I., et al. (2005). Uncovering the overlapping community structure of complex networks in nature and society. Nature, 435(7043), 814–818. https://doi.org/10.1038/nature03607
Article ADS CAS PubMed Google Scholar
Raghavan, U. N., Albert, R., & Kumara, S. (2007). Near linear time algorithm to detect community structures in large-scale networks. Physical review E, 76(3), 036106. https://doi.org/10.1103/physreve.76.036106
Article ADS Google Scholar
Rossi, R., & Ahmed, N. (2015). The network data repository with interactive graph analytics and visualization. Proceedings of the AAAI Conference on Artificial Intelligence, 29, 152–196. https://doi.org/10.1609/aaai.v29i1.9277
Article Google Scholar
Singhal, A., Cao, S., Churas, C., et al. (2020). Multiscale community detection in Cytoscape. PLOS Computational Biology, 16(10), e1008239. https://doi.org/10.1371/journal.pcbi.1008239
Article CAS PubMed PubMed Central Google Scholar
Traag, V. A., Waltman, L., & Van Eck, N. J. (2019). From Louvain to Leiden: guaranteeing well-connected communities. Scientific Reports, 9(1), 1–12. https://doi.org/10.1038/s41598-019-41695-z
Article CAS Google Scholar
Tsourakakis, C.E., Pachocki, J., & Mitzenmacher, M. (2017) Scalable motif-aware graph clustering. In: Proceedings of the 26th International Conference on World Wide Web 1451-1460. https://doi.org/10.1145/3038912.3052653
Wang, T.-S., Lin, H.-T., & Wang, P. (2017). Weighted-spectral clustering algorithm for detecting community structures in complex networks. Artificial Intelligence Review, 47(4), 463–483. https://doi.org/10.1007/s10462-016-9488-4
Article Google Scholar
Wu, W., Kwong, S., Zhou, Y., et al. (2018). Nonnegative matrix factorization with mixed hypergraph regularization for community detection. Information Sciences, 435, 263–281. https://doi.org/10.1016/j.ins.2018.01.008
Article MathSciNet Google Scholar
Xie, J., Szymanski, B.K., & Liu, X. (2011) Slpa: Uncovering overlapping communities in social networks via a speaker-listener interaction dynamic process. In: 2011 IEEE 11th International Conference on Data Mining Workshops 344-349. https://doi.org/10.1109/icdmw.2011.154
Xu, Y., & Yin, W. (2013). A block coordinate descent method for regularized multiconvex optimization with applications to nonnegative tensor factorization and completion. SIAM Journal on Imaging Sciences, 6(3), 1758–1789. https://doi.org/10.1137/120887795
Article MathSciNet Google Scholar
Yang, J., & Leskovec, J. (2012) Community-affiliation graph model for overlapping network community detection. In: 2012 IEEE 12th International Conference on Data Mining 1170-1175. https://doi.org/10.1109/icdm.2012.139
Yang, J., & Leskovec, J. (2013) Overlapping community detection at scale: a nonnegative matrix factorization approach. In:Proceedings of the sixth ACM International Conference on Web Search and Data Mining 587-596. https://doi.org/10.1145/2433396.2433471
Yang, J., Mcauley, J., & Leskovec, J. (2013) Community detection in networks with node attributes. In: 2013 IEEE 13th International Conference on Data Mining 1151-1156. https://doi.org/10.1109/icdm.2013.167
Yang, K., Guo, Q., & Liu, J.-G. (2018). Community detection via measuring the strength between nodes for dynamic networks. Physica A: Statistical Mechanics and Its Applications, 509, 256–264. https://doi.org/10.1016/j.physa.2018.06.038
Article ADS Google Scholar
Yin, H., Benson, A.R., Leskovec, J., et al. (2017) Local higher-order graph clustering. In: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 555-564. https://doi.org/10.1145/3097983.3098069
Yu, W., Wang, W., Jiao, P., et al. (2019). Evolutionary clustering via graph regularized nonnegative matrix factorization for exploring temporal networks. Knowledge-Based Systems, 167, 1–10. https://doi.org/10.1016/j.knosys.2019.01.024
Article Google Scholar
Zachary, W. W. (1977). An information flow model for conflict and fission in small groups. Journal of Anthropological Research, 33(4), 452–473. https://doi.org/10.1086/jar.33.4.3629752
Article Google Scholar
Zarandi, F. D., & Rafsanjani, M. K. (2018). Community detection in complex networks using structural similarity. Physica A: Statistical Mechanics and its Applications, 503, 882–891. https://doi.org/10.1016/j.physa.2018.02.212
Article ADS Google Scholar
Zhou, W., Wang, X., Zhang, C., et al. (2019). Community detection by enhancing community structure in bipartite networks. Modern Physics Letters B, 33(07), 1950076. https://doi.org/10.1142/s0217984919500763
Article ADS MathSciNet CAS Google Scholar
Zhou, X., Yang, K., Xie, Y., et al. (2019). A novel modularity-based discrete state transition algorithm for community detection in networks. Neurocomputing, 334, 89–99. https://doi.org/10.1016/j.neucom.2019.01.009
Article Google Scholar

Download references

Acknowledgements

We reiterate that this article has not been published anywhere and is not sponsored by any particular organization.

Funding

This article is not financially supported (Not Applicable).

Author information

Authors and Affiliations

Department of Computer Engineering, Science and Research Branch, Islamic Azad University, Tehran, Iran
Hossein Hajibabaei, Vahid Seydi & Abbas Koochari
Centre for Applied Marine Sciences, School of Ocean Sciences, Bangor University, Menai Bridge, UK
Vahid Seydi

Authors

Hossein Hajibabaei
View author publications
You can also search for this author in PubMed Google Scholar
Vahid Seydi
View author publications
You can also search for this author in PubMed Google Scholar
Abbas Koochari
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Hossein Hajibabaei: Method, Writing, original draft, software. Vahid Seydi: Method, conceptualization. Abbas Koochari: investigation, supervision.

Corresponding author

Correspondence to Vahid Seydi.

Ethics declarations

Ethics approval

Our study did not include any human and/ or animal studies. All datasets used in the paper are publicly available for research purposes.

Competing interests

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Hajibabaei, H., Seydi, V. & Koochari, A. A motif-based probabilistic approach for community detection in complex networks. J Intell Inf Syst (2024). https://doi.org/10.1007/s10844-024-00850-3

Download citation

Received: 10 September 2023
Revised: 16 February 2024
Accepted: 16 February 2024
Published: 16 March 2024
DOI: https://doi.org/10.1007/s10844-024-00850-3

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

A motif-based probabilistic approach for community detection in complex networks

Abstract

Similar content being viewed by others

The homophily principle in social network analysis: A survey

A new semi-local centrality for identifying influential nodes based on local average shortest path with extended neighborhood

A comprehensive survey on community detection methods and applications in complex information networks

1 Introduction

2 Related works

2.1 Label propagation approaches

2.2 Modularity optimization approaches

2.3 The probabilistic model estimation approach

2.4 Clique percolation and motif-based approaches

3 Proposed model frameworks

4 Community detection by PCDMS model

4.1 Updating the parameter

4.2 WSCD algorithm

4.3 Computational complexity

4.4 Initialization

5 Experiments

5.1 Evaluation metrics

5.2 Real-world datasets

5.2.1 Experimental results on real-world datasets

5.3 Synthetic datasets

5.3.1 Experimental results on synthetic datasets

6 Conclusion

Availability of supporting data

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval

Competing interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation