1 Introduction

One of the fundamental characteristics of a complex network is the concept of community, which can be described as a collection of nodes that are more densely connected than they are to the rest of the network. Community detection is a class of pattern recognition methods that assign network nodes to groups, or communities, based on the network’s structural organization (Singhal et al., 2020). Communities can reflect a style of thinking, a category, an interest, or a topic orientation, among other things. Communities can be separate or shared, which is referred to as an overlapping community (Hajibabaei et al., 2023). Community detection is a component of the machine learning clustering topic and has the potential to be used in a wide range of data science areas, including text categorization, traffic network optimization, and social network analysis. The purpose of community detection is to divide a network’s nodes into multiple communities so that nodes in the same community are densely coupled or have comparable node properties (Wu et al., 2018). The detection of communities with the goal of uncovering hidden structures of a complex network, which are often densely linked nodes, is a crucial issue in dynamical network analysis (Fortunato & Hric, 2016).

Community detection algorithms can be divided into categories like weighted (Hajibabaei et al., 2023; Wang et al., 2017) and unweighted (Chen & Li, 2019; Zarandi & Rafsanjani, 2018), directed (Le et al., 2019)and undirected (Lyu et al., 2019), global (Zhou et al., 2019a) and local (Lyu et al., 2019), overlapping (Yang & Leskovec, 2013; Ma et al., 2016) and nonoverlapping (Liu et al., 2018).

Based on these categories, different algorithms were formed for community detection. For example, modularity-based methods (Zhou et al., 2019b; Newman et al., 2004; Girvan & Newman, 2002),label propagation methods (Le et al., 2019; Raghavan et al., 2007; Gregory, 2010; Berahmand & Bouyer, 2019), model-based methods (Hajibabaei et al., 2023; Yang & Leskovec, 2013, 2012), clique percolation methods (Palla et al., 2005), network embedding methods (Kumar et al., 2021), and others can be mentioned. By examining the different approaches to community detection and reviewing the research done in this field, it can be seen that a simple analysis of node properties will not provide the necessary accuracy for community detection in complex networks; rather, closer consideration of the networks’ specifics and the utilization of the graphs’ original traits, such as motif structure, will result in superior results.

The proposed method uses a probabilistic model to detect communities in networks. We expand probabilistic model-based approaches from edge generation to motif generation. Small, linked sub-networks known as “motifs” are frequently found in complex networks. We demonstrate that by increasing the observation of three nodes in shared communities, the probability of the existence of a triangular motif between them increases. In other words, we use the triangular motif to find the probabilistic model’s hidden parameter and detect the community. For the probabilistic motif generator’s function, we define the triangular motif estimator function as a softmax loss function over one node and two of that node’s neighbors. Also, we study the influence of community overlapping on the generation of motifs.

2 Related works

The issue of community detection in complex networks has received a lot of attention due to the development of these networks, particularly social networks. Numerous studies have been conducted on various facets of community detection over the years. Traditional methods are the first approach for community detection and refer to some clustering-based algorithms. These methods introduced key ideas in community detection and paved the way for future improvements. Some traditional algorithms are as follows: hierarchical clustering, spectral clustering, partitional clustering, and graph partitioning (Javed et al., 2018). Girvan and Newman discussed the subject of community detection in 2002. They believe that “network nodes are tied together in tightly knit groupings, between which there are only looser connections” (Girvan & Newman, 2002) should be a characteristic of the community structure. Since then, other community detection techniques have been proposed by researchers from various fields. Nowadays, this approach is less used for large-scale social networks. with millions of nodes.The main contributions to the community detection algorithms are based on the overlapping and connectivity structures of the networks. In the majority of real-life complex networks, there is a community based on the network’s structures. Numerous community detection algorithms have been created to date because identifying community structure significantly contributes to our understanding of how complex networks are structured. In this section, we review a number of these studies that are related to the concepts of the proposed methodology.

2.1 Label propagation approaches

The label propagation algorithm (LPA) is a fast and convenient community detection algorithm and was first presented in (Raghavan et al., 2007). It is highly regarded for its straightforward structure and lower time complexity, but there are some drawbacks, including the randomness of node selection and label updating. In the LPA method, a node is randomly chosen and its label is updated with the most common label nearby through an iterative process (Li et al., 2022). The Speaker-listener Label Propagation Algorithm (SLPA) (Xie et al., 2011) and the COPRA (Gregory, 2010) were developed to solve the shortcomings in the label propagation process. SLPA consists of the following three stages: 1) the starting point 2) the process of evolution 3) the post-production. In both algorithms, each node has just one label, which is repeatedly updated using the maximum label in its neighborhood. The method converges, allowing for the identification of distinct communities. The current state-of-the-art label propagation-based algorithm (LPAm+) detects communities using a two-stage iterative procedure: the first stage is to assign labels, or community memberships, to nodes using label propagation to maximize modularity, a well-known quality function to evaluate the goodness of a community division, the second stage merges smaller communities to further improve modularity (Le et al., 2019).

2.2 Modularity optimization approaches

Modularity-based community detection algorithms are widely studied and applied because of their concise strategies and prominent effects. However, they also face challenges, such as sensitivity to seed node selection and unstable communities (Guo et al., 2022). Among these approaches, the Louvain algorithm (Blondel et al., 2008) is famous and frequently applied to modularity-based network analysis. This method offers a straightforward and quick methodology for finding distinct communities in complex weighted and unweighted networks and maximizes modularity by clustering graph nodes using the greedy approach. The Leiden algorithm (Traag et al., 2019) tries to correct some problems in the Louvain algorithm. In Leiden, the goal is to change the community established throughout the iteration cycle while simultaneously speeding up local movement and transferring nodes to arbitrary neighbors. The distance between the grains and the calculated density in Euclidean space are two requirements for the leading tree, an effective granule calculation model (GrC) for hierarchical clustering (Fu et al., 2022). In (Chen et al., 2022), an enhanced density peak-based community detection algorithm is proposed called DPCD.In DPCD Firstly, a novel local density suitable for complex networks is defined to consider the node distribution and network structure jointly. Secondly, based on the node density and network structure, a density-connected tree is constructed to measure the density following the distance of each node. Finally, an improved density peak model is constructed to quickly and accurately cluster complex networks.

2.3 The probabilistic model estimation approach

Probabilistic methods estimate a generative model to detect communities, as opposed to the methods stated above, which employ traditional models to detect communities in networks. This approach creates an abstract generative model from the network graph and estimates the model parameters. The AGM (Affiliation Graph Model) (Yang & Leskovec, 2013, 2012) can be read as a nonnegative affiliation matrix or interpreted as a bipartite network connecting vertices and communities. It is based on the idea that shared social affiliations offer ascent to communities. Methods (Yang & Leskovec, 2013, 2012; Yang et al., 2013) use a matrix factorization-based model to estimate a probability distribution function and the intensity of node affiliation to communities as a parameter in the probabilistic model. A model based on matrix factorization is presented by (Yang et al., 2018), allowing the edges to be simply removed or added. Also, the community detection problem is presented as a non-negative matrix factorization model (Yu et al., 2019), and the dynamics of the network structure are then managed by a transfer matrix.

2.4 Clique percolation and motif-based approaches

Cliques are one of the fundamental concepts in graph theory and are used to detect communities in graphs. In an undirected graph G, a clique is a core group of vertices where every two different vertices are connected to one another, indicating that the induced subgraph is complete (Palla et al., 2005). The idea behind clique percolation algorithms (CPM) is that a community is made up of overlapping collections of fully connected subgraphs (Attal et al., 2021). The process of community detection thus involves searching for nearby cliques. The first step is to identify every clique with size k. Then, a new graph is built with each node standing in for one of these k-cliques. CPM works well in networks with many closely connected components. Based on the clique percolation method’s search for local patterns, in (Palla et al., 2005) Palla et al. proposed one of the first overlapping community detection algorithms. The CPM implementation is the CFinder algorithm (Adamcsek et al., 2006). The algorithm’s complexity is polynomial. For large networks, the algorithm does not, however, reach its limit.

One of the concepts close to clique is the motif. Motifs are small interconnected sub-networks observed in complex networks with high frequency, interpreted as simple and fundamental components of the network (Bloem & de Rooij, 2020). Motif search has been widely studied in various fields of network analysis. The main problem of recognizing motifs in large networks is the exponential growth of the number of possible subgraphs in the network with increasing motif size (Milo et al., 2002).

Motifs are usable for understanding structure and detecting community in complex networks. A community with a high edge density will have correlations between nodes that extend beyond their immediate neighbors, as shown by the presence of motifs. Empirical studies show that similar nodes in a community have similar motifs (Arenas et al., 2008). Therefore, using motifs with high-density connections can be an important strategy to help discover the communities and analyze the network more precisely (Li et al., 2022). Though a few motif-based community detection algorithms have been put forth (Arenas et al., 2008; Tsourakakis et al., 2017; Huang et al., 2020), they typically struggle with high computational complexity when applied to large-scale networks. Integrating lower-order and higher-order structural data effectively and efficiently into a single framework for community detection is still a challenge.

Fig. 1
figure 1

Demonstration of two types of three-node motifs that we use in the proposed model: (a) the 3-clique or closed triangle motif (denoted as M(3,3)-motif) with 3 nodes and 3 edges; (b) the opened triangle motif (denoted as M(3,2)-motif) with 3 nodes and 2 edges

3 Proposed model frameworks

In this paper, we present PCDMS (Probabilistic Community Detection with Motif Structure), a probabilistic community detection method for complex networks that makes use of the affiliation graph model and the triangular motif to identify community structures. The fundamental tenet of the proposed community detection approach is that a strong community requires consideration of the structural model and relationship kinds of the node. In their study on the relationship between 2-clique (edge) probability and community overlapping, Yang and Leskovec (Yang & Leskovec, 2013) found that two nodes observed in the more shared communities, the more likely to be connected. In this article, we investigate how community overlapping affects the development of the 3-clique and triangle motifs. We demonstrate that by increasing the number of nodes observed in shared communities, the probability of the existence of a triangular motif between them increases. Such a finding is consistent with the fundamental tenet that vertices located in communities’ overlaps are more densely connected than vertices within a single community. Thus, we may improve and expand the AGM’s capability (Yang & Leskovec, 2013, 2012) to generate triangle motifs by utilizing the optimized softmax loss function for probabilistic estimation. In Fig. 1, two types of three-node motifs are presented. The triangle motifs discovered in different types of networks can be given different interpretations in various networks according to their characteristics. For example, in complex networks (Milo et al., 2002), the most widely studied motifs are the 3-node motif, i.e., the (3,e)-motif, and the 4-node motif, i.e., the (4,e)-motif. Due to the complexity of probabilistic calculations for the (4,e)-motif, the presented probabilistic model is based on a (3,e)-motif. However, the proposed method does not seek to discover or interpret the content of the triangular motif but rather uses the motif to construct the hidden parameter of the probabilistic model and to detect the community.

In contrast to other community detection methods, the presented model in PCDMS takes a different Properties that were less used into account in previous methods such as:

  • Using the probabilistic method to estimate triangle motif

  • Conceptual connection of community detection with the probability presence or absence triangular motif

  • Using evolutionary methods and maximum likelihood estimation in calculations

The components of the proposed models are explained in more detail below. The PCDMS model is predicated on a network G(VE) , where V and E refer to nodes and edges. We generate the strength affiliation of node u to the community c as a nonnegative value, \( M_{uc} \). (\( M_{uc} = 0 \) indicates that u is not a member of c) The M matrix thus displays the degree of dependence between each node and each community.

In PCDMS, the value of M determines the likelihood of a triangular motif between three nodes u, \( v_1 \), and \( v_2 \) occurring or not in a community c. Independent motif generation occurs within each community c. In particular, by considering the following probability, we assumed that three nodes, u, \( v_1 \), and \( v_2 \), form a triangular motif. We define the triangular motif estimator function as \( P_c(u, v_1, v_2) \) by using the softmax loss function over one node and two neighbors of that node for the probabilistic motif generator’s function, that is,

$$\begin{aligned} P_c(u,v_1,v_2) =\underset{(u,v_1) \in E}{P_c(u,v_1)}\cdot \underset{(u,v_2) \in E}{P_c(u,v_2)}= \nonumber \\ \frac{\exp (-M_{uc}.M^T_{v_1c})}{\sum _{v_i \in N(u)}{\exp (-M_{uc}.M^T_{v_ic})}} \cdot \frac{\exp (-M_{uc}.M^T_{v_2c})}{\sum _{v_i \in N(u)}{\exp (-M_{uc}.M^T_{v_ic})}} \end{aligned}$$
(1)

In (2), N(u) is a set of neighbors of node u. According to the generative probabilistic procedure between two pairs of nodes in a triangular motif, each pair of nodes is independently distributed by the Bernoulli distribution. Therefore, using the conditional independent probability, the following relationship is established for the probability of the existence of a triangular motif:

$$\begin{aligned} P_c(u,v_1,v_2) =\underset{v_1 \in N(u)}{P_c(u,v_1)}\cdot \underset{v_2 \in N(u)}{P_c(u,v_2)}\ \end{aligned}$$
(2)

The proposed framework, which is a probabilistic generative model, is predicated on the following premises:

  • In a community, a triangle motif can exist between two pairs of nodes (one node and two neighbors of that node).

  • The probability of the existence of a triangle motif increases when the two pairs of nodes are observed in multiple communities.

  • Communities can overlap; communities that overlap have a higher density of triangle motifs.

4 Community detection by PCDMS model

We describe the components of the PCDMS model before demonstrating how to use it for community detection in networks. In (3), l(M) is the logarithm of the likelihood of the existence of a motif in graph G. Also in (4), N(u) is a set of neighbors of node u. By minimizing the negative likelihood, we can estimate the ideal M as follows:

$$\begin{aligned} l(M) = \log {P(G \mid {M})} \end{aligned}$$
(3)
$$\begin{aligned} M = \underset{M>0}{-argmin}\ L(M) = \nonumber \\ -argmin_M \prod _{(v1,v2) \in N(u)} P(u,v_1,v_2) \prod _{(v1,v2) \notin N(u)} (1-P(u,v_1,v_2))= \nonumber \\ -argmin_M [\prod _{(u,v1) \in E} P(u,v_1) \prod _{(u,v1) \notin E} (1-P(u,v_1))]. \nonumber \\ [\prod _{(u,v2) \in E} P(u,v_2) \prod _{(u,v2) \notin E} (1-P(u,v_2))] \end{aligned}$$
(4)

In (5), the degree belonging of a node to a community (\( M_{uc} \)) is estimated. A natural logarithm must be calculated from both sides after the Insertion of (2) into (4) to change the multiplication into the sum and make further calculations simpler.

$$\begin{aligned} L(M) = -[\ln ({\exp (-M_{uc}.M^T_{v_1c})})-{\sum _{v_i \in N(u)}{\exp (-M_{uc}.M^T_{v_ic})}}] - \nonumber \\ [\ln ({\exp (-M_{uc}.M^T_{v_2c})})-{\sum _{v_i \in N(u)}{\exp (-M_{uc}.M^T_{v_ic})}}] = \nonumber \\ M_{uc}.M^T_{v_1c} + M_{uc}.M^T_{v_2c} + 2\sum _{v_i \in N(u)}{\exp (-M_{uc}.M^T_{v_ic})} \end{aligned}$$
(5)

4.1 Updating the parameter

The latent variable M is contained in the negative non-linear likelihood function of (4), which cannot be minimized using the well-known optimization methods. To overcome the difficulty of solving optimization issues with latent variables in machine learning, we use the Block Coordinate Descent technique (Xu & Yin, 2013) to solve the objective function in (4). We update \( M_u \) for each node u by keeping fixed neighbors (\( M_v \)). We solve the following subproblem:

$$\begin{aligned} L(M_u) = M_{u}.M^T_{v_1} + M_{u}.M^T_{v_2} + 2\sum _{v_i \in N(u)}{\exp (-M_{u}.M^T_{v_i})} \end{aligned}$$
(6)

To estimate the minimum negative likelihood (i.e., the minimum point of the diagram), we must look for a point on the diagram where the slope is 0. Therefore, it is necessary to derive the partial derivative of the log-likelihood function (5) with respect to \( M_u \).

$$\begin{aligned} \frac{\partial l(M_u)}{\partial M_u} = M_{v_1} + M_{v_2} + \frac{\sum _{v_i \in N(u)}{-M_{v_i}\exp (-M_{u}.M_{v_i})}}{\sum _{v_i \in N(u)}{\exp (-M_{u}.M_{v_i})}} \end{aligned}$$
(7)

Eventually, \( M_u \) values will be updated by the gradient decent method (Hsieh & Dhillon, 2011; Lin, 2007). Since a node’s belonging strength to a community cannot be negative, it will be substituted with 0 if it detects.

$$\begin{aligned} M_u(t+1) = \max (0,M_u(t)-\eta (\frac{\partial l(M_u)}{\partial M_u})) \end{aligned}$$
(8)

where \( \eta \) is a learning rate parameter. The updating process is repeated until the difference between the value from the previous step and the current value is smaller than the desired threshold.

4.2 WSCD algorithm

The proposed PCDMS model (Probabilistic Community Detection with Motif Structure) is shown in Algorithm 1. The inputs to the method are a graph (G) and the number of communities (k). The model also generates a matrix that shows how strongly each node belongs to each community (\( M_{uc} \)). The possibility of an existing motif structure between two sets of nodes grows when they are seen in various communities. The algorithm then enters an iterative loop after the latent variable of the model (M) is initialized (how to initialize M will be covered below). The iterations will end when the difference between \( M_u(t+1) \) and \( M_u(t) \) is less than a predetermined threshold (here, 0.005 is the stop threshold). This iterative method computes the likelihood function of the probabilistic generative model (\( L(M_u) \)) to estimate the model’s unknown parameter in the graph. The likelihood function’s logarithm is retrieved from each node u by \( D(L(M_u) \) to get it as close as possible to its minimal value (where the slope of the line is 0). Due to the complexity of the calculations, we chose the descending gradient method (Xu & Yin, 2013; Hsieh & Dhillon, 2011) to minimize the likelihood. At each iteration of the algorithm, this approach is utilized to update the latent variable of the model (\( M_u \)). The contribution strength of each node to each community will then be determined after the M value has been fixed. This value can be categorized as belonging or not belonging to the communities after comparing it to an experimental threshold (e.g., the median of M values), and the model’s output will be realized.

Algorithm 1
figure a

Probabilistic Community Detection with Motif Structure (PCDMS)

4.3 Computational complexity

The computational complexity of the PCDMS algorithm depends on the number of communities and the density of motifs. Equations (6) and (7) are used to update the degree of belonging to the community, which is at the core of Algorithm 1, as seen in its iteration steps. In this scenario, the presence or absence of a motif for two nodes depends on whether or not their neighbors are members of one or more communities. Therefore, the computational complexity will depend on the order of each node’s neighbors (N(u)) and how many communities are present; in the worst case, this complexity will be \( O(2k\cdot E) \).

4.4 Initialization

There are multiple options to initialize the matrix of belonging intensity for the nodes in each community. Filling in the values at random is the first solution, which also seems to be the simplest. The biggest limitation, however, is that the algorithm performs more iterations, adding to the computational complexity in order to reach the model stability stages. The local minimum neighborhood method (Gleich & Seshadhri, 2012), which has been demonstrated through experiments to be a good starting point for community detection algorithms, is the second choice. In addition to minimizing iteration steps and starting the algorithm in a stable state, using this approach has the additional benefit of being able to predict the initial number of communities to start the proposed model’s community detection phase.

5 Experiments

The Python programming language has been used to implement the proposed PCDMS approach in the Spyder environment. To assess the outcomes, we employed seven real-world data sets (Table 2) and sixteen synthetic networks (Table 5), respectively. The datasets also contains “ground-truth” community memberships of the node. In these datasets, the proposed method is compared with fundamental algorithms like Louvain (Blondel et al., 2008), Leiden (Traag et al., 2019), Bigclam (Yang & Leskovec, 2013, 2012), CPM (Palla et al., 2005), Label propagation (Gregory, 2010), and SLPA (Xie et al., 2011). Table 1 lists these algorithms in brief.

Table 1 Summarizes of the employed methods

5.1 Evaluation metrics

The efficiency and accuracy of the community detection algorithms are assessed using three widely used assessment measures. The F1Score and NMI are external measures for evaluating community accuracy by comparing them to ground-truth communities (Fortunato & Hric, 2016), whereas modularity (Clauset et al., 2004) is the internal metric for evaluating community quality. The Girvan-Newman method (Clauset et al., 2004) is the source of the modularity metric in internal metrics, which is a well-known standard for determining the density of edges in communities. The modularity value is determined by dividing the predicted community edges by the expected community edges. The higher the number of nodes inside communities and the closer a community’s modularity score is to 1, the better the discovered community will be. The F1Score is a well-known assessment metric used in community detection algorithms that compares the frequency of correctly identifying the nodes in each community based on available ground truth data. NMI, or mutual information about the relationship discovered between the identified communities and the ground truth, is the second external metric.

5.2 Real-world datasets

Seven real-world datasets are used in the experiments. Zachary’s karate club network (Zachary, 1977) is the first dataset, containing 34 nodes, 78 connecting edges between them, and 2 ground-truth communities. This dataset contains social ties among university karate club members collected by Wayne Zachary in 1977. Dolphins’ online social network (Lusseau et al., 2003) is the second dataset, which contains 62 nodes, 159 connecting edges, and two ground-truth communities containing a list of all the links, where a link represents frequent associations between dolphins. The third dataset (Kunegis, 2013), with 105 nodes, 441 connecting edges, and 3 ground-truth communities, is based on data from the network of books about US politics published around the time of the 2004 presidential election. Edges between books represent frequent co-purchasing of books by the same buyers. The fourth dataset is the American football (Girvan & Newman, 2002), with 116 nodes, 613 connecting edges, and 12 ground-truth communities. This network contains American football games between Division IA colleges during the fall of 2000. The fifth dataset is a large network generated using email data from a large European research institution (Leskovec et al., 2007; Yin et al., 2017). This network contains 1005 members of the institution as nodes and 25571 edges contain emails sent between members of the institution and people outside of the institution. The dataset also assumes departments at the research institute as the nodes’ ground-truth community memberships. Each individual belongs to exactly one of the 42 departments at the research institute. The sixth dataset, known as Wiki-Vote, has 879 nodes and 2914 connecting edges (Rossi & Ahmed, 2015). It contains voter data from a poll. The network’s nodes represent network users, and the edge that connects node i to node j indicates the edge that user i voted for user j. The seventh dataset, Twitter, has 1536 nodes and 30596 connecting edges. It is based on data from the social network Twitter (Kumar et al., 2014). The nodes in this graph represent social network users, and the edge that connects node i to node j represents tweets that node j retweeted. The Wiki-Vote and Twitter datasets do not contain any ground-truth. The real-world datasets analyzed during the current study are shown in Table 2 where N is the number of nodes, E is the number of edges, and K is the number of ground truths. These datasets are available in the network repositoryFootnote 1 (Rossi & Ahmed, 2015), the KONECT projectFootnote 2 (Kunegis, 2013), and the Stanford Network Analysis ProjectFootnote 3 (SNAP) (Leskovec et al., 2007).

Table 2 The specifics of the real-world dataset used

5.2.1 Experimental results on real-world datasets

To evaluate the effectiveness and accuracy of PCDMS in community detection, we will compare the proposed model with four categories of community detection methods, namely methods of modularity optimization, label propagation, probabilistic estimation, and clique percolation. Some of these techniques were briefly covered in the previous sections. Six algorithms are used to evaluate the proposed method with internal (modularity and community number) and external evaluation metrics (NMI and F1Score). In the internal metrics (modularity maximum and accuracy in the number of the community), the results in Table 3 show that our method has better accuracy than other methods.

Table 3 Experimental results on real-world networks by the modularity metric ( Q ) and community number ( CN )
Fig. 2
figure 2

NMI Evaluation diagram, comparing PCDMS with six community detection methods on five real-world datasets

Fig. 3
figure 3

F1Score Evaluation diagram, comparing PCDMS with six community detection methods on five real-world datasets

Additionally, in the external evaluation criteria (NMI and F1Score), Figs. 2 and 3 show that our proposed method in datasets that contain ground-truth has relative superiority compared to modularity optimization and label propagation methods and has absolute superiority over probabilistic estimation and clique percolation methods.

5.3 Synthetic datasets

A synthetic network is an appropriate method to assess community detection methods. There are different ways to create synthetic networks. The LFR benchmark (Lancichinetti et al., 2008) is one of the most well-known and commonly applied techniques. The LFR benchmark generates undirected and unweighted synthetic networks with ground-truth communities using the degree and community size distributions. We can configure network and community settings before simulating networks using LFR. The mixing parameter (\( \mu \)) is one of the important parameters in LFR. This parameter regulates how different communities interact. As shown in Table 5, a high mixing parameter value (\( \mu \)) will lower the network’s degree of modularity (\( Q_{GT} \)). Accordingly, according to the modularity metric and mixing parameter, the LFR-generated datasets are divided into two categories: sparse and dense communities. The average degree is another significant parameter; increasing it will lead to more community interaction. Table 4 displays the main features of the LFR synthetic datasets.

Table 5 reports the datasets created using the LFR approach that we employed.

Table 4 Parameters of LFR synthetic datasets (Lancichinetti et al., 2008)
Table 5 The details of the LFR synthetic network generated

5.3.1 Experimental results on synthetic datasets

We have conducted experiments on LFR synthetic networks in addition to real-world networks. To demonstrate the impact of the optimized softmax loss function for a probabilistic estimate on the community detection process utilizing modularity, F1Score, and NMI criteria, we compare the PCDMS technique with the well-known community detection methods in Table 1. For this purpose, according to the properties of synthetic networks that are given in Table 4, sixteen LFR synthetic networks are created with different configurations of mixing parameters (\( \mu \)) varying from 0.05 to 0.8, as shown in Table 5. In Table 6, the experimental results show that the communities are dense for small values of the mixing parameter (e.g. \( 0.05\le \mu \le 0.4\)) and that the compared algorithms are nearly accurate in this case. But as the mixing parameter’s (\( \mu \)) value increases, the main distinction between the algorithms becomes more clear (e.g. \( 0.4 \le \mu \le 0.8\)), and the communities are sparse, making it challenging to detect communities because the edges between communities rise.

Table 6 Experimental results on sixteen LFR synthetic networks by the modularity metric

As seen in Figs. 4 and 5, some algorithms have NMI and F1Score values equal to zero as the mixing parameter value increases. The proposed method surpasses the majority of the widely used methods in the range of \( 0.5 \le \mu \le 0.8 \).

Fig. 4
figure 4

NMI evaluation diagram, comparing PCDMS with six community detection methods on sixteen LFR datasets

Fig. 5
figure 5

F1score evaluation diagram, comparing PCDMS with six community detection methods on sixteen LFR datasets

6 Conclusion

We present a motif-based probabilistic approach for community detection in complex networks. Recent community detection methods have given less thought to the latent variable of the probabilistic model due to the difficulty of the composition of probabilistic methods in motif structure. Still, the proposed approach uses the relationship of at least two connected edges between three nodes (triangular motif structure) and the intensity of the node’s membership in the community to estimate the latent variable of the probabilistic model. This paper applied the well-known Block Coordinate Decent algorithm to minimize the negative likelihood function and extract the latent parameters of the model. Another factor that aids in the investigation of newly discovered communities is the relationship between node membership in communities and edge density: three nodes are more likely to form a motif structure when observed in different communities. Another advantage of PCDMS is overlapping in the detection of communities; by the results, communities that overlap have a higher density of triangle motifs. For evaluating the effectiveness of the proposed method, we used seven real-world networks and sixteen synthetic networks. On real-world networks, PCDMS was able to reach a sufficient quorum and outperform the other six approaches in terms of internal and external evaluation metrics. Also, according to evaluations of synthetic networks, the proposed method outperforms other methods in sparse datasets. Additionally, a study of execution time complexity shows that the proposed methodology performs better than other methods. Future research can develop PCDMS. Edge weight can be estimated using a probabilistic generative model and considering a latent parameter. Also, to provide a more accurate interpretation of the detected communities, the proposed method can be expanded by using node attributes in the network.