Introduction

In real life, community networks are ubiquitous, and they consist of highly interconnected entities from the natural world and society [1]. These networks typically share a common characteristic: closely related or similar nodes within a community network often belong to the same category, while nodes with weaker connections or opposing characteristics tend to belong to different categories. Based on these characteristics, we can extract valuable feature information from community networks to achieve clustering effects, ultimately applying them in relevant fields. For instance, in the field of bioinformatics, we can partition biological molecules to discover those with similar structures and functions or identify protein complexes within protein–protein interaction (PPI) networks. In the realm of social media, we can perform opinion analysis, recommend products to users, and find potential friends [2]. Therefore, analyzing these community networks holds significant importance. However, community detection is an effective means to analyze the characteristics and structure of community networks, and important features such as explicit and implicit of community networks are mined by optimizing the community detection method to achieve the effect of effective division of community structure, so the study of community detection is of great significance to understand the deeper characteristics and functions of the network.

In recent years, the field of community detection has attracted the attention of many researchers and many community detection methods have been proposed [1, 3,4,5], such as modularization and minimal cut [6]. In addition, non-negative matrix factorization algorithm (NMF) is also an important method for community detection, which aims to decompose a high-dimensional matrix into two or several low-dimensional non-negative matrices, whose product can be approximated equal to the original matrix. Compared with other methods, NMF has the following main advantages in the application of community detection [7]: (1) high interpretability: by representing the community network using an adjacency matrix and utilizing this matrix as the feature matrix in NMF, each value in the community result matrix obtained through factorization can be understood as the probability that a node belongs to the corresponding community. For example, in the community result matrix, \(Z_{ij}\) represents the probability or strength of node \(V_i\) belonging to community \(C_j\). This enhances the interpretability and explainability of the results; (2) high adaptability: real-world networks come in various forms, including overlapping and non-overlapping networks, directed and undirected networks, attribute networks, dynamic networks, and more. and the NMF and its related variants can be applied to any of the above networks. For example, in the case of overlapping networks, it only requires setting a probability threshold to detect nodes’ membership in multiple communities; (3) high integration: existing information within the community network can be incorporated into the NMF learning process to improve the accuracy of community detection. For instance, existing attributes or node labels in the network can be integrated into the objective function to iteratively learn more refined results. Building upon these advantages, researchers have conducted in-depth studies on the application of NMF in community detection. For topological networks [8, 9], which exclusively contain structural information, such as directed or undirected networks, NMF can be directly applied to detect communities. Many researchers have further improved this by modeling communities [3] or enhancing performance by incorporating additional information [10]. For signed networks [11, 12], i.e., the relationship between nodes can be expressed as positive or negative correlation, where positive correlation means that the nodes are friends and negative correlation means that the nodes are enemies, and thus the adjacency matrix is a matrix with sign. Compared to traditional networks, signed networks not only consider the closeness between nodes but also require positively correlated nodes to be in the same community and negatively correlated nodes to be in different communities. For attribute networks [13, 14], where nodes possess labels or attribute information in addition to link structure, these attribute details often better represent unique node characteristics and complement topological information for achieving high-quality community detection. It is evident that NMF and its variants can address community detection problems in various types of networks, playing a crucial role in community mining.

Although current research has achieved certain effectiveness, mining potential information in community networks remains insufficient, mainly due to the following shortcomings: (1) lack of consideration of homology among all nodes within the community: some nodes may not have a direct relationship, but through common neighbor nodes, potential relationships between them can be discovered. Typically, nodes with a large number of common neighbors are more closely related than those with few or no common neighbors, making them more likely to be assigned to the same community. For example, if both Paper 1 and Paper 2 cite Paper 3, while Paper 4 has no citation or referencing relationship with the aforementioned three papers, Papers 1 and 2 are more likely to be classified into the same topic category with a higher probability, indicating a potential common theme. However, it is uncertain whether Paper 4 shares a common theme with the aforementioned three papers. This illustrates the potential relationships that can be uncovered through common neighbor nodes. Therefore, considering common neighbors in calculating similarity is particularly important; (2) Lack of consideration for heterogeneity between communities (i.e., between communities): community detection aims not only for nodes within communities to have more similar features but also for differences between communities to be more distinct. The greater the differences between communities, with more focused features within each community, the clearer the community detection partition and the better the results; (3) Lack of optimization for the community membership matrix: for the probability matrix obtained from community detection, we often assign the community with the highest probability as the belonging community for the corresponding node. However, due to constraints from initialization methods and the number of iterations, the values in the probability matrix often lack clear distinctiveness. In summary, considering the three elements mentioned above simultaneously during the model learning process is crucial for community detection. Optimization from the internal, external, and inherent characteristics of community networks can enhance the effectiveness of community detection. However, addressing these aspects simultaneously is often challenging and represents a research problem that urgently needs solutions.

Motivated by existing community detection methods based on NMF, this study aims to address the following issues: (1) addressing the homogeneity issue, we will consider the similarity between nodes and measure the similarity in various ways to uncover hidden information in the network; (2) In order to address the disparities between communities, we will introduce orthogonal constraints to ensure diversity between communities. We will guide the learning process of the objective function with a mutual constraint relationship between the diversity between communities and the homogeneity among nodes, aiming for improved detection results; (3) For optimizing the result matrix, we will add constraints to the community membership matrix. This will lead to learning better results during each iteration, producing more very small and some larger values, making the probability of nodes belonging to communities more distinct. In summary, we propose a new multi-constraint non-negative matrix factorization community detection method, named orthogonal regular L1-norm sparsity constrained non-negative matrix factorization (ORSNMF). Building on the foundation of traditional network topology modeling, we simultaneously model the diversity between communities, the similarity among nodes, and the sparsity of the community membership matrix. This comprehensive modeling aims to better characterize the features of community structures. Finally, we incorporate these three aspects into the objective function for joint-constrained learning, resulting in improved community detection. Our main contributions can be summarized as follows:

  • Based on the NMF of orthogonal regular L1-norm sparsity constraint, a new community detection model based on non-negative matrix factorization is proposed. The proposed scheme simultaneously models the differences between communities, the similarity between nodes and the sparsity of community member matrix in directed networks in order to obtain the attributes of community structure to the greatest extent.

  • An algorithm with convergence guarantee is proposed to optimize the model.

  • Extensive experiments on synthetic and real data sets show that our proposed model has better performance on three metrics: jaccard similarity, normalized mutual information (nmi) and accuracy [15].

The rest of this article is organized as follows. We introduce related works in Sect.“Related works”, elaborate on the related issues of community detection in Sect. “Problem description” , and detail our proposed orthogonal regular L1-norm sparsity constraint non-negative matrix factorization model in Sect. “Orthogonal regular and L1-norm sparse constrained NMF”. Comprehensive experiments are performed to validate the effectiveness of the proposed scheme in Sect. “Experiment and analysis”, followed by conclusions in Sect. “Conclusion”.

Related works

For the community networks existing in real life, how to mine the effective information and then identify the community to promote more practical applications, such as movie recommendation, advertisement pushing, etc., is a basic problem of network analysis, and the process is also known as community detection. In recent years, various community detection methods have emerged, with significant attention given to those based on NMF [3, 5, 16,17,18,19,20,21,22]. These methods have gradually become a new direction in the field of community detection

NMF is a classic low-rank matrix decomposition model proposed by Lee et al. [23, 24]. The process involves decomposing a non-negative matrix into the product of two or more non-negative matrices. The goal is to find a non-negative base matrix and its corresponding non-negative coefficient matrix, which, when multiplied, approximates the original data matrix (i.e., the matrix before decomposition). NMF possesses a unique functionality, namely, inherent clustering capability. He et al. [25] demonstrated that NMF and its related improvements have similar effects to some classical clustering algorithms [26,27,28]. Community detection, fundamentally, is a clustering problem on complex networks. In addition to its clustering capability, NMF also has advantages such as interpretability. When using NMF for community detection, the adjacency matrix in the community network can be used as the feature matrix of NMF, and the decomposed results represent the community member matrix and the community feature matrix, respectively, which can be viewed as the probability value that the node belongs to the community in the community member matrix, so as to extract the relationship between the node and the community, which makes the results easier to understand and convince people, based on the above advantages, NMF is very suitable for community detection.

Most existing community detection methods based on NMF focus on enhancing the performance of the NMF algorithm to achieve better results in community detection. For example, Wang et al. [3] proposed symmetric non-negative matrix factorization (SNMF), asymmetric non-negative matrix factorization (ANMF) and joint non-negative matrix factorization (JNMF), respectively, for undirected networks, directed networks and composite networks to solve the problem of community discovery. The pairwise constrained symmetric non-negative matrix factorization method (PCSNMF) proposed by Shi et al. [29] considers the symmetric community structure of undirected networks, but also some pairwise constraints for basic information generation. Ye et al. [9] proposed homophilic positive non-negative matrix factorization (HPNMF), which models not only the topology of links but also takes into account the homogeneity of nodes in the network, providing a better reflection of the inherent structural properties of communities. Ye et al. [10] propose to learn an affinity matrix adaptively, which can capture the intrinsic similarity between nodes accurately, and therefore benefit the community detection results. Shi et al. [21] proposed a Bayesian non-negative matrix factorization (NMF) method for adaptive community detection. In the decomposition process, the use of Bayesian methods allows not only for capturing the most appropriate number of communities in large networks through shrinkage but also for finding optimal thresholds for assigning nodes to communities in ambiguous situations. Tosyali et al. [30] proposed regularized asymmetric non-negative matrix factorization (RANMF) for directed network clustering based on the prior information of the network and the pairwise similarity of nodes. Zhang et al. [31] proposed homophilic non-negative matrix factorization (HNMF) to model bidirectional relationships between links and communities. From the community-to-link perspective, the method assumes that nodes with common communities have a higher probability of establishing links than nodes without common communities, applying a preference-based pairwise function. From the link-to-community perspective, the method assumes that linked nodes have similar community representations, introducing a novel network embedding-based community representation learning approach. Liu et al. [32] introduced a symmetric and graph-regularized non-negative matrix factorization (SGNMF) method. This approach incorporates multiple latent factors to enhance its representation learning capabilities and introduces regularization terms to account for the symmetry of undirected networks, ultimately improving community detection performance. Luo et al. [33] proposed a novel constrained fusion-induced symmetric non-negative matrix factorization (CFS) model. This model, designed for undirected networks, introduces a graph regularization factor that preserves the intrinsic geometry of the network’s local invariance. This incorporation allows the proposed detector to effectively understand the community structure within the target network.

In summary, most existing community detection methods can achieve good results under certain conditions, especially when considering node attributes or labels as prior knowledge, which can effectively enhance detection accuracy. However, many models do not comprehensively consider the characteristics of the community’s internal, external, and inherent properties. They only derive limited inherent properties of community structures from the community structure itself, without maximizing the extraction of network information. As a result, this can impact the effectiveness of community detection.

Problem description

A community network can be represented as a graph \(G=(V,E)\), where node set \(V=\{V_1,V_2,\ldots ,V_n\}\), \(V_i\) represents a node and \(n=|V |\) represents the number of nodes in the community network, edge set \(E=\{e_{ij}|V_i \in V \cap V_j \in V\}\), \(e_{ij}\) represents the edge between nodes \(V_i\) and \(V_j\), \(m=|E|\) represents the number of edges in the network. Networks are usually divided into undirected networks and directed networks, there are many clustering methods for undirected networks, while there are relatively few studies on directed networks. Therefore, this article focuses on directed unweighted network clustering. In general, a directed network G can be described by an adjacency matrix \(A=[A_{ij}]^{n \times n}\), \(A_{ij}\) represents the relationship between node \(V_i\) and node \(V_j\), when there is a connection between \(V_i\) to \(V_j\) (i.e. \(e_{ij} \in E\) )\(A_{ij}=1\), otherwise, \(A_{ij}=0\). Suppose the network G consists of k communities, and C denotes the community set of G, that is, \(C=\{C_i|C_i \ne \emptyset , 1 \le i \le k\}\), where \(C_i\) represents the ith community and it is not empty. The purpose of community detection is to divide these nodes into k different groups according to the network topology, so that the number of edges within any specific group are maximized, while the number of edges across different groups are minimized. In this study, we focus on non-overlapping community detection, that is, the community set C should satisfy the condition \(C_i \cap C_j = \emptyset \) if \(i \ne j\), which means that different communities \(C_i\) and \(C_j\) have no common nodes.

Recently, NMF has become the important method of community detection [5, 9, 16, 18,19,20, 34], which mainly has the following advantages:

  • Better interpretability: given a network, after the non-negative matrix decomposition, a community member matrix will be obtained. Each element in the matrix can be understood as the probability or intensity that the node belongs to the corresponding community, which makes the results of community detection more interpretable.

  • Convergence node-related information: NMF can integrate node- related information (such as node similarity information) as regularization constraints into the objective function, and jointly guide the iterative optimization of the objective function to improve the clustering performance.

In view of this, we adopt NMF for community detection. Specifically, the problem is defined as follows:

Given a directed and unweighted network \(G=(V,E)\), using A to represent the adjacency matrix of this network, the individual nodes in the network can be divided into disjoint clusters by optimizing the following objective function:

$$\begin{aligned} \min \mathcal {L}(Z,H)=\Vert X-ZH \Vert _F^2,\quad \text {s.t} \hspace{5.0pt}Z \ge 0,H\ge 0, \end{aligned}$$
(1)

where X \(\in \) \(R_+^{n\times m}\) is the original non-negative matrix, Z \(\in \) \(R_+^{n\times k}\) is the basis matrix, H \(\in \) \(R_+^{k \times m}\) is the coefficient matrix, \(k<\min \{n,m\}\), \(\Vert \bullet \Vert _F^2\) is the Frobenious norm, whose purpose is to find the optimal low-rank non-negative matrices Z and H, making ZH infinitely close to X.

When NMF is leveraged for community detection, the corresponding adjacency matrix A in the network will be used as the characteristic matrix for decomposition, that is, \(A \approx ZH\), where H and Z represent the community characteristic matrix and the community member matrix, respectively. Furthermore, k represents the number of communities (clusters); and \(Z_{ij}\) represents the probability (strength) that the node \(V_i\) belongs to the community \(C_j\) (i.e. \(1\le j\le k\) ).

The discussion above is the traditional NMF model for community detection. However, in the above model, the connectivity between communities is not considered, so Wang et al. [3] propose to integrate the information between communities into the objective function, and set \(H=CZ^\top \) so that the original optimization problem is converted to the following optimization problem:

$$\begin{aligned} \min \mathcal {L}(Z,C)=\Vert A-ZCZ^\top \Vert _F^2, \quad s.t\hspace{5.0pt}Z \ge 0,C\ge 0, \end{aligned}$$
(2)

where \(A \in R_+^{n \times n}\) is the adjacency matrix, including n nodes, \(Z \in R_+^{n \times k}\) is the community member matrix, storing the probability values of nodes belonging to communities, where \(Z_{ij}\) stores the probability of node \(V_i\) belonging to community \(C_j\), \(C \in R_+^{k \times k}\) is the cluster matrix representing the connectivity between two communities. For example, in a directed network, if the ith community points to the jth community, then \(C_{ij}\) is a non-zero value; Z and C are non-negative asymmetric matrices. In addition, researchers have also studied a variety of non-negative matrix factorization variants in this area. For example, Relative Pairwise Relationship constrained non-negative matrix factorisation (RPR-NMF) proposed by Jiang et al. [35].

A good clustering method should result in more similar nodes within communities and less community-to-community associations, i.e., the more pronounced the differences between communities. Nevertheless, the studies above only consider the network connections/edges while ignoring the similarity between nodes, that is, the tightness between nodes with similar features is often greater than that between nodes with different features. Therefore, in this study, we add similarity information to the objective function, that is, node similarity constraints. We study non-overlapping community detection, in order to determine the community to which the ith node belongs. We take the index angle of the maximum value in the ith row of the community member matrix Z as the community to which the ith node belongs. In order to obtain a better community member matrix Z, we will add a constraint to Z to produce only a few large values with most other values very small.

Orthogonal regular and L1-norm sparse constrained NMF

We develop a new orthogonal regularized L1-norm sparse constrained non-negative matrix factorization model (ORSNMF), which considers the differences between communities, node similarity and how to obtain a better community membership matrix. In this study, we first model the above three aspects separately and then combine them into a unified model.

Community difference modeling

Given a network, its topology contains rich information, therefore, it can serve as an essential starting point of community analysis. We know that orthogonality constraints ensure interpretability and maintain sparsity constraints to avoid some trivial solutions [36]. In practice, we hope that the vectors of cluster matrix C are different from each other. Since if the vectors are more orthogonal, the differences between communities are more significant, leading to better clustering results. Therefore, we add orthogonality constraints to cluster matrix C and integrate them into the objective function as follows:

$$\begin{aligned}&\min \mathcal {L}_O=\Vert A-ZCZ^\top \Vert _F^2+\gamma \Vert C^\top C-I\Vert _F^2,&\nonumber \\ {}&s.t\quad \gamma \ge 0, Z \ge 0,C\ge 0&\end{aligned}$$
(3)

where \(A \in R_+^{n \times n}\) represents the adjacency matrix of the community network, n represents the total number of nodes in the community; Z \(\in \) \(R_+^{n\times k}\) represents the community membership matrix obtained after learning, k represents the number of communities, and \(Z_{ij}\) represents the probability value of node \(V_i\) belonging to community \(C_j\); \(C \in R_+^{k \times k}\) represents the community matrix, and \(C_{ij}\) represents the strength of the relationship between community \(C_i\) and community \(C_j\); \(Z^\top \) represents the transpose of matrix Z, \(\Vert \bullet \Vert _F^2\) represents the Frobenius norm, \(\gamma \) is an orthogonalization parameter used to balance the first error term and the sparsity of the second term, I represents the identity matrix, and matrices A, Z, and C are all non-negative matrices.

Node similarity modeling

In practice, we can observe that the relationship between a pair of nodes with similar characteristics is much stronger than that between a pair of nodes with different characteristics. Therefore, we consider adding a regularization term to the objective function to include the node similarity. The regularization is specified as follows:

$$\begin{aligned} \min \mathcal {L}_R=\frac{\lambda }{2} \sum _{i=1}^n \sum _{j=1}^n d(z_i,z_j)S_{ij}, \quad s.t\quad z\ge 0, \end{aligned}$$
(4)

where \( \lambda \) is the regularization parameter; \(S\in R^{n \times n}\) is the similarity matrix, which is a symmetric matrix; \(S _{ij}\) represents the similarity between node i and node j; \(d(z_i,z_j)\) represents the distance between two nodes. In particular, since nodes in the same community are closer to each other, their distance should be smaller, while on the other hand, the distance between nodes from different communities should be larger. The commonly used method to represent the distance between two nodes is the Euclidean distance, which is calculated as follows:

$$\begin{aligned} d(z_i,z_j)=\Vert z_i-z_j \Vert ^2. \end{aligned}$$
(5)

Next, we will introduce three methods to calculate the similarity between nodes.

Adjacency similarity

The simplest approach to calculate node similarity is the adjacency similarity, represented by \(S_{ij}\), which is calculated as the number of associations between nodes \(V_i\) and \(V_j\):

$$\begin{aligned} S=A+A^\top +I, \end{aligned}$$
(6)

Since this study focuses on directed networks, the relationship between node \(V_i\) and node \(V_j\) needs to consider direction. Similarly, when calculating similarity, all relationships between two nodes need to be taken into account. Therefore, when using the adjacency matrix to calculate similarity, \(A + A^\top \) should be used. In addition, each node is strongly related to itself, so the identity matrix I is also included in the consideration.

Katz centrality

In a network, the centrality of a node is used to measure its importance in the network. Katz centrality calculates the relative influence of a node in a network by measuring its direct neighbors (first-level neighbors) and the number of connections to all other nodes through these direct neighbors in the network. Katz centrality considers not only the contribution of the node’s neighbors to it, but also the size of the contributing neighbors. In addition, a constant is added to represent the node itself. Therefore, Katz centrality is also often used as a similarity measure. Katz centrality is defined as

$$\begin{aligned} S=(I-\delta \cdot A)^{-1}\eta , \end{aligned}$$
(7)

where \( \delta \) is the weight parameter, usually \(0\le \delta \le 1\) [30], and \(\eta \) represents the constant of the node itself. In this article, the node’s own constants are all set to 1 [30].

Cosine similarity

Cosine similarity measures the similarity between two nodes by calculating the number of common neighbors between them. Specifically, it is expressed by dividing the number of common neighbors of node i and node j by the geometric mean of their degrees [37], which is calculated as follows:

$$\begin{aligned} S_{ij}=\frac{v_i^\top v_j}{\Vert v_i\Vert \cdot \Vert v_j \Vert }, \end{aligned}$$
(8)

where \(v_i\), \(v_j\) are the vectors corresponding to nodes \(V_i\) and \(V_j\) in the adjacency matrix A, \(v_i^\top v_j\) represents the number of common neighbors between node \(V_i\) and node \(V_j\), and \(\Vert v_i\Vert \) represents the geometric mean of the degrees of node \(V_i\).

Community membership matrix sparsity modeling

Given the community member matrix Z in non-overlapping community detection, \(Z_{ij}\) represents the probability (intensity) of node \(V_i\) belonging to community \(C_j\). The method of assigning the ith node to a community is to take the index angle of the maximum value in the ith row of the community member matrix Z as the community to which the ith node belongs. To make this result better and more obvious, we will add L1-norm sparsity constraints to each row of the community membership matrix Z to ensure that only a few larger values are generated while all other values are very small so that each node captures the community to which it belongs.

The L1-norm sparsity constraint is the sum of the absolute values of all elements in the vector, which makes the algorithm tend to push the absolute values of some weights to zero during optimization, generating only a small number of larger values and achieving sparsity. This is because optimization algorithms like gradient descent, during the minimization of the objective function, apply gradients to each weight. The gradient of the L1-norm is non-differentiable at zero but constant at non-zero points. This implies that if the initial value of a weight is non-zero, the gradient will push it towards zero during the optimization process, leading to many small or zero values in the result, highlighting a few larger values. This makes the results more pronounced, interpretable, and robust. This is particularly useful for feature selection, as it retains only a few features most relevant to the target in the final model, pushing the weights corresponding to other features towards zero. In contrast, other sparse constraints like L2-norm sparse constraint is the square root of the sum of squares of all elements in the weight vector, encouraging weights to be distributed across all dimensions, attempting to make each feature contribute somewhat to the predicted value and preventing excessively large weights, thus aiding in preventing overfitting. For the result matrix in this article, the goal is to push some unimportant weights towards zero, making nodes more likely to capture the community to which they belong, facilitating interpretability. Therefore, we apply L1-norm sparse constraint to the result matrix Z. Specifically, it is expressed as follows:

$$\begin{aligned} \min \mathcal {L}_S=\alpha \sum _{i=1}^n\Vert z_{i\cdot }\Vert _1^2,z\ge 0, \end{aligned}$$
(9)

where \(Z_{i\cdot }\) represents the ith row of Z, \(\Vert z_{i\cdot }\Vert _1\) represents the L1-norm applied on \(Z_{i\cdot }\), which is the sum of the absolute values of each element on \(Z_{i\cdot }\), and \(\alpha \) is the sparse parameter, which is used to balance the sparse term and the error between A and \(ZCZ^\top \).

Orthogonal regular L1-norm sparse constrained non-negative matrix factorization model (ORSNMF)

In summary, we consider the community difference model in Eq. (3), the node similarity model in Eq. (4) and the community member matrix sparsity model in Eq. (9) into the objective function to establish a unified model, so the overall objective function of our propose ORSNMF model is as follows:

$$\begin{aligned} \mathop {\min }\limits _{Z\ge 0,C\ge 0}{} \mathcal {L}= & {} \mathcal {L}_O+\mathcal {L}_R+\mathcal {L}_S\nonumber \\= & {} \Vert A-ZCZ^\top \Vert _F^2+\gamma \Vert C^\top C-I\Vert _F^2\nonumber \\{} & {} +\frac{\lambda }{2} \sum _{i=1}^n \sum _{j=1}^n d(z_i,z_j)S_{ij}+\alpha \sum _{i=1}^n\Vert z_{i\cdot }\Vert _1^2,\nonumber \\ \end{aligned}$$
(10)

By introducing the orthogonality constraint term \(\mathcal {L}_O\), the regularization constraint term \(\mathcal {L}_R\), and the L1-norm sparsity constraint term \(\mathcal {L}_S\), we aim to fully capture the potential relationships among nodes in the network and the inherent properties of communities. These constraints are jointly iteratively learned to obtain improved result matrices.

Update rules

To optimize the objective function (10), for the \(\mathcal {L}_S\) in the target formula, we know that \(\Vert z_{i\cdot }\Vert _1^2\) is the absolute value of each element in the ith row of Z summed and then squared. The non-negative constraint characteristic of non-negative matrix decomposition makes every element in Z is non-negative value, so \(\Vert z_{i\cdot }\Vert _1^2\) is the sum and square of the elements in the ith row of Z. Consequently, \(\sum _{i=1}^n \Vert z_{i\cdot }\Vert _1^2\) is calculated as the sum of the elements of each row of Z, then squared, and finally summed up. Therefore, we can get \(\sum _{i=1}^n\Vert z_{i\cdot }\Vert _1^2=tr(ZHZ^\top )\), where \(H \in 1^{k\times k}\) is a matrix of \({k\times k}\) with all 1.

For the \(\mathcal {L}_R\) part in the target expression, it can be rewritten as

$$\begin{aligned} \mathop {\min }\limits _{Z\ge 0}{} \mathcal {L}_R= & {} \frac{\lambda }{2} \sum _{i=1}^n \sum _{j=1}^n d(z_i,z_j)S_{ij}\nonumber \\= & {} \frac{\lambda }{2} \sum _{i=1}^n \sum _{j=1}^n \Vert z_i-z_j \Vert ^2S_{ij}\nonumber \\= & {} \frac{\lambda }{2} \sum _{i=1}^n \sum _{j=1}^n(z_i-z_j)^\top (z_i-z_j)S_{ij}\nonumber \\= & {} \frac{\lambda }{2} \sum _{i=1}^n \sum _{j=1}^n(z_i^\top z_i+z_j^\top z_j)S_{ij}-\lambda \sum _{i=1}^n \sum _{j=1}^nz_i^\top z_jS_{ij},\nonumber \\ \end{aligned}$$
(11)

Since S is a symmetric matrix, Eq. (11) can be simplified as follows:

$$\begin{aligned} \mathop {\min }\limits _{Z\ge 0}{} \mathcal {L}_R= & {} \lambda \left( \sum _{i=1}^nz_i^\top z_iD_{ii}-\sum _{i=1}^n \sum _{j=1}^nz_i^\top z_jS_{ij}\right) \nonumber \\ {}= & {} \lambda tr(Z^\top DZ)-\lambda tr(Z^\top SZ)\nonumber \\= & {} \lambda tr(Z^\top L_SZ), \end{aligned}$$
(12)

where \(tr(\cdot )\), the trace of the matrix, equals to the sum of the elements of the main diagonal of the matrix, the Laplacian matrix \(L_S=D-S\), D is the diagonal matrix, \(D_{ii}\) represents the sum of values in the ith row of matrix S, i.e. \(D_{ii}= \sum _{j=1}^{n}S_{ij}\). In this way, the regularization term in the similarity matrix S is integrated into the objective function to jointly guide the optimization objective function.

In summary, our objective function can be rewritten as

$$\begin{aligned} \mathop {\min }\limits _{Z\ge 0,C\ge 0}{} \mathcal {L}= & {} \mathcal {L}_O+\mathcal {L}_R+\mathcal {L}_S\nonumber \\= & {} \Vert A-ZCZ^\top \Vert _F^2+\gamma \Vert C^\top C-I\Vert _F^2\nonumber \\{} & {} +\lambda tr(Z^\top DZ)-\lambda tr(Z^\top SZ)+\alpha tr(ZHZ^\top ),\nonumber \\ \end{aligned}$$
(13)

The optimization problem in Eq. (13) is not simultaneously convex on the variables Z and C, therefore, finding the global minimum is difficult. Therefore, we use the multiplicative update rule to obtain the local optimal solution. Minimize the objective function in (13) using gradient descent, using \(\beta \) and \(\theta \) as Lagrangian multiplier numbers for constraints \(Z\ge 0\) and \(C\ge 0\), Lagrangian \(\mathcal {L}\) is defined as

$$\begin{aligned} \mathcal {L}= & {} \Vert A-ZCZ^\top \Vert _F^2+\gamma \Vert C^\top C-I\Vert _F^2+\lambda tr(Z^\top DZ)\nonumber \\{} & {} -\lambda tr(Z^\top SZ)+\alpha tr(ZHZ^\top )+tr(\beta Z^\top )+tr(\theta C^\top )\nonumber \\= & {} tr(A^\top A)-2tr(A^\top ZCZ^\top )+tr(ZC^\top Z^\top ZCZ^\top )\nonumber \\{} & {} +\gamma tr(C^4-2C^2+I)+\lambda tr(Z^\top DZ)-\lambda tr(Z^\top SZ)\nonumber \\{} & {} +\alpha tr(ZHZ^\top )+tr(\beta Z^\top )+tr(\theta C^\top ),\nonumber \\ \end{aligned}$$
(14)

\(\mathcal {L}\) calculates the partial derivative of Z and C, respectively:

$$\begin{aligned} \frac{\partial \mathcal {L}}{\partial C}&=-2Z^\top AZ+2Z^\top ZCZ^\top Z+4\gamma C^3-4\gamma C+\theta ,&\nonumber \\ \end{aligned}$$
(15)
$$\begin{aligned} \frac{\partial \mathcal {L}}{\partial Z}&=2(ZCZ^\top ZC^\top +ZC^\top Z^\top ZC)+2\alpha ZH+\beta&\nonumber \\&\quad -2(A^\top ZC+AZC^\top )+2\lambda (D^\top Z-S^\top Z),&\end{aligned}$$
(16)

According to the KKT condition, we know that \(\beta _{ir}z_{ir}=0\) and \(\theta _{rj}C_{rj}=0\), so there are

$$\begin{aligned}{} & {} {[}Z^\top ZCZ^\top Z]_{rj}c_{rj}+2\gamma (C^3)_{rj}c_{rj}-2\gamma (C)_{rj}c_{rj} \nonumber \\{} & {} \quad -(Z^\top AZ)_{rj}c_{rj}=0, \end{aligned}$$
(17)
$$\begin{aligned}{} & {} (ZCZ^\top ZC^\top +ZC^\top Z^\top ZC)_{ir}z_{ir}+\lambda (D^\top Z-S^\top Z)_{ir}z_{ir}\nonumber \\{} & {} \quad +\alpha (ZH)_{ir}z_{ir} -(A^\top ZC+AZC^\top )_{ir}z_{ir}=0, \end{aligned}$$
(18)

Similar to the basic NMF, the multiplicative update rule of the objective function can be obtained:

$$\begin{aligned}{} & {} c_{rj}\leftarrow c_{rj} \frac{[Z^\top AZ+2\gamma C]_{rj}}{[Z^\top ZCZ^\top Z+2\gamma C^3]_{rj}}, \end{aligned}$$
(19)
$$\begin{aligned}{} & {} z_{ir}\leftarrow z_{ir} \frac{[A^\top ZC+AZC^\top +\lambda S^\top Z]_{ir}}{[ZCZ^\top ZC^\top +ZC^\top Z^\top ZC+\lambda D^\top Z+\alpha ZH]_{ir}},\nonumber \\ \end{aligned}$$
(20)

The proof of convergence is shown in the appendix A.

Overall procedure of ORSNMF algorithm

Given the adjacency matrix A, similarity matrix S, factorization rank (number of communities) k and the stop criteria of a directed network, we first use the modified non-negative double singular value decomposition (MNNDSVD) [30] initialization to obtain the initial decomposition \(Z_0\) and \(C_0\), then use the multiplication iteration update rule in Eqs. (19) and (20) to update Z and C until the stop criterion is met, and finally return to Z and C. The specific algorithm is shown in Algorithm 1.

Algorithm 1
figure a

ORSNMF algorithm

Experiment and analysis

In this section, we conduct experiments to demonstrate the effectiveness of the proposed algorithm for community detection in directed networks, which are done on both synthetic and real data sets. We compare the proposed ORSNMF algorithm with SGNMF [32], CFS [33], SNCMF [38], HPNMF [9], RANMF [30], ANMF [3], Spectral clustering [39] and NCut [40] for community detection. All experiments were performed in Matlab.

Comparative algorithms

  • ORSNMF: ORSNMF is the model proposed in this study. It is abbreviated as ORSNMF in Table 1 and as ORSNMFR and ORSNMFM in Table 2, with the final letters R and M representing random initialization and MNNDSVD initialization, respectively.

  • SGNMF: Liu et al. [32] proposed a symmetry and graph-regularized non-negative matrix factorization (SGNMF) method, leveraging multiple latent factor matrices to represent an large-scale undirected network, thereby enhancing its representation learning ability. In Tables 1 and 2, it is abbreviated as SGNMF.

  • CFS: Luo et al. [33] proposed a constraints fusion-induced symmetric non-negative matrix factorization (CFS) model, incorporating a symmetry-regularizer that preserves the symmetry of the learnt low-rank approximation to the adjacency matrix into the loss function, thus making the resultant detector well-aware of the target network’s symmetry. In Tables 1 and 2, it is abbreviated as CFS.

  • SNCMF: Yuan et al. [38] proposed a symmetric and non-negative constrained matrix factorization (SNCMF) community detection model based on undirected networks. This model introduces a graph regularization term to preserve the intrinsic geometric local invariance of the network, allowing the implemented detector to gain a comprehensive understanding of the community structure within the target network. In Tables 1 and 2, it is abbreviated as SNCMF.

  • HPNMF: Ye et al. [9] proposed a homophily preserving NMF (HPNMF). This method models the network’s link topology while also capturing the homogeneity of network nodes to better reflect the community structure. In Tables 1 and 2, it is abbreviated as HPNMF.

  • RANMF: Tosyali et al. [30] proposed a regularized asymmetric non-negative matrix factorization (RANMF) algorithm. In a given directed network, RANMF utilizes the pairwise similarity of nodes, guided by network prior information, to assign similar nodes to the same cluster. In Table 1, it is abbreviated as RANMF, and in Table 2, it is abbreviated as RANMFR and RANMFM, with the final letters R and M representing random initialization and MNNDSVD initialization, respectively.

  • ANMF: Wang et al. [3] proposed an asymmetric non-negative matrix factorization (ANMF) method for detecting communities in directed networks. Due to the asymmetry of the adjacency matrix and the weight matrix, the resulting matrix is not forcefully constrained. Instead, normalization of the result matrix is achieved by passing a diagonal matrix between the result matrix and the weight matrix. This approach aims to enhance the effectiveness of community detection. In Table 1, it is abbreviated as ANMF, and in Table 2, it is abbreviated as ANMFR and ANMFM, with the final letters R and M representing random initialization and MNNDSVD initialization, respectively.

  • Spect: Hespanha et al. [39] proposed a spectral decomposition-based graph partitioning algorithm, closely related to the Markov chain state aggregation algorithm introduced by Phillips and Kokotović [41]. This algorithm can be applied to the field of community detection to assess its effectiveness. In Tables 1 and 2, it is abbreviated as Spect.

  • NCut: Shi et al. [40] proposed a method based on perceptual grouping. This method focuses not on local features of the problem but extracts information globally, subsequently introducing a normalized cut criterion. The aim is to measure the overall dissimilarity between different groups and the overall similarity within groups. This method can be applied to community detection and, to some extent, enhances the detection effectiveness. In Tables 1 and 2, it is abbreviated as Ncut.

Data sets

We first compare the various clustering algorithms mentioned above on the LFR synthetic graphs. In the LFR synthetic graphs, the network topology complexity is controlled by the mixing parameter \(\mu \), which controls the connection between communities. The larger the mixing parameter, the better the connectivity between communities, the more complex the network topology, the more difficult the community detection is.

In addition, we also selected the World Wide Knowledge Base (WebKB) data set, a real world data set, to test our proposed algorithm. This data set has more connections between communities, which increases the complexity of community detection and can better test our proposed algorithm. This data set contains web page hyperlink information collected by four universities in Cornell, Wisconsin, Texas and Washington University. The specific meaning of the network is: nodes represent web pages, and directed edges represent links information between web pages. The web pages are divided into 5 categories, including students, courses, staff, projects, and teachers.

Evaluation indicators

There are various methods for evaluating and comparing differences between algorithms [42]. In this study, to accurately assess the effectiveness of clustering algorithms, we adopted three evaluation metrics, namely jaccard similarity, nmi and accuracy [15].

The jaccard similarity is utilized here to compare the similarity between predicted results and true results. The similarity is assessed by calculating the ratio of the number of common elements in both the predicted and true results to the total number of elements in their union, as defined by Eq. (21):

$$\begin{aligned} j(PL,TL)=\frac{\vert PL \cap TL\vert }{\vert PL \cup TL\vert }=\frac{\vert PL \cap TL\vert }{\vert PL \vert +\vert TL\vert -\vert PL \cap TL\vert }. \end{aligned}$$
(21)

where PL represents the predicted results, TL represents the true results, both PL and TL are \(1 \times n\) vectors, n represents the number of samples, PL and TL are used to store the predicted categories and true categories of each sample, respectively. \(\vert PL \cap TL\vert \) indicates the number of samples where the true results match the predicted results, and \(\vert PL \cup TL\vert \) represents the total number of different samples between the true results and the predicted results.

Fig. 1
figure 1

On the LFR data set, with \(\alpha \) set to 0, the impact of the parameter \(\gamma \) in the ORSNMF model was analyzed by varying it from \(10^{-3}\) to \(10^3\). ac are the effect plots obtained using the adjacency matrix, the Katz similarity computation similarity matrix and the cosine similarity computation similarity matrix, respectively

Fig. 2
figure 2

With \(\gamma \) fixed at the optimal value obtained in Fig. 1, the impact of the parameter \(\alpha \) in the ORSNMF model was analyzed by varying it from \(10^{-3}\) to \(10^{-1}\). Three similarity calculation methods were experimented with, including ac, representing the effectiveness of similarity matrices calculated using the Adjacency matrix, Katz, and Cosine similarity, respectively

Table 1 Performance comparison on the LFR synthetic networks (bold numbers represent best results)
Fig. 3
figure 3

Visual comparison results of the conducted experiments on the LFR data set, corresponding to the visualized table 1

nmi is an external measure used to judge the quality of clustering. It is used to measure the similarity of two clustering results. The calculation is performed as Eq. (22):

$$\begin{aligned} \text {nmi}(C,C_T)=\frac{I(C,C_T)}{\sqrt{H(C)H(C_T)}}, \end{aligned}$$
(22)

where C is a set of k clusters obtained after clustering, with each class represented as \(C=\{C_1,C_2,\ldots ,C_k\}\), and containing several samples after clustering. \(C_T\) represents the true class labels, with \(C_T=\{C_{T1},C_{T2},\ldots ,C_{Tk}\}\) and each \(C_{Ti}\) containing all the samples belonging to that class. \(I(C_T,C)\) represents the mutual information between \(C_T\) and C, \(H(C_T)\) and H(C), respectively, represent the entropy of \(C_T\) and C.

The accuracy is used to compare the obtained labels with the true labels provided by the original data, and is specifically defined as Eq. (23):

$$\begin{aligned} \text {acc}(\text {TL, PL})=\sum _{i=1}^n \frac{\delta (\text {TL}_i,\text {map}(\text {PL}_i))}{n}, \end{aligned}$$
(23)

where \(\text {PL}_i\) is the label after clustering; \(\text {TL}_i\) is the ground truth label; n is the total number of data samples; and \(map(\cdot )\) represents the optimal redistribution of class labels to ensure the correctness of statistics. Generally, the optimal redistribution can be realized by the Hungarian algorithm, so as to solve the task (label) assignment problem in polynomial time. \(\delta \) represents the indicator function, which is defined as Eq. (24):

$$\begin{aligned} \delta (x,y)=\left\{ \begin{array}{ll} 1 &{}\quad \text {if } \quad x=y\\ 0 &{}\quad \text {otherwise} \\ \end{array} \right. . \end{aligned}$$
(24)

The larger the value of the above three evaluation indicators, the better the clustering performance.

Experimental results

We conduct multiple experiments on the LFR networks data set and the World Wide Knowledge Base (WebKB) data set (Cornell, Wisconsin, Texas, and Washington). In addition to our proposed ORSNMF algorithm, there are also eight algorithms of SGNMF, CFS, SNCMF, HPNMF, RANMF, ANMF, Spectral clustering and NCut for comparative experiments.

LFR networks

For the LFR networks data set, we use \(\mu =0.5\) [43], \(\vert V\vert =1000\), \(\vert E\vert =15{,}249\), \(k=33\) to create a community network structure. We use Adjacency similarity, Katz centrality, and Cosine similarity as the similarity matrix of the LFR networks data set to test their performance. Using \(\lambda =0.1\) [30], we set different values for the parameters \(\gamma \) and \(\alpha \) to test the algorithm. Inspired by the parameter setting method proposed by Ye et al. [9], we adopt the same method in this experiment to test the parameters in the range of \(\{10^{-3},10^{-2},10^{-1},10^0,10^1,10^2,10^3\}\), so that the parameters take values with better effect. We first evaluate the effect of \(\gamma \) on the model by fixing \(\alpha \) to 0, as shown in Fig. 1.

As can be seen from Fig. 1, no matter which method is used to calculate the similarity matrix, the three evaluation indexes of the model gradually increase with the increase of \(\gamma \) when \(\gamma \) \(\le \) 10, and then tend to be stable. These results allow us to draw a conclusion: the algorithm is sensitive to the parameter \(\gamma \) to a certain extent. To obtain more effective results, it is necessary to consider community differences. We choose the better value among them as the final value of parameter \(\gamma \). We make \(\gamma =0.1\) when using adjacency similarity as the similarity matrix, \(\gamma =0.5\) when selecting Katz centrality, and \(\gamma =0.1\) when selecting Cosine similarity.

We then evaluate the effect of \(\alpha \) on clustering by fixing the value of \(\gamma \) (such that \(\gamma \) is equal to the figure of merit obtained above), the results are shown in Fig. 2.

Clustering performance drops sharply after \(\alpha =0.1\), so we only select data less than 0.1 to draw the result graphs. It can be seen from the above results that no matter which method is selected as the similarity matrix, the effect is the best when \(\alpha =0.1\). So under this data set, we take \(\alpha =0.1\).

In addition to using the above-obtained parameter values to carry out this algorithm experiment, a comparison experiment with SGNMF, CFS, SNCMF, HPNMF, RANMF, ANMF, Spectral clustering and NCut algorithms was also carried out. The specific results are shown in Table 1, and the corresponding visualization results are shown in Fig. 3.

It can be seen from Table 1 and Fig. 3 that no matter which similarity metric is used, our proposed algorithm achieves better results than other algorithms. Bold values indicate that the effect of this algorithm is better than that of other algorithms.

WebKB data set

In addition to comparative experiments on the LFR networks data set, we also conduct comparative experiments on the World Wide Knowledge Base (WebKB) data set (Cornell, Wisconsin, Texas, and Washington). Among them, Cornell contains 195 nodes and 304 directed edges. Texas contains 187 nodes and 328 directed edges. Wisconsin contains 265 nodes and 530 edges. Washington contains 230 nodes and 446 directed edges. Cosine similarity was used as the similarity matrix for the data sets collected by Cornell University and the University of Texas, Katz centrality was used as the similarity matrix for the data sets collected by the University of Washington, and the adjacency matrix was used as the similarity matrix for the data sets collected by the University of Wisconsin. The values of the parameters \(\lambda \), \(\gamma \) and \(\alpha \) are the same as the above processing methods, let \(\lambda =0.1\), \(\alpha =0\) to evaluate the influence of \(\gamma \) on it, and the results are shown in Fig. 4.

Fig. 4
figure 4

ad, respectively, depict the effectiveness of the ORSNMF model on the Cornell, Texas, Washington, and Wisconsin data sets, where \(\alpha \) is set to 0, and the parameter \(\gamma \) varies from \(10^{-3}\) to \(10^{3}\)

It can be concluded from Fig. 4 that for the Cornell data set we use \(\gamma =0.001\), Texas data set use \(\gamma =0.001\), Washington data set \(\gamma =0.1\), Wisconsin data set \(\gamma =0.1\). Now we fix the figure of merit \(\gamma \) obtained above to test the effect of \(\alpha \) on it, and the results are shown in Fig. 5.

Fig. 5
figure 5

ad, respectively, depict the effectiveness of the ORSNMF model on the Cornell, Texas, Washington, and Wisconsin data sets, with \(\gamma \) fixed at the optimal value obtained in Fig. 4, and the parameter \(\alpha \) varies from \(10^{-3}\) to \(10^{3}\)

It can be seen from Fig. 5 that the Cornell data set achieves a better value at \(\alpha =10\), and then it starts to fall, so we take \(\alpha =10\). The performance of the Texas data set increases slowly when \(\alpha =0.1\), but declines after \(\alpha =0.1\), so we take \(\alpha =0.1\). Washington data set has effects before \(\alpha =0.1\), but starts to decline after that, so we take \(\alpha =0.1\). When the adjacency matrix is used as the similarity matrix of the Wisconsin data set, performance starts to drop after \(\alpha =0.1\), so we take \(\alpha =0.1\).

After selecting all parameters, we will compare experiments with the eight algorithms of SGNMF, CFS, SNCMF, HPNMF, RANMF, ANMF, Spectral clustering and NCut. The results are shown in Table 2, and the corresponding visualization results are shown in Fig. 6. The bold values in Table 2 indicate the best performance. As can be seen from the table, the performance of our proposed algorithm is better than that of the other algorithms.

Table 2 Performance comparison on the WebkB networks with ground-truth communities (bold numbers represent best results)
Fig. 6
figure 6

ad, respectively, depict the visualized comparative experimental results on the Cornell, Texas, Washington, and Wisconsin data sets, corresponding to the visualized table 2

Conclusion

This study addresses the community detection problem by proposing a new method, ORSNMF, within the fundamental framework of NMF. This method models the directed network topology, community distinctiveness, node homophily, and sparsity in the community membership matrix. We transform the objective of this model into an optimization problem, develop an efficient learning algorithm, and obtain a multiplicative update method to solve it. We conduct extensive experiments on both synthetic and real networks to demonstrate the superiority of the proposed model.

While the model proposed in this study demonstrates good performance, the primary focus is on directed, unweighted, and non-overlapping networks. In reality, overlapping and dynamic networks are prevalent. In community networks, samples or individuals often exhibit multiplicity, allowing a sample to be assigned to multiple community categories. This phenomenon is known as overlapping networks, where, for example, an individual can simultaneously enjoy watching movies and playing basketball. Mechanically classifying such individuals solely into the movie-watching community would be overly simplistic. Therefore, detecting communities in overlapping networks becomes crucial. Due to the diversity in community networks, it becomes challenging to set a uniform number of attribution categories and probability threshold values for various community networks. In addition, with the passage of time, the structural attributes in the network are constantly changing, such as the number of citations of a paper will increase with the passage of time, and the relationship between users will be established and dissolved with the passage of time, so it can be seen that the dynamic network appears to be more general and applicable than the static network, and the community detection of the dynamic network not only can effectively delineate the members of the nodes of the network, but also can predict the development trend of the network. In future research, the emphasis will be on studying overlapping networks and dynamic networks. Furthermore, a variety of validation methods will be employed to highlight significant statistical differences among different algorithms [42], demonstrating the superiority