Abstract
We consider the problem of community detection in overlapping weighted networks, where nodes can belong to multiple communities and edge weights can be finite real numbers. To model such complex networks, we propose a general framework—the mixed membership distribution-free (MMDF) model. MMDF has no distribution constraints of edge weights and can be viewed as generalizations of some previous models, including the well-known mixed membership stochastic blockmodels. Especially, overlapping signed networks with latent community structures can also be generated from our model. We use an efficient spectral algorithm with a theoretical guarantee of convergence rate to estimate community memberships under the model. We also propose the fuzzy weighted modularity to evaluate the quality of community detection for overlapping weighted networks with positive and negative edge weights. We then provide a method to determine the number of communities for weighted networks by taking advantage of our fuzzy weighted modularity. Numerical simulations and real data applications are carried out to demonstrate the usefulness of our mixed membership distribution-free model and our fuzzy weighted modularity.
Similar content being viewed by others
References
Fortunato S (2010) Community detection in graphs. Phys Rep 486(3–5):75–174
Fortunato S, Hric D (2016) Community detection in networks: a user guide. Phys Rep 659:1–44
Papadopoulos S, Kompatsiaris Y, Vakali A, Spyridonos P (2012) Community detection in social media. Data Min Knowl Disc 24(3):515–554
Newman MEJ (2004) Analysis of weighted networks. Phys Rev E 70(5):56131–56131
Goldenberg A, Zheng AX, Fienberg SE, Airoldi EM (2010) A survey of statistical network models. Found Trends® Mach Learn Arch 2(2):129–233
Holland PW, Laskey KB, Leinhardt S (1983) Stochastic blockmodels: first steps. Soc Netw 5(2):109–137
Abbe E (2017) Community detection and stochastic block models: recent developments. J Mach Learn Res 18(1):6446–6531
Xie J, Kelley S, Szymanski BK (2013) Overlapping community detection in networks: the state-of-the-art and comparative study. Acm Comput Surv (csur) 45(4):1–35
Airoldi EM, Blei DM, Fienberg SE, Xing EP (2008) Mixed membership stochastic blockmodels. J Mach Learn Res 9:1981–2014
Karrer B, Newman MEJ (2011) Stochastic blockmodels and community structure in networks. Phys Rev E 83(1):16107
Zhang Y, Levina E, Zhu J (2020) Detecting overlapping communities in networks using spectral methods. SIAM J Math Data Sci 2(2):265–283
Jin J, Ke ZT, Luo S (2023) Mixed membership estimation for social networks. J Econom. https://doi.org/10.1016/j.jeconom.2022.12.003
Rohe K, Chatterjee S, Yu B (2011) Spectral clustering and the high-dimensional stochastic blockmodel. Ann Stat 39(4):1878–1915
Choi DS, Wolfe PJ, Airoldi EM (2011) Stochastic blockmodels with a growing number of classes. Biometrika 99(2):273–284
Lei J, Rinaldo A (2015) Consistency of spectral clustering in stochastic block models. Ann Stat 43(1):215–237
Abbe E, Sandon C (2015) Community detection in general stochastic block models: fundamental limits and efficient algorithms for recovery. In: 2015 IEEE 56th annual symposium on foundations of computer science, pp 670–688
Jin J (2015) Fast community detection by SCORE. Ann Stat 43(1):57–89
Joseph A, Yu B (2016) Impact of regularization on spectral clustering. Ann Stat 44(4):1765–1791
Abbe E, Bandeira AS, Hall G (2016) Exact recovery in the stochastic block model. IEEE Trans Inf Theory 62(1):471–487
Chen Y, Li X, Xu J (2018) Convexified modularity maximization for degree-corrected stochastic block models. Ann Stat 46(4):1573–1602
Mao X, Sarkar P, Chakrabarti D (2020) Estimating mixed memberships with sharp eigenvector deviations. J Am Stat Assoc 16(536):1928–1940
Qing H, Wang J (2023) Regularized spectral clustering under the mixed membership stochasticblock model. Neurocomputing 550:126490
Watts DJ, Strogatz SH (1998) Collective dynamics of ‘small-world’ networks. Nature 393(6684):440–442
Opsahl T, Panzarasa P (2009) Clustering in weighted networks. Soc Netw 31(2):155–163
Colizza V, Pastor-Satorras R, Vespignani A (2007) Reaction-diffusion processes and metapopulation models in heterogeneous networks. Nat Phys 3(4):276–282
Opsahl T, Colizza V, Panzarasa P, Ramasco JJ (2008) Prominence and control: the weighted rich-club effect. Phys Rev Lett 101(16):168702
Liu X, Bollen J, Nelson ML, Sompel H (2005) Co-authorship networks in the digital library research community. Inf Process Manag 41(6):1462–1480
Read KE (1954) Cultures of the central highlands, new guinea. Southwest J Anthropol 10(1):1–43
Yang B, Cheung W, Liu J (2007) Community mining from signed social networks. IEEE Trans Knowl Data Eng 19(10):1333–1348
Kunegis J, Lommatzsch A, Bauckhage C (2009) The slashdot zoo: mining a social network with negative edges. In: Proceedings of the 18th international conference on World Wide Web, pp 741–750
Tang J, Chang Y, Aggarwal C, Liu H (2016) A survey of signed network mining in social media. ACM Comput Surv (CSUR) 49(3):1–37
Brandes U, Kenis P, Lerner J, Van Raaij D (2009) Network analysis of collaboration structure in wikipedia. In: Proceedings of the 18th international conference on World Wide Web, pp 731–740
Kunegis J (2013) Konect: the Koblenz network collection. In: Proceedings of the 22nd international conference on World Wide Web, pp 1343–1350
Aicher C, Jacobs AZ, Clauset A (2015) Learning latent block structure in weighted networks. J Complex Netw 3(2):221–248
Palowitch J, Bhamidi S, Nobel AB (2018) Significance-based community detection in weighted networks. J Mach Learn Res 18(188):1–48
Xu M, Jog V, Loh P-L (2020) Optimal rates for community estimation in the weighted stochastic block model. Ann Stat 48(1):183–204
Ng TLJ, Murphy TB (2021) Weighted stochastic block model. Statist Methods Appl 30:1365–1398
Qing H (2023) Distribution-free model for community detection. Prog Theor Exp Phys 2023(3):033A01
Qing H, Wang J (2023) Community detection for weighted bipartite networks. Knowl-Based Syst 274:110643
Airoldi EM, Wang X, Lin X (2013) Multi-way blockmodels for analyzing coordinated high-dimensional responses. Ann Appl Stat 7(4):2431–2457
Mao X, Sarkar P, Chakrabarti D (2018) Overlapping clustering models, and one (class) svm to bind them all. Adv Neural Inf Process Syst 31:2126–2136
Dulac A, Gaussier E, Largeron C (2020) Mixed-membership stochastic block models for weighted networks. In: Conference on uncertainty in artificial intelligence (UAI), vol. 124, pp 679–688
Erdos P, Rényi A (1960) On the evolution of random graphs. Publ Math Inst Hung Acad Sci 5(1):17–60
Lancichinetti A, Fortunato S, Radicchi F (2008) Benchmark graphs for testing community detection algorithms. Phys Rev E 78(4):046110
Lancichinetti A, Fortunato S (2009) Benchmarks for testing community detection algorithms on directed and weighted graphs with overlapping communities. Phys Rev E 80(1):016118
Gillis N, Vavasis SA (2015) Semidefinite programming based preconditioning for more robust near-separable nonnegative matrix factorization. SIAM J Optim 25(1):677–698
Newman ME, Girvan M (2004) Finding and evaluating community structure in networks. Phys Rev E 69(2):026113
Newman MEJ (2006) Modularity and community structure in networks. Proc Natl Acad Sci USA 103(23):8577–8582
Gómez S, Jensen P, Arenas A (2009) Analysis of community structure in networks of correlated data. Phys Rev E 80(1):016114
Nepusz T, Petróczi A, Négyessy L, Bazsó F (2008) Fuzzy communities and the concept of Bridgeness in complex networks. Phys Rev E 77(1):016107
Strehl A, Ghosh J (2002) Cluster ensembles: a knowledge reuse framework for combining multiple partitions. J Mach Learn Res 3(Dec):583–617
Danon L, Diaz-Guilera A, Duch J, Arenas A (2005) Comparing community structure identification. J Stat Mech: Theory Exp 2005(09):09008
Bagrow JP (2008) Evaluating local community methods in networks. J Stat Mech: Theory Exp 2008(05):05001
Hubert L, Arabie P (1985) Comparing partitions. J Classif 2:193–218
Vinh NX, Epps J, Bailey J (2009) Information theoretic measures for clusterings comparison: is a correction for chance necessary? In: Proceedings of the 26th annual international conference on machine learning, pp 1073–1080
Mao X, Sarkar P, Chakrabarti D (2017) On mixed memberships and symmetric nonnegative matrix factorizations, pp 2324–2333
Le CM, Levina E (2022) Estimating the number of communities by spectral methods. Electron J Stat 16(1):3315–3342
Lancichinetti A, Fortunato S, Kertész J (2009) Detecting the overlapping and hierarchical community structure in complex networks. New J Phys 11(3):033015
Zachary WW (1977) An information flow model for conflict and fission in small groups. J Anthropol Res 33(4):452–473
Ferligoj A, Kramberger A (1996) An analysis of the slovene parliamentary parties network. Dev Stati Method 12:209–216
Hayes B (2006) Connecting the dots. Am Sci 94(5):400–404
Knuth DE (1993) The stanford graphbase: a platform for combinatorial computing, vol 37. Addison-Wesley Reading, New York
Adamic LA, Glance N (2005) The political blogosphere and the 2004 us election: divided they blog, pp 36–43
Opsahl T (2011) Why anchorage is not (that) important: binary ties and sample selection. online] http://toreopsahl.com
Newman ME (2001) The structure of scientific collaboration networks. Proc Natl Acad Sci 98(2):404–409
Zhang H, Guo X, Chang X (2022) Randomized spectral clustering in large-scale stochastic block models. J Comput Graph Stat 31(3):887–906
Tropp JA (2012) User-friendly tail bounds for sums of random matrices. Found Comput Math 12(4):389–434
Cape J, Tang M, Priebe CE (2019) The two-to-infinity norm and singular subspace geometry with applications to high-dimensional statistics. Ann Stat 47(5):2405–2439
Chen Y, Chi Y, Fan J, Ma C (2021) Spectral methods for data science: a statistical perspective. Found Trends® Mach Learn 14(5):566–806
Acknowledgements
Wang’s work was supported by the Fundamental Research Funds for the Central Universities, Nankai Univerity, 63231186 and the National Natural Science Foundation of China (Grant 12001295, 12271272).
Author information
Authors and Affiliations
Contributions
HQ was involved in conceptualization, methodology, investigation, software, formal analysis, data curation, writing—original draft, writing—reviewing and editing. JW helped in writing—reviewing and editing, funding acquisition.
Corresponding author
Ethics declarations
Conflicts of interest
The authors declare no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix A Vertex hunting algorithm
Algorithm 2 is the SP algorithm.
Appendix B Proofs under MMDF
1.1 B.1 Proof of Proposition 1
Proof
This proposition holds immediately by the first statement of Theorem 2.1 [21] since we let P be a full rank matrix and Theorem 2.1 [21] is a distribution-free result such that it always holds without constraining the distribution of A. \(\square \)
1.2 B.2 Proof of Lemma 1
Proof
Since \(\Omega =\Pi \rho P\Pi '=U\Lambda U'\) and \(U'U=I_{K}\), we have \(U=\Pi \rho P\Pi 'U\Lambda ^{-1}\), i.e., \(B=\rho P\Pi ' U\Lambda ^{-1}\). So B is unique. Since \(U=\Pi B\), we have \(U(\mathcal {I},:)=\Pi (\mathcal {I},:)B=B\) and the lemma follows. \(\square \)
1.3 B.3 Proof of Theorem 1
Proof
First, we prove the following lemma to provide an upper bound of row-wise eigenspace error \(\Vert \hat{U}\hat{U}'-UU'\Vert _{2\rightarrow \infty }\). \(\square \)
Lemma 2
(Row-wise eigenspace error) Under \(MMDF_{n}(K,P,\Pi ,\rho ,\mathcal {F})\), when Assumption 1 holds, suppose \(\sigma _{K}(\Omega )\ge C\sqrt{\gamma \rho n\textrm{log}(n)}\) for some \(C>0\), with probability at least \(1-o(n^{-3})\), we have
Proof
First, we use Theorem 1.4 (the Matrix Bernstein) of [67] to build an upper bound of \(\Vert A-\Omega \Vert _{\infty }\). This theorem is given below \(\square \)
Theorem 2
Consider a finite sequence \(\{X_{k}\}\) of independent, random, self-adjoint matrices with dimension d. Assume that each random matrix satisfies
Then, for all \(t\ge 0\),
where \(\sigma ^{2}:=\Vert \sum _{k}\mathbb {E}(X^{2}_{k})\Vert \).
Let \(x=(x_{1},x_{2},\ldots , x_{n})'\) be any \(n\times 1\) vector. For any \(i,j\in [n]\), we have \(\mathbb {E}[(A(i,j)-\Omega (i,j))x(j)]=0\) and \(\Vert (A(i,j)-\Omega (i,j))x(j)\Vert \le \tau \Vert x\Vert _{\infty }\). Set \(R=\tau \Vert x\Vert _{\infty }\). Since \(\Vert \sum _{j=1}^{n}\mathbb {E}[(A(i,j)-\Omega (i,j))^{2}x^{2}(j)]\Vert =\Vert \sum _{j=1}^{n}x^{2}(j)\mathbb {E}[(A(i,j)-\Omega (i,j))^{2}]\Vert =\Vert \sum _{j=1}^{n}x^{2}(j)\textrm{Var}(A(i,j))\Vert \le \gamma \rho \sum _{j=1}^{n}x^{2}(j)\), by Theorem 2, for any \(t\ge 0\) and \(i\in [n]\), we have
Set x(j) as 1 or \(-1\) such that \((A(i,j)-\Omega (i,j))y(j)=|A(i,j)-\Omega (i,j)|\), we have
Set \(t=\frac{\alpha +1+\sqrt{(\alpha +1)(\alpha +19)}}{3}\sqrt{\gamma \rho n\textrm{log}(n)}\) for any \(\alpha >0\). By assumption 1, we have
By Theorem 4.2 of [68], when \(\sigma _{K}(\Omega )\ge 4\Vert A-\Omega \Vert _{\infty }\), we have
where \(\mathcal {O}\) is a \(K\times K\) orthogonal matrix. With probability at least \(1-o(n^{-\alpha })\), we have
Since \(\hat{U}'\hat{U}=I_{K},U'U=I_{K}\), by basic algebra, we have \(\Vert \hat{U}\hat{U}'-UU'\Vert _{2\rightarrow \infty }\le 2\Vert \hat{U}-U\mathcal {O}\Vert _{2\rightarrow \infty }\), which gives
Since \(\sigma _{K}(\Omega )\ge \sigma _{K}(P)\rho \lambda _{K}(\Pi '\Pi )\) by Lemma II.4 of [21] and \(\Vert U\Vert ^{2}_{2\rightarrow \infty }\le \frac{1}{\lambda _{K}(\Pi '\Pi )}\) by Lemma 3.1 of [21], where these two lemmas are distribution-free and always hold as long as Eqs. (2), (4), and (5) hold, we have
Set \(\alpha =3\), and this claim follows.
Remark 6
Alternatively, Theorem 4.2. of [69] can also be applied to obtain the upper bound of \(\Vert \hat{U}\hat{U}'-UU'\Vert _{2\rightarrow \infty }\), and this bound is similar to the one in Lemma 2.
For convenience, set \(\varpi =\Vert \hat{U}\hat{U}'-UU'\Vert _{2\rightarrow \infty }\). Since DFSP is the SPACL algorithm without the prune step of [21], the proof of DFSP’s consistency is the same as SPACL except for the row-wise eigenspace error step where we need to consider \(\gamma \) which is directly related with distribution \(\mathcal {F}\). By Lemma 2 and Equation (3) in Theorem 3.2 of [21] where the proof is distribution-free, there exists a \(K\times K\) permutation matrix \(\mathcal {P}\) such that
1.4 B.4 Proof of Corollary 1
Proof
When \(\lambda _{K}(\Pi '\Pi )=O(\frac{n}{K})\) and \(K=O(1)\), we have \(\kappa (\Pi '\Pi )=O(1)\) and \(\lambda _{K}(\Pi '\Pi )=O(n/K)=O(n)\). Then, the corollary follows immediately by Theorem 1. \(\square \)
Appendix C Extra simulation results
In this part, we consider two extra simulations: imbalanced networks and running time. For imbalanced networks, we study the stability of DFSP and its competitors when there are small-size communities. For running time, we compare the running time for each method by increasing the network size n. For simplicity, we only consider the case when \(\mathcal {F}\) is Normal distribution here. When \(A(i,j)\sim \textrm{Normal}(\Omega (i,j),\sigma ^{2}_{A})\), let all nodes be pure, \(K=2, \rho =1\), and \(\sigma ^{2}_{A}=1\). Set P as
Let the first community has \(\delta n\) nodes. So, the second community has \((1-\delta )n\) nodes. Based on the above settings, we consider the following two simulations.
Changing \(\delta \): Let \(n=200\) or \(n=1000\). Let \(\delta \) range in \(\{0.025, 0.05, 0.075, \ldots , 0.5\}\). For this case, the two evaluation metrics Hamming error and Relative are not suitable for imbalanced networks. To prioritize the ability of DFSP and its competitors to detect the minority communities, we consider the following two metrics who are the smaller the better.
Unlike Hamming error which measures the \(l_{1}\) difference between \(\Pi \) and \({\hat{\Pi }}\) up to a permutation of community labels, Clustering \(l_{1}\) error measures the maximum \(l_{1}\) difference between the size of the true k-th community and the size of the estimated k-th community up to a permutation of community labels among all K communities. Therefore, Clustering \(l_{1}\) error can evaluate the ability of a community detection method to detect the minority communities. Similar arguments hold for the Clustering \(l_{2}\) error.
Panels (a–f) of Fig. 5 display numerical results for changing \(\delta \). For the case when \(n=200\), we find that DFSP and its competitors perform similarly and all of them can successfully detect the minority community when \(\delta \in [0.125,0.5]\), i.e., the proportion of community sizes between the largest community and the smallest community locates in [1, 7]. KDFSP successfully estimates the number of communities K when \(\delta \in [0.15,0.5]\) while NB and BHac fail to infer K for all cases. For the case when \(n=1000\), DFSP and its competitors successfully detect all communities when \(\delta \in [0.1, 0.5]\), i.e., the proportion of community sizes between the largest community and the smallest community locates in [1, 9]. KDFSP correctly determines K when \(\delta \in [0.05,0.5]\) while its competitors fail to find K.
Changing n: Let \(\delta =0.075\) or \(\delta =0.1\), i.e., let the proportion of community sizes between the largest community and the smallest community be \(\frac{37}{3}\) or 9. Let n range in \(\{2000,4000,6000,\ldots ,12000\}\). For simplicity, we only report the averaged Clustering \(l_{1}\) error, averaged Clustering \(l_{2}\) error, and averaged running time over 100 repetitions for DFSP and its competitors. Figure 6 displays the numerical results. We see that DFSP is better than GeoNMF, SVM-cD, and OCCAM in both estimation accuracy and running time. In particular, DFSP runs much faster than OCCAM. Meanwhile, DFSP performs satisfactorily for its small clustering errors for the two cases \(\delta =0.075\) and \(\delta =0.1\). By comparing panel (a) and panel (e) (panel (b) and panel (f)), we see that all methods perform poorer for a more imbalanced network and this result is consistent with that of changing \(\delta \). By comparing panel (c) and panel (g) (panel (d) and panel (h)), we see that each method takes more time to detect a more imbalanced network.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Qing, H., Wang, J. Mixed membership distribution-free model. Knowl Inf Syst 66, 879–904 (2024). https://doi.org/10.1007/s10115-023-02021-2
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-023-02021-2