Abstract
We consider the problem of estimating overlapping community memberships in a network, where each node can belong to multiple communities. More than a few communities per node are difficult to both estimate and interpret, so we focus on sparse node membership vectors. Our algorithm is based on sparse principal subspace estimation with iterative thresholding. The method is computationally efficient, with computational cost equivalent to estimating the leading eigenvectors of the adjacency matrix, and does not require an additional clustering step, unlike spectral clustering methods. We show that a fixed point of the algorithm corresponds to correct node memberships under a version of the stochastic block model. The methods are evaluated empirically on simulated and realworld networks, showing good statistical performance and computational efficiency.
Introduction
Networks have become a popular representation of complex data that appear in different fields such as biology, physics, and the social sciences. A network represents units of a system as nodes, and the interactions between them as edges. A network can encode relationships between people in a social environment (Wasserman and Faust, 1994), connectivity between areas of the brain (Bullmore and Sporns, 2009) or interactions between proteins (Schlitt and Brazma, 2007). The constant technological advancements have increased our ability to collect network data on a large scale, with potentially millions of nodes in a network. Parsimonious models are needed in order to obtain meaningful interpretations of such data, as well as computationally efficient methods.
Communities are a structure of interest in the analysis of networks, observed in many realworld systems (Girvan and Newman, 2002). Usually, communities are defined as clusters of nodes that have stronger connections to each other than to the rest of the network. Finding these communities allows for a more parsimonious representation of the data which is often meaningful in the system of interest. For example, communities can represent functional areas of the brain (Schwarz et al. 2008; Power et al. 2011), political affinity in social networks (Adamic and Glance, 2005; Conover et al., 2011; Latouche et al., 2011), research areas in citation networks (Ji and Jin, 2016), and many others.
The stochastic block model (SBM) (Holland et al. 1983) is a simple statistical model for a network with communities, well understood by now; see Abbe (2017) for a review. Under the SBM, a pair of nodes is connected with a probability that only depends on the community memberships of these nodes. The SBM allows to represent any type of connectivity structure between the communities in the network, such as affinity, disassortativity, or coreperiphery (see for example Cape et al. 2019). While the SBM itself is too simple to capture some aspects of realworld networks, many extensions have been proposed to incorporate more complex structures such as hubs (Ball et al. 2011), or nodes that belong to more than one community (Airoldi et al. 2009; Latouche et al. 2011; Zhang et al. 2020), which lead to models with overlapping communities.
Overlapping community models characterize each node by a membership vector, indicating its degree of belonging to different communities. While in principle all entries of a membership vector can be positive (Airoldi et al. 2009), a sparse membership vector is more likely to have a meaningful interpretation. At the same time, allowing for a varying degree of belonging to a community adds both flexibility and interpretability relative to binary membership overlapping community models such as Latouche et al. (2011). In this paper, we focus on estimating sparse overlapping community membership vectors with continuous entries, so that most nodes belong to only one or few communities, and the degree to which they belong to a community can vary. The sparsest case where each node belongs to exactly one community corresponds to the classic community detection setting, and its success in modeling and analyzing realworld networks in many different fields (Porter et al. 2009) supports the sparse nature of community memberships.
Existing statistical models for overlapping community detection include both binary membership models, where each node either belongs to a community or does not (e.g., Latouche et al. 2011), and continuous membership models which allow each node to have a different level of association with each community (Airoldi et al. 2009; Ball et al. 2011; Psorakis et al. 2011; Zhang et al. 2020). Binary memberships are a natural way to induce sparsity, but the binary models are less flexible, and fitting them be computationally intensive since they involve solving a discrete optimization problem. On the other hand, continuous memberships are not able to explicitly model sparsity, and the resulting estimates often assign most of the nodes to many or even all communities. To obtain sparse memberships, an ad hoc postprocessing step can be applied (Gregory, 2010; Lancichinetti et al. 2011), but is likely to lead to a less accurate fit to the data. Another approach to induce sparse memberships is to incorporate sparsityinducing priors into the model, for example via the discrete hierarchical Dirichlet process (Wang and Blei, 2009) or the Indian buffet process (Williamson et al. 2010) that have been introduced in the topic modeling literature.
The problem of estimating overlapping community memberships has been approached from different perspectives; see for example (Xie et al. 2013; da Fonseca Vieira et al. 2020), and references in Section 4.3. In particular, spectral methods for community detection are popular due to their computational scalability and theoretical guarantees (Newman, 2006; Rohe et al. 2011; Lyzinski et al. 2014; Le et al. 2017). Many statistics network models make a lowrank assumption on the matrix \(\text {\textbf {P}}=\mathbb {E}[\text {\textbf {A}}]\) that characterizes the edge probabilities, and in most models with communities the principal subspace of P contains the information needed to identify the communities. Spectral methods for community detection exploit this fact by computing an eigendecomposition of the network adjacency matrix A, defined by A_{ij} = 1 if there is an edge from i to j and 0 otherwise, followed by a postprocessing step applied to the leading eigenvectors to recover memberships. Several approaches of this type have been recently developed, with different ways of clustering the rows of the leading eigenvectors (Zhang et al., 2020; RubinDelanchy et al., 2017; Jin et al., 2017; Mao et al., 2017; Mao et al., 2018, 2020).
In contrast to other spectral methods, here we present a new approach for detecting overlapping communities based on estimating a sparse basis for the principal subspace of the network adjacency matrix in which the pattern of nonzero values contains the information about community memberships. Our approach can be seen as an analogue to finding sparse principal components of a matrix (Jolliffe et al. 2003; Zou et al. 2006; Ma, 2013), but with the important difference that we consider a nonorthogonal sparse basis of the principal subspace to allow for overlaps in communities. Our method has thus the potential to estimate overlapping community memberships more accurately than traditional spectral methods, with the same low computational cost of computing the leading eigenvectors of a matrix. We will demonstrate this both on simulated networks with overlapping and nonoverlapping communities, and on realworld networks.
A Sparse NonOrthogonal Eigenbasis Decomposition
As mentioned in the introduction, we consider binary symmetric adjacency matrices A ∈{0,1}^{n×n}, with no selfloops, i.e., A_{ii} = 0. We model the network as an inhomogeneous ErdösRényi random graph (Bollobás et al., 2007), meaning that the upper triangular entries of A are independent Bernoulli random variables with potentially different edge probabilities \(\text {\textbf {P}}_{ij}=\mathbb {P}(\text {\textbf {A}}_{ij}=1)\) for i, j ∈ [n], i < j, contained in a symmetric probability matrix \(\text {\textbf {P}}\in \mathbb {R}^{n\times n}\).
Our goal is to recover an overlapping community structure in A by estimating an appropriate sparse basis of the invariant subspace of P. The rationale is that when P is even approximately low rank, most relevant information on communities is contained in the leading eigenvectors of P, and can be retrieved by looking at a particular basis of its invariant subspace. We will assume that rank of P is K < n. The principal subspace of P can be described with a full rank matrix \(\text {\textbf {V}}\in \mathbb {R}^{n\times K}\), with columns of V forming a basis of this space. Most commonly, V is defined as the K leading eigenvectors of P, but for the purposes of recovering community membership, we focus on finding a sparse nonnegative eigenbasis of P, that is, a matrix V for which V_{ik} ≥ 0 for all i ∈ [n],k ∈ [K] and P = VU^{⊤} for some full rank matrix \(\text {\textbf {U}}\in \mathbb {R}^{n\times K}\). Note that this is different from the popular nonnegative matrix factorization problem (Lee and Seung, 1999), as we do not assume that U is a nonnegative matrix nor do we try to estimate it.
If P has a sparse nonnegative basis of its principal subspace \(\text {\textbf {V}}\in \mathbb {R}^{n\times K}\), this basis is not unique, as any column scaling or permutation of V will give another nonnegative basis. Among these, we are interested in finding a sparse nonnegative eigenbasis V, since we will relate the nonzeros of V to community memberships. The following proposition provides a sufficient condition for identifiability of the nonzero pattern in V up to a permutation of its columns. The proof is included in the Appendix.
Proposition 1
Let \(\text {\textbf {P}}\in \mathbb {R}^{n\times n}\) be a symmetric matrix of rank K. Suppose that there exist a matrix \(\text {\textbf {V}}\in \mathbb {R}^{n\times K}\) that satisfies the next conditions:

Eigenbasis: V is a basis of the column space of P, that is, P = VU^{⊤}, for some \(\textup {\textbf {U}}\in \mathbb {R}^{n\times K}\).

Nonnegativity: The entries of V satisfy V_{ik} ≥ 0 for all i ∈ [n],k ∈ [K]

Pure rows: For each k = 1,…,K there exists at least one row i_{k} of V such that \(\textup {\textbf {V}}_{i_{k}k}>0\) and \(\textup {\textbf {V}}_{i_{k}j}=0\) for j≠k.
If another matrix \(\widetilde {\textup {\textbf {V}}}\in \mathbb {R}^{n\times K}\) satisfies these conditions, then there exists a permutation matrix Q ∈{0,1}^{K×K}, \(\textup {\textbf {Q}}^{\top } \textup {\textbf {Q}}=\textup {\textbf {I}}_{K}\), such that
where supp(V) = {(i, j)V_{ij}≠ 0} is the set of nonzero entries of V.
We connect a nonnegative nonorthogonal basis to community memberships through the overlapping continuous community assignment model (OCCAM) of Zhang et al. (2020), a general model for overlapping communities that encompasses, as special cases, multiple other popular overlapping models (Latouche et al. 2011; Ball et al. 2011; Jin et al. 2017; Mao et al. 2018). Under OCCAM, each node is associated with a vector \(\textup {\textbf {z}}_{i}=[z_{i1}, \ldots , z_{iK}]^{\top }\in \mathbb {R}^{K}\), i = 1,…,n, where K is the number of communities in the network. Given \(\textup {\textbf {Z}}=[\boldsymbol {z}_{1} {\cdots } \boldsymbol {z}_{n}]^{\top }\in \mathbb {R}^{n\times K}\) and parameters α > 0, \(\boldsymbol {\Theta }\in \mathbb {R}^{n\times n}\) and \(\textup {\textbf {B}}\in \mathbb {R}^{K\times K}\) to be explained below, the probability matrix \(\textup {\textbf {P}}=\mathbb {E}[\boldsymbol {A}]\) of OCCAM can be expressed as
For identifiability, OCCAM assumes that α and all entries of Θ, B and Z are nonnegative, Θ is a diagonal matrix with \(\text {diag}(\boldsymbol {\Theta })=\boldsymbol {\theta }\in \mathbb {R}^{n}\) and \({\sum }_{i=1}^{n}\boldsymbol {\theta }_{i}=n\), \(\\boldsymbol {z}_{i}\_{2} =\big ({\sum }_{j=1}^{K}z_{ij}^{2}\big )^{1/2}= 1\) for all i ∈ [n], and B_{kk} = 1 for all k ∈ [K]. In this representation, the row z_{i} of Z is viewed as the community membership vector of node i. A positive value of z_{ik} indicates that node i belongs to community k, and the magnitude of z_{ik} determines how strongly. The parameter 𝜃_{i} represents the degree correction for node i as in the degreecorrected SBM (Karrer and Newman, 2011) allowing for degree heterogeneity and in particular “hub” nodes, common in realworld networks. The scalar parameter α > 0 controls the edge density of the entire graph.
One can obtain the classical SBM as a special case of OCCAM by further requiring each z_{i} to have only one nonzero value and setting 𝜃_{i} = 1 for all i ∈ [n]. Keeping only one nonzero value in each row of Z but allowing the entries of 𝜃 to take positive values, one can recover the degreecorrected SBM (Karrer and Newman, 2011). More generally in OCCAM, nodes can belong to multiple communities at the same time. Each row of Z can have multiple or all the entries different from zero, indicating the communities to which the node belong.
Equation 2.1 implies that under OCCAM the probability matrix P has a nonnegative eigenbasis given by V = ΘZ. The following proposition shows the converse result, namely, that any matrix P that admits a nonnegative eigenbasis can be represented as in Eq. 2.1, which motivates the interpretation of the nonzero entries of a nonnegative eigenbasis as indicators of community memberships.
Proposition 2.
Let \(\textup {\textbf {P}}\in \mathbb {R}^{n\times n}\) be a symmetric real matrix with rank(P) = K. Suppose that there exists a fullrank nonnegative matrix \(\textup {\textbf {V}}\in \mathbb {R}^{n\times K}\) and a matrix U such that P = VU^{⊤}. Then, there exists a nonnegative diagonal matrix \(\boldsymbol {\Theta }\in \mathbb {R}^{n\times n}\), a nonnegative matrix \(\textup {\textbf {Z}}\in \mathbb {R}^{n\times K}\) with \({\sum }_{k=1}^{K}\textup {\textbf {Z}}_{ik}^{2}=1\) for each i ∈ [n], and a symmetric matrix \(\textup {\textbf {B}}\in \mathbb {R}^{K\times K}\) such that
Moreover, if V satisfies the conditions of Proposition 1, then supp(V) = supp(ZQ) for some permutation \(\textup {\textbf {Q}}\in \mathbb {R}^{K\times K}, \textup {\textbf {Q}}^{\top }\textup {\textbf {Q}}=\textup {\textbf {I}}\).
In short, Proposition 2 states a nonnegative basis of the probability matrix P can be mapped to overlapping communities as in Eq. 2.1. Moreover, under the conditions on this eigenbasis stated in Proposition 1, the community memberships can be uniquely identified. These conditions are weaker than the ones in Zhang et al. (2020) since we are only interested in community memberships and not in identifiability of the other parameters; note that we do not aim to fit the OCCAM model, which is computationally much more intensive than our approach here. Other conditions for identifiability of overlapping community memberships have been presented in the literature (Huang and Fu, 2019), but the pure row assumption in Proposition 1 is enough for our purpose of estimating sparse memberships.
Community Detection via Sparse Iterative Thresholding
Our goal is to compute an appropriate sparse basis of the principal subspace of A which contains information about the overlapping community memberships. Spectral clustering has been popular for community detection, typically clustering the rows of the leading eigenvectors of A or a function of them to assign nodes to communities. Spectral clustering with overlapping communities typically gives a continuous membership matrix, which can then be thresholded to obtain sparse membership vectors; however, this twostage approach is unlikely to be optimal in any sense, and some of the overlapping clustering procedures can be computationally expensive (Zhang et al. 2020; Jin et al. 2017). In contrast, our approach of directly computing a sparse basis of the principal subspace of A avoids the twostage procedure and thus can lead to improvements in both accuracy and computational efficiency.
Sparse principal component analysis (SPCA) (Jolliffe et al., 2003; Zou et al., 2006) seeks to estimate the principal subspace of a matrix while incorporating sparsity constraints or regularization on the basis vectors. In high dimensions, enforcing sparsity can improve estimation when the sample size is relatively small, and/or simplify the interpretation of the solutions. Many SPCA algorithms have been proposed to estimate eigenvectors of a matrix under sparsity assumptions (see for example Amini and Wainwright 2008; Johnstone and Lu 2009; Vu and Lei 2013; Ma2013).
Our goal is clearly related to SPCA since we are interested in estimating a sparse basis of the principal subspace of P, but an important difference is that our vectors of interest are not necessarily orthogonal; in fact orthogonality is only achieved when estimated communities do not overlap, and is thus not compatible with meaningful overlapping community estimation. For nonoverlapping community detection, however, there is a close connection between a convex relaxation of the maximum likelihood estimator of communities and a convex formulation of SPCA (Amini and Levina, 2018).
Orthogonal iteration is a classical method for estimating the eigenvectors of a matrix; see for example Golub and Van Loan (2012). Ma (2013) extended this method to estimate sparse eigenvectors by an iterative thresholding algorithm. Starting from an initial matrix \(\textup {\textbf {V}}^{(0)}\in \mathbb {R}^{n\times K}\), the general form of their algorithm iterates the following steps until convergence:

1.
Multiplication step:
$$ \textup{\textbf{T}}^{(t+1)} = \textup{\textbf{A}} \textup{\textbf{V}}^{(t)}. $$(3.1) 
2.
Regularization step:
$$ \textup{\textbf{U}}^{(t+1)} = \mathcal{R}(\textup{\textbf{T}}^{(t)}, \boldsymbol{\Lambda}), $$(3.2)where \(\mathcal {R}:\mathbb {R}^{n\times K}\rightarrow \mathbb {R}^{n\times K}\) is a regularization function and \(\boldsymbol {\Lambda }\in \mathbb {R}^{n\times K}\) a matrix of regularization parameters.

3.
Identifiability step:
$$ \textup{\textbf{V}}^{(t+1)} = \textup{\textbf{U}}^{(t+1)}\textup{\textbf{W}}^{(t+1)}, $$(3.3)where W^{(t+ 1)} is a K × K matrix.
An example of a convergence criterion may be stopping when the distance between subspaces generated by V^{(t)} and V^{(t+ 1)} is small. For two fullrank matrices \(\textbf {U},\tilde {\textbf {U}}\in \mathbb {R}^{n\times K}\), the distance between the subspaces generated by the columns of U and \(\tilde {\textbf {U}}\) is defined through their orthogonal projection matrices R = U(U^{⊤}U)^{− 1}U^{⊤} and \(\tilde {\textbf {R}} = \tilde {\textbf {U}}(\tilde {\textbf {U}}^{\top }\tilde {\textbf {U}})^{1}\tilde {\textbf {U}}^{\top }\) as
where ∥⋅∥ is the matrix spectral norm (see Golub and Van Loan (2012), Section 2.5.3).
Let \(\widehat {\textup {\textbf {V}}}\) be the value of V^{(t)} at convergence, and let \(\widetilde {\textup {\textbf {V}}}\) be the n × K matrix of the K leading eigenvectors of A. The algorithm provides a generic framework for obtaining a basis \(\widehat {\textup {\textbf {V}}}\) that is close to \(\widetilde {\textup {\textbf {V}}}\), and the regularization step can be customized to enforce some structure in \(\widehat {\textup {\textbf {V}}}\). In each iteration, the multiplication step (3.1) reduces the distance between the subspaces generated by V^{(t)} and \(\widetilde {\textup {\textbf {V}}}\) (Theorem 7.3.1 of Golub and Van Loan (2012)), and then the regularization step (3.2) forces some structure in V^{(t)}. Ma (2013) focused on sparsity and regularized with a thresholding function satisfying \([\mathcal {R}(\textup {\textbf {T}},\boldsymbol {\Lambda })]_{ik}  \textup {\textbf {T}}_{ik}\leq \boldsymbol {\Lambda }_{ik}\) and \([\mathcal {R}(\textup {\textbf {T}},\boldsymbol {\Lambda })]_{ik}\mathbb {1}(\textup {\textbf {T}}_{ik}\leq \boldsymbol {\Lambda }_{ik})=0\) for all Λ_{ik} > 0 and i ∈ [n],k ∈ [K], which includes both hard and soft thresholding. If the distance between U^{(t)} and V^{(t)} is small, then the distance between V^{(t)} and \(\tilde {\textup {\textbf {V}}}\) keeps decreasing until a certain tolerance is reached (Proposition 6.1 of Ma (2013)). Finally, the last step in Eq. 3.3 ensure identifiability. For example, the orthogonal iteration algorithm uses the QR decomposition Q^{(t)}R^{(t)} = U^{(t)} and sets \(\textup {\textbf {V}}^{(t)}=\textup {\textbf {U}}^{(t)}\textup {\textbf {R}}^{(t)^{1}}\), which is an orthogonal matrix.
We will use the general form of the algorithm presented in Eqs. 3.13.3 to develop methods for estimating a sparse eigenbasis of A, by designing regularization and identifiability steps appropriate for overlapping community detection.
Sparse Eigenbasis Estimation
We propose an iterative thresholding algorithm for sparse eigenbasis estimation when the basis vectors are not necessarily orthogonal. Let V^{(t)} be the estimated basis at iteration t. For identifiability, we assume that this matrix has normalized columns, that is, \(\\textup {\textbf {V}}^{(t)}_{\cdot ,k}\_{2} =1\) for each k ∈ [K], where V_{⋅,k} denotes the kth column of V. Our algorithm is based on the following heuristic. Suppose that at some iteration t, V^{(t)} is close to the basis of interest. The multiplication step in Eq. 3.1 moves V^{(t)} closer to \(\widetilde {\textup {\textbf {V}}}\), the Kleading eigenspace of A, but the entries of T^{(t+ 1)} = AV^{(t)} and V^{(t)} are not necessarily close. Hence, before applying the regularization step, we introduce a linear transformation step that returns T^{(t+ 1)} to a value that is close to V^{(t)} entrywise. This transformation is given by the solution of the optimization problem
which has a closed form solution, \(\boldsymbol {\Gamma }^{(t+1)} = \left [ (\textup {\textbf {V}}^{(t)})^{\top } \textup {\textbf {V}}^{(t)}\right ]^{1}(\textup {\textbf {V}}^{(t)})^{\top }\textup {\textbf {T}}^{(t+1)}\). Define
After this linear transformation, we apply a sparse regularization to \(\widetilde {\textbf {T}}^{(t+1)}\), defined by a thresholding function \(\mathcal {S}\) with parameter λ ∈ [0,1),
The function \(\mathcal {S}\) applies hard thresholding to each entry of the matrix \(\widetilde {\textup {\textbf {T}}}\) with a different threshold for each row, to adjust for possible differences in the expected degree of a node. The parameter λ controls the level of sparsity, with larger values of λ giving more zeros in the solution. Finally, the new value of V is obtained by normalizing the columns, setting \(\textup {\textbf {U}}^{(t+1)} = \mathcal {S}(\widetilde {\textup {\textbf {T}}}^{(t+1)}, \lambda )\) and
for each i ∈ [n] and k ∈ [K].
We stop the algorithm after the relative difference in spectral norm between V^{(t)} and V^{(t+ 1)} is smaller than some tolerance 𝜖 > 0, that is,
These steps are summarized in Algorithm 1.
The following proposition shows that when Algorithm 1 is applied to the expected probability matrix P that has a sparse basis, then there exists a fixed point that has the correct support. In particular, this implies that for the expected probability matrix of an OCCAM graph defined in Eq. 2.1, the entries of this fixed point coincide with the support of the overlapping memberships of the model. The proof is given on the Appendix.
Proposition 3.
Let \(\textup {\textbf {P}}\in \mathbb {R}^{n\times n}\) be a symmetric matrix with rank(P) = K < n, and suppose that there exists a nonnegative sparse basis V of the principal subspace of P. Let \(\widetilde {\textup {\textbf {V}}}\) be a matrix such that \(\widetilde {\textup {\textbf {V}}}_{\cdot ,k}=\frac {1}{\\boldsymbol {V}_{\cdot ,k}\_{2}}\textup {\textbf {V}}_{\cdot ,k}\) for each k ∈ [K], and \(v^{\ast }= \min \limits \{\textup {\textbf {V}}_{ik}/\\textup {\textbf {V}}_{i\cdot }\_{\infty }(i,k)\in \text {supp}(\textup {\textbf {V}})\}\). Then, for any λ ∈ [0,v^{∗}), the matrix \(\widetilde {\textup {\textbf {V}}}\) is a fixed point of Algorithm 1 applied to P.
When the algorithm is applied to an adjacency matrix A, the matrix V is not exactly a fixed point, but the norm of the difference between V and V^{(1)} will be a function of the distance between the principal subspaces of A and P. Concentration results (Le et al. 2017) ensure that A is close to its expected value P, specifically, \(\\textup {\textbf {A}}\textup {\textbf {P}}\=O(\sqrt {d})\) (where ∥⋅∥ is the spectral norm of a matrix) with high probability as long as the largest expected degree \(d=\max \limits _{i\in [n]}\textup {\textbf {P}}_{ij}\) satisfies \(d={\Omega }(\log n)\). If the K leading eigenvalues of V are sufficiently large, then the principal subspaces of A and P are close to each other (Yu et al. 2015),
Community Detection in Networks with Homogeneous Degrees
Here, we present a second algorithm for the estimation of sparse community memberships in graphs for homogeneous expected degree of nodes within a community. Specifically, we focus on graphs for which the expected adjacency matrix \(\textbf {P}=\mathbb {E}[\textup {\textbf {A}}]\) has the form
where \(\textup {\textbf {Z}}\in \mathbb {R}^{n\times K}\) is a membership matrix such that \(\\textup {\textbf {Z}}_{i,\cdot }\_{1} = {\sum }_{k=1}^{K}\textup {\textbf {Z}}_{ik}=1\), and \(\textup {\textbf {B}}\in \mathbb {R}^{K\times K}\) is a fullrank matrix. This model is a special case of OCCAM, when the degree heterogeneity parameter Θ in Eq. 2.1 is constant for all vertices. In particular, this case includes the classic SBM (Holland et al. 1983) when the memberships do not overlap.
To enforce degree homogeneity, we add an additional normalization step, so that the matrix \(\hat {\textup {\textbf {Z}}}\) has rows with constant norm \(\\hat {\textup {\textbf {Z}}}_{i,\cdot }\_{1}=1\) as in Eq. 3.5. In practice we observed that this normalization gives very accurate results in terms of community detection. After the multiplication step T^{(t)} = AV^{(t− 1)}, the columns of T^{(t)} are proportional to the norm of the columns V^{(t− 1)}, which is in turn proportional to the estimated community sizes. In order to remove the effect of this scaling with community size, which is not meaningful for community detection, we normalize the columns of V^{(t− 1)}, and then perform the thresholding and the row normalization step as before. These steps are summarized in Algorithm 2.
The next theorem shows that in the case of the planted partition SBM, a matrix with the correct sparsity pattern is a fixed point of Algorithm 2. Note that since the algorithm does not assume that each node belongs to a single community, this result not only guarantees that there exist a fixed point that correctly cluster the nodes into communities, as typical goal of of community detection, but also that is able to distinguish if a node belongs to more than one community or not. The proof is given on the appendix.
Theorem 1.
Let A be a network generated from a SBM with K communities of sizes n_{1},…,n_{K}, membership matrix Z and connectivity matrix B ∈ [0,1]^{K×K} of the form
for some p, q ∈ [0,1], p > q. Suppose that for some λ^{∗}∈ (0,1) and some c_{1} > 2,
Then, for any λ ∈ (λ^{∗},1), Z is a fixed point of Algorithm 2 with probability at least \(1n^{c_{1}1}\).
Selecting the Thresholding Parameter
Our algorithms require two usersupplied parameters: the number of communities K and the threshold level λ. The parameter λ controls the sparsity of the estimated basis \(\hat {\textup {\textbf {V}}}\). In practice, looking at the full path of solutions for different values of λ may be informative, as controlling the number of overlapping memberships can result in different community assignments. On the other hand, it is practically useful to select an appropriate value λ that provides a good fit to the data. We discuss two possible techniques for choosing this parameter, the Bayesian Information Criterion (BIC) and edge crossvalidation (ECV) (Li et al. 2020). Here we assume that the number of communities K is given, but choosing the number of communities is also an important problem, with multiple methods available for solving it (Wang and Bickel, 2017; Le and Levina, 2015; Li et al. 2020). If computational resources allow, K can be chosen by crossvalidation along with λ.
The goodnessoffit can be measured via the likelihood of the model for the graph A, which depends on the probability matrix \(\textup {\textbf {P}}=\mathbb {E}[\textup {\textbf {A}}]\). Given \(\widehat {\textup {\textbf {V}}}\), a natural estimator for P is the projection of A onto the subspace spanned by \(\widehat {\textbf {V}}\), which can be formulated as
This optimization problem finds the least squares estimator of a matrix constrained to the set of symmetric matrices with a principal subspace defined by \(\widehat {\textup {\textbf {V}}}\), and has a closedform solution, stated in the following proposition.
Proposition 4.
Let \(\widehat {\textup {\textbf {P}}}\) be the solution of the optimization problem (3.7), and suppose that \(\widehat {\mathbf {Q}}\in \mathbb {R}^{n\times K}\) is a matrix with orthonormal columns such that \(\widehat {\textbf {V}}=\widehat {\textup {\textbf {Q}}}\widehat {\textup {\textbf {R}}}\) for some matrix \(\widehat {\textup {\textbf {R}}}\in \mathbb {R}^{K\times K}\). Then,
The Bayesian Information Criterion (BIC) (Schwarz, 1978) provides a general way of choosing a tuning parameter by balancing the fit of the model measured with the loglikelihood of A, and a penalty for the complexity of a model that is proportional to the number of parameters. The number of nonzeros in V given by ∥V∥_{0} can be used as a proxy for the degrees of freedom, and the sample size is taken to be the number of independent edges in A. Then the BIC for a given λ can be written as
where \(\widehat {\textup {\textbf {P}}}_{\lambda }\) is the estimate for P defined in Proposition 4 for \(\widehat {\textup {\textbf {V}}}_{\lambda }\).
The BIC criterion (3.8) has the advantage of being simple to calculate, but it has some issues. First, the BIC is derived for a maximum likelihood estimator, while \(\widehat {\textup {\textbf {P}}}\) is not obtained in this way, and this is only a heuristic. Further, the least squares estimator \(\widehat {\textup {\textbf {P}}}\) is not guaranteed to result in a valid estimated edge probability (between 0 and 1). A possible practical solution is to modify the estimate by defining \(\widetilde {\textup {\textbf {P}}}\in [0,1]^{n\times n}\) as \(\widetilde {\textup {\textbf {P}}}_{ij} = \min \limits (\max \limits (\widehat {\textup {\textbf {P}}}_{ij}, \epsilon ), 1\epsilon )\) for some small value of 𝜖 ∈ (0,1).
Another alternative for choosing the tuning parameter is edge crossvalidation (CV). Li et al. (2020) introduced a CV method for network data based on splitting the set of node pairs \(\mathcal {N}=\{(i,j):i,j\in \{1,\ldots ,N\}\}\) into L folds. For each fold \(l = 1, \dots , L\), the corresponding set of node pairs \({\varOmega }_{l}\subset \mathcal {N}\) is excluded, and the rest are used to fit the basis V. Li et al. (2020) propose to use a matrix completion algorithm based on the rank K truncated SVD to fill in the entries missing after excluding Ω_{l}, resulting in a matrix \(\widehat {\mathbb {M}}_{l}\in \mathbb {R}^{n\times n}\). Then, for a given λ we estimate \(\widehat {\textup {\textbf {V}}}_{\lambda }\), and use Proposition 4 to obtain an estimate \(\widehat {\textup {\textbf {P}}}_{\lambda }(\widehat {\mathbf {M}}^{(l)})\) of P. The error on the heldout edge set is measured by
and the tuning parameter λ is selected to minimize the average crossvalidation error
The edge CV method does not rely in a specific model for the graph, which can be convenient in the settings mentioned before, but its computational cost is larger. In practice, we observe that edge CV tends to select more complex models in which nodes are assigned to more communities than the solution selected with BIC (see Section 4.2).
Numerical Evaluation on Synthetic Networks
We start with evaluating our methods and comparing them to benchmarks on simulated networks. In all scenarios, we generate networks from OCCAM, thus edges of A are independent Bernoulli random variables, with expectation given by Eq. 2.1. We assume that each row vector \(\textup {\textbf {z}}_{i}\in \mathbb {R}^{K}\) of Z = [z_{1},…,z_{n}]^{⊤} satisfies ∥z_{i}∥_{1} = 1, so each node has the same expected degree. To better understand what affects the performance, we evaluate the methods by varying one parameter from the following list at a time; all of them affect the difficulty of detecting overlapping communities.

a)
Fraction of nodes belonging to more than one community \(\widetilde {p}\) (the higher \(\widetilde p\), the more difficult the problem). For a given \(\widetilde p\in [0,1)\), we select \(\widetilde {p}n\) nodes for the overlaps, and assign the rest to only one community, distributed equally among all the communities. For most of the experiments we use K = 3 communities, and 1/4 of the overlapping nodes are assigned to all communities with z_{i} = [1/3,1/3,1/3]^{T}, while the rest are assigned to two communities j, k, with z_{ij} = z_{ik} = 1/2, equally distributing these nodes on all pairs (j, k). When K > 3, we only assign the nodes to one or two communities following the same process, but we do not include overlaps with more than three communities.

b)
Connectivity between communities ρ (the higher ρ, the more difficult the problem). We parameterize B as
$$\textup{\textbf{B}} = (1\rho) \boldsymbol{I}_{K} + \rho \boldsymbol{1}_{K}\boldsymbol{1}_{K}^{\top},$$and vary ρ in a range of values between 0 and 1.

c)
Average degree of the network d (the higher d, the easier the problem). For a given average degree d, we set α in Eq. 2.1 so that the expected average degree \(\frac {1}{n}\boldsymbol {1}^{\top }_{n}\mathbb {E}[\textbf {A}] \boldsymbol {1}_{n}\) is equal to d.

d)
Node degree heterogeneity (the more heterogeneous the degrees, the harder the problem). This is controlled by parameter 𝜃 = diag(Θ), and in most simulations we set 𝜃_{i} = 1 ∀i ∈ [n] so all nodes have the same degree, but in some scenarios we also introduce hub nodes by setting 𝜃_{i} = 5 with probability 0.1 and 𝜃_{i} = 1 with probability 0.9.

e)
Number of communities K (the larger K, the harder the problem). For all values of K, we maintain communities of equal size.
In most scenarios, we fix n = 500, and K = 3. All simulation settings are run 50 times, and the average result together with its 95% confidence band are reported. An implementation of the method in R can be found at https://github.com/jesusdaniel/spcaCD.
Our main goal is to find the set of nonzero elements of the membership matrix. Many measures can be adopted to evaluate a solution; here we use the normalized variation of information (NVI) introduced by Lancichinetti et al. (2009), which is specifically designed for problems with overlapping clusters. Given a pair of binary random vectors X, Y of length K, the normalized conditional entropy of X with respect to Y can be defined as
where H(X_{k}) is the entropy of X_{k} and H(X_{k}Y_{k}) is the conditional entropy of X_{k} given Y_{k}, defined as
and the normalized variation of information between X and Y is defined as
where σ is a permutation of the indexes to account for the fact that the binary assignments can be equivalent up to a permutation. The NVI is always a number between 0 and 1; it is equal to 0 when X and Y are independent, and to 1 if X = Y.
For a given pair of binary membership matrices Z and \(\widetilde {\textup {\textbf {Z}}}\) with binary entries indicating community memberships, we can use the rows of replace the probabilities in Eqs. 4.1 and 4.2 with the sample versions using the rows of \(\widetilde {\textup {\textbf {Z}}}\) and Z, that is
for a, b ∈{0,1}.
Choice of Initial Value
We start from comparing several initialization strategies:

An overlapping community assignment, from the method for fitting OCCAM.

A nonoverlapping community assignment, from SCORE (Jin, 2015), a spectral clustering method designed for networks with heterogeneous degrees.

Multiple random nonoverlapping community assignments, with each node randomly assigned to only one community. We use five different random values and take the solution corresponding the smallest error as measured by Eq. 3.7.
We compare these initialization schemes with fixed n = 500, K = 3, d = 50, and varying betweencommunity connectivity ρ and the fraction of overlapping nodes \(\widetilde p\). For both our methods (SPCAeig and SPCACD), we fit solution paths over a range of values λ = {0.05,0.1,…,0.95}, and report the solution with the highest NVI for each of the methods (note that we are not selecting λ in a datadriven way in order to reduce variation that is not related to initialization choices).
Figure 1 shows the results on initialization strategies. In general, all methods perform worse as the problem becomes harder, and the nonrandom initializations perform better overall; the multiple random initializations are also sufficient for the easier case of few nodes in overlaps. For the rest of the paper, unless explicitly stated, we use the nonoverlapping community detection solution (SCORE) to initialize the algorithm, given its good performance and low computational cost.
Choosing the Threshold
The tuning parameter λ controls the sparsity of the solution, and hence the fraction of pure nodes. Since community detection is an unsupervised problem, it may be useful in practice to look at the entire path over λ and consider multiple solutions with different levels of sparsity (see Section 5.1). However, we may also want to choose a single value of λ that balances a good fit and a parsimonious solution. Here, we evaluate the performance of the two strategies for choosing λ proposed in Section 3.2, BIC and CV, using the same simulation setting than the previous section.
Figure 2 shows the average performance measured by NVI of the two tuning methods. The BIC tends to select sparser solutions than CV, and hence when the true membership matrix is sparse (few overlaps), BIC outperforms CV, but with more overlap in communities, CV usually performs better, specially for SPCACD. Since there is no clear winner overall, we use BIC in subsequent analysis, because it is computationally cheaper.
Comparison with Existing Methods
We compare our proposal to several stateoftheart methods for overlapping community detection. We use the same simulation settings as in the previous section (n = 500 and K = 3), including sparser scenarios with d = 20, and networks with heterogeneous degrees (d = 50 and 10% of nodes are hubs).
We select competitors based on good performance reported in previous studies. As representative examples of spectral methods, we include OCCAM fitted by the algorithm of Zhang et al. (2020) and MixedSCORE (Jin et al. 2017). We also include the EM algorithm for the BKN model (Ball et al. 2011) and the overlapping SBM of Latouche et al. (2011) (OSBM), and Bayesian nonnegative matrix factorization (BNMF) by Psorakis et al. (2011). For methods that return a continuous membership assignment (OCCAM, BKN and MixedSCORE), we follow the approach of Zhang et al. (2020) and set to zero the values of the membership matrix \(\hat {\textup {\textbf {Z}}}\) that are smaller than 1/K.
Figure 3 shows the average NVI of these methods as a function of ρ under different scenarios. Most methods show an excellent performance when ρ = 0, but as the betweencommunity connectivity increases, the performance of all methods deteriorate. Our methods (SPCACD and SPCAeig) generally achieve the best performance when the fraction of nodes in overlaps is either 0 or 10%, and are highly competitive for 40% in overlaps as well. OCCAM performs well, which is reasonable since the networks were generated from this model, but it appears that in most cases we are able to fit it more accurately. MixedSCORE has a good performance with no overlaps, but deteriorates quicker than other methods with the introduction of overlaps. We should keep in mind that OCCAM and MixedSCORE are designed for estimating continuous memberships, and the threshold of 1/K to obtain binary memberships might not be optimal. While nonoverlapping community detection methods can be alternatively used for the scenario when there is only a single membership per node, our methods are able to accurately assign the nodes to a single community without knowing the number of memberships.
Computational Efficiency
Scalability to large networks is an important issue for real data applications. Spectral methods for overlapping and nonoverlapping community detection are very popular, partly due to its scalability to large networks. The accuracy of those methods usually depends on the clustering algorithm, which in practice might require multiple initial values to get an accurate result. In contrast, our methods based on sparse principal component analysis directly estimate the membership matrix without having to estimate the eigenvectors or perform a clustering step. Although the accuracy of our methods does depend on the tuning parameter λ, the algorithms are robust to the choice of this parameter and provide good solutions over a reasonably wide range.
To compare computation efficiency empirically, we simulated networks with different number of communities (K = 3,6 and 10) and increased the number of nodes while keeping the average degree fixed to d = 50, with 10% overlapping nodes. For simplicity, we used a single fixed value λ = 0.6 for our methods. We initialized SPCACD with a random membership matrix, and SPCAeig with the SPCACD as starting point, and therefore report the running time of SPCAeig as the sum of the two. We compare the performance of our methods with OCCAM, which uses a kmedians clustering to find the centroids of the overlapping communities. Since kmedians is computationally expensive and is not able to handle large networks, we also report the performance of OCCAM with the clustering step performed with kmeans instead. Additionally, we report the running time of calculating the K leading eigenvectors of the adjacency matrix, which is a starting step required by spectral methods. All simulations are run using Matlab R2015a. The leading eigenvectors of the adjacency matrix are computed using the standard Matlab function eigs(⋅,K).
The performance in terms of time and accuracy of different methods is shown in Fig. 4. Our methods based on SPCA incur a computational cost similar to that calculating the K leading eigenvectors of the adjacency matrix, and when the number of communities is not large, our methods perform even faster. The original version of OCCAM based on kmedians clustering is limited in the size of networks it can handle, and when using kmeans the computational cost is still larger than SPCA. Our methods produce very accurate solutions in all the scenarios considered, while OCCAM deteriorates when the number of communities increases. Note that in general the performance of all methods can be improved by using different random starting values, either for clustering in OCCAM or for initializing our methods, but this will increase the computational cost; choosing tuning parameters, if the generic robust choice is not considered sufficient, will do the same.
Evaluation on RealWorld Networks
In this section, we evaluate the performance of our methods on several realworld networks. Zachary’s karate club network (Zachary, 1977) and the political blog network (Adamic and Glance, 2005) are two classic examples with community structure, and we start with them as an illustration. We then compare our method with other stateoftheart overlapping community detection algorithms on the popular benchmark dataset focused specifically on overlapping communities (McAuley and Leskovec, 2012), which contains many social egonetworks from Facebook, Twitter and Google Plus. Finally, we use our methodology to identify communities in a novel dataset consisting of Twitter following relationship between national representatives in the Mexican Chamber of Deputies.
Zachary’s Karate Club Network
Zachary (1977) recorded the reallife interactions of 34 members of a karate club from a period of two years. During this period, the club split into two factions due to a conflict between the leaders, and these factions are taken to be the ground truth communities.
We fit our methods to the karate club network, and if we use either BIC or CV to choose the optimal threshold parameter, the solution consists of two community with only pure nodes and matches the ground truth. This serves as reassurance that our method will not force overlaps on the communities when there are not actually there. In contrast, OCCAM assigns 17 nodes (50%) to both communities, and mixedSCORE assigns 26 (76%).
If we look at the entire path over the threshold parameter λ, we can also see which nodes are potential overlaps. Both our methods can identify community memberships, but SPCAeig also provides information on the degreecorrection parameter. In Fig. 5, we examine the effect of the threshold parameter λ on SPCAeig solutions. The plots show the paths of the node membership vectors as a function of λ. Each panel corresponds to one of the columns of the membership matrix, the colors indicate the true factions, and the paths of the faction leaders are indicated with a dashed line. The yaxis shows the association of the node to the corresponding community, with membership weighted by the degreecorrected parameter. In each community, the nodes with the largest values of y are the faction leaders, which are connected to most of the nodes in the faction. For larger values of λ, all nodes are assigned to pure communities corresponding to true factions, but as λ decreases the membership matrix contains more nonzero values.
The Political Blogs Network
The political blogs network (Adamic and Glance, 2005) represents the hyperlinks between 1490 political blogs around the time of the 2004 US presidential election. The blogs were manually labeled as liberal or conservative, which is taken as ground truth, again without any overlaps. This dataset is more challenging for community detection than the karate club network, due to the high degree heterogeneity (Karrer and Newman, 2011). Following the literature, we focus on the largest connected component of the network, which contains 1222 nodes, and convert the edges to undirected, so A_{ij} = 1 if either blog i has an hyperlink to blog j or vice versa.
Figure 6 shows the plot of the political blog network membership paths using Algorithm 2 as a function of λ and colored by the ground truth labels. Using the tuning parameter selected by BIC, the algorithm assigns only 29 nodes to both communities. Other overlapping community methods assigned many more nodes to both communities: 229 (19%) for OCCAM, and 195 (16%) for mixedSCORE. To convert the estimated solution into nonoverlapping memberships in order to compare with the ground truth, each node is assigned to the community corresponding to the largest entry on the corresponding row, resulting in 52 misclustered nodes, a result similar to other community detection methods that are able to operate on networks with heterogeneous node degrees (Jin, 2015). The membership paths that correspond to these misclustered nodes are represented with dashed lines. The fact that most of the overlapping nodes discovered by the algorithm were incorrectly clustered supports the idea that these are indeed overlapping nodes, as the disagreement between the unsupervised clustering result and the label given by the authors might indicate that these are nodes with no clear membership.
The SNAP Social Networks
Social media platforms provide a rich source of data for the study of social interactions. McAuley and Leskovec (2012) presented a large collection of egonetworks from Facebook, Google Plus and Twitter. An egonetwork represents the virtual friendships or followingfollower relationships between a group of people that are connected to a central user. Those platforms allow the users to manually label or classify their friends into groups or social circles, and this information can be used as a ground truth to compare the performance of methods for detecting communities. In Zhang et al. (2020), several stateoftheart overlapping community detection methods were compared on these data, showing a competitive performance of OCCAM. We again include OCCAM and MixedSCORE as examples of spectral methods for overlapping community detection. We obtained a preprocessed version of the data directly from the first author of Zhang et al. (2020); for the details on the preprocessing steps, see Section 6 of Zhang et al. (2020).
Table 1 shows the average performance measured by NVI for the community detection methods we compared. For our methods, the value of λ was chosen by BIC, like in simulations. For OCCAM and mixedSCORE, we thresholded continuous membership assignments at 1/K. Our methods (SPCAeig and SPCACD) show a slightly better performance than the rest of the methods in the Facebook networks. SPCACD performs better than other methods on the Twitter networks, but SPCAeig does not perform better than OCCAM. For Google Plus networks, OCCAM and mixedSCORE have a clear advantadge. Figure 7 presents a visualization of the distribution of of several network summary statistics for each social media platform. It suggests that Google Plus networks might be harder because they tend to have more overlaps between communities, although they also tend to have more nodes. Facebook networks, in contrast, have higher modularity values and smaller overlaps, and thus should be easier to cluster. In general, all methods perform reasonably, with SPCACD given the best overall performance Facebook and Twitter networks, and OCCAM being overall best on Google Plus. This is consistent with what we observed in simulations and what we would expect by design: our methods are more likely to perform better than others when membership vectors are sparse.
Twitter Network of Mexican Representatives
We consider the Twitter network between members of the Mexican Chamber of Deputies (the lower house of the Mexican parliament), from the LXIII Legislature for the period of 20152018. The network captures a snapshot of Twitter data from December 31st, 2017, and has 409 nodes corresponding to the representatives with a valid Twitter handle. Two nodes are connected by an edge if at least one of them follows the other on Twitter; we ignore the direction. Each member belongs to one of eight political parties or is an independent, resulting in K = 9 true communities; see Fig. 8. The data can be downloaded from https://github.com/jesusdaniel/spcaCD/data.
We apply Algorithm 1 to this network, using 20fold edge crossvalidation to estimate the number of communities and choose thresholding parameters. Figure 9 shows the average MSE across all the folds, minimized when K = 10. However, solutions corresponding to all K from 8 to 11 are qualitatively very similar, the only difference being that the largest parties (PRI and PAN) get split into smaller communities, with clusters containing the most popular members of each party, and/or factions within a party that are more connected to some of the other parties.
A comparison between the estimated membership vectors and party affiliations reveals that our algorithm discovers meaningful overlapping communities. Table 2 compares the estimated overlapping memberships with the party labels, by counting the number of nodes that are assigned to a given community (recall that each node can be assigned to more than one community) and belong to a specific party. Some of the communities contain representatives from two or more different parties, which is a reflection of coalitions and factions. For example, the majority of nodes in community 3 belong to either PRI or PVEM, which formed a coalition during the preceding election in 2015. On the other hand, nodes from MORENA in community 4 were members of PRD before MORENA was formed in 2014. The plot on the right in Fig. 8 also shows a significant overlap between these parties.
Exploring individual memberships reveals that the number of communities a node is assigned to seems associated with its overall popularity in the network. For example, the node with the largest number of community memberships (7 in total) is the representative with the largest degree in the network, while the second largest number of memberships (5 in total) is the president of the Chamber of Deputies in 2016.
Discussion
We presented an approach to estimate a regularized basis of the principal subspace of the network adjacency matrix, and showed that its sparsity pattern encodes the community membership information. Varying the amount of regularization controls the sparsity of the node memberships and allows to one to obtain a family of solutions of increasing complexity. These methods show good accuracy in estimating the memberships, and are computationally very efficient allowing to scale well to large networks. Our present theoretical results are limited to fixed points of the algorithms; establishing theoretical guarantees in more general settings as analyzing conditions for convergence to the fixed point are left for future work.
Spectral inference has been used for multiple tasks on networks: community detection (Lei and Rinaldo, 2015; Le et al. 2017), hypothesis testing (Tang et al. 2017), multiple network dimensionality reduction (Levin et al. 2017) and network classification (Arroyo and Levina, 2020). While the principal eigenspace of the adjacency matrix can provide the information needed for these problems, our results suggest that regularizing the eigenvectors can lead to improved estimation and computation in community detection; exploring the effects of this type of regularization in other network tasks is a promising direction for future work.
References
Abbe, E (2017). Community detection and stochastic block models: recent developments. Journal of Machine Learning Research 18, 1–86.
Adamic, L A and Glance, N (2005). The political blogosphere and the 2004 US election: divided they blog. ACM, p. 36–43.
Airoldi, E M, Blei, D M, Fienberg, S E and Xing, E P (2009). Mixed membership stochastic blockmodels, p. 33–40.
Amini, A A and Wainwright, M J (2008). Highdimensional analysis of semidefinite relaxations for sparse principal components. IEEE, p. 2454–2458.
Amini, A A and Levina, E (2018). On semidefinite relaxations for the block model. The Annals of Statistics 46, 1, 149–179.
Arroyo, J and Levina, E (2020). Simultaneous prediction and community detection for networks with application to neuroimaging. arXiv:2002.01645.
Ball, B, Karrer, B and Newman, MEJ (2011). Efficient and principled method for detecting communities in networks. Physical Review E 84, 3, 036103.
Bollobás, B, Janson, S and Riordan, O (2007). The phase transition in inhomogeneous random graphs. Random Structures and Algorithms 31, 1, 3–122.
Bullmore, E and Sporns, O (2009). Complex brain networks: graph theoretical analysis of structural and functional systems. Nature Reviews Neuroscience10, 3, 186–198.
Cape, J, Tang, M and Priebe, C E (2019). On spectral embedding performance and elucidating network structure in stochastic blockmodel graphs. Network Science 7, 3, 269–291.
Conover, M, Ratkiewicz, J, Francisco, M R, Gonçalves, B, Menczer, F and Flammini, A (2011). Political polarization on Twitterx. ICWSM133, 89–96.
da Fonseca Vieira, V, Xavier, C R and Evsukoff, A G (2020). A comparative study of overlapping community detection methods from the perspective of the structural properties. Applied Network Science 5, 1, 1–42.
Girvan, M and Newman, Mark EJ (2002). Community structure in social and biological networks. Proceedings of the National Academy of Sciences 99, 12, 7821–7826.
Golub, G H and Van Loan, C F (2012). Matrix computations, 3. Johns Hopkins University Press, USA.
Gregory, S (2010). Finding overlapping communities in networks by label propagation. New Journal of Physics 12, 10, 103018.
Holland, P W, Laskey, K B and Leinhardt, S (1983). Stochastic blockmodels: First steps. Social Networks 5, 2, 109–137.
Huang, K and Fu, X (2019). Detecting overlapping and correlated communities without pure nodes: Identifiability and algorithm, p. 2859–2868.
Ji, P and Jin, J (2016). Coauthorship and citation networks for statisticians. The Annals of Applied Statistics 10, 4, 1779–1812.
Jin, J (2015). Fast community detection by score. The Annals of Statistics 43, 1, 57–89.
Jin, J, Ke, Z T and Luo, S (2017). Estimating network memberships by simplex vertex hunting. arXiv:1708.07852.
Johnstone, I M and Lu, A Y (2009). On consistency and sparsity for principal components analysis in high dimensions. Journal of the American Statistical Association 104, 486, 682–693.
Jolliffe, I T, Trendafilov, N T and Uddin, M (2003). A modified principal component technique based on the lasso. Journal of Computational and Graphical Statistics 12, 3, 531–547.
Karrer, B and Newman, M.EJ (2011). Stochastic blockmodels and community structure in networks. Physical Review E 83, 1, 016107.
Lancichinetti, A, Fortunato, S and Kertész, J (2009). Detecting the overlapping and hierarchical community structure in complex networks. New J. Phys.11, 3, 033015.
Lancichinetti, A, Radicchi, F, Ramasco, J J and Fortunato, S (2011). Finding statistically significant communities in networks. PLoS ONE 6, 4.
Latouche, P, Birmelé, E. and Ambroise, C (2011). Overlapping stochastic block models with application to the french political blogosphere. The Annals of Applied Statistics, 309–336.
Le, C M and Levina, E (2015). Estimating the number of communities in networks by spectral methods. arXiv:1507.00827.
Le, C M, Levina, E and Vershynin, R (2017). Concentration and regularization of random graphs. Random Structures & Algorithms 51, 3, 538–561.
Lee, D D and Seung, H S (1999). Learning the parts of objects by nonnegative matrix factorization. Nature 401, 6755, 788–791.
Lei, J and Rinaldo, A (2015). Consistency of spectral clustering in stochastic block models. The Annals of Statistics 43, 1, 215–237.
Levin, K, Athreya, A, Tang, M, Lyzinski, V and Priebe, C E (2017). A central limit theorem for an omnibus embedding of random dot product graphs. arXiv:1705.09355.
Li, T, Levina, E and Zhu, J (2020). Network crossvalidation by edge sampling. Biometrika 107, 2, 257–276.
Lyzinski, V, Sussman, D L, Tang, M, Athreya, A and Priebe, C E (2014). Perfect clustering for stochastic blockmodel graphs via adjacency spectral embedding. Electronic Journal of Statistics 8, 2, 2905–2922.
Ma, Z (2013). Sparse principal component analysis and iterative thresholding. The Annals of Statistics, 41, 2, 772–801.
Mao, X, Sarkar, P and Chakrabarti, D (2017). On mixed memberships and symmetric nonnegative matrix factorizations. PMLR, p. 2324–2333.
Mao, X, Sarkar, P and Chakrabarti, D (2018). Overlapping clustering models, and one (class) svm to bind them all, p. 2126–2136.
Mao, X, Sarkar, P and Chakrabarti, D (2020). Estimating mixed memberships with sharp eigenvector deviations. Journal of the American Statistical Association. (justaccepted), 1–24.
McAuley, J J and Leskovec, J (2012). Learning to discover social circles in ego networks., 2012, p. 548–56.
Newman, Mark EJ (2006). Finding community structure in networks using the eigenvectors of matrices. Physical Review E 74, 3, 036104.
Porter, M A, Onnela, J.P. and Mucha, P J (2009). Communities in networks. Notices of the AMS 56, 9, 1082–1097.
Power, J D, Cohen, A L, Nelson, S M, Wig, G S, Barnes, K A, Church, J A, Vogel, A C, Laumann, T O, Miezin, F M and Schlaggar, B L (2011). Functional network organization of the human brain. Neuron 72, 4, 665–678.
Psorakis, I, Roberts, S, Ebden, M and Sheldon, B (2011). Overlapping community detection using bayesian nonnegative matrix factorization. Physical Review E 83, 6, 066114.
Rohe, K, Chatterjee, S and Yu, B (2011). Spectral clustering and the highdimensional stochastic blockmodel. Ann. Statist. 39, 4, 1878–1915.
RubinDelanchy, P, Priebe, C E and Tang, M (2017). Consistency of adjacency spectral embedding for the mixed membership stochastic blockmodel. arXiv:1705.04518.
Schlitt, T and Brazma, A (2007). Current approaches to gene regulatory network modelling. BMC Bioinformatics 8, Suppl 6, S9.
Schwarz, G (1978). Estimating the dimension of a model. The Annals of Statistics 6, 2, 461–464.
Schwarz, A J, Gozzi, A and Bifone, A (2008). Community structure and modularity in networks of correlated brain activity. agnetic Resonance Imaging 26, 7, 914–920.
Tang, M, Athreya, A, Sussman, D L, Lyzinski, V, Park, Y and Priebe, C E (2017). A semiparametric twosample hypothesis testing problem for random graphs. Journal of Computational and Graphical Statistics 26, 2, 344–354.
Vu, V Q and Lei, J (2013). Minimax sparse principal subspace estimation in high dimensions. The Annals of Statistics 41, 6, 2905–2947.
Wang, C and Blei, D (2009). Decoupling sparsity and smoothness in the discrete hierarchical Dirichlet process. Advances in Neural Information Processing Systems 22, 1982–1989.
Wang, YX R and Bickel, P J (2017). Likelihoodbased model selection for stochastic block models. The Annals of Statistics 45, 2, 500–528.
Wasserman, S and Faust, K (1994). Social network analysis: Methods and applications, 8. Cambridge University Press, Cambridge.
Williamson, S, Wang, C, Heller, K A and Blei, D M (2010). The ibp compound dirichlet process and its application to focused topic modeling. Omnipress, Madison, p. 1151–1158.
Xie, J, Kelley, S and Szymanski, B K (2013). Overlapping community detection in networks: The stateoftheart and comparative study. ACM Computing Surveys 45, 4, 1–35.
Yu, Y, Wang, T and Samworth, R J (2015). A useful variant of the Davis–Kahan theorem for statisticians. Biometrika 102, 2, 315–323.
Zachary, W W (1977). An information flow model for conflict and fission in small groups. Journal of Anthropological Research 33, 4, 452–473.
Zhang, Y, Levina, E and Zhu, J (2020). Detecting Overlapping Communities in Networks Using Spectral Methods. SIAM Journal on Mathematics of Data Science 2, 2, 265–283.
Zou, H, Hastie, T and Tibshirani, R (2006). Sparse principal component analysis. Journal of Computational and Graphical Statistics 15, 2, 265–286.
Acknowledgements
This research was supported in part by NSF grants DMS1521551 and DMS1916222. The authors would like to thank Yuan Zhang for helpful discussions, and Advanced Research Computing at the University of Michigan for computational resources and services.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix
Appendix
Proof of Proposition 1.
Because V and \(\widetilde {\textup {\textbf {V}}}\) are two bases of the column space of P, and rank(P) = K, then \(\textbf {P}=\textbf {V}\textbf {U}^{\top }=\widetilde {\textbf {V}}\widetilde {\textbf {U}}^{\top }\) for some full rank matrices \(\textbf {U},\widetilde {\textbf {U}}\in \mathbb {R}^{n\times K}\) and therefore
Let \((\widetilde {\textup {\textbf {U}}}^{\top }\textup {\textbf {U}})({\textup {\textbf {U}}}^{\top }\textup {\textbf {U}})^{1}=\boldsymbol {\Lambda }\). We will show that Λ = QD for a permutation matrix Q ∈{0,1}^{K×K} and a diagonal matrix \(\textup {\textbf {D}}\in \mathbb {R}^{K\times K}\), or in other words, this is a generalized permutation matrix.
Let \(\boldsymbol {\theta },\widetilde {\boldsymbol {\theta }}\in \mathbb {R}^{n}\) and \(\textup {\textbf {Z}},\widetilde {\textup {\textbf {Z}}}\in \mathbb {R}^{n\times K}\) such that \(\boldsymbol {\theta }_{i} = \left ({\sum }_{k=1}^{K}\textup {\textbf {V}}_{ik}^{2}\right )^{1/2}\), \(\widetilde {\boldsymbol {\theta }}_{i} = \left ({\sum }_{k=1}^{K}\widetilde {\textup {\textbf {V}}}_{ik}^{2}\right )^{1/2}\), and Z_{ik} = V_{ik}/𝜃_{i} if 𝜃_{i} > 0, and Z_{ik} = 0 otherwise (similarly for \(\widetilde {\textup {\textbf {Z}}}\)). Denote by \(\mathcal {S}_{1}=(i_{1},\ldots ,i_{K})\) to the vector of row indexes that satisfy \(\textup {\textbf {V}}_{i_{j}j}> 0\) and \(\textup {\textbf {V}}_{i_{j}j'}=0\) for j^{′}≠j, and j = 1,…,j (these indexes exist by assumption). In the same way, define \(\mathcal {S}_{2}=(i'_{1},\ldots ,i'_{K})\) such that \(\widetilde {\textup {\textbf {V}}}_{i^{\prime }_{j}j}> 0\) and \(\widetilde {\textup {\textbf {V}}}_{i_{j}j'}=0\) for j^{′}≠j. j = 1,…,j. Denote by \(\textup {\textbf {Z}}_{\mathcal {S}}\) to the K × K matrix formed by the rows indexed by \(\mathcal {S}\). Therefore
Write \(\boldsymbol {\Theta } = \text {diag}(\boldsymbol {\theta })\in \mathbb {R}^{n\times n}\) and \(\widetilde {\boldsymbol {\Theta }} = \text {diag}(\widetilde {\boldsymbol {\theta }})\in \text {real}^{n\times n}\). From Eq. A.1 we have
where \({\Theta }_{\mathcal {S}, \mathcal {S}}\) is the submatrix of Θ formed by the rows and columns indexed by \(\mathcal {S}\). Thus,
which implies that Λ is a nonnegative matrix. Applying the same to the equation \((\boldsymbol {\Theta }\textbf {Z})_{\mathcal {S}_{1}}\boldsymbol {\Lambda }^{1}= (\widetilde {\boldsymbol {\Theta }}\widetilde {\textup {\textbf {Z}}})_{\mathcal {S}_{1}}\), we have
Hence, both Λ and Λ^{− 1} are nonnegative matrices, which implies that Λ is a positive generalized permutation matrix, so Λ = QD for some permutation matrix Q and a diagonal matrix D with diag(D) > 0. □
Proof of Proposition 2.
Let \(\boldsymbol {\theta }\in \mathbb {R}^{n}\) be a vector such that \({\boldsymbol {\theta }_{i}^{2}}={\sum }_{k=1}^{K}\boldsymbol {V}_{ik}^{2}\), and define \(\textup {\textbf {Z}}\in \mathbb {R}^{n\times K}\) such that \(\textbf {Z}_{ik}=\frac {1}{\theta _{i}}\textbf {V}_{ik}\), for each i ∈ [n],k ∈ [K]. Let B = (V^{⊤}V)^{− 1}V^{T}U. To show that B is symmetric, observe that VU^{⊤} = P = P^{⊤} = UV^{⊤}. Multiplying both sides by V and V^{⊤},
and observing that (V^{⊤}V)^{− 1} exists since V is full rank, we have
which implies that B^{⊤} = B. To obtain the equivalent representation for P, form a diagonal matrix Θ = diag(𝜃). Then ΘZ = V, and
Finally, under the conditions of Proposition 1, V uniquely determines the pattern of zeros of any nonnegative eigenbasis of P, and therefore supp(V) = supp(ΘZQ) = supp(ZQ) for some permutation Q. □
Proof of Proposition 3.
Suppose that P = VU^{⊤} for some nonnegative matrix V that satisfies the assumptions of Proposition 1. Let D ∈real^{K} such that D_{i} = ∥V_{⋅k}∥_{2} and D = diag(D). Then \(\textbf {P} = \widetilde {\textbf {V}}\textbf {D}\textbf {U}^{\top }\). Let \(\textup {\textbf {V}}^{(0)} = \widetilde {\textup {\textbf {V}}}\) be the initial value of Algorithm 1. Then, observe that
Suppose that λ ∈ [0,v^{∗}). Then, \(\lambda \max \limits _{j\in [K]}\widetilde {\textup {\textbf {V}}} <\widetilde {\textup {\textbf {V}}}_{ik}\) for all i ∈ [n],k ∈ [K] such that V_{ik} > 0, and hence \(\textup {\textbf {U}}^{(1)}=\mathcal {S}(\widetilde {\textup {\textbf {V}}}, \lambda ) = \widetilde {\textup {\textbf {V}}}\). Finally, since \(\\widetilde {\textup {\textbf {V}}}_{\cdot ,k}\_{2}=1\) for all k ∈ [K], then \(\textup {\textbf {V}}^{(1)}=\widetilde {\textup {\textbf {V}}}\). □
Proof of Theorem 1.
The proof consists of a onestep fixed point
analysis of Algorithm 2. We will show that if Z^{(t)} = Z, then Z^{(t+ 1)} = Z with high probability. Let T = T^{(t+ 1)} = AZ be value after the multiplication step. Define \(\textup {\textbf {C}}\in \mathbb {R}^{K\times K}\) to be the diagonal matrix with community sizes on the diagonal, C_{kk} = n_{k} = ∥Z_{⋅,k}∥_{1}. Then \(\widetilde {\textup {\textbf {T}}}=\widetilde {\textup {\textbf {T}}}^{(t+1)}= \textup {\textbf {T}}\textup {\textbf {C}}^{1}\). In order for the threshold to set the correct set of entries to zero, a sufficient condition is that in each row i the largest element of \(\widetilde {\boldsymbol {T}}_{i,\cdot }\) corresponds to the correct community. Define \(\mathcal {C}_{k}\subset [n]\) as the node subset corresponding to community k. Then,
Therefore \(\widetilde {\textup {\textbf {T}}}_{ik}\) is a sum of independent and identically distributed Bernoulli random variables. Moreover, for each k_{1} and k_{2} in [K], \(\widetilde {\textup {\textbf {T}}}_{ik_{1}}\) and \(\widetilde {\textup {\textbf {T}}}_{ik_{2}}\) are independent of each other.
Given a value of λ ∈ (0,1), let
be the event that the largest entry of \(\widetilde {\textbf {T}}_{i\cdot }\) corresponds to k_{i}, that is, the entry corresponding to the community of node i, and all the other indexes in that row are smaller in magnitude than \(\lambda \widetilde {\textbf {T}}_{ik_{i}}\). Let \(\textbf {U} = \textbf {U}^{(t+1)}=\mathcal {S}(\widetilde {\textbf {T}}^{(t+1)}, \lambda )\) be the matrix obtained after the thresholding step. Under the event \(\mathcal {E}(\lambda )=\bigcap _{i=1}^{n} \mathcal {E}_{i}(\lambda )\), we have that \(\\textbf {U}_{i,\cdot }\_{\infty } = \textbf {U}_{ik_{i}}\) for each i ∈ [n], and hence
Therefore, under the event \(\mathcal {E}(\lambda )\), the thresholding step recovers the correct support, so Z^{(t+ 1)} = Z.
Now we verify that under the conditions of Theorem 3.6, the event \(\mathcal {E}(\lambda )\) happens with high probability. By a union bound,
For j≠k_{i}, \(\widetilde {\textup {\textbf {T}}}_{ij}\lambda \widetilde {\textup {\textbf {T}}}_{ik_{i}}\) is a sum of independent random variables with expectation
By Hoeffding’s inequality, we have that for any \(\tau \in \mathbb {R}\),
where \(n_{\min \limits } = \min \limits _{k\in [K]}n_{k}\). Setting
and using Eq. A.3 and 3.6, we obtain that for n sufficiently large,
Combining with the bound (A.2), the probability of event \(\mathcal {E}(\lambda )\) (which implies that Z^{(t+ 1)} = Z) is bounded from below as
Therefore, with high probability Z is a fixed point of the Algorithm 2 for any λ ∈ (λ^{∗},1). □
Proof of Proposition 4.
Observe that
where C is a constant that does not depend on B. Therefore \(\widehat {\textup {\textbf {B}}}\)
Suppose that \(\widehat {\boldsymbol {V}} = \widehat {\boldsymbol {Q}}\widehat {\boldsymbol {R}}\) for some matrix Q with orthonormal columns of size n × K. Then, \(\widehat {\boldsymbol {R}}\) is a full rank matrix, and therefore
Using this equation, we obtain the desired result. □
Rights and permissions
About this article
Cite this article
Arroyo, J., Levina, E. Overlapping Community Detection in Networks via Sparse Spectral Decomposition. Sankhya A 84, 1–35 (2022). https://doi.org/10.1007/s13171021002454
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13171021002454
Keywords
 Sparse principal component analysis
 Stochastic blockmodel
 Mixed memberships
AMS (2000) subject classification
 Primary: 62H30
 Secondary: 91C20
 68T10