Abstract
A network may have weak signals and severe degree heterogeneity, and may be very sparse in one occurrence but very dense in another. SCORE (Ann. Statist. 43, 57–89, 2015) is a recent approach to network community detection. It accommodates severe degree heterogeneity and is adaptive to different levels of sparsity, but its performance for networks with weak signals is unclear. In this paper, we show that in a broad class of network settings where we allow for weak signals, severe degree heterogeneity, and a wide range of network sparsity, SCORE achieves prefect clustering and has the socalled “exponential rate” in Hamming clustering errors. The proof uses the most recent advancement on entrywise bounds for the leading eigenvectors of the network adjacency matrix. The theoretical analysis assures us that SCORE continues to work well in the weak signal settings, but it does not rule out the possibility that SCORE may be further improved to have better performance in real applications, especially for networks with weak signals. As a second contribution of the paper, we propose SCORE+ as an improved version of SCORE. We investigate SCORE+ with 8 network data sets and found that it outperforms several representative approaches. In particular, for the 6 data sets with relatively strong signals, SCORE+ has similar performance as that of SCORE, but for the 2 data sets (Simmons, Caltech) with possibly weak signals, SCORE+ has much lower error rates. SCORE+ proposes several changes to SCORE. We carefully explain the rationale underlying each of these changes, using a mixture of theoretical and numerical study.
Introduction
Community detection is a problem that has received considerable attention (Karrer and Newman, 2011; Zhang et al. 2020; Bickel and Chen, 2009). Consider an undirected network \({\mathcal {N}}\) and let A be its adjacency matrix:
Since the network is undirected, A is symmetrical. Also, as a convention, we do not consider self edges, so all diagonal entries of A are 0. We assume the network is connected, consisting of K perceivable nonoverlapping communities
Following the convention in many recent works on community detection (e.g., Bickel and Chen 2009; Zhang et al. 2020), we assume K as known and the nodes do not have mixedmemberships, so each node belongs to exactly one of the K communities. The community labels are unknown, and the goal is to use (A,K) to predict them. In statistics, this is known as the clustering problem.
See Jin et al. (2020) and reference therein for discussions on how to estimate K, and Jin et al. (2017) for the generalization of SCORE for network analysis in the presence of mixed memberships.
Similar to “cluster”, “community” is a concept that is scientifically meaningful but mathematically hard to define. Intuitively, communities are clusters of nodes that have more edges “within” than “across” (Jin, 2015; Zhao et al. 2012). Note that “communities” and “components” are different concepts: two communities may be connected, while two components are always disconnected.
Table 1 presents 8 network data sets which we analyze in this paper. Data sets 23 are from Traud et al. (2011, 2012) (see also Chen et al. 2018; Ma et al. 2020), and the other 6 datasets are downloaded from http://wwwpersonal.umich.edu/~mejn/netdata/. For all these data sets, the true labels are suggested by the original authors or data curators, and we use the labels as the “ground truth.”
Conceivably, for some of the data sets, some nodes may have mixed memberships (Airoldi et al. 2008; Jin et al. 2017; Zhang et al. 2020). To alleviate the effect, we did some data preprocessing as follows. For the Polbooks data set, we removed all the books that are labeled as “neutral.” For the football data set, we removed the 5 “independent” teams. For the UKfaculty data set, we removed the smallest group which only contains 2 nodes. After the preprocessing, our assumption of “no mixedmemberships” is reasonable.
Natural networks have some noteworthy features.

Node sparsity and severe degree heterogeneity. Take Table 1 for example, even for networks with only 1222 nodes, the degrees for some nodes can be as large as 351 times higher than those of the others. If we measure the sparsity of a node by its degree, then the sparsity level may range significantly from one node to another.

Overall network sparsity. Some networks are much sparser than others, and the overall network sparsity may range significantly from one network to another.

Weak signal. In many cases, the community structures are subtle and masked by strong noise, where the signaltonoise ratio (SNR) is relatively low.
It is desirable to have a model that is flexible enough to capture all these features. This is where the DCBM comes in.
The DegreeCorrected Block Model (DCBM)
DCBM is one of the most popular models in network analysis (see for example Karrer and Newman 2011). For each node i, we encode the community label by a Kdimensional vector π_{i}, such that for all 1 ≤ i ≤ n and 1 ≤ k ≤ K,
In DCBM, for a matrix \(P \in \mathbb {R}^{K,K}\) and parameters 𝜃_{1},𝜃_{2},…,𝜃_{n}, we assume the upper triangle of the adjacency matrix A contains independent Bernoulli random variables satisfying
Here, P is a symmetrical and (entrywise) nonnegative matrix that models the community structure and 𝜃_{1},𝜃_{2},…,𝜃_{n} are positive parameters that model the degree heterogeneity. For identifiability, we assume
Writing \(\theta = (\theta _{1}, \theta _{2}, \ldots , \theta _{n})^{\prime }\), Θ = diag(𝜃_{1},𝜃_{2},…,𝜃_{n}), and π = [π_{1},π_{2},…, \(\pi _{n}]^{\prime }\), we define
Note that when i≠j, Ω(i,j) denotes the probability \(\mathbb {P}(A(i,j)=1)\). Let \(W \in \mathbb {R}^{n,n}\) be the centered Bernoulli noise matrix such that W(i,j) = A(i,j) −Ω(i,j) when i≠j and W(i,j) = 0 if i = j. We have^{Footnote 1}
where diag(Ω) stands for the diagonal matrix diag(Ω(1,1),Ω(2,2),…,Ω (n,n)). Note that the rank of Ω is K, so Eq. 1.5 is a lowrank matrix model.
In the special case of^{Footnote 2}
DCBM reduces to the stochastic block model (SBM). Note that SBM does not model severe degree heterogeneity. The DCBM is also similar to that in Chaudhuri et al. (2012) in some sense.
In DCBM, we allow (𝜃,π,P) to depend on n so the model is flexible enough to capture all the three features aforementioned of natural networks.

A reasonable metric for degree heterogeneity is 𝜃_{max}/𝜃_{min}, which is allowed to be large in DCBM. See Table 1.

A reasonable metric for overall network sparsity is ∥𝜃∥, and in DCBM, ∥𝜃∥ depends on n and is allowed to range freely between 1 and \(\sqrt {n}\) (up to some multi\(\log (n)\) terms),^{Footnote 3} corresponding to the most sparse networks and the most dense networks, respectively.

A reasonable metric for SNR is \(\lambda _{K}/ \sqrt {\lambda _{1}}\) (see Jin et al. 2019 for an explanation), where λ_{k} is the k th largest eigenvalue (in magnitude) of Ω. If we allow 𝜃, P, and π to depend on n, then DCBM is adequate for modeling the weak signal cases where λ_{k} may be much smaller than λ_{1}, 1 < k ≤ K.
In many recent works on community detection, it was assumed that the first K eigenvalues are at the same magnitude. For example, some of these works considered a DCBM model where in Eq. 1.4, we take P = α_{n}P_{0}. Here, α_{n} is a scaling parameter that may vary with n and P_{0} is a fixed matrix that does not vary with n. In this special case, by similar calculations as in Jin (2015), it is seen that all eigenvalues of Ω are at the same order under mild regularity conditions on (Θ,π) (e.g., the K communities are balanced; see Jin (2015) for details). Such models do not allow for weak signals, and so are relatively restrictive.
Motivated by these observations, it is desirable to have community detection algorithms that

accommodate severe degree heterogeneity,

are adaptive to different levels of overall network sparsity,

are effective not only for strong signals but also for weak signals.
The Orthodox SCORE
SCORE, or S pectral C lustering O n R atiosofE igenvectors, is a recent approach to community detection proposed by Jin (2015). SCORE consists of three steps.
Orthodox SCORE
Input: adjacency matrix A and the number of communities K. Output: community labels of all nodes.

(PCA). Obtain the first K leading eigenvectors \(\hat {\xi }_{1},\hat {\xi }_{2}, \ldots ,\hat {\xi }_{K}\) of A (we call \(\hat {\xi }_{k}\) the k th leading eigenvector if the corresponding eigenvalue is the k th largest in absolute value).

(PostPCA normalization). Obtain the n × (K − 1) matrix of entrywise eigenratios by
$$ \left[\frac{\hat{\xi}_{2}}{\hat{\xi}_{1}}, \frac{\hat{\xi}_{3}}{\hat{\xi}_{1}}, \ldots, \frac{\hat{\xi}_{K}}{\hat{\xi}_{1}}\right], $$(1.7)where the ratio of two vectors should be understood as the vector of entrywise ratios.^{Footnote 4}

(Clustering). Cluster by applying kmeans to rows of \(\hat {R}\), assuming there are ≤ K clusters.
Compared to classical spectral clustering, the main innovation of SCORE is the postPCA normalization. The goal of this step is to mitigate the effect of degree heterogeneity. The degrees contain very little information of the community structure and pose merely as a nuisance, but severe degree heterogeneity makes different entries of the leading eigenvectors badly scaled. As a result, without this step, SCORE tends to cluster nodes according to their degrees instead of the community structure, and thus have unsatisfactory clustering results. Take the Weblog data for example: with and without this step, the error rates of SCORE are 58/1222 and 437/1222 respectively. See Jin (2015) for more discussions.
SCORE is conceptually simple, easy to use, and does not need tuning. In Jin (2015), Ji and Jin (2016), SCORE was shown to be competitive in clustering accuracy. For computational time, note that in the kmeans clustering step of SCORE, one usually uses the Llyod’s algorithm (Hastie et al. 2009); and as a result, SCORE is computationally fast and is able to work efficiently for large networks. See Jin (2015) and also Table 4 of the current paper for more discussions.
SCORE is a flexible idea, and can be conveniently extended to many different settings such as network mixed membership estimation (Jin et al. 2017), topic estimation in text mining (Ke and Wang, 2017), state aggregation in control theory and reinforcement learning (Duan et al. 2018), analysis of hyper graphs (Ke et al. 2020), and matrix factorization in image processing.
Contribution of this Paper
For the three features aforementioned, SCORE accommodates severe degree heterogeneity and is adaptive to different levels of overall network sparsity. However, when it comes to weak signals, there are at least two problems that are not answered.

What is the theoretical behavior of SCORE in the presence of weak signals?

In challenging application problems where the SNR is small, is it possible to improve SCORE to have better real data performance, without sacrificing its good properties above?
Note that in the literature, the theoretical analysis on SCORE has been largely focused on the case where the signals are relatively strong; see for example (Jin, 2015).
In this paper, we analyze SCORE theoretically, especially for the weak signal settings. We show that for a broad class of settings where we allow for weak signals, severe degree heterogeneity, and a wide range of overall network sparsity, SCORE attains an exponential rate of convergence for the Hamming error. We also show that, when the SNR is appropriately large, SCORE fully recovers the community labels except for a small probability. The proof uses the most recent advancement on entrywise bounds (a kind of largedeviation bounds) for the leading eigenvectors of the adjacency matrix (Abbe et al. 2019; Jin et al. 2017).
The theoretical analysis here assures that SCORE continues to work well for weak signal settings. This of course does not rule out the possibility that a further improved version may perform better in real data analysis.
As a second contribution of the paper, we propose SCORE+ as an improved version of SCORE, especially for networks with weak signals. We compare SCORE+ with SCORE and several other recent algorithms using the 8 data sets in Table 1. For the 6 data sets where the signals are relatively strong (the clustering errors of all methods considered are relatively low), SCORE+ and SCORE have comparable performance. For the 2 data sets (Simmons and Caltech) where the signals are relatively weak (the clustering errors of all methods considered are relatively high), SCORE+ improves SCORE significantly, and has the lowest error rates among all methods considered in the paper.
SCORE+ proposes several changes to SCORE. We carefully explain the rationale underlying each of these changes, using a mixture of theoretical and numerical study. A much deeper understanding requires advanced tools in random matrix theory that have not yet been developed, so we leave the study along this line to the future.
Content and Notations
In Section 2, we analyze the orthodox SCORE with some new theoretical results. We show that SCORE attains exponential rates in Hamming clustering errors and achieves perfect clustering provided that the SNR is reasonably large. In Section 3, we propose SCORE+ as an improved version of SCORE. We compare the performance of SCORE+ with SCORE and several recent approaches on community detection using the 8 data sets in Table 1, and show that SCORE+ has the best overall performance. SCORE+ proposes several changes to SCORE. We explain the rationale underlying each of these changes, and especially why SCORE+ is expected to have better performance than SCORE for networks with weak signals. Section 4 proves the main results in Section 2.
In this paper, for any numbers 𝜃_{1},𝜃_{2},…,𝜃_{n}, \(\theta _{max} = \max \limits \{\theta _{1}, \theta _{2}, \ldots , \theta _{n}\}\), and \(\theta _{min} = \min \limits \{\theta _{1}, \theta _{2}, \ldots , \theta _{n}\}\). Also, diag(𝜃_{1},𝜃_{2},…,𝜃_{n}) denotes the n × n diagonal matrix with 𝜃_{i} being the ith diagonal entry, 1 ≤ i ≤ n, For any vector \(a \in \mathbb {R}^{n}\), ∥a∥_{q} denotes the Euclidean ℓ^{q}norm, and we write ∥a∥ for simplicity when q = 2. For any matrix \(P \in \mathbb {R}^{n,n}\), ∥P∥ denotes the matrix spectral norm, and \(\P\_{\max \limits }\) denotes the maximum ℓ^{2}norm of all the rows of P. For two positive sequences {a_{n}} and {b_{n}}, we say a_{n} ≍ b_{n} if there are constants c_{2} > c_{1} > 0 such that c_{1}a_{n} ≤ b_{n} ≤ c_{2}a_{n} for sufficiently large n.
SCORE: Exponential Rate and Perfect Clustering
We provide new theoretical results for the orthodox SCORE, which significantly improves those in Jin (2015). For the “weak signal” case, the theory in Jin (2015) is not applicable but our theory applies. For the “strong signal” case, compared with Jin (2015), our theory provides a faster rate of convergence for the clustering error and weaker conditions for perfect clustering.
Consider a sequence of DCBM indexed by n, where (K,𝜃,π,P) are all allowed to depend on n. Suppose, for a constant c_{1} > 0,
Recall that \({\mathcal {C}}_{1}, \ldots , {\mathcal {C}}_{K}\) denote the K true communities. For 1 ≤ k ≤ K, let \(n_{k} = {\mathcal {C}}_{k}\) be the size of community k, and let \(\theta ^{(k)} \in \mathbb {R}^{n}\) be the vector such that \(\theta ^{(k)}_{i} = \theta _{i}\) if \(i \in {\mathcal {C}}_{k}\) and \(\theta ^{(k)}_{i} = 0\) otherwise. We assume, for a constant c_{2} > 0,
Introduce a diagonal matrix \(G \in \mathbb {R}^{K,K}\) by
Let μ_{k} denote the k th largest eigenvalue (in magnitude) of the K × K matrix G^{1/2}PG^{1/2}, and let η_{k} denote the corresponding eigenvector. We assume, for a constant c_{3} ∈ (0,1) and c_{4} > 0,
These conditions are mild. For Eq. 2.8, recall that ∥𝜃∥ measures the overall network sparsity, and the interesting range of ∥𝜃∥ is between 1 and \(\sqrt {n}\), up to some multi\(\log (n)\) terms (see footnote 3). Therefore, it is mild to assume \(\\theta \\to \infty \). Condition (2.9) requires that the communities are balanced in size and in total squared degrees, which is mild.
Condition (2.10) is also mild. The most challenging case for network analysis is when the matrix P gets very close to the matrix of all ones, where it is hard to distinguish one community from another. The condition only rules out the less relevant cases such as the network is disconnected or approximately so. These are strong signal cases where it is relatively easy to distinguish one community from another. Note that the condition is satisfied if all entries of P are lower bounded by a constant or if K is fixed and P converges to a fixed irreducible matrix (see Section A.2 of Jin et al. 2017 for a discussion of this condition).
In a hypothesis testing framework, Jin et al. (2019) has pointed out that a reasonable metric for SNR in a DCBM is \(\lambda _{K} / \sqrt {\lambda _{1}}\), where we recall that λ_{k} denotes the k th largest eigenvalue (in magnitude) of \({\Omega }={\Theta }{\Pi } P{\Pi }^{\prime }{\Theta }\) and that λ_{1} is always positive (Jin, 2015; Jin et al. 2017). We also introduce a quantity to measure the severity of degree heterogeneity:
The smaller α(𝜃), the more severe the degree heterogeneity. When \(\theta _{\max \limits }\asymp \theta _{\min \limits }\), α(𝜃) is bounded below from 0 by a constant. In the presence of severe degree heterogeneity, α(𝜃) gets close to 0. We shall see that the clustering power of SCORE depends on
which is a combination of the SNR and the severity of degree heterogeneity.
Let \(\widehat {\Pi }=[\hat {\pi }_{1},\hat {\pi }_{2},\ldots ,\hat {\pi }_{n}]^{\prime }\) denote the matrix of estimated community labels by the orthodox SCORE. Define the Hamming error rate (per node) for clustering by
The next theorem is proved in Section 4.
Theorem 2.1.
Consider a sequence of DCBM indexed by n, where Eqs. 2.8 –2.10 hold. Let s_{n} be as in Eq. 2.11. There exist appropriately small constants a_{1},a_{2} > 0, which depend on the constants c_{1}c_{4} in the regularity conditions, such that, as long as \(s_{n}\geq a^{1}_{1}K^{4}\sqrt {\log (n)}\), for sufficiently large n:
Theorem 2.1 implies that the Hamming clustering error of SCORE depends on (𝜃_{1},…,𝜃_{n}) in an exponential form. This significantly improves the bound in Jin (2015), which depends on (𝜃_{1},…,𝜃_{n}) in a polynomial form. Additionally, Theorem 2.1 suggests that the nodes with smaller 𝜃_{i} have large contributions to the Hamming clustering error, i.e., it is more likely for the algorithm to make errors on lowdegree nodes.
The clustering error has an easytodigest form in special examples.
COROLLARY 2.1
Consider a special DCBM, where
Suppose K is fixed and 𝜃 satisfies that \(\theta _{\max \limits }\leq C\theta _{\min \limits }\). There exist appropriately small constants \(\tilde {a}_{1},\tilde {a}_{2}>0\), such that, as long as \((1b)\\theta \\geq a_{1}^{1}K^{4}\sqrt {\log (n)}\), for sufficiently large n,
In this special example, ∥𝜃∥^{2} characterizes the average node degrees, and (1 − b) captures the “similarity” across communities. The clustering power of SCORE is governed by s_{n} ≍ (1 − b)∥𝜃∥^{2}. The bound in Corollary 2.1 matches with the minimax bound in Gao et al. (2018), except for the constant 2K in front and the constant a_{2} in the exponent.^{Footnote 5} It was shown in Gao et al. (2018) that the exponential error rate can be attained by first applying spectral clustering and then conducting a refinement, where the refinement step was motivated by technical convenience. In fact, numerical studies suggest that spectral clustering alone can attain exponential error rates. Theorem 2.1 and Corollary 2.1 provide a rigorous theoretical justification.
The next theorem states that SCORE can exactly recover the community labels with high probability, provided that the SNR is appropriately large.
Theorem 2.2.
Consider a sequence of DCBM indexed by n, where Eqs. 2.8 –2.10 hold. Let s_{n} be as in Eq. 2.11. If \(s_{n} \geq CK^{4}\sqrt {\log (n)}\) for a sufficiently large constant C > 0, then we have that, up to a permutation on columns of \(\widehat {\Pi }\),
Furthermore, if K is finite and \(\theta _{\max \limits }\leq C\theta _{\min \limits }\), then the above is true as long as \(\lambda _{K}/\sqrt {\lambda _{1}}\) exceeds a sufficiently large constant.
The condition on s_{n} cannot be significantly improved. Take the case of fixed K and with moderate degree heterogeneity (\(\theta _{\max \limits }\leq C\theta _{\min \limits }\)) for example. In this case, the condition becomes \(\lambda _{K} / \sqrt {\lambda _{1}} \geq C\) for a large enough constant C > 0. It was shown in Jin et al. (2019) that, if we allow \(\lambda _{K} / \sqrt {\lambda _{1}} \rightarrow 0\), then we end up with a class of models that is too broad so we can find two sequences of DCBM models with different (fixed) K but are indistinguishable from each other. In such settings, successful clustering is impossible.
In the literature, the exponential error rate and the perfect clustering property were mainly obtained for nonspectral methods (e.g., Gao et al. 2018; Chen et al.2018). While spectral methods are practically popular, its theoretical analysis is challenging, since it requires sharp entrywise bounds for eigenvectors. A few existing works either focus on SBM which does not allow for degree heterogeneity (e.g., Abbe et al. 2019; Su et al. 2019) or restrict to the “strong signal” case and dense networks (e.g., Liu et al. 2019). Our results are new as we provide the first exponential rate result and perfect clustering result for spectral methods that accommodate severe degree heterogeneity, sparse networks, and weak signals.
Our analysis uses some results on the spectral analysis of the adjacency matrices from our work (Jin et al. 2017), especially the entrywise largedeviation bounds for empirical eigenvectors. We refer interesting readers to a detailed description in Jin et al. (2017). It is understood that the main technical difficulty of analyzing spectral methods lies in entrywise analysis of eigenvectors. Recent progress includes (but does not limit to) Abbe et al. (2019), Fan et al. (2019), Jin et al. (2017), Liu et al. (2019), Mao et al. (2020) & Su et al. (2019).
SCORE+, a Refinement Especially for Weak Signals
We propose SCORE+ as a refinement of SCORE for community detection. SCORE+ inherits the appealing features of SCORE. It improves the performance of SCORE in real applications, especially for networks with weak signals.
SCORE+
Recall that under DCBM,
where the “main signal” matrix Ω equals to \({\Theta } {\Pi } P {\Pi }^{\prime } {\Theta }\) and has a rank K. SCORE+ is motivated by several observations about SCORE.

Due to severe degree heterogeneity, different rows of the “signal” matrix and the “noise” matrix are in very different scales. We need two normalizations: a prePCA normalization to mitigate the effects of degree heterogeneity on the “noise” matrix, and a postPCA normalization (as in SCORE) on the “signal” matrix; we find that an appropriate prePCA normalization is Laplacian regularization.^{Footnote 6} See Section 3.3.2 for more explanations.

The idea of PCA is dimension reduction: We project rows of A to the Kdimensional space spanned by the first K eigenvectors of A, \(\hat {\xi }_{1}, \hat {\xi }_{2}, \ldots , \hat {\xi }_{K}\), and reduce A to the n × K matrix of projection coefficients:
$$ [\hat{\eta}_{1}, \hat{\eta}_{2}, \ldots, \hat{\eta}_{K}] \equiv [\hat{\xi}_{1}, \hat{\xi}_{2}, \ldots, \hat{\xi}_{K}] \cdot \text{diag}(\hat{\lambda}_{1}, \hat{\lambda}_{2}, \ldots, \hat{\lambda}_{K}). $$Therefore, in SCORE, it is better to apply the postPCA normalization to \([\hat {\eta }_{1}, \hat {\eta }_{2}, \ldots , \hat {\eta }_{K}]\) instead of \([\hat {\xi }_{1}, \hat {\xi }_{2}, \ldots , \hat {\xi }_{K}]\); the two postPCA normalization matrices (old and new) satisfy
$$ \left[\frac{\hat{\eta}_{2}}{\hat{\eta}_{1}}, \frac{\hat{\eta}_{2}}{\hat{\eta}_{1}}, \ldots, \frac{\hat{\eta}_{K}}{\hat{\eta}_{1}} \right] = \left[\frac{\hat{\xi}_{2}}{\hat{\xi}_{1}}, \frac{\hat{\xi}_{2}}{\hat{\xi}_{1}}, \ldots, \frac{\hat{\xi}_{K}}{\hat{\xi}_{1}} \right] \cdot \text{diag}\left( \frac{\hat{\lambda}_{2}}{\hat{\lambda}_{1}}, \frac{\hat{\lambda}_{3}}{\hat{\lambda}_{1}}, \ldots, \frac{\hat{\lambda}_{K}}{\hat{\lambda}_{1}}\right). $$In effect, the new change is using eigenvalues to reweight the columns of \(\left [\frac {\hat {\xi }_{2}}{\hat {\xi }_{1}}, \frac {\hat {\xi }_{2}}{\hat {\xi }_{1}}, \ldots , \frac {\hat {\xi }_{K}}{\hat {\xi }_{1}}\right ]\). See Section 3.3.3 for more explanations.

In SCORE, we only use the first K eigenvectors for clustering, which is reasonable in the “strong signal” case, where all the nonzero eigenvalues of the “signal” matrix are much larger than the spectral norm of the “noise” matrix (in absolute value). In the “weak signal” case, some nonzero eigenvalues of the “signal” can be smaller than the spectral norm of the “noise”, and we may need one or more additional eigenvectors of A for clustering. In Section 3.3.4, we have an indepth study on the weak signal case; see details therein.
SCORE +
Input: A, K, a ridge regularization parameter δ > 0 and a threshold t > 0. Output: class labels for all n nodes.

(PrePCA normalization with Laplacian). Let D = diag(d_{1},d_{2},…,d_{n}) where d_{i} is the degree of node i. Obtain the graph Laplacian with ridge regularization by
$$ L_{\delta} = (D + \delta \cdot d_{max} \cdot I_{n})^{1/2} A (D + \delta \cdot d_{max} \cdot I_{n})^{1/2}, \qquad \text{where}~d_{max} = \max\limits_{1 \leq i \leq n} \{d_{i}\}. $$Note that the ratio between the largest diagonal entry of D + δd_{max}I_{n} and the smallest one is smaller than (1 + δ)/δ. Conventional choices of δ are 0.05 and 0.10.

(PCA, where we retain possibly an additional eigenvector). We assess the aforementioned “signal weakness” by \(1  [\hat {\lambda }_{K+1}/\hat {\lambda }_{K}]\), and include an additional eigenvector for clustering if and only if
$$ (1  [\hat{\lambda}_{K+1} /\hat{\lambda}_{K}]) \leq t, \qquad (\text{a conventional choice of}~t~\text{is}~0.10). $$ 
(PostPCA normalization). Let M be the number of eigenvectors we decide in the last step (therefore, either M = K or M = K + 1). Obtain the matrix of entrywise eigenratios by
$$ \hat{R} = \left[\frac{\hat{\eta}_{2}}{\hat{\eta}_{1}}, \frac{\hat{\eta}_{3}}{\hat{\eta}_{1}}, \ldots, \frac{\hat{\eta}_{M}}{\hat{\eta}_{1}}\right], \qquad \text{where}~\hat{\eta}_{k} = \hat{\lambda}_{k} \hat{\xi}_{k}, 1 \leq k \leq M. $$(3.12) 
(Clustering). Apply classical kmeans to the rows of \(\hat {R}\), assuming ≤ K clusters.
The code is available at http://zke.fas.harvard.edu/software.html.
Compared to SCORE, SCORE+ (a) adds a prePCA normalization step, (b) may select one more eigenvectors for later use if necessary, and (c) uses eigenvalues to reweight the columns of \(\left [\frac {\hat {\xi }_{2}}{\hat {\xi }_{1}}, \frac {\hat {\xi }_{3}} {\hat {\xi }_{1}}, \ldots , \frac {\hat {\xi }_{K}}{\hat {\xi }_{1}} \right ]\). In Section 3.3, we further explain the rationale underlying these refinements.
Numerical Comparisons
We compare SCORE+ with a few recent methods: Orthodox SCORE, the convexified modularity maximization (CMM) method by Chen et al. (2018), the latent space model based (LSCD) method by Ma et al. (2020), the normalized spectral clustering (OCCAM) method for potentially overlapping communities by Zhang et al. (2020), and the regularized spectral clustering (RSC) method by Qin and Rohe (2013). For each method, we measure the clustering error rate by
where ℓ_{i} and \(\hat {\ell _{i}}\) are the true and estimated labels of node i.
The error rates are in Table 2, where for SCORE+, we take (t,δ) = (0.1,0.1). For the three relatively large networks (Weblog, Simmons, Caltech), the error rates of SCORE+ are the best among all methods, and for the other networks, the error rates are close to the best. Especially, SCORE+ provides a commendable improvement for the Simmons and Caltech data sets. In Section 3.3.4, we show that the Simmons and Caltech data sets are “weak signal” networks, and all other networks are “strong signal” networks.
RSC (Qin and Rohe, 2013) is an interesting method that applies the idea of SCORE to the graph Laplacian. It can be viewed as adding a prePCA normalization step to SCORE (but it does not include other refinements as in SCORE+). For three of the data sets (Simmons, Caltech, UKfaculty), the modification provides a small improvement, and for three of the data sets (Weblogs, Dolphins, Polbooks), the modification hurts a little bit. The performance of OCCAM is more or less similar to that of SCORE and RSC, which is not surprising, because OCCAM is also a normalized spectral method.
The error rates of CMM and LSCD are comparable with that of SCORE+ in most data sets, except that CMM and LSCD have unsatisfactory results for UKfaculty and Football, respectively. For the three small data sets (Karate, Dolphins, Polbooks), the three methods have similar error rates, with CMM being slightly better. For the three large data sets (Weblogs, Simmons, Caltech), SCORE+ is better than LSCD, and LSCD is better than CMM.
LSCD is an iterative algorithm which solves a nonconvex optimization with rank constraint. Since the algorithm only provides a local optimum, the difference between this local optimum and the global optimum may be large, especially for large K. This partially explains why LSCD performs unsatisfactorily on Football, for which data set K = 11. CMM first solves a convexified modularity maximization problem to get an n × n matrix \(\hat {Y}\) and then applies kmedian to rows of \(\hat {Y}\). The matrix \(\hat {Y}\) targets on approximating a rankK matrix, but for UKfaculty, the output \(\hat {Y}\) has a large (K + 1)th eigenvalue. This partially explains why CMM performs unsatisfactorily on this data set.
SCORE+ has two tuning parameters (t,δ), but each of which is easy to set, guided by common sense. Moreover, SCORE+ is relatively insensitive to the ridge regularization parameter δ: in Table 3, we investigate SCORE+ by setting t = 0.10 and letting δ range from 0.025 to 0.2 with an increment of 0.025. The results suggest SCORE+ is relatively insensitive to different choices of δ. In Section 3.3.4, we discuss further how to set the tuning parameter t.
Computationally, SCORE and OCCAM are the fastest, SCORE+ and RSC are slightly slower (the extra computing time is mostly due to the prePCA step), and CMM and LSCD are significantly slower, especially for large networks. For comparison of computing time, it makes more sense to use networks larger than those in Table 1. We simulate networks from the DCBM model in Section 1.1. In a DCBM with n nodes and K communities, the upper triangle of A contains independent Bernoulli variables, with
where P is a K × K symmetric nonnegative matrix, Θ = diag(𝜃_{1},𝜃_{2},…,𝜃_{n}) with 𝜃_{i} > 0 being the degree parameters, and π is the n × K label matrix. For simulations, we let n range in {1000,2000,4000,7000,10000}, and for each fixed n,

for \(c_{n} = 3 \log (n)/n\) and (α,β) = (5,4/5), generate 𝜃_{i} such that (𝜃_{i}/c_{n}) are iid from Pareto(α,β);

fix K = 4 and let π be the matrix where the first, second, third, and last quarter of rows equal to e_{1},e_{2},e_{3},e_{4}, respectively;

consider two experiments, where respectively, the P matrix is
$$ \left[ \begin{array}{cccc} 1 & 1/3 & 1/3 & 1/3 \\ 1/3 & 1 & 1/3 & 1/3 \\ 1/3 & 1/3 & 1 & 1/3 \\ 1/3 & 1/3 & 1/3 & 1 \end{array} \right] \qquad \text{and} \qquad \left[ \begin{array}{cccc} 1 & 2/3 & .1 & .1 \\ 2/3 & 1 & .5 & .5 \\ .1 & .5 & 1 & .5 \\ .1 & .5 & .5 & 1 \end{array} \right];$$the value of λ_{K}(P)/λ_{1}(P) is 0.333 for the left and 0.083 for the right, so that they represent the “strong signal” case and “weak signal” case, respectively.
The error rates and computing time are reported in Table 4 (both error rates and computing time are the average of 10 independent repetitions).
In summary, SCORE+ compares favorably over other methods both in error rates and in computing times, either for networks with “strong signals” or “weak signals”.
Rationale Underlying the Key Components of SCORE+
SCORE+ contains 4 components: the postPCA normalization that was originally proposed in SCORE, and 3 proposed refinements (prePCA normalization using the Laplacian regularization, reweighing the leading eigenvectors by eigenvalues, and recruiting one more eigenvector for use when the eigengap is small). We now explain the rationale of each of these components.
Recall that under DCBM,
where the “main signal” matrix Ω equals to \({\Theta } {\Pi } P {\Pi }^{\prime } {\Theta }\) and has a rank K. Let ξ_{1},ξ_{2},…,ξ_{K} be the eigenvectors of Ω associated with K largest eigenvalues in magnitude. Write Ξ = [ξ_{1},ξ_{2},…,ξ_{K}] = [Ξ_{1},Ξ_{2},…,Ξ_{n}]^{′}.
Rationale Underlying the PostPCA Normalization
The rationale underlying the postPCA normalization was carefully explained in Jin (2015), so we keep the discussion brief here. Under DCBM, Jin (2015) observed that
Without 𝜃_{i}’s, we can directly apply kmeans to rows of Ξ_{i}. Now, with the degree parameters, Jin (2015) considered the family of scaling invariant mappings (SIM), \(\mathbb {M}:\mathbb {R}^{K}\to \mathbb {R}^{K}\), such that \(\mathbb {M}(ax)=\mathbb {M}(x)\) for any a > 0 and \(x\in \mathbb {R}^{K}\), and proposed the postPCA normalization
The scalinginvariance property of \(\mathbb {M}\) ensures \(\{\mathbb {M}({\Xi }_{1}),\ldots ,\mathbb {M}({\Xi }_{n})\}\) take only K distinct values, so that we can apply kmeans. Two examples of SIM include:

\(\mathbb {M}(x)=(x_{2}/x_{1},x_{3}/x_{1},\ldots ,x_{n}/x_{1})'\), i.e., normalizing Ξ_{i} by its first entry;

\(\mathbb {M}(x)=\x\_{p}^{1}x\), i.e., normalizing Ξ_{i} by its L_{p}norm.
The first one was recommended by Jin (2015) and is commonly referred to as SCORE. The second one is a variant of SCORE and was proposed in the supplement of Jin (2015).
In the more general DCMM model with mixed membership, Jin et al. (2017) discovered that the postSCORE matrix is associated with a lowdimensional simplex geometry and developed SCORE into a simplexvertexhunting method for mixedmembership estimation. Interestingly, although each normalization in the scaling invariant family proposed by Jin (2015) works for DCBM, only the SCORE normalization produces the desired simplex geometry under DCMM.
Why the Laplacian is the Right PrePCA Normalization
The target of SCORE is to remove the effect of degree heterogeneity in the “main signal” matrix Ω. However, the “noise” matrix \(W=A\mathbb {E}[A]\) is also affected by degree heterogeneity and requires a proper normalization. We note that, since PCA only retains a few leading eigenvectors which are driven by “signal,” the “noise” is largely removed after conducing PCA. Therefore, one has to use a prePCA operation to normalize the “noise” matrix.
Our idea is to reweight the rows and columns of A by node degrees. Let D be the diagonal matrix where D(i,i) is the degree of node i. There are many ways for prePCA normalization, and simple choices include

A↦D^{− 1/2}AD^{− 1/2}.

A↦D^{− 1}AD^{− 1}.
Which one is the right choice?
Given an arbitrary positive diagonal matrix H, write
The best prePCA normalization is such that, despite severe degree heterogeneity, the variances of all entries of the “noise” matrix are at the same order (Jin and Ke, 2018). Under DCBM, by direct calculations,
At the same time, the node degrees satisfy
Therefore, the right choice is \(h_{i}\propto \sqrt {d_{i}}\), i.e., we should use the prePCA normalization of A↦D^{− 1/2}AD^{− 1/2}. See Mihail and Papadimitriou (2002) for a similar finding. For better practical performance, we add a ridge regularization.
Besides normalizing the “noise” matrix, the prePCA normalization also changes the “signal” matrix from Ω to D^{− 1/2}ΩD^{− 1/2}. Fortunately, the new “signal” matrix has a similar form as \({\Omega }={\Theta }{\Pi } P{\Pi }^{\prime }{\Theta }\), except that Θ is replaced by D^{− 1/2}Θ, so the postPCA normalization of SCORE is still valid.
Why \(\hat {\eta }_{k}\) is the Appropriate Choice in PostPCA Normalization
In the postPCA normalization, SCORE+ constructs the matrix of entrywise eigenratios using \(\hat {\eta }_{1},\ldots ,\hat {\eta }_{K}\), where each \(\hat {\eta }_{k}\) is \(\hat {\xi }_{k}\) weighted by the corresponding eigenvalue. There are many ways of weighting the eigenvectors, and simple choices include

\([\hat {\xi }_{1}, \hat {\xi }_{2}, \ldots , \hat {\xi }_{K}] \cdot \text {diag}(\hat {\lambda }_{1}, \hat {\lambda }_{2}, \ldots , \hat {\lambda }_{K})\).

\([\hat {\xi }_{1}, \hat {\xi }_{2}, \ldots , \hat {\xi }_{K}] \cdot \text {diag}\left (\sqrt {\hat {\lambda }_{1}}, \sqrt {\hat {\lambda }_{2}}, \ldots , \sqrt {\hat {\lambda }_{K}}\right )\).
Why do we choose the first one?
We briefly explained it in Section 3.1 using the perspective of projecting rows of data matrix to the span of \(\hat {\xi }_{1},\ldots ,\hat {\xi }_{K}\). We now take a different perspective. Recall that L_{δ} is the regularized graph Laplacian, by Abbe et al. (2019), the firstorder approximations of eigenvectors are
Intuitively speaking, since each ξ_{k} has a unitnorm, the “noise” vector \((L_{\delta }\mathbb {E}[L_{\delta }]) \xi _{k}\) is at the same scale for different k; it implies that the noise level in different eigenvectors is proportional to 1/λ_{k}. This means \(\hat {\xi }_{1}\) is less noisy than \(\hat {\xi }_{2}\), and \(\hat {\xi }_{2}\) is less noisy than \(\hat {\xi }_{3}\), and so on. By weighing the eigenvectors by \(\hat {\lambda }_{k}\), the noise level in \(\hat {\eta }_{1},\ldots ,\hat {\eta }_{K}\) is approximately at the same order.
In most theoretical studies, λ_{1},…,λ_{K} are assumed at the same order, so whether or not to reweight the eigenvectors does not affect the rate of convergence. However, in many real data, the magnitudes of the first a few eigenvalues can be considerably different, so such a weighting scheme does improve the numerical performance.
When We Should Choose One More Eigenvector for Inference
In SCORE+, we retain M eigenvectors in the PCA step for later uses, where
For the 8 data sets in Table 1, if we choose t = 0.1 as suggested, then M = K + 1 for the Simmons and Caltech data sets, and M = K for all others. The insight is that, if a data set fits with the “strong signal” profile, then we use exactly K eigenvectors for clustering, but if it fits with the “weak signal” profile, we may need to use more than K eigenvectors. Our analysis below shows that the Simmons and Caltech data sets fit with the “weak signal” profile, while all other data sets fit with the “strong signal” profile.
We illustrate our points with the scree plot and the Rayleigh quotient. Let \(\ell \in \mathbb {R}^{n}\) be the true community label vector, and let
For any vector \(x \in \mathbb {R}^{n}\), define normalized Rayleigh quotient (Fisher, 1936):
where Total Variance, WithinClass Variance, and BetweenClassvariance are \({\sum }_{i = 1}^{n} (x_{i}  \bar {x})^{2}\), \({\sum }_{k = 1}^{K} {\sum }_{i \in S_{k}} (x_{i}  \bar {x}_{k})^{2}\), and \({\sum }_{k = 1}^{K} (S_{k} \cdot (\bar {x}_{k}  \bar {x})^{2})\), respectively (\(\bar {x}\) is the overall mean of x_{i} and \(\bar {x}_{k}\) is the mean of x_{i} over all i ∈ S_{k}). Rayleigh quotient is a wellknown measure for the clustering utility of x. Note that 0 ≤ Q(x) ≤ 1 for all x, Q(x) = 1 when x = ℓ, and Q(x) ≈ 0 when x is a randomly generated vector.
Fix δ = 0.1. Let \(\hat {\lambda }_{1}, \hat {\lambda }_{2}, \ldots , \hat {\lambda }_{K+1}\) be the (K + 1) eigenvalues of L_{δ} with largest magnitude and let \(\hat {\xi }_{1}, \hat {\xi }_{2}, \ldots , \hat {\xi }_{K+1}\) be the corresponding eigenvectors. Below are some features that help differentiate a “strong signal” setting from a “weak signal” setting.

In the scree plot, we expect to see a relatively large gap between \(\hat {\lambda }_{K}\) and \(\hat {\lambda }_{K+1}\) when the “signal” is strong, and a relatively small gap if the “signal” is relatively weak.

In a “strong signal” setting, we expect to see that the Rayleigh quotient \(Q(\hat {\xi }_{k})\) is relatively large for k = K, but is relatively small for k = K + 1,K + 2, etc. In a “weak signal” setting, we may observe that a relatively large Rayleigh quotient \(Q(\hat {\xi }_{k})\) for k = K + 1,K + 2, etc., and \(Q(\hat {\xi }_{K})\) can be relatively small.
The points are illustrated in Fig. 1 with the Weblog data and Simmons data, which are believed to be a typical “strong signal” dataset and a typical “weak signal” dataset, respectively. We note that the first eigenvector consists of global information of L_{δ} and it alone does not have much utility for clustering. Therefore, the corresponding Rayleigh quotient \(Q(\hat {\xi }_{1})\) is usually small. In SCORE (e.g., Eq. 3.12), we use \(\hat {\xi }_{1}\) for normalization, but not directly for clustering.
Table 5 shows the Rayleigh quotients of all 8 datasets. We found that the (K + 1)th eigenvector contains almost no information of community labels, except for Caltech and Simmons. This agrees with our findings that Caltech and Simmons fit with the “weak signal” profile.
How to choose between K = M and K = M + 1? The scree plot could potentially be a good way to estimate how much information is contained in each eigenvector. If the K th and (K + 1)th eigenvalues are close, it is likely that the (K + 1)th eigenvector also contains information. To measure “closeness”, we propose to use the quantity \(1  [\hat {\lambda }_{K+1}/\hat {\lambda }_{K}]\) with a scalefree tuning parameter t = 0.1. This seems to work well on all the 8 datasets. See Table 6.
Proofs
We now prove Theorems 2.1, Corollary 2.1, and Theorem 2.2.
Analysis of Empirical Eigenvectors
Recall that λ_{k} and \(\hat {\lambda }_{k}\) denote the k th largest eigenvalue (in magnitude) of Ω and A, respectively, and ξ_{k} and \(\hat {\xi }_{k}\) denote the respective eigenvectors. Define
Let \(\M\_{2\to \infty }\) denote the maximum rowwise ℓ^{2}norm of a matrix M. The key technical tool we need in the proof is the following lemma:
Theorem 4.1.
Under conditions of Theorem 2.1, write \(\hat {\Xi }_{0}=[\hat {\xi }_{2},\hat {\xi }_{3},\ldots ,\) \(\hat {\xi }_{K}]\), Ξ_{0} = [ξ_{2},ξ_{3},…,ξ_{K}], and Λ_{0} = diag(λ_{2},λ_{3},…,λ_{K}). With probability 1 − o(n^{− 3}), there exists an orthogonal matrix \(O\in \mathbb {R}^{K1,K1}\) (which depends on A and is stochastic) such that
Theorem 4.1 is an extension of equation (C.71) in Jin et al. (2017) (this equation appears in the proof of Lemma 2.1 of Jin et al. (2017)). Lemma 2.1 of Jin et al. (2017) assumes that λ_{2},λ_{3},…,λ_{K} are at the same order, but here we allow them to be at different orders. The proof also needs some modification.
Remark
In the bound in Theorem 4.1, the power of K can be further reduced by adding mild regularity conditions on λ_{2},λ_{3},…,λ_{K}. For example, if we assume λ_{2},λ_{3},…,λ_{K} can be grouped into s = O(1) groups such that κ(I) ≤ C for each group (see the statement of Lemma 4.1 for the definition of κ(I)), then the power of K can be reduced from K^{5} to \(K\sqrt {K}\). In fact, the setting in Jin et al. (2017) corresponds to a special case of s = 2.
Proof of Theorem 4.1.
Rearrange the (K − 1) eigenvalues in the descending order, i.e., λ_{(2)} ≥ λ_{(3)} ≥… ≥ λ_{(K)}. We first assume that all these eigenvalues are positive and use the following procedure to divide them into groups:

Initialize: k = 1 and m = 2.

Compute the eigengaps g_{s}, where g_{s} = λ_{(s)} − λ_{(s+ 1)}, for m ≤ s ≤ K − 1, and g_{K} = λ_{(K)}. Let \(s^{*}=\min \limits \{m\leq s\leq K: g_{s}\geq K^{1}\lambda _{(m)}\}\). Since \(\lambda _{(m)}={\sum }_{s=m}^{K}g_{s}\), such s^{∗} must exist.

Group \(\lambda _{(m)},\lambda _{(m+1)},\ldots ,\lambda _{(s^{*}1)}, \lambda _{(s^{*})}\) together as the k th group.

If s^{∗} = K, terminate; otherwise, increase k by 1, reset m = s^{∗} + 1, and repeat the above steps to obtain the next group.
For each group k, let I_{k} be the corresponding index set in the original order, i.e., group k consists of eigenvalues λ_{j} for all j ∈ I_{k}. Define the eigengap associated with group k as
The above grouping procedure, as well as the first inequality of condition Eq. 2.10, guarantees that
When some of the (K − 1) eigenvalues are negative, we first partition these eigenvalues into two subsets, one consists of positive eigenvalues, and the other consists of negative ones. We directly apply the grouping procedure in the first subset. In the second subset, we take absolute values, sort in the descending order, and then apply the above grouping procedure. Finally, we combine the two collections of groups. The resulting groups still satisfy Eq. 4.14.
We then prove the following technical lemma, which extends Theorem 2.1 of Abbe et al. (2019) and Lemma C.3 of Jin et al. (2017). □
Lemma 4.1.
Let \(M\in \mathbb {R}^{n,n}\) be a symmetric random matrix, where \(\mathbb {E}M=M^{*}\) for a rank K_{0} matrix M^{∗}. Let \(d^{*}_{k}\) and d_{k} be the k th largest nonzero eigenvalue of M^{∗} and M, respectively, and let \(\eta ^{*}_{k}\) and η_{k} be the corresponding eigenvector, respectively, 1 ≤ k ≤ K_{0}. Consider a partition
with s and r being two integers such that 1 ≤ r ≤ K_{0} and 0 ≤ s ≤ K_{0} − r, and where each of I_{1},I_{2},…,I_{N} contains consecutive indices. For any index subset B ⊂{1,2,…,K_{0}}, define
Let D = diag(d_{s+ 1},…,d_{s+r}), \(D^{*}=\text {diag}(d^{*}_{s+1},\ldots ,d^{*}_{s+r})\),
Let \(M^{*}_{m,\cdot }\) denote the mth row of M^{∗}, for 1 ≤ m ≤ n. Suppose for a number γ > 0, the following assumptions are satisfied:

A1 (Incoherence): \(\max \limits _{1\leq m\leq n}\M^{*}_{m,\cdot }\\leq \gamma {\Delta }^{*}\), where Δ^{∗} = δ(I).

A2 (Independence): For any 1 ≤ m ≤ n, the entries of the mth row and column of M are independent with the other entries.

A3 (Spectral norm concentration): For a number δ_{0} ∈ (0,1), \(\mathbb {P}(\MM^{*}\\leq \gamma {\Delta }^{*})\geq 1\delta _{0}\).

A4 (Row concentration): There is a number δ_{1} ∈ (0,1) and a continuous nondecreasing function φ(⋅) with φ(0) = 0 and φ(x)/x being nonincreasing in \(\mathbb {R}^{+}\) such that, for any 1 ≤ m ≤ n and nonstochastic matrix \(Y\in \mathbb {R}^{n,r}\),
$$ \mathbb{P}\left( \(MM^{*})_{m,\cdot}Y\_{2} \leq {\Delta}^{*}\Y\_{2\to\infty}\varphi\left( \frac{\Y\_{F}}{\sqrt{n}\Y\_{2\to\infty}}\right) \right)\geq 1\delta_{1}/n. $$
With probability 1 − δ_{0} − 2δ_{1}, for an orthogonal matrix \(O\in \mathbb {R}^{r,r}\),
where \(\widetilde {U}^{*}=[\eta _{1},\eta _{2},\ldots ,\eta _{K_{0}}]\) and \(\widetilde {\kappa }={\sum }_{1\leq k\leq N}\kappa (I_{k})\).
We now prove Lemma 4.1. The proof is a light modification of the proof of Lemma C.3 of Jin et al. (2017). Fix 1 ≤ m ≤ n. Let M^{(m)} be the matrix by setting the m th row and the m th column of M to be zero. Let \(\eta _{1}^{(m)},\eta _{2}^{(m)},\ldots ,\eta _{n}^{(m)}\) be the eigenvectors of M^{(m)}. Write \(U^{(m)}=[\eta _{s+1}^{(m)},\ldots ,\eta _{s+r}^{(m)}]\). Let \(H=U^{\prime }U^{*}\), H^{(m)} = (U^{(m)})^{′}U^{∗} and V^{(m)} = U^{(m)}H^{(m)} − U^{∗}. We aim to prove
Once Eq. 4.16 is obtained, the proof is almost identical to the proof of (B.26) in Abbe et al. (2019), except that we plug in Eq. 4.16 instead of (B.32) in Abbe et al. (2019). This is straightforward, so we omit it. What remains is to prove Eq. 4.16. In the proof of (Abbe et al. 2019, Lemma 5), it is shown that
Combining them gives
We further bound the first term in Eq. 4.17. Define
In other words, I_{0} is the union of groups of eigenvalues such that the largest absolute eigenvalue in that group is larger than ∥D^{∗}∥. Let \(\widetilde {M}^{*}={\sum }_{j\in I_{0}}d_{j}^{*}\eta _{j}^{*}(\eta _{j}^{*})'\).
where the last line uses ∥V^{(m)}∥≤ 6γ, by (B.12) of Abbe et al. (2019). Note that \(M^{*}\widetilde {M}^{*}={\sum }_{j\notin I_{0}}d_{j}^{*}\eta _{j}^{*}(\eta _{j}^{*})'\). By definition of I_{0}, for any j∉I_{0}, \(d_{j}^{*}\leq \max \limits _{i\in I}d_{i}^{*}\leq \kappa {\Delta }^{*}\). It follows that
Combining the above gives
Without loss of generality, we assume all groups except for I are contained in I_{0}, i.e., \(I_{0}=\cup _{k=1}^{N}I_{k}\). Let \(D_{k}^{*}=\text {diag}(d_{j}^{*})_{j\in I_{k}}\), \(U_{k}^{*}=[\eta ^{*}_{j}]_{j\in I_{k}}\), \(U_{k}=[\eta _{j}]_{j\in I_{k}}\), \(U_{k}^{(m)}=[\eta ^{(m)}_{j}]_{j\in I_{k}}\), and \(H_{k}^{(m)}=(U_{k}^{(m)})'U_{k}^{*}\). Then,
Similar to (B.12) of Abbe et al. (2019), we have \(\U_{k}^{(m)}H_{k}^{(m)}U^{*}_{k}\\leq 6\gamma _{k}\), where γ_{k} is defined in the same way as γ but is with respect to the eigengap of group k, which is \({\Delta }^{*}_{k}\equiv \delta (I_{k})\). It is not hard to see that \(\gamma _{k}=\gamma {\Delta }^{*}/{\Delta }^{*}_{k}\). Therefore,
By mutual orthogonality of eigenvectors, \((U_{k}^{(m)})'U^{(m)}=0\), and \((U^{*}_{k})'U^{*}=0\). Additionally, we have \(\U_{k}^{(m)}\=1\) and \(\H_{k}^{(m)}\\leq 1\). It follows that
We plug it into Eq. 4.18 and use the definition of \(\widetilde {\kappa }\). It gives
Combining Eq. 4.20 with Eq. 4.17 gives Eq. 4.16. Then, the proof of Lemma 4.1 is complete.
We now apply Lemma 4.1 to prove the claim. For the groups in Eq. 4.14, they satisfy that
We fix I to be one of the groups, and let I_{1},…,I_{N} be the remaining groups. We apply Lemma 4.1 to M = A, \(M^{*}={\Omega }=\text {diag}({\Omega })+(A\mathbb {E}A)\), and
We construct φ(γ) in the same way as in Lemma C.3 of Jin et al. (2017). It satisfies that \(\varphi (\gamma )\leq C\gamma \sqrt {\log (n)}\). Similarly as in the proof of Lemma C.3, we can show that conditions A1A4 are satisfied. Write
It follows from Eq. 4.15 that there exists an orthogonal matrix \(O\in \mathbb {R}^{I\times I}\) such that
By Lemma B.2 of Jin et al. (2017), \(\{\Xi }\_{2\to \infty }=O(\sqrt {K}\\theta \^{1}\theta _{\max \limits })\). Plugging it into the above inequality, we find that
The above inequality holds for each group. Note that \(\hat {\Xi }_{0}\) is obtained by putting such \(\hat {\Xi }_{01}\) together. When B = [B_{1},B_{2},…,B_{N}], it holds that \(\B\_{2\to \infty }\leq \sqrt {{\sum }_{k}\B_{k}\^{2}_{2\to \infty }}\leq \sqrt {K}\max \limits _{k}\B_{k}\_{2\to \infty }\). Combining it with Eq. 4.21 gives the claim.
Proof of Theorem 2.1
The rationale of SCORE guarantees that the rows of R take only K distinct values \(v_{1},v_{2},\ldots ,v_{K}\in \mathbb {R}^{K1}\). Below, we first derive a crude highprobability bound for \(\text {Hamm}(\widehat {\Pi },{\Pi })\) without using Theorem 4.1. This bound implies that each kmeans center is close to one of the true v_{k}. Next, we use Theorem 4.1 to derive a sharper bound for \(\mathbb {E}[\text {Hamm}(\widehat {\Pi },{\Pi })]\).
We start from deriving a crude bound for \(\text {Hamm}(\widehat {\Pi },{\Pi })\). Let β_{n} be the same as in Section 4.1. By Lemma B.1 of Jin et al. (2017), C^{− 1}K^{− 1}∥𝜃∥^{2} ≤ λ_{1} ≤∥𝜃∥^{2}, and λ_{K}≍ K^{− 1}β_{n}∥𝜃∥^{2}. It follows that
Therefore, the assumption \(s_{n}\geq a_{1}^{1} K^{4}\sqrt {\log (n)}\) guarantees that
Let O be the orthogonal matrix in Theorem 4.1. By Lemma 2.1 of Jin et al. (2017), with probability 1 − o(n^{− 3}), there exists ω ∈{± 1} such that
By Lemma B.2 of Jin et al. (2017), \(\xi _{1}(i)\geq C^{1}\theta _{i}/\\theta \\geq C^{1}\theta _{\min \limits }/\\theta \\). By choosing a_{1} appropriately small, the condition on s_{n} guarantees that \(\\hat {\xi }_{1}\xi _{1}\_{\infty }\leq \xi _{1}(i)/3\), for any 1 ≤ i ≤ n. Then, we can use a proof similar to that in Lemma C.5 of Jin et al. (2017) to show that, with probability 1 − o(n^{− 3}), there exists an orthogonal matrix H such that
Since \(\theta _{\max \limits }^{2}\geq \\theta \^{2}/n\), we can further write that
where the last inequality is from Eq. 4.22. Recall that the rows of R take only K distinct values v_{1},…,v_{K}. By Lemma B.3 of Jin et al. (2017), there exists a constant c_{0} > 0 such that, for all 1 ≤ k≠ℓ ≤ K,
Furthermore, in the proof of Theorem 2.2 of Jin (2015), it was shown that the kmeans solution satisfies that
where δ is the minimum distance between two distinct rows of R. Combining the above gives
where C is a constant that does not depend on a_{1}.
This crude bound Eq. 4.26 is enough for studying the kmeans centers. By Eq. 4.26, the total number of misclustered nodes is \(O(n/[K^{7}\log (n)])\). Also, Condition Eq. 2.9 implies that each true cluster has at least \(c^{1}_{2}K^{1}n\) nodes. This means that each cluster has only a negligible fraction of misclustered nodes. Particularly, each true cluster \({\mathcal {C}}_{k}\) is associated with one and only one kmeans cluster, which we denote by \(\hat {\mathcal {C}}_{k}\); furthermore, we have \(\hat {\mathcal {C}}_{k}\backslash {\mathcal {C}}_{k}= O(n/[K^{7}\log (n)])\) and \(\hat {\mathcal {C}}_{k}\backslash {\mathcal {C}}_{k} =O(n/[K^{7}\log (n)])\). The cluster center \(\hat {v}_{k}\) of the cluster \(\hat {\mathcal {C}}_{k}\) satisfies that
Note that r_{i} = v_{k} for \(i\in {\mathcal {C}}_{k}\). It follows that
By Eq. 4.25, \(\r_{i}v_{k}\=O(\sqrt {K})\). Furthermore, \(\hat {\mathcal {C}}_{k}\gtrsim c^{1}_{2}K^{1}n\) and \(\hat {\mathcal {C}}_{k}\backslash {\mathcal {C}}_{k}=O(n/[K^{7}\log (n)])\). Combing them with the CauchySchwarz inequality, we find that
The right hand side is \(o(\sqrt {K})\). Let c_{0} be the same as in Eq. 4.25. Then, for sufficiently large n,
Next, we use Theorem 4.1 to get the desired bound for \(\mathbb {E}[\text {Hamm}(\widehat {\Pi },{\Pi })]\). Let D be the event that Eq. 4.27 holds. We have shown that \(\mathbb {P}(D^{c})=o(n^{3})\). It follows that
It remains to bound the probability of making a clustering error on i, when the event D holds. Suppose \(i\in {\mathcal {C}}_{k}\). On the event D, if \(\H\hat {r}_{i}r_{i}\\leq c_{0}\sqrt {K}/4\), then
while for any ℓ≠k,
Then, node i must be clustered into \(\hat {\mathcal {C}}_{k}\), i.e., there is no error on i. This implies that
We further study the right hand side of Eq. 4.29. Let \((\hat {\xi }_{1},\hat {\Xi }_{0}, \omega , O)\) be the same as in Eq. 4.23. Fix i. Let \(\hat {\Xi }_{0,i}^{\prime }\in \mathbb {R}^{K1}\) and \({\Xi }_{0,i}^{\prime }\in \mathbb {R}^{K1}\) denote the i th row of \(\hat {\Xi }_{0}\) and Ξ_{0}, respectively. Then,
It is seen that
By Lemma B.2 of Jin et al. (2017), ξ_{1}(i) ≥ C^{− 1}𝜃_{i}/∥𝜃∥. Combining it with Eq. 4.23 gives \(\\omega \hat {\xi }_{1}\xi _{1}\_{\infty }=o(\xi _{1}(i))\). It follows that \(\omega \hat {\xi }_{1}(i)\geq \xi _{1}(i)/2\geq C^{1}\theta _{i}/\\theta \\). Additionally, \(\r_{i}\\leq C\sqrt {K}\), by Eq. 4.25. Therefore, with probability 1 − o(n^{− 3}),
We plug in Theorem 4.1 and the first inequality of Eq. 4.23. It yields
where the second inequality is from Eq. 4.22 and the constant C_{1} does not depend on a_{1}. By choosing an appropriately small a_{1}, we can make the first term \(\leq c_{0}\sqrt {K}/8\). It follows that
Note that A = Ω + W −diag(Ω), where \(W=A\mathbb {E}A\) and \({\Omega }={\Theta }{\Pi } P{\Pi }^{\prime }{\Theta }={\Xi }{\Lambda }{\Xi }^{\prime }\). In particular,
It follows that
Note that \({\Omega }(i,i)\leq {\theta _{i}^{2}}\). Additionally, \(\{\Lambda }_{0}^{1}\= \lambda _{K}^{1}\asymp K\beta ^{1}_{n}\\theta \^{2}\) and ∥Ξ_{0,i}∥≤∥Ξ_{0}∥≤ 1. It follows that
where C_{2} > 0 is a constant that does not depend on a_{1}. The second term is \(O(K\beta _{n}^{1}\\theta \^{1})\). At the same time, the left hand side of Eq. 4.22 is lower bounded by \(K^{4}\sqrt {K\log (n)}/(\beta _{n}\\theta \)\). Therefore, Eq. 4.22 implies that the second term is \(O(1/[K^{3}\sqrt {K\log (n)}])=o(\sqrt {K})\). Particularly, for sufficiently large n, the second term is \(\leq c_{0}\sqrt {K}/16\), i.e.,
We plug it into Eq. 4.30 to get
where the last inequality is because of the probability union bound.
It remains to get a large deviation inequality for \(e_{i}^{\prime }W\xi _{k}\). Note that
The summands are independent, and \(\xi _{k}(j)W(i,j)\leq \xi _{k}(j) \leq C\sqrt {K}\theta _{j}/\)\(\\theta \\leq C\sqrt {K}\theta _{\max \limits }/\\theta \\) (the bound of ξ_{k}(j) is from Lemma B.2 of Jin et al. (2017)). We shall apply Bernstein’s inequality. Note that \({\sum }_{j}{\xi _{k}^{2}}(j)\text {Var}(W(i,j))\)\(\leq {\sum }_{j}{\xi _{k}^{2}}(j)\P\_{\max \limits }\theta _{i}\theta _{j}\leq C{\sum }_{j} (K{\theta ^{2}_{j}}/\\theta \^{2})\theta _{i}\theta _{j}\leq \theta _{i}\cdot CK\\theta \_{3}^{3}/\\theta \^{2}\). It follows from Bernstein’s inequality that
We plug in t = (c_{0}/16C_{1}) ⋅ K^{− 1}𝜃_{i}β_{n}∥𝜃∥. It follows that
where C_{3},C_{4} are constants that depend on (c_{0},C_{2},C), \(a_{2}=\min \limits \{C_{3},C_{4}\}\), and the last inequality is due to \(\beta _{n}\geq \sqrt {K}(\lambda _{K}/\sqrt {\lambda _{1}})\). We plug it into Eq. 4.31 to get
Combining it with Eq. 4.28 gives the desired claim.
Proof of Corollary 2.1
We use an intermediate result in the proof of Theorem 2.1, which is the second last line of Eq. 4.32. We plug it into Eq. 4.31 and Eq. 4.28 to get
where β_{n} = λ_{K}(G^{1/2}PG^{1/2}), with G = K∥𝜃∥^{− 2}diag(∥𝜃^{(1)}∥^{2},…,∥𝜃^{(K)}∥^{2}). In this example, from the way π_{i} is generated, by elementary probability, \(\GI_{K}\=O(\sqrt {\log (K)/n})\); moreover, the first eigenvalue of P is (1 − b) + Kb, and other eigenvalues are all equal to (1 − b). It follows that
Additionally, since \(\theta _{\max \limits }\leq C\theta _{\min \limits }\), we have \(\\theta \_{3}^{3}\asymp \\theta \^{2}\bar {\theta }\). It follows that
The first term is dominating. We plug it into Eq. 4.33. The claim follows immediately.
Proof of Theorem 2.2
We have shown in Eq. 4.29 that there is an event D such that \(\mathbb {P}(D^{c})=o(n^{3})\) and that on the event D,
Therefore, it suffices to show that, with probability 1 − o(n^{− 3}),
In the equation above Eq. 4.30 and the equation above Eq. 4.31, we have shown that, as long as a_{1} in Theorem 2.1 is properly small,
We then apply Eq. 4.32. In order for the exponent of the right hand side of Eq. 4.32 to be at the order of \(\log (n)\), we need
for a large enough constant C > 0. Note that the condition on s_{n} implies
for a large constant C > 0. It is straightforward that this condition guarantees Eq. 4.36. Then, the right hand side of Eq. 4.32 can be o(n^{− 3}). In other words, with probability 1 − o(n^{− 3}),
where the constant c_{1} > 0 can be arbitrarily small by setting the constant C in the assumption of s_{n} to be sufficiently large. We plug it into Eq. 4.35 to get, with probability 1 − o(n^{− 3}),
Since c can be made arbitrarily small by increasing C in the assumption of s_{n}, we choose a large enough C such that \(C_{2}c<(c_{0}/16)\sqrt {K}\). Then, Eq. 4.34 is satisfied. The claim follows directly.
Notes
We model \(\mathbb {E}[A]\) by Ω −diag(Ω) instead of Ω because the diagonals of \(\mathbb {E}[A]\) are all 0. Here, “main signal”, “secondary signal”, and “noise” refers to Ω, −diag(Ω) and W respectively.
For SBM, the diagonal entries of P can be unequal. DCBM has more free parameters, so we have to assume that P has unit diagonal entries to maintain identifiability.
A multi\(\log (n)\) term is a term L_{n} > 0 that satisfies ”L_{n}n^{−δ} → 0 and \(L_n n^{\delta }\to \infty \) for any fixed constant δ > 0
For example, \(\frac {\hat {\xi }_{2}}{\hat {\xi }_{1}}\) is the ndimensional vector \((\frac {\hat {\xi }_{2}(1)}{\hat {\xi }_{1}(1)}, \frac {\hat {\xi }_{2}(2)}{\hat {\xi }_{1}(2)}, \ldots , \frac {\hat {\xi }_{2}(n)}{\hat {\xi }_{1}(n)})^{\prime }\). Note that we may choose to threshold all entries of the n × (K − 1) matrix by \(\pm \log (n)\) from top and bottom (Jin, 2015), but this is not always necessary. For all data sets in this paper, thresholding or not only has a negligible difference.
When translating the bound in Gao et al. (2018), we notice that 𝜃_{i} there have been normalized, so that their 𝜃_{i} corresponds to our \((\theta _{i}/\bar {\theta })\).
This is analogous to the Students’ ttest, where for n samples from an unknown distribution, the ttest uses a normalization for the mean and a normalization for the variance.
References
Abbe, E., Fan, J., Wang, K. and Zhong, Y. (2019). Entrywise eigenvector analysis of random matrices with low expected rank. Ann. Statist. (to appear).
Adamic, L A and Glance, N (2005). The political blogosphere and the 2004 US election: divided they blog. In Proceedings of the 3rd international workshop on Link discovery, pp. 36–43.
Airoldi, E., Blei, D., Fienberg, S. and Xing, E. (2008). Mixed membership stochastic blockmodels. J. Mach. Learn. Res. 9, 1981–2014.
Bickel, P. J. and Chen, A (2009).
Chaudhuri, K., Chung, F. and Tsiatas, A. (2012). Spectral clustering of graphs with general degrees in the extended planted partition model. In Proceedings of the 25th annual conference on learning theory, JMLR workshop and conference proceedings, vol. 23, pp. 1–35.
Chen, Y., Li, X. and Xu, J. (2018). Convexified modularity maximization for degreecorrected stochastic block models. Ann. Statist. 46, 1573–1602.
Duan, Y., Ke, Z. T. and Wang, M. (2018). State aggregation learning from Markov transition data. In NIPS workshop on probabilistic reinforcement learning and structured control.
Fan, J., Fan, Y., Han, X. and Lv, J. (2019). SIMPLE: statistical inference on membership profiles in large networks. arXiv:1910.01734.
Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Ann. Eugen. 7, 179–188.
Gao, C., Ma, Z., Zhang, A.Y. and Zhou, H.H. (2018). Community detection in degreecorrected block models. Ann. Statist. 46, 2153–2185.
Girvan, M and Newman, M EJ (2002). Community structure in social and biological networks. Proc. Natl. Acad. Sci. 99, 12, 7821–7826. National Acad Sciences.
Hastie, T., Tibshirani, R. and Friedman, J. (2009). The elements of statistical learning, 2nd edn. Springer, Berlin.
Ji, P. and Jin, J (2016). Coauthorship and citation networks for statisticians (with discussion). Ann. Appl. Statist. 10, 4, 1779–1812.
Jin, J. (2015). Fast community detection by SCORE. Ann. Statist.43, 57–89.
Jin, J. and Ke, Z. T. (2018). Optimal membership estimation, especially for networks with severe degree heterogeneity. Manuscript.
Jin, J., Ke, Z. T. and Luo, S. (2017). Estimating network memberships by simplex vertex hunting. arXiv:1708.07852.
Jin, J., Tracy Ke, Z. and Luo, S. (2019). Optimal adaptivity of signedpolygon statistics for network testing. arXiv:1904.09532.
Jin, J., Ke, Z. T., Luo, S. and Wang, M. (2020). Optimal approach to estimating K in social networks. Manuscript.
Karrer, B. and Newman, M. (2011). Stochastic blockmodels and community structure in networks. Phys. Rev. E 83, 016107.
Ke, Z. T. and Wang, M. (2017). A new SVD approach to optimal topic estimation. arXiv:1704.07016.
Ke, Z. T., Shi, F. and Xia, D. (2020). Community detection for hypergraph networks via regularized tensor power iteration. arXiv:1909.06503.
Lusseau, D, Schneider, K, Boisseau, O J, Haase, P, Slooten, E and Dawson, S M (2003). The bottlenose dolphin community of Doubtful Sound features a large proportion of longlasting associations. Behav. Ecol. Sociobiol. 54, 4, 396–405. Springer.
Liu, Y., Hou, Z., Yao, Z., Bai, Z., Hu, J. and Zheng, S. (2019). Community detection based on the \(\ell _{\infty }\) convergence of eigenvectors in dcbm. arXiv:1906.06713.
Ma, Z., Ma, Z. and Yuan, H. (2020). Universal latent space model fitting for large networks with edge covariates. J. Mach. Learn. Res. 21, 1–67.
Mao, X., Sarkar, P. and Chakrabarti, D. (2020). Estimating mixed memberships with sharp eigenvector deviations. J. Amer. Statist. Assoc. (to appear), 147.
Mihail, M. and Papadimitriou, C. H. (2002). On the eigenvalue power law. In International workshop on randomization and approximation techniques in computer science, pp. 254–262. Springer, Berlin.
Nepusz, T, Petróczi, A, Négyessy, L and Bazsó, F (2008). Fuzzy communities and the concept of bridgeness in complex networks. Phys. Rev. E 77, 1, 016107. APS.
Qin, T. and Rohe, K. (2013). Regularized spectral clustering under the degreecorrected stochastic blockmodel. Adv. Neural Inf. Process. Syst. 3120–3128.
Su, L., Wang, W. and Zhang, Y. (2019). Strong consistency of spectral clustering for stochastic block models. IEEE Trans. Inform. Theory 66, 324–338.
Traud, A. L., Kelsic, E. D., Mucha, P. J. and Porter, M. A. (2011). Comparing community structure to characteristics in online collegiate social networks. SIAM Rev. 53, 526–543.
Traud, A. L., Mucha, P. J. and Porter, M. A. (2012). Social structure of facebook networks. Physica A 391, 4165–4180.
Zachary, W W (1977). An information flow model for conflict and fission in small groups. J. Anthropol. Res. 33, 4, 452–473. University of New Mexico.
Zhang, Y., Levina, E. and Zhu, J. (2020). Detecting overlapping communities in networks using spectral methods. SIAM J. Math. Anal. 2, 265–283.
Zhao, Y, Levina, E. and Zhu, J. (2012). Consistency of community detection in networks under degreecorrected stochastic block models. Ann. Statist. 40, 2266–2292.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Jin, J., Ke, Z.T. & Luo, S. Improvements on SCORE, Especially for Weak Signals. Sankhya A 84, 127–162 (2022). https://doi.org/10.1007/s13171020002401
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13171020002401
AMS (2000) subject classification
 Primary: 62H30
 91C20
 Secondary: 62P25