Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Considerable attention has been paid to the graph clustering or community detection problem and a number of formulations and algorithms have been proposed in the literature [15]. Although the meaning of a module in each detection method may not be equivalent, we naturally wish to know in what manner the methods perform typically and the point at which a method fails to detect a certain structure in principle [68]. Otherwise, we need to test all the existing methods, and this clearly requires a huge cost and is also redundant. Although most studies of the expected performance were experimental, using benchmark testing [911], it is expected that theoretical analysis will give us a deeper insight.

As frequently done in benchmarks, we consider random graphs having a planted block structure. The most common model is the so-called stochastic block model (or the planted partition model) [12]. Although many variants of the stochastic block model have been proposed in the literature [1316], in the simplest case, the vertices within the same module are connected with a high probability p in , while the vertices in different modules are connected with a low probability p out . When the difference between the probabilities is sufficiently large, \(p_{in} \gg p_{out}\), the graph has a strong block structure and the spectral method detects almost or exactly the same partition as the planted partition. As we increase the probability between the modules p out , the partition obtained by the spectral method tends to very different from the planted one, and finally, they are completely uncorrelated. The point of the transition is called the detectability threshold [1720]. Since we know that the graph is generated by the stochastic block model, the ultimate limit of this threshold is given by Bayesian inference and it is known that, in the case of the two-block model,

$$\displaystyle\begin{array}{rcl} c_{\mathrm{in}} - c_{\mathrm{out}} = 2\sqrt{\overline{c}},& &{}\end{array}$$
(12.1)

where c in = p in N, c out = p out N and \(\overline{c}\) is the average degree. N is the total number of vertices in the graph. Equation (12.1) indicates that, even when the vertices are more densely connected within a module than between modules, unless the difference is sufficiently large, it is statistically impossible to infer the embedded structure.

It was predicted by Nadakuditi and Newman in [20] that the spectral method with modularity also has the same detectability threshold as Eq. (12.1). However, it was numerically shown in [21] that this applies only to the case where the graph is not sparse. Despite its significance, a precise estimate of the detectability threshold of the spectral method in the sparse case seems to remain missing.

In this article, we derive an estimate of the detectability threshold of the spectral method of the two-block regular random graph. It should be noted that the simplest stochastic block model, which we explained above, has Poisson degree distribution, while we impose a constraint such that the degree does not fluctuate. Therefore, our results do not directly provide an answer to the missing part of the problem. They do, however, provide a fruitful insight into the performance of the spectral method. Moreover, in the present situation, we do not face the second difficulty of the spectral method: the localization of the eigenvectors. Although the localization of eigenvectors is another important factor in the detectability problem, it is outside the scope of this article.

This article is organized as follows. In Sect. 12.2, we briefly introduce spectral partitioning of two-block regular random graphs and mention that the eigenvector corresponding to the second-smallest eigenvalue contains the information of the modules. In Sect. 12.3, we show the average behavior of the second-smallest eigenvalue and the corresponding eigenvector as a function of the parameters in the model. Finally, Sect. 12.4 is devoted to the conclusion.

2 Spectral Partitioning of Regular Random Graphs With Two-Block Structure

The model parameters in the two-block regular random graph are the total number of vertices N, the degree of each vertex c, and the fraction of the edges between modules \(\gamma = l_{\mathrm{int}}/N\). The graph is constructed as follows. We first set module indices on the vertices, each of which has c half edges, or stubs, and randomly connect the vertices in different modules with l int edges. We connect the rest of the edges at random within the same module. We repeat the process so that every edge is connected to a pair of vertices. This random graph is sparse when c = O(1), because the number of edges is of the same order as the number of vertices N. We calculate the degree of correlation between the partition obtained by the spectral method and the planted partition as γ varies.

The choices of the matrix that can be used in the spectral method is wide. The popular matrices are the unnormalized Laplacian L, the normalized Laplacian \(\mathcal{L}\), and the modularity matrix B. For the bisection of regular random graphs, however, all the partitions they yield have shown to be the same [22]. Thus, we analyze the unnormalized Laplacian L, since it is the simplest. The basic procedure of the spectral bisection with the unnormalized Laplacian L is quite simple. We solve for the eigenvector corresponding to the second-smallest eigenvalue of L and classify each vertex according to the sign of the corresponding component of the eigenvector; the vertices with the same sign belong to the same module. Therefore, our goal is to calculate the behavior of the sign of the eigenvector as a function of γ.

3 Detectability Threshold

We use the so-called replica method, which is often used in the field of spin-glass theory in statistical physics. The basic methodology here is parallel to that in [23]. Although the final goal is to solve for the eigenvector corresponding to the second-smallest eigenvalue or the statistics of its components, let us consider estimating the second-smallest eigenvalue, averaged over the realization of the random graphs. We denote by \([\ldots ]_{L}\) the random average over the unnormalized Laplacians of the possible graphs. For this purpose, we introduce the following “Hamiltonian” \(H(\boldsymbol{x}\vert L)\), “partition function” Z(β | L), and “free energy density” f(β | L):

$$\displaystyle\begin{array}{rcl} H(\boldsymbol{x}\vert L)& =& \frac{1} {2}\boldsymbol{x}^{\mathrm{T}}L\boldsymbol{x},{}\end{array}$$
(12.2)
$$\displaystyle\begin{array}{rcl} Z(\beta \vert L)& =& \int d\boldsymbol{x}\,\mathrm{e}^{-\beta H(\boldsymbol{x}\vert L)}\delta (\vert \boldsymbol{x}\vert ^{2} - N)\delta (\boldsymbol{1}^{\mathrm{T}}\boldsymbol{x}),{}\end{array}$$
(12.3)
$$\displaystyle\begin{array}{rcl} f(\beta \vert L)& =& -\frac{1} {N\beta }\ln Z(\beta \vert L),{}\end{array}$$
(12.4)

where \(\boldsymbol{x}\) is an N-dimensional vector, \(\boldsymbol{1}\) is a vector in which each element equals one, and T represents the transpose. The delta function \(\delta (\vert \boldsymbol{x}\vert ^{2} - N)\) in (12.3) is to impose the norm constraint. It should be noted that the eigenvector corresponding to the smallest eigenvalue is proportional to \(\boldsymbol{1}\) and this choice is excluded by the constraint \(\delta (\boldsymbol{1}^{\mathrm{T}}\boldsymbol{x})\). In the limit of \(\beta \rightarrow \infty\), in conjunction with the operation of \(\delta (\boldsymbol{1}^{\mathrm{T}}\boldsymbol{x})\), the contribution in the integral of the “partition function” Z(β | L) is dominated by the vector that minimizes the value of the “Hamiltonian” \(H(\boldsymbol{x}\vert L)\), under the constraint of being orthogonal to the eigenvector \(\boldsymbol{1}\) of the smallest eigenvalue. Therefore, the “partition function” is dominated by the eigenvector of the second-smallest eigenvalue and the “free energy density” f(β | L) extracts it, i.e.,

$$\displaystyle\begin{array}{rcl} \lambda _{2} = 2\lim _{\beta \rightarrow \infty }f(\beta \vert L).& &{}\end{array}$$
(12.5)

The quantity we need is \(\left [\lambda _{2}\right ]_{L}\), the second-smallest eigenvalue averaged over the unnormalized Laplacians. However, because the average of the logarithm of the “partition function” is difficult to calculate, we recast \(\left [\lambda _{2}\right ]_{L}\) as

$$\displaystyle\begin{array}{rcl} \left [\lambda _{2}\right ]_{L}& =& -2\lim _{\beta \rightarrow \infty } \frac{1} {N\beta }\left [\ln Z(\beta \vert L)\right ]_{L} \\ & =& -2\lim _{\beta \rightarrow \infty }\lim _{n\rightarrow 0} \frac{1} {N\beta } \frac{\partial } {\partial n}\ln \left [Z^{n}(\beta \vert L)\right ]_{ L}.{}\end{array}$$
(12.6)

The assessment of [Z n(β | L)] L is also difficult for a general real number n. However, when n takes positive integer values, [Z n(β | L)] L can be evaluated as follows. For a positive integer n, [Z n(β | L)] L is expressed as

$$\displaystyle\begin{array}{rcl} [Z^{n}(\beta \vert L)]_{ L}& =& \int \left (\prod _{a=1}^{n}d\boldsymbol{x}_{ a}\delta (\vert \boldsymbol{x}_{a}\vert ^{2} - N)\delta (\boldsymbol{1}^{\mathrm{T}}\boldsymbol{x}_{ a})\right )\left [\exp \left (-\frac{\beta } {2}\sum _{a}\boldsymbol{x}_{a}^{\mathrm{T}}L\boldsymbol{x}_{ a}\right )\right ]_{L} \\ & \equiv & \int \left (\prod _{a=1}^{n}d\boldsymbol{x}_{ a}\delta (\vert \boldsymbol{x}_{a}\vert ^{2} - N)\delta (\boldsymbol{1}^{\mathrm{T}}\boldsymbol{x}_{ a})\right )\exp \left (H_{\mathrm{eff}}(\beta,\boldsymbol{x}_{1},\boldsymbol{x}_{2},\ldots,\boldsymbol{x}_{n})\right ).{}\end{array}$$
(12.7)

This means that [Z n(β | L)] L has a meaning of a partition function for a system of n-replicated variables \(\boldsymbol{x}_{1},\boldsymbol{x}_{2},\ldots,\boldsymbol{x}_{n}\) that is subject to no quenched randomness. In addition, the assumption of the graph generation guarantees that the effective Hamiltonian \(\mathcal{H}_{\mathrm{eff}}(\beta,\boldsymbol{x}_{1},\boldsymbol{x}_{2},\ldots,\boldsymbol{x}_{n})\) is of the mean field type. These indicate that \(N^{-1}\ln [Z^{n}(\beta \vert L)]_{L}\) for n = 1, 2,  can be evaluated exactly by the saddle point method with respect to certain macroscopic variables (order parameters) as \(N \rightarrow \infty\). After some calculations, we indeed reach an expression with the saddle point evaluation as

$$\displaystyle\begin{array}{rcl} \frac{1} {N}\ln [Z^{n}(\beta \vert L)]_{ L}& =& \mathop{\mathrm{extr}}_{\{Q_{r}\},\{\hat{Q}_{r}\},\{\phi _{a}\},\{\psi _{a}\},\eta }\Biggl \{NK_{\mathrm{I}}(Q_{r},\hat{Q}_{r}) + \frac{\beta } {2}\sum _{a}\phi _{a} -\sum _{r=1,2}K_{\mathrm{II}r}(Q_{r},\hat{Q}_{r}) \\ & +& \frac{1} {N}\sum _{r=1,2}\ln K_{\mathrm{III},r}(\hat{Q}_{r},\{\phi _{a}\},\{\psi _{a}\}) +\eta \gamma -\frac{1} {N}\ln \mathcal{N}_{G} -\ln c!\Biggr \}, {}\end{array}$$
(12.8)

where \(\mathcal{N}_{G}\) is the total number of graph configurations and

$$\displaystyle\begin{array}{rcl} K_{\mathrm{I}}(Q_{r},\hat{Q}_{r})& =& \sum _{r,s=1,2}\frac{p_{r}p_{s}} {2} \int d\boldsymbol{\mu }^{(r)}d\boldsymbol{\nu }^{(s)}\,Q_{ r}(\boldsymbol{\mu }^{(r)})Q_{ s}(\boldsymbol{\nu }^{(s)}) \\ & & \times \,\mathrm{e}^{-(1-\delta _{rs})\eta -\frac{\beta }{2} \sum _{a}(\mu _{a}^{(r)}-\nu _{ a}^{(s)})^{2} }, \\ K_{\mathrm{II}r}(Q_{r},\hat{Q}_{r})& =& p_{r}\int d\boldsymbol{\mu }^{(r)}\,\hat{Q}_{ r}(\boldsymbol{\mu }^{(r)})Q_{ r}(\boldsymbol{\mu }^{(r)}), \\ K_{\mathrm{III}r}(\hat{Q}_{r},\{\phi _{a}\},\{\psi _{a}\})& =& \int \prod _{i\in V _{r}}\prod _{a=1}^{n}dx_{ ia} \\ & & \times \prod _{i\in V _{r}}\left (\hat{Q}_{r}^{c}(\boldsymbol{x}_{ i})\exp \left [-\frac{\beta } {2}\sum _{a}\left (\phi _{a}x_{ia}^{2} +\psi _{ a}x_{ia}\right )\right ]\right ).\qquad {}\end{array}$$
(12.9)

In the above equations, four functions \(Q_{r}(\mu _{1}^{(r)},\ldots,\mu _{n}^{(r)})\) and \(\hat{Q}_{r}(\mu _{1}^{(r)},\ldots,\mu _{n}^{(r)})\) (r = 1, 2) play the roles of order parameters.

Unfortunately, this expression cannot be employed directly for the computation of (6) as \(Q_{r}(\mu _{1}^{(r)},\ldots,\mu _{n}^{(r)})\) and \(\hat{Q}_{r}(\mu _{1}^{(r)},\ldots,\mu _{n}^{(r)})\) are defined only for n = 1, 2, . To overcome this inconvenience, we introduce the following assumption at the dominant saddle point.

[Replica symmetric assumption] The right hand side of (12.7) is invariant under any permutation of replica indices a = 1, 2, , n. We assume that this property, which is termed the replica symmetry, is also owned by the dominant saddle point of (12.8).

In the current system, this restricts the functional forms of \(Q_{r}(\mu _{1}^{(r)},\ldots,\mu _{n}^{(r)})\) and \(\hat{Q}_{r}(\mu _{1}^{(r)},\ldots,\mu _{n}^{(r)})\) as

$$\displaystyle\begin{array}{rcl} Q_{r}(\mu _{1},\ldots,\mu _{n})& =& \left ( \frac{cp_{r}-\gamma } {Np_{r}^{2}}\right )^{1/2}\int dAdH\,q_{ r}(A,H)\left (\frac{\beta A} {2\pi } \right )^{\frac{n} {2} } \\ & & \times \exp \left [-\frac{\beta A} {2} \sum _{a=1}^{n}\left (\mu _{ a} -\frac{H} {A} \right )^{2}\right ], \\ \hat{Q}_{r}(\mu _{1},\ldots,\mu _{n})& =& c\left ( \frac{cp_{r}-\gamma } {Np_{r}^{2}}\right )^{-1/2}\int d\hat{A}d\hat{H}\,\hat{q}_{ r}(\hat{A},\hat{H}) \\ & & \times \exp \left [ \frac{\beta } {2}\sum _{a=1}^{n}\left (\hat{A}\mu _{ a}^{2} + 2\hat{H}\mu _{ a}\right )\right ], {}\end{array}$$
(12.10)

which yields an expression of \(N^{-1}\ln [Z^{n}(\beta \vert L)]_{L}\) that can be extended for n of a real number. We then substitute that expression into (12.6), which finally provides

$$\displaystyle\begin{array}{rcl} \left [\lambda _{2}\right ]_{L}& =& -\mathop{\mathrm{extr}}_{\{q_{r}\},\{\hat{q}_{r}\},\phi,\psi }\Biggl \{\int dAdH\int dA^{{\prime}}dH^{{\prime}}\,\varXi (A,H,A^{{\prime}},H^{{\prime}}) \\ & \times & \frac{cp_{1}p_{2}} {2} \biggl (\left (\frac{p_{1}} {p_{2}}+\varGamma \right )q_{1}(A,H)q_{1}(A^{{\prime}},H^{{\prime}}) \\ & +& \left (\frac{p_{2}} {p_{1}}+\varGamma \right )q_{2}(A,H)q_{2}(A^{{\prime}},H^{{\prime}}) \\ & +& 2\left (1-\varGamma \right )q_{1}(A,H)q_{2}(A^{{\prime}},H^{{\prime}})\biggr ) \\ & +& \phi \\ & -& c\sum _{r=1,2}p_{r}\int dAdH\int d\hat{A}d\hat{H}\,q_{r}(A,H)\hat{q}_{r}(\hat{A},\hat{H})\left (\frac{(H +\hat{ H})^{2}} {A -\hat{ A}} -\frac{H^{2}} {A} \right ) \\ & +& \sum _{r=1,2}p_{r}\int \prod _{g=1}^{c}\left (d\hat{A}_{ g}d\hat{H}_{g}\hat{q}_{r}(\hat{A}_{g},\hat{H}_{g})\right )\frac{\left (\psi /2 -\sum _{g}\hat{H}_{g}\right )^{2}} {\phi -\sum _{g}\hat{A}_{g}} \Biggr \}, {}\end{array}$$
(12.11)

where we set

$$\displaystyle\begin{array}{rcl} \varGamma & =& 1 - \frac{\gamma } {cp_{1}p_{2}},{}\end{array}$$
(12.12)
$$\displaystyle\begin{array}{rcl} \varXi (A,H,A^{{\prime}},H^{{\prime}})& =& \frac{(1 + A^{{\prime}})H^{2} + (1 + A)H^{{\prime}2} + 2HH^{{\prime}}} {(1 + A)(1 + A^{{\prime}}) - 1} -\frac{H^{2}} {A} -\frac{H^{{\prime}2}} {A^{{\prime}}}.\qquad {}\end{array}$$
(12.13)

The above procedure is often termed the replica method. Although its mathematical validity of the replica method has not yet been proved, we see that our assessment based on the simplest permutation symmetry for the replica indices offers a fairly accurate prediction for the experimental results below.

Due to the space limitation, we hereafter show only the results, omitting all the details of the calculation (see [24] for complete calculation including detailed derivation of (12.11)). In the limit of large size \(N \rightarrow \infty\), the saddle-point analysis of (12.11) yields the solution

$$\displaystyle\begin{array}{rcl} \left [\lambda _{2}\right ]_{L} = \left \{\begin{array}{ll} (1-\varGamma )\left (c - 1 -\frac{1} {\varGamma } \right )&(1/\sqrt{c - 1} \leq \varGamma ), \\ c - 2\sqrt{c - 1} &\mathrm{otherwise}, \end{array} \right.& &{}\end{array}$$
(12.14)

where \(\varGamma = 1 -\gamma /(cp_{1}p_{2})\); we set the size of each module as N 1 = p 1 N and N 2 = p 2 N. The region of constant eigenvalue in (12.14) indicates that the second-smallest eigenvalue is in the spectral band, i.e., the information of the modules is lost there and an undetectable region exists. Therefore, the boundary of (12.14) is the critical point where the phase transition occurs. The plot of the second-smallest eigenvalue \(\left [\lambda _{2}\right ]_{L}\) is shown in Fig. 12.1. Although the dots represent the results of the numerical experiment of a single realization, the results agree with (12.14) quite well.

Fig. 12.1
figure 1

Second-smallest eigenvalue as a function of γ. The solid line represents the estimate of the average over the realization of the graphs \(\left [\lambda _{2}\right ]_{L}\) and the dots represent the results of the numerical experiment of a single realization with N = 1000 and c = 4. The module sizes are set to be equal, \(p_{1} = p_{2} = 0.5\).

In terms of γ, the boundary of Eq. (12.14) can be recast as

$$\displaystyle\begin{array}{rcl} & \gamma = cf(c)p_{1}p_{2},&{}\end{array}$$
(12.15)

where

$$\displaystyle\begin{array}{rcl} f(c) = 1 - \frac{1} {\sqrt{c - 1}}.& &{}\end{array}$$
(12.16)

Since cp 1 p 2 is the value of γ in a uniform (i.e., one-block) regular random graph, the factor f(c) represents the low value of the threshold as compared to that in the uniform random case.

The distribution of the components of the corresponding eigenvector can also be obtained through this calculation. Although it cannot be written analytically, we can solve for it by iterating a set of integral equations that result from the saddle-point evaluation of the right hand side of (12.6). As shown in Fig. 12.2, the results of our analysis agree with the corresponding numerical experiment excellently. In Fig. 12.2, the dots represent the average over 100 realizations of the random graphs. The ratio of misclassified vertices are shown in Fig. 12.3. It increases polynomially with respect to γ and saturates at the detectability threshold.

Fig. 12.2
figure 2

Distributions of the elements in the eigenvector corresponding to the second-smallest eigenvector. Each plot shows the distribution of elements in each module, i.e., the distribution on the left corresponds to the module that is supposed to have negative sign elements and the distribution on the right corresponds to the module that is supposed to have positive sign elements, respectively. The dots represent the average results of the numerical experiments, taken over 100 samples. The ratio of the modules are set to be p 1 = 0. 6 and p 2 = 0. 4

Fig. 12.3
figure 3

Fraction of misclassified vertices in each module. As the parameter γ increases, the number of misclassified vertices increases polynomially

It should be note that, even when the number of vertices is infinity, the fraction of misclassified vertices remains finite. The misclassification of the vertices occurs because the planted partition is not the optimum in the sense of the spectral bisection. The spectral method with the unnormalized Laplacian L constitute the continuous relaxation of the discrete minimization problem of the so-called RatioCut. The RatioCut is lower for a partition with a sparse cut, while it penalizes for unbalanced partitions in the sense of the number of the vertices within a module; there may always exist a better cut in the sense of the RatioCut than the planted partition in the graph when γ is large.

Finally, let us compare our estimate with results of studies in the literature. In the following, we focus on the case of equal size modules, i.e., \(p_{1} = p_{2} = 0.5\). Let the total degree within a module be K in and let the total degree from one module to the others be K out. Since we have \(K = cN = 2(K_{\mathrm{in}} + K_{\mathrm{out}})\) and K out = γ N, Eq. (12.15) reads

$$\displaystyle\begin{array}{rcl} K_{\mathrm{in}} - K_{\mathrm{out}}& =& \frac{N} {2} \frac{c} {\sqrt{c - 1}}.{}\end{array}$$
(12.17)

In addition, in the limit \(N \rightarrow \infty\), we have

$$\displaystyle\begin{array}{rcl} K_{\mathrm{in}}& =& \frac{N^{2}} {4} p_{\mathrm{in}} = \frac{N} {4} c_{\mathrm{in}},{}\end{array}$$
(12.18)
$$\displaystyle\begin{array}{rcl} K_{\mathrm{out}}& =& \frac{N^{2}} {4} p_{\mathrm{out}} = \frac{N} {4} c_{\mathrm{out}}.{}\end{array}$$
(12.19)

Therefore, (12.17) can be recast as

$$\displaystyle\begin{array}{rcl} c_{\mathrm{in}} - c_{\mathrm{out}}& = 2 \frac{c} {\sqrt{c-1}}.&{}\end{array}$$
(12.20)

This condition converges to the ultimate detectability threshold (12.1) in the dense limit \(c \rightarrow \infty\). There exists, however, a huge gap between (12.1) and (12.20) when the degree c is small; considering the fact that the upper bound of the parameter c inc out is 2c, this gap is not negligible at all. Thus, the implication of our results is that we cannot expect the spectral threshold to detect modules all the way down to the ultimate detectability threshold, even in regular random graphs, where the localization of the eigenvectors is absent.

4 Conclusion

In summary, we derived an estimate of the detectability threshold (12.20) of the spectral method of the two-block regular random graphs. The threshold we obtained agrees with the results of the numerical experiment excellently and is expected to be asymptotically exact in the limit \(N \rightarrow \infty\). Our results indicate that the spectral method cannot detect modules all the way down to the ultimate detectability threshold (12.1), even when the degree is fixed to a constant. Since the threshold (12.20) converges to (12.1) as the degree c increases, this gap becomes negligible in the case where the degree is sufficiently large and this supports the results obtained by Nadakuditi and Newman [20].

A method for achieving the ultimate detectability threshold with the spectral method has already been proposed by Krzakala et al. [25]. They proposed using a matrix called the non-backtracking matrix, which avoids the elements of eigenvectors to be localized at a few vertices. A question about this formalism is: to what extent is the gap in the detectability in fact closed by the non-backtracking matrix as compared to the Laplacians? Our estimate gives a clue to the answer to this question. In order to gain further insight, we need to analyze the case of graphs with degree fluctuation. In that case, the methods using the unnormalized Laplacian and the normalized Laplacian will no longer be equivalent. Moreover, it is important to verify the effect of the localization of eigenvectors on the detectability. These problems remain as future work.