Abstract
Consider an Erdös–Renyi random graph in which each edge is present independently with probability \(1/2\), except for a subset \(\mathsf{C}_N\) of the vertices that form a clique (a completely connected subgraph). We consider the problem of identifying the clique, given a realization of such a random graph. The algorithm of Dekel et al. (ANALCO. SIAM, pp 67–75, 2011) provably identifies the clique \(\mathsf{C}_N\) in linear time, provided \(|\mathsf{C}_N|\ge 1.261\sqrt{N}\). Spectral methods can be shown to fail on cliques smaller than \(\sqrt{N}\). In this paper we describe a nearly linear-time algorithm that succeeds with high probability for \(|\mathsf{C}_N|\ge (1+{\varepsilon })\sqrt{N/e}\) for any \({\varepsilon }>0\). This is the first algorithm that provably improves over spectral methods. We further generalize the hidden clique problem to other background graphs (the standard case corresponding to the complete graph on \(N\) vertices). For large-girth regular graphs of degree \((\varDelta +1)\) we prove that so-called local algorithms succeed if \(|\mathsf{C}_N|\ge (1+{\varepsilon })N/\sqrt{e\varDelta }\) and fail if \(|\mathsf{C}_N|\le (1-{\varepsilon })N/\sqrt{e\varDelta }\).
Similar content being viewed by others
Notes
The problem is somewhat more subtle because \(|\mathsf{C}_N|\ll N\); see next section.
If \(Q_1\) is singular with respect to \(Q_0\), the problem is simpler but requires a bit more care.
References
Louigi Addario-Berry, Nicolas Broutin, Luc Devroye, and Gábor Lugosi. On combinatorial testing problems. The Annals of Statistics, 38(5):3063–3092, 2010.
Noga Alon, Michael Krivelevich, and Benny Sudakov. Finding a large hidden clique in a random graph. In Proceedings of the ninth annual ACM-SIAM symposium on Discrete algorithms, pages 594–598. Society for Industrial and Applied Mathematics, 1998.
Noga Alon, Michael Krivelevich, and Van H Vu. On the concentration of eigenvalues of random symmetric matrices. Israel Journal of Mathematics, 131(1):259–267, 2002.
Brendan PW Ames and Stephen A Vavasis. Nuclear norm minimization for the planted clique and biclique problems. Mathematical programming, 129(1):69–89, 2011.
Dana Angluin. Local and global properties in networks of processors. In Proceedings of the twelfth annual ACM symposium on Theory of computing, pages 82–93. ACM, 1980.
Ery Arias-Castro, Emmanuel J Candès, and Arnaud Durand. Detection of an anomalous cluster in a network. The Annals of Statistics, 39(1):278–304, 2011.
Ery Arias-Castro, David L Donoho, and Xiaoming Huo. Near-optimal detection of geometric objects by fast multiscale methods. Information Theory, IEEE Transactions on, 51(7):2402–2425, 2005.
M. Bayati and A. Montanari. The dynamics of message passing on dense graphs, with applications to compressed sensing. IEEE Trans. on Inform. Theory, 57:764–785, 2011.
Mohsen Bayati, Marc Lelarge, and Andrea Montanari. Universality in polytope phase transitions and message passing algorithms. arXiv preprint arXiv:1207.7321, 2012.
Quentin Berthet and Philippe Rigollet. Computational lower bounds for sparse pca. arXiv preprint arXiv:1304.0828, 2013.
Shankar Bhamidi, Partha S Dey, and Andrew B Nobel. Energy landscape for large average submatrix detection problems in gaussian random matrices. arXiv preprint arXiv:1211.2284, 2012.
Patrick Billingsley. Probability and measure. John Wiley & Sons, 2008.
Emmanuel J Candès, Xiaodong Li, Yi Ma, and John Wright. Robust principal component analysis? Journal of the ACM (JACM), 58(3):11, 2011.
Alexandre d’Aspremont, Francis Bach, and Laurent El Ghaoui. Optimal solutions for sparse principal component analysis. The Journal of Machine Learning Research, 9:1269–1294, 2008.
Alexandre d’Aspremont, Laurent El Ghaoui, Michael I Jordan, and Gert RG Lanckriet. A direct formulation for sparse pca using semidefinite programming. SIAM review, 49(3):434–448, 2007.
Chandler Davis and W. M. Kahan. The rotation of eigenvectors by a perturbation. iii. SIAM Journal on Numerical Analysis, 7(1):pp. 1–46, 1970.
Yael Dekel, Ori Gurel-Gurevich, and Yuval Peres. Finding hidden cliques in linear time with high probability. In ANALCO, pages 67–75. SIAM, 2011.
Amir Dembo. Probability Theory. http://www.stanford.edu/~montanar/TEACHING/Stat310A/lnotes.pdf, 2013.
David L Donoho, Arian Maleki, and Andrea Montanari. Message-passing algorithms for compressed sensing. Proceedings of the National Academy of Sciences, 106(45):18914–18919, 2009.
Uriel Feige and Dorit Ron. Finding hidden cliques in linear time. DMTCS Proceedings, (01):189–204, 2010.
Vitaly Feldman, Elena Grigorescu, Lev Reyzin, Santosh Vempala, and Ying Xiao. Statistical algorithms and a lower bound for planted clique. arXiv preprint arXiv:1201.1214, 2012.
Zoltán Füredi and János Komlós. The eigenvalues of random symmetric matrices. Combinatorica, 1(3):233–241, 1981.
Geoffrey R Grimmett and Colin JH McDiarmid. On colouring random graphs. In Mathematical Proceedings of the Cambridge Philosophical Society, volume 77, pages 313–324. Cambridge Univ Press, 1975.
Dongning Guo and Chih-Chun Wang. Asymptotic mean-square optimality of belief propagation for sparse linear systems. In Information Theory Workshop, 2006. ITW’06 Chengdu. IEEE, pages 194–198. IEEE, 2006.
Mark Jerrum. Large cliques elude the metropolis process. Random Structures & Algorithms, 3(4):347–359, 1992.
Iain M Johnstone and Arthur Yu Lu. On consistency and sparsity for principal components analysis in high dimensions. Journal of the American Statistical Association, 104(486), 2009.
Antti Knowles and Jun Yin. The isotropic semicircle law and deformation of wigner matrices. arXiv preprint arXiv:1110.6449, 2011.
Daphne Koller and Nir Friedman. Probabilistic graphical models: principles and techniques. MIT press, 2009.
Nathan Linial. Locality in distributed graph algorithms. SIAM Journal on Computing, 21(1):193–201, 1992.
Marc Mezard and Andrea Montanari. Information, physics, and computation. Oxford University Press, 2009.
Andrea Montanari. Graphical Models Concepts in Compressed Sensing. In Y.C. Eldar and G. Kutyniok, editors, Compressed Sensing: Theory and Applications. Cambridge University Press, 2012.
Andrea Montanari and David Tse. Analysis of belief propagation for non-linear problems: The example of cdma (or: How to prove tanaka’s formula). In Information Theory Workshop, 2006. ITW’06 Punta del Este. IEEE, pages 160–164. IEEE, 2006.
Moni Naor and Larry Stockmeyer. What can be computed locally? SIAM Journal on Computing, 24(6):1259–1277, 1995.
Sundeep Rangan and Alyson K Fletcher. Iterative estimation of constrained rank-one matrices in noise. In Information Theory Proceedings (ISIT), 2012 IEEE International Symposium on, pages 1246–1250. IEEE, 2012.
Tom Richardson and Rüdiger Leo Urbanke. Modern coding theory. Cambridge University Press, 2008.
Andrey A Shabalin, Victor J Weigman, Charles M Perou, and Andrew B Nobel. Finding large average submatrices in high dimensional data. The Annals of Applied Statistics, pages 985–1012, 2009.
Xing Sun and Andrew B Nobel. On the size and recovery of submatrices of ones in a random binary matrix. J. Mach. Learn. Res, 9:2431–2453, 2008.
Jukka Suomela. Survey of local algorithms. ACM Computing Surveys (CSUR), 45(2):24, 2013.
Roman Vershynin. Introduction to the non-asymptotic analysis of random matrices. In Y.C. Eldar and G. Kutyniok, editors, Compressed Sensing: Theory and Applications, pages 210–268. Cambridge University Press, 2012.
Martin J Wainwright and Michael I Jordan. Graphical models, exponential families, and variational inference. Foundations and Trends in Machine Learning, 1(1–2):1–305, 2008.
Hui Zou, Trevor Hastie, and Robert Tibshirani. Sparse principal component analysis. Journal of computational and graphical statistics, 15(2):265–286, 2006.
Acknowledgments
This work was partially supported by the NSF CAREER award CCF-0743978, NSF Grant DMS-0806211, and Grants AFOSR/DARPA FA9550-12-1-0411 and FA9550-13-1-0036.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Andrew Odlyzko.
Appendices
Appendix 1: Some Tools in Probability Theory
This appendix contains some useful facts in probability theory.
Lemma 7.1
Let \(h:\mathbb {R}\rightarrow \mathbb {R}\) be a bounded function with the first three derivatives uniformly bounded. Let \(X_{n, k}\) be mutually independent random variables for \(1\le k\le n\) with zero mean and variance \(v_{n, k}\). Define
Also, let \({\mathcal G}_n = \mathsf{N}(0, v_n)\). Then, for every \(n\) and \({\varepsilon }>0\)
Proof
The lemma is proved using a standard swapping trick. The proof can be found in Amir Dembo’s lecture notes [18]. \(\square \)
Lemma 7.2
Given a random variable \(X\) such that \({\mathbb E}(X) = \mu \), suppose \(X\) satisfies
for all \(\lambda >0\) and some constant \(\rho >0\). Then we have for all \(s > 0\)
Further, if \(\mu = 0\), then we have for \(t < 1/e\rho \)
Proof
By an application of the Markov inequality and the given condition on \(X\),
for all \(\lambda > 0\). By a symmetric argument,
By the standard integration formula, we have
Optimizing over \(\lambda \) yields the desired result.
If \(\mu = 0\), the optimization yields \(\lambda = \sqrt{s/\rho }\). Using this, the Taylor expansion of \(g(x) = e^{x^2}\), and monotone convergence we obtain
Notice that here we remove the factor of \(2\) in the inequality since this is not required for even moments of \(X\). \(\square \)
The following lemma is standard; see, for instance, [3, 39].
Lemma 7.3
Let \(M \in \mathbb {R}^{N\times N}\) be a symmetric matrix with entries \(M_{ij}\) (for \(i\ge j\)) that are centered subgaussian random variables of scale factor \(\rho \). Then, uniformly in \(N\),
where \(\lambda = t^2/16N\rho e\) and \(||{M}||_2\) denotes the spectral norm (or largest singular value) of \(M\).
Proof
Divide \(M\) into its upper and lower triangular portions \(M^u\) and \(M^l\) so that \(M = M^u+M_l\). We deal with each separately. Let \(m_i\) denote the \(i{\text {th}}\) row of \(M^l\). For a unit vector \(x\), since \(M_{ij}\) are all independent and subgaussian with scale \(\rho \), it is easy to see that \(\langle m_j, x\rangle \) are also subgaussian with the same scale. We now bound the square exponential moment of \(||{M^lx}||\) as follows. For small enough \(c\ge 0\),
Using this, we obtain for any unit vector \(x\)
where we used the Markov inequality and Eq. (7.1) with an appropriate \(c\). Let \(\Upsilon \) be a maximal \(1/2\)-net of the unit sphere. From a volume-packing argument we have that \(|\Upsilon | \le 5^N\). Then from the fact that \(g(x) = M^lx\) is \(||{M^l}||\)-Lipschitz in \(x\) we obtain
The same inequality holds for \(M^u\). Now, using the fact that \(||{\cdot }||_2\) is a convex function and that \(M^u\) and \(M^l\) are independent we obtain
Substituting for \(\lambda \) yields the result. \(\square \)
Appendix 2: Additional Proofs
In this section we provide, for the sake of completeness, some additional proofs that are known results. We begin with Proposition 1.1.
1.1 Proof of Proposition 1.1
We assume that the set \(\mathsf{C}_N\) is generated as follows: let \(X_i\in \{0, 1\}\) be the label of the index \(i\in [N]\). Then \(X_i\) are i.i.d. Bernoulli with parameter \(\kappa /\sqrt{N}\) and the set \(\mathsf{C}_N = \{i : X_i = 1\}\). The model of choosing \(\mathsf{C}_N\) uniformly random of size \(\kappa \sqrt{N}\) is similar to this model, and asymptotically in \(N\) there is no difference. Notice that since \(e_{\mathsf{C}_N} = u_{\mathsf{C}_N}/N^{1/4}\), we have that \(\Vert e_{\mathsf{C}_N}\Vert ^2\) concentrates sharply around \(\kappa \), and we are interested in the regime \(\kappa =\varTheta (1)\).
We begin with the first part of the proposition, where \(\kappa = 1+{\varepsilon }\). Let \(W_N = W/\sqrt{N},\,Z_N = Z/\sqrt{N}\), and \(e_{\mathsf{C}_N} = u_{\mathsf{C}_N}/N^{1/4}\). Since this normalization does not make a difference to the eigenvectors of \(W\) and \(Z\), we obtain from the eigenvalue equation \(W_N v_1 = \lambda _1 v_1\) that
Multiplying by \(v_1\) on either side yields
The fact that \(Z_N = Z/\sqrt{N}\) is a standard Wigner matrix with subgaussian entries [3] yields that \(||{Z}||_2 \le 2 + \delta \), with a probability of at least \(C_1e^{-c_1N}\) for some constants \(C_1(\delta ), c_1(\delta )>0\). Further, by Theorem 2.7 of [27], we have that \(\lambda _1 \ge 2 + \min ({\varepsilon },{\varepsilon }^2)\), with a probability of at least \(1 - N^{-c_2\log \log N}\) for some constant \(c_2\) and every \(N\) sufficiently large. It follows from this and the union bound that for \(N\) large enough, we have
with a probability of at least \(1 - N^{-c_4}\) for some constant \(c_4>0\). The first claim then follows.
For the second claim, we start with the same eigenvalue Eq. (8.1). Let \(\varphi _1\)be the eigenvector corresponding to the largest eigenvalue of \(Z_N\). Multiplying Eq. (8.1) by \(\varphi _1\) on either side we obtain
where \(\theta _1\) is the eigenvalue of \(Z_N\) corresponding to \(\varphi _1\). With this and Cauchy–Schwartz we obtain
Let \(\phi = (\log N)^{\log \log N}\). Then, using Theorem 2.7 of [27], for any \(\delta > 0\), there exists a constant \(C_1\) such that \(|\lambda _1 - \theta _1| \le N^{-1 + \delta }\), with a probability of at least \(1 - N^{-c_3\log \log N}\).
Since \(\varphi _1\) is independent of \(e_{\mathsf{C}_N}\), we observe that
where \(\varphi _1^i\) (\(e_{\mathsf{C}_N}^i\)) denotes the \(i{\mathrm{th}}\) entry of \(\varphi _1\) (resp. \(e_{\mathsf{C}_N}\)) and \({\mathbb E}_e(\cdot )\) is the expectation with respect to \(e_{\mathsf{C}_N}\) holding \(Z_N\) (hence \(\varphi _1\)) constant. Using Theorem 2.5 of [27], it follows that there exist constants \(c_4, c_5, c_6, c_7\) such that the following two events happen with a probability of at least \(1- N^{-c_4\log \log N}\). First, the first expectation mentioned previously is at most \((1-{\varepsilon })\phi ^{c_5}N^{-7/4}\). Second,
Now, using the Berry–Esseen central limit theorem for \(\langle \varphi _1, e_{\mathsf{C}_N}\rangle \):
for an appropriate constant \(c = c({\varepsilon })\) and \(\delta \in (0,1/4)\). Using this and the earlier bound for \(|\lambda _1-\theta _1|\) we obtain that
with a probability of at least \(1 - c'N^{-\delta }\), for some \(c'\) and sufficiently large \(N\). The claim then follows using the union bound and the same argument for the first \(\ell \) eigenvectors.
1.2 Proof of Proposition 4.2
For any fixed \(t\), let \({\mathcal {E}}_N^t\) denote the set of vertices in \(G_N\) such that their \(t\)-neighborhoods are not a tree, i.e.,
For notational simplicity, we will omit the subscript \(G_N\) in the neighborhood of \(i\). The relative size \({\varepsilon }^t_N = |{\mathcal {E}}^t_N|/N\) vanishes asymptotically in \(N\) since the sequence \(\{G_N\}_{N\ge 1}\) is locally treelike. We let \(\mathsf{F}_{BP}(W_{\mathsf{Ball}(i;t)})\) denote the decision according to belief propagation at the \(i{\mathrm{th}}\) vertex.
From Proposition 4.1, Eqs. (4.1), (4.2), (4.5), and (5.1), and induction we observe that for any \(i\in [N]\backslash {\mathcal {E}}^t_N\)
We also have that
Using both of these identities, the fact that \({\varepsilon }^t_N\rightarrow 0\), and the linearity of expectation, we have the first claim:
For any other decision rule \(\mathsf{F}(W_{\mathsf{Ball}(i;t)})\), we have that
since BP computes the correct posterior marginal on the root of the tree \(\tilde{\mathsf{T}}\mathsf{ree}(t)\) and maximizing the posterior marginal minimizes the misclassification error. The second claim follows by taking the limits.
1.3 Equivalence of i.i.d. and Uniform Set Model
In Sect. 2 the hidden set \(\mathsf{C}_N\) was assumed to be uniformly random given its size. However, in Sect. 4 we considered a slightly different model to choose \(\mathsf{C}_N\), wherein \(X_i\) are i.i.d. Bernoulli random variables with parameter \(\tilde{\kappa }/\sqrt{\varDelta }\). This leads to a set, \(\mathsf{C}_N = \{i: X_i = 1\}\), that has a random size, sharply concentrated around \(N\tilde{\kappa }/\sqrt{\varDelta }\). The uniform set model can be obtained from the i.i.d. model by simply conditioning on the size \(|\mathsf{C}_N|\). In the limit of large \(N\) it is well known that these two models are “equivalent.” However, for completeness, we provide a proof that the results of Proposition 4.2 do not change when conditioned on the size \(|\mathsf{C}_N| = \sum _{i = 1}^{N}X_i\):
Let \(S\) be the event \(\{\sum _{i=1}^N X_i=N\tilde{\kappa }/\sqrt{\varDelta }\}\). Notice that \(\mathsf{F}(W_{\mathsf{Ball}(i;t)})\) is a function of \(\{X_j, j\in \mathsf{Ball}(i;t)\}\), which is a discrete vector of dimension \(K_t \le (\varDelta +1)^{t+1}\). A straightforward direct calculation yields that \((X_j, j\in \mathsf{Ball}(i;t))|S \) converges in distribution to \((X_j, j\in \mathsf{Ball}(i;t))\) asymptotically in \(N\). This implies that \(W_{\mathsf{Ball}(i;t)} |S\) converges in distribution to \(W_{\mathsf{Ball}(i;t)}\). Further, using the locally treelike property of \(G_N\) one obtains
as required.
1.4 Importance of Message-Passing Modification in Algorithm 1
In this section we provide a simple counterexample that demonstrates the importance of the message-passing or so-called cavity modification we employ, i.e., to remove the contribution of the incoming message \(\theta ^t_{j\rightarrow i}\) in the computation of \(\theta ^{t+1}_{i\rightarrow j}\), cf. Eq. (2.2). This modification is crucial for Lemma 2.2 to hold. Indeed, our counterexample will demonstrate that state evolution does not hold without this modification.
For the sake of simplicity, we consider a pure noise data matrix \(W\,W_{ij} \sim \mathsf{N}(0, 1)\) i.i.d. for \(i<j,\,W_{ii} = 0\) and \(W=W^{\mathsf{T}}\). Using our notations, we have \(\lambda =0, \kappa =0\). In other words, our observations contain no signal. Further, we consider the initial conditions \(\theta ^0_i = \theta ^0_{i\rightarrow j} = 1\) for all \(i, j\) distinct and the simple function \(f(z; t) = z^3\) for each iteration \(t\). We stress that we make these choices to simplify calculations as much as possible: it should be clear from our subsequent argument that the same phenomenon takes place generically.
State evolution reads, for \(t\ge 0\),
where \(Z\sim \mathsf{N}(0, 1)\). The initial condition is \(\mu _0 = 1, \tau _0 = 0\). Lemma 2.2 establishes that, for \(\theta ^t_{i}\) given by the message-passing iteration (2.2), (2.3), we have, for any bounded Lipschitz function \(\psi \),
in probability. In other words, \(\theta ^t_i\) is approximately Gaussian with mean zero and variance \(\tau _t^2\).
In particular, taking \(\psi _M(x) = x\) for \(|x|\le M,\,\psi _M(x) = M\) for \(x>M\) and \(\psi (x) = -M\) for \(x<M\) and using dominated convergence we obtain
Consider now the iteration without cavity modification (denoted subsequently by \(\vartheta ^t_i\)):
Without loss of generality, we will focus on coordinate 1 of the iterate \(\vartheta ^t\):
We can explicitly compute the expectation of \(\vartheta ^{2}_1\) as follows. Since \(A_{1\ell }\sim \mathsf{N}(0, 1/N)\), an application of Stein’s lemma gives
In particular, in the limit of large \(N\),
This appears to be in contradiction with the state evolution prediction that would suggest that \(\vartheta ^t_i\) is approximately \(\mathsf{N}(0,\tau _t^2)\) and, hence, \({\mathbb E}\vartheta ^t_i\rightarrow 0\) as by formally setting \(M=\infty \) in Eq. (8.3). The reader will notice that a more careful argument is needed to reach a contradiction since we cannot set \(M=\infty \) in Eq. (8.3) (because Lemma 2.2 only applies to bounded Lipschitz functions). We proceed to present the required additional steps.
We first show that
for a constant \(C<\infty \). Note that
where the sum is over labeled trees \(T_1, T_2\) with two generations and four vertices: a root labeled \(1\) with a single child, which in turn has three children. Let \(\mathbf{G}\) denote the graph formed from the pair \(T_1, T_2\) by identifying vertices of the same label. Then \({\mathbb E}\{A(T_1)A(T_2)\}\) is nonzero only if every edge in \(\mathbf{G}\) is covered at least twice. We prove that the total contribution of such terms is bounded. Let \(e(\mathbf{G})\) indicate the number of edges (with multiplicity) in \(\mathbf{G}\) and \(n(\mathbf{G})\) its number of vertices. Since \(\mathbf{G}\) is connected by the root and every edge has a multiplicity of at least two, we have that \(n(\mathbf{G}) - 1 \le e(\mathbf{G})/2\). Using Lemma 7.2 we have \({\mathbb E}\{A(T_1)A(T_2)\} \le {\mathbb E}\{|A(T_1)A(T_2)|\} = O(N^{-e(\mathbf{G})/2})\). Further, the number of contributing terms is \(O(N^{n(\mathbf{G})-1})\). It follows that the total nonzero contribution is bounded by a constant, and hence \(\lim _{N\rightarrow \infty }{\mathbb E}\{\vartheta ^2_1)^2\} \le C\) for an appropriate constant \(C\).
Note that \(|\psi _M(x)-x| =(|x|-M)_+\le x^2/M\). We then have
Hence,
By choosing \(M\ge C\) we conclude that the state evolution prediction does not hold for the naive iteration (8.4).
It is a useful exercise to repeat the preceding calculation for the message-passing sequence \(\theta ^t_i\) (with the cavity modification). The final result confirms the state evolution prediction as in Lemma 2.2. For higher iterations, the effect of the cavity modification is analogous but somewhat more subtle, as indicated by our proof via the moment method.
Rights and permissions
About this article
Cite this article
Deshpande, Y., Montanari, A. Finding Hidden Cliques of Size \(\sqrt{N/e}\) in Nearly Linear Time. Found Comput Math 15, 1069–1128 (2015). https://doi.org/10.1007/s10208-014-9215-y
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10208-014-9215-y
Keywords
- Random graphs
- Average case complexity
- Approximate message passing
- Belief propagation
- Local algorithms
- Sparse recovery