Abstract
The problem of finding large average submatrices of a real-valued matrix arises in the exploratory analysis of data from a variety of disciplines, ranging from genomics to social sciences. In this paper we provide a detailed asymptotic analysis of large average submatrices of an \(n \times n\) Gaussian random matrix. The first part of the paper addresses global maxima. For fixed k we identify the average and the joint distribution of the \(k \times k\) submatrix having largest average value. As a dual result, we establish that the size of the largest square sub-matrix with average bigger than a fixed positive constant is, with high probability, equal to one of two consecutive integers that depend on the threshold and the matrix dimension n. The second part of the paper addresses local maxima. Specifically we consider submatrices with dominant row and column sums that arise as the local optima of iterative search procedures for large average submatrices. For fixed k, we identify the limiting average value and joint distribution of a \(k \times k\) submatrix conditioned to be a local maxima. In order to understand the density of such local optima and explain the quick convergence of such iterative procedures, we analyze the number \(L_n(k)\) of local maxima, beginning with exact asymptotic expressions for the mean and fluctuation behavior of \(L_n(k)\). For fixed k, the mean of \(L_{n}(k)\) is \(\Theta (n^{k}/(\log {n})^{(k-1)/2})\) while the standard deviation is \(\Theta (n^{2k^2/(k+1)}/(\log {n})^{k^2/(k+1)})\). Our principal result is a Gaussian central limit theorem for \(L_n(k)\) that is based on a new variant of Stein’s method.
This is a preview of subscription content, access via your institution.



References
Achlioptas, D., Naor, A.: The two possible values of the chromatic number of a random graph. Ann. Math. (2) 162(3), 1335–1351 (2005)
Addario-Berry, L., Broutin, N., Devroye, L., Lugosi, G.: On combinatorial testing problems. Ann. Statist. 38(5), 3063–3092 (2010)
Aidekon, E.: Convergence in law of the minimum of a branching random walk (2011). arXiv preprint arXiv:1101.1810
Aldous, D.J., Bordenave, C., Lelarge, M.: Dynamic programming optimization over random data: the scaling exponent for near-optimal solutions. SIAM J. Comput. 38(6), 2382–2410 (2009)
Alon, N., Krivelevich, M., Sudakov, B.: Finding a large hidden clique in a random graph. In: Proceedings of the Ninth Annual ACM-SIAM Symposium on Discrete Algorithms (San Francisco, CA), pp. 594–598 (1998)
Arias-Castro, E., Candès, E.J., Durand, A.: Detection of an anomalous cluster in a network. Ann. Statist. 39(1), 278–304 (2011)
Arias-Castro, E., Candès, E.J., Helgason, H., Zeitouni, O.: Searching for a trail of evidence in a maze. Ann. Statist. 36(4), 1726–1757 (2008)
Balakrishnan, S., Kolar, M., Rinaldo, A.: Recovering block-structured activations using compressive measurements (2012). arXiv preprint arXiv:1209.3431
Baldi, P., Rinott, Y., Stein, C.: A normal approximation for the number of local maxima of a random function on a graph. In: Probability, Statistics, and Mathematics, pp. 59–81. Academic Press, Boston, MA (1989)
Berman, S.M.: Limit theorems for the maximum term in stationary sequences. Ann. Math. Stat. 35, 502–516 (1964)
Bollobás, B., Erdős, P.: Cliques in random graphs. Math. Proc. Cambridge Philos. Soc. 80(3), 419–427 (1976)
Bollobás, B.: Random Graphs. Cambridge Studies in Advanced Mathematics, vol. 73, 2nd edn. Cambridge University Press, Cambridge (2001). doi:10.1017/CBO9780511814068
Butucea, C., Ingster, Y.I.: Detection of a sparse submatrix of a high-dimensional noisy matrix (2011). arXiv preprint arXiv:1109.0898
Chen, L.H.Y., Goldstein, L., Shao, Q.-M.: Normal Approximation by Stein’s Method. Probability and its Applications. Springer, Heidelberg (2011). doi:10.1007/978-3-642-15007-4
Chen, L.H.Y., Shao, Q.-M.: Stein’s method for normal approximation. In: An Introduction to Stein’s Method. Lecture Notes Series, Institute for Mathematical Sciences, National University of Singapore, vol. 4, pp. 1–59. Singapore University Press, Singapore (2005). doi:10.1142/9789812567680_0001
Dekel, Y., Gurel-Gurevich, O., Peres, Y.: Finding hidden cliques in linear time with high probability (2010). arXiv preprint arXiv:1010.2997
Diaconis, P., Holmes, S. (eds.): Stein’s method: expository lectures and applications, Institute of Mathematical Statistics Lecture Notes–Monograph Series, 46, Institute of Mathematical Statistics, Beachwood, OH, 2004. Papers from the Workshop on Stein’s Method held at Stanford University, Stanford, CA (1998)
Durrett, R., Limic, V.: Rigorous results for the NK model. Ann. Probab. 31(4), 1713–1753 (2003)
Evans, S.N., Steinsaltz, D.: Estimating some features of NK fitness landscapes. Ann. Appl. Probab. 12(4), 1299–1321 (2002)
Fortunato, S.: Community detection in graphs. Phys. Rep. 486(3), 75–174 (2010)
Galambos, J.: On the distribution of the maximum of random variables. Ann. Math. Stat. 43, 516–521 (1972)
Gamarnik, D., Sudan, M.: Limits of local algorithms over sparse random graphs. In: Proceedings of the 5th Conference on Innovations in Theoretical Computer Science, pp. 369–376 (2014)
Jerrum, M.: Large cliques elude the Metropolis process. Random Struct. Algorithms 3(4), 347–359 (1992)
Kauffman, S.A., Weinberger, E.D.: The nk model of rugged fitness landscapes and its application to maturation of the immune response. J. Theor. Biol. 141(2), 211–245 (1989)
Leadbetter, M.R., Lindgren, G., Rootzén, H.: Extremes and related properties of random sequences and processes. Springer Series in Statistics. Springer, New York (1983)
Li, W.V., Shao, Q.-M.: A normal comparison inequality and its applications. Probab. Theory Relat. Fields 122(4), 494–508 (2002)
Limic, V., Pemantle, R.: More rigorous results on the Kauffman–Levin model of evolution. Ann. Probab. 32(3A), 2149–2178 (2004)
Madeira, S.C., Oliveira, A.L.: Biclustering algorithms for biological data analysis: a survey. IEEE/ACM Trans. Comput. Biol. Bioinform. 1(1), 24–45 (2004)
Mahoney, M.W.: Randomized algorithms for matrices and data (2011). arXiv preprint arXiv:1104.5557
Mahoney, M.W.: Algorithmic and statistical perspectives on large-scale data analysis (2010). arXiv preprint arXiv:1010.1609
Mézard, M., Montanari, A.: Information, Physics, and Computation. Oxford Graduate Texts. Oxford University Press, Oxford (2009). doi:10.1093/acprof:oso/9780198570837.001.0001
Mézard, M., Parisi, G., Virasoro, M.A.: Spin Glass Theory and Beyond. World Scientific Lecture Notes in Physics, vol. 9. World Scientific Publishing Co. Inc., Teaneck, NJ (1987). ISBN 9971-50-115-5, 9971-50-116-3
Pittel’, B.: On the probable behaviour of some algorithms for finding the stability number of a graph. Math. Proc. Cambridge Philos. Soc. 92(3), 511–526 (1982)
Rahman, M., Virag, B.: Local algorithms for independent sets are half-optimal (2014). arXiv preprint arXiv:1402.0485
Reidys, C.M., Stadler, P.F.: Combinatorial landscapes. SIAM Rev. 44(1), 3–54 (2002)
Ross, N.: Fundamentals of Stein’s method. Probab. Surv. 8, 210–293 (2011)
Shabalin, A.A., Weigman, V.J., Perou, C.M., Nobel, A.B.: Finding large average submatrices in high dimensional data. Ann. Appl. Stat. 3(3), 985–1012 (2009)
Steele, J.M.: Probability theory and combinatorial optimization. In: CBMS-NSF Regional Conference Series in Applied Mathematics, vol. 69. Society for Industrial and Applied Mathematics (SIAM), Philadelphia (1997)
Stein, C.: A bound for the error in the normal approximation to the distribution of a sum of dependent random variables. In: Proceedings of the Sixth Berkeley Symposium on Mathematical Statistics and Probability (University of California, Berkeley, CA, 1970/1971), Vol. II: Probability Theory, pp. 583–602 (1972)
Sun, X., Nobel, A.B.: On the maximal size of large-average and anova-fit submatrices in a gaussian random matrix (2010). arxiv preprint arXiv:1009.0562
Weinberger, E.D.: Local properties of Kauffman’s Nk model: a tunably rugged energy landscape. Phys. Rev. A 44(10), 6399 (1991)
Willink, R.: Bounds on the bivariate normal distribution function. Commun. Stat. Theory Methods 33(10), 2281–2297 (2004)
Wright, S.: The roles of mutation, inbreeding, crossbreeding and selection in evolution. In: Proceedings of the Sixth International Congress on Genetics, pp. 356–366 (1932)
Acknowledgements
PD is grateful for the hospitality of the Department of Statistics and Operations research, University of North Carolina, Chapel Hill, where much of the research was done. SB and ABN were partially supported by NSF Grants DMS-1310002 and DMS-1613072. SB was partially supported by NSF Grants DMS-1105581, DMS-1606839, SES grant 1357622 and ARO grant W911NF-17-1-0010. PD was supported by Simons Postdoctoral Fellowship at New York University. AN was partially supported by NSF grant DMS-0907177. We thank the referee for going through the many technical arguments closely and pointing out a number of issues whose rectification significantly improved the presentation and readability of the paper.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Bhamidi, S., Dey, P.S. & Nobel, A.B. Energy landscape for large average submatrix detection problems in Gaussian random matrices. Probab. Theory Relat. Fields 168, 919–983 (2017). https://doi.org/10.1007/s00440-017-0766-0
Received:
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00440-017-0766-0
Keywords
- Energy landscape
- Extreme value theory
- Central limit theorem
- Stein’s method
Mathematics Subject Classification
- Primary 62G32
- 60F05
- 60G70