Skip to main content

Energy landscape for large average submatrix detection problems in Gaussian random matrices

Abstract

The problem of finding large average submatrices of a real-valued matrix arises in the exploratory analysis of data from a variety of disciplines, ranging from genomics to social sciences. In this paper we provide a detailed asymptotic analysis of large average submatrices of an \(n \times n\) Gaussian random matrix. The first part of the paper addresses global maxima. For fixed k we identify the average and the joint distribution of the \(k \times k\) submatrix having largest average value. As a dual result, we establish that the size of the largest square sub-matrix with average bigger than a fixed positive constant is, with high probability, equal to one of two consecutive integers that depend on the threshold and the matrix dimension n. The second part of the paper addresses local maxima. Specifically we consider submatrices with dominant row and column sums that arise as the local optima of iterative search procedures for large average submatrices. For fixed k, we identify the limiting average value and joint distribution of a \(k \times k\) submatrix conditioned to be a local maxima. In order to understand the density of such local optima and explain the quick convergence of such iterative procedures, we analyze the number \(L_n(k)\) of local maxima, beginning with exact asymptotic expressions for the mean and fluctuation behavior of \(L_n(k)\). For fixed k, the mean of \(L_{n}(k)\) is \(\Theta (n^{k}/(\log {n})^{(k-1)/2})\) while the standard deviation is \(\Theta (n^{2k^2/(k+1)}/(\log {n})^{k^2/(k+1)})\). Our principal result is a Gaussian central limit theorem for \(L_n(k)\) that is based on a new variant of Stein’s method.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3

References

  1. Achlioptas, D., Naor, A.: The two possible values of the chromatic number of a random graph. Ann. Math. (2) 162(3), 1335–1351 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  2. Addario-Berry, L., Broutin, N., Devroye, L., Lugosi, G.: On combinatorial testing problems. Ann. Statist. 38(5), 3063–3092 (2010)

    Article  MathSciNet  MATH  Google Scholar 

  3. Aidekon, E.: Convergence in law of the minimum of a branching random walk (2011). arXiv preprint arXiv:1101.1810

  4. Aldous, D.J., Bordenave, C., Lelarge, M.: Dynamic programming optimization over random data: the scaling exponent for near-optimal solutions. SIAM J. Comput. 38(6), 2382–2410 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  5. Alon, N., Krivelevich, M., Sudakov, B.: Finding a large hidden clique in a random graph. In: Proceedings of the Ninth Annual ACM-SIAM Symposium on Discrete Algorithms (San Francisco, CA), pp. 594–598 (1998)

  6. Arias-Castro, E., Candès, E.J., Durand, A.: Detection of an anomalous cluster in a network. Ann. Statist. 39(1), 278–304 (2011)

    Article  MathSciNet  MATH  Google Scholar 

  7. Arias-Castro, E., Candès, E.J., Helgason, H., Zeitouni, O.: Searching for a trail of evidence in a maze. Ann. Statist. 36(4), 1726–1757 (2008)

    Article  MathSciNet  MATH  Google Scholar 

  8. Balakrishnan, S., Kolar, M., Rinaldo, A.: Recovering block-structured activations using compressive measurements (2012). arXiv preprint arXiv:1209.3431

  9. Baldi, P., Rinott, Y., Stein, C.: A normal approximation for the number of local maxima of a random function on a graph. In: Probability, Statistics, and Mathematics, pp. 59–81. Academic Press, Boston, MA (1989)

  10. Berman, S.M.: Limit theorems for the maximum term in stationary sequences. Ann. Math. Stat. 35, 502–516 (1964)

    Article  MathSciNet  MATH  Google Scholar 

  11. Bollobás, B., Erdős, P.: Cliques in random graphs. Math. Proc. Cambridge Philos. Soc. 80(3), 419–427 (1976)

    Article  MathSciNet  MATH  Google Scholar 

  12. Bollobás, B.: Random Graphs. Cambridge Studies in Advanced Mathematics, vol. 73, 2nd edn. Cambridge University Press, Cambridge (2001). doi:10.1017/CBO9780511814068

  13. Butucea, C., Ingster, Y.I.: Detection of a sparse submatrix of a high-dimensional noisy matrix (2011). arXiv preprint arXiv:1109.0898

  14. Chen, L.H.Y., Goldstein, L., Shao, Q.-M.: Normal Approximation by Stein’s Method. Probability and its Applications. Springer, Heidelberg (2011). doi:10.1007/978-3-642-15007-4

  15. Chen, L.H.Y., Shao, Q.-M.: Stein’s method for normal approximation. In: An Introduction to Stein’s Method. Lecture Notes Series, Institute for Mathematical Sciences, National University of Singapore, vol. 4, pp. 1–59. Singapore University Press, Singapore (2005). doi:10.1142/9789812567680_0001

  16. Dekel, Y., Gurel-Gurevich, O., Peres, Y.: Finding hidden cliques in linear time with high probability (2010). arXiv preprint arXiv:1010.2997

  17. Diaconis, P., Holmes, S. (eds.): Stein’s method: expository lectures and applications, Institute of Mathematical Statistics Lecture Notes–Monograph Series, 46, Institute of Mathematical Statistics, Beachwood, OH, 2004. Papers from the Workshop on Stein’s Method held at Stanford University, Stanford, CA (1998)

  18. Durrett, R., Limic, V.: Rigorous results for the NK model. Ann. Probab. 31(4), 1713–1753 (2003)

    Article  MathSciNet  MATH  Google Scholar 

  19. Evans, S.N., Steinsaltz, D.: Estimating some features of NK fitness landscapes. Ann. Appl. Probab. 12(4), 1299–1321 (2002)

    Article  MathSciNet  MATH  Google Scholar 

  20. Fortunato, S.: Community detection in graphs. Phys. Rep. 486(3), 75–174 (2010)

    Article  MathSciNet  Google Scholar 

  21. Galambos, J.: On the distribution of the maximum of random variables. Ann. Math. Stat. 43, 516–521 (1972)

    Article  MathSciNet  MATH  Google Scholar 

  22. Gamarnik, D., Sudan, M.: Limits of local algorithms over sparse random graphs. In: Proceedings of the 5th Conference on Innovations in Theoretical Computer Science, pp. 369–376 (2014)

  23. Jerrum, M.: Large cliques elude the Metropolis process. Random Struct. Algorithms 3(4), 347–359 (1992)

    Article  MathSciNet  MATH  Google Scholar 

  24. Kauffman, S.A., Weinberger, E.D.: The nk model of rugged fitness landscapes and its application to maturation of the immune response. J. Theor. Biol. 141(2), 211–245 (1989)

    Article  Google Scholar 

  25. Leadbetter, M.R., Lindgren, G., Rootzén, H.: Extremes and related properties of random sequences and processes. Springer Series in Statistics. Springer, New York (1983)

  26. Li, W.V., Shao, Q.-M.: A normal comparison inequality and its applications. Probab. Theory Relat. Fields 122(4), 494–508 (2002)

    Article  MathSciNet  MATH  Google Scholar 

  27. Limic, V., Pemantle, R.: More rigorous results on the Kauffman–Levin model of evolution. Ann. Probab. 32(3A), 2149–2178 (2004)

    Article  MathSciNet  MATH  Google Scholar 

  28. Madeira, S.C., Oliveira, A.L.: Biclustering algorithms for biological data analysis: a survey. IEEE/ACM Trans. Comput. Biol. Bioinform. 1(1), 24–45 (2004)

    Article  Google Scholar 

  29. Mahoney, M.W.: Randomized algorithms for matrices and data (2011). arXiv preprint arXiv:1104.5557

  30. Mahoney, M.W.: Algorithmic and statistical perspectives on large-scale data analysis (2010). arXiv preprint arXiv:1010.1609

  31. Mézard, M., Montanari, A.: Information, Physics, and Computation. Oxford Graduate Texts. Oxford University Press, Oxford (2009). doi:10.1093/acprof:oso/9780198570837.001.0001

  32. Mézard, M., Parisi, G., Virasoro, M.A.: Spin Glass Theory and Beyond. World Scientific Lecture Notes in Physics, vol. 9. World Scientific Publishing Co. Inc., Teaneck, NJ (1987). ISBN 9971-50-115-5, 9971-50-116-3

  33. Pittel’, B.: On the probable behaviour of some algorithms for finding the stability number of a graph. Math. Proc. Cambridge Philos. Soc. 92(3), 511–526 (1982)

    Article  MathSciNet  MATH  Google Scholar 

  34. Rahman, M., Virag, B.: Local algorithms for independent sets are half-optimal (2014). arXiv preprint arXiv:1402.0485

  35. Reidys, C.M., Stadler, P.F.: Combinatorial landscapes. SIAM Rev. 44(1), 3–54 (2002)

    Article  MathSciNet  MATH  Google Scholar 

  36. Ross, N.: Fundamentals of Stein’s method. Probab. Surv. 8, 210–293 (2011)

    Article  MathSciNet  MATH  Google Scholar 

  37. Shabalin, A.A., Weigman, V.J., Perou, C.M., Nobel, A.B.: Finding large average submatrices in high dimensional data. Ann. Appl. Stat. 3(3), 985–1012 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  38. Steele, J.M.: Probability theory and combinatorial optimization. In: CBMS-NSF Regional Conference Series in Applied Mathematics, vol. 69. Society for Industrial and Applied Mathematics (SIAM), Philadelphia (1997)

  39. Stein, C.: A bound for the error in the normal approximation to the distribution of a sum of dependent random variables. In: Proceedings of the Sixth Berkeley Symposium on Mathematical Statistics and Probability (University of California, Berkeley, CA, 1970/1971), Vol. II: Probability Theory, pp. 583–602 (1972)

  40. Sun, X., Nobel, A.B.: On the maximal size of large-average and anova-fit submatrices in a gaussian random matrix (2010). arxiv preprint arXiv:1009.0562

  41. Weinberger, E.D.: Local properties of Kauffman’s Nk model: a tunably rugged energy landscape. Phys. Rev. A 44(10), 6399 (1991)

    Article  Google Scholar 

  42. Willink, R.: Bounds on the bivariate normal distribution function. Commun. Stat. Theory Methods 33(10), 2281–2297 (2004)

    Article  MathSciNet  MATH  Google Scholar 

  43. Wright, S.: The roles of mutation, inbreeding, crossbreeding and selection in evolution. In: Proceedings of the Sixth International Congress on Genetics, pp. 356–366 (1932)

Download references

Acknowledgements

PD is grateful for the hospitality of the Department of Statistics and Operations research, University of North Carolina, Chapel Hill, where much of the research was done. SB and ABN were partially supported by NSF Grants DMS-1310002 and DMS-1613072. SB was partially supported by NSF Grants DMS-1105581, DMS-1606839, SES grant 1357622 and ARO grant W911NF-17-1-0010. PD was supported by Simons Postdoctoral Fellowship at New York University. AN was partially supported by NSF grant DMS-0907177. We thank the referee for going through the many technical arguments closely and pointing out a number of issues whose rectification significantly improved the presentation and readability of the paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shankar Bhamidi.

Rights and permissions

Reprints and Permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Bhamidi, S., Dey, P.S. & Nobel, A.B. Energy landscape for large average submatrix detection problems in Gaussian random matrices. Probab. Theory Relat. Fields 168, 919–983 (2017). https://doi.org/10.1007/s00440-017-0766-0

Download citation

  • Received:

  • Revised:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00440-017-0766-0

Keywords

  • Energy landscape
  • Extreme value theory
  • Central limit theorem
  • Stein’s method

Mathematics Subject Classification

  • Primary 62G32
  • 60F05
  • 60G70