Skip to main content

Notes on Computational Hardness of Hypothesis Testing: Predictions Using the Low-Degree Likelihood Ratio

  • Conference paper
  • First Online:
Mathematical Analysis, its Applications and Computation (ISAAC 2019)

Abstract

These notes survey and explore an emerging method, which we call the low-degree method, for understanding statistical-versus-computational tradeoffs in high-dimensional inference problems. In short, the method posits that a certain quantity—the second moment of the low-degree likelihood ratio—gives insight into how much computational time is required to solve a given hypothesis testing problem, which can in turn be used to predict the computational hardness of a variety of statistical inference tasks. While this method originated in the study of the sum-of-squares (SoS) hierarchy of convex programs, we present a self-contained introduction that does not require knowledge of SoS. In addition to showing how to carry out predictions using the method, we include a discussion investigating both rigorous and conjectural consequences of these predictions. These notes include some new results, simplified proofs, and refined conjectures. For instance, we point out a formal connection between spectral methods and the low-degree likelihood ratio, and we give a sharp low-degree lower bound against subexponential-time algorithms for tensor PCA.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 119.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 159.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 159.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    We will only consider this so-called strong version of distinguishability, where the probability of success must tend to 1 as n →, as opposed to the weak version where this probability need only be bounded above \(\frac {1}{2}\). For high-dimensional problems, the strong version typically coincides with important notions of estimating the planted signal (see Sect. 4.2.6), whereas the weak version is often trivial.

  2. 2.

    For instance, what will be relevant in the examples we consider later, any pair of non-degenerate multivariate Gaussian distributions satisfy this assumption.

  3. 3.

    It is important to note that, from the point of view of statistics, we are restricting our attention to the special case of deciding between two “simple” hypotheses, where each hypothesis consists of the dataset being drawn from a specific distribution. Optimal testing is more subtle for “composite” hypotheses in parametric families of probability distributions, a more typical setting in practice. The mathematical difficulties of this extended setting are discussed thoroughly in [75].

  4. 4.

    For readers not familiar with the Radon–Nikodym derivative: if \(\mathbb {P}\), \(\mathbb {Q}\) are discrete distributions then \(L(\boldsymbol Y) = \mathbb {P}(\boldsymbol Y)/\mathbb {Q}(\boldsymbol Y)\); if \(\mathbb {P}\), \(\mathbb {Q}\) are continuous distributions with density functions p, q (respectively) then L(Y ) = p(Y )∕q(Y ).

  5. 5.

    For a more precise definition of \(L^2(\mathbb {Q}_n)\) (in particular including issues around functions differing on sets of measure zero) see a standard reference on real analysis such as [100].

  6. 6.

    To clarify, orthogonal projection is with respect to the inner product induced by \(\mathbb {Q}_n\) (see Definition 7).

  7. 7.

    Two techniques from this calculation are elements of the “replica method” from statistical physics: (1) writing a power of an expectation as an expectation over independent “replicas” and (2) changing the order of expectations and evaluating the moment-generating function. The interested reader may see [82] for an early reference, or [21, 79] for two recent presentations.

  8. 8.

    We will not actually use the definition of the univariate Hermite polynomials (although we will use certain properties that they satisfy as needed), but the definition is included for completeness in Appendix “Hermite Polynomials”.

  9. 9.

    This model is equivalent to the more standard model in which the noise is symmetric with respect to permutations of the indices; see Appendix “Equivalence of Symmetric and Asymmetric Noise Models”.

  10. 10.

    Concretely, one may take \(A_p = \frac {1}{\sqrt {2}} p^{-p/4-1/2}\) and \(B_p = \sqrt {2} e^{p/2} p^{-p/4}\).

  11. 11.

    Some of these results only apply to minor variants of the spiked tensor problem, but we do not expect this difference to be important.

  12. 12.

    Gaussian Orthogonal Ensemble (GOE): W is a symmetric n × n matrix with entries \(W_{ii} \sim \mathscr {N}(0,2/n)\) and \(W_{ij} = W_{ji} \sim \mathscr {N}(0,1/n)\), independently.

  13. 13.

    In the sparse Rademacher prior, each entry of x is nonzero with probability ρ (independently), and the nonzero entries are drawn uniformly from \(\{\pm 1/\sqrt {\rho }\}\).

  14. 14.

    More specifically, \((\|L_n^{\le D}\|{ }^2 - 1)\) is the variance of a certain pseudo-expectation value generated by pseudo-calibration, whose actual value in a valid pseudo-expectation must be exactly 1. It appears to be impossible to “correct” this part of the pseudo-expectation if the variance is diverging with n.

  15. 15.

    Here, “best” is in the sense of strongly distinguishing \(\mathbb {P}_n\) and \(\mathbb {Q}_n\) throughout the largest possible regime of model parameters.

  16. 16.

    In [47], it is shown that for a fairly general class of average-case hypothesis testing problems, if SoS succeeds in some range of parameters then there is a low-degree spectral method whose maximum positive eigenvalue succeeds (in a somewhat weaker range of parameters). However, the resulting matrix could a priori have an arbitrarily large (in magnitude) negative eigenvalue, which would prevent the spectral method from running in polynomial time. For this same reason, it seems difficult to establish a formal connection between SoS and the LDLR via spectral methods.

  17. 17.

    Indeed, coordinate degree need not be phrased in terms of polynomials, and one may equivalently consider the linear subspace of \(L^2(\mathbb {Q}_n)\) of functions that is spanned by functions of at most D variables at a time.

  18. 18.

    Non-trivial estimation of a signal \(\boldsymbol x \in \mathbb {R}^n\) means having an estimator \(\hat {\boldsymbol x}\) achieving \(|\langle \hat {\boldsymbol x}, \boldsymbol x \rangle |/(\|\hat {\boldsymbol x}\| \cdot \|\boldsymbol x\|) \ge \varepsilon \) with high probability, for some constant ε > 0.

References

  1. A. Auffinger, G. Ben Arous, J. Černỳ, Random matrices and complexity of spin glasses. Commun. Pure Appl. Math. 66(2), 165–201 (2013)

    Google Scholar 

  2. D. Achlioptas, A. Coja-Oghlan, Algorithmic barriers from phase transitions, in 2008 49th Annual IEEE Symposium on Foundations of Computer Science (IEEE, IEEE, 2008), pp. 793–802

    Google Scholar 

  3. A. Anandkumar, Y. Deng, R. Ge, H. Mobahi, Homotopy analysis for tensor PCA (2016). arXiv preprint arXiv:1610.09322

    Google Scholar 

  4. N. Alon, M. Krivelevich, B. Sudakov, Finding a large hidden clique in a random graph. Random Struct. Algorithms 13(3–4), 457–466 (1998)

    Article  MathSciNet  MATH  Google Scholar 

  5. A.A. Amini, M.J. Wainwright, High-dimensional analysis of semidefinite relaxations for sparse principal components, in 2008 IEEE International Symposium on Information Theory (IEEE, Piscataway, 2008), pp. 2454–2458

    Google Scholar 

  6. N. Alon, R. Yuster, U. Zwick, Color-coding. J. ACM 42(4), 844–856 (1995)

    Article  MathSciNet  MATH  Google Scholar 

  7. M. Brennan, G. Bresler, Optimal average-case reductions to sparse PCA: from weak assumptions to strong hardness (2019). arXiv preprint arXiv:1902.07380

    Google Scholar 

  8. M. Brennan, G. Bresler, W. Huleihel, Reducibility and computational lower bounds for problems with planted sparse structure (2018). arXiv preprint arXiv:1806.07508

    Google Scholar 

  9. J. Baik, G.Ben Arous, S. Péché, Phase transition of the largest eigenvalue for nonnull complex sample covariance matrices. Ann. Probab. 33(5), 1643–1697 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  10. J. Barbier, M. Dia, N. Macris, F. Krzakala, T. Lesieur, L. Zdeborová, Mutual information for symmetric rank-one matrix estimation: a proof of the replica formula, in Proceedings of the 30th International Conference on Neural Information Processing Systems (Curran Associates, 2016), pp. 424–432

    Google Scholar 

  11. V.V.S.P. Bhattiprolu, M. Ghosh, V. Guruswami, E. Lee, M. Tulsiani, Multiplicative approximations for polynomial optimization over the unit sphere. Electron. Colloq. Comput. Complexity 23, 185 (2016)

    Google Scholar 

  12. G.Ben Arous, R. Gheissari, A. Jagannath, Algorithmic thresholds for tensor PCA (2018). arXiv preprint arXiv:1808.00921

    Google Scholar 

  13. V. Bhattiprolu, V. Guruswami, E. Lee, Sum-of-squares certificates for maxima of random tensors on the sphere (2016). arXiv preprint arXiv:1605.00903

    Google Scholar 

  14. F. Benaych-Georges, R. Rao Nadakuditi, The eigenvalues and eigenvectors of finite, low rank perturbations of large random matrices. Adva. Math. 227(1), 494–521 (2011)

    Article  MathSciNet  MATH  Google Scholar 

  15. B. Barak, S. Hopkins, J. Kelner, P.K. Kothari, A. Moitra, A. Potechin, A nearly tight sum-of-squares lower bound for the planted clique problem. SIAM J. Comput. 48(2), 687–735 (2019)

    Article  MathSciNet  MATH  Google Scholar 

  16. A. Blum, A. Kalai, H. Wasserman, Noise-tolerant learning, the parity problem, and the statistical query model. J. ACM 50(4), 506–519 (2003)

    Article  MathSciNet  MATH  Google Scholar 

  17. A.S. Bandeira, D. Kunisky, A.S. Wein, Computational hardness of certifying bounds on constrained PCA problems (2019). arXiv preprint arXiv:1902.07324

    Google Scholar 

  18. C. Bordenave, M. Lelarge, L. Massoulié, Non-backtracking spectrum of random graphs: community detection and non-regular Ramanujan graphs, in 2015 IEEE 56th Annual Symposium on Foundations of Computer Science (IEEE, Piscataway, 2015), pp. 1347–1357

    Google Scholar 

  19. J. Banks, C. Moore, J. Neeman, P. Netrapalli, Information-theoretic thresholds for community detection in sparse networks, in Conference on Learning Theory (2016), pp. 383–416

    Google Scholar 

  20. J. Banks, C. Moore, R. Vershynin, N. Verzelen, J. Xu, Information-theoretic bounds and phase transitions in clustering, sparse PCA, and submatrix localization. IEEE Trans. Inform. Theory 64(7), 4872–4894 (2018)

    Article  MathSciNet  MATH  Google Scholar 

  21. A.S. Bandeira, A. Perry, A.S. Wein, Notes on computational-to-statistical gaps: predictions using statistical physics (2018). arXiv preprint arXiv:1803.11132

    Google Scholar 

  22. Q. Berthet, P. Rigollet, Computational lower bounds for sparse PCA (2013). arXiv preprint arXiv:1304.0828

    Google Scholar 

  23. B. Barak, D. Steurer, Proofs, beliefs, and algorithms through the lens of sum-of-squares. Course Notes (2016). http://www.sumofsquares.org/public/index.html

  24. W.-K. Chen, D. Gamarnik, D. Panchenko, M. Rahman, Suboptimality of local algorithms for a class of max-cut problems. Ann. Probab. 47(3), 1587–1618 (2019)

    Article  MathSciNet  MATH  Google Scholar 

  25. Y. Deshpande, E. Abbe, A. Montanari, Asymptotic mutual information for the two-groups stochastic block model (2015). arXiv preprint arXiv:1507.08685

    Google Scholar 

  26. M. Dyer, A. Frieze, M. Jerrum, On counting independent sets in sparse graphs. SIAM J. Comput. 31(5), 1527–1541 (2002)

    Article  MathSciNet  MATH  Google Scholar 

  27. I. Diakonikolas, G. Kamath, D. Kane, J. Li, A. Moitra, A. Stewart, Robust estimators in high-dimensions without the computational intractability. SIAM J. Comput. 48(2), 742–864 (2019)

    Article  MathSciNet  MATH  Google Scholar 

  28. A. Decelle, F. Krzakala, C. Moore, L. Zdeborová, Asymptotic analysis of the stochastic block model for modular networks and its algorithmic applications. Phys. Rev. E 84(6), 066106 (2011)

    Google Scholar 

  29. A. Decelle, F. Krzakala, C. Moore, L. Zdeborová, Inference and phase transitions in the detection of modules in sparse networks. Phys. Rev. Lett. 107(6), 065701 (2011)

    Google Scholar 

  30. I. Diakonikolas, D.M. Kane, A. Stewart, Statistical query lower bounds for robust estimation of high-dimensional Gaussians and gaussian mixtures, in 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS) (IEEE, Piscataway, 2017), pp. 73–84

    Google Scholar 

  31. Y. Ding, D. Kunisky, A.S. Wein, A.S. Bandeira, Subexponential-time algorithms for sparse PCA (2019). arXiv preprint

    Google Scholar 

  32. Y. Deshpande, A. Montanari, Sparse PCA via covariance thresholding, in Advances in Neural Information Processing Systems (2014), pp. 334–342

    Google Scholar 

  33. Y. Deshpande, A. Montanari, Finding hidden cliques of size \(\sqrt {(N/e)}\) in nearly linear time. Found. Comput. Math. 15(4), 1069–1128 (2015)

    Google Scholar 

  34. Y. Deshpande, A. Montanari, Improved sum-of-squares lower bounds for hidden clique and hidden submatrix problems, in Conference on Learning Theory (2015), pp. 523–562

    Google Scholar 

  35. D.L. Donoho, A. Maleki, A. Montanari, Message-passing algorithms for compressed sensing. Proc. Nat. Acad. Sci. 106(45), 18914–18919 (2009)

    Article  Google Scholar 

  36. A. El Alaoui, F. Krzakala, Estimation in the spiked Wigner model: a short proof of the replica formula, in 2018 IEEE International Symposium on Information Theory (ISIT) (IEEE, 2018), pp. 1874–1878

    Google Scholar 

  37. A. El Alaoui, F. Krzakala, M.I. Jordan, Finite size corrections and likelihood ratio fluctuations in the spiked Wigner model (2017). arXiv preprint arXiv:1710.02903

    Google Scholar 

  38. A. El Alaoui, F. Krzakala, M.I. Jordan, Fundamental limits of detection in the spiked Wigner model (2018). arXiv preprint arXiv:1806.09588

    Google Scholar 

  39. V. Feldman, E. Grigorescu, L. Reyzin, S.S. Vempala, Y. Xiao, Statistical algorithms and a lower bound for detecting planted cliques. J. ACM 64(2), 8 (2017)

    Google Scholar 

  40. U. Feige, J. Kilian, Heuristics for semirandom graph problems. J. Comput. Syst. Sci. 63(4), 639–671 (2001)

    Article  MathSciNet  MATH  Google Scholar 

  41. D. Féral, S. Péché, The largest eigenvalue of rank one deformation of large Wigner matrices. Commun. Math. Phys. 272(1), 185–228 (2007)

    Article  MathSciNet  MATH  Google Scholar 

  42. V. Feldman, W. Perkins, S. Vempala, On the complexity of random satisfiability problems with planted solutions. SIAM J. Comput. 47(4), 1294–1338 (2018)

    Article  MathSciNet  MATH  Google Scholar 

  43. D. Grigoriev, Linear lower bound on degrees of Positivstellensatz calculus proofs for the parity. Theor. Comput. Sci. 259(1–2), 613–622 (2001)

    Article  MathSciNet  MATH  Google Scholar 

  44. D. Gamarnik, M. Sudan, Limits of local algorithms over sparse random graphs, in Proceedings of the 5th Conference on Innovations in Theoretical Computer Science(ACM, New York, 2014), pp. 369–376

    Google Scholar 

  45. D. Gamarnik, I. Zadik, Sparse high-dimensional linear regression. algorithmic barriers and a local search algorithm (2017). arXiv preprint arXiv:1711.04952

    Google Scholar 

  46. D. Gamarnik I. Zadik, The landscape of the planted clique problem: Dense subgraphs and the overlap gap property (2019). arXiv preprint arXiv:1904.07174

    Google Scholar 

  47. S.B. Hopkins, P.K. Kothari, A. Potechin, P. Raghavendra, T. Schramm, D. Steurer, The power of sum-of-squares for detecting hidden structures, in 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS) (IEEE, Piscataway, 2017), pp. 720–731

    Google Scholar 

  48. S. Hopkins, Statistical Inference and the Sum of Squares Method. PhD thesis, Cornell University, August 2018

    Google Scholar 

  49. S.B. Hopkins, D. Steurer, Bayesian estimation from few samples: community detection and related problems (2017). arXiv preprint arXiv:1710.00264

    Google Scholar 

  50. S.B. Hopkins, J. Shi, D. Steurer, Tensor principal component analysis via sum-of-square proofs, in Conference on Learning Theory (2015), pp. 956–1006

    Google Scholar 

  51. S.B. Hopkins, T. Schramm, J. Shi, D. Steurer, Fast spectral algorithms from sum-of-squares proofs: tensor decomposition and planted sparse vectors, in Proceedings of the Forty-Eighth Annual ACM Symposium on Theory of Computing (ACM, New York, 2016), pp. 178–191

    Book  MATH  Google Scholar 

  52. B. Hajek, Y. Wu, J. Xu, Computational lower bounds for community detection on random graphs, in Conference on Learning Theory (2015), pp. 899–928

    Google Scholar 

  53. S. Janson, Gaussian Hilbert Spaces, vol. 129 (Cambridge University Press, Cambridge, 1997)

    Book  MATH  Google Scholar 

  54. M. Jerrum, Large cliques elude the Metropolis process. Random Struct. Algorithms 3(4), 347–359 (1992)

    Article  MathSciNet  MATH  Google Scholar 

  55. I.M. Johnstone, A.Y. Lu, Sparse principal components analysis. Unpublished Manuscript (2004)

    Google Scholar 

  56. I.M. Johnstone, A.Y. Lu, On consistency and sparsity for principal components analysis in high dimensions. J. Am. Stat. Assoc. 104(486), 682–693 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  57. A. Jagannath, P. Lopatto, L. Miolane, Statistical thresholds for tensor PCA (2018). arXiv preprint arXiv:1812.03403

    Google Scholar 

  58. M. Kearns, Efficient noise-tolerant learning from statistical queries. J. ACM 45(6), 983–1006 (1998)

    Article  MathSciNet  MATH  Google Scholar 

  59. F. Krzakala, C. Moore, E. Mossel, J. Neeman, A. Sly, L. Zdeborová, P. Zhang, Spectral redemption in clustering sparse networks. Proc. Nat. Acad. Sci. 110(52), 20935–20940 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  60. P.K. Kothari, R. Mori, R. O’Donnell, D. Witmer, Sum of squares lower bounds for refuting any CSP, in Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing (ACM, New York, 2017), pp. 132–145

    MATH  Google Scholar 

  61. F. Krzakała, A. Montanari, F. Ricci-Tersenghi, G. Semerjian, L. Zdeborová, Gibbs states and the set of solutions of random constraint satisfaction problems. Proc. Nat. Acad. Sci. 104(25), 10318–10323 (2007)

    Article  MathSciNet  MATH  Google Scholar 

  62. R. Krauthgamer, B. Nadler, D. Vilenchik, Do semidefinite relaxations solve sparse PCA up to the information limit? Ann. Stat. 43(3), 1300–1322 (2015)

    Article  MathSciNet  MATH  Google Scholar 

  63. A.R. Klivans, A.A. Sherstov, Unconditional lower bounds for learning intersections of halfspaces. Mach. Learn. 69(2–3), 97–114 (2007)

    Article  MATH  Google Scholar 

  64. L. Kučera, Expected complexity of graph partitioning problems. Discrete Appl. Math. 57(2–3), 193–212 (1995)

    Article  MathSciNet  MATH  Google Scholar 

  65. R. Kannan, S. Vempala, Beyond spectral: Tight bounds for planted Gaussians (2016). arXiv preprint arXiv:1608.03643

    Google Scholar 

  66. F. Krzakala, J. Xu, L. Zdeborová, Mutual information in rank-one matrix estimation, in 2016 IEEE Information Theory Workshop (ITW) (IEEE, Piscataway, 2016), pp. 71–75

    Google Scholar 

  67. J.B. Lasserre, Global optimization with polynomials and the problem of moments. SIAM J. Optim. 11(3), 796–817 (2001)

    Article  MathSciNet  MATH  Google Scholar 

  68. L. Le Cam, Asymptotic Methods in Statistical Decision Theory (Springer, Berlin, 2012)

    Google Scholar 

  69. L. Le Cam, Locally asymptotically normal families of distributions. Univ. California Publ. Stat. 3, 37–98 (1960)

    Google Scholar 

  70. T. Lesieur, F. Krzakala, L. Zdeborová, MMSE of probabilistic low-rank matrix estimation: Universality with respect to the output channel, in s2015 53rd Annual Allerton Conference on Communication, Control, and Computing (Allerton) (IEEE, 2015), pp. 680–687

    Google Scholar 

  71. T. Lesieur, F. Krzakala, L. Zdeborová, Phase transitions in sparse PCA, in 2015 IEEE International Symposium on Information Theory (ISIT) (IEEE, Piscataway, 2015), pp. 1635–1639

    Google Scholar 

  72. A.K. Lenstra, H.W. Lenstra, L. Lovász, Factoring polynomials with rational coefficients. Math. Ann. 261(4), 515–534 (1982)

    Article  MathSciNet  MATH  Google Scholar 

  73. M. Lelarge, L. Miolane, Fundamental limits of symmetric low-rank matrix estimation. Probab. Theory Related Fields 173(3–4), 859–929 (2019)

    Article  MathSciNet  MATH  Google Scholar 

  74. T. Lesieur, L. Miolane, M. Lelarge, F. Krzakala, L. Zdeborová, Statistical and computational phase transitions in spiked tensor estimation, in 2017 IEEE International Symposium on Information Theory (ISIT) (IEEE, Piscataway, 2017), pp. 511–515

    Google Scholar 

  75. E.L. Lehmann, J.P. Romano, Testing Statistical Hypotheses (Springer, Berlin, 2006)

    MATH  Google Scholar 

  76. L. Massoulié, Community detection thresholds and the weak Ramanujan property, in Proceedings of the Forty-Sixth Annual ACM Symposium on Theory of Computing (ACM, New York, 2014), pp. 694–703

    MATH  Google Scholar 

  77. L. Miolane, Phase transitions in spiked matrix estimation: information-theoretic analysis (2018). arXiv preprint arXiv:1806.04343

    Google Scholar 

  78. S.S. Mannelli, F. Krzakala, P. Urbani, L. Zdeborova, Passed & spurious: Descent algorithms and local minima in spiked matrix-tensor models, in International Conference on Machine Learning (2019), pp. 4333–4342

    Google Scholar 

  79. M. Mezard, A. Montanari, Information, Physics, and Computation (Oxford University Press, Oxford, 2009)

    Book  MATH  Google Scholar 

  80. E. Mossel, J. Neeman, A. Sly, Reconstruction and estimation in the planted partition model. Probab. Theory Related Fields 162(3–4), 431–461 (2015)

    Article  MathSciNet  MATH  Google Scholar 

  81. E. Mossel, J. Neeman, A. Sly, A proof of the block model threshold conjecture. Combinatorica 38(3), 665–708 (2018)

    Article  MathSciNet  MATH  Google Scholar 

  82. M. Mézard, G. Parisi, M. Virasoro, Spin Glass Theory and Beyond: An Introduction to the Replica Method and Its Applications, vol. 9 (World Scientific Publishing Company, Singapore, 1987)

    MATH  Google Scholar 

  83. R. Meka, A. Potechin, A. Wigderson, Sum-of-squares lower bounds for planted clique, in Proceedings of the Forty-Seventh Annual ACM Symposium on Theory of Computing (ACM, New York, 2015), pp. 87–96

    Book  MATH  Google Scholar 

  84. A. Montanari, D. Reichman, O. Zeitouni, On the limitation of spectral methods: from the Gaussian hidden clique problem to rank-one perturbations of gaussian tensors, in Advances in Neural Information Processing Systems (2015), pp. 217–225

    Google Scholar 

  85. L. Massoulié, L. Stephan, D. Towsley, Planting trees in graphs, and finding them back (2018). arXiv preprint arXiv:1811.01800

    Google Scholar 

  86. T. Ma, A. Wigderson, Sum-of-squares lower bounds for sparse PCA, in Advances in Neural Information Processing Systems (2015), pp. 1612–1620

    Google Scholar 

  87. J. Neyman, E.S. Pearson, IX. on the problem of the most efficient tests of statistical hypotheses. Philos. Trans. R. Soc. Lond. Ser. A Containing Papers Math. Phys. Charact. 231(694–706), 289–337 (1933)

    Google Scholar 

  88. R. O’Donnell, Analysis of Boolean Functions (Cambridge University Press, Cambridge, 2014)

    Book  MATH  Google Scholar 

  89. P.A. Parrilo, Structured Semidefinite Programs and Semialgebraic Geometry Methods in Robustness and Optimization. PhD thesis, California Institute of Technology, 2000

    Google Scholar 

  90. A. Perry, A.S. Wein, A.S. Bandeira, Statistical limits of spiked tensor models (2016). arXiv preprint arXiv:1612.07728

    Google Scholar 

  91. A. Perry, A.S. Wein, A.S. Bandeira, A. Moitra, Optimality and sub-optimality of PCA I: spiked random matrix models. Ann. Stat. 46(5), 2416–2451 (2018)

    Article  MathSciNet  MATH  Google Scholar 

  92. P. Rigollet, J.-C. Hütter, High-dimensional statistics. Lecture Notes, 2018

    Google Scholar 

  93. E. Richard, A. Montanari, A statistical model for tensor PCA, in Advances in Neural Information Processing Systems (2014), pp. 2897–2905

    Google Scholar 

  94. P. Raghavendra, S. Rao, T. Schramm, Strongly refuting random CSPs below the spectral threshold, in Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing (ACM, New York, 2017), pp. 121–131

    MATH  Google Scholar 

  95. P. Raghavendra, T. Schramm, D. Steurer, High-dimensional estimation via sum-of-squares proofs (2018). arXiv preprint arXiv:1807.11419

    Google Scholar 

  96. R.W. Robinson, N.C. Wormald, Almost all cubic graphs are hamiltonian. Random Struct. Algorithms 3(2), 117–125 (1992)

    Article  MathSciNet  MATH  Google Scholar 

  97. R.W. Robinson, N.C. Wormald, Almost all regular graphs are hamiltonian. Random Struct. Algorithms 5(2), 363–374 (1994)

    Article  MathSciNet  MATH  Google Scholar 

  98. G. Schoenebeck, Linear level Lasserre lower bounds for certain k-CSPs, in 2008 49th Annual IEEE Symposium on Foundations of Computer Science (IEEE, Piscataway, 2008), pp. 593–602

    Google Scholar 

  99. A. Saade, F. Krzakala, L. Zdeborová, Spectral clustering of graphs with the Bethe Hessian, in Advances in Neural Information Processing Systems (2014), pp. 406–414

    Google Scholar 

  100. E.M. Stein, R. Shakarchi, Real Analysis: Measure Theory, Integration, and Hilbert Spaces (Princeton University Press, Princeton, 2009)

    Book  MATH  Google Scholar 

  101. G. Szegö, Orthogonal Polynomials, vol. 23 (American Mathematical Soc., 1939)

    Google Scholar 

  102. T. Wang, Q. Berthet, Y. Plan, Average-case hardness of RIP certification, in Advances in Neural Information Processing Systems (2016), pp. 3819–3827

    Google Scholar 

  103. T. Wang, Q. Berthet, R.J. Samworth, Statistical and computational trade-offs in estimation of sparse principal components. Ann. Stat. 44(5), 1896–1930 (2016)

    Article  MathSciNet  MATH  Google Scholar 

  104. A.S. Wein, A. El Alaoui, C. Moore, The Kikuchi hierarchy and tensor PCA (2019). arXiv preprint arXiv:1904.03858

    Google Scholar 

  105. I. Zadik, D. Gamarnik, High dimensional linear regression using lattice basis reduction, in Advances in Neural Information Processing Systems (2018), pp. 1842–1852

    Google Scholar 

  106. L. Zdeborová, F. Krzakala, Statistical physics of inference: thresholds and algorithms. Adv. Phys. 65(5), 453–552 (2016)

    Article  Google Scholar 

Download references

Acknowledgements

We thank the participants of a working group on the subject of these notes, organized by the authors at the Courant Institute of Mathematical Sciences during the spring of 2019. We also thank Samuel B. Hopkins, Philippe Rigollet, and David Steurer for helpful discussions.

DK was partially supported by NSF grants DMS-1712730 and DMS-1719545. ASW was partially supported by NSF grant DMS-1712730 and by the Simons Collaboration on Algorithms and Geometry. ASB was partially supported by NSF grants DMS-1712730 and DMS-1719545, and by a grant from the Sloan Foundation.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Afonso S. Bandeira .

Editor information

Editors and Affiliations

Appendices

Appendix 1: Omitted Proofs

Neyman-Pearson Lemma

We include here, for completeness, a proof of the classical Neyman–Pearson lemma [87].

Proof of Lemma 1

Note first that a test f is completely determined by its rejection region, \(R_f = \{\boldsymbol Y: f(\boldsymbol Y) = \mathbb {P}\}\). We may rewrite the power of f as

$$\displaystyle \begin{aligned} 1 - \beta(f) = \mathbb{P}[f(\boldsymbol Y) = \mathbb{P}] = \int_{R_f}d\mathbb{P}(\boldsymbol Y) = \int_{R_f}L(\boldsymbol Y)d\mathbb{Q}(\boldsymbol Y). \end{aligned}$$

On the other hand, our assumption on α(f) is equivalent to

$$\displaystyle \begin{aligned} \mathbb{Q}[R_f] \leq \mathbb{Q}[L(\boldsymbol Y) > \eta]. \end{aligned}$$

Thus, we are interested in solving the optimization

$$\displaystyle \begin{aligned} \begin{array}{ll} \text{maximize} & \int_{R_f}L(\boldsymbol Y)d\mathbb{Q}(\boldsymbol Y) \\ \text{subject to} & R_f \in \mathscr{F}, \\ & \mathbb{Q}[R_f] \leq \mathbb{Q}[L(\boldsymbol Y) > \eta]. \end{array} \end{aligned}$$

From this form, let us write , then the difference of powers is

$$\displaystyle \begin{aligned} (1 - \beta(L_\eta)) - (1 - \beta(f)) &= \int_{R_\star}L(\boldsymbol Y)d\mathbb{Q}(\boldsymbol Y) - \int_{R_f}L(\boldsymbol Y)d\mathbb{Q}(\boldsymbol Y) \\ &= \int_{R_\star \setminus R_f}L(\boldsymbol Y)d\mathbb{Q}(\boldsymbol Y) - \int_{R_f \setminus R_\star}L(\boldsymbol Y)d\mathbb{Q}(\boldsymbol Y) \\ &\geq \eta\left(\mathbb{Q}[R_\star \setminus R_f] - \mathbb{Q}[R_f \setminus R_\star]\right) \\ &= \eta\left(\mathbb{Q}[R_\star] - \mathbb{Q}[R_f]\right) \\ &\geq 0, \end{aligned} $$

completing the proof.

Equivalence of Symmetric and Asymmetric Noise Models

For technical convenience, in the main text we worked with an asymmetric version of the spiked Wigner model (see Sect. 3.2), Y  = λxx  + Z where Z has i.i.d. \(\mathscr {N}(0,1)\) entries. A more standard model is to instead observe \(\widetilde {\boldsymbol Y} = \frac {1}{2}(\boldsymbol Y + \boldsymbol Y^\top ) = \lambda \boldsymbol x \boldsymbol x^\top + \boldsymbol W\), where W is symmetric with \(\mathscr {N}(0,1)\) diagonal entries and \(\mathscr {N}(0,1/2)\) off-diagonal entries, all independent. These two models are equivalent, in the sense that if we are given a sample from one then we can produce a sample from the other. Clearly, if we are given Y , we can symmetrize it to form \(\widetilde {\boldsymbol Y}\). Conversely, if we are given \(\widetilde {\boldsymbol Y}\), we can draw an independent matrix G with i.i.d. \(\mathscr {N}(0,1)\) entries, and compute \(\widetilde {\boldsymbol Y} + \frac {1}{2}(\boldsymbol G - \boldsymbol G^\top )\); one can check that the resulting matrix has the same distribution as Y (we are adding back the “skew-symmetric part” that is present in Y but not \(\widetilde {\boldsymbol Y}\)).

In the spiked tensor model (see Sect. 3.1), our asymmetric noise model is similarly equivalent to the standard symmetric model defined in [93] (in which the noise tensor Z is averaged over all permutations of indices). Since we can treat each entry of the symmetric tensor separately, it is sufficient to show the following one-dimensional fact: for unknown \(x \in \mathbb {R}\), k samples of the form \(y_i = x + \mathscr {N}(0,1)\) are equivalent to one sample of the form \(\tilde y = x + \mathscr {N}(0,1/k)\). Given {y i}, we can sample \(\tilde y\) by averaging: \(\frac {1}{k}\sum _{i=1}^k y_i\). For the converse, fix unit vectors a 1, …, a k at the corners of a simplex in \(\mathbb {R}^{k-1}\); these satisfy \(\langle \boldsymbol a_i,\boldsymbol a_j \rangle = -\frac {1}{k-1}\) for all ij. Given \(\tilde y\), draw \(\boldsymbol u \sim \mathscr {N}(0,{\boldsymbol I}_{k-1})\) and let \(y_i = \tilde y + \sqrt {1-1/k} \,\langle \boldsymbol a_i,\boldsymbol u \rangle \); one can check that these have the correct distribution.

Low-Degree Analysis of Spiked Wigner Above the PCA Threshold

Proof of Theorem 6

We follow the proof of Theorem 2(ii) in Sect. 3.1.2. For any choice of d ≤ D, using the standard bound \(\binom {2d}{d} \ge 4^d/(2\sqrt {d})\),

$$\displaystyle \begin{aligned} \|L_n^{\le D}\|{}^2 &\ge \frac{\lambda^{2d}}{d!} \operatorname*{\mathbb{E}}_{\boldsymbol x^1,\boldsymbol x^2}[\langle \boldsymbol x^1,\boldsymbol x^2 \rangle^{2d}] \\ &\ge \frac{\lambda^{2d}}{d!} \binom{n}{d} \frac{(2d)!}{2^{d}} \\ &= \frac{\lambda^{2d}}{d!} \frac{n!}{d!(n-d)!} \frac{(2d)!}{2^{d}} \\ &= \lambda^{2d} \binom{2d}{d} \frac{n!}{(n-d)! 2^d} \\ &\ge \lambda^{2d} \frac{4^d}{2\sqrt{d}} \frac{(n-d)^d}{2^d} \\ &= \frac{1}{2\sqrt{d}} \left(2\lambda^2 (n-d)\right)^d \\ &= \frac{1}{2\sqrt{d}} \left(\hat\lambda^2 \left(1 - \frac{d}{n}\right)\right)^d.\end{aligned} $$
(using the moment bound12 from Section3.1.2)

Since \(\hat \lambda > 1\), this diverges as n → provided we choose d ≤ D with ω(1) ≤ d ≤ o(n).

Appendix 2: Omitted Probability Theory Background

Hermite Polynomials

Here we give definitions and basic facts regarding the Hermite polynomials (see, e.g, [101] for further details), which are orthogonal polynomials with respect to the standard Gaussian measure.

Definition 15

The univariate Hermite polynomials are the sequence of polynomials \(h_k(x) \in \mathbb {R}[x]\) for k ≥ 0 defined by the recursion

$$\displaystyle \begin{aligned} h_0(x) &= 1, \\ h_{k + 1}(x) &= xh_k(x) - h_k^\prime(x).\end{aligned} $$

The normalized univariate Hermite polynomials are \(\widehat {h}_k(x) = h_k(x) / \sqrt {k!}\).

The following is the key property of the Hermite polynomials, which allows functions in \(L^2(\mathscr {N}(0, 1))\) to be expanded in terms of them.

Proposition 10

The normalized univariate Hermite polynomials form a complete orthonormal system of polynomials for \(L^2(\mathscr {N}(0, 1))\).

The following are the multivariate generalizations of the above definition that we used throughout the main text.

Definition 16

The N-variate Hermite polynomials are the polynomials for \(\boldsymbol \alpha \in \mathbb {N}^N\). The normalizedN-variate Hermite polynomials inNvariables are the polynomials for \(\boldsymbol \alpha \in \mathbb {N}^N\).

Again, the following is the key property justifying expansions in terms of these polynomials.

Proposition 11

The normalized N-variate Hermite polynomials form a complete orthonormal system of (multivariate) polynomials for \(L^2(\mathscr {N}(\boldsymbol 0, \boldsymbol I_N))\).

For the sake of completeness, we also provide proofs below of the three identities concerning univariate Hermite polynomials that we used in Sect. 2.3 to derive the norm of the LDLR under the additive Gaussian noise model. It is more convenient to prove these in a different order than they were presented in Sect. 2.3, since one identity is especially useful for proving the others.

Proof of Proposition 8 , Integration by Parts

Recall that we are assuming a function \(f: \mathbb {R} \to \mathbb {R}\) is k times continuously differentiable and f and its derivatives are \(O(\exp (|x|{ }^\alpha ))\) for α ∈ (0, 2), and we want to show the identity

$$\displaystyle \begin{aligned} \operatorname*{\mathbb{E}}_{y \sim \mathscr{N}(0, 1)}[h_k(y) f(y)] = \operatorname*{\mathbb{E}}_{y \sim \mathscr{N}(0, 1)}\left[ \frac{d^k f}{dy^k}(y)\right]. \end{aligned}$$

We proceed by induction. Since h 0(y) = 1, the case k = 0 follows immediately. We also verify by hand the case k = 1, with h 1(y) = y:

$$\displaystyle \begin{aligned} \operatorname*{\mathbb{E}}_{y \sim \mathscr{N}(0, 1)}\left[yf(y) \right] &= \frac{1}{\sqrt{2\pi}} \int_{-\infty}^\infty f(y) \cdot ye^{-y^2 / 2}dy \\ &= \frac{1}{\sqrt{2\pi}} \int_{-\infty}^\infty f^\prime(y) e^{-y^2 / 2}dy \\ &= \operatorname*{\mathbb{E}}_{y \sim \mathscr{N}(0, 1)}\left[f^\prime(y) \right], \end{aligned} $$

where we have used ordinary integration by parts.

Now, suppose the identity holds for all degrees smaller than some k ≥ 2, and expand the degree k case according to the recursion:

$$\displaystyle \begin{aligned} \operatorname*{\mathbb{E}}_{y \sim \mathscr{N}(0, 1)}[h_k(y) f(y)] &= \operatorname*{\mathbb{E}}_{y \sim \mathscr{N}(0, 1)}[y h_{k - 1}(y) f(y)] - \operatorname*{\mathbb{E}}_{y \sim \mathscr{N}(0, 1)}[h_{k - 1}^\prime(y) f(y)] \\ &= \operatorname*{\mathbb{E}}_{y \sim \mathscr{N}(0, 1)}[h_{k - 1}^\prime(y)f(y)] + \operatorname*{\mathbb{E}}_{y \sim \mathscr{N}(0, 1)}[h_{k - 1}(y)f^\prime(y)] \\ &\hspace{3.5cm} - \operatorname*{\mathbb{E}}_{y \sim \mathscr{N}(0, 1)}[h_{k - 1}^\prime(y) f(y)] \\ &= \operatorname*{\mathbb{E}}_{y \sim \mathscr{N}(0, 1)}[h_{k - 1}(y)f^\prime(y)] \\ &= \operatorname*{\mathbb{E}}_{y \sim \mathscr{N}(0, 1)}\left[\frac{d^k f}{dy^k}(y)\right], \end{aligned} $$

where we have used the degree 1 and then the degree k − 1 hypotheses.

Proof of Proposition 7 , Translation Identity

Recall that we want to show, for all k ≥ 0 and \(\mu \in \mathbb {R}\), that

$$\displaystyle \begin{aligned} \operatorname*{\mathbb{E}}_{y \sim \mathscr{N}(\mu, 1)}[h_k(y)] = \mu^k. \end{aligned}$$

We proceed by induction on k. Since h 0(y) = 1, the case k = 0 is immediate. Now, suppose the identity holds for degree k − 1, and expand the degree k case according to the recursion:

$$\displaystyle \begin{aligned} \operatorname*{\mathbb{E}}_{y \sim \mathscr{N}(\mu, 1)}[h_k(y)] &= \operatorname*{\mathbb{E}}_{y \sim \mathscr{N}(0, 1)}[h_k(\mu + y)] \\ &= \mu \operatorname*{\mathbb{E}}_{y \sim \mathscr{N}(0, 1)}[h_{k - 1}(\mu + y)] + \operatorname*{\mathbb{E}}_{y \sim \mathscr{N}(0, 1)}[y h_{k - 1}(\mu + y)] \\ &\hspace{3.75cm} - \operatorname*{\mathbb{E}}_{y \sim \mathscr{N}(0, 1)}[h_{k - 1}^\prime(\mu + y)] \end{aligned} $$

which may be simplified by the Gaussian integration by parts to

$$\displaystyle \begin{aligned} &= \mu \operatorname*{\mathbb{E}}_{y \sim \mathscr{N}(0, 1)}[h_{k - 1}(\mu + y)] + \operatorname*{\mathbb{E}}_{y \sim \mathscr{N}(0, 1)}[h_{k - 1}^\prime(\mu + y)] \\ &\hspace{3.75cm} - \operatorname*{\mathbb{E}}_{y \sim \mathscr{N}(0, 1)}[h_{k - 1}^\prime(\mu + y)] \\ &= \mu \operatorname*{\mathbb{E}}_{y \sim \mathscr{N}(0, 1)}[h_{k - 1}(\mu + y)], \end{aligned} $$

and the result follows by the inductive hypothesis.

Proof of Proposition 9 , Generating Function

Recall that we want to show the series identity for any \(x, y \in \mathbb {R}\),

$$\displaystyle \begin{aligned} \exp\left(xy - \frac{1}{2}x^2\right) = \sum_{k = 0}^\infty \frac{1}{k!}x^k h_k(y). \end{aligned}$$

For any fixed x, the left-hand side belongs to \(L^2(\mathscr {N}(0, 1))\) in the variable y. Thus this is merely a claim about the Hermite coefficients of this function, which may be computed by taking inner products. Namely, let us write

then using Gaussian integration by parts,

$$\displaystyle \begin{aligned} \langle f_x, \widehat{h}_k \rangle &= \frac{1}{\sqrt{k!}}\operatorname*{\mathbb{E}}_{y \sim \mathscr{N}(0, 1)}\left[f_x(y) h_k(y)\right] \\ &= \frac{1}{\sqrt{k!}}\operatorname*{\mathbb{E}}_{y \sim \mathscr{N}(0, 1)}\left[\frac{d^k f_x}{dy^k}(y) \right] \\ &= \frac{1}{\sqrt{k!}}x^k \operatorname*{\mathbb{E}}_{y \sim \mathscr{N}(0, 1)}\left[f_x(y) \right]. \end{aligned} $$

A simple calculation shows that \(\mathbb {E}_{y \sim \mathscr {N}(0, 1)}[f_x(y)] = 1\) (this is an evaluation of the Gaussian moment-generating function that we have mentioned in the main text), and then by the Hermite expansion

$$\displaystyle \begin{aligned} f_x(y) = \sum_{k = 0}^\infty \langle f_x, \widehat{h}_k \rangle \widehat{h}_k(y) = \sum_{k = 0}^\infty \frac{1}{k!}x^k h_k(y), \end{aligned}$$

giving the result.

Subgaussian Random Variables

Many of our rigorous arguments rely on the concept of subgaussianity, which we now define. See, e.g., [92] for more details.

Definition 17

For σ 2 > 0, we say that a real-valued random variable π is σ 2-subgaussian if \(\mathbb {E}[\pi ] = 0\) and for all \(t \in \mathbb {R}\), the moment-generating function \(M(t) = \mathbb {E}[\exp (t \pi )]\) of π exists and is bounded by \(M(t) \le \exp (\sigma ^2 t^2 / 2)\).

Here σ 2 is called the variance proxy, which is not necessarily equal to the variance of π (although it can be shown that σ 2 ≥Var[π]). The name subgaussian refers to the fact that \(\exp (\sigma ^2 t^2 / 2)\) is the moment-generating function of \(\mathscr {N}(0,\sigma ^2)\).

The following are some examples of (laws of) subgaussian random variables. Clearly, \(\mathscr {N}(0,\sigma ^2)\) is σ 2-subgaussian. By Hoeffding’s lemma, any distribution supported on an interval [a, b] is (ba)2∕4-subgaussian. In particular, the Rademacher distribution Unif({±1}) is 1-subgaussian. Note also that the sum of n independent σ 2-subgaussian random variables is σ 2n-subgaussian.

Subgaussian random variables admit the following bound on their absolute moments; see Lemmas 1.3 and 1.4 of [92].

Proposition 12

If π is σ 2 -subgaussian then

$$\displaystyle \begin{aligned}\mathbb{E}[|\pi|{}^k] \le (2\sigma^2)^{k/2} k \varGamma(k/2)\end{aligned}$$

for every integer k ≥ 1.

Here Γ(⋅) denotes the gamma function which, recall, is defined for all positive real numbers and satisfies Γ(k) = (k − 1)! when k is a positive integer. We will need the following property of the gamma function.

Proposition 13

For all x > 0 and a > 0,

$$\displaystyle \begin{aligned}\frac{\varGamma(x+a)}{\varGamma(x)} \le (x+a)^a.\end{aligned}$$

Proof

This follows from two standard properties of the gamma function. The first is that (similarly to the factorial) Γ(x + 1)∕Γ(x) = x for all x > 0. The second is Gautschi’s inequality, which states that Γ(x + s)∕Γ(x) < (x + s)s for all x > 0 and s ∈ (0, 1).

In the context of the spiked Wigner model (Sect. 3.2), we now prove that subgaussian spike priors admit a local Chernoff bound (Definition 14).

Proposition 14

Suppose π is σ 2-subgaussian (for some constant σ 2 > 0) with \(\mathbb {E}[\pi ] = 0\)and \(\mathbb {E}[\pi ^2] = 1\). Let \((\mathscr {X}_n)\)be the spike prior that draws each entry ofxi.i.d. from π (where π does not depend on n). Then \((\mathscr {X}_n)\)admits a local Chernoff bound.

Proof

Since π is subgaussian, π 2 is subexponential, which implies \(\mathbb {E}[\exp (t \pi ^2)] < \infty \) for all |t|≤ s for some s > 0 (see e.g., Lemma 1.12 of [92]).

Let π, π be independent copies of π, and set Π = ππ . The moment-generating function of Π is

$$\displaystyle \begin{aligned} M(t) = \mathbb{E}[\exp(t \varPi)] = \mathbb{E}_\pi \mathbb{E}_{\pi'}[\exp(t \pi \pi')] \le \mathbb{E}_\pi\left[\exp\left(\sigma^2 t^2 \pi^2/2\right)\right] < \infty \end{aligned}$$

provided \(\frac {1}{2}\sigma ^2 t^2 < s\), i.e. \(|t| < \sqrt {2s/\sigma ^2}\). Thus M(t) exists in an open interval containing t = 0, which implies \(M'(0) = \mathbb {E}[\varPi ] = 0\) and \(M''(0) = \mathbb {E}[\varPi ^2] = 1\) (this is the defining property of the moment-generating function: its derivatives at zero are the moments).

Let η > 0 and . Since M(0) = 1, M′(0) = 0, M″(0) = 1 and, as one may check, \(f(0) = 1, f'(0) = 0, f''(0) = \frac {1}{1-\eta } > 1\), there exists δ > 0 such that, for all t ∈ [−δ, δ], M(t) exists and M(t) ≤ f(t).

We then apply the standard Chernoff bound argument to \(\langle \boldsymbol x^1,\boldsymbol x^2 \rangle = \sum _{i=1}^n \varPi _i\) where Π 1, …, Π n are i.i.d. copies of Π. For any α > 0,

$$\displaystyle \begin{aligned} \Pr\left\{\langle \boldsymbol x^1,\boldsymbol x^2 \rangle \ge t\right\} &= \Pr\left\{\exp(\alpha \langle \boldsymbol x^1,\boldsymbol x^2 \rangle) \ge \exp(\alpha t)\right\}\\ &\le \exp(-\alpha t) \mathbb{E}[\exp(\alpha \langle \boldsymbol x^1,\boldsymbol x^2 \rangle)] \end{aligned} $$
(byMarkov’s inequality)
$$\displaystyle \begin{aligned} &= \exp(-\alpha t) \mathbb{E}\left[\exp\left(\alpha \sum_{i=1}^n \varPi_i\right)\right]\\ &= \exp(-\alpha t) [M(\alpha)]^n\\ &\le \exp(-\alpha t) [f(\alpha)]^n \\ &= \exp(-\alpha t) \exp\left(\frac{\alpha^2 n}{2(1-\eta)}\right). \end{aligned} $$
(provided α ≤ δ)

Taking α = (1 − η)tn,

$$\displaystyle \begin{aligned}\Pr\left\{\langle \boldsymbol x^1,\boldsymbol x^2 \rangle \ge t\right\} \le \exp\left(-\frac{1}{n}(1-\eta)t^2 + \frac{1}{2n}(1-\eta)t^2\right) = \exp\left(-\frac{1}{2n}(1-\eta)t^2\right)\end{aligned}$$

as desired. This holds provided α ≤ δ, i.e. t ≤ δn∕(1 − η). A symmetric argument with − Π in place of Π holds for the other tail, \(\Pr \left \{\langle \boldsymbol x^1,\boldsymbol x^2 \rangle \le -t\right \}\).

Hypercontractivity

The following hypercontractivity result states that the moments of low-degree polynomials of i.i.d. random variables must behave somewhat reasonably. The Rademacher version is the Bonami lemma from [88], and the Gaussian version appears in [53] (see Theorem 5.10 and Remark 5.11 of [53]). We refer the reader to [88] for a general discussion of hypercontractivity.

Proposition 15 (Bonami Lemma)

Letx = (x 1, …, x n) have either i.i.d. \(\mathscr {N}(0,1)\)or i.i.d. Rademacher (uniform ± 1) entries, and let \(f: \mathbb {R}^n \to \mathbb {R}\)be a polynomial of degree k. Then

$$\displaystyle \begin{aligned}\mathbb{E}[f(x)^4] \le 3^{2k} \,\mathbb{E}[f(x)^2]^2.\end{aligned}$$

We will combine this with the following standard second moment method.

Proposition 16 (Paley-Zygmund Inequality)

If Z ≥ 0 is a random variable with finite variance, and 0 ≤ θ ≤ 1, then

$$\displaystyle \begin{aligned}\mathrm{Pr}\left\{Z > \theta\, \mathbb{E}[Z]\right\} \ge (1-\theta)^2 \frac{\mathbb{E}[Z]^2}{\mathbb{E}[Z^2]}.\end{aligned}$$

By combining Propositions 16 and 15, we immediately have the following.

Corollary 2

Letx = (x 1, …, x n) have either i.i.d. \(\mathscr {N}(0,1)\)or i.i.d. Rademacher (uniform ± 1) entries, and let \(f: \mathbb {R}^n \to \mathbb {R}\)be a polynomial of degree k. Then, for 0 ≤ θ ≤ 1,

$$\displaystyle \begin{aligned}\Pr\left\{f(x)^2 > \theta\, \mathbb{E}[f(x)^2]\right\} \ge (1-\theta)^2 \frac{\mathbb{E}[f(x)^2]^2}{\mathbb{E}[f(x)^4]} \ge \frac{(1-\theta)^2}{3^{2k}}.\end{aligned}$$

Remark 5

One rough interpretation of Corollary 2 is that if f is degree k, then \(\mathbb {E}[f(x)^2]\) cannot be dominated by an event of probability smaller than roughly 3−2k.

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Kunisky, D., Wein, A.S., Bandeira, A.S. (2022). Notes on Computational Hardness of Hypothesis Testing: Predictions Using the Low-Degree Likelihood Ratio. In: Cerejeiras, P., Reissig, M. (eds) Mathematical Analysis, its Applications and Computation. ISAAC 2019. Springer Proceedings in Mathematics & Statistics, vol 385. Springer, Cham. https://doi.org/10.1007/978-3-030-97127-4_1

Download citation

Publish with us

Policies and ethics