Abstract
These notes survey and explore an emerging method, which we call the low-degree method, for understanding statistical-versus-computational tradeoffs in high-dimensional inference problems. In short, the method posits that a certain quantity—the second moment of the low-degree likelihood ratio—gives insight into how much computational time is required to solve a given hypothesis testing problem, which can in turn be used to predict the computational hardness of a variety of statistical inference tasks. While this method originated in the study of the sum-of-squares (SoS) hierarchy of convex programs, we present a self-contained introduction that does not require knowledge of SoS. In addition to showing how to carry out predictions using the method, we include a discussion investigating both rigorous and conjectural consequences of these predictions. These notes include some new results, simplified proofs, and refined conjectures. For instance, we point out a formal connection between spectral methods and the low-degree likelihood ratio, and we give a sharp low-degree lower bound against subexponential-time algorithms for tensor PCA.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
We will only consider this so-called strong version of distinguishability, where the probability of success must tend to 1 as n →∞, as opposed to the weak version where this probability need only be bounded above \(\frac {1}{2}\). For high-dimensional problems, the strong version typically coincides with important notions of estimating the planted signal (see Sect. 4.2.6), whereas the weak version is often trivial.
- 2.
For instance, what will be relevant in the examples we consider later, any pair of non-degenerate multivariate Gaussian distributions satisfy this assumption.
- 3.
It is important to note that, from the point of view of statistics, we are restricting our attention to the special case of deciding between two “simple” hypotheses, where each hypothesis consists of the dataset being drawn from a specific distribution. Optimal testing is more subtle for “composite” hypotheses in parametric families of probability distributions, a more typical setting in practice. The mathematical difficulties of this extended setting are discussed thoroughly in [75].
- 4.
For readers not familiar with the Radon–Nikodym derivative: if \(\mathbb {P}\), \(\mathbb {Q}\) are discrete distributions then \(L(\boldsymbol Y) = \mathbb {P}(\boldsymbol Y)/\mathbb {Q}(\boldsymbol Y)\); if \(\mathbb {P}\), \(\mathbb {Q}\) are continuous distributions with density functions p, q (respectively) then L(Y ) = p(Y )∕q(Y ).
- 5.
For a more precise definition of \(L^2(\mathbb {Q}_n)\) (in particular including issues around functions differing on sets of measure zero) see a standard reference on real analysis such as [100].
- 6.
To clarify, orthogonal projection is with respect to the inner product induced by \(\mathbb {Q}_n\) (see Definition 7).
- 7.
Two techniques from this calculation are elements of the “replica method” from statistical physics: (1) writing a power of an expectation as an expectation over independent “replicas” and (2) changing the order of expectations and evaluating the moment-generating function. The interested reader may see [82] for an early reference, or [21, 79] for two recent presentations.
- 8.
We will not actually use the definition of the univariate Hermite polynomials (although we will use certain properties that they satisfy as needed), but the definition is included for completeness in Appendix “Hermite Polynomials”.
- 9.
This model is equivalent to the more standard model in which the noise is symmetric with respect to permutations of the indices; see Appendix “Equivalence of Symmetric and Asymmetric Noise Models”.
- 10.
Concretely, one may take \(A_p = \frac {1}{\sqrt {2}} p^{-p/4-1/2}\) and \(B_p = \sqrt {2} e^{p/2} p^{-p/4}\).
- 11.
Some of these results only apply to minor variants of the spiked tensor problem, but we do not expect this difference to be important.
- 12.
Gaussian Orthogonal Ensemble (GOE): W is a symmetric n × n matrix with entries \(W_{ii} \sim \mathscr {N}(0,2/n)\) and \(W_{ij} = W_{ji} \sim \mathscr {N}(0,1/n)\), independently.
- 13.
In the sparse Rademacher prior, each entry of x is nonzero with probability ρ (independently), and the nonzero entries are drawn uniformly from \(\{\pm 1/\sqrt {\rho }\}\).
- 14.
More specifically, \((\|L_n^{\le D}\|{ }^2 - 1)\) is the variance of a certain pseudo-expectation value generated by pseudo-calibration, whose actual value in a valid pseudo-expectation must be exactly 1. It appears to be impossible to “correct” this part of the pseudo-expectation if the variance is diverging with n.
- 15.
Here, “best” is in the sense of strongly distinguishing \(\mathbb {P}_n\) and \(\mathbb {Q}_n\) throughout the largest possible regime of model parameters.
- 16.
In [47], it is shown that for a fairly general class of average-case hypothesis testing problems, if SoS succeeds in some range of parameters then there is a low-degree spectral method whose maximum positive eigenvalue succeeds (in a somewhat weaker range of parameters). However, the resulting matrix could a priori have an arbitrarily large (in magnitude) negative eigenvalue, which would prevent the spectral method from running in polynomial time. For this same reason, it seems difficult to establish a formal connection between SoS and the LDLR via spectral methods.
- 17.
Indeed, coordinate degree need not be phrased in terms of polynomials, and one may equivalently consider the linear subspace of \(L^2(\mathbb {Q}_n)\) of functions that is spanned by functions of at most D variables at a time.
- 18.
Non-trivial estimation of a signal \(\boldsymbol x \in \mathbb {R}^n\) means having an estimator \(\hat {\boldsymbol x}\) achieving \(|\langle \hat {\boldsymbol x}, \boldsymbol x \rangle |/(\|\hat {\boldsymbol x}\| \cdot \|\boldsymbol x\|) \ge \varepsilon \) with high probability, for some constant ε > 0.
References
A. Auffinger, G. Ben Arous, J. Černỳ, Random matrices and complexity of spin glasses. Commun. Pure Appl. Math. 66(2), 165–201 (2013)
D. Achlioptas, A. Coja-Oghlan, Algorithmic barriers from phase transitions, in 2008 49th Annual IEEE Symposium on Foundations of Computer Science (IEEE, IEEE, 2008), pp. 793–802
A. Anandkumar, Y. Deng, R. Ge, H. Mobahi, Homotopy analysis for tensor PCA (2016). arXiv preprint arXiv:1610.09322
N. Alon, M. Krivelevich, B. Sudakov, Finding a large hidden clique in a random graph. Random Struct. Algorithms 13(3–4), 457–466 (1998)
A.A. Amini, M.J. Wainwright, High-dimensional analysis of semidefinite relaxations for sparse principal components, in 2008 IEEE International Symposium on Information Theory (IEEE, Piscataway, 2008), pp. 2454–2458
N. Alon, R. Yuster, U. Zwick, Color-coding. J. ACM 42(4), 844–856 (1995)
M. Brennan, G. Bresler, Optimal average-case reductions to sparse PCA: from weak assumptions to strong hardness (2019). arXiv preprint arXiv:1902.07380
M. Brennan, G. Bresler, W. Huleihel, Reducibility and computational lower bounds for problems with planted sparse structure (2018). arXiv preprint arXiv:1806.07508
J. Baik, G.Ben Arous, S. Péché, Phase transition of the largest eigenvalue for nonnull complex sample covariance matrices. Ann. Probab. 33(5), 1643–1697 (2005)
J. Barbier, M. Dia, N. Macris, F. Krzakala, T. Lesieur, L. Zdeborová, Mutual information for symmetric rank-one matrix estimation: a proof of the replica formula, in Proceedings of the 30th International Conference on Neural Information Processing Systems (Curran Associates, 2016), pp. 424–432
V.V.S.P. Bhattiprolu, M. Ghosh, V. Guruswami, E. Lee, M. Tulsiani, Multiplicative approximations for polynomial optimization over the unit sphere. Electron. Colloq. Comput. Complexity 23, 185 (2016)
G.Ben Arous, R. Gheissari, A. Jagannath, Algorithmic thresholds for tensor PCA (2018). arXiv preprint arXiv:1808.00921
V. Bhattiprolu, V. Guruswami, E. Lee, Sum-of-squares certificates for maxima of random tensors on the sphere (2016). arXiv preprint arXiv:1605.00903
F. Benaych-Georges, R. Rao Nadakuditi, The eigenvalues and eigenvectors of finite, low rank perturbations of large random matrices. Adva. Math. 227(1), 494–521 (2011)
B. Barak, S. Hopkins, J. Kelner, P.K. Kothari, A. Moitra, A. Potechin, A nearly tight sum-of-squares lower bound for the planted clique problem. SIAM J. Comput. 48(2), 687–735 (2019)
A. Blum, A. Kalai, H. Wasserman, Noise-tolerant learning, the parity problem, and the statistical query model. J. ACM 50(4), 506–519 (2003)
A.S. Bandeira, D. Kunisky, A.S. Wein, Computational hardness of certifying bounds on constrained PCA problems (2019). arXiv preprint arXiv:1902.07324
C. Bordenave, M. Lelarge, L. Massoulié, Non-backtracking spectrum of random graphs: community detection and non-regular Ramanujan graphs, in 2015 IEEE 56th Annual Symposium on Foundations of Computer Science (IEEE, Piscataway, 2015), pp. 1347–1357
J. Banks, C. Moore, J. Neeman, P. Netrapalli, Information-theoretic thresholds for community detection in sparse networks, in Conference on Learning Theory (2016), pp. 383–416
J. Banks, C. Moore, R. Vershynin, N. Verzelen, J. Xu, Information-theoretic bounds and phase transitions in clustering, sparse PCA, and submatrix localization. IEEE Trans. Inform. Theory 64(7), 4872–4894 (2018)
A.S. Bandeira, A. Perry, A.S. Wein, Notes on computational-to-statistical gaps: predictions using statistical physics (2018). arXiv preprint arXiv:1803.11132
Q. Berthet, P. Rigollet, Computational lower bounds for sparse PCA (2013). arXiv preprint arXiv:1304.0828
B. Barak, D. Steurer, Proofs, beliefs, and algorithms through the lens of sum-of-squares. Course Notes (2016). http://www.sumofsquares.org/public/index.html
W.-K. Chen, D. Gamarnik, D. Panchenko, M. Rahman, Suboptimality of local algorithms for a class of max-cut problems. Ann. Probab. 47(3), 1587–1618 (2019)
Y. Deshpande, E. Abbe, A. Montanari, Asymptotic mutual information for the two-groups stochastic block model (2015). arXiv preprint arXiv:1507.08685
M. Dyer, A. Frieze, M. Jerrum, On counting independent sets in sparse graphs. SIAM J. Comput. 31(5), 1527–1541 (2002)
I. Diakonikolas, G. Kamath, D. Kane, J. Li, A. Moitra, A. Stewart, Robust estimators in high-dimensions without the computational intractability. SIAM J. Comput. 48(2), 742–864 (2019)
A. Decelle, F. Krzakala, C. Moore, L. Zdeborová, Asymptotic analysis of the stochastic block model for modular networks and its algorithmic applications. Phys. Rev. E 84(6), 066106 (2011)
A. Decelle, F. Krzakala, C. Moore, L. Zdeborová, Inference and phase transitions in the detection of modules in sparse networks. Phys. Rev. Lett. 107(6), 065701 (2011)
I. Diakonikolas, D.M. Kane, A. Stewart, Statistical query lower bounds for robust estimation of high-dimensional Gaussians and gaussian mixtures, in 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS) (IEEE, Piscataway, 2017), pp. 73–84
Y. Ding, D. Kunisky, A.S. Wein, A.S. Bandeira, Subexponential-time algorithms for sparse PCA (2019). arXiv preprint
Y. Deshpande, A. Montanari, Sparse PCA via covariance thresholding, in Advances in Neural Information Processing Systems (2014), pp. 334–342
Y. Deshpande, A. Montanari, Finding hidden cliques of size \(\sqrt {(N/e)}\) in nearly linear time. Found. Comput. Math. 15(4), 1069–1128 (2015)
Y. Deshpande, A. Montanari, Improved sum-of-squares lower bounds for hidden clique and hidden submatrix problems, in Conference on Learning Theory (2015), pp. 523–562
D.L. Donoho, A. Maleki, A. Montanari, Message-passing algorithms for compressed sensing. Proc. Nat. Acad. Sci. 106(45), 18914–18919 (2009)
A. El Alaoui, F. Krzakala, Estimation in the spiked Wigner model: a short proof of the replica formula, in 2018 IEEE International Symposium on Information Theory (ISIT) (IEEE, 2018), pp. 1874–1878
A. El Alaoui, F. Krzakala, M.I. Jordan, Finite size corrections and likelihood ratio fluctuations in the spiked Wigner model (2017). arXiv preprint arXiv:1710.02903
A. El Alaoui, F. Krzakala, M.I. Jordan, Fundamental limits of detection in the spiked Wigner model (2018). arXiv preprint arXiv:1806.09588
V. Feldman, E. Grigorescu, L. Reyzin, S.S. Vempala, Y. Xiao, Statistical algorithms and a lower bound for detecting planted cliques. J. ACM 64(2), 8 (2017)
U. Feige, J. Kilian, Heuristics for semirandom graph problems. J. Comput. Syst. Sci. 63(4), 639–671 (2001)
D. Féral, S. Péché, The largest eigenvalue of rank one deformation of large Wigner matrices. Commun. Math. Phys. 272(1), 185–228 (2007)
V. Feldman, W. Perkins, S. Vempala, On the complexity of random satisfiability problems with planted solutions. SIAM J. Comput. 47(4), 1294–1338 (2018)
D. Grigoriev, Linear lower bound on degrees of Positivstellensatz calculus proofs for the parity. Theor. Comput. Sci. 259(1–2), 613–622 (2001)
D. Gamarnik, M. Sudan, Limits of local algorithms over sparse random graphs, in Proceedings of the 5th Conference on Innovations in Theoretical Computer Science(ACM, New York, 2014), pp. 369–376
D. Gamarnik, I. Zadik, Sparse high-dimensional linear regression. algorithmic barriers and a local search algorithm (2017). arXiv preprint arXiv:1711.04952
D. Gamarnik I. Zadik, The landscape of the planted clique problem: Dense subgraphs and the overlap gap property (2019). arXiv preprint arXiv:1904.07174
S.B. Hopkins, P.K. Kothari, A. Potechin, P. Raghavendra, T. Schramm, D. Steurer, The power of sum-of-squares for detecting hidden structures, in 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS) (IEEE, Piscataway, 2017), pp. 720–731
S. Hopkins, Statistical Inference and the Sum of Squares Method. PhD thesis, Cornell University, August 2018
S.B. Hopkins, D. Steurer, Bayesian estimation from few samples: community detection and related problems (2017). arXiv preprint arXiv:1710.00264
S.B. Hopkins, J. Shi, D. Steurer, Tensor principal component analysis via sum-of-square proofs, in Conference on Learning Theory (2015), pp. 956–1006
S.B. Hopkins, T. Schramm, J. Shi, D. Steurer, Fast spectral algorithms from sum-of-squares proofs: tensor decomposition and planted sparse vectors, in Proceedings of the Forty-Eighth Annual ACM Symposium on Theory of Computing (ACM, New York, 2016), pp. 178–191
B. Hajek, Y. Wu, J. Xu, Computational lower bounds for community detection on random graphs, in Conference on Learning Theory (2015), pp. 899–928
S. Janson, Gaussian Hilbert Spaces, vol. 129 (Cambridge University Press, Cambridge, 1997)
M. Jerrum, Large cliques elude the Metropolis process. Random Struct. Algorithms 3(4), 347–359 (1992)
I.M. Johnstone, A.Y. Lu, Sparse principal components analysis. Unpublished Manuscript (2004)
I.M. Johnstone, A.Y. Lu, On consistency and sparsity for principal components analysis in high dimensions. J. Am. Stat. Assoc. 104(486), 682–693 (2009)
A. Jagannath, P. Lopatto, L. Miolane, Statistical thresholds for tensor PCA (2018). arXiv preprint arXiv:1812.03403
M. Kearns, Efficient noise-tolerant learning from statistical queries. J. ACM 45(6), 983–1006 (1998)
F. Krzakala, C. Moore, E. Mossel, J. Neeman, A. Sly, L. Zdeborová, P. Zhang, Spectral redemption in clustering sparse networks. Proc. Nat. Acad. Sci. 110(52), 20935–20940 (2013)
P.K. Kothari, R. Mori, R. O’Donnell, D. Witmer, Sum of squares lower bounds for refuting any CSP, in Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing (ACM, New York, 2017), pp. 132–145
F. Krzakała, A. Montanari, F. Ricci-Tersenghi, G. Semerjian, L. Zdeborová, Gibbs states and the set of solutions of random constraint satisfaction problems. Proc. Nat. Acad. Sci. 104(25), 10318–10323 (2007)
R. Krauthgamer, B. Nadler, D. Vilenchik, Do semidefinite relaxations solve sparse PCA up to the information limit? Ann. Stat. 43(3), 1300–1322 (2015)
A.R. Klivans, A.A. Sherstov, Unconditional lower bounds for learning intersections of halfspaces. Mach. Learn. 69(2–3), 97–114 (2007)
L. Kučera, Expected complexity of graph partitioning problems. Discrete Appl. Math. 57(2–3), 193–212 (1995)
R. Kannan, S. Vempala, Beyond spectral: Tight bounds for planted Gaussians (2016). arXiv preprint arXiv:1608.03643
F. Krzakala, J. Xu, L. Zdeborová, Mutual information in rank-one matrix estimation, in 2016 IEEE Information Theory Workshop (ITW) (IEEE, Piscataway, 2016), pp. 71–75
J.B. Lasserre, Global optimization with polynomials and the problem of moments. SIAM J. Optim. 11(3), 796–817 (2001)
L. Le Cam, Asymptotic Methods in Statistical Decision Theory (Springer, Berlin, 2012)
L. Le Cam, Locally asymptotically normal families of distributions. Univ. California Publ. Stat. 3, 37–98 (1960)
T. Lesieur, F. Krzakala, L. Zdeborová, MMSE of probabilistic low-rank matrix estimation: Universality with respect to the output channel, in s2015 53rd Annual Allerton Conference on Communication, Control, and Computing (Allerton) (IEEE, 2015), pp. 680–687
T. Lesieur, F. Krzakala, L. Zdeborová, Phase transitions in sparse PCA, in 2015 IEEE International Symposium on Information Theory (ISIT) (IEEE, Piscataway, 2015), pp. 1635–1639
A.K. Lenstra, H.W. Lenstra, L. Lovász, Factoring polynomials with rational coefficients. Math. Ann. 261(4), 515–534 (1982)
M. Lelarge, L. Miolane, Fundamental limits of symmetric low-rank matrix estimation. Probab. Theory Related Fields 173(3–4), 859–929 (2019)
T. Lesieur, L. Miolane, M. Lelarge, F. Krzakala, L. Zdeborová, Statistical and computational phase transitions in spiked tensor estimation, in 2017 IEEE International Symposium on Information Theory (ISIT) (IEEE, Piscataway, 2017), pp. 511–515
E.L. Lehmann, J.P. Romano, Testing Statistical Hypotheses (Springer, Berlin, 2006)
L. Massoulié, Community detection thresholds and the weak Ramanujan property, in Proceedings of the Forty-Sixth Annual ACM Symposium on Theory of Computing (ACM, New York, 2014), pp. 694–703
L. Miolane, Phase transitions in spiked matrix estimation: information-theoretic analysis (2018). arXiv preprint arXiv:1806.04343
S.S. Mannelli, F. Krzakala, P. Urbani, L. Zdeborova, Passed & spurious: Descent algorithms and local minima in spiked matrix-tensor models, in International Conference on Machine Learning (2019), pp. 4333–4342
M. Mezard, A. Montanari, Information, Physics, and Computation (Oxford University Press, Oxford, 2009)
E. Mossel, J. Neeman, A. Sly, Reconstruction and estimation in the planted partition model. Probab. Theory Related Fields 162(3–4), 431–461 (2015)
E. Mossel, J. Neeman, A. Sly, A proof of the block model threshold conjecture. Combinatorica 38(3), 665–708 (2018)
M. Mézard, G. Parisi, M. Virasoro, Spin Glass Theory and Beyond: An Introduction to the Replica Method and Its Applications, vol. 9 (World Scientific Publishing Company, Singapore, 1987)
R. Meka, A. Potechin, A. Wigderson, Sum-of-squares lower bounds for planted clique, in Proceedings of the Forty-Seventh Annual ACM Symposium on Theory of Computing (ACM, New York, 2015), pp. 87–96
A. Montanari, D. Reichman, O. Zeitouni, On the limitation of spectral methods: from the Gaussian hidden clique problem to rank-one perturbations of gaussian tensors, in Advances in Neural Information Processing Systems (2015), pp. 217–225
L. Massoulié, L. Stephan, D. Towsley, Planting trees in graphs, and finding them back (2018). arXiv preprint arXiv:1811.01800
T. Ma, A. Wigderson, Sum-of-squares lower bounds for sparse PCA, in Advances in Neural Information Processing Systems (2015), pp. 1612–1620
J. Neyman, E.S. Pearson, IX. on the problem of the most efficient tests of statistical hypotheses. Philos. Trans. R. Soc. Lond. Ser. A Containing Papers Math. Phys. Charact. 231(694–706), 289–337 (1933)
R. O’Donnell, Analysis of Boolean Functions (Cambridge University Press, Cambridge, 2014)
P.A. Parrilo, Structured Semidefinite Programs and Semialgebraic Geometry Methods in Robustness and Optimization. PhD thesis, California Institute of Technology, 2000
A. Perry, A.S. Wein, A.S. Bandeira, Statistical limits of spiked tensor models (2016). arXiv preprint arXiv:1612.07728
A. Perry, A.S. Wein, A.S. Bandeira, A. Moitra, Optimality and sub-optimality of PCA I: spiked random matrix models. Ann. Stat. 46(5), 2416–2451 (2018)
P. Rigollet, J.-C. Hütter, High-dimensional statistics. Lecture Notes, 2018
E. Richard, A. Montanari, A statistical model for tensor PCA, in Advances in Neural Information Processing Systems (2014), pp. 2897–2905
P. Raghavendra, S. Rao, T. Schramm, Strongly refuting random CSPs below the spectral threshold, in Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing (ACM, New York, 2017), pp. 121–131
P. Raghavendra, T. Schramm, D. Steurer, High-dimensional estimation via sum-of-squares proofs (2018). arXiv preprint arXiv:1807.11419
R.W. Robinson, N.C. Wormald, Almost all cubic graphs are hamiltonian. Random Struct. Algorithms 3(2), 117–125 (1992)
R.W. Robinson, N.C. Wormald, Almost all regular graphs are hamiltonian. Random Struct. Algorithms 5(2), 363–374 (1994)
G. Schoenebeck, Linear level Lasserre lower bounds for certain k-CSPs, in 2008 49th Annual IEEE Symposium on Foundations of Computer Science (IEEE, Piscataway, 2008), pp. 593–602
A. Saade, F. Krzakala, L. Zdeborová, Spectral clustering of graphs with the Bethe Hessian, in Advances in Neural Information Processing Systems (2014), pp. 406–414
E.M. Stein, R. Shakarchi, Real Analysis: Measure Theory, Integration, and Hilbert Spaces (Princeton University Press, Princeton, 2009)
G. Szegö, Orthogonal Polynomials, vol. 23 (American Mathematical Soc., 1939)
T. Wang, Q. Berthet, Y. Plan, Average-case hardness of RIP certification, in Advances in Neural Information Processing Systems (2016), pp. 3819–3827
T. Wang, Q. Berthet, R.J. Samworth, Statistical and computational trade-offs in estimation of sparse principal components. Ann. Stat. 44(5), 1896–1930 (2016)
A.S. Wein, A. El Alaoui, C. Moore, The Kikuchi hierarchy and tensor PCA (2019). arXiv preprint arXiv:1904.03858
I. Zadik, D. Gamarnik, High dimensional linear regression using lattice basis reduction, in Advances in Neural Information Processing Systems (2018), pp. 1842–1852
L. Zdeborová, F. Krzakala, Statistical physics of inference: thresholds and algorithms. Adv. Phys. 65(5), 453–552 (2016)
Acknowledgements
We thank the participants of a working group on the subject of these notes, organized by the authors at the Courant Institute of Mathematical Sciences during the spring of 2019. We also thank Samuel B. Hopkins, Philippe Rigollet, and David Steurer for helpful discussions.
DK was partially supported by NSF grants DMS-1712730 and DMS-1719545. ASW was partially supported by NSF grant DMS-1712730 and by the Simons Collaboration on Algorithms and Geometry. ASB was partially supported by NSF grants DMS-1712730 and DMS-1719545, and by a grant from the Sloan Foundation.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendices
Appendix 1: Omitted Proofs
Neyman-Pearson Lemma
We include here, for completeness, a proof of the classical Neyman–Pearson lemma [87].
Proof of Lemma 1
Note first that a test f is completely determined by its rejection region, \(R_f = \{\boldsymbol Y: f(\boldsymbol Y) = \mathbb {P}\}\). We may rewrite the power of f as
On the other hand, our assumption on α(f) is equivalent to
Thus, we are interested in solving the optimization
From this form, let us write , then the difference of powers is
completing the proof.
Equivalence of Symmetric and Asymmetric Noise Models
For technical convenience, in the main text we worked with an asymmetric version of the spiked Wigner model (see Sect. 3.2), Y = λxx ⊤ + Z where Z has i.i.d. \(\mathscr {N}(0,1)\) entries. A more standard model is to instead observe \(\widetilde {\boldsymbol Y} = \frac {1}{2}(\boldsymbol Y + \boldsymbol Y^\top ) = \lambda \boldsymbol x \boldsymbol x^\top + \boldsymbol W\), where W is symmetric with \(\mathscr {N}(0,1)\) diagonal entries and \(\mathscr {N}(0,1/2)\) off-diagonal entries, all independent. These two models are equivalent, in the sense that if we are given a sample from one then we can produce a sample from the other. Clearly, if we are given Y , we can symmetrize it to form \(\widetilde {\boldsymbol Y}\). Conversely, if we are given \(\widetilde {\boldsymbol Y}\), we can draw an independent matrix G with i.i.d. \(\mathscr {N}(0,1)\) entries, and compute \(\widetilde {\boldsymbol Y} + \frac {1}{2}(\boldsymbol G - \boldsymbol G^\top )\); one can check that the resulting matrix has the same distribution as Y (we are adding back the “skew-symmetric part” that is present in Y but not \(\widetilde {\boldsymbol Y}\)).
In the spiked tensor model (see Sect. 3.1), our asymmetric noise model is similarly equivalent to the standard symmetric model defined in [93] (in which the noise tensor Z is averaged over all permutations of indices). Since we can treat each entry of the symmetric tensor separately, it is sufficient to show the following one-dimensional fact: for unknown \(x \in \mathbb {R}\), k samples of the form \(y_i = x + \mathscr {N}(0,1)\) are equivalent to one sample of the form \(\tilde y = x + \mathscr {N}(0,1/k)\). Given {y i}, we can sample \(\tilde y\) by averaging: \(\frac {1}{k}\sum _{i=1}^k y_i\). For the converse, fix unit vectors a 1, …, a k at the corners of a simplex in \(\mathbb {R}^{k-1}\); these satisfy \(\langle \boldsymbol a_i,\boldsymbol a_j \rangle = -\frac {1}{k-1}\) for all i≠j. Given \(\tilde y\), draw \(\boldsymbol u \sim \mathscr {N}(0,{\boldsymbol I}_{k-1})\) and let \(y_i = \tilde y + \sqrt {1-1/k} \,\langle \boldsymbol a_i,\boldsymbol u \rangle \); one can check that these have the correct distribution.
Low-Degree Analysis of Spiked Wigner Above the PCA Threshold
Proof of Theorem 6
We follow the proof of Theorem 2(ii) in Sect. 3.1.2. For any choice of d ≤ D, using the standard bound \(\binom {2d}{d} \ge 4^d/(2\sqrt {d})\),
Since \(\hat \lambda > 1\), this diverges as n →∞ provided we choose d ≤ D with ω(1) ≤ d ≤ o(n).
Appendix 2: Omitted Probability Theory Background
Hermite Polynomials
Here we give definitions and basic facts regarding the Hermite polynomials (see, e.g, [101] for further details), which are orthogonal polynomials with respect to the standard Gaussian measure.
Definition 15
The univariate Hermite polynomials are the sequence of polynomials \(h_k(x) \in \mathbb {R}[x]\) for k ≥ 0 defined by the recursion
The normalized univariate Hermite polynomials are \(\widehat {h}_k(x) = h_k(x) / \sqrt {k!}\).
The following is the key property of the Hermite polynomials, which allows functions in \(L^2(\mathscr {N}(0, 1))\) to be expanded in terms of them.
Proposition 10
The normalized univariate Hermite polynomials form a complete orthonormal system of polynomials for \(L^2(\mathscr {N}(0, 1))\).
The following are the multivariate generalizations of the above definition that we used throughout the main text.
Definition 16
The N-variate Hermite polynomials are the polynomials for \(\boldsymbol \alpha \in \mathbb {N}^N\). The normalizedN-variate Hermite polynomials inNvariables are the polynomials for \(\boldsymbol \alpha \in \mathbb {N}^N\).
Again, the following is the key property justifying expansions in terms of these polynomials.
Proposition 11
The normalized N-variate Hermite polynomials form a complete orthonormal system of (multivariate) polynomials for \(L^2(\mathscr {N}(\boldsymbol 0, \boldsymbol I_N))\).
For the sake of completeness, we also provide proofs below of the three identities concerning univariate Hermite polynomials that we used in Sect. 2.3 to derive the norm of the LDLR under the additive Gaussian noise model. It is more convenient to prove these in a different order than they were presented in Sect. 2.3, since one identity is especially useful for proving the others.
Proof of Proposition 8 , Integration by Parts
Recall that we are assuming a function \(f: \mathbb {R} \to \mathbb {R}\) is k times continuously differentiable and f and its derivatives are \(O(\exp (|x|{ }^\alpha ))\) for α ∈ (0, 2), and we want to show the identity
We proceed by induction. Since h 0(y) = 1, the case k = 0 follows immediately. We also verify by hand the case k = 1, with h 1(y) = y:
where we have used ordinary integration by parts.
Now, suppose the identity holds for all degrees smaller than some k ≥ 2, and expand the degree k case according to the recursion:
where we have used the degree 1 and then the degree k − 1 hypotheses.
Proof of Proposition 7 , Translation Identity
Recall that we want to show, for all k ≥ 0 and \(\mu \in \mathbb {R}\), that
We proceed by induction on k. Since h 0(y) = 1, the case k = 0 is immediate. Now, suppose the identity holds for degree k − 1, and expand the degree k case according to the recursion:
which may be simplified by the Gaussian integration by parts to
and the result follows by the inductive hypothesis.
Proof of Proposition 9 , Generating Function
Recall that we want to show the series identity for any \(x, y \in \mathbb {R}\),
For any fixed x, the left-hand side belongs to \(L^2(\mathscr {N}(0, 1))\) in the variable y. Thus this is merely a claim about the Hermite coefficients of this function, which may be computed by taking inner products. Namely, let us write
then using Gaussian integration by parts,
A simple calculation shows that \(\mathbb {E}_{y \sim \mathscr {N}(0, 1)}[f_x(y)] = 1\) (this is an evaluation of the Gaussian moment-generating function that we have mentioned in the main text), and then by the Hermite expansion
giving the result.
Subgaussian Random Variables
Many of our rigorous arguments rely on the concept of subgaussianity, which we now define. See, e.g., [92] for more details.
Definition 17
For σ 2 > 0, we say that a real-valued random variable π is σ 2-subgaussian if \(\mathbb {E}[\pi ] = 0\) and for all \(t \in \mathbb {R}\), the moment-generating function \(M(t) = \mathbb {E}[\exp (t \pi )]\) of π exists and is bounded by \(M(t) \le \exp (\sigma ^2 t^2 / 2)\).
Here σ 2 is called the variance proxy, which is not necessarily equal to the variance of π (although it can be shown that σ 2 ≥Var[π]). The name subgaussian refers to the fact that \(\exp (\sigma ^2 t^2 / 2)\) is the moment-generating function of \(\mathscr {N}(0,\sigma ^2)\).
The following are some examples of (laws of) subgaussian random variables. Clearly, \(\mathscr {N}(0,\sigma ^2)\) is σ 2-subgaussian. By Hoeffding’s lemma, any distribution supported on an interval [a, b] is (b − a)2∕4-subgaussian. In particular, the Rademacher distribution Unif({±1}) is 1-subgaussian. Note also that the sum of n independent σ 2-subgaussian random variables is σ 2n-subgaussian.
Subgaussian random variables admit the following bound on their absolute moments; see Lemmas 1.3 and 1.4 of [92].
Proposition 12
If π is σ 2 -subgaussian then
for every integer k ≥ 1.
Here Γ(⋅) denotes the gamma function which, recall, is defined for all positive real numbers and satisfies Γ(k) = (k − 1)! when k is a positive integer. We will need the following property of the gamma function.
Proposition 13
For all x > 0 and a > 0,
Proof
This follows from two standard properties of the gamma function. The first is that (similarly to the factorial) Γ(x + 1)∕Γ(x) = x for all x > 0. The second is Gautschi’s inequality, which states that Γ(x + s)∕Γ(x) < (x + s)s for all x > 0 and s ∈ (0, 1).
In the context of the spiked Wigner model (Sect. 3.2), we now prove that subgaussian spike priors admit a local Chernoff bound (Definition 14).
Proposition 14
Suppose π is σ 2-subgaussian (for some constant σ 2 > 0) with \(\mathbb {E}[\pi ] = 0\)and \(\mathbb {E}[\pi ^2] = 1\). Let \((\mathscr {X}_n)\)be the spike prior that draws each entry ofxi.i.d. from π (where π does not depend on n). Then \((\mathscr {X}_n)\)admits a local Chernoff bound.
Proof
Since π is subgaussian, π 2 is subexponential, which implies \(\mathbb {E}[\exp (t \pi ^2)] < \infty \) for all |t|≤ s for some s > 0 (see e.g., Lemma 1.12 of [92]).
Let π, π ′ be independent copies of π, and set Π = ππ ′. The moment-generating function of Π is
provided \(\frac {1}{2}\sigma ^2 t^2 < s\), i.e. \(|t| < \sqrt {2s/\sigma ^2}\). Thus M(t) exists in an open interval containing t = 0, which implies \(M'(0) = \mathbb {E}[\varPi ] = 0\) and \(M''(0) = \mathbb {E}[\varPi ^2] = 1\) (this is the defining property of the moment-generating function: its derivatives at zero are the moments).
Let η > 0 and . Since M(0) = 1, M′(0) = 0, M″(0) = 1 and, as one may check, \(f(0) = 1, f'(0) = 0, f''(0) = \frac {1}{1-\eta } > 1\), there exists δ > 0 such that, for all t ∈ [−δ, δ], M(t) exists and M(t) ≤ f(t).
We then apply the standard Chernoff bound argument to \(\langle \boldsymbol x^1,\boldsymbol x^2 \rangle = \sum _{i=1}^n \varPi _i\) where Π 1, …, Π n are i.i.d. copies of Π. For any α > 0,
Taking α = (1 − η)t∕n,
as desired. This holds provided α ≤ δ, i.e. t ≤ δn∕(1 − η). A symmetric argument with − Π in place of Π holds for the other tail, \(\Pr \left \{\langle \boldsymbol x^1,\boldsymbol x^2 \rangle \le -t\right \}\).
Hypercontractivity
The following hypercontractivity result states that the moments of low-degree polynomials of i.i.d. random variables must behave somewhat reasonably. The Rademacher version is the Bonami lemma from [88], and the Gaussian version appears in [53] (see Theorem 5.10 and Remark 5.11 of [53]). We refer the reader to [88] for a general discussion of hypercontractivity.
Proposition 15 (Bonami Lemma)
Letx = (x 1, …, x n) have either i.i.d. \(\mathscr {N}(0,1)\)or i.i.d. Rademacher (uniform ± 1) entries, and let \(f: \mathbb {R}^n \to \mathbb {R}\)be a polynomial of degree k. Then
We will combine this with the following standard second moment method.
Proposition 16 (Paley-Zygmund Inequality)
If Z ≥ 0 is a random variable with finite variance, and 0 ≤ θ ≤ 1, then
By combining Propositions 16 and 15, we immediately have the following.
Corollary 2
Letx = (x 1, …, x n) have either i.i.d. \(\mathscr {N}(0,1)\)or i.i.d. Rademacher (uniform ± 1) entries, and let \(f: \mathbb {R}^n \to \mathbb {R}\)be a polynomial of degree k. Then, for 0 ≤ θ ≤ 1,
Remark 5
One rough interpretation of Corollary 2 is that if f is degree k, then \(\mathbb {E}[f(x)^2]\) cannot be dominated by an event of probability smaller than roughly 3−2k.
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Kunisky, D., Wein, A.S., Bandeira, A.S. (2022). Notes on Computational Hardness of Hypothesis Testing: Predictions Using the Low-Degree Likelihood Ratio. In: Cerejeiras, P., Reissig, M. (eds) Mathematical Analysis, its Applications and Computation. ISAAC 2019. Springer Proceedings in Mathematics & Statistics, vol 385. Springer, Cham. https://doi.org/10.1007/978-3-030-97127-4_1
Download citation
DOI: https://doi.org/10.1007/978-3-030-97127-4_1
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-97126-7
Online ISBN: 978-3-030-97127-4
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)