1 Introduction

Matrix completion is one of the cornerstone problems in machine learning and has a diverse range of applications. One of the original motivations for it comes from the Netflix Problem where the goal is to predict user-movie ratings based on all the ratings we have observed so far, from across many different users. We can organize this data into a large, partially observed matrix where each row represents a user and each column represents a movie. The goal is to fill in the missing entries. The usual assumptions are that the ratings depend on only a few hidden characteristics of each user and movie and that the underlying matrix is approximately low rank. Another standard assumption is that it is incoherent, which we elaborate on later. How many entries of M do we need to observe in order to fill in its missing entries? And are there efficient algorithms for this task?

There have been thousands of papers on this topic and by now we have a relatively complete set of answers. A representative result (building on earlier works by Fazel [21], Recht, Fazel and Parrilo [57], Srebro and Shraibman [61], Candes and Recht [12], Candes and Tao [13]) due to Keshavan, Montanari and Oh [41] can be phrased as follows: Suppose M is an unknown \(n_1 \times n_2\) matrix that has rank r but each of its entries has been corrupted by independent Gaussian noise with standard deviation \(\delta \). Then if we observe roughly

$$\begin{aligned} m = (n_1 + n_2) r \log (n_1 + n_2) \end{aligned}$$

of its entries, the locations of which are chosen uniformly at random, there is an algorithm that outputs a matrix X that with high probability satisfies

$$\begin{aligned} \text{ err }(X) = \frac{1}{n_1 n_2} \sum _{i,j} \Big | X_{i,j} - M_{i,j} \Big | \le O(\delta ) \;. \end{aligned}$$

There are extensions to non-uniform sampling models [16, 45], as well as various efficiency improvements [33, 39]. What is particularly remarkable about these guarantees is that the number of observations needed is within a logarithmic factor of the number of parameters — \((n_1 +n_2)r\) — that define the model.

In fact, there are benefits to working with even higher-order structure but so far there has been little progress on natural extensions to the tensor setting. To motivate this problem, consider the Groupon Problem (which we introduce here to illustrate this point) where the goal is to predict user-activity ratings. The challenge is that which activities we should recommend (and how much a user liked a given activity) depends on time as well — weekday/weekend, day/night, summer/fall/winter/spring, etc. or even some combination of these. As above, we can cast this problem as a large, partially observed tensor where the first index represents a user, the second index represents an activity and the third index represents the time period. It is again natural to model it as being close to low rank, under the assumption that a much smaller number of (latent) factors about the interests of the user, the type of activity and the time period should contribute to the rating. How many entries of the tensor do we need to observe in order to fill in its missing entries? This problem is emblematic of a larger issue: Can we always solve linear inverse problems when the number of observations is comparable to the number of parameters in the model, or is computational intractability an obstacle?

In fact, one of the advantages of working with tensors is that their decompositions are unique in important ways that matrix decompositions are not. There has been a groundswell of recent work that uses tensor decompositions for exactly this reason for parameter learning in phylogenetic trees [51], HMMs [51], mixture models [38], topic models [2] and to solve community detection [3]. In these applications, one assumes access to the entire tensor (up to some sampling noise). But given that the underlying tensors are low-rank, can we observe fewer of their entries and still utilize tensor methods?

A wide range of approaches to solving tensor completion have been proposed [11, 28, 40, 43, 47, 52, 60, 62, 63]. However, in terms of provable guarantees noneFootnote 1 of them improve upon the following näive algorithm. If the unknown tensor T is \(n_1 \times n_2 \times n_3\) we can treat it as a collection of \(n_1\) matrices each of size \(n_2 \times n_3\). It is easy to see that if T has rank at most r then each of these slices also has rank at most r (and they inherit incoherence properties as well). By treating a third-order tensor as nothing more than an unrelated collection of \(n_1\) low-rank matrices, we can complete each slice separately using roughly \(m = n_1 (n_2 + n_3) r \log (n_2 + n_3)\) observations in total. When the rank is constant, this is a quadratic number of observations even though the number of parameters in the model is linear.

Here we show how to solve the (noisy) tensor completion problem with many fewer observations. Let \(n_1 \le n_2 \le n_3\). We give an algorithm based on the sixth level of the sum-of-squares hierarchy that can accurately fill in the missing entries of an unknown, incoherent \(n_1 \times n_2 \times n_3\) tensor T that is entry-wise close to being rank r with roughly

$$\begin{aligned} m = (n_1)^{1/2} (n_2 + n_3) r \log ^4 (n_1 + n_2 + n_3) \end{aligned}$$

observations. Moreover, our algorithm works even when the observations are corrupted by noise. When \(n = n_1 = n_2 = n_3\), this amounts to about \(n^{1/2} r \) observations per slice which is much smaller than what we would need to apply matrix completion on each slice separately. Our algorithm needs to leverage the structure between the various slices.

1.1 Our results

We give an algorithm for noisy tensor completion that works for third-order tensors. Let T be a third-order \(n_1 \times n_2 \times n_3\) tensor that is entry-wise close to being low rank. In particular let

$$\begin{aligned} T = \sum _{\ell = 1}^r \sigma _\ell a_\ell \otimes b_\ell \otimes c_\ell + \Delta \end{aligned}$$
(1)

where \(\sigma _\ell \) is a scalar and \(a_\ell , b_\ell \) and \(c_\ell \) are vectors of dimension \(n_1\), \(n_2\) and \(n_3\) respectively. Here \(\Delta \) is a tensor that represents noise. Its entries can be thought of as representing model misspecification because T is not exactly low rank or noise in our observations or both. We will only make assumptions about the average and maximum absolute value of entries in \(\Delta \). The vectors \(a_\ell , b_\ell \) and \(c_\ell \) are called factors, and we will assume that their norms are roughly \(\sqrt{n_i}\) for reasons that will become clear later. Moreover we will assume that the magnitude of each of their entries is bounded by C in which case we call the vectors C-incoherentFootnote 2. (Note that a random vector of dimension n and norm \(\sqrt{n}\) will be \(O(\sqrt{\log n_i})\)-incoherent with high probability.) The advantage of these conventions are that a typical entry in T does not become vanishingly small as we increase the dimensions of the tensor. This will make it easier to state and interpret the error bounds of our algorithm.

Let \(\Omega \) represent the locations of the entries that we observe, which (as is standard) are chosen uniformly at random and without replacement. Set \(|\Omega | = m\). Our goal is to output a hypothesis X that has small entry-wise error, defined as:

$$\begin{aligned} \text{ err }(X) = \frac{1}{n_1 n_2 n_3} \sum _{i,j,k} \Big | X_{i,j,k} - T_{i,j,k} \Big | \end{aligned}$$

This measures the error on both the observed and unobserved entries of T. Here and throughout let \(n_1 \le n_2 \le n_3\) and \(n = n_3\). Our goal is to give algorithms that achieve vanishing error, as the size n of the problem increases. Moreover we will want algorithms that need as few observations as possible. Our main result is:

Theorem 1.1

(Main theorem) Suppose we are given m observations whose locations are chosen uniformly at random (and without replacement) from a tensor T of the form (1) where each of the factors \(a_\ell , b_\ell \) and \(c_\ell \) are C-incoherent. Let \(\delta = \frac{1}{n_1 n_2 n_3} \sum _{i,j,k} | \Delta _{i,j,k}|\). And let \(r^* = \sum _{\ell = 1}^r |\sigma _\ell |\). Then there is a polynomial time algorithm that outputs a hypothesis X that with probability \(1- \epsilon \) satisfies

$$\begin{aligned} \text{ err }(X) \le 4 C^3 r^* \sqrt{\frac{ (n_1)^{1/2} (n_2 + n_3) \log ^4 n + \log 2/\epsilon }{ m } }\ + 2 \delta \end{aligned}$$

provided that \(\max _{i,j,k} |\Delta _{i,j,k}| \le \sqrt{\frac{m}{\log 2/\epsilon }} \delta \).

Since the error bound above is quite involved, let us dissect the terms in it. In fact, having an additive \(\delta \) in the error bound is unavoidable. We have not assumed anything about \(\Delta \) in (1) except a bound on the average and maximum magnitude of its entries. If \(\Delta \) were a random tensor whose entries are \(+\delta \) and \(-\delta \) then no matter how many entries of T we observe, we cannot hope to obtain error less than \(\delta \) on the unobserved entriesFootnote 3. The crucial point is that the remaining term in the error bound becomes o(1) when \(m = {\widetilde{\Omega }}((r^*)^2 n^{3/2})\) which for polylogarithmic \(r^*\) improves over the näive algorithm for tensor completion by a polynomial factor in terms of the number of observations. Moreover our algorithm works without any constraints that factors \(a_\ell , b_\ell \) and \(c_\ell \) be orthogonal or even have low inner-product.

Furthermore we show that in certain non-degenerate cases we can even remove another factor of \(r^*\) from the number of observations we need: Suppose that T is a tensor as in (1), but let \(\sigma _\ell \) be Gaussian random variables with mean zero and variance one. The factors \(a_\ell , b_\ell \) and \(c_\ell \) are still fixed, but because of the randomness in the coefficients \(\sigma _\ell \), the entries of T are now random variables.

Corollary 1.2

Suppose we are given m observations whose locations are chosen uniformly at random (and without replacement) from a tensor T of the form (1), where each coefficient \(\sigma _\ell \) is a Gaussian random variable with mean zero and variance one, and each of the factors \(a_\ell , b_\ell \) and \(c_\ell \) are C-incoherent.

Further, suppose that for a \(1-o(1)\) fraction of the entries of T, we have \({\text {var}}(T_{i,j,k}) \ge r/{\text {polylog}}(n) = V\) and that \(\Delta \) is a tensor where each entry is a Gaussian with mean zero and variance o(V). Then there is a polynomial time algorithm that outputs a hypothesis X that satisfies

$$\begin{aligned} X_{i,j,k} = \Big (1\pm o(1) \Big )T_{i,j,k} \end{aligned}$$

for a \(1 - o(1)\) fraction of the entries. The algorithm succeeds with probability at least \(1 - o(1)\) over the randomness of the locations of the observations, and the realizations of the random variables \(\sigma _\ell \) and the entries of \(\Delta \). Moreover the algorithm uses \(m = C^6 n^{3/2} r {\text {polylog}}(n)\) observations.

In the setting above, it is enough that the coefficients \(\sigma _\ell \) are random and that the non-zero entries in the factors are spread out to ensure that the typical entry in T has variance about r. Consequently, the typical entry in T is about \(\sqrt{r}\). This fact combined with the error bounds in Theorem 1.1 immediately yield the above corollary. Remarkably, the guarantee is interesting even when \(r = n^{3/2 - \epsilon }\). The setting where \(r > n\) is called the overcomplete case. In this setting, if we observe a subpolynomial fraction of the entries of T we are able to recover almost all of the remaining entries almost entirely. For context, there are no known algorithms for decomposing an overcomplete, third-order tensor even if we are given all of its entries, at least without imposing much stronger conditions that the factors be nearly orthogonal [29].

We believe that this work is a natural first step in designing practically efficient algorithms for tensor completion. Our algorithms manage to leverage the structure across the slices through the tensor, instead of treating each slice as an independent matrix completion problem. Now that we know this is possible, a natural follow-up question is to get more efficient algorithms. Our algorithms are based on the sixth level of the sum-of-squares hierarchy and run in polynomial time, but are quite far from being practically efficient as stated. Recent work of Hopkins et al. [37] shows how to speed up sum-of-squares and obtain nearly linear time algorithms for a number of problems where the only previously known algorithms ran in a prohibitively large degree polynomial running time. Another approach would be to obtain similar guarantees for alternating minimization. Currently, the only known approaches [40] require that the factors are orthonormal and only work in the undercomplete case. Finally, it would be interesting to get algorithms for low rank tensor completion that find T exactly when there is no noise.

1.2 Our approach

All of our algorithms are based on solving the following optimization problem:

$$\begin{aligned} \qquad \min \Vert X\Vert _{\mathcal {K}} \text{ s.t. } \exists X \text{ with } \frac{1}{m} \sum _{(i,j,k) \in \Omega } | X_{i,j,k} - T_{i,j,k}| \le 2 \delta \end{aligned}$$
(2)

and outputting the minimizer X, where \(\Vert \cdot \Vert _{\mathcal {K}}\) is some norm that can be computed in polynomial time. It will be clear from the way we define the norm that the low rank part of T will itself be a good candidate solution. But this is not necessarily the solution that the convex program finds. How do we know that whatever it finds not only has low entry-wise error on the observed entries of T, but also on the unobserved entries too?

This is a well-studied topic in statistical learning theory, and as is standard we can use the notion of Rademacher complexity as a tool to bound the error. The Rademacher complexity is a property of the norm we choose, and our main innovation is to use the sum-of-squares hierarchy to suggest a suitable norm. The high-level idea is to establish a parallel between noisy tensor completion and refuting random constraint satisfaction problems. This connection is bidirectional: We show that any polynomial time computable norm with good Rademacher complexity immediately yields a polynomial time algorithm for refuting random constraint satisfaction problems. Moreover we embed known algorithms for refutation into the sum-of-squares hierarchy to suggest a suitable polynomial time computable norm in order to give generalization bounds for our algorithms for tensor completion.

A natural question to ask is: Are there other norms that have even better Rademacher complexity than the ones we use here, and that are still computable in polynomial time? It turns out that any such norm would immediately lead to much better algorithms for refuting random constraint satisfaction problems than we currently know. We have not yet introduced Rademacher complexity, so we state our lower bounds informally:

Theorem 1.3

(informal) For any \(\epsilon > 0\), if there is a polynomial time algorithm that achieves error

$$\begin{aligned} \text{ err }(X) \le r^* \sqrt{\frac{n^{3/2-\epsilon }}{ m } } \end{aligned}$$

through the framework of Rademacher complexity then there is an efficient algorithm for refuting a random 3-SAT formula on n variables with \(m = n^{3/2 - \epsilon }\) clauses. Moreover the natural sum-of-squares relaxation requires at least \(n^{2\epsilon }\)-levels in order to achieve the above error (again through the framework of Rademacher complexity).

These results follow directly from the works of Grigoriev [31], Schoenebeck [58] and Feige [22]. There are similar connections between our upper bounds and the work of Coja-Oghlan, Goerdt and Lanka [17] who give an algorithm for strongly refuting random 3-SAT. In Sect. 2 we explain some preliminary connections between these fields, at which point we will be in a better position to explain how we can borrow tools from one area to address open questions in another. We state this theorem more precisely in Corollary 2.13 and Corollary 5.6, which provide both conditional and unconditional lower bounds that match our upper bounds. An important caveat is that an algorithm for tensor completion, particularly in the exact case when there is no noise, need not be based on a norm with good Rademacher complexity. In fact an algorithm for tensor completion might not even be based on convex programming in the first place!

1.3 Computational versus sample complexity tradeoffs

It is interesting to compare the story of matrix completion and tensor completion. In matrix completion, we have the best of both worlds: There are efficient algorithms which work when the number of observations is close to the information theoretic minimum. In tensor completion, we gave algorithms that improve upon the number of observations needed by a polynomial factor but still require a polynomial factor more observations than can be achieved if we ignore computational considerations [63]. We believe that for many other linear inverse problems (e.g. sparse phase retrieval), there may well be gaps between what can be achieved information theoretically and what can be achieved with computationally efficient estimators. Moreover, proving lower bounds against the sum-of-squares hierarchy offers a new type of evidence that problems are hard, that does not rely on reductions from other average-case hard problems which seem (in general) to be brittle and difficult to execute while preserving the naturalness of the input distribution. In fact, even when there are such reductions [10], the sum-of-squares hierarchy offers a methodology to make sharper predictions for questions like: Is there a quasi-polynomial time algorithm for sparse PCA, or does it require exponential time? Are there algorithms that run in time \(n^{o(\log n)}\) that can find \(n^{1/2 - \epsilon }\)-sized planted cliques for some \(\epsilon > 0\)?

After our work, there has been substantial progress on these questions. Barak et al. [5] gave a nearly optimal lower bound for the planted clique problem. For sparse PCA, Hopkins et al. [36] gave subexponential lower bounds and showed that, in many natural settings, proving lower bounds against the sum-of-squares hierarchy is equivalent to proving lower bounds against a family of matrix polynomial methods. Ding et al. [20] recently gave a subexponential time algorithm for sparse PCA. There still remain many important problems which appear to exhibit fundamental computational vs. statistical tradeoffs for which we do not yet have strong sum-of-squares lower bounds. Perhaps the most notable example is the problem of community detection in the stochastic block model where it is conjectured that no polynomial time algorithm can solve the problem beneath the Kesten-Stigum bound [1, 19].

1.4 Subsequent work on tensor completion

One of the main questions left open by our work is whether there is a polynomial time algorithm that can exactly recover the unknown tensor T from \(m = n^{3/2} r\) random observations. Our approach has the advantage that it works when we get noisy observations of T or if T is merely close to being low-rank, but in the case when there is no noise and T is exactly low-rank, we are still only able to achieve low prediction error for the typical missing entry in T. Potechin and Steurer [55] studied the problem of exact tensor completion, but with the restriction that the unknown factors are orthogonal. They constructed a dual certificate that shows that the true low rank tensor T is exactly optimal to the primal with high probability with just \(m = n^{3/2} r\) random observations. However, when the factors are not orthogonal the dual certificate for showing that no other tensor has lower norm (where the norm is defined by the sum-of-squares relaxation to the tensor nuclear norm) is fundamentally much more complicated. In the orthogonal case, Foster and Risteski [25] gave algorithms whose generalization error scale as 1/m rather than \(1/\sqrt{m}\) as in our paper. These are called fast rates in the context of agnostic learning.

Another outstanding open question is to give faster algorithms for tensor completion. In particular algorithms based on the sum-of-squares hierarchy run in polynomial time but are impractical. Are there fast and practical algorithms for tensor completion that still succeed with \(m = n^{3/2} r\) random observations? Montanari and Sun [50] gave a spectral algorithm that works with \(m = n^{3/2} r\), but only when \(r \le n^{3/4}\). In contrast, our algorithms work even when \(r = n^{3/2}\). They gave another algorithm that runs in time \(n^6\) that works in the overcomplete setting. Building on their work, Liu and Moitra [46] analyzed a variant of alternating minimization and showed that it succeeds with \(m = n^{3/2} r^{O(1)}\). Their algorithm turns out to be practical and runs quickly even with n on the order of a thousand. Finally, on a technical level, Moitra and Wein [49] gave a general framework for designing spectral algorithms from tensor networks, and applied it to the continuous multireference alignment problem. Tensor networks can also be used to graphically visualize the trace method calculations in Sect. 4that are at the heart of this paper.

1.5 Organization

In Sect. 2 we introduce Rademacher complexity, the tensor nuclear norm and strong refutation. We connect these concepts by showing that any norm that can be computed in polynomial time and has good Rademacher complexity yields an algorithm for strongly refuting random 3-SAT. In order to understand the proof of the upper bound, it is not strictly necessary to read Sects. 2.1 or  2.3. However they provide important context that might be useful for many readers. For example, before our work there was no discussion in the literature about the possibility of there being computational vs. statistical tradeoffs for tensor completion. Section 2.1 gives a simple hypothesis testing problem, where the goal is to distinguish between an approximately rank one tensor and random noise, that is statistically easy but believed to be computationally hard. Moreover many works suggested using the tensor nuclear norm. While it is known that computing the tensor nuclear norm is computationally hard in general, could it be that it is easy for the specific types of problems that arise in tensor completion? In Sect. 2.3 we bound the Rademacher complexity of the tensor nuclear norm, which implies that we cannot compute the tensor nuclear norm (or even approximate it) on random instances under natural complexity assumptions. An important reason for understanding the connections between tensor completion and random CSPs, rather than sweeping them under the rug, is that it helps demystify where the particular relaxation we are working with comes from.

In Sect. 3 we show how a particular algorithm for strong refutation can be embedded into the sum-of-squares hierarchy and directly leads to a norm that can be computed in polynomial time and has good Rademacher complexity. In Sect. 4 we establish certain spectral bounds that we need, and prove our main upper bounds. In Sect. 5 we prove lower bounds on the Rademacher complexity of the sequence of norms arising from the sum-of-squares hierarchy by a direct reduction to lower bounds for refuting random 3-XOR. In Appendix Appendix A we give a extension from the case where \(n_1 = n_2 = n_3\) to the general case. This is what allows us to extend our analysis to arbitrary order d tensors, but the proofs are essentially identical to those in the \(d = 3\) case but more notationally involved so we omit them.

2 Noisy tensor completion and refutation

Here we make the connection between noisy tensor completion and strong refutation explicit. Our first step is to formulate a problem that is a special case of both, and studying it will help us clarify how notions from one problem translate to the other.

2.1 The distinguishing problem

Here we introduce a problem that we call the distinguishing problem. We are given random observations from a tensor and promised that the underlying tensor fits into one of the two following categories. We want an algorithm that can tell which case the samples came from, and succeeds using as few observations as possible. The two cases are:

  1. 1.

    Each observation is chosen uniformly at random (and without replacement) from a tensor T where independently for each entry we set

    $$\begin{aligned} T_{i,j,k}= {\left\{ \begin{array}{ll} a_i a_j a_k &{} \text{ with } \text{ probability } 7/8\\ 1 &{} \text{ with } \text{ probability } 1/16\\ -1 &{} \text{ else } \end{array}\right. } \end{aligned}$$

    where a is a vector whose entries are \(\pm 1\).

  2. 2.

    Alternatively, each observation is chosen uniformly at random (and without replacement) from a tensor T each of whose entries is independently set to either \(+1\) or \(-1\) and with equal probability.

In the first case, the entries of the underlying tensor T are predictable. It is possible to guess a 15/16 fraction of them correctly, once we have observed enough of its entries to be able to deduce a. And in the second case, the entries of T are completely unpredictable because no matter how many entries we have observed, the remaining entries are still random. Thus we cannot predict any of the unobserved entries better than random guessing.

Now we will explain how the distinguishing problem can be equivalently reformulated in the language of refutation. We give a formal definition for strong refutation later (Definition 2.10), but for the time being we can think of it as the task of (given an instance of a constraint satisfaction problem) certifying that there is no assignment that satisfies many of the clauses. We will be interested in 3-XOR formulas, where there are n variables \(v_1, v_2, ..., v_n\) that are constrained to take on values \(+1\) or \(-1\). Each clause takes the form

$$\begin{aligned} v_i \cdot v_j \cdot v_k = T_{i,j,k} \end{aligned}$$

where the right hand side is either \(+1\) or \(-1\). The clause represents a parity constraint but over the domain \(\{+1, -1\}\) instead of over the usual domain \({\mathbb {F}}_2\). We have chosen the notation suggestively so that it hints at the mapping between the two views of the problem. Each observation \(T_{i,j,k}\) maps to a clause \(v_i \cdot v_j \cdot v_k = T_{i,j,k}\) and vice-versa. Thus an equivalent way to formulate the distinguishing problem is that we are given a 3-XOR formula which was generated in one of the following two ways:

  1. 1.

    Each clause in the formula is generated by choosing an ordered triple of variables \((v_i, v_j, v_k)\) uniformly at random (and without replacement) and we set

    $$\begin{aligned} v_i \cdot v_j \cdot v_k= {\left\{ \begin{array}{ll} a_i a_j a_k &{} \text{ with } \text{ probability } 7/8\\ 1 &{} \text{ with } \text{ probability } 1/16\\ -1 &{} \text{ else } \end{array}\right. } \end{aligned}$$

    where a is a vector whose entries are \(\pm 1\). Now a represents a planted solution and by design our sampling procedure guarantees that many of the clauses that are generated are consistent with it.

  2. 2.

    Alternatively, each clause in the formula is generated by choosing an ordered triple of variables \((v_i, v_j, v_k)\) uniformly at random (and without replacement) and we set \(v_i \cdot v_j \cdot v_k = z_{i,j,k}\) where \(z_{i,j,k}\) is a random variable that takes on values \(+1\) and \(-1\).

In the first case, the 3-XOR formula has an assignment that satisfies a 15/16 fraction of the clauses in expectation by setting \(v_i = a_i\). In the second case, any fixed assignment satisfies at most half of the clauses in expectation. Moreover if we are given \(\Omega (n \log n)\) clauses, it is easy to see by applying the Chernoff bound and taking a union bound over all possible assignments that with high probability there is no assignment that satisfies more than a \(1/2 + o(1)\) fraction of the clauses.

This will be the starting point for the connections we establish between noisy tensor completion and refutation. Even in the matrix case these connections seem to have gone unnoticed, and the same spectral bounds that are used to analyze the Rademacher complexity of the nuclear norm [61] are also used to refute random 2-SAT formulas [30], but this is no accident.

2.2 Rademacher complexity

Ultimately our goal is to show that the hypothesis X that our convex program finds is entry-wise close to the unknown tensor T. By virtue of the fact that X is a feasible solution to (2) we know that it is entry-wise close to T on the observed entries. This is often called the empirical error:

Definition 2.1

For a hypothesis X, the empirical error is

$$\begin{aligned} \text{ emp-err }(X) = \frac{1}{m} \sum _{(i,j,k) \in \Omega } | X_{i,j,k} - T_{i,j,k}| \end{aligned}$$

Recall that \(\text{ err }(X)\) is the average entry-wise error between X and T, over all (observed and unobserved) entries. Also recall that among the candidate X’s that have low empirical error, the convex program finds the one that minimizes \(\Vert X\Vert _{\mathcal {K}}\) for some polynomial time computable norm. The way we will choose the norm \(\Vert \cdot \Vert _{\mathcal {K}}\) and our bound on the maximum magnitude of an entry of \(\Delta \) will guarantee that the low rank part of T will with high probability be a feasible solution. Thus we are guaranteed that there is a good solution to the optimization problem in the sense that we could always choose \(X = T\) which simultaneously satisfies the constraints imposed by \(\Omega \) and has bounded \(\Vert \cdot \Vert _{\mathcal {K}}\) norm. One way to bound \(\text{ err }(X)\) is to show that no hypothesis in the unit norm ball can have too large a gap between its error and its empirical error (and then dilate the unit norm ball so that it contains X). With this in mind, we define:

Definition 2.2

For a norm \(\Vert \cdot \Vert _{{\mathcal {K}}}\) and a set \(\Omega \) of observations, the generalization error is

$$\begin{aligned} \sup _{\Vert X\Vert _{\mathcal {K}}\le 1} \Big | \text{ err }(X) - \text{ emp-err }(X) \Big | \end{aligned}$$

As we will discuss, a way to control the generalization error is through the Rademacher complexity.

Definition 2.3

Let \(\Omega = \{(i_1, j_1, k_1), (i_2, j_2, k_2), ..., (i_m, j_m, k_m)\}\) be a set of m locations chosen uniformly at random (and without replacement) from \([n_1] \times [n_2] \times [n_3]\). And let \(\sigma _1, \sigma _2, ..., \sigma _m\) be random \(\pm 1\) variables. The Rademacher complexity of (the unit ball of) the norm \(\Vert \cdot \Vert _{\mathcal {K}}\) is defined as

$$\begin{aligned} R^m(\Vert \cdot \Vert _{\mathcal {K}}) = \mathop {\mathbf{E}}_{\Omega , \sigma } \Big [ \sup _{\Vert X\Vert _{\mathcal {K}}\le 1} \Big | \sum _{\ell =1}^m \sigma _\ell X_{i_\ell , j_\ell , k_\ell } \Big | \Big ] \end{aligned}$$

It follows from a standard symmetrization argument from empirical process theory [9, 42] that the Rademacher complexity does indeed bound the generalization error.

Theorem 2.4

Let \(\epsilon \in (0,1)\) and suppose each X with \(\Vert X\Vert _{{\mathcal {K}}} \le 1\) has bounded loss — i.e. \(|X_{i,j,k} - T_{i,j,k}| \le a\) and that locations (ijk) are chosen uniformly at random and without replacement. Then with probability at least \(1- \epsilon \), for every X with \(\Vert X\Vert _{{\mathcal {K}}} \le 1\), we have

$$\begin{aligned} \text{ err }(X) \le \text{ emp-err }(X) + 2 R^m(\Vert \cdot \Vert _{\mathcal {K}}) + 2 a \sqrt{\frac{\ln (1/\epsilon )}{m}} \end{aligned}$$

We repeat the proof here following [9] for the sake of completeness but readers familiar with Rademacher complexity can feel free to skip ahead to Definition 2.5. The main idea is to let \(\Omega '\) be an independent set of m samples from the same distribution, again without replacement. The expected generalization error is:

figure a

Then we can write

$$\begin{aligned} (*)= & {} \mathop {\mathbf{E}}_{\Omega } \Big [ \sup _{\Vert X\Vert _{\mathcal {K}}\le 1} \Big | \frac{1}{m} \sum _{\ell =1}^m | X_{i_\ell , j_\ell , k_\ell } - T_{i_\ell , j_\ell , k_\ell } | - \frac{1}{m} \mathop {\mathbf{E}}_{\Omega '} [\sum _{\ell =1}^m | X_{i'_\ell , j'_\ell , k'_\ell } - T_{i'_\ell , j'_\ell , k'_\ell } | ] \Big | \Big ] \\\le & {} \mathop {\mathbf{E}}_{\Omega , \Omega '} \Big [ \sup _{\Vert X\Vert _{\mathcal {K}}\le 1} \Big | \frac{1}{m} \Big ( \sum _{\ell =1}^m | X_{i_\ell , j_\ell , k_\ell } - T_{i_\ell , j_\ell , k_\ell } | - | X_{i'_\ell , j'_\ell , k'_\ell } - T_{i'_\ell , j'_\ell , k'_\ell } | \Big ) \Big | \Big ] \end{aligned}$$

where the last line follows by the concavity of \(\sup (\cdot )\). Now we can use the Rademacher (random \(\pm 1\)) variables \(\{\sigma _\ell \}_\ell \) and rewrite the right hand side of the above expression as follows:

$$\begin{aligned} (*)\le & {} \mathop {\mathbf{E}}_{\Omega , \Omega ', \sigma } \Big [ \sup _{\Vert X\Vert _{\mathcal {K}}\le 1} \Big | \frac{1}{m} \sum _{\ell =1}^m \sigma _\ell \Big ( | X_{i_\ell , j_\ell , k_\ell } - T_{i_\ell , j_\ell , k_\ell } | - | X_{i'_\ell , j'_\ell , k'_\ell } - T_{i'_\ell , j'_\ell , k'_\ell } | \Big ) \Big | \Big ] \\\le & {} \mathop {\mathbf{E}}_{\Omega , \Omega ', \sigma } \Big [ \sup _{\Vert X\Vert _{\mathcal {K}}\le 1} \Big | \frac{1}{m} \sum _{\ell =1}^m \sigma _\ell | X_{i_\ell , j_\ell , k_\ell } - T_{i_\ell , j_\ell , k_\ell } | \Big |\\&+ \Big | \frac{1}{m} \sum _{\ell =1}^m \sigma _\ell | X_{i'_\ell , j'_\ell , k'_\ell } - T_{i'_\ell , j'_\ell , k'_\ell } | \Big | \Big ] \\\le & {} 2 \mathop {\mathbf{E}}_{\Omega , \sigma } \Big [ \sup _{\Vert X\Vert _{\mathcal {K}}\le 1} \Big | \frac{1}{m} \Big ( \sum _{\ell =1}^m \sigma _\ell | X_{i_\ell , j_\ell , k_\ell } - T_{i_\ell , j_\ell , k_\ell } | \Big ) \Big | \Big ] \\\le & {} 2 \mathop {\mathbf{E}}_{\Omega , \sigma } \Big [ \sup _{\Vert X\Vert _{\mathcal {K}}\le 1} \Big | \frac{1}{m} \Big ( \sum _{\ell =1}^m \sigma _\ell \Big ( | X_{i_\ell , j_\ell , k_\ell }| + |T_{i_\ell , j_\ell , k_\ell } | \Big ) \Big ) \Big | \Big ] \\\le & {} 2 \mathop {\mathbf{E}}_{\Omega , \sigma } \Big [ \Big | \frac{1}{m} \sum _{\ell =1}^m \sigma _\ell |T_{i_\ell , j_\ell , k_\ell }| \Big | \Big ] + 2 \mathop {\mathbf{E}}_{\Omega , \sigma } \Big [ \sup _{\Vert X\Vert _{\mathcal {K}}\le 1} \Big | \frac{1}{m} \sum _{\ell =1}^m \sigma _\ell |X_{i_\ell , j_\ell , k_\ell }| \Big | \Big ]\\= & {} 2 \mathop {\mathbf{E}}_{\Omega , \sigma } \Big [ \Big | \frac{1}{m} \sum _{\ell =1}^m \sigma _\ell T_{i_\ell , j_\ell , k_\ell } \Big | \Big ] + 2 \mathop {\mathbf{E}}_{\Omega , \sigma } \Big [ \sup _{\Vert X\Vert _{\mathcal {K}}\le 1} \Big | \frac{1}{m} \sum _{\ell =1}^m \sigma _\ell X_{i_\ell , j_\ell , k_\ell } \Big | \Big ] \end{aligned}$$

where the second, fourth and fifth inequalities use the triangle inequality. The equality uses the fact that the \(\sigma _\ell \)’s are random signs and hence can absorb the absolute value around the terms that they multiply. The second term above in the last expression is exactly the Rademacher complexity that we defined earlier. This argument only shows that the Rademacher complexity bounds the expected generalization error. However it turns out that we can also use the Rademacher complexity to bound the generalization error with high probability by applying McDiarmid’s inequality. We also remark that generalization bounds are often stated in the setting where samples are drawn i.i.d., but here the locations of our observations are sampled without replacement. Nevertheless for the settings of m we are interested in, the fraction of our observations that are repeats is o(1) — in fact it is subpolynomial — and we can move back and forth between both sampling models at negligible loss in our bounds.

In much of what follows it will be convenient to think of \(\Omega = \{(i_1, j_1, k_1), (i_2, j_2, k_2), ..., (i_m, j_m, k_m)\}\) and \(\{\sigma _\ell \}_\ell \) as being represented by a sparse tensor Z, defined below.

Definition 2.5

Let Z be an \(n_1 \times n_2 \times n_3\) tensor such that

$$\begin{aligned} Z_{i,j,k} = {\left\{ \begin{array}{ll} 0, \text{ if } (i,j,k) \notin \Omega \\ \sum _{\ell \text{ s.t. } (i, j, k) = (i_\ell , j_\ell , k_\ell )} \sigma _\ell \end{array}\right. } \end{aligned}$$

This definition greatly simplifies our notation. In particular we have

$$\begin{aligned} \sum _{\ell = 1}^m \sigma _\ell X_{i_\ell , j_\ell , k_\ell } =\sum _{i,j,k} Z_{i,j,k} X_{i,j,k} = \langle Z, X \rangle \end{aligned}$$

where we have introduced the notation \(\langle \cdot , \cdot \rangle \) to denote the natural inner-product between tensors. Our main technical goal in this paper will be to analyze the Rademacher complexity of a sequence of successively tighter norms that we get from the sum-of-squares hierarchy, and to derive implications for noisy tensor completion and for refutation from these bounds.

2.3 The tensor nuclear norm

Here we introduce the tensor nuclear norm and analyze its Rademacher complexity. Many works have suggested using it to solve tensor completion problems [47, 60, 63]. This suggestion is quite natural given that it is based on a similar guiding principle as that which led to \(\ell _1\)-minimization in compressed sensing and the nuclear norm in matrix completion [21]. More generally, one can define the atomic norm for a wide range of linear inverse problems [15], and the \(\ell _1\)-norm, the nuclear norm and the tensor nuclear norm are all special cases of this paradigm. Before we proceed, let us first formally define the notion of incoherence that we gave in the introduction.

Definition 2.6

A length \(n_i\) vector a is C-incoherent if \(\Vert a\Vert = \sqrt{n_i}\) and \(\Vert a\Vert _\infty \le C\).

Recall that we chose to work with vectors whose typical entry is a constant so that the entries in T do not become vanishingly small as the dimensions of the tensor increase. We can now define the tensor nuclear normFootnote 4:

Definition 2.7

(tensor nuclear norm) Let \({{\mathcal {A}}}\subseteq {\mathbb {R}}^{ n_1 \times n_2 \times n_3 }\) be defined as

$$\begin{aligned} {{\mathcal {A}}}= & {} \Big \{ X \text{ s.t. } \exists \text{ distribution } \mu \text{ on } \text{ triples } \text{ of } \text{ C-incoherent } \text{ vectors } \text{ with } \\&\qquad X_{i,j,k} = \mathop {\mathbf{E}}_{(a,b,c) \leftarrow \mu }[a_i b_j c_k]\Big \} \end{aligned}$$

The tensor nuclear norm of X which is denoted by \(\Vert X\Vert _{{{\mathcal {A}}}}\) is the infimum over \(\alpha \) such that \(X/\alpha \in {{\mathcal {A}}}\).

Recall that T is the low rank tensor and \(\Delta \) represents the noise. Furthermore by assumption we have that \(\Vert T - \Delta \Vert _{{{\mathcal {A}}}} \le r^*\). Finally we give an elementary bound on the Rademacher complexity of the tensor nuclear norm. Recall that \(n = n_3\).

Lemma 2.8

\(R^m(\Vert \cdot \Vert _{{\mathcal {A}}}) = O(C^3\sqrt{\frac{n}{m}})\)

Proof

Recall the definition of Z given in Definition 2.5. With this we can write

$$\begin{aligned} \mathop {\mathbf{E}}_{\Omega , \sigma } \Big [ \sup _{\Vert X\Vert _{{\mathcal {A}}}\le 1} \Big | \sum _{\ell =1}^m \sigma _\ell X_{i_\ell , j_\ell , k_\ell } \Big | \Big ] = \mathop {\mathbf{E}}_{\Omega , \sigma } \Big [\sup _{{\small C}\text{-incoherent }\ a, b, c} | \langle Z, a\otimes b \otimes c \rangle | \Big ] \end{aligned}$$

We can now adapt the discretization approach in [27], although our task is considerably simpler because we are constrained to C-incoherent a’s. In particular, let

$$\begin{aligned} S = \left\{ a \Big | a \text{ is }\ C\text{-incoherent } \text{ and } a \in \Big (\epsilon {\mathbb {Z}} \Big )^n \right\} \end{aligned}$$

By standard bounds on the size of an \(\epsilon \)-net [48], we get that \(|S| \le O(C/\epsilon )^n\). Suppose that \(|\langle Z, a\otimes b \otimes c \rangle | \le M\) for all \(a, b, c \in S\). Then for an arbitrary, but C-incoherent a we can expand it as \(a = \sum _i \epsilon ^i a^i\) where each \(a^i \in S\) and similarly for b and c. And now

$$\begin{aligned} |\langle Z, a\otimes b \otimes c \rangle | \le \sum _i \sum _j \sum _k \epsilon ^i \epsilon ^j \epsilon ^k |\langle Z, a^i \otimes b^i \otimes c^i \rangle | \le (1-\epsilon )^{-3} M \end{aligned}$$

Moreover since each entry in \(a \otimes b \otimes c\) has magnitude at most \(C^3\) we can apply a Chernoff bound to conclude that for any particular \(a, b, c \in S\) we have

$$\begin{aligned} |\langle Z, a\otimes b \otimes c \rangle | \le O\Big (C^3\sqrt{m \log 1/\gamma }\Big ) \end{aligned}$$

with probability at least \(1-\gamma \). Finally, if we set \(\gamma = (\frac{\epsilon }{C})^{-n}\) and we set \(\epsilon = 1/2\) we get that

$$\begin{aligned} R^m({{\mathcal {A}}}) \le \frac{(1-\epsilon )^{-3}}{m} \max _{a, b, c \in S}|\langle Z, a\otimes b \otimes c \rangle | = O\Big (C^3\sqrt{\frac{n}{m}}\Big ) \end{aligned}$$

and this completes the proof. \(\square \)

The important point is that the Rademacher complexity of the tensor nuclear norm is o(1) whenever \(m = \omega (n)\). In the next subsection we will connect this to refutation in a way that allows us to strengthen known hardness results for computing the tensor nuclear norm [32, 34] and show that it is even hard to compute in an average-case sense based on some standard conjectures about the difficulty of refuting random 3-SAT.

2.4 From rademacher complexity to refutation

Here we show the first implication of the connection we have established. Any norm that can be computed in polynomial time and has good Rademacher complexity immediately yields an algorithm for strongly refuting random 3-SAT and 3-XOR formulas. Recall that a 3-SAT clause takes in three literals (a variable or its negation) and outputs TRUE if at least one of them is TRUE, and FALSE otherwise. A 3-SAT formula is a collection of clauses and the formula is satisfied if and only if all of its clauses are. Finally a 3-XOR formula is the same, but composed of 3-XOR clauses that takes the XOR of a collection of three literals. Now let us finally define strong refutation.

Definition 2.9

For a formula \(\phi \), let \({\text {opt}}(\phi )\) be the largest fraction of clauses that can be satisfied by any assignment.

In what follows, we will use the term random 3-XOR formula to refer to a formula where each clause is generated by choosing an ordered triple of variables \((v_i, v_j, v_k)\) uniformly at random (and without replacement) and setting \(v_i \cdot v_j \cdot v_k = z\) where z is a random variable that takes on values \(+1\) and \(-1\).

Definition 2.10

An algorithm for strongly refuting random 3-XOR takes as input a 3-XOR formula \(\phi \) and outputs a quantity \(\text{ alg }(\phi )\) that satisfies

  1. 1.

    For any 3-XOR formula \(\phi \), \({\text {opt}}(\phi ) \le \text{ alg }(\phi )\)

  2. 2.

    If \(\phi \) is a random 3-XOR formula with m clauses, then with high probability \(\text{ alg }(\phi ) = 1/2 + o(1)\)

Standard concentration bounds imply that when \(m = \omega (n)\) we have \(\text{ opt }(\phi ) = 1/2 + o(1)\) with high probability. With fewer clauses, \(\text{ opt }(\phi )\) will be bounded away from 1/2 and the above problem is not well-defined. The goal is to design algorithms that use as few clauses as possible, and are able to certify that a random formula is indeed far from satisfiable (without underestimating the fraction of clauses that can be satisfied) and to do so as close as possible to the information theoretic threshold. If we are not concerned about running time, we could compute \(\text{ opt }(\phi )\) by exhaustive search over the \(2^n\) possible assignments. But what if we are restricted to polynomial time algorithms? In a celebrated work, Håstad [35] showed that deciding whether \(\text{ alg }(\phi ) \le 1/2 + \epsilon \) for any \(\epsilon > 0\) is NP-hard for 3-XOR formulas. However this is a worst-case hardness result and we are interested in random 3-XOR formulas.

Now we can make a deeper connection to tensor completion: Any polynomial time computable norm \(\Vert \cdot \Vert _{\mathcal {K}}\) that has good Rademacher complexity immediately yields an algorithm for strongly refuting random 3-XOR. We can follow the blueprint in Sect. 2.1 where given a formula \(\phi \) we map its m clauses to a collection of m observations according to the following rule: If there are n variables, we construct an \(n \times n \times n\) tensor Z where for each clause of the form \(v_i \cdot v_j \cdot v_k = z_{i,j,k}\) we put the entry \(z_{i,j,k}\) at location (ijk). All the rest of the entries in Z – i.e. all the triples of indices that do not show up as a clause in the 3-XOR formula – are set to zero. Now, since by assumption the norm \(\Vert \cdot \Vert _{\mathcal {K}}\) is polynomial time computable, we can solve the following optimization problem:

$$\begin{aligned} \max \eta \text{ s.t. } \exists X \text{ with } \Vert X\Vert _{{\mathcal {K}}} \le 1 \text{ and } \frac{1}{m} \langle Z, X \rangle \ge 2 \eta \end{aligned}$$
(3)

Let \(\eta ^*\) be the optimum value. We set \(\text{ alg }(\phi ) = 1/2 + \eta ^*\). What remains is to prove that the output of this algorithm solves the strong refutation problem for 3-XOR.

Theorem 2.11

Suppose that \(\Vert \cdot \Vert _{{\mathcal {K}}}\) is computable in polynomial time and satisfies \(\Vert X\Vert _{\mathcal {K}}\le 1\) whenever \(X = a \otimes a \otimes a\) and a is a vector with \(\pm 1\) entries. Further suppose that for any X with \(\Vert X\Vert _{{\mathcal {K}}} \le 1\) its entries are bounded by \(C^3\) in absolute value. Then (3) can be solved in polynomial time and if \(R^m(\Vert \cdot \Vert _{\mathcal {K}}) = o(1)\) then setting \(\text{ alg }(\phi ) = 1/2 + \eta ^*\) solves strong refutation for 3-XOR with \(O(C^6 m \log n)\) clauses.

Proof

The key observation is the following inequality which relates (3) to \({\text {opt}}(\phi )\).

$$\begin{aligned} 2{\text {opt}}(\phi ) -1 \le \frac{1}{m} \sup _{\Vert X\Vert _{\mathcal {K}}\le 1} \langle Z, X \rangle \end{aligned}$$

To establish this inequality, let \(v_1, v_2, ... , v_n\) be the assignment that maximizes the fraction of clauses satisfied. If we set \(a_i = v_i\) and \(X = a \otimes a \otimes a\) we have that \(\Vert X\Vert _{\mathcal {K}}\le 1\) by assumption. Thus X is a feasible solution. Now with this choice of X for the right hand side, every term in the sum that corresponds to a satisfied clause contributes \(+1\) and every term that corresponds to an unsatisfied clause contributes \(-1\). We get \( 2{\text {opt}}(\phi ) -1\) for this choice of X, and this completes the proof of the inequality above.

The crucial point is that the expectation of the right hand side over \(\Omega \) and \(\sigma \) is exactly the Rademacher complexity. However we want a bound that holds with high probability instead of just in expectation. It follows from McDiarmid’s inequality and the fact that the entries of Z and of X are bounded by 1 and by \(C^3\) in absolute value respectively that if we take \(O(C^6 m \log n)\) observations the right hand side will be o(1) with high probability. In this case, rearranging the inequality we have

$$\begin{aligned} {\text {opt}}(\phi ) \le 1/2 + \frac{1}{m} \sup _{\Vert X\Vert _{\mathcal {K}}\le 1} \langle Z, X \rangle \end{aligned}$$

The right hand side is exactly \(\text{ alg }(\phi )\) and is \(1/2 + o(1)\) with high probability, which implies that both conditions in the definition for strong refutation hold and this completes the proof. \(\square \)

We can now combine Theorem 2.11 with the bound on the Rademacher complexity of the tensor nuclear norm given in Lemma 2.8 to conclude that if we could compute the tensor nuclear norm we would also obtain an algorithm for strongly refuting random 3-XOR with only \(m = \Omega (n \log n)\) clauses. It is not obvious but it turns out that any algorithm for strongly refuting random 3-XOR implies one for 3-SAT. Let us define strong refutation for 3-SAT. We will refer to any variable \(v_i\) or its negation \({\bar{v}}_i\) as a literal. We will use the term random 3-SAT formula to refer to a formula where each clause is generated by choosing an ordered triple of literals \((y_i, y_j, y_k)\) uniformly at random (and without replacement) and setting \(y_i \vee y_j \vee y_k = 1\).

Definition 2.12

An algorithm for strongly refuting random 3-SAT takes as input a 3-SAT formula \(\phi \) and outputs a quantity \(\text{ alg }(\phi )\) that satisfies

  1. 1.

    For any 3-SAT formula \(\phi \), \({\text {opt}}(\phi ) \le \text{ alg }(\phi )\)

  2. 2.

    If \(\phi \) is a random 3-SAT formula with m clauses, then with high probability \(\text{ alg }(\phi ) = 7/8 + o(1)\)

The only change from Definition 2.10 comes from the fact that for 3-SAT a random assignment satisfies a 7/8 fraction of the clauses in expectation. Our goal here is to certify that the largest fraction of clauses that can be satisfied is \(7/8 + o(1)\). The connection between refuting random 3-XOR and 3-SAT is often called “Feige’s XOR Trick" [22]. The first version of it was used to show that an algorithm for \(\epsilon \)-refuting 3-XOR can be turned into an algorithm for \(\epsilon \)-refuting 3-SAT. However we will not use this notion of refutation so for further details we refer the reader to [22]. The reduction was extended later by Coja-Oghlan, Goerdt and Lanka [17] to strong refutation, which for us yields the following corollary:

Corollary 2.13

Suppose that \(\Vert \cdot \Vert _{{\mathcal {K}}}\) is computable in polynomial time and satisfies \(\Vert X\Vert _{\mathcal {K}}\le 1\) whenever \(X = a \otimes a \otimes a\) and a is a vector with \(\pm 1\) entries. Suppose further that for any X with \(\Vert X\Vert _{{\mathcal {K}}} \le 1\) its entries are bounded by \(C^3\) in absolute value and that \(R^m(\Vert \cdot \Vert _{\mathcal {K}}) = o(1)\). Then there is a polynomial time algorithm for strongly refuting a random 3-SAT formula with \(O(C^6 m \log n)\) clauses.

Now we can get a better understanding of the obstacles to noisy tensor completion by connecting it to the literature on refuting random 3-SAT. Despite a long line of work on refuting random 3-SAT [17, 23, 24, 26, 30], there is no known polynomial time algorithm that works with \(m = n^{3/2 - \epsilon }\) clauses for any \(\epsilon > 0\). Feige [22] conjectured that for any constant C, there is no polynomial time algorithm for refuting random 3-SAT with \(m = Cn\) clausesFootnote 5. Daniely et al. [18] conjectured that there is no polynomial time algorithm for \(m = n^{3/2 - \epsilon }\) for any \(\epsilon > 0\). What we have shown above is that any norm that is a relaxation to the tensor nuclear norm and can be computed in polynomial time but has Rademacher complexity \(R^m(\Vert \cdot \Vert _{\mathcal {K}}) = o(1)\) for \(m = n^{3/2 - \epsilon }\) would disprove the conjecture of Daniely et al. [18] and would yield much better algorithms for refuting random 3-SAT than we currently know, despite fifteen years of work on the subject.

This leaves open an important question. While there are no known polynomial time algorithms for strongly refuting random 3-SAT with \(m = n^{3/2 - \epsilon }\) clauses, there are algorithms that work with roughly \(m = n^{3/2}\) clauses [17]. Do these algorithms have any implications for noisy tensor completion? We will adapt the algorithm of Coja-Oghlan, Goerdt and Lanka [17] and embed it within the sum-of-squares hierarchy. In turn, this will give us a norm that we can use to solve noisy tensor completion which uses a polynomial factor fewer observations than known algorithms. After our work, Raghavendra et al. [56] gave subexponential time algorithms for refuting random 3-SAT with \(m = n^{3/2 - \epsilon }\) clauses. These bounds could likely be used to give subexponential time algorithms for noisy tensor completion, using our framework for embedding refutation algorithms into the sum-of-squares hierarchy and using their analysis to bound the Rademacher complexity. However such algorithms would now require nearly \(n^{2 \epsilon }\) levels rather than six levels, which is what we work with here.

3 Bound the rademacher complexity

3.1 Pseudo-expectation

Here we introduce the sum-of-squares hierarchy and will use it (at level six) to give a relaxation to the tensor nuclear norm. This will be the norm that we will use in proving our main upper bounds. First we introduce the notion of a pseudo-expectation operator from [6,7,8]:

Definition 3.1

(Pseudo-expectation [6]) Let k be even and let \(P_k^{n'}\) denote the linear subspace of all polynomials of degree at most k on \(n'\) variables. A linear operator \({\widetilde{\mathop {\mathbf{E}}}}:P_k^{n'}\rightarrow {\mathbb {R}}\) is called a degree k pseudo-expectation operator if it satisfies the following conditions:

  1. (1)

    \({\widetilde{\mathop {\mathbf{E}}}}[1] = 1\) (normalization)

  2. (2)

    \({\widetilde{\mathop {\mathbf{E}}}}[P^2] \ge 0\), for any degree at most k/2 polynomial P (nonnegativity)

Moreover suppose that \(p \in P_k^{n'}\) with \(\text{ deg }(p) = k'\). We say that \({\widetilde{\mathop {\mathbf{E}}}}\) satisfies the constraint \(\{p = 0\}\) if \({\widetilde{\mathop {\mathbf{E}}}}[pq] = 0\) for every \(q \in P_{k - k'}^{n'}\). And we say that \({\widetilde{\mathop {\mathbf{E}}}}\) satisfies the constraint \(\{ p \ge 0\}\) if \({\widetilde{\mathop {\mathbf{E}}}}[p q^2] \ge 0\) for every \(q \in P_{\lfloor (k-k')/2 \rfloor }^{n'}\).

The rationale behind this definition is that if \(\mu \) is a distribution on vectors in \({\mathbb {R}}^{n'}\) then the operator \({\widetilde{\mathop {\mathbf{E}}}}[p] = \mathop {\mathbf{E}}_{Y \leftarrow \mu }[p(Y)]\) is a degree d pseudo-expectation operator for every d — i.e. it meets the conditions of Definition 3.1. However the converse is in general not true. We are now ready to define the norm that will be used in our upper bounds:

Definition 3.2

(\(SOS_k\) norm) We let \({\mathcal {K}}_k\) be the set of all \(X \in {\mathbb {R}}^{ n_1 \times n_2 \times n_3}\) such that there exists a degree k pseudo-expectation operator on \(P_k^{n_1 + n_2 + n_3}\) satisfying the following polynomial constraints (where the variables are the \(Y^{(a)}_i\)’s)

  1. (a)

    \(\{ \sum _{i=1}^{n_1} (Y^{(1)}_i)^2 = n_1 \}\), \(\{ \sum _{i=1}^{n_2} (Y^{(2)}_i)^2 = n_2 \}\) and \(\{ \sum _{i=1}^{n_3} (Y^{(3)}_i)^2 = n_3 \}\)

  2. (b)

    \(\{ (Y^{(1)}_i)^2 \le C^2 \}\), \(\{ (Y^{(2)}_i)^2 \le C^2 \}\) and \(\{ (Y^{(3)}_i)^2 \le C^2 \}\) for all i and

  3. (c)

    \(X_{i,j,k} = {\widetilde{\mathop {\mathbf{E}}}}[ Y^{(1)}_i Y^{(2)}_j Y^{(3)}_k ]\) for all ij and k.

The \(SOS_k\) norm of \(X \in {\mathbb {R}}^{ n_1 \times n_2 \times n_3}\) which is denoted by \(\Vert X\Vert _{{\mathcal {K}}_k}\) is the infimum over \(\alpha \) such that \(X/\alpha \in {\mathcal {K}}_k\).

The constraints in Definition 3.1 can be expressed as an \(O(n^k)\)-sized semidefinite program. This implies that given any set of polynomial constraints of the form \(\{p = 0\}\), \(\{p \ge 0\}\), one can efficiently find a degree k pseudo-expectation satisfying those constraints if one exists. This is often called the degree k Sum-of-Squares algorithm [44, 53, 54, 59]. Hence we can compute the norm \( \Vert X\Vert _{{\mathcal {K}}_k}\) of any tensor X to within arbitrary accuracy in polynomial time. And because it is a relaxation to the tensor nuclear norm which is defined analogously but over a distribution on C-incoherent vectors instead of a pseudo-expectation over them, we have that \(\Vert X\Vert _{{\mathcal {K}}_k} \le \Vert X\Vert _{{{\mathcal {A}}}} \) for every tensor X. Throughout most of this paper, we will be interested in the case \(k = 6\). Our main technical result is an upper bound on its Rademacher complexity:

Theorem 3.3

\(R^m(\Vert \cdot \Vert _{{\mathcal {K}}_6}) \le O\Big (\sqrt{\frac{ (n_1)^{1/2} (n_2 + n_3) \log ^4 n}{ m} }\Big )\)

In Corollary 5.6 we establish a matching lower bound.

3.2 Resolution in \({\mathcal {K}}_6\)

Recall that any polynomial time computable norm with good Rademacher complexity with m observations yields an algorithm for strong refutation with roughly m clauses too. Here we will use an algorithm for strongly refuting random 3-SAT to guide our search for an appropriate norm. We will adapt an algorithm due to Coja-Oghlan, Goerdt and Lanka [17] that strongly refutes random 3-SAT, and will instead give an algorithm that strongly refutes random 3-XOR. Moreover each of the steps in the algorithm embeds into the sixth level of the sum-of-squares hierarchy by mapping resolution operations to applications of Cauchy-Schwartz, that ultimately show how the inequalities that define the norm (Definition 3.2) can be manipulated to give bounds on its own Rademacher complexity.

Let’s return to the task of bounding the Rademacher complexity of \(\Vert \cdot \Vert _{{\mathcal {K}}_6}\). Let X be arbitrary but satisfy \(\Vert X\Vert _{{\mathcal {K}}_6} \le 1\). Then there is a degree six pseudo-expectation meeting the conditions of Definition 3.2. Using Cauchy-Schwartz we have:

$$\begin{aligned} \Big ( \langle Z, X \rangle \Big )^2= & {} \Big ( \sum _{i} \sum _{j,k} Z_{i,j,k} {\widetilde{\mathop {\mathbf{E}}}}[Y^{(1)}_i Y^{(2)}_j Y^{(3)}_k] \Big )^2\nonumber \\\le & {} n_1 \Big ( \sum _i \Big ( \sum _{j,k} Z_{i,j,k} {\widetilde{\mathop {\mathbf{E}}}}[Y^{(1)}_i Y^{(2)}_j Y^{(3)}_k] \Big )^2 \Big ) \end{aligned}$$
(4)

To simplify our notation, we will define the following polynomial

$$\begin{aligned} Q_{i,Z}(Y^{(2)}, Y^{(3)}) = \sum _{j,k} Z_{i,j,k} Y^{(2)}_j Y^{(3)}_k \end{aligned}$$

which we will use repeatedly. If d is even then any degree d pseudo-expectation operator satisfies the constraint \(({\widetilde{\mathop {\mathbf{E}}}}[p])^2 \le {\widetilde{\mathop {\mathbf{E}}}}[p^2]\) for every polynomial p of degree at most d/2 (e.g., see Lemma A.4 in [4]). Hence the right hand side of (4) can be bounded as:

$$\begin{aligned} n_1 \Big ( \sum _i \Big ( {\widetilde{\mathop {\mathbf{E}}}}[Y^{(1)}_i Q_{i,Z}(Y^{(2)}, Y^{(3)})] \Big )^2 \Big ) \le n_1 \sum _i {\widetilde{\mathop {\mathbf{E}}}}\Big [\Big (Y^{(1)}_i Q_{i,Z}(Y^{(2)}, Y^{(3)}) \Big )^2 \Big ] \end{aligned}$$
(5)

It turns out that bounding the right-hand side of (5) boils down to bounding the spectral norm of the following matrix.

Definition 3.4

Let A be the \(n_2 n_3 \times n_2 n_3\) matrix whose rows and columns are indexed over ordered pairs \((j,k')\) and \((j',k)\) respectively, defined as

$$\begin{aligned} A_{j,k',j',k} = \sum _{i} Z_{i,j,k} Z_{i,j',k'} \end{aligned}$$

We can now make the connection to resolution more explicit: We can think of a pair of observations \(Z_{i,j,k}, Z_{i,j',k'}\) as a pair of 3-XOR constraints, as usual. Resolving them (i.e. multiplying them) we obtain a 4-XOR constraint

$$\begin{aligned} x_j \cdot x_k \cdot x_{j'} \cdot x_{k'} = Z_{i,j,k}Z_{i,j',k'} \end{aligned}$$

A captures the effect of resolving certain pairs of 3-XOR constraints into 4-XOR constraints. The challenge is that the entries in A are not independent, so bounding its maximum singular value will require some care. It is important that the rows of A are indexed by \((j, k')\) and the columns are indexed by \((j', k)\), so that j and \(j'\) come from different 3-XOR clauses, as do k and \(k'\), and otherwise the spectral bounds that we will want to prove about A would simply not be true! This is perhaps the key insight in [17].

It will be more convenient to decompose A and reason about its two types of contributions separately. To that end, we let R be the \(n_2 n_3 \times n_2 n_3\) matrix whose non-zero entries are of the form

$$\begin{aligned} R_{j,k,j,k} = \sum _{i} Z_{i,j,k} Z_{i,j,k} \end{aligned}$$

and all of its other entries are set to zero. In particular R is diagonal. Then let B be the \(n_2 n_3 \times n_2 n_3\) matrix whose entries are of the form

$$\begin{aligned} B_{j,k',j',k} = {\left\{ \begin{array}{ll} 0, \text{ if } j = j' \text{ and } k = k' \\ \sum _{i} Z_{i,j,k} Z_{i,j',k'} \text{ else } \end{array}\right. } \end{aligned}$$

By construction we have \(A = B + R\). Finally:

Lemma 3.5

$$\begin{aligned} \sum _i {\widetilde{\mathop {\mathbf{E}}}} \Big [\Big (Y^{(1)}_i Q_{i,Z}(Y^{(2)}, Y^{(3)}) \Big )^2 \Big ] \le C^2 n_2 n_3 \Vert B\Vert + C^6 m \end{aligned}$$

Proof

The pseudo-expectation operator satisfies \(\{ (Y^{(1)}_i)^2 \le C^2 \}\) for all i, and hence we have

$$\begin{aligned}&\sum _i {\widetilde{\mathop {\mathbf{E}}}}\Big [\Big (Y_i Q_{i,Z}(Y^{(2)}, Y^{(3)}) \Big )^2 \Big ] \le C^2 \sum _i {\widetilde{\mathop {\mathbf{E}}}}\Big [\Big ( Q_{i,Z}(Y^{(2)}, Y^{(3)}) \Big )^2 \Big ]\\&\quad = C^2 \sum _i \sum _{j,k,j',k'} {\widetilde{\mathop {\mathbf{E}}}}\Big [ Z_{i,j,k} Z_{i,j',k'} Y^{(2)}_j Y^{(3)}_k Y^{(2)}_{j'} Y^{(3)}_{k'}\Big ] \end{aligned}$$

Now let \(Y^{(2)} \in {\mathbb {R}}^{n_2}\) be a vector of variables where the ith entry is \(Y^{(2)}_i\) and similarly for \(Y^{(3)}\). Then we can re-write the right hand side as a matrix inner-product:

$$\begin{aligned}&C^2\sum _i \sum _{j,k,j',k'} Z_{i,j,k} Z_{i,j',k'} {\widetilde{\mathop {\mathbf{E}}}}[ Y^{(2)}_j Y^{(3)}_k Y^{(2)}_{j'} Y^{(3)}_{k'}] \\&\qquad = C^2 \langle A, {\widetilde{\mathop {\mathbf{E}}}}[(Y^{(2)} \otimes Y^{(3)}) (Y^{(2)} \otimes Y^{(3)})^T] \rangle \end{aligned}$$

We will now bound the contribution of B and R separately.

Claim 3.6

\({\widetilde{\mathop {\mathbf{E}}}}[(Y^{(2)} \otimes Y^{(3)}) (Y^{(2)} \otimes Y^{(3)})^T]\) is positive semidefinite and has trace at most \( n_2 n_3\)

Proof

It is easy to see that a quadratic form on \({\widetilde{\mathop {\mathbf{E}}}}[(Y^{(2)} \otimes Y^{(3)}) (Y^{(2)} \otimes Y^{(3)})^T]\) corresponds to \({\widetilde{\mathop {\mathbf{E}}}}[p^2]\) for some \(p \in P_2^{n_2 + n_3}\) and this implies the first part of the claim. Finally

$$\begin{aligned}&\text{ Tr }({\widetilde{\mathop {\mathbf{E}}}}[(Y^{(2)} \otimes Y^{(3)}) (Y^{(2)} \otimes Y^{(3)})^T])= \sum _{j,k} {\widetilde{\mathop {\mathbf{E}}}}[(Y^{(2)}_j)^2 (Y^{(3)}_k)^2] \le n_2 n_3 \end{aligned}$$

where the last equality follows because the pseudo-expectation operator satisfies the constraints \(\{ \sum _{i=1}^{n_2} (Y^{(2)}_i)^2 = n_2 \}\) and \(\{ \sum _{i=1}^{n_3} (Y^{(3)}_i)^2 = n_3 \}\). \(\square \)

Hence we can bound the contribution of the first term as \(C^2 \langle B, {\widetilde{\mathop {\mathbf{E}}}}[(Y^{(2)} \otimes Y^{(3)}) (Y^{(2)} \otimes Y^{(3)})^T]] \rangle \le C^2 n_2 n_3 \Vert B\Vert \). Now we proceed to bound the contribution of the second term:

Claim 3.7

\({\widetilde{\mathop {\mathbf{E}}}}[(Y^{(2)}_j)^2 (Y^{(3)}_k)^2] \le C^4\)

Proof

It is easy to verify by direct computation that the following equality holds:

$$\begin{aligned} C^4 - (Y^{(2)}_j)^2 (Y^{(3)}_k)^2&= \Big (C^2 - (Y^{(2)}_j)^2 \Big ) \Big (C^2 - (Y^{(3)}_k)^2 \Big )\\&\quad + \Big (C^2 - (Y^{(3)}_k)^2\Big ) (Y^{(2)}_j)^2 + \Big (C^2 - (Y^{(2)}_j)^2 \Big )(Y^{(3)}_k)^2 \end{aligned}$$

Moreover the pseudo-expectation of each of the three terms above is nonnegative, by construction. This implies the claim. \(\square \)

Moreover each entry in Z is in the set \(\{-1, 0, +1\}\) and there are precisely m non-zeros. Thus the sum of the absolute values of all entries in R is at most m. Now we have:

$$\begin{aligned} C^2 \langle R, {\widetilde{\mathop {\mathbf{E}}}}[(Y^{(2)} \otimes Y^{(3)}) (Y^{(2)} \otimes Y^{(3)})^T] \rangle \le C^2 \sum _{j,k} R_{j, k, j, k} {\widetilde{\mathop {\mathbf{E}}}}[(Y^{(2)}_j)^2 (Y^{(3)}_k)^2] \le C^6 m \end{aligned}$$

And this completes the proof of the lemma. \(\square \)

4 Spectral bounds

The main remaining step is to bound \(\Vert B\Vert \), where B was defined in the previous section. In fact, for our spectral bounds it will be more convenient to relabel the variables (but keeping the definition intact):

$$\begin{aligned} B_{j,k,j',k'} = {\left\{ \begin{array}{ll} 0, \text{ if } j = j' \text{ and } k = k' \\ \sum _{i} Z_{i,j,k'} Z_{i,j',k} \text{ else } \end{array}\right. } \end{aligned}$$

Directly working with B would still be challenging because of the complex dependencies among its entries. Instead, to make the analysis simpler we will randomly group its entries according to the following rule: For \(r = 1, 2, ..., O(\log n)\) partition the set of all ordered triples (ijk) into two sets \(S_r\) and \(T_r\). We will use this ensemble of partitions to define an ensemble of matrices \(\{{\mathsf {B}}^r\}_{r =1}^{O(\log n)}\): Set \(U_{i,j,k'}^r\) as equal to \(Z_{i,j,k'}\) if \((i,j,k') \in S_r\) and zero otherwise. Similarly set \(V_{i,j',k}^r\) equal to \(Z_{i,j',k}\) if \((i,j',k) \in T_r\) and zero otherwise. Also let \(E_{i,j,j',k,k',r}\) be the event that there is no \(r' < r\) where \((i,j,k') \in S_{r'}\) and \((i,j',k) \in T_{r'}\) nor is there an \(r' < r\) where \((i,j',k) \in S_{r'}\) and \((i,j,k') \in T_{r'}\) . Now let

$$\begin{aligned} {\mathsf {B}}^r_{j,k,j',k'} = \sum _{i} U_{i,j,k'}^r V_{i,j',k}^r \mathbb {1}_{E} \end{aligned}$$

where \(\mathbb {1}_E\) is short-hand for the indicator function of the event \(E_{i,j,j',k,k',r}\). The idea behind this construction is that each pair of triples \((i,j,k')\) and \((i,j',k)\) that contributes to B will contribute to some \({\mathsf {B}}^r\) with high probability. (This follows by standard concentration bounds because the chance that any triple is covered by a random partition is a constant). Moreover it will not contribute to any later matrix in the ensemble. Hence with high probability

$$\begin{aligned} B = \sum _{r = 1}^{O(\log n)} {\mathsf {B}}^r \end{aligned}$$

Throughout the rest of this section, we will suppress the superscript r and work with a particular matrix in the ensemble, \({\mathsf {B}}\). Now let \(\ell \) be even and consider

$$\begin{aligned} {{\,\mathrm{Tr}\,}}(\underbrace{{\mathsf {B}} {\mathsf {B}}^T {\mathsf {B}} {\mathsf {B}}^T ... {\mathsf {B}} {\mathsf {B}}^T}_{\ell \text{ times } }) \end{aligned}$$

As is standard, we are interested in bounding \(\mathop {\mathbf{E}}[{{\,\mathrm{Tr}\,}}({\mathsf {B}} {\mathsf {B}}^T {\mathsf {B}}{\mathsf {B}}^T ... {\mathsf {B}} {\mathsf {B}}^T)]\) in order to bound \(\Vert {\mathsf {B}}\Vert \). But note that \({\mathsf {B}}\) is not symmetric. Also note that the random variables U and V are not independent, however whether or not they are non-zero is non-positively correlated and their signs are mutually independent. Expanding the trace above we have

$$\begin{aligned}&{{\,\mathrm{Tr}\,}}({\mathsf {B}} {\mathsf {B}}^T {\mathsf {B}}{\mathsf {B}}^T ... {\mathsf {B}} {\mathsf {B}}^T) \\&\quad = \sum _{j_1, k_1} \sum _{j_2, k_2} ... \sum _{j_{\ell -1}, k_{\ell -1}} {\mathsf {B}}_{j_1,k_1,j_2, k_2} {\mathsf {B}}_{j_3, k_3, j_2, k_2} ... {\mathsf {B}}_{j_1,k_1, j_{\ell }, k_{\ell } }\\&\quad = \sum _{j_1, k_1} \sum _{i_1} \sum _{j_2, k_2} \sum _{i_2} ... \sum _{j_{\ell }, k_{\ell }} \sum _{i_\ell } U_{i_1,j_1,k_2} V_{i_1,j_2,k_1} \mathbb {1}_{E_1} \\&\qquad U_{i_2,j_3,k_2} V_{i_2,j_2,k_3} \mathbb {1}_{E_2}... U_{i_\ell ,j_{1},k_\ell } V_{i_\ell ,j_\ell ,k_1}\mathbb {1}_{E_\ell } \end{aligned}$$

where \(\mathbb {1}_{E_1}\) is the indicator for the event that the entry \({\mathsf {B}}_{j_1,k_1,j_2,k_2}\) is not covered by an earlier matrix in the ensemble, and similarly for \(\mathbb {1}_{E_2}, ..., \mathbb {1}_{E_\ell }\).

Notice that there are \(2 \ell \) random variables in the above sum (ignoring the indicator variables). Moreover if any U or V random variable appears an odd number of times, then the contribution of the term to \(\mathop {\mathbf{E}}[{{\,\mathrm{Tr}\,}}({\mathsf {B}} {\mathsf {B}}^T {\mathsf {B}}{\mathsf {B}}^T ... {\mathsf {B}} {\mathsf {B}}^T)]\) is zero. We will give an encoding for each term that has a non-zero contribution, and we will prove that it is injective.

Fix a particular term in the above sum where each random variable appears an even number of times. Let s be the number of distinct values for i. Moreover let \(i_1, i_2, ..., i_s\) be the order that these indices first appear. Now let \(r^j_1\) denote the number of distinct values for j that appear with \(i_1\) in U terms — i.e. \(r^j_1\) is the number of distinct j’s that appear as \(U_{i_1, j, *}\). Let \(r^k_1\) denote the number of distinct values for k that appear with \(i_1\) in U terms — i.e. \(r^k_1\) is the number of distinct k’s that appear as or \(U_{i_1, *, k}\). Similarly let \(q^j_1\) denote the number of distinct values for j that appear with \(i_1\) in V terms — i.e. \(q^j_1\) is the number of distinct j’s that appear as \(V_{i_1, j, *}\). And finally let \(q^k_1\) denote the number of distinct values for k that appear with \(i_1\) in V terms — i.e. \(q^k_1\) is the number of distinct k’s that appear as \(V_{i_1, *, k}\).

We give our encoding below. It is more convenient to think of the encoding as any way to answer the following questions about the term.

  1. (a)

    What is the order \(i_1, i_2, ..., i_s\) of the first appearance of each distinct value of i?

  2. (b)

    For each i that appears, what is the order of each of the distinct values of j’s and k’s that appear along with it in U? Similarly, what is the order of each of the distinct values of j’s and k’s that appear along with it in V?

  3. (c)

    For each step (i.e. a new variable in the term when reading from left to right), has the value of i been visited already? Also, has the value for j or k that appears along with U been visited? Has the value for j or k that appears along with V been visited? Note that whether or not j or k has been visited (together in U) depends on what the value of i is, and if i is a new value then the j or k value must be new too, by definition. Finally, if any value has already been visited, which earlier value is it?

Let \(r_j = r^j_1 + r^j_2 + ... + r^j_s\) and \(r_k = r^k_1 + r^k_2 + ... + r^k_s\). Similarly let \(q_j = q^j_1 + q^j_2 + ... q^j_s\) and \(q_k = q^k_1 + q^k_2 + ... q^k_s\). Then the number of possible answers to (a) and (b) is at most \(n_1^s\) and \(n_2^{r_j} n_3^{r_k} n_2^{q_j} n_3^{q_k}\) respectively. It is also easy to see that the number of answers to (c) that arise over the sequence of \(\ell \) steps is at most \(8^\ell (s( r_j + r_k)( q_j + q_k))^\ell \). We remark that much of the work on bounding the maximum eigenvalue of a random matrix is in removing any \(\ell ^\ell \) type terms, and so one needs to encode re-visiting indices more compactly. However such terms will only cost us polylogarithmic factors in our bound on \(\Vert B\Vert \).

It is easy to see that this encoding is injective, since given the answers to the above questions one can simulate each step and recover the sequence of random variables. Next we establish some easy facts that allow us to bound \(\mathop {\mathbf{E}}[{{\,\mathrm{Tr}\,}}({\mathsf {B}} {\mathsf {B}}^T {\mathsf {B}}{\mathsf {B}}^T ... {\mathsf {B}} {\mathsf {B}}^T)]\).

Claim 4.1

For any term that has a non-zero contribution to \(\mathop {\mathbf{E}}[{{\,\mathrm{Tr}\,}}({\mathsf {B}} {\mathsf {B}}^T {\mathsf {B}}{\mathsf {B}}^T ... {\mathsf {B}} {\mathsf {B}}^T)]\), we must have \(s \le \ell /2\) and \(r_j + q_j + r_k + q_k \le \ell \)

Proof

Recall that there are \(2 \ell \) random variables in the product and precisely \(\ell \) of them corresponds to U variables and \(\ell \) of them to V variables. Suppose that \(s > \ell /2\). Then there must be at least one U variable and at least one V variable that occur exactly once, which implies that its expectation is zero because the signs of the non-zero entries are mutually independent. Similarly suppose \( r_j + q_j + r_k + q_k > \ell \). Then there must be at least one U or V variable that occurs exactly once, which also implies that its expectation is zero. \(\square \)

Claim 4.2

For any valid encoding, \(s \le r_j + q_j\) and \(s \le r_k+ q_k\).

Proof

This holds because in each step where the i variable is new and has not been visited before, by definition the j variable is new too (for the current i) and similarly for the k variable. \(\square \)

Finally, if \(s, r_j, q_j, r_k\) and \(q_k\) are defined as above then for any contributing term

$$\begin{aligned} U_{i_1,j_1,k_2} V_{i_1,j_2,k_1} U_{i_2,j_3,k_2} V_{i_2,j_2,k_3} ... U_{i_\ell ,j_{1},k_\ell } V_{i_\ell ,j_\ell ,k_1} \end{aligned}$$

its expectation is at most \(p^{r_j + r_k} p^{q_j + q_k}\) where p is the probability of any particular entry in T being observed. In particular \(p = m/n_1 n_2 n_3\) because there are exactly \(r_j + r_k\) distinct U variables and \(q_j + q_k\) distinct V variables whose values are in the set \(\{-1, 0, +1\}\) and whether or not a variable is non-zero is non-positively correlated and the signs are mutually independent.

This now implies the main lemma:

Lemma 4.3

\(\mathop {\mathbf{E}}[{{\,\mathrm{Tr}\,}}({\mathsf {B}} {\mathsf {B}}^T {\mathsf {B}}{\mathsf {B}}^T ... {\mathsf {B}} {\mathsf {B}}^T)] \le n_1^{\ell /2} (\max (n_2, n_3))^{\ell } p^\ell (\ell )^{3 \ell + 3}\)

Proof

Note that the indicator variables only have the effect of zeroing out some terms that could otherwise contribute to \(\mathop {\mathbf{E}}[{{\,\mathrm{Tr}\,}}({\mathsf {B}} {\mathsf {B}}^T {\mathsf {B}}{\mathsf {B}}^T ... {\mathsf {B}} {\mathsf {B}}^T)]\). Returning to the task at hand, we have

$$\begin{aligned}&\mathop {\mathbf{E}}[{{\,\mathrm{Tr}\,}}({\mathsf {B}} {\mathsf {B}}^T {\mathsf {B}}{\mathsf {B}}^T ... {\mathsf {B}} {\mathsf {B}}^T)] \\&\quad \le \sum _{s, r_j, r_k, q_j, q_k} n_1^s n_2^{r_j} n_3^{r_k} n_2^{q_j} n_3^{q_k} p^{r_j + r_k} p^{q_j + q_k} 8^\ell (s( r_j + r_k)( q_j + q_k))^\ell \end{aligned}$$

where the sum is over all valid triples \(s, r_j, r_k, q_j, q_k\) and hence \(s, r, q \le \ell /2\) and \(s \le r_j + r_k\) and \(s \le q_j + q_k\) using Claim 4.1 and Claim 4.2. We can upper bound the above as

$$\begin{aligned} \mathop {\mathbf{E}}[{{\,\mathrm{Tr}\,}}({\mathsf {B}} {\mathsf {B}}^T {\mathsf {B}}{\mathsf {B}}^T ... {\mathsf {B}} {\mathsf {B}}^T)]\le & {} \sum _{s, r_j, r_k, q_j, q_k} n_1^s (pn_2)^{r_j + q_j} (pn_3)^{r_k + q_k} (\ell )^{3 \ell + 3} \\\le & {} \sum _{s, r_j, r_k, q_j, q_k} n_1^s (p \max (n_2, n_3))^{r_j + q_j + r_k + q_k} (\ell )^{3 \ell + 3} \end{aligned}$$

Now if \(p \max (n_2, n_3) \le 1\) then using Claim 4.2 followed by the first half of Claim 4.1 we have:

$$\begin{aligned} \mathop {\mathbf{E}}[{{\,\mathrm{Tr}\,}}({\mathsf {B}} {\mathsf {B}}^T {\mathsf {B}}{\mathsf {B}}^T ... {\mathsf {B}} {\mathsf {B}}^T)] \le n_1^s (p \max (n_2, n_3))^{2s} (\ell )^{3 \ell + 3} \le n_1^{\ell /2} (p \max (n_2, n_3))^{\ell } (\ell )^{3 \ell + 3} \end{aligned}$$

where the last inequality follows because \(p n_1^{1/2} \max (n_2, n_3) > 1\). Alternatively if \(p \max (n_2, n_3) > 1\) then we can directly invoke the second half of Claim 4.1 and get:

$$\begin{aligned} \mathop {\mathbf{E}}[{{\,\mathrm{Tr}\,}}({\mathsf {B}} {\mathsf {B}}^T {\mathsf {B}}{\mathsf {B}}^T ... {\mathsf {B}} {\mathsf {B}}^T)] \le n_1^s (p \max (n_2, n_3))^\ell (\ell )^{3 \ell + 3} \le n_1^{\ell /2} (p \max (n_2, n_3))^{\ell } (\ell )^{3 \ell + 3} \end{aligned}$$

Hence \(\mathop {\mathbf{E}}[{{\,\mathrm{Tr}\,}}({\mathsf {B}} {\mathsf {B}}^T {\mathsf {B}}{\mathsf {B}}^T ... {\mathsf {B}} {\mathsf {B}}^T)] \le n_1^{\ell /2}\max (n_2, n_3)^\ell p^\ell (\ell )^{3 \ell + 3}\) and this completes the proof. \(\square \)

As before, let \(n = n_3\). Then the last piece we need to bound the Rademacher complexity is the following spectral bound:

Theorem 4.4

With high probability, \(\Vert B\Vert \le O\Big ( \frac{m \log ^4 n}{n_1^{1/2}\min (n_2, n_3)} \Big )\)

Proof

We proceed by using Markov’s inequality:

$$\begin{aligned} \mathop {\mathbf{Pr}}[\Vert {\mathsf {B}}\Vert\ge & {} n_1^{1/2} \max (n_2, n_3) p (2 \ell )^3 ] = \mathop {\mathbf{Pr}}\Big [\Vert {\mathsf {B}}\Vert ^\ell \ge \Big (n_1^{1/2}\max (n_2, n_3) p (2 \ell )^3 \Big )^\ell \Big ] \\\le & {} \frac{\mathop {\mathbf{E}}[{{\,\mathrm{Tr}\,}}({\mathsf {B}} {\mathsf {B}}^T {\mathsf {B}}{\mathsf {B}}^T ... {\mathsf {B}} {\mathsf {B}}^T)]}{n_1^{\ell /2}\max (n_2, n_3)^\ell p^{\ell } (2 \ell )^{3\ell }} \le \frac{\ell ^3}{2^{3 \ell }} \end{aligned}$$

and hence setting \(\ell = \Theta (\log n)\) we conclude that \(\Vert {\mathsf {B}}\Vert \le 8 n_1^{1/2} \max (n_2, n_3) p \log ^3 n\) holds with high probability. Moreover \(B = \sum _{r =1}^{O(\log n)} {\mathsf {B}}^r\) also holds with high probability. If this equality holds and each \({\mathsf {B}}^r\) satisfies \(\Vert {\mathsf {B}}^r\Vert \le 8 n_1^{1/2}\max (n_2, n_3) p \log ^3 n\), we have

$$\begin{aligned} \Vert B\Vert \le \max _r O(\Vert {\mathsf {B}}^r\Vert \log n) = O\Big ( \frac{m \log ^4 n}{n_1^{1/2}\min (n_2, n_3)} \Big ) \end{aligned}$$

where we have used the fact that \(p = m/n_1 n_2 n_3\). This completes the proof of the theorem. \(\square \)

4.1 Proofs of Theorem 1.1 and Corollary 1.2

We can now ready to prove Theorem 3.3:

Proof

Consider any X with \(\Vert X\Vert _{{\mathcal {K}}_6} \le 1\). Then using Lemma 3.5 and Theorem 4.4 we have

$$\begin{aligned} \Big ( \langle Z, X \rangle \Big )^2\le & {} n_1 \Big ( \sum _i \Big ( \sum _{j,k} Z_{i,j,k} X_{i,j,k} \Big )^2 \Big ) \le C^2 n_1 n_2 n_3 \Vert B\Vert + C^6 m n_1 \\= & {} O\Big ( m n_1^{1/2} \max (n_2, n_3) \log ^4 n + m n_1 \Big ) \end{aligned}$$

Recall that Z was defined in Definition 2.5. The Rademacher complexity can now be bounded as

$$\begin{aligned} \frac{1}{m} (\langle Z, X \rangle ) \le O\left( \sqrt{\frac{ (n_1)^{1/2} (n_2 + n_3) \log ^4 n}{ m } }\right) \end{aligned}$$

which completes the proof of the theorem. \(\square \)

Recall that bounds on the Rademacher complexity readily imply bounds on the generalization error (see Theorem 2.4). We can now prove Theorem 1.1:

Proof

We solve (2) using the norm \(\Vert \cdot \Vert _{{\mathcal {K}}_6}\). Since this norm comes from the sixth level of the sum-of-squares hierarchy, it follows that (2) is an \(n^6\)-sized semidefinite program and there is an efficient algorithm to solve it to arbitrary accuracy. Moreover we can always plug in \(X = T - \Delta \) and the bounds on the maximum magnitude of an entry in \(\Delta \) together with the Chernoff bound imply that with high probability \(X = T - \Delta \) is a feasible solution. Moreover \(\Vert T - \Delta \Vert _{{\mathcal {K}}_6} \le r^*\). Hence with high probability, the minimizer X satisfies \(\Vert X\Vert _{{\mathcal {K}}_6} \le r^*\). Now if we take any such X returned by the convex program, because it is feasible its empirical error is at most \(2 \delta \). And since \(\Vert X\Vert _{{\mathcal {K}}_6} \le r^*\) the bounds on the Rademacher complexity (Theorem 3.3) together with Theorem 2.4 give the desired bounds on \(\text{ err }(X)\) and complete the proof of our main theorem. \(\square \)

In Appendix Appendix A we treat the general case where the \(n_i\)’s can be different, which allows us to extend our results to the case of higher order tensors as well.

Finally we prove Corollary 1.2:

Proof

Our goal is to lower bound the absolute value of a typical entry in T. To be concrete, suppose that \({\text {var}}(T_{i,j,k}) \ge f(r, n)\) for a \(1-o(1)\) fraction of the entries where \(f(r, n) = r^{1/2}/ \log ^D n\). Consider \(T_{i,j,k}\), which we will view as a degree three polynomial in Gaussian random variables. Then the anti-concentration bounds of Carbery and Wright [14] now imply that \(|T_{i,j,k}| \ge f(r, n)/\log n \) with probability \(1 - o(1)\). With this in mind, we define

$$\begin{aligned} {\mathcal {R}}= \{ (i, j, k) \text{ s.t. } |T_{i,j,k}| \ge f(r, n)/\log n\} \end{aligned}$$

and it follows form Markov’s bound that that \(|{\mathcal {R}}| \ge (1 - o(1)) n_1 n_2 n_3\). Now consider just those entries in \({\mathcal {R}}\) which we get substantially wrong:

$$\begin{aligned} {\mathcal {R}}' = \{ (i, j, k) \text{ s.t. } (i, j, k) \in {\mathcal {R}} \text{ and } |X_{i,j,k} - T_{i,j,k}| \ge 1/\log n\} \end{aligned}$$

We can now invoke Theorem 1.1 which guarantees that the hypothesis X that results from solving (2) satisfies \(\text{ err }(X) = o(1/\log n)\) with probability \(1-o(1)\) provided that \(m = {\widetilde{\Omega }}( n^{3/2} r )\). This bound on the error immediately implies that \(|{\mathcal {R}}'| = o(n_1 n_2 n_3)\) and so \(|{\mathcal {R}}\setminus {\mathcal {R}}'| = (1-o(1))n_1 n_2 n_3\). This completes the proof of the corollary. \(\square \)

5 Sum-of-squares lower bounds

Here we will show strong lower bounds on the Rademacher complexity of the sequence of relaxations to the tensor nuclear norm that we get from the sum-of-squares hierarchy. Our lower bounds follow as a corollary from known lower bounds for refuting random instances of 3-XOR [31, 58]. First we need to introduce the formulation of the sum-of-squares hierarchy used in [58], which is a natural semidefinite programing relaxation for the problem of deciding whether a given formula is satisfiable or not. We will call a Boolean function f a k-junta if there is set \(S \subseteq [n]\) of at most k variables so that f is determined by the values in S.

Definition 5.1

The k-round Lasserre hierarchy is the following relaxation:

  1. (a)

    \(\Vert v_0\Vert ^2 = 1\), \(\Vert v_C\Vert ^2 = 1\) for all \(C \in {{\mathcal {C}}}\)

  2. (b)

    \(\langle v_f, v_g \rangle = \langle v_{f'}, v_{g'} \rangle \) for all \(f, g, f', g'\) that are k-juntas and \(f \cdot g \equiv f' \cdot g'\)

  3. (c)

    \(v_f + v_g = v_{f + g}\) for all fg that are k-juntas and satisfy \(f \cdot g \equiv 0\)

Here we define a vector \(v_f\) for each k-junta, and \({{\mathcal {C}}}\) is a class of constraints that must be satisfied by any Boolean solution (and are necessarily k-juntas themselves). See [58] for more background, but it is easy to construct a feasible solution to the above convex program given a distribution on feasible solutions for some constraint satisfaction problem. In the above relaxation, we think of functions f as being \(\{0, 1\}\)-valued. Over the \(\{0, 1\}\)-alphabet, we write use the notation \((\oplus _S, Z_S)\) to represent the XOR clause that takes the variables in the set S, takes their XOR and asks that the value be equal to \(Z_S\). It will be more convenient to work with an intermediate relaxation where functions are \(\{-1, 1\}\)-valued and the intuition is that \(u_S\) for some set \(S \subseteq [n]\) should correspond to the vector for the character \(\chi _S\).

Definition 5.2

Alternatively, the k-round Lasserre hierarchy is the following relaxation:

  1. (a)

    \(\Vert u_\emptyset \Vert ^2 = 1\), \(\langle u_{\emptyset }, u_S \rangle = (-1)^{Z_S}\) for all \((\oplus _S, Z_S) \in {{\mathcal {C}}}\)

  2. (b)

    \(\langle u_S, u_T \rangle = \langle u_{S'}, u_{T'} \rangle \) for sets \(S, T, S', T'\) that are size at most k and satisfy \(S \Delta T = S' \Delta T'\), where \(\Delta \) is the symmetric difference.

Here we have explicitly made the switch to XOR-constraints — namely \((\oplus _S, Z_S)\) has \(Z_S \in \{0, 1\}\) and correspond to the constraint that the parity on the set S is equal to \(Z_S\). Now if we have a feasible solution to the constraints in Definition 5.1 where all the clauses are XOR-constraints, we can construct a feasible solution to the constraints in Definition 5.2 as follows. If S is a set of size at most k, we define

$$\begin{aligned} u_S \equiv v_g - v_f \end{aligned}$$

where f is the parity function on S and \(g = 1 - f\) is its complement. Moreover let \(u_\emptyset = v_0\).

Claim 5.3

\(\{u_S\}\) is a feasible solution to the constraints in Definition 5.2

Proof

Consider Constraint (b) in Definition 5.2, and let \(S, T, S', T'\) be sets of size at most k that satisfy \(S \oplus T = S' \oplus T'\). Then our goal is to show that

$$\begin{aligned} \langle v_{g_S} - v_{f_S}, v_{g_T} - v_{f_T} \rangle = \langle v_{g_{S'}} - v_{f_{S'}}, v_{g_{T'}} - v_{f_{T'}} \rangle \end{aligned}$$

where \(f_S\) is the parity function on S, and similarly for the other functions. Then we have \(f_S \cdot f_T \equiv f_{S'} \cdot f_{T'}\) because \(S \oplus T = S' \oplus T'\), and this implies that \(\langle v_{f_S}, v_{f_T} \rangle = \langle v_{f_{S'}}, v_{f_{T'}} \rangle \). An identical argument holds for the other terms. This implies that all the Constraints (b) hold. Similarly suppose \((\oplus _S, Z_S) \in {{\mathcal {C}}}\). Since \(f_S \cdot g_S \equiv 0\) and \(f_S + g_S \equiv 1\) it is well-known that (1) \( v_{f_S}\) and \(v_{g_S}\) are orthogonal (2) \(v_{f_S} + v_{g_S} = v_0\) and (3) since \(f_S \in {{\mathcal {C}}}\) in Definition 5.1, we have \(v_{g_S} = 0\) (see [58]). Thus

$$\begin{aligned} \langle u_\emptyset , u_S \rangle = \langle v_0, v_{g_S} \rangle - \langle v_0, v_{f_S} \rangle = -1 \end{aligned}$$

and this completes the proof. \(\square \)

Now following Barak et al. [4] we can use the constraints in Definition 5.2 to define the operator \({\widetilde{\mathop {\mathbf{E}}}}[\cdot ]\). In particular, given \(p \in P_k^n\) where \(p \equiv \sum _S c_S \prod _{i \in S} Y_i\) and p is multilinear, we set

$$\begin{aligned} {\widetilde{\mathop {\mathbf{E}}}}[p] = \sum _S c_S \langle u_\emptyset , u_S \rangle \end{aligned}$$

Here we will also need to define \({\widetilde{\mathop {\mathbf{E}}}}[p]\) when p is not multilinear, and in that case if \(Y_i\) appears an even number of times we replace it with 1 and if it appears an odd number of times we replace it by \( Y_i\) to get a multilinear polynomial q and then set \({\widetilde{\mathop {\mathbf{E}}}}[p] = {\widetilde{\mathop {\mathbf{E}}}}[q]\).

Claim 5.4

\({\widetilde{\mathop {\mathbf{E}}}}[\cdot ]\) is a feasible solution to the constraints in Definition 3.2, and for any \((\oplus _S, Z_S) \in {{\mathcal {C}}}\) we have \({\widetilde{\mathop {\mathbf{E}}}}[\prod _{i \in S} Y_i] = (-1)^{Z_S} \).

Proof

Then by construction \({\widetilde{\mathop {\mathbf{E}}}}[1] = 1\), and the proof that \({\widetilde{\mathop {\mathbf{E}}}}[p^2] \ge 0\) is given in [4], but we repeat it here for completeness. Let \(p = \sum _S c_S \prod _{i \in S} Y_i\) be multilinear where we follow the above recipe and replace terms of the form \(Y_i^2\) with 1 as needed. Then \(p^2 = \sum _{S,T} c_S c_T \prod _{i \in S} Y_i \prod _{i \in T} Y_i\) and moreover

$$\begin{aligned} {\widetilde{\mathop {\mathbf{E}}}}[p^2]= & {} \sum _{S, T} c_S c_T \langle u_\emptyset , u_{S \Delta T} \rangle = \sum _{S, T} c_S c_T \langle u_S, u_{T} \rangle = \Big \Vert \sum _S c_S u_S \Big \Vert ^2 \ge 0 \end{aligned}$$

as desired. Next we must verify that \({\widetilde{\mathop {\mathbf{E}}}}[\cdot ]\) satisfies the constraints \(\{\sum _{i = 1}^n Y_i^2 = n\}\) and \(\{Y_i^2 \le C^2\}\) for all \(i \in \{1, 2, ..., n\}\), in accordance with Definition 3.1. To that end, observe that

$$\begin{aligned} {\widetilde{\mathop {\mathbf{E}}}}\Big [\Big (\sum _{i=1}^n Y_i^2 - n \Big ) q\Big ] = 0 \end{aligned}$$

which holds for any polynomial \(q \in P_{k - 2}^n\). Finally note that

$$\begin{aligned} {\widetilde{\mathop {\mathbf{E}}}}\Big [ \Big (C^2 - Y_i^2\Big ) q^2\Big ] = {\widetilde{\mathop {\mathbf{E}}}}\Big [ \Big (C^2 - 1 \Big ) q^2\Big ] \ge 0 \end{aligned}$$

which follows because \(C^2 \ge 1\) and holds for any polynomial \(q \in P_{\lfloor (d-d')/2 \rfloor }^n\). This completes the proof. \(\square \)

Theorem 5.5

[31, 58] Let \(\phi \) be a random 3-XOR formula on n variables with \(m = n^{3/2 - \epsilon }\) clauses. Then for any \(\epsilon > 0\) and any \(c < 2\), the \(k = \Omega (n^{c \epsilon })\) round Lasserre hierarchy given in Definition 5.1 permits a feasible solution, with probability \(1 - o(1)\).

Note that the constant in the \(\Omega (\cdot )\) depends on \(\epsilon \) and c. Then using the above reductions, we have the following as an immediate corollary:

Corollary 5.6

For any \(\epsilon > 0\) and any \(c < 2\) and \(k = \Omega (n^{c \epsilon })\), if \(m = n^{3/2 - \epsilon }\) the Rademacher complexity \(R^m(\Vert \cdot \Vert _{{\mathcal {K}}_k}) =1 - o(1)\).

Thus there is a sharp phase transition (as a function of the number of observations) in the Rademacher complexity of the norms derived from the sum-of-squares hierarchy. At level six, \(R^m(\Vert \cdot \Vert _{{\mathcal {K}}_6}) = o(1)\) whenever \(m = \omega (n^{3/2} \log ^4 n)\). In contrast, \(R^m(\Vert \cdot \Vert _{{\mathcal {K}}_k}) = 1- o(1)\) when \(m = n^{3/2 - \epsilon }\) even for very strong relaxations derived from \(n^{2 \epsilon }\) rounds of the sum-of-squares hierarchy. These norms require time \(2^{n^{2 \epsilon }}\) to compute but still achieve essentially no better bounds on their Rademacher complexity.