Abstract
Recent years have seen a flurry of activities in designing provably efficient nonconvex procedures for solving statistical estimation problems. Due to the highly nonconvex nature of the empirical loss, state-of-the-art procedures often require proper regularization (e.g., trimming, regularized cost, projection) in order to guarantee fast convergence. For vanilla procedures such as gradient descent, however, prior theory either recommends highly conservative learning rates to avoid overshooting, or completely lacks performance guarantees. This paper uncovers a striking phenomenon in nonconvex optimization: even in the absence of explicit regularization, gradient descent enforces proper regularization implicitly under various statistical models. In fact, gradient descent follows a trajectory staying within a basin that enjoys nice geometry, consisting of points incoherent with the sampling mechanism. This “implicit regularization” feature allows gradient descent to proceed in a far more aggressive fashion without overshooting, which in turn results in substantial computational savings. Focusing on three fundamental statistical estimation problems, i.e., phase retrieval, low-rank matrix completion, and blind deconvolution, we establish that gradient descent achieves near-optimal statistical and computational guarantees without explicit regularization. In particular, by marrying statistical modeling with generic optimization theory, we develop a general recipe for analyzing the trajectories of iterative algorithms via a leave-one-out perturbation argument. As a by-product, for noisy matrix completion, we demonstrate that gradient descent achieves near-optimal error control—measured entrywise and by the spectral norm—which might be of independent interest.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
1.1 Nonlinear Systems and Empirical Loss Minimization
A wide spectrum of science and engineering applications calls for solutions to a nonlinear system of equations. Imagine we have collected a set of data points \({\varvec{y}}=\{y_{j}\}_{1\le j \le m}\), generated by a nonlinear sensing system,
where \({\varvec{x}}^{\star }\) is the unknown object of interest and the \(\mathcal {A}_j\)’s are certain nonlinear maps known a priori. Can we reconstruct the underlying object \({\varvec{x}}^{\star }\) in a faithful yet efficient manner? Problems of this kind abound in information and statistical science, prominent examples including low-rank matrix recovery [19, 64], robust principal component analysis [17, 21], phase retrieval [20, 59], neural networks [103, 132], to name just a few.
In principle, it is possible to attempt reconstruction by searching for a solution that minimizes the empirical loss, namely
Unfortunately, this empirical loss minimization problem is, in many cases, nonconvex, making it NP-hard in general. This issue of nonconvexity comes up in, for example, several representative problems that epitomize the structures of nonlinear systems encountered in practice.Footnote 1
Phase retrieval/solving quadratic systems of equations Imagine we are asked to recover an unknown object \({\varvec{x}}^{\star }\in \mathbb {R}^{n}\), but are only given the square modulus of certain linear measurements about the object, with all sign/phase information of the measurements missing. This arises, for example, in X-ray crystallography [15], and in latent-variable models where the hidden variables are captured by the missing signs [33]. To fix ideas, assume we would like to solve for \({\varvec{x}}^{\star }\in \mathbb {R}^n \) in the following quadratic system of m equations
$$\begin{aligned} y_{j}=\big ({\varvec{a}}_{j}^{\top }{\varvec{x}}^{\star }\big )^{2},\qquad 1\le j\le m, \end{aligned}$$where \(\{{\varvec{a}}_{j}\}_{1\le j \le m}\) are the known design vectors. One strategy is thus to solve the following problem
$$\begin{aligned} \text {minimize}_{{\varvec{x}}\in \mathbb {R}^{n}}\quad f({\varvec{x}})=\frac{1}{4m}\sum _{j=1}^{m}\Big [y_{j}-\big ({\varvec{a}}_{j}^{\top }{\varvec{x}}\big )^{2}\Big ]^{2}. \end{aligned}$$(2)Low-rank matrix completion In many scenarios such as collaborative filtering, we wish to make predictions about all entries of an (approximately) low-rank matrix \({\varvec{M}}^{\star }\in \mathbb {R}^{n\times n}\) (e.g., a matrix consisting of users’ ratings about many movies), yet only a highly incomplete subset of the entries are revealed to us [19]. For clarity of presentation, assume \({\varvec{M}}^{\star }\) to be rank-r (\(r\ll n\)) and positive semidefinite (PSD), i.e., \({\varvec{M}}^{\star }={\varvec{X}}^{\star }{\varvec{X}}^{\star \top }\) with \({\varvec{X}}^{\star }\in \mathbb {R}^{n\times r}\), and suppose we have only seen the entries
$$\begin{aligned} Y_{j,k} = M^{\star }_{j,k} = ( {\varvec{X}}^{\star } {\varvec{X}}^{\star \top } )_{j,k} ,\qquad (j,k)\in \Omega \end{aligned}$$within some index subset \(\Omega \) of cardinality m. These entries can be viewed as nonlinear measurements about the low-rank factor \({\varvec{X}}^{\star }\). The task of completing the true matrix \({\varvec{M}}^{\star }\) can then be cast as solving
$$\begin{aligned} \text {minimize}_{{\varvec{X}}\in \mathbb {R}^{n\times r}}\quad f({\varvec{X}}) = \frac{n^2}{4m} \sum _{(j,k)\in \Omega }\left( Y_{j,k}-{\varvec{e}}_{j}^{\top }{\varvec{X}}{\varvec{X}}^{\top }{\varvec{e}}_{k}\right) ^{2}, \end{aligned}$$(3)where the \({\varvec{e}}_{j}\)’s stand for the canonical basis vectors in \(\mathbb {R}^n\).
Blind deconvolution/solving bilinear systems of equations Imagine we are interested in estimating two signals of interest \({\varvec{h}}^{\star },{\varvec{x}}^{\star }\in \mathbb {C}^{K}\), but only get to collect a few bilinear measurements about them. This problem arises from mathematical modeling of blind deconvolution [3, 76], which frequently arises in astronomy, imaging, communications, etc. The goal is to recover two signals from their convolution. Put more formally, suppose we have acquired m bilinear measurements taking the following form
$$\begin{aligned} y_{j}={\varvec{b}}_{j}^{\textsf {H} }{\varvec{h}}^{\star }{\varvec{x}}^{\star \textsf {H} }{\varvec{a}}_{j},\qquad 1\le j\le m, \end{aligned}$$where \({\varvec{a}}_{j},{\varvec{b}}_{j}\in \mathbb {C}^{K}\) are distinct design vectors (e.g., Fourier and/or random design vectors) known a priori and \({\varvec{b}}_{j}^{\textsf {H} }\) denotes the conjugate transpose of \({\varvec{b}}_{j}\). In order to reconstruct the underlying signals, one asks for solutions to the following problem
$$\begin{aligned} \text {minimize}_{{\varvec{h}},{\varvec{x}}\in \mathbb {C}^{K}}\quad f({\varvec{h}},{\varvec{x}})= \sum _{j=1}^{m}\big |y_{j}-{\varvec{b}}_{j}^{\textsf {H} }{\varvec{h}}{\varvec{x}}^{\textsf {H} }{\varvec{a}}_{j}\big |^{2}. \end{aligned}$$
1.2 Nonconvex Optimization via Regularized Gradient Descent
First-order methods have been a popular heuristic in practice for solving nonconvex problems including (1). For instance, a widely adopted procedure is gradient descent, which follows the update rule
where \(\eta _{t}\) is the learning rate (or step size) and \({\varvec{x}}^{0}\) is some proper initial guess. Given that it only performs a single gradient calculation \(\nabla f(\cdot )\) per iteration (which typically can be completed within near-linear time), this paradigm emerges as a candidate for solving large-scale problems. The concern is: whether \({\varvec{x}}^{t}\) converges to the global solution and, if so, how long it takes for convergence, especially since (1) is highly nonconvex.
Fortunately, despite the worst-case hardness, appealing convergence properties have been discovered in various statistical estimation problems, the blessing being that the statistical models help rule out ill-behaved instances. For the average case, the empirical loss often enjoys benign geometry, in a local region (or at least along certain directions) surrounding the global optimum. In light of this, an effective nonconvex iterative method typically consists of two stages:
- 1.
a carefully designed initialization scheme (e.g., spectral method);
- 2.
an iterative refinement procedure (e.g., gradient descent).
This strategy has recently spurred a great deal of interest, owing to its promise of achieving computational efficiency and statistical accuracy at once for a growing list of problems (e.g., [18, 25, 32, 61, 64, 76, 78, 107]). However, rather than directly applying gradient descent (4), existing theory often suggests enforcing proper regularization. Such explicit regularization enables improved computational convergence by properly “stabilizing” the search directions. The following regularization schemes, among others, have been suggested to obtain or improve computational guarantees. We refer to these algorithms collectively as Regularized Gradient Descent.
Trimming/truncation, which discards/truncates a subset of the gradient components when forming the descent direction. For instance, when solving quadratic systems of equations, one can modify the gradient descent update rule as
$$\begin{aligned} {\varvec{x}}^{t+1}={\varvec{x}}^{t}-\eta _{t}\mathcal {T}\left( \nabla f\big ({\varvec{x}}^{t}\big )\right) , \end{aligned}$$(5)where \(\mathcal {T}\) is an operator that effectively drops samples bearing too much influence on the search direction. This strategy [25, 118, 126] has been shown to enable exact recovery with linear-time computational complexity and optimal sample complexity.
Regularized loss, which attempts to optimize a regularized empirical risk
$$\begin{aligned} {\varvec{x}}^{t+1}= {\varvec{x}}^{t}-\eta _{t} \left( \nabla f\big ({\varvec{x}}^{t}\big )+\nabla R\big ({\varvec{x}}^{t}\big )\right) , \end{aligned}$$(6)where \(R({\varvec{x}})\) stands for an additional penalty term in the empirical loss. For example, in low-rank matrix completion \(R(\cdot )\) imposes penalty based on the \(\ell _{2}\) row norm [64, 107] as well as the Frobenius norm [107] of the decision matrix, while in blind deconvolution, it penalizes the \(\ell _{2}\) norm as well as certain component-wise incoherence measure of the decision vectors [58, 76, 82].
Projection, which projects the iterates onto certain sets based on prior knowledge, that is,
$$\begin{aligned} {\varvec{x}}^{t+1}=\mathcal {P}\left( {\varvec{x}}^{t}-\eta _{t}\nabla f\big ({\varvec{x}}^{t}\big )\right) , \end{aligned}$$(7)where \(\mathcal {P}\) is a certain projection operator used to enforce, for example, incoherence properties. This strategy has been employed in both low-rank matrix completion [32, 131] and blind deconvolution [76].
Equipped with such regularization procedures, existing works uncover appealing computational and statistical properties under various statistical models. Table 1 summarizes the performance guarantees derived in the prior literature; for simplicity, only orderwise results are provided.
Remark 1
There is another role of regularization commonly studied in the literature, which exploits prior knowledge about the structure of the unknown object, such as sparsity to prevent over-fitting and improve statistical generalization ability. This is, however, not the focal point of this paper, since we are primarily pursuing solutions to (1) without imposing additional structures.
1.3 Regularization-Free Procedures?
The regularized gradient descent algorithms, while exhibiting appealing performance, usually introduce more algorithmic parameters that need to be carefully tuned based on the assumed statistical models. In contrast, vanilla gradient descent (cf. (4))—which is perhaps the very first method that comes into mind and requires minimal tuning parameters—is far less understood (cf. Table 1). Take matrix completion and blind deconvolution as examples: to the best of our knowledge, there is currently no theoretical guarantee derived for vanilla gradient descent.
The situation is better for phase retrieval: the local convergence of vanilla gradient descent, also known as Wirtinger flow (WF), has been investigated in [18, 96]. Under i.i.d. Gaussian design and with near-optimal sample complexity, WF (combined with spectral initialization) provably achieves \(\epsilon \)-accuracy (in a relative sense) within \(O\big (n\log \left( {1} / {\varepsilon }\right) \big )\) iterations. Nevertheless, the computational guarantee is significantly outperformed by the regularized version (called truncated Wirtinger flow [25]), which only requires \(O\big (\log \left( {1} / {\varepsilon }\right) \big )\) iterations to converge with similar per-iteration cost. On closer inspection, the high computational cost of WF is largely due to the vanishingly small step size \(\eta _{t}=O\big ({1} / ({n\Vert {\varvec{x}}^{\star }\Vert _{2}^{2}})\big )\)—and hence slow movement—suggested by the theory [18]. While this is already the largest possible step size allowed in the theory published in [18], it is considerably more conservative than the choice \(\eta _{t}=O\big ({1} / {\Vert {\varvec{x}}^{\star }\Vert _{2}^{2}}\big )\) theoretically justified for the regularized version [25, 126].
The lack of understanding and suboptimal results about vanilla gradient descent raise a very natural question: Are regularization-free iterative algorithms inherently suboptimal when solving nonconvex statistical estimation problems of this kind?
1.4 Numerical Surprise of Unregularized Gradient Descent
To answer the preceding question, it is perhaps best to first collect some numerical evidence. In what follows, we test the performance of vanilla gradient descent for phase retrieval, matrix completion, and blind deconvolution, using a constant step size. For all of these experiments, the initial guess is obtained by means of the standard spectral method. Our numerical findings are as follows:
Phase retrieval For each n, set \(m=10n\), take \({\varvec{x}}^{\star }\in \mathbb {R}^{n}\) to be a random vector with unit norm, and generate the design vectors \({\varvec{a}}_{j}\overset{\text {i.i.d.}}{\sim }\mathcal {N}({\varvec{0}},{\varvec{I}}_{n})\), \(1\le j\le m\). Figure 1a illustrates the relative \(\ell _{2}\) error \(\min \{\Vert {\varvec{x}}^{t}-{\varvec{x}}^{\star }\Vert _{2},\Vert {\varvec{x}}^{t}+{\varvec{x}}^{\star }\Vert _{2}\}/\Vert {\varvec{x}}^{\star }\Vert _{2}\) (modulo the unrecoverable global phase) versus the iteration count. The results are shown for \(n=20,100,200,1000\), with the step size taken to be \(\eta _{t}=0.1 \) in all settings.
Matrix completion Generate a random PSD matrix \({\varvec{M}}^{\star }\in \mathbb {R}^{n \times n}\) with dimension \(n=1000\), rank \(r=10\), and all nonzero eigenvalues equal to one. Each entry of \({\varvec{M}}^{\star }\) is observed independently with probability \(p=0.1\). Figure 1b plots the relative error versus the iteration count, where can either be the Frobenius norm \(\left\| \cdot \right\| _{\mathrm {F}}\), the spectral norm \(\Vert \cdot \Vert \), or the entrywise \(\ell _{\infty }\) norm \(\Vert \cdot \Vert _{\infty }\). Here, we pick the step size as \(\eta _{t}=0.2\).
Blind deconvolution For each \(K\in \left\{ 20,100,200,1000\right\} \) and \(m=10K\), generate the design vectors \({\varvec{a}}_{j}\overset{\text {i.i.d.}}{\sim }\mathcal {N}({\varvec{0}}, \frac{1}{2}{\varvec{I}}_{K})+i\mathcal {N}({\varvec{0}},\frac{1}{2}{\varvec{I}}_{K})\) for \(1\le j\le m\) independently,Footnote 2 and the \({\varvec{b}}_{j}\)’s are drawn from a partial discrete Fourier transform (DFT) matrix (to be described in Sect. 3.3). The underlying signals \({\varvec{h}}^{\star },{\varvec{x}}^{\star }\in \mathbb {C}^{K}\) are produced as random vectors with unit norm. Figure 1c plots the relative error \(\Vert {\varvec{h}}^{t}{\varvec{x}}^{t\textsf {H} }-{\varvec{h}}^{\star }{\varvec{x}}^{\star \textsf {H} }\Vert _{\mathrm {F}}/ \Vert {\varvec{h}}^{\star } {\varvec{x}}^{\star \textsf {H} }\Vert _{\mathrm {F}}\) versus the iteration count, with the step size taken to be \(\eta _{t} = 0.5\) in all settings.
In all of these numerical experiments, vanilla gradient descent enjoys remarkable linear convergence, always yielding an accuracy of \(10^{-5}\) (in a relative sense) within around 200 iterations. In particular, for the phase retrieval problem, the step size is taken to be \(\eta _{t}=0.1 \) although we vary the problem size from \(n=20\) to \(n=1000\). The consequence is that the convergence rates experience little changes when the problem sizes vary. In comparison, the theory published in [18] seems overly pessimistic, as it suggests a diminishing step size inversely proportional to n and, as a result, an iteration complexity that worsens as the problem size grows.
In addition, it has been empirically observed in prior literature [25, 76, 127] that vanilla gradient descent performs comparably with the regularized counterpart for phase retrieval and blind deconvolution. To complete the picture, we further conduct experiments on matrix completion. In particular, we follow the experimental setup for matrix completion used above. We vary p from 0.01 to 0.1 with 51 logarithmically spaced points. For each p, we apply vanilla gradient descent, projected gradient descent [32] and gradient descent with additional regularization terms [107] with step size \(\eta = 0.2\) to 50 randomly generated instances. Successful recovery is declared if \(\Vert {\varvec{X}}^{t}{\varvec{X}}^{t\top } - {\varvec{M}}^{\star }\Vert _{\mathrm {F}} / \Vert {\varvec{M}}^{\star }\Vert _{\mathrm {F}} \le 10^{-5}\) in \(10^{4}\) iterations. Figure 2 reports the success rate versus the sampling rate. As can be seen, the phase transition of vanilla GD and that of GD with regularized cost are almost identical, whereas projected GD performs slightly better than the other two.
In short, the above empirical results are surprisingly positive yet puzzling. Why was the computational efficiency of vanilla gradient descent unexplained or substantially underestimated in prior theory?
1.5 This Paper
The main contribution of this paper is toward demystifying the “unreasonable” effectiveness of regularization-free nonconvex iterative methods. As asserted in previous work, regularized gradient descent succeeds by properly enforcing/promoting certain incoherence conditions throughout the execution of the algorithm. In contrast, we discover that
Vanilla gradient descent automatically forces the iterates to stay incoherent with the measurement mechanism, thus implicitly regularizing the search directions.
This “implicit regularization” phenomenon is of fundamental importance, suggesting that vanilla gradient descent proceeds as if it were properly regularized. This explains the remarkably favorable performance of unregularized gradient descent in practice. Focusing on the three representative problems mentioned in Sect. 1.1, our theory guarantees both statistical and computational efficiency of vanilla gradient descent under random designs and spectral initialization. With near-optimal sample complexity, to attain \(\epsilon \)-accuracy,
Phase retrieval (informal) vanilla gradient descent converges in \(O\big (\log n\log \frac{1}{\epsilon }\big )\) iterations;
Matrix completion (informal) vanilla gradient descent converges in \(O\big (\log \frac{1}{\epsilon }\big )\) iterations;
Blind deconvolution (informal) vanilla gradient descent converges in \(O\big (\log \frac{1}{\epsilon }\big )\) iterations.
In other words, gradient descent provably achieves (nearly) linear convergence in all of these examples. Throughout this paper, an algorithm is said to converge (nearly) linearly to \({\varvec{x}}^{\star }\) in the noiseless case if the iterates \(\{{\varvec{x}}^t\}\) obey
for some \(0<c\le 1\) that is (almost) independent of the problem size. Here, \(\mathrm {dist}(\cdot ,\cdot )\) can be any appropriate discrepancy measure.
As a by-product of our theory, gradient descent also provably controls the entrywise empirical risk uniformly across all iterations; for instance, this implies that vanilla gradient descent controls entrywise estimation error for the matrix completion task. Precise statements of these results are deferred to Sect. 3 and are briefly summarized in Table 2.
Notably, our study of implicit regularization suggests that the behavior of nonconvex optimization algorithms for statistical estimation needs to be examined in the context of statistical models, which induces an objective function as a finite sum. Our proof is accomplished via a leave-one-out perturbation argument, which is inherently tied to statistical models and leverages homogeneity across samples. Altogether, this allows us to localize benign landscapes for optimization and characterize finer dynamics not accounted for in generic gradient descent theory.
1.6 Notations
Before continuing, we introduce several notations used throughout the paper. First of all, boldfaced symbols are reserved for vectors and matrices. For any vector \({\varvec{v}}\), we use \(\Vert {\varvec{v}}\Vert _2\) to denote its Euclidean norm. For any matrix \({\varvec{A}}\), we use \(\sigma _{j}({\varvec{A}})\) and \(\lambda _{j}({\varvec{A}})\) to denote its jth largest singular value and eigenvalue, respectively, and let \({\varvec{A}}_{j,\cdot }\) and \({\varvec{A}}_{\cdot ,j}\) denote its jth row and jth column, respectively. In addition, \(\Vert {\varvec{A}}\Vert \), \(\Vert {\varvec{A}}\Vert _{\mathrm {F}}\), \(\Vert {\varvec{A}}\Vert _{2,\infty }\), and \(\Vert {\varvec{A}}\Vert _{\infty }\) stand for the spectral norm (i.e., the largest singular value), the Frobenius norm, the \(\ell _2/\ell _{\infty }\) norm (i.e., the largest \(\ell _2\) norm of the rows), and the entrywise \(\ell _{\infty }\) norm (the largest magnitude of all entries) of a matrix \({\varvec{A}}\). Also, \({\varvec{A}}^{\top }\), \({\varvec{A}}^\textsf {H} \), and \(\overline{{\varvec{A}}}\) denote the transpose, the conjugate transpose, and the entrywise conjugate of \({\varvec{A}}\), respectively. \({\varvec{I}}_{n}\) denotes the identity matrix with dimension \(n\times n\). The notation \(\mathcal {O}^{n\times r}\) represents the set of all \(n\times r\) orthonormal matrices. The notation [n] refers to the set \(\{1,\cdots , n\}\). Also, we use \(\text {Re}(x)\) to denote the real part of a complex number x. Throughout the paper, we use the terms “samples” and “measurements” interchangeably.
Additionally, the standard notation \(f(n)=O\left( g(n)\right) \) or \(f(n)\lesssim g(n)\) means that there exists a constant \(c>0\) such that \(\left| f(n)\right| \le c|g(n)|\), \(f(n)\gtrsim g(n)\) means that there exists a constant \(c>0\) such that \(|f(n)|\ge c\left| g(n)\right| \), and \(f(n)\asymp g(n)\) means that there exist constants \(c_{1},c_{2}>0\) such that \(c_{1}|g(n)|\le |f(n)|\le c_{2}|g(n)|\). Also, \(f(n)\gg g(n)\) means that there exists some large enough constant \(c>0\) such that \(|f(n)|\ge c\left| g(n)\right| \). Similarly, \(f(n)\ll g(n)\) means that there exists some sufficiently small constant \(c>0\) such that \(|f(n)|\le c\left| g(n)\right| \).
2 Implicit Regularization: A Case Study
To reveal reasons behind the effectiveness of vanilla gradient descent, we first examine existing theory of gradient descent and identify the geometric properties that enable linear convergence. We then develop an understanding as to why prior theory is conservative, and describe the phenomenon of implicit regularization that helps explain the effectiveness of vanilla gradient descent. To facilitate discussion, we will use the problem of solving random quadratic systems (phase retrieval) and Wirtinger flow as a case study, but our diagnosis applies more generally, as will be seen in later sections.
2.1 Gradient Descent Theory Revisited
In the convex optimization literature, there are two standard conditions about the objective function—strong convexity and smoothness—that allow for linear convergence of gradient descent.
Definition 1
(Strong convexity) A twice continuously differentiable function \(f:\mathbb {R}^{n}\mapsto \mathbb {R}\) is said to be \(\alpha \)-strongly convex for \(\alpha > 0\) if
Definition 2
(Smoothness) A twice continuously differentiable function \(f:\mathbb {R}^{n}\mapsto \mathbb {R}\) is said to be \(\beta \)-smooth for \(\beta > 0\) if
It is well known that for an unconstrained optimization problem, if the objective function f is both \(\alpha \)-strongly convex and \(\beta \)-smooth, then vanilla gradient descent (4) enjoys \(\ell _{2}\) error contraction [9, Theorem 3.12], namely
as long as the step size is chosen as \(\eta _{t}=2/(\alpha +\beta )\). Here, \({\varvec{x}}^{\star }\) denotes the global minimum. This immediately reveals the iteration complexity for gradient descent: the number of iterations taken to attain \(\epsilon \)-accuracy (in a relative sense) is bounded by
In other words, the iteration complexity is dictated by and scales linearly with the condition number—the ratio \(\beta /\alpha \) of smoothness to strong convexity parameters.
Moving beyond convex optimization, one can easily extend the above theory to nonconvex problems with local strong convexity and smoothness. More precisely, suppose the objective function f satisfies
over a local \(\ell _{2}\) ball surrounding the global minimum \({\varvec{x}}^{\star }\):
Then the contraction result (8) continues to hold, as long as the algorithm is seeded with an initial point that falls inside \(\mathcal {B}_{\delta }({\varvec{x}})\).
2.2 Local Geometry for Solving Random Quadratic Systems
To invoke generic gradient descent theory, it is critical to characterize the local strong convexity and smoothness properties of the loss function. Take the problem of solving random quadratic systems (phase retrieval) as an example. Consider the i.i.d. Gaussian design in which \({\varvec{a}}_{j}\overset{\mathrm {i.i.d.}}{\sim }\mathcal {N}({\varvec{0}},{\varvec{I}}_{n})\), \(1\le j\le m\), and suppose without loss of generality that the underlying signal obeys \(\Vert {\varvec{x}}^{\star }\Vert _{2}=1\). It is well known that \({\varvec{x}}^{\star }\) is the unique minimizer—up to global phase—of (2) under this statistical model, provided that the ratio m / n of equations to unknowns is sufficiently large. The Hessian of the loss function \(f({\varvec{x}})\) is given by
Population-level analysis Consider the case with an infinite number of equations or samples, i.e., \(m\rightarrow \infty \), where \(\nabla ^{2}f({\varvec{x}})\) converges to its expectation. Simple calculation yields that
$$\begin{aligned} \mathbb {E}\big [\nabla ^{2}f({\varvec{x}})\big ]=3\left( \Vert {\varvec{x}}\Vert _{2}^{2} {\varvec{I}}_{n}+2{\varvec{x}}{\varvec{x}}^{\top }\right) -\left( {\varvec{I}}_{n}+2{\varvec{x}}^{\star } {\varvec{x}}^{\star \top }\right) . \end{aligned}$$It is straightforward to verify that for any sufficiently small constant \(\delta >0\), one has the crude bound
$$\begin{aligned} {\varvec{I}}_{n} \,\preceq \, \mathbb {E}\big [\nabla ^{2}f({\varvec{x}})\big ] \,\preceq \, 10{\varvec{I}}_{n},\qquad \forall {\varvec{x}}\in \mathcal {B}_{\delta }({\varvec{x}}) : \big \Vert {\varvec{x}}-{\varvec{x}}^{\star }\big \Vert _{2}\le \delta \big \Vert {\varvec{x}}^{\star }\big \Vert _{2}, \end{aligned}$$meaning that f is 1-strongly convex and 10-smooth within a local ball around \({\varvec{x}}^{\star }\). As a consequence, when we have infinite samples and an initial guess \({\varvec{x}}^{0}\) such that \(\Vert {\varvec{x}}^{0}-{\varvec{x}}^{\star }\Vert _{2}\le \delta \big \Vert {\varvec{x}}^{\star }\big \Vert _{2}\), vanilla gradient descent with a constant step size converges to the global minimum within logarithmic iterations.
Finite-sample regime with\(m\asymp n\log n\) Now that f exhibits favorable landscape in the population level, one thus hopes that the fluctuation can be well controlled so that the nice geometry carries over to the finite-sample regime. In the regime where \(m\asymp n\log n\) (which is the regime considered in [18]), the local strong convexity is still preserved, in the sense that
$$\begin{aligned} \nabla ^{2}f({\varvec{x}}) \,\succeq \, \left( {1} / {2}\right) \cdot {\varvec{I}}_{n},\qquad \forall {\varvec{x}}:\text { }\big \Vert {\varvec{x}}-{\varvec{x}}^{\star }\big \Vert _{2}\le \delta \big \Vert {\varvec{x}}^{\star }\big \Vert _{2} \end{aligned}$$occurs with high probability, provided that \(\delta >0\) is sufficiently small (see [96, 101] and Lemma 1). The smoothness parameter, however, is not well controlled. In fact, it can be as large as (up to logarithmic factors)Footnote 3
$$\begin{aligned} \big \Vert \nabla ^{2}f({\varvec{x}})\big \Vert \,\lesssim \,n \end{aligned}$$even when we restrict attention to the local \(\ell _{2}\) ball (9) with \(\delta >0\) being a fixed small constant. This means that the condition number \(\beta /\alpha \) (defined in Sect. 2.1) may scale as O(n), leading to the step size recommendation
$$\begin{aligned} \eta _t\,\asymp \,1/n, \end{aligned}$$and, as a consequence, a high iteration complexity \(O\big (n\log (1 / {\epsilon } )\big )\). This underpins the analysis in [18].
In summary, the geometric properties of the loss function—even in the local \(\ell _{2}\) ball centering around the global minimum—are not as favorable as one anticipates, in particular in view of its population counterpart. A direct application of generic gradient descent theory leads to an overly conservative step size and a pessimistic convergence rate, unless the number of samples is enormously larger than the number of unknowns.
Remark 2
Notably, due to Gaussian designs, the phase retrieval problem enjoys more favorable geometry compared to other nonconvex problems. In matrix completion and blind deconvolution, the Hessian matrices are rank-deficient even at the population level. In such cases, the above discussions need to be adjusted, e.g., strong convexity is only possible when we restrict attention to certain directions.
2.3 Which Region Enjoys Nicer Geometry?
Interestingly, our theory identifies a local region surrounding \({\varvec{x}}^{\star }\) with a large diameter that enjoys much nicer geometry. This region does not mimic an \(\ell _{2}\) ball, but rather, the intersection of an \(\ell _{2}\) ball and a polytope. We term it the region of incoherence and contraction (RIC). For phase retrieval, the RIC includes all points \({\varvec{x}}\in \mathbb {R}^{n}\) obeying
where \(\delta >0\) is some small numerical constant. As will be formalized in Lemma 1, with high probability the Hessian matrix satisfies
simultaneously for \({\varvec{x}}\) in the RIC. In words, the Hessian matrix is nearly well conditioned (with the condition number bounded by \(O(\log n)\)), as long as (i) the iterate is not very far from the global minimizer (cf. (11a)) and (ii) the iterate remains incoherentFootnote 4 with respect to the sensing vectors (cf. (11b)). Another way to interpret the incoherence condition (11b) is that the empirical risk needs to be well controlled uniformly across all samples. See Fig. 3a for an illustration of the above region.
The following observation is thus immediate: one can safely adopt a far more aggressive step size (as large as \(\eta _{t}=O(1/\log n)\)) to achieve acceleration, as long as the iterates stay within the RIC. This, however, fails to be guaranteed by generic gradient descent theory. To be more precise, if the current iterate \({\varvec{x}}^{t}\) falls within the desired region, then in view of (8), we can ensure \(\ell _{2}\) error contraction after one iteration, namely
and hence \({\varvec{x}}^{t+1}\) stays within the local \(\ell _{2}\) ball and hence satisfies (11a). However, it is not immediately obvious that \({\varvec{x}}^{t+1}\) would still stay incoherent with the sensing vectors and satisfy (11b). If \({\varvec{x}}^{t+1}\) leaves the RIC, it no longer enjoys the benign local geometry of the loss function, and the algorithm has to slow down in order to avoid overshooting. See Fig. 3b for a visual illustration. In fact, in almost all regularized gradient descent algorithms mentioned in Sect. 1.2, one of the main purposes of the proposed regularization procedures is to enforce such incoherence constraints.
2.4 Implicit Regularization
However, is regularization really necessary for the iterates to stay within the RIC? To answer this question, we plot in Fig. 4a (resp. Fig. 4b) the incoherence measure \(\frac{\max _{j}\left| {\varvec{a}}_{j}^{\top }{\varvec{x}}^{t}\right| }{\sqrt{\log n}\left\| {\varvec{x}}^{\star }\right\| _{2}}\) (resp. \(\frac{\max _{j}\left| {\varvec{a}}_{j}^{\top }({\varvec{x}}^{t}-{\varvec{x}}^\star )\right| }{\sqrt{\log n}\left\| {\varvec{x}}^{\star }\right\| _{2}}\)) versus the iteration count in a typical Monte Carlo trial, generated in the same way as for Fig. 1a. Interestingly, the incoherence measure remains bounded by 2 for all iterations \(t>1\). This important observation suggests that one may adopt a substantially more aggressive step size throughout the whole algorithm.
The main objective of this paper is thus to provide a theoretical validation of the above empirical observation. As we will demonstrate shortly, with high probability all iterates along the execution of the algorithm (as well as the spectral initialization) are provably constrained within the RIC, implying fast convergence of vanilla gradient descent (cf. Fig. 3c). The fact that the iterates stay incoherent with the measurement mechanism automatically, without explicit enforcement, is termed “implicit regularization.”
2.5 A Glimpse of the Analysis: A Leave-One-Out Trick
In order to rigorously establish (11b) for all iterates, the current paper develops a powerful mechanism based on the leave-one-out perturbation argument, a trick rooted and widely used in probability and random matrix theory. Note that the iterate \({\varvec{x}}^t\) is statistically dependent with the design vectors \(\{{\varvec{a}}_j\}\). Under such circumstances, one often resorts to generic bounds like the Cauchy–Schwarz inequality, which would not yield a desirable estimate. To address this issue, we introduce a sequence of auxiliary iterates \(\{{\varvec{x}}^{t,(l)}\}\) for each \(1\le l\le m\) (for analytical purposes only), obtained by running vanilla gradient descent using all but the lth sample. As one can expect, such auxiliary trajectories serve as extremely good surrogates of \(\{{\varvec{x}}^t\}\) in the sense that
since their constructions only differ by a single sample. Most importantly, since \({\varvec{x}}^{t,(l)}\) is independent with the lth design vector, it is much easier to control its incoherence w.r.t. \({\varvec{a}}_l\) to the desired level:
Combining (12) and (13) then leads to (11b). See Fig. 5 for a graphical illustration of this argument. Notably, this technique is very general and applicable to many other problems. We invite the readers to Sect. 5 for more details.
3 Main Results
This section formalizes the implicit regularization phenomenon underlying unregularized gradient descent and presents its consequences, namely near-optimal statistical and computational guarantees for phase retrieval, matrix completion, and blind deconvolution. Note that the discrepancy measure \(\text {dist}\left( \cdot , \cdot \right) \) may vary from problem to problem.
3.1 Phase Retrieval
Suppose the m quadratic equations
are collected using random design vectors, namely \({\varvec{a}}_{j}\overset{\mathrm {i.i.d.}}{\sim }\mathcal {N}({\varvec{0}},{\varvec{I}}_{n})\), and the nonconvex problem to solve is
The Wirtinger flow (WF) algorithm, first introduced in [18], is a combination of spectral initialization and vanilla gradient descent; see Algorithm 1.
Recognizing that the global phase/sign is unrecoverable from quadratic measurements, we introduce the \(\ell _{2}\) distance modulo the global phase as follows
Our finding is summarized in the following theorem.
Theorem 1
Let \({\varvec{x}}^{\star }\in \mathbb {R}^{n}\) be a fixed vector. Suppose \({\varvec{a}}_{j}\overset{\mathrm {i.i.d.}}{\sim }\mathcal {N}\left( {\varvec{0}},{\varvec{I}}_{n}\right) \) for each \(1\le j\le m\) and \(m\ge c_{0}n\log n\) for some sufficiently large constant \(c_{0}>0\). Assume the step size obeys \(\eta _{t}\equiv \eta ={c_{1}} / \left( {\log n}\cdot \Vert {\varvec{x}}_0\Vert _{2}^{2}\right) \) for any sufficiently small constant \(c_{1}>0\). Then there exist some absolute constants \(0<\varepsilon <1\) and \(c_{2}>0\) such that with probability at least \(1-O\left( mn^{-5}\right) \), Algorithm 1 satisfies that for all \(t\ge 0\),
Theorem 1 reveals a few intriguing properties of Algorithm 1.
Implicit regularization Theorem 1 asserts that the incoherence properties are satisfied throughout the execution of the algorithm (see (19b)), which formally justifies the implicit regularization feature we hypothesized.
Near-constant step size Consider the case where \(\Vert {\varvec{x}}^{\star } \Vert _2 =1\). Theorem 1 establishes near-linear convergence of WF with a substantially more aggressive step size \(\eta \asymp 1/\log n\). Compared with the choice \(\eta \lesssim 1/n\) admissible in [18, Theorem 3.3], Theorem 1 allows WF/GD to attain \(\epsilon \)-accuracy within \(O(\log n\log (1/\epsilon ))\) iterations. The resulting computational complexity of the algorithm is
$$\begin{aligned} O\left( mn\log n\log \frac{1}{\epsilon }\right) , \end{aligned}$$which significantly improves upon the result \(O\big (mn^{2}\log \left( {1} / {\epsilon }\right) \big )\) derived in [18]. As a side note, if the sample size further increases to \(m\asymp n\log ^2 n\), then a constant step size \(\eta \asymp 1\) is also feasible, resulting in an iteration complexity \(\log (1/\epsilon )\). This follows since with high probability, the entire trajectory resides within a more refined incoherence region \(\max _{j}\big |{\varvec{a}}_{j}^{\top }\big ({\varvec{x}}^{t}-{\varvec{x}}^{\star }\big )\big | \lesssim \Vert {\varvec{x}}^{\star }\Vert _{2}\). We omit the details here.
Incoherence of spectral initialization We have also demonstrated in Theorem 1 that the initial guess \({\varvec{x}}^{0}\) falls within the RIC and is hence nearly orthogonal to all design vectors. This provides a finer characterization of spectral initialization, in comparison with prior theory that focuses primarily on the \(\ell _2\) accuracy [18, 90]. We expect our leave-one-out analysis to accommodate other variants of spectral initialization studied in the literature [12, 25, 83, 88, 118].
Remark 3
As it turns out, a carefully designed initialization is not pivotal in enabling fast convergence. In fact, randomly initialized gradient descent provably attains \(\varepsilon \)-accuracy in \(O(\log n + \log \tfrac{1}{\varepsilon })\) iterations; see [27] for details.
3.2 Low-Rank Matrix Completion
Let \({\varvec{M}}^{\star }\in \mathbb {R}^{n\times n}\) be a positive semidefinite matrixFootnote 5 with rank r, and suppose its eigendecomposition is
where \({\varvec{U}}^{\star }\in \mathbb {R}^{n\times r}\) consists of orthonormal columns and \({\varvec{\Sigma }}^{\star }\) is an \(r\times r\) diagonal matrix with eigenvalues in a descending order, i.e., \(\sigma _{\max }=\sigma _{1}\ge \cdots \ge \sigma _{r}=\sigma _{\min }>0\). Throughout this paper, we assume the condition number \(\kappa :=\sigma _{\max }/\sigma _{\min }\) is bounded by a fixed constant, independent of the problem size (i.e., n and r). Denoting \({\varvec{X}}^{\star }={\varvec{U}}^{\star }({\varvec{\Sigma }}^{\star })^{1/2}\) allows us to factorize \({\varvec{M}}^{\star }\) as
Consider a random sampling model such that each entry of \({\varvec{M}}^{\star }\) is observed independently with probability \(0<p\le 1\), i.e., for \(1\le j\le k\le n\),
where the entries of \({\varvec{E}}=[ E_{j,k} ] _{1\le j\le k\le n}\) are independent sub-Gaussian noise with sub-Gaussian norm \(\sigma \) (see [116, Definition 5.7]). We denote by \(\Omega \) the set of locations being sampled, and \(\mathcal {P}_\Omega ({\varvec{Y}})\) represents the projection of \({\varvec{Y}}\) onto the set of matrices supported in \(\Omega \). We note here that the sampling rate p, if not known, can be faithfully estimated by the sample proportion \(| \Omega |/n^2\).
To fix ideas, we consider the following nonconvex optimization problem
The vanilla gradient descent algorithm (with spectral initialization) is summarized in Algorithm 2.
Before proceeding to the main theorem, we first introduce a standard incoherence parameter required for matrix completion [19].
Definition 3
(Incoherence for matrix completion) A rank-r matrix \({\varvec{M}}^{\star }\) with eigendecomposition \({\varvec{M}}^{\star }={\varvec{U}}^{\star }{\varvec{\Sigma }}^{\star }{\varvec{U}}^{\star \top }\) is said to be \(\mu \)-incoherent if
In addition, recognizing that \({\varvec{X}}^{\star }\) is identifiable only up to orthogonal transformation, we define the optimal transform from the tth iterate \({\varvec{X}}^t\) to \({\varvec{X}}^{\star }\) as
where \(\mathcal {O}^{r\times r}\) is the set of \(r\times r\) orthonormal matrices. With these definitions in place, we have the following theorem.
Theorem 2
Let \({\varvec{M}}^{\star }\) be a rank-r, \(\mu \)-incoherent PSD matrix, and its condition number \(\kappa \) is a fixed constant. Suppose the sample size satisfies \(n^{2}p\ge C\mu ^{3}r^{3}n\log ^{3}n\) for some sufficiently large constant \(C>0\), and the noise satisfies
With probability at least \(1-O\left( n^{-3}\right) \), the iterates of Algorithm 2 satisfy
for all \(0\le t\le T=O(n^{5})\), where \(C_{1}\), \(C_{4}\), \(C_{5}\), \(C_{8}\), \(C_{9}\), and \(C_{10}\) are some absolute positive constants and \(1-\left( {\sigma _{\min }} / {5}\right) \cdot \eta \le \rho <1\), provided that \(0<\eta _{t}\equiv \eta \le {2} / \left( {25\kappa \sigma _{\max }}\right) \).
Theorem 2 provides the first theoretical guarantee of unregularized gradient descent for matrix completion, demonstrating near-optimal statistical accuracy and computational complexity.
Implicit regularization In Theorem 2, we bound the \(\ell _{2}/\ell _{\infty }\) error of the iterates in a uniform manner via (28b). Note that \(\big \Vert {\varvec{X}}-{\varvec{X}}^{\star }\big \Vert _{2,\infty }=\max _{j}\big \Vert {\varvec{e}}_{j}^{\top }\big ({\varvec{X}}-{\varvec{X}}^{\star }\big )\big \Vert _{2}\), which implies the iterates remain incoherent with the sensing vectors throughout and have small incoherence parameters (cf. (25)). In comparison, prior works either include a penalty term on \(\{\Vert {\varvec{e}}_{j}^{\top }{\varvec{X}}\Vert _2\}_{1\le j\le n}\) [64, 107] and/or \(\Vert {\varvec{X}}\Vert _{\mathrm {F}}\) [107] to encourage an incoherent and/or low-norm solution, or add an extra projection operation to enforce incoherence [32, 131]. Our results demonstrate that such explicit regularization is unnecessary.
Constant step size Without loss of generality, we may assume that \(\sigma _{\max } = \Vert {\varvec{M}}^{\star } \Vert = O( 1 )\), which can be done by choosing proper scaling of \({\varvec{M}}^{\star }\). Hence, we have a constant step size \(\eta _t \asymp 1\). Actually, it is more convenient to consider the scale-invariant parameter \(\rho \): Theorem 2 guarantees linear convergence of the vanilla gradient descent at a constant rate \(\rho \). Remarkably, the convergence occurs with respect to three different unitarily invariant norms: the Frobenius norm \(\Vert \cdot \Vert _{\mathrm {F}}\), the \(\ell _{2}/\ell _{\infty }\) norm \(\Vert \cdot \Vert _{2,\infty }\), and the spectral norm \(\Vert \cdot \Vert \). As far as we know, the latter two are established for the first time. Note that our result even improves upon that for regularized gradient descent; see Table 1.
Near-optimal sample complexity When the rank \(r=O(1)\), vanilla gradient descent succeeds under a near-optimal sample complexity \(n^{2}p\gtrsim n\mathrm {poly}\log n\), which is statistically optimal up to some logarithmic factor.
Near-minimal Euclidean error In view of (28a), as t increases, the Euclidean error of vanilla GD converges to
$$\begin{aligned} \big \Vert {\varvec{X}}^{t}\widehat{{\varvec{H}}}^{t}-{\varvec{X}}^{\star }\big \Vert _{\mathrm {F}} \lesssim \frac{\sigma }{\sigma _{\min }}\sqrt{\frac{n}{p}}\big \Vert {\varvec{X}}^{\star }\big \Vert _{\mathrm {F}}, \end{aligned}$$(29)which coincides with the theoretical guarantee in [32, Corollary 1] and matches the minimax lower bound established in [67, 89].
Near-optimal entrywise error The \(\ell _{2}/\ell _{\infty }\) error bound (28b) immediately yields entrywise control of the empirical risk. Specifically, as soon as t is sufficiently large (so that the first term in (28b) is negligible), we have
$$\begin{aligned} \big \Vert {\varvec{X}}^{t}{\varvec{X}}^{t\top }-{\varvec{M}}^{\star }\big \Vert _{\infty }&\le \big \Vert {\varvec{X}}^{t}\widehat{{\varvec{H}}}^{t}\big ({\varvec{X}}^{t}\widehat{{\varvec{H}}}^{t}-{\varvec{X}}^{\star }\big )^{\top }\big \Vert _{\infty }+\big \Vert \big ({\varvec{X}}^{t}\widehat{{\varvec{H}}}^{t}-{\varvec{X}}^{\star }\big ){\varvec{X}}^{\star \top }\big \Vert _{\infty }\\&\le \big \Vert {\varvec{X}}^{t}\widehat{{\varvec{H}}}^{t}\ \big \Vert _{2,\infty }\big \Vert {\varvec{X}}^{t}\widehat{{\varvec{H}}}^{t}{-}{\varvec{X}}^{\star }\big \Vert _{2,\infty }{+}\big \Vert {\varvec{X}}^{t}\widehat{{\varvec{H}}}^{t}{-}{\varvec{X}}^{\star }\big \Vert _{2,\infty }\big \Vert {\varvec{X}}^{\star }\big \Vert _{2,\infty } \\&\lesssim \frac{\sigma }{\sigma _{\min }}\sqrt{\frac{n\log n}{p}}\left\| {\varvec{M}}^{\star }\right\| _{\infty }, \end{aligned}$$where the last line follows from (28b) as well as the facts that \(\Vert {\varvec{X}}^{t}\widehat{{\varvec{H}}}^{t}-{\varvec{X}}^{\star }\Vert _{2,\infty }\le \Vert {\varvec{X}}^{\star }\Vert _{2,\infty }\) and \(\Vert {\varvec{M}}^{\star }\Vert _{\infty }=\Vert {\varvec{X}}^{\star }\Vert _{2,\infty }^{2}\). Compared with the Euclidean loss (29), this implies that when \(r=O(1)\), the entrywise error of \({\varvec{X}}^{t}{\varvec{X}}^{t\top }\) is uniformly spread out across all entries. As far as we know, this is the first result that reveals near-optimal entrywise error control for noisy matrix completion using nonconvex optimization, without resorting to sample splitting.
Remark 4
Theorem 2 remains valid if the total number T of iterations obeys \(T=n^{O(1)}\). In the noiseless case where \(\sigma = 0\), the theory allows arbitrarily large T.
Finally, we report the empirical statistical accuracy of vanilla gradient descent in the presence of noise. Figure 6 displays the squared relative error of vanilla gradient descent as a function of the signal-to-noise ratio (SNR), where the SNR is defined to be
and the relative error is measured in terms of the square of the metrics as in (28) as well as the squared entrywise prediction error. Both the relative error and the SNR are shown on a dB scale (i.e., \(10\log _{10}(\text {SNR})\) and \(10\log _{10}(\text {squared relative error})\) are plotted). The results are averaged over 20 independent trials. As one can see from the plot, the squared relative error scales inversely proportional to the SNR, which is consistent with our theory.Footnote 6
3.3 Blind Deconvolution
Suppose we have collected m bilinear measurements
where \({\varvec{a}}_{j}\) follows a complex Gaussian distribution, i.e., \({\varvec{a}}_{j}\overset{\text {i.i.d.}}{\sim }\mathcal {N}\left( {\varvec{0}},\frac{1}{2}{\varvec{I}}_{K}\right) +i\mathcal {N}\left( {\varvec{0}},\frac{1}{2}{\varvec{I}}_{K}\right) \) for \(1\le j \le m\), and \({\varvec{B}} :=\left[ {\varvec{b}}_1,\cdots ,{\varvec{b}}_m\right] ^\textsf {H} \in \mathbb {C}^{m\times K}\) is formed by the first K columns of a unitary discrete Fourier transform (DFT) matrix \({\varvec{F}}\in \mathbb {C}^{m\times m}\) obeying \({\varvec{F}}{\varvec{F}}^{\textsf {H} }={\varvec{I}}_m\) (see Appendix D.3.2 for a brief introduction to DFT matrices). This setup models blind deconvolution, where the two signals under convolution belong to known low-dimensional subspaces of dimension K [3].Footnote 7 In particular, the partial DFT matrix \({\varvec{B}}\) plays an important role in image blind deblurring. In this subsection, we consider solving the following nonconvex optimization problem
The (Wirtinger) gradient descent algorithm (with spectral initialization) is summarized in Algorithm 3; here, \(\nabla _{{\varvec{h}}} f({\varvec{h}},{\varvec{x}})\) and \(\nabla _{{\varvec{x}}} f({\varvec{h}},{\varvec{x}})\) stand for the Wirtinger gradient and are given in (77) and (78), respectively; see [18, Section 6] for a brief introduction to Wirtinger calculus.
It is self-evident that \({\varvec{h}}^{\star }\) and \({\varvec{x}}^{\star }\) are only identifiable up to global scaling, that is, for any nonzero \(\alpha \in \mathbb {C}\),
In light of this, we will measure the discrepancy between
via the following function
Before proceeding, we need to introduce the incoherence parameter [3, 76], which is crucial for blind deconvolution, whose role is similar to the incoherence parameter (cf. Definition 3) in matrix completion.
Definition 4
(Incoherence for blind deconvolution) Let the incoherence parameter \(\mu \) of \({\varvec{h}}^{\star }\) be the smallest number such that
The incoherence parameter describes the spectral flatness of the signal \({\varvec{h}}^{\star }\). With this definition in place, we have the following theorem, where for identifiability we assume that \(\left\| {\varvec{h}}^{\star }\right\| _{2}=\left\| {\varvec{x}}^{\star }\right\| _{2}\).
Theorem 3
Suppose the number of measurements obeys \(m\ge C\mu ^{2}K\log ^{9}m\) for some sufficiently large constant \(C>0\), and suppose the step size \(\eta >0\) is taken to be some sufficiently small constant. Then there exist constants \(c_{1},c_{2},C_{1},C_{3},C_{4}>0\) such that with probability exceeding \(1-c_{1}m{}^{-5}-c_{1}me^{-c_{2}K}\), the iterates in Algorithm 3 satisfy
for all \(t\ge 0\). Here, we denote \(\alpha ^t\) as the alignment parameter,
Theorem 3 provides the first theoretical guarantee of unregularized gradient descent for blind deconvolution at a near-optimal statistical and computational complexity. A few remarks are in order.
Implicit regularization Theorem 3 reveals that the unregularized gradient descent iterates remain incoherent with the sampling mechanism (see (37b) and (37c)). Recall that prior works operate upon a regularized cost function with an additional penalty term that regularizes the global scaling \(\{\Vert {\varvec{h}}\Vert _2,\Vert {\varvec{x}}\Vert _2\}\) and the incoherence \(\{|{\varvec{b}}_j^\textsf {H} {\varvec{h}}|\}_{1\le j\le m}\) [58, 76, 82]. In comparison, our theorem implies that it is unnecessary to regularize either the incoherence or the scaling ambiguity, which is somewhat surprising. This justifies the use of regularization-free (Wirtinger) gradient descent for blind deconvolution.
Constant step size Compared to the step size \(\eta _t \lesssim 1/m\) suggested in [76] for regularized gradient descent, our theory admits a substantially more aggressive step size (i.e., \(\eta _t\asymp 1\)) even without regularization. Similar to phase retrieval, the computational efficiency is boosted by a factor of m, attaining \(\epsilon \)-accuracy within \(O\left( \log (1/\epsilon )\right) \) iterations (vs. \(O\left( m\log (1/\epsilon )\right) \) iterations in prior theory).
Near-optimal sample complexity It is demonstrated that vanilla gradient descent succeeds at a near-optimal sample complexity up to logarithmic factors, although our requirement is slightly worse than [76] which uses explicit regularization. Notably, even under the sample complexity herein, the iteration complexity given in [76] is still \(O\left( m/\mathrm {poly}\log (m)\right) \).
Incoherence of spectral initialization As in phase retrieval, Theorem 3 demonstrates that the estimates returned by the spectral method are incoherent with respect to both \(\{{\varvec{a}}_j\}\) and \(\{{\varvec{b}}_j\}\). In contrast, [76] recommends a projection operation (via a linear program) to enforce incoherence of the initial estimates, which is dispensable according to our theory.
Contraction in\(\left\| \cdot \right\| _{\mathrm {F}}\) It is easy to check that the Frobenius norm error satisfies \(\left\| {\varvec{h}}^{t}{\varvec{x}}^{t\textsf {H} }-{\varvec{h}}^{\star }{\varvec{x}}^{\star \textsf {H} }\right\| _{\mathrm {F}}\lesssim \mathrm {dist}\left( {\varvec{z}}^{t},{\varvec{z}}^{\star }\right) \), and therefore, Theorem 3 corroborates the empirical results shown in Fig. 1c.
4 Related Work
Solving nonlinear systems of equations has received much attention in the past decade. Rather than directly attacking the nonconvex formulation, convex relaxation lifts the object of interest into a higher-dimensional space and then attempts recovery via semidefinite programming (e.g., [3, 19, 20, 94]). This has enjoyed great success in both theory and practice. Despite appealing statistical guarantees, semidefinite programming is in general prohibitively expensive when processing large-scale datasets.
Nonconvex approaches, on the other end, have been under extensive study in the last few years, due to their computational advantages. There is a growing list of statistical estimation problems for which nonconvex approaches are guaranteed to find global optimal solutions, including but not limited to phase retrieval [18, 25, 90], low-rank matrix sensing and completion [7, 32, 48, 115, 130], blind deconvolution and self-calibration [72, 76, 78, 82], dictionary learning [106], tensor decomposition [49], joint alignment [24], learning shallow neural networks [103, 132], robust subspace learning [34, 74, 86, 91]. In several problems [40, 48, 49, 75, 77, 86, 87, 105, 106], it is further suggested that the optimization landscape is benign under sufficiently large sample complexity, in the sense that all local minima are globally optimal, and hence nonconvex iterative algorithms become promising in solving such problems. See [37] for a recent overview. Below we review the three problems studied in this paper in more detail. Some state-of-the-art results are summarized in Table 1.
Phase retrieval Candès et al. proposed PhaseLift [20] to solve the quadratic systems of equations based on convex programming. Specifically, it lifts the decision variable \({\varvec{x}}^{\star }\) into a rank-one matrix \({\varvec{X}}^{\star }={\varvec{x}}^{\star }{\varvec{x}}^{\star \top }\) and translates the quadratic constraints of \({\varvec{x}}^{\star }\) in (14) into linear constraints of \({\varvec{X}}^{\star }\). By dropping the rank constraint, the problem becomes convex [11, 16, 20, 29, 113]. Another convex program PhaseMax [5, 41, 50, 53] operates in the natural parameter space via linear programming, provided that an anchor vector is available. On the other hand, alternating minimization [90] with sample splitting has been shown to enjoy much better computational guarantee. In contrast, Wirtinger flow [18] provides the first global convergence result for nonconvex methods without sample splitting, whose statistical and computational guarantees are later improved by [25] via an adaptive truncation strategy. Several other variants of WF are also proposed [12, 68, 102], among which an amplitude-based loss function has been investigated [117,118,119, 127]. In particular, [127] demonstrates that the amplitude-based loss function has a better curvature, and vanilla gradient descent can indeed converge with a constant step size at the orderwise optimal sample complexity. A small sample of other nonconvex phase retrieval methods include [6, 10, 22, 36, 43, 47, 92, 98, 100, 109, 122], which are beyond the scope of this paper.
Matrix completion Nuclear norm minimization was studied in [19] as a convex relaxation paradigm to solve the matrix completion problem. Under certain incoherence conditions imposed upon the ground truth matrix, exact recovery is guaranteed under near-optimal sample complexity [14, 23, 38, 51, 93]. Concurrently, several works [54, 55, 60, 61, 63,64,65, 71, 110, 123, 129, 129] tackled the matrix completion problem via nonconvex approaches. In particular, the seminal work by Keshavan et al. [64, 65] pioneered the two-stage approach that is widely adopted by later works. Sun and Luo [107] demonstrated the convergence of gradient descent type methods for noiseless matrix completion with a regularized nonconvex loss function. Instead of penalizing the loss function, [32, 131] employed projection to enforce the incoherence condition throughout the execution of the algorithm. To the best of our knowledge, no rigorous guarantees have been established for matrix completion without explicit regularization. A notable exception is [63], which uses unregularized stochastic gradient descent for matrix completion in the online setting. However, the analysis is performed with fresh samples in each iteration. Our work closes the gap and makes the first contribution toward understanding implicit regularization in gradient descent without sample splitting. In addition, entrywise eigenvector perturbation has been studied by [1, 26, 60] in order to analyze the spectral algorithms for matrix completion, which helps us establish theoretical guarantees for the spectral initialization step. Finally, it has recently been shown that the analysis of nonconvex gradient descent in turn yields near-optimal statistical guarantees for convex relaxation in the context of noisy matrix completion; see [28, 31].
Blind deconvolution In [3], Ahmed et al. first proposed to invoke similar lifting ideas for blind deconvolution, which translates the bilinear measurements (31) into a system of linear measurements of a rank-one matrix \({\varvec{X}}^{\star }={\varvec{h}}^{\star }{\varvec{x}}^{\star \textsf {H} }\). Near-optimal performance guarantees have been established for convex relaxation [3]. Under the same model, Li et al. [76] proposed a regularized gradient descent algorithm that directly optimizes the nonconvex loss function (32) with a few regularization terms that account for scaling ambiguity and incoherence. In [58], a Riemannian steepest descent method is developed that removes the regularization for scaling ambiguity, although they still need to regularize for incoherence. In [2], a linear program is proposed but requires exact knowledge of the signs of the signals. Blind deconvolution has also been studied for other models—interested readers are referred to [35, 72, 73, 81, 82, 120, 128].
On the other hand, our analysis framework is based on a leave-one-out perturbation argument. This technique has been widely used to analyze high-dimensional problems with random designs, including but not limited to robust M-estimation [44, 45], statistical inference for sparse regression [62], likelihood ratio test in logistic regression [108], phase synchronization [1, 133], ranking from pairwise comparisons [30], community recovery [1], and covariance sketching [79]. In particular, this technique results in tight performance guarantees for the generalized power method [133], the spectral method [1, 30], and convex programming approaches [30, 44, 108, 133]; however, it has not been applied to analyze nonconvex optimization algorithms.
Finally, we note that the notion of implicit regularization—broadly defined—arises in settings far beyond the models and algorithms considered herein. For instance, it has been conjectured that in matrix factorization, over-parameterized stochastic gradient descent effectively enforces certain norm constraints, allowing it to converge to a minimal-norm solution as long as it starts from the origin [52]. The stochastic gradient methods have also been shown to implicitly enforce Tikhonov regularization in several statistical learning settings [80]. More broadly, this phenomenon seems crucial in enabling efficient training of deep neural networks [104, 125].
5 A General Recipe for Trajectory Analysis
In this section, we sketch a general recipe for establishing performance guarantees of gradient descent, which conveys the key idea for proving the main results of this paper. The main challenge is to demonstrate that appropriate incoherence conditions are preserved throughout the trajectory of the algorithm. This requires exploiting statistical independence of the samples in a careful manner, in conjunction with generic optimization theory. Central to our approach is a leave-one-out perturbation argument, which allows to decouple the statistical dependency while controlling the component-wise incoherence measures.
5.1 General Model
Consider the following problem where the samples are collected in a bilinear/quadratic form as
where the objects of interest \({\varvec{H}}^{\star }, {\varvec{X}}^{\star }\in \mathbb {C}^{n\times r}\) or \(\mathbb {R}^{n\times r}\) might be vectors or tall matrices taking either real or complex values. The design vectors \(\left\{ {\varvec{\psi }}_{j}\right\} \) and \(\{{\varvec{\phi }}_{j}\}\) are in either \(\mathbb {C}^n\) or \(\mathbb {R}^n\), and can be either random or deterministic. This model is quite general and entails all three examples in this paper as special cases:
Phase retrieval: \({\varvec{H}}^{\star }={\varvec{X}}^{\star }={\varvec{x}}^{\star }\in \mathbb {R}^{n}\), and \({\varvec{\psi }}_{j}={\varvec{\phi }}_{j}={\varvec{a}}_{j}\);
Matrix completion: \({\varvec{H}}^{\star }={\varvec{X}}^{\star }\in \mathbb {R}^{n\times r}\) and \({\varvec{\psi }}_{j},{\varvec{\phi }}_{j}\in \{{\varvec{e}}_{1},\cdots ,{\varvec{e}}_{n}\}\);
Blind deconvolution: \({\varvec{H}}^{\star }={\varvec{h}}^{\star }\in \mathbb {C}^{K}\), \({\varvec{X}}^{\star }={\varvec{x}}^{\star }\in \mathbb {C}^{K}\), \({\varvec{\phi }}_{j}={\varvec{a}}_{j},\) and \({\varvec{\psi }}_{j}={\varvec{b}}_{j}\).
For this setting, the empirical loss function is given by
where we denote \({\varvec{Z}}=({\varvec{H}},{\varvec{X}})\). To minimize \(f({\varvec{Z}})\), we proceed with vanilla gradient descent
following a standard spectral initialization, where \(\eta \) is the step size. As a remark, for complex-valued problems, the gradient (resp. Hessian) should be understood as the Wirtinger gradient (resp. Hessian).
It is clear from (39) that \({\varvec{Z}}^{\star } = ({\varvec{H}}^{\star },{\varvec{X}}^{\star })\) can only be recovered up to certain global ambiguity. For clarity of presentation, we assume in this section that such ambiguity has already been taken care of via proper global transformation.
5.2 Outline of the Recipe
We are now positioned to outline the general recipe, which entails the following steps.
Step 1: characterizing local geometry in the RIC Our first step is to characterize a region \(\mathcal {R}\)—which we term as the region of incoherence and contraction (RIC)—such that the Hessian matrix \(\nabla ^{2}f({\varvec{Z}})\) obeys strong convexity and smoothness,
$$\begin{aligned} {\varvec{0}} \,\prec \, \alpha {\varvec{I}} \,\preceq \, \nabla ^{2}f({\varvec{Z}}) \,\preceq \, \beta {\varvec{I}},\qquad \forall {\varvec{Z}}\in \mathcal {R}, \end{aligned}$$(40)or at least along certain directions (i.e., restricted strong convexity and smoothness), where \(\beta /\alpha \) scales slowly (or even remains bounded) with the problem size. As revealed by optimization theory, this geometric property (40) immediately implies linear convergence with the contraction rate \(1- O(\alpha /\beta )\) for a properly chosen step size \(\eta \), as long as all iterates stay within the RIC.
A natural question then arises: What does the RIC \(\mathcal {R}\) look like? As it turns out, the RIC typically contains all points such that the \(\ell _2\) error \(\Vert {\varvec{Z}} - {\varvec{Z}}^{\star }\Vert _{\mathrm {F}}\) is not too large and
$$\begin{aligned}&(\mathbf{incoherence }) \qquad \max _{j} {\big \Vert {\varvec{\phi }}_{j}^{\textsf {H} }({\varvec{X}}-{\varvec{X}}^{\star })\big \Vert _2}~~\text {and}~~ \max _{j} {\big \Vert {\varvec{\psi }}_{j}^{\textsf {H} }({\varvec{H}}-{\varvec{H}}^{\star })\big \Vert _2}\nonumber \\ {}&\quad \text { are well controlled}. \end{aligned}$$(41)In the three examples, the above incoherence condition translates to:
Phase retrieval: \(\max _{j} {\big |{\varvec{a}}_{j}^{\top }({\varvec{x}}-{\varvec{x}}^{\star })\big |}\) is well controlled;
Matrix completion: \(\big \Vert {\varvec{X}}- {\varvec{X}}^{\star }\big \Vert _{2,\infty }\) is well controlled;
Blind deconvolution: \(\max _{j} {\big |{\varvec{a}}_{j}^{\top }({\varvec{x}}-{\varvec{x}}^{\star })\big |}\) and \(\max _{j} {\big |{\varvec{b}}_{j}^{\top }({\varvec{h}}-{\varvec{h}}^{\star })\big |}\) are well controlled.
Step 2: introducing the leave-one-out sequences To justify that no iterates leave the RIC, we rely on the construction of auxiliary sequences. Specifically, for each l, produce an auxiliary sequence \(\{{\varvec{Z}}^{t,(l)}=({\varvec{X}}^{t,(l)},{\varvec{H}}^{t,(l)})\}\) such that \({\varvec{X}}^{t,(l)}\) (resp. \({\varvec{H}}^{t,(l)}\)) is independent of any sample involving \({\varvec{\phi }}_l\) (resp. \({\varvec{\psi }}_l\)). As an example, suppose that the \({\varvec{\phi }}_l\)’s and the \({\varvec{\psi }}_l\)’s are independently and randomly generated. Then for each l, one can consider a leave-one-out loss function
$$\begin{aligned} f^{(l)}({\varvec{Z}}) : = \frac{1}{m}\sum _{j: j\ne l}\Big | {\varvec{\psi }}_{j}^{\textsf {H} } {\varvec{H}}{\varvec{X}}^{\textsf {H} } {\varvec{\phi }}_{j}-y_{j}\Big |^{2} \end{aligned}$$that discards the lth sample. One further generates \(\{{\varvec{Z}}^{t,(l)}\}\) by running vanilla gradient descent w.r.t. this auxiliary loss function, with a spectral initialization that similarly discards the lth sample. Note that this procedure is only introduced to facilitate analysis and is never implemented in practice.
Step 3: establishing the incoherence condition We are now ready to establish the incoherence condition with the assistance of the auxiliary sequences. Usually, the proof proceeds by induction, where our goal is to show that the next iterate remains within the RIC, given that the current one does.
Step 3(a): proximity between the original and the leave-one-out iterates As one can anticipate, \(\{{\varvec{Z}}^{t}\}\) and \(\{{\varvec{Z}}^{t,(l)}\}\) remain “glued” to each other along the whole trajectory, since their constructions differ by only a single sample. In fact, as long as the initial estimates stay sufficiently close, their gaps will never explode. To intuitively see why, use the fact \(\nabla f({\varvec{Z}}^{t})\approx \nabla f^{(l)}({\varvec{Z}}^{t})\) to discover that
$$\begin{aligned} {\varvec{Z}}^{t+1}-{\varvec{Z}}^{t+1,(l)}&={\varvec{Z}}^{t}-\eta \nabla f({\varvec{Z}}^{t})-\big ({\varvec{Z}}^{t,(l)}-\eta \nabla f^{(l)}\big ({\varvec{Z}}^{t,(l)}\big )\big )\\&\approx {\varvec{Z}}^{t}-{\varvec{Z}}^{t,(l)}-\eta \nabla ^{2}f({\varvec{Z}}^{t})\big ({\varvec{Z}}^{t}-{\varvec{Z}}^{t,(l)}\big ), \end{aligned}$$which together with the strong convexity condition implies \(\ell _{2}\) contraction
$$\begin{aligned} \big \Vert {\varvec{Z}}^{t+1}-{\varvec{Z}}^{t+1,(l)} \big \Vert _{\mathrm {F}} \approx \Big \Vert \big ({\varvec{I}}-\eta \nabla ^{2}f({\varvec{Z}}^{t})\big )\big ({\varvec{Z}}^{t}-{\varvec{Z}}^{t,(l)}\big )\Big \Vert _{\mathrm {F}}\le \big \Vert {\varvec{Z}}^{t}-{\varvec{Z}}^{t,(l)} \big \Vert _{2}. \end{aligned}$$Indeed, (restricted) strong convexity is crucial in controlling the size of leave-one-out perturbations.
Step 3(b): incoherence condition of the leave-one-out iterates The fact that \({\varvec{Z}}^{t+1}\) and \({\varvec{Z}}^{t+1,(l)}\) are exceedingly close motivates us to control the incoherence of \({\varvec{Z}}^{t+1,(l)} - {\varvec{Z}}^{\star }\) instead, for \(1\le l\le m\). By construction, \({\varvec{X}}^{t+1,(l)}\) (resp. \({\varvec{H}}^{t+1,(l)}\)) is statistically independent of any sample involving the design vector \({\varvec{\phi }}_{l}\) (resp. \({\varvec{\psi }}_ l\)), a fact that typically leads to a more friendly analysis for controlling \(\left\| {\varvec{\phi }}_{l}^{\textsf {H} }\big ({\varvec{X}}^{t+1,(l)}-{\varvec{X}}^{\star }\big ) \right\| _2\) and \(\left\| {\varvec{\psi }}_{l}^{\textsf {H} }\big ({\varvec{H}}^{t+1,(l)}-{\varvec{H}}^{\star }\big ) \right\| _2\).
Step 3(c): combining the bounds With these results in place, apply the triangle inequality to obtain
$$\begin{aligned} \big \Vert {\varvec{\phi }}_{l}^{\textsf {H} }\big ({\varvec{X}}^{t+1}-{\varvec{X}}^{\star }\big )\big \Vert _2&\le \big \Vert {\varvec{\phi }}_{l}\Vert _2 \big \Vert {\varvec{X}}^{t+1}-{\varvec{X}}^{t+1,(l)} \big \Vert _{\mathrm {F}} + \big \Vert {\varvec{\phi }}_{l}^{\textsf {H} }\big ({\varvec{X}}^{t+1,(l)}-{\varvec{X}}^{\star }\big )\big \Vert _2, \end{aligned}$$where the first term is controlled in Step 3(a) and the second term is controlled in Step 3(b). The term \(\big \Vert {\varvec{\psi }}_{l}^{\textsf {H} }\big ({\varvec{H}}^{t+1}-{\varvec{H}}^{\star }\big )\big \Vert _2\) can be bounded similarly. By choosing the bounds properly, this establishes the incoherence condition for all \(1\le l\le m\) as desired.
6 Analysis for Phase Retrieval
In this section, we instantiate the general recipe presented in Sect. 5 to phase retrieval and prove Theorem 1. Similar to the section 7.1 in [18], we are going to use \(\eta _t = c_1/( \log n \cdot \Vert {\varvec{x}}^{\star } \Vert _2^2 )\) instead of \(c_1/( \log n \cdot \Vert {\varvec{x}}_0 \Vert _2^2 )\) as the step size for analysis. This is because with high probability, \(\Vert {\varvec{x}}_0 \Vert _2\) and \(\Vert {\varvec{x}}^{\star } \Vert _2\) are rather close in the relative sense. Without loss of generality, we assume throughout this section that \(\big \Vert {\varvec{x}}^{\star }\big \Vert _{2}=1\) and
In addition, the gradient and the Hessian of \(f(\cdot )\) for this problem (see (15)) are given, respectively, by
which are useful throughout the proof.
6.1 Step 1: Characterizing Local Geometry in the RIC
6.1.1 Local Geometry
We start by characterizing the region that enjoys both strong convexity and the desired level of smoothness. This is supplied in the following lemma, which plays a crucial role in the subsequent analysis.
Lemma 1
(Restricted strong convexity and smoothness for phase retrieval) Fix any sufficiently small constant \(C_{1}>0\) and any sufficiently large constant \(C_{2}>0\), and suppose the sample complexity obeys \(m\ge c_{0}n\log n\) for some sufficiently large constant \(c_{0}>0\). With probability at least \(1-O(mn^{-10})\),
holds simultaneously for all \({\varvec{x}}\in \mathbb {R}^{n}\) satisfying \(\left\| {\varvec{x}}-{\varvec{x}}^{\star }\right\| _{2}\le 2C_{1}\), and
holds simultaneously for all \({\varvec{x}}\in \mathbb {R}^{n}\) obeying
Proof
See Appendix A.1. \(\square \)
In words, Lemma 1 reveals that the Hessian matrix is positive definite and (almost) well conditioned, if one restricts attention to the set of points that are (i) not far away from the truth (cf. (45a)) and (ii) incoherent with respect to the measurement vectors \(\left\{ {\varvec{a}}_{j}\right\} _{1\le j\le m}\) (cf. (45b)).
6.1.2 Error Contraction
As we point out before, the nice local geometry enables \(\ell _{2}\) contraction, which we formalize below.
Lemma 2
There exists an event that does not depend on t and has probability \(1-O(mn^{-10})\), such that when it happens and \({\varvec{x}}^t\) obeys conditions (45), one has
provided that the step size satisfies \(0<\eta \le 1/ \left[ 5C_{2}\left( 10+C_{2}\right) \log n\right] \).
Proof
This proof applies the standard argument when establishing the \(\ell _{2}\) error contraction of gradient descent for strongly convex and smooth functions. See Appendix A.2. \(\square \)
With the help of Lemma 2, we can turn the proof of Theorem 1 into ensuring that the trajectory \(\left\{ {\varvec{x}}^t\right\} _{0\le t \le n}\) lies in the RIC specified by (47).Footnote 8 This is formally stated in the next lemma.
Lemma 3
Suppose for all \(0\le t\le T_{0}:=n\), the trajectory \(\left\{ {\varvec{x}}^t\right\} \) falls within the region of incoherence and contraction (termed the RIC), namely
then the claims in Theorem 1 hold true. Here and throughout this section, \(C_{1},C_{2}>0\) are two absolute constants as specified in Lemma 1.
Proof
See Appendix A.3. \(\square \)
6.2 Step 2: Introducing the Leave-One-Out Sequences
In comparison with the \(\ell _{2}\) error bound (47a) that captures the overall loss, the incoherence hypothesis (47b)—which concerns sample-wise control of the empirical risk—is more complicated to establish. This is partly due to the statistical dependence between \({\varvec{x}}^{t}\) and the sampling vectors \(\{{\varvec{a}}_{l}\}\). As described in the general recipe, the key idea is the introduction of a leave-one-out version of the WF iterates, which removes a single measurement from consideration.
To be precise, for each \(1\le l\le m\), we define the leave-one-out empirical loss function as
and the auxiliary trajectory \(\left\{ {\varvec{x}}^{t,(l)}\right\} _{t\ge 0}\) is constructed by running WF w.r.t. \(f^{(l)}({\varvec{x}})\). In addition, the spectral initialization \({\varvec{x}}^{0,\left( l\right) }\) is computed based on the rescaled leading eigenvector of the leave-one-out data matrix
Clearly, the entire sequence \(\left\{ {\varvec{x}}^{t,(l)}\right\} _{t\ge 0}\) is independent of the lth sampling vector \({\varvec{a}}_{l}\). This auxiliary procedure is formally described in Algorithm 4.
6.3 Step 3: Establishing the Incoherence Condition by Induction
As revealed by Lemma 3, it suffices to prove that the iterates \(\{{\varvec{x}}^t\}_{0\le t \le T_{0}}\) satisfies (47) with high probability. Our proof will be inductive in nature. For the sake of clarity, we list all the induction hypotheses:
Here \(C_{3} >0\) is some universal constant. For any \(t\ge 0\), define \(\mathcal {E}_{t}\) to be the event where the conditions in (51) hold for the tth iteration. According to Lemma 2, there exists some event \(\mathcal {E}\) with probability \(1-O(mn^{-10})\) such that on \(\mathcal {E}_t \cap \mathcal {E}\) one has
This subsection is devoted to establishing (51b) and (51c) for the \((t+1)\)th iteration, assuming that (51) holds true up to the tth iteration. We defer the justification of the base case (i.e., initialization at \(t=0\)) to Sect. 6.4.
Step 3(a): proximity between the original and the leave-one-out iterates The leave-one-out sequence \(\{{\varvec{x}}^{t,(l)}\}\) behaves similarly to the true WF iterates \(\{{\varvec{x}}^{t}\}\) while maintaining statistical independence with \({\varvec{a}}_{l}\), a key fact that allows us to control the incoherence of lth leave-one-out sequence w.r.t. \({\varvec{a}}_{l}\). We will formally quantify the gap between \({\varvec{x}}^{t+1}\) and \({\varvec{x}}^{t+1,(l)}\) in the following lemma, which establishes the induction in (51b).
Lemma 4
Suppose that the sample size obeys \(m\ge Cn\log n\) for some sufficiently large constant \(C>0\) and that the step size obeys \(0<\eta <1/[5C_{2}(10+C_{2})\log n]\). Then on some event \(\mathcal {E}_{t+1,1}\subseteq \mathcal {E}_{t}\) obeying \(\mathbb {P}(\mathcal {E}_{t}\cap \mathcal {E}_{t+1,1}^{c} ) = O( mn^{-10})\), one has
Proof
The proof relies heavily on the restricted strong convexity (see Lemma 1) and is deferred to Appendix A.4. \(\square \)
Step 3(b): incoherence of the leave-one-out iterates By construction, \({\varvec{x}}^{t+1,(l)}\) is statistically independent of the sampling vector \({\varvec{a}}_{l}\). One can thus invoke the standard Gaussian concentration results and the union bound to derive that on an event \(\mathcal {E}_{t+1,2} \subseteq \mathcal {E}_t\) obeying \(\mathbb {P}(\mathcal {E}_{t}\cap \mathcal {E}_{t+1,2}^{c} ) = O( mn^{-10})\),
$$\begin{aligned} \max _{1\le l\le m}\left| {\varvec{a}}_{l}^{\top }\big ({\varvec{x}}^{t+1,\left( l\right) }-{\varvec{x}}^{\star }\big )\right|&\le 5\sqrt{\log n}\big \Vert {\varvec{x}}^{t+1,\left( l\right) }-{\varvec{x}}^{\star }\big \Vert _{2}\nonumber \\&\overset{(\text {i})}{\le }5\sqrt{\log n}\left( \big \Vert {\varvec{x}}^{t+1,\left( l\right) }-{\varvec{x}}^{t+1}\big \Vert _{2}+\left\| {\varvec{x}}^{t+1}-{\varvec{x}}^{\star }\right\| _{2}\right) \nonumber \\&\overset{(\mathrm {ii})}{\le }5\sqrt{\log n}\left( C_{3}\sqrt{\frac{\log n}{n}}+C_{1}\right) \nonumber \\&\le C_{4}\sqrt{\log n} \end{aligned}$$(54)holds for some constant \(C_{4}\ge 6 C_{1}>0\) and n sufficiently large. Here, (i) comes from the triangle inequality and (ii) arises from the proximity bound (53) and the condition (52).
Step 3(c): combining the bounds We are now prepared to establish (51c) for the (\(t+1\))th iteration. Specifically,
$$\begin{aligned} \max _{1\le l\le m}\left| {\varvec{a}}_{l}^{\top }\left( {\varvec{x}}^{t+1}-{\varvec{x}}^{\star }\right) \right|&\le \max _{1\le l\le m}\left| {\varvec{a}}_{l}^{\top }\big ({\varvec{x}}^{t+1}-{\varvec{x}}^{t+1,\left( l\right) }\big ) \right| +\max _{1\le l\le m}\left| {\varvec{a}}_{l}^{\top }\big ({\varvec{x}}^{t+1,\left( l\right) }-{\varvec{x}}^{\star }\big ) \right| \nonumber \\&\overset{\left( \text {i}\right) }{\le }\max _{1\le l\le m}\Vert {\varvec{a}}_{l}\Vert _{2}\big \Vert {\varvec{x}}^{t+1}-{\varvec{x}}^{t+1,\left( l\right) } \big \Vert _{2}+C_{4}\sqrt{\log n}\nonumber \\&\overset{\left( \text {ii}\right) }{\le }\sqrt{6n}\cdot C_{3}\sqrt{\frac{\log n}{n}}+C_{4}\sqrt{\log n}\le C_{2}\sqrt{\log n}, \end{aligned}$$(55)where (i) follows from the Cauchy–Schwarz inequality and (54), the inequality (ii) is a consequence of (53) and (98), and the last inequality holds as long as \(C_{2}/(C_{3}+C_{4})\) is sufficiently large. From the deduction above we easily get \(\mathbb {P}(\mathcal {E}_{t}\cap \mathcal {E}_{t+1}^{c} ) = O( mn^{-10}) \).
Using mathematical induction and the union bound, we establish (51) for all \(t\le T_{0}=n\) with high probability. This in turn concludes the proof of Theorem 1, as long as the hypotheses are valid for the base case.
6.4 The Base Case: Spectral Initialization
In the end, we return to verify the induction hypotheses for the base case (\(t=0\)), i.e., the spectral initialization obeys (51). The following lemma justifies (51a) by choosing \(\delta \) sufficiently small.
Lemma 5
Fix any small constant \(\delta >0\), and suppose \(m>c_{0}n\log n\) for some large constant \(c_{0}>0\). Consider the two vectors \({\varvec{x}}^{0}\) and \(\widetilde{{\varvec{x}}}^{0}\) as defined in Algorithm 1, and suppose without loss of generality that (42) holds. Then with probability exceeding \(1-O(n^{-10})\), one has
Proof
This result follows directly from the Davis–Kahan sin\(\Theta \) theorem. See Appendix A.5. \(\square \)
We then move on to justifying (51b), the proximity between the original and leave-one-out iterates for \(t=0\).
Lemma 6
Suppose \(m>c_{0}n\log n\) for some large constant \(c_{0}>0\). Then with probability at least \({1-O(mn^{-10})}\), one has
Proof
This is also a consequence of the Davis–Kahan sin\(\Theta \) theorem. See Appendix A.6. \(\square \)
The final claim (51c) can be proved using the same argument as in deriving (55) and hence is omitted.
7 Analysis for Matrix Completion
In this section, we instantiate the general recipe presented in Sect. 5 to matrix completion and prove Theorem 2. Before continuing, we first gather a few useful facts regarding the loss function in (23). The gradient of it is given by
We define the expected gradient (with respect to the sampling set \(\Omega \)) to be
and also the (expected) gradient without noise to be
In addition, we need the Hessian \(\nabla ^{2}f_{\mathrm {clean}}({\varvec{X}})\), which is represented by an \(nr\times nr\) matrix. Simple calculations reveal that for any \({\varvec{V}}\in \mathbb {R}^{n\times r}\),
where \(\mathrm {vec}({\varvec{V}})\in \mathbb {R}^{nr}\) denotes the vectorization of \({\varvec{V}}\).
7.1 Step 1: Characterizing Local Geometry in the RIC
7.1.1 Local Geometry
The first step is to characterize the region where the empirical loss function enjoys restricted strong convexity and smoothness in an appropriate sense. This is formally stated in the following lemma.
Lemma 7
(Restricted strong convexity and smoothness for matrix completion) Suppose that the sample size obeys \(n^{2}p\ge C\kappa ^{2}\mu rn\log n\) for some sufficiently large constant \(C>0\). Then with probability at least \(1-O\left( n^{-10}\right) \), the Hessian \(\nabla ^{2}f_{\mathrm {clean}}({\varvec{X}})\) as defined in (61) obeys
for all \({\varvec{X}}\) and \({\varvec{V}} = {\varvec{Y}}{\varvec{H}}_{Y}-{\varvec{Z}}\), with \({\varvec{H}}_{Y}:=\arg \min _{{\varvec{R}}\in \mathcal {O}^{r\times r}}\left\| {\varvec{Y}}{\varvec{R}}-{\varvec{Z}}\right\| _{\mathrm {F}}\), satisfying:
where \(\epsilon \ll 1/\sqrt{\kappa ^{3}\mu r\log ^{2}n}\) and \(\delta \ll 1 / \kappa \).
Proof
See Appendix B.1. \(\square \)
Lemma 7 reveals that the Hessian matrix is well conditioned in a neighborhood close to \({\varvec{X}}^{\star }\) that remains incoherent measured in the \(\ell _2/\ell _\infty \) norm (cf. (63a)), and along directions that point toward points which are not far away from the truth in the spectral norm (cf. (63b)).
Remark 5
The second condition (63b) is characterized using the spectral norm \(\Vert \cdot \Vert \), while in previous works this is typically presented in the Frobenius norm \(\Vert \cdot \Vert _{\mathrm {F}}\). It is also worth noting that the Hessian matrix—even in the infinite-sample and noiseless case—is rank-deficient and cannot be positive definite. As a result, we resort to the form of strong convexity by restricting attention to certain directions (see the conditions on \({\varvec{V}}\)).
7.1.2 Error Contraction
Our goal is to demonstrate the error bounds (28) measured in three different norms. Notably, as long as the iterates satisfy (28) at the tth iteration, then \(\Vert {\varvec{X}}^{t}\widehat{{\varvec{H}}}^{t}-{\varvec{X}}^{\star }\Vert _{2,\infty }\) is sufficiently small. Under our sample complexity assumption, \({\varvec{X}}^{t}\widehat{{\varvec{H}}}^{t}\) satisfies the \(\ell _{2}/\ell _{\infty }\) condition (63a) required in Lemma 7. Consequently, we can invoke Lemma 7 to arrive at the following error contraction result.
Lemma 8
(Contraction w.r.t. the Frobenius norm) Suppose that \(n^{2}p\ge C\kappa ^{3}\mu ^{3}r^{3}n\log ^{3}n\) for some sufficiently large constant \(C>0\), and the noise satisfies (27). There exists an event that does not depend on t and has probability \(1-O(n^{-10})\), such that when it happens and (28a), (28b) hold for the tth iteration, one has
provided that \(0<\eta \le {2} / ({25\kappa \sigma _{\max }})\), \(1-\left( {\sigma _{\min }} / {4}\right) \cdot \eta \le \rho <1\), and \(C_{1}\) is sufficiently large.
Proof
The proof is built upon Lemma 7. See Appendix B.2. \(\square \)
Further, if the current iterate satisfies all three conditions in (28), then we can derive a stronger sense of error contraction, namely contraction in terms of the spectral norm.
Lemma 9
(Contraction w.r.t. the spectral norm) Suppose \(n^{2}p\ge C\kappa ^{3}\mu ^{3}r^{3}n\log ^{3}n\) for some sufficiently large constant \(C>0\), and the noise satisfies (27). There exists an event that does not depend on t and has probability \(1-O(n^{-10})\), such that when it happens and (28) holds for the tth iteration, one has
provided that \(0<\eta \le {1} / \left( {2\sigma _{\max }}\right) \) and \(1-\left( {\sigma _{\min }} / {3}\right) \cdot \eta \le \rho <1\).
Proof
The key observation is this: the iterate that proceeds according to the population-level gradient reduces the error w.r.t. \(\Vert \cdot \Vert \), namely
as long as \({\varvec{X}}^{t}\widehat{{\varvec{H}}}^{t}\) is sufficiently close to the truth. Notably, the orthonormal matrix \(\widehat{{\varvec{H}}}^{t}\) is still chosen to be the one that minimizes the \(\Vert \cdot \Vert _{\mathrm {F}}\) distance (as opposed to \(\Vert \cdot \Vert \)), which yields a symmetry property \({\varvec{X}}^{\star \top }{\varvec{X}}^{t}\widehat{{\varvec{H}}}^{t}=\big ({\varvec{X}}^{t}\widehat{{\varvec{H}}}^{t}\big )^{\top }{\varvec{X}}^{\star }\), crucial for our analysis. See Appendix B.3 for details. \(\square \)
7.2 Step 2: Introducing the Leave-One-Out Sequences
In order to establish the incoherence properties (28b) for the entire trajectory, which is difficult to deal with directly due to the complicated statistical dependence, we introduce a collection of leave-one-out versions of \(\left\{ {\varvec{X}}^{t}\right\} _{t\ge 0}\), denoted by \(\left\{ {\varvec{X}}^{t,(l)}\right\} _{t\ge 0}\) for each \(1\le l\le n\). Specifically, \(\left\{ {\varvec{X}}^{t,(l)}\right\} _{t\ge 0}\) is the iterates of gradient descent operating on the auxiliary loss function
Here, \(\mathcal {P}_{\Omega _{l}}\) (resp. \(\mathcal {P}_{\Omega ^{-l}}\) and \(\mathcal {P}_{l}\)) represents the orthogonal projection onto the subspace of matrices which vanish outside of the index set \(\Omega _{l}:=\left\{ \left( i,j\right) \in \Omega \mid i=l\text { or }j=l\right\} \) (resp. \(\Omega ^{-l}:=\left\{ \left( i,j\right) \in \Omega \mid i\ne l,j\ne l\right\} \) and \(\left\{ \left( i,j\right) \mid i=l\text { or }j=l\right\} \)); that is, for any matrix \({\varvec{M}}\),
The gradient of the leave-one-out loss function (65) is given by
The full algorithm to obtain the leave-one-out sequence \(\{{\varvec{X}}^{t,(l)}\}_{t\ge 0}\) (including spectral initialization) is summarized in Algorithm 5.
Remark 6
Rather than simply dropping all samples in the lth row/column, we replace the lth row/column with their respective population means. In other words, the leave-one-out gradient forms an unbiased surrogate for the true gradient, which is particularly important in ensuring high estimation accuracy.
7.3 Step 3: Establishing the Incoherence Condition by Induction
We will continue the proof of Theorem 2 in an inductive manner. As seen in Sect. 7.1.2, the induction hypotheses (28a) and (28c) hold for the \((t+1)\)th iteration as long as (28) holds at the tth iteration. Therefore, we are left with proving the incoherence hypothesis (28b) for all \(0\le t\le T=O(n^{5})\). For clarity of analysis, it is crucial to maintain a list of induction hypotheses, which includes a few more hypotheses that complement (28), and is given below.
hold for some absolute constants \(0<\rho <1\) and \(C_{1},\cdots ,C_{10}>0\). Here, \(\widehat{{\varvec{H}}}^{t,\left( l\right) }\) and \({\varvec{R}}^{t,(l)}\) are orthonormal matrices defined by
Clearly, the first three hypotheses (70a)–(70c) constitute the conclusion of Theorem 2, i.e., (28). The last two hypotheses (70d) and (70e) are auxiliary properties connecting the true iterates and the auxiliary leave-one-out sequences. Moreover, we summarize below several immediate consequences of (70), which will be useful throughout.
Lemma 10
Suppose \(n^{2}p \ge C\kappa ^{3}\mu ^{2}r^{2}n\log n\) for some sufficiently large constant \(C>0\), and the noise satisfies (27). Under hypotheses (70), one has
In particular, (73a) follows from hypotheses (70c) and (70d).
Proof
See Appendix B.4. \(\square \)
In the sequel, we follow the general recipe outlined in Sect. 5 to establish the induction hypotheses. We only need to establish (70b), (70d), and (70e) for the \((t+1)\)th iteration, since (70a) and (70c) are established in Sect. 7.1.2. Specifically, we resort to the leave-one-out iterates by showing that: first, the true and the auxiliary iterates remain exceedingly close throughout; second, the lth leave-one-out sequence stays incoherent with \({\varvec{e}}_{l}\) due to statistical independence.
Step 3(a): proximity between the original and the leave-one-out iterates We demonstrate that \({\varvec{X}}^{t+1}\) is well approximated by \({\varvec{X}}^{t+1,(l)}\), up to proper orthonormal transforms. This is precisely the induction hypothesis (70d) for the \((t+1)\)th iteration.
Lemma 11
Suppose the sample complexity satisfies \(n^{2}p\ge C \kappa ^{4}\mu ^{3}r^{3}n\log ^{3}n\) for some sufficiently large constant \(C>0\), and the noise satisfies (27). Let \(\mathcal {E}_t\) be the event where the hypotheses in (70) hold for the tth iteration. Then on some event \(\mathcal {E}_{t+1,1} \subseteq \mathcal {E}_t\) obeying \(\mathbb {P}( \mathcal {E}_t \cap \mathcal {E}_{t+1,1}^c ) = O(n^{-10})\), we have
provided that \(0<\eta \le {2} / {\left( 25\kappa \sigma _{\max }\right) }\), \(1-\left( {\sigma _{\min }} / {5}\right) \cdot \eta \le \rho <1\), and \(C_{7} > 0\) is sufficiently large.
Proof
The fact that this difference is well controlled relies heavily on the benign geometric property of the Hessian revealed by Lemma 7. Two important remarks are in order: (1) both points \({\varvec{X}}^{t}\widehat{{\varvec{H}}}^{t}\) and \({\varvec{X}}^{t,(l)}{\varvec{R}}^{t,\left( l\right) }\) satisfy (63a); (2) the difference \({\varvec{X}}^{t}\widehat{{\varvec{H}}}^{t}-{\varvec{X}}^{t,(l)}{\varvec{R}}^{t,\left( l\right) }\) forms a valid direction for restricted strong convexity. These two properties together allow us to invoke Lemma 7. See Appendix B.5. \(\square \)
-
Step 3(b): incoherence of the leave-one-out iterates Given that \({\varvec{X}}^{t+1,(l)}\) is sufficiently close to \({\varvec{X}}^{t+1}\), we turn our attention to establishing the incoherence of this surrogate \({\varvec{X}}^{t+1,(l)}\) w.r.t. \({\varvec{e}}_{l}\). This amounts to proving the induction hypothesis (70e) for the \((t+1)\)th iteration.
Lemma 12
Suppose the sample complexity meets \(n^{2}p\ge C \kappa ^{3}\mu ^{3}r^{3}n\log ^{3}n\) for some sufficiently large constant \(C>0\), and the noise satisfies (27). Let \(\mathcal {E}_t\) be the event where the hypotheses in (70) hold for the tth iteration. Then on some event \(\mathcal {E}_{t+1,2} \subseteq \mathcal {E}_t\) obeying \(\mathbb {P}( \mathcal {E}_t \cap \mathcal {E}_{t+1,2}^c ) = O(n^{-10})\), we have
so long as \(0<\eta \le 1/{\sigma _{\max }}\), \(1-\left( {\sigma _{\min }} / {3}\right) \cdot \eta \le \rho <1\), \( C_{2} \gg \kappa C_{9} \), and \( C_{6} \gg \kappa C_{10} / \sqrt{\log n} \).
Proof
The key observation is that \({\varvec{X}}^{t+1,(l)}\) is statistically independent from any sample in the lth row/column of the matrix. Since there are an order of np samples in each row/column, we obtain enough information that helps establish the desired incoherence property. See Appendix B.6. \(\square \)
-
Step 3(c): combining the bounds Inequalities (70d) and (70e) taken collectively allow us to establish the induction hypothesis (70b). Specifically, for every \(1\le l\le n\), write
$$\begin{aligned} \big ({\varvec{X}}^{t+1}\widehat{{\varvec{H}}}^{t+1}-{\varvec{X}}^{\star }\big )_{l,\cdot } =&\,\big ({\varvec{X}}^{t+1}\widehat{{\varvec{H}}}^{t+1}-{\varvec{X}}^{t+1,(l)}\widehat{{\varvec{H}}}^{t+1, \left( l\right) }\big )_{l,\cdot }\\&+\big ({\varvec{X}}^{t+1,\left( l\right) }\widehat{{\varvec{H}}}^{t+1, \left( l\right) }-{\varvec{X}}^{\star }\big )_{l,\cdot }, \end{aligned}$$and the triangle inequality gives
$$\begin{aligned} \big \Vert \big ({\varvec{X}}^{t+1}\widehat{{\varvec{H}}}^{t+1}-{\varvec{X}}^{\star }\big )_{l,\cdot }\big \Vert _{2}&\le \big \Vert {\varvec{X}}^{t+1}\widehat{{\varvec{H}}}^{t+1}-{\varvec{X}}^{t+1,(l)} \widehat{{\varvec{H}}}^{t+1,\left( l\right) }\big \Vert _{\mathrm {F}}\nonumber \\&\quad +\big \Vert \big ({\varvec{X}}^{t+1, \left( l\right) }\widehat{{\varvec{H}}}^{t+1,\left( l\right) }-{\varvec{X}}^{\star }\big )_{l, \cdot }\big \Vert _{2}. \end{aligned}$$(76)The second term has already been bounded by (75). Since we have established the induction hypotheses (70c) and (70d) for the \((t+1)\)th iteration, the first term can be bounded by (73a) for the \((t+1)\)th iteration, i.e.,
$$\begin{aligned} \left\| {\varvec{X}}^{t+1}\widehat{{\varvec{H}}}^{t+1}-{\varvec{X}}^{t+1,(l)}\widehat{{\varvec{H}}}^{t+1,\left( l\right) } \right\| _{\mathrm {F}} \le 5\kappa \left\| {\varvec{X}}^{t+1}\widehat{{\varvec{H}}}^{t+1}-{\varvec{X}}^{t+1,(l)}{\varvec{R}}^{t+1,\left( l\right) } \right\| _{\mathrm {F}}. \end{aligned}$$Plugging the above inequality, (74), and (75) into (76), we have
$$\begin{aligned}&\left\| {\varvec{X}}^{t+1}\widehat{{\varvec{H}}}^{t+1}-{\varvec{X}}^{\star }\right\| _{2,\infty }\\&\quad \le 5\kappa \left( C_{3}\rho ^{t+1}\mu r\sqrt{\frac{\log n}{np}}\left\| {\varvec{X}}^{\star }\right\| _{2,\infty }+\frac{C_{7}}{\sigma _{\min }}\sigma \sqrt{\frac{n\log n}{p}}\left\| {\varvec{X}}^{\star }\right\| _{2,\infty }\right) \\&\qquad +C_{2}\rho ^{t+1}\mu r\frac{1}{\sqrt{np}}\left\| {\varvec{X}}^{\star }\right\| _{2,\infty }+\frac{C_{6}}{\sigma _{\min }}\sigma \sqrt{\frac{n\log n}{p}}\left\| {\varvec{X}}^{\star }\right\| _{2,\infty }\\&\quad \le C_{5}\rho ^{t+1}\mu r\sqrt{\frac{\log n}{np}}\left\| {\varvec{X}}^{\star }\right\| _{2,\infty }+\frac{C_{8}}{\sigma _{\min }}\sigma \sqrt{\frac{n\log n}{p}}\left\| {\varvec{X}}^{\star }\right\| _{2,\infty } \end{aligned}$$as long as \(C_{5}/(\kappa C_{3}+C_{2})\) and \(C_{8}/(\kappa C_{7}+C_{6})\) are sufficiently large. This establishes the induction hypothesis (70b). From the deduction above, we see \(\mathcal {E}_t \cap \mathcal {E}_{t+1}^c = O(n^{-10})\) and thus finish the proof.
7.4 The Base Case: Spectral Initialization
Finally, we return to check the base case; namely, we aim to show that the spectral initialization satisfies the induction hypotheses (70a)–(70e) for \(t=0\). This is accomplished via the following lemma.
Lemma 13
Suppose the sample size obeys \(n^{2}p\ge C \mu ^{2}r^{2}n\log n\) for some sufficiently large constant \(C>0\), the noise satisfies (27), and \(\kappa =\sigma _{\max }/\sigma _{\min }\asymp 1\). Then with probability at least \(1-O\left( n^{-10}\right) \), the claims in (70a)–(70e) hold simultaneously for \(t=0\).
Proof
This follows by invoking the Davis–Kahan sin\(\Theta \) theorem [39] as well as the entrywise eigenvector perturbation analysis in [1]. We defer the proof to Appendix B.7. \(\square \)
8 Analysis for Blind Deconvolution
In this section, we instantiate the general recipe presented in Sect. 5 to blind deconvolution and prove Theorem 3. Without loss of generality, we assume throughout that \(\left\| {\varvec{h}}^{\star }\right\| _{2}=\left\| {\varvec{x}}^{\star }\right\| _{2}=1\).
Before presenting the analysis, we first gather some simple facts about the empirical loss function in (32). Recall the definition of \({\varvec{z}}\) in (33), and for notational simplicity, we write \(f\left( {\varvec{z}}\right) =f({\varvec{h}},{\varvec{x}})\). Since \({\varvec{z}}\) is complex-valued, we need to resort to Wirtinger calculus; see [18, Section 6] for a brief introduction. The Wirtinger gradient of (32) with respect to \({\varvec{h}}\) and \({\varvec{x}}\) are given, respectively, by
It is worth noting that the formal Wirtinger gradient contains \(\nabla _{\overline{{\varvec{h}}}}f\left( {\varvec{h}},{\varvec{x}}\right) \) and \(\nabla _{\overline{{\varvec{x}}}}f\left( {\varvec{h}},{\varvec{x}}\right) \) as well. Nevertheless, since \(f\left( {\varvec{h}},{\varvec{x}}\right) \) is a real-valued function, the following identities always hold
In light of these observations, one often omits the gradient with respect to the conjugates; correspondingly, the gradient update rule (35) can be written as
We can also compute the Wirtinger Hessian of \(f({\varvec{z}})\) as follows,
where
Last but not least, we say \(\left( {\varvec{h}}_{1},{\varvec{x}}_{1}\right) \) is aligned with \(\left( {\varvec{h}}_{2},{\varvec{x}}_{2}\right) \), if the following holds,
To simplify notations, define \(\widetilde{{\varvec{z}}}^t\) as
with the alignment parameter \(\alpha ^{t}\) given in (38). Then we can see that \(\widetilde{{\varvec{z}}}^t\) is aligned with \({\varvec{z}}^{\star }\) and
8.1 Step 1: Characterizing Local Geometry in the RIC
8.1.1 Local Geometry
The first step is to characterize the region of incoherence and contraction (RIC), where the empirical loss function enjoys restricted strong convexity and smoothness properties. To this end, we have the following lemma.
Lemma 14
(Restricted strong convexity and smoothness for blind deconvolution) Let \(c>0\) be a sufficiently small constant and
Suppose the sample size satisfies \(m\ge c_0 \mu ^{2}K\log ^{9}m\) for some sufficiently large constant \(c_0 >0\). Then with probability \(1-O\left( m^{-10} + e^{-K} \log m \right) \), the Wirtinger Hessian \(\nabla ^{2}f\left( {\varvec{z}}\right) \) obeys
simultaneously for all
where \({\varvec{z}}\) satisfies
\(\left( {\varvec{h}}_{1},{\varvec{x}}_{1}\right) \) is aligned with \(\left( {\varvec{h}}_{2},{\varvec{x}}_{2}\right) \), and they satisfy
and finally, \({\varvec{D}}\) satisfies for \(\gamma _{1},\gamma _{2}\in \mathbb {R}\),
Here, \(C_{3},C_{4}>0\) are numerical constants.
Proof
See Appendix C.1. \(\square \)
Lemma 14 characterizes the restricted strong convexity and smoothness of the loss function used in blind deconvolution. To the best of our knowledge, this provides the first characterization regarding geometric properties of the Hessian matrix for blind deconvolution. A few interpretations are in order.
Conditions (82) specify the region of incoherence and contraction (RIC). In particular, (82a) specifies a neighborhood that is close to the ground truth in \(\ell _2\) norm, and (82b) and (82c) specify the incoherence region with respect to the sensing vectors \(\{{\varvec{a}}_j\}\) and \(\{{\varvec{b}}_j\}\), respectively.
Similar to matrix completion, the Hessian matrix is rank-deficient even at the population level. Consequently, we resort to a restricted form of strong convexity by focusing on certain directions. More specifically, these directions can be viewed as the difference between two pre-aligned points that are not far from the truth, which is characterized by (83).
Finally, the diagonal matrix \({\varvec{D}}\) accounts for scaling factors that are not too far from 1 (see (84)), which allows us to account for different step sizes employed for \({\varvec{h}}\) and \({\varvec{x}}\).
8.1.2 Error Contraction
The restricted strong convexity and smoothness allow us to establish the contraction of the error measured in terms of \(\text {dist}(\cdot , {\varvec{z}}^{\star })\) as defined in (34) as long as the iterates stay in the RIC.
Lemma 15
Suppose the number of measurements satisfies \(m\ge C \mu ^{2}K\log ^{9}m\) for some sufficiently large constant \(C>0\), and the step size \(\eta >0\) is some sufficiently small constant. There exists an event that does not depend on t and has probability \(1-O\left( m^{-10} + e^{-K} \log m \right) \), such that when it happens and
hold for some constants \(C_{3},C_{4}>0\), one has
Here, \(\widetilde{{\varvec{h}}}^{t}\) and \(\widetilde{{\varvec{x}}}^{t}\) are defined in (81), and \(\xi \ll 1/\log ^2 m\).
Proof
See Appendix C.2. \(\square \)
As a result, if \({\varvec{z}}^{t}\) satisfies condition (85) for all \(0\le t\le T\), then
where \(\rho := 1- \eta /16\). Furthermore, similar to the case of phase retrieval (i.e., Lemma 3), as soon as we demonstrate that conditions (85) hold for all \(0\le t\le m\), then Theorem 3 holds true. The proof of this claim is exactly the same as for Lemma 3 and is thus omitted for conciseness. In what follows, we focus on establishing (85) for all \(0\le t\le m\).
Before concluding this subsection, we make note of another important result that concerns the alignment parameter \(\alpha ^{t}\), which will be useful in the subsequent analysis. Specifically, the alignment parameter sequence \(\{\alpha ^{t}\}\) converges linearly to a constant whose magnitude is fairly close to 1, as long as the two initial vectors \({\varvec{h}}^0\) and \({\varvec{x}}^0\) have similar \(\ell _2\) norms and are close to the truth. Given that \(\alpha ^t\) determines the global scaling of the iterates, this reveals rapid convergence of both \(\Vert {\varvec{h}}^t\Vert _2\) and \(\Vert {\varvec{x}}^t\Vert _2\), which explains why there is no need to impose extra terms to regularize the \(\ell _2\) norm as employed in [58, 76].
Lemma 16
When \(m > 1\) is sufficiently large, the following two claims hold true.
If \(\big ||\alpha ^{t}|-1\big |\le 1/2\) and \(\mathrm {dist}({\varvec{z}}^{t},{\varvec{z}}^{\star })\le C_{1}/\log ^{2}m\), then
$$\begin{aligned} \left| \frac{\alpha ^{t+1}}{\alpha ^{t}}-1\right| \le c\,\mathrm {dist}({\varvec{z}}^{t},{\varvec{z}}^{\star })\le \frac{cC_{1}}{\log ^{2}m} \end{aligned}$$for some absolute constant \(c>0\);
If \(\big ||\alpha ^{0}|-1\big |\le 1/4\) and \(\mathrm {dist}({\varvec{z}}^{s},{\varvec{z}}^{\star })\le C_{1}(1-\eta /16)^{s}/\log ^{2}m\) for all \(0\le s\le t\), then one has
$$\begin{aligned} \big ||\alpha ^{s+1}|-1\big |\le 1/2,\qquad 0\le s\le t. \end{aligned}$$
Proof
See Appendix C.2. \(\square \)
The initial condition \(\big ||\alpha ^{0}|-1\big |<1/4\) will be guaranteed to hold with high probability by Lemma 19.
8.2 Step 2: Introducing the Leave-One-Out Sequences
As demonstrated by the assumptions in Lemma 15, the key is to show that the whole trajectory lies in the region specified by (85a)–(85c). Once again, the difficulty lies in the statistical dependency between the iterates \(\left\{ {\varvec{z}}^{t}\right\} \) and the measurement vectors \(\left\{ {\varvec{a}}_{j}\right\} \). We follow the general recipe and introduce the leave-one-out sequences, denoted by \(\left\{ {\varvec{h}}^{t,\left( l\right) },{\varvec{x}}^{t,(l)}\right\} _{t\ge 0}\) for each \(1\le l\le m\). Specifically, \(\left\{ {\varvec{h}}^{t,\left( l\right) },{\varvec{x}}^{t,(l)}\right\} _{t\ge 0}\) is the gradient sequence operating on the loss function
The whole sequence is constructed by running gradient descent with spectral initialization on the leave-one-out loss (86). The precise description is supplied in Algorithm 6.
For notational simplicity, we denote \({\varvec{z}}^{t,(l)}=\left[ \begin{array}{c} {\varvec{h}}^{t,(l)}\\ {\varvec{x}}^{t,(l)} \end{array}\right] \) and use \(f({\varvec{z}}^{t,(l)})=f({\varvec{h}}^{t,(l)},{\varvec{x}}^{t,(l)})\) interchangeably. Define similarly the alignment parameters
and denote \(\widetilde{{\varvec{z}}}^{t,(l)}=\left[ \begin{array}{c} \widetilde{{\varvec{h}}}^{t,\left( l\right) }\\ \widetilde{{\varvec{x}}}^{t,\left( l\right) } \end{array}\right] \) where
8.3 Step 3: Establishing the Incoherence Condition by Induction
As usual, we continue the proof in an inductive manner. For clarity of presentation, we list below the set of induction hypotheses underlying our analysis:
where \(\widetilde{{\varvec{h}}}^{t},\; \widetilde{{\varvec{x}}}^{t}\), and \(\widetilde{{\varvec{z}}}^{t}\) are defined in (81). Here, \(C_{1},C_{3}>0\) are some sufficiently small constants, while \(C_{2},C_{4}>0\) are some sufficiently large constants. We aim to show that if these hypotheses (90) hold up to the tth iteration, then the same would hold for the (\(t+1\))th iteration with exceedingly high probability (e.g., \(1-O(m^{-10})\)). The first hypothesis (90a) has already been established in Lemma 15, and hence, the rest of this section focuses on establishing the remaining three. To justify the incoherence hypotheses (90c) and (90d) for the (\(t+1\))th iteration, we need to leverage the nice properties of the leave-one-out sequences and establish (90b) first. In the sequel, we follow the steps suggested in the general recipe.
Step 3(a): proximity between the original and the leave-one-out iterates We first justify hypothesis (90b) for the (\(t+1\))th iteration via the following lemma.
Lemma 17
Suppose the sample complexity obeys \(m\ge C \mu ^{2}K\log ^{9}m\) for some sufficiently large constant \(C>0\). Let \(\mathcal {E}_t\) be the event where hypotheses (90a)–(90d) hold for the tth iteration. Then on an event \(\mathcal {E}_{t+1,1} \subseteq \mathcal {E}_t\) obeying \(\mathbb {P}(\mathcal {E}_t \cap \mathcal {E}_{t+1,1}^c ) = O(m^{-10} + me^{-cK} )\) for some constant \(c>0\), one has
provided that the step size \(\eta >0\) is some sufficiently small constant.
Proof
As usual, this result follows from the restricted strong convexity, which forces the distance between the two sequences of interest to be contractive. See Appendix C.3.\(\square \)
-
Step 3(b): incoherence of the leave-one-out iterate\({\varvec{x}}^{t+1,(l)}\)w.r.t. \({\varvec{a}}_{l}\) Next, we show that the leave-one-out iterate \(\widetilde{{\varvec{x}}}^{t+1,\left( l\right) }\)—which is independent of \({\varvec{a}}_{l}\)—is incoherent w.r.t. \({\varvec{a}}_{l}\) in the sense that
$$\begin{aligned} \left| {\varvec{a}}_{l}^{\textsf {H} }\big (\widetilde{{\varvec{x}}}^{t+1,\left( l\right) } -{\varvec{x}}^{\star }\big )\right| \le 10C_{1}\frac{1}{\log ^{3/2}m} \end{aligned}$$(91)with probability exceeding \(1-O\left( m^{-10} + e^{-K} \log m \right) \). To see why, use the statistical independence and the standard Gaussian concentration inequality to show that
$$\begin{aligned} \max _{1\le l\le m}\left| {\varvec{a}}_{l}^{\textsf {H} }\big (\widetilde{{\varvec{x}}}^{t+1,(l)}-{\varvec{x}}^{\star }\big )\right| \le 5\sqrt{\log m}\max _{1\le l\le m}\big \Vert \widetilde{{\varvec{x}}}^{t+1,(l)}-{\varvec{x}}^{\star }\big \Vert _{2} \end{aligned}$$with probability exceeding \(1-O(m^{-10})\). It then follows from the triangle inequality that
$$\begin{aligned} \big \Vert \widetilde{{\varvec{x}}}^{t+1,(l)}-{\varvec{x}}^{\star }\big \Vert _{2}&\le \big \Vert \widetilde{{\varvec{x}}}^{t+1,(l)}-\widetilde{{\varvec{x}}}^{t+1}\big \Vert _{2}+\left\| \widetilde{{\varvec{x}}}^{t+1}-{\varvec{x}}^{\star }\right\| _{2}\\&\overset{\left( \text {i}\right) }{\le }CC_{2}\frac{\mu }{\sqrt{m}}\sqrt{\frac{\mu ^{2}K\log ^{9}m}{m}}+C_{1}\frac{1}{\log ^{2}m}\\&\overset{\left( \text {ii}\right) }{\le }2C_{1}\frac{1}{\log ^{2}m}, \end{aligned}$$where (i) follows from Lemmas 15 and 17 and (ii) holds as soon as \(m/( \mu ^{2}\sqrt{K}\log ^{13/2}m )\) is sufficiently large. Combining the preceding two bounds establishes (91).
-
Step 3(c): combining the bounds to show incoherence of \({\varvec{x}}^{t+1}\)w.r.t. \(\left\{ {\varvec{a}}_{l}\right\} \) The above bounds immediately allow us to conclude that
$$\begin{aligned} \max _{1\le l\le m}\left| {\varvec{a}}_{l}^{\textsf {H} }\big (\widetilde{{\varvec{x}}}^{t+1}-{\varvec{x}}^{\star }\big )\right| \le C_{3}\frac{1}{\log ^{3/2}m} \end{aligned}$$with probability at least \(1-O\left( m^{-10} + e^{-K}\log m \right) \), which is exactly hypothesis (90c) for the (\(t+1\))th iteration. Specifically, for each \(1\le l\le m\), the triangle inequality yields
$$\begin{aligned} \left| {\varvec{a}}_{l}^{\textsf {H} }\big (\widetilde{{\varvec{x}}}^{t+1}-{\varvec{x}}^{\star }\big )\right|&\le \left| {\varvec{a}}_{l}^{\textsf {H} }\big (\widetilde{{\varvec{x}}}^{t+1} -\widetilde{{\varvec{x}}}^{t+1,(l)}\big )\right| +\left| {\varvec{a}}_{l}^{\textsf {H} } \big (\widetilde{{\varvec{x}}}^{t+1,(l)}-{\varvec{x}}^{\star }\big )\right| \\&\overset{\left( \text {i}\right) }{\le }\left\| {\varvec{a}}_{l}\right\| _{2}\left\| \widetilde{{\varvec{x}}}^{t+1}-\widetilde{{\varvec{x}}}^{t+1,(l)} \right\| _{2}+\left| {\varvec{a}}_{l}^{\textsf {H} }\big (\widetilde{{\varvec{x}}}^{t+1,(l)}-{\varvec{x}}^{\star } \big )\right| \\&\overset{\left( \text {ii}\right) }{\le }3\sqrt{K}\cdot CC_{2}\frac{\mu }{\sqrt{m}}\sqrt{\frac{\mu ^{2}K\log ^{9}m}{m}}+10C_{1}\frac{1}{\log ^{3/2}m}\\&\overset{\left( \text {iii}\right) }{\le }C_{3}\frac{1}{\log ^{3/2}m}. \end{aligned}$$Here (i) follows from Cauchy–Schwarz; (ii) is a consequence of (190), Lemma 17, and bound (91); and the last inequality holds as long as \(m / ( \mu ^{2}K\log ^{6}m )\) is sufficiently large and \(C_{3}\ge 11C_{1}\).
-
Step 3(d): incoherence of\({\varvec{h}}^{t+1}\)w.r.t. \(\{{\varvec{b}}_{l}\}\) It remains to justify that \({\varvec{h}}^{t+1}\) is also incoherent w.r.t. its associated design vectors \(\{{\varvec{b}}_{l}\}\). This proof of this step, however, is much more involved and challenging, due to the deterministic nature of the \({\varvec{b}}_{l}\)’s. As a result, we would need to “propagate” the randomness brought about by \(\{{\varvec{a}}_{l}\}\) to \({\varvec{h}}^{t+1}\) in order to facilitate the analysis. The result is summarized as follows.
Lemma 18
Suppose that the sample complexity obeys \(m\ge C \mu ^{2}K\log ^{9}m\) for some sufficiently large constant \(C>0\). Let \(\mathcal {E}_t\) be the event where hypotheses (90a)–(90d) hold for the tth iteration. Then on an event \(\mathcal {E}_{t+1,2} \subseteq \mathcal {E}_t\) obeying \(\mathbb {P}(\mathcal {E}_t \cap \mathcal {E}_{t+1,2}^c ) = O(m^{-10})\), one has
as long as \(C_{4}\) is sufficiently large and \(\eta >0\) is taken to be some sufficiently small constant.
Proof
The key idea is to divide \(\{1,\cdots ,m\}\) into consecutive bins each of size \(\mathrm {poly}\log (m)\), and to exploit the randomness (namely, the randomness from \({\varvec{a}}_{l}\)) within each bin. This binning idea is crucial in ensuring that the incoherence measure of interest does not blow up as t increases. See Appendix C.4. \(\square \)
With these steps in place, we conclude the proof of Theorem 3 via induction and the union bound.
8.4 The Base Case: Spectral Initialization
In order to finish the induction steps, we still need to justify the induction hypotheses for the base cases; namely, we need to show that the spectral initializations \({\varvec{z}}^{0}\) and \(\left\{ {\varvec{z}}^{0,\left( l\right) }\right\} _{1\le l \le m} \) satisfy the induction hypotheses (90) at \(t=0\).
To start with, the initializations are sufficiently close to the truth when measured by the \(\ell _{2}\) norm, as summarized by the following lemma.
Lemma 19
Fix any small constant \(\xi >0\). Suppose the sample size obeys \(m\ge {C\mu ^{2}K\log ^{2}m}/{\xi ^{2}}\) for some sufficiently large constant \(C>0\). Then with probability at least \(1-O(m^{-10})\), we have
and \(\left| |\alpha _0|-1\right| \le 1/4\).
Proof
This follows from Wedin’s sin\(\Theta \) theorem [121] and [76, Lemma 5.20]. See Appendix C.5. \(\square \)
From the definition of \(\mathrm {dist}(\cdot ,\cdot )\) (cf. (34)), we immediately have
as long as \( m\ge C\mu ^{2}K\log ^{6}m\) for some sufficiently large constant \(C>0\). Here (i) follows from the elementary inequality that \(a^{2}+b^{2}\le \left( a+b\right) ^{2}\) for positive a and b, (ii) holds since the feasible set of the latter one is strictly smaller, and (iii) follows directly from Lemma 19. This finishes the proof of (90a) for \(t=0\). Similarly, with high probability we have
Next, when properly aligned, the true initial estimate \({\varvec{z}}^0\) and the leave-one-out estimate \({\varvec{z}}^{0,(l)}\) are expected to be sufficiently close, as claimed by the following lemma. Along the way, we show that \({\varvec{h}}^{0}\) is incoherent w.r.t. the sampling vectors \(\left\{ {\varvec{b}}_{l}\right\} \). This establishes (90b) and (90d) for \(t=0\).
Lemma 20
Suppose that \(m\ge C \mu ^{2}K\log ^{3}m\) for some sufficiently large constant \(C>0\). Then with probability at least \(1-O(m^{-10})\), one has
and
Proof
The key is to establish that \(\mathrm {dist}\big ({\varvec{z}}^{0,\left( l\right) },\widetilde{{\varvec{z}}}^{0}\big )\) can be upper bounded by some linear scaling of \(\big |{\varvec{b}}_{l}^{\textsf {H} }\widetilde{{\varvec{h}}}^{0}\big |\), and vice versa. This allows us to derive bounds simultaneously for both quantities. See Appendix C.6.\(\square \)
Finally, we establish (90c) regarding the incoherence of \({\varvec{x}}^{0}\) with respect to the design vectors \(\left\{ {\varvec{a}}_{l}\right\} \).
Lemma 21
Suppose that \(m\ge C \mu ^{2}K\log ^{6}m\) for some sufficiently large constant \(C>0\). Then with probability exceeding \(1-O(m^{-10})\), we have
Proof
See Appendix C.7. \(\square \)
9 Discussion
This paper showcases an important phenomenon in nonconvex optimization: even without explicit enforcement of regularization, the vanilla form of gradient descent effectively achieves implicit regularization for a large family of statistical estimation problems. We believe this phenomenon arises in problems far beyond the three cases studied herein, and our results are initial steps toward understanding this fundamental phenomenon. There are numerous avenues open for future investigation, and we point out a few of them.
Improving sample complexity In the current paper, the required sample complexity \(O\left( \mu ^{3}r^{3}n\log ^{3}n\right) \) for matrix completion is suboptimal when the rank r of the underlying matrix is large. While this allows us to achieve a dimension-free iteration complexity, it is slightly higher than the sample complexity derived for regularized gradient descent in [32]. We expect our results continue to hold under lower sample complexity \(O\left( \mu ^{2}r^{2}n\log n\right) \), but it calls for a more refined analysis (e.g., a generic chaining argument).
Leave-one-out tricks for more general designs So far, our focus is on independent designs, including the i.i.d. Gaussian design adopted in phase retrieval and partially in blind deconvolution, as well as the independent sampling mechanism in matrix completion. Such independence property creates some sort of “statistical homogeneity,” for which the leave-one-out argument works beautifully. It remains unclear how to generalize such leave-one-out tricks for more general designs (e.g., more general sampling patterns in matrix completion and more structured Fourier designs in phase retrieval and blind deconvolution). In fact, the readers can already get a flavor of this issue in the analysis of blind deconvolution, where the Fourier design vectors require much more delicate treatments than purely Gaussian designs.
Uniform stability The leave-one-out perturbation argument is established upon a basic fact: when we exclude one sample from consideration, the resulting estimates/predictions do not deviate much from the original ones. This leave-one-out stability bears similarity to the notion of uniform stability studied in statistical learning theory [8]. We expect our analysis framework to be helpful for analyzing other learning algorithms that are uniformly stable.
Other iterative methods and other loss functions The focus of the current paper has been the analysis of vanilla GD tailored to the natural squared loss. This is by no means to advocate GD as the top-performing algorithm in practice; rather, we are using this simple algorithm to isolate some seemingly pervasive phenomena (i.e., implicit regularization) that generic optimization theory fails to account for. The simplicity of vanilla GD makes it an ideal object to initiate such discussions. That being said, practitioners should definitely explore as many algorithmic alternatives as possible before settling on a particular algorithm. Take phase retrieval for example: iterative methods other than GD and/or algorithms tailored to other loss functions have been proposed in the nonconvex optimization literature, including but not limited to alternating minimization, block coordinate descent, and sub-gradient methods and prox-linear methods tailed to nonsmooth losses. It would be interesting to develop a full theoretical understanding of a broader class of iterative algorithms, and to conduct a careful comparison regarding which loss functions lead to the most desirable practical performance.
Connections to deep learning? We have focused on nonlinear systems that are bilinear or quadratic in this paper. Deep learning formulations/architectures, highly nonlinear, are notorious for their daunting nonconvex geometry. However, iterative methods including stochastic gradient descent have enjoyed enormous practical success in learning neural networks (e.g., [46, 103, 132]), even when the architecture is significantly over-parameterized without explicit regularization. We hope the message conveyed in this paper for several simple statistical models can shed light on why simple forms of gradient descent and variants work so well in learning complicated neural networks.
Finally, while the present paper provides a general recipe for problem-specific analyses of nonconvex algorithms, we acknowledge that a unified theory of this kind has yet to be developed. As a consequence, each problem requires delicate and somewhat lengthy analyses of its own. It would certainly be helpful if one could single out a few stylized structural properties/elements (like sparsity and incoherence in compressed sensing [13]) that enable near-optimal performance guarantees through an overarching method of analysis; with this in place, one would not need to start each problem from scratch. Having said that, we believe that our current theory elucidates a few ingredients (e.g., the region of incoherence and leave-one-out stability) that might serve as crucial building blocks for such a general theory. We invite the interested readers to contribute toward this path forward.
Notes
Here, we choose different pre-constants in front of the empirical loss in order to be consistent with the literature of the respective problems. In addition, we only introduce the problem in the noiseless case for simplicity of presentation.
Here and throughout, i represents the imaginary unit.
To demonstrate this, taking \({\varvec{x}}={\varvec{x}}^{\star }+\left( \delta / {\Vert {\varvec{a}}_{1}\Vert _{2}}\right) \cdot {{\varvec{a}}_{1}}\) in (10), one can easily verify that, with high probability, \(\big \Vert \nabla ^{2}f({\varvec{x}})\big \Vert \ge \left| 3({\varvec{a}}_{1}^{\top }{\varvec{x}})^{2}-y_{1}\right| \big \Vert {\varvec{a}}_{1}{\varvec{a}}_{1}^{\top }\big \Vert / m-O(1)\gtrsim {\delta ^{2}n^{2}} / {m}\asymp \delta ^{2}{n} / {\log n}\).
If \({\varvec{x}}\) is aligned with (and hence very coherent with) one vector \({\varvec{a}}_{j}\), then with high probability one has \(\big |{\varvec{a}}_{j}^{\top }\big ({\varvec{x}}-{\varvec{x}}^{\star }\big )|\gtrsim \big |{\varvec{a}}_{j}^{\top }{\varvec{x}}| \asymp \sqrt{n}\Vert {\varvec{x}}\Vert _{2}\), which is significantly larger than \(\sqrt{\log n}\Vert {\varvec{x}}\Vert _{2}\).
Here, we assume \({\varvec{M}}^{\star }\) to be positive semidefinite to simplify the presentation, but note that our analysis easily extends to asymmetric low-rank matrices.
Note that when \({\varvec{M}}^{\star }\) is well conditioned and when \(r=O(1)\), one can easily check that \(\textsf {SNR} \approx \left( {\Vert {\varvec{M}}^{\star }\Vert _{\mathrm {F}}^{2}}\right) / \left( {n^{2}\sigma ^{2}}\right) \asymp {\sigma _{\min }^{2}} / ({n^{2}\sigma ^{2}})\), and our theory says that the squared relative error bound is proportional to \(\sigma ^2 / \sigma _{\min }^2\).
For simplicity, we have set the dimensions of the two subspaces equal, and it is straightforward to extend our results to the case of unequal subspace dimensions.
For simplicity, we assume \(\frac{1}{2}\log \left( \kappa \mu r\right) \) is an integer. The argument here can be easily adapted to the case when \(\frac{1}{2}\log \left( \kappa \mu r\right) \) is not an integer.
References
Abbe, E., Fan, J., Wang, K., Zhong, Y.: Entrywise eigenvector analysis of random matrices with low expected rank. arXiv preprint arXiv:1709.09565 (2017)
Aghasi, A., Ahmed, A., Hand, P., Joshi, B.: Branchhull: Convex bilinear inversion from the entrywise product of signals with known signs. Applied and Computational Harmonic Analysis (2019)
Ahmed, A., Recht, B., Romberg, J.: Blind deconvolution using convex programming. IEEE Transactions on Information Theory 60(3), 1711–1732 (2014)
Alon, N., Spencer, J.H.: The Probabilistic Method (3rd Edition). Wiley (2008)
Bahmani, S., Romberg, J.: Phase retrieval meets statistical learning theory: A flexible convex relaxation. In: Artificial Intelligence and Statistics, pp. 252–260 (2017)
Bendory, T., Eldar, Y.C., Boumal, N.: Non-convex phase retrieval from STFT measurements. IEEE Transactions on Information Theory (2017)
Bhojanapalli, S., Neyshabur, B., Srebro, N.: Global optimality of local search for low rank matrix recovery. In: Advances in Neural Information Processing Systems, pp. 3873–3881 (2016)
Bousquet, O., Elisseeff, A.: Stability and generalization. Journal of Machine Learning Research 2(Mar), 499–526 (2002)
Bubeck, S.: Convex optimization: Algorithms and complexity. Foundations and Trends\({\textregistered }\) in Machine Learning 8(3-4), 231–357 (2015)
Cai, J.F., Liu, H., Wang, Y.: Fast rank-one alternating minimization algorithm for phase retrieval. Journal of Scientific Computing 79(1), 128–147 (2019)
Cai, T., Zhang, A.: ROP: Matrix recovery via rank-one projections. The Annals of Statistics 43(1), 102–138 (2015)
Cai, T.T., Li, X., Ma, Z.: Optimal rates of convergence for noisy sparse phase retrieval via thresholded Wirtinger flow. The Annals of Statistics 44(5), 2221–2251 (2016)
Candès, E., Plan, Y.: A probabilistic and RIPless theory of compressed sensing. IEEE Transactions on Information Theory 57(11), 7235–7254 (2011). https://doi.org/10.1109/TIT.2011.2161794
Candès, E., Tao, T.: The power of convex relaxation: Near-optimal matrix completion. IEEE Transactions on Information Theory 56(5), 2053 –2080 (2010)
Candès, E.J., Eldar, Y.C., Strohmer, T., Voroninski, V.: Phase retrieval via matrix completion. SIAM Journal on Imaging Sciences 6(1), 199–225 (2013)
Candès, E.J., Li, X.: Solving quadratic equations via PhaseLift when there are about as many equations as unknowns. Foundations of Computational Mathematics 14(5), 1017–1026 (2014)
Candès, E.J., Li, X., Ma, Y., Wright, J.: Robust principal component analysis? Journal of ACM 58(3), 11:1–11:37 (2011)
Candès, E.J., Li, X., Soltanolkotabi, M.: Phase retrieval via Wirtinger flow: Theory and algorithms. IEEE Transactions on Information Theory 61(4), 1985–2007 (2015)
Candès, E.J., Recht, B.: Exact matrix completion via convex optimization. Foundations of Computational Mathematics 9(6), 717–772 (2009)
Candès, E.J., Strohmer, T., Voroninski, V.: Phaselift: Exact and stable signal recovery from magnitude measurements via convex programming. Communications on Pure and Applied Mathematics 66(8), 1017–1026 (2013)
Chandrasekaran, V., Sanghavi, S., Parrilo, P.A., Willsky, A.S.: Rank-sparsity incoherence for matrix decomposition. SIAM Journal on Optimization 21(2), 572–596 (2011)
Chen, P., Fannjiang, A., Liu, G.R.: Phase retrieval with one or two diffraction patterns by alternating projections with the null initialization. Journal of Fourier Analysis and Applications, pp. 1–40 (2015)
Chen, Y.: Incoherence-optimal matrix completion. IEEE Transactions on Information Theory 61(5), 2909–2923 (2015)
Chen, Y., Candès, E.: The projected power method: An efficient algorithm for joint alignment from pairwise differences. Communications on Pure and Applied Mathematics 71(8), 1648–1714 (2018)
Chen, Y., Candès, E.J.: Solving random quadratic systems of equations is nearly as easy as solving linear systems. Communications on Pure and Applied Mathematics 70(5), 822–883 (2017). https://doi.org/10.1002/cpa.21638.
Chen, Y., Cheng, C., Fan, J.: Asymmetry helps: Eigenvalue and eigenvector analyses of asymmetrically perturbed low-rank matrices. arXiv preprint arXiv:1811.12804 (2018)
Chen, Y., Chi, Y., Fan, J., Ma, C.: Gradient descent with random initialization: Fast global convergence for nonconvex phase retrieval. Mathematical Programming 176(1-2), 5–37 (2019)
Chen, Y., Chi, Y., Fan, J., Ma, C., Yan, Y.: Noisy matrix completion: Understanding statistical guarantees for convex relaxation via nonconvex optimization. arXiv preprint arXiv:1902.07698 (2019)
Chen, Y., Chi, Y., Goldsmith, A.J.: Exact and stable covariance estimation from quadratic sampling via convex programming. IEEE Transactions on Information Theory 61(7), 4034–4059 (2015)
Chen, Y., Fan, J., Ma, C., Wang, K.: Spectral method and regularized MLE are both optimal for top-\(K\) ranking. Annals of Statistics 47(4), 2204–2235 (2019)
Chen, Y., Fan, J., Ma, C., Yan, Y.: Inference and uncertainty quantification for noisy matrix completion. arXiv preprint arXiv:1906.04159 (2019)
Chen, Y., Wainwright, M.J.: Fast low-rank estimation by projected gradient descent: General statistical and algorithmic guarantees. arXiv preprint arXiv:1509.03025 (2015)
Chen, Y., Yi, X., Caramanis, C.: A convex formulation for mixed regression with two components: Minimax optimal rates. In: Conference on Learning Theory, pp. 560–604 (2014)
Cherapanamjeri, Y., Jain, P., Netrapalli, P.: Thresholding based outlier robust PCA. In: Conference on Learning Theory, pp. 593–628 (2017)
Chi, Y.: Guaranteed blind sparse spikes deconvolution via lifting and convex optimization. IEEE Journal of Selected Topics in Signal Processing 10(4), 782–794 (2016)
Chi, Y., Lu, Y.M.: Kaczmarz method for solving quadratic equations. IEEE Signal Processing Letters 23(9), 1183–1187 (2016)
Chi, Y., Lu, Y.M., Chen, Y.: Nonconvex optimization meets low-rank matrix factorization: An overview. arXiv preprint arXiv:1809.09573 (2018)
Davenport, M.A., Romberg, J.: An overview of low-rank matrix recovery from incomplete observations. IEEE Journal of Selected Topics in Signal Processing 10(4), 608–622 (2016)
Davis, C., Kahan, W.M.: The rotation of eigenvectors by a perturbation. iii. SIAM Journal on Numerical Analysis 7(1), 1–46 (1970)
Davis, D., Drusvyatskiy, D., Paquette, C.: The nonsmooth landscape of phase retrieval. arXiv preprint arXiv:1711.03247 (2017)
Dhifallah, O., Thrampoulidis, C., Lu, Y.M.: Phase retrieval via linear programming: Fundamental limits and algorithmic improvements. arXiv preprint arXiv:1710.05234 (2017)
Dopico, F.M.: A note on \(\sin \Theta \) theorems for singular subspace variations. BIT 40(2), 395–403 (2000). https://doi.org/10.1023/A:1022303426500.
Duchi, J.C., Ruan, F.: Solving (most) of a set of quadratic equalities: Composite optimization for robust phase retrieval. Information and Inference (2018)
El Karoui, N.: On the impact of predictor geometry on the performance on high-dimensional ridge-regularized generalized robust regression estimators. Probability Theory and Related Fields pp. 1–81 (2015)
El Karoui, N., Bean, D., Bickel, P.J., Lim, C., Yu, B.: On robust regression with high-dimensional predictors. Proceedings of the National Academy of Sciences 110(36), 14557–14562 (2013)
Fan, J., Ma, C., Zhong, Y.: A selective overview of deep learning. arXiv preprint arXiv:1904.05526 (2019)
Gao, B., Xu, Z.: Phase retrieval using Gauss-Newton method. arXiv preprint arXiv:1606.08135 (2016)
Ge, R., Lee, J.D., Ma, T.: Matrix completion has no spurious local minimum. In: Advances in Neural Information Processing Systems, pp. 2973–2981 (2016)
Ge, R., Ma, T.: On the optimization landscape of tensor decompositions. In: Advances in Neural Information Processing Systems, pp. 3653–3663 (2017)
Goldstein, T., Studer, C.: Phasemax: Convex phase retrieval via basis pursuit. IEEE Transactions on Information Theory 64(4), 2675–2689 (2018)
Gross, D.: Recovering low-rank matrices from few coefficients in any basis. IEEE Transactions on Information Theory 57(3), 1548–1566 (2011)
Gunasekar, S., Woodworth, B.E., Bhojanapalli, S., Neyshabur, B., Srebro, N.: Implicit regularization in matrix factorization. In: Advances in Neural Information Processing Systems, pp. 6151–6159 (2017)
Hand, P., Voroninski, V.: An elementary proof of convex phase retrieval in the natural parameter space via the linear program phasemax. Communications in Mathematical Sciences 16(7), 2047–2051 (2018)
Hardt, M., Wootters, M.: Fast matrix completion without the condition number. Conference on Learning Theory, pp. 638–678 (2014)
Hastie, T., Mazumder, R., Lee, J.D., Zadeh, R.: Matrix completion and low-rank SVD via fast alternating least squares. Journal of Machine Learning Research 16, 3367–3402 (2015)
Higham, N.J.: Estimating the matrix \(p\)-norm. Numerische Mathematik 62(1), 539–555 (1992)
Hsu, D., Kakade, S.M., Zhang, T.: A tail inequality for quadratic forms of subgaussian random vectors. Electron. Commun. Probab. 17, no. 52, 6 (2012). https://doi.org/10.1214/ECP.v17-2079.
Huang, W., Hand, P.: Blind deconvolution by a steepest descent algorithm on a quotient manifold. SIAM Journal on Imaging Sciences 11(4), 2757–2785 (2018)
Jaganathan, K., Eldar, Y.C., Hassibi, B.: Phase retrieval: An overview of recent developments. arXiv preprint arXiv:1510.07713 (2015)
Jain, P., Netrapalli, P.: Fast exact matrix completion with finite samples. In: Conference on Learning Theory, pp. 1007–1034 (2015)
Jain, P., Netrapalli, P., Sanghavi, S.: Low-rank matrix completion using alternating minimization. In: ACM symposium on Theory of computing, pp. 665–674 (2013)
Javanmard, A., Montanari, A., et al.: Debiasing the lasso: Optimal sample size for gaussian designs. The Annals of Statistics 46(6A), 2593–2622 (2018)
Jin, C., Kakade, S.M., Netrapalli, P.: Provable efficient online matrix completion via non-convex stochastic gradient descent. In: Advances in Neural Information Processing Systems, pp. 4520–4528 (2016)
Keshavan, R.H., Montanari, A., Oh, S.: Matrix completion from a few entries. IEEE Transactions on Information Theory 56(6), 2980–2998 (2010)
Keshavan, R.H., Montanari, A., Oh, S.: Matrix completion from noisy entries. J. Mach. Learn. Res. 11, 2057–2078 (2010)
Koltchinskii, V.: Oracle inequalities in empirical risk minimization and sparse recovery problems, Lecture Notes in Mathematics, vol. 2033. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-22147-7
Koltchinskii, V., Lounici, K., Tsybakov, A.B.: Nuclear-norm penalization and optimal rates for noisy low-rank matrix completion. Ann. Statist. 39(5), 2302–2329 (2011). https://doi.org/10.1214/11-AOS894.
Kolte, R., Özgür, A.: Phase retrieval via incremental truncated Wirtinger flow. arXiv preprint arXiv:1606.03196 (2016)
Kreutz-Delgado, K.: The complex gradient operator and the CR-calculus. arXiv preprint arXiv:0906.4835 (2009)
Lang, S.: Real and functional analysis. Springer-Verlag, New York, 10, 11–13 (1993)
Lee, K., Bresler, Y.: Admira: Atomic decomposition for minimum rank approximation. IEEE Transactions on Information Theory 56(9), 4402–4416 (2010)
Lee, K., Li, Y., Junge, M., Bresler, Y.: Blind recovery of sparse signals from subsampled convolution. IEEE Transactions on Information Theory 63(2), 802–821 (2017)
Lee, K., Tian, N., Romberg, J.: Fast and guaranteed blind multichannel deconvolution under a bilinear system model. IEEE Transactions on Information Theory 64(7), 4792–4818 (2018)
Lerman, G., Maunu, T.: Fast, robust and non-convex subspace recovery. Information and Inference: A Journal of the IMA 7(2), 277–336 (2017)
Li, Q., Tang, G.: The nonconvex geometry of low-rank matrix optimizations with general objective functions. In: 2017 IEEE Global Conference on Signal and Information Processing (GlobalSIP), pp. 1235–1239. IEEE (2017)
Li, X., Ling, S., Strohmer, T., Wei, K.: Rapid, robust, and reliable blind deconvolution via nonconvex optimization. Applied and computational harmonic analysis (2018)
Li, X., Wang, Z., Lu, J., Arora, R., Haupt, J., Liu, H., Zhao, T.: Symmetry, saddle points, and global geometry of nonconvex matrix factorization. arXiv preprint arXiv:1612.09296 (2016)
Li, Y., Lee, K., Bresler, Y.: Blind gain and phase calibration for low-dimensional or sparse signal sensing via power iteration. In: Sampling Theory and Applications (SampTA), 2017 International Conference on, pp. 119–123. IEEE (2017)
Li, Y., Ma, C., Chen, Y., Chi, Y.: Nonconvex matrix factorization from rank-one measurements. arXiv preprint arXiv:1802.06286 (2018)
Lin, J., Camoriano, R., Rosasco, L.: Generalization properties and implicit regularization for multiple passes SGM. In: International Conference on Machine Learning, pp. 2340–2348 (2016)
Ling, S., Strohmer, T.: Self-calibration and biconvex compressive sensing. Inverse Problems 31(11), 115002 (2015)
Ling, S., Strohmer, T.: Regularized gradient descent: a non-convex recipe for fast joint blind deconvolution and demixing. Information and Inference: A Journal of the IMA 8(1), 1–49 (2018)
Lu, Y.M., Li, G.: Phase transitions of spectral initialization for high-dimensional nonconvex estimation. arXiv preprint arXiv:1702.06435 (2017)
Mathias, R.: The spectral norm of a nonnegative matrix. Linear Algebra Appl. 139, 269–284 (1990). https://doi.org/10.1016/0024-3795(90)90403-Y.
Mathias, R.: Perturbation bounds for the polar decomposition. SIAM Journal on Matrix Analysis and Applications 14(2), 588–597 (1993)
Maunu, T., Zhang, T., Lerman, G.: A well-tempered landscape for non-convex robust subspace recovery. Journal of Machine Learning Research 20(37), 1–59 (2019)
Mei, S., Bai, Y., Montanari, A.: The landscape of empirical risk for nonconvex losses. The Annals of Statistics 46(6A), 2747–2774 (2018)
Mondelli, M., Montanari, A.: Fundamental limits of weak recovery with applications to phase retrieval. Foundations of Computational Mathematics, pp. 1–71 (2017)
Negahban, S., Wainwright, M.J.: Restricted strong convexity and weighted matrix completion: optimal bounds with noise. J. Mach. Learn. Res. 13, 1665–1697 (2012)
Netrapalli, P., Jain, P., Sanghavi, S.: Phase retrieval using alternating minimization. Advances in Neural Information Processing Systems (NIPS) (2013)
Netrapalli, P., Niranjan, U., Sanghavi, S., Anandkumar, A., Jain, P.: Non-convex robust PCA. In: Advances in Neural Information Processing Systems, pp. 1107–1115 (2014)
Qing, Q., Zhang, Y., Eldar, Y., Wright, J.: Convolutional phase retrieval via gradient descent. Neural Information Processing Systems (2017)
Recht, B.: A simpler approach to matrix completion. Journal of Machine Learning Research 12(Dec), 3413–3430 (2011)
Recht, B., Fazel, M., Parrilo, P.A.: Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. SIAM Review 52(3), 471–501 (2010)
Rudelson, M., Vershynin, R., et al.: Hanson-Wright inequality and sub-Gaussian concentration. Electronic Communications in Probability 18 (2013)
Sanghavi, S., Ward, R., White, C.D.: The local convexity of solving systems of quadratic equations. Results in Mathematics 71(3-4), 569–608 (2017)
Schmitt, B.A.: Perturbation bounds for matrix square roots and Pythagorean sums. Linear Algebra Appl. 174, 215–227 (1992). https://doi.org/10.1016/0024-3795(92)90052-C.
Schniter, P., Rangan, S.: Compressive phase retrieval via generalized approximate message passing. IEEE Transactions on Signal Processing 63(4), 1043–1055 (2015)
Schudy, W., Sviridenko, M.: Concentration and moment inequalities for polynomials of independent random variables. In: Symposium on Discrete Algorithms, pp. 437–446. ACM, New York (2012)
Shechtman, Y., Beck, A., Eldar, Y.C.: GESPAR: Efficient phase retrieval of sparse signals. IEEE Transactions on Signal Processing 62(4), 928–938 (2014)
Soltanolkotabi, M.: Algorithms and theory for clustering and nonconvex quadratic programming. Ph.D. thesis, Stanford University (2014)
Soltanolkotabi, M.: Structured signal recovery from quadratic measurements: Breaking sample complexity barriers via nonconvex optimization. IEEE Transactions on Information Theory 65(4), 2374–2400 (2019)
Soltanolkotabi, M., Javanmard, A., Lee, J.D.: Theoretical insights into the optimization landscape of over-parameterized shallow neural networks. IEEE Transactions on Information Theory 65(2), 742–769 (2019)
Soudry, D., Hoffer, E., Nacson, M.S., Gunasekar, S., Srebro, N.: The implicit bias of gradient descent on separable data. The Journal of Machine Learning Research 19(1), 2822–2878 (2018)
Sun, J., Qu, Q., Wright, J.: A geometric analysis of phase retrieval. In: Information Theory (ISIT), 2016 IEEE International Symposium on, pp. 2379–2383. IEEE (2016)
Sun, J., Qu, Q., Wright, J.: Complete dictionary recovery over the sphere i: Overview and the geometric picture. IEEE Transactions on Information Theory 63(2), 853–884 (2017)
Sun, R., Luo, Z.Q.: Guaranteed matrix completion via non-convex factorization. IEEE Transactions on Information Theory 62(11), 6535–6579 (2016)
Sur, P., Chen, Y., Candès, E.J.: The likelihood ratio test in high-dimensional logistic regression is asymptotically a rescaled chi-square. arXiv preprint arXiv:1706.01191, accepted to Probability Theory and Related Fields (2017)
Tan, Y.S., Vershynin, R.: Phase retrieval via randomized kaczmarz: Theoretical guarantees. Information and Inference: A Journal of the IMA 8(1), 97–123 (2018)
Tanner, J., Wei, K.: Low rank matrix completion by alternating steepest descent methods. Applied and Computational Harmonic Analysis 40(2), 417–429 (2016)
Tao, T.: Topics in Random Matrix Theory. Graduate Studies in Mathematics. American Mathematical Society, Providence, Rhode Island (2012)
Ten Berge, J.M.: Orthogonal procrustes rotation for two or more matrices. Psychometrika 42(2), 267–276 (1977)
Tropp, J.A.: Convex recovery of a structured signal from independent random linear measurements. In: Sampling Theory, a Renaissance, pp. 67–101. Springer (2015)
Tropp, J.A.: An introduction to matrix concentration inequalities. Found. Trends Mach. Learn. 8(1-2), 1–230 (2015). https://doi.org/10.1561/2200000048.
Tu, S., Boczar, R., Simchowitz, M., Soltanolkotabi, M., Recht, B.: Low-rank solutions of linear matrix equations via procrustes flow. In: International Conference on Machine Learning, pp. 964–973. JMLR. org (2016)
Vershynin, R.: Introduction to the non-asymptotic analysis of random matrices. Compressed Sensing, Theory and Applications, pp. 210–268 (2012)
Wang, G., Giannakis, G., Saad, Y., Chen, J.: Solving most systems of random quadratic equations. In: Advances in Neural Information Processing Systems, pp. 1867–1877 (2017)
Wang, G., Giannakis, G.B., Eldar, Y.C.: Solving systems of random quadratic equations via truncated amplitude flow. IEEE Transactions on Information Theory (2017)
Wang, G., Zhang, L., Giannakis, G.B., Akçakaya, M., Chen, J.: Sparse phase retrieval via truncated amplitude flow. IEEE Transactions on Signal Processing 66(2), 479–491 (2018)
Wang, L., Chi, Y.: Blind deconvolution from multiple sparse inputs. IEEE Signal Processing Letters 23(10), 1384–1388 (2016)
Wedin, P.Å.: Perturbation bounds in connection with singular value decomposition. BIT Numerical Mathematics 12(1), 99–111 (1972)
Wei, K.: Solving systems of phaseless equations via Kaczmarz methods: A proof of concept study. Inverse Problems 31(12), 125008 (2015)
Wei, K., Cai, J.F., Chan, T.F., Leung, S.: Guarantees of riemannian optimization for low rank matrix recovery. SIAM Journal on Matrix Analysis and Applications 37(3), 1198–1222 (2016)
Yu, Y., Wang, T., Samworth, R.J.: A useful variant of the Davis-Kahan theorem for statisticians. Biometrika 102(2), 315–323 (2015). https://doi.org/10.1093/biomet/asv008.
Zhang, C., Bengio, S., Hardt, M., Recht, B., Vinyals, O.: Understanding deep learning requires rethinking generalization. International Conference on Learning Representations (2017)
Zhang, H., Chi, Y., Liang, Y.: Provable non-convex phase retrieval with outliers: Median truncated Wirtinger flow. In: International conference on machine learning, pp. 1022–1031 (2016)
Zhang, H., Zhou, Y., Liang, Y., Chi, Y.: A nonconvex approach for phase retrieval: Reshaped wirtinger flow and incremental algorithms. Journal of Machine Learning Research (2017)
Zhang, Y., Lau, Y., Kuo, H.w., Cheung, S., Pasupathy, A., Wright, J.: On the global geometry of sphere-constrained sparse blind deconvolution. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4894–4902 (2017)
Zhao, T., Wang, Z., Liu, H.: A nonconvex optimization framework for low rank matrix estimation. In: Advances in Neural Information Processing Systems, pp. 559–567 (2015)
Zheng, Q., Lafferty, J.: A convergent gradient descent algorithm for rank minimization and semidefinite programming from random linear measurements. In: Advances in Neural Information Processing Systems, pp. 109–117 (2015)
Zheng, Q., Lafferty, J.: Convergence analysis for rectangular matrix completion using Burer-Monteiro factorization and gradient descent. arXiv preprint arXiv:1605.07051 (2016)
Zhong, K., Song, Z., Jain, P., Bartlett, P.L., Dhillon, I.S.: Recovery guarantees for one-hidden-layer neural networks. In: International Conference on Machine Learning, pp. 4140–4149. JMLR. org (2017)
Zhong, Y., Boumal, N.: Near-optimal bounds for phase synchronization. SIAM Journal on Optimization 28(2), 989–1016 (2018)
Acknowledgements
Y. Chen is supported in part by the AFOSR YIP Award FA9550-19-1-0030, ONR Grant N00014-19-1-2120, ARO Grant W911NF-18-1-0303, NSF Grant CCF-1907661, and the Princeton SEAS innovation award. Y. Chi is supported in part by the Grants AFOSR FA9550-15-1-0205, ONR N00014-18-1-2142 and N00014-19-1-2404, ARO W911NF-18-1-0303, NSF CCF-1826519, ECCS-1818571, CCF-1806154. Y. Chen thanks Yudong Chen for inspiring discussions about matrix completion.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Emmanuel J. Candès.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Proofs for Phase Retrieval
Before proceeding, we gather a few simple facts. The standard concentration inequality for \(\chi ^{2}\) random variables together with the union bound reveals that the sampling vectors \(\{{\varvec{a}}_{j}\}\) obey
with probability at least \(1-O(me^{-1.5n})\). In addition, standard Gaussian concentration inequalities give
with probability exceeding \(1-O(mn^{-10})\).
1.1 Proof of Lemma 1
We start with the smoothness bound, namely \(\nabla ^{2}f({\varvec{x}})\preceq O(\log n)\cdot {\varvec{I}}_{n}\). It suffices to prove the upper bound \({\left\| \nabla ^{2}f\left( {\varvec{x}}\right) \right\| \lesssim \log n}\). To this end, we first decompose the Hessian (cf. (44)) into three components as follows:
where we have used \(y_j=({\varvec{a}}_j^{\top }{\varvec{x}}^{\star })^2\). In the sequel, we control the three terms \({\varvec{\Lambda }}_{1}\), \({\varvec{\Lambda }}_{2}\), and \({\varvec{\Lambda }}_{3}\) in reverse order.
The third term \({\varvec{\Lambda }}_{3}\) can be easily bounded by
$$\begin{aligned} \left\| {\varvec{\Lambda }}_{3}\right\| \le 2\left( \left\| {\varvec{I}}_{n}\right\| +2\left\| {\varvec{x}}^{\star }{\varvec{x}}^{\star \top }\right\| \right) =6. \end{aligned}$$The second term \({\varvec{\Lambda }}_{2}\) can be controlled by means of Lemma 32:
$$\begin{aligned} \left\| {\varvec{\Lambda }}_{2}\right\| \le 2\delta \end{aligned}$$for an arbitrarily small constant \(\delta >0\), as long as \(m\ge c_{0}n\log n\) for \(c_{0}\) sufficiently large.
It thus remains to control \({\varvec{\Lambda }}_{1}\). Toward this, we discover that
$$\begin{aligned} \left\| {\varvec{\Lambda }}_{1}\right\|&\le \left\| \frac{3}{m}\sum _{j=1}^{m}\left| {\varvec{a}}_{j}^{\top }\left( {\varvec{x}}-{\varvec{x}}^{\star }\right) \right| \left| {\varvec{a}}_{j}^{\top }\left( {\varvec{x}}+{\varvec{x}}^{\star }\right) \right| {\varvec{a}}_{j}{\varvec{a}}_{j}^{\top }\right\| . \end{aligned}$$(100)Under the assumption \(\max _{1\le j\le m}\left| {\varvec{a}}_{j}^{\top }\left( {\varvec{x}}-{\varvec{x}}^{\star }\right) \right| \le C_{2}\sqrt{\log n}\) and fact (99), we can also obtain
$$\begin{aligned} \max _{1\le j\le m}\left| {\varvec{a}}_{j}^{\top }\left( {\varvec{x}}+{\varvec{x}}^{\star }\right) \right| \le&2\max _{1\le j\le m}\left| {\varvec{a}}_{j}^{\top }{\varvec{x}}^{\star }\right| +\max _{1\le j\le m}\left| {\varvec{a}}_{j}^{\top }\left( {\varvec{x}}-{\varvec{x}}^{\star }\right) \right| \nonumber \\ {}&\le \left( 10+C_{2}\right) \sqrt{\log n}. \end{aligned}$$Substitution into (100) leads to
$$\begin{aligned} \left\| {\varvec{\Lambda }}_{1}\right\|&\le 3C_{2}\left( 10+C_{2}\right) \log n\cdot \Bigg \Vert \frac{1}{m}\sum _{j=1}^{m}{\varvec{a}}_{j}{\varvec{a}}_{j}^{\top }\Bigg \Vert \le 4C_{2}\left( 10+C_{2}\right) \log n, \end{aligned}$$where the last inequality is a direct consequence of Lemma 31.
Combining the above bounds on \({\varvec{\Lambda }}_{1}\), \({\varvec{\Lambda }}_{2}\), and \({\varvec{\Lambda }}_{3}\) yields
as long as n is sufficiently large. This establishes the claimed smoothness property.
Next, we move on to the strong convexity lower bound. Picking a constant \(C>0\) and enforcing proper truncation, we get
We begin with the simpler term \({\varvec{\Lambda }}_{5}\). Lemma 32 implies that with probability at least \(1-O(n^{-10})\),
holds for any small constant \(\delta >0\), as long as \(m/(n\log n)\) is sufficiently large. This reveals that
To bound \({\varvec{\Lambda }}_{4}\), invoke Lemma 33 to conclude that with probability at least \(1-c_{3}e^{-c_{2}m}\) (for some constants \(c_{2},c_{3}>0\)),
for any small constant \(\delta >0\), provided that m / n is sufficiently large. Here,
where the expectation is taken with respect to \(\xi \sim \mathcal {N}(0,1)\). By the assumption \(\left\| {\varvec{x}}-{\varvec{x}}^{\star }\right\| _{2}\le 2C_{1}\), one has
which leads to
This further implies
Recognizing that \(\beta _{1}\) (resp. \(\beta _{2}\)) approaches 2 (resp. 1) as C grows, we can thus take \(C_{1}\) small enough and C large enough to guarantee that
Putting the preceding two bounds on \({\varvec{\Lambda }}_{4}\) and \({\varvec{\Lambda }}_{5}\) together yields
as claimed.
1.2 Proof of Lemma 2
Using the update rule (cf. (17)) as well as the fundamental theorem of calculus [70, Chapter XIII, Theorem 4.2], we get
where we denote \({\varvec{x}}\left( \tau \right) ={\varvec{x}}^{\star }+\tau ({\varvec{x}}^{t}-{\varvec{x}}^{\star })\), \(0\le \tau \le 1\). Here, the first equality makes use of the fact that \(\nabla f({\varvec{x}}^{\star })={\varvec{0}}\). Under condition (45), it is self-evident that for all \(0\le \tau \le 1\),
This means that for all \(0\le \tau \le 1\),
in view of Lemma 1. Picking \(\eta \le {1} / \left[ 5C_{2}\left( 10+C_{2}\right) \log n\right] \) (and hence \(\Vert \eta \nabla ^{2}f({\varvec{x}}(\tau ))\Vert \le 1\)), one sees that
which immediately yields
1.3 Proof of Lemma 3
We start with proving (19a). For all \(0\le t\le T_{0}\), invoke Lemma 2 recursively with conditions (47) to reach
This finishes the proof of (19a) for \(0\le t\le T_{0}\) and also reveals that
provided that \(\eta \asymp 1/\log n\). Applying the Cauchy–Schwarz inequality and fact (98) indicate that
leading to the satisfaction of (45). Therefore, invoking Lemma 2 yields
One can then repeat this argument to arrive at for all \(t > T_{0}\)
We are left with (19b). It is self-evident that the iterates from \(0\le t\le T_{0}\) satisfy (19b) by assumptions. For \(t>T_{0}\), we can use the Cauchy–Schwarz inequality to obtain
where the penultimate relation uses conditions (98) and (103).
1.4 Proof of Lemma 4
First, going through the same derivation as in (54) and (55) will result in
for some \(C_{4}<C_{2}\), which will be helpful for our analysis.
We use the gradient update rules once again to decompose
where the last line comes from the definition of \(\nabla f\left( \cdot \right) \) and \(\nabla f^{(l)}\left( \cdot \right) \).
- 1.
We first control the term \({\varvec{\nu }}_{2}^{\left( l\right) }\), which is easier to deal with. Specifically,
$$\begin{aligned} \Vert {\varvec{\nu }}_{2}^{\left( l\right) }\Vert _{2}&\le \eta \frac{\Vert {\varvec{a}}_{l}\Vert _{2}}{m}\left| \big ({\varvec{a}}_{l}^{\top } {\varvec{x}}^{t,\left( l\right) }\big )^{2}-\big ({\varvec{a}}_{l}^{\top }{\varvec{x}}^{\star }\big )^{2}\right| \left| {\varvec{a}}_{l}^{\top }{\varvec{x}}^{t,\left( l\right) }\right| \\&\overset{(\text {i})}{\lesssim }C_{4}(C_{4}+5)(C_{4}+10)\eta \frac{n\log n}{m}\sqrt{\frac{\log n}{n}}\overset{(\text {ii})}{\le }c\eta \sqrt{\frac{\log n}{n}}, \end{aligned}$$for any small constant \(c>0\). Here (i) follows since (98) and, in view of (99) and (104),
$$\begin{aligned} \left| \big ({\varvec{a}}_{l}^{\top }{\varvec{x}}^{t,\left( l\right) }\big )^{2}-\big ({\varvec{a}}_{l}^{\top } {\varvec{x}}^{\star }\big )^{2}\right|&\le \left| {\varvec{a}}_{l}^{\top }\big ({\varvec{x}}^{t,\left( l\right) } -{\varvec{x}}^{\star }\big )\right| \left( \left| {\varvec{a}}_{l}^{\top }\big ({\varvec{x}}^{t,\left( l\right) } -{\varvec{x}}^{\star }\big )\right| +2\left| {\varvec{a}}_{l}^{\top }{\varvec{x}}^{\star }\right| \right) \\&\le C_{4} (C_{4}+10)\log n, \\ \text {and}\qquad \left| {\varvec{a}}_{l}^{\top }{\varvec{x}}^{t,\left( l\right) }\right|&\le \left| {\varvec{a}}_{l}^{\top }\big ({\varvec{x}}^{t,\left( l\right) }-{\varvec{x}}^{\star }\big )\right| + \left| {\varvec{a}}_{l}^{\top }{\varvec{x}}^{\star }\right| \\&\le (C_{4}+5)\sqrt{\log n}. \end{aligned}$$And (ii) holds as long as \(m\gg n\log n\).
- 2.
For the term \({\varvec{\nu }}_{1}^{\left( l\right) }\), the fundamental theorem of calculus [70, Chapter XIII, Theorem 4.2] tells us that
$$\begin{aligned} {\varvec{\nu }}_{1}^{\left( l\right) }=\left[ {\varvec{I}}_{n}-\eta \int _{0}^{1} \nabla ^{2}f\left( {\varvec{x}}\left( \tau \right) \right) \mathrm {d}\tau \right] \big ({\varvec{x}}^{t}-{\varvec{x}}^{t,\left( l\right) }\big ), \end{aligned}$$where we abuse the notation and denote \({\varvec{x}}\left( \tau \right) ={\varvec{x}}^{t,\left( l\right) }+\tau ({\varvec{x}}^{t}-{\varvec{x}}^{t,\left( l\right) })\). By the induction hypotheses (51) and condition (104), one can verify that
$$\begin{aligned}&\big \Vert {\varvec{x}}\left( \tau \right) -{\varvec{x}}^{\star }\big \Vert _{2}\le \tau \big \Vert {\varvec{x}}^{t}-{\varvec{x}}^{\star }\big \Vert _{2}+(1-\tau )\big \Vert {\varvec{x}}^{t,(l)} -{\varvec{x}}^{\star }\big \Vert _{2}\le 2C_{1}\qquad \text {and}\nonumber \\&\quad \max _{1\le l\le m}\left| {\varvec{a}}_{l}^{\top }\big ({\varvec{x}}\left( \tau \right) -{\varvec{x}}^{\star }\big )\right| \le \tau \max _{1\le l\le m} \left| {\varvec{a}}_{l}^{\top }\big ({\varvec{x}}^{t}-{\varvec{x}}^{\star }\big )\right| +(1-\tau )\nonumber \\&\quad \max _{1\le l\le m}\left| {\varvec{a}}_{l}^{\top }\big ({\varvec{x}}^{t,(l)}-{\varvec{x}}^{\star }\big )\right| \le C_{2}\sqrt{\log n} \end{aligned}$$(105)for all \(0\le \tau \le 1\), as long as \(C_{4}\le C_{2}\). The second line follows directly from (104). To see why (105) holds, we note that
$$\begin{aligned} \big \Vert {\varvec{x}}^{t,(l)}-{\varvec{x}}^{\star }\big \Vert _{2}\le \big \Vert {\varvec{x}}^{t,(l)}-{\varvec{x}}^{t}\big \Vert _{2}+\big \Vert {\varvec{x}}^{t}-{\varvec{x}}^{\star }\big \Vert _{2}\le C_{3}\sqrt{\frac{\log n}{n}}+C_{1}, \end{aligned}$$where the second inequality follows from the induction hypotheses (51b) and (51a). This combined with (51a) gives
$$\begin{aligned} \left\| {\varvec{x}}\left( \tau \right) -{\varvec{x}}^{\star }\right\| _{2} \le \tau C_{1}+\left( 1-\tau \right) \left( C_{3}\sqrt{\frac{\log n}{n}}+C_{1}\right) \le 2C_{1} \end{aligned}$$as long as n is large enough, thus justifying (105). Hence, by Lemma 1, \(\nabla ^{2}f\left( {\varvec{x}}\left( \tau \right) \right) \) is positive definite and almost well conditioned. By choosing \(0<\eta \le {1} / \left[ {5C_{2}\left( 10+C_{2}\right) \log n}\right] \), we get
$$\begin{aligned} \big \Vert {\varvec{\nu }}_{1}^{\left( l\right) }\big \Vert _{2}&\le \left( 1-\eta /2\right) \big \Vert {\varvec{x}}^{t}-{\varvec{x}}^{t,\left( l\right) }\big \Vert _{2}. \end{aligned}$$ - 3.
Combine the preceding bounds on \({\varvec{\nu }}_{1}^{\left( l\right) }\) and \({\varvec{\nu }}_{2}^{\left( l\right) }\) as well as the induction bound (51b) to arrive at
$$\begin{aligned} \big \Vert {\varvec{x}}^{t+1}-{\varvec{x}}^{t+1,\left( l\right) }\big \Vert _{2}&\le \left( 1-\eta /2\right) \big \Vert {\varvec{x}}^{t}-{\varvec{x}}^{t,\left( l\right) }\big \Vert _{2}+c\eta \sqrt{\frac{\log n}{n}}\le C_{3}\sqrt{\frac{\log n}{n}}. \end{aligned}$$(106)This establishes (53) for the \((t+1)\)th iteration.
1.5 Proof of Lemma 5
In view of the assumption (42) that \(\left\| {\varvec{x}}^{0}-{\varvec{x}}^{\star }\right\| _{2}\le \left\| {\varvec{x}}^{0}+{\varvec{x}}^{\star }\right\| _{2}\) and the fact that \({\varvec{x}}^{0} =\sqrt{\lambda _{1}\left( {\varvec{Y}}\right) /3}\;\widetilde{{\varvec{x}}}^{0}\) for some \(\lambda _{1}\left( {\varvec{Y}}\right) >0\) (which we will verify below), it is straightforward to see that
One can then invoke the Davis–Kahan sin\(\Theta \) theorem [124, Corollary 1] to obtain
Note that (56)—\(\Vert {\varvec{Y}}-\mathbb {E}[{\varvec{Y}}]\Vert \le \delta \)—is a direct consequence of Lemma 32. Additionally, the fact that \(\mathbb {E}\left[ {\varvec{Y}}\right] ={\varvec{I}}+2{\varvec{x}}^{\star }{\varvec{x}}^{\star \top }\) gives \(\lambda _{1}\left( \mathbb {E}\left[ {\varvec{Y}}\right] \right) =3\), \(\lambda _{2}\left( \mathbb {E}\left[ {\varvec{Y}}\right] \right) =1\), and \(\lambda _{1}\left( \mathbb {E}\left[ {\varvec{Y}}\right] \right) -\lambda _{2}\left( \mathbb {E}\left[ {\varvec{Y}}\right] \right) =2\). Combining this spectral gap and the inequality \(\Vert {\varvec{Y}}-\mathbb {E}[{\varvec{Y}}]\Vert \le \delta \), we arrive at
To connect this bound with \({\varvec{x}}^{0}\), we need to take into account the scaling factor \(\sqrt{\lambda _{1}\left( {\varvec{Y}}\right) /3}\). To this end, it follows from Weyl’s inequality and (56) that
and, as a consequence, \(\lambda _{1}\left( {\varvec{Y}}\right) \ge 3-\delta >0\) when \(\delta \le 1\). This further implies that
where we have used the elementary identity \(\sqrt{a}-\sqrt{b}=\left( a-b\right) /(\sqrt{a}+\sqrt{b})\). With these bounds in place, we can use the triangle inequality to get
1.6 Proof of Lemma 6
To begin with, repeating the same argument as in Lemma 5 (which we omit here for conciseness), we see that for any fixed constant \(\delta >0\),
holds with probability at least \(1-O(mn^{-10})\) as long as \(m\gg n \log n\). The \(\ell _{2}\) bound on \(\Vert {\varvec{x}}^{0}-{\varvec{x}}^{0,\left( l\right) }\Vert _{2}\) is derived as follows.
- 1.
We start by controlling \(\big \Vert \widetilde{{\varvec{x}}}^{0}-\widetilde{{\varvec{x}}}^{0,\left( l\right) }\big \Vert _{2}\). Combining (57) and (108) yields
$$\begin{aligned} \big \Vert \widetilde{{\varvec{x}}}^{0}-\widetilde{{\varvec{x}}}^{0,\left( l\right) } \big \Vert _{2}\le \big \Vert \widetilde{{\varvec{x}}}^{0}-{\varvec{x}}^{\star }\big \Vert _{2}+\big \Vert \widetilde{{\varvec{x}}}^{0,\left( l\right) }-{\varvec{x}}^{\star }\big \Vert _{2}\le 2\sqrt{2}\delta . \end{aligned}$$For \(\delta \) sufficiently small, this implies that \(\big \Vert \widetilde{{\varvec{x}}}^{0}-\widetilde{{\varvec{x}}}^{0,\left( l\right) }\big \Vert _{2}\le \big \Vert \widetilde{{\varvec{x}}}^{0}+\widetilde{{\varvec{x}}}^{0,\left( l\right) }\big \Vert _{2}\), and hence, the Davis–Kahan sin\(\Theta \) theorem [39] gives
$$\begin{aligned} \big \Vert \widetilde{{\varvec{x}}}^{0}-\widetilde{{\varvec{x}}}^{0,\left( l\right) }\big \Vert _{2}&\le \frac{\left\| \big ({\varvec{Y}}-{\varvec{Y}}^{(l)}\big )\widetilde{{\varvec{x}}}^{0,\left( l\right) }\right\| _{2}}{\lambda _{1}\left( {\varvec{Y}}\right) -\lambda _{2}\left( {\varvec{Y}}^{\left( l\right) }\right) }\le \big \Vert \big ({\varvec{Y}}-{\varvec{Y}}^{(l)}\big )\widetilde{{\varvec{x}}}^{0,\left( l\right) }\big \Vert _{2}. \end{aligned}$$(109)Here, the second inequality uses Weyl’s inequality:
$$\begin{aligned} \lambda _{1}\big ({\varvec{Y}}\big )-\lambda _{2}\big ({\varvec{Y}}^{\left( l\right) }\big )&\ge \lambda _{1}(\mathbb {E}[{\varvec{Y}}])-\big \Vert {\varvec{Y}}-\mathbb {E}[{\varvec{Y}}]\big \Vert -\lambda _{2}(\mathbb {E}[{\varvec{Y}}^{(l)}])-\big \Vert {\varvec{Y}}^{\left( l\right) }-\mathbb {E}[{\varvec{Y}}^{(l)}]\big \Vert \\&\ge 3-\delta -1-\delta \ge 1, \end{aligned}$$with the proviso that \(\delta \le 1/2\).
- 2.
We now connect \(\Vert {\varvec{x}}^{0}-{\varvec{x}}^{0,(l)}\Vert _{2}\) with \(\Vert \widetilde{{\varvec{x}}}^{0}-\widetilde{{\varvec{x}}}^{0,(l)}\Vert _{2}\). Applying the Weyl’s inequality and (56) yields
$$\begin{aligned} \left| \lambda _{1}\left( {\varvec{Y}}\right) -3\right| \le \Vert {\varvec{Y}}- \mathbb {E}[{\varvec{Y}}]\Vert \le \delta \qquad \Longrightarrow \qquad \lambda _{1} ({\varvec{Y}})\in \left[ 3-\delta ,3+\delta \right] \subseteq [2,4] \end{aligned}$$(110)and, similarly, \(\lambda _{1}({\varvec{Y}}^{(l)}),\Vert {\varvec{Y}}\Vert ,\Vert {\varvec{Y}}^{(l)}\Vert \in \left[ 2,4\right] \). Invoke Lemma 34 to arrive at
$$\begin{aligned} \frac{1}{\sqrt{3}}\big \Vert {\varvec{x}}^{0}-{\varvec{x}}^{0,(l)}\big \Vert _{2}&\le \frac{\big \Vert \big ({\varvec{Y}}-{\varvec{Y}}^{(l)}\big ) \widetilde{{\varvec{x}}}^{0,(l)}\big \Vert _{2}}{2\sqrt{2}}+\left( 2+\frac{4}{\sqrt{2}}\right) \big \Vert \widetilde{{\varvec{x}}}^{0}-\widetilde{{\varvec{x}}}^{0,(l)}\big \Vert _{2} \nonumber \\&\le 6\big \Vert \big ({\varvec{Y}}-{\varvec{Y}}^{(l)}\big ) \widetilde{{\varvec{x}}}^{0,(l)}\big \Vert _{2}, \end{aligned}$$(111)where the last inequality comes from (109).
- 3.
Everything then boils down to controlling \(\left\| \left( {\varvec{Y}}-{\varvec{Y}}^{(l)}\right) \widetilde{{\varvec{x}}}^{0,\left( l\right) }\right\| _{2}\). Toward this, we observe that
$$\begin{aligned} \max _{1\le l\le m}\big \Vert \big ({\varvec{Y}}-{\varvec{Y}}^{(l)}\big )\widetilde{{\varvec{x}}}^{0,\left( l\right) }\big \Vert _{2}&=\max _{1\le l\le m}\frac{1}{m}\left\| \left( {\varvec{a}}_{l}^{\top }{\varvec{x}}^{\star }\right) ^{2}{\varvec{a}}_{l}{\varvec{a}}_{l}^{\top }\widetilde{{\varvec{x}}}^{0,\left( l\right) }\right\| _{2}\nonumber \\&\le \max _{1\le l\le m}\frac{\left( {\varvec{a}}_{l}^{\top }{\varvec{x}}^{\star }\right) ^{2}\big |{\varvec{a}}_{l}^{\top } \widetilde{{\varvec{x}}}^{0,\left( l\right) }\big | \big \Vert {\varvec{a}}_{l}\big \Vert _{2}}{m}\nonumber \\&\overset{\left( \text {i}\right) }{\lesssim }\frac{\log n\cdot \sqrt{\log n}\cdot \sqrt{n}}{m} \nonumber \\&\asymp \sqrt{\frac{\log n}{n}}\cdot \frac{n\log n}{m}. \end{aligned}$$(112)Inequality (i) makes use of the fact \(\max _{l}\left| {\varvec{a}}_{l}^{\top }{\varvec{x}}^{\star }\right| \le 5\sqrt{\log n}\) (cf. (99)), the bound \(\max _{l}\Vert {\varvec{a}}_{l}\Vert _{2}\le 6\sqrt{n}\) (cf. (98)), and \(\max _{l}\left| {\varvec{a}}_{l}^{\top }\widetilde{{\varvec{x}}}^{0,\left( l\right) }\right| \le 5\sqrt{\log n}\) (due to statistical independence and standard Gaussian concentration). As long as \(m/(n\log n)\) is sufficiently large, substituting the above bound (112) into (111) leads us to conclude that
$$\begin{aligned} \max _{1\le l\le m}\big \Vert {\varvec{x}}^{0}-{\varvec{x}}^{0,\left( l\right) }\big \Vert _{2}\le C_{3}\sqrt{\frac{\log n}{n}} \end{aligned}$$(113)for any constant \(C_{3}>0\).
Proofs for Matrix Completion
Before proceeding to the proofs, let us record an immediate consequence of the incoherence property (25):
where \(\kappa =\sigma _{\max }/\sigma _{\min }\) is the condition number of \({\varvec{M}}^{\star }\). This follows since
Unless otherwise specified, we use the indicator variable \(\delta _{j,k}\) to denote whether the entry in the location (j, k) is included in \(\Omega \). Under our model, \(\delta _{j,k}\) is a Bernoulli random variable with mean p.
1.1 Proof of Lemma 7
By the expression of the Hessian in (61), one can decompose
The basic idea is to demonstrate that: \(\left( \text {1}\right) \)\(\alpha _{4}\) is bounded both from above and from below, and \(\left( \text {2}\right) \) the first three terms are sufficiently small in size compared to \(\alpha _{4}\).
- 1.
We start by controlling \(\alpha _{4}\). It is immediate to derive the following upper bound
$$\begin{aligned} \alpha _{4}&\le \left\| {\varvec{V}}{\varvec{X}}^{\star \top }\right\| _{\mathrm {F}}^{2}+\left\| {\varvec{X}}^{\star }{\varvec{V}}^{\top }\right\| _{\mathrm {F}}^{2}\le 2\Vert {\varvec{X}}^{\star }\Vert ^{2}\left\| {\varvec{V}}\right\| _{\mathrm {F}}^{2}=2\sigma _{\max }\left\| {\varvec{V}}\right\| _{\mathrm {F}}^{2}. \end{aligned}$$When it comes to the lower bound, one discovers that
$$\begin{aligned} \alpha _{4}&=\frac{1}{2}\left\{ \left\| {\varvec{V}}{\varvec{X}}^{\star \top }\right\| _{\mathrm {F}}^{2}+\left\| {\varvec{X}}^{\star }{\varvec{V}}^{\top }\right\| _{\mathrm {F}}^{2}+2\mathrm {Tr}\left( {\varvec{X}}^{\star \top }{\varvec{V}}{\varvec{X}}^{\star \top }{\varvec{V}}\right) \right\} \nonumber \\&\ge \sigma _{\min }\left\| {\varvec{V}}\right\| _{\mathrm {F}}^{2}+\mathrm {Tr}\left[ \left( {\varvec{Z}}+{\varvec{X}}^{\star }-{\varvec{Z}}\right) ^{\top }{\varvec{V}}\left( {\varvec{Z}}+{\varvec{X}}^{\star }-{\varvec{Z}}\right) ^{\top }{\varvec{V}}\right] \nonumber \\&\ge \sigma _{\min }\left\| {\varvec{V}}\right\| _{\mathrm {F}}^{2}+\mathrm {Tr}\left( {\varvec{Z}}^{\top }{\varvec{V}}{\varvec{Z}}^{\top }{\varvec{V}}\right) {-}2\left\| {\varvec{Z}}{-}{\varvec{X}}^{\star }\right\| \left\| {\varvec{Z}}\right\| \left\| {\varvec{V}}\right\| _{\mathrm {F}}^{2}-\left\| {\varvec{Z}}{-}{\varvec{X}}^{\star }\right\| ^{2}\left\| {\varvec{V}}\right\| _{\mathrm {F}}^{2}\nonumber \\&\ge \left( \sigma _{\min }-5\delta \sigma _{\max }\right) \left\| {\varvec{V}}\right\| _{\mathrm {F}}^{2}+\mathrm {Tr}\left( {\varvec{Z}}^{\top }{\varvec{V}}{\varvec{Z}}^{\top }{\varvec{V}}\right) , \end{aligned}$$(115)where the last line comes from the assumptions that
$$\begin{aligned} \left\| {\varvec{Z}}-{\varvec{X}}^{\star }\right\| \le \delta \left\| {\varvec{X}}^{\star }\right\| \le \left\| {\varvec{X}}^{\star }\right\| \qquad \text {and}\qquad \left\| {\varvec{Z}}\right\| \le \left\| {\varvec{Z}}-{\varvec{X}}^{\star }\right\| +\left\| {\varvec{X}}^{\star }\right\| \le 2\left\| {\varvec{X}}^{\star }\right\| . \end{aligned}$$With our assumption \({\varvec{V}}={\varvec{Y}}{\varvec{H}}_{Y}-{\varvec{Z}}\) in mind, it comes down to controlling
$$\begin{aligned} \mathrm {Tr}\left( {\varvec{Z}}^{\top }{\varvec{V}}{\varvec{Z}}^{\top }{\varvec{V}}\right) = \mathrm {Tr}\left[ {\varvec{Z}}^{\top }\left( {\varvec{Y}}{\varvec{H}}_{Y}-{\varvec{Z}}\right) {\varvec{Z}}^{\top }\left( {\varvec{Y}}{\varvec{H}}_{Y}-{\varvec{Z}}\right) \right] . \end{aligned}$$From the definition of \({\varvec{H}}_{Y}\), we see from Lemma 35 that \({\varvec{Z}}^{\top }{\varvec{Y}}{\varvec{H}}_{Y}\) (and hence \({\varvec{Z}}^{\top }\left( {\varvec{Y}}{\varvec{H}}_{Y}-{\varvec{Z}}\right) \)) is a symmetric matrix, which implies that
$$\begin{aligned} \text {Tr}\left[ {\varvec{Z}}^{\top }\left( {\varvec{Y}}{\varvec{H}}_{Y}-{\varvec{Z}}\right) {\varvec{Z}}^{\top }\left( {\varvec{Y}}{\varvec{H}}_{Y}-{\varvec{Z}}\right) \right] \ge 0. \end{aligned}$$Substitution into (115) gives
$$\begin{aligned} \alpha _{4}\ge \left( \sigma _{\min }-5\delta \sigma _{\max }\right) \left\| {\varvec{V}}\right\| _{\mathrm {F}}^{2}\ge \frac{9}{10}\sigma _{\min }\left\| {\varvec{V}}\right\| _{\mathrm {F}}^{2}, \end{aligned}$$provided that \(\kappa \delta \le 1/50.\)
- 2.
For \(\alpha _{1}\), we consider the following quantity
$$\begin{aligned}&\big \Vert \mathcal {P}_{\Omega }\left( {\varvec{V}}{\varvec{X}}^{\top }+{\varvec{X}}{\varvec{V}}^{\top } \right) \big \Vert _{\mathrm {F}}^{2}\\&\quad =\left\langle \mathcal {P}_{\Omega }\left( {\varvec{V}}{\varvec{X}}^{\top }\right) ,\mathcal {P}_{\Omega }\left( {\varvec{V}} {\varvec{X}}^{\top }\right) \right\rangle +\left\langle \mathcal {P}_{\Omega }\left( {\varvec{V}}{\varvec{X}}^{\top }\right) ,\mathcal {P}_{\Omega }\left( {\varvec{X}} {\varvec{V}}^{\top }\right) \right\rangle \\&\qquad +\left\langle \mathcal {P}_{\Omega }\left( {\varvec{X}}{\varvec{V}}^{\top }\right) ,\mathcal {P}_{\Omega }\left( {\varvec{V}} {\varvec{X}}^{\top }\right) \right\rangle +\left\langle \mathcal {P}_{\Omega }\left( {\varvec{X}}{\varvec{V}}^{\top }\right) ,\mathcal {P}_{\Omega }\left( {\varvec{X}} {\varvec{V}}^{\top }\right) \right\rangle \\&\quad =2\left\langle \mathcal {P}_{\Omega }\left( {\varvec{V}}{\varvec{X}}^{\top }\right) ,\mathcal {P}_{\Omega }\left( {\varvec{V}} {\varvec{X}}^{\top }\right) \right\rangle +2\left\langle \mathcal {P}_{\Omega }\left( {\varvec{V}}{\varvec{X}}^{\top }\right) ,\mathcal {P}_{\Omega }\left( {\varvec{X}} {\varvec{V}}^{\top }\right) \right\rangle . \end{aligned}$$Similar decomposition can be performed on \(\big \Vert \mathcal {P}_{\Omega }\left( {\varvec{V}}{\varvec{X}}^{\star \top }+{\varvec{X}}^{\star } {\varvec{V}}^{\top }\right) \big \Vert _{\mathrm {F}}^{2}\) as well. These identities yield
$$\begin{aligned} \alpha _{1}&=\underbrace{\frac{1}{p}\left[ \left\langle \mathcal {P}_{\Omega } \big ({\varvec{V}}{\varvec{X}}^{\top }\big ),\mathcal {P}_{\Omega }\left( {\varvec{V}}{\varvec{X}}^{\top }\right) \right\rangle -\left\langle \mathcal {P}_{\Omega }\left( {\varvec{V}}{\varvec{X}}^{\star \top }\right) , \mathcal {P}_{\Omega }\left( {\varvec{V}}{\varvec{X}}^{\star \top }\right) \right\rangle \right] }_{:=\beta _{1}}\\&\quad +\underbrace{\frac{1}{p}\left[ \left\langle \mathcal {P}_{\Omega }\big ({\varvec{V}} {\varvec{X}}^{\top }\big ),\mathcal {P}_{\Omega }\left( {\varvec{X}}{\varvec{V}}^{\top }\right) \right\rangle -\left\langle \mathcal {P}_{\Omega }\left( {\varvec{V}}{\varvec{X}}^{\star \top }\right) ,\mathcal {P}_{\Omega } \left( {\varvec{X}}^{\star }{\varvec{V}}^{\top }\right) \right\rangle \right] }_{:=\beta _{2}}. \end{aligned}$$For \(\beta _{2}\), one has
$$\begin{aligned} \beta _{2}&=\frac{1}{p}\left\langle \mathcal {P}_{\Omega }\left( {\varvec{V}}\left( {\varvec{X}} -{\varvec{X}}^{\star }\right) ^{\top }\right) ,\mathcal {P}_{\Omega }\left( \left( {\varvec{X}} -{\varvec{X}}^{\star }\right) {\varvec{V}}^{\top }\right) \right\rangle \\&\quad +\frac{1}{p}\left\langle \mathcal {P}_{\Omega }\left( {\varvec{V}}\left( {\varvec{X}} -{\varvec{X}}^{\star }\right) ^{\top }\right) ,\mathcal {P}_{\Omega }\left( {\varvec{X}}^{\star } {\varvec{V}}^{\top }\right) \right\rangle \\&\quad +\frac{1}{p}\left\langle \mathcal {P}_{\Omega } \left( {\varvec{V}}{\varvec{X}}^{\star \top }\right) ,\mathcal {P}_{\Omega }\left( \left( {\varvec{X}} -{\varvec{X}}^{\star }\right) {\varvec{V}}^{\top }\right) \right\rangle \end{aligned}$$which together with the inequality \(\left| \langle {\varvec{A}},{\varvec{B}}\rangle \right| \le \Vert {\varvec{A}}\Vert _{\mathrm {F}}\Vert {\varvec{B}}\Vert _{\mathrm {F}}\) gives
$$\begin{aligned} \left| \beta _{2}\right| \le \frac{1}{p}\left\| \mathcal {P}_{\Omega } \left( {\varvec{V}}\left( {\varvec{X}}-{\varvec{X}}^{\star }\right) ^{\top }\right) \right\| _{\mathrm {F}}^{2}+\frac{2}{p}\left\| \mathcal {P}_{\Omega } \left( {\varvec{V}}\left( {\varvec{X}}-{\varvec{X}}^{\star }\right) ^{\top }\right) \right\| _{\mathrm {F}}\left\| \mathcal {P}_{\Omega }\left( {\varvec{X}}^{\star } {\varvec{V}}^{\top }\right) \right\| _{\mathrm {F}}. \end{aligned}$$(116)This then calls for upper bounds on the following two terms
$$\begin{aligned} \frac{1}{\sqrt{p}}\left\| \mathcal {P}_{\Omega }\left( {\varvec{V}}\left( {\varvec{X}} -{\varvec{X}}^{\star }\right) ^{\top }\right) \right\| _{\mathrm {F}} \qquad \text {and}\qquad \frac{1}{\sqrt{p}}\left\| \mathcal {P}_{\Omega } \left( {\varvec{X}}^{\star }{\varvec{V}}^{\top }\right) \right\| _{\mathrm {F}}. \end{aligned}$$The injectivity of \(\mathcal {P}_{\Omega }\) (cf. [19, Section 4.2] or Lemma 38)—when restricted to the tangent space of \({\varvec{M}}^{\star }\)—gives: for any fixed constant \(\gamma >0\),
$$\begin{aligned} \frac{1}{\sqrt{p}}\left\| \mathcal {P}_{\Omega }\left( {\varvec{X}}^{\star }{\varvec{V}}^{\top }\right) \right\| _{\mathrm {F}}\le \left( 1+\gamma \right) \left\| {\varvec{X}}^{\star }{\varvec{V}}^{\top }\right\| _{\mathrm {F}}\le \left( 1+\gamma \right) \left\| {\varvec{X}}^{\star }\right\| \left\| {\varvec{V}}\right\| _{\mathrm {F}} \end{aligned}$$with probability at least \(1-O\left( n^{-10}\right) \), provided that \(n^{2}p / (\mu nr\log n)\) is sufficiently large. In addition,
$$\begin{aligned}&\frac{1}{p}\left\| \mathcal {P}_{\Omega }\left( {\varvec{V}}\left( {\varvec{X}}-{\varvec{X}}^{\star }\right) ^{\top }\right) \right\| _{\mathrm {F}}^{2} \\&\quad =\frac{1}{p}\sum _{1\le j,k\le n}\delta _{j,k}\left[ {\varvec{V}}_{j,\cdot }\left( {\varvec{X}}_{k,\cdot } -{\varvec{X}}_{k,\cdot }^{\star }\right) ^{\top }\right] ^{2}\\&\quad =\sum _{1\le j\le n}{\varvec{V}}_{j,\cdot }\left[ \frac{1}{p}\sum _{1\le k\le n}\delta _{j,k}\left( {\varvec{X}}_{k,\cdot }-{\varvec{X}}_{k,\cdot }^{\star }\right) ^{\top } \left( {\varvec{X}}_{k,\cdot }-{\varvec{X}}_{k,\cdot }^{\star }\right) \right] {\varvec{V}}_{j,\cdot }^{\top }\\&\quad \le \max _{1\le j\le n}\left\| \frac{1}{p}\sum _{1\le k\le n} \delta _{j,k}\left( {\varvec{X}}_{k,\cdot }-{\varvec{X}}_{k,\cdot }^{\star }\right) ^{\top } \left( {\varvec{X}}_{k,\cdot }-{\varvec{X}}_{k,\cdot }^{\star }\right) \right\| \left\| {\varvec{V}}\right\| _{\mathrm {F}}^{2}\\&\quad \le \left\{ \frac{1}{p}\max _{1\le j\le n}\sum _{1\le k\le n} \delta _{j,k}\right\} \left\{ \max _{1\le k\le n}\left\| {\varvec{X}}_{k,\cdot } -{\varvec{X}}_{k,\cdot }^{\star }\right\| _{2}^{2}\right\} \left\| {\varvec{V}} \right\| _{\mathrm {F}}^{2}\\&\quad \le \left( 1+\gamma \right) n\left\| {\varvec{X}}-{\varvec{X}}^{\star } \right\| _{2,\infty }^{2}\left\| {\varvec{V}}\right\| _{\mathrm {F}}^{2}, \end{aligned}$$with probability exceeding \(1-O\left( n^{-10}\right) \), which holds as long as \(np/\log n\) is sufficiently large. Taken collectively, the above bounds yield that for any small constant \(\gamma >0\),
$$\begin{aligned} \left| \beta _{2}\right|&\le \left( 1+\gamma \right) n\left\| {\varvec{X}}-{\varvec{X}}^{\star }\right\| _{2,\infty }^{2}\left\| {\varvec{V}}\right\| _{\mathrm {F}}^{2}\\&\quad +2\sqrt{\left( 1+\gamma \right) n\left\| {\varvec{X}}-{\varvec{X}}^{\star }\right\| _{2,\infty }^{2}\left\| {\varvec{V}}\right\| _{\mathrm {F}}^{2}\cdot \left( 1+\gamma \right) ^{2}\left\| {\varvec{X}}^{\star }\right\| ^{2}\left\| {\varvec{V}}\right\| _{\mathrm {F}}^{2}}\\&\lesssim \left( \epsilon ^{2}n\left\| {\varvec{X}}^{\star }\right\| _{2,\infty }^{2}+\epsilon \sqrt{n}\left\| {\varvec{X}}^{\star }\right\| _{2,\infty }\left\| {\varvec{X}}^{\star }\right\| \right) \left\| {\varvec{V}}\right\| _{\mathrm {F}}^{2}, \end{aligned}$$where the last inequality makes use of the assumption \(\Vert {\varvec{X}}-{\varvec{X}}^{\star }\Vert _{2,\infty }\le \epsilon \Vert {\varvec{X}}^{\star }\Vert _{2,\infty }\). The same analysis can be repeated to control \(\beta _{1}\). Altogether, we obtain
$$\begin{aligned} \left| \alpha _{1}\right| \le \left| \beta _{1}\right| +\left| \beta _{2}\right|&\lesssim \left( n\epsilon ^{2}\left\| {\varvec{X}}^{\star }\right\| _{2,\infty }^{2}+\sqrt{n}\epsilon \left\| {\varvec{X}}^{\star }\right\| _{2,\infty }\left\| {\varvec{X}}^{\star }\right\| \right) \left\| {\varvec{V}}\right\| _{\mathrm {F}}^{2}\\&\overset{\left( \text {i}\right) }{\le }\left( n\epsilon ^{2}\frac{\kappa \mu r}{n}+\sqrt{n}\epsilon \sqrt{\frac{\kappa \mu r}{n}}\right) \sigma _{\max }\left\| {\varvec{V}}\right\| _{\mathrm {F}}^{2}\overset{\left( \text {ii}\right) }{\le }\frac{1}{10}\sigma _{\min }\left\| {\varvec{V}}\right\| _{\mathrm {F}}^{2}, \end{aligned}$$where (i) utilizes the incoherence condition (114) and (ii) holds with the proviso that \(\epsilon \sqrt{\kappa ^{3}\mu r}\ll 1\).
- 3.
To bound \(\alpha _{2}\), apply the Cauchy–Schwarz inequality to get
$$\begin{aligned} \left| \alpha _{2}\right| =\left| \left\langle {\varvec{V}},\text { }\frac{1}{p}\mathcal {P}_{\Omega }\left( {\varvec{X}}{\varvec{X}}^{\top } -{\varvec{M}}^{\star }\right) {\varvec{V}}\right\rangle \right| \le \left\| \frac{1}{p}\mathcal {P}_{\Omega }\left( {\varvec{X}}{\varvec{X}}^{\top } -{\varvec{M}}^{\star }\right) \right\| \left\| {\varvec{V}}\right\| _{\mathrm {F}}^{2}. \end{aligned}$$In view of Lemma 43, with probability at least \(1-O\left( n^{-10}\right) \),
$$\begin{aligned} \left\| \frac{1}{p}\mathcal {P}_{\Omega }\left( {\varvec{X}}{\varvec{X}}^{\top } -{\varvec{M}}^{\star }\right) \right\|&\le 2n\epsilon ^{2}\left\| {\varvec{X}}^{\star }\right\| _{2,\infty }^{2}+4\epsilon \sqrt{n}\log n\left\| {\varvec{X}}^{\star }\right\| _{2,\infty }\left\| {\varvec{X}}^{\star }\right\| \\&\le \left( 2n\epsilon ^{2}\frac{\kappa \mu r}{n}+4\epsilon \sqrt{n}\log n\sqrt{\frac{\kappa \mu r}{n}}\right) \sigma _{\max } \le \frac{1}{10}\sigma _{\min } \end{aligned}$$as soon as \(\epsilon \sqrt{\kappa ^{3}\mu r}\log n\ll 1\), where we utilize the incoherence condition (114). This in turn implies that
$$\begin{aligned} \left| \alpha _{2}\right| \le \frac{1}{10}\sigma _{\min }\left\| {\varvec{V}}\right\| _{\mathrm {F}}^{2}. \end{aligned}$$Notably, this bound holds uniformly over all \({\varvec{X}}\) satisfying the condition in Lemma 7, regardless of the statistical dependence between \({\varvec{X}}\) and the sampling set \(\Omega \).
- 4.
The last term \(\alpha _{3}\) can also be controlled using the injectivity of \(\mathcal {P}_{\Omega }\) when restricted to the tangent space of \({\varvec{M}}^{\star }\). Specifically, it follows from the bounds in [19, Section 4.2] or Lemma 38 that
$$\begin{aligned} \left| \alpha _{3}\right| \le \gamma \left\| {\varvec{V}}{\varvec{X}}^{\star \top } +{\varvec{X}}^{\star }{\varvec{V}}^{\top }\right\| _{\mathrm {F}}^{2}\le 4\gamma \sigma _{\max }\left\| {\varvec{V}}\right\| _{\mathrm {F}}^{2}\le \frac{1}{10}\sigma _{\min }\left\| {\varvec{V}}\right\| _{\mathrm {F}}^{2} \end{aligned}$$for any \(\gamma >0\) such that \(\kappa \gamma \) is a small constant, as soon as \(n^{2}p\gg \kappa ^2\mu rn\log n\).
- 5.
Taking all the preceding bounds collectively yields
$$\begin{aligned} \mathrm {vec}\left( {\varvec{V}}\right) ^{\top }\nabla ^{2}f_{\mathrm {clean}} \left( {\varvec{X}}\right) \mathrm {vec}\left( {\varvec{V}}\right)&\ge \alpha _{4} -\left| \alpha _{1}\right| -\left| \alpha _{2}\right| -\left| \alpha _{3}\right| \\&\ge \left( \frac{9}{10}-\frac{3}{10}\right) \sigma _{\min }\left\| {\varvec{V}}\right\| _{\mathrm {F}}^{2}\ge \frac{1}{2}\sigma _{\min }\left\| {\varvec{V}}\right\| _{\mathrm {F}}^{2} \end{aligned}$$for all \({\varvec{V}}\) satisfying our assumptions, and
$$\begin{aligned} \left| \mathrm {vec}\left( {\varvec{V}}\right) ^{\top }\nabla ^{2}f_{\mathrm {clean}} \left( {\varvec{X}}\right) \mathrm {vec}\left( {\varvec{V}}\right) \right|&\le \alpha _{4} +\left| \alpha _{1}\right| +\left| \alpha _{2}\right| +\left| \alpha _{3}\right| \\&\le \left( 2\sigma _{\max }+\frac{3}{10}\sigma _{\min }\right) \left\| {\varvec{V}}\right\| _{\mathrm {F}}^{2}\le \frac{5}{2}\sigma _{\max }\left\| {\varvec{V}}\right\| _{\mathrm {F}}^{2} \end{aligned}$$for all \({\varvec{V}}\). Since this upper bound holds uniformly over all \({\varvec{V}}\), we conclude that
$$\begin{aligned} \left\| \nabla ^{2}f_{\mathrm {clean}}\left( {\varvec{X}}\right) \right\| \le \frac{5}{2}\sigma _{\max } \end{aligned}$$as claimed.
1.2 Proof of Lemma 8
Given that \(\widehat{{\varvec{H}}}^{t+1}\) is chosen to minimize the error in terms of the Frobenius norm (cf. (26)), we have
where (i) follows from the identity \(\nabla f({\varvec{X}}^{t}{\varvec{R}})=\nabla f\left( {\varvec{X}}^{t}\right) {\varvec{R}}\) for any orthonormal matrix \({\varvec{R}}\in \mathcal {O}^{r\times r}\), (ii) arises from the definitions of \(\nabla f\left( {\varvec{X}}\right) \) and \(\nabla f_{\mathrm {clean}}\left( {\varvec{X}}\right) \) (see (59) and (60), respectively), and the last inequality (117) utilizes the triangle inequality and the fact that \(\nabla f_{\mathrm {clean}}({\varvec{X}}^{\star })={\varvec{0}}\). It thus suffices to control \(\alpha _{1}\) and \(\alpha _{2}\).
- 1.
For the second term \(\alpha _{2}\) in (117), it is easy to see that with probability at least \(1-O\left( n^{-10}\right) \),
$$\begin{aligned} \alpha _{2}\le \eta \left\| \frac{1}{p}\mathcal {P}_{\Omega }\left( {\varvec{E}}\right) \right\| \left\| {\varvec{X}}^{t}\widehat{{\varvec{H}}}^{t}\right\| _{\mathrm {F}}\le 2\eta \left\| \frac{1}{p}\mathcal {P}_{\Omega }\left( {\varvec{E}}\right) \right\| \left\| {\varvec{X}}^{\star }\right\| _{\mathrm {F}}\le 2\eta C\sigma \sqrt{\frac{n}{p}}\Vert {\varvec{X}}^{\star }\Vert _{\mathrm {F}} \end{aligned}$$for some absolute constant \(C > 0\). Here, the second inequality holds because \(\big \Vert {\varvec{X}}^{t}\widehat{{\varvec{H}}}^{t}\big \Vert _{\mathrm {F}} \le \big \Vert {\varvec{X}}^{t}\widehat{{\varvec{H}}}^{t}-{\varvec{X}}^{\star }\big \Vert _{\mathrm {F}}+\left\| {\varvec{X}}^{\star }\right\| _{\mathrm {F}} \le 2\left\| {\varvec{X}}^{\star }\right\| _{\mathrm {F}}\), following hypothesis (28a) together with our assumptions on the noise and the sample complexity. The last inequality makes use of Lemma 40.
- 2.
For the first term \(\alpha _{1}\) in (117), the fundamental theorem of calculus [70, Chapter XIII, Theorem 4.2] reveals
$$\begin{aligned}&\mathrm {vec}\left[ {\varvec{X}}^{t}\widehat{{\varvec{H}}}^{t}-\eta \nabla f_{\mathrm {clean}}\big ({\varvec{X}}^{t}\widehat{{\varvec{H}}}^{t}\big )-\left( {\varvec{X}}^{\star }-\eta \nabla f_{\mathrm {clean}}\big ({\varvec{X}}^{\star }\big )\right) \right] \nonumber \\&\quad =\mathrm {vec}\left[ {\varvec{X}}^{t}\widehat{{\varvec{H}}}^{t}-{\varvec{X}}^{\star }\right] -\eta \cdot \mathrm {vec}\left[ \nabla f_{\mathrm {clean}}\big ({\varvec{X}}^{t}\widehat{{\varvec{H}}}^{t}\big )-\nabla f_{\mathrm {clean}}\left( {\varvec{X}}^{\star }\right) \right] \nonumber \\&\quad =\Bigg ({\varvec{I}}_{nr}-\eta \underset{:={\varvec{A}}}{\underbrace{\int _{0}^{1}\nabla ^{2}f_{\mathrm {clean}}\left( {\varvec{X}}(\tau )\right) \mathrm {d}\tau }}\Bigg )\mathrm {vec}\left( {\varvec{X}}^{t}\widehat{{\varvec{H}}}^{t}-{\varvec{X}}^{\star }\right) , \end{aligned}$$(118)where we denote \({\varvec{X}}(\tau ):={\varvec{X}}^{\star }+\tau ({\varvec{X}}^{t}\widehat{{\varvec{H}}}^{t}-{\varvec{X}}^{\star })\). Taking the squared Euclidean norm of both sides of equality (118) leads to
$$\begin{aligned} \left( \alpha _{1}\right) ^{2}&=\mathrm {vec}\big ({\varvec{X}}^{t}\widehat{{\varvec{H}}}^{t}-{\varvec{X}}^{\star }\big )^{\top } \left( {\varvec{I}}_{nr}-\eta {\varvec{A}}\right) ^{2}\mathrm {vec}\big ({\varvec{X}}^{t}\widehat{{\varvec{H}}}^{t} -{\varvec{X}}^{\star }\big )\nonumber \\&=\mathrm {vec}\big ({\varvec{X}}^{t}\widehat{{\varvec{H}}}^{t}-{\varvec{X}}^{\star }\big )^{\top } \left( {\varvec{I}}_{nr}-2\eta {\varvec{A}}+\eta ^{2}{\varvec{A}}^{2}\right) \mathrm {vec} \big ({\varvec{X}}^{t}\widehat{{\varvec{H}}}^{t}-{\varvec{X}}^{\star }\big )\nonumber \\&\le \left\| {\varvec{X}}^{t}\widehat{{\varvec{H}}}^{t}-{\varvec{X}}^{\star } \right\| _{\mathrm {F}}^{2}+\eta ^{2}\left\| {\varvec{A}}\right\| ^{2}\left\| {\varvec{X}}^{t}\widehat{{\varvec{H}}}^{t}-{\varvec{X}}^{\star }\right\| _{\mathrm {F}}^{2} \nonumber \\&\quad -2\eta \; \mathrm {vec}\big ({\varvec{X}}^{t}\widehat{{\varvec{H}}}^{t}-{\varvec{X}}^{\star }\big )^{\top }{\varvec{A}}\text { } \mathrm {vec}\big ({\varvec{X}}^{t}\widehat{{\varvec{H}}}^{t}-{\varvec{X}}^{\star }\big ), \end{aligned}$$(119)where in (119) we have used the fact that
$$\begin{aligned} \mathrm {vec}\big ({\varvec{X}}^{t}\widehat{{\varvec{H}}}^{t}-{\varvec{X}}^{\star }\big )^{\top }{\varvec{A}}^{2} \mathrm {vec} \big ({\varvec{X}}^{t}\widehat{{\varvec{H}}}^{t}-{\varvec{X}}^{\star }\big )\le & {} \left\| {\varvec{A}}\right\| ^{2}\left\| \mathrm {vec}\big ({\varvec{X}}^{t}\widehat{{\varvec{H}}}^{t}-{\varvec{X}}^{\star }\big )\right\| _{2}^{2}\\= & {} \left\| {\varvec{A}}\right\| ^{2}\left\| {\varvec{X}}^{t}\widehat{{\varvec{H}}}^{t}-{\varvec{X}}^{\star }\right\| _{\mathrm {F}}^{2}. \end{aligned}$$Based on condition (28b), it is easily seen that \(\forall \tau \in [0,1]\),
$$\begin{aligned} \left\| {\varvec{X}}\left( \tau \right) -{\varvec{X}}^{\star }\right\| _{2,\infty }&\le \left( C_{5}\mu r\sqrt{\frac{\log n}{np}}+\frac{C_{8}}{\sigma _{\min }}\sigma \sqrt{\frac{n\log n}{p}}\right) \left\| {\varvec{X}}^{\star }\right\| _{2,\infty }. \end{aligned}$$Taking \({\varvec{X}}={\varvec{X}}\left( \tau \right) ,{\varvec{Y}}={\varvec{X}}^{t}\), and \({\varvec{Z}}={\varvec{X}}^{\star }\) in Lemma 7, one can easily verify the assumptions therein given our sample size condition \(n^2 p \gg \kappa ^{3} \mu ^{3} r^{3} n \log ^{3} n\) and the noise condition (27). As a result,
$$\begin{aligned}&\mathrm {vec}\big ({\varvec{X}}^{t}\widehat{{\varvec{H}}}^{t}-{\varvec{X}}^{\star }\big )^{\top }{\varvec{A}}\;\mathrm {vec}\big ({\varvec{X}}^{t}\widehat{{\varvec{H}}}^{t}-{\varvec{X}}^{\star }\big )\ge \frac{\sigma _{\min }}{2}\big \Vert {\varvec{X}}^{t}\widehat{{\varvec{H}}}^{t}-{\varvec{X}}^{\star }\big \Vert _{\mathrm {F}}^{2}\\&\qquad \text {and}\qquad \Vert {\varvec{A}}\Vert \le \frac{5}{2}\sigma _{\max }. \end{aligned}$$Substituting these two inequalities into (119) yields
$$\begin{aligned} \left( \alpha _{1}\right) ^{2} \le&\left( 1+\frac{25}{4}\eta ^{2}\sigma _{\max }^{2}-\sigma _{\min }\eta \right) \big \Vert {\varvec{X}}^{t}\widehat{{\varvec{H}}}^{t}-{\varvec{X}}^{\star }\big \Vert _{\mathrm {F}}^{2}\nonumber \\\le&\left( 1-\frac{\sigma _{\min }}{2}\eta \right) \big \Vert {\varvec{X}}^{t}\widehat{{\varvec{H}}}^{t}-{\varvec{X}}^{\star }\big \Vert _{\mathrm {F}}^{2} \end{aligned}$$as long as \(0<\eta \le ({2\sigma _{\min }}) / ({25\sigma _{\max }^{2}})\), which further implies that
$$\begin{aligned} \alpha _{1}&\le \left( 1-\frac{\sigma _{\min }}{4}\eta \right) \big \Vert {\varvec{X}}^{t}\widehat{{\varvec{H}}}^{t}-{\varvec{X}}^{\star }\big \Vert _{\mathrm {F}}. \end{aligned}$$ - 3.
Combining the preceding bounds on both \(\alpha _{1}\) and \(\alpha _{2}\) and making use of hypothesis (28a), we have
$$\begin{aligned}&\left\| {\varvec{X}}^{t+1}\widehat{{\varvec{H}}}^{t+1}-{\varvec{X}}^{\star }\right\| _{\mathrm {F}} \le \left( 1-\frac{\sigma _{\min }}{4}\eta \right) \left\| {\varvec{X}}^{t}\widehat{{\varvec{H}}}^{t} -{\varvec{X}}^{\star }\right\| _{\mathrm {F}}+2\eta C\sigma \sqrt{\frac{n}{p}}\left\| {\varvec{X}}^{\star }\right\| _{\mathrm {F}}\\&\quad \le \left( 1-\frac{\sigma _{\min }}{4}\eta \right) \left( C_{4}\rho ^{t}\mu r\frac{1}{\sqrt{np}}\left\| {\varvec{X}}^{\star }\right\| _{\mathrm {F}}+C_{1} \frac{\sigma }{\sigma _{\min }}\sqrt{\frac{n}{p}}\left\| {\varvec{X}}^{\star } \right\| _{\mathrm {F}}\right) +2\eta C\sigma \sqrt{\frac{n}{p}}\left\| {\varvec{X}}^{\star }\right\| _{\mathrm {F}}\\&\quad \le \left( 1-\frac{\sigma _{\min }}{4}\eta \right) C_{4}\rho ^{t}\mu r\frac{1}{\sqrt{np}}\left\| {\varvec{X}}^{\star }\right\| _{\mathrm {F}} +\left[ \left( 1-\frac{\sigma _{\min }}{4}\eta \right) \frac{C_{1}}{\sigma _{\min }} +2\eta C\right] \sigma \sqrt{\frac{n}{p}}\left\| {\varvec{X}}^{\star } \right\| _{\mathrm {F}}\\&\quad \le C_{4}\rho ^{t+1}\mu r\frac{1}{\sqrt{np}}\left\| {\varvec{X}}^{\star }\right\| _{\mathrm {F}}+C_{1}\frac{\sigma }{\sigma _{\min }} \sqrt{\frac{n}{p}}\left\| {\varvec{X}}^{\star }\right\| _{\mathrm {F}} \end{aligned}$$as long as \(0<\eta \le ({2\sigma _{\min }}) / ({25\sigma _{\max }^{2}})\), \(1-\left( {\sigma _{\min }} / {4}\right) \cdot \eta \le \rho <1\), and \(C_{1}\) is sufficiently large. This completes the proof of the contraction with respect to the Frobenius norm.
1.3 Proof of Lemma 9
To facilitate analysis, we construct an auxiliary matrix defined as follows
With this auxiliary matrix in place, we invoke the triangle inequality to bound
- 1.
We start with the second term \(\alpha _{2}\) and show that the auxiliary matrix \(\widetilde{{\varvec{X}}}^{t+1}\) is also not far from the truth. The definition of \(\widetilde{{\varvec{X}}}^{t+1}\) allows one to express
$$\begin{aligned} \alpha _{2}&=\left\| {\varvec{X}}^{t}\widehat{{\varvec{H}}}^{t}-\eta \frac{1}{p} \mathcal {P}_{\Omega }\left[ {\varvec{X}}^{t}{\varvec{X}}^{t\top }-\left( {\varvec{M}}^{\star }+{\varvec{E}}\right) \right] {\varvec{X}}^{\star }-{\varvec{X}}^{\star }\right\| \nonumber \\&\le \eta \left\| \frac{1}{p}\mathcal {P}_{\Omega }\left( {\varvec{E}}\right) \right\| \left\| {\varvec{X}}^{\star }\right\| +\left\| {\varvec{X}}^{t}\widehat{{\varvec{H}}}^{t} -\eta \frac{1}{p}\mathcal {P}_{\Omega }\left( {\varvec{X}}^{t}{\varvec{X}}^{t\top }-{\varvec{X}}^{\star } {\varvec{X}}^{\star \top }\right) {\varvec{X}}^{\star }-{\varvec{X}}^{\star }\right\| \end{aligned}$$(122)$$\begin{aligned}&\le \eta \left\| \frac{1}{p}\mathcal {P}_{\Omega }\left( {\varvec{E}}\right) \right\| \left\| {\varvec{X}}^{\star }\right\| +\underbrace{\left\| {\varvec{X}}^{t} \widehat{{\varvec{H}}}^{t}-\eta \left( {\varvec{X}}^{t}{\varvec{X}}^{t\top }-{\varvec{X}}^{\star } {\varvec{X}}^{\star \top }\right) {\varvec{X}}^{\star }-{\varvec{X}}^{\star }\right\| }_{:=\beta _{1}}\nonumber \\&\quad +\underbrace{\eta \left\| \frac{1}{p}\mathcal {P}_{\Omega }\left( {\varvec{X}}^{t} {\varvec{X}}^{t\top }-{\varvec{X}}^{\star }{\varvec{X}}^{\star \top }\right) {\varvec{X}}^{\star } -\left( {\varvec{X}}^{t}{\varvec{X}}^{t\top }-{\varvec{X}}^{\star }{\varvec{X}}^{\star \top }\right) {\varvec{X}}^{\star }\right\| }_{:=\beta _{2}}, \end{aligned}$$(123)where we have used the triangle inequality to separate the population-level component (i.e., \(\beta _{1}\)), the perturbation (i.e., \(\beta _{2}\)), and the noise component. In what follows, we will denote
$$\begin{aligned} {\varvec{\Delta }}^{t}:={\varvec{X}}^{t}\widehat{{\varvec{H}}}^{t}-{\varvec{X}}^{\star } \end{aligned}$$which, by Lemma 35, satisfies the following symmetry property
$$\begin{aligned} \widehat{{\varvec{H}}}^{t\top }{\varvec{X}}^{t\top }{\varvec{X}}^{\star }={\varvec{X}}^{\star \top }{\varvec{X}}^{t} \widehat{{\varvec{H}}}^{t}\qquad \Longrightarrow \qquad {\varvec{\Delta }}^{t\top } {\varvec{X}}^{\star }={\varvec{X}}^{\star \top }{\varvec{\Delta }}^{t}. \end{aligned}$$(124)- (a)
The population-level component \(\beta _1\) is easier to control. Specifically, we first simplify its expression as
$$\begin{aligned} \beta _{1}&=\left\| {\varvec{\Delta }}^{t}-\eta \left( {\varvec{\Delta }}^{t}{\varvec{\Delta }}^{t\top }+{\varvec{\Delta }}^{t} {\varvec{X}}^{\star \top }+{\varvec{X}}^{\star }{\varvec{\Delta }}^{t\top }\right) {\varvec{X}}^{\star }\right\| \\&\le \underbrace{\left\| {\varvec{\Delta }}^{t}-\eta \left( {\varvec{\Delta }}^{t}{\varvec{X}}^{\star \top }+{\varvec{X}}^{\star } {\varvec{\Delta }}^{t\top }\right) {\varvec{X}}^{\star }\right\| }_{:=\gamma _{1}}+\underbrace{\eta \left\| {\varvec{\Delta }}^{t}{\varvec{\Delta }}^{t\top }{\varvec{X}}^{\star }\right\| }_{:=\gamma _{2}}. \end{aligned}$$The leading term \(\gamma _{1}\) can be upper bounded by
$$\begin{aligned} \gamma _{1}&=\left\| {\varvec{\Delta }}^{t}-\eta {\varvec{\Delta }}^{t}{\varvec{\Sigma }}^{\star }-\eta {\varvec{X}}^{\star } {\varvec{\Delta }}^{t\top }{\varvec{X}}^{\star }\right\| =\left\| {\varvec{\Delta }}^{t}-\eta {\varvec{\Delta }}^{t}{\varvec{\Sigma }}^{\star }-\eta {\varvec{X}}^{\star } {\varvec{X}}^{\star \top }{\varvec{\Delta }}^{t}\right\| \\&=\left\| \frac{1}{2}{\varvec{\Delta }}^{t}\left( {\varvec{I}}_{r}-2\eta {\varvec{\Sigma }}^{\star }\right) +\frac{1}{2}\left( {\varvec{I}}_{n}-2\eta {\varvec{M}}^{\star }\right) {\varvec{\Delta }}^{t}\right\| \\&\le \frac{1}{2} \left( \left\| {\varvec{I}}_{r}-2\eta {\varvec{\Sigma }}^{\star }\right\| + \left\| {\varvec{I}}_{n}-2\eta {\varvec{M}}^{\star }\right\| \right) \left\| {\varvec{\Delta }}^{t}\right\| \end{aligned}$$where the second identity follows from the symmetry property (124). By choosing \(\eta \le {1} / ({2\sigma _{\max }})\), one has \({\varvec{0}} \preceq {\varvec{I}}_{r}-2\eta {\varvec{\Sigma }}^{\star } \preceq \left( 1-2\eta \sigma _{\min }\right) {\varvec{I}}_{r}\) and \({\varvec{0}} \preceq {\varvec{I}}_{n}-2\eta {\varvec{M}}^{\star } \preceq {\varvec{I}}_{n}\), and further one can ensure
$$\begin{aligned} \gamma _{1}&\le \frac{1}{2}\left[ \left( 1-2\eta \sigma _{\min }\right) + 1\right] \left\| {\varvec{\Delta }}^{t}\right\| = \left( 1-\eta \sigma _{\min }\right) \left\| {\varvec{\Delta }}^{t}\right\| . \end{aligned}$$(125)Next, regarding the higher-order term \(\gamma _{2}\), we can easily obtain
$$\begin{aligned} \gamma _{2}&\le \eta \left\| {\varvec{\Delta }}^{t}\right\| ^{2}\left\| {\varvec{X}}^{\star }\right\| . \end{aligned}$$(126)Bounds (125) and (126) taken collectively give
$$\begin{aligned} \beta _{1}\le \left( 1-\eta \sigma _{\min }\right) \left\| {\varvec{\Delta }}^{t}\right\| +\eta \left\| {\varvec{\Delta }}^{t}\right\| ^{2}\left\| {\varvec{X}}^{\star }\right\| . \end{aligned}$$(127) - (b)
We now turn to the perturbation part \(\beta _{2}\) by showing that
$$\begin{aligned} \frac{1}{\eta }\beta _{2}&=\left\| \frac{1}{p}\mathcal {P}_{\Omega }\left( {\varvec{\Delta }}^{t}{\varvec{\Delta }}^{t\top }+{\varvec{\Delta }}^{t} {\varvec{X}}^{\star \top }+{\varvec{X}}^{\star }{\varvec{\Delta }}^{t\top }\right) {\varvec{X}}^{\star }\right. \nonumber \\&\qquad \left. -\left[ {\varvec{\Delta }}^{t}{\varvec{\Delta }}^{t\top }+{\varvec{\Delta }}^{t}{\varvec{X}}^{\star \top } +{\varvec{X}}^{\star }{\varvec{\Delta }}^{t\top }\right] {\varvec{X}}^{\star }\right\| \nonumber \\&\le \underbrace{\left\| \frac{1}{p}\mathcal {P}_{\Omega }\left( {\varvec{\Delta }}^{t} {\varvec{X}}^{\star \top }\right) {\varvec{X}}^{\star }-\left( {\varvec{\Delta }}^{t}{\varvec{X}}^{\star \top }\right) {\varvec{X}}^{\star }\right\| _{\mathrm {F}}}_{:=\theta _{1}}\nonumber \\&\quad +\underbrace{\left\| \frac{1}{p}\mathcal {P}_{\Omega }\left( {\varvec{X}}^{\star }{\varvec{\Delta }}^{t\top }\right) {\varvec{X}}^{\star }-\left( {\varvec{X}}^{\star }{\varvec{\Delta }}^{t\top }\right) {\varvec{X}}^{\star }\right\| _{\mathrm {F}}}_{:=\theta _{2}}\nonumber \\&\quad +\underbrace{\left\| \frac{1}{p}\mathcal {P}_{\Omega }\left( {\varvec{\Delta }}^{t}{\varvec{\Delta }}^{t\top }\right) {\varvec{X}}^{\star }-\left( {\varvec{\Delta }}^{t}{\varvec{\Delta }}^{t\top }\right) {\varvec{X}}^{\star }\right\| _{\mathrm {F}}}_{:=\theta _{3}}, \end{aligned}$$(128)where the last inequality holds due to the triangle inequality as well as the fact that \(\Vert {\varvec{A}}\Vert \le \Vert {\varvec{A}}\Vert _{\mathrm {F}}\). In the sequel, we shall bound the three terms separately.
For the first term \(\theta _{1}\) in (128), the lth row of \(\frac{1}{p}\mathcal {P}_{\Omega }\left( {\varvec{\Delta }}^{t}{\varvec{X}}^{\star \top }\right) {\varvec{X}}^{\star }-\left( {\varvec{\Delta }}^{t}{\varvec{X}}^{\star \top }\right) {\varvec{X}}^{\star }\) is given by
$$\begin{aligned} \frac{1}{p}\sum _{j=1}^{n}\left( \delta _{l,j}-p\right) {\varvec{\Delta }}_{l,\cdot }^{t}{\varvec{X}}_{j,\cdot }^{\star \top }{\varvec{X}}_{j,\cdot }^{\star }&={\varvec{\Delta }}_{l,\cdot }^{t}\left[ \frac{1}{p}\sum _{j=1}^{n}\left( \delta _{l,j}-p\right) {\varvec{X}}_{j,\cdot }^{\star \top }{\varvec{X}}_{j,\cdot }^{\star }\right] \end{aligned}$$where, as usual, \(\delta _{l,j}={{\,\mathrm{\mathbb {1}}\,}}_{\left\{ (l,j)\in \Omega \right\} }\). Lemma 41 together with the union bound reveals that
$$\begin{aligned}&\left\| \frac{1}{p}\sum _{j=1}^{n}\left( \delta _{l,j}-p\right) {\varvec{X}}_{j,\cdot }^{\star \top }{\varvec{X}}_{j,\cdot }^{\star }\right\| \\&\quad \lesssim \frac{1}{p}\left( \sqrt{p\left\| {\varvec{X}}^{\star }\right\| _{2,\infty }^{2}\left\| {\varvec{X}}^{\star }\right\| ^{2}\log n}+\left\| {\varvec{X}}^{\star }\right\| _{2,\infty }^{2}\log n\right) \\&\quad \asymp \sqrt{\frac{\Vert {\varvec{X}}^{\star }\Vert _{2,\infty }^{2}\sigma _{\max }\log n}{p}}+\frac{\Vert {\varvec{X}}^{\star }\Vert _{2,\infty }^{2}\log n}{p} \end{aligned}$$for all \(1\le l\le n\) with high probability. This gives
$$\begin{aligned}&\left\| {\varvec{\Delta }}_{l,\cdot }^{t}\left[ \frac{1}{p}\sum _{j=1}^{n}\left( \delta _{l,j}-p\right) {\varvec{X}}_{j,\cdot }^{\star \top }{\varvec{X}}_{j,\cdot }^{\star }\right] \right\| _{2} \\&\quad \le \left\| {\varvec{\Delta }}_{l,\cdot }^{t}\right\| _{2}\left\| \frac{1}{p}\sum _{j}\left( \delta _{l,j}-p\right) {\varvec{X}}_{j,\cdot }^{\star \top }{\varvec{X}}_{j,\cdot }^{\star }\right\| \\&\quad \lesssim \left\| {\varvec{\Delta }}_{l,\cdot }^{t}\right\| _{2}\left\{ \sqrt{\frac{\Vert {\varvec{X}}^{\star }\Vert _{2,\infty }^{2}\sigma _{\max }\log n}{p}}+\frac{\Vert {\varvec{X}}^{\star }\Vert _{2,\infty }^{2}\log n}{p}\right\} , \end{aligned}$$which further reveals that
$$\begin{aligned} \theta _{1}&=\sqrt{\sum _{l=1}^{n}\left\| \frac{1}{p}\sum _{j}\left( \delta _{l,j}-p\right) {\varvec{\Delta }}_{l,\cdot }^{t}{\varvec{X}}_{j,\cdot }^{\star \top }{\varvec{X}}_{j,\cdot }^{\star }\right\| _{2}^{2}} \\&\lesssim \left\| {\varvec{\Delta }}^{t}\right\| _{\mathrm {F}}\left\{ \sqrt{\frac{\Vert {\varvec{X}}^{\star }\Vert _{2,\infty }^{2}\sigma _{\max }\log n}{p}}+\frac{\Vert {\varvec{X}}^{\star }\Vert _{2,\infty }^{2}\log n}{p}\right\} \\&\overset{\left( \text {i}\right) }{\lesssim }\left\| {\varvec{\Delta }}^{t}\right\| \left\{ \sqrt{\frac{\Vert {\varvec{X}}^{\star }\Vert _{2,\infty }^{2}r\sigma _{\max }\log n}{p}}+\frac{\sqrt{r}\Vert {\varvec{X}}^{\star }\Vert _{2,\infty }^{2}\log n}{p}\right\} \\&\overset{\left( \text {ii}\right) }{\lesssim }\left\| {\varvec{\Delta }}^{t}\right\| \left\{ \sqrt{\frac{\kappa \mu r^{2}\log n}{np}}+\frac{\kappa \mu r^{3/2}\log n}{np}\right\} \sigma _{\max }\\&\overset{\left( \text {iii}\right) }{\le }\gamma \sigma _{\min }\left\| {\varvec{\Delta }}^{t}\right\| , \end{aligned}$$for arbitrarily small \(\gamma >0\). Here, (i) follows from \(\left\| {\varvec{\Delta }}^{t}\right\| _{\mathrm {F}}\le \sqrt{r}\left\| {\varvec{\Delta }}^{t}\right\| \), (ii) holds owing to the incoherence condition (114), and (iii) follows as long as \(n^{2}p\gg \kappa ^{3}\mu r^{2} n \log n\).
For the second term \(\theta _{2}\) in (128), denote
$$\begin{aligned} {\varvec{A}}=\mathcal {P}_{\Omega }\left( {\varvec{X}}^{\star }{\varvec{\Delta }}^{t\top }\right) {\varvec{X}}^{\star }-p\left( {\varvec{X}}^{\star }{\varvec{\Delta }}^{t\top }\right) {\varvec{X}}^{\star }, \end{aligned}$$whose lth row is given by
$$\begin{aligned} {\varvec{A}}_{l,\cdot }={\varvec{X}}_{l,\cdot }^{\star }\sum _{j=1}^{n}\left( \delta _{l,j}-p\right) {\varvec{\Delta }}_{j,\cdot }^{t\top }{\varvec{X}}_{j,\cdot }^{\star }. \end{aligned}$$(129)Recalling the induction hypotheses (28b) and (28c), we define
$$\begin{aligned} \left\| {\varvec{\Delta }}^{t}\right\| _{2,\infty }&\le C_{5}\rho ^{t}\mu r\sqrt{\frac{\log n}{np}}\left\| {\varvec{X}}^{\star }\right\| _{2,\infty }+C_{8}\frac{\sigma }{\sigma _{\min }}\sqrt{\frac{n\log n}{p}}\left\| {\varvec{X}}^{\star }\right\| _{2,\infty }:=\xi \end{aligned}$$(130)$$\begin{aligned} \left\| {\varvec{\Delta }}^{t}\right\|&\le C_{9}\rho ^{t}\mu r\frac{1}{\sqrt{np}}\left\| {\varvec{X}}^{\star }\right\| +C_{10}\frac{\sigma }{\sigma _{\min }}\sqrt{\frac{n}{p}}\left\| {\varvec{X}}^{\star }\right\| :=\psi . \end{aligned}$$(131)With these two definitions in place, we now introduce a “truncation level”
$$\begin{aligned} \omega :=2p\xi \sigma _{\max } \end{aligned}$$(132)that allows us to bound \(\theta _{2}\) in terms of the following two terms
$$\begin{aligned} \theta _{2}= & {} \frac{1}{p}\sqrt{\sum _{l=1}^{n}\left\| {\varvec{A}}_{l,\cdot }\right\| _{2}^{2}}\le \frac{1}{p}\underbrace{\sqrt{\sum _{l=1}^{n}\left\| {\varvec{A}}_{l,\cdot }\right\| _{2}^{2}{{\,\mathrm{\mathbb {1}}\,}}_{\left\{ \left\| {\varvec{A}}_{l,\cdot }\right\| _{2}\le \omega \right\} }}}_{:=\phi _{1}}\\&+\frac{1}{p}\underbrace{\sqrt{\sum _{l=1}^{n}\left\| {\varvec{A}}_{l,\cdot }\right\| _{2}^{2}{{\,\mathrm{\mathbb {1}}\,}}_{\left\{ \left\| {\varvec{A}}_{l,\cdot }\right\| _{2}\ge \omega \right\} }}}_{:=\phi _{2}}. \end{aligned}$$We will apply different strategies when upper bounding the terms \(\phi _{1}\) and \(\phi _{2}\), with their bounds given in the following two lemmas under the induction hypotheses (28b) and (28c).
- (a)
Lemma 22
Under the conditions in Lemma 9, there exist some constants \(c,C>0\) such that with probability exceeding \(1-c\exp (-Cnr\log n)\),
holds simultaneously for all \({\varvec{\Delta }}^{t}\) obeying (130) and (131). Here, \(\xi \) is defined in (130).
Lemma 23
Under the conditions in Lemma 9, with probability at least \(1-O\left( n^{-10}\right) \),
holds simultaneously for all \({\varvec{\Delta }}^{t}\) obeying (130) and (131). Here, \(\xi \) is defined in (130).
Bounds (133) and (134) together with the incoherence condition (114) yield
Next, we assert that the third term \(\theta _{3}\) in (128) has the same upper bound as \(\theta _{2}\). The proof follows by repeating the same argument used in bounding \(\theta _{2}\) and is hence omitted.
Take the previous three bounds on \(\theta _{1}\), \(\theta _{2}\), and \(\theta _{3}\) together to arrive at
for some constant \(\widetilde{C}>0\).
- (c)
Substituting the preceding bounds on \(\beta _{1}\) and \(\beta _{2}\) into (123), we reach
$$\begin{aligned} \alpha _{2}&\overset{\left( \text {i}\right) }{\le }\left( 1-\eta \sigma _{\min } +\eta \gamma \sigma _{\min }+\eta \left\| {\varvec{\Delta }}^{t}\right\| \left\| {\varvec{X}}^{\star }\right\| \right) \left\| {\varvec{\Delta }}^{t}\right\| +\eta \left\| \frac{1}{p}\mathcal {P}_{\Omega }\left( {\varvec{E}}\right) \right\| \left\| {\varvec{X}}^{\star }\right\| \nonumber \\&\quad +\widetilde{C}\eta \sqrt{\frac{\kappa \mu r^{2}\log ^{2}n}{p}}\sigma _{\max } \left( C_{5}\rho ^{t}\mu r\sqrt{\frac{\log n}{np}}\left\| {\varvec{X}}^{\star }\right\| _{2,\infty }+C_{8}\frac{\sigma }{\sigma _{\min }}\sqrt{\frac{n\log n}{p}}\left\| {\varvec{X}}^{\star }\right\| _{2,\infty }\right) \nonumber \\&\overset{\left( \text {ii}\right) }{\le }\left( 1-\frac{\sigma _{\min }}{2}\eta \right) \left\| {\varvec{\Delta }}^{t}\right\| +\eta \left\| \frac{1}{p}\mathcal {P}_{\Omega } \left( {\varvec{E}}\right) \right\| \left\| {\varvec{X}}^{\star }\right\| \nonumber \\&\quad +\widetilde{C}\eta \sqrt{\frac{\kappa \mu r^{2}\log ^{2}n}{p}}\sigma _{\max } \left( C_{5}\rho ^{t}\mu r\sqrt{\frac{\log n}{np}}\left\| {\varvec{X}}^{\star } \right\| _{2,\infty }+C_{8}\frac{\sigma }{\sigma _{\min }}\sqrt{\frac{n\log n}{p}} \left\| {\varvec{X}}^{\star }\right\| _{2,\infty }\right) \nonumber \\&\overset{\left( \text {iii}\right) }{\le }\left( 1-\frac{\sigma _{\min }}{2}\eta \right) \left\| {\varvec{\Delta }}^{t}\right\| +C\eta \sigma \sqrt{\frac{n}{p}}\left\| {\varvec{X}}^{\star }\right\| \nonumber \\&\quad +\widetilde{C}\eta \sqrt{\frac{\kappa ^{2}\mu ^{2}r^{3}\log ^{3}n}{np}} \sigma _{\max }\left( C_{5}\rho ^{t}\mu r\sqrt{\frac{1}{np}}+C_{8}\frac{\sigma }{\sigma _{\min }}\sqrt{\frac{n}{p}}\right) \left\| {\varvec{X}}^{\star } \right\| \end{aligned}$$(135)for some constant \(C>0\). Here, (i) uses the definition of \(\xi \) (cf. (130)), (ii) holds if \(\gamma \) is small enough and \(\left\| {\varvec{\Delta }}^{t}\right\| \left\| {\varvec{X}}^{\star }\right\| \ll \sigma _{\min }\), and (iii) follows from Lemma 40 as well as the incoherence condition (114). An immediate consequence of (135) is that under the sample size condition and the noise condition of this lemma, one has
$$\begin{aligned} \big \Vert \widetilde{{\varvec{X}}}^{t+1}-{\varvec{X}}^{\star } \big \Vert \left\| {\varvec{X}}^{\star }\right\| \le \sigma _{\min } /2 \end{aligned}$$(136)if \(0<\eta \le 1/\sigma _{\max }\).
- 2.
We then move on to the first term \(\alpha _{1}\) in (121), which can be rewritten as
$$\begin{aligned} \alpha _{1}=\big \Vert {\varvec{X}}^{t+1}\widehat{{\varvec{H}}}^{t}{\varvec{R}}_{1}-\widetilde{{\varvec{X}}}^{t+1} \big \Vert , \end{aligned}$$with
$$\begin{aligned} {\varvec{R}}_{1}&=\big (\widehat{{\varvec{H}}}^{t}\big )^{-1}\widehat{{\varvec{H}}}^{t+1}:=\arg \min _{{\varvec{R}}\in \mathcal {O}^{r\times r}}\big \Vert {\varvec{X}}^{t+1}\widehat{{\varvec{H}}}^{t}{\varvec{R}}-{\varvec{X}}^{\star }\big \Vert _{\mathrm {F}}. \end{aligned}$$(137)- (a)
First, we claim that \(\widetilde{{\varvec{X}}}^{t+1}\) satisfies
$$\begin{aligned} {\varvec{I}}_{r}=\arg \min _{{\varvec{R}}\in \mathcal {O}^{r\times r}}\big \Vert \widetilde{{\varvec{X}}}^{t+1}{\varvec{R}}-{\varvec{X}}^{\star }\big \Vert _{\mathrm {F}}, \end{aligned}$$(138)meaning that \(\widetilde{{\varvec{X}}}^{t+1}\) is already rotated to the direction that is most “aligned” with \({\varvec{X}}^{\star }\). This important property eases the analysis. In fact, in view of Lemma 35, (138) follows if one can show that \({\varvec{X}}^{\star \top }\widetilde{{\varvec{X}}}^{t+1}\) is symmetric and positive semidefinite. First of all, it follows from Lemma 35 that \({\varvec{X}}^{\star \top }{\varvec{X}}^{t}\widehat{{\varvec{H}}}^{t}\) is symmetric and, hence, by definition,
$$\begin{aligned} {\varvec{X}}^{\star \top }\widetilde{{\varvec{X}}}^{t+1}={\varvec{X}}^{\star \top }{\varvec{X}}^{t} \widehat{{\varvec{H}}}^{t}-\frac{\eta }{p}{\varvec{X}}^{\star \top }\mathcal {P}_{\Omega } \left[ {\varvec{X}}^{t}{\varvec{X}}^{t\top }-\left( {\varvec{M}}^{\star }+{\varvec{E}}\right) \right] {\varvec{X}}^{\star } \end{aligned}$$is also symmetric. Additionally,
$$\begin{aligned} \big \Vert {\varvec{X}}^{\star \top }\widetilde{{\varvec{X}}}^{t+1}-{\varvec{M}}^{\star }\big \Vert \le \big \Vert \widetilde{{\varvec{X}}}^{t+1}-{\varvec{X}}^{\star }\big \Vert \left\| {\varvec{X}}^{\star }\right\| \le \sigma _{\min } / 2, \end{aligned}$$where the second inequality holds according to (136). Weyl’s inequality guarantees that
$$\begin{aligned} {\varvec{X}}^{\star \top }\widetilde{{\varvec{X}}}^{t+1}\succeq \frac{1}{2}\sigma _{\min } {\varvec{I}}_r, \end{aligned}$$ - (b)
With (137) and (138) in place, we resort to Lemma 37 to establish the bound. Specifically, take \({\varvec{X}}_{1}=\widetilde{{\varvec{X}}}^{t+1}\) and \({\varvec{X}}_{2}={\varvec{X}}^{t+1}\widehat{{\varvec{H}}}^{t}\), and it comes from (136) that
$$\begin{aligned} \left\| {\varvec{X}}_{1}-{\varvec{X}}^{\star }\right\| \left\| {\varvec{X}}^{\star }\right\| \le \sigma _{\min }/2. \end{aligned}$$Moreover, we have
$$\begin{aligned} \left\| {\varvec{X}}_{1}-{\varvec{X}}_{2}\right\| \left\| {\varvec{X}}^{\star }\right\| =\big \Vert {\varvec{X}}^{t+1}\widehat{{\varvec{H}}}^{t}-\widetilde{{\varvec{X}}}^{t+1}\big \Vert \big \Vert {\varvec{X}}^{\star }\big \Vert , \end{aligned}$$in which
$$\begin{aligned} {\varvec{X}}^{t+1}\widehat{{\varvec{H}}}^{t}-\widetilde{{\varvec{X}}}^{t+1}&=\left( {\varvec{X}}^{t}-\eta \frac{1}{p}\mathcal {P}_{\Omega }\left[ {\varvec{X}}^{t}{\varvec{X}}^{t\top } -\left( {\varvec{M}}^{\star }+{\varvec{E}}\right) \right] {\varvec{X}}^{t}\right) \widehat{{\varvec{H}}}^{t}\\&\qquad -\left[ {\varvec{X}}^{t}\widehat{{\varvec{H}}}^{t}-\eta \frac{1}{p}\mathcal {P}_{\Omega } \left[ {\varvec{X}}^{t}{\varvec{X}}^{t\top }-\left( {\varvec{M}}^{\star }+{\varvec{E}}\right) \right] {\varvec{X}}^{\star }\right] \\&=-\eta \frac{1}{p}\mathcal {P}_{\Omega }\left[ {\varvec{X}}^{t}{\varvec{X}}^{t\top }-\left( {\varvec{M}}^{\star } +{\varvec{E}}\right) \right] \left( {\varvec{X}}^{t}\widehat{{\varvec{H}}}^{t}-{\varvec{X}}^{\star }\right) . \end{aligned}$$This allows one to derive
$$\begin{aligned}&\big \Vert {\varvec{X}}^{t+1}\widehat{{\varvec{H}}}^{t}-\widetilde{{\varvec{X}}}^{t+1}\big \Vert \nonumber \\&\quad \le \eta \left\| \frac{1}{p}\mathcal {P}_{\Omega }\left[ {\varvec{X}}^{t}{\varvec{X}}^{t\top }-{\varvec{M}}^{\star }\right] \left( {\varvec{X}}^{t}\widehat{{\varvec{H}}}^{t}-{\varvec{X}}^{\star }\right) \right\| \nonumber \\&\qquad +\eta \left\| \frac{1}{p}\mathcal {P}_{\Omega }({\varvec{E}})\left( {\varvec{X}}^{t}\widehat{{\varvec{H}}}^{t}-{\varvec{X}}^{\star }\right) \right\| \nonumber \\&\quad \le \eta \left( 2n\left\| {\varvec{\Delta }}^{t}\right\| _{2,\infty }^{2}+4\sqrt{n}\log n\left\| {\varvec{\Delta }}^{t}\right\| _{2,\infty }\left\| {\varvec{X}}^{\star }\right\| +C\sigma \sqrt{\frac{n}{p}}\right) \left\| {\varvec{\Delta }}^{t}\right\| \end{aligned}$$(139)for some absolute constant \(C > 0\). Here the last inequality follows from Lemmas 40 and 43. As a consequence,
$$\begin{aligned}&\left\| {\varvec{X}}_{1}-{\varvec{X}}_{2}\right\| \left\| {\varvec{X}}^{\star }\right\| \\&\quad \le \eta \left( 2n\left\| {\varvec{\Delta }}^{t}\right\| _{2,\infty }^{2}+4\sqrt{n}\log n\left\| {\varvec{\Delta }}^{t}\right\| _{2,\infty }\left\| {\varvec{X}}^{\star }\right\| +C\sigma \sqrt{\frac{n}{p}}\right) \left\| {\varvec{\Delta }}^{t}\right\| \left\| {\varvec{X}}^{\star }\right\| . \end{aligned}$$Under our sample size condition and the noise condition (27) and the induction hypotheses (28), one can show
$$\begin{aligned} \left\| {\varvec{X}}_{1}-{\varvec{X}}_{2}\right\| \left\| {\varvec{X}}^{\star }\right\| \le \sigma _{\min } / 4. \end{aligned}$$Apply Lemma 37 and (139) to reach
$$\begin{aligned} \alpha _{1}&\le 5\kappa \big \Vert {\varvec{X}}^{t+1}\widehat{{\varvec{H}}}^{t}-\widetilde{{\varvec{X}}}^{t+1}\big \Vert \\&\le 5\kappa \eta \left( 2n\left\| {\varvec{\Delta }}^{t}\right\| _{2,\infty }^{2}+2\sqrt{n}\log n\left\| {\varvec{\Delta }}^{t}\right\| _{2,\infty }\left\| {\varvec{X}}^{\star }\right\| +C\sigma \sqrt{\frac{n}{p}}\right) \left\| {\varvec{\Delta }}^{t}\right\| . \end{aligned}$$
- (a)
- 3.
Combining the above bounds on \(\alpha _{1}\) and \(\alpha _{2}\), we arrive at
$$\begin{aligned}&\big \Vert {\varvec{X}}^{t+1}\widehat{{\varvec{H}}}^{t+1}-{\varvec{X}}^{\star }\big \Vert \\&\quad \le \left( 1-\frac{\sigma _{\min }}{2}\eta \right) \left\| {\varvec{\Delta }}^{t}\right\| +\eta C\sigma \sqrt{\frac{n}{p}}\left\| {\varvec{X}}^{\star }\right\| \\&\qquad +\widetilde{C}\eta \sqrt{\frac{\kappa ^{2}\mu ^{2}r^{3}\log ^{3}n}{np}}\sigma _{\max } \left( C_{5}\rho ^{t}\mu r\sqrt{\frac{1}{np}}+\frac{C_{8}}{\sigma _{\min }}\sigma \sqrt{\frac{n}{p}}\right) \left\| {\varvec{X}}^{\star }\right\| \\&\qquad +5\eta \kappa \left( 2n\left\| {\varvec{\Delta }}^{t}\right\| _{2,\infty }^{2}+2\sqrt{n}\log n\left\| {\varvec{\Delta }}^{t}\right\| _{2,\infty }\left\| {\varvec{X}}^{\star }\right\| +C\sigma \sqrt{\frac{n}{p}}\right) \left\| {\varvec{\Delta }}^{t}\right\| \\&\quad \le C_{9}\rho ^{t+1}\mu r\frac{1}{\sqrt{np}}\left\| {\varvec{X}}^{\star }\right\| +C_{10}\frac{\sigma }{\sigma _{\min }}\sqrt{\frac{n}{p}}\left\| {\varvec{X}}^{\star }\right\| , \end{aligned}$$with the proviso that \(\rho \ge 1-({\sigma _{\min }} / {3}) \cdot \eta \), \(\kappa \) is a constant, and \(n^{2}p\gg \mu ^{3}r^{3}n\log ^{3}n\).
1.3.1 Proof of Lemma 22
In what follows, we first assume that the \(\delta _{j,k}\)’s are independent and then use the standard decoupling trick to extend the result to symmetric sampling case (i.e., \(\delta _{j,k} = \delta _{k,j}\)).
To begin with, we justify the concentration bound for any \({\varvec{\Delta }}^{t}\) independent of \(\Omega \), followed by the standard covering argument that extends the bound to all \({\varvec{\Delta }}^{t}\). For any \({\varvec{\Delta }}^{t}\) independent of \(\Omega \), one has
where \(\xi \) and \(\psi \) are defined, respectively, in (130) and (131). Here, the last line makes use of the fact that
as long as n is sufficiently large. Apply the matrix Bernstein inequality [114, Theorem 6.1.1] to get
for some constant \(c>0\), provided that
This upper bound on t is exactly the truncation level \(\omega \) we introduce in (132). With this in mind, we can easily verify that
is a sub-Gaussian random variable with variance proxy not exceeding \(O\left( p\xi ^{2}\sigma _{\max }\left\| {\varvec{X}}^{\star }\right\| _{2,\infty }^{2}\log r\right) \). Therefore, invoking the concentration bounds for quadratic functions [57, Theorem 2.1] yields that for some constants \(C_{0}, C>0\), with probability at least \(1-C_{0}e^{-Cnr\log n}\),
Now that we have established an upper bound on any fixed matrix \({\varvec{\Delta }}^{t}\) (which holds with exponentially high probability), we can proceed to invoke the standard epsilon-net argument to establish a uniform bound over all feasible \({\varvec{\Delta }}^{t}\). This argument is fairly standard and is thus omitted; see [111, Section 2.3.1] or the proof of Lemma 42. In conclusion, we have that with probability exceeding \(1-C_{0}e^{-\frac{1}{2}Cnr\log n}\),
holds simultaneously for all \({\varvec{\Delta }}^{t}\in \mathbb {R}^{n\times r}\) obeying the conditions of the lemma.
In the end, we comment on how to extend the bound to the symmetric sampling pattern where \(\delta _{j,k} = \delta _{k,j}\). Recall from (129) that the diagonal element \(\delta _{l,l}\) cannot change the \(\ell _{2}\) norm of \({\varvec{A}}_{l,\cdot }\) by more than \(\left\| {\varvec{X}}^{\star }\right\| _{2,\infty }^2 \xi \). As a result, changing all the diagonals \(\{\delta _{l,l}\}\) cannot change the quantity of interest (i.e., \(\phi _{1}\)) by more than \(\sqrt{n}\left\| {\varvec{X}}^{\star }\right\| _{2,\infty }^2 \xi \). This is smaller than the right-hand side of (141) under our incoherence and sample size conditions. Hence, from now on, we ignore the effect of \(\{\delta _{l,l}\}\) and focus on off-diagonal terms. The proof then follows from the same argument as in [48, Theorem D.2]. More specifically, we can employ the construction of Bernoulli random variables introduced therein to demonstrate that the upper bound in (141) still holds if the indicator \(\delta _{i,j}\) is replaced by \((\tau _{i,j} + \tau _{i,j}') / 2\), where \(\tau _{i,j}\) and \(\tau _{i,j}'\) are independent copies of the symmetric Bernoulli random variables. Recognizing that \(\sup _{{\varvec{\Delta }}^{t}}\phi _{1}\) is a norm of the Bernoulli random variables \(\tau _{i,j}\), one can repeat the decoupling argument in [48, Claim D.3] to finish the proof. We omit the details here for brevity.
1.3.2 Proof of Lemma 23
Observe from (129) that
where \(\psi \) is as defined in (131) and \({\varvec{G}}_{l}\left( \cdot \right) \) is as defined in Lemma 41. Here, the last inequality follows from Lemma 41; namely, for some constant \(C>0\), the following holds with probability at least \(1-O(n^{-10})\)
where we also use the incoherence condition (114) and the sample complexity condition \(n^2 p\gg \kappa \mu r n \log n\). Hence, the event
together with (142) and (143) necessarily implies that
where the last inequality follows from bound (140). As a result, with probability at least \(1-O(n^{-10})\) (i.e., when (144) holds for all l’s) we can upper bound \(\phi _{2}\) by
where the indicator functions are now specified with respect to \(\left\| {\varvec{G}}_{l}\left( {\varvec{\Delta }}^{t}\right) \right\| \).
Next, we divide into multiple cases based on the size of \(\left\| {\varvec{G}}_{l}\left( {\varvec{\Delta }}^{t}\right) \right\| \). By Lemma 42, for some constants \(c_{1}, c_{2} > 0\), with probability at least \(1-c_{1}\exp \left( -c_{2}nr\log n\right) \),
for any \(k\ge 0\) and any \(\alpha \gtrsim \log n\). We claim that it suffices to consider the set of sufficiently large k obeying
otherwise, we can use (140) to obtain
which contradicts the event \(\left\| {\varvec{A}}_{l,\cdot }\right\| _{2}\ge \omega \). Consequently, we divide all indices into the following sets
defined for each integer k obeying (146). Under condition (146), it follows from (145) that
meaning that the cardinality of \(S_{k}\) satisfies
which decays exponentially fast as k increases. Therefore, when restricting attention to the set of indices within \(S_{k}\), we can obtain
where (i) follows from bound (143) and constraint (147) in \(S_{k}\), (ii) is a consequence of (146), and (iii) uses the incoherence condition (114).
Now that we have developed an upper bound with respect to each \(S_{k}\), we can add them up to yield the final upper bound. Note that there are in total no more than \(O\left( \log n\right) \) different sets, i.e., \(S_{k}=\emptyset \) if \(k\ge c_{1}\log n\) for \(c_{1}\) sufficiently large. This arises since
and hence
if \(k/ \log n\) is sufficiently large. One can thus conclude that
leading to \(\phi _{2} \lesssim \xi \sqrt{\alpha \kappa \mu r^{2}p\log n}\left\| {\varvec{X}}^{\star }\right\| ^{2}\). The proof is finished by taking \(\alpha =c\log n\) for some sufficiently large constant \(c>0\).
1.4 Proof of Lemma 10
-
1.
To obtain (73a), we invoke Lemma 37. Setting \({\varvec{X}}_{1}={\varvec{X}}^{t}\widehat{{\varvec{H}}}^{t}\) and \({\varvec{X}}_{2}={\varvec{X}}^{t,(l)}{\varvec{R}}^{t,\left( l\right) }\), we get
$$\begin{aligned} \left\| {\varvec{X}}_{1}-{\varvec{X}}^{\star }\right\| \left\| {\varvec{X}}^{\star }\right\| \overset{\left( \text {i}\right) }{\le }C_{9}\rho ^{t}\mu r\frac{1}{\sqrt{np}}\sigma _{\max }+\frac{C_{10}}{\sigma _{\min }}\sigma \sqrt{\frac{n\log n}{p}}\sigma _{\max }\overset{\left( \text {ii}\right) }{\le }\frac{1}{2}\sigma _{\min }, \end{aligned}$$where (i) follows from (70c) and (ii) holds as long as \(n^{2}p\gg \kappa ^{2}\mu ^{2}r^{2}n\) and the noise satisfies (27). In addition,
$$\begin{aligned} \left\| {\varvec{X}}_{1}-{\varvec{X}}_{2}\right\| \left\| {\varvec{X}}^{\star }\right\|&\le \left\| {\varvec{X}}_{1}-{\varvec{X}}_{2}\right\| _{\mathrm {F}}\left\| {\varvec{X}}^{\star }\right\| \\&\overset{\left( \text {i}\right) }{\le }\left( C_{3}\rho ^{t}\mu r\sqrt{\frac{\log n}{np}}\left\| {\varvec{X}}^{\star }\right\| _{2,\infty }+\frac{C_{7}}{\sigma _{\min }}\sigma \sqrt{\frac{n\log n}{p}}\left\| {\varvec{X}}^{\star }\right\| _{2,\infty }\right) \left\| {\varvec{X}}^{\star }\right\| \\&\overset{\left( \text {ii}\right) }{\le }C_{3}\rho ^{t}\mu r\sqrt{\frac{\log n}{np}}\sigma _{\max }+\frac{C_{7}}{\sigma _{\min }}\sigma \sqrt{\frac{n\log n}{p}}\sigma _{\max }\\&\overset{\left( \text {iii}\right) }{\le }\frac{1}{2}\sigma _{\min }, \end{aligned}$$where (i) utilizes (70d), (ii) follows since \(\left\| {\varvec{X}}^{\star }\right\| _{2,\infty }\le \left\| {\varvec{X}}^{\star }\right\| \), and (iii) holds if \(n^{2}p\gg \kappa ^{2}\mu ^{2}r^{2}n\log n\) and the noise satisfies (27). With these in place, Lemma 37 immediately yields (73a).
-
2.
The first inequality in (73b) follows directly from the definition of \(\widehat{{\varvec{H}}}^{t,\left( l\right) }\). The second inequality is concerned with the estimation error of \({\varvec{X}}^{t,\left( l\right) }{\varvec{R}}^{t,\left( l\right) }\) with respect to the Frobenius norm. Combining (70a), (70d), and the triangle inequality yields
$$\begin{aligned}&\left\| {\varvec{X}}^{t,\left( l\right) }{\varvec{R}}^{t,\left( l\right) }-{\varvec{X}}^{\star }\right\| _{\mathrm {F}}\le \left\| {\varvec{X}}^{t}\widehat{{\varvec{H}}}^{t}-{\varvec{X}}^{\star }\right\| _{\mathrm {F}}+\left\| {\varvec{X}}^{t}\widehat{{\varvec{H}}}^{t}-{\varvec{X}}^{t,\left( l\right) }{\varvec{R}}^{t,\left( l\right) }\right\| _{\mathrm {F}}\nonumber \\&\quad \le C_{4}\rho ^{t}\mu r\frac{1}{\sqrt{np}}\left\| {\varvec{X}}^{\star }\right\| _{\mathrm {F}}+\frac{C_{1}\sigma }{\sigma _{\min }}\sqrt{\frac{n}{p}}\left\| {\varvec{X}}^{\star }\right\| _{\mathrm {F}}+C_{3}\rho ^{t}\mu r\sqrt{\frac{\log n}{np}}\left\| {\varvec{X}}^{\star }\right\| _{2,\infty }\nonumber \\&\qquad +\frac{C_{7}\sigma }{\sigma _{\min }}\sqrt{\frac{n\log n}{p}}\left\| {\varvec{X}}^{\star }\right\| _{2,\infty }\nonumber \\&\quad \le C_{4}\rho ^{t}\mu r\frac{1}{\sqrt{np}}\left\| {\varvec{X}}^{\star }\right\| _{\mathrm {F}}+\frac{C_{1}\sigma }{\sigma _{\min }}\sqrt{\frac{n}{p}}\left\| {\varvec{X}}^{\star }\right\| _{\mathrm {F}}+C_{3}\rho ^{t}\mu r\sqrt{\frac{\log n}{np}}\sqrt{\frac{\kappa \mu }{n}}\left\| {\varvec{X}}^{\star }\right\| _{\mathrm {F}}\nonumber \\&\qquad +\frac{C_{7}\sigma }{\sigma _{\min }}\sqrt{\frac{n\log n}{p}}\sqrt{\frac{\kappa \mu }{n}}\left\| {\varvec{X}}^{\star }\right\| _{\mathrm {F}}\nonumber \\&\quad \le 2C_{4}\rho ^{t}\mu r\frac{1}{\sqrt{np}}\left\| {\varvec{X}}^{\star }\right\| _{\mathrm {F}}+\frac{2C_{1}\sigma }{\sigma _{\min }}\sqrt{\frac{n}{p}}\left\| {\varvec{X}}^{\star }\right\| _{\mathrm {F}}, \end{aligned}$$(148)where the last step holds true as long as \(n\gg \kappa \mu \log n\).
-
3.
To obtain (73c), we use (70d) and (70b) to get
$$\begin{aligned}&\left\| {\varvec{X}}^{t,\left( l\right) }{\varvec{R}}^{t,\left( l\right) }-{\varvec{X}}^{\star }\right\| _{2,\infty }\le \left\| {\varvec{X}}^{t}\widehat{{\varvec{H}}}^{t}-{\varvec{X}}^{\star }\right\| _{2,\infty }+\left\| {\varvec{X}}^{t}\widehat{{\varvec{H}}}^{t}-{\varvec{X}}^{t,\left( l\right) }{\varvec{R}}^{t,\left( l\right) }\right\| _{\mathrm {F}}\\&\quad \le C_{5}\rho ^{t}\mu r\sqrt{\frac{\log n}{np}}\left\| {\varvec{X}}^{\star }\right\| _{2,\infty }+\frac{C_{8}\sigma }{\sigma _{\min }}\sqrt{\frac{n\log n}{p}}\left\| {\varvec{X}}^{\star }\right\| _{2,\infty }\\&\qquad +C_{3}\rho ^{t}\mu r\sqrt{\frac{\log n}{np}}\left\| {\varvec{X}}^{\star }\right\| _{2,\infty }+\frac{C_{7}\sigma }{\sigma _{\min }}\sqrt{\frac{n\log n}{p}}\left\| {\varvec{X}}^{\star }\right\| _{2,\infty }\\&\quad \le \left( C_{3}+C_{5}\right) \rho ^{t}\mu r\sqrt{\frac{\log n}{np}}\left\| {\varvec{X}}^{\star }\right\| _{2,\infty }+\frac{C_{8}+C_{7}}{\sigma _{\min }}\sigma \sqrt{\frac{n\log n}{p}}\left\| {\varvec{X}}^{\star }\right\| _{2,\infty }. \end{aligned}$$ -
4.
Finally, to obtain (73d), one can take the triangle inequality
$$\begin{aligned} \left\| {\varvec{X}}^{t,\left( l\right) }\widehat{{\varvec{H}}}^{t,\left( l\right) }-{\varvec{X}}^{\star }\right\|&\le \left\| {\varvec{X}}^{t,\left( l\right) }\widehat{{\varvec{H}}}^{t,\left( l\right) }-{\varvec{X}}^{t}\widehat{{\varvec{H}}}^{t}\right\| _{\mathrm {F}}+\left\| {\varvec{X}}^{t}\widehat{{\varvec{H}}}^{t}-{\varvec{X}}^{\star }\right\| \\&\le 5\kappa \left\| {\varvec{X}}^{t}\widehat{{\varvec{H}}}^{t}-{\varvec{X}}^{t,\left( l\right) }{\varvec{R}}^{t,\left( l\right) }\right\| _{\mathrm {F}}+\left\| {\varvec{X}}^{t}\widehat{{\varvec{H}}}^{t}-{\varvec{X}}^{\star }\right\| , \end{aligned}$$where the second line follows from (73a). Combine (70d) and (70c) to yield
$$\begin{aligned}&\left\| {\varvec{X}}^{t,\left( l\right) }\widehat{{\varvec{H}}}^{t,\left( l\right) }-{\varvec{X}}^{\star }\right\| \\&\quad \le 5\kappa \left( C_{3}\rho ^{t}\mu r\sqrt{\frac{\log n}{np}}\left\| {\varvec{X}}^{\star }\right\| _{2,\infty }+\frac{C_{7}}{\sigma _{\min }}\sigma \sqrt{\frac{n\log n}{p}}\left\| {\varvec{X}}^{\star }\right\| _{2,\infty }\right) \\&\qquad +C_{9}\rho ^{t}\mu r\frac{1}{\sqrt{np}}\left\| {\varvec{X}}^{\star }\right\| +\frac{C_{10}}{\sigma _{\min }}\sigma \sqrt{\frac{n}{p}}\left\| {\varvec{X}}^{\star }\right\| \\&\quad \le 5\kappa \sqrt{\frac{\kappa \mu r}{n}}\left\| {\varvec{X}}^{\star }\right\| \left( C_{3}\rho ^{t}\mu r\sqrt{\frac{\log n}{np}}+\frac{C_{7}}{\sigma _{\min }}\sigma \sqrt{\frac{n\log n}{p}}\right) \\&\qquad +C_{9}\rho ^{t}\mu r\frac{1}{\sqrt{np}}\left\| {\varvec{X}}^{\star }\right\| +\frac{C_{10}\sigma }{\sigma _{\min }}\sqrt{\frac{n}{p}}\left\| {\varvec{X}}^{\star }\right\| \\&\quad \le 2C_{9}\rho ^{t}\mu r\frac{1}{\sqrt{np}}\left\| {\varvec{X}}^{\star }\right\| +\frac{2C_{10}\sigma }{\sigma _{\min }}\sqrt{\frac{n}{p}}\left\| {\varvec{X}}^{\star }\right\| , \end{aligned}$$where the second inequality uses the incoherence of \({\varvec{X}}^{\star }\) (cf. (114)) and the last inequality holds as long as \(n\gg \kappa ^{3}\mu r\log n\).
1.5 Proof of Lemma 11
From the definition of \({\varvec{R}}^{t+1,\left( l\right) }\) (see (72)), we must have
The gradient update rules in (24) and (69) allow one to express
where we have again used the fact that \(\nabla f\left( {\varvec{X}}^{t}\right) {\varvec{R}}=\nabla f({\varvec{X}}^{t}{\varvec{R}})\) for any orthonormal matrix \({\varvec{R}}\in \mathcal {O}^{r\times r}\) (similarly for \(\nabla f^{(l)}\big ({\varvec{X}}^{t,(l)}\big )\)). Relate the right-hand side of the above equation with \(\nabla f_{\mathrm {clean}}\left( {\varvec{X}}\right) \) to reach
where we have used the following relationship between \(\nabla f^{(l)}\left( {\varvec{X}}\right) \) and \(\nabla f\left( {\varvec{X}}\right) \):
for all \({\varvec{X}}\in \mathbb {R}^{n\times r}\) with \(\mathcal {P}_{\Omega _{l}}\) and \(\mathcal {P}_{l}\) defined, respectively, in (66) and (67). In the sequel, we control the four terms in reverse order.
- 1.
The last term \({\varvec{B}}_{4}^{\left( l\right) }\) is controlled via the following lemma.
Lemma 24
Suppose that the sample size obeys \(n^2 p>C\mu ^2 r^2 n \log ^2 n\) for some sufficiently large constant \(C>0\). Then with probability at least \(1-O\left( n^{-10}\right) \), the matrix \({\varvec{B}}_{4}^{(l)}\) as defined in (149) satisfies
-
2.
The third term \({\varvec{B}}_{3}^{\left( l\right) }\) can be bounded as follows
$$\begin{aligned} \left\| {\varvec{B}}_{3}^{(l)}\right\| _{\mathrm {F}}\le \eta \left\| \frac{1}{p}\mathcal {P}_{\Omega }\left( {\varvec{E}}\right) \right\| \left\| {\varvec{X}}^{t}\widehat{{\varvec{H}}}^{t}-{\varvec{X}}^{t,\left( l\right) }{\varvec{R}}^{t,\left( l\right) }\right\| _{\mathrm {F}}\lesssim \eta \sigma \sqrt{\frac{n}{p}}\left\| {\varvec{X}}^{t}\widehat{{\varvec{H}}}^{t}-{\varvec{X}}^{t,\left( l\right) }{\varvec{R}}^{t,\left( l\right) }\right\| _{\mathrm {F}}, \end{aligned}$$where the second inequality comes from Lemma 40.
-
3.
For the second term \({\varvec{B}}_{2}^{(l)}\), we have the following lemma.
Lemma 25
Suppose that the sample size obeys \(n^{2}p\gg \mu ^{2}r^{2}n\log n\). Then with probability exceeding \(1-O\left( n^{-10}\right) \), the matrix \({\varvec{B}}_{2}^{(l)}\) as defined in (149) satisfies
- 4.
Regarding the first term \({\varvec{B}}_{1}^{(l)}\), apply the fundamental theorem of calculus [70, Chapter XIII, Theorem 4.2] to get
$$\begin{aligned} \mathrm {vec}\big ({\varvec{B}}_{1}^{(l)}\big )=\left( {\varvec{I}}_{nr}-\eta \int _{0}^{1} \nabla ^{2}f_{\mathrm {clean}}\left( {\varvec{X}}(\tau )\right) \mathrm {d}\tau \right) \mathrm {vec}\left( {\varvec{X}}^{t}\widehat{{\varvec{H}}}^{t}-{\varvec{X}}^{t,\left( l\right) } {\varvec{R}}^{t,\left( l\right) }\right) , \end{aligned}$$(152)where we abuse the notation and denote \({\varvec{X}}(\tau ):={\varvec{X}}^{t,\left( l\right) }{\varvec{R}}^{t,\left( l\right) }+\tau \left( {\varvec{X}}^{t}\widehat{{\varvec{H}}}^{t}-{\varvec{X}}^{t,\left( l\right) }{\varvec{R}}^{t,\left( l\right) }\right) \). Going through the same derivations as in the proof of Lemma 8 (see Appendix B.2), we get
$$\begin{aligned} \big \Vert {\varvec{B}}_{1}^{(l)}\big \Vert _{\mathrm {F}}\le \left( 1-\frac{\sigma _{\min }}{4}\eta \right) \left\| {\varvec{X}}^{t}\widehat{{\varvec{H}}}^{t}-{\varvec{X}}^{t,\left( l\right) }{\varvec{R}}^{t,\left( l\right) }\right\| _{\mathrm {F}} \end{aligned}$$(153)with the proviso that \(0<\eta \le ({2\sigma _{\min }}) / ({25\sigma _{\max }^{2}})\).
Applying the triangle inequality to (149) and invoking the preceding four bounds, we arrive at
for some absolute constant \(\widetilde{C}>0\). Here the last inequality holds as long as \(\sigma \sqrt{{n} / {p}}\ll \sigma _{\min }\), which is satisfied under our noise condition (27). This taken collectively with hypotheses (70d) and (73c) leads to
as long as \(C_{7}>0\) is sufficiently large, where we have used the sample complexity assumption \(n^{2}p\gg \kappa ^{4}\mu ^{2}r^{2} n \log n\) and the step size \(0<\eta \le {1} / ({2\sigma _{\max }}) \le {1} / ({2\sigma _{\min }})\). This finishes the proof.
1.5.1 Proof of Lemma 24
By the unitary invariance of the Frobenius norm, one has
where all nonzero entries of the matrix \(\mathcal {P}_{\Omega _{l}}\left( {\varvec{E}}\right) \) reside in the lth row/column. Decouple the effects of the lth row and the lth column of \(\mathcal {P}_{\Omega _{l}}\left( {\varvec{E}}\right) \) to reach
where \(\delta _{l,j}:={{\,\mathrm{\mathbb {1}}\,}}_{\left\{ (l,j)\in \Omega \right\} }\) indicates whether the (l, j)th entry is observed. Since \({\varvec{X}}^{t,\left( l\right) }\) is independent of \(\{\delta _{l,j}\}_{1\le j \le n}\) and \(\{{E}_{l,j}\}_{1\le j \le n}\), we can treat the first term as a sum of independent vectors \(\{{\varvec{u}}_{j}\}\). It is easy to verify that
where \(\Vert \cdot \Vert _{\psi _{1}}\) denotes the sub-exponential norm [66, Section A.1]. Further, one can calculate
Invoke the matrix Bernstein inequality [66, Theorem 2.7] to discover that with probability at least \(1-O\left( n^{-10}\right) \),
where the third inequality follows from \(\left\| {\varvec{X}}^{t,\left( l\right) }\right\| _{\mathrm {F}}^{2}\le n\left\| {\varvec{X}}^{t,\left( l\right) }\right\| _{2,\infty }^{2}\) and the last inequality holds as long as \(np\gg \log ^2 n\).
Additionally, the remaining term \(\alpha \) in (154) can be controlled using the same argument, giving rise to
We then complete the proof by observing that
where the last inequality follows by combining (73c), the sample complexity condition \(n^{2}p\gg \mu ^{2}r^{2}n\log n\), and the noise condition (27).
1.5.2 Proof of Lemma 25
For notational simplicity, we denote
Since the Frobenius norm is unitarily invariant, we have
Again, all nonzero entries of the matrix \({\varvec{W}}\) reside in its lth row/column. We can deal with the lth row and the lth column of \({\varvec{W}}\) separately as follows
where \(\delta _{l,j}:={{\,\mathrm{\mathbb {1}}\,}}_{\left\{ (l,j)\in \Omega \right\} }\) and the second line relies on the fact that \(\sum _{j:j\ne l}\left( \delta _{l,j}-p\right) ^{2}\asymp np\). It follows that
Here, (i) is a consequence of (155). In addition, (ii) follows from
where the last inequality comes from (73b), the sample complexity condition \(n^{2}p\gg \mu ^{2}r^{2}n\log n\), and the noise condition (27). The matrix Bernstein inequality [114, Theorem 6.1.1] reveals that
with probability exceeding \(1-O\left( n^{-10}\right) \), and as a result,
as soon as \(np\gg \log n\).
To finish up, we make the observation that
where the last line arises from (155). This combined with (157) gives
where (i) comes from (158) and (ii) makes use of the incoherence condition (114).
1.6 Proof of Lemma 12
We first introduce an auxiliary matrix
With this in place, we can use the triangle inequality to obtain
In what follows, we bound the two terms \(\alpha _{1}\) and \(\alpha _{2}\) separately.
- 1.
Regarding the second term \(\alpha _{2}\) of (160), we see from the definition of \(\widetilde{{\varvec{X}}}^{t+1,\left( l\right) }\) (see (159)) that
$$\begin{aligned} \big (\widetilde{{\varvec{X}}}^{t+1,\left( l\right) }-{\varvec{X}}^{\star }\big )_{l,\cdot }=\left[ {\varvec{X}}^{t,\left( l\right) }\widehat{{\varvec{H}}}^{t,\left( l\right) }-\eta \big ({\varvec{X}}^{t,\left( l\right) }{\varvec{X}}^{t,\left( l\right) \top }-{\varvec{X}}^{\star }{\varvec{X}}^{\star \top }\big ){\varvec{X}}^{\star }-{\varvec{X}}^{\star }\right] _{l,\cdot }, \end{aligned}$$(161)where we also utilize the definitions of \(\mathcal {P}_{\Omega ^{-l}}\) and \(\mathcal {P}_{l}\) in (67). For notational convenience, we denote
$$\begin{aligned} {\varvec{\Delta }}^{t,\left( l\right) }:={\varvec{X}}^{t,\left( l\right) } \widehat{{\varvec{H}}}^{t,\left( l\right) }-{\varvec{X}}^{\star }. \end{aligned}$$(162)This allows us to rewrite (161) as
$$\begin{aligned} \left( \widetilde{{\varvec{X}}}^{t+1,\left( l\right) }-{\varvec{X}}^{\star }\right) _{l,\cdot }&={\varvec{\Delta }}_{l,\cdot }^{t,\left( l\right) }-\eta \left[ \left( {\varvec{\Delta }}^{t, \left( l\right) }{\varvec{X}}^{\star \top }+{\varvec{X}}^{\star }{\varvec{\Delta }}^{t, \left( l\right) \top }\right) {\varvec{X}}^{\star }\right] _{l,\cdot }\\&\quad -\eta \left[ {\varvec{\Delta }}^{t,\left( l\right) }{\varvec{\Delta }}^{t,\left( l\right) \top } {\varvec{X}}^{\star }\right] _{l,\cdot }\\&={\varvec{\Delta }}_{l,\cdot }^{t,\left( l\right) }-\eta {\varvec{\Delta }}_{l,\cdot }^{t, \left( l\right) }{\varvec{\Sigma }}^{\star }-\eta {\varvec{X}}_{l,\cdot }^{\star }{\varvec{\Delta }}^{t, \left( l\right) \top }{\varvec{X}}^{\star }-\eta {\varvec{\Delta }}_{l,\cdot }^{t,\left( l\right) } {\varvec{\Delta }}^{t,\left( l\right) \top }{\varvec{X}}^{\star }, \end{aligned}$$which further implies that
$$\begin{aligned} \alpha _{2}&\le \left\| {\varvec{\Delta }}_{l,\cdot }^{t,\left( l\right) }-\eta {\varvec{\Delta }}_{l,\cdot }^{t,\left( l\right) }{\varvec{\Sigma }}^{\star }\right\| _{2}+\eta \left\| {\varvec{X}}_{l,\cdot }^{\star }{\varvec{\Delta }}^{t,\left( l\right) \top }{\varvec{X}}^{\star }\right\| _{2}+\eta \left\| {\varvec{\Delta }}_{l,\cdot }^{t,\left( l\right) }{\varvec{\Delta }}^{t,\left( l\right) \top }{\varvec{X}}^{\star }\right\| _{2}\\&\le \left\| {\varvec{\Delta }}_{l,\cdot }^{t,\left( l\right) }\right\| _{2}\left\| {\varvec{I}}_{r}-\eta {\varvec{\Sigma }}^{\star }\right\| +\eta \left\| {\varvec{X}}^{\star }\right\| _{2,\infty }\left\| {\varvec{\Delta }}^{t,\left( l\right) }\right\| \left\| {\varvec{X}}^{\star }\right\| +\eta \left\| {\varvec{\Delta }}_{l,\cdot }^{t,\left( l\right) }\right\| _{2}\left\| {\varvec{\Delta }}^{t,\left( l\right) }\right\| \left\| {\varvec{X}}^{\star }\right\| \\&\le \left\| {\varvec{\Delta }}_{l,\cdot }^{t,\left( l\right) }\right\| _{2}\left\| {\varvec{I}}_{r}-\eta {\varvec{\Sigma }}^{\star }\right\| +2\eta \left\| {\varvec{X}}^{\star }\right\| _{2,\infty }\left\| {\varvec{\Delta }}^{t,\left( l\right) }\right\| \left\| {\varvec{X}}^{\star }\right\| . \end{aligned}$$Here, the last line follows from the fact that \(\left\| {\varvec{\Delta }}_{l,\cdot }^{t,\left( l\right) }\right\| _{2} \le \left\| {\varvec{X}}^{\star }\right\| _{2,\infty }\). To see this, one can use the induction hypothesis (70e) to get
$$\begin{aligned} \left\| {\varvec{\Delta }}_{l,\cdot }^{t,\left( l\right) }\right\| _{2}\le C_{2}\rho ^{t}\mu r\frac{1}{\sqrt{np}}\left\| {\varvec{X}}^{\star }\right\| _{2,\infty }+C_{6}\frac{\sigma }{\sigma _{\min }}\sqrt{\frac{n\log n}{p}}\left\| {\varvec{X}}^{\star }\right\| _{2,\infty }\ll \left\| {\varvec{X}}^{\star }\right\| _{2,\infty } \end{aligned}$$(163)as long as \(np\gg \mu ^{2}r^{2}\) and \(\sigma \sqrt{\left( n\log n \right) / p}\ll \sigma _{\min }\). By taking \(0<\eta \le 1 / {\sigma _{\max }}\), we have \({\varvec{0}}\preceq {\varvec{I}}_{r}-\eta {\varvec{\Sigma }}^{\star } \preceq \left( 1-\eta \sigma _{\min } \right) {\varvec{I}}_{r}\) and hence can obtain
$$\begin{aligned} \alpha _{2}&\le \left( 1-\eta \sigma _{\min }\right) \left\| {\varvec{\Delta }}_{l,\cdot }^{t,\left( l\right) }\right\| _{2}+2\eta \left\| {\varvec{X}}^{\star }\right\| _{2,\infty }\left\| {\varvec{\Delta }}^{t,\left( l\right) }\right\| \left\| {\varvec{X}}^{\star }\right\| . \end{aligned}$$(164)An immediate consequence of the above two inequalities and (73d) is
$$\begin{aligned} \alpha _{2}\le \Vert {\varvec{X}}^{\star }\Vert _{2,\infty }. \end{aligned}$$(165) - 2.
The first term \(\alpha _{1}\) of (160) can be equivalently written as
$$\begin{aligned} \alpha _{1}=\left\| \big ({\varvec{X}}^{t+1,\left( l\right) }\widehat{{\varvec{H}}}^{t,\left( l\right) }{\varvec{R}}_{1}-\widetilde{{\varvec{X}}}^{t+1,\left( l\right) } \big )_{l,\cdot }\right\| _{2}, \end{aligned}$$where
$$\begin{aligned} {\varvec{R}}_{1}&=\big (\widehat{{\varvec{H}}}^{t,\left( l\right) }\big )^{-1}\widehat{{\varvec{H}}}^{t+1,\left( l\right) }:=\arg \min _{{\varvec{R}}\in \mathcal {O}^{r\times r}}\left\| {\varvec{X}}^{t+1,\left( l\right) }\widehat{{\varvec{H}}}^{t,\left( l\right) }{\varvec{R}}-{\varvec{X}}^{\star }\right\| _{\mathrm {F}}. \end{aligned}$$Simple algebra yields
$$\begin{aligned} \alpha _{1}&\le \left\| \left( {\varvec{X}}^{t+1,\left( l\right) }\widehat{{\varvec{H}}}^{t,\left( l\right) }-\widetilde{{\varvec{X}}}^{t+1,\left( l\right) }\right) _{l,\cdot }{\varvec{R}}_{1}\right\| _{2}+\left\| \widetilde{{\varvec{X}}}_{l,\cdot }^{t+1,\left( l\right) }\right\| _{2}\left\| {\varvec{R}}_{1}- {\varvec{I}}_{r} \right\| \\&\le \underbrace{\left\| \left( {\varvec{X}}^{t+1,\left( l\right) }\widehat{{\varvec{H}}}^{t,\left( l\right) }-\widetilde{{\varvec{X}}}^{t+1,\left( l\right) }\right) _{l,\cdot }\right\| _{2}}_{:=\beta _{1}}+2\left\| {\varvec{X}}^{\star }\right\| _{2,\infty }\underbrace{\left\| {\varvec{R}}_{1}-{\varvec{I}}_{r} \right\| }_{:=\beta _{2}}. \end{aligned}$$Here, to bound the second term, we have used
$$\begin{aligned} \left\| \widetilde{{\varvec{X}}}_{l,\cdot }^{t+1,\left( l\right) }\right\| _{2}\le \left\| \widetilde{{\varvec{X}}}_{l,\cdot }^{t+1,\left( l\right) }-{\varvec{X}}_{l,\cdot }^{\star }\right\| _{2}+\left\| {\varvec{X}}_{l,\cdot }^{\star }\right\| _{2}=\alpha _{2}+\left\| {\varvec{X}}_{l,\cdot }^{\star }\right\| _{2}\le 2\left\| {\varvec{X}}^{\star }\right\| _{2,\infty }, \end{aligned}$$where the last inequality follows from (165). It remains to upper bound \(\beta _{1}\) and \(\beta _{2}\). For both \(\beta _1\) and \(\beta _2\), a central quantity to control is \({\varvec{X}}^{t+1,\left( l\right) }\widehat{{\varvec{H}}}^{t,\left( l\right) }-\widetilde{{\varvec{X}}}^{t+1, \left( l\right) }\). By the definition of \(\widetilde{{\varvec{X}}}^{t+1,\left( l\right) }\) in (159) and the gradient update rule for \({\varvec{X}}^{t+1,\left( l\right) }\) (see (69)), one has
$$\begin{aligned}&{\varvec{X}}^{t+1,\left( l\right) }\widehat{{\varvec{H}}}^{t,\left( l\right) } -\widetilde{{\varvec{X}}}^{t+1,\left( l\right) }\nonumber \\&\quad =\left\{ {\varvec{X}}^{t,\left( l\right) }\widehat{{\varvec{H}}}^{t,\left( l\right) }-\eta \left[ \frac{1}{p} \mathcal {P}_{\Omega ^{-l}}\left[ {\varvec{X}}^{t,\left( l\right) }{\varvec{X}}^{t,\left( l\right) \top } -\left( {\varvec{M}}^{\star }+{\varvec{E}}\right) \right] \right. \right. \nonumber \\&\qquad \left. \left. +\,\mathcal {P}_{l}\left( {\varvec{X}}^{t,\left( l\right) } {\varvec{X}}^{t,\left( l\right) \top }-{\varvec{M}}^{\star }\right) \right] {\varvec{X}}^{t,\left( l\right) } \widehat{{\varvec{H}}}^{t,\left( l\right) }\right\} \nonumber \\&\qquad -\left\{ {\varvec{X}}^{t,\left( l\right) }\widehat{{\varvec{H}}}^{t,\left( l\right) }-\eta \left[ \frac{1}{p} \mathcal {P}_{\Omega ^{-l}}\left[ {\varvec{X}}^{t,\left( l\right) }{\varvec{X}}^{t,\left( l\right) \top } -\left( {\varvec{M}}^{\star }+{\varvec{E}}\right) \right] +\mathcal {P}_{l}\left( {\varvec{X}}^{t,\left( l\right) } {\varvec{X}}^{t,\left( l\right) \top }-{\varvec{M}}^{\star }\right) \right] {\varvec{X}}^{\star }\right\} \nonumber \\&\quad =-\eta \left[ \frac{1}{p}\mathcal {P}_{\Omega ^{-l}}\left( {\varvec{X}}^{t,\left( l\right) }{\varvec{X}}^{t, \left( l\right) \top }-{\varvec{X}}^{\star }{\varvec{X}}^{\star \top }\right) +\mathcal {P}_{l}\left( {\varvec{X}}^{t, \left( l\right) }{\varvec{X}}^{t,\left( l\right) \top }-{\varvec{X}}^{\star }{\varvec{X}}^{\star \top }\right) \right] {\varvec{\Delta }}^{t,\left( l\right) }\nonumber \\&\qquad +\frac{\eta }{p}\mathcal {P}_{\Omega ^{-l}}\left( {\varvec{E}}\right) {\varvec{\Delta }}^{t,\left( l\right) }. \end{aligned}$$(166)It is easy to verify that
$$\begin{aligned} \frac{1}{p} \left\| \mathcal {P}_{\Omega ^{-l}}\left( {\varvec{E}}\right) \right\| \overset{\text {(i)}}{\le } \frac{1}{p} \left\| \mathcal {P}_{\Omega }\left( {\varvec{E}}\right) \right\| \overset{\text {(ii)}}{\lesssim }\sigma \sqrt{\frac{n}{p}} \overset{\text {(iii)}}{\le }\frac{\delta }{2}\sigma _{\min } \end{aligned}$$for \(\delta >0\) sufficiently small. Here, (i) uses the elementary fact that the spectral norm of a submatrix is no more than that of the matrix itself, (ii) arises from Lemma 40, and (iii) is a consequence of the noise condition (27). Therefore, in order to control (166), we need to upper bound the following quantity
$$\begin{aligned} \gamma :=\left\| \frac{1}{p}\mathcal {P}_{\Omega ^{-l}}\left( {\varvec{X}}^{t,\left( l\right) }{\varvec{X}}^{t,\left( l\right) \top }-{\varvec{X}}^{\star }{\varvec{X}}^{\star \top }\right) +\mathcal {P}_{l}\left( {\varvec{X}}^{t,\left( l\right) }{\varvec{X}}^{t,\left( l\right) \top }-{\varvec{X}}^{\star }{\varvec{X}}^{\star \top }\right) \right\| . \end{aligned}$$(167)To this end, we make the observation that
$$\begin{aligned} \gamma&\le \underbrace{\left\| \frac{1}{p}\mathcal {P}_{\Omega }\left( {\varvec{X}}^{t,\left( l\right) }{\varvec{X}}^{t,\left( l\right) \top }-{\varvec{X}}^{\star }{\varvec{X}}^{\star \top }\right) \right\| }_{:=\gamma _{1}}\nonumber \\&\quad +\underbrace{\left\| \frac{1}{p}\mathcal {P}_{\Omega _{l}}\left( {\varvec{X}}^{t,\left( l\right) }{\varvec{X}}^{t,\left( l\right) \top }-{\varvec{X}}^{\star }{\varvec{X}}^{\star \top }\right) -\mathcal {P}_{l}\left( {\varvec{X}}^{t,\left( l\right) }{\varvec{X}}^{t,\left( l\right) \top }-{\varvec{X}}^{\star }{\varvec{X}}^{\star \top }\right) \right\| }_{:=\gamma _{2}}, \end{aligned}$$(168)where \(\mathcal {P}_{\Omega _{l}}\) is defined in (66). An application of Lemma 43 reveals that
$$\begin{aligned} \gamma _{1}&\le 2n\left\| {\varvec{X}}^{t,\left( l\right) }{\varvec{R}}^{t,\left( l\right) }-{\varvec{X}}^{\star }\right\| _{2,\infty }^{2}+4\sqrt{n}\log n\left\| {\varvec{X}}^{t,\left( l\right) }{\varvec{R}}^{t,\left( l\right) }-{\varvec{X}}^{\star }\right\| _{2,\infty }\left\| {\varvec{X}}^{\star }\right\| , \end{aligned}$$where \({\varvec{R}}^{t,\left( l\right) } \in \mathcal {O}^{r \times r}\) is defined in (72). Let \({\varvec{C}}={\varvec{X}}^{t,\left( l\right) }{\varvec{X}}^{t,\left( l\right) \top }-{\varvec{X}}^{\star }{\varvec{X}}^{\star \top }\) as in (156), and one can bound the other term \(\gamma _{2}\) by taking advantage of the triangle inequality and the symmetry property:
$$\begin{aligned} \gamma _{2}&\le \frac{2}{p}\sqrt{\sum _{j=1}^{n}\left( \delta _{l,j}-p\right) ^{2}C_{l,j}^{2}}\overset{\left( \text {i}\right) }{\lesssim }\sqrt{\frac{n}{p}}\left\| {\varvec{C}}\right\| _{\infty } \overset{\left( \text {ii}\right) }{\lesssim }\sqrt{\frac{n}{p}}\left\| {\varvec{X}}^{t,\left( l\right) }{\varvec{R}}^{t,\left( l\right) }-{\varvec{X}}^{\star }\right\| _{2,\infty }\left\| {\varvec{X}}^{\star }\right\| _{2,\infty }, \end{aligned}$$where (i) comes from the standard Chernoff bound \(\sum _{j=1}^{n}\left( \delta _{l,j}-p\right) ^{2}\asymp np\), and in (ii) we utilize the bound established in (158). The previous two bounds taken collectively give
$$\begin{aligned} \gamma&\le 2n\left\| {\varvec{X}}^{t,\left( l\right) }{\varvec{R}}^{t,\left( l\right) }-{\varvec{X}}^{\star }\right\| _{2,\infty }^{2}+4\sqrt{n}\log n\left\| {\varvec{X}}^{t,\left( l\right) }{\varvec{R}}^{t,\left( l\right) }-{\varvec{X}}^{\star }\right\| _{2,\infty }\left\| {\varvec{X}}^{\star }\right\| \nonumber \\&\quad +\widetilde{C}\sqrt{\frac{n}{p}}\left\| {\varvec{X}}^{t,\left( l\right) }{\varvec{R}}^{t,\left( l\right) }-{\varvec{X}}^{\star }\right\| _{2,\infty }\left\| {\varvec{X}}^{\star }\right\| _{2,\infty } \le \frac{\delta }{2}\sigma _{\min } \end{aligned}$$(169)for some constant \(\widetilde{C}>0\) and \(\delta >0\) sufficiently small. The last inequality follows from (73c), the incoherence condition (114), and our sample size condition. In summary, we obtain
$$\begin{aligned} \left\| {\varvec{X}}^{t+1,\left( l\right) }\widehat{{\varvec{H}}}^{t,\left( l\right) }-\widetilde{{\varvec{X}}}^{t+1,\left( l\right) }\right\|&\le \eta \left( \gamma +\left\| \frac{1}{p}\mathcal {P}_{\Omega ^{-l}}\left( {\varvec{E}}\right) \right\| \right) \left\| {\varvec{\Delta }}^{t,\left( l\right) }\right\| \le \eta \delta \sigma _{\min }\left\| {\varvec{\Delta }}^{t,\left( l\right) }\right\| , \end{aligned}$$(170)for \(\delta >0\) sufficiently small. With the estimate (170) in place, we can continue our derivation on \(\beta _{1}\) and \(\beta _{2}\).
- (a)
With regard to \(\beta _{1}\), in view of (166) we can obtain
$$\begin{aligned} \beta _{1}&\overset{(\text {i})}{=}\eta \left\| \left( {\varvec{X}}^{t,\left( l\right) }{\varvec{X}}^{t,\left( l\right) \top }-{\varvec{X}}^{\star }{\varvec{X}}^{\star \top }\right) _{l,\cdot }{\varvec{\Delta }}^{t,\left( l\right) }\right\| _{2}\nonumber \\&\le \eta \left\| \left( {\varvec{X}}^{t,\left( l\right) }{\varvec{X}}^{t,\left( l\right) \top }-{\varvec{X}}^{\star }{\varvec{X}}^{\star \top }\right) _{l,\cdot }\right\| _{2}\left\| {\varvec{\Delta }}^{t,\left( l\right) }\right\| \nonumber \\&\overset{(\text {ii})}{=}\eta \left\| \left[ {\varvec{\Delta }}^{t,\left( l\right) }\left( {\varvec{X}}^{t,\left( l\right) }\widehat{{\varvec{H}}}^{t,\left( l\right) }\right) ^{\top }+{\varvec{X}}^{\star }{\varvec{\Delta }}^{t,\left( l\right) \top }\right] _{l,\cdot }\right\| _{2}\left\| {\varvec{\Delta }}^{t,\left( l\right) }\right\| \nonumber \\&\le \eta \left( \left\| {\varvec{\Delta }}_{l,\cdot }^{t,\left( l\right) }\right\| _{2}\left\| {\varvec{X}}^{t,\left( l\right) }\right\| +\left\| {\varvec{X}}_{l,\cdot }^{\star }\right\| _{2}\left\| {\varvec{\Delta }}^{t,\left( l\right) }\right\| \right) \left\| {\varvec{\Delta }}^{t,\left( l\right) }\right\| \nonumber \\&\le \eta \left\| {\varvec{\Delta }}_{l,\cdot }^{t,\left( l\right) }\right\| _{2}\left\| {\varvec{X}}^{t,\left( l\right) }\right\| \left\| {\varvec{\Delta }}^{t,\left( l\right) }\right\| +\eta \left\| {\varvec{X}}_{l,\cdot }^{\star }\right\| _{2}\left\| {\varvec{\Delta }}^{t,\left( l\right) }\right\| ^{2}, \end{aligned}$$(171)where (i) follows from the definitions of \(\mathcal {P}_{\Omega ^{-l}}\) and \(\mathcal {P}_{l}\) (see (67) and note that all entries in the lth row of \(\mathcal {P}_{\Omega ^{-l}}(\cdot )\) are identically zero) and identity (ii) is due to the definition of \({\varvec{\Delta }}^{t,\left( l\right) }\) in (162).
- (b)
For \(\beta _{2}\), we first claim that
$$\begin{aligned} {\varvec{I}}_{r}:=\arg \min _{{\varvec{R}}\in \mathcal {O}^{r\times r}}\left\| \widetilde{{\varvec{X}}}^{t+1,\left( l\right) }{\varvec{R}}-{\varvec{X}}^{\star }\right\| _{\mathrm {F}}, \end{aligned}$$(172)whose justification follows similar reasonings as that of (138) and is therefore omitted. In particular, it gives rise to the facts that \({\varvec{X}}^{\star \top }\widetilde{{\varvec{X}}}^{t+1,(l)}\) is symmetric and
$$\begin{aligned} \big (\widetilde{{\varvec{X}}}^{t+1,\left( l\right) }\big )^{\top }{\varvec{X}}^{\star }\succeq \frac{1}{2} \sigma _{\min }{\varvec{I}}_{r}. \end{aligned}$$(173)We are now ready to invoke Lemma 36 to bound \(\beta _{2}\). We abuse the notation and denote \({\varvec{C}}:=\big (\widetilde{{\varvec{X}}}^{t+1,\left( l\right) }\big )^{\top }{\varvec{X}}^{\star }\) and \({\varvec{E}}:=\big ({\varvec{X}}^{t+1,\left( l\right) }\widehat{{\varvec{H}}}^{t,\left( l\right) }-\widetilde{{\varvec{X}}}^{t+1,\left( l\right) }\big )^{\top }{\varvec{X}}^{\star }\). We have
$$\begin{aligned} \left\| {\varvec{E}}\right\| \le \frac{1}{2}\sigma _{\min }\le \sigma _{r}\left( {\varvec{C}}\right) . \end{aligned}$$The first inequality arises from (170), namely
$$\begin{aligned} \left\| {\varvec{E}}\right\|&\le \left\| {\varvec{X}}^{t+1,\left( l\right) }\widehat{{\varvec{H}}}^{t,\left( l\right) }-\widetilde{{\varvec{X}}}^{t+1,\left( l\right) }\right\| \left\| {\varvec{X}}^{\star }\right\| \le \eta \delta \sigma _{\min }\left\| {\varvec{\Delta }}^{t,\left( l\right) }\right\| \left\| {\varvec{X}}^{\star }\right\| \\&\overset{\left( \text {i}\right) }{\le }\eta \delta \sigma _{\min }\left\| {\varvec{X}}^{\star }\right\| ^{2}\overset{\left( \text {ii}\right) }{\le }\frac{1}{2}\sigma _{\min }, \end{aligned}$$where (i) holds since \(\left\| {\varvec{\Delta }}^{t,\left( l\right) }\right\| \le \left\| {\varvec{X}}^{\star }\right\| \) and (ii) holds true for \(\delta \) sufficiently small and \(\eta \le 1/{\sigma _{\max }}\). Invoke Lemma 36 to obtain
$$\begin{aligned} \beta _{2}=\left\| {\varvec{R}}_{1}-{\varvec{I}}_r \right\|&\le \frac{2}{\sigma _{r-1}\left( {\varvec{C}}\right) +\sigma _{r}\left( {\varvec{C}}\right) }\left\| {\varvec{E}}\right\| \nonumber \\&\le \frac{2}{\sigma _{\min }}\left\| {\varvec{X}}^{t+1,\left( l\right) }\widehat{{\varvec{H}}}^{t,\left( l\right) }-\widetilde{{\varvec{X}}}^{t+1,\left( l\right) }\right\| \left\| {\varvec{X}}^{\star }\right\| \end{aligned}$$(174)$$\begin{aligned}&\le 2\delta \eta \left\| {\varvec{\Delta }}^{t,\left( l\right) }\right\| \left\| {\varvec{X}}^{\star }\right\| , \end{aligned}$$(175)where (174) follows since \(\sigma _{r-1}\left( {\varvec{C}}\right) \ge \sigma _{r}\left( {\varvec{C}}\right) \ge \sigma _{\min }/2\) from (173), and the last line comes from (170).
- (c)
Putting the previous bounds (171) and (175) together yields
$$\begin{aligned} \alpha _{1}&\le \eta \left\| {\varvec{\Delta }}_{l,\cdot }^{t,\left( l\right) }\right\| _{2}\left\| {\varvec{X}}^{t,\left( l\right) }\right\| \left\| {\varvec{\Delta }}^{t,\left( l\right) }\right\| +\eta \left\| {\varvec{X}}_{l,\cdot }^{\star }\right\| _{2}\left\| {\varvec{\Delta }}^{t,\left( l\right) }\right\| ^{2}\nonumber \\&\quad +4\delta \eta \left\| {\varvec{X}}^{\star }\right\| _{2,\infty }\left\| {\varvec{\Delta }}^{t,\left( l\right) }\right\| \left\| {\varvec{X}}^{\star }\right\| . \end{aligned}$$(176)
- (a)
- 3.
Combine (160), (164), and (176) to reach
$$\begin{aligned}&\left\| \left( {\varvec{X}}^{t+1,\left( l\right) }\widehat{{\varvec{H}}}^{t+1,\left( l\right) }-{\varvec{X}}^{\star }\right) _{l,\cdot }\right\| _{2} \le \left( 1-\eta \sigma _{\min }\right) \left\| {\varvec{\Delta }}_{l,\cdot }^{t,\left( l\right) }\right\| _2 +2\eta \left\| {\varvec{X}}^{\star }\right\| _{2,\infty }\left\| {\varvec{\Delta }}^{t,\left( l\right) }\right\| \left\| {\varvec{X}}^{\star }\right\| \\&\quad \quad \qquad +\eta \left\| {\varvec{\Delta }}_{l,\cdot }^{t,\left( l\right) }\right\| _{2}\left\| {\varvec{X}}^{t,\left( l\right) }\right\| \left\| {\varvec{\Delta }}^{t,\left( l\right) }\right\| +\eta \left\| {\varvec{X}}_{l,\cdot }^{\star }\right\| _{2}\left\| {\varvec{\Delta }}^{t,\left( l\right) }\right\| ^{2}+4\delta \eta \left\| {\varvec{X}}^{\star }\right\| _{2,\infty }\left\| {\varvec{\Delta }}^{t,\left( l\right) }\right\| \left\| {\varvec{X}}^{\star }\right\| \\&\quad \overset{\left( \text {i}\right) }{\le }\left( 1-\eta \sigma _{\min }+\eta \left\| {\varvec{X}}^{t,\left( l\right) }\right\| \left\| {\varvec{\Delta }}^{t,\left( l\right) }\right\| \right) \left\| {\varvec{\Delta }}_{l,\cdot }^{t,\left( l\right) }\right\| _2 +4\eta \left\| {\varvec{X}}^{\star }\right\| _{2,\infty }\left\| {\varvec{\Delta }}^{t,\left( l\right) }\right\| \left\| {\varvec{X}}^{\star }\right\| \\&\quad \overset{\left( \text {ii}\right) }{\le }\left( 1-\frac{\sigma _{\min }}{2}\eta \right) \left( C_{2}\rho ^{t}\mu r\frac{1}{\sqrt{np}}+\frac{C_{6}}{\sigma _{\min }}\sigma \sqrt{\frac{n\log n}{p}}\right) \left\| {\varvec{X}}^{\star }\right\| _{2,\infty }\\&\quad \quad +4\eta \left\| {\varvec{X}}^{\star }\right\| \left\| {\varvec{X}}^{\star }\right\| _{2,\infty }\left( 2C_{9}\rho ^{t}\mu r\frac{1}{\sqrt{np}}\left\| {\varvec{X}}^{\star }\right\| +\frac{2C_{10}}{\sigma _{\min }}\sigma \sqrt{\frac{n}{p}}\left\| {\varvec{X}}^{\star }\right\| \right) \\&\quad \overset{\left( \text {iii}\right) }{\le }C_{2}\rho ^{t+1}\mu r\frac{1}{\sqrt{np}}\left\| {\varvec{X}}^{\star }\right\| _{2,\infty }+\frac{C_{6}}{\sigma _{\min }}\sigma \sqrt{\frac{n\log n}{p}}\left\| {\varvec{X}}^{\star }\right\| _{2,\infty }. \end{aligned}$$Here, (i) follows since \(\left\| {\varvec{\Delta }}^{t,\left( l\right) }\right\| \le \left\| {\varvec{X}}^{\star }\right\| \) and \(\delta \) is sufficiently small, (ii) invokes hypotheses (70e) and (73d) and recognizes that
$$\begin{aligned} \left\| {\varvec{X}}^{t,\left( l\right) }\right\| \left\| {\varvec{\Delta }}^{t,\left( l\right) }\right\| \le 2\left\| {\varvec{X}}^{\star }\right\| \left( 2C_{9}\mu r\frac{1}{\sqrt{np}}\left\| {\varvec{X}}^{\star }\right\| +\frac{2C_{10}}{\sigma _{\min }}\sigma \sqrt{\frac{n\log n}{np}}\left\| {\varvec{X}}^{\star }\right\| \right) \le \frac{\sigma _{\min }}{2} \end{aligned}$$holds under the sample size and noise condition, while \(\left( \text {iii}\right) \) is valid as long as \(1- \left( {\sigma _{\min }} / {3}\right) \cdot \eta \le \rho < 1\), \(C_{2}\gg \kappa C_{9}\), and \(C_{6}\gg \kappa C_{10} / \sqrt{\log n}\).
1.7 Proof of Lemma 13
For notational convenience, we define the following two orthonormal matrices
The problem of finding \(\widehat{{\varvec{H}}}^{t}\) (see (26)) is called the orthogonal Procrustes problem [112]. It is well known that the minimizer \(\widehat{{\varvec{H}}}^{t}\) always exists and is given by
Here, the sign matrix \(\mathrm {sgn}({\varvec{B}})\) is defined as
for any matrix \({\varvec{B}}\) with singular value decomposition \({\varvec{B}}={\varvec{U}}{\varvec{\Sigma }}{\varvec{V}}^{\top }\), where the columns of \({\varvec{U}}\) and \({\varvec{V}}\) are left and right singular vectors, respectively.
Before proceeding, we make note of the following perturbation bounds on \({\varvec{M}}^{0}\) and \({\varvec{M}}^{(l)}\) (as defined in Algorithms 2 and 5, respectively):
for some universal constant \(C >0\). Here, (i) arises from the triangle inequality, (ii) utilizes Lemmas 39 and 40, (iii) follows from the incoherence condition (114), and (iv) holds under our sample complexity assumption that \(n^2 p \gg \mu ^2 r^2 n\) and the noise condition (27). Similarly, we have
Combine Weyl’s inequality, (178), and (179) to obtain
which further implies
We start by proving (70a), (70b), and (70c). The key decomposition we need is the following
- 1.
For the spectral norm error bound in (70c), the triangle inequality together with (182) yields
$$\begin{aligned} \left\| {\varvec{X}}^{0}\widehat{{\varvec{H}}}^{0}-{\varvec{X}}^{\star }\right\|\le & {} \left\| \left( {\varvec{\Sigma }}^{0}\right) ^{1/2}\right\| \left\| \widehat{{\varvec{H}}}^{0}-{\varvec{Q}}\right\| +\left\| \left( {\varvec{\Sigma }}^{0}\right) ^{1/2}{\varvec{Q}}-{\varvec{Q}}\left( {\varvec{\Sigma }}^{\star }\right) ^{1/2}\right\| \\&+\sqrt{\sigma _{\max }}\left\| {\varvec{U}}^{0}{\varvec{Q}}-{\varvec{U}}^{\star }\right\| , \end{aligned}$$where we have also used the fact that \(\Vert {\varvec{U}}^{0}\Vert =1\). Recognizing that \(\left\| {\varvec{M}}^{0}-{\varvec{M}}^{\star }\right\| \ll \sigma _{\min }\) (see (178)) and the assumption \(\sigma _{\max }/\sigma _{\min }\lesssim 1\), we can apply Lemmas 47, 46, and 45 to obtain
$$\begin{aligned}&\big \Vert \widehat{{\varvec{H}}}^{0}-{\varvec{Q}}\big \Vert \lesssim \frac{1}{\sigma _{\min }}\left\| {\varvec{M}}^{0}-{\varvec{M}}^{\star }\right\| , \end{aligned}$$(183a)$$\begin{aligned}&\left\| \left( {\varvec{\Sigma }}^{0}\right) ^{1/2}{\varvec{Q}}-{\varvec{Q}}\left( {\varvec{\Sigma }}^{\star }\right) ^{1/2}\right\| \lesssim \frac{1}{\sqrt{\sigma _{\min }}}\left\| {\varvec{M}}^{0}-{\varvec{M}}^{\star }\right\| , \end{aligned}$$(183b)$$\begin{aligned}&\left\| {\varvec{U}}^{0}{\varvec{Q}}-{\varvec{U}}^{\star }\right\| \lesssim \frac{1}{\sigma _{\min }}\left\| {\varvec{M}}^{0}-{\varvec{M}}^{\star }\right\| . \end{aligned}$$(183c)These taken collectively imply the advertised upper bound
$$\begin{aligned} \big \Vert {\varvec{X}}^{0}\widehat{{\varvec{H}}}^{0}-{\varvec{X}}^{\star }\big \Vert&\lesssim \sqrt{\sigma _{\max }}\frac{1}{\sigma _{\min }}\left\| {\varvec{M}}^{0}-{\varvec{M}}^{\star }\right\| +\frac{1}{\sqrt{\sigma _{\min }}}\left\| {\varvec{M}}^{0}-{\varvec{M}}^{\star }\right\| \\&\lesssim \frac{1}{\sqrt{\sigma _{\min }}}\left\| {\varvec{M}}^{0}-{\varvec{M}}^{\star }\right\| \\&\lesssim \left\{ \mu r\sqrt{\frac{1}{np}}\sqrt{\frac{\sigma _{\max }}{\sigma _{\min }}}+\frac{\sigma }{{\sigma _{\min }}}\sqrt{\frac{n}{p}}\right\} \left\| {\varvec{X}}^{\star }\right\| , \end{aligned}$$where we also utilize the fact that \(\big \Vert \left( {\varvec{\Sigma }}^{0}\right) ^{1/2}\big \Vert \le \sqrt{2\sigma _{\max }}\) (see (181)) and the bounded condition number assumption, i.e., \(\sigma _{\max }/\sigma _{\min }\lesssim 1\). This finishes the proof of (70c).
- 2.
With regard to the Frobenius norm bound in (70a), one has
$$\begin{aligned} \left\| {\varvec{X}}^{0}\widehat{{\varvec{H}}}^{0}-{\varvec{X}}^{\star }\right\| _{\mathrm {F}}&\le \sqrt{r}\big \Vert {\varvec{X}}^{0}\widehat{{\varvec{H}}}^{0}-{\varvec{X}}^{\star }\big \Vert \\&\overset{\text {(i)}}{\lesssim }\left\{ \mu r\sqrt{\frac{1}{np}}+\frac{\sigma }{{\sigma _{\min }}}\sqrt{\frac{n}{p}}\right\} \sqrt{r} \left\| {\varvec{X}}^{\star }\right\| \\&=\left\{ \mu r\sqrt{\frac{1}{np}}+\frac{\sigma }{{\sigma _{\min }}}\sqrt{\frac{n}{p}}\right\} \sqrt{r} \frac{\sqrt{\sigma _{\max }}}{\sqrt{\sigma _{\min }}}\sqrt{\sigma _{\min }} \\&\overset{\text {(ii)}}{\lesssim } \left\{ \mu r\sqrt{\frac{1}{np}}+\frac{\sigma }{{\sigma _{\min }}}\sqrt{\frac{n}{p}}\right\} \sqrt{r} \left\| {\varvec{X}}^{\star }\right\| _{\mathrm {F}}. \end{aligned}$$Here (i) arises from (70c) and (ii) holds true since \(\sigma _{\max }/\sigma _{\min }\asymp 1\) and \(\sqrt{r}\sqrt{\sigma _{\min }}\le \left\| {\varvec{X}}^{\star }\right\| _{\mathrm {F}}\), thus completing the proof of (70a).
- 3.
The proof of (70b) follows from similar arguments as used in proving (70c). Combine (182) and the triangle inequality to reach
$$\begin{aligned}&\left\| {\varvec{X}}^{0}\widehat{{\varvec{H}}}^{0}-{\varvec{X}}^{\star }\right\| _{2,\infty }\\&\quad \le \left\| {\varvec{U}}^{0}\right\| _{2,\infty }\left\{ \left\| \left( {\varvec{\Sigma }}^{0}\right) ^{1/2}\right\| \left\| \widehat{{\varvec{H}}}^{0}-{\varvec{Q}}\right\| +\left\| \left( {\varvec{\Sigma }}^{0}\right) ^{1/2}{\varvec{Q}}-{\varvec{Q}}\left( {\varvec{\Sigma }}^{\star }\right) ^{1/2}\right\| \right\} \\&\qquad +\sqrt{\sigma _{\max }}\left\| {\varvec{U}}^{0}{\varvec{Q}}-{\varvec{U}}^{\star }\right\| _{2,\infty }. \end{aligned}$$Plugging in the estimates (178), (181), (183a), and (183b) results in
$$\begin{aligned} \left\| {\varvec{X}}^{0}\widehat{{\varvec{H}}}^{0}-{\varvec{X}}^{\star }\right\| _{2,\infty }\lesssim & {} \left\{ \mu r\sqrt{\frac{1}{np}}+\frac{\sigma }{\sigma _{\min }}\sqrt{\frac{n}{p}}\right\} \left\| {\varvec{X}}^{\star }\right\| \left\| {\varvec{U}}^{0}\right\| _{2,\infty }\\&+\sqrt{\sigma _{\max }}\left\| {\varvec{U}}^{0}{\varvec{Q}}-{\varvec{U}}^{\star }\right\| _{2,\infty }. \end{aligned}$$It remains to study the component-wise error of \({\varvec{U}}^{0}\). To this end, it has already been shown in [1, Lemma 14] that
$$\begin{aligned} \left\| {\varvec{U}}^{0}{\varvec{Q}}-{\varvec{U}}^{\star }\right\| _{2,\infty }\lesssim \left( \mu r\sqrt{\frac{1}{np}}+\frac{\sigma }{\sigma _{\min }}\sqrt{\frac{n}{p}}\right) \left\| {\varvec{U}}^{\star }\right\| _{2,\infty }\quad \text {and}\quad \left\| {\varvec{U}}^{0}\right\| _{2,\infty }\lesssim \left\| {\varvec{U}}^{\star }\right\| _{2,\infty } \end{aligned}$$(184)under our assumptions. These combined with the previous inequality give
$$\begin{aligned} \left\| {\varvec{X}}^{0}\widehat{{\varvec{H}}}^{0}-{\varvec{X}}^{\star }\right\| _{2,\infty }\lesssim & {} \left\{ \mu r\sqrt{\frac{1}{np}}+\frac{\sigma }{\sigma _{\min }}\sqrt{\frac{n}{p}}\right\} \sqrt{\sigma _{\max }}\left\| {\varvec{U}}^{\star }\right\| _{2,\infty }\\\lesssim & {} \left\{ \mu r\sqrt{\frac{1}{np}}+\frac{\sigma }{\sigma _{\min }}\sqrt{\frac{n}{p}}\right\} \left\| {\varvec{X}}^{\star }\right\| _{2,\infty }, \end{aligned}$$where the last relation is due to the observation that
$$\begin{aligned} \sqrt{\sigma _{\max }}\left\| {\varvec{U}}^{\star }\right\| _{2,\infty } \lesssim \sqrt{\sigma _{\min }}\left\| {\varvec{U}}^{\star }\right\| _{2,\infty } \le \left\| {\varvec{X}}^{\star }\right\| _{2,\infty }. \end{aligned}$$ - 4.
We now move on to proving (70e). Recall that \({\varvec{Q}}^{(l)}=\arg \min _{{\varvec{R}}\in \mathcal {O}^{r\times r}}\left\| {\varvec{U}}^{0,\left( l\right) }{\varvec{R}}-{\varvec{U}}^{\star }\right\| _{\mathrm {F}}\). By the triangle inequality,
$$\begin{aligned} \left\| \big ({\varvec{X}}^{0,\left( l\right) }\widehat{{\varvec{H}}}^{0,\left( l\right) }-{\varvec{X}}^{\star }\big )_{l,\cdot }\right\| _{2}&\le \left\| {\varvec{X}}_{l,\cdot }^{0,\left( l\right) }\big (\widehat{{\varvec{H}}}^{0,\left( l\right) }{-}{\varvec{Q}}^{(l)}\big )\right\| _{2}{+}\left\| \big ({\varvec{X}}^{0,\left( l\right) }{\varvec{Q}}^{(l)}{-}{\varvec{X}}^{\star }\big )_{l,\cdot }\right\| _{2}\nonumber \\&\le \left\| {\varvec{X}}_{l,\cdot }^{0,\left( l\right) }\right\| _{2}\big \Vert \widehat{{\varvec{H}}}^{0,\left( l\right) }-{\varvec{Q}}^{(l)}\big \Vert +\left\| \big ({\varvec{X}}^{0,\left( l\right) }{\varvec{Q}}^{(l)}-{\varvec{X}}^{\star }\big )_{l,\cdot }\right\| _{2}. \end{aligned}$$(185)Note that \({\varvec{X}}_{l,\cdot }^{\star }={\varvec{M}}_{l,\cdot }^{\star }{\varvec{U}}^{\star }\left( {\varvec{\Sigma }}^{\star }\right) ^{-1/2}\) and, by construction of \({\varvec{M}}^{(l)}\),
$$\begin{aligned} {\varvec{X}}_{l,\cdot }^{0,\left( l\right) }={\varvec{M}}_{l,\cdot }^{\left( l\right) }{\varvec{U}}^{0,\left( l\right) }\big ({\varvec{\Sigma }}^{\left( l\right) }\big )^{-1/2}={\varvec{M}}_{l,\cdot }^{\star }{\varvec{U}}^{0,\left( l\right) }\big ({\varvec{\Sigma }}^{\left( l\right) }\big )^{-1/2}. \end{aligned}$$We can thus decompose
$$\begin{aligned} \left( {\varvec{X}}^{0,\left( l\right) }{\varvec{Q}}^{(l)}-{\varvec{X}}^{\star }\right) _{l,\cdot }&={\varvec{M}}_{l,\cdot }^{\star }\left\{ {\varvec{U}}^{0,\left( l\right) }\left[ \big ({\varvec{\Sigma }}^{\left( l\right) }\big )^{-1/2}{\varvec{Q}}^{(l)}-{\varvec{Q}}^{(l)}\left( {\varvec{\Sigma }}^{\star }\right) ^{-1/2}\right] \right. \\&\quad \left. +\left( {\varvec{U}}^{0,\left( l\right) }{\varvec{Q}}^{(l)}-{\varvec{U}}^{\star }\right) \left( {\varvec{\Sigma }}^{\star }\right) ^{-1/2}\right\} , \end{aligned}$$which further implies that
$$\begin{aligned} \left\| \big ({\varvec{X}}^{0,\left( l\right) }{\varvec{Q}}^{(l)}-{\varvec{X}}^{\star }\big )_{l,\cdot }\right\| _{2}\le & {} \left\| {\varvec{M}}^{\star }\right\| _{2,\infty }\left\{ \left\| \big ({\varvec{\Sigma }}^{\left( l\right) }\big )^{-1/2}{\varvec{Q}}^{(l)}-{\varvec{Q}}^{(l)}\left( {\varvec{\Sigma }}^{\star }\right) ^{-1/2}\right\| \right. \nonumber \\&\quad \left. +\frac{1}{\sqrt{\sigma _{\min }}}\left\| {\varvec{U}}^{0,\left( l\right) }{\varvec{Q}}^{(l)}-{\varvec{U}}^{\star }\right\| \right\} . \end{aligned}$$(186)In order to control this, we first see that
$$\begin{aligned}&\left\| \big ({\varvec{\Sigma }}^{\left( l\right) }\big )^{-1/2}{\varvec{Q}}^{(l)}-{\varvec{Q}}^{(l)}\left( {\varvec{\Sigma }}^{\star }\right) ^{-1/2}\right\| \\&\quad =\left\| \big ({\varvec{\Sigma }}^{\left( l\right) }\big )^{-1/2}\left[ {\varvec{Q}}^{(l)}\left( {\varvec{\Sigma }}^{\star }\right) ^{1/2}-\big ({\varvec{\Sigma }}^{\left( l\right) }\big )^{1/2}{\varvec{Q}}^{(l)}\right] \left( {\varvec{\Sigma }}^{\star }\right) ^{-1/2}\right\| \\&\quad \lesssim \frac{1}{\sigma _{\min }}\left\| {\varvec{Q}}^{(l)}\left( {\varvec{\Sigma }}^{\star }\right) ^{1/2}-\big ({\varvec{\Sigma }}^{\left( l\right) }\big )^{-1/2}{\varvec{Q}}^{(l)}\right\| \\&\quad \lesssim \frac{1}{\sigma _{\min }^{3/2}}\left\| {\varvec{M}}^{\left( l\right) }-{\varvec{M}}^{\star }\right\| , \end{aligned}$$where the penultimate inequality uses (181) and the last inequality arises from Lemma 46. Additionally, Lemma 45 gives
$$\begin{aligned} \left\| {\varvec{U}}^{0,\left( l\right) }{\varvec{Q}}^{(l)}-{\varvec{U}}^{\star }\right\| \lesssim \frac{1}{\sigma _{\min }}\left\| {\varvec{M}}^{\left( l\right) }-{\varvec{M}}^{\star }\right\| . \end{aligned}$$Plugging the previous two bounds into (186), we reach
$$\begin{aligned} \left\| \big ({\varvec{X}}^{0,\left( l\right) }{\varvec{Q}}^{(l)}-{\varvec{X}}^{\star }\big )_{l,\cdot }\right\| _{2}&\lesssim \frac{1}{\sigma _{\min }^{3/2}}\left\| {\varvec{M}}^{\left( l\right) }-{\varvec{M}}^{\star }\right\| \left\| {\varvec{M}}^{\star }\right\| _{2,\infty } \\&\lesssim \left\{ \mu r\sqrt{\frac{1}{np}}+\frac{\sigma }{\sigma _{\min }}\sqrt{\frac{n}{p}}\right\} \left\| {\varvec{X}}^{\star }\right\| _{2,\infty }. \end{aligned}$$where the last relation follows from \(\left\| {\varvec{M}}^{\star }\right\| _{2,\infty }=\left\| {\varvec{X}}^{\star }{\varvec{X}}^{\star \top }\right\| _{2,\infty }\le \sqrt{\sigma _{\max }}\left\| {\varvec{X}}^{\star }\right\| _{2,\infty }\) and estimate (179). Note that this also implies that \(\left\| {\varvec{X}}_{l,\cdot }^{0,\left( l\right) }\right\| _{2}\le 2\left\| {\varvec{X}}^{\star }\right\| _{2,\infty }\). To see this, one has by the unitary invariance of \(\left\| \left( \cdot \right) _{l,\cdot }\right\| _{2}\),
$$\begin{aligned} \left\| {\varvec{X}}_{l,\cdot }^{0,\left( l\right) }\right\| _{2} = \left\| {\varvec{X}}_{l,\cdot }^{0,\left( l\right) }{\varvec{Q}}^{(l)}\right\| _{2} \le \left\| \big ({\varvec{X}}^{0,\left( l\right) }{\varvec{Q}}^{(l)}-{\varvec{X}}^{\star }\big )_{l,\cdot }\right\| _{2} + \left\| {\varvec{X}}^{\star }_{l,\cdot }\right\| _{2} \le 2\left\| {\varvec{X}}^{\star }\right\| _{2,\infty }. \end{aligned}$$Substituting the above bounds back to (185) yields in
$$\begin{aligned} \left\| \big ({\varvec{X}}^{0,\left( l\right) }\widehat{{\varvec{H}}}^{0,\left( l\right) }-{\varvec{X}}^{\star }\big )_{l,\cdot }\right\| _{2}&\lesssim \left\| {\varvec{X}}^{\star }\right\| _{2,\infty }\left\| \widehat{{\varvec{H}}}^{0,\left( l\right) }-{\varvec{Q}}^{(l)}\right\| \\&\quad +\left\{ \mu r\sqrt{\frac{1}{np}}+\frac{\sigma }{\sigma _{\min }}\sqrt{\frac{n}{p}}\right\} \left\| {\varvec{X}}^{\star }\right\| _{2,\infty }\\&\lesssim \left\{ \mu r\sqrt{\frac{1}{np}}+\frac{\sigma }{\sigma _{\min }}\sqrt{\frac{n}{p}}\right\} \left\| {\varvec{X}}^{\star }\right\| _{2,\infty }, \end{aligned}$$where the second line relies on Lemma 47, bound (179), and condition \(\sigma _{\max }/\sigma _{\min }\asymp 1\). This establishes (70e).
- 5.
Our final step is to justify (70d). Define \({\varvec{B}}:=\arg \min _{{\varvec{R}}\in \mathcal {O}^{r\times r}}\left\| {\varvec{U}}^{0,\left( l\right) }{\varvec{R}}-{\varvec{U}}^{0}\right\| _{\mathrm {F}}\). From the definition of \({\varvec{R}}^{0,\left( l\right) }\) (cf. (72)), one has
$$\begin{aligned} \left\| {\varvec{X}}^{0}\widehat{{\varvec{H}}}^{0}-{\varvec{X}}^{0,\left( l\right) }{\varvec{R}}^{0,\left( l\right) } \right\| _{\mathrm {F}}\le \left\| {\varvec{X}}^{0,\left( l\right) }{\varvec{B}}-{\varvec{X}}^{0}\right\| _{\mathrm {F}}. \end{aligned}$$Recognizing that
$$\begin{aligned} {\varvec{X}}^{0,\left( l\right) }{\varvec{B}}-{\varvec{X}}^{0}={\varvec{U}}^{0,\left( l\right) } \left[ \big ({\varvec{\Sigma }}^{\left( l\right) }\big )^{1/2}{\varvec{B}}-{\varvec{B}}\left( {\varvec{\Sigma }}^{0} \right) ^{1/2}\right] +\left( {\varvec{U}}^{0,\left( l\right) }{\varvec{B}}-{\varvec{U}}^{0}\right) \left( {\varvec{\Sigma }}^{0}\right) ^{1/2}, \end{aligned}$$we can use the triangle inequality to bound
$$\begin{aligned} \left\| {\varvec{X}}^{0,\left( l\right) }{\varvec{B}}-{\varvec{X}}^{0}\right\| _{\mathrm {F}}\le \left\| \big ({\varvec{\Sigma }}^{\left( l\right) }\big )^{1/2}{\varvec{B}}-{\varvec{B}}\left( {\varvec{\Sigma }}^{0}\right) ^{1/2}\right\| _{\mathrm {F}}+\left\| {\varvec{U}}^{0,\left( l\right) }{\varvec{B}}-{\varvec{U}}^{0}\right\| _{\mathrm {F}}\left\| \left( {\varvec{\Sigma }}^{0}\right) ^{1/2}\right\| . \end{aligned}$$In view of Lemma 46 and bounds (178) and (179), one has
$$\begin{aligned} \left\| \big ({\varvec{\Sigma }}^{\left( l\right) }\big )^{-1/2}{\varvec{B}}-{\varvec{B}}{\varvec{\Sigma }}^{1/2}\right\| _{\mathrm {F}}\lesssim \frac{1}{\sqrt{\sigma _{\min }}}\left\| \big ({\varvec{M}}^{0}-{\varvec{M}}^{\left( l\right) }\big ){\varvec{U}}^{0,\left( l\right) }\right\| _{\mathrm {F}}. \end{aligned}$$From Davis–Kahan’s sin\(\Theta \) theorem [39] we see that
$$\begin{aligned} \left\| {\varvec{U}}^{0,\left( l\right) }{\varvec{B}}-{\varvec{U}}^{0}\right\| _{\mathrm {F}}\lesssim \frac{1}{\sigma _{\min }}\left\| \big ({\varvec{M}}^{0}-{\varvec{M}}^{\left( l\right) }\big ){\varvec{U}}^{0,\left( l\right) }\right\| _{\mathrm {F}}. \end{aligned}$$These estimates taken together with (181) give
$$\begin{aligned} \left\| {\varvec{X}}^{0,\left( l\right) }{\varvec{B}}-{\varvec{X}}^{0}\right\| _{\mathrm {F}}\lesssim \frac{1}{\sqrt{\sigma _{\min }}}\left\| \big ({\varvec{M}}^{0}-{\varvec{M}}^{\left( l\right) }\big ){\varvec{U}}^{0,\left( l\right) }\right\| _{\mathrm {F}}. \end{aligned}$$It then boils down to controlling \(\left\| \left( {\varvec{M}}^{0}-{\varvec{M}}^{\left( l\right) }\right) {\varvec{U}}^{0,\left( l\right) }\right\| _{\mathrm {F}}\). Quantities of this type have shown up multiple times already, and hence, we omit the proof details for conciseness (see Appendix B.5). With probability at least \(1-O\left( n^{-10}\right) \),
$$\begin{aligned} \left\| \big ({\varvec{M}}^{0}-{\varvec{M}}^{\left( l\right) }\big ){\varvec{U}}^{0,\left( l\right) }\right\| _{\mathrm {F}}\lesssim \left\{ \mu r\sqrt{\frac{\log n}{np}}\sigma _{\max }+\sigma \sqrt{\frac{n\log n}{p}}\right\} \left\| {\varvec{U}}^{0,\left( l\right) }\right\| _{2,\infty }. \end{aligned}$$If one further has
$$\begin{aligned} \left\| {\varvec{U}}^{0,\left( l\right) }\right\| _{2,\infty }\lesssim \left\| {\varvec{U}}^{\star }\right\| _{2,\infty }\lesssim \frac{1}{\sqrt{\sigma _{\min }}}\left\| {\varvec{X}}^{\star }\right\| _{2,\infty }, \end{aligned}$$(187)then taking the previous bounds collectively establishes the desired bound
$$\begin{aligned} \left\| {\varvec{X}}^{0}\widehat{{\varvec{H}}}^{0}-{\varvec{X}}^{0,\left( l\right) }{\varvec{R}}^{0,\left( l\right) }\right\| _{\mathrm {F}}\lesssim \left\{ \mu r\sqrt{\frac{\log n}{np}}+\frac{\sigma }{\sigma _{\min }}\sqrt{\frac{n\log n}{p}}\right\} \left\| {\varvec{X}}^{\star }\right\| _{2,\infty }. \end{aligned}$$
Proof of Claim (187)
Denote by \({\varvec{M}}^{(l),\text {zero}}\) the matrix derived by zeroing out the lth row/column of \({\varvec{M}}^{\left( l\right) }\), and \({\varvec{U}}^{(l),\text {zero}} \in \mathbb {R}^{n\times r}\) containing the leading r eigenvectors of \({\varvec{M}}^{(l),\text {zero}}\). On the one hand, [1, Lemma 4 and Lemma 14] demonstrate that
On the other hand, by the Davis–Kahan \(\sin \Theta \) theorem [39] we obtain
where \(\mathrm {sgn}({\varvec{A}})\) denotes the sign matrix of \({\varvec{A}}\). For any \(j\ne l\), one has
since the lth row of \({\varvec{U}}_{l,\cdot }^{\left( l\right) ,\text {zero}}\) is identically zero by construction. In addition,
As a consequence, one has
which combined with (188) and the assumption \(\sigma _{\max } / \sigma _{\min } \asymp 1\) yields
Claim (187) then follows by combining the above estimates:
where we have utilized the unitary invariance of \(\left\| \cdot \right\| _{2,\infty }\). \(\square \)
Proofs for Blind Deconvolution
Before proceeding to the proofs, we make note of the following concentration results. The standard Gaussian concentration inequality and the union bound give
with probability at least \(1-O(m^{-10})\). In addition, with probability exceeding \(1-Cm\exp (-cK)\) for some constants \(c,C>0\),
In addition, the population/expected Wirtinger Hessian at the truth \({\varvec{z}}^{\star }\) is given by
1.1 Proof of Lemma 14
First, we find it convenient to decompose the Wirtinger Hessian (cf. (80)) into the expected Wirtinger Hessian at the truth (cf. (191)) and the perturbation part as follows:
The proof then proceeds by showing that (i) the population Hessian \(\nabla ^{2}F\big ({\varvec{z}}^{\star }\big )\) satisfies the restricted strong convexity and smoothness properties as advertised and (ii) the perturbation \(\nabla ^{2}f\left( {\varvec{z}}\right) -\nabla ^{2}F\big ({\varvec{z}}^{\star }\big )\) is well controlled under our assumptions. We start by controlling the population Hessian in the following lemma.
Lemma 26
Instate the notation and the conditions of Lemma 14. We have
The next step is to bound the perturbation. To this end, we define the set
and derive the following lemma.
Lemma 27
Suppose the sample complexity satisfies \(m\gg \mu ^{2}K\log ^{9}m\), \(c>0\) is a sufficiently small constant, and \(\delta =c/\log ^2 m\). Then with probability at least \(1-O\left( m^{-10} + e^{-K} \log m \right) \), one has
Combining the two lemmas, we can easily see that for \({\varvec{z}}\in \mathcal {S}\),
which verifies the smoothness upper bound. In addition,
where (i) uses the triangle inequality, (ii) holds because of Lemma 27 and the fact that \(\Vert {\varvec{D}}\Vert \le 1+\delta \), and (iii) follows if \(\delta \le 1/2\). This establishes the claim on the restricted strong convexity.
1.1.1 Proof of Lemma 26
We start by proving the identity \(\left\| \nabla ^{2}F\left( {\varvec{z}}^{\star }\right) \right\| =2\). Let
Recalling that \(\Vert {\varvec{h}}^{\star }\Vert _{2}=\Vert {\varvec{x}}^{\star }\Vert _{2}=1\), we can easily check that these four vectors form an orthonormal set. A little algebra reveals that
which immediately implies
We now turn attention to the restricted strong convexity. Since \({\varvec{u}}^{\textsf {H} }{\varvec{D}}\nabla ^{2}F\left( {\varvec{z}}^{\star }\right) {\varvec{u}}\) is the complex conjugate of \({\varvec{u}}^{\textsf {H} }\nabla ^{2}F\left( {\varvec{z}}^{\star }\right) {\varvec{D}}{\varvec{u}}\) as both \(\nabla ^{2}F({\varvec{z}}^{\star })\) and \({\varvec{D}}\) are Hermitian, we will focus on the first term \({\varvec{u}}^{\textsf {H} }{\varvec{D}}\nabla ^{2}F\left( {\varvec{z}}^{\star }\right) {\varvec{u}}\). This term can be rewritten as
where (i) uses the definitions of \({\varvec{u}}\) and \(\nabla ^{2}F\left( {\varvec{z}}^{\star }\right) \), and (ii) follows from the definition of \({\varvec{D}}\). In view of the assumption (84), we can obtain
where the last inequality utilizes the identity
It then boils down to controlling \(\beta \). Toward this goal, we decompose \(\beta \) into the following four terms
Since \(\left\| {\varvec{h}}_{2}-{\varvec{h}}^{\star }\right\| _2 \) and \(\left\| {\varvec{x}}_{2}-{\varvec{x}}^{\star }\right\| _{2}\) are both small by (83), \(\beta _{2},\beta _{3}\), and \(\beta _{4}\) are well bounded. Specifically, regarding \(\beta _{2}\), we discover that
where the second inequality is due to (83) and the last one holds since \(\delta < 1\). Similarly, we can obtain
where both lines make use of the facts that
Combine the previous three bounds to reach
where we utilize the elementary inequality \(ab\le (a^2+b^2)/2\) and identity (194).
The only remaining term is thus \(\beta _{1}\). Recalling that \(({\varvec{h}}_{1},{\varvec{x}}_{1})\) and \(({\varvec{h}}_{2},{\varvec{x}}_{2})\) are aligned by our assumption, we can invoke Lemma 56 to obtain
which allows one to rewrite \(\beta _{1}\) as
Consequently,
Here, (i) arises from the triangle inequality that
and (ii) occurs since \(\Vert {\varvec{x}}_{1}-{\varvec{x}}_{2}\Vert _{2}\le \Vert {\varvec{x}}_{1}-{\varvec{x}}^{\star }\Vert _{2}+\Vert {\varvec{x}}_{2}-{\varvec{x}}^{\star }\Vert _{2}\le 2\delta \) and \(\Vert {\varvec{x}}_{2}\Vert _{2}\le 2\) (see (195)).
To finish up, note that \(\gamma _{1}+\gamma _{2}\le 2(1+\delta )\le 3\) for \(\delta <1/2\). Substitute these bounds into (193) to obtain
as long as \(\delta \) is small enough.
1.1.2 Proof of Lemma 27
In view of the expressions of \(\nabla ^{2}f\left( {\varvec{z}}\right) \) and \(\nabla ^{2}F\left( {\varvec{z}}^{\star }\right) \) (cf. (80) and (191)) and the triangle inequality, we get
where the four terms on the right-hand side are defined as follows
In what follows, we shall control \(\sup _{{\varvec{z}}\in \mathcal {S}}\alpha _{j}\) for \(j=1,2,3,4\) separately.
- 1.
Regarding the first term \(\alpha _{1}\), the triangle inequality gives
$$\begin{aligned} \alpha _{1}&\le \underbrace{\left\| \sum _{j=1}^{m}\left| {\varvec{a}}_{j}^{\textsf {H} }{\varvec{x}}\right| ^{2}{\varvec{b}}_{j}{\varvec{b}}_{j}^{\textsf {H} }-\sum _{j=1}^{m}\left| {\varvec{a}}_{j}^{\textsf {H} }{\varvec{x}}^{\star }\right| ^{2}{\varvec{b}}_{j}{\varvec{b}}_{j}^{\textsf {H} }\right\| }_{:=\beta _{1}}+\underbrace{\left\| \sum _{j=1}^{m}\left| {\varvec{a}}_{j}^{\textsf {H} }{\varvec{x}}^{\star }\right| ^{2}{\varvec{b}}_{j}{\varvec{b}}_{j}^{\textsf {H} }-{\varvec{I}}_{K}\right\| }_{:=\beta _{2}}. \end{aligned}$$To control \(\beta _{1}\), the key observation is that \({\varvec{a}}_{j}^{\textsf {H} }{\varvec{x}}\) and \({\varvec{a}}_{j}^{\textsf {H} }{\varvec{x}}^{\star }\) are extremely close. We can rewrite \(\beta _{1}\) as
$$\begin{aligned} \beta _{1}&=\left\| \sum _{j=1}^{m}\left( \left| {\varvec{a}}_{j}^{\textsf {H} }{\varvec{x}}\right| ^{2}-\left| {\varvec{a}}_{j}^{\textsf {H} }{\varvec{x}}^{\star }\right| ^{2}\right) {\varvec{b}}_{j}{\varvec{b}}_{j}^{\textsf {H} }\right\| \le \left\| \sum _{j=1}^{m}\left| \left| {\varvec{a}}_{j}^{\textsf {H} }{\varvec{x}}\right| ^{2}-\left| {\varvec{a}}_{j}^{\textsf {H} }{\varvec{x}}^{\star }\right| ^{2}\right| {\varvec{b}}_{j}{\varvec{b}}_{j}^{\textsf {H} }\right\| , \end{aligned}$$(197)where
$$\begin{aligned}&\left| \left| {\varvec{a}}_{j}^{\textsf {H} }{\varvec{x}}\right| ^{2}-\left| {\varvec{a}}_{j}^{\textsf {H} }{\varvec{x}}^{\star }\right| ^{2}\right| \\&\quad \overset{\text {(i)}}{=}\left| \left[ {\varvec{a}}_{j}^{\textsf {H} }\left( {\varvec{x}}-{\varvec{x}}^{\star }\right) \right] ^{\textsf {H} }{\varvec{a}}_{j}^{\textsf {H} }\left( {\varvec{x}}-{\varvec{x}}^{\star }\right) +\left[ {\varvec{a}}_{j}^{\textsf {H} }\left( {\varvec{x}}-{\varvec{x}}^{\star }\right) \right] ^{\textsf {H} }{\varvec{a}}_{j}^{\textsf {H} }{\varvec{x}}^{\star }+\left( {\varvec{a}}_{j}^{\textsf {H} }{\varvec{x}}^{\star }\right) ^{\textsf {H} }{\varvec{a}}_{j}^{\textsf {H} }\left( {\varvec{x}}-{\varvec{x}}^{\star }\right) \right| \\&\quad \overset{\text {(ii)}}{\le }\left| {\varvec{a}}_{j}^{\textsf {H} }\left( {\varvec{x}}-{\varvec{x}}^{\star }\right) \right| ^{2}+2\left| {\varvec{a}}_{j}^{\textsf {H} }\left( {\varvec{x}}-{\varvec{x}}^{\star }\right) \right| \left| {\varvec{a}}_{j}^{\textsf {H} }{\varvec{x}}^{\star }\right| \\&\quad \overset{\text {(iii)}}{\le }4C_{3}^{2}\frac{1}{\log ^{3}m}+4C_{3}\frac{1}{\log ^{3/2}m}\cdot 5\sqrt{\log m}\\&\quad \lesssim C_{3}\frac{1}{\log m}. \end{aligned}$$Here, the first line (i) uses the identity for \(u,v\in \mathbb {C}\),
$$\begin{aligned} |u|^2 - |v|^2 = u^{\textsf {H} }u-v^{\textsf {H} }v=(u-v)^{\textsf {H} }(u-v)+(u-v)^{\textsf {H} }v+v^{\textsf {H} }(u-v), \end{aligned}$$the second relation (ii) comes from the triangle inequality, and the third line (iii) follows from (189) and assumption (82b). Substitution into (197) gives
$$\begin{aligned} \beta _{1}&\le \max _{1\le j\le m}\left| \left| {\varvec{a}}_{j}^{\textsf {H} }{\varvec{x}}\right| ^{2}-\left| {\varvec{a}}_{j}^{\textsf {H} }{\varvec{x}}^{\star }\right| ^{2}\right| \left\| \sum _{j=1}^{m}{\varvec{b}}_{j}{\varvec{b}}_{j}^{\textsf {H} }\right\| \lesssim C_{3}\frac{1}{\log m}, \end{aligned}$$where the last inequality comes from the fact that \(\sum _{j=1}^{m}{\varvec{b}}_{j}{\varvec{b}}_{j}^{\textsf {H} }={\varvec{I}}_{K}\).
The other term \(\beta _{2}\) can be bounded through Lemma 59, which reveals that with probability \(1-O\left( m^{-10}\right) \),
$$\begin{aligned} \beta _{2}\lesssim \sqrt{\frac{K}{m}\log m}. \end{aligned}$$
Taken collectively, the preceding two bounds give
$$\begin{aligned} \sup _{{\varvec{z}}\in \mathcal {S}}\alpha _{1}\lesssim \sqrt{\frac{K}{m}\log {m}}+C_{3}\frac{1}{\log m}. \end{aligned}$$Hence, \(\mathbb {P}( \sup _{{\varvec{z}}\in \mathcal {S}}\alpha _{1} \le 1/32 ) = 1 - O(m^{-10})\).
- 2.
We are going to prove that \(\mathbb {P}( \sup _{{\varvec{z}} \in \mathcal {S} } \alpha _2 \le 1/32 ) = 1 - O(m^{-10})\). The triangle inequality allows us to bound \(\alpha _{2}\) as
$$\begin{aligned} \alpha _{2}&\le \underbrace{\left\| \sum _{j=1}^{m}\left| {\varvec{b}}_{j}^{\textsf {H} }{\varvec{h}}\right| ^{2}{\varvec{a}}_{j}{\varvec{a}}_{j}^{\textsf {H} }-\left\| {\varvec{h}}\right\| _{2}^{2}{\varvec{I}}_{K}\right\| }_{:=\theta _{1}({\varvec{h}})}+\underbrace{\left\| \left\| {\varvec{h}}\right\| _{2}^{2}{\varvec{I}}_{K}-{\varvec{I}}_{K}\right\| \phantom {\left\| \sum _{j=1}^{m}\left| {\varvec{b}}_{j}^{\textsf {H} }{\varvec{h}}\right| ^{2}{\varvec{a}}_{j}{\varvec{a}}_{j}^{\textsf {H} }-\left\| {\varvec{h}}\right\| _{2}^{2}{\varvec{I}}_{K}\right\| }}_{:=\theta _{2}({\varvec{h}})}. \end{aligned}$$The second term \(\theta _{2}({\varvec{h}})\) is easy to control. To see this, we have
$$\begin{aligned} \theta _{2}({\varvec{h}}) =\left| \left\| {\varvec{h}}\right\| _{2}^{2}-1\right| =\big |\left\| {\varvec{h}}\right\| _{2}-1\big |\left( \left\| {\varvec{h}}\right\| _{2}+1\right) \le 3\delta < 1/64, \end{aligned}$$where the penultimate relation uses the assumption that \(\left\| {\varvec{h}}-{\varvec{h}}^{\star }\right\| _{2}\le \delta \) and hence
$$\begin{aligned} \big |\left\| {\varvec{h}}\right\| _{2}-1\big | \le \delta ,\qquad \left\| {\varvec{h}}\right\| _{2}\le 1+ \delta \le 2. \end{aligned}$$For the first term \(\theta _{1}({\varvec{h}})\), we define a new set
$$\begin{aligned} \mathcal {H}:=&\left\{ {\varvec{h}}\in \mathbb {C}^{K} :\text { } \Vert {\varvec{h}}-{\varvec{h}}^{\star }\Vert _{2} \le \delta \quad \text {and}\quad \max _{1\le j\le m}\left| {\varvec{b}}_{j}^{\textsf {H} }{\varvec{h}}\right| \le \frac{2C_{4}\mu \log ^{2}m}{\sqrt{m}}\right\} . \end{aligned}$$It is easily seen that \(\sup _{{\varvec{z}}\in {\mathcal {S}}}\theta _{1}\le \sup _{{\varvec{h}}\in \mathcal {H}}\theta _{1}\). We plan to use the standard covering argument to show that
$$\begin{aligned} \mathbb {P}\left( \sup _{{\varvec{h}}\in \mathcal {H}}\theta _{1}({\varvec{h}}) \le 1/64 \right) = 1 - O(m^{-10}) . \end{aligned}$$(198)To this end, we define \(c_j({\varvec{h}}) = | {\varvec{b}}_j^\textsf {H} {\varvec{h}} |^2\) for every \(1\le j \le m\). It is straightforward to check that
$$\begin{aligned} \theta _1({\varvec{h}})&= \left\| \sum _{j=1}^{m} c_j({\varvec{h}}) \left( {\varvec{a}}_j {\varvec{a}}_j^\textsf {H} - {\varvec{I}}_K \right) \right\| , \qquad \max _{1\le j\le m} |c_j| \le \left( \frac{ 2C_{4}\mu \log ^{2}m }{ \sqrt{m} } \right) ^{2} , \qquad \qquad \end{aligned}$$(199)$$\begin{aligned} \sum _{j=1}^{m}c_j^2&=\sum _{j=1}^{m}|{\varvec{b}}_{j}^{\textsf {H} }{\varvec{h}}|^{4}\le \left\{ \max _{1\le j\le m}|{\varvec{b}}_{j}^{\textsf {H} }{\varvec{h}}|^{2}\right\} \sum _{j=1}^{m}|{\varvec{b}}_{j}^{\textsf {H} }{\varvec{h}}|^{2} = \left\{ \max _{1\le j\le m}|{\varvec{b}}_{j}^{\textsf {H} }{\varvec{h}}|^{2}\right\} \Vert {\varvec{h}}\Vert _{2}^{2}\nonumber \\&\le 4\left( \frac{2C_{4}\mu \log ^{2}m}{\sqrt{m}}\right) ^{2} \end{aligned}$$(200)for \({\varvec{h}}\in \mathcal {H}\). In the above argument, we have used the facts that \(\sum _{j=1}^{m}{\varvec{b}}_{j}{\varvec{b}}_{j}^{\textsf {H} }={\varvec{I}}_{K}\) and
$$\begin{aligned} \sum _{j=1}^{m}|{\varvec{b}}_{j}^{\textsf {H} }{\varvec{h}}|^{2}={\varvec{h}}^{\textsf {H} }\left( \sum _{j=1}^{m}{\varvec{b}}_{j}{\varvec{b}}_{j}^{\textsf {H} }\right) {\varvec{h}}=\Vert {\varvec{h}}\Vert _{2}^{2} \le (1+\delta )^2\le 4, \end{aligned}$$together with the definition of \(\mathcal {H}\). Lemma 57 combined with (199) and (200) readily yields that for any fixed \({\varvec{h}} \in \mathcal {H}\) and any \(t \ge 0\),
$$\begin{aligned} \mathbb {P}( \theta _1({\varvec{h}}) \ge t )&\le 2 \exp \left( \widetilde{C}_1 K - \widetilde{C}_2 \min \left\{ \frac{t}{\max _{ 1\le j\le m} |c_j| }, \frac{t^2}{\sum _{j=1}^{m}c_j^2} \right\} \right) \nonumber \\&\le 2 \exp \left( \widetilde{C}_1 K - \widetilde{C}_2 \frac{m t \min \left\{ 1, t/4 \right\} }{4 C_4^2 \mu ^2 \log ^4 m} \right) , \end{aligned}$$(201)where \(\widetilde{C}_1, \widetilde{C}_2 >0\) are some universal constants.
Now we are in a position to strengthen this bound to obtain uniform control of \(\theta _1\) over \(\mathcal {H}\). Note that for any \({\varvec{h}}_1,{\varvec{h}}_2\in \mathcal {H}\),
$$\begin{aligned} |\theta _1({\varvec{h}}_1) - \theta _1({\varvec{h}}_2)|&\le \left\| \sum _{j=1}^{m} \left( |{\varvec{b}}_j^\textsf {H} {\varvec{h}}_1|^2 - |{\varvec{b}}_j^\textsf {H} {\varvec{h}}_2|^2 \right) {\varvec{a}}_j {\varvec{a}}_j^\textsf {H} \right\| + \left| \Vert {\varvec{h}}_1\Vert _2^2 -\Vert {\varvec{h}}_2\Vert _2^2 \right| \\&= \max _{1\le j\le m} \left| |{\varvec{b}}_j^\textsf {H} {\varvec{h}}_1|^2 - |{\varvec{b}}_j^\textsf {H} {\varvec{h}}_2|^2 \right| \left\| \sum _{j=1}^{m} {\varvec{a}}_j {\varvec{a}}_j^\textsf {H} \right\| + \left| \Vert {\varvec{h}}_1\Vert _2^2 -\Vert {\varvec{h}}_2\Vert _2^2 \right| , \end{aligned}$$where
$$\begin{aligned} \left| |{\varvec{b}}_{j}^{\textsf {H} }{\varvec{h}}_{2}|^{2}-|{\varvec{b}}_{j}^{\textsf {H} }{\varvec{h}}_{1}|^{2}\right|&=\left| ({\varvec{h}}_{2}-{\varvec{h}}_{1})^{\textsf {H} }{\varvec{b}}_{j}{\varvec{b}}_{j}^{\textsf {H} }{\varvec{h}}_{2}+{\varvec{h}}_{1}^{\textsf {H} }{\varvec{b}}_{j}{\varvec{b}}_{j}^{\textsf {H} }({\varvec{h}}_{2}-{\varvec{h}}_{1})\right| \\&\le 2\max \{\Vert {\varvec{h}}_{1}\Vert _{2},\Vert {\varvec{h}}_{2}\Vert _{2}\}\Vert {\varvec{h}}_{2}-{\varvec{h}}_{1}\Vert _{2}\Vert {\varvec{b}}_{j}\Vert _{2}^{2}\\&\le 4\Vert {\varvec{h}}_{2}-{\varvec{h}}_{1}\Vert _{2}\Vert {\varvec{b}}_{j}\Vert _{2}^{2} \le \frac{4K}{m}\Vert {\varvec{h}}_{2}-{\varvec{h}}_{1}\Vert _{2} \end{aligned}$$and
$$\begin{aligned} \left| \Vert {\varvec{h}}_1\Vert _2^2 -\Vert {\varvec{h}}_2\Vert _2^2 \right|&=\left| {\varvec{h}}_1^\textsf {H} ( {\varvec{h}}_1 - {\varvec{h}}_2 ) - ( {\varvec{h}}_1 - {\varvec{h}}_2 )^\textsf {H} {\varvec{h}}_2 \right| \\&\le 2\max \{\Vert {\varvec{h}}_{1}\Vert _{2},\Vert {\varvec{h}}_{2}\Vert _{2}\}\Vert {\varvec{h}}_{2}-{\varvec{h}}_{1}\Vert _{2} \le 4 \Vert {\varvec{h}}_1 - {\varvec{h}}_2\Vert _2. \end{aligned}$$Define an event \(\mathcal {E}_0 = \left\{ \left\| \sum _{j=1}^{m} {\varvec{a}}_j {\varvec{a}}_j^\textsf {H} \right\| \le 2m \right\} \). When \(\mathcal {E}_0\) happens, the previous estimates give
$$\begin{aligned} |\theta _1({\varvec{h}}_1) - \theta _1({\varvec{h}}_2)| \le (8K+4) \Vert {\varvec{h}}_1 - {\varvec{h}}_2 \Vert _2 \le 10 K \Vert {\varvec{h}}_1 - {\varvec{h}}_2 \Vert _2, \qquad \forall {\varvec{h}}_1,{\varvec{h}}_2 \in \mathcal {H}. \end{aligned}$$Let \(\varepsilon ={1} / ({1280 K})\), and \(\widetilde{\mathcal {H}}\) be an \(\varepsilon \)-net covering \(\mathcal {H}\) (see [116, Definition 5.1]). We have
$$\begin{aligned} \left( \left\{ \sup _{{\varvec{h}} \in \widetilde{\mathcal {H}} } \theta _1({\varvec{h}}) \le \frac{1}{128} \right\} \cap \mathcal {E}_0 \right) \subseteq \left\{ \sup _{{\varvec{h}} \in \mathcal {H}} \theta _1 \le \frac{1}{64} \right\} \end{aligned}$$and, as a result,
$$\begin{aligned} \mathbb {P}\left( \sup _{{\varvec{h}} \in \mathcal {H}} \theta _1({\varvec{h}}) \ge \frac{1}{64} \right)&\le \mathbb {P}\left( \sup _{{\varvec{h}} \in \widetilde{\mathcal {H}} } \theta _1({\varvec{h}}) \ge \frac{1}{128} \right) + \mathbb {P}( \mathcal {E}_0^c) \\&\le |\widetilde{\mathcal {H}}|\cdot \max _{{\varvec{h}} \in \widetilde{\mathcal {H}}} \mathbb {P}\left( \theta _1({\varvec{h}}) \ge \frac{1}{128} \right) +\mathbb {P}( \mathcal {E}_0^c) . \end{aligned}$$Lemma 57 forces that \(\mathbb {P}( \mathcal {E}_0^c )=O(m^{-10})\). Additionally, we have \(\log |{\widetilde{\mathcal {H}}}| \le \widetilde{C}_3 K \log K\) for some absolute constant \(\widetilde{C}_3>0\) according to [116, Lemma 5.2]. Hence, (201) leads to
$$\begin{aligned}&|\widetilde{\mathcal {H}}|\cdot \max _{{\varvec{h}} \in \widetilde{\mathcal {H}}} \mathbb {P}\left( \theta _1({\varvec{h}}) \ge \frac{1}{128} \right) \\&\quad \le 2 \exp \left( \widetilde{C}_3 K \log K + \widetilde{C}_1 K - \widetilde{C}_2 \frac{m (1/128) \min \left\{ 1, (1/128)/4 \right\} }{4 C_4^2 \mu ^2 \log ^4 m} \right) \\&\quad \le 2 \exp \left( 2\widetilde{C}_3 K \log m - \frac{\widetilde{C}_4 m}{\mu ^2 \log ^4 m} \right) \end{aligned}$$for some constant \(\widetilde{C}_4 > 0 \). Under the sample complexity \(m\gg \mu ^2 K \log ^5 m\), the right-hand side of the above display is at most \(O\left( m^{-10}\right) \). Combine the estimates above to establish the desired high-probability bound for \(\sup _{{\varvec{z}}\in {\mathcal {S}}} \alpha _2 \).
- 3.
Next, we will demonstrate that
$$\begin{aligned} \mathbb {P}( \sup _{{\varvec{z}}\,\in \,{\mathcal {S}}}\alpha _{3}\le 1/96 ) = 1-O\left( m^{-10} + e^{-K} \log m \right) . \end{aligned}$$To this end, we let
$$\begin{aligned} {\varvec{A}}= & {} \left[ \begin{array}{c} {\varvec{a}}^\textsf {H} _1\\ \vdots \\ {\varvec{a}}^\textsf {H} _m \end{array}\right] \in \mathbb {C}^{m\times K} ,\quad {\varvec{B}}= \left[ \begin{array}{c} {\varvec{b}}^\textsf {H} _1\\ \vdots \\ {\varvec{b}}^\textsf {H} _m \end{array}\right] \in \mathbb {C}^{m \times K},\\ {\varvec{C}}= & {} \left[ \begin{array}{cccc} c_{1}\left( {\varvec{z}}\right) \\ &{} c_{2}\left( {\varvec{z}}\right) \\ &{} &{} \cdots \\ &{} &{} &{} c_{m}\left( {\varvec{z}}\right) \end{array}\right] \in \mathbb {C}^{m \times m}, \end{aligned}$$where for each \(1\le j \le m\),
$$\begin{aligned} c_j({\varvec{z}}) := {\varvec{b}}_{j}^{\textsf {H} }{\varvec{h}}{\varvec{x}}^{\textsf {H} }{\varvec{a}}_{j}-y_{j} = {\varvec{b}}_{j}^{\textsf {H} } ( {\varvec{h}}{\varvec{x}}^{\textsf {H} }-{\varvec{h}}^{\star } {\varvec{x}}^{\star \textsf {H} } ){\varvec{a}}_{j}. \end{aligned}$$As a consequence, we can write \(\alpha _3 = \Vert {\varvec{B}}^\textsf {H} {\varvec{C}} {\varvec{A}} \Vert \).
The key observation is that both the \(\ell _{\infty }\) norm and the Frobenius norm of \({\varvec{C}}\) are well controlled. Specifically, we claim for the moment that with probability at least \(1-O\left( m^{-10}\right) \),
$$\begin{aligned} \left\| {\varvec{C}}\right\| _{\infty }=\max _{1\le j \le m }\left| c_{j}\right|&\le C \frac{\mu \log ^{5/2} m }{\sqrt{m}}; \end{aligned}$$(202a)$$\begin{aligned} \left\| {\varvec{C}}\right\| _{\mathrm {F}}^{2} = \sum _{j=1}^{m} \left| c_{j}\right| ^2&\le 12\delta ^{2}, \end{aligned}$$(202b)where \(C > 0\) is some absolute constant. This motivates us to divide the entries in \({\varvec{C}}\) into multiple groups based on their magnitudes.
To be precise, introduce \(R :=1+ \lceil \log _2 ( C \mu \log ^{7/2} m ) \rceil \) sets \(\{\mathcal {I}_{r}\}_{1\le r \le R}\), where
$$\begin{aligned} \mathcal {I}_r = \left\{ j\in [m]:\text { } \frac{C \mu \log ^{5/2} m }{2^r \sqrt{m} } < |c_j| \le \frac{C \mu \log ^{5/2} m }{2^{r-1} \sqrt{m} } \right\} , \qquad 1\le r \le R-1 \end{aligned}$$and \(\mathcal {I}_R = \{1,\cdots , m\} \,\backslash \, \big ( \bigcup _{r=1}^{R-1} \mathcal {I}_r \big )\). An immediate consequence of the definition of \(\mathcal {I}_{r}\) and the norm constraints in (202) is the following cardinality bound
$$\begin{aligned} |\mathcal {I}_r| \le \frac{ \left\| {\varvec{C}} \right\| _{\mathrm {F}}^2 }{ \min _{j \in \mathcal {I}_{r}}\left| c_{j}\right| ^2} \le \frac{ 12\delta ^2 }{\left( \frac{C \mu \log ^{5/2} m }{2^{r} \sqrt{m} }\right) ^2} = \underbrace{ \frac{ 12 \delta ^2 4^r }{ C^2 \mu ^2 \log ^5 m } }_{\delta _r} m \end{aligned}$$(203)for \(1\le r \le R-1\). Since \(\{\mathcal {I}_{r}\}_{1\le r \le R}\) form a partition of the index set \(\{1,\cdots ,m\}\), it is easy to see that
$$\begin{aligned} {\varvec{B}}^\textsf {H} {\varvec{C}} {\varvec{A}}= \sum _{r=1}^{R} ({\varvec{B}}_{\mathcal {I}_r,\cdot })^\textsf {H} {\varvec{C}}_{\mathcal {I}_r,\mathcal {I}_r} {\varvec{A}}_{\mathcal {I}_r,\cdot }, \end{aligned}$$where \({\varvec{D}}_{\mathcal {I},\mathcal {J}}\) denotes the submatrix of \({\varvec{D}}\) induced by the rows and columns of \({\varvec{D}}\) having indices from \(\mathcal {I}\) and \(\mathcal {J}\), respectively, and \({\varvec{D}}_{\mathcal {I},\cdot }\) refers to the submatrix formed by the rows from the index set \(\mathcal {I}\). As a result, one can invoke the triangle inequality to derive
$$\begin{aligned} \alpha _3 \le \sum _{r=1}^{R-1} \left\| {\varvec{B}}_{\mathcal {I}_r,\cdot } \right\| \cdot \left\| {\varvec{C}}_{\mathcal {I}_r,\mathcal {I}_r} \right\| \cdot \left\| {\varvec{A}}_{\mathcal {I}_r,\cdot }\right\| +\left\| {\varvec{B}}_{\mathcal {I}_R,\cdot } \right\| \cdot \left\| {\varvec{C}}_{\mathcal {I}_R,\mathcal {I}_R} \right\| \cdot \left\| {\varvec{A}}_{\mathcal {I}_R,\cdot }\right\| . \end{aligned}$$(204)Recognizing that \({\varvec{B}}^\textsf {H} {\varvec{B}} = {\varvec{I}}_K\), we obtain
$$\begin{aligned} \left\| {\varvec{B}}_{\mathcal {I}_r,\cdot } \right\| \le \left\| {\varvec{B}} \right\| =1 \end{aligned}$$for every \(1\le r \le R\). In addition, by construction of \(\mathcal {I}_{r}\), we have
$$\begin{aligned} \left\| {\varvec{C}}_{\mathcal {I}_r,\mathcal {I}_r} \right\| = \max _{j\in \mathcal {I}_r} |c_j| \le \frac{C \mu \log ^{5/2} m }{2^{r-1} \sqrt{m}} \end{aligned}$$for \(1\le r \le R\), and specifically for R, one has
$$\begin{aligned} \left\| {\varvec{C}}_{\mathcal {I}_R,\mathcal {I}_R} \right\| = \max _{j\in \mathcal {I}_R} |c_j| \le \frac{C \mu \log ^{5/2} m }{2^{R-1} \sqrt{m} } \le \frac{1}{\sqrt{m} \log m}, \end{aligned}$$which follows from the definition of R, i.e., \(R=1+ \lceil \log _2 ( C \mu \log ^{7/2} m ) \rceil \). Regarding \(\left\| {\varvec{A}}_{\mathcal {I}_r,\cdot } \right\| \), we discover that \(\left\| {\varvec{A}}_{I_R,\cdot } \right\| \le \left\| {\varvec{A}} \right\| \) and, in view of (203),
$$\begin{aligned} \left\| {\varvec{A}}_{\mathcal {I}_r,\cdot } \right\| \le \sup _{\mathcal {I}:|\mathcal {I}|\le \delta _{r}m}\left\| {\varvec{A}}_{\mathcal {I},\cdot } \right\| , \qquad 1\le r \le R-1. \end{aligned}$$Substitute the above estimates into (204) to get
$$\begin{aligned} \alpha _{3} \le \sum _{r=1}^{R-1} \frac{C \mu \log ^{5/2} m }{2^{r-1} \sqrt{m} } \sup _{\mathcal {I}:|\mathcal {I}|\le \delta _{r}m}\left\| {\varvec{A}}_{\mathcal {I},\cdot } \right\| + \frac{\left\| {\varvec{A}} \right\| }{\sqrt{m} \log m}. \end{aligned}$$(205)It remains to upper bound \(\left\| {\varvec{A}} \right\| \) and \(\sup _{\mathcal {I}: |\mathcal {I}|\le \delta _{r}m}\left\| {\varvec{A}}_{\mathcal {I},\cdot } \right\| \). Lemma 57 tells us that \(\left\| {\varvec{A}} \right\| \le 2\sqrt{m}\) with probability at least \(1-O\left( m^{-10}\right) \). Furthermore, we can invoke Lemma 58 to bound \(\sup _{\mathcal {I}: |\mathcal {I}|\le \delta _{r}m}\left\| {\varvec{A}}_{\mathcal {I},\cdot } \right\| \) for each \(1\le r \le R-1\). It is easily seen from our assumptions \(m \gg \mu ^2 K \log ^9 m\) and \(\delta = c/\log ^2 m\) that \(\delta _r \gg K/m\). In addition,
$$\begin{aligned} \delta _r \le \frac{ 12 \delta ^2 4^{R-1} }{C^2 \mu ^2 \log ^5 m } \le \frac{ 12 \delta ^2 4^{1+ \log _2 ( C \mu \log ^{7/2} m )} }{C^2 \mu ^2 \log ^5 m } =48\delta ^2\log ^2 m = \frac{48c}{\log ^2 m}\ll 1. \end{aligned}$$By Lemma 58, we obtain that for some constants \(\widetilde{C}_{2}, \widetilde{C}_{3}>0\)
$$\begin{aligned} \mathbb {P}\left( \sup _{\mathcal {I}: |\mathcal {I}|\le \delta _{r}m}\left\| {\varvec{A}}_{\mathcal {I},\cdot } \right\| \ge \sqrt{4 \widetilde{C}_3 \delta _r m \log (e/\delta _r) }\right)&\le 2\exp \left( - \frac{\widetilde{C}_2 \widetilde{C}_3}{3} \delta _r m \log (e/ \delta _r ) \right) \\&\le 2\exp \left( - \frac{\widetilde{C}_2 \widetilde{C}_3}{3} \delta _r m\right) \le 2 e^{-K}. \end{aligned}$$Taking the union bound and substituting the estimates above into (205), we see that with probability at least \(1-O\left( m^{-10}\right) - O\left( (R-1)e^{-K}\right) \),
$$\begin{aligned} \alpha _{3}&\le \sum _{r=1}^{R-1} \frac{C \mu \log ^{5/2} m }{2^{r-1} \sqrt{m} } \cdot \sqrt{4 \widetilde{C}_3 \delta _r m \log (e/\delta _r) }+ \frac{2\sqrt{m}}{\sqrt{m} \log m} \\&\le \sum _{r=1}^{R-1} 4\delta \sqrt{12 \widetilde{C}_3 \log (e/\delta _r) } + \frac{2}{ \log m} \\&\lesssim (R-1) \delta \sqrt{\log (e/\delta _1)} + \frac{1}{\log m}. \end{aligned}$$Note that \(\mu \le \sqrt{m}\), \(R-1=\lceil \log _2 ( C \mu \log ^{7/2} m ) \rceil \lesssim \log m\), and
$$\begin{aligned} \sqrt{\log \frac{e}{\delta _1}} = \sqrt{\log \left( \frac{ eC^2 \mu ^2 \log ^5 m }{ 48 \delta ^2 } \right) } \lesssim \log m. \end{aligned}$$Therefore, with probability exceeding \(1-O\left( m^{-10}\right) - O\left( e^{-K}\log m\right) \),
$$\begin{aligned} \sup _{{\varvec{z}} \in {\mathcal {S}}} \alpha _3 \lesssim \delta \log ^2 m + \frac{1}{\log m}. \end{aligned}$$By taking c to be small enough in \(\delta = c/ \log ^2 m\), we get
$$\begin{aligned} \mathbb {P}\left( \sup _{{\varvec{z}} \in {\mathcal {S}}} \alpha _3 \ge 1/96 \right) \le O\left( m^{-10}\right) + O\left( e^{-K}\log m\right) \end{aligned}$$as claimed.
Finally, it remains to justify (202). For all \({\varvec{z}}\in {\mathcal {S}}\), the triangle inequality tells us that
$$\begin{aligned} |c_j|&\le \left| {\varvec{b}}_j^\textsf {H} {\varvec{h}} ( {\varvec{x}} - {\varvec{x}}^{\star } )^{\textsf {H} } {\varvec{a}}_j \right| + \left| {\varvec{b}}_j^\textsf {H} ( {\varvec{h}}- {\varvec{h}}^{\star } ) {\varvec{x}}^{\star \textsf {H} } {\varvec{a}}_j \right| \\&\le \left| {\varvec{b}}_j^\textsf {H} {\varvec{h}}\right| \cdot \left| {\varvec{a}}_j^\textsf {H} ( {\varvec{x}} - {\varvec{x}}^{\star } )\right| + \left( \left| {\varvec{b}}_j^\textsf {H} {\varvec{h}}\right| + \left| {\varvec{b}}_j^\textsf {H} {\varvec{h}}^{\star } \right| \right) \cdot \left| {\varvec{a}}_j^\textsf {H} {\varvec{x}}^{\star } \right| \\&\le \frac{2C_4 \mu \log ^2 m}{\sqrt{m}} \cdot \frac{2C_3}{\log ^{3/2} m } + \left( \frac{2C_4 \mu \log ^2 m}{\sqrt{m}} + \frac{\mu }{\sqrt{m}} \right) 5 \sqrt{\log m} \\&\le C \frac{\mu \log ^{5/2} m }{\sqrt{m}}, \end{aligned}$$for some large constant \(C>0\), where we have used the definition of \({\mathcal {S}}\) and fact (189). Claim (202b) follows directly from [76, Lemma 5.14]. To avoid confusion, we use \(\mu _1\) to refer to the parameter \(\mu \) therein. Let \(L=m\), \(N=K\), \(d_0=1\), \(\mu _1=C_4 \mu \log ^2 m / 2\), and \(\varepsilon =1/15\). Then
$$\begin{aligned} {\mathcal {S}}\subseteq \mathcal {N}_{d_0} \cap \mathcal {N}_{\mu _1} \cap \mathcal {N}_{\varepsilon }, \end{aligned}$$and the sample complexity condition \(L \gg \mu _1^2 (K+N) \log ^2 L \) is satisfied because we have assumed \(m\gg \mu ^2 K \log ^6 m\). Therefore, with probability exceeding \(1-O\left( m^{-10}+e^{-K}\right) \), we obtain that for all \({\varvec{z}}\in {\mathcal {S}}\),
$$\begin{aligned} \left\| {\varvec{C}}\right\| _{\mathrm {F}}^{2} \le \frac{5}{4} \left\| {\varvec{h}}{\varvec{x}}^{\textsf {H} }-{\varvec{h}}^{\star }{\varvec{x}}^{\star \textsf {H} }\right\| _{\mathrm {F}}^2. \end{aligned}$$Claim (202b) can then be justified by observing that
$$\begin{aligned} \left\| {\varvec{h}}{\varvec{x}}^{\textsf {H} }-{\varvec{h}}^{\star }{\varvec{x}}^{\star \textsf {H} }\right\| _{\mathrm {F}}&=\left\| {\varvec{h}}\left( {\varvec{x}}-{\varvec{x}}^{\star }\right) ^{\textsf {H} }+\left( {\varvec{h}}-{\varvec{h}}^{\star }\right) {\varvec{x}}^{\star \textsf {H} }\right\| _{\mathrm {F}} \\&\le \left\| {\varvec{h}}\right\| _{2}\left\| {\varvec{x}}-{\varvec{x}}^{\star }\right\| _{2}+\left\| {\varvec{h}}-{\varvec{h}}^{\star }\right\| _{2}\left\| {\varvec{x}}^{\star }\right\| _{2}\le 3\delta . \end{aligned}$$ - 4.
It remains to control \(\alpha _{4}\), for which we make note of the following inequality
$$\begin{aligned} \alpha _{4}\le \underbrace{ \left\| \sum _{j=1}^{m}{\varvec{b}}_{j}{\varvec{b}}_{j}^{\textsf {H} }({\varvec{h}}{\varvec{x}}^{\top }-{\varvec{h}}^{\star } {\varvec{x}}^{\star \top })\overline{{\varvec{a}}_{j}}\,\overline{{\varvec{a}}_{j}}^{\textsf {H} }\right\| }_{\theta _3} + \underbrace{ \left\| \sum _{j=1}^{m}{\varvec{b}}_{j}{\varvec{b}}_{j}^{\textsf {H} }{\varvec{h}}^{\star }{\varvec{x}}^{\star \top } (\overline{{\varvec{a}}_{j}}\,\overline{{\varvec{a}}_{j}}^{\textsf {H} }-{\varvec{I}}_{K})\right\| }_{\theta _4} \end{aligned}$$with \(\overline{{\varvec{a}}_{j}}\) denoting the entrywise conjugate of \({\varvec{a}}_{j}\). Since \(\{\overline{{\varvec{a}}_{j}}\}\) has the same joint distribution as \(\{{\varvec{a}}_{j}\}\), by the same argument used for bounding \(\alpha _{3}\) we obtain control of the first term, namely
$$\begin{aligned} \mathbb {P}\left( \sup _{{\varvec{z}}\in {\mathcal {S}}} \theta _3 \ge 1/96 \right) = O(m^{-10}+e^{-K}\log m). \end{aligned}$$Note that \(m \gg \mu ^2 K \log m / \delta ^2\) and \(\delta \ll 1\). According to [76, Lemma 5.20],
$$\begin{aligned} \mathbb {P}\left( \sup _{{\varvec{z}}\in {\mathcal {S}}} \theta _4 \ge 1/96 \right) \le \mathbb {P}\left( \sup _{{\varvec{z}}\in {\mathcal {S}}} \theta _4 \ge \delta \right) = O(m^{-10}). \end{aligned}$$Putting together the above bounds, we reach \(\mathbb {P}( \sup _{{\varvec{z}} \in {\mathcal {S}}} \alpha _{4}\le 1/48) = 1-O(m^{-10}+e^{-K}\log m)\).
- 5.
Combining all the previous bounds for \(\sup _{{\varvec{z}}\in {\mathcal {S}}} \alpha _j\) and (196), we deduce that with probability \(1-O(m^{-10}+e^{-K}\log m)\),
$$\begin{aligned} \left\| \nabla ^{2}f\left( {\varvec{z}}\right) -\nabla ^{2}F\left( {\varvec{z}}^{\star }\right) \right\| \le 2\cdot \frac{1}{32}+2\cdot \frac{1}{32}+4\cdot \frac{1}{96}+4\cdot \frac{1}{48}= \frac{1}{4}. \end{aligned}$$
1.2 Proofs of Lemmas 15 and 16
Proof of Lemma 15
In view of the definition of \(\alpha ^{t+1}\) (see (38)), one has
The gradient update rules (79) imply that
where we denote \(\widetilde{{\varvec{h}}}^{t}=\frac{1}{\overline{\alpha ^{t}}}{\varvec{h}}^{t}\) and \(\widetilde{{\varvec{x}}}^{t}=\alpha ^{t}{\varvec{x}}^{t}\) as in (81). Let \(\widehat{{\varvec{h}}}^{t+1} = \frac{1}{\overline{\alpha ^{t}}}{\varvec{h}}^{t+1}\) and \(\widehat{{\varvec{x}}}^{t+1} = \alpha ^{t}{\varvec{x}}^{t+1}\). We further get
The fundamental theorem of calculus (see Appendix D.3.1) together with the fact that \(\nabla f\left( {\varvec{z}}^{\star }\right) ={\varvec{0}}\) tells us
where we denote \({{\varvec{z}}}\left( \tau \right) :={\varvec{z}}^{\star }+\tau \left( \widetilde{{\varvec{z}}}^{t}-{\varvec{z}}^{\star }\right) \) and \(\nabla ^{2}f\) is the Wirtinger Hessian. To further simplify notation, denote \(\widehat{{\varvec{z}}}^{t+1}=\begin{bmatrix} \widehat{{\varvec{h}}}^{t+1}\\ \widehat{{\varvec{x}}}^{t+1} \end{bmatrix}\). Identity (207) allows us to rewrite (206) as
Take the squared Euclidean norm of both sides of (208) to reach
Since \({\varvec{z}}\left( \tau \right) \) lies between \(\widetilde{{\varvec{z}}}^{t}\) and \({\varvec{z}}^{\star }\), we conclude from assumptions (85) that for all \(0\le \tau \le 1\),
for \(\xi >0\) sufficiently small. Moreover, it is straightforward to see that
satisfy
as long as \(\xi >0\) is sufficiently small. We can now readily invoke Lemma 14 to arrive at
Substitution into (209) indicates that
When \(0<\eta \le {1}/{128}\), this implies that
and hence
This completes the proof of Lemma 15. \(\square \)
Proof of Lemma 16
Reuse the notation in this subsection, namely \(\widehat{{\varvec{z}}}^{t+1}=\left[ \begin{array}{c} \widehat{{\varvec{h}}}^{t+1}\\ \widehat{{\varvec{x}}}^{t+1} \end{array}\right] \) with \(\widehat{{\varvec{h}}}^{t+1}=\frac{1}{\overline{\alpha ^{t}}}{\varvec{h}}^{t+1}\) and \(\widehat{{\varvec{x}}}^{t+1}=\alpha ^{t}{\varvec{x}}^{t+1}\). From (210), one can tell that
Invoke Lemma 52 with \(\beta =\alpha ^{t}\) to get
This combined with the assumption \(||\alpha ^{t}|-1|\le 1/2\) implies that
This finishes the proof of the first claim.
The second claim can be proved by induction. Suppose that \(\big ||\alpha ^{s}|-1\big |\le 1/2\) and \(\mathrm {dist}({\varvec{z}}^{s},{\varvec{z}}^{\star })\le C_{1}(1-\eta /16)^{s}/\log ^{2}m\) hold for all \(0\le s\le \tau \le t\) , then using our result in the first part gives
for m sufficiently large. The proof is then complete by induction. \(\square \)
1.3 Proof of Lemma 17
Define the alignment parameter between \({\varvec{z}}^{t,\left( l\right) }\) and \(\widetilde{{\varvec{z}}}^{t}\) as
Further denote, for simplicity of presentation, \({\widehat{{\varvec{z}}}^{t,(l)}=\begin{bmatrix} \widehat{{\varvec{h}}}^{t,\left( l\right) }\\ \widehat{{\varvec{x}}}^{t,\left( l\right) } \end{bmatrix}}\) with
Clearly, \(\widehat{{\varvec{z}}}^{t,(l)}\) is aligned with \(\widetilde{{\varvec{z}}}^{t}\).
Armed with the above notation, we have
where (211) follows by taking \(\alpha =\frac{\alpha ^{t+1}}{\alpha ^{t}}\alpha _{\text {mutual}}^{t,\left( l\right) }\). The latter bound is more convenient to work with when controlling the gap between \({\varvec{z}}^{t,(l)}\) and \({\varvec{z}}^{t}\).
We can then apply the gradient update rules (79) and (89) to get
By construction, we can write the leave-one-out gradients as
which allow us to continue the derivation and obtain
This further gives
In what follows, we bound the three terms \({\varvec{\nu }}_1\), \({\varvec{\nu }}_2\), and \({\varvec{\nu }}_3\) separately.
- 1.
Regarding the first term \({\varvec{\nu }}_{1}\), one can adopt the same strategy as in Appendix C.2. Specifically, write
$$\begin{aligned}&\left[ \begin{array}{c} \widehat{{\varvec{h}}}^{t,\left( l\right) }-\frac{\eta }{\left\| \widehat{{\varvec{x}}}^{t,\left( l\right) }\right\| _{2}^{2}}\nabla _{{\varvec{h}}}f\left( \widehat{{\varvec{z}}}^{t,\left( l\right) }\right) -\Big (\widetilde{{\varvec{h}}}^{t}-\frac{\eta }{\left\| \widehat{{\varvec{x}}}^{t,\left( l\right) }\right\| _{2}^{2}}\nabla _{{\varvec{h}}}f\left( \widetilde{{\varvec{z}}}^{t}\right) \Big )\\ \widehat{{\varvec{x}}}^{t,\left( l\right) }-\frac{\eta }{\left\| \widehat{{\varvec{h}}}^{t,\left( l\right) }\right\| _{2}^{2}}\nabla _{{\varvec{x}}}f\left( \widehat{{\varvec{z}}}^{t,\left( l\right) }\right) -\Big (\widetilde{{\varvec{x}}}^{t}-\frac{\eta }{\left\| \widehat{{\varvec{h}}}^{t,\left( l\right) }\right\| _{2}^{2}}\nabla _{{\varvec{x}}}f\left( \widetilde{{\varvec{z}}}^{t}\right) \Big )\\ \overline{\widehat{{\varvec{h}}}^{t,\left( l\right) }-\frac{\eta }{\Vert \widehat{{\varvec{x}}}^{t,\left( l\right) } \Vert _{2}^{2}}\nabla _{{\varvec{h}}}f\big (\widehat{{\varvec{z}}}^{t,\left( l\right) }\big )-\Big (\widetilde{{\varvec{h}}}^{t}-\frac{\eta }{\Vert \widehat{{\varvec{x}}}^{t,\left( l\right) }\Vert _{2}^{2}}\nabla _{{\varvec{h}}}f\left( \widetilde{{\varvec{z}}}^{t}\right) \Big )}\\ \overline{\widehat{{\varvec{x}}}^{t,\left( l\right) }-\frac{\eta }{\Vert \widehat{{\varvec{h}}}^{t,\left( l\right) }\Vert _{2}^{2}}\nabla _{{\varvec{x}}}f\left( \widehat{{\varvec{z}}}^{t,\left( l\right) }\right) -\Big (\widetilde{{\varvec{x}}}^{t}-\frac{\eta }{\Vert \widehat{{\varvec{h}}}^{t,\left( l\right) } \Vert _{2}^{2}}\nabla _{{\varvec{x}}}f\left( \widetilde{{\varvec{z}}}^{t}\right) \Big )} \end{array}\right] =\left[ \begin{array}{c} \widehat{{\varvec{h}}}^{t,\left( l\right) }-\widetilde{{\varvec{h}}}^{t}\\ \widehat{{\varvec{x}}}^{t,\left( l\right) }-\widetilde{{\varvec{x}}}^{t}\\ \overline{\widehat{{\varvec{h}}}^{t,\left( l\right) }-\widetilde{{\varvec{h}}}^{t}}\\ \overline{\widehat{{\varvec{x}}}^{t,\left( l\right) }-\widetilde{{\varvec{x}}}^{t}} \end{array}\right] \\&\qquad -\eta \underbrace{\left[ \begin{array}{cccc} \left\| \widehat{{\varvec{x}}}^{t,\left( l\right) }\right\| _{2}^{-2}{\varvec{I}}_{K}\\ &{} \left\| \widehat{{\varvec{h}}}^{t,\left( l\right) }\right\| _{2}^{-2}{\varvec{I}}_{K}\\ &{} &{} \left\| \widehat{{\varvec{x}}}^{t,\left( l\right) }\right\| _{2}^{-2}{\varvec{I}}_{K}\\ &{} &{} &{} \left\| \widehat{{\varvec{h}}}^{t,\left( l\right) }\right\| _{2}^{-2}{\varvec{I}}_{K} \end{array}\right] }_{:={\varvec{D}}}\\&\qquad \left[ \begin{array}{c} \nabla _{{\varvec{h}}}f\left( \widehat{{\varvec{z}}}^{t,\left( l\right) }\right) -\nabla _{{\varvec{h}}}f\left( \widetilde{{\varvec{z}}}^{t}\right) \\ \nabla _{{\varvec{x}}}f\left( \widehat{{\varvec{z}}}^{t,\left( l\right) }\right) -\nabla _{{\varvec{x}}}f\left( \widetilde{{\varvec{z}}}^{t}\right) \\ \overline{\nabla _{{\varvec{h}}}f\left( \widehat{{\varvec{z}}}^{t,\left( l\right) }\right) -\nabla _{{\varvec{h}}}f\left( \widetilde{{\varvec{z}}}^{t}\right) }\\ \overline{\nabla _{{\varvec{x}}}f\left( \widehat{{\varvec{z}}}^{t,\left( l\right) }\right) -\nabla _{{\varvec{x}}}f\left( \widetilde{{\varvec{z}}}^{t}\right) } \end{array}\right] . \end{aligned}$$The fundamental theorem of calculus (see Appendix D.3.1) reveals that
$$\begin{aligned} \left[ \begin{array}{c} \nabla _{{\varvec{h}}}f\left( \widehat{{\varvec{z}}}^{t,\left( l\right) }\right) -\nabla _{{\varvec{h}}}f\left( \widetilde{{\varvec{z}}}^{t}\right) \\ \nabla _{{\varvec{x}}}f\left( \widehat{{\varvec{z}}}^{t,\left( l\right) }\right) -\nabla _{{\varvec{x}}}f\left( \widetilde{{\varvec{z}}}^{t}\right) \\ \overline{\nabla _{{\varvec{h}}}f\left( \widehat{{\varvec{z}}}^{t,\left( l\right) }\right) -\nabla _{{\varvec{h}}}f\left( \widetilde{{\varvec{z}}}^{t}\right) }\\ \overline{\nabla _{{\varvec{x}}}f\left( \widehat{{\varvec{z}}}^{t,\left( l\right) }\right) -\nabla _{{\varvec{x}}}f\left( \widetilde{{\varvec{z}}}^{t}\right) } \end{array}\right] =\underbrace{\int _{0}^{1}\nabla ^{2}f\left( {\varvec{z}}\left( \tau \right) \right) \mathrm {d}\tau }_{:={\varvec{A}}}\left[ \begin{array}{c} \widehat{{\varvec{h}}}^{t,\left( l\right) }-\widetilde{{\varvec{h}}}^{t}\\ \widehat{{\varvec{x}}}^{t,\left( l\right) }-\widetilde{{\varvec{x}}}^{t}\\ \overline{\widehat{{\varvec{h}}}^{t,\left( l\right) }-\widetilde{{\varvec{h}}}^{t}}\\ \overline{\widehat{{\varvec{x}}}^{t,\left( l\right) }-\widetilde{{\varvec{x}}}^{t}} \end{array}\right] , \end{aligned}$$where we abuse the notation and denote \({\varvec{z}}\left( \tau \right) =\widetilde{{\varvec{z}}}^{t}+\tau \left( \widehat{{\varvec{z}}}^{t,\left( l\right) }-\widetilde{{\varvec{z}}}^{t}\right) \). In order to invoke Lemma 14, we need to verify the conditions required therein. Recall the induction hypothesis (90b) that
$$\begin{aligned} \text {dist}\big ({\varvec{z}}^{t,\left( l\right) },\widetilde{{\varvec{z}}}^{t}\big )=\big \Vert \widehat{{\varvec{z}}}^{t,\left( l\right) }-\widetilde{{\varvec{z}}}^{t}\big \Vert _{2}\le C_{2}\frac{\mu }{\sqrt{m}}\sqrt{\frac{\mu ^{2}K\log ^{9}m}{m}}, \end{aligned}$$and the fact that \({\varvec{z}}\left( \tau \right) \) lies between \(\widehat{{\varvec{z}}}^{t,\left( l\right) }\) and \(\widetilde{{\varvec{z}}}^{t}\). For all \(0\le \tau \le 1\):
- (a)
If \(m\gg \mu ^{2}\sqrt{K}\log ^{13/2}m\), then
$$\begin{aligned} \left\| {\varvec{z}}\left( \tau \right) -{\varvec{z}}^{\star }\right\| _{2}&\le \max \left\{ \big \Vert \widehat{{\varvec{z}}}^{t,\left( l\right) }-{\varvec{z}}^{\star }\big \Vert _{2},\left\| \widetilde{{\varvec{z}}}^{t}-{\varvec{z}}^{\star }\right\| _{2}\right\} \le \left\| \widetilde{{\varvec{z}}}^{t}-{\varvec{z}}^{\star }\right\| _{2}+\big \Vert \widehat{{\varvec{z}}}^{t,\left( l\right) }-\widetilde{{\varvec{z}}}^{t}\big \Vert _{2}\\&\le C_{1}\frac{1}{\log ^{2}m}+C_{2}\frac{\mu }{\sqrt{m}}\sqrt{\frac{\mu ^{2}K\log ^{9}m}{m}}\le 2C_{1}\frac{1}{\log ^{2}m}, \end{aligned}$$where we have used the induction hypotheses (90a) and (90b);
- (b)
If \(m\gg \mu ^{2}K\log ^{6}m\), then
$$\begin{aligned}&\max _{1\le j\le m}\left| {\varvec{a}}_{j}^{\textsf {H} }\left( {\varvec{x}}\left( \tau \right) -{\varvec{x}}^{\star }\right) \right| \nonumber \\&\quad = \max _{1\le j\le m}\left| \tau {\varvec{a}}_{j}^{\textsf {H} }\big (\widehat{{\varvec{x}}}^{t,\left( l\right) }-\widetilde{{\varvec{x}}}^{t}\big ) + {\varvec{a}}_{j}^{\textsf {H} }\left( \widetilde{{\varvec{x}}}^{t}-{\varvec{x}}^{\star }\right) \right| \nonumber \\&\quad \le \max _{1\le j\le m}\left| {\varvec{a}}_{j}^{\textsf {H} }\big (\widehat{{\varvec{x}}}^{t,\left( l\right) }-\widetilde{{\varvec{x}}}^{t}\big )\right| +\max _{1\le j\le m}\left| {\varvec{a}}_{j}^{\textsf {H} }\left( \widetilde{{\varvec{x}}}^{t}-{\varvec{x}}^{\star }\right) \right| \nonumber \\&\quad \le \max _{1\le j\le m}\left\| {\varvec{a}}_{j}\right\| _{2}\big \Vert \widehat{{\varvec{z}}}^{t,\left( l\right) }-\widetilde{{\varvec{z}}}^{t}\big \Vert _{2}+C_{3}\frac{1}{\log ^{3/2}m}\nonumber \\&\quad \le 3\sqrt{K}\cdot C_{2}\frac{\mu }{\sqrt{m}}\sqrt{\frac{\mu ^{2}K\log ^{9}m}{m}}+C_{3}\frac{1}{\log ^{3/2}m}\le 2C_{3}\frac{1}{\log ^{3/2}m}, \end{aligned}$$(214)which follows from bound (190) and the induction hypotheses (90b) and (90c);
- (c)
If \(m\gg \mu K\log ^{5/2}m\), then
$$\begin{aligned} \max _{1\le j\le m}\left| {\varvec{b}}_{j}^{\textsf {H} }{\varvec{h}}\left( \tau \right) \right|&= \max _{1\le j\le m}\big | \tau {\varvec{b}}_{j}^{\textsf {H} }\big (\widehat{{\varvec{h}}}^{t,\left( l\right) }-\widetilde{{\varvec{h}}}^{t}\big ) + {\varvec{b}}_{j}^{\textsf {H} }\widetilde{{\varvec{h}}}^{t}\big |\nonumber \\&\le \max _{1\le j\le m}\left| {\varvec{b}}_{j}^{\textsf {H} }\big (\widehat{{\varvec{h}}}^{t,\left( l\right) }-\widetilde{{\varvec{h}}}^{t}\big )\right| +\max _{1\le j\le m}\big |{\varvec{b}}_{j}^{\textsf {H} }\widetilde{{\varvec{h}}}^{t}\big |\nonumber \\&\le \max _{1\le j\le m}\Vert {\varvec{b}}_{j}\Vert _{2}\big \Vert \widehat{{\varvec{h}}}^{t,\left( l\right) }-\widetilde{{\varvec{h}}}^{t}\big \Vert _{2}+\max _{1\le j\le m}\big |{\varvec{b}}_{j}^{\textsf {H} }\widetilde{{\varvec{h}}}^{t}\big |\nonumber \\&\le \sqrt{\frac{K}{m}}\cdot C_{2}\frac{\mu }{\sqrt{m}}\sqrt{\frac{\mu ^{2}K\log ^{9}m}{m}}+C_{4}\frac{\mu }{\sqrt{m}}\log ^{2}m\le 2C_{4}\frac{\mu }{\sqrt{m}}\log ^{2}m, \end{aligned}$$(215)which makes use of the fact \(\Vert {\varvec{b}}_{j}\Vert _{2}=\sqrt{K/m}\) as well as the induction hypotheses (90b) and (90d).
These properties satisfy condition (82) required in Lemma 14. The other two conditions (83) and (84) are also straightforward to check, and hence, we omit it. Thus, we can repeat the argument used in Appendix C.2 to obtain
$$\begin{aligned} \left\| {\varvec{\nu }}_{1}\right\| _{2}\le \left( 1-{\eta }/{16}\right) \cdot \big \Vert \widehat{{\varvec{z}}}^{t,\left( l\right) }-\widetilde{{\varvec{z}}}^{t}\big \Vert _{2}. \end{aligned}$$ - (a)
- 2.
In terms of the second term \({\varvec{\nu }}_{2}\), it is easily seen that
$$\begin{aligned} \left\| {\varvec{\nu }}_{2}\right\| _{2}&\le \max \left\{ \left| \frac{1}{\big \Vert \widetilde{{\varvec{x}}}^{t}\big \Vert _{2}^{2}}-\frac{1}{\big \Vert \widehat{{\varvec{x}}}^{t,\left( l\right) }\big \Vert _{2}^{2}}\right| ,\left| \frac{1}{\big \Vert \widetilde{{\varvec{h}}}^{t}\big \Vert _{2}^{2}}-\frac{1}{\big \Vert \widehat{{\varvec{h}}}^{t,\left( l\right) }\big \Vert _{2}^{2}}\right| \right\} \left\| \left[ \begin{array}{c} \nabla _{{\varvec{h}}}f\left( \widetilde{{\varvec{z}}}^{t}\right) \\ \nabla _{{\varvec{x}}}f\left( \widetilde{{\varvec{z}}}^{t}\right) \end{array}\right] \right\| _{2}. \end{aligned}$$We first note that the upper bound on \(\Vert \nabla ^{2}f\left( \cdot \right) \Vert \) (which essentially provides a Lipschitz constant on the gradient) in Lemma 14 forces
$$\begin{aligned} \left\| \left[ \begin{array}{c} \nabla _{{\varvec{h}}}f\left( \widetilde{{\varvec{z}}}^{t}\right) \\ \nabla _{{\varvec{x}}}f\left( \widetilde{{\varvec{z}}}^{t}\right) \end{array}\right] \right\| _{2}=\left\| \left[ \begin{array}{c} \nabla _{{\varvec{h}}}f\left( \widetilde{{\varvec{z}}}^{t}\right) -\nabla _{{\varvec{h}}}f\left( {\varvec{z}}^{\star }\right) \\ \nabla _{{\varvec{x}}}f\left( \widetilde{{\varvec{z}}}^{t}\right) -\nabla _{{\varvec{x}}}f\left( {\varvec{z}}^{\star }\right) \end{array}\right] \right\| _{2}\lesssim \left\| \widetilde{{\varvec{z}}}^{t}-{\varvec{z}}^{\star }\right\| _{2}\lesssim C_{1}\frac{1}{\log ^{2}m}, \end{aligned}$$where the first identity follows since \(\nabla _{{\varvec{h}}}f\left( {\varvec{z}}^{\star }\right) ={\varvec{0}}\), and the last inequality comes from the induction hypothesis (90a). Additionally, recognizing that \(\left\| \widetilde{{\varvec{x}}}^{t}\right\| _{2} \asymp \left\| \widehat{{\varvec{x}}}^{t,\left( l\right) }\right\| _{2} \asymp 1\), one can easily verify that
$$\begin{aligned} \left| \frac{1}{\left\| \widetilde{{\varvec{x}}}^{t}\right\| _{2}^{2}}-\frac{1}{\left\| \widehat{{\varvec{x}}}^{t,\left( l\right) }\right\| _{2}^{2}}\right| =\left| \frac{\left\| \widehat{{\varvec{x}}}^{t,\left( l\right) }\right\| _{2}^{2}-\left\| \widetilde{{\varvec{x}}}^{t}\right\| _{2}^{2}}{\left\| \widetilde{{\varvec{x}}}^{t}\right\| _{2}^{2}\cdot \left\| \widehat{{\varvec{x}}}^{t,\left( l\right) }\right\| _{2}^{2}}\right| \lesssim \Big | \big \Vert \widehat{{\varvec{x}}}^{t,\left( l\right) } \big \Vert _2 - \big \Vert \widetilde{{\varvec{x}}}^{t}\big \Vert _{2} \Big | \lesssim \big \Vert \widehat{{\varvec{x}}}^{t,\left( l\right) }-\widetilde{{\varvec{x}}}^{t}\big \Vert _{2}. \end{aligned}$$A similar bound holds for the other term involving \({\varvec{h}}\). Combining the estimates above thus yields
$$\begin{aligned} \left\| {\varvec{\nu }}_{2}\right\| _{2}\lesssim C_{1}\frac{1}{\log ^{2}m}\big \Vert \widehat{{\varvec{z}}}^{t,(l)}-\widetilde{{\varvec{z}}}^{t}\big \Vert _{2}. \end{aligned}$$ - 3.
When it comes to the last term \({\varvec{\nu }}_{3}\), one first sees that
$$\begin{aligned} \left\| \left( {\varvec{b}}_{l}^{\textsf {H} }\widehat{{\varvec{h}}}^{t,\left( l\right) }\widehat{{\varvec{x}}}^{t,\left( l\right) \textsf {H} }{\varvec{a}}_{l}-y_{l}\right) {\varvec{b}}_{l}{\varvec{a}}_{l}^{\textsf {H} }\widehat{{\varvec{x}}}^{t,\left( l\right) }\right\| _{2}\le \left| {\varvec{b}}_{l}^{\textsf {H} }\widehat{{\varvec{h}}}^{t,\left( l\right) }\widehat{{\varvec{x}}}^{t,\left( l\right) \textsf {H} }{\varvec{a}}_{l}-y_{l}\right| \left\| {\varvec{b}}_{l}\right\| _{2}\big |{\varvec{a}}_{l}^{\textsf {H} }\widehat{{\varvec{x}}}^{t,\left( l\right) }\big |. \end{aligned}$$(216)Bounds (189) and (214) taken collectively yield
$$\begin{aligned} \left| {\varvec{a}}_{l}^{\textsf {H} }\widehat{{\varvec{x}}}^{t,\left( l\right) }\right| \le \left| {\varvec{a}}_{l}^{\textsf {H} }{\varvec{x}}^{\star }\right| +\left| {\varvec{a}}_{l}^{\textsf {H} }\big (\widehat{{\varvec{x}}}^{t,\left( l\right) }-{\varvec{x}}^{\star }\big )\right| \lesssim \sqrt{\log m}+C_{3}\frac{1}{\log ^{3/2}m}\asymp \sqrt{\log m}. \end{aligned}$$In addition, the same argument as in obtaining (215) tells us that
$$\begin{aligned} \big |{\varvec{b}}_{l}^{\textsf {H} }(\widehat{{\varvec{h}}}^{t,\left( l\right) }-{\varvec{h}}^{\star })\big | \lesssim C_{4}\frac{\mu }{\sqrt{m}}\log ^{2}m. \end{aligned}$$Combine the previous two bounds to obtain
$$\begin{aligned}&\left| {\varvec{b}}_{l}^{\textsf {H} }\widehat{{\varvec{h}}}^{t,\left( l\right) }\widehat{{\varvec{x}}}^{t,\left( l\right) \textsf {H} }{\varvec{a}}_{l}-y_{l}\right| \le \big |{\varvec{b}}_{l}^{\textsf {H} }\widehat{{\varvec{h}}}^{t,\left( l\right) }(\widehat{{\varvec{x}}}^{t,(l)}-{\varvec{x}}^{\star })^{\textsf {H} }{\varvec{a}}_{l}\big |+\big |{\varvec{b}}_{l}^{\textsf {H} }(\widehat{{\varvec{h}}}^{t,\left( l\right) }-{\varvec{h}}^{\star }){\varvec{x}}^{\star \textsf {H} }{\varvec{a}}_{l}\big |\\&\quad \le \big |{\varvec{b}}_{l}^{\textsf {H} }\widehat{{\varvec{h}}}^{t,\left( l\right) }\big |\cdot \big |{\varvec{a}}_{l}^{\textsf {H} }(\widehat{{\varvec{x}}}^{t,(l)}-{\varvec{x}}^{\star })\big |+\big |{\varvec{b}}_{l}^{\textsf {H} }(\widehat{{\varvec{h}}}^{t,\left( l\right) }-{\varvec{h}}^{\star })\big |\cdot \big |{\varvec{a}}_{l}^{\textsf {H} }{\varvec{x}}^{\star }\big |\\&\quad \le \left( \big |{\varvec{b}}_{l}^{\textsf {H} }(\widehat{{\varvec{h}}}^{t,\left( l\right) }-{\varvec{h}}^{\star })\big |+\big |{\varvec{b}}_{l}^{\textsf {H} }{\varvec{h}}^{\star }\big |\right) \cdot \big |{\varvec{a}}_{l}^{\textsf {H} }(\widehat{{\varvec{x}}}^{t,(l)}-{\varvec{x}}^{\star })\big |+\big |{\varvec{b}}_{l}^{\textsf {H} }(\widehat{{\varvec{h}}}^{t,\left( l\right) }-{\varvec{h}}^{\star })\big |\cdot \big |{\varvec{a}}_{l}^{\textsf {H} }{\varvec{x}}^{\star }\big |\\&\quad \lesssim \left( C_{4}\mu \frac{\log ^{2}m}{\sqrt{m}}+\frac{\mu }{\sqrt{m}}\right) \cdot C_{3}\frac{1}{\log ^{3/2}m}+C_{4}\mu \frac{\log ^{2}m}{\sqrt{m}}\cdot \sqrt{\log m}\lesssim C_{4}\mu \frac{\log ^{5/2}m}{\sqrt{m}}. \end{aligned}$$Substitution into (216) gives
$$\begin{aligned} \left\| \left( {\varvec{b}}_{l}^{\textsf {H} }\widehat{{\varvec{h}}}^{t,\left( l\right) }\widehat{{\varvec{x}}}^{t,\left( l\right) \textsf {H} }{\varvec{a}}_{l}-y_{l}\right) {\varvec{b}}_{l}{\varvec{a}}_{l}^{\textsf {H} }\widehat{{\varvec{x}}}^{t,\left( l\right) }\right\| _{2}&\lesssim C_{4}\mu \frac{\log ^{5/2}m}{\sqrt{m}}\cdot \sqrt{\frac{K}{m}}\cdot \sqrt{\log m}. \end{aligned}$$(217)Similarly, we can also derive
$$\begin{aligned} \left\| \overline{\left( {\varvec{b}}_{l}^{\textsf {H} }\widehat{{\varvec{h}}}^{t,\left( l\right) }\widehat{{\varvec{x}}}^{t,\left( l\right) \textsf {H} }{\varvec{a}}_{l}-y_{l}\right) }{\varvec{a}}_{l}{\varvec{b}}_{l}^{\textsf {H} }\widehat{{\varvec{h}}}^{t,\left( l\right) }\right\|&\le \left| {\varvec{b}}_{l}^{\textsf {H} }\widehat{{\varvec{h}}}^{t,\left( l\right) }\widehat{{\varvec{x}}}^{t,\left( l\right) \textsf {H} }{\varvec{a}}_{l}-y_{l}\right| \left\| {\varvec{a}}_{l}\right\| _{2}\left| {\varvec{b}}_{l}^{\textsf {H} }\widehat{{\varvec{h}}}^{t,\left( l\right) }\right| \\&\lesssim C_{4}\mu \frac{\log ^{5/2}m}{\sqrt{m}}\cdot \sqrt{K}\cdot C_{4}\frac{\mu }{\sqrt{m}}\log ^{2}m \end{aligned}$$Putting these bounds together indicates that
$$\begin{aligned} \left\| {\varvec{\nu }}_{3}\right\| _{2}\lesssim \left( C_{4}\right) ^{2}\frac{\mu }{\sqrt{m}}\sqrt{\frac{\mu ^{2}K\log ^{9}m}{m}}. \end{aligned}$$
The above bounds taken together with (212) and (213) ensure the existence of a constant \(C>0\) such that
Here, (i) holds as long as m is sufficiently large such that \(CC_{1}{1} / {\log ^{2}m}\ll 1\) and
which is guaranteed by Lemma 16. Inequality (ii) arises from the induction hypothesis (90b) and taking \(C_{2}>0\) is sufficiently large.
Finally, we establish the second inequality claimed in the lemma. Take \(({\varvec{h}}_{1},{\varvec{x}}_{1})=(\widetilde{{\varvec{h}}}^{t+1},\widetilde{{\varvec{x}}}^{t+1})\) and \(({\varvec{h}}_{2},{\varvec{x}}_{2})=(\widehat{{\varvec{h}}}^{t+1,(l)},\widehat{{\varvec{x}}}^{t+1,(l)})\) in Lemma 55. Since both \(({\varvec{h}}_{1},{\varvec{x}}_{1})\) and \(({\varvec{h}}_{2},{\varvec{x}}_{2})\) are close enough to \(({\varvec{h}}^{\star },{\varvec{x}}^{\star })\), we deduce that
as claimed.
1.4 Proof of Lemma 18
Before going forward, we make note of the following inequality
for some small \(\delta \asymp {\log ^{-2}m}\), where the last relation follows from Lemma 16 that
for m sufficiently large. In view of the above inequality, the focus of our subsequent analysis will be to control \(\max _{l}\left| {\varvec{b}}_{l}^{\textsf {H} }\frac{1}{\overline{\alpha ^{t}}}{\varvec{h}}^{t+1}\right| \).
The gradient update rule for \({\varvec{h}}^{t+1}\) (cf. (79a)) gives
where \(\widetilde{{\varvec{h}}}^{t}=\frac{1}{\overline{\alpha ^t}}{{\varvec{h}}}^{t}\) and \(\widetilde{{\varvec{x}}}^{t}={{\alpha ^t}}{{\varvec{x}}}^{t}\). Here and below, we denote \( \xi ={1} / {\Vert \widetilde{{\varvec{x}}}^{t} \Vert _{2}^{2}} \) for notational convenience. The above formula can be further decomposed into the following terms
where we use the fact that \(\sum _{j=1}^{m}{\varvec{b}}_{j}{\varvec{b}}_{j}^{\textsf {H} }={\varvec{I}}_{K}\). In the sequel, we shall control each term separately.
- 1.
We start with \(|{\varvec{b}}_{l}^{\textsf {H} }{\varvec{v}}_{1}|\) by making the observation that
$$\begin{aligned} \frac{1}{\eta \xi }\left| {\varvec{b}}_{l}^{\textsf {H} }{\varvec{v}}_{1}\right|&=\left| \sum _{j=1}^{m}{\varvec{b}}_{l}^{\textsf {H} }{\varvec{b}}_{j}{\varvec{b}}_{j}^{\textsf {H} } \widetilde{{\varvec{h}}}^{t}\left[ {\varvec{a}}_{j}^{\textsf {H} }\left( \widetilde{{\varvec{x}}}^{t}-{\varvec{x}}^{\star }\right) \left( {\varvec{a}}_{j}^{\textsf {H} }\widetilde{{\varvec{x}}}^{t}\right) ^{\textsf {H} }+{\varvec{a}}_{j}^{\textsf {H} }{\varvec{x}}^{\star }\left( {\varvec{a}}_{j}^{\textsf {H} }\left( \widetilde{{\varvec{x}}}^{t}-{\varvec{x}}^{\star }\right) \right) ^{\textsf {H} }\right] \right| \nonumber \\&\le \sum _{j=1}^{m}\left| {\varvec{b}}_{l}^{\textsf {H} }{\varvec{b}}_{j}\right| \left\{ \max _{1\le j\le m}\big |{\varvec{b}}_{j}^{\textsf {H} }\widetilde{{\varvec{h}}}^{t}\big |\right\} \left\{ \max _{1\le j\le m}\left| {\varvec{a}}_{j}^{\textsf {H} }\left( \widetilde{{\varvec{x}}}^{t}-{\varvec{x}}^{\star }\right) \right| \left( \left| {\varvec{a}}_{j}^{\textsf {H} }\widetilde{{\varvec{x}}}^{t}\right| +\left| {\varvec{a}}_{j}^{\textsf {H} }{\varvec{x}}^{\star }\right| \right) \right\} . \end{aligned}$$(219)Combining the induction hypothesis (90c) and condition (189) yields
$$\begin{aligned}&\max _{1\le j\le m}\left| {\varvec{a}}_{j}^{\textsf {H} }\widetilde{{\varvec{x}}}^{t}\right| \le \max _{1\le j\le m}\left| {\varvec{a}}_{j}^{\textsf {H} }\left( \widetilde{{\varvec{x}}}^{t}-{\varvec{x}}^{\star }\right) \right| +\max _{1\le j\le m}\left| {\varvec{a}}_{j}^{\textsf {H} }{\varvec{x}}^{\star }\right| \\&\quad \le C_{3}\frac{1}{\log ^{3/2}m}+5\sqrt{\log m}\le 6\sqrt{\log m} \end{aligned}$$as long as m is sufficiently large. This further implies
$$\begin{aligned} \max _{1\le j\le m}\left| {\varvec{a}}_{j}^{\textsf {H} }\left( \widetilde{{\varvec{x}}}^{t}-{\varvec{x}}^{\star }\right) \right| \left( \left| {\varvec{a}}_{j}^{\textsf {H} }\widetilde{{\varvec{x}}}^{t}\right| +\left| {\varvec{a}}_{j}^{\textsf {H} }{\varvec{x}}^{\star }\right| \right) \le C_{3}\frac{1}{\log ^{3/2}m}\cdot 11\sqrt{\log m}\le 11C_{3}\frac{1}{\log m}. \end{aligned}$$Substituting it into (219) and taking Lemma 48, we arrive at
$$\begin{aligned} \frac{1}{\eta \xi }\left| {\varvec{b}}_{l}^{\textsf {H} }{\varvec{v}}_{1}\right| \lesssim \log m\cdot \left\{ \max _{1\le j\le m}\big |{\varvec{b}}_{j}^{\textsf {H} }\widetilde{{\varvec{h}}}^{t}\big |\right\} \cdot C_{3}\frac{1}{\log m}\lesssim C_{3}\max _{1\le j\le m}\big |{\varvec{b}}_{j}^{\textsf {H} }\widetilde{{\varvec{h}}}^{t}\big |\le 0.1\max _{1\le j\le m}\big |{\varvec{b}}_{j}^{\textsf {H} }\widetilde{{\varvec{h}}}^{t}\big |, \end{aligned}$$with the proviso that \(C_{3}\) is sufficiently small.
- 2.
We then move on to \(|{\varvec{b}}_{l}^{\textsf {H} }{\varvec{v}}_{3}|\), which obeys
$$\begin{aligned} \frac{1}{\eta \xi }\left| {\varvec{b}}_{l}^{\textsf {H} }{\varvec{v}}_{3}\right|&\le \left| \sum _{j=1}^{m}{\varvec{b}}_{l}^{\textsf {H} }{\varvec{b}}_{j}{\varvec{b}}_{j}^{\textsf {H} } {\varvec{h}}^{\star }{\varvec{x}}^{\star \textsf {H} }{\varvec{a}}_{j}{\varvec{a}}_{j}^{\textsf {H} }{\varvec{x}}^{\star } \right| +\left| \sum _{j=1}^{m}{\varvec{b}}_{l}^{\textsf {H} }{\varvec{b}}_{j}{\varvec{b}}_{j}^{\textsf {H} } {\varvec{h}}^{\star }{\varvec{x}}^{\star \textsf {H} }{\varvec{a}}_{j}{\varvec{a}}_{j}^{\textsf {H} }\left( \widetilde{{\varvec{x}}}^{t} -{\varvec{x}}^{\star }\right) \right| . \end{aligned}$$(220)Regarding the first term, we have the following lemma, whose proof is given in Appendix C.4.1.
Lemma 28
Suppose \(m\ge C K\log ^2 m\) for some sufficiently large constant \(C>0\). Then with probability at least \(1-O\left( m^{-10}\right) \), one has
For the remaining term, we apply the same strategy as in bounding \(|{\varvec{b}}_{l}^{\textsf {H} }{\varvec{v}}_{1}|\) to get
where the second line follows from incoherence (36), the induction hypothesis (90c), condition (189), and Lemma 48. Combining the above three inequalities and incoherence (36) yields
- 3.
Finally, we need to control \(\left| {\varvec{b}}_{l}^{\textsf {H} }{\varvec{v}}_{2}\right| \). For convenience of presentation, we will only bound \(\left| {\varvec{b}}_{1}^{\textsf {H} }{\varvec{v}}_{2}\right| \) in the sequel, but the argument easily extends to all other \({\varvec{b}}_{l}\)’s. The idea is to group \(\left\{ {\varvec{b}}_{j}\right\} _{1\le j\le m} \) into bins each containing \(\tau \) adjacent vectors, and to look at each bin separately. Here, \(\tau \asymp \mathrm {poly}\log (m)\) is some integer to be specified later. For notational simplicity, we assume \(m/\tau \) to be an integer, although all arguments continue to hold when \(m/\tau \) is not an integer. For each \(0\le l\le m-\tau \), the following summation over \(\tau \) adjacent data obeys
$$\begin{aligned}&{\varvec{b}}_{1}^{\textsf {H} }\sum _{j=1}^{\tau }{\varvec{b}}_{l+j}{\varvec{b}}_{l+j}^{\textsf {H} } \widetilde{{\varvec{h}}}^{t}\left( \left| {\varvec{a}}_{l+j}^{\textsf {H} }{\varvec{x}}^{\star }\right| ^{2} -\left\| {\varvec{x}}^{\star }\right\| _{2}^{2}\right) \nonumber \\&\quad ={\varvec{b}}_{1}^{\textsf {H} }\sum _{j=1}^{\tau }{\varvec{b}}_{l+1}{\varvec{b}}_{l+1}^{\textsf {H} } \widetilde{{\varvec{h}}}^{t}\left( \left| {\varvec{a}}_{l+j}^{\textsf {H} }{\varvec{x}}^{\star }\right| ^{2}-\left\| {\varvec{x}}^{\star }\right\| _{2}^{2}\right) \nonumber \\&\qquad +{\varvec{b}}_{1}^{\textsf {H} }\sum _{j=1}^{\tau }\left( {\varvec{b}}_{l+j} {\varvec{b}}_{l+j}^{\textsf {H} }-{\varvec{b}}_{l+1}{\varvec{b}}_{l+1}^{\textsf {H} }\right) \widetilde{{\varvec{h}}}^{t}\left( \left| {\varvec{a}}_{l+j}^{\textsf {H} }{\varvec{x}}^{\star }\right| ^{2}-\left\| {\varvec{x}}^{\star }\right\| _{2}^{2}\right) \nonumber \\&\quad =\left\{ \sum _{j=1}^{\tau }\left( \left| {\varvec{a}}_{l+j}^{\textsf {H} }{\varvec{x}}^{\star }\right| ^{2}-\left\| {\varvec{x}}^{\star }\right\| _{2}^{2}\right) \right\} {\varvec{b}}_{1}^{\textsf {H} }{\varvec{b}}_{l+1}{\varvec{b}}_{l+1}^{\textsf {H} }\widetilde{{\varvec{h}}}^{t}\nonumber \\&\qquad +{\varvec{b}}_{1}^{\textsf {H} }\sum _{j=1}^{\tau }\left( {\varvec{b}}_{l+j}-{\varvec{b}}_{l+1} \right) {\varvec{b}}_{l+j}^{\textsf {H} }\widetilde{{\varvec{h}}}^{t}\left( \left| {\varvec{a}}_{l+j}^{\textsf {H} } {\varvec{x}}^{\star }\right| ^{2}-\left\| {\varvec{x}}^{\star }\right\| _{2}^{2}\right) \nonumber \\&\quad \quad +{\varvec{b}}_{1}^{\textsf {H} }\sum _{j=1}^{\tau }{\varvec{b}}_{l+1} \left( {\varvec{b}}_{l+j}-{\varvec{b}}_{l+1}\right) ^{\textsf {H} }\widetilde{{\varvec{h}}}^{t} \left( \left| {\varvec{a}}_{l+j}^{\textsf {H} }{\varvec{x}}^{\star }\right| ^{2}-\left\| {\varvec{x}}^{\star }\right\| _{2}^{2}\right) . \end{aligned}$$(221)We will now bound each term in (221) separately.
Before bounding the first term in (221), we first bound the pre-factor \(\left| \sum _{j=1}^{\tau }\big (|{\varvec{a}}_{l+j}^{\textsf {H} }{\varvec{x}}^{\star }|^{2}-\Vert {\varvec{x}}^{\star }\Vert _{2}^{2}\big )\right| \). Notably, the fluctuation of this quantity does not grow fast as it is the sum of i.i.d. random variables over a group of relatively large size, i.e., \(\tau \). Since \(2\left| {\varvec{a}}_{j}^{\textsf {H} }{\varvec{x}}^{\star }\right| ^{2}\) follows the \(\chi _{2}^{2}\) distribution, by standard concentration results (e.g., [95, Theorem 1.1]), with probability exceeding \(1-O\left( m^{-10}\right) \),
$$\begin{aligned} \left| \sum _{j=1}^{\tau }\big (\big |{\varvec{a}}_{l+j}^{\textsf {H} }{\varvec{x}}^{\star }\big |^{2}-\Vert {\varvec{x}}^{\star }\Vert _{2}^{2}\big )\right| \lesssim \sqrt{\tau \log m}. \end{aligned}$$With this result in place, we can bound the first term in (221) as
$$\begin{aligned} \left| \left\{ \sum _{j=1}^{\tau }\big (\big |{\varvec{a}}_{l+j}^{\textsf {H} }{\varvec{x}}^{\star }\big |^{2}-\Vert {\varvec{x}}^{\star }\Vert _{2}^{2}\big )\right\} {\varvec{b}}_{1}^{\textsf {H} }{\varvec{b}}_{l+1}{\varvec{b}}_{l+1}^{\textsf {H} }\widetilde{{\varvec{h}}}^{t}\right|&\lesssim \sqrt{\tau \log m}\left| {\varvec{b}}_{1}^{\textsf {H} }{\varvec{b}}_{l+1}\right| \max _{1\le l\le m}\left| {\varvec{b}}_{l}^{\textsf {H} }\widetilde{{\varvec{h}}}^{t}\right| . \end{aligned}$$Taking the summation over all bins gives
$$\begin{aligned}&\sum _{k=0}^{ \frac{m}{\tau }-1 }\left| \left\{ \sum _{j=1}^{\tau }\big (\big |{\varvec{a}}_{k\tau +j}^{\textsf {H} }{\varvec{x}}^{\star }\big |^{2}-\Vert {\varvec{x}}^{\star }\Vert _{2}^{2}\big )\right\} {\varvec{b}}_{1}^{\textsf {H} }{\varvec{b}}_{k\tau +1}{\varvec{b}}_{k\tau +1}^{\textsf {H} }\widetilde{{\varvec{h}}}^{t}\right| \nonumber \\&\quad \lesssim \sqrt{\tau \log m}\sum _{k=0}^{ \frac{m}{\tau }-1 }\left| {\varvec{b}}_{1}^{\textsf {H} }{\varvec{b}}_{k\tau +1}\right| \max _{1\le l\le m}\left| {\varvec{b}}_{l}^{\textsf {H} }\widetilde{{\varvec{h}}}^{t}\right| . \end{aligned}$$(222)It is straightforward to see from the proof of Lemma 48 that
$$\begin{aligned} \sum _{k=0}^{ \frac{m}{\tau }-1 }\left| {\varvec{b}}_{1}^{\textsf {H} }{\varvec{b}}_{k\tau +1}\right| =\Vert {\varvec{b}}_{1}\Vert _{2}^{2}+\sum _{k=1}^{ \frac{m}{\tau }-1 }\left| {\varvec{b}}_{1}^{\textsf {H} }{\varvec{b}}_{k\tau +1}\right| \le \frac{K}{m}+O\left( \frac{\log m}{\tau }\right) . \end{aligned}$$(223)Substituting (223) into the previous inequality (222) gives
$$\begin{aligned}&\sum _{k=0}^{ \frac{m}{\tau }-1 }\left| \left\{ \sum _{j=1}^{\tau }\big (\big |{\varvec{a}}_{k\tau +j}^{\textsf {H} }{\varvec{x}}^{\star }\big |^{2}-\Vert {\varvec{x}}^{\star }\Vert _{2}^{2}\big )\right\} {\varvec{b}}_{1}^{\textsf {H} }{\varvec{b}}_{k\tau +1}{\varvec{b}}_{k\tau +1}^{\textsf {H} }\widetilde{{\varvec{h}}}^{t}\right| \\&\quad \lesssim \left( \frac{K\sqrt{\tau \log m}}{m}+\sqrt{\frac{\log ^{3}m}{\tau }}\right) \max _{1\le l\le m}\left| {\varvec{b}}_{l}^{\textsf {H} }\widetilde{{\varvec{h}}}^{t}\right| \\&\quad \le 0.1\max _{1\le l\le m}\left| {\varvec{b}}_{l}^{\textsf {H} }\widetilde{{\varvec{h}}}^{t}\right| , \end{aligned}$$as long as \(m\gg K\sqrt{\tau \log m}\) and \(\tau \gg \log ^{3}m\).
The second term of (221) obeys
$$\begin{aligned}&\left| {\varvec{b}}_{1}^{\textsf {H} }\sum _{j=1}^{\tau }\left( {\varvec{b}}_{l+j}-{\varvec{b}}_{l+1}\right) {\varvec{b}}_{l+j}^{\textsf {H} }\widetilde{{\varvec{h}}}^{t}\left( \left| {\varvec{a}}_{l+j}^{\textsf {H} }{\varvec{x}}^{\star }\right| ^{2}-\left\| {\varvec{x}}^{\star }\right\| _{2}^{2}\right) \right| \\&\quad \le \max _{1\le l\le m}\left| {\varvec{b}}_{l}^{\textsf {H} }\widetilde{{\varvec{h}}}^{t}\right| \sqrt{\sum _{j=1}^{\tau }\left| {\varvec{b}}_{1}^{\textsf {H} }\left( {\varvec{b}}_{l+j}-{\varvec{b}}_{l+1}\right) \right| ^{2}}\sqrt{\sum _{j=1}^{\tau }\left( \big |{\varvec{a}}_{l+j}^{\textsf {H} }{\varvec{x}}^{\star }\big |^{2}-\left\| {\varvec{x}}^{\star }\right\| _{2}^{2}\right) ^{2}}\\&\quad \lesssim \sqrt{\tau }\max _{1\le l\le m}\left| {\varvec{b}}_{l}^{\textsf {H} }\widetilde{{\varvec{h}}}^{t}\right| \sqrt{\sum _{j=1}^{\tau }\left| {\varvec{b}}_{1}^{\textsf {H} }\left( {\varvec{b}}_{l+j}-{\varvec{b}}_{l+1}\right) \right| ^{2}}, \end{aligned}$$where the first inequality is due to Cauchy–Schwarz, and the second one holds because of the following lemma, whose proof can be found in Appendix C.4.2.
Lemma 29
Suppose \(\tau \ge C\log ^{4}m\) for some sufficiently large constant \(C>0\). Then with probability exceeding \(1-O\left( m^{-10}\right) \),
With the above bound in mind, we can sum over all bins of size \(\tau \) to obtain
Here, the last line arises from Lemma 51, which says that for any small constant \(c>0\), as long as \(m\gg \tau K\log m\)
The third term of (221) obeys
$$\begin{aligned}&\left| {\varvec{b}}_{1}^{\textsf {H} }\sum _{j=1}^{\tau }{\varvec{b}}_{l+1}\left( {\varvec{b}}_{l+j}-{\varvec{b}}_{l+1}\right) ^{\textsf {H} }\widetilde{{\varvec{h}}}^{t}\left\{ \left| {\varvec{a}}_{l+j}^{\textsf {H} }{\varvec{x}}^{\star }\right| ^{2}-\left\| {\varvec{x}}^{\star }\right\| _{2}^{2}\right\} \right| \\&\quad \le \left| {\varvec{b}}_{1}^{\textsf {H} }{\varvec{b}}_{l+1}\right| \left\{ \sum _{j=1}^{\tau }\left| \left| {\varvec{a}}_{l+j}^{\textsf {H} }{\varvec{x}}^{\star }\right| ^{2}-\left\| {\varvec{x}}^{\star }\right\| _{2}^{2}\right| \right\} \max _{0\le l\le m-\tau ,\,1\le j\le \tau }\left| \left( {\varvec{b}}_{l+j}-{\varvec{b}}_{l+1}\right) ^{\textsf {H} }\widetilde{{\varvec{h}}}^{t}\right| \\&\quad \lesssim \tau \left| {\varvec{b}}_{1}^{\textsf {H} }{\varvec{b}}_{l+1}\right| \max _{0\le l\le m-\tau ,\,1\le j\le \tau }\left| \left( {\varvec{b}}_{l+j}-{\varvec{b}}_{l+1}\right) ^{\textsf {H} }\widetilde{{\varvec{h}}}^{t}\right| , \end{aligned}$$where the last line relies on the inequality
$$\begin{aligned} \sum _{j=1}^{\tau }\left| \left| {\varvec{a}}_{l+j}^{\textsf {H} }{\varvec{x}}^{\star }\right| ^{2}-\left\| {\varvec{x}}^{\star }\right\| _{2}^{2}\right| \le \sqrt{\tau }\sqrt{\sum _{j=1}^{\tau }\left( \left| {\varvec{a}}_{l+j}^{\textsf {H} }{\varvec{x}}^{\star }\right| ^{2}-\left\| {\varvec{x}}^{\star }\right\| _{2}^{2}\right) ^{2}}\lesssim \tau \end{aligned}$$owing to Lemma 29 and the Cauchy–Schwarz inequality. Summing over all bins gives
$$\begin{aligned}&\sum _{k=0}^{ \frac{m}{\tau } -1 }\left| {\varvec{b}}_{1}^{\textsf {H} }\sum _{j=1}^{\tau }{\varvec{b}}_{k\tau +1}\left( {\varvec{b}}_{k\tau +j}-{\varvec{b}}_{k\tau +1}\right) ^{\textsf {H} }\widetilde{{\varvec{h}}}^{t}\left\{ \left| {\varvec{a}}_{k\tau +j}^{\textsf {H} }{\varvec{x}}^{\star }\right| ^{2}-\left\| {\varvec{x}}^{\star }\right\| _{2}^{2}\right\} \right| \\&\quad \lesssim \tau \sum _{k=0}^{ \frac{m}{\tau } -1 }\left| {\varvec{b}}_{1}^{\textsf {H} }{\varvec{b}}_{k\tau +1}\right| \max _{0\le l\le m-\tau ,\,1\le j\le \tau }\left| \left( {\varvec{b}}_{l+j}-{\varvec{b}}_{l+1}\right) ^{\textsf {H} }\widetilde{{\varvec{h}}}^{t}\right| \\&\quad \lesssim \log m\max _{0\le l\le m-\tau ,\,1\le j\le \tau }\left| \left( {\varvec{b}}_{l+j}-{\varvec{b}}_{l+1}\right) ^{\textsf {H} }\widetilde{{\varvec{h}}}^{t}\right| , \end{aligned}$$where the last relation makes use of (223) with the proviso that \(m \gg K\tau \). It then boils down to bounding \(\max _{0\le l\le m-\tau ,\,1\le j\le \tau }\big |\left( {\varvec{b}}_{l+j}-{\varvec{b}}_{l+1}\right) ^{\textsf {H} }\widetilde{{\varvec{h}}}^{t}\big |\). Without loss of generality, it suffices to look at \(\big |({\varvec{b}}_{j}-{\varvec{b}}_{1})^{\textsf {H} }\widetilde{{\varvec{h}}}^{t}\big |\) for all \(1\le j\le \tau \). Specifically, we claim for the moment that
$$\begin{aligned} \max _{1\le j \le \tau }\left| \left( {\varvec{b}}_{j}-{\varvec{b}}_{1}\right) ^{\textsf {H} }\widetilde{{\varvec{h}}}^{t}\right| \le cC_{4}\frac{\mu }{\sqrt{m}}\log m \end{aligned}$$(224)for some sufficiently small constant \(c>0\), provided that \(m\gg \tau K\log ^{4}m\). As a result,
$$\begin{aligned}&\sum _{k=0}^{ \frac{m}{\tau } -1 }\left| {\varvec{b}}_{1}^{\textsf {H} }\sum _{j=1}^{\tau }{\varvec{b}}_{k\tau +1}\left( {\varvec{b}}_{k\tau +j}-{\varvec{b}}_{k\tau +1}\right) ^{\textsf {H} }\widetilde{{\varvec{h}}}^{t}\left\{ \left| {\varvec{a}}_{k\tau +j}^{\textsf {H} }{\varvec{x}}^{\star }\right| ^{2}-\left\| {\varvec{x}}^{\star }\right\| _{2}^{2}\right\} \right| \lesssim cC_{4}\frac{\mu }{\sqrt{m}}\log ^{2}m. \end{aligned}$$Putting the above results together, we get
$$\begin{aligned} \frac{1}{\eta \xi }\left| {\varvec{b}}_{1}^{\textsf {H} }{\varvec{v}}_{2}\right|\le & {} \sum _{k=0}^{ \frac{m}{\tau } -1 }\left| {\varvec{b}}_{1}^{\textsf {H} }\sum _{j=1}^{\tau }{\varvec{b}}_{k\tau +j}{\varvec{b}}_{k\tau +j}^{\textsf {H} }\widetilde{{\varvec{h}}}^{t}\left\{ \left| {\varvec{a}}_{k\tau +j}^{\textsf {H} }{\varvec{x}}^{\star }\right| ^{2}-\left\| {\varvec{x}}^{\star }\right\| _{2}^{2}\right\} \right| \\\le & {} 0.2\max _{1\le l\le m}\left| {\varvec{b}}_{l}^{\textsf {H} }\widetilde{{\varvec{h}}}^{t}\right| +O\left( cC_{4}\frac{\mu }{\sqrt{m}}\log ^{2}m\right) . \end{aligned}$$
- 4.
Combining the preceding bounds guarantees the existence of some constant \(C_{8}>0\) such that
$$\begin{aligned} \left| {\varvec{b}}_{l}^{\textsf {H} }\widetilde{{\varvec{h}}}^{t+1}\right|&\le \left( 1+\delta \right) \left\{ \left( 1-\eta \xi \right) \left| {\varvec{b}}_{l}^{\textsf {H} }\widetilde{{\varvec{h}}}^{t}\right| +0.3\eta \xi \max _{1\le l\le m}\left| {\varvec{b}}_{l}^{\textsf {H} }\widetilde{{\varvec{h}}}^{t}\right| \right. \\&\left. \quad +\,C_{8}(1+C_{3})\eta \xi \frac{\mu }{\sqrt{m}}+C_{8}\eta \xi cC_{4}\frac{\mu }{\sqrt{m}}\log ^{2}m\right\} \\&\overset{(\text {i})}{\le }\left( 1+O\left( \frac{1}{\log ^{2}m}\right) \right) \left\{ \left( 1-0.7\eta \xi \right) C_{4}\frac{\mu }{\sqrt{m}}\log ^{2}m\right. \\&\qquad \left. +\,C_{8}(1+C_{3})\eta \xi \frac{\mu }{\sqrt{m}}+C_{8}\eta \xi cC_{4}\frac{\mu }{\sqrt{m}}\log ^{2}m\right\} \\&\overset{(\text {ii})}{\le }C_{4}\frac{\mu }{\sqrt{m}}\log ^{2}m. \end{aligned}$$Here, (i) uses the induction hypothesis (90d), and (ii) holds as long as \(c>0\) is sufficiently small (so that \((1+\delta )C_{8}\eta \xi c\ll 1\)) and \(\eta >0\) is some sufficiently small constant. In order for the proof to go through, it suffices to pick
$$\begin{aligned} \tau =c_{10}\log ^{4}m \end{aligned}$$for some sufficiently large constant \(c_{10}>0\). Accordingly, we need the sample size to exceed
$$\begin{aligned} m\gg \mu ^{2}\tau K\log ^{4}m\asymp \mu ^{2}K\log ^{8}m. \end{aligned}$$
Finally, it remains to verify claim (224), which we accomplish in Appendix C.4.3.
1.4.1 Proof of Lemma 28
Denote
Recognizing that \(\mathbb {E}[{\varvec{a}}_{j}{\varvec{a}}_{j}^{\textsf {H} }]={\varvec{I}}_{K}\) and \(\sum _{j=1}^{m}{\varvec{b}}_{j}{\varvec{b}}_{j}^{\textsf {H} }={\varvec{I}}_{K}\), we can write the quantity of interest as the sum of independent random variables, namely
Further, the sub-exponential norm (see definition in [116]) of \(w_{j}-\mathbb {E}\left[ w_{j}\right] \) obeys
where (i) arises from the centering property of the sub-exponential norm (see [116, Remark 5.18]), (ii) utilizes the relationship between the sub-exponential norm and the sub-Gaussian norm [116, Lemma 5.14], (iii) is a consequence of the incoherence condition (36) and the fact that \(\left\| {\varvec{a}}_{j}^{\textsf {H} }{\varvec{x}}^{\star }\right\| _{\psi _{2}}\lesssim 1\), and (iv) follows from \(\Vert {\varvec{b}}_j\Vert _2 = \sqrt{K/m} \). Let \(M = \max _{j\in [m]} \left\| w_{j}-\mathbb {E}\left[ w_{j}\right] \right\| _{\psi _{1}} \) and
which follows since \(\sum _{j=1}^{m} \left| {\varvec{b}}_{l}^{\textsf {H} }{\varvec{b}}_{j}\right| ^2 = {\varvec{b}}_l^\textsf {H} \left( \sum _{j=1}^{m} {\varvec{b}}_j {\varvec{b}}_j^\textsf {H} \right) {\varvec{b}}_l = \Vert {\varvec{b}}_l\Vert _2^2 = K/m\). Let \(a_j = \left\| w_{j}-\mathbb {E}\left[ w_{j}\right] \right\| _{\psi _{1}} \) and \(X_j =(w_j - \mathbb {E}[w_j]) / a_j\). Since \(\Vert X_j \Vert _{\psi _1} =1\), \(\sum _{j=1}^{m}a_j^2=V^2\) and \(\max _{j\in [m]} |a_j| = M\), we can invoke [116, Proposition 5.16] to obtain that
where \(c>0\) is some universal constant. By taking \(t = \mu /\sqrt{m}\), we see there exists some constant \(c'\) such that
We conclude the proof by observing that \(m\gg K\log ^2 m\) as stated in the assumption.
1.4.2 Proof of Lemma 29
From the elementary inequality \(\left( a-b\right) ^{2}\le 2\left( a^{2}+b^{2}\right) \), we see that
where the last identity holds true since \(\left\| {\varvec{x}}^{\star }\right\| _{2}=1\). It thus suffices to control \(\sum _{j=1}^{\tau }\left| {\varvec{a}}_{j}^{\textsf {H} }{\varvec{x}}^{\star }\right| ^{4}\). Let \(\xi _{i}={\varvec{a}}_{j}^{\textsf {H} }{\varvec{x}}^{\star }\), which is a standard complex Gaussian random variable. Since the \(\xi _{i}\)’s are statistically independent, one has
for some constant \(C_{4}>0\). It then follows from the hypercontractivity concentration result for Gaussian polynomials that [99, Theorem 1.9]
for some constants \(c,c_{2},C>0\), with the proviso that \(\tau \gg \log ^{4}m\). As a consequence, with probability at least \(1-O(m^{-10})\),
which together with (225) concludes the proof.
1.4.3 Proof of Claim (224)
We will prove the claim by induction. Again, observe that
for some \(\delta \asymp {\log ^{-2}m}\), which allows us to look at \(\left( {\varvec{b}}_{j}-{\varvec{b}}_{1}\right) ^{\textsf {H} }\frac{1}{\overline{\alpha ^{t-1}}}{\varvec{h}}^{t}\) instead.
Use the gradient update rule for \({\varvec{h}}^{t}\) (cf. (79a)) once again to get
where we denote \(\theta :={1}/{\left\| \widetilde{{\varvec{x}}}^{t-1}\right\| _{2}^{2}}.\) This further gives rise to
where the last identity makes use of the fact that \(\sum _{l=1}^{m}{\varvec{b}}_{l}{\varvec{b}}_{l}^{\textsf {H} } = {\varvec{I}}_{K}\). For \(\beta _{1}\), one can get
where we utilize the incoherence condition (36) and the fact that \(\widetilde{{\varvec{x}}}^{t-1}\) and \({\varvec{x}}^{\star }\) are extremely close, i.e.,
Regarding the second term \(\beta _{2}\), we have
The term \(\psi \) can be bounded as follows
Here, we have used the incoherence condition (36) and the facts that
which are immediate consequences of (90c) and (189). Combining this with Lemma 50, we see that for any small constant \(c>0\)
holds as long as \(m\gg \tau K\log ^{4}m\).
To summarize, we arrive at
Making use of the induction hypothesis (85c) and the fact that \(\left\| \widetilde{{\varvec{x}}}^{t-1}\right\| _{2}^{2}\ge 0.9\), we reach
Recall that \(\delta \asymp 1/\log ^{2}m\). As a result, if \(\eta >0\) is some sufficiently small constant and if
holds, then one has
Therefore, this concludes the proof of claim (224) by induction, provided that the base case is true, i.e., for some \(c>0\) sufficiently small
Claim (226) is proved in Appendix C.6 (see Lemma 30).
1.5 Proof of Lemma 19
Recall that \(\check{{\varvec{h}}}^{0}\) and \(\check{{\varvec{x}}}^{0}\) are the leading left and right singular vectors of \({\varvec{M}}\), respectively. Applying a variant of Wedin’s sin\(\Theta \) theorem [42, Theorem 2.1], we derive that
for some universal constant \(c_{1}>0\). Regarding the numerator of (227), it has been shown in [76, Lemma 5.20] that for any \(\xi >0\),
with probability exceeding \(1-O(m^{-10})\), provided that
for some universal constant \(c_{2}>0\). For the denominator of (227), we can take (228) together with Weyl’s inequality to demonstrate that
where the last inequality utilizes the facts that \(\sigma _{1}\left( \mathbb {E}\left[ {\varvec{M}}\right] \right) =1\) and \(\sigma _{2}\left( \mathbb {E}[{\varvec{M}}]\right) =0\). These together with (227) reveal that
as long as \(\xi \le 1/2\).
Now we connect the preceding bound (229) with the scaled singular vectors \({\varvec{h}}^{0}=\sqrt{\sigma _{1}\left( {\varvec{M}}\right) }\;\check{{\varvec{h}}}^{0}\) and \({\varvec{x}}^{0}=\sqrt{\sigma _{1}\left( {\varvec{M}}\right) }\,\check{{\varvec{x}}}^{0}\). For any \(\alpha \in \mathbb {C}\) with \(|\alpha |=1\), from the definition of \({\varvec{h}}^{0}\) and \({\varvec{x}}^{0}\) we have
Since \(\alpha \check{{\varvec{h}}}^{0},\alpha \check{{\varvec{x}}}^{0}\) are also the leading left and right singular vectors of \({\varvec{M}}\), we can invoke Lemma 60 to get
In addition, we can apply Weyl’s inequality once again to deduce that
where the last inequality comes from (228). Substitute (231) into (230) to obtain
Taking the minimum over \(\alpha \), one can thus conclude that
where the last inequality comes from (229). Since \(\xi \) is arbitrary, by taking \(m/(\mu ^2 K \log ^2 m )\) to be large enough, we finish the proof for (92). Carrying out similar arguments (which we omit here), we can also establish (93).
The last claim in Lemma 19 that \(\left| |\alpha _0|-1\right| \le 1/4\) is a direct corollary of (92) and Lemma 52.
1.6 Proof of Lemma 20
The proof is composed of three steps:
In the first step, we show that the normalized singular vectors of \({\varvec{M}}\) and \({\varvec{M}}^{\left( l\right) }\) are close enough; see (240).
We then proceed by passing this proximity result to the scaled singular vectors; see (243).
Finally, we translate the usual \(\ell _{2}\) distance metric to the distance function we defined in (34); see (245). Along the way, we also prove the incoherence of \({\varvec{h}}^{0}\) with respect to \(\left\{ {\varvec{b}}_{l}\right\} \).
Here comes the formal proof. Recall that \(\check{{\varvec{h}}}^{0}\) and \(\check{{\varvec{x}}}^{0}\) are, respectively, the leading left and right singular vectors of \({\varvec{M}}\), and \(\check{{\varvec{h}}}^{0,\left( l\right) }\) and \(\check{{\varvec{x}}}^{0,\left( l\right) }\) are, respectively, the leading left and right singular vectors of \({\varvec{M}}^{(l)}\). Invoke Wedin’s sin\(\Theta \) theorem [42, Theorem 2.1] to obtain
for some universal constant \(c_{1}>0\). Using the Weyl’s inequality we get
where the penultimate inequality follows from
for m sufficiently large, and the last inequality comes from [76, Lemma 5.20], provided that \(m\ge c_{2}\mu ^{2}K\log ^{2}m\) for some sufficiently large constant \(c_{2}>0\). As a result, denoting
allows us to obtain
It then boils down to controlling the two terms on the right-hand side of (234). By construction,
To bound the first term, observe that
$$\begin{aligned} \left\| \big ({\varvec{M}}-{\varvec{M}}^{(l)}\big )\check{{\varvec{x}}}^{0,\left( l\right) }\right\| _{2}&=\left\| {\varvec{b}}_{l}{\varvec{b}}_{l}^{\textsf {H} }{\varvec{h}}^{\star }{\varvec{x}}^{\star \textsf {H} }{\varvec{a}}_{l}{\varvec{a}}_{l}^{\textsf {H} }\check{{\varvec{x}}}^{0,\left( l\right) }\right\| _{2}=\left\| {\varvec{b}}_{l}\right\| _{2}\left| {\varvec{b}}_{l}^{\textsf {H} }{\varvec{h}}^{\star }\right| \left| {\varvec{a}}_{l}^{\textsf {H} }{\varvec{x}}^{\star }\right| \cdot \big | {\varvec{a}}_{l}^{\textsf {H} }\check{{\varvec{x}}}^{0,\left( l\right) }\big |\nonumber \\&\le 30\frac{\mu }{\sqrt{m}}\cdot \sqrt{\frac{K\log ^{2}m}{m}}, \end{aligned}$$(235)where we use the fact that \(\Vert {\varvec{b}}_{l}\Vert _{2}=\sqrt{K/m}\), the incoherence condition (36), bound (189), and the fact that with probability exceeding \(1-O\left( m^{-10}\right) \),
$$\begin{aligned} \max _{1\le l\le m}\big |{\varvec{a}}_{l}^{\textsf {H} }\check{{\varvec{x}}}^{0,\left( l\right) }\big |\le 5\sqrt{\log m}, \end{aligned}$$due to the independence between \(\check{{\varvec{x}}}^{0,\left( l\right) }\) and \({\varvec{a}}_{l}\).
To bound the second term, for any \(\widetilde{\alpha }\) obeying \(|\widetilde{\alpha }|=1\), one has
$$\begin{aligned}&\left\| \check{{\varvec{h}}}^{0,\left( l\right) \textsf {H} }\big ({\varvec{M}}-{\varvec{M}}^{(l)}\big )\right\| _{2}\\&\quad =\left\| \check{{\varvec{h}}}^{0,\left( l\right) \textsf {H} }{\varvec{b}}_{l}{\varvec{b}}_{l}^{\textsf {H} }{\varvec{h}}^{\star }{\varvec{x}}^{\star \textsf {H} }{\varvec{a}}_{l}{\varvec{a}}_{l}^{\textsf {H} }\right\| _{2}=\left\| {\varvec{a}}_{l}\right\| _{2}\left| {\varvec{b}}_{l}^{\textsf {H} }{\varvec{h}}^{\star }\right| \left| {\varvec{a}}_{l}^{\textsf {H} }{\varvec{x}}^{\star }\right| \cdot \big |{\varvec{b}}_{l}^{\textsf {H} }\check{{\varvec{h}}}^{0,\left( l\right) }\big |\\&\quad \overset{\left( \text {i}\right) }{\le }3\sqrt{K}\cdot \frac{\mu }{\sqrt{m}}\cdot 5\sqrt{\log m}\cdot \big |{\varvec{b}}_{l}^{\textsf {H} }\check{{\varvec{h}}}^{0,\left( l\right) }\big |\\&\quad \overset{\left( \text {ii}\right) }{\le }15\sqrt{\frac{\mu ^{2}K\log m}{m}}\big |\widetilde{\alpha }{\varvec{b}}_{l}^{\textsf {H} }\check{{\varvec{h}}}^{0}\big |+15\sqrt{\frac{\mu ^{2}K\log m}{m}}\left| {\varvec{b}}_{l}^{\textsf {H} }\big (\widetilde{\alpha }\check{{\varvec{h}}}^{0}-\check{{\varvec{h}}}^{0,\left( l\right) }\big )\right| \\&\quad \overset{\left( \text {iii}\right) }{\le }15\sqrt{\frac{\mu ^{2}K\log m}{m}}\big |{\varvec{b}}_{l}^{\textsf {H} }\check{{\varvec{h}}}^{0}\big |+15\sqrt{\frac{\mu ^{2}K\log m}{m}}\cdot \sqrt{\frac{K}{m}}\left\| \widetilde{\alpha }\check{{\varvec{h}}}^{0}-\check{{\varvec{h}}}^{0,\left( l\right) }\right\| _{2}. \end{aligned}$$Here, (i) arises from the incoherence condition (36) together with bounds (189) and (190), inequality (ii) comes from the triangle inequality, and the last line (iii) holds since \(\Vert {\varvec{b}}_{l}\Vert _{2}=\sqrt{K/m}\) and \(|\widetilde{\alpha }|=1\).
Substitution of the above bounds into (234) yields
Since the previous inequality holds for all \(\left| \widetilde{\alpha }\right| =1\), we can choose \(\widetilde{\alpha }=\beta ^{0,\left( l\right) }\) and rearrange terms to get
Under the condition that \(m\gg \mu K\log ^{1/2}m\), one has \(1-30c_{1}\sqrt{{\mu ^{2}K\log m}/{m}}\cdot \sqrt{{K}/{m}}\ge \frac{1}{2}\), and therefore,
which immediately implies that
We then move on to \(\left| {\varvec{b}}_{l}^{\textsf {H} }\check{{\varvec{h}}}^{0}\right| \). The aim is to show that \(\max _{1\le l\le m}\left| {\varvec{b}}_{l}^{\textsf {H} }\check{{\varvec{h}}}^{0}\right| \) can also be upper bounded by the left-hand side of (236). By construction, we have \({\varvec{M}}\check{{\varvec{x}}}^{0}=\sigma _{1}\left( {\varvec{M}}\right) \check{{\varvec{h}}}^{0}\), which further leads to
where \(\beta ^{0,\left( j\right) }\) is as defined in (233). Here, (i) comes from the lower bound \(\sigma _{1}\left( {\varvec{M}}\right) \ge {1}/{2}\). Bound (ii) follows by combining the incoherence condition (36), bound (189), the triangle inequality, as well as the estimate \(\sum _{j=1}^{m}\left| {\varvec{b}}_{l}^{\textsf {H} }{\varvec{b}}_{j}\right| \le 4\log m\) from Lemma 48. The last line uses the upper estimate \(\max _{1\le j\le m}\left| {\varvec{a}}_{j}^{\textsf {H} }\check{{\varvec{x}}}^{0,\left( j\right) }\right| \le 5\sqrt{\log m}\) and (190). Our bound (237) further implies
The above bound (238) taken together with (236) gives
As long as \(m\gg \mu ^{2}K\log ^{2}m\), we have \(60c_{1}\sqrt{{\mu ^{2}K\log m}/{m}}\cdot 120\sqrt{{\mu ^{2}K\log ^{3}m}/{m}}\le 1/2\). Rearranging terms, we are left with
for some constant \(c_{3}>0\). Further, this bound combined with (238) yields
for some constant \(c_{2}>0\), with the proviso that \(m\gg \mu ^{2}K\log ^{2}m\).
We now translate the preceding bounds to the scaled version. Recall from bound (231) that
as long as \(\xi \le 1/2\). For any \(\alpha \in \mathbb {C}\) with \(\left| \alpha \right| =1\), \(\alpha \check{{\varvec{h}}}^{0},\alpha \check{{\varvec{x}}}^{0}\) are still the leading left and right singular vectors of \({\varvec{M}}\). Hence, we can use Lemma 60 to derive that
and
Taking the previous two bounds collectively yields
which together with (235) and (240) implies
for some constant \(c_{5}>0\), as long as \(\xi \) is sufficiently small. Moreover, we have
for any \(|\alpha |=1\), where \(\alpha ^{0}\) is defined in (38) and, according to Lemma 19, satisfies
Therefore,
Furthermore, we have
where the second line follows since the latter is minimizing over a smaller feasible set. This completes the proof for claim (96).
Regarding \(\big |{\varvec{b}}_{l}^{\textsf {H} }\widetilde{{\varvec{h}}}^{0}\big |\), one first sees that
where the last relation holds due to (241) and (242). Hence, using the property (244), we have
which finishes the proof of claim (97).
Before concluding this section, we note a by-product of the proof. Specifically, we can establish the claim required in (226) using many results derived in this section. This is formally stated in the following lemma.
Lemma 30
Fix any small constant \(c>0\). Suppose the number of samples obeys \(m\gg \tau K\log ^{4}m\). Then with probability at least \(1-O\left( m^{-10}\right) \), we have
Proof
Instate the notation and hypotheses in Appendix C.6. Recognize that
where the last inequality comes from (242) and (244). It thus suffices to prove that \(\left| \left( {\varvec{b}}_{j}-{\varvec{b}}_{1}\right) ^{\textsf {H} }\check{{\varvec{h}}}^{0}\right| \le c{\mu }\log m / {\sqrt{m}}\) for some \(c>0\) small enough. To this end, it can be seen that
where (i) comes from Lemma 50, the incoherence condition (36), and estimate (189). The last line (ii) holds since we have already established (see (237) and (240))
The proof is then complete.\(\square \)
1.7 Proof of Lemma 21
Recall that \(\alpha ^{0}\) and \(\alpha ^{0,\left( l\right) }\) are the alignment parameters between \({\varvec{z}}^{0}\) and \({\varvec{z}}^{\star }\), and between \({\varvec{z}}^{0,\left( l\right) }\) and \({\varvec{z}}^{\star }\), respectively, that is,
Also, we let
The triangle inequality together with (94) and (245) then tells us that
where the last relation holds as long as \(m\gg \mu ^{2}\sqrt{K}\log ^{9/2}m\).
Let
It is easy to see that \({\varvec{x}}_{1},{\varvec{h}}_{1},{\varvec{x}}_{2},{\varvec{h}}_{2}\) satisfy the assumptions in Lemma 55, which implies
where the last line comes from (245). With this upper estimate at hand, we are now ready to show that with high probability,
where (i) follows from the triangle inequality, (ii) uses Cauchy–Schwarz and the independence between \({\varvec{x}}^{0,\left( l\right) }\) and \({\varvec{a}}_{l}\), (iii) holds because of (95) and (247) under the condition \(m\gg \mu ^{2}K\log ^{6}m\), and (iv) holds true as long as \(m\gg \mu ^{2}K\log ^{4}m\).
Technical Lemmas
1.1 Technical Lemmas for Phase Retrieval
1.1.1 Matrix Concentration Inequalities
Lemma 31
Suppose that \({\varvec{a}}_{j}\overset{\mathrm {i.i.d.}}{\sim }\mathcal {N}\left( {\varvec{0}},{\varvec{I}}_{n}\right) \) for every \(1\le j\le m\). Fix any small constant \(\delta >0\). With probability at least \(1-C_{2}e^{-c_{2}m}\), one has
as long as \(m\ge c_{0}n\) for some sufficiently large constant \(c_{0}>0\). Here, \(C_{2},c_{2}>0\) are some universal constants.
Proof
This is an immediate consequence of [116, Corollary 5.35]. \(\square \)
Lemma 32
Suppose that \({\varvec{a}}_{j}\overset{\mathrm {i.i.d.}}{\sim }\mathcal {N}\left( {\varvec{0}},{\varvec{I}}_{n}\right) \), for every \(1\le j\le m\). Fix any small constant \(\delta >0\). With probability at least \(1-O(n^{-10})\), we have
provided that \(m\ge c_{0}n\log n\) for some sufficiently large constant \(c_{0}>0\).
Proof
This is adapted from [18, Lemma 7.4]. \(\square \)
Lemma 33
Suppose that \({\varvec{a}}_{j}\overset{\mathrm {i.i.d.}}{\sim }\mathcal {N}\left( {\varvec{0}},{\varvec{I}}_{n}\right) \), for every \(1\le j\le m\). Fix any small constant \(\delta >0\) and any constant \(C>0\). Suppose \(m\ge c_{0}n\) for some sufficiently large constant \(c_{0}>0\). Then with probability at least \(1-C_{2}e^{-c_{2}m}\),
holds for some absolute constants \(c_{2},C_{2}>0\), where
with \(\xi \) being a standard Gaussian random variable.
Proof
This is supplied in [25, supplementary material]. \(\square \)
1.1.2 Matrix Perturbation Bounds
Lemma 34
Let \(\lambda _{1}({\varvec{A}})\), \({\varvec{u}}\) be the leading eigenvalue and eigenvector of a symmetric matrix \({\varvec{A}}\), respectively, and \(\lambda _{1}(\widetilde{{\varvec{A}}})\), \(\widetilde{{\varvec{u}}}\) be the leading eigenvalue and eigenvector of a symmetric matrix \(\widetilde{{\varvec{A}}}\), respectively. Suppose that \(\lambda _{1}({\varvec{A}}),\lambda _{1}(\widetilde{{\varvec{A}}}),\Vert {\varvec{A}}\Vert ,\Vert \widetilde{{\varvec{A}}}\Vert \in [C_{1},C_{2}]\) for some \(C_{1},C_{2}>0\). Then,
Proof
Observe that
where the last inequality follows since \(\left\| {\varvec{u}}\right\| _{2}=1\). Using the identity \(\sqrt{a}-\sqrt{b}=({a-b}) /({\sqrt{a}+\sqrt{b}})\), we have
where the last inequality comes from our assumptions on \(\lambda _{1}({\varvec{A}})\) and \(\lambda _{1}(\widetilde{{\varvec{A}}})\). This combined with (248) yields
To control \(\left| \lambda _{1}\big ({\varvec{A}}\big )-\lambda _{1}(\widetilde{{\varvec{A}}})\right| \), use the relationship between the eigenvalue and the eigenvector to obtain
which together with (249) gives
as claimed. \(\square \)
1.2 Technical Lemmas for Matrix Completion
1.2.1 Orthogonal Procrustes Problem
The orthogonal Procrustes problem is a matrix approximation problem which seeks an orthogonal matrix \({\varvec{R}}\) to best “align” two matrices \({\varvec{A}}\) and \({\varvec{B}}\). Specifically, for \({\varvec{A}},{\varvec{B}}\in \mathbb {R}^{n\times r}\), define \(\widehat{{\varvec{R}}}\) to be the minimizer of
The first lemma is concerned with the characterization of the minimizer \(\widehat{{\varvec{R}}}\) of (250).
Lemma 35
For \({\varvec{A}},{\varvec{B}}\in \mathbb {R}^{n\times r}\), \(\widehat{{\varvec{R}}}\) is the minimizer of (250) if and only if \(\widehat{{\varvec{R}}}^{\top }{\varvec{A}}^{\top }{\varvec{B}}\) is symmetric and positive semidefinite.
Proof
This is an immediate consequence of [112, Theorem 2]. \(\square \)
Let \({\varvec{A}}^\top {\varvec{B}} = {\varvec{U}}{\varvec{\Sigma }}{\varvec{V}}^\top \) be the singular value decomposition of \({\varvec{A}}^\top {\varvec{B}} \in \mathbb {R}^{r \times r}\). It is easy to check that \(\widehat{{\varvec{R}}} := {\varvec{U}} {\varvec{V}}^\top \) satisfies the conditions that \(\widehat{{\varvec{R}}}^{\top }{\varvec{A}}^{\top }{\varvec{B}}\) is both symmetric and positive semidefinite. In view of Lemma 35, \(\widehat{{\varvec{R}}}={\varvec{U}} {\varvec{V}}^\top \) is the minimizer of (250). In the special case when \({\varvec{C}}:= {\varvec{A}}^\top {\varvec{B}}\) is invertible, \(\widehat{{\varvec{R}}}\) enjoys the following equivalent form:
where \(\widehat{{\varvec{H}}}\left( \cdot \right) \) is an \(\mathbb {R}^{r \times r}\)-valued function on \(\mathbb {R}^{r \times r}\). This motivates us to look at the perturbation bounds for the matrix-valued function \(\widehat{{\varvec{H}}}\left( \cdot \right) \), which is formulated in the following lemma.
Lemma 36
Let \({\varvec{C}}\in \mathbb {R}^{r\times r}\) be a nonsingular matrix. Then for any matrix \({\varvec{E}}\in \mathbb {R}^{r\times r}\) with \(\left\| {\varvec{E}}\right\| \le \sigma _{\min }\left( {\varvec{C}}\right) \) and any unitarily invariant norm , one has
where \(\widehat{{\varvec{H}}}\left( \cdot \right) \) is defined above.
Proof
This is an immediate consequence of [85, Theorem 2.3].\(\square \)
With Lemma 36 in place, we are ready to present the following bounds on two matrices after “aligning” them with \({\varvec{X}}^{\star }\).
Lemma 37
Instate the notation in Sect. 3.2. Suppose \({\varvec{X}}_{1},{\varvec{X}}_{2}\in \mathbb {R}^{n\times r}\) are two matrices such that
Denote
Then the following two inequalities hold true:
Proof
Before proving the claims, we first gather some immediate consequences of assumptions (252). Denote \({\varvec{C}}={\varvec{X}}_{1}^{\top }{\varvec{X}}^{\star }\) and \({\varvec{E}}=\left( {\varvec{X}}_{2}-{\varvec{X}}_{1}\right) ^{\top }{\varvec{X}}^{\star }\). It is easily seen that \({\varvec{C}}\) is invertible since
where (i) follows from assumption (252a) and (ii) is a direct application of Weyl’s inequality. In addition, \({\varvec{C}}+{\varvec{E}}={\varvec{X}}_{2}^{\top }{\varvec{X}}^{\star }\) is also invertible since
where (i) arises from assumption (252b) and (ii) holds because of (253). When both \({\varvec{C}}\) and \({\varvec{C}}+{\varvec{E}}\) are invertible, the orthonormal matrices \({\varvec{R}}_{1}\) and \({\varvec{R}}_{2}\) admit closed-form expressions as follows
Moreover, we have the following bound on \(\left\| {\varvec{X}}_{1}\right\| \):
where (i) is the triangle inequality, (ii) uses assumption (252a), and (iii) arises from the fact that \(\left\| {\varvec{X}}^{\star }\right\| = \sqrt{\sigma _{\max }}\).
With these in place, we turn to establishing the claimed bounds. We will focus on the upper bound on \(\left\| {\varvec{X}}_{1}{\varvec{R}}_{1}-{\varvec{X}}_{2}{\varvec{R}}_{2}\right\| _{\mathrm {F}}\), as the bound on \(\left\| {\varvec{X}}_{1}{\varvec{R}}_{1}-{\varvec{X}}_{2}{\varvec{R}}_{2}\right\| \) can be easily obtained using the same argument. Simple algebra reveals that
where the first inequality uses the fact that \(\left\| {\varvec{R}}_{2}\right\| = 1\) and the last inequality comes from (254). An application of Lemma 36 leads us to conclude that
where (256) utilizes (253). Combine (255) and (257) to reach
which finishes the proof by noting that \(\kappa \ge 1\). \(\square \)
1.2.2 Matrix Concentration Inequalities
This section collects various measure concentration results regarding the Bernoulli random variables \(\{\delta _{j,k}\}_{1\le j,k \le n}\), which is ubiquitous in the analysis for matrix completion.
Lemma 38
Fix any small constant \(\delta >0\), and suppose that \(m\gg \delta ^{-2}\mu nr\log n\). Then with probability exceeding \(1-O\left( n^{-10}\right) \), one has
which holds simultaneously for all \({\varvec{B}}\in \mathbb {R}^{n\times n}\) lying within the tangent space of \({\varvec{M}}^{\star }\).
Proof
This result has been established in [19, Section 4.2] for asymmetric sampling patterns (where each (i, j), \(i\ne j\), is included in \(\Omega \) independently). It is straightforward to extend the proof and the result to symmetric sampling patterns (where each (i, j), \(i\ge j\), is included in \(\Omega \) independently). We omit the proof for conciseness. \(\square \)
Lemma 39
Fix a matrix \({\varvec{M}}\in \mathbb {R}^{n\times n}\). Suppose \(n^{2}p\ge c_{0}n\log n\) for some sufficiently large constant \(c_{0}>0\). With probability at least \(1-O\left( n^{-10}\right) \), one has
where \(C>0\) is some absolute constant.
Proof
See [64, Lemma 3.2]. Similar to Lemma 38, the result therein was provided for the asymmetric sampling patterns but can be easily extended to the symmetric case.\(\square \)
Lemma 40
Recall from Sect. 3.2 that \({\varvec{E}}\in \mathbb {R}^{n \times n}\) is the symmetric noise matrix. Suppose the sample size obeys \(n^{2}p\ge c_{0}n\log ^{2}n\) for some sufficiently large constant \(c_{0}>0\). With probability at least \(1-O\left( n^{-10}\right) \), one has
where \(C>0\) is some universal constant.
Proof
See [32, Lemma 11]. \(\square \)
Lemma 41
Fix some matrix \({\varvec{A}}\in \mathbb {R}^{n\times r}\) with \(n\ge 2r\) and some \(1\le l \le n\). Suppose \(\left\{ \delta _{l,j}\right\} _{1\le j\le n}\) are independent Bernoulli random variables with means \(\left\{ p_{j}\right\} _{1\le j \le n}\) no more than p. Define
Then one has
and for any constant \(C \ge 3\), with probability exceeding \(1-n^{-\left( 1.5C-1\right) }\)
and
Proof
By the definition of \({\varvec{G}}_{l}\left( {\varvec{A}}\right) \) and the triangle inequality, one has
Therefore, it suffices to control the first term. It can be seen that \(\left\{ \left( \delta _{l,j}-p_{j}\right) {\varvec{A}}_{j,\cdot }^{\top }{\varvec{A}}_{j,\cdot }\right\} _{1\le j \le n}\) are i.i.d. zero-mean random matrices. Letting
and invoking matrix Bernstein’s inequality [114, Theorem 6.1.1], one has for all \(t\ge 0\),
We can thus find an upper bound on \(\textsf {Median} \left[ \left\| \sum _{j=1}^{n}\left( \delta _{l,j}-p_{j}\right) {\varvec{A}}_{j,\cdot }^{\top }{\varvec{A}}_{j,\cdot }\right\| \right] \) by finding a value t that ensures the right-hand side of (258) is smaller than 1 / 2. Using this strategy and some simple calculations, we get
and for any \(C\ge 3\),
holds with probability at least \(1-n^{-\left( 1.5C-1\right) }\). As a consequence, we have
and with probability exceeding \(1-n^{-\left( 1.5C-1\right) }\),
This completes the proof. \(\square \)
Lemma 42
Let \(\left\{ \delta _{l,j}\right\} _{1\le l\le j\le n}\) be i.i.d. Bernoulli random variables with mean p and \(\delta _{l,j}=\delta _{j,l}\). For any \({\varvec{\Delta }}\in \mathbb {R}^{n\times r}\), define
Suppose the sample size obeys \(n^2 p\gg \kappa \mu r n \log ^{2} n\). Then for any \(k>0\) and \(\alpha >0\) large enough, with probability at least \(1-c_{1}e^{-{\alpha C}nr\log n / 2}\),
holds simultaneously for all \({\varvec{\Delta }}\in \mathbb {R}^{n\times r}\) obeying
where \(c_{1},C_{5},C_{8}, C_{9}, C_{10}>0\) are some absolute constants.
Proof
For simplicity of presentation, we will prove the claim for the asymmetric case where \(\left\{ \delta _{l,j}\right\} _{1\le l,j\le n}\) are independent. The results immediately carry over to the symmetric case as claimed in this lemma. To see this, note that we can always divide \({\varvec{G}}_{l}({\varvec{\Delta }})\) into
where all nonzero components of \({\varvec{G}}_{l}^{\mathrm {upper}}({\varvec{\Delta }})\) come from the upper triangular part (those blocks with \(l\le j\) ), while all nonzero components of \({\varvec{G}}_{l}^{\mathrm {lower}}({\varvec{\Delta }})\) are from the lower triangular part (those blocks with \(l>j\)). We can then look at \(\left\{ {\varvec{G}}_{l}^{\mathrm {upper}}({\varvec{\Delta }})\mid 1\le l\le n\right\} \) and \(\left\{ {\varvec{G}}_{l}^{\mathrm {upper}}({\varvec{\Delta }})\mid 1\le l\le n\right\} \) separately using the argument we develop for the asymmetric case. From now on, we assume that \(\left\{ \delta _{l,j}\right\} _{1\le l,j\le n}\) are independent.
Suppose for the moment that \({\varvec{\Delta }}\) is statistically independent of \(\left\{ \delta _{l,j}\right\} \). Clearly, for any \({\varvec{\Delta }},\widetilde{{\varvec{\Delta }}}\in \mathbb {R}^{n\times r}\),
which implies that \(\left\| {\varvec{G}}_{l}\left( {\varvec{\Delta }}\right) \right\| \) is 1-Lipschitz with respect to the metric \(d\left( \cdot ,\cdot \right) \). Moreover,
according to our assumption. Hence, Talagrand’s inequality [24, Proposition 1] reveals the existence of some absolute constants \(C,c>0\) such that for all \(\lambda >0\)
We then proceed to control \(\mathrm {Median}\left[ \left\| {\varvec{G}}_{l}\left( {\varvec{\Delta }}\right) \right\| \right] \). A direct application of Lemma 41 yields
where the last relation holds since \(p\psi ^{2}\gg \xi ^{2}\log r\), which follows by combining the definitions of \(\psi \) and \(\xi \), the sample size condition \(np\gg \kappa \mu r\log ^{2}n\), and the incoherence condition (114). Thus, substitution into (259) and taking \(\lambda =\sqrt{kr}\) give
for any \(k\ge 0\). Furthermore, invoking [4, Corollary A.1.14] and using bound (260), one has
for any \(t\ge 6\). Choose \(t={\alpha \log n} / \left[ {kC\exp \left( -ckr\right) }\right] \ge 6\) to obtain
So far, we have demonstrated that for any fixed \({\varvec{\Delta }}\) obeying our assumptions, \(\sum _{l=1}^{n}{{\,\mathrm{\mathbb {1}}\,}}_{\left\{ \left\| {\varvec{G}}_{l}\left( {\varvec{\Delta }}\right) \right\| \ge 2\sqrt{p}\psi +\sqrt{kr}\xi \right\} }\) is well controlled with exponentially high probability. In order to extend the results to all feasible \({\varvec{\Delta }}\), we resort to the standard \(\epsilon \)-net argument. Clearly, due to the homogeneity property of \(\left\| {\varvec{G}}_{l}\left( {\varvec{\Delta }}\right) \right\| \), it suffices to restrict attention to the following set:
where \(\psi /\xi \lesssim \Vert {\varvec{X}}^{\star }\Vert / \Vert {\varvec{X}}^{\star }\Vert _{2,\infty } \lesssim \sqrt{n}\). We then proceed with the following steps.
- 1.
Introduce the auxiliary function
$$\begin{aligned} \chi _{l}({\varvec{\Delta }})={\left\{ \begin{array}{ll} 1,\qquad &{} \text {if }\left\| {\varvec{G}}_{l}\left( {\varvec{\Delta }}\right) \right\| \ge 4\sqrt{p}\psi +2\sqrt{kr}\xi ,\\ \frac{\left\| {\varvec{G}}_{l}\left( {\varvec{\Delta }}\right) \right\| -2\sqrt{p}\psi -\sqrt{kr}\xi }{2\sqrt{p}\psi + \sqrt{kr}\xi }, &{} \text {if }\left\| {\varvec{G}}_{l}\left( {\varvec{\Delta }}\right) \right\| \in [2\sqrt{p}\psi +\sqrt{kr}\xi ,\text { }4\sqrt{p}\psi +2\sqrt{kr}\xi ],\\ 0, &{} \text {else}. \end{array}\right. } \end{aligned}$$Clearly, this function is sandwiched between two indicator functions
$$\begin{aligned} {{\,\mathrm{\mathbb {1}}\,}}_{\left\{ \left\| {\varvec{G}}_{l}\left( {\varvec{\Delta }}\right) \right\| \ge 4\sqrt{p}\psi +2\sqrt{kr}\xi \right\} }\le \chi _{l}({\varvec{\Delta }})\le {{\,\mathrm{\mathbb {1}}\,}}_{\left\{ \left\| {\varvec{G}}_{l}\left( {\varvec{\Delta }}\right) \right\| \ge 2\sqrt{p}\psi +\sqrt{kr}\xi \right\} }. \end{aligned}$$Note that \(\chi _{l}\) is more convenient to work with due to continuity.
- 2.
Consider an \(\epsilon \)-net \(\mathcal {N}_{\epsilon }\) [111, Section 2.3.1] of the set \(\mathcal {S}\) as defined in (262). For any \(\epsilon =1/n^{O(1)}\), one can find such a net with cardinality \(\log |\mathcal {N}_{\epsilon }|\lesssim nr\log n\). Apply the union bound and (261) to yield
$$\begin{aligned}&\mathbb {P}\left( \sum _{l=1}^{n}\chi _{l}({\varvec{\Delta }})\ge \frac{\alpha n\log n}{k},\text { }\forall {\varvec{\Delta }}\in \mathcal {N}_{\epsilon }\right) \\&\quad \le \mathbb {P}\left( \sum _{l=1}^{n}{{\,\mathrm{\mathbb {1}}\,}}_{\left\{ \left\| {\varvec{G}}_{l}\left( {\varvec{\Delta }}\right) \right\| \ge 2\sqrt{p}\psi +\sqrt{kr}\xi \right\} }\ge \frac{\alpha n\log n}{k},\text { }\forall {\varvec{\Delta }}\in \mathcal {N}_{\epsilon }\right) \\&\quad \le 2|\mathcal {N}_{\epsilon }|\exp \left( -\frac{\alpha C}{2}nr\log n\right) \le 2\exp \left( -\frac{\alpha C}{4}nr\log n\right) , \end{aligned}$$as long as \(\alpha \) is chosen to be sufficiently large.
- 3.
One can then use the continuity argument to extend the bound to all \({\varvec{\Delta }}\) outside the \(\epsilon \)-net, i.e., with exponentially high probability,
$$\begin{aligned}&\sum _{l=1}^{n}\chi _{l}({\varvec{\Delta }})\le \frac{2\alpha n\log n}{k},\quad \forall {\varvec{\Delta }}\in \mathcal {S}\\&\Longrightarrow \qquad \sum _{l=1}^{n}{{\,\mathrm{\mathbb {1}}\,}}_{\left\{ \left\| {\varvec{G}}_{l}\left( {\varvec{\Delta }}\right) \right\| \ge 4\sqrt{p}\psi +2\sqrt{kr}\xi \right\} }\le \sum _{l=1}^{n}\chi _{l}({\varvec{\Delta }})\le \frac{2\alpha n\log n}{k},\quad \forall {\varvec{\Delta }}\in \mathcal {S}. \end{aligned}$$This is fairly standard (see, e.g., [111, Section 2.3.1]) and is thus omitted here.
We have thus concluded the proof. \(\square \)
Lemma 43
Suppose the sample size obeys \(n^{2}p\ge C\kappa \mu rn\log n\) for some sufficiently large constant \(C>0\). Then with probability at least \(1-O\left( n^{-10}\right) \),
holds simultaneously for all \({\varvec{X}}\in \mathbb {R}^{n\times r}\) satisfying
where \(\epsilon >0\) is any fixed constant.
Proof
To simplify the notations hereafter, we denote \({\varvec{\Delta }}:={\varvec{X}}-{\varvec{X}}^{\star }\). With this notation in place, one can decompose
which together with the triangle inequality implies that
In the sequel, we bound \(\alpha _{1}\) and \(\alpha _{2}\) separately.
- 1.
Recall from [84, Theorem 2.5] the elementary inequality that
$$\begin{aligned} \left\| {\varvec{C}}\right\| \le \big \Vert |{\varvec{C}}|\big \Vert , \end{aligned}$$(265)where \(|{\varvec{C}}|:=[|c_{i,j}|]_{1\le i,j\le n}\) for any matrix \({\varvec{C}}=[c_{i,j}]_{1\le i,j\le n}\). In addition, for any matrix \({\varvec{D}}:=[d_{i,j}]_{1\le i,j\le n}\) such that \(|d_{i,j}|\ge |c_{i,j}|\) for all i and j, one has \(\big \Vert |{\varvec{C}}|\big \Vert \le \big \Vert |{\varvec{D}}|\big \Vert \). Therefore,
$$\begin{aligned} \alpha _{1}\le \left\| \frac{1}{p}\mathcal {P}_{\Omega }\left( \left| {\varvec{\Delta }}{\varvec{\Delta }}^{\top }\right| \right) \right\| \le \left\| {\varvec{\Delta }}\right\| _{2,\infty }^{2}\left\| \frac{1}{p}\mathcal {P}_{\Omega }\left( {\varvec{1}}{\varvec{1}}^{\top }\right) \right\| . \end{aligned}$$Lemma 39 then tells us that with probability at least \(1-O(n^{-10})\),
$$\begin{aligned} \left\| \frac{1}{p}\mathcal {P}_{\Omega }\left( {\varvec{1}}{\varvec{1}}^{\top }\right) -{\varvec{1}}{\varvec{1}}^{\top }\right\| \le C\sqrt{\frac{n}{p}} \end{aligned}$$(266)for some universal constant \(C>0\), as long as \(p\gg \log n/n\). This together with the triangle inequality yields
$$\begin{aligned} \left\| \frac{1}{p}\mathcal {P}_{\Omega }\left( {\varvec{1}}{\varvec{1}}^{\top }\right) \right\| \le \left\| \frac{1}{p}\mathcal {P}_{\Omega }\left( {\varvec{1}}{\varvec{1}}^{\top }\right) -{\varvec{1}}{\varvec{1}}^{\top }\right\| +\left\| {\varvec{1}}{\varvec{1}}^{\top }\right\| \le C\sqrt{\frac{n}{p}}+n\le 2n, \end{aligned}$$(267)provided that \(p\gg 1/n\). Putting together the previous bounds, we arrive at
$$\begin{aligned} \alpha _{1}\le 2n\left\| {\varvec{\Delta }}\right\| _{2,\infty }^{2}. \end{aligned}$$(268) - 2.
Regarding the second term \(\alpha _{2}\), apply the elementary inequality (265) once again to get
$$\begin{aligned} \left\| \mathcal {P}_{\Omega }\left( {\varvec{\Delta }}{\varvec{X}}^{\star \top }\right) \right\| \le \left\| \mathcal {P}_{\Omega }\left( \left| {\varvec{\Delta }}{\varvec{X}}^{\star \top }\right| \right) \right\| , \end{aligned}$$which motivates us to look at \(\left\| \mathcal {P}_{\Omega }\left( \left| {\varvec{\Delta }}{\varvec{X}}^{\star \top }\right| \right) \right\| \) instead. A key step of this part is to take advantage of the \(\ell _{2,\infty }\) norm constraint of \(\mathcal {P}_{\Omega }\left( \left| {\varvec{\Delta }}{\varvec{X}}^{\star \top }\right| \right) \). Specifically, we claim for the moment that with probability exceeding \(1-O(n^{-10})\),
$$\begin{aligned} \left\| \mathcal {P}_{\Omega }\left( \left| {\varvec{\Delta }}{\varvec{X}}^{\star \top }\right| \right) \right\| _{2,\infty }^{2}\le 2p\sigma _{\max }\left\| {\varvec{\Delta }}\right\| _{2,\infty }^{2}:=\theta \end{aligned}$$(269)holds under our sample size condition. In addition, we also have the following trivial \(\ell _{\infty }\) norm bound
$$\begin{aligned} \left\| \mathcal {P}_{\Omega }\left( \left| {\varvec{\Delta }}{\varvec{X}}^{\star \top }\right| \right) \right\| _{\infty }\le \left\| {\varvec{\Delta }}\right\| _{2,\infty }\left\| {\varvec{X}}^{\star }\right\| _{2,\infty }:=\gamma . \end{aligned}$$(270)In what follows, for simplicity of presentation, we will denote
$$\begin{aligned} {\varvec{A}}:=\mathcal {P}_{\Omega }\left( \left| {\varvec{\Delta }}{\varvec{X}}^{\star \top }\right| \right) . \end{aligned}$$(271)- (a)
To facilitate the analysis of \(\Vert {\varvec{A}}\Vert \), we first introduce \(k_{0}+1=\frac{1}{2}\log \left( \kappa \mu r\right) \) auxiliary matricesFootnote 9\({\varvec{B}}_{s}\in \mathbb {R}^{n\times n}\) that satisfy
$$\begin{aligned} \left\| {\varvec{A}}\right\| \le \left\| {\varvec{B}}_{k_{0}}\right\| +\sum _{s=0}^{k_{0}-1}\left\| {\varvec{B}}_{s}\right\| . \end{aligned}$$(272)To be precise, each \({\varvec{B}}_{s}\) is defined such that
$$\begin{aligned} \left[ {\varvec{B}}_{s}\right] _{j,k}&={\left\{ \begin{array}{ll} \frac{1}{2^{s}}\gamma , &{} \text {if}\quad A_{j,k}\in (\frac{1}{2^{s+1}}\gamma ,\frac{1}{2^{s}}\gamma ],\\ 0, &{} \text {else}, \end{array}\right. }\quad \text {for } 0\le s\le k_{0}-1\qquad \text {and}\\ \left[ {\varvec{B}}_{k_{0}}\right] _{j,k}&={\left\{ \begin{array}{ll} \frac{1}{2^{k_{0}}}\gamma , &{} \text {if}\quad A_{j,k}\le \frac{1}{2^{k_{0}}}\gamma ,\\ 0, &{} \text {else}, \end{array}\right. } \end{aligned}$$which clearly satisfy (272); in words, \({\varvec{B}}_{s}\) is constructed by rounding up those entries of \({\varvec{A}}\) within a prescribed magnitude interval. Thus, it suffices to bound \(\Vert {\varvec{B}}_{s}\Vert \) for every s. To this end, we start with \(s=k_{0}\) and use the definition of \({\varvec{B}}_{k_{0}}\) to get
$$\begin{aligned}&\left\| {\varvec{B}}_{k_{0}}\right\| \overset{\text {(i)}}{\le }\left\| {\varvec{B}}_{k_{0}}\right\| _{\infty }\sqrt{\left( 2np\right) ^{2}}\overset{\text {(ii)}}{\le }4np\frac{1}{\sqrt{\kappa \mu r}}\left\| {\varvec{\Delta }}\right\| _{2,\infty }\left\| {\varvec{X}}^{\star }\right\| _{2,\infty }\\&\quad \overset{\text {(iii)}}{\le }4\sqrt{n}p\left\| {\varvec{\Delta }}\right\| _{2,\infty }\left\| {\varvec{X}}^{\star }\right\| , \end{aligned}$$where (i) arises from Lemma 44, with 2np being a crude upper bound on the number of nonzero entries in each row and each column. This can be derived by applying the standard Chernoff bound on \(\Omega \). The second inequality (ii) relies on the definitions of \(\gamma \) and \(k_0\). The last one (iii) follows from the incoherence condition (114). Besides, for any \(0\le s\le k_{0}-1\), by construction one has
$$\begin{aligned} \left\| {\varvec{B}}_{s}\right\| _{2,\infty }^{2}\le 4\theta =8p\sigma _{\max }\left\| {\varvec{\Delta }}\right\| _{2,\infty }^{2}\qquad \text {and}\qquad \left\| {\varvec{B}}_{s}\right\| _{\infty }=\frac{1}{2^{s}}\gamma , \end{aligned}$$where \(\theta \) is as defined in (269). Here, we have used the fact that the magnitude of each entry of \({\varvec{B}}_{s}\) is at most two times that of \({\varvec{A}}\). An immediate implication is that there are at most
$$\begin{aligned} \frac{\left\| {\varvec{B}}_{s}\right\| _{2,\infty }^{2}}{\left\| {\varvec{B}}_{s}\right\| _{\infty }^{2}}\le \frac{8p\sigma _{\max }\left\| {\varvec{\Delta }}\right\| _{2,\infty }^{2}}{\left( \frac{1}{2^{s}}\gamma \right) ^{2}}:=k_{\mathrm {r}} \end{aligned}$$nonzero entries in each row of \({\varvec{B}}_{s}\) and at most
$$\begin{aligned} k_{\mathrm {c}}=2np \end{aligned}$$nonzero entries in each column of \({\varvec{B}}_{s}\), where \(k_{\mathrm {c}}\) is derived from the standard Chernoff bound on \(\Omega \). Utilizing Lemma 44 once more, we discover that
$$\begin{aligned} \left\| {\varvec{B}}_{s}\right\|\le & {} \left\| {\varvec{B}}_{s}\right\| _{\infty }\sqrt{k_{\mathrm {r}}k_{\mathrm {c}}}=\frac{1}{2^{s}}\gamma \sqrt{k_{\mathrm {r}}k_{\mathrm {c}}}=\sqrt{16np^{2}\sigma _{\max }\left\| {\varvec{\Delta }}\right\| _{2,\infty }^{2}}\\&=4\sqrt{n}p\left\| {\varvec{\Delta }}\right\| _{2,\infty }\left\| {\varvec{X}}^{\star }\right\| \end{aligned}$$for each \(0\le s \le k_{0}-1\). Combining all, we arrive at
$$\begin{aligned} \Vert {\varvec{A}}\Vert&\le \sum _{s=0}^{k_{0}-1}\left\| {\varvec{B}}_{s}\right\| +\left\| {\varvec{B}}_{k_{0}}\right\| \le \left( k_{0}+1\right) 4\sqrt{n}p\left\| {\varvec{\Delta }}\right\| _{2,\infty }\left\| {\varvec{X}}^{\star }\right\| \\&\le 2\sqrt{n}p\log \left( \kappa \mu r\right) \left\| {\varvec{\Delta }}\right\| _{2,\infty }\left\| {\varvec{X}}^{\star }\right\| \\&\le 2\sqrt{n}p\log n\left\| {\varvec{\Delta }}\right\| _{2,\infty }\left\| {\varvec{X}}^{\star }\right\| , \end{aligned}$$where the last relation holds under the condition \(n\ge \kappa \mu r\). This further gives
$$\begin{aligned} \alpha _{2}&\le \frac{1}{p}\left\| {\varvec{A}}\right\| \le 2\sqrt{n}\log n\left\| {\varvec{\Delta }}\right\| _{2,\infty }\left\| {\varvec{X}}^{\star }\right\| . \end{aligned}$$(273) - (b)
In order to finish the proof of this part, we need to justify claim (269). Observe that
$$\begin{aligned} \left\| \left[ \mathcal {P}_{\Omega }\left( \left| {\varvec{\Delta }}{\varvec{X}}^{\star \top }\right| \right) \right] _{l,\cdot }\right\| _{2}^{2}&=\sum \nolimits _{j=1}^{n}\left( {\varvec{\Delta }}_{l,\cdot }{\varvec{X}}_{j,\cdot }^{\star \top }\delta _{l,j}\right) ^{2}\nonumber \\&={\varvec{\Delta }}_{l,\cdot }\left( \sum \nolimits _{j=1}^{n}\delta _{l,j}{\varvec{X}}_{j,\cdot }^{\star \top }{\varvec{X}}_{j,\cdot }^{\star }\right) {\varvec{\Delta }}_{l,\cdot }^{\top }\nonumber \\&\le \left\| {\varvec{\Delta }}\right\| _{2,\infty }^{2}\left\| \sum \nolimits _{j=1}^{n}\delta _{l,j}{\varvec{X}}_{j,\cdot }^{\star \top }{\varvec{X}}_{j,\cdot }^{\star }\right\| \end{aligned}$$(274)for every \(1\le l\le n\), where \(\delta _{l,j}\) indicates whether the entry with the index (l, j) is observed or not. Invoke Lemma 41 to yield
$$\begin{aligned} \left\| \sum \nolimits _{j=1}^{n}\delta _{l,j}{\varvec{X}}_{j,\cdot }^{\star \top }{\varvec{X}}_{j,\cdot }^{\star }\right\|&=\left\| \left[ \delta _{l,1}{\varvec{X}}_{1,\cdot }^{\star \top },\delta _{l,2}{\varvec{X}}_{2,\cdot }^{\star \top },\cdots ,\delta _{l,n}{\varvec{X}}_{n,\cdot }^{\star \top }\right] \right\| ^{2}\nonumber \\&\le p\sigma _{\max }+C\left( \sqrt{p\left\| {\varvec{X}}^{\star }\right\| _{2,\infty }^{2}\left\| {\varvec{X}}^{\star }\right\| ^{2}\log n}+\left\| {\varvec{X}}^{\star }\right\| _{2,\infty }^{2}\log n\right) \nonumber \\&\le \left( p+C\sqrt{\frac{p\kappa \mu r\log n}{n}}+C\frac{\kappa \mu r\log n}{n}\right) \sigma _{\max }\nonumber \\&\le 2p\sigma _{\max }, \end{aligned}$$(275)with high probability, as soon as \(np\gg \kappa \mu r\log n\). Combining (274) and (275) yields
$$\begin{aligned} \left\| \left[ \mathcal {P}_{\Omega }\left( \left| {\varvec{\Delta }}{\varvec{X}}^{\star \top }\right| \right) \right] _{l,\cdot }\right\| _{2}^{2}&\le 2p\sigma _{\max }\left\| {\varvec{\Delta }}\right\| _{2,\infty }^{2},\qquad 1\le l\le n, \end{aligned}$$as claimed in (269).
- (a)
- 3.
Taken together, the preceding bounds (264), (268), and (273) yield
$$\begin{aligned} \left\| \frac{1}{p}\mathcal {P}_{\Omega }\left( {\varvec{X}}{\varvec{X}}^{\top }-{\varvec{X}}^{\star }{\varvec{X}}^{\star \top }\right) \right\|&\le \alpha _1 + 2\alpha _2 \le 2n\left\| {\varvec{\Delta }}\right\| _{2,\infty }^{2}+4\sqrt{n}\log n\left\| {\varvec{\Delta }}\right\| _{2,\infty }\left\| {\varvec{X}}^{\star }\right\| . \end{aligned}$$
The proof is completed by substituting the assumption \(\left\| {\varvec{\Delta }}\right\| _{2,\infty } \le \epsilon \left\| {\varvec{X}}^{\star }\right\| _{2,\infty }.\)\(\square \)
In the end of this subsection, we record a useful lemma to bound the spectral norm of a sparse Bernoulli matrix.
Lemma 44
Let \({\varvec{A}}\in \left\{ 0,1\right\} ^{n_{1}\times n_{2}}\) be a binary matrix, and suppose that there are at most \(k_{\mathrm {r}}\) and \(k_{\mathrm {c}}\) nonzero entries in each row and column of \({\varvec{A}}\), respectively. Then one has \(\left\| {\varvec{A}}\right\| \le \sqrt{k_{\mathrm {c}}k_{\mathrm {r}}}\).
Proof
This immediately follows from the elementary inequality \(\Vert {\varvec{A}} \Vert ^2 \le \Vert {\varvec{A}} \Vert _{1\rightarrow 1} \Vert {\varvec{A}} \Vert _{\infty \rightarrow \infty }\) (see [56, equation (1.11)]), where \(\Vert {\varvec{A}} \Vert _{1\rightarrow 1} \) and \(\Vert {\varvec{A}} \Vert _{\infty \rightarrow \infty } \) are the induced 1-norm (or maximum absolute column sum norm) and the induced \(\infty \)-norm (or maximum absolute row sum norm), respectively.\(\square \)
1.2.3 Matrix Perturbation Bounds
Lemma 45
Let \({\varvec{M}}\in \mathbb {R}^{n\times n}\) be a symmetric matrix with the top-r eigendecomposition \({\varvec{U}}{\varvec{\Sigma }}{\varvec{U}}^{\top }\). Assume \(\left\| {\varvec{M}}-{\varvec{M}}^{\star }\right\| \le \sigma _{\min }/2\), and denote
Then there is some numerical constant \(c_{3}>0\) such that
Proof
Define \({\varvec{Q}}={\varvec{U}}^{\top }{\varvec{U}}^{\star }\). The triangle inequality gives
[1, Lemma 3] asserts that
as long as \(\left\| {\varvec{M}}-{\varvec{M}}^{\star }\right\| \le \sigma _{\min }/2\). For the remaining term in (276), one can use \({\varvec{U}}^{\star \top }{\varvec{U}}^{\star }={\varvec{I}}_{r}\) to obtain
which together with the Davis–Kahan sin\(\Theta \) theorem [39] reveals that
for some constant \(c_{2}>0\). Combine the estimates on \(\big \Vert \widehat{{\varvec{Q}}}-{\varvec{Q}}\big \Vert \), \(\left\| {\varvec{U}}{\varvec{U}}^{\top }{\varvec{U}}^{\star }-{\varvec{U}}^{\star }\right\| \) and (276) to reach
for some numerical constant \(c_{3}>0\), where we have utilized the fact that \(\left\| {\varvec{M}}-{\varvec{M}}^{\star }\right\| /\sigma _{\min }\le 1/2\). \(\square \)
Lemma 46
Let \({\varvec{M}},\widetilde{{\varvec{M}}}\in \mathbb {R}^{n\times n}\) be two symmetric matrices with top-r eigendecompositions \({\varvec{U}}{\varvec{\Sigma }}{\varvec{U}}^{\top }\) and \(\widetilde{{\varvec{U}}}\widetilde{{\varvec{\Sigma }}}\widetilde{{\varvec{U}}}^{\top }\), respectively. Assume \(\left\| {\varvec{M}}-{\varvec{M}}^{\star }\right\| \le \sigma _{\min }/4\) and \(\big \Vert \widetilde{{\varvec{M}}}-{\varvec{M}}^{\star }\big \Vert \le \sigma _{\min }/4\), and suppose \(\sigma _{\max }/\sigma _{\min }\) is bounded by some constant \(c_{1}>0\), with \(\sigma _{\max }\) and \(\sigma _{\min }\) the largest and the smallest singular values of \({\varvec{M}}^{\star }\), respectively. If we denote
then there exists some numerical constant \(c_{3}>0\) such that
Proof
Here, we focus on the Frobenius norm; the bound on the operator norm follows from the same argument, and hence we omit the proof. Since \(\left\| \cdot \right\| _{\mathrm {F}}\) is unitarily invariant, we have
where \({\varvec{Q}}^{\top }{\varvec{\Sigma }}^{1/2}{\varvec{Q}}\) and \(\widetilde{{\varvec{\Sigma }}}^{1/2}\) are the matrix square roots of \({\varvec{Q}}^{\top }{\varvec{\Sigma }}{\varvec{Q}}\) and \(\widetilde{{\varvec{\Sigma }}}\), respectively. In view of the matrix square root perturbation bound [97, Lemma 2.1],
where the last inequality follows from the lower estimates
and, similarly, \(\sigma _{\min }(\widetilde{{\varvec{\Sigma }}})\ge \sigma _{\min } /4\). Recognizing that \({\varvec{\Sigma }}={\varvec{U}}^{\top }{\varvec{M}}{\varvec{U}}\) and \(\widetilde{{\varvec{\Sigma }}}=\widetilde{{\varvec{U}}}^{\top }\widetilde{{\varvec{M}}}\widetilde{{\varvec{U}}}\), one gets
where the last relation holds due to the upper estimate
Invoke the Davis–Kahan sin\(\Theta \) theorem [39] to obtain
for some constant \(c_{2}>0\), where the last inequality follows from the bounds
Combine (277), (278), (279), and the fact \(\sigma _{\max }/\sigma _{\min }\le c_{1}\) to reach
for some constant \(c_{3}>0\). \(\square \)
Lemma 47
Let \({\varvec{M}}\in \mathbb {R}^{n\times n}\) be a symmetric matrix with the top-r eigendecomposition \({\varvec{U}}{\varvec{\Sigma }}{\varvec{U}}^{\top }\). Denote \({\varvec{X}}={\varvec{U}}{\varvec{\Sigma }}^{1/2}\) and \({\varvec{X}}^{\star }={\varvec{U}}^{\star }({\varvec{\Sigma }}^{\star })^{1/2}\), and define
Assume \(\left\| {\varvec{M}}-{\varvec{M}}^{\star }\right\| \le \sigma _{\min }/2\), and suppose \(\sigma _{\max }/\sigma _{\min }\) is bounded by some constant \(c_{1}>0\). Then there exists a numerical constant \(c_{3}>0\) such that
Proof
We first collect several useful facts about the spectrum of \({\varvec{\Sigma }}\). Weyl’s inequality tells us that \(\left\| {\varvec{\Sigma }}-{\varvec{\Sigma }}^{\star }\right\| \le \left\| {\varvec{M}}-{\varvec{M}}^{\star }\right\| \le \sigma _{\min }/2\), which further implies that
Denote
Simple algebra yields
It can be easily seen that \(\sigma _{r-1}\left( {\varvec{A}}\right) \ge \sigma _{r}\left( {\varvec{A}}\right) \ge \sigma _{\min }/2\), and
which can be controlled as follows.
Regarding \(\alpha \), use [1, Lemma 3] to reach
$$\begin{aligned} \alpha =\big \Vert {\varvec{Q}}-\widehat{{\varvec{Q}}}\big \Vert \le 4\left\| {\varvec{M}}-{\varvec{M}}^{\star }\right\| ^{2}/\sigma _{\min }^{2}. \end{aligned}$$For \(\beta \), one has
$$\begin{aligned} \beta&\overset{\left( \text {i}\right) }{=}\left\| \widehat{{\varvec{Q}}}^{\top }{\varvec{\Sigma }}^{1/2}\widehat{{\varvec{Q}}}-{\varvec{\Sigma }}^{1/2}\right\| \overset{\left( \text {ii}\right) }{\le }\frac{1}{2\sigma _{r}\left( {\varvec{\Sigma }}^{1/2}\right) }\left\| \widehat{{\varvec{Q}}}^{\top }{\varvec{\Sigma }}\widehat{{\varvec{Q}}}-{\varvec{\Sigma }}\right\| \overset{\left( \text {iii}\right) }{=}\frac{1}{2\sigma _{r}\left( {\varvec{\Sigma }}^{1/2}\right) }\left\| {\varvec{\Sigma }}\widehat{{\varvec{Q}}}-\widehat{{\varvec{Q}}}{\varvec{\Sigma }}\right\| , \end{aligned}$$where (i) and (iii) come from the unitary invariance of \(\left\| \cdot \right\| \) and (ii) follows from the matrix square root perturbation bound [97, Lemma 2.1]. We can further take the triangle inequality to obtain
$$\begin{aligned} \left\| {\varvec{\Sigma }}\widehat{{\varvec{Q}}}-\widehat{{\varvec{Q}}}{\varvec{\Sigma }}\right\|&= \left\| {\varvec{\Sigma }}{\varvec{Q}}-{\varvec{Q}}{\varvec{\Sigma }} + {\varvec{\Sigma }}(\widehat{{\varvec{Q}}} - {\varvec{Q}}) - (\widehat{{\varvec{Q}}}-{\varvec{Q}}){\varvec{\Sigma }} \right\| \\&\le \left\| {\varvec{\Sigma }}{\varvec{Q}}-{\varvec{Q}}{\varvec{\Sigma }}\right\| +2\left\| {\varvec{\Sigma }}\right\| \big \Vert {\varvec{Q}}-\widehat{{\varvec{Q}}}\big \Vert \\&=\left\| {\varvec{U}}\left( {\varvec{M}}-{\varvec{M}}^{\star }\right) {\varvec{U}}^{\star \top }+{\varvec{Q}}\left( {\varvec{\Sigma }}^{\star }-{\varvec{\Sigma }}\right) \right\| +2\left\| {\varvec{\Sigma }}\right\| \big \Vert {\varvec{Q}}-\widehat{{\varvec{Q}}}\big \Vert \\&\le \left\| {\varvec{U}}\left( {\varvec{M}}-{\varvec{M}}^{\star }\right) {\varvec{U}}^{\star \top }\big \Vert +\big \Vert {\varvec{Q}}\left( {\varvec{\Sigma }}^{\star }-{\varvec{\Sigma }}\right) \right\| +2\left\| {\varvec{\Sigma }}\right\| \big \Vert {\varvec{Q}}-\widehat{{\varvec{Q}}}\big \Vert \\&\le 2\left\| {\varvec{M}}-{\varvec{M}}^{\star }\right\| +4\sigma _{\max }\alpha , \end{aligned}$$where the last inequality uses the Weyl’s inequality \(\Vert {\varvec{\Sigma }}^{\star }-{\varvec{\Sigma }}\Vert \le \Vert {\varvec{M}}-{\varvec{M}}^{\star }\Vert \) and the fact that \(\left\| {\varvec{\Sigma }}\right\| \le 2\sigma _{\max }\).
Rearrange the previous bounds to arrive at
$$\begin{aligned} \left\| {\varvec{E}}\right\| \le 2\sigma _{\max }\alpha +\sqrt{\sigma _{\max }}\frac{1}{\sqrt{\sigma _{\min }}}\left( 2\left\| {\varvec{M}}-{\varvec{M}}^{\star }\right\| +4\sigma _{\max }\alpha \right) \le c_{2}\left\| {\varvec{M}}-{\varvec{M}}^{\star }\right\| \end{aligned}$$for some numerical constant \(c_{2}>0\), where we have used the assumption that \(\sigma _{\max } / \sigma _{\min }\) is bounded.
Recognizing that \(\widehat{{\varvec{Q}}}=\text {sgn}\left( {\varvec{A}}\right) \) (see definition in (177)), we are ready to invoke Lemma 36 to deduce that
for some constant \(c_{3}>0\). \(\square \)
1.3 Technical Lemmas for Blind Deconvolution
1.3.1 Wirtinger Calculus
In this section, we formally prove the fundamental theorem of calculus and the mean value form of Taylor’s theorem under the Wirtinger calculus; see (283) and (284), respectively.
Let \(f:\mathbb {C}^{n}\rightarrow \mathbb {R}\) be a real-valued function. Denote \({\varvec{z}}={\varvec{x}}+i{\varvec{y}}\in \mathbb {C}^{n}\), and then \(f\left( \cdot \right) \) can alternatively be viewed as a function \(\mathbb {R}^{2n}\rightarrow \mathbb {R}\). There is a one-to-one mapping connecting the Wirtinger derivatives and the conventional derivatives [69]:
where the subscripts \(\mathbb {R}\) and \(\mathbb {C}\) represent calculus in the real (conventional) sense and in the complex (Wirtinger) sense, respectively, and
With these relationships in place, we are ready to verify the fundamental theorem of calculus using the Wirtinger derivatives. Recall from [70, Chapter XIII, Theorem 4.2] that
where
Substitute identities (280) into (281) to arrive at
where \({\varvec{z}}_{1}={\varvec{x}}_{1}+i{\varvec{y}}_{1}\), \({\varvec{z}}_{2}={\varvec{x}}_{2}+i{\varvec{y}}_{2}\) and
Simplification of (282) gives
Repeating the above arguments, one can also show that
where \(\widetilde{{\varvec{z}}}\) is some point lying on the vector connecting \({\varvec{z}}_{1}\) and \({\varvec{z}}_{2}\). This is the mean value form of Taylor’s theorem under the Wirtinger calculus.
1.3.2 Discrete Fourier Transform Matrices
Let \({\varvec{B}}\in \mathbb {C}^{m\times K}\) be the first K columns of a discrete Fourier transform (DFT) matrix \({\varvec{F}}\in \mathbb {C}^{m\times m}\), and denote by \({\varvec{b}}_{l}\) the lth column of the matrix \({\varvec{B}}^{\textsf {H} }\). By definition,
where \(\omega :=e^{-i\frac{2\pi }{m}}\) with i representing the imaginary unit. It is seen that for any \(j\ne l\),
Here, (i) uses \(\overline{\omega ^{\alpha }}=\omega ^{-\alpha }\) for all \(\alpha \in \mathbb {R}\), while the last identity (ii) follows from the formula for the sum of a finite geometric series when \(\omega ^{l-j} \ne 1\). This leads to the following lemma.
Lemma 48
For any \(m\ge 3\) and any \(1\le l\le m\), we have
Proof
We first make use of identity (285) to obtain
where the last identity follows since \(\left\| {\varvec{b}}_{l}\right\| _{2}^{2}={K} / {m}\) and, for all \(\alpha \in \mathbb {R}\),
Without loss of generality, we focus on the case when \(l=1\) in the sequel. Recall that for \(c>0\), we denote by \(\left\lfloor c\right\rfloor \) the largest integer that does not exceed c. We can continue the derivation to get
where (i) follows from \(\left| \sin \left( K\left( 1-j\right) \frac{\pi }{m}\right) \right| \le 1\) and \(\left| \sin \left( x\right) \right| =\left| \sin \left( -x\right) \right| \), and (ii) relies on the fact that \(\sin \left( x\right) =\sin \left( \pi -x\right) \). The property that \(\sin \left( x\right) \ge x/2\) for any \(x\in \left[ 0, {\pi }/2\right] \) allows one to further derive
where in (i) we extend the range of the summation, (ii) uses the elementary inequality \(\sum _{k=1}^{m}k^{-1}\le 1+\log m\), and (iii) holds true as long as \(m\ge 3\). \(\square \)
The next lemma considers the difference of two inner products, namely \(\left( {\varvec{b}}_{l}-{\varvec{b}}_{1}\right) ^{\textsf {H} }{\varvec{b}}_{j}\).
Lemma 49
For all \(0\le l-1\le \tau \le \left\lfloor \frac{m}{10}\right\rfloor \), we have
In addition, for any j and l, the following uniform upper bound holds
Proof
Given (285), we can obtain for \(j\ne l\) and \(j\ne 1\),
where the last line is due to the triangle inequality and \(\left| \omega ^{\alpha }\right| =1\) for all \(\alpha \in \mathbb {R}\). Identity (286) allows us to rewrite this bound as
Combined with the fact that \(\left| \sin x\right| \le 2\left| x\right| \) for all \(x \in \mathbb {R}\), we can upper bound (287) as
where we also utilize the assumption \(0\le l-1\le \tau \). Then for \(l+\tau \le j\le \left\lfloor {m}/{2}\right\rfloor +1\), one has
Therefore, utilizing the property \(\sin \left( x\right) \ge x/2\) for any \(x\in \left[ 0, \pi /2\right] \), we arrive at
where the last inequality holds since \(j-1 > j-l\). Similarly, we can obtain the upper bound for \(\left\lfloor {m}/{2}\right\rfloor +l\le j\le m-\tau \) using nearly identical argument (which is omitted for brevity).
The uniform upper bound can be justified as follows
The last relation holds since \(\left\| {\varvec{b}}_{l}\right\| _{2}^{2}= K/m\) for all \(1\le l\le m\). \(\square \)
Next, we list two consequences of the above estimates in Lemmas 50 and 51.
Lemma 50
Fix any constant \(c>0\) that is independent of m and K. Suppose \(m\ge C\tau K\log ^{4}m\) for some sufficiently large constant \(C>0\), which solely depends on c. If \(0\le l-1\le \tau \), then one has
Proof
For some constant \(c_{0}>0\), we can split the index set \(\left[ m\right] \) into the following three disjoint sets
With this decomposition in place, we can write
We first look at \(\mathcal {A}_{1}\). By Lemma 49, one has for any \(j\in \mathcal {A}_{1}\),
and hence
where the last inequality arises from \(\sum _{k=1}^{m} {k}^{-1} \le 1+\log m \le 2\log m\) and \(\sum _{k=c}^{m} {k^{-2}}\le 2/{c}\).
Similarly, for \(j\in \mathcal {A}_{2}\), we have
which in turn implies
Regarding \(j\in \mathcal {A}_{3}\), we observe that
This together with the simple bound \(\left| \left( {\varvec{b}}_{l}-{\varvec{b}}_{1}\right) ^{\textsf {H} }{\varvec{b}}_{j}\right| \le 2{K}/{m}\) gives
The previous three estimates taken collectively yield
as long as \(c_{0}\ge ({32}/ {\pi })\cdot ({1}/ {c})\) and \(m\ge {8 c_0} \tau K\log ^{4}m/c\). \(\square \)
Lemma 51
Fix any constant \(c>0\) that is independent of m and K. Consider an integer \(\tau >0\), and suppose that \(m\ge C\tau K\log m\) for some large constant \(C>0\), which depends solely on c. Then we have
Proof
The proof strategy is similar to the one used in Lemma 50. First, notice that
As before, for some \(c_{1}>0\), we can split the index set \(\left\{ 1,\cdots ,\left\lfloor {m}/{\tau }\right\rfloor \right\} \) into three disjoint sets
where \(1\le j\le \tau \).
By Lemma 49, one has
Hence, for any \(k\in \mathcal {B}_{1}\),
which further implies that
where the last inequality follows since \(\sum _{k=1}^m k^{-1}\le 2\log m\) and \(\sum _{k=c_1}^{m}k^{-2}\le 2/c_1\). A similar bound can be obtained for \(k\in \mathcal {B}_{2}\).
For the remaining set \(\mathcal {B}_{3}\), observe that
This together with the crude upper bound \(\left| \left( {\varvec{b}}_{l}-{\varvec{b}}_{1}\right) ^{\textsf {H} }{\varvec{b}}_{j}\right| \le 2{K}/{m}\) gives
The previous estimates taken collectively yield
as long as \(c_{1} \gg {1}/{c}\) and \(m/(c_{1}\tau K\log m) \gg {1}/{c}\). \(\square \)
1.3.3 Complex-Valued Alignment
Let \(g_{{\varvec{h}},{\varvec{x}}}\left( \cdot \right) :\mathbb {C}\rightarrow \mathbb {R}\) be a real-valued function defined as
which is the key function in definition (34). Therefore, the alignment parameter of \(({\varvec{h}},{\varvec{x}})\) to \(({\varvec{h}}^{\star },{\varvec{x}}^{\star })\) is the minimizer of \(g_{{\varvec{h}},{\varvec{x}}}\left( \alpha \right) \). This section is devoted to studying various properties of \(g_{{\varvec{h}},{\varvec{x}}}\left( \cdot \right) \). To begin with, the Wirtinger gradient and Hessian of \(g_{{\varvec{h}},{\varvec{x}}}\left( \cdot \right) \) can be calculated as
The first lemma reveals that, as long as \(\left( \frac{1}{\overline{\beta }}{\varvec{h}},\beta {\varvec{x}}\right) \) is sufficiently close to \(({\varvec{h}}^{\star },{\varvec{x}}^{\star })\), the minimizer of \(g_{{\varvec{h}},{\varvec{x}}}\left( \alpha \right) \) cannot be far away from \(\beta \).
Lemma 52
Assume there exists \(\beta \in \mathbb {C}\) with \({1}/ {2}\le \left| \beta \right| \le {3}/{2}\) such that \(\max \left\{ \left\| \frac{1}{\overline{\beta }}{\varvec{h}}-{\varvec{h}}^{\star }\right\| _{2},\left\| \beta {\varvec{x}}-{\varvec{x}}^{\star }\right\| _{2}\right\} \le \delta \le {1}/{4}\). Denote by \(\widehat{\alpha }\) the minimizer of \(g_{{\varvec{h}},{\varvec{x}}}\left( \alpha \right) \), and then we necessarily have
Proof
The first inequality is a direct consequence of the triangle inequality. Hence, we concentrate on the second one. Notice that by assumption,
which immediately implies that \(g_{{\varvec{h}},{\varvec{x}}}\left( \widehat{\alpha }\right) \le 2\delta ^{2}\). It thus suffices to show that for any \(\alpha \) obeying \(\left| \alpha -\beta \right| >18\delta \), one has \(g_{{\varvec{h}},{\varvec{x}}}\left( \alpha \right) >2\delta ^{2}\), and hence, it cannot be the minimizer. To this end, we lower bound \(g_{{\varvec{h}},{\varvec{x}}}\left( \alpha \right) \) as follows:
Given that \(\left\| \beta {\varvec{x}}-{\varvec{x}}^{\star }\right\| _{2}\le \delta \le {1}/{4}\) and \(\left\| {\varvec{x}}^{\star }\right\| _{2}=1\), we have
which together with the fact that \({1}/{2}\le \left| \beta \right| \le {3}/{2}\) implies
and
Taking the previous estimates collectively yields
It is self-evident that once \(\left| \alpha -\beta \right| >18\delta ,\) one gets \( g_{{\varvec{h}},{\varvec{x}}}\left( \alpha \right) >2\delta ^{2}, \) and hence, \(\alpha \) cannot be the minimizer as \(g_{{\varvec{h}},{\varvec{x}}}\left( \alpha \right) >g_{{\varvec{h}},{\varvec{x}}}\left( \beta \right) \) according to (290). This concludes the proof. \(\square \)
The next lemma reveals the local strong convexity of \(g_{{\varvec{h}},{\varvec{x}}}\left( \alpha \right) \) when \(\alpha \) is close to one.
Lemma 53
Assume that \(\max \left\{ \left\| {\varvec{h}}-{\varvec{h}}^{\star }\right\| _{2},\left\| {\varvec{x}}-{\varvec{x}}^{\star }\right\| _{2}\right\} \le \delta \) for some sufficiently small constant \(\delta >0\). Then, for any \(\alpha \) satisfying \(\left| \alpha -1\right| \le 18\delta \) and any \(u,v\in \mathbb {C}\), one has
where \(\nabla ^{2}g_{{\varvec{h}},{\varvec{x}}}\left( \cdot \right) \) stands for the Wirtinger Hessian of \(g_{{\varvec{h}},{\varvec{x}}}(\cdot )\).
Proof
For simplicity of presentation, we use \(g_{{\varvec{h}},{\varvec{x}}}\left( \alpha ,\overline{\alpha }\right) \) and \(g_{{\varvec{h}},{\varvec{x}}}\left( \alpha \right) \) interchangeably. By (289), for any \(u,v\in \mathbb {C}\) , one has
We would like to demonstrate that this is at least on the order of \(\left| u\right| ^{2}+\left| v\right| ^{2}\). We first develop a lower bound on \(\beta _{1}\). Given the assumption that \(\max \left\{ \left\| {\varvec{h}}-{\varvec{h}}^{\star }\right\| _{2},\left\| {\varvec{x}}-{\varvec{x}}^{\star }\right\| _{2}\right\} \le \delta \), one necessarily has
Thus, for any \(\alpha \) obeying \(|\alpha -1|\le 18\delta \), one has
as long as \(\delta >0\) is sufficiently small. Regarding the second term \(\beta _{2}\), we utilize the conditions \(\left| \alpha -1\right| \le 18\delta \), \(\Vert {\varvec{x}}\Vert _{2}\le 1+\delta \) and \(\Vert {\varvec{h}}\Vert _{2}\le 1+\delta \) to get
where the last relation holds since \(2\left| u\right| \left| v\right| \le \left| u\right| ^{2}+\left| v\right| ^{2}\) and \(\delta > 0\) is sufficiently small. Combining the previous bounds on \(\beta _{1}\) and \(\beta _{2}\), we arrive at
as long as \(\delta \) is sufficiently small. This completes the proof. \(\square \)
Additionally, in a local region surrounding the optimizer, the alignment parameter is Lipschitz continuous; namely, the difference of the alignment parameters associated with two distinct vector pairs is at most proportional to the \(\ell _2\) distance between the two vector pairs involved, as demonstrated below.
Lemma 54
Suppose that the vectors \({\varvec{x}}_{1},{\varvec{x}}_{2},{\varvec{h}}_{1},{\varvec{h}}_{2}\in \mathbb {C}^{K}\) satisfy
for some sufficiently small constant \(\delta >0\). Denote by \(\alpha _{1}\) and \(\alpha _{2}\) the minimizers of \(g_{{\varvec{h}}_{1},{\varvec{x}}_{1}}\left( \alpha \right) \) and \(g_{{\varvec{h}}_{2},{\varvec{x}}_{2}}\left( \alpha \right) \), respectively. Then we have
Proof
Since \(\alpha _{1}\) minimizes \(g_{{\varvec{h}}_{1},{\varvec{x}}_{1}}\left( \alpha \right) \), the mean value form of Taylor’s theorem (see Appendix D.3.1) gives
where \(\widetilde{\alpha }\) is some complex number lying between \(\alpha _{1}\) and \(\alpha _{2}\), and \(\nabla g_{{\varvec{h}}_{1},{\varvec{x}}_{1}}\) and \(\nabla ^{2}g_{{\varvec{h}}_{1},{\varvec{x}}_{1}}\) are the Wirtinger gradient and Hessian of \(g_{{\varvec{h}}_{1},{\varvec{x}}_{1}}\left( \cdot \right) \), respectively. Rearrange the previous inequality to obtain
as long as \(\lambda _{\min }\left( \nabla ^{2}g_{{\varvec{h}}_{1},{\varvec{x}}_{1}}\left( \widetilde{\alpha }\right) \right) >0\). This calls for evaluation of the Wirtinger gradient and Hessian of \(g_{{\varvec{h}}_{1},{\varvec{x}}_{1}}\left( \cdot \right) \).
Regarding the Wirtinger Hessian, by assumption (291), we can invoke Lemma 52 with \(\beta =1\) to reach \(\max \left\{ \left| \alpha _{1}-1\right| ,\left| \alpha _{2}-1\right| \right\} \le 18\delta \). This together with Lemma 53 implies
since \(\widetilde{\alpha }\) lies between \(\alpha _{1}\) and \(\alpha _{2}\).
For the Wirtinger gradient, since \(\alpha _{2}\) is the minimizer of \(g_{{\varvec{h}}_{2},{\varvec{x}}_{2}}\left( \alpha \right) \), the first-order optimality condition [69, equation (38)] requires \(\nabla g_{{\varvec{h}}_{2},{\varvec{x}}_{2}}\left( \alpha _{2}\right) ={\varvec{0}}\) , which gives
Plug in the gradient expression (288) to reach
where the last line follows from the triangle inequality. It is straightforward to see that
under condition (291) and assumption \(\Vert {\varvec{x}}^{\star }\Vert _2=\Vert {\varvec{h}}^{\star }\Vert _2=1\), where the first inequality follows from Lemma 52. Taking these estimates together reveals that
The proof is accomplished by substituting the two bounds on the gradient and the Hessian into (292). \(\square \)
Further, if two vector pairs are both close to the optimizer, then their distance after alignment (w.r.t. the optimizer) cannot be much larger than their distance without alignment, as revealed by the following lemma.
Lemma 55
Suppose that the vectors \({\varvec{x}}_{1},{\varvec{x}}_{2},{\varvec{h}}_{1},{\varvec{h}}_{2}\in \mathbb {C}^{K}\) satisfy
for some sufficiently small constant \(\delta >0\). Denote by \(\alpha _{1}\) and \(\alpha _{2}\) the minimizers of \(g_{{\varvec{h}}_{1},{\varvec{x}}_{1}}\left( \alpha \right) \) and \(g_{{\varvec{h}}_{2},{\varvec{x}}_{2}}\left( \alpha \right) \), respectively. Then we have
Proof
To start with, we control the magnitudes of \(\alpha _{1}\) and \(\alpha _{2}\). Lemma 52 together with assumption (293) guarantees that
Now we can prove the lemma. The triangle inequality gives
where (i) holds since \(\left| \alpha _{1}\right| \le 2\) and \(\Vert {\varvec{x}}_{2}\Vert _{2}\le 1+\delta \le 2\), and (ii) arises from Lemma 54 that \(\left| \alpha _{1}-\alpha _{2}\right| \lesssim \left\| {\varvec{x}}_{1}-{\varvec{x}}_{2}\right\| _{2}+\left\| {\varvec{h}}_{1}-{\varvec{h}}_{2}\right\| _{2}\). Similarly,
where the last inequality comes from Lemma 54 as well as the facts that \(|\alpha _{1}|\ge 1/2\) and \(|\alpha _{2}|\ge 1/2\) as shown above. Combining all of the above bounds and recognizing that \(\left\| {\varvec{x}}_{1}-{\varvec{x}}_{2}\right\| _{2}+\left\| {\varvec{h}}_{1}-{\varvec{h}}_{2}\right\| _{2}\le \sqrt{2\left\| {\varvec{x}}_{1}-{\varvec{x}}_{2}\right\| _{2}^{2}+2\left\| {\varvec{h}}_{1}-{\varvec{h}}_{2}\right\| _{2}^{2}}\), we conclude the proof. \(\square \)
Finally, there is a useful identity associated with the minimizer of \(\widetilde{g}(\alpha )\) as defined below.
Lemma 56
For any \({\varvec{h}}_{1},{\varvec{h}}_{2},{\varvec{x}}_{1},{\varvec{x}}_{2}\in \mathbb {C}^{K}\), denote
Let \(\widetilde{{\varvec{x}}}_{1}=\alpha ^{\sharp }{\varvec{x}}_{1}\) and \(\widetilde{{\varvec{h}}}_{1}=\frac{1}{\overline{\alpha ^{\sharp }}}{\varvec{h}}_{1}\), then we have
Proof
We can rewrite the function \(\widetilde{g}\left( \alpha \right) \) as
The first-order optimality condition [69, equation (38)] requires
which further simplifies to
since \(\widetilde{{\varvec{x}}}_{1}=\alpha ^{\sharp }{\varvec{x}}_{1}\), \(\widetilde{{\varvec{h}}}_{1}=\frac{1}{\overline{\alpha ^{\sharp }}}{\varvec{h}}_{1}\), and \(\alpha ^{\sharp }\ne 0\) (otherwise \(\widetilde{g}(\alpha ^{\sharp })=\infty \) and cannot be the minimizer). Furthermore, this condition is equivalent to
Recognizing that
we arrive at the desired identity. \(\square \)
1.3.4 Matrix Concentration Inequalities
The proof for blind deconvolution is largely built upon the concentration of random matrices that are functions of \(\left\{ {\varvec{a}}_j{\varvec{a}}_j^\textsf {H} \right\} \). In this subsection, we collect the measure concentration results for various forms of random matrices that we encounter in the analysis.
Lemma 57
Suppose \({\varvec{a}}_{j}\overset{\text {i.i.d.}}{\sim }\mathcal {N}\left( {\varvec{0}},\frac{1}{2}{\varvec{I}}_{K}\right) +i\mathcal {N}\left( {\varvec{0}},\frac{1}{2}{\varvec{I}}_{K}\right) \) for every \(1\le j \le m\), and \(\{c_j\}_{1 \le j \le m}\) are a set of fixed numbers. Then there exist some universal constants \(\widetilde{C}_1,\widetilde{C}_2 >0 \) such that for all \(t\ge 0\)
Proof
This is a simple variant of [116, Theorem 5.39], which uses the Bernstein inequality and the standard covering argument. Hence, we omit its proof. \(\square \)
Lemma 58
Suppose \({\varvec{a}}_{j}\overset{\text {i.i.d.}}{\sim }\mathcal {N}\left( {\varvec{0}},\frac{1}{2}{\varvec{I}}_{K}\right) +i\mathcal {N}\left( {\varvec{0}},\frac{1}{2}{\varvec{I}}_{K}\right) \) for every \(1\le j \le m\). Then there exist some absolute constants \(\widetilde{C}_1,\widetilde{C}_2, \widetilde{C}_{3}> 0 \) such that for all \(\max \{ 1,3 \widetilde{C}_1 K/\widetilde{C}_2 \} / m \le \varepsilon \le 1\), one has
where \(J\subseteq [m]\) and |J| denotes its cardinality.
Proof
The proof relies on Lemma 57 and the union bound. First, invoke Lemma 57 to see that for any fixed \(J\subseteq [m]\) and for all \(t\ge 0\), we have
for some constants \(\widetilde{C}_1,\widetilde{C}_2 > 0\), and as a result,
where \(\lceil c \rceil \) denotes the smallest integer that is no smaller than c. Here, (i) holds since we take the supremum over a larger set and (ii) results from (294) and the union bound. Apply the elementary inequality \({n \atopwithdelims ()k} \le (en/k)^k\) for any \(0\le k \le n\) to obtain
where the second inequality uses \(\varepsilon m \le \lceil \varepsilon m\rceil \le 2 \varepsilon m\) whenever \(1/m\le \varepsilon \le 1\).
The proof is then completed by taking \(\widetilde{C}_3 \ge \max \{1,6/\widetilde{C}_2\}\) and \(t = \widetilde{C}_3 \log (e/\varepsilon )\). To see this, it is easy to check that \(\min \{ t,t^2\} = t\) since \(t\ge 1\). In addition, one has \(\widetilde{C}_1 K \le \widetilde{C}_2 \varepsilon m / 3 \le \widetilde{C}_2 \varepsilon m t / 3\), and \(2\log (e/\varepsilon ) \le \widetilde{C}_2 t / 3\). Combine the estimates above with (295) to arrive at
as claimed. Here, (i) holds due to the facts that \(\lceil \varepsilon m\rceil \le 2 \varepsilon m\) and \(1+t\le 2t \le 2\widetilde{C}_3 \log (e/\varepsilon )\). Inequality (ii) arises from the estimates listed above. \(\square \)
Lemma 59
Suppose \(m\gg K\log ^{3}m\). With probability exceeding \(1-O\left( m^{-10}\right) \), we have
Proof
The identity \(\sum _{j=1}^{m}{\varvec{b}}_{j}{\varvec{b}}_{j}^{\textsf {H} }={\varvec{I}}_{K}\) allows us to rewrite the quantity on the left-hand side as
where the \({\varvec{Z}}_{j}\)’s are independent zero-mean random matrices. To control the above spectral norm, we resort to the matrix Bernstein inequality [66, Theorem 2.7]. To this end, we first need to upper bound the sub-exponential norm \(\Vert \cdot \Vert _{\psi _{1}}\) (see definition in [116]) of each summand \({\varvec{Z}}_{j}\), i.e.,
where we make use of the facts that
We further need to bound the variance parameter, that is,
where the second line arises since \(\mathbb {E}\big [ \big (|{\varvec{a}}_{j}^{\textsf {H} }{\varvec{x}}^{\star }|^{2}-1\big )^2\big ]\asymp 1\), \(\Vert {\varvec{b}}_{j}\Vert _{2}^{2}=K/m\), and \(\sum _{j=1}^{m}{\varvec{b}}_{j}{\varvec{b}}_{j}^{\textsf {H} }={\varvec{I}}_{K}\). A direct application of the matrix Bernstein inequality [66, Theorem 2.7] leads us to conclude that with probability exceeding \(1-O\left( m^{-10}\right) \),
where the last relation holds under the assumption that \(m\gg K\log ^{3}m\). \(\square \)
1.3.5 Matrix Perturbation Bounds
We also need the following perturbation bound on the top singular vectors of a given matrix. The following lemma is parallel to Lemma 34.
Lemma 60
Let \(\sigma _{1}({\varvec{A}})\), \({\varvec{u}}\), and \({\varvec{v}}\) be the leading singular value, left and right singular vectors of \({\varvec{A}}\), respectively, and let \(\sigma _{1}(\widetilde{{\varvec{A}}})\), \(\widetilde{{\varvec{u}}}\), and \(\widetilde{{\varvec{v}}}\) be the leading singular value, left and right singular vectors of \(\widetilde{{\varvec{A}}}\), respectively. Suppose \(\sigma _{1}({\varvec{A}})\) and \(\sigma _{1}(\widetilde{{\varvec{A}}})\) are not identically zero, and then one has
Proof
The first claim follows since
With regard to the second claim, we see that
Similarly, one can obtain
Add these two inequalities to complete the proof. \(\square \)
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
About this article
Cite this article
Ma, C., Wang, K., Chi, Y. et al. Implicit Regularization in Nonconvex Statistical Estimation: Gradient Descent Converges Linearly for Phase Retrieval, Matrix Completion, and Blind Deconvolution. Found Comput Math 20, 451–632 (2020). https://doi.org/10.1007/s10208-019-09429-9
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10208-019-09429-9
Keywords
- Nonconvex optimization
- Gradient descent
- Leave-one-out analysis
- Phase retrieval
- Matrix completion
- Blind deconvolution