Implicit Regularization in Nonconvex Statistical Estimation: Gradient Descent Converges Linearly for Phase Retrieval, Matrix Completion and Blind Deconvolution

Recent years have seen a flurry of activities in designing provably efficient nonconvex procedures for solving statistical estimation problems. Due to the highly nonconvex nature of the empirical loss, state-of-the-art procedures often require proper regularization (e.g. trimming, regularized cost, projection) in order to guarantee fast convergence. For vanilla procedures such as gradient descent, however, prior theory either recommends highly conservative learning rates to avoid overshooting, or completely lacks performance guarantees. This paper uncovers a striking phenomenon in nonconvex optimization: even in the absence of explicit regularization, gradient descent enforces proper regularization implicitly under various statistical models. In fact, gradient descent follows a trajectory staying within a basin that enjoys nice geometry, consisting of points incoherent with the sampling mechanism. This"implicit regularization"feature allows gradient descent to proceed in a far more aggressive fashion without overshooting, which in turn results in substantial computational savings. Focusing on three fundamental statistical estimation problems, i.e. phase retrieval, low-rank matrix completion, and blind deconvolution, we establish that gradient descent achieves near-optimal statistical and computational guarantees without explicit regularization. In particular, by marrying statistical modeling with generic optimization theory, we develop a general recipe for analyzing the trajectories of iterative algorithms via a leave-one-out perturbation argument. As a byproduct, for noisy matrix completion, we demonstrate that gradient descent achieves near-optimal error control --- measured entrywise and by the spectral norm --- which might be of independent interest.


Nonlinear systems and empirical loss minimization
A wide spectrum of science and engineering applications calls for solutions to a nonlinear system of equations. Imagine we have collected a set of data points y = {y j } 1≤j≤m , generated by a nonlinear sensing system, where x is the unknown object of interest, and the A j 's are certain nonlinear maps known a priori. Can we reconstruct the underlying object x in a faithful yet efficient manner? Problems of this kind abound in information and statistical science, prominent examples including low-rank matrix recovery [KMO10a, CR09], robust principal component analysis [CSPW11,CLMW11], phase retrieval [CSV13,JEH15], neural networks [SJL17, ZSJ + 17], to name just a few. In principle, it is possible to attempt reconstruction by searching for a solution that minimizes the empirical loss, namely, Unfortunately, this empirical loss minimization problem is, in many cases, nonconvex, making it NP-hard in general. This issue of non-convexity comes up in, for example, several representative problems that epitomize the structures of nonlinear systems encountered in practice. 1 • Phase retrieval / solving quadratic systems of equations. Imagine we are asked to recover an unknown object x ∈ R n , but are only given the square modulus of certain linear measurements about the object, with all sign/phase information of the measurements missing. This arises, for example, in X-ray crystallography [CESV13], and in latent-variable models where the hidden variables are captured by the missing signs [CYC14]. To fix ideas, assume we would like to solve for x ∈ R n in the following quadratic system of m equations where {a j } 1≤j≤m are the known design vectors. One strategy is thus to solve the following problem (2) • Low-rank matrix completion. In many scenarios such as collaborative filtering, we wish to make predictions about all entries of an (approximately) low-rank matrix M ∈ R n×n (e.g. a matrix consisting of users' ratings about many movies), yet only a highly incomplete subset of the entries are revealed to us [CR09]. For clarity of presentation, assume M to be rank-r (r n) and positive semidefinite (PSD), i.e. M = X X with X ∈ R n×r , and suppose we have only seen the entries Y j,k = M j,k = (X X ) j,k , (j, k) ∈ Ω within some index subset Ω of cardinality m. These entries can be viewed as nonlinear measurements about the low-rank factor X . The task of completing the true matrix M can then be cast as solving minimize X∈R n×r f (X) = n 2 4m where the e j 's stand for the canonical basis vectors in R n . 1 Here, we choose different pre-constants in front of the empirical loss in order to be consistent with the literature of the respective problems. In addition, we only introduce the problem in the noiseless case for simplicity of presentation.
• Blind deconvolution / solving bilinear systems of equations. Imagine we are interested in estimating two signals of interest h , x ∈ C K , but only get to collect a few bilinear measurements about them. This problem arises from mathematical modeling of blind deconvolution [ARR14, LLSW16], which frequently arises in astronomy, imaging, communications, etc. The goal is to recover two signals from their convolution. Put more formally, suppose we have acquired m bilinear measurements taking the following form where a j , b j ∈ C K are distinct design vectors (e.g. Fourier and/or random design vectors) known a priori.
In order to reconstruct the underlying signals, one asks for solutions to the following problem

Nonconvex optimization via regularized gradient descent
First-order methods have been a popular heuristic in practice for solving nonconvex problems including (1). For instance, a widely adopted procedure is gradient descent, which follows the update rule where η t is the learning rate (or step size) and x 0 is some proper initial guess. Given that it only performs a single gradient calculation ∇f (·) per iteration (which typically can be completed within near-linear time), this paradigm emerges as a candidate for solving large-scale problems. The concern is: whether x t converges to the global solution and, if so, how long it takes for convergence, especially since (1) is highly nonconvex. Fortunately, despite the worst-case hardness, appealing convergence properties have been discovered in various statistical estimation problems; the blessing being that the statistical models help rule out ill-behaved instances. For the average case, the empirical loss often enjoys benign geometry, in a local region (or at least along certain directions) surrounding the global optimum. In light of this, an effective nonconvex iterative method typically consists of two stages: 1. a carefully-designed initialization scheme (e.g. spectral method); 2. an iterative refinement procedure (e.g. gradient descent). This strategy has recently spurred a great deal of interest, owing to its promise of achieving computational efficiency and statistical accuracy at once for a growing list of problems (e.g. [KMO10a, JNS13, CW15, SL16, CLS15,CC17,LLSW16,LLB17]). However, rather than directly applying gradient descent (4), existing theory often suggests enforcing proper regularization. Such explicit regularization enables improved computational convergence by properly "stabilizing" the search directions. The following regularization schemes, among others, have been suggested to obtain or improve computational guarantees. We refer to these algorithms collectively as Regularized Gradient Descent.
• Trimming/truncation, which discards/truncates a subset of the gradient components when forming the descent direction. For instance, when solving quadratic systems of equations, one can modify the gradient descent update rule as where T is an operator that effectively drops samples bearing too much influence on the search direction. This strategy [CC17, ZCL16, WGE17] has been shown to enable exact recovery with linear-time computational complexity and optimal sample complexity.
• Regularized loss, which attempts to optimize a regularized empirical risk where R(x) stands for an additional penalty term in the empirical loss. For example, in low-rank matrix completion R(·) imposes penalty based on the 2 row norm [KMO10a, SL16] as well as the Frobenius norm [SL16] of the decision matrix, while in blind deconvolution, it penalizes the 2 norm as well as certain component-wise incoherence measure of the decision vectors [LLSW16, HH17, LS17]. n/a n/a n/a nr 7 n r log 1 regularized loss Matrix [SL16] completion nr 2 r 2 log 1 projection [CW15,ZL16] Blind n/a n/a n/a Kpoly log m m log 1 regularized loss & deconvolution projection [LLSW16] • Projection, which projects the iterates onto certain sets based on prior knowledge, that is, where P is a certain projection operator used to enforce, for example, incoherence properties. This strategy has been employed in both low-rank matrix completion [CW15,ZL16] and blind deconvolution [LLSW16].
Equipped with such regularization procedures, existing works uncover appealing computational and statistical properties under various statistical models. Table 1 summarizes the performance guarantees derived in the prior literature; for simplicity, only orderwise results are provided.
Remark 1. There is another role of regularization commonly studied in the literature, which exploits prior knowledge about the structure of the unknown object, such as sparsity to prevent overfitting and improve statistical generalization ability. This is, however, not the focal point of this paper, since we are primarily pursuing solutions to (1) without imposing additional structures.

Regularization-free procedures?
The regularized gradient descent algorithms, while exhibiting appealing performance, usually introduce more algorithmic parameters that need to be carefully tuned based on the assumed statistical models. In contrast, vanilla gradient descent (cf. (4)) -which is perhaps the very first method that comes into mind and requires minimal tuning parameters -is far less understood (cf. Table 1). Take matrix completion and blind deconvolution as examples: to the best of our knowledge, there is currently no theoretical guarantee derived for vanilla gradient descent.
The situation is better for phase retrieval: the local convergence of vanilla gradient descent, also known as Wirtinger flow (WF), has been investigated in [CLS15,WWS15]. Under i.i.d. Gaussian design and with near-optimal sample complexity, WF (combined with spectral initialization) provably achieves -accuracy (in a relative sense) within O n log (1/ε) iterations. Nevertheless, the computational guarantee is significantly outperformed by the regularized version (called truncated Wirtinger flow [CC17]), which only requires O log (1/ε) iterations to converge with similar per-iteration cost. On closer inspection, the high computational cost of WF is largely due to the vanishingly small step size η t = O 1/(n x 2 2 ) -and hence slow movement -suggested by the theory [CLS15]. While this is already the largest possible step size allowed in the theory published in [CLS15], it is considerably more conservative than the choice η t = O 1/ x 2 2 theoretically justified for the regularized version [CC17, ZCL16].
The lack of understanding and suboptimal results about vanilla gradient descent raise a very natural question: are regularization-free iterative algorithms inherently suboptimal when solving nonconvex statistical estimation problems of this kind?

Numerical surprise of unregularized gradient descent
To answer the preceding question, it is perhaps best to first collect some numerical evidence. In what follows, we test the performance of vanilla gradient descent for phase retrieval, matrix completion, and blind Relative error of X t X t (measured by · F , · , · ∞ ) vs. iteration count for matrix completion, where n = 1000, r = 10, p = 0.1, and η t = 0.2. (c) Relative error of h t x t * (measured by · F ) vs. iteration count for blind deconvolution, where m = 10K and η t = 0.5. deconvolution, using a constant step size. For all of these experiments, the initial guess is obtained by means of the standard spectral method. Our numerical findings are as follows: • Phase retrieval. For each n, set m = 10n, take x ∈ R n to be a random vector with unit norm, and generate the design vectors a j i.i.d.
∼ N (0, I n ), 1 ≤ j ≤ m. Figure 1(a) illustrates the relative 2 error min{ x t − x 2 , x t + x 2 }/ x 2 (modulo the unrecoverable global phase) vs. the iteration count. The results are shown for n = 20, 100, 200, 1000, with the step size taken to be η t = 0.1 in all settings.
• Matrix completion. Generate a random PSD matrix M ∈ R n×n with dimension n = 1000, rank r = 10, and all nonzero eigenvalues equal to one. Each entry of M is observed independently with probability p = 0.1. Figure 1(b) plots the relative error X t X t − M / M vs. the iteration count, where |||·||| can either be the Frobenius norm · F , the spectral norm · , or the entrywise ∞ norm · ∞ . Here, we pick the step size as η t = 0.2.
• Blind deconvolution. For each K ∈ {20, 100, 200, 1000} and m = 10K, generate the design vectors a j i.i.d. ∼ N (0, 1 2 I K ) + iN (0, 1 2 I K ) for 1 ≤ j ≤ m independently, 2 and the b j 's are drawn from a partial Discrete Fourier Transform (DFT) matrix (to be described in Section 3.3). The underlying signals h , x ∈ C K are produced as random vectors with unit norm. Figure 1(c) plots the relative error h t x t * −h x * F / h x * F vs. the iteration count, with the step size taken to be η t = 0.5 in all settings.
In all of these numerical experiments, vanilla gradient descent enjoys remarkable linear convergence, always yielding an accuracy of 10 −5 (in a relative sense) within around 200 iterations. In particular, for the phase retrieval problem, the step size is taken to be η t = 0.1 although we vary the problem size from n = 20 to n = 1000. The consequence is that the convergence rates experience little changes when the problem sizes vary. In comparison, the theory published in [CLS15] seems overly pessimistic, as it suggests a diminishing step size inversely proportional to n and, as a result, an iteration complexity that worsens as the problem size grows.
In short, the above empirical results are surprisingly positive yet puzzling. Why was the computational efficiency of vanilla gradient descent unexplained or substantially underestimated in prior theory?

This paper
The main contribution of this paper is towards demystifying the "unreasonable" effectiveness of regularizationfree nonconvex iterative methods. As asserted in previous work, regularized gradient descent succeeds by properly enforcing/promoting certain incoherence conditions throughout the execution of the algorithm. In contrast, we discover that Prior theory Our theory sample iteration step sample iteration step complexity complexity size complexity complexity size Phase retrieval n log n n log (1/ε) 1/n n log n log n log (1/ε) 1/ log n Matrix completion n/a n/a n/a nr 3 poly log n log (1/ε) 1 Blind deconvolution n/a n/a n/a Kpoly log m log (1/ε) 1 Vanilla gradient descent automatically forces the iterates to stay incoherent with the measurement mechanism, thus implicitly regularizing the search directions.
This "implicit regularization" phenomenon is of fundamental importance, suggesting that vanilla gradient descent proceeds as if it were properly regularized. This explains the remarkably favorable performance of unregularized gradient descent in practice. Focusing on the three representative problems mentioned in Section 1.1, our theory guarantees both statistical and computational efficiency of vanilla gradient descent under random designs and spectral initialization. With near-optimal sample complexity, to attain -accuracy, • Phase retrieval (informal): vanilla gradient descent converges in O log n log 1 iterations; • Matrix completion (informal): vanilla gradient descent converges in O log 1 iterations; • Blind deconvolution (informal): vanilla gradient descent converges in O log 1 iterations.
In words, gradient descent provably achieves (nearly) linear convergence in all of these examples. Throughout this paper, an algorithm is said to converge (nearly) linearly to x in the noiseless case if the iterates {x t } obey for some 0 < c ≤ 1 that is (almost) independent of the problem size. Here, dist(·, ·) can be any appropriate discrepancy measure. As a byproduct of our theory, gradient descent also provably controls the entrywise empirical risk uniformly across all iterations; for instance, this implies that vanilla gradient descent controls entrywise estimation error for the matrix completion task. Precise statements of these results are deferred to Section 3 and are briefly summarized in Table 2.
Notably, our study of implicit regularization suggests that the behavior of nonconvex optimization algorithms for statistical estimation needs to be examined in the context of statistical models, which induces an objective function as a finite sum. Our proof is accomplished via a leave-one-out perturbation argument, which is inherently tied to statistical models and leverages homogeneity across samples. Altogether, this allows us to localize benign landscapes for optimization and characterize finer dynamics not accounted for in generic gradient descent theory.

Notations
Before continuing, we introduce several notations used throughout the paper. First of all, boldfaced symbols are reserved for vectors and matrices. For any vector v, we use v 2 to denote its Euclidean norm. For any matrix A, we use σ j (A) and λ j (A) to denote its jth largest singular value and eigenvalue, respectively, and let A j,· and A ·,j denote its jth row and jth column, respectively. In addition, A , A F , A 2,∞ , and A ∞ stand for the spectral norm (i.e. the largest singular value), the Frobenius norm, the 2 / ∞ norm (i.e. the largest 2 norm of the rows), and the entrywise ∞ norm (the largest magnitude of all entries) of a matrix A. Also, A , A * and A denote the transpose, the conjugate transpose, and the entrywise conjugate of A, respectively. I n denotes the identity matrix with dimension n × n. The notation O n×r represents the set of all n × r orthonormal matrices. The notation [n] refers to the set {1, · · · , n}. Also, we use Re(x) to denote the real part of a complex number x. Throughout the paper, we use the terms "samples" and "measurements" interchangeably. Additionally, the standard notation f (n) = O (g(n)) or f (n) g(n) means that there exists a constant c > 0 such that |f (n)| ≤ c|g(n)|, f (n) g(n) means that there exists a constant c > 0 such that |f (n)| ≥ c |g(n)|, and f (n) g(n) means that there exist constants c 1 , c 2 > 0 such that c 1 |g(n)| ≤ |f (n)| ≤ c 2 |g(n)|. Besides, f (n) g(n) means that there exists some large enough constant c > 0 such that |f (n)| ≥ c |g(n)|. Similarly, f (n) g(n) means that there exists some sufficiently small constant c > 0 such that |f (n)| ≤ c |g(n)|.
2 Implicit regularization -a case study To reveal reasons behind the effectiveness of vanilla gradient descent, we first examine existing theory of gradient descent and identify the geometric properties that enable linear convergence. We then develop an understanding as to why prior theory is conservative, and describe the phenomenon of implicit regularization that helps explain the effectiveness of vanilla gradient descent. To facilitate discussion, we will use the problem of solving random quadratic systems (phase retrieval) and Wirtinger flow as a case study, but our diagnosis applies more generally, as will be seen in later sections.

Gradient descent theory revisited
In the convex optimization literature, there are two standard conditions about the objective functionstrong convexity and smoothness -that allow for linear convergence of gradient descent.
Definition 1 (Strong convexity). A twice continuously differentiable function f : It is well known that for an unconstrained optimization problem, if the objective function f is both αstrongly convex and β-smooth, then vanilla gradient descent (4) enjoys 2 error contraction [Bub15, Theorem 3.12], namely, as long as the step size is chosen as η t = 2/(α + β). Here, x denotes the global minimum. This immediately reveals the iteration complexity for gradient descent: the number of iterations taken to attain -accuracy (in a relative sense) is bounded by In other words, the iteration complexity is dictated by and scales linearly with the condition number -the ratio β/α of smoothness to strong convexity parameters. Moving beyond convex optimization, one can easily extend the above theory to nonconvex problems with local strong convexity and smoothness. More precisely, suppose the objective function f satisfies and ∇ 2 f (x) ≤ β over a local 2 ball surrounding the global minimum x : Then the contraction result (8) continues to hold, as long as the algorithm is seeded with an initial point that falls inside B δ (x).

Local geometry for solving random quadratic systems
To invoke generic gradient descent theory, it is critical to characterize the local strong convexity and smoothness properties of the loss function. Take the problem of solving random quadratic systems (phase retrieval) as an example. Consider the i.i.d. Gaussian design in which a j i.i.d.
∼ N (0, I n ), 1 ≤ j ≤ m, and suppose without loss of generality that the underlying signal obeys x 2 = 1. It is well known that x is the unique minimizer -up to global phase -of (2) under this statistical model, provided that the ratio m/n of equations to unknowns is sufficiently large. The Hessian of the loss function f (x) is given by • Population-level analysis. Consider the case with an infinite number of equations or samples, i.e. m → ∞, where ∇ 2 f (x) converges to its expectation. Simple calculation yields that It it straightforward to verify that for any sufficiently small constant δ > 0, one has the crude bound meaning that f is 1-strongly convex and 10-smooth within a local ball around x . As a consequence, when we have infinite samples and an initial guess x 0 such that x 0 − x 2 ≤ δ x 2 , vanilla gradient descent with a constant step size converges to the global minimum within logarithmic iterations.
• Finite-sample regime with m n log n. Now that f exhibits favorable landscape in the population level, one thus hopes that the fluctuation can be well-controlled so that the nice geometry carries over to the finite-sample regime. In the regime where m n log n (which is the regime considered in [CLS15]), the local strong convexity is still preserved, in the sense that occurs with high probability, provided that δ > 0 is sufficiently small (see [Sol14,WWS15] and Lemma 1). The smoothness parameter, however, is not well-controlled. In fact, it can be as large as (up to logarithmic factors) 3 ∇ 2 f (x) n even when we restrict attention to the local 2 ball (9) with δ > 0 being a fixed small constant. This means that the condition number β/α (defined in Section 2.1) may scale as O(n), leading to the step size recommendation η t 1/n, and, as a consequence, a high iteration complexity O n log(1/ ) . This underpins the analysis in [CLS15].
In summary, the geometric properties of the loss function -even in the local 2 ball centering around the global minimum -is not as favorable as one anticipates, in particular in view of its population counterpart. A direct application of generic gradient descent theory leads to an overly conservative step size and a pessimistic convergence rate, unless the number of samples is enormously larger than the number of unknowns.
Remark 2. Notably, due to Gaussian designs, the phase retrieval problem enjoys more favorable geometry compared to other nonconvex problems. In matrix completion and blind deconvolution, the Hessian matrices are rank-deficient even at the population level. In such cases, the above discussions need to be adjusted, e.g. strong convexity is only possible when we restrict attention to certain directions.

Which region enjoys nicer geometry?
Interestingly, our theory identifies a local region surrounding x with a large diameter that enjoys much nicer geometry. This region does not mimic an 2 ball, but rather, the intersection of an 2 ball and a polytope. We term it the region of incoherence and contraction (RIC). For phase retrieval, the RIC includes all points x ∈ R n obeying where δ > 0 is some small numerical constant. As will be formalized in Lemma 1, with high probability the Hessian matrix satisfies simultaneously for x in the RIC. In words, the Hessian matrix is nearly well-conditioned (with the condition number bounded by O(log n)), as long as (i) the iterate is not very far from the global minimizer (cf. (11a)), and (ii) the iterate remains incoherent 4 with respect to the sensing vectors (cf. (11b)). Another way to interpret the incoherence condition (11b) is that the empirical risk needs to be well-controlled uniformly across all samples. See Figure 2(a) for an illustration of the above region. Figure 2: (a) The shaded region is an illustration of the incoherence region, which satisfies a j (x − x ) √ log n for all points x in the region. (b) When x 0 resides in the desired region, we know that x 1 remains within the 2 ball but might fall out of the incoherence region (the shaded region). Once x 1 leaves the incoherence region, we lose control and may overshoot. (c) Our theory reveals that with high probability, all iterates will stay within the incoherence region, enabling fast convergence.
The following observation is thus immediate: one can safely adopt a far more aggressive step size (as large as η t = O(1/ log n)) to achieve acceleration, as long as the iterates stay within the RIC. This, however, fails to be guaranteed by generic gradient descent theory. To be more precise, if the current iterate x t falls within the desired region, then in view of (8), we can ensure 2 error contraction after one iteration, namely, and hence x t+1 stays within the local 2 ball and hence satisfies (11a). However, it is not immediately obvious that x t+1 would still stay incoherent with the sensing vectors and satisfy (11b). If x t+1 leaves the RIC, it no longer enjoys the benign local geometry of the loss function, and the algorithm has to slow down in order to avoid overshooting. See Figure 2(b) for a visual illustration. In fact, in almost all regularized gradient descent algorithms mentioned in Section 1.2, one of the main purposes of the proposed regularization procedures is to enforce such incoherence constraints.
4 If x is aligned with (and hence very coherent with) one vector a j , then with high probability one has a j x − x | a j x| √ n x 2 , which is significantly larger than √ log n x 2 .

Implicit regularization
However, is regularization really necessary for the iterates to stay within the RIC? To answer this question, we plot in Figure 3(a) (resp. Figure 3 ) vs. the iteration count in a typical Monte Carlo trial, generated in the same way as for Figure 1(a). Interestingly, the incoherence measure remains bounded by 2 for all iterations t > 1. This important observation suggests that one may adopt a substantially more aggressive step size throughout the whole algorithm.
) of the gradient iterates vs. iteration count for the phase retrieval problem. The results are shown for n ∈ {20, 100, 200, 1000} and m = 10n, with the step size taken to be η t = 0.1. The problem instances are generated in the same way as in Figure 1(a).
The main objective of this paper is thus to provide a theoretical validation of the above empirical observation. As we will demonstrate shortly, with high probability all iterates along the execution of the algorithm (as well as the spectral initialization) are provably constrained within the RIC, implying fast convergence of vanilla gradient descent (cf. Figure 2(c)). The fact that the iterates stay incoherent with the measurement mechanism automatically, without explicit enforcement, is termed "implicit regularization".

A glimpse of the analysis: a leave-one-out trick
In order to rigorously establish (11b) for all iterates, the current paper develops a powerful mechanism based on the leave-one-out perturbation argument, a trick rooted and widely used in probability and random matrix theory. Note that the iterate x t is statistically dependent with the design vectors {a j }. Under such circumstances, one often resorts to generic bounds like the Cauchy-Schwarz inequality, which would not yield a desirable estimate. To address this issue, we introduce a sequence of auxiliary iterates {x t,(l) } for each 1 ≤ l ≤ m (for analytical purposes only), obtained by running vanilla gradient descent using all but the lth sample. As one can expect, such auxiliary trajectories serve as extremely good surrogates of {x t } in the sense that since their constructions only differ by a single sample. Most importantly, since x t,(l) is independent with the lth design vector, it is much easier to control its incoherence w.r.t. a l to the desired level: Combining (12) and (13) then leads to (11b). See Figure 4 for a graphical illustration of this argument. Notably, this technique is very general and applicable to many other problems. We invite the readers to Section 5 for more details. Ax y = |Ax| 2 Figure 4: Illustration of the leave-one-out sequence w.r.t. a l . (a) The sequence {x t,(l) } t≥0 is constructed without using the lth sample. (b) Since the auxiliary sequence {x t,(l) } is constructed without using a l , the leave-one-out iterates stay within the incoherence region w.r.t. a l with high probability. Meanwhile, {x t } and {x t,(l) } are expected to remain close as their construction differ only in a single sample.

Main results
This section formalizes the implicit regularization phenomenon underlying unregularized gradient descent, and presents its consequences, namely near-optimal statistical and computational guarantees for phase retrieval, matrix completion, and blind deconvolution. Note that the discrepancy measure dist (·, ·) may vary from problem to problem.

Phase retrieval
Suppose the m quadratic equations are collected using random design vectors, namely, a j i.i.d.
∼ N (0, I n ), and the nonconvex problem to solve is The Wirtinger flow (WF) algorithm, first introduced in [CLS15], is a combination of spectral initialization and vanilla gradient descent; see Algorithm 1.
Algorithm 1 Wirtinger flow for phase retrieval Input: {a j } 1≤j≤m and {y j } 1≤j≤m . Spectral initialization: Let λ 1 (Y ) and x 0 be the leading eigenvalue and eigenvector of respectively, and set x 0 = λ 1 (Y ) /3 x 0 . Gradient updates: for t = 0, 1, 2, . . . , T − 1 do Recognizing that the global phase/sign is unrecoverable from quadratic measurements, we introduce the 2 distance modulo the global phase as follows Our finding is summarized in the following theorem.
Theorem 1. Let x ∈ R n be a fixed vector. Suppose a j i.i.d.
∼ N (0, I n ) for each 1 ≤ j ≤ m and m ≥ c 0 n log n for some sufficiently large constant c 0 > 0. Assume the step size obeys η t ≡ η = c 1 / log n · x 0 2 2 for any sufficiently small constant c 1 > 0. Then there exist some absolute constants 0 < ε < 1 and c 2 > 0 such that with probability at least 1 − O mn −5 , the Wirtinger flow iterates (Algorithm 1) satisfy that for all t ≥ 0, Theorem 1 reveals a few intriguing properties of WF.
• Implicit regularization: Theorem 1 asserts that the incoherence properties are satisfied throughout the execution of the algorithm (see (19b)), which formally justifies the implicit regularization feature we hypothesized.
• Near-constant step size: Consider the case where x 2 = 1. Theorem 1 establishes near-linear convergence of WF with a substantially more aggressive step size η 1/ log n. Compared with the choice η 1/n admissible in [CLS15, Theorem 3.3], Theorem 1 allows WF to attain -accuracy within O(log n log(1/ )) iterations. The resulting computational complexity of WF is O mn log n log 1 , which significantly improves upon the result O mn 2 log (1/ ) derived in [CLS15]. As a side note, if the sample size further increases to m n log 2 n, then a constant step size η 1 is also feasible, resulting in an iteration complexity log(1/ ). This follows since with high probability, the entire trajectory resides within a more refined incoherence region max j a j x t − x x 2 . We omit the details here.
• Incoherence of spectral initialization: We have also demonstrated in Theorem 1 that the initial guess x 0 falls within the RIC and is hence nearly orthogonal to all design vectors. This provides a finer characterization of spectral initialization, in comparison to prior theory that focuses primarily on the 2 accuracy [NJS13, CLS15]. We expect our leave-one-out analysis to accommodate other variants of spectral initialization studied in the literature [CC17, CLM + 16, WGE17, LL17, MM17].

Low-rank matrix completion
Let M ∈ R n×n be a positive semidefinite matrix 5 with rank r, and suppose its eigendecomposition is where U ∈ R n×r consists of orthonormal columns, and Σ is an r × r diagonal matrix with eigenvalues in a descending order, i.e. σ max = σ 1 ≥ · · · ≥ σ r = σ min > 0. Throughout this paper, we assume the condition number κ := σ max /σ min is bounded by a fixed constant, independent of the problem size (i.e. n and r). Denoting X = U (Σ ) 1/2 allows us to factorize M as Consider a random sampling model such that each entry of M is observed independently with probability 0 < p ≤ 1, i.e. for 1 ≤ j ≤ k ≤ n, where the entries of E = [E j,k ] 1≤j≤k≤n are independent sub-Gaussian noise with sub-Gaussian norm σ (see [Ver12,Definition 5.7]). We denote by Ω the set of locations being sampled, and P Ω (Y ) represents the projection of Y onto the set of matrices supported in Ω. We note here that the sampling rate p, if not known, can be faithfully estimated by the sample proportion |Ω|/n 2 . To fix ideas, we consider the following nonconvex optimization problem The vanilla gradient descent algorithm (with spectral initialization) is summarized in Algorithm 2.
In addition, recognizing that X is identifiable only up to orthogonal transformation, we define the optimal transform from the tth iterate X t to X as where O r×r is the set of r × r orthonormal matrices. With these definitions in place, we have the following theorem.
Theorem 2. Let M be a rank r, µ-incoherent PSD matrix, and its condition number κ is a fixed constant. Suppose the sample size satisfies n 2 p ≥ Cµ 3 r 3 n log 3 n for some sufficiently large constant C > 0, and the noise satisfies σ n p σ min κ 3 µr log 3 n .
Theorem 2 provides the first theoretical guarantee of unregularized gradient descent for matrix completion, demonstrating near-optimal statistical accuracy and computational complexity.
• Implicit regularization: In Theorem 2, we bound the 2 / ∞ error of the iterates in a uniform manner via (28b). Note that X − X 2,∞ = max j e j X − X 2 , which implies the iterates remain incoherent with the sensing vectors throughout and have small incoherence parameters (cf. (25)). In comparison, prior works either include a penalty term on { e j X 2 } 1≤j≤n [KMO10a,SL16] and/or X F [SL16] to encourage an incoherent and/or low-norm solution, or add an extra projection operation to enforce incoherence [CW15,ZL16]. Our results demonstrate that such explicit regularization is unnecessary.
• Constant step size: Without loss of generality we may assume that σ max = M = O(1), which can be done by choosing proper scaling of M . Hence we have a constant step size η t 1. Actually it is more convenient to consider the scale invariant parameter ρ: Theorem 2 guarantees linear convergence of the vanilla gradient descent at a constant rate ρ. Remarkably, the convergence occurs with respect to three different unitarily invariant norms: the Frobenius norm · F , the 2 / ∞ norm · 2,∞ , and the spectral norm · . As far as we know, the latter two are established for the first time. Note that our result even improves upon that for regularized gradient descent; see Table 1.
• Near-optimal sample complexity: When the rank r = O(1), vanilla gradient descent succeeds under a near-optimal sample complexity n 2 p npoly log n, which is statistically optimal up to some logarithmic factor.
• Near-minimal Euclidean error: In view of (28a), as t increases, the Euclidean error of vanilla GD converges to which coincides with the theoretical guarantee in [CW15, Corollary 1] and matches the minimax lower bound established in [NW12, KLT11].
• Near-optimal entrywise error: The 2 / ∞ error bound (28b) immediately yields entrywise control of the empirical risk. Specifically, as soon as t is sufficiently large (so that the first term in (28b) is negligible), we have where the last line follows from (28b) as well as the facts that X t H t −X 2,∞ ≤ X 2,∞ and M ∞ = X 2 2,∞ . Compared with the Euclidean loss (29), this implies that when r = O(1), the entrywise error of X t X t is uniformly spread out across all entries. As far as we know, this is the first result that reveals near-optimal entrywise error control for noisy matrix completion using nonconvex optimization, without resorting to sample splitting.
Remark 3. Theorem 2 remains valid if the total number T of iterations obeys T = n O(1) . In the noiseless case where σ = 0, the theory allows arbitrarily large T .
Finally, we report the empirical statistical accuracy of vanilla gradient descent in the presence of noise. Figure 5 displays the squared relative error of vanilla gradient descent as a function of the signal-to-noise ratio (SNR), where the SNR is defined to be : Squared relative error of the estimate X (measured by · F , · , · 2,∞ modulo global transformation) and M = X X (measured by · ∞ ) vs. SNR for noisy matrix completion, where n = 500, r = 10, p = 0.1, and η t = 0.2. Here X denotes the estimate returned by Algorithm 2 after convergence. and the relative error is measured in terms of the square of the metrics as in (28) as well as the squared entrywise prediction error. Both the relative error and the SNR are shown on a dB scale (i.e. 10 log 10 (SNR) and 10 log 10 (squared relative error) are plotted). As one can see from the plot, the squared relative error scales inversely proportional to the SNR, which is consistent with our theory. 6

Blind deconvolution
Suppose we have collected m bilinear measurements where a j follows a complex Gaussian distribution, i.e. a j i.i.d.
∼ N 0, 1 2 I K + iN 0, 1 2 I K for 1 ≤ j ≤ m, and B := [b 1 , · · · , b m ] * ∈ C m×K is formed by the first K columns of a unitary discrete Fourier transform (DFT) matrix F ∈ C m×m obeying F F * = I m (see Appendix D.3.2 for a brief introduction to DFT matrices). This setup models blind deconvolution, where the two signals under convolution belong to known low-dimensional subspaces of dimension K [ARR14] 7 . In particular, the partial DFT matrix B plays an important role in image blind deblurring. In this subsection, we consider solving the following nonconvex optimization problem The (Wirtinger) gradient descent algorithm (with spectral initialization) is summarized in Algorithm 3; here, and ∇ x f (h, x) stand for the Wirtinger gradient and are given in (77) and (78), respectively; see [CLS15, Section 6] for a brief introduction to Wirtinger calculus. It is self-evident that h and x are only identifiable up to global scaling, that is, for any nonzero α ∈ C, In light of this, we will measure the discrepancy between 6 Note that when M is well-conditioned and when r = O(1), one can easily check that SNR ≈ M 2 F / n 2 σ 2 σ 2 min /(n 2 σ 2 ), and our theory says that the squared relative error bound is proportional to σ 2 /σ 2 min . 7 For simplicity, we have set the dimensions of the two subspaces equal, and it is straightforward to extend our results to the case of unequal subspace dimensions.
Before proceeding, we need to introduce the incoherence parameter [ARR14, LLSW16], which is crucial for blind deconvolution, whose role is similar to the incoherence parameter (cf. Definition 3) in matrix completion.
Definition 4 (Incoherence for blind deconvolution). Let the incoherence parameter µ of h be the smallest number such that max The incoherence parameter describes the spectral flatness of the signal h . With this definition in place, we have the following theorem, where for identifiability we assume that h 2 = x 2 .
Theorem 3. Suppose the number of measurements obeys m ≥ Cµ 2 K log 9 m for some sufficiently large constant C > 0, and suppose the step size η > 0 is taken to be some sufficiently small constant. Then there exist constants c 1 , c 2 , C 1 , C 3 , C 4 > 0 such that with probability exceeding 1 − c 1 m −5 − c 1 me −c2K , the iterates in Algorithm 3 satisfy for all t ≥ 0. Here, we denote α t as the alignment parameter, Theorem 3 provides the first theoretical guarantee of unregularized gradient descent for blind deconvolution at a near-optimal statistical and computational complexity. A few remarks are in order.
• Implicit regularization: Theorem 3 reveals that the unregularized gradient descent iterates remain incoherent with the sampling mechanism (see (37b) and (37c)). Recall that prior works operate upon a regularized cost function with an additional penalty term that regularizes the global scaling { h 2 , x 2 } and the incoherence {|b * j h|} 1≤j≤m [LLSW16, HH17, LS17]. In comparison, our theorem implies that it is unnecessary to regularize either the incoherence or the scaling ambiguity, which is somewhat surprising. This justifies the use of regularization-free (Wirtinger) gradient descent for blind deconvolution.
• Constant step size: Compared to the step size η t 1/m suggested in [LLSW16] for regularized gradient descent, our theory admits a substantially more aggressive step size (i.e. η t 1) even without regularization. Similar to phase retrieval, the computational efficiency is boosted by a factor of m, attaining -accuracy within O (log(1/ )) iterations (vs. O (m log(1/ )) iterations in prior theory).
• Near-optimal sample complexity: It is demonstrated that vanilla gradient descent succeeds at a near-optimal sample complexity up to logarithmic factors, although our requirement is slightly worse than [LLSW16] which uses explicit regularization. Notably, even under the sample complexity herein, the iteration complexity given in [LLSW16] is still O (m/poly log(m)).
• Incoherence of spectral initialization: As in phase retrieval, Theorem 3 demonstrates that the estimates returned by the spectral method are incoherent with respect to both {a j } and {b j }. In contrast, [LLSW16] recommends a projection operation (via a linear program) to enforce incoherence of the initial estimates, which is dispensable according to our theory.
• Contraction in · F : It is easy to check that the Frobenius norm error satisfies h t x t * − h x * F dist z t , z , and therefore Theorem 3 corroborates the empirical results shown in Figure 1(c).

Related work
Solving nonlinear systems of equations has received much attention in the past decade. Rather than directly attacking the nonconvex formulation, convex relaxation lifts the object of interest into a higher dimensional space and then attempts recovery via semidefinite programming (e.g. [RFP10, CSV13, CR09, ARR14]). This has enjoyed great success in both theory and practice. Despite appealing statistical guarantees, semidefinite programming is in general prohibitively expensive when processing large-scale datasets. Nonconvex approaches, on the other end, have been under extensive study in the last few years, due to their computational advantages. There is a growing list of statistical estimation problems for which nonconvex approaches are guaranteed to find global optimal solutions, including but not limited to phase retrieval [NJS13, CLS15 , it is further suggested that the optimization landscape is benign under sufficiently large sample complexity, in the sense that all local minima are globally optimal, and hence nonconvex iterative algorithms become promising in solving such problems. Below we review the three problems studied in this paper in more details. Some state-of-the-art results are summarized in Table 1 Finally, we note that the notion of implicit regularization -broadly defined -arises in settings far beyond the models and algorithms considered herein. For instance, it has been conjectured that in matrix factorization, over-parameterized stochastic gradient descent effectively enforces certain norm constraints, allowing it to converge to a minimal-norm solution as long as it starts from the origin [GWB + 17]. The stochastic gradient methods have also been shown to implicitly enforce Tikhonov regularization in several statistical learning settings [LCR16]. More broadly, this phenomenon seems crucial in enabling efficient training of deep neural networks [NTS14, NTSS17, ZBH + 16, SHS17, KMN + 16].

A general recipe for trajectory analysis
In this section, we sketch a general recipe for establishing performance guarantees of gradient descent, which conveys the key idea for proving the main results of this paper. The main challenge is to demonstrate that appropriate incoherence conditions are preserved throughout the trajectory of the algorithm. This requires exploiting statistical independence of the samples in a careful manner, in conjunction with generic optimization theory. Central to our approach is a leave-one-out perturbation argument, which allows to decouple the statistical dependency while controlling the component-wise incoherence measures.
General Recipe (a leave-one-out analysis) Step 1: characterize restricted strong convexity and smoothness of f , and identify the region of incoherence and contraction (RIC).
Step 2: introduce leave-one-out sequences {X t,(l) } and {H t,(l) } for each l, where {X t,(l) } (resp. {H t,(l) }) is independent of any sample involving φ l (resp. ψ l ); Step 3: establish the incoherence condition for {X t } and {H t } via induction. Suppose the iterates satisfy the claimed conditions in the tth iteration: (a) show, via restricted strong convexity, that the true iterates (X t+1 , H t+1 ) and the leave-one-out version (X t+1,(l) , H t+1,(l) ) are exceedingly close; (c) combine the bounds to establish the desired incoherence condition concerning max

General model
Consider the following problem where the samples are collected in a bilinear/quadratic form as where the objects of interest H , X ∈ C n×r or R n×r might be vectors or tall matrices taking either real or complex values. The design vectors {ψ j } and {φ j } are in either C n or R n , and can be either random or deterministic. This model is quite general and entails all three examples in this paper as special cases: • Phase retrieval : H = X = x ∈ R n , and ψ j = φ j = a j ; • Matrix completion: H = X ∈ R n×r and ψ j , φ j ∈ {e 1 , · · · , e n }; • Blind deconvolution: H = h ∈ C K , X = x ∈ C K , φ j = a j , and ψ j = b j .
For this setting, the empirical loss function is given by where we denote Z = (H, X). To minimize f (Z), we proceed with vanilla gradient descent following a standard spectral initialization, where η is the step size. As a remark, for complex-valued problems, the gradient (resp. Hessian) should be understood as the Wirtinger gradient (resp. Hessian). It is clear from (39) that Z = (H , X ) can only be recovered up to certain global ambiguity. For clarity of presentation, we assume in this section that such ambiguity has already been taken care of via proper global transformation.

Outline of the recipe
We are now positioned to outline the general recipe, which entails the following steps.
• Step 1: characterizing local geometry in the RIC. Our first step is to characterize a region Rwhich we term as the region of incoherence and contraction (RIC) -such that the Hessian matrix ∇ 2 f (Z) obeys strong convexity and smoothness, or at least along certain directions (i.e. restricted strong convexity and smoothness), where β/α scales slowly (or even remains bounded) with the problem size. As revealed by optimization theory, this geometric property (40) immediately implies linear convergence with the contraction rate 1 − O(α/β) for a properly chosen step size η, as long as all iterates stay within the RIC.
A natural question then arises: what does the RIC R look like? As it turns out, the RIC typically contains all points such that the 2 error Z − Z F is not too large and In the three examples, the above incoherence condition translates to: • Step 2: introducing the leave-one-out sequences. To justify that no iterates leave the RIC, we rely on the construction of auxiliary sequences. Specifically, for each l, produce an auxiliary sequence {Z t,(l) = (X t,(l) , H t,(l) )} such that X t,(l) (resp. H t,(l) ) is independent of any sample involving φ l (resp. ψ l ). As an example, suppose that the φ l 's and the ψ l 's are independently and randomly generated. Then for each l, one can consider a leave-one-out loss function that discards the lth sample. One further generates {Z t,(l) } by running vanilla gradient descent w.r.t. this auxiliary loss function, with a spectral initialization that similarly discards the lth sample. Note that this procedure is only introduced to facilitate analysis and is never implemented in practice. • Step 3: establishing the incoherence condition. We are now ready to establish the incoherence condition with the assistance of the auxiliary sequences. Usually the proof proceeds by induction, where our goal is to show that the next iterate remains within the RIC, given that the current one does.
-Step 3(a): proximity between the original and the leave-one-out iterates. As one can anticipate, {Z t } and {Z t,(l) } remain "glued" to each other along the whole trajectory, since their constructions differ by only a single sample. In fact, as long as the initial estimates stay sufficiently close, their gaps will never explode. To intuitively see why, use the fact ∇f which together with the strong convexity condition implies 2 contraction Indeed, (restricted) strong convexity is crucial in controlling the size of leave-one-out perturbations.
-Step 3(b): incoherence condition of the leave-one-out iterates. The fact that Z t+1 and Z t+1,(l) are exceedingly close motivates us to control the incoherence of Z t+1,(l) − Z instead, for 1 ≤ l ≤ m. By construction, X t+1,(l) (resp. H t+1,(l) ) is statistically independent of any sample involving the design vector φ l (resp. ψ l ), a fact that typically leads to a more friendly analysis for controlling φ * l X t+1,(l) − X 2 and ψ * l H t+1,(l) − H 2 . -Step 3(c): combining the bounds. With these results in place, apply the triangle inequality to obtain where the first term is controlled in Step 3(a) and the second term is controlled in Step 3(b). The term ψ * l H t+1 − H 2 can be bounded similarly. By choosing the bounds properly, this establishes the incoherence condition for all 1 ≤ l ≤ m as desired.
In this section, we instantiate the general recipe presented in Section 5 to phase retrieval and prove Theorem 1. Similar to the Section 7.1 in [CLS15], we are going to use η t = c 1 /(log n · x 2 2 ) instead of c 1 /(log n · x 0 2 2 ) as the step size for analysis. This is because with high probability, x 0 2 and x 2 are rather close in the relative sense. Without loss of generality, we assume throughout this section that x 2 = 1 and In addition, the gradient and the Hessian of f (·) for this problem (see (15)) are given respectively by which are useful throughout the proof. 6.1 Step 1: characterizing local geometry in the RIC

Local geometry
We start by characterizing the region that enjoys both strong convexity and the desired level of smoothness. This is supplied in the following lemma, which plays a crucial role in the subsequent analysis.
Lemma 1 (Restricted strong convexity and smoothness for phase retrieval). Fix any sufficiently small constant C 1 > 0 and any sufficiently large constant C 2 > 0, and suppose the sample complexity obeys m ≥ c 0 n log n for some sufficiently large constant c 0 > 0. With probability at least 1 − O(mn −10 ), holds simultaneously for all x ∈ R n satisfying x − x 2 ≤ 2C 1 ; and In words, Lemma 1 reveals that the Hessian matrix is positive definite and (almost) well-conditioned, if one restricts attention to the set of points that are (i) not far away from the truth (cf. (45a)) and (ii) incoherent with respect to the measurement vectors {a j } 1≤j≤m (cf. (45b)).

Error contraction
As we point out before, the nice local geometry enables 2 contraction, which we formalize below.
Proof. This proof applies the standard argument when establishing the 2 error contraction of gradient descent for strongly convex and smooth functions. See Appendix A.2.
With the help of Lemma 2, we can turn the proof of Theorem 1 into ensuring that the trajectory {x t } 0≤t≤n lies in the RIC specified by (47). 8 This is formally stated in the next lemma.
Lemma 3. Suppose for all 0 ≤ t ≤ T 0 := n, the trajectory {x t } falls within the region of incoherence and contraction (termed the RIC), namely, then the claims in Theorem 1 hold true. Here and throughout this section, C 1 , C 2 > 0 are two absolute constants as specified in Lemma 1.

6.2
Step 2: introducing the leave-one-out sequences In comparison to the 2 error bound (47a) that captures the overall loss, the incoherence hypothesis (47b)which concerns sample-wise control of the empirical risk -is more complicated to establish. This is partly due to the statistical dependence between x t and the sampling vectors {a l }. As described in the general recipe, the key idea is the introduction of a leave-one-out version of the WF iterates, which removes a single measurement from consideration. To be precise, for each 1 ≤ l ≤ m, we define the leave-one-out empirical loss function as and the auxiliary trajectory x t,(l) t≥0 is constructed by running WF w.r.t. f (l) (x). In addition, the spectral initialization x 0,(l) is computed based on the rescaled leading eigenvector of the leave-one-out data matrix Clearly, the entire sequence x t,(l) t≥0 is independent of the lth sampling vector a l . This auxiliary procedure is formally described in Algorithm 4.

Step 3: establishing the incoherence condition by induction
As revealed by Lemma 3, it suffices to prove that the iterates {x t } 0≤t≤T0 satisfies (47) with high probability. Our proof will be inductive in nature. For the sake of clarity, we list all the induction hypotheses: Here C 3 > 0 is some universal constant. The induction on (51a), that is, has already been established in Lemma 2. This subsection is devoted to establishing (51b) and (51c) for the (t + 1)th iteration, assuming that (51) holds true up to the tth iteration. We defer the justification of the base case (i.e. initialization at t = 0) to Section 6.4.

Algorithm 4
The lth leave-one-out sequence for phase retrieval Input: {a j } 1≤j≤m,j =l and {y j } 1≤j≤m,j =l . Spectral initialization: let λ 1 Y (l) and x 0,(l) be the leading eigenvalue and eigenvector of respectively, and set Gradient updates: for t = 0, 1, 2, . . . , T − 1 do • Step 3(a): proximity between the original and the leave-one-out iterates. The leave-one-out sequence {x t,(l) } behaves similarly to the true WF iterates {x t } while maintaining statistical independence with a l , a key fact that allows us to control the incoherence of lth leave-one-out sequence w.r.t. a l . We will formally quantify the gap between x t+1 and x t+1,(l) in the following lemma, which establishes the induction in (51b).

Lemma 4. Under the hypotheses (51), with probability at least
as long as the sample size obeys m n log n and the stepsize 0 < η ≤ 1/ [5C 2 (10 + C 2 ) log n].
Proof. The proof relies heavily on the restricted strong convexity (see Lemma 1) and is deferred to Appendix A.4. • Step 3(b): incoherence of the leave-one-out iterates. By construction, x t+1,(l) is statistically independent of the sampling vector a l . One can thus invoke the standard Gaussian concentration results and the union bound to derive that with probability at least 1 − O mn −10 , holds for some constant C 4 ≥ 6C 1 > 0 and n sufficiently large. Here, (i) comes from the triangle inequality, and (ii) arises from the proximity bound (53) and the condition (52). • Step 3(c): combining the bounds. We are now prepared to establish (51c) for the (t + 1)th iteration. Specifically, where (i) follows from the Cauchy-Schwarz inequality and (54), the inequality (ii) is a consequence of (53) and (98), and the last inequality holds as long as C 2 /(C 3 + C 4 ) is sufficiently large.
Using mathematical induction and the union bound, we establish (51) for all t ≤ T 0 = n with high probability. This in turn concludes the proof of Theorem 1, as long as the hypotheses are valid for the base case.

The base case: spectral initialization
In the end, we return to verify the induction hypotheses for the base case (t = 0), i.e. the spectral initialization obeys (51). The following lemma justifies (51a) by choosing δ sufficiently small.
Lemma 5. Fix any small constant δ > 0, and suppose m > c 0 n log n for some large constant c 0 > 0. Consider the two vectors x 0 and x 0 as defined in Algorithm 1, and suppose without loss of generality that (42) holds. Then with probability exceeding 1 − O(n −10 ), one has Proof. This result follows directly from the Davis-Kahan sinΘ theorem. See Appendix A.5.
We then move on to justifying (51b), the proximity between the original and leave-one-out iterates for t = 0.
Lemma 6. Suppose m > c 0 n log n for some large constant c 0 > 0. Then with probability at least 1 − O(mn −10 ), one has Proof. This is also a consequence of the Davis-Kahan sinΘ theorem. See Appendix A.6.
The final claim (51c) can be proved using the same argument as in deriving (55), and hence is omitted.

Analysis for matrix completion
In this section, we instantiate the general recipe presented in Section 5 to matrix completion and prove Theorem 2. Before continuing, we first gather a few useful facts regarding the loss function in (23). The gradient of it is given by We define the expected gradient (with respect to the sampling set Ω) to be and also the (expected) gradient without noise to be In addition, we need the Hessian ∇ 2 f clean (X), which is represented by an nr×nr matrix. Simple calculations reveal that for any V ∈ R n×r , where vec(V ) ∈ R nr denotes the vectorization of V .

7.1
Step 1: characterizing local geometry in the RIC

Local geometry
The first step is to characterize the region where the empirical loss function enjoys restricted strong convexity and smoothness in an appropriate sense. This is formally stated in the following lemma.
Lemma 7 (Restricted strong convexity and smoothness for matrix completion). Suppose that the sample size obeys n 2 p ≥ Cκ 2 µrn log n for some sufficiently large constant C > 0. Then with probability at least 1 − O n −10 , the Hessian ∇ 2 f clean (X) as defined in (61) obeys where 1/ κ 3 µr log 2 n and δ 1/κ.
Lemma 7 reveals that the Hessian matrix is well-conditioned in a neighborhood close to X that remains incoherent measured in the 2 / ∞ norm (cf. (63a)), and along directions that point towards points which are not far away from the truth in the spectral norm (cf. (63b)).
Remark 4. The second condition (63b) is characterized using the spectral norm · , while in previous works this is typically presented in the Frobenius norm · F . It is also worth noting that the Hessian matrixeven in the infinite-sample and noiseless case -is rank-deficient and cannot be positive definite. As a result, we resort to the form of strong convexity by restricting attention to certain directions (see the conditions on V ).

Error contraction
Our goal is to demonstrate the error bounds (28) measured in three different norms. Notably, as long as the iterates satisfy (28) at the tth iteration, then X t H t − X 2,∞ is sufficiently small. Under our sample complexity assumption, X t H t satisfies the 2 / ∞ condition (63a) required in Lemma 7. Consequently, we can invoke Lemma 7 to arrive at the following error contraction result.
Lemma 8 (Contraction w.r.t. the Frobenius norm). Suppose n 2 p ≥ Cκ 3 µ 3 r 3 n log 3 n and the noise satisfies (27). If the iterates satisfy (28a) and (28b) at the tth iteration, then with probability at least 1 − O(n −10 ), Further, if the current iterate satisfies all three conditions in (28), then we can derive a stronger sense of error contraction, namely, contraction in terms of the spectral norm.

Proof.
The key observation is this: the iterate that proceeds according to the population-level gradient reduces the error w.r.t. · , namely, as long as X t H t is sufficiently close to the truth. Notably, the orthonormal matrix H t is still chosen to be the one that minimizes the · F distance (as opposed to · ), which yields a symmetry property

7.2
Step 2: introducing the leave-one-out sequences In order to establish the incoherence properties (28b) for the entire trajectory, which is difficult to deal with directly due to the complicated statistical dependence, we introduce a collection of leave-one-out versions of {X t } t≥0 , denoted by X t,(l) t≥0 for each 1 ≤ l ≤ n. Specifically, X t,(l) t≥0 is the iterates of gradient descent operating on the auxiliary loss function Here, P Ω l (resp. P Ω −l and P l ) represents the orthogonal projection onto the subspace of matrices which vanish outside of the index set Ω l : (67) The gradient of the leave-one-out loss function (65) is given by The full algorithm to obtain the leave-one-out sequence {X t,(l) } t≥0 (including spectral initialization) is summarized in Algorithm 5.
Algorithm 5 The lth leave-one-out sequence for matrix completion with P Ω −l and P l defined in (67), and set X 0,(l) = U 0,(l) Σ (l) 1/2 . Gradient updates: for t = 0, 1, 2, . . . , T − 1 do Remark 5. Rather than simply dropping all samples in the lth row/column, we replace the lth row/column with their respective population means. In other words, the leave-one-out gradient forms an unbiased surrogate for the true gradient, which is particularly important in ensuring high estimation accuracy.

Step 3: establishing the incoherence condition by induction
We will continue the proof of Theorem 2 in an inductive manner. As seen in Section 7.1.2, the induction hypotheses (28a) and (28c) hold for the (t+1)th iteration as long as (28) holds at the tth iteration. Therefore, we are left with proving the incoherence hypothesis (28b) for all 0 ≤ t ≤ T = O(n 5 ). For clarity of analysis, it is crucial to maintain a list of induction hypotheses, which includes a few more hypotheses that complement (28), and is given below.
In the sequel, we follow the general recipe outlined in Section 5 to establish the induction hypotheses. We only need to establish (70b), (70d) and (70e) for the (t + 1)th iteration, since (70a) and (70c) have been established in Section 7.1.2. Specifically, we resort to the leave-one-out iterates by showing that: first, the true and the auxiliary iterates remain exceedingly close throughout; second, the lth leave-one-out sequence stays incoherent with e l due to statistical independence.
• Step 3(a): proximity between the original and the leave-one-out iterates. We demonstrate that X t+1 is well approximated by X t+1,(l) , up to proper orthonormal transforms. This is precisely the induction hypothesis (70d) for the (t + 1)th iteration.
Proof. The fact that this difference is well-controlled relies heavily on the benign geometric property of the Hessian revealed by Lemma 7. Two important remarks are in order: (1) both points X t H t and X t,(l) R t,(l) satisfy (63a); (2) the difference X t H t − X t,(l) R t,(l) forms a valid direction for restricted strong convexity. These two properties together allow us to invoke Lemma 7. See Appendix B.5. • Step 3(b): incoherence of the leave-one-out iterates. Given that X t+1,(l) is sufficiently close to X t+1 , we turn our attention to establishing the incoherence of this surrogate X t+1,(l) w.r.t. e l . This amounts to proving the induction hypothesis (70e) for the (t + 1)th iteration.

Proof.
The key observation is that X t+1,(l) is statistically independent from any sample in the lth row/column of the matrix. Since there are an order of np samples in each row/column, we obtain enough information that helps establish the desired incoherence property. See Appendix B.6. • Step 3(c): combining the bounds. The inequalities (70d) and (70e) taken collectively allow us to establish the induction hypothesis (70b). Specifically, for every 1 ≤ l ≤ n, write and the triangle inequality gives The second term has already been bounded by (75). Since we have established the induction hypotheses (70c) and (70d) for the (t+1)th iteration, the first term can be bounded by (73a) for the (t+1)th iteration, i.e.

The base case: spectral initialization
Finally, we return to check the base case, namely, we aim to show that the spectral initialization satisfies the induction hypotheses (70a)-(70e) for t = 0. This is accomplished via the following lemma.
Proof. This follows by invoking the Davis-Kahan sinΘ theorem [DK70] as well as the entrywise eigenvector perturbation analysis in [AFWZ17]. We defer the proof to Appendix B.7.

Analysis for blind deconvolution
In this section, we instantiate the general recipe presented in Section 5 to blind deconvolution and prove Theorem 3. Without loss of generality, we assume throughout that h 2 = x 2 = 1.
Before presenting the analysis, we first gather some simple facts about the empirical loss function in (32). Recall the definition of z in (33), and for notational simplicity, we write f (z) = f (h, x). Since z is complex-valued, we need to resort to Wirtinger calculus; see [CLS15, Section 6] for a brief introduction. The Wirtinger gradient of (32) with respect to h and x are given respectively by It is worth noting that the formal Wirtinger gradient contains ∇ h f (h, x) and ∇ x f (h, x) as well. Nevertheless, since f (h, x) is a real-valued function, the following identities always hold and In light of these observations, one often omits the gradient with respect to the conjugates; correspondingly, the gradient update rule (35) can be written as We can also compute the Wirtinger Hessian of f (z) as follows, where Last but not least, we say (h 1 , x 1 ) is aligned with (h 2 , x 2 ), if the following holds, To simplify notations, define z t as with the alignment parameter α t given in (38). Then we can see that z t is aligned with z and 8.1 Step 1: characterizing local geometry in the RIC

Local geometry
The first step is to characterize the region of incoherence and contraction (RIC), where the empirical loss function enjoys restricted strong convexity and smoothness properties. To this end, we have the following lemma.
Lemma 14 (Restricted strong convexity and smoothness for blind deconvolution). Let c > 0 be a sufficiently small constant and δ = c/ log 2 m.
Suppose the sample size satisfies m ≥ c 0 µ 2 K log 9 m for some sufficiently large constant c 0 > 0. Then with probability at least 1 − O m −10 + e −K log m , the Wirtinger Hessian ∇ 2 f (z) obeys and x 2 ), and they satisfy and finally, D satisfies for γ 1 , γ 2 ∈ R, Here, C 3 , C 4 > 0 are numerical constants.
Lemma 14 characterizes the restricted strong convexity and smoothness of the loss function used in blind deconvolution. To the best of our knowledge, this provides the first characterization regarding geometric properties of the Hessian matrix for blind deconvolution. A few interpretations are in order.
• The conditions (82) specify the region of incoherence and contraction (RIC). In particular, (82a) specifies a neighborhood that is close to the ground truth in 2 norm, and (82b) and (82c) specify the incoherence region with respect to the sensing vectors {a j } and {b j }, respectively.
• Similar to matrix completion, the Hessian matrix is rank-deficient even at the population level. Consequently, we resort to a restricted form of strong convexity by focusing on certain directions. More specifically, these directions can be viewed as the difference between two pre-aligned points that are not far from the truth, which is characterized by (83).
• Finally, the diagonal matrix D accounts for scaling factors that are not too far from 1 (see (84)), which allows us to account for different step sizes employed for h and x.

Error contraction
The restricted strong convexity and smoothness allow us to establish the contraction of the error measured in terms of dist(·, z ) as defined in (34) as long as the iterates stay in the RIC.
Lemma 15. Suppose the number of measurements satisfies m µ 2 K log 9 m, and the step size η > 0 is some sufficiently small constant. Then for some constants C 3 , C 4 > 0. Here, h t and x t are defined in (81), and ξ 1/ log 2 m.
As a result, if z t satisfies the condition (85) for all 0 ≤ t ≤ T , then the union bound gives where ρ := 1 − η/16. Furthermore, similar to the case of phase retrieval (i.e. Lemma 3), as soon as we demonstrate that the conditions (85) hold for all 0 ≤ t ≤ m, then Theorem 3 holds true. The proof of this claim is exactly the same as for Lemma 3, and is thus omitted for conciseness. In what follows, we focus on establishing (85) for all 0 ≤ t ≤ m. Before concluding this subsection, we make note of another important result that concerns the alignment parameter α t , which will be useful in the subsequent analysis. Specifically, the alignment parameter sequence {α t } converges linearly to a constant whose magnitude is fairly close to 1, as long as the two initial vectors h 0 and x 0 have similar 2 norms and are close to the truth. Given that α t determines the global scaling of the iterates, this reveals rapid convergence of both h t 2 and x t 2 , which explains why there is no need to impose extra terms to regularize the 2 norm as employed in [LLSW16, HH17].
Lemma 16. Suppose that m 1. The following two claims hold true.
The initial condition |α 0 | − 1 < 1/4 will be guaranteed to hold with high probability by Lemma 19.

8.2
Step 2: introducing the leave-one-out sequences As demonstrated by the assumptions in Lemma 15, the key is to show that the whole trajectory lies in the region specified by (85a)-(85c). Once again, the difficulty lies in the statistical dependency between the iterates {z t } and the measurement vectors {a j }. We follow the general recipe and introduce the leave-oneout sequences, denoted by h t,(l) , x t,(l) t≥0 for each 1 ≤ l ≤ m. Specifically, h t,(l) , x t,(l) t≥0 is the gradient sequence operating on the loss function The whole sequence is constructed by running gradient descent with spectral initialization on the leave-oneout loss (86). The precise description is supplied in Algorithm 6.

Step 3: establishing the incoherence condition by induction
As usual, we continue the proof in an inductive manner. For clarity of presentation, we list below the set of induction hypotheses underlying our analysis: where h t , x t and z t are defined in (81). Here, C 1 , C 3 > 0 are some sufficiently small constants, while C 2 , C 4 > 0 are some sufficiently large constants. We aim to show that if these hypotheses (90) hold up to the tth iteration, then the same would hold for the (t + 1)th iteration with exceedingly high probability (e.g. 1 − O(m −10 )). The first hypothesis (90a) has already been established in Lemma 15, and hence the rest of this section focuses on establishing the remaining three. To justify the incoherence hypotheses (90c) and (90d) for the (t + 1)th iteration, we need to leverage the nice properties of the leave-one-out sequences, and establish (90b) first. In the sequel, we follow the steps suggested in the general recipe. • Step 3(a): proximity between the original and the leave-one-out iterates. We first justify the hypothesis (90b) for the (t + 1)th iteration via the following lemma.
Lemma 17. Suppose the sample complexity obeys m µ 2 K log 9 m. There exists a constant c > 0 such that under the hypotheses (90a)-(90d) for the tth iteration, with probability at least provided that the step size η > 0 is some sufficiently small constant.
Proof. As usual, this result follows from the restricted strong convexity, which forces the distance between the two sequences of interest to be contractive. See Appendix C.3.
with probability exceeding 1 − O m −10 + e −K log m . To see why, use the statistical independence and the standard Gaussian concentration inequality to show that with probability exceeding 1 − O(m −10 ). It then follows from the triangle inequality that where (i) follows from Lemmas 15 and 17, and (ii) holds as soon as m µ 2 √ K log 13/2 m. Combining the preceding two bounds establishes (91).
• Step 3(c): combining the bounds to show incoherence of x t+1 w.r.t. {a l }. The above bounds immediately allow us to conclude that with probability at least 1 − O m −10 + e −K log m , which is exactly the hypothesis (90c) for the (t + 1)th iteration. Specifically, for each 1 ≤ l ≤ m, the triangle inequality yields Here (i) follows from Cauchy-Schwarz, (ii) is a consequence of (190), Lemma 17 and the bound (91), and the last inequality holds as long as m µ 2 K log 6 m and C 3 ≥ 11C 1 . • Step 3(d): incoherence of h t+1 w.r.t. {b l }. It remains to justify that h t+1 is also incoherent w.r.t. its associated design vectors {b l }. This proof of this step, however, is much more involved and challenging, due to the deterministic nature of the b l 's. As a result, we would need to "propagate" the randomness brought about by {a l } to h t+1 in order to facilitate the analysis. The result is summarized as follows.
Lemma 18. Suppose that the sample complexity obeys m µ 2 K log 9 m. Under the inductive hypotheses (90a)-(90d) for the tth iteration, with probability exceeding 1 − O(m −10 ) we have as long as C 4 is sufficiently large, and η > 0 is taken to be some sufficiently small constant.
Proof. The key idea is to divide {1, · · · , m} into consecutive bins each of size poly log(m), and to exploit the randomness (namely, the randomness from a l ) within each bin. This binning idea is crucial in ensuring that the incoherence measure of interest does not blow up as t increases. See Appendix C.4.
With these steps in place, we conclude the proof of Theorem 3 via induction and the union bound.

The base case: spectral initialization
In order to finish the induction steps, we still need to justify the induction hypotheses for the base cases, namely, we need to show that the spectral initializations z 0 and z 0,(l) 1≤l≤m satisfy the induction hypotheses (90) at t = 0.
To start with, the initializations are sufficiently close to the truth when measured by the 2 norm, as summarized by the following lemma.
From the definition of dist(·, ·) (cf. (34)), we immediately have as long as m ≥ Cµ 2 K log 6 m for some sufficiently large constant C > 0. Here (i) follows from the elementary inequality that a 2 + b 2 ≤ (a + b) 2 for positive a and b, (ii) holds since the feasible set of the latter one is strictly smaller, and (iii) follows directly from Lemma 19. This finishes the proof of (90a) for t = 0. Similarly, with high probability we have Next, when properly aligned, the true initial estimate z 0 and the leave-one-out estimate z 0,(l) are expected to be sufficiently close, as claimed by the following lemma. Along the way, we show that h 0 is incoherent w.r.t. the sampling vectors {b l }. This establishes (90b) and (90d) for t = 0.
Lemma 20. Suppose that m µ 2 K log 3 m. Then with probability at least 1 − O(m −10 ), one has Proof.
The key is to establish that dist z 0,(l) , z 0 can be upper bounded by some linear scaling of b * l h 0 , and vice versa. This allows us to derive bounds simultaneously for both quantities. See Appendix C.6. Proof. See Appendix C.7.

Discussions
This paper showcases an important phenomenon in nonconvex optimization: even without explicit enforcement of regularization, the vanilla form of gradient descent effectively achieves implicit regularization for a large family of statistical estimation problems. We believe this phenomenon arises in problems far beyond the three cases studied herein, and our results are initial steps towards understanding this fundamental phenomenon. That being said, there are numerous avenues open for future investigation, and we conclude the paper with a few of them.
• Improving sample complexity. In the current paper, the required sample complexity O µ 3 r 3 n log 3 n for matrix completion is sub-optimal when the rank r of the underlying matrix is large. While this allows us to achieve a dimension-free iteration complexity, it is slightly higher than the sample complexity derived for regularized gradient descent in [CW15]. We expect our results continue to hold under lower sample complexity O µ 2 r 2 n log n , but it calls for a more refined analysis (e.g. a generic chaining argument).
• Leave-one-out tricks for more general designs. So far our focus is on independent designs, including the i.i.d. Gaussian design adopted in phase retrieval and partially in blind deconvolution, as well as the independent sampling mechanism in matrix completion. Such independence property creates some sort of "statistical homogeneity", for which the leave-one-out argument works beautifully. It remains unclear how to generalize such leave-one-out tricks for more general designs (e.g. more general sampling patterns in matrix completion and more structured Fourier designs in phase retrieval and blind deconvolution). In fact, the readers can already get a flavor of this issue in the analysis of blind deconvolution, where the Fourier design vectors require much more delicate treatments than purely Gaussian designs.
• Uniform stability. The leave-one-out perturbation argument is established upon a basic fact: when we exclude one sample from consideration, the resulting estimates/predictions do not deviate much from the original ones. This leave-one-out stability bears similarity to the notion of uniform stability studied in statistical learning theory [BE02, LLNT17]. We expect our analysis framework to be helpful for analyzing other learning algorithms that are uniformly stable.
• Constrained optimization. We restrict ourselves to study empirical risk minimization problems in an unconstrained setting. It will be interesting to explore if such implicit regularization still holds for constrained nonconvex problems.
• Other iterative methods. Iterative methods other than gradient descent have been extensively studied in the nonconvex optimization literature, including alternating minimization, proximal methods, etc. Identifying the implicit regularization feature for a broader class of iterative algorithms is another direction worth exploring.
• Connections to deep learning? We have focused on nonlinear systems that are bilinear or quadratic in this paper. Deep learning formulations/architectures, highly nonlinear, are notorious for its daunting nonconvex geometry. However, iterative methods including stochastic gradient descent have enjoyed enormous practical success in learning neural networks (e.g. [ZSJ + 17, SJL17]), even when the architecture is significantly over-parameterized without explicit regularization. We hope the message conveyed in this paper for several simple statistical models can shed light on why simple forms of gradient descent and variants work so well in learning complicated neural networks.

A.1 Proof of Lemma 1
We start with the smoothness bound, namely, ∇ 2 f (x) O(log n) · I n . It suffices to prove the upper bound ∇ 2 f (x) log n. To this end, we first decompose the Hessian (cf. (44)) into three components as follows: a j x 2 a j a j − 2 I n + 2x x :=Λ2 where we have used y j = (a j x ) 2 . In the sequel, we control the three terms Λ 1 , Λ 2 and Λ 3 in reverse order.
• The second term Λ 2 can be controlled by means of Lemma 32: for an arbitrarily small constant δ > 0, as long as m ≥ c 0 n log n for c 0 sufficiently large.
• It thus remains to control Λ 1 . Towards this we discover that Under the assumption max 1≤j≤m a j x − x ≤ C 2 √ log n and the fact (99), we can also obtain Substitution into (100) leads to Λ 1 ≤ 3C 2 (10 + C 2 ) log n · 1 m m j=1 a j a j ≤ 4C 2 (10 + C 2 ) log n, where the last inequality is a direct consequence of Lemma 31.
Next we move on to the strong convexity lower bound. Picking a constant C > 0 and enforcing proper truncation, we get .
Recognizing that β 1 (resp. β 2 ) approaches 2 (resp. 1) as C grows, we can thus take C 1 small enough and C large enough to guarantee that Λ 4 5x x + 2I n .
Putting the preceding two bounds on Λ 4 and Λ 5 together yields (1/2) · I n as claimed.

A.3 Proof of Lemma 3
We start with proving (19a). For all 0 ≤ t ≤ T 0 , invoke Lemma 2 recursively with the conditions (47) to reach This finishes the proof of (19a) for 0 ≤ t ≤ T 0 and also reveals that provided that η 1/ log n. Applying the Cauchy-Schwarz inequality and the fact (98) indicate that leading to the satisfaction of (45). Therefore, invoking Lemma 2 yields One can then repeat this argument to arrive at for all t > T 0 We are left with (19b). It is self-evident that the iterates from 0 ≤ t ≤ T 0 satisfy (19b) by assumptions. For t > T 0 , we can use the Cauchy-Schhwarz inequality to obtain where the penultimate relation uses the conditions (98) and (103).

A.4 Proof of Lemma 4
First, going through the same derivation as in (54) and (55) will result in max 1≤l≤m a l x t,(l) − x ≤ C 4 log n (104) for some C 4 < C 2 , which will be helpful for our analysis. We use the gradient update rules once again to decompose where the last line comes from the definition of ∇f (·) and ∇f (l) (·).
1. We first control the term ν (l) 2 , which is easier to deal with. Specifically, C 4 (C 4 + 5)(C 4 + 10)η n log n m log n n (ii) ≤ cη log n n , for any small constant c > 0. Here (i) follows since (98) and, in view of (99) and (104), and a l x t,(l) ≤ a l x t,(l) − x + a l x ≤ (C 4 + 5) log n.
And (ii) holds as long as m n log n.

For the term ν
where we abuse the notation and denote x (τ ) = x t,(l) + τ (x t − x t,(l) ). By the induction hypotheses (51) and the condition (104), one can verify that a l x t,(l) − x ≤ C 2 log n for all 0 ≤ τ ≤ 1, as long as C 4 ≤ C 2 . The second line follows directly from (104). To see why (105) holds, we note that where the second inequality follows from the induction hypotheses (51b) and (51a). This combined with (51a) gives as long as n is large enough, thus justifying (105). Hence by Lemma 1, ∇ 2 f (x (τ )) is positive definite and almost well-conditioned. By choosing 0 < η ≤ 1/ [5C 2 (10 + C 2 ) log n], we get 3. Combine the preceding bounds on ν (l) 1 and ν (l) 2 as well as the induction bound (51b) to arrive at This establishes (53) for the (t + 1)th iteration.

A.5 Proof of Lemma 5
In view of the assumption (42) that x 0 − x 2 ≤ x 0 + x 2 and the fact that x 0 = λ 1 (Y ) /3 x 0 for some λ 1 (Y ) > 0 (which we will verify below), it is straightforward to see that One can then invoke the Davis-Kahan sinΘ theorem [YWS15, Corollary 1] to obtain .
Combining this spectral gap and the inequality Y − E[Y ] ≤ δ, we arrive at To connect this bound with x 0 , we need to take into account the scaling factor λ 1 (Y ) /3. To this end, it follows from Weyl's inequality and (56) that and, as a consequence, λ 1 (Y ) ≥ 3 − δ > 0 when δ ≤ 1. This further implies that where we have used the elementary identity . With these bounds in place, we can use the triangle inequality to get

A.6 Proof of Lemma 6
To begin with, repeating the same argument as in Lemma 5 (which we omit here for conciseness), we see that for any fixed constant δ > 0, holds with probability at least 1 − O(mn −10 ) as long as m n log n. The 2 bound on x 0 − x 0,(l) 2 is derived as follows. 1. We start by controlling x 0 − x 0,(l) 2 . Combining (57) and (108) yields For δ sufficiently small, this implies that x 0 − x 0,(l) 2 ≤ x 0 + x 0,(l) 2 , and hence the Davis-Kahan sinΘ theorem [DK70] gives Here, the second inequality uses Weyl's inequality: with the proviso that δ ≤ 1/2.

Everything then boils down to controlling
The inequality (i) makes use of the fact max l a l x ≤ 5 √ log n (cf. (99)), the bound max l a l 2 ≤ 6 √ n (cf. (98)), and max l a l x 0,(l) ≤ 5 √ log n (due to statistical independence and standard Gaussian concentration). As long as m/(n log n) is sufficiently large, substituting the above bound (112) into (111) leads us to conclude that for any constant C 3 > 0.

B Proofs for matrix completion
Before proceeding to the proofs, let us record an immediate consequence of the incoherence property (25): where κ = σ max /σ min is the condition number of M . This follows since Unless otherwise specified, we use the indicator variable δ j,k to denote whether the entry in the location (j, k) is included in Ω. Under our model, δ j,k is a Bernoulli random variable with mean p.

B.1 Proof of Lemma 7
By the expression of the Hessian in (61), one can decompose The basic idea is to demonstrate that: (1) α 4 is bounded both from above and from below, and (2) the first three terms are sufficiently small in size compared to α 4 . 1. We start by controlling α 4 . It is immediate to derive the following upper bound When it comes to the lower bound, one discovers that where the last line comes from the assumptions that With our assumption V = Y H Y − Z in mind, it comes down to controlling From the definition of H Y , we see from Lemma 35 that Z Y H Y (and hence Z (Y H Y − Z)) is a symmetric matrix, which implies that Substitution into (115) gives provided that κδ ≤ 1/50.

2.
For α 1 , we consider the following quantity Similar decomposition can be performed on P Ω V X + X V 2 F as well. These identities yield .
For β 2 , one has This then calls for upper bounds on the following two terms The injectivity of P Ω (cf. [CR09, Section 4.2] or Lemma 38)-when restricted to the tangent space of M -gives: for any fixed constant γ > 0, with probability at least 1 − O n −10 , provided that n 2 p/(µnr log n) is sufficiently large. In addition, with probability exceeding 1 − O n −10 , which holds as long as np/ log n is sufficiently large. Taken collectively, the above bounds yield that for any small constant γ > 0, where the last inequality makes use of the assumption X − X 2,∞ ≤ X 2,∞ . The same analysis can be repeated to control β 1 . Altogether, we obtain where (i) utilizes the incoherence condition (114) and (ii) holds with the proviso that κ 3 µr 1.
3. To bound α 2 , apply the Cauchy-Schwarz inequality to get In view of Lemma 43, with probability at least 1 − O n −10 , ≤ 2n 2 κµr n + 4 √ n log n κµr n σ max ≤ 1 10 σ min as soon as κ 3 µr log n 1, where we utilize the incoherence condition (114). This in turn implies that |α 2 | ≤ 1 10 σ min V 2 F . Notably, this bound holds uniformly over all X satisfying the condition in Lemma 7, regardless of the statistical dependence between X and the sampling set Ω.
4. The last term α 3 can also be controlled using the injectivity of P Ω when restricted to the tangent space of M . Specifically, it follows from the bounds in [CR09, Section 4.2] or Lemma 38 that for any γ > 0 such that κγ is a small constant, as soon as n 2 p κ 2 µrn log n.

Taking all the preceding bounds collectively yields
for all V satisfying our assumptions, and for all V . Since this upper bound holds uniformly over all V , we conclude that ∇ 2 f clean (X) ≤ 5 2 σ max as claimed.

B.2 Proof of Lemma 8
Given that H t+1 is chosen to minimize the error in terms of the Frobenius norm (cf. (26)), we have where (i) follows from the identity ∇f (X t R) = ∇f (X t ) R for any orthonormal matrix R ∈ O r×r , (ii) arises from the definitions of ∇f (X) and ∇f clean (X) (see (59) and (60), respectively), and the last inequality (117) utilizes the triangle inequality and the fact that ∇f clean (X ) = 0. It thus suffices to control α 1 and α 2 .
1. For the second term α 2 in (117), it is easy to see that for some absolute constant C > 0. Here, the second inequality holds because X t H t F ≤ X t H t − X F + X F ≤ 2 X F , following the hypothesis (28a) together with our assumptions on the noise and the sample complexity. The last inequality makes use of Lemma 40.

For the first term α 1 in (117), the fundamental theorem of calculus [Lan93, Chapter XIII, Theorem 4.2] reveals
where we denote X(τ ) := X + τ (X t H t − X ). Taking the squared Euclidean norm of both sides of the equality (118) leads to where in (119) we have used the fact that Based on the condition (28b), it is easily seen that ∀τ ∈ [0, 1], X (τ ) − X 2,∞ ≤ C 5 µr log n np + C 8 σ min σ n log n p X 2,∞ .
Taking X = X (τ ) , Y = X t and Z = X in Lemma 7, one can easily verify the assumptions therein given our sample size condition n 2 p κ 3 µ 3 r 3 n log 3 n and the noise condition (27). As a result, Substituting these two inequalities into (119) yields as long as 0 < η ≤ (2σ min )/(25σ 2 max ), which further implies that 3. Combining the preceding bounds on both α 1 and α 2 and making use of the hypothesis (28a), we have as long as 0 < η ≤ (2σ min )/(25σ 2 max ), 1 − (σ min /4) · η ≤ ρ < 1 and C 1 is sufficiently large. This completes the proof of the contraction with respect to the Frobenius norm.

B.3 Proof of Lemma 9
To facilitate analysis, we construct an auxiliary matrix defined as follows With this auxiliary matrix in place, we invoke the triangle inequality to bound 1. We start with the second term α 2 and show that the auxiliary matrix X t+1 is also not far from the truth. The definition of X t+1 allows one to express where we have used the triangle inequality to separate the population-level component (i.e. β 1 ), the perturbation (i.e. β 2 ), and the noise component. In what follows, we will denote which, by Lemma 35, satisfies the following symmetry property (a) The population-level component β 1 is easier to control. Specifically, we first simplify its expression as The leading term γ 1 can be upper bounded by where the second identity follows from the symmetry property (124). By choosing η ≤ 1/(2σ max ), one has 0 I r − 2ηΣ (1 − 2ησ min ) I r and 0 I r − 2ηM I r , and further one can ensure Next, regarding the higher order term γ 2 , we can easily obtain The bounds (125) and (126) taken collectively give (b) We now turn to the perturbation part β 2 by showing that where the last inequality holds due to the triangle inequality as well as the fact that A ≤ A F . In the sequel, we shall bound the three terms separately.
• For the first term θ 1 in (128), the lth row of 1 p P Ω ∆ t X X − ∆ t X X is given by where, as usual, δ l,j = 1 {(l,j)∈Ω} . Lemma 41 together with the union bound reveals that 1 p n j=1 (δ l,j − p) X j,· X j,· 1 p p X 2 2,∞ X 2 log n + X 2 2,∞ log n X 2 2,∞ σ max log n p + X 2 2,∞ log n p for all 1 ≤ l ≤ n with high probability. This gives which further reveals that for arbitrarily small γ > 0. Here, (i) follows from ∆ t F ≤ √ r ∆ t , (ii) holds owing to the incoherence condition (114), and (iii) follows as long as n 2 p κ 3 µr 2 n log n. • For the second term θ 2 in (128), denote whose lth row is given by Recalling the induction hypotheses (28b) and (28c), we define With these two definitions in place, we now introduce a "truncation level" that allows us to bound θ 2 in terms of the following two terms .
We will apply different strategies when upper bounding the terms φ 1 and φ 2 , with their bounds given in the following two lemmas under the induction hypotheses (28b) and (28c). holds simultaneously for all ∆ t obeying (130) and (131). Here, ξ is defined in (130).
• Next, we assert that the third term θ 3 in (128) has the same upper bound as θ 2 . The proof follows by repeating the same argument used in bounding θ 2 , and is hence omitted.
(c) Substituting the preceding bounds on β 1 and β 2 into (123), we reach for some constant C > 0. Here, (i) uses the definition of ξ (cf. (130)), (ii) holds if γ is small enough and ∆ t X σ min , and (iii) follows from Lemma 40 as well as the incoherence condition (114). An immediate consequence of (135) is that under the sample size condition and the noise condition of this lemma, one has X t+1 − X X ≤ σ min /2 (136) if 0 < η ≤ 1/σ max .
2. We then move on to the first term α 1 in (121), which can be rewritten as (a) First, we claim that X t+1 satisfies meaning that X t+1 is already rotated to the direction that is most "aligned" with X . This important property eases the analysis. In fact, in view of Lemma 35, (138) follows if one can show that X X t+1 is symmetric and positive semidefinite. First of all, it follows from Lemma 35 that X X t H t is symmetric and, hence, by definition, is also symmetric. Additionally, where the second inequality holds according to (136). Weyl's inequality guarantees that X X t+1 1 2 σ min I r , thus justifying (138) via Lemma 35.
(b) With (137) and (138) in place, we resort to Lemma 37 to establish the bound. Specifically, take X 1 = X t+1 and X 2 = X t+1 H t , and it comes from (136) that This allows one to derive for some absolute constant C > 0. Here the last inequality follows from Lemma 40 and Lemma 43. As a consequence, Under our sample size condition and the noise condition (27) and the induction hypotheses (28), one can show X 1 − X 2 X ≤ σ min /4.

B.3.1 Proof of Lemma 22
In what follows, we first assume that the δ j,k 's are independent, and then use the standard decoupling trick to extend the result to symmetric sampling case (i.e. δ j,k = δ k,j ).
To begin with, we justify the concentration bound for any ∆ t independent of Ω, followed by the standard covering argument that extends the bound to all ∆ t . For any ∆ t independent of Ω, one has where ξ and ψ are defined respectively in (130) and (131). Here, the last line makes use of the fact that as long as n is sufficiently large. Apply the matrix Bernstein inequality [Tro15b, Theorem 6.1.1] to get This upper bound on t is exactly the truncation level ω we introduce in (132). With this in mind, we can easily verify that A l,· 2 1 { A l,· 2 ≤ω} is a sub-Gaussian random variable with variance proxy not exceeding O pξ 2 σ max X 2 2,∞ log r . Therefore, invoking the concentration bounds for quadratic functions [HKZ12, Theorem 2.1] yields that for some constants C 0 , C > 0, with probability at least 1 − C 0 e −Cnr log n , Now that we have established an upper bound on any fixed matrix ∆ t (which holds with exponentially high probability), we can proceed to invoke the standard epsilon-net argument to establish a uniform bound over all feasible ∆ t . This argument is fairly standard, and is thus omitted; see [Tao12, Section 2.3.1] or the proof of Lemma 42. In conclusion, we have that with probability exceeding 1 − C 0 e − 1 2 Cnr log n , holds simultaneously for all ∆ t ∈ R n×r obeying the conditions of the lemma.
In the end, we comment on how to extend the bound to the symmetric sampling pattern where δ j,k = δ k,j . Recall from (129) that the diagonal element δ l,l cannot change the 2 norm of A l,· by more than X 2 2,∞ ξ. As a result, changing all the diagonals {δ l,l } cannot change the quantity of interest (i.e. φ 1 ) by more than √ n X 2 2,∞ ξ. This is smaller than the right hand side of (141) under our incoherence and sample size conditions. Hence from now on we ignore the effect of {δ l,l } and focus on off-diagonal terms. The proof then follows from the same argument as in [GLM16, Theorem D.2]. More specifically, we can employ the construction of Bernoulli random variables introduced therein to demonstrate that the upper bound in (141) still holds if the indicator δ i,j is replaced by (τ i,j + τ i,j )/2, where τ i,j and τ i,j are independent copies of the symmetric Bernoulli random variables. Recognizing that sup ∆ t φ 1 is a norm of the Bernoulli random variables τ i,j , one can repeat the decoupling argument in [GLM16, Claim D.3] to finish the proof. We omit the details here for brevity.

B.3.2 Proof of Lemma 23
Observe from (129) that where ψ is as defined in (131) and G l (·) is as defined in Lemma 41. Here, the last inequality follows from Lemma 41, namely, for some constant C > 0, the following holds with probability at least 1 − O(n −10 ) ≤ p + C p κµr n log n + C κµr log n n 1 2 where we also use the incoherence condition (114) and the sample complexity condition n 2 p κµrn log n. Hence, the event A l,· 2 ≥ ω = 2pσ max ξ together with (142) and (143) necessarily implies that where the last inequality follows from the bound (140). As a result, with probability at least 1 − O(n −10 ) (i.e. when (144) holds for all l's) we can upper bound φ 2 by where the indicator functions are now specified with respect to G l (∆ t ) . Next, we divide into multiple cases based on the size of G l (∆ t ) . By Lemma 42, for some constants c 1 , c 2 > 0, with probability at least 1 − c 1 exp (−c 2 nr log n), for any k ≥ 0 and any α log n. We claim that it suffices to consider the set of sufficiently large k obeying otherwise we can use (140) to obtain which contradicts the event A l,· 2 ≥ ω. Consequently, we divide all indices into the following sets meaning that the cardinality of S k satisfies which decays exponentially fast as k increases. Therefore, when restricting attention to the set of indices within S k , we can obtain where (i) follows from the bound (143) and the constraint (147) in S k , (ii) is a consequence of (146) and (iii) uses the incoherence condition (114). Now that we have developed an upper bound with respect to each S k , we can add them up to yield the final upper bound. Note that there are in total no more than O (log n) different sets, i.e. S k = ∅ if k ≥ c 1 log n for c 1 sufficiently large. This arises since · log n, leading to φ 2 ξ ακµr 2 p log n X 2 . The proof is finished by taking α = c log n for some sufficiently large constant c > 0.
2. The first inequality in (73b) follows directly from the definition of H t,(l) . The second inequality is concerned with the estimation error of X t,(l) R t,(l) with respect to the Frobenius norm. Combining (70a), (70d) and the triangle inequality yields where the last step holds true as long as n κµ log n.
4. Finally, to obtain (73d), one can take the triangle inequality where the second line follows from (73a). Combine (70d) and (70c) to yield where the second inequality uses the incoherence of X (cf. (114)) and the last inequality holds as long as n κ 3 µr log n.

B.5 Proof of Lemma 11
From the definition of R t+1,(l) (see (72)), we must have The gradient update rules in (24) and (69) allow one to express where we have again used the fact that ∇f (X t ) R = ∇f (X t R) for any orthonormal matrix R ∈ O r×r (similarly for ∇f (l) X t,(l) ). Relate the right-hand side of the above equation with ∇f clean (X) to reach where we have used the following relationship between ∇f (l) (X) and ∇f (X): for all X ∈ R n×r with P Ω l and P l defined respectively in (66) and (67). In the sequel, we control the four terms in reverse order.

The last term B
(l) 4 is controlled via the following lemma.
Lemma 24. Suppose that the sample size obeys n 2 p > Cµ 2 r 2 n log 2 n for some sufficiently large constant C > 0. Then with probability at least 1 − O n −10 , the matrix B

The third term B
(l) 3 can be bounded as follows where the second inequality comes from Lemma 40.

For the second term B
(l) 2 , we have the following lemma.
Lemma 25. Suppose that the sample size obeys n 2 p µ 2 r 2 n log n. Then with probability exceeding 1 − O n −10 , the matrix B (l) 2 as defined in (149) satisfies where we abuse the notation and denote X(τ ) := X t,(l) R t,(l) + τ X t H t − X t,(l) R t,(l) . Going through the same derivations as in the proof of Lemma 8 (see Appendix B.2), we get with the proviso that 0 < η ≤ (2σ min )/(25σ 2 max ). Applying the triangle inequality to (149) and invoking the preceding four bounds, we arrive at for some absolute constant C > 0. Here the last inequality holds as long as σ n/p σ min , which is satisfied under our noise condition (27). This taken collectively with the hypotheses (70d) and (73c) leads to 2σ min 9 η C 3 ρ t µr log n np X 2,∞ + C 7 σ σ min n log n p X 2,∞ + Cη κ 2 µ 2 r 2 log n np (C 3 + C 5 ) ρ t µr log n np + (C 8 + C 7 ) σ σ min n log n p X 2,∞ σ max + Cησ n log n p X 2,∞ ≤ 1 − σ min 5 η C 3 ρ t µr log n np X 2,∞ + C 7 σ σ min n log n p X 2,∞ as long as C 7 > 0 is sufficiently large, where we have used the sample complexity assumption n 2 p κ 4 µ 2 r 2 n log n and the step size 0 < η ≤ 1/(2σ max ) ≤ 1/(2σ min ). This finishes the proof.

B.5.1 Proof of Lemma 24
By the unitary invariance of the Frobenius norm, one has where all nonzero entries of the matrix P Ω l (E) reside in the lth row/column. Decouple the effects of the lth row and the lth column of P Ω l (E) to reach where δ l,j := 1 {(l,j)∈Ω} indicates whether the (l, j)-th entry is observed. Since X t,(l) is independent of {δ l,j } 1≤j≤n and {E l,j } 1≤j≤n , we can treat the first term as a sum of independent vectors {u j }. It is easy to verify that where · ψ1 denotes the sub-exponential norm [KLT11, Section 6]. Further, one can calculate Invoke the matrix Bernstein inequality [KLT11, Proposition 2] to discover that with probability at least 1 − O n −10 , n j=1 u j 2 V log n + u j ψ1 log 2 n pσ 2 X t,(l) 2 F log n + σ X t,(l) 2,∞ log 2 n σ np log n X t,(l) where the third inequality follows from X t,(l) 2 F ≤ n X t,(l) 2 2,∞ , and the last inequality holds as long as np log 2 n. Additionally, the remaining term α in (154) can be controlled using the same argument, giving rise to α σ np log n X t,(l) 2,∞ . We then complete the proof by observing that where the last inequality follows by combining (73c), the sample complexity condition n 2 p µ 2 r 2 n log n, and the noise condition (27).

B.5.2 Proof of Lemma 25
For notational simplicity, we denote Since the Frobenius norm is unitarily invariant, we have Again, all nonzero entries of the matrix W reside in its lth row/column. We can deal with the lth row and the lth column of W separately as follows where δ l,j := 1 {(l,j)∈Ω} and the second line relies on the fact that j:j =l (δ l,j − p) 2 np. It follows that Here, (i) is a consequence of (155). In addition, (ii) follows from where the last inequality comes from (73b), the sample complexity condition n 2 p µ 2 r 2 n log n, and the noise condition (27). The matrix Bernstein inequality [Tro15b, Theorem 6.1.1] reveals that n j=1 (δ l,j − p) C l,j X t,(l) j,· 2 V log n + L log n p C 2 ∞ X 2 F log n + C ∞ X 2,∞ log n with probability exceeding 1 − O n −10 , and as a result, as soon as np log n. To finish up, we make the observation that where the last line arises from (155). This combined with (157) gives where (i) comes from (158), and (ii) makes use of the incoherence condition (114).
1. Regarding the second term α 2 of (160), we see from the definition of X t+1,(l) (see (159)) that where we also utilize the definitions of P Ω −l and P l in (67). For notational convenience, we denote This allows us to rewrite (161) as l,· ∆ t,(l) X , which further implies that Here, the last line follows from the fact that ∆ t,(l) l,· 2 ≤ X 2,∞ . To see this, one can use the induction hypothesis (70e) to get as long as np µ 2 r 2 and σ (n log n) /p σ min . By taking 0 < η ≤ 1/σ max , we have 0 I r − ηΣ (1 − ησ min ) I r , and hence can obtain An immediate consequence of the above two inequalities and (73d) is 2. The first term α 1 of (160) can be equivalently written as

Simple algebra yields
.
Here, to bound the the second term we have used X t+1,(l) l,· 2 ≤ X t+1,(l) l,· − X l,· 2 + X l,· 2 = α 2 + X l,· 2 ≤ 2 X 2,∞ , where the last inequality follows from (165). It remains to upper bound β 1 and β 2 . For both β 1 and β 2 , a central quantity to control is X t+1,(l) H t,(l) − X t+1,(l) . By the definition of X t+1,(l) in (159) and the gradient update rule for X t+1,(l) (see (69)), one has It is easy to verify that 1 p P Ω −l (E) for δ > 0 sufficiently small. Here, (i) uses the elementary fact that the spectral norm of a submatrix is no more than that of the matrix itself, (ii) arises from Lemma 40 and (iii) is a consequence of the noise condition (27). Therefore, in order to control (166), we need to upper bound the following quantity To this end, we make the observation that where P Ω l is defined in (66). An application of Lemma 43 reveals that where R t,(l) ∈ O r×r is defined in (72). Let C = X t,(l) X t,(l) − X X as in (156), and one can bound the other term γ 2 by taking advantage of the triangle inequality and the symmetry property: where (i) comes from the standard Chernoff bound n j=1 (δ l,j − p) 2 np, and in (ii) we utilize the bound established in (158). The previous two bounds taken collectively give for some constant C > 0 and δ > 0 sufficiently small. The last inequality follows from (73c), the incoherence condition (114) and our sample size condition. In summary, we obtain for δ > 0 sufficiently small. With the estimate (170) in place, we can continue our derivation on β 1 and β 2 .
(a) With regard to β 1 , in view of (166) we can obtain where (i) follows from the definitions of P Ω −l and P l (see (67) and note that all entries in the lth row of P Ω −l (·) are identically zero), and the identity (ii) is due to the definition of ∆ t,(l) in (162).
(b) For β 2 , we first claim that whose justification follows similar reasonings as that of (138), and is therefore omitted. In particular, it gives rise to the facts that X X t+1,(l) is symmetric and We are now ready to invoke Lemma 36 to bound β 2 . We abuse the notation and denote C := X t+1,(l) X and E := X t+1,(l) H t,(l) − X t+1,(l) X . We have The first inequality arises from (170), namely, where (i) holds since ∆ t,(l) ≤ X and (ii) holds true for δ sufficiently small and η ≤ 1/σ max . Invoke Lemma 36 to obtain where (174) follows since σ r−1 (C) ≥ σ r (C) ≥ σ min /2 from (173), and the last line comes from (170).

B.7 Proof of Lemma 13
For notational convenience, we define the following two orthonormal matrices The problem of finding H t (see (26)) is called the orthogonal Procrustes problem [tB77]. It is well-known that the minimizer H t always exists and is given by Here, the sign matrix sgn(B) is defined as for any matrix B with singular value decomposition B = U ΣV , where the columns of U and V are left and right singular vectors, respectively. Before proceeding, we make note of the following perturbation bounds on M 0 and M (l) (as defined in Algorithm 2 and Algorithm 5, respectively): for some universal constant C > 0. Here, (i) arises from the triangle inequality, (ii) utilizes Lemma 39 and Lemma 40, (iii) follows from the incoherence condition (114) and (iv) holds under our sample complexity assumption that n 2 p µ 2 r 2 n and the noise condition (27). Similarly, we have Combine Weyl's inequality, (178) and (179) to obtain which further implies We start by proving (70a), (70b) and (70c). The key decomposition we need is the following 1. For the spectral norm error bound in (70c), the triangle inequality together with (182) yields where we have also used the fact that U 0 = 1. Recognizing that M 0 − M σ min (see (178)) and the assumption σ max /σ min 1, we can apply Lemma 47, Lemma 46 and Lemma 45 to obtain These taken collectively imply the advertised upper bound where we also utilize the fact that Σ 0 1/2 ≤ √ 2σ max (see (181)) and the bounded condition number assumption, i.e. σ max /σ min 1. This finishes the proof of (70c).

With regard to the Frobenius norm bound in (70a), one has
Here (i) arises from (70c) and (ii) holds true since σ max /σ min 1 and √ r √ σ min ≤ X F , thus completing the proof of (70a).
3. The proof of (70b) follows from similar arguments as used in proving (70c). Combine (182) and the triangle inequality to reach Plugging in the estimates (178), (181), (183a) and (183b) results in It remains to study the component-wise error of U 0 . To this end, it has already been shown in [AFWZ17, Lemma 14] that under our assumptions. These combined with the previous inequality give where the last relation is due to the observation that 4. We now move on to proving (70e). Recall that Q (l) = arg min R∈O r×r U 0,(l) R − U F . By the triangle inequality, Note that X l,· = M l,· U Σ −1/2 and, by construction of M (l) , which further implies that (186) In order to control this, we first see that where the penultimate inequality uses (181) and the last inequality arises from Lemma 46. Additionally, Lemma 45 gives Plugging the previous two bounds into (186), we reach where the last relation follows from M 2,∞ = X X 2,∞ ≤ √ σ max X 2,∞ and the estimate (179).
Substituting the above bounds back to (185) yields in where the second line relies on Lemma 47, the bound (179), and the condition σ max /σ min 1. This establishes (70e). 5. Our final step is to justify (70d). Define B := arg min R∈O r×r U 0,(l) R − U 0 F . From the definition of R 0,(l) (cf. (72)), one has

Recognizing that
we can use the triangle inequality to bound In view of Lemma 46 and the bounds (178) and (179), one has From Davis-Kahan's sinΘ theorem [DK70] we see that These estimates taken together with (181) give It then boils down to controlling M 0 − M (l) U 0,(l) F . Quantities of this type have showed up multiple times already, and hence we omit the proof details for conciseness (see Appendix B.5). With probability at least 1 − O n −10 , If one further has then taking the previous bounds collectively establishes the desired bound is identically zero by construction. In addition, As a consequence, one has which combined with (188) and the assumption σ max /σ min 1 yields The claim (187) then follows by combining the above estimates: where we have utilized the unitary invariance of · 2,∞ .

C Proofs for blind deconvolution
Before proceeding to the proofs, we make note of the following concentration results. The standard Gaussian concentration inequality and the union bound give max 1≤l≤m a * l x ≤ 5 log m with probability at least 1 − O(m −10 ). In addition, with probability exceeding 1 − Cm exp(−cK) for some constants c, C > 0, max In addition, the population/expected Wirtinger Hessian at the truth z is given by

C.1 Proof of Lemma 14
First, we find it convenient to decompose the Wirtinger Hessian (cf. (80)) into the expected Wirtinger Hessian at the truth (cf. (191)) and the perturbation part as follows: The proof then proceeds by showing that (i) the population Hessian ∇ 2 F z satisfies the restricted strong convexity and smoothness properties as advertised, and (ii) the perturbation ∇ 2 f (z) − ∇ 2 F z is wellcontrolled under our assumptions. We start by controlling the population Hessian in the following lemma.
Lemma 26. Instate the notation and the conditions of Lemma 14. We have . The next step is to bound the perturbation. To this end, we define the set S := {z : z satisfies (82)} , and derive the following lemma.
Lemma 27. Suppose the sample complexity satisfies m µ 2 K log 9 m, c > 0 is a sufficiently small constant, and δ = c/ log 2 m. Then with probability at least 1 − O m −10 + e −K log m , one has Combining the two lemmas, we can easily see that for z ∈ S, which verifies the smoothness upper bound. In addition, where (i) uses the triangle inequality, (ii) holds because of Lemma 27 and the fact that D ≤ 1 + δ, and (iii) follows if δ ≤ 1/2. This establishes the claim on the restricted strong convexity.

C.1.1 Proof of Lemma 26
We start by proving the identity ∇ 2 F z = 2. Let Recalling that h 2 = x 2 = 1, we can easily check that these four vectors form an orthonormal set. A little algebra reveals that We now turn attention to the restricted strong convexity. Since u * D∇ 2 F z u is the complex conjugate of u * ∇ 2 F z Du as both ∇ 2 F (z ) and D are Hermitian, we will focus on the first term u * D∇ 2 F z u. This term can be rewritten as where (i) uses the definitions of u and ∇ 2 F z , and (ii) follows from the definition of D. In view of the assumption (84), we can obtain where the last inequality utilizes the identity It then boils down to controlling β. Toward this goal, we decompose β into the following four terms .
Since h 2 − h 2 and x 2 − x 2 are both small by (83), β 2 , β 3 and β 4 are well-bounded. Specifically, regarding β 2 , we discover that where the second inequality is due to (83) and the last one holds since δ < 1. Similarly, we can obtain where both lines make use of the facts that Combine the previous three bounds to reach where we utilize the elementary inequality ab ≤ (a 2 + b 2 )/2 and the identity (194).

C.1.2 Proof of Lemma 27
In view of the expressions of ∇ 2 f (z) and ∇ 2 F z (cf. (80) and (191)) and the triangle inequality, we get where the four terms on the right-hand side are defined as follows In what follows, we shall control sup z∈S α j for j = 1, 2, 3, 4 separately.
1. Regarding the first term α 1 , the triangle inequality gives .
• To control β 1 , the key observation is that a * j x and a * j x are extremely close. We can rewrite β 1 as where Here, the first line (i) uses the identity for u, v ∈ C, the second relation (ii) comes from the triangle inequality, and the third line (iii) follows from (189) and the assumption (82b). Substitution into (197) gives where the last inequality comes from the fact that m j=1 b j b * j = I K . • The other term β 2 can be bounded through Lemma 59, which reveals that with probability 1−O m −10 , Taken collectively, the preceding two bounds give Hence P(sup z∈S α 1 ≤ 1/32) = 1 − O(m −10 ).

2.
We are going to prove that P(sup z∈S α 2 ≤ 1/32) = 1 − O(m −10 ). The triangle inequality allows us to bound α 2 as The second term θ 2 (h) is easy to control. To see this, we have where the penultimate relation uses the assumption that h − h 2 ≤ δ and hence For the first term θ 1 (h), we define a new set It is easily seen that sup z∈S θ 1 ≤ sup h∈H θ 1 . We plan to use the standard covering argument to show that To this end, we define c j (h) = |b * j h| 2 for every 1 ≤ j ≤ m. It is straightforward to check that for h ∈ H. In the above argument, we have used the facts that together with the definition of H. Lemma 57 combined with (199) and (200) readily yields that for any fixed h ∈ H and any t ≥ 0, where C 1 , C 2 > 0 are some universal constants.
Now we are in a position to strengthen this bound to obtain uniform control of θ 1 over H. Note that for any h 1 , h 2 ∈ H, Define an event E 0 = m j=1 a j a * j ≤ 2m . When E 0 happens, the previous estimates give Let ε = 1/(1280K), and H be an ε-net covering H (see [Ver12,Definition 5.1]). We have and as a result, Lemma 57 forces that P(E c 0 ) = O(m −10 ). Additionally, we have log | H| ≤ C 3 K log K for some absolute constant C 3 > 0 according to [Ver12,Lemma 5.2]. Hence (201) leads to for some constant C 4 > 0. Under the sample complexity m µ 2 K log 5 m, the right-hand side of the above display is at most O m −10 . Combine the estimates above to establish the desired high-probability bound for sup z∈S α 2 .
To this end, we let As a consequence, we can write α 3 = B * CA .
The key observation is that both the ∞ norm and the Frobenius norm of C are well-controlled. Specifically, we claim for the moment that with probability at least 1 − O m −10 , where C > 0 is some absolute constant. This motivates us to divide the entries in C into multiple groups based on their magnitudes.
To be precise, introduce R := 1 + log 2 (Cµ log 7/2 m) sets {I r } 1≤r≤R , where An immediate consequence of the definition of I r and the norm constraints in (202) is the following cardinality bound for 1 ≤ r ≤ R − 1. Since {I r } 1≤r≤R form a partition of the index set {1, · · · , m}, it is easy to see that where D I,J denotes the submatrix of D induced by the rows and columns of D having indices from I and J , respectively, and D I,· refers to the submatrix formed by the rows from the index set I. As a result, one can invoke the triangle inequality to derive B Ir,· · C Ir,Ir · A Ir,· + B I R ,· · C I R ,I R · A I R ,· .
By Lemma 58 we obtain that for some constants C 2 , C 3 > 0 P sup Taking the union bound and substituting the estimates above into (205), we see that with probability at Note that µ ≤ √ m, R − 1 = log 2 (Cµ log 7/2 m) log m, and log e δ 1 = log eC 2 µ 2 log 5 m 48δ 2 log m. Finally, it remains to justify (202). For all z ∈ S, the triangle inequality tells us that for some large constant C > 0, where we have used the definition of S and the fact (189). The claim (202b) follows directly from [LLSW16, Lemma 5.14]. To avoid confusion, we use µ 1 to refer to the parameter µ therein. Let L = m, N = K, d 0 = 1, µ 1 = C 4 µ log 2 m/2, and ε = 1/15. Then and the sample complexity condition L µ 2 1 (K + N ) log 2 L is satisfied because we have assumed m µ 2 K log 6 m. Therefore with probability exceeding 1 − O m −10 + e −K , we obtain that for all z ∈ S, The claim (202b) can then be justified by observing that 4. It remains to control α 4 , for which we make note of the following inequality with a j denoting the entrywise conjugate of a j . Since {a j } has the same joint distribution as {a j }, by the same argument used for bounding α 3 we obtain control of the first term, namely, Putting together the above bounds, we reach P(sup z∈S α 4 ≤ 1/48) = 1 − O(m −10 + e −K log m).
5. Combining all the previous bounds for sup z∈S α j and (196), we deduce that with probability 1−O(m −10 + e −K log m),

C.2 Proofs of Lemma 15 and Lemma 16
Proof of Lemma 15. In view of the definition of α t+1 (see (38)), one has The gradient update rules (79) imply that where we denote h t = 1 α t h t and x t = α t x t as in (81). Let h t+1 = 1 α t h t+1 and x t+1 = α t x t+1 . We further get The fundamental theorem of calculus (see Appendix D.3.1) together with the fact that ∇f z = 0 tells us Substitution into (209) indicates that z t+1 − z 2 2 ≤ 1 + 16η 2 − η/4 z t − z 2 2 . When 0 < η ≤ 1/128, this implies that This completes the proof of Lemma 15.
Proof of Lemma 16. Reuse the notation in this subsection, namely, z t+1 = h t+1 x t+1 with h t+1 = 1 α t h t+1 and x t+1 = α t x t+1 . From (210), one can tell that Invoke Lemma 52 with β = α t to get This combined with the assumption ||α t | − 1| ≤ 1/2 implies that This finishes the proof of the first claim. The second claim can be proved by induction. Suppose that |α s | − 1 ≤ 1/2 and dist(z s , z ) ≤ C 1 (1 − η/16) s / log 2 m hold for all 0 ≤ s ≤ τ ≤ t , then using our result in the first part gives for m sufficiently large. The proof is then complete by induction.

C.3 Proof of Lemma 17
Define the alignment parameter between z t,(l) and z t as α t,(l) Further denote, for simplicity of presentation, z t,(l) = h t,(l) x t,(l) with 1. Regarding the first term ν 1 , one can adopt the same strategy as in Appendix C.2. Specifically, write The fundamental theorem of calculus (see Appendix D.3.1) reveals that where we abuse the notation and denote z (τ ) = z t + τ z t,(l) − z t . In order to invoke Lemma 14, we need to verify the conditions required therein. Recall the induction hypothesis (90b) that dist z t,(l) , z t = z t,(l) − z t 2 ≤ C 2 µ √ m µ 2 K log 9 m m , and the fact that z (τ ) lies between z t,(l) and z t . For all 0 ≤ τ ≤ 1: (a) If m µ 2 √ K log 13/2 m, then where we have used the induction hypotheses (90a) and (90b); which follows from the bound (190) and the induction hypotheses (90b) and (90c); (c) If m µK log 5/2 m, then which makes use of the fact b j 2 = K/m as well as the induction hypotheses (90b) and (90d).
These properties satisfy the condition (82) required in Lemma 14. The other two conditions (83) and (84) are also straightforward to check and hence we omit it. Thus, we can repeat the argument used in Appendix C.2 to obtain ν 1 2 ≤ (1 − η/16) · z t,(l) − z t 2 . 2. In terms of the second term ν 2 , it is easily seen that Similarly, we can also derive Putting these bounds together indicates that The above bounds taken together with (212) and (213) ensure the existence of a constant C > 0 such that Here, (i) holds as long as m is sufficiently large such that CC 1 1/log 2 m 1 and which is guaranteed by Lemma 16. The inequality (ii) arises from the induction hypothesis (90b) and taking C 2 > 0 is sufficiently large. Finally we establish the second inequality claimed in the lemma. Take (h 1 , x 1 ) = ( h t+1 , x t+1 ) and (h 2 , x 2 ) = ( h t+1,(l) , x t+1,(l) ) in Lemma 55. Since both (h 1 , x 1 ) and (h 2 , x 2 ) are close enough to (h , x ), we deduce that z t+1,(l) − z t+1 2 z t+1,(l) − z t+1 2 C 2 µ √ m µ 2 K log 9 m m as claimed.

C.4 Proof of Lemma 18
Before going forward, we make note of the following inequality for some small δ log −2 m, where the last relation follows from Lemma 16 that for m sufficiently large. In view of the above inequality, the focus of our subsequent analysis will be to control max l b * l 1 α t h t+1 . The gradient update rule for h t+1 (cf. (79a)) gives where h t = 1 α t h t and x t = α t x t . Here and below, we denote ξ = 1/ x t 2 2 for notational convenience. The above formula can be further decomposed into the following terms , where we use the fact that m j=1 b j b * j = I K . In the sequel, we shall control each term separately. 1. We start with |b * l v 1 | by making the observation that Combining the induction hypothesis (90c) and the condition (189) yields log 3/2 m + 5 log m ≤ 6 log m as long as m is sufficiently large. This further implies Substituting it into (219) and taking Lemma 48, we arrive at with the proviso that C 3 is sufficiently small.
2. We then move on to |b * l v 3 |, which obeys Regarding the first term, we have the following lemma, whose proof is given in Appendix C.4.1.
Lemma 28. Suppose m ≥ CK log 2 m for some sufficiently large constant C > 0. Then with probability at least 1 − O m −10 , one has For the remaining term, we apply the same strategy as in bounding |b * l v 1 | to get where the second line follows from the incoherence (36), the induction hypothesis (90c), the condition (189) and Lemma 48. Combining the above three inequalities and the incoherence (36) yields 3. Finally, we need to control |b * l v 2 |. For convenience of presentation, we will only bound |b * 1 v 2 | in the sequel, but the argument easily extends to all other b l 's. The idea is to group {b j } 1≤j≤m into bins each containing τ adjacent vectors, and to look at each bin separately. Here, τ poly log(m) is some integer to be specified later. For notational simplicity, we assume m/τ to be an integer, although all arguments continue to hold when m/τ is not an integer. For each 0 ≤ l ≤ m − τ , the following summation over τ adjacent data obeys We will now bound each term in (221) separately.
• Before bounding the first term in (221), we first bound the pre-factor τ j=1 |a * l+j x | 2 − x 2 2 . Notably, the fluctuation of this quantity does not grow fast as it is the sum of i.i.d. random variables over a group of relatively large size, i.e. τ . Since 2 a * j x 2 follows the χ 2 2 distribution, by standard concentration results (e.g. [RV + 13, Theorem 1.1]), with probability exceeding 1 − O m −10 , τ j=1 a * l+j x 2 − x 2 2 τ log m.
With this result in place, we can bound the first term in (221) as Taking the summation over all bins gives It is straightforward to see from the proof of Lemma 48 that Substituting (223) into the previous inequality (222) gives as long as m K √ τ log m and τ log 3 m. • The second term of (221) obeys where the first inequality is due to Cauchy-Schwarz, and the second one holds because of the following lemma, whose proof can be found in Appendix C.4.2.
Lemma 29. Suppose τ ≥ C log 4 m for some sufficiently large constant C > 0. Then with probability With the above bound in mind, we can sum over all bins of size τ to obtain Here, the last line arises from Lemma 51, which says that for any small constant c > 0, as long as • The third term of (221) obeys where the last line relies on the inequality where the last relation makes use of (223) with the proviso that m Kτ . It then boils down to bounding max 0≤l≤m−τ, 1≤j≤τ (b l+j − b l+1 ) * h t . Without loss of generality, it suffices to look at (b j − b 1 ) * h t for all 1 ≤ j ≤ τ . Specifically, we claim for the moment that for some sufficiently small constant c > 0, provided that m τ K log 4 m. As a result, • Putting the above results together, we get 4. Combining the preceding bounds guarantees the existence of some constant C 8 > 0 such that Here, (i) uses the induction hypothesis (90d), and (ii) holds as long as c > 0 is sufficiently small (so that (1 + δ)C 8 ηξc 1) and η > 0 is some sufficiently small constant. In order for the proof to go through, it suffices to pick τ = c 10 log 4 m for some sufficiently large constant c 10 > 0. Accordingly, we need the sample size to exceed m µ 2 τ K log 4 m µ 2 K log 8 m.
Finally, it remains to verify the claim (224), which we accomplish in Appendix C.4.3.

C.4.1 Proof of Lemma 28
Denote we can write the quantity of interest as the sum of independent random variables, namely, Further, the sub-exponential norm (see definition in [Ver12] where (i) arises from the centering property of the sub-exponential norm (see [Ver12,Remark 5.18]), (ii) utilizes the relationship between the sub-exponential norm and the sub-Gaussian norm [Ver12, Lemma 5.14] and (iii) is a consequence of the incoherence condition (36) and the fact that a * j x ψ2 1, and (iv) follows where c > 0 is some universal constant. By taking t = µ/ √ m, we see there exists some constant c such that We conclude the proof by observing that m K log 2 m as stated in the assumption.

C.4.2 Proof of Lemma 29
From the elementary inequality (a − b) 2 ≤ 2 a 2 + b 2 , we see that where the last identity holds true since x 2 = 1. It thus suffices to control τ j=1 a * j x 4 . Let ξ i = a * j x , which is a standard complex Gaussian random variable. Since the ξ i 's are statistically independent, one has for some constant C 4 > 0. It then follows from the hypercontractivity concentration result for Gaussian polynomials that [SS12, Theorem 1.9] for some constants c, c 2 , C > 0, with the proviso that τ log 4 m. As a consequence, with probability at which together with (225) concludes the proof.

C.4.3 Proof of Claim (224)
We will prove the claim by induction. Again, observe that for some δ log −2 m, which allows us to look at (b j − b 1 ) * 1 α t−1 h t instead. Use the gradient update rule for h t (cf. (79a)) once again to get where we denote θ := 1/ x t−1 2 2 . This further gives rise to Since αȟ 0 , αx 0 are also the leading left and right singular vectors of M , we can invoke Lemma 60 to get In addition, we can apply Weyl's inequality once again to deduce that where the last inequality comes from (228). Substitute (231) into (230) to obtain Taking the minimum over α, one can thus conclude that min α∈C,|α|=1 where the last inequality comes from (229). Since ξ is arbitrary, by taking m/(µ 2 K log 2 m) to be large enough, we finish the proof for (92). Carrying out similar arguments (which we omit here), we can also establish (93). The last claim in Lemma 19 that ||α 0 | − 1| ≤ 1/4 is a direct corollary of (92) and Lemma 52. a j x 2 a j a j − x 2 2 I n + 2x x ≤ δ x 2 2 , provided that m ≥ c 0 n log n for some sufficiently large constant c 0 > 0.
Lemma 33. Suppose that a j i.i.d.
Proof. This is supplied in [CC17, supplementary material].

D.1.2 Matrix perturbation bounds
Lemma 34. Let λ 1 (A), u be the leading eigenvalue and eigenvector of a symmetric matrix A, respectively, and λ 1 ( A), u be the leading eigenvalue and eigenvector of a symmetric matrix A, respectively. Suppose that λ 1 (A), λ 1 ( A), A , A ∈ [C 1 , C 2 ] for some C 1 , C 2 > 0. Then, Proof. Observe that where the last inequality follows since u 2 = 1. Using the identity √ a − √ b = (a − b)/( √ a + √ b), we have where the last inequality comes from our assumptions on λ 1 (A) and λ 1 ( A). This combined with (248) yields To control λ 1 A − λ 1 ( A) , use the relationship between the eigenvalue and the eigenvector to obtain which together with (249) gives as claimed.

D.2.1 Orthogonal Procrustes problem
The orthogonal Procrustes problem is a matrix approximation problem which seeks an orthogonal matrix R to best "align" two matrices A and B. Specifically, for A, B ∈ R n×r , define R to be the minimizer of The first lemma is concerned with the characterization of the minimizer R of (250).
Lemma 35. For A, B ∈ R n×r , R is the minimizer of (250) if and only if R A B is symmetric and positive semidefinite.
Proof. This is an immediate consequence of [tB77, Theorem 2].
Let A B = U ΣV be the singular value decomposition of A B ∈ R r×r . It is easy to check that R := U V satisfies the conditions that R A B is both symmetric and positive semidefinite. In view of Lemma 35, R = U V is the minimizer of (250). In the special case when C := A B is invertible, R enjoys the following equivalent form: where H (·) is an R r×r -valued function on R r×r . This motivates us to look at the perturbation bounds for the matrix-valued function H (·), which is formulated in the following lemma.
Proof. This is an immediate consequence of [Mat93, Theorem 2.3].
With Lemma 36 in place, we are ready to present the following bounds on two matrices after "aligning" them with X .
Lemma 37. Instate the notation in Section 3.2. Suppose X 1 , X 2 ∈ R n×r are two matrices such that Then the following two inequalities hold true: Proof. Before proving the claims, we first gather some immediate consequences of the assumptions (252). Denote C = X 1 X and E = (X 2 − X 1 ) X . It is easily seen that C is invertible since where (i) follows from the assumption (252a) and (ii) is a direct application of Weyl's inequality. In addition, C + E = X 2 X is also invertible since where (i) arises from the assumption (252b) and (ii) holds because of (253). When both C and C + E are invertible, the orthonormal matrices R 1 and R 2 admit closed-form expressions as follows R 1 = C C C −1/2 and R 2 = (C + E) (C + E) (C + E) −1/2 . Moreover, we have the following bound on X 1 : where (i) is the triangle inequality, (ii) uses the assumption (252a) and (iii) arises from the fact that X = √ σ max .
With these in place, we turn to establishing the claimed bounds. We will focus on the upper bound on X 1 R 1 − X 2 R 2 F , as the bound on X 1 R 1 − X 2 R 2 can be easily obtained using the same argument. Simple algebra reveals that where the first inequality uses the fact that R 2 = 1 and the last inequality comes from (254). An application of Lemma 36 leads us to conclude that where (256) utilizes (253). Combine (255) and (257) to reach which finishes the proof by noting that κ ≥ 1.

D.2.2 Matrix concentration inequalities
This section collects various measure concentration results regarding the Bernoulli random variables {δ j,k } 1≤j,k≤n , which is ubiquitous in the analysis for matrix completion.
Lemma 38. Fix any small constant δ > 0, and suppose that m δ −2 µnr log n. Then with probability exceeding 1 − O n −10 , one has holds simultaneously for all B ∈ R n×n lying within the tangent space of M .
Proof. This result has been established in [CR09, Section 4.2] for asymmetric sampling patterns (where each (i, j), i = j is included in Ω independently). It is straightforward to extend the proof and the result to symmetric sampling patterns (where each (i, j), i ≥ j is included in Ω independently). We omit the proof for conciseness.
Lemma 39. Fix a matrix M ∈ R n×n . Suppose n 2 p ≥ c 0 n log n for some sufficiently large constant c 0 > 0. With probability at least 1 − O n −10 , one has where C > 0 is some absolute constant.
Proof. See [KMO10a, Lemma 3.2]. Similar to Lemma 38, the result therein was provided for the asymmetric sampling patterns but can be easily extended to the symmetric case.
Lemma 40. Recall from Section 3.2 that E ∈ R n×n is the symmetric noise matrix. Suppose the sample size obeys n 2 p ≥ c 0 n log 2 n for some sufficiently large constant c 0 > 0. With probability at least 1 − O n −10 , one has 1 p P Ω (E) ≤ Cσ n p , where C > 0 is some universal constant.

Then one has
Median [ G l (A) ] ≤ p A 2 + 2p A where the last relation holds since pψ 2 ξ 2 log r, which follows by combining the definitions of ψ and ξ, the sample size condition np κµr log 2 n, and the incoherence condition (114). Thus, substitution into (259) and taking λ = √ kr give P G l (∆) ≥ 2 √ pψ + √ krξ ≤ C exp (−ckr) for any k ≥ 0. Furthermore, invoking [AS08, Corollary A.1.14] and using the bound (260), one has So far we have demonstrated that for any fixed ∆ obeying our assumptions, is well controlled with exponentially high probability. In order to extend the results to all feasible ∆, we resort to the standard -net argument. Clearly, due to the homogeneity property of G l (∆) , it suffices to restrict attention to the following set: We have thus concluded the proof.
Lemma 43. Suppose the sample size obeys n 2 p ≥ Cκµrn log n for some sufficiently large constant C > 0. Then with probability at least 1 − O n −10 , 1 p P Ω XX − X X ≤ 2n 2 X 2 2,∞ + 4 √ n log n X 2,∞ X holds simultaneously for all X ∈ R n×r satisfying where > 0 is any fixed constant.
Proof. To simplify the notations hereafter, we denote ∆ := X − X . With this notation in place, one can decompose XX − X X = ∆X + X ∆ + ∆∆ , which together with the triangle inequality implies that In the sequel, we bound α 1 and α 2 separately.
1. Recall from [Mat90, Theorem 2.5] the elementary inequality that where |C| := [|c i,j |] 1≤i,j≤n for any matrix C = [c i,j ] 1≤i,j≤n . In addition, for any matrix D := [d i,j ] 1≤i,j≤n such that |d i,j | ≥ |c i,j | for all i and j, one has |C| ≤ |D| . Therefore Lemma 39 then tells us that with probability at least 1 − O(n −10 ), 1 p P Ω 11 − 11 ≤ C n p for some universal constant C > 0, as long as p log n/n. This together with the triangle inequality yields 1 p P Ω 11 ≤ 1 p P Ω 11 − 11 + 11 ≤ C n p + n ≤ 2n, provided that p 1/n. Putting together the previous bounds, we arrive at 2. Regarding the second term α 2 , apply the elementary inequality (265) once again to get P Ω ∆X ≤ P Ω ∆X , which motivates us to look at P Ω ∆X instead. A key step of this part is to take advantage of the 2,∞ norm constraint of P Ω ∆X . Specifically, we claim for the moment that with probability exceeding 1 − O(n −10 ), P Ω ∆X 2 2,∞ ≤ 2pσ max ∆ 2 2,∞ := θ holds under our sample size condition. In addition, we also have the following trivial ∞ norm bound P Ω ∆X ∞ ≤ ∆ 2,∞ X 2,∞ := γ.
In what follows, for simplicity of presentation, we will denote A := P Ω ∆X .
In the end of this subsection, we record a useful lemma to bound the spectral norm of a sparse Bernoulli matrix.
Lemma 44. Let A ∈ {0, 1} n1×n2 be a binary matrix, and suppose that there are at most k r and k c nonzero entries in each row and column of A, respectively. Then one has A ≤ √ k c k r .
Proof. This immediately follows from the elementary inequality A 2 ≤ A 1→1 A ∞→∞ (see [Hig92, equation (1.11)]), where A 1→1 and A ∞→∞ are the induced 1-norm (or maximum absolute column sum norm) and the induced ∞-norm (or maximum absolute row sum norm), respectively. Proof. Here, we focus on the Frobenius norm; the bound on the operator norm follows from the same argument, and hence we omit the proof. Since · F is unitarily invariant, we have where Q Σ 1/2 Q and Σ 1/2 are the matrix square roots of Q ΣQ and Σ, respectively. In view of the matrix square root perturbation bound [Sch92, Lemma 2.1], where the last inequality follows from the lower estimates Invoke the Davis-Kahan sinΘ theorem [DK70] to obtain for some constant c 2 > 0, where the last inequality follows from the bounds Proof. We first collect several useful facts about the spectrum of Σ. Weyl's inequality tells us that Σ − Σ ≤ M − M ≤ σ min /2, which further implies that σ r (Σ) ≥ σ r Σ − Σ − Σ ≥ σ min /2 and Σ ≤ Σ + Σ − Σ ≤ 2σ max .
Denote Q = U U and H = X X .
∼ N 0, 1 2 I K + iN 0, 1 2 I K for every 1 ≤ j ≤ m, and {c j } 1≤j≤m are a set of fixed numbers. Then there exist some universal constants C 1 , C 2 > 0 such that for all t ≥ 0 Proof. This is a simple variant of [Ver12, Theorem 5.39], which uses the Bernstein inequality and the standard covering argument. Hence we omit its proof.
Proof. The proof relies on Lemma 57 and the union bound. First, invoke Lemma 57 to see that for any fixed J ⊆ [m] and for all t ≥ 0, we have for some constants C 1 , C 2 > 0, and as a result, where c denotes the smallest integer that is no smaller than c. Here, (i) holds since we take the supremum over a larger set and (ii) results from (294) and the union bound. Apply the elementary inequality n k ≤ (en/k) k for any 0 ≤ k ≤ n to obtain P   sup |J|≤εm j∈J a j a * j ≥ εm (1 + t)   ≤ 2 em εm εm exp C 1 K − C 2 εm min t, t 2 ≤ 2 e ε 2εm exp C 1 K − C 2 εm min t, t 2 = 2 exp C 1 K − εm C 2 min t, t 2 − 2 log(e/ε) .
The proof is then completed by taking C 3 ≥ max{1, 6/ C 2 } and t = C 3 log(e/ε). To see this, it is easy to check that min{t, t 2 } = t since t ≥ 1. In addition, one has C 1 K ≤ C 2 εm/3 ≤ C 2 εmt/3, and 2 log(e/ε) ≤ C 2 t/3. Combine the estimates above with (295) to arrive at .
Proof. The first claim follows since With regards to the second claim, we see that .
Similarly, one can obtain .
Add these two inequalities to complete the proof.