On connections between Amplitude Flow and Error Reduction for phase retrieval and ptychography

In this paper, we consider two iterative algorithms for the phase retrieval problem: the well-known Error Reduction method and the Amplitude Flow algorithm, which performs minimization of the amplitude-based squared loss via the gradient descent. We show that Error Reduction can be interpreted as a scaled gradient method applied to minimize the same amplitude-based squared loss, which allows to establish its convergence properties. Moreover, we show that for a class of measurement scenarios, such as ptychography, both methods have the same computational complexity and sometimes even coincide.


Introduction
The phase retrieval problem considers the reconstruction of an unknown x ∈ C d from m ∈ N amplitude measurements of the form with A ∈ C m×d denoting the measurement matrix and n ∈ R m being noise. It has many applications such as crystallography [1], noncrystalline materials [2][3][4] and optical imaging [5], where the goal is to recover the specimen from its diffraction patterns obtained by illumination with penetrating light, e.g., x-rays or electron beam.
One such application is ptychography [6,7], where the inference on the object of interest is based on a collection of far-field diffraction patterns, each obtained by an illumination of a small region of the specimen. As the regions overlap, it produces the surplus information, which allows for unique identification of the object up to a global phase factor from ptychographic measurements.
One of the longstanding favored algorithms is Error Reduction (ER), which was introduced in 1972 by Gerchberg and Saxton [8]. Later contributions [10], [32] and [12] classified ER as an alternating projections technique and supplemented it with the detailed analysis on the convergence and also provided an interpretation of the algorithm as a projected gradient method. The version of ER algorithm was also studied as a gradient flow method in continuous setting [33].
Another algorithm, which became popular in recent years, is Amplitude Flow (AF) [17,19]. It performs the first order optimization applied to the amplitude-based squared loss AF is well-understood for randomized measurements scenarios [17], where the matrix A is random. It as well possesses the convergence guarantees for arbitrary measurement scenarios [19]. In this paper we connect these two methods by representing ER as a scaled gradient method for the minimization of the amplitude-based squared loss L 2 . It allows to establish convergence rate of the ER algorithm, which to our knowledge has never observed in the literature. Furthermore, the scaled gradient representation provides the equivalence between the set of fixed points of two methods. Lastly, we consider ER and AF in application to the ptychographic measurements and show that both methods exhibit the same computational complexity and in special cases even coincide.
The paper is structured in the following way. In Section 2 we provide the reader with necessary notation and detailed overview of the ER and AF algorithms. Our contribution is then presented in Section 3 and proved in Section 4. Finally, the paper is summarized by a short conclusion.

Definitions
Throughout the paper, we will use the short notation [a] = {1, 2, . . . , a} for index sets. The complex unit is denoted by i. The complex conjugate of α ∈ C is given byᾱ. The transpose and complex conjugate transpose of a vector v or a matrix B are denoted by v t , v * and B T , B * , respectively. The Euclidean norm of a vector v ∈ C a is given by . We say that a matrix B ∈ C b×a , b ≥ a is injective if for all pairs of vectors u, v ∈ C a with u = v it holds that Bu = Bv. The injectivity of B is equivalent to the condition rank(B) = a. We will also denote the image of B as For a square full rank matrix B its inverse is given by The projection of u ∈ C a onto a set S ⊆ C a is an elementũ ∈ S, such that u −ũ 2 ≤ u − v 2 for all v ∈ S. An operator, which maps u toũ is called the projection operator onto S. In general,ũ is not unique, however, in case when S is a non-empty closed convex set,ũ can be uniquely identified [34].
For a matrix B ∈ C b×a of rank r its singular value decomposition is given by where U ∈ C b×r , V ∈ C a×r are orthogonal matrices and ∈ R r ×r is an invertible diagonal matrix with diagonal entries σ j (B) > 0, j ∈ [r ], sorted in decreasing order. The values σ j (B) are also referred to as the singular values of B. The largest singular value σ 1 (B) equals to the spectral norm of B defined as Using the singular value decomposition, the Moore-Penrose pseudoinverse of B is defined as For an injective matrix B ∈ C b×a , b ≥ a, its pseudoinverse B † can be expressed as It satisfies B † B = I and B B † is a projection operator onto the set im(B).
For a vector v ∈ C a , the diagonal matrix diag(v) ∈ C a×a is formed by placing the entries of the vector v onto main diagonal, so that for k, j ∈ [a] it holds that The discrete Fourier transform is given by a matrix F ∈ C d×d with the entries and satisfies equality The family of the circular shift matrices S s ∈ C d×d , s ∈ Z is defined by its action for all vectors v ∈ C d as For the description of the computational complexity of algorithms, we use notation O(n) for the order of operations, meaning that at most cn operations are required for some constant c > 0.
For a function f : C → C and a vector v ∈ C a , the notation f (v) will denote the entrywise application of the function f . For a vector v ∈ C a and number α ∈ C by v + α we will denote vector in C a with entries v k + α, k ∈ [a]. For instance, using this notation we can rewrite the measurements (1) as y = |Ax| + n.

Phase retrieval
In the context of the phase retrieval problem, it is convenient to refer to the spaces C d and C m as object and measurement spaces, respectively. If the phases of the measurements Ax were known, the problem would be the classical recovery from linear measurements, which in general is only possible if the dimension of the measurement space m is at least as large as the dimension of the object space d. Since the phases are lost, the number of required measurements is even higher and, hence, we will assume that m ≥ d. It is known that m ≥ 4d − 4 measurements are sufficient for the unique reconstruction of x when A is generic [35] and m ≥ cd with constant c ≥ 1 when A is random [23,24]. By the unique reconstruction of x, we understand the identification of x up to a global phase factor αx for any α ∈ C, |α| = 1, since it holds that |Ax| = |Aαx|.
The unique reconstruction of x up to a global phase is equivalent to the unique identification of the set {αx : |α| = 1} or, in other words, the function {αx : |α| = 1} → |Ax| is injective. One of the necessary conditions for the unique recovery is the injectivity of the matrix A. If A is not injective then there exist two vectors u, v ∈ C d , such that u = v and Au = Av. Consequently, |A(u − v)| j = 0 and it is not possible to distinguish u − v and the zero vector from the measurements. The injectivity of the matrix A will be the main assumption for our results in Section 3. The injectivity of A, however, is not sufficient for unique recovery. A counterexample is A = F, which is injective by (6), but it is well-known that there are multiple objects satisfying the same measurements |F x| [36].

Error Reduction
The Error Reduction (ER) is an iterative algorithm for the phase retrieval problem. It considers an initial guess z 0 ∈ C d in the object space and is given by the iterations The iterations are repeated until the fixed point is reached, so that z t+1 = z t . For T ∈ N iterations of ER O(md 2 + T md) operations are required, where O(md 2 ) operations are needed to compute the pseudoinverse and O(md) operations are performed per iteration. Let us consider iterates in the measurement space u t := Az t , t ≥ 0 for which the update of ER reads as In this form, diag y |u t | u t is the projection of u t onto the set Moreover, the set M can be viewed as a product of one-dimensional sets and, thus, the projection onto M is performed by projecting each coordinate u t k onto corresponding set M k . The second step is to apply A A † , which is by (3), the projection operator onto im(A). Therefore, ER first projects onto M, where the measurements are satisfied. Then, the resulting point is projected onto im(A). The sequential projections onto M and im(A) allows for an interpretation of ER as an alternating projection scheme. If M was a convex set, then ER would converge to the intersection of two sets [37]. However, due to non-convexity of M, convergence of u t to the intersection of the sets is not guaranteed, which is a known problem of the ER algorithm. We note that, when A allows for unique recovery and noise is absent, intersection of M and im(A) is given by {αx : |α| = 1} [12].
Another complication arising from the non-convexity of M is the non-uniqueness of the projection onto M. Let y k = 0 and consider the projection of α ∈ C onto M k . If α is non-zero, the closest point in M k is given by y k · α/|α| [12,Lemma 3.15a]. If α = 0, all points in M k have the same distance to 0 and any of them can be used as a projection. In the literature, it is resolved by setting the projection either to y k or y k e iϕ for a randomly selected angle ϕ ∈ [0, 2π). In this paper, we will instead map 0 to 0, which is not precisely the projection, but can be interpreted as an average of all possible projections The ER algorithm can also be interpreted as a projected gradient method [12, Section 3.8] applied to solve the minimization problem We note, that substituting Az for u, z ∈ C d leads to an unconstrained minimization of the amplitude-based objective (2), which suggests that ER can be interpreted as a gradient method applied to the function L 2 .
It is known that in the absence of noise if an initial guess z 0 is chosen sufficiently close to the set {αx : |α| = 1}, the ER algorithm will converge to a point in this set [12,Theorem 3.16]. In general, ER does not converge globally to {αx : |α| = 1} [12, p.830]. If the loss L 2 is differentiable at z t , the ER iteration will not increase the value of L 2 , i.e., L 2 (z t+1 ) ≤ L 2 (z t ) [12,38].

Amplitude Flow
The Amplitude Flow algorithm (AF) considers the gradient-based optimization of the amplitude-based objective (2). The algorithm is based on the Wirtinger derivatives, which are discussed in greater detail in Section 4.2, while in this section we superficially define the gradient in order to avoid lengthy derivations.
Given an initial guess z 0 ∈ C d , AF is based on the iterations where μ t > 0 denotes the so-called learning rate and ∇L 2 is the generalized Wirtinger gradient of L 2 given by Az.
Similarly to ER, we treat the case (Az) k = 0 by setting (Az) k /|(Az) k | = 0. The iteration process is continued until the gradient ∇L 2 (z t ) vanishes, which is equivalent to reaching the fixed point z t+1 = z t . Originally, AF was derived and analyzed for random Gaussian measurements without noise [17]. For such A, it is possible to construct good starting point z 0 via spectral initialization [15] or null initialization [45], such that AF admits linear convergence rate to the set of true solutions {αx : |α| = 1}. In general, for any choice of the measurement matrix A, the following convergence results have been established in [19].
Unlike the randomized scenario, the general case only guarantees convergence to a fixed point with sublinear rate. Therefore, the initialization z 0 is crucial for the convergence to global minimum. For a non-random A, e.g., in case of ptychography, an outcome of the direct (non-iterative) method [30] is a good starting point. Furthermore, with sufficiently good initialization AF can achieve linear convergence rate [46] even for non-random measurements.
As the proof of Theorem 1 resembles the proof of Theorem 3 below, we provide the sketch of proof for Theorem 1 in Remark 10 in Section 4.2.
The computational complexity of AF for T ∈ N iterations is given by O(T md) operations. If the learning rate is chosen to be μ t = A −2 , the computation of the spectral norm can be done with additional O(md) operations by performing the fixed number of the power method iterations. More precisely, for K ∈ N and random , are computed and Av K 2 is used as an estimate of A .

Results
As it was briefly mentioned in Section 2.3, ER can be linked to the minimization of the amplitude-based objective (2). We formalize this intuition in the next lemma.

Lemma 2 Let A be injective. Then, ER is a scaled gradient method with iterations given by
We emphasize that the result of Lemma 2 is only true for all z ∈ C d , if the ambiguity 0/0 in the iteration of ER is defined as 0.
The reinterpretation of ER as a scaled gradient method allows to analyze convergence of the algorithm similarly to AF, which leads to an analogue of Theorem 1.
Theorem 3 Consider the phase retrieval measurements y of the form (1) with injective matrix A. Let z 0 ∈ C d be arbitrary. Then, for iterates {z t } t≥0 given by ER we have where σ d (A) denotes the smallest singular value of the matrix A.
Theorem 3 guarantees that no matter how noisy the measurements are, ER will always converge to a fixed point and the convergence rate is sublinear. However, even in the absence of noise, it does not guarantee the global convergence to a point in the set {αx : |α| = 1}. We note that for cases A = F and A corresponding to ptychography (see (10) below), the convergence of ER to a fixed point was shown in [10] and [47], respectively. However, the convergence rate was not derived. Comparing Theorem 3 to Theorem 1, we observe that the constant in the convergence rate of ER is worse by σ 2 1 (A)/σ 2 d (A) compared to AF. A further consequence of Lemma 2 is the equality of the fixed-point sets of both algorithms.

Corollary 4 Let A be injective. Then, z ∈ C d is a fixed point of ER if and only if z is the fixed point of AF.
We note that Corollary 4 does not imply that given the same initial guess z 0 , both algorithms will necessarily converge to the same fixed point.
By Theorem 1 and Theorem 3, both algorithms seem to be comparable in terms of convergence rate and by Corollary 4 in terms of critical points. However, for T ∈ N iterations of ER O(md 2 +T md) operations are required, while AF only needs O(T md) operations and, thus, in general ER is considerably slower in terms of computation complexity. The next corollary shows, that this difference is less significant in cases where the columns of A are orthogonal.

Then, for T ∈ N iterations both algorithms ER and AF require O(T md) operations. Furthermore, if A * A = cI , for some c > 0, then the iteration of ER coincides with the iteration of AF for the learning rate μ t = A −2 .
While condition (9) may seem restrictive, it, in fact, holds in many practical applications. For instance, the equivalence of both algorithms was observed for the recovery from Fourier magnitudes ( A = F) in [10]. Another application of interest is ptychography, for which the measurement matrix A is given by where the vector w ∈ C d denotes the distribution of the light in the illuminated region and s 1 , . . . , s r ∈ [d], r ≤ d, are unique positions of the regions. Matrices F and S s j are given by (5) and (7), respectively. When r = d and s j = j, j ∈ [d], the matrix A is also known as the discrete Short-Time Fourier transform (STFT) with window w. The next corollary shows that condition (9) and, consequently, the results of Corollary 5 also hold for ptychographic measurements. In order to illustrate the result of Corollary 6, we perform a numerical reconstructions of randomly generated x ∈ C d , d = 256 with both AF and ER. In the first case, A is chosen to be the STFT matrix with the window

Corollary 6 Consider measurements of the form (1) with ptychographic measurement matrix A as in (10). Then, A *
In the second case A is given by (10)  is approximately 45. Figure 1a shows the values L 2 (z t ) for 500 iterations of the algorithms starting from a random initialization z 0 . Note that for the STFT matrix A, AF and ER coincide as predicted by Corollary 6, while this is no longer true for A as in (10). Despite producing different reconstructions, the runtime of the algorithms in Figure 1b is almost the same, which is in line with Corollary 5.   Finally, we consider a scenario when the object is supported on J ⊆ [d]. Then, we can rewrite the the measurement model as where x J is the vector containing entries of x in J and E J is a linear embedding operator, which maps x J to x . In such case, the results above apply for new measurement matrixÃ = AE J .

Proofs of Lemma 2 and corollaries
We will start with the proof of Lemma 2. (3) and (4) hold true. Therefore, the iteration of ER can be rewritten as

Proof of Lemma 2 By the assumption, A is injective and, thus, identities
Using the result of Lemma 2, we deduce Corollary 4 and Corollary 5.

Proof of Corollary 4
Let z ∈ C d be a fixed point of ER. By Lemma 2, we have that which is equivalent to Since A is injective and (A * A) −1 exists, the obtained equality holds if and only if ∇L 2 (z) = 0, so that z is the fixed point of AF.

Proof of Corollary 5
Using the condition (9), we obtain (A * A) −1 = diag(1/v). Consequently, by Lemma 2, the iteration of ER is given by Hence, using Lemma 2 for the iteration of ER we have which is precisely the iteration of AF with μ t = A −2 .
The last corollary is the result of direct computations similar to the equation (12) in [19]. 6 We compute the product A * A by using the representation (10),

Proof of Corollary
Next, we use (6) and diag * (S s j w) = diag(S s j w) to obtain The matrix A is injective if and only if A * A is invertible, and the diagonal matrix is invertible when all its entries are non-zero.
If A is the STFT matrix, then s j = j for all j ∈ [d] and the entries of the vector v further simplify to Changing the order of summation yields for all ∈ [d], which concludes the proof.

Proof of Theorem 3
The proof of Theorem 3 is based on Wirtinger derivatives [48]. Let us recall some basic facts about Wirtinger derivatives based on [49,50]. A function f : C → C can be viewed as a function of two real variables, the real and imaginary parts of the argument z = α + iβ. The function f is said to be differentiable in real sense if the derivatives with respect to α and β exist. Then, the Wirtinger derivatives are defined as which is nothing, but a change of the coordinate system to conjugate coordinates. In this sense, we treat function f as a function of z andz instead of α and β.
As an example consider f (z) = z = α + iβ. Its Wirtinger derivatives are which implies thatz can be treated as a constant when the derivative with respect to z is computed and vice versa. Similar to the real analysis of multivariate functions, Wirtinger derivatives are extended for f : C d → C, that is for z ∈ C d they are given by The computation of Wirtinger derivatives is analogous to the standard real analysis as the arithmetic operations and the chain rule extends to the complex case. For Wirtinger derivatives it also holds that for any differentiable function f . The Wirtinger derivatives are particularly useful for optimization of real-valued functions of complex variables. Let f : C d → R be a differentiable real-valued function. Its differential can be presented in the form of Wirtinger derivatives as Since f is real-valued, by (11), it holds that and the differential simplifies to It is maximal, when dz is a scaled version of ∂ f ∂z = ∂ f ∂z and, thus, ∂ f ∂z gives the direction of the steepest ascent. Moreover, the critical points of f are those, where derivative with respect toz vanishes. For this reason, the gradient of f is defined as In our analysis, we would also need the Wirtinger version of the second order Taylor's approximation theorem in integral form. That is for all twice continuously differentiable functions f : where ∇ 2 f denotes the Hessian matrix and its components are given by For further information on Wirtinger calculus, we refer reader to [49,50]. Let us go back to the amplitude-based objective (2). We rewrite it as Since √ · is not differentiable at 0, L 2 is not differentiable on C d . Hence, the gradient of L 2 is not properly defined for points z with (Az) k = 0 for some k ∈ [m]. In order to overcome this issue, we consider the following smoothed version of (13), where ε > 0. The function L 2,ε possesses some useful properties. Firstly, L 2,ε is continuous in ε and we have Secondly, we can compute the gradient of L 2,ε everywhere and properly define the generalized gradient of L 2 as the limit of gradients as parameter ε vanishes.

Lemma 7
The function L 2,ε is continuously differentiable with the gradient given by Furthermore, the generalized gradient of L 2 is given by the pointwise limit Proof Denote by a k the conjugate of the k-th row of the matrix A, so that (Az) k = a * k z. Then, a single summand of L 2,ε is given by .
The gradient of f k can be evaluated by the chain rule. We get Then, by the linearity of derivatives, For the generalized gradient of L 2 we consider two cases. If (Az) k = 0 for all k ∈ [m], . Note that in this case, L 2 is differentiable at z and its gradient coincides with the limit of ∇L 2,ε (z) as ε → 0+. On the other hand, if (Az) k = 0 for some k ∈ [m], it holds that with ambiguity 0/0 resolved as 0.
The last property concerns the Hessian matrix of L 2,ε .
Lemma 8 [19] The function L 2,ε is twice continuously differentiable and its Hessian matrix satisfies Proof See computations on pages 27-28 of [19].

Remark 9
The convergence of gradient descent to a critical point is often studied [51] under the assumption that the function f is L-smooth with L ≥ 0. For twice continuously differentiable functions L-smoothness is equivalent to the inequality In fact, only the upper bound is sufficient to establish convergence of gradient decent. In our case, its stronger version is given by Lemma 8. Now, we are equipped for the proof of Theorem 3.

Proof of Theorem 3
In view of Lemma 2, let us consider the smoothed step of the ER algorithm Note that (A * A) −1 exists due to injectivity of A.
We first show that the single step of the smoothed Error Reduction works for minimization of L 2,ε and then we take the pointwise limits to obtain desired result for L 2 . In order to derive that in each iteration step the objective does not increase, we apply Taylor's theorem (12) with arbitrary z ∈ C d and v = −(A * A) −1 ∇L 2,ε (z). We note that by Lemma 8, the integral in (12) is bounded as Hence, by (12), we have and, thus, taking the limit ε → 0+ yields Selecting z as iterates z t of ER, we obtain Since A is injective, its singular value decomposition is given by A = U V * with orthogonal U ∈ C m×d , unitary V ∈ C d×d and invertible diagonal matrix ∈ C d×d . Then, is the singular value decomposition of (A * A) −1 . From this representation we deduce that Thus, by (15), which shows the first statement of Theorem 3.
In order to prove the remaining statements of Theorem 3, we need to link the decay of the objective to the iterates. By Lemma 2, we have that Using (16) and the definition of the spectral norm, the squared distance between the iterates can be bounded as z t+1 − z t 2 2 = ( −1 V * ∇L 2 (z t )) * −2 ( −1 V * ∇L 2 (z t )) = −1 ( −1 V * ∇L 2 (z t )) 2 2 ≤ −1 2 Next, we sum up the norms for T ∈ N iterations of ER and apply (15) to obtain where in the last line we used that L 2 (z) ≥ 0 for all z ∈ C d . This implies that the partial sum of the series ∞ t=0 z t+1 − z t 2 2 is bounded and, thus, the series is convergent. Consequently, summands converge to zero, that is z t+1 − z t 2 2 → 0, t → ∞. Furthermore, we have min t=0,...,T −1 which concludes the proof.

Remark 10
Proof of Theorem 1 follows the same logic. By Taylor's theorem (12) with v = μ t L 2 (z), the analogue of inequality (15) is established. Since the learning rate is a positive constant, it further implies that the norm of the gradient converges to zero with the desired speed similarly to the proof of Theorem 3.

Conclusion
In this paper we established the understanding of the Error Reduction algorithm as a scaled gradient method and derived its convergence rate. Furthermore, it was shown that in practical scenarios, Error Reduction has the same computational complexity as the Amplitude Flow method and the two algorithms coincide in some cases.
In the future, we plan to expand our analysis for the Hybrid Input-Output method [9] and extended Ptychograpic Iterative Engine [52] used for the problem of blind ptychography.