Abstract
In this paper, we consider two iterative algorithms for the phase retrieval problem: the well-known Error Reduction method and the Amplitude Flow algorithm, which performs minimization of the amplitude-based squared loss via the gradient descent. We show that Error Reduction can be interpreted as a scaled gradient method applied to minimize the same amplitude-based squared loss, which allows to establish its convergence properties. Moreover, we show that for a class of measurement scenarios, such as ptychography, both methods have the same computational complexity and sometimes even coincide.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
The phase retrieval problem considers the reconstruction of an unknown \(x \in \mathbb {C}^{d}\) from \(m \in \mathbb {N}\) amplitude measurements of the form
with \(A \in \mathbb {C}^{m \times d}\) denoting the measurement matrix and \(n \in \mathbb {R}^m\) being noise.
It has many applications such as crystallography [1], noncrystalline materials [2,3,4] and optical imaging [5], where the goal is to recover the specimen from its diffraction patterns obtained by illumination with penetrating light, e.g., x-rays or electron beam.
One such application is ptychography [6, 7], where the inference on the object of interest is based on a collection of far-field diffraction patterns, each obtained by an illumination of a small region of the specimen. As the regions overlap, it produces the surplus information, which allows for unique identification of the object up to a global phase factor from ptychographic measurements.
Since the introduction of the phase retrieval problem to the mathematical community, many approaches have been developed in order to reconstruct the specimen. The spectrum of methods includes alternating projections methods [8,9,10,11,12,13,14], gradient-based minimization [15,16,17,18,19], semidefinite [20,21,22,23,24] and linear programming [25, 26], direct methods [27,28,29,30,31], and many more.
One of the longstanding favored algorithms is Error Reduction (ER), which was introduced in 1972 by Gerchberg and Saxton [8]. Later contributions [10, 32] and [12] classified ER as an alternating projections technique and supplemented it with the detailed analysis on the convergence and also provided an interpretation of the algorithm as a projected gradient method. The version of ER algorithm was also studied as a gradient flow method in continuous setting [33].
Another algorithm, which became popular in recent years, is Amplitude Flow (AF) [17, 19]. It performs the first order optimization applied to the amplitude-based squared loss
AF is well-understood for randomized measurements scenarios [17], where the matrix A is random. It as well possesses the convergence guarantees for arbitrary measurement scenarios [19].
In this paper we connect these two methods by representing ER as a scaled gradient method for the minimization of the amplitude-based squared loss \(\mathcal {L}_2\). It allows to establish convergence rate of the ER algorithm, which to our knowledge has never observed in the literature. Furthermore, the scaled gradient representation provides the equivalence between the set of fixed points of two methods. Lastly, we consider ER and AF in application to the ptychographic measurements and show that both methods exhibit the same computational complexity and in special cases even coincide.
The paper is structured in the following way. In Section 2 we provide the reader with necessary notation and detailed overview of the ER and AF algorithms. Our contribution is then presented in Section 3 and proved in Section 4. Finally, the paper is summarized by a short conclusion.
2 Notation and Preliminaries
2.1 Definitions
Throughout the paper, we will use the short notation \([a] = \{1,2,\ldots , a\}\) for index sets. The complex unit is denoted by i. The complex conjugate of \(\alpha \in \mathbb {C}\) is given by \(\bar{\alpha }\). The transpose and complex conjugate transpose of a vector v or a matrix B are denoted by \(v^t, v^*\) and \(B^T, B^*\), respectively. The Euclidean norm of a vector \(v \in \mathbb {C}^a\) is given by \(\Vert {v}\Vert _2 := \left[ \sum _{j = 1}^a \vert {v_j}\vert ^2 \right] ^{1/2}\). We say that a matrix \(B \in \mathbb {C}^{b \times a}, b \ge a\) is injective if for all pairs of vectors \(u,v \in \mathbb {C}^a\) with \(u \ne v\) it holds that \(Bu \ne Bv\). The injectivity of B is equivalent to the condition \(\mathrm{rank}(B) = a\). We will also denote the image of B as
For a square full rank matrix B its inverse is given by \(B^{-1}\). A matrix \(B \in \mathbb {C}^{b \times a}, b \ge a\) is called orthogonal if it satisfies \(B^* B = I\), where I denotes the identity matrix. A square orthogonal matrix \(B \in \mathbb {C}^{a \times a}\) is a unitary matrix and its inverse is \(B^{-1} =B^*\).
The projection of \(u \in \mathbb {C}^{a}\) onto a set \(\mathcal {S} \subseteq \mathbb {C}^{a}\) is an element \(\tilde{u} \in \mathcal {S}\), such that \(\Vert {u - \tilde{u}}\Vert _2 \le \Vert {u - v}\Vert _2\) for all \(v \in \mathcal {S}\). An operator, which maps u to \(\tilde{u}\) is called the projection operator onto \(\mathcal {S}\). In general, \(\tilde{u}\) is not unique, however, in case when \(\mathcal {S}\) is a non-empty closed convex set, \(\tilde{u}\) can be uniquely identified [34].
For a matrix \(B \in \mathbb {C}^{b \times a}\) of rank r its singular value decomposition is given by
where \(U\in \mathbb {C}^{b \times r}, V \in \mathbb {C}^{a \times r}\) are orthogonal matrices and \(\Sigma \in \mathbb {R}^{r \times r}\) is an invertible diagonal matrix with diagonal entries \(\sigma _j(B)>0\), \(j \in [r]\), sorted in decreasing order. The values \(\sigma _j(B)\) are also referred to as the singular values of B. The largest singular value \(\sigma _1(B)\) equals to the spectral norm of B defined as
Using the singular value decomposition, the Moore-Penrose pseudoinverse of B is defined as
For an injective matrix \(B \in \mathbb {C}^{b \times a}, b \ge a\), its pseudoinverse \(B^\dagger \) can be expressed as
It satisfies
For a vector \(v \in \mathbb {C}^a\), the diagonal matrix \({\text {diag}}(v) \in \mathbb {C}^{a \times a}\) is formed by placing the entries of the vector v onto main diagonal, so that for \(k,j \in [a]\) it holds that
The discrete Fourier transform is given by a matrix \(F \in \mathbb {C}^{d \times d}\) with the entries
and satisfies equality
The family of the circular shift matrices \(S_s \in \mathbb {C}^{d \times d}, s \in \mathbb {Z}\) is defined by its action for all vectors \(v \in \mathbb {C}^d\) as
For the description of the computational complexity of algorithms, we use notation \(\mathcal {O}(n)\) for the order of operations, meaning that at most cn operations are required for some constant \(c>0\).
For a function \(f: \mathbb {C} \rightarrow \mathbb {C}\) and a vector \(v \in \mathbb {C}^a\), the notation f(v) will denote the entrywise application of the function f. For a vector \(v \in \mathbb {C}^a\) and number \(\alpha \in \mathbb {C}\) by \(v + \alpha \) we will denote vector in \(\mathbb {C}^a\) with entries \(v_k + \alpha \), \(k \in [a]\). For instance, using this notation we can rewrite the measurements (1) as
2.2 Phase retrieval
In the context of the phase retrieval problem, it is convenient to refer to the spaces \(\mathbb {C}^d\) and \(\mathbb {C}^m\) as object and measurement spaces, respectively. If the phases of the measurements Ax were known, the problem would be the classical recovery from linear measurements, which in general is only possible if the dimension of the measurement space m is at least as large as the dimension of the object space d. Since the phases are lost, the number of required measurements is even higher and, hence, we will assume that \(m \ge d\). It is known that \(m \ge 4 d - 4\) measurements are sufficient for the unique reconstruction of x when A is generic [35] and \(m \ge c d\) with constant \(c\ge 1\) when A is random [23, 24]. By the unique reconstruction of x, we understand the identification of x up to a global phase factor \(\alpha x\) for any \(\alpha \in \mathbb {C}, \vert {\alpha }\vert = 1\), since it holds that
The unique reconstruction of x up to a global phase is equivalent to the unique identification of the set \(\{\alpha x: \vert {\alpha }\vert = 1 \}\) or, in other words, the function \(\{\alpha x : \vert {\alpha }\vert = 1\} \mapsto \vert {Ax}\vert \) is injective. One of the necessary conditions for the unique recovery is the injectivity of the matrix A. If A is not injective then there exist two vectors \(u,v \in \mathbb {C}^d\), such that \(u \ne v\) and \(Au = Av\). Consequently, \(\vert {A(u-v)}\vert _j = 0\) and it is not possible to distinguish \(u-v\) and the zero vector from the measurements. The injectivity of the matrix A will be the main assumption for our results in Section 3. The injectivity of A, however, is not sufficient for unique recovery. A counterexample is \(A=F\), which is injective by (6), but it is well-known that there are multiple objects satisfying the same measurements \(\vert {F x}\vert \) [36].
2.3 Error Reduction
The Error Reduction (ER) is an iterative algorithm for the phase retrieval problem. It considers an initial guess \(z^{0} \in \mathbb {C}^{d}\) in the object space and is given by the iterations
The iterations are repeated until the fixed point is reached, so that \(z^{t+1} = z^t\). For \(T \in \mathbb {N}\) iterations of ER \(\mathcal {O}(m d^2 + T m d)\) operations are required, where \(\mathcal {O}(m d^2)\) operations are needed to compute the pseudoinverse and \(\mathcal {O}(m d)\) operations are performed per iteration.
Let us consider iterates in the measurement space \(u^t := A z^t, t \ge 0\) for which the update of ER reads as
In this form, \({\text {diag}}\left( \frac{y}{\vert {u^t}\vert }\right) u^t\) is the projection of \(u^t\) onto the set
Moreover, the set \(\mathcal {M}\) can be viewed as a product of one-dimensional sets
and, thus, the projection onto \(\mathcal {M}\) is performed by projecting each coordinate \(u^t_k\) onto corresponding set \(\mathcal {M}_k\). The second step is to apply \(A A^\dagger \), which is by (3), the projection operator onto \({\text {im}}(A)\). Therefore, ER first projects onto \(\mathcal {M}\), where the measurements are satisfied. Then, the resulting point is projected onto \({\text {im}}(A)\). The sequential projections onto \(\mathcal {M}\) and \({\text {im}}(A)\) allows for an interpretation of ER as an alternating projection scheme. If \(\mathcal {M}\) was a convex set, then ER would converge to the intersection of two sets [37]. However, due to non-convexity of \(\mathcal {M}\), convergence of \(u^t\) to the intersection of the sets is not guaranteed, which is a known problem of the ER algorithm. We note that, when A allows for unique recovery and noise is absent, intersection of \(\mathcal {M}\) and \({\text {im}}(A)\) is given by \(\{ \alpha x: \vert {\alpha }\vert = 1\}\) [12].
Another complication arising from the non-convexity of \(\mathcal {M}\) is the non-uniqueness of the projection onto \(\mathcal {M}\). Let \(y_k \ne 0\) and consider the projection of \(\alpha \in \mathbb {C}\) onto \(\mathcal {M}_k\). If \(\alpha \) is non-zero, the closest point in \(\mathcal {M}_k\) is given by \(y_k \cdot \alpha / \vert {\alpha }\vert \) [12, Lemma 3.15a]. If \(\alpha = 0\), all points in \(\mathcal {M}_k\) have the same distance to 0 and any of them can be used as a projection. In the literature, it is resolved by setting the projection either to \(y_k\) or \(y_k e^{i \varphi }\) for a randomly selected angle \(\varphi \in [0,2 \pi )\). In this paper, we will instead map 0 to 0, which is not precisely the projection, but can be interpreted as an average of all possible projections
Therefore, whenever \((Az^t)_k = 0 \) we set \((Az^t)_k / \vert {(Az^t)_k}\vert = 0\).
The ER algorithm can also be interpreted as a projected gradient method [12, Section 3.8] applied to solve the minimization problem
We note, that substituting Az for u, \(z \in \mathbb {C}^d\) leads to an unconstrained minimization of the amplitude-based objective (2), which suggests that ER can be interpreted as a gradient method applied to the function \(\mathcal {L}_2\).
It is known that in the absence of noise if an initial guess \(z^0\) is chosen sufficiently close to the set \(\{\alpha x: \vert {\alpha }\vert = 1 \}\), the ER algorithm will converge to a point in this set [12, Theorem 3.16]. In general, ER does not converge globally to \(\{\alpha x: \vert {\alpha }\vert = 1 \}\) [12, p.830]. If the loss \(\mathcal {L}_2\) is differentiable at \(z^t\), the ER iteration will not increase the value of \(\mathcal {L}_2\), i.e., \(\mathcal {L}_2(z^{t+1}) \le \mathcal {L}_2(z^t)\) [12, 38].
For the initialization \(z^0\) of ER, the polarization method can be used [12, 39, 40]. It constructs a matrix containing the estimates of \({\text {sgn}}(Ax_k) \overline{{\text {sgn}}(Ax_\ell )}\), \(k,\ell \in [m]\) from the measurements and recovers \({\text {sgn}}(Ax_k)\) by solving the phase synchronization problem [41,42,43,44].
2.4 Amplitude Flow
The Amplitude Flow algorithm (AF) considers the gradient-based optimization of the amplitude-based objective (2). The algorithm is based on the Wirtinger derivatives, which are discussed in greater detail in Section 4.2, while in this section we superficially define the gradient in order to avoid lengthy derivations.
Given an initial guess \(z^0 \in \mathbb {C}^d\), AF is based on the iterations
where \(\mu _t > 0\) denotes the so-called learning rate and \(\nabla \mathcal {L}_2\) is the generalized Wirtinger gradient of \(\mathcal {L}_2\) given by
Similarly to ER, we treat the case \((Az)_k = 0\) by setting \((Az)_k / \vert {(Az)_k}\vert = 0\). The iteration process is continued until the gradient \(\nabla \mathcal {L}_2(z^t)\) vanishes, which is equivalent to reaching the fixed point \(z^{t+1} = z^t\). Originally, AF was derived and analyzed for random Gaussian measurements without noise [17]. For such A, it is possible to construct good starting point \(z^0\) via spectral initialization [15] or null initialization [45], such that AF admits linear convergence rate to the set of true solutions \(\{\alpha x : \vert {\alpha }\vert = 1 \}\). In general, for any choice of the measurement matrix A, the following convergence results have been established in [19].
Theorem 1
([19, Theorem 1]) Consider measurements y of the form (1). Let \(0<\mu _t \le \Vert {A}\Vert ^{-2}\) and \(z^0 \in \mathbb {C}^d\) be arbitrary. Then, for iterates \(\{ z^t \}_{t \ge 0}\) defined by AF we have
and
Unlike the randomized scenario, the general case only guarantees convergence to a fixed point with sublinear rate. Therefore, the initialization \(z^{0}\) is crucial for the convergence to global minimum. For a non-random A, e.g., in case of ptychography, an outcome of the direct (non-iterative) method [30] is a good starting point. Furthermore, with sufficiently good initialization AF can achieve linear convergence rate [46] even for non-random measurements.
As the proof of Theorem 1 resembles the proof of Theorem 3 below, we provide the sketch of proof for Theorem 1 in Remark 10 in Section 4.2.
The computational complexity of AF for \(T \in \mathbb {N}\) iterations is given by \(\mathcal {O}(T m d)\) operations. If the learning rate is chosen to be \(\mu _t = \Vert {A}\Vert ^{-2}\), the computation of the spectral norm can be done with additional \(\mathcal {O}(m d)\) operations by performing the fixed number of the power method iterations. More precisely, for \(K \in \mathbb {N}\) and random initialization \(v^0\), iterates \(v^{k} = A^* A v^{k-1} / \Vert {A^* A v^{k-1} }\Vert _2,\) \(k \in [K]\), are computed and \(\Vert {A v^K}\Vert _2\) is used as an estimate of \(\Vert {A}\Vert \).
3 Results
As it was briefly mentioned in Section 2.3, ER can be linked to the minimization of the amplitude-based objective (2). We formalize this intuition in the next lemma.
Lemma 2
Let A be injective. Then, ER is a scaled gradient method with iterations given by
We emphasize that the result of Lemma 2 is only true for all \(z \in \mathbb {C}^d\), if the ambiguity 0/0 in the iteration of ER is defined as 0.
The reinterpretation of ER as a scaled gradient method allows to analyze convergence of the algorithm similarly to AF, which leads to an analogue of Theorem 1.
Theorem 3
Consider the phase retrieval measurements y of the form (1) with injective matrix A. Let \(z^0 \in \mathbb {C}^d\) be arbitrary. Then, for iterates \(\{z^t\}_{t \ge 0}\) given by ER we have
and
where \(\sigma _d(A)\) denotes the smallest singular value of the matrix A.
Theorem 3 guarantees that no matter how noisy the measurements are, ER will always converge to a fixed point and the convergence rate is sublinear. However, even in the absence of noise, it does not guarantee the global convergence to a point in the set \(\{\alpha x : \vert {\alpha }\vert = 1 \}\). We note that for cases \(A = F\) and A corresponding to ptychography (see (10) below), the convergence of ER to a fixed point was shown in [10] and [47], respectively. However, the convergence rate was not derived. Comparing Theorem 3 to Theorem 1, we observe that the constant in the convergence rate of ER is worse by \(\sigma _1^2(A)/\sigma _d^2(A)\) compared to AF.
A further consequence of Lemma 2 is the equality of the fixed-point sets of both algorithms.
Corollary 4
Let A be injective. Then, \(z \in \mathbb {C}^d\) is a fixed point of ER if and only if z is the fixed point of AF.
We note that Corollary 4 does not imply that given the same initial guess \(z^0\), both algorithms will necessarily converge to the same fixed point.
By Theorem 1 and Theorem 3, both algorithms seem to be comparable in terms of convergence rate and by Corollary 4 in terms of critical points. However, for \(T \in \mathbb {N}\) iterations of ER \(\mathcal {O}(m d^2 + T m d)\) operations are required, while AF only needs \(\mathcal {O}(T m d)\) operations and, thus, in general ER is considerably slower in terms of computation complexity. The next corollary shows, that this difference is less significant in cases where the columns of A are orthogonal.
Corollary 5
Let
Then, for \(T \in \mathbb {N}\) iterations both algorithms ER and AF require \(\mathcal {O}(T m d)\) operations.
Furthermore, if \(A^* A = c I\), for some \(c>0\), then the iteration of ER coincides with the iteration of AF for the learning rate \(\mu _t = \Vert {A}\Vert ^{-2}\).
While condition (9) may seem restrictive, it, in fact, holds in many practical applications. For instance, the equivalence of both algorithms was observed for the recovery from Fourier magnitudes (\(A = F\)) in [10]. Another application of interest is ptychography, for which the measurement matrix A is given by
where the vector \(w \in \mathbb {C}^{d}\) denotes the distribution of the light in the illuminated region and \(s_1, \ldots , s_r \in [d]\), \(r \le d\), are unique positions of the regions. Matrices F and \(S_{s_j}\) are given by (5) and (7), respectively. When \(r = d\) and \(s_j = j, j \in [d]\), the matrix A is also known as the discrete Short-Time Fourier transform (STFT) with window w.
The next corollary shows that condition (9) and, consequently, the results of Corollary 5 also hold for ptychographic measurements.
Corollary 6
Consider measurements of the form (1) with ptychographic measurement matrix A as in (10). Then, \(A^* A = {\text {diag}}(v) \), where the vector v has entries
for all \(\ell \in [d]\). The matrix A is injective if and only if \(v_\ell >0\) for all \(\ell \in [d]\). Furthermore, if A is the STFT matrix, the vector v has entries \(v_\ell = d \Vert {w}\Vert _2^2\) for all \(\ell \in [d]\). Consequently, the results of Corollary 5 apply for ptychographic measurements.
In order to illustrate the result of Corollary 6, we perform a numerical reconstructions of randomly generated \(x \in \mathbb {C}^{d}\), \(d= 256\) with both AF and ER. In the first case, A is chosen to be the STFT matrix with the window
In the second case A is given by (10) with the same window and positions \(s_j = 16 j\) and \(j \in [d/16]\). The measurements are additionally corrupted by Poisson noise such that signal-to-noise ratio \(10 \log _{10} \left( \tfrac{ \Vert {\vert {Ax}\vert ^2}\Vert _2^2}{ \Vert {y^2 - \vert {Ax}\vert ^2}\Vert _2^2} \right) \) is approximately 45. Figure 1a shows the values \(\mathcal {L}_2(z^t)\) for 500 iterations of the algorithms starting from a random initialization \(z^0\). Note that for the STFT matrix A, AF and ER coincide as predicted by Corollary 6, while this is no longer true for A as in (10). Despite producing different reconstructions, the runtime of the algorithms in Figure 1b is almost the same, which is in line with Corollary 5.
Finally, we consider a scenario when the object is supported on \(J \subseteq [d]\). Then, we can rewrite the the measurement model as
where \(x_{J}\) is the vector containing entries of x in J and \(E_{J}\) is a linear embedding operator, which maps \(x_J\) to x . In such case, the results above apply for new measurement matrix \(\tilde{A} = A E_{J}\).
4 Proofs
4.1 Proofs of Lemma 2 and corollaries
We will start with the proof of Lemma 2.
Proof of Lemma 2
By the assumption, A is injective and, thus, identities (3) and (4) hold true. Therefore, the iteration of ER can be rewritten as
\(\square \)
Using the result of Lemma 2, we deduce Corollary 4 and Corollary 5.
Proof of Corollary 4
Let \(z \in \mathbb {C}^{d}\) be a fixed point of ER. By Lemma 2, we have that
which is equivalent to
Since A is injective and \((A^* A)^{-1}\) exists, the obtained equality holds if and only if \(\nabla \mathcal {L}_2 (z) = 0\), so that z is the fixed point of AF. \(\square \)
Proof of Corollary 5
Using the condition (9), we obtain \((A^* A)^{-1} = {\text {diag}}(1/v)\). Consequently, by Lemma 2, the iteration of ER is given by
The computation of the gradient requires \(\mathcal {O}(md)\) operations. Both the multiplication with \({\text {diag}}(1/v)\) and the difference can be done in \(\mathcal {O}(d)\) operations. Therefore, the total number of operations for a single iteration of ER is given by \(\mathcal {O}(md + d) = \mathcal {O}(md)\), which is the same order of operations as for a single iteration of AF. Furthermore, evaluation of v requires additional \(\mathcal {O}(md)\) operations. We also note that \(\Vert {A}\Vert ^2 = \max _{\ell \in [d]} \vert {v_\ell }\vert \) and the computation of the learning rate is done in \(\mathcal {O}(d)\) operations. Therefore, both algorithms have total complexity of \(\mathcal {O}(Tmd)\) for T iterations.
If \(A^* A = c I\), then
Hence, using Lemma 2 for the iteration of ER we have
which is precisely the iteration of AF with \(\mu _t = \Vert {A}\Vert ^{-2}\). \(\square \)
The last corollary is the result of direct computations similar to the equation (12) in [19].
Proof of Corollary 6
We compute the product \(A^*A\) by using the representation (10),
Next, we use (6) and \({\text {diag}}^*(S_{s_j} w) = {\text {diag}}(\overline{S_{s_j} w})\) to obtain
The matrix A is injective if and only if \(A^* A\) is invertible, and the diagonal matrix is invertible when all its entries are non-zero. Since \(v_\ell = d \sum _{j=1}^r \vert { S_{s_j} w_\ell }\vert ^2 \ge 0, \ell \in [d]\), the injectivity of A is equivalent to \(v_\ell > 0\) for all \(\ell \in [d]\).
If A is the STFT matrix, then \(s_j = j\) for all \(j \in [d]\) and the entries of the vector v further simplify to
Changing the order of summation yields
for all \(\ell \in [d]\), which concludes the proof. \(\square \)
4.2 Proof of Theorem 3
The proof of Theorem 3 is based on Wirtinger derivatives [48]. Let us recall some basic facts about Wirtinger derivatives based on [49, 50]. A function \(f: \mathbb {C} \mapsto \mathbb {C}\) can be viewed as a function of two real variables, the real and imaginary parts of the argument \(z = \alpha + i \beta \). The function f is said to be differentiable in real sense if the derivatives with respect to \(\alpha \) and \(\beta \) exist.
Then, the Wirtinger derivatives are defined as
which is nothing, but a change of the coordinate system to conjugate coordinates. In this sense, we treat function f as a function of z and \(\bar{z}\) instead of \(\alpha \) and \(\beta \).
As an example consider \(f(z) = z = \alpha + i \beta \). Its Wirtinger derivatives are
which implies that \(\bar{z}\) can be treated as a constant when the derivative with respect to z is computed and vice versa.
Similar to the real analysis of multivariate functions, Wirtinger derivatives are extended for \(f: \mathbb {C}^{d} \mapsto \mathbb {C}\), that is for \(z \in \mathbb {C}^d\) they are given by
The computation of Wirtinger derivatives is analogous to the standard real analysis as the arithmetic operations and the chain rule extends to the complex case. For Wirtinger derivatives it also holds that
for any differentiable function f.
The Wirtinger derivatives are particularly useful for optimization of real-valued functions of complex variables. Let \(f: \mathbb {C}^d \mapsto \mathbb {R}\) be a differentiable real-valued function. Its differential can be presented in the form of Wirtinger derivatives as
Since f is real-valued, by (11), it holds that
and the differential simplifies to
It is maximal, when dz is a scaled version of \(\overline{\frac{\partial f}{\partial z}} = \frac{\partial f}{\partial \bar{z}}\) and, thus, \(\frac{\partial f}{\partial \bar{z}}\) gives the direction of the steepest ascent. Moreover, the critical points of f are those, where derivative with respect to \(\bar{z}\) vanishes. For this reason, the gradient of f is defined as
In our analysis, we would also need the Wirtinger version of the second order Taylor’s approximation theorem in integral form. That is for all twice continuously differentiable functions \(f: \mathbb {C}^d \mapsto \mathbb {R}\) and all \(z, v \in \mathbb {C}^d\) it holds that
where \(\nabla ^2 f\) denotes the Hessian matrix
and its components are given by
For further information on Wirtinger calculus, we refer reader to [49, 50].
Let us go back to the amplitude-based objective (2). We rewrite it as
Since \(\sqrt{\cdot }\) is not differentiable at 0, \(\mathcal {L}_2\) is not differentiable on \(\mathbb {C}^{d}\). Hence, the gradient of \(\mathcal {L}_2\) is not properly defined for points z with \((Az)_k =0\) for some \(k \in [m]\). In order to overcome this issue, we consider the following smoothed version of (13),
where \(\varepsilon > 0\). The function \(\mathcal {L}_{2,\varepsilon }\) possesses some useful properties. Firstly, \(\mathcal {L}_{2,\varepsilon }\) is continuous in \(\varepsilon \) and we have
Secondly, we can compute the gradient of \(\mathcal {L}_{2,\varepsilon }\) everywhere and properly define the generalized gradient of \(\mathcal {L}_2\) as the limit of gradients as parameter \(\varepsilon \) vanishes.
Lemma 7
The function \(\mathcal {L}_{2,\varepsilon }\) is continuously differentiable with the gradient given by
Furthermore, the generalized gradient of \(\mathcal {L}_2\) is given by the pointwise limit
Proof
Denote by \(a_k\) the conjugate of the k-th row of the matrix A, so that \((A z)_k = a_k^* z\). Then, a single summand of \(\mathcal {L}_{2,\varepsilon }\) is given by
The gradient of \(f_k\) can be evaluated by the chain rule. We get
Then, by the linearity of derivatives,
For the generalized gradient of \(\mathcal {L}_2\) we consider two cases. If \((Az)_k \ne 0\) for all \(k \in [m]\), then \(\sqrt{y^2_k + \varepsilon }/\sqrt{ \vert {(A z)_k}\vert ^2 + \varepsilon } \rightarrow y_k /\vert {(A z)_k}\vert \), \(\varepsilon \rightarrow 0+\), for all \(k \in [m]\). Note that in this case, \(\mathcal {L}_2\) is differentiable at z and its gradient coincides with the limit of \(\nabla \mathcal {L}_{2,\varepsilon }(z)\) as \(\varepsilon \rightarrow 0+\). On the other hand, if \((Az)_k = 0\) for some \(k \in [m]\), it holds that
with ambiguity 0/0 resolved as 0. \(\square \)
The last property concerns the Hessian matrix of \(\mathcal {L}_{2,\varepsilon }\).
Lemma 8
[19] The function \(\mathcal {L}_{2,\varepsilon }\) is twice continuously differentiable and its Hessian matrix satisfies
Proof
See computations on pages 27-28 of [19]. \(\square \)
Remark 9
The convergence of gradient descent to a critical point is often studied [51] under the assumption that the function f is L-smooth with \(L \ge 0\). For twice continuously differentiable functions L-smoothness is equivalent to the inequality
In fact, only the upper bound is sufficient to establish convergence of gradient decent. In our case, its stronger version is given by Lemma 8.
Now, we are equipped for the proof of Theorem 3.
Proof of Theorem 3
In view of Lemma 2, let us consider the smoothed step of the ER algorithm
Note that \((A^* A)^{-1}\) exists due to injectivity of A.
We first show that the single step of the smoothed Error Reduction works for minimization of \(\mathcal {L}_{2,\varepsilon }\) and then we take the pointwise limits to obtain desired result for \(\mathcal {L}_2\). In order to derive that in each iteration step the objective does not increase, we apply Taylor’s theorem (12) with arbitrary \(z \in \mathbb {C}^d\) and \(v = - (A^* A)^{-1}\nabla \mathcal {L}_{2,\varepsilon }(z)\). We note that by Lemma 8, the integral in (12) is bounded as
Hence, by (12), we have
where we used that \(( (A^* A)^{-1} )^* = ( (A^* A)^* )^{-1} = (A^* A)^{-1}\). Lemma 7 gives
and, thus, taking the limit \(\varepsilon \rightarrow 0+\) yields
Selecting z as iterates \(z^t\) of ER, we obtain
Since A is injective, its singular value decomposition is given by \(A = U \Sigma V^*\) with orthogonal \(U \in \mathbb {C}^{m \times d}\), unitary \(V \in \mathbb {C}^{d \times d}\) and invertible diagonal matrix \(\Sigma \in \mathbb {C}^{d \times d}\). Then,
is the singular value decomposition of \((A^* A)^{-1}\). From this representation we deduce that
Thus, by (15),
which shows the first statement of Theorem 3.
In order to prove the remaining statements of Theorem 3, we need to link the decay of the objective to the iterates. By Lemma 2, we have that
Using (16) and the definition of the spectral norm, the squared distance between the iterates can be bounded as
Next, we sum up the norms for \(T \in \mathbb {N}\) iterations of ER and apply (15) to obtain
where in the last line we used that \(\mathcal {L}_2(z) \ge 0\) for all \(z \in \mathbb {C}^d\). This implies that the partial sum of the series \(\sum _{t=0}^{\infty } \Vert {z^{t+1} - z^t}\Vert _2^2\) is bounded and, thus, the series is convergent. Consequently, summands converge to zero, that is \(\Vert {z^{t+1} - z^t}\Vert _2^2 \rightarrow 0, t \rightarrow \infty \). Furthermore, we have
which concludes the proof. \(\square \)
Remark 10
Proof of Theorem 1 follows the same logic. By Taylor’s theorem (12) with \(v = \mu _t \mathcal {L}_{2}(z)\), the analogue of inequality (15) is established. Since the learning rate is a positive constant, it further implies that the norm of the gradient converges to zero with the desired speed similarly to the proof of Theorem 3.
5 Conclusion
In this paper we established the understanding of the Error Reduction algorithm as a scaled gradient method and derived its convergence rate. Furthermore, it was shown that in practical scenarios, Error Reduction has the same computational complexity as the Amplitude Flow method and the two algorithms coincide in some cases.
In the future, we plan to expand our analysis for the Hybrid Input-Output method [9] and extended Ptychograpic Iterative Engine [52] used for the problem of blind ptychography.
References
Liu, Z.C., Xu, R., Dong, Y.H.: Phase retrieval in protein crystallography Section A foundations of crystallography. Acta. Cryst 68(Pt 2), 256–265 (2012). https://doi.org/10.1107/S0108767311053815
Shapiro, D., Thibault, P., Beetz, T., Elser, V., Howells, M., Jacobsen, C., Kirz, J., Lima, E., Miao, H., Neiman, A.M., Sayre, D.: Biological imaging by soft X-ray diffraction microscopy. Proc. Natl. Acad. Sci. USA 102(43), 15343–15346 (2005). https://doi.org/10.1073/pnas.0503305102
Thibault, P., Elser, V., Jacobsen, C., Shapiro, D., Sayre, D.: Reconstruction of a yeast cell from X-ray diffraction data Section A foundations of crystallography. Acta. Cryst. 62(Pt 4), 248–261 (2006). https://doi.org/10.1107/S0108767306016515
Miao, J., Ishikawa, T., Shen, Q., Earnest, T.: Extending X-ray crystallography to allow the imaging of noncrystalline materials, cells, and single protein complexes. Ann. Rev. Phys. Chem. 59, 387–410 (2008). https://doi.org/10.1146/annurev.physchem.59.032607.093642
Shechtman, Y., Eldar, Y.C., Cohen, O., Chapman, H.N., Miao, J., Segev, M.: Phase retrieval with application to optical imaging: A contemporary overview. IEEE Signal Proc. Mag. 32(3), 87–109 (2015). https://doi.org/10.1109/MSP.2014.2352673
Pfeiffer, F.: X-ray ptychography. Nat. Photonics 12(1), 9–17 (2018). https://doi.org/10.1038/s41566-017-0072-5
Chen, Z., Jiang, Y., Shao, Y.T., Holtz, M.E., Odstrčil, M., Guizar-Sicairos, M., Hanke, I., Ganschow, S., Schlom, D.G., Muller, D.A.: Electron ptychography achieves atomic-resolution limits set by lattice vibrations. Science 372(6544), 826–831 (2021). https://doi.org/10.1126/science.abg2533
Gerchberg, R.W., Saxton, W.O.: A practical algorithm for the determination of phase from image and diffraction plane pictures. Optik 35, 237 (1972)
Fienup, J.R.: Reconstruction of an object from the modulus of its Fourier transform. Optics Lett. 3(1), 27–29 (1978). https://doi.org/10.1364/ol.3.000027
Fienup, J.R.: Phase retrieval algorithms: A comparison. Appl. Opt. 21(15), 2758–2769 (1982). https://doi.org/10.1364/AO.21.002758
Wen, Z., Yang, C., Liu, X., Marchesini, S.: Alternating direction methods for classical and ptychographic phase retrieval. Inverse Problems 28(11), 115,010 (2012). https://doi.org/10.1088/0266-5611/28/11/115010
Marchesini, S., Tu, Y.C., Wu, H.T.: Alternating projection, ptychographic imaging and phase synchronization. Appl. Comput. Harmon. Anal. 41(3), 815–851 (2016). https://doi.org/10.1016/j.acha.2015.06.005
Chang, H., Lou, Y., Duan, Y., Marchesini, S.: Total variation-based phase retrieval for Poisson noise removal. SIAM J. Imag. Sci. 11(1), 24–55 (2018). https://doi.org/10.1137/16M1103270
Fannjiang, A., Zhang, Z.: Fixed point analysis of Douglas-Rachford splitting for ptychography and phase retrieval. SIAM J. Imag. Sci. 13(2), 609–650 (2020). https://doi.org/10.1137/19M128781X
Candes, E.J., Li, X., Soltanolkotabi, M.: Phase retrieval via Wirtinger Flow: Theory and algorithms. IEEE Trans. Inform. Theory 61(4), 1985–2007 (2015). https://doi.org/10.1109/TIT.2015.2399924
Chen, Y., Candes, E.J.: Solving random quadratic systems of equations is nearly as easy as solving linear systems. NIPS (2015)
Wang, G., Giannakis, G.B., Eldar, Y.C.: Solving systems of random quadratic equations via Truncated Amplitude Flow. IEEE Trans Inform Theory 64(2), 773–794 (2018). https://doi.org/10.1109/TIT.2017.2756858
Wang, G., Giannakis, G.B., Saad, Y., Chen, J.: Phase retrieval via Reweighted Amplitude Flow. IEEE Trans. Signal Proc. p. 1 (2018). https://doi.org/10.1109/TSP.2018.2818077
Xu, R., Soltanolkotabi, M., Haldar, J.P., Unglaub, W., Zusman, J., Levi, A.F.J., Leahy, R.M.: Accelerated Wirtinger Flow: A fast algorithm for ptychography. arXiv:1806.05546
Candès, E.J., Eldar, Y.C., Strohmer, T., Voroninski, V.: Phase retrieval via matrix completion. SIAM J. Imag. Sci. 6(1), 199–225 (2013). https://doi.org/10.1137/110848074
Candès, E.J., Strohmer, T., Voroninski, V.: Phaselift: Exact and stable signal recovery from magnitude measurements via convex programming. Commun. Pure Appl. Math. 66(8), 1241–1274 (2013). https://doi.org/10.1002/cpa.21432
Kueng, R., Rauhut, H., Terstiege, U.: Low rank matrix recovery from rank one measurements. Appl. Comput. Harm. Anal. 42(1), 88–116 (2017). https://doi.org/10.1016/j.acha.2015.07.007
Kabanava, M., Kueng, R., Rauhut, H., Terstiege, U.: Stable low-rank matrix recovery via null space properties. Inform. Infer. J. IMA 5(4), 405–441 (2016). https://doi.org/10.1093/imaiai/iaw014
Krahmer, F., Kümmerle, C., Melnyk, O.: On the robustness of noise-blind low-rank recovery from rank-one measurements. Linear Algebra Appl. 652, 37–81 (2022) https://doi.org/10.1016/j.laa.2022.07.002. https://www.sciencedirect.com/science/article/pii/S0024379522002609
Goldstein, T., Studer, C.: Phasemax: Convex phase retrieval via Basis Pursuit. IEEE Trans. Inform. Theory 64(4), 2675–2689 (2018). https://doi.org/10.1109/TIT.2018.2800768
Ghods, R., Lan, A.S., Goldstein, T., Studer, C.: in 52nd Annual Conference on Information Sciences and Systems (CISS) (IEEE, Princeton, NJ, 2018), pp. 1–6. https://doi.org/10.1109/CISS.2018.8362270
Chapman, H.N.: Phase-retrieval X-ray microscopy by Wigner-distribution deconvolution. Ultramicroscopy 66(3–4), 153–172 (1996). https://doi.org/10.1016/S0304-3991(96)00084-8
Iwen, M.A., Viswanathan, A., Wang, Y.: Fast phase retrieval from local correlation measurements. SIAM J. Imag. Sci. 9(4), 1655–1688 (2016). https://doi.org/10.1137/15M1053761
Iwen, M.A., Preskitt, B., Saab, R., Viswanathan, A.: Phase retrieval from local measurements: Improved robustness via eigenvector-based angular synchronization. Appl. Comput. Harm. Anal. 48(1), 415–444 (2020). https://doi.org/10.1016/j.acha.2018.06.004
Forstner, A., Krahmer, F., Melnyk, O., Sissouno, N.: Well-conditioned ptychographic imaging via lost subspace completion. Inverse Problems 36(10), 105,009 (2020). https://doi.org/10.1088/1361-6420/abaf3a
Perlmutter, M., Merhi, S., Viswanathan, A., Iwen, M.: Inverting spectrogram measurements via aliased Wigner distribution deconvolution and angular synchronization. Inform. Infer. J. IMA (2020). https://doi.org/10.1093/imaiai/iaaa023
Levi, A., Stark, H.: in ICASSP ’84. IEEE International Conference on Acoustics, Speech, and Signal Processing (Institute of Electrical and Electronics Engineers, San Diego, CA, USA, 1984), pp. 88–91. https://doi.org/10.1109/ICASSP.1984.1172785
Tsipenyuk, A.: Variational approach to Fourier phase retrieval. Doctoral Thesis
Aubin, J.P.: Applied Functional Analysis, 2nd edn. Pure and applied mathematics (John Wiley & Sons, Inc, Hoboken, NJ, USA, 2000). https://doi.org/10.1002/9781118032725. https://onlinelibrary.wiley.com/doi/book/10.1002/9781118032725
Conca, A., Edidin, D., Hering, M., Vinzant, C.: An algebraic characterization of injectivity in phase retrieval. Appl. Comput. Harmon. Anal. 38(2), 346–356 (2015). https://doi.org/10.1016/j.acha.2014.06.005
Beinert, R., Plonka, G.: Ambiguities in one-dimensional discrete phase retrieval from Fourier magnitudes. J. Fourier Anal. Appl. 21(6), 1169–1198 (2015). https://doi.org/10.1007/s00041-015-9405-2
Bauschke, H.H., Borwein, J.M.: On projection algorithms for solving convex feasibility problems. SIAM Rev. 38(3), 367–426 (1996). https://doi.org/10.1137/S0036144593251710
Qian, J., Yang, C., Schirotzek, A., Maia, F., Marchesini, S.: Efficient Algorithms for Ptychographic Phase Retrieval. Inverse Problems and Applications. Contemp. Math 615, 261–280 (2014)
Alexeev, B., Bandeira, A.S., Fickus, M., Mixon, D.G.: Phase retrieval with polarization. SIAM J. Imag. Sci. 7(1), 35–66 (2014). https://doi.org/10.1137/12089939X
Pfander, G.E., Salanevich, P.: Robust phase retrieval algorithm for time-frequency structured measurements. SIAM J. Imag. Sci. 12(2), 736–761 (2019). https://doi.org/10.1137/18M1205522
Singer, A.: Angular synchronization by eigenvectors and semidefinite programming. Appl. Comput. Harmon. Anal. 30(1), 20–36 (2011). https://doi.org/10.1016/j.acha.2010.02.001
Boumal, N.: Nonconvex phase synchronization. SIAM J. Optim. 26(4), 2355–2377 (2016). https://doi.org/10.1137/16M105808X
Bandeira, A.S., Boumal, N., Singer, A.: Tightness of the maximum likelihood semidefinite relaxation for angular synchronization. Math. Programm. 163(1–2), 145–167 (2017). https://doi.org/10.1007/s10107-016-1059-6
Filbir, F., Krahmer, F., Melnyk, O.: On recovery guarantees for angular synchronization. J. Fourier Anal. Appl. 27(2) (2021). https://doi.org/10.1007/s00041-021-09834-1
Chen, P., Fannjiang, A., Liu, G.R.: Phase retrieval by linear algebra. SIAM J. Matrix Anal. Appl. 38(3), 854–868 (2017). https://doi.org/10.1137/16M1107747
Bendory, T., Eldar, Y.C., Boumal, N.: Non-convex phase retrieval from STFT measurements. IEEE Trans. Inform. Theory 64(1), 467–484 (2018). https://doi.org/10.1109/TIT.2017.2745623
Griffin, D., Lim, J.: Signal estimation from modified Short-Time Fourier transform. IEEE Trans. Acoustics Speech Signal Proc. 32(2), 236–243 (1984). https://doi.org/10.1109/TASSP.1984.1164317
Wirtinger, W.: Zur formalen Theorie der Funktionen von mehr komplexen Veränderlichen. Math. Annalen 97(1), 357–375 (1927). https://doi.org/10.1007/BF01447872
Hunger, R.: An introduction to complex differentials and complex differentiability (2008). https://mediatum.ub.tum.de/doc/631019/631019.pdf
Bouboulis, P.: Wirtinger’s calculus in general Hilbert spaces. arXiv:1005.5170
Beck, A.: First-order methods in optimization (Society for Industrial and Applied Mathematics. Philadelphia (2017). https://doi.org/10.1137/1.9781611974997
Maiden, A.M., Rodenburg, J.M.: An improved ptychographical phase retrieval algorithm for diffractive imaging. Ultramicroscopy 109(10), 1256–1262 (2009). https://doi.org/10.1016/j.ultramic.2009.05.012
Acknowledgements
The author thanks Frank Filbir and Felix Krahmer for helpful discussions.
Funding
Open Access funding enabled and organized by Projekt DEAL. This work was partially supported by the Helmholtz Association within the projects Ptychography 4.0 and EDARTI.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The author has neither the conflict of interests nor relevant financial or non-financial interests to disclose.
Additional information
Communicated by Dae Gwan Lee, Ron Levie, Johannes Maly and Hanna Veselovska.
OM was partially supported by the Helmholtz Association within the projects Ptychography 4.0 and EDARTI.
This article is part of the topical collection “Recent advances in computational harmonic analysis” edited by Dae Gwan Lee, Ron Levie, Johannes Maly and Hanna Veselovska.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Melnyk, O. On connections between Amplitude Flow and Error Reduction for phase retrieval and ptychography. Sampl. Theory Signal Process. Data Anal. 20, 16 (2022). https://doi.org/10.1007/s43670-022-00035-5
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s43670-022-00035-5