1 Introduction

The phase retrieval problem considers the reconstruction of an unknown \(x \in \mathbb {C}^{d}\) from \(m \in \mathbb {N}\) amplitude measurements of the form

$$\begin{aligned} y_k = \vert { (A x)_k }\vert + n_k, \quad k = 1, \ldots , m, \end{aligned}$$
(1)

with \(A \in \mathbb {C}^{m \times d}\) denoting the measurement matrix and \(n \in \mathbb {R}^m\) being noise.

It has many applications such as crystallography [1], noncrystalline materials [2,3,4] and optical imaging [5], where the goal is to recover the specimen from its diffraction patterns obtained by illumination with penetrating light, e.g., x-rays or electron beam.

One such application is ptychography [6, 7], where the inference on the object of interest is based on a collection of far-field diffraction patterns, each obtained by an illumination of a small region of the specimen. As the regions overlap, it produces the surplus information, which allows for unique identification of the object up to a global phase factor from ptychographic measurements.

Since the introduction of the phase retrieval problem to the mathematical community, many approaches have been developed in order to reconstruct the specimen. The spectrum of methods includes alternating projections methods [8,9,10,11,12,13,14], gradient-based minimization [15,16,17,18,19], semidefinite [20,21,22,23,24] and linear programming [25, 26], direct methods [27,28,29,30,31], and many more.

One of the longstanding favored algorithms is Error Reduction (ER), which was introduced in 1972 by Gerchberg and Saxton [8]. Later contributions [10, 32] and [12] classified ER as an alternating projections technique and supplemented it with the detailed analysis on the convergence and also provided an interpretation of the algorithm as a projected gradient method. The version of ER algorithm was also studied as a gradient flow method in continuous setting [33].

Another algorithm, which became popular in recent years, is Amplitude Flow (AF) [17, 19]. It performs the first order optimization applied to the amplitude-based squared loss

$$\begin{aligned} \mathcal {L}_2(z) = \sum _{k=1}^{m} \vert {\vert { (A z)_k }\vert - y_k }\vert ^2, \quad z \in \mathbb {C}^{d}. \end{aligned}$$
(2)

AF is well-understood for randomized measurements scenarios [17], where the matrix A is random. It as well possesses the convergence guarantees for arbitrary measurement scenarios [19].

In this paper we connect these two methods by representing ER as a scaled gradient method for the minimization of the amplitude-based squared loss \(\mathcal {L}_2\). It allows to establish convergence rate of the ER algorithm, which to our knowledge has never observed in the literature. Furthermore, the scaled gradient representation provides the equivalence between the set of fixed points of two methods. Lastly, we consider ER and AF in application to the ptychographic measurements and show that both methods exhibit the same computational complexity and in special cases even coincide.

The paper is structured in the following way. In Section 2 we provide the reader with necessary notation and detailed overview of the ER and AF algorithms. Our contribution is then presented in Section 3 and proved in Section 4. Finally, the paper is summarized by a short conclusion.

2 Notation and Preliminaries

2.1 Definitions

Throughout the paper, we will use the short notation \([a] = \{1,2,\ldots , a\}\) for index sets. The complex unit is denoted by i. The complex conjugate of \(\alpha \in \mathbb {C}\) is given by \(\bar{\alpha }\). The transpose and complex conjugate transpose of a vector v or a matrix B are denoted by \(v^t, v^*\) and \(B^T, B^*\), respectively. The Euclidean norm of a vector \(v \in \mathbb {C}^a\) is given by \(\Vert {v}\Vert _2 := \left[ \sum _{j = 1}^a \vert {v_j}\vert ^2 \right] ^{1/2}\). We say that a matrix \(B \in \mathbb {C}^{b \times a}, b \ge a\) is injective if for all pairs of vectors \(u,v \in \mathbb {C}^a\) with \(u \ne v\) it holds that \(Bu \ne Bv\). The injectivity of B is equivalent to the condition \(\mathrm{rank}(B) = a\). We will also denote the image of B as

$$\begin{aligned} {\text {im}}(B) := \{ B v \in \mathbb {C}^{b} : v \in \mathbb {C}^a \}. \end{aligned}$$

For a square full rank matrix B its inverse is given by \(B^{-1}\). A matrix \(B \in \mathbb {C}^{b \times a}, b \ge a\) is called orthogonal if it satisfies \(B^* B = I\), where I denotes the identity matrix. A square orthogonal matrix \(B \in \mathbb {C}^{a \times a}\) is a unitary matrix and its inverse is \(B^{-1} =B^*\).

The projection of \(u \in \mathbb {C}^{a}\) onto a set \(\mathcal {S} \subseteq \mathbb {C}^{a}\) is an element \(\tilde{u} \in \mathcal {S}\), such that \(\Vert {u - \tilde{u}}\Vert _2 \le \Vert {u - v}\Vert _2\) for all \(v \in \mathcal {S}\). An operator, which maps u to \(\tilde{u}\) is called the projection operator onto \(\mathcal {S}\). In general, \(\tilde{u}\) is not unique, however, in case when \(\mathcal {S}\) is a non-empty closed convex set, \(\tilde{u}\) can be uniquely identified [34].

For a matrix \(B \in \mathbb {C}^{b \times a}\) of rank r its singular value decomposition is given by

$$\begin{aligned} B = U \Sigma V^*, \end{aligned}$$

where \(U\in \mathbb {C}^{b \times r}, V \in \mathbb {C}^{a \times r}\) are orthogonal matrices and \(\Sigma \in \mathbb {R}^{r \times r}\) is an invertible diagonal matrix with diagonal entries \(\sigma _j(B)>0\), \(j \in [r]\), sorted in decreasing order. The values \(\sigma _j(B)\) are also referred to as the singular values of B. The largest singular value \(\sigma _1(B)\) equals to the spectral norm of B defined as

$$\begin{aligned} \Vert {B}\Vert := \max _{v \in \mathbb {C}^{a}, \Vert {v}\Vert =1} \Vert {B v}\Vert _2. \end{aligned}$$

Using the singular value decomposition, the Moore-Penrose pseudoinverse of B is defined as

$$\begin{aligned} B^\dagger := V \Sigma ^{-1} U^*. \end{aligned}$$

For an injective matrix \(B \in \mathbb {C}^{b \times a}, b \ge a\), its pseudoinverse \(B^\dagger \) can be expressed as

$$\begin{aligned} B^\dagger = (B^* B)^{-1} B^*. \end{aligned}$$
(3)

It satisfies

$$\begin{aligned} B^\dagger B = I \text { and } B B^\dagger \text { is a projection operator onto the set } {\text {im}}(B). \end{aligned}$$
(4)

For a vector \(v \in \mathbb {C}^a\), the diagonal matrix \({\text {diag}}(v) \in \mathbb {C}^{a \times a}\) is formed by placing the entries of the vector v onto main diagonal, so that for \(k,j \in [a]\) it holds that

$$\begin{aligned} {\text {diag}}(v)_{k,j} := {\left\{ \begin{array}{ll} v_k &{} k = j, \\ 0 &{} k \ne j. \end{array}\right. } \end{aligned}$$

The discrete Fourier transform is given by a matrix \(F \in \mathbb {C}^{d \times d}\) with the entries

$$\begin{aligned} F_{k,j} =e^{2 \pi i (k-1)(j-1)/d}, \quad k,j \in [d], \end{aligned}$$
(5)

and satisfies equality

$$\begin{aligned} F^* F = d I. \end{aligned}$$
(6)

The family of the circular shift matrices \(S_s \in \mathbb {C}^{d \times d}, s \in \mathbb {Z}\) is defined by its action for all vectors \(v \in \mathbb {C}^d\) as

$$\begin{aligned} (S_s v)_j := v_{(j - 1 - s { \text{ mod } }d) + 1}, \quad j \in [d]. \end{aligned}$$
(7)

For the description of the computational complexity of algorithms, we use notation \(\mathcal {O}(n)\) for the order of operations, meaning that at most cn operations are required for some constant \(c>0\).

For a function \(f: \mathbb {C} \rightarrow \mathbb {C}\) and a vector \(v \in \mathbb {C}^a\), the notation f(v) will denote the entrywise application of the function f. For a vector \(v \in \mathbb {C}^a\) and number \(\alpha \in \mathbb {C}\) by \(v + \alpha \) we will denote vector in \(\mathbb {C}^a\) with entries \(v_k + \alpha \), \(k \in [a]\). For instance, using this notation we can rewrite the measurements (1) as

$$\begin{aligned} y = \vert {Ax}\vert + n. \end{aligned}$$

2.2 Phase retrieval

In the context of the phase retrieval problem, it is convenient to refer to the spaces \(\mathbb {C}^d\) and \(\mathbb {C}^m\) as object and measurement spaces, respectively. If the phases of the measurements Ax were known, the problem would be the classical recovery from linear measurements, which in general is only possible if the dimension of the measurement space m is at least as large as the dimension of the object space d. Since the phases are lost, the number of required measurements is even higher and, hence, we will assume that \(m \ge d\). It is known that \(m \ge 4 d - 4\) measurements are sufficient for the unique reconstruction of x when A is generic [35] and \(m \ge c d\) with constant \(c\ge 1\) when A is random [23, 24]. By the unique reconstruction of x, we understand the identification of x up to a global phase factor \(\alpha x\) for any \(\alpha \in \mathbb {C}, \vert {\alpha }\vert = 1\), since it holds that

$$\begin{aligned} \vert {A x }\vert = \vert { A \alpha x}\vert . \end{aligned}$$

The unique reconstruction of x up to a global phase is equivalent to the unique identification of the set \(\{\alpha x: \vert {\alpha }\vert = 1 \}\) or, in other words, the function \(\{\alpha x : \vert {\alpha }\vert = 1\} \mapsto \vert {Ax}\vert \) is injective. One of the necessary conditions for the unique recovery is the injectivity of the matrix A. If A is not injective then there exist two vectors \(u,v \in \mathbb {C}^d\), such that \(u \ne v\) and \(Au = Av\). Consequently, \(\vert {A(u-v)}\vert _j = 0\) and it is not possible to distinguish \(u-v\) and the zero vector from the measurements. The injectivity of the matrix A will be the main assumption for our results in Section 3. The injectivity of A, however, is not sufficient for unique recovery. A counterexample is \(A=F\), which is injective by (6), but it is well-known that there are multiple objects satisfying the same measurements \(\vert {F x}\vert \) [36].

2.3 Error Reduction

The Error Reduction (ER) is an iterative algorithm for the phase retrieval problem. It considers an initial guess \(z^{0} \in \mathbb {C}^{d}\) in the object space and is given by the iterations

$$\begin{aligned} z^{t+1} = A^\dagger {\text {diag}}\left( \frac{y}{\vert {A z^t}\vert }\right) Az^t ,\quad t \ge 0. \end{aligned}$$
(ER)

The iterations are repeated until the fixed point is reached, so that \(z^{t+1} = z^t\). For \(T \in \mathbb {N}\) iterations of ER \(\mathcal {O}(m d^2 + T m d)\) operations are required, where \(\mathcal {O}(m d^2)\) operations are needed to compute the pseudoinverse and \(\mathcal {O}(m d)\) operations are performed per iteration.

Let us consider iterates in the measurement space \(u^t := A z^t, t \ge 0\) for which the update of ER reads as

$$\begin{aligned} u^{t+1} = A z^{t+1} = A A^\dagger {\text {diag}}\left( \frac{y}{\vert {A z^t}\vert }\right) A z^t = A A^\dagger {\text {diag}}\left( \frac{y}{\vert {u^t}\vert }\right) u^t. \end{aligned}$$

In this form, \({\text {diag}}\left( \frac{y}{\vert {u^t}\vert }\right) u^t\) is the projection of \(u^t\) onto the set

$$\begin{aligned} \mathcal {M} := \{ u \in \mathbb {C}^{m} : \vert {u}\vert = y \}. \end{aligned}$$

Moreover, the set \(\mathcal {M}\) can be viewed as a product of one-dimensional sets

$$\begin{aligned} \mathcal {M}_k := \{\alpha \in \mathbb {C}: \vert {\alpha }\vert = y_k \}, \quad k \in [m], \end{aligned}$$

and, thus, the projection onto \(\mathcal {M}\) is performed by projecting each coordinate \(u^t_k\) onto corresponding set \(\mathcal {M}_k\). The second step is to apply \(A A^\dagger \), which is by (3), the projection operator onto \({\text {im}}(A)\). Therefore, ER first projects onto \(\mathcal {M}\), where the measurements are satisfied. Then, the resulting point is projected onto \({\text {im}}(A)\). The sequential projections onto \(\mathcal {M}\) and \({\text {im}}(A)\) allows for an interpretation of ER as an alternating projection scheme. If \(\mathcal {M}\) was a convex set, then ER would converge to the intersection of two sets [37]. However, due to non-convexity of \(\mathcal {M}\), convergence of \(u^t\) to the intersection of the sets is not guaranteed, which is a known problem of the ER algorithm. We note that, when A allows for unique recovery and noise is absent, intersection of \(\mathcal {M}\) and \({\text {im}}(A)\) is given by \(\{ \alpha x: \vert {\alpha }\vert = 1\}\) [12].

Another complication arising from the non-convexity of \(\mathcal {M}\) is the non-uniqueness of the projection onto \(\mathcal {M}\). Let \(y_k \ne 0\) and consider the projection of \(\alpha \in \mathbb {C}\) onto \(\mathcal {M}_k\). If \(\alpha \) is non-zero, the closest point in \(\mathcal {M}_k\) is given by \(y_k \cdot \alpha / \vert {\alpha }\vert \) [12, Lemma 3.15a]. If \(\alpha = 0\), all points in \(\mathcal {M}_k\) have the same distance to 0 and any of them can be used as a projection. In the literature, it is resolved by setting the projection either to \(y_k\) or \(y_k e^{i \varphi }\) for a randomly selected angle \(\varphi \in [0,2 \pi )\). In this paper, we will instead map 0 to 0, which is not precisely the projection, but can be interpreted as an average of all possible projections

$$\begin{aligned} 0 = \frac{1}{2 \pi }\int _{0}^{2 \pi } y_k e^{i \varphi } d \varphi . \end{aligned}$$

Therefore, whenever \((Az^t)_k = 0 \) we set \((Az^t)_k / \vert {(Az^t)_k}\vert = 0\).

The ER algorithm can also be interpreted as a projected gradient method [12, Section 3.8] applied to solve the minimization problem

$$\begin{aligned} \min _{u \in {\text {im}}(A)} \Vert { \vert {u}\vert - y}\Vert _2^2. \end{aligned}$$
(8)

We note, that substituting Az for u, \(z \in \mathbb {C}^d\) leads to an unconstrained minimization of the amplitude-based objective (2), which suggests that ER can be interpreted as a gradient method applied to the function \(\mathcal {L}_2\).

It is known that in the absence of noise if an initial guess \(z^0\) is chosen sufficiently close to the set \(\{\alpha x: \vert {\alpha }\vert = 1 \}\), the ER algorithm will converge to a point in this set [12, Theorem 3.16]. In general, ER does not converge globally to \(\{\alpha x: \vert {\alpha }\vert = 1 \}\) [12, p.830]. If the loss \(\mathcal {L}_2\) is differentiable at \(z^t\), the ER iteration will not increase the value of \(\mathcal {L}_2\), i.e., \(\mathcal {L}_2(z^{t+1}) \le \mathcal {L}_2(z^t)\) [12, 38].

For the initialization \(z^0\) of ER, the polarization method can be used [12, 39, 40]. It constructs a matrix containing the estimates of \({\text {sgn}}(Ax_k) \overline{{\text {sgn}}(Ax_\ell )}\), \(k,\ell \in [m]\) from the measurements and recovers \({\text {sgn}}(Ax_k)\) by solving the phase synchronization problem [41,42,43,44].

2.4 Amplitude Flow

The Amplitude Flow algorithm (AF) considers the gradient-based optimization of the amplitude-based objective (2). The algorithm is based on the Wirtinger derivatives, which are discussed in greater detail in Section 4.2, while in this section we superficially define the gradient in order to avoid lengthy derivations.

Given an initial guess \(z^0 \in \mathbb {C}^d\), AF is based on the iterations

$$\begin{aligned} z^{t+1} = z^{t} - \mu _t \nabla \mathcal {L}_2(z^t), \quad t \ge 0, \end{aligned}$$
(AF)

where \(\mu _t > 0\) denotes the so-called learning rate and \(\nabla \mathcal {L}_2\) is the generalized Wirtinger gradient of \(\mathcal {L}_2\) given by

$$\begin{aligned} \nabla \mathcal {L}_2(z) = A^* \left[ I - {\text {diag}}\left( \frac{y}{\vert {A z}\vert }\right) \right] A z. \end{aligned}$$

Similarly to ER, we treat the case \((Az)_k = 0\) by setting \((Az)_k / \vert {(Az)_k}\vert = 0\). The iteration process is continued until the gradient \(\nabla \mathcal {L}_2(z^t)\) vanishes, which is equivalent to reaching the fixed point \(z^{t+1} = z^t\). Originally, AF was derived and analyzed for random Gaussian measurements without noise [17]. For such A, it is possible to construct good starting point \(z^0\) via spectral initialization [15] or null initialization [45], such that AF admits linear convergence rate to the set of true solutions \(\{\alpha x : \vert {\alpha }\vert = 1 \}\). In general, for any choice of the measurement matrix A, the following convergence results have been established in [19].

Theorem 1

([19, Theorem 1]) Consider measurements y of the form (1). Let \(0<\mu _t \le \Vert {A}\Vert ^{-2}\) and \(z^0 \in \mathbb {C}^d\) be arbitrary. Then, for iterates \(\{ z^t \}_{t \ge 0}\) defined by AF we have

$$\begin{aligned}&\mathcal {L}_2(z^t) \ge \mathcal {L}_2(z^{t+1}) \text { for all } t \ge 0, \\&\Vert {z^{t+1} - z^t}\Vert _2 \rightarrow 0,\ t \rightarrow \infty , \end{aligned}$$

and

$$\begin{aligned} \min _{t= 0,\ldots , T-1} \Vert {z^{t+1} - z^t}\Vert _2^2 \le \frac{ \mathcal {L}_2(z^0) }{ \Vert {A}\Vert ^2 T } \ \text { for all } T \ge 0. \end{aligned}$$

Unlike the randomized scenario, the general case only guarantees convergence to a fixed point with sublinear rate. Therefore, the initialization \(z^{0}\) is crucial for the convergence to global minimum. For a non-random A, e.g., in case of ptychography, an outcome of the direct (non-iterative) method [30] is a good starting point. Furthermore, with sufficiently good initialization AF can achieve linear convergence rate [46] even for non-random measurements.

As the proof of Theorem 1 resembles the proof of Theorem 3 below, we provide the sketch of proof for Theorem 1 in Remark 10 in Section 4.2.

The computational complexity of AF for \(T \in \mathbb {N}\) iterations is given by \(\mathcal {O}(T m d)\) operations. If the learning rate is chosen to be \(\mu _t = \Vert {A}\Vert ^{-2}\), the computation of the spectral norm can be done with additional \(\mathcal {O}(m d)\) operations by performing the fixed number of the power method iterations. More precisely, for \(K \in \mathbb {N}\) and random initialization \(v^0\), iterates \(v^{k} = A^* A v^{k-1} / \Vert {A^* A v^{k-1} }\Vert _2,\) \(k \in [K]\), are computed and \(\Vert {A v^K}\Vert _2\) is used as an estimate of \(\Vert {A}\Vert \).

3 Results

As it was briefly mentioned in Section 2.3, ER can be linked to the minimization of the amplitude-based objective (2). We formalize this intuition in the next lemma.

Lemma 2

Let A be injective. Then, ER is a scaled gradient method with iterations given by

$$\begin{aligned} z^{t+1} = z^t - (A^* A)^{-1} \nabla \mathcal {L}_2(z^t), \quad t \ge 0. \end{aligned}$$

We emphasize that the result of Lemma 2 is only true for all \(z \in \mathbb {C}^d\), if the ambiguity 0/0 in the iteration of ER is defined as 0.

The reinterpretation of ER as a scaled gradient method allows to analyze convergence of the algorithm similarly to AF, which leads to an analogue of Theorem 1.

Theorem 3

Consider the phase retrieval measurements y of the form (1) with injective matrix A. Let \(z^0 \in \mathbb {C}^d\) be arbitrary. Then, for iterates \(\{z^t\}_{t \ge 0}\) given by ER we have

$$\begin{aligned}&\mathcal {L}_2(z^t) \ge \mathcal {L}_2(z^{t+1}) \text { for all } t \ge 0, \\&\Vert {z^{t+1} - z^t}\Vert _2 \rightarrow 0,\ t \rightarrow \infty , \end{aligned}$$

and

$$\begin{aligned} \min _{t = 0, \ldots , T-1} \Vert {z^{t+1} - z^{t}}\Vert _2^2 \le \frac{\mathcal {L}_2(z^0)}{T \sigma _d^2(A)} \text { for all } T \ge 0, \end{aligned}$$

where \(\sigma _d(A)\) denotes the smallest singular value of the matrix A.

Theorem 3 guarantees that no matter how noisy the measurements are, ER will always converge to a fixed point and the convergence rate is sublinear. However, even in the absence of noise, it does not guarantee the global convergence to a point in the set \(\{\alpha x : \vert {\alpha }\vert = 1 \}\). We note that for cases \(A = F\) and A corresponding to ptychography (see (10) below), the convergence of ER to a fixed point was shown in [10] and [47], respectively. However, the convergence rate was not derived. Comparing Theorem 3 to Theorem 1, we observe that the constant in the convergence rate of ER is worse by \(\sigma _1^2(A)/\sigma _d^2(A)\) compared to AF.

A further consequence of Lemma 2 is the equality of the fixed-point sets of both algorithms.

Corollary 4

Let A be injective. Then, \(z \in \mathbb {C}^d\) is a fixed point of ER if and only if z is the fixed point of AF.

We note that Corollary 4 does not imply that given the same initial guess \(z^0\), both algorithms will necessarily converge to the same fixed point.

By Theorem 1 and Theorem 3, both algorithms seem to be comparable in terms of convergence rate and by Corollary 4 in terms of critical points. However, for \(T \in \mathbb {N}\) iterations of ER \(\mathcal {O}(m d^2 + T m d)\) operations are required, while AF only needs \(\mathcal {O}(T m d)\) operations and, thus, in general ER is considerably slower in terms of computation complexity. The next corollary shows, that this difference is less significant in cases where the columns of A are orthogonal.

Corollary 5

Let

$$\begin{aligned} A^* A = {\text {diag}}(v) \text { for some } v \in \mathbb {R}^{d} \text { with } v_\ell >0,\ \ell \in [d]. \end{aligned}$$
(9)

Then, for \(T \in \mathbb {N}\) iterations both algorithms ER and AF require \(\mathcal {O}(T m d)\) operations.

Furthermore, if \(A^* A = c I\), for some \(c>0\), then the iteration of ER coincides with the iteration of AF for the learning rate \(\mu _t = \Vert {A}\Vert ^{-2}\).

While condition (9) may seem restrictive, it, in fact, holds in many practical applications. For instance, the equivalence of both algorithms was observed for the recovery from Fourier magnitudes (\(A = F\)) in [10]. Another application of interest is ptychography, for which the measurement matrix A is given by

$$\begin{aligned} A = \begin{bmatrix} F {\text {diag}}(S_{s_1} w) \\ \vdots \\ F {\text {diag}}(S_{s_r} w) \\ \end{bmatrix}, \end{aligned}$$
(10)

where the vector \(w \in \mathbb {C}^{d}\) denotes the distribution of the light in the illuminated region and \(s_1, \ldots , s_r \in [d]\), \(r \le d\), are unique positions of the regions. Matrices F and \(S_{s_j}\) are given by (5) and (7), respectively. When \(r = d\) and \(s_j = j, j \in [d]\), the matrix A is also known as the discrete Short-Time Fourier transform (STFT) with window w.

The next corollary shows that condition (9) and, consequently, the results of Corollary 5 also hold for ptychographic measurements.

Corollary 6

Consider measurements of the form (1) with ptychographic measurement matrix A as in (10). Then, \(A^* A = {\text {diag}}(v) \), where the vector v has entries

$$\begin{aligned} v_{\ell } = d \sum _{j \in [r]} \vert {(S_{s_j} w)_{\ell }}\vert ^2, \end{aligned}$$

for all \(\ell \in [d]\). The matrix A is injective if and only if \(v_\ell >0\) for all \(\ell \in [d]\). Furthermore, if A is the STFT matrix, the vector v has entries \(v_\ell = d \Vert {w}\Vert _2^2\) for all \(\ell \in [d]\). Consequently, the results of Corollary 5 apply for ptychographic measurements.

In order to illustrate the result of Corollary 6, we perform a numerical reconstructions of randomly generated \(x \in \mathbb {C}^{d}\), \(d= 256\) with both AF and ER. In the first case, A is chosen to be the STFT matrix with the window

$$\begin{aligned} w_j = {\left\{ \begin{array}{ll} \exp \left( -\frac{(j - 8.5)^2}{12.8} + i \frac{\pi (j - 8.5)^2}{12.8} \right) , &{} j \in [32], \\ 0, &{} j \notin {[32]}. \end{array}\right. } \end{aligned}$$

In the second case A is given by (10) with the same window and positions \(s_j = 16 j\) and \(j \in [d/16]\). The measurements are additionally corrupted by Poisson noise such that signal-to-noise ratio \(10 \log _{10} \left( \tfrac{ \Vert {\vert {Ax}\vert ^2}\Vert _2^2}{ \Vert {y^2 - \vert {Ax}\vert ^2}\Vert _2^2} \right) \) is approximately 45. Figure 1a shows the values \(\mathcal {L}_2(z^t)\) for 500 iterations of the algorithms starting from a random initialization \(z^0\). Note that for the STFT matrix A, AF and ER coincide as predicted by Corollary 6, while this is no longer true for A as in (10). Despite producing different reconstructions, the runtime of the algorithms in Figure 1b is almost the same, which is in line with Corollary 5.

Fig. 1
figure 1

Numerical visualization of Corollary 5 and Corollary 6

Finally, we consider a scenario when the object is supported on \(J \subseteq [d]\). Then, we can rewrite the the measurement model as

$$\begin{aligned} y = \vert {A E_{J} x_{J} }\vert + n, \end{aligned}$$

where \(x_{J}\) is the vector containing entries of x in J and \(E_{J}\) is a linear embedding operator, which maps \(x_J\) to x . In such case, the results above apply for new measurement matrix \(\tilde{A} = A E_{J}\).

4 Proofs

4.1 Proofs of Lemma 2 and corollaries

We will start with the proof of Lemma 2.

Proof of Lemma 2

By the assumption, A is injective and, thus, identities (3) and (4) hold true. Therefore, the iteration of ER can be rewritten as

$$\begin{aligned} z^{t+1}&= A^\dagger {\text {diag}}\left( \frac{y}{\vert {A z^t}\vert } \right) A z^t = A^\dagger A z^t - A^\dagger A z^t + A^\dagger {\text {diag}}\left( \frac{y}{\vert {A z^t}\vert } \right) A z^t \\&= z^t - A^\dagger \left[ I - {\text {diag}}\left( \frac{y}{\vert {A z^t}\vert }\right) \right] A z^t = z^t - (A^* A)^{-1} A^* \left[ I - {\text {diag}}\left( \frac{y}{\vert {A z^t}\vert }\right) \right] A z^t \\&= z^t - (A^* A)^{-1} \nabla \mathcal {L}_2 (z^t). \end{aligned}$$

\(\square \)

Using the result of Lemma 2, we deduce Corollary 4 and Corollary 5.

Proof of Corollary 4

Let \(z \in \mathbb {C}^{d}\) be a fixed point of ER. By Lemma 2, we have that

$$\begin{aligned} z = z - (A^* A)^{-1} \nabla \mathcal {L}_2 (z), \end{aligned}$$

which is equivalent to

$$\begin{aligned} (A^* A)^{-1} \nabla \mathcal {L}_2 (z) = 0. \end{aligned}$$

Since A is injective and \((A^* A)^{-1}\) exists, the obtained equality holds if and only if \(\nabla \mathcal {L}_2 (z) = 0\), so that z is the fixed point of AF. \(\square \)

Proof of Corollary 5

Using the condition (9), we obtain \((A^* A)^{-1} = {\text {diag}}(1/v)\). Consequently, by Lemma 2, the iteration of ER is given by

$$\begin{aligned} z^{t+1} = z^{t} - (A^* A)^{-1} \nabla \mathcal {L}_2 (z^t) = z^{t} - {\text {diag}}(1/v) \nabla \mathcal {L}_2 (z^t). \end{aligned}$$

The computation of the gradient requires \(\mathcal {O}(md)\) operations. Both the multiplication with \({\text {diag}}(1/v)\) and the difference can be done in \(\mathcal {O}(d)\) operations. Therefore, the total number of operations for a single iteration of ER is given by \(\mathcal {O}(md + d) = \mathcal {O}(md)\), which is the same order of operations as for a single iteration of AF. Furthermore, evaluation of v requires additional \(\mathcal {O}(md)\) operations. We also note that \(\Vert {A}\Vert ^2 = \max _{\ell \in [d]} \vert {v_\ell }\vert \) and the computation of the learning rate is done in \(\mathcal {O}(d)\) operations. Therefore, both algorithms have total complexity of \(\mathcal {O}(Tmd)\) for T iterations.

If \(A^* A = c I\), then

$$\begin{aligned} \Vert {A}\Vert ^2 = \Vert {A^* A}\Vert = \Vert {c I}\Vert = c \text { and } (A^* A)^{-1} = c^{-1} I = \Vert {A}\Vert ^{-2} I . \end{aligned}$$

Hence, using Lemma 2 for the iteration of ER we have

$$\begin{aligned} z^{t+1} = z^{t} - (A^* A)^{-1} \nabla \mathcal {L}_2 (z^t) = z^{t} - \Vert {A}\Vert ^{-2} \nabla \mathcal {L}_2 (z^t), \end{aligned}$$

which is precisely the iteration of AF with \(\mu _t = \Vert {A}\Vert ^{-2}\). \(\square \)

The last corollary is the result of direct computations similar to the equation (12) in [19].

Proof of Corollary 6

We compute the product \(A^*A\) by using the representation (10),

$$\begin{aligned} A^* A = \sum _{j=1}^r ( F {\text {diag}}(S_{s_j} w) )^* (F {\text {diag}}(S_{s_j} w)) = \sum _{j=1}^r {\text {diag}}^*(S_{s_j} w) F^* F {\text {diag}}(S_{s_j} w). \end{aligned}$$

Next, we use (6) and \({\text {diag}}^*(S_{s_j} w) = {\text {diag}}(\overline{S_{s_j} w})\) to obtain

$$\begin{aligned} A^* A&= \sum _{j=1}^r d {\text {diag}}(\overline{S_{s_j} w}) {\text {diag}}(S_{s_j} w) = \sum _{j=1}^r {\text {diag}}( d \vert { S_{s_j} w }\vert ^2) \\&= {\text {diag}}\left( d \sum _{j=1}^r \vert { S_{s_j} w }\vert ^2 \right) = {\text {diag}}(v). \end{aligned}$$

The matrix A is injective if and only if \(A^* A\) is invertible, and the diagonal matrix is invertible when all its entries are non-zero. Since \(v_\ell = d \sum _{j=1}^r \vert { S_{s_j} w_\ell }\vert ^2 \ge 0, \ell \in [d]\), the injectivity of A is equivalent to \(v_\ell > 0\) for all \(\ell \in [d]\).

If A is the STFT matrix, then \(s_j = j\) for all \(j \in [d]\) and the entries of the vector v further simplify to

$$\begin{aligned} v_\ell = d \sum _{j=1}^d \vert { S_{s_j} w_\ell }\vert ^2 = d \sum _{j=1}^d \vert { w_{(\ell -j-1) { \text{ mod } }d + 1} }\vert ^2, \quad \ell \in [d]. \end{aligned}$$

Changing the order of summation yields

$$\begin{aligned} v_\ell = d \sum _{j=1}^d \vert { w_{j} }\vert ^2 = d \Vert {w}\Vert _2^2, \end{aligned}$$

for all \(\ell \in [d]\), which concludes the proof. \(\square \)

4.2 Proof of Theorem 3

The proof of Theorem 3 is based on Wirtinger derivatives [48]. Let us recall some basic facts about Wirtinger derivatives based on [49, 50]. A function \(f: \mathbb {C} \mapsto \mathbb {C}\) can be viewed as a function of two real variables, the real and imaginary parts of the argument \(z = \alpha + i \beta \). The function f is said to be differentiable in real sense if the derivatives with respect to \(\alpha \) and \(\beta \) exist.

Then, the Wirtinger derivatives are defined as

$$\begin{aligned} \frac{\partial f}{\partial z} := \frac{1}{2}\frac{\partial f}{\partial \alpha } - \frac{i}{2}\frac{\partial f}{\partial \beta }, \quad \frac{\partial f}{\partial \bar{z} } := \frac{1}{2}\frac{\partial f}{\partial \alpha } + \frac{i}{2}\frac{\partial f}{\partial \beta }, \end{aligned}$$

which is nothing, but a change of the coordinate system to conjugate coordinates. In this sense, we treat function f as a function of z and \(\bar{z}\) instead of \(\alpha \) and \(\beta \).

As an example consider \(f(z) = z = \alpha + i \beta \). Its Wirtinger derivatives are

$$\begin{aligned} \frac{\partial z}{\partial z} = \frac{1}{2}\frac{\partial (\alpha + i \beta )}{\partial \alpha } - \frac{i}{2}\frac{\partial (\alpha + i \beta )}{\partial \beta } = \frac{1}{2} - \frac{i^2}{2} = 1 \text { and } \frac{\partial z}{\partial \bar{z}} = 0, \end{aligned}$$

which implies that \(\bar{z}\) can be treated as a constant when the derivative with respect to z is computed and vice versa.

Similar to the real analysis of multivariate functions, Wirtinger derivatives are extended for \(f: \mathbb {C}^{d} \mapsto \mathbb {C}\), that is for \(z \in \mathbb {C}^d\) they are given by

$$\begin{aligned} \frac{\partial f}{\partial z} = \left( \frac{\partial f}{\partial z_1}, \ldots , \frac{\partial f}{\partial z_d} \right) \quad \text {and} \quad \frac{\partial f}{\partial \bar{z} } = \left( \frac{\partial f}{\partial \bar{z}_1}, \ldots , \frac{\partial f}{\partial \bar{z}_d} \right) . \end{aligned}$$

The computation of Wirtinger derivatives is analogous to the standard real analysis as the arithmetic operations and the chain rule extends to the complex case. For Wirtinger derivatives it also holds that

$$\begin{aligned} \overline{\frac{\partial f}{\partial z} } = \frac{\partial \bar{f}}{\partial \bar{z}} \quad \text {and} \quad \overline{\frac{\partial f}{\partial \bar{z}} } = \frac{\partial \bar{f}}{\partial z}, \end{aligned}$$
(11)

for any differentiable function f.

The Wirtinger derivatives are particularly useful for optimization of real-valued functions of complex variables. Let \(f: \mathbb {C}^d \mapsto \mathbb {R}\) be a differentiable real-valued function. Its differential can be presented in the form of Wirtinger derivatives as

$$\begin{aligned} d f = \frac{\partial f}{\partial z} d z + \frac{\partial f}{\partial \bar{z}} d \bar{z}. \end{aligned}$$

Since f is real-valued, by (11), it holds that

$$\begin{aligned} \overline{\frac{\partial f}{\partial z} } = \frac{\partial f}{\partial \bar{z}}, \end{aligned}$$

and the differential simplifies to

$$\begin{aligned} df = 2 {\text {Re}}\left( \frac{\partial f}{\partial z} d z \right) . \end{aligned}$$

It is maximal, when dz is a scaled version of \(\overline{\frac{\partial f}{\partial z}} = \frac{\partial f}{\partial \bar{z}}\) and, thus, \(\frac{\partial f}{\partial \bar{z}}\) gives the direction of the steepest ascent. Moreover, the critical points of f are those, where derivative with respect to \(\bar{z}\) vanishes. For this reason, the gradient of f is defined as

$$\begin{aligned} \nabla f := \left( \frac{\partial f}{\partial \bar{z}} \right) ^T = \left( \frac{\partial f}{\partial z} \right) ^*. \end{aligned}$$

In our analysis, we would also need the Wirtinger version of the second order Taylor’s approximation theorem in integral form. That is for all twice continuously differentiable functions \(f: \mathbb {C}^d \mapsto \mathbb {R}\) and all \(z, v \in \mathbb {C}^d\) it holds that

$$\begin{aligned} f(z+v) = f(z) + \begin{bmatrix} \nabla f \\ \overline{\nabla f} \end{bmatrix}^* \begin{bmatrix} v \\ \bar{v} \end{bmatrix} + \begin{bmatrix} v \\ \bar{v} \end{bmatrix}^* \int _0^1 (1 - s) \nabla ^2 f(z + s v) d s \begin{bmatrix} v \\ \bar{v} \end{bmatrix}, \end{aligned}$$
(12)

where \(\nabla ^2 f\) denotes the Hessian matrix

$$\begin{aligned} \nabla ^2 f = \begin{bmatrix} \nabla ^2_{z,z} f &{} \nabla ^2_{\bar{z}, z} f \\ &{} \\ \overline{\nabla ^2_{\bar{z}, z} f } &{} \overline{ \nabla ^2_{z, z} f} \end{bmatrix}, \end{aligned}$$

and its components are given by

$$\begin{aligned} \nabla ^2_{z,z} f = \frac{\partial }{\partial z} \nabla f = \frac{\partial }{\partial z} \left( \frac{\partial f}{\partial z} \right) ^* \quad \text {and} \quad \nabla ^2_{\bar{z},z} f = \frac{\partial }{\partial \bar{z}} \nabla f = \frac{\partial }{\partial \bar{z}} \left( \frac{\partial f}{\partial z} \right) ^*. \end{aligned}$$

For further information on Wirtinger calculus, we refer reader to [49, 50].

Let us go back to the amplitude-based objective (2). We rewrite it as

$$\begin{aligned} \mathcal {L}_2(z) = \sum _{k=1}^{m} \vert { \sqrt{ \vert { (A z)_k }\vert ^2 } - \sqrt{y_k^2} }\vert ^2. \end{aligned}$$
(13)

Since \(\sqrt{\cdot }\) is not differentiable at 0, \(\mathcal {L}_2\) is not differentiable on \(\mathbb {C}^{d}\). Hence, the gradient of \(\mathcal {L}_2\) is not properly defined for points z with \((Az)_k =0\) for some \(k \in [m]\). In order to overcome this issue, we consider the following smoothed version of (13),

$$\begin{aligned} \mathcal {L}_{2,\varepsilon }(z) := \sum _{k=1}^{m} \vert { \sqrt{ \vert { (A z)_k }\vert ^2 + \varepsilon } - \sqrt{y_k^2 + \varepsilon } }\vert ^2, \end{aligned}$$
(14)

where \(\varepsilon > 0\). The function \(\mathcal {L}_{2,\varepsilon }\) possesses some useful properties. Firstly, \(\mathcal {L}_{2,\varepsilon }\) is continuous in \(\varepsilon \) and we have

$$\begin{aligned} \mathcal {L}_2(z) = \lim _{\varepsilon \rightarrow 0+} \mathcal {L}_{2,\varepsilon }(z). \end{aligned}$$

Secondly, we can compute the gradient of \(\mathcal {L}_{2,\varepsilon }\) everywhere and properly define the generalized gradient of \(\mathcal {L}_2\) as the limit of gradients as parameter \(\varepsilon \) vanishes.

Lemma 7

The function \(\mathcal {L}_{2,\varepsilon }\) is continuously differentiable with the gradient given by

$$\begin{aligned} \nabla \mathcal {L}_{2,\varepsilon }(z) = A^* \left[ I - {\text {diag}}\left( \frac{\sqrt{y^2 + \varepsilon } }{\sqrt{ \vert {A z}\vert ^2 + \varepsilon } }\right) \right] A z, \quad z \in \mathbb {C}^d. \end{aligned}$$

Furthermore, the generalized gradient of \(\mathcal {L}_2\) is given by the pointwise limit

$$\begin{aligned} \nabla \mathcal {L}_2(z) := \lim _{\varepsilon \rightarrow 0+}\nabla \mathcal {L}_{2,\varepsilon }(z). \end{aligned}$$

Proof

Denote by \(a_k\) the conjugate of the k-th row of the matrix A, so that \((A z)_k = a_k^* z\). Then, a single summand of \(\mathcal {L}_{2,\varepsilon }\) is given by

$$\begin{aligned} f_k(z) := \vert { \sqrt{ z^T \bar{a}_k a_k^T \bar{z} + \varepsilon } - \sqrt{y_k^2 + \varepsilon } }\vert ^2, \quad k \in [m]. \end{aligned}$$

The gradient of \(f_k\) can be evaluated by the chain rule. We get

$$\begin{aligned} \nabla f_k(z)&= \left[ \frac{\partial f_k}{\partial \bar{z}} (z) \right] ^T = \left[ \frac{\partial \vert { \sqrt{ z^T \bar{a}_k a_k^T \bar{z} + \varepsilon } - \sqrt{y_k^2 + \varepsilon } }\vert ^2 }{\partial \sqrt{ z^T \bar{a}_k a_k^T \bar{z} + \varepsilon } - \sqrt{y_k^2 + \varepsilon } } \right. \\&\quad \quad \quad \quad \quad \quad \quad \quad \quad \cdot \left. \frac{\partial \sqrt{ z^T \bar{a}_k a_k^T \bar{z} + \varepsilon } - \sqrt{y_k^2 + \varepsilon } }{\partial z^T \bar{a}_k a_k^T \bar{z} + \varepsilon } \cdot \frac{\partial z^T \bar{a}_k a_k^T \bar{z} + \varepsilon }{\partial \bar{z}} \right] ^T \\&= 2\left( \sqrt{ z^T \bar{a}_k a_k^T \bar{z} + \varepsilon } - \sqrt{y_k^2 + \varepsilon }\right) \frac{1}{2\sqrt{ z^T \bar{a}_k a_k^T \bar{z} + \varepsilon }} \left[ z^T \bar{a}_k a_k^T \right] ^T \\&= \left( 1 - \frac{ \sqrt{y_k^2 + \varepsilon } }{\sqrt{ z^T \bar{a}_k a_k^T \bar{z} + \varepsilon }} \right) a_k a_k^* z = \left( 1 - \frac{ \sqrt{y_k^2 + \varepsilon } }{\sqrt{ \vert { (A z)_k}\vert ^2 + \varepsilon }} \right) (A z)_k a_k. \end{aligned}$$

Then, by the linearity of derivatives,

$$\begin{aligned} \nabla \mathcal {L}_{2,\varepsilon }(z)&= \sum _{k=1}^{m} \nabla f_k(z) = \sum _{k=1}^{m} \left( 1 - \frac{ \sqrt{y_k^2 + \varepsilon } }{\sqrt{ \vert { (A z)_k}\vert ^2 + \varepsilon }} \right) (A z)_k a_k \\&= A^* \left[ I - {\text {diag}}\left( \frac{\sqrt{y^2 + \varepsilon } }{\sqrt{ \vert {A z}\vert ^2 + \varepsilon } }\right) \right] A z. \end{aligned}$$

For the generalized gradient of \(\mathcal {L}_2\) we consider two cases. If \((Az)_k \ne 0\) for all \(k \in [m]\), then \(\sqrt{y^2_k + \varepsilon }/\sqrt{ \vert {(A z)_k}\vert ^2 + \varepsilon } \rightarrow y_k /\vert {(A z)_k}\vert \), \(\varepsilon \rightarrow 0+\), for all \(k \in [m]\). Note that in this case, \(\mathcal {L}_2\) is differentiable at z and its gradient coincides with the limit of \(\nabla \mathcal {L}_{2,\varepsilon }(z)\) as \(\varepsilon \rightarrow 0+\). On the other hand, if \((Az)_k = 0\) for some \(k \in [m]\), it holds that

$$\begin{aligned} \frac{\sqrt{y^2_k + \varepsilon } }{\sqrt{ \vert {(A z)_k}\vert ^2 + \varepsilon } } (A z)_k = \frac{\sqrt{y^2_k + \varepsilon } }{\sqrt{ 0 + \varepsilon } } \cdot 0 = 0 \rightarrow 0 = \frac{y_k }{\vert {(A z)_k}\vert } (A z)_k, \quad \varepsilon \rightarrow 0+, \end{aligned}$$

with ambiguity 0/0 resolved as 0. \(\square \)

The last property concerns the Hessian matrix of \(\mathcal {L}_{2,\varepsilon }\).

Lemma 8

[19] The function \(\mathcal {L}_{2,\varepsilon }\) is twice continuously differentiable and its Hessian matrix satisfies

$$\begin{aligned} \begin{bmatrix} v \\ \bar{v} \end{bmatrix}^* \nabla ^2 \mathcal {L}_{2, \varepsilon }(z) d s \begin{bmatrix} v \\ \bar{v} \end{bmatrix} \le 2 v^* A^* A v, \quad \text {for all } z,v \in \mathbb {C}^d, \varepsilon >0. \end{aligned}$$

Proof

See computations on pages 27-28 of [19]. \(\square \)

Remark 9

The convergence of gradient descent to a critical point is often studied [51] under the assumption that the function f is L-smooth with \(L \ge 0\). For twice continuously differentiable functions L-smoothness is equivalent to the inequality

$$\begin{aligned} - 2 L \Vert {v}\Vert _2^2 \le \begin{bmatrix} v \\ \bar{v} \end{bmatrix}^* \nabla ^2 f(z) d s \begin{bmatrix} v \\ \bar{v} \end{bmatrix} \le 2 L \Vert {v}\Vert _2^2, \quad \text {for all } z,v \in \mathbb {C}^d. \end{aligned}$$

In fact, only the upper bound is sufficient to establish convergence of gradient decent. In our case, its stronger version is given by Lemma 8.

Now, we are equipped for the proof of Theorem 3.

Proof of Theorem 3

In view of Lemma 2, let us consider the smoothed step of the ER algorithm

$$\begin{aligned} z^{+}_{\varepsilon } := z - (A^* A)^{-1}\nabla \mathcal {L}_{2,\varepsilon }(z). \end{aligned}$$

Note that \((A^* A)^{-1}\) exists due to injectivity of A.

We first show that the single step of the smoothed Error Reduction works for minimization of \(\mathcal {L}_{2,\varepsilon }\) and then we take the pointwise limits to obtain desired result for \(\mathcal {L}_2\). In order to derive that in each iteration step the objective does not increase, we apply Taylor’s theorem (12) with arbitrary \(z \in \mathbb {C}^d\) and \(v = - (A^* A)^{-1}\nabla \mathcal {L}_{2,\varepsilon }(z)\). We note that by Lemma 8, the integral in (12) is bounded as

$$\begin{aligned} \int _0^1 (1 - s) \begin{bmatrix} v \\ \bar{v} \end{bmatrix}^* \nabla ^2 \mathcal {L}_{2,\varepsilon } (z + s v) \begin{bmatrix} v \\ \bar{v} \end{bmatrix} d s \le 2 v^* A^* A v \int _0^1 (1 - s) d s = v^* A^* A v. \end{aligned}$$

Hence, by (12), we have

$$\begin{aligned} \mathcal {L}_{2,\varepsilon }(z^{+}_{\varepsilon })&\le \mathcal {L}_{2,\varepsilon }(z) - 2[ \nabla \mathcal {L}_{2,\varepsilon }(z)]^* (A^* A)^{-1}\nabla \mathcal {L}_{2,\varepsilon }(z) \\&\quad + [\nabla \mathcal {L}_{2,\varepsilon }(z)]^* ( (A^* A)^{-1} )^* (A^* A) (A^* A)^{-1} \nabla \mathcal {L}_{2,\varepsilon }(z) \\&= \mathcal {L}_{2,\varepsilon }(z) - [\nabla \mathcal {L}_{2,\varepsilon }(z)]^* (A^* A)^{-1}\nabla \mathcal {L}_{2,\varepsilon }(z), \end{aligned}$$

where we used that \(( (A^* A)^{-1} )^* = ( (A^* A)^* )^{-1} = (A^* A)^{-1}\). Lemma 7 gives

$$\begin{aligned} z^{+}_{\varepsilon } \rightarrow z^+ := z - (A^* A)^{-1}\nabla \mathcal {L}_{2}(z), \quad \varepsilon \rightarrow 0+, \end{aligned}$$

and, thus, taking the limit \(\varepsilon \rightarrow 0+\) yields

$$\begin{aligned} \mathcal {L}_{2}(z^{+}) \le \mathcal {L}_{2}(z) - [\nabla \mathcal {L}_{2}(z)]^* (A^* A)^{-1}\nabla \mathcal {L}_{2}(z). \end{aligned}$$

Selecting z as iterates \(z^t\) of ER, we obtain

$$\begin{aligned} \mathcal {L}_{2}(z^{t+1}) \le \mathcal {L}_{2}(z^t) - [\nabla \mathcal {L}_{2}(z^t)]^* (A^* A)^{-1}\nabla \mathcal {L}_{2}(z^t). \end{aligned}$$
(15)

Since A is injective, its singular value decomposition is given by \(A = U \Sigma V^*\) with orthogonal \(U \in \mathbb {C}^{m \times d}\), unitary \(V \in \mathbb {C}^{d \times d}\) and invertible diagonal matrix \(\Sigma \in \mathbb {C}^{d \times d}\). Then,

$$\begin{aligned} (A^* A)^{-1} = (V \Sigma ^2 V^*)^{-1} = (V^*)^{-1} \Sigma ^{-2} V^{-1} = V \Sigma ^{-2} V^* = (V \Sigma ^{-1}) (V \Sigma ^{-1})^* \end{aligned}$$
(16)

is the singular value decomposition of \((A^* A)^{-1}\). From this representation we deduce that

$$\begin{aligned} {[}\nabla \mathcal {L}_{2}(z^t)]^* (A^* A)^{-1}\nabla \mathcal {L}_{2}(z^t) = \Vert {(V \Sigma ^{-1})^* \nabla \mathcal {L}_{2}(z^t) }\Vert _2^2 \ge 0. \end{aligned}$$

Thus, by (15),

$$\begin{aligned} \mathcal {L}_{2}(z^{t+1}) \le \mathcal {L}_{2}(z) - [\nabla \mathcal {L}_{2}(z)]^* (A^* A)^{-1}\nabla \mathcal {L}_{2}(z) \le \mathcal {L}_{2}(z^t), \end{aligned}$$

which shows the first statement of Theorem 3.

In order to prove the remaining statements of Theorem 3, we need to link the decay of the objective to the iterates. By Lemma 2, we have that

$$\begin{aligned} \Vert {z^{t+1} - z^t}\Vert _2^2 = \Vert { (A^* A)^{-1}\nabla \mathcal {L}_{2}(z^t) }\Vert _2^2 = [\nabla \mathcal {L}_{2}(z^t)]^* (A^* A)^{-1} (A^* A)^{-1}\nabla \mathcal {L}_{2}(z^t). \end{aligned}$$

Using (16) and the definition of the spectral norm, the squared distance between the iterates can be bounded as

$$\begin{aligned} \Vert {z^{t+1} - z^t}\Vert _2^2&= (\Sigma ^{-1} V^* \nabla \mathcal {L}_{2}(z^t))^* \Sigma ^{-2} (\Sigma ^{-1} V^* \nabla \mathcal {L}_{2}(z^t) ) \\&= \Vert {\Sigma ^{-1} (\Sigma ^{-1} V^* \nabla \mathcal {L}_{2}(z^t) )}\Vert _2^2 \le \Vert {\Sigma ^{-1}}\Vert ^2 \Vert {\Sigma ^{-1} V^* \nabla \mathcal {L}_{2}(z^t)}\Vert _2^2 \\&= \sigma _1^2(\Sigma ^{-1}) [\nabla \mathcal {L}_{2}(z^t)]^* V \Sigma ^{-1} \Sigma ^{-1} V^* \nabla \mathcal {L}_{2}(z^t) \\&= \sigma _d^{-2}(A) [\nabla \mathcal {L}_{2}(z^t)]^* (A^* A)^{-1} \nabla \mathcal {L}_{2}(z^t). \end{aligned}$$

Next, we sum up the norms for \(T \in \mathbb {N}\) iterations of ER and apply (15) to obtain

$$\begin{aligned} \sum _{t=0}^{T-1} \Vert {z^{t+1} - z^t}\Vert _2^2&\le \sigma _d^{-2}(A) \sum _{t=0}^{T-1} [\nabla \mathcal {L}_{2}(z^t)]^* (A^* A)^{-1} \nabla \mathcal {L}_{2}(z^t) \\&\le \sigma _d^{-2}(A) \sum _{t=0}^{T-1} \left[ \mathcal {L}_{2}(z^t) - \mathcal {L}_{2}(z^{t+1}) \right] \\&= \sigma _d^{-2}(A) \left[ \mathcal {L}_{2}(z^0) - \mathcal {L}_{2}(z^{T}) \right] \le \sigma _d^{-2}(A) \mathcal {L}_{2}(z^0), \end{aligned}$$

where in the last line we used that \(\mathcal {L}_2(z) \ge 0\) for all \(z \in \mathbb {C}^d\). This implies that the partial sum of the series \(\sum _{t=0}^{\infty } \Vert {z^{t+1} - z^t}\Vert _2^2\) is bounded and, thus, the series is convergent. Consequently, summands converge to zero, that is \(\Vert {z^{t+1} - z^t}\Vert _2^2 \rightarrow 0, t \rightarrow \infty \). Furthermore, we have

$$\begin{aligned} \min _{t = 0, \ldots , T-1} \Vert {z^{t+1} - z^{t}}\Vert _2^2 \le \frac{1}{T} \sum _{t=0}^{T-1} \Vert {z^{t+1} - z^t}\Vert _2^2 \le \frac{\mathcal {L}_{2}(z^0)}{\sigma _d^{2}(A) T}, \end{aligned}$$

which concludes the proof. \(\square \)

Remark 10

Proof of Theorem 1 follows the same logic. By Taylor’s theorem (12) with \(v = \mu _t \mathcal {L}_{2}(z)\), the analogue of inequality (15) is established. Since the learning rate is a positive constant, it further implies that the norm of the gradient converges to zero with the desired speed similarly to the proof of Theorem 3.

5 Conclusion

In this paper we established the understanding of the Error Reduction algorithm as a scaled gradient method and derived its convergence rate. Furthermore, it was shown that in practical scenarios, Error Reduction has the same computational complexity as the Amplitude Flow method and the two algorithms coincide in some cases.

In the future, we plan to expand our analysis for the Hybrid Input-Output method [9] and extended Ptychograpic Iterative Engine [52] used for the problem of blind ptychography.