On the convergence rate of the Halpern-iteration

In this work, we give a tight estimate of the rate of convergence for the Halpern-iteration for approximating a fixed point of a nonexpansive mapping in a Hilbert space. Specifically, using semidefinite programming and duality we prove that the norm of the residuals is upper bounded by the distance of the initial iterate to the closest fixed point divided by the number of iterations plus one.


Introduction
Let H be a Hilbert space equipped with a symmetric inner product ., . : H × H → R. Let T : H → H be a nonexpansive mapping and consider for fixed x 0 ∈ H the Halpern-Iteration from [5] with λ k := 1 k+2 for approximating a fixed point of T . Let x := √ x, x denote the induced norm and Fi x(T ) := {x ∈ H : x = T (x)} the set of fixed points of T . It is well known that, if the set Fi x(T ) is nonempty, then the sequence {x k } k∈N 0 will converge to x * ∈ Fi x(T ) minimizing the distance to x 0 ; see [17] Theorem 2, and [18] for generalizations of this remarkable property. As a consequence the norm of the residuals x k − T (x k ) tends to zero, i.e. lim k→∞ x k − T (x k ) = 0. Our goal here is to quantify their rate of convergence. A first result of this type was generated via proof mining in [10] in normed spaces (see also [8] and [9] for further details on results in more general spaces.). Here, we improve the result for the setting of Hilbert spaces. Our proof technique is not based on proof mining, but on semidefinite programming, and is strongly motivated by the recent work of Taylor et al. [15] on B Felix Lieder lieder@opt.uni-duesseldorf.de 1 Mathematisches Institut, Heinrich-Heine-Universität, 40225 Düsseldorf, Germany worst case performance of first order minimization methods. Our methodology and focus here are, however, slightly different. We present two new proofs below. The first one is short and uses a parameter choice derived from the technique used in [15]. The second proof based on semidefinite programming is self-contained and adapts the framework of [15] to fixed point problems. The second approach can also be applied to other choices of parameters λ k and to other fixed point methods. The rates are however in general not obvious. After the manuscript of this work was made public in late 2017, the author (see section 4.1 and following of [12]) and independently other authors (see e.g. [2][3][4]6,16]) have studied similar setups, which may all be categorized as part of the performance estimation framework. Let us briefly sketch how our setup here and in [12] may be applied in the context of proximal point algorithms (e.g. the setup of [2,6]) by defining the nonexpansive operator T := 2J − I , where I is the identity and J is a firmly nonexpansive operator (e.g. the resolvent operator of a maximal monotone operator). Because the fixed points of J and T are the same, one may now apply the Halpern-Iteration (1) to find a fixed point of J . The (tight) convergence rate is then implied by Theorem 2.1 below.

Main result
Theorem 2.1 Let x 0 ∈ H be arbitrary but fixed. If T has fixed points, i.e. Fi x(T ) = ∅, then the iterates defined in (1) satisfy This bound is tight.

Remark 2.1
A generalization of the Halpern-Iteration, the sequential averaging method (SAM), was analyzed in the recent paper [14], where for the first time a rate of convergence of order O(1/k) could be established for SAM. The rate of convergence in (2) is even slightly faster than the one established for the more general framework in [14] (by a factor of 4). More importantly, however, as shown by Example 3.1 below, the estimate (2) is actually tight, in the sense that for every k ∈ N 0 there exists a Hilbert space H and a nonexpansive operator T with some fixed point x * such that the inequality (2) is not strict. Estimate (2) refers to the step length λ k := 1/(k + 2). The restriction to this choice is motivated by problem (17) below in the proof based on semidefinite programming; in numerical tests for small for dimensions k these coefficients provided a better worstcase complexity than any other choice of coefficients.
Next, an elementary direct proof of Theorem 2.1 is given.

Direct proof based on a weighted sum
The iteration (1) with λ k = 1/(k + 2) implies for 1 ≤ j ≤ k: By nonexpansiveness the following inequalities hold: and Below we reformulate the following weighted sum of (5): Using the second relation in (3) the first terms in the summation (6) are and using the first relation in (3) it follows for the second terms in (6) Observe [using again the second relation in (3)] that the first term in (8) cancels the third term in (7). Summing up the second terms in (8) for j = 1, . . . , k we shift the summation index, so that summing up the second terms in (7) and in (8) for j = 1, . . . , k results in Shifting again the index in the summation of the third terms in (8) and summing up the first terms in (7) and the third terms in (8) for j = 1, . . . , k gives where the sum in the middle cancels the sum in the middle of (10) and the terms 2 x 0 − T (x 0 ) 2 cancel as well. The only remaining terms are the first terms in (10) and (11). Thus, inserting (9), (10), and (11) in (6) leads to Applying the Cauchy-Schwarz inequality to the second term in (12) leads to which may be interesting in its own right. To prove the theorem, (12) is divided by k + 1 and then (4) is added: To see the last equation, the last two terms in (13) can be combined, and then a straightforward but tedious multiplication of the terms a := Omitting the last term in (13) one obtains which proves the theorem when taking square roots on both sides. The above proof is somewhat unintuitive as the choice of the weights with which the inequalities (4) and (5) are added in (6) and (13) is far from obvious. In fact we owe the suggestion of these weights to an extremely helpful anonymous referee, who extracted it from a more complex construction in [15] which was also the basis for the initial proof of this paper based on semidefinite programming. We state this proof next since it offers a generalizable approach for analyzing fixed point iterations; it can be modified to the KM iteration, for example in the recent thesis [12]-though this modification is quite technical. The proof based on semidefinite programming also led to Example 3.1 below showing that the rate of convergence is tight. Proof based on semidefinite programming Let x * ∈ Fi x(T ). The Halpern-Iteration was stated in the form (1) to comply with existing literature. For our proof however, it is more convenient to consider the shifted sequencex 1 := x 0 andx k := x k−1 ∀k ∈ N =0 := {1, 2, 3, . . . } and to show a shifted statement Let us define g( . It is well known that g is firmly nonexpansive. For sake of completeness the argument is repeated here: Nonexpansiveness and the Cauchy-Schwarz inequality also imply g(x) − g(y) ≤ x − y ∀x, y ∈ H . For k = 1 the statement (14) follows immediately since g(x * ) = 0 and therefore 1 2 which inductively leads tox Let us shorten the notation slightly and define Let b T denote the transpose of b. Note that is a Gramian matrix formed fromx 1 − x * , g 1 , . . . , g k ∈ H and is therefore symmetric and positive semidefinite. We proceed by expressing the inequalities from firm nonexpansiveness in terms of the Gram-Matrix. Since L often is of much lower dimension than H , this is sometimes referred to as "kernel trick". Keeping in mind that we can rewrite the differencesx j −x 1 = −2 j−1 l=1 l j g l for j ∈ {1, . . . , k}, we arrive at Let e ∈ R k denote the vector of all ones. Then where diag(.) denotes the diagonal of its matrix argument, holds true. Hence The firm nonexpansiveness inequalities g i − g j 2 ≤ g i − g j ,x i −x j are equivalent to the component-wise inequality Note that only k 2 −k 2 of these componentwise inequalities are non redundant. From g * := g(x * ) = 0 we obtain another k inequalities, i.e. g i Defining U := I − L , relations (15) and (16) can be shortened slightly to Let e k ∈ R k denote the k-th unit vector, S n := {X ∈ R n×n | X = X T } denote the space of symmetric matrices and S n + := {X ∈ S n |x T X x ≥ 0 ∀x ∈ R n } the convex cone of positive semidefinite matrices. Consider the chain of inequalities for where Diag(.) denotes the square diagonal matrix with its vector argument on the diagonal. The first equality follows from construction, the first inequality from relaxing, and the second inequality from weak conic duality as detailed in Sect. 5. We conclude the proof by showing feasibility ofξ := 1 k 2 > 0 and for the last optimization problem (18). First note thatX =X T is symmetric and nonnegative. A short computation reveals, that the equality The matrices U ,X and F(X ) can now be expressed as and using (22) we compute which implies U F(X ) + F(X )U T = 2e k e T k as we claimed above. Consequently is positive semidefinite and as a result,ξ andX is feasible for (18). Hence which yields the desired result after taking the square root.

Remark 2.2
The matrixX in the above proof carrying the weights j( j + 1) used in (6) were obtained by solving (18) with YALMIP [11] in combination with the SDP solver Sedumi [13] for small values of k. In order to provide a theoretical proof that the pointsξ andX above are not only feasible but actually optimal for (18) and to prove tightness of the derived bound, we refer to Example 3.1 below, which was derived from a, numerically obtained, low-rank optimal solution of (17). More precisely, after numerically determining the optimal value of (18) a linear equation was added to (18) requiring that (Y 2 ) kk equals this value, and then the trace of Y 2,2 was minimized with the intention to find the optimal solution with minimum rank. This optimal solution was then used to derive Example 3.1 below proving the tightness of (2). In fact for any optimal solution of the SDP relaxation (17), there exists at least one Lipschitz continuous operatorT k : R d → R d for appropriately chosen d with some fixed point x * such that the inequality in (18) is tight: This follows from appropriately labeling the columns of the symmetric square root of such an optimal solution and a Lipschitz extension argument. Specifically the Kirszbraun-Theorem [7] allows a Lipschitz extension of an operator that is Lipschitz on a discrete set to the entire space. For further details we refer to Section 4.2 of [12].
proving tightness of (2) can then be set up as follows: Let T : R → R be defined via for some fixed k ∈ N and R := x 0 − x * 2 with x 0 ∈ R and x * := 0. Note that T satisfies T (x * ) = 0 = x * and is 1-Lipschitz continuous, i.e. nonexpansive, because it is piece-wise linear, continuous and the derivative is bounded in norm by one (|T | ≤ 1) whenever it exists. We will now show that applying the Halpern-Iteration results in an equality in (2) for the k-th iterate, i.e.
is satisfied. This means that the bound (2) can not be improved without making further assumptions, as the operator above would otherwise pose a counterexample. For the first k iterates of the Halpern-Iteration (1) we can obtain for j ∈ 0, . . . , k inductively: for x 0 = 0 = x * there is nothing to prove. The case 0 < x 0 = x 0 − x * 2 = R follows by using the definition of T and considering for j ∈ {0, . . . , k − 1} the iterates , which imply, that holds true. The case x 0 < 0 follows from the operators point symmetry, i.e. T (−x) = −T (x). This completes the proof of tightness.
While Example 3.1 shows that the bound (2) is best possible for the Halpern-Iteration with λ k = 1/(k + 2) , the rate of convergence could be improved for this example, if the values λ k were chosen appropriately less than 1/(k + 2). Let us illustrate next that this is not true in general, i.e. that choosing smaller values of λ k does not always provide faster convergence. Let H be a Hilbert space with a countable orthonormal Schauder basis {e j } j∈N and T be the linear operator defined by T (e j ) = e j+1 for j ∈ N. Hence the unique fixed point is x * = 0. When choosing x 0 = e 1 , then for any choice of step lengths λ j ∈ [0, 1], the k-th iterate always lies in the convex hull of e 1 , . . . , e k+1 , and the choice of λ j minimizing the error x k − x * is precisely the step length λ j = 1/( j + 2) for all 1 ≤ j ≤ k. This step length leads to a slightly faster rate of convergence than (2) for this example, namely 1 2 . While this step length does not minimize the residual 1 2 x k − T (x k ) it shows that smaller values of λ j such as λ j := ρ/( j + 2) for all j with ρ ∈ [0, 1) lead to larger residuals. On the other hand larger values λ j > 1/( j + 2) for all j lead to larger residuals for Example 3.1.

Conclusions
We have derived a new and tight complexity bound for the Halpern-Iteration with coefficients chosen as λ k = 1 k+2 . The proof based on semidefinite programming can in principle be adapted for other choices of parameters and fixed point iterations, again leading to tight complexity bounds. For example, for the Krasnoselski-Mann (KM) iteration (see e.g. [1]) with some constant stepsize t ∈ [ 1 2 , 1] a proof can be found in [12] (Theorem 4.9), where is used to to define inequalities of the form (15). However, while in practice the KM-Iteration with constant stepsize may often perform much better than the Halpern-Iteration, its worst-case complexity is an order of magnitude worse-and the convergence analysis based on semidefinite programming is considerably longer.

Appendix, a technical duality result
While the duality used in (17) and (18) is the well known weak conic duality, the format of the problems (17) and (18) is quite intricate. Here, the explicit derivation of the dual problem therefore is derived in detail: Define the Euclidean space E := R × S k × S k+1 and the closed convex cone K := R + × (S k ∩ R k×k + ) × S k+1 + . We denote the trace inner product A · B := trace(AB) for all symmetric matrices A, B. Then K is self-dual with respect to the canonical inner product X , Y E := x T 1 y 1 + X 2 · Y 2 + X 3 · Y 3 , which we define for any We proceed by restating problems (17) and (18) in standard form and exploit (weak) conic duality. In fact, by adding slack variables S ∈ K, we can write (17) The constraint A * (Y ) + S = C, S ∈ K is a direct translation of the constraints in (17) except for the diagonal entries. Here it is used that the term diag(Y 2 U )e T + e diag(U T Y 2 ) T − (Y 2 U +U T Y 2 ) has an all zero diagonal, which is why we can "place " the constraints diag(Y 2 U ) − y 1 ≤ 0 on the diagonal. Recall that the optimal value of the conic optimization problem (24) in dual standard form is upper bounded by the optimal value of the conic optimization in primal standard form where the linear operators A * : S n+1 → E and A : E → S n+1 are adjoint, i.e. A is the unique operator that satisfies for all Y ∈ S n+1 ,X ∈ E. The "Y 2 -part" of the adjoint of the second component of A * can be obtained from (diag(Y 2 U )e T + e diag(U T Y 2 ) T − (Y 2 U + U T Y 2 ) + Diag(diag(Y 2 U ))) · X 1 = Y 2 · (U (Diag(X 1 e) + 1 2 Diag(diag(X 1 )) − X 1 =F(X 1 ) ) + (Diag(X 1 e) + 1 2 Diag(diag(X 1 )) − X 1 leading to the full description of A where, as before F(X ) = Diag(Xe) + 1 2 Diag(diag(X )) − X . This shows that we can rewrite (27) in the claimed form (18) by eliminating the (semidefinite) variable X 2 . This establishes the inequality in (17) and (18).