On the rate of convergence of alternating minimization for non-smooth non-strongly convex optimization in Banach spaces

In this paper, the convergence of alternating minimization is established for non-smooth convex optimization in Banach spaces, and novel rates of convergence are provided. As objective function a composition of a smooth and a non-smooth part is considered with the latter being block-separable, e.g., corresponding to convex constraints or regularization. For the smooth part, three different relaxations of strong convexity are considered: (i) quasi-strong convexity; (ii) quadratic functional growth; and (iii) plain convexity. Linear convergence is established for the first two cases, generalizing and improving previous results for strongly convex problems; sublinear convergence is established for the third case, also improving previous results from the literature. All the convergence results have in common, that opposing to previous corresponding results for the general block coordinate descent, the performance of the alternating minimization is beneficially governed by properties of the single blocks, instead of global properties. Ultimately, not only the better conditioned block determines the performance, as has been similarly observed in the literature. But also the worse conditioned problem enhances the performance additionally, resulting in potentially significantly improved convergence rates. Furthermore, by solely using the convexity and smoothness properties of the problem, the results immediately apply in general Banach spaces.


Introduction
The (cyclic) block coordinate descent, also referred to in the literature as non-linear block Gauss-Seidel or successive subspace correction method, is a classical and fundamental optimization algorithm [11,4]. Given a minimization problem with a block structure, it consists of the successive (exact) minimization with respect to the single blocks. Since numerous applications naturally inherit a block structure, block coordinate descent algorithms have been of great interest for decades, including variations as random coordinate descents and the successive inexact minimization, based, e.g., on the projected gradient descent or proximal point minimization. For an overview, we refer to the review paper [14].
The convergence of the block coordinate descent has been extensively studied under various convexity and smoothness properties of the objective function, typically in Euclidean spaces. For instance, convergence of the algorithm has been established for non-smooth strongly convex optimization by Auslender [1], and for various sets of convexity assumptions as, e.g., quasiconvexity with respect to each block by Grippo and Sciandrone [7,6]. Furthermore, Bertsekas [4] showed that any accumulation point of the sequence generated by the method is a stationary point if the successive minimization with respect to each block is well-defined. In the context of domain decomposition methods, Tai and Espedal [12] established a linear rate of convergence for the multiplicative Schwarz method applied to smooth strongly convex problems, which can ultimately be identified as the block coordinate descent method. Luo and Tseng [8] showed a linear rate of convergence for feasible descent methods under the existence of a local error bound of the objective function (generalizing strong convexity), and a proper separation of isocost surfaces; the block coordinate descent method satisfies the feasible descent property, e.g., for block coordinatewise strongly convex functions. Lately, Necoara et al. [10] identified the class of of smooth convex functions satisfying a quadratic functional growth property as the largest class of functions for which feasible descent methods converge linearly, and provide linear convergence rates.
In this paper, we focus on the application of the block coordinate descent to the two-block structured model problem where B 1 , B 2 are Banach spaces, f is convex and smooth, and g 1 , g 2 are non-smooth but convex, allowing, e.g., for block-separable constraints or non-smooth regularization (detailed properties are specified in section 2). In the case of just two blocks, the block coordinate descent is also often referred to as alternating minimization (cf. section 3). Since problems of type (1.1) are present in various applications, alternating minimization is widely used as a decoupling technique -in particular if it is much more convenient or feasible to solve the corresponding subproblems instead of the globally coupled problem. This is has been for instance exploited in the context of the iteratively reweighted least squares method [2], or the block partitioned solution of coupled partial differential equations, cf., e.g., [5]. The presence of just two blocks allows for an improved convergence analysis of the alternating minimization compared to the general block coordinate descent. For (unconstrained) smooth strongly convex optimization, i.e., smooth and strongly convex f , and g 1 ≡ g 2 ≡ 0, linear convergence has been established by Beck and Tetruashvili [3] with the rate just depending on the convexity modulus and the minimum of the Lipschitz constant of the partial derivatives, instead of a global one. Also for a smooth (simply) convex f , and convex g 1 , g 2 , sublinear convergence has been proved by Beck [2]. Again the multiplicative constant has been showed to only depend on the minimum of the Lipschitz constants of the partial derivatives, instead of a global one. The aforementioned results are constrained to (finite-dimensional) Euclidean spaces equipped with the l 2 norm. The proofs essentially utilize knowledge on first-order gradient descent methods as the (proximal) block coordinate descent. To our best knowledge, those results are the finest theoretical convergence results in the literature for alternating minimization.
In practice, strong convexity is a rather strong requirement, which often is not met. It is natural to ask whether linear convergence can be also achieved under more relaxed conditions. This is in particular motivated by the work of Necoara [10], in which linear convergence has been established for the general block gradient descent under various relaxations of strong convexity, including quasi-strong convexity and quadratic functional growth. To the author's knowledge, an improved analysis of the specific alternating minimization under such conditions has not been provided in the literature, yet.
Considering Banach spaces in the convergence analysis of the alternating minimization may have several benefits. We mention two. First, by allowing for describing smoothness and convexity properties of the problem with respect to problem-specific norms, sharper theoretical convergence rates may be derived -even for problems in finite dimensional Euclidean spaces. And second, as already mentioned, alternating minimization can be applied for instance to develop robust splitting schemes for coupled partial differential equations. Abstract convergence results of the alternating minimization holding for infinitely dimensional Banach spaces may then allow for analyzing the convergence of the resulting scheme, independent of the discretization, cf., e.g. [5] in the context of the Biot equations describing flow in deformable porous media.
In this work, we aim at complementing previous results on the convergence of the fundamental alternating minimization and generalize those for non-strongly convex problems of type (1.1).
In this regard, our main contributions are convergence results for three different settings with a decreasing demand on the convexity properties of the problem: • Linear convergence assuming a quasi-strongly convex f , generalizing and improving the results in [3] for unconstrained smooth strongly convex optimization in Euclidean spaces. The final result is assessed numerically and by that demonstrated to be sharp.
• Linear convergence for convex H with quadratic functional growth without the explicit need of feasible descent as, e.g., required by [10], cf. section 4.
• Sublinear convergence for convex f , generalizing and improving the results in [2] for convex optimization in Euclidean spaces, cf. section 6.
All results have in common, that opposing to corresponding results for the general block coordinate descent, the performance of the alternating minimization gains from properties of the separate single blocks instead of global properties. Furthermore, by solely utilizing the basic definitions of convexity and smoothness properties as well as the definition of the alternating minimization in the proofs, all results hold in Banach spaces. To the best of the author's knowledge, all results are novel; in particular, the case of quadratic functional growth without feasible descent has not been studied in the literature, yet.

The two-block structured model problem
We consider the problem where B 1 , B 2 , f, g 1 , g 2 satisfy the following properties: (P1) The feasible sets B i are Banach spaces equipped with norms · i , i = 1, 2. Let B ⋆ i denote the dual space of B i equipped with the canonical dual norm · i,⋆ , and let ·, · i denote the duality product on B ⋆ i × B i . If clear from the context, we omit specifying the index i = 1, 2 in duality pairings.
(P2) The product space B 1 × B 2 is equipped with a norm · and β 1 , β 2 ≥ 0, satisfying for all and the duality pairing ·, · satisfying for all d i ∈ B ⋆ i and which are (Fréchet) subdifferentiable on their domains, denoted by dom g 1 and dom g 2 , respectively. Let their (Fréchet) subdifferentials be denoted by ∂g 1 and ∂g 2 .
(P6) The optimal set of the problem (2.1), denoted by X ⊂ B 1 × B 2 is non-empty, and the corresponding optimal value is denoted by H ⋆ .
The model problem (2.1) satisfying the properties (P1)-(P6) covers a wide range of problems including, for instance, smooth convex optimization under block-separable convex constraints and non-smooth block-separable regularization. For examples, we refer to [2].
We remark that the use of an arbitrary norm · (instead of, e.g., the canonical norm on product spaces · 2 = · 2 1 + · 2 2 ), and the associated introduction of β 1 and β 2 , cf. (P2), are going to significantly contribute in the development of improved convergence rates of the alternating minimization in sections 4 to 6.

Alternating minimization
In the following, we focus on the iterative solution of problem (2.1) by the classical alternating minimization, as described in algorithm 1. In order to make the alternating minimization welldefined, we further assume on model problem (2.1): (P7) For any (x 1 ,x 2 ) ∈ dom g 1 × dom g 2 , the following problems have minimizers min

Algorithm 1 Alternating minimization
We remark the partial optimality condition (3.1) on the initial guess, which corresponds to one half step of the alternating minimization. As in [2], this is simply chosen for convenience and the sake of simpler notation in the subsequent analysis -in particular lemma 3.1. In view of the following lemma, for an overview on subdifferentials of convex functions in Banach spaces and optimality conditions, we refer to [9]. 1 , x k 2 )} k≥0 denote the sequence generated by the alternating minimization, cf. algorithm 1. Then for all k ≥ 0, it holds x k+1 1 ∈ dom g 1 , x k 2 ∈ dom g 2 , and Proof. By construction, the optimality condition corresponding to the first step of the alternating minimization, defining {x k 1 } k≥1 , cf. eq. (3.2), reads x k+1 , which by definition of a subdifferential is equivalent with eq. (3.4).
Analogously, eq. Moreover, being identical with successive minimization, H k satisfies the monotonicity principle

Quasi-strongly convex case
In this section, linear convergence is established for the alternating minimization applied to model problem (2.1) under the additional assumption of quasi-strong convexity for the smooth part of H: (P8a) The function f : B 1 × B 2 → R is quasi-strongly convex with modulus σ > 0, i.e., for all x = (x 1 , x 2 ) ∈ dom g 1 × dom g 2 andx = arg min x − y y ∈ X being the orthogonal projection of x onto X, it holds that By the convexity of g 1 and g 2 , H inherits quasi-strong convexity from f . It is interesting to mention that a quasi-strongly convex function does not even require to be convex; however, any strongly convex function is clearly quasi-strongly convex.
The following result generalizes and improves a convergence result in [3].
In the case of max L 1 β 1 , L 2 β 2 = ∞ this has to be understood as Proof. We consider the first half-step of the alternating minimization and show Without loss of generality we assume that L 1 β 1 < ∞; note that eq. (4.2) holds immediately for In order to prove eq. (4.2), we first utilize the quasi-strong convexity of f , the definition of β 1 , cf. eq. (2.2), a simple rescaling, the fact that σβ 1 L 1 ∈ (0, 1], and Lipschitz continuity of Furthermore, eq. (2.2), eq. (2.5), and (P8a) imply that σβ 1 L 1 ∈ (0, 1]. Thus, by convexity of g 1 it holds that By reordering terms, we obtain By lemma 3.1, it holds that By combining eqs. (4.3) to (4.5), we obtain for the objective function By employing the optimality property of x k+1 1 , cf. eq. (3.2), we obtain Reordering terms finally yields eq. (4.2). By symmetry, it analogously follows that Hence, by combining eqs. (4.2) and (4.6), we obtain and the assertion follows by induction.

Comparison to the literature
Linear convergence of the general block coordinate descent (for an arbitrary number of blocks) has been previously established for strongly and non-strongly convex optimization. In the following, we recall some results from the literature for a comparison with the convergence result in theorem 4.1, specific for alternating minimization. All results have been derived for problems stated in Euclidean spaces, equipped with l 2 norms, such that β 1 = β 2 = 1 in eq. (2.2). For smooth strongly convex optimization subject to block-separable convex constraints, the general block coordinate descent for N ≥ 2 number of blocks, has been previously showed to converge q-linearly [8,13]. In particular, let L denote the global Lipschitz constant of ∇f . Then for all k ≥ 0 it holds that Recently, linear convergence has been also established for smooth objective functions with quadratic functional growth (see also section 5). This includes strongly and quasi-strongly convex functions [10]. In particular, for all k ≥ 0 it holds that In the context of domain decomposition methods for PDEs, linear convergence of a general multiplicative Schwarz method has been established [12], which includes alternating minimization. Considering unconstrained smooth strongly convex optimization with the objective function being three times differentiable, asymptotic linear convergence was proved with asymptotic rate 1 − σ 2 σ 2 +8L 2 . Opposing to the results for the general block coordinate descent, in the special case of just two blocks, linear convergence with further improved convergence rates is guaranteed. For instance, in [3], linear convergence of the alternating minimization has been established for unconstrained smooth strongly convex optimization, i.e., the model problem (2.1) under the simplification g 1 ≡ g 2 ≡ 0. The final convergence result is identical with eq. (4.1). Thereby, convergence is ensured already if just one partial derivative is Lipschitz continuous.
After all, the convergence result in theorem 4.1, provides three novel improvements compared to all mentioned results: (i) The theoretical convergence rate has a multiplicative (i.e., squared) character if max{L 1 , L 2 } < ∞.
(ii) Convergence is guaranteed for the general non-smooth quasi-strongly convex case with same rate as for the smooth strongly convex case.
(iii) The result holds in general Banach spaces.

Numerical example
In order to assess the sharpness of the convergence result in theorem 4.1, we present a simple numerical example. Representative as practical lower bound for the 'worst-case' theoretical convergence rate, we consider a problem within the overly favorable class of unconstrained smooth strongly convex quadratic optimization in Euclidean spaces: being symmetric positive definite. This problem clearly satisfies the assumptions of theorem 4.1, as smoothness and strong convexity are satisfied with respect to standard l 2 norms for the Euclidean spaces. In particular, it holds β 1 = β 2 = 1, L 1 = λ max (A) and L 2 = λ max (C), σ = λ min (M), where λ max (·) and λ min (·) respectively denote the maximal and minimal eigenvalues. Ultimately, q-linear convergence is guaranteed such that for all k ≥ 0 it holds By exploiting the generality of theorem 4.1 and choosing problem-dependent norms · 1 , · 2 , and · , the theoretical convergence rate can be significantly improved. For this, let · 1 := · A , · 2 := · C , and · := · M , where for instance x 1 2 A := x ⊤ 1 Ax 1 . Finally, H is strongly convex with respect to · with modulus σ = 1, and the partial block derivatives ∇ 1 H and ∇ 2 H are Lipschitz continuous with Lipschitz constants L 1 = L 2 = 1. Furthermore, since M is symmetric positive definite, so are A and C, as well as the corresponding Schur complements S A := A − B ⊤ C −1 B and S C := C − BA −1 B ⊤ , and it holds Thus, β 1 = λ min A −1 S A and β 2 = λ min C −1 S C . Thereby, theorem 4.1 predicts q-linear convergence such that for all k ≥ 0 it holds In the following, we verify the sharpness of the theoretical rate η, as predicted above, for a small example. Let n = 3, m = 2, and define For this choice, the theoretical rate as defined in eq. (4.7) is given by η ≈ 0.7222. The performance of the alternating minimization, cf. algorithm 1, for the initial guess x 0 1 := [0, 0, 0] ⊤ is visualized in fig. 1. In addition, the theoretically predicted convergence in eq. (4.7) is displayed as well. Ultimately, we observe a good agreement between the theoretical bound and the asymptotic practical convergence rate.
We conclude, that theorem 4.1 allows for theoretically predicting sharp bounds of the practical convergence rate of the alternating minimization.

Convex case with quadratic functional growth
In this section, linear convergence is established for the alternating minimization applied to the model problem (2.1) with the objective function satisfying a quadratic functional growth property. We stress that opposing to the analysis of feasible descent methods, cf., e.g., [10], a feasible descent property is not explicitly required. Such would be, e.g., ensured for block coordinatewise strongly convex objective functions.
The property of quadratic functional growth reads: (P8b) The objective function H : B 1 × B 2 → R has quadratic functional growth with modulus κ > 0 with respect to the optimal set X, i.e., for all x = (x 1 , x 2 ) ∈ dom g 1 × dom g 2 and x := arg min x − y y ∈ X being the orthogonal projection of x onto X, it holds that Quasi-strong convexity implies quadratic functional growth [10], but not vice versa; functions satisfying (P8b) do not even necessarily require to be convex [15]. The proof of the next result follows a similar strategy as the proof of theorem 4.1.
In the case of max L 1 β 1 , L 2 β 2 = ∞, this has to be understood as Proof. We consider the first half-step of the alternating minimization and show Without loss of generality, we assume that L 1 β 1 < ∞; note that eq. (5.1) holds immediately for In order to prove eq. (5.1), we first utilize the convexity of f yielding . For γ ∈ (0, 1] to be specified later, using the Lipschitz continuity of ∇ 1 f , cf. eq. (2.3), the convexity of f , and the definition of β 1 , cf. eq. (2.2), we obtain By convexity of g 1 , it holds that By lemma 3.1, it holds that By combining eq. (5.2), eq. (5.3), eq. (5.4), eq. (5.5), we obtain for the objective function Thus, by utilizing the quadratic growth of H and the optimality property of x k+1 1 based on the first step of the alternating minimization, cf. eq. (3.2), it follows for all γ ∈ (0, 1] that By (optimally) choosing γ = κβ 1 4L 1 , we obtain which finally yields eq. (5.1), after reordering terms. By symmetry, it analogously follows that Hence, by combining eq. (5.1) and eq. (5.7), we obtain and the assertion follows by induction.

Plain convex case
In this section, sublinear convergence is established for the alternating minimization applied to model problem (2.1) under no additional convexity or growth assumptions, besides plain convexity. A similar setting has been considered by Beck [2]. Here, we extend the result in the aforementioned work to Banach spaces, without the use of proximal mappings. As in [2], we however assume a compact level set with respect to the initial value: (P8c) The functions g 1 : B 1 → R ∪ {∞}, g 2 : B 2 → R ∪ {∞} are closed convex (and thereby also H is closed convex). Furthermore, assume that the level set of H with respect to the initial guess is compact, and we denote by R the following diameter By monotonicity of {H(x k )} k=0, 1 2 ,1,... , it in particular holds The following convergence result predicts a two-stage behavior: first, the error decreases q-linearly until sufficiently small; after that, sublinear convergence is initiated. The shift is essentially depending on the smoothness properties of the objective function.
where ⌈·⌉ denotes the ceiling function, and [·] + denotes the positive part of values in R, i.e., [x] + := max{x, 0}. Then it holds for all k ≥ 0 In particular, for all k ≥ m ⋆ , it holds that The proof of theorem 6.1 utilizes two auxiliary results: general descent properties for each subiteration of the alternating minimization, and a criterion for concluding sublinear convergence. Those are summarized in the following two lemmas. Lemma 6.2. Under the same assumption as in theorem 6.1, it holds for all k ≥ 0 and Proof. We consider the first half step of the alternating minimization, assuming, without loss of generality, that L 1 β 1 < ∞; otherwise eq. (6.1) follows immediately. Following the proof of theorem 5.1, eq. (5.6) also holds under the stated assumptions of this lemma. We recall eq. (5.6): it holds for all γ ∈ (0, 1] that Thus, by the definition of R and x k+1/2 , cf. eq. (3.2), it follows In the following, we distinguish the two cases: In the first case, we choose γ = 1; in the second case, we choose γ = β 1 2L 1 R 2 (H k − H ⋆ ). This results in eq. (6.1). The result eq. (6.2) analogously follows by symmetry.
The following auxiliary lemma is inspired by [3]. Opposing to the aforementioned work, the subsequent results allows for effectively making use of the energy descent of both substeps of the alternating minimization instead of just one.
Finally, we are able to prove theorem 6.1.
Proof of theorem 6.1. As long as H k − H ⋆ > 2 min L 1 β 1 , L 2 β 2 R 2 for some k ∈ N 0 , by lemma 6.2 and the monotonicity of {H k } k=0,1,... , it holds that Thereby, there exists a minimal m ≥ 0 such that H k − H ⋆ ≤ 2 min L 1 β 1 , L 2 β 2 R 2 for all k ≥ m. Assuming m ≥ 1, eq. (6.6) holds for all k ≤ m − 1. In particular, it holds Thus, it holds that Consequently, in general (including the case m = 0), it holds By the monotonicity of Hence, by lemma 6.2 it holds for all k ≥ m that We define the sequence {A n } n=0, 1 2 ,1,... with A n := H n+m − H ⋆ . Then the assumptions of lemma 6.3 are fulfilled with Thus, lemma 6.3 yields for all n ≥ 0 and equivalently for all k ≥ m Combining eqs. (6.6) and (6.8) proves the assertion.
Remark 6.4. In the case it holds max L 1 β 1 , L 2 β 2 < ∞, and the initial error satisfies H 0 − H ⋆ > 2 max L 1 β 1 , L 2 β 2 R 2 , the result of theorem 6.1 is in fact slightly pessimistic. The value of m ⋆ can then be chosen significantly lower. By an analogous line of argumentation as in the above proof, one can conclude that, H k − H ⋆ first contracts with a rate of 1 4 for the first k 1 iterations, until H k 1 − H ⋆ ≤ 2 max L 1 β 1 , L 2 β 2 R 2 for some k 1 ∈ N 0 . Afterwards, the convergence behavior can be qualitatively predicted as in theorem 6.1. Ultimately, m ⋆ takes a lower value of the order However, for the sake of a cleaner presentation, we have avoided the anyhow rather theoretical accurate description of the convergence for the general case.
By only employing convexity and smoothness properties of the problem, our result also holds for general Banach spaces. Furthermore, focussing on the sublinear convergence, our result gains from the use of both L 1 and L 2 , resulting in a potentially slightly lower multiplicative constant.

Smooth case in Euclidean spaces
The tools, established in the previous section, allow for the improvement of a convergence result by Beck and Tetruashvili [3] on the alternating minimization applied to the smooth model problem (2.1) in a Euclidean setting. In the following, we consider Euclidean spaces for B 1 and B 2 , equipped with l 2 norms, i.e., it holds β 1 = β 2 = 1. Furthermore, let g 1 ≡ g 2 ≡ 0, ensuring smoothness of the model problem (2.1). For this setting, sublinear convergence has been established in [3], stating that it holds for all k ≥ 2 In the following, we show that in fact for all k ≥ 0 it holds in fact the improved result Thus, not only the subproblem with lower Lipschitz constant governs the performance of the alternating minimization. But the performance separately benefits from both Lipschitz constants. For deriving eq. (6.9), we combine descent properties of the alternating minimization derived in [3], and the auxiliary lemma 6.3. The following descent properties are a byproduct of the proof of Theorem 5.2 in [3]. Lemma 6.6 (Descent properties of the alternating minimization for the smooth case [3]). Assume (P1)-(P7) and (P8c) are satisfied with B 1 and B 2 being Euclidean spaces, equipped with l 2 norms, and g 1 ≡ g 2 ≡ 0. Furthermore, let {x k } k≥0 be the sequence generated by the alternating minimization, cf. algorithm 1, and H k := H(x k ). Then it holds for all k ≥ 0 Finally, by lemma 6.6, the assumptions of lemma 6.3 are satisfied for A k := H k − H ⋆ , and Thus, the sublinear convergence result (6.9) directly follows from lemma 6.3.

Conclusions
In this paper, we established convergence of the alternating minimization applied to a two-block structured model problem within the class of non-smooth non-strongly convex optimization in Banach spaces. We considered three cases of relaxed strong convexity: quasi-strong convexity, quadratic functional growth, and plain convexity. Convergence rates have been provided -of linear type for the first two cases, and of sublinear type for the third case. Opposing to previous works on the convergence analysis of the alternating minimization, we have considered a fairly general setup. Ultimately, by allowing to describe smoothness and convexity properties with respect to different norms, improved convergence rates have been derived in comparison to corresponding results in the literature [3,2]. In particular, the linear convergence in the case of quadratic functional growth also holds without any feasible descent property as commonly required in the analysis of the general block coordinate descent [8,10]. Our results have several implications. For instance, applications of the results in [3,2] can be immediately improved, e.g., for the iteratively reweighted least squares method [2]. Also the tools provided in this paper allow for a sharp problem-specific convergence analysis of iterative splitting schemes for coupled partial differential equations; this has been realized with similar but slightly simpler abstract convergence results in [5].
Within the proofs of this work, it was never used that the norms · 1 , · 2 , and · are actually norms. The results could be directly relaxed measuring convexity and smoothness with respect to just semi-norms or something similar. This may allow to generalize our results even further. The same applies for the general block coordinate descent method.