Asymptotic linear convergence of fully-corrective generalized conditional gradient methods

We propose a fully-corrective generalized conditional gradient method (FC-GCG) for the minimization of the sum of a smooth, convex loss function and a convex one-homogeneous regularizer over a Banach space. The algorithm relies on the mutual update of a finite set $\mathcal{A}_k$ of extremal points of the unit ball of the regularizer and of an iterate $u_k \in \operatorname{cone}(\mathcal{A}_k)$. Each iteration requires the solution of one linear problem to update $\mathcal{A}_k$ and of one finite dimensional convex minimization problem to update the iterate. Under standard hypotheses on the minimization problem we show that the algorithm converges sublinearly to a solution. Subsequently, imposing additional assumptions on the associated dual variables, this is improved to a linear rate of convergence. The proof of both results relies on two key observations: First, we prove the equivalence of the considered problem to the minimization of a lifted functional over a particular space of Radon measures using Choquet's theorem. Second, the FC-GCG algorithm is connected to a Primal-Dual-Active-point Method (PDAP) on the lifted problem for which we finally derive the desired convergence rates.


Introduction
This paper is concerned with the analysis of an efficient solution algorithm for minimization problems in composite form inf u∈M J(u), J(u) := F (Ku) + G(u) (P M ) over a Banach space M. Here, the forward operator K maps continuously from M into a Hilbert space Y of observations, not necessarily finite dimensional, and F denotes a smooth convex loss function. The second part of the objective functional is constituted by a convex but possibly nonsmooth functional G which promotes desired structural properties. We refer, e.g., to the sparsifying property of total variation regularization or the staircasing effect of bounded variation penalties. The observation that certain structural features of minimizers can be brought forth by a suitable choice of the functional G and of the space M, has made the analysis of problems of the form (P M ) a flourishing topic in the context of optimal control, inverse problems, compressed sensing and machine learning. As a consequence, the interest for such models has also sparked the demand for efficient solution algorithms. Since many of the structural features of (P M ) are tightly linked to properties of the underlying, possibly infinite dimensional Banach space, a particular focus in this context lies on function space methods, i.e., algorithms solving (P M ) without discretizing M. This is a challenging task for a variety of reasons: On the one hand, it requires algorithms that can handle the non-smoothness of the objective functional J. On the other hand, it forces to work directly on the Banach space M which usually lacks "nice" properties such as reflexivity or uniform convexity. Efficient algorithms have been developed, for instance, for inverse problems regularized with ℓ p penalties [18,31], inverse problems in the space of measures regularized with the total variation [11,21,33] and dynamic inverse problems with optimal transport regularizers [16,17,14,15].
1.1. Contribution & related work. In many interesting applications it is meaningful to assume that M is given as the topological dual of a separable Banach space C, and K is the adjoint of a "predual" operator K * : Y → C, (K * ) * = K. We refer to Section 4 for a few examples. In this case a simple approach to computing a minimizer to (P M ) is constituted by generalized conditional gradient (GCG) algorithms. Assuming that the solution set of (P M ) is bounded by some M > 0, this method updates the iterate u k by computing the dual variable p k = −K * ∇F (Ku k ) and setting where s k ∈ (0, 1) is chosen according to some stepsize rule. If G(u) = I A (u) is the indicator function of a compact convex set A ⊂ M, then the iteration scheme reduces to the Frank-Wolfe (FW) algorithm for constrained minimization [40,32,37].
In the present paper we focus on convex positively homogeneous functionals G with compact sublevel sets, where compactness is intended with respect to the weak* topology induced by C. The compactness assumption is equivalent to G having weak* closed and norm bounded sublevels. In particular, these assumptions on G encompass the important case of norm regularization G(u) = u M , and also allow for seminorm penalties if the problem is posed in a suitable quotient space [13]. More in general, our setting covers the case of all gauge functions of the form and A ⊂ M is a weak* compact and convex set. This type of functional is of particular relevance in machine learning applications [28,62]. In this general setting, GCG methods benefit from several desirable properties. For example, they only rely on the repeated solution of (partially) linearized problems that, in some interesting cases, can be solved analytically. Additionally, the descent direction can be chosen to satisfy v k = M kvk where M k ≥ 0 is a scaling factor and v k is an extremal point of the "unit ball" B = { u | G(u) ≤ 1 } of the regularizer. We denote by Ext(B) the set of such points. Thus, the above mentioned partially linearized problems can be solved, equivalently, in the set Ext(B), reducing considerably the complexity of each iteration of the algorithm. As a further consequence, initializing the algorithm by u 0 = 0 and selecting descent directions v k that are extremal points of B, the iterate u k exhibits k-sparsity, i.e., it is contained in the conic hull of at most k points in Ext(B). The connection between the structure enhancing properties of a functional G and the set of extremal points of B has been recently studied in [13] and [12]. In this context, a central result is given by convex representer theorems. Loosely speaking, these state that problems of the form (P M ) admit solutionsū which are contained in the conic hull of at most dim Y extremal points. While the FW method [35,36,41,47], and its many variants such as away-step FW [43,49], or fullycorrective FW [44,58,59], have received a lot of attention, GCG algorithms for general non-smooth functionals G are less frequently studied. We refer, e.g., to [16,20,57,62] as well as [60, Chapter 6] which all prove a global sublinear O(1/k) rate of convergence of J(u k ) towards the minimum value. Note that this rate is known to be optimal [24]. Moreover, the absence of steps that remove extremal points from u k often leads to clustering phenomena in practice. Thus, despite its various advantages, these shortcomings of GCG methods limit their practical utility.
The main contribution of the present work is the analysis of a fully-corrective generalized conditional gradient method (FC-GCG) for (P M ). This relies on the mutual update of a sequence of sparse iterates u k and of a sequence of finite, ordered active sets of extremal points, which constitute the atoms representing the sparse iterate u k . Given the current iterate u k and active set A k , the proposed method first enlarges A k by setting N + k = N k + 1 and Subsequently, the new iterate iterate u k+1 is found by solving the subproblem min u∈cone(A u,+ k ) where the minimization occurs over the cone spanned by A u,+ k , and G is replaced by the gauge function associated with A u,+ k , which in this case [10,28] simplifies to Finally, A u,+ k is pruned by removing the extremal points for which the weight is set to zero by (1.2), obtaining the next active set A u k+1 . As we will see, the proposed method combines the advantages of GCG methods, e.g., its global convergence, with an improved convergence behavior and sparser iterates. More in detail, we first prove a global sublinear convergence rate O(1/k) for J(u k ) towards the minimum of (P M ), see Theorem 3.3. This is achieved under mild assumptions, see (A1)-(A3) discussed in Section 2. These are quite standard and are, in particular, sufficient for the well-posedness of (P M ), as shown in Proposition 2.3. Subsequently, in Theorem 3.8, we show that J(u k ) converges asymptotically at a linear rate of O(ζ k ), ζ ∈ (0, 1), provided that the optimal dual variable and the loss in (P M ) meet certain structural requirements, see Assumptions (B1)-(B5) in Section 3.3. Specifically, these assumptions imply that the minimizer to (P M ) is unique and sparse, i.e., of the formū = N i=1λ iūi for a finite number of extremal pointsū i ∈ Ext(B) and coefficientsλ i > 0. More crucially, we also require the existence of a "distance function" g defined on Ext(B) such that the forward operator K and the optimal dual variablep = −K * ∇F (Kū) associated toū satisfy, respectively, the following Lipschitz and quadratic growth conditions where the forward operator K is replaced by a lifted model K : M + (B) → Y satisfying Kµ = KI(µ) for all µ ∈ M + (B). As it turns out, see Theorem 5.4, the above minimization problems are equivalent, in the sense thatū = I(μ) solves (P M ) if and only ifμ minimizes in (P M + ). As a third key ingredient, such equivalence allows to interpret the iteration scheme introduced in (1.1)-(1.2) as one step of an exchange algorithm or a Primal-Dual-Active-Point method (PDAP), cf. Algorithm 2, applied to (P M + ). These methods are fully-corrective variants of the generalized conditional gradient method for minimization problems over spaces of Borel measures introduced in [21], and their linear convergence has been recently proven in the finite-dimensional case, i.e., for measures supported on a compact subset of the euclidean space, see [39,52]. Combining these three key observations, and carefully extending the techniques and results of [52] to the present setting, we are able to conclude the asymptotic linear convergence of our algorithm. If we interpret (P M ) as a sparse dictionary learning problem, in which the dictionary is given by Ext(B), our FC-GCG method can also be linked to a fully-corrective greedy selection method, [56], or an accelerated gradient boosting algorithm, [61]. In this context, a similar lifting approach was implicitly used in [63] to derive a fully-corrective GCG method for matrix-regularization problems which eventually converges at a global linear rate, albeit under very restrictive conditions. More in detail, expressed in the notation of the present paper, the authors require the strong convexity of F • K, as well as that Ext(B) is a finite set. For a related result in settings with finitely many atoms we also point out [56]. In contrast, the present manuscript is geared towards potentially compact forward operators K, i.e. F •K is not strongly convex, as well as regularizers whose sublevel sets admit infinitely many extremal points. These additional difficulties will be circumvented by a careful localization of A u,+ k around the optimal extremal points, as well as by exploiting the assumed quadratic growth behavior in (1.3). While the main contribution of the present work is clearly given by the FC-GCG method, the lifting strategy of Section 5 may be of independent interest, and could also be applicable beyond the analysis of efficient solution algorithms. In this paper, we mainly focus on the theoretical aspects of the FC-GCG method and, in particular, its convergence. Naturally, from the practical standpoint, the applicability of our method relies on the knowledge of the set Ext(B), as well as on the efficient solution of the constrained linear problem in (1.1). Especially the computational cost of the latter greatly varies between different instances of Problem (P M ). To illustrate the working of our method and the role of assumptions (B1)-(B5) we present in Section 4 several examples of applications where our algorithm can be implemented. For the first three examples we provide a natural and easy to verify set of assumptions that imply (B1)-(B5), and for all of them we discuss the computational burden of computing a solution for the constrained linear problem in (1.1). We first consider the problem of identifying the initial source of a heat equation from given temperature measurements, see Section 4.1. For this example we also demonstrate numerically the expected linear convergence of Algorithm 1, discussing the stopping criterion and the advantages of our algorithm compared to classical GCG methods. Then, in Section 4.2, we consider the trace regularization of linear operators from a Hilbert space into itself, which favours the reconstruction of rank-one operators. In Section 4.3, we discuss minimum effort problems, where the regularization enforced by the supremum norm favours binary solutions. Finally, in Section 4.4, we briefly deal with the optimal transport regularization of dynamic inverse problems [17], showing that the algorithm introduced in [16] is, in some instances, a particular case of Algorithm 1. For this example the verification of the hypotheses (B1)-(B5), necessary to ensure fast convergence, is non-trivial and is left to future work.

1.2.
Outline. The paper is structured as follows. In Section 2 we formulate the minimization problem (P M ) we are interested to solve. We further make a basic set of Assumptions (A1)-(A3), under which we prove its well-posedness, see Proposition 2.3. In Section 3 we introduce the FC-GCG algorithm, discussing well-posedness of its steps and providing a stopping criterion. In addition we introduce sufficient conditions for fast convergence (B1)-(B5), and state the sublinear global and linear local convergence results in Theorems 3.3, 3.8 respectively. These results, together with the assumptions, are discussed for specific examples in Section 4. We then pass to the convergence proofs for the FC-GCG algorithm. These are broken into three parts: First, in Section 5, we introduce a "lifting" of the minimization problem (P M ) to the space of Radon measures M + (B) by virtue of Choquet's theorem. We then prove the equivalence of (P M ) to the auxiliary problem (P M + ), see Theorem 5.4. Second, we propose an extension of the PDAP algorithm to compute solutions to (P M + ) in Section 5.5. It turns out that PDAP and FC-GCG are equivalent given the correct interpretation, see Theorem 5.10. This equivalence will be extensively used in Section 6 to carry out the proofs of the main convergence statements. Finally, the Appendix contains some auxiliary results, as well as of proofs omitted in the main body of the paper.

The minimization problem
In this section we introduce the minimization problem we are concerned with solving, and prove its well-posedness under suitable assumptions. We start by establishing some notations. Throughout the paper C denotes a separable Banach space with norm · C and topological dual space M ≃ C * . We denote the duality pairing between p ∈ C and u ∈ M by p, u . The space M is equipped with the canonical dual norm Let G : M → [0, ∞] be a convex, weak* lower semi-continuous and positively one-homogeneous functional, that is G(λu) = λG(u) for all λ ≥ 0. Let K : M → Y be a linear weak*-to-weak continuous operator, mapping to a given Hilbert space Y , and F : Y → R be a convex mapping. The inner product and induced norm on Y will be denoted by (·, ·) Y and · Y , respectively. Our interest lies in efficient solution algorithms for problems of the form (P M ), which we remind the reader is of the form inf (A3) The operator K : M → Y is sequentially weak*-to-strong continuous in Note that Assumption (A2) is equivalent to ask that the sublevel set S − (G, α) is closed and norm bounded for every α ≥ 0. The above conditions guarantee the existence of minimizers to (P M ).
Proof. The existence of a minimizer follows by the direct method of calculus of variations. Indeed, the sublevel sets of J are weak* compact thanks to Assumption (A1) and Assumption (A2). Moreover J is weak* lower semicontinuous, since G is weak* lower semicontinuous, K is weak*-to-weak continuous and F is convex and continuous (Assumption (A1)). In order to show (2.1), notice that the function Hence, see e.g. [60, Proposition 6.3],ū ∈ M is a solution to (P M ) if and only if This variational inequality holds if and only if which is equivalent to (2.1), thanks to the one-homogeneity of G. Last, the identity Kū 1 = Kū 2 for two solutionsū 1 ,ū 2 of (P M ) follows from the strict convexity of F .
For the remainder of the paper we refer tō as the (unique) optimal observation and dual variable for (P M ), respectively, whereū is any minimizer to (P M ). Set B = S − (G, 1). In the following we further require the notion of an extremal point of B.

A numerical minimization algorithm
This section concerns the development of an implementable and efficient solution algorithm for the minimization problem (P M ), namely, the fully-corrective generalized conditional gradient method (FC-GCG). As anticipated in the introduction, FC-GCG comprises the two basic steps at (1.1) and (1.2), which we now describe in more detail. For this purpose, recall that every u ∈ dom(G) can be approximated, in the weak* topology of M, by a sequence in cone(Ext(B)) thanks to Krein-Milman's Theorem. The considered method exploits this observation by alternating between the update of a finite set , the so-called active set, and of an iterate The above step requires the minimization of a linear functional over the not necessarily compact set of extremal points. Remarkably, this problem is well-posed, as shown in Lemma A.1 in Appendix A.
Note that κ A u,+ k (u) is well-defined on cone(A u,+ k ), as the minimum is achieved due to lower semicontinuity of the ℓ 1 norm and closedness of R + . Subsequently, the next iterate u k+1 is found by solving the subproblem where the search for the minimizer is restricted to the cone spanned by A u,+ k and the possibly complicated regularizer G is replaced by the easier-to-handle gauge function of the finite set A u,+ k . We point out that the objective functional in (3.2) constitutes an upper bound on J, i.e., there holds due to the convexity and one-homogeneity of G. Apart from that, Problems (P M ) and (3.2) share the same basic structure of minimizing the sum of a smooth fidelity term and a gauge-like regularizer.
In particular, the existence result and the necessary first-order conditions in Proposition 2.3 also apply to (3.2). However, as shown in Proposition A.2 in Appendix A, the big advantage of (3.2) is that a solution can be computed as where λ k+1 ∈ R N + k + solves the following finite dimensional optimization problem: Thus, in practice, we first determine a minimizer λ k+1 ∈ R N + k to (3.4). We remark that (3.4) constitutes a finite dimensional non-smooth convex minimization problem, which can be efficiently solved by proximal methods or generalized Newton algorithms provided that F is sufficiently smooth. Once this is accomplished, the new iterate u k+1 is defined according to (3.3). As a final step, we truncate the active set A u,+ k by removing all extremal points that were assigned a zero weight by the optimization procedure (3.4), i.e., we set . . , N + k } and increment k by one. The method is summarized in Algorithm 1 below.
if p k , v u k ≤ 1 and k ≥ 1 then 3. Terminate withū = u k a minimizer to (P M ).
This is justified by the fact that, in this case, the current iterate u k is a minimizer to (P M ), as shown in Proposition 3.1 below. In particular, in this situation, the algorithm converges in a finite number of iterations.
The second condition in (3.6) is satisfied by assumption, so we are only left to check the first. If u k = 0 this is trivial, since G(0) = 0. Therefore assume u k = 0. This implies G(u k ) = 0. Indeed, if by contradiction G(u k ) = 0, then by one-homogeneity of G we would obtain λu k ∈ S − (G, 0) for all λ ≥ 0. But then S − (G, 0) would not be norm bounded, contradicting (A2). Thus G(u k ) = 0. Since u k /G(u k ) ∈ B, from the second condition in (3.6) we get Note that since λ k is a minimizer to the above problem, we have , ending the proof. Remark 3.2. Note that Algorithm 1 terminates also in case v u k ∈ A u k for some k ≥ 1, i.e., if the algorithm cannot find a new point to insert. Indeed in this situation we have p k , v u k ≤ 1 by the optimality conditions at (3.8). Therefore the stopping condition (3.5) is satisfied.

3.2.
Worst-case convergence rates. The main contribution of the present manuscript is the derivation of convergence results for the sequence of residuals (3.9) associated to the iterates generated by Algorithm 1. This is a challenging task for a variety of reasons. For example the space M is, in general, not reflexive. Moreover the functional J lacks useful properties, such as smoothness or strict convexity. Classical approaches [36,35,5] provide sublinear convergence rates for GCG methods defined in Banach spaces, however very few linear convergence results are available in the literature and these results are usually limited to specific cases. In this section, we state the convergence results anticipated in the introduction, and we detail the additional assumptions which are needed to prove them. Their proofs, which are rather technical, are then postponed to Sections 6.1 and 6.2.10. We start with the following sublinear convergence result, which holds under the basic Assumptions (A1)-(A3).

10)
where r J is defined at (3.9). Moreover, in this case, the sequence {u k } k admits at least one weak* accumulation point and each of such point is a solution to (P M ). If the solutionū to (P M ) is unique, then we have u k * ⇀ū in M for the whole sequence. consists of a finite number of extremal pointsū 1 , . . . ,ū N ∈ Ext(B), wherep denotes the unique dual variable of (P M ), see (2.2). Moreover, we ask that the restriction of the operator K into the span ofĀ := {ū i } N i=1 is injective. Such assumptions ensure the uniqueness of the minimizer to (P M ), denoted byū, see Proposition 3.5 below. Additionally, we ask that F is strongly convex around the unique optimal observationȳ, see (2.2). This set of assumptions is summarized below.

Assumption 3.4. (Uniqueness and strong convexity)
(B1) The map F : Y → R is strictly convex and strongly convex around the unique optimal observationȳ, i.e., there exists a neighborhood N (ȳ) ofȳ and θ > 0 such that Y , for all y 1 , y 2 ∈ N (ȳ) , (B2) There is N > 0 and a finite collection of extremal pointsĀ : ⊂ Y is linearly independent. We now check that the above assumptions imply uniqueness of solutions to (P M ).
In the next set of assumptions we suppose strict complementarity for the minimizerū, i.e., u ∈ cone Ā \ {ū i } , for every i = 1, . . . , N , or, equivalently,λ i > 0 for all i = 1, . . . , N . The final assumption concerns the existence of a "distance function" g such that K is Lipschitz continuous and the linear functional u → p, u grows quadratically, both with respect to g and in the vicinity ofū i ∈Ā. Of course, the particular form of g depends on the space M and the functional G and thus it has to be constructed on a case-by-case basis. We give an example in Section 4.1. This set of assumptions is summarized below.   (3.13) due to Assumption (B2) and the d B -continuity of u → p, u .
We can finally state the main convergence result of the paper. For its proof we refer to Section 6.2.10.  withλ i > 0, then there exists a second solutionũ =ū withũ = N i=1λ iūi ,λ i ≥ 0, and equality holds for at least one index, cf. also [52,Section 3.2]. Second, it is worthwhile to further discuss the quadratic growth condition in Assumption (B5) and relate it to more well-known concepts in the literature. For this purpose, recall that the necessary and sufficient optimality conditions for (P M ) are given by the variational subgradient inequality Due to the one-homogeneity of G, this can be equivalently reformulated as see Proposition 2.3. By applying Lemma A.1, this is equivalent to Finally, due to Assumption (B2), we arrive at In particular, this implies thatp ∈ ∂I Ext(B) (ū i ) where I Ext(B) (ū i ) denotes the subdifferential of the nonconvex indicator function of Ext(B) atū i . In this context, Assumption (B5) implies the locally strengthened condition This is very reminiscent of the concept of strongly metric subregular subdifferentials in convex optimization which plays a vital role for the derivation of fast convergence rates for proximal point methods [4], and "vanilla" generalized conditional gradient methods [48]. We point out that, in the general case, we are not aware of possibly stronger, but more intuitive, structural assumptions onp as well as Ext(B) which eventually ensure Assumptions (B2) and (B5). However, for particular instances such a characterization is indeed possible, see Section 4 and the examples and references therein.

Examples
Summarizing the previous sections, we see that the practical application and the convergence analysis of Algorithm 1 rest on three pillars: First, a characterization of the set of extremal points Ext(B), second, an efficient method for Step 2 in Algorithm 1 and, third, the derivation of sufficient structural assumptions to ensure Assumptions (B1)-(B5). In this section, we outline this program for three examples, namely, sparse initial value identification, trace regularization, and minimum effort problems, see Sections 4.1, 4.2, 4.3, respectively. In all these cases, a particular focus is put on the computation of v u k and the verification of Assumption (B5). For the sake of brevity, we decided to strike a balance between practically interesting settings and problems for which the characterization of Ext(B) and the derivation of Assumption (B5) can be done in a concise manner. As a main take-away message, these examples suggest that there is no general "recipe" for the resolution of Step 2 in Algorithm 1. Quite the reverse, the method of choice for computing v u k as well as the associated computational burden strongly depend on the example at hand. We also stress that Algorithm 1 is applicable to far more complex problems, in which characterizing extremal points and deriving quadratic growth conditions could also get much more convoluted. One such problem, namely the optimal transport regularization of dynamic inverse problems, is briefly teased in Section 4.4, and will be the subject of a follow-up work. Other examples are given by works [25,45], in which the authors apply the program outlined above to certain regularizers given by certain infimal convolutions.

Sparse source identification.
Let us consider the inverse problem of identifying the initial source of a heat equation on a convex polygonal spatial domain Ω ⊂ R 2 from distributed temperature measurements y d at a given final time T > 0. Our particular interest lies in the recovery of sparse sources given as a linear combination of finitely many point measures, where the coefficients λ † i ∈ R, the positions x † i ∈ Ω, and the number N ∈ N of points are all assumed to be unknown. Taking the ill-posedness of the described inverse problem into account we follow [21,50] and consider the convex Tikhonov-regularized problem min u∈M (Ω),y Here M (Ω) denotes the space of Borel measures on the open set Ω, y is a scalar function defined on [0, 1] × Ω with y(t) := y(t)(·), y d ∈ L 2 (Ω) is a given desired state, and the pair (y, u) satisfies, in the sense of distributions, the heat equation The a priori assumption on the sparsity of the unknown source is encoded in the choice of the regularizer, defined as the total variation norm of u with β > 0. To fit (4.1) into the setting of (P M ) we set C = C 0 (Ω), the space of continuous functions vanishing on ∂Ω, and we equip it with the canonical supremum norm This makes C a Banach space. According to the Riesz-Markov-Kakutani theorem we have . Of course, these functionals satisfy (A1) and (A2) in Assumption 2.2. Finally, we replace the PDE constraint by introducing a source-to-observation operator K : M (Ω) → L 2 (Ω) mapping a measure u ∈ M (Ω) to y(T ), where y solves (4.2). It is readily verified that K is injective, thanks to a priori estimates for weak solutions to (4.2) [26, Lemma 2.2], as well as weak*-to-strong continuous [26,Lemma 2.3]. Hence, (4.1) admits a unique solution and (A3) in Assumption 2.2 is satisfied. Moreover, K is the adjoint of the operator K * : L 2 (Ω) → C 0 (Ω) defined by K * ϕ := z(0), where the pair (z, ϕ) satisfies, in the sense of distributions, the backwards heat equation For more details we refer to [50,26]. Note that K * is well-defined as z(0) ∈ C 0 (Ω), due to parabolic regularity estimates. For sake of completeness, this is justified in Lemma B.1 in the appendix. The next lemma characterizes the set of extremal points of the unit ball of the regularizer G, that is, of the set B = S − (β · M (Ω) , 1).
Proof. The characterization of Ext(B) is well-known [13, Proposition 4.1]. As for the second claim, note that 0 ∈ Ext(B) * : indeed one can take {x k } k in Ω such that x k → x with x ∈ ∂Ω, so that For the opposite inclusion, assume given a Moreover, applying Proposition 2.3 and the characterization of K * , we immediately deduce the optimality conditions for (4.1).

Proposition 4.2. Letū ∈ M (Ω) be given and denote byz the solution to
Thenū is a minimizer of (4.1) if and only if This implies Thanks to the characterization of extremal points presented in Lemma 4.1, the FC-GCG method presented in Algorithm 1 for solving (4.1) generates a sequence of iterates as well as an associated sequence of active sets We now claim that the new candidate extremal point in the iteration k of Algorithm 1 can be chosen as i.e., Step 2 in Algorithm 1 is equivalent to computing a global extremum of a continuous function. This is verified in the following proposition.
The remaining statement follows directly from as well as for sparsity examples, solving (3.1) amounts to computing an extremum of the continuous function z k (0) = −K * ∇F (Ku k ), where u k is the iterate generated by Algorithm 1. Since |z k (0)| is, in general, non-concave, this optimization task could be non-trivial. However, since the spatial domain Ω is low dimensional in this example, it is possible to resort to heuristic strategies to approximate the extremum of z k (0). In particular, a standard widely accepted strategy [21,11], consists in discretizing the domain Ω using a uniform grid {x h } h=1,...N 2 and performing local searches around x h using gradient descent methods. Then the extremum of z k (0) can be estimated as arg max where y h is the outcome of the local search around the point of the grid x h . Moreover, a practical implementation of Algorithm 1 for this particular problem also entails a discretization of the heat equation, e.g. by piecewise polynomial and continuous finite elements. In this case, the computation of the new point x k becomes trivial, see Section 4.1.1. We now discuss the non-degeneracy conditions. Denoting byū ∈ M (Ω) the minimizer of (4.1), we propose a natural and easy to verify set of assumptions forz(0) that implies our general nondegeneracy assumptions from Section 3.3 for a suitable choice of g. More precisely, this new set of assumptions onz(0) will imply Assumption (B2), (B3) and (B5). We remark that Assumption (B4) still needs to be assumed to ensure the fast convergence; however we decide not to state it in the following set of assumptions as, for this specific example, it would be formulated exactly as in (B4). Moreover, its verification can be done straightforwardly looking at the structure of the unique minimizer of (4.1). We finally remind thatz is the solution to (4.3) for ϕ = y d −ȳ,ȳ = Kū and therefore by the characterization of K * it holds thatz(0) = −K * ∇F (Kū). The following structural assumptions are made, see also [50,52].
Regarding (C2) recall thatz(0) is at least two times continuously differentiable in the vicinity ofx i , i = 1, . . . , N , see Lemma B.1. Loosely speaking, the additional requirements in Assumptions (C1)-(C2) state thatz(0) only admits a finite number of global minima/maxima and its curvature around them does not degenerate. The latter corresponds to a second order sufficient optimality condition for the global extrema ofz(0). We now prove that (C1) and (C2) imply (B2), (B3) and (B5). First we show that (C1) guarantees Assumptions (B2)-(B3). For this purpose, set Hence, the claimed statement follows from (4.7) and Lemma 4.1. Finally, the linear independence follows from the injectivity of K. Next we address Assumption (B5). For every subdomain Ω 0 withΩ 0 ⊂ Ω define the quantities The proof can be found in Section B.1.1 in the appendix.
Now, given arbitrary extremal points Such g will be the one verifying Assumption (B5). In the next lemma we show that the weak* convergence in M (Ω) of a sequence of extremal points toū i is equivalent to convergence with respect to g. The proof can be found in Section B.1.2 in the appendix.
Finally we combine the previous observations to conclude Assumption (B5).

Proposition 4.7.
Let R > 0 be chosen according to Lemma 4.5. Then, there is a d B -neighbourhood U i ofū i which satisfies  as well as for every u ∈ U i and for every i = 1, . . . , N .
Proof. The statement on the existence of a d B -neighbourhoodŪ i with the stated properties follows immediately from Lemma 4.6. In fact, if such a neighbourhood does not exist, then there exists a sequence The remaining statements are also readily verified using Lemma 4.5 noting that for every The regularization parameter is chosen as β = 0.001. The heat equation is discretized using a dg(0)cg(1) scheme on a temporal grid with stepsize δ = 0.001 and a uniform triangulation of Ω with grid size h = 1/128 For the adjoint equation, a conforming discretization scheme is considered. All computations were carried out in Matlab 2019 on a notebook with 32 GB RAM and an In-tel®Core™ i7-10870H CPU@2.20 GHz. In Figure 1a, we report on the convergence history of the residuals r J (u k ) = J(u k ) − min u∈M (Ω) J(u) associated with a sequence {u k } k generated by Algorithm 1 starting from u 0 = 0 and A 0 = ∅. Due to the dg(0)cg(1) scheme used in the discretization of the state equation, the dual variable z k (0) is now a piecewise linear and continuous function on the spatial grid. As a consequence, |z k (0)| achieves its global maximum in a gridpoint and the new candidate x k can be cheaply computed as In each iteration, the finite dimensional subproblem (3.2) is solved using a semismooth Newton method. Moreover, we plot the size of the support of u k in dependence of k in Figure 1c. By construction, this corresponds to the number of Dirac deltas in the active set A u k . In order to highlight the practical efficiency of Algorithm 1, we also include a comparison to the iterates generated by a generalized conditional gradient method (GCG) given by Here v u k is chosen as before, M 0 = J(0)/β and s k ∈ (0, 1) is an explicitly given stepsize as described in [21]. Both methods were run for a maximum of 100 iterations or until r J (u k ) ≤ 10 −12 . As expected, the GCG update exhibits the typical sublinear convergence behaviour of conditional gradient methods. In particular, after 200 iterations the residual is still of magnitude r J (u k ) ≈ 5 × 10 −2 . In contrast, we observe a vastly improved rate of convergence for Algorithm 1. The stopping criterion is met after 7 iterations. Moreover, while the support size of u k in GCG strictly increases in the first 13 iterations, Algorithm 1 removes Dirac deltas which are assigned a zero coefficient. This leads to smaller active sets and thus sparser iterates. Both observations are testament to the practical efficiency of Algorithm 1. Finally, for a fair comparison, we also plot the residual as a function of the computational time (in seconds) for k = 1, . . . , 20. This is done to acknowledge the vastly different computational cost of the update steps in both methods, i.e., forming a convex combination in GCG and the full resolution of a finite-dimensional minimization problem in Algorithm 1. As we can see, the additional computational effort of fully resolving (3.2) is outweighed by its practical utility. More in detail, Algorithm 1 converges after around 30s while GCG fails to decrease r J (u k ) below 10 −1 in the considered time frame.

Rank-one matrix reconstruction by trace regularization.
Let H be a separable Hilbert space with norm · H induced by an inner product (·, ·) H . In the following, to simplify notation, we will focus on infinite dimensional Hilbert spaces, the finite dimensional case, i.e. H ≃ R n , follows by the same arguments. Denote by K(H) the space of bounded, linear, compact and self-adjoint operators from H into itself which we equip with the standard operator norm. An operator Moreover, let {h i } i∈N be an orthonormal basis (ONB) of H. For an operator U ∈ K(H) we formally define its trace as where |U | is the unique positive square root of U 2 , and Hilbert-Schmidt (HS) if Tr(U 2 ) < ∞. The set of all Hilbert-Schmidt operators together with the norm U HS = (U, U) forms a Hilbert space HS(H). Moreover, the space of all trace-class operators, denoted by T (H), forms a Banach space when equipped with the nuclear norm U T = Tr(|U |). In this case, we also have K(H) * ≃ T (H) where the duality pairing is realized by With these prerequisites, consider where β > 0 and K : HS(H) → Y is weak-to-strong continuous. Problems of this type appear, e.g., as convex relaxations of quadratic inverse problems, see [23,7], since they are known to favour solutions with rank one. As before, we start by computing the extremal points of For this purpose, given h ∈ H, we introduce the associated rank one operator We arrive at the following characterization.  In particular, if σP 1 = β andN ≥ 1 is the smallest index with σP N +1 < σP 1 , this implies that a minimizerŪ of (4.13) is of the form Now, denote by U k the k-th iterate in Algorithm 2 and let P k be the corresponding dual variable. Then we immediately obtain arg max In particular, selecting the new candidate point V U k in Step 2 of Algorithm 1 can be realized by computing one eigenfunction for the leading eigenvalue of P k and U k has at most rank k. Regarding the fast convergence of Algorithm 2, we recall that, in the best case, problem (4.13) produces rank one solutions. In the following, we make a slightly stronger assumption. More in detail, denoting byP the unique optimal dual variable for (4.13), i.e., there holdsP = −K * ∇F (KŪ) for every minimizerŪ of (4.13), we assume that: (D1) The first eigenvalue ofP is simple, i.e., β = σP 1 > σP 2 . On the one hand, invoking (4.17), this implies that the solution to (4.13) has at most rank one. On the other hand, it also ensures the unique solvability of the linear problem induced byP as well as a quadratic growth behavior with respect to the Hilbert-Schmidt norm as the next theorem shows. The proof can be found in the appendix, c.f. Section B.2.3.
In particular, this implies that the solution of (4.13) is unique and of the formŪ for all U ∈ B with Tr(U) = β −1 .
Hence, Assumption (D1) implies Assumptions (B2) and (B5) for g(·, ·) = · − · HS and U 1 = Ext(B) ∩ B. As a consequence, if F is strongly convex in the sense of Assumption (B1) as well as KŪ 1 = 0 andλ > 0, then Algorithm 1 converges with an asymptotically linear rate. We consider minimum effort problems, that is, where α > 0 is a fixed parameter, see [29]. This type of regularizer favors binary solutions, i.e., functionsū withū(x) ∈ {−λ,λ} for a.e. x ∈ Ω and someλ ≥ 0. Problems of this form appear, e.g., in the optimal maneuvering of spacecrafts [8]. Due to the nonsmoothness of the L ∞ -norm, previous solution approaches [46,29] were, e.g., based on a regularized semi-smooth Newton method for a bilinear reformulation of (4.21). We now show that FC-GCG yields a simple algorithm which eventually solves (4.21) without regularizing and/or reformulating the problem. For this purpose, we first note that L 1 (Ω) * ≃ M where L 1 (Ω) is equipped with the canonical norm · 1 and where the duality pairing is realized by If K : L ∞ (Ω) → Y is a linear weak*-to-weak continuous operator and F : Y → R is a convex fidelity satisfying (A1), one can easily verify that the minimization problem (4.21) satisfies Assumptions In particular, ifū = 0 is a solution to (4.21), then we have where the maximum is realized for v u k = sign(p k ). Hence, realizing Step 2 in Algorithm 1 only requires the computation of the sign of p k . Next, assume that there is C > 0 with: We remark that Assumption (E2) has already been considered in the context of bang-bang optimal control, i.e. the L ∞ -norm constrained setting, see, e.g., [27]. Such assumption in particular implies that L({|p| = 0}) = 0, and thusū  Moreover, settingū = α −1 sign(p), there holds In [17] it has been proposed to reconstruct the dynamic data ρ by solving a variational inverse problem regularized with a coercive version of the Benamou-Brenier energy J α,β , allowing for a correlation between the measurements at different time instants (see also [22]). Such variational models have been applied, for instance, to dynamic cell imaging in PET [55], to 4d image reconstruction in nanoscopy [22] and to particle image velocimetry methods [54]. A general variant of the problem considered in [17] can be formulated in our setting by considering where M := M (X) × M (X; R d ), K : M (X) → Y is weak*-to-weak continuous and such that (A3) holds, while F : Y → R is a convex fidelity term satisfying (A1). The Benamou-Brenier energy [9] can be regarded as a dynamic version of optimal transport, and is defined as follows. Setting and Φ := ∞ otherwise, the Benamou-Brenier energy is defined for (ρ, m) ∈ M by the formula where σ ∈ M + (X) is an arbitrary measure such that ρ, |m| ≪ σ. We refer the reader to [3,53] for more details. Following [17], the regularizer J α,β is then defined by where α, β > 0, and D is the set of pairs (ρ, m) ∈ M satisfying the continuity equation ∂ t ρ+div m = 0 in X in the weak sense. The functional J α,β is convex, weak* lower semicontinuous, positively onehomogeneous and satisfies (A2), see [15,Lemma 4]. Denote by B the unit ball of J α,β . In [15,Theorem 6] it has been shown that where C α,β is the set of measures concentrated on absolutely continuous curves in Ω, i.e., pairs (ρ γ , m γ ) satisfying for some γ : [0, 1] → Ω an absolutely continuous curve with weak derivative in L 2 . Therefore, it is possible to apply the FC-GCG method of Algorithm 1 for computing solutions to (4.27). In this setting the iterates are of the form u k = (ρ k , m k ) with with λ k i ≥ 0 and γ k i : [0, 1] → Ω absolutely continuous. Moreover it can be shown, see [16], that Step 2 in Algorithm 1 is equivalent to solvinĝ where p k = −K * ∇F (Kρ k ) ∈ C(X) is the dual variable at the k-th iteration. The new point inserted is then v k = (ργ, mγ). Thanks to the validity of Assumption (A1)-(A3), Theorem 3.3 guarantees the sublinear convergence of Algorithm 1. On the other hand, the verification of hypotheses (B1)-(B5), necessary for ensuring fast convergence of Algorithm 1, is non-trivial and is left to future work. We remark that an implementable version of Algorithm 1 for solving (4.27) under specific choices of the fidelity term F and the operator K has been recently proposed in [16] (see also [38]). We refer to these papers for more details about the practical implementation and the modifications needed to deal with time dependent measurement operators. Similarly to [16], solving (3.1) amounts to computing an absolutely continuous curve γ by solving (4.32) at every iteration of the algorithm. This is a challenging non-concave variational problem in the space of curves. In [16], the authors proposed to solve (4.32) by a multistart gradient descent approach in the space of curves. The initialization curves are chosen to be a linear interpolation of randomly generated points {x h } in Ω. Moreover several heuristic rules are employed to optimally select {x h } and to reduce the computational time of the multistart gradient descent routine. Together with further acceleration strategies, this method is shown to be computationally feasible and very accurate for the task of tracking several dynamic sources in presence of high noise and severe spatial undersampling. In [38], the authors proposed to speed up the algorithm in [16] by considering inexact subproblems (3.1) and solving them using known algorithms for computing shortest paths on directed acyclic graphs.

A lifting to a space of measures
Our approach to proving Theorem 3.3 and Theorem 3.8 relies on the observation of Choquet's Theorem. This classical result allows us to prove that for every u ∈ dom(G) there exists a positive measure µ concentrated on Ext(B) such that G(u) = µ M (B) and where the infimum is taken over M + (B), the cone of positive measures on B. Note that the forward operator is replaced by a "lifted" mapping K : M (B) → Y which satisfies Kµ = Ku whenever (5.1) holds for µ ∈ M + (B) and u ∈ M, see Proposition 5.3. Moreover, the role of the non-smooth regularizer G is now played by the total variation norm. It turns out (see Section 5.2) that problems (P M ) and (P M + ) are equivalent in the sense that every minimizer to (P M ) can be converted to a solution of (P M + ) (and vice versa). Subsequently, in Section 5.5, we propose an extension of the Primal-Dual-Active-Point method from [52] to compute a solution of (P M + ). This is described in Algorithm 2. Finally, using the results of Section 5.2, we show that Algorithm 2 and Algorithm 1 are equivalent, setting the ground to prove Theorem 3.3 and Theorem 3.8 by means of convergence results for Algorithm 2.
Finally, the support of a measure µ ∈ M (B), denoted by supp µ, is the closure of the set of all points v ∈ B such that |µ|(U ) > 0 for all neighbourhoods U of v.

An equivalent problem.
We require the following definition. An element u ∈ M such that (5.2) holds is also called the (weak) barycenter of µ in B.
It turns out that each u ∈ dom(G) is the barycenter of at least one measure µ ∈ M + (B). iii) I is weak*-to-weak* continuous.
For a proof of the above statement we refer the reader to Appendix C.1. Here we only mention that existence of the map I is easily obtained by the theory of weak* integration, while surjectivity is a consequence of the classic Choquet's Theorem C.1. Thus, instead of solving (P M ) directly we can equivalently determine a measureμ ∈ M + (B) which represents any of its minimizers. Again, this can be done by solving a suitable minimization problem.
To make this idea more rigorous we argue the existence of a linear continuous operator K : M (B) → Y which agrees with K on measures representing points of M. Again, we postpone the proof of such statement to Appendix C.1.

Proposition 5.3. There exists a linear continuous operator
Moreover K satisfies the following properties: i) The norm of K is such that If in addition Y is separable, then (5.4) holds in the strong sense, that is,

5)
for all µ ∈ M (B), where the right hand side integral is in the Bochner sense.
We are now in position to investigate the announced equivalence of (P M ) and the sparse minimization problem (P M + ). To this end, define Proof. First recall that K is weak*-to-strong continuous thanks to Proposition 5.3. Given that F is continuous, we then infer weak* lower semicontinuity of j. As F is bounded from below, we immediately have that j is bounded from below and its sublevels are weak* compact, concluding the existence of minimizers to (P M + ) by the direct method. We pass to the proof of i). Assume thatū is a minimizer to (P M ), so thatū ∈ dom(G). According to ii) in Proposition 5. where the first inequality follows from the optimality ofū. This proves i). Second, letμ ∈ M + (B) be a solution to (P M + ) andū := I(μ). Thus, by Proposition 5.2, we have thatū ∈ dom(G),μ representsū and G(ū) ≤ μ M (B) . Moreover, let u ∈ dom(G). By point ii) in Proposition 5.2, we get µ ∈ M + (B) representing u and such that µ M (B) = G(u). Again, Kµ = Ku, Kμ = Kū and we conclude the proof of ii) noting that The final part of the statement follows from (5.7)-(5.8).

Optimality conditions.
In this section we establish the relationship between the dual variables of the problems (P M ) and (P M + ). Moreover we characterize optimality conditions for (P M + ).
Proof. By Proposition 3.5 we have that (P M ) admits a unique solutionū, which is of the form u = N i=1λ iūi for someλ i ≥ 0. Letμ ∈ M + (B) be a solution to (P M + ), which exists by Theorem 5.4. From the optimality conditions (5.11) in Theorem 5.6, one can easily verify that Hence, by Assumption (B2) and Proposition 5.5, there existσ i ≥ 0 such thatμ = N i=1σ i δū i . By (5.3) we have I(δ u ) = u for all u ∈ B. As I is linear, we then conclude I(μ) = N i=1σ iūi . On the other hand, by Theorem 5.4 ii), we know that I(μ) minimizes in (P M ). Thus I(μ) =ū, given that u is the unique solution of (P M ). We have then shown N i=1 (λ i −σ i )ū i = 0. By applying the linear operator K to such identity, and invoking (B3), we inferλ i =σ i . Thereforeμ = N i=1λ i δū i . Asμ is an arbitrary minimizer of (P M + ), the thesis is achieved.

5.5.
A Primal-Dual-Active-Point method for (P M + ). In the following we describe a variant of the Primal-Dual-Active-Point strategy (PDAP) from [52] for the solution of (P M + ). The latter is a fully-corrective version of a generalized conditional gradient method (also known as Frank-Wolfe algorithm) for solving convex minimization problems over spaces of measures supported on subsets of the euclidean space. In this section we generalize such procedure to (P M + ) and discuss its connection to Algorithm 1. Similarly to [52], our proposed PDAP method alternates between the update of an for some λ k i ≥ 0. We now provide a short description of the individual steps of this method and we summarize them in Algorithm 2 below. Given the current iterate µ k of the form (5.12), we first compute the corresponding dual variable P k = −K * ∇F (Kµ k ) ∈ C(B) and enrich the active set A µ k by adding to it a global maximizer { v µ k } of P k over Ext(B), i.e., we set Using Proposition 5.5 we note that this update step is equivalent to maximizing a linear functional over Ext(B). This is the content of the next lemma, whose proof is an immediate consequence of Proposition 5.5, and is hence omitted.
Step 5 of Algorithm 2. The following lemma compares the update obtained by solving (5.13) to the finite dimensional minimization problem (3.4) in Step 5 of Algorithm 1.

Lemma 5.9. A measure µ ∈ M + (A µ,+ k ) is a solution to (5.13) if and only if
+ is a minimizer of the finite dimensional minimization problem (3.4).
Proof. Let µ ∈ M + (A µ,+ k ), so that there exists at least one λ µ ∈ R from which the characterization of minimizers µ to (5.13) readily follows.
Finally, see Step 6 of Algorithm 2, the active set is truncated by choosing A µ k+1 as the support of µ k+1 , that is, The method is summarized in Algorithm 2.

Update
A µ k+1 = supp µ k+1 and set N k+1 = #A µ k+1 . end for As for Algorithm 1, we define the residuals associated with the iterates µ k of Algorithm 2 by (5.14) Note that due to Theorem 5.4, such residuals can be written as Summarizing the previous observations, we conclude the equivalence between Algorithm 1 and Algorithm 2 as stated in the next theorem.
and µ k ∈ M + (A µ k ) be given. Set A u k := A µ k and u k := I(µ k ). Then, the update Steps from 2 to 6 in Algorithm 1 and Algorithm 2 can be realized such that In particular, if {u k } k and {A u k } k are sequences of iterates and active sets generated by Algorithm 1 and we have and it holds where the residuals r J (u k ) and r j (µ k ) are defined in (3.9) and (5.14), respectively.

Proof. By Lemma 5.8 we can choose
it holds that u k+1 = I(µ k+1 ) and µ k+1 is a solution to (5.13) by Lemma 5.9. Moreover, we note that Concerning the claim in (5.15), given u k = and A µ k = A u k . Therefore (5.15) follows by the first part of the statement and an induction argument. Finally, as I(µ k ) = u k , by Proposition 5.2 i) and Proposition 5.3 ii) we have G(u k ) ≤ µ k M (B) and Kµ k = Ku k . Therefore J(u k ) ≤ j(µ k ), and (5.16) follows by (5.6).

Convergence analysis
We are now prepared to prove Theorem 3.3 and Theorem 3.8. For this purpose we rely on Theorem 5.10, which states that the FC-GCG method from Algorithm 1 converges at least as fast as the PDAP method in Algorithm 2, thanks to the estimate (5.16). In particular, by Step 5 in Algorithm 2, we have that the iterates of FC-GCG and PDAP satisfy In the following we use (6.1), as well as specific choices of the measure µ in the upper bound, to prove Theorems 3.3 and Theorem 3.8. Specifically, the proof of Theorem 3.3 is carried out in Section 6.1. The proof of Theorem 3.8, which is more technical, is conducted in Section 6.2.10, after establishing some preliminary results in Section 6.2.
For the remainder of the paper we silently assume that Algorithm 2 does not stop after a finite number of iterations, and generates a sequence {µ k } k in M + (B). Dropping superscripts, we denote the k-th active set, iterate, dual variable, candidate point computed in Step 2, and enlarged active set by, respectively, where we recall that λ k i > 0 and u k i , v k ∈ Ext(B).
6.1. Worst-case convergence rate. We first argue that Algorithm 2 converges at least sublinearly.
To start, define the sublevel set By Theorem 5.4, we have that E µ 0 is weak* compact. Let M 0 > 0 be an arbitrary but fixed upper bound on the norm of elements in E µ 0 and consider the norm constrained problem Clearly, by definition of M 0 , the additional norm constraint does not change the set of global minimizers. The following proposition relates v k to a particular conditional gradient descent direction η k for ( P M + ).
Then, η k is a minimizer of the partially linearized problem (6.4)

Moreover, we have
Proof. Since we are testing against positive measures, we can estimate The proof is finished noting that (1−s)µ k +sη k belongs to M + (A + k ), and that µ k+1 solves (5.13). In particular, Proposition 6.1 shows that, in each iteration, Algorithm 2 achieves at least as much descent as a conditional gradient update. Such observation allows us to prove sublinear convergence for Algorithm 2 using known convergence results for conditional gradient methods in general Banach space (see Theorem 6.2 below). Finally, the combination of Theorem 5.10 with Theorem 6.2 yields the convergence of Algorithm 1. Theorem 6.2. Let (A1)-(A3) in Assumption 2.2 hold. Then, the sequence {j(µ k )} k is monotone decreasing, µ k ∈ E µ 0 , and there exists a constant c > 0 such that where r j is defined at (5.14). The sequence {µ k } k admits at least one weak* accumulation point and each such point is a solution to (P M + ). If the solutionμ to (P M + ) is unique, we have µ k * ⇀μ for the whole sequence.
Proof. Since µ k+1 is a solution to (5.13) we clearly have j(µ k+1 ) ≤ j(µ k ) ≤ j(µ 0 ). Thus {j(µ k )} k is monotone decreasing and µ k ∈ E µ 0 . Now, we show that ∇(F • K) is Lipschitz continuous on E µ 0 . Indeed, since E µ 0 is weak* compact and K is weak*-to-strong continuous, see Proposition 5.3 iii), the image set Using Assumption (A1), we have that ∇F is Lipschitz continuous on KE µ 0 for some constant L µ 0 > 0. Hence where C is the constant from Proposition 5.3 i). The claimed convergence statement now follows from [60, Theorem 6.14].
Proof of Theorem 3.3. Assume that Algorithm 1 does not converge after finitely many steps and generates a sequence {u k } k . According to Theorem 5.10 there exists a sequence {µ k } k generated by Algorithm 2 with u k = I(µ k ). Invoking Theorem 5.10 as well as Theorem 6.2 yields In particular Theorem 6.2 implies that, up to subsequences, µ k * ⇀μ withμ solution of (P M + ). Since I is weak*-to-weak* continuous by Proposition 5.2 iii), and since u k = I(µ k ), we infer u k * ⇀ū with u := I(μ). Asμ minimizes in (P M + ), by Theorem 5.4 ii), we infer thatū minimizes in (P M ). The rest of the statement follows by a similar argument. Remark 6.3. In the next section, devoted to the proof of the linear convergence of Algorithm 1, the global sublinear convergence provided by Theorem 3.3 plays an important role. Indeed, since the claimed linear convergence in Theorem 3.8 is only asymptotic, i.e., it relies on the iterates u k being sufficiently close, in the weak* sense, to a solutionū of (P M ), the application of Theorem 3.3 is a necessary starting point. Our proof of Theorem 3.8 relies on the lifting strategy of Section 5, the interpretation of PDAP as a monotone, fully-corrective GCG method for (P M + ), as well as known convergence results for this algorithm. A similar identification for Algorithm 1 is not directly possible. In fact, since in general we only have G(u k ) ≤ κ A k (u k ), the residuals {r J (u k )} k in Algorithm 1 are not necessarily monotone. While we believe that a proof of Theorem 3.3 which does not rely on (P M + ) is possible, this is, to the best of our knowledge, non-standard and would require further work. In particular, we point out that known results of [62] for GCG methods with gauge-like regularizers seem not to be applicable since B is, generally, not strongly compact. As such, the lifting strategy provides an elegant way to circumvent these additional arguments.
6.2. Fast convergence and proof of Theorem 3.8. In this section we further investigate the convergence behaviour of the iterates {µ k } k of Algorithm 2, but now under the premise of Assumptions (B1)-(B5). The goal is to show an improved linear local convergence rate for Algorithm 2, see Theorem 6.10 below. Thanks to this result, and to Theorem 5.10, we will then be able to prove linear convergence for the FC-GCG method of Algorithm 1, as stated in Theorem 3.8. As the proofs are quite technical, after establishing some notations we give a detailed summary of the results. 6.2.1. Notation. We employ the notation at (6.2) for µ k , the unique solutions to (P M ) and (P M + ), respectively. The existence and uniqueness are guaranteed by Proposition 5.7, which holds since we are assuming (B2)-(B3), whileλ i > 0 by the non-degeneracy Assumption (B4). Since (P M + ) has a unique solution and the prerequisites of Theorem 6.2 are fulfilled, we have the convergences as k → ∞, along the whole sequence. Also, we denote bȳ , the set of optimal extremal points and the optimal dual variables associated toū andμ, respectively. Moreover, let , as in Assumption (B5), and set U i :=Ū i ∩ Ext(B). Finally, the observations are denoted by where the equality Kū = Kμ follows from Proposition 5.3 ii).

Summary of results.
Our aim is to show the existence of ζ ∈ [3/4, 1) and c > 0 such that for all k sufficiently large, see Theorem 6.10 below. To obtain (6.7) it will be sufficient, for fixed k, to construct a measure µ + ∈ M (A + k ) such that r j (µ k+1 ) ≤ r j (µ) ≤ ζr j (µ k ) .
We will choose µ := µ s k as the convex combination µ s k := µ k + s( µ k − µ k ), (6.8) for some s ∈ [0, 1], where the surrogate sequence µ k is obtained by suitably modifying µ k in a neighbourhood of the inserted point v k . More in detail, we proceed as follows: i) We start with some preparatory results in Sections 6.2.3, 6.2.4, 6.2.5. After, in Section 6.2.6, we show that the active sets A k cluster aroundĀ, in the sense that A k ⊂ ∪ N i=1 U i for k sufficiently large. Moreover, every point inĀ is approximated by at least one in A k . This is a consequence of the uniform convergence of the dual variables P k to the optimal dual variablē P , see Section 6.2.5, as well as the isolation of the maximizers ofP , see Assumption (B2). ii) In Section 6.2.7, we quantify the distance between A + k andĀ in terms of the function g in Assumption (B5), see Proposition 6.8. This is a crucial part of our analysis, and also the first point where the growth estimates in (B5) come into play. Precisely, we show that A k ∩ U i approachesū i in the sense that for all i = 1, . . . , N . (6.9) In addition, we prove that the inserted point v k at Step 2 is close toĀ, in the sense that there exists an indexî k ∈ {1, . . . , N } such that v k ∈ Uî k and g( v k ,ūî k ) r j (µ k ) 1/2 . (6.10) iii) In Section 6.2.8 we introduce the surrogate sequence obtained by modifying µ k in the neighbourhoodŪî k of the inserted point v k . In Lemma 6.9 we prove that µ k * ⇀μ, as well as the following key estimate This estimate relies on (6.9)-(6.10), and thus inherently requires Assumption (B5). iv) In Section 6.2.9 we employ (6.11) to prove the estimate for some ζ ∈ [3/4, 1) and a suitable step size s ∈ (0, 1), where µ s k is defined in (6.8). From (6.12) we then obtain the local linear convergence rate for PDAP, see Theorem 6.10. v) In Section 6.2.10 we finally prove Theorem 3.8. Thanks to the above analysis, the proof is a simple consequence of the linear convergence rate for PDAP established in Theorem 6.10, and of the link between PDAP and FC-GCG granted by the lifting strategy, see Theorem 5.10.
Remark 6.4. The proof strategy described above is inspired by [52], where the authors propose a method in the spirit of Algorithm 2 for solving a minimization problem in the space of measures supported on a compact set in R n . However, the proofs in our setting often require novel techniques compared to the ones in [52]. Loosely speaking, this can be attributed to the fact that the metric space (B, d B ) does not posses an obvious geometric structure. For example, in the euclidean setting of [52], the estimates for the distance betweenĀ and A k , or v k , respectively, rely on the convexity of euclidean balls, as well as on the higher order differentiability of the dual variable. Such strategy does not extend to our setting, where d B -neighborhoods are generally non-convex and no apparent differentiable structure is available. The difference between both approaches shows most prominently in the key estimate for K( µ k − µ k ) anticipated in (6.11). In contrast, the fast convergence result in [52] relies on see [52,Lemma 5.15]. With the notation of the current paper, estimate (6.13) is obtained from a perturbed quadratic growth condition of the form in the vicinity of v k , the derivation of which relies on higher order differentiability of the dual variable. In general though, the quadratic growth condition in Assumption (B5) is not stable with respect to perturbations of the dual variable. Thus, such arguments cannot be applied in our setting.

6.2.3.
Properties of iterates. Sinceμ = 0 and · M (B) is weak* lower semicontinuous, from (6.6) we conclude the existence of M ∈ N such that Also, as a consequence of Theorem 6.2 where E µ 0 is the weak* compact set in ( Similarly, we can reformulate Assumption (B5) and Remark 3.7 in terms ofP to obtain where we remind the reader that g : Ext(B) × Ext(B) → [0, ∞), and κ, σ > 0 are constants. The dual variables P k satisfy the following optimality conditions. For a proof, see Appendix C.2.1.
Proposition 6.5. Let M be as in (6.14). Then, for all k ≥ M we have 6.2.5. Convergence of dual variables and observations. Due to the strong convexity of F aroundȳ, see Assumption (B1), the worst-case convergence guarantee of Theorem 6.2 also carries over to the observations y k and the dual variables P k , as stated in the following proposition. For a proof, we refer the reader to Appendix C.2.2.
Proposition 6.6. There exist M ∈ N and c > 0 such that, for all k ≥ M , there holds In particular, we have y k →ȳ in Y and P k →P in C(B).
6.2.6. Asymptotic behavior of A k . In the following we show that active sets A k cluster aroundĀ. Specifically, for k sufficiently large, we have . . , N . Proposition 6.7. There exists M ∈ N such that for all k ≥ M we have where σ > 0 is the constant in (6.17). Moreover, for all k ≥ M and i = 1, . . . , N , Proof. As P k →P uniformly by Proposition 6.6, we deduce the existence of M ∈ N such that, for all k ≥ M , it holds P k −P C(B) ≤ σ/2. As a consequence, for all v ∈ B \ N i=1Ū i and k sufficiently large we have where we used (6.17). This proves (6.20). Now, recall that P k = 1 on A k by Proposition 6.5. Therefore A k ⊂ ∪ N i=1Ū i , otherwise (6.20) would yield a contradiction. Since by construction A k ⊂ Ext(B) and U i :=Ū i ∩ Ext(B), we conclude (6.21). Consider now an arbitrary but fixed index l in {1, . . . , N }. Recall that the setsŪ i are pairwise disjoint and d B -closed. Therefore we can apply Urysohn's lemma to obtain a d B -continuous function ϕ l : B → [0, 1] such that ϕ l = 1 inŪ l , and ϕ l = 0 inŪ i , i = l. Recall that µ k * ⇀ µ along the whole sequence by (6.6). Therefore, sinceμ = N i=1λ i δū i , we getλ where in the last equality we used (6.21). Finally, sinceλ l > 0, from the above convergence we get µ k (Ū l ) > 0 for k sufficiently large. Recalling that A k ⊂ Ext(B) and that U l :=Ū l ∩ Ext(B), we then infer µ k (U l ) > 0 for k sufficiently large, concluding A k ∩ U l = 0. 6.2.7. Distance between A + k andĀ. In the following proposition we use the previous results to quantify the distance between the active set A k and the set of optimal extremal pointsĀ, see (6.23) below. We also provide an estimate for the distance of v k to the closest element inĀ, see (6.24). We remark that this is the first point in which the estimates of Assumption (B5) are employed.
Proof. By minimality ofμ in (P M + ) and convexity of F we obtain where we used the optimality condition (5.11) in the last line. Fix i 0 ∈ {1, . . . , N } and an arbitrary k ∈ N large enough such that all previous results in this section hold. We will show (6.23) for the index i 0 . By (6.21) and definition of µ k we obtain Putting together (6.25)-(6.27), and using (6.18), that is, (B5), we estimate Due to the convexity of (·) 2 , we conclude (6.15). Estimate (6.23) now follows by (6.28) with c := M 0 /κ. We now show (6.24). Note that by Proposition 6.5 we have P k ( v k ) = max v∈B P k (v) ≥ 1 for all k ∈ N large enough. Therefore v k ∈ ∪ N i=1Ū i by (6.20). Recalling that v k ∈ Ext(B) and that U i := U i ∩ Ext(B) are pairwise disjoint, we deduce that v k ∈ Uî k for some unique indexî k in {1, . . . , N }. Utilizing (6.18), i.e. (B5), and the fact thatP (ūî k ) = 1 by (6.16), we estimate (6.29) where in the last line we used that P k ( v k ) = max v∈B P k (v). Using Assumption (B5) as well as Proposition 5.5, the right-hand side in (6.29) is further bounded bȳ where τ > 0 is the constant from (3.12), which does not depend on k. Finally, using (6.29) and (6.30) together with Proposition 6.6, we conclude (6.24).
6.2.8. Surrogate sequence. Let M ∈ N be sufficiently large so that all the above results hold. For k ≥ M we denote byî k ∈ {1, . . . , N } the index from Proposition 6.8. Starting from the sequence {µ k } k generated by Algorithm 2, we define the surrogate sequence { µ k } k in M + (B) by setting where µ kŪi is defined as in Section 5.1. Notice that µ k is just a local modification of µ k around v k . In the following lemma we investigate the properties of µ k . Most importantly, we establish the weak* convergence of µ k towardsμ, and the crucial estimate (6.11).
where c > 0 does not depend on k. Moreover, as k → ∞, we have In particular, µ k ∈ E µ 0 for all k ∈ N large enough, where E µ 0 is the set in (6.3).
Proof. Recall that v k ∈ Uî k by Proposition 6.8. Using (6.21), the definition of µ k , and the fact that the sets U i are pairwise disjoint, it is immediate to check that (6.32) holds. Noting that we also obtain (6.33), since where in the last equality we used that P k = 1 on A k = supp µ k by Proposition 6.5. We now show (6.34). By (6.21) and (6.35) we compute Thus linearity of K, Proposition 5.3 ii), and triangle inequality imply Recalling that v k ∈ Uî k , by (B5) and (6.23), (6.24) we estimate where in the last inequality we used (6.15). This establishes (6.34).
As for the remaining part of the statement, first recall that r j (µ k ) → 0 and µ k * ⇀μ along the whole sequence by (6.6). Since K is weak*-to-strong continuous, see Proposition 5.3 iii), we conclude that Kµ k → Kμ strongly in Y . Thus, K µ k → Kμ from (6.34). We now show that j( µ k ) → j(μ). Recalling that µ k M (B) = µ k M (B) by (6.32), we obtain We have r j (µ k ) → 0 by Theorem 6.2, while |F (K µ k ) − F (Kµ k )| → 0 since K µ k , Kµ k → Kμ and F is continuous. Therefore the right hand side of (6.36) converges to zero, which implies r j ( µ k ) → 0, that is, j( µ k ) → j(μ). We are left to show that µ k * ⇀μ. Indeed, we have shown that { µ k } k is a minimizing sequence for (P M + ). Since j is weak* lower semicontinuous and has weak* compact sublevels (Theorem 5.4), we infer the existence of a subsequence n k and ofμ ∈ M + (B) such that µ n k * ⇀μ, withμ minimizer of j. By uniqueness of the minimizer, see Proposition 5.7, we conclude that µ n k * ⇀μ. Moreover, since the weak* limit does not depend on the subsequence n k , we also infer µ k * ⇀μ. Finally, since µ 0 is not a minimizer of j, we have µ k ∈ E µ 0 for k sufficiently large.
for all k ≥k. In particular, there is c > 0 with for all k ∈ N sufficiently large.
Proof. The fact that µ k * ⇀μ in M (B) along the whole sequence is already established in (6.6). We have to show the improved convergence rate at (6.37). To this end, for a fixed s ∈ [0, 1] define , with µ k as in Definition (6.31). We will obtain (6.37) by estimating the residual of µ s k and choosing an optimal value of s. We start by noting that Since v k ∈Ūî k , from the above we deduce Recall that µ k+1 is optimal in (5.13) by definition of Algorithm 2. As supp µ s k ⊂ A + k , we infer j(µ k+1 ) ≤ j(µ s k ) for all s ∈ [0, 1] . (6.40) Next, we estimate the residual of µ s k . By (6.39) and regularity of F we infer where R s (µ k ) is a remainder defined by In order to estimate R s (µ k ), first recall that {µ k } k ⊂ E µ 0 by (6.15). Moreover, by Lemma 6.9, we have µ k ∈ E µ 0 for k sufficiently large. Therefore, as E µ 0 is convex, we also have µ s k ∈ E µ 0 for all s ∈ [0, 1] and k sufficiently large. Note that the set KE µ 0 := { Kµ | µ ∈ E µ 0 } is compact, given that K is weak*-to-strong continuous, see Proposition 5.3 iii). Denote by L µ 0 the Lipschitz constant of ∇F on the set KE µ 0 , which exists by Assumption (A1). By Cauchy-Schwarz we then estimate By minimality ofμ and (6.41)-(6.42), we infer Recall that by construction P k ( v k ) = max v∈B P k (v). Moreover max v∈B P k (v) ≥ 1 by Proposition 6.5. Thus P k ( v k ) ≥ 1 and we can apply Proposition 6.1 to infer that M 0 δ v k minimizes in (6.4), that is, We can now use convexity of F to estimate where in the last equality we used Proposition 6.5. Since μ M (B) ≤ M 0 , see Section 6.2.3, we can apply (6.44) with η =μ to obtain By (6.33) we then infer Using (6.45) and (6.34) in (6.43) yields where c 2 > 0 is the square of the constant in (6.34). Define the constant Notice that c 1 > 0, since (B4) holds. Moreover, c 1 ≤ 1/2 given that μ M (B) ≤ M 0 . Invoking (6.22) we conclude that which, together with (6.46) yields for all k sufficiently large and all s ∈ [0, 1]. Subtracting j(μ) from both sides of (6.40) yields Notice that min s∈[0,1] ϕ(s) ≤ ζ, where Thus, from (6.47) we obtain an integerk ∈ N such that r j (µ k+1 ) ≤ ζ r j (µ k ) for all k ≥k, establishing (6.37). In particular, r j (µ k ) ≤ r j (µk)ζ k−k for all k ≥k, and the final thesis (6.38) follows by setting c := r j (µk)ζ −k > 0.
6.2.10. Proof of Theorem 3.8. Assume that Algorithm 1 does not converge after finitely many steps and generates a sequence {u k } k in M. By Proposition 5.7 both problems (P M ) and (P M + ) admit a unique solution, given byū = N i=1λ iūi andμ = N i=1λ i δū i , respectively. Thanks to Theorem 5.10, there exists a sequence {µ k } k generated by Algorithm 2 with u k = I(µ k ). Invoking Theorem 5.10 as well as Theorem 6.10 yields r J (u k ) ≤ r j (µ k ) ≤ c ζ k , for all k sufficiently large, where c > 0 and ζ ∈ [3/4, 1). This shows (3.14). In addition, Theorem 6.10 ensures that µ k * ⇀μ along the whole sequence. Recalling that I is weak*-to-weak* continuous by Proposition 5.2 iii), we infer u k * ⇀ I(μ). Sinceμ minimizes in (P M + ), by Theorem 5.4 ii), we infer that I(μ) minimizes in (P M ). As the minimizer is unique, we concludeū = I(μ).

Conclusions
We have introduced a fully-corrective generalized conditional gradient method (FC-GCG) to solve a class of non-smooth minimization problems in Banach spaces. For such algorithm we provided a global sublinear rate of convergence under the mild Assumptions (A1)-(A3), as well as a local linear convergence rate under the Assumptions (B1)-(B5). Several examples of applications were considered, showing that it is possible to formulate a problem-dependent natural set of assumptions which is easy to verify and implies Assumptions (B1)-(B5), thus ensuring linear convergence of our method. We demonstrate numerically the fast convergence for our first example, showing that, compared to standard conditional gradient methods (GCG), our algorithm exhibits a vastly improved rate of convergence. Additionally, we have discussed in details the computational burden of each presented example, and we have described viable strategies to solve the (partially) linearized problem (1.1). As in the euclidean case of [52], we believe that all results in the present paper are transferable to non-smooth regularizers of the form Φ(G(·)) where G is as in Assumptions (A1)-(A3), and Φ is a suitable convex, monotone increasing function, e.g., Φ(G(u)) = (1/2)G(u) 2 . This generalization was omitted in the present paper for the sake of readability.
Walter .  On  sparse  sensor  placement  for  parameter  identification  problems  with  partial  differential  Appendix A. Complements to Section 3 In this section we state and prove two results. The first concerns well-posedness for the minimization problem at (3.1), while the second shows equivalence, in a suitable sense, of (3.2) and (3.4).
showing that u is a solution to (3.2). Conversely, let u be a minimizer to (3.2). Since u ∈ cone(A u,+ k ), there exists λ ∈ R . Now let λ ∈ R N + k + be given and From the optimality of u and the definition of the gauge function we get Thus, λ is a minimizer of (3.4).

Appendix B. Complements to Section 4
Here we collect the technical statements and the proofs of Section 4. For sake of clarity we place the proofs of each example in different subsections.
B.1. Sparse source identification (Section 4.1). The next lemma justifies the well-posedness of the definition of K * .

B.2. Rank-one matrix reconstruction by trace regularization (Section 4.2).
B.2.1. Proof of Lemma 4.8. Recall that every extremal point U of B necessarily satisfies Tr(U) = β −1 or Tr(U) = 0. We first focus on those extremal points with Tr(U) = β −1 . According to Lidskii's theorem, for every U ∈ B there holds Tr(U) = i∈N σ U i . From these observations, we already see that every extremal point of B has at most rank one. Indeed, if U ∈ B, Tr(U) = β −1 , is at least of rank two, i.e. σ U 1 ≥ σ U 2 > 0, then we can decompose U as 1 ) ∈ (0, 1) and the second equality follows from β −1 = Tr(U) = i∈N σ U i .
Noting that U 1 , U 2 ∈ B, as well as U 1 = U 2 , we conclude that U is not extremal. Hence, if U ∈ B, G(U) = 1, is extremal then we have U  (4.18), the structure ofŪ and its uniqueness as well as the Lipschitz result on K follow immediately from Assumption (D1), the assumptions on K and the strict convexity of F , see Assumption 2.2. Let U ∈ B with Tr(U) = β −1 and U ≥ 0 be arbitrary but fixed. For the proof of the quadratic growth behavior we follow similar steps as in the finite dimensional setting, see [42,Lemma 4]. Setting δ = σP 1 − σP 2 > 0, we estimate Definẽ Q n,j := Q n,j ∩ Ω and let J n be the collection of indices j ∈ N such thatQ n,j = ∅. Let P n be the set of maps u ∈ B such that u = c n,j inQ n,j for some c n,j ∈ [−1, 1] and for all j ∈ J n . Fix u ∈ P n and n ∈ N. For every j ∈ J n , k ∈ N define u k (x) := h λ n,j k (x) for all x ∈Q n,j , where λ n,j := (1 − c n,j )/2. Therefore {u k } k ⊂ Ext(B). By (B.4) we have that u k * ⇀ c n,j weakly* in L ∞ (Q n,j ) as k → ∞, for all j ∈ J n . Thus u k * ⇀ u in L ∞ (Ω). In particular we have shown that P n ⊂ Ext(B) * . Therefore P ⊂ Ext(B) * , where P := ∪ n∈N P n . As clearly P * = B, the proof is concluded. This also implies Finally, I is weak*-to-weak* continuous: indeed if µ k * ⇀ µ weakly* in M (B), then by definition of weak* convergence we have ⟪P, µ k ⟫ → ⟪P, µ⟫ for all P ∈ C(B). Note that the map P (v) = p, v , v ∈ B, p ∈ C belongs to C(B). Thus, for all p ∈ C, Thanks to (5.3) the above reads I(µ k ) * ⇀ I(µ) weakly* in M, concluding the proof. for all y ∈ Y . Notice that T µ is well defined, since the map v → (Kv, y) Y = K * y, v is weak* continuous over M, and hence µ-measurable. It is clear that T µ is linear. Moreover, T µ is continuous. Indeed, it holds where we recalled that the constant C is defined at i) and is finite, since B is norm bounded. Therefore, by Riesz's Theorem, there exists a unique element in Y representing T µ , thus defining a Note that the function (v, w) → (Kv, Kw) Y is an element of C(B × B). Indeed, given a sequence (v k , w k ) ∈ B × B such that (v k , w k ) * ⇀ (v, w) for (v, w) ∈ B × B we estimate where C is defined at i). As, v k , w k , v, w ∈ B, using the weak*-to-strong continuity of K on dom(G), cf. Assumptions (A1)-(A3), we conclude that (Kv k , Kw k ) Y → (Kv, Kw) Y proving that (v, w) → (Kv, Kw) Y is an element of C(B × B). Moreover, we have µ k ⊗ µ k * ⇀ µ ⊗ µ in M (B × B), which, combined with (C.2) and the fact that (v, w) → (Kv, Kw) Y ∈ C(B × B) implies the convergence Kµ k Y → Kµ Y . Combined with the weak convergence Kµ k ⇀ Kµ we finally conclude Kµ k → Kµ in Y , so that K is weak*-to-strong continuous. Last, note that the operator K * is well-defined, linear and continuous. For any y ∈ Y and µ ∈ M (B) we obtain where again, we used (5.4). This concludes the proof of iv). Assume now that Y is separable and fix µ ∈ M (B). Notice that f : B → Y defined by f (v) := Kv is weakly µ-measurable, since the map v → (Kv, y) Y = K * y, v is weak* continuous and hence, µ-measurable, for each y ∈ Y fixed. As Y is separable, we also have that f is essentially separably valued. Therefore, Pettis' Theorem ([34, Sec II.1, Thm 2]) implies that f is strongly measurable with respect to µ. Moreover C.2.1. Proof of Proposition 6.5. Let k ≥ 1. By construction, λ k ∈ R N k solves Deriving the first order necessary optimality conditions for this problem, we obtain that Since λ k i > 0 by construction, we deduce that P k (u k i ) = 1 for i = 1, . . . , N k . In particular, Finally, max v∈B P k (v) ≥ max v∈A k P k (v) = 1, concluding.
C.2.2. Proof of Proposition 6.6. Note that {µ k } k ⊂ E µ 0 by (6.15). Arguing as in the proof of Theorem 6.2, we get where we recall that L µ 0 > 0 is the Lipschitz constant of ∇F in KE µ 0 , and C > 0 is the constant in Proposition 5.3 i). Hence, it suffices to prove (6.19) for y k −ȳ Y . Let N (ȳ) ⊂ Y and θ > 0 denote the neighbourhood and constant from Assumption (B1), respectively. Recall that µ k * ⇀μ along the whole sequence by (6.6). By weak*-to-strong continuity of K, see Proposition 5.3 iii), we then conclude y k →ȳ in Y . Thus, there exist M ∈ N such that y k ∈ N (ȳ) for all k ≥ M . Using the strong convexity of F in N (ȳ), we estimate Y /2 , where we used (5.11) in the final inequality. Thus (6.19) follows by rearranging the terms in the above estimate, recalling that r j (µ k ) = j(µ k ) − j(μ). The final part of the statement holds since r j (µ k ) → 0 by Theorem 6.2.