On the linear convergence rates of exchange and continuous methods for total variation minimization

We analyze an exchange algorithm for the numerical solution total-variation regularized inverse problems over the space M($\Omega$) of Radon measures on a subset $\Omega$ of R d. Our main result states that under some regularity conditions, the method eventually converges linearly. Additionally, we prove that continuously optimizing the amplitudes of positions of the target measure will succeed at a linear rate with a good initialization. Finally, we propose to combine the two approaches into an alternating method and discuss the comparative advantages of this approach.


The problem
The main objective of this paper is to develop and analyze iterative algorithms to solve the following infinite dimensional problem: inf µ∈M(Ω)
= µ M + f (Aµ), (P (Ω)) where Ω is a bounded open domain of R d , M(Ω) is the set of Radon measures on Ω, µ M is the total variation (or mass) of the measure µ, f : R m → R ∪ {+∞} is a convex lower semi-continuous function with non-empty domain and A : M(Ω) → R m is a linear measurement operator. An important property of Problem (P(Ω)) is that at least one of its solutions µ has a support restricted to s distinct points with s ≤ m (see e.g. [30,15,4]), i.e. is of the form with ξ i ∈ Ω and α i ∈ R. This property motivates us to study a class of exchange algorithms. They were introduced as early as 1934 [24] and then extended in various manners [23]. They consist in discretizing the domain Ω coarsely and then refining it adaptively based on the analysis of so-called dual certificates. If the refinement process takes place around the locations (ξ i ) only, these methods considerably reduce the computational burden compared to a finely discretized mesh. Our main results consist in a set of convergence rates for this algorithm that depend on the regularity of f and on the non-degeneracy of a dual certificate at the solution. We also show the linear convergence rate for first order algorithms that continuously vary the coefficients α i and x i of a discrete measure. Finally, we show that algorithms alternating between an exchange step and a continuous method share the best of both worlds: the global convergence guarantees of exchange algorithms together with the efficiency of first order methods. This yields a fast adaptive method with strong convergence guarantees for total variation minimization and related problems.

Applications
Our initial motivation to study the problem (P(Ω)) stems from signal processing applications. We recover an infinite dimensional version of the basis pursuit problem [7] by setting Similarly, the choice f (x) = τ 2 x − y 2 2 , leads to an extension of the LASSO [27] called Beurling LASSO [9]. Both problems proved to be extremely useful in engineering applications. They got a significant attention recently thanks to theoretical progresses in the field of super-resolution [9,26,6,13]. Our results are particularly strong for the quadratic fidelity term.

Numerical approaches in signal processing
The progresses on super-resolution [9,26,6,13] motivated researchers from this field to develop numerical algorithms for the resolution of Problem (P(Ω)). By far the most widespread approach is to use a fine uniform discretization and solve a finite dimensional problem. The complexity of this approach is however too large if one wishes high precision solutions. This approach was analyzed from a theoretical point of view in [25,13] for instance. The first papers investigating the use of (P(Ω)) for super-resolution purposes advocated the use of semi-definite relaxations [26,6], which are limited to specific measurement functions and domains, such as trigonometric polynomials on the 1D torus T. The limitations were significantly reduced in [10], where the authors suggested the use of Lasserre hierarchies. These methods are however currently unable to deal with large scale problems. Another approach suggested in [5], and referred to as a Frank-Wolfe algorithm, consists in adding one point to a discretization set iteratively, where a so-called dual certificate is maximal. More recently, [28] began investigating the use of methods that continuously vary the positions (x i ) and amplitudes (α i ) of discrete measures parameterized as µ = s i=1 α i δ xi . The authors gave sufficient conditions for a simple gradient descent on the product-space (α, x) to converge. In [3] and [11], this method was used alternatively with a Frank-Wolfe algorithm, the idea being to first add Dirac masses roughly at the right locations and then to optimize their locations and position continuously, leading to promising numerical results. Surprisingly enough, it seems that the connection with the mature field of semi-infinite programming has been ignored (or not explicitly stated) in all the mentioned references.

Some numerical approaches in semi-infinite programming
A semi-infinite program [23,16] is traditionally defined as a problem of the form where Q and Ω are subsets of R n and R m respectively, u : Q → R and c : Ω × Q → R are functions. The term semi-infinite stems from the fact that the variable q is finite-dimensional, but it is subject to infinitely many constraints c(x, q) ≤ 0 for x ∈ Ω. In order to see the connection between the semi-infinite program (SIP [Ω]) and our problem (P(Ω)), we can formulate its dual, which reads as This dual will play a critical role in all the paper and it is easy to relate it to a SIP by setting Q = R m , u = f * and c = |(A * q)(x)| − 1. Many numerical methods have been and are still being developed for semi-infinite programs and we refer the interested reader to the excellent chapter 7 of the survey book [23] for more insight. We sketch below two classes of methods that are of interest for our concerns.

Exchange algorithms
A canonical way of discretizing a semi-infinite program is to simply control finitely many of the constraints, say c(x, q) ≤ 0 for x ∈ Ω 0 ⊆ Ω, where Ω 0 is finite. The discretized problem SIP[Ω 0 ] can then be solved by standard proximal methods or interior point methods. In order to obtain convergence towards an exact solution of the problem, it is possible to choose a sequence (Ω k ) of nested sets such that k Ω k is dense in Ω. Solving the problems SIP[Ω k ] for large k however leads to a high numerical complexity due to the high number of discretization points. The idea of exchange algorithms is to iteratively update the discretization sets Ω k in a more clever manner than simply making them denser. A generic description is given by Algorithm 1. Set q k ∈ argmin q∈Q c(x,q)≤0,x∈Ω k u(q) 4: 5: Set Ω k+1 = Update Rule(Ω k , q k , k). 6: end while 7: Output: The last iterate q ∞ .

Algorithm 1 A Generic Exchange Algorithm
In this paper, we consider Update Rules of the form where the points x i k are local maximizers of c(·, q k ). At each iteration, the set of discretization points can therefore be updated by adding and dropping a few prescribed points, explaining the name 'exchange'. The simplest rule consists of adding the single most violating point, i.e.
It seems to be the first exchange algorithm and is nearly equivalent to the Remez algorithm from the 30's [24]. It can be shown to be equivalent to a Frank-Wolfe (a.k.a. conditional gradient) method up to an epigraphical lift [11]. These methods were introduced in the field of signal processing in [5] and the connection with exchange algorithms was proposed in [14]. The update rule (2) is sufficient to guarantee convergence in the generic case and to ensure a decay of the cost function in O 1 k , see [19]. Although 'exchange' suggests that points are both added and subtracted, methods for which Ω k ⊆ Ω k+1 are also coined exchange algorithms. The use of such rules often leads to easier convergence analyses, since we get monotonicity of the objective values u(q k ) for free [16]. Other examples [17] include only adding points if they exceed a certain margin, i.e. c(x, y) ≥ k , or all local maxima of c(q k , ·). In the case of convex functions f , algorithms that both add and remove points can be derived and analyzed with the use of cutting plane methods. All these instances have their pros and cons and perform differently on different types of problems. Since a semi-infinite program basically allows to minimize arbitrary continuous and finite dimensional problems, a theoretical comparison should depend on additional properties of the problem.

Continuous methods
Every iteration of an exchange algorithm can be costly: it requires solving a convex program with a number of constraints that increases if no discretization point is dropped. In addition, the problems tend to get more and more degenerate as the discretization points cluster, leading to numerical inaccuracies. In practice it is therefore tempting to use the following two-step strategy: i) find an approximate solution µ k = p k i=1 α i k δ x i k of the primal problem (P(Ω)) using k iterations of an exchange algorithm and ii) continuously move the positions X = (x i ) and amplitudes α = (α i ) starting from (α k , X k ) to minimize (P(Ω)) using a nonlinear programming approach such as a gradient descent, a conjugate gradient algorithm or a Newton approach.
This procedure supposes that the output µ k of the exchange algorithm has the right number p k = s of Dirac masses, that their amplitudes satisfy sign(α i ) = sign(α i ) and that µ k lies in the basin of attraction of the optimization algorithm around the global minimum µ . To the best of our knowledge, knowing a priori when those conditions are met is still an open problem and deciding when to switch from an exchange algorithm to a continuous method therefore relies on heuristics such as detecting when the number of masses p k stagnates for a few iterations. The cost of continuous methods is however much smaller than that of exchange algorithms since they amount to work over a small number s(d + 1) of variables. In addition, the instabilities mentioned earlier are significantly reduced for these methods. This observation was already made in [3,11] and proved in [28] for specific problems.

Contribution
Many recent results in the field of super-resolution provide sufficient conditions for a non degenerate source condition to hold [6,26,12,1,21]. The non degeneracy means that the solution q of (D(Ω)) is unique and that the dual certificate |A * q | reaches 1 at exactly s points, where it is strictly concave. The main purpose of this paper is to study the implications of this non degeneracy for the convergence of a class of exchange algorithms and for continuous methods based on gradient descents. Our main results are as follows: 1. We show an eventual linear convergence rate of a class of exchange algorithms for convex functions f with Lipschitz continuous gradient. More precisely, we prove that after a finite number of iterations N the algorithm outputs vectors q k such that the set contains exactly s-points (x 1 k , . . . , x s k ).
, we also show the linear convergence rate of the cost function J(µ k ) to J(µ ) and of the support in the following sense: after a number N of initial iterations, it will take no more that k τ = C log(τ −1 ) iterations to ensure that dist(X kτ +N , ξ) ≤ τ . A similar statement holds for the coefficient vectors α k .

Notation
In all the paper, Ω designs an open bounded domain of R d . The boundedness assumptions plays an important role to control the number of elements in discretization procedures. A grid Ω k is a finite set of points in Ω. Its cardinality is denoted by |Ω k |. The distance between two sets Ω 1 and Ω 2 is defined by Note that this definition of distance is not symmetric: in general dist(Ω 1 , Ω 2 ) = dist(Ω 2 , Ω 1 ). We let C 0 (Ω) denote the set of continuous functions on Ω vanishing on the boundary. The set of Radon measures M(Ω) can be identified as the dual of C 0 (Ω), i.e. the set of continuous linear forms on C 0 (Ω). For any sub-domain Ω k ⊂ Ω, we let M(Ω k ) denote the set of Radon measures supported on Ω k . For p ∈ [1, +∞], the L p -norm of a function u ∈ C 0 (Ω) is denoted by u p . The total variation of a measure µ ∈ M(Ω) is denoted µ M . It can be defined by duality as The p -norm of a vector x ∈ R m is also denoted x p . The Frobenius norm of a matrix M is denoted by M F . Let f : R m → R ∪ {+∞} denote a convex lower semi-continuous function with non-empty domain dom(f ) = {x ∈ R m , f (x) < +∞}. Its subdifferential is denoted ∂f . Its Fenchel transform f * is defined by If f is differentiable, we let f ∈ R m denote its gradient and if it is twice differentiable, we let f ∈ R m×m denote its Hessian matrix. We let for all (x 1 , x 2 ) ∈ R m × R m and all η ∈ ∂f (x 1 ). A differentiable function f is said to have an L-Lipschitz gradient if it satisfies f ( This implies that We recall the following equivalence [18]: 1. Let f : R m → R ∪ {+∞} denote a convex and closed function with non empty domain. Then the following two statements are equivalent: • f has an L-Lipschitz gradient. • f * is 1 L -strongly convex. The linear measurement operators A considered in this paper can be viewed as a collection of m continuous functions (a i ) 1≤i≤m . For x ∈ Ω, the notation A(x) corresponds to the vector [a 1 (x), . . . , a m (x)] ∈ R m .

Existence results and duality
In order to obtain existence and duality results, we will now make further assumptions.
Assumption 1. f : R m → R ∪ {∞} is convex and lower bounded. In addition, we assume that either dom(f ) = R m or that f is polyhedral (that is, its epigraph is a finite intersection of closed halfspaces).

Assumption 2.
The operator A is weak- * -continuous. Equivalently, the measurement functionals a * i defined by a * i , µ = (A(µ)) i are given by for functions a i ∈ C 0 (Ω). In addition, we assume that A is surjective on R m .
The following results relate the primal and the dual.
Proposition 2.2 (Existence and strong duality). Under Assumptions 1 and 2, the following statements are true: • The primal problem (P(Ω)) and its dual (D(Ω)) both admit a solution.
• The following strong duality result holds • Let (µ , q ) denote a primal-dual pair. They are related as follows Proof. The stated assumptions ensure the existence of a feasible measure µ. In addition, the primal function is coercive since f is bounded below. This yields existence of a primal solution. The existence of a dual solution stems from the compactness of the set {q ∈ R m , A * q ∞ ≤ 1} (which itself follows from the surjectivity of A) and the continuity of f * on its domain. The strong duality result follows from [2,Thm 4.2]. The primal-dual relationship directly derives from the first order optimality conditions.

The algorithm
We assume that an initial grid Ω 0 ⊆ Ω is given (e.g. a coarse Euclidean grid). Given a discretization Ω k , we can define a discretized primal problem (P(Ω k )) inf µ∈M(Ω k ) µ M + f (Aµ), (P(Ω k )) and its associated dual (D(Ω k )) In this paper, we will investigate the exchange rule below: The implementation of this rule requires finding X k , the set of all the local maximizers of |A * q k | exceeding 1.

A generic convergence result
The exchange algorithm above converges under quite weak assumptions. For instance, it is enough to assume that the function f is differentiable.
The data fitting function f : R m → R is differentiable with L-Lipschitz continuous gradient.
Alternatively, we may assume that the initial set Ω 0 is fine enough, which in particular implies that |Ω 0 | ≥ m.
Assumption 4. The initial set Ω 0 is such that A restricted to Ω 0 is surjective.
We may now present and prove our first result.
Theorem 3.1 (Generic convergence). Under assumptions 1, 2 and 3 or 4, a subsequence of (µ k , q k ) will converge in the weak- * -topology towards a solution pair (µ , q ) of (P(Ω)) and (D(Ω)), as well as in objective function value. If the solution of (P(Ω)) and/or (D(Ω)) is unique, the entire sequence will converge.
Proof. First remark that the sequence ( µ k M +f (Aµ k )) k∈N is non-increasing since the spaces M(Ω k ) are nested. Due to the boundedness below of f , the same must be true for ( µ k M ). Hence there exists a subsequence (µ k ), which we do not relabel, that weak- * converges towards a measure µ ∞ . Now, we will prove that the sequence of dual variables (q k ) k∈N is bounded. If Assumption 3 is satisfied, then f * is strongly convex and since 0 is a feasible point, we must have Since A 0 is surjective, the previous inequality implies that ( q k 2 ) k∈N is bounded. Hence, in both cases, the sequence (q k ) k∈N converges up to a subsequence to a point q ∞ .
The key is now to prove that A * q ∞ ∞ ≤ 1. To this end, let us first argue that the family (A * q k ) k∈N is equicontiuous. For this, let > 0 be arbitrary. Since the functions a i ∈ C 0 (Ω) all are uniformly continuous, there exists a δ > 0 with the property Consequently, Due to the convergence of (q k ) k∈N , the sequence (A * q k ) k∈N is converging strongly to A * q ∞ . We will now prove that A * q ∞ ∞ ≤ 1. If for some k, A * q k ∞ ≤ 1, we will have A * q = A * q k for all ≥ k, and in particular q ∞ = q k and thus A * q ∞ ≤ 1. Hence, we may assume that A * q k ∞ > 1 for each k, i.e. that we add at least one point to Ω k in each iteration. Now, towards a contradiction, assume that A * q ∞ ∞ = 1 + 2 for an > 0. Set δ as in (11). For each k ∈ N, let x k be the element in argmax x |(A * q k )(x)| which has the largest distance to Ω k . Due to a ∈ C 0 (Ω) for each k, there needs to exist a compact subset C ⊆ Ω such that ( Consequently, a subsequence (which we do not rename) of (x k ) must converge. Thus, for some k 0 and every k > k 0 , we have x k − x k0 2 < δ. We then have In the last estimate, we used the constraint of (D(Ω k )) and the fact that x k0 ∈ Ω k . Since the last inequality holds for every k ≥ k 0 , we obtain where we used the fact that (A * q k ) k converges strongly towards A * q ∞ . This is a contradiction, and hence, we do have A * q ∞ ∞ ≤ 1.
Overall, we proved that the primal-dual pair (µ ∞ , q ∞ ) is feasible. It remains to prove that it is actually a solution. To do this, let us first remark that µ ∞ M + f (Aµ ∞ ) ≥ −f * (q ∞ ) by weak duality. To prove the second inequality, first notice that the weak- * -continuity of A implies that Aµ k → Aµ ∞ . Assumption 1 furthermore implies that f is lower semi-continuous. As a supremum of linear functions, so is f * . Since also q k → q ∞ , we conclude Assumptions 1, 2 together with Proposition 2.2 imply exact duality of the discretized problems. This means

Reshuffling these inequalities yields
, the reverse inequality. Thus, µ ∞ and q ∞ fulfill the duality conditions, and are solutions. The final claim follows from a standard subsequence argument. Remark 1. Let us mention that the convergence result in Theorem 3.1 and its proof, is not new, see e.g. [22]. The proof technique can be applied to prove similar statements for other refinement rules. For instance, the result still holds if we add the single most violating point: The result that we have just shown is very generally applicable. It however does not give us any knowledge of the convergence rate. The next section will be devoted to proving a linear convergence rate in a significant special case.

Non degenerate source condition
The idea behind adding points to the grid adaptively is to avoid a uniform refinement, which results in computationally expensive problems (D(Ω k )). However, there is a priori no reason for the exchange rule not to refine in a uniform manner. In this section, we prove that additional assumptions improve the situation. First, we will from now on work under Assumption (3). It implies that the dual solutions q k are unique for every k, since Proposition (2.1) ensures the strong convexity of the Fenchel conjugate f * . We furthermore assume that the a j are smooth.
Assumption 5 (Assumption on the measurement functionals ). The measurement functions a j all belong to = C 0 (Ω) ∩ C 2 (Ω) and their first and second order derivatives are uniformly bounded on Ω. We hence may define We also assume the following regularity condition on the solution q of (D(Ω)), and its corresponding primal solution µ .
Assumption 6 (Assumption on the primal-dual pair). We assume that (P(Ω)) admits a unique s-sparse solution Let q denote the associated dual pair. We assume that the only points x for which |A * q (x)| = 1 are the points in ξ, and that the second derivative of |A * q | is negative definite in each point ξ i . It follows that there exists τ 0 > 0 and γ > 0 such that We note that if Equations (14) and (15) are valid for some (γ, τ 0 ), they are also valid for any (γ,τ 0 ) withγ ≤ γ andτ 0 ≤ τ 0 .
Assumption (6) may look very strong and hard to verify in advance. Recent advances in signal processing actually show that it is verified under clear geometrical conditions. First, there will always exists at most m-sparse solutions to problem (P(Ω)), [30,15,4]. Therefore, the main difficulty comes from the uniqueness of the primal solution and from the two regularity conditions (14) and (15). These assumptions are called non-degenerate source condition of the dual certificate A * q [13]. Many results in this direction have been shown for The papers [6,26,12] deal with different Fourier-type operators, [1] about a few other special cases whereas [21] provides an analysis for arbitrary integral operators sampled at random.

Auxiliary results
In this and the following sections, we always work under Assumptions 1, 2, 3 without further notice. We derive several lemmata that are direct consequences of the above assumptions. The first two rely strongly on the Lipschitz regularity of the gradient of f .
By strong convexity of f * and optimality ofq and q k , we get: Therefore ) and the conclusion follows from a triangle inequality.
Then for any q, we have We first claim that M and D only have the point q in common. Indeed µ solves the problem P(ξ) and by strong duality of the problem restricted to M(ξ), q solves D(ξ). By strong convexity of f , q is the unique solution D(ξ), this exactly means M ∩ D = {q }.
The fact that M ∩D = {q } implies that there exists a separating hyperplane there. Since the hyperplane must be tangent to M , it can be written as Rearranging this, we obtain which is the claim.
Before moving on, let us record the following proposition: Proof. The proof of the first inequality of (18) is a standard Taylor expansion : The proof of the second part of (18) follows the same lines as the first part and is left to the reader.
The next two lemmata aim at transferring bounds from the geometric distances of the sets X k , Ω k and ξ to bounds on |A * q k (ξ)|. Using Proposition 3.3, we may then transfer these bounds to bounds on the errors of the dual solutions and the dual (or primal) objective values.
Lemma 3.5. The following inequalities hold Proof of Lemma 3.5. To show (19), first notice that Indeed, by definition, the global maximum z of |A * q k | lies in X k and satisfies (A * q k ) (z) = 0. Furthermore, by construction, all points x in Ω k satisfy |A * q k (x)| ≤ 1. Using a Taylor expansion, we get for all x ∈ Ω Taking x as the point in Ω k minimizing the distance to z leads to (20). In addition, we have (A * q k ) ∞ ≤ Rκ hess by Lemma 3.2, so that A * q k ∞ ≤ 1 + with = Rκ hess dist(Ω k ,X k ) 2 2 . Now, letting C = {q | A * q ∞ ≤ 1}, we have just proven that q k ∈ (1 + )C. Furthermore, due to the optimality of q k for the discretized problem and to the fact that q is feasible for that problem, we will have f * (q k ) ≤ f * (q ), i.e., q k is included in the f * (q )-sub-level set of f * : M = {q ∈ R m |f * (q) ≤ f * (q )}. An application of Proposition 3.3 now yields the result.
Lemma 3.6. Suppose that dist(X k , ξ) ≤ δ and dist(Ω k , ξ) ≤ δ. Then Proof. Let y i k (resp. x i k ) be the point closest to ξ i in Ω k (resp. X k ). By assumption, we have Then, for all z ∈ [y i k , ξ i ], using the fact that (A * q k ) (x i k ) = 0, we get Hence, we have |A * q k (ξ i )| ≤ 1+2δRκ hess ξ i −y i k 2 ≤ 1+2δRκ hess dist(Ω k , ξ). To conclude, we use Proposition 3.3 again.
The last assertion takes full advantage of Assumption 6 and the fact that the function |A * q | is uniformly concave around its maximizers. It allows to transfer bounds from q k − q 2 to bounds on the distance from X k to ξ. and assume that q k − q 2 < c q , then Moreover, for each i, if B i is the ball or radius τ 0 around ξ i , then X k contains at most one point in B i and A * q k has the same sign as Proof. Define τ = κ ∇ γ q k − q and note that τ < τ 0 . By Proposition 3.4, we have for each x ∈ Ω The above inequalities together with Assumption 6 imply the following for all 1 ≤ i ≤ s: The estimate (A * q ) (x) 2 > γτ deserves a slightly more detailed justification than the others. Define w = x − ξ i and g(θ) = (A * q) (ξ i + θw), w for θ ∈ (0, 1). We may apply the mean value theorem to conclude that since w 2 = x − ξ i 2 > τ by assumption. The last estimate was the claim (iv). This implies a number of things. First, any local maximum of |A * q k | with |A * q k | ≥ 1 must lie within a distance of τ from the set ξ (since for all other points, we have |A * q k | < 1 -via (iii) -or (Aq k ) = 0 -via (iv)). Since |A * q k | is locally concave on the τ 0 -neighborhoods of the ξ i -this follows from (ii) -at most one local extremum furthermore exists in each such neighborhood. This is the claim.

Fixed grids estimates
In this section, we consider a fixed grid Ω 0 and ask what we need to assume about it in order to guarantee that the set of local maxima of |A * q 0 (x)| is close to true support ξ. We express our result in terms of a geometrical property that we can control, the width of the grid dist(Ω 0 , Ω).
Proof. It is trivial that dist(Ω 0 , X 0 ) ≤ dist(Ω 0 , Ω). Applying Lemma 3.5, we immediately obtain the bound on q 0 − q 2 . By the same lemma, In order to obtain the first bound, remark that q 0 − q 2 ≤ c q and use Proposition 3.7.
Remark 2. Note that Theorem 3.8 allows to control dist(ξ, X 0 ) but not dist(X 0 , ξ). Indeed each x ∈ X 0 is guaranteed to be close to a ξ i , but not every ξ i needs to have a point in X 0 closeby. Note however that the bounds on the optimal value indicates that in this case the missed ξ i is not crucial to produce a good candidate for solving the primal problem. We will provide more insight on this, in the case of f being strongly convex, in Section 4.

Eventual linear convergence rate
In this section, we provide a convergence rate for the iterative algorithm. As a follow-up to Remark 2, the proof of convergence relies on the fact that the distances dist(X k , ξ) and dist(ξ, X k ) are equal. In order to ensure this fact, one has to wait for a finite number of iterations, this is exactly the purpose of the next proposition.
There exists a finite number of iterations N , such that for all k ≥ N , X k has exactly s points, one in each B i . It follows that dist(X k , ξ) = dist(ξ, X k ). Moreover if S k is the set of active point of D(Ω k ), that is Proof. We first prove that By assumption (6), J + > J . Since (J(µ k )) k∈N converges to J(µ ), there exists k 2 ∈ N such that ∀k ≥ k 2 , J(µ k ) < J + . Hence µ k must for each 1 ≤ i ≤ s have points z i k ∈ Ω k such that µ k has non-zero mass at z i k . Consequently, |A * q k (z i k )| = 1, hence, each B i contains at least one point in Ω k such that |A * q k (z i k )| = 1. Notice that q k converges to q by Theorem 3.1. Hence there a finite number of iterations k 1 such that q k − q < c q for all k ≥ k 1 . By item (iii) of the proof of Proposition 3.7, |A * q k | < 1 outside ∪ i B i , and by item (ii), |A * q k | is strictly concave in each B i . Hence each B i contains exactly one maximizer of |A * q k | exceeding one.
We now move on to analyzing our exchange approach. Before formulating the main result, let us introduce a term: δ-regimes. Definition 1. We say that the algorithm enters a δ-regime at iteration k δ if for all k ≥ k δ , we have dist(ξ, X k ) ≤ δ. In particular it means that only points with a distance at most δ from ξ are added to the grid. . Let N be as in Proposition 3.9.
1. For any τ , the algorithm enters a τ -regime after a finite number of iterations.
2. Assume that N iterations have passed and that the algorithm is in a τ -regime with τ ≤τ 0 . Then for every α ∈ (0, 1) it takes no more than A α 2d + 1 iterations to enter an ατ -regime. Proof. Note that for any δ ≤τ 0 , if there exists p ∈ N such that we will enter an δ-regime after iteration p by applying Proposition 3.7.
To prove (1), note that we without loss of generality can assume that τ ≤τ 0 (since entering a τ -regime means in particular entering a τ -regime for any τ ≥ τ .) Then , since q k − q 2 tends to zero as k goes to infinity, (22) with δ = τ is true after a finite number of iterations.
To prove (2), we proceed as follows : Proposition 3.9 ensures that in each iteration, exactly one point is added in each ball {x ∈ Ω, x − ξ i 2 ≤ τ }. Let k 0 be the actual iteration, a covering number argument [29] ensures, for any ∆ that after δ 0 = 2d d/2 τ ∆ d iterations, each point in X k needs to lie at a distance at most ∆ from Ω k , i.e., dist(Ω k , X k ) ≤ ∆.
The main result will tell us how many iterations we need to enter a τ -regime. = κ ∇ Rγ c q and k 0 be the iteration on which the algorithm enters aτ 0 -regime. Then k 0 < ∞, and the algorithm will enter a τ -regime after no more than k 0 + k τ iterations, where Additionally, we will have for k ≥ k 0 + k τ + 1. In other words, the algorithm will eventually converge linearly.
Proof. The fact that k 0 < ∞ is the first assertion of Lemma 3.10. As for the other part, we argue as follows: Fix α ∈ (0, 1). Since we have entered aτ 0 -regime at iteration k 0 , Lemma 3.10 implies that it will take no more than A α 2d + 1 additional iterations to enter a ατ 0 . Repeating this argument, we see that after no more than n · A α 2d + 1 iterations, we will have entered a α nτ 0 regime. Choosing α = e −1/2d and n = 2d log (τ 0 /τ ) , we obtain the first statement.
The second statement immediately follows from Lemma 3.6 (as in the proof of Theorem 3.8) and the fact that entering a τ -regime exactly amounts to that dist(X k , ξ) ≤ τ for all future k, and therefore in particular dist(Ω k+1 , ξ) ≤ τ .
The inequality (23) upper-bounds the cost function for the problem (P(Ω k )). In practice, the numerical resolution of this problem is hard since Ω k contains clusters of points and in practice it is beneficial to solve the simpler discrete problem For this measure, we also obtain an a posteriori estimate of the convergence rate.
Proposition 3.12. Define µ k as the solution of (P(X k )), if dist(X k , ξ) ≤ τ , we have Proof. For any i, denote x i k a point in X k closest to ξ i and defineμ k = s i=1 α i δ x i k . We have J( µ k ) ≤ J(μ k ) and μ k M ≤ µ M . Furthermore, we have The last term in the inequality is dealt with the following estimate: As for the penultimate term, remember that q = −∇f (Aµ ). This implies By making a Taylor expansion of A * q in each ξ i , utilizing that the derivative vanishes there, and that (A * q ) (x) ≤ κ hess q 2 for each x ∈ Ω, we see that ( Overall, we obtain

Convergence of continuous methods
In this section, we study an alternative algorithm that consists of using nonlinear programming approaches to minimize the following finite dimensional problem: where X = (x 1 , . . . , x p ). This principle is similar to continuous methods in semi-infinite programming [23] and was proposed specifically for total variation minimization in [3,11,28,8]. By Proposition 3.9, we know that after a finite number of iterations, X k will contain exactly s points located in a neighborhood of ξ. This motivates the following hybrid algorithm: • Launch the proposed exchange method until some criterion is met. This yields a grid X (0) = X k and we let p = |X k |.
• Find the solution of the finite convex program • Use the following gradient descent: where τ is a suitably defined step-size (e.g. defined using Wolfe conditions).
We tackle the following question: does the gradient descent algorithm converge to the solution if initialized well enough?

Existence of a basin of attraction
This section is devoted to proving the existence of a basin of attraction of a descent method in G. Under two additional assumptions, we state our result in Proposition 4.1.
Assumption 7. The function f is twice differentiable and Λ-strongly convex.
The twice differentiability assumption is mostly due to convenience, but the strong convexity is crucial. The second assumption is related to the structure of the support ξ of the solution µ .

Assumption 8.
For any x, y ∈ Ω denote K(x, y) = a (x)a (y). The transition matrix is assumed to be positive definite, with a smallest eigenvalue larger than Γ > 0.
It is again possible to prove for many important operators A that this assumption is satisfied if the set ξ is separated. See the references listed in the discussion about Assumption 6. The following proposition describes the links between minimizing G and solving (P(Ω)). is global minimum of G. Additionally, G is differentiable with a Lipschitz gradient and strongly convex in a neighborhood of (α , ξ).
Hence, there exists a basin of attraction around (α , ξ) such that performing a gradient descent on G will yield the solution of (P(Ω)) at a linear rate.
The rest of this section is devoted to the proof of Proposition 4.1. Let us begin by stating a simple auxiliary result. Proof. If B * CB − λB * B is positive semidefinite, we claim holds. Since for v ∈ U arbitrary the latter is the case. Let us introduce some notation that will be used in this section: for an X = (x 1 , . . . , x p ) ∈ Ω p for some p, A(X) denotes the matrix [a i (x j )]. Analogously, A (X) and A (X) denote the operators respectively. Note that for q ∈ R m and X ∈ Ω p , We will also use the shorthands µ = i α i δ xi , G f (α, X) = f (Aµ), and, for α ∈ R p , D(α) denotes the operator We have so that in points (α, X) with α i = 0 for all i, and in particular in a neighborhood of (α , ξ), G is differentiable and its gradient is given by : As for the second derivatives, we have We may now prove our claims.
Proof 4.1. First, let us note that due to the optimality conditions of P(Ω), we know that Letting q = −∇f (Aµ), it is furthermore fruitful to decompose the Hessian of G into two parts: Now, |A * q | has local maxima in the points ξ i , so that (A * q ) (ξ) = 0. In these points, we furthermore have that sign(α i ) = A * q (ξ i ), so that the gradient of G given in (27) vanishes.
To prove the rest, it is enough to show that the Hessian of G f is positive definite in a neighborhood around (α , ξ). Let (α, X) be arbitrary. H 1 is an operator of the form M * 1 M 2 (X) * LM 2 (X)M 1 , with L = f (Aµ) : R m → R m and Due to the Λ-strong convexity of f , L Λ id. We furthermore have Let us now turn to M 2 (X) * M 2 (X). If we define M 2 (ξ) = A(ξ) A (ξ) , we have by Assumption (8). We however have Now, by definition of κ and κ ∇ , and similarly for M 2 (X) 2 = M 2 (X) * 2 . Also, we have, by (18): so that all in all We may now apply Lemma 4.2 twice to conclude where we defined

It remains to analyze H 2 . Define the bilinear form
Then, if we define w = ∇f (Aµ) − ∇f (Aµ ) = q − q, we have This makes it evident that The L-Lipschitz gradient of f proves that w 2 ≤ L Aµ − Aµ * 2 . Using (18) yields directly: We still need to bound H 2 . First remember that Assumption 6 asserts that for each i, sign α i (A * q ) −γ id and sign(α i ) = sign(α i ) in the ball of radius τ 0 around ξ i . Consequently, if (α, X) is chosen so that for each i, By definition of κ hess , we can further estimate Using (A * q ) (ξ i ) = 0, we obtain If we now define we obtain Further utilizing the definition of G 1 and (28), we arrive at Since G 1 (X), G 2 (α, X) → 0 for α → α and X → ξ, we obtain the claim.

Eventually entering the basin of attraction
The following proposition shows that (α, X k ) defined as the amplitudes and positions of the Dirac-components of the solution µ of (P(X k )), (α, X k ) will lie in the basin described by Proposition 4.1. This result is stated in Corollary 4.4, the rest of this section is dedicated to proving it.
To bound the latter, recall that Λ-strong convexity of f means that The optimality conditions for (P(Ω)) tell us that q = −∇f (Aµ ), and hence where we in the last step used that A * q ∞ ≤ 1. Plugging the above inequality in (29) yields The claim follows.
Corollary 4.4. By Proposition 3.9, if k is large enough then X k contains exactly s points. In this case, let µ k = s i=1 α i δxk i be the solution of (P(X k )). Applying Proposition 4.3, recalling that max i ξ i −x k i 2 ≤ dist(X k , ξ) and using the bound (24), we obtain : Since dist(X k , ξ) is guaranteed to eventually converge to zero by Theorem 3.11 and µ k M are bounded ( e.g. by lower boundedness of f and upper boundedness of J( µ k )) , ( α, X k ) will eventually lie in the basin of attraction of G.

Description of the hybrid approach
To conclude this paper, we propose a method alternating between an exchange step and a continuous gradient descent. It is detailed in Algorithm 2. The idea is, after each iteration of an exchange algorithm, to start a gradient descent of G initialized at the solution µ k of (P(X k )). If this gradient descent converges to a measurē µ k , we can subsequently test if it is an optimal point by checking ifq k = −∇f (Aμ k ) fulfills the stopping criterion A * q k ∞ ≤ 1 + , where is a user defined stopping criterion (the latter is justified by Proposition 3.3). If so, we may outputμ k , and if not, we may instead continue our exchange algorithm, possibly after adding also the support points ofμ k . Its behavior is described in the following theorem.
Overall, this method has many desirable properties: the continuous method should be used whenever the exchange method reaches its basin of attraction since its per iteration cost is much cheaper. However, it is unclear in general that this basin even exists. In that case, the exchange method should be preferred since it eventually converges linearly under quite mild assumptions. The proposed algorithmic scheme somehow captures the best of all methods. Let us notice that it is very similar in spirit to the sliding Frank-Wolfe algorithm proposed in [11], apart from the fact that we suggest adding all the points X k violating the constraints, while the single most violating point is added in [11]. We believe that the proposed analysis sheds some light on the good numerical performance of this method.
Arguably the most complicated step in this algorithm is to evaluate X k , the set of local maximizers of A * q k exceeding 1. This is an impossible task for an arbitrary function A * q k . However, a simple heuristic described in the next section provided rather satisfactory results for the measurement functions considered in this paper (trigonometric polynomials and Gaussian convolution).
Apart from this, let us outline that the subproblems in this algorithm are well suited for numerical resolution. In the exchange algorithm, we only solve the dual problems D(Ω k ) which are strongly convex. Hence firstorder methods for instance come with guarantees of convergence to q k in 2 -norm. Recovering the massesα k , solutions of P(X k ) is also stable since X k (the local maximizers of A * q k ) is typically a well separated set of low cardinality. The gradient descent (or alternative nonlinear programming approach) on G(α, X) is performed over a low dimensional set. If the convergence is not satisfactory (e.g. the norm of ∇G doesn't decay fast enough), it can be stopped, and we can switch back to the exchange algorithm. Set Ω k = Ω k−1 ∪ X k

7:
Solve D(Ω k ) to retrieve q k Convex -Stable

Numerical Experiments
To test our theory, we have implemented our algorithm in MATLAB. Before displaying the results of the experiments, let us discuss a few key steps in the implementation. In the entire section, we assume that Ω = [0, 1] d for d = 1 or 2 for simplicity. Note that this is no true restriction: we can always by scaling and translation ensure that Ω ⊆ [0, 1] d , and trivially extend the measurement functions by 0 to the entirety of [0, 1] d .
Evaluating X k Each iteration of the exchange algorithm requires the exact calculation of the local maximizers of A * q k exceeding 1. This is, in general, an impossible task. We resort to the following heuristic method: Given a q k , we first evaluate |A * q k | on a fixed rectangular grid G = ((n) −1 [0, . . . , n]) d , and determine all of the discrete peaks, i.e. points in which {A * q k } is larger than all of its neighbors in the grid, and where A * q k exceeds 1 − 1 for a threshold 1 > 0. Next, we start a gradient descent in each of these points, stopping them once (A * q k ) 2 is lower than another threshold. Since it is possible that several of these gradient descents land in the same point x, we subsequently check if the set contains sets of points which are too close to each other -if this is the case, we discard all but one of them in such a group. We finally remove any point in which |A * q k | is not larger than 1 − 2 , for a small 2 > 0. Solving the Discrete Problems We have chosen to solve the problems (D(Ω k )) and (P(X k )) using an accelerated proximal gradient descent [20].
We start by testing our algorithm on a popular instance of problem (P(Ω)): super-resolution of a measure µ ∈ M(0, 1) from finitely many of its Fourier moments We use a quadratic data fidelity term f (z) = L 2 z − y 2 2 . This example is well studied by the signal processing community [26,6,13,21].
We chose m to be equal to 30, and a vector y generated as Aµ 0 , where µ 0 is chosen at random as a 5-sparse atomic measure with amplitudes close to 1 or −1. The positions of the Dirac masses were chosen as a small random perturbation from a uniform grid. The initial grid Ω 0 was chosen as a uniform grid with 8 points, i.e. [0, 1 8 , . . . , 7 8 ]. We made 100 experiments, with 20 iterations of the exchange algorithm. The evolution of µ k and q k for the first iterations for a typical iteration is displayed in Figure 1. We see that after already 8 iterations, A * q k appears to be very close to A * q . Before this iteration, the algorithm 'chooses' to add points relatively uniformly to the grid, but after that, new points are only added close to ξ. This is further emphasized by Figure 2, in which X k is plotted for each iteration, along with size of Ω k .
To track the success of the algorithm a bit more systematically, we chose to track the evolution of dist(ξ, X k ), dist(Ω k , X k ) and dist(Ω k , ξ). The median over the 100 iterations, along with confidence intervals covering all experiments but the top and bottom 5% are plotted in Figures 3. We see that all of the quality measures seem to converge linearly to 0.
Finally, we performed the same analysis for the optimum gap min (P(Ω k )) − min (P(Ω)), the error q k − q 2 and the sizes of the grids Ω k . (min (P(Ω)) was in each case chosen as the lowest value of min (P(Ω k )) over all iterations k, and q as the corresponding dual solution). We see that the optimum gap seems to converge   exponentially to 0 right from the first iteration, wheras the error q k − q 2 initially does not. The 'two-phase'effect is also easy to spot: After about 5 − 6 iterations, the algorithm switches from adding many points to adding only few points close to ξ. Interestingly, the plateau of the q-errors seems to be simultaneuos with the 'phase-transition'.

Example 2: Super-resolution from Gaussian measurements in 2D
Next, we perform a study in a two-dimensional setting. We consider Ω = [−1, 1] 2 and measurement functions of the form where the points x i live on a Euclidean grid of size 64×64, restricted to the domain [−0.5, 0.5] 2 . We then add white Gaussian noise to the measurements, leading to pictures of the type shown in Fig. 5. Here, the true underlying measure contains 11 Dirac masses with random positive amplitudes and random locations on [−0.4, 0.4] 2 .

Exchange algorithm
The evolution of the grids Ω k and of the dual certificates |A * q k | is shown in Figure 6. As can be seen, points are initially added anywhere in the domain, but after a few iterations, they all cluster around the true locations, as expected from the theory. To further stress this phenomenon and illustrate our theorems and lemmata, we display many quantities of interest appearing in our main results in Fig. 7. the distance from X k to ξ (where ξ is estimated as X 40 ) on Fig. 7c, the distance from Ω k to ξ on Fig. 7b, the evolution of J( µ k ) − J( µ 40 ) on Fig. 7a, A * q k ∞ − 1 on Fig. 7e. Finally, the number of maxima of |A * q k | is shown on Fig. 7f. As can be seen, the number of maxima quickly stabilizes, suggesting that we reached a τ 0 -regime. Then all the quantities (cost function, distance from ξ, violation of the constraints) seem to converge to 0 linearly. This is not true after iteration 15, and we suspect that this is solely due to numerical inaccuracies when computing the solution of the discretized problems. Notice however that the accuracy of the Dirac locations drops below 10 −3 after 14 iterations, and that this accuracy is more than enough for the particular super-resolution application. Notice that if we wished to reach this accuracy with a fixed grid, we would need a Euclidean discretization containing 10 6 points, while we here needed only 152 (|Ω 14 | = 152). In addition, the 1 resolution is stable since it is accomplished on a grid X 14 containing only 11 points.

Continuous method
In this experiment, we evaluate the behavior of the gradient descent (26) depending on the initialization (α (0) , X (0) ) and on the number of iterations. We use the same setting as in the previous section. The left graph of Fig. 8 illustrates that the gradient descent typically converges linearly when initialized close enough to the true minimizer (α , ξ). This was predicted by Theorem 4.1. In this case (and actually all the others related to this experiment), it converges to machine precision in less than 1000 iterations. This is remarkable since the gradient descent is a simple algorithm that can be easily improved by using e.g. Nesterov acceleration (we proved that the function is locally convex) or other optimization schemes such as L-BFGS. In order to evaluate the size of the basin of attraction around the global minimizer, we start from random points of the form (α (0) , X (0) ) = (α , ξ) + (∆ α , ∆ X ), where ∆ α and ∆ X are random perturbations with an amplitude set as (∆ α , ∆ X ) 2 = γ (α , ξ) 2 , with γ in [0, 1]. We then run 50 gradient descents with different realizations of (α (0) , X (0) ) and record the success rate (i.e. the number of times the gradient descent converges to (α , ξ) with an accuracy of at least 10 −6 ). We plot this success rate with respect to γ in Fig. 8b. As can be seen, the success rate is always 1 when the relative error γ is less than 5%, showing that for this particular problem, a rather rough initialization suffices for the gradient descent to converge to the global minimizer.

Alternating method
The alternating method suggested in Algorithm 2 turns out to converge in a single iteration when applied to the setting described above. We therefore apply it to a more challenging scenario with 30 Dirac masses instead of 11 and more noise. The measurements y are shown in Fig. 9. We compare three implementations: a pure exchange method, an alternating method as in Algorithm 2 without line 14 and an alternating method as in in Algorithm 2 with line 14. The conclusions are as follows: • All methods rapidly conclude that the underlying measure contains 30 Dirac masses. (The pure exchange algorithm after 10 iterations, the alternating method with line 14 already after the first).
• The pure exchange algorithm quickly gets to a point close to the optimum. The positions then slowly converge to the tue locations. It does however eventually find the basin of attraction of G (in this example, it needed 10 iterations).
• Line 14 in the alternating method improves the convergence significantly. In fact, omitting it, we need 10 iterations to find the basin of attraction, whereas the version with the line finds it directly. Investigating this effect more closely is an interesting line of future research.       The red bars with crosses indicated the recovered measures. Apart from a slight bias in amplitude due to the 1 -norm, the ground truth is near perfectly recovered.