Skip to main content

On the linear convergence rates of exchange and continuous methods for total variation minimization

Abstract

We analyze an exchange algorithm for the numerical solution total-variation regularized inverse problems over the space \(\mathcal {M}(\varOmega )\) of Radon measures on a subset \(\varOmega \) of \(\mathbb {R}^d\). Our main result states that under some regularity conditions, the method eventually converges linearly. Additionally, we prove that continuously optimizing the amplitudes of positions of the target measure will succeed at a linear rate with a good initialization. Finally, we propose to combine the two approaches into an alternating method and discuss the comparative advantages of this approach.

Introduction

The problem

The main objective of this paper is to develop and analyze iterative algorithms to solve the following infinite dimensional problem:

figurea

where \(\varOmega \) is a bounded open domain of \(\mathbb {R}^d\), \(\mathcal {M}(\varOmega )\) is the set of Radon measures on \(\varOmega \), \(\Vert \mu \Vert _{\mathcal {M}}\) is the total variation (or mass) of the measure \(\mu \), \(f:\mathbb {R}^m\rightarrow \mathbb {R}\cup \{+\infty \}\) is a convex lower semi-continuous function with non-empty domain and \(A: \mathcal {M}(\varOmega ) \rightarrow \mathbb {R}^m\) is a linear measurement operator.

An important property of problem (\(\mathcal {P}(\varOmega )\)) is that at least one of its solutions \(\mu ^\star \) has a support restricted to s distinct points with \(s\le m\) (see e.g. [3, 14, 33]), i.e. is of the form

$$\begin{aligned} \mu ^\star = \sum _{i=1}^s \alpha _i^\star \delta _{\xi _i}, \end{aligned}$$
(1)

with \(\xi _i\in \varOmega \) and \(\alpha _i^\star \in \mathbb {R}\). This property motivates us to study a class of exchange algorithms. They were introduced as early as 1934 [26] and then extended in various manners [25]. They consist in discretizing the domain \(\varOmega \) coarsely and then refining it adaptively based on the analysis of so-called dual certificates. If the refinement process takes place around the locations \((\xi _i)\) only, these methods considerably reduce the computational burden compared to a finely discretized mesh.

Our main results consist in a set of convergence rates for this algorithm that depend on the regularity of f and on the non-degeneracy of a dual certificate at the solution. We also show the linear convergence rate for first order algorithms that continuously vary the coefficients \(\alpha _i\) and \(x_i\) of a discrete measure. Finally, we show that algorithms alternating between an exchange step and a continuous method share the best of both worlds: the global convergence guarantees of exchange algorithms together with the efficiency of first order methods. This yields a fast adaptive method with strong convergence guarantees for total variation minimization and related problems.

Applications

Our initial motivation to study the problem (\(\mathcal {P}(\varOmega )\)) stems from signal processing applications. We recover an infinite dimensional version of the basis pursuit problem [6] by setting

$$\begin{aligned} f(x)=\iota _{\{y\}}(x) = {\left\{ \begin{array}{ll} 0 &{} \quad \text {if } x=y \\ +\infty &{} \quad \text {otherwise.} \end{array}\right. } \end{aligned}$$

Similarly, the choice \(f(x) = \frac{\tau }{2}\Vert x-y\Vert _2^2\), leads to an extension of the LASSO [29] called Beurling LASSO [8]. Both problems proved to be extremely useful in engineering applications. They got a significant attention recently thanks to theoretical progresses in the field of super-resolution [5, 8, 12, 28]. Our results are particularly strong for the quadratic fidelity term.

Another less popular application in approximation theory [14], which was revived recently [31], is “generalized” total variation minimization. Given a surjective Fredholm operator \(L:B(\varOmega )\rightarrow \mathcal {M}(\varOmega )\), where \(B(\varOmega )\) is a suitably defined Banach space, we consider the following problem

$$\begin{aligned} \inf _{u \in B(\varOmega )} \Vert Lu \Vert _{\mathcal {M}} + f(Au). \end{aligned}$$
(2)

The solutions of this problem can be proved to be (generalized) splines with free knots [31]. Following [15] and letting \(L^+\) denote a pseudo-inverse of L, solving this problem can be rephrased as

$$\begin{aligned} \inf _{\mu \in \mathcal {M}(\varOmega ), u_K\in \mathrm {ker}(L)} \Vert \mu \Vert _{\mathcal {M}} + f(A (L^+\mu +u_K)), \end{aligned}$$
(3)

which is a variant of (\(\mathcal {P}(\varOmega )\)) that can also be solved with the proposed algorithms.

Numerical approaches in signal processing

The progresses on super-resolution [5, 8, 12, 28] motivated researchers from this field to develop numerical algorithms for the resolution of problem (\(\mathcal {P}(\varOmega )\)). By far the most widespread approach is to use a fine uniform discretization and solve a finite dimensional problem. The complexity of this approach is however too large if one wishes high precision solutions. This approach was analyzed from a theoretical point of view in [12, 27] for instance. The first papers investigating the use of (\(\mathcal {P}(\varOmega )\)) for super-resolution purposes advocated the use of semi-definite relaxations [5, 28], which are limited to specific measurement functions and domains, such as trigonometric polynomials on the 1D torus \(\mathbb {T}\). The limitations were significantly reduced in [9], where the authors suggested the use of Lasserre hierarchies. These methods are however currently unable to deal with large scale problems. Another approach suggested in [4], consists in adding one point to a discretization set iteratively, where a so-called dual certificate is maximal. The weights of a measure supported on the set of added points are then updated using an ad-hoc rule. The authors refer to this algorithm as a mix between a Frank–Wolfe (or conditional gradient) algorithm and a LASSO type method. More recently, [30] began investigating the use of methods that continuously vary the positions \((x_i)\) and amplitudes \((\alpha _i)\) of discrete measures parameterized as \(\mu =\sum _{i=1}^s \alpha _i \delta _{x_i}\). The authors gave sufficient conditions for a simple gradient descent on the product-space \((\alpha ,x)\) to converge. In [2] and [10], this method was used alternatively with a Frank-Wolfe algorithm, the idea being to first add Dirac masses roughly at the right locations and then to optimize their locations and position continuously, leading to promising numerical results. Surprisingly enough, it seems that the connection with the mature field of semi-infinite programming has been ignored (or not explicitly stated) in all the mentioned references.

Some numerical approaches in semi-infinite programming

A semi-infinite program [17, 25] is traditionally defined as a problem of the form

figureb

where Q and \(\varOmega \) are subsets of \(\mathbb {R}^m\) and \(\mathbb {R}^n\) respectively, \(u: Q \rightarrow \mathbb {R}\) and \(c : \varOmega \times Q\rightarrow \mathbb {R}\) are functions. The term semi-infinite stems from the fact that the variable q is finite-dimensional, but it is subject to infinitely many constraints \(c(x,q) \le 0\) for \(x\in \varOmega \). In order to see the connection between the semi-infinite program (SIP\([\varOmega ]\)) and our problem (\(\mathcal {P}(\varOmega )\)), we can formulate its dual, which reads as

figurec

This dual will play a critical role in all the paper and it is easy to relate it to a SIP by setting \(Q=\mathbb {R}^m\), \(u= f^*\) and \(c(x,q)= \left| (A^*q)(x) \right| -1\).

Many numerical methods have been and are still being developed for semi-infinite programs and we refer the interested reader to the excellent chapter 7 of the survey book [25] for more insight. We sketch below two classes of methods that are of interest for our concerns.

Exchange algorithms

A canonical way of discretizing a semi-infinite program is to simply control finitely many of the constraints, say \(c(x,q) \le 0\) for \(x \in \varOmega _0 \subseteq \varOmega \), where \(\varOmega _0\) is finite. The discretized problem SIP\([\varOmega _0]\) can then be solved by standard proximal methods or interior point methods. In order to obtain convergence towards an exact solution of the problem, it is possible to choose a sequence \((\varOmega _k)\) of nested sets such that \(\bigcup _{k} \varOmega _k\) is dense in \(\varOmega \). Solving the problems SIP\([\varOmega _k]\) for large k however leads to a high numerical complexity due to the high number of discretization points. The idea of exchange algorithms is to iteratively update the discretization sets \(\varOmega _k\) in a more clever manner than simply making them denser. A generic description is given by Algorithm 1.

figured

In this paper, we consider \(\mathrm {Update\_Rule}\)s of the form

$$\begin{aligned} \varOmega _{k+1} \subset \varOmega _k \cup \{x_k^1,\ldots , x_k^{p_k}\}, \end{aligned}$$

where the points \(x_k^{i}\) are local maximizers of \(c(\cdot ,q_k)\). At each iteration, the set of discretization points can therefore be updated by adding and dropping a few prescribed points, explaining the name ’exchange’. The simplest rule consists of adding the single most violating point, i.e.

$$\begin{aligned} \varOmega _{k+1}= \varOmega _k \cup \mathop {\mathrm {argmax}}_{x\in \varOmega } c(x, q_k). \end{aligned}$$
(4)

It seems to be the first exchange algorithm and it first appeared under a less general form as the Remez algorithm in the 30’s [26]. It also shares similarities with the Frank-Wolfe (a.k.a. conditional gradient) method [16], which iteratively adds a point at a location where the constraint is most violated. It however differs in the way the solution \(q_k\) is updated. The connection was discussed recently in [13] for problems where the total variation term is used as a constraint. The use of the Frank-Wolfe algorithm for penalized total variation problems was also clarified recently in [10] using an epigraphical lift.

The update rule (4) is sufficient to guarantee convergence in the generic case and to ensure a decay of the cost function in \(O\left( \frac{1}{k}\right) \), see [20]. Although ’exchange’ suggests that points are both added and subtracted, methods for which \(\varOmega _k \subseteq \varOmega _{k+1}\) are also coined exchange algorithms. The use of such rules often leads to easier convergence analyses, since we get monotonicity of the objective values \(u(q_k)\) for free [17]. Other examples [18] include only adding points if they exceed a certain margin, i.e. \(c(x,y) \ge \epsilon _k\), or all local maxima of \(c(q_k,\cdot )\). In the case of convex functions f, algorithms that both add and remove points can be derived and analyzed with the use of cutting plane methods. All these instances have their pros and cons and perform differently on different types of problems. Since a semi-infinite program basically allows to minimize arbitrary continuous and finite dimensional problems, a theoretical comparison should depend on additional properties of the problem.

Continuous methods

Every iteration of an exchange algorithm can be costly: it requires solving a convex program with a number of constraints that increases if no discretization point is dropped. In addition, the problems tend to get more and more degenerate as the discretization points cluster, leading to numerical inaccuracies. In practice it is therefore tempting to use the following two-step strategy: i) find an approximate solution \(\mu _k=\sum _{i=1}^{p_k} \alpha _k^i \delta _{x_k^i}\) of the primal problem (\(\mathcal {P}(\varOmega )\)) using k iterations of an exchange algorithm and ii) continuously move the positions \(X=(x_i)\) and amplitudes \(\alpha =(\alpha _i)\) starting from \((\alpha _k,X_k)\) to minimize (\(\mathcal {P}(\varOmega )\)) using a nonlinear programming approach such as a gradient descent, a conjugate gradient algorithm or a Newton approach.

This procedure supposes that the output \(\mu _k\) of the exchange algorithm has the right number \(p_k=s\) of Dirac masses, that their amplitudes satisfy \({{\,\mathrm{sign}\,}}(\alpha _i)={{\,\mathrm{sign}\,}}(\alpha _i^\star )\) and that \(\mu _k\) lies in the basin of attraction of the optimization algorithm around the global minimum \(\mu ^\star \). To the best of our knowledge, knowing a priori when those conditions are met is still an open problem and deciding when to switch from an exchange algorithm to a continuous method therefore relies on heuristics such as detecting when the number of masses \(p_k\) stagnates for a few iterations. The cost of continuous methods is however much smaller than that of exchange algorithms since they amount to work over a small number \(s(d+1)\) of variables. In addition, the instabilities mentioned earlier are significantly reduced for these methods. This observation was already made in [2, 10] and proved in [30] for specific problems.

Contribution

Many recent results in the field of super-resolution provide sufficient conditions for a non degenerate source condition to hold [5, 11, 23, 28]. The non degeneracy means that the solution \(q^\star \) of (\(\mathcal {D}(\varOmega )\)) is unique and that the dual certificate \(|A^*q^\star |\) reaches 1 at exactly s points, where it is strictly concave. The main purpose of this paper is to study the implications of this non degeneracy for the convergence of a class of exchange algorithms and for continuous methods based on gradient descents. Our main results are as follows:

  1. 1.

    We show an eventual linear convergence rate of a class of exchange algorithms for convex functions f with Lipschitz continuous gradient. More precisely, we prove that after a finite number of iterations N the algorithm outputs vectors \(q_k\) such that the set

    $$\begin{aligned} X_k {\mathop {=}\limits ^{\mathrm{def.}}}\{x\in \varOmega \, \vert \, x \text { local maximizer of } \left| A^*q_k \right| , \ |A^*q_k|(x)\ge 1\} \end{aligned}$$
    (5)

    contains exactly s-points \((x_k^1, \ldots , x_k^s)\).

    Letting \(\widehat{\mu }_k=\sum _{i=1}^s \alpha _i^{k} \delta _{x_i^k}\) denote the solution of the finite dimensional problem \(\inf _{\mu \in \mathcal {M}(X_k)} \Vert \mu \Vert _{\mathcal {M}} + f(A\mu )\), we also show the linear convergence rate of the cost function \(J(\widehat{\mu }_k)\) to \(J(\mu ^\star )\) and of the support in the following sense: after a number N of initial iterations, it will take no more that \(k_\tau = C\log (\tau ^{-1})\) iterations to ensure that the Hausdorff distance between the sets \(X_{k_\tau +N}\) and \(\xi \) is smaller than \(\tau \). A similar statement holds for the coefficient vectors \(\alpha ^{k}\). Of importance, let us mention that similar results were derived under slightly different conditions by Pieper and Walter in [22]. The two works were carried out independently at the same time.

  2. 2.

    We also show that a well-initialized gradient descent algorithm on the pair \((\alpha ,x)\) converges linearly to the true solution \(\mu ^\star \) and explicit the width of the basin of attraction.

  3. 3.

    We then show how the proposed guarantees may explain the success of methods alternating between exchange methods and continuous methods at each step, in a spirit similar to the sliding Frank-Wolfe algorithm [10].

  4. 4.

    We finally illustrate the above results on total variation based problems in 1D and 2D.

Preliminaries

Notation

In all the paper, \(\varOmega \) designs an open bounded domain of \(\mathbb {R}^d\). The boundedness assumptions plays an important role to control the number of elements in the discretization procedures. A grid \(\varOmega _k\) is a finite set of points in \(\varOmega \). Its cardinality is denoted by \(|\varOmega _k|\). The distance from a set \(\varOmega _2\) to a set \(\varOmega _1\) is defined by

$$\begin{aligned} {{\,\mathrm{dist}\,}}(\varOmega _1 |\varOmega _2)=\sup _{x_2\in \varOmega _2} \inf _{x_1\in \varOmega _1} \Vert x_1-x_2\Vert _2. \end{aligned}$$
(6)

Note that this definition of distance is not symmetric: in general .

We let \(\mathcal {C}_0(\varOmega )\) denote the set of continuous functions on \(\varOmega \) vanishing on the boundary. The set of Radon measures \(\mathcal {M}(\varOmega )\) can be identified as the dual of \(\mathcal {C}_0(\varOmega )\), i.e. the set of continuous linear forms on \(\mathcal {C}_0(\varOmega )\). For any sub-domain \(\varOmega _k\subset \varOmega \), we let \(\mathcal {M}(\varOmega _k)\) denote the set of Radon measures supported on \(\varOmega _k\). For \(p\in [1,+\infty ]\), the \(L^p\)-norm of a function \(u\in \mathcal {C}_0(\varOmega )\) is denoted by \(\Vert u\Vert _p\). The total variation of a measure \(\mu \in \mathcal {M}(\varOmega )\) is denoted \(\Vert \mu \Vert _{\mathcal {M}}\). It can be defined through duality as

$$\begin{aligned} \Vert \mu \Vert _{\mathcal {M}} = \sup _{\begin{array}{c} u\in \mathcal {C}_0(\varOmega ) \\ \Vert u\Vert _\infty \le 1 \end{array}} \mu (u). \end{aligned}$$
(7)

The \(\ell ^p\)-norm of a vector \(x\in \mathbb {R}^m\) is also denoted \(\Vert x\Vert _p\). The Frobenius norm of a matrix M is denoted by \(\Vert M\Vert _F\).

Let \(f:\mathbb {R}^m\rightarrow \mathbb {R}\cup \{+\infty \}\) denote a convex lower semi-continuous function with non-empty domain \({{\,\mathrm{dom}\,}}(f) = \{x\in \mathbb {R}^m, f(x)<+\infty \}\). Its subdifferential is denoted \(\partial f\). Its Fenchel transform \(f^*\) is defined by

$$\begin{aligned} f^*(y)=\sup _{x\in \mathbb {R}^m} \langle x,y \rangle - f(x). \end{aligned}$$

If f is differentiable, we let \(f'\in \mathbb {R}^m\) denote its gradient and if it is twice differentiable, we let \(f''\in \mathbb {R}^{m\times m}\) denote its Hessian matrix. We let \(\Vert f'\Vert _\infty =\sup _{x\in \varOmega } \Vert f'(x)\Vert _2\) and \(\Vert f''\Vert _\infty =\sup _{x\in \varOmega } \Vert f''(x)\Vert \), where \(\Vert f''(x)\Vert \) is the largest singular value of \(f''(x)\). A convex function f is said to be l-strongly convex if

$$\begin{aligned} f(x_2)\ge f(x_1) + \langle \eta , x_2-x_1 \rangle + \frac{l}{2}\Vert x_2-x_1\Vert _2^2 \end{aligned}$$
(8)

for all \((x_1,x_2)\in \mathbb {R}^m\times \mathbb {R}^m\) and all \(\eta \in \partial f(x_1)\). A differentiable function f is said to have an L-Lipschitz gradient if it satisfies \(\Vert f'(x_1)-f'(x_2)\Vert _2\le L \Vert x_1-x_2\Vert _2\). This implies that

$$\begin{aligned} f(x_2)\le f(x_1) + \langle f'(x_1) , x_2-x_1 \rangle +\frac{L}{2}\Vert x_2-x_1\Vert _2^2 \text{ for } \text{ all } (x_1,x_2)\in \mathbb {R}^m\times \mathbb {R}^m. \end{aligned}$$
(9)

We recall the following equivalence [19]:

Proposition 1

Let \(f:\mathbb {R}^m\rightarrow \mathbb {R}\cup \{+\infty \}\) denote a convex and closed function with non empty domain. Then the following two statements are equivalent:

  • f has an L-Lipschitz gradient.

  • \(f^*\) is \(\frac{1}{L}\)-strongly convex.

The linear measurement operators A considered in this paper can be viewed as a collection of m continuous functions \((a_i)_{1\le i\le m}\). For \(x\in \varOmega \), the notation A(x) corresponds to the vector \([a_1(x),\ldots , a_m(x)] \in \mathbb {R}^m\).

Existence results and duality

In order to obtain existence and duality results, we will now make further assumptions.

Assumption 1

\(f: \mathbb {R}^m \rightarrow \mathbb {R}\cup \left\{ \infty \right\} \) is convex and lower bounded. In addition, we assume that either \({{\,\mathrm{dom}\,}}(f)=\mathbb {R}^m\) or that f is polyhedral (that is, its epigraph is a finite intersection of closed halfspaces).

Assumption 2

The operator A is weak-\(*\)-continuous. Equivalently, the measurement functionals \(a_i^*\) defined by \(\left\langle a_i^*, \mu \right\rangle = (A(\mu ))_i\) are given by

$$\begin{aligned} \left\langle a_i^*, \mu \right\rangle = \int _{\varOmega } a_i d\mu , \end{aligned}$$

for functions \(a_i \in \mathcal {C}_0(\varOmega )\). In addition, we assume that A is surjective on \(\mathbb {R}^m\).

The following results relate the primal and the dual.

Proposition 2

(Existence and strong duality) Under Assumptions 1 and 2, the following statements are true:

  • The primal problem (\(\mathcal {P}(\varOmega )\)) and its dual (\(\mathcal {D}(\varOmega )\)) both admit a solution.

  • The following strong duality result holds

    $$\begin{aligned} \min _{\mu \in \mathcal {M}(\varOmega )} \Vert \mu \Vert _{\mathcal {M}(\varOmega )} + f(A\mu ) = \max _{q \in \mathbb {R}^m, \Vert A^*q\Vert _\infty \le 1} - f^*(q). \end{aligned}$$
    (10)
  • Let \((\mu ^\star ,q^\star )\) denote a primal-dual pair. They are related as follows

    $$\begin{aligned} A^*q^\star \in \partial _{\Vert \cdot \Vert _\mathcal {M}}(\mu ^\star ) \text{ and } -q^\star \in \partial f(A\mu ^\star ). \end{aligned}$$
    (11)

Proof

The stated assumptions ensure the existence of a feasible measure \(\mu \). In addition, the primal function is coercive since f is bounded below. Since \(\mathcal {M}(\varOmega )\) can be viewed as the dual of the Banach space \(\mathcal {C}_0(\varOmega )\), we further have that bounded sets in \(\mathcal {M}(\varOmega )\) are compact in the weak-\(*\)-topology (this is the the Banach-Alaoglu theorem). Using these three facts, a standard argument now allows one to deduce the existence of a primal solution. The existence of a dual solution stems from the compactness of the set \(\{q \in \mathbb {R}^m, \Vert A^*q\Vert _\infty \le 1\}\) (which itself follows from the surjectivity of A) and the continuity of \(f^*\) on its domain. The strong duality result follows from [1, Thm 4.2]. The primal-dual relationship directly derives from the first order optimality conditions. \(\square \)

The left inclusion in equation (11) plays an important role, which is well detailed in [12]. It implies that the support of \(\mu ^\star \) satisfies: \({{\,\mathrm{supp}\,}}(\mu ^\star )\subseteq \{x\in \varOmega , |A^*q^\star (x)|=1\}\).

An exchange algorithm and its convergence

The algorithm

We assume that an initial grid \(\varOmega _0 \subseteq \varOmega \) is given (e.g. a coarse Euclidean grid). Given a discretization \(\varOmega _k\), we can define a discretized primal problem (\(\mathcal {P}(\varOmega _k)\))

figuree

and its associated dual (\(\mathcal {D}(\varOmega _k)\))

figuref

In this paper, we will investigate the exchange rule below:

$$\begin{aligned} \varOmega _{k+1}=\varOmega _{k} \cup X_{k} \hbox { where } X_k \hbox { is defined in } (5). \end{aligned}$$
(12)

The implementation of this rule requires finding \(X_k\), the set of all the local maximizers of \(\left| A^*q_k \right| \) exceeding 1.

A generic convergence result

The exchange algorithm above converges under quite weak assumptions. For instance, it is enough to assume that the function f is differentiable.

Assumption 3

The data fitting function \(f:\mathbb {R}^m \rightarrow \mathbb {R}\) is differentiable with L-Lipschitz continuous gradient.

Alternatively, we may assume that the initial set \(\varOmega _0\) is fine enough, which in particular implies that \(|\varOmega _0|\ge m\).

Assumption 4

The initial set \(\varOmega _0\) is such that A restricted to \(\varOmega _0\) is surjective.

We may now present and prove our first result.

Theorem 1

(Generic convergence) Under Assumptions 1, 2 and 3 or 4, a subsequence of \((\mu _k,q_k)\) will converge in the weak-\(*\)-topology towards a solution pair \((\mu ^\star ,q^\star )\) of (\(\mathcal {P}(\varOmega )\)) and (\(\mathcal {D}(\varOmega )\)), as well as in objective function value. If the solution of (\(\mathcal {P}(\varOmega )\)) and/or (\(\mathcal {D}(\varOmega )\)) is unique, the entire sequence will converge.

Proof

First remark that the sequence \((\Vert \mu _k\Vert _\mathcal {M}+f(A\mu _k))_{k \in \mathbb {N}}\) is non-increasing since the spaces \(\mathcal {M}(\varOmega _k)\) are nested. Due to the boundedness below of f, the same must be true for \((\Vert \mu _k \Vert _\mathcal {M})\). Hence there exists a subsequence \((\mu _k)\), which we do not relabel, that weak-\(*\) converges towards a measure \(\mu _\infty \).

Now, we will prove that the sequence of dual variables \((q_k)_{k\in \mathbb {N}}\) is bounded. If Assumption 3 is satisfied, then \(f^*\) is strongly convex and since 0 is a feasible point, we must have \(q_k\in \{q\in \mathbb {R}^m, f^*(q)\le f^*(0)\}\), which is bounded. Alternatively, if Assumption 4 is satisfied, notice that \(1\ge \Vert A_k^* q_k\Vert _\infty \ge \Vert A_0^* q_k\Vert _\infty \). Since \(A_0\) is surjective, the previous inequality implies that \((\Vert q_k\Vert _2)_{k \in \mathbb {N}}\) is bounded. Hence, in both cases, the sequence \((q_k)_{k\in \mathbb {N}}\) converges up to a subsequence to a point \(q_\infty \).

The key is now to prove that \(\Vert A^*q_\infty \Vert _\infty \le 1\). To this end, let us first argue that the family \((A^*q_k)_{k\in \mathbb {N}}\) is equicontiuous. For this, let \(\epsilon >0\) be arbitrary. Since the functions \(a_i \in \mathcal {C}_0(\varOmega )\) all are uniformly continuous, there exists a \(\delta >0\) with the property

$$\begin{aligned} \Vert x-y \Vert _2< \delta \, \Rightarrow \, \left| a_i(x)-a_i(y) \right| < \frac{\epsilon }{\sup _{k} \Vert q_k \Vert _1} \text { for all } i. \end{aligned}$$

Consequently,

$$\begin{aligned}&\Vert x-y \Vert _2< \delta \, \Rightarrow \, \left| (A^*q_k)(x)- (A^*q_k)(y) \right| \nonumber \\&\quad = \left| \sum _{i=1}^m (a_i(x)-a_i(y))q_k(i) \right| \le \sum _{i=1}^m \left| a_i(x)-a_i(y) \right| \left| q_k(i) \right| \nonumber \\&\quad < \frac{\epsilon }{\sup _{k} \Vert q_k \Vert _1} \sum _{i=1}^m \left| q_k(i) \right| \le \epsilon . \end{aligned}$$
(13)

Due to the convergence of \((q_k)_{k \in \mathbb {N}}\), the sequence \((A^*q_k)_{k \in \mathbb {N}}\) is converging strongly to \(A^*q_\infty \). We will now prove that \(\Vert A^*q_\infty \Vert _\infty \le 1\). If for some k, \(\Vert A^*q_k \Vert _\infty \le 1\), we will have \(A^*q_\ell = A^*q_k\) for all \(\ell \ge k\), and in particular \(q_\infty = q_k\) and thus \(\Vert A^*q_\infty \Vert \le 1\). Hence, we may assume that \(\Vert A^*q_k \Vert _\infty >1\) for each k, i.e. that we add at least one point to \(\varOmega _k\) in each iteration.

Now, towards a contradiction, assume that \(\Vert A^*q_\infty \Vert _\infty =1 + 2\epsilon \) for an \(\epsilon >0\). Set \(\delta \) as in (13). For each \(k \in \mathbb {N}\), let \(x_k^\star \) be the element in \(\mathop {\mathrm {argmax}}_x \left| (A^*q_k)(x) \right| \) which has the largest distance to \(\varOmega _k\). Due to \(a_\ell \in \mathcal {C}_0(\varOmega )\) for each k, there needs to exist a compact subset \(C \subseteq \varOmega \) such that \((x_k^\star )_k \subseteq C\). Indeed, there exists for each \(\ell =1, \ldots , m\) a \(C_\ell \) such that \(\left| a_\ell (x) \right| \le (\sup _{k} \Vert q_k \Vert _1)^{-1}\) for all \(x \notin C_\ell \). Now, if \(x \notin C{\mathop {=}\limits ^{\mathrm{def.}}}\bigcup _{\ell =1}^m C_\ell \), we get

$$\begin{aligned} \left| A^*q_k(x) \right| =\left| \sum _{i=1}^m a_i(x)q_k(i) \right| \le \sum _{i=1}^m \left| a_i(x) \right| \left| q_k(i) \right| \nonumber < \frac{1}{\sup _{k} \Vert q_k \Vert _1} \sum _{i=1}^m \left| q_k(i) \right| \le 1 \end{aligned}$$

for every k. Since \(\left| A^*q_k(x_k^\star ) \right| >1\), we conclude \((x_k^\star )_k \subseteq C\). Consequently, a subsequence (which we do not rename) of \((x_k^\star )\) must converge. Thus, for some \(k_0\) and every \(k > k_0\), we have \(\Vert x_k^\star - x_{k_0}^\star \Vert _2 < \delta \). We then have

$$\begin{aligned} \Vert A^*q_k \Vert _\infty = \left| (A^*q_k)(x_k^\star ) \right| < \left| (A^*q_k)(x_{k_0}^\star ) \right| + \epsilon \le 1+ \epsilon . \end{aligned}$$

In the last estimate, we used the constraint of (\(\mathcal {D}(\varOmega _k)\)) and the fact that \(x_{k_0}^\star \in \varOmega _k\). Since the last inequality holds for every \(k\ge k_0\), we obtain

$$\begin{aligned} \Vert A^*q_\infty \Vert _\infty = \lim _{k \rightarrow \infty } \Vert A^*q_k \Vert _\infty \le 1+ \epsilon , \end{aligned}$$

where we used the fact that \((A^*q_k)_k\) converges strongly towards \(A^*q_\infty \). This is a contradiction, and hence, we do have \(\Vert A^*q_\infty \Vert _\infty \le 1\).

Overall, we proved that the primal-dual pair \((\mu _\infty ,q_\infty )\) is feasible. It remains to prove that it is actually a solution. To do this, let us first remark that \(\Vert \mu _\infty \Vert _\mathcal {M}+ f(A\mu _\infty ) \ge - f^*(q_\infty )\) by weak duality. To prove the second inequality, first notice that the weak-\(*\)-continuity of A implies that \(A\mu _k \rightarrow A\mu _\infty \). Assumption 1 furthermore implies that f is lower semi-continuous. As a supremum of linear functions, so is \(f^*\). Since also \(q_k \rightarrow q_\infty \), we conclude

$$\begin{aligned} f^*(q_\infty ) + f(A\mu _\infty ) \le \liminf _{k \rightarrow \infty } f^*(q_k) + f(A\mu _k). \end{aligned}$$

Assumptions 1 and 2 together with Proposition 2 imply exact duality of the discretized problems. This means \(f^*(q_k) + f(A\mu _k) = -\Vert \mu _k \Vert _\mathcal {M}\). Since the norm is weak-\(*\)-l.s.c. , we thus obtain

$$\begin{aligned} \liminf _{k \rightarrow \infty } f^*(q_k) + f(A\mu _k) = \liminf _{k \rightarrow \infty } - \Vert \mu _k \Vert _\mathcal {M}\le - \liminf _{k \rightarrow \infty } \Vert \mu _k \Vert _\mathcal {M}\le -\Vert \mu _\infty \Vert _\mathcal {M}. \end{aligned}$$

Reshuffling these inequalities yields \(\Vert \mu _\infty \Vert _\mathcal {M}+ f(A\mu _\infty ) \le - f^*(q_\infty )\), i.e., the reverse inequality. Thus, \(\mu _\infty \) and \(q_\infty \) fulfill the duality conditions, and are solutions. The final claim follows from a standard subsequence argument. \(\square \)

Remark 1

Let us mention that the convergence result in Theorem 1 and its proof, is not new, see e.g. [24]. The proof technique can be applied to prove similar statements for other refinement rules. For instance, the result still holds if we add the single most violating point:

$$\begin{aligned} \varOmega _{k+1} \supseteq \varOmega _k \cup \{x_k\} \text{ with } x_k \in \mathop {\mathrm {argmax}}_{x\in \varOmega } |A^*q_k|. \end{aligned}$$
(14)

The result that we have just shown is very generally applicable. It however does not give us any knowledge of the convergence rate. The next section will be devoted to proving a linear convergence rate in a significant special case.

Non degenerate source condition

The idea behind adding points to the grid adaptively is to avoid a uniform refinement, which results in computationally expensive problems (\(\mathcal {D}(\varOmega _k)\)). However, there is a priori no reason for the exchange rule not to refine in a uniform manner. In this section, we prove that additional assumptions improve the situation. First, we will from now on work under Assumption 3. It implies that the dual solutions \(q_k\) are unique for every k, since Proposition 1 ensures the strong convexity of the Fenchel conjugate \(f^*\). We furthermore assume that the functions \(a_j\) are smooth.

Assumption 5

(Assumption on the measurement functionals) The measurement functions \(a_j\) all belong to \(\mathcal {C}_0^2(\varOmega ) {\mathop {=}\limits ^{\mathrm{def.}}}\mathcal {C}_0(\varOmega ) \cap \mathcal {C}^2(\varOmega )\) and their first and second order derivatives are uniformly bounded on \(\varOmega \). We hence may define

$$\begin{aligned}&\kappa {\mathop {=}\limits ^{\mathrm{def.}}}\sup _{\Vert q\Vert _2\le 1} \Vert A^*q \Vert _\infty = \sup _{x\in \varOmega } \Vert A(x) \Vert _2, \quad \kappa _\nabla {\mathop {=}\limits ^{\mathrm{def.}}}\sup _{\Vert q\Vert _2\le 1} \Vert (A^*q)' \Vert _\infty ,\\&\kappa _{{{\,\mathrm{hess}\,}}} {\mathop {=}\limits ^{\mathrm{def.}}}\sup _{\Vert q\Vert _2\le 1} \Vert (A^*q)'' \Vert _\infty . \end{aligned}$$

We also assume the following regularity condition on the solution \(q^\star \) of (\(\mathcal {D}(\varOmega )\)), and its corresponding primal solution \(\mu ^\star \).

Assumption 6

(Assumption on the primal-dual pair) We assume that (\(\mathcal {P}(\varOmega )\)) admits a unique s-sparse solution \(\mu ^\star \) supported on \(\xi = (\xi _i)_{i=1}^s\in \varOmega ^s\):

$$\begin{aligned} \mu ^\star = \sum _{i=1}^s \alpha _i^\star \delta _{\xi _i}. \end{aligned}$$
(15)

Let \(q^\star \) denote the associated dual pair. We assume that the only points x for which \(\left| A^*q^\star (x) \right| =1\) are the points in \(\xi \), and that the second derivative of \(\left| A^*q^\star \right| \) is negative definite in each point \(\xi _i\). It follows that there exists \(\tau _0>0\) and \(\gamma >0\) such that

$$\begin{aligned}&|A^*q^\star |''(x) \preccurlyeq - \gamma {{\,\mathrm{Id}\,}}\text { and } |A^*q^\star |(x) \ge \frac{\gamma \tau _0^2}{2} \quad \text { for }x\text { with } {} \le \tau _0. \end{aligned}$$
(16)
$$\begin{aligned}&\left| (A^*q^\star )(x) \right| \le 1- \frac{\gamma \tau _0^2}{2} \ \quad \text { for }x\text { with } {} \ge \tau _0. \end{aligned}$$
(17)

We note that if Equations (16) and (17) are valid for some \((\gamma ,\tau _0)\), they are also valid for any \((\tilde{\gamma },\tilde{\tau }_0)\) with \(\tilde{\gamma } \le \gamma \) and \(\tilde{\tau }_0 \le \tau _0\).

Assumption 6 may look very strong and hard to verify in advance. Recent advances in signal processing actually show that it is verified under clear geometrical conditions. First, there will always exists at most m-sparse solutions to problem (\(\mathcal {P}(\varOmega )\)), [3, 14, 33]. Therefore, the main difficulty comes from the uniqueness of the primal solution and from the two regularity conditions (16) and (17). These assumptions are called non-degenerate source condition of the dual certificate \(A^*q^\star \) [12]. Many results in this direction have been shown for \(f= \xi _{\left\{ b\right\} }\) or \(f(\cdot )=\frac{L}{2}\Vert \cdot - b\Vert _2^2\), where \(b=A\mu _0\) with \(\mu _0\) a finitely supported measure. The papers [5, 11, 28] deal with different Fourier-type operators, whereas [23] provides an analysis for arbitrary integral operators sampled at random.

Auxiliary results

In this and the following sections, we always work under Assumptions 12, 3 without further notice. We derive several lemmata that are direct consequences of the above assumptions. The first two rely strongly on the Lipschitz regularity of the gradient of f.

Lemma 1

(Boundedness of the dual variables) Let \(\bar{q}=\mathop {\mathrm {argmin}}_{q\in \mathbb {R}^m} f^*(q)\) denote the prox-center of \(f^*\). For all \(k\in \mathbb {N}\), we have

$$\begin{aligned} \Vert q_k\Vert _2\le \sqrt{2L (f^*(0)-f^*(\bar{q}))} + \Vert \bar{q}\Vert _2 {\mathop {=}\limits ^{\mathrm{def.}}}R. \end{aligned}$$
(18)

Proof of Lemma 1

For all \(k\in \mathbb {N}\), we have \(0\in \{q\in \mathbb {R}^m, \Vert A^*_k q\Vert _\infty \le 1\}\), hence \(f^*(q_k)\le f^*(0)\). By strong convexity of \(f^*\) and optimality of \(\bar{q}\) and \(q_k\), we get:

$$\begin{aligned} f^*(0)\ge f^*(q_k) \ge f^*(\bar{q})+\frac{1}{2L}\Vert q_k-\bar{q}\Vert _2^2. \end{aligned}$$
(19)

Therefore \(\Vert q_k-\bar{q} \Vert _2\le \sqrt{2L (f^*(0)-f^*(\bar{q}))}\) and the conclusion follows from a triangle inequality. \(\square \)

Proposition 3

Let \(q^\star \) be the solution of (\(\mathcal {D}(\varOmega )\)). Let

$$\begin{aligned} \rho {\mathop {=}\limits ^{\mathrm{def.}}}\sqrt{\sup _{w\in \partial f^*(q^\star )} -L \left\langle w,q^\star \right\rangle }. \end{aligned}$$

Then for any q, we have

$$\begin{aligned} f^*(q^\star )-f^*(q) + \frac{1}{2L} \Vert q-q^\star \Vert _2^2 \le \rho ^2 L^{-1}(\sup _{x\in \xi } |A^*q|(x) -1). \end{aligned}$$

Proof

Let \(M=\{q\in \mathbb {R}^m, f^*(q)\le f^*(q^\star )\}\) denote the sub-level set of \(f^*\) and \( D = \left\{ q\in \mathbb {R}^n \, \vert \, \sup _{x\in \xi } |A^*q|(x) \le 1\right\} \). We first claim that M and D only have the point \(q^\star \) in common. Indeed \(\mu ^\star \) solves the problem \(\mathcal P(\xi )\) and by strong duality of the problem restricted to \(\mathcal {M}(\xi )\), \(q^\star \) solves \(\mathcal D(\xi )\). By strong convexity of f, \(q^\star \) is the unique solution \(\mathcal D(\xi )\), this exactly means \(M\cap D= \{q^\star \}\).

The fact that \(M\cap D= \{q^\star \}\) implies that there exists a separating hyperplane there. Since the hyperplane must be tangent to M, it can be written as \(\left\{ q \, \vert \, \left\langle w, q \right\rangle = \left\langle w,q^\star \right\rangle \right\} \) for a \(w \in \partial f^*(q^\star )\), with \(D \subset \left\{ q \, \vert \, \left\langle w, q \right\rangle \ge \left\langle w,q^\star \right\rangle \right\} \). Consequently, letting \(\epsilon =\sup _{x\in \xi } |A^*q(x)|-1\), we have

$$\begin{aligned}(1+\epsilon )D \subset \left\{ q \, \vert \, \left\langle w, q \right\rangle \ge (1+\epsilon )\left\langle w,q^\star \right\rangle \right\} = \left\{ q \, \vert \, \left\langle w, q- q^\star \right\rangle \ge \epsilon \left\langle w,q^\star \right\rangle \right\} . \end{aligned}$$

Now, the strong convexity of \(f^*\) implies for every \(q \in (1+\epsilon )D \cap M\),

$$\begin{aligned} f^*(q) {\ge } f^*(q^\star ) {+} \left\langle w, q-q^\star \right\rangle {+} \frac{1}{2L} \Vert q-q^\star \Vert _2^2 \ge f^*(q^\star ) {+} \epsilon \left\langle w, q^\star \right\rangle {+} \frac{1}{2L} \Vert q-q^\star \Vert _2^2. \end{aligned}$$

Rearranging this, we obtain

$$\begin{aligned} -\epsilon \left\langle w, q^\star \right\rangle \ge f^*(q^\star ) -f^*(q) + \frac{1}{2L} \Vert q-q^\star \Vert _2^2. \end{aligned}$$

which is the claim. \(\square \)

Before moving on, let us record the following proposition:

Proposition 4

We have

$$\begin{aligned} \Vert A(x)-A(y) \Vert _2\le \kappa _\nabla \Vert x-y\Vert _2 \quad \text {and}\quad \Vert A'(x)-A'(y) \Vert _F\le \kappa _{{{\,\mathrm{hess}\,}}} \Vert x-y\Vert _2. \end{aligned}$$
(20)

Proof

The proof of the first inequality of (20) is a standard Taylor expansion :

$$\begin{aligned} \Vert A(x) - A(y) \Vert _2&= \sup _{\begin{array}{c} q \in \mathbb {R}^m \\ \Vert q \Vert _2=1 \end{array}} \left\langle q, A(x) - A(y) \right\rangle = \sup _{\begin{array}{c} q \in \mathbb {R}^m \\ \Vert q \Vert _2=1 \end{array}} \left| (A^*q)(x) - (A^*q)(y) \right| \\&\le \sup _{\begin{array}{c} q \in \mathbb {R}^m \\ \Vert q \Vert _2=1 \end{array}} \sup _{z \in [x,y]} \left\langle (A^*q)'(z),x-y \right\rangle \\&\le \sup _{\begin{array}{c} q \in \mathbb {R}^m \\ \Vert q \Vert _2=1 \end{array}} \Vert (A^*q)' \Vert _\infty \Vert x-y \Vert _2 \le \kappa _\nabla \Vert x-y\Vert _2. \end{aligned}$$

The proof of the second part of (20) follows the same lines as the first part and is left to the reader. \(\square \)

The next two lemmata aim at transferring bounds from the geometric distances of the sets \(X_k\), \(\varOmega _k\) and \(\xi \) to bounds on \(|A^* q_k(\xi )|\). Using Proposition 3, we may then transfer these bounds to bounds on the errors of the dual solutions and the dual (or primal) objective values.

Lemma 2

The following inequalities hold

$$\begin{aligned} \Vert A^*q_k \Vert _\infty&\le 1 + \frac{R\kappa _{{{\,\mathrm{hess}\,}}}}{2} {{\,\mathrm{dist}\,}}(\varOmega _k| X_k)^2, \nonumber \\ f^*(q^\star ) -f^*(q_k)&\le \frac{R\kappa _{{{\,\mathrm{hess}\,}}}\rho ^2}{2L} {{\,\mathrm{dist}\,}}(\varOmega _k| X_k)^2, \nonumber \\ \Vert q_k - q^\star \Vert _2&\le {{\,\mathrm{dist}\,}}(\varOmega _k| X_k) \sqrt{R\kappa _{{{\,\mathrm{hess}\,}}}} \rho . \end{aligned}$$
(21)

Proof of Lemma 2

To show (21), first notice that

$$\begin{aligned} \Vert A^*q_k \Vert _\infty \le 1 + \Vert (A^*q_k)'' \Vert _\infty \frac{{{\,\mathrm{dist}\,}}(\varOmega _k| X_k)^2}{2}. \end{aligned}$$
(22)

Indeed, by definition, the global maximum z of \(|A^*q_k|\) lies in \(X_k\) and satisfies \((A^*q_k)'(z)=0\). Furthermore, by construction, all points x in \(\varOmega _k\) satisfy \(|A^*q_k(x)|\le 1\). Using a Taylor expansion, we get for all \(x\in \varOmega \)

$$\begin{aligned} \left| A^*q_k(x)-A^*q_k(z) \right| \le \Vert (A^*q_k)'' \Vert _\infty \frac{\Vert x-z \Vert _2^2}{2}. \end{aligned}$$

Taking x as the point in \(\varOmega _k\) minimizing the distance to z leads to (22). In addition, we have \(\Vert (A^*q_k)'' \Vert _\infty \le R\kappa _{{{\,\mathrm{hess}\,}}}\) by Lemma 1, so that \(\Vert A^*q_k \Vert _\infty \le 1+\epsilon \) with \(\epsilon =R\kappa _{{{\,\mathrm{hess}\,}}} \frac{{{\,\mathrm{dist}\,}}(\varOmega _k| X_k)^2}{2}\).

Now, letting \(C= \left\{ q \, \vert \, \Vert A^*q \Vert _\infty \le 1\right\} \), we have just proven that \(q_k \in (1+\epsilon )C\). Furthermore, due to the optimality of \(q_k\) for the discretized problem and to the fact that \(q^\star \) is feasible for that problem, we will have \(f^*(q_k) \le f^*(q^\star )\), i.e., \(q_k\) is included in the \(f^*(q^\star )\)-sub-level set of \(f^*\): \(M=\{q\in \mathbb {R}^m | f^*(q)\le f^*(q^\star ) \}\). An application of Proposition 3 now yields the result. \(\square \)

Lemma 3

Suppose that \({{\,\mathrm{dist}\,}}(X_k|\xi )\le \delta \) and \({{\,\mathrm{dist}\,}}(\varOmega _k|\xi )\le \delta \). Then

$$\begin{aligned} f^*(q^\star ) - f^*(q_k)&\le \frac{2R\kappa _{{{\,\mathrm{hess}\,}}}\rho ^2}{L} \cdot \delta {{\,\mathrm{dist}\,}}(\varOmega _k|\xi ) \\ \Vert q_k- q^\star \Vert _2&\le \rho \sqrt{2R\kappa _{{{\,\mathrm{hess}\,}}}}\sqrt{ \delta \cdot {{\,\mathrm{dist}\,}}(\varOmega _k| \xi )}. \end{aligned}$$

Proof

Let \(y_k^i\) (resp. \(x_k^i\)) be the point closest to \(\xi _i\) in \(\varOmega _k\) (resp. \(X_k\)). By assumption, we have \(\Vert x_k^i-y_k^i\Vert _2\le 2\delta \). For all i, we have

$$\begin{aligned}&|A^*q_k(\xi _i)|\le |A^*q_k(y_k^i)| + \sup _{z\in [y_k^i, \xi _i]} \Vert (A^*q_k)'(z)\Vert _2 \Vert \xi _i -y_k^i\Vert _2 \nonumber \\&\quad \le 1 + \sup _{z\in [y_k^i, \xi _i]} \Vert (A^*q_k)'(z)\Vert _2 \Vert \xi _i -y_k^i\Vert _2. \end{aligned}$$
(23)

Then, for all \(z\in [y_k^i, \xi _i]\), using the fact that \((A^*q_k)'(x_k^i)=0\), we get

$$\begin{aligned} \Vert (A^*q_k)'(z)\Vert _2 \le R\kappa _{{{\,\mathrm{hess}\,}}} \Vert z-x_k^i\Vert _2 \le 2\delta R\kappa _{{{\,\mathrm{hess}\,}}}. \end{aligned}$$

Hence, we have \(|A^*q_k(\xi _i)| \le 1 + 2\delta R\kappa _{{{\,\mathrm{hess}\,}}} \Vert \xi _i -y_k^i\Vert _2 \le 1 + 2\delta R\kappa _{{{\,\mathrm{hess}\,}}} {{\,\mathrm{dist}\,}}(\varOmega _k|\xi )\). To conclude, we use Proposition 3 again. \(\square \)

The last assertion takes full advantage of Assumption 6 and the fact that the function \(|A^*q^\star |\) is uniformly concave around its maximizers. It allows to transfer bounds from \(\Vert q_k-q^\star \Vert _2\) to bounds on the distance from \(X_k\) to \(\xi \).

Proposition 5

Define \(c_q=\gamma \min \left( \frac{\tau _0^2}{2\kappa },\frac{\tau _0}{\kappa _{\nabla }},\frac{1}{\kappa _{{{\,\mathrm{hess}\,}}}}\right) \) and assume that \(\Vert q_k-q^\star \Vert _2 < c_q\), then

$$\begin{aligned} {{\,\mathrm{dist}\,}}(\xi |X_k) \le \frac{\kappa _\nabla }{\gamma } \Vert q_k-q^\star \Vert _2. \end{aligned}$$

Moreover, for each i, if \(B_i\) is the ball or radius \(\tau _0\) around \(\xi _i\), then \(X_k\) contains at most one point in \(B_i\) and \(A^*q_k\) has the same sign as \(A^*q^\star (\xi _i)\) in \(B_i\).

Proof

Define \(\tau =\frac{\kappa _\nabla }{\gamma }\Vert q_k-q^\star \Vert \) and note that \(\tau < \tau _0\). By Proposition 4, we have for each \(x\in \varOmega \)

$$\begin{aligned} \left| (A^*q_k)(x)-(A^*q^\star )(x) \right|&\le \Vert A^*(q_k-q^\star ) \Vert _\infty \le \kappa \Vert q_k-q^\star \Vert _2< \frac{\gamma \tau _0^2}{2} \\ \Vert (A^*q_k)'(x)-(A^*q^\star )'(x) \Vert _2&\le \Vert (A^*(q_k-q^\star ))' \Vert _\infty \le \kappa _\nabla \Vert q_k-q^\star \Vert _2 = \gamma \tau \\ \Vert (A^*q_k)''(x)-(A^*q^\star )''(x) \Vert _2&\le \Vert (A^*(q_k-q^\star ))'' \Vert _\infty \le \kappa _{{{\,\mathrm{hess}\,}}} \Vert q_k-q^\star \Vert _2 <\gamma . \end{aligned}$$

The above inequalities together with Assumption 6 imply the following for all \(1\le i \le s\):

  1. (i)

    For x with \(\Vert x-\xi _i \Vert _2 \le \tau _0\), we have \( {{\,\mathrm{sign}\,}}(A^*q_k)(x) = {{\,\mathrm{sign}\,}}(A^*q^\star )(x) ={{\,\mathrm{sign}\,}}(A^*q^\star )(\xi _i). \)

  2. (ii)

    For x with \(\Vert x-\xi _i \Vert _2 \le \tau _0\), we have \( (\left| A^*q_k \right| )''(x) \prec (\left| A^*q^\star \right| )''(x) + \gamma {{\,\mathrm{id}\,}}\prec 0. \)

  3. (iii)

    For x with \(\Vert x-\xi _i \Vert _2 \ge \tau _0\), we have \(\left| (A^*q_k)(x) \right| <\left| (A^*q^\star )(x) \right| + \frac{\gamma \tau _0^2}{2} \le 1 - \frac{\gamma \tau _0^2}{2} + \frac{\gamma \tau _0^2}{2} =1. \)

  4. (iv)

    For x with \(\tau < \Vert x-\xi _i \Vert _2 \le \tau _0\), we have \(\Vert (A^*q_k)'(x) \Vert _2\ge \Vert (A^*q^\star )'(x) \Vert _2 - \gamma \tau > 0.\)

The estimate \(\Vert (A^*q^\star )'(x) \Vert _2 > \gamma \tau \) deserves a slightly more detailed justification than the others. Define \(w = x-\xi _i\) and \(g(\theta ) = \left\langle (A^*q)'(\xi _i+\theta w),w \right\rangle \) for \(\theta \in (0,1)\). We may apply the mean value theorem to conclude that

$$\begin{aligned} g(1)-g(0) = g'(\hat{\theta }) = \left\langle (A^*q)''(\xi _i+\hat{\theta }w) w,w \right\rangle \end{aligned}$$

for some \(\hat{\theta }\in (0,1)\). Since \(g(0)=\left\langle (A^*q^\star )'(\xi _i),w \right\rangle =\left\langle 0,w \right\rangle =0\), and \(\left\langle (A^*q^\star )''(\xi _i+\hat{\theta }w) w,w \right\rangle \le -\gamma \Vert w \Vert _2^2\), due to \((\left| A^*q^\star \right| )'' \preccurlyeq - \gamma {{\,\mathrm{id}\,}}\) in \(\{x\in \varOmega , \Vert x-\xi _i\Vert _2\le \tau _0\}\), we obtain

$$\begin{aligned} \Vert (A^*q^\star )'(x) \Vert _2 \ge \frac{1}{\Vert w \Vert _2} \left| \left\langle (A^*q^\star )'(x),w \right\rangle \right| = \frac{\left| g(1) \right| }{\Vert w \Vert _2} \ge \frac{\gamma \Vert w \Vert _2^2}{\Vert w \Vert _2} > \gamma \tau , \end{aligned}$$

since \(\Vert w \Vert _2=\Vert x-\xi _i \Vert _2 > \tau \) by assumption. The last estimate was the claim (iv).

This implies a number of things. First, any local maximum of \(\left| A^*q_k \right| \) with \(\left| A^*q_k \right| \ge 1\) must lie within a distance of \(\tau \) from the set \(\xi \) (since for all other points, we have \(\left| A^*q_k \right| < 1\)—via (iii)—or \((Aq_k)'\ne 0\)—via (iv)). Since \(\left| A^*q_k \right| \) is locally concave on the \(\tau _0\)-neighborhoods of the \(\xi _i\)—this follows from (ii)—at most one local extremum furthermore exists in each such neighborhood. This is the claim. \(\square \)

Fixed grids estimates

In this section, we consider a fixed grid \(\varOmega _0\) and ask what we need to assume about it in order to guarantee that the set of local maxima of \( \left| A^*q_0(x) \right| \) is close to true support \(\xi \). We express our result in terms of a geometrical property that we can control, the width of the grid \({{\,\mathrm{dist}\,}}(\varOmega _0|\varOmega )\).

Theorem 2

Assume that \({{{\,\mathrm{dist}\,}}(\varOmega _0|\varOmega )} \le \frac{c_q}{\rho \sqrt{\kappa _{{{\,\mathrm{hess}\,}}}}}\), then

$$\begin{aligned} {{{\,\mathrm{dist}\,}}(\xi | X_0)}&\le \frac{\kappa _\nabla \sqrt{R\kappa _{{{\,\mathrm{hess}\,}}}}\rho }{2\gamma } {{\,\mathrm{dist}\,}}(\varOmega _0|\varOmega )\\ \Vert q_0 - q^\star \Vert _2&\le \rho \sqrt{R\kappa _{{{\,\mathrm{hess}\,}}}}{{{\,\mathrm{dist}\,}}(\varOmega _0|\varOmega )} \\ \inf (\mathcal P(\varOmega _0))&\le \inf (\mathcal {P}(\varOmega )) + \frac{R\kappa _{{{\,\mathrm{hess}\,}}}\rho ^2}{2L} {{\,\mathrm{dist}\,}}(\varOmega _0|\varOmega )^2 \end{aligned}$$

Proof

It is trivial that \({{\,\mathrm{dist}\,}}(\varOmega _0|X_0) \le {{\,\mathrm{dist}\,}}(\varOmega _0|\varOmega )\). Applying Lemma 2, we immediately obtain the bound on \(\Vert q_0-q^\star \Vert _2\). By the same lemma,

$$\begin{aligned} \inf (\mathcal P(\varOmega _0))&= \sup (\mathcal D(\varOmega _0)) = -f^*(q_0) \le -f^*(q^\star ) + \frac{R\kappa _{{{\,\mathrm{hess}\,}}}\rho ^2}{2L} {{\,\mathrm{dist}\,}}(\varOmega _0|X_0)^2 \\&= \sup (\mathcal {D}(\varOmega )) + \frac{R\kappa _{{{\,\mathrm{hess}\,}}}\rho ^2}{2L} {{\,\mathrm{dist}\,}}(\varOmega _0|\varOmega )^2\nonumber \\ {}&= \inf (\mathcal {P}(\varOmega )) + \frac{R\kappa _{{{\,\mathrm{hess}\,}}}\rho ^2}{2L} {{\,\mathrm{dist}\,}}(\varOmega _0|\varOmega )^2. \end{aligned}$$

In order to obtain the first bound, remark that \(\Vert q_0-q^\star \Vert _2 \le c_q\) and use Proposition 5.

Remark 2

Note that Theorem 2 allows to control \({{\,\mathrm{dist}\,}}(\xi | X_0)\) but not \({{\,\mathrm{dist}\,}}(X_0| \xi )\). Indeed each \(x \in X_0\) is guaranteed to be close to a \(\xi _i\), but not every \(\xi _i\) needs to have a point in \(X_0\) closeby. Note however that the bounds on the optimal value indicates that in this case the missed \(\xi _i\) is not crucial to produce a good candidate for solving the primal problem. We will provide more insight on this, in the case of f being strongly convex, in Sect. 4.

Eventual linear convergence rate

In this section, we provide an asymptotic convergence rate for the iterative algorithm. As a follow-up to Remark 2, the proof of convergence relies on the fact that the distances will eventually \({{\,\mathrm{dist}\,}}(X_k| \xi )\) and \({{\,\mathrm{dist}\,}}( \xi |X_k)\) become equal. To prove that this is the case is exactly the purpose of the next proposition.

Proposition 6

Let \(B_i=\{x\in \varOmega , \Vert x - \xi _i \Vert _2<\tau _0 \}\). There exists a finite number of iterations N, such that for all \(k\ge N\), \(X_k\) has exactly s points, one in each \(B_i\). It follows that . Moreover if \(S_k\) is the set of active points of \(\mathcal D(\varOmega _k)\), that is

$$\begin{aligned} S_k=\{z \in \varOmega _k \text { s.t. } |A^* q_k(z)|=1\},\end{aligned}$$

then \(S_k\subset \cup _i B_i\) and for each i, \(B_i \cap S_k \ne \emptyset \).

Proof

We first prove that \(B_i\) contains a point in \(S_k\). To this end, define the set of measures \(\mathcal {M}_-=\{\mu \in \mathcal {M}(\varOmega ), \exists i\in \{1,\ldots , s\}, {{\,\mathrm{supp}\,}}(\mu )\cap B_i = \emptyset \}\) and

$$\begin{aligned} J_{+} = \min _{\mu \in \mathcal {M}_-} \Vert \mu \Vert _{\mathcal {M}} + f(A\mu ). \end{aligned}$$

By Assumption 6, \(J_{+}>J^\star \). Since \((J(\mu _k))_{k\in \mathbb {N}}\) converges to \(J(\mu ^\star )\), there exists \(k_2\in \mathbb {N}\) such that \(\forall k\ge k_2\), \(J(\mu _k)<J_{+}\). Hence \(\mu _k\) must for each \(1\le i \le s\) have points \(z_{k}^i\in \varOmega _k\) such that \(\mu _k\) has non-zero mass at \(z_{k}^i\). Consequently, \(|A^*q_k(z_{k}^i)|=1\), hence, each \(B_i\) contains at least one point in \(\varOmega _k\) such that \(|A^*q_k(z_{k}^i)|=1\).

Notice that \(q_k\) converges to \(q^\star \) by Theorem 1. Hence there a finite number of iterations \(k_1\) such that \(\Vert q_k-q^\star \Vert < c_q\) for all \(k\ge k_1\). By item (iii) of the proof of Proposition 5, \(|A^*q_k| <1\) outside \(\cup _i B_i\), and by item (ii), \(|A^*q_k|\) is strictly concave in each \(B_i\). Hence each \(B_i\) contains exactly one maximizer of \(|A^*q_k|\) exceeding one. \(\square \)

We now move on to analyzing our exchange approach. Before formulating the main result, let us introduce a term: \(\delta \)-regimes.

Definition 1

We say that the algorithm enters a \(\delta \)-regime at iteration \(k_\delta \) if for all \(k \ge k_\delta \), we have \({{\,\mathrm{dist}\,}}(\xi |X_k) \le \delta \). In particular it means that only points with a distance at most \(\delta \) from \(\xi \) are added to the grid.

Lemma 4

Let \(\bar{\tau }_0 = \frac{\kappa _\nabla }{\gamma }c_q\) and \(A=2^{d+1}d^{d/2}\left( \frac{\rho \sqrt{R\kappa _{\mathrm {{{\,\mathrm{hess}\,}}}}}\kappa _{\nabla }}{\gamma }\right) ^{3d}\). Let N be as in Proposition 6.

  1. 1.

    For any \(\tau \), the algorithm enters a \(\tau \)-regime after a finite number of iterations.

  2. 2.

    Assume that N iterations have passed and that the algorithm is in a \(\tau \)-regime with \(\tau \le \bar{\tau }_0\). Then for every \(\alpha \in (0,1)\) it takes no more than \(\left\lceil \frac{A}{\alpha ^{2d}} \right\rceil +1\) iterations to enter an \(\alpha \tau \)-regime.

Proof

Note that for any \(\delta \le \bar{\tau }_0\), if there exists \(p\in \mathbb {N}\) such that

$$\begin{aligned} \Vert q_k-q^\star \Vert _2 \le \frac{\gamma }{\kappa _{\nabla }}\delta \quad \text {for all }k\ge p, \end{aligned}$$
(24)

we will enter an \(\delta \)-regime after iteration p by applying Proposition 5.

To prove (1), note that we without loss of generality can assume that \(\tau \le \bar{\tau }_0\) (since entering a \(\tau \)-regime means in particular entering a \(\tau '\)-regime for any \(\tau '\ge \tau \).) Then, since \(\Vert q_k-q^\star \Vert _2\) tends to zero as k goes to infinity, (24) with \(\delta =\tau \) is true after a finite number of iterations.

To prove (2), we proceed as follows: Proposition 6 ensures that in each iteration, exactly one point is added in each ball \(\{x\in \varOmega , \Vert x-\xi _i\Vert _2\le \tau \}\). Let \(k_0\) be the actual iteration, a covering number argument [32] ensures, for any \(\Delta \) that after \(\delta _0 =\left\lceil 2d^{d/2}\left( \frac{\tau }{\Delta } \right) ^d\right\rceil \) iterations, each point in \(X_k\) needs to lie at a distance at most \(\Delta \) from \(\varOmega _k\), i.e., \({{\,\mathrm{dist}\,}}(\varOmega _{k}|X_k)\le \Delta \).

Now, if we choose \(\Delta =\left( \frac{\gamma }{\kappa _\nabla \rho \sqrt{R\kappa _{{{\,\mathrm{hess}\,}}}}}\right) ^3\frac{\alpha ^2\tau }{2}\), Lemma 2 together with Proposition 5 imply

$$\begin{aligned} {{\,\mathrm{dist}\,}}(\varOmega _{k_0+\delta _0+1}|\xi )\le&{{\,\mathrm{dist}\,}}(X_{k_0+\delta _0}|\xi )\\ \le&\frac{\kappa _\nabla }{\gamma } \rho \sqrt{R\kappa _{{{\,\mathrm{hess}\,}}}} {{\,\mathrm{dist}\,}}(\varOmega _{k_0+\delta _0}|X_{k_0+\delta _0}) \le \left( \frac{\gamma \alpha }{\kappa _{\nabla }\rho } \right) ^2\frac{\tau }{2R\kappa _{{{\,\mathrm{hess}\,}}}} \end{aligned}$$

Since \(\varOmega _{k+1}\subset \varOmega _k\) for all k, the distance \({{\,\mathrm{dist}\,}}(\varOmega _k|\xi )\) is non-increasing. As a result \({{\,\mathrm{dist}\,}}(\varOmega _{k}|\xi )\le \left( \frac{\gamma \alpha }{\kappa _{\nabla }\rho } \right) ^2\frac{\tau }{2R\kappa _{{{\,\mathrm{hess}\,}}}}\) for all \(k\ge k_0+\delta _0+1\). Since we are in \(\tau \)-regime, we know that \({{\,\mathrm{dist}\,}}(X_k|\xi )\le \tau \) and \({{\,\mathrm{dist}\,}}(\varOmega _k|\xi )\le \tau \). Hence we can apply Lemma 3 to obtain that

$$\begin{aligned} \Vert q_k-q^\star \Vert _2\le \sqrt{2R\kappa _{{{\,\mathrm{hess}\,}}} \tau \cdot {{\,\mathrm{dist}\,}}(\varOmega _k|\xi )} \rho \le \frac{\gamma }{\kappa _\nabla }\alpha \tau . \end{aligned}$$

Then inequality (24) is satisfied with \(\delta =\alpha \tau \) and the algorithm enters a \(\alpha \tau \)-regime. \(\square \)

The main result will tell us how many iterations we need to enter a \(\tau \)-regime.

Theorem 3

Let \(\tau \le \bar{\tau }_0{\mathop {=}\limits ^{\mathrm{def.}}}\frac{\kappa _\nabla }{R\gamma }c_q\) and \(k_0\) be the iteration on which the algorithm enters a \(\bar{\tau }_0\)-regime. Then \(k_0 < \infty \), and the algorithm will enter a \(\tau \)-regime after no more than \(k_0 + k_\tau \) iterations, where

$$\begin{aligned} k_\tau := \left\lceil e 2^{d+1}d^{d/2}\left( \frac{\rho \sqrt{R\kappa _{{{\,\mathrm{hess}\,}}}}\kappa _{\nabla }}{\gamma }\right) ^{3d}+1\right\rceil \left\lceil 2d \log \left( \frac{\bar{\tau }_0}{\tau }\right) \right\rceil . \end{aligned}$$

Additionally, we will have

$$\begin{aligned} \Vert q_k - q_* \Vert _2&\le \tau \sqrt{2R\kappa _{\mathrm {{{\,\mathrm{hess}\,}}}}}\rho \nonumber \\ \inf (\mathcal {P}(\varOmega _k))&\le \inf (\mathcal {P}(\varOmega )) + \frac{2R\kappa _{{{\,\mathrm{hess}\,}}}\rho ^2}{L} \cdot \tau ^2 \end{aligned}$$
(25)

for \(k \ge k_0+k_\tau +1\). In other words, the algorithm will eventually converge linearly.

Proof

The fact that \(k_0< \infty \) is the first assertion of Lemma 4. As for the other part, we argue as follows: Fix \(\alpha \in (0,1)\). Since we have entered a \(\bar{\tau }_0\)-regime at iteration \(k_0\), Lemma 4 implies that it will take no more than \(\left\lceil \frac{A}{\alpha ^{2d}} \right\rceil +1\) additional iterations to enter a \(\alpha \bar{\tau }_0\). Repeating this argument, we see that after no more than

$$\begin{aligned} n \cdot \left( \left\lceil \frac{A}{\alpha ^{2d}} \right\rceil +1\right) \end{aligned}$$

iterations, we will have entered a \(\alpha ^n \bar{\tau }_0\) regime. Choosing \(\alpha = e^{-1/2d}\) and \(n = \lceil 2d \log \left( \bar{\tau }_0/ \tau \right) \rceil \), we obtain the first statement.

The second statement immediately follows from Lemma 3 (as in the proof of Theorem 2) and the fact that entering a \(\tau \)-regime exactly amounts to that \({{\,\mathrm{dist}\,}}(X_k|\xi )\le \tau \) for all future k, and therefore in particular \({{\,\mathrm{dist}\,}}(\varOmega _{k+1}|\xi )\le \tau \). \(\square \)

Remark 3

Let us give some insights on Theorem 3.

  1. 1.

    Notice that the value \(k_\tau \) depends exponentially on the ambient dimension d. This property cannot be improved with the current proof based on a covering number argument. We are unsure as if the exponential growth really is an artefact of the proof, or if it can be removed.

  2. 2.

    A popular variant of the algorithm consists in adding the single most violating maximizer, which can then be regarded as a variant of the conditional gradient descent. It is yet unclear whether the current proof can be adapted to this setting since our proof relies on systematically adding one point around every Dirac mass of the solution. We however believe that adding all the violating maximizers arguably makes more sense from a computational point of view. Indeed, all violating maximizers have to be explored to select the global maximizer. Hence some information is lost by adding only one point. For instance, in the context of super-resolution imaging, we will see that a variant of the proposed algorithm converges in a single iteration, while a similar variant of the conditional gradient would require s iterations.

  3. 3.

    An alternative proof covering the case of adding a single point and removing some was proposed in a work produced independently and roughly at the same time by Pieper and Walter [22]. In there, the authors consider a similar but more general framework allowing for vector valued total variation regularizers. Under an additional assumption of strong convexity of f, the authors also prove an eventual linear convergence rate. The proofs share a few similarities, but also some differences reflected by the additional assumption. In particular, the covering number argument does not appear. It is currently unclear to the authors which proof leads to the better rate.

On a practical level, the algorithm contains two main difficulties: (i) computing the dual solution \(q_k\) and (ii) finding the local maximizers of \(|A^*q_k|\). As for i), the Lipschitz continuity assumption on \(\nabla f\) makes the dual problem strongly convex. This is a helpful feature that allows to certify the precision of iterative algorithms: we can generate points \(\tilde{q}_k\) within a prescribed distance to the actual solution \(q_k\). With some additional work, this could most probably lead to certified algorithms with an inexact resolution of the duals \(\mathcal {D}(\varOmega _k)\). Point ii) is arguably more problematic: unless the measurement functions \(a_i\) have a specific structure such as polynomials, certifying that the maximizers \(X_k\) are well evaluated is out of reach. Unfortunately, forgetting points in \(X_k\) can break the convergence to the actual solution. In practice, this evaluation proved to require some attention, but well designed particle flow algorithms initialized with a sufficiently large amount of particles seemed to solve any instance of the super-resolution experiments provided later.

The inequality (25) is an upper-bound on the cost function for the problem (\(\mathcal {P}(\varOmega _k)\)). Unfortunately, the numerical resolution of this problem is hard since \(\varOmega _k\) contains clusters of points and in practice it is beneficial to solve the simpler discrete problem

figureg

For this measure, we also obtain an a posteriori estimate of the convergence rate.

Proposition 7

Define \(\widehat{\mu }_k\) as the solution of (\(\mathcal {P}(X_k)\)), if \({{\,\mathrm{dist}\,}}(X_k|\xi )\le \tau \), we have

$$\begin{aligned} J(\widehat{\mu }_k)\le J(\mu ^\star ) + \left( \Vert \alpha ^\star \Vert _1 \frac{\kappa _{{{\,\mathrm{hess}\,}}} \Vert q^\star \Vert _2}{2} + \frac{L}{2} \Vert \alpha ^\star \Vert _1^2 \kappa _\nabla ^2\right) \tau ^2. \end{aligned}$$
(26)

Proof

For any i, denote \(x^i_k\) a point in \(X_k\) closest to \(\xi _i\) and define \(\tilde{\mu }_k = \sum _{i=1}^s \alpha _i^\star \delta _{x_k^i}\). We have \(J(\widehat{\mu }_k)\le J(\tilde{\mu }_k)\) and \(\Vert \tilde{\mu }_k\Vert _\mathcal {M}\le \Vert \mu ^\star \Vert _\mathcal {M}\). Furthermore, we have

$$\begin{aligned} f(A\tilde{\mu }_k) \le f(A\mu ^\star ) + \langle \nabla f(A\mu ^\star ), A\tilde{\mu }_k - A\mu ^\star \rangle + \frac{L}{2} \Vert A\tilde{\mu } - A\mu ^\star \Vert _2^2. \end{aligned}$$

The last term in the inequality is dealt with the following estimate:

$$\begin{aligned} \Vert A\tilde{\mu } - A\mu ^\star \Vert _2\le \sum _{i=1}^s |\alpha _i^\star | \Vert A(x_k^i)-A(\xi _i)\Vert _2 \le \sum _{i=1}^s |\alpha _i^\star | \kappa _\nabla \Vert x_k^i-\xi _i\Vert _2 \le \Vert \alpha ^\star \Vert _1 \kappa _\nabla \tau . \end{aligned}$$

As for the penultimate term, remember that \(q^\star = - \nabla f(A\mu ^\star )\). This implies

$$\begin{aligned} \left\langle \nabla f(A\mu ^\star ), A \tilde{\mu }_k - A \mu ^\star \right\rangle = \left\langle A^*q^\star , \mu ^\star - \tilde{\mu }_k \right\rangle = \sum _{i=1}^s \alpha _i^\star \left( (A^*q^\star )(\xi _i)-A^*q^\star (x_k^i) \right) \end{aligned}$$

By making a Taylor expansion of \(A^*q^\star \) in each \(\xi _i\), utilizing that the derivative vanishes there, and that \(\Vert (A^*q^\star )''(x) \Vert \le \kappa _{{{\,\mathrm{hess}\,}}} \Vert q^\star \Vert _2\) for each \(x \in \varOmega \), we see that \(\left| (A^*q^\star )(x_k^i)-(A^*q^\star )(\xi _i) \right| \le \frac{\kappa _{{{\,\mathrm{hess}\,}}} \Vert q^\star \Vert _2}{2}\Vert x_k^i-\xi _i \Vert _2^2\) for each i. This yields

$$\begin{aligned} \left\langle \nabla f(A\mu ^\star ), A \tilde{\mu }_k - A \mu ^\star \right\rangle \le \Vert \alpha ^\star \Vert _1 \frac{\kappa _{{{\,\mathrm{hess}\,}}} \Vert q^\star \Vert _2 \tau ^2}{2}. \end{aligned}$$

Overall, we obtain

$$\begin{aligned} J(\widehat{\mu }_k)&\le J(\tilde{\mu }_k) = \Vert \tilde{\mu }_k\Vert _\mathcal {M}+ f(A\tilde{\mu }_k) \le J(\mu ^\star ) + \Vert \alpha ^\star \Vert _1 \frac{\kappa _{{{\,\mathrm{hess}\,}}} \Vert q^\star \Vert _2 \tau ^2}{2} \\&\quad +\, \frac{L}{2} \Vert \alpha ^\star \Vert _1^2 \kappa _\nabla ^2\tau ^2. \end{aligned}$$

\(\square \)

Convergence of continuous methods

In this section, we study an alternative algorithm that consists of using nonlinear programming approaches to minimize the following finite dimensional problem:

$$\begin{aligned} G(\alpha ,X) {\mathop {=}\limits ^{\mathrm{def.}}}J\left( \sum _{i=1}^p\alpha _i\delta _{x_i}\right) = \Vert \alpha \Vert _1 +f\left( A\left( \sum _i \alpha _i \delta _{x_i} \right) \right) , \end{aligned}$$
(27)

where \(X=(x_1,\ldots , x_p)\). This principle is similar to continuous methods in semi-infinite programming [25] and was proposed specifically for total variation minimization in [2, 7, 10, 30]. By Proposition 6, we know that after a finite number of iterations, \(X_k\) will contain exactly s points located in a neighborhood of \(\xi \). This motivates the following hybrid algorithm:

  • Launch the proposed exchange method until some criterion is met. This yields a grid \(X^{(0)}=X_k\) and we let \(p=|X_k|\).

  • Find the solution of the finite convex program

    $$\begin{aligned} \alpha ^{(0)} = \min _{\alpha \in \mathbb {R}^p} G(\alpha , X^{(0)}). \end{aligned}$$
  • Use the following gradient descent:

    $$\begin{aligned} (\alpha ^{(t+1)}, X^{(t+1)})= (\alpha ^{(t+1)}, X^{(t+1)}) - \tau \nabla G( \alpha ^{(t)}, X^{(t)}), \end{aligned}$$
    (28)

    where \(\tau \) is a suitably defined step-size (e.g. defined using Wolfe conditions).

We tackle the following question: does the gradient descent algorithm converge to the solution if initialized well enough?

Existence of a basin of attraction

This section is devoted to proving the existence of a basin of attraction of a descent method in G. Under two additional assumptions, we state our result in Proposition 8.

Assumption 7

The function f is twice differentiable and \(\Lambda \)-strongly convex.

The twice differentiability assumption is mostly due to convenience, but the strong convexity is crucial. The second assumption is related to the structure of the support \(\xi \) of the solution \(\mu ^\star \).

Assumption 8

For any \(x,y \in \varOmega \) denote \(K(x,y)=\sum _{\ell } a_\ell (x)a_\ell (y)\). The transition matrix

$$\begin{aligned} T(\xi )=\begin{bmatrix} [K(\xi _i,\xi _j)]_{i,j=1}^s &{}[\nabla _x K(\xi _i,\xi _j)^*]_{i,j=1}^s \\ [\nabla _x K(\xi _i, \xi _j)]_{i,j=1}^s &{}[\nabla _x \nabla _y K(\xi _i,\xi _j)^*]_{i,j=1}^s \end{bmatrix} \in \mathbb {R}^{s+sd,s+sd} . \end{aligned}$$

is assumed to be positive definite, with a smallest eigenvalue larger than \(\Gamma >0\).

It is again possible to prove for many important operators A that this assumption is satisfied if the set \(\xi \) is separated. See the references listed in the discussion about Assumption 6. The following proposition describes the links between minimizing G and solving (\(\mathcal {P}(\varOmega )\)).

Proposition 8

Let \(\mu ^\star =\sum _{i=1}^s \alpha ^\star _i \delta _{\xi _i} \ne 0\) be the solution of (\(\mathcal {P}(\varOmega )\)). Under Assumption 7 and 8, \((\alpha ^\star ,\xi )\) is the global minimum of G. Additionally, G is differentiable with a Lipschitz gradient and strongly convex in a neighborhood of \((\alpha ^\star ,\xi )\).

Hence, there exists a basin of attraction around \((\alpha ^\star ,\xi )\) such that performing a gradient descent on G will yield the solution of (\(\mathcal {P}(\varOmega )\)) at a linear rate.

The rest of this section is devoted to the proof of Proposition 8. Let us begin by stating a simple auxiliary result.

Lemma 5

Let U and V be vector spaces and \(C: V \rightarrow V\) be a linear operator with \(C \succcurlyeq \lambda {{\,\mathrm{id}\,}}_V\) for a \(\lambda \ge 0\). Then, for any \(B:U \rightarrow V\)

$$\begin{aligned} B^*CB \succcurlyeq \lambda B^*B. \end{aligned}$$

Proof

If \(B^*CB-\lambda B^*B\) is positive semidefinite, the claim holds. Since for \(v \in U\) arbitrary

$$\begin{aligned} \left\langle (B^*CB-\lambda B^*B)v,v \right\rangle = \left\langle C(Bv),Bv \right\rangle - \lambda \left\langle Bv,Bv \right\rangle \ge \lambda \Vert Bv \Vert _V^2 - \lambda \Vert Bv \Vert _V^2 =0, \end{aligned}$$

the former is the case. \(\square \)

Let us introduce some notation that will be used in this section: for an \(X=(x_1,\ldots ,x_p) \in \varOmega ^p\) for some p, A(X) denotes the matrix \([a_i(x_j)]\). Analogously, \(A'(X)\) and \(A''(X)\) denote the operators

$$\begin{aligned}&A'(X) : (\mathbb {R}^d)^p \rightarrow \mathbb {R}^m, (v_i)_{i=1}^p \mapsto \left( \sum _{i=1}^p \partial _x a_j(x_i)v_i\right) _j,\\ {}&\quad A''(X): (\mathbb {R}^d \times \mathbb {R}^d)^p \rightarrow \mathbb {R}^m, (v_i,w_i)_{i=1}^p \mapsto \sum _{i=1}^p A''(x_i)[v_i,w_i] \end{aligned}$$

respectively. Note that for \(q \in \mathbb {R}^m\) and \(X \in \varOmega ^p\),

$$\begin{aligned} A(X)^*q&= ((A^*q)(x_i))_{i=1}^p {\mathop {=}\limits ^{\mathrm{def.}}}(A^*q)(X) \in \mathbb {R}^p \\ A'(X)^*q&= (\nabla (A^*q)(x_1), \dots , \nabla (A^*q)(x_p)) \in (\mathbb {R}^d)^p \\ A''(X)^*q&= ((A^*q)''(x_1), \dots , (A^*q)''(x_p)) \in (\mathbb {R}^d \times \mathbb {R}^d)^p \end{aligned}$$

We will also use the shorthands \(\mu = \sum _{i} \alpha _i \delta _{x_i}\), \(G_f(\alpha ,X) = f(A\mu )\), and, for \(\alpha \in \mathbb {R}^p\), \(D(\alpha )\) denotes the operator

$$\begin{aligned} D(\alpha ): (\mathbb {R}^d)^p \rightarrow (\mathbb {R}^d)^p, (v_i)_{i=1}^p \mapsto (\alpha _i v_i)_{i=1}^p. \end{aligned}$$

We have

$$\begin{aligned} \frac{\partial G_f}{\partial \alpha }(\alpha ,X) \beta&= \left\langle \nabla f(A\mu ),A(X)\beta \right\rangle \\ \frac{\partial G_f}{\partial X} \delta&= \left\langle \nabla f(A\mu ), A'(X)D(\alpha )\delta \right\rangle , \end{aligned}$$

so that in points \((\alpha ,X)\) with \(\alpha _i \ne 0\) for all i, and in particular in a neighborhood of \((\alpha ^\star ,\xi )\), G is differentiable and its gradient is given by :

$$\begin{aligned}&\mathbb {R}^p \times (\mathbb {R}^p)^d \ni \nabla G(\alpha ,X) = \left( {{\,\mathrm{sign}\,}}(\alpha ) - (A^*q)(X), -D(\alpha )(A^*q)'(X) \right) ,\nonumber \\ {}&\quad \text { with } q=-\nabla f(A\mu ). \end{aligned}$$
(29)

As for the second derivatives, we have

$$\begin{aligned} \frac{\partial ^2 G_f}{\partial ^2 \alpha }(\alpha ,X) [\beta , \gamma ]&= f''(A\mu )(A(X) \beta ,A(X)\gamma ) \\ \frac{\partial ^2 G_f}{\partial \alpha \partial X}(\alpha , X)[ \beta ,\delta ]&= f''(A\mu ) (A(X) \beta , A'(X)D(\alpha )\delta ) + \left\langle \nabla f(A\mu ), A'(X)D(\beta ) \delta \right\rangle \\ \frac{\partial ^2 G_f}{\partial ^2 X} (\alpha ,X)[\delta , \epsilon ]&= f''(A\mu )(A'(X)D(\alpha )\delta ,A'(X)D(\alpha )\epsilon ) + \left\langle \nabla f(A\mu ), A''(X)(D(\alpha ) \delta , \epsilon ) \right\rangle . \end{aligned}$$

We may now prove our claims.

Proof 8

First, let us note that due to the optimality conditions of (\(\mathcal {P}(\varOmega )\)), we know that

$$\begin{aligned} q^\star = - \nabla f(A\mu ^\star ). \end{aligned}$$

Now, \(\left| A^*q^\star \right| \) has local maxima in the points \(\xi _i\), so that \((A^*q^\star )'(\xi )=0\). In these points, we furthermore have that \({{\,\mathrm{sign}\,}}(\alpha _i^\star )= A^*q^\star (\xi _i)\), so that the gradient of G given in (29) vanishes.

To prove the rest, it is enough to show that the Hessian of \(G_f\) is positive definite in a neighborhood around \((\alpha ^\star ,\xi )\). For this, it is fruitful to decompose it into two parts. Letting \(q=-\nabla f(A\mu )\), we have \(G_f''=H_1+H_2\), with

$$\begin{aligned} H_1(\alpha ,X)&= \begin{bmatrix} A(X)^*f''(A\mu ) A(X) &{} A(X)^* f''(A\mu ) A'(X)D(\alpha ) \\ D(\alpha )^* A'(X)^*f''(A\mu )A(X) &{} D(\alpha )^*A'(X)^*f''(A\mu )A'(X)D(\alpha ) \end{bmatrix} \\ H_2(\alpha ,X) [(\beta ,\delta ),(\gamma ,\epsilon )]&= -\sum _{i=1}^s \beta _i (A^*q)'(x_i)\epsilon _i + \gamma _i (A^*q)'(x_i)\delta _i +\alpha _i(A^*q)''(x_i)[\delta _i,\epsilon _i] , \end{aligned}$$

Let \((\alpha ,X)\) be arbitrary. \(H_1\) is an operator of the form \(M_1^*M_2(X)^*{\mathcal L} M_2(X)M_1\), with \({\mathcal L}= f''(A\mu ): \mathbb {R}^m \rightarrow \mathbb {R}^m\) and

$$\begin{aligned}&M_1= \begin{bmatrix}{{\,\mathrm{id}\,}}&{} 0 \\ 0 &{} D(\alpha ) \end{bmatrix} : \mathbb {R}^p \times (\mathbb {R}^d)^s \rightarrow \mathbb {R}^s \times (\mathbb {R}^d)^s,\\ {}&M_2(X)= \begin{bmatrix}A(X)&A'(X) \end{bmatrix}: \mathbb {R}^s \times (\mathbb {R}^d)^s \rightarrow \mathbb {R}^m. \end{aligned}$$

Due to the \(\Lambda \)-strong convexity of f, \(\mathcal {L} \succcurlyeq \Lambda {{\,\mathrm{id}\,}}\). We furthermore have

$$\begin{aligned} M_1^*M_1 = \begin{bmatrix} {{\,\mathrm{id}\,}}&{} 0 \\ 0 &{} D(\alpha )^*D(\alpha ) \end{bmatrix} {\succcurlyeq \min _{1 \le i \le n } \left| \alpha _i \right| ^2 \cdot {{\,\mathrm{id}\,}}\succcurlyeq \frac{\min _{1 \le i \le n } \left| \alpha _i^\star \right| ^2}{2} \cdot {{\,\mathrm{id}\,}}} \end{aligned}$$

in some neighborhood U of \(\alpha ^\star \ne 0\).

Let us now turn to \(M_2(X)^*M_2(X)\). If we define \(M_2(\xi ) = \begin{bmatrix}A(\xi )&A'(\xi )\end{bmatrix}\), we have

$$\begin{aligned} M_2(\xi )^*M_2(\xi ) = \begin{bmatrix} A(\xi )^*A(\xi )&{} A(\xi ) A'(\xi )^*\\ A'(\xi )^*A(\xi ) &{} A'(\xi )^*A'(\xi )^* \end{bmatrix} = T(\xi ) \succcurlyeq \Gamma {{\,\mathrm{id}\,}}\end{aligned}$$

by Assumption 8. Since, by Assumption 5, both A(X) and \(A'(X)\) are continuously dependent on X, we even have

$$\begin{aligned} M_2^*(X)M_2(X) \ge \frac{\Gamma }{2} \end{aligned}$$

for X in some neigborhood V of \(\xi \). We may now apply Lemma 5 twice to conclude

$$\begin{aligned} H_1(\alpha ,X)&\succcurlyeq { \frac{\Lambda \Gamma \min _{1 \le i \le n} \left| \alpha ^\star _i \right| ^2}{4} {{\,\mathrm{id}\,}}} \end{aligned}$$
(30)

for \((\alpha , X)\in U \times V\).

It remains to analyze \(H_2\). We again begin by evaluating the expression in \((\alpha ^\star , \xi )\). The Assumption 6 implies that

$$\begin{aligned} (A^*q)'(x_i)&=0 \\ \alpha _i(A^*q)''(x_i)&\preccurlyeq 0 \end{aligned}$$

for each i. We therefore obtain

$$\begin{aligned} H_2(\alpha ^\star ,\xi ) [(\beta ,\delta ),(\beta ,\delta )]&= -\sum _{i=1}^s \beta _i (A^*q)'(\xi _i)\delta _i + \beta _i (A^*q)'(\xi _i)\delta _i +\alpha _i(A^*q)''(\xi _i)[\delta _i,\delta _i] \\&= -\sum _{i=1}^s \alpha _i(A^*q)''(x_i)[\delta _i,\delta _i] \ge 0 \end{aligned}$$

Hence, the bidual form \(H_2(\alpha ^\star , \xi )\) is positive semidefinite. Due to the assumptions that the measurement functions \(a_i\) are members of \(\mathcal {C}^2_0\), and that \(\nabla f\) is Lipschitz continuous, \(\mathcal {H}_2\) depends continuously on \(\alpha \) and x. Consequently,

$$\begin{aligned} \Vert H_2(\alpha , X) \Vert \le \frac{\Lambda \Gamma \min _{1 \le i \le n} \left| \alpha _i^\star \right| ^2}{8} \end{aligned}$$
(31)

for \((\alpha ,X)\) in some neighborhood W of \((\alpha ^\star , \xi )\).

Combining (30) and (31), we obtain

$$\begin{aligned} H_1(\alpha ,X) + H_2(\alpha ,X) \succcurlyeq \frac{\Lambda \Gamma \min _{1 \le i \le n} \left| \alpha _i^\star \right| ^2}{8}{{\,\mathrm{id}\,}}\end{aligned}$$

for all \((\alpha ,X) \in (U\times V) \cap W\), which was to be proven. \(\square \)

Eventually entering the basin of attraction

The following proposition shows that \((\tilde{\alpha },X_k)\) defined as the amplitudes and positions of the Dirac-components of the solution \(\widehat{\mu }\) of (\(\mathcal {P}(X_k)\)), \((\tilde{\alpha }, X_k)\) will lie in the basin described by Proposition 8. This result is stated in Corollary 1, the rest of this section is dedicated to proving it.

Proposition 9

Assume that Assumptions 7 and 8 are true. Consider an s-sparse measure

$$\begin{aligned} \tilde{\mu } = \sum _{\ell =1}^s \tilde{\alpha }_\ell \delta _{\tilde{x}_\ell } \end{aligned}$$

for some \(\tilde{\alpha } \in \mathbb {R}^s\) and \((\tilde{x}_\ell )_{\ell =1\dots s}\) pairwise different points of \(\varOmega \). We then have

$$\begin{aligned} \Vert \tilde{\alpha }-\alpha ^\star \Vert _2 \le \frac{1}{\sqrt{\Gamma }} \left( \kappa _\nabla \Vert \tilde{\mu } \Vert _\mathcal {M}\sup _{\begin{array}{c} 1 \le \ell \le s \end{array}} \Vert \xi _\ell -\tilde{x}_\ell \Vert _2 + \sqrt{\frac{2}{\Lambda }\left( J(\tilde{\mu })-J(\mu ^\star )\right) }\right) . \end{aligned}$$

Proof

Let \(A(\xi )^\dagger \) be the Moore-Penrose inverse of \(A(\xi ) =[A(\xi _1), \dots , A(\xi _s)]\). Due to Assumption 8, \(A(\xi )^\dagger \) has full rank and has an operator norm no larger than \(\Gamma ^{-1/2}\). Since

$$\begin{aligned} \tilde{\alpha } = \alpha ^\star + A(\xi )^\dagger ( A(\xi ) \tilde{\alpha } - A \tilde{\mu }) + A(\xi )^\dagger (A\tilde{\mu } - A(\xi )\alpha ^\star ), \end{aligned}$$

bounds on \(A(\xi ) \tilde{\alpha } - A \tilde{\mu }\) and \(A\tilde{\mu }- A(\xi )\alpha ^\star \) will therefore transform to a bound on \(\tilde{\alpha }-\alpha ^\star \).

Let us begin with the former. We have

$$\begin{aligned}&\Vert A(\xi ) \tilde{\alpha }-A \tilde{\mu } \Vert _2 \le \sum _{\ell =1}^s \left| \tilde{\alpha }_\ell \right| \Vert A(\xi _\ell ) - A(\tilde{x}_\ell ) \Vert \le \sum _{\ell =1}^s \kappa _\nabla \left| \tilde{\alpha }_\ell \right| \Vert \xi _\ell -\tilde{x}_\ell \Vert _2 \\ {}&\quad = \kappa _\nabla \Vert \tilde{\alpha } \Vert _1 \sup _{\begin{array}{c} 1 \le \ell \le s \\ \tilde{\alpha }_\ell \ne 0 \end{array}} \Vert \xi _\ell -\tilde{x}_\ell \Vert _2, \end{aligned}$$

where we used the Cauchy-Schwarz inequality in the last step.

To bound the latter, recall that \(\Lambda \)-strong convexity of f means that

$$\begin{aligned} f(A\tilde{\mu }) \ge f(A\mu ^\star ) + \left\langle \nabla f(A\mu ^\star ), A\tilde{\mu }-A \mu ^\star \right\rangle + \frac{\Lambda }{2} \Vert A\tilde{\mu }-A\mu ^\star \Vert _2^2. \end{aligned}$$
(32)

The optimality conditions for (\(\mathcal {P}(\varOmega )\)) tell us that \(q^\star = - \nabla f(A\mu ^\star )\), and hence

$$\begin{aligned}&\left\langle \nabla f(A\mu ^\star ), A\tilde{\mu }-A \mu ^\star \right\rangle = \left\langle A^*q^\star , \mu ^\star -\tilde{\mu }) \right\rangle \\ {}&\quad = \sum _{\ell =1}^s \alpha _\ell ^\star (A^* q^\star )(\xi _\ell ) - \tilde{\alpha }_\ell (A^*q^\star )(\tilde{x}_\ell ) \ge \Vert \alpha ^\star \Vert _1 - \Vert \tilde{\alpha } \Vert _1, \end{aligned}$$

where we in the last step used that \(\Vert A^*q^\star \Vert _\infty \le 1 \). Plugging the above inequality in (32) yields

$$\begin{aligned} \frac{\Lambda }{2} \Vert A\tilde{\mu }-A\mu ^\star \Vert _2^2 \le J(\tilde{\mu })- J(\mu ^\star ). \end{aligned}$$

The claim follows. \(\square \)

Corollary 1

By Proposition 6, if k is large enough then \(X_k\) contains exactly s points. In this case, let \(\widehat{\mu }_k=\sum _{i=1}^s \widehat{ \alpha }_i \delta _{\hat{x}_i^k}\) be the solution of (\(\mathcal {P}(X_k)\)). Applying Proposition 9, recalling that \(\max _{i} \Vert \xi _i-\hat{x}_i^k \Vert _2 \le {{\,\mathrm{dist}\,}}(X_k|\xi )\) and using the bound (26), we obtain :

$$\begin{aligned} \Vert \widehat{\alpha }-\alpha ^\star \Vert _2 \le \frac{{{\,\mathrm{dist}\,}}(X_k|\xi )}{\sqrt{\Gamma }} \left( \kappa _\nabla \Vert \widehat{\mu }_k \Vert _\mathcal {M}+ \sqrt{\frac{2}{\Lambda }\left( \Vert \alpha ^\star \Vert _1 \frac{\kappa _{{{\,\mathrm{hess}\,}}} \Vert q^\star \Vert _2}{2} + \frac{L}{2} \Vert \alpha ^\star \Vert _1^2 \kappa _\nabla ^2\right) }\right) . \end{aligned}$$

Since \({{\,\mathrm{dist}\,}}(X_k|\xi )\) is guaranteed to eventually converge to zero by Theorem 3 and \(\Vert \widehat{\mu }_k \Vert _\mathcal {M}\) are bounded (e.g. by lower boundedness of f and upper boundedness of \(J(\widehat{\mu }_k)\)) , \((\widehat{\alpha },X_k)\) will eventually lie in the basin of attraction of G.

Description of the hybrid approach

To conclude this paper, we propose a method alternating between an exchange step and a continuous gradient descent. It is detailed in Algorithm 2. The idea is, after each iteration of an exchange algorithm, to start a gradient descent of G initialized at the solution \(\widehat{\mu }_k\) of (\(\mathcal {P}(X_k)\)). If this gradient descent converges to a measure \(\bar{\mu }_k\), we can subsequently test if it is an optimal point by checking if \(\bar{q}_k= -\nabla f(A\bar{\mu }_k)\) fulfills the stopping criterion \(\Vert A^*\bar{q}_k \Vert _\infty \le 1+\epsilon \), where \(\epsilon \) is a user defined stopping criterion (the latter is justified by Proposition 3). If so, we may output \(\bar{\mu }_k\), and if not, we may instead continue our exchange algorithm, possibly after adding also the support points of \(\bar{\mu }_k\). Its behavior is described in the following theorem.

Theorem 4

(Convergence guarantees for the alternating method) Algorithm 2 comes with the following guarantees: \(\square \)

  1. 1.

    (Theorem 1) Under Assumptions 12 and 3, it is guaranteed to stop after a finite number of iterations for any stopping criterion \(\epsilon >0\).

  2. 2.

    (Theorem 3) If in addition Assumptions 5 and 6 are satisfied, then the algorithm eventually converges linearly: \(k \ge N+k_\tau \) with \(k_\tau \lesssim \log (\tau ^{-1})\), we have \({{\,\mathrm{dist}\,}}(\varOmega _{k}|\xi )\le \tau \).

  3. 3.

    (Proposition 8, Theorem 3 and Proposition 9) If in addition Assumptions 7 and 8 are satisfied, then - for large enough k - the low complexity gradient descent (28) method converges linearly : \(\Vert (\alpha ^{(t)},X^{(t)})-(\alpha ^\star ,\xi )\Vert _2\le c^t\Vert (\alpha ^{(0)},X^{(0)})-(\alpha ^\star ,\xi )\Vert _2\) for some \(0\le c<1\).

Overall, this method has many desirable properties: the continuous method should be used whenever the exchange method reaches its basin of attraction since its per iteration cost is much cheaper. However, it is unclear in general that this basin even exists. In that case, the exchange method should be preferred since it eventually converges linearly under quite mild assumptions. The proposed algorithmic scheme somehow captures the best of all methods. Let us notice that it is very similar in spirit to the sliding Frank-Wolfe algorithm proposed in [10], apart from the fact that we suggest adding all the points \(X_k\) violating the constraints, while the single most violating point is added in [10]. We believe that the proposed analysis sheds some light on the good numerical performance of this method.

Arguably the most complicated step in this algorithm is to evaluate \(X_k\), the set of local maximizers of \(A^*q_k\) exceeding 1. This is an impossible task for an arbitrary function \(A^*q_k\). However, a simple heuristic described in the next section provided rather satisfactory results for the measurement functions considered in this paper (trigonometric polynomials and Gaussian convolution).

Apart from this, let us outline that the subproblems in this algorithm are well suited for numerical resolution. In the exchange algorithm, we only solve the dual problems \(\mathcal {D}(\varOmega _k)\) which are strongly convex. Hence first-order methods for instance come with guarantees of convergence to \(q_k\) in \(\ell ^2\)-norm. Recovering the masses \(\hat{\alpha }_k\), solutions of \(\mathcal {P}(X_k)\) is also stable since \(X_k\) (the local maximizers of \(A^*q_k\)) is typically a well separated set of low cardinality. The gradient descent (or alternative nonlinear programming approach) on \(G(\alpha ,X)\) is performed over a low dimensional set. If the convergence is not satisfactory (e.g. the norm of \(\nabla G\) doesn’t decay fast enough), it can be stopped, and we can switch back to the exchange algorithm.

figureh

Numerical experiments

To test our theory, we have implemented our algorithm in MATLAB. Before displaying the results of the experiments, let us discuss a few key steps in the implementation. In the entire section, we assume that \(\varOmega = [0,1]^d\) for \(d=1\) or 2 for simplicity. Note that this is no true restriction: we can always by scaling and translation ensure that \(\varOmega \subseteq [0,1]^d\), and trivially extend the measurement functions by 0 to the entirety of \([0,1]^d\).

Evaluating \(X_k\) Each iteration of the exchange algorithm requires the exact calculation of the local maximizers of \(A^*q_k\) exceeding 1. This is, in general, an impossible task. We resort to the following heuristic method: Given a \(q_k\), we first evaluate \(|A^*q_k|\) on a fixed rectangular grid \(G=((n)^{-1}[0, \dots , n])^d\), and determine all of the discrete peaks, i.e. points in which \(\left\{ A^*q_k\right\} \) is larger than all of its neighbors in the grid, and where \(A^*q_k\) exceeds \(1-\epsilon _1\) for a threshold \(\epsilon _1>0\). Next, we start a gradient descent in each of these points, stopping them once \(\Vert (A^*q_k)' \Vert _2\) is lower than another threshold. Since it is possible that several of these gradient descents land in the same point x, we subsequently check if the set contains sets of points which are too close to each other—if this is the case, we discard all but one of them in such a group. We finally remove any point in which \(\left| A^*q_k \right| \) is not larger than \(1-\epsilon _2\), for a small \(\epsilon _2>0\).

Solving the Discrete Problems We have chosen to solve the problems (\(\mathcal {D}(\varOmega _k)\)) and \((\mathcal {P}(X_k))\) using an accelerated proximal gradient descent [21].

Example 1: super-resolution from Fourier measurements in 1D

We start by testing our algorithm on a popular instance of problem (\(\mathcal {P}(\varOmega )\)): super-resolution of a measure \(\mu \in \mathcal {M}(0,1)\) from finitely many of its Fourier moments

$$\begin{aligned} y_k= \left\langle a_k, \mu \right\rangle = \int _{0}^1 \exp (-ikx)d\mu , -m/2\le k \le m/2-1. \end{aligned}$$

We use a quadratic data fidelity term \(f(z) = \frac{L}{2} \Vert z-y \Vert _2^2\). This example is well studied by the signal processing community [5, 12, 23, 28].

We chose m to be equal to 30, and a vector y generated as \(A\mu _0\), where \(\mu _0\) is chosen at random as a 5-sparse atomic measure with amplitudes close to 1 or \(-1\). The positions of the Dirac masses were chosen as a small random perturbation from a uniform grid. The initial grid \(\varOmega _0\) was chosen as a uniform grid with 8 points, i.e. \([0, \frac{1}{8}, \dots , \frac{7}{8}]\). We made 100 experiments, with 20 iterations of the exchange algorithm. The evolution of \(\mu _k\) and \(q_k\) for the first iterations for a typical iteration is displayed in Fig. 1. We see that after already 8 iterations, \(A^*q_k\) appears to be very close to \(A^*q^\star \). Before this iteration, the algorithm ’chooses’ to add points relatively uniformly to the grid, but after that, new points are only added close to \(\xi \). This is further emphasized by Fig. 2, in which \(X_k\) is plotted for each iteration, along with size of \(\varOmega _k\).

Fig. 1
figure 1

Above: \(\mu _k\) for \(k=0,2,4,6,8,20\) along one run of the algorithm. Below: \(A^*q_k\) for \(k=0,2,4,6,8,20\) along the same run. Note that the range of the first plot is different from the others

Fig. 2
figure 2

Left: The set \(X_k\) of added points for each iteration along a run of the algorithm. Right: The total number of points in \(\varOmega _k\) along the same run

To track the success of the algorithm a bit more systematically, we chose to track the evolution of \({{\,\mathrm{dist}\,}}(\xi |X_k)\), \({{\,\mathrm{dist}\,}}(\varOmega _k|X_k)\) and \({{\,\mathrm{dist}\,}}(\varOmega _k|\xi )\). The median over the 100 iterations, along with confidence intervals covering all experiments but the top and bottom \(5\%\) are plotted in Fig. 3. We see that all of the quality measures seem to converge linearly to 0.

Fig. 3
figure 3

Logarithmic plot of \({{\,\mathrm{dist}\,}}(\xi |X_k)\), \({{\,\mathrm{dist}\,}}(\varOmega _k|X_k)\) and \({{\,\mathrm{dist}\,}}(\varOmega _k|\xi )\). Shown is the median value (oblique line) along with confidence intervals (dashed) covering all but the top and lower \(5\%\) values

Finally, we performed the same analysis for the optimum gap \(\min \) (\(\mathcal {P}(\varOmega _k)\))–\(\min \)(\(\mathcal {P}(\varOmega )\)), the error \(\Vert q_k-q^\star \Vert _2\) and the sizes of the grids \(\varOmega _k\). (\(\min \)(\(\mathcal {P}(\varOmega )\)) was in each case chosen as the lowest value of \(\min \)(\(\mathcal {P}(\varOmega _k)\)) over all iterations k, and \(q^\star \) as the corresponding dual solution). We see that the optimum gap seems to converge exponentially to 0 right from the first iteration, wheras the error \(\Vert q_k-q^\star \Vert _2\) initially does not. The ’two-phase’-effect is also easy to spot: After about 5–6 iterations, the algorithm switches from adding many points to adding only few points close to \(\xi \). Interestingly, the plateau of the q-errors seems to be simultaneuos with the ’phase-transition’ (Fig. 4).

Fig. 4
figure 4

Plot of the evolution optimum gap, q-error and grid sizes. The top two plots are logarithmic, while the bottom one is not. The oblique lines are represent the median iterations, the dashed ones are confidence intervals covering all but the top and bottom \(5\%\) values

Example 2: super-resolution from Gaussian measurements in 2D

Next, we perform a study in a two-dimensional setting. We consider \(\varOmega =[-1,1]^2\) and measurement functions of the form

$$\begin{aligned} a_i(x) = \exp \left( - \frac{\Vert x-x_i \Vert ^2}{2\sigma ^2} \right) , \end{aligned}$$

where the points \(x_i\) live on a Euclidean grid of size \(64\times 64\), restricted to the domain \([-0.5,0.5]^2\). We then add white Gaussian noise to the measurements, leading to pictures of the type shown in Fig. 5. Here, the true underlying measure contains 11 Dirac masses with random positive amplitudes and random locations on \([-0.4,0.4]^2\).

Fig. 5
figure 5

Measurements y associated to a super-resolution experiment. A sparse measure is convolved with a Gaussian kernel and Gaussian white noise is added

Exchange algorithm

The evolution of the grids \(\varOmega _k\) and of the dual certificates \(|A^*q_k|\) is shown in Fig. 6. As can be seen, points are initially added anywhere in the domain, but after a few iterations, they all cluster around the true locations, as expected from the theory. To further stress this phenomenon and illustrate our theorems and lemmata, we display many quantities of interest appearing in our main results in Fig. 7. the distance from \(X_k\) to \(\xi \) (where \(\xi \) is estimated as \(X_{40}\)) on Fig. 7c, the distance from \(\varOmega _k\) to \(\xi \) on Fig. 7b, the evolution of \(J(\widehat{\mu }_k)-J(\widehat{\mu }_{40})\) on Fig. 7a, \(\Vert A^*q_k\Vert _\infty -1\) on Fig. 7e. Finally, the number of maxima of \(|A^*q_k|\) is shown on Fig. 7f. As can be seen, the number of maxima quickly stabilizes, suggesting that we reached a \(\tau _0\)-regime. Then all the quantities (cost function, distance from \(\xi \), violation of the constraints) seem to converge to 0 linearly. This is not true after iteration 15, and we suspect that this is solely due to numerical inaccuracies when computing the solution of the discretized problems. Notice however that the accuracy of the Dirac locations drops below \(10^{-3}\) after 14 iterations, and that this accuracy is more than enough for the particular super-resolution application. Notice that if we wished to reach this accuracy with a fixed grid, we would need a Euclidean discretization containing \(10^6\) points, while we here needed only 152 (\(|\varOmega _{14}|=152\)). In addition, the \(\ell ^1\) resolution is stable since it is accomplished on a grid \(X_{14}\) containing only 11 points.

Fig. 6
figure 6

Evolution of the dual certificate and of the grid through the 12 first iterations. This is a contour plot with the levels from 1 to the maximum of \(|A^*q_i|\) indicated

Fig. 7
figure 7

Plot of several quantities of interest along the exchange algorithm’s iterates

Continuous method

In this experiment, we evaluate the behavior of the gradient descent (28) depending on the initialization \((\alpha ^{(0)},X^{(0)})\) and on the number of iterations. We use the same setting as in the previous section. The left graph of Fig. 8 illustrates that the gradient descent typically converges linearly when initialized close enough to the true minimizer \((\alpha ^{\star },\xi )\). This was predicted by Theorem 8. In this case (and actually all the others related to this experiment), it converges to machine precision in less than 1000 iterations. This is remarkable since the gradient descent is a simple algorithm that can be easily improved by using e.g. Nesterov acceleration (we proved that the function is locally convex) or other optimization schemes such as L-BFGS.

In order to evaluate the size of the basin of attraction around the global minimizer, we start from random points of the form \((\alpha ^{(0)},X^{(0)}) = (\alpha ^{\star },\xi ) + (\Delta _\alpha , \Delta _X)\), where \(\Delta _\alpha \) and \(\Delta _X\) are random perturbations with an amplitude set as \(\Vert (\Delta _\alpha , \Delta _X)\Vert _2 = \gamma \Vert (\alpha ^{\star },\xi )\Vert _2\), with \(\gamma \) in [0, 1]. We then run 50 gradient descents with different realizations of \((\alpha ^{(0)},X^{(0)})\) and record the success rate (i.e. the number of times the gradient descent converges to \((\alpha ^{\star },\xi )\) with an accuracy of at least \(10^{-6}\)). We plot this success rate with respect to \(\gamma \) in Fig. 8b. As can be seen, the success rate is always 1 when the relative error \(\gamma \) is less than \(5\%\), showing that for this particular problem, a rather rough initialization suffices for the gradient descent to converge to the global minimizer.

Fig. 8
figure 8

Left: Typical convergence curve in logarithmic scale when the initial guess \((\alpha ^{(0)},X^{(0)})\) is good enough. Right: Success rate of the continuous descent method over 50 runs of the algorithm, depending on the relative amplitude of the perturbation

Alternating method

The alternating method suggested in Algorithm 2 turns out to converge in a single iteration when applied to the setting described above. We therefore apply it to a more challenging scenario with 30 Dirac masses instead of 11 and more noise. The measurements y are shown in Fig. 9. We compare three implementations: a pure exchange method, an alternating method as in Algorithm 2 without line 14 and an alternating method as in in Algorithm 2 with line 14. The conclusions are as follows:

  • All methods rapidly conclude that the underlying measure contains 30 Dirac masses. (The pure exchange algorithm after 10 iterations, the alternating method with line 14 already after the first).

  • The pure exchange algorithm quickly gets to a point close to the optimum. The positions then slowly converge to the tue locations. It does however eventually find the basin of attraction of G (in this example, it needed 10 iterations).

  • Line 14 in the alternating method improves the convergence significantly. In fact, omitting it, we need 10 iterations to find the basin of attraction, whereas the version with the line finds it directly. Investigating this effect more closely is an interesting line of future research.

Fig. 9
figure 9

Left: measurements associated to a denser measure with more noise. Right: 3D illustration of the recovery results. The blue vertical bars with circles indicate the locations and amplitude of the ground truth. The red bars with crosses indicated the recovered measures. Apart from a slight bias in amplitude due to the \(\ell ^1\)-norm, the ground truth is near perfectly recovered (color figure online)

References

  1. 1.

    Borwein, J.M., Lewis, A.S.: Partially finite convex programming, part i: quasi relative interiors and duality theory. Math. Program. 57(1), 15–48 (1992)

    Article  Google Scholar 

  2. 2.

    Boyd, N., Schiebinger, G., Recht, B.: The alternating descent conditional gradient method for sparse inverse problems. SIAM J. Optim. 27(2), 616–639 (2017)

    MathSciNet  Article  Google Scholar 

  3. 3.

    Boyer, C., Chambolle, A., De Castro, Y., Duval, V., De Gournay, F., Weiss, P.: On representer theorems and convex regularization. SIAM J. Optim. 29(2), 1260–1281 (2019)

    MathSciNet  Article  Google Scholar 

  4. 4.

    Bredies, K., Pikkarainen, H.K.: Inverse problems in spaces of measures. ESAIM Control Optim. Calculus Var. 19(1), 190–218 (2013)

    MathSciNet  Article  Google Scholar 

  5. 5.

    Candès, E.J., Fernandez-Granda, C.: Towards a mathematical theory of super-resolution. Commun. Pure Appl. Math. 67(6), 906–956 (2014)

    MathSciNet  Article  Google Scholar 

  6. 6.

    Chen, S.S., Donoho, D.L., Saunders, M.A.: Atomic decomposition by basis pursuit. SIAM Rev. 43(1), 129–159 (2001)

    MathSciNet  Article  Google Scholar 

  7. 7.

    Chizat, L., Bach, F.: On the global convergence of gradient descent for over-parameterized models using optimal transport. In: Advances in neural information processing systems, pp. 3036–3046 (2018)

  8. 8.

    De Castro, Y., Gamboa, F.: Exact reconstruction using Beurling minimal extrapolation. J. Math. Anal. Appl. 395(1), 336–354 (2012)

    MathSciNet  Article  Google Scholar 

  9. 9.

    De Castro, Y., Gamboa, F., Henrion, D., Lasserre, J.-B.: Exact solutions to Super Resolution on semi-algebraic domains in higher dimensions. IEEE Trans. Inf. Theory 63(1), 621–630 (2017)

    MathSciNet  Article  Google Scholar 

  10. 10.

    Denoyelle, Q., Duval, V., Peyré, G., Soubies, E.: The sliding Frank–Wolfe algorithm and its application to super-resolution microscopy. Inverse Prob. 36, 014001 (2019)

    MathSciNet  Article  Google Scholar 

  11. 11.

    Dossal, C., Duval, V., Poon, C.: Sampling the Fourier transform along radial lines. SIAM J. Numer. Anal. 55(6), 2540–2564 (2017)

    MathSciNet  Article  Google Scholar 

  12. 12.

    Duval, V., Peyré, G.: Exact support recovery for sparse spikes deconvolution. Found. Comput. Math. 15(5), 1315–1355 (2015)

    MathSciNet  Article  Google Scholar 

  13. 13.

    Eftekhari, A., Thompson, A.: Sparse inverse problems over measures: equivalence of the conditional gradient and exchange methods. SIAM J. Optim. 29, 1329–1349 (2019)

    MathSciNet  Article  Google Scholar 

  14. 14.

    Fisher, S.D., Jerome, J.W.: Spline solutions to L1 extremal problems in one and several variables. J. Approx. Theory 13(1), 73–83 (1975)

    Article  Google Scholar 

  15. 15.

    Flinth, A., Weiss, P.: Exact solutions of infinite dimensional total-variation regularized problems. Inf. Inference 8, 407–443 (2017)

    MathSciNet  Article  Google Scholar 

  16. 16.

    Frank, M., Wolfe, P.: An algorithm for quadratic programming. Naval Res. Logist. Q. 3(1–2), 95–110 (1956)

    MathSciNet  Article  Google Scholar 

  17. 17.

    Hettich, R., Kortanek, K.O.: Semi-infinite programming: theory, methods, and applications. SIAM Rev. 35(3), 380–429 (1993)

    MathSciNet  Article  Google Scholar 

  18. 18.

    Hettich, R., Zencke, P.: Numerische Methoden der Approximation und semi-infiniten Optimierung. Vieweg+Teubner, Berlin (1982)

    Book  Google Scholar 

  19. 19.

    Hiriart-Urruty, J.-B., Lemaréchal, C.: Convex Analysis and Minimization Algorithms I: Fundamentals, vol. 305. Springer, Berlin (2013)

    MATH  Google Scholar 

  20. 20.

    Evgenii Solomonovich Levitin and Boris Teodorovich Polyak: Constrained minimization methods. Zhurnal Vychislitel’noi Matematiki i Matematicheskoi Fiziki 6(5), 787–823 (1966)

    Google Scholar 

  21. 21.

    Nesterov, Y.: Gradient methods for minimizing composite functions. Math. Program. 140(1), 125–161 (2013)

    MathSciNet  Article  Google Scholar 

  22. 22.

    Pieper, K., Walter, D.: Linear convergence of accelerated conditional gradient algorithms in spaces of measures. arXiv:1904.09218 (2019)

  23. 23.

    Poon, C., Keriven, N., Peyré, G.: Support localization and the fisher metric for off-the-grid sparse regularization. Proc. Mach. Learn. Res. 89, 1341–1350 (2019)

    Google Scholar 

  24. 24.

    Reemtsen, R.: Modifications of the first Remez algorithm. SIAM J. Numer. Anal. 27(2), 507–518 (1990)

    MathSciNet  Article  Google Scholar 

  25. 25.

    Reemtsen, R., Görner, S.: Numerical methods for semi-infinite programming: a survey. pp. 195–262 (1998)

  26. 26.

    Remes, E.: Sur un procédé convergent d’approximations successives pour déterminer les polynômes d’approximation. CR Acad. Sci. Paris 198, 2063–2065 (1934)

    MATH  Google Scholar 

  27. 27.

    Tang, G., Bhaskar, B.N., Recht, B.: Sparse recovery over continuous dictionaries-just discretize. In: 2013 Asilomar Conference on Signals, Systems and Computers, pp. 1043–1047. IEEE (2013)

  28. 28.

    Tang, G., Bhaskar, B.N., Shah, P., Recht, B.: Compressed sensing off the grid. IEEE Trans. Inf. Theory 59(11), 7465–7490 (2013)

    MathSciNet  Article  Google Scholar 

  29. 29.

    Tibshirani, R.: Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B (Methodological), pp. 267–288 (1996)

  30. 30.

    Traonmilin, Y., Aujol, J.-F.: The basins of attraction of the global minimizers of the non-convex sparse spikes estimation problem. Inverse Prob. 36, 045003 (2020)

    MathSciNet  Article  Google Scholar 

  31. 31.

    Unser, M., Fageot, J., Ward, J.P.: Splines are universal solutions of linear inverse problems with generalized tv regularization. SIAM Rev. 59(4), 769–793 (2017)

    MathSciNet  Article  Google Scholar 

  32. 32.

    Vapnik, V.: The Nature of Statistical Learning Theory. Springer, Berlin (2013)

    MATH  Google Scholar 

  33. 33.

    Zuhovickiĭ, S.I.: Remarks on problems in approximation theory. Mat. Zbirnik KDU, pp. 169–183 (1948). (Ukrainian)

Download references

Acknowledgements

Open access funding provided by University of Gothenburg. Open access funding provided by University of Gothenburg. The authors acknowledge support from ANR JCJC Optimization on Measures Spaces ANR-17-CE23-0013-01 and from ANR-3IA Artificial and Natural Intelligence Toulouse Institute.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Axel Flinth.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 1004 KB)

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Flinth, A., de Gournay, F. & Weiss, P. On the linear convergence rates of exchange and continuous methods for total variation minimization. Math. Program. 190, 221–257 (2021). https://doi.org/10.1007/s10107-020-01530-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10107-020-01530-0

Keywords

  • Total variation minimization
  • Inverse problems
  • Superresolution
  • Semi-infinite programming

Mathematics Subject Classification

  • 49M25
  • 49M29
  • 90C34
  • 65K05