A generalized conditional gradient method for dynamic inverse problems with optimal transport regularization

We develop a dynamic generalized conditional gradient method (DGCG) for dynamic inverse problems with optimal transport regularization. We consider the framework introduced in (Bredies and Fanzon, ESAIM: M2AN, 54:2351-2382, 2020), where the objective functional is comprised of a fidelity term, penalizing the pointwise in time discrepancy between the observation and the unknown in time-varying Hilbert spaces, and a regularizer keeping track of the dynamics, given by the Benamou-Brenier energy constrained via the homogeneous continuity equation. Employing the characterization of the extremal points of the Benamou-Brenier energy (Bredies et al., arXiv:1907.11589, 2019) we define the atoms of the problem as measures concentrated on absolutely continuous curves in the domain. We propose a dynamic generalization of a conditional gradient method that consists in iteratively adding suitably chosen atoms to the current sparse iterate, and subsequently optimize the coefficients in the resulting linear combination. We prove that the method converges with a sublinear rate to a minimizer of the objective functional. Additionally, we propose heuristic strategies and acceleration steps that allow to implement the algorithm efficiently. Finally, we provide numerical examples that demonstrate the effectiveness of our algorithm and model at reconstructing heavily undersampled dynamic data, together with the presence of noise.


Introduction
The aim of this paper is to develop a dynamic generalized condition gradient method (DGCG) to numerically compute solutions of ill-posed dynamic inverse problems regularized with optimal transport energies. The code is openly available on GitHub 1 . Lately, several approaches have been proposed to tackle dynamic inverse problems [42,57,58,61,67], all of which take advantage of redundancies in the data, allowing to stabilize reconstructions, both in the presence of noise or undersampling. A common challenge faced in such time-dependent approaches is understanding how to properly connect, or relate, the time-neighbouring datapoints, in a way that the reconstructed object follows a presumed dynamic. In this paper we address such issue by means of dynamic optimal transport. A wide range of applications can benefit from motionaware approaches. In particular, the employment of dynamic reconstruction methods represents one of the latest key mathematical advances in medical imaging. For instance, magnetic resonance imaging (MRI) [48,50,56] and computed tomography (CT) [9,21,32] methods allow dynamic modalities in which the time-dependent data is further undersampled to reach high temporal sampling rates; these are required to resolve organ motion, such as the beating heart or the breathing lung. A more accurate reconstruction of the image, and of the underlying dynamics, would yield valuable diagnostic information.
1.1. Setting and existing approaches. Recently, it has been proposed to regularize dynamic inverse problems using dynamic optimal transport energies both in a balanced and unbalanced context [16,59,60], with the goal of efficiently reconstructing time-dependent Radon measures. Such regularization choice is natural: optimal transport energies incorporate information about time correlations present in the data, and are thus favoring a more stable reconstruction. Optimal transport theory was originally developed to find the most efficient way to move mass from a probability measure ρ 0 to another one ρ 1 , with respect to a given cost [44,54]. More recently Benamou and Brenier [8] showed that the optimal transport map can be computed by solving (1.1) min where t → ρ t is a curve of probability measures on the closure of a bounded domain Ω ⊂ R d , v t is a time-dependent vector field advecting the mass, and the continuity equation is intended in the sense of distributions with initial data ρ 0 and final data ρ 1 . Notably, the quantity at ( if ρ ≥ 0 and m ρ, and B := +∞ otherwise. Then (1.1) is equivalent to minimizing B under the linear constraint ∂ t ρ + div m = 0. Such convex reformulation can be employed as a regularizer for dynamic inverse problems where, instead of fixing initial and final data, a fidelity term is added to measure the discrepancy between the unknown and the observation at each time instant, as proposed in [16]. There the authors consider the dynamic inverse problem of finding a curve of measures t → ρ t , with ρ t ∈ M(Ω), such that (1.2) K * t ρ t = f t for a.e. t ∈ (0, 1) , where f t ∈ H t is some given data, {H t } is a family of Hilbert spaces and K where ρ M(X) denotes the total variation norm of the measure ρ := dt ⊗ ρ t , and α, β > 0 are regularization parameters. Notice that any curve t → ρ t having finite Benamou-Brenier energy and satisfying the continuity equation constraint must have constant mass (see Lemma A.1). As a consequence, the regularization (1.3) is especially suited to reconstruct motions where preservation of mass is expected. We point our that such formulation is remarkably flexible, as the measurements spaces and measurement operators are allowed to be very general. In this way one could model, for example, undersampled acquisition strategies in medical imaging, particularly MRI [16]. The aim of this paper is to design a numerical algorithm to solve (1.3). The main difficulties arise due to the non-reflexivity of measure spaces. Even in the static case, solving the classical LASSO problem [62] in the space of bounded Borel measures (known as BLASSO [29]), i.e., (1.4) inf ρ∈M(Ω) 1 2 for a Hilbert space Y and a linear continuous operator K : M(Ω) → Y , has proven to be challenging. Usual strategies to tackle (1.4) numerically often rely on the discretization of the domain [26,28,63]; however grid-based methods are known to be affected by theoretical and practical flaws such as high computational costs and the presence of mesh-dependent artifacts in the reconstruction. The mentioned drawbacks have motivated algorithms that do not rely on domain discretization, but optimize directly on the space of measures. One class of such algorithms, first introduced in [20] and subsequently developed in different directions [10,30,39,51], are named generalized conditional gradient methods (GCG) or Frank-Wolfe type algorithms. They can be regarded as the infinite dimensional generalization of the classical Frank-Wolfe optimization algorithm [41] and of GCG in Banach spaces [7,18,19,25,33,43]. The basic idea behind such algorithms consists in exploiting the structure of sparse solutions to (1.4), which are given by finite linear combinations of Dirac deltas supported on Ω. In this case Dirac deltas represent the extremal points of the unit ball of the Radon norm regularizer. With this knowledge at hand, the GCG method iteratively minimizes a linearized version of (1.4); such minimum can be found in the set of extremal points. The iterate is then constructed by adding delta peaks at each iteration, and by subsequently optimizing the coefficients of the linear combination. GCG methods have proven to be successful at solving (1.4), and have been adapted to related problems in the context of, e.g., super-resolution [1, 49, 51, 60].
1.2. Outline of the main contributions. Inspired by GCG methods, the goal of this paper is to develop a dynamic generalized conditional gradient method (DGCG) aimed at solving the dynamic minimization problem at (1.3). Similarly to the classical GCG approaches, our DGCG algorithm is based on the structure of sparse solutions to (1.3), and it is Lagrangian in essence, since it does not require a discretization of the space domain. Lagrangian approaches have been proven useful for many different dynamic applications [55], often outperforming Eulerian approaches, where the discretization in space is necessary. Indeed, since Eulerian approaches are based on the optimization of challenging discrete assignment problems in space, Lagrangian approaches allow to lower the computational costs and are more suitable to reconstruct coalescence phenomena in the dynamics. Motivated by similar considerations, our approach aims to reduce the reconstruction artifacts and lower the computational cost when compared to grid-based methods designed to solve similar inverse problems to (1.3) (see [59]). The fundamentals of our approach rest on recent results concerning sparsity for variational inverse problems: it has been empirically observed that the presence of a regularizer promotes the existence of sparse solutions, that is, minimizers that can be represented as a finite linear combination of simpler atoms. This effect is evident in reconstruction problems [34,40,64,65,66], as well as in variational problems in other applications, such as materials science [37,38,46,53]. Existence of sparse solutions has been recently proven for a class of general functionals comprised of a fidelity term, mapping to a finite dimensional space, and a regularizer: in this case atoms correspond to the extremal points of the unit ball of the regularizer [11,14]. In the context of (1.3), the extremal points of the Benamou-Brenier energy have been recently characterized in [15]; this provides an operative notion of atoms that will be used throughout the paper. More precisely, for every absolutely continuous curve γ : [0, 1] → Ω, we name as atom of the Benamou-Brenier energy the respective pair of measures µ γ := (ρ γ , m γ ) ∈ M defined by (1.5) ρ γ := a γ dt ⊗ δ γ(t) , m γ :=γ(t)ρ γ , a γ := β 2 The notion of atom described above can be regarded as the dynamic counterpart of the Dirac deltas for the Radon norm regularizer. Curves of measures of the form (1.5) constitute the building blocks used in our DGCG method to generate at iteration step n the sparse iterate µ n = (ρ n , m n ) (1.6) µ n = Nn j=1 c j µ γj converging to a solution of (1.3), where c j > 0. The basic DGCG method proposed is comprised of two steps. The first one, called insertion step, operates as follows. Given a sparse iterate µ n of the form (1.6), we obtain a descent direction for the energy at (1.3), by minimizing a version of (1.3) around µ n , in which the fidelity term is linearized. We show that, in order to find such descent direction, it is sufficient to solve (1.7) min (ρ,m)∈Ext(C α,β ) − 1 0 ρ t , w n t Ht dt , where w n t := −K t (K * t ρ n t − f t ) ∈ C(Ω) is the dual variable of the problem at the iterate µ n , and Ext(C α,β ) denotes the set of extremal points of the unit ball of the regularizer in (1.3), namely the set C α,β := {(ρ, m) ∈ M : ∂ t ρ + div m = 0, βB(ρ, m) + α ρ M(X) ≤ 1}. Formula (1.7) clarifies the connection between atoms and extremal points of C α,β , showing the fundamental role that the latter play in sparse optimization and GCG methods. In view of the characterization Theorem 2.1, proven in [15], the minimization problem (1.7) can be equivalently written in terms of atoms (1.8) min where AC 2 denotes the set of absolutely continuous curves with values in Ω and weak derivative in L 2 (Ω; R d ). The insertion step then consists in finding a curve γ * solving (1.8), and considering the respective atom µ γ * . Afterwards, naming γ Nn+1 := γ * , the coefficients optimization step proceeds at optimizing the conic combination µ n + c Nn+1 µ γ Nn +1 with respect to (1.3), among all nonnegative coefficients c j . Denoting by c * 1 , . . . , c * Nn+1 a solution to such problem, the new iterate is defined by µ n+1 := j c * j µ γj . The two steps of inserting a new atom in the linear combination and optimizing the coefficients are the building blocks of our core algorithm, summarized in Algorithm 1. In Theorem 4.6 we prove that such algorithm has a sublinear convergence rate, similarly to the GCG method for the BLASSO problem [20], and the produced iterates µ n converge in the weak* sense of measures to a solution of (1.3). The core algorithm and its analysis are the subject of Section 4. From the computational point of view, we observe that the coefficients optimization step can be solved efficiently, as it is equivalent to a finite dimensional quadratic program. Concerning the insertion step, however, even if the complexity of searching for a descent direction for (1.3) is reduced by only minimizing in the set of atoms, (1.8) remains a challenging non-linear and nonlocal problem. For this reason, we shift our attention to computing stationary points for (1.8), relying on gradient descent strategies. Specifically, we prove that, under additional assumptions on H t and K * t , problem (1.8) can be cast in the Hilbert space H 1 ([0, 1]; R d ), and that the gradient descent algorithm, with appropriate stepsize, outputs stationary points to (1.8) (see Theorem A.6). With this theoretical result at hand, in Section 5.1 we formulate a solution strategy for (1.8) based on multistart gradient descent methods, whose initializations are chosen according to heuristic principles. More precisely, the initial curves are chosen randomly in the regions where the dual variable w n t has larger value, and new starting curves are produced combining pieces of stationary curves for (1.8) by means of a procedure named crossover. We complement the core algorithm with acceleration strategies. First, we add multiple atoms in the insertion step (multiple insertion step). Such new atoms can be easily obtained as a byproduct of the multistart gradient descent in the insertion step. Moreover, after optimizing the coefficients in the linear combination, we perform an additional gradient descent step with respect to (1.3), varying the curves in the iterate µ n+1 , while keeping the weights fixed. Such procedure, named sliding step, will then be alternated with the coefficients optimization step for a fixed number of iterations, before searching for a new atom in the insertion step. These additional steps are described in Section 5.1. We mention that similar strategies were already employed for the BLASSO problem [20,52]. They are then included in the basic core algorithm to obtain the complete DGCG method in Algorithm 2. In Section 6 we provide numerical examples. As observation operators K * t , we use time-dependent undersampled Fourier measurements, popular in imaging and medical imaging [17,35], as well as in compressed sensing and super-resolution [1,22,23]. Such examples show the effectiveness of our DGCG method in reconstructing spatially sparse data, in presence of simultaneously strong noise and severe temporal undersampling. Indeed, satisfactory results are obtained for ground-truths with 20% and 60% of added Gaussian noise, and heavy temporal undersampling in the sense that, at each time t, the observation operator K * t is not able to distinguish sources along lines. With such ill-posed measurements static reconstruction methods would not be able to accurately recover any ground-truth. In contrast, the time regularization chosen in (1.3) allows to resolve the dynamics by correlating the information of neighbouring data-points. As shown by the experiments presented, our DGCG algorithm produces accurate reconstructions of the ground-truth. In case of 20% and 60% of added noise we note a surge of low intensity artifacts; nonetheless, the obtained reconstruction is close, in the sense of the measures, to the original ground-truth. Moreover, in all of the tried out examples a linear convergence rate has been observed; this shows that the algorithm is, in practice, faster than the theoretical guarantees (Theorem 4.6), and a linear rate has to be expected in most of the cases. We conclude the paper with Section 7, in which we discuss future perspectives and open questions, such as the possibility of improving the theoretical convergence rate for Algorithm 1, and alternative modelling choices.
1.3. Organization of the paper. The paper is organized as follows. In Section 2 we summarize all the relevant notations and preliminary results regarding the Benamou-Brenier energy that are needed in the paper. In particular, we recall the characterization of the extremal points of the unit ball of the Benamou-Brenier energy obtained in [15]. In Section 3 we introduce the dynamic inverse problem under consideration and its regularization (1.3), following the approach of [16]. We further establish basic theory needed to setup the DGCG method. In Section 4 we provide the definition of atoms and we give a high-level description of the DGCG method we propose in this paper, see Algorithm 1, proving its sublinear convergence. In Section 5 we describe the strategy employed to solve the insertion step problem (1.8), based on a multistart gradient descent method. Moreover we outline the mentioned acceleration steps. Incorporating these procedures in the core algorithm, we obtain the complete DGCG method, see Algorithm 2. In Section 6 we show numerical results supporting the effectiveness of Algorithm 2. Finally, we present the reader some open questions in Section 7.

Preliminaries and notation
In this section we introduce the mathematical concepts and results we need to formulate our minimization problem and consequent algorithms. Throughout the paper Ω ⊂ R d denotes an open bounded domain with d ∈ N, d ≥ 1. We define the time-space cylinder X := (0, 1) × Ω. Following [3], given a metric space Y we denote by M(Y ), M(Y ; R d ), M + (Y ), the spaces of bounded Borel measures, bounded vector Borel measures, and positive measures, respectively. For a scalar measure ρ we denote by ρ M(Y ) its total variation. In addition, we employ the notations R + and R ++ to refer to the non-negative and positive real numbers, respectively.
2.1. Time dependent measures. We say that {ρ t } t∈[0,1] is a Borel family of measures in M(Ω) if ρ t ∈ M(Ω) for every t ∈ [0, 1] and the map t → Ω ϕ(x) dρ t (x) is Borel measurable for every function ϕ ∈ C(Ω). Given a measure ρ ∈ M(X) we say that ρ disintegrates with respect to time if there exists a Borel family of measures We denote such disintegration with the symbol ρ = dt⊗ρ t . Further, we say that a curve of measures continuous. The family of narrowly continuous curves will be denoted by C w . We denote by C + w the family of narrowly continuous curves with values in M + (Ω).

2.2.
Optimal transport regularizer. Introduce the space We denote elements of M by µ = (ρ, m) with ρ ∈ M(X), m ∈ M(X; R d ), and by 0 the pair (0, 0). Define the set of pairs in M satisfying the continuity equation as where the solutions of the continuity equation are intended in a distributional sense, that is, The above weak formulation includes no flux boundary conditions for the momentum m on ∂Ω, and no initial and final data for ρ. Notice that, by standard approximation arguments, it is equivalent to test (2.1) against maps in C 1 c (X) (see [4,Remark 8.1.1]). We now introduce the Benamou-Brenier energy, as originally done in [8]. To this end, define the convex, one-homogeneous and lower semicontinuous map Ψ : where λ ∈ M + (X) is any measure satisfying ρ, m λ. Note that (2.3) does not depend on the choice of λ, as Ψ is one-homogeneous. Following [16], we introduce a coercive version of B: for fixed parameters α, β > 0 define the functional J α,β : M → [0, ∞] as As recently shown [16], J α,β can be employed as a regularizer for dynamic inverse problems in spaces of measures.

The dynamic inverse problem and conditional gradient method
In this section we introduce the dynamic inverse problem we aim at solving, following the approach of [16]. Moreover we set up the functional analytic framework necessary to state the numerical algorithm presented in Section 4. Recall that Ω ⊂ R d is an open bounded domain, d ∈ N, d ≥ 1.
Let {H t } t∈[0,1] be a family of real Hilbert spaces, K * t : M(Ω) → H t a family of linear continuous forward operators parametrized by t ∈ [0, 1]. Given some data f t ∈ H t for a.e. t ∈ (0, 1), consider the dynamic inverse problem of finding a curve t → ρ t ∈ M(Ω) such that (3.1) K * t ρ t = f t for a.e. t ∈ (0, 1) . It has been recently proposed [16] to regularize the above problem with the optimal transport energy J α,β defined in (2.4), where α, β > 0 are fixed parameters. This leads to consider the Tikhonov functional T α,β : M → [0, +∞] with associated minimization problem where the fidelity term F : M → [0, +∞] is defined by In the following we will denote by f , for a.e. t ∈ (0, 1) and ρ t is the disintegration of ρ with respect to time. The fidelity term F serves to track the discrepancy in (3.1) continuously in time. Following [16], this is achieved by introducing the Hilbert space of square integrable maps f : [0, 1] → H := ∪ t H t , denoted by L 2 H . The data f is then assumed to belong to L 2 H . The assumptions under which this procedure can be made rigorous are briefly summarized in Section 3.1 below, see (H1)-(H3), (K1)-(K3). Under these assumptions, we have that F is well defined, see Remark 3.4. Such framework allows to model a variety of time-dependent acquisition strategies in dynamic imaging, as seen in Section A.6. We are now ready to recall an existence result for (P) (see [16,Theorem 4.4]). H and α, β > 0. Then T α,β is weak* lower semicontinuous on M and there exists µ * ∈ D that solves the minimization problem (P). Moreover ρ * disintegrates in ρ * = dt ⊗ ρ * t with (t → ρ * t ) ∈ C + w . If in addition K * t is injective for a.e. t ∈ (0, 1), then µ * is unique.
The proposed numerical approach for (P) is based on the conditional gradient method, which consists in seeking minimizers of local linear approximations of the target functional. As standard practice [19,20,52], we first replace (P) with a surrogate minimization problem, by defining the functionalT α,β as in (P) below. The key step in a conditional gradient method is then to find the steepest descent direction for a linearized version ofT α,β . In Section 3.3 we show that, in order to find such direction, it is sufficient to solve the minimization problem is the dual variable associated to the current iterate (ρ,m), and the linear term µ → ρ, w is defined in (3.9) below. Finally, in Section 3.4 we define the primal-dual gap G associated to (P), and prove optimality conditions. 3.1. Functional analytic setting for time continuous fidelity term. In order to define the continuous sampling fidelity term F at (3.2), the authors of [16] introduce suitable assumptions on the measurement spaces H t and on the forward operators K * t : M(Ω) → H t . Assumption 3.2. For a.e. t ∈ (0, 1), let H t be a real Hilbert space with norm · Ht and scalar product ·, · Ht . Let D be a real Banach space with norm denoted by · D . Assume that for a.e. t ∈ (0, 1) there exists a linear continuous operator i t : D → H t with the following properties: (H1) i t ≤ C for some constant C > 0 not depending on t, Setting H := t∈[0,1] H t , it is possible to define the space of square integrable maps f : [0, 1] → H such that f t ∈ H t for a.e. t ∈ (0, 1), that is, The strong measurability mentioned in (3.4) is an extension to time dependent spaces of the classical notion of strong measurability for Bochner integrals. The common subset D is employed to construct step functions in a suitable way. An important property of strong measurability is that t → f t , g t Ht is Lebesgue measurable whenever f, g are strongly measurable [16,Remark 3.4]. Moreover L 2 H is a Hilbert space with inner product and norm given by . We refer the interested reader to [16,Section 3] for more details on the construction of such spaces and their properties. We will now state the assumptions required for the measurement operators K * t . Assumption 3.3. For a.e. t ∈ (0, 1) the linear continuous operators K * t : M(Ω) → H t satisfy: (K1) K * t is weak*-to-weak continuous, with pre-adjoint denoted by K t : H t → C(Ω), (K2) K * t ≤ C for some constant C > 0 not depending on t, (K3) the map t ∈ [0, 1] → K * t ρ ∈ H t is strongly measurable for every fixed ρ ∈ M(Ω). Remark 3.4. After assuming (H1)-(H3), (K1)-(K3), the fidelity term F introduced at (3.2) is welldefined. Indeed, the conditions ρ = dt ⊗ ρ t and (t → ρ t ) ∈ C w imply that t → K * t ρ t belongs to L 2 H by Lemma A.4. We further remark that F(µ) is finite whenever J α,β (µ) < +∞, as in this case we H and define the map ϕ : where we set M 0 := T α,β (0). Notice that by (A.1) we have J α,β (0) = 0, so that Notice that (P) and (P) share the same set of minimizers, and they are thus equivalent. This is readily seen after noting that solutions to (P) and (P) belong to the set {µ ∈ M : J α,β (µ) ≤ M 0 }, thanks to the estimate t ≤ ϕ(t), and that T α,β andT α,β coincide on the said set. We remark that the surrogate minimization problem (P) is a technical modification of (P) introduced just to ensure that the partially linearized problem defined in the following section is coercive.
Let ϕ and M 0 be as in (3.6)-(3.7). We consider the following linearized version of (P) which is well-posed by Theorem A.5. The objective of this section is to prove the existence of a solution to (3.11) belonging, up to a multiplicative constant, to the extremal points of the sublevel set C α,β := {J α,β ≤ 1}. To this end, consider the problem In the following proposition we prove that (3.12) admits a minimizer µ * ∈ Ext(C α,β ). Moreover we show that a suitably rescaled version of µ * solves (3.11).
3.4. The primal-dual gap. In this section we introduce the primal-dual gap associated to (P).

The algorithm: theoretical analysis
In this section we give a theoretical description of the dynamic generalized conditional gradient algorithm anticipated in the introduction, which we call core algorithm. The proposed algorithm aims at finding minimizers to T α,β as defined in (P), for some fixed data f ∈ L 2 H and parameters α, β > 0. It is comprised of an insertion step, where one seeks a minimizer to (3.12) among the extremal points of the set C α,β := {J α,β ≤ 1}, and of a coefficients optimization step, which will yield a finite dimensional quadratic program. As a result, each iterate will be a finite linear combination, with non-negative coefficients, of points in Ext(C α,β ). We remind the reader that Ext(C α,β ) = {0} ∪ C α,β in view of Theorem 2.1, where C α,β is defined at (2.5). From the definition of C α,β , we see that, except for the zero element, the extremal points are in 1-on-1 correspondence with the space of curves AC 2 := AC 2 ([0, 1]; Ω). This observation motivates us to define the atoms of our problem.

Definition 4.1 (Atoms). We denote by AC 2
∞ the one-point extension of the set AC 2 , where we include a point denoted by γ ∞ . For any γ ∈ AC 2 we name as atom the respective extremal point µ γ = (ρ γ , m γ ) ∈ M defined according to (2.6). For γ ∞ the corresponding atom is defined by µ γ∞ := (0, 0). We call sparse any measure µ ∈ M such that Note that γ ∞ can be regarded as the infinite length curve: indeed if {γ n } in AC 2 is a sequence of curves with diverging length, that is 1 0 |γ n | dt → ∞ as n → ∞, then ρ γ n * ρ γ∞ , since a γ n → 0 by Hölder's inequality. Additionally, it is convenient to introduce the following map associating to vectors of curves and coefficients the corresponding sparse measure: where N ∈ N is fixed and c := (c 1 , . . . , c N ) with c j > 0, γ := (γ 1 , . . . , γ N ), with γ j ∈ AC 2 .
Remark 4.2. The decomposition in extremal points of a given sparse measure might not be unique, that is, the map at (4.2) is not injective. For example, let N := 2, Ω := (0, 1) 2 and Note that injectivity fails for ((γ 1 , γ 2 ), (1, 1)) and ((γ 1 ,γ 2 ), (1, 1)), given that they map to the same measure µ * , but γ 1 and γ 2 cross at t = 1/2, whileγ 1 andγ 2 rebound. This observation is relevant for the algorithms presented, seeing that they operate in terms of extremal points: if for example µ * was the unique solution to (P) for some data f , due to the lack of unique sparse representation for µ * , the numerical reconstruction could favor the representation having the least energy in terms of the regularizer J α,β . This is not surprising, since our method aims at reconstructing sparse measures, rather than their extremal points. A numerical example displaying the behavior of our algorithm on crossings, such as the case of µ * , is given in Section 6.2.3.
The rest of the section is organized as follows. In Section 4.1 we present the core algorithm, describing its basic steps and summarizing it in Algorithm 1. In Section 4.2 we discuss the equivalence of the coefficients optimization step to a quadratic program, while in Section 4.3 we show sublinear convergence of Algorithm 1 in terms of the residual defined at (3.19). In Section 4.4 we detail on a theoretical stopping criterion for our algorithm. To conclude, in Section 4.5, we give a description of how to alter Algorithm 1 in case the fidelity term F at (P) is replaced by a time-discrete version.
All the results presented in this section and in the above will hold also for this particular case, with minor modifications.
4.1. Core Algorithm. The core algorithm consists of two steps. In the first one, named the insertion step, an atom is added to the current iterate, this atom being the minimizer of the linearized problem defined at (3.12). In the second step, named the coefficients optimization step, the atoms are fixed and their associated weights are optimized to minimize the target functional T α,β defined in (P).
In what follows f ∈ L 2 H is a given datum and α, β > 0 are fixed parameters.
4.1.1. Iterates. We initialize the algorithm to the zero atom µ 0 := 0. The n-th iteration µ n is a sparse element of M according to (4.1), that is, Notice that N n is counting the number of atoms present at the n-th iteration, and is not necessarily equal to n, since the optimization step could discard atoms by setting their associated weights to zero. In practice, Algorithm 1 operates in terms of curves and weights. That is, the n-th iteration outputs pairs (γ j , c j ) with γ j ∈ AC 2 , c j > 0: the iterate at (4.3) can be then constructed via the map (4.2).

Insertion step.
Assume µ n is the current iterate. Define the dual variable associated to µ n as in (3.8), that is, where the term ·, · is defined in (3.9). We recall that (4.5) admits solution by Proposition 3.6. Thanks to the characterization Ext(C α,β ) = {0} ∪ C α,β provided by Theorem 2.1, problem (4.5) can be cast on the space AC 2 ∞ . Indeed, given µ ∈ C α,β , following the notations at (2.5)-(2.6), we have that µ = µ γ for some γ ∈ AC 2 . The curve t → δ γ(t) belongs to C + w , and hence ρ γ , w is finite (see Remark 3.5). Thus, by definition, we have showing that (4.5) is equivalent to The insertion step consists in finding a curve γ * ∈ AC 2 ∞ that solves (4.7). To such curve, we associate a new atom µ γ * via Definition 4.1. Note that γ * depends on the current iterate µ n , as well as on the datum f and parameters α, β: however, in order to simplify notations, we omit such dependencies. After, we have a stopping condition: • if ρ γ * , w n ≤ 1, then µ n is solution to (P). The algorithm outputs µ n and stops, • if ρ γ * , w n > 1, then µ n is not a solution to (P) and γ * ∈ AC 2 . The found atom µ γ * is inserted in the n-th iterate µ n and the algorithm continues.
The optimality statements in the above stopping condition correspond to positivity conditions on the subgradient of T α,β and are rigorously proven in Section 4.4 below. Moreover, the mentioned stopping condition can be made quantitative as discussed in Remark 4.9 below.
Remark 4.3. In this section we will always assume the availability of an exact solution γ * to (4.7).
In particular, this allows to obtain a sublinear convergence rate for the core algorithm (Theorem 4.6 below), and make the stopping condition rigorous. In practice, however, obtaining γ * is not always possible, due to the non-linearity and non-locality of the functional at (4.7). For this reason, in Section 5.1, we propose a strategy aimed at obtaining stationary points of (4.7). Based on such strategy, a relaxed version of the insertion step is proposed for Algorithm 2, which we employ for the numerical simulations of Section 6. 4.1.3. Coefficients optimization step. This step is realized after the stopping condition is checked, with the condition ρ γ * , w n > 1 being satisfied. In particular, as observed above, in this case γ * ∈ AC 2 . We then set γ n Nn+1 := γ * and consider the coefficients optimization problem where µ γ n j for j = 1, . . . , N n are the atoms present in the n-th iterate µ n . If c ∈ R Nn+1 + is a solution to the above problem, the next iterate is defined by thus discarding the curves that do not contribute to (4.8).
Remark 4.4. Problem (4.8) is equivalent to a quadratic program of the form (4.10) min is a positive-semidefinite and symmetric matrix and b ∈ R Nn+1 , as proven in Proposition 4.5 below. Therefore, throughout the paper, we will always assume the availability of an exact solution to (4.8). In practice we solved (4.8) by means of the free Python software package CVXOPT [5, 6].
4.1.4. Algorithm summary. As discussed in Section 4.1.1, the iterates of Algorithm 1 are pairs of curves and weights (γ j , c j ) for γ ∈ AC 2 , c j > 0 and j = 1, . . . , N n . In the pseudo-code we denote such iterates with the tuples γ = (γ 1 , . . . , γ Nn ) and c = (c 1 , . . . , c Nn ). Note that such tuples vary in size at each iteration, and the initial iterate of the algorithm, that is the zero atom, corresponds to the empty tuples γ = () and c = (). We denote the number of elements contained in a tuple γ with the symbol |γ|. Via the map (4.2), the iterates (γ, c) define a sequence of sparse measures {µ n } of the form (4.3). The generated sequence {µ n } weakly* converges (up to subsequences) to a minimizer of T α,β for the datum f ∈ L 2 H , as shown in Theorem 4.6 below. Notice that the assignments at lines 4 and 9 are meant to choose one element in the respective argmin set. The function delete zero weighted(γ, c) at line 11 in Algorithm 1 is designed to input a tuple (γ, c) and output another tuple where the curves γ j and corresponding weights c j are deleted if c j = 0.
Proposition 4.5. Problem (4.11) is equivalent to Proof. As γ j is continuous, the curve t → δ γj (t) belongs to C + w . Hence the map t → K * t δ γj (t) belongs to L 2 H by Lemma A.4 and the quantities at (4.13) are well-defined. Thanks to definition of T α,β and Lemma 2.2 we immediately see that . This shows that (4.11) and (4.12) are equivalent. The rest of the statement follows since Γ is the Gramian with respect to the vectors K * ρ γ1 , . . . , K * ρ γ N in L 2 H .

Convergence analysis.
We prove sublinear convergence for Algorithm 1. The convergence rate is given in terms of the functional distance (3.19) associated to T α,β . Throughout the section we assume that f ∈ L 2 H is a given datum, α, β > 0 are fixed regularization parameters and (H1)-(H3), (K1)-(K3) as in Section 3.1 hold. The convergence result states as follows.
Theorem 4.6. Let {µ n } be a sequence generated by Algorithm 1. Then {T α,β (µ n )} is nonincreasing and the residual at (3.19) satisfies where C > 0 is a constant depending only on α, β, f and K * t . Moreover each accumulation point µ * of µ n with respect to the weak* topology of M is a minimizer for T α,β . If T α,β admits a unique minimizer µ * , then µ n * µ * along the whole sequence. Rather, the proof relies on energy estimates for the surrogate iterate µ s := µ n + s(M µ γ * − µ n ), where γ * ∈ AC 2 is a solution of the insertion step (4.7), M ≥ 0 a suitable constant, and the step-size s is chosen according to the Armijo-Goldstein condition (see [20,Section 5]). Since µ s is a candidate for problem (4.8), the added coefficients optimization step in Algorithm 1 does not worsen the convergence rate.
Finally, we are in position to prove (4.14). Since µ 0 = 0, we have that T α,β (µ 0 ) ≤ M 0 . Thus we can inductively apply (4.16) and obtain that T α,β (µ n ) ≤ M 0 for all n ∈ N. In particular, (4.16) holds for every n ∈ N and, as a consequence, the sequence {r(µ n )} is non-increasing. Setting r n := r(µ n ), we then get and (4.14) follows. The remaining claims follow from the weak* lower semicontinuity of T α,β (Theorem 3.1) and estimate (A.2), given that {µ n } is a minimizing sequence for (P).

Stopping condition.
We prove the optimality statements in the stopping criterion for the core algorithm anticipated in Section 4.1.2. In the following G denotes the primal-dual gap introduced in (3.18). We denote by µ n the n-th iterate of Algorithm 1, which is of the form (4.3), and by w n the corresponding dual variable (4.4). Moreover let µ γ * be the atom associated to the curve γ * ∈ AC 2 ∞ solving the insertion step (4.7). Lemma 4.8. For all n ∈ N we have . In particular µ n is a solution of (P) if and only if ρ γ * , w n ≤ 1.
Before proving Lemma 4.8, we give a quantitative version of the stopping condition of Section 4.1.2.
Remark 4.9. With the same notations as above, consider the condition Lemma 4.8. Thus, assuming (4.23), and using (3.20), we see that the functional residual defined at (3.19) satisfies r(µ n ) < TOL, i.e., µ n almost minimizes (P), up to the tolerance. Therefore, the condition at (4.23) can be employed as a quantitative stopping criterion for Algorithm 1.
Proof of Lemma 4.8. We start by computing G(µ n ) for a fixed n ∈ N. Setμ := M µ γ * , where M is defined as in (3.13) with w = w n . Since µ γ * ∈ Ext(C α,β ) solves (4.5), we have thatμ solves (3.11) with respect to w n (see Proposition 3.6). By Theorem 4.6, the sequence Notice that by one-homogeneity of J α,β (see Lemma A.3) and the fact that µ γ * ∈ Ext(C α,β ) we have J α,β (μ) = M . Recalling the definition of ϕ at (3.6), by direct calculation we obtain We now compute the remaining terms in (4.24). By (4.8) and (4.9) at the step n − 1, we know that the coefficients c n = (c n 1 , . . . , c n Nn ) ∈ R Nn ++ of µ n solve the minimization problem Since γ n j in (4.3) is continuous, we have that (t → δ γ n j (t) ) ∈ C + w and thus t → K * t δ γ n j (t) belongs to L 2 H , by Lemma A.4. In view of (4.3), definition of T α,β and Lemma 2.2, we can expand the expression at (4.26), differentiate with respect to each component of c, and recall that c n is optimal in (4.26), to obtain holds for all i = 1, . . . , N n . By Lemma 2.2 and linearity of ·, · , we obtain the identity J α,β (µ n ) = ρ n , w n . The latter, together with (4.24) and (4.25), yields (4.22). For the remaining part of the statement, notice that by (4.22) we have G(µ n ) = 0 if and only if ρ γ * , w n ≤ 1. Therefore, the thesis follows by Lemma 3.8.

4.5.
Time-discrete version. The minimization problem (P) presented in Section 3 is posed for time-continuous measurements f ∈ L 2 H , whose discrepancy to the reconstructed curve t → ρ t is modelled by the fidelity term F at (3.2). In real-world applications, however, the measured data is time-discrete, i.e., we can assume that measurements are taken at times 0 = t 0 < t 1 < . . . < t T = 1, with T ∈ N fixed. Hence the data is of the form f = (f t0 , f t1 , . . . , f t T ), where f ti ∈ H ti . For this reason, and with the additional goal of lowering the computational cost, we decided to present numerical experiments (Section 6) where the time-continuous fidelity term F is replaced by a discrete counterpart. In Section 4.5.1 we show how to modify the mathematical framework discussed so far, in order to deal with the time-discrete case. Consequently, it is immediate to adapt Algorithm 1 to the resulting time-discrete functional, as discussed in Section 4.5.2. We remark that all the results up to this point will hold, in a slightly modified version, also for the discrete setting discussed below. 4.5.1. Time-discrete framework. We replace problem (P) with where the time-discrete fidelity term F D : M → [0, +∞] is defined by Here We now define the other quantities which are needed to formulate the discrete counterpart of the theory developed so far. For a given curve of measures (t →ρ t ) ∈ C w , the corresponding dual variable (3.8) is redefined to be (4. 28) w ti := −K ti (K * tiρti − f ti ) ∈ C(Ω) , for each i = 0, . . . , T . Consequently, we redefine the associated scalar product (3.9) to ρ, w D := It is straightforward to check that all the results in Section 3 hold with T α,β and ·, · replaced by T D α,β and ·, · D respectively, with the obvious modifications. In particular, the problem (4.29) min One can perform a similar computation to the one at (4.6) to obtain equivalence between (4.29) and (4.30) min Remark 4.10. Assume additionally that Ω is convex. Then a solution γ * ∈ AC 2 ∞ to problem (4.30) is either γ * = γ ∞ , or γ * ∈ AC 2 with γ * linear in each interval [t i , t i+1 ]. Indeed, given any curve γ ∈ AC 2 , denote byγ the piecewise linear version of γ sampled at t i for i = 0, . . . , T . Then ρ γ , w D ≤ ργ, w D , due to the inequality 1 0 |γ(t)| 2 dt ≤ 1 0 |γ(t)| 2 dt. 4.5.2. Adaption of Algorithm 1 to the time-discrete setting. The core algorithm in Section 4.1 is readily adaptable to the task of minimizing (P discr ). To this end, let f ti ∈ H ti be a given datum and α, β > 0 be fixed parameters for (P discr ). Assume that µ n is the current sparse iterate, of the form (4.3). The dual variable associated to µ n is defined, according to (4.28), by w n ti := −K ti (K * ti ρ n ti − f ti ). Similarly to Section 4.1.2, the insertion step in the time-discrete version consists in finding a curve γ * ∈ AC 2 ∞ which solves (4.30) with respect to w n i . Such curve defines a new atom µ γ * according to Definition 4.1. Adapting the proofs of Lemmas 3.8, 4.8 to the discrete setting, one deduces the following stopping condition: • if ρ γ * , w n D ≤ 1, then µ n is solution to (P discr ). The algorithm outputs µ n and stops, • if ρ γ * , w n D > 1, then µ n is not a solution to (P discr ) and γ * ∈ AC 2 . The found atom µ γ * is inserted in the n-th iterate µ n and the algorithm continues.
Set γ n Nn+1 := γ * . The time-discrete version of the coefficients optimization step discussed in Section 4.1.3 consists in solving By proceeding as in Section 4.2, one can check that (4.31) is equivalent to a quadratic program of the form (4.10), where the matrix Γ ∈ R (Nn+1)×(Nn+1) and the vector b ∈ R Nn+1 are given by In view of Remark 4.10 and of the above construction, we note that the iterates of the discrete algorithms are of the form (4.3) with γ n j ∈ AC 2 piecewise linear. Finally, we remark that the timediscrete algorithm obtained with the above modifications has the same sublinear rate of convergence stated in Theorem 4.6.

The algorithm: numerical implementation
This aim of this section is twofold. First, in Section 5.1 we describe how to approach the minimization of the insertion step problem (4.7) by means of gradient descent strategies. This analysis is performed under additional assumptions on the operators K t : H t → C(Ω), which, loosely speaking, require that K t map into the space C 1,1 (Ω) of differentiable functions with bounded Lipschitz gradient. The strategies proposed will result in the multistart gradient descent Subroutine 1 (Section 5.1.5), which, given a dual variable w, outputs a set of stationary points for (4.7). Then, in Section 5.2, we present two acceleration steps that can be added to Algorithm 1 to, in principle, enhance its performance. The first acceleration strategy, called the multiple insertion step, proceeds by adding all the outputs of Subroutine 1 to the current iterate. The second strategy, termed sliding step, consists in locally descending the target functional T α,β at (P) in a neighbourhood of the curves composing the current iterate, while keeping the coefficients fixed. These strategies are finally added to Algorithm 1. The outcome is Algorithm 2, presented in Section 5.3, which we name dynamic generalized conditional gradient (DGCG). 5.1. Insertion step implementation. We aim at minimizing the linearized problem (4.7) in the insertion step, which is of the form where the dual variable t → w t is defined for a.e. t ∈ (0, 1) by w t := −K t (K * tρt − f t ) ∈ C(Ω), for some curve (t →ρ t ) ∈ C w and data f ∈ L 2 H fixed. In (5.1) we also identified AC 2 with H 1 ([0, 1]; Ω), and employed the notations at (2.6). We remind the reader that, although (5.1) admits solutions (see Section 4.1.1), in practice they may be difficult to compute numerically (Remark 4.3). Therefore we turn our attention at finding stationary points for the functional F at (5.1), relying on gradient descent methods. To make this approach feasible, we require additional assumptions on the operators K * t (see Assumption 5.1 below), which allow to extend the functional F to the Hilbert space H 1 := H 1 ([0, 1]; R d ) and make it Fréchet differentiable, without altering the value of the minimum at (5.1). In particular this allows for a gradient descent procedure to be well defined (Section 5.1.1). With this at hand, in Section 5.1.2 we define a descent operator G w associated to F , which, for a starting curve γ ∈ H 1 , outputs either a stationary point of F or the infinite length curve γ ∞ . The starting curves for G w are of two types: • random starts which are guided by the values of the dual variable w (Section 5.1.3), • crossovers between known stationary points (Section 5.1.4).
We then propose a minimization strategy for (5.1), which is implemented in the multistart gradient descent algorithm contained in Subroutine 1 (Section 5.1.5). Such algorithm inputs a set of curves and a dual variable w, and returns a set S ⊂ AC 2 , which is either empty or contains stationary curves for F at (5.1). 5.1.1. Minimization strategy. The additional assumptions required on K * t are as follows. Assumption 5.1. For a.e. t ∈ (0, 1) the linear continuous operator K * t : C 1,1 (Ω) * → H t satisfies (F1) K * t is weak*-to-weak continuous, with pre-adjoint denoted by K t : H t → C 1,1 (Ω), (F2) K * t ≤ C for some constant C > 0 not depending on t, (F3) the map t → K * t ρ is strongly measurable for every fixed ρ ∈ M(Ω), (F4) there exists a closed convex set E Ω such that supp(K t f ), supp(∇(K t f )) ⊂ E for all f ∈ H t and a.e. t ∈ (0, 1), where ∇ denotes the spatial gradient.
Notice that (F1)-(F3) imply (K1)-(K3) of Section 3.1, due to the embedding M(Ω) → C 1,1 (Ω) * . Also note that (F4) has no counterpart in the assumptions of Section 3.1, and is only assumed for computational convenience, as discussed below. The problem of solving (5.1) under (F1)-(F4) is addressed in Appendix A.4; here we summarize the main results obtained. First, we extend F to the Hilbert space H 1 , by setting w t (x) := 0 for all x ∈ R d Ω and a.e. t ∈ (0, 1). Due to (F4) we have that w t ∈ C 1,1 (R d ). We then show that F is continuously Fréchet differentiable on H 1 , with locally Lipschitz derivative (Proposition A.9). Denote by D γ F ∈ (H 1 ) * the Fréchet derivative of In Proposition A.12 we prove that (5.2) is equivalent to As P (w) is easily computable, condition (5.3) provides an implementable test for (5.2). Moreover, note that (5.2) is satisfied when F is computed from the dual variable w n associated to the n-th iterate µ n = Nn j=1 c n j δ γ n j of Algorithm 1, and n ≥ 1. This is because F (γ n j ) = −1 for all j = 1, . . . , N n , as shown in (4.27).
If (5.2) is satisfied, we aim at computing points in S F by gradient descent. To this end, we say that {γ n } in H 1 is a descent sequence if where D γ n F ∈ (H 1 ) * is identified with its Riesz representative in H 1 , and {δ n } is a stepsize chosen according to the Armijo-Goldstein or Backtracking-Armijo rules. In Theorem A.6 we prove that, if {γ n } is a descent sequence, there exists at least a subsequence such that γ n k → γ * strongly in H 1 ; moreover, any such accumulation point γ * belongs to S F . To summarize, descent sequences in the sense of (5.4) enable us to compute points in S F , which are candidate solutions to (5.1) whenever (5.2) holds.
where γ * is an accumulation point for the descent sequence {γ n } defined according to (5.4) with starting point γ 0 := γ. In view of the discussion in the previous section, we know that the image of G w is contained in S F ∪ {γ ∞ }. Note that, if P (w) > 0, in principle the image of G w will contain at least one stationary point γ * ∈ S F , as in this case (5.2) holds (Remark 5.2). However, in simulations, we can only compute G w on some finite family of curves {γ i } i∈I , which we name the starts. Thus, in general, we have no guarantee of finding points in S F , even if (5.2) holds. The situation improves if w = w n is the dual variable associated to the n-th iterate µ n = Nn j=1 c n j δ γ n j of Algorithm 1 and n ≥ 1. In this case, setting A := {γ n 1 , . . . , γ n Nn }, we have that G w n (A) ⊂ S F by definition (5.5) and Remark 5.2. Therefore, by including the curves in A in the set of considered starts, we are guaranteed of obtaining at least one point in S F . 5.1.3. Random starts. We now describe how we randomly generate starting points γ in H 1 ([0, 1]; Ω) for the descent operator G at (5.5). We start by selecting time nodes 0 = t 0 < t 1 < . . . < t T = 1 drawn uniformly in [0, 1] (if operating in the time-discrete setting, we sample instead with a uniform probability on the finite set of sampling times on which the fidelity term of (P discr ) is defined). We choose the value of a random start γ at time t i seeking to maximize the dual variable w ti . To achieve this, let Q : R → R + be non-decreasing and monotonous, and define the probability measure on E for A ⊂ E Borel measurable and E Ω introduced in Assumption 5.1. We then draw samples from P wt i with the rejection-sampling algorithm, and assign those samples to γ(t i ). Using that E is a convex set, the random curve γ ∈ AC 2 is obtained by interpolating linearly the values γ(t i ) ∈ E. This procedure is executed by the routine sample, which inputs a dual variable w and outputs a randomly generated curve γ ∈ AC 2 .

Crossovers between stationary points.
It is heuristically observed that stationary curves have a tendency to share common "routes", as for example seen in the reconstructions presented in Figures 4 and 5. It is then a reasonable ansatz to combine curves which are sharing routes, in order to increase the likelihood for the newly obtained crossovers to share common routes with the sought global minimizers of F . Such crossovers will then be employed as starts for the descent operator G at (5.5). Formally, the crossover is achieved as follows. We fix small parameters ε > 0 and 0 < δ < 1. For γ 1 , γ 2 ∈ AC 2 define the set We say that γ 1 and γ 2 share routes if R ε (γ 1 , γ 2 ) = ∅. If R ε (γ 1 , γ 2 ) = ∅, we perform no operations on γ 1 and γ 2 .
Denote by I any of its connected components. Then I is an interval with endpoints t − and t + , satisfying 0 ≤ t − < t + ≤ 1. The crossovers of γ 1 and γ 2 in I are the two curves γ 3 , γ 4 ∈ AC 2 defined by and linearly interpolated in (t − δt,t + δt), wheret := (t + + t − )/2 andt := (t + − t − )/2, i.e., is, for instance, the result of the linear interpolation of γ 3 in (t − δt,t + δt). We construct the crossovers of γ 1 and γ 2 in each connected component of R ε (γ 1 , γ 2 ) obtaining 2M new curves, with M being the number of connected components of R ε (γ 1 , γ 2 ). The described procedure is executed by the routine crossover, which inputs two curves γ 1 , γ 2 ∈ AC 2 and outputs a set of curves in AC 2 , possibly empty.

5.1.5.
Multistart gradient descent algorithm. In Subroutine 1 we sketch the proposed method to search for a minimizer of (5.1): this is implemented in the function MultistartGD, which inputs a set of curves A and a dual variable w, and outputs a (possibly empty) set S of stationary points for F defined at (5.1). We now describe how to interpret such subroutine in the context of Algorithm 1, and how it can be used to replace the insertion step operation at line 4. Given the n-th iteration µ n = Nn j=1 c n j δ γ n j of Algorithm 1, we define A := {γ n 1 , . . . , γ n Nn } if n ≥ 1 and A := ∅ if n = 0. The dual variable w n is as in (4.4). We initialize to empty the sets S and O of known stationary and crossover points respectively. The condition at line 2 of Subroutine 1 checks if we are at the 0-th iteration and P (w 0 ) ≤ 0. In case this is satisfied, then 0 is the minimum of (5.1), and no stationary point is returned: indeed in this situation Algorithm 1 stops, with µ 0 = 0 being the minimum of (P) (see Section 5.1.1). Otherwise, the set A (possibly empty) is inserted in O. Then N max initializations of the multistart gradient descent are performed, where a starting point γ is either chosen from the crossover set O, if the latter is non empty, or sampled at random by the function sample described in Section 5.1.3. We then descend γ, obtaining the new curve γ * := G w n (γ), where G w n is defined at (5.5). If the outputted point γ * does not belong to the set of known stationary points S, and γ * = γ ∞ , then we first compute the crossovers of γ * with all the elements of S, and afterwards insert it in S. After N max iterations, the set S is returned. Notice that S could be empty only if A = ∅, i.e., if MultistartGD is called at the first iteration of Algorithm 1 (see Section 5.1.2).
The modified insertion step, that is, line 4 of Algorithm 1, reads as follows. First we set A = γ and compute S := MultistartGD(A, w n ). If S = ∅, the algorithm stops and returns the current iterate µ n . Otherwise, the element in S with minimal energy with respect to F is chosen as candidate minimizer, and inserted as γ * in line 4.

Acceleration strategies.
In this section we describe two acceleration strategies that can be incorporated in Algorithm 1.

Multiple insertion step.
This is an extension of the insertion step for Algorithm 1 described in Section 4.1.2. Precisely, the multiple insertion step consists in inserting into the current iterate µ n all the atoms associated to the stationary points in the set S produced by Subroutine 1 with respect to the dual variable w n . This procedure is motivated by the following observations. First, computationally speaking, the coefficients optimization step described in Section 4.1.3 is cheap and fast. Second, it is observed that stationary points are good candidates for the insertion step in the GCG method presented in [52] to solve (1.4). This observation can be extended similarly to our framework, noticing that the addition of multiple stationary points is encouraging the iterates to concentrate around every atom of the ground-truth measure µ † and consequently the algorithm could need fewer iterations to efficiently locate the support of µ † .

5.2.2.
Sliding step. The sliding step proposed in this paper is a natural extension of the one introduced for BLASSO in [20] and further analyzed in [30]. Precisely, given a current iterate µ = N j=1 c j µ γj of Algorithm 1, we fix the weights c = (c 1 , . . . , c N ) and define the functional T α,β,c : (H 1 ) N → R as We then perform additional gradient descent steps in the space (H 1 ) N for the functional T α,β,c , starting from the tuple of curves (γ 1 , . . . , γ N ) contained in the current iterate µ. Formally, this procedure is possible: in Proposition A.13 we prove that T α,β,c is continuously Fréchet differentiable in (H 1 ) N under Assumption 5.1, with derivative given by (A.42). This step can be intertwined with the coefficients optimization one, by alternating between modifying the position of the current curves, and optimizing their associated weights.

Full algorithm.
By including Subroutine 1 and the proposed acceleration steps of Section 5.2 into Algorithm 1, we obtain Algorithm 2, which we name the dynamic generalized conditional gradient (DGCG). 5.3.1. Algorithm summary. We employ the notations of Section 4.1.4. In particular, Algorithm 2 generates, as iterates, tuples of coefficients c = (c 1 , . . . , c N ) with c j > 0, and curves γ = (γ 1 , . . . , γ N ) with γ j ∈ AC 2 . We now summarize the main steps of Algorithm 2. The first 3 lines are unaltered from Algorithm 1, and they deal with tuples initializations and assembly of the measure iterate µ n . The multiple insertion step is carried out by Subroutine 1, via the function MultistartGD at line 4, which is called with arguments γ and w n . The output is a set of curves S which contains stationary points for the functional F at (5.1) with respect to the dual variable w n . If S = ∅, the algorithm stops and outputs the current iterate µ n . As observed in Section 5.1.2, this can only happen at the first iteration of the algorithm. Otherwise, in line 7, the function order is employed to input S and output a tuple of curves, obtained by ordering the elements of S increasingly with respect to their value of F . As anticipated in Section 5.1, the first element of order(S), named γ * Nn+1 , is considered to be the best available candidate solution to the insertion step problem (5.1), and is used in the stopping condition at lines 8, 9. Such stopping condition has been used similarly in Algorithm 1, with the difference that in Algorithm 2 the curve γ * Nn+1 is not necessarily the global minimum of F , but, in general, just a stationary point. We further discuss such stopping criterion in Section 5.3.2 below. After, the found stationary points are inserted in the current iterate, and the algorithm alternates between the coefficients optimization step (Section 4.1.3) and the sliding step (Section 5.2.2). Such operations are executed from line 11 to 16 in Algorithm 2, for K max times.

5.3.2.
Stopping condition. The stopping condition for Algorithm 2 is implemented in lines 8, 9: when γ * Nn+1 satisfies ρ γ * Nn+1 , w n ≤ 1, the algorithm stops and outputs the current iterate µ n . Due to the definition of F , such condition is equivalent to say that the Subroutine 1 has not been able to find any curve γ * ∈ AC 2 satisfying ρ γ * , w n > 1. The main difference when compared to the stopping condition for Algorithm 1 (Section 4.1.2) is that, in Algorithm 2, the curve γ * Nn+1 is generally not a global minimum of F . As a consequence, Lemma 4.8 does not hold, and the condition ρ γ * Nn+1 , w n ≤ 1 is not equivalent to the minimality of the current iterate µ n for (P). The evident drawback is that Algorithm 2 could stop even if the current iterate does not solve (P). However, it is at least possible to say that, if Algorithm 2 continues after line 9, then the current iterate does not solve (P). Thus, in this situation, the correct decision is taken. We remark that, even if the stopping condition for Algorithm 2 does not ensure the minimality of the output, from a practical standpoint, if Subroutine 1 is employed with a high number of restarts the reconstruction is satisfactory. We also point out that the condition at line 8 can be replaced by the quantitative condition defined at (4.23) in Remark 4.9, with some predefined tolerance.

5.3.3.
Convergence and numerical residual. In the following we will say that Algorithm 2 converges if MultistartGD(γ, w 0 ) = ∅, or if some iterate satisfies the stopping condition at line 8. In case of convergence, we will denote by µ N the output value of Algorithm 2. We remind the reader that, due to the discussion in Section 5.3.2, µ N is considered to be an approximate solution for the minimization problem (P). In order to analyze the convergence rate for Algorithm 2 numerically we define the numerical residual as with µ n each of the computed intermediate iterates. We further define the numerical primal-dual gapG(µ n ), which we compute by employing (4.22) with γ * = γ * Nn+1 .

Numerical implementation and experiments
In this section we present the produced numerical experiments. In order to lower the computational cost, we chose to implement Algorithm 2 for the minimization of the time-discrete functional (P discr ) discussed in Section 4.5. The adaptation of Algorithm 2 to such setting is easily obtainable as a corollary of the discussion in Section 4.5.2. The simulations were produced by a Python code that is openly available at https://github.com/panchoop/DGCG_algorithm/. For all the simulations we employ the following: • the considered domain is Ω : • convergence for Algorithm 2 is intended as in Section 5.3.3. The number of restarts N max for Subroutine 1 is stated in each experiment. Also, we employ the quantitative stopping condition for Algorithm 2 described in Remark 4.9, with tolerance TOL := 10 −10 . • the crossover parameters described in Section 5.1.4 are chosen as ε = 0.05 and δ = 1/T .
The random starts presented in Section 5.1.3 are implemented using the function Q(x) = exp(max(x + 0.05, 0)) − 1. These parameter choices are purely heuristical and there is no reason to believe that they are optimal. The remainder of the section is organized as follows. In Section 6.1 we first introduce the measurement spaces and Fourier-type forward operators employed in the experiments. After, we explain how the synthetic data is generated in the noiseless case, and then detail on the noise model we consider. Subsequently, we show how the data and the obtained reconstructions can be visualized, by means of the so-called backprojections and intensities. We then pass to the actual experiments in Section 6.2, detailing three of them. The first experiment (Section 6.2.1), which is basic in nature, serves the purpose of illustrating how the measurements are constructed and how the data can be visualized. We then showcase the reach of the proposed regularization and algorithm in the second example (Section 6.2.2). There, we consider more complex data with various levels of added noise. In particular, we show that the proposed dynamic setting is capable of reconstructing severely spatially undersampled data. The final experiment (Section 6.2.3) illustrates a particular behaviour of our model when reconstructing sparse measures whose underling curves cross, as discussed in Remark 4.2. We point out that in all the experiments presented, the algorithm converged faster, indeed linearly, than the sublinear rate predicted by Theorem 4.6. We conclude the section with a few general observations on the model and algorithm proposed.
where ρ is extended to zero outside of Ω, and the first integral is intended component-wise.
6.1.2. Data. For all the experiments the ground-truth consists of a sparse measure of the form where we follow the notations at (2.6). Given a ground-truth µ † , the respective noiseless data is constructed by f ti := K * ti ρ † ti ∈ H ti , for i = 0, . . . , T . 6.1.3. Noise model. Let U i,k , V i,k , for i = 0, . . . , T and k = 1, . . . , n i , be the realization of two jointly independent families of standard 1-dimensional Gaussian random variables, with which we define the noise vector ν by . Given some data f = (f t0 , . . . , f t T ), with f ti ∈ H ti , when ν = 0, the corresponding noisy data f ε with noise level ε ≥ 0 is taken as Ht i ν .

Visualization via backprojection.
In general it is not illustrative to directly visualize the data. The proposed way to gain some insight on the data structure is by means of backprojections: given f = (f t0 , . . . , f t T ), with f ti ∈ H ti , we call backprojection the map w 0 ti := K ti f ti ∈ C(Ω). Note that w 0 ti corresponds to the dual variable at the first iteration of Algorithm 2. As Ω = (0, 1) 2 , such functions can be plotted at each time sample, allowing us to display the data. 6.1.5. Reconstruction's intensities. Given a sparse measure µ := N j=1 c j µ γj , with N ∈ N, c j > 0, γ j ∈ AC 2 , the intensity associated to the atom µ γj is defined by I j := c j a γj . The quantity I j measures the intensity at time t i of the signal for a single source, as K * ti (c j ρ γj (ti) ) = I j K * ti (δ γj (ti) ). Therefore, when presenting reconstructions and comparing them to the given ground-truth, we will use the intensity I j of each atom instead of its associated weight c j .

Numerical experiments.
6.2.1. Experiment 1 -Single atom with constant speed. We start with a basic example that serves at illustrating how to observe the data, the obtained reconstructions, and their respective distortions due to the employed regularization. We use constant-in-time forward Fourier-type measurements, with frequencies sampled from an Archimedean spiral: for each sampling time i ∈ {0, . . . , 50}, we consider the same frequencies vector S i ∈ (R 2 ) ni with n i := 20 and S i,k lying on a spiral for k = 1, . . . , 20 (see Figure 1). Thus, the corresponding forward operators defined by (6.1) are constant in time, i.e., K * ti = K * . The employed ground-truth µ † is composed of a single atom with intensity I † = 1, and respective curve γ † (t) := (0.2, 0.2) + t(0.6, 0.6). Accordingly, we have µ † = c † µ γ † with c † = 1/a γ † . We consider the case of noiseless data f ti := K * ρ † ti . The corresponding backprojection w 0 ti := Kf ti can be visualized in Figure 1 for some selected time samples. For the proposed data f we solve the minimization problem (P discr ) employing Algorithm 2. The obtained reconstructions are presented in Figure 2, where we display the results for two different parameter choices α, β. Given the simplicity of the considered example, we employed Subroutine 1 with only 5 restarts, that is, N max = 5. In the respective cases of parameters α = β = 0.1 and α = β = 0.4, Algorithm 2 converged in 2 and 1 iterations, and had an execution time of 2 and 1 minutes (the employed CPU was an Apple M1 8 Core 3.2 GHz, running native arm64 Python 3.9.7). As common for Tikhonov regularization methods, the obtained reconstructions differ from the ground-truth due to the effect of regularization. Specifically, in Figure 2c, we notice a stronger effect of the regularization with parameters α = β = 0.4 at the endpoints of the reconstructed curve. Such phenomenon is expected: as argued in Section 5.1, each curve found by Subroutine 1 belongs to the set of stationary points S F ; due to the optimality conditions proven in Proposition A.10, any of such curves has zero initial and final speed, i.e.,γ(0) =γ(1) = 0. The mentioned constraint is achieved with a slower transition for larger speed penalizations β. We can quantify the discrepancy between the ground-truth curve γ † and a reconstructed curve γ with respect to the L 2 norm by computing D(γ † , γ) := γ † − γ L 2 / γ † L 2 . We obtain that D(γ † , γ) = 0.00515 for α = β = 0.1 and D(γ † , γ) = 0.017 for α = β = 0.4. The reconstructed intensities for the parameter choices α = β = 0.1 and α = β = 0.4 are 87% and 48% of the ground truth's intensity respectively, as observed in Figure 2. 6.2.2. Experiment 2 -Complex example with time varying measurements. The following example is given to showcase the full strength of the proposed regularization and algorithm. The frequencies are sampled over lines through the origin of R 2 , which are rotating in time. Specifically, let Θ ∈ N be a bound on the number of lines, h > 0 a fixed spacing between measured frequencies on a given line, and consider frequencies S i ∈ (R 2 ) ni defined by Here the matrix represents a rotation of angle θ i , with θ i := i Θ π. For such frequencies, we consider the associated forward operators K * ti as in (6.1). In the presented experiment the parameters are chosen as Θ = 4, h = 1 and n i = 15 for all i = 0, . . . , 50: in other words, we sample along 4 different lines which rotate at each time-sample; see Figure 3 for two examples of them. The ground-truth is a measure µ † composed of 3 atoms, whose associated curves are as in Figure 4a. Note that µ † displays: different non-constant speeds, a contact point between two underlying curves, a strong kink, and intensities equal to 1. We point out that equal intensities are considered for the sole purpose of easing graph visualization: the reconstruction quality is not affected by different intensity choices. The noiseless data is defined by f ti := K * ti ρ † ti . Following the noisy model described in Section 6.1.3, we also consider data f 0.2 and f 0.6 with added 20% and 60% of relative noise, respectively. For the noisy data we employed the same realization of the randomly generated noise vector ν defined in (6.3). In Figure 3 we present the backprojections for the noiseless and noisy data at two different times. We can observe that at each time-sample the backprojected data exhibits a line-constant behavior, as explained in Remark 6.1 below. We apply Algorithm 2 to obtain reconstructions for the cases of noiseless and 20% of added noise data for parameters α = β = 0.1 (see Figure 4); in Figure 5, we present the obtained reconstructions for the case of added 60% noise, where we further show the regularization effects of employing larger α, β values, namely α = β = 0.1 and α = β = 0.3. In the noiseless case, presented in Figure 4b, we can observe an accurate reconstruction, with some low intensity artifacts that for most of the time share paths with the higher intensity atoms. In Figure 4c, where we add 20% of noise to the data, we notice a surge of low intensity artifacts, but nonetheless, we see that the obtained solution is close, in the sense of measures, to the original   The backprojected data is displayed for noiseless data and for 20% and 60% of relative noise, i.e., f 0.2 and f 0.6 respectively, with superimposed ground-truth's curves in blue color.
ground-truth. In Figure 5, for the case of 60% added noise, we can notice that by increasing the regularization parameters the quality of the obtained reconstruction increases, displaying small regularization-induced distortions. The examples in Figures 4, 5 demonstrate the power of the proposed regularization, given its reconstruction accuracy when simultaneously employing highly ill-posed forward measurements, as pointed out in Remark 6.1 below, together with strong noise. We finalize this example by presenting the convergence plots for the case of 60% added noise, this being the most complex experiment (see Figure 6). We plot the numerical residualr(µ n ) and the numerical primal-dual gapG(µ n ) for the iterates µ n , wherer andG are defined in Section 5.3.3. We   observe that the algorithm exhibits a linear rate of convergence, instead of the proven sublinear one. Such linear convergence has been numerically observed in all the tested examples. Additionally, we see that the algorithm is greatly accelerated when one considers strong regularization parameters  α, β. Finally, the plot confirms the efficacy of the proposed descent strategy for the insertion step (5.1): indeed we note that the inequalityr(µ n ) ≤G(µ n ) holds for most of the iterations in the performed experiments. As such inequality is proven in Lemma 3.8 for the actual residual and primal dual gap, we have confirmation that γ * Nn+1 in Algorithm 2 is a good approximated solution for (5.1). Regarding execution times, iterations until convergence and number of restarts, they are summarized in Table 1.
Remark 6.1. Consider the Fourier-type forward measurements K * t defined by (6.1) with frequencies S i ∈ (R 2 ) ni sampled along rotating lines, as in (6.4). In this case, at each fixed time-sample t i , the operator K * ti does not encode sufficient information to accurately resolve the location of the unknown ground-truth at time t i . As a consequence, any static reconstruction technique, that is, one that does not jointly employ information from different time samples in order to perform a reconstruction, would not be able to accurately recover any ground-truth under these measurements. To justify this claim, notice that for all time-samples t i , the family {S i,k } ni k=1 defined at (6.4) is collinear, and as such, there exists a corresponding vector S ⊥ i ∈ R 2 such that S i,k · S ⊥ i = 0 for all k = 1, . . . , n i . Therefore, for any given time-static source ρ † = δ x † with x † ∈ (0.1, 0.9) 2 ⊂ R 2 , the measured forward data is invariant along S ⊥ i , that is, Hence, solely with the information of a single time-sample, it is not possible to distinguish a source along a line, and therefore, it is not possible to accurately resolve it. This is in contrast with the dynamic model presented in this paper, which is able to perform an efficient reconstruction, as demonstrated in Experiment 2.
6.2.3. Experiment 3 -Crossing example. The following is an example in which the considered model is not able to track dynamic sources: although the reconstruction is close to the ground truth in the sense of measures, its underlying curves do not resemble those of the ground truth. This effect is due to the non-injectivity of the map at (4.2): even if the sparse measure we wish to recover is unique, its decomposition into atoms might not be. A simple example in which injectivity fails is given by the crossing of two curves (see Remark 4.2): this is the subject of the numerical  experiment performed in this section. Specifically, the ground-truth µ † considered is of the form (6.5) . Notice that γ † 1 and γ † 2 cross at time t = 0.5, and the respective atoms have both intensity 1. For the forward measurements, we employ the time-constant Archimedean spiral family of frequencies defined in the first numerical experiment in Section 6.2.1, resulting in the constant in time operator K * ti = K * . The reconstruction is performed for noiseless data f ti := K * ρ † ti . In Figure 7 we present the considered frequency samples, together with the backprojected data at selected time samples. In Figure 8 we display the obtained reconstruction for high regularization parameters α = 0.5 and β = 0.5. It is observed that the reconstructed atoms are not close, as curves, to the ones in (6.5): rather than a crossing at time t = 0.5, the two curves rebound. As already mentioned, this phenomenon is a consequence of the lack of uniqueness for the sparse representation of µ † , which in this case is both represented by crossing curves and rebounding curves. The fact that Algorithm 2 outputs rebounding curves is due to employed regularization: indeed the considered Benamou-Brenier-type penalization selects a solution whose squared velocity is minimized, which discourages the reconstruction to follow the crossing path. However, we remark that the algorithm proposed yields a good solution in terms of the model, given that, in the sense of measures, the obtained reconstruction is very close to the ground truth µ † . More sophisticated models are needed in order to resolve the crossing, as briefly discussed in Section 7.
6.3. Remark on execution times. It is observed that the execution times of our algorithm are quite high for some of the presented examples in Table 1. This is mainly due to the computational cost of the insertion step, since the algorithm is set to run several gradient descents to find a global minimum of the linearized problem (3.12) at each iteration, and these descents are executed in a non-parallel fashion on a single CPU core. The other components of the algorithm, namely the routines sample and crossover included in Subroutine 1, the coefficient optimization step and the sliding step, have, in comparison, negligible computational cost. In particular, the sliding step grows in execution time with the number of active atoms, but this effect appears towards the last iterations of the algorithm, and it is shadowed by the insertion step, whose gradient descents become longer as the iterate is getting closer to the optimal value. As a confirmation of the role of the insertion step in the overall computational cost, one can see that the execution times of the algorithm linearly depend on the total number of gradient descents that are run in each example. Indeed, since the total number of gradient descents is given by the number of restarts multiplied by the number of iterations (see Table 1), the ratio between the execution times and the total number of gradient descents is of the same order for all the presented examples (between 0.0008 and 0.0019 hour/gradient descent). It is worth pointing out that the multistart gradient descent is a highly parallelizable method. Some early tests in this direction indicate that much lower computational times are achievable by simultaneously computing gradient descents on several GPUs for the presented examples.
6.4. Conclusions and discussion. The presented numerical experiments confirm the effectiveness of Algorithm 2, and that the proposed Benamou-Brenier-type energy is an excellent candidate to regularize dynamic inverse problems. This is in particular evidenced by Experiment 2 in Section 6.2.2, where we consider a dynamic inverse problem that is impossible to tackle with a purely static approach, as discussed in Remark 6.1. Even in the extreme case of 60% added noise, our method recovered satisfactory reconstructions. We can further observe the distortions induced by the considered regularization. Precisely, the attenuation of the reconstruction's velocities and intensities is a direct consequence of the minimization of the Benamou-Brenier energy and the total variation norm in the objective functional; for this reason, the choice of the regularization parameters α and β affects the reconstruction and the magnitude of such distortions. Additionally, a further effect of the regularization is the phenomenon presented in Experiment 3 (Section 6.2.3). As the dynamic inverse problem is formulated in the space of measures, if the sparse ground truth possesses many different decompositions into extremal points, the preferred reconstruction may be the one favoring the regularizer. Finally, concerning the execution times, we emphasize that the presented simulations are a proof of concept and not the result of a carefully optimized algorithm. There are many improvement directions, where the most promising one is a GPU implementation to parallelize the multiple insertion step. To increase the likelihood of finding a global minimizer for the insertion step, Subroutine 1 was employed with a high number of restarts N max , which was tuned manually, prioritizing high reconstruction accuracy over execution time. To improve on this aspect, one could include early stopping conditions in Subroutine 1, for example by exiting the routine when a sufficiently high ratio of starts descend towards the same stationary curve. Last, the code was written having in mind readibility, as well as adaptability to a broad class of inverse problems. Therefore, the experienced execution times are not an accurate estimation of what would be possible in specific applications.

Future perspectives
In this section we propose several research directions to expand on the research presented in this paper. A first relevant question concerns the proposed DGCG algorithm, and, in particular, the possibility of proving a theoretical linear convergence rate under suitable structural assumptions on the minimization problem (P). Linear convergence has been recently proven for the GCG method applied to the BLASSO problem [39,51]. It seems feasible to extend such an analysis to the DGCG algorithm presented in this paper, especially seeing the linear convergence observed in the experiments provided in Section 6.2, and the fact that our proof of sublinear convergence (see Theorem 4.6) does not fully exploit the coefficients optimization step, as commented in Remark 4.7. This line of research is currently under investigation by the authors [12]. Another interesting research direction is the extension of the DGCG method introduced in this paper to the case of unbalanced optimal transport. Precisely, one can regularize the inverse problem (1.2) by replacing the Benamou-Brenier energy B in (1.3) with the so-called Wasserstein-Fischer-Rao energy, as proposed in [16]. Such energy, first introduced in [24,45,47] as a model for unbalanced optimal transport, accounts for more general displacements t → ρ t , in particular allowing the total mass of ρ t to vary during the evolution. The possibility to numerically treat such a problem with conditional gradient methods would rest on the characterization of the extremal points for the Wasserstein-Fischer-Rao energy recently achieved by the authors in [13]. In addition, it is a challenging open problem to design alternative dynamic regularizers that allow to reconstruct accurately a ground-truth composed of crossing atoms, such as the ones considered in the experiment in Section 6.2.3. Due to the fact that the considered Benamou-Brenier-type regularizer penalizes the square of the velocity field associated to the measure, the reconstruction obtained by our DGCG algorithm does not follow the crossing route (Figure 8b). A possible solution is to consider additional high-order regularizers in (P), such as curvature-type penalizations. The challenging part is devising a penalization that can be enforced at the level of Borel measures, and whose extremal points are measures concentrated on sufficiently regular curves.
Finally, keeping into account the possible improvements discussed in Section 6.4, the implementation of an accelerated and parallelized version of Algorithm 2 will be the subject of future work. and the definition of a γj , we conclude noting that A.3. Existence of minimizers for linearized problems. In this section we show existence of minimizers for the problems at (3.11) and (3.12). To this end, we prove existence for a slightly more general functional (see (A.4) below), which coincides with (3.11) and (3.12) for ϕ as in (3.6) and ϕ(t) := χ (−∞, 1](t), respectively.
As observed in Remark 5.2, the only case of interest is when the minimum value of the problem at (5.1) is stricly negative. Thus, we assume to be in such situation, and consider (A.9) min As discussed in Section 5.1, we are interested in computing stationary points of F by gradient descent. In order to make this possible, we first extend F to the Hilbert space H 1 := H 1 ([0, 1]; R d ), in a way that the set of stationary points of F is not altered. To be more precise, by assumptions (F1)-(F3) we have that the dual variable w t := −K t (K * tρt −f t ) belongs to C 1,1 (Ω) for a.e. t ∈ (0, 1). Additionally, (F4) implies that supp w t , supp ∇w t ⊂ E for a.e. t ∈ (0, 1), where E Ω is closed and convex. We can then extend w t to the whole R d by setting w t (x) := 0 for all x ∈ R d Ω and a.e. t ∈ (0, 1). Consequently, the functional F is well defined via (A.8) over the space H 1 .
In the above setting we are able to prove that F is continuously Fréchet differentiable over H 1 (Proposition A.9 below). Denote by D γ F ∈ (H 1 ) * the Fréchet derivative of F at γ. We also show that stationary points of F , i.e., curves γ * ∈ H 1 such that D γ * F = 0, satisfy γ * ([0, 1]) ⊂ E ⊂ Ω whenever F (γ * ) = 0 (Proposition A.10 below). Therefore problem (A.9) is equivalent to (A.10) min γ∈H 1 F (γ) . We can now apply the gradient descent algorithm to compute stationary points of F , in the attempt of approximating solutions to (A.10), and hence to (A.9). The main result of this section states that descent sequences for F , in the sense of (5.4), converge (subsequentially) to stationary points.
Theorem A.6. Assume (F1)-(F4) as in Section 5.1 and (H1)-(H3) as in Section 3.1. Assume given (t →ρ t ) ∈ C w , f ∈ L 2 H and α, β > 0. For a.e. t ∈ (0, 1) set w t := −K t (K * tρt − f t ) and w t (x) := 0 for all x ∈ R d Ω. Then, the corresponding functional F : H 1 → R defined via (A.8) is continuously Fréchet differentiable. If {γ n } in H 1 is a descent sequence for F in the sense of (5.4) then, up to subsequences, γ n → γ * strongly in H 1 . Any such accumulation point γ * satisfies F (γ * ) < 0 and is stationary for F , that is, The proof of Theorem A.6, postponed to Section A.4.2 below, relies on differentiability results for F and on properties of its stationary points, as discussed in the following Section A.4.1. Finally, in Section A.4.3 we provide an implementable criterion to determine whether the minimum of (5.1) is strictly negative.
A.4.1. Differentiability of F and stationary points. In this section we discuss Fréchet differentiability and stationary points properties for the (extended) operator F : H 1 → R defined at (A.8). Before proceeding with the discussion, we establish a few notations and make some remarks on assumptions (F1)-(F4). In the following, for any closed S ⊆ R d we denote by C 1,1 (S) the space of differentiable maps ϕ : S → R such that the norm is finite, where ∇ : C 1,1 (S) → C(S; R d ) is the gradient operator. Notice that in this case ∇ is linear and continuous. We will also consider the Bochner space L 2 ([0, 1]; C 1,1 (S)) equipped with the norm w 2 1,1 := 1 0 w t 2 C 1,1 (S) dt. Finally, for two Banach spaces X, Y and a continuously Fréchet differentiable map G : X → Y , we denote the differential of G by DG : X → Y * , with evaluation at u ∈ X given by the linear continuous functional v → D u G(v) belonging to Y * .
In order to show that F is differentiable, we first investigate the regularity for dual variables t → w t of the form considered in (5.1). The differentiability properties for F are considered afterwards.
γ → D γ W is continuous from H 1 into (H 1 ) * . To this end, fix γ 1 , γ 2 ∈ H 1 and notice that (A.17) where in the last inequality we employed Cauchy-Schwarz and the estimate η ∞ ≤ √ 2 η H 1 . Notice that (A.17) shows that the map γ → D γ W is Lipschitz from H 1 into (H 1 ) * . Thus, in particular, W is continuously Fréchet differentiable. We will now prove the estimates at (A.15)-(A.16). The first bound in (A.15) follows immediately from the definition of F , the fact that w ∈ L 2 ([0, 1]; C 1,1 (R d )), and the estimate L ≥ α > 0. As for the second estimate in (A.15), by (A.13) and the triangle inequality we have Notice that D γ W (H 1 ) * ≤ w 1,1 , thanks to (A.14) and Hölder's inequality. Moreover, by (A.14) and Hölder's inequality, where the second estimate is obtained by noting that the real map s → βs/(βs 2 /2 + α) is differentiable, with maximum value given by β/(2α Concerning the first term in (A. 20), observe that, by the estimate L ≥ α, We now estimate the second term in (A. 20). First note that, as a consequence of (A.15) and of the mean value theorem, the map γ → F (γ) is bounded by w 1,1 /α and has (global) Lipschitz constant bounded by w 1,1 (1 + β/(2α))/α. Moreover, the map γ → G(γ) := D γ L/L(γ) is bounded by β/(2α) (see (A.19)). It is easy to check that G is continuously Fréchet differentiable. Employing the estimates L ≥ α and (A.19), we also check that γ → D γ G is bounded uniformly by 3β/(2α). By the mean value theorem we then conclude that G is globally Lipschitz with constant controlled by 3β/(2α). Arguing as in (A.22), we compute Finally, we show that stationary points of F with non-zero energy are curves contained in E ⊂ Ω.
Proof. By Lemma A.8 we have that w ∈ L 2 ([0, 1]; C 1,1 (R d )). Moreover Proposition A.9 ensures that F is continuously Fréchet differentiable over H 1 . If γ * is such that D γ * F = 0, from (A.13)-(A.14) and the inequality L ≥ α > 0 we deduce the weak formulation of (A.24), i.e., Suppose that F (γ * ) = 0 and set A := {t ∈ [0, 1] : γ * (t) / ∈ E}. Assume by contradiction that A = ∅. Note that A = [0, 1], since F (γ * ) = 0 and (F4) holds. Since E is closed and γ * is continuous, then A is relatively open in [0, 1]. Therefore A = ∪ i∈N I i , with I i pairwise disjoint, which are either of the form (a i , b i ) with 0 < a i < b i < 1, or [0, a i ), or (a i , 1], with 0 < a i < 1, or empty. Assume that there exists i ∈ N such that c (a i , b i ) and extend it to zero to the whole [0, 1]. Set η := e j ϕ, with e j the j-th coordinate vector in R d . Since supp ∇w t ⊂ E for a.e. t ∈ (0, 1) (see (F4)), γ * (t) / ∈ E for t ∈ (a i , b i ), and F (γ * ) = 0, testing (A.25) against η yields bi aiγ * which is a contradiction. Assume now that there exists i ∈ N such that I i = [0, a i ) for some 0 < a i < 1. Let ϕ ∈ L 2 (0, a i ), extend it to zero in [a i , 1], and set η(t) := −e j t ai ϕ(s) ds. Testing (A.25) against η, allows to conclude that γ * is constant in [0, a i ], which is a contradiction since by construction γ * (a i ) ∈ E. Similarly, the remaining case I i = (a i , 1] for some 0 < a i < 1 leads to a contradiction. Thus we conclude that A = ∅, finishing the proof. A.4.2. Gradient descent. In this section we prove Theorem A.6 on descent sequences for the functional F at (A.8). The proof relies on the following lemma. Moreover, let {γ n } in H 1 be a sequence such that for some c < 0. Then, up to subsequences, γ n → γ * strongly in H 1 . Any such accumulation point γ * satisfies F (γ * ) = c and is stationary for F , namely, D γ * F = 0.
Moreover by definition of η n and (A.14) one can check that D γ n L(η n ) = L(γ n ) for all n ∈ N.