1 Introduction

The aim of this paper is to develop a dynamic generalized condition gradient method (DGCG) to numerically compute solutions of ill-posed dynamic inverse problems regularized with optimal transport energies. The code is openly available on GitHub.Footnote 1

Lately, several approaches have been proposed to tackle dynamic inverse problems [42, 57, 58, 61, 67], all of which take advantage of redundancies in the data, allowing to stabilize reconstructions, both in the presence of noise or undersampling. A common challenge faced in such time-dependent approaches is understanding how to properly connect, or relate, the time-neighboring datapoints, in a way that the reconstructed object follows a presumed dynamic. In this paper, we address such issue by means of dynamic optimal transport. A wide range of applications can benefit from motion-aware approaches. In particular, the employment of dynamic reconstruction methods represents one of the latest key mathematical advances in medical imaging. For instance, magnetic resonance imaging (MRI) [48, 50, 56] and computed tomography (CT) [9, 21, 32] methods allow dynamic modalities in which the time-dependent data are further undersampled to reach high temporal sampling rates; these are required to resolve organ motion, such as the beating heart or the breathing lung. A more accurate reconstruction of the image, and of the underlying dynamics, would yield valuable diagnostic information.

1.1 Setting and Existing Approaches

Recently, it has been proposed to regularize dynamic inverse problems using dynamic optimal transport energies both in a balanced and unbalanced context [15, 59, 60], with the goal of efficiently reconstructing time-dependent Radon measures. Such regularization choice is natural: optimal transport energies incorporate information about time correlations present in the data, and are thus favoring a more stable reconstruction. Optimal transport theory was originally developed to find the most efficient way to move mass from a probability measure \(\rho _0\) to another one \(\rho _1\), with respect to a given cost [44, 54]. More recently, Benamou and Brenier [8] showed that the optimal transport map can be computed by solving

$$\begin{aligned} \min _{(\rho _t,v_t)} \frac{1}{2}\int _0^1\int _{\overline{\Omega }} \left|v_t(x)\right|^2 \, \mathrm {d}\rho _t(x)\,\mathrm{d}t \,\,\,\, \text { s.t. } \,\,\,\, \partial _t \rho _t + {{\,\mathrm{div}\,}}(v_t \rho _t ) = 0\,, \end{aligned}$$
(1)

where \(t \mapsto \rho _t\) is a curve of probability measures on the closure of a bounded domain \(\Omega \subset \mathbb {R}^d\), \(v_t\) is a time-dependent vector field advecting the mass, and the continuity equation is intended in the sense of distributions with initial data \(\rho _0\) and final data \(\rho _1\). Notably, the quantity at (1), named Benamou–Brenier energy, admits an equivalent convex reformulation. Specifically, consider the space of bounded Borel measures \(\mathcal {M}:= \mathcal {M}(X) \times \mathcal {M}(X; \mathbb {R}^d)\), \(X:=(0,1)\times \overline{\Omega }\), and define the convex energy \(B: \mathcal {M}\rightarrow [0,+\infty ]\) by setting

$$\begin{aligned} B(\rho ,m) := \frac{1}{2} \int _0^1 \int _{\overline{\Omega }} \left| \frac{\mathrm {d}m}{\mathrm {d}\rho }\,(t,x)\right| ^2\, \mathrm {d}\rho (t,x)\,, \end{aligned}$$

if \(\rho \ge 0\) and \(m\ll \rho \), and \(B:=+\infty \) otherwise. Then, (1) is equivalent to minimizing B under the linear constraint \(\partial _t \rho + {{\,\mathrm{div}\,}}m = 0\). Such convex reformulation can be employed as a regularizer for dynamic inverse problems, where instead of fixing initial and final data, a fidelity term is added to measure the discrepancy between the unknown and the observation at each time instant, as proposed in [15]. There the authors consider the dynamic inverse problem of finding a curve of measures \(t \mapsto \rho _t\), with \(\rho _t \in \mathcal {M}(\overline{\Omega })\), such that

$$\begin{aligned} K_t^*\rho _t = f_t \,\,\, \text { for a.e. }\,\, t \in (0,1)\,, \end{aligned}$$
(2)

where \(f_t \in H_t\) is some given data, \(\{H_t\}\) is a family of Hilbert spaces and \(K_t^* : \mathcal {M}(\overline{\Omega }) \rightarrow H_t\) are linear continuous observation operators. The problem at (2) is then regularized via the minimization problem

$$\begin{aligned} \min _{(\rho ,m) \in \mathcal {M}} \frac{1}{2}\int _0^1 \left\Vert K^*_t \rho _t - f_t\right\Vert ^2_{H_t}\mathrm{d}t + \beta B(\rho ,m) + \alpha \left\Vert \rho \right\Vert _{\mathcal {M}(X)} \,\, \text { s.t. } \,\, \partial _t \rho + {{\,\mathrm{div}\,}}m = 0\,, \end{aligned}$$
(3)

where \(\Vert \rho \Vert _{\mathcal {M}(X)}\) denotes the total variation norm of the measure \(\rho :=\mathrm{d}t \otimes \rho _t\), and \(\alpha , \beta >0\) are regularization parameters. Notice that any curve \(t \mapsto \rho _t\) having finite Benamou–Brenier energy and satisfying the continuity equation constraint must have constant mass (see Lemma 9). As a consequence, the regularization (3) is especially suited to reconstruct motions where preservation of mass is expected. We point our that such formulation is remarkably flexible, as the measurements spaces and measurement operators are allowed to be very general. In this way, one could model, for example, undersampled acquisition strategies in medical imaging, particularly MRI [15].

The aim of this paper is to design a numerical algorithm to solve (3). The main difficulties arise due to the non-reflexivity of measure spaces. Even in the static case, solving the classical LASSO problem [62] in the space of bounded Borel measures (known as BLASSO [29]), i.e.,

$$\begin{aligned} \inf _{\rho \in \mathcal {M}(\Omega )}\frac{1}{2} \Vert K\rho - f\Vert ^2_Y + \alpha \Vert \rho \Vert _{\mathcal {M}(\Omega )}\,, \end{aligned}$$
(4)

for a Hilbert space Y and a linear continuous operator \(K : \mathcal {M}(\Omega ) \rightarrow Y\) has proven to be challenging. Usual strategies to tackle (4) numerically often rely on the discretization of the domain [26, 28, 63]; however, grid-based methods are known to be affected by theoretical and practical flaws such as high computational costs and the presence of mesh-dependent artifacts in the reconstruction. The mentioned drawbacks have motivated algorithms that do not rely on domain discretization, but optimize directly on the space of measures. One class of such algorithms, first introduced in [18] and subsequently developed in different directions [10, 30, 40, 52], are named generalized conditional gradient methods (GCG) or Frank–Wolfe-type algorithms. They can be regarded as the infinite-dimensional generalization of the classical Frank–Wolfe optimization algorithm [41] and of GCG in Banach spaces [7, 17, 19, 25, 33, 43]. The basic idea behind such algorithms consists in exploiting the structure of sparse solutions to (4), which are given by finite linear combinations of Dirac deltas supported on \(\Omega \). In this case, Dirac deltas represent the extremal points of the unit ball of the Radon norm regularizer. With this knowledge at hand, the GCG method iteratively minimizes a linearized version of (4); such minimum can be found in the set of extremal points. The iterate is then constructed by adding delta peaks at each iteration, and by subsequently optimizing the coefficients of the linear combination. GCG methods have proven to be successful at solving (4), and have been adapted to related problems in the context of, e.g., super-resolution [1, 49, 52, 59].

1.2 Outline of the Main Contributions

Inspired by GCG methods, the goal of this paper is to develop a dynamic generalized conditional gradient method (DGCG) aimed at solving the dynamic minimization problem at (3). Similarly to the classical GCG approaches, our DGCG algorithm is based on the structure of sparse solutions to (3), and it is Lagrangian in essence, since it does not require a discretization of the space domain. Lagrangian approaches have been proven useful for many different dynamic applications [55], often outperforming Eulerian approaches, where the discretization in space is necessary. Indeed, since Eulerian approaches are based on the optimization of challenging discrete assignment problems in space, Lagrangian approaches allow to lower the computational costs and are more suitable to reconstruct coalescence phenomena in the dynamics. Motivated by similar considerations, our approach aims to reduce the reconstruction artifacts and lower the computational cost when compared to grid-based methods designed to solve similar inverse problems to (3) (see [60]).

The fundamentals of our approach rest on recent results concerning sparsity for variational inverse problems: it has been empirically observed that the presence of a regularizer promotes the existence of sparse solutions, that is, minimizers that can be represented as a finite linear combination of simpler atoms. This effect is evident in reconstruction problems [34, 39, 64,65,66], as well as in variational problems in other applications, such as materials science [37, 38, 46, 53]. Existence of sparse solutions has been recently proven for a class of general functionals comprised of a fidelity term, mapping to a finite dimensional space, and a regularizer: in this case, atoms correspond to the extremal points of the unit ball of the regularizer [11, 14]. In the context of (3), the extremal points of the Benamou–Brenier energy have been recently characterized in [20]; this provides an operative notion of atoms that will be used throughout the paper. More precisely, for every absolutely continuous curve \(\gamma : [0,1] \rightarrow \overline{\Omega }\), we name as atom of the Benamou–Brenier energy the respective pair of measures \(\mu _\gamma := (\rho _\gamma , m_\gamma ) \in \mathcal {M}\) defined by

$$\begin{aligned} \rho _\gamma :=a_\gamma \, \mathrm{d}t \otimes \delta _{\gamma (t)} \,, \,\,\,\, m_\gamma :=\dot{\gamma }(t) \rho _\gamma \,,\,\,\,\, a_\gamma :=\left( \frac{\beta }{2} \int _0^1 \left|\dot{\gamma }(t)\right|^2 \, \mathrm{d}t + \alpha \right) ^{-1}\,. \end{aligned}$$
(5)

The notion of atom described above can be regarded as the dynamic counterpart of the Dirac deltas for the Radon norm regularizer. Curves of measures of the form (5) constitute the building blocks used in our DGCG method to generate at iteration step n the sparse iterate \(\mu ^n = (\rho ^n , m^n)\)

$$\begin{aligned} \mu ^n = \sum _{j=1}^{N_n} c_j \mu _{\gamma _{j}} \end{aligned}$$
(6)

converging to a solution of (3), where \(c_j >0\).

The basic DGCG method proposed is comprised of two steps. The first one, called insertion step, operates as follows. Given a sparse iterate \(\mu ^n\) of the form (6), we obtain a descent direction for the energy at (3), by minimizing a version of (3) around \(\mu ^n\), in which the fidelity term is linearized. We show that in order to find such descent direction, it is sufficient to solve

$$\begin{aligned} \min _{(\rho ,m) \in {{\,\mathrm{Ext}\,}}(C_{\alpha ,\beta })} - \int _0^1 \langle \rho _t, w_t^n\rangle _{H_t}\, \mathrm{d}t\,, \end{aligned}$$
(7)

where \(w^n_t := -K_t(K_t^* \rho ^n_t - f_t ) \in C(\overline{\Omega })\) is the dual variable of the problem at the iterate \(\mu ^n\), and \({{\,\mathrm{Ext}\,}}(C_{\alpha ,\beta })\) denotes the set of extremal points of the unit ball of the regularizer in (3), namely the set \(C_{\alpha ,\beta } := \{(\rho ,m) \in \mathcal {M}: \partial _t \rho + {{\,\mathrm{div}\,}}m = 0,\ \ \beta B(\rho ,m) + \alpha \left\Vert \rho \right\Vert _{\mathcal {M}(X)}\le 1\}\). Formula (7) clarifies the connection between atoms and extremal points of \(C_{\alpha ,\beta }\), showing the fundamental role that the latter play in sparse optimization and GCG methods. In view of the characterization Theorem 1, proven in [20], the minimization problem (7) can be equivalently written in terms of atoms

$$\begin{aligned} \min _{\gamma \in \mathrm{AC}^2} \, \left\{ 0, \, -a_{\gamma } \int _0^1 w_t^n(\gamma (t))\,\mathrm{d}t \right\} \,, \end{aligned}$$
(8)

where \(\mathrm{AC}^2\) denotes the set of absolutely continuous curves with values in \(\overline{\Omega }\) and weak derivative in \(L^2(\overline{\Omega };\mathbb {R}^d)\). The insertion step then consists in finding a curve \(\gamma ^*\) solving (8), and considering the respective atom \(\mu _{\gamma ^*}\). Afterward, naming \(\gamma _{N_n+1}:=\gamma ^*\), the coefficients optimization step proceeds at optimizing the conic combination \(\mu ^n + c_{N_{n}+1} \mu _{\gamma _{N_n+1}}\) with respect to (3), among all nonnegative coefficients \(c_j\). Denoting by \(c_1^*,\ldots , c_{N_n+1}^*\) a solution to such problem, the new iterate is defined by \(\mu ^{n+1}:=\sum _j c_j^*\mu _{\gamma _j}\). The two steps of inserting a new atom in the linear combination and optimizing the coefficients are the building blocks of our core algorithm, summarized in Algorithm 1. In Theorem 7, we prove that such algorithm has a sublinear convergence rate, similarly to the GCG method for the BLASSO problem [18], and the produced iterates \(\mu ^n\) converge in the weak* sense of measures to a solution of (3). The core algorithm and its analysis are the subject of Sect. 4.

From the computational point of view, we observe that the coefficients optimization step can be solved efficiently, as it is equivalent to a finite-dimensional quadratic program. Concerning the insertion step, however, even if the complexity of searching for a descent direction for (3) is reduced by only minimizing in the set of atoms, (8) remains a challenging nonlinear and non-local problem. For this reason, we shift our attention to computing stationary points for (8), relying on gradient descent strategies. Specifically, we prove that under additional assumptions on \(H_t\) and \(K_t^*\), problem (8) can be cast in the Hilbert space \(H^1([0,1];\mathbb {R}^d)\), and that the gradient descent algorithm, with appropriate stepsize, outputs stationary points to (8) (see Theorem 14). With this theoretical result at hand, in Sect. 5.1, we formulate a solution strategy for (8) based on multistart gradient descent methods, whose initializations are chosen according to heuristic principles. More precisely, the initial curves are chosen randomly in the regions where the dual variable \(w_t^n\) has larger value, and new starting curves are produced combining pieces of stationary curves for (8) by means of a procedure named crossover.

We complement the core algorithm with acceleration strategies. First, we add multiple atoms in the insertion step (multiple insertion step). Such new atoms can be easily obtained as a byproduct of the multistart gradient descent in the insertion step. Moreover, after optimizing the coefficients in the linear combination, we perform an additional gradient descent step with respect to (3), varying the curves in the iterate \(\mu ^{n+1}\), while keeping the weights fixed. Such procedure, named sliding step, will then be alternated with the coefficients optimization step for a fixed number of iterations, before searching for a new atom in the insertion step. These additional steps are described in Sect. 5.1. We mention that similar strategies were already employed for the BLASSO problem [18, 51]. They are then included in the basic core algorithm to obtain the complete DGCG method in Algorithm 2.

In Section 6, we provide numerical examples. As observation operators \(K^*_t\), we use time-dependent undersampled Fourier measurements, popular in imaging and medical imaging [16, 35], as well as in compressed sensing and super-resolution [1, 22, 23]. Such examples show the effectiveness of our DGCG method in reconstructing spatially sparse data, in the presence of simultaneously strong noise and severe temporal undersampling. Indeed, satisfactory results are obtained for ground-truths with \(20\%\) and \(60\%\) of added Gaussian noise, and heavy temporal undersampling in the sense that at each time t, the observation operator \(K^*_t\) is not able to distinguish sources along lines. With such ill-posed measurements, static reconstruction methods would not be able to accurately recover any ground-truth. In contrast, the time regularization chosen in (3) allows to resolve the dynamics by correlating the information of neighboring data points. As shown by the experiments presented, our DGCG algorithm produces accurate reconstructions of the ground-truth. In case of \(20\%\) and \(60\%\) of added noise, we note a surge of low intensity artifacts; nonetheless, the obtained reconstruction is close, in the sense of the measures, to the original ground-truth. Moreover, in all of the tried out examples, a linear convergence rate has been observed; this shows that the algorithm is, in practice, faster than the theoretical guarantees (Theorem 7), and a linear rate has to be expected in most of the cases.

We conclude the paper with Sect. 7, in which we discuss future perspectives and open questions, such as the possibility of improving the theoretical convergence rate for Algorithm 1, and alternative modeling choices.

1.3 Organization of the Paper

The paper is organized as follows. In Sect. 2, we summarize all the relevant notations and preliminary results regarding the Benamou–Brenier energy that are needed in the paper. In particular, we recall the characterization of the extremal points of the unit ball of the Benamou–Brenier energy obtained in [20]. In Sect. 3, we introduce the dynamic inverse problem under consideration and its regularization (3), following the approach of [15]. We further establish basic theory needed to setup the DGCG method. In Sect. 4, we provide the definition of atoms, and we give a high-level description of the DGCG method we propose in this paper, see Algorithm 1, proving its sublinear convergence. In Sect. 5, we describe the strategy employed to solve the insertion step problem (8), based on a multistart gradient descent method. Moreover, we outline the mentioned acceleration steps. Incorporating these procedures in the core algorithm, we obtain the complete DGCG method, see Algorithm 2. In Sect. 6, we show numerical results supporting the effectiveness of Algorithm 2. Finally, we present the reader some open questions in Sect. 7.

2 Preliminaries and Notation

In this section, we introduce the mathematical concepts and results we need to formulate our minimization problem and consequent algorithms. Throughout the paper, \(\Omega \subset \mathbb {R}^d\) denotes an open-bounded domain with \(d\in \mathbb {N}, d \ge 1\). We define the time-space cylinder \(X:= (0,1) \times \overline{\Omega }\). Following [3], given a metric space Y, we denote by \(\mathcal {M}(Y)\), \(\mathcal {M}(Y ; \mathbb {R}^d)\), \(\mathcal {M}^+(Y)\), the spaces of bounded Borel measures, bounded vector Borel measures, and positive measures, respectively. For a scalar measure \(\rho \), we denote by \(\left\Vert \rho \right\Vert _{\mathcal {M}(Y)}\) its total variation. In addition, we employ the notations \(\mathbb {R}_+\) and \(\mathbb {R}_{++}\) to refer to the nonnegative and positive real numbers, respectively.

2.1 Time-Dependent Measures

We say that \(\{\rho _t\}_{t \in [0,1]}\) is a Borel family of measures in \(\mathcal {M}(\overline{\Omega })\) if \(\rho _t \in \mathcal {M}(\overline{\Omega })\) for every \(t \in [0,1]\) and the map \(t \mapsto \int _{\overline{\Omega }} \varphi (x)\, d\rho _t(x)\) is Borel measurable for every function \(\varphi \in C(\overline{\Omega })\). Given a measure \(\rho \in \mathcal {M}(X)\), we say that \(\rho \) disintegrates with respect to time if there exists a Borel family of measures \(\{\rho _t\}_{t \in [0,1]}\) in \(\mathcal {M}(\overline{\Omega })\) such that

$$\begin{aligned} \int _X \varphi (t,x) \, \mathrm {d}\rho (t,x) = \int _0^1 \int _{\overline{\Omega }} \varphi (t,x) \, \mathrm {d}\rho _t(x) \, \mathrm{d}t \quad \text { for all } \quad \varphi \in L^1_{\rho }(X). \end{aligned}$$

We denote such disintegration with the symbol \(\rho =\mathrm{d}t \otimes \rho _t\). Further, we say that a curve of measures \(t \in [0,1] \mapsto \rho _t \in \mathcal {M}(\overline{\Omega })\) is narrowly continuous if, for all \(\varphi \in C(\overline{\Omega })\), the map \(t \mapsto \int _{\overline{\Omega }} \varphi (x) \, d\rho _t(x)\) is continuous. The family of narrowly continuous curves will be denoted by \(C_\mathrm{w}\). We denote by \(C_\mathrm{w}^+\) the family of narrowly continuous curves with values in \(\mathcal {M}^+(\overline{\Omega })\).

2.2 Optimal Transport Regularizer

Introduce the space

$$\begin{aligned} \mathcal {M}:= \mathcal {M}(X) \times \mathcal {M}( X; \mathbb {R}^d )\,. \end{aligned}$$

We denote elements of \(\mathcal {M}\) by \(\mu =(\rho ,m)\) with \(\rho \in \mathcal {M}(X)\), \(m \in \mathcal {M}(X;\mathbb {R}^d)\), and by 0 the pair (0, 0). Define the set of pairs in \(\mathcal {M}\) satisfying the continuity equation as

$$\begin{aligned} \mathcal {D}:= \{\mu \in \mathcal {M}\ : \ \partial _t \rho + {{\,\mathrm{div}\,}}m = 0 \ \ \text {in} \ \ X\}\,, \end{aligned}$$

where the solutions of the continuity equation are intended in a distributional sense, that is,

$$\begin{aligned} \int _{X} \partial _t \varphi \, \mathrm {d}\rho + \int _{X} \nabla \varphi \cdot \mathrm {d}m = 0 \quad \text {for all} \quad \varphi \in C^{\infty }_c ( X ) \,. \end{aligned}$$
(9)

The above weak formulation includes no flux boundary conditions for the momentum m on \(\partial \Omega \), and no initial and final data for \(\rho \). Notice that by standard approximation arguments, it is equivalent to test (9) against maps in \(C^1_c ( X )\) (see [4, Remark 8.1.1]).

We now introduce the Benamou–Brenier energy, as originally done in [8]. To this end, define the convex, one-homogeneous and lower semicontinuous map \(\Psi :\mathbb {R}\times \mathbb {R}^d \rightarrow [0,\infty ]\) as

$$\begin{aligned} \Psi (t,x):= {\left\{ \begin{array}{ll} \frac{\left|x\right|^2}{2t} &{} \text { if } t>0\,, \\ 0 &{} \text { if } t=\left|x\right|=0 \,, \\ +\infty &{} \text { otherwise}\,. \end{array}\right. } \end{aligned}$$
(10)

The Benamou–Brenier energy \(B :\mathcal {M}\rightarrow [0,\infty ]\) is defined by

$$\begin{aligned} B (\mu ) := \int _X \Psi \left( \frac{\mathrm {d}\rho }{\mathrm {d}\lambda }, \frac{\mathrm {d}m}{\mathrm {d}\lambda }\right) \, \mathrm {d} \lambda \,, \end{aligned}$$
(11)

where \(\lambda \in \mathcal {M}^+(X)\) is any measure satisfying \(\rho ,m \ll \lambda \). Note that (11) does not depend on the choice of \(\lambda \), as \(\Psi \) is one-homogeneous. Following [15], we introduce a coercive version of B: for fixed parameters \(\alpha ,\beta >0\) define the functional \(J_{\alpha , \beta } :\mathcal {M}\rightarrow [0,\infty ]\) as

$$\begin{aligned} J_{\alpha , \beta }(\mu ) := {\left\{ \begin{array}{ll} \beta B(\mu ) + \alpha \left\Vert \rho \right\Vert _{\mathcal {M}(X)} &{} \,\, \text { if }(\rho ,m) \in \mathcal {D}\,, \\ + \infty \qquad &{} \,\, \text { otherwise}\,. \end{array}\right. } \end{aligned}$$
(12)

As recently shown [15], \(J_{\alpha ,\beta }\) can be employed as a regularizer for dynamic inverse problems in spaces of measures.

2.3 Extremal Points of \(J_{\alpha ,\beta }\)

Define the convex unit ball

$$\begin{aligned} C_{\alpha ,\beta } := \left\{ \mu \in \mathcal {M}\, :\, J_{\alpha ,\beta }(\mu ) \le 1 \right\} \,, \end{aligned}$$

and the set of measures concentrated on \(\mathrm{AC}^2\) curves in \(\overline{\Omega }\)

$$\begin{aligned} \mathcal {C}_{\alpha ,\beta }:= \left\{ \mu _\gamma \in \mathcal {M}\, :\, \gamma \in \mathrm{AC}^2([0,1];\mathbb {R}^d),\,\, \gamma ([0,1]) \subset \overline{\Omega }\right\} \,, \end{aligned}$$
(13)

where we denote by \(\mu _\gamma \) the pair \((\rho _\gamma ,m_\gamma )\) with

$$\begin{aligned} \rho _\gamma :=a_\gamma \, \mathrm{d}t \otimes \delta _{\gamma (t)} \,, \,\,\,\, m_\gamma :=\dot{\gamma }(t) \rho _\gamma \,,\,\,\,\, a_\gamma :=\left( \frac{\beta }{2} \int _0^1 \left|\dot{\gamma }(t)\right|^2 \, \mathrm{d}t + \alpha \right) ^{-1}\,. \end{aligned}$$
(14)

Here, \(\mathrm{AC}^2([0,1];\mathbb {R}^d)\) denotes the space of curves having metric derivative in \(L^2((0,1);\mathbb {R}^d)\). We can identify \(\mathrm{AC}^2([0,1];\mathbb {R}^d)\) with the Sobolev space \(H^1((0,1);\mathbb {R}^d)\) (see [4, Remark 1.1.3]). For brevity, we will denote by \(\mathrm{AC}^2:=\mathrm{AC}^2([0,1];\overline{\Omega })\) the set of curves \(\gamma \) belonging to \(\mathrm{AC}^2([0,1];\mathbb {R}^d)\) such that \(\gamma ([0,1]) \subset \overline{\Omega }\). For the extremal points of \(C_{\alpha ,\beta }\), we have the following characterization result, originally proven in [20, Theorem 6].

Theorem 1

Let \(\alpha ,\beta >0\) be fixed. Then, it holds \(\,{{\,\mathrm{Ext}\,}}(C_{\alpha ,\beta })=\{0\} \cup \mathcal {C}_{\alpha ,\beta }\).

We now show that \(J_{\alpha ,\beta }\) is linear on nonnegative combinations of points in \(\mathcal {C}_{\alpha ,\beta }\). Such property will be crucial for several computations in this paper, and the proof is postponed to Sect. 1

Lemma 2

Let \(N \in \mathbb {N}, N \ge 1\), \(c_j \in \mathbb {R}\) with \(c_j >0\), and \(\gamma _j \in \mathrm{AC}^2\) for \(j=1,\ldots ,N\). Let \(\mu _{\gamma _j}=(\rho _{\gamma _j},m_{\gamma _j}) \in \mathcal {C}_{\alpha ,\beta }\) be defined according to (14). Then, \(J_{\alpha ,\beta }(\mu _j)=1\) and

$$\begin{aligned} J_{\alpha ,\beta } \left( \sum _{j=1}^N c_j \mu _{\gamma _j} \right) = \sum _{j=1}^N c_j\,. \end{aligned}$$

3 The Dynamic Inverse Problem and Conditional Gradient Method

In this section, we introduce the dynamic inverse problem we aim at solving, following the approach of [15]. Moreover, we set up the functional analytic framework necessary to state the numerical algorithm presented in Sect. 4. Recall that \(\Omega \subset \mathbb {R}^d\) is an open-bounded domain, \(d \in \mathbb {N}\), \(d \ge 1\). Let \(\{H_t\}_{t \in [0,1]}\) be a family of real Hilbert spaces, \(K_t^* :\mathcal {M}(\overline{\Omega }) \rightarrow H_t\) a family of linear continuous forward operators parametrized by \(t\in [0,1]\). Given some data \(f_t \in H_t\) for a.e. \(t \in (0,1)\), consider the dynamic inverse problem of finding a curve \(t \mapsto \rho _t \in \mathcal {M}(\overline{\Omega })\) such that

$$\begin{aligned} K_t^* \rho _t = f_t \quad \text {for a.e.} \,\, t \in (0,1) \,. \end{aligned}$$
(15)

It has been recently proposed [15] to regularize the above problem with the optimal transport energy \(J_{\alpha ,\beta }\) defined in (12), where \(\alpha ,\beta >0\) are fixed parameters. This leads to consider the Tikhonov functional \(T_{\alpha ,\beta } :\mathcal {M}\rightarrow [0,+\infty ]\) with associated minimization problem

figure a

where the fidelity term \(\mathcal {F}:\mathcal {M}\rightarrow [0,+\infty ]\) is defined by

$$\begin{aligned} \mathcal {F}(\mu ):= {\left\{ \begin{array}{ll} \displaystyle \frac{1}{2}\int _0^1 \left\Vert K^*_t \rho _t - f_t\right\Vert ^2_{H_t}\mathrm{d}t &{} \, \text { if } \rho =\mathrm{d}t \otimes \rho _t \,, \,\, (t\mapsto \rho _t) \in C_\mathrm{w}\,, \\ + \infty &{} \, \text{ otherwise. } \end{array}\right. } \end{aligned}$$
(16)

In the following, we will denote by f, \(K^*\rho \), Kf the maps \(t\mapsto f_t\), \(t \mapsto K_t^* \rho _t\), \(t \mapsto K_tf_t\), respectively, where \(f_t \in H_t\) for a.e. \(t \in (0,1)\) and \(\rho _t\) is the disintegration of \(\rho \) with respect to time. The fidelity term \(\mathcal {F}\) serves to track the discrepancy in (15) continuously in time. Following [15], this is achieved by introducing the Hilbert space of square integrable maps \(f :[0,1] \rightarrow H:=\cup _t H_t\), denoted by \(L^2_H\). The data f are then assumed to belong to \(L^2_H\). The assumptions under which this procedure can be made rigorous are briefly summarized in Sect. 3.1, see (H1)–(H3), (K1)–(K3). Under these assumptions, we have that \(\mathcal {F}\) is well-defined, see Remark 1. Such framework allows to model a variety of time-dependent acquisition strategies in dynamic imaging, as shown in Sect. 1. We are now ready to recall an existence result for (\(\mathcal {P}\)) (see [15, Theorem 4.4]).

Theorem 3

Assume (H1)–(H3), (K1)–(K3) as in Sect. 3.1. Let \(f \in L^2_H\) and \(\alpha , \beta >0\). Then, \(T_{\alpha ,\beta }\) is weak* lower semicontinuous on \(\mathcal {M}\), and there exists \(\mu ^* \in \mathcal {D}\) that solves the minimization problem (\(\mathcal {P}\)). Moreover, \(\rho ^*\) disintegrates in \(\rho ^*=\mathrm{d}t \otimes \rho _t^*\) with \((t \mapsto \rho ^*_t) \in C_\mathrm{w}^+\). If in addition \(K_t^*\) is injective for a.e. \(t \in (0,1)\), then \(\mu ^*\) is unique.

The proposed numerical approach for (\(\mathcal {P}\)) is based on the conditional gradient method, which consists in seeking minimizers of local linear approximations of the target functional. As standard practice [18, 19, 51], we first replace (\(\mathcal {P}\)) with a surrogate minimization problem, by defining the functional \(\tilde{T}_{\alpha ,\beta }\) as in (\(\tilde{\mathcal {P}}\)) below. The key step in a conditional gradient method is then to find the steepest descent direction for a linearized version of \(\tilde{T}_{\alpha ,\beta }\). In Sect. 3.3, we show that in order to find such direction, it is sufficient to solve the minimization problem

$$\begin{aligned} \min _{\mu \in {{\,\mathrm{Ext}\,}}(C_{\alpha ,\beta })} - \langle \rho , w \rangle \,, \end{aligned}$$
(17)

where \(C_{\alpha ,\beta }:=\{J_{\alpha ,\beta } \le 1\}\), \(w_t:=-K_t(K_t^* \tilde{\rho _t} - f_t) \in C(\overline{\Omega })\) is the dual variable associated with the current iterate \((\tilde{\rho },\tilde{m})\), and the linear term \(\mu \mapsto \langle \rho , w \rangle \) is defined in (23). Finally, in Sect. 3.4, we define the primal-dual gap G associated with (\(\mathcal {P}\)), and prove optimality conditions.

3.1 Functional Analytic Setting for Time-Continuous Fidelity Term

In order to define the continuous sampling fidelity term \(\mathcal {F}\) at (16), the authors of [15] introduce suitable assumptions on the measurement spaces \(H_t\) and on the forward operators \(K_t^* :\mathcal {M}(\overline{\Omega }) \rightarrow H_t\).

Assumption 1

For a.e. \(t \in (0,1)\), let \(H_t\) be a real Hilbert space with norm \(\left\Vert \cdot \right\Vert _{H_t}\) and scalar product \( \langle \cdot , \cdot \rangle _{H_t}\). Let D be a real Banach space with norm denoted by \(\left\Vert \cdot \right\Vert _{D}\). Assume that for a.e. \(t \in (0,1)\), there exists a linear continuous operator \(i_t : D \rightarrow H_t\) with the following properties:

  1. (H1)

    \(\left\Vert i_t\right\Vert \le C\) for some constant \(C>0\) not depending on t,

  2. (H2)

    \(i_t(D)\) is dense in \(H_t\),

  3. (H3)

    the map \(t \in [0,1] \mapsto \langle i_t\varphi , i_t \psi \rangle _{H_t} \in \mathbb {R}\) is Lebesgue measurable for every fixed \(\varphi ,\psi \in D\).

Setting \(H := \bigcup _{t\in [0,1]} H_t\), it is possible to define the space of square integrable maps \(f :[0,1] \rightarrow H\) such that \(f_t \in H_t\) for a.e. \(t \in (0,1)\), that is,

$$\begin{aligned} L^2_H= & {} L^2([0,1] ;H):= \left\{ f :[0,1] \rightarrow H \, :\, f \ \text {strongly measurable}, \right. \nonumber \\&\quad \left. \int _0^1 \Vert f_t\Vert _{H_t}^2 \,\mathrm{d}t < \infty \right\} \,. \end{aligned}$$
(18)

The strong measurability mentioned in (18) is an extension to time-dependent spaces of the classical notion of strong measurability for Bochner integrals. The common subset D is employed to construct step functions in a suitable way. An important property of strong measurability is that \(t \mapsto \langle f_t, g_t \rangle _{H_t}\) is Lebesgue measurable whenever fg are strongly measurable [15, Remark 3.4]. Moreover, \(L^2_H\) is a Hilbert space with inner product and norm given by

$$\begin{aligned} \langle f, g \rangle _{L^2_H}:= \int _0^1 \langle f_t, g_t \rangle _{H_t} \, \mathrm{d}t \,, \quad \left\Vert f\right\Vert _{L^2_H}:=\left( \int _0^1 \left\Vert f_t\right\Vert _{H_t}^2 \, \mathrm{d}t \right) ^{1/2}\,, \end{aligned}$$
(19)

respectively [15, Theorem 3.13]. We refer the interested reader to [15, Section 3] for more details on the construction of such spaces and their properties. We will now state the assumptions required for the measurement operators \(K_t^*\).

Assumption 2

For a.e. \(t \in (0,1)\), the linear continuous operators \(K_t^* : \mathcal {M}(\overline{\Omega }) \rightarrow H_t\) satisfy:

  1. (K1)

    \(K_t^*\) is weak*-to-weak continuous, with pre-adjoint denoted by \(K_t: H_t \rightarrow C(\overline{\Omega })\),

  2. (K2)

    \(\left\Vert K^*_t\right\Vert \le C\) for some constant \(C>0\) not depending on t,

  3. (K3)

    the map \(t \in [0,1] \mapsto K_t^* \rho \in H_t\) is strongly measurable for every fixed \(\rho \in \mathcal {M}(\overline{\Omega })\).

Remark 1

After assuming (H1)–(H3), (K1)–(K3), the fidelity term \(\mathcal {F}\) introduced at (16) is well-defined. Indeed, the conditions \(\rho =\mathrm{d}t \otimes \rho _t\) and \((t \mapsto \rho _t) \in C_\mathrm{w}\) imply that \(t \mapsto K_t^*\rho _t\) belongs to \(L^2_H\) by Lemma 12. We further remark that \(\mathcal {F}(\mu )\) is finite whenever \(J_{\alpha ,\beta }(\mu )<+\infty \), as in this case we have \(\rho =\mathrm{d}t \otimes \rho _t\) with \((t \mapsto \rho _t) \in C_\mathrm{w}^+\), by Lemmas 9, 10.

3.2 Surrogate Minimization Problem

Let \(f \in L^2_H\) and define the map \(\varphi :\mathbb {R}_+\rightarrow \mathbb {R}_{+}\)

$$\begin{aligned} \varphi (t) := {\left\{ \begin{array}{ll} t &{} \text { if } t \le M_0, \\ \frac{t^2 + M_0^2}{2 M_0} &{} \text { if } t > M_0, \end{array}\right. } \end{aligned}$$
(20)

where we set \(M_0 := T_{\alpha ,\beta }(0)\). Notice that by (A1), we have \(J_{\alpha ,\beta }(0)=0\), so that

$$\begin{aligned} M_0= \frac{1}{2} \int _0^1 \left\Vert f_t\right\Vert _{H_t}^2 \, \mathrm{d}t \,, \end{aligned}$$
(21)

highlighting the dependence of \(\varphi \) on f. Recalling the definition of \(\mathcal {F}\) at (16), define the surrogate minimization problem

figure b

Notice that (\(\mathcal {P}\)) and (\(\tilde{\mathcal {P}}\)) share the same set of minimizers, and they are thus equivalent. This is readily seen after noting that solutions to (\(\mathcal {P}\)) and (\(\tilde{\mathcal {P}}\)) belong to the set \(\{\mu \in \mathcal {M}:J_{\alpha ,\beta }(\mu ) \le M_0\}\), thanks to the estimate \(t \le \varphi (t)\), and that \(T_{\alpha ,\beta }\) and \(\tilde{T}_{\alpha ,\beta }\) coincide on the said set.

We remark that the surrogate minimization problem (\(\tilde{\mathcal {P}}\)) is a technical modification of (\(\mathcal {P}\)) introduced just to ensure that the partially linearized problem defined in the following section is coercive.

3.3 Linearized Problem

Fix some data \(f \in L^2_H\) and a curve \((t \mapsto \tilde{\rho }_t) \in C_\mathrm{w}\). We define the associated dual variable \(t \mapsto w_t\) by

$$\begin{aligned} w_t := -K_t(K_t^* \tilde{\rho }_t - f_t ) \in C(\overline{\Omega })\,, \end{aligned}$$
(22)

and the map \(\mu \in \mathcal {M}\mapsto \langle \rho , w\rangle \in \mathbb {R}\cup \{\pm \infty \}\) as

$$\begin{aligned} \langle \rho , w \rangle := {\left\{ \begin{array}{ll} \displaystyle \int _{0}^1\langle \rho _t, w_t \rangle _{\mathcal {M}(\overline{\Omega }),C(\overline{\Omega })} \, \mathrm{d}t &{} \,\, \text {if } \rho =\mathrm{d}t \otimes \rho _t\, , \,\, (t \mapsto \rho _t) \in C_\mathrm{w}\,, \\ -\infty &{}\,\, \text {otherwise.} \end{array}\right. } \end{aligned}$$
(23)

Remark 2

The above map is well-defined; indeed, assuming that \((t \mapsto \rho _t) \in C_\mathrm{w}\), we have \((t \mapsto K_t^*\rho _t ) \in L^2_H\) by Lemma 12. Similarly, also \((t \mapsto K_t^*\tilde{\rho }_t) \in L^2_H\). Thus, recalling (K1), we infer

$$\begin{aligned} \langle \rho , w \rangle = - \langle K^*\rho , K^* \tilde{\rho }- f \rangle _{L^2_H}\,, \end{aligned}$$
(24)

which is well-defined and finite, being a scalar product in the Hilbert space \(L^2_H\) (see (19)). Moreover if \(J_{\alpha ,\beta }(\mu )<+\infty \), then \( \langle \rho , w \rangle \) is finite, since \(\rho =\mathrm{d}t \otimes \rho _t\) for \((t \mapsto \rho _t) \in C_\mathrm{w}^+\), by Lemmas 9, 10.

Let \(\varphi \) and \(M_0\) be as in (20)-(21). We consider the following linearized version of (\(\tilde{\mathcal {P}}\))

$$\begin{aligned} \min _{\mu \in \mathcal {M}} - \langle \rho , w \rangle + \varphi (J_{\alpha ,\beta }(\mu ))\,, \end{aligned}$$
(25)

which is well-posed by Theorem 13.

The objective of this section is to prove the existence of a solution to (25) belonging, up to a multiplicative constant, to the extremal points of the sublevel set \(C_{\alpha ,\beta }:=\{J_{\alpha ,\beta } \le 1\}\).

To this end, consider the problem

$$\begin{aligned} \min _{\mu \in C_{\alpha ,\beta }} - \langle \rho , w \rangle \,. \end{aligned}$$
(26)

In the following proposition, we prove that (26) admits a minimizer \(\mu ^* \in {{\,\mathrm{Ext}\,}}(C_{\alpha ,\beta })\). Moreover, we show that a suitably rescaled version of \(\mu ^*\) solves (25).

Proposition 4

Assume (H1)–(H3), (K1)–(K3) as in Sect. 3.1. Let \(f \in L^2_H\), \(\alpha , \beta >0\). Then, there exists a solution \(\mu ^* \in {{\,\mathrm{Ext}\,}}(C_{\alpha ,\beta })\) to (26). Moreover, \(M \mu ^*\) is a minimizer for (25), where

$$\begin{aligned} M := {\left\{ \begin{array}{ll} 0 &{} \,\, \text { if } \,\,\, \langle \rho ^*, w \rangle \le 1 \,,\\ M_0 \langle \rho ^*, w \rangle &{} \,\, \text { if } \,\,\, \langle \rho ^*, w \rangle > 1 \,. \end{array}\right. } \end{aligned}$$
(27)

The above statement is reminiscent of the classical Bauer Maximum Principle [2, Theorem 7.69]. In our case, however, there is no clear topology that makes the set \(C_{\alpha ,\beta }\) compact and the linearized map defined in (23) continuous (or upper semicontinuous). Therefore, an ad hoc proof is required.

Proof

Let \( \hat{\mu }\in C_{\alpha ,\beta }\) be a solution to (26), which exists thanks to Theorem 13 with the choice \(\varphi (t):=\upchi _{(-\infty ,1]}(t)\). Consider the set \(S := \left\{ \mu \in C_{\alpha ,\beta } : \langle \rho ,w\rangle =\langle \hat{\rho },w\rangle \right\} \) of all solutions to (26). Note that S is bounded with respect to the total variation on \(\mathcal {M}\), due to (A2) and definition of \(C_{\alpha ,\beta }\). In particular, the weak* topology of \(\mathcal {M}\) is metrizable in S. We claim that S is compact in the same topology. Indeed, given a sequence \(\{\mu ^n\}\) in S we have by definition that

$$\begin{aligned} \sup _n J_{\alpha ,\beta }(\mu ^n) \le 1\,, \qquad \langle \rho ^n,w\rangle = \langle \hat{\rho },w\rangle \,\, \text { for every } \,\, n \in \mathbb {N}\,. \end{aligned}$$
(28)

Therefore, (28) and Lemma 11 imply that, up to subsequences, \(\mu ^n\) converges to some \(\mu \in \mathcal {M}\) in the sense of (A3). By (28) and by the weak* sequential lower semicontinuity of \(J_{\alpha ,\beta }\) (Lemma 11), we infer \(\mu \in C_{\alpha ,\beta }\). Moreover by (28), (24) and Lemma 12, we also conclude that \(\mu \in S\), hence proving compactness. Also notice that S is convex due to the convexity of \(J_{\alpha ,\beta }\) (Lemma 11) and linearity of the constraint. Since \(S \ne \emptyset \), by Krein–Milman’s theorem, we have that \({{\,\mathrm{Ext}\,}}(S) \ne \emptyset \). Let \(\mu ^* \in {{\,\mathrm{Ext}\,}}(S)\). If we show that \(\mu ^* \in {{\,\mathrm{Ext}\,}}(C_{\alpha ,\beta })\), the thesis is achieved by definition of S. Hence, assume that \(\mu ^*\) can be decomposed as

$$\begin{aligned} \mu ^* = \lambda \mu ^1 + (1-\lambda ) \mu ^2 \end{aligned}$$
(29)

with \(\mu ^j =(\rho ^j,m^j)\in C_{\alpha ,\beta }\) and \(\lambda \in (0,1)\). Assume that \(\mu ^1\) belongs to \(C_{\alpha ,\beta } \smallsetminus S\). By (29) and the minimality of the points in S for (26), we infer \(-\langle \hat{\rho },w\rangle <-\langle \rho ^*, w\rangle \), which is a contradiction since \(\mu ^* \in S\). Therefore, \(\mu ^1 \in S\). Similarly also \(\mu ^2 \in S\). Since \(\mu ^* \in {{\,\mathrm{Ext}\,}}(S)\), from (29), we infer \(\mu ^* = \mu ^1=\mu ^2\), showing that \(\mu ^* \in {{\,\mathrm{Ext}\,}}(C_{\alpha ,\beta })\).

Assume now that \(\mu ^* \in {{\,\mathrm{Ext}\,}}(C_{\alpha ,\beta })\) minimizes in (26). If \(\mu ^*=0\), it is straightforward to check that 0 minimizes in (25). Hence, assume \(\mu ^* \ne 0\), so that \(J_{\alpha ,\beta }(\mu ^*)>0\) by (A2). Since the functional at (26) is linear and \(\mu ^*\) is a minimizer, we can scale by \(J_{\alpha ,\beta }(\mu ^*)\) and exploit the one-homogeneity of \(J_{\alpha ,\beta }\) to obtain \(J_{\alpha ,\beta }(\mu ^*)=1\). For every \(\mu =(\rho ,m) \in \mathcal {M}\) such that \(J_{\alpha ,\beta }(\mu )<+\infty \), one has

$$\begin{aligned} - \langle \rho ,w \rangle + \varphi (J_{\alpha ,\beta }(\mu )) \ge - J_{\alpha ,\beta }(\mu ) \langle \rho ^*, w\rangle + \varphi (J_{\alpha ,\beta }(\mu ))\,, \end{aligned}$$
(30)

since \(\mu ^*\) is a minimizer, \(J_{\alpha ,\beta }\) is nonnegative and one-homogeneous, and since \(J_{\alpha ,\beta }(\mu )=0\) if and only if \(\mu =0\) by Lemma 10. Again one-homogeneity implies

$$\begin{aligned} \inf _{\mu \in \mathcal {M}} - J_{\alpha ,\beta }(\mu ) \langle \rho ^*, w \rangle + \varphi (J_{\alpha ,\beta }(\mu )) = \inf _{\lambda \ge 0} - \lambda \langle \rho ^*, w\rangle + \varphi (\lambda )\, . \end{aligned}$$
(31)

It is immediate to check that M defined in (27) is a minimizer for the right-hand side problem in (31).

Hence from (30)–(31), one-homogeneity of \(J_{\alpha ,\beta }\) and the fact that \(J_{\alpha ,\beta }(\mu ^*) = 1\), we conclude that \(M \mu ^*\) is a minimizer for (25). \(\square \)

3.4 The Primal-Dual Gap

In this section, we introduce the primal-dual gap associated with (\(\mathcal {P}\)).

Definition 1

The primal-dual gap is defined as the map \(G :\mathcal {M}\rightarrow [0,+\infty ]\) such that

$$\begin{aligned} G(\mu ):= {\left\{ \begin{array}{ll} J_{\alpha ,\beta }(\mu ) - \varphi (J_{\alpha ,\beta }(\hat{\mu })) - \langle \rho - \hat{\rho }, w \rangle &{} \, \text { if } \, J_{\alpha ,\beta }(\mu )<+\infty \,, \\ + \infty &{} \, \text { otherwise,} \end{array}\right. } \end{aligned}$$
(32)

for \(\mu \in \mathcal {M}\). Here \(w_t := -K_t(K_t^* \rho _t - f_t )\), the product \(\langle \cdot ,\cdot \rangle \) is defined in (23), the map \(\varphi \) at (20), and \( \hat{\mu }\in \mathcal {M}\) is a solution to (25).

Notice that G is well-defined. Indeed, assume that \(\mu \in \mathcal {M}\) is such that \(J_{\alpha ,\beta }(\mu )<+\infty \) and let \( \hat{\mu }\in \mathcal {M}\) be a solution to (25). In particular, \(J_{\alpha ,\beta }(\hat{\mu })<+\infty \) by Theorem 13. Therefore, the scalar product in (32) is finite, see Remark 2. The purpose of G becomes clear in its relationship with the functional distance associated with \(T_{\alpha ,\beta }\), which is defined by

$$\begin{aligned} r(\mu ):=T_{\alpha ,\beta }(\mu ) - \min T_{\alpha ,\beta } \,, \end{aligned}$$
(33)

for all \(\mu \in \mathcal {M}\). Such relationship is described in the following lemma. We remark that a similar result is standard in the context of Frank–Wolfe-type algorithms and generalized conditional gradient methods (see e.g., [18, Lemma 5.5] and [43, Section 2]). Due to the specificity of our dynamic problem and for sake of completeness, we present it in our setting as well.

Lemma 5

Let \(\mu \in \mathcal {M}\) be such that \(J_{\alpha ,\beta }(\mu )<+\infty \). Then,

$$\begin{aligned} r(\mu ) \le G(\mu )\,. \end{aligned}$$
(34)

Moreover, \(\mu ^*\) solves (\(\mathcal {P}\)) if and only if \(G(\mu ^*)=0\).

Proof

Let w and \(\hat{\rho }\) be as in (32). By using the Hilbert structure of \(L^2_H\), for any \(\tilde{\mu }\) such that \(J_{\alpha ,\beta }(\tilde{\mu })<+\infty \), we have, by the polarization identity,

$$\begin{aligned} -\langle \rho - \tilde{\rho }, w\rangle = \frac{\Vert K^* \rho - f\Vert ^2_{L^2_H}}{2} - \frac{\Vert K^* \tilde{\rho }- f\Vert ^2_{L^2_H}}{2} + \frac{\Vert K^*(\tilde{\rho }- \rho )\Vert ^2_{L^2_H}}{2}\,. \end{aligned}$$
(35)

Let \(\mu ^*\) be a minimizer for \(T_{\alpha ,\beta }\), which exists by Theorem 3. Since \(J_{\alpha ,\beta }(\mu ^*) \le T_{\alpha ,\beta }(\mu ^*) \le T_{\alpha ,\beta }(0)=M_0\), by definition of \(\varphi \), we have \(\varphi (J_{\alpha ,\beta }(\mu ^*)) = J_{\alpha ,\beta }(\mu ^*)\). Using (35) and the definition of G in (32), where \(\hat{\mu }\) is chosen to be a solution to (25), we obtain

$$\begin{aligned} G(\mu )&\ge -\langle \rho - \rho ^*, w\rangle + J_{\alpha ,\beta }(\mu ) - J_{\alpha ,\beta }(\mu ^*)\\&\ge \frac{\Vert K^* \rho - f\Vert ^2_{L^2_H}}{2} - \frac{\Vert K^* \rho ^* - f\Vert ^2_{L^2_H}}{2}+ J_{\alpha ,\beta }(\mu ) - J_{\alpha ,\beta }(\mu ^*)\\&= T_{\alpha ,\beta }(\mu ) - T_{\alpha ,\beta }(\mu ^*) = r(\mu )\,, \end{aligned}$$

proving (34). If \(G(\mu ^*)=0\), then \(\mu ^*\) minimizes in (\(\mathcal {P}\)) by (34). Conversely, assume that \(\mu ^*\) is a solution of (\(\mathcal {P}\)) and denote by \(w^*_t:=-K_t(K_t^*\rho _t^* - f_t)\) the associated dual variable. Let \(\mu \) be arbitrary and such that \(J_{\alpha ,\beta }(\mu )<+\infty \). Let \(s \in [0,1]\) and set \(\mu ^s:=\mu ^*+s(\mu -\mu ^*)\). By convexity of \(J_{\alpha ,\beta }\) (see Lemma 11), we have that \(J_{\alpha ,\beta }(\mu ^s)<+\infty \). As \(\mu ^*\) is optimal in (\(\mathcal {P}\)), we infer

$$\begin{aligned} \begin{aligned} 0&\le T_{\alpha ,\beta }(\mu ^s) - T_{\alpha ,\beta }(\mu ^*) \\&\le \frac{ \left\Vert K^*\rho ^s - f\right\Vert ^2_{L^2_H}}{2} - \frac{ \left\Vert K^*\rho ^* - f\right\Vert ^2_{L^2_H}}{2} + s(J_{\alpha ,\beta }(\mu )-J_{\alpha ,\beta }(\mu ^*)) \\&= - s \langle \rho -\rho ^*, w^* \rangle + s^2 \, \frac{\left\Vert K^*(\rho ^* - \rho )\right\Vert ^2_{L^2_H}}{2} + s(J_{\alpha ,\beta }(\mu )-J_{\alpha ,\beta }(\mu ^*))\,, \end{aligned} \end{aligned}$$

where we used convexity of \(J_{\alpha ,\beta }\), and the identity at (35) with respect to \(w^*, \rho ^*\) and \(\rho ^s\). Dividing the above inequality by s and letting \(s \rightarrow 0\) yields

$$\begin{aligned} 0 \le - \langle \rho -\rho ^* , w^* \rangle +J_{\alpha ,\beta }(\mu ) - J_{\alpha ,\beta }(\mu ^*) \,, \end{aligned}$$
(36)

which holds for all \(\mu \) with \( J_{\alpha ,\beta }(\mu )<+\infty \). Now notice that \(J_{\alpha ,\beta }(\mu ^*) \le T_{\alpha ,\beta }(\mu ^*) \le M_0\), since \(\mu ^*\) solves (\(\mathcal {P}\)). Therefore, \(\varphi (J_{\alpha ,\beta }(\mu ^*))=J_{\alpha ,\beta }(\mu ^*)\). Moreover, \(t \le \varphi (t)\) for all \(t \ge 0\). As a consequence of (36), we then infer

$$\begin{aligned} \begin{aligned} - \langle \rho ^*, w^* \rangle +\varphi (J_{\alpha ,\beta }(\mu ^*))&= - \langle \rho ^*, w^* \rangle +J_{\alpha ,\beta }(\mu ^*) \\&\le - \langle \rho , w^* \rangle +J_{\alpha ,\beta }(\mu ) \le - \langle \rho , w^* \rangle +\varphi (J_{\alpha ,\beta }(\mu ))\,, \end{aligned} \end{aligned}$$

proving that \(\mu ^*\) minimizes in (25) with respect to \(w^*\). Therefore, by definition, \(G(\mu ^*)=0\). \(\square \)

4 The Algorithm: Theoretical Analysis

In this section, we give a theoretical description of the dynamic generalized conditional gradient algorithm anticipated in the introduction, which we call core algorithm. The proposed algorithm aims at finding minimizers to \(T_{\alpha ,\beta }\) as defined in (\(\mathcal {P}\)), for some fixed data \(f \in L^2_H\) and parameters \(\alpha ,\beta >0\). It is comprised of an insertion step, where one seeks a minimizer to (26) among the extremal points of the set \(C_{\alpha ,\beta }:=\{J_{\alpha ,\beta }\le 1\}\), and of a coefficients optimization step, which will yield a finite dimensional quadratic program. As a result, each iterate will be a finite linear combination, with nonnegative coefficients, of points in \({{\,\mathrm{Ext}\,}}(C_{\alpha ,\beta })\). We remind the reader that \({{\,\mathrm{Ext}\,}}(C_{\alpha ,\beta })= \{0\} \cup \mathcal {C}_{\alpha ,\beta }\) in view of Theorem 1, where \(\mathcal {C}_{\alpha ,\beta }\) is defined at (13). From the definition of \(\mathcal {C}_{\alpha ,\beta }\), we see that except for the zero element, the extremal points are in 1-on-1 correspondence with the space of curves \(\mathrm{AC}^2:=\mathrm{AC}^2([0,1]; \overline{\Omega })\). This observation motivates us to define the atoms of our problem.

Definition 2

(Atoms) We denote by \(\mathrm{{AC}}_{\infty }^2\) the one-point extension of the set \(\mathrm{AC}^2\), where we include a point denoted by \(\gamma _\infty \). For any \(\gamma \in \mathrm{AC}^2\), we name as atom the respective extremal point \(\mu _\gamma =(\rho _\gamma , m_\gamma ) \in \mathcal {M}\) defined according to (14).

For \(\gamma _{\infty }\), the corresponding atom is defined by \(\mu _{\gamma _\infty }:=(0,0)\). We call sparse any measure \(\mu \in \mathcal {M}\) such that

$$\begin{aligned} \mu =\sum _{j=1}^N c_j \mu _{\gamma _j} \end{aligned}$$
(37)

for some \(N \in \mathbb {N}\), \(c_j > 0\) and \(\gamma _j \in \mathrm{AC}^2\), with \(\gamma _i \ne \gamma _j\) for \(i \ne j\).

Note that \(\gamma _\infty \) can be regarded as the infinite length curve: indeed if \(\{\gamma ^n\}\) in \(\mathrm{AC}^2\) is a sequence of curves with diverging length, that is, \(\int _0^1 \left|\dot{\gamma }^n\right| \,\mathrm{d}t \rightarrow \infty \) as \(n \rightarrow \infty \), then \(\rho _{\gamma ^n} {\mathop {\rightharpoonup }\limits ^{*}}\rho _{\gamma _\infty }\), since \(a_{\gamma ^n} \rightarrow 0\) by Hölder’s inequality.

Additionally, it is convenient to introduce the following map associating vectors of curves and coefficients with the corresponding sparse measure:

$$\begin{aligned} (\varvec{\gamma }, \varvec{c}) \mapsto \mu := \sum _{j=1}^N c_j \mu _{\gamma _j} \,, \end{aligned}$$
(38)

where \(N \in \mathbb {N}\) is fixed and \(\varvec{c}:= (c_1, \ldots , c_{N})\) with \(c_j > 0\), \(\varvec{\gamma } := (\gamma _1, \ldots , \gamma _{N})\), with \(\gamma _j \in \mathrm{AC}^2\).

Remark 3

The decomposition in extremal points of a given sparse measure might not be unique, that is, the map at (38) is not injective. For example, let \(N:=2\), \(\Omega :=(0,1)^2\) and

$$\begin{aligned} \begin{aligned}&\gamma _1(t):=(t,t) \,, \,\,\,\,\quad&\tilde{\gamma }_1 (t):= (t,t) \upchi _{[0,1/2)}(t)+ (t,1-t) \upchi _{[1/2,1]}(t)\,, \\&\gamma _2(t):=(t,1-t) \,,&\tilde{\gamma }_2 (t):= (t,1-t) \upchi _{[0,1/2)}(t)+ (t,t) \upchi _{[1/2,1]}(t)\,. \end{aligned} \end{aligned}$$

Note that injectivity fails for \(((\gamma _1,\gamma _2),(1,1))\) and \(((\tilde{\gamma }_1,\tilde{\gamma }_2),(1,1))\), given that they map to the same measure \(\mu ^*\), but \(\gamma _1\) and \(\gamma _2\) cross at \(t=1/2\), while \(\tilde{\gamma }_1\) and \(\tilde{\gamma }_2\) rebound. This observation is relevant for the algorithms presented, seeing that they operate in terms of extremal points: if for example, \(\mu ^*\) was the unique solution to (\(\mathcal {P}\)) for some data f, due to the lack of unique sparse representation for \(\mu ^*\), the numerical reconstruction could favor the representation having the least energy in terms of the regularizer \(J_{\alpha ,\beta }\). This is not surprising, since our method aims at reconstructing sparse measures, rather than their extremal points. A numerical example displaying the behavior of our algorithm on crossings, such as the case of \(\mu ^*\), is given in Sect. 6.2.3.

The rest of the section is organized as follows. In Sect. 4.1, we present the core algorithm, describing its basic steps and summarizing it in Algorithm 1. In Sect. 4.2, we discuss the equivalence of the coefficients optimization step to a quadratic program, while in Sect. 4.3, we show sublinear convergence of Algorithm 1 in terms of the residual defined at (33). In Sect. 4.4, we detail on a theoretical stopping criterion for our algorithm. To conclude, in Sect. 4.5, we give a description of how to alter Algorithm 1 in case the fidelity term \(\mathcal {F}\) at (\(\mathcal {P}\)) is replaced by a time-discrete version. All the results presented in this section and in the above will hold also for this particular case, with minor modifications.

4.1 Core Algorithm

The core algorithm consists of two steps. In the first one, named the insertion step, an atom is added to the current iterate, this atom being the minimizer of the linearized problem defined at (26). In the second step, named the coefficients optimization step, the atoms are fixed and their associated weights are optimized to minimize the target functional \(T_{\alpha ,\beta }\) defined in (\(\mathcal {P}\)). In what follows, \(f \in L^2_H\) is a given datum and \(\alpha ,\beta >0\) are fixed parameters.

4.1.1 Iterates

We initialize the algorithm to the zero atom \(\mu ^0 := 0\). The n-th iteration \(\mu ^n\) is a sparse element of \(\mathcal {M}\) according to (37), that is,

$$\begin{aligned} \mu ^n = \sum _{j=1}^{N_n} c_j^n \mu _{\gamma _j^n} , \end{aligned}$$
(39)

where \(N_n \in \mathbb {N}\cup \{0\}\), \(\gamma _j^n \in \mathrm{AC}^2\), \(c_j^n >0\) and \(\gamma _i \ne \gamma _j\) if \(i \ne j\). Notice that \(N_n\) is counting the number of atoms present at the n-th iteration, and is not necessarily equal to n, since the optimization step could discard atoms by setting their associated weights to zero. In practice, Algorithm 1 operates in terms of curves and weights. That is, the n-th iteration outputs pairs \((\gamma _j,c_j)\) with \(\gamma _j \in \mathrm{AC}^2\), \(c_j >0\): the iterate at (39) can be then constructed via the map (38).

4.1.2 Insertion Step

Assume \(\mu ^n\) is the current iterate. Define the dual variable associated with \(\mu ^n\) as in (22), that is,

$$\begin{aligned} w^n_t := -K_t (K_t^* \rho ^n_t - f_t ) \in C(\overline{\Omega }) \quad \text {for a.e.}\ t\in (0,1)\,. \end{aligned}$$
(40)

With it, consider the minimization problem of the form (17), that is,

$$\begin{aligned} \min _{\mu \in {{\,\mathrm{Ext}\,}}(C_{\alpha ,\beta })} - \langle \rho , w^n \rangle \,, \end{aligned}$$
(41)

where the term \( \langle \cdot , \cdot \rangle \) is defined in (23). We recall that (41) admits solution by Proposition 4. Thanks to the characterization \({{\,\mathrm{Ext}\,}}(C_{\alpha ,\beta })=\{0\}\cup \mathcal {C}_{\alpha ,\beta }\) provided by Theorem 1, problem (41) can be cast on the space \(\mathrm{{AC}}_{\infty }^2\). Indeed, given \(\mu \in \mathcal {C}_{\alpha ,\beta }\), following the notations at (13)–(14), we have that \(\mu =\mu _\gamma \) for some \(\gamma \in \mathrm{AC}^2\). The curve \(t \mapsto \delta _{\gamma (t)}\) belongs to \(C_\mathrm{w}^+\), and hence \( \langle \rho _\gamma , w \rangle \) is finite (see Remark 2). Thus, by definition, we have

$$\begin{aligned} \langle \rho _\gamma , w^n \rangle = a_\gamma \int _0^1 \langle \delta _{\gamma (t)}, w_t^n \rangle _{\mathcal {M}(\overline{\Omega }),C(\overline{\Omega })} \, \mathrm{d}t = a_\gamma \int _0^1 w_t^n(\gamma (t)) \,\mathrm{d}t \,, \end{aligned}$$
(42)

showing that (41) is equivalent to

$$\begin{aligned} \min _{\gamma \in \mathrm{{AC}}_{\infty }^2} -\langle \rho _\gamma ,w^n \rangle = \min _{\gamma \in \mathrm{AC}^2} \left\{ 0, -a_{\gamma } \int _0^1 w_t^n(\gamma (t))\,\mathrm{d}t \right\} \,. \end{aligned}$$
(43)

The insertion step consists in finding a curve \(\gamma ^* \in \mathrm{{AC}}_{\infty }^2\) that solves (43). To such curve, we associate a new atom \(\mu _{\gamma ^*}\) via Definition 2. Note that \(\gamma ^*\) depends on the current iterate \(\mu ^n\), as well as on the datum f and parameters \(\alpha ,\beta \): however, in order to simplify notations, we omit such dependencies. After, we have a stopping condition:

  • if \(\langle \rho _{\gamma ^*}, w^n\rangle \le 1\), then \(\mu ^n\) is solution to (\(\mathcal {P}\)). The algorithm outputs \(\mu ^n\) and stops,

  • if \(\langle \rho _{\gamma ^*}, w^n\rangle > 1\), then \(\mu ^n\) is not a solution to (\(\mathcal {P}\)) and \(\gamma ^* \in \mathrm{AC}^2\). The found atom \(\mu _{\gamma ^*}\) is inserted in the n-th iterate \(\mu ^n\) and the algorithm continues.

The optimality statements in the above stopping condition correspond to positivity conditions on the subgradient of \(T_{\alpha ,\beta }\) and are rigorously proven in Sect. 4.4. Moreover, the mentioned stopping condition can be made quantitative as discussed in Remark 7.

Remark 4

In this section, we will always assume the availability of an exact solution \(\gamma ^*\) to (43). In particular, this allows to obtain a sublinear convergence rate for the core algorithm (Theorem 7), and make the stopping condition rigorous. In practice, however, obtaining \(\gamma ^*\) is not always possible, due to the nonlinearity and non-locality of the functional at (43). For this reason, in Sect. 5.1, we propose a strategy aimed at obtaining stationary points of (43). Based on such strategy, a relaxed version of the insertion step is proposed for Algorithm 2, which we employ for the numerical simulations of Sect. 6.

4.1.3 Coefficients Optimization Step

This step is realized after the stopping condition is checked, with the condition \(\langle \rho _{\gamma ^*}, w^n\rangle > 1\) being satisfied. In particular, as observed above, in this case \(\gamma ^* \in \mathrm{AC}^2\). We then set \(\gamma _{N_n+1}^n := \gamma ^*\) and consider the coefficients optimization problem

$$\begin{aligned} \min _{ (c_1,c_2,\ldots ,c_{N_n+1}) \in \mathbb {R}_{+}^{N_n+1}} T_{\alpha ,\beta } \left( \sum _{j=1}^{N_n+1} c_j \mu _{\gamma _j^{n}} \right) \,, \end{aligned}$$
(44)

where \(\mu _{\gamma _j^n}\) for \(j=1,\ldots ,N_n\) are the atoms present in the n-th iterate \(\mu ^n\). If \(c \in \mathbb {R}_{+}^{N_n+1}\) is a solution to the above problem, the next iterate is defined by

$$\begin{aligned} \mu ^{n+1} := \sum _{\begin{array}{c} c_j > 0 \end{array}} c_j \mu _{\gamma _j^n}\,, \end{aligned}$$
(45)

thus discarding the curves that do not contribute to (44).

Remark 5

Problem (44) is equivalent to a quadratic program of the form

$$\begin{aligned} \min _{c= (c_1,\ldots ,c_{N_n+1}) \in \mathbb {R}^{N_n+1}_+} \,\frac{1}{2}\, c^T \Gamma c + b^T c \,, \end{aligned}$$
(46)

where \(\Gamma \in \mathbb {R}^{(N_n+1) \times (N_n+1)}\) is a positive-semidefinite and symmetric matrix and \(b \in \mathbb {R}^{N_n+1}\), as proven in Proposition 6.

Therefore, throughout the paper, we will always assume the availability of an exact solution to (44). In practice, we solved (44) by means of the free Python software package CVXOPT [5, 6].

4.1.4 Algorithm Summary

As discussed in Sect. 4.1.1, the iterates of Algorithm 1 are pairs of curves and weights \((\gamma _j,c_j)\) for \(\gamma \in \mathrm{AC}^2\), \(c_j>0\) and \(j=1, \ldots , N_n\). In the pseudo-code, we denote such iterates with the tuples \(\varvec{\gamma } = (\gamma _1, \ldots , \gamma _{N_n})\) and \(\varvec{c} = (c_1, \ldots , c_{N_n})\). Note that such tuples vary in size at each iteration, and the initial iterate of the algorithm, that is, the zero atom, corresponds to the empty tuples \(\varvec{\gamma } = ()\) and \(\varvec{c} = ()\). We denote the number of elements contained in a tuple \(\varvec{\gamma }\) with the symbol \(\left|\varvec{\gamma }\right|\). Via the map (38), the iterates \((\varvec{\gamma },\varvec{c})\) define a sequence of sparse measures \(\{\mu ^n\}\) of the form (39). The generated sequence \(\{\mu ^n\}\) weakly* converges (up to subsequences) to a minimizer of \(T_{\alpha ,\beta }\) for the datum \(f \in L^2_H\), as shown in Theorem 7. Notice that the assignments at lines 4 and 9 are meant to choose one element in the respective argmin set. The function \(\texttt {delete\_zero\_weighted}( \varvec{\gamma }, \varvec{c}) \) at line 11 in Algorithm 1 is designed to input a tuple \((\varvec{\gamma }, \varvec{c})\) and output another tuple where the curves \(\gamma _j\) and corresponding weights \(c_j\) are deleted if \(c_j = 0\).

figure c

4.2 Quadratic Optimization

We prove the statement in Remark 5. To be more precise, assume (H1)–(H3), (K1)–(K3) from Sect. 3.1 and let \(f \in L^2_H\), \(\alpha ,\beta >0\) be given. Fix \(N \in \mathbb {N}\), \(\gamma _1,\ldots ,\gamma _N \in \mathrm{AC}^2\) with \(\gamma _i \ne \gamma _j\) for all \(i\ne j\), and consider the coefficients optimization problem

$$\begin{aligned} \min _{c= (c_1,\ldots ,c_N)\in \mathbb {R}^N_+} \, T_{\alpha ,\beta } \left( \sum _{j=1}^N c_j \mu _{\gamma _j} \right) \,, \end{aligned}$$
(47)

where \(\mu _{\gamma _j}\) is defined according to (14). For (47), the following holds.

Proposition 6

Problem (47) is equivalent to

$$\begin{aligned} \min _{c=(c_1,\ldots ,c_N) \in \mathbb {R}^N_+}\, \frac{1}{2} c^T \Gamma c + b^T c\,, \end{aligned}$$
(48)

where \(\Gamma =(\Gamma _{i,j}) \in \mathbb {R}^{N \times N}\) is a positive semi-definite symmetric matrix and \(b=(b_i) \in \mathbb {R}^N\), with

$$\begin{aligned} \Gamma _{i,j}:=a_{\gamma _i} a_{\gamma _j} \int _0^1 \langle K_t^*\delta _{\gamma _i(t)}, K_t^*\delta _{\gamma _j(t)} \rangle _{H_t}\, \mathrm{d}t \,, \quad b_i:= 1- a_{\gamma _i} \int _0^1 \langle K_t^*\delta _{\gamma _i}(t), f_t \rangle _{H_t}\, \mathrm{d}t \,. \end{aligned}$$
(49)

Proof

As \(\gamma _j\) is continuous, the curve \(t \mapsto \delta _{\gamma _j(t)}\) belongs to \(C_\mathrm{w}^+\). Hence, the map \(t \mapsto K_t^* \delta _{\gamma _j(t)}\) belongs to \(L^2_H\) by Lemma 12 and the quantities at (49) are well-defined. Thanks to definition of \(T_{\alpha ,\beta }\) and Lemma 2, we immediately see that

$$\begin{aligned} \begin{aligned} T_{\alpha ,\beta } \left( \sum _{j=1}^N c_j \mu _{\gamma _j} \right)&= \frac{1}{2} \sum _{i,j=1}^N c_i c_j \langle K^* \rho _{\gamma _i}, K^* \rho _{\gamma _j} \rangle _{L^2_H} \\&\qquad \qquad \quad - \sum _{j=1}^N c_j \langle K^* \rho _{\gamma _j}, f \rangle _{L^2_H} + \frac{1}{2} \left\Vert f\right\Vert _{L^2_H}^2 + \sum _{j=1}^N c_j J_{\alpha ,\beta }(\mu _{\gamma _j}) \\&= \frac{1}{2} c^T \Gamma c + b^T c + M_0\,, \end{aligned} \end{aligned}$$

where \(M_0\ge 0\) is defined at (21). This shows that (47) and (48) are equivalent. The rest of the statement follows since \(\Gamma \) is the Gramian with respect to the vectors \(K^*\rho _{\gamma _1}, \ldots , K^*\rho _{\gamma _N}\) in \(L^2_H\). \(\square \)

4.3 Convergence Analysis

We prove sublinear convergence for Algorithm 1. The convergence rate is given in terms of the functional distance (33) associated with \(T_{\alpha ,\beta }\). Throughout the section, we assume that \(f \in L^2_H\) is a given datum, \(\alpha , \beta >0\) are fixed regularization parameters and (H1)–(H3), (K1)–(K3) as in Sect. 3.1 hold. The convergence result is stated as follows.

Theorem 7

Let \(\{\mu ^n\}\) be a sequence generated by Algorithm 1. Then, \(\{T_{\alpha ,\beta }(\mu ^n)\}\) is non-increasing and the residual at (33) satisfies

$$\begin{aligned} r(\mu ^n) \le \frac{C}{n} \,, \,\, \text { for all } \,\, n \in \mathbb {N}\,, \end{aligned}$$
(50)

where \(C>0\) is a constant depending only on \(\alpha ,\beta \), f and \(K_t^*\). Moreover, each accumulation point \(\mu ^*\) of \(\mu ^n\) with respect to the weak* topology of \(\mathcal {M}\) is a minimizer for \(T_{\alpha ,\beta }\). If \(T_{\alpha ,\beta }\) admits a unique minimizer \(\mu ^*\), then \(\mu ^n {\mathop {\rightharpoonup }\limits ^{*}}\mu ^*\) along the whole sequence.

Remark 6

The proof of Theorem 7 follows similar steps to [18, Theorem 5.8] and [51, Theorem 5.4]. We highlight that the proof of monotonicity of \(\{T_{\alpha ,\beta }(\mu ^n)\}\) and of the decay estimate (50) does not make full use of the coefficients optimization step (Sect. 4.1.3) of Algorithm 1. Rather, the proof relies on energy estimates for the surrogate iterate \(\mu ^s := \mu ^n + s(M\mu _{\gamma *} - \mu ^n)\), where \(\gamma ^*\in \mathrm{AC}^2\) is a solution of the insertion step (43), \(M \ge 0\) a suitable constant, and the step-size s is chosen according to the Armijo–Goldstein condition (see [18, Section 5]). Since \(\mu ^s\) is a candidate for problem (44), the added coefficients optimization step in Algorithm 1 does not worsen the convergence rate.

Proof

Fix \(n \in \mathbb {N}\), and let \(\{\mu ^n\}\) be a sequence generated by Algorithm 1, which by construction is of the form (39). Recall that \(w^n_t:=-K_t(K_t^* \rho _t^n - f_t) \in C(\overline{\Omega })\) is the associated dual variable. Let \(\mu ^*=(\rho ^*,m^*)\) in \({{\,\mathrm{Ext}\,}}(C_{\alpha ,\beta })\) be a solution to the insertion step, that is, \(\mu ^*\) solves (41). The existence of such \(\mu ^*\) is guaranteed by Proposition 4. Without loss of generality, we can assume that the algorithm does not stop at iteration n, that is, \( \langle \rho ^*, w^n \rangle >1\) according to the stopping condition in Sect. 4.1.2. In particular, \(\rho ^*\ne 0\). Recalling that \({{\,\mathrm{Ext}\,}}(C_{\alpha ,\beta })=\{0\} \cup \mathcal {C}_{\alpha ,\beta }\) (Theorem 1), we then have \(\mu ^*=\mu _{\gamma ^*}\) for some \(\gamma ^* \in \mathrm{AC}^2\). By the coefficients optimization step, the next iterate is of the form

$$\begin{aligned} \mu ^{n+1} = \sum _{j=1}^{N_n} c_j^{n+1} \mu _{\gamma _{j}^n} + c^* \mu _{\gamma ^*} \,, \end{aligned}$$

and the coefficients \((c_1^{n+1},\ldots , c_{N_n}^{n+1},c^*)\) solve the quadratic problem

$$\begin{aligned} \min _{c=(c_1,\ldots ,c_{N_n +1} ) \in \mathbb {R}_+^{N_{n}+1}} T_{\alpha ,\beta } \left( \sum _{j=1}^{N_n} c_j \mu _{\gamma _j^n} + c_{N_n+1} \mu _{\gamma ^*} \right) \,. \end{aligned}$$
(51)

Claim. There exists a constant \(\tilde{C}>0\) independent of n such that, if \(T_{\alpha ,\beta }(\mu ^n) \le M_0\), then

$$\begin{aligned} r(\mu ^{n+1}) - r(\mu ^n) \le - \tilde{C} \, r(\mu ^n)^2 \,. \end{aligned}$$
(52)

To prove the above claim, assume that \(T_{\alpha ,\beta }(\mu ^n) \le M_0\). Define \(\hat{\mu }:=M\mu _{\gamma ^*}\), with \(M:=M_0 \langle \rho _{\gamma ^*}, w^n \rangle \). Since \(\mu _{\gamma ^*}\) solves (41) and \( \langle \rho _{\gamma ^*}, w^n \rangle >1\), by Proposition 4, we have that \(\hat{\mu }\) minimizes in (25) with respect to \(w^n\). Therefore, according to (32),

$$\begin{aligned} G(\mu ^n)=J_{\alpha ,\beta }(\mu ^n) - \varphi (J_{\alpha ,\beta }(\hat{\mu })) - \langle \rho ^n-\hat{\rho }, w^n \rangle \,, \end{aligned}$$
(53)

given that \(J_{\alpha ,\beta }(\mu ^n)\le T_{\alpha ,\beta }(\mu ^n) \le M_0 <+\infty \). For \(s \in [0,1]\), set \(\mu ^s:=\mu ^n + s (\hat{\mu }- \mu ^n)\). By convexity of \(J_{\alpha ,\beta }\) (see Lemma 11), we have \(J_{\alpha ,\beta }(\mu ^s)<+\infty \). Hence, we can apply the polarization identity at (35) with respect to \(w^n,\rho ^n,\rho ^s\) to obtain

$$\begin{aligned} \begin{aligned} T_{\alpha ,\beta }(\mu ^s)&- T_{\alpha ,\beta }(\mu ^n) \\&= -s \langle \hat{\rho }- \rho ^n, w^n \rangle + \frac{s^2}{2} \left\Vert K^*( \rho ^n - \hat{\rho })\right\Vert ^2_{L^2_H} + J_{\alpha ,\beta }(\mu ^s) - J_{\alpha ,\beta }(\mu ^n) \\&\le -s \langle \hat{\rho }- \rho ^n, w^n \rangle + \frac{s^2}{2} \left\Vert K^*( \rho ^n - \hat{\rho })\right\Vert ^2_{L^2_H} + s(\varphi (J_{\alpha ,\beta }(\hat{\mu })) - J_{\alpha ,\beta }(\mu ^n) ) \,, \end{aligned} \end{aligned}$$

where in the second line, we used convexity of \(J_{\alpha ,\beta }\) and the inequality \(t \le \varphi (t)\) for all \(t \ge 0\). Note that \(\mu ^s\) is a competitor for (51), so that \(T_{\alpha ,\beta }(\mu ^{n+1}) \le T_{\alpha ,\beta }(\mu ^s)\). Recalling (53), we then obtain

$$\begin{aligned} r(\mu ^{n+1})-r(\mu ^n)=T_{\alpha ,\beta }(\mu ^{n+1}) - T_{\alpha ,\beta }(\mu ^n) \le - s G(\mu ^n) + \frac{s^2}{2} \left\Vert K^*(\hat{\rho }-\rho ^n )\right\Vert ^2_{L^2_H}\,, \end{aligned}$$
(54)

which holds for all \(s \in [0,1]\). Choose the stepsize \(\hat{s}\) according to the Armijo–Goldstein condition (see e.g., [51, Definition 4.1]) as

$$\begin{aligned} \hat{s}:= \min \left\{ 1, \frac{G(\mu ^n)}{ \left\Vert K^* (\hat{\rho } - \rho ^n) \right\Vert ^2_{L^2_H}} \right\} \,, \end{aligned}$$
(55)

with the convention that \(C/0 = +\infty \) for \(C>0\). If \(\hat{s} <1\), by (54) and (34), we obtain

$$\begin{aligned} r(\mu ^{n+1})-r(\mu ^n) \le - \frac{1}{2} \frac{G(\mu ^n)^2}{ \left\Vert K^* (\hat{\rho } - \rho ^n) \right\Vert ^2_{L^2_H}} \le -\frac{1}{2} \frac{r(\mu ^n)^2}{ \left\Vert K^* (\hat{\rho } - \rho ^n) \right\Vert ^2_{L^2_H}}\,. \end{aligned}$$
(56)

Since we are assuming \(T_{\alpha ,\beta }(\mu ^n) \le M_0\), by definition of \(T_{\alpha ,\beta }\), we deduce that \(\Vert \rho ^n\Vert _{\mathcal {M}(X)} \le M_0/\alpha \). Moreover, since \(\hat{\mu }= M\mu _{\gamma ^*}\), by (K2) in Sect. 3.1, the estimate \(a_{\gamma ^*} \le 1/\alpha \), and the Cauchy–Schwarz inequality yield

$$\begin{aligned} \begin{aligned} \left\Vert \hat{\rho }\right\Vert _{\mathcal {M}(X)} = a_{\gamma ^*}M&\le a_{\gamma ^*}^2 M_0 \int _0^1 \left| \langle \delta _{\gamma ^*(t)}, w^n_t \rangle \right| \, \mathrm{d}t \le \frac{M_0}{\alpha ^2} \int _0^1 \left\Vert w^n_t\right\Vert _{C(\overline{\Omega })} \, \mathrm{d}t \\&\le \frac{M_0}{\alpha ^2} \!\left( \int _0^1 \left\Vert w^n_t\right\Vert _{C(\overline{\Omega })}^2 \mathrm{d}t\right) ^2 \!\le \! \frac{M_0}{\alpha ^2} C\! \sqrt{2} \!\left( \frac{1}{2} \left\Vert K^*\rho ^n \!-\! f\right\Vert _{L^2_H}^2 \right) ^{1/2}\\&\le \frac{M_0}{\alpha ^2} C \sqrt{2} \, T_{\alpha ,\beta } (\mu ^n)^{1/2} \le \frac{M_0^{3/2}}{\alpha ^2} C \sqrt{2}\,, \end{aligned} \end{aligned}$$

where \(C>0\) is the constant in (K2).   Thus, we can estimate

$$\begin{aligned} \Vert K^*(\hat{\rho }- \rho ^n)\Vert ^2_{L^2_H} \le 2C^2 (\Vert \hat{\rho }\Vert _{\mathcal {M}(X)}^2 + \Vert \rho ^n\Vert _{\mathcal {M}(X)}^2) \le \tilde{C}\,, \end{aligned}$$

with \(\tilde{C}>0\) not depending on n. Inserting the above estimate in (56) yields (52) and the claim follows. Assume now \(\hat{s}=1\), so that \( \Vert K^*(\hat{\rho }- \rho ^n)\Vert ^2_{L^2_H} \le G(\mu ^n)\). From (54) and (34), we obtain

$$\begin{aligned} r(\mu ^{n+1})-r(\mu ^n) \le - \frac{1}{2} G(\mu ^n) \le -\frac{1}{2} r(\mu ^n) \,. \end{aligned}$$
(57)

As \(T_{\alpha ,\beta }(\mu ^n) \le M_0\), we also have \(r(\mu ^n) \le T_{\alpha ,\beta }(\mu ^n) \le M_0\). Therefore, we can find a constant \(\tilde{C}>0\) not depending on n such that \(\tilde{C}r(\mu ^n)^2 \le r(\mu ^n)\). Substituting the latter in (57) yields (52), and the proof of the claim is concluded.

Finally, we are in position to prove (50). Since \(\mu ^0=0\), we have that \(T_{\alpha ,\beta }(\mu ^0) \le M_0\). Thus, we can inductively apply (52) and obtain that \(T_{\alpha ,\beta }(\mu ^n) \le M_0\) for all \(n\in \mathbb {N}\). In particular, (52) holds for every \(n\in \mathbb {N}\) and, as a consequence, the sequence \(\{r(\mu ^n)\}\) is non-increasing. Setting \(r_n := r(\mu ^n)\), we then get

$$\begin{aligned} \frac{1}{r_{n}} - \frac{1}{r_{0}} = \sum _{j=0}^{n-1} \frac{1}{r_{j+1}} - \frac{1}{r_{j}} = \sum _{j=0}^{n-1} \frac{r_j - r_{j+1}}{r_j r_{j+1}} \ge \sum _{j=0}^{n-1}\frac{\tilde{C} r_j^2}{r_j r_{j+1}} \ge \tilde{C}n\,, \end{aligned}$$

and (50) follows. The remaining claims follow from the weak* lower semicontinuity of \(T_{\alpha ,\beta }\) (Theorem 3) and estimate (A2), given that \(\{\mu ^n\}\) is a minimizing sequence for (\(\mathcal {P}\)). \(\square \)

4.4 Stopping Condition

We prove the optimality statements in the stopping criterion for the core algorithm anticipated in Sect. 4.1.2. In the following, G denotes the primal-dual gap introduced in (32). We denote by \(\mu ^n\) the n-th iterate of Algorithm 1, which is of the form (39), and by \(w^n\) the corresponding dual variable (40). Moreover, let \(\mu _{\gamma ^*}\) be the atom associated with the curve \(\gamma ^* \in \mathrm{{AC}}_{\infty }^2\) solving the insertion step (43).

Lemma 8

For all \(n \in \mathbb {N}\), we have

$$\begin{aligned} G(\mu ^n) = \Lambda ( \langle \rho _{\gamma ^*}, w^n \rangle )\,, \qquad \Lambda (t):= {\left\{ \begin{array}{ll} 0 &{} \, \text { if } \, t \le 1\,, \\ \frac{M_0}{2} \left( t^2 -1\right) &{} \, \text { if } \, t>1 \,, \end{array}\right. } \end{aligned}$$
(58)

where \(M_0\ge 0\) is defined at (21). In particular, \(\mu ^n\) is a solution of (\(\mathcal {P}\)) if and only if \( \langle \rho _{\gamma ^*}, w^n \rangle \le 1\).

Before proving Lemma 8, we give a quantitative version of the stopping condition of Sect. 4.1.2.

Remark 7

With the same notations as above, consider the condition

$$\begin{aligned} \Lambda ( \langle \rho _{\gamma ^*}, w^n \rangle ) < \mathrm{TOL} \end{aligned}$$
(59)

where \(\mathrm{TOL}>0\) is a fixed tolerance. Notice that \(G(\mu ^n) =\Lambda ( \langle \rho _{\gamma ^*}, w^n \rangle )\) by Lemma 8. Thus, assuming (59), and using (34), we see that the functional residual defined at (33) satisfies \(r(\mu ^n) <\mathrm{TOL}\), i.e., \(\mu ^n\) almost minimizes (\(\mathcal {P}\)), up to the tolerance. Therefore, the condition at (59) can be employed as a quantitative stopping criterion for Algorithm 1.

Proof of Lemma 8

We start by computing \(G(\mu ^n)\) for a fixed \(n \in \mathbb {N}\). Set \(\hat{\mu }:=M\mu _{\gamma ^*}\), where M is defined as in (27) with \(w=w^n\). Since \(\mu _{\gamma ^*} \in {{\,\mathrm{Ext}\,}}(C_{\alpha ,\beta })\) solves (41), we have that \(\hat{\mu }\) solves (25) with respect to \(w^n\) (see Proposition 4). By Theorem 7, the sequence \(\{T_{\alpha ,\beta }(\mu ^n)\}\) is non-increasing. Thus, \(J_{\alpha ,\beta }(\mu ^n) \le T_{\alpha ,\beta }(\mu ^n) \le T_{\alpha ,\beta }(\mu ^0) = M_0<+\infty \), since \(\mu ^0=0\). Then, (32) reads

$$\begin{aligned} G(\mu ^n)= J_{\alpha ,\beta }(\mu ^n) - \langle \rho ^n, w^n \rangle - \varphi (J_{\alpha ,\beta }(\hat{\mu }))+ \langle \hat{\rho }, w^n \rangle \,. \end{aligned}$$
(60)

Notice that by one-homogeneity of \(J_{\alpha ,\beta }\) (see Lemma 11) and the fact that \(\mu _{\gamma ^*} \in {{\,\mathrm{Ext}\,}}(C_{\alpha ,\beta })\), we have \(J_{\alpha ,\beta }(\hat{\mu })=M\). Recalling the definition of \(\varphi \) at (20), by direct calculation, we obtain

$$\begin{aligned} -\varphi (J_{\alpha ,\beta }(\hat{\mu })) + \langle \hat{\rho }, w^n \rangle = \Lambda ( \langle \rho _{\gamma ^*}, w^n \rangle )\,. \end{aligned}$$
(61)

We now compute the remaining terms in (60). By (44) and (45) at the step \(n-1\), we know that the coefficients \(c^n=(c_1^n,\ldots ,c_{N_n}^n) \in \mathbb {R}^{N_n}_{++}\) of \(\mu ^{n}\) solve the minimization problem

$$\begin{aligned} \min _{c=(c_1,\ldots ,c_{N_n}) \in \mathbb {R}^{N_n}_{++}} T_{\alpha ,\beta } \left( \sum _{j=1}^{N_n} c_j \mu _{\gamma _j^n}\right) \,. \end{aligned}$$
(62)

Since \(\gamma _j^n\) in (39) is continuous, we have that \((t \mapsto \delta _{\gamma _j^n(t)}) \in C_\mathrm{w}^+\) and thus \(t \mapsto K_t^* \delta _{\gamma _j^n(t)}\) belongs to \(L^2_H\), by Lemma 12. In view of (39), definition of \(T_{\alpha ,\beta }\) and Lemma 2, we can expand the expression at (62), differentiate with respect to each component of c, and recall that \(c^n\) is optimal in (62), to obtain

$$\begin{aligned} \begin{aligned} 0&= \sum _{j=1}^{N_n} c_j^n a_{\gamma _i^n} a_{\gamma _j^n} \int _0^1 \langle K_t^* \delta _{\gamma _i^n(t)}, K_t^* \delta _{\gamma _j^n(t)} \rangle _{H_t} \, \mathrm{d}t - a_{\gamma _i^n} \int _0^1 \langle K_t^* \delta _{\gamma _i^n(t)}, f_t \rangle _{H_t} \, \mathrm{d}t + 1\\&= \langle K^* \rho _{\gamma ^n_i}, K^* \rho ^n \rangle _{L^2_H} - \langle K^* \rho _{\gamma ^n_i}, f \rangle _{L^2_H} + 1 = 1 - \langle \rho _{\gamma _i^n}, w^n\rangle \,, \end{aligned} \end{aligned}$$
(63)

which holds for all \(i=1,\ldots ,N_n\). By Lemma 2 and linearity of \( \langle \cdot , \cdot \rangle \), we obtain the identity \(J_{\alpha ,\beta }(\mu ^n) = \langle \rho ^n, w^n \rangle \). The latter, together with (60) and (61), yields (58). For the remaining part of the statement, notice that by (58), we have \(G(\mu ^n)=0\) if and only if \( \langle \rho _{\gamma ^*}, w^n \rangle \le 1\). Therefore, the thesis follows by Lemma 5. \(\square \)

4.5 Time-Discrete Version

The minimization problem (\(\mathcal {P}\)) presented in Sect. 3 is posed for time-continuous measurements \(f \in L^2_H\), whose discrepancy to the reconstructed curve \(t \mapsto \rho _t\) is modeled by the fidelity term \(\mathcal {F}\) at (16). In real-world applications, however, the measured data are time-discrete, i.e., we can assume that measurements are taken at times \(0 = t_0< t_1< \ldots < t_T = 1\), with \(T \in \mathbb {N}\) fixed. Hence, the data are of the form \(f = (f_{t_0}, f_{t_1},\ldots , f_{t_T})\), where \(f_{t_i} \in H_{t_i}\).

For this reason, and with the additional goal of lowering the computational cost, we decided to present numerical experiments (Sect. 6) where the time-continuous fidelity term \(\mathcal {F}\) is replaced by a discrete counterpart. In Sect. 4.5.1, we show how to modify the mathematical framework discussed so far, in order to deal with the time-discrete case. Consequently, it is immediate to adapt Algorithm 1 to the resulting time-discrete functional, as discussed in Sect. 4.5.2. We remark that all the results up to this point will hold, in a slightly modified version, also for the discrete setting discussed below.

4.5.1 Time-Discrete Framework

We replace problem (\(\mathcal {P}\)) with

figure d

where the time-discrete fidelity term \(\mathcal {F}^{\mathcal {D}} :\mathcal {M}\rightarrow [0,+\infty ]\) is defined by

$$\begin{aligned} \mathcal {F}^{\mathcal {D}}(\mu ):= {\left\{ \begin{array}{ll} \displaystyle \frac{1}{2 (T+1)}\sum _{i=0}^T \left\Vert K_{t_i}^* \rho _{t_i} - f_{t_i}\right\Vert _{H_{t_i}}^2 &{} \, \text { if } \rho =\mathrm{d}t \otimes \rho _t \, , \, \, (t \mapsto \rho _t) \in C_\mathrm{w}\,, \\ + \infty &{} \, \text { otherwise.} \end{array}\right. } \end{aligned}$$

Here, \(H_{t_i}\) are real Hilbert spaces, and the given data vector \(f=(f_{t_0},\ldots ,f_{t_T})\) satisfies \(f_{t_i} \in H_{t_i}\) for all \(i=0,\ldots ,T\). The forward operators \(K_{t_i}^* :\mathcal {M}(\overline{\Omega }) \rightarrow H_i\) are assumed to be linear continuous and weak*-to-weak continuous, for each \(i=0,\ldots ,T\). The minimization problem (\(\mathcal {P}_{discr}\)) is well-posed by the direct method of calculus of variations: indeed \(T^{\mathcal {D}}_{\alpha ,\beta }\) is proper, \(J_{\alpha ,\beta }\) is weak* lower semicontinuous and coercive in \(\mathcal {M}\) (Lemma 11), and \(\mathcal {F}^{\mathcal {D}}\) is lower semicontinuous with respect to the convergence in (A3). In particular, a solution \(\mu ^*= (\rho ^*, m^*)\) to (\(\mathcal {P}_{discr}\)) will satisfy \(\rho ^*=\mathrm{d}t \otimes \rho _t^*\) with \((t \mapsto \rho _t^* )\in C_\mathrm{w}^+\). We now define the other quantities which are needed to formulate the discrete counterpart of the theory developed so far. For a given curve of measures \((t \rightarrow \tilde{\rho }_t) \in C_\mathrm{w}\), the corresponding dual variable (22) is redefined to be

$$\begin{aligned} w_{t_i} := -K_{t_i}( K_{t_i}^* \tilde{\rho }_{t_i} - f_{t_i}) \in C(\overline{\Omega })\,, \end{aligned}$$
(64)

for each \(i=0,\ldots ,T\). Consequently, we redefine the associated scalar product (23) to

$$\begin{aligned} \langle \rho , w \rangle _{\mathcal {D}} := {\left\{ \begin{array}{ll} \displaystyle \frac{1}{T+1}\sum _{i=0}^T \langle \rho _{t_i}, w_{t_i} \rangle _{\mathcal {M}(\overline{\Omega }),C(\overline{\Omega })} &{} \,\, \text {if } \rho =\mathrm{d}t \otimes \rho _t\, , \,\, (t \mapsto \rho _t) \in C_\mathrm{w}\,, \\ -\infty &{}\,\, \text {otherwise.} \end{array}\right. } \end{aligned}$$

It is straightforward to check that all the results in Sect. 3 hold with \(T_{\alpha ,\beta }\) and \( \langle \cdot , \cdot \rangle \) replaced by \(T^{\mathcal {D}}_{\alpha ,\beta }\) and \( \langle \cdot , \cdot \rangle _{\mathcal {D}}\), respectively, with the obvious modifications. In particular, the problem

$$\begin{aligned} \min _{\mu \in {{\,\mathrm{Ext}\,}}(C_{\alpha ,\beta })} \, - \langle \rho , w \rangle _{\mathcal {D}} \end{aligned}$$
(65)

admits a solution \(\mu ^*\), where \(C_{\alpha ,\beta }:=\{ J_{\alpha ,\beta }(\mu )\le 1\}\). One can perform a similar computation to the one at (42) to obtain equivalence between (65) and

$$\begin{aligned} \min _{\gamma \in \mathrm{{AC}}_{\infty }^2} -\langle \rho _\gamma ,w \rangle _{\mathcal {D}} = \min _{\gamma \in \mathrm{AC}^2} \left\{ 0, -\frac{a_{\gamma }}{T+1}\sum _{i=0}^T w_{t_i}(\gamma (t_i)) \right\} \,. \end{aligned}$$
(66)

Remark 8

Assume additionally that \(\overline{\Omega }\) is convex. Then, a solution \(\gamma ^* \in \mathrm{{AC}}_{\infty }^2\) to problem (66) is either \(\gamma ^*=\gamma _\infty \), or \(\gamma ^* \in \mathrm{AC}^2\) with \(\gamma ^*\) linear in each interval \([t_i,t_{i+1}]\). Indeed, given any curve \(\gamma \in \mathrm{AC}^2\), denote by \(\tilde{\gamma }\) the piecewise linear version of \(\gamma \) sampled at \(t_i\) for \(i=0,\ldots ,T\). Then \(\langle \rho _\gamma ,w \rangle _{\mathcal {D}} \le \langle \rho _{\tilde{\gamma }} ,w \rangle _{\mathcal {D}}\), due to the inequality \(\int _0^1 \left|\dot{\tilde{\gamma }}(t)\right|^2 \, \mathrm{d}t \le \int _0^1 \left|\dot{\gamma }(t)\right|^2 \, \mathrm{d}t\).

4.5.2 Adaption of Algorithm 1 to the Time-Discrete Setting

The core algorithm in Sect. 4.1 is readily adaptable to the task of minimizing (\(\mathcal {P}_{discr}\)). To this end, let \(f_{t_i} \in H_{t_i}\) be a given datum and \(\alpha ,\beta >0\) be fixed parameters for (\(\mathcal {P}_{discr}\)). Assume that \(\mu ^n\) is the current sparse iterate, of the form (39). The dual variable associated with \(\mu ^n\) is defined, according to (64), by \(w_{t_i}^n := -K_{t_i}( K_{t_i}^* \rho _{t_i}^n - f_{t_i})\). Similarly to Sect. 4.1.2, the insertion step in the time-discrete version consists in finding a curve \(\gamma ^* \in \mathrm{{AC}}_{\infty }^2\) which solves (66) with respect to \(w^n_i\). Such curve defines a new atom \(\mu _{\gamma ^*}\) according to Definition 2. Adapting the proofs of Lemmas 5, 8 to the discrete setting, one deduces the following stopping condition:

  • if \(\langle \rho _{\gamma ^*}, w^n\rangle _{\mathcal {D}} \le 1\), then \(\mu ^n\) is solution to (\(\mathcal {P}_{discr}\)). The algorithm outputs \(\mu ^n\) and stops,

  • if \(\langle \rho _{\gamma ^*}, w^n\rangle _{\mathcal {D}} > 1\), then \(\mu ^n\) is not a solution to (\(\mathcal {P}_{discr}\)) and \(\gamma ^* \in \mathrm{AC}^2\). The found atom \(\mu _{\gamma ^*}\) is inserted in the n-th iterate \(\mu ^n\) and the algorithm continues.

Set \(\gamma _{N_n+1}^n := \gamma ^*\). The time-discrete version of the coefficients optimization step discussed in Sect. 4.1.3 consists in solving

$$\begin{aligned} \min _{ (c_1,c_2,\ldots ,c_{N_n+1}) \in \mathbb {R}_{+}^{N_n+1}} T^{\mathcal {D}}_{\alpha ,\beta } \left( \sum _{j=1}^{N_n+1} c_j \mu _{\gamma _j^{n}} \right) \,. \end{aligned}$$
(67)

By proceeding as in Sect. 4.2, one can check that (67) is equivalent to a quadratic program of the form (46), where the matrix \(\Gamma \in \mathbb {R}^{(N_n +1) \times (N_n+1)}\) and the vector \(b \in \mathbb {R}^{N_n+1}\) are given by

$$\begin{aligned} \Gamma _{j,k}&:= \frac{a_{\gamma _j} a_{\gamma _k}}{T+1} \sum _{i=0}^{T} \langle K_{t_i}^*\delta _{\gamma _j(t_i)}, K_{t_i}^*\delta _{\gamma _k(t_i)} \rangle _{H_{t_i}}\,, \\ b_j&:= 1- \frac{a_{\gamma _j}}{T+1}\sum _{i=0}^{T} \langle K_{t_i}^*\delta _{\gamma _j}(t_i), f_{t_i} \rangle _{H_{t_i}} \,. \end{aligned}$$

In view of Remark 8 and of the above construction, we note that the iterates of the discrete algorithms are of the form (39) with \(\gamma _j^n \in \mathrm{AC}^2\) piecewise linear. Finally, we remark that the time-discrete algorithm obtained with the above modifications has the same sublinear rate of convergence stated in Theorem 7.

5 The Algorithm: Numerical Implementation

This aim of this section is twofold. First, in Sect. 5.1, we describe how to approach the minimization of the insertion step problem (43) by means of gradient descent strategies. This analysis is performed under additional assumptions on the operators \(K_t :H_t \rightarrow C(\overline{\Omega })\), which, loosely speaking, require that \(K_t\) map into the space \(C^{1,1}(\overline{\Omega })\) of differentiable functions with bounded Lipschitz gradient. The strategies proposed will result in the multistart gradient descent Subroutine 1 (Sect. 5.1.5), which, given a dual variable w, outputs a set of stationary points for (43). Then, in Sect. 5.2, we present two acceleration steps that can be added to Algorithm 1 to, in principle, enhance its performance. The first acceleration strategy, called the multiple insertion step, proceeds by adding all the outputs of Subroutine 1 to the current iterate. The second strategy, termed sliding step, consists in locally descending the target functional \(T_{\alpha ,\beta }\) at (\(\mathcal {P}\)) in a neighborhood of the curves composing the current iterate, while keeping the coefficients fixed. These strategies are finally added to Algorithm 1. The outcome is Algorithm 2, presented in Sect. 5.3, which we name dynamic generalized conditional gradient (DGCG).

5.1 Insertion Step Implementation

We aim at minimizing the linearized problem (43) in the insertion step, which is of the form

$$\begin{aligned} \min _{\gamma \in H^1([0,1];\overline{\Omega })} \left\{ 0, F(\gamma ) \right\} \,, \quad F(\gamma ):=-a_{\gamma } \int _0^1 w_t(\gamma (t)) \, \mathrm{d}t \,, \end{aligned}$$
(68)

where the dual variable \(t \mapsto w_t\) is defined for a.e. \(t \in (0,1)\) by \(w_t:=-K_t(K_t^* \tilde{\rho }_t - f_t) \in C(\overline{\Omega })\), for some curve \((t \mapsto \tilde{\rho }_t) \in C_\mathrm{w}\) and data \(f \in L^2_H\) fixed. In (68), we also identified \(\mathrm{AC}^2\) with \(H^1([0,1];\overline{\Omega })\), and employed the notations at (14). We remind the reader that although (68) admits solutions (see Sect. 4.1.1), in practice, they may be difficult to compute numerically (Remark 4). Therefore, we turn our attention at finding stationary points for the functional F at (68), relying on gradient descent methods. To make this approach feasible, we require additional assumptions on the operators \(K_t^*\) (see Assumption 3), which allow to extend the functional F to the Hilbert space \(H^1:=H^1([0,1];\mathbb {R}^d)\) and make it Fréchet differentiable, without altering the value of the minimum at (68). In particular, this allows for a gradient descent procedure to be well-defined (Sect.  5.1.1). With this at hand, in Sect. 5.1.2, we define a descent operator \(\mathcal {G}_w\) associated with F, which, for a starting curve \(\gamma \in H^1\), outputs either a stationary point of F or the infinite length curve \(\gamma _\infty \). The starting curves for \(\mathcal {G}_w\) are of two types:

  • random starts which are guided by the values of the dual variable w (Sect. 5.1.3),

  • crossovers between known stationary points (Sect. 5.1.4).

We then propose a minimization strategy for (68), which is implemented in the multistart gradient descent algorithm contained in Subroutine 1 (Sect. 5.1.5). Such algorithm inputs a set of curves and a dual variable w, and returns a set \(\mathcal {S} \subset \mathrm{AC}^2\), which is either empty or contains stationary curves for F at (68).

5.1.1 Minimization Strategy

The additional assumptions required on \(K_t^*\) are as follows.

Assumption 3

For a.e. \(t \in (0,1)\), the linear continuous operator \(K_t^* :C^{1,1}(\overline{\Omega })^* \rightarrow H_t\) satisfies

  1. (F1)

    \(K_t^*\) is weak*-to-weak continuous, with pre-adjoint denoted by \(K_t: H_t \rightarrow C^{1,1}(\overline{\Omega })\),

  2. (F2)

    \(\left\Vert K^*_t\right\Vert \le C\) for some constant \(C>0\) not depending on t,

  3. (F3)

    the map \(t \mapsto K_t^*\rho \) is strongly measurable for every fixed \(\rho \in \mathcal {M}(\overline{\Omega })\),

  4. (F4)

    there exists a closed convex set \(E \Subset \Omega \) such that \( {{\,\mathrm{supp}\,}}(K_t f), {{\,\mathrm{supp}\,}}(\nabla (K_t f))\)\(\subset E\) for all \(f \in H_t\) and a.e. \(t \in (0,1)\), where \(\nabla \) denotes the spatial gradient.

Notice that (F1)–(F3) imply (K1)–(K3) of Sect. 3.1, due to the embedding \(\mathcal {M}(\overline{\Omega }) \hookrightarrow C^{1,1}(\overline{\Omega })^*\). Also note that (F4) has no counterpart in the assumptions of Sect. 3.1, and is only assumed for computational convenience, as discussed below. The problem of solving (68) under (F1)–(F4) is addressed in Appendix 1; here, we summarize the main results obtained. First, we extend F to the Hilbert space \(H^1\), by setting \(w_t(x):=0\) for all \(x\in \mathbb {R}^d \smallsetminus \overline{\Omega }\) and a.e. \(t \in (0,1)\). Due to (F4), we have that \(w_t \in C^{1,1}(\mathbb {R}^d)\). We then show that F is continuously Fréchet differentiable on \(H^1\), with locally Lipschitz derivative (Proposition 16).

Denote by \(D_\gamma F \in (H^1)^*\), the Fréchet derivative of F at \(\gamma \) (see (A13) for the explicit computation) and introduce the set of stationary points of F with nonzero energy

$$\begin{aligned} \mathfrak {S}_F :=\{\gamma \in H^1 \, :\, F(\gamma ) \ne 0 \,, \,\, D_\gamma F = 0 \}\,. \end{aligned}$$

In Proposition 17, we prove that all the points in \(\mathfrak {S}_F\) satisfy \(\gamma ([0,1]) \subset \Omega \). As a consequence, \(\mathfrak {S}_F\) contains all the solutions to the insertion problem (68) whenever

$$\begin{aligned} \min _{\gamma \in H^1([0,1];\overline{\Omega })} \{0,F(\gamma )\}<0\,. \end{aligned}$$
(69)

Remark 9

The case (69) is the only one of interest: indeed Algorithm 1 stops if (69) is not satisfied, with the current iterate being a solution to the target problem (\(\mathcal {P}\)) (see Sect. 4.1.2). In Proposition 19, we prove that (69) is equivalent to

$$\begin{aligned} P(w):=\int _0^1 \max _{x \in \overline{\Omega }} w_t(x)\, \mathrm{d}t > 0 \,. \end{aligned}$$
(70)

As P(w) is easily computable, condition (70) provides an implementable test for (69). Moreover, note that (69) is satisfied when F is computed from the dual variable \(w^n\) associated with the n-th iterate \(\mu ^n=\sum _{j=1}^{N_n}c_j^n \delta _{\gamma _j^n}\) of Algorithm 1, and \(n \ge 1\). This is because \(F(\gamma _j^n)=-1\) for all \(j=1,\ldots ,N_n\), as shown in (63).

If (69) is satisfied, we aim at computing points in \(\mathfrak {S}_F\) by gradient descent. To this end, we say that \(\{\gamma ^n\}\) in \(H^1\) is a descent sequence if

$$\begin{aligned} F(\gamma ^0)<0 \,, \,\,\,\,\, \gamma ^{n+1} = \gamma ^n - \delta _n D_{\gamma ^n} F\,, \,\,\,\, \text { for all }\,\, n \in \mathbb {N}\cup \{0\}\,, \end{aligned}$$
(71)

where \(D_{\gamma ^n}F \in (H^1)^*\) is identified with its Riesz representative in \(H^1\), and \(\{\delta _n\}\) is a stepsize chosen according to the Armijo–Goldstein or Backtracking-Armijo rules. In Theorem 14, we prove that if \(\{\gamma ^n\}\) is a descent sequence, there exists at least a subsequence such that \(\gamma ^{n_k} \rightarrow \gamma ^*\) strongly in \(H^1\); moreover, any such accumulation point \(\gamma ^*\) belongs to \(\mathfrak {S}_F\). To summarize, descent sequences in the sense of (71) enable us to compute points in \(\mathfrak {S}_F\), which are candidate solutions to (68) whenever (69) holds.

5.1.2 Descent Operator

Fix a dual variable w and consider the functional F defined as in (68). The descent operator \(\mathcal {G}_w :H^1([0,1];\overline{\Omega }) \rightarrow H^1 \cup \{\gamma _\infty \}\) associated with w is defined by

$$\begin{aligned} \mathcal {G}_w(\gamma ) := {\left\{ \begin{array}{ll} \gamma ^* &{} \,\, \text { if } \,\, F(\gamma )<0 \,,\\ \gamma _\infty &{} \,\, \text { otherwise}\,, \end{array}\right. } \end{aligned}$$
(72)

where \(\gamma ^*\) is an accumulation point for the descent sequence \(\{\gamma ^n\}\) defined according to (71) with starting point \(\gamma ^0:=\gamma \). In view of the discussion in the previous section, we know that the image of \(\mathcal {G}_w\) is contained in \(\mathfrak {S}_F\cup \{\gamma _\infty \}\). Note that if \(P(w)>0\), in principle, the image of \(\mathcal {G}_w\) will contain at least one stationary point \(\gamma ^* \in \mathfrak {S}_F\), as in this case (69) holds (Remark 9). However, in simulations, we can only compute \(\mathcal {G}_w\) on some finite family of curves \(\{ \gamma _i\}_{i \in I}\), which we name the starts. Thus, in general, we have no guarantee of finding points in \(\mathfrak {S}_F\), even if (69) holds. The situation improves if \(w=w^n\) is the dual variable associated with the n-th iterate \(\mu ^n=\sum _{j=1}^{N_n}c_j^n \delta _{\gamma _j^n}\) of Algorithm 1 and \(n \ge 1\). In this case, setting \(\mathcal {A}:=\{\gamma _{1}^n, \ldots , \gamma _{N_n}^n\}\), we have that \(\mathcal {G}_{w^n}(\mathcal {A}) \subset \mathfrak {S}_F\) by definition (72) and Remark 9. Therefore, by including the curves in \(\mathcal {A}\) in the set of considered starts, we are guaranteed of obtaining at least one point in \(\mathfrak {S}_F\).

5.1.3 Random Starts

We now describe how we randomly generate starting points \(\gamma \) in \(H^1([0,1];\overline{\Omega })\) for the descent operator \(\mathcal {G}\) at (72). We start by selecting time nodes \(0=t_0< t_1< \ldots < t_T=1\) drawn uniformly in [0, 1]. (If operating in the time-discrete setting, we sample instead with a uniform probability on the finite set of sampling times on which the fidelity term of (\(\mathcal {P}_{discr}\)) is defined.) We choose the value of a random start \(\gamma \) at time \(t_i\) seeking to maximize the dual variable \(w_{t_i}\). To achieve this, let \(Q: \mathbb {R}\rightarrow \mathbb {R}_+\) be non-decreasing and monotonous, and define the probability measure on E

$$\begin{aligned} \mathbb {P}_{w_{t_i}}(A) := \frac{\int _A Q(w_{t_i}(x)) \,\mathrm {d}x }{\int _{\Omega } Q(w_{t_i}(x))\, \mathrm {d}x }\,, \end{aligned}$$

for \(A \subset E\) Borel measurable and \(E \Subset \Omega \) introduced in Assumption 3. We then draw samples from \(\mathbb {P}_{w_{t_i}}\) with the rejection-sampling algorithm, and assign those samples to \(\gamma (t_i)\). Using that E is a convex set, the random curve \(\gamma \in \mathrm{AC}^2\) is obtained by interpolating linearly the values \(\gamma (t_i) \in E\). This procedure is executed by the routine sample, which inputs a dual variable w and outputs a randomly generated curve \(\gamma \in \mathrm{AC}^2\).

5.1.4 Crossovers Between Stationary Points

It is heuristically observed that stationary curves have a tendency to share common “routes,” as for example seen in the reconstructions presented in Figs. 4 and 5. It is then a reasonable ansatz to combine curves which are sharing routes, in order to increase the likelihood for the newly obtained crossovers to share common routes with the sought global minimizers of F. Such crossovers will then be employed as starts for the descent operator G at (72). Formally, the crossover is achieved as follows. We fix small parameters \(\varepsilon >0\) and \(0<\delta <1\). For \(\gamma _1, \gamma _2 \in \mathrm{AC}^2\) define the set

$$\begin{aligned} R_\varepsilon (\gamma _1,\gamma _2):= \left\{ t\in [0,1] \, :\, \left|\gamma _1(t) -\gamma _2(t)\right| < \varepsilon \right\} \,. \end{aligned}$$

We say that \(\gamma _1\) and \(\gamma _2\) share routes if \(R_\varepsilon (\gamma _1,\gamma _2) \ne \emptyset \). If \(R_\varepsilon (\gamma _1,\gamma _2) = \emptyset \), we perform no operations on \(\gamma _1\) and \(\gamma _2\). If instead \(R_\varepsilon (\gamma _1,\gamma _2) \ne \emptyset \), first notice that \(R_\varepsilon (\gamma _1,\gamma _2)\) is relatively open in [0, 1]. Denote by I any of its connected components. Then, I is an interval with endpoints \(t^-\) and \(t^+\), satisfying \(0\le t^- < t^+\le 1\). The crossovers of \(\gamma _1\) and \(\gamma _2\) in I are the two curves \(\gamma _3, \gamma _4 \in \mathrm{AC}^2\) defined by

$$\begin{aligned} \gamma _3(t):= {\left\{ \begin{array}{ll} \gamma _1(t), &{}\ t \in [\,0, \hat{t}- \delta \tilde{t} \,] \\ \gamma _2(t), &{}\ t \in [\,\hat{t} + \delta \tilde{t}, 1] \end{array}\right. } \,\,, \qquad \gamma _4(t):= {\left\{ \begin{array}{ll} \gamma _2(t), &{}\ t \in [\,0, \hat{t} - \delta \tilde{t}\,] \\ \gamma _1(t), &{}\ t \in [\,\hat{t} + \delta \tilde{t}, 1\,] \end{array}\right. } \end{aligned}$$

and linearly interpolated in \((\hat{t} - \delta \tilde{t}, \hat{t} + \delta \tilde{t})\), where \(\hat{t} :=(t^++t^-)/2\) and \(\tilde{t} :=(t^+-t^-)/2\), i.e.,

$$\begin{aligned} \frac{t - (\hat{t} -\delta \tilde{t})}{2\delta \tilde{t}} \gamma _2(\hat{t} + \delta \tilde{t}) - \frac{t - (\hat{t} + \delta \tilde{t})}{2\delta \tilde{t}} \gamma _1(\hat{t} - \delta \tilde{t})\,, \quad t \in (\hat{t} - \delta \tilde{t}, \hat{t} + \delta \tilde{t}) \end{aligned}$$

is, for instance, the result of the linear interpolation of \(\gamma _3\) in \((\hat{t} - \delta \tilde{t}, \hat{t} + \delta \tilde{t})\). We construct the crossovers of \(\gamma _1\) and \(\gamma _2\) in each connected component of \(R_\varepsilon (\gamma _1,\gamma _2)\) obtaining 2M new curves, with M being the number of connected components of \(R_\varepsilon (\gamma _1,\gamma _2)\). The described procedure is executed by the routine crossover, which inputs two curves \(\gamma _1, \gamma _2 \in \mathrm{AC}^2\) and outputs a set of curves in \(\mathrm{AC}^2\), possibly empty.

5.1.5 Multistart Gradient Descent Algorithm

In Subroutine 1, we sketch the proposed method to search for a minimizer of (68): this is implemented in the function MultistartGD, which inputs a set of curves \(\mathcal {A}\) and a dual variable w, and outputs a (possibly empty) set \(\mathcal {S}\) of stationary points for F defined at (68). We now describe how to interpret such subroutine in the context of Algorithm 1, and how it can be used to replace the insertion step operation at line 4.

Given the n-th iteration \(\mu ^n=\sum _{j=1}^{N_n}c_{j}^n \delta _{\gamma _j^n}\) of Algorithm 1, we define \(\mathcal {A}:=\{\gamma _{1}^{n}, \ldots , \gamma _{N_n}^n\}\) if \(n \ge 1\) and \(\mathcal {A}:=\emptyset \) if \(n=0\). The dual variable \(w^n\) is as in (40). We initialize to empty the sets \(\mathcal {S}\) and \(\mathcal {O}\) of known stationary and crossover points, respectively. The condition at line 2 of Subroutine 1 checks if we are at the 0-th iteration and \(P(w^0)\le 0\). In case this is satisfied, then 0 is the minimum of (68), and no stationary point is returned; indeed in this situation, Algorithm 1 stops, with \(\mu ^0=0\) being the minimum of (\(\mathcal {P}\)) (see Sect. 5.1.1). Otherwise, the set \(\mathcal {A}\) (possibly empty) is inserted in \(\mathcal {O}\). Then, \(N_\mathrm{max}\) initializations of the multistart gradient descent are performed, where a starting point \(\gamma \) is either chosen from the crossover set \(\mathcal {O}\), if the latter is nonempty, or sampled at random by the function sample described in Sect. 5.1.3. We then descend \(\gamma \), obtaining the new curve \(\gamma ^*:=\mathcal {G}_{w^n}(\gamma )\), where \(\mathcal {G}_{w^n}\) is defined at (72). If the outputted point \(\gamma ^*\) does not belong to the set of known stationary points \(\mathcal {S}\), and \(\gamma ^* \ne \gamma _{\infty }\), then we first compute the crossovers of \(\gamma ^*\) with all the elements of \(\mathcal {S}\), and afterward insert it in \(\mathcal {S}\). After \(N_\mathrm{max}\) iterations, the set \(\mathcal {S}\) is returned. Notice that \(\mathcal {S}\) could be empty only if \(\mathcal {A}=\emptyset \), i.e., if MultistartGD is called at the first iteration of Algorithm 1 (see Sect. 5.1.2).

The modified insertion step, that is, line 4 of Algorithm 1, reads as follows. First, we set \(\mathcal {A}=\varvec{\gamma }\) and compute \(\mathcal {S}:=\texttt {MultistartGD}(\mathcal {A},w^n)\). If \(\mathcal {S}=\emptyset \), the algorithm stops and returns the current iterate \(\mu ^n\). Otherwise, the element in \(\mathcal {S}\) with minimal energy with respect to F is chosen as candidate minimizer, and inserted as \(\gamma ^*\) in line 4.

figure e

5.2 Acceleration Strategies

In this section, we describe two acceleration strategies that can be incorporated in Algorithm 1.

5.2.1 Multiple Insertion Step

This is an extension of the insertion step for Algorithm 1 described in Sect. 4.1.2. Precisely, the multiple insertion step consists in inserting into the current iterate \(\mu ^n\) all the atoms associated with the stationary points in the set \(\mathcal {S}\) produced by Subroutine 1 with respect to the dual variable \(w^n\). This procedure is motivated by the following observations. First, computationally speaking, the coefficients optimization step described in Sect. 4.1.3 is cheap and fast. Second, it is observed that stationary points are good candidates for the insertion step in the GCG method presented in [51] to solve (4). This observation can be extended similarly to our framework, noticing that the addition of multiple stationary points is encouraging the iterates to concentrate around every atom of the ground-truth measure \(\mu ^\dagger \), and consequently, the algorithm could need fewer iterations to efficiently locate the support of \(\mu ^\dagger \).

5.2.2 Sliding Step

The sliding step proposed in this paper is a natural extension of the one introduced for BLASSO in [18] and further analyzed in [30]. Precisely, given a current iterate \(\mu = \sum _{j=1}^{N} c_j \mu _{\gamma _j}\) of Algorithm 1, we fix the weights \( \varvec{c} = (c_1 ,\ldots , c_{N})\) and define the functional \(T_{\alpha ,\beta ,\varvec{c}} : (H^1)^{N} \rightarrow \mathbb {R}\) as

$$\begin{aligned} T_{\alpha ,\beta ,\varvec{c}}(\eta _1,\ldots , \eta _{N}) := T_{\alpha ,\beta } \left( \sum _{j=1}^{N} c_j \mu _{\eta _j} \right) \quad \text {for all} \quad (\eta _1,\ldots , \eta _{N}) \in \left( H^1\right) ^{N}\,. \end{aligned}$$
(73)

We then perform additional gradient descent steps in the space \((H^1)^{N}\) for the functional \(T_{\alpha ,\beta ,\varvec{c}}\), starting from the tuple of curves \((\gamma _1,\ldots ,\gamma _N)\) contained in the current iterate \(\mu \). Formally, this procedure is possible; in Proposition 20, we prove that \(T_{\alpha ,\beta ,\varvec{c}}\) is continuously Fréchet differentiable in \((H^1)^{N}\) under Assumption 3, with derivative given by (A42). This step can be intertwined with the coefficients optimization one, by alternating between modifying the position of the current curves, and optimizing their associated weights.

5.3 Full Algorithm

By including Subroutine 1 and the proposed acceleration steps of Sect. 5.2 into Algorithm 1, we obtain Algorithm 2, which we name the dynamic generalized conditional gradient (DGCG).

5.3.1 Algorithm Summary

We employ the notations of Sect. 4.1.4. In particular, Algorithm 2 generates, as iterates, tuples of coefficients \(\varvec{c} = (c_1, \ldots , c_{N})\) with \(c_j>0\), and curves \(\varvec{\gamma } = (\gamma _1, \ldots , \gamma _{N})\) with \(\gamma _j \in \mathrm{AC}^2\). We now summarize the main steps of Algorithm 2. The first 3 lines are unaltered from Algorithm 1, and they deal with tuples initializations and assembly of the measure iterate \(\mu ^n\). The multiple insertion step is carried out by Subroutine 1, via the function MultistartGD at line 4, which is called with arguments \(\varvec{\gamma }\) and \(w^n\). The output is a set of curves \(\mathcal {S}\) which contains stationary points for the functional F at (68) with respect to the dual variable \(w^n\). If \(\mathcal {S}=\emptyset \), the algorithm stops and outputs the current iterate \(\mu ^n\). As observed in Sect. 5.1.2, this can only happen at the first iteration of the algorithm. Otherwise, in line 7, the function order is employed to input \(\mathcal {S}\) and output a tuple of curves, obtained by ordering the elements of \(\mathcal {S}\) increasingly with respect to their value of F. As anticipated in Sect. 5.1, the first element of \(\texttt {order}(\mathcal {S})\), named \(\gamma _{N_n + 1}^*\), is considered to be the best available candidate solution to the insertion step problem (68), and is used in the stopping condition at lines 8, 9. Such stopping condition has been used similarly in Algorithm 1, with the difference that in Algorithm 2 the curve \(\gamma _{N_n + 1}^*\) is not necessarily the global minimum of F, but, in general, just a stationary point. We further discuss such stopping criterion in Sect. 5.3.2. After, the found stationary points are inserted in the current iterate, and the algorithm alternates between the coefficients optimization step (Sect. 4.1.3) and the sliding step (Sect. 5.2.2). Such operations are executed from line 11 to 16 in Algorithm 2, for \(K_\mathrm{max}\) times.

5.3.2 Stopping Condition

The stopping condition for Algorithm 2 is implemented in lines 8, 9: when \(\gamma _{N_n + 1}^*\) satisfies \( \langle \rho _{\gamma _{N_n+1}^*}, w^n \rangle \le 1\), the algorithm stops and outputs the current iterate \(\mu ^n\). Due to the definition of F, such condition is equivalent to say that the Subroutine 1 has not been able to find any curve \(\gamma ^* \in \mathrm{AC}^2\) satisfying \(\langle \rho _{\gamma ^*}, w^n\rangle > 1 \). The main difference when compared to the stopping condition for Algorithm 1 (Sect. 4.1.2) is that in Algorithm 2, the curve \(\gamma _{N_n + 1}^*\) is generally not a global minimum of F. As a consequence, Lemma 8 does not hold, and the condition \(\langle \rho _{\gamma _{N_n + 1}^*}, w^n\rangle \le 1\) is not equivalent to the minimality of the current iterate \(\mu ^n\) for (\(\mathcal {P}\)). The evident drawback is that Algorithm 2 could stop even if the current iterate does not solve (\(\mathcal {P}\)). However, it is at least possible to say that if Algorithm 2 continues after line 9, then the current iterate does not solve (\(\mathcal {P}\)). Thus, in this situation, the correct decision is taken. We remark that even if the stopping condition for Algorithm 2 does not ensure the minimality of the output, from a practical standpoint, if Subroutine 1 is employed with a high number of restarts, the reconstruction is satisfactory. We also point out that the condition at line 8 can be replaced by the quantitative condition defined at (59) in Remark 7, with some predefined tolerance.

5.3.3 Convergence and Numerical Residual

In the following, we will say that Algorithm 2 converges if \(\texttt {MultistartGD}(\varvec{\gamma },w^0) = \emptyset \), or if some iterate satisfies the stopping condition at line 8. In case of convergence, we will denote by \(\mu ^{N}\) the output value of Algorithm 2. We remind the reader that due to the discussion in Sect. 5.3.2, \(\mu ^N\) is considered to be an approximate solution for the minimization problem (\(\mathcal {P}\)). In order to analyze the convergence rate for Algorithm 2 numerically, we define the numerical residual as

$$\begin{aligned} \tilde{r}(\mu ^n) := T_{\alpha ,\beta }(\mu ^n) - T_{\alpha ,\beta }(\mu ^N)\,, \qquad n = 0,\ldots ,N-1\,, \end{aligned}$$
(74)

with \(\mu ^n\) each of the computed intermediate iterates. We further define the numerical primal-dual gap \(\tilde{G}(\mu ^n)\), which we compute by employing (58) with \(\gamma ^*=\gamma ^*_{N_n+1}\).

figure f

6 Numerical Implementation and Experiments

In this section, we present the produced numerical experiments. In order to lower the computational cost, we chose to implement Algorithm 2 for the minimization of the time-discrete functional (\(\mathcal {P}_{discr}\)) discussed in Sect. 4.5. The adaptation of Algorithm 2 to such setting is easily obtainable as a corollary of the discussion in Sect. 4.5.2. The simulations were produced by a Python code that is openly available at https://github.com/panchoop/DGCG_algorithm/. For all the simulations, we employ the following:

  • the considered domain is \(\Omega := (0,1)\times (0,1) \subset \mathbb {R}^2\),

  • the number of time samples is fixed to \(T = 50\), with \(t_i := i/T\) for \(i =0,\ldots ,T\),

  • the data spaces \(H_{t_i}\) and forward operators \(K_{t_i}^* :\mathcal {M}(\overline{\Omega }) \rightarrow H_{t_i}\) are as in Sect. 6.1.1, and model a Fourier transform with time-dependent spatial undersampling. A specific choice of sampling pattern will be made in each experiment,

  • in each experiment problem, (\(\mathcal {P}_{discr}\)) is considered for specific choices of regularization parameters \(\alpha ,\beta >0\) and data \(f=(f_{t_0},\ldots , f_{t_T})\) with \(f_{t_i} \in H_{t_i}\),

  • convergence for Algorithm 2 is intended as in Sect. 5.3.3. The number of restarts \(N_\mathrm{max}\) for Subroutine 1 is stated in each experiment. Also, we employ the quantitative stopping condition for Algorithm 2 described in Remark 7, with tolerance \(\mathrm{TOL}:=10^{-10}\).

  • the crossover parameters described in Sect. 5.1.4 are chosen as \(\varepsilon =0.05\) and \(\delta = 1/T\). The random starts presented in Sect. 5.1.3 are implemented using the function \(Q(x) = \exp (\max (x + 0.05,0)) - 1\). These parameter choices are purely heuristical, and there is no reason to believe that they are optimal.

The remainder of the section is organized as follows. In Sect. 6.1, we first introduce the measurement spaces and Fourier-type forward operators employed in the experiments. After, we explain how the synthetic data are generated in the noiseless case, and then detail on the noise model we consider. Subsequently, we show how the data and the obtained reconstructions can be visualized, by means of the so-called backprojections and intensities. We then pass to the actual experiments in Sect. 6.2, detailing three of them. The first experiment (Sect. 6.2.1), which is basic in nature, serves the purpose of illustrating how the measurements are constructed and how the data can be visualized. We then showcase the reach of the proposed regularization and algorithm in the second example (Sect. 6.2.2). There, we consider more complex data with various levels of added noise. In particular, we show that the proposed dynamic setting is capable of reconstructing severely spatially undersampled data. The final experiment (Sect. 6.2.3) illustrates a particular behavior of our model when reconstructing sparse measures whose underling curves cross, as discussed in Remark 3. We point out that in all the experiments presented, the algorithm converged faster, indeed linearly, than the sublinear rate predicted by Theorem 7. We conclude the section with a few general observations on the model and algorithm proposed.

6.1 Measurements, Data, Noise Model and Visualization

6.1.1 Measurements

The measurements employed in the experiments are given by the time-discrete version, in the sense of Sect. 4.5, of the spatially undersampled Fourier measurements introduced in Sect. 1. Such measurements will be suitably cut-off, in order to avoid boundary conditions when dealing with the insertion step (Sects. 4.1.2, 5.2.1) and the sliding step (Sect. 5.2.2). Precisely, at each time instant \(t_i\), we sample \(n_i \in \mathbb {N}\) frequencies, encoded in the given vector \(S_i=(S_{i,1},\ldots ,S_{i,n_i}) \in (\mathbb {R}^2)^{n_i}\), for all \(i \in \{0,\ldots ,T\}\). The sampling spaces \(H_{t_i}\) are defined as the realification of \(\mathbb {C}^{n_{i}}\), equipped with the inner product \( \langle u, v \rangle _{H_{t_i}} := \mathrm{Re} \langle u, v \rangle _{\mathbb {C}^{n_i}}/n_i\), where \(\mathrm{Re}\) denotes the real part of a complex number. According to (A50), we define the cut-off Fourier kernels \(\psi _{t_i} :\mathbb {R}^2 \mapsto \mathbb {C}^{n_i}\) by

$$\begin{aligned} \psi _{t_i}(x) := \left( \exp (-2\pi i x \cdot S_{i,k})\chi (x_1)\chi (x_2) \right) _{k=1}^{n_i}, \end{aligned}$$

where the map \(\chi :(0,1) \mapsto (0,1)\) is defined by

Notice that \(\chi \) is twice differentiable, strictly increasing in [0, 0.1], strictly decreasing in [0.9, 1], and it satisfies \(\chi (0) = \chi (1) = 0\), \(\chi (z) = 1\) for all \(z \in [0.1,0.9]\). Following (A49), the cut-off undersampled Fourier transform and its pre-adjoint are given by the linear continuous operators \(K_{t_i}^* :\mathcal {M}(\overline{\Omega }) \rightarrow H_{t_i}\) and \(K_{t_i} :H_{t_i} \rightarrow C(\overline{\Omega })\) defined by

$$\begin{aligned} K_{t_i}^*(\rho ) := \int _{\mathbb {R}^2} \psi _{t_i}(x) \, d \rho (x)\,, \qquad K_{t_i}(h) := \left( \ x \mapsto \langle \psi _{t_i}(x), h \rangle _{H_{t_i}} \right) \,, \end{aligned}$$
(75)

for all \(\rho \in \mathcal {M}(\overline{\Omega })\), \(h \in H_t\), where \(\rho \) is extended to zero outside of \(\overline{\Omega }\), and the first integral is intended component-wise.

6.1.2 Data

For all the experiments, the ground-truth consists of a sparse measure of the form

$$\begin{aligned} \mu ^\dagger =(\rho ^\dagger ,m^\dagger ):= \sum _{j=1}^N c^\dagger _j \mu _{\gamma ^\dagger _j}\,, \end{aligned}$$
(76)

for some \(N \in \mathbb {N}\), \(c^\dagger _j>0\), \(\gamma ^\dagger _j \in \mathrm{AC}^2\), where we follow the notations at (14). Given a ground-truth \(\mu ^\dagger \), the respective noiseless data is constructed by \(f_{t_i} := K_{t_i}^* \rho ^\dagger _{t_i} \in H_{t_i}\), for \(i =0,\ldots ,T\).

6.1.3 Noise Model

Let \(U_{i,k}, V_{i,k}\), for \(i = 0,\ldots ,T\) and \(k = 1,\ldots , n_i\), be the realization of two jointly independent families of standard one-dimensional Gaussian random variables, with which we define the noise vector \(\nu \) by

$$\begin{aligned} \nu _{i,k} := U_{i,k} + \mathrm {i} V_{i,k} \in \mathbb {C}, \quad \nu _{t_i} := (\nu _{i,k})_{k=1}^{n_i} \in H_{t_i}, \quad \nu := (\nu _0, \nu _1, \ldots ,\nu _T)\,. \end{aligned}$$
(77)

Given some data \(f=(f_{t_0},\ldots ,f_{t_T})\), with \(f_{t_i} \in H_{t_i}\), when \(\nu \ne 0\), the corresponding noisy data \(f^\varepsilon \) with noise level \(\varepsilon \ge 0\) is taken as

$$\begin{aligned} f^\varepsilon := f + \varepsilon \ \sqrt{ \frac{ \sum _{i=0}^T \left\Vert f_{t_i}\right\Vert _{H_{t_i}}^2}{ \sum _{i=0}^T \left\Vert \nu _{t_i}\right\Vert _{H_{t_i}}^2} } \ \nu \,. \end{aligned}$$

6.1.4 Visualization via Backprojection

In general, it is not illustrative to directly visualize the data. The proposed way to gain some insight on the data structure is by means of backprojections: given \(f = (f_{t_0},\ldots , f_{t_T})\), with \(f_{t_i} \in H_{t_i}\), we call backprojection the map \(w_{t_i}^0:= K_{t_i}f_{t_i} \in C(\overline{\Omega })\). Note that \(w_{t_i}^0\) corresponds to the dual variable at the first iteration of Algorithm 2. As \(\Omega = (0,1)^2\), such functions can be plotted at each time sample, allowing us to display the data.

6.1.5 Reconstruction’s Intensities

Given a sparse measure \(\mu := \sum _{j=1}^N c_j \mu _{\gamma _j}\), with \(N \in \mathbb {N}\), \(c_j >0\), \(\gamma _j \in \mathrm{AC}^2\), the intensity associated to the atom \(\mu _{\gamma _j}\) is defined by \(I_{j} := c_j a_{\gamma _j}\). The quantity \(I_j\) measures the intensity at time \(t_i\) of the signal for a single source, as \(K^*_{t_i}(c_j \rho _{\gamma _j(t_i)})=I_j K^*_{t_i}(\delta _{\gamma _j(t_i)})\). Therefore, when presenting reconstructions and comparing them to the given ground-truth, we will use the intensity \(I_j\) of each atom instead of its associated weight \(c_j\).

6.2 Numerical Experiments

6.2.1 Experiment 1: Single Atom with Constant Speed

We start with a basic example that serves at illustrating how to observe the data, the obtained reconstructions, and their respective distortions due to the employed regularization. We use constant-in-time forward Fourier-type measurements, with frequencies sampled from an Archimedean spiral: for each sampling time \(i \in \{0,\ldots ,50\}\), we consider the same frequencies vector \(S_{i} \in (\mathbb {R}^2)^{n_i}\) with \(n_i := 20\) and \(S_{i,k}\) lying on a spiral for \(k = 1,\ldots ,20\) (see Fig. 1). Thus, the corresponding forward operators defined by (75) are constant in time, i.e., \(K_{t_i}^*=K^*\). The employed ground-truth \(\mu ^\dagger \) is composed of a single atom with intensity \(I^\dagger = 1\), and respective curve \(\gamma ^\dagger (t) := (0.2,0.2) + t (0.6,0.6)\). Accordingly, we have \(\mu ^\dagger = c^\dagger \mu _{\gamma ^\dagger }\) with \(c^\dagger = 1/a_{\gamma ^\dagger }\). We consider the case of noiseless data \(f_{t_i} := K^* \rho ^\dagger _{t_i}\). The corresponding backprojection \(w_{t_i}^0:=Kf_{t_i}\) is shown in Fig. 1 for some selected time samples. For the proposed data f, we solve the minimization problem (\(\mathcal {P}_{discr}\)) employing Algorithm 2. The obtained reconstructions are shown in Fig. 2, where we display the results for two different parameter choices \(\alpha ,\beta \). Given the simplicity of the considered example, we employed Subroutine 1 with only 5 restarts, that is, \(N_\mathrm{max}=5\). In the respective cases of parameters \(\alpha =\beta = 0.1\) and \(\alpha =\beta = 0.4\), Algorithm 2 converged in 2 and 1 iterations, and had an execution time of 2 and 1 minutes (the employed CPU was an Apple M1 8 Core 3.2 GHz, running native arm64 Python 3.9.7).

Fig. 1
figure 1

Time-constant frequency samples \(\{S_{i,k}\}_{k}\) and backprojections \(w_{t_i}^0\) for data \(f_{t_i} := K^*\rho ^\dagger _{t_i}\) at times \(t_0 = 0,\ t_{25} = 0.5,\ t_{50} = 1\), with the ground-truth curve \(\gamma ^\dagger \) superimposed in blue color (Color figure online)

Fig. 2
figure 2

Reconstruction results for Experiment 1. We use color to indicate position in time and transparency to indicate intensity of the respective atom. From left to right, we plot the employed ground-truth, the obtained reconstruction with the specified parameters, and then the superimposition of ground-truth and reconstruction

As common for Tikhonov regularization methods, the obtained reconstructions differ from the ground-truth due to the effect of regularization. Specifically, in Fig. 2c, we notice a stronger effect of the regularization with parameters \(\alpha =\beta =0.4\) at the endpoints of the reconstructed curve. Such phenomenon is expected: as argued in Sect. 5.1, each curve found by Subroutine 1 belongs to the set of stationary points \(\mathfrak {S}_F\); due to the optimality conditions proven in Proposition 17, any of such curves has zero initial and final speed, i.e., \(\dot{\gamma }(0) = \dot{\gamma }(1) = 0\). The mentioned constraint is achieved with a slower transition for larger speed penalizations \(\beta \). We can quantify the discrepancy between the ground-truth curve \(\gamma ^\dagger \) and a reconstructed curve \(\overline{\gamma }\) with respect to the \(L^2\) norm by computing \(D(\gamma ^\dagger , \overline{\gamma }) := \left\Vert \gamma ^{\dagger } - \overline{\gamma }\right\Vert _{L^2}/\left\Vert \gamma ^{\dagger }\right\Vert _{L^2}\). We obtain that \(D(\gamma ^\dagger , \overline{\gamma }) = 0.00515\) for \(\alpha =\beta = 0.1\) and \(D(\gamma ^\dagger , \overline{\gamma }) = 0.017\) for \(\alpha =\beta = 0.4\). The reconstructed intensities for the parameter choices \(\alpha =\beta = 0.1\) and \(\alpha =\beta = 0.4\) are \(87\%\) and \(48\%\) of the ground-truth’s intensity, respectively, as shown in Fig. 2.

6.2.2 Experiment 2: Complex Example with Time-Varying Measurements

The following example is given to showcase the full strength of the proposed regularization and algorithm. The frequencies are sampled over lines through the origin of \(\mathbb {R}^2\), which are rotating in time. Specifically, let \(\Theta \in \mathbb {N}\) be a bound on the number of lines, \(h > 0\) a fixed spacing between measured frequencies on a given line, and consider frequencies \(S_i \in (\mathbb {R}^2)^{n_i}\) defined by

$$\begin{aligned} S_{i,k} := \begin{pmatrix} \cos ( \theta _i ) &{} \sin (\theta _i) \\ -\sin (\theta _i) &{} \cos (\theta _i) \end{pmatrix} \begin{pmatrix} h (k-(n_i+1)/2) \\ 0 \end{pmatrix}, \nonumber \\ i \in \{0,\ldots ,50\},\ k \in \{1,\ldots ,n_i\}\,. \end{aligned}$$
(78)

Here, the matrix represents a rotation of angle \(\theta _i\), with \(\theta _i := \frac{i}{\Theta }\pi \). For such frequencies, we consider the associated forward operators \(K_{t_i}^*\) as in (75).

In the presented experiment, the parameters are chosen as \(\Theta = 4\), \(h=1\) and \(n_i = 15\) for all \(i = 0,\ldots ,50\): in other words, we sample along four different lines which rotate at each time-sample; see Fig. 3 for two examples of them. The ground-truth is a measure \(\mu ^\dagger \) composed of three atoms, whose associated curves are as in Fig. 4a. Note that \(\mu ^\dagger \) displays different non-constant speeds, a contact point between two underlying curves, a strong kink, and intensities equal to 1. We point out that equal intensities are considered for the sole purpose of easing graph visualization; the reconstruction quality is not affected by different intensity choices. The noiseless data are defined by \(f_{t_i}:=K_{t_i}^* \rho _{t_i}^\dagger \). Following the noisy model described in Sect. 6.1.3, we also consider data \(f^{0.2}\) and \(f^{0.6}\) with added 20% and 60% of relative noise, respectively. For the noisy data, we employed the same realization of the randomly generated noise vector \(\nu \) defined in (77). In Fig. 3, we present the backprojections for the noiseless and noisy data at two different times. We can observe that at each time sample, the backprojected data exhibits a line-constant behavior, as explained in Remark 10.

Fig. 3
figure 3

Sampled frequencies and backprojected data for Experiment 2 at times \(t_{25} = 0.5\), \(t_{50} = 1\). The backprojected data are displayed for noiseless data and for 20% and 60% of relative noise, i.e., \(f^{0.2}\) and \(f^{0.6}\), respectively, with superimposed ground-truth’s curves in blue color

We apply Algorithm 2 to obtain reconstructions for the cases of noiseless and 20% of added noise data for parameters \(\alpha =\beta =0.1\) (see Fig. 4); in Fig. 5, we present the obtained reconstructions for the case of added 60% noise, where we further show the regularization effects of employing larger \(\alpha ,\beta \) values, namely \(\alpha =\beta =0.1\) and \(\alpha =\beta =0.3\).

Fig. 4
figure 4

Reconstruction results for Experiment 2, with noiseless data, 20% of relative noise data and parameters \( \alpha =\beta = 0.1\). The intensities of the reconstructed atoms are represented by the rightmost black and grey tick-lines in the colorbars

In the noiseless case, presented in Fig. 4b, we can observe an accurate reconstruction, with some low intensity artifacts that for most of the time share paths with the higher intensity atoms. In Fig. 4c, where we add 20% of noise to the data, we notice a surge of low-intensity artifacts, but nonetheless, we see that the obtained solution is close, in the sense of measures, to the original ground-truth. In Fig. 5, for the case of 60% added noise, we can notice that by increasing the regularization parameters, the quality of the obtained reconstruction increases, displaying small regularization-induced distortions. The examples in Figs. 4, 5 demonstrate the power of the proposed regularization, given its reconstruction accuracy when simultaneously employing highly ill-posed forward measurements, as pointed out in Remark 10, together with strong noise.

Fig. 5
figure 5

Reconstruction results for Experiment 2, with 60% of relative noise. Algorithm 2 is applied to the same data, with regularization parameter choices \(\alpha =\beta =0.1\) and \(\alpha =\beta =0.3\)

We finalize this example by presenting the convergence plots for the case of 60% added noise, this being the most complex experiment (see Fig. 6). We plot the numerical residual \(\tilde{r}(\mu ^n)\) and the numerical primal-dual gap \(\tilde{G}(\mu ^n)\) for the iterates \(\mu ^n\), where \(\tilde{r}\) and \(\tilde{G}\) are defined in Sect. 5.3.3. We observe that the algorithm exhibits a linear rate of convergence, instead of the proven sublinear one. Such linear convergence has been numerically observed in all the tested examples. Additionally, we see that the algorithm is greatly accelerated when one considers strong regularization parameters \(\alpha ,\beta \). Finally, the plot confirms the efficacy of the proposed descent strategy for the insertion step (68); indeed, we note that the inequality \(\tilde{r}(\mu ^n)\le \tilde{G}(\mu ^n)\) holds for most of the iterations in the performed experiments. As such inequality is proven in Lemma 5 for the actual residual and primal dual gap, we have confirmation that \(\gamma ^*_{N_n+1}\) in Algorithm 2 is a good approximated solution for (68). Regarding execution times, iterations until convergence and number of restarts, they are summarized in Table 1.

Fig. 6
figure 6

Convergence plot for Experiment 2 with 60% of added noise and parameter choices \(\alpha =\beta =0.1\) and \(\alpha =\beta =0.3\). For each iterate, we plot the numerical residual \(\tilde{r}(\mu ^n)\), together with the corresponding numerical dual gap \(\tilde{G}(\mu ^n)\)

Table 1 Convergence information and execution times for Experiment 2

Remark 10

Consider the Fourier-type forward measurements \(K_t^*\) defined by (75) with frequencies \(S_{i} \in (\mathbb {R}^2)^{n_i}\) sampled along rotating lines, as in (78). In this case, at each fixed time-sample \(t_i\), the operator \(K_{t_i}^*\) does not encode sufficient information to accurately resolve the location of the unknown ground-truth at time \(t_i\). As a consequence, any static reconstruction technique, that is, one that does not jointly employ information from different time samples in order to perform a reconstruction, would not be able to accurately recover any ground-truth under these measurements. To justify this claim, notice that for all time-samples \(t_i\), the family \(\{S_{i,k}\}_{k=1}^{n_i}\) defined at (78) is collinear, and as such, there exists a corresponding vector \(S_i^{\perp } \in \mathbb {R}^2\) such that \(S_{i,k} \cdot S_{i}^{\perp } = 0\) for all \(k = 1,\ldots , n_i\). Therefore, for any given time-static source \(\rho ^\dagger = \delta _{x^\dagger }\) with \(x^\dagger \in (0.1,0.9)^2 \subset \mathbb {R}^2\), the measured forward data are invariant along \(S_{i}^{\perp }\), that is,

$$\begin{aligned} K_{t_i}^* \delta _{x^\dagger } = K_{t_i}^* \delta _{x^\dagger + \lambda S_{i}^{\perp }}\,, \quad \text { for all } \, \lambda \in \mathbb {R}\, \text { such that } \, x^\dagger + \lambda S_{i}^{\perp } \in (0.1,0.9)^2. \end{aligned}$$

Hence, solely with the information of a single time sample, it is not possible to distinguish a source along a line, and therefore, it is not possible to accurately resolve it. This is in contrast with the dynamic model presented in this paper, which is able to perform an efficient reconstruction, as demonstrated in Experiment 2.

6.2.3 Experiment 3: Crossing Example

The following is an example in which the considered model is not able to track dynamic sources; although the reconstruction is close to the ground-truth in the sense of measures, its underlying curves do not resemble those of the ground-truth. This effect is due to the non-injectivity of the map at (38); even if the sparse measure we wish to recover is unique, its decomposition into atoms might not be. A simple example in which injectivity fails is given by the crossing of two curves (see Remark 3); this is the subject of the numerical experiment performed in this section. Specifically, the ground-truth \(\mu ^\dagger \) considered is of the form

$$\begin{aligned} \begin{gathered} \mu ^\dagger := a_{\gamma _1^\dagger }^{-1} \rho _{ \gamma _1^\dagger } + a_{\gamma _2^\dagger }^{-1} \rho _{\gamma _2^\dagger }\,, \\ \gamma _1^\dagger (t) := (0.2,0.2)+ t(0.6, 0.6), \qquad \gamma _2^\dagger (t) := (0.8,0.2)+ t(-0.6, 0.6)\,. \end{gathered} \end{aligned}$$
(79)

Notice that \(\gamma _1^\dagger \) and \(\gamma _2^\dagger \) cross at time \(t=0.5\), and the respective atoms have both intensity 1. For the forward measurements, we employ the time-constant Archimedean spiral family of frequencies defined in the first numerical experiment in Sect. 6.2.1, resulting in the constant in time operator \(K_{t_i}^*=K^*\). The reconstruction is performed for noiseless data \(f_{t_i}:=K^*\rho ^\dagger _{t_i}\). In Fig. 7, we present the considered frequency samples, together with the backprojected data at selected time samples.

Fig. 7
figure 7

Time-constant frequency samples \(\{S_{i,k}\}_{k}\) and corresponding backprojected data \(w_t^0\) for Experiment 3. The backprojected data are taken in the noiseless case, and presented at times \(t_0 = 0,\ t_{25} = 0.5,\ t_{50} = 1\), with associated ground-truth superimposed in blue color

Fig. 8
figure 8

Reconstruction results for Experiment 3. In this case, the method fails to reconstruct the crossing, but still approximates the ground-truth in terms of measures

In Fig. 8, we display the obtained reconstruction for high regularization parameters \(\alpha =0.5\) and \(\beta =0.5\). It is observed that the reconstructed atoms are not close, as curves, to the ones in (79); rather than a crossing at time \(t=0.5\), the two curves rebound. As already mentioned, this phenomenon is a consequence of the lack of uniqueness for the sparse representation of \(\mu ^\dagger \), which in this case is both represented by crossing curves and rebounding curves. The fact that Algorithm 2 outputs rebounding curves is due to employed regularization; indeed, the considered Benamou–Brenier-type penalization selects a solution whose squared velocity is minimized, which discourages the reconstruction to follow the crossing path. However, we remark that the algorithm proposed yields a good solution in terms of the model, given that in the sense of measures, the obtained reconstruction is very close to the ground-truth \(\mu ^\dagger \). More sophisticated models are needed in order to resolve the crossing, as briefly discussed in Sect. 7.

6.3 Remark on Execution Times

It is observed that the execution times of our algorithm are quite high for some of the presented examples in Table 1. This is mainly due to the computational cost of the insertion step, since the algorithm is set to run several gradient descents to find a global minimum of the linearized problem (26) at each iteration, and these descents are executed in a non-parallel fashion on a single CPU core. The other components of the algorithm, namely the routines sample and crossover included in Subroutine 1, the coefficient optimization step and the sliding step, have, in comparison, negligible computational cost. In particular, the sliding step grows in execution time with the number of active atoms, but this effect appears toward the last iterations of the algorithm, and it is shadowed by the insertion step, whose gradient descents become longer as the iterate is getting closer to the optimal value. As a confirmation of the role of the insertion step in the overall computational cost, one can see that the execution times of the algorithm linearly depend on the total number of gradient descents that are run in each example. Indeed, since the total number of gradient descents is given by the number of restarts multiplied by the number of iterations (see Table 1), the ratio between the execution times and the total number of gradient descents is of the same order for all the presented examples (between \(0.0008\) and \(0.0019\) hour/gradient descent). It is worth pointing out that the multistart gradient descent is a highly parallelizable method. Some early tests in this direction indicate that much lower computational times are achievable by simultaneously computing gradient descents on several GPUs for the presented examples.

6.4 Conclusions and Discussion

The presented numerical experiments confirm the effectiveness of Algorithm 2, and that the proposed Benamou–Brenier-type energy is an excellent candidate to regularize dynamic inverse problems. This is in particular evidenced by Experiment 2 in Sect. 6.2.2, where we consider a dynamic inverse problem that is impossible to tackle with a purely static approach, as discussed in Remark 10. Even in the extreme case of \(60\%\) added noise, our method recovered satisfactory reconstructions.

We can further observe the distortions induced by the considered regularization. Precisely, the attenuation of the reconstruction’s velocities and intensities is a direct consequence of the minimization of the Benamou–Brenier energy and the total variation norm in the objective functional; for this reason, the choice of the regularization parameters \(\alpha \) and \(\beta \) affects the reconstruction and the magnitude of such distortions. Additionally, a further effect of the regularization is the phenomenon presented in Experiment 3 (Sect. 6.2.3). As the dynamic inverse problem is formulated in the space of measures, if the sparse ground truth possesses many different decompositions into extremal points, the preferred reconstruction may be the one favoring the regularizer.

Finally, concerning the execution times, we emphasize that the presented simulations are a proof of concept and not the result of a carefully optimized algorithm. There are many improvement directions, where the most promising one is a GPU implementation to parallelize the multiple insertion step. To increase the likelihood of finding a global minimizer for the insertion step, Subroutine 1 is employed with a high number of restarts \(N_\mathrm{max}\), which was tuned manually, prioritizing high reconstruction accuracy over execution time. To improve on this aspect, one could include early stopping conditions in Subroutine 1, for example by exiting the routine when a sufficiently high ratio of starts descend toward the same stationary curve. Last, the code was written having in mind readibility, as well as adaptability to a broad class of inverse problems. Therefore, the experienced execution times are not an accurate estimation of what would be possible in specific applications.

7 Future Perspectives

In this section, we propose several research directions to expand on the research presented in this paper. A first relevant question concerns the proposed DGCG algorithm, and, in particular, the possibility of proving a theoretical linear convergence rate under suitable structural assumptions on the minimization problem (\(\mathcal {P}\)). Linear convergence has been recently proven for the GCG method applied to the BLASSO problem [40, 52]. It seems feasible to extend such an analysis to the DGCG algorithm presented in this paper, especially seeing the linear convergence observed in the experiments provided in Sect. 6.2, and the fact that our proof of sublinear convergence (see Theorem 7) does not fully exploit the coefficients optimization step, as commented in Remark 6. This line of research is currently under investigation by the authors [13].

Another interesting research direction is the extension of the DGCG method introduced in this paper to the case of unbalanced optimal transport. Precisely, one can regularize the inverse problem (2) by replacing the Benamou–Brenier energy B in (3) with the so-called Wasserstein–Fischer–Rao energy, as proposed in [15]. Such energy, first introduced in [24, 45, 47] as a model for unbalanced optimal transport, accounts for more general displacements \(t \mapsto \rho _t\), in particular allowing the total mass of \(\rho _t\) to vary during the evolution. The possibility to numerically treat such a problem with conditional gradient methods would rest on the characterization of the extremal points for the Wasserstein–Fischer–Rao energy recently achieved by the authors in [12].

In addition, it is a challenging open problem to design alternative dynamic regularizers that allow to reconstruct accurately a ground-truth composed of crossing atoms, such as the ones considered in the experiment in Sect. 6.2.3. Due to the fact that the considered Benamou–Brenier-type regularizer penalizes the square of the velocity field associated with the measure, the reconstruction obtained by our DGCG algorithm does not follow the crossing route (Fig. 8b). A possible solution is to consider additional high-order regularizers in (\(\mathcal {P}\)), such as curvature-type penalizations. The challenging part is devising a penalization that can be enforced at the level of Borel measures, and whose extremal points are measures concentrated on sufficiently regular curves.

Finally, keeping into account the possible improvements discussed in Sect. 6.4, the implementation of an accelerated and parallelized version of Algorithm 2 will be the subject of future work.