1 Introduction

A recurrent problem in science and engineering is the reconstruction of a multidimensional signal \(f: {\mathbb {R}}^d \rightarrow {\mathbb {R}}\) from a finite number of (possibly noisy) linear measurements \({\varvec{y}}=(y_m)={\varvec{\nu }}(f) \in {\mathbb {R}}^M\), where the operator \({\varvec{\nu }}=(\nu _m): f \mapsto {\varvec{\nu }}(f)=(\langle \nu _1,f \rangle ,\dots ,\langle \nu _M, f\rangle )\) symbolizes the linear measurement process. The machine-learning version of the problem is the determination of a function \(f: {\mathbb {R}}^d \rightarrow {\mathbb {R}}\) from a finite number of samples \(y_m=f({\varvec{x}}_m)+\epsilon _m\) where \(\epsilon _m\) is a small perturbation term; it is a special case of the former with \(\nu _m=\delta (\cdot -{\varvec{x}}_m)\). Since a function that takes values over the continuum is an infinite-dimensional entity, the reconstruction problem is inherently ill-posed.

The standard remedy is to impose an additional minimum-energy requirement which, in effect, regularizes the solution. A natural choice of regularization is a smoothness norm associated with some function space \({\mathcal {X}}'\) (typically, a Sobolev space), which results in the prototypical formulation of the problem as

$$\begin{aligned} S=\arg \min _{f \in {\mathcal {X}}'} \Vert f\Vert _{{\mathcal {X}}'} \quad \text{ s.t. } \quad \langle \nu _m,f\rangle =y_m, \ m=1,\dots ,M. \end{aligned}$$
(1)

An alternative version that is better suited for noisy data is

$$\begin{aligned} S=\arg \min _{f \in {\mathcal {X}}'} \sum _{m=1}^M \left| y_m -\langle \nu _m,f\rangle \right| ^2 + \lambda \Vert f\Vert ^p_{{\mathcal {X}}'} \end{aligned}$$
(2)

with an adequate choice of hyper-parameters \(\lambda \in {\mathbb {R}}^+\) and \(p\in [1,\infty )\). We note that the unconstrained form (2) is a generalization of (1): the latter is recovered in the limit by taking \(\lambda \rightarrow 0\).

The term “representer theorem” is typically used to designate a parametric formula—preferably, a linear expansion in terms of some basis functions—that spans the whole range of solutions, irrespective of the value of the data \({\varvec{y}} \in {\mathbb {R}}^M\). Representer theorems are valued by practitioners because they indicate the way in which the initial problem can be recast as a finite-dimensional optimization, making it amenable to numerical computations. The other benefit is that the description of the manifold of possible solutions provides one with a better understanding of the effect of regularization. The best-known example is the representer theorem for reproducing kernel Hilbert spaces (RKHS), which states that the solution of (2) with \(\langle \nu _m,f\rangle =f({\varvec{x}}_m)\) and a Hilbertian regularization norm necessarily lives in a subspace of dimension M spanned by kernels centered on the data coordinates \({\varvec{x}}_m\) [7, 31, 35, 36, 43]. This theorem, in its extended version [42], is the foundation for the majority of kernel-based methods for machine learning, including regression, radial-basis functions, and support-vector machines [23, 44, 48]. There is also a whole line of generalizations of the concept that involves reproducing kernel Banach spaces (RKBS) [55,56,57]. More recently, motivated by the success of \(\ell _1\) and total-variation regularization for compressed sensing [11, 14, 19], researchers have derived alternative representer theorems in order to explain the sparsifying effect of such penalties and their robustness to missing data [8, 26, 28, 52]. A representer theorem for measures has also been invoked to justify the use of the total-variation norm for the super-resolution localization of spikes [12, 17, 21, 37] (see Sect. 4.1 for details).

In this paper, we present a unifying treatment of regularization by considering the problem from the abstract perspective of optimization in Banach spaces. Our motivation there is essentially twofold: (1) to get a better “geometrical” understanding of the effect of regularization, and (2) to state a generic representer theorem that applies to a wide variety of objects describable as elements of some native Banach space. The supporting theory is developed in Sect. 2. Our formulation takes advantage of the notion of Banach conjugates which is explained in Sect. 2.1. We then immediately proceed with the presentation of our key result: a generalized representer theorem (Theorem 2) that is valid for arbitrary convex data terms and Banach spaces in general, including the non-reflexive ones. The proof that is developed Sect. 2.2 is rather soft (or “high-level”), as it relies exclusively on the powerful machinery of duality mappings and the Hahn–Banach theorem—in other words, there is no need for Gâteaux derivatives nor subdifferentials, which are often invoked in such contexts. The resulting form of the solution in Theorem 2 is enlightening because it separates out the effect of the measurement operator from that of the regularization topology. Specifically, the measurement functionals \(\nu _1,\dots ,\nu _M\) in (1) or (2) specify a linear solution manifold that is then isometrically mapped into the primary space via the conjugate map \({\mathrm {J}}_{\mathcal {X}}: {\mathcal {X}} \rightarrow {\mathcal {X}}'\), which may or may not be linear, depending on whether the regularization norm is Hilbertian or not.

The theory is then complemented with concrete examples of usage of Theorem 2 to illustrate the power of the approach as well as its broad range of applicability. Section 3 is devoted to the scenario where the regularization norm is strictly convex, which ensures that the solution of the underlying minimization problem is unique. We make the link with the existing literature by deriving of a number of classical results: Schölkopf’s generalized representer theorem for RKHS (Sect. 3.1), the closed-form solution of continuous-domain Tikhonov regularization with a Hilbertian norm (Sect. 3.2), and the connection with the theory of reproducing kernel Banach spaces (Sect. 3.3). In addition, we present a novel representer theorem for \(\ell _p\)-norm regularization (Sect. 3.4). Then, in Sect. 4, we turn our attention to sparsity promoting regularization which is more challenging because the underlying Banach spaces are typically non-reflexive and non-convex. The enabling ingredient there is a recent result by Boyer et al. [8], which allows one to express the extreme points of the solution set in Theorem 2 as a linear combination of a few basic atoms that are selected adaptively (Theorem 3). This result, in its simplest incarnation with \({\mathcal {X}}'=\ell _1({\mathbb {Z}})\), supports the well-documented sparsifying effect of \(\ell _1\)-norm minimization, which is central to the theory of compressed sensing. By switching to a continuum, we obtain the representer theorem for \({\mathcal {X}}'={\mathcal {M}}(\varOmega )\)—the space of signed Radon measures on a compact domain \(\varOmega \) (Sect. 4.1), which is relevant to super-resolution localization. We then also derive a representer theorem for generalized total-variation (Sect. 4.2)—in the spirit of [53]—that justifies the use of sparse kernel expansions for machine learning, in line with the generalized LASSO [39].

2 Mathematical Formulation

2.1 Banach Spaces and Duality Mappings

The notion of Banach space—basically, a vector space equipped with a norm—is remarkably general. Indeed, the elements (or points) of a Banach space can be vectors (e.g., \({\varvec{v}}\in {\mathbb {R}}^N\)), functions (e.g., \(f\in L_2({\mathbb {R}}^d)\)), sequences (e.g., \(u[\cdot ]\in \ell _1({\mathbb {Z}})\)), continuous linear functionals (e.g., \(f \in {\mathcal {X}}'\) where \({\mathcal {X}}'\) is the dual of some primary Banach space), vector-valued functions (e.g., \({\varvec{f}}=(f_1,\dots ,f_N)\) with \(f_n \in L_2({\mathbb {R}}^d)\)), matrices \((e.g., {\mathbf{X}} \in {\mathbb {R}}^{N \times N}\)), and, even, bounded linear operators from a Banach space \({\mathcal {U}}\) (domain) to another Banach space \({\mathcal {V}}\) (range) [e.g., \({\mathrm {X}} \in {\mathcal {L}}({\mathcal {U}}, {\mathcal {V}})\)] [32].

Definition 1

A normed vector space \({\mathcal {X}}\) is a linear space equipped with a norm, henceforth denoted by \(\Vert \cdot \Vert _{{\mathcal {X}}}\). It is called a Banach space if it is complete in the sense that every Cauchy sequence in \(({\mathcal {X}},\Vert \cdot \Vert _{{\mathcal {X}}})\) converges to an element of \({\mathcal {X}}\). It is said to be strictly convex if, for all \(v_1,v_2 \in {\mathcal {X}}\) such that \(\Vert v_1\Vert _{{\mathcal {X}}}=\Vert v_2\Vert _{{\mathcal {X}}}=1\) and \(v_1\ne v_2\), one has that \({\Vert \lambda v_1+(1-\lambda )v_2\Vert _{{\mathcal {X}}}}<1\) for any \(\lambda \in (0,1)\). Finally, a Hilbert space is a Banach space whose norm is induced by an inner product.

We recall that \({\mathcal {X}}'\) (the continuous dual of \({\mathcal {X}}\)) is the space of linear functionals \(u : v \mapsto \langle u, v\rangle {\mathop {=}\limits ^{\vartriangle }}u(v)\in {\mathbb {R}}\) that are continuous on \({\mathcal {X}}\). It is a Banach space equipped with the dual norm

$$\begin{aligned} \Vert u\Vert _{{\mathcal {X}}'}{\mathop {=}\limits ^{\vartriangle }}\sup _{{v \in {\mathcal {X}}}\backslash \{0\}}\frac{\langle u,v\rangle }{\Vert v\Vert _{{\mathcal {X}}}}. \end{aligned}$$
(3)

A direct implication of this definition is the generic duality bound

$$\begin{aligned} |\langle u, v\rangle |\le \Vert u\Vert _{{\mathcal {X}}'} \Vert v\Vert _{{\mathcal {X}}}, \end{aligned}$$
(4)

for any \(u \in {\mathcal {X}}, {v \in {\mathcal {X}}}'\). In fact, (4) can be interpreted as the Banach generalization of the Cauchy-Schwarz inequality for Hilbert spaces. By invoking the Hahn–Banach theorem, one can also prove that the duality bound is sharp for any dual pair \(({\mathcal {X}},{\mathcal {X}}')\) of Banach spaces [41]. This remarkable property inspired Beurling and Livingston to introduce the notion of duality mapping and to identify conditions of uniqueness [4]. We like to view the latter as the generalization of the classical Riesz map \({\mathrm {R}}: {\mathcal {H}}' \rightarrow {\mathcal {H}}\) or, rather, its inverse \({\mathrm {J}}_{\mathcal {H}}={\mathrm {R}}^{-1}: {\mathcal {H}} \rightarrow {\mathcal {H}}'\), which describes the isometric isomorphism between a Hilbert space \({\mathcal {H}}\) and its continuous dual \({\mathcal {H}}'\) [38]. The caveat with Banach spaces is that the duality mapping is not necessarily bijective nor even single-valued.

Definition 2

(Duality mapping) Let \(({\mathcal {X}},{\mathcal {X}}')\) be a dual pair of Banach spaces. Then, the elements \({v^*}\in {\mathcal {X}}'\) and \({v \in {\mathcal {X}}}\) form a conjugate pair if they satisfy:

  1. 1.

    Norm preservation: \(\Vert {v^*}\Vert _{{\mathcal {X}}'}=\Vert v\Vert _{{\mathcal {X}}}\), and

  2. 2.

    Sharp duality bound: \(\langle {v^*},v \rangle _{{\mathcal {X}}' \times {\mathcal {X}}}=\Vert {v^*}\Vert _{{\mathcal {X}}'}\Vert v\Vert _{{\mathcal {X}}}\).

For any given \(v\in {\mathcal {X}}\), the set of admissible conjugates defines the duality mapping

$$\begin{aligned} {\mathcal {J}}_{\mathcal {X}}(v)=\{{v^*} \in {\mathcal {X}}': \Vert {v^*}\Vert _{{\mathcal {X}}'}=\Vert v\Vert _{{\mathcal {X}}} \text { and } \langle {v^*},v \rangle _{{\mathcal {X}}' \times {\mathcal {X}}}=\Vert {v^*}\Vert _{{\mathcal {X}}'}\Vert v\Vert _{{\mathcal {X}}}\}, \end{aligned}$$

which is a nonempty subset of \({\mathcal {X}}'\). Whenever the duality mapping is a singleton (for instance, when \({\mathcal {X}}'\) is strictly convex), one also defines the corresponding duality operator \({\mathrm {J}}_{\mathcal {X}}: {\mathcal {X}} \rightarrow {\mathcal {X}}'\), which is such that \({\mathcal {J}}_{\mathcal {X}}(v)=\{{v^*}={\mathrm {J}}_{\mathcal {X}}\{v\}\}\).

We now list the properties of the duality mapping that are relevant for our purpose [see [4, 15, Proposition 4.7 p. 27, Proposition 1.4, p. 43], [45, Theorem 2.53, p. 43]].

Theorem 1

(Properties of duality mappings) Let \(({\mathcal {X}},{\mathcal {X}}')\) be a dual pair of Banach spaces. Then, the following holds:

  1. 1.

    Every \({v \in {\mathcal {X}}}\) admits at least one conjugate \(v^*\in {\mathcal {X}}'\).

  2. 2.

    \({\mathcal {J}}_{\mathcal {X}}(\lambda v)=\lambda {\mathcal {J}}_{\mathcal {X}}(v)\) for any \(\lambda \in {\mathbb {R}}\) (homogeneity).

  3. 3.

    For every \({{v \in {\mathcal {X}}}}\), the set \({\mathcal {J}}_{\mathcal {X}}(v)\) is convex and weak\(^*\)-closed in \({\mathcal {X}}'\).

  4. 4.

    The duality mapping is single-valued if \({\mathcal {X}}'\) is strictly convex; the latter condition is also necessary if \({\mathcal {X}}\) is reflexive.

  5. 5.

    When \({\mathcal {X}}\) is reflexive, the duality map is bijective if and only if both \({\mathcal {X}}\) and \({\mathcal {X}}'\) are strictly convex.

The most favorable scenario is covered by Item 5. In that case, the duality map is invertible with \(v=({v^*})^*={\mathrm {J}}_{{\mathcal {X}}'}{\mathrm {J}}_{\mathcal {X}}\{v\}\); that is, \({\mathrm {J}}_{\mathcal {X}}^{-1}={\mathrm {J}}_{{\mathcal {X}}'}\), in conformity with the property that \({\mathcal {X}}''={\mathcal {X}}\).

We now prove that the duality map is linear if and only if \({\mathcal {X}}={\mathcal {H}}\) is a Hilbert space. In that case, the unitary operator \({\mathrm {J}}_{\mathcal {H}}: {\mathcal {H}} \rightarrow {\mathcal {H}}'\) is precisely the inverse of the Riesz map \({\mathrm {R}}: {\mathcal {H}}'\rightarrow {\mathcal {H}}\).

Proposition 1

Let \(({\mathcal {X}},{\mathcal {X}}')\) be a dual pair of Banach spaces such that \({\mathcal {X}}'\) is strictly convex. Then, the duality map \({\mathrm {J}}_{\mathcal {X}}: {\mathcal {X}} \rightarrow {\mathcal {X}}', v \mapsto {\mathrm {J}}_{\mathcal {X}}\{v\}={v^*}\) is linear if and only if \({\mathcal {X}}\) is a Hilbert space.

Proof

First, we recall that all Hilbert spaces are strictly convex. Consequently, the indirect part of the statement is Riesz’ celebrated representation theorem, which identifies the canonical linear isometry \({\mathrm {J}}_{\mathcal {X}}={\mathrm {R}}^{-1}\) between a Hilbert space and its dual [41]. As for the converse implication, we show that the underlying inner product is

$$\begin{aligned} \langle u, v\rangle _{{\mathcal {X}}}&=\tfrac{1}{2} \langle {\mathrm {J}}_{\mathcal {X}}\{u\},v \rangle _{{\mathcal {X}}'\times {\mathcal {X}}}+\tfrac{1}{2}\langle {\mathrm {J}}_{\mathcal {X}}\{v\},u\rangle _{{\mathcal {X}}'\times {\mathcal {X}}}. \end{aligned}$$
(5)

Its bilinearity follows from the bilinearity of the duality product and the linearity of \({\mathrm {J}}_{\mathcal {X}}\), while the symmetry in u and v is obvious. Finally, the definition of the conjugate yields

$$\begin{aligned} \langle v, v\rangle _{{\mathcal {X}}}=\langle {\mathrm {J}}_{\mathcal {X}}\{v\},v \rangle _{{\mathcal {X}}'\times {\mathcal {X}}}=\langle {v^*}, v \rangle _{{\mathcal {X}}'\times {\mathcal {X}}}=\Vert v\Vert ^2_{{\mathcal {X}}}, \end{aligned}$$
(6)

which confirms that the bilinear form \(\langle \cdot , \cdot \rangle _{{\mathcal {X}}}\) is positive-semi-definite. Hence, it is the inner product that induces the \(\Vert \cdot \Vert _{{\mathcal {X}}}\)-norm. \(\square \)

As an example, we provide the expression of the (unique) Banach conjugate \({v^*}={\mathrm {J}}_{\mathcal {X}}\{v\} \in L_q({\mathbb {R}}^d)\) of a function \(v \in L_p({\mathbb {R}}^d)\backslash \{0\}\) with \(1<p<\infty \) and \(\frac{1}{p}+\frac{1}{q}=1\):

$$\begin{aligned} {v^*}({\varvec{x}})=\frac{\left| v({\varvec{x}})\right| ^{p-1}}{\Vert v\Vert ^{p-2}_{L_p}} \mathrm{sign}\big (f({\varvec{x}})\big ). \end{aligned}$$
(7)

This formula is intimately connected to Hölder’s inequality. In particular, the \(L_2\) conjugation map with \(p=q=2\) is an identity.

2.2 General Representer Theorem

We now make use of the powerful tool of conjugation to characterize the solution of a broad class of unconstrained optimization problems in Banach space. Let us note that the result also covers the equality constraint of Problem (1) if one selects the barrier functional

$$\begin{aligned} E_{\mathrm{equal}}({\varvec{y}},{\varvec{z}})= {\left\{ \begin{array}{ll} 0, &{} {\varvec{y}}={\varvec{z}}\\ +\infty , &{} \text{ otherwise. } \end{array}\right. } \end{aligned}$$

Theorem 2

(General Banach representer theorem) Let us consider the following setting:

  • A dual pair \(({\mathcal {X}},{\mathcal {X}}')\) of Banach spaces.

  • The analysis subspace \({\mathcal {N}}_{\varvec{\nu }}=\mathrm{span}\{\nu _m\}_{m=1}^M \subset {\mathcal {X}}\) with the \(\nu _m\) being linearly independent.

  • The linear measurement operator \({\varvec{\nu }}: {\mathcal {X}}' \rightarrow {\mathbb {R}}^M: f \mapsto \big (\langle \nu _1,f \rangle , \dots ,\langle \nu _M, f \rangle \big )\) (it is weak\(^*\) continuous on \({\mathcal {X}}'\) because \(\nu _1,\dots ,\nu _M\in {\mathcal {X}}\)).

  • The loss functional \(E: {\mathbb {R}}^M \times {\mathbb {R}}^M \rightarrow {\mathbb {R}}^{+}\cup \{+\infty \}\) that is proper, weak lower semicontinuous and convex in its second argument.

  • Some arbitrary strictly increasing and convex function \(\psi : {\mathbb {R}}^+ \rightarrow {\mathbb {R}}^+\).

Then, for any fixed \({\varvec{y}}\in {\mathbb {R}}^M\), the solution set of the generic optimization problem

$$\begin{aligned} S=\arg \min _{f \in {\mathcal {X}}'} E\big ({\varvec{y}}, {\varvec{\nu }}(f)\big )+ \psi \left( \Vert f\Vert _{{\mathcal {X}}'}\right) \end{aligned}$$
(8)

is nonempty, convex, and weak\(^*\)-compact. If E is strictly convex (or if it imposes the equality \({\varvec{y}}={\varvec{\nu }}(f)\)), then any solution \(f_0 \in S\subset {\mathcal {X}}'\) is a \(({\mathcal {X}}',{\mathcal {X}})\)-conjugate of a common

$$\begin{aligned} \nu _0=\sum _{m=1}^M a_m \nu _m\in {\mathcal {N}}_{\varvec{\nu }}\subset {\mathcal {X}} \end{aligned}$$
(9)

with a suitable set of weights \({\varvec{a}} \in {\mathbb {R}}^M\), i.e., \(S\subseteq {\mathcal {J}}_{\mathcal {X}}(\nu _0)\). Moreover, if \({\mathcal {X}}\) is strictly convex and \(f\mapsto \psi (\Vert f\Vert _{{\mathcal {X}}'})\) is strictly convex, then the solution is unique with \(f_0={\mathrm {J}}_{\mathcal {X}}\{\nu _0\} \in {\mathcal {X}}'\) (Banach conjugate of \(\nu _0\)). In particular, if \({\mathcal {X}}\) is a Hilbert space, then \(f_0=\sum _{m=1}^M a_m \nu _m^*\), where \(\nu _m^*\) is the Riesz conjugate of \(\nu _m\).

The condition of unicity requires the strict convexity of both \(\psi : {\mathbb {R}}^+ \rightarrow {\mathbb {R}}\) and \(f \mapsto \Vert f\Vert _{{\mathcal {X}}'}\). This applies to Banach spaces such as \({\mathcal {X}}'=\big (L_q({\mathbb {R}}^d)\big )'=L_p({\mathbb {R}}^d)\) (up to some isometric isomorphism) with \(1<p<\infty \) and the canonical choice of regularization \(R(f)=\lambda \Vert f\Vert _{L_p}^p\) with \(\psi (t)=\lambda |t|^p\) being strictly convex. While the solution of (8) also exists for Banach spaces such as \({\mathcal {M}}({\mathbb {R}}^d)=\big (C_0({\mathbb {R}}^d)\big )'\) or \(L_\infty ({\mathbb {R}}^d)=\big (L_1({\mathbb {R}}^d)\big )'\), the uniqueness is usually lost in such non-reflexive scenarios (see Sect. 4).

Proof

The proof uses standard arguments in convex analysis together with a dual reformulation of the problem inspired from the interpretation of best interpolation given by Carl de Boor in [6].

  • (i) Existence and Reformulation as a Generalized Interpolation Problem

    First, we recall that the basic properties of (weak lower semi-) continuity and coercivityFootnote 1 are preserved though functional composition. The functional \(f \mapsto \Vert f\Vert _{{\mathcal {X}}'}\) is convex, (norm-)continuous and coercive on \({\mathcal {X}}'\) from the definition of a norm. Since \(\psi : {\mathbb {R}}^+\rightarrow {\mathbb {R}}^+\) is strictly increasing and convex, it is necessarily continuous and coercive. This ensures that \(f \mapsto \psi \left( \Vert f\Vert _{{\mathcal {X}}'}\right) \) is endowed with the same three basic properties. The linear measurement operator \({\varvec{\nu }}: {\mathcal {X}}' \rightarrow {\mathbb {R}}^N\) is continuous on \({\mathcal {X}}'\) by assumption (i.e., \(\nu _m \in {\mathcal {X}} \Rightarrow \nu _m \in {\mathcal {X}}''\) because of the canonical embedding of a Banach space in its bidual) and trivially convex. Since \({\varvec{z}} \mapsto E\big ({\varvec{y}}, {\varvec{z}}\big )\) is lower semicontinuous on \({\mathbb {R}}^d\) and convex, this implies by composition the lower semicontinuity and convexity of \(f \mapsto E\big ({\varvec{y}}, {\varvec{\nu }}(f)\big )\). Consequently, the functional \( f\mapsto F(f)=E\big ({\varvec{y}}, {\varvec{\nu }}(f)\big )+ \psi \left( \Vert f\Vert _{{\mathcal {X}}'}\right) \) is (weakly) lower semicontinuous, convex, and coercive on \({\mathcal {X}}'\), which guarantees the existence of the solution (as well as the convexity and closedness of the solution set) by a standard argument in convex analysis [22]—see [28, Proposition 8] for the non-reflexive case. Moreover, unicity is ensured when \(f \mapsto F(f)\) is strictly convex which happens to be the case when both \({\varvec{z}} \mapsto E\big ({\varvec{y}}, {\varvec{z}}\big )\) and \(f \mapsto \psi \left( \Vert f\Vert _{{\mathcal {X}}'}\right) \) are strictly convex. For the general (not necessarily unique) scenario, we take advantage of the strict convexity of \(E({\varvec{y}}, \cdot )\) to show that all minimizers of F(f) share a common measurement vector \({\varvec{z}}_0= {\varvec{\nu }}(f_0) \in {\mathbb {R}}^M\). To that end, we pick any two distinct solutions \(f_i \in S, i=1,2\) with corresponding measurements \({\varvec{z}}_i= {\varvec{\nu }}(f_i)\) and regularization cost \(r_i=\lambda \psi (\Vert f_i\Vert _{{\mathcal {X}}'})\). The convexity of S implies that, for any \(\alpha \in (0,1)\), \(f=\alpha f_1 + (1-\alpha )f_2 \in S\) with \({\varvec{z}}={\varvec{\nu }}(f)= \alpha {\varvec{z}}_1 + (1-\alpha ){\varvec{z}}_2\) and \(F(f)=F(f_i), i=1,2\). Let us now assume that \({\varvec{z}}_1 \ne {\varvec{z}}_2\). Then, by invoking the strictly convexity of \({\varvec{z}} \mapsto E({\varvec{y}},{\varvec{z}})\) and the convexity of \(f \mapsto \lambda \psi (\Vert f\Vert _{{\mathcal {X}}'})\), we get that

    $$\begin{aligned} F(f)&=E\big ({\varvec{y}}, \alpha {\varvec{z}}_1 + (1-\alpha ) {\varvec{z}}_2 \big ) + \lambda \psi \big ( \Vert \alpha f_1 + (1-\alpha ) f_2\Vert _{{\mathcal {X}}'}\big )\\&< \underbrace{\alpha E\left( {\varvec{y}}, {\varvec{z}}_1\right) + (1-\alpha ) E\left( {\varvec{y}}, {\varvec{z}}_2\right) + \alpha r_1 + (1-\alpha ) r_2} _{\alpha F(f_1) + (1-\alpha ) F(f_2)=F(f)}, \end{aligned}$$

    which is a contradiction. It follows that \({\varvec{z}}_i= {\varvec{\nu }}(f_i)={\varvec{z}}_0\) which, in turn, implies that the optimal regularization cost \(r_i=r_0\) is the same for all \(f_i \in S\). Although \({\varvec{z}}_0= {\varvec{\nu }}(f_0) \in {\mathbb {R}}^M\) is usually not known before hand, this property provides us with a convenient parametric characterization of the solution set as

    $$\begin{aligned} S_{{\varvec{z}}}=\arg \min _{f \in {\mathcal {X}}'} \Vert f\Vert _{{\mathcal {X}}'} \text{ s.t. } {\varvec{\nu }}(f)={\varvec{z}}, \end{aligned}$$
    (10)

    where \({\varvec{z}}\) ranges over \({\mathbb {R}}^M\). In this reformulation, we also exploit the property that the minimization of \(\Vert f\Vert _{{\mathcal {X}}'}\) is equivalent to that of \(\psi (\Vert f\Vert _{{\mathcal {X}}'})\) because the mapping between the two quantities is monotone.

  • (ii) Explicit Resolution of the Generalized Interpolation Problem (10)

    The linear independence of the functionals \(\nu _m\) ensures that any \(\nu \in {\mathcal {N}}_{\varvec{\nu }}\) has the unique expansion \(\nu =\sum _{m=1}^M a_m \nu _m\). Based on this representation, we define the linear functional

    $$\begin{aligned} \nu \mapsto {\zeta }(\nu )=\sum _{m=1}^{M} a_m z_m \end{aligned}$$

    with \({\varvec{z}}={\varvec{z}}_0\) fixed. By construction, \({\zeta }\) is continuous \(\big ({\mathcal {N}}_{\varvec{\nu }},\Vert \cdot \Vert _{{\mathcal {X}}}\big )\xrightarrow {{\ \mathrm{c.}\ }}{\mathbb {R}}\) with \(|{\zeta }(\nu )|\le \Vert {\zeta }\Vert \; \Vert \nu \Vert _{{\mathcal {X}}}\), where \( \Vert {\zeta }\Vert =\sup _{\nu \in {\mathcal {N}}_{\varvec{\nu }}:\; \Vert \nu \Vert _{{\mathcal {X}}}= 1} {\zeta }(\nu )<\infty \). Moreover, the Hahn–Banach theorem ensures the existence of a continuous, norm-preserving extension of \({\zeta }\) to the whole Banach space \({\mathcal {X}}\); that is, an element \(f_0\in {\mathcal {X}}'\) such that

    $$\begin{aligned} \Vert f_0\Vert _{{\mathcal {X}}'}=\sup _{g \in {\mathcal {X}}:\; \Vert g\Vert _{{\mathcal {X}}}=1} \langle f_0, g\rangle =\Vert {\zeta }\Vert . \end{aligned}$$

    The connection between the above statement and the generalized interpolation problem (10) is that the complete set of continuous extensions of \({\zeta }\) to \({\mathcal {X}}\supset {\mathcal {N}}_{{\varvec{\nu }}}\) is given by

    $$\begin{aligned} U=\{f \in {\mathcal {X}}': \langle f,\nu \rangle ={\zeta }(\nu ) \text{ for } \text{ all } \nu \in {\mathcal {N}}_{{\varvec{\nu }}}\} \end{aligned}$$

    with the property that

    $$\begin{aligned} f_0\in \arg \inf _{f\in U} \Vert f\Vert _{{\mathcal {X}}'}=S_{{\varvec{z}}_0} \quad \Leftrightarrow \quad \Vert f_0\Vert _{{\mathcal {X}}'}=\Vert {\zeta }\Vert . \end{aligned}$$
    (11)

    The next fundamental observation is that \({\mathcal {N}}_{{\varvec{\nu }}}=\big ({\mathcal {N}}'_{{\varvec{\nu }}}\big )'\) because both spaces are of finite dimension M and, hence, reflexive. Consequently, for any \(\nu _0 \in {\mathcal {J}}_{\mathcal {X}}({\zeta }) \subseteq \big ({\mathcal {N}}'_{{\varvec{\nu }}}\big )'={\mathcal {N}}_{\varvec{\nu }}\), we have that \(\Vert \nu _0\Vert _{{\mathcal {X}}}=\Vert {\zeta }\Vert \) and \({\zeta }(\nu _0)=\Vert \nu _0\Vert ^2_{{\mathcal {X}}}\), as well as \(\Vert \nu _0\Vert _{{\mathcal {X}}}=\Vert f_0\Vert _{{\mathcal {X}}'}\) for all \(f_0 \in S_{{\varvec{z}}_0}\) because of (11). Since \(f_0\in U\subset {\mathcal {X}}'\) and \(\nu _0 \in {\mathcal {N}}_{{\varvec{\nu }}} \subset {\mathcal {X}}\), this yields

    $$\begin{aligned} \langle f_0,\nu _0\rangle =\lambda (\nu _0) =\Vert f_0\Vert _{{\mathcal {X}}'}\Vert \nu _0\Vert _{{\mathcal {X}}}, \end{aligned}$$

    which implies that \(f_0 \in {\mathcal {J}}_{\mathcal {X}}(\nu _0)\) with \({\mathcal {J}}_{\mathcal {X}}\) the duality mapping from \({\mathcal {X}}\) to \({\mathcal {X}}'\).

  • (iii) Structure of the Solution Set

    We have just shown that \(S_{{\varvec{z}}_0}\subseteq {\mathcal {J}}_{\mathcal {X}}(\nu _0)\) for any extremal element \(\nu _0 \in \{g \in {\mathcal {N}}_{{\varvec{\nu }}}: {\zeta }(g)=\Vert {\zeta }\Vert \, \Vert g\Vert _{{\mathcal {X}}}, \Vert g\Vert _{{\mathcal {X}}}=\Vert {\zeta }\Vert \}\). We now deduce that \(S_{{\varvec{z}}_0}\) is weak\(^*\)-compact since it is included in the closed ball in \({\mathcal {X}}'\) of radius \(\Vert f_0\Vert _{{\mathcal {X}}'}<\infty \), which is itself weak\(^*\)-compact, by the Banach–Alaoglu theorem. When \({\mathcal {X}}'\) is strictly convex, the situation is simpler because the duality mapping from \({\mathcal {X}}\) to \({\mathcal {X}}'\) is single-valued and the solution \(f_0\in {\mathcal {X}}'\) is unique. Moreover, the latter conjugate map is linear if and only if \({\mathcal {X}}\) is a Hilbert space, by Proposition 1. \(\square \)

Note that the existence of the conjugate of \(\nu _0 \in {\mathcal {N}}_{\varvec{\nu }}\subset {\mathcal {X}}\) is essential to the argumentation. This is the reason why the problem is formulated with \(f \in {\mathcal {X}}'\) subject to the hypothesis that \(\nu _1,\dots ,\nu _M \in {\mathcal {X}}\) (weak\(^*\) continuity). These considerations are inconsequential in the simpler reflexive scenario where the role of the two spaces is interchangeable since \({\mathcal {X}}={\mathcal {X}}''\). The hypothesis of linear independence of the \(\nu _m\) in Theorem 2 is only made for convenience. When it does not hold, one can adapt the proof by picking a basis of \({\mathcal {N}}_{{\varvec{\nu }}}\) of reduced dimension \(M'<M\), which then leads to a corresponding reduction in the number \(M'\) of degrees of freedom of the solution.

In the sequel, as we shall apply Theorem 2 to concrete scenarios, we shall implicitly interpret \(f \in {\mathcal {X}}'\) in (8) as a function (or, eventually, a vector) rather than a continuous linear functional on \({\mathcal {X}}\) (the abstract definition of an element of the dual space). This is acceptable provided that the defining space \({\mathcal {X}}'\) is isometrically embedded in some classical function spaces such as \(L_p({\mathbb {R}}^d)\) because of the bijective mapping (isometric isomorphism) that relates the two types of entities; for instance, there is a unique element of \(f \in L_p({\mathbb {R}})\) with p the conjugate exponent of \(q\in [1,\infty )\) such that the linear functional \(\zeta \in \big (L_q({\mathbb {R}}^d)\big )'\) can be specified as \(\zeta (g)= \langle f,g\rangle =\int _{{\mathbb {R}}^d} f({\varvec{x}})g({\varvec{x}}) \mathrm{d}{\varvec{x}}\) and vice versa. This allows us to identify \(\zeta =\zeta _f\) as \(f \in L_p({\mathbb {R}}^d)\), while it also gives a precise meaning to identities such as \(L_p({\mathbb {R}}^d)=\big (L_q({\mathbb {R}}^d)\big )'\).

3 Strictly Convex Regularization

The solution of the optimization problem in Theorem 2 is unique whenever the Banach space \({\mathcal {X}}\) (or \({\mathcal {X}}'\)) is reflexive and strictly convex. This is the setting that has been studied the most in the literature. We now illustrate the unifying character of Theorem 2 by using it to retrieve the key results in this area; that is, the classical kernel methods for machine learning in RKHS (Sect. 3.1), the resolution of linear inverse problems with Tikhonov regularization (Sect. 3.2), and the link with reproducing kernel Banach spaces (Sect. 3.3). In addition, we make use of the conjugate map to present a novel perspective on \(\ell _p\) regularization for \(p>1\) in Sect. 3.4.

3.1 Kernel/RKHS Methods in Machine Learning

Here, the search space \({\mathcal {X}}'\) is a reproducing kernel Hilbert space on \({\mathbb {R}}^d\) denoted by \({\mathcal {H}}\) with \(\Vert f\Vert ^2_{{\mathcal {H}}}=\langle f, f\rangle _{{\mathcal {H}}}\), where \(\langle \cdot , \cdot \rangle _{{\mathcal {H}}}\) is the underlying inner product. The predual space is \({\mathcal {X}}={\mathcal {H}}'\) which agrees with \({\mathcal {X}}'={\mathcal {H}}''={\mathcal {H}}\) (reflexive scenario). The RKHS property [3] is equivalent to the existence of a (unique) positive-definite kernel \(r_{\mathcal {H}}: {\mathbb {R}}^d \times {\mathbb {R}}^d \rightarrow {\mathbb {R}}\) (the reproducing kernel of \({\mathcal {H}}\)) such that

$$\begin{aligned} (i)&\ \ r_{\mathcal {H}}(\cdot ,{\varvec{x}}_m) \in {\mathcal {H}} \end{aligned}$$
(12)
$$\begin{aligned} (ii)&\ \ f({\varvec{x}}_m)=\langle f, r_{\mathcal {H}}(\cdot ,{\varvec{x}}_m)\rangle _{{\mathcal {H}}} \end{aligned}$$
(13)

for all \(f\in {\mathcal {H}}\) and any \({\varvec{x}}_m\in {\mathbb {R}}^d\).

In the context of machine learning, the loss function E is usually chosen to be additive with \(E({\varvec{y}}, {\varvec{z}})=\sum _{m=1}^M E_m\big (y_m, z_m\big )\) [29, 43]. Given a series of data points \(\big ({\varvec{x}}_m,y_m\big )\), \(m=1,\dots ,M\) with \({\varvec{x}}_m\in {\mathbb {R}}^d\), the learning problem is then to estimate a function \(f_0: {\mathbb {R}}^d \rightarrow {\mathbb {R}}\) such that

$$\begin{aligned} f_0=\arg \min _{f \in {\mathcal {H}}} \left( \sum _{m=1}^M E_m\big (y_m, f({\varvec{x}}_m)\big ) + \lambda \Vert f\Vert ^2_{{\mathcal {H}}}\right) \end{aligned}$$
(14)

where \(\lambda \in {\mathbb {R}}^+\) is an adjustable regularization parameter. In functional terms, the reproducing kernel represents the Schwartz kernel [27, 46] of the Riesz map \({\mathrm {R}}: {\mathcal {H}}' \rightarrow {\mathcal {H}}: \nu \mapsto \nu ^*=\int _{{\mathbb {R}}^d} r_{\mathcal {H}}(\cdot ,{\varvec{y}}) \nu ({\varvec{y}}) \mathrm{d}{\varvec{y}}\) so that \(\nu _m^{*}({\varvec{x}})={\mathrm {R}}\{\delta (\cdot -{\varvec{x}}_m)\}({\varvec{x}})=r_{\mathcal {H}}({\varvec{x}},{\varvec{x}}_m)\). The application of Theorem 2 with \({\mathcal {X}}'={\mathcal {H}}\) then immediately yields the parametric form of the solution

$$\begin{aligned} f_0({\varvec{x}})=\sum _{m=1}^M a_m r_{\mathcal {H}}({\varvec{x}},{\varvec{x}}_m), \end{aligned}$$
(15)

which is a linear kernel expansion. The optimality of such kernel expansions is precisely the result stated in Schölkopf’s representer theorem for RKHS [42]. Moreover, by invoking the reproducing kernel property (13) with \(f=r_{\mathcal {H}}(\cdot ,{\varvec{x}}_n) \in {\mathcal {H}}\), one readily finds that \( \Vert f_0\Vert ^2_{{\mathcal {H}}}={\varvec{a}}^T {\mathbf{G}} {\varvec{a}}\), where the Gram matrix \({\mathbf{G}} \in {\mathbb {R}}^{M \times M}\) is specified by \([{\mathbf{G}}]_{m,n}=r_{\mathcal {H}}({\varvec{x}}_m,{\varvec{x}}_n)\). By injecting the parametric form of the solution into the cost functional in (14), we then end up with the equivalent finite-dimensional minimization task

$$\begin{aligned} {\varvec{a}}_0=\arg \min _{{\varvec{a}} \in {\mathbb {R}}^M} \left( E\big ({\varvec{y}}, {\mathbf{G}}{\varvec{a}}) + \lambda {\varvec{a}}^T {\mathbf{G}} {\varvec{a}}\right) , \end{aligned}$$
(16)

which yields the exact solution of the original infinite-dimensional optimization problem. In short, (16) is the optimal discretization of the functional optimization problem (15), which is then readily transcribable into a numerical implementation using standard (finite-dimensional) techniques.

3.2 Tikhonov Regularization

Tikhonov regularization is a classical approach for dealing with ill-posed linear inverse problems [30, 50]. The goal there is to recover a function \(f: {\mathbb {R}}^d \rightarrow {\mathbb {R}}\) from a noisy or imprecise series of linear measurements \(y_m=\langle \nu _m,f \rangle + \epsilon _m\), where \(\epsilon _m\) is the disturbance term. By using the same functional framework as in Sect. 3.1 with \(\nu _1,\dots ,\nu _M \in {\mathcal {H}}'={\mathcal {X}}\), and \({\mathcal {X}}'={\mathcal {H}}''={\mathcal {H}}\), one formulates the recovery problem as

$$\begin{aligned} f_0=\arg \min _{f \in {\mathcal {H}}} \left( \sum _{m=1}^M |y_m-\langle \nu _m,f \rangle |^2 + \lambda \Vert f\Vert ^2_{{\mathcal {H}}}\right) . \end{aligned}$$
(17)

The application of Theorem 2 then yields a solution that takes the parametric form

$$\begin{aligned} f_0=\sum _{m=1}^M a_m \varphi _m \end{aligned}$$
(18)

with \(\varphi _m={\mathrm {R}}\{\nu _m\}\), where \({\mathrm {R}}\) is the Riesz map \({\mathcal {H}}'={\mathcal {X}}\rightarrow {\mathcal {H}}={\mathcal {X}}'\). The next fundamental observation is that the bilinear form \((\nu _m,\nu _n) \mapsto \langle \nu _m,{\mathrm {R}}\{\nu _n\} \rangle \) is actually the inner product for the dual space \({\mathcal {H}}'\) leading to \(\langle \nu _m,\varphi _n \rangle =\langle \nu _m,\nu _n\rangle _{{\mathcal {H}}'}\). In fact, by using the property that \(\nu _m\) and \(\varphi _m=\nu _m^*\) are Hilbert conjugates, we have that

$$\begin{aligned} \langle \nu _m,\varphi _n\rangle =\langle \nu _m,\nu _n\rangle _{{\mathcal {H}}'}= \langle \nu ^*_m,\nu ^*_n\rangle _{{\mathcal {H}}}=\langle \varphi _m,\varphi _n \rangle _{{\mathcal {H}}} \end{aligned}$$
(19)

which, somewhat remarkably, shows that the underlying system matrix is equal to the Gram matrix of the basis \(\{\varphi _m\}\).

Therefore, by injecting (18) into the cost functional in (17), we are able to reformulate the initial optimization problem as the finite-dimensional minimization

$$\begin{aligned} {\varvec{a}}_0=\arg \min _{{\varvec{a}} \in {\mathbb {R}}^M} \left( \Vert {\varvec{y}}- {\mathbf{H}}{\varvec{a}}\Vert ^2 + \lambda {\varvec{a}}^T {\mathbf{H}} {\varvec{a}}\right) , \end{aligned}$$
(20)

where the system/Gram matrix \({\mathbf{H}}\in {\mathbb {R}}^{M \times M}\) with \([{\mathbf{H}}]_{m,n}=\langle \nu _m,\varphi _n \rangle =\langle \varphi _m,\varphi _n \rangle _{{\mathcal {H}}}\) is symmetric positive-definite. By differentiating the quadratic form in (20) with respect to \({\varvec{a}}\) and setting the gradient to zero, we readily derive the very pleasing closed-form solution

$$\begin{aligned} {\varvec{a}}_0=({\mathbf{H}}{\mathbf{H}}\ +\lambda {\mathbf{H}})^{-1} {\mathbf{H}}{\varvec{y}}=({\mathbf{H}} +\lambda {\mathbf{I}})^{-1} {\varvec{y}} \end{aligned}$$
(21)

under the implicit assumption that \({\mathbf{H}}\) is invertible. We note that the latter is equivalent to the linear independence of the \(\varphi _m\) (resp., the linear independence of the \(\nu _m\) due to the Riesz pairing).

3.3 Reproducing Kernel Banach Spaces

The concept of reproducing kernel Banach space, which is the natural generalization of RKHS, was introduced and investigated by Zhang and Xu in [55, 56]. Similar to the Hilbertian case, one can identify the RKBS property as follows.

Definition 3

A strictly convex and reflexive Banach space \({\mathcal {B}}\) of functions on \({\mathbb {R}}^d\) is called a reproducing kernel Banach space (RKBS) if \(\delta (\cdot -{\varvec{x}}) \in {\mathcal {B}}'\) for any \({\varvec{x}} \in {\mathbb {R}}^d\). Then, the unique representer \(r_{\mathcal {B}}(\cdot ,{\varvec{x}})={\mathrm {J}}_{{\mathcal {B}}'}\{\delta (\cdot -{\varvec{x}})\} \in {\mathcal {B}}\) when indexed by \({\varvec{x}}\) is called the reproducing kernel of the Banach space.

It is then of interest to consider the Banach variant of (14) that involves a slightly more general regularization term: Given the data points \(\big ({\varvec{x}}_m,y_m\big )_{m=1}^M\), \(m=1,\dots ,M\), we want to find the unique solution of the optimization problem

$$\begin{aligned} f_0=\arg \min _{f \in {\mathcal {B}}} \left( \sum _{m=1}^M E\big (y_m, f({\varvec{x}}_m)\big ) + \psi (\Vert f\Vert _{{\mathcal {B}}}) \right) \end{aligned}$$
(22)

where the loss function \(E: {\mathbb {R}}\times {\mathbb {R}}\rightarrow {\mathbb {R}}\) is convex in its second argument and the regularization strength modulated by the function \(\psi : {\mathbb {R}}\rightarrow {\mathbb {R}}^+\), which is convex and strictly increasing. Since the space \({\mathcal {B}}\) is reflexive by assumption, the optimization problem falls into the framework of Theorem 2 with \({\mathcal {X}}=B'\) and \({\mathcal {X}}'={\mathcal {B}}''={\mathcal {B}}\) and \(\nu _m=\delta (\cdot -{\varvec{x}}_m) \in {\mathcal {B}}', m=1,\dots ,M\), where the latter inclusion is guaranteed by the RKBS property. We thereby obtain the parametric form of the solution as

$$\begin{aligned} f_0={\mathrm {J}}_{{\mathcal {B}}'}\left\{ \sum _{m=1}^M a_m \delta (\cdot -{\varvec{x}}_m)\right\} ={\mathrm {J}}_{{\mathcal {B}}'}\left\{ \sum _{m=1}^M a_m r^*_{{\mathcal {B}}}(\cdot ,{\varvec{x}}_m)\right\} \quad \end{aligned}$$
(23)

with appropriate coefficients \((a_m)\in {\mathbb {R}}^M\), where the expression on the right-hand side has been included in order to make the connection with the Banach reproducing kernel, as in [56, 57]. Due to the homogeneity and invertibility of the duality mapping (see Theorem 1), we have that \({\mathrm {J}}_{{\mathcal {B}}'}\left\{ a_m r^*_{{\mathcal {B}}}(\cdot ,{\varvec{x}}_m)\right\} =a_m r_{{\mathcal {B}}}(\cdot ,{\varvec{x}}_m)\). This implies that (23) yields a linear expansion in terms of kernels if and only if \(M=1\) or if the duality map \({\mathrm {J}}_{\mathcal {B}}: {\mathcal {B}}' \rightarrow {\mathcal {B}}\) is linear. We note that the latter condition together with Definition 3 is equivalent to \({\mathcal {B}}={\mathcal {H}}\) being a RKHS (by Proposition 1), which brings us back to the classical setting of Sect. 3.1. The same argumentation is also extendable to the vector-valued setting which has been considered by various authors both for RKHS and RKBS settings [1, 33, 58]. We also like to point out that our analysis is compatible with some recent results of Combettes et al. [16], where the corresponding conditions of optimality are stated using subdifferentials.

3.4 Toward Compressed Sensing: \(\ell _p\)-Norm Regularization

A classical problem in signal processing is to recover an unknown discrete signal \({\varvec{s}} \in {\mathbb {R}}^N\) from a set of corrupted linear measurements \(y_m={\varvec{h}}^T_m{\varvec{s}} + \epsilon _m\), \(m=1,\dots ,M\). The measurement vectors \({\varvec{h}}_1,\dots ,{\varvec{h}}_M \in {\mathbb {R}}^N\) specify the system matrix \({\mathbf{H}}={[{\varvec{h}}_1 \ {\varvec{h}}_2 \ \cdots \ {\mathbf{h}}_M]^T}\) \( \in {\mathbb {R}}^{M \times N}\). When M (the number of measurements) is less than N (the size of the unknown signal \({\varvec{s}}\)), the reconstruction problem is a priori ill-posed, and strongly so when \(M\ll N\) (compressed-sensing scenario). However, if the original signal is known to be sparse (i.e., \(\Vert {\varvec{s}}\Vert _{0}\le K_0\) with \(K_0<2M\)) and the system matrix \({\mathbf{H}}\) satisfies some “incoherence” properties, then the theory of compressed sensing provides general guarantees for a stable recovery [14, 19, 26]. The computational strategy then is to impose an \(\ell _p\) regularization (with p small to favor sparsity) on the solution and to formulate the reconstruction problem as

$$\begin{aligned} {\varvec{s}} = \arg \min _{{\varvec{x}} \in {\mathbb {R}}^N} \left( E\big ({\varvec{y}}, {\mathbf{H}}{\varvec{x}}) + \lambda \Vert {\varvec{x}}\Vert ^p_{\ell _p}\right) \end{aligned}$$
(24)

with \(\Vert {\varvec{x}}\Vert _{\ell _p}{\mathop {=}\limits ^{\vartriangle }}\left( \sum _{n=1}^N |x_n|^p\right) ^{1/p}\). The traditional choice for compressed sensing is \(p=1\), which is the smallest exponent that still results in a convex optimization problem.

We now show how we can use Theorem 2 to characterize the effect of such a regularization for \(p\in (1,\infty )\). The corresponding Banach space is \({\mathcal {X}}'=({\mathbb {R}}^N,\Vert \cdot \Vert _{\ell _p})\) whose predual is \({\mathcal {X}}=({\mathbb {R}}^N,\Vert \cdot \Vert _{\ell _q})\) with \(\frac{1}{p}+\frac{1}{q}=1\). Moreover, the underlying norms are strictly convex for \(p>1\), which guarantees that the solution is unique, irrespective of M and \({\mathbf{H}}\). By introducing the dual signal \({\varvec{\nu }}_0={\mathbf{H}}^T{\mathbf{a}} \in {\mathcal {X}}\) and by using the known form of the corresponding Banach q-to-p duality map \({\mathrm {J}}_{{\mathcal {X}}}: {\mathcal {X}} \rightarrow {\mathcal {X}}'\), we then readily deduce that the solution can be represented as

$$\begin{aligned}{}[{\varvec{s}}]_n=\frac{\left| [{\mathbf{H}}^T {\varvec{a}}]_n)\right| ^{q-1}}{\Vert {\mathbf{H}}^T {\varvec{a}}\Vert ^{q-2}_{\ell _q}} \mathrm{sign}\big ([{\mathbf{H}}^T {\varvec{a}}]_n\big ) \end{aligned}$$
(25)

for a suitable value of the (dual) parameter vector \({\varvec{a}} \in {\mathbb {R}}^M\). While the exact value of \({\varvec{a}}\) is data dependent, (25) provides us with the description of the solution manifold of intrinsic dimension M. Another way to put it is that the fact that \({\varvec{s}}\) minimizes (24) induces a nonlinear pairing between the data vector \({\mathbf{y}} \in {\mathbb {R}}^M\) and the dual variable \({\varvec{a}}\in {\mathbb {R}}^M\) in (25). In particular, for \(p=2\), we have that \({\varvec{s}}={\mathbf{H}}^T {\varvec{a}}=\sum _{m=1}^M {\varvec{h}}_m a_m\), which confirms the well-known result that \({\varvec{s}}\in \mathrm{span}\{{\mathbf{h}}_m\}\). The latter also explains why classical quadratic/Tikhonov regularization performs poorly when M is much smaller than N.

4 Sparsity-Promoting Regularization

The limit caseFootnote 2 of the previous scenario is \(p=1\) (CS) for which the norm is no longer strictly convex. To deal with such cases where the solution is potentially non-unique, we first recall the Krein–Milman theorem [41, p. 75], which allows us to describe the weak\(^*\)-compact solution set S in Theorem 2 as the convex hull of its extreme points. We then invoke a recent result by Boyer et al. that yields the following characterization of the extremal points of Problem (8).

Theorem 3

All extremal points \(f_{0,\mathrm ext}\) of the solution set S of Problem (8) can be expressed as

$$\begin{aligned} {f_{0,\mathrm ext}}=\sum _{k=1}^{K_0} a_k e_k \end{aligned}$$
(26)

for some \(1\le K_0\le M\) where the \(e_k\) are some extremal points of the unit “regularization” ball \(B_{{\mathcal {X}}'}=\{f \in {\mathcal {X}}': \Vert f\Vert _{{\mathcal {X}}'}\le 1\}\) and \((a_k)\in {\mathbb {R}}^{K_0}\) is a vector of appropriate weights.

The above is a direct corollary of [8, Theorem 1 with \(j=0\)] applied to an extreme point of the equivalent generalized interpolation problem (10). We also note that the existence of a minimizer \(f_0\in S\) of the form (26) has been established independently by Bredies and Carioni [9] in a framework that is even more general than the one considered here. The latter property is also directly deducible from the reduced problem (10) and a classical result by Singer [47, Lemma 1.3, p. 169]. It remains that the existence of a global minimizer of the form (26) is not as strong a result as Theorem 3, which tells us the characterization applies for all extremal points of S. Moreover, it should be pointed out that the result in Theorem 3 is not particularly informative for strictly convex spaces such as \(\ell _p({\mathbb {Z}})\) or \(L_p({\mathbb {R}}^d)\) with \(p\in (1,\infty )\) for which all unit vectors (i.e., \(e\in {\mathcal {X}}'\) with \(\Vert e\Vert _{{\mathcal {X}}'}=1\)) are extremal points of the unit ball. Indeed, since the corresponding solution is unique (by Theorem 2), we trivially have that \(f_0=\Vert f_0\Vert _{{\mathcal {X}}'}e_1\) with \(K_0=1\) and \(e_1=f_0/\Vert f_0\Vert _{{\mathcal {X}}'}\).

By contrast, the characterization in Theorem 3 is highly relevant for the non-strictly convex space \({\mathcal {X}}'=\ell _1({\mathbb {Z}})\) whose extreme vectors are intrinsically sparse, i.e., \(e_k=(\pm \delta [n-n_k])_{n\in {\mathbb {Z}}}\) for some fixed offset \(n_k\in {\mathbb {Z}}\). Here, \(\delta [\cdot ]\) denotes the Kronecker impulse which is such that \(\delta [0]=1\) and \(\delta [n]=0\) for \(n\ne 0\). Hence, the outcome is that the use of the \(\ell _1\) penalty [e.g., (24) with \(p=1\)] has a tendency to induce sparse solutions with \(\Vert f\Vert _0=K_0\le M\), which is the flavor of the representer theorem(s) in [52]. Two other practically relevant examples that fall in the non-strictly convex category are considered next.

4.1 Super-Resolution Localization of Spikes

The space of continuous functions over a compact domain \(\varOmega \subset {\mathbb {R}}^d\) equipped with the supremum (or \(L_\infty \)) norm is a classical Banach space denoted by

$$\begin{aligned} C(\varOmega )=\{f : \varOmega \rightarrow {\mathbb {R}}: \Vert f\Vert _{\infty }{\mathop {=}\limits ^{\vartriangle }}\sup _{{\varvec{x}} \in \varOmega } |f({\varvec{x}})|<\infty \}. \end{aligned}$$
(27)

Its continuous dual

$$\begin{aligned} {\mathcal {M}}(\varOmega )=\{f: C(\varOmega ) \rightarrow {\mathbb {R}}: \Vert f\Vert _{{\mathcal {M}}}{\mathop {=}\limits ^{\vartriangle }}\sup _{\varphi \in C(\varOmega ):\, \Vert \varphi \Vert _\infty \le 1} \langle f, \varphi \rangle < \infty \} \end{aligned}$$
(28)

is the Banach space of bounded (signed) Radon measures on \(\varOmega \) (by the Riesz–Markov representation theorem [40]). Moreover, it is well known that the extreme points of the unit ball in \({\mathcal {M}}(\varOmega )\) are point measures (a.k.a. Dirac impulses) of the form \(e_k=\pm \delta (\cdot -{\varvec{x}}_k)\) for some \({\varvec{x}}_k\in \varOmega \), with the property that

$$\begin{aligned} \varphi \mapsto \langle \delta (\cdot -{\varvec{x}}_k), \varphi \rangle = \varphi ({\varvec{x}}_k) \end{aligned}$$
(29)

for any \(\varphi \in C(\varOmega )\). For a series of (independent) analysis functions \(\nu _1,\dots ,\) \(\nu _M\in C(\varOmega )\) (e.g., Fourier exponentials), we can invoke Theorems 2 and 3 with \({\mathcal {X}}'={\mathcal {M}}(\varOmega )\) to deduce that the extreme points of the problem

$$\begin{aligned} S=\arg \min _{f \in {\mathcal {M}}(\varOmega )} \left( E\big ({\varvec{y}},{\varvec{\nu }}(f)\big ) + \lambda \Vert f\Vert _{{\mathcal {M}}}\right) \end{aligned}$$
(30)

are inherently sparse. This means that there necessarily exists at least one minimizer of the form

$$\begin{aligned} f_0=\sum _{k=1}^{K_0} a_k \delta (\cdot -{\varvec{x}}_k) \end{aligned}$$
(31)

with \(K_0\le M\), \((a_k) \in {\mathbb {R}}^{K_0}\), and \({\varvec{x}}_1,\dots ,{\varvec{x}}_{K_0} \in \varOmega \). The fact that (30) admits a global solution whose representation is given by (31) is a result that can be traced back to the work of Fisher and Jerome in [25, Theorem 1]. This optimality result is the foundation for a recent variational method for super-resolution localization that was investigated by a number of authors [10, 12, 24]. Besides the development of grid-free optimization schemes, researchers have worked out the conditions on \({\varvec{x}}_k\) and \(\nu _m\) under which (30) can provide a perfect recovery of spike trains of the form given by (31) with a small \(K_0\) [13, 17, 37]. The remarkable finding is that there are many configurations for which super-resolution recovery is guaranteed, with an accuracy that only depends on the signal-to-noise ratio and the minimal spacing between neighboring spikes.

4.2 Sparse Kernel Expansions

Schwartz’ space of smooth and rapidly decaying functions on \({\mathbb {R}}^d\) is denoted by \({\mathcal {S}}({\mathbb {R}}^d)\). Its continuous dual is \({\mathcal {S}}'({\mathbb {R}}^d)\): the space of tempered distributions. In this section, \(\mathrm{L}: {\mathcal {S}}'({\mathbb {R}}^d) \xrightarrow {{\ \mathrm{c.}\ }}{\mathcal {S}}'({\mathbb {R}}^d)\) is an invertible operator with continuous inverse \(\mathrm{L}^{-1}: {\mathcal {S}}'({\mathbb {R}}^d) \xrightarrow {{\ \mathrm{c.}\ }}{\mathcal {S}}'({\mathbb {R}}^d)\). We also assume that the generalized impulse response of \(\mathrm{L}^{-1}\) is a bivariate function of slow growth \(h: {\mathbb {R}}^d \times {\mathbb {R}}^d \rightarrow {\mathbb {R}}\). In other words, the inverse operator \(\mathrm{L}^{-1}\) has the explicit integral representation

$$\begin{aligned} \mathrm{L}^{-1}\{\varphi \}=\int _{{\mathbb {R}}^d} h(\cdot ,{\varvec{y}}) \varphi ({\varvec{y}}) \mathrm{d}{\varvec{y}} \end{aligned}$$
(32)

for any \(\varphi \in {\mathcal {S}}({\mathbb {R}}^d)\). In conformity with the nomenclature of [53], the native Banach space for \(\big (\mathrm{L},{\mathcal {M}}({\mathbb {R}}^d)\big )\) is

$$\begin{aligned} {\mathcal {M}}_\mathrm{L}({\mathbb {R}}^d)=\{f\in {\mathcal {S}}'({\mathbb {R}}^d): \Vert \mathrm{L}f\Vert _{{\mathcal {M}}}{\mathop {=}\limits ^{\vartriangle }}\sup _{\varphi \in {\mathcal {S}}({\mathbb {R}}^d): \, \Vert \varphi \Vert _\infty \le 1} \langle \mathrm{L}f, \varphi \rangle < \infty \}. \end{aligned}$$
(33)

It is isometrically isomorphic to \({\mathcal {M}}({\mathbb {R}}^d)\) (the space of bounded Radon measures on \({\mathbb {R}}^d\)). This is to say that the operators \(\mathrm{L}, \mathrm{L}^{-1}\) have restrictions \(\mathrm{L}: {\mathcal {M}}_\mathrm{L}({\mathbb {R}}^d) \xrightarrow {{\ \mathrm{c.}\ }}{\mathcal {M}}({\mathbb {R}}^d)\) and \(\mathrm{L}^{-1}: {\mathcal {M}}({\mathbb {R}}^d) \xrightarrow {{\ \mathrm{c.}\ }}{\mathcal {M}}_\mathrm{L}({\mathbb {R}}^d)\) that are isometries. Consequently, we can apply Theorem 2 to deduce that the generic learning problem

$$\begin{aligned} S=\arg \min _{f \in {\mathcal {M}}_\mathrm{L}({\mathbb {R}}^d)} \left( \sum _{m=1}^M E_m\big (y_m, f({\varvec{x}}_m)\big ) + \lambda \Vert \mathrm{L}f\Vert _{{\mathcal {M}}}\right) \end{aligned}$$
(34)

admits a solution, albeit not necessarily a unique one since the underlying search space \({\mathcal {M}}_\mathrm{L}({\mathbb {R}}^d)\)—or, equivalently, the parent space \({\mathcal {M}}({\mathbb {R}}^d)\)—is neither reflexive nor strictly convex.

In order to refine the above statement with the help of Theorem 3, we first observe that the extreme points of the unit ball in \({\mathcal {M}}({\mathbb {R}}^d)\) take the form \(e_k=\pm \delta (\cdot -{\varvec{\tau }}_k)\) with \({\varvec{\tau }}_k\in {\mathbb {R}}^d\), which is consistent with the result in Sect. 4.1 for \({\mathcal {M}}(\varOmega )\). Since the map \(\mathrm{L}^{-1}: {\mathcal {M}}({\mathbb {R}}^d) \xrightarrow {{\ \mathrm{c.}\ }}{\mathcal {M}}_\mathrm{L}({\mathbb {R}}^d)\) is isometric, this allows us to identify the extreme points of the unit ball in \({\mathcal {M}}_\mathrm{L}({\mathbb {R}}^d)\) as

$$\begin{aligned} u_k=\mathrm{L}^{-1}\{e_k\}=\pm \mathrm{L}^{-1}\{\delta (\cdot -{\varvec{\tau }}_k)\}=\pm h(\cdot ,{\varvec{\tau }}_k) \end{aligned}$$
(35)

where \(h: {\mathbb {R}}^d \times {\mathbb {R}}^d \rightarrow {\mathbb {R}}\) is the kernel of the operator in (32). Consequently, we can invoke Theorem 3 to prove that the extreme points of Problem (34) are all expressible as

$$\begin{aligned} f_0({\varvec{x}})=\sum _{k=1}^{K_0} a_k h({\varvec{x}},{\varvec{\tau }}_k) \end{aligned}$$
(36)

with parameters \(K_0\le M\), \({\varvec{\tau }}_1,\dots ,{\varvec{\tau }}_{K_0} \in {\mathbb {R}}^d\), and \((a_k)\in {\mathbb {R}}^{K_0}\). Moreover, since \(\mathrm{L}\{h(\cdot ,{\varvec{\tau }}_k)\}=\delta (\cdot -{\varvec{\tau }}_k)\) and \(\Vert \delta (\cdot -{\varvec{\tau }}_k)\Vert _{{\mathcal {M}}}=\Vert e_k\Vert _{{\mathcal {M}}}=1\), the optimal regularization cost is \(\Vert \mathrm{L}f_0\Vert _{{\mathcal {M}}}=\sum _{k=1}^{K_0}|a_k|=\Vert {\varvec{a}}\Vert _{\ell _1}\), which makes an interesting connection with \(\ell _1\)-norm minimization and the generalized LASSO [39, 49]. To sum up, the solution (36) has a kernel expansion that is similar to (15), with the important twist that the kernel centers \({\varvec{\tau }}_k\) are adaptive, meaning that their location as well as their cardinality \(K_0\) is data dependent. In effect, it is the underlying \(\ell _1\)-norm penalty that helps reducing the number \(K_0\) of active kernels, thereby producing a sparse solution. We should also point out that the form of the solution is compatible with the empirical method of moving and learning the data centers in kernel expansions [see [35, Section IV]] with the important difference that the present proposal is purely variational.

When \(\mathrm{L}: {\mathcal {S}}'({\mathbb {R}}^d) \xrightarrow {{\ \mathrm{c.}\ }}{\mathcal {S}}'({\mathbb {R}}^d)\) is linear shift-invariant (LSI) with frequency response \( {\mathcal {F}}\big \{\mathrm{L}\delta \big \}({\varvec{\omega }})={\widehat{L}}({\varvec{\omega }})\), then \(h({\varvec{x}},{\varvec{\tau }})=h_{\mathrm{LSI}}({\varvec{x}}-{\varvec{\tau }})\) with

$$\begin{aligned} h_{\mathrm{LSI}}({\varvec{x}})= {\mathcal {F}}^{-1}\left\{ \frac{1}{{\widehat{L}}({\varvec{\omega }})}\right\} ({\varvec{x}}), \end{aligned}$$
(37)

where the operator \( {\mathcal {F}}^{-1}: {\mathcal {S}}'({\mathbb {R}}^d) \rightarrow {\mathcal {S}}'({\mathbb {R}}^d)\) is the generalized inverse Fourier transform.

The overarching message in the optimality result of the present section is that the choice of the regularization operator \(\mathrm{L}\) in (34) predetermines the parametric form of the kernel in (36). Now, in light of (37), we can choose to specify first a kernel \(h_{\mathrm{LSI}}: {\mathbb {R}}^d \rightarrow {\mathbb {R}}\) and then infer the frequency response of the corresponding regularization operator

$$\begin{aligned} {\widehat{L}}({\varvec{\omega }})=\frac{1}{\widehat{h}_{\mathrm{LSI}}({\varvec{\omega }})}. \end{aligned}$$
(38)

Now, the necessary and sufficient condition for the continuity of \(\mathrm{L}: {\mathcal {S}}'({\mathbb {R}}^d) \rightarrow {\mathcal {S}}'({\mathbb {R}}^d)\) is that the function \({\widehat{L}}: {\mathbb {R}}^d \rightarrow {\mathbb {R}}\) be smooth and slowly growing [46]. A parametric class of kernels that meets this admissibility requirement is the super-exponential family

$$\begin{aligned} h_{\mathrm{LSI}}({\varvec{x}})=\exp \left( -\Vert {\varvec{x}}\Vert ^{\alpha }\right) \end{aligned}$$
(39)

with \(\alpha \in (0,2)\). The limit case with \(\alpha =2\) (Gaussian) is excluded because the corresponding frequency response in (38) (inverse of a Gaussian) fails to be slowly increasing.

5 Conclusion

We have shown that the fundamental ingredient in the quest for a representer theorem is the identification and characterization of a dual pair of Banach spaces that is linked to the regularization functional. The main result of the paper is expressed by Theorem 2, which is valid for Banach spaces in general. This characterization of the solution of the general optimization problem (8) is directly exploitable in the reflexive and strictly convex scenario—in which case the solution is also known to be unique—whenever the duality mapping is known. While our formulation also offers interesting insights for certain non-strictly convex and sparsity-promoting norms such as \(\Vert \cdot \Vert _{\ell _1}\) and its continuous-domain counterpart—the total variation \(\Vert \cdot \Vert _{{\mathcal {M}}}\) and generalization thereof—it raises intriguing questions about the unicity of such solutions and the necessity to develop some corresponding numerical optimization schemes.

We have made the link with the existing literature in machine learning (regression) and the resolution of ill-posed inverse problems by considering several concrete cases, including reproducing kernel Hilbert spaces (RKHS) and compressed sensing. The conciseness and self-containedness of the proposed derivations are a good indication of the power of the approach.

Since the concept of Banach spaces is remarkably general, one can easily conceive of other variations around the common theme of regularization and representer theorems. Potential topics for further research include the use of nonstandard norms, the deployment of hybrid regularization schemes, vector-valued functions or feature maps [1], and the consideration of direct-sum spaces and semi-norms, as in the theory of splines [7, 18, 20, 34, 53, 54]. In short, there is ample room for additional theoretical and practical investigations, in direct analogy with what has been accomplished during the past few decades in the simpler but more restrictive context of RKHS [1, 2]. Interestingly, there also appears to be a link with deep neural/kernel networks, as has been demonstrated recently [5, 51].