From Kernel Methods to Neural Networks: A Unifying Variational Formulation

Unser, Michael

doi:10.1007/s10208-023-09624-9

From Kernel Methods to Neural Networks: A Unifying Variational Formulation

Open access
Published: 17 October 2023

(2023)
Cite this article

Download PDF

You have full access to this open access article

Foundations of Computational Mathematics Aims and scope Submit manuscript

From Kernel Methods to Neural Networks: A Unifying Variational Formulation

Download PDF

Michael Unser¹

1133 Accesses
1 Citation
Explore all metrics

Abstract

The minimization of a data-fidelity term and an additive regularization functional gives rise to a powerful framework for supervised learning. In this paper, we present a unifying regularization functional that depends on an operator $\textrm{L}$ and on a generic Radon-domain norm. We establish the existence of a minimizer and give the parametric form of the solution(s) under very mild assumptions. When the norm is Hilbertian, the proposed formulation yields a solution that involves radial-basis functions and is compatible with the classical methods of machine learning. By contrast, for the total-variation norm, the solution takes the form of a two-layer neural network with an activation function that is determined by the regularization operator. In particular, we retrieve the popular ReLU networks by letting the operator be the Laplacian. We also characterize the solution for the intermediate regularization norms $\Vert \cdot \Vert =\Vert \cdot \Vert _{L_p}$ with $p\in (1,2]$. Our framework offers guarantees of universal approximation for a broad family of regularization operators or, equivalently, for a wide variety of shallow neural networks, including the cases (such as ReLU) where the activation function is increasing polynomially. It also explains the favorable role of bias and skip connections in neural architectures.

Kolmogorov width decay and poor approximators in machine learning: shallow neural networks, random feature models and neural tangent kernels

Article 05 January 2021

From Covariance Matrices to Covariance Operators: Data Representation from Finite to Infinite-Dimensional Settings

Regularisation of neural networks by enforcing Lipschitz continuity

Article Open access 06 December 2020

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Regularization theory constitutes a powerful framework for the derivation of algorithms for supervised learning [14, 41, 42]. Given a series of data points $({\varvec{x}}_m, y_m) \in {\mathbb {R}}^d \times {\mathbb {R}}$, $m=1,\dots ,M$, the basic problem (regression) is to find a mapping $f: {\mathbb {R}}^d \rightarrow {\mathbb {R}}$ such that $f({\varvec{x}}_m)\approx y_m$, without overfitting. The standard paradigm is to let f be the minimizer of a cost that consists of a data-fidelity term and an additive regularization functional [8]. The minimization proceeds over a prescribed class ${\mathcal {H}}$ of candidate functions. One usually distinguishes between the parametric approaches (e.g., neural networks), where ${\mathcal {H}}={\mathcal {H}}_{\varTheta }$ is a family of functions specified by a finite set of parameters $\theta \in \varTheta $ (e.g., the weights of the network), and the nonparametric ones, where the properties of the solution are controlled by the regularization functional. The focus of this paper is on the nonparametric techniques. They rely on functional optimization, which means that the minimization proceeds over a space of functions rather than over a set of parameters. The regularization is usually chosen to be an increasing function of the norm associated with a particular Banach space, which results in a well-posed problem [9, 10, 56].

The functional-optimization point of view is often constructive, in that it suggests or supports explicit learning architectures. For instance, the choice of the Hilbertian regularization $R(f)=\Vert f\Vert ^2_{{\mathcal {H}}}$ where ${\mathcal {H}}$ is a reproducing-kernel Hilbert space (RKHS) results in a closed-form solution that is a linear combination of kernels positioned on the data [7, 62]. In fact, the RKHS setting yields a generic class of estimators that is compatible with the classical kernel-based methods of machine learning, including support vector machines [1, 41, 49, 50, 56, 62]. Likewise, adaptive kernel methods are justifiable from the minimization of a generalized total-variation norm, which favors sparse representations [3, 11, 12]. These latter results actually take their root in spline theory [18, 28, 60]. Similarly, it has been demonstrated that shallow ReLU networks are solutions of functional-optimization problems with an appropriate regularization. One way to achieve this is to start from an explicit parameterization of an infinite-width network [4] (the reverse engineering/synthesis approach). Another way is to consider a regularization operator that is matched to the neuronal activation with a $L_1$-type penalty^{Footnote 1}; for instance, a second-order derivative for $d=1$ [36, 48] or, more generally, the Radon-domain counterpart of the Laplace operator whose Green’s function is precisely a ReLU ridge [35, 37, 57]. Similar optimality results can be stated within the framework of reproducing-kernel Banach spaces [6], which is a formal point of view that bridges the synthesis and analysis approach of [4] and [37], respectively. Also relevant to the discussion is a variational formulation that links the ridgelet transform to the training of shallow neural networks with weight-decay regularization [53].

The second important benefit of the functional-optimization approach is that it gives insight on the approximation capabilities (expressivity) of the resulting learning architectures. This information is encapsulated in the definition of the native space ${\mathcal {H}}$ (typically, a Sobolev space), which goes hand-in-hand with the regularization functional. Roughly speaking, the native space ${\mathcal {H}}$ ought to be “large enough” to allow for the approximation of any continuous function with an arbitrary degree of precision. This universal approximation property is a central theme in the theory of radial-basis functions (RBFs) [31, 63]. In machine learning, the kernel estimators that meet this approximation requirement are called universal [32]. When the basis functions are shifted replicates of a single template $h: {\mathbb {R}}^d \rightarrow {\mathbb {R}}$, then the condition is equivalent to h being strictly positive definite, which means that its Fourier transform is real-valued symmetric, and (strictly) positive [13]. Similar guarantees of universal approximation exist for (shallow) neural networks under mild conditions on the activation functions [5, 16, 25, 30, 39]. The main difference with the RKHS framework, however, is that the universality results for neural nets usually make the assumption that the input domain is a compact subset of ${\mathbb {R}}^d$.

The purpose of this paper is to unify and extend these various approaches by introducing a universal regularization functional. The latter has two components: an admissible differential operator $\textrm{L}$, and an $L_p$-type Radon-domain norm. The resulting regularization operator is $\textrm{L}_{\textrm{R}}={\textrm{K}}_{\textrm{rad}}\text {RL}$, where ${\textrm{R}}$ is the Radon transform and ${\textrm{K}}_{\textrm{rad}}$ the “filtering” operator of computer tomography [33]. Our main result (Theorem 5) gives the parametric form of the solution of the corresponding functional-optimization problems under minimal hypotheses. For $p=2$, the outcome is compatible with the type of kernel expansions (RBFs) of classical machine learning for which there is a vast literature [24, 52]. For $p=1$, the solution set is parameterized by a neural network with one hidden layer whose activation function is determined by the regularization operator. In particular, if we take $\textrm{L}$ to be the Laplacian, then one retrieves the popular ReLU activation. Remarkably, the connection with neural networks also works the other way round: Parhi et al. [36, 38] could prove that the training of a shallow ReLU neural network that is sufficiently wide, with weight-decay regularization, converges to the solution of a functional-optimization problem that is a special instance of the class considered in this paper.

The foundation for our characterization is an abstract representer theorem for direct-sum Banach spaces [58]. Thus, the primary effort in this paper consists in the development of a functional framework that is adapted to the Radon transform and that fulfills the hypotheses of the abstract theorem. The main contributions can be summarized as follows.

1.
Construction and characterization of an extended family of native Banach spaces ${\mathcal {X}}'_{{\textrm{L}}_{\textrm{R}}}({\mathbb {R}}^d)$ associated with a generic Radon-domain norm $\Vert \cdot \Vert _{{\mathcal {X}}'}$ and a differential operator $\textrm{L}$, under the general admissibility conditions stated in Definition 3 (Theorem 6).
2.
Proof that: (i) the sampling functionals $\delta (\cdot -{\varvec{x}}_m): {\mathcal {X}}'_{{\textrm{L}}_{\textrm{R}}}({\mathbb {R}}^d) \rightarrow {\mathbb {R}}$ are weak*-continuous; and (ii) the adjoint of the regularization operator has a stable generalized inverse $\textrm{L}^{*\dagger }_{\textrm{R}}$ (see Theorem 7 and accompanying explanations). These technical points are essential to the argumentation (existence of solution).
3.
Extension and unification of a number of earlier optimality results for RBF expansions and neural networks. While the present setup for $p=2$ and $\textrm{L}=(-\varDelta )^{\gamma }$ is reminiscent of thin-plate splines [17, 29], the resulting solution for a fixed $\gamma $ does not depend on the dimension d, which makes it easier to deploy. Likewise, our variational formulation with ${\mathcal {X}}'={\mathcal {M}}$ extends the results of Parhi and Nowak [37] by: (i) proving that the neural network parameterization applies to all the extreme points of the solution set, and (ii) by covering a much broader class of activation functions, including those with polynomial growth (of degree $n_0$).
4.
General guarantees of universality, subject to the admissibility condition in Definition 3. While the result for $p=2$ is consistent with the known criteria for kernel estimators [32], its counterpart for neural networks $({\mathcal {X}}'={\mathcal {M}})$ brings in a new twist: the addition of a polynomial component. The latter, which is not present in the traditional theory [5, 39], is necessary to lift the hypothesis of a compact input domain. The two cases of greatest practical relevance are the sigmoid and the ReLU activations which, in our formulation, require the addition of a bias ($n_0=0$) and an affine term $(n_0=1)$, respectively. Interestingly, the ReLU case yields a neural architecture with a skip connection akin to ResNet [22], which is highly popular in practice.

The paper is organized as follows: We start with mathematical preliminaries in Sect. 2. In particular, we state our criteria of admissibility for $\textrm{L}$ and show how to represent its polynomial null space. In Sect. 3, we review the main properties of the Radon transform and specify the dual pair $({\mathcal {X}}_{\textrm{Rad}}, {\mathcal {X}}'_{\textrm{Rad}})$ of hyper-spherical Banach spaces that enter the definition of our native spaces. We also provide formulas for the (filtered) Radon transform of RBFs and ridges (the elementary constituents of neural networks). Section 4 is devoted to the description and interpretation of our main result (Theorem 5). In particular, we draw a connection with RKHS in Sect. 4.2. We discuss the issue of universality in Sect. 4.3 and show in Sect. 4.4 how our framework can be extended to handle anti-symmetric activations, including sigmoids. We complement our exposition in Sect. 4.5 with a listing of specific configurations, many of which are intimately connected to splines. The mathematical developments that support our formulation are presented in Sect. 5. They include the characterization of the kernel of the inverse operator ${\textrm{L}}^{*\dagger }_{\textrm{R}}$ —the enabling ingredient of our formulation— and the construction of the predual Banach space ${\mathcal {X}}_{{\textrm{L}}_{\textrm{R}}}({\mathbb {R}}^d)$.

2 Mathematical Preliminaries

2.1 Notations

We shall consider multidimensional functions f on ${\mathbb {R}}^d$ that are indexed by the variable ${\varvec{x}} \in {\mathbb {R}}^d$. To describe their partial derivatives, we use the multi-index ${\varvec{k}}=(k_1,\dots ,k_d) \in {\mathbb {N}}^d$ (where ${\mathbb {N}}$ includes 0) with the notational conventions ${\varvec{k}}!=\prod _{i=1}^d k_i!$, $|{\varvec{k}}|=k_1+\cdots +k_d$, ${\varvec{x}}^{{\varvec{k}}}=\prod _{i=1}^d x_i^{k_i}$ for any ${\varvec{x}} \in {\mathbb {R}}^d$, and $\partial ^{\varvec{k}} f({\varvec{x}})=\frac{\partial ^{|{\varvec{k}}|}f(x_1,\dots ,x_d)}{\partial ^{k_1}_{x_1} \cdots \partial ^{k_d}_{x_d}}$. This allows us to write the multidimensional Taylor expansion around ${\varvec{x}}={\varvec{x}}_0$ of an analytical function $f: {\mathbb {R}}^d \rightarrow {\mathbb {R}}$ explicitly as

$$\begin{aligned} f({\varvec{x}})=\sum _{n=0}^\infty \sum _{|{\varvec{k}}|=n} \frac{\partial ^{\varvec{k}} f({\varvec{x}}_0) ({\varvec{x}}- {\varvec{x}}_0)^{\varvec{k}}}{{\varvec{k}}!} \end{aligned}$$

(1)

where the internal summation is over all multi-indices ${\varvec{k}}$ such that $k_1+\cdots +k_d=n$.

Schwartz’ space of smooth and rapidly decreasing test functions $\varphi : {\mathbb {R}}^d \rightarrow {\mathbb {R}}$ equipped with the usual Fréchet-Schwartz topology is denoted by ${\mathcal {S}}({\mathbb {R}}^d)$. Its continuous dual is the space ${\mathcal {S}}'({\mathbb {R}}^d)$ of tempered distributions. In this setting, the Lebesgue spaces $L_p({\mathbb {R}}^d)$ for $p\in [1,\infty )$ can be specified as the completion of ${\mathcal {S}}({\mathbb {R}}^d)$ equipped with the $L_p$-norm $\Vert \cdot \Vert _{L_p}$, denoted as $L_p({\mathbb {R}}^d)=\overline{({\mathcal {S}}({\mathbb {R}}^d),\Vert \cdot \Vert _{L_p})}$. For the endpoint $p=\infty $, we have $\overline{({\mathcal {S}}({\mathbb {R}}^d),\Vert \cdot \Vert _{L_\infty })}=C_0({\mathbb {R}}^d)$ with $\Vert \varphi \Vert _{L_\infty }=\sup _{{\varvec{x}} \in {\mathbb {R}}^d}|\varphi ({\varvec{x}})|$, which is the space of continuous functions that vanish at infinity. The continuous dual of $C_0({\mathbb {R}}^d)$ is the space ${\mathcal {M}}({\mathbb {R}}^d)=\{f \in {\mathcal {S}}'({\mathbb {R}}^d): \Vert f\Vert _{{\mathcal {M}}}<\infty \}$ of bounded Radon measures with

$$\begin{aligned} \Vert f\Vert _{{\mathcal {M}}}=\sup _{\varphi \in {\mathcal {S}}({\mathbb {R}}^d): \Vert \varphi \Vert _{L_\infty }\le 1} \langle f, \varphi \rangle . \end{aligned}$$

(2)

The latter is a superset of $L_1({\mathbb {R}}^d)$, which is isometrically embedded in it, in the sense that $\Vert f\Vert _{L_1}= \Vert f\Vert _{{\mathcal {M}}}$ for all $f \in L_1({\mathbb {R}}^d)$.

The Fourier transform of a function $\varphi \in L_1({\mathbb {R}}^d)$ is defined as

$$\begin{aligned} {{\widehat{\varphi }}}({\varvec{\omega }}){\mathop {=}\limits ^{\vartriangle }} {\mathcal {F}}\{\varphi \}({\varvec{\omega }})=\frac{1}{(2 \pi )^d} \int _{{\mathbb {R}}^d} \varphi ({\varvec{x}}) \textrm{e}^{-\textrm{i}\langle {\varvec{\omega }}, {\varvec{x}} \rangle } \textrm{d}{\varvec{x}}. \end{aligned}$$

(3)

Since the Fourier operator $ {\mathcal {F}}$ continuously maps ${\mathcal {S}}({\mathbb {R}}^d)$ into itself, the transform can be extended by duality to the whole space ${\mathcal {S}}'({\mathbb {R}}^d)$ of tempered distribution. Specifically, ${{\widehat{f}}}= {\mathcal {F}}\{f\} \in {\mathcal {S}}'({\mathbb {R}}^d)$ is the (unique) generalized Fourier transform of $f\in {\mathcal {S}}'({\mathbb {R}}^d)$ if and only if $\langle {{\widehat{f}}}, \varphi \rangle =\langle f, {{\widehat{\varphi }}} \rangle $ for all $\varphi \in {\mathcal {S}}({\mathbb {R}}^d)$, where ${{\widehat{\varphi }}}= {\mathcal {F}}\{\varphi \}$ is the “classical” Fourier transform of $\varphi $ defined by (3).

To control the minimal order $\alpha \ge 0$ of decay (resp., the maximal rate of growth) of functions, we use the dual pair of spaces $L_{1,\alpha }({\mathbb {R}}^d)$ and $L_{\infty ,-\alpha }({\mathbb {R}}^d)=\big (L_{1,\alpha }({\mathbb {R}}^d)\big )'$. These are the Banach spaces associated with the weighted norms

$$\begin{aligned}&\Vert f\Vert _{L_{1,\alpha }}{\mathop {=}\limits ^{\vartriangle }}\int _{{\mathbb {R}}^d} (1+\Vert {\varvec{x}}\Vert )^\alpha |f({\varvec{x}})| \textrm{d}{\varvec{x}} \end{aligned}$$

(4)

$$\begin{aligned}&\Vert f\Vert _{L_{\infty ,-\alpha }} {\mathop {=}\limits ^{\vartriangle }}{{\,\mathrm{ess\,sup}\,}}_{{\varvec{x}}\in {\mathbb {R}}} (1+\Vert {\varvec{x}}\Vert )^{-\alpha } |f({\varvec{x}})|, \end{aligned}$$

(5)

respectively. Specifically, the inclusion $f\in L_{\infty ,-n_0}({\mathbb {R}}^d)$ with $n_0 \in {\mathbb {N}}$ indicates that f cannot grow faster than a polynomial of degree $n_0$, while the condition $f \in L_{1,\alpha }({\mathbb {R}}^d)$ implies that $f({\varvec{x}})$ must be locally integrable and must decay (slightly) faster than $1/\Vert {\varvec{x}}\Vert ^{\alpha +d}$ as $\Vert {\varvec{x}}\Vert \rightarrow \infty $.

2.2 Admissible Regularization Operators

The regularization operators $\textrm{L}$ that are of interest to us are linear, shift-invariant (LSI), and isotropic. For simplicity, we shall first specify the action of $\textrm{L}$ on test functions, with the understanding that the domain of the operator will be extended to some corresponding “native space” that will be identified as we progress through the paper.

Definition 1

The linear operator $\textrm{L}: {\mathcal {S}}({\mathbb {R}}^d) \rightarrow {\mathcal {S}}'({\mathbb {R}}^d)$ is said to be

Shift-invariant: If $\textrm{L}\{\varphi (\cdot -{\varvec{x}}_0) \}=\textrm{L}\{\varphi \}(\cdot -{\varvec{x}}_0)$ for all $\varphi \in {\mathcal {S}}({\mathbb {R}}^d)$ and ${\varvec{x}}_0 \in {\mathbb {R}}^d$.
Isotropic (or rotation-invariant): If $\textrm{L}\{\varphi ({\textbf{R}}_{{\varvec{\theta }}}\cdot ) \}=\textrm{L}\{\varphi \}({\textbf{R}}_{{\varvec{\theta }}}\cdot )$ for all $\varphi \in {\mathcal {S}}({\mathbb {R}}^d)$ and any rotation matrix ${\textbf{R}}_{{\varvec{\theta }}}$ on ${\mathbb {R}}^d$.
Self-adjoint: If the adjoint operator $\textrm{L}^*: {\mathcal {S}}''({\mathbb {R}}^d)={\mathcal {S}}({\mathbb {R}}^d)\rightarrow {\mathcal {S}}'({\mathbb {R}}^d)$ has the same Schwartz kernel (impulse response) as $\textrm{L}$.

Since the Schwartz kernel of a linear operator is unique [55], the property of self-adjointness will be denoted as $\textrm{L}=\textrm{L}^*$, irrespective of the actual domain and range of the operator. It is well-known that a LSI operator can always be expressed as the convolution $\textrm{L}\{\varphi \}=h *\varphi $, where $h=\textrm{L}\{\delta \}\in {\mathcal {S}}'({\mathbb {R}}^d)$ is the impulse response of $\textrm{L}$. When $\textrm{L}$ is isotropic, h is a purely radial function. Since all isotropic functions are symmetric, this also implies that an isotropic LSI operator is necessarily self-adjoint. All such operators are characterized by a Fourier symbol (a.k.a. frequency response) ${{\widehat{L}}}= {\mathcal {F}}\{h\}$ that is purely radial, with ${{\widehat{L}}}({\varvec{\omega }})={{\widehat{L}}}_{\textrm{rad}}(\Vert {\varvec{\omega }}\Vert )$, under the implicit assumption that the radial profile ${{\widehat{L}}}_{\textrm{rad}}$ is identifiable as a measurable function ${\mathbb {R}}\rightarrow {\mathbb {R}}$.

Our condition for admissibility is that $\textrm{L}$ be invertible in an appropriate sense.

Definition 2

(Spline-admissible operators with trivial null space) An isotropic LSI operator $\textrm{L}$ has a trivial null space if its radial frequency profile ${{\widehat{L}}}_{\textrm{rad}}$ does not vanish over ${\mathbb {R}}$. We then say that it is spline-admissible if $1/{{\widehat{L}}}_{\textrm{rad}} \in L_1({\mathbb {R}})$ and $\rho _{\textrm{rad}}= {\mathcal {F}}^{-1}\{1/{{\widehat{L}}}_{\textrm{rad}}\} \in L_1({\mathbb {R}})$ where the operator $ {\mathcal {F}}^{-1}: L_1({\mathbb {R}}) \rightarrow C_0({\mathbb {R}})$ is the classical inverse Fourier transform.

The typical scenario is ${{\widehat{L}}}({\varvec{\omega }})=(1 + \Vert {\varvec{\omega }}\Vert ^2)^{\alpha /2}$ with $\alpha \ge 1$, which results in a stable inverse operator ${\textrm{L}}^{-1}$ whose radially symmetric impulse response is the Bessel potential of order $\alpha $. These operators play a central in the theory of Sobolev spaces [21].

Distribution theory allows us to go further and to invert operators with nontrivial null spaces, but only if the zeros of the frequency response are located at isolated points. When the operator is isotropic, this reduces the options to the cases where ${{\widehat{L}}}({\varvec{\omega }})$ has a (multiple) zero at ${\varvec{\omega }}={\varvec{0}}$. Specifically, we shall say that $\textrm{L}$ is of order $\gamma _0$ if $|{{\widehat{L}}}({\varvec{\omega }})|/\Vert {\varvec{\omega }}\Vert ^{\gamma _0}=C_0$ as $\Vert {\varvec{\omega }}\Vert \rightarrow 0$. The second important parameter is the asymptotic growth exponent of ${{\widehat{L}}}({\varvec{\omega }})$. This is the largest index $\gamma _1$ such that $|{{\widehat{L}}}({\varvec{\omega }})|\ge C_1 \Vert {\varvec{\omega }}\Vert ^{\gamma _1},$ for all $\Vert {\varvec{\omega }}\Vert >R$. It determines the smoothness of the Green’s function of the operator.

Definition 3

(Spline-admissible operators with nontrivial null space) An isotropic LSI operator $\textrm{L}$ with radial frequency profile ${{\widehat{L}}}_{\textrm{rad}}$ is said to be spline-admissible with a polynomial null space of degree $n_0$ if the following conditions are satisfied.

1.
The profile ${{\widehat{L}}}_{\textrm{rad}}$ does not vanish over ${\mathbb {R}}$, except for a zero of order $\gamma _0\in (n_0,n_0+1]$ at the origin; that is, $|{{\widehat{L}}}_{\textrm{rad}}(\omega )|/|\omega |^{\gamma _0}=C_0$ as $\omega \rightarrow 0$.
2.
There exists an order $\gamma _1> 1$, a constant $C_1>0$, and a radius $R_1>0$ such that $|{{\widehat{L}}}_{\textrm{rad}}(\omega )|\ge C_1 |\omega |^{\gamma _1}$ for all $|\omega |>R_1$ (ellipticity).
3.
For all $\varphi \in {\mathcal {S}}({\mathbb {R}}^d)$, $\textrm{L}^{*}\{\varphi \} \in L_{1,n_0}({\mathbb {R}}^d)$.

The connection between Condition 1 and the null space of $\textrm{L}$ will be explained in Sect. 2.3. Conditions 1 and 2 with $\gamma _1>1$ ensure that $\rho _{\textrm{rad}}= {\mathcal {F}}^{-1}\{1/{{\widehat{L}}}_{\textrm{rad}}\}$, which is the generalized inverse Fourier transform of the distribution $1/{{\widehat{L}}}_{\textrm{rad}}$, is identifiable as a continuous function ${\mathbb {R}}\rightarrow {\mathbb {R}}$. The order $\gamma _1$ actually controls the degree of differentiability (Sobolev smoothness) of $\rho _{\textrm{rad}}$. Condition 3 is a mild technical constraint on the decay of $\textrm{L}^*\{\varphi \}$; this constraint has not appeared to be a practical limitation so far. For instance, if $\textrm{L}$ is an ordinary differential operator (an arbitrary polynomial of the Laplace operator $\varDelta $) then $\textrm{L}^*\{\varphi \}\in {\mathcal {S}}({\mathbb {R}}^d)$, which is included in $L_{1,m}({\mathbb {R}}^d)$ for any $m\in {\mathbb {Z}}$. We use this third condition for the handling of fractional operators whose impulse response decays slowly.

An attractive class of admissible operators with $\gamma _0=\gamma _1=\alpha $ and $n_0=\lceil \alpha -1\rceil $ are the fractional Laplacians $(-\varDelta )^{\tfrac{\alpha }{2}}$ with $\alpha \in (1,\infty )$ whose frequency response is $\Vert {\varvec{\omega }}\Vert ^{\alpha }$. The inverse of the fractional Laplacian of order $\alpha $, which corresponds to a frequency-domain multiplication by $\Vert {\varvec{\omega }}\Vert ^{-\alpha }$, is denoted by $(-\varDelta )^{-\tfrac{\alpha }{2}}$. Both operators are part of the same family (isotropic LSI and scale-invariant), their distributional impulse response being given by

$$\begin{aligned} k_{\alpha ,d}({\varvec{x}})= {\mathcal {F}}^{-1}\left\{ \frac{1}{\Vert {\varvec{\omega }}\Vert ^{\alpha } } \right\} ({\varvec{x}})={\left\{ \begin{array}{ll} A_{\alpha ,d} \; \Vert {\varvec{x}}\Vert ^{\alpha -d},&{} \alpha -d, -\alpha \notin 2{\mathbb {N}}\\ B_{n,d} \; \Vert {\varvec{x}}\Vert ^{2n} \log (\Vert {\varvec{x}}\Vert ) , &{} \alpha -d=2n \in 2{\mathbb {N}}\\ (-\varDelta )^{n} \{\delta \}, &{} -\alpha /2=n \in {\mathbb {N}}, \end{array}\right. } \end{aligned}$$

(6)

with constants $A_{d,\alpha }= \frac{\varGamma \big ( \tfrac{d-\alpha }{2}\big )}{2^\alpha \pi ^{d/2} \varGamma \big ( \tfrac{\alpha }{2}\big )}$ and $B_{d,n}= \frac{(-1)^{1+n} }{2^{2n+d-1}\pi ^{d/2} \varGamma \big ( n+\tfrac{d}{2}\big ) n! }$ [19, 47]. The kernel $k_{\alpha ,d}$ can also be interpreted as the Green’s function of $(-\varDelta )^{\tfrac{\alpha }{2}}$, with the corresponding radial profile in Definition 3 being $\rho _{\textrm{rad}}(t)=k_{\alpha ,1}(t)$. In view of (6), this means that $(-\varDelta )^{\tfrac{\alpha }{2}}$ is admissible for $\alpha >1$.

We note that the impulse response of the filtering operator ${\textrm{K}}$ in Theorem 1 is proportional to $k_{-d+1,d}({\varvec{x}})$, which tells us that it decays asymptotically like $1/\Vert {\varvec{x}}\Vert ^{2d -1}$ when d is even, or is a power of the Laplacian (local operator) otherwise. Functionally, this means that ${\textrm{K}}\big ({\mathcal {S}}({\mathbb {R}}^d)\big )={\mathcal {S}}({\mathbb {R}}^d)$ for even dimensions, and ${\textrm{K}}\big ({\mathcal {S}}({\mathbb {R}}^d)\big )\subset L_{1, d-1-\epsilon }({\mathbb {R}}^d)$ otherwise. Likewise, the impulse response of the fractional Laplacians (of non-even order) decays asymptotically like $1/\Vert {\varvec{x}}\Vert ^{\alpha +d}$, which implies that $(-\varDelta )^{\tfrac{\alpha }{2}}\big ({\mathcal {S}}({\mathbb {R}}^d)\big )\subset L_{1, \alpha -\epsilon }({\mathbb {R}}^d)$ for arbitrarily small $\epsilon >0$, so that the third condition in Definition 3 is met.

2.3 Nontrivial Null Space and Related Projectors

Let $\textrm{L}$ be a LSI operator whose frequency response ${{\widehat{L}}}$ satisfies the conditions

$$\begin{aligned} \partial ^{{\varvec{k}}} {{\widehat{L}}}({\varvec{0}})=0, \text{ for } \text{ all } {\varvec{k}} \in {\mathbb {N}}^d \text{ with } |{\varvec{k}}|\le n_0, \end{aligned}$$

(7)

for some integer $n_0\ge 0$. This flatness behavior at the origin implies that $\textrm{L}$ has the capacity to annihilate all polynomials of degree $n_0$ by mapping them to zero (see [61, p. 131]). The explanation lies in the property that the Fourier transform of any polynomial is entirely concentrated at the origin. If, in addition, we impose that ${{\widehat{L}}}({\varvec{\omega }})\ne 0$ for all ${\varvec{\omega }}\in {\mathbb {R}}^d \backslash \{{\varvec{0}}\}$, then we are also making sure that the null space of $\textrm{L}$ is limited to polynomials.

Next, we recall that the directional derivative of a function along the direction ${\varvec{\xi }}\in {\mathbb {S}}^{d-1}$ (i.e., ${\varvec{\xi }}\in {\mathbb {R}}^d$ with $\Vert {\varvec{\xi }}\Vert =1$) is given by

$$\begin{aligned} \textrm{D}_{{\varvec{\xi }}}f={\varvec{\xi }}^{\textsf{T}}{\varvec{\nabla }}f =\xi _1\partial ^{{\varvec{e}}_1}f + \cdots + \xi _d\partial ^{{\varvec{e}}_d}f. \end{aligned}$$

(8)

The operator $\textrm{D}_{{\varvec{\xi }}}$ is LSI with frequency response $\widehat{\textrm{D}_{{\varvec{\xi }}}}({\varvec{\omega }})=(\textrm{i}{\varvec{\xi }}^{\textsf{T}}{\varvec{\omega }})$. The nth iterate of $\textrm{D}_{{\varvec{\xi }}}$ yields the nth derivative along ${\varvec{\xi }}$ whose explicit expression in terms of partial derivatives is

$$\begin{aligned} \textrm{D}^n_{{\varvec{\xi }}}f({\varvec{x}})= {\mathcal {F}}^{-1}\{(\textrm{i}{\varvec{\xi }}^{\textsf{T}}{\varvec{\omega }})^n {{\hat{f}}}({\varvec{\omega }}) \}({\varvec{x}})=\sum _{|{\varvec{k}}|=n} \frac{n! }{{\varvec{k}}! } {\varvec{\xi }}^{{\varvec{k}}}\partial ^{\varvec{k}} f({\varvec{x}}), \end{aligned}$$

(9)

where the right-hand side follows from the application of the multinomial expansion to $(\textrm{i}{\varvec{\xi }}^{\textsf{T}}{\varvec{\omega }})^n=(\xi _1\textrm{i}\omega _1 + \dots + \xi _d\textrm{i}\omega _d)^n$.

For isotropic operators, the directional derivatives ${\textrm{D}}^n_{{\varvec{\xi }}}{{\widehat{L}}}({\varvec{0}})$ do not dependent on the direction ${\varvec{\xi }}$ and coincide with the radial derivatives ${{\widehat{L}}}^{(n)}_{\textrm{rad}}(0)$. In view of (9), (7) then has the radial equivalent

$$\begin{aligned} {{\widehat{L}}}^{(n)}_{\textrm{rad}}(0)=\frac{\textrm{d}^n{{\widehat{L}}}_{\textrm{rad}}(0)}{\textrm{d}\omega ^n}=0, \text{ for } n=0,1,\dots ,n_0, \end{aligned}$$

(10)

which is much simpler to test. It follows that an operator whose radial frequency profile is such that $|{{\widehat{L}}}_{\textrm{rad}}(\omega )|/|\omega |^{\gamma _0}=C_0$ as $\omega \rightarrow 0$ will annihilate all polynomials up to degree $n_0=\lceil \gamma _0-1\rceil $.

Consequently, the null space of a spline-admissible operator $\textrm{L}$ of order $\gamma _0$ consists of the polynomials of degree $n_0=(\gamma _0-1)$ when $\gamma _0$ is an integer and $n_0=\lfloor \gamma _0\rfloor $ otherwise when $\gamma _0 \notin {\mathbb {N}}$. We shall represent these polynomials by expanding them in the monomial/Taylor basis

$$\begin{aligned} m_{\varvec{k}}({\varvec{x}})=\frac{{\varvec{x}}^{{\varvec{k}}}}{{\varvec{k}}!} \end{aligned}$$

(11)

with $|{{\varvec{k}}}|\le n_0$. We also add a topological structure by equipping the space with the $\ell _2$ norm of the Taylor coefficients, which results in the description

$$\begin{aligned} {\mathcal {P}}_{n_0}=\left\{ p_0=\sum _{|{\varvec{k}}|\le n_0} b_{\varvec{k}} m_{{\varvec{k}}}: \Vert p_0\Vert _{{\mathcal {P}}} <\infty \right\} \text{ with } \Vert p_0\Vert _{{\mathcal {P}}}{\mathop {=}\limits ^{\vartriangle }}\Vert (b_{\varvec{k}})_{|{\varvec{k}}|\le n_0}\Vert _2. \end{aligned}$$

(12)

To avoid a notational overload, we shall often denote this null space by ${\mathcal {P}}$, with the convention that ${\mathcal {P}}={\mathcal {P}}_{n_0}=\{0\}$ when $n_0=\lceil \gamma _0-1\rceil <0$ (for the operators $\textrm{L}$ whose null space is trivial). The important point here is that (12) specifies a finite-dimensional Banach subspace of ${\mathcal {S}}'({\mathbb {R}}^d)$. Its continuous dual ${\mathcal {P}}'$ is finite-dimensional as well, although it is composed of “abstract” elements $p^*_0 \in {\mathcal {P}}'$ that are, in fact, equivalence classes in ${\mathcal {S}}'({\mathbb {R}}^d)$. Yet, it is possible to identify every dual element $p^*_0\in {\mathcal {P}}'$ as a true function by selecting a particular dual basis $\{m_{{\varvec{k}}}^*\}_{|{\varvec{k}}|\le n_0}$ such that $\langle m^*_{{\varvec{k}}}, m_{{\varvec{k}}'}\rangle =\delta _{{\varvec{k}}-{\varvec{k}}'}$ (Kroneker delta). Our specific choice is

$$\begin{aligned} m^*_{\varvec{k}}=(-1)^{|{\varvec{k}}|}\partial ^{{\varvec{k}}}\kappa _{\textrm{iso}} \in {\mathcal {S}}({\mathbb {R}}^d) \end{aligned}$$

(13)

with ${\varvec{k}}\in {\mathbb {N}}^d$, where $\kappa _{\textrm{iso}}$ is the isotropic function described in Lemma 1.

Lemma 1

(adapted from [57]) There exists an isotropic window $\kappa _{\textrm{iso}} \in {\mathcal {S}}({\mathbb {R}}^d)$ such that

$$\begin{aligned} \langle m_{{\varvec{k}}},(-1)^{|{\varvec{n}}|}\partial ^{{\varvec{n}}}\kappa _{\textrm{iso}} \rangle =\delta _{{\varvec{k}} -{\varvec{n}}} \end{aligned}$$

(14)

for all ${\varvec{k}}, {\varvec{n}} \in {\mathbb {N}}^d$, subject to the spectral constraints ${\widehat{\kappa }}_{\textrm{iso}}({\varvec{\omega }})=1$ for $\Vert {\varvec{\omega }}\Vert < \frac{1}{2}$, $1\ge {\widehat{\kappa }}_{\textrm{iso}}({\varvec{\omega }})\ge 0$ for $\frac{1}{2}<\Vert {\varvec{\omega }}\Vert <1$, and ${\widehat{\kappa }}_{\textrm{iso}}({\varvec{\omega }})=0$ for $\Vert {\varvec{\omega }}\Vert \ge 1$.

This allows us to describe the dual space explicitly as

$$\begin{aligned} {\mathcal {P}}'= {\mathcal {P}}'_{n_0}=\left\{ p_0^*=\sum _{|{\varvec{k}}|\le n_0} b^*_{\varvec{k}} m^*_{{\varvec{k}}}: \Vert p_0^*\Vert _{{\mathcal {P}}'} <\infty \right\} \text{ with } \Vert p_0^*\Vert _{{\mathcal {P}}'}{\mathop {=}\limits ^{\vartriangle }}\Vert (b^*_{\varvec{k}})\Vert _2 \end{aligned}$$

(15)

where each elements $p_0^*$ has a unique representation in terms of its coefficients $(b^*_{\varvec{k}})_{|{\varvec{k}}|\le n_0}$. We use the dual basis $\{m_{\varvec{k}}^*\}$ to specify the projection operator $\textrm{Proj}_{{\mathcal {P}}}: {\mathcal {S}}'({\mathbb {R}}^d) \rightarrow {\mathcal {P}}_{n_0}$ as

$$\begin{aligned} \textrm{Proj}_{{\mathcal {P}}}\{f\}&=\sum _{|{\varvec{k}}|\le n_0} \langle f,m_{\varvec{k}}^*\rangle \; m_{\varvec{k}}, \end{aligned}$$

(16)

which is well-defined for any $f \in {\mathcal {S}}'({\mathbb {R}}^d)$ since $m_{\varvec{k}}^*\in {\mathcal {S}}({\mathbb {R}}^d)$. The “transpose” of this operator is

$$\begin{aligned} \textrm{Proj}_{{\mathcal {P}}'}\{\nu \}&=\sum _{|{\varvec{k}}|\le n_0} \langle m_{\varvec{k}},\nu \rangle \; m^*_{\varvec{k}}, \end{aligned}$$

(17)

which returns the projection of $\nu $ onto ${\mathcal {P}}'_{n_0}\subseteq {\mathcal {S}}({\mathbb {R}}^d)$ under the implicit assumption that $\nu $ has sufficient decay for $\nu \mapsto \langle m_{\varvec{k}},\nu \rangle $ to be well-defined—for instance, $\nu \in L_{1,n_0}({\mathbb {R}}^d)$. Correspondingly, we also have that $\textrm{Proj}_{{\mathcal {P}}'}\{\textrm{L}^*\varphi \}=0$ for all $\varphi $ such that $\textrm{L}^*{\{\varphi \}} \in L_{1,n_0}({\mathbb {R}}^d)$ since $\langle m_{\varvec{k}}, \textrm{L}^*\varphi \rangle =\langle \textrm{L}m_{\varvec{k}}, \varphi \rangle =0$ for $|{\varvec{k}}|\le n_0$. The latter manipulation of the duality product is legitimate in reason of the inclusion ${\mathcal {P}}_{n_0}\subset L_{\infty ,-n_0}({\mathbb {R}}^d)=\big (L_{1,n_0}({\mathbb {R}}^d)\big )'$.

Even though the null space of an admissible operator $\textrm{L}$ may be nontrivial, its intersection with ${\mathcal {S}}({\mathbb {R}}^d)$ is always $\{0\}$. This implies that $\textrm{L}^*=\textrm{L}$ is injective on ${\mathcal {S}}({\mathbb {R}}^d)$ with $\textrm{L}^{*-1}\textrm{L}^*\{\varphi \}=\varphi $ for all $\varphi \in {\mathcal {S}}({\mathbb {R}}^d)$ where $\textrm{L}^{*-1}=\textrm{L}^{-1}$ is the LSI operator whose frequency response is $1/|{{\widehat{L}}}|$.

3 Radon Transform

The Radon transform extracts the integrals of a function on ${\mathbb {R}}^d$ over all hyperplanes of dimension $(d-1)$. These hyperplanes are indexed over ${\mathbb {R}}\times {\mathbb {S}}^{d-1}$, where ${\mathbb {S}}^{d-1}=\{{\varvec{\xi }}\in {\mathbb {R}}^d: \Vert {\varvec{\xi }}\Vert _2=1\}$ is the unit sphere in ${\mathbb {R}}^d$. The coordinates of a hyperplane associated with an offset $t\in {\mathbb {R}}$ and a normal vector ${\varvec{\xi }}\in {\mathbb {S}}^{d-1}$ satisfy

$$\begin{aligned} {\varvec{\xi }}^{\textsf{T}}{\varvec{x}}=\xi _1x_1+ \dots + \xi _d x_d = t. \end{aligned}$$

Here, we first review the classical theory of the Radon transform [27], starting with the case of test functions (Sect. 3.1), and extending it by duality to tempered distributions (Sect. 3.2). Then, in Sect. 3.3, we specify the Radon transform and its inverse on an appropriate class of intermediate Banach spaces ${\mathcal {Y}}$ with (Theorem 3). Finally, in Sect. 3.4, we provide the (filtered) Radon transforms of the dictionary elements—isotropic kernels and ridges—that are relevant to our investigation.

3.1 Classical Integral Formulation

The Radon transform of the function $f\in L_1({\mathbb {R}}^d)\cap C_0({\mathbb {R}}^d)$ is defined as

$$\begin{aligned} {\textrm{R}}\{ f\}(t, {\varvec{\xi }})&=\int _{{\mathbb {R}}^d}\delta (t-{\varvec{\xi }}^{\textsf{T}}{\varvec{x}}) f({\varvec{x}}) \textrm{d}{\varvec{x}},\quad (t,{\varvec{\xi }}) \in {\mathbb {R}}\times {\mathbb {S}}^{d-1}. \end{aligned}$$

(18)

The adjoint of ${\textrm{R}}$ is the backprojection operator ${\textrm{R}}^*$. Its action on $g: {\mathbb {R}}\times {\mathbb {S}}^{d-1} \rightarrow {\mathbb {R}}$ yields the function

$$\begin{aligned} {\textrm{R}}^*\{g\}({\varvec{x}})=\int _{{\mathbb {S}}^{d-1}} g(\underbrace{{\varvec{\xi }}^{\textsf{T}}{\varvec{x}}}_{t}, {\varvec{\xi }})\textrm{d}{\varvec{\xi }}, \quad {\varvec{x}}\in {\mathbb {R}}^d. \end{aligned}$$

(19)

Given the d-dimensional Fourier transform ${{\hat{f}}}$ of $f \in L_1({\mathbb {R}}^d)$, one can calculate $ {\textrm{R}} \{f\}(\cdot ,{\varvec{\xi }}_0)$ for any fixed ${\varvec{\xi }}_0 \in {\mathbb {S}}^{d-1}$ through the relation

$$\begin{aligned} {\textrm{R}}\{ f\}(t, {\varvec{\xi }}_0)=\frac{1}{2 \pi } \int _{{\mathbb {R}}} {{\hat{f}}}(\omega {\varvec{\xi }}_0) \textrm{e}^{\textrm{i}\omega t} \textrm{d}\omega = {\mathcal {F}}_{\textrm{1D}}^{-1}\{ {{\hat{f}}}(\cdot {\varvec{\xi }}_0) \}\{t\}, \end{aligned}$$

(20)

In other words, the restriction of ${{\hat{f}}}: {\mathbb {R}}^d \rightarrow {\mathbb {C}}$ along the ray $\{{\varvec{\omega }}=\omega {\varvec{\xi }}_0: \omega \in {\mathbb {R}}\}$ coincides with the 1D Fourier transform of ${\textrm{R}}\{ f\}(\cdot , {\varvec{\xi }}_0)$, a property that is referred to as the Fourier-slice theorem.

To describe the functional properties of the Radon transform, one needs the (hyper)spherical (or Radon-domain) counterparts of the spaces described in Sect. 2.1. There, the Euclidean indexing with ${\varvec{x}} \in {\mathbb {R}}^d$ must be replaced by $(t, {\varvec{\xi }}) \in {\mathbb {R}}\times {\mathbb {S}}^{d-1}$.

The spherical counterpart of ${\mathcal {S}}({\mathbb {R}}^d)$ is ${\mathcal {S}}({\mathbb {R}}\times {\mathbb {S}}^{d-1})$. Correspondingly, an element $g \in {\mathcal {S}}'({\mathbb {R}}\times {\mathbb {S}}^{d-1})$ is a continuous linear functional on ${\mathcal {S}}({\mathbb {R}}\times {\mathbb {S}}^{d-1})$ whose action on the test function $\phi $ is represented by the duality product $g: \phi \mapsto \langle g,\phi \rangle _{\textrm{Rad}}$. When g can be identified as an ordinary function $g: (t,{\varvec{\xi }}) \mapsto g(t,{\varvec{\xi }})\in {\mathbb {R}}$, one can write that

$$\begin{aligned} \langle g,\phi \rangle _{\textrm{Rad}} = \int _{{\mathbb {S}}^{d-1}} \int _{{\mathbb {R}}} g(t, {\varvec{\xi }}) \phi (t, {\varvec{\xi }}) \textrm{d}t \textrm{d}{\varvec{\xi }}\end{aligned}$$

(21)

where $\textrm{d}{\varvec{\xi }}$ stands for the surface element on the unit sphere ${\mathbb {S}}^{d-1}$.

The key property for analysis is that the Radon transform is continuous on ${\mathcal {S}}({\mathbb {R}}^d)$ and invertible [23, 27, 43]. In addition to a backprojection, the inversion involves the so-called filtering operator.

Definition 4

The filtering operator ${\textrm{K}}: {\mathcal {S}}({\mathbb {R}}^d) \rightarrow {\mathcal {S}}'({\mathbb {R}}^d)$ is defined as

$$\begin{aligned} {\textrm{K}}\{\varphi \}= {\mathcal {F}}^{-1}\{{{\widehat{K}}} {\hat{\varphi }}\} \quad \text{ with } \quad {{\widehat{K}}}({\varvec{\omega }})=c_d\Vert {\varvec{\omega }}\Vert ^{d-1}, \end{aligned}$$

(22)

where $c_d=\frac{1}{2(2\pi )^{d-1}}$.

The filtering operator is isotropic LSI and, as such, has a Radon-domain counterpart (see Definition 5) denoted by ${\textrm{K}}_{\textrm{rad}}$ that exclusively acts along the radial variable.

Definition 5

(Radon-domain counterpart of an isotropic LSI operator) Let $\textrm{L}: {\mathcal {S}}({\mathbb {R}}^d) \rightarrow {\mathcal {S}}'({\mathbb {R}}^d)$ be an isotropic LSI operator with radial frequency profile ${{\widehat{L}}}_{\textrm{rad}}: {\mathbb {R}}\rightarrow {\mathbb {R}}$. Then, the Radon-domain counterpart of $\textrm{L}$ is $\textrm{L}_{\textrm{rad}}: {\mathcal {S}} ({\mathbb {R}}\times {\mathbb {S}}^{d-1}) \rightarrow {\mathcal {S}}'({\mathbb {R}}\times {\mathbb {S}}^{d-1} )$, which is defined as

$$\begin{aligned} {\textrm{L}}_{\textrm{rad}}\{\phi (\cdot ,{\varvec{\xi }})\}(t)= {\mathcal {F}}_{\textrm{1D}}^{-1}\{{{\widehat{L}}}_{\textrm{rad}} {\hat{\phi }}(\cdot ,{\varvec{\xi }}) \}(t), \end{aligned}$$

(23)

where ${\hat{\phi }}(\omega ,{\varvec{\xi }}) =\int _{\mathbb {R}}\phi (t,{\varvec{\xi }}) \textrm{e}^{-\textrm{i}\omega t} \textrm{d}t$ is the 1D Fourier transform of $t\mapsto \phi (t,{\varvec{\xi }})$.

Theorem 1

(Continuity and invertibility of the Radon transform on ${\mathcal {S}}({\mathbb {R}}^d)$) The Radon operator ${\textrm{R}}$ continuously maps ${\mathcal {S}}({\mathbb {R}}^d) \rightarrow {\mathcal {S}}({\mathbb {R}}\times {\mathbb {S}}^{d-1})$. Moreover, $\text {KR}^*{\textrm{R}}={\textrm{R}}^*{\textrm{K}}_{\textrm{rad}} {\textrm{R}}={\textrm{R}}^*\text {RK} =\textrm{Id}\,{ on }\,{\mathcal {S}}({\mathbb {R}}^d) $.

Based on this result, we can identify the filtering operator as ${\textrm{K}}=({\textrm{R}}^*{\textrm{R}})^{-1}=c_d(-\varDelta )^{\tfrac{d-1}{2}}$ (fractional Laplacian). Alternatively, one can perform the filtering in the Radon domain by means of the operator ${\textrm{K}}_{\textrm{rad}}$, which implements a 1D convolution along the radial variable. The connection is that the frequency response of ${\textrm{K}}_{\textrm{rad}}$ coincides with the radial frequency profile of ${\textrm{K}}$ so that ${{\widehat{K}}}({\varvec{\omega }})={{\widehat{K}}}_{\textrm{rad}}(\Vert {\varvec{\omega }}\Vert )$ with ${{\widehat{K}}}_{\textrm{rad}}(\omega )=c_d|\omega |^{d-1}$.

While the Radon transform ${\textrm{R}}: {\mathcal {S}}({\mathbb {R}}^d) \rightarrow {\mathcal {S}}({\mathbb {R}}\times {\mathbb {S}}^{d-1})$ is invertible, it is not surjective, which means that not every hyper-spherical test function $\phi \in {\mathcal {S}}({\mathbb {R}}\times {\mathbb {S}}^{d-1})$ can be written as $\phi ={\textrm{R}}\{\varphi \}$ with $\varphi \in {\mathcal {S}}({\mathbb {R}}^d)$. A necessary condition is that $\phi $ be even, but this is not sufficient [20, 23, 27]. The good news, however, is that the range of ${\textrm{R}}$ on ${\mathcal {S}}({\mathbb {R}}^d)$ is a closed subspace of ${\mathcal {S}}({\mathbb {R}}\times {\mathbb {S}}^{d-1})$ [23, p. 60]. Accordingly, one can identify the range space ${\mathcal {S}}_{\textrm{Rad}}{\mathop {=}\limits ^{\vartriangle }}{\textrm{R}}\big ({\mathcal {S}}({\mathbb {R}}^d) \big )$ equipped with the Fréchet topology of ${\mathcal {S}}({\mathbb {R}}\times {\mathbb {S}}^{d-1})$. Since the domain and range spaces are both Fréchet, we then invoke the open-mapping theorem [45, Theorem 2.11] to deduce that the transform $\varphi \mapsto {\textrm{R}}\{\varphi \}$ is a homeomorphism of ${\mathcal {S}}({\mathbb {R}}^d)$ onto ${\mathcal {S}}_{\textrm{Rad}}$.

Corollary 1

The operator ${\textrm{R}}: {\mathcal {S}}({\mathbb {R}}^d) \rightarrow {\mathcal {S}}_{\textrm{Rad}}$ is a continuous bijection, with a continuous inverse given by $ {\textrm{R}}^{-1}=({\textrm{R}}^*{\textrm{K}}_{\textrm{rad}}): {\mathcal {S}}_{\textrm{Rad}}\rightarrow {\mathcal {S}}({\mathbb {R}}^d)$.

3.2 Distributional Extension

To extend the framework to distributions, one proceeds by duality. By invoking the property that ${\textrm{R}}^*{\textrm{K}}_{\textrm{rad}}{\textrm{R}}=\textrm{Id}$ on ${\mathcal {S}}({\mathbb {R}}^d)$, we make the manipulation

$$\begin{aligned} \forall \varphi \in {\mathcal {S}}({\mathbb {R}}^d)\quad \langle f,\varphi \rangle&=\langle f,{\textrm{R}}^*{\textrm{K}}_{\textrm{rad}}{\textrm{R}}\{\varphi \} \rangle \nonumber \\&= \langle {\textrm{R}}\{f\},{\textrm{K}}_{\textrm{rad}}{\textrm{R}}\{ \varphi \}\rangle _{\textrm{Rad}} =\langle {\textrm{R}}\{f\}, \phi \rangle _{\textrm{Rad}}, \end{aligned}$$

(24)

with $\phi ={\textrm{K}}_{\textrm{rad}}{\textrm{R}}\{\varphi \} \in {\textrm{K}}_{\textrm{rad}}{\textrm{R}}\big ({\mathcal {S}}({\mathbb {R}}^d)\big )$ and $\varphi ={\textrm{R}}^*\{\phi \}$. Relation (24), which is valid in the classical sense for $f \in L_1({\mathbb {R}}^d)$, is then used as definition to extend the scope of ${\textrm{R}}$ for $f\in {\mathcal {S}}'({\mathbb {R}}^d)$.

Definition 6

The distribution $g={\textrm{R}}\{f\} \in {\big ( {\textrm{K}}_{\textrm{rad}}{\textrm{R}}\big ( {\mathcal {S}}({\mathbb {R}}^d))\big )'}$ is the (formal) Radon transform of $f \in {\mathcal {S}}'({\mathbb {R}}^d)$ if

$$\begin{aligned} \forall \phi \in {\textrm{K}}_{\textrm{rad}}{\textrm{R}} \big ( {\mathcal {S}}({\mathbb {R}}^d)\big ):\quad \langle g,\phi \rangle _{\textrm{Rad}} =\langle f, {\textrm{R}}^*\{\phi \} \rangle . \end{aligned}$$

(25)

Likewise, ${{\tilde{g}}}={\textrm{K}}_{\textrm{rad}}{\textrm{R}}\{f\} \in {\mathcal {S}}_{\textrm{Rad}}'$ is the (formal) filtered projection of $f \in {\mathcal {S}}'({\mathbb {R}}^d)$ if

$$\begin{aligned} \forall \phi \in {\mathcal {S}}_{\textrm{Rad}}: \quad \langle {{\tilde{g}}},\phi \rangle _{\textrm{Rad}}=\langle f, {\textrm{R}}^*{\textrm{K}}_{\textrm{rad}} \{\phi \} \rangle . \end{aligned}$$

(26)

Finally, $f={\textrm{R}}^*\{g\} \in {\mathcal {S}}'({\mathbb {R}}^d)$ is the backprojection of $g \in {\mathcal {S}}'({\mathbb {R}}\times {\mathbb {S}}^{d-1})$ if

$$\begin{aligned} \forall \varphi \in {\mathcal {S}}({\mathbb {R}}^d): \quad \langle {\textrm{R}}^*\{ g\}, \varphi \rangle =\langle g, {\textrm{R}} \{\varphi \}\rangle _{\textrm{Rad}}. \end{aligned}$$

(27)

While (27) identifies ${\textrm{R}}^*\{g\}$ as a single, unique distribution in ${\mathcal {S}}'({\mathbb {R}}^d)$, this is not so for (26) (resp., (25)), as the members of ${\mathcal {S}}'_{\textrm{Rad}}$ (resp., of $\big ( {\textrm{K}}_{\textrm{rad}}{\textrm{R}}\big ( {\mathcal {S}})\big )'$) are equivalence classes in ${\mathcal {S}}'({\mathbb {R}}\times {\mathbb {S}}^{d-1})$. To make this explicit, we take advantage of the equivalence ${\textrm{R}}^*\{g\}=0 \Leftrightarrow \langle g, \phi \rangle _{\textrm{Rad}}=0$ to identity the null space of the backprojection operator as being

$$\begin{aligned} {\mathcal {N}}_{{\textrm{R}}^*}=\{g \in {\mathcal {S}}'({\mathbb {R}}\times {\mathbb {S}}^{d-1}): \langle g, \phi \rangle _{\textrm{Rad}}=0, \forall \phi \in {\mathcal {S}}_{\textrm{Rad}}\}, \end{aligned}$$

(28)

which is a closed subspace of ${\mathcal {S}}'({\mathbb {R}}\times {\mathbb {S}}^{d-1})$. It is then possible to describe ${\mathcal {S}}'_{\textrm{Rad}}$ as the abstract quotient space ${\mathcal {S}}'({\mathbb {R}}\times {\mathbb {S}}^{d-1})/{\mathcal {N}}_{{\textrm{R}}^*}$. In other words, if we find a hyper-spherical distribution $g_0\in {\mathcal {S}}'({\mathbb {R}}\times {\mathbb {S}}^{d-1})$ such that (26) is met for a given $f \in {\mathcal {S}}'({\mathbb {R}}^d)$, then, strictly speaking, ${\textrm{K}}_{\textrm{rad}}{\textrm{R}}\{f\} \in {\mathcal {S}}'_{\textrm{Rad}}$ is the equivalence class (or coset) given by

$$\begin{aligned} {\textrm{K}}_{\textrm{rad}}{\textrm{R}}\{f\}=[g_0]=\{g_0 + h: h \in {\mathcal {N}}_{{\textrm{R}}^*}\}. \end{aligned}$$

(29)

Since $[g_0]=[g]$ for any $g\in {\textrm{K}}_{\textrm{rad}}{\textrm{R}}\{f\}$, we refer to the members of ${\textrm{K}}_{\textrm{rad}}{\textrm{R}}\{f\}$ as “formal” filtered projections of f to remind us of this lack of unicity.

Based on those definitions, one obtains the classical result on the invertibility of the (filtered) Radon transform on ${\mathcal {S}}'({\mathbb {R}}^d)$ [27], which is the dual of Corollary 1.

Theorem 2

(Invertibility of the Radon transform on ${\mathcal {S}}'({\mathbb {R}}^d)$) It holds that ${\textrm{R}}^*{\textrm{K}}_{\textrm{rad}} {\textrm{R}} =\textrm{Id}$ on ${\mathcal {S}}'({\mathbb {R}}^d)$. More precisely, the filtered Radon transform ${\textrm{K}}_{\textrm{rad}}{\textrm{R}}: {\mathcal {S}}'({\mathbb {R}}^d) \rightarrow {\mathcal {S}}'_{\textrm{Rad}}$ is a continuous bijection, with a continuous inverse given by $({\textrm{K}}_{\textrm{rad}}{\textrm{R}})^{-1}={\textrm{R}}^*: {\mathcal {S}}'_{\textrm{Rad}} \rightarrow {\mathcal {S}}'({\mathbb {R}}^d)$.

To illustrate the fact that (26) does not identify a single distribution, we consider the Dirac ridge $\delta ({\varvec{\xi }}_0{\varvec{x}} - t_0) \in {\mathcal {S}}'({\mathbb {R}}^d)$ and refer to the definition (18) of the Radon transform to deduce that, for all $\phi ={\textrm{R}}\{\varphi \} \in {\mathcal {S}}_{\textrm{Rad}}$ with $\varphi \in {\mathcal {S}}({\mathbb {R}}^d)$,

$$\begin{aligned} \langle \delta ({\varvec{\xi }}_0^{\textsf{T}}\cdot - t_0),{\textrm{R}}^*{\textrm{K}}_{\textrm{rad}}\{\phi \}\rangle&=\langle \delta ({\varvec{\xi }}_0^{\textsf{T}}\cdot - t_0), \overbrace{{\textrm{R}}^*{\textrm{K}}_{\textrm{rad}}{\textrm{R}}}^{\textrm{Id}}\{\varphi \}\rangle \\&= \int _{{\mathbb {R}}^d}\delta ({\varvec{\xi }}_0^{\textsf{T}}{\varvec{x}}-t_0) \varphi ({\varvec{x}}) \textrm{d}{\varvec{x}}={\textrm{R}}\{\varphi \}(-t_0,-{\varvec{\xi }}_0)\\&=\langle \delta \big (\cdot +(t_0,{\varvec{\xi }}_0)\big ),{\textrm{R}}\{\varphi \}\rangle _{\textrm{Rad}}=\langle \delta \big (\cdot +(t_0,{\varvec{\xi }}_0)\big ),\phi \rangle _{\textrm{Rad}}, \end{aligned}$$

which shows that $\delta \big (\cdot +{\varvec{z}}_0\big )$ with ${\varvec{z}}_0=(t_0,{\varvec{\xi }}_0)$ is a formal filtered projection of $\delta ({\varvec{\xi }}_0^{\textsf{T}}{\varvec{x}} - t_0)$. Moreover, since $\delta ({\varvec{\xi }}_0^{\textsf{T}}{\varvec{x}} - t_0)=\delta (-{\varvec{\xi }}_0^{\textsf{T}}{\varvec{x}} +t_0)$, the same holds true for $\delta (\cdot -{\varvec{z}}_0)$, as well as for $\delta _{\textrm{Rad},{\varvec{z}}_0}{\mathop {=}\limits ^{\vartriangle }}\frac{1}{2} \big (\delta (\cdot -{\varvec{z}}_0)+\delta (\cdot +{\varvec{z}}_0)\big )$, which has the advantage of being symmetric. While the general solution in ${\mathcal {S}}'_{\textrm{Rad}}$ is ${\textrm{K}}_{\textrm{rad}}{\textrm{R}}\{\delta ({\varvec{\xi }}_0^{\textsf{T}}\cdot - t_0)\}=[\delta \big (\cdot \pm {\varvec{z}}_0\big )]$, we shall see that there also exists a way to specify a representer that is unique (i.e., $\delta _{\textrm{Rad},{\varvec{z}}_0})$ by restricting the range of ${\textrm{K}}_{\textrm{rad}}{\textrm{R}}$ to a suitable subspace of measures.

The distributional extension of the Radon transform inherits most of the properties of the “classical” operator defined in (18). Of special relevance to us is the quasi-commutativity of ${\textrm{R}}$ with convolution, also known as the intertwining property. Specifically, let $h,f \in {\mathcal {S}}'({\mathbb {R}}^d)$ be two distributions whose convolution $h *f$ is well-defined in ${\mathcal {S}}'({\mathbb {R}}^d)$. Then,

$$\begin{aligned} {\textrm{R}}\{ h *f\}={\textrm{R}}\{ h\} \circledast {\textrm{R}}\{ f\}\end{aligned}$$

(30)

where the symbol “$\circledast $” denotes the 1D convolution along the radial variable $t \in {\mathbb {R}}$ with $(u \circledast g)(t,{\varvec{\xi }})=\langle u(\cdot ,{\varvec{\xi }}),g(t-\cdot ,{\varvec{\xi }}) \rangle $. In particular, when $h=\textrm{L}\{\delta \}$ is the (isotropic) impulse response of an LSI operator whose frequency response ${{\widehat{L}}}({\varvec{\omega }})={{\widehat{L}}}_{\textrm{rad}}(\Vert {\varvec{\omega }}\Vert )$ is purely radial, we get that

$$\begin{aligned} {\textrm{R}}\{ h *f\}=\text {RL}\{f\}=\textrm{L}_{\textrm{rad}}{\textrm{R}}\{f\}, \end{aligned}$$

(31)

where $\textrm{L}_{\textrm{rad}}$ is the corresponding Radon-domain operator of Definition 5. Likewise, by duality, for $g \in {\mathcal {S}}'({\mathbb {R}}\times {\mathbb {S}}^{d-1})$ we have that

$$\begin{aligned} \text {LR}^*\{g\}={\textrm{R}}^*\textrm{L}_{\textrm{rad}}\{g\}, \end{aligned}$$

(32)

under the implicit assumption that $\textrm{L}\{{\textrm{R}}^*g\}$ and $\textrm{L}_{\textrm{rad}}\{g\}$ are well-defined distributions. By taking inspiration from Theorem 1, we can then use these relations for $\textrm{L}={\textrm{K}}=({\textrm{R}}^*{\textrm{R}})^{-1}$ to show that ${\textrm{R}}^*{\textrm{K}}_{\textrm{rad}} {\textrm{R}}\{f\}={\textrm{R}}^*\text {RK}\{f\}=\text {KR}^*{\textrm{R}}\{f\}=f$ for a broad class of distributions. The first form is valid for all $f \in {\mathcal {S}}'({\mathbb {R}}^d)$ (Theorem 2), but there is a slight restriction with the second form (resp., third form), which requires that ${\textrm{K}}\{f\}$ (resp., ${\textrm{K}}\{g\}$ with $g={\textrm{R}}^*{\textrm{R}}\{f\} \in {\mathcal {S}}'({\mathbb {R}}^d)$) be well-defined in ${\mathcal {S}}'({\mathbb {R}}^d)$. While the latter condition is always met when d is odd, it may fail^{Footnote 2} in even dimensions with distributions (e.g., polynomials) whose Fourier transform is singular at the origin. The good news for our regularization framework is that these problematic distributions are either excluded from the native space or annihilated by $\textrm{L}$, so that it is legitimate to write that $\textrm{L}_{\textrm{R}}={\textrm{K}}_{\textrm{rad}}\text {RL}=\text {RKL}$, where the second form has the advantage that ${\textrm{K}}$ and $\textrm{L}$ can be pooled into a single augmented operator $(\text {KL})$. An alternative form is $\textrm{L}_{\textrm{R}}={\textrm{Q}}_{\textrm{rad}}{\textrm{R}}$, where ${\textrm{Q}}_{\textrm{rad}}={\textrm{K}}_{\textrm{rad}}{\textrm{L}}_{\textrm{rad}}$ is the radial Radon-domain operator whose frequency response is ${{\widehat{Q}}}_{\textrm{rad}}(\omega )=c_d |\omega |^{d-1} {{\widehat{L}}}_{\textrm{rad}}(\omega )$.

3.3 Radon-Compatible Banach Spaces

Our formulation requires the identification of Radon-domain Banach spaces over which the backprojection operator ${\textrm{R}}^*$ is invertible. This is a nontrivial point because the extended operator ${\textrm{R}}^*: {\mathcal {S}}'({\mathbb {R}}\times {\mathbb {S}}^{d-1}) \rightarrow {\mathcal {S}}'({\mathbb {R}}^d)$ in Definition 6 is not injective. In fact, it has the highly nontrivial null space $\textrm{ker}({\textrm{R}}^*)={\mathcal {S}}^\perp _{\textrm{Rad}}$, which is a superset of the odd Radon-domain distributions [20]. Yet, ${\textrm{R}}^*$ is invertible on ${\mathcal {S}}'_{\textrm{Rad}}$ and surjective on ${\mathcal {S}}'({\mathbb {R}}^d)$ (Theorem 2).

To ensure invertibility, we therefore need to restrict ourselves to Banach spaces that are embedded in ${\mathcal {S}}'_{\textrm{Rad}}$. To identify such objects, we consider a generic Banach space ${\mathcal {X}}=({\mathcal {X}}, \Vert \cdot \Vert _{{\mathcal {X}}})$ such that . This dense-embedding hypothesis has several implications:

1.
The space ${\mathcal {X}}$ is the completion of ${\mathcal {S}}({\mathbb {R}}\times {\mathbb {S}}^{d-1})$ in the $\Vert \cdot \Vert _{{\mathcal {X}}}$ norm, i.e.,
$$\begin{aligned} {\mathcal {X}}=\overline{\big ({\mathcal {S}}({\mathbb {R}}\times {\mathbb {S}}^{d-1}), \Vert \cdot \Vert _{{\mathcal {X}}}\big )}. \end{aligned}$$
(33)
2.
The dual space is equipped with the norm
$$\begin{aligned} \Vert g\Vert _{{\mathcal {X}}'}=\sup _{\phi \in {\mathcal {X}}:\; \Vert \phi \Vert _{{\mathcal {X}}}\le 1} \langle g, \phi \rangle =\sup _{\phi \in {\mathcal {S}}({\mathbb {R}}\times {\mathbb {S}}^{d-1}):\; \Vert \phi \Vert _{{\mathcal {X}}}\le 1} \langle g, \phi \rangle , \end{aligned}$$
(34)
where the restriction of $\phi \in {\mathcal {S}}({\mathbb {R}}\times {\mathbb {S}}^{d-1})$ on the right-hand side of (34) is justified by the denseness of ${\mathcal {S}}({\mathbb {R}}\times {\mathbb {S}}^{d-1})$ in ${\mathcal {X}}$.
3.
The definition of $\Vert g\Vert _{{\mathcal {X}}'}$ given by the right-hand side of (34) is valid for any distribution $g\in {\mathcal {S}}'({\mathbb {R}}\times {\mathbb {S}}^{d-1})$ with $\Vert g\Vert _{{\mathcal {X}}'}=\infty $ for $g \notin {\mathcal {X}}'$. Accordingly, we can specify the topological dual of ${\mathcal {X}}$ as
$$\begin{aligned} {\mathcal {X}}'=\big \{g \in {\mathcal {S}}'({\mathbb {R}}\times {\mathbb {S}}^{d-1}): \Vert g\Vert _{{\mathcal {X}}'}< \infty \big \}. \end{aligned}$$
(35)

Likewise, based on the pair $({\mathcal {S}}_{\textrm{Rad}},{\mathcal {S}}'_{\textrm{Rad}})$, we specify the Radon-compatible Banach subspaces

$$\begin{aligned} {\mathcal {X}}_{\textrm{Rad}}&=\overline{({\mathcal {S}}_{\textrm{Rad}},\Vert \cdot \Vert _{{\mathcal {X}}})} \end{aligned}$$

(36)

$$\begin{aligned} {\mathcal {X}}'_{\textrm{Rad}}&=\big ({\mathcal {X}}_{\textrm{Rad}}\big )'=\big \{g \in {\mathcal {S}}'_{\textrm{Rad}}: \Vert g\Vert _{{\mathcal {X}}'_{\textrm{Rad}}}< \infty \big \} \end{aligned}$$

(37)

where the underlying dual norms have a definition that is analogous to (34) with ${\mathcal {S}}_{\textrm{Rad}}$ and ${\mathcal {X}}_{\textrm{Rad}}$ substituting for ${\mathcal {S}}({\mathbb {R}}\times {\mathbb {S}}^{d-1})$ and ${\mathcal {X}}$.

Theorem 3

(adapted from [57]) Let $({\mathcal {X}}_{\textrm{Rad}},{\mathcal {X}}'_{\textrm{Rad}})$ be the dual pair of spaces specified by (36) and (37). Then,

1.
the map $\ {\textrm{R}}^*{\textrm{K}}_{\textrm{rad}}: {\mathcal {X}}_{\textrm{Rad}} \rightarrow {\mathcal {Y}}={\textrm{R}}^*{\textrm{K}}_{\textrm{rad}}\big ({\mathcal {X}}_{\textrm{Rad}}\big )$ is an isometric bijection, with $\text {RR}^*{\textrm{K}}_{\textrm{rad}}=\textrm{Id}$ on ${\mathcal {X}}_{\textrm{Rad}}$;
2.
the map ${\textrm{R}}^*: {\mathcal {X}}'_{\textrm{Rad}} \rightarrow {\mathcal {Y}}'={\textrm{R}}^*\big ({\mathcal {X}}'_{\textrm{Rad}}\big )$ is an isometric bijection, with ${\textrm{K}}_{\textrm{rad}}\text {RR}^*=\textrm{Id}$ on ${\mathcal {X}}'_{\textrm{Rad}}$.

Moreover, if there exists a complementary Banach space ${\mathcal {X}}^{\textrm{c}}_{\textrm{Rad}}$ such that ${\mathcal {X}}={\mathcal {X}}_{\textrm{Rad}}\oplus {\mathcal {X}}^{\textrm{c}}_{\textrm{Rad}}$, then ${\mathcal {X}}'={\mathcal {X}}'_{\textrm{Rad}}\oplus ({\mathcal {X}}^{\textrm{c}}_{\textrm{Rad}})'$ where $({\mathcal {X}}^{\textrm{c}}_{\textrm{Rad}})'$ can be identified as the null space of the backprojection operator .

The prototypical examples where those properties are met are $({\mathcal {X}}, {\mathcal {X}}')=\big (L_p({\mathbb {R}}\times {\mathbb {S}}^{d-1}),L_q({\mathbb {R}}\times {\mathbb {S}}^{d-1})\big )$ with $p\in (1,\infty )$ and $q=p/(p-1)$ (conjugate exponent), as well as $({\mathcal {X}}, {\mathcal {X}}')=\big (C_0({\mathbb {R}}\times {\mathbb {S}}^{d-1}),{\mathcal {M}}({\mathbb {R}}\times {\mathbb {S}}^{d-1})\big )$. In fact, those hyper-spherical spaces have the convenient feature of admitting a decomposition in their even and odd components.

Lemma 2

Let ${\mathcal {Z}} ={\mathbb {R}}\times {\mathbb {S}}^{d-1}$. Then, for ${\mathcal {X}}=L_p({\mathcal {Z}})$ with $p \in (1,\infty )$ and ${\mathcal {X}}=C_0({\mathcal {Z}})$ for $p=\infty $, we have that ${\mathcal {X}}={\mathcal {X}}_{\textrm{Rad}}\oplus {\mathcal {X}}^{\textrm{c}}_{\textrm{Rad}}$ where

$$\begin{aligned} {\mathcal {X}}_{\textrm{Rad}}&={\mathcal {X}}_{\textrm{even}}=\{g\in {\mathcal {X}}: g({\varvec{z}})=g(-{\varvec{z}}), \forall {\varvec{z}} \in {\mathcal {Z}}\} \end{aligned}$$

(38)

$$\begin{aligned} {\mathcal {X}}^{\textrm{c}}_{\textrm{Rad}}&={\mathcal {X}}_{\textrm{odd}}=\{g\in {\mathcal {X}}: g({\varvec{z}})=-g(-{\varvec{z}}), \forall {\varvec{z}} \in {\mathcal {Z}}\}. \end{aligned}$$

(39)

Proof

To establish this result directly is tricky because the characterization of ${\mathcal {S}}_{\textrm{Rad}}$ involves some general moment conditions [20, 23, 27]. Instead, we consider the smaller space of even Radon-domain Lizorkin test functions [26] described by

$$\begin{aligned} {\mathcal {S}}_{\textrm{Liz, Rad}}=\{\phi \in {\mathcal {S}}_{\textrm{even}}({\mathcal {Z}}): \int _{\mathbb {R}}t^k\phi (t,{\varvec{\xi }})\textrm{d}t=0,\forall {\varvec{\xi }}\in {\mathbb {S}}^{d-1}, k \in {\mathbb {N}}\}, \end{aligned}$$

(40)

which is such that ${\mathcal {S}}_{\textrm{Liz, Rad}}\subset {\mathcal {S}}_{\textrm{Rad}} \subset {\mathcal {S}}_{\textrm{even}}({\mathcal {Z}})$. We then invoke a general result by Samko [46] that implies that $\overline{({\mathcal {S}}_{\textrm{Liz, Rad}},\Vert \cdot \Vert _{L_p})}=L_{p,\textrm{even}}({\mathcal {Z}})\supset {({\mathcal {S}}_{\textrm{Rad}},\Vert \cdot \Vert _{L_p})}$ for $p\in (1,\infty )$ and $\overline{({\mathcal {S}}_{\textrm{Liz, Rad}},\Vert \cdot \Vert _{L_\infty })}=C_{0,\textrm{even}}({\mathcal {Z}})$ otherwise [34]. The claim then follows from the observation that $L_{p}({\mathcal {Z}})=L_{p,\textrm{even}}({\mathcal {Z}})\oplus L_{p,\textrm{odd}}({\mathcal {Z}})$ with $L_{p,\textrm{even}}({\mathcal {Z}})=\overline{({\mathcal {S}}_{\textrm{Rad}},\Vert \cdot \Vert _{L_p})}$ (because the completion is unique) and suitable adaptation for $p=\infty $. $\square $

Correspondingly, we get that ${\mathcal {X}}'_{\textrm{Rad}}={\textrm{P}}_{\textrm{even}}({\mathcal {X}}')={\mathcal {X}}'_{\textrm{even}}$ and $({\mathcal {X}}^{\textrm{c}}_{\textrm{Rad}})'=(\textrm{Id}-{\textrm{P}}_{\textrm{even}})({\mathcal {X}}')={\mathcal {X}}'_{\textrm{odd}}$, with the cases of greatest interest to us being ${\mathcal {M}}_{\textrm{Rad}}={\mathcal {M}}_{\textrm{even}}({\mathbb {R}}\times {\mathbb {S}}^{d-1}) $ and $L_{2,\mathrm Rad}=L_{2,\textrm{even}}({\mathbb {R}}\times {\mathbb {S}}^{d-1})$.

3.4 Specific Radon Transforms

The Fourier-slice theorem expressed by (20) remains valid for tempered distributions [43] and therefore also yields a characterization of ${\textrm{R}}\{ f\}$ that is compatible with the Banach framework of Theorem 3. It is especially helpful when the underlying function $\rho _{\textrm{iso}}$ is isotropic with a known radial frequency profile ${{\widehat{\rho }}}_{\textrm{rad}}$ such that $ {\mathcal {F}}\{ \rho _{\textrm{iso}}\}({\varvec{\omega }})={{\widehat{\rho }}}_{\textrm{rad}}(\Vert {\varvec{\omega }}\Vert )$.

Proposition 1

(Radon transform of isotropic distributions) Let $\rho _{\textrm{iso}}$ be an isotropic distribution whose radial frequency profile is ${{\widehat{\rho }}}_{\textrm{rad}}: {\mathbb {R}}\rightarrow {\mathbb {R}}$. Then,

$$\begin{aligned} {\textrm{R}}\{\rho _{\textrm{iso}}(\cdot -{\varvec{x}}_0)\}(t,{\varvec{\xi }})&=\rho _{\textrm{rad}}(t-{\varvec{\xi }}^{\textsf{T}}{\varvec{x}}_0) \end{aligned}$$

(41)

$$\begin{aligned} {\textrm{K}}_{\textrm{rad}}{\textrm{R}}\{ \rho _{\textrm{iso}}(\cdot -{\varvec{x}}_0)\}(t,{\varvec{\xi }})&=\nu _{\textrm{rad}}(t-{\varvec{\xi }}^{\textsf{T}}{\varvec{x}}_0) \end{aligned}$$

(42)

with $\rho _{\textrm{rad}}(t)= {\mathcal {F}}^{-1}\{{{\widehat{\rho }}}_{\textrm{rad}}(\omega )\}(t)$ and $\nu _{\textrm{rad}}(t)=\tfrac{1}{2(2\pi )^{d-1}} {\mathcal {F}}^{-1}\{|\omega |^{d-1}{{\widehat{\rho }}}_{\textrm{rad}}(\omega )\}(t)$.

The other important building blocks for representing functions are ridges. Specifically, a ridge is a multidimensional function

$$\begin{aligned} r_{{\varvec{\xi }}_0}:{\mathbb {R}}^d \rightarrow {\mathbb {R}}: {\varvec{x}} \mapsto r({\varvec{\xi }}_0^{\textsf{T}}{\varvec{x}}) \end{aligned}$$

(43)

that is characterized by a profile $r: {\mathbb {R}}\rightarrow {\mathbb {R}}$ and a direction ${\varvec{\xi }}_0 \in {\mathbb {S}}^{d-1}$. In effect, $r_{{\varvec{\xi }}_0}$ varies along the axis specified by ${\varvec{\xi }}_0$ and is constant within any hyperplane perpendicular to ${\varvec{\xi }}_0$. The connection between ridges and the Radon transform is given by the ridge identity

$$\begin{aligned} \forall \varphi \in {\mathcal {S}}({\mathbb {R}}^d): \quad \langle r_{{\varvec{\xi }}_0}, \varphi \rangle =\langle r, {\textrm{R}}\{\varphi \} (\cdot ,{\varvec{\xi }}_0)\rangle , \end{aligned}$$

(44)

where the right-hand side duality product (1D) is well-defined for any $r \in {\mathcal {S}}'({\mathbb {R}})$ because ${\textrm{R}}\{\varphi \} (\cdot ,{\varvec{\xi }}_0) \in {\mathcal {S}}({\mathbb {R}})$ (by Theorem 1). When the profile $r: {\mathbb {R}}\rightarrow {\mathbb {R}}$ is locally integrable, (44) is a simple consequence of Fubini’s theorem. For more general distributional profiles $r \in {\mathcal {S}}'({\mathbb {R}})$, we use the ridge identity as definition, which then leads to the following characterization [57].

Theorem 4

(Filtered Radon transform of a ridge) The filtered Radon transform of the (generalized) ridge $r_{{\varvec{\xi }}_0}$ with profile $r\in {\mathcal {S}}'({\mathbb {R}}^d)$ and direction ${\varvec{\xi }}_0 \in {\mathbb {S}}^{d-1}$ is given by

$$\begin{aligned} {\textrm{K}}_{\textrm{rad}}{\textrm{R}} \{r_{{\varvec{\xi }}_0} \}(t,{\varvec{\xi }})= [r(t)\delta ({\varvec{\xi }}-{\varvec{\xi }}_0)], \end{aligned}$$

(45)

where $[r(t)\delta ({\varvec{\xi }}-{\varvec{\xi }}_0)]\in {\mathcal {S}}_{\textrm{Rad}}'$ is the equivalence class of distributions specified by (29). If $r \in {\mathcal {M}}({\mathbb {R}})$, then the latter has the unique, concrete representer

$$\begin{aligned} {\textrm{K}}_{\textrm{rad}}{\textrm{R}} \{r_{{\varvec{\xi }}_0} \}(t,{\varvec{\xi }})&=\frac{1}{2} \big (r(t)\delta ({\varvec{\xi }}-{\varvec{\xi }}_0) + r(-t)\delta ({\varvec{\xi }}+{\varvec{\xi }}_0)\big ) \end{aligned}$$

(46)

in ${\mathcal {M}}_{\textrm{Rad}}={\mathcal {M}}_{\textrm{even}}({\mathbb {R}}\times {\mathbb {S}}^{d-1})$.

An important special case of (46) is the Radon transform of a Dirac ridge: ${\textrm{K}}_{\textrm{rad}}{\textrm{R}} \{\delta ({\varvec{\xi }}_0^{\textsf{T}}\cdot -t_0) \}= \delta _{\textrm{Rad},(t_0, {\varvec{\xi }}_0)}= \frac{1}{2} \big (\delta (\cdot -t_0)\delta (\cdot -{\varvec{\xi }}_0) + \delta (\cdot +t_0)\delta (\cdot +{\varvec{\xi }}_0)\big )$, which has already been mentioned in Sect. 3.2 (see also [35, Example 1]).

4 Unifying Variational Formulation

4.1 Representer Theorem for Radon-Domain Regularization

From now on, we shall use the generic symbol ${\mathcal {X}}$ to designate the hyper-spherical Banach space $L_q({\mathbb {R}}\times {\mathbb {S}}^{d-1})$ with $q=(1,\infty )$ or $C_0({\mathbb {R}}\times {\mathbb {S}}^{d-1})$ for $q=\infty $, which fall into the category described by (33) with $\Vert \cdot \Vert _{{\mathcal {X}}}=\Vert \cdot \Vert _{L_q}$.

The formulation of Theorem 5 requires the specification of a native space, ${\mathcal {X}}_{{\textrm{L}}_{\textrm{R}}}'({\mathbb {R}}^d)$, that is tied to a Radon-domain norm $\Vert \cdot \Vert _{{\mathcal {X}}'}$ and an admissible regularization operator $\textrm{L}_{\textrm{R}}$. Informally, our native space is the largest function space over which the proposed regularization functional $f \mapsto \Vert \textrm{L}_{\textrm{R}}\{f\}\Vert _{{\mathcal {X}}'}$ is well-defined. The precise description of ${\mathcal {X}}_{{\textrm{L}}_{\textrm{R}}}'({\mathbb {R}}^d)$, however, is a bit more involved. As first step, we need to restrict the dual pair $({\mathcal {X}},{\mathcal {X}}')$ to the range of the (filtered) Radon transform. This yields the Banach spaces $({\mathcal {X}}_{\textrm{Rad}},{\mathcal {X}}'_{\textrm{Rad}})$, as defined by (36) and (37), with the pairs of interest being $(C_{0,\textrm{Rad}},{\mathcal {M}}_{0,\textrm{Rad}})$ and $(L_{q,\textrm{Rad}},L_{p,\textrm{Rad}})$ with $\frac{1}{p}+\tfrac{1}{q}=1$ and $p\in (1,\infty )$. Given some spline-admissible operator $\textrm{L}$ (Definitions 2 and 3), we then define our regularization operator and its adjoint as

$$\begin{aligned} \textrm{L}_{\textrm{R}}&{\mathop {=}\limits ^{\vartriangle }}{\textrm{K}} _{\textrm{rad}}\text {RL}: {\mathcal {X}}'_{{\textrm{L}}_{\textrm{R}}}({\mathbb {R}}^d) \rightarrow {\mathcal {X}}'_{\textrm{Rad}} ,\\ \textrm{L}^*_{\textrm{R}}&={\textrm{L}}^*{\textrm{R}}^*{\textrm{K}} _{\textrm{rad}}: {\mathcal {X}}_{\textrm{Rad}} \rightarrow {\mathcal {X}}_{{\textrm{L}}_{\textrm{R}}}({\mathbb {R}}^d) \end{aligned}$$

where ${\textrm{R}}$ is the Radon transform and ${\textrm{K}} _{\textrm{rad}}$ the (self-adjoint) filtering operator such that ${\textrm{K}}_{\textrm{rad}}{\textrm{R}}{\textrm{R}}^*=\textrm{Id}$ on ${\mathcal {X}}'_{\textrm{Rad}}$ (Theorem 3). In order to establish isometries, one needs to be able to invert $\textrm{L}: {\mathcal {X}}'_{{\textrm{L}}_{\textrm{R}}}({\mathbb {R}}^d) \rightarrow {\textrm{R}}^{*}({\mathcal {X}}'_{\textrm{Rad}})$ as well as $\textrm{L}_{\textrm{R}}$, which is feasible if one factors out the null space ${\mathcal {P}}$, which is common to both. This motivates us to define the directed inverse operators

$$\begin{aligned} \textrm{L}_{\textrm{R}}^{\dagger }&{\mathop {=}\limits ^{\vartriangle }}\textrm{L}^{-1}_{\mathcal {P}} {\textrm{R}}^*: {\mathcal {X}}'_{\textrm{Rad}} \rightarrow {\mathcal {X}}'_{{\textrm{L}}_{\textrm{R}}}({\mathbb {R}}^d) \end{aligned}$$

(47)

$$\begin{aligned} \textrm{L}_{\textrm{R}}^{*\dagger }&= \text {RL}^{-1*}_{\mathcal {P}} : {\mathcal {X}}_{{\textrm{L}}_{\textrm{R}}}({\mathbb {R}}^d) \rightarrow {\mathcal {X}}_{\textrm{Rad}} \end{aligned}$$

(48)

where the operators $\textrm{L}_{\textrm{R}}^{\dagger }$ and $\textrm{L}_{\textrm{R}}^{*\dagger }$ are generalized inverses^{Footnote 3} of $\textrm{L}_{\textrm{R}}$ and $\textrm{L}_{\textrm{R}}^{*}$, respectively. We now have all the ingredients to specify our native space as

$$\begin{aligned} {\mathcal {X}}'_{{\textrm{L}}_{\textrm{R}}}({\mathbb {R}}^d)&={\textrm{L}}^{\dagger }_{\textrm{R}}\big ({\mathcal {X}}'_{\textrm{Rad}} \big )\oplus {\mathcal {P}} \nonumber \\&=\{ f \in L_{\infty ,-n_0}({\mathbb {R}}^d): \Vert {\textrm{L}}_{\textrm{R}}\{f\}\Vert _{{\mathcal {X}}'}+ \Vert \textrm{Proj}_{\mathcal {P}}\{f\}\Vert _{{\mathcal {P}}}<\infty \}\nonumber \\&=\{ {\textrm{L}}^{\dagger }_{\textrm{R}}\{w\} + p_0:\quad (w, p_0)\in {\mathcal {X}}_{\textrm{Rad}}' \times {\mathcal {P}}\}, \end{aligned}$$

(49)

which is isometrically isomorphic to ${\mathcal {X}}_{\textrm{Rad}}' \times {\mathcal {P}} $, as expressed by (49). The key property there is that $\textrm{L}_{\textrm{R}}{\textrm{L}}^{\dagger }_{\textrm{R}} =\textrm{Id}$ on $ {\mathcal {X}}_{\textrm{Rad}}' $, while $\textrm{L}_{\textrm{R}}\{p_0\}=0$ for all $p_0\in {\mathcal {P}}$ (Theorem 6). Moreover, ${\mathcal {X}}'_{{\textrm{L}}_{\textrm{R}}}({\mathbb {R}}^d)$ is the topological dual of the predual space

$$\begin{aligned} {\mathcal {X}}_{{\textrm{L}}_{\textrm{R}}}({\mathbb {R}}^d)&=\textrm{L}_{\textrm{R}}^{*}\big ({\mathcal {X}}_{\textrm{Rad}} \big )\oplus {\mathcal {P}}' \nonumber \\&=\{\nu \in {\mathcal {S}}'({\mathbb {R}}^d): \Vert \nu \Vert _{{\mathcal {X}}_{\textrm{L}_{\textrm{R}}}}= \max (\Vert \textrm{L}_{\textrm{R}}^{*\dagger }\{\nu \}\Vert _{{\mathcal {X}}}, \Vert \textrm{Proj}_{{\mathcal {P}}'}\{\nu \}\Vert _{{\mathcal {P}}'})<\infty \}\nonumber \\&= \{ {\textrm{L}}_{\textrm{R}}^*\{v\} + p_0^*:\quad (v,p_0^*)\in {\mathcal {X}}_{\textrm{Rad}} \times {\mathcal {P}}'\}, \end{aligned}$$

(50)

which is a Banach space, as shown in Theorem 9. The validity of this dual pairing can be checked formally in the absence of null space components: For any $(f,\nu ) \in {\mathcal {X}}'_{{\textrm{L}}_{\textrm{R}}}({\mathbb {R}}^d) \times {\mathcal {X}}_{{\textrm{L}}_{\textrm{R}}}({\mathbb {R}}^d)$ with $\textrm{Proj}_{\mathcal {P}} \{f\}=0$ and $\textrm{Proj}_{{\mathcal {P}}'} \{\nu \}= 0$, we have that

$$\begin{aligned} \langle f, \nu \rangle&=\langle {\textrm{L}}^{\dagger }_{\textrm{R}}\{w\}, {\textrm{L}}^*_{\textrm{R}} \{v \}\rangle =\langle {\textrm{L}}_{\textrm{R}}{\textrm{L}}^{\dagger }_{\textrm{R}}\{w\}, v\rangle _{\textrm{Rad}} =\langle w, v\rangle _{\textrm{Rad}} \end{aligned}$$

with $(w,v) \in {\mathcal {X}}'_{\textrm{Rad}} \times {\mathcal {X}}_{\textrm{Rad}} $. Finally, since $\textrm{Proj}_{{\mathcal {P}}}$ continuously maps , we can identify $\textrm{Proj}_{{\mathcal {P}}'}$ as its adjoint .

To ensure that the generic regression problem in Theorem 5 is well-defined, we also need a mild hypothesis on the structure of the data points.

Definition 7

(Admissible data points) Let $N_0=\textrm{dim}{\mathcal {P}}$ where ${\mathcal {P}}={\mathcal {P}}_{n_0}$ is the polynomial null space of ${\textrm{L}}_{\textrm{R}}$. Then, the set of data points $\{{\varvec{x}}_1,\dots ,{\varvec{x}}_M\} \subset {\mathbb {R}}^d$ with $M\ge N_0$ is said to be ${\mathcal {P}}$-admissible if the sampling matrix ${\textbf{H}}=\big [{\textbf{v}}_1 \ \cdots \ {\textbf{v}}_M\big ]^{\textsf{T}}\in {\mathbb {R}}^{M \times N_0}$ with ${\textbf{v}}_m=\big (\langle \delta (\cdot -{\varvec{x}}_m),m_{\varvec{k}}\rangle \big )_{|{\varvec{k}}|\le n_0} \in {\mathbb {R}}^{N_0}$ is of rank $N_0$.

The rank condition is precisely the condition under which the classical least-squares polynomial fitting problem

$$\begin{aligned} \min _{f\in {\mathcal {P}}} \sum _{m=1}^M |y_m-f({\varvec{x}}_m)|^2=\min _{{\textbf{b}} \in {\mathbb {R}}^{N_0}} \Vert {\textbf{y}}- {\textbf{H}} {\textbf{b}}\Vert ^2 \end{aligned}$$

(51)

is well-posed. Indeed, the formal solution of (51) is $p_0({\varvec{x}})=\sum _{|{\varvec{k}}|\le n_0} b_{\varvec{k}} m_{{\varvec{k}}}({\varvec{x}})$ with ${\textbf{b}}=(b_{\varvec{k}})=({\textbf{H}}^{\textsf{T}}{\textbf{H}})^{-1}{\textbf{H}}^{\textsf{T}}{\textbf{y}}$ where the rank condition guarantees the invertibility of the normal matrix $({\textbf{H}}^{\textsf{T}}{\textbf{H}}) \in {\mathbb {R}}^{N_0 \times N_0}$.

We are now ready to formulate our extended representer theorem, which supports a rich panorama of regression models. A case of direct practical relevance is obtained by setting $\textrm{L}=(-\varDelta )^{\frac{\alpha +1}{2}}$ (fractional Laplacian) with ${{\widehat{L}}}_{\textrm{rad}}(\omega )=|\omega |^{\alpha +1}$. Indeed, we shall see that, for ${\mathcal {X}}=L_2$, this essentially yield the classical thin-plate splines (see Sect. 4.2), while, for ${\mathcal {X}}'={\mathcal {M}}$, it results in neural networks with activation functions ($\rho _{\textrm{rad}}$) labeled as “ridge splines” (including ReLU) and “fractional splines” in Table 1 (Sect. 4.5).

Theorem 5

Let us consider the following setting.

A proper, lower-semicontinuous, coercive, and convex loss functional $E: {\mathbb {R}}\times {\mathbb {R}}\rightarrow {\mathbb {R}}^+ \cup \{+\infty \}$.
An isotropic, spline-admissible operator $\textrm{L}$ with frequency profile ${{\widehat{L}}}_{\textrm{rad}}$ and polynomial null space ${\mathcal {P}}_{n_0}$ of degree $n_0$, possibly trivial with the convention that ${\mathcal {P}}_{-1}=\{0\}$.
A convex, increasing function $\psi : {\mathbb {R}}^+ \rightarrow {\mathbb {R}}^+$.
A set $\{{\varvec{x}}_1,\dots ,{\varvec{x}}_M\} \subset {\mathbb {R}}^d$ of ${\mathcal {P}}_{n_0}$-admissible data points.

Then, for any fixed ${\textbf{y}}=(y_m)\in {\mathbb {R}}^M$, the generic functional-optimization problem

$$\begin{aligned} S=\arg \min _{f \in {\mathcal {X}}'_{{\textrm{L}}_{\textrm{R}} }({\mathbb {R}}^d)} \left( \sum _{m=1}^M E(y_m ,f({\varvec{x}}_m)) + \psi (\Vert {\textrm{L}}_{\textrm{R}} \{f\}\Vert _{{\mathcal {X}}'} )\right) , \end{aligned}$$

(52)

with $\textrm{L}_{\textrm{R}}={\textrm{K}} _{\textrm{rad}}\text {RL}$, and $\psi $, ${\mathcal {X}}'$ as stated below, always has a solution.

1.
When ${\mathcal {X}}={\mathcal {X}}'=L_2({\mathbb {R}}\times {\mathbb {S}}^{d-1})$ and $\psi $ is strictly convex, the solution of (52) is unique and representable by the linear kernel expansion
$$\begin{aligned} f({\varvec{x}})=p_0({\varvec{x}}) +\sum _{m=1}^M a_m \rho _{\textrm{iso}}({\varvec{x}} - {\varvec{x}}_m), \end{aligned}$$
(53)
where $\rho _{\textrm{iso}}=2(2\pi )^{d-1} {\mathcal {F}}^{-1}\{1/(|{{\widehat{L}}}_{\textrm{rad}}(\Vert {\varvec{\omega }}\Vert )|^2 \Vert {\varvec{\omega }}\Vert ^{d-1})\}$ is a radial-basis function, $(a_m) \in {\mathbb {R}}^{M}$ an adequate set of coefficients, and $p_0 \in {\mathcal {P}}_{n_0}$ a polynomial that lies in the null space of ${\textrm{L}}_{\textrm{R}}$.
2.
When ${\mathcal {X}}'=L_{p}({\mathbb {R}}\times {\mathbb {S}}^{d-1})$ with $p\in (1,2]$ and $\psi $ is strictly convex, the solution is unique and admits the parametric representation
$$\begin{aligned} f({\varvec{x}})= p_0({\varvec{x}}) + \textrm{L}_{\textrm{R}}^{\dagger } \circ {\textrm{J}}_q\left\{ \sum _{m=1}^M a_m \nu _{{\varvec{x}}_m}\right\} ({\varvec{x}}) \end{aligned}$$
(54)
with basis functions $\nu _{{\varvec{x}}_1},\ldots ,\nu _{{\varvec{x}}_M} \in L_q({\mathbb {R}}\times {\mathbb {S}}^{d-1})$ specified by (70) and parameters $(a_m)\in {\mathbb {R}}^M$, $p_0 \in {\mathcal {P}}_{n_0}$, and where ${\textrm{J}}_q$ is the pointwise nonlinearity given by (92) with $q=p/(p-1)$. (The latter is the duality map ${\textrm{J}}_{{\mathcal {X}}}: {\mathcal {X}} \rightarrow {\mathcal {X}}'$ with ${\mathcal {X}}=L_{q}({\mathbb {R}}\times {\mathbb {S}}^{d-1})$—see Appendix A for explanations).
3.
When ${\mathcal {X}}'={\mathcal {M}}({\mathbb {R}}\times {\mathbb {S}}^{d-1})$ and $\psi $ is strictly increasing, the solution set is the weak$*$-closed convex hull of its extreme points, which are all of the form
$$\begin{aligned} f_{\textrm{ext}}({\varvec{x}})= p_0({\varvec{x}}) + \sum _{k=1}^{K_0} a_k \rho _{\textrm{rad}}({\varvec{\xi }}_k^{\textsf{T}}{\varvec{x}} - \tau _k) \end{aligned}$$
(55)
with activation function $\rho _{\textrm{rad}}= {\mathcal {F}}^{-1}\{1/{{\widehat{L}}}_{\textrm{rad}}\}$, for some $K_0\le M-\textrm{dim}{\mathcal {P}}_{n_0}$, $(a_k,{\varvec{\xi }}_k,\tau _k) \in {\mathbb {R}}\times {\mathbb {S}}^{d-1} \times {\mathbb {R}}$ for $k=1,\dots ,K_0$, and a null-space component $p_0\in {\mathcal {P}}_{n_0}$. The optimal regularization cost associated with (55) is $\Vert {\textrm{L}}_{\textrm{R}}f_{\textrm{ext}}\Vert _{\mathcal {M}}=\sum _{k=1}^{K_0}|a_k|$ and is shared by all solutions.

Proof

Since $\textrm{L}_{\textrm{R}}^{*}$ is injective on ${\mathcal {S}}_{\textrm{Rad}} $ and, by extension, on the completed space ${\mathcal {X}}_{\textrm{Rad}} $, the image space ${\mathcal {U}}=\textrm{L}_{\textrm{R}}^{*}\big ({\mathcal {X}}_{\textrm{Rad}} \big )$ is complete as well (see proof of Theorem 9 for the details of the construction of ${\mathcal {U}}$). Its continuous dual is given by ${\mathcal {U}}'={\textrm{L}}^{-1}_{{\mathcal {P}}}{\textrm{R}}^{*}\big ({\mathcal {X}}'_{\textrm{Rad}} \big )$, in reason of the identities $\text {RL}^{-1*}_{{\mathcal {P}}}\textrm{L}_{\textrm{R}}^{*}= \text {RL}^{-1*}_{{\mathcal {P}}}{\textrm{L}}^*{\textrm{R}}^*{\textrm{K}}_{\textrm{rad}}= \text {RR}^*{\textrm{K}}_{\textrm{rad}}=\textrm{Id}$ on ${\mathcal {X}}_{\textrm{Rad}}$. Likewise, ${\mathcal {P}}'$, as identified by (15), is a finite-dimensional Banach space. Its dual is simply $({\mathcal {P}}')'={\mathcal {P}}$ (the null space of both $\textrm{L}$ and $\textrm{L}_{\textrm{R}}$), owing to the property that all finite-dimensional spaces are reflexive. Using the notation for direct-sum topologies of [58], we then observe that ${\mathcal {X}}_{{\textrm{L}}_{\textrm{R}}}({\mathbb {R}}^d)=({\mathcal {U}} \oplus {\mathcal {P}}')_{\ell _\infty }$, whose formal dual $({\mathcal {U}} \oplus {\mathcal {P}}' )'_{\ell _\infty }=({\mathcal {U}}' \oplus {\mathcal {P}})_{\ell _1}$ is precisely the native space ${\mathcal {X}}'_{{\textrm{L}}_{\textrm{R}}}({\mathbb {R}}^d)$ described by (49). By writing $f=\textrm{L}_{{\mathcal {P}}}^{-1} {\textrm{R}}^*\{ w \}+ p_0$ and recalling that $\text {LL}_{{\mathcal {P}}}^{-1}=\textrm{Id}$ (right-inverse property), we then identify the regularization functional as

$$\begin{aligned} \Vert {\textrm{L}}_{\textrm{R}} f\Vert _{{\mathcal {X}}' }&=\Vert {\textrm{K}}_{\textrm{rad}} {\textrm{R}} \{\text {LL}^{-1}_{\mathcal {P}} {\textrm{R}}^*w + {\textrm{L}} p_0\}\Vert _{{\mathcal {X}}'}\\&=\Vert {\textrm{K}}_{\textrm{rad}} \text {RR}^*\{w\} + {\textrm{K}}_{\textrm{rad}}{\textrm{R}} \{0\}\Vert _{{\mathcal {X}}'}\\&=\Vert w \Vert _{{\mathcal {X}}'_{\textrm{Rad}}}=\Vert \textrm{Proj}_{{\mathcal {U}}'} f\Vert _{{\mathcal {U}}'} \end{aligned}$$

which, in effect, converts the seminorm over ${\mathcal {X}}'_{{\textrm{L}}_{\textrm{R}}}({\mathbb {R}}^d)$ into a norm over ${\mathcal {U}}'$ by factoring out the null space of $\textrm{L}_{\textrm{R}}$. The other mathematical ingredient for the optimization problem (52) to be well-posed is the weak$*$ continuity of the sampling functionals $\delta (\cdot -{\varvec{x}}_m): f \mapsto f({\varvec{x}}_m)$ in the underlying topology. This is equivalent to $\delta (\cdot -{\varvec{x}}_m) \in {\mathcal {X}}_{{\textrm{L}}_{\textrm{R}}}({\mathbb {R}}^d)$ for any ${\varvec{x}}_m \in {\mathbb {R}}^d$. In the present context, this condition can be reframed as $\nu _{{\varvec{x}}_m}=\textrm{L}^{*\dagger }_{\textrm{R}}\{\delta (\cdot -{\varvec{x}}_m)\} \in {\mathcal {X}}_{\textrm{Rad}}$, which is a property that is established in Theorem 7 for the cases ${\mathcal {X}}=C_0$ and ${\mathcal {X}}=L_q$ for any $q\ge 2$.

The existence of the solution and the parametric descriptions (53), (54), and (55) then follows from the three cases of the abstract representer theorem for direct sums [58, Theorem 3]. The link with the abstract theorem is made by identifying ${\mathcal {N}}_{{\varvec{p}}}={\mathcal {P}}$, ${\mathcal {U}} \oplus {\mathcal {N}}_{{\varvec{p}}^*}={\mathcal {X}}_{{\textrm{L}}_{\textrm{R}}}({\mathbb {R}}^d)$, ${\mathcal {U}}' \oplus {\mathcal {N}}_{{\varvec{p}}}={\mathcal {X}}'_{{\textrm{L}}_{\textrm{R}}}({\mathbb {R}}^d)$ and $\nu _m=\delta (\cdot -{\varvec{x}}_m)$ for $m=1,\dots ,M$. As for the required technical assumptions, they directly follow from the weak$*$ continuity of the sampling functionals (i.e., $\nu _m \in {\mathcal {U}} \oplus {\mathcal {N}}_{{\varvec{p}}^*}$) and the ${\mathcal {P}}$-admissibility hypothesis in Definition 7, with the ${\textbf{v}}_m$ being the same as in the statement of the abstract theorem. The duality map that is required for the first and second scenarios is ${\textrm{J}}_{\mathcal {U}}= \textrm{L}^{\dagger }_{\textrm{R}} \circ {\textrm{J}} \circ \textrm{L}^{*\dagger }_{\textrm{R}}: {\mathcal {U}} \rightarrow {\mathcal {U}}'$ (see Appendix A, Proposition 4 with ${\textrm{T}}=\textrm{L}^*_{\textrm{R}}$, ${\textrm{T}}^{-1}=\textrm{L}_{\textrm{R}}^{*\dagger }={\textrm{R}}^{*}{\textrm{L}}^{*-1}_{{\mathcal {P}}}$, and $({\textrm{T}}^*)^{-1}= \textrm{L}^{\dagger }_{\textrm{R}} ={\textrm{L}}^{-1}_{{\mathcal {P}}}{\textrm{R}}^{*}$).

To describe the solution set for the non-reflexive case ${\mathcal {X}}'={\mathcal {M}}$, we invoke the third case of [58, Theorem 3], which tells us that the extreme points of S can all be expressed as the sum of a null-space component plus a linear combination of $K_0\le M-\textrm{dim}({\mathcal {P}})$ atoms $e_k$ that are selected adaptively within a dictionary consisting of the extreme points of the unit ball in ${\mathcal {U}}'$: $B({\mathcal {U}}')=\{u \in {\mathcal {U}}': \Vert u\Vert _{{\mathcal {U}}'}\le 1\}$. Because of the linear isomorphism ${\mathcal {U}}'={\textrm{L}}^{\dagger }_{\textrm{R}}\big ({\mathcal {M}}_{\textrm{Rad}} \big )$, $\textrm{ext}B({\mathcal {U}})={\textrm{L}}^{\dagger }_{\textrm{R}}\big (\textrm{ext}B({\mathcal {M}}_{\textrm{Rad}})\big )$. Next, we use the property that ${\mathcal {M}}_{\textrm{Rad}}={{\mathcal {M}}_{\textrm{even}}({\mathbb {R}}\times {\mathbb {S}}^{d-1})}$ whose extreme points are $\{\pm \delta _{\textrm{Rad}, {\varvec{z}}}\}_{{\varvec{z}}\in {\mathbb {P}}^{d}}$. Each $\delta _{\textrm{Rad}, {\varvec{z}}_k}\in \textrm{ext}B({\mathcal {M}}_{\textrm{Rad}})$ then bijectively maps into an extreme point $e_{k}= {\textrm{L}}^{\dagger }_{\textrm{R}}\{\delta _{\textrm{Rad}, {\varvec{z}}_k}\} \in \textrm{ext}B({\mathcal {U}}')$, and vice versa. Finally, by recalling that ${\textrm{L}}^{\dagger }_{\textrm{R}}=(\textrm{Id}-\textrm{Proj}_{{\mathcal {P}}}){\textrm{L}}^{-1}{\textrm{R}}^{*}$ and by invoking Theorem 4 to show that ${\textrm{L}}^{-1}{\textrm{R}}^{*}\{\delta _{\textrm{Rad}, (t_k,{\varvec{\xi }}_k)}\}={\textrm{L}}^{-1}{\textrm{R}}^{*}{\textrm{K}}_{\textrm{rad}}{\textrm{R}} \{\delta ({\varvec{\xi }}_k^{\textsf{T}}{\varvec{x}} -t_k)\}={\textrm{L}}^{-1} \{\delta ({\varvec{\xi }}_k^{\textsf{T}}{\varvec{x}} -t_k)\}=\rho _{\textrm{rad}}({\varvec{\xi }}_k^{\textsf{T}}{\varvec{x}} -t_k)$, we find that

$$\begin{aligned} e_k=\pm \rho _{\textrm{rad}}({\varvec{\xi }}_k^{\textsf{T}}{\varvec{x}} -t_k) {\mp } p_{0,k}, \end{aligned}$$

(56)

where $p_{0,k}=\textrm{Proj}_{{\mathcal {P}}}\{\rho _{\textrm{rad}}({\varvec{\xi }}_k^{\textsf{T}}{\varvec{x}} -\tau _k)\}\in {\mathcal {P}}$. Since every extreme point of $B({\mathcal {U}}')$ is necessarily of the form (56), we can substitute this expression in the generic expansion $f_{\textrm{ext}}= {{\tilde{p}}}_0 +\sum _{k=1}^{K_0} a_k e_k $ which, upon collection of all null-space components, yields (55). $\square $ $\square $

The two cases in Theorem 5 that are of direct practical relevance to machine learning are Items 1 and 3. The first scenario yields a learning architecture that is a linear expansion of RBFs, which also has a classical RKHS interpretation, as explained in Sect. 4.2.

The form of the solution in Item 3 is equivalent to a shallow network with the weights (${\varvec{\xi }}_k$) of the hidden layer being normalized. It actually turns out that this normalization is inconsequential when the activation is a homogeneous function. This happens for instance when the regularization operator $\textrm{L}=(-\varDelta )^{\frac{\alpha +1}{2}}$ is a fractional Laplacian which maps into $\rho _{\textrm{Rad}}(t)\propto |t|^{\alpha }$. Indeed, for any $({\varvec{w}}_k,b_k)\in {\mathbb {R}}^d \times {\mathbb {R}}$, we have that

$$\begin{aligned} |{\varvec{w}}_k^{\textsf{T}}{\varvec{x}} -b_k|^{\alpha }= \Vert {\varvec{w}}_k\Vert ^{\alpha } |{\varvec{\xi }}_k^{\textsf{T}}{\varvec{x}} -t_k|^{\alpha } \end{aligned}$$

(57)

with ${\varvec{\xi }}_k={\varvec{w}}_k/\Vert {\varvec{w}}_k\Vert $ and $t_k=b_k/\Vert {\varvec{w}}_k\Vert $, which indicates that the normalization (or lack thereof) can be absorbed in the weights $(a_k)$ of the output layer. The case $\alpha =1$ with $\textrm{L}=\varDelta $ (Laplacian) is particularly attractive, as it nicely maps into a ReLU network with one hidden layer and a skip connection to implement the affine component [37].

The form of the solution in Item 2 is more involved, but still useful to get insight into the transition with p from RBFs to neural nets. The equivalence with Item 1 for $p=2$ is demonstrated in Sect. 4.2. As for the behavior as $p\rightarrow 1$, we observe that the effect of the duality map ${\textrm{J}}_{q}$ as $q\rightarrow \infty $ is to pull a few maximal values to infinity, while attenuating all other (non-supremum) values toward zero. In effect, this means that ${\textrm{J}}_q\{\sum _{m=1}^M a_m \nu _{{\varvec{x}}_m}\}$ will exhibit peaks that become more and more pronounced, and eventually converge to a sum of impulses as $p\rightarrow 1$, which is consistent with the limit form given by (55).

4.2 Connection with RKHS Methods

The scenario ${\mathcal {X}}=L_2$ in Theorem 5 is compatible with the kernel models of “classical” machine learning [40, 50]. This is because the underlying native space is a reproducing-kernel Hilbert space whose topological structure is now made explicit.

Proposition 2

(Characterization of RKHS for ${\mathcal {X}}=L_2$) Let $\textrm{L}$ be a spline-admissible operator with a polynomial null space of degree $n_0$ and consider the self-adjoint operator ${\textrm{A}}=({\textrm{L}}^*\text {KL})^{-1}$. Then, the native space ${\mathcal {H}}=L'_{2,\textrm{L}_{\textrm{R}}}({\mathbb {R}}^d)$, defined by (49) with ${\mathcal {X}}'={\mathcal {X}}=L_2$, is the reproducing-kernel Hilbert space ${\mathcal {H}}={\mathcal {U}}' \oplus {\mathcal {P}}$ associated with the composite inner product

$$\begin{aligned} \langle f_1, f_2 \rangle _{{\mathcal {H}}}&=\langle \textrm{L}_{\textrm{R}} \{f_1\}, \textrm{L}_{\textrm{R}} \{f_2\} \rangle _{\textrm{Rad}}+ \sum _{|{\varvec{k}}|\le n_0} \langle m_{\varvec{k}}^*, f_1 \rangle \langle m_{\varvec{k}}^*, f_2 \rangle \nonumber \\&=\langle (\textrm{L}^*\text {KL}) \{f_1\}, f_2 \rangle + \sum _{|{\varvec{k}}|\le n_0} \langle m_{\varvec{k}}^*, f_1 \rangle \langle m_{\varvec{k}}^*, f_2 \rangle \end{aligned}$$

(58)

whose leading term can also be written as

$$\begin{aligned} \langle (\textrm{L}^*\text {KL}) \{f_1\}, f_2 \rangle&=\langle (\textrm{L}^*\text {KL}) \textrm{Proj}_{{\mathcal {U}}'}\{f_1\},\textrm{Proj}_{{\mathcal {U}}'}\{f_2 \}\rangle \nonumber \\&=\langle \textrm{Proj}_{{\mathcal {U}}'}\{f_1\}, \textrm{Proj}_{{\mathcal {U}}'}\{f_2\} \rangle _{{\mathcal {U}}'}, \end{aligned}$$

(59)

where $\textrm{Proj}_{{\mathcal {U}}'}=\textrm{Id}- \textrm{Proj}_{{\mathcal {P}}}: {\mathcal {H}} \rightarrow {\mathcal {U}}'$. The topological dual of ${\mathcal {H}}$ is the Hilbert space ${\mathcal {H}}'=L_{2,\textrm{L}_{\textrm{R}}}({\mathbb {R}}^d)={\mathcal {U}} \oplus {\mathcal {P}}'$ equipped with the inner product

$$\begin{aligned} \langle g_1, g_2 \rangle _{{\mathcal {H}}'}&=\langle \textrm{L}_{\textrm{R}}^{*\dagger }\{ g_1\}, \textrm{L}_{\textrm{R}}^{*\dagger } \{g_2\} \rangle _{\textrm{Rad}}+ \sum _{|{\varvec{k}}|\le n_0} \langle m_{\varvec{k}}, g_1\rangle \langle m_{\varvec{k}}, g_2 \rangle \nonumber \\&=\langle \textrm{L}_{\textrm{R}}^{\dagger }\textrm{L}_{\textrm{R}}^{*\dagger } \{g_1\}, g_2 \rangle + \sum _{|{\varvec{k}}|\le n_0} \langle m_{\varvec{k}}, g_1\rangle \langle m_{\varvec{k}}, g_2 \rangle \nonumber \\&=\langle {\textrm{A}}\textrm{Proj}_{{\mathcal {U}}}\{g_1\}, \textrm{Proj}_{{\mathcal {U}}}\{g_2\} \rangle + \sum _{|{\varvec{k}}|\le n_0} \langle m_{\varvec{k}}, g_1\rangle \langle m_{\varvec{k}}, g_2 \rangle , \end{aligned}$$

(60)

where $\textrm{Proj}_{{\mathcal {U}}}=\textrm{Id}- \textrm{Proj}_{{\mathcal {P}}'}: {\mathcal {H}}' \rightarrow {\mathcal {U}}$. The corresponding linear isometries (Riesz maps) that map ${\mathcal {U}}\rightarrow {\mathcal {U}}'$, ${\mathcal {P}}' \rightarrow {\mathcal {P}}$, and ${\mathcal {H}}\rightarrow {\mathcal {H}}'$ are

$$\begin{aligned} {\textrm{J}}_{{\mathcal {U}}}&= {\textrm{A}} : {\mathcal {U}} \rightarrow {\mathcal {U}}'\\&=(\textrm{Id}- \textrm{Proj}_{{\mathcal {P}}}){\textrm{A}} (\textrm{Id}- \textrm{Proj}_{{\mathcal {P}}'}) : {\mathcal {U}} \oplus {\mathcal {P}}' \rightarrow {\mathcal {U}}',\\ {\textrm{J}}_{{\mathcal {P}}'}&=\textrm{Proj}_{{\mathcal {P}}}: {\mathcal {P}}' \rightarrow {\mathcal {P}},\\&=\textrm{Proj}_{{\mathcal {P}}}\textrm{Proj}_{{\mathcal {P}}'}: {\mathcal {U}} \oplus {\mathcal {P}}' \rightarrow {\mathcal {P}},\\ {\textrm{J}}_{{\mathcal {H}}}&=\left( (\textrm{Id}- \textrm{Proj}_{{\mathcal {P}}}){\textrm{A}} (\textrm{Id}- \textrm{Proj}_{{\mathcal {P}}'}) + \textrm{Proj}_{{\mathcal {P}}}\textrm{Proj}_{{\mathcal {P}}'}\right) : {\mathcal {H}}\rightarrow {\mathcal {H}}', \end{aligned}$$

where the second, alternative forms of ${\textrm{J}}_{{\mathcal {U}}}$ and ${\textrm{J}}_{{\mathcal {P}}'}$ that include projectors are the proper extension of those operators to the whole space ${\mathcal {H}}={\mathcal {U}}'\oplus {\mathcal {P}}$.

Proposition 2 is obtained as a corollary of Theorems 6 and 9 with ${\mathcal {X}}={\mathcal {X}}'=L_2$, with the help of the intertwining property ${\textrm{L}}_{\textrm{R}}={\textrm{K}}_{\textrm{rad}}\text {RL}= \text {RKL}$ (see the discussion at the end of Sect. 3.2). The technical part is to establish the completeness of ${\mathcal {U}}$ (resp., ${\mathcal {H}}'={\mathcal {U}}\oplus {\mathcal {P}}'$), which then implies that of $\ {\mathcal {U}}'$ (resp., ${\mathcal {H}}={\mathcal {H}}''={\mathcal {U}}'\oplus {\mathcal {P}}$) by duality. As for the Hilbert-space property, the maps defined by (58) and (60) are obviously bilinear and symmetric. To show that (58) is also positive definite, we invoke Theorem 6, which states that any $f \in {\mathcal {H}}={\mathcal {U}}'\oplus {\mathcal {P}}$ has a unique decomposition as $f=\textrm{L}_{\textrm{R}}^{\dagger }\{w\}+p_0$, with $w=\textrm{L}_{\textrm{R}}\{f\}\in {\mathcal {X}}'$ and $p_0=\textrm{Proj}_{{\mathcal {P}}}\{f\}=\sum _{|{{\varvec{k}}}|\le n_0} b_{{\varvec{k}}} m_{{\varvec{k}}} \in {\mathcal {P}}$ with $b_{{\varvec{k}}}=\langle m^*_{{\varvec{k}}},f\rangle $. This then yields that

$$\begin{aligned} \langle f, f\rangle _{{\mathcal {H}}}&=\langle \textrm{L}_{\textrm{R}}\{f\}, \textrm{L}_{\textrm{R}}\{f\}\rangle + \sum _{|{{\varvec{k}}}|\le n_0} |b_{\varvec{k}}|^2=\underbrace{\langle w, w\rangle _{L_2({\mathbb {R}}\times {\mathbb {S}}^{d-1})}}_{\Vert w\Vert ^2_{{\mathcal {X}}'}} + \underbrace{\Vert {\varvec{b}}\Vert ^2_2}_{\Vert p_0\Vert ^2_{{\mathcal {P}}}} \ge 0, \end{aligned}$$

which vanishes if and only if $f=0$. Likewise, one readily verifies that the semi-inner products associated with each individual term in (60) induce the two component norms $\Vert v\Vert _{{\mathcal {X}}}$ and $\Vert p_0^*\Vert _{{\mathcal {P}}'}$ for ${\mathcal {X}}=L_2$ that appear in the definition (50) of the predual space.

The denomination RKHS applies to any Hilbert space ${\mathcal {H}}$ of functions on ${\mathbb {R}}^d$ such that $\delta (\cdot -{\varvec{x}}_0) \in {\mathcal {H}}'$ for any ${\varvec{x}}_0 \in {\mathbb {R}}^d$. In our case, this property is equivalent to $\textrm{L}^{*\dagger }_{\textrm{R}}\{\delta (\cdot -{\varvec{x}}_0)\} \in L_{2, \textrm{Rad}} $, which follows from Theorem 7.

To get further insight, we now show that the RBF solution (53) is a particular case of the $L_p$ solution (54) with $p=2$. Since $L_2=(L_2)'$ is its own dual, ${\textrm{J}}=\textrm{Id}$, which allows us to manipulate (54) as

$$\begin{aligned} f&=p_0 + \textrm{L}_{{\mathcal {P}}}^{-1}{\textrm{R}}^*\text {RL}_{{\mathcal {P}}}^{-1*} \left\{ \sum _{m=1}^{M} a_m\delta (\cdot - {\varvec{x}}_m)\right\} \nonumber \\&= p_0 +\textrm{L}_{{\mathcal {P}}}^{-1}{\textrm{K}}^{-1}\textrm{L}_{{\mathcal {P}}}^{-1*}\left\{ \sum _{m=1}^{M} a_m\delta (\cdot - {\varvec{x}}_m)\right\} \nonumber \\&= p_0 + (\textrm{Id}- \textrm{Proj}_{{\mathcal {P}}}){\textrm{A}} (\textrm{Id}- \textrm{Proj}_{{\mathcal {P}}'})\left\{ \sum _{m=1}^{M} a_m\delta (\cdot - {\varvec{x}}_m)\right\} \end{aligned}$$

(61)

$$\begin{aligned}&=p_0 + {\textrm{A}}\left\{ \sum _{m=1}^{M} a_m\delta (\cdot - {\varvec{x}}_m)\right\} = p_0 +\sum _{m=1}^{M} a_m \rho _{\textrm{iso}}({\varvec{x}}- {\varvec{x}}_m), \end{aligned}$$

(62)

where $\rho _{\textrm{iso}}=(\textrm{L}^*\text {KL})^{-1}\{\delta \}={\textrm{A}}\{\delta \}: {\mathbb {R}}^d \rightarrow {\mathbb {R}}$ is the Green’s function of $(\textrm{L}^*\text {KL})$ and $p_0\in {\mathcal {P}}$. The nonobvious simplification from (61) to (62) results from two crucial observations: (1) the “orthogonality-to-the-null-space” condition $\sum _{m=1}^{M} a_m\delta (\cdot - {\varvec{x}}_m) \in {\mathcal {U}}$ is necessary for optimality; and (2) the availability of the identity

$$\begin{aligned} \forall u\in {\mathcal {U}}: \quad (\textrm{Id}- \textrm{Proj}_{{\mathcal {P}}}){\textrm{A}} (\textrm{Id}- \textrm{Proj}_{{\mathcal {P}}'})\{ u\} ={\textrm{A}}\{ u\} =u^*\in {\mathcal {U}}', \end{aligned}$$

which is tightly linked to the specification of the underlying spaces in Proposition 2. Likewise, we find that the quadratic regularization cost associated with the linear model (62) is ${\varvec{a}}^{\textsf{T}}{\textbf{G}} {\varvec{a}}$, where ${\textbf{G}}\in {\mathbb {R}}^{M \times M}$ is a symmetric, conditionally positive-definite matrix (see [31]) whose entries are calculated as follows:

$$\begin{aligned} {[}{\textbf{G}}]_{m,n}&=\langle \textrm{L}_{\textrm{R}}\rho _{\textrm{iso}}(\cdot -{\varvec{x}}_m), \textrm{L}_{\textrm{R}}\rho _{\textrm{iso}}(\cdot -{\varvec{x}}_n)\rangle _{\textrm{Rad}}\\&=\langle \text {RKL} \{\rho _{\textrm{iso}}(\cdot -{\varvec{x}}_m)\}, \text {RKL} \{\rho _{\textrm{iso}}(\cdot -{\varvec{x}}_n)\}\rangle _{\textrm{Rad}}\\&=\langle \text {KL} \{\rho _{\textrm{iso}}(\cdot -{\varvec{x}}_m)\}, {\textrm{R}}^*\text {RKL} \{\rho _{\textrm{iso}}(\cdot -{\varvec{x}}_n)\}\rangle \\&=\langle \rho _{\textrm{iso}}(\cdot -{\varvec{x}}_m), {\textrm{L}}^*{\textrm{K}}\underbrace{{\textrm{R}}^*\text {RK}}_{\textrm{Id}} {\textrm{L}} \{\rho _{\textrm{iso}}(\cdot -{\varvec{x}}_n)\}\rangle \\&= \langle \rho _{\textrm{iso}}(\cdot - {\varvec{x}}_m),\delta (\cdot -{\varvec{x}}_n)\rangle =\rho _{\textrm{iso}}({\varvec{x}}_n-{\varvec{x}}_m)=\rho _{\textrm{iso}}({\varvec{x}}_m-{\varvec{x}}_n). \end{aligned}$$

As variant of the $L_2$ result in Theorem 5, we may also consider the modified regularization operator ${\tilde{\textrm{L}}}={\textrm{K}}^{-\frac{1}{2}} {\textrm{L}}$ whose frequency response is $\widehat{{{\tilde{L}}}}({\varvec{\omega }})=\sqrt{2}(2\pi )^{(d-1)/2}{{\widehat{L}}}({\varvec{\omega }})/\Vert {\varvec{\omega }}\Vert ^{(d-1)/2}$. For this particular setting, $\tilde{{\textrm{L}}}_{\textrm{R}}=\text {RK}^{\frac{1}{2}}{\textrm{L}}$, which translates into

$$\begin{aligned} \Vert \tilde{{\textrm{L}}}_{\textrm{R}}\{f\}\Vert ^2_{L_2({\mathbb {R}}\times {\mathbb {S}}^{d-1})}=\Vert \text {RK}^{\frac{1}{2}}{\textrm{L}}\{f\}\Vert ^2_{L_2({\mathbb {R}}\times {\mathbb {S}}^{d-1})}=\Vert {\textrm{L}}\{ f\}\Vert ^2_{L_2({\mathbb {R}}^d)}, \end{aligned}$$

owing to the property that $\text {RK}^{\frac{1}{2}}$ is an $L_2$ isometry [27]. The proposed Radon-domain regularization therefore reduces to the standard energy functional associated with (semi-)reproducing-kernel Hilbert spaces. The resulting basis function is ${\tilde{\rho }}_{\textrm{iso}}({\varvec{x}})= {\mathcal {F}}^{-1}\{1/|{{\widehat{L}}}|^2\}({\varvec{x}})$, which is the same as the one encountered in the classical formulation that does not involve the Radon transform. While this may suggest that the two formulations are equivalent, there is an important difference that concerns the dimension of the null space.

As a matter of illustration, we now compare two schemes that utilize the same “linear” radial-basis $\rho _{\textrm{iso}} ({\varvec{x}})\propto \Vert {\varvec{x}}\Vert $. Based on (6), we deduce that this corresponds to the choice $|{{\widehat{L}}}({\varvec{\omega }})|^2=\Vert {\varvec{\omega }}\Vert ^{d+1}$ in the classical formulation, which induces the regularization norm $\Vert (-\varDelta )^{(d+1)/2}\{f\}\Vert _{L_2}$ with a polynomial null space of degree $n_0=\lceil (d+1)/2\rceil > d/2$ [17, 63]. In our proposed Radon-domain variant, the appropriate regularization norm is $\Vert \text {RK} (-\varDelta )^{1/2}\{f\}\Vert _{L_2}$ with a null space of polynomial degree $n_0=0$. This solution is attractive because it does not depend on the dimensionality of the data.

4.3 Universal Approximation Properties

The universal-approximation properties of the supervised-learning scheme specified by (52) are supported by Theorem 6, which summarizes the properties of the native space ${\mathcal {X}}'_{\textrm{L}_{\textrm{R}}}$ in relation to the regularization operator $\textrm{L}_{\textrm{R}}$ and its generalized inverse $ \textrm{L}^{\dagger }_{\textrm{R}}$. This result is a direct corollary (dual counterpart) of Theorem 9, whose proof is given in Sect. 5.2.

Theorem 6

(Properties of the native space ${\mathcal {X}}'_{\textrm{L}_{\textrm{R}}}$) Let $\textrm{L}$ be an isotropic LSI operator that is spline-admissible with a polynomial null space ${\mathcal {P}}$ (possibly trivial) of degree $n_0$. Then, the operators $\textrm{L}_{\textrm{R}}={\textrm{K}}_{\textrm{rad}}\text {RL}: {\mathcal {X}}'_{\textrm{L}_{\textrm{R}}} \rightarrow {\mathcal {X}}'_{\textrm{Rad}}$ and $\textrm{L}_{\textrm{R}}^{\dagger }=(\textrm{Id}-\textrm{Proj}_{{\mathcal {P}}}){\textrm{L}}^{-1}{\textrm{R}}^*: {\mathcal {X}}'_{\textrm{Rad}} \rightarrow L_{\infty ,-n_0}({\mathbb {R}}^d)$ (the adjoint of $\textrm{L}_{\textrm{R}}^{*\dagger }$ in Theorem 9) are continuous and have the properties

$$\begin{aligned} \forall w\in {\mathcal {X}}'_{\textrm{Rad}} :&\quad \textrm{L}_{\textrm{R}}\textrm{L}_{\textrm{R}}^{\dagger } \{w\}=w \end{aligned}$$

(63)

$$\begin{aligned} \forall p_0\in {\mathcal {P}}:&\quad \textrm{L}_{\textrm{R}}\{p_0\}=0 \end{aligned}$$

(64)

$$\begin{aligned} \forall f \in {\mathcal {X}}'_{\textrm{L}_{\textrm{R}}}({\mathbb {R}}^d):&\quad \textrm{L}_{\textrm{R}}^{\dagger }\textrm{L}_{\textrm{R}}\{f\} =(\textrm{Id}-\textrm{Proj}_{{\mathcal {P}}})\{f\}=\textrm{Proj}_{{\mathcal {U}}'}\{ f\}, \end{aligned}$$

(65)

where ${\mathcal {X}}'_{\textrm{L}_{\textrm{R}}}({\mathbb {R}}^d){\mathop {=}\limits ^{\vartriangle }}\textrm{L}^{\dagger }_{\textrm{R}}({\mathcal {X}}'_{\textrm{Rad}}) \oplus {\mathcal {P}}$ is equipped with the composite norm $\Vert f\Vert _{{\mathcal {X}}'_{\textrm{L}_{\textrm{R}}}}=\Vert \textrm{L}_{\textrm{R}}\{f\}\Vert _{{\mathcal {X}}'_{\textrm{Rad}}}+ \Vert \textrm{Proj}_{{\mathcal {P}}}\{f\}\Vert _{{\mathcal {P}}}$. The space ${\mathcal {X}}'_{\textrm{L}_{\textrm{R}}}$ is complete and isomorphic to ${\mathcal {X}}'_{\textrm{Rad}} \times {\mathcal {P}}$ with $f =\textrm{L}^{\dagger }_{\textrm{R}}\{w\} + p_0 \mapsto (w,p_0)=(\textrm{L}_{\textrm{R}}\{f\}, \textrm{Proj}_{{\mathcal {P}}}\{f\})$. Moreover, .

To explain how Theorem 6 relates to universality, let us first consider the case ${\mathcal {X}}'=L_2$ for which we have just shown that ${\mathcal {X}}'_{\textrm{L}_{\textrm{R}}}={\mathcal {H}}$ is a RKHS whose (semi-)reproducing kernel is $\rho _{\textrm{iso}}=(\textrm{L}^*\text {KL})^{-1}\{\delta \}$. From the general properties of RKHS [2, 63], we know that ${\mathcal {H}}$ (as a set) can be specified as ${\mathcal {H}}=\overline{\textrm{span}\{\rho _{\textrm{iso}}(\cdot -{\varvec{y}})\}}_{{\varvec{y}}\in {\mathbb {R}}^d} + {\mathcal {P}}$, which tells us that the class of RBF estimators of the form given by (53) is dense in ${\mathcal {H}}$. This means that such estimators can yield an approximation of any $f \in {\mathcal {H}}$ to an arbitrary degree of precision. In particular, this applies to any $f \in {\mathcal {S}}({\mathbb {R}}^d)$, due to the inclusion ${\mathcal {S}}({\mathbb {R}}^d) \subset {\mathcal {X}}'_{{\textrm{L}}_{\textrm{R}}}({\mathbb {R}}^d)$, as guaranteed by Theorem 6. Now, the key to universal approximation is that ${\mathcal {S}}({\mathbb {R}}^d)$ is dense in most of the classical function spaces [51], in particular, $C_0({\mathbb {R}}^d)$. We then readily deduce that (53), for M sufficiently large and a suitable choice of the ${\varvec{x}}_m$, has the ability to reproduce any continuous function $f: {\mathbb {R}}^d \rightarrow {\mathbb {R}}$. This deduction, of course, is consistent with the classical theory of kernel estimators: Our admissibility conditions for ${\widehat{L}}_{\textrm{rad}}$ in Definition 2 (resp., in Definition 3) ensure that the function $\rho _{\textrm{iso}}: {\mathbb {R}}^d \rightarrow {\mathbb {R}}$ is strictly positive-definite (resp., strictly conditionally positive-definite), which is the standard criterion for universality [31, 32, 63].

The same kind of density argument can be made for ${\mathcal {X}}'={\mathcal {M}}$. The relevant basis elements there are the atomic Radon-compatible Dirac measures $\delta _{\textrm{Rad},(t_k,{\varvec{\xi }}_k)}\in {\mathcal {M}}_{\textrm{Rad}}$ with $(t_k,{\varvec{\xi }}_k)\in {\mathbb {R}}\times {\mathbb {S}}^{d-1}$. These get mapped into $e_k={\textrm{L}}^{\dagger }_{\textrm{R}}\{\delta _{\textrm{Rad},(t_k,{\varvec{\xi }}_k)}\} \in {\mathcal {U}}'$, which are essentially ridges (up to some polynomial) characterized by $e_k=\rho _{\textrm{rad}}({\varvec{\xi }}_k^{\textsf{T}}\cdot -t_k)- p_{0,k}$ with $p_{0,k}=\textrm{Proj}_{{\mathcal {P}}}\{\rho _{\textrm{rad}}({\varvec{\xi }}_k^{\textsf{T}}\cdot -t_k)\}$. Thus, by setting $w\approx \sum _k w_k \delta _{\textrm{Rad},(t_k,{\varvec{\xi }}_k)}$, we can interpret the generative formula $f={\textrm{L}}^{\dagger }_{\textrm{R}}\{w\}+ p_0$ in Theorem 6 as a linear superposition of ridges plus a global polynomial trend, which is compatible with the form of the estimator in (55). We then invoke the property that ${\mathcal {S}}({\mathbb {R}}^d) \subset {\mathcal {M}}_{{\textrm{L}}_{\textrm{R}}}({\mathbb {R}}^d)$, which implies that any continuous function can be approximated as closely as desired by a member of ${\mathcal {M}}_{{\textrm{L}}_{\textrm{R}}}({\mathbb {R}}^d)$. As it turns out, the latter is representable by a superposition of ridges plus a polynomial of degree $n_0$. We emphasize that the presence of the polynomial term—the guarantor of stability for Theorem 8—is essential to counterbalance the growth of the individual atoms at infinity. This is an important aspect where our analysis and conclusions deviate from those of [54].

4.4 Regularization Operators for Anti-symmetric Activations

While the framework that has been discussed so far is very powerful, it has one shortcoming: it yields canonical activation functions $\rho _{\textrm{rad}}$ that are necessarily symmetric. In some cases such as $\textrm{L}_{\textrm{R}}={\textrm{K}}_{\textrm{rad}} {\textrm{R}}(-\varDelta )$, these can be converted into one-sided functions such as the ReLU by doctoring the polynomial term. That said, the original scheme that is described by Theorem 5 does not allow for sigmoids, which are frequently used for classification [8]. This is the reason why we now introduce a variant that systematically produces anti-symmetric activations, including sigmoids for $n_0=0$.

The idea is to substitute the (symmetric) filtering operator ${\textrm{K}}_{\textrm{rad}}$ by its anti-symmetric counterpart $\tilde{{\textrm{K}}}_{\textrm{rad}}$, which includes an additional Hilbert transform. Specifically, $\tilde{{\textrm{K}}}_{\textrm{rad}}$ is the hyper-spherical radial filter whose frequency response is $\widehat{{\tilde{K}}}_{\textrm{rad}}(\omega )=-\textrm{i}\,\textrm{sign}(\omega ) c_p|\omega |^{-d-1}$ and whose adjoint is $\tilde{{\textrm{K}}}^*_{\textrm{rad}}=-\tilde{{\textrm{K}}}_{\textrm{rad}}$. Since the action of the (radial) Hilbert transform $\tilde{{\textrm{H}}}_{\textrm{rad}}: \phi (\cdot ,{\varvec{\xi }}) \mapsto \phi \circledast 1/(\pi t)$ is well-defined on ${\mathcal {S}}({\mathbb {R}}\times {\mathbb {S}}^{d-1})$ with $\tilde{{\textrm{H}}}^*_{\textrm{rad}}=-\tilde{{\textrm{H}}}_{\textrm{rad}}=\tilde{{\textrm{H}}}^{-1}_{\textrm{rad}}$, we have the identity ${\textrm{R}}^*{\textrm{K}}_{\textrm{rad}} {\textrm{R}}= {\textrm{R}}^*{\textrm{K}}_{\textrm{rad}}\tilde{{\textrm{H}}}_{\textrm{rad}}^*\tilde{{\textrm{H}}}_{\textrm{rad}}{\textrm{R}}={\textrm{R}}^*\tilde{ {\textrm{K}}}_{\textrm{rad}}^*\tilde{{\textrm{R}}}=\textrm{Id}$ with $\tilde{{\textrm{R}}}{\mathop {=}\limits ^{\vartriangle }}\tilde{{\textrm{H}}}_{\textrm{rad}}{\textrm{R}}$. Accordingly, we can essentially replicate the whole mechanism of construction of spaces in Sect. 3.3 by substituting ${\mathcal {S}}_{\textrm{Rad}}$ by $\tilde{{\mathcal {S}}}_{\textrm{Rad}}=\tilde{{\textrm{R}}}({\mathcal {S}}({\mathbb {R}}\times {\mathbb {S}}^{d-1})=\tilde{{\textrm{H}}}_{\textrm{rad}}({\mathcal {S}}_{\textrm{Rad}})$, which is a space of odd functions that are smooth ($C^\infty $) and included in $L_{p}({\mathbb {R}}\times {\mathbb {S}}^{d-1})$ for all $p\ge 1$. While the members of $\tilde{{\mathcal {S}}}_{\textrm{Rad}}$ do not necessarily decay rapidly, the mapping ${\textrm{R}}^*\tilde{{\textrm{K}}}^*_{\textrm{rad}}: \tilde{{\mathcal {S}}}_{\textrm{Rad}} \rightarrow {\mathcal {S}}({\mathbb {R}}^d)$ is still guaranteed to be an isomorphism (see [43, Theorem 3.3.1, p. 83] where similar arguments are used). Under the assumption that the hyper-spherical norm $\Vert \cdot \Vert _{{\mathcal {X}}}$ is continuous on $\tilde{{\mathcal {S}}}_{\textrm{Rad}}$, we can readily adapt the proof of [57, Theorem 8] to establish the following.

Proposition 3

(Odd Radon-compatible Banach spaces) Consider the Banach space $\tilde{{\mathcal {X}}}_{\textrm{Rad}}=\overline{(\tilde{{\mathcal {S}}}_{\textrm{Rad}},\Vert \cdot \Vert _{{\mathcal {X}}})}$ of odd hyper-spherical functions. Then, the following holds.

1.
The map ${\textrm{R}}^*\tilde{{\textrm{K}}}^*_{\textrm{rad}}: \tilde{{\mathcal {X}}}_{\textrm{Rad}} \rightarrow \tilde{{\mathcal {Y}}}={\textrm{R}}^*\tilde{{\textrm{K}}}^*_{\textrm{rad}}\big (\tilde{{\mathcal {X}}}_{\textrm{Rad}}\big )$ is an isometric bijection, with $\tilde{{\textrm{R}}}{\textrm{R}}^*\tilde{{\textrm{K}}}^*_{\textrm{rad}}=\textrm{Id}$ on $\tilde{{\mathcal {X}}}_{\textrm{Rad}}$.
2.
The map $\tilde{{\textrm{R}}}^*: \tilde{{\mathcal {X}}}'_{\textrm{Rad}} \rightarrow \tilde{{\mathcal {Y}}}'=\tilde{{\textrm{R}}}^*\big (\tilde{{\mathcal {X}}}'_{\textrm{Rad}}\big )$ is an isometric bijection, with $\tilde{{\textrm{K}}}_{\textrm{rad}}{\textrm{R}}\tilde{{\textrm{R}}}^*=\textrm{Id}$ on $\tilde{{\mathcal {X}}}'_{\textrm{Rad}}$.

The underlying definition of the “oddified” backprojection operator $\tilde{{\textrm{R}}}^*: \tilde{{\mathcal {X}}}'_{\textrm{Rad}} \rightarrow \tilde{{\mathcal {Y}}}'$ for $g \in \tilde{{\mathcal {X}}}_{\textrm{Rad}}'$ is

$$\begin{aligned} \langle \tilde{{\textrm{R}}}^*\{g\}, \varphi \rangle =\langle g, \tilde{{\textrm{R}}}\{\varphi \}\rangle _{\textrm{Rad}} =\langle \tilde{{\textrm{H}}}^*_{\textrm{rad}}\{g\}, {\textrm{R}}\{\varphi \}\rangle _{\textrm{Rad}} \end{aligned}$$

(66)

for all $\varphi \in \tilde{{\mathcal {Y}}}$ or, equivalently, $\varphi \in {\mathcal {S}}({\mathbb {R}}^d)$ since ${\mathcal {S}}({\mathbb {R}}^d)$ is dense in ${\mathcal {Y}}$ by construction. Likewise, by using the property that the Hilbert transform is a homeomorphism from ${\mathcal {S}}_{\textrm{Liz, even}}$ onto ${\mathcal {S}}_{\textrm{Liz, odd}}$ [47] with the underlying Lizorkin spaces being included in ${\mathcal {S}}_{\textrm{Rad}}$ and $\tilde{{\mathcal {S}}}_{\textrm{Rad}}\subset L_{p, \textrm{odd}}$, respectively, we can adapt the proof of Lemma 2 to show that ${{\tilde{L}}}_{q,\textrm{Rad}}=L_{q, \textrm{odd}}$ for $q\in (1,\infty )$, ${\tilde{C}}_{0,\textrm{Rad}}=C_{0, \textrm{odd}}$ for $p=\infty $, and $\tilde{{\mathcal {M}}}_{\textrm{Rad}}=({\tilde{C}}_{0,\textrm{Rad}})'={\mathcal {M}}_{\textrm{odd}}$.

The bottom line is that the whole argumentation, including the critical Fourier-based proofs of Sect. 5, applies in this modified setting as well. Accordingly, all theorems that mention the regularization operator $\textrm{L}_{\textrm{R}}={\textrm{K}}_{\textrm{rad}} \text {RL}$ and the radial profile $\rho _{\textrm{rad}}= {\mathcal {F}}^{-1}\{ 1/{\widehat{L}}_{\textrm{rad}}\}$ are also valid for the odd setting where these quantifies are substituted by

$$\begin{aligned} \tilde{{\textrm{L}}}_{\textrm{R}}&{\mathop {=}\limits ^{\vartriangle }}\tilde{{\textrm{K}}}_{\textrm{rad}} \text {RL}=\tilde{{\textrm{K}}}_{\textrm{rad}} {\textrm{L}}_{\textrm{rad}}{\textrm{R}}, \end{aligned}$$

(67)

$$\begin{aligned} \tilde{\rho }_{\textrm{rad}}(t)&{\mathop {=}\limits ^{\vartriangle }} {\mathcal {F}}^{-1}\left\{ \frac{\textrm{i}\,\textrm{sign}(\omega )}{ {\widehat{L}}_{\textrm{rad}}(\omega ) } \right\} (t), \end{aligned}$$

(68)

where (68) directly follows from (66) and Theorem 4. The conditions for admissibility in Definitions 2 and 3, which are all Fourier-based, remain the same, while the adjusted activation $\tilde{\rho }_{\textrm{rad}}$ (the 1D Hilbert transform of $\rho _{\textrm{rad}}$) is real-valued and anti-symmetric because the original Fourier profile ${\widehat{L}}_{\textrm{rad}}: {\mathbb {R}}\rightarrow {\mathbb {R}}$ is symmetric.

We note that our admissibility condition with $\gamma _0=1$ for odd variant ($n_0=0$) is compatible with the condition used by Barron to prove the universality of neural networks with sigmoidal activations [5]. It is also worth mentioning that the substitution of ${\textrm{L}}_{\textrm{R}}$ by $\tilde{{\textrm{L}}}_{\textrm{R}}$ has no incidence on the form of the RBF in (53) because of the unitary nature of the Hilbert transform.

4.5 Specific Configurations

The proposed framework encompasses a wide variety of kernels and activation functions, with minimal restrictions. For instance, one can start with any strictly positive-definite function $\rho _{\textrm{rad},0}$ whose Fourier transform is strictly positive, and construct some higher-order variants by (fractional) integration. The variants are such that $\rho _{\textrm{rad},\gamma _0}(t)= {\mathcal {F}}^{-1}\{\frac{{\widehat{\rho }}_{\textrm{rad},0}(\omega )}{|\omega |^{\gamma _0}}\}(t)$ with suitable $\gamma _0>0$ and are guaranteed to meet the requirements in Definition 3. The simplest scenario is to set $\rho _{\textrm{rad},0}=\delta $, which maps into a Laplacian-type regularization.

In Table 1, we provide examples of admissible operators together with their corresponding symmetric and anti-symmetric activations. It is noteworthy that the two most popular sigmoids ($\textrm{tanh}$ and $\textrm{arctan}$) are part of the framework. We can determine the explicit frequency response of their regularization filter and show that it is first-order-admissible with a null space that consists of the polynomials of degree $n_0=0$ (the constants). The symmetric spline activations of odd degree $m-1$ and the anti-symmetric ones of even degree are also known: they coincide with the ridge splines of Parhi and Novak, which are tied to the Radon-domain operator $\textrm{L}_{\textrm{rad}}=\frac{\partial ^{m}}{\partial t^{m}}$ [37]. With the present formulation, we can seamlessly extend this family to fractional orders, in direct analogy with [59], by setting $\textrm{L}=(-\varDelta )^{\frac{\alpha +1}{2}}$.

Table 1 Examples of admissible symmetric and anti-symmetric activation functions with their corresponding regularization operator

Full size table

5 Supporting Mathematical Results

To answer the fundamental issue of the existence of a solution in Theorem 5, we need to 1) prove that the “predual” space ${\mathcal {X}}_{\textrm{L}}({\mathbb {R}}^d)$ is a proper Banach space (Theorem 9); and 2) establish the weak* continuity of the sampling functionals $\delta (\cdot -{\varvec{x}}_k)$. As we shall see, both aspects largely rest upon the functional characterization of the Schwartz kernel of the pseudoinverse operator ${\textrm{L}}_{\textrm{R}}^{*\dagger }$ provided in Theorem 7.

5.1 Kernel and Stability of Generalized Inverse Operators

Let $\nu : f \mapsto \langle \nu , f\rangle $ be a linear functional that is acting on some Banach space ${\mathcal {X}}'$. We recall that $\nu $ is weak*-continuous if and only if $\nu \in {\mathcal {X}}$, where ${\mathcal {X}}$ is the predual of ${\mathcal {X}}'$ [44]. The condition that ${\mathcal {X}}'$ is reflexive (i.e., ${\mathcal {X}}''={\mathcal {X}}$) is equivalent to the continuity of $\nu $ on ${\mathcal {X}}$, which is the standard condition for analysis. However, when ${\mathcal {X}}'$ is not reflexive (e.g., ${\mathcal {X}}'={\mathcal {M}}=(C_0)'$), the constraint of weak* continuity is stronger than continuity, contrary to what could be suggested by the qualifier “weak.” In that case, the predual space ${\mathcal {X}} \subseteq {\mathcal {X}}''$, which is continuously embedded in ${\mathcal {X}}''$, turns out to be smaller than ${\mathcal {X}}''$. Therefore, in order to establish the weak* continuity of the Dirac functionals $\delta (\cdot -{\varvec{x}}_0)$ for the scenarios in Theorem 5, we need to show that $\delta (\cdot -{\varvec{x}}_0) \in {\mathcal {X}}_{{\textrm{L}}_{\textrm{R}}}$, which can be reduced to proving that $ \textrm{L}_{\textrm{R}}^{*\dagger }\{\delta (\cdot - {\varvec{x}}_0)\} \in {\mathcal {X}}_{\textrm{Rad}} $.

Theorem 7

(Properties of the impulse response of ${\textrm{L}}_{\textrm{R}}^{*\dagger }=\text {RL}_{\mathcal {P}}^{-1*}$) Let $\textrm{L}$ be an isotropic operator such that ${{\widehat{L}}}({\varvec{\omega }})={{\widehat{L}}}_{\textrm{rad}}(\Vert {\varvec{\omega }}\Vert )$ where ${{\widehat{L}}}_{\textrm{rad}}: {\mathbb {R}}\rightarrow {\mathbb {R}}$ is a continuous function and $\rho _{\textrm{rad}}(t)= {\mathcal {F}}^{-1}\{1/{{\widehat{L}}}_{\textrm{rad}}\}(t)$. We consider two cases:

1.
Trivial null space: If $\textrm{L}$ satisfies the admissibility conditions in Definition 2, then ${\textrm{L}}_{\textrm{R}}^{*\dagger }=\text {RL}^{-1*}$ and
$$\begin{aligned} \nu _{{\varvec{x}}_0}(t,{\varvec{\xi }})&=\text {RL}^{-1*}\{\delta (\cdot - {\varvec{x}}_0)\}(t,{\varvec{\xi }}) =\rho _{\textrm{rad}}(t-{\varvec{\xi }}^{\textsf{T}}{\varvec{x}}_0) \end{aligned}$$
(69)
with ${\varvec{x}}_0 \in {\mathbb {R}}^d$ and $(t,{\varvec{\xi }})\in {\mathbb {R}}\times {\mathbb {S}}^{d-1}$. Moreover, $\nu _{{\varvec{x}}_0} \in {\mathcal {X}}_{\textrm{Rad}} $ for ${\mathcal {X}}=C_0$ as well as ${\mathcal {X}}=L_q$ with $q\in [1,\infty ]$.
2.
Nontrivial null space: If $\textrm{L}$ satisfies the admissibility conditions in Definition 3 with a polynomial null space of degree $n_0$, then
$$\begin{aligned} \nu _{{\varvec{x}}_0}(t,{\varvec{\xi }})&= \textrm{L}_{\textrm{R}}^{*\dagger }\{\delta (\cdot - {\varvec{x}}_0)\}(t,{\varvec{\xi }}) \nonumber \\&=\rho _{\textrm{rad}}(t-{\varvec{\xi }}^{\textsf{T}}{\varvec{x}}_0) - \sum _{n=0}^{n_0} \frac{(-{\varvec{\xi }}^{\textsf{T}}{\varvec{x}}_0)^n}{n!}\big (\kappa _{\textrm{rad}} *\partial ^n\rho _{\textrm{rad}}\big )(t) \end{aligned}$$
(70)
with $\nu _{{\varvec{x}}_0} \in {\mathcal {X}}_{\textrm{Rad}} $ for the same spaces as in Item 1, but with $q\in [2,\infty ]$. Moreover,
$$\begin{aligned} \sup _{({\varvec{x}}_0,{\varvec{\xi }}) \in {\mathbb {R}}^d \times {\mathbb {S}}^{d-1}} (1+|{\varvec{\xi }}^{\textsf{T}}{\varvec{x}}_0|)^{-n_0} \Vert \nu _{{\varvec{x}}_0}(\cdot ,{\varvec{\xi }})\Vert _{L_q({\mathbb {R}})} < \infty \end{aligned}$$
(71)
for any $q\in [2,\infty ]$.

Proof

To show that $\nu _{{\varvec{x}}_0}\in {\mathcal {X}}_{\textrm{Rad}}$, it is sufficient to prove that $\nu _{{\varvec{x}}_0}\in {\mathcal {X}}({\mathbb {R}}^d \times {\mathbb {S}}^{d-1})$ since $\nu _{{\varvec{x}}_0}$ is in the range of the Radon transform by construction.

When the null space of $\textrm{L}$ is trivial, $\textrm{L}$ has a stable convolution inverse so that it suffices to show that $\nu _{{\varvec{x}}_0} \in L_q({\mathbb {R}}\times {\mathbb {S}}^{d-1}) \cap C_0({\mathbb {R}}\times {\mathbb {S}}^{d-1})$. To that end, we formally identify the isotropic distribution $\rho _{\textrm{iso}}={\textrm{L}}^{-1*} \{\delta \}={\textrm{L}}^{-1} \{\delta \}= {\mathcal {F}}^{-1}\{1/{{\widehat{L}}}_{\textrm{rad}}(\Vert {\varvec{\omega }}\Vert )\}$ and apply Proposition 1, which yields

$$\begin{aligned} \nu _{{\varvec{x}}_0}(t,{\varvec{\xi }})={\textrm{R}} \{\rho _{\textrm{iso}}(\cdot -{\varvec{x}}_0)\}(t, {\varvec{\xi }})=\rho _{\textrm{rad}}(t-{\varvec{\xi }}^{\textsf{T}}{\varvec{x}}_0). \end{aligned}$$

Due to our assumptions, this function is such that $\Vert \nu _{{\varvec{x}}_0}(\cdot ,{\varvec{\xi }})\Vert _{L_1}=\Vert \rho _{\textrm{rad}}\Vert _{L_1}<\infty $ for any fixed ${\varvec{\xi }}\in {\mathbb {S}}^{d-1}$. Moreover, since $1/{{\hat{L}}}_{\textrm{rad}} \in L_1({\mathbb {R}})$, $\rho _{\textrm{rad}}$ is bounded, continuous, and vanishing at infinity (by the Riemann-Lebesgue lemma), which gives $\rho _{\textrm{rad}}(\cdot -{\varvec{\xi }}^{\textsf{T}}{\varvec{x}}_0) \in C_0({\mathbb {R}})$ for any ${\varvec{x}}_0 \in {\mathbb {R}}^d$ and ${\varvec{\xi }}\in {\mathbb {S}}^{d-1}$. Since $\rho _{\textrm{rad}} \in L_\infty ({\mathbb {R}}) \cap L_1({\mathbb {R}})$, we readily deduce that $\rho _{\textrm{rad}}(\cdot -{\varvec{\xi }}^{\textsf{T}}{\varvec{x}}_0) \in L_q({\mathbb {R}})$ for all intermediate $q\ge 1$, which then also yields $\rho _{\textrm{rad}}(t-{\varvec{\xi }}^{\textsf{T}}{\varvec{x}}_0) \in L_q({\mathbb {R}}\times {\mathbb {S}}^{d-1})$ because the surface of the unit sphere ${\mathbb {S}}^{d-1}$ is bounded. Finally, since $\rho _{\textrm{rad}}: {\mathbb {R}}\rightarrow {\mathbb {R}}$ is continuous and vanishing at infinity, $\nu _{{\varvec{x}}_0}(t,{\varvec{\xi }})$ is jointly continuous in $(t,{\varvec{\xi }})$ and vanishing at $t \rightarrow \pm \infty $ for all ${\varvec{\xi }}\in {\mathbb {S}}^{d-1}$, which implies that $\nu _{{\varvec{x}}_0} \in C_0({\mathbb {R}}\times {\mathbb {S}}^{d-1})$.

For the more difficult case of a nontrivial null space, we invoke the Fourier-slice theorem to evaluate the 1D Fourier transform of $\nu _{{\varvec{x}}_0}(\cdot ,{\varvec{\xi }})$ with ${\varvec{\xi }}$ fixed as

$$\begin{aligned} {\hat{\nu }}_{{\varvec{x}}_0}(\omega ,{\varvec{\xi }})&=\frac{ {\mathcal {F}}\{\delta (\cdot -{\varvec{x}}_0) - \sum _{|{\varvec{k}}|\le n_0} \langle m_{\varvec{k}}, \delta (\cdot -{\varvec{x}}_0) \rangle \; m_{\varvec{k}}^*\}(\omega {\varvec{\xi }}) }{{{\widehat{L}}}_{\textrm{rad}}(\omega )} \nonumber \\&=\frac{\textrm{e}^{-\textrm{i}\omega {\varvec{\xi }}^{\textsf{T}}{\varvec{x}}_0} - \sum _{|{\varvec{k}}|\le n_0}\frac{{\varvec{x}}_0^{\varvec{k}}}{{\varvec{k}}!} \; {{\widehat{m}}}_{\varvec{k}}^*(\omega {\varvec{\xi }}) }{{{\widehat{L}}}_{\textrm{rad}}(\omega )}\nonumber \\&=\frac{\textrm{e}^{-\textrm{i}\omega {\varvec{\xi }}^{\textsf{T}}{\varvec{x}}_0} - \sum _{|{\varvec{k}}|\le n_0}\frac{{\varvec{x}}_0^{\varvec{k}}}{{\varvec{k}}!} \; (-\textrm{i}\omega {\varvec{\xi }})^{{\varvec{k}}} {{\widehat{\kappa }}}_{\textrm{rad}}(\omega )}{{{\widehat{L}}}_{\textrm{rad}}(\omega )}\nonumber \\&=\frac{\textrm{e}^{-\textrm{i}\omega {\varvec{\xi }}^{\textsf{T}}{\varvec{x}}_0} - \sum _{n=0}^{n_0}\frac{(-{\varvec{\xi }}^{\textsf{T}}{\varvec{x}}_0)^n}{n!} (\textrm{i}\omega )^n {{\widehat{\kappa }}}_{\textrm{rad}}(\omega )}{{{\widehat{L}}}_{\textrm{rad}}(\omega )}, \end{aligned}$$

(72)

where the simplification of the summation over ${\varvec{k}}$ results from the multinomial expansion $(-\textrm{i}\omega {\varvec{\xi }}^{\textsf{T}}{\varvec{x}}_0)^n=(y_1+ \cdots + y_d)^n= \sum _{|{\varvec{k}}|= n} \frac{n!}{{\varvec{k}}!} {\varvec{y}}^{{\varvec{k}}}$ with ${\varvec{y}}={\big (-\textrm{i}\omega \xi _i x_{0,i}\big )_{i=1}^d}$. The delicate point here is that (72) is potentially singular because it has a pole of multiplicity $\gamma _0$ at $\omega =$0. Fortunately, the condition $\gamma _0\le n_0+1$ ensures that there is a proper pole-zero cancelation: By recalling that ${{\widehat{\kappa }}}_{\textrm{rad}}(\omega )=1$ for $\omega \in \varOmega _0=[-R_0,R_0]$ with $R_0=\frac{1}{2}$ and setting $t_0={\varvec{\xi }}^{\textsf{T}}{\varvec{x}}_0$, we identify the numerator as the remainder of the Maclaurin series of $\textrm{e}^{-\textrm{i}\omega t_0}$, which is bounded by

$$\begin{aligned} \left| \textrm{e}^{-\textrm{i}\omega t_0} - \sum _{n=0}^{N}\frac{(-\textrm{i}t_0)^n}{n!} \omega ^n\right|&\le \sup _{\omega \in {\mathbb {R}}} \left| (-\textrm{i}t_0)^{N+1} \textrm{e}^{-\textrm{i}\omega t_0}\right| \; \frac{ |\omega |^{N+1}}{(N+1)!} \nonumber \\&\le \frac{ |t_0|^{N+1}}{(N+1)!}\; | \omega |^{N+1}. \end{aligned}$$

(73)

This then yields the estimate

$$\begin{aligned} |{\hat{\nu }}_{{\varvec{x}}_0}(\omega ,{\varvec{\xi }})|&\le \frac{|{\varvec{\xi }}^{\textsf{T}}{\varvec{x}}_0|^{n_0+1}}{C_0(n_0+1) !} |\omega |^{\epsilon } \quad \text{ as } \omega \rightarrow 0 \nonumber \\ \text{ with } \epsilon =n_0+1-\gamma _0&={\left\{ \begin{array}{ll} 0,&{} \gamma _0 \in {\mathbb {N}}\\ 1-(\gamma _0-\lfloor \gamma _0\rfloor ) \in (0,1),&{} \text{ otherwise }. \end{array}\right. } \end{aligned}$$

(74)

Since the denominator ${{\widehat{L}}}_{\textrm{rad}}$ in (72) is continuous and non-vanishing away from the origin, ${\hat{\nu }}_{{\varvec{x}}_0}(\cdot ,{\varvec{\xi }})$ is bounded on $\varOmega _0$, and, by extension, on any compact subset of ${\mathbb {R}}$. Moreover, since $|\textrm{e}^{-\textrm{i}\omega {\varvec{\xi }}^{\textsf{T}}{\varvec{x}}_0}|=1$ and ${{\widehat{\kappa }}}_{\textrm{rad}}$ is rapidly decreasing, there exists a constant C such that $|{\hat{\nu }}_{{\varvec{x}}_0}(\omega ,{\varvec{\xi }})|\le C |\omega |^{-\gamma _1}$ for $|\omega |>R_1$. This leads to several conclusions. First, if $\gamma _1> 1$, then ${\hat{\nu }}_{{\varvec{x}}_0}(\cdot ,{\varvec{\xi }}) \in L_1({\mathbb {R}})$ so that $\nu _{{\varvec{x}}_0} \in C_{0,\mathrm Rad}({\mathbb {R}}\times {\mathbb {S}}^{d-1})$ by the same argument as in the nonsingular case. Second, if $\gamma _1> \alpha + \tfrac{1}{2}$, then $\nu _{{\varvec{x}}_0}(\cdot ,{\varvec{\xi }}) \in W_2^{\alpha }({\mathbb {R}})$, where $W_2^{\alpha }({\mathbb {R}})=\{f: \int _{{\mathbb {R}}} (1 + |\omega |^2)^\alpha |{{\hat{f}}}(\omega )|^2 \textrm{d}\omega <\infty \}$ is the Sobolev space of functions with finite-energy derivatives up to order $\alpha $. Therefore, $\nu _{{\varvec{x}}_0} \in L_{q,\mathrm Rad}({\mathbb {R}}\times {\mathbb {S}}^{d-1})$ for all $q\in [2,\infty ]$ provided that $\gamma _1> 1$. The explicit Radon-domain formula (70) with $\kappa _{\textrm{rad}}(t)= {\mathcal {F}}^{-1}\{ {{\widehat{\kappa }}}_{\textrm{rad}}(\omega )\}(t)$ is obtained by taking the 1D inverse transform of (72).

To refine our characterization of ${\hat{\nu }}_{{\varvec{x}}_0}$, we introduce the function

$$\begin{aligned} r_{N }(\omega )=\frac{\textrm{e}^{-\textrm{i}\omega } - \sum _{n=0}^{N}\frac{(- \textrm{i}\omega )^n}{n!}}{\frac{(\textrm{i}\omega )^N }{N !}} \end{aligned}$$

(75)

whose modulus is plotted in Fig. 1. Some of the remarkable properties of $r_N$ are

$$\begin{aligned}&r_{N }(\omega )= \frac{-\textrm{i}\omega }{N+1}\ \text{ as } \ \omega \rightarrow 0 \end{aligned}$$

(76)

$$\begin{aligned}&\forall \omega \in {\mathbb {R}}:\quad \left| r_{N }(\omega )\right| \le \min (|\omega |/2, 1.27) \end{aligned}$$

(77)

$$\begin{aligned}&\lim _{\omega \rightarrow \infty } \left| r_N(\omega )\right| =1, \end{aligned}$$

(78)

with the global bound (77) being overlaid on the graph to demonstrate its sharpness. This function will allows us to control the behavior of

$$\begin{aligned} {\hat{\nu }}_{{\varvec{x}}_0}(\omega ,{\varvec{\xi }})&= \frac{\textrm{e}^{-\textrm{i}t_0 \omega } - \sum _{n \le n_0}\frac{(-\textrm{i}t_0\omega )^n}{n!} \kappa _{\textrm{rad}}(\omega ) }{{{\widehat{L}}}_{\textrm{rad}}(\omega )} \\&= \frac{ {{\widehat{\kappa }}}_{\textrm{rad}}(\omega ) r_{n_0}(t_0\omega ) \frac{(-\textrm{i}t_0\omega )^{n_0}}{n_0!} + \big (1-{{\widehat{\kappa }}}_{\textrm{rad}}(\omega )\big )\textrm{e}^{-\textrm{i}t_0 \omega } }{{{\widehat{L}}}_{\textrm{rad}}(\omega )} \end{aligned}$$

by splitting the frequency domain in three regions. First, for $\omega \in \varOmega _0=[0,R_0]$, where ${{\widehat{\kappa }}}_{\textrm{rad}}(\omega )=1$, we find that

$$\begin{aligned} |{\hat{\nu }}_{{\varvec{x}}_0}(\omega ,{\varvec{\xi }})|&=\left| \frac{r_{n_0}(t_0\omega )\frac{(-\textrm{i}t_0\omega )^{n_0}}{n_0!}}{{{\widehat{L}}}_{\textrm{rad}}(\omega )} \right| \le |t_0|^{n_0} \min {(|\omega |,2)} \frac{2\frac{|\omega |^{n_0}}{n_0!}}{|{{\widehat{L}}}_{\textrm{rad}}(\omega )|}. \end{aligned}$$

(79)

For the transition region $\omega \in \varOmega _{01}=[R_0, {{\tilde{R}}}_{1}]$ with ${{\tilde{R}}}_{1}=\max (2R_0,R_1)$, where $0\le {{\widehat{\kappa }}}_{\textrm{rad}}(\omega )\le 1$, we bound the numerator by its maximum, which yields

$$\begin{aligned} |{\hat{\nu }}_{{\varvec{x}}_0}(\omega ,{\varvec{\xi }})|&\le \frac{ 2 \frac{|t_0 {{\tilde{R}}}_1 |^{n_0}}{n_0!} +1}{|{{\widehat{L}}}_{\textrm{rad}}(\omega )|}. \end{aligned}$$

(80)

Finally, for $\omega \in \varOmega _1=[{{\tilde{R}}}_1,\infty ]$, where ${{\widehat{\kappa }}}_{\textrm{rad}}(\omega )=0$, we get the expected tail behavior

$$\begin{aligned} |{\hat{\nu }}_{{\varvec{x}}_0}(\omega ,{\varvec{\xi }})|&\le \frac{ 1}{|{{\widehat{L}}}_{\textrm{rad}}(\omega )|}\le \frac{ 1}{C_1|\omega |^{\gamma _1}}. \end{aligned}$$

(81)

Based on those bounds with $t_0=1$, we define the auxiliary function

$$\begin{aligned} u(\omega )={\left\{ \begin{array}{ll} \min {(|\omega |,2)} \frac{2\frac{|\omega |^{n_0}}{n_0!}}{|{{\widehat{L}}}_{\textrm{rad}}(\omega )|},&{}|\omega |< R_0 \\ \frac{ 2 \frac{{{\tilde{R}}}_1^{n_0}}{n_0!} +1}{|{{\widehat{L}}}_{\textrm{rad}}(\omega )|}, &{}R_0\le |\omega |\le {{\tilde{R}}}_1\\ \frac{1}{|{{\widehat{L}}}_{\textrm{rad}}(\omega )|},&{}{{\tilde{R}}}_1< |\omega | \end{array}\right. } \end{aligned}$$

(82)

which, by construction, is such that $\Vert u\Vert _{L_p}<\infty $ for any $p\ge 1$. We can now use (79), (80), and (81) to bound the norm of ${\hat{\nu }}_{{\varvec{x}}_0}(\cdot ,{\varvec{\xi }})$ by distinguishing between two cases. For $t_0\le 1$, we have $\Vert {\hat{\nu }}_{{\varvec{x}}_0}(\cdot ,{\varvec{\xi }})\Vert _{L_p} \le \Vert u\Vert _{L_p}$, while, for $t_0\ge 1$, we get $\Vert {\hat{\nu }}_{{\varvec{x}}_0}(\cdot ,{\varvec{\xi }})\Vert _{L_p} \le |t_0|^{n_0} \Vert u\Vert _{L_p}$. By combining these two inequalities, we obtain the universal norm estimate

$$\begin{aligned} \Vert {\hat{\nu }}_{{\varvec{x}}_0}(\cdot ,{\varvec{\xi }})\Vert _{L_p({\mathbb {R}})}&\le (1+ |{\varvec{\xi }}^{\textsf{T}}{\varvec{x}}_0|)^{n_0} \Vert u\Vert _{L_p} < + \infty , \end{aligned}$$

(83)

which holds for all $({\varvec{x}}_0,{\varvec{\xi }}) \in {\mathbb {R}}^d \times {\mathbb {S}}^{d-1}$ and $p\ge 1$. Finally, we invoke the boundedness of the (inverse) Fourier transform $ {\mathcal {F}}^{-1}: L_p({\mathbb {R}}) \rightarrow L_q({\mathbb {R}})$ for $p\in [1,2]$ to obtain the generic bound

$$\begin{aligned} \sup _{({\varvec{x}}_0,{\varvec{\xi }}) \in {\mathbb {R}}^d \times {\mathbb {S}}^{d-1}} (1+|{\varvec{\xi }}^{\textsf{T}}{\varvec{x}}_0|)^{-n_0} \Vert \nu _{{\varvec{x}}_0}(\cdot ,{\varvec{\xi }})\Vert _{L_q({\mathbb {R}})} < \infty , \end{aligned}$$

(84)

which is the desired result. $\square $

As complement to the proof of Theorem 7, we make the following remarks.

1.
The function $\omega \mapsto {\hat{\nu }}_{{\varvec{x}}_0}(\omega ,{\varvec{\xi }})$ in (72) is $\epsilon $-Hölder continuous around the origin for $\gamma _0 \notin 2 {\mathbb {N}}$ and as smooth as ${{\widehat{L}}}$ when $\gamma _0=2n$. For instance, the case where $\textrm{L}=(-\varDelta )^{n}$ is a non-fractional iterate of the Laplacian corresponds to ${{\widehat{L}}}_{\textrm{rad}}(\omega )=\omega ^{2n}$ and ${\hat{\nu }}_{{\varvec{x}}_0}(\cdot ,{\varvec{\xi }}) \in C^\infty ({\mathbb {R}})$, which then translates into $t\mapsto \nu _{{\varvec{x}}_0}(t,{\varvec{\xi }})$ being rapidly decreasing, but with a limited order of differentiability controlled by $\gamma _0=\gamma _1=2n$.
2.
The proof can be readily adapted to characterize the partial derivatives of $\nu _{{\varvec{x}}_0}$ by replacing $\delta (\cdot -{\varvec{x}}_0)$ by $\partial ^{{\varvec{n}}}\delta (\cdot -{\varvec{x}}_0)$ with $|{\varvec{n}}|< n_0$. These distributions are such that $\langle \partial ^{{\varvec{n}}}\delta (\cdot -{\varvec{x}}_0), m_{\varvec{k}}\rangle =(-1)^{|{\varvec{n}}|}\partial ^{{\varvec{n}}}m_{\varvec{k}}({\varvec{x}}_0)=(-1)^{|{\varvec{n}}|}m_{{\varvec{k}}-{\varvec{n}}}({\varvec{x}}_0)$, with the convention that $m_{{\varvec{k}}-{\varvec{n}}}=0$ if $k_m<n_m$ for any $m\in \{1,\dots ,d\}$.
3.
The leading term in (70) is $\rho _{\textrm{rad}}(t-\tau _0)$ with $\tau _0={\varvec{\xi }}^{\textsf{T}}{\varvec{x}}_0$, which, in the nontrivial scenario, typically grows as $O(|t|^{\gamma _0-1})$. Our analysis shows that the second correction term in (70), which is unbounded at infinity as well, essentially neutralizes this growth. It is tempting to call this a miraculous cancelation.

The bottom line is that, for any $({\varvec{x}}_0, {\varvec{\xi }}_0) \in {\mathbb {R}}^d \times {\mathbb {S}}^{d-1}$, the function $t\mapsto \nu _{{\varvec{x}}_0}(t,{\varvec{\xi }}_0)$ is continuous, bounded with a maximum that is proportional to $|{\varvec{\xi }}_0^{\textsf{T}}{\varvec{x}}_0|^{n_0}$ (see (84) with $p=\infty $), and vanishing at infinity. This is a remarkable property that also guarantees the boundedness of $\textrm{L}_{\textrm{R}}^{*\dagger }: L_{1,n_0}({\mathbb {R}}^d) \rightarrow {\mathcal {X}}_{\textrm{Rad}}$, which is not obvious a priori. The enabling ingredient there is (71), which ensures that the corresponding bounding constant in Theorem 8 is finite. Indeed, since $\Vert {\varvec{x}}\Vert \ge |{\varvec{\xi }}^{\textsf{T}}{\varvec{x}}|$ with equality if and only if ${\varvec{\xi }}$ and ${\varvec{x}}$ are colinear, we have that

$$\begin{aligned} \Vert \textrm{L}_{\textrm{R}}^{*\dagger }\Vert&\le \sup _{{\varvec{x}}\in {\mathbb {R}}^d, \; {\varvec{\xi }}\in {\mathbb {S}}^{d-1}} (1+\Vert {\varvec{x}}\Vert )^{-n_0} \Vert \nu _{{\varvec{x}}}(\cdot ,{\varvec{\xi }})\Vert _{{\mathcal {X}}({\mathbb {R}})}\\&\le \sup _{{\varvec{x}}\in {\mathbb {R}}^d,\; {\varvec{\xi }}\in {\mathbb {S}}^{d-1}} (1+|{\varvec{\xi }}^{\textsf{T}}{\varvec{x}}|)^{-n_0} \Vert \nu _{{\varvec{x}}}(\cdot ,{\varvec{\xi }})\Vert _{{\mathcal {X}}({\mathbb {R}})} < \infty . \end{aligned}$$

Theorem 8

(Stability of Cartesian-to-Radon-domain mappings) Let $h_{{\varvec{x}}}(t,{\varvec{\xi }})={\textrm{T}}\{\delta (\cdot -{\varvec{x}})\}(t,{\varvec{\xi }})\}$ denote the generalized impulse response of the operator ${\textrm{T}}: L_{1,\alpha }({\mathbb {R}}^d) \rightarrow {\mathcal {X}}_{\textrm{Rad}} ({\mathbb {R}}\times {\mathbb {S}}^{d-1}) $ and let

$$\begin{aligned} C_{p,\alpha }=\sup _{{\varvec{x}}\in {\mathbb {R}}^d,\; {\varvec{\xi }}\in {\mathbb {S}}^{d-1}} (1+\Vert {\varvec{x}}\Vert )^{-\alpha } \Vert h_{{\varvec{x}}}(\cdot ,{\varvec{\xi }})\Vert _{L_p({\mathbb {R}})}. \end{aligned}$$

(85)

1.
${\mathcal {X}}=C_0$: If $C_{\infty ,\alpha }<\infty $, then ${\textrm{T}}: L_{1,\alpha }({\mathbb {R}}^d) \rightarrow C_{0,\mathrm Rad} $ is bounded with $\Vert {\textrm{T}}\Vert \le C_{\infty ,\alpha }$.
2.
${\mathcal {X}}=L_p$ with $p\in (1,\infty )$: If $C_{p,\alpha }<\infty $, then ${\textrm{T}}: L_{1,\alpha }({\mathbb {R}}^d) \rightarrow L_{p, \mathrm Rad} $ is bounded with $ \Vert {\textrm{T}}\Vert \le \frac{2 \pi ^{d/2}}{\varGamma (d/2)} C_{p,\alpha }$.

The same holds true for the adjoint ${\textrm{T}}^*: {\mathcal {X}}'_{\textrm{Rad}} \rightarrow L_{\infty ,-\alpha }({\mathbb {R}}^d)$ with $\Vert {\textrm{T}}^*\Vert =\Vert {\textrm{T}}\Vert $.

Proof

The function $\big ((t, {\varvec{\xi }}),{\varvec{x}}\big ) \mapsto h_{{\varvec{x}}}(t,{\varvec{\xi }})$ is the Schwartz kernel of the operator ${\textrm{T}}$, so that

$$\begin{aligned} |g(t,{\varvec{\xi }})|&=\big |{\textrm{T}}\{f\}(t,{\varvec{\xi }})\big | = \big |\int _{{\mathbb {R}}^d} h_{{\varvec{x}}}(t,{\varvec{\xi }}) \; f({\varvec{x}})\textrm{d}{\varvec{x}}\big |\\&\le \sup _{{\varvec{x}}\in {\mathbb {R}}^d,\; {\varvec{\xi }}\in {\mathbb {S}}^{d-1}}\left( (1+\Vert {\varvec{x}}\Vert )^{-\alpha } \Vert h_{{\varvec{x}}}(\cdot ,{\varvec{\xi }})\Vert _{L_\infty }\right) \;\int _{{\mathbb {R}}^d}(1+\Vert {\varvec{x}}\Vert )^{\alpha }|f({\varvec{x}})| \textrm{d}{\varvec{x}}. \end{aligned}$$

Consequently,

$$\begin{aligned} \Vert g\Vert _{L_{\infty }}=\sup _{(t,{\varvec{\xi }}) \in {\mathbb {R}}\times {\mathbb {S}}^{d-1}}|g(t,{\varvec{\xi }})|&\le C_{\infty ,\alpha } \Vert f\Vert _{L_{1,\alpha }}, \end{aligned}$$

which is the desired bound for ${\mathcal {X}}=C_0$.

To handle the reflexive case ${\mathcal {X}}=L_p$, we consider the adjoint operator ${\textrm{T}}^*$ whose Schwartz kernel $\big ({\varvec{x}},(t, {\varvec{\xi }})\big ) \mapsto h_{{\varvec{x}}}(t,{\varvec{\xi }})$ is obtained by transposition. We now show that ${\textrm{T}}^*: L_q({\mathbb {R}}\times {\mathbb {S}}^{d-1}) \rightarrow L_{\infty ,-\alpha }({\mathbb {R}}^d)$ with $q=\tfrac{p}{p-1}$ is bounded which, by duality, implies that the same holds true for since $L_{1,\alpha }({\mathbb {R}}^d)$ is isometrically embedded in its bidual $\big (L_{1,\alpha }({\mathbb {R}}^d)\big )''=\big (L_{\infty ,-\alpha }({\mathbb {R}}^d)\big )'$. The action of the adjoint operator is described as

$$\begin{aligned} f({\varvec{x}})&={\textrm{T}}^*\{g\}({\varvec{x}}) = \int _{{\mathbb {R}}} \int _{{\mathbb {S}}^{d-1}} h_{{\varvec{x}}}(t,{\varvec{\xi }}) \; g(t,{\varvec{\xi }})\textrm{d}{\varvec{\xi }}\textrm{d}t \end{aligned}$$

which, with the help of Hölder’s inequality, yields the bound

$$\begin{aligned} \big | (1+\Vert {\varvec{x}}\Vert )^{-\alpha } f({\varvec{x}})\big |&\le (1+\Vert {\varvec{x}}\Vert )^{-\alpha } \Vert h_{{\varvec{x}}}\Vert _{L_p({\mathbb {R}}\times {\mathbb {S}}^{d-1})} \; \Vert g\Vert _{L_q({\mathbb {R}}\times {\mathbb {S}}^{d-1})}\\&\le \sup _{_{{\varvec{x}}\in {\mathbb {R}}^d,\; {\varvec{\xi }}\in {\mathbb {S}}^{d-1}} } \left( (1+\Vert {\varvec{x}}\Vert )^{-\alpha } S_{d} \;\Vert h_{{\varvec{x}}}(\cdot ,{\varvec{\xi }}) \Vert _{L_p({\mathbb {R}})}\right) \Vert g\Vert _{L_q({\mathbb {R}}\times {\mathbb {S}}^{d-1})}\\&\le S_{d}\; C_{p,\alpha } \; \Vert g\Vert _{L_q({\mathbb {R}}\times {\mathbb {S}}^{d-1})}, \end{aligned}$$

where $S_d=\frac{2 \pi ^{d/2}}{\varGamma (d/2)} $ is the surface of the unit hypersphere ${\mathbb {S}}^{d-1}$. This proves the desired result with $\Vert {\textrm{T}}\Vert =\Vert {\textrm{T}}^*\Vert \le S_d C_{p,\alpha }$. $\square $

5.2 Characterization of the Predual Space ${\mathcal {X}}_{{\textrm{L}}_{\textrm{R}}}$

The application of the general representer in [58] requires that the native space ${\mathcal {X}}'_{\textrm{L}_{\textrm{R}}}=({\mathcal {X}}_{\textrm{L}_{\textrm{R}}})'$ be identifiable as the topological dual of some primary Banach space ${\mathcal {X}}_{\textrm{L}_{\textrm{R}}}$. The specification of the proper predual space is achieved constructively through a completion process that ensures that ${\mathcal {X}}_{\textrm{L}_{\textrm{R}}}$ is a complete normed space (Banach property).

Theorem 9

(Construction of the predual Banach space ${\mathcal {X}}_{\textrm{L}_{\textrm{R}}}$) Let $\textrm{L}$ be an isotropic LSI operator with radial frequency profile ${{\widehat{L}}}_{\textrm{rad}}$ that is spline-admissible with a polynomial null space ${\mathcal {P}}$ of degree $n_0$. Then, is bounded, and admits the unique extension with the properties

$$\begin{aligned} \forall v\in {\mathcal {X}}_{\textrm{Rad}} :&\quad \textrm{L}_{\textrm{R}}^{*\dagger } \textrm{L}_{\textrm{R}}^{*}\{v\}=v \end{aligned}$$

(86)

$$\begin{aligned} \forall p_0^*\in {\mathcal {P}}':&\quad \textrm{L}_{\textrm{R}}^{*\dagger }\{p_0^*\}=0 \end{aligned}$$

(87)

$$\begin{aligned} \forall f \in {\mathcal {X}}_{\textrm{L}_{\textrm{R}}}({\mathbb {R}}^d):&\quad \textrm{L}_{\textrm{R}}^{*}\textrm{L}_{\textrm{R}}^{*\dagger }\{f\} =(\textrm{Id}-\textrm{Proj}_{{\mathcal {P}}'})\{f\}=\textrm{Proj}_{\mathcal {U}}\{ f\}, \end{aligned}$$

(88)

where $\textrm{L}_{\textrm{R}}^{*}=\textrm{L}^*{\textrm{R}}^*{\textrm{K}}_{\textrm{rad}}$ and ${\mathcal {X}}_{\textrm{L}_{\textrm{R}}}({\mathbb {R}}^d)=\textrm{L}^*_{\textrm{R}}({\mathcal {X}}_{\textrm{Rad}}) \oplus {\mathcal {P}}'$ is equipped with the norm $\Vert f\Vert _{{\mathcal {X}}_{\textrm{L}_{\textrm{R}}}}{\mathop {=}\limits ^{\vartriangle }}\max (\Vert \textrm{L}_{\textrm{R}}^{*\dagger }\{f\}\Vert _{{\mathcal {X}}}, \Vert \textrm{Proj}_{{\mathcal {P}}'}\{f\}\Vert _{{\mathcal {P}}'})$. The space ${\mathcal {X}}_{\textrm{L}_{\textrm{R}}}$ is complete and isomorphic to ${\mathcal {X}}_{\textrm{Rad}} \times {\mathcal {P}}'$ with $f =\textrm{L}^*_{\textrm{R}}\{v\} + p_0^*\mapsto (v,p_0^*)=(\textrm{L}_{\textrm{R}}^{*\dagger }\{f\}, \textrm{Proj}_{{\mathcal {P}}'}\{f\})$. Moreover, with the embedding being continuous and dense.

Theorem 9 obviously also applies to scenarios where the null space is trivial by setting ${\mathcal {P}}'=\{0\}$ and only retaining (86), in which case ${\mathcal {X}}_{\textrm{L}_{\textrm{R}}}({\mathbb {R}}^d)=\textrm{L}^*_{\textrm{R}}({\mathcal {X}}_{\textrm{Rad}})$ and .

Proof

Since ${\textrm{R}}^{*}{\textrm{K}}_{\textrm{rad}}\big ({\mathcal {S}}_{\textrm{Rad}} \big )={\mathcal {S}}({\mathbb {R}}^d)$, the image of ${\mathcal {S}}_{\textrm{Rad}}$ under $\textrm{L}_{\textrm{R}}^*$ is a vector space denoted by ${\mathcal {S}}_{\textrm{L}_{\textrm{R}}^*}({\mathbb {R}}^d)=\textrm{L}_{\textrm{R}}^{*}\big ({\mathcal {S}}_{\textrm{Rad}})=\textrm{L}^*\big ({\mathcal {S}}({\mathbb {R}}^d)\big )$. The spline-admissibility of $\textrm{L}$ implies that $\textrm{L}^*$ is injective on ${\mathcal {S}}({\mathbb {R}}^d)$ which, in turn, translates into the injectivity of $\textrm{L}_{\textrm{R}}^*$ on ${\mathcal {S}}_{\textrm{Rad}}$. The latter statement is equivalent to the existence of a linear operator $\left. \textrm{L}_{\textrm{R}}^{*-1}\right| _{{\mathcal {S}}_{\textrm{L}_{\textrm{R}}^*}}=\textrm{L}_{\textrm{R}}^{*-1}$ (for short) such that, for any $u=\textrm{L}_{\textrm{R}}^{*}\{ \phi \}\in {\mathcal {S}}_{\textrm{L}_{\textrm{R}}^*}({\mathbb {R}}^d)$ with $\phi \in {\mathcal {S}}_{\textrm{Rad}}$, it holds that

$$\begin{aligned} \textrm{L}_{\textrm{R}}^{*-1}\{u\}=\textrm{L}_{\textrm{R}}^{*-1}\textrm{L}_{\textrm{R}}^*\{\phi \} =\phi . \end{aligned}$$

Hence, if $\phi \mapsto \Vert \phi \Vert _{{\mathcal {X}}}$ is a norm on ${\mathcal {S}}_{\textrm{Rad}}$, then the same holds true for $u \mapsto \Vert u\Vert _{{\mathcal {U}}}{\mathop {=}\limits ^{\vartriangle }}\Vert \textrm{L}_{\textrm{R}}^{*-1}\{u\} \Vert _{{\mathcal {X}}}$ on ${\mathcal {S}}_{\textrm{L}_{\textrm{R}}^*}({\mathbb {R}}^d)$. This means that the normed spaces $({\mathcal {S}}_{\textrm{Rad}},\Vert \cdot \Vert _{{\mathcal {X}}})$ and $({\mathcal {S}}_{\textrm{L}_{\textrm{R}}^*}({\mathbb {R}}^d),\Vert \cdot \Vert _{{\mathcal {U}}})$ are (isometrically) isomorphic. Moreover, since ${\mathcal {S}}_{\textrm{L}_{\textrm{R}}^*}({\mathbb {R}}^d)\subset L_{1,n_0}({\mathbb {R}}^d)$ (Condition 4 in Definition 3) and , the inverse operator $\left. \textrm{L}_{\textrm{R}}^{*-1}\right| _{{\mathcal {S}}_{\textrm{L}_{\textrm{R}}^*}}$ coincides on ${\mathcal {S}}_{\textrm{L}_{\textrm{R}}^*}$ with whose impulse response is characterized in Theorem 7. The well-posedness and boundedness of $\textrm{L}_{\textrm{R}}^{*\dagger }$ on $L_{1,n_0}({\mathbb {R}}^d)$ for $n_0=\lceil \gamma _0-1\rceil $ follows from Theorem 8 and (71) in Theorem 7, which provides the required stability condition. The other fundamental ingredient is $\langle m_{\varvec{k}}, u\rangle =\langle m_{\varvec{k}}, \textrm{L}_{\textrm{R}}^*\{\phi \}\rangle =\langle \textrm{L}_{\textrm{R}} \{m_{\varvec{k}}\}, \phi \rangle _{\textrm{Rad}}=0$ for all $u\in {\mathcal {S}}_{\textrm{L}^*}({\mathbb {R}}^d)$ and $|{\varvec{k}}|\le n_0$, which implies that $\textrm{Proj}_{{\mathcal {P}}'}\{u\}=0$. Consequently, we have that

$$\begin{aligned} \textrm{L}_{\textrm{R}}^{*\dagger }\{u\}=\textrm{L}_{\textrm{R}}^{*-1}(\textrm{Id}-\textrm{Proj}_{{\mathcal {P}}'})\{u\}=\textrm{L}_{\textrm{R}}^{*-1}\{u\}=\textrm{L}_{\textrm{R}}^{*-1}\textrm{L}_{\textrm{R}}^*\{\phi \}=\phi , \end{aligned}$$

which confirms the equivalence of $\textrm{L}_{\textrm{R}}^{*\dagger }$ and $\left. \textrm{L}_{\textrm{R}}^{*-1}\right| _{{\mathcal {S}}_{\textrm{L}_{\textrm{R}}^*}}$.

So far, we have shown that the operator $\textrm{L}_{\textrm{R}}^{*\dagger }: \big ({\mathcal {S}}_{\textrm{L}_{\textrm{R}}^*},\Vert \cdot \Vert _{{\mathcal {U}}}\big ) \rightarrow {\mathcal {X}}_{\textrm{Rad}}$ is an isometry. As next step, we invoke the bounded linear transformation (BLT) theorem [44, Theorem I.7, p 9] to uniquely extend the operator to the completed space ${\mathcal {U}}=\overline{\big ({\mathcal {S}}_{\textrm{L}_{\textrm{R}}^*},\Vert \cdot \Vert _{{\mathcal {U}}}\big )}$ which, by construction, is the Banach space equipped with the norm $\Vert \cdot \Vert _{{\mathcal {U}}}$. This extension argument also applies the other way around: Since , the operator $\textrm{L}_{\textrm{R}}^*: ({\mathcal {S}}_{\textrm{Rad}},\Vert \cdot \Vert _{{\mathcal {X}}}) \rightarrow {\mathcal {U}}$ has a unique (isometric) extension $\textrm{L}_{\textrm{R}}^*: {\mathcal {X}}_{\textrm{Rad}} \rightarrow {\mathcal {U}}$ with ${\mathcal {X}}_{\textrm{Rad}}$ being the closure ${\mathcal {X}}_{\textrm{Rad}} =\overline{({\mathcal {S}}_{\textrm{Rad}},\Vert \cdot \Vert _{{\mathcal {X}}})}$. This proves that the spaces ${\mathcal {X}}_{\textrm{Rad}}$ and ${\mathcal {U}}$ are isometrically isomorphic with ${\mathcal {U}}=\textrm{L}_{\textrm{R}}^{*}\big ({\mathcal {X}}_{\textrm{Rad}}\big )$ and ${\mathcal {X}}_{\textrm{Rad}}=\textrm{L}_{\textrm{R}}^{*\dagger }\big ({\mathcal {U}}\big )$. In addition, we have that ${\mathcal {U}} \perp {\mathcal {P}}$, which means that $\textrm{Proj}_{{\mathcal {P}}'}\{u\}=0$ for all $u\in {\mathcal {U}}$. Since ${\mathcal {U}}$ and ${\mathcal {P}}'$ are both Banach spaces, the direct-sum space ${\mathcal {X}}_{\textrm{L}_{\textrm{R}}}={\mathcal {U}} \oplus {\mathcal {P}}'$, equipped with the composite norm $\Vert f\Vert _{{\mathcal {X}}_{\textrm{L}_{\textrm{R}}}}=\max (\Vert \textrm{Proj}_{{\mathcal {U}}}\{f\}\Vert _{{\mathcal {U}}}, \Vert \textrm{Proj}_{{\mathcal {P}}'}\{f\}\Vert _{{\mathcal {P}}'})$, is complete (Banach property) and isomorphic to ${\mathcal {X}}_{\textrm{Rad}} \times {\mathcal {P}}'$. The final element is to recognize that $\Vert \textrm{Proj}_{{\mathcal {U}}}\{f\}\Vert _{{\mathcal {U}}}=\Vert {\textrm{L}}_{\textrm{R}}^{*\dagger }\textrm{Proj}_{{\mathcal {U}}}\{f\}\Vert _{{\mathcal {X}}}=\Vert {\textrm{L}}_{\textrm{R}}^{*\dagger }\{f\}\Vert _{{\mathcal {X}}}$, where $\textrm{Proj}_{{\mathcal {U}}}=(\textrm{Id}-\textrm{Proj}_{{\mathcal {P}}'})$. This direct-sum decomposition has an equivalent description in terms of operators, which is the form given in the statement of Theorem 9. Specifically, the isomorphism between ${\mathcal {U}}$ and ${\mathcal {X}}$ is expressed by (86) and (88) for $f \in {\mathcal {U}}$. This is complemented by the null-space property (87), which ensures that the components of f that are in ${\mathcal {P}}'$ are annihilated by $\textrm{L}_{\textrm{R}}^*$.

Embeddings: The denseness of ${\mathcal {S}}$ in ${\mathcal {X}}_{{\textrm{L}}_{\textrm{R}}}$ follows from the observation that ${\mathcal {S}}({\mathbb {R}}^d)=(\textrm{Id}-\textrm{Proj}_{{\mathcal {P}}'})\big ({\mathcal {S}}({\mathbb {R}}^d)\big ) \oplus {\mathcal {P}}'$. Since, by construction, one has that and , one also has that .

As for the relation , we already have that , by construction. To show that , we invoke the intertwining relation and express ${\mathcal {U}}=\textrm{L}^*{\textrm{R}}^*{\textrm{K}}_{\textrm{rad}}({\mathcal {X}}_{\textrm{Rad}})$ as ${\mathcal {U}}={\textrm{R}}^*{\textrm{Q}}_{\textrm{rad}}({\mathcal {X}}_{\textrm{Rad}})$, where ${\textrm{Q}}_{\textrm{rad}}={\textrm{L}}_{\textrm{rad}}{\textrm{K}}_{\textrm{rad}}: {\mathcal {X}}_{\textrm{Rad}} \rightarrow {\mathcal {S}}'({\mathbb {R}}\times {\mathbb {S}}^{d-1})$. Our hypotheses on ${\widehat{L}}_{\textrm{rad}}$ ensure that ${\textrm{Q}}_{\textrm{rad}}\{\phi \}$ is well-defined for every $\phi \in {\mathcal {X}}$, with the operator being continuous in the weak topology of ${\mathcal {S}}'({\mathbb {R}}\times {\mathbb {S}}^{d-1})$. (As the latter is a complete nuclear space, the weak (sequential) convergence also ensures continuity in the strong topology [55].) Since , as implied by (27), the composed operator ${\textrm{R}}^*{\textrm{Q}}_{\textrm{rad}}: {\mathcal {X}}_{\textrm{Rad}} \rightarrow {\mathcal {S}}'({\mathbb {R}}^d)$ is continuous as well, which proves that ${\mathcal {U}}$ and ${\mathcal {X}}_{{\textrm{L}}_{\textrm{R}}}$ are both continuously embedded in ${\mathcal {S}}'({\mathbb {R}}^d)$. The latter embedding is also dense by transitivity since ${\mathcal {S}}({\mathbb {R}}^d) \subset {\mathcal {X}}_{{\textrm{L}}_{\textrm{R}}}$ and [19, 51]. $\square $

Notes

The precise formulation involves the ${\mathcal {M}}$-norm (or total variation), which is the weak form of $L_1$ associated with the space of bounded Radon measures. In our account, we take it as the default norm for the Lebesgue exponent $p=1$, with a slight abuse of language.
For $d=2n$ even, ${{\widehat{K}}}({\varvec{\omega }}) \propto \Vert {\varvec{\omega }}\Vert ^{2n-1}$ which is $C^\infty $ everywhere, except at the origin where it is only $C^{2n-2}$. This means that ${\textrm{K}}$ can properly handle (and annihilate) polynomials only up to degree $(2n-2)$.
The precise properties of these inverse operators are stated in Theorems 6 and 9.

References

Alvarez, M.A., Rosasco, L., Lawrence, N.D.: Kernels for vector-valued functions: A review. Foundations and Trends in Machine Learning 4(3), 195–266 (2012)
Article MATH Google Scholar
Aronszajn, N.: Theory of reproducing kernels. Transactions of the American Mathematical Society 68(3), 337–404 (1950)
Article MathSciNet MATH Google Scholar
Aziznejad, S., Unser, M.: Multikernel regression with sparsity constraint. SIAM Journal on Mathematics of Data Science 3(1), 201–224 (2021). https://doi.org/10.1137/20m1318882
Article MathSciNet MATH Google Scholar
Bach, F.: Breaking the curse of dimensionality with convex neural networks. Journal of Machine Learning Research 18, 1–53 (2017)
MATH Google Scholar
Barron, A.R.: Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information Theory 39(3), 930–945 (1993). https://doi.org/10.1109/18.256500
Article MathSciNet MATH Google Scholar
Bartolucci, F., Vito, E.D., Rosasco, L., Vigogna, S.: Understanding neural networks with reproducing kernel Banach spaces. Applied and Computational Harmonic Analysis 62, 194–236 (2023)
Article MathSciNet MATH Google Scholar
Berlinet, A., Thomas-Agnan, C.: Reproducing Kernel Hilbert Spaces in Probability and Statistics, vol. 3. Kluwer Academic Boston (2004)
Bishop, C.M.: Pattern Recognition and Machine Learning. Springer (2006)
MATH Google Scholar
de Boor, C.: On “best” interpolation. Journal of Approximation Theory 16(1), 28–42 (1976)
Article MathSciNet MATH Google Scholar
de Boor, C., Lynch, R.E.: On splines and their minimum properties. Journal of Mathematics and Mechanics 15(6), 953–969 (1966)
MathSciNet MATH Google Scholar
Boyer, C., Chambolle, A., De Castro, Y., Duval, V., De Gournay, F., Weiss, P.: On representer theorems and convex regularization. SIAM Journal of Optimization 29(2), 1260–1281 (2019)
Article MathSciNet MATH Google Scholar
Bredies, K., Carioni, M.: Sparsity of solutions for variational inverse problems with finite-dimensional data. Calculus of Variations and Partial Differential Equations 59(14), 26 (2020)
MathSciNet MATH Google Scholar
Buhmann, M.D.: Radial Basis Functions. Cambridge University Press (2003)
Book MATH Google Scholar
Chen, Z., Haykin, S.: On different facets of regularization theory. Neural Computation 14(12), 2791–2846 (2002)
Article MATH Google Scholar
Cioranescu, I.: Geometry of Banach Spaces, Duality Mappings and Nonlinear Problems, vol. 62. Springer Science & Business Media (2012)
MATH Google Scholar
Cybenko, G.: Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals and Systems 2(4), 303–314 (1989)
Article MathSciNet MATH Google Scholar
Duchon, J.: Splines minimizing rotation-invariant semi-norms in Sobolev spaces. In: W. Schempp, K. Zeller (eds.) Constructive Theory of Functions of Several Variables, pp. 85–100. Springer-Verlag, Berlin (1977)
Chapter Google Scholar
Fisher, S.D., Jerome, J.W.: Spline solutions to $L_1$ extremal problems in one and several variables. Journal of Approximation Theory 13(1), 73–83 (1975)
Article MathSciNet MATH Google Scholar
Gelfand, I.M., Shilov, G.: Generalized Functions. Vol. 1. Properties and Operations. Academic Press, New York (1964)
Google Scholar
Gelfand, I.M., Shilov, G.: Generalized Functions. Integral Geometry and Representation Theory. Vol. 5. Academic Press, New York (1966)
Google Scholar
Grafakos, L.: Classical Fourier Analysis. Springer (2008)
Book MATH Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proc. IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Helgason, S.: Integral Geometry and Radon Transforms. Springer (2011)
MATH Google Scholar
Hofmann, T., Schölkopf, B., Smola, A.J.: Kernel methods in machine learning. Annals of Statistics 36(3), 1171–1220 (2008)
Article MathSciNet MATH Google Scholar
Hornik, K., Stinchcombe, M., White, H.: Multilayer feedforward networks are universal approximators. Neural Networks 2(5), 359–366 (1989)
Article MATH Google Scholar
Kostadinova, S., Pilipović, S., Saneva, K., Vindas, J.: The ridgelet transform of distributions. Integral Transforms and Special Functions 25(5), 344–358 (2014)
Article MathSciNet MATH Google Scholar
Ludwig, D.: The Radon transform on Euclidean space. Communications on Pure and Applied Mathematics 19(1), 49–81 (1966). https://doi.org/10.1002/cpa.3160190105
Article MathSciNet MATH Google Scholar
Mammen, E., van de Geer, S.: Locally adaptive regression splines. Annals of Statistics 25(1), 387–413 (1997)
Article MathSciNet MATH Google Scholar
Meinguet, J.: Multivariate interpolation at arbitrary points made simple. Zeitschrift fur Angewandte Mathematik und Physik 30, 292–304 (1979)
Article MathSciNet MATH Google Scholar
Mhaskar, H., Micchelli, C.A.: Approximation by superposition of sigmoidal and radial basis functions. Advances in Applied Mathematics 13(3), 350–373 (1992). https://doi.org/10.1016/0196-8858(92)90016-p
Article MathSciNet MATH Google Scholar
Micchelli, C.A.: Interpolation of scattered data: Distance matrices and conditionally positive definite functions. Constructive Approximation 2(1), 11–22 (1986)
Article MathSciNet MATH Google Scholar
Micchelli, C.A., Xu, Y., Zhang, H.: Universal kernels. Journal of Machine Learning Research 7, 2651–2667 (2006)
MathSciNet MATH Google Scholar
Natterer, F.: The Mathematics of Computed Tomography. John Willey & Sons Ltd. (1984)
Neumayer, S., Unser, M.: Explicit representations for Banach subspaces of Lizorkin distributions. Analysis and Applications 21(5), 1223–1250 (2023). https://doi.org/10.1142/S0219530523500148
Article MathSciNet MATH Google Scholar
Ongie, G., Willett, R., Soudry, D., Srebro, N.: A function space view of bounded norm infinite width ReLU nets: The multivariate case. International Conference on Representation Learning (ICLR) (2020)
Parhi, R., Nowak, R.D.: The role of neural network activation functions. IEEE Signal Processing Letters 27, 1779–1783 (2020). https://doi.org/10.1109/LSP.2020.3027517
Article Google Scholar
Parhi, R., Nowak, R.D.: Banach space representer theorems for neural networks and ridge splines. Journal of Machine Learning Research 22(43), 1–40 (2021)
MathSciNet MATH Google Scholar
Parhi, R., Nowak, R.D.: What kinds of functions do deep neural networks learn? Insights from variational spline theory. SIAM Journal on Mathematics of Data Science 4(2), 464–489 (2022). https://doi.org/10.1137/21m1418642
Article MathSciNet MATH Google Scholar
Pinkus, A.: Approximation theory of the MLP model in neural networks. Acta Numerica 8, 143–195 (1999). https://doi.org/10.1017/s0962492900002919
Article MathSciNet MATH Google Scholar
Poggio, T., Girosi, F.: Networks for approximation and learning. Proceedings of the IEEE 78(9), 1481–1497 (1990)
Article MATH Google Scholar
Poggio, T., Girosi, F.: Regularization algorithms for learning that are equivalent to multilayer networks. Science 247(4945), 978–982 (1990)
Article MathSciNet MATH Google Scholar
Poggio, T., Smale, S.: The mathematics of learning: Dealing with data. Notices of the AMS 50(5), 537–544 (2003)
MathSciNet MATH Google Scholar
Ramm, A.G., Katsevich, A.I.: The Radon transform and local tomography. CRC Press (2020)
Reed, M., Simon, B.: Methods of Modern Mathematical Physics. Vol. 1: Functional Analysis. Academic Press, San Diego (1980)
Rudin, W.: Functional Analysis, 2nd edn. McGraw-Hill, New York (1991). McGraw-Hill Series in Higher Mathematics
Samko, S.G.: Denseness of the Lizorkin-type spaces $\Phi _V$ in ${L_p(\mathbb{R}^n)}$. Mathematical Notes of the Academy of Sciences of the USSR 31(6), 432–437 (1982)
Google Scholar
Samko, S.G., Kilbas, A.A., Marichev, O.I.: Fractional Integrals and Derivatives: Theory and Applications. Gordon and Breach Science Publishers (1993)
MATH Google Scholar
Savarese, P., Evron, I., Soudry, D., Srebro, N.: How do infinite width bounded norm networks look in function space? In: Conference on Learning Theory. PMLR, pp. 2667–2690 (2019)
Schölkopf, B., Herbrich, R., Smola, A.J.: A generalized representer theorem. In: D. Helmbold, B. Williamson (eds.) Computational Learning Theory, pp. 416–426. Springer Berlin Heidelberg (2001)
Chapter Google Scholar
Schölkopf, B., Sung, K.K., Burges, C.J.C., Girosi, F., Niyogi, P., Poggio, T., Vapnik, V.: Comparing support vector machines with Gaussian kernels to radial basis function classifiers. IEEE Transactions on Signal Processing 45(11), 2758–2765 (1997)
Article Google Scholar
Schwartz, L.: Théorie des Distributions. Hermann, Paris (1966)
MATH Google Scholar
Shawe-Taylor, J., Cristianini, N., et al.: Kernel methods for pattern analysis. Cambridge university press (2004)
Book MATH Google Scholar
Sonoda, S., Ishikawa, I., Ikeda, M.: Ridge regression with over-parametrized two-layer networks converge to ridgelet spectrum. pp. 2674–2682. PMLR (2021)
Sonoda, S., Murata, N.: Neural network with unbounded activation functions is universal approximator. Applied and Computational Harmonic Analysis 43(2), 233–268 (2017)
Article MathSciNet MATH Google Scholar
Trèves, F.: Topological Vector Spaces, Distributions and Kernels. Dover Publications, New York (2006)
MATH Google Scholar
Unser, M.: A unifying representer theorem for inverse problems and machine learning. Foundations of Computational Mathematics 21(4), 941–960 (2021). https://doi.org/10.1007/s10208-020-09472-x
Article MathSciNet MATH Google Scholar
Unser, M.: Ridges, neural networks, and the Radon transform. Journal of Machine Learning Research 24, 1–33 (2023)
MathSciNet Google Scholar
Unser, M., Aziznejad, S.: Convex optimization in sums of Banach spaces. Applied and Computational Harmonic Analysis 56, 1–25 (2022). https://doi.org/10.1016/j.acha.2021.07.002
Article MathSciNet MATH Google Scholar
Unser, M., Blu, T.: Fractional splines and wavelets. SIAM Review 42(1), 43–67 (2000)
Article MathSciNet MATH Google Scholar
Unser, M., Fageot, J., Ward, J.P.: Splines are universal solutions of linear inverse problems with generalized-TV regularization. SIAM Review 59(4), 769–793 (2017)
Article MathSciNet MATH Google Scholar
Unser, M., Tafti, P.D.: An Introduction to Sparse Stochastic Processes. Cambridge University Press (2014)
Book MATH Google Scholar
Wahba, G.: Spline Models for Observational Data. Society for Industrial and Applied Mathematics, Philadelphia (1990)
Book MATH Google Scholar
Wendland, H.: Scattered Data Approximations. Cambridge University Press (2005)
MATH Google Scholar

Download references

Acknowledgements

The research was supported by the European Commission under Grant ERC-2020-AdG FunLearn-101020573. The author is thankful to Sebastian Neumayer for very helpful discussions.

Funding

Open access funding provided by EPFL Lausanne.

Author information

Authors and Affiliations

Biomedical Imaging Group, École polytechnique fédérale de Lausanne (EPFL), Station 17, 1015, Lausanne, Switzerland
Michael Unser

Authors

Michael Unser
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Michael Unser.

Additional information

Communicated by Carola-Bibiane Schönlieb.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

The research leading to these results has received funding from the European Research Council under Grant ERC-2020-AdG FunLearn-101020573.

Appendix: Duality Maps

The generalization of the Cauchy-Schwarz inequality for any dual pair $({\mathcal {X}}, {\mathcal {X}}')$ of Banach spaces is the so-called duality inequality

$$\begin{aligned} \forall (f,x) \in {\mathcal {X}}' \times {\mathcal {X}}: \langle f, x\rangle _{{\mathcal {X}}' \times {\mathcal {X}}}\le \Vert f\Vert _{{\mathcal {X}}'}\Vert x\Vert _{{\mathcal {X}}}, \end{aligned}$$

(89)

which is tightly linked to the definition of the dual norm given by

$$\begin{aligned} \Vert f\Vert _{{\mathcal {X}}'}{\mathop {=}\limits ^{\vartriangle }}\sup _{x \in {\mathcal {X}}: \Vert x\Vert _{{\mathcal {X}}}\le 1} \langle f, x\rangle _{{\mathcal {X}}' \times {\mathcal {X}}}. \end{aligned}$$

(90)

Definition 8

(Strict convexity) A Banach space ${\mathcal {X}}$ is said to be strictly convex if, for all $x_1,x_2 \in {\mathcal {X}}$ such that $\Vert x_1\Vert _{{\mathcal {X}}} =\Vert x_2\Vert _{{\mathcal {X}}} =1$ and $x_1 \ne x_2$, one has that $\Vert \theta x_1+(1-\theta )x_2\Vert _{{\mathcal {X}}}< 1$ for all $\theta \in (0,1)$.

It is obvious from (90) that (89) is sharp. Moreover, when ${\mathcal {X}}$ is reflexive and strictly convex, there is a single element $x^*\in {\mathcal {X}}'$ (the Banach conjugate of x) such that $\Vert x^*\Vert _{{\mathcal {X}}'}=\Vert x\Vert _{{\mathcal {X}}}$ (isometry) and $\langle x^*, x\rangle _{{\mathcal {X}}' \times {\mathcal {X}}}=\Vert x^*\Vert _{{\mathcal {X}}'}\Vert x\Vert _{{\mathcal {X}}}$ (sharp duality bound) [15]. This leads to the definition of the corresponding duality map ${\textrm{J}}_{\mathcal {X}}: {\mathcal {X}} \rightarrow {\mathcal {X}}'$ as

$$\begin{aligned} {\textrm{J}}_{{\mathcal {X}}}\{x\}=x^*. \end{aligned}$$

(91)

Since the dual of ${\mathcal {X}}'$ is strictly convex as well, we have that ${\textrm{J}}_{\mathcal {X}}^{-1}={\textrm{J}}_{{\mathcal {X}}'}: {\mathcal {X}}' \rightarrow {\mathcal {X}}$ with ${\textrm{J}}_{{\mathcal {X}}'}\{x^*\}=x$, where $(x^*)^*=x \in {\mathcal {X}}''={\mathcal {X}}$ is the unique Banach conjugate of $x^*\in {\mathcal {X}}'$.

A relevant example of reflexive and strictly convex Banach space is ${\mathcal {X}}=L_q({\mathbb {R}}\times {\mathbb {S}}^{d-1})$ for $q \in (1,\infty )$. Its topological dual is ${\mathcal {X}}'=L_p({\mathbb {R}}\times {\mathbb {S}}^{d-1})$ with $p=q/(q-1)$ being the conjugate exponent of q. For that particular pair, (89) reduces to the Hölder inequality for hyper-spherical functions. From [15, Chapter 4], the corresponding duality map ${\textrm{J}}_q: L_q({\mathbb {R}}\times {\mathbb {S}}^{d-1}) \rightarrow L_p({\mathbb {R}}\times {\mathbb {S}}^{d-1})$ is

$$\begin{aligned} {\textrm{J}}_{q}\{\nu \}({\varvec{z}})=\nu ^*({\varvec{z}})=\frac{\left| \nu ({\varvec{z}})\right| ^{q-1}}{\Vert \nu \Vert ^{q-2}_{L_q}} \textrm{sign}\big (\nu ({\varvec{z}})\big ), \end{aligned}$$

(92)

which establishes a one-to-one isometric mapping between the spaces $L_q$ and $L_p=(L_q)'$ with the property that ${\textrm{J}}^{-1}_{q}={\textrm{J}}_{p}$.

Proposition 4

(Banach isometries) Let $({\mathcal {X}}, {\mathcal {X}}')$ be a dual pair of reflexive and strictly convex Banach spaces with corresponding duality map ${\textrm{J}}_{\mathcal {X}}: {\mathcal {X}} \rightarrow {\mathcal {X}}'$. We consider two generic types of linear isometries.

(1)
One-to-one map: Let ${\textrm{T}}: {\mathcal {X}} \rightarrow {\mathcal {Y}}={\textrm{T}}({\mathcal {X}})$ be an injective operator whose inverse is denoted by ${\textrm{T}}^{-1}$ with ${\textrm{T}}^{-1}{\textrm{T}}=\textrm{Id}$ on ${\mathcal {X}}$. Then, ${\mathcal {Y}}={\textrm{T}}({\mathcal {X}})=\{y={\textrm{T}}\{x\}: x \in {\mathcal {X}}\}$ equipped with the norm $\Vert y\Vert _{{\mathcal {Y}}}{\mathop {=}\limits ^{\vartriangle }}\Vert {\textrm{T}}^{-1}\{y\}\Vert _{{\mathcal {X}}}$ is a Banach space. Its continuous dual is the Banach space ${\mathcal {Y}}'={\textrm{T}}^{-1*}({\mathcal {X}}')$ with $\Vert y'\Vert _{{\mathcal {Y}}'}=\Vert {\textrm{T}}^*\{ y'\}\Vert _{{\mathcal {X}}'}$, while the corresponding duality map is ${\textrm{J}}_{{\mathcal {Y}}}= ({\textrm{T}}^*)^{-1}\circ {\textrm{J}}_{{\mathcal {X}}}\circ {\textrm{T}}^{-1}: {\mathcal {Y}} \rightarrow {\mathcal {Y}}'$.
(2)
Projection: Let be a continuous projection operator on ${\mathcal {X}}$ with $\Vert {\textrm{P}}\Vert =1$. Then, $\big ({\textrm{P}}({\mathcal {X}}),{\textrm{P}}^*({\mathcal {X}}')\big )=\big ({\mathcal {U}},{\mathcal {U}}'\big )$ is a dual pair of Banach subspaces with corresponding duality map ${\textrm{J}}_{{\mathcal {U}}}={\textrm{P}}^*\circ {\textrm{J}}_{{\mathcal {X}}}\circ {\textrm{P}}: {\mathcal {U}} \rightarrow {\mathcal {U}}'$.

Proof

We first recall that the dual of a reflexive and strictly convex Banach space is reflexive (by definition) and strictly convex as well.

(1)
Injective operator: For the first property, we refer to [58, Proposition 1]. The key observation is that the operators ${\textrm{T}}: {\mathcal {X}} \rightarrow {\mathcal {Y}}$ and ${\textrm{T}}^{-1}: {\mathcal {Y}} \rightarrow {\mathcal {X}}$, as well as their adjoints, are isometries with $({\textrm{T}}^*)^{-1}={\textrm{T}}^{-1*}$. The argument then primarily rests upon the duality inequality
$$\begin{aligned} \langle y', y\rangle _{{\mathcal {Y}}' \times {\mathcal {Y}}}&=\langle y', {\textrm{T}}{\textrm{T}}^{-1}\{y\}\rangle _{{\mathcal {Y}}' \times {\mathcal {Y}}}=\langle {\textrm{T}}^*\{y'\}, {\textrm{T}}^{-1}\{y\}\rangle _{{\mathcal {X}}' \times {\mathcal {X}}}\nonumber \\&\le \Vert {\textrm{T}}^*\{y'\}\Vert _{{\mathcal {X}}'}\; \Vert {\textrm{T}}^{-1}\{y\}\Vert _{{\mathcal {X}}}, \end{aligned}$$
(93)
which is sharp if and only if $x={\textrm{T}}^{-1}\{y\}$ and $x'={\textrm{T}}^*\{y'\}$ (resp., $y'$ and y) are Banach conjugates, so that $x'={\textrm{J}}_{{\mathcal {X}}}\{x\}$.
(2)
Projection operator: The first part is obtained by using a standard argument with (projected) Cauchy sequences. For the second part, let with Banach conjugate $u^*\in {\mathcal {X}}'$. Then,
$$\begin{aligned} \Vert u^*\Vert _{{\mathcal {X}}'}\Vert u\Vert _{{\mathcal {X}}} =\langle u^*, u\rangle _{{\mathcal {X}}' \times {\mathcal {X}}}&= \langle u^*, {\textrm{P}}^2 u\rangle _{{\mathcal {X}}' \times {\mathcal {X}}}=\langle {\textrm{P}}^*u^*, \text {Pu}\rangle _{{\mathcal {U}}' \times {\mathcal {U}}}\nonumber \\&\ \le \Vert {\textrm{P}}^*\Vert \Vert u^*\Vert _{{\mathcal {U}}'} \Vert {\textrm{P}}\Vert \Vert u\Vert _{{\mathcal {U}}}, \end{aligned}$$
(94)
from which we deduce that $\langle {\textrm{P}}^*u^*, \text {Pu}\rangle _{{\mathcal {U}}' \times {\mathcal {U}}}=\Vert u^*\Vert _{{\mathcal {U}}'} \Vert u\Vert _{{\mathcal {U}}'}$. We conclude that $u={\textrm{P}} u$ and $u^*={\textrm{P}}^*u^*= {\textrm{P}}^*\circ {\textrm{J}}_{{\mathcal {X}}}\{{\textrm{P}} u\}$ are (unique) Banach conjugates of each other.

$\square $

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Unser, M. From Kernel Methods to Neural Networks: A Unifying Variational Formulation. Found Comput Math (2023). https://doi.org/10.1007/s10208-023-09624-9

Download citation

Received: 29 June 2022
Revised: 14 February 2023
Accepted: 20 July 2023
Published: 17 October 2023
DOI: https://doi.org/10.1007/s10208-023-09624-9

Keywords

Mathematics Subject Classification

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

From Kernel Methods to Neural Networks: A Unifying Variational Formulation

Abstract

Similar content being viewed by others

Kolmogorov width decay and poor approximators in machine learning: shallow neural networks, random feature models and neural tangent kernels

From Covariance Matrices to Covariance Operators: Data Representation from Finite to Infinite-Dimensional Settings

Regularisation of neural networks by enforcing Lipschitz continuity

1 Introduction

2 Mathematical Preliminaries

2.1 Notations

2.2 Admissible Regularization Operators

Definition 1

Definition 2

Definition 3

2.3 Nontrivial Null Space and Related Projectors

Lemma 1

3 Radon Transform

3.1 Classical Integral Formulation

Definition 4

Definition 5

Theorem 1

Corollary 1

3.2 Distributional Extension

Definition 6

Theorem 2

3.3 Radon-Compatible Banach Spaces

Theorem 3

Lemma 2

Proof

3.4 Specific Radon Transforms

Proposition 1

Theorem 4

4 Unifying Variational Formulation

4.1 Representer Theorem for Radon-Domain Regularization

Definition 7

Theorem 5

Proof

4.2 Connection with RKHS Methods

Proposition 2

4.3 Universal Approximation Properties

Theorem 6

4.4 Regularization Operators for Anti-symmetric Activations

Proposition 3

4.5 Specific Configurations

5 Supporting Mathematical Results

5.1 Kernel and Stability of Generalized Inverse Operators

Theorem 7

Proof

Theorem 8

Proof

5.2 Characterization of the Predual Space \({\mathcal {X}}_{{\textrm{L}}_{\textrm{R}}}\)

Theorem 9

Proof

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix: Duality Maps

Appendix: Duality Maps

Definition 8

Proposition 4

Proof

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation