1 Introduction

Wavelet systems have long been employed in time-frequency analysis and approximation theory to break the uncertainty principle and resolve local singularities against global smoothness. Nonlinear approximation over redundant families of localized waveforms has enabled the construction of efficient sparse representations, becoming common practice in signal processing, source coding, noise reduction, and beyond. Sparse dictionaries are also an important tool in machine learning, where the extraction of few relevant features can significantly enhance a variety of learning tasks, making them scale with enormous quantities of data. However, the role of wavelets in machine learning is still unclear, and the impact they had in signal processing has, by far, not been matched. One objective constraint to a direct application of classical wavelet techniques to modern data science is of geometric nature: real data are typically high-dimensional and inherently structured, often featuring or hiding non-Euclidean topologies. On the other hand, a representation built on empirical samples poses an additional problem of stability, accounted for by how well it generalizes to future data. In this paper, expanding upon the ideas outlined in [35], we introduce a data-driven construction of wavelet frames on non-Euclidean domains, and provide stability results in high probability.

Starting from Haar’s seminal work [31] and since the founding contributions of Grossmann and Morlet [30], a general theory of wavelet transforms and a wealth of specific families of wavelets have rapidly arisen [10, 14, 23, 39, 41], first and foremost on \(\mathbb {R}^d\), but soon thereafter also on non-Euclidean structures such as manifolds and graphs [12, 13, 18, 20, 26, 28, 33, 44]. Generalized wavelets usually consist of frames with some kind of broad to tighter link to ideas from multi-resolution analysis. At the very least, elements of a wavelet frame ought to be associated with locations and scales, decomposing signals into a sum of local features in increasing resolution. On a basic conceptual level, many of these generalized constructions stem from a reinterpretation of the frequency domain as the spectrum of a differential operator. Indeed, wavelets on \({\mathbb {R}}\) are commonly generated by dilating and translating a well-localized function \(\psi \),

$$\begin{aligned} \psi _{a,b}(x) = |a|^{-1/2} \psi \left( \tfrac{x-b}{a} \right) \qquad a \ne 0, b \in {\mathbb {R}}; \end{aligned}$$

but taking the Fourier transform, they can be rewritten as

$$\begin{aligned} \psi _{a,b}(x) = \int |a|^{1/2} {\widehat{\psi }}(a\xi ) e^{2\pi \imath (x-b) \xi } d\xi = \int G_a(\xi ) \overline{v_\xi (b)} {v_\xi } (x) d\xi , \end{aligned}$$
(1)

with \( G_a(\xi ) = |a|^{1/2} {\widehat{\psi }}(a\xi ) \) and \( v_{\xi } (x) = e^{2\pi \imath x\xi } \). This allows to reinterpret the wavelet \( \psi _{a,b}(x) \) as a superposition of Fourier harmonics \(v_{\xi } (x)\), modulated by a spectral filter \(G_a(\xi )\). Moreover, each \(v_{\xi }\) can be seen as an eigenfunction of the Laplacian \( \Delta = -d^2/dx^2 \). Hence, in principle, we may retrace an analogous construction whenever some notion of Laplacian is at hand. In particular, Riemannian manifolds and weighted graphs are examples of spaces where this is possible, using the Laplace–Beltrami operator or the graph Laplacian. A more detailed overview of related work based on these or similar ideas is postponed to Sects. 2 and 6.

Thus far, the study of generalized wavelets on non-Euclidean domains has primarily focused on either the continuous or the discrete setting. It is nonetheless natural to investigate the relationship between the two. For instance, regarding a graph as a sample of a manifold, we may ask whether and in what sense the frame built on the graph tends to the one on the manifold. In this paper we present a unified framework for the construction and the comparison of continuous and discrete frames. Returning for a moment to the real line, let us consider the semigroup \( e^{-t \Delta } \) generated by the Laplacian. This defines an integral operator

$$\begin{aligned} e^{-t \Delta }f(x) = \int K_t(x,y) f(y) dy, \end{aligned}$$

with \(K_t(x,y)\) being the heat kernel. Such a representation suggests that the generalized Fourier analysis, already revisited as spectral analysis of the Laplacian, can now be translated in terms of a corresponding integral operator (see e.g. [13, 38]). With the attention shifting from the Laplacian to an integral kernel, our idea is to recast the above constructions inside a reproducing kernel Hilbert space. Exploiting the reproducing kernel, we will extend a discrete frame out of the given samples, and thus compare it to its natural continuous counterpart.

Our construction yields empirical frames \({\widehat{{\varvec{\Psi }}}}^N\) on sets of N data. We will show that \({\widehat{{\varvec{\Psi }}}}^N\) converges in high probability to a continuous frame \({\varvec{\Psi }}\) associated to a reproducing kernel Hilbert space \({\mathcal {H}}\) as \( N \rightarrow \infty \), thus providing a proof of its stability in an asymptotic sense. The empirical frames \({\widehat{{\varvec{\Psi }}}}^N \) can be seen as Monte Carlo estimates of \({\varvec{\Psi }}\). Repeated random sampling will in fact produce a sequence of frames \({\widehat{{\varvec{\Psi }}}}^N\) on an increasing chain of finite dimensional reproducing kernel Hilbert spaces \({\widehat{{\mathcal {H}}}}_N \)

$$\begin{aligned} \begin{matrix} {\widehat{{\mathcal {H}}}}_N &{} \subset &{} \widehat{{\mathcal {H}}}_{N+1} &{} \subset &{} \cdots &{} \subset &{} {\mathcal {H}}\\ {\widehat{{\varvec{\Psi }}}}^N &{} &{} \widehat{\varvec{\Psi }}^{N+1} &{} &{} \longrightarrow &{} &{} {\varvec{\Psi }}\end{matrix} \quad , \end{aligned}$$

which approximates \({\varvec{\Psi }}\) on \({\mathcal {H}}\) up to a desired sampling resolution quantifiable by finite sample bounds in high probability.

One may also look at our result as a form of stochastic discretization of continuous frames. Going from the continuum to the discrete setting is an important problem in frame theory and applications of coherent states. Given a continuous frame of a Hilbert space, the discretization problem [2, Chapter 17] asks to extract a discrete frame out of it. Originally motivated by the need of numerical implementations of coherent states arising in quantum mechanics [15, 51], the problem was then generalized to continuous frames [1] and addressed in several theoretical efforts [21, 24, 29], until it found a complete yet not constructive characterization in [22]. Sampling the continuous frame is tantamount to sampling the parameter space on which the frame is indexed. For a wavelet frame, this means the selection of a discrete set of scales and locations. While the discretization of the scales can be readily obtained by a dyadic parametrization, the difficult part is usually sampling locations, that is, the domain where the frame is defined. How to do this is known in many cases and consists in an attentive selection of nets of well covering but sufficiently separated points. Already sensitive in the Euclidean setting, this procedure can be hard to generalize and implement in more general geometries [13]. In this respect, our Monte Carlo frame estimation provides a randomized approach to frame discretization as opposed to a deterministic sampling design. Clearly, our Monte Carlo estimate is not solving the discretization problem in its original form, since it defines frames only on finite dimensional subspaces. It is rather providing an asymptotic approximate solution, computing frames on an invading sequence of subspaces \( \widehat{{\mathcal {H}}}_N \subset {\mathcal {H}}\). We should also remark that, due to covering properties, standard frame discretization always entails a loosening of the frame bounds; hence, in particular, only non-tight frames may be sampled, even when the starting continuous frame is Parseval. As a result, signal reconstruction with respect to the discretized frame will in general require the computation of a dual frame, which is a problem on its own. On the contrary, in our randomized construction we preserve the tightness, albeit at the expense of a (possibly large) loss of resolution power \( {\mathcal {H}}\setminus \widehat{{\mathcal {H}}}_N \).

The remainder of the paper is organized as follows. The general notation used throughout the paper is listed in Table 1. In Sect. 2 we relate our main contribution to recent constructions of wavelets on graphs. This is both a special case and a main motivation of the general theory developed in the subsequent sections. In Sect. 3 we introduce the general framework and define the fundamental objects used in our analysis. The focus is on kernels, reproducing kernel Hilbert spaces, and associated integral operators. In Sect. 4 we present our frame construction based on spectral calculus of the integral operator. Our theory encompasses continuous and discrete frames within a unified formalism, paving the way for a principled comparison of the two. In particular, in Sect. 5, interpreting discrete locations as samples from a probability distribution we propose a Monte Carlo method for the estimation of continuous frames. In Sect. 6 we compare and contrast our approach to the existing literature. In Sect. 7 we prove the consistency of our Monte Carlo wavelets and obtain explicit convergence rates under Sobolev regularity of the signals. This is done combining techniques borrowed from the theory of spectral regularization with bounds of concentration of measure. In Sect. 8 we study the convergence rates in Besov spaces. In Sect. 9 we draw our conclusions and point at some directions for future work.

Table 1 Notation

2 Wavelets on Graphs and Their Stability

In this section we discuss how the framework introduced in the paper may be used to study the stability of typical constructions of wavelets on graphs. We first recall a few elementary concepts about graphs and set up some notation. After that, we outline a natural construction of wavelets based on the graph Laplacian, and observe that such a construction may be recast in terms of a reproducing kernel. Finally, we explain how this allows to establish the stability of wavelet frames in a suitable random graph model.

2.1 Wavelets on Graphs

We start with some basics of spectral graph theory. We only review what is strictly necessary for our purposes, and refer to [11] for further details.

Definition 2.1

(Weighted graph) An undirected graph is a pair \( {\mathcal {G}}= ( {\mathcal {V}}, {\mathcal {E}}) \), where \( {\mathcal {V}}\) is a finite discrete set of vertices \( {\mathcal {V}}:= \{ x_1, \ldots , x_N \} \), and \( {\mathcal {E}}\) is a set of unordered pairs \( {\mathcal {E}}\subset \{ \{ x_i, x_k \} : x_i, x_k \in {\mathcal {V}}\} \), called edges. A weighted (undirected) graph is an undirected graph with an associated weight function \( w : {\mathcal {E}}\rightarrow (0,+\infty ) \).

Arguably, one of the most remarkable facts about graphs is that it is possible to define on such a minimal structure a consistent notion of Laplacian. Functions on the graph, more precisely functions \( f : {\mathcal {V}}\rightarrow {\mathbb {R}}\), can be identified with vectors \( {\mathbf {f}}\in {\mathbb {R}}^N \) by \( {\mathbf {f}}_i := f(x_i) \), and equipped with the standard inner product \( {\mathbf {f}}^\top {\mathbf {g}}\) for \( {\mathbf {f}}, {\mathbf {g}}\in {\mathbb {R}}^N \). As an operator acting on functions, the graph Laplacian is thus defined by a matrix \( {\mathbf {L}}\in {\mathbb {R}}^{N \times N} \).

Definition 2.2

(Graph Laplacian) Let \( {\mathcal {G}}= ( {\mathcal {V}}, {\mathcal {E}}, w ) \) be a weighted graph. The weight matrix \( {\mathbf {W}}:= [ w_{i,k} ]_{i,k=1}^N \) is defined by \( w_{i,k} := w(\{x_i,x_k\}) \) for \( \{ x_i, x_k \} \in {\mathcal {E}}\), and \( w_{i,k} := 0 \) otherwise. The degree matrix \( {\mathbf {D}}:= {\text {diag}}(d_1,\ldots ,d_N) \) is defined by \( d_i := \sum _{k=1}^N w_{i,k} \). The unnormalized graph Laplacian is the matrix

$$\begin{aligned} {\mathbf {L}}:= {\mathbf {D}}- {\mathbf {W}}. \end{aligned}$$

Assuming that \( {\mathcal {G}}\) is connected, hence \( d_i > 0 \) for all \( i =1, \ldots , N \), the symmetric normalized graph Laplacian is \( {\mathbf {L}}' := {\mathbf {D}}^{-1/2} {\mathbf {L}}{\mathbf {D}}^{-1/2} = {\mathbf {I}}- {\mathbf {D}}^{-1/2} {\mathbf {W}}{\mathbf {D}}^{-1/2}. \) Several other variants are considered in the literature, including the random walk normalized graph Laplacian \( {\mathbf {D}}^{-1} {\mathbf {L}}= {\mathbf {I}}- {\mathbf {D}}^{-1} {\mathbf {W}}\), which is not symmetric but conjugate to \({\mathbf {L}}'\). The operators \({\mathbf {L}}\), \({\mathbf {L}}'\) and further normalizations result from different definitions of Hilbert structures on the spaces of functions on \({\mathcal {V}}\) and \({\mathcal {E}}\) [34]. While each operator gives rise to a different analysis, the choice of one or the other does not have formal consequences in our construction, hence, for simplicity, we will generically use \({\mathbf {L}}\).

The matrix \({\mathbf {L}}\) is positive semi-definite, hence it admits an orthonormal basis of eigenvectors with non-negative eigenvalues, customarily sorted in increasing order:

$$\begin{aligned} {\mathbf {L}}{\mathbf {u}}_i = \xi _i {\mathbf {u}}_i, \quad i = 0, \ldots , N-1, \qquad 0 = \xi _0 \le \xi _1 \le \cdots \le \xi _{N-1}. \end{aligned}$$

The spectrum of \({\mathbf {L}}\) reveals several important topological properties of the graph. In particular, a graph has as many connected components as zero eigenvalues, with eigenfunctions being piecewise constant on the components. We assume from now on that the graph is connected, hence \( \xi _1 > 0 \).

The graph Laplacian can be seen as a discrete analog of the continuous Laplace operator. This analogy justifies the interpretation of the eigenvectors \( {\mathbf {u}}_i \) as Fourier harmonics, and the corresponding eigenvalues \( \xi _i \) as frequencies. Accordingly, the graph Fourier transform is defined by

$$\begin{aligned} {\mathbf {F}}:= [ {\mathbf {u}}_1 \cdots {\mathbf {u}}_N ]^\top , \qquad [{\mathbf {F}}{\mathbf {f}}]_i := {\mathbf {u}}_i^\top {\mathbf {f}}. \end{aligned}$$

Note that the indexing is hiding that \( {\mathbf {F}}{\mathbf {f}}\) should be thought as a function on the frequencies \( \xi _i \). Carrying the analogy forward, a family of graph wavelets can be constructed by spectral filtering of the Fourier basis as follows. Let \( \{ H_j \}_{j\ge 0} \) be a family of functions \( H_j : [0,+\infty ) \rightarrow [0,+\infty ) \) satisfying

$$\begin{aligned}&\sum _{j\ge 0} H_j(\xi )^2 = 1 \quad \text {for all } \xi \in [0,+\infty ), \\&\# \{ H_j : H_j(\xi _i) \ne 0 \} < \infty \quad \text {for } i = 1, \ldots , N. \end{aligned}$$

Then, the family

$$\begin{aligned} \varphi _{j,k} := \sum _{i=1}^N H_j(\xi _i) {\mathbf {u}}_i[k] {\mathbf {u}}_i \qquad j\ge 0,\, k = 1,\ldots ,N \end{aligned}$$
(2)

defines a Parseval frame on \({\mathcal {G}}\) [28, Theorem 2].

Let \( {\mathcal {H}}_{\mathcal {G}}:= {{\,\mathrm{span}\,}}\{{\mathbf {u}}_0\}^\perp = {{\,\mathrm{span}\,}}\{{\mathbf {u}}_1,\ldots ,{\mathbf {u}}_{N-1}\} \) the space of all non-constant signals on \({\mathcal {G}}\). The graph Laplacian defines an inner product on \( {\mathcal {H}}_{\mathcal {G}}\) by \( \langle {\mathbf {f}}, {\mathbf {g}}\rangle _{\mathcal {G}}:= {\mathbf {f}}^\top {\mathbf {L}}{\mathbf {g}}\), which is invariant under graph isomorphisms. The Hilbert space \( {\mathcal {H}}_{\mathcal {G}}\) has reproducing kernel

$$\begin{aligned} {\mathbf {K}}:= {\mathbf {L}}^+. \end{aligned}$$

The matrix \({\mathbf {K}}\) on \({\mathcal {H}}_{\mathcal {G}}\) has same eigenvectors \( {\mathbf {u}}_1, \ldots , {\mathbf {u}}_{N-1} \) as \({\mathbf {L}}\), and eigenvalues

$$\begin{aligned} {\lambda }_1 = \xi _1^{-1} \ge {\lambda }_2 = \xi _2^{-1} \ge \cdots \ge {\lambda }_{N-1} = \xi _{N-1}^{-1}. \end{aligned}$$

Therefore, wavelets (2) can be as well defined starting from the spectral decomposition of the reproducing kernel \({\mathbf {K}}\), rather than the Laplacian \({\mathbf {L}}\). Conversely, given any reproducing kernel \({\mathbf {K}}\), a frame may be constructed, without any reference to a Laplacian matrix. Indeed, this is the point of view taken in this paper.

Besides the equivalence in defining the frame, starting from a kernel implies some technical differences, but also opens to new theoretical potential. First, note that the spectrum gets flipped, hence the eigenvalues of the kernel should be thought as inverses of Fourier frequencies. This seemingly irrelevant remark is actually important to correctly interpret the definitions of Sobolev and Besov spaces given in Sect. 8. Moreover, in light of this, the scale \(\tau \) in (27) can be understood as a frequency threshold, and the regularization \(\tau ^{-1}\) in the regression problem (29) as keeping the low frequencies. Reasoning in reproducing kernel Hilbert spaces also suggests further definitions of filtering beyond typical band-pass of Example 4.5, employing regularization techniques from inverse problems, as exemplified in Table 2. Lastly, reproducing kernels naturally extend the wavelet functions out of the graph vertices, making possible to analyze the stability of the graph wavelet frame for different random realizations of the graph. We elaborate on this in the next section.

2.2 Stability of Wavelets on Random Graphs

By virtue of their generality, graphs can be used to model a variety of discrete objects with pairwise relations, as well as to approximate complex geometries in continuous domains. In both cases, complexity and uncertainty are often handled by assuming an underlying random model and studying statistics and asymptotic behavior of relevant variables. In particular, neighborhood graphs are often used to approximate the Riemannian structure of a manifold. In a neighborhood graph, vertices are sampled at random from the manifold, and edges are drawn connecting vertices in suitable neighborhoods, such as k-nearest neighborhoods or \({\epsilon }\)-radius balls in the ambient Euclidean distance, or even putting weights using a global (possibly truncated) kernel function.

The convergence of the graph Laplacian to the Laplace–Beltrami operator has been studied and quantified in several settings, both as a pointwise [4, 27, 34, 53, 55] and as a spectral limit [3, 25, 37, 46, 54]. On the other hand, wavelets have been generalized to continuous non-Euclidean domains, notably Riemannian manifolds and spaces of homogenous type [13, 20, 26], and while the conceptual ingredients remain similar, the convergence of graph to manifold wavelets is hardly studied. We next describe how our theory provides a way to fill this gap.

Suppose we have a graph \({\mathcal {G}}\) with vertices \(\{x_1,\ldots ,x_N\}\) and a positive definite kernel matrix \({\widehat{{\mathbf {K}}}}\). For instance, the matrix \(N{\widehat{{\mathbf {K}}}}\) may be the kernel associated with the graph Laplacian. Computing the eigenvalues \({\widehat{{\lambda }}}_i\) and eigenvectors \({\widehat{{\mathbf {u}}}}_i\) of \({\widehat{{\mathbf {K}}}}\), we can define, in analogy with (2), the family

$$\begin{aligned} {\widehat{{\varvec{\varphi }}}}_{j,k} := \sum _{i=1}^N F_j({\widehat{{\lambda }}}_i) {{{\widehat{{\mathbf {u}}}}_i[k]}} {\widehat{{\mathbf {u}}}}_i \qquad j\ge 0, k = 1,\ldots ,N, \end{aligned}$$
(3)

for a suitable spectral filter \(F_j({\lambda })\). By Proposition 4.7, (3) defines a Parseval frame on \({\mathcal {G}}\). Now, suppose that the vertices of our graph are sampled from a space \({\mathcal {X}}\) with probability distribution \(\rho \) and reproducing kernel K satisfying the assumptions of Sects. 3 and 4. Furthermore, suppose that the kernel matrix \({\widehat{{\mathbf {K}}}}\) is given by

$$\begin{aligned} {\widehat{{\mathbf {K}}}}[i,k] = N^{-1} K(x_i,x_k). \end{aligned}$$

For example, the space \({\mathcal {X}}\) may be a compact Riemannian manifold, in which case we could consider the heat kernel associated with the Laplace–Beltrami operator, and regard the kernel matrix as a discretization of the integral operator. As a discrete example, one may also think of \({\mathcal {X}}\) as a supergraph of \({\mathcal {G}}\). Thanks to Proposition 4.7, the family of Monte Carlo wavelets

$$\begin{aligned} {\widehat{\psi }}_{j,k} (x) := \sum _i G_j({\widehat{{\lambda }}}_i) \overline{{\widehat{v}}_i(x_k)} {\widehat{v}}_i(x) \qquad j\ge 0,\, k = 1,\ldots ,N \end{aligned}$$

is a Parseval frame isomorphic to (3). Crucially, in this new representation, the frame functions are well-defined both on and off the graph \({\mathcal {G}}\), and thus the convergence of the frame can be studied on a test signal \( f : {\mathcal {X}}\rightarrow {\mathbb {R}}\), as discussed in Sect. 7. The stability of the graph wavelets (3) can therefore be established by an application of Theorem 7.5 or 8.8.

Starting from the next section, we develop our theory in greater generality, but always bearing in mind the motivating setting just discussed.

3 Preliminaries

In this section we prepare the technical ground on which our results will built (see also [46]). Let \(\mathcal{X}\) be a locally compact, second countable topological space endowed with a Borel probability measure \(\rho \). Given a continuous, positive semi-definite kernel

$$\begin{aligned} K : \mathcal{X}\times \mathcal{X}\rightarrow {\mathbb {C}}, \end{aligned}$$

we denote the associated reproducing kernel Hilbert space (RKHS) by

$$\begin{aligned} \mathcal{H}:= \overline{{{\,\mathrm{span}\,}}} \{ K_x : x\in \mathcal{X}\}, \end{aligned}$$

where \( K_x := K(\cdot ,x) \in \mathcal{H}\), and the closure is taken with respect to the inner product \( \langle K_x, K_y \rangle _{\mathcal{H}} := K(y,x)\). Elements of \(\mathcal{H}\) are continuous functions satisfying the following reproducing property:

$$\begin{aligned} f(x) = \left<{f},{K_x}\right>_\mathcal{H}\quad \text {for all } f \in \mathcal{H}. \end{aligned}$$
(4)

The space \(\mathcal{H}\) is separable, since \(\mathcal{X}\) is separable. We further assume K is bounded on \(\mathcal{X}\) and denote

$$\begin{aligned} \kappa := \sup _{x \in \mathcal{X}} \sqrt{K(x,x)} = \sup _{x \in \mathcal{X}} \Vert K_x\Vert _\mathcal{H}< \infty , \end{aligned}$$

which implies that \(\mathcal{H}\) is continuously embedded into the space of bounded continuous functions on \(\mathcal{X}\).

We define the (non-centered) covariance operator \({\mathrm {T}}: \mathcal{H}\rightarrow \mathcal{H}\) by

$$\begin{aligned} {\mathrm {T}}:= \int _\mathcal{X}K_x \otimes K_x \,d\rho (x), \end{aligned}$$
(5)

where the integral converges strongly. The operator \({\mathrm {T}}\) is positive and trace-class (therefore compact) with \( \sigma ({\mathrm {T}}) \subset [0,\kappa ^2] \). Hence, the spectral theorem ensures the existence of a countable orthonormal set \( \{ v_i\}_{i \in \mathcal{I}_\rho \cup \mathcal{I}_0} \subset \mathcal{H}\) and a sequence \( ( {\lambda }_i )_{i\in \mathcal{I}_\rho } \subset (0,\kappa ^2] \) such that

$$\begin{aligned} {\mathrm {T}}v_i = {\left\{ \begin{array}{ll} \lambda _i v_i &{} i\in \mathcal{I}_\rho \\ 0 &{} i\in \mathcal{I}_0 \end{array}\right. }. \end{aligned}$$

Let \(L^{2}({\mathcal{X},\rho })\) be the space of square-integrable functions on \(\mathcal{X}\) with respect to the measure \(\rho \), and denote \( \mathcal{X}_\rho := {\text {supp}}(\rho )\). We define the integral operator \({\mathrm {L}_K}: L^{2}({\mathcal{X},\rho })\rightarrow L^{2}({\mathcal{X},\rho })\) by

$$\begin{aligned} {\mathrm {L}_K}F (x) := \int _\mathcal{X}K(x,y) F(y)\, d\rho (y). \end{aligned}$$

The spaces \(\mathcal{H}\) and \(L^{2}({\mathcal{X},\rho })\) and the operators \({\mathrm {T}}\) and \({\mathrm {L}_K}\) are related through the inclusion operator \({\mathrm {S}}: \mathcal{H}\rightarrow L^{2}({\mathcal{X},\rho })\) defined by

$$\begin{aligned} {\mathrm {S}}f(x) := \left<{f},{K_x}\right>_{\mathcal{H}}. \end{aligned}$$

The adjoint operator \({\mathrm {S}}^*: L^{2}({\mathcal{X},\rho })\rightarrow \mathcal{H}\) acts as the strongly converging integral

$$\begin{aligned} {\mathrm {S}}^* F= \int _\mathcal{X}F(x) K_x\,d\rho (x). \end{aligned}$$

We have \( {\mathrm {T}}= {\mathrm {S}}^* {\mathrm {S}}\) and \( {\mathrm {L}_K}= {\mathrm {S}}{\mathrm {S}}^* \). Hence, \(\sigma ({\mathrm {T}})\backslash \{0\}=\sigma ({\mathrm {L}_K})\backslash \{0\}\), and the eigenfunctions \( \{ u_i \}_{i\in \mathcal{I}_\rho \cup \mathcal{I}_0} \subset L^{2}({\mathcal{X},\rho })\) of \({\mathrm {L}_K}\) satisfy

$$\begin{aligned} {\mathrm {S}}v_i = {\left\{ \begin{array}{ll} \sqrt{\lambda _i} u_i &{} i \in \mathcal{I}_\rho \\ 0 &{} i\in \mathcal{I}_0 \end{array}\right. }. \end{aligned}$$
(6)

Mercer’s theorem gives

$$\begin{aligned} \begin{aligned}&K(x,y) = \sum _{i \in \mathcal{I}_\rho \cup \mathcal{I}_0} \overline{v_i(x)} v_i(y) \quad \text {for } x,y \in \mathcal{X}, \\&K(x,y) = \sum _{i \in \mathcal{I}_\rho } {\lambda }_i \overline{u_i(x)} u_i(y) \quad \text {for } x,y \in \mathcal{X}_\rho , \end{aligned} \end{aligned}$$
(7)

where the series converge absolutely and uniformly on compact subsets.

Defining

$$\begin{aligned} \mathcal{H}_\rho := \overline{{{\,\mathrm{span}\,}}} \{ K_x : x \in \mathcal{X}_\rho \} = \overline{{{\,\mathrm{span}\,}}} \{ v_i : i \in \mathcal{I}_\rho \}, \end{aligned}$$

where the closure is taken in \({\mathcal {H}}\), we can identify \(\mathcal{H}_\rho \) as a (non-closed) subspace of \(L^{2}({\mathcal{X},\rho })\). The closure of \(\mathcal{H}_\rho \) in \(L^{2}({\mathcal{X},\rho })\) is

$$\begin{aligned} \overline{\mathcal{H}}_\rho : = \overline{{{\,\mathrm{span}\,}}} \{ u_i : i \in \mathcal{I}_\rho \}, \end{aligned}$$

and the following decompositions hold true:

$$\begin{aligned} \mathcal{H}= \mathcal{H}_\rho \oplus {\text {ker}}{\mathrm {S}}, \qquad L^{2}({\mathcal{X},\rho })= \overline{\mathcal{H}}_\rho \oplus {\text {ker}}{\mathrm {S}}^*. \end{aligned}$$

For \(f \in \mathcal{H}_\rho \), we can relate the norms in \(\mathcal{H}\) and \(L^{2}({\mathcal{X},\rho })\) as

$$\begin{aligned} \Vert {f}\Vert _{\rho } = \Vert \sqrt{\mathrm {T}}f \Vert _\mathcal{H}. \end{aligned}$$
(8)

In other words, \(\sqrt{{\mathrm {T}}}\) induces an isometric isomorphism between \(\overline{\mathcal{H}}_\rho \) and \({\mathcal{H}}_\rho \). We define the partial isometry \(\mathrm{U}:\mathcal{H}\rightarrow L^{2}({\mathcal{X},\rho })\), such that \(\mathrm{U}\mathcal{H}_\rho = \overline{\mathcal{H}}_\rho \), by

$$\begin{aligned} \mathrm{U}f = \sum _{i\in \mathcal{I}_\rho } \left<{f},{v_i}\right>_{\mathcal{H}} u_i. \end{aligned}$$

As examples of this setting, we may think of \(\mathcal{X}\) as \(\mathbb {R}^d\), or a non-Euclidean domain such as a compact connected Riemannian manifold or a weighted graph. In these cases, we can take K as the heat kernel associated with the proper notion of Laplacian, be it the Laplace–Beltrami operator or the graph Laplacian.

4 Wavelet Frames by Reproducing Kernels

We now build Parseval frames in the RKHS \(\mathcal{H}\) and in \(L^{2}({\mathcal{X},\rho })\). Our construction is centered around eigenfunctions of the integral operator (5) and filters on the corresponding eigenvalues. Continuous frames emerged in the mathematical physics community from the study of coherent states, as a generalization of the more common notion of a discrete frame [2, 23].

Definition 4.1

(Frame) Let \(\mathcal{H}\) be a Hilbert space, \(\mathcal{A}\) a locally compact space and \(\mu \) a Radon measure on \(\mathcal{A}\) with \({\text {supp}}\mu = \mathcal{A}\). A family \({\varvec{\Psi }}=\{ \psi _a: a\in \mathcal{A}\}\subset \mathcal{H}\) is called a frame for \(\mathcal{H}\) if there exist constants \(0<A\le B<\infty \) such that, for every \(f\in \mathcal{H}\), we have

$$\begin{aligned} A \left\| {f}\right\| _\mathcal{H}^2 \le \int _{\mathcal{A}} \left| {\left<{f},{\psi _a}\right>_\mathcal{H}}\right| ^2 d\mu (a) \le B \left\| {f}\right\| _\mathcal{H}^2. \end{aligned}$$

We say that \({\varvec{\Psi }}\) is tight if \(A=B\), and Parseval if \(A=B=1\).

In the above definition it is implicitly assumed that the map \(a \mapsto \left<{\Psi _a},{f}\right>_{\mathcal {H}}\) is measurable for all \(f\in {\mathcal {H}}\). It is important to note that this definition depends on the choice of the measure \(\mu \). In the case of a counting measure, we recover the standard definition of discrete frame.

4.1 Filters

To construct our wavelet frames, we first need to define filters, i.e. functions acting on the spectrum of \({\mathrm {T}}\) that satisfy a partition of unity condition.

Definition 4.2

(Filters) A family \(\{{G_j}\}_{j\ge 0} \) of measurable functions \({G_j}: [0,+\infty ) \rightarrow [0,+\infty )\) such that

$$\begin{aligned} \lambda \sum _{j\ge 0} {G_j}(\lambda )^2 = 1 \quad \text {for all } \lambda \in (0,\kappa ^2] \end{aligned}$$
(9)

is called a family of filters.

By the spectral theorem, \({G_j}({\mathrm {T}})\) is a (possibly unbounded) positive operator on \(\mathcal{H}\) such that \(\sigma ({G_j}({\mathrm {T}}))= {G_j}(\sigma ({\mathrm {T}})),\) with domain of definition

$$\begin{aligned} \mathcal{D}_j := \Bigg \{f\in \mathcal{H}: \sum _{i\in \mathcal{I}_\rho \cup \mathcal{I}_0} {G_j}(\lambda _i)^2 \,\left| {\left<{f},{v_i}\right>_{\mathcal{H}}}\right| ^2 <\infty \Bigg \}. \end{aligned}$$

It follows that

$$\begin{aligned} \mathcal{D}:= {\text {span}}\{ v_i:i\in \mathcal{I}_\rho \cup \mathcal{I}_0\}\subset \mathcal{D}_j \quad \text {for all } j \ge 0, \end{aligned}$$

and

$$\begin{aligned} {G_j}({\mathrm {T}}) v_i = {\left\{ \begin{array}{ll} {G_j}(\lambda _i) v_i, &{} i\in \mathcal{I}_\rho \\ {G_j}(0) v_i, &{} i\in \mathcal{I}_0 \end{array}\right. }. \end{aligned}$$

An easy way to define filters is by differences of suitable spectral functions.

Definition 4.3

(Spectral functions) A family \(\{g_j\}_{j\ge 0} \) of measurable functions \(g_j : [0,\infty ) \rightarrow [0,\infty )\) satisfying

$$\begin{aligned} 0 \le g_j \le g_{j+1}, \qquad \lim _{j\rightarrow \infty } \lambda g_j(\lambda ) = 1 \quad \text {for all } \lambda \in (0,\kappa ^2] \end{aligned}$$
(10)

is called a family of spectral functions.

Given a family of spectral functions \(\{g_j\}_{j\ge 0} \), filters \(\{{G_j}\}_{j\ge 0} \) can be obtained setting

$$\begin{aligned} G_0(\lambda ) := \sqrt{g_0(\lambda )}, \qquad G_{j+1} (\lambda ) := \sqrt{ g_{j+1}(\lambda ) - g_j(\lambda ) } \quad \text {for } j\ge 0. \end{aligned}$$
(11)

The filters thus defined give rise to a telescopic sum:

$$\begin{aligned} \sum _{j\le \tau } {G_j}(\lambda )^2=g_\tau (\lambda ). \end{aligned}$$
(12)

Taking the limit for \(\tau \rightarrow \infty \), condition (9) is satisfied thanks to (10). Conversely, starting from a family of filters \(\{G_j\}_{j\ge 0}\), we can define spectral functions \(\{g_j\}_{j\ge 0} \) by

$$\begin{aligned} g_j(\lambda ):=\sum _{\ell \le j} G_\ell (\lambda )^2 \quad \text {for } j \ge 0, \end{aligned}$$

which enjoys (10) due to (9). Therefore, the notion of filter and that of spectral function are equivalent, and we will refer to them interchangeably.

The definition in (11) allows to find a wealth of filters by tapping into regularization theory [19]. In the forthcoming analysis, we will use the following notion of qualification.

Definition 4.4

(Qualification) The qualification of a spectral function \(g_j:[0,\infty )\rightarrow [0,\infty )\) is the maximum constant \( \nu \in (0,\infty ] \) such that

$$\begin{aligned} \sup _{\lambda \in (0,\kappa ^2]} \lambda ^\nu \left| {1-\lambda g_j(\lambda )}\right| \le C_\nu j^{-\nu } \quad \text {for all } j \ge 0, \end{aligned}$$

where the constant \(C_\nu \) does not depend on j.

In the theory of regularization of ill-posed inverse problems [19], the qualification represents the limit within which a regularizer may exploit the regularity of the true solution. In particular, methods with finite qualification suffer from the so-called saturation effect.

Some standard examples of spectral functions, together with their qualifications, are listed in Table 2.

Table 2 Spectral regularizers and their qualifications. Landweber iteration and Nesterov acceleration require \( \gamma < 1 / \kappa ^2 \) and \( \beta \ge 1 \). In heavy ball, \( \alpha _j,\, \beta _j \) are suitably selected sequences depending on \(\nu \), where \(\nu \) is any positive real (see [43])

Additional examples of admissible filters widely used in the construction of wavelet frames (see e.g. [13, 20]) are given by the following:

Example 4.5

(Localized filters) Let \( g \in C^\infty ([0,\infty )) \) such that \( {\text {supp}}(g) \subset (2^{-1},\infty ) \), \( 0 \le g \le 1 \), and \( g({\lambda }) = 1 \) for all \( {\lambda }\ge 1 \). Define

$$\begin{aligned} {\lambda }g_{j}({\lambda }) := g(2^j{\lambda }). \end{aligned}$$

Then the family \( \{ g_j \}_{j\ge 0} \) satisfies the properties (10). Furthermore, the corresponding filters (11) are localized, meaning that, defining \( F_j({\lambda }) := \sqrt{{\lambda }} G_j({\lambda }) \), we have

$$\begin{aligned} {\text {supp}}( F_0 ) \subset (2^{-1},\infty ), \qquad {\text {supp}}( F_j ) \subset ( 2^{-j-1}, 2^{-j+1} ) \quad \text {for } j \ge 1. \end{aligned}$$

4.2 Frames

We are now ready to define our wavelet frames. We first form frame elements in \(\mathcal{H}\), and then use the partial isometry \(\mathrm{U}:\mathcal{H}\rightarrow L^{2}({\mathcal{X},\rho })\) to obtain frames in \(L^{2}({\mathcal{X},\rho })\).

Definition 4.6

(Wavelets)) Let \(\{{G_j}\}_{j\ge 0}\) be a family of filters as in Definition 4.2, and assume

$$\begin{aligned} K_x \in \mathcal{D}_{j} \quad \text {for all } j \ge 0 \text { and almost every } x\in \mathcal{X}_{\rho }. \end{aligned}$$
(13)

We define the families of wavelets

$$\begin{aligned} {\varvec{\Psi }}:=\{\psi _{j,x}: j\ge 0,x\in \mathcal{X}_\rho \}\subset \mathcal{H}, \qquad \varvec{\Phi }:=\{\varphi _{j,x}: j\ge 0,x\in \mathcal{X}_\rho \}\subset L^{2}({\mathcal{X},\rho }), \end{aligned}$$

where

$$\begin{aligned} \psi _{j,x} :={G_j}({\mathrm {T}})K_x, \qquad \varphi _{j,x} := \mathrm{U}{G_j}({\mathrm {T}})K_x \quad \text {for } j\ge 0 \text { and } x\in \mathcal{X}_\rho . \end{aligned}$$
(14)

Observe that, since \( \psi _{j,x} \) and \( \varphi _{j,x} \) are defined for \( x \in \mathcal{X}_\rho \), we actually have \( {\varvec{\Psi }}\subset \mathcal{H}_\rho \subset \mathcal{H}\), and \( \varvec{\Phi }\subset \overline{\mathcal{H}}_\rho \subset L^{2}({\mathcal{X},\rho })\). In particular, the orthogonality of \(\mathcal{H}_\rho \) and \({\text {ker}}{\mathrm {S}}\) entails \( \left<{K_x},{G_j(T)v_i}\right>_\mathcal{H}= 0 \) for all \( i \in \mathcal{I}_0 \). By the reproducing property (4), condition (13) is thus equivalent to

$$\begin{aligned} \sum _{i\in \mathcal{I}_\rho } {G_j}(\lambda _i)^2\left| { v_i(x) }\right| ^2 <\infty \quad \text {for all } j\ge 0 \text{ and } \text{ almost } \text{ every } x\in \mathcal{X}_{\rho }. \end{aligned}$$

If \({G_j}\) is a bounded function, then \({G_j}({\mathrm {T}})\) is a bounded operator, hence \(\mathcal{D}_j=\mathcal{H}\). In this case, which includes the spectral functions listed in Table 2, condition (13) is trivially satisfied.

Using the spectral decomposition of \({G_j}({\mathrm {T}})\) and the reproducing property, we obtain

$$\begin{aligned} \psi _{j,x}(y) = \sum _{i\in \mathcal{I}_\rho } {G_j}(\lambda _i) \overline{v_i(x)} v_i(y), \quad \varphi _{j,x}(y) = \sum _{i\in \mathcal{I}_\rho } \sqrt{\lambda _i}{G_j}(\lambda _i) \overline{u_i(x)}\, u_i(y). \end{aligned}$$
(15)

These expressions allow to interpret \({\varvec{\Psi }}\) and \(\varvec{\Phi }\) as families of wavelets, in the sense of (1). We interpret x as the location and j as the scale parameter; the functions \(K_x\) localize the signal in space, whereas the filters \({G_j}\) regularize or localize in frequency. Note also the analogy with (7), in the light of which (15) may be seen as a filtered Mercer representation.

With the following proposition we show that (14) defines Parseval frames.

Proposition 4.7

Assume the setting in Sect. 3, and let \( {\varvec{\Psi }}, \varvec{\Phi }\) be defined as in Definition 4.6. Then, for every \(f\in \mathcal{H}\) we have

$$\begin{aligned} \sum _{j\ge 0}\int _{\mathcal{X}} \left| {\left<{f},{\psi _{j,x}}\right>_{\mathcal{H}}}\right| ^2\, d\rho (x) = \big \Vert {\mathrm{P}_{\mathcal{H}_\rho }f}\big \Vert _\mathcal{H}^2, \end{aligned}$$
(16)

and for any \(F\in L^{2}({\mathcal{X},\rho })\) we have

$$\begin{aligned} \sum _{j\ge 0}\int _{\mathcal{X}} \big |{\left<{F},{\varphi _{j,x}}\right>_\rho }\big |^2\, d\rho (x) = \big \Vert {\mathrm{P}_{\overline{\mathcal{H}}_\rho } F}\big \Vert _{\rho }^2. \end{aligned}$$
(17)

Proof

The equality (17) follows from (16) and the fact that U is unitary from \(\mathcal{H}_\rho \) to \(\overline{\mathcal{H}}_\rho \). To establish (16), in view of Lemma A.1 it suffices to consider functions in the dense subspace \( \mathcal{D}\subset \mathcal{H}\). Thus, let \( f \in \mathcal{D}\). Since \({G_j}({\mathrm {T}})\) is self-adjoint on \(\mathcal{D}_j\), and \( \mathcal{D}\subset \mathcal{D}_j \) for all j, we have

$$\begin{aligned} \left<{f},{\psi _{j,x}}\right>_\mathcal{H}= \left<{f},{{G_j}({\mathrm {T}}) K_x}\right>_{\mathcal{H}} = \left<{{G_j}({\mathrm {T}}) f},{K_x}\right>_{\mathcal{H}}, \end{aligned}$$

which integrated over \(x\in \mathcal{X}\) gives

$$\begin{aligned} \int _{ \mathcal{X}} \left| {\left<{f},{\psi _{j,x}}\right>_{\mathcal{H}}}\right| ^2\, d\rho (x) = \left<{{\mathrm {T}}{G_j}({\mathrm {T}}) f },{{G_j}({\mathrm {T}}) f}\right>_\mathcal{H}= \left<{{\mathrm {T}}{G_j}({\mathrm {T}})^2\,f},{f}\right>_\mathcal{H}. \end{aligned}$$
(18)

Summing over \(j\ge 0\) and using (9), we therefore obtain

$$\begin{aligned} \sum _{j\ge 0} \left<{{\mathrm {T}}{G_j}({\mathrm {T}})^2\, f},{f}\right>_{\mathcal{H}}&= \sum _{i\in \mathcal{I}_\rho } \Big (\left| {\left<{f},{v_i}\right>_{\mathcal{H}}}\right| ^2\sum _{j\ge 0} \lambda _i{G_j}(\lambda _i)^2 \Big ) \\&= \sum _{i\in \mathcal{I}_\rho } \left| {\left<{f},{v_i}\right>_{\mathcal{H}}}\right| ^2 = \left\| {\mathrm{P}_{\mathcal{H}_\rho }f}\right\| _{\mathcal{H}}^2. \end{aligned}$$

\(\square \)

The frame property can also be expressed as a resolution of the identity. Such a formulation will be particularly useful in Sect. 7.

Proposition 4.8

Under the assumptions of Proposition 4.7, there exists a positive bounded operator \({\mathrm {T}}_j:\mathcal{H}\rightarrow \mathcal{H}\) such that

$$\begin{aligned} {\mathrm {T}}_j = \int _{\mathcal {X}}\psi _{j,x}\otimes \psi _{j,x} \, d\rho (x), \end{aligned}$$
(19)

where the integral converges weakly. Furthermore,

$$\begin{aligned}&{\mathrm {T}}_j= {\mathrm {T}}{G_j}({\mathrm {T}})^2, \end{aligned}$$
(20)
$$\begin{aligned}&\sum _{j\le \tau } {\mathrm {T}}_j = {\mathrm {T}}g_\tau ({\mathrm {T}}), \end{aligned}$$
(21)

and the following resolution of the identity holds true:

$$\begin{aligned} \mathrm{P}_{\mathcal{H}_\rho } = \sum _{j\ge 0} {\mathrm {T}}_j. \end{aligned}$$
(22)

Proof

From (18) we have, for all \( f \in \mathcal{D}\),

$$\begin{aligned} \int _{ \mathcal{X}} \left| {\left<{f},{\psi _{j,x}}\right>_{\mathcal{H}}}\right| ^2\, d\rho (x) \le \Vert {\mathrm {T}}{G_j}({\mathrm {T}})^2\Vert \Vert f\Vert _{{\mathcal {H}}}^2, \end{aligned}$$

where \({\mathrm {T}}{G_j}({\mathrm {T}})^2\) is bounded since \(\lambda {G_j}(\lambda )^2 \le 1\) by (9). Hence, thanks to Lemma A.1, there exists a positive bounded operator \({\mathrm {T}}_j\) as in (19). Moreover, (18) implies (20) by the density of \(\mathcal{D}\). The equality (21) follows from (20) and (12). Lastly, (22) is a reformulation of (16). \(\square \)

Depending on the choice of the measure \(\rho \), Proposition 4.7 gives the frame property for either a continuous or a discrete setting. Namely, consider a discrete set \( \{x_1,\ldots ,x_N\} \), and let

$$\begin{aligned} \widehat{\rho }_N := \frac{1}{N}\sum _{k=1}^N \delta _{x_k}. \end{aligned}$$

With the choice of the discrete measure \( \widehat{\rho }_N\), (5) defines the discrete (non-centered) covariance operator \( {\widehat{\mathrm {T}} }:\mathcal{H}\rightarrow \mathcal{H}\) by

$$\begin{aligned} {\widehat{\mathrm {T}} }:= \frac{1}{N} \sum _{k=1}^N K_{x_k}\otimes K_{x_k}. \end{aligned}$$

Furthermore, Definition 4.6 produces the family of wavelets

$$\begin{aligned} \widehat{\psi }_{j,k} := {G_j}({\widehat{\mathrm {T}} }) K_{x_k} \quad \text {for } j\ge 0 \text { and } k=1,\ldots ,N, \end{aligned}$$

which, by Proposition 4.7, constitutes a discrete Parseval frame on

$$\begin{aligned} \widehat{\mathcal{H}}_N := \mathcal{H}_{\widehat{\rho }_N} = {{\,\mathrm{span}\,}}\{ K_{x_k} : k=1,\ldots ,N\} \simeq \mathbb {C}^N. \end{aligned}$$

In Sect. 5 we will make reference to this construction to define Monte Carlo wavelets, where the points \(x_1,\ldots ,x_N\) are drawn at random from \(\mathcal{X}_\rho \).

4.3 Two Generalizations

We discuss here two generalizations of the framework presented in Sect. 4.2. First, one may readily consider more general scale parameterizations. Namely, let \(\Omega \) be a locally compact, second countable topological space, endowed with a measure \(\mu \) defined on the Borel \(\sigma \)-algebra of \(\Omega \), finite on compact subsets, and such that \({\text {supp}}\mu =\Omega \). Adjusting the definitions accordingly, such as replacing the sums over all non-negative integers j in (9) and (16) with integrals over \(\Omega \) with respect to \(\mu \), the proof of Proposition 4.7 follows along the same steps. In this context, Definition 4.2 can be seen as a special case where \(\Omega \) is countable and \(\mu \) is the counting measure. Second, the assumption that the kernel K is bounded, implying that \({\mathrm {L}_K}\) admits an orthonormal basis of eigenvectors, is not necessary for our construction of Parseval frames. Indeed, it is enough to assume that

$$\begin{aligned} \int _{ \mathcal{X}}\left| { f(x)}\right| ^2\, d\rho (x) <+\infty \quad \text {for all } f\in {\mathcal {H}}. \end{aligned}$$

This implies that \({\mathcal {H}}\) is a subspace of \(L^2({\mathcal {X}};\rho )\) and the inclusion operator \({\mathrm {S}}\) is bounded. The integral (5) converges now in the weak operator topology, and the covariance operator \({\mathrm {T}}\) is positive and bounded. Thus, the Riesz–Markov theorem entails that, for all \(f\in \mathcal{H}\), there is a unique finite measure \(\nu _f\) on \([0,+\infty )\) such that \(\nu _f\left( [0,+\infty )\right) =\left\| {f}\right\| _\mathcal{H}^2\) and

$$\begin{aligned} \left<{{\mathrm {T}}f},{f}\right>_\mathcal{H}= \int _{[0,{+\infty })} \lambda d\nu _f(\lambda ). \end{aligned}$$

By spectral calculus, there exists a unique positive operator \({G_j}({\mathrm {T}}):\mathcal{D}_j\rightarrow \mathcal{H}\) such that

$$\begin{aligned} \left<{{G_j}({\mathrm {T}})f},{f}\right>_\mathcal{H}= \int _{[0,{+\infty })} {G_j}(\lambda ) d\nu _f(\lambda ), \end{aligned}$$

where now

$$\begin{aligned} \mathcal{D}_j :=\Big \{f\in \mathcal{H}: \int _{[0,{+\infty })} {G_j}(\lambda )^2 d\nu _f(\lambda ) <\infty \Big \}. \end{aligned}$$

Assume further that

$$\begin{aligned} \mathcal{D}_\infty :=\{f\in \mathcal{H}: f\in {\text {dom}} {G_j}({\mathrm {T}})^2 \text { for all } j\ge 0\} \end{aligned}$$

is a dense subset of \(\mathcal{H}\). Assumption (13) and Definition 4.6 are still valid. Moreover, the proof of Proposition 4.7 remains essentially unchanged. The only difference is in the following lines of equalities: for a given \(f\in \mathcal{D}_\infty \), we have

$$\begin{aligned} \sum _{j\ge 0} \left<{{G_j}({\mathrm {T}})^2{\mathrm {T}}f},{f}\right>_{\mathcal{H}}&=\sum _{j\ge 0} \Big ( \int _{[0,{+\infty })} \lambda {G_j}(\lambda )^2 d\nu _f(\lambda ) \Big ) \\&= \int _{(0,{+\infty })} \big ( \sum _{j\ge 0} \lambda {G_j}(\lambda )^2 \big ) d\mu _f(\lambda )\\ {}&= \int _{(0,{+\infty })} 1 \,d\mu _f(\lambda ) = \left\| {\mathrm{P}_{\mathcal{H}_\rho } f}\right\| _{\mathcal{H}}^2, \end{aligned}$$

where the second equality is due to Tonelli’s theorem.

5 Monte Carlo Wavelets

We are finally ready to define our Monte Carlo wavelets. In the following, we adopt notations, definitions and assumptions of Sects. 3 and 4. For the sake of simplicity, we further assume \({\text {supp}}(\rho ) = \mathcal{X}\), so that \(\mathcal{H}_\rho =\mathcal{H}\). By Proposition 4.7, the family \({\varvec{\Psi }}\) defined in (14) describes a Parseval frame on the entire Hilbert space \(\mathcal{H}\).

Definition 5.1

(Monte Carlo wavelets) Suppose we have N independent and identically distributed samples \(x_1,\ldots ,x_N\sim \rho \). Consider the empirical covariance operator \( {\widehat{\mathrm {T}} }:\mathcal{H}\rightarrow \mathcal{H}\) defined by

$$\begin{aligned} {\widehat{\mathrm {T}} }:= \frac{1}{N} \sum _{k=1}^N K_{x_k}\otimes K_{x_k}. \end{aligned}$$

Let \(\{{G_j}\}_{j\ge 0}\) be a family of filters as in Definition 4.2. We call

$$\begin{aligned} \widehat{\varvec{\Psi }}^N:=\big \{\widehat{\psi }_{j,k} := {G_j}({\widehat{\mathrm {T}} }) K_{x_k}\,:\, j\ge 0 \text { and } k=1,\ldots , N\big \} \end{aligned}$$

a family of Monte Carlo wavelets.

The family \( \widehat{\varvec{\Psi }}^N \) of Definition 5.1 corresponds to the family \({\varvec{\Psi }}\) of Definition 4.6 with respect to the empirical measure \( \widehat{\rho }_N := \frac{1}{N}\sum _{k=1}^N \delta _{x_k} \). Hence, thanks to Proposition 4.7, \(\widehat{{\varvec{\Psi }}}^N\) defines a discrete Parseval frame on the finite dimensional space

$$\begin{aligned} \widehat{\mathcal{H}}_N := {{\,\mathrm{span}\,}}\{ K_{x_k} :k=1,\ldots ,N\}. \end{aligned}$$

Now, let \({\varvec{\Psi }}\) be the family of wavelets in the sense of Definition 4.6 with respect to the (continuous) measure \(\rho \). Again by Proposition 4.7, \({\varvec{\Psi }}\) is a (continuous) Parseval frame on the (infinite dimensional) space \(\mathcal{H}\). Taking more and more samples, we obtain a sequence of frames \(\widehat{{\varvec{\Psi }}}^N\) on a chain of nested subspaces of increasing dimension:

$$\begin{aligned} \widehat{\mathcal{H}}_N\subset \widehat{\mathcal{H}}_{N+1}\subset \cdots \subset \mathcal{H}. \end{aligned}$$

We thus interpret \( \widehat{{\varvec{\Psi }}}^N \) as a Monte Carlo estimate of \( {\varvec{\Psi }}\). In this view, we are interested in studying the asymptotic behavior of \(\widehat{{\varvec{\Psi }}}^N\) as \(N\rightarrow \infty \), and, in particular, the convergence of \(\widehat{{\varvec{\Psi }}}^N\) to \({\varvec{\Psi }}\).

Notice that, despite being finite-dimensional, the frame \( \widehat{\varvec{\Psi }}^N \) consists of functions that are well-defined on the entire space \({\mathcal {X}}\). In particular, for any signal f in the reproducing kernel Hilbert space \( {\mathcal {H}}\), we can study the wavelet expansion

$$\begin{aligned} f \approx \sum _{j\le \tau } \sum _{k=1}^N \langle f, {\widehat{\psi }}_{j,k} \rangle _{\mathcal {H}}{\widehat{\psi }}_{j,k}. \end{aligned}$$
(23)

This series approximates f up to a resolution \(\tau \) and a sampling rate N. Our main result (Theorem 7.5) states that, cutting off the frequencies at a threshold \( \tau = \tau (N) \) and letting N go to infinity, the error of (23) goes to zero,

$$\begin{aligned} \Big \Vert f - \sum _{j\le \tau (N)} \sum _{k=1}^N \langle f, {\widehat{\psi }}_{j,k} \rangle _{\mathcal {H}}{\widehat{\psi }}_{j,k} \Big \Vert _{\mathcal {H}}\xrightarrow {{N\rightarrow \infty }} 0, \end{aligned}$$

at a rate that depends on the regularity of the signal f. In other words, the frame constructed on the sample space \(\{x_1,\ldots ,x_N\}\) is asymptotically resolving the signal defined on the space \({\mathcal {X}}\). This result will be derived as a finite-sample bound in high probability.

Deterministic discretization vs random sampling Discretization is a classical problem in frame theory, harmonic analysis and applied mathematics tout court. While the construction of reproducing representations may usefully exploit rich topological, algebraic and measure theoretical properties of a continuous parameter space, discretization is eventually required when it comes to numerical implementation. Starting from a continuous frame \( \{ \psi _a : a \in \mathcal{A}\} \) in a Hilbert space \({\mathcal {H}}\), frame discretization selects a countable subset of parameters \( \mathcal{A}' \subset \mathcal{A}\) so that the corresponding subfamily \( \{ \psi _a : a \in \mathcal{A}' \} \) preserves the frame property. This typically involves a deterioration of the frame bounds, which grows with the sparsity of \(\mathcal{A}'\).

A possible interpretation of our Monte Carlo wavelets is as a randomized approximate frame discretization. Random sampling may be useful when the topology of the parameter space is complex or unknown. On the other hand, our discrete frame is not a frame on the original space \({\mathcal {H}}\), but only on a finite dimensional approximation \(\widehat{{\mathcal {H}}}\) of \({\mathcal {H}}\). Notice though that our frame preserves the tightness, and the signal loss \( {\mathcal {H}}\setminus \widehat{{\mathcal {H}}} \) is asymptotically zero. Moreover, the numerical implementation of any discretized frame on \({\mathcal {H}}\) would still require truncation at finitely many terms, resulting in fact in a loss of the global frame property. Lastly, when the space is unknown and we can only access signals trough finite samples, going beyond the given sampling resolution might per se not be significant, while our results characterize how the frame parameters may be chosen adaptively to the given sampling rate.

Numerical implementation The representation of \( \widehat{\psi }_{j,k} \) in Definition 5.1 is remarkably compact, but hardly suitable for computation. We next provide an implementable formula of our Monte Carlo wavelets, using the Mercer representation (15) along with the singular value decomposition (6). Let \( {\widehat{\mathrm {T}} }{\widehat{v}}_i = {\widehat{{\lambda }}}_i {\widehat{v}}_i \) be the eigendecomposition of \({\widehat{\mathrm {T}} }\). Then (15) reads as

$$\begin{aligned} {\widehat{\psi }}_{j,k} (x) = \sum _{i=1}^N G_j({\widehat{{\lambda }}}_i) \overline{{\widehat{v}}_i(x_k)} {\widehat{v}}_i(x) \qquad j\ge 0,\, k = 1,\ldots ,N, \end{aligned}$$

where the eigenpairs \( ({\widehat{{\lambda }}}_i,{\widehat{v}}_i) \) can be computed from the kernel matrix

$$\begin{aligned} {\mathbf {K}}[i,k] := K(x_i,x_k) \qquad i, k = 1,\ldots ,N. \end{aligned}$$
(24)

Indeed, we have \( {\widehat{\mathrm {T}} }= {\widehat{\mathrm {S}}}^* {\widehat{\mathrm {S}}}\) and \( N^{-1} {\mathbf {K}}= {\widehat{\mathrm {S}}}{\widehat{\mathrm {S}}}^* \), where \({\widehat{\mathrm {S}}}\) is the sampling operator

$$\begin{aligned} {\widehat{\mathrm {S}}}: \mathcal{H}\rightarrow {\mathbb {C}}^N, \qquad ({\widehat{\mathrm {S}}}f)[i] = f(x_i) \qquad i = 1,\ldots ,N, \end{aligned}$$
(25)

and \({\widehat{\mathrm {S}}}^*\) is the out-of-sample extension

$$\begin{aligned} {\widehat{\mathrm {S}}}^* : {\mathbb {C}}^N \rightarrow \mathcal{H}, \qquad ({\widehat{\mathrm {S}}}^*{\mathbf {u}}) (x) = \frac{1}{N} \sum _{\ell =1}^N K(x,x_\ell ) {\mathbf {u}}[\ell ] \qquad x \in {\mathcal {X}}. \end{aligned}$$
(26)

Thus, the eigenvalues \({\widehat{{\lambda }}}_i\) of \({\widehat{\mathrm {T}} }\) are exactly the eigenvalues of \( N^{-1}{\mathbf {K}}\). Moreover, in view of (6), the eigenfunctions \({\widehat{v}}_i\) can be obtained from the eigenvectors \({\widehat{{\mathbf {u}}}}_i\) of \( N^{-1}{\mathbf {K}}\) by

$$\begin{aligned} {\widehat{v}}_i = {\widehat{{\lambda }}}_i^{-1/2} {\widehat{\mathrm {S}}}^* {\widehat{{\mathbf {u}}}}_i = {\widehat{{\lambda }}}_i^{-1/2} \frac{1}{N} \sum _{\ell =1}^N {\widehat{{\mathbf {u}}}}_i[\ell ] K_{x_\ell }, \end{aligned}$$

which evaluated at \(x_k\) gives

$$\begin{aligned} {\widehat{v}}_i(x_k) = {\widehat{{\lambda }}}_i^{-1/2} \frac{1}{N} \sum _{\ell =1}^N K(x_k,x_\ell ) {\widehat{{\mathbf {u}}}}_i[\ell ] = {\widehat{{\lambda }}}_i^{-1/2} N^{-1} ({\mathbf {K}}{\widehat{{\mathbf {u}}}}_i)[k] = {\widehat{{\lambda }}}_i^{1/2} {\widehat{{\mathbf {u}}}}_i[k]. \end{aligned}$$

We therefore obtain the computable formula

$$\begin{aligned} {\widehat{\psi }}_{j,k} (x) = \frac{1}{N} \sum _{i,\ell =1}^N G_j({\widehat{{\lambda }}}_i) \overline{{\widehat{{\mathbf {u}}}}_i[k]} {\widehat{{\mathbf {u}}}}_i[\ell ] K(x,x_\ell ) \qquad j\ge 0,\, k = 1,\ldots ,N. \end{aligned}$$

For what concerns the Monte Carlo wavelet transform of a signal \( f \in \mathcal{H}\), it is easy to see that

$$\begin{aligned} \langle f, {\widehat{\psi }}_{j,k} \rangle _\mathcal{H}= {\mathbf {U}}G_j(\Lambda ) {\mathbf {U}}^* f(x_k) , \end{aligned}$$

where \( N^{-1} {\mathbf {K}}= {\mathbf {U}}\Lambda {\mathbf {U}}^* \) expresses the eigendecomposition of \(N^{-1}{\mathbf {K}}\) in matrix form.

Computational considerations The bottleneck in the implementation of our Monte Carlo wavelets is the eigendecomposition of the kernel matrix, which requires in general \(\mathcal{O}(N^3)\) operations and is therefore impractical in typical large scale scenarios. This is in fact a common problem for virtually all spectral based constructions of frames (see e.g. [28, 33, 38]). A possible solution is approximating the filters by low order polynomials, thus simplifying the functional calculus to repeated matrix-vector multiplication, which scales well in the case of sparse graphs [33]. While kernel matrices are typically dense, such an approach may still be useful for compactly supported kernels [58], although their real applicability is mostly limited to the low-dimensional regime. Besides sparsity, a more reasonable property to leverage is fast eigenvalue decay, which opens onto a variety of methods for truncated approximate SVD. Deterministic methods allow to compute an r-rank approximation in \(\mathcal{O}(r N^2)\) [52], whereas randomized methods can further reduce the complexity to \(\mathcal{O}(\log r N^2 + r^2 N)\) [32, 40].

We also remark that the actual Monte Carlo approximation of a given signal is in principle a different problem than the computation of the frame itself, and as such may in some cases be more tractable. For example, for some specific filters as in Table 2, the computation of (23) boils down to the implementation of some regularized inversion or minimization procedure, for which several approaches based on sketching, random projections, hierarchical decompositions and early stopping may be profitably used [7, 9, 17, 47,48,49,50, 59]. An efficient implementation of Monte Carlo wavelets is out of the scope of this paper and will be subject of future work.

6 Comparison with Other Frame Constructions

The approach we adopt in Sect. 4 differs from the existing literature in several crucial aspects. We now give an overview of similarities and differences. As argued in Sect. 1, many techniques for the analysis of signals on non-Euclidean domains, such as manifolds and graphs, are based on spectral filtering of some suitable operator. There are, generally speaking, two distinct yet related perspectives.

A first type of methods builds frames for function spaces on compact differentiable manifolds associated with certain positive operators (predominantly the Laplace–Beltrami operator). In [13, 26], filter functions \(g_j\) are applied to the given operator \(\mathrm{L}\), giving \(g_j(\sqrt{\mathrm{L}})\) for \(j\ge 0\). One then needs to ensure that this defines an integral operator with a corresponding kernel \(\psi _j(\sqrt{\mathrm{L}})(x,y)\), which often poses a technical challenge, and relies on the relationship between the operator \(\mathrm{L}\) and local metric properties of the manifold. We avoid this by using a positive definite kernel from the start. The next step is to sample points \(\{x^j_k\}_{k=1}^{m_j}\) from the manifold for each scale j, in such a way that they form a \(\delta _j\)-net and satisfy a cubature rule for functions in the desired space. Frame elements are then defined by \(C_{j,k}\,\psi _j(\sqrt{\mathrm{L}})(x^j_k,\cdot )\), for some suitable weights \(C_{j,k}\). The resulting family of functions constitutes a non-tight frame on the entire function space. On the contrary, our sampled frames are Parseval frames on finite-dimensional subspaces. As we are going to show in the next section, in order to establish convergence we do not require a stringent selection of points; instead, we sample at random, which allows for a straightforward algorithmic approach, independent of the specific geometry of the underlying space.

In a different line of research [38, 42, 57], frames are built on an arbitrary orthonormal basis \(\{w_i\}_{i\ge 0}\) of a separable Hilbert space of functions defined on a quasi-metric measure space, together with a suitable sequence of positive reals \((l_i)_{i\ge 0}\). Based on these data, a kernel-like function \(K_H(x,\cdot ) := \sum _{i\ge 0} H(l_i) w_i(x)w_i\) is constructed. This mirrors the basis expansion of frame elements (15), but in our case a specific orthonormal basis is taken, that is, the eigenbasis of the integral operator, and \((l_i)_{i\ge 0}\) are the corresponding eigenvalues. Due to the use of an arbitrary basis and sequence, an additional effort (or a set of assumptions) needs to be made in order to ensure the desired properties, such as the decay of the approximation error as the number of eigenvalues resolved by the function H increases. Some of the results are similar to those in our paper, albeit estimation errors or sample bounds have not been established in this context.

On the other hand, starting from a discrete setting, graph signal processing considers a weight (or adjacency) matrix to define a certain graph operator \(\mathrm{L}\), such as the graph Laplacian [28, 33] or a diffusion operator [12]. The frame elements are then defined in the spectral domain as \(\psi _{j,x} := g_j( \mathrm{L})\delta _x\), where g is an admissible wavelet kernel, j a scale parameter, and \(\delta _x\) the indicator function of a vertex x. This is conceptually similar to (14), though there are also several distinctions. First, following [28], our construction results in Parseval frames. This simplifies the computational effort, since Parseval frames are canonically self-dual, and thus signal reconstruction does not require the computation of a dual frame. Moreover, to localize the frame in space we use the continuous kernel function \(K_x\), instead of the impulse \(\delta _x\). Since in our setting the kernel K is used both to define the underlying integral operator and to localize the frame elements, we can use the theory of RKHS to establish a connection between continuous and discrete frames, as we will show in Sect. 7. In typical constructions of frames on graphs, a more judicious effort is usually required to elaborate analogous convergence results.

7 Stability of Monte Carlo Wavelets

In this section we study the relationship between continuous and discrete frames, regarding the latter as Monte Carlo estimates of the former. We begin by restricting our attention to \(\mathcal{H}\), and we will then extend the analysis to \(L^{2}({\mathcal{X},\rho })\). Let

$$\begin{aligned} {\mathrm {T}}_j := \int _{\mathcal{X}} \psi _{j,x} \otimes \psi _{j,x} d\rho (x), \qquad {\widehat{\mathrm {T}} }_j :=\frac{1}{N} \sum _{k=1}^N \widehat{\psi }_{j,k} \otimes \widehat{\psi }_{j,k} \end{aligned}$$

be the frame operators associated with the scale j, and its empirical counterpart. By Proposition 4.8, we have

$$\begin{aligned} \mathsf {Id}_\mathcal{H}= \sum _{j\ge 0} {\mathrm {T}}_j, \qquad \mathsf {Id}_{\widehat{\mathcal{H}}_N} = \sum _{j\ge 0} {\widehat{\mathrm {T}} }_j. \end{aligned}$$

For \(f\in \mathcal{H}\), given a threshold scale \(\tau \in {\mathbb {N}}\) and a sample size N, we let

$$\begin{aligned} \widehat{f}_{\tau ,N}:=\sum _{j=0}^\tau {\widehat{\mathrm {T}} }_j f \end{aligned}$$
(27)

be the empirical approximation of f using the first \(\tau \) scales of the frame \(\widehat{{\varvec{\Psi }}}^N\). The reconstruction error of \(\widehat{f}_{\tau ,N}\) can be decomposed into

$$\begin{aligned} \left\| {f- \widehat{f}_{\tau ,N}}\right\| _\mathcal{H}\le \Big \Vert {\sum _{j>\tau } {\mathrm {T}}_j f}\Big \Vert _\mathcal{H}+ \Big \Vert {\sum _{j=0}^\tau \left( {\mathrm {T}}_j-{\widehat{\mathrm {T}} }_j\right) f}\Big \Vert _\mathcal{H}. \end{aligned}$$
(28)

The first term is the approximation error, arising from the truncation of the resolution of the identity. The second term is the estimation error, which stems from estimating the measure by means of empirical samples. Next, we derive quantitative error bounds for both terms, and then balance the resolution \(\tau \) in terms of sample size N to obtain our convergence result.

Approximation error Note that Proposition 4.7 already implies

$$\begin{aligned} \Vert {\sum _{j>\tau } {\mathrm {T}}_j f}\Vert _\mathcal{H}\xrightarrow {\tau \rightarrow \infty } 0, \end{aligned}$$

being the tail of a convergent series. To quantify the speed of convergence with respect to \(\tau \), approximation theory suggests that f has to obey some notion of regularity. In the following we assume a smoothness of Sobolev kind (see [20] and Sect. 8), also known in statistical learning theory as the source condition (see [8]):

$$\begin{aligned} f = {\mathrm {T}}^\alpha h \text { for some } h\in \mathcal{H}\text { and } \alpha >0. \end{aligned}$$

Proposition 7.1

Assume that \(g_j\) has qualification \( \nu \in (0,\infty ] \) and \(f\in {\text {range}}({\mathrm {T}}^\alpha )\) for some \( \alpha > 0 \). Let \(\beta := \min \{\nu , \alpha \}\). Then

$$\begin{aligned} \Bigg \Vert {\sum _{j>\tau } {\mathrm {T}}_j f}\Bigg \Vert _\mathcal{H}\lesssim \left\| {{\mathrm {T}}^{-\alpha } f}\right\| _\mathcal{H}\kappa ^{2(\alpha -\beta )}\tau ^{-\beta }. \end{aligned}$$

Proof

By (21) we have \( \sum _{j>\tau } {\mathrm {T}}_j = \mathsf {Id}_\mathcal{H}- {\mathrm {T}}g_\tau ({\mathrm {T}}) \). Hence,

$$\begin{aligned} \Big \Vert {\sum _{j>\tau } {\mathrm {T}}_j f}\Big \Vert _\mathcal{H}^2&= \sum _{i\in \mathcal{I}_\rho } | 1-\lambda _i g_\tau (\lambda _i)|^2 \left| {\left<{f},{v_i}\right>_\mathcal{H}}\right| ^2 \\&= \sum _{i\in \mathcal{I}_\rho } \left( \lambda _i^{\beta } | 1-\lambda _i g_\tau (\lambda _i) | \right) ^2 \left| {\left<{{\mathrm {T}}^{-\beta } f},{v_i}\right>_\mathcal{H}}\right| ^2 \\&\le \biggl ( \sup _{i\in \mathcal{I}_\rho } \lambda _i^{\beta } | 1-\lambda _i g_\tau (\lambda _i)| \biggr )^2 \sum _{i\in \mathcal{I}_\rho } \left| {\left<{{\mathrm {T}}^{-\beta } f},{v_i}\right>_\mathcal{H}}\right| ^2 \\&\lesssim \tau ^{-2\beta } \kappa ^{4(\alpha -\beta )} \left\| {{\mathrm {T}}^{-\alpha } f}\right\| _\mathcal{H}^2. \end{aligned}$$

\(\square \)

Estimation error To bound the second term in (28), we rely on concentration results for covariance operators [46].

Proposition 7.2

Assume that \(\lambda \mapsto \lambda g_\tau (\lambda )\) is Lipschitz continuous on \([0,\kappa ^2]\) with Lipschitz constant \(L(\tau )\). Then, for every \(f\in \mathcal{H}\) and \( t > 0 \), with probability at least \(1-2e^{-t}\) we have

$$\begin{aligned} \Big \Vert {\sum _{j=0}^\tau \left( {\mathrm {T}}_j-{\widehat{\mathrm {T}} }_j\right) f}\Big \Vert _\mathcal{H}\lesssim \left\| {f}\right\| _\mathcal{H}\kappa ^2\sqrt{t} L(\tau ) N^{-1/2}. \end{aligned}$$

Proof

Using (21) and Lemma A.2 we have

$$\begin{aligned} \Big \Vert {\sum _{j=0}^\tau \left( {\mathrm {T}}_j-{\widehat{\mathrm {T}} }_j\right) f}\Big \Vert _\mathcal{H}&= \Big \Vert \left( {\mathrm {T}}g_\tau ({\mathrm {T}}) - {\widehat{\mathrm {T}} }g_\tau ({\widehat{\mathrm {T}} })\right) f \Big \Vert _\mathcal{H}\\&\le \Big \Vert {\mathrm {T}}g_\tau ({\mathrm {T}}) - {\widehat{\mathrm {T}} }g_\tau ({\widehat{\mathrm {T}} }) \Big \Vert _{{\text {HS}}}\left\| {f}\right\| _\mathcal{H}\\&\le L(\tau )\big \Vert {{\mathrm {T}}-{\widehat{\mathrm {T}} }}\big \Vert _{{\text {HS}}} \left\| {f}\right\| _\mathcal{H}. \end{aligned}$$

Bounding \(\Vert {{\mathrm {T}}-{\widehat{\mathrm {T}} }}\Vert _{{\text {HS}}} \) with the concentration estimate [46, Theorem 7] we obtain

$$\begin{aligned} \big \Vert {{\mathrm {T}}-{\widehat{\mathrm {T}} }}\big \Vert _{{\text {HS}}} \lesssim \kappa ^2 \sqrt{t} N^{-1/2} \end{aligned}$$

with probability no lower than \(1-2e^{-t}\). \(\square \)

All examples of filters given in Sect. 4.1 satisfy the Lipschitz condition required in Proposition 7.2.

Lemma 7.3

Let \( g_j \) be a spectral function from Table 2. Then the function \( {\lambda }\mapsto {\lambda }g_\tau ({\lambda }) \) is Lipschitz continuous on \([0,{\kappa }^2]\), with Lipschitz constant \( L(\tau ) \lesssim \tau \) for the first four spectral functions, and \( L(\tau ) \lesssim \tau ^2 \) for the last two. Moreover, let \( g_j \) be defined as in Example 4.5, with \( |g'| \le B \). Then the function \( {\lambda }\mapsto {\lambda }g_\tau ({\lambda }) \) is Lipschitz continuous on \([0,{\kappa }^2]\), with Lipschitz constant \( L(\tau ) \le B 2^\tau \).

Proof

For the first four spectral functions of Table 2, the claim follows by bounding the explicit derivative of \( \lambda \mapsto \lambda g_\tau (\lambda )\); for the last two, from an application of Markov brothers’ inequality (see [43, Supplemental, Lemma 1]). For filters of Example 4.5, we differentiate \( {\lambda }\mapsto g(2^{\tau }{\lambda }) \) and use \( |g'| \le B \). \(\square \)

Remark 7.4

In this paper we are not interested in the constants. We rely on the Hilbert norm since it provides both a simple bound on \(\big \Vert {{\mathrm {T}}-{\widehat{\mathrm {T}} }}\big \Vert _{{\text {HS}}} \) and, by the Lipschitz assumption, the stability bound \(\big \Vert {\mathrm {T}}g_\tau ({\mathrm {T}}) - {\widehat{\mathrm {T}} }g_\tau ({\widehat{\mathrm {T}} }) \big \Vert _{{\text {HS}}}\left\| {f}\right\| _\mathcal{H}\le L(\tau )\big \Vert {{\mathrm {T}}-{\widehat{\mathrm {T}} }}\big \Vert _{{\text {HS}}} \). Our result can be improved by using the sharper bound

$$\begin{aligned} \big \Vert {{\mathrm {T}}-{\widehat{\mathrm {T}} }}\big \Vert \le C \big \Vert {\mathrm {T}}\Vert \max \Big \{ \sqrt{\frac{{r({\mathrm {T}})}}{N}}, \frac{r({\mathrm {T}})}{N}, \sqrt{\frac{t}{N}}, \frac{r({\mathrm {T}})}{N}\Big \}, \end{aligned}$$

where \(r({\mathrm {T}})=\frac{{\text {trace}}({\mathrm {T}})}{\left\| {{\mathrm {T}}}\right\| }\) (see Theorem 9 in [36] and the techniques in the proof of Theorem 3.4 in [6] to bound \(\big \Vert {\mathrm {T}}g_\tau ({\mathrm {T}}) - {\widehat{\mathrm {T}} }g_\tau ({\widehat{\mathrm {T}} }) \big \Vert \)).

Reconstruction error and convergence Combining Propositions 7.1 and 7.2, we can finally prove the convergence of our Monte Carlo wavelets. In order to balance approximation and estimation error, we need to tune the resolution \(\tau \) with the number of samples N and the smoothness \(\alpha \) of the signal, in so far as the qualification \(\nu \) of the filter allows.

Theorem 7.5

Assume that \(g_\tau \) has qualification \( \nu \in (0,\infty ] \), \(f\in {\text {range}}({\mathrm {T}}^\alpha )\) for some \( \alpha > 0 \), and \(\lambda \mapsto \lambda g_\tau (\lambda )\) is Lipschitz continuous on \([0,\kappa ^2]\) with Lipschitz constant \(L(\tau )\lesssim \tau ^p\), \( p \ge 1 \). Let \(\beta :=\min \{\alpha ,\nu \}\) and set

$$\begin{aligned} \tau := \lceil N^{\frac{1}{2(\beta +p)}}\rceil . \end{aligned}$$

Then, for every \( t > 0 \), with probability at least \(1-2e^{-t}\) we have

$$\begin{aligned} \big \Vert {f-\widehat{f}_{\tau ,N}}\big \Vert _\mathcal{H}\lesssim \big \Vert {{\mathrm {T}}^{-\alpha } f}\big \Vert _\mathcal{H}\big ({\kappa ^{2(\alpha -\beta )}+\kappa ^{2\alpha +2}\sqrt{t}}\big ) N^{-\frac{\beta }{2(\beta +p)}}. \end{aligned}$$

Proof

Starting from the decomposition (28), we bound the two terms by Propositions 7.1 and 7.2. The approximation error is \(\mathcal{O}(\tau ^{-\beta })\), while the estimation error is \(\mathcal{O}(\tau ^pN^{-1/2})\). We thus choose \(\tau \) to balance them out, and collect the constants. \(\square \)

If \({\text {supp}}\rho \ne \mathcal{X}\), we have instead a frame on \(\mathcal{H}_\rho \), and the corresponding resolution of the identity \(\mathsf {Id}_{\mathcal{H}_\rho } = \sum _{j\ge 0} {\mathrm {T}}_j\). The reconstruction error would thus include an additional bias term:

$$\begin{aligned} \big \Vert {f- \widehat{f}_{\tau ,N}}\big \Vert _\mathcal{H}\le \left\| {\mathrm{P}_{{\text {ker}}{\mathrm {S}}} f}\right\| _\mathcal{H}+ \big \Vert {\sum _{j>\tau } {\mathrm {T}}_j f}\big \Vert _\mathcal{H}+ \Big \Vert \sum _{j=0}^\tau \big ({{\mathrm {T}}_j-{\widehat{\mathrm {T}} }_j}\big )f\Big \Vert _\mathcal{H}. \end{aligned}$$

Classical spectral functions from Table 2 satisfy the assumptions of Theorem 7.5. We report the explicit rates in Table 3. A convergence result for filters of Example 4.5 will be provided at the end of Sect. 8.

Table 3 Error rates for signals \( f \in {\text {range}}({\mathrm {T}}^\alpha ) \) and several spectral regularizers

Convergence in \(L^{2}({\mathcal{X},\rho })\) Error rates in \(L^{2}({\mathcal{X},\rho })\) can be extracted using the isometry between \(\overline{\mathcal{H}}_\rho \) and \(\mathcal{H}_\rho \). Suppose again for simplicity that \( {\text {supp}}\rho = \mathcal{X}\). In view of (8), for \(f\in \mathcal{H}_\rho =\mathcal{H}\) we have

$$\begin{aligned} \big \Vert {f-\widehat{f}_{\tau ,N}}\big \Vert _{\rho }&= \big \Vert {\sqrt{{\mathrm {T}}}(f-\widehat{f}_{\tau ,N})}\big \Vert _{\mathcal{H}}. \end{aligned}$$

Decomposing the error into its approximation and estimation components, we can repeat the same analysis as in the proof of Theorem 7.5. The estimation bound simply gets an additional \(\kappa \) factor. Assuming \( f \in {\mathrm {T}}^\alpha \mathcal{H}\) with \(\alpha >0\), for the approximation term we have

$$\begin{aligned} \big \Vert {\sqrt{{\mathrm {T}}} \sum _{j>\tau }{\mathrm {T}}_j f}\big \Vert _\mathcal{H}&\le \sup _{i\in \mathcal{I}_\rho } \big ({\lambda _i^{\beta }\left( 1-\lambda _i g_\tau (\lambda _i)\right) }\big )\sum _{i\in \mathcal{I}_\rho } \big |{\big \langle {\mathrm {T}}^{1/2-\beta } f,v_i\big \rangle _\mathcal{H}}\big |\\&\lesssim \big \Vert {{\mathrm {T}}^{-\alpha } f}\big \Vert _\mathcal{H}\kappa ^{2(\alpha -\beta )+1}\tau ^{-\beta }, \end{aligned}$$

with \(\beta :=\min (\alpha +1/2,\nu )\). Therefore, the approximation rate increases by 1/2 (qualification permitting). Combining all together, we obtain the following bound in \(L^{2}({\mathcal{X},\rho })\).

Corollary 7.6

Assume that \(g_\tau \) has qualification \( \nu \in (0,\infty ] \), \(f\in {\text {range}}({\mathrm {T}}^\alpha )\) for some \( \alpha > 0 \), and \(\lambda \mapsto \lambda g_\tau (\lambda )\) is Lipschitz continuous on \([0,\kappa ^2]\) with Lipschitz constant \(L(\tau )\lesssim \tau ^p\), \( p \ge 1 \). Let \(\beta :=\min \{\alpha +1/2,\nu \}\) and set

$$\begin{aligned} \tau := \lceil N^{\frac{1}{2(\beta +p)}}\rceil . \end{aligned}$$

Then, for every \( t > 0 \), with probability at least \(1-2e^{-t}\) we have

$$\begin{aligned} \big \Vert {f-\widehat{f}_{\tau ,N}}\big \Vert _{\rho }\lesssim \big \Vert {{\mathrm {T}}^{-\alpha } f}\big \Vert _\mathcal{H}\left( \kappa ^{2(\alpha -\beta )+1}+\kappa ^{2\alpha +3}\sqrt{t}\right) N^{-\frac{\beta }{2(\beta +p)}}. \end{aligned}$$

See Table 3 for specific rates regarding spectral functions from Table 2.

Monte Carlo wavelet approximation as noiseless kernel ridge regression We conclude this section with an observation that draws a link between Monte Carlo wavelets and regression analysis. Let \( nb\widehat{f}_{\tau ,N} \) be the Monte Carlo wavelet approximation (27) of \( f \in \mathcal{H}\) at resolution \(\tau \) given samples \( x_1,\ldots ,x_N \). Then

$$\begin{aligned} \widehat{f}_{\tau ,N} = \sum _{j=0}^\tau {G_j}({\widehat{\mathrm {T}} })^2 {\widehat{\mathrm {T}} }f = g_\tau ({\widehat{\mathrm {T}} }) {\widehat{\mathrm {T}} }f. \end{aligned}$$

With the choice of the Tikhonov filter \( g_j({\lambda }) = ( {\lambda }+ \tau ^{-1} )^{-1} \) (Table 2), recalling (24), (25) and (26), and defining

$$\begin{aligned} {\mathbf {y}}= [ f(x_1), \ldots , f(x_N) ]^\top , \qquad {{\varvec{\alpha }}}= \Big ({\mathbf {K}}+ \tfrac{N}{\tau } {\mathbf {I}}\Big )^{-1} {\mathbf {y}}, \end{aligned}$$

we have

$$\begin{aligned} \widehat{f}_{\tau ,N}&= \big ({\widehat{\mathrm {T}} }+ \tfrac{1}{\tau } \mathsf {Id}_\mathcal{H}\big )^{-1} {\widehat{\mathrm {T}} }f = \Big ({\widehat{\mathrm {S}}}^*{\widehat{\mathrm {S}}}+ \tfrac{1}{\tau } \mathsf {Id}_\mathcal{H}\Big )^{-1} {\widehat{\mathrm {S}}}^*{\widehat{\mathrm {S}}}f = \Big ({\widehat{\mathrm {S}}}^*{\widehat{\mathrm {S}}}+ \tfrac{1}{\tau } \mathsf {Id}_\mathcal{H}\Big )^{-1} {\widehat{\mathrm {S}}}^* {\mathbf {y}}\\&= {\widehat{\mathrm {S}}}^* \Big ({\widehat{\mathrm {S}}}\,{\widehat{\mathrm {S}}}^* + \tfrac{1}{\tau } {\mathbf {I}}\Big )^{-1} {\mathbf {y}}= \frac{1}{N} \sum _{i=1}^N K(\cdot ,x_i) \Big [ \Big (\tfrac{1}{N}{\mathbf {K}}+ \tfrac{1}{\tau } {\mathbf {I}}\Big )^{-1} {\mathbf {y}}\Big ][i] \\&= \sum _{i=1}^N K(\cdot ,x_i) \Big [ \Big ({\mathbf {K}}+ \tfrac{N}{\tau } {\mathbf {I}}\Big )^{-1} {\mathbf {y}}\Big ][i] = \sum _{i=1}^N {{\varvec{\alpha }}}[i] K(\cdot ,x_i). \end{aligned}$$

This is the (unique) solution to the kernel regularized least squares problem

$$\begin{aligned} \min _{\widehat{f} \in \mathcal{H}} \frac{1}{N} \sum _{i=1}^N | y_i - \widehat{f}(x_i) |^2 + {\lambda }\Vert \widehat{f} \Vert _\mathcal{H}^2, \end{aligned}$$
(29)

where \( y_i = {\mathbf {y}}[i] \) and \( {\lambda }= \tau ^{-1} \). Therefore, \( \widehat{f}_{\tau ,N} \) is the kernel ridge estimator for the noiseless regression problem

$$\begin{aligned} y_i = f(x_i) \qquad i = 1,\ldots ,N, \end{aligned}$$

and the squared reconstruction error \( \Vert f - \widehat{f}_{\tau ,N} \Vert _\rho ^2 \) is the generalization error of \(\widehat{f}_{\tau ,N}\).

Contrasting this with the optimal rate (in the minimax sense) for kernel ridge regression [8] entails that the rate in Table 3 is suboptimal for Tikhonov regularization, and presumably for all other regularizers. This is well expected from the crude Lipschitz bound used in Proposition 7.2. The scope of the present work was to establish a first result of convergence of randomly sampled frames, rather than identifying the optimality of the convergence rates. Refinement of our bounds will be object of future investigation (see also Remark 7.4).

8 Sobolev and Besov Spaces in RKHS

The convergence rates of the frame reconstruction error in Theorem 7.5 depend on the approximation rates in Proposition 7.1, hence on the regularity of the original signal f, as quantified by the condition \(f\in {\text {range}}({\mathrm {T}}^\alpha )\). Thinking of \({\mathrm {T}}\) as the inverse square root of the Laplacian allows to interpret \({\text {range}}({\mathrm {T}}^\alpha )\) as a Sobolev space. The theory of smoothness function spaces [56] plays a critical role in harmonic analysis, and serves also as a base for the definition of statistical priors in learning theory [5]. In this section we examine general notions of regularity and their effect on the reconstruction error. Many of the reported results on Besov spaces are well known [20], but we nonetheless include them here to be self contained and to adapt them to our setting and notation. In particular, as already observed in Sect. 2, it should be borne in mind that the spectrum of the integral operator \({\mathrm {T}}\) has inverse trend compared to that of a Laplace operator, and therefore all the spectral definitions of the generalized Besov spaces must take this into account in order to preserve the consistency with their classical counterparts. As in the previous section, we assume \({\text {supp}}(\rho ) = \mathcal{X}\).

Sobolev spaces as domains of powers of a positive operator By virtue of the spectral theorem, for every \(\alpha >0\), \({\mathrm {T}}^\alpha \) is a positive, bounded, injective operator on \(\mathcal{H}\), with \( \sigma ({\mathrm {T}}^\alpha ) \subset (0,\kappa ^{2\alpha }] \). Thus, \({\mathrm {T}}^{-\alpha }\) is a positive, closed, densely-defined, injective operator with \( \sigma ({\mathrm {T}}^{-\alpha }) \subset [\kappa ^{-2\alpha },\infty ) \). We put the following

Definition 8.1

(Sobolev spaces) For \( \alpha > 0 \), we define the Sobolev space \(\mathcal{H}^\alpha \) by

$$\begin{aligned} \mathcal{H}^\alpha := {\text {dom}}({\mathrm {T}}^{-\alpha }) = {\text {range}}({\mathrm {T}}^\alpha ), \end{aligned}$$

equipped with the norm

$$\begin{aligned} \left\| {v}\right\| _{\mathcal{H}^\alpha } := \left\| {{\mathrm {T}}^{-\alpha } v}\right\| _\mathcal{H}. \end{aligned}$$

\(\mathcal{H}^\alpha \) is a Hilbert space. Moreover, we have

$$\begin{aligned} \mathcal{H}^\alpha = \Big \{f \in \mathcal{H}\,:\, \sum _{i\in \mathcal{I}_\rho } \lambda _i^{-2\alpha } \left| {\left<{f},{v_i}\right>_\mathcal{H}}\right| ^2 < \infty \Big \}, \end{aligned}$$

which expresses \(\mathcal{H}^\alpha \) in terms of the speed of decay of the Fourier coefficients, thus generalizing the standard Sobolev spaces \( H^\alpha = W^{\alpha ,2} \). Theorem 7.5 establishes the convergence of Monte Carlo wavelets for signals in the class \(\mathcal{H}^\alpha \).

Besov spaces as approximation spaces Besov spaces on Euclidean domains are traditionally defined by the decay of the modulus of continuity. A characterization that is best suited to generalize to arbitrary domains, and to which we also adhere, is through approximation and interpolation spaces [20, 45, 56]. We begin with the approximation perspective by defining a scale of Paley–Wiener spaces.

Definition 8.2

(Paley–Wiener spaces) For \( \omega > 0 \), the Paley–Wiener space \(\mathbf {PW}(\omega )\) is defined by

$$\begin{aligned} \mathbf {PW}(\omega ) := \left\{ f \in \mathcal{H}\,: \left<{f},{v_i}\right>_\mathcal{H}= 0 \text { for }\lambda _i < \omega ^{-1} \right\} = \overline{{{\,\mathrm{span}\,}}} \left\{ v_i : \lambda _i \ge \omega ^{-1}\right\} . \end{aligned}$$

The associated approximation error for \(f\in \mathcal{H}\) is

$$\begin{aligned} \mathcal{E}(f,\omega ) := \inf _{g\in \mathbf {PW}(\omega )} \left\| {f - g}\right\| _\mathcal{H}= \left\| {\mathrm{P}_{\mathbf {PW}(\omega )^\perp } f}\right\| _\mathcal{H}= \Big ({\sum _{\lambda _i<\omega ^{-1}} \left| {\left<{f},{v_i}\right>_\mathcal{H}}\right| ^2}\Big )^{1/2}. \end{aligned}$$

The space \( \mathbf {PW}(\omega ) \) is a closed subspace of \(\mathcal{H}\), and \( \bigcup _{w>0} \mathbf {PW}(\omega ) \) is dense in \(\mathcal{H}\). Note that \( \mathcal{E}(f,\omega )\xrightarrow {\omega \rightarrow 0}\left\| {f}\right\| _\mathcal{H}\) and \( \mathcal{E}(f,\omega )\xrightarrow {\omega \rightarrow \infty }0\). Approximation spaces classify functions in \(\mathcal{H}\) according to the rate of decay of their approximation error.

Definition 8.3

(Besov spaces) For \( s > 0 \) and \( q \in [1,\infty ) \), we define the Besov space \( \mathcal{B}^s_q \) as the approximation space

$$\begin{aligned} \mathcal{B}^s_q:= \left\{ f \in \mathcal{H}\,: \left( \int _0^\infty (\omega ^{s} \mathcal{E}(f,\omega ))^q \frac{d\omega }{\omega }\right) ^{1/q} < \infty \right\} , \end{aligned}$$

equipped with the norm

$$\begin{aligned} \left\| {f}\right\| _{\mathcal{B}^s_q} := \left\| {f}\right\| _\mathcal{H}+ \left( \int _0^\infty (\omega ^{s} \mathcal{E}(f,\omega ))^q \frac{d\omega }{\omega }\right) ^{1/q}. \end{aligned}$$
(30)

The space \( \mathcal{B}^s_\infty \) is defined with the usual adjustment.

Discretizing the integral in (30), we obtain the equivalent norm

$$\begin{aligned} \left\| {f}\right\| _\mathcal{H}+ \Big ( \sum _{j\ge 0} \left( 2^{j s} \mathcal{E}(f,2^j)\right) ^q \Big )^{1/q} \asymp \left\| {f}\right\| _{\mathcal{B}^s_q}. \end{aligned}$$
(31)

In particular, a function \( f \in \mathcal{B}^s_q \) if and only if the sequence \( \left( 2^{j s} \mathcal{E}(f,2^j)\right) _{j\ge 0} \in \ell ^q \). It is easy to see that the scale of spaces \(\mathcal{B}^s_q\) obeys the following lexicographical order [45, Proposition 3]:

$$\begin{aligned}&\mathcal{B}^{s}_q \supset \mathcal{B}_p^{t} \quad \text {for } s< t, \nonumber \\&\mathcal{B}^s_q\subset \mathcal{B}^s_p \quad \text {for } q < p. \end{aligned}$$
(32)

Besov spaces as interpolation spaces The Sobolev space \( \mathcal{H}^\alpha \) is continuously embedded into \(\mathcal{B}^s_q \) for every \( \alpha > s \). Indeed, for \( f \in \mathcal{H}^\alpha \) we have the Jackson-type inequality \(\mathcal{E}(f,\omega ) \le \omega ^{-\alpha } \Vert f \Vert _{\mathcal{H}^\alpha } \), hence

$$\begin{aligned} \sum _{j\ge 0} (2^{j s} \mathcal{E}(f,2^j))^q \le \left\| {f}\right\| _{\mathcal{H}^\alpha }^q \sum _{j\ge 0} 2^{-j q(\alpha -s)} < \infty . \end{aligned}$$

Furthermore, \(\mathcal{B}^s_q\) interpolates between \(\mathcal{H}^\alpha \) and \(\mathcal{H}\).

Definition 8.4

(Interpolation spaces) For quasi-normed spaces \({\mathbf {E}}\) and \({\mathbf {F}}\), \( \theta \in (0,1) \) and \(q \in (0,\infty ) \), the quasi-normed interpolation space \(\left( {\mathbf {E}},{\mathbf {F}}\right) _{\theta , q}\) is defined by

$$\begin{aligned} \left( {\mathbf {E}},{\mathbf {F}}\right) _{\theta , q} := \left\{ f\in {\mathbf {E}}+{\mathbf {F}}\,: \int _0^\infty \left( t^{-\theta } \mathcal{K}(f,t)\right) ^q \frac{dt}{t} <\infty \right\} , \end{aligned}$$

where \(\mathcal{K}(f,t)\) is Peetre’s K-functional

$$\begin{aligned} \mathcal{K}(f,t) := \inf _{\begin{array}{c} f_0 + f_1=f \\ f_0 \in {\mathbf {E}}, f_1 \in {\mathbf {F}} \end{array}} \left\| {f_0}\right\| _{{\mathbf {E}}} + t\left\| {f_1}\right\| _{{\mathbf {F}}}. \end{aligned}$$

The space \( \left( {\mathbf {E}},{\mathbf {F}}\right) _{\theta , \infty } \) is defined with the usual adjustment.

Standard interpolation theory [20, 56] gives

$$\begin{aligned} \mathcal{B}^s_q = (\mathcal{H},\mathcal{H}^\alpha )_{\frac{s}{\alpha },\,q} \quad \text {for } s \in (0,\alpha ) \text { and } q \in [1,\infty ], \end{aligned}$$
(33)

with

$$\begin{aligned} \left\| {f}\right\| _{\mathcal{B}^s_q} \asymp \left\| {f}\right\| _\mathcal{H}+ \left( \int _0^\infty \left( t^{-\theta } \mathcal{K}(f,t)\right) ^q \frac{dt}{t} \right) ^{1/q}. \end{aligned}$$
(34)

In the next proposition we show that, as in the Euclidean setting, the Besov space \(\mathcal{B}^s_2\) coincides with the Sobolev space \(\mathcal{H}^s\) of the same order. As in the classical setting, this is particular to the case \( q = 2 \). This is probably a known fact, but we could find neither a proof nor a statement.

Proposition 8.5

For every \( s > 0 \), \( \mathcal{B}^s_2 = \mathcal{H}^s \) with equivalent norms.

Proof

Let \( {\alpha }= 2 s \). Then (33) and (34) give \( \mathcal{B}^s_2 = (\mathcal{H},\mathcal{H}^{{\alpha }})_{\frac{s}{{\alpha }},\,2} = (\mathcal{H},\mathcal{H}^{2s})_{\frac{1}{2},\,2} \) and

$$\begin{aligned} \left\| {f}\right\| _{\mathcal{B}^s_2}^2 \asymp \left\| {f}\right\| _\mathcal{H}^2 + \int _0^\infty t^{-1} \mathcal{K}(f,t)^2 \frac{dt}{t}. \end{aligned}$$
(35)

Let \( {\mathrm {A}}: \mathcal{H}^{\alpha }\rightarrow \mathcal{H}\) denote the canonical embedding \( {\mathrm {A}}g = g \). Then, for \( f \in \mathcal{H}\) and \( t > 0 \) we have

$$\begin{aligned} \mathcal{K}(f,t)^2&= \inf _{\begin{array}{c} f_0 + {\mathrm {A}}g=f \\ f_0 \in \mathcal{H}, g \in \mathcal{H}^{\alpha } \end{array}} (\left\| {f_0}\right\| _\mathcal{H}+ t \left\| {g}\right\| _{\mathcal{H}^{\alpha }})^2 \nonumber \\&= \inf _{g \in \mathcal{H}^{\alpha }} (\left\| {f - {\mathrm {A}}g}\right\| _\mathcal{H}+ t \left\| {g}\right\| _{\mathcal{H}^{\alpha }})^2 \asymp \mathcal{G}(f,t^2), \end{aligned}$$
(36)

with

$$\begin{aligned} \mathcal{G}(f,{\lambda }) := \inf _{g \in \mathcal{H}^{\alpha }} \left\| {f - {\mathrm {A}}g}\right\| _\mathcal{H}^2 + {\lambda }\left\| {g}\right\| _{\mathcal{H}^{\alpha }}^2. \end{aligned}$$

This infimum is attained by \( g = ({\mathrm {A}}^*{\mathrm {A}}+ {\lambda }\mathsf {Id}_{\mathcal{H}^\alpha })^{-1} {\mathrm {A}}^*f \). Since

$$\begin{aligned} ({\mathrm {A}}^*{\mathrm {A}}+ {\lambda }\mathsf {Id}_{\mathcal{H}^\alpha })^{-1} {\mathrm {A}}^* = {\mathrm {A}}^* ({\mathrm {A}}{\mathrm {A}}^* + {\lambda }\mathsf {Id}_\mathcal{H})^{-1}, \end{aligned}$$

defining \( {\mathrm {B}}:= {\mathrm {A}}{\mathrm {A}}^* :\mathcal{H}\rightarrow \mathcal{H}\) we obtain

$$\begin{aligned} {\mathrm {A}}({\mathrm {A}}^*{\mathrm {A}}+ {\lambda }\mathsf {Id}_{\mathcal{H}^\alpha })^{-1} {\mathrm {A}}^* = {\mathrm {B}}( {\mathrm {B}}+ {\lambda }\mathsf {Id}_{\mathcal{H}})^{-1}. \end{aligned}$$

Let \( {\mathrm {A}}^* = {\mathrm {U}}({\mathrm {A}}{\mathrm {A}}^*)^{1/2} = {\mathrm {U}}{\mathrm {B}}^{1/2} \) be the polar decomposition of \({\mathrm {A}}^*\), where \( {\mathrm {U}}: \mathcal{H}\rightarrow \mathcal{H}^{\alpha }\) is unitary. We have

$$\begin{aligned} \mathcal{G}(f,{\lambda }) = \left\| {(\mathsf {Id}_{\mathcal{H}^\alpha } - {\mathrm {B}}( {\mathrm {B}}+ {\lambda }\mathsf {Id}_{\mathcal{H}^\alpha } )^{-1}) f}\right\| _\mathcal{H}^2 + {\lambda }\Vert {\mathrm {U}}{\mathrm {B}}^{1/2} ({\mathrm {B}}+ {\lambda }\mathsf {Id}_{\mathcal{H}^\alpha })^{-1} f \Vert _{\mathcal{H}^{\alpha }}^2. \end{aligned}$$

Since \((\mathsf {Id}_{\mathcal{H}} - {\mathrm {B}}( {\mathrm {B}}+ {\lambda }\mathsf {Id}_{\mathcal{H}} )^{-1} ) ( {\mathrm {B}}+ {\lambda }\mathsf {Id}_{\mathcal{H}}) = {\lambda }\mathsf {Id}_{\mathcal{H}},\) it follows that

$$\begin{aligned} \mathcal{G}(f,{\lambda })&= {\lambda }^2 \Vert ({\mathrm {B}}+ {\lambda }\mathsf {Id}_{\mathcal{H}})^{-1} f \Vert _\mathcal{H}^2 + {\lambda }\Vert {\mathrm {B}}^{1/2}({\mathrm {B}}+ {\lambda }\mathsf {Id}_{\mathcal{H}})^{-1} f \Vert _\mathcal{H}^2 \nonumber \\&= {\lambda }\left[ {\lambda }\langle ({\mathrm {B}}+ {\lambda }\mathsf {Id}_{\mathcal{H}})^{-2} f, f \rangle _\mathcal{H}+ \langle {\mathrm {B}}({\mathrm {B}}+ {\lambda }\mathsf {Id}_{\mathcal{H}})^{-2} f, f \rangle _\mathcal{H}\right] \nonumber \\&= {\lambda }\langle ({\mathrm {B}}+ {\lambda }\mathsf {Id}_{\mathcal{H}})^{-2} ({\lambda }\mathsf {Id}_\mathcal{H}+ {\mathrm {B}}), f \rangle _\mathcal{H}\nonumber \\&= {\lambda }\langle ({\mathrm {B}}+ {\lambda }\mathsf {Id}_{\mathcal{H}})^{-1} f, f \rangle _\mathcal{H}. \end{aligned}$$
(37)

Plugging (36) and (37) into (35) we get

$$\begin{aligned}&\int _0^\infty t^{-1} \mathcal{K}(f,t)^2 \frac{dt}{t} \asymp \int _0^\infty t^{-1} \mathcal{G}(f,t^2) \frac{dt}{t} \\&\quad = \int _0^\infty \left<{({\mathrm {B}}+ t^2 \mathsf {Id}_{\mathcal{H}^\alpha })^{-1} f},{f}\right>_\mathcal{H}dt = \int _0^\infty \int _0^\infty \frac{1}{\sigma + t^2} \left<{d\pi _{{\mathrm {B}}}(\sigma ) f},{f}\right>dt, \end{aligned}$$

where \( \pi _{{\mathrm {B}}}\) is the spectral measure of \({{\mathrm {B}}}\). By Fubini we have

$$\begin{aligned}&\int _0^\infty \int _0^\infty \frac{1}{\sigma + t^2} dt \ \left<{d\pi _{{\mathrm {B}}}(\sigma ) f},{f}\right> = \int _0^\infty \frac{1}{\sqrt{\sigma }} \arctan \Big (\frac{t}{\sqrt{\sigma }}\Big ) \bigg |_0^\infty \left<{d\pi _{{\mathrm {B}}}(\sigma ) f},{f}\right> \\&\quad \asymp \int _0^\infty \sigma ^{-1/2} \left<{d\pi _{{\mathrm {B}}}(\sigma ) f},{f}\right> =\langle {{\mathrm {B}}^{-1/2} f,f}\rangle _\mathcal{H}= \big \Vert {{\mathrm {B}}^{-1/4} f}\big \Vert _\mathcal{H}^2. \end{aligned}$$

Therefore, \( f \in \mathcal{B}^s_2 \) if and only if \( f \in {\text {dom}}({\mathrm {B}}^{-1/4}) \). It now suffices to show \( {\mathrm {B}}^{-1/4} = {\mathrm {T}}^{-s} \), whence \( \Vert {{\mathrm {B}}^{-1/4} f}\Vert _\mathcal{H}^2 = \Vert {f}\Vert _{\mathcal{H}^s}^2 \). For any \( f \in \mathcal{H}\) and \( g \in \mathcal{H}^{\alpha }\) we have

$$\begin{aligned} \left<{f},{{\mathrm {A}}g}\right>_{\mathcal{H}} = \left<{{\mathrm {A}}^* f},{g}\right>_{\mathcal{H}^{\alpha }} = \left<{{\mathrm {T}}^{-{\alpha }} {\mathrm {A}}{\mathrm {A}}^*f},{{\mathrm {T}}^{-{\alpha }} {\mathrm {A}}g}\right>_\mathcal{H}= \left<{{\mathrm {T}}^{-2{\alpha }} {\mathrm {B}}f},{g}\right>_\mathcal{H}. \end{aligned}$$

Since \(\mathcal{H}^{\alpha }\) is dense in \(\mathcal{H}\), this implies \( {\mathrm {T}}^{-2{\alpha }} {\mathrm {B}}= \mathsf {Id}_{\mathcal{H}} \). Hence, \( {\mathrm {B}}= {\mathrm {T}}^{2{\alpha }} = {\mathrm {T}}^{4s} \), which completes the proof. \(\square \)

Besov spaces by wavelets coefficients The Besov norm can also be expressed by means of wavelet coefficients. Let

$$\begin{aligned} {F_j}(\lambda ):=\sqrt{\lambda }{G_j}(\lambda ), \end{aligned}$$

where \({G_j}\) is a filter as in Definition 4.2. The partition of unity (9) becomes

$$\begin{aligned} \sum _{j\ge 0}{F_j}(\lambda )^2=1 \quad \text {for all } \lambda \in (0,\kappa ^2]. \end{aligned}$$
(38)

Moreover, in view of (18), for a frame \( {\varvec{\Psi }}\) as in Definition 4.6 we have

$$\begin{aligned} \left\| { \left<{f},{\psi _{j,\cdot }}\right> }\right\| _{L^{2}({\mathcal{X},\rho })} = \left\| {{F_j}({\mathrm {T}}) f}\right\| _\mathcal{H}, \end{aligned}$$

and the frame property (16) can be rewritten as

$$\begin{aligned} \left\| {f}\right\| _\mathcal{H}^2 = \sum _{j\ge 0} \left\| {{F_j}({\mathrm {T}}) f}\right\| _\mathcal{H}^2. \end{aligned}$$
(39)

If we further assume the localization property (cf. Example 4.5)

$$\begin{aligned} {\text {supp}}( F_0 ) \subset (2^{-1},\infty ), \qquad {\text {supp}}( F_j ) \subset ( 2^{-j-1}, 2^{-j+1} ) \quad \text {for } j \ge 1, \end{aligned}$$
(40)

a weighted \(\ell ^q\)-norm of the sequence \( (\left\| {{F_j}({\mathrm {T}}) f}\right\| _\mathcal{H})_{j\ge 0} \) gives an equivalent characterization of the space \( \mathcal{B}^s_q \).

Proposition 8.6

([20, Theorem 3.18]) Let \( \{{F_j}\}_{j\ge 0} \) be a family of measurable functions \( {F_j}: [0,\infty ) \rightarrow [0,\infty ) \) satisfying (38) and (40). Then, for every \( f \in \mathcal{B}^s_q\) we have

$$\begin{aligned} \left\| {f}\right\| _{\mathcal{B}^s_q} \asymp {\left\| {f}\right\| }_\mathcal{H}+ \Big ( \sum _{j\ge 0} \left( 2^{j s} \left\| {{F_j}({\mathrm {T}}) f}\right\| _\mathcal{H}\right) ^q \Big )^{1/q}. \end{aligned}$$

Proof

We upper and lower bound the discretized norm in (31). Using (39) (which holds thanks to (38)) and (40), we have

$$\begin{aligned} \mathcal{E}(f,2^\ell )^2&= \left\| {\mathrm{P}_{\mathbf {PW}(2^\ell )^\perp } f}\right\| _\mathcal{H}^2 = \sum _{j\ge 0}\left\| {{F_j}({\mathrm {T}}) \mathrm{P}_{\mathbf {PW}(2^\ell )^\perp } f}\right\| _\mathcal{H}^2 \\&= \sum _{j\ge 0}\sum _{i\in \mathcal{I}_\rho } \left| {\left<{{F_j}({\mathrm {T}})\mathrm{P}_{\mathbf {PW}(2^\ell )^\perp } f},{v_i}\right>_\mathcal{H}}\right| ^2 = \sum _{j\ge 0}\sum _{i\in \mathcal{I}_\rho } \left| {\left<{\mathrm{P}_{\mathbf {PW}(2^\ell )^\perp } f},{{F_j}({\mathrm {T}}) v_i}\right>_\mathcal{H}}\right| ^2 \\&= \sum _{j \ge 0} \sum _{ \begin{array}{c} {\lambda }_i< 2^{-\ell } \\ {\lambda }_i \in (2^{-j-1},2^{-j+1}) \end{array} } {F_j}(\lambda _i)^2 \left| {\left<{f},{v_i}\right>_\mathcal{H}}\right| ^2 = \sum _{j \ge \ell } \sum _{{\lambda }_i \in (2^{-j-1},2^{-j+1})} {F_j}(\lambda _i)^2 \left| {\left<{f},{v_i}\right>_\mathcal{H}}\right| ^2 \\&= \sum _{j\ge \ell } \sum _{i\in \mathcal{I}_\rho } {F_j}(\lambda _i)^2 \left| {\left<{f},{v_i}\right>_\mathcal{H}}\right| ^2 = \sum _{j\ge \ell } \left\| {{F_j}({\mathrm {T}}) f}\right\| _\mathcal{H}^2. \end{aligned}$$

Thus, by the discrete Hardy inequality (Lemma A.3), we get

$$\begin{aligned} \Big ( \sum _{\ell \ge 0} (2^{\ell s} \mathcal{E}(f,2^\ell ))^q \Big )^{1/q}&\le \Big ( \sum _{\ell \ge 0} \Big (2^{\ell s} \sum _{j\ge \ell } \left\| {{F_j}({\mathrm {T}}) f}\right\| _\mathcal{H}\Big )^q \Big )^{1/q} \\&\le C_{sq} \Big ( \sum _{j\ge 0}\left( 2^{j s} \left\| {{F_j}({\mathrm {T}}) f}\right\| _\mathcal{H}\right) ^q \Big )^{1/q}, \end{aligned}$$

with \(C_{sq}=\frac{2^{sq}}{2^{sq}-1}\). Conversely, \({F_j}({\mathrm {T}}) g = 0\) for every \( g \in \mathbf {PW}(2^{j}) \), and therefore

$$\begin{aligned} \left\| {{F_j}({\mathrm {T}}) f}\right\| _\mathcal{H}= \left\| {{F_j}({\mathrm {T}}) (f-g)}\right\| _\mathcal{H}\le \left\| {{F_j}({\mathrm {T}})}\right\| _\mathcal{H}\left\| {f-g}\right\| _\mathcal{H}\le \left\| {f-g}\right\| _\mathcal{H}, \end{aligned}$$

whence

$$\begin{aligned} \left\| {{F_j}({\mathrm {T}}) f}\right\| _\mathcal{H}\le \inf _{g\in \mathbf {PW}(2^{j})} \left\| {f-g}\right\| _\mathcal{H}= \mathcal{E}(f,2^{j}). \end{aligned}$$

\(\square \)

Convergence of spectrally-localized Monte Carlo wavelets Proposition 8.6 can be used to obtain approximation bounds for frames built with filters satisfying the localization property (40).

Proposition 8.7

Under the conditions of Proposition 8.6, for every \( f \in \mathcal{B}^s_q \) and \( \epsilon \in (0,s) \), we have

$$\begin{aligned} \Big \Vert {\sum _{j>\tau } {\mathrm {T}}_j f}\Big \Vert _\mathcal{H}\lesssim {\left\{ \begin{array}{ll} \left\| {f}\right\| _{\mathcal{B}^s_q} 2^{-\tau s} &{} \text {for } q \in [1,2] \\ \left\| {f}\right\| _{\mathcal{B}^{s-\epsilon }_2} 2^{-\tau (s-\epsilon )} &{} \text {for } q \in (2,\infty ] \end{array}\right. }. \end{aligned}$$

Proof

By Proposition 8.6, we have

$$\begin{aligned} \sum _{j>\tau } \left\| {{F_j}({\mathrm {T}})f}\right\| _\mathcal{H}^q = \sum _{j>\tau } 2^{-jsq} \left( 2^{j s} \left\| {{F_j}({\mathrm {T}})f}\right\| _\mathcal{H}\right) ^q \lesssim 2^{-(\tau +1)s q} \left\| {f}\right\| _{\mathcal{B}^s_q}^q. \end{aligned}$$

Also, (38) implies \( \left| {\sum _{j} {F_j}(\lambda _i)^2}\right| ^2 \le \sum _{j} {F_j}(\lambda _i)^2 \). Hence, for \( q \le 2 \) we obtain

$$\begin{aligned} \Big \Vert {\sum _{j>\tau } {\mathrm {T}}_j f}\Big \Vert _\mathcal{H}^2&= \sum _{i\in \mathcal{I}_\rho } \Big \vert {\sum _{j>\tau } {F_j}(\lambda _i)^2}\Big \vert ^2 \left| {\left<{f},{v_i}\right>_\mathcal{H}}\right| ^2 \le \sum _{j>\tau } \sum _{i\in \mathcal{I}_\rho } {F_j}(\lambda _i)^2 \left| {\left<{f},{v_i}\right>_\mathcal{H}}\right| ^2 \\&= \sum _{j>\tau } \left\| {{F_j}({\mathrm {T}}) f}\right\| _\mathcal{H}^2 = \big \Vert { \left( \left\| {F_{\tau +j}({\mathrm {T}})f}\right\| \right) _{j\ge 1} \big \Vert }_{\ell ^2}^2 \\&\le \big \Vert { \left( \left\| {F_{\tau +j}({\mathrm {T}})f}\right\| \right) _{j\ge 1} }\big \Vert _{\ell ^q}^2 \le \big ({ 2^{-(\tau +1) s} \left\| {f}\right\| _{\mathcal{B}^s_q} }\big )^2. \end{aligned}$$

If \(q>2\), then \( \mathcal{B}^s_q \subset \mathcal{B}^{s-\epsilon }_2 \) for every \( \epsilon \in (0,s)\), thanks to (32), and the claim follows. \(\square \)

Putting together Propositions 8.7 and 7.2 yields a convergence result for Monte Carlo wavelets with localized filters.

Theorem 8.8

Assume that \( {F_j}\) satisfies (40), \( f \in \mathcal{B}^s_q \) with \( q \in [1,2] \), and \(\lambda \mapsto \lambda g_\tau (\lambda )\) is Lipschitz continuous on \([0,\kappa ^2]\) with Lipschitz constant \(L(\tau )\lesssim 2^\tau \). Set

$$\begin{aligned} \tau = \lceil { {\tfrac{1}{2s+2}} \log _2 (N)}\rceil . \end{aligned}$$

Then, for every \( t > 0 \), with probability at least \(1-2e^{-t}\) we have

$$\begin{aligned} \big \Vert {f-\widehat{f}_{\tau ,N}}\big \Vert _\mathcal{H}\lesssim \left\| {f}\right\| _{\mathcal{B}^s_q} \big ({1+\kappa ^{2}\sqrt{t}}\big ) N^{-\frac{s}{2s+2}}. \end{aligned}$$

Compared to Theorem 7.5, Theorem 8.8 requires the resolution \(\tau \) to grow only logarithmically with respect to the sample size N. Note that the conditions of Theorem 8.8 exclude the spectral functions of Table 2, since they do not satisfy (40). Examples of admissible filters are given instead by Example 4.5, which have local support (40) but exponential Lipschitz constant.

9 Concluding Remarks and Future Directions

We presented a construction of tight frames which extends wavelets on general domains based on spectral filtering of a reproducing kernel. Depending on the measure considered, our construction leads to continuous or discrete frames, covering non-Euclidean structures such as Riemannian manifolds and weighted graphs. Besides standard frequency-localized filters commonly used in wavelet frames, we defined admissible spectral filters resorting to methods from regularization theory, such as Tikhonov regularization and Landweber iteration. Regarding discrete measures as empirical measures arising from independent realizations of a continuous density, we interpreted discrete frames as Monte Carlo estimates of continuous frames. We proved that the Monte Carlo frame converges to the corresponding deterministic continuous frame, and provided finite-sample bounds in high probability, with rates that depend on the Sobolev or Besov class of the reproduced signal. This demonstrates the stability of empirical frames built on sampled data.

In future work we intend to study the numerical implementation of our Monte Carlo wavelets, along with possible applications in graph signal processing, regression analysis and denoising. Further theoretical investigation may include \(L^p\) Banach frame extensions, sparse representations, nonlinear approximation rates, Lipschitz bound refinements, and explicit localization properties for specific families of kernels.