1 Introduction

Data science in general and more specifically signal and image processing relies on mathematical methods, with the fast Fourier transform as the most prominent example. Besides its favourable computational complexity, its success relies on the good approximation of smooth functions by trigonometric polynomials. Mainly driven by specific applications, functions with additional properties together with associated computational schemes have gained some attention: signals might for instance be sparse like in single molecule fluorescence microscopy [45], or live on some other lower dimensional structure like microfilaments, again in bio-imaging. Such properties are well modeled by measures, which can express the underlying structure through the geometry of their support, e.g. being discrete or singular continuous. This representation has in particular led to a better understanding of the sparse super-resolution problem [7, 11, 17], but has also proven useful in many more applications, such as phase retrieval in X-ray crystallography [3], or contour reconstruction in natural images [49]. In this work, we consider measures \(\mu \) supported on the d-dimensional torus. The available data then consists of trigonometric moments of low to moderate order, i.e.

$$\begin{aligned} {{\hat{\mu }}(k)=\int _{\mathbb {T}^d} \text {e}^{-2\pi i {kx}}\textrm{d}\mu (x),\qquad k\in \{-n,\ldots ,n\}^d} \end{aligned}$$
(1.1)

for some \(n \in \mathbb {N}\), and one asks for the reconstruction or approximation of \(\mu \) from this partial information. In this context, our work focuses on approximations by trigonometric polynomials, more specifically on two types of asymptotic behaviours: after setting up some trigonometric polynomials \(q_n\) based on the knowledge of (1.1), we distinguish between pointwise convergence to the indicator function of \({\text {supp}}\,\mu \), i.e.

$$\begin{aligned} \lim _{n\rightarrow \infty } q_n(x) = {\left\{ \begin{array}{ll} 1, \quad &{} x\in {\text {supp}}\,\mu , \\ 0, \quad &{} \text {else,} \end{array}\right. } \end{aligned}$$

and weak convergence, i.e.

$$\begin{aligned} \lim _{n\rightarrow \infty } \int _{\mathbb {T}^d} f(x) q_n(x) \textrm{d}x = \int _{\mathbb {T}^d} f(x) \textrm{d}\mu (x) \end{aligned}$$
(1.2)

for all continuous test functions f. The latter is denoted by \(q_n\rightharpoonup \mu \) and equivalent to convergence with respect to the Wasserstein distance for which we achieve quantitative rates.

Related work For discrete measures, there is a large variety of subspace methods that compute or approximate the parameters (positions and weights) of the measure, e.g., Prony’s method [16, 30, 32, 53, 58], matrix pencil [17, 27, 46], ESPRIT [2, 40, 55, 57] or MUSIC [41, 60]. Except MUSIC, these methods realise the parameters as eigenvalues of specific moment matrices and are well understood in the univariate case [46]. In the multivariate case, an often used randomisation technique [17, 48] has only been discussed recently in a special case [23].

On the other hand, MUSIC [41, 60] as well as the variational methods [7, 9, 10, 54] build intermediate trigonometric polynomials which interpolate the value one at the support points and are smaller otherwise. If the measure is supported on a positive dimensional variety, the situation is more involved. Specific curves in a two-dimensional domain are identified by the kernel of moment matrices in [19, 49, 50], more general discussions can be found in [39, 62] where the support again is described implicitly by a trigonometric polynomial which takes the value one at the support and is smaller otherwise. Finally, Christoffel functions offer interesting guarantees both in terms of support identification [37] or approximation on the support [31, 43, 51], but, to the best of our knowledge, require regularity assumptions on the underlying measure, and only come with separate guarantees on and outside the support of the measure.

Contributions Following the approach of the seminal paper [44] to approximate a measure by using information about its trigonometric moments, we study easily computable trigonometric polynomials to approximate an arbitrary measure on the d-dimensional torus. In contrast to [44], we provide tight bounds on the pointwise approximation error as well as with respect to the 1-Wasserstein distance, the latter scaling inverse linearly with respect to the polynomial degree (up to a logarithmic factor). One of the main contributions of our work lies in the simple connection between approximation in the 1-Wasserstein distance and known results from approximation theory for Lipschitz functions. For example, we relate questions on the best approximation of measures by polynomials to best approximation results in \(L^1(\mathbb {T}^d)\) and \(C(\mathbb {T}^d)\). Additionally, we show analogously to classical approximation theory that near best approximations can be derived through convolution with certain kernels. As far as we know, these connections formulated in Sect. 3 where not considered before. On the other hand, we analyse in Sect. 4 the interpolation behaviour of a sum of squares polynomial, \(p_{1,n}\), similarly suggested in [32, Thm. 3.5] and [49, Prop. 5.3] (and indeed closely related to the rational function in the MUSIC algorithm, see [60, Eq. (6)]). The main contribution of this section is not the invention of this polynomial \(p_{1,n}\) but the analysis of its pointwise convergence to the indicator function of the Zariski closure of the support of the measure. This justifies to estimate the support of discrete measures or measures with support on an algebraic curve by considering points where \(p_{1,n}\) is equal or close to its maximal value one. For instance, this might be used in future works to represent sparse objects in single molecule microscopy.

Organisation of the paper We summarize our main results in an informal way neither stating technical details nor making the involved constants explicit. After setting up the notations, Sect. 3 considers the approximation of measures by trigonometric polynomials. The convolution of the measure with polynomial kernels is studied in Sect. 3.1 and leads in Theorem 3.3 to the 1-Wasserstein upper bound

$$\begin{aligned} W_1(p_n,\mu )&\le \frac{c_1 \log n}{n}, \end{aligned}$$

where \(p_n\) denotes the convolution of the measure with the n-th Fejér kernel. This approximation comes with a saturation result in Sect. 3.2, Theorem 3.5, showing that for each measure except the Lebesgue measure, there exists a constant \(c_2\) with

$$\begin{aligned} W_1(p_n,\mu )&\ge \frac{c_2}{n}. \end{aligned}$$

This individual lower bound is complemented with a worst case lower bound for the best approximation \(p^*\) in Sect. 3.3, Theorem 3.6, showing

$$\begin{aligned} \sup _\mu W_1(p^*,\mu )&\ge \frac{c_3}{n}, \end{aligned}$$

where the subsequent characterisation of the best approximation in the univariate case in Sect. 3.4, Theorem 3.9, and the following Example 3.10 show that the achieved constant \(c_3\) is sharp.

In Sect. 4, we start by showing a specific sum of squares representation for \(p_n\) involving the moment matrix \(\left( {\hat{\mu }}(k-\ell )\right) _{k,\ell \in [n]}\). Setting all non-zero singular values in this representation to a constant yields the so-called signal polynomial \(p_{1,n}\) which identifies the support of the measure in Sect. 4.1, Theorem 4.2, by

$$\begin{aligned} p_{1,n}(x)&{\left\{ \begin{array}{ll} =1, &{}\quad x\in {\text {supp}}\,\mu ,\\ <1, &{}\quad \text {otherwise}.\end{array}\right. } \end{aligned}$$

As common to all subspace methods, this involves technical assumptions on the support of the measure and the degree n to be finite but large enough. For discrete measures the assumptions on the support are met and Sect. 4.2, Theorem 4.6, proves the pointwise convergence

$$\begin{aligned} p_{1,n}(x)&{\left\{ \begin{array}{ll} \le 1-c_4n^2\mathop {\textrm{dist}}\limits (x,{\text {supp}}\,\mu )^2, &{} x \text { close to } {\text {supp}}\,\mu ,\\ \le \frac{c_5}{n^2\mathop {\textrm{dist}}\limits (x,{\text {supp}}\,\mu )^2}, &{} \text {otherwise}.\end{array}\right. } \end{aligned}$$

A weak convergence result for discrete measures is proven in Theorem 4.9. Finally, Sect. 4.3, Theorem 4.10, proves pointwise convergence

$$\begin{aligned} p_{1,m+n}(x)\le \frac{c_5(m,x)}{n}, \quad x\notin {\text {supp}}\,\mu , \end{aligned}$$

also for positive dimensional support sets (which are generated in degree m). We end by illustrating the theoretical results by numerical examples in Sect. 5.

2 Preliminaries

Let \(d\in \mathbb {N}\), \(1\le p\le \infty \) and let \(|x-y|_p = \min _{k\in \mathbb {Z}^d} {\left\| x-y+k\right\| _{}}_{p}\) denote the wrap-around p-norm on \(\mathbb {T}^d=[0,1)^d\). For \(d=1\) these wrap-around distances coincide and we denote them by \(|x-y|_1\) to distinguish them from the absolute value. Throughout this paper, let \(\mu ,\nu \) denote some complex Borel measures on \(\mathbb {T}^d\) with finite total variation \(\Vert {\mu }\Vert _{\text {TV}}\) and normalization \(\mu (\mathbb {T}^d)=\nu (\mathbb {T}^d)=1\).Footnote 1 This implies that the trigonometric moments as defined above are finite with \(|{\hat{\mu }}(k)|\le \Vert {\mu }\Vert _{\text {TV}}\) and \({\hat{\mu }}(0)=1\). We denote the set of all such measures by \(\mathcal {M}\) and restrict to the real signed and nonnegative case by \(\mathcal {M}_{\mathbb {R}}\) and \(\mathcal {M}_+\), respectively.

A function has Lipschitz-constant at most 1 if \(f: \mathbb {T}^d\rightarrow \mathbb {C}\) admits \(|f(x)-f(y)|\le |x-y|_1\) for all \(x,y\in \mathbb {T}^d\) and we denote this by the shorthand \(\mathop {\textrm{Lip}}\limits (f)\le 1\). Using the dual characterisation by Kantorovich-Rubinstein, the 1-Wasserstein-distance of \(\mu \) and \(\nu \) is defined by

$$\begin{aligned} W_1(\nu ,\mu ) =\sup _{\mathop {\textrm{Lip}}\limits (f)\le 1} \left| \int _{\mathbb {T}^d} f(x) \;\textrm{d}(\nu -\mu )\left( x\right) \right| , \end{aligned}$$

for any \(\mu ,\nu \in {\mathcal {M}}\). If \(\mu ,\nu \in {\mathcal {M}}_+\), this distance also admits the primal formulation

$$\begin{aligned} W_1(\nu ,\mu )=\inf _{\pi } \int _{\mathbb {T}^{2d}} |x-y|_1 \textrm{d}\pi (x,y) \end{aligned}$$

where the infimum is taken over all couplings \(\pi \) with marginals \(\mu \) and \(\nu \), see e.g. [24, 52]. We note in passing that the 1-Wasserstein-distances for other p norms on \(\mathbb {T}^d\) are equivalent with lower and upper constant 1 and \(d^{1-1/p}\), respectively. Moreover, the 1-Wasserstein distance defines a metric induced by the norm

$$\begin{aligned} \Vert \mu \Vert _{\mathop {\textrm{Lip}}\limits ^{*}}=\sup _{f: \mathop {\textrm{Lip}}\limits (f)\le 1, \Vert f\Vert _{\infty }\le \frac{d}{2}} \left| \int _{\mathbb {T}^d} f(x) d\mu (x)\right| \end{aligned}$$

which makes the space of complex-valued Borel measures with finite total variation a Banach space. By slight abuse of notation, we also write \(W_1(p,\mu )\) in case the measure \(\nu \) has density p, i.e., \(\textrm{d}\nu (x)=p(x)\textrm{d}x\). Using the trigonometric moments from (1.1), we compute in Sect. 3 trigonometric approximations \(q_n\in \mathcal {P}_n\) to the underlying measure, where we define the set \(\mathcal {P}_n\) of trigonometric polynomials with max-degree n as

$$\begin{aligned} \mathcal {P}_n=\left\{ p: \mathbb {T}^d\rightarrow \mathbb {C}, \quad x\mapsto p(x)=\sum _{k\in \{-n,\dots ,n\}^d} {\hat{p}}(k) \text {e}^{2\pi i {kx}} \text { for } {\hat{p}}(k)\in \mathbb {C}\right\} . \end{aligned}$$

In Sect. 4, we additionally consider causal trigonometric polynomials where the coefficients of the polynomial are only nonzero at the nonnegative frequencies, i.e. for \(k\in \{0,\dots ,n\}^d\).

3 Approximation

We study in this section weakly convergent polynomial approximations of measures, i.e. approximations satisfying the property (1.2). The 1-Wasserstein distance (along with all the Wasserstein distances) metrizes this notion of convergence for measures with equal mass [61, Thm. 6.9], which allows us to provide both upper and lower bounds on the rates of convergence with respect to this distance of our constructions.

While our focus is principally on actually computable approximations, based on convolution with known kernels, we also turn in the last part of this section (Sect. 3.3 below) to more theoretical considerations on the best polynomial approximations with respect to the 1-Wasserstein distance, which, additionally to giving new perspectives on polynomial approximations of measures, also highlights the near-optimality of our constructions.

3.1 Approximation by Convolution and Upper Bounds

Similarly to standard approaches in approximation theory, one may derive easy-to-compute polynomial estimates for a measure \(\mu \), by considering the convolution of the latter with adequate kernels. For instance, given the first trigonometric moments of \(\mu \), the Fourier partial sums

$$\begin{aligned} S_n\mu (x)=(D_n*\mu )(x)=\sum _{k\in \mathbb {Z}^d, \Vert k\Vert _{\infty }\le n} {\hat{\mu }}(k) \text {e}^{2\pi i {kx}}, \end{aligned}$$

which correspond to convolution with Dirichlet kernelsFootnote 2, might serve as a sequence of approximations.

We focus in this section on yet another classical sequence of approximations, given by convolution with Fejér kernels \(F_n:\mathbb {T}^d\rightarrow \mathbb {R}\) (by slight abuse of notation, we use the same notation for both the multivariate and univariate kernels), defined for \(x = (x_1,\ldots ,x_d) \in \mathbb {T}^d\) as

$$\begin{aligned} F_n(x_1,\ldots ,x_d) = \prod _{i=1}^d F_n(x_i) \end{aligned}$$

where, for any \(x \in \mathbb {T}\),

$$\begin{aligned} F_n(x)&=\sum _{k=-n}^n \left( 1-\frac{|k|}{n+1}\right) \text {e}^{2\pi i {kx}} =\frac{1}{n+1}\left( \frac{\sin ((n+1)\pi x)}{\sin (\pi x)}\right) ^2\nonumber \\&= \frac{1}{n+1} \left| \sum _{k=0}^n \text {e}^{2\pi i {kx}} \right| ^2. \end{aligned}$$
(3.1)

The main object of study in this section is the trigonometric polynomial

$$\begin{aligned} p_n(x)&=\left( F_n*\mu \right) (x) =\int _{\mathbb {T}^d} F_n(x-y) \textrm{d}\mu (y). \end{aligned}$$
(3.2)

We give two illustrative examples in Example 3.1.

Example 3.1

Our first example for \(d=1\) is the measure

$$\begin{aligned} \mu =\frac{1}{3}\delta _{\frac{1}{8}}+\nu \in \mathcal {M}_+,\quad \frac{\textrm{d}\nu }{\textrm{d}\lambda }(x)=\frac{8}{9}\chi _{\left[ \frac{1}{4},\frac{5}{8}\right] }(x)+\frac{\sqrt{2}}{3}\left( \frac{1}{\sqrt{\left| x-\frac{7}{8} \right| }} - \sqrt{8}\right) \chi _{\left[ \frac{3}{4},1\right] }(x), \end{aligned}$$
(3.3)

where \(\lambda \) denotes the Lebesgue measure. Obviously, this measure \(\mu \) has singular and absolutely continuous parts including an integrable pole at \(x=\frac{7}{8}\).

Both the Fourier partial sums and the Fejér approximations for \(n=19\) are shown in the left and right panel of Fig. 1, respectively.

Fig. 1
figure 1

The example measure (3.3), its approximations by the Fourier partial sum (left) and the convolution with the Fejér kernel (right). The weight \(\frac{1}{3}\) of the Dirac measure in \(\frac{1}{8}\) is displayed by an arrow of height n/3 for visibility

Our second example is a singular continuous measure for \(d=2\). We take \(\mu =(2\pi r_0)^{-1}\delta _{C}\in \mathcal {M}_+\) as the uniform measure on the circle

$$\begin{aligned} C=\{x\in \mathbb {T}^2: |x|_2=r_0\} \end{aligned}$$

for some radius \(0<r_0<\frac{1}{2}\). The total variation of this measure is

$$\begin{aligned} \Vert {\mu }\Vert _{\text {TV}} = {\hat{\mu }}(0)=\int _{\mathbb {T}^2} \textrm{d}\mu (x) = \frac{1}{2\pi r_0} \int _{C} \textrm{d}x = 1. \end{aligned}$$

Using a well-known representation of the Fourier transform of a radial function (cf.  [22, p. 574]), we find

$$\begin{aligned} \hat{\mu }(k)&=\int _{\mathbb {T}^2} \text {e}^{-2\pi i {kx}} \textrm{d}\mu (x)= \frac{1}{r_0} \int _{0}^{\infty } r J_0(2\pi r \Vert k\Vert _2) \textrm{d}\delta _{r_0}(r) = J_0(2\pi r_0 \Vert k\Vert _2) \end{aligned}$$
(3.4)

for the trigonometric moments of \(\mu \), where \(J_0\) denotes the 0-th order Bessel function of the first kind. These decay asymptotically with rate \(\Vert k\Vert _2^{-1/2}\), cf.  [22, Appendix B. 8]. The Fourier partial sum as well as the convolution with the Fejér kernel for \(n=29\) are shown with maximal contrast in the left and right panel of Fig. 2, respectively. We observe that both approximators peak around the support of the measure and the approximation by convolution with the Fejér kernel produces less ringing than the Dirichlet kernel at the cost of a slightly thicker main lobe.

Fig. 2
figure 2

Uniform measure on a circle of radius \(r_0=\frac{1}{3}\) and its approximations by the Fourier partial sum (left) and the convolution with the Fejér kernel (right), \(n=29\)

Of course, the construction and efficient evaluation of this approximation by \(p_n\) relies on the convolution theorem and the fast Fourier transform (FFT). Given the trigonometric moments \({\hat{\mu }}(k)\), \(k\in \{-n,\ldots ,n\}^d\), we multiply these with the Fourier coefficients of the Fejér kernel (3.1) in each dimension and use an inverse FFT to evaluate \(p_n\) on the equispaced grid \((2n+1)^{-1}\{-n,\ldots ,n\}^d\). Our next goal is a quantitative approximation result, for which we need the following preparatory lemma. This result can be found in qualitative form e.g. in [5, Lemma 1.6.4].

Lemma 3.2

Let \(n,d\in \mathbb {N}\), then we have

$$\begin{aligned} \frac{d}{\pi ^2} \left( \frac{\log (n+1)}{n+1}+\frac{1}{n+2}\right) \le \int _{\mathbb {T}^d} F_n(x) |x|_1 \textrm{d}x \le \frac{d}{\pi ^2}\frac{\log (n) + 4}{n+1}. \end{aligned}$$

Proof

First note that

$$\begin{aligned} \int _{\mathbb {T}^d} \prod _{s=1}^d F_n(x_s) \sum _{\ell =1}^d |x_\ell |_1 \textrm{d}x = \sum _{\ell =1}^d \int _{\mathbb {T}^d} \prod _{s=1}^d F_n(x_s) |x_\ell |_1 \textrm{d}x = d \int _{\mathbb {T}} F_n(x) |x|_1 \textrm{d}x, \end{aligned}$$

where the second equality holds since \(\int F_n(x_s)\textrm{d}x_s = 1\). Thus it is sufficient to consider the univariate case. The representation \(F_n(x)=1+2\sum _{k=1}^n \left( 1-\frac{k}{n+1}\right) \cos (2\pi k x)\) gives

$$\begin{aligned} \int _{\mathbb {T}} F_n(x) |x|_1 \textrm{d}x&= 2 \int _0^{1/2} \left( x+2\sum _{k=1}^n \left( 1-\frac{k}{n+1}\right) \cos (2\pi k x)x \right) \,\textrm{d}x \\&=2\left[ \frac{1}{8} + \sum _{k=1}^n \frac{(-1)^k-1}{2\pi ^2 k^2} - \sum _{k=1}^n \frac{(-1)^k - 1}{2(n+1)\pi ^2 k} \right] \\&=2\left[ \frac{1}{8} - \sum _{j=0}^{\left\lfloor \frac{n-1}{2}\right\rfloor } \frac{1}{\pi ^2 (2j+1)^2} + \sum _{j=0}^{\left\lfloor \frac{n-1}{2}\right\rfloor } \frac{1}{(n+1)\pi ^2 (2j+1)} \right] \end{aligned}$$

since \(\int _0^{1/2} \cos (2\pi kx) x \textrm{d}x = ((-1)^k - 1)/(4\pi ^2 k^2)\). Using that \(\sum _{j=0}^{\infty } \frac{1}{(2j+1)^2}=\frac{\pi ^2}{8}\), we obtain

$$\begin{aligned} \int _{\mathbb {T}} F_n(x) |x|_1 \textrm{d}x&=2\left[ \frac{1}{\pi ^2} \sum _{j=\left\lfloor \frac{n+1}{2} \right\rfloor }^{\infty } \frac{1}{(2j+1)^2} + \frac{1}{(n+1)\pi ^2} \sum _{j=0}^{\left\lfloor \frac{n-1}{2} \right\rfloor } \frac{1}{2j+1} \right] \\&\le \frac{2}{\pi ^2} \left[ \frac{1}{\left( 2\left\lfloor \frac{n+1}{2}\right\rfloor +1\right) ^2} + \int _{\left\lfloor \frac{n+1}{2} \right\rfloor }^\infty \frac{1}{(2y+1)^2} \textrm{d}y + \hspace{-3pt} \frac{1+ \hspace{-3pt}\int _{0}^{\left\lfloor \frac{n-1}{2} \right\rfloor } \frac{1}{2y+1} \textrm{d}y}{n+1}\right] \\&\le \frac{\frac{2}{\left( 2\left\lfloor \frac{n+1}{2}\right\rfloor +1\right) }+1}{\left( 2\left\lfloor \frac{n+1}{2}\right\rfloor +1\right) \pi ^2}+ \frac{2+\log (n)}{(n+1)\pi ^2} \le \frac{\log (n)+4}{\pi ^2 (n+1)}. \end{aligned}$$

The lower bound follows similarly by bounding the series from the previous calculation by integrals from below. \(\square \)

Theorem 3.3

Let \(d,n\in \mathbb {N}\) and \(\mu \in \mathcal {M}\), then the measure with density \(p_n\) converges weakly to \(\mu \) with

$$\begin{aligned} W_1(p_n,\mu )\le \frac{d}{\pi ^2}\frac{\log (n) + 4}{n+1} \cdot \Vert {\mu }\Vert _{\text {TV}} , \end{aligned}$$

which is sharp since \(\mu \in \mathcal {M}_+\) implies \(\Vert {\mu }\Vert _{\text {TV}}=\mu (\mathbb {T}^d)=1\) and

$$\begin{aligned} \sup _{\mu \in \mathcal {M}} W_1(p_n,\mu ) \ge \frac{d}{\pi ^2} \left( \frac{\log (n+1)}{n+1}+\frac{1}{n+2}\right) . \end{aligned}$$

Proof

We compute

$$\begin{aligned} W_1(p_n,\mu )&=\sup _{\mathop {\textrm{Lip}}\limits (f)\le 1} \left| \langle F_n*\mu ,f\rangle -\langle \mu ,f\rangle \right| \\&=\sup _{\mathop {\textrm{Lip}}\limits (f)\le 1} \left| \langle \mu ,F_n*f-f\rangle \right| \\&\le \sup _{\mathop {\textrm{Lip}}\limits (f)\le 1} \int _{\mathbb {T}^d} \int _{\mathbb {T}^d} F_n(x) \left| f(y-x)-f(y)\right| \textrm{d}x \textrm{d}|\mu |(y) \\&\le \Vert {\mu }\Vert _{\text {TV}} \int _{\mathbb {T}^d} F_n(x) |x|_1 \textrm{d}x, \end{aligned}$$

and note that both inequalities become equalities when choosing \(\mu =\delta _0\) and \(f(x)=|x|_1\). Applying Lemma 3.2 gives the result. In particular, we remark in passing that \(W_1(F_n,\delta _0)= \int _{\mathbb {T}^d} F_n(x) |x|_1 \textrm{d}x\). \(\square \)

Remark 3.4

Similar to classical results, the \(\log \)-factor can be removed by choosing another convolution kernel, which then however does not allow for the representation later found in Lemma 4.1. The Jackson kernel, see [28, pp. 2 ff.],

$$\begin{aligned} J_{2m-2}(x)=\frac{3}{m(2m^2+1)} \frac{\sin ^4(m\pi x)}{\sin ^4(\pi x)}, \quad m\in \mathbb {N}, \end{aligned}$$

has degree \(n=2m-2\), is normalised and satisfies by using \(\frac{\sin (m\pi x)}{\sin (\pi x)}\le \min (m,\frac{1}{2x})\)

$$\begin{aligned} \int _{\mathbb {T}} J_n(x) |x|_1 \textrm{d}x&\le \frac{6}{m(2m^2+1)} \left[ \int _0^{1/2m} m^4 x \textrm{d}x+ \int _{1/2m}^{\infty } \frac{1}{16x^3} \textrm{d}x\right] \\&\le \frac{3m}{2(2m^2+1)} \le \frac{3}{2(n+2)}. \end{aligned}$$

Analogously to Theorem 3.3, we get

$$\begin{aligned} W_1(J_n*\mu ,\mu )\le \frac{3}{2} \frac{d \cdot \Vert {\mu }\Vert _{\text {TV}}}{n+2}, \end{aligned}$$
(3.5)

which still is an approximate factor 6 worse than the lower bound in the univariate case (see Theorem 3.6). By numerical analysis or more detailed analysis of the above estimate, one can deduce that a factor 3 is due to the above estimate and a factor 2 seems to indicate that the Jackson kernel is not optimal. Moreover, upper and lower bound differ by a factor d in the multivariate case which might be due to the used norms or our proof techniques.

3.2 Saturation

Theorem 3.3 gives a worst case lower bound while, on the other hand, the Lebesgue measure is approximated by \(F_n*\lambda =\lambda \) without any error. We may thus ask how well a measure \(\textrm{d}\mu =w(x) \textrm{d}x\) with smooth (nonnegative) density might be approximated. For an introductory example, consider the univariate analytical density \(w(x)=1+\cos (2\pi x)\). Since \(F_n*w(x)-w(x)=\cos (2\pi x)/(n+1)\), we achieve by testing with the Lipschitz function \(f(x)=\cos (2\pi x)/(2\pi )\) that

$$\begin{aligned} W_1(F_n*w,w)&\ge \frac{1}{2\pi (n+1)}\int _{\mathbb {T}}\cos ^2(2\pi x)\textrm{d}x=\frac{1}{4\pi (n+1)}. \end{aligned}$$

This effect is called saturation (e.g. cf.  [5]). In greater generality, such a lower bound holds for each measure individually and can be inferred by a nice relationship between the Wasserstein distance and a discrepancy, cf. [18].

Theorem 3.5

For each individual measure \(\mu \in \mathcal {M}\) different from the Lebesgue measure, there is a constant \(c>0\) such that we have for all \(n\in \mathbb {N}\)

$$\begin{aligned} W_1(p_n,\mu )\ge \frac{c}{n+1}. \end{aligned}$$

Proof

Let \({\hat{h}}\in \ell ^2(\mathbb {Z}^d)\), \({\hat{h}}(k)\in \mathbb {R}\setminus \{0\}\), \({\hat{h}}(k)={\hat{h}}(-k)\), and consider the reproducing kernel Hilbert space

$$\begin{aligned} H=\{f\in L^2(\mathbb {T}^d): \sum _{k\in \mathbb {Z}^d} |{\hat{h}}(k)|^{-2} |{\hat{f}}(k)|^2 <\infty \},\qquad \Vert f\Vert _{H}^2=\sum _{k\in \mathbb {Z}^d} |{\hat{h}}(k)|^{-2} |{\hat{f}}(k)|^2. \end{aligned}$$

Given two measures \(\mu ,\nu \), their discrepancy (which also depends on the space H) is defined by

$$\begin{aligned} {\mathcal {D}}(\mu ,\nu )=\sup _{\Vert f\Vert _{H}\le 1} \left| \int _{\mathbb {T}^d} f~\textrm{d}(\mu - \nu )\right| =\sup _{\Vert f\Vert _{H} \le 1} \left| \sum _{k \in {\mathbb {Z}}^d}\frac{{\widehat{f}}(k)}{{\widehat{h}}(k)} {\widehat{h}}(k) \widehat{\mu -\nu }(k)\right| =\Vert {\widehat{h}} \cdot \widehat{\mu -\nu }\Vert _{\ell ^{2}} \end{aligned}$$

and fulfils by the geometric-arithmetic inequalityFootnote 3

$$\begin{aligned} {\mathcal {D}}(p_n,\mu )^2&=\sum _{\Vert k\Vert _{\infty }\le n} |{\hat{h}}(k)|^{2} \left| 1-\prod _{\ell =1}^d \left( 1-\frac{|k_\ell |}{n+1}\right) \right| ^2 |\hat{\mu }(k)|^2 + \sum _{\Vert k\Vert _{\infty }> n} |{\hat{h}}(k)|^{2}|\hat{\mu }(k)|^2 \\&\ge \sum _{\Vert k\Vert _{\infty }\le n} |{\hat{h}}(k)|^{2} \left| \frac{\Vert k\Vert _1}{d(n+1)}\right| ^2 |\hat{\mu }(k)|^2 + \sum _{\Vert k\Vert _{\infty }> n} |{\hat{h}}(k)|^{2}|\hat{\mu }(k)|^2 \\&=\sum _{\Vert k\Vert _{\infty }\le n} |{\hat{h}}(k)|^{2} \left| \frac{\Vert k\Vert _1}{d(n+1)}\right| ^2 |\hat{\mu }(k)-\hat{\lambda }(k)|^2 + \sum _{\Vert k\Vert _{\infty }> n} |{\hat{h}}(k)|^{2}|\hat{\mu }(k)-\hat{\lambda }(k)|^2 \\&\ge \frac{1}{d^2(n+1)^2} \Vert h*(\mu -\lambda )\Vert _{L^2(\mathbb {T}^d)}^2 \end{aligned}$$

where \(h(x)=\sum _{k\in \mathbb {Z}^d} {\hat{h}}(k) \text {e}^{2\pi i {kx}}\) and \(\lambda \) denotes the Lebesgue measure with \({\hat{\lambda }}(0)=1\) and \({\hat{\lambda }}(k)=0\) for \(k\in \mathbb {Z}^d\setminus \{0\}\). Our second ingredient is a Lipschitz estimate that quantifies the Lipschitz constant of any \(f\in H\) with \(\Vert f\Vert _{H}\le 1\). For such a function f, the Cauchy-Schwarz inequality together with \(\left| e^{2 \pi i k y}-e^{2 \pi i k(y+x)}\right| ^{2}=2(1-\cos (2 \pi k x))\) gives

$$\begin{aligned} |f(y)-f(y+x)|^2&=\left| \sum _{k\in \mathbb {Z}^d} {\hat{f}}(k) \left( \text {e}^{2\pi i {ky}}-\text {e}^{2\pi i {k(y+x)}}\right) \right| ^2\\&\le \Vert f\Vert _{H}^2 \sum _{k\in \mathbb {Z}^d} \left| \text {e}^{2\pi i {ky}}-\text {e}^{2\pi i {k(y+x)}}\right| ^2 |{\hat{h}}(k)|^2\\&{={\left\| f\right\| _{}}_H^2 \sum _{k\in \mathbb {Z}^d} (2 - 2\cos (2\pi kx)) |{\hat{h}}(k)|^2}\\&\le 2\left( {K(x,x)-K(x,0)}\right) , \end{aligned}$$

where \(K(x,y)=\sum _{k\in \mathbb {Z}^d} |{\hat{h}}(k)|^2 \text {e}^{2\pi i {k(x-y)}}=(h*h)(x-y)\) denotes the so-called reproducing kernelFootnote 4 of the space H. If this kernel is \(K(x,y)=h^{[4]}(x_1-y_1)\cdot \ldots \cdot h^{[4]}(x_d-y_d)\) for some nonnegative univariate function \(h^{[4]}\in C^2(\mathbb {T})\) being maximal in zero (and thus \(\left( h^{[4]}\right) '(0)=0\)), we find by a telescoping sum and the Taylor expansion

$$\begin{aligned} {K(x,x)-K(x,0)}&= \prod _{\ell =1}^dh^{[4]}(0)-\prod _{\ell =1}^d h^{[4]}(x_\ell ) \\&= \sum _{\ell =1}^d \left( h^{[4]}(0)^\ell \prod _{k=1}^{d-\ell } h^{[4]}(x_k)-h^{[4]}(0)^{\ell -1}\prod _{k=1}^{d-\ell +1} h^{[4]}(x_k)\right) \\&=\sum _{\ell =1}^d \left( h^{[4]}(0)^{\ell -1} \left( h^{[4]}(0)-h^{[4]}(x_{d-\ell +1})\right) \prod _{k=1}^{d-\ell } h^{[4]}(x_k)\right) \\&\le \sum _{\ell =1}^d \Vert h^{[4]}\Vert ^{d-1}_{\infty } \left[ h^{[4]}(0)-h^{[4]}(x_{d-\ell +1})\right] \\&\le \frac{1}{2} \Vert h^{[4]}\Vert ^{d-1}_{\infty } \left\| \left( h^{[4]}\right) ''\right\| _{\infty } |x|_2^2. \end{aligned}$$

To make a specific choice, let \(a\in (0,\frac{1}{8})\) be some irrational number and set

$$\begin{aligned} h^{[2]}(x)= \sum _{k\in \mathbb {Z}} (\chi _{[-a,a]}*\chi _{[-a,a]})(x+k), \quad x\in \mathbb {T}\end{aligned}$$

as the periodisation of the convolution of the indicator function of \([-a,a]\) with itself. Based on this, we set \(h^{[4]}=h^{[2]}*h^{[2]}\), and \(h(x_1,\ldots ,x_d)=h^{[2]}(x_1)\cdot \ldots \cdot h^{[2]}(x_d)\), which yields a valid kernel.Footnote 5 Consequently, we derive that \(f\in H\) with \(\Vert f\Vert _H\le 1\) satisfies \(\mathop {\textrm{Lip}}\limits (f)\le c'_{d,a}\) for some constant \(c'_{d,a}>0\) depending on the dimension d and the parameter a. This allows to conclude

$$\begin{aligned} W_1(p_n,\mu ) \ge {c'}^{-1}_{d,a} {\mathcal {D}}(p_n,\mu ) \ge \frac{{c'}^{-1}_{d,a}}{d(n+1)} \Vert h*(\mu -\lambda )\Vert _{L^2(\mathbb {T}^d)}=:\frac{c}{n+1} \end{aligned}$$

for some \(c\in \mathbb {R}\). Since a is irrational, we can directly seeFootnote 6 by Parseval’s theorem that \(\Vert h*(\mu -\lambda )\Vert _{L^2(\mathbb {T}^d)}=0\) if and only if \(\mu =\lambda \). For \(\mu \ne \lambda \), we obtain the statement with a positive constant c depending on the measure \(\mu \), the constant a, and the spatial dimension d. \(\square \)

3.3 Best Approximation and Lower Bounds

After observing upper (Sect. 3.1) and lower bounds on the approximation by \(p_n=F_n*\mu \) for individual measures \(\mu \) (Sect. 3.2), one might ask whether an approximation rate faster than \({\mathcal {O}}(n^{-1})\) is possible by some general polynomial approximation. The following theorem shows that the answer to this question is negative as the best approximation by a normalised polynomial only yields a \({\mathcal {O}}(n^{-1})\) worst-case rate.

Theorem 3.6

For any \(d,n\in \mathbb {N}\) and for every \(\mu \in \mathcal {M}\) there exists a polynomial with degree n of best approximation in the 1-Wasserstein distance among all polynomials with degree n. Moreover, we have

$$\begin{aligned} \sup _{\mu \in \mathcal {M}} \min _{{\genfrac{}{}{0.0pt}{}{p\in \mathcal {P}_n}{{\hat{p}}(0)=1}}} \frac{W_1(p,\mu )}{\Vert {\mu }\Vert _{\text {TV}}} \ge \frac{1}{4(n+1)}. \end{aligned}$$

Proof

We have existence of a best approximation by polynomials in the Banach space of Borel measures with finite total variation.Footnote 7 For the lower bound, we compute

$$\begin{aligned} \sup _{\mu \in \mathcal {M}} \min _{{\genfrac{}{}{0.0pt}{}{p\in \mathcal {P}_n}{{\hat{p}}(0)=1}}} W_1(p,\mu )&\ge \min _{{\genfrac{}{}{0.0pt}{}{p\in \mathcal {P}_n}{{\hat{p}}(0)=1}}} W_1(p,\delta _0) \\&= \min _{{\genfrac{}{}{0.0pt}{}{p\in \mathcal {P}_n}{{\hat{p}}(0)=1}}} \sup _{f: \mathop {\textrm{Lip}}\limits (f)\le 1} \left| f(0)-\int _{\mathbb {T}^d} f(x)p(x) \textrm{d}x\right| \\&=\min _{{\genfrac{}{}{0.0pt}{}{p\in \mathcal {P}_n}{{\hat{p}}(0)=1}}} \sup _{{\genfrac{}{}{0.0pt}{}{f: \Vert f\Vert _{\infty }\le \frac{d}{2}}{\mathop {\textrm{Lip}}\limits (f)\le 1}}} \left\| f-{\check{p}}*f\right\| _{\infty } \\&\ge \sup _{{\genfrac{}{}{0.0pt}{}{f: \Vert f\Vert _{\infty }\le \frac{d}{2}}{\mathop {\textrm{Lip}}\limits (f)\le 1}}} \min _{p\in \mathcal {P}_n} \Vert f-p\Vert _{\infty }, \end{aligned}$$

where we added a suitable constant to obtain the last equalityFootnote 8 and \({\check{p}}\) denotes the reflection of p, i.e. \({\check{p}}(x)=p(-x)\) for all \(x\in \mathbb {T}^d\). It remains to find the worst case error for the best approximation of a Lipschitz function by a trigonometric polynomial of degree n. While this is well-understood for \(d=1\) (cf.  [1, 20]), we did not find a reference mentioning whether and how \(d>1\) is possible as well. Therefore, we show that the idea by [21] for the case \(d=1\) works also for \(d>1\) in our situation. A main ingredient of Fishers proof is the duality relation

$$\begin{aligned} \inf _{x\in Y\subset X} \Vert x_0-x\Vert = \sup _{{\genfrac{}{}{0.0pt}{}{\ell \in X^*}{\ell |_Y=0, \Vert \ell \Vert _{X^*}\le 1} }} |\ell (x_0)| \end{aligned}$$

for a Banach space X, \(x_0\in X\), with a subspace Y and dual space \(X^*\). A second ingredient is given by the 1-periodic Bernoulli spline of degree 1, i.e.,Footnote 9

$$\begin{aligned} {\mathcal {B}}_1(x)=\sum _{k\in \mathbb {Z}\setminus \{0\}} \frac{\text {e}^{2\pi i {kx}}}{2\pi i k} = \sum _{k=1}^\infty \frac{\sin (2\pi k x)}{\pi k} = {\left\{ \begin{array}{ll} \frac{1}{2}-x,&{}\quad x\in (0,1), \\ 0,&{}\quad x\in \{0,1\}.\end{array}\right. } \end{aligned}$$
(3.6)

A Lipschitz continuous and 1-periodic function \(f:\mathbb {T}\rightarrow \mathbb {R}\) with \(\mathop {\textrm{Lip}}\limits (f)\le 1\) has a derivative \(f'\) almost everywhere and this derivative satisfies \(\int _{\mathbb {T}} f'(s)=0\) by the periodicity of f. Therefore, it follows that

$$\begin{aligned} \left( f'*{\mathcal {B}}_1\right) (t)&=\int _{\mathbb {T}} f'(s) {\mathcal {B}}_1(t-s) \textrm{d}s \nonumber \\&= -\int _0^{t} (t-s) f'(s) \textrm{d}s -\int _{t}^1 (t-s+1) f'(s) \textrm{d}s \nonumber \\&=f(t)-\int _0^1 f(s) \textrm{d}s \end{aligned}$$
(3.7)

for \(0<t,s\le 1\). The dual space of the space of continuous periodic functions is the space of periodic finite regular Borel measures equipped with the total variation norm and the duality formulation gives

$$\begin{aligned} \sup _{{\genfrac{}{}{0.0pt}{}{f:\mathbb {T}^d\rightarrow \mathbb {R}}{\Vert f\Vert _{\infty }\le \frac{d}{2}, \mathop {\textrm{Lip}}\limits (f)\le 1}}}\min _{p\in \mathcal {P}_n} \Vert f-p\Vert _{\infty }&= \sup _{{\genfrac{}{}{0.0pt}{}{f:\mathbb {T}^d\rightarrow \mathbb {R}}{\Vert f\Vert _{\infty }\le \frac{d}{2}, \mathop {\textrm{Lip}}\limits (f)\le 1}}}\sup _{{\genfrac{}{}{0.0pt}{}{\hat{\mu }(k) =0, \Vert k\Vert _{\infty }\le n}{\Vert {\mu }\Vert _{\text {TV}}\le 1} }} \left| \int _{\mathbb {T}^d} f(x) \textrm{d}\mu (x)\right| . \end{aligned}$$

Our main contribution to this result is the observation how to transfer the multivariate setting back to the univariate one. It is easy to verify that \(f(x)=\frac{1}{d}\sum _{\ell =1}^d f_0(x_\ell )\) for a univariate Lipschitz function \(f_0\), \(\mathop {\textrm{Lip}}\limits (f_0)\le d\), \(\Vert f_0\Vert _{\infty }\le \frac{d}{2}\) fulfils the conditions for the Lipschitz function f. Additionally, \(\mu ^*=\frac{1}{d}\sum _{s=1}^d \mu _s\) with \(\mu _s=\left( \bigotimes _{\ell \ne s} \lambda (x_\ell )\right) \otimes \mu _0^*(x_s)\),

$$\begin{aligned} \mu _0^*(x_s)=\frac{1}{2(n+1)} \sum _{j=0}^{2n+1} (-1)^j \delta _{j/(2n+2)}(x_s) \end{aligned}$$

and \(\lambda \) being the Lebesgue measure on \(\mathbb {T}\) is admissible.Footnote 10 Since this choice of \(\mu _s\) integrates \(\int g \textrm{d}\mu _s=0\) if g is constant with respect to \(x_s\) (and the same holds for constant univariate functions integrated against \(\mu _0^*\)), we obtain with (3.7)

$$\begin{aligned} \sup _{{\genfrac{}{}{0.0pt}{}{f:\mathbb {T}^d\rightarrow \mathbb {R}}{\Vert f\Vert _{\infty }\le \frac{d}{2}, \mathop {\textrm{Lip}}\limits (f)\le 1}}}\min _{p\in \mathcal {P}_n} \Vert f-p\Vert _{\infty }&\ge \sup _{{\genfrac{}{}{0.0pt}{}{f_0:\mathbb {T}\rightarrow \mathbb {R}}{\Vert f_0\Vert _{\infty }\le \frac{d}{2}, \mathop {\textrm{Lip}}\limits (f_0)\le d}}} \left| \frac{1}{d^2} \sum _{s,\ell =1}^d \int _{\mathbb {T}^d} f_0(x_\ell ) \textrm{d}\mu _s(x)\right| \\&=\sup _{{\genfrac{}{}{0.0pt}{}{f_0:\mathbb {T}\rightarrow \mathbb {R}}{\Vert f_0\Vert _{\infty }\le \frac{d}{2}, \mathop {\textrm{Lip}}\limits (f_0)\le d}}} \left| \frac{1}{d^2} \sum _{\ell =1}^d \int _{\mathbb {T}} f_0(x_\ell ) \textrm{d}\mu _0^*(x_\ell )\right| \\&=\sup _{{\genfrac{}{}{0.0pt}{}{f_0:\mathbb {T}\rightarrow \mathbb {R}}{\Vert f_0\Vert _{\infty }\le \frac{d}{2}, \mathop {\textrm{Lip}}\limits (f_0)\le d}}} \frac{1}{d} \left| \int _{\mathbb {T}} f_0'(s) \left( \int _{\mathbb {T}}{\mathcal {B}}_1(t-s) \textrm{d}\mu _0^*(t)\right) \textrm{d}s\right| . \end{aligned}$$

We denote \({\mathcal {B}}_{\mu *}(s)= \int _{\mathbb {T}}{\mathcal {B}}_1(t-s) \textrm{d}\mu _0^*(t)\) and observe \(\int _{\mathbb {T}} {\mathcal {B}}_{\mu *}(s) \textrm{d}s=0\). Moreover, \(\mu _0^*\) has moments \(\hat{\mu }_0^*(k) = 1\) for \(k\in (n+1)\left( 2\mathbb {Z}+1\right) \) and \(\hat{\mu }_0^*(k)=0\) otherwise. Together with the Fourier representation (3.6) of \({\mathcal {B}}_1\) where one rewrites the sum over odd integers as the difference between the sum over all nonzero integers and the sum of all nonzero even integers, this gives

$$\begin{aligned} {\mathcal {B}}_{\mu *}(s)&= \sum _{k\in \mathbb {Z}\setminus \{0\}} \frac{\text {e}^{2\pi i {k(n+1)s}}}{2\pi i k(n+1)} - \sum _{k\in \mathbb {Z}\setminus \{0\}} \frac{\text {e}^{2\pi i {2k(n+1)s}}}{2\pi i 2k(n+1)} \\&= \frac{1}{n+1} {\mathcal {B}}_1((n+1)s)- \frac{1}{2n+2} {\mathcal {B}}_1((2n+2)s) \\&= {\left\{ \begin{array}{ll} \frac{1}{4} \frac{1}{n+1},&{}\quad (n+1)s-\lfloor (n+1)s\rfloor \in \left( 0,\frac{1}{2}\right) , \\ -\frac{1}{4}\frac{1}{n+1},&{}\quad (n+1)s-\lfloor (n+1)s\rfloor \in \left( \frac{1}{2},1\right) , \\ 0,&{}\quad (2n+2)s\in \{0,\dots ,2n+1\}. \end{array}\right. } \end{aligned}$$

Here, the last equality is a direct consequence of (3.6). Now, we choose \(f_0\) by taking \(f'_0(s)=d\cdot {\text {sgn}}({\mathcal {B}}_{\mu *}(s)) \) and \(f_0(0)=0\) which is possible as it yields

$$\begin{aligned} \Vert f_0\Vert _{\infty }=\max _{x\in \mathbb {T}} |f_0(x)-f_0(0)|\le \max _{x\in \mathbb {T}} d |x|_{1} =\frac{d}{2} \quad \text {and} \quad \int _{\mathbb {T}} f_0'(s) \textrm{d}s=0. \end{aligned}$$

Finally, we end up with

$$\begin{aligned} \sup _{{\genfrac{}{}{0.0pt}{}{f:\mathbb {T}^d\rightarrow \mathbb {R}}{\Vert f\Vert _{\infty }\le \frac{d}{2}, \mathop {\textrm{Lip}}\limits (f)\le 1}}}\min _{p\in \mathcal {P}_n} \Vert f-p\Vert _{\infty }&\ge \int _{\mathbb {T}} \left| {\mathcal {B}}_{\mu *}(s)\right| \textrm{d}s \\&= \frac{1}{n+1} \int _{\mathbb {T}} \left| {\mathcal {B}}_1((n+1)s)-\frac{1}{2}{\mathcal {B}}_1((2n+2)s)\right| \textrm{d}s \\&= \frac{1}{n+1} \int _{\mathbb {T}} \left| {\mathcal {B}}_1(s)-\frac{1}{2}{\mathcal {B}}_1(2s)\right| \textrm{d}s = \frac{1}{4(n+1)} \end{aligned}$$

and this was the claim. \(\square \)

Remark 3.7

(Information theoretic point of view) One should distinguish the above result on the best approximation by a polynomial with given degree n from the question of how well one can recover any measure given its low order trigonometric moments. While the polynomial approximation calculated in the framework of Theorem 3.6 is based on the knowledge of all moments, the latter information theoretic question would only consider the moments \(\hat{\mu }(k)\) for \(k\in \{-n,\dots ,n\}^d\). A lower bound can be reformulated as the largest difference

$$\begin{aligned} \sup \left\{ W_1(\mu ,\nu ): \mu ,\nu \in {\mathcal {M}},\;{\hat{\nu }}(k)={\hat{\mu }}(k)\text { for }k\in \{-n,\ldots ,n\}^d\right\} \end{aligned}$$
(3.8)

between two measures, which have equal low order moments and cannot be distinguished by a recovery algorithm if no additional prior is known. If \({\hat{\mu }}\) and \({\hat{\nu }}\) are equal up to order n, then convolution with the Jackson kernel yields \(J_n*\mu =J_n*\nu \), so that the triangle inequality for \(W_1\) and Remark 3.4 give

$$\begin{aligned} W_1(\mu ,\nu )\le W_1(\mu ,J_n*\mu )+W_1(\nu ,J_n*\nu ) \le \frac{3d}{2} \frac{\Vert {\mu }\Vert _{\text {TV}} +\Vert {\nu }\Vert _{\text {TV}}}{n+2}, \end{aligned}$$

and thus (3.8) is at most of order \({\mathcal {O}}(n^{-1})\). This order is also optimal and this can be seen by choosing \(\mu \) as the Lebesgue measure \(\lambda \), \(\nu \) being absolutely continuous with \(\textrm{d}\nu (x_1,\dots ,x_d)= \left[ 1+\cos (2\pi (n+1) x_1)\right] \textrm{d}\lambda (x_1,\dots ,x_d)\), and \(f(x)=\cos (2\pi (n+1) x_1) / (2\pi (n+1))\) in

$$\begin{aligned} W_1\left( \mu ,\nu \right)&= \sup _{f:\,\mathop {\textrm{Lip}}\limits (f)\le 1} \int _{{\mathbb {T}}^d} f(x) \cos (2\pi (n+1) x_1) \textrm{d}x \\&\ge \int _{{\mathbb {T}}^d} \frac{\cos ^2(2\pi (n+1) x_1)}{2\pi (n+1)} \textrm{d}x = \frac{1}{4\pi (n+1)}. \end{aligned}$$

This shows that the knowledge of the Fourier coefficients of a measure up to order n without any prior assumption on the measure only allows to approximate the measure with worst case error of order \(n^{-1}\). This worst case error rate can be decreased if prior knowledge on the ground truth measure, e.g. sparsity (see [24]), is assumed.

3.4 Univariate Situation and Uniqueness of Best Approximation

On the univariate torus \(\mathbb {T}\), the Wasserstein distance of two probability measures can be rewritten as a \(L^1\) distance of their cumulative density functions (CDF) shifted by some constant depending on the measures, see [6]. We extend this to real signed measures belonging to \({\mathcal {M}}_\mathbb {R}\).

Lemma 3.8

(Wasserstein via CDF) For any univariate \(\mu ,\nu \in {\mathcal {M}}_\mathbb {R}\), we have

$$\begin{aligned} W_1(\mu ,\nu )=\int _0^1 |\mu ([0,x])-\nu ([0,x])-c^*(\mu ,\nu )| \textrm{d}x, \end{aligned}$$

and \(c^*(\mu ,\nu )\in \mathbb {R}\) depends on \(\mu ,\nu \).

Proof

For \(\mu ,\nu \in {\mathcal {M}}_+(\mathbb {T})\) this is [6, Thm. 3.7]. For \(\mu ,\nu \in {\mathcal {M}}_\mathbb {R}\), we can use the Jordan decomposition of any signed measure as a difference of nonnegative measures, in other words we write \(\mu =\mu _+-\mu _{-}\), \(\nu =\nu _+-\nu _{-}\) for \(\mu _+,\mu _-,\nu _+,\nu _-\) being nonnegative measures on \(\mathbb {T}\) and rewrite

$$\begin{aligned} W_1(\mu ,\nu )&=\sup _{f: \mathop {\textrm{Lip}}\limits (f)\le 1} \left| \int _{\mathbb {T}} f(x)\left[ \textrm{d}\nu _+(x) +\textrm{d}\mu _-(x)-(\textrm{d}\nu _-(x)+\textrm{d}\mu _+(x))\right] \right| \nonumber \\&= (\nu _+ +\mu _-)(\mathbb {T}) \sup _{f: \mathop {\textrm{Lip}}\limits (f)\le 1} \left| \int _{\mathbb {T}} f(x)\left[ \frac{\textrm{d}\nu _+(x) +\textrm{d}\mu _-(x)}{(\nu _+ +\mu _-)(\mathbb {T})}-\frac{\textrm{d}\nu _-(x)+\textrm{d}\mu _+(x)}{(\nu _+ +\mu _-)(\mathbb {T})}\right] \right| \end{aligned}$$
(3.9)

and this allows to apply [6, Thm. 3.7].Footnote 11 This then gives

$$\begin{aligned}&(\nu _+ +\mu _-)(\mathbb {T}) \int _0^1 \left| \left( \frac{ \nu _+ +\mu _--(\nu _-+\mu _+)}{(\nu _+ +\mu _-)(\mathbb {T})}\right) ([0,x])\right. \\&\quad \left. -c^*\left( \frac{ \nu _+ +\mu _-}{(\nu _+ +\mu _-)(\mathbb {T})},\frac{\nu _-+\mu _+}{(\nu _+ +\mu _-)(\mathbb {T})}\right) \right| \textrm{d}x \\&\quad = \int _0^1 \left| \left( \nu -\mu \right) ([0,x])-c^*(\nu ,\mu ) \right| \textrm{d}x \end{aligned}$$

for the Wasserstein distance of \(\mu \) and \(\nu \) where the constant \(c^*(\nu ,\mu )\) depends again only on the two measures. \(\square \)

The question of uniqueness of the best approximation can be equivalently characterised by the uniqueness of the best approximation in \(L^1(\mathbb {T})\) and thus allows for the following theorem.

Theorem 3.9

(Best approximation in the univariate case) If \(d=1\) and \(\mu ,\nu \in {\mathcal {M}}_{\mathbb {R}}\), we have

$$\begin{aligned} W_1(\nu ,\mu )= \inf _{c\in \mathbb {R}} \int _{\mathbb {T}} \left| ({\mathcal {B}}_1*\nu )(t)-({\mathcal {B}}_1*\mu )(t)-c\right| \textrm{d}t \end{aligned}$$
(3.10)

with \({\mathcal {B}}_1\) being the Bernoulli spline from (3.6). This allows to conclude that for any \(n\in \mathbb {N}\), any real, normalised measure which does not give mass to atoms (i.e. \(\mu (\{x\})=0\) for all \(x\in \mathbb {T}\)) admits a unique best approximation by a normalised polynomial of degree \(n\in \mathbb {N}\) with respect to the 1-Wasserstein distance.

Proof

Let \(\mu ,\nu \in {\mathcal {M}}_\mathbb {R}\) and \({\mathcal {B}}_1\) denote the Bernoulli spline of degree 1 from the proof of Theorem 3.6, then we have by (3.7)

$$\begin{aligned} W_1(\nu ,\mu )&=\sup _{f: \mathop {\textrm{Lip}}\limits (f)\le 1} \left| \int _{\mathbb {T}} f(x)\left[ \textrm{d}\nu (x) -\textrm{d}\mu (x)\right] \right| \\&=\sup _{f: \mathop {\textrm{Lip}}\limits (f)\le 1} \left| \int _{\mathbb {T}}\int _{\mathbb {T}} f'(t){\mathcal {B}}_1(x-t)\left[ \textrm{d}\nu (x) -\textrm{d}\mu (x)\right] \textrm{d}t\right| \\&=\sup _{f: \mathop {\textrm{Lip}}\limits (f)\le 1} \left| \int _{\mathbb {T}} f'(t)\left[ ({\mathcal {B}}_1*\nu )(t)-({\mathcal {B}}_1*\mu )(t)\right] \textrm{d}t\right| . \end{aligned}$$

Since the integral over \(f'\) is zero by the periodicity of f, any \(c\in \mathbb {R}\) yields

$$\begin{aligned} \left| \int _{\mathbb {T}} f'(t)\left[ ({\mathcal {B}}_1*\nu )(t)-({\mathcal {B}}_1*\mu )(t)\right] \textrm{d}t\right|&= \left| \int _{\mathbb {T}} f'(t)\left[ ({\mathcal {B}}_1*\nu )(t)-({\mathcal {B}}_1*\mu )(t)-c\right] \textrm{d}t\right| \\&\le \inf _{c\in \mathbb {R}} \int _{\mathbb {T}} \left| ({\mathcal {B}}_1*\nu )(t)-({\mathcal {B}}_1*\mu )(t)-c\right| \textrm{d}t. \end{aligned}$$

We proceed by computing explicitly

$$\begin{aligned} ({\mathcal {B}}_1*\mu )(t)&=\int _{[0,t)\cup (t,1)} {\mathcal {B}}_1(t-x) \textrm{d}\mu (x) \nonumber \\&=\int _{[0,t)} \frac{1}{2}-(t-x) \textrm{d}\mu (x) + \int _{(t,1)} \frac{1}{2}-(t-x+1) \textrm{d}\mu (x) \nonumber \\&=\left( \frac{1}{2} -t\right) (\mu ([0,1))-\mu (\{t\}))\nonumber \\&\quad + \int _{[0,1)} x \textrm{d}\mu (x) - t \mu (\{t\}) - \mu ([0,1))+\mu ([0,t]) \nonumber \\&= \frac{\mu ([0,t))+\mu ([0,t])}{2} - \mu ([0,1))\left( t +\frac{1}{2}\right) +\int _{[0,1)} x \textrm{d}\mu (x) \end{aligned}$$
(3.11)

for \(t\in (0,1)\) and

$$\begin{aligned} ({\mathcal {B}}_1*\mu )(0)=\int _{[0,1)} x\textrm{d}\mu (x) -\frac{1}{2} \mu ([0,1)) + \frac{1}{2} \mu (\{0\}). \end{aligned}$$

On the other hand, Lemma 3.8 and (3.11) yield

$$\begin{aligned}&\int _0^1 \hspace{-0.05cm} \left| \left( \nu -\mu \right) ([0,x])-c^*(\nu ,\mu ) \right| \textrm{d}x = W_1(\nu ,\mu ) \\&\quad \le \inf _{c\in \mathbb {R}} \int _0^1 \left| \left( \nu -\mu \right) ([0,x])-\frac{\left( \nu -\mu \right) (\{x\})}{2}-c \right| \textrm{d}x \end{aligned}$$

and thus equality (3.10) for measures that give mass to at most countably many atoms because in this case the set of x where the integrands from the upper and lower bounds disagree has Lebesgue measure zero. But in fact, this holds for every measure as the following argument shows.Footnote 12 At first, one might consider the case of a finite positive measure \(\mu \). For \(n \in {\mathbb {N}}\), consider \(N_{n}:=\left\{ x \in {\mathbb {T}}: \mu (\{x\}) \ge n^{-1}\right\} \) and observe that for any finite subset \(N^{*} \subseteq N_{n}\)

$$\begin{aligned} \infty >\mu ({\mathbb {T}}) \ge \mu \left( N^{*}\right) \ge \sum _{x \in N^{*}} \mu (\{x\}) \ge n^{-1} \cdot \# N^{*} \end{aligned}$$

and hence \(\# N^{*} \le n \cdot \mu ({\mathbb {T}})\), which then implies that \(N_{n}\) is finite with \(\# N_{n} \le n \cdot \mu ({\mathbb {T}})\). Therefore, the set

$$\begin{aligned} N:=\bigcup _{n=1}^{\infty } N_{n}=\{x \in {\mathbb {T}}: \mu (\{x\})>0\} \end{aligned}$$

is countable and the general case follows by decomposing \(\mu =\mu _{+}-\mu _{-}\).

With this knowledge, the question of approximation of \(\mu \) by p with degree n and \({\hat{p}}(0)=1\) can be rewritten as

$$\begin{aligned} \min _{{\genfrac{}{}{0.0pt}{}{p\in \mathcal {P}_n}{{\hat{p}}(0)=1}}} W_1(p,\mu )=\min _{{\genfrac{}{}{0.0pt}{}{p\in \mathcal {P}_n}{{\hat{p}}(0)=1}}} \inf _{c\in \mathbb {R}} \int _{\mathbb {T}} \left| ({\mathcal {B}}_1*p)(t)-({\mathcal {B}}_1*\mu )(t)-c\right| \textrm{d}t. \end{aligned}$$

By the assumption of an atom-free measure \(\mu \) we have that \({\mathcal {B}}_1*\mu \) is continuous by (3.11), and hence there exists a unique best \(L^1\)-approximation \(\tilde{p}={\mathcal {B}}_1*p^*-c\) (see e.g.  [12, Thm. 3.10.9]) which defines the unique best approximation \(p^*\) to \(\mu \) uniquely by \(\tilde{p}={\mathcal {B}}_1*p^*-c\) and the normalisation condition \({\hat{p}}^*(0)=1\). \(\square \)

Example 3.10

Uniqueness and non-uniqueness of \(L^1\) approximation is discussed in some detail in [14, 47] and we note the following:

  1. (i)

    For \(\mu =\frac{1}{2}\delta _0-\frac{1}{2}\delta _{1/2}+\lambda \in \mathcal {M}_{\mathbb {R}}\) where \(\lambda \) is again the Lebesgue measure, one finds

    $$\begin{aligned} ({\mathcal {B}}_1*\mu )(t)=\frac{1}{2}\left( {\mathcal {B}}_1(t)-{\mathcal {B}}_1\Big (t-\frac{1}{2}\Big )\right) = {\left\{ \begin{array}{ll} 0,&{}\quad t=0,\\ \frac{1}{4}, &{} \quad t\in \big (0,\frac{1}{2}\big ), \\ 0,&{} \quad t=\frac{1}{2}, \\ -\frac{1}{4},&{}\quad t\in \big (\frac{1}{2},1\big ). \end{array}\right. } \end{aligned}$$

    For any normalised polynomial p, we have that the difference \(B_1*p(t)-B_1*\mu (t)\) differs from \(\int _0^t p(x)\textrm{d}x-\mu ([0,t])\) by a constant except at the discontinuity points \(t=0,\frac{1}{2}\). But as they have Lebesgue measure zero, we can derive from Theorem 3.9

    $$\begin{aligned} \min _{{\genfrac{}{}{0.0pt}{}{p\in \mathcal {P}_n}{{\hat{p}}(0)=1}}} W_1(p,\mu )=\inf _{c\in \mathbb {R}} \int _{\mathbb {T}} \left| ({\mathcal {B}}_1*p)(t)-({\mathcal {B}}_1*\mu )(t)-c\right| \textrm{d}t. \end{aligned}$$
    (3.12)

    As proven in [47, Thm. 5.1], the function \({\mathcal {B}}_1*\mu \) does not have a unique \(L^1\) approximation for even n. Thus, \(\mu \) does not admit a unique best approximation either.

  2. (ii)

    For \(\mu =\delta _0\) one has \({\mathcal {B}}_1*\mu ={\mathcal {B}}_1\) such that again (3.12) holds for this choice of \(\mu \). According to [47, Lem. 2.2] this function \({\mathcal {B}}_1\) with only one jump has a unique best \(L^1\)-approximation given by the interpolation polynomial

    $$\begin{aligned} \tilde{p}(x) =\sum _{j=1}^n \frac{1}{2n+2} \cot \left( \frac{j\pi }{2n+2}\right) \sin (2\pi jx). \end{aligned}$$

    Deconvolving \(\tilde{p}={\mathcal {B}}_1*p^*\) gives

    $$\begin{aligned} p^*(x) = 1 + \sum _{j=1}^n \frac{j\pi }{n+1} \cot \left( \frac{j\pi }{2n+2}\right) \cos (2\pi jx) \end{aligned}$$

    as the unique best approximation to \(\delta _0\). Since the error of the best \(L^1\) approximation of \({\mathcal {B}}_1\) is known from a theorem by Favard [20] (e.g. this is mentioned in [12, p. 213]), we can compute

    $$\begin{aligned} W_1(\delta _0,p^*)&=\inf _{c\in \mathbb {R}} \int _{\mathbb {T}} \left| ({\mathcal {B}}_1*p^*)(t)-({\mathcal {B}}_1*\delta _0)(t)-c\right| \textrm{d}t \\&\le \left\| {\mathcal {B}}_1*p^*-{\mathcal {B}}_1\right\| _{L^1(\mathbb {T})}=\frac{1}{4(n+1)}. \end{aligned}$$

    By comparison with Theorem 3.6, we notice that equality holds in this calculation and that the bound from Theorem 3.6 is sharp.

Figure 3 and Table 1 summarize our findings on the approximation of \(\delta _0\). The best approximation \(p^*\) as well as the Dirichlet kernel \(D_n(x)=\sin ((2n+1)\pi x) /\sin (\pi x)\) are signed with small full width at half maximum (FWHM) but positive and negative oscillations at the sides. The latter might be seen as an unwanted artifact in applications. The approximations given by the Fejér and the Jackson kernel are nonnegative.

Fig. 3
figure 3

Interpolation of \({\mathcal {B}}_1\) (left) and comparison of different polynomial approximations of degree \(n=10\) to \(\delta _0\) (right)

For completeness, we note that the Dirichlet kernel is the Fourier partial sum of \(\delta _0\) and allows for the estimate

$$\begin{aligned} W_1(\delta _0,D_n)\le W_1(\delta _0,p^*) + W_1(p^*,D_n) \le \left( 1 + \Vert D_n\Vert _1\right) W_1(\delta _0,p^*) \le \frac{\frac{4}{\pi ^2} \log (n)+{\mathcal {O}}(1)}{4(n+1)} \end{aligned}$$
(3.13)

which relies on \(W_1(p^*,D_n)= W_1(D_n*p^*,D_n*\delta _0)\le \Vert D_n\Vert _1 W_1(\delta _0,p^*)\),

the well known bound on the Lebesgue constant [5, Prop. 1.2.3], and Example 3.10 (ii).

Table 1 Convergence rates of different trigonometric polynomials approximating the Dirac delta \(\delta _0\)

Remark 3.11

We close by some remarks which are specific for the univariate setting:

  1. (i)

    We stress that Theorem 3.9 allows to compute the Wasserstein distance as an \(L^1\)-distance for real signed univariate measures. Similarly, this allows to compute the so-called star discrepancy \(\Vert \nu ([0,\cdot ))\Vert _{\infty }\) as suggested in [44, eq. (2.1) and (2.2)]. However note that (3.11) has some additional term such that \(\nu =\frac{1}{2}\delta _0-\frac{1}{2}\delta _{1/2}\) with \(\nu (\mathbb {T})=0\) gives

    $$\begin{aligned} \Vert \nu ([0,\cdot ))\Vert _{\infty } = \frac{1}{2} \ne \frac{1}{4} = \Vert {\mathcal {B}}_1*\nu \Vert _{\infty } \end{aligned}$$

    and thus [44, eq. (2.1) and (2.2)] needs some adjustment. More precisely, it seems that in the publication [44] a factor \(\frac{1}{2}\) was lost since the kth Fourier coefficient of \(\nu ([0,\cdot ))\) is \(\frac{\hat{\nu }(k)}{ i k}\) whereas \(\hat{{\mathcal {B}}_1}(k) \cdot {\hat{\nu }}(k)=\frac{1}{2} \frac{\hat{\nu }(k)}{ i k}\).

  2. (ii)

    In the univariate case, one can relate our work to a main result in [44]. As Theorem 3.9 reformulates the Wasserstein distance of two univariate measures in terms of the \(L^1\)-distance of their convolution with the Bernoulli spline, one can view this Bernoulli spline as a kernel of type \(\beta =1\) following the notation of [44]. Thus, one can take \(p=1,p'=\infty \) in [44, Thm. 4.1] yielding that the Wasserstein distance between a measure \(\mu \) and its trigonometric approximation is bounded from above by c/n. The latter agrees with our Remark 3.4 which additionally gives an explicit and small constant.

  3. (iii)

    The observation that the construction of \(p^*\) for \(\delta _0\) is possible via FFT’s might lead to the idea to construct near-best approximations to any measure \(\mu \) by interpolating \({\mathcal {B}}_1*\mu \) by some \(\tilde{p}\) and to obtain the polynomial p of near best approximation which satisfies \(\tilde{p}={\mathcal {B}}_1*p\) by dividing with the Fourier coefficients of the Bernoulli spline \({\mathcal {B}}_1\). A first problem would be that the limited knowledge of moments only allows to interpolate the partial Fourier sum \(S_n({\mathcal {B}}_1*\mu )\), which does not converge to \({\mathcal {B}}_1*\mu \) uniformly as \(n\rightarrow \infty \) for discrete \(\mu \). Secondly, the near-best approximation p cannot be expected to be nonnegative for a nonnegative measure \(\mu \) which is another drawback compared to convolution with nonnegative kernels like the Fejér or Jackson kernel.

4 Interpolation

While Sect. 3 focuses on weak approximations of a measure \(\mu \), in particular via convolution with smooth kernels, we consider in this section another type of polynomial estimator, denoted by \(p_{1,n}\) (4.4), which depends non-linearly on \(\mu \) and is able to identify at a finite degree the support of \(\mu \), under some assumptions on the latter: more precisely, the main results of this section, stated in Theorems 4.6 and 4.10 below, are quantitative rates for the pointwise convergence

$$\begin{aligned} p_{1,n}(x) \xrightarrow {n\rightarrow \infty } \chi _{V_{{\mu }}}(x)={\left\{ \begin{array}{ll} 1, &{} x\in V_{{\mu }},\\ 0, &{} \text {otherwise}\end{array}\right. } \end{aligned}$$
(4.1)

to the indicator function of the Zariski closure \(V_\mu \) of the support, i.e. the smallest algebraic variety containing \({\text {supp}}\,\mu \). After discussing algebraic properties of this estimator (Sect. 4.1), we consider separately the case of discrete measures (Sect. 4.2) and general measures (Sect. 4.3).

In the following, let \([n] :=\{0,\ldots ,n\}^d\) and \(N :=(n+1)^d\). We use bold type to designate vectors (resp. matrices) of \(\mathbb {C}^N\) (resp. \(\mathbb {C}^{N\times N}\)) only (vectors of \(\mathbb {T}^d\) or \(\mathbb {N}^d\) are left in normal type). We write

$$\begin{aligned} \varvec{e}^{(n)}_{x}:=\begin{pmatrix} \text {e}^{-2\pi i {kx}} \end{pmatrix}_{k \in [n]} \in \mathbb {C}^{N} \end{aligned}$$

for the vector containing all d-variate trigonometric monomials up to max-degree n. Unlike previously, we consider in this section causal trigonometric polynomials [15], i.e. polynomials having zero coefficients at all negative frequencies. We often identify such a polynomial \( p \in \langle \text {e}^{-2\pi i {k\cdot }}; \; k\in [n]\rangle \) with its vector of coefficients \(\varvec{p}\in \mathbb {C}^N\), i.e.

$$\begin{aligned} p(x) = {\varvec{e}^{(n)}_{x}}^* \varvec{p}\qquad \forall x \in \mathbb {T}^d. \end{aligned}$$

Note that from Parseval’s theorem, \({\left\| p\right\| _{}}_{L^2} = {\left\| \varvec{p}\right\| _{}}_2\). Note also that \(\left| p \right| ^2 \in \mathcal {P}_n\).

The key object of this section is the (truncated) moment matrix associated with the unknown measure \(\mu \), defined as

$$\begin{aligned} \varvec{T}_n:=\left( {\hat{\mu }}(k-\ell )\right) _{k,\ell \in [n]} \in \mathbb {C}^{N\times N}, \end{aligned}$$
(4.2)

where \(\hat{\mu }(k)\) are the trigonometric moments of \(\mu \) (1.1).

4.1 Algebraic Considerations

It is well-known that the range and kernel of the matrix (4.2) reveal some of the structure of the measure hidden behind the moments, and methods that aim at recovering \(\mu \) using purely algebraic manipulations on \(\varvec{T}_n\) are often referred to as subspace methods, e.g. MUSIC [60], ESPRIT [55] or matrix pencils [26]. The starting point for these methods is often the singular value decomposition of \(\varvec{T}_n\), which we denote by

$$\begin{aligned} \varvec{T}_n= \varvec{U}_n \varvec{\Sigma }_n \varvec{V}_n^* = \sum _{j=1}^N \sigma _j^{(n)}\varvec{u}_j^{(n)}{\varvec{v}_j^{(n)}}^*, \end{aligned}$$

where all matrices are of size \(N\times N\), \(\varvec{u}_j^{(n)}\) and \(\varvec{v}_j^{(n)}\) are the j-th columns of \(\varvec{U}_n\) and \(\varvec{V}_n\) respectively (left and right singular vectors), and \(\sigma _1^{(n)} \ge \sigma _2^{(n)} \ge \ldots \ge \sigma _N^{(n)}\) are the diagonal entries of the diagonal matrix \(\varvec{\Sigma }_n\) (singular values). This decomposition is sometimes explicitly used to design estimators for the support of \(\mu \), such as MUSIC’s frequency estimation function [60], or Christoffel polynomials [43]. In fact, it is interesting as a motivating remark to see that the construction \(p_n\) from the previous section can also be expressed in terms of this singular value decomposition.

Lemma 4.1

The polynomial estimator (3.2) fulfils

$$\begin{aligned} p_n(x) ={\frac{1}{N}} {{\varvec{e}^{(n)}_{x}}^* \varvec{T}_n\varvec{e}^{(n)}_{x}} =\frac{1}{{N}} \sum _{j=1}^N \sigma ^{(n)}_j u_j^{(n)}(x)\overline{v_j^{(n)}(x)}, \end{aligned}$$
(4.3)

where, as explained above, \(u_j^{(n)}(x) = {\varvec{e}^{(n)}_{x}}^* \varvec{u}_j^{(n)}\) and \(v_j^{(n)}(x) = {\varvec{e}^{(n)}_{x}}^* \varvec{v}_j^{(n)}\).

Proof

We have, for \(n\in \mathbb {N}\) and \(x\in \mathbb {T}^d\),

$$\begin{aligned} \frac{1}{N} {\varvec{e}^{(n)}_{x}}^* \varvec{T}_n\varvec{e}^{(n)}_{x}&= \frac{1}{N} \sum _{k\in [n]} \sum _{l \in [n]} \hat{\mu }(k-l) \text {e}^{2\pi i {kx}} \text {e}^{-2\pi i {lx}}\\&= \frac{1}{N} \int _{\mathbb {T}^d} \sum _{k\in [n]} \sum _{l \in [n]} \text {e}^{-2\pi i {(k-l)y}} \text {e}^{2\pi i {kx}} \text {e}^{-2\pi i {lx}} \textrm{d}\mu (y)\\&= \int _{\mathbb {T}^d} \frac{1}{N} \left| \sum _{k\in [n]} \text {e}^{-2\pi i {k(y-x)}} \right| ^2 \textrm{d}\mu (y) = \int _{\mathbb {T}^d} F_n(y-x)\textrm{d}\mu (y) \end{aligned}$$

where the last equality is a consequence of (3.1). Plugging in the singular value decomposition of \(\varvec{T}_n\) yields the second equality of the statement. \(\square \)

Note that if \(\mu \in \mathcal {M}_{\mathbb {R}}\) (the set of real-valued measures), then the moment matrix \(\varvec{T}_n\) is Hermitian.

If \(\mu \in \mathcal {M}_+\) (the set of nonnegative measures), then \(\varvec{T}_n\) is positive semi-definite, and we have in particular the sum of squares representation

$$\begin{aligned} p_n(x)=\frac{1}{{N}} \sum _{j=1}^N \sigma ^{(n)}_j \left| v^{(n)}_j(x)\right| ^2. \end{aligned}$$

We now introduce polynomial estimators for the measure, which can be understood as the unweighted counterparts of \(p_n\). Let \(r_n:={\text {rank}}\varvec{T}_n\) and define signal- and noise-polynomials \(p_{1,n},p_{0,n}:\mathbb {T}^d\rightarrow [0,1]\) respectively, by

$$\begin{aligned} p_{1,n}(x)=\frac{1}{N} \sum _{j=1}^{r_n} \left| v^{(n)}_j(x)\right| ^2 \quad {\text {and}} \quad p_{0,n}(x)=\frac{1}{N} \sum _{j=r_n+1}^N \left| v^{(n)}_j(x)\right| ^2. \end{aligned}$$
(4.4)

This signal/noise terminology comes from the notions of signal and noise subspaces, which were initially introduced in [60] and are at the core of the aforementioned subspace methods in signal processing (we refer the interested reader to [42, Section 9.6] for an overview). Schematically speaking, they correspond to the spaces spanned by the vectors \((\varvec{v}_1^{(n)},\ldots ,\varvec{v}_{r_n}^{(n)})\) (the signal space) and \((\varvec{v}_{r_n+1}^{(n)},\ldots ,\varvec{v}_N^{(n)})\) (the noise space) respectively.

They are actually independent of the singular value decomposition itself, which ensures in particular that \(p_{1,n}\) and \(p_{0,n}\) are indeed well-defined.

The key idea of subspace methods, relating these spaces to the underlying measure \(\mu \), is that, given a causal polynomial \(p \in \langle \text {e}^{-2\pi i {kx}}; \; k \in [n] \rangle \) that vanishes on \({\text {supp}}\,\mu \), one obtains using (4.2) that the k-th entry (\(k \in [n]\)) of \(\varvec{T}_n\varvec{p}\) is given by

$$\begin{aligned} \sum _{l \in [n]} \varvec{p}_l \cdot \int _{\mathbb {T}^d} \text {e}^{-2\pi i {(k-l)x}}\textrm{d}\mu (x) = \int _{\mathbb {T}^d} \text {e}^{-2\pi i {kx}} p(x) \textrm{d}\mu (x) = 0, \end{aligned}$$
(4.5)

and hence \(\varvec{p} \in \ker \varvec{T}_n\). Thus, finding the common roots of all polynomials contained in the kernel of the matrix \(\varvec{T}_n\), may allow to identify the support of \(\mu \), or more accurately the smallest algebraic variety (the set of solutions of a polynomial system) that contains it, i.e. its Zariski closure \(V_{{\mu }}\). In what follows, we denote by \(V(\ker \varvec{T}_n)\) the set consisting of the common roots of all the polynomials in \(\ker \varvec{T}_n\), i.e.

$$\begin{aligned} V(\ker \varvec{T}_n) :=\{x\in \mathbb {T}^d: \; p(x) ={\varvec{e}^{(n)}_{x}}^* \varvec{p}= 0 \text { for all } \varvec{p}\in \ker \varvec{T}_n\}. \end{aligned}$$

We begin in this section with qualitative, purely algebraic considerations about the polynomials (4.4). The next theorem shows that, under the condition that \(V_{{\mu }}= V(\ker \varvec{T}_n)\), \(p_{0,n}\) and \(p_{1,n}\) actually identify the set \(V_{{\mu }}\) for finite n. Variants of this result can be found for the zero-dimensional and positive-dimensional (\(d=2,3\)) setting e.g. in [33] and [49, Propositions 5.2, 5.3], respectively.

Theorem 4.2

Let \(d,n\in \mathbb {N}\), \(\mu \in \mathcal {M}\), and suppose \(V(\ker \varvec{T}_n)=V_{{\mu }}\subseteq \mathbb {T}^d\). Then \(p_{0,n}(x)+p_{1,n}(x)=1\) for all \(x\in \mathbb {T}^d\). In particular, we have

$$\begin{aligned} \begin{aligned} p_{1,n}(x){\left\{ \begin{array}{ll} =1, &{} \text { if } x\in V_{{\mu }},\\ <1, &{} \text {otherwise}. \end{array}\right. } \end{aligned} \end{aligned}$$
(4.6)

Proof

We have

$$\begin{aligned} p_{1,n}(x) + p_{0,n}(x) = \frac{1}{N} \sum _{j=1}^N \left| v_j^{(n)}(x) \right| ^2 = {\frac{1}{N} {\varvec{e}^{(n)}_{x}}^* \varvec{V}_n \varvec{V}_n^* \varvec{e}^{(n)}_{x}= \frac{1}{N} {\varvec{e}^{(n)}_{x}}^* \varvec{e}^{(n)}_{x}} = 1,\nonumber \\ \end{aligned}$$
(4.7)

so in particular \(p_{1,n}(x) \in [0,1]\). Since \(V(\ker \varvec{T}_n) = V_{{\mu }}\) and \(\ker \varvec{T}_n= \langle \varvec{v}_{r_n+1}^{(n)},\dots ,\varvec{v}_N^{(n)}\rangle \), it follows that the polynomials \(v_{r_n+1}^{(n)},\dots ,v_N^{(n)}\) vanish on \(V_{{\mu }}\), so \(p_{1,n}(x) = 1\) for all \(x\in V_{{\mu }}\). Conversely, if \(x\in \mathbb {T}^d\) such that \(p_{1,n}(x) = 1\), we claim that \(x\in V_{{\mu }}\). Indeed, we have \(1 - p_{1,n}(x) = \sum _{j=r_n+1}^N \left| v_j^{(n)}(x) \right| ^2 = 0\), so it follows that x lies in the vanishing set of \(v_{r_n+1}^{(n)},\dots ,v_N^{(n)}\), so \(x\in V(\ker \varvec{T}_n) = V_{{\mu }}\). \(\square \)

Remark 4.3

The hypothesis \(V(\ker \varvec{T}_n) = V_{{\mu }}\) in Theorem 4.2 is well-known in the theory of super-resolution [32, 58] or polynomial system solving [38], and is hard to check in practice. Note however that:

  1. (i)

    It is satisfied for all sufficiently large n if \(\mu \in \mathcal {M}\) is finitely supported, see e.g. [35]. In particular, if \(\mu \) is supported on r points \(\{x_1,\ldots ,x_r\}\), this ensures that the rank of \(\varvec{T}_n\) is equal to r (while in general, one only has \(r_n \le r\)). The optimal n in that case depends on the geometry of the support, but it is sufficient to have \(n + 1 > 6d / \min _{j\ne k}|x_j-x_k|_{\infty }\), see [34, Example 4.4] and [33, Cor. 2.10].

    Similarly, the condition holds for sufficiently large n if \(\mu \in \mathcal {M}_+\), see for example [39, Theorem 2.10] or [62, Proposition 4.10].

  2. (ii)

    If \(\mu \) is neither finitely supported nor nonnegative, then \(V(\ker \varvec{T}_n) = V_{{\mu }}\) can fail to hold for any \(n\in \mathbb {N}\) (cf. [62, Example 4.9]). In this case, it is possible to rephrase the hypothesis in terms of a non-square moment matrix of suitable size (cf. [62, Theorem 4.3]) to obtain a statement similar to Theorem 4.2.

  3. (iii)

    Theorem 4.2 and the results below only deal with the Zariski closure \(V_{{\mu }}\), which coincides with \({\text {supp}}\,\mu \) only when the latter is the zero locus of a trigonometric polynomial, but is larger otherwise. In particular, we have \(V_{{\mu }}=\mathbb {T}^d\) if \({\text {supp}}\,\mu \) has an interior point (with respect to the metric on \(\mathbb {T}^d\)) and in this case, \(r_n = N\) and \(p_{1,n}\equiv 1\). Beyond the scope of this paper, one might adapt the definition of \(p_{1,n}\) by adequately thresholding the singular values of \(\varvec{T}_n\), thus giving rise to algebraic approximations of the actual support.

Example 4.4

For \(\mu =\delta _0\), we have \(p_{1,n}(x)=F_n(x)/(n+1)^d\) and the proof of Theorems 4.6 and 4.9 also show that \(p_{1,n}\) is close to a sum of normalized Fejér kernels for arbitrary discrete measures. A singular continuous measure with support on the zero locus of a specific trigonometric polynomial is discussed as a numerical example in Sect. 5.

We conclude this subsection by stating a variational characterization of \(p_{0,n}\), which will be an important tool in proof in the next sections.

Lemma 4.5

If \(\ker \varvec{T}_n\ne \{\varvec{0}\}\), we have that

$$\begin{aligned} p_{0,n}(x)=\max _{} \left\{ \frac{1}{N}\frac{|p(x)|^2}{\Vert \varvec{p}\Vert _{2}^2} \;:\; {\varvec{p}\in \ker \varvec{T}_n\setminus \{\varvec{0}\}} \right\} . \end{aligned}$$
(4.8)

Proof

As we assume \(\ker \varvec{T}_n\ne \{\varvec{0}\}\), we have \(r_n={\text {rank}}\varvec{T}_n<N\) and find a matrix \(\varvec{V}_0=(\varvec{v}_{r_n+1}^{(n)},\dots ,\varvec{v}_{N}^{(n)})\in \mathbb {C}^{N\times (N-r_n)}\) whose columns form an orthonormal basis of \(\ker \varvec{T}_n\).

For fixed \(x \in \mathbb {T}^d\), let \({\varvec{q}_x :=\varvec{V}_0 \varvec{V}_0^{*}\varvec{e}^{(n)}_{x}} \in \ker \varvec{T}_n\) such that we identify this vector of coefficients with the polynomial satisfying \({q_x(x)} = {\varvec{e}^{(n)}_{x}}^{*}\varvec{q}_x=\sum _{j=r_n+1}^N \left| v^{(n)}_j(x)\right| ^2 = N p_{0,n}(x)\).

For all \(\varvec{p} \in \ker \varvec{T}_n\), we have

$$\begin{aligned} \varvec{q}_x^{*}\varvec{p}= {\varvec{e}^{(n)}_{x}} ^{*}\varvec{V}_0 \varvec{V}_0^{*}\varvec{p}= {\varvec{e}^{(n)}_{x}}^{*}\varvec{p}= p(x). \end{aligned}$$

In particular, note that

$$\begin{aligned} {\left\| \varvec{q}_x\right\| _{2}}^2 = \varvec{q}_x^{*}\varvec{q}_x = {\varvec{e}^{(n)}_{x}}^{*}\varvec{V}_0 \varvec{V}_0^{*}\varvec{e}^{(n)}_{x}= N p_{0,n}(x). \end{aligned}$$
(4.9)

Therefore, by the Cauchy–Schwarz inequality, it follows that

$$\begin{aligned} \left| p(x) \right| ^2 = \left| \varvec{q}_x^{*}\varvec{p} \right| ^2 \le {\left\| \varvec{q}_x\right\| _{2}}^2 \cdot {\left\| \varvec{p}\right\| _{2}}^2 = N p_{0,n}(x) \cdot {\left\| \varvec{p}\right\| _{2}}^2. \end{aligned}$$

Hence, we have

$$\begin{aligned} p_{0,n}(x) \ge \max _{\varvec{p}\in \ker \varvec{T}_n\setminus \{\varvec{0}\}} \frac{\left| p(x) \right| ^2}{N {\left\| \varvec{p}\right\| _{2}}^2} \ge \frac{\left| q_x(x) \right| ^2}{N {\left\| \varvec{q}_x\right\| _{2}}^2} = p_{0,n}(x), \end{aligned}$$

if \(\varvec{q}_x\ne \varvec{0}\). The first inequality also holds when \(\varvec{q}_x=\varvec{0}\) such that the result follows due to (4.9) in this case. \(\square \)

4.2 Zero-Dimensional Situation

We now come to the first main result of this section, stated in Theorem 4.6 below, which gives quantitative rates for the pointwise convergence (4.1) in the case where \(\mu \) is a discrete measure. If the measure is given by

$$\begin{aligned} \mu =\sum _{j=1}^r \lambda _j \delta _{x_j} \end{aligned}$$

with (Zariski-closed) support \(V_{{\mu }}={\text {supp}}\,\mu =\{x_1,\ldots ,x_r\}\subset \mathbb {T}^d\) and complex weights \(\varvec{\Lambda }={\text {diag}}(\lambda _1,\ldots ,\lambda _r)\), then the moment matrix allows for the Vandermonde factorisation

$$\begin{aligned} \varvec{T}_n=\varvec{A}_n^*\varvec{\Lambda } \varvec{A}_n,\qquad \varvec{A}_n=(\text {e}^{2\pi i {kx_j}})_{j=1,\ldots ,r; k\in [n]}\in \mathbb {C}^{r\times N}, \end{aligned}$$

which will be instrumental.

Theorem 4.6

(Pointwise convergence) Let \(\mu = \sum _{j=1}^r \lambda _j \delta _{x_j}\), \(\lambda _j \in \mathbb {C}{\setminus }\{0\}, x_j \in \mathbb {T}^d\), and let \(x\in \mathbb {T}^d\) such that \(x\ne x_j\) for all \(1\le j\le r\). Let \(\lambda _{\min }\) and \(\lambda _{\max }\) be the minimal and maximal weights, in absolute value. If \(n+1>6d/\min _{j\ne \ell }|x_j-x_\ell |_{\infty }\), then

$$\begin{aligned} p_{1,n}(x) \le \frac{1}{(n+1)^2} \cdot \frac{|\lambda _{\max }|}{|\lambda _{\min }|} \cdot \frac{9d^{d/2}}{4}\sum _{j=1}^r \frac{1}{ |x - x_j|_{\infty }^2}. \end{aligned}$$

In particular, this implies the pointwise convergence (4.1). Moreover, for \(x\in \mathbb {T}^d\) such that \(\min _j |x-x_j|_{\infty }\le \sqrt{d}/(n+1)\), one has

$$\begin{aligned} p_{1,n}(x) \le {\left\{ \begin{array}{ll} 1- \frac{3^{d-1}(2d-1)}{2^d d^{2+d/2}} \cdot (n+1)^2 \cdot \min _j |x-x_j|_{\infty }^2, &{}\quad d>1,\\ 1- \frac{\pi ^2}{2\cdot 3^5} (n+1)^2 \cdot \min _j |x-x_j|_{\infty }^2 , &{}\quad d=1. \end{array}\right. } \end{aligned}$$

Proof

The condition \(n+1 > 6d/\min _{j\ne \ell } \left| x_j-x_\ell \right| _\infty \) implies \({\text {rank}}\varvec{T}_n= r\) and \(V_{{\mu }}= V(\ker \varvec{A}_n)=V(\ker \varvec{T}_n)\), see [34, Exa. 4.4] and [33, Cor. 2.5, 2.10, and their proofs]. We have \(p_{1,n} = 1-p_{n,0}\) by Theorem 4.2 and since \(\langle \varvec{v}_{r+1},\ldots ,\varvec{v}_{N}\rangle = \ker \varvec{T}_n= \ker \varvec{A}_n\), it follows that \(p_{1,n}\) does not depend on the weights \(\lambda _j\), and we assume without loss of generality \(\lambda _j>0\). Let then \(\varvec{T}_n= \varvec{V}\varvec{\Sigma }\varvec{V}^*\) be the moment matrix of this nonnegative measure. One has, for any \(x\in \mathbb {T}^d\),

$$\begin{aligned} p_{1,n}(x)= & {} \frac{1}{N}\sum _{j=1}^r \left| v_j^{(n)}(x) \right| ^2 \le \frac{1}{N} \sum _{j=1}^r \frac{\sigma _j^{{(n)}}}{\sigma _{\min }^{{(n)}}} \left| v^{(n)}_j(x) \right| ^2 = {\frac{1}{\sigma _{\min }^{{(n)}}} p_n(x)} \\= & {} \frac{1}{\sigma _{\min }^{{(n)}}} \sum _{j=1}^r |\lambda _j| F_n(x-x_j), \end{aligned}$$

where the last two equalities are (4.3) and (3.2), respectively. The final estimate follows from

$$\begin{aligned} F_n(x-x_j) \le \frac{(n+1)^{d-1}}{(n+1) \sin ^2\left( \pi |x - x_j|_{\infty }\right) } \le \frac{(n+1)^{d-2}}{4 |x - x_j|_{\infty }^2} \end{aligned}$$

where \(\sin (\pi x)\ge 2x\) for \(x\in [0,\frac{1}{2}]\) was used and \(\sigma _{\min }^{{(n)}}\ge \frac{1}{9d^{d/2}} (n+1)^d |\lambda _{\min }|\), see [34, Exa. 4.4].

We denote the \((r+1)\)-th standard basis vector by \(e_{r+1}=\left( 0,\dots ,0,1\right) ^\top \in \mathbb {C}^{r+1}\). Regarding the second estimate, consider the Vandermonde matrix

$$\begin{aligned} \varvec{\tilde{A}}_{n,x}=\begin{bmatrix} \varvec{e}_{x_1}^{(n)}&\cdots&\varvec{e}_{x_r}^{(n)}&\varvec{e}^{(n)}_{x}\end{bmatrix} \in \mathbb {C}^{N\times (r+1)} \end{aligned}$$

and note that its pseudo-inverse gives rise to the Lagrange polynomial \(\ell _{r+1}(y)=e_{r+1}^*\varvec{\tilde{A}}_{n,x}^\dagger {\varvec{e}^{(n)}_{y}}\), satisfying \(\ell _{r+1}(x_j)=0\) for \(j=1,\ldots ,r\) and \(\ell _{r+1}(x)=1\).Footnote 13 We compute

$$\begin{aligned} \Vert \ell _{r+1}\Vert _{L^2}^2&=\int _{\mathbb {T}^d} |e_{r+1}^*\varvec{\tilde{A}}_{n,x}^\dagger \varvec{e}^{(n)}_{y}|^2 \,\textrm{d}y \\&=\int _{\mathbb {T}^d} |\langle \varvec{\tilde{A}}_{n,x}^{\dagger *} e_{r+1}, \varvec{e}^{(n)}_{y}\rangle |^2 \,\textrm{d}y =\Vert \varvec{\tilde{A}}_{n,x}^{\dagger *} e_{r+1}\Vert _2^2\le \sigma _{\min }(\varvec{\tilde{A}}_{n,x})^{-2} \end{aligned}$$

and use Lemma 4.5 to bound

$$\begin{aligned} 1-p_{1,n}(x)=p_{0,n}(x)=\max _{p} \frac{|p(x)|^2}{N \Vert p\Vert _{L^2}^2} \ge \frac{|\ell _{r+1}(x)|^2}{N \Vert \ell _{r+1}\Vert _{L^2}^2} \ge \frac{\sigma _{\min }(\varvec{\tilde{A}}_{n,x})^{2}}{N}. \end{aligned}$$

The assertion follows from known estimates on the smallest singular value for the Vandermonde matrix with pairwise clustering nodes, see [25, Cor. 3.20] for \(d>1\) and [13, Cor. 4.2] for the univariate case \(d=1\). \(\square \)

Remark 4.7

Actually, Theorem 4.6 shows the correct orders in n and \(\min _j |x-x_j|_{\infty }^2\) in the upper bound of \(p_{1,n}(x)\). First note that \(1-p_{1,n}\) and all its partial derivatives of order 1 vanish on \(x_1,\ldots ,x_r\). For fixed \(x\in \mathbb {T}^d\), and \(j'=\mathop {\textrm{argmin}}\limits _j |x-x_j|_{\infty }\), the Taylor expansion at \(x_{j'}\) thus gives \(\xi \in \mathbb {T}^d\) such that

$$\begin{aligned} 1-p_{1,n}(x)&= \frac{1}{2} (x-x_{j'})^\top H_x(\xi ) (x-x_{j'}), \end{aligned}$$

where \(H_x(\xi ):=\left( -\partial _{s}\partial _t p_{1,n}\left( \xi \right) \right) _{1\le s,t\le d}\) is the Hessian of \(1-p_{1,n}\) at \(\xi \). Thus,

$$\begin{aligned} 1-p_{1,n}(x)&\le {\frac{1}{2} {\left\| H_x(\xi )\right\| _{}}_F \cdot \left| x-x_{j'} \right| _2^2}\le {\frac{d}{2} \max _{r,s} {\left\| \partial _r \partial _s p_{1,n}\right\| _{}}_{L^\infty } \cdot d \left| x-x_{j'} \right| _\infty ^2}. \end{aligned}$$

One may apply Bernstein’s inequality (see e.g. [12, Chapter 4]) to \(y_s \mapsto p_{1,n}(y_1,\ldots , y_d)\) and \(y_r \mapsto \partial _s p_{1,n}(y_1,\ldots ,y_d)\) successively (both trigonometric polynomials of degree n), and obtain

$$\begin{aligned} 2\pi ^2 d^2 n^2 \cdot \min _j \left| x-x_j \right| ^2_\infty \end{aligned}$$

since \(\Vert p_{1,n}\Vert _{L^\infty }=1\). A bivariate visualisation of the bounds on \(p_{1,n}\) is shown in Fig. 4.

Fig. 4
figure 4

Summary of the bounds on \(p_{1,n}\) from Theorem 4.6 and Remark 4.7 for \(d=2\), \(n=20\), and a discrete measure \(\mu \) supported on four points. The polynomial \(p_{1,20}\) was evaluated on a grid in \(\mathbb {T}^2\) and interpolated on the magenta cross section (left), while the bounds on \(p_{1,20}\) on this cross section are displayed (right). We see that specifically the bound \(1-\sigma _{\min }(\varvec{\tilde{A}}_{n,x})^2/N\) from the proof of Theorem 4.6 reproduces the behaviour of \(p_{1,n}\). The constant upper bound on \(p_{1,n}\) away from the support of \(\mu \) can be derived by using estimates for \(\sigma _{\min }(\varvec{\tilde{A}}_{n,x})\) in the case of separated nodes

In fact, in this discrete setting, normalizing \(p_{1,n}\) differently even leads to a weak convergence result towards the empirical measure associated with the support points. This result, stated in Theorem 4.9 below, uses the following technical lemma.

Lemma 4.8

(Convergence of singular values) Let \(\mu =\sum _{j=1}^r \lambda _j \delta _{x_j}\) be a discrete complex measure whose weights are ordered non-increasingly with respect to their absolute value. Assume that \((n+1)\min _{j\ne \ell }|x_j-x_{\ell }|_{\infty } > d\), then the singular values \(\sigma _j^{{(n)}}\) of the moment matrix \(\varvec{T}_n\) fulfil

$$\begin{aligned} \left| |\lambda _j|-\frac{\sigma _j^{{(n)}}}{N}\right| \le \frac{1}{n+1}\cdot \frac{\left| \lambda _{1} \right| \left( 1+\sqrt{\text {e}}\right) r}{2\min _{j\ne \ell }|x_j-x_{\ell }|_{\infty }},\qquad j=1,\ldots ,r. \end{aligned}$$

Proof

With the polar decomposition \(\frac{1}{\sqrt{N}} \varvec{A}_n^{*}= \varvec{P} \varvec{H}\), where \(\varvec{P}\in \mathbb {C}^{N\times r}\) is unitary and \(H\in \mathbb {C}^{r\times r}\) is positive-definite, we have that \(\left| \lambda _1 \right| \ge \cdots \ge \left| \lambda _r \right| \) are the singular values of the matrix \(P \Lambda P^*\). Therefore, for the singular values of \(\varvec{T}_n= \varvec{A}_n^{*}\Lambda \varvec{A}_n\), we obtain

$$\begin{aligned} \max _{1\le j\le r} \left| \frac{\sigma _j^{{(n)}}}{N} - \left| \lambda _j \right| \right|&\le {\left\| \frac{1}{N} \varvec{T}_n- \varvec{P} \Lambda \varvec{P}^*\right\| _{2}} = {\left\| \varvec{H} \Lambda \varvec{H}^* - \Lambda \right\| _{2}}\\&\le {\left\| \varvec{H} \Lambda \left( \varvec{H} - \varvec{\textrm{I}}_{r}\right) \right\| _{2}} + {\left\| \left( \varvec{H} - \varvec{\textrm{I}}_{r}\right) \Lambda \right\| _{2}}\\&\le \left| \lambda _1 \right| \left( {\left\| \varvec{H}\right\| _{2}} + 1\right) {\left\| \varvec{H} - \varvec{\textrm{I}}_{r}\right\| _{2}}\\&\le \left| \lambda _1 \right| \left( {\left\| \varvec{H}\right\| _{2}} + 1\right) {\left\| (\varvec{H}+\varvec{\textrm{I}}_{r})^{-1}\right\| _{2}} {\left\| \varvec{H}^2 - \varvec{\textrm{I}}_{r}\right\| _{2}}\\&\le \left| \lambda _1 \right| \frac{\frac{1}{\sqrt{N}} \mathop {\mathrm {\sigma _{\text {max}}}}\limits (\varvec{A}_n) + 1}{\frac{1}{\sqrt{N}} \mathop {\mathrm {\sigma _{\text {min}}}}\limits (\varvec{A}_n) + 1} \left\| \frac{1}{N} \varvec{A}_n \varvec{A}_n^* - \varvec{\textrm{I}}_{r}\right\| _\text {F}, \end{aligned}$$

where the first inequality is due to [4, Theorem 2.2.8] and the last inequality is a consequence of \(\varvec{H}=\varvec{P}^{*} \varvec{P} \varvec{H}=\frac{1}{\sqrt{N}} \varvec{P}^{*} \varvec{A}_{n}^{*}\) yielding

$$\begin{aligned} \varvec{H}^{2}=\varvec{H}^{*} \varvec{H}=\frac{1}{N} \varvec{A}_{n} \varvec{P} \varvec{P}^{*} \varvec{A}_{n}^{*}=\frac{1}{\sqrt{N}} \varvec{A}_{n} \varvec{P} \varvec{P}^{*} \varvec{P} \varvec{H}=\varvec{A}_{n} \frac{1}{\sqrt{N}} \varvec{P} \varvec{H}=\frac{1}{N} \varvec{A}_{n} \varvec{A}_{n}^{*}. \end{aligned}$$

Each entry of the matrix \(\frac{1}{N} \varvec{A}_n \varvec{A}_n^* - \varvec{\textrm{I}}_{r}\) is a modified Dirichlet kernel and can be bounded uniformly by

$$\begin{aligned} \left\| \frac{1}{N}\varvec{A}_n\varvec{A}_n^* - \varvec{\textrm{I}}_{r}\right\| _\text {F}&= \frac{1}{N} \left( \sum _{j=1}^r\sum _{l\ne j} \left| \sum _{k\in [n]}\text {e}^{2\pi i {k(x_l-x_j)}}\right| ^2\right) ^{1/2} \le \frac{r}{N} \cdot \frac{(n+1)^{d-1}}{2 \min _{j\ne \ell }|x_j-x_{\ell }|_{\infty }}. \end{aligned}$$

Moreover, since \((n+1)\min _{j\ne \ell }|x_j-x_{\ell }|_{\infty }> d\), it follows from [35, Theorem 2.1] that

$$\begin{aligned} \frac{1}{\sqrt{N}} \mathop {\mathrm {\sigma _{\text {max}}}}\limits (\varvec{A}_n) \le \sqrt{\left( 1 + \frac{1}{d}\right) ^d} \le \sqrt{\text {e}}. \end{aligned}$$

\(\square \)

Theorem 4.9

For \(\lambda _j\in \mathbb {C}\setminus \{0\}\) and pairwise different \(x_j\in \mathbb {T}^d\), \(j=1,\dots ,r,\) we have

$$\begin{aligned} \frac{p_{1,n}}{\Vert p_{1,n}\Vert _{L^1}} \rightharpoonup {\tilde{\mu }}=\frac{1}{r}\sum _{j=1}^r \delta _{x_j} \end{aligned}$$

as \(n\rightarrow \infty \).

Proof

Throughout this proof we use that the Vandermonde matrix \(\varvec{A}_n\) has full rank r if n is sufficiently large. In particular, this implies \(V(\ker \varvec{T}_n) = V_{{\mu }}\) and \(\Vert p_{1,n}\Vert _{L^1}=r/N\). We define \({\tilde{p}}_n=F_n*{\tilde{\mu }}\) and observe that for any continuous function f on \(\mathbb {T}^d\) we have

$$\begin{aligned}&\left| \int _{\mathbb {T}^d} \frac{p_{1,n}(x)}{\Vert p_{1,n}\Vert _{L^1}} f(x) \,\textrm{d}x - \frac{1}{r}\sum _{j=1}^r f(x_j) \right| \\&\quad \le \left| \int _{\mathbb {T}^d} \left( \frac{p_{1,n}(x)}{\Vert p_{1,n}\Vert _{L^1}} - {\tilde{p}}_n(x)\right) f(x) \,\textrm{d}x \right| + \left| \int _{\mathbb {T}^d} {\tilde{p}}_n(x)f(x)\,\textrm{d}x - \frac{1}{r}\sum _{j=1}^r f(x_j) \right| \\&\quad \le \left\| \frac{N}{r}p_{1,n} - {\tilde{p}}_n \right\| _{L^1} \left\| f \right\| _{L^\infty } + \left| \int _{\mathbb {T}^d} f\,\textrm{d}(F_n*{\tilde{\mu }}) - \int _{\mathbb {T}^d} f\,\textrm{d}{\tilde{\mu }} \right| , \end{aligned}$$

so, by Theorem 3.3, it is enough to show that \(\left\| \frac{N}{r} p_{1,n} - {\tilde{p}}_n \right\| _{L^1}\) converges to zero for \(n\rightarrow \infty \). If n is sufficiently large, then by Lemma 4.1 we can write \({\tilde{p}}_n(x) = \frac{1}{N} {\varvec{e}^{(n)}_{x}}^{*}\tilde{\varvec{U}} \tilde{\varvec{\Sigma }} \tilde{\varvec{U}}^{*}\varvec{e}^{(n)}_{x}\), where \(\tilde{\varvec{\Sigma }}\in \mathbb {C}^{r\times r}\) denotes the diagonal matrix consisting of non-zero singular values, and \(\tilde{\varvec{U}} \in \mathbb {C}^{N\times r}\) denotes the corresponding singular vector matrix of the moment matrix of \({\tilde{\mu }}\). As the signal polynomial \(p_{1,n}=1-p_{0,n}\) only depends on the kernel of the moment matrix \(\varvec{T}_n\) of \(\mu \), which agrees with the kernel of \(\varvec{A}_n\) and with the kernel of the moment matrix of \({\tilde{\mu }}\), it follows by (4.4) that \(p_{1,n}(x) = \frac{1}{N} {\varvec{e}^{(n)}_{x}}^{*}\tilde{\varvec{U}} \tilde{\varvec{U}} ^{*}\varvec{e}^{(n)}_{x}\) and thus

$$\begin{aligned} \left| \frac{N}{r} p_{1,n}(x) - {\tilde{p}}_n(x) \right| = \left| {\varvec{e}^{(n)}_{x}}^{*}\tilde{\varvec{U}} \left( \frac{\varvec{\textrm{I}}_{r}}{r} - \frac{\tilde{\varvec{\Sigma }}}{N}\right) \tilde{\varvec{U}} ^{*}\varvec{e}^{(n)}_{x} \right| \le {\left\| {\varvec{e}^{(n)}_{x}}^* \tilde{\varvec{U}}\right\| _{2}}^2 {\left\| \frac{1}{r}\varvec{\textrm{I}}_{r} - \frac{1}{N}\tilde{\varvec{\Sigma }}\right\| _{2}}. \end{aligned}$$

Since \(\int _{\mathbb {T}^d} {\left\| {\varvec{e}^{(n)}_{x}}^* \tilde{\varvec{U}}\right\| _{2}}^{2} \,\textrm{d}x = N \Vert p_{1,n}\Vert _{L^1} = r\) is constant, the result follows from Lemma 4.8. \(\square \)

4.3 Positive-Dimensional Situation

For a measure \(\mu \) whose support is contained in a non-trivial algebraic variety of any dimension, we derive a pointwise convergence rate \(p_{1,n}(x)={\mathcal {O}}\left( n^{-1}\right) \) outside of the variety in Theorem 4.10 and together with Theorem 4.2 this proves (4.1) if \(V(\ker \varvec{T}_n)=V_{{\mu }}\). It is not clear whether this is already optimal, as we found \({\mathcal {O}}\left( n^{-2}\right) \) as an approximation rate in the case of a discrete measure.

Theorem 4.10

Let \(y\in \mathbb {T}^d\) and let \(g \in \langle \text {e}^{2\pi i {\left\langle k,x \right\rangle }} \mid k\in [m]\rangle \) be a trigonometric polynomial of max-degree m such that \(g(y)\ne 0\) and g vanishes on \({\text {supp}}\,\mu \). Then

$$\begin{aligned}{} & {} p_{1,n+m}(y) \le 1 - \frac{(n+1)^d}{(n+m +1)^d} \cdot \frac{\left| g\left( y\right) \right| ^2}{\left( F_n * \left| g \right| ^2\right) (y)}\\{} & {} \le \frac{\Vert g\Vert _{L^2}^2}{|g(y)|^2} \frac{m(4m+2)^d}{n+1} + \frac{d m}{\left( n+m + 1\right) }, \end{aligned}$$

for \(n\in \mathbb {N}\), \(n\ge m\).

Proof

Set \(N_n = (n+1)^d\) for \(n\in \mathbb {N}\) and define the trigonometric polynomial \(p(x) = e_{n,y}(x) g(x)\) of max-degree \(n+m\), where \(e_{n,y}(x) :={\varvec{e}^{(n)}_{x}}^{*}\varvec{e}^{(n)}_{y}\). Furthermore, we define \(f(x):=\left| g(x) \right| ^2\). Then

$$\begin{aligned} \left| p(x) \right| ^2 = N_n F_n(x-y) f(x), \end{aligned}$$

for all \(x\in \mathbb {T}^d\). On the other hand,

$$\begin{aligned} \left\| p \right\| _{L^2}^2 = N_n \left( F_n * f\right) (y). \end{aligned}$$

The existence of a trigonometric polynomial g which vanishes on the support of the measure \(\mu \) but not at \(y\in \mathbb {T}^d\) shows already that \(p\in \langle \text {e}^{2\pi i {\left\langle k,x \right\rangle }} \mid k\in [n+m]\rangle \) satisfies these conditions as well and thus \(\ker \varvec{T}_{n+m}\ne \{\varvec{0}\}\) by (4.5). This allows to use Lemma 4.5 in order to obtain

$$\begin{aligned} 1 - p_{1,n+m}(y)\ge & {} \frac{\left| p(y) \right| ^2}{N_{n+m} \left\| p \right\| _{L^2}^2} \nonumber \\= & {} \frac{N_n}{N_{n+m}} \cdot \frac{f(y)}{\left( F_n * f\right) (y)} \nonumber \\\ge & {} \left( 1-\frac{m}{n+m+1}\right) ^d \frac{1}{1 + h_n},\qquad \quad \end{aligned}$$
(4.10)

where we define \(h_n :={\left\| F_n * f - f \right\| _{L^\infty }}/{f(y)}\). This proves the first statement. For the second upper bound, we computeFootnote 14

$$\begin{aligned} \left| (F_n * f - f)(x) \right|&= \left| \sum _{k\in \{-m,\dots ,m\}^d} \sum _{{\genfrac{}{}{0.0pt}{}{s\in \{0,1\}^d}{1\le |s|\le d}}} \frac{(-1)^{|s|} |k^s|}{(n+1)^{|s|}} {\hat{f}}(k) \text {e}^{2\pi i {kx}} \right| \\&= \left| \sum _{k\in \{-m,\dots ,m\}^d} \sum _{{\genfrac{}{}{0.0pt}{}{s\in \{0,1\}^d}{1\le |s|\le d}}} \frac{(-1)^{|s|}|k^s|}{(n+1)^{|s|}} \int _{\mathbb {T}^d} \left| g(z) \right| ^2 \text {e}^{2\pi i {k(x-z)}} \textrm{d}z \right| \\&\le \sum _{{\genfrac{}{}{0.0pt}{}{s\in \{0,1\}^d}{1\le |s|\le d}}} \left( \frac{m}{n+1}\right) ^{|s|} (2m+1)^d \Vert g\Vert _{L^2}^2 \\&\le \Vert g\Vert _{L^2}^2 \frac{m(4m+2)^d}{n+1} \end{aligned}$$

by using that \(f=|g|^2\) is a trigonometric polynomial of degree m. Then it follows from (4.10) that

$$\begin{aligned} p_{1,n+m}(y)&\le 1- \left( 1-\frac{m}{n+m+1}\right) ^d \left( 1-\frac{h_n}{1+h_n}\right) \\&\le 1- \left( 1-\frac{dm}{n+m+1}\right) \left( 1-\frac{h_n}{1+h_n}\right) \\&= \frac{h_n}{1+h_n} + \frac{d m}{\left( n+m + 1\right) } - \frac{h_n}{1+h_n} \cdot \frac{d m}{\left( n+m + 1\right) } \\&\le \frac{\Vert g\Vert _{L^2}^2}{|g(y)|^2} \frac{m(4m+2)^d}{n+1} + \frac{d m}{\left( n+m + 1\right) }, \end{aligned}$$

since we can apply \(\frac{h_n}{1+h_n}\le h_n\). \(\square \)

5 Numerical Examples

We illustrate in this section the asymptotic behaviour of \(p_n\) and \(p_{1,n}\) for several types of singular measures, with respect to the 1-Wasserstein distance. We compute the distance using a semidiscrete optimal transport algorithm, described below. The code to reproduce the figures is available at https://github.com/Paulcat/Measure-trigo-approximations.

Our experiments focus on three examples on \(\mathbb {T}^2\): a discrete measure \(\mu _{\textrm{d}}\) supported on 15 points, with (nonnegative) random amplitudes, a uniform measure \(\mu _{\textrm{cu}}\) supported on the trigonometric algebraic curve

$$\begin{aligned} \cos (2\pi x)\cos (2\pi y) + \cos (2\pi x) + \cos (2\pi y) = \frac{1}{4}, \end{aligned}$$
(5.1)

and a uniform measure \(\mu _{\textrm{ci}}\) supported on the circle centered in \(c_0=(\frac{1}{2},\frac{1}{2})\) with radius \(r_0 = 0.3\).

The moments of \(\mu _{\textrm{cu}}\) are computed numerically up to machine precision using Arb [29] with a parametrization of the implicit curve (5.1). It follows from (3.4) that the trigonometric moments of the measure \(\mu _{\textrm{ci}}\) are given by

$$\begin{aligned} \widehat{\mu _{\textrm{ci}}}(k) = \text {e}^{-2\pi i {k c_0}} J_0(2\pi r_0 {\left\| k\right\| _{}}_2). \end{aligned}$$

The polynomials \(p_n\), \(J_n*\mu \), and \(p_{1,n}\) can be evaluated efficiently via the fast Fourier transform over a regular grid in \(\mathbb {T}^2\). For the polynomial \(p_{1,n}\), the singular value decomposition of the moment matrix \(\varvec{T}_n\) can be computed at reduced cost by exploiting that \(\varvec{T}_n\) has Toeplitz structure and resorting only to matrix–vector multiplications which can be computed by means of the FFT.

To compute transport distances to the measure \(\mu \in \{\mu _{\textrm{cu}},\mu _{\textrm{ci}}\}\), let the curve \(C={\text {supp}}\,\mu \subset \mathbb {T}^d\) denote its support with arc-length L. Now let \(s\in \mathbb {N}\), take a partition \(C=\bigcup _{\ell =1}^s C_\ell \) into path-connected curves with measure \(\mu (C_\ell )=s^{-1}\) and arc-length \(L_\ell \), and any \(x_\ell \in C_\ell \), then

$$\begin{aligned} W_1\left( \frac{1}{s}\sum _{\ell =1}^s \delta _{x_\ell }, \mu \right)&=\sup _{f: \mathop {\textrm{Lip}}\limits (f)\le 1} \left| \sum _\ell \int _{C_\ell } \left[ f(x)-f(x_\ell )\right] \textrm{d}\mu (x) \right| \\&\le \sum _{\ell =1}^s \int _{C_\ell } |x-x_\ell |_1 \textrm{d}\mu (x) \le \sum _{\ell =1}^s \sqrt{d} L_\ell \mu (C_\ell ) =\frac{\sqrt{d}\cdot L}{s}. \end{aligned}$$

We denote the resulting discrete measures by \(\mu _{\textrm{cu}}^s\) and \(\mu _{\textrm{ci}}^s\), respectively (see Fig. 5). In our tests, we use \(s = 3000\) samples, which offers a satisfactory tradeoff between computational time and accuracy for our range of degrees n. Indeed, the computational cost of evaluating the objective (5.2) or its gradient grows linearly in s, while for degrees up to \(n=250\), sampling beyond 3000 points has no effect on the output of our algorithm for computing \(W_1(p_n,\mu ^s)\), see Fig. 5.

Fig. 5
figure 5

The two example measures \(\mu _{\textrm{cu}}^s\) (left) and \(\mu _{\textrm{ci}}^s\) (middle) used in our numerical tests. In this display the two continuous measures are discretized using \(s=60\) samples. The amplitudes of the spikes in both measures are taken equal, and normalized. The last plot shows the 1-Wasserstein distance \(W_1(F_n * \mu _{\textrm{cu}},\mu _{\textrm{cu}}^s)\) for degrees \(n=1,\ldots ,250\) and several values of s

Now let \(\mu = \sum _{j=1}^s \lambda _j\delta _{x_j}\) refer to either \(\mu _{\textrm{d}}\), \(\mu _{\textrm{cu}}^s\) or \(\mu _{\textrm{ci}}^s\). The semidiscrete optimal transport between a measure with density p and the discrete measure \(\mu \) may be computed by solving the finite-dimensional optimization problem

$$\begin{aligned} \max _{w\in \mathbb {R}_+^s} f(w),\qquad f(w)=\sum _{j=1}^s \lambda _j w_j + \sum _{j=1}^s \int _{\Omega _j(w)} (\left| x_j-y \right| -w_j)p(y)\textrm{d}y \end{aligned}$$
(5.2)

where the Laguerre cells associated to the weight vector w are given by

$$\begin{aligned} \Omega _j(w) = \left\{ y\in \mathbb {T}^d : \left| x_j-y \right| - w_j \le \left| x_k-y \right| -w_k,\;k=1,\ldots ,s\right\} , \end{aligned}$$

see e.g. [52, Sec. 5.2]. In our implementation, the density measure (and the Laguerre cells) are computed over a \(502\times 502\) grid. We use a BFGS algorithm to perform the maximization, using the Matlab implementation [59]; we stop the iterations when the change of value of the objective goes below \(10^{-9}\), or when the infinity norm \({\left\| \nabla f\right\| _{}}_\infty \) goes below \(10^{-5}\). Note that this last condition has a geometrical interpretation since the j-th component of \(\nabla f\) corresponds to the difference between the measure of the Laguerre cell \(\Omega _j(w)\) and the amplitude \(\lambda _j\). We set the limit number of iterations to 100.

Fig. 6
figure 6

Asymptotics of \(p_n\) and \(p_{1,n}\). For \(p_{1,n}\), the distance is computed with respect to the unweighted measure \(\tilde{\mu }^s\), that is \(\tilde{\mu }^s = \frac{1}{s}\sum _{j=1}^s\delta _{x_j}\) where \(\{x_1,\dots ,x_s\}\) is the support of \(\mu \)

In the discrete case, our numerical results (see Fig. 6) show that the Wasserstein distance \(W_1(p_n,\mu ^s)\) decreases at a rate close to the worst-case bound derived in Theorem 3.3. This is also the case for \(W_1(p_{1,n},\tilde{\mu }^s)\), which is coherent with the bound given in the proof of Theorem 4.9. In the positive dimensional cases, one would need to compute the Wasserstein distances for degrees larger than \(n=250\) to be able to reliably estimate a rate, but this would require better optimized algorithms, in the spirit for instance of [36], which goes beyond the scope of this paper. Still, our preliminary results seem to indicate that the rates for \(F_n * \mu \) and \(J_n * \mu \) in the positive dimensional situation are similar to the ones for discrete measures, but with better constants, see Fig. 6. For \(p_{1,n}\) on the other hand, although the theory does not foresee weak convergence in that case, if it were to occur, our results indicate that the rate would then be worse than in the discrete case.

6 Summary and Outlook

We provided tight bounds on the pointwise approximation error as well as with respect to the 1-Wasserstein distance when approximating arbitrary measures by trigonometric polynomials. We recently generalised this also to the approximation with respect to the p-Wasserstein metric where stronger localised kernels are used [8]. Future work might address the truncation of the singular value decomposition in Sect. 4 if the support of the measure is only approximated by the zero set of an unknown trigonometric polynomial or the available trigonometric moments are disturbed by noise.