Embracing off-the-grid samples

Many empirical studies suggest that samples of continuous-time signals taken at locations randomly deviated from an equispaced grid (i.e., off-the-grid) can benefit signal acquisition, e.g., undersampling and anti-aliasing. However, explicit statements of such advantages and their respective conditions are scarce in the literature. This paper provides some insight on this topic when the sampling positions are known, with grid deviations generated i.i.d. from a variety distributions. By solving a square-root LASSO decoder with an interpolation kernel we demonstrate the capabilities of nonuniform samples for compressive sampling, an effective paradigm for undersampling and anti-aliasing. For functions in the Wiener algebra that admit a discrete s-sparse representation in some transform domain, we show that \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathcal {O}}(s{{\,\mathrm{poly\,\hspace{-2pt}log}\,}}N)$$\end{document}O(spolylogN) random off-the-grid samples are sufficient to recover an accurate \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\frac{N}{2}$$\end{document}N2-bandlimited approximation of the signal. For sparse signals (i.e., \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$s \ll N$$\end{document}s≪N), this sampling complexity is a great reduction in comparison to equispaced sampling where \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathcal {O}}(N)$$\end{document}O(N) measurements are needed for the same quality of reconstruction (Nyquist–Shannon sampling theorem). We further consider noise attenuation via oversampling (relative to a desired bandwidth), a standard technique with limited theoretical understanding when the sampling positions are non-equispaced. By solving a least squares problem, we show that \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathcal {O}}(N\log N)$$\end{document}O(NlogN) i.i.d. randomly deviated samples provide an accurate \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\frac{N}{2}$$\end{document}N2-bandlimited approximation of the signal with suppression of the noise energy by a factor \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sim \frac{1}{\sqrt{\log (N)}}.$$\end{document}∼1log(N).


Introduction
The Nyquist-Shannon sampling theorem is perhaps the most impactful result in the theory of signal processing, fundamentally shaping the practice of acquiring and processing data [6,24] (also attributed to Kotel'nikov [60], Ferrar [61], Cauchy [1], Ogura [35], E.T. and J.M. Whittaker [12,31]).In this setting, typical acquisition of a continuous-time signal involves taking equispaced samples at a rate slightly higher than a prescribed frequency ω Hz in order to obtain a bandlimited approximation via a quickly decaying kernel.Such techniques provide accurate approximations of (noisy) signals whose spectral energy is largely contained in the band [−ω/2, ω/2] [1 -3, 18].
As a consequence, industrial signal acquisition and post-processing methods tend to be designed to incorporate uniform sampling.However, such sampling schemes are difficult to honor in practice due to physical constraints and natural factors that perturb sampling locations from the uniform grid, i.e., nonuniform or off-the-grid samples.In response, nonuniform analogs of the noise-free sampling theorem have been developed, where an average sampling density proportional to the highest frequency ω of the signal guarantees accurate interpolation, e.g., Landau density [18,26,36].However, non-equispaced samples are typically unwanted and regarded as a burden due to the extra computational cost involved in regularization, i.e., interpolating the nonuniform samples onto the desired equispaced grid.
On the other hand, many works in the literature have considered the potential benefits of deliberate nonuniform sampling [4, 8, 9, 11, 13, 15, 16, 20-22, 25, 27, 33, 37, 38, 40, 43, 45, 46, 49-51, 53, 54, 56, 57, 59, 62].Suppression of aliasing error, i.e., anti-aliasing, is a well known advantage of randomly perturbed samples.For example, jittered sampling is a common technique for anti-aliasing that also provides a well distributed set of samples [7,10,51,52].To the best of the authors' knowledge, this phenomenon was first noticed by Harold S. Shapiro and Richard A. Silverman [25] (also by Federick J. Beutler [15,16] and implicitly by Henry J. Landau [26]) and remained unused in applications until rediscovered in Pixar animation studios by Robert L. Cook [53].According to our literature review, such observations remain largely empirical or arguably uninformative for applications.Closing this gap between theory and experiments would help the practical design of such widely used methodologies.
To this end, in this paper we propose a practical framework that allows us to concretely investigate the properties of randomly deviated samples for undersampling, anti-aliasing and general noise attenuation.To elaborate (see Section 1.1 for notation), let f : [− 1  2 , 1 2 ) → C be our function of interest that belongs to some smoothness class.Our goal is to obtain a uniform discretization f ∈ C N , where an estimate of f k = f( k−1 N − 1 2 ) will provide an accurate approximation f (x) of f(x) for all x ∈ [− 1  2 , 1 2 ).We are given noisy non-equispaced samples, b = f + d ∈ C m , where fk = f( k−1 m − 1 2 + ∆ k ) is the nonuniformly sampled signal and d ∈ C m encompasses unwanted additive noise.In general, we will consider functions f with support on [− 1  2 , 1 2 ) whose periodic extension is in the Wiener algebra A(Ω) [63], where by abuse of notation Ω denotes the interval [− 1  2 , 1 2 ) and the torus T. To achieve undersampling and anti-aliasing, we assume our uniform signal admits a sparse (or compressible) representation along the lines of compressive sensing [23,30,58].We say that f is compressible with respect to a transform, Ψ ∈ C N ×N , if there exists some g ∈ C N such that f = Ψg and g can be accurately approximated by an s-sparse vector (s ≤ N ).In this scenario, our methodology consists of constructing an interpolation kernel S ∈ R m×N that achieves Sf ≈ f accurately for smooth signals, in order to define our estimate f (x) using the discrete approximation Ψg ≈ f where g := argmin and λ ≥ 0 is a parameter to be chosen appropriately.We show that for signals in the Wiener algebra and under certain distributions D, if we have m ∼ O(s polylog(N )) off-the-grid samples with i.i.d.deviations {∆ k } m k=1 ∼ D then the approximation error |f (x) − f(x)| is proportional to d 2 , the error of the best s-sparse approximation of g, and the error of the best N  2 -bandlimited approximation of f in the Wiener algebra norm (see equation 6.1 in [63]).If s N , the average sampling rate required for our result (step size ∼ 1 s polylog(N ) ) provides a stark contrast to standard density conditions where a rate proportional to the highest frequency ω ∼ N , resulting in step size ∼ 1 N , is needed for the same bandlimited approximation.The result is among the first to formally state the anti-aliasing nature of nonuniform sampling in the context of undersampling (see Section 3).
Removing the sparse signal model, we attenuate measurement noise (i.e., denoise) by defining f (x) using the discrete estimate In this context, our main result states that m ≥ N log(N ) i.i.d.randomly deviated samples provide approximation error |f ) and the error of the best N 2 -bandlimited approximation of f in the Wiener algebra norm.Thus, by nonuniform oversampling (relative to the desired N 2 -bandwidth) we attenuate unwanted noise regardless of its structure.While uniform oversampling is a common noise filtration technique, our results show that general nonuniform samples also posses this denoising property (see Section 4).
The rest of the paper is organized as follows: Section 2 provides a detailed elaboration of our sampling scenario, signal model and methodology.Section 3 showcases our results for anti-aliasing and undersampling of compressible signals while Section 4 considers noise attenuation via oversampling.A comprehensive discussion of the results and implications is presented in Section 5. Several experiments and computational considerations are found in Section 6, followed by concluding remarks in Section 7. We postpone the proofs of our statements until Section 8. Before proceeding to the next section, we find it best to introduce the general notation that will be used throughout.However, each subsequent section may introduce additional notation helpful in its specific context.

Notation
We denote complex valued functions of real variables using bold letters, e.g., f : R → C. For any integer n ∈ N, [n] denotes the set { ∈ N : 1 ≤ ≤ n}.For k, ∈ N, b k indicates the k-th entry of the vector b, D k denotes the (k, ) entry of the matrix D and D k * (D * ) denotes the k-th row (resp.thcolumn) of the matrix.We reserve x to denote real variables and write the complex exponential as e(x) := e 2πix , where i is the imaginary unit.For a vector f ∈ C n and 1 ≤ p < ∞, and f 0 gives the total number of non-zero elements of f .For a matrix X ∈ C n×m , σ k (X) denotes the k-th largest singular value of X and X := σ 1 (X) is the spectral norm.A(Ω) is the Wiener algebra and H k (Ω) is the Sobolev space W k,2 (Ω) (with domain Ω), S n−1 is the unit sphere in C n , and the adjoint of a linear operator A is denoted by A * .

Assumptions and Methodology
In this section we develop the signal model, deviation model and interpolation kernel, in Sections 2.1, 2.2 and 2.3 respectively.This will allow us to proceed to Sections 3 and 4 where the main results are elaborated.We note that the deviation model (Section 2.2) and sparse signal model at the end of Section 2.1 only apply to the compressive sensing results in Section 3.However, the sampling on the torus assumption in Section 2.1 does apply to the results in Section 4 as well.

Signal Model
For the results in Section 3 and 4, let Ω = [− 1 2 , 1 2 ) and let f : Ω → C be the function of interest to be sampled.We assume f ∈ A(Ω) with Fourier expansion on Ω.Note that our regularity assumption implies that which will be crucial for our error bounds.Further, H k (Ω) ⊂ A(Ω) for k ≥ 1 so that our context applies to many signals of interest.Henceforth, let N ∈ N be odd.We denote the discretized regular data vector by f ∈ C N , which is obtained by sampling f on the uniform grid τ , (which is a collection of equispaced points) so that f k = f(t k ).The vector f will be our discrete signal of interest to recover via nonuniform samples in order to ultimately obtain an approximation to f(x) for all x ∈ Ω.Similar results can be obtained in the case N is even, our current assumption is adopted to simplify the exposition.
The observed nonuniform data is encapsulated in the vector f ∈ C m with underlying unstructured grid τ = { t1 , • • • , tm } ⊂ Ω where tk := k−1 m − 1 2 + ∆ k is now a collection of generally non-equispaced points.The entries of the perturbation vector ∆ ∈ R m define the pointwise deviations of τ from the equispaced grid , where fk = f( tk ).Noisy nonuniform samples are given as b = f + d ∈ C m , where the noise model, d, does not incorporate off-the-grid corruption.We assume that we know τ .
In order for the expansion (3) to remain valid for x ∈ τ , we must impose τ ⊂ Ω.This is not possible for the general deviations ∆ we wish to examine, so we instead adopt the torus as our sampling domain to ensure this condition.
Sampling on the torus: for all our results, we consider sampling schemes to be on the torus.In other words, we allow grid points τ to lie outside of the interval [− 1 2 , 1 2 ) but they will correspond to samples of f within [− 1 2 , 1 2 ) via a circular wrap-around.To elaborate, if f| Ω (x) is given as then we define f(x) as the periodic extension of f| Ω (x) to the whole line We now apply samples generated from our deviations τ to f(x).Indeed, for any tk generated outside of Ω will have f( tk ) = f(t * ) for some t * ∈ Ω.In this way, we avoid restricting the magnitude of the entries of ∆ and the expansion (3) will remain valid for any nonuniform samples generated.
Sparse signal model: For the results of Section 3 only, we impose a compressibility condition on f ∈ C N .To this end, let Ψ ∈ C N ×N be a basis with 0 < σ N (Ψ) =: α and σ 1 (Ψ) =: β.We assume there exists some g ∈ C N such that f = Ψg, where g can be accurately approximated by an s ≤ N sparse vector.To be precise, for s ∈ [N ] we define the error of best s-sparse approximation of g as s (g) := min and assume s has been chosen so that s (g) is within a prescribed error tolerance determined by the practitioner.
In Section 8.1, we will relax the condition that Ψ be a basis by allowing full column rank matrices Ψ ∈ C N ×n with n ≤ N .While such transforms are not typical in compressive sensing, we argue that they may be of practical interest since our results will show that if Ψ can be selected as tall matrix then the sampling complexity will solely depend on its number of columns (i.e., the smallest dimension n).
The transform Ψ will have to be coherent with respect to the 1D centered discrete Fourier basis F ∈ C N ×N (see Section 2.3 for definition of F).We define the DFT-incoherence parameter as which provides a uniform bound on the 1 -norm of the discrete Fourier coefficients of the columns of Ψ.This parameter will play a role in the sampling complexity of our result in Section 3, as a metric that quantifies the smoothness of our signal of interest.We discuss γ in detail in Section 5.3, including examples for several transforms common in compressive sensing.

Deviation Model
Section 3 will apply to random deviations ∆ ∈ R m whose entries are i.i.d. with any distribution, D, that obeys the following: for δ ∼ D, there exists some θ ≥ 0 such that for all integers 0 This will be known as our deviation model.In our results, distributions with smaller θ parameter will require less samples and provide reduced error bounds.We postpone further discussion of the deviation model until Section 5.2, where we will also provide examples of deviations that fit our criteria.We note that the deviation model is most relevant when m < N .The case m ≥ N is discussed in Section 4, which no longer requires this deviation model or the sparse signal model.

Dirichlet Kernel
In Section 3 and 4, we model our nonuniform samples via an interpolation kernel S ∈ R m×N that achieves Sf ≈ f accurately.We consider the Dirichlet kernel defined by S = N F * : C N → C m , where F ∈ C N ×N is a 1D centered discrete Fourier transform (DFT) and N ∈ C m×N is a 1D centered nonuniform discrete Fourier transform (NDFT, see [32,39]) with normalized rows and nonharmonic frequencies chosen according to τ .In other words, let Ñ = N −1 2 , then the (k, ) ∈ [m] × [N ] entry of N is given as This NDFT is referred to as a nonuniform discrete Fourier transform of type 2 in [39].Thus, the action of S on f ∈ C N can be given as follows: we first apply the centered inverse DFT to our discrete uniform data followed by the NDFT in terms of τ : Equivalently, where K(x) = sin (N πx) sin (πx) is the Dirichlet kernel.This equality is well known and holds by applying the geometric series formula upon expansion.This kernel is commonly used for trigonometric interpolation and is accurate when acting on signals that can be well approximated by trigonometric polynomials of finite order, as we show in the following theorem.
where r( ) = rem( + Ñ , N ) − Ñ with rem(p, q) giving the remainder after division of p by q.As a consequence, if tk ∈ Ω for all k ∈ [m] then for any integer and The proof of this theorem is postponed until Section 8.3.Therefore, the error due to S is proportional to the 1-norm (or Wiener algebra norm) of the Fourier coefficients of f that correspond to frequencies larger than Ñ = N −1 2 .In particular notice that if c = 0 for all > Ñ we obtain perfect interpolation, as expected from standard results in signal processing (i.e., bandlimited signals consisting of trigonometric polynomials with finite degree ≤ Ñ ).Despite the wide usage of trigonometric interpolation in applications [14,28,74], such a result that gives a sharp error bound does not seem to exist in the literature.
Notice that Theorem 1 only holds for τ ⊂ Ω as restricted in Section 2.1.However, the results continues to hold for τ unrestricted if we sample on the torus as imposed in Section 2.1.Therefore, the error bound will always hold under our setup.

Anti-aliasing via Nonuniform Sampling
With the definitions and assumptions introduced in Section 2, our methodology in this chapter will consist of modeling our m nonuniform measurements via S and approximating the s largest coefficients of f in Ψ (in the representation f = Ψg).This discrete approximation will provide an accurate estimate f (x) of f(x) for all x ∈ Ω, achieving precision comparable to that given by the best N 2 -bandlimited approximation of f while requiring m N samples.The following is a simplified statement, assuming that Ψ is an orthonormal basis and m ≤ N .We focus on this cleaner result for ease of exposition, presented as a corollary of the main result in Section 8.1.The full statement considers the case m ≥ N and allows for more general and practical Ψ that allow for reduced sample complexity.
Theorem 2 Let 2 ≤ s ≤ N and m ≤ N , where m is the number of nonuniform samples.Under our signal model with Fourier expansion (3), let Ψ ∈ C N ×N be an orthonormal basis with DFT-incoherence parameter γ.Define the interpolation kernel S as in Section 2.3 with the entries of ∆ i.i.d.from any distribution satisfying our deviation model from Section 2.2 with θ < 1. Define with where C 1 and C 2 are absolute constants, then with probability exceeding 1 − 1 N .
Therefore, with m ∼ s log 4 (N ) randomly perturbed samples we can recover f with error (14) proportional to the sparse model mismatch s (g), the noise level d 2 , and the error of the best N −1 2 -bandlimited approximation of f in the Wiener algebra norm (i.e., As a consequence, we can approximate f(x) for all x ∈ Ω as stated in the following corollary.
Corollary 1 Let h : Ω → C N be the vector valued function defined entry-wise for and define the function where g is given by (12).Then, under the assumptions of Theorem 2, The proof of this corollary is presented in Section 8.3.In the case s (g) = d 2 = 0, the results intuitively say that we can recover a N −1 2 -bandlimited approximation of f with O(s polylog(N )) random off-the-grid samples.In the case of equispaced samples, O(N ) measurements are needed for the same quality of reconstruction by the Nyquist-Shannon sampling theorem (or by Theorem 1 directly).Thus, for compressible signals with s N , random nonuniform samples provide a significant reduction in sampling complexity (undersampling) and simultaneously allow recovery of frequency components exceeding the sampling density (anti-aliasing).See Section 5 for further discussion.
Notice that general denoising is not guaranteed in an undersampling scenario, due to the term (14), and (17).In other words, one cannot expect the output estimate to reduce the measurement noise d 2 since √ N √ m ≥ 1 appearing in our error bound implies an amplification of the input noise level.Such situations with limited samples are typical in compressive sensing and this noise amplifying behavior is demonstrated numerically in Section 6.3.In general a practitioner must oversample (i.e., N < m) to attenuate the effects of generic noise.However, Theorem 2 and Corollary 1 state that nonuniform samples specifically attenuate aliasing noise.

Denoising via Nonuniform Oversampling
In this section, we show that reduction in the noise level introduced during acquisition is possible given nonuniform samples whose average density exceeds the Nyquist rate (relative to a desired bandwidth).While the implications of this section are not surprising in the context of classical sampling theory, to the best of our knowledge such guarantees do not exist in the literature when the sampling points are nonuniform.
By removing the sparse signal model (Section 2.1), deviation model (Section 2.2), and requiring m ≥ N off-the-grid samples (on the torus, see Section 2.1), we now use the numerically cheaper program of least squares.To reiterate, f ∈ A(Ω) with Fourier expansion The vector f ∈ C m encapsulates the nonuniformly sampled data where fk = f( tk ) for tk := k−1 m − 1 2 + ∆ k .Noisy nonuniform samples are given as where the additive noise model, d, does not incorporate off-the-grid corruption.
In this oversampling context, we provide a denoising result for a more general set of deviations.
Theorem 3 Let the entries of ∆ be i.i.d.from any distribution and define with probability exceeding 1 − 1 N .
The proof can be found in Section 8.2.In this scenario, we oversample relative to the .The oversampling parameter κ may be varied for increased attenuation at the cost of denser sampling.We comment that the methodology from Section 3 with m ≥ N also allows for denoising and similar error bounds (see Theorem 4).However, focusing on the oversampling case distinctly provides simplified results with many additional benefits.
In particular, here the deviations ∆ need not be from our deviation model in Section 2.2 and instead the result applies to perturbations generated by any distribution.This includes the degenerate distribution (deterministic), so the claim also holds in the case of equispaced samples.Furthermore, we no longer require the sparsity assumption and the result applies to all functions in the Wiener algebra.Finally, the recovery method (18) consists of standard least squares which can be solved cheaply relative to the square-root LASSO decoder (12) from the previous section.
We may proceed analogously to Corollary 1 and show that the output discrete signal f provides a continuous approximation for all x ∈ Ω, where h(x) is defined in (15).The error of this estimate is bounded as proportional to the error of the best N −1 2 -bandlimited approximation in the Wiener algebra norm while attenuating the introduced measurement noise.In the result, the structure of the deviated samples is quite general and accounts for many practical cases.
While related results exist in the equispaced case (see for example Section

Related Work
Several studies in the compressive sensing literature are similar to our results in Section 3 ( [28,29]).In contrast to these references, we derive recovery guarantees for non-orthonormal systems (when θ = 0) while focusing the scope of the paper within the context of classical sampling theory (introducing error according to the bandlimited approximation).The work in [28] considers sampling of sparse trigonometric polynomials and overlaps with our application in the case Ψ = F. Our results generalize this work to allow for other signal models and sparsifying transforms.Furthermore, [28] assumes that the samples are chosen uniformly at random from a continuous interval or a discrete set of N equispaced points.In contrast, our results pertain to general deviations from an equispaced grid with average sampling density ∼ s polylog(N ) and allow for many other distributions of the perturbations.

Deviation Model
In this section, we present several examples of distributions that are suitable for our results in Section 3. Notice that our deviation model utilizes the characteristic function of a given distribution, evaluated at a finite set of points.This allows one to easily consider many distributions for our purpose by consulting the relevant and exhaustive literature of characteristic functions (see for example [17]).
5. Exponential: D = Exp(λ), for any rate λ > 0 gives In particular, notice that examples 1 and 2 include cases of jittered sampling [7,10,51,52].Indeed, with p = 1 these examples partition Ω into m regions of equal size and these distributions will choose a point randomly from each region (in a continuous or discrete sense).The jittered sampling list can be expanded by considering other distributions to generate samples within each region.
In general we will have θ > 0, which implies deteriorated output quality and increases the number of required off-the-grid samples according to Theorem 2. Arguably, our deviation model introduces a notion of optimal jitter when the chosen distribution achieves θ = 0, ideal in our results.This observation may be of interest in the active literature of jittered sampling techniques [51].
Intuitively, θ is measuring how biased a given distribution is in generating deviations.If δ ∼ D, |Ee(jmδ)| ≈ 0 means that the distribution is nearly centered and impartial.On the other hand, |Ee(jmδ)| ≈ 1 gives the opposite interpretation where the deviations will be generated favoring a certain direction in an almost deterministic sense.Our result is not applicable to such biased distributions, since in Theorem 2 as θ → 1 the error bound becomes unbounded and meaningless.

Signal Model
In this section we discuss the DFT-incoherence parameter γ, introduced in Section 2.1 as where we now let Ψ ∈ C N ×n be a full column rank matrix with n ≤ N .The parameter γ a uniform upper bound on the 1-norm of the discrete Fourier coefficients of the columns of Ψ.Since the decay of the Fourier coefficients of a function is related to its smoothness, intuitively γ can be seen as a measure of the smoothness of the columns of Ψ. Implicitly, this also measures the smoothness of f since its uniform discretization admits a representation via this transformation f = Ψg.
Therefore, the role of γ on the sampling complexity is clear, relatively small γ implies that our signal of interest is smooth and therefore requires less samples.This observation is intuitive, since non-smooth functions will require additional samples to capture discontinuities in accordance with Gibbs phenomenon.This argument is validated numerically in Section 6.1, where we compare reconstruction via an infinitely differentiable ensemble (FFT) and a discontinuous wavelet (Daubechies 2).
We now consider several common choices for Ψ and discuss the respective γ parameter: 1. Ψ = F (the DFT), then γ = 1 which is optimal.However, most appropriate and common is the choice Ψ = F * which can be shown to exhibit γ ∼ O(1) by a simple calculation.2. When Ψ = H * is the inverse 1D Haar wavelet transform, we have γ ∼ O(log(N )).In [19] it is shown that the inner products between rows of F and rows of H decay according to an inverse power law of the frequency (see Lemma 1 therein).A similar proof shows that | F * k , H * * | ∼ 1 |k| , which gives the desired upper bound for γ via an integral comparison.Notice that these basis vectors have jump discontinuities, and yet we still obtain an acceptable DFT-incoherence parameter for nonuniform undersampling.3. Ψ = I N (the N × N identity) gives γ = √ N .This is the worst case scenario for normalized transforms since max In general, our smooth signals of interest are not fit for this sparsity model.4. Let p ≥ 1 be an integer.Matrices Ψ whose columns are uniform discretizations of p-differentiable functions, with p − 1 periodic and continuous derivatives and p-th derivative that is piecewise continuous.In this case For sake of brevity we do not provide this calculation, but refer the reader to Section 2.8 in [48] for an informal argument.
Example 4 is particularly informative due to its generality and ability to somewhat formalize the intuition behind γ previously discussed.This example implies the applicability of our result to a general class of smooth functions that agree nicely with our signal model defined in Section 2.1 (functions in A(Ω)).

Numerical Experiments
In this section we present numerical experiments to explore several aspects of our methodology and results.Specifically, we consider the effects of the DFTincoherence and θ parameter in Section 6.1 and 6.2 respectively.Section 6.3 investigates the noise attenuation properties of nonuniform samples.We first introduce several terms and models to describe the setup of the experiments.Throughout we let N = 2015 be the size of the uniformly discretized signal f .Program (1) with λ = 1 2 √ 2s is solved using CVX [41,42], a MATLAB ® optimization toolbox for solving convex problems.We implement the Dirichlet kernel using (7) directly to construct S. We warn the reader that in this section we have not dedicated much effort to optimize the numerical complexity of the interpolation kernel.For a faster implementation, we recommend instead applying the DFT/NDFT representation S = N F * (see Section 2.3) using NFFT 3 software from [32] or its parallel counterpart [44].
Given output f = Ψg with true solution f , we consider the relative norm of the reconstruction error as a measure of output quality, given as Grid perturbations: To construct the nonuniform grid τ , we introduce an irregularity parameter ρ ≥ 0. We define our perturbations by sampling from a uniform distribution, so that each ∆ k is drawn uniformly at random from [− ρ m , ρ m ] for all k ∈ [m] independently.Off-the-grid samples τ are generated independently for each signal reconstruction experiment.
Complex exponential signal model: We consider bandlimited complex exponentials with random harmonic frequencies.With bandwidth ω = N −1 2 = 1007, and sparsity level s = 50 we generate ω ∈ Z s by choosing s frequencies uniformly at random from {−ω, −ω + 1, • • • , ω} and let We use the DFT as a sparsifying transform Ψ = F so that g = Ψ * f = Ψ −1 f is a 50-sparse vector.This transform is implemented using MATLAB's fft function.The frequency vector, ω, is generated randomly for each independent set of experiments.Note that in this case we have optimal DFT-incoherence parameter γ = 1 (see Section 5.3).
For this dataset, we use the Daubechies 2 wavelet as a sparsifying transform Ψ, implemented using the Rice Wavelet Toolbox [55].This provides g = Ψ * f = Ψ −1 f that can be well approximated by a 50-sparse vector.In other words, all entries of g are non-zero but 50 (g) < .088≈ f 2 250 and if g 50 is the best 50-sparse approximation of g then f − Ψg 50 2 < .026≈ f 2 850 .The smallest singular value of the transform is σ 2015 (Ψ) = 1 and we have γ ≈ 36.62,computed numerically.

Effect of DFT-incoherence
This section is dedicated to exploring the effect of the DFT-incoherence parameter in signal reconstruction.We consider the complex exponential and Gaussian signal models described above.Recall that in the complex exponential model we have Ψ = F (the DFT) with optimal DFT-incoherence parameter γ = 1.In the Gaussian model Ψ is the Daubechies 2 wavelet with γ ≈ 36.62.Varying the number of nonuniform samples, we will compare the quality of reconstruction using both signal models with respective transforms to investigate the role of γ in the reconstruction error.We consider the sparsity level s = 50 and solve (1) with λ = 1 2 √ 2s = 1 20 , though the Gaussian signal model is not 50-sparse in the Daubechies domain (see last paragraph of this subsection for further discussion).
Here we set irregularity parameter ρ = 1 2 to generate the deviations (so that θ = 0) and vary the average step size of the nonuniform samples.We do so by letting m vary through the set For each fixed value of m, the average relative error is obtained by averaging the relative errors of 50 independent reconstruction experiments.The results are shown in Figure 1, where we plot the average step size vs average relative reconstruction error.These experiments demonstrate the negative effect of larger DFTincoherence parameters in signal reconstruction.Indeed, in Figure 1 we see that the complex exponential model with γ = 1 allows for accurate reconstruction from larger step sizes.This is to be expected from Section 3, where the results imply that the Daubechies 2 wavelet will require more samples for successful reconstruction according to its parameter γ ≈ 36.62.
To appropriately interpret these experiments, it is important to note that the signal from the Gaussian model is only compressible and does not exhibit a 50-sparse representation via the Daubechies transform.Arguably, this may render the experiments of this section inappropriate to purely determine the effect of γ since the impact of approximating the Gaussian signal with a 50sparse vector may be significant and produce an unfair comparison (i.e., due the sparse model mismatch term 50 (g) appearing in our error bound ( 14)).This is important for the reader to keep in mind, but we argue that the effect of this mismatch is negligible since in the Gaussian signal model with g = Ψ −1 f we have 50 (g) < f 2 250 and if g 50 is the best 50-sparse approximation of g then f − Ψg 50 2 < f 2 850 .This argument can be further validated with modified numerical experiments where f does have a 50-sparse representation in the Daubechies domain, producing reconstruction errors with identical behavior and magnitude as those in Figure 1.Therefore, we believe our results here are informative to understand the impact of γ.For brevity, we do not present these modified experiments since such an f will not longer satisfy the Gaussian signal model and complicate our discussion.

Effect of the Deviation Model Parameter
In this section we generate the deviations in such a way that vary the deviation model parameter θ, in order to explore its effect on signal reconstruction.We only consider the complex exponential signal model for this purpose and fix m = N 10 = 201.We vary θ by generating deviations with irregularity parameter ρ varying over {.001, .002,.003,• • • , .009}{.01, .02,.03,• • • , .5}.For each fixed ρ value we compute the average relative reconstruction error of 50 independent experiments.Notice that for each k ∈ [m] and any j Ee (jm∆ k ) = sin (2πjρ) 2πjρ .
Given ρ, we use this observation and definition (4) to compute the respective θ value by considering the maximum of the expression above over all 0 < |j| ≤ N −1 m = 10.The relationship between ρ and θ is illustrated in Figure 2 (right plot), where smaller irregularity parameters ρ ≈ 0 provide larger deviation model parameters θ.
According to (4), this allows θ ∈ [0, 20.05), which violates the assumption θ < 1 of Theorem 2 and does not allow (1) to be implemented with parameter in the required range Despite this, we implement all experiments in this section with λ = 1 (where s = 50).Such a fixed choice may not provide a fair set of results, since the parameter is not adapted in any way to the deviation model.Regardless, the experiments will prove to be informative while revealing the robustness of the square-root LASSO decoder with respect to parameter selection.Figure 2 plots θ vs average relative reconstruction error (left plot).In the plot, our main result (Theorem 2) is only strictly applicable in three cases (outlined in red, θ = 0, .409,.833).However, the experiments demonstrate that decent signal reconstruction may be achieved when the condition θ < 1 does not hold and the parameter λ is not chosen appropriately.Therefore, the applicability of the methodology goes beyond the restrictions of the theorem and the numerical results demonstrate the flexibility of the square-root LASSO decoder.

Noise Attenuation
This section explores the robustness of the methodology when presented with measurement noise, in both the undersampled and oversampled cases relative to the target bandwidth N −1 2 (Sections 3 and 4 respectively).We only solve the square-root LASSO problem (1) with λ = 1 2 √ 2s = 1  20 , and avoid the least squares problem (18) for brevity.However, we note that both programs produce similar results and conclusions in the oversampled case (see Theorem 4).We only consider the bandlimited complex exponential signal model for this purpose.We generate additive random noise m f 1 , chosen to maintain d 2 relatively constant as m varies.
We set ρ = 1 2 to generate the deviations (so that θ = 0) and vary the average step size of the nonuniform samples.We do so by letting m vary through the set { N .5 , N .75, N, • • • , N 6.75 , N 7 }, notice that only the first two cases correspond to oversampling.For each fixed value of m, the relative reconstruction error is obtained by averaging the result of 50 independent experiments.The results are shown in Figure 3, where we plot the average step size vs average relative reconstruction error and average relative input noise level d 2 / f 2 .
The first two cases (m = N .5 , N .75 ) correspond to oversampling and illustrate the results from Section 4 (and Theorem 4), where attenuation of the input noise level is achieved.Surprisingly, these experiments demonstrate that nonuniform undersampling also allows for denoising.This is seen in Figure 3, where the values m = N 1.25 , N 1.5 , • • • , N 3.5 correspond to sub-Nyquist rates and output an average relative reconstruction error smaller than the input measurement noise level.Thus, when nonuniform samples are not severely undersampled, the negative effects of random noise can be reduced.

Conclusions
This paper provides a concrete framework to study the benefits of random nonuniform samples for signal acquisition (in comparison to equispaced sampling), with explicit statements that are informative for practitioners.Related observations are extensive but largely empirical in the sampling theory literature.Therefore, this work supplies novel theoretical insights on this widely discussed phenomenon.In the context of compressive sensing, we extend the applicability of this acquisition paradigm by demonstrating how it naturally intersects with standard sampling techniques.We hope that these observations will prompt a broader usage of compressive sensing in real world applications that rely on classical sampling theory.
There are several avenues for future research.First, the overall methodology requires the practitioner to know the nonuniform sampling locations τ accurately.While this is typical for signal reconstruction techniques that involve non-equispaced samples, it would be of practical interest to extend the methodology is such a way that allows for robustness to inaccurate sampling locations and even self-calibration.Further, as mentioned in Section 6, this work has not dedicated much effort to a numerically efficient implementation of the Dirichlet kernel S.This is crucial for large-scale applications, where a direct implementation of the Dirichlet kernel via its Fourier or Dirichlet representation (see [47]) may be too inefficient for practical purposes.As future work, it would be useful to consider other interpolation kernels with greater numerical efficiency (e.g., a low order Lagrange interpolation operator).
Finally, to explore the undersampling and anti-aliasing properties of nonuniform samples, our results here require a sparse signal assumption and adopt compressive sensing methodologies.However, most work that first discussed this nonuniform sampling phenomenon precedes the introduction of compressive sensing and does not explicitly impose sparsity assumptions.Therefore, to fully determine the benefits provided by off-the-grid samples it would be most informative to consider a more general setting, e.g., only relying on the smoothness of continuous-time signals.We believe the work achieved here provides a potential avenue to do so.

Proofs
We now provide proofs to all of our claims.In Section 8.1 we prove Theorem 2 via a more general result.Theorem 3 is proven in Section 8.2.Section 8.3 establishes the Dirichlet kernel error bounds in Theorem 1 and Corollary 1.

Proof of Theorem 2
In this section, we will prove a more general result than Theorem 2 assuming that Ψ is a full column-rank matrix and allowing m ≥ N .Theorem 2 will follow from Theorem 4 by taking α = β = 1, n = N , and simplifying some terms. with where C 1 and C 2 are absolute constants, then This theorem generalizes Theorem 2 to more general transformations Ψ for sparse representation.This is more practical since the columns of Ψ need not be orthogonal, instead linear independence suffices (with knowledge of the singular values α, β).In particular notice that (21) depends on n and does not involve N , as opposed to m ∼ s log 4 (N ) in (13).Since n ≤ N , this general result allows for a potential reduction in sample complexity if the practitioner may construct Ψ in such an efficient manner while still allowing a sparse and accurate representation of f .Furthermore, notice that this more general result allows for oversampling m ≥ n or m ≥ N .If we apply Theorem 4 with s = n then s (g) = 0 and we obtain an error bound similar to those in Section 4, reducing additive noise by a factor 4 (n) off-the-grid samples.However, in this scenario the sparsifying transform is no longer of much relevance and it is arguably best to consider the approach of Section 4 which removes the need to consider γ, β, α, and θ via a numerically cheaper methodology and a more general set of deviations.
To establish Theorem 4 we will consider the G-adjusted restricted isometry property (G-RIP) [5], defined as follows: This property ensures that a measurement matrix is well conditioned amongst all s-sparse signals, allowing for successful compressive sensing from ∼ s polylog(n) measurements.Once established for our measurement ensemble, Theorem 4 will follow by applying the following result: Theorem 5 (Theorem 13.9 in [5]) Let G ∈ C n×n be invertible and A ∈ C m×n have the G-RIP of order q and constant 0 < δ < 1 where We therefore obtain our main result if we establish the G-RIP for To do so, we note that our measurement ensemble is generated from a nondegenerate collection of independent families of random vectors.Such random matrices have been shown to possess the G-RIP in the literature.To be specific, a nondegenerate collection is defined as follows: where a k ∼ A k , is positive-definite.In this case, write G C ∈ C n×n for its unique positive-definite square root.
Our ensemble fits this definition, with the rows of N ∈ C m×N generated from a collection of m independent families of random vectors: Therefore, in our scenario, the k-th family A k independently generates deviation ∆ k ∼ D and produces a random vector of the form above as the k-th row of N .This in turn also generates the rows of A independently, since its k-th row is given as √ N N k * F * Ψ.To apply G-RIP results from the literature for such matrices, we will have to consider the coherence of our collection: In the above definition, a family A k is saturated is it consists of a single vector and a collection is unsaturated if no family in the collection is saturated.In our context, it is easy to see that the condition θ < 1 avoids saturation and the definition above applies.The coherence of our collection of families will translate to the DFT-incoherence parameter defined in Section 2.1.
With these definitions in mind, we now state a simplified version of Theorem 13.12 in [5] that will show the G-RIP for our ensemble.
k=1 be a nondegenerate collection generating the rows of A. Suppose that where c1 is an absolute constant.Then with probability at least 1 − , the matrix A has the G-RIP of order s with constant δ s,G ≤ δ.
In conclusion, to obtain Theorem 4 we will first show that A is generated by a nondegenerate collection with unique positive-definite square root G. Establishing this will provide a upper bounds for G −1 , G , and µ(C ).At this point, Theorem 6 will provide A with the G-RIP and subsequently Theorem 5 can be applied to obtain the error bounds.
To establish that the collection C = {A k } m k=1 above is nondegenerate, it suffices to show that for all w ∈ C n .This will show that 1 m EA * A is positive-definite if the deviation model satisfies θ < 1.Further, let G be the unique positive-definite square root of 1 m EA * A, then (25) will also show that To this end, let w ∈ C n and normalize Ñ : Ñk := e(− tk ( − Ñ − 1)).
Throughout, let ∆ ∈ R be an independent copy of the entries of ∆ ∈ R m .Then with v := F * Ψw, The last equality can be obtained as follows, The third equality uses the fact that Ee(∆ k ( − ˜ )) = Ee( ∆( − ˜ )) for all k ∈ [m] in order to properly factor out this constant from the sum in the fourth equality.The last equality is due to the geometric series formula.
Returning to our original calculation, we bound the last term using our deviation model assumptions is the index set of allowed indices according to j, i.e., that satisfy ∈ [N ] and + jm ∈ [N ].The second inequality holds by our deviation model assumption (4).
The remaining sum (with − ˜ = jm) can be bounded similarly.Combine these inequalities with the singular values of Ψ to obtain We will apply this inequality and similar orthogonality properties in what follows (e.g., in Section 8.2), and ask the reader to keep this in mind.
To upper bound the coherence of the collection C , let Ñk * F * Ψ ∼ A k as above.Then The proof of Theorem 4 is now an application of Theorems 6 and 5 using the derivations above.
From the arguments above, the rows of A are generated by a nondegenerate collection C with coherence bounded as (27).The unique positive-definite square root of 1 m EA * A, denoted G, satisfies the bounds (26).We now apply Theorem 6 with δ = 1/2, = n −1 and order By ( 26) and ( 27), if then ( 24) is satisfied and the conclusion of Theorem 6 holds.Therefore, with probability exceeding 1 − n −1 , A has G-RIP of order q with constant δ q,G ≤ δ = 1/2.To show that our sampling assumption ( 21) satisfies (28), notice that by ( 26) The last inequality holds since and for any real number a ≥ 24 it holds that a ≤ (1 + 1 24 )a.In (28), replace q with q.This provides our assumed sampling complexity, where expression (21) simplifies by absorbing all absolute constants into C 1 and C 2 .
With parameter λ chosen for (20), the conditions of Theorem 5 hold with δ = 1/2 and we obtain the error bound To finish, notice that where the last inequality holds by Theorem 1.
To obtain Theorem 2 from Theorem 4, notice that in Theorem 2 we have n = N and α = β = 1.The assumption m ≤ N gives that which allows further simplification by combining all the logarithmic factors into a single polylog(N ) term (introducing absolute constants where necessary).We note that the condition m ≤ N is not needed and is only applied for ease of exposition in the introductory result.

Proof of Theorem 3
To establish the claim, we aim to show that inf holds with high probability.By optimality of f , this will give where the last inequality is due to our noise model and trigonometric interpolation error (Theorem 1).
To this end, we normalize by letting S = 1 m N F * and note that when m ≥ N our sampling operator is isometric in the sense that where I N is the N × N identity matrix.To see this, we use our calculations from the previous section (that establish (25)) to obtain as before that for However, if m ≥ N , notice that the middle case never occurs since With the isometry established, we may now proceed to the main component of the proof of Theorem 3.
and the entries of ∆ be i.i.d. with any distribution.Then with probability exceeding 1 − 1 N .
Proof: We will apply a matrix Chernoff inequality to lower bound the smallest eigenvalue of S * S. To apply Theorem 1.1 in [34], notice that we can expand which is a sum of independent, random, self-adjoint, and positive-definite matrices.
Our isometry condition (30) gives that E S * S = I N has extreme eigenvalues equal to 1, we stress that this holds because we assume m ≥ N as shown above.Further, Therefore, by Theorem 1.1 in [34] with R = N m and δ = 1 2 , we obtain , the left hand side is upper bounded by N −1 .
Since the singular values of S are the squareroot of the eigenvalues of S * S, this establishes the result.
With our remarks in the beginning of the section, we can now easily establish the proof of Theorem 3. .The remainder of the proof follows from our outline in the beginning of the section.

Interpolation Error of Dirichlet Kernel: Proof
In this section we provide the error term of our interpolation operator when applied to our signal model (Theorem 1) and also the error bound given in Corollary 1.
Proof of Theorem 1: We begin by showing (8), i.e., if tk = t p for some p ∈ [N ] (our "nonuniform" sample lies on the equispaced interpolation grid) then the error is zero.This is easy to see by orthogonality of the complex exponentials, combining (5), ( 6 of the coefficients in the last row (one column prior).Therefore, if start at the top left coefficient c − Ñ and "column-wise" traverse this infinite array of Fourier coefficients we will obtain the ordered sequence {(−1) Ñ (with no repetitions).The coefficients in row q ∈ [N ] correspond to frequency value − Ñ +q −1 and have indices of the form pN − Ñ + q − 1 for some p ∈ N. To establish that the reordered series is equivalent, we finish by checking that for a given index the mapping r gives the correct frequency value, i.e., r(pN − Ñ + q − 1) = − Ñ + q − 1 for all q ∈ [N ]: r(pN − Ñ + q − 1) := rem(pN − Ñ + q − 1 + Ñ , N ) − Ñ = rem(pN + q − 1, N ) − Ñ = q − 1 − Ñ .
We can therefore reorder the series as desired and incorporate the sum over j < 0 via the same logic to establish the equality.We see that such a condition would imply that tk / ∈ Ω := [− 1 2 , 1 2 ), which violates our assumption τ ⊂ Ω.This finishes the proof.
We end this section with the proof of Corollary 1.
Proof of Corollary 1: The proof will consist of applying Theorem 2 (under identical assumptions) and Theorem 1.
By Theorem 2, we have that

Theorem 1 Ñ
Let S, f and f be defined as above and tk ∈ Ω for some k ∈ [m].If tk = tp for some p ∈ [N ] then c e( tk ) − (−1)

Fig. 1
Fig.1Plot of average relative reconstruction error vs average step size for both signal models.In the complex exponential model (Ψ = F , the DFT) we have γ = 1 and in the Gaussian signal model we have γ ≈ 36.62 (Daubechies 2 wavelet).Notice that the complex exponential model allows for reconstruction from larger step sizes in comparison to the Gaussian signal model.

Fig. 2 (
Fig. 2 (left) Plot of average relative reconstruction error vs corresponding θ parameter and (right) plot illustrating the relationship between the irregularity parameter ρ and the deviation model parameter θ.The plots emphasize via red outlines the θ values that satisfy the conditions of Theorem 2 (i.e., θ < 1).Although our results only hold for three θ values (0, .409,.833), the experiments demonstrate that accurate recovery is possible otherwise.

Fig. 3
Fig.3Plot of average relative reconstruction error ( f − f 2 / f 2 ) vs average step size (blue curve) and average input relative measurement error ( d 2 / f 2 ) vs average step size (red curve).Notice that the first 13 step size values achieve noise attenuation, i.e., the reconstruction error is lower than the input noise level.

Definition 3 ( 2 ∞
Coherence of an unsaturated collection C[5]) Let A 1 , • • • , A m be independent families of random vectors, with smallest constants µ 1 , • • • , µ k such that a k ≤ µ k holds almost surely for a k ∼ A k .The coherence of an unsaturated collection C = {A k } m k=1 is µ(C ) = max k∈[m]µ k .

Proof of Theorem 4
We are considering the equivalent program g := argmin h∈C

Proof of Theorem 3 : 1 inf
Under our assumptions, apply Theorem 7 to obtain that for all v ∈ S N −prescribed probability.This establishes(29) with δ = √ m √ 2N ) (recall that Ñ = N −1 2 ) we have (Sf ) k = f, S k * = utp)e(−ut p)   = f p = f (t p) = f ( tk ) = fk .The fourth equality holds since we are assuming tk = t p for some p ∈ [N ].