Rates of convergence for regression with the graph poly-Laplacian

In the (special) smoothing spline problem one considers a variational problem with a quadratic data fidelity penalty and Laplacian regularization. Higher order regularity can be obtained via replacing the Laplacian regulariser with a poly-Laplacian regulariser. The methodology is readily adapted to graphs and here we consider graph poly-Laplacian regularization in a fully supervised, non-parametric, noise corrupted, regression problem. In particular, given a dataset \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\{x_i\}_{i=1}^n$$\end{document}{xi}i=1n and a set of noisy labels \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\{y_i\}_{i=1}^n\subset \mathbb {R}$$\end{document}{yi}i=1n⊂R we let \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$u_n{:}\{x_i\}_{i=1}^n\rightarrow \mathbb {R}$$\end{document}un:{xi}i=1n→R be the minimizer of an energy which consists of a data fidelity term and an appropriately scaled graph poly-Laplacian term. When \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$y_i = g(x_i)+\xi _i$$\end{document}yi=g(xi)+ξi, for iid noise \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\xi _i$$\end{document}ξi, and using the geometric random graph, we identify (with high probability) the rate of convergence of \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$u_n$$\end{document}un to g in the large data limit \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n\rightarrow \infty $$\end{document}n→∞. Furthermore, our rate is close to the known rate of convergence in the usual smoothing spline model.


Introduction
Given the applications to signal processing and computer science the smoothing spline problem has been attracting the interest of Statisticians since the 1960's [59,60].The problem can stated as [73]: given feature vectors {x i } n i=1 ⊂ Ω ⊂ R d and labels over all u ∈ H s (Ω) where H s (Ω) is the Sobolev space with square integrable sth (weak) derivative on Ω.Here, one tries to find an unknown function u : Ω → R that is trying to match the observed labels {y i } n i=1 at {x i } n i=1 whilst being smooth (in the sense of H s regularisation).It is important to note that the regularity penalty is applied uniformly throughout the domain Ω.
Recently, the spline methodology has found use in data science and machine learning as a candidate for semi-supervised or fully-supervised learning.Zhu, Ghahramani and Lafferty [78] introduced the following variational problem as a method for finding missing labels.They assumed that for every pair of feature vectors x i , x j with i, j ∈ {1, . . ., n} one has a measure of similarity W i,j .Further assuming that there is no error in the observed labels {y i } i∈Zn , where Z n {1, . . ., n}, they proposed the variational problem: minimise (2) W i,j |u(x i ) − u(x j )| 2 subject to u(x i ) = y i for all i ∈ Z n over u : {x i } n i=1 → R. The power of this method is that the regularisation will be applied more strongly when the density of data is higher.Indeed, the continuum, n → ∞, limit of ( 2) is, up to a multiplicative constant, (3) where ρ is the density of the data.We see that where ρ is large the minimiser of (3) should be smoother, and conversely when ρ is smaller the minimiser can fluctuate more.This is typically desirable behaviour in classification tasks since the minimiser can be expected to be approximately constant within clusters and quickly transitioning outside of clusters where the density of data is assumed to be low.
Several types of convergence result connecting (2) to (3) exist in the literature.For instance pointwise convergence of the objective functionals was established in [20,51].Rates of convergence between constrained minimisers of (2) to constrained minimisers of (3) appeared in [9].Further results concern the convergence of the Laplacian operator, without rates in [2,32,71] and with rates in [8,28,39,62], the game theoretic Laplacian in [6], the p-Laplacian in [63], and the ∞-Laplacian in [7].However, none of these results consider asymptotic consistency in the sense that the minimiser of (2) converges to a "true function".(It would be more accurate to describe the above results as convergence properties of the method.)To our knowledge we are the first to consider consistency in the graph-based setting.
When there is uncertainty in the observed labels then minimising (2) with constraints is not the natural model.Instead, as in (1), we can write a soft version of the Zhu, Ghahramani and Lafferty model by ( 4) where we include the correct scaling on the second term.The parameter τ controls the weighting between regularity and matching the data: formally τ = 0 corresponds to the hard constrained problem.In some settings the large data limits for minimisers of (4) can be inferred from the hard constrained problem, see [9].
To complete the generalisation of the Zhu, Ghahramani and Lafferty model to an analogue of the spline problem (1) we discuss higher order regularisation.The Dirichlet energy E (ZGL) n can be written in inner product form: where ∆ n is the graph Laplacian and L 2 (µ n ) is the space of L 2 functions with respect to the empirical measure µ n (defined in the following subsection).A natural method to introduce higher order regularity is to consider higher powers of the graph-Laplacian, in particular ( 5) in which case (4) is the special case of (5) with s = 1.In the context of graphs this model was introduced in [19], although the fractional Laplacian, including non-local and discrete versions, has been of interest in applied mathematics for much longer, see for example [13] and references therein.As observed in [78] and (in terms of uncertainty quantification) [3], (5) has an interpretation as a maximum a-posteriori (MAP) estimate for the Gaussian process regression method (also known as kriging).As an aside we mention works that have explored the connection between discrete and continuum problems in the Bayesian setting such as [29,69] and [58], where the latter introduces Matérn priors on graphs and studies their continuum limits.In this paper we use the setting of [19] to recover noisy observations of a labeling function g as the MAP estimator given the data {(x i , y i )} n i=1 and using the graph poly-Laplacian to define a prior.Our focus will be on recovering the labels g as well as some of its higher order information at the data points {x i } n i=1 in the form of powers of the Laplacian of g.
Theoretical analysis of splines dates back to the 1960's and we refer to [73] for an overview of more historical works and only mention a few select (more recent) references here related to large data limits.Convergence in norm of special splines under various settings have been studied in [1,4,5,14,38,42,45,49,74] and pointwise convergence results in [46,61,[75][76][77].The general splines problem writes (1) in a more abstract framework.In particular, one seeks to find u ∈ H, where H is a reproducing kernel Hilbert space that can be decomposed H = H 0 ⊕ H 1 , as the minimiser of where L i ∈ H * and P 1 : H → H 1 is the orthogonal projection.For an appropriate choice of H and L i one recovers the special smoothing spline problem (as a special case of the general smoothing spline problem).
The general smoothing spline problem has itself attracted attention with large data convergence in norm results appearing in [12,15,43,50,52,72] and weak convergence results in [66].
In this paper, using the model ( 5), we consider the problem of non-parametric regression of a noisy signal g observed at finitely many points that are randomly selected from an unknown probability distribution µ on the torus T .More precisely, we assume we are given a set of feature vectors {x i } n i=1 ⊂ Ω := T and a set of associated noisy real valued labels {y i } n i=1 satisfying ( 6) from which we wish to recover the true signal g, also known as the label function.The random variables ξ i are assumed to be mean zero, sub-Gaussian and independent.Our main results establish variance and bias estimates for the error of approximation of the signal g and of some of its higher order information by the solution u of a variational problem characterized by a graph PDE of the form Up to logarithms we establish an O(n − s d+4s ) rate of convergence of minimisers of E (yn) n,τ in (5) to g.This is comparable to the O(n − s d+2s ) rate of convergence in splines, see [64].Projecting the samples onto the first K eigenvectors of the graph Laplacian gives a better rate of O(n − 2s 2s+d ) (the minimax rate) [35], however K depends on the H s norm of g which will usually be unknown.See Remarks 1.11 and 1.12 for further details.
Methods such as kernel ridge regression are closely related to smoothing spline models.These methods use a data fidelity (on {x i } n i=1 ) plus regularisation on Ω (in particular incorporating Ω's geometry explicitly) to attempt to recover g.In that setting, the question of how to set regularisation parameters was studied in [10,17].Some very recent works have focused on studying the "ridgeless case", where one considers the limit as one sets the regularization parameter to zero, with both positive [47] and negative [55] results depending on the richness of the data.A related approach is to interpolate between data points using C m penalisation [21,25,26] or Sobolev penalisation [22][23][24].
There are connections between the Gaussian process regression (kriging) method approach that we take here and the generalised lasso model (which includes the lasso, the fused lasso, trend filtering, and the more closely related to our work: graph fused lasso, graph trend filtering, and Kronecker trend filtering), see for example [68].Both Gaussian process regression and generalised lasso attempt to recover an unknown function g from noisy observations of the form (6) in the fully supervised setting (i.e. for each feature vector x i we have an observation y i ).However, in the lasso models the function g is assumed to be linear, i.e. g(x) = β • x where β is an unknown vector which parametrises g.The fused lasso, on the other hand, uses a total variation regularisation in place of the graph poly-Laplacian considered here.In grid graphs this has been considered in [41,56,57] where the estimator is shown to be minimax (the estimator performs best amongst all other estimators in the worst case).Further results have considered chain graphs [54], and k-NN and εconnected graphs [53] (the ε-connected graph setting is also the setting of this paper).In particular, the L 2 convergence rate of the fused lasso on an ε-connected graph is (ignoring logarithms) O(n − 1 d ) which at least in some settings is the minimax rate [53].Up to logarithms, and assuming a sufficiently smooth signal g, our basic L 2 convergence rates for the approximation of g coincide with these rates.This is also approximately the L ∞ convergence rate given in [30] for the case s = 1 in (7), which is the minimax L ∞ rate given in [44].At this point we would like to remark that our variance estimates are meaningful and converge to zero with growing n even if the regularization parameter τ is not scaled down to zero.Our results characterize precisely the continuum limit of the solutions to the graph PDE (7), and are of relevance in case one were interested, not only on denoising, but also in enforcing additional regularization.We also remark that in our results we provide additional information about the convergence towards g, by giving convergence rates for higher-order derivatives.
Other approaches for high order regularisation that do not consider Gaussian priors use instead a non-linear p-Laplacian operator for large enough p.In the graph setting, results like those in [37] establish that solutions of a p-Dirichlet regularised problem converge with rate n − 1  4 to the solution of an analogue continuum non-local variational problem; although the setting differs from ours as we scale the connectivity of our graph to obtain a local limit whilst in [37] the connectivity of the graph remains fixed and the limit is to a nonlocal variational problem.Naturally, the advantage of the framework in [37] is that the dimension of the space essentially plays no role in the analysis (depending on the precise edge model one uses) and therefore it is enough to consider the problem in 1D (as the authors do).On the other hand, by not scaling down the connectivity threshold it is not possible to recover the local geometry.The same authors, in the same setting, show a rate of convergence for the associated gradient flow [36].As was mentioned earlier when discussing generalised Lasso models (in particular in graph trend filtering), total variation is another tool used to regularise regression and classification problems.This has motivated theoretical works like [34] which study the convergence of graph total variation to a continuum weighted total variation (the same paper proposed a topology to study the convergence that didn't require regularity -in particular pointwise evaluation -of the continuum function).Total variation functionals are also widely used for clustering and segmentation such as in graph cut methods, for example ratio or Cheeger cuts [31,65], graph modularity clustering [18,40], and Ginzburg-Landau segmentation [16,67,70].
Since we have observations for all feature vectors our problem is in the fully-supervised setting.The semisupervised setting with Laplacian regularisation (closely related to (7) with s = 1 but with hard constraints as opposed to having a penalty term) has been considered in [9] which show an "ill-posedness result" (the labels disappear in the large data limit) if the number of labelled points scales below a critical threshold, and a "well-posedness result" (the labels remain in the continuum problem) when the number of labels scales linearly with n.Using graph p-Laplacian regularisation with finite labels the authors in [6,63] show that whether the variational problem is asymptotically well-posed depends on the choice of p and how the graph is scaled.For the fractional graph Laplacian with finite labels, it is shown in [19] that the problem is ill-posed if s ≤ d 2 or the length scale on the graph is sufficiently large and conjectured that this is sharp.
We wrap up this brief literature review by pointing out that other approaches to regression on unknown manifolds include [48], where local tangent planes around points are carefully constructed to apply regression methods in the more classical functional data setting.Our approach is markedly different as it does not rely on the construction of extrinsic geometric objects.In particular, once a proximity graph is defined on the data cloud, all regularisers and the resulting PDEs become intrinsic to the graph.
In the remainder of this section we define the graph and continuum operators that are analysed in the paper, and then state our main results.

Discrete Operators
We begin by stating our basic assumptions on the data, and the graph that we use to model it: (A1) Assumptions on the domain Ω : Ω is a d-dimensional torus.
(A5) Assumptions on the labelled data : are independent and identically distributed (iid), sub-Gaussian centred noise (where sub-Gaussian by definition means there exists C > c > 0 such that P (|ξ j | > t) ≤ Ce −ct 2 for all t ≥ 0).
Remark 1.1.The assumptions on η are technical in nature and are imposed to facilitate some very concrete steps in our analysis.Assumption (A4)b is slightly stronger than what is typically required in related papers, and will only be used to simplify our computations in, for example, Lemma 2.3.
The graph Laplacian ∆ n plays an important role in the regularisation and is defined as follows: (8) Here we have chosen what is called the unnormalized graph Laplacian.
Throughout the paper we will denote the empirical measure µ n := 1 n n i=1 δ x i .We will define an inner product with respect to a (usually probability) measure ν by for u, v measureable w.r.t.ν.
And the associated L 2 norm by u L 2 (ν) = u, u L 2 (ν) .When ν = µ n then the norm can be written There is a small abuse in notation in how we define ∆ n since we will also write Given a n = (a 1 , . . ., a n ), with a i ∈ R, we let ( 9) where the regularisation is given by ( 10) and here s is a positive integer with ∆ s n the s-th power of the matrix.We will mostly be concerned with the situation where a n = y n = (y 1 , . . ., y n ), which gives the energy ( 11) We will define u * n,τ to be the minimiser of (11).Note that when s = 1, and the regularisation functional is the graph Dirichlet energy.We define R (s) n for non-integer powers via the eigenvector-eigenvalue expansion (however our results consider only integer powers).That is, we let i=1 form an orthonormal basis of L 2 (µ n ) and we can write which is defined for any s ∈ R.

Continuum Operators
We now define the appropriate continuum operators and variational formulations.It is well-known that as n → ∞, the operator ∆ n converges to a continuum limit ∆ ρ [2,6,9,28,32,39,62], where ∆ ρ is the differential operator defined by ( 12) and σ η is the constant defined by ( 13) For τ > 0 fixed, the continuum objective functional is defined by ( 14) where . We will define u * τ to be the minimiser of (14).Again, we observe that when s = 1 the regularisation functional, is a weighted Dirichlet energy.We remark that, by the fact that ρ is bounded from below, we may integrate by parts to obtain for some constants C > c > 0.
We can also define R ∞ for non-integer powers analogously to the discrete case.More concretely, by the spectral theorem and the fact that Ω is compact, if we let (λ i , q i ) be eigenpairs of ∆ ρ then {q i } ∞ i=1 form an orthonormal basis of L 2 (µ).In turn we can define which is well-defined for any s ∈ R.

Main Results
Our results are to bound the bias and variance of the estimator u * n,τ , defined as the minimiser of E (yn) n,τ .Following the terminology of [17] we define the variance of the estimator by where u * τ is the minimiser of E (g) ∞,τ , and the bias is defined to be . The main results are the following.
We state the L 2 variance estimates in the following theorem.∞ by (15), ∆ n by (8) and ∆ ρ by (12).Let u * n,τ be the minimiser of E (yn) n,τ and u * τ be the minimiser of E (g) ∞,τ .Then, for all α > 1, there exists ε 0 > 0, τ 0 > 0, C > c > 0 such that for all ε, n satisfying Let g n = (g(x 1 ), . . ., g(x n )), and let u g * n,τ be the minimiser of the "noiseless" energy E (gn) n,τ .The proof of Theorem 1.2 is divided into two steps; in the first we compare u * n,τ and u g * n,τ (corresponding to averaging in y) which gives a quantitative bound on the effect of the noise, in the second part we compare u g * n,τ with u * τ (which corresponds to averaging in x).We do this in Sections 2.1 and 2.2 respectively.Remark 1.3.We notice that the estimates are meaningful for fixed τ when n goes to infinity, i.e. τ is not required to become smaller with growing n. ρ u * τ ⌊ Ωn when s is even.More precisely, our results show, with the same probability as in Theorem 1.2.This inequality likely generalises to odd s, but to prove it using the methods in this paper we would require a pointwise convergence result for the graph derivative (i.e. the operator ∇ n such that div n • ∇ n = ∆ n and div n is the adjoint operator to ∇ n ) which is beyond the scope of the paper.
Remark 1.5.We offer a comparison with the estimates in [37] (although note that a direct comparison is not possible as we scale ε → 0 whilst [37] work in the setting where ε > 0 is fixed).If, as in [37], we fix τ > 0, and therefore absorb it into our constants, and choose s = 1 then the error bound simplifies to Unfortunately, optimising over ε implies a scaling in ε = ε n of which is outside of the conditions of Theorem 1.2 as (16) does not hold (one needs nε d+4 ≫ log(n) in order to get a high probability bound).Instead we choose ε n ∼ log(n) n 1 d+4 .With this choice the error scales as .
The results in [37] show that for the p-Laplacian regularized problem a rate of convergence n − 1 4 when d = 1, s = 1 and ε > 0 is fixed independently of n, compared to our rate of convergence of n − 1 5 (up to logarithms).

L 2 Bias Estimates
We have the following L 2 bias estimate.
Remark 1.7.We are also able to show that The previous results, from Sections 1.3.1 and 1.3.2,can be combined to bound the error between u * n,τ and g.Trivially, by the triangle inequality, we may bound the total error by Introducing a transport map . Now, since we can choose any T n satisfying (T n ) # µ = µ n then we choose the one that minimises the L 2 distance between T n and Id, by [27] this can be bounded Putting the previous argument together, with the choice δ = n −α , we have the following corollary.n is defined by (10) and ∆ n by (8).Let u * n,τ be the minimiser of E (yn) n,τ .Then, for all α > 1, there exists ε 0 > 0, C > c > 0 such that for all ε, n satisfying (16) and τ ∈ (0, τ 0 ) we have Remark 1.9.Combining Remarks 1.4 and 1.7 we can also show, for even s, Remark 1.10.When s = 1 the error simplifies to with probability at least 1 − C n −α + ne −cnε d+4 .Choosing τ optimally with respect to ε implies τ = ε and The optimal choice of ε is , which (as in Remark 1.5) is outside the admissible scaling of ε n , so we choose . In this regime the optimal error is . This is approximately the minimax rate achieved for the total variation regularised problem which, in certain cases, is up to logarithms scaling as n − 1 d [53], comparable to the L ∞ minimax rates and convergence of spline smoothing obtained in [44], [30] and [64], which are approximately n − 1 d+2 .This also coincides with the semi-supervised rate of convergence given in [9] when the number of labels are linear in n.
Remark 1.11.For s ∈ N we can choose τ = ε s so that the error simplifies to (where C depends on M ) with probability at least 1 − C n −α + n 1−cM , choosing M = 1+α c we have that the bound holds with probability at least 1 − Cn −α .This is the close to the spline error rate which, up to logarithms, scales as n − s d+2s , see [64].
Remark 1.12.The minimax rates for estimating g from noisy samples (6) is n − 2s 2s+d when g ∈ H s (the rate achieved by splines).In the graph setting the minimax rate can be obtained by projecting the samples onto the first K = K( g H s ) eigenvectors of the graph Laplacian [35].Whilst this has the advantage of better rates one must have an a-piori estimate in the H s norm of g in order to know K.

Outline
The rest of the paper is organized as follows.In Section 2 we obtain the L 2 variance estimates discussed in Section 1.3.1.In Section 3 we consider the bias of the estimation procedure given in Section 1.3.2.

L Variance Estimates
In this section we prove the variance estimates stated precisely in Theorem 1.2.We split the proof into two main steps.First, we compare the solution u * n,τ with a discrete noiseless function u g * n,τ .Then, we compare the function u g * n,τ with u * τ .

Removing the Noise
We start by stating the main result of the section.n,τ be defined by (9), where R n , ∆ n are defined by (10), ( 8) respectively, and let u * n,τ , u g * n,τ be the minimisers of n,τ respectively.Assume that ξ i are iid, mean zero, sub-Gaussian random variables.Then, for all α > 1, there exists ε 0 > 0 and C > 0 such that for any ε, n satisfying (16) and τ > 0 we have The proof of the proposition will be given at the end of the section.The strategy is to compare the Euler-Lagrange equations associated with minimising E (yn) n,τ and E (gn) n,τ .In particular, we have where ξ n = (ξ 1 , . . ., ξ n ).We can write To obtain an estimate on w * n,τ L 2 (µn) we use an ansatz wn and show w * n,τ − wn L 2 (µn) ≤ C log(n) nε d and wn L 2 (µn) ≤ Cε 2s τ with high probability.Our choice of ansatz is to assume that the diagonal part of ∆ n dominates and therefore ∆ n ≈ 2 nε 2 D n which leads to the choice, ( This choice of ansatz is appropriate because the off-diagonal elements of ∆ n equate to a local averaging procedure, which with high probability will not amplify the vector ξ.We can equivalently write (20) wn The following lemmas will be useful.
Proof.Fix i ∈ N, then W i,j are iid for j = i. 2 (where the expectation E[W i,j ] is taken over x j ) then it is straightforward to show the bounds σ 2 ≤ Cε −d and M ≤ Cε −d .By Bernstein's inequality, for all t > 0, Choosing t = λn and restricting to λ ≤ 1 we have Hence, (recalling and union bounding over i ∈ {1, . . ., n} we can conclude the first result. The second result follows from the first by choosing ε = 2ε and letting Wi,j = η ε(|x i − x j |).Then, W i,j > 0 implies εd Wij ≥ 0.5 and therefore #{j : W i,j > 0} ≤ 2ε d n j=1 Wi,j .Applying the first part of the lemma we have 2ε d n j=1 Wi,j ≤ 2C 2 nε d = 2 d+1 C 2 nε d as required.
In [11, Lemma 3.1] the authors show, in the Euclidean setting, that if ε n → 0 satisfies log n nε d n → 0 then there exists an absolutely continuous probability measure μn ∈ P(Ω) such that where ρn is the density of μn .As in [8, Proposition 2.10] the proof can be modified to give a non-asymptotic quantitative high probability bound.In particular, there exists constants C, ε 0 and θ 0 such that if with probability at least 1 − 2ne −cnθ 2 ε d .For the rest of the proof we fix θ = θ 0 and absorb it into the constants.
Note that if we define η = η((| • | − 1) + ) and We choose T n to be a transport map satisfying T n# μn = µ n and Let λ max be the largest eigenvalue of ∆ n then, as in the proof of [19,Lemma 22], Although the bound holds for probability at least 1 − Cne −cnε d when d = 2 we can assume that the C in ( 16) is sufficiently large so that nε d log(n) ≥ α+1 c .After some elementary algebra one has that For any v ∈ L 2 (µ n ) we have, by Lemma 2.2 with probability at least which implies D n op ≤ C 2 n.The operator norm bound on W n follows from the bounds on the operator norms of ∆ n , D n and the triangle inequality.Choosing C in Equation ( 16) sufficiently large we can assume that In fact, [19,Lemma 22], suggests the operator bound on ∆ n is sharp (up to a constant), that is, there exists c > 0 such that ∆ n op ≥ c ε 2 .
Proof.Let us condition on a graph G n that satisfies the two inequalities: Fix i ∈ {1, 2, . . ., n} and define Conditioned on G n we have that q j are zero mean and independent random variables.Moreover, since is the Birnbaum-Orlicz norm.By Hoeffding's inequality, for any t > 0 We choose t = λ n log(n) with probability at least 1 − 2n −cλ 2 , conditioned on G n .Union bounding and selecting λ = α+1 c , we then get that the above bound holds for all i ∈ {1, 2, . . ., n} with probability at least n 1−cλ 2 = n −α .Hence, after absorbing α into the constant C, conditioned on G n with probability at least 1 − 2n −α .Since, by Lemma 2.2, the probability of G n satisfying conditions (i) and (ii) is at least 1−2e −cnε d , and by choosing C sufficiently large we have that 2e −cnε d ≤ Cn −α we can conclude the lemma.
To control wn − w * n,τ L 2 (µn) we take advantage of the convexity of our objective functional n,τ , where we recall that ξ n = y n − g n .In particular, one can easily show that E (ξn) n,τ satisfies for any u n , v n ∈ L 2 (µ n ).Hence, Applying this bound to u n = w * n,τ and v n = wn , and using the optimality of w * n,τ , we have The next lemma will bound these gradients in order to prove L 2 convergence rates.
Lemma 2.5.Let Assumptions (A1)-(A4) hold and s ∈ N. Let ξ i be iid, mean zero, sub-Gaussian, random variables.Define wn by (20) and w * n,τ by (18) where ∆ n is given by (8).Then, for any α > 1, there exists C > 0 such that if ε, n satisfy (16) and τ > 0 we have Proof.By the definition of wn , namely Equation ( 19) Using the bounds from Lemmas 2.3 and 2.4 and their associated probability estimates, along with the fact that we have cancelled the D s n term, we then may bound This concludes the proof.
Our final lemma before proving Proposition 2.1 is to bound wn 2 L 2 (µn) .Lemma 2.6.Let Assumptions (A1)-(A4) hold and s ∈ N. Let ξ i be iid, mean zero, sub-Gaussian, random variables.Define wn by (20).Then, for any α > 1, there exists C > 0 such that for all ε, n satisfying nε d ≥ 1 and τ > 0 we have Proof.By application of Lemma 2.2 we have Applying a Chernoff bound we have, for all s, t ≥ 0, Ψ 2 and t = An we have Now we choose A sufficiently large so that ≥ α + log 2 and hence In particular, wn L 2 (µn) ≤ Cε 2s τ with probability at least 1 − n −α as required.
The proof of Proposition 2.1 now follows from Lemma 2.3, Lemma 2.5 and Lemma 2.6 since

Discrete-to-Continuum in the Noiseless Case
In this subsection we prove the following estimates which relate the functions u * n,τ (the minimizer of E (yn) n,τ defined in ( 11)) with the function u * τ (the minimizer of E ∞,τ defined in ( 14)).As in (17) we can write the Euler-Lagrange equations associated with minimizing E (g) ∞,τ by (24) τ ∆ s ρ u * τ + u * τ − g = 0. Our main result for this section is the following proposition.
Proposition 2.7.Let Assumptions (A1)-(A5) hold and s ∈ N. Define ∆ n , ∆ ρ and σ η by (8), ( 12) and (13) respectively.Let u g * n,τ and u * τ satisfy (17) and (24) respectively.Then, for any α > 1 and τ 0 > 0 there exists constants ε 0 > 0, C > c > 0 such that, for any ε, n satisfying (16) and τ ∈ (0, τ 0 ) we have The proof of the proposition is given in Section 2.2.3.The proof of Theorem 1.2 follows immediately from the triangle inequality and Propositions 2.1 and 2.7.One of the main ingredients for proving Proposition 2.7 is the following result which is of interest on its own.Theorem 2.8.Let Assumptions (A1)-(A4) hold and s ∈ N. Define ∆ n , ∆ ρ and σ η by (8), ( 12) and (13) respectively.Then, for any α > 1, there exists C > c > 0 and ε 0 > 0 such that for any u ∈ C 2s+1 and ε, n satisfying (16), We notice that when s = 1 it is well known that the graph Laplacian is pointwise consistent and the rate at which it converges, e.g.[62].Theorem 2.8 generalises this result, and states that with high probability ∆ s n u → ∆ s ρ u in an L 2 sense for all s ∈ N where u is sufficiently smooth and ε = ε n satisfies a lower bound.The proof of Theorem 2.8 is given in Section 2.2.2.
Before presenting a rigorous proof of Proposition 2.7, let us present a heuristic argument.First, we write where v (k) = ∆ k ρ u.We keep track of higher order errors in the pointwise consistency of the graph Laplacian, following the method in [6] to estimate, when v ∈ C r , ( 25) where E i ∈ C r−i−2 .Now, heuristically one expects (with high probability) 2 ) and we recall a worse case (high probability) operator norm bound ∆ j n op ≤ Cε −2j n , see Lemma 2.3.Letting u = u * τ , and assuming g ∈ C 1 (Ω), we can immediately infer that u ∈ C 2s+1 (Ω) from ( 24) (as a standard elliptic regularity result).We choose v = v (k−1) in ( 25) and note that r = 2(s − k) + 3. Now, (with high probability) So, (with high probability) Thus, ∆ s n u − ∆ s ρ u L 2 (µn) = O(ε n ) (note that C in the above inequality depends on u, in the proof we will show that this dependence is in terms of u C 2s+1 (Ω) , i.e. ∆ s n u − ∆ s ρ u L 2 (µn) ≤ Cε n u C 2s+1 (Ω) + 1 ).The above discussion is clearly formal and we spend the remainder of the section making the proof rigorous.We do this in two stages.The first step gives operator bounds on ∆ n for smooth functions, i.e. quantifying . The second step derives (25) from which we can prove Theorem 2.8 when combined with the first step.
Let us define the non-local continuum Laplacian by ( 26) We prove the proposition in two steps.In the first step we show In the second step we bound the difference ∆ m n v − ∆ m ε v L 2 (µn) .Initially we consider the case when m = 1, which is just the difference of ∆ n v(x) to its expected value ∆ ε v(x) = E [∆ n v(x)].We can then bootstrap this to m > 1.
Putting the two steps together proves Proposition 2.9.
Lemma 2.10.Let Assumptions (A1), (A3) and (A4) hold, and k ∈ N. Define ∆ ε by (26).Then, there exists C > 0, ε 0 > 0 such that for all ε ∈ (0, ε 0 ) and for all v ∈ C k+2 (Ω) we have Proof.We can write, for ε sufficiently small, where ∇ above is the gradient in R d , and D 2 the matrix of second derivatives of a function on R d , This proves the first part of the lemma.Iterating the estimate ( 27) implies (28).

Now we turn to
Step 2 and bounding the difference ∆ n − ∆ ε .
Proof.Fix w ∈ C 1 (Ω), x ∈ Ω n and let Note that By Bernstein's inequality for any t > 0, .
Choosing t = nε p w C 1 (Ω) implies Symmetrising the argument we have n i=1 with probability at least 1 − 2e −cnε d+2p+2 .Substituting in (29) and union bounding over all x ∈ Ω n we have proved the lemma.
Using the above lemma we can provide a bound on ∆ m n − ∆ m ε .

Proof of Theorem 2.8
Now, we note that where v (i) = ∆ i ρ u.The idea is now to use pointwise convergence but to keep track of higher order terms than the estimates that appear in [6,62].For example, [6] shows that if f ∈ C 3 then (30) where ϑ ≥ ε, with probability at least 1 − Cne −cnε d+2 ϑ 2 .Directly applying the operator bounds we have If we could choose ϑ k = ε 1+2(s−k) then the proof is immediate; however the pointwise convergence result (30) requires ϑ ≥ ε which rules out this choice.However, we will show that this gives the right answer, in particular, that the convergence is within ε with probability at least 1 − Cne −cnε d+4s .The rest of the section is devoted to removing the assumption that ϑ k ≥ ε.
As in the proof of Lemma 2.14 we take advanatge of the fact that u * τ , q i L 2 (µ) = g,q i L 2 (µ) 1+τ λ s i to infer . By choosing k sufficiently large and employing Morrey's inequality we can find C such that u C 2s+1 (Ω) ≤ C u H k (Ω) for all u ∈ H k (Ω).In particular, u * τ C 2s+1 (Ω) ≤ C C c g H k (Ω) which proves the lemma.
We can now prove Proposition 2.7.
We can derive the second inequality from the first inequality and Theorem 2.8 as follows

Remark 1 . 4 .
In addition to the bound in L 2 (µ n ) between u * n,τ and u * τ we are able to show a bound between the Laplacians ∆ s 2 n u * n,τ and ∆ s 2

1 0
∇ρ(x + εsz) • z ds dz, by Taylor's theorem and a change of variables.Using the reflective symmetry of η we have R d η(|z|)z dz = 0 and hence,