Infill asymptotics and bandwidth selection for kernel estimators of spatial intensity functions

We investigate the asymptotic mean squared error of kernel estimators of the intensity function of a spatial point process. We show that when $n$ independent copies of a point process in $\mathbb R^d$ are superposed, the optimal bandwidth $h_n$ is of the order $n^{-1/(d+4)}$ under appropriate smoothness conditions on the kernel and true intensity function. We apply the Abramson principle to define adaptive kernel estimators and show that asymptotically the optimal adaptive bandwidth is of the order $n^{-1/(d+8)}$ under appropriate smoothness conditions.


Introduction
Often the first step in the analysis of a spatial point pattern is to estimate its intensity function. Various non-parametric estimators are available to do so. Some techniques are based on local neighbourhoods of a point, expressed for example by its nearest neighbours [7], its Voronoi [11] or Delaunay tessellation [13,14]. By far the most popular technique, however, is kernel smoothing [6]. Specifically, let Φ be a point process that is observed in a bounded open subset ∅ = W of R d and assume that its first order moment measure exists as a σ-finite Borel measure and is absolutely continuous with respect to Lebesgue measure with a Radon-Nikodym derivative λ : R d → [0, ∞) known as its intensity function. A kernel estimator of λ based on Φ ∩ W takes the form The function κ : R d → [0, ∞) is supposed to be kernel, that is, a d-dimensional symmetric probability density function [15, p. 13]. The choice of bandwidth h > 0 determines the amount of smoothing. In principle, the support of κ((x 0 − y)/h) as a function of y could overlap the complement of W . Therefore, various edge corrections have been proposed [2,9]. In the sequel, though, we will be concerned with very small bandwidths, so this aspect may be ignored. The aim of this paper is to derive asymptotic expansions for the bias and variance of (1) in terms of the bandwidth. This problem is well known when dealing with probability density functions. Indeed, there exists a vast literature, for example the textbooks [3,15,16] and the references therein. In a spatial context, bandwidth selection is dominated by ad hoc [2] and non-parametric methods [5]. The first rigorous study into bandwidth selection to the best of our knowledge is that by Lo [10] who studies infill asymptotics for spatial patterns consisting of independent and identically distributed points. Our goal is to extend his approach to point processes that may exhibit interactions between their points and to investigate adaptive versions thereof.
The plan of this paper is as follows. In Section 2 we focus on the regime in which n independent copies of the same point process are superposed and the bandwidth h n tends to zero as n tends to infinity but does not depend on the points of the pattern. We derive Taylor expansions and deduce the asymptotically optimal bandwidth. Intuitively, however, one feels that in sparse regions more smoothing is necessary then in regions that are rich in points. Indeed, in the context of estimating a probability density function, Abramson [1] proposed to scale the bandwidth in proportion to the square root of the density. Analogously, in Section 3 we let h n decrease in proportion to the square root of the intensity function and show that by doing so the bias can be reduced. For the sake of readability, all proofs are deferred to Section 4.
Lemma 1 Let Φ be a point process observed in a bounded open subset ∅ = W ⊂ R d whose factorial moment measures exist up to second order and are absolutely continuous with intensity function λ and second order product densities ρ (2) . Let κ be a kernel. Then the first two moments of (1) are The proof follows directly from the definition of product densities, see for example [4,Section 4.3.3]. Provided λ(·) > 0, the variance of λ(x 0 ) can expressed in terms of the pair correlation function g : For Poisson processes, the first integral vanishes as g ≡ 1.
In this paper, we will restrict ourselves to kernels that belong to the Beta class for γ ≥ 0. Here b(0, 1) is the closed unit ball in R d centred at the origin. The normalising constant will be abbreviated by Note that Beta kernels are supported on the compact unit ball and that their smoothness is governed by the parameter γ. Indeed, the box kernel defined by γ = 0 is constant and therefore continuous on the interior of the unit ball; the Epanechnikov kernel corresponding to the choice γ = 1 is Lipschitz continuous. For γ > k the function κ γ is k times continuously differentiable on R d .
The following Lemma collects further basic properties of the Beta kernels. The proof can be found in Section 4.1.
Their values do not depend on the particular choices of i and j.
For the important special case d = 2, Lemma 1 can be used to derive the mean squared error of (2). Its proof can be found in Section 4.2.
Proposition 1 Let Φ 1 , Φ 2 , . . . be independent and identically distributed point processes observed in a bounded open subset ∅ = W ⊂ R d . Assume that their factorial moment measures exist up to second order and are absolutely continuous with strictly positive intensity function λ : W → (0, ∞) and second order product densities ρ (2) . Write Y n = n i=1 Φ i for the union, n ∈ N, and let κ γ (x) be a Beta kernel (3) with γ ≥ 0. Then the mean squared error of (2) is given by The first term in the above expression is the squared bias. It depends on λ and the bandwidth h but not on n. The remaining terms come from the variance and depend on λ, on g, on h and on n.
Our aim in the remainder of this section is to derive an asymptotic expansion of the mean squared error for bandwidths h n that depend on n in such a way that h n → 0 as n → ∞. In order to achieve this, first recall some basic facts from analysis. Let E be an open subset of R n and denote by C k (E) the class of functions f : E → R m for which all k th order partial derivatives D j 1 ···j k f exist and are continuous on E. For such functions the order of taking partial derivatives may be interchanged and the Taylor theorem states that if x ∈ E and x + th ∈ E for all 0 ≤ t ≤ 1, then a θ ∈ (0, 1) can be found such that where h (r) is the r-tuple (h, . . . , h) and for h = (h 1 , . . . , h n ) ∈ R n . We are now ready to state the main result of this section, generalising [10, Theorem 2] for the union of independent random points. The proof can be found in Section 4.2.
Theorem 1 Let Φ 1 , Φ 2 , . . . be i.i.d. point processes observed in a bounded open subset ∅ = W ⊂ R d with well-defined intensity function λ and pair correlation function g. Suppose that g : W × W → R is bounded and that λ : W → (0, ∞) is twice continuously differentiable with second order partial derivatives λ ij = D ij λ, i, j = 1, . . . , d, that are Hőlder continuous with index α > 0 on W , that is, there exists some C > 0 such that for all i, j = 1, . . . , d: Consider the estimator λ based on the unions Y n = ∪ i≤n Φ i , n ∈ N, and Beta kernel κ γ , γ ≥ 0, with bandwidth h n chosen in such a way that, as n → ∞, h n → 0 and nh d n → ∞. Then, for x 0 ∈ W , as n → ∞, The bias depends on the second order partial derivatives of the unknown intensity function and on the smoothness parameter α. The smoothness of the kernel, measured by γ, also plays a role. The leading term of the variance depends on λ(x 0 ) and on the smoothness of the kernel.
Corollary 1 Consider the setting of Theorem 1. Then The asymptotic mean squared error is optimised at .
In words, h * n (x 0 ) is of the order n −1/(d+4) . Clearly h * n (x 0 ) tends to zero as n → ∞. Moreover, n(h * n ) d is of the order n to the 1 − d/(d + 4) and therefore tends to infinity with n. For the special case d = 2, The following Proposition generalises [10,Proposition 5]. Its proof can be found in Section 4.2.
Proposition 2 Let Φ 1 , Φ 2 , . . . be i.i.d. point processes observed in a bounded open subset ∅ = W ⊂ R d with well-defined intensity function λ and pair correlation function g. Suppose that g : W × W → R is bounded and that λ : W → (0, ∞) is twice continuously differentiable with second order partial derivatives λ ij = D ij λ, i, j = 1, . . . , d, that are Hőlder continuous with index α > 0 on W . Consider λ based on the unions Y n = ∪ i≤n Φ i , n ∈ N, and Beta kernel κ γ , γ ≥ 0, with bandwidth h n chosen in such a way that as n → ∞, h n → 0 and nh d n → ∞. Then, for x 0 ∈ W , as n → ∞,

Adaptive infill asymptotics
Up to now, estimators based on (1) were considered in which the same bandwidth h was applied at every point y ∈ Φ ∩ W . However, at least intuitively, it seems clear that the bandwidth should be smaller in regions with many points, larger when points are scarce. This suggests that h = h(y) should be decreasing in λ(y).
Motivated by similar considerations in the context of density estimation, Abramson [1] suggested to consider point-dependent bandwidths of the form h(y) = h/c(y) for c(y) equal to the square root of the probability density function. He found that a significant reduction in bias could be obtained by the use of such adaptive bandwidths. Our aim in this section is to show that a similar result holds for spatial intensity function estimation.
Define an estimatorλ of λ(x 0 ), x 0 ∈ W , that is the average of data-adaptive estimators As in Section 2, κ is a kernel and the Φ i , i = 1, . . . , n, are independent and identically distributed point processes on R d observed in a bounded non-empty open subset W for which the first order moment measure exists and admits an intensity function λ : W → [0, ∞); c : W → (0, ∞) is assumed to be a measurable positive-valued weight function on W . The next result summarises the first two moments.
Lemma 3 Let Φ be a point process observed in a bounded open subset ∅ = W ⊂ R d , whose factorial moment measures exist up to second order and are absolutely continuous with intensity function λ and second order product densities ρ (2) . Let κ be a kernel. Then the first two moments of (6) are The proof follows directly from the definition of product densities, see for example [4,Section 4.3.3]. For the special case c(u) ≡ 1, we retrieve Lemma 1.
Provided λ(·) > 0, the variance ofλ(x 0 ), the average of theλ(x 0 ; h, Φ i , W ), can be expressed in terms of the pair correlation function as We are now ready to state the first main result of this section in analogy to [1, Theorem, p. 1218]. The proof can be found in Section 4.3.
well-defined intensity function λ and pair correlation function g. Suppose that g : W × W → R is bounded and that λ : W → (λ,λ) is bounded, bounded away from zero and twice continuously differentiable on W with bounded second order partial derivatives Consider the estimatorλ with based on the unions Y n = ∪ i≤n Φ i , n ∈ N, and Beta kernel κ γ , γ > 2, with bandwidth h n chosen in such a way that, as n → ∞, h n → 0 and nh d n → ∞. Then, for x 0 ∈ W , as n → ∞, In comparison with Theorem 1, the variance is the same as that for a non-adaptive bandwidth. The bias term on the other hand is of a smaller order. Note that, since the leading bias term is not specified, Theorem 2 cannot be used to calculate an asymptotically optimal bandwidth. To remedy this, stronger smoothness assumptions seem needed.
well-defined intensity function λ and pair correlation function g. Suppose that g : W × W → R is bounded and that λ : W → (λ,λ) is bounded, bounded away from zero and five times continuously differentiable on W with bounded partial derivatives.
Consider the estimatorλ with based on the unions Y n = ∪ i≤n Φ i , n ∈ N, and Beta kernel κ γ , γ > 5, with bandwidth h n chosen in such a way that, as n → ∞, h n → 0 and nh d n → ∞. Then, for x 0 ∈ W , as n → ∞, For the important special cases d = 1, 2, the expression for A(u; x 0 ) may be simplified. All the proofs are given in Section 4.3.

Corollary 2 Consider the setting of Theorem 3. Then
The asymptotic mean squared error is optimised at The optimal bandwidth h * n (x 0 ) and the weights (λ(x)/λ(x 0 )) 1/2 depend on the unknown intensity function. In practice, a non-parametric pilot estimator (for example the one proposed in [5]) would be plugged in.
To conclude this section, we present the analogue of Proposition 2. The proof can be found in Section 4.3.
well-defined intensity function λ and pair correlation function g. Suppose that g : W × W → R is bounded and that λ : W → (λ, λ) is bounded, bounded away from zero and five times continuously differentiable on W with bounded partial derivatives. Considerλ with c(x) = (λ(x)/λ(x 0 )) 1/2 based on the unions Y n = ∪ i≤n Φ i , n ∈ N, and Beta kernel κ γ , γ > 5, with bandwidth h n chosen in such a way that as n → ∞, h n → 0 and nh d n → ∞. Then, for x 0 ∈ W , as n → ∞,

Proofs and technicalities 4.1 Auxiliary lemmas for the Beta kernel
Proof of Lemma 2: The first two claims follow from the symmetry of the Beta kernel. Furthermore Due to the symmetry of the Beta kernel it is clear that the definitions of V (d, γ), V 4 (d, γ) and V 2 (d, γ) do not depend on the choices of i and j. First consider the case d = 1. By the symmetry of κ γ and a change of variables Similarly, .
For dimensions d > 1, write V (d, γ) and V 4 (d, γ) as a repeated integral and note that the innermost integral takes the form By the symmetry and a change of in accordance with the claim. Finally for d > 1, V 2 (d, γ) can be written as The inner integral is equal to (1 − ||x|| 2 d−1 ) γ+3/2 B 3 2 , γ + 1 so in accordance with the claim.
In the sequel, the following additional properties of the Beta kernels will be needed.

Lemma 4
Consider the Beta kernels κ γ with γ > 1 defined in equation (3). Then, for all i ∈ {1, . . . , d}, the integrals of second order products in u ∈ R d with respect to D i κ γ vanish and for distinct i, j ∈ {1, . . . , d}, The integrals of other third order products in u ∈ R d with respect to D i κ γ vanish. Finally the following identities hold for all i = j ∈ {1, . . . , d}: Proof of Lemma 4: The proof relies on partial integrations, which involve evaluations of . . , u d ). These take the value zero, as (1 − ||u|| 2 ) = 0. Therefore Lemma 5 Consider the Beta kernels κ γ with γ > 2 defined in equation (3). Then, for all i = j ∈ {1, . . . , d}, Proof of Lemma 5: Apply integration by parts and Lemma 4 to obtain that for all distinct i, j, k, l in {1, . . . , d}, The evaluations of products in u multiplied by D k κ γ (u) are zero since which take the value zero when ||u|| = 1 and γ > 1. All other integrals of fourth order products in u ∈ R d with respect to D kl κ γ or D kk κ γ vanish.
and rearranging terms completes the proof.
for the Beta kernel κ γ with γ > 4, four times continuously differentiable. The first three derivatives are given by and the fourth order derivative is Proof of Lemma 6: For γ > 4, the function κ γ is four times continuously differentiable. The expressions for the derivatives follow by straightforward calculation.

Proofs of propositions and theorems: non-adaptive case
Proof of Proposition 1: Since λ(x 0 ) is the average of n independent random variables and Var λ( Therefore, by Lemma 1, Since mse λ(x 0 ) is the sum of the squared bias and the variance, the claim is seen to hold.
Proof of of Theorem 1: To prove 1. note that since h n goes to zero, x 0 ∈ W and W is open, for n large enough b(x 0 , h n ) ∩ W is equal to b(x 0 , h n ). For such n, by a change of variables, the symmetry of the Beta kernels and the proof of Proposition 1, the bias is The intensity λ(x 0 ) can be brought under the integral since κ γ is a probability density. Fix u ∈ b(0, 1). As x 0 + th n u ∈ W for all 0 ≤ t ≤ 1 and λ is twice continuously differentiable on W , the term between curly brackets in the integrand may be expanded as a Taylor series (5) with k = 2: for some 0 < θ = θ(u) < 1 that may depend on u. Write

Now,
Since n was chosen large enough for x 0 + θh n u to lie in W , we may use the Hőlder assumption to obtain the inequality The right hand side does not depend on the particular choice of u ∈ b(0, 1) nor on θ(u) ∈ (0, 1). In summary, for a remainder term R(h n , u) that satisfies |R(h n , u)| ≤ Cd 2 h 2+α n /2. Returning to the bias (8), for large n, By Lemma 2, because by Lemma 2, the cross terms with i = j are zero. Finally, since κ γ is a probability density and R(h n , u) is uniformly bounded in u ∈ b(0, 1), To prove 2. note that, as for the bias, n may be chosen large enough for the ball b(x 0 , h n ) to fall entirely in W . For such n, by a change of variables u = (x − x 0 )/h n and the symmetry of the Beta kernels, Fix u ∈ b(0, 1). As x 0 + th n u ∈ W for all 0 ≤ t ≤ 1 and λ is continuously differentiable on W , we may use the Taylor expansion (5) with k = 1 to write for some 0 < θ = θ(u) < 1 that may depend on u. Since the partial derivatives are continuous and hence bounded on closed balls contained in W , say by Dλ, for a remainder term R(h n , u) that satisfies |R(h n , u)| ≤ dh n Dλ and consequently by Lemma 2. The bound on the remainder term R(h n , u) implies that We will now show that the contribution of the interaction structure (through the pair correlation function) to the variance vanishes. Choose n so large that b(x 0 , h n ) ⊆ W . Then, by a change of variables and the symmetry of the Beta kernels, the double integral in Proposition 1 reduces to Since the pair correlation is assumed to be bounded on W , say g(·, ·) ≤ g, and x 0 + h n u ∈ W for all u ∈ b(0, 1), the double integral can be bounded in absolute value by cf. equation (9). The integrand in the right hand side is bounded in absolute value by κ γ (u) λ(x 0 ) + dh n Dλ and therefore the interaction structure contributes O(1/n) to the mean squared error. Upon adding (10), The last term is negligible with respect to the middle one, and the proof is complete.
Proof of Corollary 1: By Theorem 1, for a remainder term R(h n ) for which there exists a scalar M such that |R(h n )| ≤ M h 2+α n for large n. Hence and the claimed expression for the mean squared error follows. Consequently, the asymptotic mean squared error takes the form for some scalars α, β > 0. Equating the derivative with respect to h n to zero yields The second derivative with respect to h n , 12αh 2 , is strictly positive, so h * n is the unique minimum. Plugging in the expressions for α and β completes the proof.

Proof of Proposition 2:
Since h n → 0, x 0 ∈ W and W is open, if n is large enough then b(x 0 , h n ) ∩ W = b(x 0 , h n ). For such n, by Lemma 1, can be written as an average of independent random variables with EZ i = 0. Furthermore, by Theorem 1, for a remainder term R(h n ) satisfying nh d−1 n |R(h n )| ≤ M for some M > 0 and large n. By Chebychev's inequality, for all ǫ > 0, The upper bound tends to ǫ as n → ∞ so that To finish the proof, add the bias expansion 1. in Theorem 1.

Proofs of propositions and theorems: adaptive case
Proof of Theorem 2: To prove 1. note that since h n goes to zero, x 0 ∈ W , W is open and λ is bounded away from zero, for n large enough for all x ∈ W . For such n, by a change of variables, the symmetry of the Beta kernels and Lemma 3, the bias is equal to for the functions g u : R → R, u ∈ R d , defined by Note that the integral in (11) is compactly supported, say on K ⊂ R d , a property it inherits from the Beta kernel since c is bounded away from zero.
Since we are after the coefficient of h 2 n and, for γ > 2, κ γ is twice continuously differentiable, we use a Taylor expansion (5) with k = 2. Thus, fix u ∈ K. Then where the remainder term is for some 0 < θ = θ(v) < 1 that may depend on v ∈ R. Moreover, D 2 g u (v) can be written as Recall that g u is evaluated at v of the form c(x 0 + h n u). Since the function c is bounded we may restrict ourselves to a compact interval I for v and on this interval D 2 g u (v) is bounded as κ γ and its partial derivatives are bounded too. Moreover, the bound can be chosen uniformly in u over the compact set K. In summary, there exists a constant C > 0 such that |R(u, v)| ≤ Cv 2 for all u ∈ K and v ∈ I. We also need a Taylor expansion (5) with k = 2 for the function c around x 0 ∈ R d : where the remainder term is for some 0 < θ = θ(u) < 1 that may depend on u ∈ K. The second order partial derivatives are, for i, j ∈ {1, . . . , d}, where we use the notation λ i = D i λ. On the compact set K, the |u i | are bounded and so are the |D ij c(u)| since λ is bounded away from zero and twice continuously differentiable. Hence there exists a constantC > 0 such that |R n (u)| ≤Ch 2 n for all u ∈ K. Our next step is to combine the Taylor series (12) and (13). Write E n (u) := c(x 0 +h n u)−1. For large n the bias (11) can then be written as for some η(u) ∈ (0, 1).
We will show that the first and second order terms vanish. By (12)-(13), the first order term is equal to h n λ(x 0 ) multiplied by and vanishes because of Lemma 2 and Lemma 4.
Also by (12)-(13), the second order term reads h 2 n λ(x 0 )I n /2 where for some θ(u) and η(u) in (0, 1). Recall that I n is compactly supported and that the integrand is bounded. Therefore, by the dominated convergence theorem, The first double sum is zero because of Lemma 2 and Lemma 4, the second one because of Lemma 2, Lemma 4 and Lemma 5. By the bounds on the remainder terms R andR, all other terms in (14) are of the order o(h 2 n ) and the proof is complete.
To prove 2. note that, as for the bias, n may be chosen so large that For such n, by a change of variables u = (x − x 0 )/h n and the symmetry of the Beta kernels, for the function h u : R → R, u ∈ R d , defined by Note that the integral in (15) is compactly supported, say on K ⊂ R d , a property it inherits from the Beta kernel since c is bounded away from zero.
Fix u ∈ K. Then by a Taylor expansion (5) with k = 1 Recall that h u is evaluated at v of the form c(x 0 + h n u). Since the function c is bounded we may restrict ourselves to a compact interval I for v and on this interval Dh u (v) is bounded as κ γ and its partial derivatives are bounded too. Moreover, the bound can be chosen uniformly in u over the compact set K. In summary, there exists a constant H such that |Dh u (v)| ≤ H for all u ∈ K and v ∈ I. Hence, with E n (u) = c(x 0 + h n u) − 1 as before, (15) can be written as with 0 < θ(u) < 1. By (13), As u ∈ K and, for such u, |R n (u)| ≤Ch 2 n , We will finally show that the contribution of the interaction structure (through the pair correlation function) to the variance (7) vanishes. Again, choose n so large that For such n, by a change of variables and the symmetry of the Beta kernels, and writingḡ for an upper bound to the pair correlation function, the integral in the last line in (7) can be bounded in absolute value by since the integral is compactly supported and both c and κ γ are bounded.
Proof of of Theorem 3: As in the proof or Theorem 2, the bias is given by (11) and the integral involved is supported on a compact set K ⊂ R d . Since we are after the coefficient of h 4 n and, for γ > 5, a Taylor expansions (5) with k = 5 applies for both c and g u . For the former, E n (u) = c(x 0 + h n u) − 1 is equal to (16) h n Dc( up to a remainder termR n (u) = 1 120 h 5 n D 5 c(x 0 + θh n u) (u, u, u, u, u) for some 0 < θ = θ(u) < 1 that may depend on u ∈ K. Since λ is bounded away from zero and five times continuously differentiable, |R n (u)| ≤Ch 5 n for all u ∈ K. Similarly, for fixed u ∈ K, where R(u, v) = v 2 D 5 g u (1 + θv)/120 for some θ = θ(v) in (0, 1) that may depend on v ∈ R.
Recall that g u is evaluated at v of the form c(x 0 + h n u), u ∈ K. Since the function c is bounded we may restrict ourselves to a compact interval I for v and on this interval D 5 g u (v) is bounded as κ γ and its partial derivatives up to fifth order are bounded too. Moreover, the bound can be chosen uniformly in u over the compact set K. In summary, |R(u, v)| ≤ C|v| 5 for u ∈ K and v ∈ I. Next, plug the Taylor expansions into (11). Then By Theorem 2, the first and second order terms are zero. We will show that the third order term h 3 n λ(x 0 )I n,3 vanishes too. By (16), Lemma 6 implies that the first term of I n,3 is which vanishes by the symmetry properties of κ γ , Lemma 2 and Lemma 4. By Lemma 6, the second term is a linear combination of integrals of the form which vanish because of the symmetry properties of the Beta kernel and integration by parts. Similar arguments apply to the third and last term of I n,3 , which by Lemma 6 is a linear combination of integrals of the form The coefficient of h 4 n in (17) reads λ(x 0 ) A(u; x 0 )du with A as claimed, and does not vanish in general. Finally, by the bounds on the remainder terms R andR n , all other terms in (17) are of the order o(h 4 n ) and the proof is complete.
Proof of Proposition 3: By Theorem 3, the coefficient of Lemma 6 and Lemma 7 can be used to derive the following equations: Hence, upon a rearrangement of terms, It remains to calculate and plug in expressions for the derivatives of c in terms of the underlying intensity function λ. Now can be plugged into (18) to obtain − 1 24 and the claim follows upon a rearrangement of terms.
Proof of Proposition 4: Theorem 3 states that the coefficient of h 4 n is λ(x 0 ) A(u; x 0 )du with an explicit expression for A(u; x 0 ). The non-zero terms in this expression can be reduced by repeated partial integration to a scalar multiple of either V 4 (2, γ) or V 2 (2, γ) as the integrals of other fourth order products in u ∈ R 2 with respect to κ γ vanish by the symmetry properties of the Beta kernel.
The scalar multipliers can be calculated as in Lemma 7: for i = j ∈ {1, 2}, integrals with respect to first order partial derivatives reduce to integrals with respect to second order partial derivatives reduce to and integrals with respect to third order partial derivatives are reduced as R d u 6 i u j D iij κ γ (u)du = −30V 4 (2, γ) and R d Finally, Evaluation of the expression for A(u; x 0 ) implies the claim by elementary but tedious calculation. For example, the coefficient of D 11 c(x 0 )D 22 c(x 0 ) arises from terms with these coefficients in 1 8 R d (D 2 c(x 0 )(u, u)) 2 g ′′ u (1)du 1 du 2 , which, by Lemma 6, is equal to The desired coefficients occur when i = j = 1 and k = l = 2 or when i = j = 2 and k = l = 2. Therefore from which the claimed expression for the mean squared error follows. Consequently, the asymptotic mean squared error takes the form αh 8 n + β nh d n for some scalars α, β > 0. Equating the derivative with respect to h n to zero yields The second derivative with respect to h n , 56αh 6 n + d(d + 1)βn −1 h −d−2 n , is strictly positive, so h * n is the unique minimum. Plugging in the expressions for α and β completes the proof.

Proof of Proposition 5:
Since h n → 0, x 0 ∈ W and W is open, if n is large enough the ball centred at x 0 with radius λ(x 0 ) 1/2 h n /λ 1/2 is contained in W . For such n, by Lemma 3, can be written as an average of independent random variables with EZ i = 0. Furthermore, by Theorem 3 for a remainder term R(h n ) satisfying nh d−1 n |R(h n )| ≤ M for some M > 0 and large n. By Chebychev's inequality, for all ǫ > 0, The upper bound tends to ǫ as n → ∞ so that To finish the proof, add the bias expansion 1. in Theorem 3.