1 Introduction and motivation

Diffusions present a particularly important class of stochastic processes. The long standing probabilistic interest in this subject is illustrated, for example, by the seminal books of Itō and McKean [9] and Stroock and Varadhan [22]. From the statistical point of view, one classical problem is to estimate the (unknown) characteristics of the diffusion, both from continuous-time and discrete observations. In the last two decades, nonparametric estimation of diffusion processes has been widely developed, mainly due to their applications in mathematical finance where diffusions are commonly used to model the evolution of financial assets or portfolios of assets. While diffusion models have been largely univariate in the past, they now predominantly include multiple state variables; see the introductory remarks of Aït-Sahalia [1] for concrete examples. In some respects, the statistical theory did not keep pace with this evolution: thorough theoretical results on nonparametric estimation of multidimensional diffusion processes are few and far between. At least partially, this is due to the fact that the concept of diffusion local time and related tools such as the occupation times formula are not available in dimension \(d >1\) such that the treatment of the multivariate case requires different approaches and new techniques.

The aim of this paper is to close one gap in the literature by analyzing the asymptotically exact behavior of the pointwise risk for adaptively estimating the drift vector \(b: \mathbb {R}^d \rightarrow \mathbb {R}^d\) of a multivariate diffusion which is given as a solution of the stochastic differential equation

$$\begin{aligned} \mathrm {d}X_t = b(X_t)\mathrm {d}t + \sigma (X_t)\mathrm {d}W_t, \quad X_0 = \xi , \ t \in [0,T], \end{aligned}$$
(1.1)

where \(\sigma : \mathbb {R}^d \rightarrow \mathbb {R}^{d\times d}\) is the dispersion matrix, \(W\) is a \(d\)-dimensional standard Wiener process and the initial value \(\xi \in \mathbb {R}^d\) is independent of \(W\). It will be assumed throughout that a continuous record of observations \(X^T := (X_t)_{0\le t \le T}\) is available. Thus, the diffusion coefficient \(\sigma \sigma ^\top \) is identifiable by means of the semimartingale quadratic variation, and it means only little loss of generality to simplify the analysis by considering merely the case of known, constant diffusion part. We further restrict attention to ergodic diffusions whose invariant measure is absolutely continuous with respect to the Lebesgue measure. Let \(\rho _{b}\) denote the invariant density. The initial value \(\xi \) is assumed to follow the invariant law such that the process \(X\) is strictly stationary.

It is statistical folklore to consider drift estimation as some analogue of the regression problem. Given some appropriately chosen kernel \(K: \mathbb {R}^d \rightarrow \mathbb {R}\) and bandwidth \(h>0\), it thus appears natural to investigate the following type of kernel estimators of the drift vector \(b\),

$$\begin{aligned} \widehat{b}_T(x) := \frac{T^{-1}\int _0^T K_h(X_u-x)\mathrm {d}X_u}{\widehat{\rho }_T(x) \vee \rho _*(x)}, \quad x \in \mathbb {R}^d, \end{aligned}$$
(1.2)

where \(K_h(\cdot ) := h^{-d}K(\cdot /h)\), \(\widehat{\rho }_T\) is some estimator of the invariant density \(\rho _{b}\) and \(\rho _*(x)>0\) denotes some a priori lower bound on \(\rho _{b}(x)\). In the sequel, the quality of an estimator \(\widehat{b}_T\) will be quantified by its pointwise risk

$$\begin{aligned} {\mathcal {R}}\big (\widehat{b}_T,b\big ) := \mathbf {E}_b\Vert \widehat{b}_T(x_0)-b(x_0)\Vert ^2, \quad x_0\in \mathbb {R}^d \text { fixed,} \end{aligned}$$

for \(\mathbf {E}_b\) denoting expectation with respect to the invariant measure associated with \(b\) and \(\Vert \cdot \Vert \) denoting the Euclidean norm. The goal is to define minimax adaptive estimators \(b_T^*\) satisfying

$$\begin{aligned} {\mathcal {R}}\big (b_T^*,b\big ) = \inf _{\widetilde{b}_T}\sup _{b \in \fancyscript{B}} {\mathcal {R}}\big (\widetilde{b}_T,b\big ). \end{aligned}$$

The supremum here extends over a given class of functions \({\fancyscript{B}}\), typically, a class of functions satisfying certain smoothness assumptions or structural constraints. For estimating the drift vector, we shall consider scales of Sobolev classes \((\varSigma _T(\beta ,L))_{(\beta ,L)\in {\mathcal {B}}_T}\) where, for fixed \(\beta _*>d/2\) and \(0<L_*<L^*<\infty \),

$$\begin{aligned} {\mathcal {B}}_T := \big \{(\beta ,L): \beta _*\!\le \! \beta \!<\! \beta _T, \ L_*\le L \le L^*\big \}, \quad \beta _T \!=\! (\log \log T)^\delta , \ \delta \!\in \! (0,1) \text { fixed.} \end{aligned}$$

We propose estimators which attain not only the optimal rate of convergence but the best exact asymptotic minimax risk when the actual smoothness of the drift and the associated invariant density \(\rho _b\) are unidentified and we only assume membership to \(\varSigma _T(\beta ,L)\) for some \((\beta ,L)\in \mathcal {B}_T\).

To the best of our knowledge, sharp asymptotic minimax bounds for nonparametric estimation in the diffusion framework have been established exclusively for one-dimensional processes up to now. One particularly deep result is given in Dalalyan [5] where a fully data-driven procedure for exact global estimation of the drift for a large class of ergodic scalar diffusion processes is proven. In the multidimensional diffusion set-up however, we only know of upper bound results on rates of convergence, even for the prototypical problem of estimating the drift vector from continuous-time observations. Let us emphasize that the question of identifying the exact constant in the risk asymptotics is far from being merely of theoretical interest. The subsequent in-depth analysis rather allows to descry the influencing values of the drift estimation problem, and these findings provide answers to practice-oriented issues. For instance, it is to be expected—and has been observed in practice indeed—that the speed of convergence for estimating functionals of a diffusion process solution of the SDE (1.1) depends on the geometry of the diffusion coefficient \(\sigma \sigma ^\top =: a=(a_{jk})_{1\le j,k\le d}\). For the exemplary problem of estimating the \(j\)-th component \(b^j\) of the drift vector, \(j\in \{1,\ldots ,d\}\) fixed, the dependence will be proven to be reflected by the appearance of \(a_{jj}\), the \(j\)-th diagonal entry of the diffusion matrix, in the exact normalizing factor in the risk asymptotics. Our exact results further give a theoretical justification for the wide-spread use of standard kernel methods for drift estimation which in applications (e.g., in financial econometrics) often occurs on an ad-hoc basis. Heuristically, the use of such methods is based on the aforementioned folklore that “drift estimation is just regression,” provided that the long observation limit is considered and as long as the diffusion is sufficiently regular.

On a mathematically formal level, abstract decision theory allows to transfer risk bounds from one statistical model to another by referring to the concept of asymptotic equivalence of experiments in the sense of Le Cam. For inference on the drift in multidimensional ergodic diffusion models, asymptotic equivalence is established in Dalalyan and Reiß [7]. Their results concern the special case of Kolmogorov diffusions with unit diffusion part, i.e. \(\sigma = \mathbf {Id}\), and hold only for large enough Hölder smoothness of the drift coefficient (which is substantially larger than the lower bound of \(d/2\) which would correspond to known results on asymptotic nonequivalence of scalar nonparametric experiments when the smoothness index is \(1/2\)). We take a direct approach and establish upper and lower asymptotic risk bounds for diffusions with general constant and nondegenerate diffusion part without resorting to arguments based on asymptotic equivalence.

For ease of presentation however, let us merely announce the result for the important special case of diffusions with dispersion matrix of the form \(\sigma =\sigma _0~\mathbf {Id}\), for some \(0\ne \sigma _0\in \mathbb {R}\). Define

$$\begin{aligned} D(\beta ,L;\rho _{b},\sigma _0) := \frac{2\beta L^{\frac{d}{2\beta }}}{\rho _{b}(x_0)\sqrt{d}}\left( \frac{d^2\sigma _0^2\rho _{b}(x_0)}{\beta (2\beta -d)}\right) ^{\frac{\beta -d/2}{2\beta }} \mathbb {I}_\beta , \quad \beta >\frac{d}{2}, \ L>0.\qquad \end{aligned}$$
(1.3)

Here, with \(B(\cdot ,\cdot )\) and \(\varGamma (\cdot )\) denoting the Beta and the Gamma function, respectively, and letting \(\mathbb {S}_d:=2\pi ^{d/2}/\varGamma (d/2)\) denote the surface of the unit sphere in \(\mathbb {R}^d\),

$$\begin{aligned} \mathbb {I}_\beta ^2&:= \frac{1}{2\beta }\ B\left( 1+\frac{d}{2\beta },\ 1-\frac{d}{2\beta }\right) (2\pi )^{-d}\mathbb {S}_d \nonumber \\ {}&= \frac{1}{(2\pi )^d}\int _{\mathbb {R}^d}\frac{\Vert \lambda \Vert ^{2\beta }}{\left( 1+\Vert \lambda \Vert ^{2\beta }\right) ^2}\ \mathrm {d}\lambda . \end{aligned}$$
(1.4)

On the one hand, we show that

$$\begin{aligned}&\liminf _{T\rightarrow \infty } \inf _{\widetilde{b}_T}\sup _{(\beta ,L)\in {\mathcal {B}}_T} \sup _{b\in \varPi (c_1,c_2)} \sup _{\rho _b \in \varSigma _T(\beta ,L)}\left( \frac{T}{\log T}\right) ^{\frac{\beta -d/2}{\beta }}\\&\qquad \qquad \qquad \qquad \qquad \times D^{-2}(\beta ,L;\rho _{b},\sigma _0) \ \mathbf {E}_b\big \Vert \widetilde{b}_T(x_0)-b(x_0)\big \Vert ^2 \ge 1, \end{aligned}$$

for some suitably defined functional sets \(\varPi (c_1,c_2)\) and \(\varSigma _T(\beta ,L)=\Sigma _T(\beta ,L;L',\sigma _0)\), depending also on \(\sigma _0\) and constants \(c_1,c_2,L'\) related to the regularity properties of the class of investigated multivariate diffusion processes (for details, see Sects. 2 and 5). Furthermore, we suggest an asymptotically sharp adaptive estimator over \({\mathcal {B}}_T\), i.e. an adaptive estimator which does not only achieve the best possible rate of convergence but the best asymptotic constant associated to it. Our exact asymptotic results on drift estimation hold under mild regularity constraints and indicate that asymptotic equivalence—at least in some reduced sense—also holds under smoothness assumptions less severe than those imposed in Dalalyan and Reiß [7]; cf. the discussion in Sect. 6.

The current investigation is directly related to previous work both from the field of nonparametric statistics and more applied areas such as financial econometrics. A larger quantity of results on nonparametric drift estimation in the scalar diffusion case is already available. Dalalyan and Kutoyants [6] consider the problem of nonparametric estimation of the derivative of the invariant density and of the drift coefficient for scalar ergodic diffusion processes over weighted \(L^2\) Sobolev classes. The construction of the suggested asymptotically efficient estimator requires the knowledge of the smoothness and the radius of these weighted Sobolev balls. On the basis of these results, Dalalyan [5] develops an adaptive procedure which does not depend on the characteristics of the Sobolev ball and which is asymptotically minimax simultaneously over a broad scale of Sobolev classes. In direct relation to the present work, Spokoiny [20] considers the problem of pointwise adaptive drift estimation and develops a locally linear smoother with data-driven bandwidth choice. His method is also derived in a scalar setting but generalizes to the multidimensional framework. The focus of Spokoiny [20] clearly differs from ours: he provides nonasymptotic results (which do not require stationarity, ergodicity or mixing properties of the observed diffusion process) for the suggested kernel type estimators, while our interest is in identifying the asymptotically exact behavior of adaptive drift estimators. The definition of such asymptotically sharp adaptive estimators does not only require a data-dependent choice of the smoothing parameter but also a data-driven selection of the kernel.

In the sequel, we will use rather recent results on functional inequalities (and the interplay of different types thereof) for diffusion processes. To be more precise, inspection of the constructive proof of the asymptotic upper risk bound suggests that the combination of a Bernstein-type deviation inequality and sufficiently tight variance bounds is the key for suggesting sharp adaptive drift estimation procedures for diffusion processes. Diverse works on generalizations of the classical Bernstein inequality which are applicable in the diffusion framework exist. In this paper, we will assume that the diffusion satisfies the spectral gap inequality—a condition which, at least in the area of statistics for random processes, is rather unconventional. However, it can be argued that this hypothesis presents some sort of minimal assumption for a Bernstein-type inequality for symmetric diffusion processes to hold and thus provides a natural framework for our investigation. The combination of different types of tail estimates of additive functionals and sharp variance bounds due to Dalalyan and Reiß [7] then allows to prove the required type of exponential inequalities, and classical chaining arguments and conditions on the size of function classes in terms of bracketing numbers provide an extension to uniform versions thereof. Our results still are by no means restricted to this specific kind of dependence mechanism as will be sketched later. Currently, (probabilistic) research on diffusion processes is aimed at investigating the interplay between different approaches for the study of quantitative ergodic properties and the relationship between different functional inequalities. It would be interesting to complement these results with findings on the asymptotic statistical behavior of estimators in the respective ergodic diffusion models, and the present analysis provides one first step in this direction.

Outline of the paper One crucial point in our subsequent investigation is the fact that we may restrict attention to analyzing the exact asymptotics of the estimators which appear in the numerator of (1.2). Only mild regularity properties of the diffusion are required for translating results on estimating

$$\begin{aligned} \lim _{h\rightarrow 0}\mathbf {E}_b \left[ \frac{1}{T}\int _0^T K_h(X_u-x)\mathrm {d}X_u\right]&= \lim _{h\rightarrow 0}\int _{\mathbb {R}^d} K_h(y-x)b(y)\rho _{b}(y)\mathrm {d}y\\&= b(x) \rho _{b}(x),\quad x\in \mathbb {R}^d, \end{aligned}$$

into upper and lower bounds for drift estimation. We thus start our investigation with considering estimation of \(b\rho _{b}\), assuming that the components \(b^j\rho _{b}\), \(j\in \{1,\ldots ,d\}\), belong to some Sobolev class of regularity \(\beta \in {\mathcal {I}}\), \({\mathcal {I}}\) some given interval of the form \(\big [\beta _*,\beta _T\big ]\) with \(\beta _T \rightarrow _{T\rightarrow \infty }\infty \) slowly enough. Section 3 contains a lower bound for pointwise estimation of \(b\rho _{b}\), and an adaptive procedure for estimating the components of \(b\rho _{b}\) which asymptotically attains the respective infimum is introduced in Sect. 4. Provided that the drift grows at most linearly and the invariant density decays exponentially, upper and lower bounds for estimating \(b\rho _{b}\) can be translated into corresponding results for drift estimation. In favor of a concise and transparent presentation, the bounds are stated explicitly only for Kolmogorov diffusions. The respective results are given in Sect. 5. Section 6 contains a discussion of our findings and a sketch of possible extensions. Details on the exponential inequality used in the proof of the upper bound part of our exact result are given in Appendix A. The bulk of the proofs of the main results is deferred to Appendix B.

General definitions and notation For \(g:\mathbb {R}^d \rightarrow \mathbb {R}^d\), denote by \(g^j\) its \(j\)-th component. For a smooth function \(f: \mathbb {R}^d \rightarrow \mathbb {R}\), let \(\partial _j f:= \partial f/\partial x^j\), and denote its gradient by \(\nabla f = (\partial _j f)_j\). Rows of an \(d\times d\)-matrix \(a\) are denoted by \(a_j\), and the Frobenius norm of the matrix \(\sigma \) is denoted by \(\Vert \sigma \Vert _{S_2} := (\sum _{j=1}^d (\sigma \sigma ^\top )_{jj})^{1/2}\). Let \(\phi _f\) be the Fourier transform of \(f \in L^2(\mathbb {R}^d)\), that is, for any \(\lambda \in \mathbb {R}^d\), \(\phi _f(\lambda ) := \int _{\mathbb {R}^d}f(x)\exp (\mathrm{i } \lambda ^\top x)\mathrm {d}x\). Let \(\beta >d/2\), and define the Sobolev seminorm \(\eta _\beta (\cdot )\) by

$$\begin{aligned} \eta _{\beta }(f) := \left( \frac{1}{(2\pi )^d} \int _{\mathbb {R}^d} \Vert \lambda \Vert ^{2\beta }\big |\phi _f(\lambda )\big |^2\mathrm {d}\lambda \right) ^{1/2}, \quad f \in L^2(\mathbb {R}^d). \end{aligned}$$

The isotropic Sobolev class \({\mathcal {S}}(\beta ,L)\) is given as \({\mathcal {S}}(\beta ,L) \!:=\! \left\{ f \!\in \! L^2(\mathbb {R}^d): \eta _{\beta }(f)\!\le \! L\right\} \). Throughout, \(\lesssim \) means less or equal up to some constant which does not depend on the variable parameters in the expression.

2 Preliminaries

The complexity of the diffusion model requires some care in defining the framework for pointwise estimation of the components of the drift vector, with special consideration of the interplay of regularity properties of the individual components of the model. We start by defining \(\varPi _0=\varPi _0(\sigma )\), \(\sigma \) some constant nondegenerate \(\mathbb {R}^{d\times d}\)-valued dispersion matrix with associated diffusion coefficient \(\sigma \sigma ^\top =a\), as the set of all drift coefficients \(b:\mathbb {R}^d\rightarrow \mathbb {R}^d\) such that

(\(\hbox {P}_0\)):

the SDE

$$\begin{aligned} \mathrm {d}X_t = b(X_t)\mathrm {d}t +\sigma ~ \mathrm {d}W_t \end{aligned}$$
(2.1)

admits a strong solution which is ergodic with Lebesgue continuous invariant measure \(\mathrm {d}\mu _b(x) = \rho _{b}(x)\mathrm {d}x\), and

(\(\hbox {P}_0'\)):

for \(j\in \{1,\ldots ,d\}\), the invariant density \(\rho _{b}\) satisfies the relation

$$\begin{aligned} 2b^j\rho _{b} = a_j\nabla \rho _{b} = \sum _{k=1}^d a_{jk} \partial _k\rho _{b}. \end{aligned}$$

We further suppose that the initial value \(X_0\) has the density \(\rho _b\) such that \((X_t)_{t\ge 0}\) is strictly stationary.

As aforementioned, the drift estimation problem in the sequel will be decomposed into the individual questions of estimating the invariant density \(\rho _b\) and the products \(b^j\rho _b\), \(j=1,\ldots ,d\). Restricting to diffusion processes satisfying \((\hbox {P}_0')\), the second question can also be stated as estimating the weighted sums of derivatives \(\sum _{k=1}^d a_{jk}\partial _k \rho _b\), \(j=1,\ldots ,d\). As has been proved in Dalalyan and Kutoyants [6] and Dalalyan [5] in the scalar set-up, this approach has the potential to derive deep results. We already noted that the non-existence of diffusion local time presents a particular challenge for the statistical analysis of estimators in the multivariate diffusion framework as a set of valuable technical tools falls away. One further difficulty consists in identifying regularity conditions on the diffusion which allow for an as broad as possible extension of the investigation to a multivariate framework. It is convenient to include the condition \((\mathrm{P }_0')\), but our results can also be generalized to more general classes of diffusion processes.

In the sequel, we consider estimation of the drift function at a point \(x_0\in \mathbb {R}^d\) under Sobolev smoothness constraints on the associated invariant density. Precisely, set

$$\begin{aligned} \varSigma _T(\beta ,L;L',\sigma )&:= \left\{ \rho _b\ | \, b\in \varPi _0(\sigma ),\, \rho _b\in {\mathcal {S}}(\beta +1,L'), \, \right. \\&\qquad \quad \left. b^j\rho _b \in {\mathcal {S}}(\beta ,L), 1\le j\le d, \, \rho _b(x_0)\ge \rho _T^*\right\} , \end{aligned}$$

where \(\rho _{T}^*\) is a sequence of positive real numbers such that \(\lim _{T\rightarrow \infty }\rho _{T}^*=0\) and \(\liminf _{T\rightarrow \infty }\left( \rho _T^*\log T\right) >0\). To shorten notation, we frequently write \(\varSigma _T(\beta ,L)\) for \(\varSigma _T(\beta ,L;L',\sigma )\). For constants \(c_1\in (0,\infty ]\) and \(c_2>0\), we further define \(\varPi (c_1,c_2)=\varPi (c_1,c_2,\sigma )\) as the set of all drift functions \(b\in \varPi _0(\sigma )\) satisfying the following conditions:

\(\mathrm{(P_1) }\) :

It holds \(\limsup _{\Vert x\Vert \rightarrow \infty }\Vert x\Vert ^{-2}\left\langle b(x),x\right\rangle = - c_1\).

\(\mathrm{(P_2) }\) :

For all \(x \in \mathbb {R}^d\), we have \(\left\| b(x)\right\| \le c_2(1+\Vert x\Vert )\).

A few comments on the definition of the functional sets \(\Pi _0(\sigma )\) and \(\Pi (c_1,c_2,\sigma )\) are in order:

Remark 1

  • A lower bound on the value \(\rho _b(x_0)\) is required for two reasons: First (and analogously to the case of nonparametric density estimation from i.i.d. observations considered in Butucea [3]), in order to obtain a reasonably good adaptive estimator of the value \((b^j\rho _b)(x_0)\), we have to exclude the case of a density \(\rho _b\) that varies with \(T\) such that \(\rho _b(x_0)\rightarrow 0\) too fast. Secondly, for defining a ratio-type drift estimator in the spirit of (1.2), a strictly positive a priori lower bound \(\rho _*(x_0)<\rho _b(x_0)\) is needed. The regularity conditions on the drift used in the proof of the asymptotic properties of our adaptive estimators actually allow for the derivation of explicit lower bounds; see Remark 2 below.

  • The assumption of ergodicity is central for our subsequent analysis. Existence and uniqueness of invariant measures are conveniently proven by means of versions of Khasminskii’s criterion, involving Lyapunov-type functions for the generator of the diffusion. Assumption \(\mathrm{(P_1) }\) is a radial assumption on the drift coefficient and states that (if \(c_1<\infty \)) the inward radial component of \(b\) has a prescribed polynomial behavior. In particular, \(\mathrm{(P_1) }\) implies that \(\exp (\delta \Vert x\Vert ^2)\) for \(\Vert x\Vert \ge 1\) is a Lyapunov function for small enough \(\delta \), thus ensuring the existence of an invariant measure. Together with the “at most linear growth”-condition in \(\mathrm{(P_2) }\), it further implies an exponential bound on the associated invariant density (see Lemma 1 below).

3 Lower bound for pointwise estimation

In the Gaussian white noise framework, it has been shown by Lepski [12] that estimators which are optimally rate adaptive with respect to the pointwise risk over the scale of Hölder classes do not exist. The best adaptive estimators are proven to achieve only a rate which is slower than the optimal one in a logarithmic factor. Tsybakov [23] derives an analogous result for adaptation over the scale of Sobolev classes. To some extent, our findings are analogous, and principal ideas of the proof basically rely on techniques developed in the classical framework. The exact analysis of the drift estimation problem however also involves some subtleties which go beyond the known intricacies associated to the question of pointwise adaptation.

Let us first state the exact lower bound for estimating the components of \(b\rho _{b}\) adaptively, assuming that the components \(b^j\rho _b \in {\mathcal {S}}(\beta ,L)\), \(j=1,\ldots ,d\), for some \(\beta \in [\beta _*,\infty )\) and \(L \in [L_*,L^*]\). Here, \(\beta _*\in (d/2,\infty )\) and \(0<L_*<L^*<\infty \) are fixed values. For any \(\beta >d/2\), let

$$\begin{aligned} \kappa = \kappa (\beta ) := \frac{\beta -\frac{d}{2}}{2\beta }, \quad \psi _{T,\beta } := \left( \frac{\log {T}}{T}\right) ^{\kappa (\beta )}, \end{aligned}$$
(3.1)

and recall the definition of \(\mathbb {I}_\beta \) according to (1.4).

Theorem 1

Fix \(\beta _*>d/2\) and \(\delta \in (0,1)\), and denote \({\mathcal {B}}_T := \left[ \beta _*,\beta _T\right] \times \left[ L_*,L^*\right] \), for \(\beta _T := (\log \log T)^\delta \). Then, for any \(x_0\in \mathbb {R}^d\) and \(j\in \{1,\ldots ,d\}\) fixed,

$$\begin{aligned} \liminf _{T\rightarrow \infty }\inf _{\widehat{g}_T}\sup _{(\beta ,L) \in {\mathcal {B}}_T}\sup _{b\in \varPi (c_1,c_2)} \sup _{\rho _b \in \varSigma _T(\beta ,L;L',\sigma )} \frac{\mathbf {E}_b\big |\widehat{g}_T(x_0)- (b^j\rho _{b})(x_0)\big |^2}{\psi _{T,\beta }^2 C_j^2(\beta ,L;\rho _{b},\sigma )} \ge 1,\nonumber \\ \end{aligned}$$
(3.2)

where the infimum is taken over all estimators \(\widehat{g}_T\) of \(b^j\rho _{b}\) and

$$\begin{aligned} C_j(\beta ,L;\rho _{b},\sigma ) := L^{\frac{d}{2\beta }} \frac{2\beta }{d} \left( \frac{d^2 a_{jj} \rho _{b}(x_0)}{\beta (2\beta -d)}\right) ^{\frac{\beta -d/2}{2\beta }}\mathbb {I}_\beta . \end{aligned}$$
(3.3)

The proof of Theorem 1 is deferred to Appendix B.1.

The basic—and classical—idea of the proof of Theorem 1 is to reduce the proof of the lower bound in (3.2) to proving a lower bound on the risk of two suitably chosen hypotheses. A lower bound on the latter risk is then deduced by means of Theorem 6(i) in Tsybakov [23] as it was also done in Butucea [3] and Klemelä and Tsybakov [10, 11]. The verification of the conditions of Tsybakov [23]’s result in the current diffusion framework however requires tools which differ from those used in the references mentioned above. Denoting by \({\mathbb {P}}_0\) and \({\mathbb {P}}_1\) the probability measures associated to the two different hypotheses, it needs to be shown that, for some fixed \(\tau \) and for any \(\alpha \in (0,1/2)\),

$$\begin{aligned} {\mathbb {P}}_1\left( \frac{\mathrm {d}{\mathbb {P}}_0}{\mathrm {d}{\mathbb {P}}_1}\ge \tau \right) \ge 1-\alpha . \end{aligned}$$
(3.4)

In the Gaussian white noise framework considered in Klemelä and Tsybakov [10, 11], (3.4) is verified directly for suitably chosen hypotheses due to the Gaussian nature of the model. For nonparametric density estimation from i.i.d. observations, Butucea [3] uses Lyapunov’s CLT. In our framework, the condition (3.4) is verified by means of the martingale CLT.

4 Construction of sharp adaptive estimators

To define pointwise adaptive estimators of the components of \(b\rho _{b}\) which attain the lower bound established in the previous section, we act similarly to Klemelä and Tsybakov [11]. Precisely, we will use a two-staged procedure in the spirit of Lepski’s method, constructing first a collection of admissible estimators and selecting then an estimator with minimal variance among them. In contrast to the Gaussian white noise setting considered in Klemelä and Tsybakov [11], the complexity of the multidimensional diffusion model however requires a more involved investigation and more sophisticated tools. This remark applies both to the proof of asymptotic lower and upper bounds on the pointwise risk. In particular, for proving the exact upper bound, sufficiently precise exponential bounds on the stochastic error are needed. In the Gaussian white noise framework, the derivation of such exponential bounds is straightforward due to the Gaussian nature of the model. An additional complication arises in the classical problem of estimating a density at some fixed point \(x_0\in \mathbb {R}\) from i.i.d. observations (cf. Butucea [3]) where one has to derive exponential bounds on the risk which hold uniformly over a set of estimators associated to different bandwidths. To do so, Butucea [3] uses the classical Bernstein inequality and a uniform exponential inequality due to van de Geer [24]. Similarly to the pointwise density estimation problem, the bandwidths used for defining the estimators in our selection procedure involve an estimator \(\widehat{\rho }_T(x_0)\) of the (unknown) value of the invariant density \(\rho _{b}\) at \(x_0\) such that uniform risk bounds on the stochastic error are required.

We proceed by introducing central assumptions on the diffusion process \(X\) required for proving adaptivity of the proposed estimation scheme. Let \(P_t\) be the transition semigroup of \(X\), and denote its transition density by \(p_t\), i.e.

$$\begin{aligned} P_tf(x) = \mathbf {E}_b[f(X_t)\mid X_0=x] = \int _{\mathbb {R}^d}f(y)p_t(x,y)\mathrm {d}y,\quad f \in L^2(\mu _b). \end{aligned}$$

The following Bernstein-type deviation inequality in particular allows to prove uniform deviation inequalities which are crucial tools for verifying sufficiently sharp upper bounds on the pointwise squared risk of the adaptive estimators. Given any \(b\in \Pi _0(\sigma )\), denote by \(\varsigma _b^2(\cdot )\) the asymptotic variance appearing in the CLT, i.e.

$$\begin{aligned} \varsigma _b^2(g) := \lim _{T \rightarrow \infty }\frac{1}{T}~ \mathrm{Var }_{\mathbf {P}_b}\left( \int _0^T g(X_u)\mathrm {d}u\right) ,\quad g \in L^2(\mu _b). \end{aligned}$$
(4.1)

Assumption (BI)

Let \(b\in \varPi _0(\sigma )\). Then there exists some positive constant \(C_B\) such that, for any bounded measurable function \(f\in L^2(\mu _b)\) and for any \(r,T>0\) and \(j \in \{1,\ldots ,d\}\) fixed,

$$\begin{aligned} \mathbf {P}_b\Bigg (\bigg |\frac{1}{T}\int _0^T f(X_u)b^j(X_u)\mathrm {d}u ~-&\displaystyle \int _{\mathbb {R}^d} f(y)b^j(y)\mathrm {d}\mu _b(y)\bigg | > r \Bigg ) \\&\quad \qquad \le 2 \exp \left( -\frac{Tr^2}{2C_B\big (\varsigma _b(f)+r \Vert f\Vert _\infty \big )}\right) .\nonumber \end{aligned}$$
(BI)

The investigation of the variance term in (4.1) differs from the case of independent data as there appear additional covariances in the dependent case. The following assumption provides sufficiently tight upper bounds on the (asymptotic) variance.

Assumption (SG+)

The carré du champs associated with the infinitesimal generator of the diffusion satisfies the spectral gap inequality, that is, for some constant \(c_P\) and any \(f \in L^2(\mu _b)\),

$$\begin{aligned} \Big \Vert P_t f-\int _{\mathbb {R}^d}f\mathrm {d}\mu _b\Big \Vert _{L^2(\mu _b)} \le \mathrm {e}^{-t/c_P}\Vert f\Vert _{L^2(\mu _b)}. \end{aligned}$$
(SG)

Furthermore, there exists some \(C_0>0\) such that, for any \(u\ge t>0\) and for any pair of points \(x,y\in \mathbb {R}^d\) with \(\Vert x-y\Vert ^2\le u\), the transition density \(p_t(\cdot ,\cdot )\) satisfies

$$\begin{aligned} p_t(x,y)\ \le \ C_0\big (t^{-d/2}+u^{3d/2}\big ). \end{aligned}$$
(4.2)

For any symmetric diffusion \(X\), it can be shown analogously to the proof of Proposition 1 in Dalalyan and Reiß [7] (also see the proof of Lemma 2.3 in Cattiaux et al. [4]) that, for any \(f \in L^2(\mu _b)\) and \(T>0\),

$$\begin{aligned} \mathbf {E}_b\Bigg [\bigg (\int _0^T f(X_u)\mathrm {d}u\bigg )^2\Bigg ]&= 2 \int _0^T \int _0^v \mathbf {E}_b\left[ f (X_u)f (X_v)\right] \mathrm {d}u \ \mathrm {d}v\\&\le 2 T \int _0^T \left\langle f , P_w f \right\rangle _{L^2(\mu _b)}\mathrm {d}w. \end{aligned}$$

The last term is upper-bounded by applying the Cauchy–Schwarz and the spectral gap inequality such that, for some positive constant \(C\) (depending only on \(c_P\)),

$$\begin{aligned} \mathbf {E}_b\Bigg [\bigg (\int _0^T f(X_u)\mathrm {d}u\bigg )^2\Bigg ]\le CT\Vert f\Vert _{L^2(\mu _b)}^2. \end{aligned}$$
(4.3)

It however turns out that, given the goal of describing the precise asymptotics for nonparametric drift estimation, we do actually require an exponential inequality with a tight leading term in the exponent. Taking also into account the upper bound on the transition density in (4.2), Proposition 1 in Dalalyan and Reiß [7] provides an enforced upper bound on the variance of additive functionals of multidimensional diffusions which allows to prove such a refined exponential inequality. In particular, for any compactly supported kernel \(G:\mathbb {R}^d\rightarrow \mathbb {R}\), Assumption (SG+) ensures that there exists some positive constant \(C'\) (depending only on \(d\), \(C_0\) and \(c_P\)) such that, for any bandwidth \(h>0\), \(y\in \mathbb {R}^d\), \(T>0\),

$$\begin{aligned} \mathrm{Var }_b\left( \frac{1}{\sqrt{T}}\int _0^T G_h(X_u-y)\mathrm {d}u\right) \le C'\times \left\{ \begin{array}{ll} 1, &{}\quad d=1,\\ \max \left\{ 1,(\log (h^{-4}))^2\right\} , &{}\quad d=2,\\ h^{2-d}, &{}\quad d\ge 3. \end{array} \right. \end{aligned}$$

It seems to be rather unconventional to investigate estimators in diffusion models under the explicit assumption that functional inequalities in the spirit of the spectral gap hypothesis are satisfied. We believe that this approach is useful as it allows to formulate precise results for a sufficiently large class of diffusion processes under clear assumptions; see in particular Theorem 3 below.

The adaptive scheme is based on Lepski’s principle. For implementing the procedure, consider a sufficiently fine grid \(\mathcal {G}=\mathcal {G}_T\) on the interval \(\left[ \beta _*,\beta _T\right] \), with \(\beta _T\rightarrow \infty \). It is defined as \(\mathcal {G}= \mathcal {G}_T:= \big \{\beta _1,\ldots ,\beta _m\big \}\), where \(\beta _*< \beta _1 < \cdots < \beta _m = \beta _T\). Assume that there exist \(k_2>k_1>0\) and \(\delta _1 \ge \delta >1\) such that

$$\begin{aligned} k_1(\log T)^{-\delta _1}\le \beta _{i+1}-\beta _i \le k_2(\log T)^{-\delta }, \quad i=0,1,\ldots , m-1, \end{aligned}$$
(4.4)

and set \(\beta _0 := \beta _*-d/2\). As in the case of density estimation from i.i.d. observations (cf. Butucea [3]), the optimal bandwidth for estimating \(b^j\rho _{b}\) is not available in practice as it involves the unknown value of the invariant density \(\rho _{b}\) at \(x_0 \in \mathbb {R}^d\). The adaptive procedure for estimating \(b^j\rho _{b}\) therefore starts with a preliminary estimator \(\widehat{\rho }_T(x_0)\) of the value \(\rho _{b}(x_0)\).

Definition of the preliminary density estimator Define

$$\begin{aligned} \check{\rho }_T(x_0) := \frac{1}{Th_T^{d}}\int _0^T Q\left( \frac{X_u-x_0}{h_T}\right) \mathrm {d}u, \end{aligned}$$
(4.5)

where \(Q\) is a bounded positive kernel satisfying \(\int _{\mathbb {R}^d}\Vert u\Vert |Q(u)|\mathrm {d}u < \infty \), and the bandwidth \(h_T>0\) is such that

$$\begin{aligned} \lim _{T\rightarrow \infty } h_T= 0, \quad \lim _{T\rightarrow \infty } Th_T^d=\infty , \quad \lim _{T\rightarrow \infty }Th_T^{2d}(\log T)^{-3}=\infty , \end{aligned}$$
(4.6a)

and, for some \(\alpha _0 \in (0,1/2)\),

$$\begin{aligned} \limsup _{T\rightarrow \infty } h_T^dT^{\alpha _0} <\infty . \end{aligned}$$
(4.6b)

Recall the definition of \(\rho _T^*\), and let \(\widehat{\rho }_T(x_0) := \max \big \{\check{\rho }_T(x_0),\ \rho _T^*\big \}\).

Main part of the procedure: adaptive estimation of \(b\rho _{b}\) For fixed \(j \in \{1,\ldots ,d\}\), we now describe the procedure for defining an adaptive estimator of the \(j\)-th component of the vector \(b\rho _{b}\). Recall that \(\sigma \) is the dispersion matrix taking values in \(\mathbb {R}^{d\times d}\) and \(a=\sigma \sigma ^\top \) denotes the associated diffusion coefficient. The adaptive estimator will be selected among the family of estimators \(\widehat{g}^j_{T,\beta }(x_0)\), defined as

$$\begin{aligned} \widehat{g}^j_{T,\beta }(x_0) := \frac{1}{T\widehat{h}_{T,\beta }^d}\int _0^T K_\beta \left( \frac{X_u-x_0}{\widehat{h}_{T,\beta }}\right) \mathrm {d}X_u^j, \end{aligned}$$

where \(\widehat{h}_{T,\beta } = \widehat{h}^j_{T,\beta } := \left( \frac{d\widehat{\rho }_T(x_0)a_{j j}\log T}{\beta T}\right) ^{1/(2\beta )}\). As in Klemelä and Tsybakov [10], the kernel \(K_\beta \) is obtained as a renormalized version of the basic kernel

$$\begin{aligned} \widetilde{K}_\beta (x) := (2\pi )^{-d}\int _{\mathbb {R}^d}\left( 1+\Vert \lambda \Vert ^{2\beta } \right) ^{-1}\exp (\mathrm{i }\lambda ^\top x)\mathrm {d}\lambda , \end{aligned}$$
(4.7)

namely

$$\begin{aligned} K_\beta (x) := \mathrm {b}^d \widetilde{K}_\beta (\mathrm {b}x), \quad \text { for } \mathrm {b}= \mathrm {b}(\beta ) :=\left( \frac{2\beta -d}{d}\right) ^{1/(2\beta )}. \end{aligned}$$
(4.8)

As the last ingredient of the adaptive procedure, introduce the thresholding sequence

$$\begin{aligned} \widehat{\eta }_{T,\beta } =\widehat{\eta }_{T,\beta }^j := \left( \frac{d\widehat{\rho }_T(x_0)a_{jj}\log T}{\beta T}\right) ^{\frac{\beta -d/2}{2\beta }} \Vert K_\beta \Vert _{L^2(\mathbb {R}^d)}. \end{aligned}$$

The adaptive estimator \(\widetilde{g}_T^j\) is defined as

$$\begin{aligned} \widetilde{g}_T^j(x_0) := \widehat{g}^j_{T,\widehat{\beta }_T^j}(x_0), \end{aligned}$$
(4.9)

where

$$\begin{aligned} \widehat{\beta }_T^j := \max \Big \{\beta \in {\mathcal {G}}_T:\ \big |\widehat{g}^j_{T,\gamma }(x_0)-\widehat{g}^j_{T,\beta }(x_0)\big |\le \widehat{\eta }_{T,\gamma } \ \forall \ \gamma \in {\mathcal {G}}_T,\ \gamma \le \beta \Big \}.\qquad \end{aligned}$$
(4.10)

We continue with the main result on pointwise adaptive estimation of \(b\rho _{b}\). Recall the definition of the constants \(C_j(\beta ,L;\rho _{b},\sigma )\), \(j=1,\ldots ,d\), in (3.3). Denote by \(\widetilde{\varPi }(c_1,c_2)=\widetilde{\varPi }(c_1,c_2,\sigma )\) the intersection of \(\varPi (c_1,c_2,\sigma )\) with the set of all drift functions \(b\in \varPi _0(\sigma )\) satisfying (BI) and Assumption (SG+).

Theorem 2

For fixed \(\beta _*>d/2\), \(0<L_*<L^*<\infty \), \(\delta \in (0,1)\) and for \({\mathcal {B}}_T=\left[ \beta _*,\beta _T\right] \times \left[ L_*,L^*\right] \), where \(\beta _T = (\log \log T)^\delta \), the adaptive estimator \(\widetilde{g}_T^j\) defined according to (4.9) satisfies, for any \(x_0 \in \mathbb {R}^d\),

$$\begin{aligned} \limsup _{T\rightarrow \infty }\sup _{(\beta ,L)\in {\mathcal {B}}_T}\sup _{b\in \widetilde{\varPi }(c_1,c_2)} \sup _{\rho _b\in \varSigma _T(\beta ,L;L',\sigma )} \frac{\mathbf {E}_b\big |\widetilde{g}_T^j(x_0)-(b^j\rho _{b})(x_0)\big |^2}{\psi _{T,\beta }^2 C_j^2(\beta ,L;\rho _{b},\sigma )} \le 1. \end{aligned}$$

The proof of Theorem 2 is given in Appendix B.2.

5 Sharp adaptive drift estimation for Kolmogorov diffusions

Restriction to the important case of Kolmogorov diffusions allows to derive and formulate results in a comparatively concise way.

Let \(b\in \varPi _0(\sigma _0~\mathbf {Id})\), \(0\ne \sigma _0\in \mathbb {R}\), and consider the diffusion

$$\begin{aligned} X_t = X_0+\int _0^t b(X_u) \mathrm {d}u + \sigma _0~ W_t,\qquad t\ge 0, \end{aligned}$$
(5.1)

where \(W\) is a \(d\)-dimensional Brownian motion and the initial value \(X_0\) is independent of \(W\). Note that property \((\mathrm{P }_0')\) in the definition of the functional set \(\varPi _0\) is fulfilled if \(b=\sigma _0^2\nabla \left( \log \rho _b\right) /2\), that is, the drift vector \(b\) can be represented as a gradient. To enlighten notation, let

$$\begin{aligned} C(\beta ,L;\rho _{b},\sigma _0)= L^{\frac{d}{2\beta }}\frac{2 \beta }{\sqrt{d}}\left( \frac{d^2\sigma _0^2\rho _{b}(x_0)}{\beta (2\beta -d)}\right) ^{\frac{\beta -d/2}{2\beta }} \mathbb {I}_\beta . \end{aligned}$$
(5.2)

We further introduce the maximal risk of an estimator \(\check{g}_T\) of \(b\rho _{b}\), for \(\beta >d/2\), \(L>0\), \(T>0\), some bounded set \(A\subset \mathbb {R}^d\) and fixed \(x_0\in \mathring{A}\) defined as

$$\begin{aligned} {\fancyscript{R}}_{T,\beta ,L}\big (\check{g}_T\big ) := \sup _{b \in \widetilde{\varPi }(c_1,c_2)}\sup _{\rho _b\in \varSigma _T(\beta ,L;L', \sigma _0\mathbf {Id})} \mathbf {E}_b\Vert \check{g}_T(x_0)-(b\rho _{b})(x_0)\Vert ^2. \end{aligned}$$
(5.3)

Theorem 3

Define \({\mathcal {B}}_T\) as in Theorem 2, and consider the risk introduced in (5.3). Then the following holds true:

  1. (a)

    For any \(x_0\) and for \(C(\beta ,L;\rho _{b},\sigma _0)\) defined in (5.2), the estimator \(\widetilde{g}_T = \big (\widetilde{g}_T^j\big )_{j=1,\ldots ,d}\) defined according to (4.9) is sharp adaptive.

  2. (b)

    If there exists an estimator \(\check{g}_T\) such that, for some \(\beta _0 \ge \beta _*\), \(L>0\),

    $$\begin{aligned} \limsup _{T\rightarrow \infty }\sup _{b \in \widetilde{\varPi }(c_1,c_2)}\sup _{\rho _b\in \varSigma _T(\beta _0,L;L', \sigma _0\mathbf {Id})} \frac{\mathbf {E}_b\Vert \check{g}_T(x_0)-(b\rho _{b})(x_0)\Vert ^2}{\psi _{T,\beta _0}^2 C^2(\beta _0,L;\rho _{b},\sigma _0)} < 1, \end{aligned}$$

    then there exists \(\beta _0' > \beta _0\) such that

    $$\begin{aligned} \frac{{\fancyscript{R}}_{T,\beta _0',L}\big (\check{g}_T\big )}{{\fancyscript{R}}_{T,\beta _0',L}\big (\widetilde{g}_T\big )} \ge \varPsi _T\ \frac{{\fancyscript{R}}_{T,\beta _0,L}\big (\widetilde{g}_T\big )}{{\fancyscript{R}}_{T,\beta _0,L}\big (\check{g}_T\big )}, \end{aligned}$$
    (5.4)

    where \(\varPsi _T\rightarrow _{T\rightarrow \infty }\infty \). In particular, for any fixed \(\beta \ge \beta _*\), \(L>0\),

    $$\begin{aligned} \limsup _{T\rightarrow \infty }\sup _{b \in \widetilde{\varPi }(c_1,c_2)}\sup _{\rho _b\in \varSigma _T(\beta ,L;L', \sigma _0\mathbf {Id})} \frac{\mathbf {E}_b\Vert \widetilde{g}_T(x_0)-(b\rho _{b})(x_0)\Vert ^2}{\psi _{T,\beta }^2 C^2(\beta ,L;\rho _{b},\sigma _0)} = 1. \end{aligned}$$

The statement in the second part of the above theorem is to be interpreted in the sense that, whenever there exists an estimator \(\check{g}_T\) which performs better than the estimator \(\widetilde{g}_T\) at least for one smoothness degree \(\beta _0\), there exists another smoothness factor \(\beta _0'\) for which there is much greater loss of \(\check{g}_T\). The assertion—and its respective proof—are to be compared with Theorem 2 in Klemelä and Tsybakov [11].

Proof (of Theorem 3)

We first show that \(\varPi (c_1,c_2,\sigma _0~\mathbf {Id})=\widetilde{\varPi }(c_1,c_2,\sigma _0~\mathbf {Id})\). Let \(b\in \varPi (c_1,c_2,\sigma _0~\mathbf {Id})\). In view of the results in Section 4.3 in Bakry et al. [2] (p. 747), \((\mathrm{P_1 })\) implies that (SG) holds. Since, in addition, \((\mathrm{P_2 })\) is satisfied, Theorem 3.2 in Qian and Zheng [17] entails that (4.2) and thus Assumption \((\mathrm{\mathbf {SG+} })\) is fulfilled. For any \(b\in \varPi _0(\sigma _0~\mathbf {Id})\), the associated measure \(\mu _b\) is reversible for \(X\) (see, e.g., Lemma 2.2.3 in Royer [19]). In particular, (SG) is equivalent to Poincaré’s inequality. Restricting to bounded drift functions, Poincaré’s inequality implies that Assumption (BI) is satisfied, too; this follows from Lemma 2 stated in Appendix A. In view of Theorem 1 and Theorem 2, it now only remains to verify (5.4). The proof actually is along the lines of the proof of Theorem 2 in Klemelä and Tsybakov [11] and therefore omitted. \(\square \)

We conclude this section with a brief summary of the adaptive estimation procedure. Assume that a continuous record of observations \(X^T=(X_t)_{0\le t\le T}\) of a diffusion process solution of the SDE (2.1) is available and that the (constant) diffusion matrix \(a=\sigma \sigma ^\top \) is known. The goal is to estimate the value of the \(j\)-th component of the product \(b\rho _b\) at some given fixed point \(x_0\). For implementing the adaptive estimation scheme, we need to specify a lower bound \(\beta _*>d/2\) on the unknown smoothness of the function \(b\rho _b\) and fix some value \(\delta \in (0,1)\). To define an estimator on the basis of the input parameters \(X^T\), \(a=(a_{ij})_{1\le i,j\le d}\), \(x_0\), \(\beta _*\) and \(\delta \), one then might proceed as follows:

\(*\) :

(Computation of a pilot estimator of the invariant density) Choose some bounded positive kernel \(Q:\mathbb {R}^d\rightarrow \mathbb {R}\) satisfying \(\int _{\mathbb {R}^d}\Vert u\Vert |Q(u)|\mathrm {d}u<\infty \) and some bandwidth \(h_T>0\) fulfilling (4.6). Define a preliminary density estimator \(\widehat{\rho }_T(x_0)\) by computing \(\check{\rho }_T(x_0)\) according to (4.5) and by letting

$$\begin{aligned} \widehat{\rho }_T(x_0):=\max \left\{ \check{\rho }_T(x_0), \ \rho _T^*\right\} , \end{aligned}$$

where \(\rho _T^*\) denotes a vanishing sequence of positive real numbers satisfying \(\liminf _{T\rightarrow \infty }(\rho _T^*\log T)>0\).

\(*\) :

(Computation of kernel estimators for a discrete set of parameters) Specify a grid \(\mathcal {G}_T:=\{\beta _1,\ldots ,\beta _m\}\), where the values \(\beta _*<\beta _1<\cdots <\beta _m=(\log \log T)^\delta \) are chosen such that (4.4) is satisfied. Define the bandwidths

$$\begin{aligned} \widehat{h}_{T,\beta _i}=\left( \frac{d\widehat{\rho }_T(x_0)a_{jj}\log T}{\beta _i T}\right) ^{\frac{1}{2\beta _i}},\quad i=1,\ldots ,m. \end{aligned}$$

Recall the definition of the kernel \(K_\beta \) in (4.8), and compute the family of estimators

$$\begin{aligned} \widehat{g}_{T,\beta _i}^j(x_0)=\frac{1}{T\widehat{h}_{T,\beta _i}^d}\int _0^T K_{\beta _i}\left( \frac{X_u-x_0}{\widehat{h}_{T,\beta _i}}\right) \mathrm {d}X_u^j,\quad i=1,\ldots ,m. \end{aligned}$$
(5.5)
\(*\) :

(Definition of the Lepski-type estimator of \(\beta \) and the adaptive estimator of \(b^j\rho _b\)) Recall the definition of \(\mathbb {I}_\beta ^2\) in (1.4), and define the thresholding values

$$\begin{aligned} \widehat{\eta }_{T,\beta _i}=\left( \frac{d\widehat{\rho }_T(x_0)a_{jj}\log T}{\beta _i T}\right) ^{\frac{\beta _i-d/2}{2\beta _i}} {\mathbb {I}}_{\beta _i}\left( \frac{2\beta _i-d}{d} \right) ^{\frac{\beta _i+d/2}{2\beta _i}},\quad i=1,\ldots ,m.\nonumber \\ \end{aligned}$$
(5.6)

Use the values (5.5) and (5.6) to determine \(\widehat{\beta }_T^j\) as specified in (4.10), and define the adaptive estimator \(\widetilde{g}_T^j(x_0)=\widehat{g}^j_{T,\widehat{\beta }_T^j}(x_0)\).

Remark 2

  • Once one has determined the adaptive estimator \(\widetilde{g}_T=(\widetilde{g}_T^j)_{j=1,\ldots ,d}\) according to the above scheme, one obtains an adaptive drift estimator by defining a suitable invariant density estimator \(\widetilde{\rho }_T\) and setting

    $$\begin{aligned} \widetilde{b}_T(x_0) := \frac{\widetilde{g}_T(x_0)}{\widetilde{\rho }_T(x_0) \vee \rho _*(x_0)}, \qquad x_0\in \mathbb {R}^d, \end{aligned}$$

    \(\rho _*(x_0)> 0\) denoting some a priori lower bound on \(\rho _{b}(x_0)\). Restricting again to the case of Kolmogorov diffusions as in (5.1), the normalizing factor appearing in the upper bound for the pointwise squared risk of \(\widetilde{b}_T^j(x_0)\), assuming that \(b \in \varPi (c_1,c_2)\) and \(\rho _b\in \varSigma _T(\beta ,L;L',\sigma _0~\mathbf {Id})\), is identified as \(C_j(\beta ,L;\rho _{b},\sigma _0)\rho _{b}^{-1}(x_0) = D_j(\beta ,L;\rho _{b},\sigma _0)\).

  • In the situation of Theorem 3, an explicit a priori lower bound on \(\rho _b(x_0)\) depending only on \(c_1,c_2\) can be derived as in Remark 6 in Dalalyan and Reiß [7]. For the more general case of diffusion processes with uniformly nondegenerate diffusion matrix \(a\), it was proven in Metafune et al. [15] that, if \(b^i \in C^2(\mathbb {R}^d)\), \(\mathrm{(P_1) }\) is satisfied, and, in addition, \(\Vert b(x)\Vert +\Vert Db(x)\Vert +\Vert D^2b(x)\Vert \le c_1'(1+\Vert x\Vert )\) for some constant \(c_1'>0\), it holds \(\rho _b(x)\ge \mathrm {e}^{-K(1+\Vert x\Vert ^2)}\), \(x\in \mathbb {R}^d\), with some positive constant \(K\) depending only on \(c_1'\), \(c_2\), \(d\) and \(\sigma \).

6 Discussion and extensions

Placement and interpretation of the identified constants Let us first arrange the normalizing factors identified in the previous sections and relate it to known results on asymptotically exact adaptive estimation with respect to pointwise risk over the scale of Sobolev classes in the classical statistical models. Tsybakov [23] considers the problem of nonparametric function estimation in the Gaussian white noise model (in the one-dimensional case), assuming that the unknown function belongs to some Sobolev class with unknown regularity parameter. The question of density estimation at a fixed point \(x_0 \in \mathbb {R}\) is investigated by Butucea [3]. Since the variance of the proposed kernel estimator is proportional to the value of the unknown density \(f\) at \(x_0\), the value \(f(x_0)\) appears in the exact normalization. Klemelä and Tsybakov [11] deal with nonparametric estimation of a multivariate function and its partial derivatives at a fixed point when the Riesz transform is observed in Gaussian white noise. In particular, Klemelä and Tsybakov [11] find the exact constant for nonparametric estimation of a function \(f: \mathbb {R}^d \rightarrow \mathbb {R}\), observed in Gaussian white noise and satisfying the Sobolev smoothness constraint \(\eta _\beta (f) \le L\). In combination with the results of Butucea [3] on classical density estimation (in dimension \(d=1\)), the exact constant for nonparametric estimation of a density \(f:\mathbb {R}^d\rightarrow \mathbb {R}\) at some fixed point \(x_0\in \mathbb {R}^d\) is then identified as

$$\begin{aligned} L^{\frac{d}{2\beta }}\frac{2\beta }{d}\left( \frac{d^2 f(x_0)}{\beta (2\beta -d)}\right) ^{\frac{\beta -d/2}{2\beta }}\mathbb {I}_\beta . \end{aligned}$$
(6.1)

For the case of Kolmogorov diffusions with \(\sigma \sigma ^\top = \mathbf {Id}\), \(C_j(\beta ,L;\rho _{b},\mathbf {Id})\) as introduced in (3.3) coincides with the constant in (6.1). The accordance of constants in the exact asymptotics reflects the old-established experience that different statistical models show similar behavior, at least from an asymptotic point of view. The remarkable results of Dalalyan and Reiß [7] on asymptotic statistical equivalence for inference on the drift in multidimensional Kolmogorov diffusion models justify this notice in a mathematically formal manner. They are however established under rather restrictive (Hölder) smoothness assumptions. Precisely, the critical regularity for proving asymptotic equivalence with the Gaussian shift model grows like \((1/2+1/\sqrt{2})d\) as \(d \rightarrow \infty \). The authors refer to the question whether for Hölder classes of smaller regularity asymptotic equivalence fails as “a challenging open problem.” Our risk bounds are valid under weaker smoothness constraints which suggests that asymptotic equivalence (at least in a reduced sense) still holds beyond the critical bounds of Dalalyan and Reiß [7]. Going beyond the special case of Kolmogorov diffusions with unit diffusion part, the dependence of the drift estimation problem on the form of the diffusion coefficient \(a=\sigma \sigma ^\top \) is reflected by the appearance of the \(j\)-th diagonal entry of the matrix \(a\) in the optimal normalizing factor \(C_j(\beta ,L;\rho _{b},\sigma )\) (for estimating \(b^j\rho _{b}\)).

Possible generalizations of the assumptions It can be argued which type of description of the dependence structure underlying the diffusion is most convenient. Given the goal of describing exact asymptotics for pointwise drift estimation for an as large as possible class of diffusion processes under some preferably small set of assumptions, we decided to formulate our results in terms of the spectral gap hypothesis. Indeed, restricting to the case of reversible diffusion processes, Theorem 3.1 in Guillin et al. [8] implies that, whenever \(\varsigma _b^2(g)\le C \Vert g\Vert _\infty ^2\) for any centered bounded \(g\) and some constant \(C>0\), (BI) entails the Poincaré inequality, i.e.

$$\begin{aligned} \mathrm{Var }_{\mu _b}(f) = \int f^2\mathrm {d}\mu _b-\left( \int f\mathrm {d}\mu _b\right) ^2 \le c_P^{-1}\int |\nabla f|^2\mathrm {d}\mu _b \end{aligned}$$
(PI)

for any smooth enough function \(f:\mathbb {R}^d\rightarrow \mathbb {R}\) and some positive constant \(c_P\). It is further well-known that Poincaré’s inequality in the symmetric case is equivalent to the spectral gap assumption (SG). However, as was already noted in the introduction, the upper bound result in Theorem 2 essentially holds whenever a Bernstein-type deviation inequality in the spirit of (BI) is combined with a sufficiently tight upper variance bound [as provided by Assumption (SG+)]. Such variance bounds for diffusion processes actually can be deduced by means of mixing conditions or the assumption of weak dependence of data.