Optimal stable Ornstein–Uhlenbeck regression

We prove asymptotically efficient inference results concerning an Ornstein–Uhlenbeck regression model driven by a non-Gaussian stable Lévy process, where the output process is observed at high frequency over a fixed period. The local asymptotics of non-ergodic type for the likelihood function is presented, followed by a way to construct an asymptotically efficient estimator through a suboptimal, yet very simple preliminary estimator.

The primary objective of this paper is the asymptotically efficient estimation of θ, when available data is (X t ) t∈[0,T ] and (Y tj ) n j=1 , where t j = t n j := jh with h = h n := T /n; later on, we will consider cases where we observe (X tj ) n j=1 instead of the full continuous-time record.We will denote the true value of θ by θ 0 = (λ 0 , µ 0 , β 0 , σ 0 ) ∈ Θ.
Analysis of the (time-homogeneous) OU process driven by a stable Lévy process goes back to [6], where Doob treated the model in a genuinely analytic manner without Itô's formula, which has not got published as yet at that time.Nowadays, the OU models have been used in a wide variety of applications, such as electric consumption modeling [18], [1], and [21], ecology [14], and protein dynamics modeling [3], to mention a few.
The model (1.1) may be seen as a continuous-time counterpart of the simple first-order ARX (autoregressive exogenous) model.Nevertheless, any proper form of the efficient estimation result has been missing in the literature, probably due to the lack of background theory that can deal with the estimation of all the parameters involved under the bounded-domain infill asymptotics.Let us note that, when J is a Wiener process (β = 2), the drift parameters are consistently estimable only when the terminal sampling time tends to infinity, and the associated statistical experiments are known to possess essentially different properties according to the sign of λ.That is to say, the model is: locally asymptotically normal for λ > 0 (ergodic case); locally asymptotically Brownian functional for λ = 0 (unit-root case); locally asymptotically mixed-normal (LAMN) for λ < 0 (non-ergodic (explosive) case).Turning back to the stable driven case, we should note that the least-squares type estimator would not work unless T n → ∞, as is expected from [9] and [23]; there, the authors proved that (when β is known) the rate of convergence when λ > 0 equals (T n / log n) 1/β and the asymptotic distribution is given by a ratio of two independent stable distributions. 1 1.2.Contributions in brief.First, in Section 2, we will show that the model is locally asymptotically mixed normal (LAMN) at θ 0 ∈ Θ, and also that the likelihood equation has a root that is asymptotically efficient in the classical sense of Hajék-Le Cam-Jeganathan.The asymptotic results presented here are uniformly valid in a single manner over any compact subset of the parameter space Θ.In particular, the sign of the autoregressive parameter λ 0 does not matter, revealing that (i) the results can be described in a unified manner regardless of whether the model is ergodic or not, and also that (ii) the conventional unit-root problem (see [19] and the references therein) is not relevant here at all; this is in sharp contrast to the case of ARX time-series models and the Gaussian OU models.Besides, in Section 3, we will provide a way to provide an asymptotically efficient estimator through a suboptimal yet very simple preliminary estimator, which enables us to bypass not only computationally demanding numerical optimization of the likelihood function involving the β-stable density, but also possible multiple-root problem [15,Section 7.3].

Local likelihood asymptotic
2.1.Preliminaries and result.Let P θ denote the image measure of (J, Y ) associated with the value θ ∈ Θ.We are going to show the non-trivial stochastic expansion of the log-likelihood ratio of P Y θ+ϕn(θ)vn,n with respect to P Y θ,n for appropriate norming matrix ϕ n (θ) introduced later and bounded sequence (v n ) ⊂ R p , where P Y θ,n stands for the restriction of P θ to σ(Y tj : j ≤ n).The distribution L(Y 0 ) may vary according as θ, we will assume that for any ǫ > 0 there exists an M > 0 such that sup Let φ β denote the β-stable density of J 1 : See [7] for details.Here we wrote ; analogous notation for the partial derivatives will be used in the sequel.
To proceed, we need to introduce further notation.Any asymptotics will be taken for n → ∞ unless otherwise mentioned.We denote by → u the uniform convergence of non-random quantities concerning θ over Θ.We write C for positive universal constant which may vary at each appearance, and a n b n when a n ≤ Cb n for every n large enough.Given positive functions a n (θ) and b n (θ), we write b n , respectively.The symbol a n (θ) u b n (θ) means that sup θ |a n (θ)/b n (θ)| 1.We write j instead of tj tj−1 .By integrating by parts applied to the process t → e λt Y t , we obtain the explicit càdlàg solution process: under P θ , (2.2) For x, λ ∈ R, we write The basic property of the Lévy integral and the fact that log We introduce the non-random p × p-matrix where the real entries ϕ kl,n = ϕ kl,n (θ) are assumed to be continuously differentiable in θ ∈ Θ and to satisfy the following conditions for some finite values ϕ kl = ϕ kl (θ): (2.6) The matrix ϕ n (θ) will turn out to be the right norming with which u → ℓ n (θ + ϕ n (θ)u) − ℓ n (θ) under P θ has an asymptotically quadratic structure in R p ; see [2] and [5] for the related previous studies.Note that √ nh Let and define the block-diagonal random matrix (2.7) where, for a random variable ǫ P θ ∼ φ β (y)dy and by denoting by A ⊤ the transpose of matrix A, Note that I(θ) does depend on the choice of ϕ(θ) = {ϕ kl (θ)}; if ϕ(θ) is free from (λ, µ), then so is I(θ).
(1) For any bounded sequence (v n ) ⊂ R p , it holds that where we have the convergence in distribution under P θ : L (∆ n (θ), I n (θ)|P θ ) ⇒ L I(θ) 1/2 Z, I(θ) , where Z ∼ N p (0, I) is independent of I(θ), defined on an extended probability space.(2) There exists a local maximum point θn of ℓ n (θ) with P θ -probability tending to 1 for which It is worth mentioning that the particular non-diagonal form of ϕ n (θ) is, as in [2], inevitable to deduce the asymptotically non-degenerate joint distribution of the maximum-likelihood estimator (MLE), the good local maximum point θn in Theorem 2.1(2).
Remark 2.2.Here are some comments on the model-time scale.
(1) We are fixing the terminal sampling time T , so that the rate of convergence , then a longer period would lead to a better (resp.worse) performance of estimating (λ, µ).The Cauchy case β = 1, where the two rates of convergence coincide, is exceptional.
(2) We can explicitly associate a change of the terminal sampling time T with those of the components of θ.Specifically, changing the model-time scale from t to tT in (1.1), we see that the process satisfies exactly the same integral equation as in (1.1) except that θ = (λ, µ, β, σ) is replaced by (β is unchanged), X t by X T t := X tT , and J t by J T t := T −1/β J tT : Note that (J T t ) t∈[0,1] defines the standard β-stable Lévy process.This indeed shows that we may set T ≡ 1 in the virtual (model) world without loss of generality.This is impossible for diffusion-type models where we cannot consistently estimate the drift coefficient unless we let the terminal sampling time T tend to infinity.
Remark 2.3.The present framework allows us to do unit-period-wise, for example, day-by-day inference for both trend and scale structures, providing a sequence of period-wise estimates with theoretically valid approximate confidence sets.This, though informally, suggests an aspect of change-point analysis in highfrequency data: if we have high-frequency sample over [k − 1, k] for k = 1, . . ., [T ], then we can construct a sequence of estimators { θn (k)}

[T ]
k=1 ; then it would be possible in some way to reject the constancy of θ over ) is not likely to stay unchanged.
Under mild regularity conditions on the function (a, b, c) as well as on the non-random process X, a solution process Y is explicitly given by (see [4,Appendix]) where ψ(s, t; λ) := t s a(X v , λ)dv.However, the corresponding likelihood asymptotics becomes much messier.It is worth mentioning that the optimal rate matrix can be diagonal if, for example, ∂ t ∂ θ log c(t, σ) ≡ 0 with X t = t: for details, see the previous study [5] that treated the general time-homogeneous Markovian case.

Proof of Theorem 2.1.
In this proof, we are going to make use of the general result [20] about the exact-likelihood asymptotics in a more or less analogous way to that of [2, Theorem 1]: under the uniform nature of the exact-likelihood asymptotics, we will deduce the joint convergence in distribution of the normalized score ∆ n (θ 0 ) and the normalized observed information I n (θ 0 ) from the uniform convergence in probability of I n (•) in an appropriate sense.Consequently, we will not need to derive the stable convergence in law of ∆ n (θ 0 ), which is often crucial when concerned with a high-frequency sampling for a process with dependent increments.
We have sup t∈[0,T ] |X t | < ∞ since X : [0, T ] → R q is assumed to be càdlàg.Through the localization procedure, we may and do suppose that the driving stable Lévy process does not have jumps of size greater than some fixed threshold (see [17,Section 6.1] for a concise account).In that case, the Lévy measure of J is compactly supported, hence in particular (2.10) sup for any K > 0. Further, since the Lévy measure of J is symmetric, the removal of large-size jumps does not change the parametric form of the drift coefficient.We also localize the initial variable Y 0 so that |Y 0 | is essentially bounded uniformly in θ.It follows from (2.2) and (2.10) that sup To proceed, we introduce some further notation.Given continuous random functions ξ 0 (θ) and ξ n (θ), n ≥ 1, we write ξ n (θ) p − → u ξ 0 (θ) if the joint distribution of ξ n and ξ 0 are well-defined under P θ and if Similarly, for any random functions χ nj (θ) doubly indexed by n and j ≤ n, we write We will complete the proof of Theorem 2.1 by verifying the three statements corresponding to the conditions ( 12), (13), and ( 14) in [2], which here read respectively, where (2.12) and (2.13) should hold for all c > 0 and where ∂ 2 θ ℓ n (θ 1 , . . ., θ p ), θ k ∈ Θ, denotes the p × p Hessian matrix of ℓ n (θ), whose (k, l)th element is given by ∂ θ k ∂ θ l ℓ n (θ k ), where θ =: (θ l ) p l=1 .Having obtained (2.11), (2.12) and (2.13), [20, Theorem 1 and 2] immediately concludes Theorem 2.1.We can verify (2.12) exactly as in [2], so will look at (2.11) and (2.13).
Proof of (2.11).Recall the expression (2.4).To look at the entries of ∂ 2 θ ℓ n (θ), we introduce several shorthands for notational convenience; they may look somewhat daring, but would not bring confusion.Let us omit the subscript β and the argument ǫ j of the aforementioned notation, such as φ := φ β (ǫ j ), g := g β (ǫ j ) and so on.For brevity, we also write Further, the partial differentiation with respect to a variable will be denoted by the braced subscript such as ǫ (a) := ∂ a ǫ j (θ) and ǫ (a,b) := ∂ a ∂ b ǫ j (θ).Then, direct computations give the first-order partial derivatives: followed by the second-order ones: It is straightforward to see which term is the leading one in each expression above.We do not list all the details here, but for later reference mention a few of the points: ), and so forth; 3) and because of the consequence (2.10) of the localization, concerning the partial derivatives of ǫ j (θ) we obtain the asymptotic representations: ), and so on; the terms "o u (1)" therein are all valid uniformly in j ≤ n.

Now we write
p − → u I β,σ (θ) in exactly the same way as in the proof of Eq.( 12) in [2].Below, we will show I 11,n (θ) The Burkholder inequality ensures that for any continuous π(x, y; θ) and for any U (ǫ j (θ)) such that E θ [U (ǫ j (θ))] = 0 (θ ∈ Θ) and that the left-hand side of (2.14) is continuous over θ ∈ Θ.Also, note that the right continuity of t → X t implies that (X These basic facts will be repeatedly used below without mentioning them. For convenience, we will write (2.15) and denote by 1 u,p any random array ξ nj (θ) such that max j≤n |ξ nj (θ) − 1| p − → u 0. Direct computations give the following expressions for the components of , , We can deduce that I 11,n (θ) p − → u I λ,µ (θ) as follows.
• First, noting that ǫ j = ǫ j (θ) ) in the summands in rightmost sides of the last three displays and then pick up the leading part involving E θ [g 2 ]; the other one becomes negligible by the Burkholder inequality.
• Then, the a.s.Riemann integrability of t → (X t (ω), Y t (ω)) allows us to conclude that, for k, l ∈ {0, 1, 2} and under P θ for each θ, where the order symbols in the estimates are valid uniformly in j ≤ n.By (2.2), under the localization we have max for any M > 0, from which it follows that D n (k, l) = o u,p (1).
Next we turn to looking at I 12,n (θ) = {I kl 12,n (θ)} k,l : We can deduce that I 12,n (θ) p − → u 0 just by inspecting the four components separately in a similar way that we managed I 11,n (θ).Let us only mention the lower-left q × 1 component: recalling the properties (2.1) and |ϕ 22,n | u l ′ , we see that Thus, the claim (2.11) follows.

Asymptotically efficient estimator
From now on, we fix a true value θ 0 ∈ Θ, and the stochastic symbols and convergences will be taken under P := P θ0 ; accordingly, we write E := E θ0 .Having Theorem 2.1 in hand, we can proceed with the construction of an asymptotically efficient estimator.It is known that any asymptotically centering estimator θ * n : (3.1) ) are regular; by Theorem 2.1, the right-hand side converges in distribution to M N p,θ0 0, I(θ 0 ) −1 .This together with the convolution theorem in turn gives the asymptotic minimax theorem: for any measurable (loss) function L : R p → R + such that L(u) = τ (|u|) for some non-decreasing τ : R + → R + with τ (0) = 0, we have Recalling that L (∆ n (θ), , where Z ∼ N p (0, I) (Theorem 2.1) and in view of the lower bound in (3.2), we may call that any estimator θ * n satisfying (3.1) asymptotically efficient.Again by Theorem 2.1, the good local maximum point θn of ℓ n (θ) is asymptotically efficient.We refer to [12, Theorems 2 and 3, and Proposition 2] and also [13,Theorem 8] for more information and details of the above arguments.
Theorem 2.1 is based on the classical Cramér-type argument.The well-known shortcoming is its local character: the result just tells us the existence of an asymptotically nicely behaving root of the likelihood equation, but does not give information about which local maxima is the one when there are multiple local maxima, equivalently multiple roots for the likelihood equations [15,Section 7.3].Indeed, the log-likelihood function ℓ n of (2.4) is highly nonlinear and non-concave.In this section, we try to get rid of the locality by a Newton-Raphson-type improvement, which in our case will not only remedy the aforementioned inconvenience of the multiple-root problem but also enable us to bypass the numerical optimization involving the stable density φ β .In [2, Section 3], for the β-stable Lévy process (the special case of (1.1) with λ = 0 and X ≡ 1), we provided an initial estimator based on the sample median and the method of moments associated with logarithm and/or lower-order fractional moments.However, it was essential in [2] that the model is a Lévy process for which we could apply the median-adjusted central limit theorem for an i.i.d.sequence of random variables.In the present case, we need a different sort of argument.
In Theorem 2.1, the process X = (X t ) t∈[0,T ] was assumed to be observed continuously in [0, T ].In this section, we will instead deal with a discrete-time sample (X tj ) n j=0 under the additional condition: We will explicitly construct an estimator θ * n which is asymptotically equivalent to the MLE θn , by verifying the asymptotically centering property (3.1); for this much-thinned sample, we may and do keep calling such a θ * n asymptotically efficient.
3.1.Newton-Raphson procedure.To proceed with a discrete-time sample {(X tj , Y tj )} n j=0 , we introduce the approximate-likelihood function H n (θ) by replacing ζ j (λ) by X tj−1 in the definition (2.4) of the genuine log-likelihood function ℓ n (θ) (recall the notation l ′ := log(1/h)): Of course, this approximation is not for free: to manage the resulting discretization error specified later on, we additionally impose that (3.6) Then we have at least β 0 > 2/3, so that small-values of β 0 are excluded; this is the price we have to pay for dealing with a discrete-time sample from X in an efficient way.Accordingly, in the sequel, we will reset the parameter space of β to be a domain Θ β such that Θ β ⊂ (2/3, 2).Toward construction of an asymptotically efficient estimator θ * n satisfying (3.1), we will prove a basic result about a Newton-Raphson type procedure.As in (2.15), we write r n = r n (β 0 ) = √ nh 1−1/β0 .Write n −1/2 φn (θ) for the lower-right 2 × 2-part of ϕ n (θ), so that the definition (2.5) with θ = θ 0 becomes ϕ n (θ 0 ) = diag(r −1 n I q+1 , n −1/2 φn (θ 0 )).We then introduce the diagonal matrix The difference between ϕ n and ϕ 0,n is only in the lower-right component for (β, σ), and note that the matrix ϕ −1 n ϕ 0,n may diverge in norm.Then, suppose that we are given an initial estimator θ0,n = ( λ0,n , μ0,n , β0,n , σ0,n ) such that ϕ −1 0,n ( θ0,n Let us write a = (λ, µ) and b = (β, σ).Based on the approximate-likelihood function (3.4) and θ0,n , we recursively define the k-step estimator θk,n (k ≥ 1) by and assign an arbitrary value to θk,n on the complement set F c k−1,n ; below, it will be seen (as in the proof of Theorem 2.1) that P [F k−1,n ] → 1, hence the arbitrary property does not matter asymptotically and we may and do suppose that P [F k−1,n ] = 1 for k ≥ 1.In our subsequent arguments, the inverse-matrix part in (3.9) must be block-diagonal: see Remark 3.3 below.
Step 2. Next, we show that For the first term, we have → 0 for every ǫ > 0 and s n ↑ ∞.To manage the second term, we need to estimate the gap between H n (θ) and ℓ n (θ) by taking the different convergence rates of their components into account.By the definitions (2.4) and (3.4), From the expressions (2.3) and (3.5) and since κ ≤ 1, a series of straightforward computations show that the partial derivatives of and |∂ σ d ǫ,j (θ)| h 1+κ−1/β .Obviously, h 1/ βn−1/β0 = 1 + o p (1) for any βn such that n v ( βn − β 0 ) = O p (1) for some v > 0; below, we will repeatedly make use of this fact without mention.Further, under (3.6), it holds that (3.16) By piecing together these observations, the basic property (2.1), and the expression (3.15), under (3.3) we can obtain This concludes that R ′ 0,n = o p (1). Step The goal of this step is to show R ′′ 0,a,n = o p (1) and R ′′ 0,b,n = O p (n 1/2−r (l ′ ) C ); at this stage, the latter component may not be stochastically bounded if r ≤ 1/2 (recall (3.8)).We have R ′′ 0,n = A 0,n H 0,n , where Under the assumption ϕ −1 0,n ( θ0,n − θ 0 ) = O p (1), recalling the block-diagonal forms (2.5) and (3.7), we see that where the components O p (1) ∈ R q+1 and O p (n (1−r)/2 l ′ ) ∈ R 2 ; here and in what follows, we use the stochastic-order symbols for random variables of different dimensions, which will not cause any confusion.
Turning to A 0,n , we will show that all the components of A 0,n are at most O p n −r/2 (l ′ ) C : For the diagonal parts of A 0,n , from the same arguments as in proving (3.13) and (3.14) with the assumption Write θ = (θ l ) p l=1 and so on, and also let for the size of the matrix.Then, for the non-diagonal part of A 0,n , we expand it as follows: As in the previous diagonal case, the second term on the right-hand side equals O p (n −r/2 (l ′ ) C ).As for the first term, we write We have seen the explicit expressions of the components of ∂ 2 θ ℓ n (θ) in Section 2.2.Based on them, it can be seen that all the components of r −1 for some F tj−1 -measurable random variable π j−1 (θ 0 ) such that |π j−1 (θ 0 )| (1+|Y tj−1 |)(l ′ ) C and for some odd function ψ (hence E[ψ(ǫ j (θ 0 ))] = 0); the last term "O(h 2 )" only appears in ∂ λ ∂ β ℓ n (θ).Burkholder's inequality for the martingale difference arrays gives ) for the expression (3.15).The following estimates hold: Since r ≤ 1, we have concluded (3.18).The desired stochastic orders follows from (3.17) and (3.18): Step 4. We are now able to derive the convergence rate of θ1,n − θn .Recall the definition (3.10) of K ∈ N and the initial rate of convergence (3.7).
• Turning to r ∈ (0, 1/2], we pick a constant ǫ ′ ∈ (0, r/2) (hence r − ǫ ′ > r/2), which is to be taken sufficiently small later.Define Again by Steps 1 to 3 and (3.12), It follows that the rate of convergence for estimating (β, σ) gets improved from diag(n r/2 , n r/2 /l ′ ) of θ0,n to diag(n r−ǫ ′ , n r−ǫ ′ /l ′ ) of θ1,n ; this can be seen as a matrix-norming counterpart of the (near-)doubling phenomenon in the one-step estimation; see for example [22,Section 5.5].To improve the rate further, we apply (3.9) to obtain θ2,n from θ1,n , so that the rate of convergence for estimating (β, σ) gets improved from diag(n r−ǫ ′ , n r−ǫ ′ /l ′ ) to diag(n 2r−3ǫ ′ , n 2r−3ǫ ′ /l ′ ); here again, we can control the constant ǫ ′ > 0 to be sufficiently small.This procedure is iterated Then, the last (Kth-step) application of (3.9) is the same as in the case of r > 1/2 mentioned above.
These observations conclude (3.11).Thus, we have arrived at the following claim.
The property (3.21) entails which can be used for constructing an approximate confidence ellipsoid and for goodness-of-fit testing, in particular variable selection among the components of X.
Remark 3.2.From the proof of Theorem 3.1, we see that it is possible to weaken (3.6) as β 0 > 2/3 if the integrated-process sequence ( j X s ds) n j=1 is observable.Moreover, It is possible to remove (3.6) if the model is the Markovian Y t = Y 0 + t 0 (µ − λY s )ds + σJ t with constant µ ∈ R with modifying the definition (3.5) as in the estimating function of [5].However, we worked under (3.3) and (3.5) to deal with a possibly time-varying X.
Remark 3.3.The standard form of the one-step estimator is not (3.9),but θk By inspecting the proof of Theorem 3.1, we found that the off-block-diagonal part −∂ a ∂ b H n ( θk−1,n ) made the claim therein invalid.This has happened since the rate of convergence for estimating the component b = (β, σ) could be too slow.Still, because of the block-diagonality of the original form (2.7), it seems to be a natural and reasonable strategy to use the block-diagonal form from the beginning of defining (3.9).
Remark 3.4.The necessity of more than one iteration (K ≥ 2) would be a technical one.If we could verify the tail-probability estimate sup n P [|r n ( λ0,n − λ 0 , μ0,n − µ 0 )| ≥ s] s −M for a sufficiently large M > 0, then it is possible to deduce the optimality of the one-step Newton-Raphson procedure even when a strategy of construction ( β0,n , σ0,n ) is not smooth in ( λ0,n , μ0,n ) as in the function Mn (a ′ ) in Section 3.2.2.However, the model under consideration is heavy-tailed and it seems impossible to deduce such a bound since we cannot make use of the localization for that purpose.
(1) First, we will estimate the trend parameter (λ, µ) by the least absolute deviation (LAD) estimator, which will turn out to be rate-optimal, and asymptotically mixed-normally distributed; although the identification of the asymptotic distribution is not necessary here, it would be of independent interest (see Section 3.2.3).
(2) Next, by plugging in the LAD estimator we construct a sequence of residuals for the noise term, based on which we will consider the lower-order fractional moment matching.Recall that we are working under the localization (2.10) by removing large jumps of J.

3.2.2.
Rates of convergence at the moment matching for (β, σ).The remaining task is to construct a specific estimator ( β0,n , σ0,n ) such that (3.31) n r/2 ( β0,n − β 0 ), n r/2 l ′ (σ 0,n − σ 0 ) = O p (1).This can be achieved simply by fitting some appropriate moments; for this purpose, the localization does not make sense, since precise expressions of truly existing moments without the localization come into play.Here we consider, as in [2], the pair of the absolute moments of order r and 2r.There exists a bijection f r such that f r (m(r; β) 2 /m(2r; β)) = β; see [2, Section 3.2] and the references therein for the related details.Therefore, taking β0,n := f r ( Mn (r) 2 / Mn (2r)) results in n r/2 ( β0,n − β 0 ) = O p (1), as was to be shown.The bisection method is sufficient for finding β0,n numerically.Turning to σ0,n , we note that u ∈ R q , since for any constant real ξ we have Y = (u • X)ξ a.s. as functions on [0, T ].Apply the identity det A B ⊤ B C = det(A) det(C − BA −1 B ⊤ ) to conclude the P θ -a.s.positive definiteness of I(θ).

Remark 2 . 4 .
It is formally straightforward to extend the model (1.1) to the following form: